diff --git a/.github/CITATION.md b/.github/CITATION.md
new file mode 100644
index 0000000000..4e78352060
--- /dev/null
+++ b/.github/CITATION.md
@@ -0,0 +1,24 @@
+If you redistribute ArrayFire, please follow the terms established in
+[the license](../LICENSE). If you wish to cite ArrayFire in an academic
+publication, please use the following reference:
+
+Formatted:
+```
+Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P.,
+Kloppenborg, B., Malcolm, J. and Melonakos, J. (2015).
+ArrayFire - A high performance software library for parallel computing with an
+easy-to-use API. Atlanta: AccelerEyes. Retrieved from https://github.com/arrayfire/arrayfire
+```
+
+BibTeX:
+```bibtex
+@misc{Yalamanchili2015,
+abstract = {ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array based function set makes parallel programming simple. ArrayFire's multiple backends (CUDA, OpenCL and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving you valuable time and lowering development costs.},
+address = {Atlanta},
+author = {Yalamanchili, Pavan and Arshad, Umar and Mohammed, Zakiuddin and Garigipati, Pradeep and Entschev, Peter and Kloppenborg, Brian and Malcolm, James and Melonakos, John},
+publisher = {AccelerEyes},
+title = {{ArrayFire - A high performance software library for parallel computing with an easy-to-use API}},
+url = {https://github.com/arrayfire/arrayfire},
+year = {2015}
+}
+```
diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md
new file mode 100644
index 0000000000..668986c904
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/bug_report.md
@@ -0,0 +1,76 @@
+---
+name: Bug Report
+about: Create a bug report to help us improve ArrayFire
+title: "[BUG]"
+labels: 'bug'
+assignees: ''
+---
+
+<!-- One to two sentences discription of the bug -->
+
+Description
+===========
+<!--
+* Additional details regarding the bug
+* Did you build ArrayFire yourself or did you use the official installers
+* Which backend is experiencing this issue? (CPU, CUDA, OpenCL)
+* Do you have a workaround?
+* Can the bug be reproduced reliably on your system?
+* A clear and concise description of what you expected to happen.
+* Run your executable with AF_TRACE=all and AF_PRINT_ERRORS=1 environment
+  variables set.
+* Screenshot or terminal output of the results
+-->
+
+Reproducible Code and/or Steps
+------------------------------
+<!--
+* Steps or code snippet that can reproduce the bug
+* A full example will allow us to debug and fix the bug faster
+-->
+
+System Information
+------------------
+<!--
+Please provide the following information:
+1. ArrayFire version
+2. Devices installed on the system
+3. (optional) Output from the af::info() function if applicable.
+4. Output from the following scripts:
+
+Run one of the following commands based on your OS
+
+Linux:
+```sh
+lsb_release -a
+if command -v nvidia-smi >/dev/null; then
+  nvidia-smi --query-gpu="name,memory.total,driver_version" --format=csv -i 0
+else
+  echo "nvidia-smi not found"
+fi
+if command -v /opt/rocm/bin/rocm-smi >/dev/null; then
+  /opt/rocm/bin/rocm-smi --showproductname
+else
+  echo "rocm-smi not found."
+fi
+if command -v clinfo > /dev/null; then
+  clinfo
+else
+  echo "clinfo not found."
+fi
+```
+
+Windows:
+Download clinfo from https://github.com/Oblomov/clinfo
+
+If you have NVIDIA GPUs. Run nvidia-smi usually located in
+C:\Program Files\NVIDIA Corporation\NVSMI
+
+Provide driver version for your GPU. (This is vendor specific)
+-->
+
+Checklist
+---------
+
+- [ ] Using the latest available ArrayFire release
+- [ ] GPU drivers are up to date
diff --git a/.github/ISSUE_TEMPLATE/build_error.md b/.github/ISSUE_TEMPLATE/build_error.md
new file mode 100644
index 0000000000..dc457c668e
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/build_error.md
@@ -0,0 +1,36 @@
+---
+name: Build Error
+about: Create a report for errors during the building process
+title: "[Build]"
+labels: 'build'
+assignees: ''
+---
+
+<!--
+A short one or two line description of the error
+-->
+
+Description
+===========
+
+<!--
+* Additional details about the errors during the build
+* What do you suspect is causing the issue?
+* Which steps in the (wiki)[https://github.com/arrayfire/arrayfire/wiki] failed
+* What operating system and/or distro are you using?
+* Versions of the packages related to this bug
+-->
+
+Error Log
+---------
+<!-- Output of the error log. -->
+```
+
+```
+
+Build Environment
+-----------------
+Compiler version: <!-- MSVC v140 or gcc 9.3.2 -->
+Operating system: <!-- Windows 10; Ubuntu 18.04 -->
+Build environment: <!-- Environment variables; Installed software -->
+CMake variables: <!-- Output of  `cmake -L` -->
diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md
new file mode 100644
index 0000000000..662f8e722d
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/feature_request.md
@@ -0,0 +1,20 @@
+---
+name: Feature Request
+about: Suggest a new idea for ArrayFire
+title: ''
+labels: 'feature'
+assignees: ''
+
+---
+
+<!-- One or two sentences describing the feature. -->
+
+Description
+===========
+<!--
+* Additional information about the feature you would like to add
+* What problem are you trying to solve?
+* (Optional) API of new function
+* (Optional) Algorithms that could be used to implement this feature
+* (Optional)Are there other libraries that implement this feature?
+-->
diff --git a/.github/ISSUE_TEMPLATE/performance_issue.md b/.github/ISSUE_TEMPLATE/performance_issue.md
new file mode 100644
index 0000000000..c563aedee5
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/performance_issue.md
@@ -0,0 +1,40 @@
+---
+name: Performance Issue
+about: For Issues related to lackluster performance
+title: "[Perf]"
+labels: 'perf'
+assignees: ''
+
+---
+
+<!-- One or two line description of the performance issue -->
+
+
+Description
+===========
+
+<!--
+* Additional information about the performance issues
+* Did you build ArrayFire yourself or did you use the official installers
+* Which backend is experiencing this issue? (CPU, CUDA, OpenCL)
+* Do you have a workaround?
+* Can the bug be reproduced reliably on your system?
+-->
+
+Reproducible Code
+-----------------
+<!--
+* Provide a small example that could reproduce the performance issue
+* A full example will allow us to debug and fix this issue faster
+-->
+
+System Information
+------------------
+ArrayFire Version:
+Device:
+Operating System:
+Driver version:
+
+Checklist
+---------
+- [ ] I have read [timing ArrayFire](http://arrayfire.org/docs/timing.htm)
diff --git a/.github/ISSUE_TEMPLATE/question.md b/.github/ISSUE_TEMPLATE/question.md
new file mode 100644
index 0000000000..a37af18d75
--- /dev/null
+++ b/.github/ISSUE_TEMPLATE/question.md
@@ -0,0 +1,14 @@
+---
+name: Question
+about: General questions and potential issues
+title: "[Question]"
+labels: ''
+assignees: ''
+
+---
+
+Before asking a question on github, please consider if it is more appropriate for these other platforms:
+
+* [Slack Chat](https://join.slack.com/t/arrayfire-org/shared_invite/MjI4MjIzMDMzMTczLTE1MDI5ODg4NzYtN2QwNGE3ODA5OQ)
+* [Google Groups](https://groups.google.com/forum/#!forum/arrayfire-users)
+* ArrayFire Services:  [Consulting](http://arrayfire.com/consulting/)  |  [Support](http://arrayfire.com/support/)   |  [Training](http://arrayfire.com/training/)
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
new file mode 100644
index 0000000000..5669dd9e7f
--- /dev/null
+++ b/.github/pull_request_template.md
@@ -0,0 +1,40 @@
+
+<!--
+Short description of change
+
+This should be one or two sentences that describe the overall
+motivation of the PR. Description of what was done for an unfamiliar
+reviewer.
+-->
+
+Description
+-----------
+<!--
+Additional information about the PR answering following questions:
+
+* Is this a new feature or a bug fix?
+* More detail if necessary to describe all commits in pull request.
+* Why these changes are necessary.
+* Potential impact on specific hardware, software or backends.
+* New functions and their functionality.
+* Can this PR be backported to older versions?
+* Future changes not implemented in this PR.
+-->
+Fixes: #<issue number> ...
+
+Changes to Users
+----------------
+<!--
+* Additional options added to the build.
+* What changes will existing users have to make to their code or build steps?
+Refer to [wiki](https://github.com/arrayfire/arrayfire/wiki) for development guidelines
+-->
+
+Checklist
+---------
+<!-- Check if done or not applicable -->
+- [ ] Rebased on latest master
+- [ ] Code compiles
+- [ ] Tests pass
+- [ ] Functions added to unified API
+- [ ] Functions documented
diff --git a/.github/workflows/release_src_artifact.yml b/.github/workflows/release_src_artifact.yml
new file mode 100644
index 0000000000..41b01d4f72
--- /dev/null
+++ b/.github/workflows/release_src_artifact.yml
@@ -0,0 +1,92 @@
+on:
+  push:
+    # Sequence of patterns matched against refs/tags
+    tags:
+    - 'v*' # Push events to tag names starting with v
+
+name: ci
+
+jobs:
+    upload_src_tarball:
+        name: Upload release source tarball
+        runs-on: ubuntu-latest
+        steps:
+            - name: Fetch Repo Info
+              run: |
+                  tag=$(echo ${GITHUB_REF} | awk '{split($0, a, "/"); print a[3]}')
+                  ver=${tag:1}
+                  response=$(curl https://api.github.com/repos/${GITHUB_REPOSITORY}/releases/tags/${tag})
+                  id_line=$(echo "${response}" | grep -m 1 "id.:")
+                  rel_id=$(echo "${id_line}" | awk '{split($0, a, ":"); split(a[2], b, ","); print b[1]}')
+                  trimmed_rel_id=$(echo "${rel_id}" | awk '{gsub(/^[ \t]+/,""); print $0 }')
+                  echo "RELEASE_ID=${trimmed_rel_id}" >> $GITHUB_ENV
+                  echo "AF_TAG=${tag}" >> $GITHUB_ENV
+                  echo "AF_VER=${ver}" >> $GITHUB_ENV
+
+            - name: Checkout Repo
+              run: |
+                  cd ${GITHUB_WORKSPACE}
+                  clone_url="https://github.com/${GITHUB_REPOSITORY}"
+                  git clone --depth 1 -b ${AF_TAG} ${clone_url} arrayfire-full-${AF_VER}
+
+            - name: Install Dependencies
+              run: |
+                  sudo add-apt-repository ppa:mhier/libboost-latest
+                  sudo apt-get -qq update
+                  sudo apt-get install -y libfontconfig1-dev \
+                                          libglfw3-dev \
+                                          libfftw3-dev \
+                                          liblapacke-dev \
+                                          libopenblas-dev \
+                                          ocl-icd-opencl-dev \
+                                          nvidia-cuda-toolkit \
+                                          libboost-dev
+
+            - name: CMake Configure
+              run: |
+                  cd ${GITHUB_WORKSPACE}/arrayfire-full-${AF_VER}
+                  mkdir build && cd build
+                  cmake .. -DAF_BUILD_FORGE:BOOL=ON -DAF_COMPUTE_LIBRARY="FFTW/LAPACK/BLAS"
+
+            - name: Create source tarball
+              id: create-src-tarball
+              run: |
+                  cd $GITHUB_WORKSPACE
+                  rm -rf arrayfire-full-${AF_VER}/.git
+                  rm -rf arrayfire-full-${AF_VER}/.github
+                  rm arrayfire-full-${AF_VER}/.gitmodules
+                  cd arrayfire-full-${AF_VER}/build/
+                  shopt -s extglob
+                  rm -r !(extern)
+                  cd ./extern
+                  rm -rf ./*-build
+                  rm -rf ./*-subbuild
+                  declare -a deps
+                  deps=($(ls))
+                  for dep in ${deps[@]}; do
+                    rm -rf ./${dep}/.git
+                    rm -rf ./${dep}/.gitattributes
+                    rm -rf ./${dep}/.gitmodules
+                  done
+                  shopt -u extglob
+                  rm -rf matrixmarket
+                  cp -r ./* ../../extern/
+                  cd ..
+                  wget https://github.com/arrayfire/forge/releases/download/v1.0.8/forge-full-1.0.8.tar.bz2
+                  tar -xf forge-full-1.0.8.tar.bz2
+                  mv forge-full-1.0.8 ../extern/af_forge-src
+                  cd ..
+                  rm -rf build
+                  cd ..
+                  tar -cjf arrayfire-full-${AF_VER}.tar.bz2 arrayfire-full-${AF_VER}/
+                  echo "UPLOAD_FILE=arrayfire-full-${AF_VER}.tar.bz2" >> $GITHUB_ENV
+
+            - name: Upload source tarball
+              uses: actions/upload-release-asset@v1
+              env:
+                  GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+              with:
+                  upload_url: https://uploads.github.com/repos/${{ github.repository }}/releases/${{ env.RELEASE_ID }}/assets{?name,label}
+                  asset_path: ${{ env.UPLOAD_FILE }}
+                  asset_name: ${{ env.UPLOAD_FILE }}
+                  asset_content_type: application/x-bzip2
diff --git a/.github/workflows/unix_cpu_build.yml b/.github/workflows/unix_cpu_build.yml
new file mode 100644
index 0000000000..07ffba36f7
--- /dev/null
+++ b/.github/workflows/unix_cpu_build.yml
@@ -0,0 +1,196 @@
+on:
+  push:
+    branches:
+    - master
+  pull_request:
+    branches:
+    - master
+
+name: ci
+
+jobs:
+    clang-format:
+        name: Clang Format Lint
+        runs-on: ubuntu-latest
+        steps:
+            - name: Checkout Respository
+              uses: actions/checkout@master
+
+            - name: Check Sources
+              uses: DoozyX/clang-format-lint-action@v0.15
+              with:
+                source: './src ./test ./examples'
+                extensions: 'h,cpp,hpp'
+                clangFormatVersion: 15
+
+    documentation:
+        name: Documentation
+        runs-on: ubuntu-20.04
+        env:
+          DOXYGEN_VER: 1.8.18
+        steps:
+            - name: Checkout Repository
+              uses: actions/checkout@master
+
+            - name: Install Doxygen
+              run: |
+                  wget --quiet https://sourceforge.net/projects/doxygen/files/rel-${DOXYGEN_VER}/doxygen-${DOXYGEN_VER}.linux.bin.tar.gz
+                  mkdir doxygen
+                  tar -xf doxygen-${DOXYGEN_VER}.linux.bin.tar.gz -C doxygen --strip 1
+
+            - name: Install Boost
+              run: |
+                  sudo add-apt-repository ppa:mhier/libboost-latest
+                  sudo apt-get -qq update
+                  sudo apt-get install -y libboost1.74-dev
+
+            - name: Configure
+              run: |
+                  mkdir build && cd build && unset VCPKG_ROOT
+                  cmake -DAF_BUILD_CPU:BOOL=OFF -DAF_BUILD_CUDA:BOOL=OFF \
+                        -DAF_BUILD_OPENCL:BOOL=OFF -DAF_BUILD_UNIFIED:BOOL=OFF \
+                        -DAF_BUILD_EXAMPLES:BOOL=OFF -DBUILD_TESTING:BOOL=OFF \
+                        -DDOXYGEN_EXECUTABLE:FILEPATH=${GITHUB_WORKSPACE}/doxygen/bin/doxygen ..
+
+            - name: Build
+              run: |
+                  cd ${GITHUB_WORKSPACE}/build
+                  cmake --build . --target docs
+
+    build_cpu:
+        name: CPU
+        runs-on: ${{ matrix.os }}
+        needs: [clang-format, documentation]
+        env:
+          NINJA_VER: 1.10.2
+          CMAKE_VER: 3.16.3
+        strategy:
+            fail-fast: false
+            matrix:
+                blas_backend: [Atlas, MKL, OpenBLAS]
+                os: [ubuntu-20.04, macos-latest]
+                compiler: [gcc, clang, icx]
+                exclude:
+                    - os: macos-latest
+                      blas_backend: Atlas
+                    - os: macos-latest
+                      blas_backend: MKL
+                    - blas_backend: Atlas
+                      compiler: icx
+                    - blas_backend: OpenBLAS
+                      compiler: icx
+        steps:
+            - name: Checkout Repository
+              uses: actions/checkout@master
+
+            - name: Download Ninja
+              env:
+                  OS_NAME: ${{ matrix.os }}
+              run: |
+                  os_suffix=$(if [ $OS_NAME == 'macos-latest' ]; then echo "mac"; else echo "linux"; fi)
+                  wget --quiet "https://github.com/ninja-build/ninja/releases/download/v${NINJA_VER}/ninja-${os_suffix}.zip"
+                  unzip ./ninja-${os_suffix}.zip
+                  chmod +x ninja
+                  ${GITHUB_WORKSPACE}/ninja --version
+
+            - name: Download CMake 3.16.3 for Linux
+              if: matrix.os != 'macos-latest'
+              env:
+                  OS_NAME: ${{ matrix.os }}
+                  CC: ${{ matrix.compiler }}
+              run: |
+                  cmake_suffix=$(if [ $OS_NAME == 'macos-latest' ]; then echo "Darwin-x86_64"; else echo "Linux-x86_64"; fi)
+                  cmake_url=$(echo "https://github.com/Kitware/CMake/releases/download/v${CMAKE_VER}/cmake-${CMAKE_VER}-${cmake_suffix}.tar.gz")
+                  wget --quiet "${cmake_url}"
+                  tar -xf ./cmake-${CMAKE_VER}-${cmake_suffix}.tar.gz
+                  cmake_install_dir=$(echo "cmake-${CMAKE_VER}-x86_64")
+                  mv cmake-${CMAKE_VER}-${cmake_suffix} ${cmake_install_dir}
+                  cmake_lnx_dir=$(echo "${cmake_install_dir}/bin")
+                  cmake_osx_dir=$(echo "${cmake_install_dir}/CMake.app/Contents/bin")
+                  cmake_dir=$(if [ $OS_NAME == 'macos-latest' ]; then echo "${cmake_osx_dir}"; else echo "${cmake_lnx_dir}"; fi)
+                  echo "CMAKE_PROGRAM=$(pwd)/${cmake_dir}/cmake" >> $GITHUB_ENV
+                  case "$CC" in
+                    'gcc')
+                        echo "CXX=g++" >> $GITHUB_ENV
+                        ;;
+                    'clang')
+                        echo "CXX=clang++" >> $GITHUB_ENV
+                        ;;
+                    'icx')
+                        echo "CXX=icpx" >> $GITHUB_ENV
+                        ;;
+                  esac
+
+            - name: Install Dependencies for Macos
+              if: matrix.os == 'macos-latest'
+              run: |
+                  brew install boost fontconfig glfw freeimage fftw lapack openblas expat
+                  echo "CMAKE_PROGRAM=cmake" >> $GITHUB_ENV
+
+            - name: Install Common Dependencies for Ubuntu
+              if: matrix.os == 'ubuntu-20.04' || matrix.os == 'ubuntu-22.04'
+              run: |
+                  sudo add-apt-repository ppa:mhier/libboost-latest
+                  sudo apt-get -qq update
+                  sudo apt-get install -y libboost1.74-dev \
+                                          libfreeimage-dev \
+                                          libglfw3-dev \
+                                          libfftw3-dev \
+                                          liblapacke-dev
+
+            - name: Install Atlas for Ubuntu
+              if: matrix.os != 'macos-latest' && matrix.blas_backend == 'Atlas'
+              run: sudo apt-get install -y libatlas-base-dev
+
+            - name: Install MKL for Ubuntu
+              if: matrix.os != 'macos-latest' && matrix.blas_backend == 'MKL'
+              env:
+                  CC: ${{ matrix.compiler }}
+              run: |
+                  wget https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
+                  sudo apt-key add GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
+                  sudo sh -c 'echo deb https://apt.repos.intel.com/oneapi all main > /etc/apt/sources.list.d/oneAPI.list'
+                  sudo apt-get -qq update
+                  sudo apt-get install -y intel-oneapi-mkl-devel intel-oneapi-tbb-devel
+                  if [ "$CC" == 'icx' ]; then sudo apt-get install -y intel-oneapi-compiler-dpcpp-cpp; fi
+                  echo "MKLROOT=/opt/intel/oneapi/mkl/latest" >> ${GITHUB_ENV}
+
+            - name: Install OpenBLAS for Ubuntu
+              if: matrix.os != 'macos-latest' && matrix.blas_backend == 'OpenBLAS'
+              run: sudo apt-get install -y libopenblas-dev
+
+            - name: CMake Configure
+              env:
+                  USE_MKL: ${{ matrix.blas_backend == 'MKL' }}
+                  BLAS_BACKEND: ${{ matrix.blas_backend }}
+                  CC: ${{ matrix.compiler }}
+                  OS_NAME: ${{ matrix.os }}
+              run: |
+                  ref=$(echo ${GITHUB_REF} | awk '/refs\/pull\/[0-9]+\/merge/{print $0}')
+                  prnum=$(echo $ref | awk '{split($0, a, "/"); print a[3]}')
+                  branch=$(git rev-parse --abbrev-ref HEAD)
+                  buildname=$(if [ -z "$prnum" ]; then echo "$branch"; else echo "PR-$prnum"; fi)
+                  dashboard=$(if [ -z "$prnum" ]; then echo "Continuous"; else echo "Experimental"; fi)
+                  backend=$(if [ "$USE_MKL" == true ]; then echo "Intel-MKL"; else echo "FFTW/LAPACK/BLAS"; fi)
+                  buildname="$buildname-cpu-$BLAS_BACKEND"
+                  cmake_rpath=$(if [ $OS_NAME == 'macos-latest' ]; then echo "-DCMAKE_INSTALL_RPATH=/opt/arrayfire/lib"; fi)
+                  if [ "$CC" == 'icx' ] || [ "$USE_MKL" == true ]; then source /opt/intel/oneapi/setvars.sh; fi
+                  mkdir build && cd build && unset VCPKG_ROOT
+                  ${CMAKE_PROGRAM} -G Ninja \
+                      -DCMAKE_MAKE_PROGRAM:FILEPATH=${GITHUB_WORKSPACE}/ninja \
+                      -DAF_BUILD_CUDA:BOOL=OFF -DAF_BUILD_OPENCL:BOOL=OFF \
+                      -DAF_BUILD_UNIFIED:BOOL=OFF -DAF_BUILD_EXAMPLES:BOOL=ON \
+                      -DAF_BUILD_FORGE:BOOL=ON \
+                      -DAF_COMPUTE_LIBRARY:STRING=${backend} \
+                      "$cmake_rpath" \
+                      -DBUILDNAME:STRING=${buildname} ..
+                  echo "CTEST_DASHBOARD=${dashboard}" >> $GITHUB_ENV
+
+            - name: Build and Test
+              env:
+                  CC: ${{ matrix.compiler }}
+                  USE_MKL: ${{ matrix.blas_backend == 'MKL' }}
+              run: |
+                  cd ${GITHUB_WORKSPACE}/build
+                  if [ "$CC" == 'icx' ] || [ "$USE_MKL" == true ]; then source /opt/intel/oneapi/setvars.sh; fi
+                  ctest -D Experimental --track ${CTEST_DASHBOARD} -T Test -T Submit -R cpu -j2
diff --git a/.github/workflows/win_cpu_build.yml b/.github/workflows/win_cpu_build.yml
new file mode 100644
index 0000000000..d42450f103
--- /dev/null
+++ b/.github/workflows/win_cpu_build.yml
@@ -0,0 +1,72 @@
+on:
+  push:
+    branches:
+    - master
+  pull_request:
+    branches:
+    - master
+
+name: ci
+
+jobs:
+    window_build_cpu:
+        name: CPU (fftw, OpenBLAS, windows-latest)
+        runs-on: windows-latest
+        env:
+          VCPKG_HASH: 9d47b24eacbd1cd94f139457ef6cd35e5d92cc84
+          VCPKG_DEFAULT_TRIPLET: x64-windows
+        steps:
+            - name: Checkout Repository
+              uses: actions/checkout@master
+
+            - name: VCPKG Cache
+              uses: actions/cache@v3
+              id: vcpkg-cache
+              with:
+                path: ~/vcpkg
+                key: vcpkg-deps-${{ env.VCPKG_HASH }}
+
+            - name: Install VCPKG Dependencies
+              if: steps.vcpkg-cache.outputs.cache-hit != 'true'
+              run: |
+                pushd .
+                cd ~
+                git clone --quiet --recursive https://github.com/microsoft/vcpkg.git
+                cd vcpkg
+                git checkout $env:VCPKG_HASH
+                .\bootstrap-vcpkg.bat
+                popd
+                mkdir build && cd build && set VCPKG_ROOT=
+                cmake .. -G "Visual Studio 17 2022" -A x64 `
+                      -DVCPKG_ROOT:PATH=~/vcpkg `
+                      -DAF_BUILD_CUDA:BOOL=OFF -DAF_BUILD_OPENCL:BOOL=OFF `
+                      -DAF_BUILD_UNIFIED:BOOL=OFF -DAF_BUILD_FORGE:BOOL=ON `
+                      -DBUILDNAME:STRING="$buildname" `
+                      -DAF_COMPUTE_LIBRARY:STRING="FFTW/LAPACK/BLAS"
+
+            - name: CMake Configure
+              run: |
+                  $ref = $env:GITHUB_REF | %{ if ($_ -match "refs/pull/[0-9]+/merge") { $_;} }
+                  $prnum = $ref | %{$_.Split("/")[2]}
+                  $branch = git branch --show-current
+                  $buildname = if($prnum -eq $null) { $branch } else { "PR-$prnum" }
+                  $dashboard = if($prnum -eq $null) { "Continuous" } else { "Experimental" }
+                  $buildname = "$buildname-cpu-openblas"
+                  if((Test-Path build) -eq 0) {
+                      mkdir build
+                  }
+                  cd build && set VCPKG_ROOT=
+                  cmake .. -G "Visual Studio 17 2022" -A x64 `
+                      -DVCPKG_ROOT:PATH=~/vcpkg `
+                      -DAF_BUILD_CUDA:BOOL=OFF -DAF_BUILD_OPENCL:BOOL=OFF `
+                      -DAF_BUILD_UNIFIED:BOOL=OFF -DAF_BUILD_FORGE:BOOL=ON `
+                      -DBUILDNAME:STRING="$buildname" `
+                      -DAF_COMPUTE_LIBRARY:STRING="FFTW/LAPACK/BLAS"
+                  echo "CTEST_DASHBOARD=${dashboard}" >> $env:GITHUB_ENV
+
+            - name: Build and Test
+              run: |
+                  cd build
+                  $build_path = (pwd).Path
+                  $Env:PATH += ";$build_path/vcpkg_installed/x64-windows/bin"
+                  ctest -D Experimental --track ${CTEST_DASHBOARD} -T Test -T Submit -C RelWithDebInfo -R cpu -E pinverse -j2
diff --git a/.gitignore b/.gitignore
index 95a58ae34d..933736dba0 100644
--- a/.gitignore
+++ b/.gitignore
@@ -1,11 +1,25 @@
-CMakeCache.txt
-CMakeFiles/
+#CMakeCache.txt
+#./CMakeFiles/
+CMakeUserPresets.json
 build*/
-Makefile
-cmake_install.cmake
-**~
+Release/
+#Makefile
+#cmake_install.cmake
 GTAGS
 GRTAGS
 GPATH
 .dir-locals.el
-include/af/version.h
+#docs/details/examples.dox
+/TAGS
+external/
+extern/
+compile_commands.json
+venv
+test/gtest
+#src/backend/cuda/cub
+conanbuildinfo*
+conaninfo*
+conan.lock
+graph_info.json
+.ccls-cache
+.projectile
diff --git a/.gitmodules b/.gitmodules
index df157b7e80..e69de29bb2 100644
--- a/.gitmodules
+++ b/.gitmodules
@@ -1,9 +0,0 @@
-[submodule "test/data"]
-	path = test/data
-	url = https://github.com/arrayfire/arrayfire_data
-[submodule "assets"]
-	path = assets
-	url = https://github.com/arrayfire/assets
-[submodule "test/gtest"]
-	path = test/gtest
-	url = http://git.chromium.org/external/googletest.git
diff --git a/ACKNOWLEDGEMENTS.md b/ACKNOWLEDGEMENTS.md
new file mode 100644
index 0000000000..0574ee5fa3
--- /dev/null
+++ b/ACKNOWLEDGEMENTS.md
@@ -0,0 +1,28 @@
+Acknowledgements
+=====
+
+The ArrayFire library is written by developers at [ArrayFire](http://arrayfire.com) LLC
+with [contributions from several individuals](https://github.com/arrayfire/arrayfire/graphs/contributors).
+The developers at ArrayFire LLC have received partial financial support
+from several grants and institutions. Those that wish to receive public
+acknowledgement are listed below:
+
+<!--
+The following section contains acknowledgements for grant funding. In most
+circumstances, the specific phrasing of the text is mandated by the grant
+provider. Thus these acknowledgements must remain intact without modification.
+-->
+
+### Grants
+
+This material is based upon work supported by the DARPA SBIR Program Office
+under Contract Numbers W31P4Q-14-C-0012 and W31P4Q-15-C-0008.
+Any opinions, findings and conclusions or recommendations expressed in this
+material are those of the author(s) and do not necessarily reflect the views of
+the DARPA SBIR Program Office.
+
+Research reported in this publication is supported by the National Library of
+Medicine of the National Institutes of Health under award number
+R43LM012359. The content is solely the responsibility of the author(s) and
+does not necessarily represent the official views of the National Institutes
+of Health.
diff --git a/ArrayFireConfig.cmake.in b/ArrayFireConfig.cmake.in
deleted file mode 100644
index 3ffd8e0c51..0000000000
--- a/ArrayFireConfig.cmake.in
+++ /dev/null
@@ -1,64 +0,0 @@
-# Defines the following variables:
-# ArrayFire_INCLUDE_DIRS    - Location of ArrayFire's include directory.
-# ArrayFire_LIBRARIES       - Location of ArrayFire's libraries. This will default
-#                             to a GPU backend if one is found.
-# ArrayFire_FOUND           - True if ArrayFire has been located
-#
-# You may provide a hint to where ArrayFire's root directory may be located
-# by setting ArrayFire_DIR.
-#
-# ----------------------------------------------------------------------------
-#
-# ArrayFire_CPU_FOUND        - True of the ArrayFire CPU library has been found.
-# ArrayFire_CPU_LIBRARIES    - Location of ArrayFire's CPU library, if found
-# ArrayFire_CUDA_FOUND       - True of the ArrayFire CUDA library has been found.
-# ArrayFire_CUDA_LIBRARIES   - Location of ArrayFire's CUDA library, if found
-# ArrayFire_OpenCL_FOUND     - True of the ArrayFire OpenCL library has been found.
-# ArrayFire_OpenCL_LIBRARIES - Location of ArrayFire's OpenCL library, if found
-#
-#=============================================================================
-# Copyright (c) 2015, ArrayFire
-# All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without modification,
-# are permitted provided that the following conditions are met:
-#
-# * Redistributions of source code must retain the above copyright notice, this
-#   list of conditions and the following disclaimer.
-#
-# * Redistributions in binary form must reproduce the above copyright notice, this
-#   list of conditions and the following disclaimer in the documentation and/or
-#   other materials provided with the distribution.
-#
-# * Neither the name of the ArrayFire nor the names of its
-#   contributors may be used to endorse or promote products derived from
-#   this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
-# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-# ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#=============================================================================
-
-get_filename_component(ArrayFire_INCLUDE_DIRS "@INCLUDE_DIR@" ABSOLUTE)
-
-# keep in the backends in the slowest to fastest order
-foreach(backend CPU OpenCL CUDA)
-  string(TOLOWER "${backend}" lowerbackend)
-  set(targetFile ${CMAKE_CURRENT_LIST_DIR}/@BACKEND_DIR@/ArrayFire${backend}.cmake)
-  if(EXISTS ${targetFile})
-    include(${targetFile})
-    set(ArrayFire_${backend}_FOUND ON)
-    set(ArrayFire_${backend}_LIBRARIES af${lowerbackend})
-    # set the default backend
-    set(ArrayFire_LIBRARIES af${lowerbackend})
-  else()
-    set(ArrayFire_${backend}_FOUND OFF)
-  endif()
-endforeach()
diff --git a/ArrayFireConfigVersion.cmake.in b/ArrayFireConfigVersion.cmake.in
deleted file mode 100644
index 3e461b9e99..0000000000
--- a/ArrayFireConfigVersion.cmake.in
+++ /dev/null
@@ -1,73 +0,0 @@
-#=============================================================================
-# Copyright (c) 2015, ArrayFire
-# All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without modification,
-# are permitted provided that the following conditions are met:
-#
-# * Redistributions of source code must retain the above copyright notice, this
-#   list of conditions and the following disclaimer.
-#
-# * Redistributions in binary form must reproduce the above copyright notice, this
-#   list of conditions and the following disclaimer in the documentation and/or
-#   other materials provided with the distribution.
-#
-# * Neither the name of the ArrayFire nor the names of its
-#   contributors may be used to endorse or promote products derived from
-#   this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
-# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-# ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#=============================================================================
-
-# This is a basic version file for the Config-mode of find_package().
-#
-# The created file sets PACKAGE_VERSION_EXACT if the current version string and
-# the requested version string are exactly the same and it sets
-# PACKAGE_VERSION_COMPATIBLE if the current version is >= requested version,
-# but only if the requested major version is the same as the current one.
-
-
-set(PACKAGE_VERSION "@AF_VERSION@@AF_VERSION_MINOR@")
-
-if("${PACKAGE_VERSION}" VERSION_LESS "${PACKAGE_FIND_VERSION}" )
-  set(PACKAGE_VERSION_COMPATIBLE FALSE)
-else()
-
-  if("@AF_VERSION@@AF_VERSION_MINOR@" MATCHES "^([0-9]+)\\.")
-    set(ArrayFire_VERSION_MAJOR "${CMAKE_MATCH_1}")
-  else()
-    set(ArrayFire_VERSION_MAJOR "@AF_VERSION@@AF_VERSION_MINOR@")
-  endif()
-
-  if("${PACKAGE_FIND_VERSION_MAJOR}" STREQUAL "${ArrayFire_VERSION_MAJOR}")
-    set(PACKAGE_VERSION_COMPATIBLE TRUE)
-  else()
-    set(PACKAGE_VERSION_COMPATIBLE FALSE)
-  endif()
-
-  if( "${PACKAGE_FIND_VERSION}" STREQUAL "${PACKAGE_VERSION}")
-      set(PACKAGE_VERSION_EXACT TRUE)
-  endif()
-endif()
-
-
-# if the installed or the using project don't have CMAKE_SIZEOF_VOID_P set, ignore it:
-if("${CMAKE_SIZEOF_VOID_P}"  STREQUAL ""  OR "@CMAKE_SIZEOF_VOID_P@" STREQUAL "")
-   return()
-endif()
-
-# check that the installed version has the same 32/64bit-ness as the one which is currently searching:
-if(NOT "${CMAKE_SIZEOF_VOID_P}" STREQUAL "@CMAKE_SIZEOF_VOID_P@")
-  math(EXPR installedBits "@CMAKE_SIZEOF_VOID_P@ * 8")
-  set(PACKAGE_VERSION "${PACKAGE_VERSION} (${installedBits}bit)")
-  set(PACKAGE_VERSION_UNSUITABLE TRUE)
-endif()
diff --git a/CMakeLists.txt b/CMakeLists.txt
index f512fb554a..21bc48d39e 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -1,157 +1,470 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-PROJECT(ARRAYFIRE)
-
-SET_PROPERTY(GLOBAL PROPERTY USE_FOLDERS ON)
-
-OPTION(BUILD_TEST "Build Tests" ON)
-OPTION(BUILD_EXAMPLES "Build Examples" ON)
-OPTION(BUILD_GTEST "Download gtest and check for updates. Necessary if you change compilers" ON)
-
-OPTION(BUILD_CPU "Build ArrayFire with a CPU backend" ON)
-OPTION(BUILD_CUDA "Build ArrayFire with a CUDA backend" OFF)
-OPTION(BUILD_OPENCL "Build ArrayFire with a OpenCL backend" OFF)
-
-OPTION(BUILD_GRAPHICS "Build ArrayFire with Forge Graphics" ON)
-
-OPTION(BUILD_DOCS "Create ArrayFire Documentation" OFF)
-OPTION(WITH_COVERAGE "Added code coverage flags" OFF)
-
-# Set a default build type if none was specified
-if(NOT CMAKE_BUILD_TYPE AND NOT CMAKE_CONFIGURATION_TYPES)
-    set(CMAKE_BUILD_TYPE Release CACHE STRING "Choose the type of build." FORCE)
-    # Set the possible values of build type for cmake-gui
-    set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS "Debug" "Release"
-      "MinSizeRel" "RelWithDebInfo")
-endif()
-
-SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules")
-INCLUDE(${CMAKE_MODULE_PATH}/UploadCoveralls.cmake)
-INCLUDE(AFInstallDirs)
-
-FIND_PACKAGE(FreeImage)
-IF(FREEIMAGE_FOUND)
-    ADD_DEFINITIONS(-DWITH_FREEIMAGE)
-    SET(FreeImage_LIBS ${FREEIMAGE_LIBRARY})
-    MESSAGE(STATUS "Using FreeImage library ${FreeImage_LIBS}")
-    INCLUDE_DIRECTORIES(${FREEIMAGE_INCLUDE_PATH})
-ELSE(FREEIMAGE_FOUND)
-    MESSAGE(WARNING, "FreeImage not found!")
-ENDIF(FREEIMAGE_FOUND)
-
-IF(BUILD_GRAPHICS)
-    OPTION(USE_SYSTEM_FORGE "Use system Forge" OFF)
-    IF(USE_SYSTEM_FORGE)
-        FIND_PACKAGE(Forge)
-    ELSE(USE_SYSTEM_FORGE)
-        INCLUDE("${CMAKE_MODULE_PATH}/build_forge.cmake")
-    ENDIF(USE_SYSTEM_FORGE)
-
-    IF(FORGE_FOUND)
-        ADD_DEFINITIONS(-DGLEW_MX -DWITH_GRAPHICS)
-        INCLUDE("${CMAKE_MODULE_PATH}/FindGLEWmx.cmake")
-        FIND_PACKAGE(GLFW)
-
-        INCLUDE_DIRECTORIES(
-            ${FORGE_INCLUDE_DIRECTORIES}
-            ${GLEW_INCLUDE_DIR}
-            ${GLFW_INCLUDE_DIR})
-
-        SET(FORGE_LIBRARIES ${FORGE_LIBRARIES}
-                            ${GLEWmx_LIBRARY}
-                            ${OPENGL_gl_LIBRARY}
-                            ${OPENGL_glu_LIBRARY})
-
-        IF(APPLE)
-            FIND_PACKAGE(X11 REQUIRED)
-            INCLUDE_DIRECTORIES(${X11_INCLUDE_DIR})
-        ENDIF(APPLE)
-
-    ELSE(FORGE_FOUND)
-        MESSAGE(WARNING "Forge not found. Graphics will be disabled")
-    ENDIF(FORGE_FOUND)
-
-ENDIF(BUILD_GRAPHICS)
-
-INCLUDE_DIRECTORIES(
-    "${CMAKE_CURRENT_SOURCE_DIR}/include"
-    "${CMAKE_CURRENT_SOURCE_DIR}/src/backend"
-    "${CMAKE_CURRENT_SOURCE_DIR}/src/api/c"
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+if(AF_BUILD_ONEAPI)
+  cmake_minimum_required(VERSION 3.20)
+else()
+  cmake_minimum_required(VERSION 3.16.3)
+endif()
+include(CheckLanguage)
+
+include(CMakeModules/AF_vcpkg_options.cmake)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules")
+project(ArrayFire VERSION 3.10.0 LANGUAGES C CXX)
+
+include(AFconfigure_deps_vars)
+include(AFBuildConfigurations)
+include(AFInstallDirs)
+include(CMakeDependentOption)
+include(InternalUtils)
+include(Version)
+include(platform)
+include(GetPrerequisites)
+include(CheckCXXCompilerFlag)
+include(CheckSymbolExists)
+include(SplitDebugInfo)
+
+# Use the function generate_product_version on Windows
+# to attach version info in dll file attributes.
+# Make sure to pass appropriate arguments for each backend
+# to generate the correct resource file
+include(generate_product_version)
+
+set_policies(
+  TYPE NEW
+  POLICIES CMP0073
+           CMP0074
+           CMP0077
+           CMP0079)
+if (CMAKE_VERSION VERSION_GREATER_EQUAL "3.27")
+  cmake_policy(SET CMP0146 OLD)
+endif()
+arrayfire_set_cmake_default_variables()
+
+option(AF_WITH_EXTERNAL_PACKAGES_ONLY "Build ArrayFire with External packages only" OFF)
+if(AF_WITH_EXTERNAL_PACKAGES_ONLY)
+  set(AF_REQUIRED REQUIRED)
+endif()
+if(CMAKE_SYCL_COMPILER)
+  get_filename_component(SYCL_COMPILER_NAME ${CMAKE_SYCL_COMPILER} NAME)
+endif()
+if(SYCL_COMPILER_NAME STREQUAL "dpcpp" OR SYCL_COMPILER_NAME STREQUAL "dpcpp.exe"
+   OR SYCL_COMPILER_NAME STREQUAL "icpx" OR SYCL_COMPILER_NAME STREQUAL "icx.exe")
+  set(MKL_THREAD_LAYER "TBB" CACHE STRING "The thread layer to choose for MKL")
+  set(TBB_ROOT "$ENV{TBBROOT}")
+  set(MKL_INTERFACE "ilp64")
+  set(MKL_INTERFACE_INTEGER_SIZE 8)
+else()
+  set(MKL_THREAD_LAYER "Intel OpenMP" CACHE STRING "The thread layer to choose for MKL")
+  set(MKL_INTERFACE "lp64")
+  set(MKL_INTERFACE_INTEGER_SIZE 4)
+endif()
+
+find_package(CUDA 10.2)
+find_package(cuDNN 4.0)
+find_package(OpenCL 1.2)
+find_package(OpenGL)
+find_package(glad CONFIG QUIET)
+find_package(FreeImage)
+find_package(Threads)
+find_package(FFTW)
+find_package(CBLAS)
+find_package(LAPACKE)
+find_package(Doxygen)
+find_package(AF_MKL)
+find_package(spdlog QUIET ${AF_REQUIRED} NO_CMAKE_PACKAGE_REGISTRY)
+find_package(fmt QUIET ${AF_REQUIRED})
+find_package(span-lite QUIET)
+find_package(GTest)
+find_package(CLBlast QUIET)
+find_package(Boost 1.70 ${AF_REQUIRED})
+
+# CLFFT used in ArrayFire requires a specific fork
+#find_package(clFFT QUIET)
+
+include(boost_package)
+include(config_ccache)
+
+option(AF_BUILD_CPU      "Build ArrayFire with a CPU backend"        ON)
+option(AF_BUILD_CUDA     "Build ArrayFire with a CUDA backend"       ${CUDA_FOUND})
+option(AF_BUILD_OPENCL   "Build ArrayFire with a OpenCL backend"     ${OpenCL_FOUND})
+option(AF_BUILD_ONEAPI   "Build ArrayFire with a oneAPI backend"     OFF)
+option(AF_BUILD_UNIFIED  "Build Backend-Independent ArrayFire API"   ON)
+option(AF_BUILD_DOCS     "Create ArrayFire Documentation"            ${DOXYGEN_FOUND})
+option(AF_BUILD_EXAMPLES "Build Examples"                            ON)
+option(AF_WITH_CUDNN     "Use cuDNN for convolveNN functions"        ${cuDNN_FOUND})
+option(AF_BUILD_FORGE
+    "Forge libs are not built by default as it is not link time dependency" OFF)
+
+option(AF_WITH_NONFREE  "Build ArrayFire nonfree algorithms"   OFF)
+option(AF_WITH_LOGGING  "Build ArrayFire with logging support" ON)
+option(AF_WITH_STACKTRACE  "Add stacktraces to the error messages." ON)
+option(AF_CACHE_KERNELS_TO_DISK "Enable caching kernels to disk" ON)
+option(AF_WITH_STATIC_MKL "Link against static Intel MKL libraries" OFF)
+option(AF_WITH_STATIC_CUDA_NUMERIC_LIBS "Link libafcuda with static numeric libraries(cublas, cufft, etc.)" OFF)
+option(AF_WITH_SPDLOG_HEADER_ONLY "Build ArrayFire with header only version of spdlog" OFF)
+option(AF_WITH_FMT_HEADER_ONLY "Build ArrayFire with header only version of fmt" OFF)
+option(AF_WITH_FAST_MATH "Use lower precision but high performance numeric optimizations" OFF)
+option(AF_CTEST_SEPARATED "Run tests separately when called from ctest(increases test times)" OFF)
+option(AF_SKIP_UNSUPPORTED_TESTS "Skip tests where functions are unsupported by the backend instead of failing" OFF)
+
+if(AF_WITH_STATIC_CUDA_NUMERIC_LIBS)
+  option(AF_WITH_PRUNE_STATIC_CUDA_NUMERIC_LIBS "Prune CUDA static libraries to reduce binary size.(WARNING: May break some libs on older CUDA toolkits for some compute arch)" OFF)
+endif()
+
+set(default_compute_library "FFTW/LAPACK/BLAS")
+if(MKL_FOUND)
+  set(default_compute_library "Intel-MKL")
+endif()
+
+if(AF_WITH_STATIC_MKL)
+  set(MKL_LINK static)
+endif()
+if(MKL_THREAD_LAYER STREQUAL "Sequential")
+  set(MKL_THREADING "sequential")
+elseif(MKL_THREAD_LAYER STREQUAL "GNU OpenMP")
+  set(MKL_THREADING "gnu_thread")
+elseif(MKL_THREAD_LAYER STREQUAL "Intel OpenMP")
+  set(MKL_THREADING "intel_thread")
+elseif(MKL_THREAD_LAYER STREQUAL "TBB")
+  set(MKL_THREADING "tbb_thread")
+else()
+endif()
+
+if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.13)
+  # VCPKG overrides the find_package command and the PATH parameter is currently
+  # broken with the current version of VCPKG so we are setting the MKL_ROOT
+  # directory to the MKLROOT environment variable.
+  if(DEFINED ENV{MKLROOT} AND NOT DEFINED MKL_ROOT)
+    set(MKL_ROOT "$ENV{MKLROOT}")
+  endif()
+  set(SYCL_COMPILER ON)
+  find_package(MKL)
+endif()
+
+af_multiple_option(NAME        AF_COMPUTE_LIBRARY
+                   DEFAULT     ${default_compute_library}
+                   DESCRIPTION "Compute library for signal processing and linear algebra routines"
+                   OPTIONS     "Intel-MKL" "FFTW/LAPACK/BLAS")
+
+if(WIN32)
+  af_multiple_option(NAME         AF_STACKTRACE_TYPE
+                     DEFAULT      "Windbg"
+                     DESCRIPTION  "The type of backtrace features. Windbg(simple), None"
+                     OPTIONS       "Windbg" "None")
+else()
+  af_multiple_option(NAME         AF_STACKTRACE_TYPE
+                     DEFAULT      "Basic"
+                     DESCRIPTION  "The type of backtrace features. Basic(simple), libbacktrace(fancy), addr2line(fancy), None"
+                     OPTIONS       "Basic" "libbacktrace" "addr2line" "None")
+endif()
+
+option(AF_INSTALL_STANDALONE "Build installers that include all dependencies" OFF)
+
+cmake_dependent_option(AF_WITH_RELATIVE_TEST_DIR "Use relative paths for the test data directory(For continious integration(CI) purposes only)" OFF
+  "BUILD_TESTING" OFF)
+
+cmake_dependent_option(AF_WITH_IMAGEIO "Build ArrayFire with Image IO support" ${FreeImage_FOUND}
+                       "FreeImage_FOUND" OFF)
+cmake_dependent_option(AF_BUILD_FRAMEWORK "Build an ArrayFire framework for Apple platforms.(Experimental)" OFF
+                       "APPLE" OFF)
+
+option(AF_WITH_STATIC_FREEIMAGE "Use Static FreeImage Lib" OFF)
+
+set(AF_WITH_CPUID ON CACHE BOOL "Build with CPUID integration")
+
+if(AF_BUILD_CUDA)
+  check_language(CUDA)
+  if(CMAKE_CUDA_COMPILER)
+    enable_language(CUDA)
+  elseif(CUDA_NVCC_EXECUTABLE)
+    message(STATUS "Using the FindCUDA script to search for the CUDA compiler")
+    set(CMAKE_CUDA_COMPILER ${CUDA_NVCC_EXECUTABLE} CACHE INTERNAL "CUDA compiler executable")
+    enable_language(CUDA)
+  else()
+    message(WARNING "No CUDA support")
+  endif()
+endif()
+
+af_deprecate(BUILD_CPU             AF_BUILD_CPU)
+af_deprecate(BUILD_CUDA            AF_BUILD_CUDA)
+af_deprecate(BUILD_OPENCL          AF_BUILD_OPENCL)
+af_deprecate(BUILD_UNIFIED         AF_BUILD_UNIFIED)
+af_deprecate(BUILD_DOCS            AF_BUILD_DOCS)
+af_deprecate(BUILD_NONFREE         AF_WITH_NONFREE)
+af_deprecate(BUILD_EXAMPLES        AF_BUILD_EXAMPLES)
+af_deprecate(USE_RELATIVE_TEST_DIR AF_WITH_RELATIVE_TEST_DIR)
+af_deprecate(USE_FREEIMAGE_STATIC  AF_WITH_STATIC_FREEIMAGE)
+af_deprecate(USE_CPUID             AF_WITH_CPUID)
+if(DEFINED USE_CPU_MKL OR DEFINED USE_OPENCL_MKL)
+  # Cannot use af_deprecated as it expects the new and old variables to store values of
+  # same type. In this case, USE_*_MKL variables are BOOLs and AF_COMPUTE_LIBRARY is a STRING
+  message(DEPRECATION
+    "Variables USE_CPU_MKL/USE_OPENCL_MKL are deprecated. Use AF_COMPUTE_LIBRARY instead.")
+  message(WARNING
+    "USE_CPU_MKL/USE_OPENCL_MKL defined. These values take precendence over the value of
+    AF_COMPUTE_LIBRARY until they are removed to preserve existing build behavior.")
+  # Until USE_CPU_MKL and USE_OPENCL_MKL are removed, if they are defined, they take
+  # precendence and cmake will check and report error if Intel-MKL is not found
+  if(USE_CPU_MKL OR USE_OPENCL_MKL)
+    get_property(doc CACHE AF_COMPUTE_LIBRARY PROPERTY HELPSTRING)
+    set(AF_COMPUTE_LIBRARY "Intel-MKL" CACHE STRING "${doc}" FORCE)
+  endif()
+endif()
+
+if(AF_COMPUTE_LIBRARY STREQUAL "Intel-MKL")
+  set(BLA_VENDOR "Intel10_64lp")
+  if(MKL_THREAD_LAYER STREQUAL "Sequential")
+    set(BLA_VENDOR "${BLA_VENDOR}_seq")
+  endif()
+endif()
+find_package(BLAS)
+find_package(LAPACK)
+
+# IF: the old USE_CPU_MKL/USE_OPENCL_MKL flags are present,
+# THEN Irrespective of AF_COMPUTE_LIBRARY value, continue with MKL to preserve old
+#      behavior. Once the deprecated USE_CPU_MKL/USE_OPENCL_MKL are removed in later
+#      versions AF_COMPUTE_LIBRARY will take over total control of selecting CPU
+#      compute backend.
+#
+# Note that the default value of AF_COMPUTE_LIBRARY is Intel-MKL.
+# Also, cmake doesn't have short-circuit of OR/AND conditions in if
+if(${AF_BUILD_CPU} OR ${AF_BUILD_OPENCL})
+  if("${AF_COMPUTE_LIBRARY}" STREQUAL "Intel-MKL"
+      OR "${AF_COMPUTE_LIBRARY}" STREQUAL "MKL")
+    af_mkl_batch_check()
+    dependency_check(MKL_Shared_FOUND "Please ensure Intel-MKL / oneAPI-oneMKL is installed")
+    set(BUILD_WITH_MKL ON)
+  elseif("${AF_COMPUTE_LIBRARY}" STREQUAL "FFTW/LAPACK/BLAS")
+    dependency_check(FFTW_FOUND "FFTW not found")
+    dependency_check(CBLAS_FOUND "CBLAS not found")
+    if(UNIX AND NOT APPLE)
+      dependency_check(LAPACK_FOUND "LAPACK not found")
+    endif()
+  endif()
+endif()
+
+#Configure forge submodule
+#forge is included in ALL target if AF_BUILD_FORGE is ON
+#otherwise, forge is not built at all
+include(AFconfigure_forge_dep)
+
+if(TARGET fmt::fmt AND AF_WITH_FMT_HEADER_ONLY)
+  set_target_properties(fmt::fmt
+    PROPERTIES
+      INTERFACE_COMPILE_DEFINITIONS "FMT_HEADER_ONLY=1")
+endif()
+
+if(TARGET spdlog::spdlog OR AF_WITH_EXTERNAL_PACKAGES_ONLY)
+  if(AF_WITH_SPDLOG_HEADER_ONLY)
+    add_library(af_spdlog ALIAS spdlog::spdlog_header_only)
+  else()
+    add_library(af_spdlog ALIAS spdlog::spdlog)
+  endif()
+else()
+  add_library(af_spdlog INTERFACE)
+  af_dep_check_and_populate(${spdlog_prefix}
+    URI https://github.com/gabime/spdlog.git
+    REF v1.9.2
+  )
+
+  if(TARGET fmt::fmt)
+    set(SPDLOG_FMT_EXTERNAL ON)
+  endif()
+
+  add_subdirectory(${${spdlog_prefix}_SOURCE_DIR} ${${spdlog_prefix}_BINARY_DIR} EXCLUDE_FROM_ALL)
+
+  if(AF_WITH_SPDLOG_HEADER_ONLY)
+    set_target_properties(af_spdlog
+      PROPERTIES
+        INTERFACE_COMPILE_DEFINITIONS "FMT_HEADER_ONLY=1"
+        INTERFACE_LINK_LIBRARIES "spdlog_header_only")
+  else()
+    target_compile_options(spdlog
+      PRIVATE
+        $<$<BOOL:${has_cxx_fp_model}>:-fp-model precise>)
+    install(TARGETS spdlog
+      COMPONENT common_backend_dependencies
+      DESTINATION ${AF_INSTALL_BIN_DIR})
+    set_target_properties(af_spdlog
+      PROPERTIES
+        INTERFACE_LINK_LIBRARIES "spdlog")
+  endif()
+endif()
+
+if(NOT TARGET glad::glad)
+  af_dep_check_and_populate(${glad_prefix}
+    URI https://github.com/arrayfire/glad.git
+    REF main
+  )
+  add_subdirectory(${${glad_prefix}_SOURCE_DIR} ${${glad_prefix}_BINARY_DIR})
+
+  add_library(af_glad STATIC $<TARGET_OBJECTS:af_glad_obj_lib>)
+  target_link_libraries(af_glad PUBLIC ${CMAKE_DL_LIBS})
+  target_include_directories(af_glad
+    SYSTEM PUBLIC
+      $<BUILD_INTERFACE:$<TARGET_PROPERTY:af_glad_obj_lib,INTERFACE_INCLUDE_DIRECTORIES>>)
+endif()
+
+if(NOT TARGET nonstd::span-lite)
+  af_dep_check_and_populate(span-lite
+    URI https://github.com/martinmoene/span-lite
+    REF "ccf2351"
+    )
+  add_subdirectory(${span-lite_SOURCE_DIR} ${span-lite_BINARY_DIR} EXCLUDE_FROM_ALL)
+  get_property(span_include_dir
+    TARGET span-lite
+    PROPERTY INTERFACE_INCLUDE_DIRECTORIES)
+  set_target_properties(span-lite
+    PROPERTIES INTERFACE_SYSTEM_INCLUDE_DIRECTORIES "${span_include_dir}")
+  set_target_properties(span-lite
+    PROPERTIES INTERFACE_COMPILE_DEFINITIONS "span_FEATURE_WITH_INITIALIZER_LIST_P2447=1")
+
+endif()
+
+af_dep_check_and_populate(${assets_prefix}
+  URI https://github.com/arrayfire/assets.git
+  REF master
+)
+set(ASSETS_DIR ${${assets_prefix}_SOURCE_DIR})
+
+# when crosscompiling use the bin2cpp file from the native bin directory
+if(CMAKE_CROSSCOMPILING)
+  set(NATIVE_BIN_DIR "NATIVE_BIN_DIR-NOTFOUND"
+    CACHE FILEPATH "Path to the Native build directory.")
+  if(NATIVE_BIN_DIR)
+    include(${NATIVE_BIN_DIR}/ImportExecutables.cmake)
+  else()
+    message(SEND_ERROR "Native Directory not found. Run cmake in a separate"
+                       "directory and build the bin2cpp target.")
+  endif()
+else()
+  add_executable(bin2cpp CMakeModules/bin2cpp.cpp
+                         src/backend/common/deterministicHash.cpp
+                         src/backend/common/deterministicHash.hpp
+                         src/backend/common/Source.hpp)
+  set_target_properties(bin2cpp
+    PROPERTIES
+      CXX_STANDARD 17)
+  target_link_libraries(bin2cpp PRIVATE nonstd::span-lite)
+
+  if(WIN32)
+    target_compile_definitions(bin2cpp PRIVATE OS_WIN)
+  elseif(APPLE)
+    target_compile_definitions(bin2cpp PRIVATE OS_MAC)
+  elseif(UNIX)
+    target_compile_definitions(bin2cpp PRIVATE OS_LNX)
+  endif()
+  target_include_directories(bin2cpp PRIVATE
+                             ${ArrayFire_SOURCE_DIR}/include
+                             ${ArrayFire_BINARY_DIR}/include
+                             ${ArrayFire_SOURCE_DIR}/src/backend)
+  export(TARGETS bin2cpp FILE ${CMAKE_BINARY_DIR}/ImportExecutables.cmake)
+endif()
+
+
+if(NOT LAPACK_FOUND)
+    if(APPLE)
+        # UNSET THE VARIABLES FROM LAPACKE
+        unset(LAPACKE_LIB CACHE)
+        unset(LAPACK_LIB CACHE)
+        unset(LAPACKE_INCLUDES CACHE)
+        unset(LAPACKE_ROOT_DIR CACHE)
+    endif()
+endif()
+
+add_subdirectory(src/backend/common)
+add_subdirectory(src/api/c)
+add_subdirectory(src/api/cpp)
+
+conditional_directory(AF_BUILD_CPU     src/backend/cpu)
+conditional_directory(AF_BUILD_CUDA    src/backend/cuda)
+conditional_directory(AF_BUILD_ONEAPI  src/backend/oneapi)
+conditional_directory(AF_BUILD_OPENCL  src/backend/opencl)
+conditional_directory(AF_BUILD_UNIFIED src/api/unified)
+
+if(TARGET af)
+  list(APPEND built_backends af)
+endif()
+
+if(TARGET afcpu)
+  list(APPEND built_backends afcpu)
+endif()
+
+if(TARGET afcuda)
+  list(APPEND built_backends afcuda)
+endif()
+
+if(TARGET afoneapi)
+  list(APPEND built_backends afoneapi)
+endif()
+
+if(TARGET afopencl)
+  list(APPEND built_backends afopencl)
+endif()
+
+set_target_properties(${built_backends} PROPERTIES
+                      CXX_STANDARD 17
+                      CXX_EXTENSIONS OFF
+                      CXX_VISIBILITY_PRESET hidden
+                      VERSION "${ArrayFire_VERSION}"
+                      SOVERSION "${ArrayFire_VERSION_MAJOR}")
+
+if(AF_INSTALL_STANDALONE)
+
+  # This flag enables the use of RUNPATH instead of RPATH which is the
+  # preferred method to set the runtime lookup. Only doind this for
+  # standalone builds because we include all libraries with the installers
+  # and they are included in the same directory so the RUNPATH is set to
+  # $ORIGIN. This avoid setting the linker path in ld.so.conf.d
+  check_cxx_compiler_flag("-Wl,--enable-new-dtags" HAS_RUNPATH_FLAG)
+  if(HAS_RUNPATH_FLAG)
+    set_target_properties(${built_backends} PROPERTIES
+      INSTALL_RPATH "$ORIGIN"
+      LINK_OPTIONS "-Wl,--enable-new-dtags")
+  endif()
+endif()
+
+# On some distributions the linker will not add a library to the ELF header if
+# the symbols are not needed when the library was first parsed by the linker.
+# This causes undefined references issues when linking with libraries which have
+# circular dependencies.
+if(UNIX AND NOT APPLE AND CMAKE_CXX_COMPILER_ID MATCHES "GNU")
+  set_target_properties(${built_backends} PROPERTIES
+                        LINK_FLAGS "-Wl,--no-as-needed")
+endif()
+
+
+find_library(Backtrace_LIBRARY backtrace
+  DOC "libbacktrace.so file for more informative stacktraces. https://github.com/ianlancetaylor/libbacktrace")
+find_program(ADDR2LINE_PROGRAM addr2line
+  DOC "The path to the addr2line program for informative stacktraces")
+
+check_cxx_compiler_flag(-Wno-ignored-attributes has_ignored_attributes_flag)
+check_cxx_compiler_flag(-Wall has_all_warnings_flag)
+
+foreach(backend ${built_backends})
+  arrayfire_set_default_cxx_flags(${backend})
+endforeach()
+
+if(AF_BUILD_FRAMEWORK)
+  set_target_properties(${built_backends}
+    PROPERTIES
+      FRAMEWORK TRUE
+      FRAMEWORK_VERSION A
+      MACOSX_FRAMEWORK_IDENTIFIER com.arrayfire.arrayfireFramework
+      #MACOSX_FRAMEWORK_INFO_PLIST Info.plist
+      #PUBLIC_HEADER "${CMAKE_CURRENT_SOURCE_DIR}/include/arrayfire.h;${af_headers}"
+      #XCODE_ATTRIBUTE_CODE_SIGN_IDENTITY "iPhone Developer"
     )
+endif()
 
-IF(${UNIX})
-    ADD_DEFINITIONS(-Wall -std=c++11 -fvisibility=hidden)
-    IF(${WITH_COVERAGE})
-        SET(CMAKE_CXX_FLAGS             "-fprofile-arcs -ftest-coverage")
-        SET(CMAKE_EXE_LINKER_FLAGS      "-fprofile-arcs -ftest-coverage")
-        SET(CMAKE_SHARED_LINKER_FLAGS   "-fprofile-arcs -ftest-coverage")
-        SET(CMAKE_STATIC_LINKER_FLAGS   "-fprofile-arcs -ftest-coverage")
-    ENDIF(${WITH_COVERAGE})
-ENDIF(${UNIX})
-
-# OS Definitions
-IF(UNIX)
-    IF(APPLE)   #OSX
-        ADD_DEFINITIONS(-DOS_MAC)
-
-        SET(CMAKE_MACOSX_RPATH ON)
-        SET(CMAKE_SKIP_BUILD_RPATH  FALSE)
-        SET(CMAKE_BUILD_WITH_INSTALL_RPATH FALSE)
-        SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${AF_INSTALL_LIB_DIR}")
-        SET(CMAKE_INSTALL_RPATH_USE_LINK_PATH TRUE)
-
-        LIST(FIND CMAKE_PLATFORM_IMPLICIT_LINK_DIRECTORIES "${CMAKE_INSTALL_PREFIX}/${AF_INSTALL_LIB_DIR}" isSystemDir)
-        IF("${isSystemDir}" STREQUAL "-1")
-            SET(CMAKE_INSTALL_RPATH "${CMAKE_INSTALL_PREFIX}/${AF_INSTALL_LIB_DIR}")
-        ENDIF("${isSystemDir}" STREQUAL "-1")
-    ELSE(APPLE) #Linux
-        ADD_DEFINITIONS(-DOS_LNX)
-    ENDIF()
-ELSE(${UNIX}) #Windows
-    ADD_DEFINITIONS(-DOS_WIN -DNOMINMAX)
-ENDIF()
-
-# Architechture Definitions
-INCLUDE(${CMAKE_MODULE_PATH}/TargetArch.cmake)
-target_architecture(ARCH)
-IF(${ARCH} STREQUAL "x86_64")
-    ADD_DEFINITIONS(-DARCH_64)
-ELSE(${ARCH})
-    ADD_DEFINITIONS(-DARCH_32)
-ENDIF()
-
-INCLUDE(${CMAKE_MODULE_PATH}/Version.cmake)
-
-IF(${BUILD_CPU})
-    ADD_SUBDIRECTORY(src/backend/cpu)
-ENDIF()
-
-IF(${BUILD_CUDA})
-    ADD_SUBDIRECTORY(src/backend/cuda)
-ENDIF()
-
-IF(${BUILD_OPENCL})
-    ADD_SUBDIRECTORY(src/backend/opencl)
-ENDIF()
-
-IF(${BUILD_DOCS})
-    ADD_SUBDIRECTORY(docs)
-ENDIF()
-
-ADD_EXECUTABLE(bin2cpp ${CMAKE_MODULE_PATH}/bin2cpp.cpp)
-
-IF(${BUILD_TEST})
-    ENABLE_TESTING()
-    ADD_SUBDIRECTORY(test)
-ENDIF()
-
-IF(${BUILD_EXAMPLES})
-    ADD_SUBDIRECTORY(examples)
-ENDIF()
-
-##
-# Installation of headers, and CMake scripts
-##
-INSTALL(DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/include/" DESTINATION "${AF_INSTALL_INC_DIR}"
+install(DIRECTORY include/ DESTINATION ${AF_INSTALL_INC_DIR}
     COMPONENT headers
     FILES_MATCHING
     PATTERN "*.h"
@@ -161,44 +474,282 @@ INSTALL(DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/include/" DESTINATION "${AF_INSTA
 
 ## The ArrayFire version file is generated and won't be included above, install
 ## it separately.
-INSTALL(FILES
-    ${CMAKE_SOURCE_DIR}/include/af/version.h DESTINATION "${AF_INSTALL_INC_DIR}/af/"
-    COMPONENT headers
+install(FILES ${ArrayFire_BINARY_DIR}/include/af/version.h
+              ${ArrayFire_BINARY_DIR}/include/af/compilers.h
+        DESTINATION "${AF_INSTALL_INC_DIR}/af/"
+        COMPONENT headers)
+
+# install the examples irrespective of the AF_BUILD_EXAMPLES value
+# only the examples source files are installed, so the installation of these
+# source files does not depend on AF_BUILD_EXAMPLES
+# when AF_BUILD_EXAMPLES is OFF, the examples source is installed without
+# building the example executables
+install(DIRECTORY examples/ #NOTE The slash at the end is important
+    DESTINATION ${AF_INSTALL_EXAMPLE_DIR}
+    COMPONENT examples)
+
+install(DIRECTORY ${ASSETS_DIR}/examples/ #NOTE The slash at the end is important
+    DESTINATION ${AF_INSTALL_EXAMPLE_DIR}
+    COMPONENT examples)
+
+install(DIRECTORY "${ArrayFire_SOURCE_DIR}/LICENSES/"
+    DESTINATION LICENSES
+    COMPONENT licenses)
+
+foreach(backend CPU CUDA OpenCL oneAPI Unified)
+  string(TOUPPER ${backend} upper_backend)
+  string(TOLOWER ${backend} lower_backend)
+  if(AF_BUILD_${upper_backend})
+    install(EXPORT ArrayFire${backend}Targets
+            NAMESPACE ArrayFire::
+            DESTINATION ${AF_INSTALL_CMAKE_DIR}
+            COMPONENT ${lower_backend}_dev)
+
+    export( EXPORT ArrayFire${backend}Targets
+            NAMESPACE ArrayFire::
+            FILE cmake/ArrayFire${backend}Targets.cmake)
+  endif()
+endforeach()
+
+include(CMakePackageConfigHelpers)
+write_basic_package_version_file(
+  "${ArrayFire_BINARY_DIR}/ArrayFireConfigVersion.cmake"
+  COMPATIBILITY SameMajorVersion
 )
 
-IF(FORGE_FOUND AND NOT USE_SYSTEM_FORGE)
-    INSTALL(DIRECTORY "${CMAKE_BINARY_DIR}/third_party/forge/lib/" DESTINATION "${AF_INSTALL_LIB_DIR}"
-        COMPONENT libraries
-    )
-ENDIF(FORGE_FOUND AND NOT USE_SYSTEM_FORGE)
-
-## configuration to be used from the binary directory directly
-SET(INCLUDE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/include")
-SET(BACKEND_DIR "src/backend/\${lowerbackend}")
-CONFIGURE_FILE(
-    ${CMAKE_CURRENT_SOURCE_DIR}/ArrayFireConfig.cmake.in
-    ${CMAKE_CURRENT_BINARY_DIR}/ArrayFireConfig.cmake
-    @ONLY)
-
-## installed cmake configuration
-# use a relative dir to keep arrayfire relocatable
-STRING(REGEX REPLACE "[^/]+" ".." reldir "${AF_INSTALL_CMAKE_DIR}")
-SET(INCLUDE_DIR "\${CMAKE_CURRENT_LIST_DIR}/${reldir}/include")
-set(BACKEND_DIR)
-CONFIGURE_FILE(
-    ${CMAKE_CURRENT_SOURCE_DIR}/ArrayFireConfig.cmake.in
-    ${CMAKE_CURRENT_BINARY_DIR}/Install/ArrayFireConfig.cmake
-    @ONLY)
-CONFIGURE_FILE(
-    ${CMAKE_CURRENT_SOURCE_DIR}/ArrayFireConfigVersion.cmake.in
-    ${CMAKE_CURRENT_BINARY_DIR}/ArrayFireConfigVersion.cmake
-    @ONLY)
-INSTALL(FILES ${CMAKE_CURRENT_BINARY_DIR}/Install/ArrayFireConfig.cmake
-    ${CMAKE_CURRENT_BINARY_DIR}/ArrayFireConfigVersion.cmake
-    DESTINATION ${AF_INSTALL_CMAKE_DIR}
-    COMPONENT cmake)
-
-##
-# Packaging
-##
-include(${CMAKE_CURRENT_SOURCE_DIR}/CPack.txt)
+# This config file will be installed so we need to set the install_destination
+# path relitive to the install path
+set(INCLUDE_DIRS include)
+set(CMAKE_DIR ${AF_INSTALL_CMAKE_DIR})
+configure_package_config_file(
+  ${ArrayFire_SOURCE_DIR}/CMakeModules/ArrayFireConfig.cmake.in
+  cmake/install/ArrayFireConfig.cmake
+  INSTALL_DESTINATION "${AF_INSTALL_CMAKE_DIR}"
+  PATH_VARS INCLUDE_DIRS CMAKE_DIR
+  )
+
+install(FILES ${ArrayFire_BINARY_DIR}/cmake/install/ArrayFireConfig.cmake
+              ${ArrayFire_BINARY_DIR}/ArrayFireConfigVersion.cmake
+              DESTINATION ${AF_INSTALL_CMAKE_DIR}
+              COMPONENT cmake)
+
+if(WIN32 AND AF_INSTALL_STANDALONE)
+  find_program(MSVC_REDIST NAMES vc_redist.x64.exe
+          PATHS "$ENV{VCINSTALLDIR}Redist\\MSVC\\v${MSVC_TOOLSET_VERSION}")
+  get_filename_component(MSVC_REDIST_INSTALLER ${MSVC_REDIST} NAME)
+  install(PROGRAMS ${MSVC_REDIST} COMPONENT common_backend_dependencies
+          DESTINATION ${AF_INSTALL_BIN_DIR})
+endif()
+
+if(BUILD_WITH_MKL AND AF_INSTALL_STANDALONE)
+  if(TARGET MKL::ThreadingLibrary)
+    get_filename_component(mkl_tl ${MKL_ThreadingLibrary_LINK_LIBRARY} REALPATH)
+    install(FILES
+      $<TARGET_FILE:MKL::ThreadingLibrary>
+      ${mkl_tl}
+      DESTINATION ${AF_INSTALL_LIB_DIR}
+      COMPONENT mkl_dependencies)
+  endif()
+
+  if(NOT AF_WITH_STATIC_MKL AND TARGET MKL::Shared)
+    if(NOT WIN32)
+      get_filename_component(mkl_int ${MKL_Interface_LINK_LIBRARY} REALPATH)
+      install(FILES
+        $<TARGET_FILE:MKL::Interface>
+        ${mkl_int}
+        DESTINATION ${AF_INSTALL_LIB_DIR}
+        COMPONENT mkl_dependencies)
+
+      # LP64 library is required for the CPU and OpenCL back ends, so install it too
+      if(MKL_INTERFACE_INTEGER_SIZE EQUAL 8)
+        get_filename_component(mkl_int_lp ${MKL_InterfaceLP_LINK_LIBRARY} REALPATH)
+        install(FILES
+          ${mkl_int_lp}
+          DESTINATION ${AF_INSTALL_LIB_DIR}
+          COMPONENT mkl_dependencies)
+      endif()
+    endif()
+
+  if(UNIX)
+    get_filename_component(mkl_rnt ${MKL_RT_LINK_LIBRARY} REALPATH)
+    get_filename_component(mkl_shd ${MKL_Core_LINK_LIBRARY} REALPATH)
+    get_filename_component(mkl_tly ${MKL_ThreadLayer_LINK_LIBRARY} REALPATH)
+    install(FILES
+      ${mkl_rnt}
+      ${mkl_shd}
+      ${mkl_tly}
+      DESTINATION ${AF_INSTALL_LIB_DIR}
+      COMPONENT mkl_dependencies)
+  endif()
+
+    install(FILES
+      $<TARGET_FILE:MKL::RT>
+      $<TARGET_FILE:MKL::Shared>
+      $<TARGET_FILE:MKL::ThreadLayer>
+      ${MKL_RUNTIME_KERNEL_LIBRARIES}
+
+      # This variable is used to add tbb.so.2 library because the main lib
+      # is a linker script and not a symlink so it cant be resolved using
+      # get_filename_component
+      ${AF_ADDITIONAL_MKL_LIBRARIES}
+      DESTINATION ${AF_INSTALL_LIB_DIR}
+      COMPONENT mkl_dependencies)
+    if(AF_BUILD_ONEAPI)
+      if(WIN32)
+        get_filename_component(mkl_sycl_lapack ${MKL_SyclLapack_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_dft ${MKL_SyclDft_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_blas ${MKL_SyclBlas_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_sparse ${MKL_SyclSparse_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_data ${MKL_SyclDataFitting_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_rng ${MKL_SyclRNG_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_stats ${MKL_SyclStats_DLL_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_vm ${MKL_SyclVM_DLL_LIBRARY} REALPATH)
+      else()
+        get_filename_component(mkl_sycl_lapack ${MKL_SyclLapack_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_dft ${MKL_SyclDft_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_blas ${MKL_SyclBlas_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_sparse ${MKL_SyclSparse_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_data ${MKL_SyclDataFitting_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_rng ${MKL_SyclRNG_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_stats ${MKL_SyclStats_LINK_LIBRARY} REALPATH)
+        get_filename_component(mkl_sycl_vm ${MKL_SyclVM_LINK_LIBRARY} REALPATH)
+      endif()
+      install(FILES
+        ${mkl_sycl_lapack}
+        ${mkl_sycl_dft}
+        ${mkl_sycl_blas}
+        ${mkl_sycl_sparse}
+        ${mkl_sycl_data}
+        ${mkl_sycl_rng}
+        ${mkl_sycl_stats}
+        ${mkl_sycl_vm}
+        DESTINATION ${AF_INSTALL_LIB_DIR}
+        COMPONENT mkl_dependencies)
+    endif()
+  endif()
+endif()
+
+# This file will be used to create the config file for the build directory.
+# These config files will be used by the examples to find the ArrayFire
+# libraries
+set(INCLUDE_DIRS "${ArrayFire_SOURCE_DIR}/include" "${ArrayFire_BINARY_DIR}/include")
+set(CMAKE_DIR "${ArrayFire_BINARY_DIR}/cmake")
+configure_package_config_file(
+  ${ArrayFire_SOURCE_DIR}/CMakeModules/ArrayFireConfig.cmake.in
+  ArrayFireConfig.cmake
+  INSTALL_DESTINATION "${ArrayFire_BINARY_DIR}"
+  PATH_VARS INCLUDE_DIRS CMAKE_DIR
+  INSTALL_PREFIX "${ArrayFire_BINARY_DIR}"
+  )
+
+# Registers the current build directory with the user's cmake config. This will
+# create a file at $HOME/.cmake/packages/ArrayFire which will point to this source
+# build directory.
+# TODO(umar): Disable for now. Causing issues with builds on windows.
+#export(PACKAGE ArrayFire)
+
+# Unset the visibility to avoid setting policy commands for older versions of
+# CMake for examples and tests.
+unset(CMAKE_CXX_VISIBILITY_PRESET)
+
+configure_file(
+  ${ArrayFire_SOURCE_DIR}/CMakeModules/CTestCustom.cmake
+  ${ArrayFire_BINARY_DIR}/CTestCustom.cmake)
+
+include(CTest)
+
+# Handle depricated BUILD_TEST variable if found.
+if(BUILD_TEST)
+  set(BUILD_TESTING ${BUILD_TEST})
+endif()
+
+conditional_directory(BUILD_TESTING test)
+
+conditional_directory(AF_BUILD_EXAMPLES examples)
+conditional_directory(AF_BUILD_DOCS docs)
+
+include(CPackConfig)
+
+# VCPKG variables that aren't necessarily important
+# for ArrayFire Development. They are marked hidden.
+# If VCPKG is not used, marking them is not harmful
+mark_as_advanced(
+  AF_BUILD_FRAMEWORK
+  AF_CACHE_KERNELS_TO_DISK
+  AF_INSTALL_STANDALONE
+  AF_WITH_CPUID
+  AF_WITH_LOGGING
+  AF_WITH_STACKTRACE
+  AF_WITH_STATIC_FREEIMAGE
+  AF_WITH_NONFREE
+  AF_WITH_IMAGEIO
+  AF_WITH_RELATIVE_TEST_DIR
+  AF_TEST_WITH_MTX_FILES
+  ArrayFire_DIR
+
+  VCPKG_APPLOCAL_DEPS
+  VCPKG_BOOTSTRAP_OPTIONS
+  VCPKG_INSTALL_OPTIONS
+  VCPKG_MANIFEST_DIR
+  VCPKG_MANIFEST_INSTALL
+  VCPKG_MANIFEST_MODE
+  VCPKG_OVERLAY_PORTS
+  VCPKG_OVERLAY_TRIPLETS
+  VCPKG_TARGET_TRIPLET
+  X_VCPKG_APPLOCAL_DEPS_INSTALL
+  X_VCPKG_APPLOCAL_DEPS_SERIALIZED
+  Z_VCPKG_BUILTIN_POWERSHELL_PATH
+  Z_VCPKG_PWSH_PATH
+  Z_VCPKG_CL
+  _VCPKG_INSTALLED_DIR
+
+  Boost_INCLUDE_DIR
+  CLEAR CUDA_VERSION
+  CUDA_HOST_COMPILER
+  CUDA_SDK_ROOT_DIR
+  CUDA_USE_STATIC_CUDA_RUNTIME
+  CUDA_rt_LIBRARY
+  SPDLOG_BUILD_EXAMPLES
+  SPDLOG_BUILD_TESTING
+  ADDR2LINE_PROGRAM
+  Backtrace_LIBRARY
+  AF_WITH_STATIC_MKL
+  GIT
+  Forge_DIR
+  glad_DIR
+  spdlog_DIR
+  FG_BUILD_OFFLINE
+  SPAN_LITE_COLOURISE_TEST
+  SPAN_LITE_EXPORT_PACKAGE
+  SPAN_LITE_OPT_BUILD_EXAMPLES
+  SPAN_LITE_OPT_BUILD_TESTS
+  SPAN_LITE_OPT_SELECT_NONSTD
+  SPAN_LITE_OPT_SELECT_STD
+  FETCHCONTENT_SOURCE_DIR_SPAN-LITE
+  SPDLOG_BUILD_ALL
+  SPDLOG_BUILD_BENCH
+  SPDLOG_BUILD_EXAMPLE
+  SPDLOG_BUILD_EXAMPLE_HO
+  SPDLOG_BUILD_SHARED
+  SPDLOG_BUILD_TESTS
+  SPDLOG_BUILD_TESTS_HO
+  SPDLOG_BUILD_WARNINGS
+  SPDLOG_CLOCK_COARSE
+  SPDLOG_DISABLE_DEFAULT_LOGGER
+  SPDLOG_ENABLE_PCH
+  SPDLOG_FMT_EXTERNAL
+  SPDLOG_FMT_EXTERNAL_HO
+  SPDLOG_INSTALL
+  SPDLOG_NO_ATOMIC_LEVELS
+  SPDLOG_NO_EXCEPTIONS
+  SPDLOG_NO_THREAD_ID
+  SPDLOG_NO_TLS
+  SPDLOG_PREVENT_CHILD_FD
+  SPDLOG_SANITIZE_ADDRESS
+  SPDLOG_TIDY
+  SPDLOG_WCHAR_FILENAMES
+  SPDLOG_WCHAR_SUPPORT
+  cub_include_dir
+  fmt_DIR
+  span-lite_DIR
+  )
diff --git a/CMakeModules/AFBuildConfigurations.cmake b/CMakeModules/AFBuildConfigurations.cmake
new file mode 100644
index 0000000000..48dd07001b
--- /dev/null
+++ b/CMakeModules/AFBuildConfigurations.cmake
@@ -0,0 +1,24 @@
+# CMake 3.9 or later provides a global property to whether we are multi-config
+# or single-config generator. Before 3.9, the defintion of CMAKE_CONFIGURATION_TYPES
+# variable indicated multi-config, but developers might modify.
+if(NOT CMAKE_VERSION VERSION_LESS 3.9)
+  get_property(isMultiConfig GLOBAL PROPERTY GENERATOR_IS_MULTI_CONFIG)
+elseif(CMAKE_CONFIGURATION_TYPES)
+  # CMAKE_CONFIGURATION_TYPES is set by project() call for multi-config generators
+  set(isMultiConfig True)
+else()
+  set(isMultiConfig False)
+endif()
+
+if(isMultiConfig)
+  set(CMAKE_CONFIGURATION_TYPES
+    "Coverage;Debug;MinSizeRel;Release;RelWithDebInfo"
+    CACHE STRING "Configurations for Multi-Config CMake Generator" FORCE)
+else()
+  if(NOT CMAKE_BUILD_TYPE)
+    set(CMAKE_BUILD_TYPE "Release" CACHE STRING "Build Type" FORCE)
+  endif()
+  set_property(CACHE CMAKE_BUILD_TYPE
+    PROPERTY
+      STRINGS "Debug" "Release" "MinSizeRel" "RelWithDebInfo" "Coverage")
+endif()
diff --git a/CMakeModules/AFInstallDirs.cmake b/CMakeModules/AFInstallDirs.cmake
index 578ff48e28..2c7b96eaf8 100644
--- a/CMakeModules/AFInstallDirs.cmake
+++ b/CMakeModules/AFInstallDirs.cmake
@@ -2,16 +2,22 @@
 # Sets ArrayFire installation paths.
 #
 
+include(GNUInstallDirs)
+
 # NOTE: These paths are all relative to the project installation prefix.
 
 # Executables
 if(NOT DEFINED AF_INSTALL_BIN_DIR)
-  set(AF_INSTALL_BIN_DIR "bin" CACHE PATH "Installation path for executables")
+  set(AF_INSTALL_BIN_DIR "lib" CACHE PATH "Installation path for executables")
 endif()
 
 # Libraries
 if(NOT DEFINED AF_INSTALL_LIB_DIR)
-  set(AF_INSTALL_LIB_DIR "lib" CACHE PATH "Installation path for libraries")
+  if(WIN32)
+    set(AF_INSTALL_LIB_DIR "lib" CACHE PATH "Installation path for libraries")
+  else()
+    set(AF_INSTALL_LIB_DIR "${CMAKE_INSTALL_LIBDIR}" CACHE PATH "Installation path for libraries")
+  endif()
 endif()
 
 # Header files
@@ -19,22 +25,45 @@ if(NOT DEFINED AF_INSTALL_INC_DIR)
   set(AF_INSTALL_INC_DIR "include" CACHE PATH "Installation path for headers")
 endif()
 
-# Data files
-if(NOT DEFINED AF_INSTALL_DATA_DIR)
-  set(AF_INSTALL_DATA_DIR "share/ArrayFire" CACHE PATH "Installation path for data files")
-endif()
+set(DATA_DIR "share/ArrayFire")
 
 # Documentation
 if(NOT DEFINED AF_INSTALL_DOC_DIR)
-  set(AF_INSTALL_DOC_DIR "${AF_INSTALL_DATA_DIR}/doc" CACHE PATH "Installation path for documentation")
+  if (WIN32)
+    set(AF_INSTALL_DOC_DIR "doc" CACHE PATH "Installation path for documentation")
+  else ()
+      set(AF_INSTALL_DOC_DIR "${DATA_DIR}/doc" CACHE PATH "Installation path for documentation")
+  endif ()
+endif()
+
+if(NOT DEFINED AF_INSTALL_EXAMPLE_DIR)
+  if (WIN32)
+    set(AF_INSTALL_EXAMPLE_DIR "examples" CACHE PATH "Installation path for examples")
+  else ()
+    set(AF_INSTALL_EXAMPLE_DIR "${DATA_DIR}/examples" CACHE PATH "Installation path for examples")
+  endif ()
 endif()
 
 # Man pages
 if(NOT DEFINED AF_INSTALL_MAN_DIR)
-  set(AF_INSTALL_MAN_DIR "${AF_INSTALL_DATA_DIR}/man" CACHE PATH "Installation path for man pages")
+    set(AF_INSTALL_MAN_DIR "${DATA_DIR}/man" CACHE PATH "Installation path for man pages")
 endif()
 
 # CMake files
 if(NOT DEFINED AF_INSTALL_CMAKE_DIR)
-  set(AF_INSTALL_CMAKE_DIR "${AF_INSTALL_DATA_DIR}/cmake" CACHE PATH "Installation path for CMake files")
+  if (WIN32)
+    set(AF_INSTALL_CMAKE_DIR "cmake" CACHE PATH "Installation path for CMake files")
+  else ()
+    set(AF_INSTALL_CMAKE_DIR "${DATA_DIR}/cmake" CACHE PATH "Installation path for CMake files")
+  endif ()
 endif()
+
+mark_as_advanced(
+  AF_INSTALL_BIN_DIR
+  AF_INSTALL_LIB_DIR
+  AF_INSTALL_INC_DIR
+  AF_INSTALL_DATA_DIR
+  AF_INSTALL_DOC_DIR
+  AF_INSTALL_EXAMPLE_DIR
+  AF_INSTALL_MAN_DIR
+  AF_INSTALL_CMAKE_DIR)
diff --git a/CMakeModules/AF_vcpkg_options.cmake b/CMakeModules/AF_vcpkg_options.cmake
new file mode 100644
index 0000000000..c84adcee82
--- /dev/null
+++ b/CMakeModules/AF_vcpkg_options.cmake
@@ -0,0 +1,38 @@
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+set(ENV{VCPKG_FEATURE_FLAGS} "versions")
+set(VCPKG_MANIFEST_NO_DEFAULT_FEATURES ON)
+
+set(VCPKG_OVERLAY_TRIPLETS ${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules/vcpkg/vcpkg-triplets)
+set(VCPKG_OVERLAY_PORTS ${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules/vcpkg/ports)
+
+if(AF_BUILD_CUDA)
+  list(APPEND VCPKG_MANIFEST_FEATURES "cuda")
+endif()
+
+if(AF_BUILD_OPENCL)
+  list(APPEND VCPKG_MANIFEST_FEATURES "opencl")
+endif()
+
+if(AF_BUILD_FORGE)
+  list(APPEND VCPKG_MANIFEST_FEATURES "forge")
+endif()
+
+if(BUILD_TESTING)
+  list(APPEND VCPKG_MANIFEST_FEATURES "tests")
+endif()
+
+if(NOT AF_COMPUTE_LIBRARY STREQUAL "Intel-MKL")
+  list(APPEND VCPKG_MANIFEST_FEATURES "openblasfftw")
+endif()
+
+if(DEFINED VCPKG_ROOT AND NOT DEFINED CMAKE_TOOLCHAIN_FILE)
+  set(CMAKE_TOOLCHAIN_FILE "${VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" CACHE STRING "")
+elseif(DEFINED ENV{VCPKG_ROOT} AND NOT DEFINED CMAKE_TOOLCHAIN_FILE)
+  set(CMAKE_TOOLCHAIN_FILE "$ENV{VCPKG_ROOT}/scripts/buildsystems/vcpkg.cmake" CACHE STRING "")
+endif()
diff --git a/CMakeModules/AFconfigure_deps_vars.cmake b/CMakeModules/AFconfigure_deps_vars.cmake
new file mode 100644
index 0000000000..aac332f5ab
--- /dev/null
+++ b/CMakeModules/AFconfigure_deps_vars.cmake
@@ -0,0 +1,148 @@
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(DOWNLOAD
+  "https://github.com/arrayfire/arrayfire/blob/v3.0.0/CMakeLists.txt"
+  "${ArrayFire_BINARY_DIR}/download_copy_cmakelists.stamp"
+  STATUS af_check_result
+  TIMEOUT 4
+)
+list(GET af_check_result 0 af_is_connected)
+if(${af_is_connected})
+  set(BUILD_OFFLINE ON)
+  # Turn ON disconnected flag when connected to cloud
+  set(FETCHCONTENT_FULLY_DISCONNECTED ON CACHE BOOL
+      "Disable Download/Update stages of FetchContent workflow" FORCE)
+
+  message(STATUS "No cloud connection. Attempting offline build if dependencies are available")
+else()
+  set(BUILD_OFFLINE OFF)
+  # Turn OFF disconnected flag when connected to cloud
+  # This is required especially in the following scenario:
+  # - cmake run successfully first
+  # - lost connection, but development can still be done
+  # - Now, connection regained. Hence updates should be allowed
+  set(FETCHCONTENT_FULLY_DISCONNECTED OFF CACHE BOOL
+      "Disable Download/Update stages of FetchContent workflow" FORCE)
+endif()
+
+# Track dependencies download persistently across multiple
+# cmake configure runs. *_POPULATED variables are reset for each
+# cmake run to 0. Hence, this internal cache value is needed to
+# check for already (from previous cmake run's) populated data
+# during the current cmake run if it looses network connection.
+set(AF_INTERNAL_DOWNLOAD_FLAG OFF CACHE BOOL "Deps Download Flag")
+
+# Override fetch content base dir before including AFfetch_content
+set(FETCHCONTENT_BASE_DIR "${ArrayFire_BINARY_DIR}/extern" CACHE PATH
+    "Base directory where ArrayFire dependencies are downloaded and/or built" FORCE)
+
+include(AFfetch_content)
+
+mark_as_advanced(
+  AF_INTERNAL_DOWNLOAD_FLAG
+  FETCHCONTENT_BASE_DIR
+  FETCHCONTENT_QUIET
+  FETCHCONTENT_FULLY_DISCONNECTED
+  FETCHCONTENT_UPDATES_DISCONNECTED
+)
+
+macro(set_and_mark_depnames_advncd var name)
+  string(TOLOWER ${name} ${var})
+  string(TOUPPER ${name} ${var}_ucname)
+  mark_as_advanced(
+      FETCHCONTENT_SOURCE_DIR_${${var}_ucname}
+      FETCHCONTENT_UPDATES_DISCONNECTED_${${var}_ucname}
+  )
+endmacro()
+
+set_and_mark_depnames_advncd(assets_prefix "af_assets")
+set_and_mark_depnames_advncd(testdata_prefix "af_test_data")
+set_and_mark_depnames_advncd(gtest_prefix "googletest")
+set_and_mark_depnames_advncd(glad_prefix "af_glad")
+set_and_mark_depnames_advncd(forge_prefix "af_forge")
+set_and_mark_depnames_advncd(spdlog_prefix "spdlog")
+set_and_mark_depnames_advncd(threads_prefix "af_threads")
+set_and_mark_depnames_advncd(cub_prefix "nv_cub")
+set_and_mark_depnames_advncd(cl2hpp_prefix "ocl_cl2hpp")
+set_and_mark_depnames_advncd(clblast_prefix "ocl_clblast")
+set_and_mark_depnames_advncd(clfft_prefix "ocl_clfft")
+set_and_mark_depnames_advncd(boost_prefix "boost_compute")
+
+macro(af_dep_check_and_populate dep_prefix)
+  set(single_args URI REF)
+  cmake_parse_arguments(adcp_args "" "${single_args}" "" ${ARGN})
+
+  if("${adcp_args_URI}" STREQUAL "")
+    message(FATAL_ERROR [=[
+        Cannot check requested dependency source's availability.
+        Please provide a valid URI(almost always a URL to a github repo).
+        Note that the above error message if for developers of ArrayFire.
+        ]=])
+  endif()
+
+  string(FIND "${adcp_args_REF}" "=" adcp_has_algo_id)
+
+  if(${BUILD_OFFLINE} AND NOT ${AF_INTERNAL_DOWNLOAD_FLAG})
+    if(NOT ${adcp_has_algo_id} EQUAL -1)
+      FetchContent_Populate(${dep_prefix}
+        QUIET
+        URL            ${adcp_args_URI}
+        URL_HASH       ${adcp_args_REF}
+        DOWNLOAD_COMMAND \"\"
+        UPDATE_DISCONNECTED ON
+        SOURCE_DIR     "${ArrayFire_SOURCE_DIR}/extern/${dep_prefix}-src"
+        BINARY_DIR     "${ArrayFire_BINARY_DIR}/extern/${dep_prefix}-build"
+        SUBBUILD_DIR   "${ArrayFire_BINARY_DIR}/extern/${dep_prefix}-subbuild"
+      )
+    elseif("${adcp_args_REF}" STREQUAL "")
+      FetchContent_Populate(${dep_prefix}
+        QUIET
+        URL            ${adcp_args_URI}
+        DOWNLOAD_COMMAND \"\"
+        UPDATE_DISCONNECTED ON
+        SOURCE_DIR     "${ArrayFire_SOURCE_DIR}/extern/${dep_prefix}-src"
+        BINARY_DIR     "${ArrayFire_BINARY_DIR}/extern/${dep_prefix}-build"
+        SUBBUILD_DIR   "${ArrayFire_BINARY_DIR}/extern/${dep_prefix}-subbuild"
+      )
+    else()
+      # The left over alternative is assumed to be a cloud hosted git repository
+      FetchContent_Populate(${dep_prefix}
+        QUIET
+        GIT_REPOSITORY ${adcp_args_URI}
+        GIT_TAG        ${adcp_args_REF}
+        DOWNLOAD_COMMAND \"\"
+        UPDATE_DISCONNECTED ON
+        SOURCE_DIR     "${ArrayFire_SOURCE_DIR}/extern/${dep_prefix}-src"
+        BINARY_DIR     "${ArrayFire_BINARY_DIR}/extern/${dep_prefix}-build"
+        SUBBUILD_DIR   "${ArrayFire_BINARY_DIR}/extern/${dep_prefix}-subbuild"
+      )
+    endif()
+  else()
+    if(NOT ${adcp_has_algo_id} EQUAL -1)
+      FetchContent_Declare(${dep_prefix}
+        URL            ${adcp_args_URI}
+        URL_HASH       ${adcp_args_REF}
+      )
+    elseif("${adcp_args_REF}" STREQUAL "")
+      FetchContent_Declare(${dep_prefix}
+        URL            ${adcp_args_URI}
+      )
+    else()
+      # The left over alternative is assumed to be a cloud hosted git repository
+      FetchContent_Declare(${dep_prefix}
+        GIT_REPOSITORY ${adcp_args_URI}
+        GIT_TAG        ${adcp_args_REF}
+      )
+    endif()
+    FetchContent_GetProperties(${dep_prefix})
+    if(NOT ${dep_prefix}_POPULATED)
+      FetchContent_Populate(${dep_prefix})
+    endif()
+    set(AF_INTERNAL_DOWNLOAD_FLAG ON CACHE BOOL "Deps Download Flag" FORCE)
+  endif()
+endmacro()
diff --git a/CMakeModules/AFconfigure_forge_dep.cmake b/CMakeModules/AFconfigure_forge_dep.cmake
new file mode 100644
index 0000000000..8bf27d3a9e
--- /dev/null
+++ b/CMakeModules/AFconfigure_forge_dep.cmake
@@ -0,0 +1,100 @@
+# Copyright (c) 2019, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+set(FG_VERSION_MAJOR 1)
+set(FG_VERSION_MINOR 0)
+set(FG_VERSION_PATCH 8)
+set(FG_VERSION "${FG_VERSION_MAJOR}.${FG_VERSION_MINOR}.${FG_VERSION_PATCH}")
+set(FG_API_VERSION_CURRENT ${FG_VERSION_MAJOR}${FG_VERSION_MINOR})
+
+
+if(AF_BUILD_FORGE)
+    af_dep_check_and_populate(${forge_prefix}
+        URI https://github.com/arrayfire/forge.git
+        REF "v${FG_VERSION}"
+    )
+
+    set(af_FETCHCONTENT_BASE_DIR ${FETCHCONTENT_BASE_DIR})
+    set(af_FETCHCONTENT_QUIET ${FETCHCONTENT_QUIET})
+    set(af_FETCHCONTENT_FULLY_DISCONNECTED ${FETCHCONTENT_FULLY_DISCONNECTED})
+    set(af_FETCHCONTENT_UPDATES_DISCONNECTED ${FETCHCONTENT_UPDATES_DISCONNECTED})
+
+    set(ArrayFireInstallPrefix ${CMAKE_INSTALL_PREFIX})
+    set(ArrayFireBuildType ${CMAKE_BUILD_TYPE})
+    set(CMAKE_INSTALL_PREFIX ${${forge_prefix}_BINARY_DIR}/extern/forge/package)
+    set(CMAKE_BUILD_TYPE Release)
+    set(FG_BUILD_EXAMPLES OFF CACHE BOOL "Used to build Forge examples")
+    set(FG_BUILD_DOCS OFF CACHE BOOL "Used to build Forge documentation")
+    set(FG_WITH_FREEIMAGE OFF CACHE BOOL "Turn on usage of freeimage dependency")
+
+    add_subdirectory(
+        ${${forge_prefix}_SOURCE_DIR} ${${forge_prefix}_BINARY_DIR} EXCLUDE_FROM_ALL)
+    mark_as_advanced(
+        FG_BUILD_EXAMPLES
+        FG_BUILD_DOCS
+        FG_WITH_FREEIMAGE
+        FG_USE_WINDOW_TOOLKIT
+        FG_RENDERING_BACKEND
+        SPHINX_EXECUTABLE
+        glfw3_DIR
+        glm_DIR
+        )
+    set(CMAKE_BUILD_TYPE ${ArrayFireBuildType})
+    set(CMAKE_INSTALL_PREFIX ${ArrayFireInstallPrefix})
+    set(FETCHCONTENT_BASE_DIR ${af_FETCHCONTENT_BASE_DIR})
+    set(FETCHCONTENT_QUIET ${af_FETCHCONTENT_QUIET})
+    set(FETCHCONTENT_FULLY_DISCONNECTED ${af_FETCHCONTENT_FULLY_DISCONNECTED})
+    set(FETCHCONTENT_UPDATES_DISCONNECTED ${af_FETCHCONTENT_UPDATES_DISCONNECTED})
+    install(FILES
+        $<TARGET_FILE:forge>
+        $<$<PLATFORM_ID:Linux>:$<TARGET_SONAME_FILE:forge>>
+        $<$<PLATFORM_ID:Darwin>:$<TARGET_SONAME_FILE:forge>>
+        $<$<PLATFORM_ID:Linux>:$<TARGET_LINKER_FILE:forge>>
+        $<$<PLATFORM_ID:Darwin>:$<TARGET_LINKER_FILE:forge>>
+        DESTINATION "${AF_INSTALL_LIB_DIR}"
+        COMPONENT common_backend_dependencies)
+
+    if(AF_INSTALL_STANDALONE)
+        cmake_minimum_required(VERSION 3.21)
+        install(FILES
+            $<TARGET_RUNTIME_DLLS:forge>
+            DESTINATION "${AF_INSTALL_LIB_DIR}"
+            COMPONENT common_backend_dependencies)
+    endif(AF_INSTALL_STANDALONE)
+
+    set_property(TARGET forge APPEND_STRING PROPERTY COMPILE_FLAGS " -w")
+else(AF_BUILD_FORGE)
+    find_package(Forge
+        ${FG_VERSION_MAJOR}.${FG_VERSION_MINOR}.${FG_VERSION_PATCH}
+        QUIET
+    )
+
+    if(TARGET Forge::forge)
+        get_target_property(fg_lib_type Forge::forge TYPE)
+        if(NOT ${fg_lib_type} STREQUAL "STATIC_LIBRARY" AND
+           AF_INSTALL_STANDALONE)
+            install(FILES
+                    $<TARGET_FILE:Forge::forge>
+                    $<$<PLATFORM_ID:Linux>:$<TARGET_SONAME_FILE:Forge::forge>>
+                    $<$<PLATFORM_ID:Darwin>:$<TARGET_SONAME_FILE:Forge::forge>>
+                    $<$<PLATFORM_ID:Linux>:$<TARGET_LINKER_FILE:Forge::forge>>
+                    $<$<PLATFORM_ID:Darwin>:$<TARGET_LINKER_FILE:Forge::forge>>
+                    DESTINATION "${AF_INSTALL_LIB_DIR}"
+                    COMPONENT common_backend_dependencies)
+        endif()
+    else()
+        af_dep_check_and_populate(${forge_prefix}
+            URI https://github.com/arrayfire/forge.git
+            REF "v${FG_VERSION}"
+        )
+
+        configure_file(
+            ${${forge_prefix}_SOURCE_DIR}/CMakeModules/version.h.in
+            ${${forge_prefix}_BINARY_DIR}/include/fg/version.h
+        )
+    endif()
+endif(AF_BUILD_FORGE)
diff --git a/CMakeModules/AFcuda_helpers.cmake b/CMakeModules/AFcuda_helpers.cmake
new file mode 100644
index 0000000000..a5d20c4a62
--- /dev/null
+++ b/CMakeModules/AFcuda_helpers.cmake
@@ -0,0 +1,69 @@
+# Copyright (c) 2020, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+find_program(NVPRUNE NAMES nvprune)
+cuda_select_nvcc_arch_flags(cuda_architecture_flags ${CUDA_architecture_build_targets})
+set(cuda_architecture_flags ${cuda_architecture_flags} CACHE INTERNAL "CUDA compute flags" FORCE)
+set(cuda_architecture_flags_readable ${cuda_architecture_flags_readable} CACHE INTERNAL "Readable CUDA compute flags" FORCE)
+
+function(af_detect_and_set_cuda_architectures target)
+  if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.18")
+    string(REGEX REPLACE "sm_([0-9]+)[ ]*" "\\1-real|" cuda_build_targets ${cuda_architecture_flags_readable})
+    string(REGEX REPLACE "compute_([0-9]+)[ ]*" "\\1-virtual|" cuda_build_targets ${cuda_build_targets})
+    string(REPLACE "|" ";" cuda_build_targets ${cuda_build_targets})
+
+    set_target_properties(${target}
+      PROPERTIES
+        CUDA_ARCHITECTURES "${cuda_build_targets}")
+  else()
+    # CMake 3.12 adds deduplication of compile options. This breaks the way the
+    # gencode flags are passed into the compiler. these replace instructions add
+    # the SHELL: prefix to each of the gencode options so that it is not removed
+    # from the command
+    if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.12")
+      string(REPLACE ";" "|" cuda_architecture_flags "${cuda_architecture_flags}")
+      string(REGEX REPLACE "(-gencode)\\|" "SHELL:\\1 " cuda_architecture_flags2 "${cuda_architecture_flags}")
+      string(REPLACE "|" ";" cuda_architecture_flags ${cuda_architecture_flags2})
+    endif()
+    target_compile_options(${target}
+      PRIVATE
+        $<$<COMPILE_LANGUAGE:CUDA>:${cuda_architecture_flags}>)
+  endif()
+endfunction()
+
+# The following macro uses a macro defined by
+# FindCUDA module from cmake.
+function(af_find_static_cuda_libs libname)
+  cmake_parse_arguments(fscl "PRUNE" "" "" ${ARGN})
+
+  set(search_name
+    "${CMAKE_STATIC_LIBRARY_PREFIX}${libname}${CMAKE_STATIC_LIBRARY_SUFFIX}")
+  cuda_find_library_local_first(CUDA_${libname}_LIBRARY
+    ${search_name} "${libname} static library")
+
+  if(fscl_PRUNE AND AF_WITH_PRUNE_STATIC_CUDA_NUMERIC_LIBS)
+    get_filename_component(af_${libname} ${CUDA_${libname}_LIBRARY} NAME)
+
+    set(liboutput ${CMAKE_CURRENT_BINARY_DIR}/${af_${libname}})
+    add_custom_command(OUTPUT ${liboutput}.depend
+      COMMAND ${NVPRUNE} ${cuda_architecture_flags} ${CUDA_${libname}_LIBRARY} -o ${liboutput}
+      COMMAND ${CMAKE_COMMAND} -E touch ${liboutput}.depend
+      BYPRODUCTS ${liboutput}
+      MAIN_DEPENDENCY ${CUDA_${libname}_LIBRARY}
+      COMMENT "Pruning ${CUDA_${libname}_LIBRARY} for ${cuda_build_targets}"
+      VERBATIM)
+    add_custom_target(prune_${libname}
+      DEPENDS ${liboutput}.depend)
+    set(cuda_pruned_library_targets ${cuda_pruned_library_targets};prune_${libname} PARENT_SCOPE)
+
+    set(AF_CUDA_${libname}_LIBRARY "${liboutput}" PARENT_SCOPE)
+  else()
+    set(AF_CUDA_${libname}_LIBRARY ${CUDA_${libname}_LIBRARY} PARENT_SCOPE)
+  endif()
+  mark_as_advanced(CUDA_${libname}_LIBRARY)
+endfunction()
+
diff --git a/CMakeModules/AFfetch_content.cmake b/CMakeModules/AFfetch_content.cmake
new file mode 100644
index 0000000000..98cdf6cb96
--- /dev/null
+++ b/CMakeModules/AFfetch_content.cmake
@@ -0,0 +1,916 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+#[=======================================================================[.rst:
+FetchContent
+------------------
+
+.. only:: html
+
+  .. contents::
+
+Overview
+^^^^^^^^
+
+This module enables populating content at configure time via any method
+supported by the :module:`ExternalProject` module.  Whereas
+:command:`ExternalProject_Add` downloads at build time, the
+``FetchContent`` module makes content available immediately, allowing the
+configure step to use the content in commands like :command:`add_subdirectory`,
+:command:`include` or :command:`file` operations.
+
+Content population details would normally be defined separately from the
+command that performs the actual population.  Projects should also
+check whether the content has already been populated somewhere else in the
+project hierarchy.  Typical usage would look something like this:
+
+.. code-block:: cmake
+
+  FetchContent_Declare(
+    googletest
+    GIT_REPOSITORY https://github.com/google/googletest.git
+    GIT_TAG        release-1.8.0
+  )
+
+  FetchContent_GetProperties(googletest)
+  if(NOT googletest_POPULATED)
+    FetchContent_Populate(googletest)
+    add_subdirectory(${googletest_SOURCE_DIR} ${googletest_BINARY_DIR})
+  endif()
+
+When using the above pattern with a hierarchical project arrangement,
+projects at higher levels in the hierarchy are able to define or override
+the population details of content specified anywhere lower in the project
+hierarchy.  The ability to detect whether content has already been
+populated ensures that even if multiple child projects want certain content
+to be available, the first one to populate it wins.  The other child project
+can simply make use of the already available content instead of repeating
+the population for itself.  See the
+:ref:`Examples <fetch-content-examples>` section which demonstrates
+this scenario.
+
+The ``FetchContent`` module also supports defining and populating
+content in a single call, with no check for whether the content has been
+populated elsewhere in the project already.  This is a more low level
+operation and would not normally be the way the module is used, but it is
+sometimes useful as part of implementing some higher level feature or to
+populate some content in CMake's script mode.
+
+
+Declaring Content Details
+^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. command:: FetchContent_Declare
+
+  .. code-block:: cmake
+
+    FetchContent_Declare(<name> <contentOptions>...)
+
+  The ``FetchContent_Declare()`` function records the options that describe
+  how to populate the specified content, but if such details have already
+  been recorded earlier in this project (regardless of where in the project
+  hierarchy), this and all later calls for the same content ``<name>`` are
+  ignored.  This "first to record, wins" approach is what allows hierarchical
+  projects to have parent projects override content details of child projects.
+
+  The content ``<name>`` can be any string without spaces, but good practice
+  would be to use only letters, numbers and underscores.  The name will be
+  treated case-insensitively and it should be obvious for the content it
+  represents, often being the name of the child project or the value given
+  to its top level :command:`project` command (if it is a CMake project).
+  For well-known public projects, the name should generally be the official
+  name of the project.  Choosing an unusual name makes it unlikely that other
+  projects needing that same content will use the same name, leading to
+  the content being populated multiple times.
+
+  The ``<contentOptions>`` can be any of the download or update/patch options
+  that the :command:`ExternalProject_Add` command understands.  The configure,
+  build, install and test steps are explicitly disabled and therefore options
+  related to them will be ignored.  In most cases, ``<contentOptions>`` will
+  just be a couple of options defining the download method and method-specific
+  details like a commit tag or archive hash.  For example:
+
+  .. code-block:: cmake
+
+    FetchContent_Declare(
+      googletest
+      GIT_REPOSITORY https://github.com/google/googletest.git
+      GIT_TAG        release-1.8.0
+    )
+
+    FetchContent_Declare(
+      myCompanyIcons
+      URL      https://intranet.mycompany.com/assets/iconset_1.12.tar.gz
+      URL_HASH 5588a7b18261c20068beabfb4f530b87
+    )
+
+    FetchContent_Declare(
+      myCompanyCertificates
+      SVN_REPOSITORY svn+ssh://svn.mycompany.com/srv/svn/trunk/certs
+      SVN_REVISION   -r12345
+    )
+
+Populating The Content
+^^^^^^^^^^^^^^^^^^^^^^
+
+.. command:: FetchContent_Populate
+
+  .. code-block:: cmake
+
+    FetchContent_Populate( <name> )
+
+  In most cases, the only argument given to ``FetchContent_Populate()`` is the
+  ``<name>``.  When used this way, the command assumes the content details have
+  been recorded by an earlier call to :command:`FetchContent_Declare`.  The
+  details are stored in a global property, so they are unaffected by things
+  like variable or directory scope.  Therefore, it doesn't matter where in the
+  project the details were previously declared, as long as they have been
+  declared before the call to ``FetchContent_Populate()``.  Those saved details
+  are then used to construct a call to :command:`ExternalProject_Add` in a
+  private sub-build to perform the content population immediately.  The
+  implementation of ``ExternalProject_Add()`` ensures that if the content has
+  already been populated in a previous CMake run, that content will be reused
+  rather than repopulating them again.  For the common case where population
+  involves downloading content, the cost of the download is only paid once.
+
+  An internal global property records when a particular content population
+  request has been processed.  If ``FetchContent_Populate()`` is called more
+  than once for the same content name within a configure run, the second call
+  will halt with an error.  Projects can and should check whether content
+  population has already been processed with the
+  :command:`FetchContent_GetProperties` command before calling
+  ``FetchContent_Populate()``.
+
+  ``FetchContent_Populate()`` will set three variables in the scope of the
+  caller; ``<lcName>_POPULATED``, ``<lcName>_SOURCE_DIR`` and
+  ``<lcName>_BINARY_DIR``, where ``<lcName>`` is the lowercased ``<name>``.
+  ``<lcName>_POPULATED`` will always be set to ``True`` by the call.
+  ``<lcName>_SOURCE_DIR`` is the location where the
+  content can be found upon return (it will have already been populated), while
+  ``<lcName>_BINARY_DIR`` is a directory intended for use as a corresponding
+  build directory.  The main use case for the two directory variables is to
+  call :command:`add_subdirectory` immediately after population, i.e.:
+
+  .. code-block:: cmake
+
+    FetchContent_Populate(FooBar ...)
+    add_subdirectory(${foobar_SOURCE_DIR} ${foobar_BINARY_DIR})
+
+  The values of the three variables can also be retrieved from anywhere in the
+  project hierarchy using the :command:`FetchContent_GetProperties` command.
+
+  A number of cache variables influence the behavior of all content population
+  performed using details saved from a :command:`FetchContent_Declare` call:
+
+  ``FETCHCONTENT_BASE_DIR``
+    In most cases, the saved details do not specify any options relating to the
+    directories to use for the internal sub-build, final source and build areas.
+    It is generally best to leave these decisions up to the ``FetchContent``
+    module to handle on the project's behalf.  The ``FETCHCONTENT_BASE_DIR``
+    cache variable controls the point under which all content population
+    directories are collected, but in most cases developers would not need to
+    change this.  The default location is ``${CMAKE_BINARY_DIR}/_deps``, but if
+    developers change this value, they should aim to keep the path short and
+    just below the top level of the build tree to avoid running into path
+    length problems on Windows.
+
+  ``FETCHCONTENT_QUIET``
+    The logging output during population can be quite verbose, making the
+    configure stage quite noisy.  This cache option (``ON`` by default) hides
+    all population output unless an error is encountered.  If experiencing
+    problems with hung downloads, temporarily switching this option off may
+    help diagnose which content population is causing the issue.
+
+  ``FETCHCONTENT_FULLY_DISCONNECTED``
+    When this option is enabled, no attempt is made to download or update
+    any content.  It is assumed that all content has already been populated in
+    a previous run or the source directories have been pointed at existing
+    contents the developer has provided manually (using options described
+    further below).  When the developer knows that no changes have been made to
+    any content details, turning this option ``ON`` can significantly speed up
+    the configure stage.  It is ``OFF`` by default.
+
+  ``FETCHCONTENT_UPDATES_DISCONNECTED``
+    This is a less severe download/update control compared to
+    ``FETCHCONTENT_FULLY_DISCONNECTED``.  Instead of bypassing all download and
+    update logic, the ``FETCHCONTENT_UPDATES_DISCONNECTED`` only disables the
+    update stage.  Therefore, if content has not been downloaded previously,
+    it will still be downloaded when this option is enabled.  This can speed up
+    the configure stage, but not as much as
+    ``FETCHCONTENT_FULLY_DISCONNECTED``.  It is ``OFF`` by default.
+
+  In addition to the above cache variables, the following cache variables are
+  also defined for each content name (``<ucName>`` is the uppercased value of
+  ``<name>``):
+
+  ``FETCHCONTENT_SOURCE_DIR_<ucName>``
+    If this is set, no download or update steps are performed for the specified
+    content and the ``<lcName>_SOURCE_DIR`` variable returned to the caller is
+    pointed at this location.  This gives developers a way to have a separate
+    checkout of the content that they can modify freely without interference
+    from the build.  The build simply uses that existing source, but it still
+    defines ``<lcName>_BINARY_DIR`` to point inside its own build area.
+    Developers are strongly encouraged to use this mechanism rather than
+    editing the sources populated in the default location, as changes to
+    sources in the default location can be lost when content population details
+    are changed by the project.
+
+  ``FETCHCONTENT_UPDATES_DISCONNECTED_<ucName>``
+    This is the per-content equivalent of
+    ``FETCHCONTENT_UPDATES_DISCONNECTED``. If the global option or this option
+    is ``ON``, then updates will be disabled for the named content.
+    Disabling updates for individual content can be useful for content whose
+    details rarely change, while still leaving other frequently changing
+    content with updates enabled.
+
+
+  The ``FetchContent_Populate()`` command also supports a syntax allowing the
+  content details to be specified directly rather than using any saved
+  details.  This is more low-level and use of this form is generally to be
+  avoided in favour of using saved content details as outlined above.
+  Nevertheless, in certain situations it can be useful to invoke the content
+  population as an isolated operation (typically as part of implementing some
+  other higher level feature or when using CMake in script mode):
+
+  .. code-block:: cmake
+
+    FetchContent_Populate( <name>
+      [QUIET]
+      [SUBBUILD_DIR <subBuildDir>]
+      [SOURCE_DIR <srcDir>]
+      [BINARY_DIR <binDir>]
+      ...
+    )
+
+  This form has a number of key differences to that where only ``<name>`` is
+  provided:
+
+  - All required population details are assumed to have been provided directly
+    in the call to ``FetchContent_Populate()``. Any saved details for
+    ``<name>`` are ignored.
+  - No check is made for whether content for ``<name>`` has already been
+    populated.
+  - No global property is set to record that the population has occurred.
+  - No global properties record the source or binary directories used for the
+    populated content.
+  - The ``FETCHCONTENT_FULLY_DISCONNECTED`` and
+    ``FETCHCONTENT_UPDATES_DISCONNECTED`` cache variables are ignored.
+
+  The ``<lcName>_SOURCE_DIR`` and ``<lcName>_BINARY_DIR`` variables are still
+  returned to the caller, but since these locations are not stored as global
+  properties when this form is used, they are only available to the calling
+  scope and below rather than the entire project hierarchy.  No
+  ``<lcName>_POPULATED`` variable is set in the caller's scope with this form.
+
+  The supported options for ``FetchContent_Populate()`` are the same as those
+  for :command:`FetchContent_Declare()`.  Those few options shown just
+  above are either specific to ``FetchContent_Populate()`` or their behavior is
+  slightly modified from how :command:`ExternalProject_Add` treats them.
+
+  ``QUIET``
+    The ``QUIET`` option can be given to hide the output associated with
+    populating the specified content.  If the population fails, the output will
+    be shown regardless of whether this option was given or not so that the
+    cause of the failure can be diagnosed.  The global ``FETCHCONTENT_QUIET``
+    cache variable has no effect on ``FetchContent_Populate()`` calls where the
+    content details are provided directly.
+
+  ``SUBBUILD_DIR``
+    The ``SUBBUILD_DIR`` argument can be provided to change the location of the
+    sub-build created to perform the population.  The default value is
+    ``${CMAKE_CURRENT_BINARY_DIR}/<lcName>-subbuild`` and it would be unusual
+    to need to override this default.  If a relative path is specified, it will
+    be interpreted as relative to :variable:`CMAKE_CURRENT_BINARY_DIR`.
+
+  ``SOURCE_DIR``, ``BINARY_DIR``
+    The ``SOURCE_DIR`` and ``BINARY_DIR`` arguments are supported by
+    :command:`ExternalProject_Add`, but different default values are used by
+    ``FetchContent_Populate()``.  ``SOURCE_DIR`` defaults to
+    ``${CMAKE_CURRENT_BINARY_DIR}/<lcName>-src`` and ``BINARY_DIR`` defaults to
+    ``${CMAKE_CURRENT_BINARY_DIR}/<lcName>-build``.  If a relative path is
+    specified, it will be interpreted as relative to
+    :variable:`CMAKE_CURRENT_BINARY_DIR`.
+
+  In addition to the above explicit options, any other unrecognized options are
+  passed through unmodified to :command:`ExternalProject_Add` to perform the
+  download, patch and update steps.  The following options are explicitly
+  prohibited (they are disabled by the ``FetchContent_Populate()`` command):
+
+  - ``CONFIGURE_COMMAND``
+  - ``BUILD_COMMAND``
+  - ``INSTALL_COMMAND``
+  - ``TEST_COMMAND``
+
+  If using ``FetchContent_Populate()`` within CMake's script mode, be aware
+  that the implementation sets up a sub-build which therefore requires a CMake
+  generator and build tool to be available. If these cannot be found by
+  default, then the :variable:`CMAKE_GENERATOR` and/or
+  :variable:`CMAKE_MAKE_PROGRAM` variables will need to be set appropriately
+  on the command line invoking the script.
+
+
+Retrieve Population Properties
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. command:: FetchContent_GetProperties
+
+  When using saved content details, a call to :command:`FetchContent_Populate`
+  records information in global properties which can be queried at any time.
+  This information includes the source and binary directories associated with
+  the content and also whether or not the content population has been processed
+  during the current configure run.
+
+  .. code-block:: cmake
+
+    FetchContent_GetProperties( <name>
+      [SOURCE_DIR <srcDirVar>]
+      [BINARY_DIR <binDirVar>]
+      [POPULATED <doneVar>]
+    )
+
+  The ``SOURCE_DIR``, ``BINARY_DIR`` and ``POPULATED`` options can be used to
+  specify which properties should be retrieved.  Each option accepts a value
+  which is the name of the variable in which to store that property.  Most of
+  the time though, only ``<name>`` is given, in which case the call will then
+  set the same variables as a call to
+  :command:`FetchContent_Populate(name) <FetchContent_Populate>`.  This allows
+  the following canonical pattern to be used, which ensures that the relevant
+  variables will always be defined regardless of whether or not the population
+  has been performed elsewhere in the project already:
+
+  .. code-block:: cmake
+
+    FetchContent_GetProperties(foobar)
+    if(NOT foobar_POPULATED)
+      FetchContent_Populate(foobar)
+
+      # Set any custom variables, etc. here, then
+      # populate the content as part of this build
+
+      add_subdirectory(${foobar_SOURCE_DIR} ${foobar_BINARY_DIR})
+    endif()
+
+  The above pattern allows other parts of the overall project hierarchy to
+  re-use the same content and ensure that it is only populated once.
+
+
+.. _`fetch-content-examples`:
+
+Examples
+^^^^^^^^
+
+Consider a project hierarchy where ``projA`` is the top level project and it
+depends on projects ``projB`` and ``projC``. Both ``projB`` and ``projC``
+can be built standalone and they also both depend on another project
+``projD``.  For simplicity, this example will assume that all four projects
+are available on a company git server.  The ``CMakeLists.txt`` of each project
+might have sections like the following:
+
+*projA*:
+
+.. code-block:: cmake
+
+  include(FetchContent)
+  FetchContent_Declare(
+    projB
+    GIT_REPOSITORY git@mycompany.com/git/projB.git
+    GIT_TAG        4a89dc7e24ff212a7b5167bef7ab079d
+  )
+  FetchContent_Declare(
+    projC
+    GIT_REPOSITORY git@mycompany.com/git/projC.git
+    GIT_TAG        4ad4016bd1d8d5412d135cf8ceea1bb9
+  )
+  FetchContent_Declare(
+    projD
+    GIT_REPOSITORY git@mycompany.com/git/projD.git
+    GIT_TAG        origin/integrationBranch
+  )
+
+  FetchContent_GetProperties(projB)
+  if(NOT projb_POPULATED)
+    FetchContent_Populate(projB)
+    add_subdirectory(${projb_SOURCE_DIR} ${projb_BINARY_DIR})
+  endif()
+
+  FetchContent_GetProperties(projC)
+  if(NOT projc_POPULATED)
+    FetchContent_Populate(projC)
+    add_subdirectory(${projc_SOURCE_DIR} ${projc_BINARY_DIR})
+  endif()
+
+*projB*:
+
+.. code-block:: cmake
+
+  include(FetchContent)
+  FetchContent_Declare(
+    projD
+    GIT_REPOSITORY git@mycompany.com/git/projD.git
+    GIT_TAG        20b415f9034bbd2a2e8216e9a5c9e632
+  )
+
+  FetchContent_GetProperties(projD)
+  if(NOT projd_POPULATED)
+    FetchContent_Populate(projD)
+    add_subdirectory(${projd_SOURCE_DIR} ${projd_BINARY_DIR})
+  endif()
+
+
+*projC*:
+
+.. code-block:: cmake
+
+  include(FetchContent)
+  FetchContent_Declare(
+    projD
+    GIT_REPOSITORY git@mycompany.com/git/projD.git
+    GIT_TAG        7d9a17ad2c962aa13e2fbb8043fb6b8a
+  )
+
+  FetchContent_GetProperties(projD)
+  if(NOT projd_POPULATED)
+    FetchContent_Populate(projD)
+    add_subdirectory(${projd_SOURCE_DIR} ${projd_BINARY_DIR})
+  endif()
+
+A few key points should be noted in the above:
+
+- ``projB`` and ``projC`` define different content details for ``projD``,
+  but ``projA`` also defines a set of content details for ``projD`` and
+  because ``projA`` will define them first, the details from ``projB`` and
+  ``projC`` will not be used.  The override details defined by ``projA``
+  are not required to match either of those from ``projB`` or ``projC``, but
+  it is up to the higher level project to ensure that the details it does
+  define still make sense for the child projects.
+- While ``projA`` defined content details for ``projD``, it did not need
+  to explicitly call ``FetchContent_Populate(projD)`` itself.  Instead, it
+  leaves that to a child project to do (in this case it will be ``projB``
+  since it is added to the build ahead of ``projC``).  If ``projA`` needed to
+  customize how the ``projD`` content was brought into the build as well
+  (e.g. define some CMake variables before calling
+  :command:`add_subdirectory` after populating), it would do the call to
+  ``FetchContent_Populate()``, etc. just as it did for the ``projB`` and
+  ``projC`` content.  For higher level projects, it is usually enough to
+  just define the override content details and leave the actual population
+  to the child projects.  This saves repeating the same thing at each level
+  of the project hierarchy unnecessarily.
+- Even though ``projA`` is the top level project in this example, it still
+  checks whether ``projB`` and ``projC`` have already been populated before
+  going ahead to do those populations.  This makes ``projA`` able to be more
+  easily incorporated as a child of some other higher level project in the
+  future if required.  Always protect a call to
+  :command:`FetchContent_Populate` with a check to
+  :command:`FetchContent_GetProperties`, even in what may be considered a top
+  level project at the time.
+
+
+The following example demonstrates how one might download and unpack a
+firmware tarball using CMake's :manual:`script mode <cmake(1)>`.  The call to
+:command:`FetchContent_Populate` specifies all the content details and the
+unpacked firmware will be placed in a ``firmware`` directory below the
+current working directory.
+
+*getFirmware.cmake*:
+
+.. code-block:: cmake
+
+  # NOTE: Intended to be run in script mode with cmake -P
+  include(FetchContent)
+  FetchContent_Populate(
+    firmware
+    URL        https://mycompany.com/assets/firmware-1.23-arm.tar.gz
+    URL_HASH   MD5=68247684da89b608d466253762b0ff11
+    SOURCE_DIR firmware
+  )
+
+#]=======================================================================]
+
+
+set(__FetchContent_privateDir "${CMAKE_CURRENT_LIST_DIR}/FetchContent")
+
+#=======================================================================
+# Recording and retrieving content details for later population
+#=======================================================================
+
+# Internal use, projects must not call this directly. It is
+# intended for use by FetchContent_Declare() only.
+#
+# Sets a content-specific global property (not meant for use
+# outside of functions defined here in this file) which can later
+# be retrieved using __FetchContent_getSavedDetails() with just the
+# same content name. If there is already a value stored in the
+# property, it is left unchanged and this call has no effect.
+# This allows parent projects to define the content details,
+# overriding anything a child project may try to set (properties
+# are not cached between runs, so the first thing to set it in a
+# build will be in control).
+function(__FetchContent_declareDetails contentName)
+
+  string(TOLOWER ${contentName} contentNameLower)
+  set(propertyName "_FetchContent_${contentNameLower}_savedDetails")
+  get_property(alreadyDefined GLOBAL PROPERTY ${propertyName} DEFINED)
+  if(NOT alreadyDefined)
+    define_property(GLOBAL PROPERTY ${propertyName}
+      BRIEF_DOCS "Internal implementation detail of FetchContent_Populate()"
+      FULL_DOCS  "Details used by FetchContent_Populate() for ${contentName}"
+    )
+    set_property(GLOBAL PROPERTY ${propertyName} ${ARGN})
+  endif()
+
+endfunction()
+
+
+# Internal use, projects must not call this directly. It is
+# intended for use by the FetchContent_Declare() function.
+#
+# Retrieves details saved for the specified content in an
+# earlier call to __FetchContent_declareDetails().
+function(__FetchContent_getSavedDetails contentName outVar)
+
+  string(TOLOWER ${contentName} contentNameLower)
+  set(propertyName "_FetchContent_${contentNameLower}_savedDetails")
+  get_property(alreadyDefined GLOBAL PROPERTY ${propertyName} DEFINED)
+  if(NOT alreadyDefined)
+    message(FATAL_ERROR "No content details recorded for ${contentName}")
+  endif()
+  get_property(propertyValue GLOBAL PROPERTY ${propertyName})
+  set(${outVar} "${propertyValue}" PARENT_SCOPE)
+
+endfunction()
+
+
+# Saves population details of the content, sets defaults for the
+# SOURCE_DIR and BUILD_DIR.
+function(FetchContent_Declare contentName)
+
+  set(options "")
+  set(oneValueArgs SVN_REPOSITORY)
+  set(multiValueArgs "")
+
+  cmake_parse_arguments(ARG "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  unset(srcDirSuffix)
+  unset(svnRepoArgs)
+  if(ARG_SVN_REPOSITORY)
+    # Add a hash of the svn repository URL to the source dir. This works
+    # around the problem where if the URL changes, the download would
+    # fail because it tries to checkout/update rather than switch the
+    # old URL to the new one. We limit the hash to the first 7 characters
+    # so that the source path doesn't get overly long (which can be a
+    # problem on windows due to path length limits).
+    string(SHA1 urlSHA ${ARG_SVN_REPOSITORY})
+    string(SUBSTRING ${urlSHA} 0 7 urlSHA)
+    set(srcDirSuffix "-${urlSHA}")
+    set(svnRepoArgs  SVN_REPOSITORY ${ARG_SVN_REPOSITORY})
+  endif()
+
+  string(TOLOWER ${contentName} contentNameLower)
+  __FetchContent_declareDetails(
+    ${contentNameLower}
+    SOURCE_DIR "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-src${srcDirSuffix}"
+    BINARY_DIR "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-build"
+    ${svnRepoArgs}
+    # List these last so they can override things we set above
+    ${ARG_UNPARSED_ARGUMENTS}
+  )
+
+endfunction()
+
+
+#=======================================================================
+# Set/get whether the specified content has been populated yet.
+# The setter also records the source and binary dirs used.
+#=======================================================================
+
+# Internal use, projects must not call this directly. It is
+# intended for use by the FetchContent_Populate() function to
+# record when FetchContent_Populate() is called for a particular
+# content name.
+function(__FetchContent_setPopulated contentName sourceDir binaryDir)
+
+  string(TOLOWER ${contentName} contentNameLower)
+  set(prefix "_FetchContent_${contentNameLower}")
+
+  set(propertyName "${prefix}_sourceDir")
+  define_property(GLOBAL PROPERTY ${propertyName}
+    BRIEF_DOCS "Internal implementation detail of FetchContent_Populate()"
+    FULL_DOCS  "Details used by FetchContent_Populate() for ${contentName}"
+  )
+  set_property(GLOBAL PROPERTY ${propertyName} ${sourceDir})
+
+  set(propertyName "${prefix}_binaryDir")
+  define_property(GLOBAL PROPERTY ${propertyName}
+    BRIEF_DOCS "Internal implementation detail of FetchContent_Populate()"
+    FULL_DOCS  "Details used by FetchContent_Populate() for ${contentName}"
+  )
+  set_property(GLOBAL PROPERTY ${propertyName} ${binaryDir})
+
+  set(propertyName "${prefix}_populated")
+  define_property(GLOBAL PROPERTY ${propertyName}
+    BRIEF_DOCS "Internal implementation detail of FetchContent_Populate()"
+    FULL_DOCS  "Details used by FetchContent_Populate() for ${contentName}"
+  )
+  set_property(GLOBAL PROPERTY ${propertyName} True)
+
+endfunction()
+
+
+# Set variables in the calling scope for any of the retrievable
+# properties. If no specific properties are requested, variables
+# will be set for all retrievable properties.
+#
+# This function is intended to also be used by projects as the canonical
+# way to detect whether they should call FetchContent_Populate()
+# and pull the populated source into the build with add_subdirectory(),
+# if they are using the populated content in that way.
+function(FetchContent_GetProperties contentName)
+
+  string(TOLOWER ${contentName} contentNameLower)
+
+  set(options "")
+  set(oneValueArgs SOURCE_DIR BINARY_DIR POPULATED)
+  set(multiValueArgs "")
+
+  cmake_parse_arguments(ARG "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if(NOT ARG_SOURCE_DIR AND
+     NOT ARG_BINARY_DIR AND
+     NOT ARG_POPULATED)
+    # No specific properties requested, provide them all
+    set(ARG_SOURCE_DIR ${contentNameLower}_SOURCE_DIR)
+    set(ARG_BINARY_DIR ${contentNameLower}_BINARY_DIR)
+    set(ARG_POPULATED  ${contentNameLower}_POPULATED)
+  endif()
+
+  set(prefix "_FetchContent_${contentNameLower}")
+
+  if(ARG_SOURCE_DIR)
+    set(propertyName "${prefix}_sourceDir")
+    get_property(value GLOBAL PROPERTY ${propertyName})
+    if(value)
+      set(${ARG_SOURCE_DIR} ${value} PARENT_SCOPE)
+    endif()
+  endif()
+
+  if(ARG_BINARY_DIR)
+    set(propertyName "${prefix}_binaryDir")
+    get_property(value GLOBAL PROPERTY ${propertyName})
+    if(value)
+      set(${ARG_BINARY_DIR} ${value} PARENT_SCOPE)
+    endif()
+  endif()
+
+  if(ARG_POPULATED)
+    set(propertyName "${prefix}_populated")
+    get_property(value GLOBAL PROPERTY ${propertyName} DEFINED)
+    set(${ARG_POPULATED} ${value} PARENT_SCOPE)
+  endif()
+
+endfunction()
+
+
+#=======================================================================
+# Performing the population
+#=======================================================================
+
+# The value of contentName will always have been lowercased by the caller.
+# All other arguments are assumed to be options that are understood by
+# ExternalProject_Add(), except for QUIET and SUBBUILD_DIR.
+function(__FetchContent_directPopulate contentName)
+
+  set(options
+      QUIET
+  )
+  set(oneValueArgs
+      SUBBUILD_DIR
+      SOURCE_DIR
+      BINARY_DIR
+      # Prevent the following from being passed through
+      CONFIGURE_COMMAND
+      BUILD_COMMAND
+      INSTALL_COMMAND
+      TEST_COMMAND
+  )
+  set(multiValueArgs "")
+
+  cmake_parse_arguments(ARG "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if(NOT ARG_SUBBUILD_DIR)
+    message(FATAL_ERROR "Internal error: SUBBUILD_DIR not set")
+  elseif(NOT IS_ABSOLUTE "${ARG_SUBBUILD_DIR}")
+    set(ARG_SUBBUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/${ARG_SUBBUILD_DIR}")
+  endif()
+
+  if(NOT ARG_SOURCE_DIR)
+    message(FATAL_ERROR "Internal error: SOURCE_DIR not set")
+  elseif(NOT IS_ABSOLUTE "${ARG_SOURCE_DIR}")
+    set(ARG_SOURCE_DIR "${CMAKE_CURRENT_BINARY_DIR}/${ARG_SOURCE_DIR}")
+  endif()
+
+  if(NOT ARG_BINARY_DIR)
+    message(FATAL_ERROR "Internal error: BINARY_DIR not set")
+  elseif(NOT IS_ABSOLUTE "${ARG_BINARY_DIR}")
+    set(ARG_BINARY_DIR "${CMAKE_CURRENT_BINARY_DIR}/${ARG_BINARY_DIR}")
+  endif()
+
+  # Ensure the caller can know where to find the source and build directories
+  # with some convenient variables. Doing this here ensures the caller sees
+  # the correct result in the case where the default values are overridden by
+  # the content details set by the project.
+  set(${contentName}_SOURCE_DIR "${ARG_SOURCE_DIR}" PARENT_SCOPE)
+  set(${contentName}_BINARY_DIR "${ARG_BINARY_DIR}" PARENT_SCOPE)
+
+  # The unparsed arguments may contain spaces, so build up ARG_EXTRA
+  # in such a way that it correctly substitutes into the generated
+  # CMakeLists.txt file with each argument quoted.
+  unset(ARG_EXTRA)
+  foreach(arg IN LISTS ARG_UNPARSED_ARGUMENTS)
+    set(ARG_EXTRA "${ARG_EXTRA} \"${arg}\"")
+  endforeach()
+
+  # Hide output if requested, but save it to a variable in case there's an
+  # error so we can show the output upon failure. When not quiet, don't
+  # capture the output to a variable because the user may want to see the
+  # output as it happens (e.g. progress during long downloads). Combine both
+  # stdout and stderr in the one capture variable so the output stays in order.
+  if (ARG_QUIET)
+    set(outputOptions
+        OUTPUT_VARIABLE capturedOutput
+        ERROR_VARIABLE  capturedOutput
+    )
+  else()
+    set(capturedOutput)
+    set(outputOptions)
+    message(STATUS "Populating ${contentName}")
+  endif()
+
+  if(CMAKE_GENERATOR)
+    set(generatorOpts "-G${CMAKE_GENERATOR}")
+    if(CMAKE_GENERATOR_PLATFORM)
+      list(APPEND generatorOpts "-A${CMAKE_GENERATOR_PLATFORM}")
+    endif()
+    if(CMAKE_GENERATOR_TOOLSET)
+      list(APPEND generatorOpts "-T${CMAKE_GENERATOR_TOOLSET}")
+    endif()
+
+    if(CMAKE_MAKE_PROGRAM)
+      list(APPEND generatorOpts "-DCMAKE_MAKE_PROGRAM:FILEPATH=${CMAKE_MAKE_PROGRAM}")
+    endif()
+
+  else()
+    # Likely we've been invoked via CMake's script mode where no
+    # generator is set (and hence CMAKE_MAKE_PROGRAM could not be
+    # trusted even if provided). We will have to rely on being
+    # able to find the default generator and build tool.
+    unset(generatorOpts)
+  endif()
+
+  # Create and build a separate CMake project to carry out the population.
+  # If we've already previously done these steps, they will not cause
+  # anything to be updated, so extra rebuilds of the project won't occur.
+  # Make sure to pass through CMAKE_MAKE_PROGRAM in case the main project
+  # has this set to something not findable on the PATH.
+  configure_file("${__FetchContent_privateDir}/CMakeLists.cmake.in"
+                 "${ARG_SUBBUILD_DIR}/CMakeLists.txt")
+  execute_process(
+    COMMAND ${CMAKE_COMMAND} ${generatorOpts} .
+    RESULT_VARIABLE result
+    ${outputOptions}
+    WORKING_DIRECTORY "${ARG_SUBBUILD_DIR}"
+  )
+  if(result)
+    if(capturedOutput)
+      message("${capturedOutput}")
+    endif()
+    message(FATAL_ERROR "CMake step for ${contentName} failed: ${result}")
+  endif()
+  execute_process(
+    COMMAND ${CMAKE_COMMAND} --build .
+    RESULT_VARIABLE result
+    ${outputOptions}
+    WORKING_DIRECTORY "${ARG_SUBBUILD_DIR}"
+  )
+  if(result)
+    if(capturedOutput)
+      message("${capturedOutput}")
+    endif()
+    message(FATAL_ERROR "Build step for ${contentName} failed: ${result}")
+  endif()
+
+endfunction()
+
+
+option(FETCHCONTENT_FULLY_DISCONNECTED   "Disables all attempts to download or update content and assumes source dirs already exist")
+option(FETCHCONTENT_UPDATES_DISCONNECTED "Enables UPDATE_DISCONNECTED behavior for all content population")
+option(FETCHCONTENT_QUIET                "Enables QUIET option for all content population" ON)
+set(FETCHCONTENT_BASE_DIR "${CMAKE_BINARY_DIR}/_deps" CACHE PATH "Directory under which to collect all populated content")
+
+# Populate the specified content using details stored from
+# an earlier call to FetchContent_Declare().
+function(FetchContent_Populate contentName)
+
+  if(NOT contentName)
+    message(FATAL_ERROR "Empty contentName not allowed for FetchContent_Populate()")
+  endif()
+
+  string(TOLOWER ${contentName} contentNameLower)
+
+  if(ARGN)
+    # This is the direct population form with details fully specified
+    # as part of the call, so we already have everything we need
+    __FetchContent_directPopulate(
+      ${contentNameLower}
+      SUBBUILD_DIR "${CMAKE_CURRENT_BINARY_DIR}/${contentNameLower}-subbuild"
+      SOURCE_DIR   "${CMAKE_CURRENT_BINARY_DIR}/${contentNameLower}-src"
+      BINARY_DIR   "${CMAKE_CURRENT_BINARY_DIR}/${contentNameLower}-build"
+      ${ARGN}  # Could override any of the above ..._DIR variables
+    )
+
+    # Pass source and binary dir variables back to the caller
+    set(${contentNameLower}_SOURCE_DIR "${${contentNameLower}_SOURCE_DIR}" PARENT_SCOPE)
+    set(${contentNameLower}_BINARY_DIR "${${contentNameLower}_BINARY_DIR}" PARENT_SCOPE)
+
+    # Don't set global properties, or record that we did this population, since
+    # this was a direct call outside of the normal declared details form.
+    # We only want to save values in the global properties for content that
+    # honours the hierarchical details mechanism so that projects are not
+    # robbed of the ability to override details set in nested projects.
+    return()
+  endif()
+
+  # No details provided, so assume they were saved from an earlier call
+  # to FetchContent_Declare(). Do a check that we haven't already
+  # populated this content before in case the caller forgot to check.
+  FetchContent_GetProperties(${contentName})
+  if(${contentNameLower}_POPULATED)
+    message(FATAL_ERROR "Content ${contentName} already populated in ${${contentNameLower}_SOURCE_DIR}")
+  endif()
+
+  string(TOUPPER ${contentName} contentNameUpper)
+  set(FETCHCONTENT_SOURCE_DIR_${contentNameUpper}
+      "${FETCHCONTENT_SOURCE_DIR_${contentNameUpper}}"
+      CACHE PATH "When not empty, overrides where to find pre-populated content for ${contentName}")
+
+  if(FETCHCONTENT_SOURCE_DIR_${contentNameUpper})
+    # The source directory has been explicitly provided in the cache,
+    # so no population is required
+    set(${contentNameLower}_SOURCE_DIR "${FETCHCONTENT_SOURCE_DIR_${contentNameUpper}}")
+    set(${contentNameLower}_BINARY_DIR "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-build")
+
+  elseif(FETCHCONTENT_FULLY_DISCONNECTED)
+    # Bypass population and assume source is already there from a previous run
+    set(${contentNameLower}_SOURCE_DIR "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-src")
+    set(${contentNameLower}_BINARY_DIR "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-build")
+
+  else()
+    # Support both a global "disconnect all updates" and a per-content
+    # update test (either one being set disables updates for this content).
+    option(FETCHCONTENT_UPDATES_DISCONNECTED_${contentNameUpper}
+           "Enables UPDATE_DISCONNECTED behavior just for population of ${contentName}")
+    if(FETCHCONTENT_UPDATES_DISCONNECTED OR
+       FETCHCONTENT_UPDATES_DISCONNECTED_${contentNameUpper})
+      set(disconnectUpdates True)
+    else()
+      set(disconnectUpdates False)
+    endif()
+
+    if(FETCHCONTENT_QUIET)
+      set(quietFlag QUIET)
+    else()
+      unset(quietFlag)
+    endif()
+
+    __FetchContent_getSavedDetails(${contentName} contentDetails)
+    if("${contentDetails}" STREQUAL "")
+      message(FATAL_ERROR "No details have been set for content: ${contentName}")
+    endif()
+
+    __FetchContent_directPopulate(
+      ${contentNameLower}
+      ${quietFlag}
+      UPDATE_DISCONNECTED ${disconnectUpdates}
+      SUBBUILD_DIR "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-subbuild"
+      SOURCE_DIR   "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-src"
+      BINARY_DIR   "${FETCHCONTENT_BASE_DIR}/${contentNameLower}-build"
+      # Put the saved details last so they can override any of the
+      # the options we set above (this can include SOURCE_DIR or
+      # BUILD_DIR)
+      ${contentDetails}
+    )
+  endif()
+
+  __FetchContent_setPopulated(
+    ${contentName}
+    ${${contentNameLower}_SOURCE_DIR}
+    ${${contentNameLower}_BINARY_DIR}
+  )
+
+  # Pass variables back to the caller. The variables passed back here
+  # must match what FetchContent_GetProperties() sets when it is called
+  # with just the content name.
+  set(${contentNameLower}_SOURCE_DIR "${${contentNameLower}_SOURCE_DIR}" PARENT_SCOPE)
+  set(${contentNameLower}_BINARY_DIR "${${contentNameLower}_BINARY_DIR}" PARENT_SCOPE)
+  set(${contentNameLower}_POPULATED  True PARENT_SCOPE)
+
+endfunction()
diff --git a/CMakeModules/ArrayFireConfig.cmake.in b/CMakeModules/ArrayFireConfig.cmake.in
new file mode 100644
index 0000000000..c258d19ed3
--- /dev/null
+++ b/CMakeModules/ArrayFireConfig.cmake.in
@@ -0,0 +1,150 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+# ArrayFire
+# ---------
+#
+# IMPORTED Targets
+# ^^^^^^^^^^^^^^^^
+#
+# This is the configuration file for the ArrayFire Library. It provides the
+# following :prop_tgt:`IMPORTED` targets:
+#
+# ``ArrayFire::af``
+#   Target for the ArrayFire Unified backend.
+# ``ArrayFire::afcpu``
+#   Target for the ArrayFire CPU backend.
+# ``ArrayFire::afcuda``
+#   Target for the ArrayFire CUDA backend.
+# ``ArrayFire::afoneapi``
+#   Target for the ArrayFire oneAPI backend.
+# ``ArrayFire::afopencl``
+#   Target for the ArrayFire OpenCL backend.
+#
+# These targets can be used to link with your application using the
+# ``target_link_library`` command. Here is an example of how to use these
+# targets in your application:
+#
+#   add_executable(mybinary source.cpp)
+#   target_link_library(mybinary PRIVATE ArrayFire::afopencl)
+#
+# This example creates a mybinary executable from the source.cpp file and links
+# against the OpenCL backend of ArrayFire library. Note you do *not* need to set
+# the include directories as they are automatically included with the target.
+#
+# This is the recommended way of linking against ArrayFire
+#
+# Legacy Variables
+# ^^^^^^^^^^^^^^^^
+#
+# Additionally, this config file creates the following variables for backward
+# compatibility with legacy cmake files:
+#
+# ``ArrayFire_INCLUDE_DIRS``
+#  Path to ArrayFire's include directory.
+# ``ArrayFire_LIBRARIES``
+#  ArrayFire's libraries. This will default to a GPU backend if one
+#  is found.
+# ``ArrayFire_FOUND``
+#  True if ArrayFire has been located
+#
+# ``ArrayFire_CPU_FOUND``
+#   True of the ArrayFire CPU library has been found.
+# ``ArrayFire_CPU_LIBRARIES``
+#   Location of ArrayFire's CPU library, if found
+#
+# ``ArrayFire_CUDA_FOUND``
+#   True of the ArrayFire CUDA library has been found.
+# ``ArrayFire_CUDA_LIBRARIES``
+#   Location of ArrayFire's CUDA library, if found
+#
+# ``ArrayFire_oneAPI_FOUND``
+#   True of the ArrayFire oneAPI library has been found.
+# ``ArrayFire_oneAPI_LIBRARIES``
+#   Location of ArrayFire's oneAPI library, if found
+#
+# ``ArrayFire_OpenCL_FOUND``
+#   True of the ArrayFire OpenCL library has been found.
+# ``ArrayFire_OpenCL_LIBRARIES``
+#   Location of ArrayFire's OpenCL library, if found
+#
+# ``ArrayFire_Unified_FOUND``
+#   True of the ArrayFire Unified library has been found.
+# ``ArrayFire_Unified_LIBRARIES``
+#   Location of ArrayFire's Unified library, if found
+#
+# It is recommended you use imported targets instead of these variables.
+#
+# You may provide a hint to where ArrayFire's root directory may be located
+# by setting ArrayFire_DIR. You should not need to set this if you installed
+# ArrayFire using the official installers or the package manager(please submit
+# a bug report). If CMake is unable to locate ArrayFire then set the
+# ArrayFire_DIR to the directory of this file.
+#
+# If you are trying to link against a source build then this should be set to
+# the build directory.
+
+@PACKAGE_INIT@
+
+set_and_check(ArrayFire_INCLUDE_DIRS @PACKAGE_INCLUDE_DIRS@)
+
+foreach(backend Unified CPU oneAPI OpenCL CUDA)
+  if(backend STREQUAL "Unified")
+    set(lowerbackend "")
+  else()
+    string(TOLOWER "${backend}" lowerbackend)
+  endif()
+
+  string(TOUPPER "${backend}" upperbackend)
+  if(NOT TARGET ArrayFire::af${lowerbackend} AND NOT TARGET af${lowerbackend})
+    # Either we are not in the ArrayFire project or the target was not built
+    if(EXISTS @PACKAGE_CMAKE_DIR@/ArrayFire${backend}Targets.cmake)
+      include(@PACKAGE_CMAKE_DIR@/ArrayFire${backend}Targets.cmake)
+    endif()
+  endif()
+  if(TARGET ArrayFire::af${lowerbackend})
+    get_property(all_config TARGET ArrayFire::af${lowerbackend} PROPERTY IMPORTED_CONFIGURATIONS)
+    foreach(config IN LISTS all_config)
+      if(NOT all_config)
+        set(all_config "NOCONFIG")
+      endif()
+      get_property(loc TARGET ArrayFire::af${lowerbackend} PROPERTY IMPORTED_LOCATION_${config})
+
+      # break if any of the imported configurations exist. All configs write to the same
+      # location so they are not working as CMake intended. Its fine for single config
+      # installers like ours.
+      if(EXISTS ${loc})
+        set(ArrayFire_${backend}_BINARY_EXISTS TRUE)
+        break()
+      endif()
+    endforeach()
+  endif()
+
+  if((TARGET ArrayFire::af${lowerbackend} AND ArrayFire_${backend}_BINARY_EXISTS) OR TARGET af${lowerbackend})
+      set(ArrayFire_${backend}_FOUND ON)
+      set(ArrayFire_${backend}_LIBRARIES ArrayFire::af${lowerbackend})
+      set(ArrayFire_LIBRARIES ArrayFire::af${lowerbackend})
+  else()
+      set(ArrayFire_${backend}_FOUND OFF)
+  endif()
+
+  # If this project is built as part of the ArrayFire project, make sure the
+  # backends are only enabled if the backend is selected to be built even if
+  # the Binary exists.
+  if(DEFINED AF_BUILD_${upperbackend} AND NOT AF_BUILD_${upperbackend})
+    set(ArrayFire_${backend}_FOUND OFF)
+  endif()
+endforeach()
+
+foreach(_comp ${ArrayFire_FIND_COMPONENTS})
+  if (NOT ArrayFire_${_comp}_FOUND)
+    set(ArrayFire_FOUND False)
+    set(ArrayFire_NOT_FOUND_MESSAGE "Required ArrayFire component ${_comp} not found")
+  endif()
+endforeach()
+
+check_required_components(CPU oneAPI OpenCL CUDA Unified)
diff --git a/CMakeModules/CLKernelToH.cmake b/CMakeModules/CLKernelToH.cmake
deleted file mode 100644
index c4df5d5b3e..0000000000
--- a/CMakeModules/CLKernelToH.cmake
+++ /dev/null
@@ -1,63 +0,0 @@
-# Function to turn an OpenCL source file into a C string within a source file.
-# xxd uses its input's filename to name the string and its length, so we
-# need to move them to a name that depends only on the path output, not its
-# input.  Otherwise, builds in different relative locations would put the
-# source into different variable names, and everything would fall over.
-# The actual name will be filename (.s replaced with underscores), and length
-# name_len.
-#
-# Usage example:
-#
-# set(KERNELS a.cl b/c.cl)
-# resource_to_cxx_source(
-#   SOURCES ${KERNELS}
-#   VARNAME OUTPUTS
-# )
-# add_executable(foo ${OUTPUTS})
-#
-# The namespace they are placed in is taken from filename.namespace.
-#
-# For example, if the input file is kernel.cl, the two variables will be
-#  unsigned char ns::kernel_cl[];
-#  unsigned int ns::kernel_cl_len;
-#
-# where ns is the contents of kernel.cl.namespace.
-
-include(CMakeParseArguments)
-
-set(BIN2CPP_PROGRAM "bin2cpp")
-
-function(CL_KERNEL_TO_H)
-    cmake_parse_arguments(RTCS "" "VARNAME;EXTENSION;OUTPUT_DIR;TARGETS;NAMESPACE;EOF" "SOURCES" ${ARGN})
-
-    set(_output_files "")
-    foreach(_input_file ${RTCS_SOURCES})
-        get_filename_component(_path "${_input_file}" PATH)
-        get_filename_component(_name "${_input_file}" NAME)
-        get_filename_component(var_name "${_input_file}" NAME)
-        get_filename_component(_name_we "${_input_file}" NAME_WE)
-
-        set(_namespace "${RTCS_NAMESPACE}")
-        string(REPLACE "." "_" var_name ${var_name})
-
-        set(_output_path "${CMAKE_CURRENT_BINARY_DIR}/${RTCS_OUTPUT_DIR}")
-        set(_output_file "${_output_path}/${_name_we}.${RTCS_EXTENSION}")
-
-        ADD_CUSTOM_COMMAND(
-            OUTPUT ${_output_file}
-            DEPENDS ${_input_file} ${BIN2CPP_PROGRAM}
-            COMMAND ${CMAKE_COMMAND} -E make_directory "${_output_path}"
-            COMMAND ${CMAKE_COMMAND} -E echo "\\#include \\<${_path}/${_name_we}.hpp\\>"  >>"${_output_file}"
-            COMMAND ${BIN2CPP_PROGRAM} --file ${_name} --namespace ${_namespace} --output ${_output_file} --name ${var_name} --eof ${RTCS_EOF}
-            WORKING_DIRECTORY "${_path}"
-            COMMENT "Compiling ${_input_file} to C++ source"
-        )
-
-
-        list(APPEND _output_files ${_output_file})
-    endforeach()
-    ADD_CUSTOM_TARGET(${RTCS_NAMESPACE}_bin_target DEPENDS ${_output_files})
-
-    set("${RTCS_VARNAME}" ${_output_files} PARENT_SCOPE)
-    set("${RTCS_TARGETS}" ${RTCS_NAMESPACE}_bin_target PARENT_SCOPE)
-endfunction(CL_KERNEL_TO_H)
diff --git a/CMakeModules/CMakeCompilerABI.h b/CMakeModules/CMakeCompilerABI.h
new file mode 100644
index 0000000000..c5ce4dd9ab
--- /dev/null
+++ b/CMakeModules/CMakeCompilerABI.h
@@ -0,0 +1,45 @@
+
+/* Size of a pointer-to-data in bytes.  */
+#define SIZEOF_DPTR (sizeof(void*))
+const char info_sizeof_dptr[] = {
+  /* clang-format off */
+  'I', 'N', 'F', 'O', ':', 's', 'i', 'z', 'e', 'o', 'f', '_', 'd', 'p', 't',
+  'r', '[', ('0' + ((SIZEOF_DPTR / 10) % 10)), ('0' + (SIZEOF_DPTR % 10)), ']',
+  '\0'
+  /* clang-format on */
+};
+
+/* Byte order.  Only one of these will have bytes in the right order.  */
+static unsigned short const info_byte_order_big_endian[] = {
+  /* INFO:byte_order string for BIG_ENDIAN */
+  0x494E, 0x464F, 0x3A62, 0x7974, 0x655F, 0x6F72, 0x6465, 0x725B,
+  0x4249, 0x475F, 0x454E, 0x4449, 0x414E, 0x5D00, 0x0000
+};
+static unsigned short const info_byte_order_little_endian[] = {
+  /* INFO:byte_order string for LITTLE_ENDIAN */
+  0x4E49, 0x4F46, 0x623A, 0x7479, 0x5F65, 0x726F, 0x6564, 0x5B72,
+  0x494C, 0x5454, 0x454C, 0x455F, 0x444E, 0x4149, 0x5D4E, 0x0000
+};
+
+/* Application Binary Interface.  */
+
+/* Check for (some) ARM ABIs.
+ * See e.g. http://wiki.debian.org/ArmEabiPort for some information on this. */
+#if defined(__GNU__) && defined(__ELF__) && defined(__ARM_EABI__)
+#  define ABI_ID "ELF ARMEABI"
+#elif defined(__GNU__) && defined(__ELF__) && defined(__ARMEB__)
+#  define ABI_ID "ELF ARM"
+#elif defined(__GNU__) && defined(__ELF__) && defined(__ARMEL__)
+#  define ABI_ID "ELF ARM"
+
+#elif defined(__linux__) && defined(__ELF__) && defined(__amd64__) &&         \
+  defined(__ILP32__)
+#  define ABI_ID "ELF X32"
+
+#elif defined(__ELF__)
+#  define ABI_ID "ELF"
+#endif
+
+#if defined(ABI_ID)
+static char const info_abi[] = "INFO:abi[" ABI_ID "]";
+#endif
diff --git a/CMakeModules/CMakeDetermineSYCLCompiler.cmake b/CMakeModules/CMakeDetermineSYCLCompiler.cmake
new file mode 100644
index 0000000000..669e8a79e3
--- /dev/null
+++ b/CMakeModules/CMakeDetermineSYCLCompiler.cmake
@@ -0,0 +1,239 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+
+# determine the compiler to use for C++ programs
+# NOTE, a generator may set CMAKE_SYCL_COMPILER before
+# loading this file to force a compiler.
+# use environment variable SYCL first if defined by user, next use
+# the cmake variable CMAKE_GENERATOR_SYCL which can be defined by a generator
+# as a default compiler
+# If the internal cmake variable _CMAKE_TOOLCHAIN_PREFIX is set, this is used
+# as prefix for the tools (e.g. arm-elf-g++, arm-elf-ar etc.)
+#
+# Sets the following variables:
+#   CMAKE_SYCL_COMPILER
+#   CMAKE_COMPILER_IS_GNUSYCL
+#   CMAKE_AR
+#   CMAKE_RANLIB
+#
+# If not already set before, it also sets
+#   _CMAKE_TOOLCHAIN_PREFIX
+
+#list(APPEND CMAKE_MODULE_PATH ${CMAKE_ROOT})
+include(CMakeDetermineCompiler)
+
+# Load system-specific compiler preferences for this language.
+#include(Platform/${CMAKE_SYSTEM_NAME}-Determine-SYCL OPTIONAL)
+#include(Platform/${CMAKE_SYSTEM_NAME}-SYCL OPTIONAL)
+if(NOT CMAKE_SYCL_COMPILER_NAMES)
+  set(CMAKE_SYCL_COMPILER_NAMES icpx)
+endif()
+
+if(${CMAKE_GENERATOR} MATCHES "Visual Studio")
+elseif("${CMAKE_GENERATOR}" MATCHES "Green Hills MULTI")
+elseif("${CMAKE_GENERATOR}" MATCHES "Xcode")
+  set(CMAKE_SYCL_COMPILER_XCODE_TYPE sourcecode.cpp.cpp)
+  _cmake_find_compiler_path(SYCL)
+else()
+  if(NOT CMAKE_SYCL_COMPILER)
+    set(CMAKE_SYCL_COMPILER_INIT NOTFOUND)
+
+    # prefer the environment variable SYCL
+    if(NOT $ENV{SYCL} STREQUAL "")
+      get_filename_component(CMAKE_SYCL_COMPILER_INIT $ENV{SYCL} PROGRAM PROGRAM_ARGS CMAKE_SYCL_FLAGS_ENV_INIT)
+      if(CMAKE_SYCL_FLAGS_ENV_INIT)
+        set(CMAKE_SYCL_COMPILER_ARG1 "${CMAKE_SYCL_FLAGS_ENV_INIT}" CACHE STRING "Arguments to SYCL compiler")
+      endif()
+      if(NOT EXISTS ${CMAKE_SYCL_COMPILER_INIT})
+        message(FATAL_ERROR "Could not find compiler set in environment variable SYCL:\n$ENV{SYCL}.\n${CMAKE_SYCL_COMPILER_INIT}")
+      endif()
+    endif()
+
+    # next prefer the generator specified compiler
+    if(CMAKE_GENERATOR_SYCL)
+      if(NOT CMAKE_SYCL_COMPILER_INIT)
+        set(CMAKE_SYCL_COMPILER_INIT ${CMAKE_GENERATOR_SYCL})
+      endif()
+    endif()
+
+    # finally list compilers to try
+    if(NOT CMAKE_SYCL_COMPILER_INIT)
+      set(CMAKE_SYCL_COMPILER_LIST icpx icx)
+      if(NOT CMAKE_HOST_WIN32)
+        # FIXME(#24314): Add support for the GNU-like icpx compiler driver
+        # on Windows, first introduced by Intel oneAPI 2023.0.
+        list(APPEND CMAKE_SYCL_COMPILER_LIST icpx)
+      endif()
+    endif()
+
+    _cmake_find_compiler(SYCL)
+  else()
+    _cmake_find_compiler_path(SYCL)
+  endif()
+  mark_as_advanced(CMAKE_SYCL_COMPILER)
+
+  # Each entry in this list is a set of extra flags to try
+  # adding to the compile line to see if it helps produce
+  # a valid identification file.
+  set(CMAKE_SYCL_COMPILER_ID_TEST_FLAGS_FIRST)
+  set(CMAKE_SYCL_COMPILER_ID_TEST_FLAGS
+    "-fsycl"
+    # Try compiling to an object file only.
+    "-c"
+    # IAR does not detect language automatically
+    "--c++"
+    "--ec++"
+
+    # ARMClang need target options
+    "--target=arm-arm-none-eabi -mcpu=cortex-m3"
+
+    # MSVC needs at least one include directory for __has_include to function,
+    # but custom toolchains may run MSVC with no INCLUDE env var and no -I flags.
+    # Also avoid linking so this works with no LIB env var.
+    "-c -I__does_not_exist__"
+    )
+endif()
+
+if(CMAKE_SYCL_COMPILER_TARGET)
+  set(CMAKE_SYCL_COMPILER_ID_TEST_FLAGS_FIRST "-c --target=${CMAKE_SYCL_COMPILER_TARGET}")
+endif()
+
+# Build a small source file to identify the compiler.
+if(NOT CMAKE_SYCL_COMPILER_ID_RUN)
+  set(CMAKE_SYCL_COMPILER_ID_RUN 1)
+
+  # Try to identify the compiler.
+  set(CMAKE_SYCL_COMPILER_ID)
+  set(CMAKE_SYCL_PLATFORM_ID)
+  file(READ ${CMAKE_ROOT}/Modules/CMakePlatformId.h.in
+    CMAKE_SYCL_COMPILER_ID_PLATFORM_CONTENT)
+
+  # The IAR compiler produces weird output.
+  # See https://gitlab.kitware.com/cmake/cmake/-/issues/10176#note_153591
+  list(APPEND CMAKE_SYCL_COMPILER_ID_VENDORS IAR)
+  set(CMAKE_SYCL_COMPILER_ID_VENDOR_FLAGS_IAR )
+  set(CMAKE_SYCL_COMPILER_ID_VENDOR_REGEX_IAR "IAR .+ Compiler")
+
+  # Match the link line from xcodebuild output of the form
+  #  Ld ...
+  #      ...
+  #      /path/to/cc ...CompilerIdSYCL/...
+  # to extract the compiler front-end for the language.
+  set(CMAKE_SYCL_COMPILER_ID_TOOL_MATCH_REGEX "\nLd[^\n]*(\n[ \t]+[^\n]*)*\n[ \t]+([^ \t\r\n]+)[^\r\n]*-o[^\r\n]*CompilerIdSYCL/(\\./)?(CompilerIdSYCL.(framework|xctest|build/[^ \t\r\n]+)/)?CompilerIdSYCL[ \t\n\\\"]")
+  set(CMAKE_SYCL_COMPILER_ID_TOOL_MATCH_INDEX 2)
+
+  include(${CMAKE_ROOT}/Modules/CMakeDetermineCompilerId.cmake)
+  set(SYCLFLAGS "-fsycl -Werror")
+  CMAKE_DETERMINE_COMPILER_ID(SYCL SYCLFLAGS CMakeSYCLCompilerId.cpp)
+
+  _cmake_find_compiler_sysroot(SYCL)
+
+  # Set old compiler and platform id variables.
+  if(CMAKE_SYCL_COMPILER_ID STREQUAL "GNU")
+    set(CMAKE_COMPILER_IS_GNUSYCL 1)
+  endif()
+else()
+  if(NOT DEFINED CMAKE_SYCL_COMPILER_FRONTEND_VARIANT)
+    # Some toolchain files set our internal CMAKE_SYCL_COMPILER_ID_RUN
+    # variable but are not aware of CMAKE_SYCL_COMPILER_FRONTEND_VARIANT.
+    # They pre-date our support for the GNU-like variant targeting the
+    # MSVC ABI so we do not consider that here.
+    if(CMAKE_SYCL_COMPILER_ID STREQUAL "Clang"
+      OR "x${CMAKE_SYCL_COMPILER_ID}" STREQUAL "xIntelLLVM")
+      if("x${CMAKE_SYCL_SIMULATE_ID}" STREQUAL "xMSVC")
+        set(CMAKE_SYCL_COMPILER_FRONTEND_VARIANT "MSVC")
+      else()
+        set(CMAKE_SYCL_COMPILER_FRONTEND_VARIANT "GNU")
+      endif()
+    else()
+      set(CMAKE_SYCL_COMPILER_FRONTEND_VARIANT "")
+    endif()
+  endif()
+endif()
+
+if (NOT _CMAKE_TOOLCHAIN_LOCATION)
+  get_filename_component(_CMAKE_TOOLCHAIN_LOCATION "${CMAKE_SYCL_COMPILER}" PATH)
+endif ()
+
+# if we have a g++ cross compiler, they have usually some prefix, like
+# e.g. powerpc-linux-g++, arm-elf-g++ or i586-mingw32msvc-g++ , optionally
+# with a 3-component version number at the end (e.g. arm-eabi-gcc-4.5.2).
+# The other tools of the toolchain usually have the same prefix
+# NAME_WE cannot be used since then this test will fail for names like
+# "arm-unknown-nto-qnx6.3.0-gcc.exe", where BASENAME would be
+# "arm-unknown-nto-qnx6" instead of the correct "arm-unknown-nto-qnx6.3.0-"
+
+
+if (NOT _CMAKE_TOOLCHAIN_PREFIX)
+
+  if("${CMAKE_SYCL_COMPILER_ID}" MATCHES "GNU|Clang|QCC|LCC")
+    get_filename_component(COMPILER_BASENAME "${CMAKE_SYCL_COMPILER}" NAME)
+    if (COMPILER_BASENAME MATCHES "^(.+-)?(clang\\+\\+|[gc]\\+\\+|clang-cl)(-[0-9]+(\\.[0-9]+)*)?(-[^.]+)?(\\.exe)?$")
+      set(_CMAKE_TOOLCHAIN_PREFIX ${CMAKE_MATCH_1})
+      set(_CMAKE_TOOLCHAIN_SUFFIX ${CMAKE_MATCH_3})
+      set(_CMAKE_COMPILER_SUFFIX ${CMAKE_MATCH_5})
+    elseif("${CMAKE_SYCL_COMPILER_ID}" MATCHES "Clang")
+      if(CMAKE_SYCL_COMPILER_TARGET)
+        set(_CMAKE_TOOLCHAIN_PREFIX ${CMAKE_SYCL_COMPILER_TARGET}-)
+      endif()
+    elseif(COMPILER_BASENAME MATCHES "QCC(\\.exe)?$")
+      if(CMAKE_SYCL_COMPILER_TARGET MATCHES "gcc_nto([a-z0-9]+_[0-9]+|[^_le]+)(le)")
+        set(_CMAKE_TOOLCHAIN_PREFIX nto${CMAKE_MATCH_1}-)
+      endif()
+    endif ()
+
+    # if "llvm-" is part of the prefix, remove it, since llvm doesn't have its own binutils
+    # but uses the regular ar, objcopy, etc. (instead of llvm-objcopy etc.)
+    if ("${_CMAKE_TOOLCHAIN_PREFIX}" MATCHES "(.+-)?llvm-$")
+      set(_CMAKE_TOOLCHAIN_PREFIX ${CMAKE_MATCH_1})
+    endif ()
+  elseif("${CMAKE_SYCL_COMPILER_ID}" MATCHES "TI")
+    # TI compilers are named e.g. cl6x, cl470 or armcl.exe
+    get_filename_component(COMPILER_BASENAME "${CMAKE_SYCL_COMPILER}" NAME)
+    if (COMPILER_BASENAME MATCHES "^(.+)?cl([^.]+)?(\\.exe)?$")
+      set(_CMAKE_TOOLCHAIN_PREFIX "${CMAKE_MATCH_1}")
+      set(_CMAKE_TOOLCHAIN_SUFFIX "${CMAKE_MATCH_2}")
+    endif ()
+
+  endif()
+
+endif ()
+
+set(_CMAKE_PROCESSING_LANGUAGE "SYCL")
+include(CMakeFindBinUtils)
+include(Compiler/${CMAKE_SYCL_COMPILER_ID}-FindBinUtils OPTIONAL)
+unset(_CMAKE_PROCESSING_LANGUAGE)
+
+if(CMAKE_SYCL_COMPILER_SYSROOT)
+  string(CONCAT _SET_CMAKE_SYCL_COMPILER_SYSROOT
+    "set(CMAKE_SYCL_COMPILER_SYSROOT \"${CMAKE_SYCL_COMPILER_SYSROOT}\")\n"
+    "set(CMAKE_COMPILER_SYSROOT \"${CMAKE_SYCL_COMPILER_SYSROOT}\")")
+else()
+  set(_SET_CMAKE_SYCL_COMPILER_SYSROOT "")
+endif()
+
+if(CMAKE_SYCL_COMPILER_ARCHITECTURE_ID)
+  set(_SET_CMAKE_SYCL_COMPILER_ARCHITECTURE_ID
+    "set(CMAKE_SYCL_COMPILER_ARCHITECTURE_ID ${CMAKE_SYCL_COMPILER_ARCHITECTURE_ID})")
+else()
+  set(_SET_CMAKE_SYCL_COMPILER_ARCHITECTURE_ID "")
+endif()
+
+if(MSVC_SYCL_ARCHITECTURE_ID)
+  set(SET_MSVC_SYCL_ARCHITECTURE_ID
+    "set(MSVC_SYCL_ARCHITECTURE_ID ${MSVC_SYCL_ARCHITECTURE_ID})")
+endif()
+
+if(CMAKE_SYCL_XCODE_ARCHS)
+  set(SET_CMAKE_XCODE_ARCHS
+    "set(CMAKE_XCODE_ARCHS \"${CMAKE_SYCL_XCODE_ARCHS}\")")
+endif()
+
+# configure all variables set in this file
+configure_file(${ArrayFire_SOURCE_DIR}/CMakeModules/CMakeSYCLCompiler.cmake.in
+  ${CMAKE_PLATFORM_INFO_DIR}/CMakeSYCLCompiler.cmake
+  @ONLY
+  )
+
+set(CMAKE_SYCL_COMPILER_ENV_VAR "SYCL")
diff --git a/CMakeModules/CMakeSYCLCompiler.cmake.in b/CMakeModules/CMakeSYCLCompiler.cmake.in
new file mode 100644
index 0000000000..e0193afb13
--- /dev/null
+++ b/CMakeModules/CMakeSYCLCompiler.cmake.in
@@ -0,0 +1,83 @@
+set(CMAKE_SYCL_COMPILER "@CMAKE_SYCL_COMPILER@")
+set(CMAKE_SYCL_COMPILER_ARG1 "@CMAKE_SYCL_COMPILER_ARG1@")
+set(CMAKE_SYCL_COMPILER_ID "@CMAKE_SYCL_COMPILER_ID@")
+set(CMAKE_SYCL_COMPILER_VERSION "@CMAKE_SYCL_COMPILER_VERSION@")
+set(CMAKE_SYCL_COMPILER_VERSION_INTERNAL "@CMAKE_SYCL_COMPILER_VERSION_INTERNAL@")
+set(CMAKE_SYCL_COMPILER_WRAPPER "@CMAKE_SYCL_COMPILER_WRAPPER@")
+set(CMAKE_SYCL_STANDARD_COMPUTED_DEFAULT "@CMAKE_SYCL_STANDARD_COMPUTED_DEFAULT@")
+set(CMAKE_SYCL_EXTENSIONS_COMPUTED_DEFAULT "@CMAKE_SYCL_EXTENSIONS_COMPUTED_DEFAULT@")
+set(CMAKE_SYCL_COMPILE_FEATURES "@CMAKE_SYCL_COMPILE_FEATURES@")
+set(CMAKE_SYCL98_COMPILE_FEATURES "@CMAKE_SYCL98_COMPILE_FEATURES@")
+set(CMAKE_SYCL11_COMPILE_FEATURES "@CMAKE_SYCL11_COMPILE_FEATURES@")
+set(CMAKE_SYCL14_COMPILE_FEATURES "@CMAKE_SYCL14_COMPILE_FEATURES@")
+set(CMAKE_SYCL17_COMPILE_FEATURES "@CMAKE_SYCL17_COMPILE_FEATURES@")
+set(CMAKE_SYCL20_COMPILE_FEATURES "@CMAKE_SYCL20_COMPILE_FEATURES@")
+set(CMAKE_SYCL23_COMPILE_FEATURES "@CMAKE_SYCL23_COMPILE_FEATURES@")
+
+set(CMAKE_SYCL_PLATFORM_ID "@CMAKE_SYCL_PLATFORM_ID@")
+set(CMAKE_SYCL_SIMULATE_ID "@CMAKE_SYCL_SIMULATE_ID@")
+set(CMAKE_SYCL_COMPILER_FRONTEND_VARIANT "@CMAKE_SYCL_COMPILER_FRONTEND_VARIANT@")
+set(CMAKE_SYCL_SIMULATE_VERSION "@CMAKE_SYCL_SIMULATE_VERSION@")
+@_SET_CMAKE_SYCL_COMPILER_ARCHITECTURE_ID@
+@_SET_CMAKE_SYCL_COMPILER_SYSROOT@
+@SET_MSVC_SYCL_ARCHITECTURE_ID@
+@SET_CMAKE_XCODE_ARCHS@
+set(CMAKE_AR "@CMAKE_AR@")
+set(CMAKE_SYCL_COMPILER_AR "@CMAKE_SYCL_COMPILER_AR@")
+set(CMAKE_RANLIB "@CMAKE_RANLIB@")
+set(CMAKE_SYCL_COMPILER_RANLIB "@CMAKE_SYCL_COMPILER_RANLIB@")
+set(CMAKE_LINKER "@CMAKE_LINKER@")
+set(CMAKE_MT "@CMAKE_MT@")
+set(CMAKE_COMPILER_IS_GNUSYCL @CMAKE_COMPILER_IS_GNUSYCL@)
+set(CMAKE_SYCL_COMPILER_LOADED 1)
+set(CMAKE_SYCL_COMPILER_WORKS @CMAKE_SYCL_COMPILER_WORKS@)
+set(CMAKE_SYCL_ABI_COMPILED @CMAKE_SYCL_ABI_COMPILED@)
+
+set(CMAKE_SYCL_COMPILER_ENV_VAR "SYCL")
+
+set(CMAKE_SYCL_COMPILER_ID_RUN 1)
+set(CMAKE_SYCL_SOURCE_FILE_EXTENSIONS C;M;c++;cc;cpp;cxx;m;mm;mpp;CPP;ixx;cppm)
+set(CMAKE_SYCL_IGNORE_EXTENSIONS inl;h;hpp;HPP;H;o;O;obj;OBJ;def;DEF;rc;RC)
+
+foreach (lang SYCL)
+  if (CMAKE_${lang}_COMPILER_ID_RUN)
+    foreach(extension IN LISTS CMAKE_${lang}_SOURCE_FILE_EXTENSIONS)
+      list(REMOVE_ITEM CMAKE_SYCL_SOURCE_FILE_EXTENSIONS ${extension})
+    endforeach()
+  endif()
+endforeach()
+
+set(CMAKE_SYCL_LINKER_PREFERENCE 30)
+set(CMAKE_SYCL_LINKER_PREFERENCE_PROPAGATES 1)
+
+# Save compiler ABI information.
+set(CMAKE_SYCL_SIZEOF_DATA_PTR "@CMAKE_SYCL_SIZEOF_DATA_PTR@")
+set(CMAKE_SYCL_COMPILER_ABI "@CMAKE_SYCL_COMPILER_ABI@")
+set(CMAKE_SYCL_BYTE_ORDER "@CMAKE_SYCL_BYTE_ORDER@")
+set(CMAKE_SYCL_LIBRARY_ARCHITECTURE "@CMAKE_SYCL_LIBRARY_ARCHITECTURE@")
+
+if(CMAKE_SYCL_SIZEOF_DATA_PTR)
+  set(CMAKE_SIZEOF_VOID_P "${CMAKE_SYCL_SIZEOF_DATA_PTR}")
+endif()
+
+if(CMAKE_SYCL_COMPILER_ABI)
+  set(CMAKE_INTERNAL_PLATFORM_ABI "${CMAKE_SYCL_COMPILER_ABI}")
+endif()
+
+if(CMAKE_SYCL_LIBRARY_ARCHITECTURE)
+  set(CMAKE_LIBRARY_ARCHITECTURE "@CMAKE_SYCL_LIBRARY_ARCHITECTURE@")
+endif()
+
+set(CMAKE_SYCL_CL_SHOWINCLUDES_PREFIX "@CMAKE_SYCL_CL_SHOWINCLUDES_PREFIX@")
+if(CMAKE_SYCL_CL_SHOWINCLUDES_PREFIX)
+  set(CMAKE_CL_SHOWINCLUDES_PREFIX "${CMAKE_SYCL_CL_SHOWINCLUDES_PREFIX}")
+endif()
+
+@CMAKE_SYCL_COMPILER_CUSTOM_CODE@
+@CMAKE_SYCL_SYSROOT_FLAG_CODE@
+@CMAKE_SYCL_OSX_DEPLOYMENT_TARGET_FLAG_CODE@
+
+set(CMAKE_SYCL_IMPLICIT_INCLUDE_DIRECTORIES "@CMAKE_SYCL_IMPLICIT_INCLUDE_DIRECTORIES@")
+set(CMAKE_SYCL_IMPLICIT_LINK_LIBRARIES "@CMAKE_SYCL_IMPLICIT_LINK_LIBRARIES@")
+set(CMAKE_SYCL_IMPLICIT_LINK_DIRECTORIES "@CMAKE_SYCL_IMPLICIT_LINK_DIRECTORIES@")
+set(CMAKE_SYCL_IMPLICIT_LINK_FRAMEWORK_DIRECTORIES "@CMAKE_SYCL_IMPLICIT_LINK_FRAMEWORK_DIRECTORIES@")
diff --git a/CMakeModules/CMakeSYCLCompilerABI.cpp b/CMakeModules/CMakeSYCLCompilerABI.cpp
new file mode 100644
index 0000000000..cac613b114
--- /dev/null
+++ b/CMakeModules/CMakeSYCLCompilerABI.cpp
@@ -0,0 +1,19 @@
+#ifndef __cplusplus
+#  error "A C compiler has been selected for C++."
+#endif
+
+#include "CMakeCompilerABI.h"
+
+int main(int argc, char* argv[])
+{
+  int require = 0;
+  require += info_sizeof_dptr[argc];
+  require += info_byte_order_big_endian[argc];
+  require += info_byte_order_little_endian[argc];
+#if defined(ABI_ID)
+  require += info_abi[argc];
+#endif
+  static_cast<void>(argv);
+
+  return require;
+}
diff --git a/CMakeModules/CMakeSYCLCompilerId.cpp.in b/CMakeModules/CMakeSYCLCompilerId.cpp.in
new file mode 100644
index 0000000000..913dbc7932
--- /dev/null
+++ b/CMakeModules/CMakeSYCLCompilerId.cpp.in
@@ -0,0 +1,105 @@
+/* This source file must have a .cpp extension so that all C++ compilers
+   recognize the extension without flags.  Borland does not know .cxx for
+   example.  */
+#ifndef __cplusplus
+# error "A C compiler has been selected for C++."
+#endif
+
+#if !defined(__has_include)
+/* If the compiler does not have __has_include, pretend the answer is
+   always no.  */
+#  define __has_include(x) 0
+#endif
+
+@CMAKE_SYCL_COMPILER_ID_CONTENT@
+
+/* Construct the string literal in pieces to prevent the source from
+   getting matched.  Store it in a pointer rather than an array
+   because some compilers will just produce instructions to fill the
+   array rather than assigning a pointer to a static array.  */
+char const* info_compiler = "INFO" ":" "compiler[" COMPILER_ID "]";
+#ifdef SIMULATE_ID
+char const* info_simulate = "INFO" ":" "simulate[" SIMULATE_ID "]";
+#endif
+
+#ifdef __QNXNTO__
+char const* qnxnto = "INFO" ":" "qnxnto[]";
+#endif
+
+#if defined(__CRAYXT_COMPUTE_LINUX_TARGET)
+char const *info_cray = "INFO" ":" "compiler_wrapper[CrayPrgEnv]";
+#endif
+
+@CMAKE_SYCL_COMPILER_ID_PLATFORM_CONTENT@
+@CMAKE_SYCL_COMPILER_ID_ERROR_FOR_TEST@
+
+#if defined(__INTEL_COMPILER) && defined(_MSVC_LANG) && _MSVC_LANG < 201403L
+#  if defined(__INTEL_CXX11_MODE__)
+#    if defined(__cpp_aggregate_nsdmi)
+#      define CXX_STD 201402L
+#    else
+#      define CXX_STD 201103L
+#    endif
+#  else
+#    define CXX_STD 199711L
+#  endif
+#elif defined(_MSC_VER) && defined(_MSVC_LANG)
+#  define CXX_STD _MSVC_LANG
+#else
+#  define CXX_STD __cplusplus
+#endif
+
+const char* info_language_standard_default = "INFO" ":" "standard_default["
+#if CXX_STD > 202002L
+  "23"
+#elif CXX_STD > 201703L
+  "20"
+#elif CXX_STD >= 201703L
+  "17"
+#elif CXX_STD >= 201402L
+  "14"
+#elif CXX_STD >= 201103L
+  "11"
+#else
+  "98"
+#endif
+"]";
+
+const char* info_language_extensions_default = "INFO" ":" "extensions_default["
+#if (defined(__clang__) || defined(__GNUC__) || defined(__xlC__) ||           \
+     defined(__TI_COMPILER_VERSION__)) &&                                     \
+  !defined(__STRICT_ANSI__)
+  "ON"
+#else
+  "OFF"
+#endif
+"]";
+
+/*--------------------------------------------------------------------------*/
+
+int main(int argc, char* argv[])
+{
+  int require = 0;
+  require += info_compiler[argc];
+  require += info_platform[argc];
+  require += info_arch[argc];
+#ifdef COMPILER_VERSION_MAJOR
+  require += info_version[argc];
+#endif
+#ifdef COMPILER_VERSION_INTERNAL
+  require += info_version_internal[argc];
+#endif
+#ifdef SIMULATE_ID
+  require += info_simulate[argc];
+#endif
+#ifdef SIMULATE_VERSION_MAJOR
+  require += info_simulate_version[argc];
+#endif
+#if defined(__CRAYXT_COMPUTE_LINUX_TARGET)
+  require += info_cray[argc];
+#endif
+  require += info_language_standard_default[argc];
+  require += info_language_extensions_default[argc];
+  (void)argv;
+  return require;
+}
diff --git a/CMakeModules/CMakeSYCLInformation.cmake b/CMakeModules/CMakeSYCLInformation.cmake
new file mode 100644
index 0000000000..b5ec7876db
--- /dev/null
+++ b/CMakeModules/CMakeSYCLInformation.cmake
@@ -0,0 +1,381 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+# make sure default modules are accesible
+list(APPEND CMAKE_MODULE_PATH ${CMAKE_ROOT}/Modules)
+message(${CMAKE_MODULE_PATH})
+
+set(CMAKE_SYCL_COMPILER_ID IntelLLVM)
+
+# This file sets the basic flags for the C++ language in CMake.
+# It also loads the available platform file for the system-compiler
+# if it exists.
+# It also loads a system - compiler - processor (or target hardware)
+# specific file, which is mainly useful for crosscompiling and embedded systems.
+
+include(CMakeLanguageInformation)
+
+# some compilers use different extensions (e.g. sdcc uses .rel)
+# so set the extension here first so it can be overridden by the compiler specific file
+if(UNIX)
+  set(CMAKE_SYCL_OUTPUT_EXTENSION .o)
+else()
+  set(CMAKE_SYCL_OUTPUT_EXTENSION .obj)
+endif()
+
+set(_INCLUDED_FILE 0)
+
+# Load compiler-specific information.
+if(CMAKE_SYCL_COMPILER_ID)
+  #include(Compiler/${CMAKE_SYCL_COMPILER_ID}-CXX OPTIONAL)
+endif()
+
+set(CMAKE_BASE_NAME)
+get_filename_component(CMAKE_BASE_NAME "${CMAKE_SYCL_COMPILER}" NAME_WE)
+# since the gnu compiler has several names force g++
+if(CMAKE_COMPILER_IS_GNUSYCL)
+  set(CMAKE_BASE_NAME g++)
+endif()
+
+include(Compiler/${CMAKE_SYCL_COMPILER_ID} OPTIONAL)
+__compiler_intel_llvm(SYCL)
+
+if("x${CMAKE_CXX_COMPILER_FRONTEND_VARIANT}" STREQUAL "xMSVC")
+  string(APPEND CMAKE_SYCL_FLAGS_INIT " /DWIN32 /D_WINDOWS")
+  string(APPEND CMAKE_SYCL_FLAGS_DEBUG_INIT " /Zi /Ob0 /Od /RTC1")
+  string(APPEND CMAKE_SYCL_FLAGS_MINSIZEREL_INIT " /O1 /Ob1 /DNDEBUG")
+  string(APPEND CMAKE_SYCL_FLAGS_RELEASE_INIT " /O2 /Ob2 /DNDEBUG")
+  string(APPEND CMAKE_SYCL_FLAGS_RELWITHDEBINFO_INIT " /Zi /O2 /Ob1 /DNDEBUG")
+  set(CMAKE_SYCL_COMPILE_OPTIONS_EXPLICIT_LANGUAGE -TP)
+  set(CMAKE_SYCL_CLANG_TIDY_DRIVER_MODE "cl")
+  set(CMAKE_SYCL_INCLUDE_WHAT_YOU_USE_DRIVER_MODE "cl")
+  if((NOT DEFINED CMAKE_DEPENDS_USE_COMPILER OR CMAKE_DEPENDS_USE_COMPILER)
+      AND CMAKE_GENERATOR MATCHES "Makefiles|WMake"
+      AND CMAKE_DEPFILE_FLAGS_SYCL)
+    set(CMAKE_SYCL_DEPENDS_USE_COMPILER TRUE)
+  endif()
+else()
+  set(CMAKE_SYCL_COMPILE_OPTIONS_EXPLICIT_LANGUAGE -x c++)
+  if((NOT DEFINED CMAKE_DEPENDS_USE_COMPILER OR CMAKE_DEPENDS_USE_COMPILER)
+      AND CMAKE_GENERATOR MATCHES "Makefiles|WMake"
+      AND CMAKE_DEPFILE_FLAGS_SYCL)
+    # dependencies are computed by the compiler itself
+    set(CMAKE_SYCL_DEPFILE_FORMAT gcc)
+    set(CMAKE_SYCL_DEPENDS_USE_COMPILER TRUE)
+  endif()
+
+  set(CMAKE_SYCL_COMPILE_OPTIONS_VISIBILITY_INLINES_HIDDEN "-fvisibility-inlines-hidden")
+
+  string(APPEND CMAKE_SYCL_FLAGS_MINSIZEREL_INIT " -DNDEBUG")
+  string(APPEND CMAKE_SYCL_FLAGS_RELEASE_INIT " -DNDEBUG")
+  string(APPEND CMAKE_SYCL_FLAGS_RELWITHDEBINFO_INIT " -DNDEBUG")
+endif()
+
+set(CMAKE_SYCL98_STANDARD__HAS_FULL_SUPPORT ON)
+set(CMAKE_SYCL11_STANDARD__HAS_FULL_SUPPORT ON)
+set(CMAKE_SYCL14_STANDARD__HAS_FULL_SUPPORT ON)
+
+if(NOT "x${CMAKE_SYCL_SIMULATE_ID}" STREQUAL "xMSVC")
+  set(CMAKE_SYCL98_STANDARD_COMPILE_OPTION  "-std=c++98")
+  set(CMAKE_SYCL98_EXTENSION_COMPILE_OPTION "-std=gnu++98")
+
+  set(CMAKE_SYCL11_STANDARD_COMPILE_OPTION  "-std=c++11")
+  set(CMAKE_SYCL11_EXTENSION_COMPILE_OPTION "-std=gnu++11")
+
+  set(CMAKE_SYCL14_STANDARD_COMPILE_OPTION  "-std=c++14")
+  set(CMAKE_SYCL14_EXTENSION_COMPILE_OPTION "-std=gnu++14")
+
+  set(CMAKE_SYCL17_STANDARD_COMPILE_OPTION  "-std=c++17")
+  set(CMAKE_SYCL17_EXTENSION_COMPILE_OPTION "-std=gnu++17")
+
+  set(CMAKE_SYCL20_STANDARD_COMPILE_OPTION  "-std=c++20")
+  set(CMAKE_SYCL20_EXTENSION_COMPILE_OPTION "-std=gnu++20")
+
+  set(CMAKE_SYCL23_STANDARD_COMPILE_OPTION  "-std=c++2b")
+  set(CMAKE_SYCL23_EXTENSION_COMPILE_OPTION "-std=gnu++2b")
+else()
+  set(CMAKE_SYCL98_STANDARD_COMPILE_OPTION  "")
+  set(CMAKE_SYCL98_EXTENSION_COMPILE_OPTION "")
+
+  set(CMAKE_SYCL11_STANDARD_COMPILE_OPTION  "")
+  set(CMAKE_SYCL11_EXTENSION_COMPILE_OPTION "")
+
+  set(CMAKE_SYCL14_STANDARD_COMPILE_OPTION  "-Qstd:c++14")
+  set(CMAKE_SYCL14_EXTENSION_COMPILE_OPTION "-Qstd:c++14")
+
+  set(CMAKE_SYCL17_STANDARD_COMPILE_OPTION  "-Qstd:c++17")
+  set(CMAKE_SYCL17_EXTENSION_COMPILE_OPTION "-Qstd:c++17")
+
+  set(CMAKE_SYCL20_STANDARD_COMPILE_OPTION  "-Qstd:c++20")
+  set(CMAKE_SYCL20_EXTENSION_COMPILE_OPTION "-Qstd:c++20")
+
+  set(CMAKE_SYCL23_STANDARD_COMPILE_OPTION  "-Qstd:c++2b")
+  set(CMAKE_SYCL23_EXTENSION_COMPILE_OPTION "-Qstd:c++2b")
+endif()
+
+include(Platform/${CMAKE_EFFECTIVE_SYSTEM_NAME}-${CMAKE_SYCL_COMPILER_ID} OPTIONAL RESULT_VARIABLE _INCLUDED_FILE)
+
+if(WIN32)
+  set(_COMPILE_CXX " /TP")
+  __windows_compiler_intel(SYCL)
+elseif(UNIX AND NOT APPLE)
+  __linux_compiler_intel_llvm(SYCL)
+  # This should be -isystem but icpx throws an error on Ubuntu
+  # when you include /usr/include as a system header
+  set(CMAKE_INCLUDE_SYSTEM_FLAG_SYCL "-I ")
+else()
+  __apple_compiler_intel_llvm(SYCL)
+endif()
+
+# We specify the compiler information in the system file for some
+# platforms, but this language may not have been enabled when the file
+# was first included.  Include it again to get the language info.
+# Remove this when all compiler info is removed from system files.
+if (NOT _INCLUDED_FILE)
+  include(Platform/${CMAKE_SYSTEM_NAME} OPTIONAL)
+endif ()
+
+if(CMAKE_SYCL_SIZEOF_DATA_PTR)
+  foreach(f ${CMAKE_SYCL_ABI_FILES})
+    include(${f})
+  endforeach()
+  unset(CMAKE_SYCL_ABI_FILES)
+endif()
+
+# This should be included before the _INIT variables are
+# used to initialize the cache.  Since the rule variables
+# have if blocks on them, users can still define them here.
+# But, it should still be after the platform file so changes can
+# be made to those values.
+
+if(CMAKE_USER_MAKE_RULES_OVERRIDE)
+  # Save the full path of the file so try_compile can use it.
+  include(${CMAKE_USER_MAKE_RULES_OVERRIDE} RESULT_VARIABLE _override)
+  set(CMAKE_USER_MAKE_RULES_OVERRIDE "${_override}")
+endif()
+
+if(CMAKE_USER_MAKE_RULES_OVERRIDE_SYCL)
+  # Save the full path of the file so try_compile can use it.
+  include(${CMAKE_USER_MAKE_RULES_OVERRIDE_SYCL} RESULT_VARIABLE _override)
+  set(CMAKE_USER_MAKE_RULES_OVERRIDE_SYCL "${_override}")
+endif()
+
+
+# Create a set of shared library variable specific to C++
+# For 90% of the systems, these are the same flags as the C versions
+# so if these are not set just copy the flags from the c version
+if(NOT CMAKE_SHARED_LIBRARY_CREATE_SYCL_FLAGS)
+  set(CMAKE_SHARED_LIBRARY_CREATE_SYCL_FLAGS ${CMAKE_SHARED_LIBRARY_CREATE_CXX_FLAGS})
+endif()
+
+if(NOT CMAKE_SYCL_COMPILE_OPTIONS_PIC)
+  set(CMAKE_SYCL_COMPILE_OPTIONS_PIC ${CMAKE_CXX_COMPILE_OPTIONS_PIC})
+endif()
+
+if(NOT CMAKE_SYCL_COMPILE_OPTIONS_PIE)
+  set(CMAKE_SYCL_COMPILE_OPTIONS_PIE ${CMAKE_CXX_COMPILE_OPTIONS_PIE})
+endif()
+if(NOT CMAKE_SYCL_LINK_OPTIONS_PIE)
+  set(CMAKE_SYCL_LINK_OPTIONS_PIE ${CMAKE_CXX_LINK_OPTIONS_PIE})
+endif()
+if(NOT CMAKE_SYCL_LINK_OPTIONS_NO_PIE)
+  set(CMAKE_SYCL_LINK_OPTIONS_NO_PIE ${CMAKE_CXX_LINK_OPTIONS_NO_PIE})
+endif()
+
+if(NOT CMAKE_SYCL_COMPILE_OPTIONS_DLL)
+  set(CMAKE_SYCL_COMPILE_OPTIONS_DLL ${CMAKE_CXX_COMPILE_OPTIONS_DLL})
+endif()
+
+if(NOT CMAKE_SHARED_LIBRARY_SYCL_FLAGS)
+  set(CMAKE_SHARED_LIBRARY_SYCL_FLAGS ${CMAKE_SHARED_LIBRARY_CXX_FLAGS})
+endif()
+
+if(NOT DEFINED CMAKE_SHARED_LIBRARY_LINK_SYCL_FLAGS)
+  set(CMAKE_SHARED_LIBRARY_LINK_SYCL_FLAGS ${CMAKE_SHARED_LIBRARY_LINK_CXX_FLAGS})
+endif()
+
+if(NOT CMAKE_SHARED_LIBRARY_RUNTIME_SYCL_FLAG)
+  set(CMAKE_SHARED_LIBRARY_RUNTIME_SYCL_FLAG ${CMAKE_SHARED_LIBRARY_RUNTIME_CXX_FLAG})
+endif()
+
+if(NOT CMAKE_SHARED_LIBRARY_RUNTIME_SYCL_FLAG_SEP)
+  set(CMAKE_SHARED_LIBRARY_RUNTIME_SYCL_FLAG_SEP ${CMAKE_SHARED_LIBRARY_RUNTIME_CXX_FLAG_SEP})
+endif()
+
+if(NOT CMAKE_SHARED_LIBRARY_RPATH_LINK_SYCL_FLAG)
+  set(CMAKE_SHARED_LIBRARY_RPATH_LINK_SYCL_FLAG ${CMAKE_SHARED_LIBRARY_RPATH_LINK_CXX_FLAG})
+endif()
+
+if(NOT DEFINED CMAKE_EXE_EXPORTS_SYCL_FLAG)
+  set(CMAKE_EXE_EXPORTS_SYCL_FLAG ${CMAKE_EXE_EXPORTS_CXX_FLAG})
+endif()
+
+if(NOT DEFINED CMAKE_SHARED_LIBRARY_SONAME_SYCL_FLAG)
+  set(CMAKE_SHARED_LIBRARY_SONAME_SYCL_FLAG ${CMAKE_SHARED_LIBRARY_SONAME_CXX_FLAG})
+endif()
+
+if(NOT CMAKE_EXECUTABLE_RUNTIME_SYCL_FLAG)
+  set(CMAKE_EXECUTABLE_RUNTIME_SYCL_FLAG ${CMAKE_SHARED_LIBRARY_RUNTIME_SYCL_FLAG})
+endif()
+
+if(NOT CMAKE_EXECUTABLE_RUNTIME_SYCL_FLAG_SEP)
+  set(CMAKE_EXECUTABLE_RUNTIME_SYCL_FLAG_SEP ${CMAKE_SHARED_LIBRARY_RUNTIME_SYCL_FLAG_SEP})
+endif()
+
+if(NOT CMAKE_EXECUTABLE_RPATH_LINK_SYCL_FLAG)
+  set(CMAKE_EXECUTABLE_RPATH_LINK_SYCL_FLAG ${CMAKE_SHARED_LIBRARY_RPATH_LINK_SYCL_FLAG})
+endif()
+
+if(NOT DEFINED CMAKE_SHARED_LIBRARY_LINK_SYCL_WITH_RUNTIME_PATH)
+  set(CMAKE_SHARED_LIBRARY_LINK_SYCL_WITH_RUNTIME_PATH ${CMAKE_SHARED_LIBRARY_LINK_CXX_WITH_RUNTIME_PATH})
+endif()
+
+if(NOT CMAKE_INCLUDE_FLAG_SYCL)
+  set(CMAKE_INCLUDE_FLAG_SYCL ${CMAKE_INCLUDE_FLAG_C})
+endif()
+
+# for most systems a module is the same as a shared library
+# so unless the variable CMAKE_MODULE_EXISTS is set just
+# copy the values from the LIBRARY variables
+if(NOT CMAKE_MODULE_EXISTS)
+  set(CMAKE_SHARED_MODULE_SYCL_FLAGS ${CMAKE_SHARED_LIBRARY_SYCL_FLAGS})
+  set(CMAKE_SHARED_MODULE_CREATE_SYCL_FLAGS ${CMAKE_SHARED_LIBRARY_CREATE_SYCL_FLAGS})
+endif()
+
+# repeat for modules
+if(NOT CMAKE_SHARED_MODULE_CREATE_SYCL_FLAGS)
+  set(CMAKE_SHARED_MODULE_CREATE_SYCL_FLAGS ${CMAKE_SHARED_MODULE_CREATE_CXX_FLAGS})
+endif()
+
+if(NOT CMAKE_SHARED_MODULE_SYCL_FLAGS)
+  set(CMAKE_SHARED_MODULE_SYCL_FLAGS ${CMAKE_SHARED_MODULE_CXX_FLAGS})
+endif()
+
+# Initialize SYCL link type selection flags from C versions.
+foreach(type SHARED_LIBRARY SHARED_MODULE EXE)
+  if(NOT CMAKE_${type}_LINK_STATIC_SYCL_FLAGS)
+    set(CMAKE_${type}_LINK_STATIC_SYCL_FLAGS
+      ${CMAKE_${type}_LINK_STATIC_CXX_FLAGS})
+  endif()
+  if(NOT CMAKE_${type}_LINK_DYNAMIC_SYCL_FLAGS)
+    set(CMAKE_${type}_LINK_DYNAMIC_SYCL_FLAGS
+      ${CMAKE_${type}_LINK_DYNAMIC_CXX_FLAGS})
+  endif()
+endforeach()
+
+if(CMAKE_EXECUTABLE_FORMAT STREQUAL "ELF")
+  if(NOT DEFINED CMAKE_SYCL_LINK_WHAT_YOU_USE_FLAG)
+    set(CMAKE_SYCL_LINK_WHAT_YOU_USE_FLAG "LINKER:--no-as-needed")
+  endif()
+  if(NOT DEFINED CMAKE_LINK_WHAT_YOU_USE_CHECK)
+    set(CMAKE_LINK_WHAT_YOU_USE_CHECK ldd -u -r)
+  endif()
+endif()
+
+# add the flags to the cache based
+# on the initial values computed in the platform/*.cmake files
+# use _INIT variables so that this only happens the first time
+# and you can set these flags in the cmake cache
+set(CMAKE_SYCL_FLAGS_INIT "-fsycl $ENV{SYCLFLAGS} ${CMAKE_SYCL_FLAGS_INIT}")
+
+cmake_initialize_per_config_variable(CMAKE_SYCL_FLAGS "Flags used by the SYCL compiler")
+
+if(CMAKE_SYCL_STANDARD_LIBRARIES_INIT)
+  set(CMAKE_SYCL_STANDARD_LIBRARIES "${CMAKE_CXX_STANDARD_LIBRARIES_INIT}"
+    CACHE STRING "Libraries linked by default with all C++ applications.")
+  mark_as_advanced(CMAKE_SYCL_STANDARD_LIBRARIES)
+endif()
+
+if(NOT CMAKE_SYCL_COMPILER_LAUNCHER AND DEFINED ENV{CMAKE_SYCL_COMPILER_LAUNCHER})
+  set(CMAKE_SYCL_COMPILER_LAUNCHER "$ENV{CMAKE_SYCL_COMPILER_LAUNCHER}"
+    CACHE STRING "Compiler launcher for SYCL.")
+endif()
+
+if(NOT CMAKE_SYCL_LINKER_LAUNCHER AND DEFINED ENV{CMAKE_SYCL_LINKER_LAUNCHER})
+  set(CMAKE_SYCL_LINKER_LAUNCHER "$ENV{CMAKE_SYCL_LINKER_LAUNCHER}"
+    CACHE STRING "Linker launcher for SYCL.")
+endif()
+
+include(CMakeCommonLanguageInclude)
+
+# now define the following rules:
+# CMAKE_SYCL_CREATE_SHARED_LIBRARY
+# CMAKE_SYCL_CREATE_SHARED_MODULE
+# CMAKE_SYCL_COMPILE_OBJECT
+# CMAKE_SYCL_LINK_EXECUTABLE
+
+# variables supplied by the generator at use time
+# <TARGET>
+# <TARGET_BASE> the target without the suffix
+# <OBJECTS>
+# <OBJECT>
+# <LINK_LIBRARIES>
+# <FLAGS>
+# <LINK_FLAGS>
+
+# SYCL compiler information
+# <CMAKE_SYCL_COMPILER>
+# <CMAKE_SHARED_LIBRARY_CREATE_SYCL_FLAGS>
+# <CMAKE_SYCL_SHARED_MODULE_CREATE_FLAGS>
+# <CMAKE_SYCL_LINK_FLAGS>
+
+# Static library tools
+# <CMAKE_AR>
+# <CMAKE_RANLIB>
+
+# create a shared C++ library
+if(NOT CMAKE_SYCL_CREATE_SHARED_LIBRARY)
+  set(CMAKE_SYCL_CREATE_SHARED_LIBRARY
+      "<CMAKE_SYCL_COMPILER> <CMAKE_SHARED_LIBRARY_SYCL_FLAGS> <LANGUAGE_COMPILE_FLAGS> <LINK_FLAGS> <CMAKE_SHARED_LIBRARY_CREATE_SYCL_FLAGS> <SONAME_FLAG><TARGET_SONAME> -o <TARGET> <OBJECTS> <LINK_LIBRARIES>")
+endif()
+
+# create a c++ shared module copy the shared library rule by default
+if(NOT CMAKE_SYCL_CREATE_SHARED_MODULE)
+  set(CMAKE_SYCL_CREATE_SHARED_MODULE ${CMAKE_SYCL_CREATE_SHARED_LIBRARY})
+endif()
+
+
+# Create a static archive incrementally for large object file counts.
+# If CMAKE_SYCL_CREATE_STATIC_LIBRARY is set it will override these.
+if(NOT DEFINED CMAKE_SYCL_ARCHIVE_CREATE)
+  set(CMAKE_SYCL_ARCHIVE_CREATE "<CMAKE_AR> qc <TARGET> <LINK_FLAGS> <OBJECTS>")
+endif()
+if(NOT DEFINED CMAKE_SYCL_ARCHIVE_APPEND)
+  set(CMAKE_SYCL_ARCHIVE_APPEND "<CMAKE_AR> q <TARGET> <LINK_FLAGS> <OBJECTS>")
+endif()
+if(NOT DEFINED CMAKE_SYCL_ARCHIVE_FINISH)
+  set(CMAKE_SYCL_ARCHIVE_FINISH "<CMAKE_RANLIB> <TARGET>")
+endif()
+
+# compile a C++ file into an object file
+if(NOT CMAKE_SYCL_COMPILE_OBJECT)
+  set(CMAKE_SYCL_COMPILE_OBJECT
+    "<CMAKE_SYCL_COMPILER> <DEFINES> <INCLUDES> <FLAGS> -o <OBJECT> -c <SOURCE>")
+endif()
+
+if(NOT CMAKE_SYCL_LINK_EXECUTABLE)
+  set(CMAKE_SYCL_LINK_EXECUTABLE
+    "<CMAKE_SYCL_COMPILER> <FLAGS> <CMAKE_SYCL_LINK_FLAGS> <LINK_FLAGS> <OBJECTS> -o <TARGET> <LINK_LIBRARIES>")
+endif()
+
+if(CMAKE_HOST_WIN32)
+  set(MSVC_RUNTIME "")
+  if("${CMAKE_MSVC_RUNTIME_LIBRARY}" STREQUAL "MultiThreaded")
+    set(MSVC_RUNTIME "-MT")
+  elseif("${CMAKE_MSVC_RUNTIME_LIBRARY}" STREQUAL "MultiThreadedDLL")
+    set(MSVC_RUNTIME "-MD")
+  elseif("${CMAKE_MSVC_RUNTIME_LIBRARY}" STREQUAL "MultiThreadedDebug")
+    set(MSVC_RUNTIME "-MTd")
+  elseif("${CMAKE_MSVC_RUNTIME_LIBRARY}" STREQUAL "MultiThreadedDebugDLL")
+    set(MSVC_RUNTIME "-MDd")
+  else()
+    set(MSVC_RUNTIME "-MD$<$<CONFIG:Debug>:d>")
+  endif()
+  set(CMAKE_MSVC_RUNTIME_LIBRARY "")
+endif()
+
+mark_as_advanced(
+CMAKE_VERBOSE_MAKEFILE
+)
+
+set(CMAKE_SYCL_INFORMATION_LOADED 1)
diff --git a/CMakeModules/CMakeTestSYCLCompiler.cmake b/CMakeModules/CMakeTestSYCLCompiler.cmake
new file mode 100644
index 0000000000..ef38081b37
--- /dev/null
+++ b/CMakeModules/CMakeTestSYCLCompiler.cmake
@@ -0,0 +1,95 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+
+if(CMAKE_SYCL_COMPILER_FORCED)
+  # The compiler configuration was forced by the user.
+  # Assume the user has configured all compiler information.
+  set(CMAKE_SYCL_COMPILER_WORKS TRUE)
+  return()
+endif()
+
+include(CMakeTestCompilerCommon)
+
+# work around enforced code signing and / or missing executable target type
+set(__CMAKE_SAVED_TRY_COMPILE_TARGET_TYPE ${CMAKE_TRY_COMPILE_TARGET_TYPE})
+if(_CMAKE_FEATURE_DETECTION_TARGET_TYPE)
+  set(CMAKE_TRY_COMPILE_TARGET_TYPE ${_CMAKE_FEATURE_DETECTION_TARGET_TYPE})
+endif()
+
+# Remove any cached result from an older CMake version.
+# We now store this in CMakeSYCLCompiler.cmake.
+unset(CMAKE_SYCL_COMPILER_WORKS CACHE)
+
+# Try to identify the ABI and configure it into CMakeSYCLCompiler.cmake
+include(CMakeDetermineCompilerABI)
+CMAKE_DETERMINE_COMPILER_ABI(SYCL ${ArrayFire_SOURCE_DIR}/CMakeModules/CMakeSYCLCompilerABI.cpp)
+if(CMAKE_SYCL_ABI_COMPILED)
+  # The compiler worked so skip dedicated test below.
+  set(CMAKE_SYCL_COMPILER_WORKS TRUE)
+  message(STATUS "Check for working SYCL compiler: ${CMAKE_SYCL_COMPILER} - skipped")
+endif()
+
+# This file is used by EnableLanguage in cmGlobalGenerator to
+# determine that the selected C++ compiler can actually compile
+# and link the most basic of programs.   If not, a fatal error
+# is set and cmake stops processing commands and will not generate
+# any makefiles or projects.
+if(NOT CMAKE_SYCL_COMPILER_WORKS)
+  PrintTestCompilerStatus("SYCL")
+  __TestCompiler_setTryCompileTargetType()
+  file(WRITE ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/CMakeTmp/testSYCLCompiler.cxx
+    "#ifndef __cplusplus\n"
+    "# error \"The CMAKE_SYCL_COMPILER is set to a C compiler\"\n"
+    "#endif\n"
+    "int main(){return 0;}\n")
+  # Clear result from normal variable.
+  unset(CMAKE_SYCL_COMPILER_WORKS)
+  # Puts test result in cache variable.
+  try_compile(CMAKE_SYCL_COMPILER_WORKS ${CMAKE_BINARY_DIR}
+    ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/CMakeTmp/testSYCLCompiler.cxx
+    OUTPUT_VARIABLE __CMAKE_SYCL_COMPILER_OUTPUT)
+  unset(__TestCompiler_testSYCLCompilerSource)
+  # Move result from cache to normal variable.
+  set(CMAKE_SYCL_COMPILER_WORKS ${CMAKE_SYCL_COMPILER_WORKS})
+  unset(CMAKE_SYCL_COMPILER_WORKS CACHE)
+  __TestCompiler_restoreTryCompileTargetType()
+  if(NOT CMAKE_SYCL_COMPILER_WORKS)
+    PrintTestCompilerResult(CHECK_FAIL "broken")
+    string(REPLACE "\n" "\n  " _output "${__CMAKE_SYCL_COMPILER_OUTPUT}")
+    message(FATAL_ERROR "The C++ compiler\n  \"${CMAKE_SYCL_COMPILER}\"\n"
+      "is not able to compile a simple test program.\nIt fails "
+      "with the following output:\n  ${_output}\n\n"
+      "CMake will not be able to correctly generate this project.")
+  endif()
+  PrintTestCompilerResult(CHECK_PASS "works")
+endif()
+
+# Try to identify the compiler features
+if(CMAKE_VERSION VERSION_GREATER_EQUAL 3.30.0)
+    include(CMakeDetermineCompilerSupport)
+    CMAKE_DETERMINE_COMPILER_SUPPORT(CXX)
+else()
+    include(CMakeDetermineCompileFeatures)
+    CMAKE_DETERMINE_COMPILE_FEATURES(CXX)
+endif()
+
+set(CMAKE_TRY_COMPILE_CONFIGURATION "")
+# Re-configure to save learned information.
+configure_file(
+  ${ArrayFire_SOURCE_DIR}/CMakeModules/CMakeSYCLCompiler.cmake.in
+  ${CMAKE_PLATFORM_INFO_DIR}/CMakeSYCLCompiler.cmake
+  @ONLY
+)
+include(${CMAKE_PLATFORM_INFO_DIR}/CMakeSYCLCompiler.cmake)
+
+if(CMAKE_SYCL_SIZEOF_DATA_PTR)
+  foreach(f ${CMAKE_SYCL_ABI_FILES})
+    include(${f})
+  endforeach()
+  unset(CMAKE_SYCL_ABI_FILES)
+endif()
+
+set(CMAKE_TRY_COMPILE_TARGET_TYPE ${__CMAKE_SAVED_TRY_COMPILE_TARGET_TYPE})
+unset(__CMAKE_SAVED_TRY_COMPILE_TARGET_TYPE)
+unset(__CMAKE_SYCL_COMPILER_OUTPUT)
diff --git a/CMakeModules/CPackConfig.cmake b/CMakeModules/CPackConfig.cmake
new file mode 100644
index 0000000000..8cf0880faa
--- /dev/null
+++ b/CMakeModules/CPackConfig.cmake
@@ -0,0 +1,150 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# https://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.10.2)
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${PROJECT_SOURCE_DIR}/CMakeModules/nsis")
+
+include(Version)
+
+set(CPACK_THREADS 8)
+
+set(CPACK_GENERATOR "STGZ;TGZ" CACHE STRING "STGZ;TGZ;DEB;RPM;productbuild")
+mark_as_advanced(CPACK_GENERATOR)
+
+set(VENDOR_NAME "ArrayFire")
+set(LIBRARY_NAME ${PROJECT_NAME})
+string(TOLOWER "${LIBRARY_NAME}" APP_LOW_NAME)
+set(SITE_URL "https://arrayfire.com")
+
+# Long description of the package
+set(CPACK_PACKAGE_DESCRIPTION
+"ArrayFire is a high performance software library for parallel computing
+with an easy-to-use API. Its array based function set makes parallel
+programming simple.
+
+ArrayFire's multiple backends (CUDA, OpenCL and native CPU) make it
+platform independent and highly portable.
+
+A few lines of code in ArrayFire can replace dozens of lines of parallel
+computing code, saving you valuable time and lowering development costs.")
+
+# Short description of the package
+set(CPACK_PACKAGE_DESCRIPTION_SUMMARY
+  "A high performance library for parallel computing with an easy-to-use API.")
+
+# Common settings to all packaging tools
+set(CPACK_PREFIX_DIR ${CMAKE_INSTALL_PREFIX})
+set(CPACK_PACKAGE_NAME "${LIBRARY_NAME}")
+set(CPACK_PACKAGE_VENDOR "${VENDOR_NAME}")
+set(CPACK_PACKAGE_INSTALL_REGISTRY_KEY ${LIBRARY_NAME})
+set(CPACK_PACKAGE_CONTACT "ArrayFire <technical@arrayfire.com>")
+set(MY_CPACK_PACKAGE_ICON "${ASSETS_DIR}/${APP_LOW_NAME}.ico")
+
+file(TO_NATIVE_PATH "${ASSETS_DIR}/" NATIVE_ASSETS_PATH)
+string(REPLACE "\\" "\\\\" NATIVE_ASSETS_PATH  ${NATIVE_ASSETS_PATH})
+set(CPACK_AF_ASSETS_DIR "${NATIVE_ASSETS_PATH}")
+
+set(CPACK_PACKAGE_VERSION_MAJOR "${ArrayFire_VERSION_MAJOR}")
+set(CPACK_PACKAGE_VERSION_MINOR "${ArrayFire_VERSION_MINOR}")
+set(CPACK_PACKAGE_VERSION_PATCH "${ArrayFire_VERSION_PATCH}")
+
+set(CPACK_PACKAGE_INSTALL_DIRECTORY "${LIBRARY_NAME}")
+
+set(CPACK_DEBIAN_FILE_NAME DEB-DEFAULT)
+set(CPACK_DEB_COMPONENT_INSTALL ON)
+set(CPACK_DEBIAN_DEBUGINFO_PACKAGE OFF)
+set(CPACK_DEBIAN_PACKAGE_DEBUG ON)
+set(CPACK_DEBIAN_PACKAGE_GENERATE_SHLIBS ON)
+set(CPACK_DEBIAN_PACKAGE_GENERATE_SHLIBS_POLICY ">=")
+set(CPACK_DEBIAN_PACKAGE_HOMEPAGE http://www.arrayfire.com)
+set(CPACK_DEBIAN_PACKAGE_CONTROL_STRICT_PERMISSION TRUE)
+set(CPACK_DEBIAN_COMPRESSION_TYPE xz)
+set(CPACK_DEBIAN_DEBUGINFO_PACKAGE ON)
+
+# Creates a variable from a ArrayFire variable so that it can be passed
+# into cpack project file. This is done by prepending CPACK_ before the
+# variable name
+macro(to_cpack_variable variable)
+  set(CPACK_${variable} ${${variable}})
+endmacro()
+
+to_cpack_variable(AF_COMPUTE_LIBRARY)
+to_cpack_variable(ArrayFire_SOURCE_DIR)
+to_cpack_variable(ArrayFire_BINARY_DIR)
+to_cpack_variable(CUDA_VERSION_MAJOR)
+to_cpack_variable(CUDA_VERSION_MINOR)
+
+# Create a arrayfire component so that Debian package has a top level
+# package that installs all the backends. This package needs to have
+# some files associated with it so that it doesn't get deleted by
+# APT after its installed.
+file(WRITE ${ArrayFire_BINARY_DIR}/arrayfire_version.txt ${ArrayFire_VERSION})
+install(FILES ${ArrayFire_BINARY_DIR}/arrayfire_version.txt
+	DESTINATION ${CMAKE_INSTALL_SYSCONFDIR}
+  COMPONENT arrayfire)
+
+# Platform specific settings for CPACK generators
+# - OSX specific
+#   - DragNDrop (OSX only)
+#   - PackageMaker (OSX only)
+#   - OSXX11 (OSX only)
+#   - Bundle (OSX only)
+# - Windows
+#   - NSIS64 Generator
+if(APPLE)
+  set(CPACK_PACKAGING_INSTALL_PREFIX "/opt/arrayfire")
+  set(OSX_INSTALL_SOURCE ${PROJECT_SOURCE_DIR}/CMakeModules/osx_install)
+  set(WELCOME_FILE       "${OSX_INSTALL_SOURCE}/welcome.html.in")
+  set(WELCOME_FILE_OUT   "${CMAKE_CURRENT_BINARY_DIR}/welcome.html")
+
+  set(README_FILE       "${OSX_INSTALL_SOURCE}/readme.html.in")
+  set(README_FILE_OUT   "${CMAKE_CURRENT_BINARY_DIR}/readme.html")
+
+  set(LICENSE_FILE       "${ArrayFire_SOURCE_DIR}/LICENSE")
+  set(LICENSE_FILE_OUT   "${CMAKE_CURRENT_BINARY_DIR}/license.txt")
+
+  set(AF_TITLE    "ArrayFire ${AF_VERSION}")
+  configure_file(${WELCOME_FILE} ${WELCOME_FILE_OUT})
+  configure_file(${README_FILE} ${README_FILE_OUT})
+  configure_file(${LICENSE_FILE} ${LICENSE_FILE_OUT})
+  set(CPACK_RESOURCE_FILE_LICENSE ${LICENSE_FILE_OUT})
+  set(CPACK_RESOURCE_FILE_README ${README_FILE_OUT})
+  set(CPACK_RESOURCE_FILE_WELCOME ${WELCOME_FILE_OUT})
+elseif(WIN32)
+  set(WIN_INSTALL_SOURCE ${PROJECT_SOURCE_DIR}/CMakeModules/nsis)
+
+  set(LICENSE_FILE       "${ArrayFire_SOURCE_DIR}/LICENSE")
+  set(LICENSE_FILE_OUT   "${CMAKE_CURRENT_BINARY_DIR}/license.txt")
+  configure_file(${LICENSE_FILE} ${LICENSE_FILE_OUT})
+  set(CPACK_RESOURCE_FILE_LICENSE ${LICENSE_FILE_OUT})
+
+  #NSIS SPECIFIC VARIABLES
+  set(CPACK_NSIS_ENABLE_UNINSTALL_BEFORE_INSTALL ON)
+  set(CPACK_NSIS_MODIFY_PATH ON)
+  set(CPACK_NSIS_DISPLAY_NAME "${LIBRARY_NAME}")
+  set(CPACK_NSIS_PACKAGE_NAME "${LIBRARY_NAME}")
+  set(CPACK_NSIS_HELP_LINK "${SITE_URL}")
+  set(CPACK_NSIS_URL_INFO_ABOUT "${SITE_URL}")
+  set(CPACK_NSIS_INSTALLED_ICON_NAME "${MY_CPACK_PACKAGE_ICON}")
+  set(CPACK_NSIS_COMPRESSOR "lzma")
+  if (CMAKE_CL_64)
+    set(CPACK_NSIS_INSTALL_ROOT "$PROGRAMFILES64")
+  else (CMAKE_CL_64)
+    set(CPACK_NSIS_INSTALL_ROOT "$PROGRAMFILES")
+  endif (CMAKE_CL_64)
+  configure_file(
+      ${PROJECT_SOURCE_DIR}/CMakeModules/nsis/NSIS.definitions.nsh.in
+      ${CMAKE_CURRENT_BINARY_DIR}/NSIS.definitions.nsh)
+else()
+  set(CPACK_RESOURCE_FILE_LICENSE "${ArrayFire_SOURCE_DIR}/LICENSE")
+  set(CPACK_RESOURCE_FILE_README "${ArrayFire_SOURCE_DIR}/README.md")
+endif()
+
+set(CPACK_PROJECT_CONFIG_FILE "${CMAKE_SOURCE_DIR}/CMakeModules/CPackProjectConfig.cmake")
+
+include(CPack)
diff --git a/CMakeModules/CPackProjectConfig.cmake b/CMakeModules/CPackProjectConfig.cmake
new file mode 100644
index 0000000000..f75591f8bb
--- /dev/null
+++ b/CMakeModules/CPackProjectConfig.cmake
@@ -0,0 +1,610 @@
+
+include(CPackIFW)
+include(CPackComponent)
+
+# Only install the components created using the af_component macro
+set(CPACK_COMPONENTS_ALL "")
+
+# This is necessary if you don't have a cuda driver installed on your system
+# but you are still building the cuda package. You need the libcuda.so library
+# which is installed by the driver. This tell the dpkg-shlibs to ignore
+# this library because it is a private library
+set (CPACK_DEBIAN_PACKAGE_SHLIBDEPS_PRIVATE_DIRS
+  "/usr/local/cuda-${CPACK_CUDA_VERSION_MAJOR}.${CPACK_CUDA_VERSION_MINOR}/lib64/stubs")
+
+
+# Create an ArrayFire component with a set of properties for each package manager
+# This function sets all the variables for each component in ArrayFire.
+#
+# ``COMPONENT``
+# The name of the ArrayFire component used in the install(XXX) commands
+#
+# ``DISPLAY_NAME``
+# The name that will appear in the GUI installers for this component
+#
+# ``SUMMARY``
+# A short one line summary of the package
+#
+# ``DESCRIPTION``
+# A longer description of the package
+#
+# ``GROUP``
+# Used to combine packages in GUI installers. Ignored in DEB and RPM installers
+#
+# ``DEB_PACKAGE_NAME``
+# Name of the package for the DEB installers. This is the first component of the
+# file name.
+#
+# ``DEB_PROVIDES``
+# The virtual packages provided by the deb package. This is a higher level name
+# of the file that can be used across version numbers. also includes the version
+# information about the package
+#
+# ``DEB_REPLACES``
+# The packages and virtual packages this will replace. Used if there is a package
+# that is installed as part of the base debian installation
+#
+# ``REQUIRES``
+# The components required for the GUI installers
+#
+# ``OPTIONAL``
+# Optional packages that this component can use.
+#
+# ``INSTALL_TYPE``
+# A group of components that will be selected in GUI installers from a drop down
+#
+# ``DEB_REQUIRES``
+# Set of packages required by the debian package. This is slighly different from
+# REQUIRES because it also takes into account external dependencies that can be
+# installed by apt
+#
+# ``DEB_OPTIONAL``
+# Same as OPTIONAL but for debian packages
+#
+# ``DEB_RECOMMENDS``
+# Packages that should be installed but are not required. These packages will
+# be installed by default but if removed will not also delete this package
+#
+# ``HIDDEN``
+# If set, the package will not appear in the GUI installers like NSIS. Usually
+# components that install dependencies
+macro(af_component)
+  cmake_parse_arguments(RC
+    "HIDDEN;DISABLED;DEB_USE_SHLIBDEPS;DEB_ADD_POSTINST"
+    "COMPONENT;DISPLAY_NAME;SUMMARY;DESCRIPTION;GROUP;DEB_PACKAGE_NAME;DEB_PROVIDES;DEB_REPLACES"
+    "REQUIRES;OPTIONAL;INSTALL_TYPES;DEB_REQUIRES;DEB_OPTIONAL;DEB_RECOMMENDS" ${ARGN})
+
+  list(APPEND CPACK_COMPONENTS_ALL ${RC_COMPONENT})
+
+  string(TOUPPER ${RC_COMPONENT} COMPONENT_UPPER)
+  string(REPLACE ";" ", " DEB_REQ "${RC_DEB_REQUIRES}")
+  string(REPLACE ";" ", " DEB_REC "${RC_DEB_RECOMMENDS}")
+  string(REPLACE ";" ", " DEB_OPT "${RC_DEB_OPTIONAL}")
+  string(REPLACE ";" ", " DEB_PROVIDES "${RC_DEB_PROVIDES}")
+
+  if(CPACK_GENERATOR MATCHES "DEB")
+    cpack_add_component(${RC_COMPONENT}
+      DISPLAY_NAME "${RC_DISPLAY_NAME}"
+      INSTALL_TYPES ${RC_INSTALL_TYPES}
+      DESCRIPTION ${RC_DESCRIPTION})
+
+    if(RC_DEB_RECOMMENDS)
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_RECOMMENDS ${DEB_REC})
+    endif()
+
+    if(RC_DEB_PACKAGE_NAME)
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_NAME "${RC_DEB_PACKAGE_NAME}")
+    endif()
+
+    set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_SUGGESTS ${DEB_OPT})
+
+    if(RC_DEB_REQUIRES)
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_DEPENDS "${DEB_REQ}")
+    endif()
+
+    if(RC_DEB_USE_SHLIBDEPS)
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_SHLIBDEPS ON)
+    else()
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_SHLIBDEPS OFF)
+    endif()
+
+    if(RC_DEB_PROVIDES)
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_PROVIDES ${DEB_PROVIDES})
+    endif()
+
+    if(RC_DEB_REPLACES)
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_REPLACES ${RC_DEB_REPLACES})
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_CONFLICTS ${RC_DEB_REPLACES})
+    endif()
+
+    if(RC_DEB_ADD_POSTINST)
+      configure_file(
+        "${CPACK_ArrayFire_SOURCE_DIR}/CMakeModules/debian/postinst"
+        "${CPACK_ArrayFire_BINARY_DIR}/cpack/${COMPONENT_UPPER}/postinst")
+
+      set(CPACK_DEBIAN_${COMPONENT_UPPER}_PACKAGE_CONTROL_EXTRA
+        "${CPACK_ArrayFire_BINARY_DIR}/cpack/${COMPONENT_UPPER}/postinst")
+    endif()
+  else()
+    cpack_add_component(${RC_COMPONENT}
+      DISPLAY_NAME "${RC_DISPLAY_NAME}"
+      DEPENDS ${RC_REQUIRES}
+      GROUP ${RC_GROUP}
+      INSTALL_TYPES ${RC_INSTALL_TYPES}
+      DESCRIPTION ${RC_DESCRIPTION})
+  endif()
+
+  set(CPACK_COMPONENT_${RC_COMPONENT}_DESCRIPTION_SUMMARY ${RC_SUMMARY})
+  set(CPACK_COMPONENT_${COMPONENT_UPPER}_DESCRIPTION ${RC_DESCRIPTION})
+
+  set(CPACK_COMPONENT_${COMPONENT_UPPER}_HIDDEN ${RC_HIDDEN})
+  set(CPACK_COMPONENT_${COMPONENT_UPPER}_DISABLED ${RC_DISABLED})
+
+  # Does not work with RPM for some reason using
+  # CPACK_RPM_${COMPONENT_UPPER}_PACKAGE_REQUIRES  instead
+
+endmacro()
+
+cpack_add_install_type(All DISPLAY_NAME "All Components")
+cpack_add_install_type(Development DISPLAY_NAME "Development")
+cpack_add_install_type(Runtime DISPLAY_NAME "Runtime")
+
+# Groups on debian packages will combine all the packages into one
+# debian component
+if(NOT CPACK_GENERATOR MATCHES "DEB")
+  cpack_add_component_group(afruntime
+    DISPLAY_NAME "ArrayFire Runtime"
+    DESCRIPTION "ArrayFire runtime libraries")
+
+  cpack_add_component_group(afdevelopment
+    DISPLAY_NAME "ArrayFire Development"
+    DESCRIPTION "ArrayFire development files including headers and configuration files"
+    EXPANDED)
+
+  if(CMAKE_BUILD_TYPE STREQUAL "Debug" OR
+     CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo")
+    cpack_add_component_group(debug
+      DISPLAY_NAME "ArrayFire Debug Symbols"
+      DESCRIPTION "ArrayFire Debug symbols")
+  endif()
+endif()
+
+set(arrayfire_cuda_runtime_name "CUDA Runtime(${CPACK_CUDA_VERSION_MAJOR}.${CPACK_CUDA_VERSION_MINOR})")
+set(arrayfire_cuda_dev_name "CUDA Dev")
+
+if(CPACK_GENERATOR MATCHES "DEB")
+  af_component(
+    COMPONENT arrayfire
+    REQUIRES cpu_dev cuda_dev opencl_dev examples documentation
+    SUMMARY  "ArrayFire high performance library"
+    DESCRIPTION  "ArrayFire
+ArrayFire is a general-purpose library that simplifies software
+development that targets parallel and massively-parallel architectures
+including CPUs, GPUs, and other hardware acceleration devices."
+
+    DEB_PACKAGE_NAME arrayfire
+    DEB_REQUIRES arrayfire-cpu3-dev
+                 arrayfire-headers
+
+    DEB_RECOMMENDS arrayfire-cuda3-dev
+                   arrayfire-opencl3-dev
+                   arrayfire-unified3-dev
+                   arrayfire-examples
+                   arrayfire-cmake
+                   arrayfire-doc
+  )
+endif()
+
+
+list(APPEND cpu_deps_comps common_backend_dependencies)
+list(APPEND ocl_deps_comps common_backend_dependencies)
+
+if (NOT APPLE)
+  list(APPEND ocl_deps_comps opencl_dependencies)
+endif ()
+
+set(PACKAGE_MKL_DEPS OFF)
+
+if(CPACK_CUDA_VERSION_MAJOR STREQUAL "10" AND CPACK_GENERATOR MATCHES "DEB")
+  set(deb_cuda_runtime_requirements "libcublas${CPACK_CUDA_VERSION_MAJOR}")
+elseif(CPACK_CUDA_VERSION_MAJOR STREQUAL "11" AND CPACK_GENERATOR MATCHES "DEB")
+  set(deb_cuda_runtime_requirements "libcublas-${CPACK_CUDA_VERSION_MAJOR}-${CPACK_CUDA_VERSION_MINOR}")
+elseif(CPACK_GENERATOR MATCHES "DEB")
+  message(FATAL_ERROR "THIS CUDA VERSION NOT ADDRESSED FOR DEBIN PACKAGES")
+endif()
+
+if (CPACK_AF_COMPUTE_LIBRARY STREQUAL "Intel-MKL")
+  set(PACKAGE_MKL_DEPS ON)
+  if(NOT CPACK_GENERATOR STREQUAL "DEB")
+    af_component(
+      COMPONENT mkl_dependencies
+      DISPLAY_NAME "Intel MKL Libraries"
+            DESCRIPTION "Intel Math Kernel Libraries for FFTW, BLAS, and LAPACK routines."
+      HIDDEN
+      INSTALL_TYPES All Runtime)
+    list(APPEND cpu_deps_comps mkl_dependencies)
+    list(APPEND ocl_deps_comps mkl_dependencies)
+  endif()
+  set(deb_opencl_runtime_package_name arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR}-mkl)
+  set(deb_opencl_runtime_requirements "intel-mkl-core-rt-2020.0-166, intel-mkl-gnu-rt-2020.0-166")
+  set(deb_cpu_runtime_package_name arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-mkl)
+  set(deb_cpu_runtime_requirements "intel-mkl-core-rt-2020.0-166, intel-mkl-gnu-rt-2020.0-166")
+else()
+  # OpenCL and CPU runtime dependencies are detected using
+  # SHLIBDEPS
+  set(deb_opencl_runtime_package_name arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR}-openblas)
+  set(deb_opencl_runtime_requirements "")
+  set(deb_cpu_runtime_package_name arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-openblas)
+  set(deb_cpu_runtime_requirements "")
+endif ()
+
+af_component(
+  COMPONENT cpu
+  DISPLAY_NAME "CPU Runtime"
+  SUMMARY "ArrayFire CPU backend shared libraries"
+  DESCRIPTION "ArrayFire CPU backend shared libraries"
+  OPTIONAL forge
+  GROUP afruntime
+  REQUIRES ${cpu_deps_comps} licenses
+  INSTALL_TYPES All Runtime
+
+  DEB_PACKAGE_NAME ${deb_cpu_runtime_package_name}
+  DEB_REQUIRES ${deb_cpu_runtime_requirements}
+  DEB_PROVIDES "arrayfire-cpu (= ${CPACK_PACKAGE_VERSION}), arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION}), libarrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-cpu, arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_USE_SHLIBDEPS
+  DEB_ADD_POSTINST
+  DEB_OPTIONAL forge libfreeimage3
+)
+
+af_component(
+  COMPONENT cpu_dev
+  DISPLAY_NAME "CPU Dev"
+  SUMMARY  "ArrayFire CPU backend development files"
+  DESCRIPTION  "ArrayFire CPU backend development files"
+  REQUIRES cpu headers cmake
+  GROUP afdevelopment
+  INSTALL_TYPES All Development
+
+  DEB_PACKAGE_NAME arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-dev
+  DEB_PROVIDES "arrayfire-cpu-dev (= ${CPACK_PACKAGE_VERSION}), arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-dev (= ${CPACK_PACKAGE_VERSION}), libarrayfire-cpu-dev (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-cpu-dev (<< ${CPACK_PACKAGE_VERSION}), arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-dev (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-cpu3-dev (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES "arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-openblas (>= ${CPACK_PACKAGE_VERSION}) | arrayfire-cpu${CPACK_PACKAGE_VERSION_MAJOR}-mkl (>= ${CPACK_PACKAGE_VERSION}), arrayfire-headers (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_RECOMMENDS "arrayfire-cmake (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_OPTIONAL "cmake (>= 3.0)"
+)
+
+af_component(
+  COMPONENT cuda
+  DISPLAY_NAME "${arrayfire_cuda_runtime_name}"
+  SUMMARY "ArrayFire CUDA backend shared libraries"
+  DESCRIPTION "ArrayFire CUDA backend shared libraries"
+  OPTIONAL forge
+  REQUIRES common_backend_dependencies cuda_dependencies licenses
+  GROUP afruntime
+  INSTALL_TYPES All Runtime
+
+  DEB_PACKAGE_NAME arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR}-cuda-${CPACK_CUDA_VERSION_MAJOR}-${CPACK_CUDA_VERSION_MINOR}
+  DEB_REQUIRES ${deb_cuda_runtime_requirements}
+  DEB_ADD_POSTINST
+  DEB_USE_SHLIBDEPS
+  DEB_PROVIDES "arrayfire-cuda (= ${CPACK_PACKAGE_VERSION}), arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION}), libarrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-cuda (<< ${CPACK_PACKAGE_VERSION}), arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_OPTIONAL cudnn9-cuda-${CPACK_CUDA_VERSION_MAJOR}-${CPACK_CUDA_VERSION_MINOR} forge libfreeimage3
+)
+
+af_component(
+  COMPONENT cuda_dev
+  DISPLAY_NAME "${arrayfire_cuda_dev_name}"
+  SUMMARY  "ArrayFire CUDA backend development files"
+  DESCRIPTION  "ArrayFire CUDA backend development files"
+  REQUIRES cuda headers cmake
+  GROUP afdevelopment
+  INSTALL_TYPES All Development
+
+  DEB_PACKAGE_NAME arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR}-dev
+  DEB_PROVIDES "arrayfire-cuda-dev (= ${CPACK_PACKAGE_VERSION}), arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR}-dev (= ${CPACK_PACKAGE_VERSION}), libarrayfire-cuda-dev (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-cuda-dev (<< ${CPACK_PACKAGE_VERSION}), arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR}-dev (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES "arrayfire-cuda${CPACK_PACKAGE_VERSION_MAJOR} (>= ${CPACK_PACKAGE_VERSION}), arrayfire-headers (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_RECOMMENDS "arrayfire-cmake (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_OPTIONAL "cmake (>= 3.0)"
+)
+
+af_component(
+  COMPONENT opencl
+  DISPLAY_NAME "OpenCL Runtime"
+  SUMMARY "ArrayFire OpenCL backend shared libraries"
+  DESCRIPTION "ArrayFire OpenCL backend shared libraries"
+  REQUIRES ${opencl_deps_comps} licenses
+  OPTIONAL forge
+  GROUP afruntime
+  INSTALL_TYPES All Runtime
+
+  DEB_PACKAGE_NAME ${deb_opencl_runtime_package_name}
+  DEB_PROVIDES "arrayfire-opencl (= ${CPACK_PACKAGE_VERSION}), arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION}), libarrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-opencl (<< ${CPACK_PACKAGE_VERSION}), arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES ${deb_opencl_runtime_requirements}
+  DEB_USE_SHLIBDEPS
+  DEB_ADD_POSTINST
+  DEB_OPTIONAL forge libfreeimage3
+)
+
+af_component(
+  COMPONENT opencl_dev
+  DISPLAY_NAME "OpenCL Dev"
+  SUMMARY  "ArrayFire OpenCL backend development files"
+  DESCRIPTION  "ArrayFire OpenCL backend development files"
+  REQUIRES opencl headers cmake
+  GROUP afdevelopment
+  INSTALL_TYPES All Development
+
+  DEB_PACKAGE_NAME arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR}-dev
+  DEB_PROVIDES "arrayfire-opencl-dev (= ${CPACK_PACKAGE_VERSION}), arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR}-dev (= ${CPACK_PACKAGE_VERSION}), libarrayfire-opencl-dev (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-opencl-dev (<< ${CPACK_PACKAGE_VERSION}), arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR}-dev (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-opencl-dev (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES "arrayfire-opencl${CPACK_PACKAGE_VERSION_MAJOR} (>= ${CPACK_PACKAGE_VERSION}), arrayfire-headers (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_RECOMMENDS "arrayfire-cmake (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_OPTIONAL "cmake (>= 3.0)"
+)
+
+af_component(
+  COMPONENT oneapi
+  DISPLAY_NAME "oneAPI Runtime"
+  SUMMARY "ArrayFire oneAPI backend shared libraries"
+  DESCRIPTION "ArrayFire oneAPI backend shared libraries"
+  REQUIRES ${oneapi_deps_comps} licenses
+  OPTIONAL forge
+  GROUP afruntime
+  INSTALL_TYPES All Runtime
+
+  DEB_PACKAGE_NAME ${deb_oneapi_runtime_package_name}
+  DEB_PROVIDES "arrayfire-oneapi (= ${CPACK_PACKAGE_VERSION}), arrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION}), libarrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-oneapi (<< ${CPACK_PACKAGE_VERSION}), arrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES ${deb_oneapi_runtime_requirements}
+  DEB_USE_SHLIBDEPS
+  DEB_ADD_POSTINST
+  DEB_OPTIONAL forge libfreeimage3
+)
+
+af_component(
+  COMPONENT oneapi_dev
+  DISPLAY_NAME "oneAPI Dev"
+  SUMMARY  "ArrayFire oneAPI backend development files"
+  DESCRIPTION  "ArrayFire oneAPI backend development files"
+  REQUIRES oneapi headers cmake
+  GROUP afdevelopment
+  INSTALL_TYPES All Development
+
+  DEB_PACKAGE_NAME arrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR}-dev
+  DEB_PROVIDES "arrayfire-oneapi-dev (= ${CPACK_PACKAGE_VERSION}), arrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR}-dev (= ${CPACK_PACKAGE_VERSION}), libarrayfire-oneapi-dev (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-oneapi-dev (<< ${CPACK_PACKAGE_VERSION}), arrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR}-dev (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-oneapi-dev (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES "arrayfire-oneapi${CPACK_PACKAGE_VERSION_MAJOR} (>= ${CPACK_PACKAGE_VERSION}), arrayfire-headers (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_RECOMMENDS "arrayfire-cmake (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_OPTIONAL "cmake (>= 3.0)"
+)
+
+af_component(
+  COMPONENT unified
+  DISPLAY_NAME "Unified Runtime"
+  SUMMARY "ArrayFire Unified backend shared libraries."
+  DESCRIPTION "ArrayFire Unified backend shared libraries. Requires other backends to function."
+  OPTIONAL forge
+  REQUIRES licenses
+  GROUP afruntime
+  INSTALL_TYPES All Runtime
+
+  DEB_PACKAGE_NAME arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR}
+  DEB_PROVIDES "arrayfire-unified (= ${CPACK_PACKAGE_VERSION}), arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION}), libarrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR} (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-unified (<< ${CPACK_PACKAGE_VERSION}), arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR} (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES "arrayfire-cpu (>= ${CPACK_PACKAGE_VERSION}) | arrayfire-cuda (>= ${CPACK_PACKAGE_VERSION}) | arrayfire-opencl (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_USE_SHLIBDEPS
+)
+
+af_component(
+  COMPONENT unified_dev
+  DISPLAY_NAME "Unified Dev"
+  SUMMARY  "ArrayFire Unified backend development files"
+  DESCRIPTION  "ArrayFire Unified backend development files"
+  REQUIRES unified headers cmake
+  OPTIONAL forge
+  GROUP afdevelopment
+  INSTALL_TYPES All Development
+
+  DEB_PACKAGE_NAME arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR}-dev
+  DEB_PROVIDES "arrayfire-unified-dev (= ${CPACK_PACKAGE_VERSION}), arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR}-dev (= ${CPACK_PACKAGE_VERSION}), libarrayfire-unified-dev (= ${CPACK_PACKAGE_VERSION})"
+  DEB_REPLACES "arrayfire-unified-dev (<< ${CPACK_PACKAGE_VERSION}), arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR}-dev (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-unified-dev (<< ${CPACK_PACKAGE_VERSION})"
+  DEB_REQUIRES "arrayfire-unified${CPACK_PACKAGE_VERSION_MAJOR} (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_RECOMMENDS "arrayfire-cmake (>= ${CPACK_PACKAGE_VERSION})"
+  DEB_OPTIONAL "cmake (>= 3.0)"
+)
+
+af_component(
+  COMPONENT documentation
+  DISPLAY_NAME "Documentation"
+  SUMMARY  "ArrayFire Documentation"
+  INSTALL_TYPES All
+  DESCRIPTION  "ArrayFire Doxygen Documentation"
+
+  DEB_PACKAGE_NAME arrayfire-doc
+  DEB_REPLACES "arrayfire-doc (<< ${CPACK_PACKAGE_VERSION}), libarrayfire-doc (<< ${CPACK_PACKAGE_VERSION})"
+)
+
+af_component(
+  COMPONENT headers
+  DISPLAY_NAME "C/C++ Headers"
+  HIDDEN
+  INSTALL_TYPES All Development
+  DESCRIPTION "Headers for the ArrayFire libraries.")
+
+af_component(
+  COMPONENT examples
+  DISPLAY_NAME "ArrayFire Examples"
+  INSTALL_TYPES All
+  DESCRIPTION "Various examples using ArrayFire.")
+
+af_component(
+  COMPONENT cmake
+  DISPLAY_NAME "CMake Files"
+  HIDDEN
+  INSTALL_TYPES All Development
+  DESCRIPTION "Configuration files to use ArrayFire using CMake.")
+
+af_component(
+  COMPONENT licenses
+  DISPLAY_NAME "Licenses"
+  DESCRIPTION "License files for ArrayFire and its upstream libraries."
+  HIDDEN
+  REQUIRED)
+
+if(NOT CPACK_GENERATOR MATCHES "DEB")
+  af_component(
+    COMPONENT common_backend_dependencies
+    DISPLAY_NAME "Common Dependencies"
+    DESCRIPTION "Libraries commonly required by all ArrayFire backends."
+    HIDDEN
+    INSTALL_TYPES All Development Runtime)
+
+  af_component(
+    COMPONENT cuda_dependencies
+    DISPLAY_NAME "CUDA Dependencies"
+    DESCRIPTION "Shared libraries required for the CUDA backend."
+    HIDDEN
+    INSTALL_TYPES All Development Runtime)
+
+endif()
+
+#TODO(pradeep) Remove check after OSX support addition
+# Debug symbols in debian installers are created using the DEBINFO property
+if(NOT APPLE AND
+   NOT CPACK_GENERATOR MATCHES "DEB")
+  if(CMAKE_BUILD_TYPE STREQUAL "Debug" OR
+     CMAKE_BUILD_TYPE STREQUAL "RelWithDebInfo")
+    af_component(
+      COMPONENT afoneapi_debug_symbols
+      DISPLAY_NAME "oneAPI Debug Symbols"
+      DESCRIPTION "Debug symbols for the oneAPI backend."
+      GROUP debug
+      DISABLED
+      INSTALL_TYPES Development)
+  
+    af_component(
+      COMPONENT afopencl_debug_symbols
+      DISPLAY_NAME "OpenCL Debug Symbols"
+      DESCRIPTION "Debug symbols for the OpenCL backend."
+      GROUP debug
+      DISABLED
+      INSTALL_TYPES Development)
+  
+    af_component(
+      COMPONENT afcuda_debug_symbols
+      DISPLAY_NAME "CUDA Debug Symbols"
+      DESCRIPTION "Debug symbols for CUDA backend backend."
+      GROUP debug
+      DISABLED
+      INSTALL_TYPES Development)
+  
+    af_component(
+      COMPONENT afcpu_debug_symbols
+      DISPLAY_NAME "CPU Debug Symbols"
+      DESCRIPTION "Debug symbols for CPU backend backend."
+      GROUP debug
+      DISABLED
+      INSTALL_TYPES Development)
+  
+    af_component(
+      COMPONENT af_debug_symbols
+      DISPLAY_NAME "Unified Debug Symbols"
+      DESCRIPTION "Debug symbols for the Unified backend."
+      GROUP debug
+      DISABLED
+      INSTALL_TYPES Development)
+  endif()
+endif()
+
+# if (AF_INSTALL_FORGE_DEV)
+#   list(APPEND CPACK_COMPONENTS_ALL forge)
+#   af_component(
+#     COMPONENT forge
+#     DISPLAY_NAME "Forge Vizualiation"
+#     DESCRIPTION "Visualization Library"
+#     INSTALL_TYPES Extra)
+# endif ()
+#
+#set(LIBRARY_NAME ${PROJECT_NAME})
+#string(TOLOWER "${LIBRARY_NAME}" APP_LOW_NAME)
+#set(SITE_URL "https://arrayfire.com")
+#
+# set(inst_pkg_name ${APP_LOW_NAME})
+# set(inst_pkg_hash "")
+# if (WIN32)
+#   set(inst_pkg_name ${CPACK_PACKAGE_NAME})
+#   set(inst_pkg_hash "-${GIT_COMMIT_HASH}")
+# endif ()
+#
+#set(CPACK_PACKAGE_FILE_NAME "${inst_pkg_name}${inst_pkg_hash}")
+
+# ##
+# # IFW CPACK generator
+# # Uses Qt installer framework, cross platform installer generator.
+# # Uniform installer GUI on all major desktop platforms: Windows, OSX & Linux.
+# ##
+# set(CPACK_IFW_PACKAGE_TITLE "${CPACK_PACKAGE_NAME}")
+# set(CPACK_IFW_PACKAGE_PUBLISHER "${CPACK_PACKAGE_VENDOR}")
+# set(CPACK_IFW_PRODUCT_URL "${SITE_URL}")
+# set(CPACK_IFW_PACKAGE_ICON "${MY_CPACK_PACKAGE_ICON}")
+# set(CPACK_IFW_PACKAGE_WINDOW_ICON "${CMAKE_SOURCE_DIR}/assets/${APP_LOW_NAME}_icon.png")
+# set(CPACK_IFW_PACKAGE_WIZARD_DEFAULT_WIDTH 640)
+# set(CPACK_IFW_PACKAGE_WIZARD_DEFAULT_HEIGHT 480)
+# if (WIN32)
+#     set(CPACK_IFW_ADMIN_TARGET_DIRECTORY "@ApplicationsDirX64@/${CPACK_PACKAGE_INSTALL_DIRECTORY}")
+# else ()
+#     set(CPACK_IFW_ADMIN_TARGET_DIRECTORY "/opt/${CPACK_PACKAGE_INSTALL_DIRECTORY}")
+# endif ()
+#
+# function(get_native_path out_path path)
+#   file(TO_NATIVE_PATH ${path} native_path)
+#   if (WIN32)
+#     string(REPLACE "\\" "\\\\" native_path  ${native_path})
+#     set(${out_path} ${native_path} PARENT_SCOPE)
+#   else ()
+#     set(${out_path} ${path} PARENT_SCOPE)
+#   endif ()
+# endfunction()
+#
+# get_native_path(zlib_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/zlib-libpng License.txt")
+# get_native_path(boost_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/Boost Software License.txt")
+# get_native_path(fimg_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/FreeImage Public License.txt")
+# get_native_path(apache_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/Apache-2.0.txt")
+# get_native_path(sift_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/OpenSIFT License.txt")
+# get_native_path(bsd3_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/BSD 3-Clause.txt")
+# get_native_path(issl_lic_path "${CPACK_ArrayFire_SOURCE_DIR}/LICENSES/ISSL License.txt")
+
+#cpack_ifw_configure_component_group(backends)
+#cpack_ifw_configure_component_group(cpu-backend)
+#cpack_ifw_configure_component_group(cuda-backend)
+#cpack_ifw_configure_component_group(opencl-backend)
+#if (PACKAGE_MKL_DEPS)
+#  cpack_ifw_configure_component(mkl_dependencies)
+#endif ()
+#if (NOT APPLE)
+#  cpack_ifw_configure_component(opencl_dependencies)
+#endif ()
+#cpack_ifw_configure_component(common_backend_dependencies)
+#cpack_ifw_configure_component(cuda_dependencies)
+#cpack_ifw_configure_component(cpu)
+#cpack_ifw_configure_component(cuda)
+#cpack_ifw_configure_component(opencl)
+#cpack_ifw_configure_component(unified)
+#cpack_ifw_configure_component(headers)
+#cpack_ifw_configure_component(cmake)
+#cpack_ifw_configure_component(documentation)
+#cpack_ifw_configure_component(examples)
+#cpack_ifw_configure_component(licenses FORCED_INSTALLATION
+#  LICENSES "GLFW" ${zlib_lic_path} "FreeImage" ${fimg_lic_path}
+#  "Boost" ${boost_lic_path} "CLBlast, clFFT" ${apache_lic_path} "SIFT" ${sift_lic_path}
+#  "BSD3" ${bsd3_lic_path} "Intel MKL" ${issl_lic_path}
+#)
+#if (AF_INSTALL_FORGE_DEV)
+#  cpack_ifw_configure_component(forge)
+#endif ()
+
+
diff --git a/CMakeModules/CTestCustom.cmake b/CMakeModules/CTestCustom.cmake
new file mode 100644
index 0000000000..604f697465
--- /dev/null
+++ b/CMakeModules/CTestCustom.cmake
@@ -0,0 +1,32 @@
+# Copyright (c) 2019, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+set(CTEST_CUSTOM_ERROR_POST_CONTEXT 200)
+set(CTEST_CUSTOM_ERROR_PRE_CONTEXT 200)
+set(CTEST_CUSTOM_MAXIMUM_NUMBER_OF_ERRORS 300)
+set(CTEST_CUSTOM_MAXIMUM_NUMBER_OF_WARNINGS 300)
+
+if(WIN32)
+  if(CMAKE_GENERATOR MATCHES "Ninja")
+    set(CTEST_CUSTOM_POST_TEST ./bin/print_info.exe)
+  endif()
+else()
+  set(CTEST_CUSTOM_POST_TEST ./test/print_info)
+endif()
+
+list(APPEND CTEST_CUSTOM_COVERAGE_EXCLUDE
+  "test"
+
+  # All external and third_party libraries
+  "extern/.*"
+  "test/mmio/.*"
+  "src/backend/cpu/threads/.*"
+  "src/backend/cuda/cub/.*"
+  "cl2.hpp"
+
+  # Remove bin2cpp from coverage
+  "CMakeModules/.*")
diff --git a/CMakeModules/CUDACheckCompute.cmake b/CMakeModules/CUDACheckCompute.cmake
deleted file mode 100644
index 0a35dbf6cf..0000000000
--- a/CMakeModules/CUDACheckCompute.cmake
+++ /dev/null
@@ -1,37 +0,0 @@
-#############################
-#Sourced from:
-#https://raw.githubusercontent.com/jwetzl/CudaLBFGS/master/CheckComputeCapability.cmake
-#############################
-# Check for GPUs present and their compute capability
-# based on http://stackoverflow.com/questions/2285185/easiest-way-to-test-for-existence-of-cuda-capable-gpu-from-cmake/2297877#2297877 (Christopher Bruns)
-
-if(CUDA_FOUND)
-    message(STATUS "${CMAKE_MODULE_PATH}/cuda_compute_capability.c")
-    try_run(RUN_RESULT_VAR COMPILE_RESULT_VAR
-        ${CMAKE_BINARY_DIR}
-        ${CMAKE_MODULE_PATH}/cuda_compute_capability.c
-        CMAKE_FLAGS
-        -DINCLUDE_DIRECTORIES:STRING=${CUDA_TOOLKIT_INCLUDE}
-        -DLINK_LIBRARIES:STRING=${CUDA_CUDART_LIBRARY}
-        COMPILE_OUTPUT_VARIABLE COMPILE_OUTPUT_VAR
-        RUN_OUTPUT_VARIABLE RUN_OUTPUT_VAR)
-    message(STATUS "Compile: ${RUN_OUTPUT_VAR}")
-    if (COMPILE_RESULT_VAR)
-        message(STATUS "compiled -> " ${RUN_RESULT_VAR})
-    else()
-        message(STATUS "didn't compile")
-    endif()
-    # COMPILE_RESULT_VAR is TRUE when compile succeeds
-    # RUN_RESULT_VAR is zero when a GPU is found
-    if(COMPILE_RESULT_VAR AND NOT RUN_RESULT_VAR)
-        message(STATUS "worked")
-        set(CUDA_HAVE_GPU TRUE CACHE BOOL "Whether CUDA-capable GPU is present")
-        set(CUDA_COMPUTE_CAPABILITY ${RUN_OUTPUT_VAR} CACHE STRING "Compute capability of CUDA-capable GPU present")
-        set(CUDA_GENERATE_CODE "arch=compute_${CUDA_COMPUTE_CAPABILITY},code=sm_${CUDA_COMPUTE_CAPABILITY}" CACHE STRING "Which GPU architectures to generate code for (each arch/code pair will be passed as --generate-code option to nvcc, separate multiple pairs by ;)")
-        mark_as_advanced(CUDA_COMPUTE_CAPABILITY CUDA_GENERATE_CODE)
-        set(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -arch compute_${CUDA_COMPUTE_CAPABILITY})
-    else()
-        message(STATUS "didn't work")
-        set(CUDA_HAVE_GPU FALSE CACHE BOOL "Whether CUDA-capable GPU is present")
-    endif()
-endif()
diff --git a/CMakeModules/FetchContent/CMakeLists.cmake.in b/CMakeModules/FetchContent/CMakeLists.cmake.in
new file mode 100644
index 0000000000..9a7a7715ab
--- /dev/null
+++ b/CMakeModules/FetchContent/CMakeLists.cmake.in
@@ -0,0 +1,21 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+cmake_minimum_required(VERSION ${CMAKE_VERSION})
+
+# We name the project and the target for the ExternalProject_Add() call
+# to something that will highlight to the user what we are working on if
+# something goes wrong and an error message is produced.
+
+project(${contentName}-populate NONE)
+
+include(ExternalProject)
+ExternalProject_Add(${contentName}-populate
+                    ${ARG_EXTRA}
+                    SOURCE_DIR          "${ARG_SOURCE_DIR}"
+                    BINARY_DIR          "${ARG_BINARY_DIR}"
+                    CONFIGURE_COMMAND   ""
+                    BUILD_COMMAND       ""
+                    INSTALL_COMMAND     ""
+                    TEST_COMMAND        ""
+)
diff --git a/CMakeModules/FileToString.cmake b/CMakeModules/FileToString.cmake
new file mode 100644
index 0000000000..5491c8b126
--- /dev/null
+++ b/CMakeModules/FileToString.cmake
@@ -0,0 +1,75 @@
+# Function to turn an OpenCL source file into a C string within a source file.
+# xxd uses its input's filename to name the string and its length, so we
+# need to move them to a name that depends only on the path output, not its
+# input.  Otherwise, builds in different relative locations would put the
+# source into different variable names, and everything would fall over.
+# The actual name will be filename (.s replaced with underscores), and length
+# name_len.
+#
+# Usage example:
+#
+# set(KERNELS a.cl b/c.cl)
+# resource_to_cxx_source(
+#   SOURCES ${KERNELS}
+#   VARNAME OUTPUTS
+# )
+# add_executable(foo ${OUTPUTS})
+#
+# The namespace they are placed in is taken from filename.namespace.
+#
+# For example, if the input file is kernel.cl, the two variables will be
+#  unsigned char ns::kernel_cl[];
+#  unsigned int ns::kernel_cl_len;
+#
+# where ns is the contents of kernel.cl.namespace.
+
+set(BIN2CPP_PROGRAM "bin2cpp")
+
+function(FILE_TO_STRING)
+    cmake_parse_arguments(RTCS "WITH_EXTENSION;NULLTERM" "VARNAME;EXTENSION;OUTPUT_DIR;TARGETS;NAMESPACE;BINARY" "SOURCES" ${ARGN})
+
+    set(_output_files "")
+    foreach(_input_file ${RTCS_SOURCES})
+        get_filename_component(_path "${_input_file}" PATH)
+        get_filename_component(_name "${_input_file}" NAME)
+        get_filename_component(var_name "${_input_file}" NAME)
+        get_filename_component(_name_we "${_input_file}" NAME_WE)
+
+        set(_namespace "${RTCS_NAMESPACE}")
+        set(_binary "")
+        if(${RTCS_BINARY})
+            set(_binary "--binary")
+        endif(${RTCS_BINARY})
+        if(RTCS_NULLTERM)
+            set(_nullterm "--nullterm")
+        endif(RTCS_NULLTERM)
+
+        string(REPLACE "." "_" var_name ${var_name})
+        string(REPLACE "\ " "_" namespace_name ${RTCS_NAMESPACE})
+
+        set(_output_path "${CMAKE_CURRENT_BINARY_DIR}/${RTCS_OUTPUT_DIR}")
+        if(RTCS_WITH_EXTENSION)
+          set(_output_file "${_output_path}/${var_name}.${RTCS_EXTENSION}")
+        else()
+          set(_output_file "${_output_path}/${_name_we}.${RTCS_EXTENSION}")
+        endif()
+
+        add_custom_command(
+            OUTPUT ${_output_file}
+            DEPENDS ${_input_file} ${BIN2CPP_PROGRAM}
+            COMMAND ${CMAKE_COMMAND} -E make_directory "${_output_path}"
+            COMMAND ${CMAKE_COMMAND} -E echo "\\#include \\<${_path}/${_name_we}.hpp\\>"  >>"${_output_file}"
+            COMMAND ${BIN2CPP_PROGRAM} --file ${_name} --namespace ${_namespace} --output ${_output_file} --name ${var_name} ${_binary} ${_nullterm}
+            WORKING_DIRECTORY "${_path}"
+            COMMENT "Compiling ${_input_file} to C++ source"
+        )
+
+
+        list(APPEND _output_files ${_output_file})
+    endforeach()
+    add_custom_target(${namespace_name}_${RTCS_OUTPUT_DIR}_bin_target DEPENDS ${_output_files})
+    set_target_properties(${namespace_name}_${RTCS_OUTPUT_DIR}_bin_target PROPERTIES FOLDER "Generated Targets")
+
+    set("${RTCS_VARNAME}" ${_output_files} PARENT_SCOPE)
+    set("${RTCS_TARGETS}" ${namespace_name}_${RTCS_OUTPUT_DIR}_bin_target PARENT_SCOPE)
+endfunction(FILE_TO_STRING)
diff --git a/CMakeModules/FindAF_MKL.cmake b/CMakeModules/FindAF_MKL.cmake
new file mode 100644
index 0000000000..2da1ed4584
--- /dev/null
+++ b/CMakeModules/FindAF_MKL.cmake
@@ -0,0 +1,499 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license. The complete license
+# agreement can be obtained at: http://arrayfire.com/licenses/BSD-3-Clause
+#
+# A FindMKL script based on the recommendations by the Intel's Link Line
+# Advisor. It currently only tested on the 2018 version of MKL on Windows,
+# Linux, and OSX but it should work on older versions.
+#
+# To use this module call the mklvars.(sh,bat) script before you call cmake. This
+# script is located in the bin folder of your mkl installation. This will set the
+# MKLROOT environment variable which will be used to find the libraries on your system.
+#
+# In case you have oneAPI base toolkit installed, having ONEAPI_ROOT environment variable available
+# also will enable picking Intel oneMKL automatically.
+#
+# Example:
+# set(MKL_THREAD_LAYER "TBB")
+# find_package(MKL)
+#
+# add_executable(myapp main.cpp)
+# target_link_libraries(myapp PRIVATE MKL::Shared)
+#
+# This module bases its behavior based on the following variables:
+#
+# ``MKL_THREAD_LAYER``
+#   The threading layer that needs to be used by the MKL library. This
+#   Defines which library will be used to parallelize the MKL kernels. Possible
+#   options are TBB(Default), GNU OpenMP, Intel OpenMP, Sequential
+#
+# This module provides the following :prop_tgt:'IMPORTED' targets:
+#
+# ``MKL::Shared``
+#   Target used to define and link all MKL libraries required by Intel's Link
+#   Line Advisor. This usually the only thing you need to link against unless
+#   you want to link against the single dynamic library version of MKL
+#   (libmkl_rt.so)
+#
+# ``MKL::Static``
+#   Target used to define and link all MKL libraries required by Intel's Link
+#   Line Advisor for a static build. This will still link the threading libraries
+#   using dynamic linking as advised by the Intel Link Advisor
+#
+#  Optional:
+#
+# ``MKL::ThreadLayer{_STATIC}``
+#   Target used to define the threading layer(TBB, OpenMP, etc.) based on
+#   MKL_THREAD_LAYER variable.
+#
+# ``MKL::ThreadingLibrary``
+#   Target used to define the threading library(libtbb, libomp, etc) that the
+#   application will need to link against.
+#
+# ``MKL::Interface``
+#   Target used to determine which interface library to use(32bit int or 64bit
+#   int).
+#
+# ``MKL::Core``
+#   Target for the dynamic library dispatcher
+#
+# ``MKL::RT``
+#   Target for the single dynamic library
+#
+# ``MKL::{mkl_def;mkl_mc;mkl_mc3;mkl_avx;mkl_avx2;mkl_avx512}{_STATIC}``
+#   Targets for MKL kernel libraries.
+#
+# This module has the following result variables:
+#
+# ``MKL_INTERFACE_INTEGER_SIZE``
+#   This variable is set integer size in bytes on the platform where this module
+#   runs. This is usually 4/8, and set of values this is dependent on MKL library.
+
+include(CheckTypeSize)
+include(FindPackageHandleStandardArgs)
+
+if(DEFINED MKL_INTERFACE_INTEGER_SIZE)
+  set(INT_SIZE ${MKL_INTERFACE_INTEGER_SIZE})
+else()
+  check_type_size("int" INT_SIZE
+    BUILTIN_TYPES_ONLY LANGUAGE C)
+endif()
+
+set(MKL_THREAD_LAYER "TBB" CACHE STRING "The thread layer to choose for MKL")
+set_property(CACHE MKL_THREAD_LAYER PROPERTY STRINGS "TBB" "GNU OpenMP" "Intel OpenMP" "Sequential")
+
+message(STATUS "MKL: Thread Layer(${MKL_THREAD_LAYER}) Interface(${INT_SIZE}-byte Integer)")
+
+if(NOT MKL_THREAD_LAYER STREQUAL MKL_THREAD_LAYER_LAST)
+  unset(MKL::ThreadLayer CACHE)
+  unset(MKL::ThreadingLibrary CACHE)
+  unset(MKL_ThreadLayer_LINK_LIBRARY CACHE)
+  unset(MKL_ThreadLayer_STATIC_LINK_LIBRARY CACHE)
+  unset(MKL_ThreadLayer_DLL_LIBRARY CACHE)
+  unset(MKL_ThreadingLibrary_LINK_LIBRARY CACHE)
+  unset(MKL_ThreadingLibrary_STATIC_LINK_LIBRARY CACHE)
+  unset(MKL_ThreadingLibrary_DLL_LIBRARY CACHE)
+  set(MKL_THREAD_LAYER_LAST ${MKL_THREAD_LAYER} CACHE INTERNAL "" FORCE)
+endif()
+
+find_path(MKL_INCLUDE_DIR
+  NAMES
+    mkl.h
+    mkl_blas.h
+    mkl_cblas.h
+  PATHS
+    /opt/intel
+    /opt/intel/mkl
+    $ENV{MKLROOT}
+    $ENV{ONEAPI_ROOT}/mkl/latest
+    /opt/intel/compilers_and_libraries/linux/mkl
+  PATH_SUFFIXES
+    include
+    IntelSWTools/compilers_and_libraries/windows/mkl/include
+    )
+
+if(MKL_INCLUDE_DIR)
+  mark_as_advanced(MKL_INCLUDE_DIR)
+endif()
+
+function(find_version)
+  set(options "")
+  set(single_args VAR FILE REGEX)
+  set(multi_args "")
+  cmake_parse_arguments(find_version "${options}" "${single_args}" "${multi_args}" ${ARGN})
+
+  file(READ ${find_version_FILE} VERSION_FILE_CONTENTS)
+  string(REGEX MATCH ${find_version_REGEX}
+    VERSION_LINE "${VERSION_FILE_CONTENTS}")
+  set(${ARGV0} ${CMAKE_MATCH_1} PARENT_SCOPE)
+endfunction()
+
+if(MKL_INCLUDE_DIR)
+  find_file(MKL_VERSION_HEADER
+    NAMES
+      mkl_version.h
+    PATHS
+      ${MKL_INCLUDE_DIR})
+
+    find_version(MKL_MAJOR_VERSION
+      FILE ${MKL_VERSION_HEADER}
+      REGEX "__INTEL_MKL__ * ([0-9]+)")
+
+    find_version(MKL_MINOR_VERSION
+      FILE ${MKL_VERSION_HEADER}
+      REGEX "__INTEL_MKL_MINOR__ * ([0-9]+)")
+
+    find_version(MKL_UPDATE_VERSION
+      FILE ${MKL_VERSION_HEADER}
+      REGEX "__INTEL_MKL_UPDATE__ * ([0-9]+)")
+
+    find_version(MKL_VERSION_MACRO
+      FILE ${MKL_VERSION_HEADER}
+      REGEX "INTEL_MKL_VERSION * ([0-9]+)")
+
+  set(MKL_VERSION_STRING ${MKL_MAJOR_VERSION}.${MKL_MINOR_VERSION}.${MKL_UPDATE_VERSION})
+  mark_as_advanced(MKL_VERSION_HEADER)
+endif()
+
+
+find_path(MKL_FFTW_INCLUDE_DIR
+  NAMES
+    fftw3_mkl.h
+  HINTS
+    ${MKL_INCLUDE_DIR}/fftw)
+if(MKL_FFTW_INCLUDE_DIR)
+  mark_as_advanced(MKL_FFTW_INCLUDE_DIR)
+endif()
+
+if(WIN32)
+  if(${MSVC_VERSION} GREATER_EQUAL 1900)
+    set(msvc_dir "vc_mt")
+    set(shared_suffix "_dll")
+    set(md_suffix "md")
+  else()
+    message(WARNING "MKL: MS Version not supported for MKL")
+  endif()
+endif()
+
+
+if(WIN32)
+  set(ENV_LIBRARY_PATHS "$ENV{LIB}")
+  if (${CMAKE_VERSION} VERSION_GREATER 3.14)
+    message(VERBOSE "MKL environment variable(LIB): ${ENV_LIBRARY_PATHS}")
+  endif()
+else()
+  string(REGEX REPLACE ":" ";" ENV_LIBRARY_PATHS "$ENV{LIBRARY_PATH}")
+  if (${CMAKE_VERSION} VERSION_GREATER 3.14)
+    message(VERBOSE "MKL environment variable(LIBRARY_PATH): ${ENV_LIBRARY_PATHS}")
+  endif()
+endif()
+
+# Finds and creates libraries for MKL with the MKL:: prefix
+#
+# Parameters:
+#    NAME:         A variable name describing the library
+#    LIBRARY_NAME: The library that needs to be searched
+#
+# OPTIONS:
+#    DLL_ONLY     On Windows do not search for .lib files. Ignored in other
+#                 platforms
+#    SEARCH_STATIC Search for static versions of the libraries as well as the
+#                  dynamic libraries
+#
+# Output Libraries:
+#    MKL::${NAME}
+#    MKL::${NAME}_STATIC
+#
+# Output Variables
+#    MKL_${NAME}_LINK_LIBRARY:        on Unix: *.so on Windows *.lib
+#    MKL_${NAME}_STATIC_LINK_LIBRARY: on Unix: *.a  on Windows *.lib
+#    MKL_${NAME}_DLL_LIBRARY:         on Unix: ""   on Windows *.dll
+function(find_mkl_library)
+  set(options "SEARCH_STATIC;DLL_ONLY")
+  set(single_args NAME LIBRARY_NAME)
+  set(multi_args "")
+
+  cmake_parse_arguments(mkl_args "${options}" "${single_args}" "${multi_args}" ${ARGN})
+
+  if(TARGET MKL::${mkl_args_NAME})
+    return()
+  endif()
+
+  add_library(MKL::${mkl_args_NAME}        SHARED IMPORTED)
+  add_library(MKL::${mkl_args_NAME}_STATIC STATIC IMPORTED)
+
+  if(NOT (WIN32 AND mkl_args_DLL_ONLY))
+    list(APPEND CMAKE_FIND_LIBRARY_SUFFIXES ".so.1;.so.2;.so.3;.so.4;.so.12")
+    find_library(MKL_${mkl_args_NAME}_LINK_LIBRARY
+      NAMES
+        ${mkl_args_LIBRARY_NAME}${shared_suffix}
+        ${mkl_args_LIBRARY_NAME}${md_suffix}
+        lib${mkl_args_LIBRARY_NAME}${md_suffix}
+        ${mkl_args_LIBRARY_NAME}
+      PATHS
+        /opt/intel/mkl/lib
+        /opt/intel/tbb/lib
+        /opt/intel/lib
+        $ENV{MKLROOT}/lib
+        $ENV{ONEAPI_ROOT}/mkl/latest/lib
+        ${ENV_LIBRARY_PATHS}
+        /opt/intel/compilers_and_libraries/linux/mkl/lib
+      PATH_SUFFIXES
+        IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64
+        IntelSWTools/compilers_and_libraries/windows/compiler/lib/intel64
+        IntelSWTools/compilers_and_libraries/windows/tbb/lib/intel64/${msvc_dir}
+        ""
+        intel64
+        intel64/gcc4.7)
+    list(REMOVE_ITEM CMAKE_FIND_LIBRARY_SUFFIXES ".so.1")
+    if(MKL_${mkl_args_NAME}_LINK_LIBRARY)
+      if (CMAKE_VERSION VERSION_GREATER 3.14)
+        message(VERBOSE "MKL_${mkl_args_NAME}_LINK_LIBRARY: ${MKL_${mkl_args_NAME}_LINK_LIBRARY}")
+      endif()
+      mark_as_advanced(MKL_${mkl_args_NAME}_LINK_LIBRARY)
+    endif()
+  endif()
+
+  #message(STATUS "NAME: ${mkl_args_NAME} LIBNAME: ${mkl_args_LIBRARY_NAME} MKL_${mkl_args_NAME}_LINK_LIBRARY  ${MKL_${mkl_args_NAME}_LINK_LIBRARY}")
+
+  if(mkl_args_SEARCH_STATIC)
+    find_library(MKL_${mkl_args_NAME}_STATIC_LINK_LIBRARY
+      NAMES
+        ${CMAKE_STATIC_LIBRARY_PREFIX}${mkl_args_LIBRARY_NAME}${CMAKE_STATIC_LIBRARY_SUFFIX}
+      PATHS
+        /opt/intel/mkl/lib
+        /opt/intel/tbb/lib
+        /opt/intel/lib
+        $ENV{MKLROOT}/lib
+        $ENV{ONEAPI_ROOT}/mkl/latest/lib
+        ${ENV_LIBRARY_PATHS}
+        /opt/intel/compilers_and_libraries/linux/mkl/lib
+      PATH_SUFFIXES
+        ""
+        intel64
+        intel64/gcc4.7
+        IntelSWTools/compilers_and_libraries/windows/mkl/lib/intel64
+        IntelSWTools/compilers_and_libraries/windows/compiler/lib/intel64
+        IntelSWTools/compilers_and_libraries/windows/tbb/lib/intel64/${msvc_dir}
+        )
+    if(MKL_${mkl_args_NAME}_STATIC_LINK_LIBRARY)
+      if(CMAKE_VERSION VERSION_GREATER 3.14)
+        message(VERBOSE "MKL_${mkl_args_NAME}_STATIC_LINK_LIBRARY: ${MKL_${mkl_args_NAME}_STATIC_LINK_LIBRARY}")
+      endif()
+    endif()
+    mark_as_advanced(MKL_${mkl_args_NAME}_STATIC_LINK_LIBRARY)
+  endif()
+
+  set_target_properties(MKL::${mkl_args_NAME}
+    PROPERTIES
+      INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR}"
+      IMPORTED_LOCATION "${MKL_${mkl_args_NAME}_LINK_LIBRARY}"
+      IMPORTED_NO_SONAME TRUE)
+
+  set_target_properties(MKL::${mkl_args_NAME}_STATIC
+      PROPERTIES
+      INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR}"
+      IMPORTED_LOCATION "${MKL_${mkl_args_NAME}_STATIC_LINK_LIBRARY}"
+      IMPORTED_NO_SONAME TRUE)
+
+  if(WIN32)
+    find_file(MKL_${mkl_args_NAME}_DLL_LIBRARY
+      NAMES
+        ${CMAKE_SHARED_LIBRARY_PREFIX}${mkl_args_LIBRARY_NAME}${CMAKE_SHARED_LIBRARY_SUFFIX}
+        ${CMAKE_SHARED_LIBRARY_PREFIX}${mkl_args_LIBRARY_NAME}${md_suffix}${CMAKE_SHARED_LIBRARY_SUFFIX}
+        ${CMAKE_SHARED_LIBRARY_PREFIX}${mkl_args_LIBRARY_NAME}.2${CMAKE_SHARED_LIBRARY_SUFFIX}
+        ${CMAKE_SHARED_LIBRARY_PREFIX}${mkl_args_LIBRARY_NAME}.5${CMAKE_SHARED_LIBRARY_SUFFIX}
+        ${CMAKE_SHARED_LIBRARY_PREFIX}${mkl_args_LIBRARY_NAME}12${CMAKE_SHARED_LIBRARY_SUFFIX}
+        lib${mkl_args_LIBRARY_NAME}${md_suffix}${CMAKE_SHARED_LIBRARY_SUFFIX}
+        $ENV{LIB}
+        $ENV{LIBRARY_PATH}
+      PATHS
+        $ENV{MKLROOT}/bin
+        $ENV{TBBROOT}/bin
+        $ENV{ONEAPI_ROOT}/compiler/latest/bin
+      PATH_SUFFIXES
+        IntelSWTools/compilers_and_libraries/windows/redist/intel64/mkl
+        IntelSWTools/compilers_and_libraries/windows/redist/intel64/compiler
+        IntelSWTools/compilers_and_libraries/windows/redist/intel64/tbb/${msvc_dir}
+      NO_SYSTEM_ENVIRONMENT_PATH)
+
+    set_target_properties(MKL::${mkl_args_NAME}
+      PROPERTIES
+        IMPORTED_LOCATION "${MKL_${mkl_args_NAME}_DLL_LIBRARY}"
+        IMPORTED_IMPLIB "${MKL_${mkl_args_NAME}_LINK_LIBRARY}")
+
+    mark_as_advanced(MKL_${mkl_args_NAME}_DLL_LIBRARY)
+  endif()
+endfunction()
+
+
+find_mkl_library(NAME Core LIBRARY_NAME mkl_core SEARCH_STATIC)
+find_mkl_library(NAME RT LIBRARY_NAME mkl_rt)
+
+if(AF_BUILD_ONEAPI)
+    find_mkl_library(NAME Sycl LIBRARY_NAME sycl DLL_ONLY)
+    find_mkl_library(NAME SyclLapack LIBRARY_NAME mkl_sycl_lapack DLL_ONLY)
+    find_mkl_library(NAME SyclDft LIBRARY_NAME mkl_sycl_dft DLL_ONLY)
+    find_mkl_library(NAME SyclBlas LIBRARY_NAME mkl_sycl_blas DLL_ONLY)
+    find_mkl_library(NAME SyclSparse LIBRARY_NAME mkl_sycl_sparse DLL_ONLY)
+    find_mkl_library(NAME SyclDataFitting LIBRARY_NAME mkl_sycl_data_fitting DLL_ONLY)
+    find_mkl_library(NAME SyclRNG LIBRARY_NAME mkl_sycl_rng DLL_ONLY)
+    find_mkl_library(NAME SyclStats LIBRARY_NAME mkl_sycl_stats DLL_ONLY)
+    find_mkl_library(NAME SyclVM LIBRARY_NAME mkl_sycl_vm DLL_ONLY)
+endif()
+
+# MKL can link against Intel OpenMP, GNU OpenMP, TBB, and Sequential
+if(MKL_THREAD_LAYER STREQUAL "Intel OpenMP")
+  find_mkl_library(NAME ThreadLayer LIBRARY_NAME mkl_intel_thread SEARCH_STATIC)
+  find_mkl_library(NAME ThreadingLibrary LIBRARY_NAME iomp5)
+elseif(MKL_THREAD_LAYER STREQUAL "GNU OpenMP")
+  find_package(OpenMP REQUIRED)
+  find_mkl_library(NAME ThreadLayer LIBRARY_NAME mkl_gnu_thread SEARCH_STATIC)
+  set(MKL_ThreadingLibrary_LINK_LIBRARY ${OpenMP_gomp_LIBRARY})
+  if(MKL_ThreadingLibrary_LINK_LIBRARY)
+    mark_as_advanced(MKL_${mkl_args_NAME}_LINK_LIBRARY)
+  endif()
+  if(NOT TARGET MKL::ThreadingLibrary)
+    add_library(MKL::ThreadingLibrary SHARED IMPORTED)
+    set_target_properties(MKL::ThreadingLibrary
+      PROPERTIES
+        IMPORTED_LOCATION "${MKL_ThreadingLibrary_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES OpenMP::OpenMP_CXX)
+  endif()
+elseif(MKL_THREAD_LAYER STREQUAL "TBB")
+  find_mkl_library(NAME ThreadLayer LIBRARY_NAME mkl_tbb_thread SEARCH_STATIC)
+  find_mkl_library(NAME ThreadingLibrary LIBRARY_NAME tbb)
+elseif(MKL_THREAD_LAYER STREQUAL "Sequential")
+  find_mkl_library(NAME ThreadLayer LIBRARY_NAME mkl_sequential SEARCH_STATIC)
+endif()
+
+if("${INT_SIZE}" EQUAL 4)
+  set(MKL_INTERFACE_INTEGER_SIZE 4)
+  set(MKL_INTERFACE "lp64")
+  find_mkl_library(NAME Interface LIBRARY_NAME mkl_intel_lp64 SEARCH_STATIC)
+else()
+  set(MKL_INTERFACE_INTEGER_SIZE 8)
+  set(MKL_INTERFACE "ilp64")
+  find_mkl_library(NAME Interface LIBRARY_NAME mkl_intel_ilp64 SEARCH_STATIC)
+  find_mkl_library(NAME InterfaceLP LIBRARY_NAME mkl_intel_lp64 SEARCH_STATIC)
+endif()
+
+set(MKL_KernelLibraries "mkl_def;mkl_mc;mkl_mc3;mkl_avx;mkl_avx2;mkl_avx512")
+
+foreach(lib ${MKL_KernelLibraries})
+  find_mkl_library(NAME ${lib} LIBRARY_NAME ${lib} DLL_ONLY)
+
+  if(MKL_${lib}_LINK_LIBRARY)
+    list(APPEND MKL_RUNTIME_KERNEL_LIBRARIES_TMP ${MKL_${lib}_LINK_LIBRARY})
+  endif()
+
+  if(MKL_${lib}_DLL_LIBRARY)
+    list(APPEND MKL_RUNTIME_KERNEL_LIBRARIES_TMP ${MKL_${lib}_DLL_LIBRARY})
+  endif()
+endforeach()
+
+set(MKL_RUNTIME_KERNEL_LIBRARIES "${MKL_RUNTIME_KERNEL_LIBRARIES_TMP}" CACHE STRING
+    "MKL kernel libraries targeting different CPU architectures")
+mark_as_advanced(MKL_RUNTIME_KERNEL_LIBRARIES)
+
+# Bypass developer warning that the first argument to find_package_handle_standard_args (MKL_...) does not match
+# the name of the calling package (MKL)
+# https://cmake.org/cmake/help/v3.17/module/FindPackageHandleStandardArgs.html
+set(FPHSA_NAME_MISMATCHED TRUE)
+
+find_package_handle_standard_args(MKL_Shared
+  FAIL_MESSAGE "Could NOT find MKL: Source the compilervars.sh or mklvars.sh scripts included with your installation of MKL. This script searches for the libraries in MKLROOT, LIBRARY_PATHS(Linux), and LIB(Windows) environment variables"
+  VERSION_VAR  MKL_VERSION_STRING
+  REQUIRED_VARS MKL_INCLUDE_DIR
+                MKL_Core_LINK_LIBRARY
+                MKL_Interface_LINK_LIBRARY
+                MKL_ThreadLayer_LINK_LIBRARY)
+
+find_package_handle_standard_args(MKL_Static
+  FAIL_MESSAGE "Could NOT find MKL: Source the compilervars.sh or mklvars.sh scripts included with your installation of MKL. This script searches for the libraries in MKLROOT, LIBRARY_PATHS(Linux), and LIB(Windows) environment variables"
+  VERSION_VAR   MKL_VERSION_STRING
+  REQUIRED_VARS MKL_INCLUDE_DIR
+                MKL_Core_STATIC_LINK_LIBRARY
+                MKL_Interface_STATIC_LINK_LIBRARY
+                MKL_ThreadLayer_STATIC_LINK_LIBRARY)
+
+if(NOT WIN32)
+  find_library(M_LIB m)
+  mark_as_advanced(M_LIB)
+endif()
+
+if(TARGET MKL::RT)
+  set_target_properties(MKL::RT
+  PROPERTIES
+    INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}")
+endif()
+
+if(MKL_Shared_FOUND AND NOT TARGET MKL::Shared)
+  add_library(MKL::Shared SHARED IMPORTED)
+  if(MKL_THREAD_LAYER STREQUAL "Sequential")
+    set_target_properties(MKL::Shared
+      PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES "MKL::Interface;MKL::ThreadLayer;${CMAKE_DL_LIBS};${M_LIB}"
+        INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}"
+        IMPORTED_NO_SONAME TRUE)
+  else()
+    set_target_properties(MKL::Shared
+      PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES "MKL::Interface;MKL::ThreadLayer;MKL::ThreadingLibrary;${CMAKE_DL_LIBS};${M_LIB}"
+        INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}"
+        IMPORTED_NO_SONAME TRUE)
+  endif()
+  if(WIN32)
+    set_target_properties(MKL::Shared
+      PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_DLL_LIBRARY}"
+        IMPORTED_IMPLIB "${MKL_Core_LINK_LIBRARY}")
+  endif()
+endif()
+
+if(MKL_Static_FOUND AND NOT TARGET MKL::Static)
+  add_library(MKL::Static STATIC IMPORTED)
+
+  if(UNIX AND NOT APPLE)
+    if(MKL_THREAD_LAYER STREQUAL "Sequential")
+      set_target_properties(MKL::Static
+        PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_STATIC_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES "-Wl,--start-group;MKL::Core_STATIC;MKL::Interface_STATIC;MKL::ThreadLayer_STATIC;-Wl,--end-group;${CMAKE_DL_LIBS};${M_LIB}"
+        INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}"
+        IMPORTED_NO_SONAME TRUE)
+    else()
+      set_target_properties(MKL::Static
+        PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_STATIC_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES "-Wl,--start-group;MKL::Core_STATIC;MKL::Interface_STATIC;MKL::ThreadLayer_STATIC;-Wl,--end-group;MKL::ThreadingLibrary;${CMAKE_DL_LIBS};${M_LIB}"
+        INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}"
+        IMPORTED_NO_SONAME TRUE)
+    endif()
+  else()
+    if(MKL_THREAD_LAYER STREQUAL "Sequential")
+      set_target_properties(MKL::Static
+        PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_STATIC_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES "MKL::Core_STATIC;MKL::Interface_STATIC;MKL::ThreadLayer_STATIC;${CMAKE_DL_LIBS};${M_LIB}"
+        INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}"
+        IMPORTED_NO_SONAME TRUE)
+    else()
+      set_target_properties(MKL::Static
+        PROPERTIES
+        IMPORTED_LOCATION "${MKL_Core_STATIC_LINK_LIBRARY}"
+        INTERFACE_LINK_LIBRARIES "MKL::Core_STATIC;MKL::Interface_STATIC;MKL::ThreadLayer_STATIC;MKL::ThreadingLibrary;${CMAKE_DL_LIBS};${M_LIB}"
+        INTERFACE_INCLUDE_DIRECTORIES "${MKL_INCLUDE_DIR};${MKL_FFTW_INCLUDE_DIR}"
+        IMPORTED_NO_SONAME TRUE)
+    endif()
+  endif()
+endif()
+
+set(MKL_FOUND OFF)
+if(MKL_Shared_FOUND OR MKL_Static_FOUND)
+  set(MKL_FOUND ON)
+endif()
diff --git a/CMakeModules/FindCBLAS.cmake b/CMakeModules/FindCBLAS.cmake
index ee47b3af5a..31b6f72dd5 100644
--- a/CMakeModules/FindCBLAS.cmake
+++ b/CMakeModules/FindCBLAS.cmake
@@ -21,30 +21,95 @@ SET(CBLAS_INCLUDE_DIR CACHE STRING
 SET(CBLAS_INCLUDE_FILE CACHE STRING
   "CBLAS header name")
 
+
+# If a valid PkgConfig configuration for cblas is found, this overrides and cancels
+# all further checks.
+FIND_PACKAGE(PkgConfig)
+IF(PKG_CONFIG_FOUND)
+  PKG_CHECK_MODULES(PC_CBLAS cblas)
+ENDIF(PKG_CONFIG_FOUND)
+
+IF(PC_CBLAS_FOUND)
+
+  FOREACH(PC_LIB ${PC_CBLAS_LIBRARIES})
+    FIND_LIBRARY(${PC_LIB}_LIBRARY NAMES ${PC_LIB} HINTS ${PC_CBLAS_LIBRARY_DIRS} )
+    IF (NOT ${PC_LIB}_LIBRARY)
+      message(FATAL_ERROR "Something is wrong in your pkg-config file - lib ${PC_LIB} not found in ${PC_CBLAS_LIBRARY_DIRS}")
+    ENDIF (NOT ${PC_LIB}_LIBRARY)
+    LIST(APPEND CBLAS_LIBRARIES ${${PC_LIB}_LIBRARY})
+  ENDFOREACH(PC_LIB)
+
+  FIND_PATH(CBLAS_INCLUDE_DIRS NAMES cblas.h HINTS ${PC_CBLAS_INCLUDE_DIRS} )
+  IF (NOT CBLAS_INCLUDE_DIRS)
+    message(FATAL_ERROR "Something is wrong in your pkg-config file - cblas.h not found in ${PC_CBLAS_INCLUDE_DIRS}")
+  ENDIF (NOT CBLAS_INCLUDE_DIRS)
+  SET(CBLAS_INCLUDE_DIR ${CBLAS_INCLUDE_DIRS})
+
+  FIND_PACKAGE_HANDLE_STANDARD_ARGS(CBLAS DEFAULT_MSG CBLAS_LIBRARIES CBLAS_INCLUDE_DIR)
+  MARK_AS_ADVANCED(
+    CBLAS_LIBRARIES
+    CBLAS_INCLUDE_DIR
+    CBLAS_INCLUDE_DIRS)
+
+ELSE(PC_CBLAS_FOUND)
+
 SET(INTEL_MKL_ROOT_DIR CACHE STRING
   "Root directory of the Intel MKL")
 
 SET(CBLAS_ROOT_DIR CACHE STRING
   "Root directory for custom CBLAS implementation")
 
+MARK_AS_ADVANCED(INTEL_MKL_ROOT_DIR CBLAS_ROOT_DIR)
+
 INCLUDE(CheckTypeSize)
 CHECK_TYPE_SIZE("void*" SIZE_OF_VOIDP)
 
-SET(CBLAS_LIB_DIR)
+IF (NOT INTEL_MKL_ROOT_DIR)
+  SET(INTEL_MKL_ROOT_DIR $ENV{INTEL_MKL_ROOT})
+ENDIF()
 
-SET(CBLAS_ROOT_DIR "${INTEL_MKL_ROOT_DIR}")
+IF(NOT CBLAS_ROOT_DIR)
 
-IF(CBLAS_ROOT_DIR)
-    IF(INTEL_MKL_ROOT_DIR)
-      IF ("${SIZE_OF_VOIDP}" EQUAL 8)
-        SET(CBLAS_LIB_DIR "${INTEL_MKL_ROOT_DIR}/lib/intel64")
-      ELSE()
-        SET(CBLAS_LIB_DIR "${INTEL_MKL_ROOT_DIR}/lib/ia32")
-      ENDIF()
+  IF (ENV{CBLASDIR})
+    SET(CBLAS_ROOT_DIR $ENV{CBLASDIR})
+    IF ("${SIZE_OF_VOIDP}" EQUAL 8)
+        SET(CBLAS_LIB64_DIR "${CBLAS_ROOT_DIR}/lib64")
+    ELSE()
+        SET(CBLAS_LIB32_DIR "${CBLAS_ROOT_DIR}/lib")
     ENDIF()
-    SET(CBLAS_INCLUDE_DIR "${INTEL_MKL_ROOT_DIR}/include")
+  ENDIF()
+
+  IF (ENV{CBLAS_ROOT_DIR})
+    SET(CBLAS_ROOT_DIR $ENV{CBLAS_ROOT_DIR})
+    IF ("${SIZE_OF_VOIDP}" EQUAL 8)
+        SET(CBLAS_LIB64_DIR "${CBLAS_ROOT_DIR}/lib64")
+    ELSE()
+        SET(CBLAS_LIB32_DIR "${CBLAS_ROOT_DIR}/lib")
+    ENDIF()
+  ENDIF()
+
+  IF (INTEL_MKL_ROOT_DIR)
+    SET(CBLAS_ROOT_DIR ${INTEL_MKL_ROOT_DIR})
+    IF(APPLE)
+        IF ("${SIZE_OF_VOIDP}" EQUAL 8)
+            SET(CBLAS_LIB64_DIR "${CBLAS_ROOT_DIR}/lib")
+        ELSE()
+            SET(CBLAS_LIB32_DIR "${CBLAS_ROOT_DIR}/lib")
+        ENDIF()
+    ELSE(APPLE) # Windows and Linux
+        IF ("${SIZE_OF_VOIDP}" EQUAL 8)
+            SET(CBLAS_LIB64_DIR "${CBLAS_ROOT_DIR}/lib/intel64")
+        ELSE()
+            SET(CBLAS_LIB32_DIR "${CBLAS_ROOT_DIR}/lib/ia32")
+        ENDIF()
+    ENDIF(APPLE)
+  ENDIF()
 ENDIF()
 
+if(CBLAS_ROOT_DIR)
+  set(CBLAS_INCLUDE_DIR "${CBLAS_ROOT_DIR}/include")
+endif()
+
 # Old CBLAS search
 SET(_verbose TRUE)
 INCLUDE(CheckFunctionExists)
@@ -57,7 +122,7 @@ MACRO(CHECK_ALL_LIBRARIES
     _flags
     _list
     _include
-    _search_include,
+    _search_include
     _libraries_work_check)
   # This macro checks for the existence of the combination of fortran libraries
   # given by _list.  If the combination is found, this macro checks (using the
@@ -93,14 +158,14 @@ MACRO(CHECK_ALL_LIBRARIES
           NAMES ${_library}
           PATHS /usr/local/lib /usr/lib /usr/local/lib64 /usr/lib64
           ENV DYLD_LIBRARY_PATH
-          "{CBLAS_LIB_DIR}"
+          "${CBLAS_LIB_DIR}" "${CBLAS_LIB32_DIR}" "${CBLAS_LIB64_DIR}"
           )
       ELSE(APPLE)
         FIND_LIBRARY(${_prefix}_${_library}_LIBRARY
-          NAMES ${_library}
+          NAMES ${_library} lib${_library}
           PATHS /usr/local/lib /usr/lib /usr/local/lib64 /usr/lib64
           ENV LD_LIBRARY_PATH
-          "${CBLAS_LIB_DIR}"
+          "${CBLAS_LIB_DIR}" "${CBLAS_LIB32_DIR}" "${CBLAS_LIB64_DIR}"
           PATH_SUFFIXES atlas
           )
         IF(NOT ${_prefix}_${library}_LIBRARY)
@@ -109,30 +174,36 @@ MACRO(CHECK_ALL_LIBRARIES
               NAMES ${_library}
               PATHS /usr/local/lib /usr/lib /usr/local/lib64 /usr/lib64
               ENV LD_LIBRARY_PATH
-              "${CBLAS_LIB_DIR}"
+              "${CBLAS_LIB_DIR}" "${CBLAS_LIB32_DIR}" "${CBLAS_LIB64_DIR}"
               PATH_SUFFIXES atlas
               )
           ENDIF(NOT ${_prefix}_${library}_LIBRARY)
       ENDIF(APPLE)
       MARK_AS_ADVANCED(${_prefix}_${_library}_LIBRARY)
 
-      IF(${_prefix}_${_library}_LIBRARY)
-        GET_FILENAME_COMPONENT(_path ${${_prefix}_${_library}_LIBRARY} PATH)
-        LIST(APPEND _paths ${_path}/../include ${_path}/../../include ${CBLAS_ROOT_DIR}/include)
-      ENDIF(${_prefix}_${_library}_LIBRARY)
-
       SET(${LIBRARIES} ${${LIBRARIES}} ${${_prefix}_${_library}_LIBRARY})
       SET(_libraries_work ${${_prefix}_${_library}_LIBRARY})
-
     ENDIF(_libraries_work)
   ENDFOREACH(_library)
 
   # Test include
   SET(_bug_search_include ${_search_include}) #CMAKE BUG!!! SHOULD NOT BE THAT
+  SET(_bug_libraries_work_check ${_libraries_work_check}) #CMAKE BUG!!! SHOULD NOT BE THAT
 
   IF(_bug_search_include)
-    FIND_PATH(${_prefix}${_combined_name}_INCLUDE ${_include} ${_paths})
+    FIND_PATH(${_prefix}${_combined_name}_INCLUDE ${_include}
+      PATHS
+      ${CBLAS_ROOT_DIR}/include
+      /opt/intel/mkl/include
+      /usr/include
+      /usr/local/include
+      /sw/include
+      /opt/local/include
+      PATH_SUFFIXES
+      openblas
+      )
     MARK_AS_ADVANCED(${_prefix}${_combined_name}_INCLUDE)
+
     IF(${_prefix}${_combined_name}_INCLUDE)
       IF (_verbose)
         MESSAGE(STATUS "Includes found")
@@ -142,13 +213,13 @@ MACRO(CHECK_ALL_LIBRARIES
     ELSE(${_prefix}${_combined_name}_INCLUDE)
       SET(_libraries_work FALSE)
     ENDIF(${_prefix}${_combined_name}_INCLUDE)
+
   ELSE(_bug_search_include)
     SET(${_prefix}_INCLUDE_DIR)
     SET(${_prefix}_INCLUDE_FILE ${_include})
   ENDIF(_bug_search_include)
 
-
-  IF (_libraries_work_check)
+  IF (_bug_libraries_work_check)
     # Test this combination of libraries.
     IF(_libraries_work)
       SET(CMAKE_REQUIRED_LIBRARIES ${_flags} ${${LIBRARIES}})
@@ -156,9 +227,13 @@ MACRO(CHECK_ALL_LIBRARIES
       SET(CMAKE_REQUIRED_LIBRARIES)
       MARK_AS_ADVANCED(${_prefix}${_combined_name}_WORKS)
       SET(_libraries_work ${${_prefix}${_combined_name}_WORKS})
+
       IF(_verbose AND _libraries_work)
-        MESSAGE(STATUS "Libraries found")
+        MESSAGE(STATUS "CBLAS Symbols FOUND")
+      ELSE()
+        MESSAGE(STATUS "CBLAS Symbols NOTFOUND")
       ENDIF(_verbose AND _libraries_work)
+
     ENDIF(_libraries_work)
   ENDIF()
 
@@ -193,31 +268,6 @@ IF( NOT CBLAS_LIBRARIES )
     TRUE)
 ENDIF( NOT CBLAS_LIBRARIES )
 
-# MKL
-IF (INTEL_MKL_ROOT_DIR)
-  IF ("${SIZE_OF_VOIDP}" EQUAL 8)
-    SET(MKL_CBLAS_EXT mkl_gf_lp64)
-  ELSE()
-    SET(MKL_CBLAS_EXT mkl_gf)
-  ENDIF()
-
-  IF(NOT CBLAS_LIBRARIES)
-    CHECK_ALL_LIBRARIES(
-      CBLAS_LIBRARIES
-      CBLAS
-      cblas_dgemm
-      ""
-      "${MKL_CBLAS_EXT};mkl_intel_thread"
-      "mkl_cblas.h"
-      TRUE,
-      FALSE)
-  ENDIF(NOT CBLAS_LIBRARIES)
-
-  IF (CBLAS_LIBRARIES)
-    SET(MKL_CBLAS_FOUND TRUE)
-  ENDIF()
-ENDIF()
-
 # CBLAS in ATLAS library? (http://math-atlas.sourceforge.net/)
 IF(NOT CBLAS_LIBRARIES)
   CHECK_ALL_LIBRARIES(
@@ -257,6 +307,20 @@ IF(NOT CBLAS_LIBRARIES)
     TRUE)
 ENDIF(NOT CBLAS_LIBRARIES)
 
+# Generic BLAS+CBLAS library
+# Debian based systems have them as single library
+IF(NOT CBLAS_LIBRARIES)
+  CHECK_ALL_LIBRARIES(
+    CBLAS_LIBRARIES
+    CBLAS
+    dgemm_
+    ""
+    "blas"
+    "cblas.h"
+    TRUE,
+    TRUE)
+ENDIF(NOT CBLAS_LIBRARIES)
+
 IF(CBLAS_LIBRARIES)
   IF (NOT MKL_CBLAS_FOUND)
     SET(CBLAS_FOUND TRUE)
@@ -277,3 +341,12 @@ IF(NOT CBLAS_FIND_QUIETLY)
     MESSAGE(STATUS "CBLAS library not found.")
   ENDIF()
 ENDIF(NOT CBLAS_FIND_QUIETLY)
+
+ENDIF(PC_CBLAS_FOUND)
+
+MARK_AS_ADVANCED(
+    CBLAS_INCLUDE_DIR
+    CBLAS_INCLUDE_FILE
+    CBLAS_LIBRARIES
+    cblas_LIBRARY
+    blas_LIBRARY)
diff --git a/CMakeModules/FindFFTW.cmake b/CMakeModules/FindFFTW.cmake
index a725f64ecd..15bc7843d4 100644
--- a/CMakeModules/FindFFTW.cmake
+++ b/CMakeModules/FindFFTW.cmake
@@ -20,75 +20,52 @@
 ######## This FindFFTW.cmake file is a copy of the file from the eigen library
 ######## http://code.metager.de/source/xref/lib/eigen/cmake/FindFFTW.cmake
 
-IF(NOT FFTW_ROOT AND ENV{FFTWDIR})
-    SET(FFTW_ROOT $ENV{FFTWDIR})
-ENDIF()
+find_package(PkgConfig)
+pkg_check_modules(PKG_FFTW "fftw3")
 
-# Check if we can use PkgConfig
-FIND_PACKAGE(PkgConfig)
+find_path( FFTW_INCLUDE_DIR
+  NAMES "fftw3.h"
+  PATHS ${FFTW_ROOT}
+        ${CMAKE_SYSTEM_INCLUDE_PATH}
+        ${CMAKE_SYSTEM_PREFIX_PATH}
+        ${PKG_FFTW_INCLUDE_DIRS}
+  PATH_SUFFIXES "include" "include/fftw"
+  )
 
-#Determine from PKG
-IF(PKG_CONFIG_FOUND AND NOT FFTW_ROOT)
-    PKG_CHECK_MODULES( PKG_FFTW QUIET "fftw3")
-ENDIF()
+find_library( FFTW_LIBRARY
+  NAMES "fftw3" "libfftw3-3" "fftw3-3"
+  PATHS ${FFTW_ROOT}
+        ${CMAKE_SYSTEM_PREFIX_PATH}
+        ${PKG_FFTW_LIBRARY_DIRS}
+  PATH_SUFFIXES "lib" "lib64"
+)
 
-#Check whether to search static or dynamic libs
-SET(CMAKE_FIND_LIBRARY_SUFFIXES_SAV ${CMAKE_FIND_LIBRARY_SUFFIXES})
-IF(${FFTW_USE_STATIC_LIBS} )
-    SET(CMAKE_FIND_LIBRARY_SUFFIXES ${CMAKE_STATIC_LIBRARY_SUFFIX})
-ELSE()
-    SET(CMAKE_FIND_LIBRARY_SUFFIXES ${CMAKE_SHARED_LIBRARY_SUFFIX})
-ENDIF()
+find_library( FFTWF_LIBRARY
+  NAMES "fftw3f" "libfftw3f-3" "fftw3f-3"
+  PATHS ${FFTW_ROOT}
+        ${CMAKE_SYSTEM_PREFIX_PATH}
+        ${CMAKE_SYSTEM_LIBRARY_PATH}
+        ${PKG_FFTW_LIBRARY_DIRS}
+  PATH_SUFFIXES "lib" "lib64"
+)
 
-IF(FFTW_ROOT)
-    #find libs
-    FIND_LIBRARY(
-        FFTW_LIB
-        NAMES "fftw3" "libfftw3-3" "fftw3-3"
-        PATHS ${FFTW_ROOT}
-        PATH_SUFFIXES "lib" "lib64"
-        NO_DEFAULT_PATH
-        )
-    FIND_LIBRARY(
-        FFTWF_LIB
-        NAMES "fftw3f" "libfftw3f-3" "fftw3f-3"
-        PATHS ${FFTW_ROOT}
-        PATH_SUFFIXES "lib" "lib64"
-        NO_DEFAULT_PATH
-        )
+mark_as_advanced(FFTW_INCLUDE_DIR FFTW_LIBRARY FFTWF_LIBRARY)
 
-    #find includes
-    FIND_PATH(
-        FFTW_INCLUDES
-        NAMES "fftw3.h"
-        PATHS ${FFTW_ROOT}
-        PATH_SUFFIXES "include"
-        NO_DEFAULT_PATH
-        )
-ELSE()
-    FIND_LIBRARY(
-        FFTW_LIB
-        NAMES "fftw3"
-        PATHS ${PKG_FFTW_LIBRARY_DIRS} ${LIB_INSTALL_DIR}
-        )
-    FIND_LIBRARY(
-        FFTWF_LIB
-        NAMES "fftw3f"
-        PATHS ${PKG_FFTW_LIBRARY_DIRS} ${LIB_INSTALL_DIR}
-        )
-    FIND_PATH(
-        FFTW_INCLUDES
-        NAMES "fftw3.h"
-        PATHS ${PKG_FFTW_INCLUDE_DIRS} ${INCLUDE_INSTALL_DIR}
-        )
-ENDIF(FFTW_ROOT)
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(FFTW DEFAULT_MSG
+    FFTW_INCLUDE_DIR FFTW_LIBRARY FFTWF_LIBRARY)
 
-SET(FFTW_LIBRARIES ${FFTW_LIB} ${FFTWF_LIB})
+if (FFTW_FOUND)
+  add_library(FFTW::FFTW UNKNOWN IMPORTED)
+  set_target_properties(FFTW::FFTW PROPERTIES
+    IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+    IMPORTED_LOCATION "${FFTW_LIBRARY}"
+    INTERFACE_INCLUDE_DIRECTORIES "${FFTW_INCLUDE_DIR}")
 
-SET(CMAKE_FIND_LIBRARY_SUFFIXES ${CMAKE_FIND_LIBRARY_SUFFIXES_SAV})
+  add_library(FFTW::FFTWF UNKNOWN IMPORTED)
+  set_target_properties(FFTW::FFTWF PROPERTIES
+    IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+    IMPORTED_LOCATION "${FFTWF_LIBRARY}"
+    INTERFACE_INCLUDE_DIRECTORIES "${FFTW_INCLUDE_DIR}")
+endif (FFTW_FOUND)
 
-INCLUDE(FindPackageHandleStandardArgs)
-FIND_PACKAGE_HANDLE_STANDARD_ARGS(FFTW DEFAULT_MSG
-    FFTW_INCLUDES FFTW_LIBRARIES)
-
-MARK_AS_ADVANCED(FFTW_INCLUDES FFTW_LIBRARIES)
diff --git a/CMakeModules/FindForge.cmake b/CMakeModules/FindForge.cmake
deleted file mode 100644
index f1380fde05..0000000000
--- a/CMakeModules/FindForge.cmake
+++ /dev/null
@@ -1,102 +0,0 @@
-# - Find Forge
-
-# Defines the following variables:
-# FORGE_INCLUDE_DIRECTORIES    - Location of FORGE's include directory.
-# FORGE_LIBRARIES              - Location of FORGE's libraries.
-# FORGE_FOUND                  - True if FORGE has been located
-#
-# You may provide a hint to where FORGE's root directory may be located
-# by setting FORGE_ROOT_DIR before calling this script.
-#
-# ----------------------------------------------------------------------------
-#
-# FORGE_FOUND        - True of the FORGE library has been found.
-# FORGE_LIBRARIES    - Location of FORGE library, if found
-#
-# Variables used by this module, they can change the default behaviour and
-# need to be set before calling find_package:
-#
-#=============================================================================
-# Copyright (c) 2014, ArrayFire
-# All rights reserved.
-#
-# Redistribution and use in source and binary forms, with or without modification,
-# are permitted provided that the following conditions are met:
-#
-# * Redistributions of source code must retain the above copyright notice, this
-#   list of conditions and the following disclaimer.
-#
-# * Redistributions in binary form must reproduce the above copyright notice, this
-#   list of conditions and the following disclaimer in the documentation and/or
-#   other materials provided with the distribution.
-#
-# * Neither the name of the ArrayFire nor the names of its
-#   contributors may be used to endorse or promote products derived from
-#   this software without specific prior written permission.
-#
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-# WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-# DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
-# ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-# (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-# LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-# ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#=============================================================================
-
-IF(FORGE_INCLUDE_DIRECTORIES)
-  # Already in cache, be silent
-  set (FORGE_FIND_QUIETLY TRUE)
-ENDIF()
-
-# Find the FORGE install directories and headers:
-FIND_PATH(FORGE_ROOT_DIR
-    NAMES include/forge.h
-    PATH_SUFFIXES forge FORGE FORGE
-    HINTS "${CMAKE_INSTALL_PREFIX}" "${FORGE_ROOT_DIR}" "${FORGE_ROOT_DIR}/lib" "${FORGE_ROOT_DIR}/lib64" "${CMAKE_SOURCE_DIR}/.." "${CMAKE_SOURCE_DIR}/../.."
-    DOC "FORGE root directory.")
-
-FIND_PATH(FORGE_PACKAGE_DIR
-    NAMES include/forge.h lib
-    HINTS "${FORGE_ROOT_DIR}/package" "${FORGE_ROOT_DIR}/build/package" "${FORGE_ROOT_DIR}"
-    DOC "FORGE Package directory.")
-
-FIND_PATH(FORGE_INCLUDE_DIRECTORIES
-    NAMES forge.h
-    HINTS "${FORGE_PACKAGE_DIR}/include"
-    DOC "FORGE Include directory")
-
-# Find all libraries required for the FORGE
-FIND_LIBRARY(FORGE_LIBRARY
-    NAMES forge
-    HINTS "${FORGE_PACKAGE_DIR}/lib")
-
-INCLUDE("${CMAKE_MODULE_PATH}/FindGLEWmx.cmake")
-INCLUDE("${CMAKE_MODULE_PATH}/FindGLFW.cmake")
-
-IF(GLFW_FOUND AND GLEWmx_FOUND AND OPENGL_FOUND)
-    IF(FORGE_INCLUDE_DIRECTORIES)
-        SET(FORGE_INCLUDE_DIRECTORIES ${FORGE_INCLUDE_DIRECTORIES} ${GLFW_INCLUDE_DIR} ${GLEW_INCLUDE_DIR}
-            CACHE INTERNAL "All include dirs required for FORGE'")
-    ENDIF()
-    IF(FORGE_LIBRARY)
-        SET(FORGE_LIBRARIES ${FORGE_LIBRARY} ${GLFW_LIBRARY} ${GLEWmx_LIBRARY} ${OPENGL_gl_LIBRARY} ${OPENGL_glu_LIBRARY}
-            CACHE INTERNAL "All libraries required for FORGE'")
-    ENDIF()
-    # handle the QUIETLY and REQUIRED arguments and set FORGE_FOUND to TRUE if
-    # all listed variables are TRUE
-    INCLUDE (FindPackageHandleStandardArgs)
-    FIND_PACKAGE_HANDLE_STANDARD_ARGS(FORGE DEFAULT_MSG FORGE_LIBRARIES FORGE_INCLUDE_DIRECTORIES)
-    MARK_AS_ADVANCED(FORGE_LIBRARIES FORGE_INCLUDE_DIRECTORIES)
-
-ELSE(GLFW_FOUND AND GLEWmx_FOUND AND OPENGL_FOUND)
-    IF(NOT GLFW_FOUND)
-        MESSAGE(FATAL_ERROR "GLFW Not Found")
-    ELSEIF(NOT GLEWmx_FOUND)
-        MESSAGE(FATAL_ERROR "GLEW-MX Not Found")
-    ELSEIF(NOT OPENGL_FOUND)
-        MESSAGE(FATAL_ERROR "OpenGL Not Found")
-    ENDIF()
-ENDIF(GLFW_FOUND AND GLEWmx_FOUND AND OPENGL_FOUND)
diff --git a/CMakeModules/FindFreeImage.cmake b/CMakeModules/FindFreeImage.cmake
index 09fbcbfc3c..3b2d3fca29 100644
--- a/CMakeModules/FindFreeImage.cmake
+++ b/CMakeModules/FindFreeImage.cmake
@@ -1,68 +1,125 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
 #
-# Try to find the FreeImage library and include path.
-# Once done this will define
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
 #
-# FREEIMAGE_FOUND
-# FREEIMAGE_INCLUDE_PATH
-# FREEIMAGE_LIBRARY
-# FREEIMAGE_STATIC_LIBRARY
-# FREEIMAGE_DYNAMIC_LIBRARY
+# Targets defined by this script
+#   FreeImage::FreeImage
+#   FreeImage::FreeImage_STATIC
 #
+# Note:
+# 1. The static version target is only defined if the static lib is found
+# 2. Environment variable FreeImage_ROOT can be defined on Windows where
+#    FreeImage is just a zip file of header and library files.
+#
+# Sets the following variables:
+#          FreeImage_FOUND
+#          FreeImage_INCLUDE_DIR
+#          FreeImage_LINK_LIBRARY
+#          FreeImage_STATIC_LIBRARY
+#          FreeImage_DLL_LIBRARY - Windows only
+#
+# Usage:
+# find_package(FreeImage)
+# if (FreeImage_FOUND)
+#    target_link_libraries(mylib PRIVATE FreeImage::FreeImage)
+# endif (FreeImage_FOUND)
+#
+# OR if you want to link against the static library:
+#
+# find_package(FreeImage)
+# if (FreeImage_FOUND)
+#    target_link_libraries(mylib PRIVATE FreeImage::FreeImage_STATIC)
+# endif (FreeImage_FOUND)
+#
+# NOTE: You do not need to include the FreeImage include directories since they
+# will be included as part of the target_link_libraries command
 
-OPTION(USE_FREEIMAGE_STATIC "Use Static FreeImage Lib" OFF)
-
-FIND_PATH( FREEIMAGE_INCLUDE_PATH
-    NAMES FreeImage.h
-    HINTS ${PROJECT_SOURCE_DIR}/extern/FreeImage
-    PATHS
+find_path(FreeImage_INCLUDE_DIR
+  NAMES FreeImage.h
+  PATHS
     /usr/include
     /usr/local/include
     /sw/include
     /opt/local/include
-    DOC "The directory where FreeImage.h resides")
+	${FreeImage_ROOT}
+  DOC "The directory where FreeImage.h resides")
 
-FIND_LIBRARY( FREEIMAGE_DYNAMIC_LIBRARY
-    NAMES FreeImage freeimage
-    HINTS ${PROJECT_SOURCE_DIR}/FreeImage
-    PATHS
+find_library(FreeImage_LINK_LIBRARY
+  NAMES FreeImage freeimage
+  PATHS
     /usr/lib64
     /usr/lib
     /usr/local/lib64
     /usr/local/lib
     /sw/lib
     /opt/local/lib
-    DOC "The FreeImage library")
+	${FreeImage_ROOT}
+  DOC "The FreeImage library")
 
-SET(PX ${CMAKE_STATIC_LIBRARY_PREFIX})
-SET(SX ${CMAKE_STATIC_LIBRARY_SUFFIX})
-FIND_LIBRARY( FREEIMAGE_STATIC_LIBRARY
-    NAMES ${PX}FreeImageLIB${SX} ${PX}FreeImage${SX} ${PX}freeimage${SX}
-    HINTS ${PROJECT_SOURCE_DIR}/FreeImage
-    PATHS
+find_library(FreeImage_STATIC_LIBRARY
+  NAMES
+    ${CMAKE_STATIC_LIBRARY_PREFIX}FreeImageLIB${CMAKE_STATIC_LIBRARY_SUFFIX}
+    ${CMAKE_STATIC_LIBRARY_PREFIX}FreeImage${CMAKE_STATIC_LIBRARY_SUFFIX}
+    ${CMAKE_STATIC_LIBRARY_PREFIX}freeimage${CMAKE_STATIC_LIBRARY_SUFFIX}
+  PATHS
     /usr/lib64
     /usr/lib
     /usr/local/lib64
     /usr/local/lib
     /sw/lib
     /opt/local/lib
-    DOC "The FreeImage library")
-UNSET(PX)
-UNSET(SX)
+    ${FreeImage_ROOT}
+  DOC "The FreeImage static library")
+
+if (WIN32)
+  get_filename_component(FreeImage_LIB_PATH ${FreeImage_LINK_LIBRARY} DIRECTORY)
+  find_file(FreeImage_DLL_LIBRARY
+    NAMES
+      ${CMAKE_SHARED_LIBRARY_PREFIX}FreeImage${CMAKE_SHARED_LIBRARY_SUFFIX}
+      ${CMAKE_SHARED_LIBRARY_PREFIX}freeimage${CMAKE_SHARED_LIBRARY_SUFFIX}
+    PATHS
+      ${FreeImage_ROOT}
+      ${FreeImage_LIB_PATH}/../bin
+    DOC "The FreeImage dll")
+	mark_as_advanced(FreeImage_DLL_LIBRARY)
+endif ()
+
+mark_as_advanced(
+  FreeImage_INCLUDE_DIR
+  FreeImage_LINK_LIBRARY
+  FreeImage_STATIC_LIBRARY)
+
+include(FindPackageHandleStandardArgs)
+find_package_handle_standard_args(FreeImage
+  REQUIRED_VARS FreeImage_INCLUDE_DIR FreeImage_LINK_LIBRARY)
 
-IF(USE_FREEIMAGE_STATIC)
-  MESSAGE(STATUS "Using Static FreeImage Lib")
-  ADD_DEFINITIONS(-DFREEIMAGE_LIB)
-  SET(FREEIMAGE_LIBRARY ${FREEIMAGE_STATIC_LIBRARY})
-ELSE(USE_FREEIMAGE_STATIC)
-  MESSAGE(STATUS "Using Dynamic FreeImage Lib")
-  REMOVE_DEFINITIONS(-DFREEIMAGE_LIB)
-  SET(FREEIMAGE_LIBRARY ${FREEIMAGE_DYNAMIC_LIBRARY})
-ENDIF(USE_FREEIMAGE_STATIC)
+if(FreeImage_FOUND AND NOT TARGET FreeImage::FreeImage)
+  add_library(FreeImage::FreeImage SHARED IMPORTED)
+  if(WIN32)
+    set_target_properties(FreeImage::FreeImage
+	  PROPERTIES
+      IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+      INTERFACE_INCLUDE_DIRECTORIES "${FreeImage_INCLUDE_DIR}"
+      IMPORTED_LOCATION "${FreeImage_DLL_LIBRARY}"
+      IMPORTED_IMPLIB "${FreeImage_LINK_LIBRARY}")
+  else(WIN32)
+    set_target_properties(FreeImage::FreeImage
+	  PROPERTIES
+      IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+      INTERFACE_INCLUDE_DIRECTORIES "${FreeImage_INCLUDE_DIR}"
+      IMPORTED_LOCATION "${FreeImage_LINK_LIBRARY}"
+      IMPORTED_NO_SONAME FALSE)
+  endif(WIN32)
+endif()
 
-MARK_AS_ADVANCED(
-    FREEIMAGE_DYNAMIC_LIBRARY
-    FREEIMAGE_STATIC_LIBRARY
-    )
-INCLUDE(FindPackageHandleStandardArgs)
-FIND_PACKAGE_HANDLE_STANDARD_ARGS(FREEIMAGE DEFAULT_MSG
-    FREEIMAGE_INCLUDE_PATH FREEIMAGE_LIBRARY)
+if(FreeImage_STATIC_LIBRARY AND NOT TARGET FreeImage::FreeImage_STATIC)
+  add_library(FreeImage::FreeImage_STATIC STATIC IMPORTED)
+  set_target_properties(FreeImage::FreeImage_STATIC
+    PROPERTIES
+      IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+      INTERFACE_INCLUDE_DIRECTORIES "${FreeImage_INCLUDE_DIR}"
+      IMPORTED_LOCATION "${FreeImage_STATIC_LIBRARY}")
+endif()
diff --git a/CMakeModules/FindGLEWmx.cmake b/CMakeModules/FindGLEWmx.cmake
deleted file mode 100644
index f86cb0509c..0000000000
--- a/CMakeModules/FindGLEWmx.cmake
+++ /dev/null
@@ -1,89 +0,0 @@
-# Source from
-#https://github.com/LaurentGomila/SFML/blob/master/cmake/Modules/FindGLEW.cmake
-
-#
-# Try to find GLEW library and include path.
-# Once done this will define
-#
-# GLEW_FOUND
-# GLEW_INCLUDE_DIR
-# GLEW_LIBRARY
-# GLEWmx_LIBRARY
-# GLEWmxd_LIBRARY
-# GLEWmxs_LIBRARY
-
-FIND_PACKAGE(GLEW)
-FIND_PACKAGE(OpenGL)
-
-OPTION(USE_GLEWmx_STATIC "Use Static GLEWmx Lib" OFF)
-
-IF (WIN32)
-    FIND_LIBRARY( GLEWmxd_LIBRARY
-        NAMES glewmx GLEWmx glew32mx glew32mx
-        PATHS
-        $ENV{PROGRAMFILES}/GLEW/lib
-        ${GLEW_ROOT_DIR}/lib
-        ${GLEW_ROOT_DIR}
-        ${PROJECT_SOURCE_DIR}/../dependencies/glew/lib
-        PATH_SUFFIXES "Release MX/x64" "lib64"
-        DOC "The GLEWmx library"
-    )
-    FIND_LIBRARY( GLEWmxs_LIBRARY
-        NAMES glewmxs GLEWmxs glew32mxs glew32mxs
-        PATHS
-        $ENV{PROGRAMFILES}/GLEW/lib
-        ${GLEW_ROOT_DIR}/lib
-        ${GLEW_ROOT_DIR}
-        ${PROJECT_SOURCE_DIR}/../dependencies/glew/lib
-        PATH_SUFFIXES "Release MX/x64" "lib64"
-        DOC "The GLEWmxs Static library"
-    )
-ELSE (WIN32)
-    FIND_LIBRARY( GLEWmxd_LIBRARY
-        NAMES GLEWmx glewmx
-        PATHS
-        /usr/lib64
-        /usr/lib
-        /usr/lib/x86_64-linux-gnu
-        /usr/lib/arm-linux-gnueabihf
-        /usr/local/lib64
-        /usr/local/lib
-        /sw/lib
-        /opt/local/lib
-        ${GLEW_ROOT_DIR}/lib
-        NO_DEFAULT_PATH
-        DOC "The GLEWmx library")
-
-    SET(PX ${CMAKE_STATIC_LIBRARY_PREFIX})
-    SET(SX ${CMAKE_STATIC_LIBRARY_SUFFIX})
-    FIND_LIBRARY( GLEWmxs_LIBRARY
-        NAMES ${PX}GLEWmx${SX} ${PX}glewmx${SX}
-        PATHS
-        /usr/lib64
-        /usr/lib
-        /usr/lib/x86_64-linux-gnu
-        /usr/lib/arm-linux-gnueabihf
-        /usr/local/lib64
-        /usr/local/lib
-        /sw/lib
-        /opt/local/lib
-        ${GLEW_ROOT_DIR}/lib
-        NO_DEFAULT_PATH
-        DOC "The GLEWmx library")
-    UNSET(PX)
-    UNSET(SX)
-ENDIF (WIN32)
-
-IF(USE_GLEWmx_STATIC)
-    MESSAGE(STATUS "Using Static GLEWmx Lib")
-    ADD_DEFINITIONS(-DGLEW_STATIC)
-    SET(GLEWmx_LIBRARY ${GLEWmxs_LIBRARY})
-ELSE(USE_GLEWmx_STATIC)
-    MESSAGE(STATUS "Using Dynamic GLEWmx Lib")
-    REMOVE_DEFINITIONS(-DGLEW_STATIC)
-    SET(GLEWmx_LIBRARY ${GLEWmxd_LIBRARY})
-ENDIF(USE_GLEWmx_STATIC)
-
-IF (GLEW_INCLUDE_DIR AND GLEWmx_LIBRARY)
-    SET(GLEWmx_FOUND "YES")
-ENDIF (GLEW_INCLUDE_DIR AND GLEWmx_LIBRARY)
diff --git a/CMakeModules/FindGLFW.cmake b/CMakeModules/FindGLFW.cmake
deleted file mode 100644
index 42f1297e98..0000000000
--- a/CMakeModules/FindGLFW.cmake
+++ /dev/null
@@ -1,60 +0,0 @@
-# Sourced from
-#https://code.google.com/p/assembly3d/
-
-# Locate the glfw library
-# This module defines the following variables:
-# GLFW_LIBRARY, the name of the library;
-# GLFW_INCLUDE_DIR, where to find glfw include files.
-# GLFW_FOUND, true if both the GLFW_LIBRARY and GLFW_INCLUDE_DIR have been found.
-#
-# To help locate the library and include file, you could define an environment variable called
-# GLFW_ROOT which points to the root of the glfw library installation. This is pretty useful
-# on a Windows platform.
-#
-#
-# Usage example to compile an "executable" target to the glfw library:
-#
-# FIND_PACKAGE (glfw REQUIRED)
-# INCLUDE_DIRECTORIES (${GLFW_INCLUDE_DIR})
-# ADD_EXECUTABLE (executable ${EXECUTABLE_SRCS})
-# TARGET_LINK_LIBRARIES (executable ${GLFW_LIBRARY})
-#
-# TODO:
-# Allow the user to select to link to a shared library or to a static library.
-
-#Search for the include file...
-FIND_PATH(GLFW_INCLUDE_DIR GLFW/glfw3.h DOC "Path to GLFW include directory."
-  HINTS
-  $ENV{GLFW_ROOT}
-  PATH_SUFFIX include #For finding the include file under the root of the glfw expanded archive, typically on Windows.
-  PATHS
-  /usr/include/
-  /usr/local/include/
-  # By default headers are under GLFW subfolder
-  /usr/include/GLFW
-  /usr/local/include/GLFW
-  ${GLFW_ROOT_DIR}/include/ # added by ptr
-)
-
-FIND_LIBRARY(GLFW_LIBRARY DOC "Absolute path to GLFW library."
-  NAMES glfw GLFW.lib glfw3
-  HINTS
-  $ENV{GLFW_ROOT}
-  PATH_SUFFIXES lib/win32 #For finding the library file under the root of the glfw expanded archive, typically on Windows.
-  PATHS
-  /usr/lib
-  /usr/lib64
-  /usr/lib/x86_64-linux-gnu
-  /usr/lib/arm-linux-gnueabihf
-  /usr/local/lib
-  /usr/local/lib64
-  /sw/lib
-  /opt/local/lib
-  ${GLFW_ROOT_DIR}/lib-msvc100/release # added by ptr
-)
-
-SET(GLFW_FOUND 0)
-IF(GLFW_LIBRARY AND GLFW_INCLUDE_DIR)
-  SET(GLFW_FOUND 1)
-  message(STATUS "GLFW found!")
-ENDIF(GLFW_LIBRARY AND GLFW_INCLUDE_DIR)
diff --git a/CMakeModules/FindLAPACKE.cmake b/CMakeModules/FindLAPACKE.cmake
index 05d218aa86..65c513abb2 100644
--- a/CMakeModules/FindLAPACKE.cmake
+++ b/CMakeModules/FindLAPACKE.cmake
@@ -3,85 +3,47 @@
 # Usage:
 #   FIND_PACKAGE(LAPACKE [REQUIRED] [QUIET] )
 #
-# It sets the following variables:
-#   LAPACK_FOUND               ... true if LAPACKE is found on the system
-#   LAPACK_LIBRARIES           ... full path to LAPACKE library
-#   LAPACK_INCLUDES            ... LAPACKE include directory
-#
 
-IF(NOT LAPACKE_ROOT AND ENV{LAPACKEDIR})
-  SET(LAPACKE_ROOT $ENV{LAPACKEDIR})
-ENDIF()
+INCLUDE(FindPackageHandleStandardArgs)
+SET(LAPACKE_ROOT_DIR CACHE STRING
+  "Root directory for custom LAPACK implementation")
+
+if(NOT LAPACKE_ROOT_DIR)
+  if (ENV{LAPACKEDIR})
+    SET(LAPACKE_ROOT_DIR $ENV{LAPACKEDIR})
+  endif()
+
+  if (ENV{LAPACKE_ROOT_DIR})
+    SET(LAPACKE_ROOT_DIR $ENV{LAPACKE_ROOT_DIR})
+  endif()
+
+  if (ENV{MKLROOT})
+    SET(LAPACKE_ROOT_DIR $ENV{MKLROOT})
+  endif()
+endif()
 
 # Check if we can use PkgConfig
 FIND_PACKAGE(PkgConfig)
 
 #Determine from PKG
-IF(PKG_CONFIG_FOUND AND NOT LAPACKE_ROOT)
-  PKG_CHECK_MODULES( PKG_LAPACKE QUIET "lapacke")
+IF(PKG_CONFIG_FOUND AND NOT LAPACKE_ROOT_DIR)
+  PKG_CHECK_MODULES( PC_LAPACKE QUIET "lapacke")
 ENDIF()
 
-IF(LAPACKE_ROOT)
-    #find libs
-    FIND_LIBRARY(
-        LAPACKE_LIB
-        NAMES "lapacke" "LAPACKE" "liblapacke"
-        PATHS ${LAPACKE_ROOT}
-        PATH_SUFFIXES "lib" "lib64"
-        DOC "LAPACKE Library"
-        NO_DEFAULT_PATH
-        )
-    FIND_LIBRARY(
-        LAPACK_LIB
-        NAMES "lapack" "LAPACK" "liblapack"
-        PATHS ${LAPACKE_ROOT}
-        PATH_SUFFIXES "lib" "lib64"
-        DOC "LAPACK Library"
-        NO_DEFAULT_PATH
-        )
-    FIND_PATH(
-        LAPACKE_INCLUDES
-        NAMES "lapacke.h"
-        PATHS ${LAPACKE_ROOT}
-        PATH_SUFFIXES "include"
-        DOC "LAPACKE Include Directory"
-        NO_DEFAULT_PATH
-        )
+IF(PC_LAPACKE_FOUND)
+    FOREACH(PC_LIB ${PC_LAPACKE_LIBRARIES})
+      FIND_LIBRARY(${PC_LIB}_LIBRARY NAMES ${PC_LIB} HINTS ${PC_LAPACKE_LIBRARY_DIRS} )
+      IF (NOT ${PC_LIB}_LIBRARY)
+        MESSAGE(FATAL_ERROR "Something is wrong in your pkg-config file - lib ${PC_LIB} not found in ${PC_LAPACKE_LIBRARY_DIRS}")
+      ENDIF (NOT ${PC_LIB}_LIBRARY)
+      LIST(APPEND LAPACKE_LIB ${${PC_LIB}_LIBRARY})
+    ENDFOREACH(PC_LIB)
 
-ELSE()
-    FIND_LIBRARY(
-        LAPACKE_LIB
-        NAMES "lapacke" "liblapacke"
-        PATHS
-        ${PKG_LAPACKE_LIBRARY_DIRS}
-        ${LIB_INSTALL_DIR}
-        /usr/lib64
-        /usr/lib
-        /usr/local/lib64
-        /usr/local/lib
-        /sw/lib
-        /opt/local/lib
-        DOC "LAPACKE Library"
-        )
-    FIND_LIBRARY(
-       LAPACK_LIB
-        NAMES "lapack" "liblapack"
-        PATHS
-        ${PKG_LAPACKE_LIBRARY_DIRS}
-        ${LIB_INSTALL_DIR}
-        /usr/lib64
-        /usr/lib
-        /usr/local/lib64
-        /usr/local/lib
-        /sw/lib
-        /opt/local/lib
-        DOC "LAPACK Library"
-        )
     FIND_PATH(
         LAPACKE_INCLUDES
         NAMES "lapacke.h"
         PATHS
-        ${PKG_LAPACKE_INCLUDE_DIRS}
+        ${PC_LAPACKE_INCLUDE_DIRS}
         ${INCLUDE_INSTALL_DIR}
         /usr/include
         /usr/local/include
@@ -89,13 +51,82 @@ ELSE()
         /opt/local/include
         DOC "LAPACKE Include Directory"
         )
-ENDIF(LAPACKE_ROOT)
 
-SET(LAPACK_LIBRARIES ${LAPACKE_LIB} ${LAPACK_LIB})
-SET(LAPACK_INCLUDE_DIR ${LAPACKE_INCLUDES})
+    FIND_PACKAGE_HANDLE_STANDARD_ARGS(LAPACKE DEFAULT_MSG LAPACKE_LIB)
+    MARK_AS_ADVANCED(LAPACKE_INCLUDES LAPACKE_LIB)
 
-INCLUDE(FindPackageHandleStandardArgs)
-FIND_PACKAGE_HANDLE_STANDARD_ARGS(LAPACK DEFAULT_MSG
-  LAPACK_INCLUDE_DIR LAPACK_LIBRARIES)
+ELSE(PC_LAPACKE_FOUND)
+
+    IF ("${SIZE_OF_VOIDP}" EQUAL 8)
+        SET(MKL_LIB_DIR_SUFFIX "intel64")
+    ELSE()
+        SET(MKL_LIB_DIR_SUFFIX "ia32")
+    ENDIF()
+
+    IF(LAPACKE_ROOT_DIR)
+        #find libs
+        FIND_LIBRARY(
+            LAPACKE_LIB
+            NAMES "lapacke" "LAPACKE" "liblapacke" "mkl_rt"
+            PATHS ${LAPACKE_ROOT_DIR}
+            PATH_SUFFIXES "lib" "lib64" "lib/${MKL_LIB_DIR_SUFFIX}"
+            DOC "LAPACKE Library"
+            NO_DEFAULT_PATH
+            )
+        FIND_PATH(
+            LAPACKE_INCLUDES
+            NAMES "lapacke.h" "mkl_lapacke.h"
+            PATHS ${LAPACKE_ROOT_DIR}
+            PATH_SUFFIXES "include"
+            DOC "LAPACKE Include Directory"
+            NO_DEFAULT_PATH
+            )
+    ELSE()
+        FIND_LIBRARY(
+            LAPACKE_LIB
+            NAMES "lapacke" "liblapacke" "openblas" "mkl_rt"
+            PATHS
+            ${PC_LAPACKE_LIBRARY_DIRS}
+            ${LIB_INSTALL_DIR}
+            /opt/intel/mkl/lib/${MKL_LIB_DIR_SUFFIX}
+            /usr/lib64
+            /usr/lib
+            /usr/local/lib64
+            /usr/local/lib
+            /sw/lib
+            /opt/local/lib
+            DOC "LAPACKE Library"
+            )
+        FIND_PATH(
+            LAPACKE_INCLUDES
+            NAMES "lapacke.h" "mkl_lapacke.h"
+            PATHS
+            ${PC_LAPACKE_INCLUDE_DIRS}
+            ${INCLUDE_INSTALL_DIR}
+            /opt/intel/mkl/include
+            /usr/include
+            /usr/local/include
+            /sw/include
+            /opt/local/include
+            DOC "LAPACKE Include Directory"
+            PATH_SUFFIXES
+            lapacke
+            )
+    ENDIF(LAPACKE_ROOT_DIR)
+    find_package_handle_standard_args(LAPACKE DEFAULT_MSG LAPACKE_LIB LAPACKE_INCLUDES)
+ENDIF(PC_LAPACKE_FOUND)
+
+MARK_AS_ADVANCED(
+  LAPACKE_ROOT_DIR
+  LAPACKE_INCLUDES
+  LAPACKE_LIB
+  lapacke_LIBRARY)
 
-MARK_AS_ADVANCED(LAPACK_INCLUDES LAPACK_LIBRARIES)
+if(PC_LAPACKE_FOUND OR (LAPACKE_LIB AND LAPACKE_INCLUDES))
+  add_library(LAPACKE::LAPACKE UNKNOWN IMPORTED)
+  set_target_properties(LAPACKE::LAPACKE PROPERTIES
+      IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+      IMPORTED_LOCATION "${LAPACKE_LIB}"
+      INTERFACE_INCLUDE_DIRECTORIES "${LAPACKE_INCLUDES}"
+    )
+endif()
diff --git a/CMakeModules/FindNVVM.cmake b/CMakeModules/FindNVVM.cmake
deleted file mode 100644
index a34947e9cd..0000000000
--- a/CMakeModules/FindNVVM.cmake
+++ /dev/null
@@ -1,64 +0,0 @@
-# - Find the NVVM include directory and libraries
-# Modified version of the file found here:
-# https://raw.githubusercontent.com/nvidia-compiler-sdk/nvvmir-samples/master/CMakeLists.txt
-
-# libNVVM
-if(NOT DEFINED ENV{LIBNVVM_HOME})
-  set(LIBNVVM_HOME "${CUDA_TOOLKIT_ROOT_DIR}/nvvm" CACHE PATH "Path to NVVM.")
-else()
-  set(LIBNVVM_HOME "$ENV{LIBNVVM_HOME}" CACHE PATH "Path to NVVM.")
-endif()
-message(STATUS "Using LIBNVVM_HOME: ${LIBNVVM_HOME}")
-
-if (CMAKE_SIZEOF_VOID_P STREQUAL "8")
-  if (WIN32)
-    set (CUDA_LIB_SEARCH_PATH "${CUDA_TOOLKIT_ROOT_DIR}/lib/x64")
-    set (NVVM_DLL_NAME nvvm64_20_0.dll)
-  else ()
-    set (CUDA_LIB_SEARCH_PATH "")
-  endif()
-else()
-  if (WIN32)
-    set (CUDA_LIB_SEARCH_PATH "${CUDA_TOOLKIT_ROOT_DIR}/lib/Win32")
-    set (NVVM_DLL_NAME nvvm32_20_0.dll)
-  else()
-    set (CUDA_LIB_SEARCH_PATH "")
-  endif()
-endif()
-
-### Find libNVVM
-# The directory structure for nvvm is a bit complex.
-# On Windows:
-#   32-bit -- nvvm/lib/Win32
-#   64-bit -- nvvm/lib/x64
-# On Linux:
-#   32-bit -- nvvm/lib
-#   64-bit -- nvvm/lib64
-# On Mac:
-#   Universal -- nvvm/lib
-if (CMAKE_SIZEOF_VOID_P STREQUAL "8")
-  if (WIN32)
-    set (LIB_ARCH_SUFFIX "/x64")
-  elseif (APPLE)
-    set (LIB_ARCH_SUFFIX "")
-  else ()
-    set (LIB_ARCH_SUFFIX "64")
-  endif()
-else()
-  if (WIN32)
-    set (LIB_ARCH_SUFFIX "/Win32")
-  else()
-    set (LIB_ARCH_SUFFIX "")
-  endif()
-endif()
-
-find_library(NVVM_LIB nvvm PATHS "${LIBNVVM_HOME}/lib${LIB_ARCH_SUFFIX}")
-find_file(NVVM_H nvvm.h PATHS "${LIBNVVM_HOME}/include")
-
-if(NVVM_H)
-  get_filename_component(CUDA_NVVM_INCLUDE_DIR ${NVVM_H} PATH)
-else()
-  message(FATAL_ERROR "Unable to find nvvm.h")
-endif()
-
-set(CUDA_NVVM_LIBRARIES ${NVVM_LIB})
diff --git a/CMakeModules/FindOpenCL.cmake b/CMakeModules/FindOpenCL.cmake
index 80fcf493bc..3ac45a4a12 100644
--- a/CMakeModules/FindOpenCL.cmake
+++ b/CMakeModules/FindOpenCL.cmake
@@ -1,68 +1,43 @@
-#.rst:
-# FindOpenCL
-# ----------
-#
-# Try to find OpenCL
-#
-# Once done this will define::
-#
-#   OpenCL_FOUND          - True if OpenCL was found
-#   OpenCL_INCLUDE_DIRS   - include directories for OpenCL
-#   OpenCL_LIBRARIES      - link against this library to use OpenCL
-#   OpenCL_VERSION_STRING - Highest supported OpenCL version (eg. 1.2)
-#   OpenCL_VERSION_MAJOR  - The major version of the OpenCL implementation
-#   OpenCL_VERSION_MINOR  - The minor version of the OpenCL implementation
-#
-# The module will also define two cache variables::
-#
-#   OpenCL_INCLUDE_DIR    - the OpenCL include directory
-#   OpenCL_LIBRARY        - the path to the OpenCL library
-#
-
-#=============================================================================
-# From CMake 3.2
-# Copyright 2014 Matthaeus G. Chajdas
-#
-# Distributed under the OSI-approved BSD License (the "License");
-# see accompanying file Copyright.txt for details.
-#
-# This software is distributed WITHOUT ANY WARRANTY; without even the
-# implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-# See the License for more information.
-
-# CMake - Cross Platform Makefile Generator
-# Copyright 2000-2014 Kitware, Inc.
-# Copyright 2000-2011 Insight Software Consortium
-# All rights reserved.
-# 
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-# 
-# * Redistributions of source code must retain the above copyright
-# notice, this list of conditions and the following disclaimer.
-# 
-# * Redistributions in binary form must reproduce the above copyright
-# notice, this list of conditions and the following disclaimer in the
-# documentation and/or other materials provided with the distribution.
-# 
-# * Neither the names of Kitware, Inc., the Insight Software Consortium,
-# nor the names of their contributors may be used to endorse or promote
-# products derived from this software without specific prior written
-# permission.
-# 
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#=============================================================================
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+#[=======================================================================[.rst:
+FindOpenCL
+----------
+
+.. versionadded:: 3.1
+
+Finds Open Computing Language (OpenCL)
+
+.. versionadded:: 3.10
+  Detection of OpenCL 2.1 and 2.2.
+
+IMPORTED Targets
+^^^^^^^^^^^^^^^^
+
+.. versionadded:: 3.7
+
+This module defines :prop_tgt:`IMPORTED` target ``OpenCL::OpenCL``, if
+OpenCL has been found.
+
+Result Variables
+^^^^^^^^^^^^^^^^
+
+This module defines the following variables::
+
+  OpenCL_FOUND          - True if OpenCL was found
+  OpenCL_INCLUDE_DIRS   - include directories for OpenCL
+  OpenCL_LIBRARIES      - link against this library to use OpenCL
+  OpenCL_VERSION_STRING - Highest supported OpenCL version (eg. 1.2)
+  OpenCL_VERSION_MAJOR  - The major version of the OpenCL implementation
+  OpenCL_VERSION_MINOR  - The minor version of the OpenCL implementation
+
+The module will also define two cache variables::
+
+  OpenCL_INCLUDE_DIR    - the OpenCL include directory
+  OpenCL_LIBRARY        - the path to the OpenCL library
+
+#]=======================================================================]
 
 function(_FIND_OPENCL_VERSION)
   include(CheckSymbolExists)
@@ -70,12 +45,13 @@ function(_FIND_OPENCL_VERSION)
   set(CMAKE_REQUIRED_QUIET ${OpenCL_FIND_QUIETLY})
 
   CMAKE_PUSH_CHECK_STATE()
-  foreach(VERSION "2_0" "1_2" "1_1" "1_0")
+  foreach(VERSION "3_0" "2_2" "2_1" "2_0" "1_2" "1_1" "1_0")
     set(CMAKE_REQUIRED_INCLUDES "${OpenCL_INCLUDE_DIR}")
+
     if(APPLE)
       CHECK_SYMBOL_EXISTS(
         CL_VERSION_${VERSION}
-        "${OpenCL_INCLUDE_DIR}/OpenCL/cl.h"
+        "${OpenCL_INCLUDE_DIR}/Headers/cl.h"
         OPENCL_VERSION_${VERSION})
     else()
       CHECK_SYMBOL_EXISTS(
@@ -103,11 +79,14 @@ find_path(OpenCL_INCLUDE_DIR
     CL/cl.h OpenCL/cl.h
   PATHS
     ENV "PROGRAMFILES(X86)"
-    ENV NVSDKCOMPUTE_ROOT
-    ENV CUDA_PATH
     ENV AMDAPPSDKROOT
     ENV INTELOCLSDKROOT
+    ENV NVSDKCOMPUTE_ROOT
+    ENV CUDA_PATH
     ENV ATISTREAMSDKROOT
+    ENV OCL_ROOT
+    /usr/local/cuda
+    /opt/cuda
   PATH_SUFFIXES
     include
     OpenCL/common/inc
@@ -121,11 +100,12 @@ if(WIN32)
       NAMES OpenCL
       PATHS
         ENV "PROGRAMFILES(X86)"
-        ENV CUDA_PATH
-        ENV NVSDKCOMPUTE_ROOT
         ENV AMDAPPSDKROOT
         ENV INTELOCLSDKROOT
+        ENV CUDA_PATH
+        ENV NVSDKCOMPUTE_ROOT
         ENV ATISTREAMSDKROOT
+        ENV OCL_ROOT
       PATH_SUFFIXES
         "AMD APP/lib/x86"
         lib/x86
@@ -136,11 +116,12 @@ if(WIN32)
       NAMES OpenCL
       PATHS
         ENV "PROGRAMFILES(X86)"
-        ENV CUDA_PATH
-        ENV NVSDKCOMPUTE_ROOT
         ENV AMDAPPSDKROOT
         ENV INTELOCLSDKROOT
+        ENV CUDA_PATH
+        ENV NVSDKCOMPUTE_ROOT
         ENV ATISTREAMSDKROOT
+        ENV OCL_ROOT
       PATH_SUFFIXES
         "AMD APP/lib/x86_64"
         lib/x86_64
@@ -148,14 +129,37 @@ if(WIN32)
         OpenCL/common/lib/x64)
   endif()
 else()
-  find_library(OpenCL_LIBRARY
-    NAMES OpenCL)
+  if(CMAKE_SIZEOF_VOID_P EQUAL 4)
+    find_library(OpenCL_LIBRARY
+      NAMES OpenCL
+      PATHS
+        ENV AMDAPPSDKROOT
+        ENV CUDA_PATH
+        /usr/local/cuda
+        /opt/cuda
+      PATH_SUFFIXES
+        lib/x86
+        lib)
+  elseif(CMAKE_SIZEOF_VOID_P EQUAL 8)
+    find_library(OpenCL_LIBRARY
+      NAMES OpenCL
+      PATHS
+        ENV AMDAPPSDKROOT
+        ENV CUDA_PATH
+        /usr/local/cuda
+        /opt/cuda
+      PATH_SUFFIXES
+        lib/x86_64
+        lib/x64
+        lib
+        lib64)
+  endif()
 endif()
 
 set(OpenCL_LIBRARIES ${OpenCL_LIBRARY})
 set(OpenCL_INCLUDE_DIRS ${OpenCL_INCLUDE_DIR})
 
-#include(${CMAKE_CURRENT_LIST_DIR}/FindPackageHandleStandardArgs.cmake)
+include(FindPackageHandleStandardArgs)
 find_package_handle_standard_args(
   OpenCL
   FOUND_VAR OpenCL_FOUND
@@ -166,3 +170,16 @@ mark_as_advanced(
   OpenCL_INCLUDE_DIR
   OpenCL_LIBRARY)
 
+if(OpenCL_FOUND AND NOT TARGET OpenCL::OpenCL)
+  if(OpenCL_LIBRARY MATCHES "/([^/]+)\\.framework$")
+    add_library(OpenCL::OpenCL INTERFACE IMPORTED)
+    set_target_properties(OpenCL::OpenCL PROPERTIES
+      INTERFACE_LINK_LIBRARIES "${OpenCL_LIBRARY}")
+  else()
+    add_library(OpenCL::OpenCL UNKNOWN IMPORTED)
+    set_target_properties(OpenCL::OpenCL PROPERTIES
+      IMPORTED_LOCATION "${OpenCL_LIBRARY}")
+  endif()
+  set_target_properties(OpenCL::OpenCL PROPERTIES
+    INTERFACE_INCLUDE_DIRECTORIES "${OpenCL_INCLUDE_DIRS}")
+endif()
diff --git a/CMakeModules/FindOpenGL.cmake b/CMakeModules/FindOpenGL.cmake
new file mode 100644
index 0000000000..4ab5d4bfd5
--- /dev/null
+++ b/CMakeModules/FindOpenGL.cmake
@@ -0,0 +1,227 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+#.rst:
+# FindOpenGL
+# ----------
+#
+# FindModule for OpenGL and GLU.
+#
+# IMPORTED Targets
+# ^^^^^^^^^^^^^^^^
+#
+# This module defines the :prop_tgt:`IMPORTED` targets:
+#
+# ``OpenGL::GL``
+#  Defined if the system has OpenGL.
+# ``OpenGL::GLU``
+#  Defined if the system has GLU.
+#
+# Result Variables
+# ^^^^^^^^^^^^^^^^
+#
+# This module sets the following variables:
+#
+# ``OPENGL_FOUND``
+#  True, if the system has OpenGL.
+# ``OPENGL_XMESA_FOUND``
+#  True, if the system has XMESA.
+# ``OPENGL_GLU_FOUND``
+#  True, if the system has GLU.
+# ``OPENGL_INCLUDE_DIR``
+#  Path to the OpenGL include directory.
+# ``OPENGL_LIBRARIES``
+#  Paths to the OpenGL and GLU libraries.
+#
+# If you want to use just GL you can use these values:
+#
+# ``OPENGL_gl_LIBRARY``
+#  Path to the OpenGL library.
+# ``OPENGL_glu_LIBRARY``
+#  Path to the GLU library.
+#
+# OSX Specific
+# ^^^^^^^^^^^^
+#
+# On OSX default to using the framework version of OpenGL. People will
+# have to change the cache values of OPENGL_glu_LIBRARY and
+# OPENGL_gl_LIBRARY to use OpenGL with X11 on OSX.
+
+
+set(_OpenGL_REQUIRED_VARS OPENGL_gl_LIBRARY)
+
+if (CYGWIN)
+
+  find_path(OPENGL_INCLUDE_DIR GL/gl.h )
+  list(APPEND _OpenGL_REQUIRED_VARS OPENGL_INCLUDE_DIR)
+
+  find_library(OPENGL_gl_LIBRARY opengl32 )
+
+  find_library(OPENGL_glu_LIBRARY glu32 )
+
+elseif (WIN32)
+
+  if(BORLAND)
+    set (OPENGL_gl_LIBRARY import32 CACHE STRING "OpenGL library for win32")
+    set (OPENGL_glu_LIBRARY import32 CACHE STRING "GLU library for win32")
+  else()
+    set (OPENGL_gl_LIBRARY opengl32 CACHE STRING "OpenGL library for win32")
+    set (OPENGL_glu_LIBRARY glu32 CACHE STRING "GLU library for win32")
+  endif()
+
+elseif (APPLE)
+
+  # The OpenGL.framework provides both gl and glu
+  find_library(OPENGL_gl_LIBRARY OpenGL DOC "OpenGL library for OS X")
+  find_library(OPENGL_glu_LIBRARY OpenGL DOC
+    "GLU library for OS X (usually same as OpenGL library)")
+  find_path(OPENGL_INCLUDE_DIR OpenGL/gl.h DOC "Include for OpenGL on OS X")
+  list(APPEND _OpenGL_REQUIRED_VARS OPENGL_INCLUDE_DIR)
+
+else()
+  if (CMAKE_SYSTEM_NAME MATCHES "HP-UX")
+    # Handle HP-UX cases where we only want to find OpenGL in either hpux64
+    # or hpux32 depending on if we're doing a 64 bit build.
+    if(CMAKE_SIZEOF_VOID_P EQUAL 4)
+      set(_OPENGL_LIB_PATH
+        /opt/graphics/OpenGL/lib/hpux32/)
+    else()
+      set(_OPENGL_LIB_PATH
+        /opt/graphics/OpenGL/lib/hpux64/
+        /opt/graphics/OpenGL/lib/pa20_64)
+    endif()
+  elseif(CMAKE_SYSTEM_NAME STREQUAL Haiku)
+    set(_OPENGL_LIB_PATH
+      /boot/develop/lib/x86)
+    set(_OPENGL_INCLUDE_PATH
+      /boot/develop/headers/os/opengl)
+  endif()
+
+  # The first line below is to make sure that the proper headers
+  # are used on a Linux machine with the NVidia drivers installed.
+  # They replace Mesa with NVidia's own library but normally do not
+  # install headers and that causes the linking to
+  # fail since the compiler finds the Mesa headers but NVidia's library.
+  # Make sure the NVIDIA directory comes BEFORE the others.
+  #  - Atanas Georgiev <atanas@cs.columbia.edu>
+
+  find_path(OPENGL_INCLUDE_DIR GL/gl.h
+    /usr/share/doc/NVIDIA_GLX-1.0/include
+    /usr/openwin/share/include
+    /opt/graphics/OpenGL/include /usr/X11R6/include
+    ${_OPENGL_INCLUDE_PATH}
+  )
+  list(APPEND _OpenGL_REQUIRED_VARS OPENGL_INCLUDE_DIR)
+
+  find_path(OPENGL_xmesa_INCLUDE_DIR GL/xmesa.h
+    /usr/share/doc/NVIDIA_GLX-1.0/include
+    /usr/openwin/share/include
+    /opt/graphics/OpenGL/include /usr/X11R6/include
+  )
+
+  find_library(OPENGL_gl_LIBRARY
+    NAMES GL MesaGL
+    PATHS /opt/graphics/OpenGL/lib
+          /usr/openwin/lib
+          /usr/shlib /usr/X11R6/lib
+          ${_OPENGL_LIB_PATH}
+  )
+
+  unset(_OPENGL_INCLUDE_PATH)
+  unset(_OPENGL_LIB_PATH)
+
+  find_library(OPENGL_glu_LIBRARY
+    NAMES GLU MesaGLU
+    PATHS ${OPENGL_gl_LIBRARY}
+          /opt/graphics/OpenGL/lib
+          /usr/openwin/lib
+          /usr/shlib /usr/X11R6/lib
+  )
+
+endif ()
+
+if(OPENGL_gl_LIBRARY)
+
+    if(OPENGL_xmesa_INCLUDE_DIR)
+      set( OPENGL_XMESA_FOUND "YES" )
+    else()
+      set( OPENGL_XMESA_FOUND "NO" )
+    endif()
+
+    set( OPENGL_LIBRARIES  ${OPENGL_gl_LIBRARY} ${OPENGL_LIBRARIES})
+    if(OPENGL_glu_LIBRARY)
+      set( OPENGL_GLU_FOUND "YES" )
+      if(NOT "${OPENGL_glu_LIBRARY}" STREQUAL "${OPENGL_gl_LIBRARY}")
+        set( OPENGL_LIBRARIES ${OPENGL_glu_LIBRARY} ${OPENGL_LIBRARIES} )
+      endif()
+    else()
+      set( OPENGL_GLU_FOUND "NO" )
+    endif()
+
+    # This deprecated setting is for backward compatibility with CMake1.4
+    set (OPENGL_LIBRARY ${OPENGL_LIBRARIES})
+
+endif()
+
+# This deprecated setting is for backward compatibility with CMake1.4
+set(OPENGL_INCLUDE_PATH ${OPENGL_INCLUDE_DIR})
+
+include(FindPackageHandleStandardArgs)
+FIND_PACKAGE_HANDLE_STANDARD_ARGS(OpenGL REQUIRED_VARS ${_OpenGL_REQUIRED_VARS})
+unset(_OpenGL_REQUIRED_VARS)
+
+# OpenGL:: targets
+if(OPENGL_FOUND)
+  if(NOT TARGET OpenGL::GL)
+    if(IS_ABSOLUTE "${OPENGL_gl_LIBRARY}")
+      add_library(OpenGL::GL UNKNOWN IMPORTED)
+      if(OPENGL_gl_LIBRARY MATCHES "/([^/]+)\\.framework$")
+        set(_gl_fw "${OPENGL_gl_LIBRARY}/${CMAKE_MATCH_1}")
+        if(EXISTS "${_gl_fw}.tbd")
+          set(_gl_fw "${_gl_fw}.tbd")
+        endif()
+        set_target_properties(OpenGL::GL PROPERTIES
+          IMPORTED_LOCATION "${_gl_fw}")
+      else()
+        set_target_properties(OpenGL::GL PROPERTIES
+          IMPORTED_LOCATION "${OPENGL_gl_LIBRARY}")
+      endif()
+    else()
+      add_library(OpenGL::GL INTERFACE IMPORTED)
+      set_target_properties(OpenGL::GL PROPERTIES
+        INTERFACE_LINK_LIBRARIES "${OPENGL_gl_LIBRARY}")
+    endif()
+    set_target_properties(OpenGL::GL PROPERTIES
+      INTERFACE_INCLUDE_DIRECTORIES "${OPENGL_INCLUDE_DIR}")
+  endif()
+
+  if(OPENGL_GLU_FOUND AND NOT TARGET OpenGL::GLU)
+    if(IS_ABSOLUTE "${OPENGL_glu_LIBRARY}")
+      add_library(OpenGL::GLU UNKNOWN IMPORTED)
+      if(OPENGL_glu_LIBRARY MATCHES "/([^/]+)\\.framework$")
+        set(_glu_fw "${OPENGL_glu_LIBRARY}/${CMAKE_MATCH_1}")
+        if(EXISTS "${_glu_fw}.tbd")
+          set(_glu_fw "${_glu_fw}.tbd")
+        endif()
+        set_target_properties(OpenGL::GLU PROPERTIES
+          IMPORTED_LOCATION "${_glu_fw}")
+      else()
+        set_target_properties(OpenGL::GLU PROPERTIES
+          IMPORTED_LOCATION "${OPENGL_glu_LIBRARY}")
+      endif()
+    else()
+      add_library(OpenGL::GLU INTERFACE IMPORTED)
+      set_target_properties(OpenGL::GLU PROPERTIES
+        INTERFACE_LINK_LIBRARIES "${OPENGL_glu_LIBRARY}")
+    endif()
+    set_target_properties(OpenGL::GLU PROPERTIES
+      INTERFACE_LINK_LIBRARIES OpenGL::GL)
+  endif()
+endif()
+
+mark_as_advanced(
+  OPENGL_INCLUDE_DIR
+  OPENGL_xmesa_INCLUDE_DIR
+  OPENGL_glu_LIBRARY
+  OPENGL_gl_LIBRARY
+)
diff --git a/CMakeModules/FindOpenMP.cmake b/CMakeModules/FindOpenMP.cmake
new file mode 100644
index 0000000000..be7f85661d
--- /dev/null
+++ b/CMakeModules/FindOpenMP.cmake
@@ -0,0 +1,457 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+#.rst:
+# FindOpenMP
+# ----------
+#
+# Finds OpenMP support
+#
+# This module can be used to detect OpenMP support in a compiler.  If
+# the compiler supports OpenMP, the flags required to compile with
+# OpenMP support are returned in variables for the different languages.
+# The variables may be empty if the compiler does not need a special
+# flag to support OpenMP.
+#
+# Variables
+# ^^^^^^^^^
+#
+# This module will set the following variables per language in your
+# project, where ``<lang>`` is one of C, CXX, or Fortran:
+#
+# ``OpenMP_<lang>_FOUND``
+#   Variable indicating if OpenMP support for ``<lang>`` was detected.
+# ``OpenMP_<lang>_FLAGS``
+#   OpenMP compiler flags for ``<lang>``, separated by spaces.
+#
+# For linking with OpenMP code written in ``<lang>``, the following
+# variables are provided:
+#
+# ``OpenMP_<lang>_LIB_NAMES``
+#   :ref:`;-list <CMake Language Lists>` of libraries for OpenMP programs for ``<lang>``.
+# ``OpenMP_<libname>_LIBRARY``
+#   Location of the individual libraries needed for OpenMP support in ``<lang>``.
+# ``OpenMP_<lang>_LIBRARIES``
+#   A list of libraries needed to link with OpenMP code written in ``<lang>``.
+#
+# Additionally, the module provides :prop_tgt:`IMPORTED` targets:
+#
+# ``OpenMP::OpenMP_<lang>``
+#   Target for using OpenMP from ``<lang>``.
+#
+# Specifically for Fortran, the module sets the following variables:
+#
+# ``OpenMP_Fortran_HAVE_OMPLIB_HEADER``
+#   Boolean indicating if OpenMP is accessible through ``omp_lib.h``.
+# ``OpenMP_Fortran_HAVE_OMPLIB_MODULE``
+#   Boolean indicating if OpenMP is accessible through the ``omp_lib`` Fortran module.
+#
+# The module will also try to provide the OpenMP version variables:
+#
+# ``OpenMP_<lang>_SPEC_DATE``
+#   Date of the OpenMP specification implemented by the ``<lang>`` compiler.
+# ``OpenMP_<lang>_VERSION_MAJOR``
+#   Major version of OpenMP implemented by the ``<lang>`` compiler.
+# ``OpenMP_<lang>_VERSION_MINOR``
+#   Minor version of OpenMP implemented by the ``<lang>`` compiler.
+# ``OpenMP_<lang>_VERSION``
+#   OpenMP version implemented by the ``<lang>`` compiler.
+#
+# The specification date is formatted as given in the OpenMP standard:
+# ``yyyymm`` where ``yyyy`` and ``mm`` represents the year and month of
+# the OpenMP specification implemented by the ``<lang>`` compiler.
+#
+# Backward Compatibility
+# ^^^^^^^^^^^^^^^^^^^^^^
+#
+# For backward compatibility with older versions of FindOpenMP, these
+# variables are set, but deprecated::
+#
+#   OpenMP_FOUND
+#
+# In new projects, please use the ``OpenMP_<lang>_XXX`` equivalents.
+
+cmake_policy(PUSH)
+cmake_policy(SET CMP0057 NEW) # if IN_LIST
+
+function(_OPENMP_FLAG_CANDIDATES LANG)
+  if(NOT OpenMP_${LANG}_FLAG)
+    unset(OpenMP_FLAG_CANDIDATES)
+
+    set(OMP_FLAG_GNU "-fopenmp")
+    set(OMP_FLAG_Clang "-fopenmp=libomp" "-fopenmp=libiomp5" "-fopenmp")
+    set(OMP_FLAG_HP "+Oopenmp")
+    if(WIN32)
+      set(OMP_FLAG_Intel "-Qopenmp")
+    elseif(CMAKE_${LANG}_COMPILER_ID STREQUAL "Intel" AND
+           "${CMAKE_${LANG}_COMPILER_VERSION}" VERSION_LESS "15.0.0.20140528")
+      set(OMP_FLAG_Intel "-openmp")
+    else()
+      set(OMP_FLAG_Intel "-qopenmp")
+    endif()
+    set(OMP_FLAG_MIPSpro "-mp")
+    set(OMP_FLAG_MSVC "-openmp")
+    set(OMP_FLAG_PathScale "-openmp")
+    set(OMP_FLAG_NAG "-openmp")
+    set(OMP_FLAG_Absoft "-openmp")
+    set(OMP_FLAG_PGI "-mp")
+    set(OMP_FLAG_SunPro "-xopenmp")
+    set(OMP_FLAG_XL "-qsmp=omp")
+    # Cray compiles with OpenMP automatically
+    set(OMP_FLAG_Cray " ")
+
+    # If we know the correct flags, use those
+    if(DEFINED OMP_FLAG_${CMAKE_${LANG}_COMPILER_ID})
+      set(OpenMP_FLAG_CANDIDATES "${OMP_FLAG_${CMAKE_${LANG}_COMPILER_ID}}")
+    # Fall back to reasonable default tries otherwise
+    else()
+      set(OpenMP_FLAG_CANDIDATES "-openmp" "-fopenmp" "-mp" " ")
+    endif()
+    set(OpenMP_${LANG}_FLAG_CANDIDATES "${OpenMP_FLAG_CANDIDATES}" PARENT_SCOPE)
+  else()
+    set(OpenMP_${LANG}_FLAG_CANDIDATES "${OpenMP_${LANG}_FLAG}" PARENT_SCOPE)
+  endif()
+endfunction()
+
+# sample openmp source code to test
+set(OpenMP_C_CXX_TEST_SOURCE
+"
+#include <omp.h>
+int main() {
+#ifndef _OPENMP
+  breaks_on_purpose
+#endif
+}
+")
+
+# in Fortran, an implementation may provide an omp_lib.h header
+# or omp_lib module, or both (OpenMP standard, section 3.1)
+# Furthmore !$ is the Fortran equivalent of #ifdef _OPENMP (OpenMP standard, 2.2.2)
+# Without the conditional compilation, some compilers (e.g. PGI) might compile OpenMP code
+# while not actually enabling OpenMP, building code sequentially
+set(OpenMP_Fortran_TEST_SOURCE
+  "
+      program test
+      @OpenMP_Fortran_INCLUDE_LINE@
+  !$  integer :: n
+      n = omp_get_num_threads()
+      end program test
+  "
+)
+
+function(_OPENMP_WRITE_SOURCE_FILE LANG SRC_FILE_CONTENT_VAR SRC_FILE_NAME SRC_FILE_FULLPATH)
+  set(WORK_DIR ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/FindOpenMP)
+  if("${LANG}" STREQUAL "C")
+    set(SRC_FILE "${WORK_DIR}/${SRC_FILE_NAME}.c")
+    file(WRITE "${SRC_FILE}" "${OpenMP_C_CXX_${SRC_FILE_CONTENT_VAR}}")
+  elseif("${LANG}" STREQUAL "CXX")
+    set(SRC_FILE "${WORK_DIR}/${SRC_FILE_NAME}.cpp")
+    file(WRITE "${SRC_FILE}" "${OpenMP_C_CXX_${SRC_FILE_CONTENT_VAR}}")
+  elseif("${LANG}" STREQUAL "Fortran")
+    set(SRC_FILE "${WORK_DIR}/${SRC_FILE_NAME}.f90")
+    file(WRITE "${SRC_FILE}_in" "${OpenMP_Fortran_${SRC_FILE_CONTENT_VAR}}")
+    configure_file("${SRC_FILE}_in" "${SRC_FILE}" @ONLY)
+  endif()
+  set(${SRC_FILE_FULLPATH} "${SRC_FILE}" PARENT_SCOPE)
+endfunction()
+
+include(${CMAKE_ROOT}/Modules/CMakeParseImplicitLinkInfo.cmake)
+
+function(_OPENMP_GET_FLAGS LANG FLAG_MODE OPENMP_FLAG_VAR OPENMP_LIB_NAMES_VAR)
+  _OPENMP_FLAG_CANDIDATES("${LANG}")
+  _OPENMP_WRITE_SOURCE_FILE("${LANG}" "TEST_SOURCE" OpenMPTryFlag _OPENMP_TEST_SRC)
+
+  foreach(OPENMP_FLAG IN LISTS OpenMP_${LANG}_FLAG_CANDIDATES)
+    set(OPENMP_FLAGS_TEST "${OPENMP_FLAG}")
+    if(CMAKE_${LANG}_VERBOSE_FLAG)
+      string(APPEND OPENMP_FLAGS_TEST " ${CMAKE_${LANG}_VERBOSE_FLAG}")
+    endif()
+    string(REGEX REPLACE "[-/=+]" "" OPENMP_PLAIN_FLAG "${OPENMP_FLAG}")
+    try_compile( OpenMP_COMPILE_RESULT_${FLAG_MODE}_${OPENMP_PLAIN_FLAG} ${CMAKE_BINARY_DIR} ${_OPENMP_TEST_SRC}
+      CMAKE_FLAGS "-DCOMPILE_DEFINITIONS:STRING=${OPENMP_FLAGS_TEST}"
+      OUTPUT_VARIABLE OpenMP_TRY_COMPILE_OUTPUT
+    )
+
+    if(OpenMP_COMPILE_RESULT_${FLAG_MODE}_${OPENMP_PLAIN_FLAG})
+      set("${OPENMP_FLAG_VAR}" "${OPENMP_FLAG}" PARENT_SCOPE)
+
+      if(CMAKE_${LANG}_VERBOSE_FLAG)
+        unset(OpenMP_${LANG}_IMPLICIT_LIBRARIES)
+        unset(OpenMP_${LANG}_IMPLICIT_LINK_DIRS)
+        unset(OpenMP_${LANG}_IMPLICIT_FWK_DIRS)
+        unset(OpenMP_${LANG}_LOG_VAR)
+
+        file(APPEND ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/CMakeOutput.log
+        "Detecting ${LANG} OpenMP compiler ABI info compiled with the following output:\n${OpenMP_TRY_COMPILE_OUTPUT}\n\n")
+
+        cmake_parse_implicit_link_info("${OpenMP_TRY_COMPILE_OUTPUT}"
+          OpenMP_${LANG}_IMPLICIT_LIBRARIES
+          OpenMP_${LANG}_IMPLICIT_LINK_DIRS
+          OpenMP_${LANG}_IMPLICIT_FWK_DIRS
+          OpenMP_${LANG}_LOG_VAR
+          "${CMAKE_${LANG}_IMPLICIT_OBJECT_REGEX}"
+        )
+
+        file(APPEND ${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/CMakeOutput.log
+        "Parsed ${LANG} OpenMP implicit link information from above output:\n${OpenMP_${LANG}_LOG_VAR}\n\n")
+
+        unset(_OPENMP_LIB_NAMES)
+        foreach(_OPENMP_IMPLICIT_LIB IN LISTS OpenMP_${LANG}_IMPLICIT_LIBRARIES)
+          if(NOT "${_OPENMP_IMPLICIT_LIB}" IN_LIST CMAKE_${LANG}_IMPLICIT_LINK_LIBRARIES)
+            find_library(OpenMP_${_OPENMP_IMPLICIT_LIB}_LIBRARY
+              NAMES "${_OPENMP_IMPLICIT_LIB}"
+              HINTS ${OpenMP_${LANG}_IMPLICIT_LINK_DIRS}
+            )
+            mark_as_advanced(OpenMP_${_OPENMP_IMPLICIT_LIB}_LIBRARY)
+            list(APPEND _OPENMP_LIB_NAMES ${_OPENMP_IMPLICIT_LIB})
+          endif()
+        endforeach()
+        set("${OPENMP_LIB_NAMES_VAR}" "${_OPENMP_LIB_NAMES}" PARENT_SCOPE)
+      else()
+        # The Intel compiler on windows has no verbose mode, so we need to treat it explicitly
+        if("${CMAKE_${LANG}_COMPILER_ID}" STREQUAL "Intel" AND "${CMAKE_SYSTEM_NAME}" STREQUAL "Windows")
+          set("${OPENMP_LIB_NAMES_VAR}" "libiomp5md" PARENT_SCOPE)
+          find_library(OpenMP_libiomp5md_LIBRARY
+            NAMES "libiomp5md"
+            HINTS ${CMAKE_${LANG}_IMPLICIT_LINK_DIRECTORIES}
+          )
+          mark_as_advanced(OpenMP_libiomp5md_LIBRARY)
+        else()
+          set("${OPENMP_LIB_NAMES_VAR}" "" PARENT_SCOPE)
+        endif()
+      endif()
+      break()
+    endif()
+    set("${OPENMP_LIB_NAMES_VAR}" "NOTFOUND" PARENT_SCOPE)
+    set("${OPENMP_FLAG_VAR}" "NOTFOUND" PARENT_SCOPE)
+  endforeach()
+endfunction()
+
+set(OpenMP_C_CXX_CHECK_VERSION_SOURCE
+"
+#include <stdio.h>
+#include <omp.h>
+const char ompver_str[] = { 'I', 'N', 'F', 'O', ':', 'O', 'p', 'e', 'n', 'M',
+                            'P', '-', 'd', 'a', 't', 'e', '[',
+                            ('0' + ((_OPENMP/100000)%10)),
+                            ('0' + ((_OPENMP/10000)%10)),
+                            ('0' + ((_OPENMP/1000)%10)),
+                            ('0' + ((_OPENMP/100)%10)),
+                            ('0' + ((_OPENMP/10)%10)),
+                            ('0' + ((_OPENMP/1)%10)),
+                            ']', '\\0' };
+int main()
+{
+  puts(ompver_str);
+}
+")
+
+set(OpenMP_Fortran_CHECK_VERSION_SOURCE
+"
+      program omp_ver
+      @OpenMP_Fortran_INCLUDE_LINE@
+      integer, parameter :: zero = ichar('0')
+      integer, parameter :: ompv = openmp_version
+      character, dimension(24), parameter :: ompver_str =&
+      (/ 'I', 'N', 'F', 'O', ':', 'O', 'p', 'e', 'n', 'M', 'P', '-',&
+         'd', 'a', 't', 'e', '[',&
+         char(zero + mod(ompv/100000, 10)),&
+         char(zero + mod(ompv/10000, 10)),&
+         char(zero + mod(ompv/1000, 10)),&
+         char(zero + mod(ompv/100, 10)),&
+         char(zero + mod(ompv/10, 10)),&
+         char(zero + mod(ompv/1, 10)), ']' /)
+      print *, ompver_str
+      end program omp_ver
+")
+
+function(_OPENMP_GET_SPEC_DATE LANG SPEC_DATE)
+  _OPENMP_WRITE_SOURCE_FILE("${LANG}" "CHECK_VERSION_SOURCE" OpenMPCheckVersion _OPENMP_TEST_SRC)
+
+  set(BIN_FILE "${CMAKE_BINARY_DIR}${CMAKE_FILES_DIRECTORY}/FindOpenMP/ompver_${LANG}.bin")
+  string(REGEX REPLACE "[-/=+]" "" OPENMP_PLAIN_FLAG "${OPENMP_FLAG}")
+  try_compile(OpenMP_SPECTEST_${LANG}_${OPENMP_PLAIN_FLAG} "${CMAKE_BINARY_DIR}" "${_OPENMP_TEST_SRC}"
+              CMAKE_FLAGS "-DCOMPILE_DEFINITIONS:STRING=${OpenMP_${LANG}_FLAGS}"
+              COPY_FILE ${BIN_FILE})
+
+  if(${OpenMP_SPECTEST_${LANG}_${OPENMP_PLAIN_FLAG}})
+    file(STRINGS ${BIN_FILE} specstr LIMIT_COUNT 1 REGEX "INFO:OpenMP-date")
+    set(regex_spec_date ".*INFO:OpenMP-date\\[0*([^]]*)\\].*")
+    if("${specstr}" MATCHES "${regex_spec_date}")
+      set(${SPEC_DATE} "${CMAKE_MATCH_1}" PARENT_SCOPE)
+    endif()
+  endif()
+endfunction()
+
+macro(_OPENMP_SET_VERSION_BY_SPEC_DATE LANG)
+  set(OpenMP_SPEC_DATE_MAP
+    # Combined versions, 2.5 onwards
+    "201511=4.5"
+    "201307=4.0"
+    "201107=3.1"
+    "200805=3.0"
+    "200505=2.5"
+    # C/C++ version 2.0
+    "200203=2.0"
+    # Fortran version 2.0
+    "200011=2.0"
+    # Fortran version 1.1
+    "199911=1.1"
+    # C/C++ version 1.0 (there's no 1.1 for C/C++)
+    "199810=1.0"
+    # Fortran version 1.0
+    "199710=1.0"
+  )
+
+  string(REGEX MATCHALL "${OpenMP_${LANG}_SPEC_DATE}=([0-9]+)\\.([0-9]+)" _version_match "${OpenMP_SPEC_DATE_MAP}")
+  if(NOT _version_match STREQUAL "")
+    set(OpenMP_${LANG}_VERSION_MAJOR ${CMAKE_MATCH_1})
+    set(OpenMP_${LANG}_VERSION_MINOR ${CMAKE_MATCH_2})
+    set(OpenMP_${LANG}_VERSION "${OpenMP_${LANG}_VERSION_MAJOR}.${OpenMP_${LANG}_VERSION_MINOR}")
+  else()
+    unset(OpenMP_${LANG}_VERSION_MAJOR)
+    unset(OpenMP_${LANG}_VERSION_MINOR)
+    unset(OpenMP_${LANG}_VERSION)
+  endif()
+  unset(_version_match)
+  unset(OpenMP_SPEC_DATE_MAP)
+endmacro()
+
+foreach(LANG IN ITEMS C CXX)
+  if(CMAKE_${LANG}_COMPILER_LOADED)
+    if(NOT DEFINED OpenMP_${LANG}_FLAGS OR "${OpenMP_${LANG}_FLAGS}" STREQUAL "NOTFOUND"
+      OR NOT DEFINED OpenMP_${LANG}_LIB_NAMES OR "${OpenMP_${LANG}_LIB_NAMES}" STREQUAL "NOTFOUND")
+      _OPENMP_GET_FLAGS("${LANG}" "${LANG}" OpenMP_${LANG}_FLAGS_WORK OpenMP_${LANG}_LIB_NAMES_WORK)
+    endif()
+
+    set(OpenMP_${LANG}_FLAGS "${OpenMP_${LANG}_FLAGS_WORK}"
+      CACHE STRING "${LANG} compiler flags for OpenMP parallelization")
+    set(OpenMP_${LANG}_LIB_NAMES "${OpenMP_${LANG}_LIB_NAMES_WORK}"
+      CACHE STRING "${LANG} compiler libraries for OpenMP parallelization")
+    mark_as_advanced(OpenMP_${LANG}_FLAGS OpenMP_${LANG}_LIB_NAMES)
+  endif()
+endforeach()
+
+if(CMAKE_Fortran_COMPILER_LOADED)
+  if(NOT DEFINED OpenMP_Fortran_FLAGS OR "${OpenMP_Fortran_FLAGS}" STREQUAL "NOTFOUND"
+    OR NOT DEFINED OpenMP_Fortran_LIB_NAMES OR "${OpenMP_Fortran_LIB_NAMES}" STREQUAL "NOTFOUND"
+    OR NOT DEFINED OpenMP_Fortran_HAVE_OMPLIB_MODULE)
+    set(OpenMP_Fortran_INCLUDE_LINE "use omp_lib\n      implicit none")
+    _OPENMP_GET_FLAGS("Fortran" "FortranHeader" OpenMP_Fortran_FLAGS_WORK OpenMP_Fortran_LIB_NAMES_WORK)
+    if(OpenMP_Fortran_FLAGS_WORK)
+      set(OpenMP_Fortran_HAVE_OMPLIB_MODULE TRUE CACHE BOOL INTERNAL "")
+    endif()
+
+    set(OpenMP_Fortran_FLAGS "${OpenMP_Fortran_FLAGS_WORK}"
+      CACHE STRING "Fortran compiler flags for OpenMP parallelization")
+    set(OpenMP_Fortran_LIB_NAMES "${OpenMP_Fortran_LIB_NAMES_WORK}"
+      CACHE STRING "Fortran compiler libraries for OpenMP parallelization")
+    mark_as_advanced(OpenMP_Fortran_FLAGS OpenMP_Fortran_LIB_NAMES)
+  endif()
+
+  if(NOT DEFINED OpenMP_Fortran_FLAGS OR "${OpenMP_Fortran_FLAGS}" STREQUAL "NOTFOUND"
+    OR NOT DEFINED OpenMP_Fortran_LIB_NAMES OR "${OpenMP_Fortran_LIB_NAMES}" STREQUAL "NOTFOUND"
+    OR NOT DEFINED OpenMP_Fortran_HAVE_OMPLIB_HEADER)
+    set(OpenMP_Fortran_INCLUDE_LINE "implicit none\n      include 'omp_lib.h'")
+    _OPENMP_GET_FLAGS("Fortran" "FortranModule" OpenMP_Fortran_FLAGS_WORK OpenMP_Fortran_LIB_NAMES_WORK)
+    if(OpenMP_Fortran_FLAGS_WORK)
+      set(OpenMP_Fortran_HAVE_OMPLIB_HEADER TRUE CACHE BOOL INTERNAL "")
+    endif()
+
+    set(OpenMP_Fortran_FLAGS "${OpenMP_Fortran_FLAGS_WORK}"
+      CACHE STRING "Fortran compiler flags for OpenMP parallelization")
+
+    set(OpenMP_Fortran_LIB_NAMES "${OpenMP_Fortran_LIB_NAMES}"
+      CACHE STRING "Fortran compiler libraries for OpenMP parallelization")
+  endif()
+
+  if(OpenMP_Fortran_HAVE_OMPLIB_MODULE)
+    set(OpenMP_Fortran_INCLUDE_LINE "use omp_lib\n      implicit none")
+  else()
+    set(OpenMP_Fortran_INCLUDE_LINE "implicit none\n      include 'omp_lib.h'")
+  endif()
+endif()
+
+set(OPENMP_FOUND TRUE)
+
+foreach(LANG IN ITEMS C CXX Fortran)
+  if(CMAKE_${LANG}_COMPILER_LOADED)
+    if (NOT OpenMP_${LANG}_SPEC_DATE)
+      _OPENMP_GET_SPEC_DATE("${LANG}" OpenMP_${LANG}_SPEC_DATE_INTERNAL)
+      set(OpenMP_${LANG}_SPEC_DATE "${OpenMP_${LANG}_SPEC_DATE_INTERNAL}" CACHE
+        INTERNAL "${LANG} compiler's OpenMP specification date")
+      _OPENMP_SET_VERSION_BY_SPEC_DATE("${LANG}")
+    endif()
+
+    include(FindPackageHandleStandardArgs)
+
+    set(OpenMP_${LANG}_FIND_QUIETLY ${OpenMP_FIND_QUIETLY})
+    set(OpenMP_${LANG}_FIND_REQUIRED ${OpenMP_FIND_REQUIRED})
+    set(OpenMP_${LANG}_FIND_VERSION ${OpenMP_FIND_VERSION})
+    set(OpenMP_${LANG}_FIND_VERSION_EXACT ${OpenMP_FIND_VERSION_EXACT})
+
+    set(_OPENMP_${LANG}_REQUIRED_VARS OpenMP_${LANG}_FLAGS)
+    if("${OpenMP_${LANG}_LIB_NAMES}" STREQUAL "NOTFOUND")
+      set(_OPENMP_${LANG}_REQUIRED_LIB_VARS OpenMP_${LANG}_LIB_NAMES)
+    else()
+      foreach(_OPENMP_IMPLICIT_LIB IN LISTS OpenMP_${LANG}_LIB_NAMES)
+        list(APPEND _OPENMP_${LANG}_REQUIRED_LIB_VARS OpenMP_${_OPENMP_IMPLICIT_LIB}_LIBRARY)
+      endforeach()
+    endif()
+
+    find_package_handle_standard_args(OpenMP_${LANG}
+      REQUIRED_VARS OpenMP_${LANG}_FLAGS ${_OPENMP_${LANG}_REQUIRED_LIB_VARS}
+      VERSION_VAR OpenMP_${LANG}_VERSION
+    )
+
+    if(OpenMP_${LANG}_FOUND)
+      set(OpenMP_${LANG}_LIBRARIES "")
+      foreach(_OPENMP_IMPLICIT_LIB IN LISTS OpenMP_${LANG}_LIB_NAMES)
+        list(APPEND OpenMP_${LANG}_LIBRARIES "${OpenMP_${_OPENMP_IMPLICIT_LIB}_LIBRARY}")
+      endforeach()
+
+      if(NOT TARGET OpenMP::OpenMP_${LANG})
+        add_library(OpenMP::OpenMP_${LANG} INTERFACE IMPORTED)
+      endif()
+      if(OpenMP_${LANG}_FLAGS)
+
+        if(UNIX)
+          separate_arguments(_OpenMP_${LANG}_OPTIONS UNIX_COMMAND "${OpenMP_${LANG}_FLAGS}")
+        elseif(WIN32)
+          separate_arguments(_OpenMP_${LANG}_OPTIONS WINDOWS_COMMAND "${OpenMP_${LANG}_FLAGS}")
+        endif()
+
+        set_property(TARGET OpenMP::OpenMP_${LANG} PROPERTY
+          INTERFACE_COMPILE_OPTIONS "${_OpenMP_${LANG}_OPTIONS}")
+        unset(_OpenMP_${LANG}_OPTIONS)
+      endif()
+      if(OpenMP_${LANG}_LIBRARIES)
+        set_property(TARGET OpenMP::OpenMP_${LANG} PROPERTY
+          INTERFACE_LINK_LIBRARIES "${OpenMP_${LANG}_LIBRARIES}")
+      endif()
+    else()
+      set(OPENMP_FOUND FALSE)
+    endif()
+  endif()
+endforeach()
+
+if(CMAKE_Fortran_COMPILER_LOADED AND OpenMP_Fortran_FOUND)
+  if(NOT DEFINED OpenMP_Fortran_HAVE_OMPLIB_MODULE)
+    set(OpenMP_Fortran_HAVE_OMPLIB_MODULE FALSE CACHE BOOL INTERNAL "")
+  endif()
+  if(NOT DEFINED OpenMP_Fortran_HAVE_OMPLIB_HEADER)
+    set(OpenMP_Fortran_HAVE_OMPLIB_HEADER FALSE CACHE BOOL INTERNAL "")
+  endif()
+endif()
+
+if(NOT ( CMAKE_C_COMPILER_LOADED OR CMAKE_CXX_COMPILER_LOADED OR CMAKE_Fortran_COMPILER_LOADED ))
+  message(SEND_ERROR "FindOpenMP requires the C, CXX or Fortran languages to be enabled")
+endif()
+
+unset(OpenMP_C_CXX_TEST_SOURCE)
+unset(OpenMP_Fortran_TEST_SOURCE)
+unset(OpenMP_C_CXX_CHECK_VERSION_SOURCE)
+unset(OpenMP_Fortran_CHECK_VERSION_SOURCE)
+unset(OpenMP_Fortran_INCLUDE_LINE)
+
+cmake_policy(POP)
diff --git a/CMakeModules/FindcuDNN.cmake b/CMakeModules/FindcuDNN.cmake
new file mode 100644
index 0000000000..98641f4198
--- /dev/null
+++ b/CMakeModules/FindcuDNN.cmake
@@ -0,0 +1,238 @@
+# Fetched the original content of this file from
+# https://github.com/soumith/cudnn.torch
+#
+# Original Copyright:
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+#
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+#
+# FindcuDNN
+# -------
+#
+# Find cuDNN library
+#
+# This module creates imported target cuDNN::cuDNN upon successfull
+# lookup of cuDNN headers and libraries.
+#
+# Valiables that affect result:
+# <VERSION>, <REQUIRED>, <QUIET>: as usual
+#
+# Usage
+# -----
+# add_exectuable(helloworld main.cpp)
+# target_link_libraries(helloworld PRIVATE cuDNN::cuDNN)
+#
+# Note: It is recommended to avoid using variables set by the find module.
+#
+# Result variables
+# ----------------
+#
+# This module will set the following variables in your project:
+#
+# ``cuDNN_INCLUDE_DIRS``
+#   where to find cudnn.h.
+#
+# ``cuDNN_LINK_LIBRARY``
+#   the libraries to link against to use cuDNN. Priot to cuDNN 8, this is a huge monolithic
+#   library. However, since cuDNN 8 it has been split into multiple shared libraries. If
+#   cuDNN version 8 if found, this variable contains the shared library that dlopens the
+#   other libraries: cuDNN_*_INFER_LINK_LIBRARY and cuDNN_*_TRAIN_LINK_LIBRARY as needed.
+#   For versions of cuDNN 7 or lower, cuDNN_*_INFER_LINK_LIBRARY and cuDNN_*_TRAIN_LINK_LIBRARY
+#   are not defined.
+#
+# ``cuDNN_ADV_INFER_LINK_LIBRARY``
+#   the libraries to link directly to use advanced inference API from cuDNN.
+# ``cuDNN_ADV_INFER_DLL_LIBRARY``
+#   Corresponding advanced inference API Windows DLL. This is not set on non-Windows platforms.
+# ``cuDNN_ADV_TRAIN_LINK_LIBRARY``
+#   the libraries to link directly to use advanced training API from cuDNN.
+# ``cuDNN_ADV_TRAIN_DLL_LIBRARY``
+#   Corresponding advanced training API Windows DLL. This is not set on non-Windows platforms.
+#
+# ``cuDNN_CNN_INFER_LINK_LIBRARY``
+#   the libraries to link directly to use convolutional nueral networks inference API from cuDNN.
+# ``cuDNN_CNN_INFER_DLL_LIBRARY``
+#   Corresponding CNN inference API Windows DLL. This is not set on non-Windows platforms.
+# ``cuDNN_CNN_TRAIN_LINK_LIBRARY``
+#   the libraries to link directly to use convolutional nueral networks training API from cuDNN.
+# ``cuDNN_CNN_TRAIN_DLL_LIBRARY``
+#   Corresponding CNN training API Windows DLL. This is not set on non-Windows platforms.
+#
+# ``cuDNN_OPS_INFER_LINK_LIBRARY``
+#   the libraries to link directly to use starndard ML operations API from cuDNN.
+# ``cuDNN_OPS_INFER_DLL_LIBRARY``
+#   Corresponding OPS inference API Windows DLL. This is not set on non-Windows platforms.
+# ``cuDNN_OPS_TRAIN_LINK_LIBRARY``
+#   the libraries to link directly to use starndard ML operations API from cuDNN.
+# ``cuDNN_OPS_TRAIN_DLL_LIBRARY``
+#   Corresponding OPS inference API Windows DLL. This is not set on non-Windows platforms.
+#
+# ``cuDNN_FOUND``
+#   If false, do not try to use cuDNN.
+# ``cuDNN_VERSION``
+#   Version of the cuDNN library found
+# ``cuDNN_VERSION_MAJOR``
+#   Major Version of the cuDNN library found
+# ``cuDNN_VERSION_MINOR``
+#   Minor Version of the cuDNN library found
+
+find_package(PkgConfig)
+pkg_check_modules(PC_CUDNN QUIET cuDNN)
+
+find_package(CUDA QUIET)
+
+find_path(cuDNN_INCLUDE_DIRS
+  NAMES cudnn.h
+  HINTS
+    ${cuDNN_ROOT_DIR}
+    ${PC_CUDNN_INCLUDE_DIRS}
+    ${CUDA_TOOLKIT_INCLUDE}
+  PATH_SUFFIXES include
+  DOC "cuDNN include directory path." )
+
+if(cuDNN_INCLUDE_DIRS)
+  file(READ ${cuDNN_INCLUDE_DIRS}/cudnn.h CUDNN_VERSION_FILE_CONTENTS)
+  string(REGEX MATCH "define CUDNN_MAJOR * +([0-9]+)"
+    CUDNN_MAJOR_VERSION "${CUDNN_VERSION_FILE_CONTENTS}")
+  list(LENGTH CUDNN_MAJOR_VERSION cudnn_ver_matches)
+  if(${cudnn_ver_matches} EQUAL 0)
+    file(READ ${cuDNN_INCLUDE_DIRS}/cudnn_version.h CUDNN_VERSION_FILE_CONTENTS)
+    string(REGEX MATCH "define CUDNN_MAJOR * +([0-9]+)"
+      CUDNN_MAJOR_VERSION "${CUDNN_VERSION_FILE_CONTENTS}")
+  endif()
+  string(REGEX REPLACE "define CUDNN_MAJOR * +([0-9]+)" "\\1"
+      CUDNN_MAJOR_VERSION "${CUDNN_MAJOR_VERSION}")
+  string(REGEX MATCH "define CUDNN_MINOR * +([0-9]+)"
+    CUDNN_MINOR_VERSION "${CUDNN_VERSION_FILE_CONTENTS}")
+  string(REGEX REPLACE "define CUDNN_MINOR * +([0-9]+)" "\\1"
+      CUDNN_MINOR_VERSION "${CUDNN_MINOR_VERSION}")
+  string(REGEX MATCH "define CUDNN_PATCHLEVEL * +([0-9]+)"
+    CUDNN_PATCH_VERSION "${CUDNN_VERSION_FILE_CONTENTS}")
+  string(REGEX REPLACE "define CUDNN_PATCHLEVEL * +([0-9]+)" "\\1"
+      CUDNN_PATCH_VERSION "${CUDNN_PATCH_VERSION}")
+  set(cuDNN_VERSION_MAJOR ${CUDNN_MAJOR_VERSION})
+  set(cuDNN_VERSION_MINOR ${CUDNN_MINOR_VERSION})
+  set(cuDNN_VERSION ${CUDNN_MAJOR_VERSION}.${CUDNN_MINOR_VERSION})
+endif()
+
+# Choose lib suffix to be exact major version if requested
+# otherwise, just pick the one read from cudnn.h header
+if(cuDNN_FIND_VERSION_EXACT)
+  set(cudnn_ver_suffix "${cuDNN_FIND_VERSION_MAJOR}")
+else()
+  set(cudnn_ver_suffix "${CUDNN_MAJOR_VERSION}")
+endif()
+
+if(cuDNN_INCLUDE_DIRS)
+  get_filename_component(libpath_cudart "${CUDA_CUDART_LIBRARY}" PATH)
+
+  macro(af_find_cudnn_libs cudnn_lib_name_infix)
+    if("${cudnn_lib_name_infix}" STREQUAL "")
+	  set(LIB_INFIX "")
+	else()
+	  string(TOUPPER ${cudnn_lib_name_infix} LIB_INFIX)
+	endif()
+    find_library(cuDNN${LIB_INFIX}_LINK_LIBRARY
+      NAMES
+        libcudnn${cudnn_lib_name_infix}.so.${cudnn_ver_suffix}
+        libcudnn${cudnn_lib_name_infix}.${cudnn_ver_suffix}.dylib
+        cudnn${cudnn_lib_name_infix}
+      PATHS
+        ${cuDNN_ROOT_DIR}
+        ${PC_CUDNN_LIBRARY_DIRS}
+        $ENV{LD_LIBRARY_PATH}
+        ${libpath_cudart}
+        ${CMAKE_INSTALL_PREFIX}
+      PATH_SUFFIXES lib lib64 bin lib/x64 bin/x64
+      DOC "cudnn${cudnn_lib_name_infix} link library." )
+    mark_as_advanced(cuDNN${LIB_INFIX}_LINK_LIBRARY)
+
+    if(WIN32 AND cuDNN_LINK_LIBRARY)
+      find_file(cuDNN${LIB_INFIX}_DLL_LIBRARY
+      NAMES cudnn${cudnn_lib_name_infix}64_${cudnn_ver_suffix}${CMAKE_SHARED_LIBRARY_SUFFIX}
+      PATHS
+        ${cuDNN_ROOT_DIR}
+        ${PC_CUDNN_LIBRARY_DIRS}
+        $ENV{PATH}
+        ${libpath_cudart}
+        ${CMAKE_INSTALL_PREFIX}
+      PATH_SUFFIXES lib lib64 bin lib/x64 bin/x64
+      DOC "cudnn${cudnn_lib_name_infix} Windows DLL." )
+      mark_as_advanced(cuDNN${LIB_INFIX}_DLL_LIBRARY)
+    endif()
+  endmacro()
+
+  af_find_cudnn_libs("") # gets base cudnn shared library
+  if(cuDNN_VERSION_MAJOR VERSION_EQUAL 8)
+    af_find_cudnn_libs("_adv_infer")
+    af_find_cudnn_libs("_adv_train")
+    af_find_cudnn_libs("_cnn_infer")
+    af_find_cudnn_libs("_cnn_train")
+    af_find_cudnn_libs("_ops_infer")
+    af_find_cudnn_libs("_ops_train")
+  elseif(cuDNN_VERSION_MAJOR VERSION_GREATER_EQUAL 9)
+    af_find_cudnn_libs("_adv")
+    af_find_cudnn_libs("_cnn")
+    af_find_cudnn_libs("_ops")
+  endif()
+endif()
+
+find_package_handle_standard_args(cuDNN
+  REQUIRED_VARS cuDNN_LINK_LIBRARY cuDNN_INCLUDE_DIRS
+  VERSION_VAR   cuDNN_VERSION)
+
+mark_as_advanced(cuDNN_LINK_LIBRARY cuDNN_INCLUDE_DIRS cuDNN_DLL_LIBRARY)
+
+if(cuDNN_FOUND)
+  add_library(cuDNN::cuDNN SHARED IMPORTED)
+  if(WIN32)
+    set_target_properties(cuDNN::cuDNN
+      PROPERTIES
+      IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+      INTERFACE_INCLUDE_DIRECTORIES "${cuDNN_INCLUDE_DIRS}"
+      IMPORTED_LOCATION "${cuDNN_DLL_LIBRARY}"
+      IMPORTED_IMPLIB "${cuDNN_LINK_LIBRARY}"
+    )
+  else(WIN32)
+    set_target_properties(cuDNN::cuDNN
+      PROPERTIES
+      IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+      INTERFACE_INCLUDE_DIRECTORIES "${cuDNN_INCLUDE_DIRS}"
+      IMPORTED_LOCATION "${cuDNN_LINK_LIBRARY}"
+    )
+  endif(WIN32)
+  if(cuDNN_VERSION_MAJOR VERSION_GREATER 8 OR cuDNN_VERSION_MAJOR VERSION_EQUAL 8)
+    macro(create_cudnn_target cudnn_target_name)
+	  string(TOUPPER ${cudnn_target_name} target_infix)
+	  add_library(cuDNN::${cudnn_target_name} SHARED IMPORTED)
+	  if(WIN32)
+        set_target_properties(cuDNN::${cudnn_target_name}
+          PROPERTIES
+          IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+          INTERFACE_INCLUDE_DIRECTORIES "${cuDNN_INCLUDE_DIRS}"
+          IMPORTED_LOCATION "${cuDNN_${target_infix}_DLL_LIBRARY}"
+          IMPORTED_IMPLIB "${cuDNN_${target_infix}_LINK_LIBRARY}"
+        )
+      else(WIN32)
+          set_target_properties(cuDNN::${cudnn_target_name}
+            PROPERTIES
+            IMPORTED_LINK_INTERFACE_LANGUAGE "C"
+            INTERFACE_INCLUDE_DIRECTORIES "${cuDNN_INCLUDE_DIRS}"
+            IMPORTED_LOCATION "${cuDNN_${target_infix}_LINK_LIBRARY}"
+          )
+      endif(WIN32)
+	endmacro()
+	create_cudnn_target(adv_infer)
+	create_cudnn_target(adv_train)
+	create_cudnn_target(cnn_infer)
+	create_cudnn_target(cnn_train)
+	create_cudnn_target(ops_infer)
+	create_cudnn_target(ops_train)
+  endif()
+endif(cuDNN_FOUND)
diff --git a/CMakeModules/InternalUtils.cmake b/CMakeModules/InternalUtils.cmake
new file mode 100644
index 0000000000..8d29718365
--- /dev/null
+++ b/CMakeModules/InternalUtils.cmake
@@ -0,0 +1,288 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+function(dependency_check VAR ERROR_MESSAGE)
+  if(NOT ${VAR})
+    message(SEND_ERROR ${ERROR_MESSAGE})
+  endif()
+endfunction()
+
+# Includes the directory if the variable is set
+function(conditional_directory variable directory)
+  if(${variable})
+    add_subdirectory(${directory})
+  endif()
+endfunction()
+
+include(CheckCXXCompilerFlag)
+
+if(WIN32)
+  check_cxx_compiler_flag(/Zc:__cplusplus cplusplus_define)
+  check_cxx_compiler_flag(/permissive- cxx_compliance)
+endif()
+
+check_cxx_compiler_flag(-ffast-math has_cxx_fast_math)
+check_cxx_compiler_flag("-fp-model fast" has_cxx_fp_model)
+check_cxx_compiler_flag(-fno-errno-math has_cxx_no_errno_math)
+check_cxx_compiler_flag(-fno-trapping-math  has_cxx_no_trapping_math)
+check_cxx_compiler_flag(-fno-signed-zeros  has_cxx_no_signed_zeros)
+check_cxx_compiler_flag(-mno-ieee-fp has_cxx_no_ieee_fp)
+check_cxx_compiler_flag(-Wno-unqualified-std-cast-call has_cxx_unqualified_std_cast_call)
+check_cxx_compiler_flag(-Werror=reorder-ctor has_cxx_error_reorder_ctor)
+check_cxx_compiler_flag(-Rno-debug-disables-optimization has_cxx_debug-disables-optimization)
+
+
+function(arrayfire_set_default_cxx_flags target)
+  target_compile_options(${target}
+    PRIVATE
+
+      $<$<BOOL:${CMAKE_SYCL_COMPILER}>:
+        $<$<COMPILE_LANGUAGE:SYCL>:
+                # OpenCL targets need this flag to avoid
+                # ignored attribute warnings in the OpenCL
+                # headers
+                -Wno-ignored-attributes
+                -Wall
+                -Wno-unqualified-std-cast-call
+                -Werror=reorder-ctor
+                #-fp-model precise
+                $<$<BOOL:${AF_WITH_FAST_MATH}>: -ffast-math -fno-errno-math -fno-trapping-math -fno-signed-zeros -mno-ieee-fp>
+                $<$<NOT:$<BOOL:${AF_WITH_FAST_MATH}>>: $<IF:$<PLATFORM_ID:Windows>,/fp=precise,-fp-model=precise>>
+                $<$<CONFIG:Debug>:-Rno-debug-disables-optimization>
+
+                $<$<PLATFORM_ID:Windows>: /wd4251
+                                          /wd4068
+                                          /wd4275
+                                          /wd4668
+                                          /wd4710
+                                          /wd4505
+                                          /we5038
+                                          /bigobj
+                                          /EHsc
+                                          /nologo
+                                          # MSVC incorrectly sets the cplusplus to 199711L even if the compiler supports
+                                          # c++11 features. This flag sets it to the correct standard supported by the
+                                          # compiler
+                                          $<$<BOOL:${cplusplus_define}>:/Zc:__cplusplus>
+                                          $<$<BOOL:${cxx_compliance}>:/permissive-> >
+            >>
+      $<$<COMPILE_LANGUAGE:CXX>:
+              # C4068: Warnings about unknown pragmas
+              # C4668: Warnings about unknown defintions
+              # C4275: Warnings about using non-exported classes as base class of an
+              #        exported class
+              $<$<CXX_COMPILER_ID:MSVC>:  /wd4251
+                                          /wd4068
+                                          /wd4275
+                                          /wd4668
+                                          /wd4710
+                                          /wd4505
+                                          /we5038
+                                          /bigobj
+                                          /EHsc
+                                          /nologo
+                                          # MSVC incorrectly sets the cplusplus to 199711L even if the compiler supports
+                                          # c++11 features. This flag sets it to the correct standard supported by the
+                                          # compiler
+                                          $<$<BOOL:${cplusplus_define}>:/Zc:__cplusplus>
+                                          $<$<BOOL:${cxx_compliance}>:/permissive-> >
+
+              # OpenCL targets need this flag to avoid
+              # ignored attribute warnings in the OpenCL
+              # headers
+              $<$<BOOL:${has_ignored_attributes_flag}>:-Wno-ignored-attributes>
+              $<$<BOOL:${has_all_warnings_flag}>:-Wall>
+              $<$<BOOL:${has_cxx_unqualified_std_cast_call}>:-Wno-unqualified-std-cast-call>
+              $<$<BOOL:${has_cxx_error_reorder_ctor}>:-Werror=reorder-ctor>
+
+              $<$<BOOL:${AF_WITH_FAST_MATH}>:
+                  $<$<BOOL:${has_cxx_fast_math}>:-ffast-math>
+                  $<$<BOOL:${has_cxx_no_errno_math}>:-fno-errno-math>
+                  $<$<BOOL:${has_cxx_no_trapping_math}>:-fno-trapping-math>
+                  $<$<BOOL:${has_cxx_no_signed_zeros}>:-fno-signed-zeros>
+                  $<$<BOOL:${has_cxx_no_ieee_fp}>:-mno-ieee-fp>
+                  >
+
+              $<$<NOT:$<BOOL:${AF_WITH_FAST_MATH}>>:
+                    $<$<BOOL:${has_cxx_fp_model}>:-fp-model precise>>
+
+              $<$<BOOL:${has_cxx_debug-disables-optimization}>:
+                  $<$<CONFIG:Debug>:-Rno-debug-disables-optimization>>
+          >
+    )
+
+  target_compile_definitions(${target}
+    PRIVATE
+      AFDLL
+      $<$<PLATFORM_ID:Windows>:             OS_WIN
+                                            WIN32_LEAN_AND_MEAN
+                                            NOMINMAX>
+      $<$<PLATFORM_ID:Darwin>:              OS_MAC>
+      $<$<PLATFORM_ID:Linux>:               OS_LNX>
+
+      $<$<BOOL:${AF_WITH_LOGGING}>:           AF_WITH_LOGGING>
+      $<$<BOOL:${AF_CACHE_KERNELS_TO_DISK}>:  AF_CACHE_KERNELS_TO_DISK>
+      $<$<BOOL:${AF_WITH_FAST_MATH}>:         AF_WITH_FAST_MATH>
+  )
+endfunction()
+
+function(__af_deprecate_var var access value)
+  if(access STREQUAL "READ_ACCESS")
+    message(DEPRECATION "Variable ${var} is deprecated. Use AF_${var} instead.")
+  endif()
+endfunction()
+
+function(af_deprecate var newvar)
+  if(DEFINED ${var})
+    message(DEPRECATION "Variable ${var} is deprecated. Use ${newvar} instead.")
+    get_property(doc CACHE ${newvar} PROPERTY HELPSTRING)
+    set(${newvar} ${${var}} CACHE BOOL "${doc}" FORCE)
+    unset(${var} CACHE)
+  endif()
+  variable_watch(${var} __af_deprecate_var)
+endfunction()
+
+function(get_native_path out_path path)
+  file(TO_NATIVE_PATH ${path} native_path)
+  if (WIN32)
+    string(REPLACE "\\" "\\\\" native_path  ${native_path})
+    set(${out_path} ${native_path} PARENT_SCOPE)
+  else ()
+    set(${out_path} ${path} PARENT_SCOPE)
+  endif ()
+endfunction()
+
+macro(arrayfire_set_cmake_default_variables)
+  set(CMAKE_PREFIX_PATH "${ArrayFire_BINARY_DIR};${CMAKE_PREFIX_PATH}")
+  set(BUILD_SHARED_LIBS ON)
+
+  set(CMAKE_CXX_FLAGS_COVERAGE
+      "-g -O0"
+      CACHE STRING "Flags used by the C++ compiler during coverage builds.")
+
+  set(CMAKE_C_FLAGS_COVERAGE
+      "-g -O0"
+      CACHE STRING "Flags used by the C compiler during coverage builds.")
+  set(CMAKE_EXE_LINKER_FLAGS_COVERAGE
+      ""
+      CACHE STRING "Flags used for linking binaries during coverage builds.")
+  set(CMAKE_SHARED_LINKER_FLAGS_COVERAGE
+      ""
+      CACHE STRING "Flags used by the shared libraries linker during coverage builds.")
+  set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE
+      ""
+      CACHE STRING "Flags used by the static libraries linker during coverage builds.")
+  set(CMAKE_MODULE_LINKER_FLAGS_COVERAGE
+      ""
+      CACHE STRING "Flags used by the module linker during coverage builds.")
+
+  if(CMAKE_CXX_COMPILER_ID MATCHES "Clang" OR CMAKE_CXX_COMPILER_ID MATCHES "GNU")
+    set(CMAKE_CXX_FLAGS_COVERAGE "${CMAKE_CXX_FLAGS_COVERAGE} --coverage")
+    set(CMAKE_C_FLAGS_COVERAGE "${CMAKE_C_FLAGS_COVERAGE} --coverage")
+    set(CMAKE_EXE_LINKER_FLAGS_COVERAGE "${CMAKE_EXE_LINKER_FLAGS_COVERAGE} --coverage")
+    set(CMAKE_SHARED_LINKER_FLAGS_COVERAGE "${CMAKE_SHARED_LINKER_FLAGS_COVERAGE} --coverage")
+    set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE "${CMAKE_STATIC_LINKER_FLAGS_COVERAGE}")
+    set(CMAKE_MODULE_LINKER_FLAGS_COVERAGE "${CMAKE_STATIC_LINKER_FLAGS_COVERAGE} --coverage")
+  elseif(CMAKE_CXX_COMPILER_ID STREQUAL "MSVC")
+    set(CMAKE_CXX_FLAGS_COVERAGE "")
+    set(CMAKE_C_FLAGS_COVERAGE "")
+    set(CMAKE_EXE_LINKER_FLAGS_COVERAGE "")
+    set(CMAKE_SHARED_LINKER_FLAGS_COVERAGE "")
+    set(CMAKE_STATIC_LINKER_FLAGS_COVERAGE "")
+    set(CMAKE_MODULE_LINKER_FLAGS_COVERAGE "")
+  endif()
+
+  mark_as_advanced(
+      CMAKE_CXX_FLAGS_COVERAGE
+      CMAKE_C_FLAGS_COVERAGE
+      CMAKE_EXE_LINKER_FLAGS_COVERAGE
+      CMAKE_SHARED_LINKER_FLAGS_COVERAGE
+      CMAKE_STATIC_LINKER_FLAGS_COVERAGE
+      CMAKE_MODULE_LINKER_FLAGS_COVERAGE)
+
+  set_property(GLOBAL PROPERTY USE_FOLDERS ON)
+
+  # Store all binaries in the bin/<Config> directory
+  if(WIN32)
+    set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${ArrayFire_BINARY_DIR}/bin)
+  endif()
+
+  if(APPLE AND (NOT DEFINED CMAKE_INSTALL_RPATH))
+      message(WARNING "CMAKE_INSTALL_RPATH is required when installing ArrayFire to the local system. Set it to /opt/arrayfire/lib if making the installer or your own custom install path.")
+  endif()
+
+  # This code is used to generate the compilers.h file in CMakeModules. Not all
+  # features of this modules are supported in the versions of CMake we wish to
+  # support so we are directly including the files here
+  #  set(compiler_header_epilogue [=[
+  #  #if defined(AF_COMPILER_CXX_RELAXED_CONSTEXPR) && AF_COMPILER_CXX_RELAXED_CONSTEXPR
+  #  #define AF_CONSTEXPR constexpr
+  #  #else
+  #  #define AF_CONSTEXPR
+  #  #endif
+  #  #if __cpp_if_constexpr || __cplusplus >= 201606L
+  #  #define AF_IF_CONSTEXPR if constexpr
+  #  #else
+  #  #define AF_IF_CONSTEXPR if
+  #  #endif
+  #  ]=])
+  #  include(WriteCompilerDetectionHeader)
+  #  write_compiler_detection_header(
+  #          FILE ${ArrayFire_BINARY_DIR}/include/af/compilers.h
+  #          PREFIX AF
+  #          COMPILERS AppleClang Clang GNU Intel MSVC
+  #          # NOTE: cxx_attribute_deprecated does not work well with C
+  #          FEATURES cxx_rvalue_references cxx_noexcept cxx_variadic_templates cxx_alignas
+  #          cxx_static_assert cxx_generalized_initializers cxx_relaxed_constexpr
+  #          ALLOW_UNKNOWN_COMPILERS
+  #          #[VERSION <version>]
+  #          #[PROLOG <prolog>]
+  #          EPILOG ${compiler_header_epilogue}
+  #          )
+  configure_file(
+    ${ArrayFire_SOURCE_DIR}/CMakeModules/compilers.h
+    ${ArrayFire_BINARY_DIR}/include/af/compilers.h)
+endmacro()
+
+macro(set_policies)
+  cmake_parse_arguments(SP "" "TYPE" "POLICIES" ${ARGN})
+  foreach(_policy ${SP_POLICIES})
+    if(POLICY ${_policy})
+      cmake_policy(SET ${_policy} ${SP_TYPE})
+    endif()
+  endforeach()
+endmacro()
+
+macro(af_mkl_batch_check)
+  set(CMAKE_REQUIRED_LIBRARIES "MKL::RT")
+  check_symbol_exists(sgetrf_batch_strided "mkl_lapack.h" MKL_BATCH)
+endmacro()
+
+# Creates a CACHEd CMake variable which has limited set of possible string values
+# Argumehts:
+#   NAME: The name of the variable
+#   DEFAULT: The default value of the variable
+#   DESCRIPTION: The description of the variable
+#   OPTIONS: The possible set of values for the option
+#
+# Example:
+#
+# af_multiple_option(NAME        AF_COMPUTE_LIBRARY
+#                    DEFAULT     "Intel-MKL"
+#                    DESCRIPTION "Compute library for signal processing and linear algebra routines"
+#                    OPTIONS     "Intel-MKL" "FFTW/LAPACK/BLAS")
+macro(af_multiple_option)
+  cmake_parse_arguments(opt "" "NAME;DEFAULT;DESCRIPTION" "OPTIONS" ${ARGN})
+  set(${opt_NAME} ${opt_DEFAULT} CACHE STRING ${opt_DESCRIPTION})
+  set_property(CACHE ${opt_NAME} PROPERTY STRINGS ${opt_OPTIONS})
+endmacro()
+
+mark_as_advanced(
+    pkgcfg_lib_PC_CBLAS_cblas
+    pkgcfg_lib_PC_LAPACKE_lapacke
+    pkgcfg_lib_PKG_FFTW_fftw3
+    )
diff --git a/CMakeModules/LSANSuppression.txt b/CMakeModules/LSANSuppression.txt
new file mode 100644
index 0000000000..b305e805f3
--- /dev/null
+++ b/CMakeModules/LSANSuppression.txt
@@ -0,0 +1,13 @@
+# This is a known leak.
+leak:libnvidia-ptxjitcompile
+leak:tbb::internal::task_stream
+leak:libnvidia-opencl.so
+
+# Allocated by Intel's OpenMP implementation during inverse_dense_cpu
+# This is not something we can control in ArrayFire
+leak:kmp_alloc_cpp*::bget
+leak:kmp_b_alloc
+
+# ArrayFire leaks the default random engine on each thread. This is to avoid
+# errors on exit on Windows.
+leak:af_get_default_random_engine
diff --git a/CMakeModules/MinBuildTime.cmake b/CMakeModules/MinBuildTime.cmake
new file mode 100644
index 0000000000..e48ab0263b
--- /dev/null
+++ b/CMakeModules/MinBuildTime.cmake
@@ -0,0 +1,89 @@
+IF(NOT DEFINED MINBUILDTIME_FLAG)
+    SET(MINBUILDTIME_FLAG OFF CACHE INTERNAL "Flag" FORCE)
+ENDIF()
+
+IF(${MIN_BUILD_TIME})
+    IF(NOT ${CMAKE_BUILD_TYPE} MATCHES "Release")
+        MESSAGE(WARNING "The MIN_BUILD_TIME Flag only works with Release.\
+                        Other CMAKE_BUILD_TYPEs will be ignore this flag")
+    ELSEIF(NOT ${MINBUILDTIME_FLAG})
+    # BUILD_TYPE is Release - Set the flags
+    # The flags should be set only when going from OFF -> ON. This is
+    # determined by MINBUILDTIME_FLAG
+    # IF FLAG is ON, then the flags were already set, no need to set them again
+    # IF FLAG is OFF, then the flags are not set, so set them now, and back up
+    # release flags
+    MESSAGE(STATUS "MIN_BUILD_TIME: Setting Release flags to no optimizations")
+
+        # Backup Default Release Flags
+        SET(CMAKE_CXX_FLAGS_RELEASE_DEFAULT ${CMAKE_CXX_FLAGS_RELEASE} CACHE
+            INTERNAL "Default compiler flags during release build" FORCE)
+        SET(CMAKE_C_FLAGS_RELEASE_DEFAULT ${CMAKE_C_FLAGS_RELEASE} CACHE
+            INTERNAL "Default compiler flags during release build" FORCE)
+        SET(CMAKE_EXE_LINKER_FLAGS_RELEASE_DEFAULT ${CMAKE_EXE_LINKER_FLAGS_RELEASE} CACHE
+            INTERNAL "Default linker flags during release build" FORCE)
+        SET(CMAKE_MODULE_LINKER_FLAGS_RELEASE_DEFAULT ${CMAKE_MODULE_LINKER_FLAGS_RELEASE} CACHE
+            INTERNAL "Default linker flags during release build" FORCE)
+        SET(CMAKE_STATIC_LINKER_FLAGS_RELEASE_DEFAULT ${CMAKE_STATIC_LINKER_FLAGS_RELEASE} CACHE
+            INTERNAL "Default linker flags during release build" FORCE)
+        SET(CMAKE_SHARED_LINKER_FLAGS_RELEASE_DEFAULT ${CMAKE_SHARED_LINKER_FLAGS_RELEASE} CACHE
+            INTERNAL "Default linker flags during release build" FORCE)
+
+        IF(MSVC)
+            SET(CMAKE_CXX_FLAGS_RELEASE "/MD /Od /Ob1 /D NDEBUG" CACHE
+                STRING "Flags used by the compiler during release builds." FORCE)
+            SET(CMAKE_C_FLAGS_RELEASE "/MD /Od /Ob1 /D NDEBUG" CACHE
+                STRING "Flags used by the compiler during release builds." FORCE)
+            SET(CMAKE_EXE_LINKER_FLAGS_RELEASE "/INCREMENTAL:NO" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+            SET(CMAKE_MODULE_LINKER_FLAGS_RELEASE "/INCREMENTAL:NO" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+            SET(CMAKE_STATIC_LINKER_FLAGS_RELEASE "" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+            SET(CMAKE_SHARED_LINKER_FLAGS_RELEASE "/INCREMENTAL:NO" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+        ELSE(MSVC)
+            SET(CMAKE_CXX_FLAGS_RELEASE "-O0 -DNDEBUG" CACHE
+                STRING "Flags used by the compiler during release builds." FORCE)
+            SET(CMAKE_C_FLAGS_RELEASE "-O0 -DNDEBUG" CACHE
+                STRING "Flags used by the compiler during release builds." FORCE)
+            SET(CMAKE_EXE_LINKER_FLAGS_RELEASE "" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+            SET(CMAKE_MODULE_LINKER_FLAGS_RELEASE "" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+            SET(CMAKE_STATIC_LINKER_FLAGS_RELEASE "" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+            SET(CMAKE_SHARED_LINKER_FLAGS_RELEASE "" CACHE
+                STRING "Flags used by the linker during release builds." FORCE)
+        ENDIF(MSVC)
+
+        SET(MINBUILDTIME_FLAG ON CACHE INTERNAL "Flag" FORCE)
+    ENDIF()
+ELSE()
+    # MIN_BUILD_TIME is OFF. Change the flags back only if the flag was set before
+    IF(${MINBUILDTIME_FLAG})
+        MESSAGE(STATUS "MIN_BUILD_FLAG was toggled. Resetting Release FLags")
+        SET(CMAKE_CXX_FLAGS_RELEASE ${CMAKE_CXX_FLAGS_RELEASE_DEFAULT} CACHE
+            STRING "Flags used by the compiler during release builds." FORCE)
+        SET(CMAKE_C_FLAGS_RELEASE ${CMAKE_C_FLAGS_RELEASE_DEFAULT} CACHE
+            STRING "Flags used by the compiler during release builds." FORCE)
+        SET(CMAKE_EXE_LINKER_FLAGS_RELEASE ${CMAKE_EXE_LINKER_FLAGS_RELEASE_DEFAULT} CACHE
+            STRING "Flags used by the linker during release builds." FORCE)
+        SET(CMAKE_MODULE_LINKER_FLAGS_RELEASE ${CMAKE_MODULE_LINKER_FLAGS_RELEASE_DEFAULT} CACHE
+            STRING "Flags used by the linker during release builds." FORCE)
+        SET(CMAKE_STATIC_LINKER_FLAGS_RELEASE ${CMAKE_STATIC_LINKER_FLAGS_RELEASE_DEFAULT} CACHE
+            STRING "Flags used by the linker during release builds." FORCE)
+        SET(CMAKE_SHARED_LINKER_FLAGS_RELEASE ${CMAKE_SHARED_LINKER_FLAGS_RELEASE_DEFAULT} CACHE
+            STRING "Flags used by the linker during release builds." FORCE)
+        SET(MINBUILDTIME_FLAG OFF CACHE INTERNAL "Flag" FORCE)
+    ENDIF()
+ENDIF()
+
+MARK_AS_ADVANCED(
+    CMAKE_CXX_FLAGS_RELEASE
+    CMAKE_C_FLAGS_RELEASE
+    CMAKE_EXE_LINKER_FLAGS_RELEASE
+    CMAKE_MODULE_LINKER_FLAGS_RELEASE
+    CMAKE_STATIC_LINKER_FLAGS_RELEASE
+    CMAKE_SHARED_LINKER_FLAGS_RELEASE
+    )
diff --git a/CMakeModules/SplitDebugInfo.cmake b/CMakeModules/SplitDebugInfo.cmake
new file mode 100644
index 0000000000..3900c25a5d
--- /dev/null
+++ b/CMakeModules/SplitDebugInfo.cmake
@@ -0,0 +1,102 @@
+# Tailored after https://github.com/GerbilSoft/mcrecover/blob/master/cmake/macros/SplitDebugInformation.cmake
+# Minor modifications to original
+
+if (NOT WIN32)
+  include(CMakeFindBinUtils)
+  if (NOT APPLE AND NOT CMAKE_OBJCOPY)
+    message("'objcopy' tool not found; debug information will not be split.")
+  elseif (NOT CMAKE_STRIP)
+    message("'strip' tool not found; debug information will not be split.")
+  elseif (APPLE)
+    # TODO(pradeep) debug info splits on OSX are disabled
+    # this section of elseif will be removed when Apple support is added
+    message("Debug information is not split on OSX")
+  endif ()
+endif (NOT WIN32)
+
+function(af_split_debug_info _target _destination_dir)
+  set(SPLIT_TOOL_EXISTS ON)
+  if (WIN32)
+    set(SPLIT_TOOL_EXISTS OFF)
+    if (MSVC)
+      install(FILES
+        $<$<CONFIG:Debug>:$<TARGET_PDB_FILE:${_target}>>
+        $<$<CONFIG:RelWithDebInfo>:$<TARGET_PDB_FILE:${_target}>>
+        DESTINATION ${_destination_dir}
+        COMPONENT "${_target}_debug_symbols"
+        )
+    endif()
+  elseif (NOT APPLE AND NOT CMAKE_OBJCOPY)
+    set(SPLIT_TOOL_EXISTS OFF)
+  elseif (NOT CMAKE_STRIP)
+    set(SPLIT_TOOL_EXISTS OFF)
+  elseif (APPLE)
+    # TODO(pradeep) debug info splits on OSX are disabled
+    # this section of elseif will be removed when Apple support is added
+    set(SPLIT_TOOL_EXISTS OFF)
+  endif ()
+
+  if (SPLIT_TOOL_EXISTS)
+    get_target_property(TRGT_PREFIX ${_target} PREFIX)
+    if(TRGT_PREFIX)
+      set(prefix ${TRGT_PREFIX})
+    else()
+      get_target_property(TRGT_TYPE ${_target} TYPE)
+      set(prefix "${CMAKE_${TRGT_TYPE}_PREFIX}")
+    endif()
+
+    get_target_property(TRGT_OUT_NAME ${_target} OUTPUT_NAME)
+    if(TRGT_OUT_NAME)
+      set(outName ${TRGT_OUT_NAME})
+    else()
+      set(outName "${_target}")
+    endif()
+
+    get_target_property(TRGT_POSTFIX ${_target} POSTFIX)
+    if(TRGT_POSTFIX)
+      set(postfix ${TRGT_POSTFIX})
+    else()
+      get_target_property(TRGT_TYPE ${_target} TYPE)
+      set(postfix "${CMAKE_${TRGT_TYPE}_POSTFIX}")
+    endif()
+
+    set(OUT_NAME "${prefix}${outName}")
+    set(OUT_NAME_WE "${OUT_NAME}${postfix}")
+    set(SPLIT_DEBUG_OUT_FILE_EXT ".debug")
+    if(APPLE)
+      set(SPLIT_DEBUG_OUT_FILE_EXT ".dSYM")
+    endif()
+    set(SPLIT_DEBUG_SRC_FILE "$<TARGET_FILE:${_target}>")
+    set(SPLIT_DEBUG_OUT_NAME "$<TARGET_FILE_DIR:${_target}>/${OUT_NAME_WE}")
+    set(SPLIT_DEBUG_OUT_FILE "${SPLIT_DEBUG_OUT_NAME}${SPLIT_DEBUG_OUT_FILE_EXT}")
+
+    if(APPLE)
+      add_custom_command(TARGET ${_target} POST_BUILD
+          COMMAND dsymutil ${SPLIT_DEBUG_SRC_FILE} -o ${SPLIT_DEBUG_OUT_FILE}
+          #TODO(pradeep) From initial research stripping debug info from
+          # is removing debug LC_ID_DYLIB command also which is make
+          # shared library unusable. Confirm this from OSX expert
+          # and remove these comments and below command
+          #COMMAND ${CMAKE_STRIP} --strip-debug ${SPLIT_DEBUG_SRC_FILE}
+        )
+    else(APPLE)
+      add_custom_command(TARGET ${_target} POST_BUILD
+        COMMAND ${CMAKE_OBJCOPY}
+          --only-keep-debug ${SPLIT_DEBUG_SRC_FILE} ${SPLIT_DEBUG_OUT_FILE}
+        COMMAND ${CMAKE_STRIP}
+          --strip-debug ${SPLIT_DEBUG_SRC_FILE}
+        COMMAND ${CMAKE_OBJCOPY}
+          --add-gnu-debuglink=${SPLIT_DEBUG_OUT_FILE} ${SPLIT_DEBUG_SRC_FILE}
+        )
+    endif()
+
+    install(FILES ${SPLIT_DEBUG_OUT_FILE}
+      DESTINATION ${_destination_dir}
+      COMPONENT "${OUT_NAME}_debug_symbols"
+      )
+
+    # Make sure the file is deleted on `make clean`.
+    set_property(DIRECTORY APPEND
+      PROPERTY ADDITIONAL_MAKE_CLEAN_FILES ${SPLIT_DEBUG_OUT_FILE})
+  endif(SPLIT_TOOL_EXISTS)
+endfunction(af_split_debug_info)
diff --git a/CMakeModules/TargetArch.cmake b/CMakeModules/TargetArch.cmake
deleted file mode 100644
index bcf7d61a8b..0000000000
--- a/CMakeModules/TargetArch.cmake
+++ /dev/null
@@ -1,157 +0,0 @@
-#https://github.com/petroules/solar-cmake
-
-#Copyright (c) 2012 Petroules Corporation. All rights reserved.
-#
-#Redistribution and use in source and binary forms, with or without
-#modification, are permitted provided that the following conditions are met:
-#
-#Redistributions of source code must retain the above copyright notice, this
-#list of conditions and the following disclaimer.  Redistributions in binary
-#form must reproduce the above copyright notice, this list of conditions and
-#the following disclaimer in the documentation and/or other materials provided
-#with the distribution.  THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
-#CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-#LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
-#PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
-#CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
-#EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
-#PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
-#BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
-#IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
-#ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
-#POSSIBILITY OF SUCH DAMAGE.
-
-# Based on the Qt 5 processor detection code, so should be very accurate
-# https://qt.gitorious.org/qt/qtbase/blobs/master/src/corelib/global/qprocessordetection.h
-# Currently handles arm (v5, v6, v7), x86 (32/64), ia64, and ppc (32/64)
-
-# Regarding POWER/PowerPC, just as is noted in the Qt source,
-# "There are many more known variants/revisions that we do not handle/detect."
-
-set(archdetect_c_code "
-#if defined(__arm__) || defined(__TARGET_ARCH_ARM)
-    #if defined(__ARM_ARCH_7__) \\
-        || defined(__ARM_ARCH_7A__) \\
-        || defined(__ARM_ARCH_7R__) \\
-        || defined(__ARM_ARCH_7M__) \\
-        || (defined(__TARGET_ARCH_ARM) && __TARGET_ARCH_ARM-0 >= 7)
-        #error cmake_ARCH armv7
-    #elif defined(__ARM_ARCH_6__) \\
-        || defined(__ARM_ARCH_6J__) \\
-        || defined(__ARM_ARCH_6T2__) \\
-        || defined(__ARM_ARCH_6Z__) \\
-        || defined(__ARM_ARCH_6K__) \\
-        || defined(__ARM_ARCH_6ZK__) \\
-        || defined(__ARM_ARCH_6M__) \\
-        || (defined(__TARGET_ARCH_ARM) && __TARGET_ARCH_ARM-0 >= 6)
-        #error cmake_ARCH armv6
-    #elif defined(__ARM_ARCH_5TEJ__) \\
-        || (defined(__TARGET_ARCH_ARM) && __TARGET_ARCH_ARM-0 >= 5)
-        #error cmake_ARCH armv5
-    #else
-        #error cmake_ARCH arm
-    #endif
-#elif defined(__i386) || defined(__i386__) || defined(_M_IX86)
-    #error cmake_ARCH i386
-#elif defined(__x86_64) || defined(__x86_64__) || defined(__amd64) || defined(_M_X64)
-    #error cmake_ARCH x86_64
-#elif defined(__ia64) || defined(__ia64__) || defined(_M_IA64)
-    #error cmake_ARCH ia64
-#elif defined(__ppc__) || defined(__ppc) || defined(__powerpc__) \\
-      || defined(_ARCH_COM) || defined(_ARCH_PWR) || defined(_ARCH_PPC)  \\
-      || defined(_M_MPPC) || defined(_M_PPC)
-    #if defined(__ppc64__) || defined(__powerpc64__) || defined(__64BIT__)
-        #error cmake_ARCH ppc64
-    #else
-        #error cmake_ARCH ppc
-    #endif
-#endif
-
-#error cmake_ARCH unknown
-")
-
-# Set ppc_support to TRUE before including this file or ppc and ppc64
-# will be treated as invalid architectures since they are no longer supported by Apple
-
-function(target_architecture output_var)
-    if(APPLE AND CMAKE_OSX_ARCHITECTURES)
-        # On OS X we use CMAKE_OSX_ARCHITECTURES *if* it was set
-        # First let's normalize the order of the values
-
-        # Note that it's not possible to compile PowerPC applications if you are using
-        # the OS X SDK version 10.6 or later - you'll need 10.4/10.5 for that, so we
-        # disable it by default
-        # See this page for more information:
-        # http://stackoverflow.com/questions/5333490/how-can-we-restore-ppc-ppc64-as-well-as-full-10-4-10-5-sdk-support-to-xcode-4
-
-        # Architecture defaults to i386 or ppc on OS X 10.5 and earlier, depending on the CPU type detected at runtime.
-        # On OS X 10.6+ the default is x86_64 if the CPU supports it, i386 otherwise.
-
-        foreach(osx_arch ${CMAKE_OSX_ARCHITECTURES})
-            if("${osx_arch}" STREQUAL "ppc" AND ppc_support)
-                set(osx_arch_ppc TRUE)
-            elseif("${osx_arch}" STREQUAL "i386")
-                set(osx_arch_i386 TRUE)
-            elseif("${osx_arch}" STREQUAL "x86_64")
-                set(osx_arch_x86_64 TRUE)
-            elseif("${osx_arch}" STREQUAL "ppc64" AND ppc_support)
-                set(osx_arch_ppc64 TRUE)
-            else()
-                message(FATAL_ERROR "Invalid OS X arch name: ${osx_arch}")
-            endif()
-        endforeach()
-
-        # Now add all the architectures in our normalized order
-        if(osx_arch_ppc)
-            list(APPEND ARCH ppc)
-        endif()
-
-        if(osx_arch_i386)
-            list(APPEND ARCH i386)
-        endif()
-
-        if(osx_arch_x86_64)
-            list(APPEND ARCH x86_64)
-        endif()
-
-        if(osx_arch_ppc64)
-            list(APPEND ARCH ppc64)
-        endif()
-    else()
-        file(WRITE "${CMAKE_BINARY_DIR}/arch.c" "${archdetect_c_code}")
-
-        enable_language(C)
-
-        # Detect the architecture in a rather creative way...
-        # This compiles a small C program which is a series of ifdefs that selects a
-        # particular #error preprocessor directive whose message string contains the
-        # target architecture. The program will always fail to compile (both because
-        # file is not a valid C program, and obviously because of the presence of the
-        # #error preprocessor directives... but by exploiting the preprocessor in this
-        # way, we can detect the correct target architecture even when cross-compiling,
-        # since the program itself never needs to be run (only the compiler/preprocessor)
-        try_run(
-            run_result_unused
-            compile_result_unused
-            "${CMAKE_BINARY_DIR}"
-            "${CMAKE_BINARY_DIR}/arch.c"
-            COMPILE_OUTPUT_VARIABLE ARCH
-            CMAKE_FLAGS CMAKE_OSX_ARCHITECTURES=${CMAKE_OSX_ARCHITECTURES}
-        )
-
-        # Parse the architecture name from the compiler output
-        string(REGEX MATCH "cmake_ARCH ([a-zA-Z0-9_]+)" ARCH "${ARCH}")
-
-        # Get rid of the value marker leaving just the architecture name
-        string(REPLACE "cmake_ARCH " "" ARCH "${ARCH}")
-
-        # If we are compiling with an unknown architecture this variable should
-        # already be set to "unknown" but in the case that it's empty (i.e. due
-        # to a typo in the code), then set it to unknown
-        if (NOT ARCH)
-            set(ARCH unknown)
-        endif()
-    endif()
-
-    set(${output_var} "${ARCH}" PARENT_SCOPE)
-endfunction()
diff --git a/CMakeModules/TegraCrossToolchain.cmake b/CMakeModules/TegraCrossToolchain.cmake
new file mode 100644
index 0000000000..e8b1e10f5a
--- /dev/null
+++ b/CMakeModules/TegraCrossToolchain.cmake
@@ -0,0 +1,11 @@
+
+set(CMAKE_SYSTEM_NAME Linux)
+set(CMAKE_SYSTEM_PROCESSOR aarch64)
+
+set(CMAKE_C_COMPILER aarch64-linux-gnu-gcc-5)
+set(CMAKE_CXX_COMPILER aarch64-linux-gnu-g++-5)
+
+set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
+set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
+set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
+set(CMAKE_FIND_ROOT_PATH_MODE_PACKAGE ONLY)
diff --git a/CMakeModules/Version.cmake b/CMakeModules/Version.cmake
index a5aff0c8a3..2269bd73f2 100644
--- a/CMakeModules/Version.cmake
+++ b/CMakeModules/Version.cmake
@@ -1,18 +1,54 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
 #
 # Make a version file that includes the ArrayFire version and git revision
 #
-SET(AF_VERSION_MAJOR "3")
-SET(AF_VERSION_MINOR "0")
-SET(AF_VERSION_PATCH "beta")
-SET(AF_VERSION "${AF_VERSION_MAJOR}.${AF_VERSION_MINOR}.${AF_VERSION_PATCH}")
-EXECUTE_PROCESS(
+set(AF_VERSION_MAJOR ${ArrayFire_VERSION_MAJOR})
+set(AF_VERSION_MINOR ${ArrayFire_VERSION_MINOR})
+set(AF_VERSION_PATCH ${ArrayFire_VERSION_PATCH})
+
+set(AF_VERSION ${ArrayFire_VERSION})
+set(ArrayFire_API_VERSION_CURRENT ${ArrayFire_VERSION_MAJOR}${ArrayFire_VERSION_MINOR})
+
+# From CMake 3.0.0 CMAKE_<LANG>_COMPILER_ID is AppleClang for OSX machines
+# that use clang for compilations
+if("${CMAKE_C_COMPILER_ID}" STREQUAL "AppleClang")
+    set(COMPILER_NAME "AppleClang")
+elseif("${CMAKE_C_COMPILER_ID}" STREQUAL "Clang")
+    set(COMPILER_NAME "LLVM Clang")
+elseif("${CMAKE_C_COMPILER_ID}" STREQUAL "GNU")
+    set(COMPILER_NAME "GNU Compiler Collection(GCC/G++)")
+elseif("${CMAKE_C_COMPILER_ID}" STREQUAL "Intel")
+    set(COMPILER_NAME "Intel Compiler")
+elseif("${CMAKE_C_COMPILER_ID}" STREQUAL "MSVC")
+    set(COMPILER_NAME "Microsoft Visual Studio")
+endif()
+
+set(COMPILER_VERSION "${CMAKE_C_COMPILER_VERSION}")
+set(AF_COMPILER_STRING "${COMPILER_NAME} ${COMPILER_VERSION}")
+
+execute_process(
     COMMAND git log -1 --format=%h
-    WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
+    WORKING_DIRECTORY ${PROJECT_SOURCE_DIR}
     OUTPUT_VARIABLE GIT_COMMIT_HASH
     OUTPUT_STRIP_TRAILING_WHITESPACE
 )
 
-CONFIGURE_FILE(
-    ${CMAKE_MODULE_PATH}/version.h.in
-    ${CMAKE_SOURCE_DIR}/include/af/version.h
+if(NOT GIT_COMMIT_HASH)
+    message(STATUS "No git. Setting hash to default")
+    set(GIT_COMMIT_HASH "default")
+endif()
+
+configure_file(
+    ${ArrayFire_SOURCE_DIR}/CMakeModules/version.h.in
+    ${ArrayFire_BINARY_DIR}/include/af/version.h
+)
+
+configure_file(
+    ${ArrayFire_SOURCE_DIR}/CMakeModules/build_version.hpp.in
+    ${ArrayFire_BINARY_DIR}/src/backend/build_version.hpp
 )
diff --git a/CMakeModules/bin2cpp.cpp b/CMakeModules/bin2cpp.cpp
index 66fab53fdf..3426b1ebed 100644
--- a/CMakeModules/bin2cpp.cpp
+++ b/CMakeModules/bin2cpp.cpp
@@ -1,19 +1,39 @@
 // Umar Arshad
 // Copyright 2014
 
+// this enables template overloads of standard CRT functions that call the
+// more secure variants automatically,
+#define _CRT_SECURE_CPP_OVERLOAD_SECURE_NAMES 1
 
+#include <cstring>
+// strtok symbol name that keeps context is not on windows and linux
+// so, the above overload define won't help with that function
+#if defined(OS_WIN)
+#define STRTOK_CALL(...) strtok_s(__VA_ARGS__)
+#else
+#define STRTOK_CALL(...) strtok_r(__VA_ARGS__)
+#endif
+
+#include <algorithm>
+#include <cassert>
+#include <cstdlib>
+#include <cstring>
 #include <fstream>
-#include <sstream>
+#include <functional>
 #include <iostream>
-#include <string>
-#include <vector>
 #include <map>
 #include <memory>
+#include <sstream>  // IWYU pragma: keep
+#include <string>
+#include <utility>
+#include <vector>
+
+#include <common/deterministicHash.hpp>
 
 using namespace std;
+using std::cout;
 typedef map<string, string> opt_t;
 
-static
 void print_usage() {
     cout << R"delimiter(BIN2CPP
 Converts files from a binary file to C++ headers. It is similar to bin2c and
@@ -23,6 +43,8 @@ xxd but adds support for namespaces.
 | --file        | input file                                                        |
 | --output      | output file (If no output is specified then it prints to stdout   |
 | --type        | Type of variable (default: char)                                  |
+| --binary      | If the file contents are in binary form                           |
+| --nullterm    | Add a null character to the end of the file                       |
 | --namespace   | A space seperated list of namespaces                              |
 | --formatted   | Tabs for formatting                                               |
 | --version     | Prints my name                                                    |
@@ -35,108 +57,230 @@ Example
 
 Will produce:
 #pragma once
+#include <common/util.hpp>
 #include <cstddef>
 namespace blah {
-    namespace detail {
-        static const char blah_var[] = {
-            0x2f,    0x2f,    0x20,    0x62,    0x6c,    0x61,    0x68,    0x2e,    0x74,    0x78,
-            0x74,    0xa,    0x62,    0x6c,    0x61,    0x68,    0x20,    0x62,    0x6c,    0x61,
-            0x68,    0x20,    0x62,    0x6c,    0x61,    0x68,    0xa,    };
-        static const size_t blah_var_len = 27;
-    }
+	namespace detail {
+		static const unsigned char blah_var_uchar [] = {
+			0x2f,	0x2f,	0x20,	0x62,	0x6c,	0x61,	0x68,	0x2e,	0x74,	0x78,
+			0x74,	0xa,	0x62,	0x6c,	0x61,	0x68,	0x20,	0x62,	0x6c,	0x61,
+			0x68,	0x20,	0x62,	0x6c,	0x61,	0x68,	0xa,	};
+		static const char *blah_var = (const char*)blah_var_uchar;
+		static const size_t blah_var_len  = 27;
+		static const size_t blah_var_hash = 12345678901234567890ULL;
+		static const common::Source blah_var_src = {
+			blah_var,
+			blah_var_len,
+			blah_var_hash
+		};
+	}
 })delimiter";
-        exit(0);
+    exit(0);
 }
 
 static bool formatted;
+static bool binary   = false;
+static bool nullterm = false;
 
-static
-void add_tabs(const int level ){
-    if(formatted) {
-        for(int i =0; i < level; i++) {
-            cout << "\t";
-        }
+void add_tabs(const int level) {
+    if (formatted) {
+        for (int i = 0; i < level; i++) { cout << "\t"; }
     }
 }
 
-static
-opt_t
-parse_options(const vector<string>& args) {
+opt_t parse_options(const vector<string> &args) {
     opt_t options;
 
-    options["--name"]       = "";
-    options["--type"]       = "";
-    options["--file"]       = "";
-    options["--output"]     = "";
-    options["--namespace"]  = "";
-    options["--eof"]        = "";
+    options["--name"]      = "";
+    options["--type"]      = "";
+    options["--file"]      = "";
+    options["--output"]    = "";
+    options["--namespace"] = "";
 
-    //Parse Arguments
+    // Parse Arguments
     string curr_opt;
     bool verbose = false;
-    for(auto arg : args) {
-        if(arg == "--verbose") {
+    for (auto arg : args) {
+        if (arg == "--verbose") {
             verbose = true;
-        }
-        else if(arg == "--formatted") {
+        } else if (arg == "--binary") {
+            binary = true;
+        } else if (arg == "--nullterm") {
+            nullterm = true;
+        } else if (arg == "--formatted") {
             formatted = true;
-        }
-        else if(arg == "--version") {
+        } else if (arg == "--version") {
             cout << args[0] << " By Umar Arshad" << endl;
-        }
-        else if(arg == "--help") {
+        } else if (arg == "--help") {
             print_usage();
-        }
-        else if(options.find(arg) != options.end()) {
+        } else if (options.find(arg) != options.end()) {
             curr_opt = arg;
-        }
-        else if(curr_opt.empty()) {
-            //cerr << "Invalid Argument: " << arg << endl;
-        }
-        else {
-            if(options[curr_opt] != "") {
+        } else if (curr_opt.empty()) {
+            // cerr << "Invalid Argument: " << arg << endl;
+        } else {
+            if (options[curr_opt] != "") {
                 options[curr_opt] += " " + arg;
-            }
-            else {
+            } else {
                 options[curr_opt] += arg;
             }
         }
     }
 
-    if(verbose) {
-        for(auto opts : options) {
+    if (verbose) {
+        for (auto opts : options) {
             cout << get<0>(opts) << " " << get<1>(opts) << endl;
         }
     }
     return options;
 }
 
-int main(int argc, const char * const * const argv)
-{
+stringstream removeComments(ifstream &input, string &filename) {
+    stringstream ss;
+    char line[256]{
+        '\0'};  // Maximum length of lines in OpenCL code is limited to 256
+    const char *tokenCommentsStart = "/*";
+    const char *tokenCommentsEnd   = "*/";
+    const char *tokenCommentsLine  = "//";
+    const char *tokenString        = "\"";
+    const char *delimitors         = " \t;";  // Only the subset we need
+    enum { NO, STRING, ENDOFLINE, MULTILINE } commentsLevel{NO};
+
+    while (input.getline(line, sizeof(line) - 1)) {
+        char local[sizeof(line)];
+        struct segment {
+            char *start;
+            char *end;
+        } del{commentsLevel == MULTILINE ? line : nullptr, nullptr};
+        vector<segment> dels;
+        memcpy(local, line, sizeof(line));   // will be overwritten by strtok
+        local[sizeof(local) - 1] = '\0';     // string is always terminated
+        char *context            = nullptr;
+        char *token              = STRTOK_CALL(local, delimitors, &context);
+        do {
+            char *subtoken = nullptr;
+            while (token) {
+                switch (commentsLevel) {
+                    case MULTILINE:
+                        subtoken = strstr(token, tokenCommentsEnd);
+                        if (subtoken != nullptr) {
+                            if (del.start == nullptr) del.start = line;
+                            del.end = subtoken + strlen(tokenCommentsEnd) -
+                                      local + line;
+                            dels.push_back(del);
+                            del           = {nullptr, nullptr};
+                            token         = subtoken + strlen(tokenCommentsEnd);
+                            commentsLevel = NO;
+                        } else {
+                            token = nullptr;
+                        }
+                        break;
+                    case STRING:
+                        subtoken = strstr(token, tokenString);
+                        if (subtoken != nullptr) {
+                            token         = subtoken + strlen(tokenString);
+                            commentsLevel = NO;
+                        } else {
+                            token = nullptr;
+                        }
+                        break;
+                    case NO: {
+                        // select first subtoken inside this token
+                        subtoken = strstr(token, tokenCommentsStart);
+                        if (subtoken != nullptr) { commentsLevel = MULTILINE; }
+                        char *ptr = strstr(token, tokenCommentsLine);
+                        if ((ptr != nullptr) &&
+                            ((subtoken == nullptr) || (ptr < subtoken))) {
+                            commentsLevel = ENDOFLINE;
+                            subtoken      = ptr;
+                        }
+                        ptr = strstr(token, tokenString);
+                        if ((ptr != nullptr) &&
+                            ((subtoken == nullptr) || ptr < subtoken)) {
+                            commentsLevel = STRING;
+                            subtoken      = ptr;
+                        }
+                        switch (commentsLevel) {
+                            case MULTILINE:
+                                del.start = subtoken - local + line;
+                                token = subtoken + strlen(tokenCommentsStart);
+                                break;
+                            case ENDOFLINE:
+                                del.start = subtoken - local + line;
+                                token = subtoken + strlen(tokenCommentsLine);
+                                break;
+                            case STRING:
+                                token = subtoken + strlen(tokenString);
+                                break;
+                            case NO:
+                            default: token = nullptr;
+                        }
+                    } break;
+                    case ENDOFLINE:
+                    default: token = nullptr;
+                }
+            }
+            token = STRTOK_CALL(nullptr, delimitors, &context);
+        } while (token != nullptr);
+        if (del.start != nullptr) {
+            if (commentsLevel == ENDOFLINE) commentsLevel = NO;
+            del.end = line + strlen(line);
+            dels.push_back(del);
+            del = {nullptr, nullptr};
+        }
+        // Delete all segments starting from the end!!!
+        for (auto d = dels.crbegin(); d != dels.crend(); d++) {
+            char *ptr1 = d->start;
+            char *ptr2 = d->end;
+            // Do not use strncpy, it has problems with overlapping because the
+            // order isn't defined in the standard
+            while ((*ptr2 != '\0') && (ptr2 != line + sizeof(line))) { *ptr1++ = *ptr2++; }
+            *ptr1 = '\0';
+        }
+        // Remove trailing blanks
+        for (long i = static_cast<long>(std::min(sizeof(line),strlen(line))) - 1;
+             (i >= 0) && (line[i] == ' '); --i) {
+            line[i] = '\0';
+        }
+        // Remove leading blanks
+        char *linePtr = line;
+        for (size_t i = 0, len = std::min(sizeof(line),strlen(line));
+            (i < len) && (line[i] == ' ');
+             ++i, ++linePtr) {}
+        // Useful text is terminated by '\n';
+        if (linePtr[0] != '\0') { ss << linePtr << "\n"; }
+    }
+    return (ss);
+}
 
-    vector<string> args(argv, argv+argc);
+int main(int argc, const char *const *const argv) {
+    vector<string> args(argv, argv + argc);
 
-    opt_t&& options = parse_options(args);
+    if (argc == 1) {
+        print_usage();
+        return 0;
+    }
+    opt_t &&options = parse_options(args);
 
-    //Save default cout buffer. Need this to prevent crash.
+    // Save default cout buffer. Need this to prevent crash.
     auto bak = cout.rdbuf();
     unique_ptr<ofstream> outfile;
 
     // Set defaults
-    if(options["--name"] == "")     { options["--name"]     = "var"; }
-    if(options["--output"] != "")   {
-        //redirect stream if output file is specified
+    if (options["--name"] == "") { options["--name"] = "var"; }
+    if (options["--output"] != "") {
+        // redirect stream if output file is specified
         outfile.reset(new ofstream(options["--output"]));
         cout.rdbuf(outfile->rdbuf());
     }
 
     cout << "#pragma once\n";
-    cout << "#include <cstddef>\n"; // defines size_t
+    cout << "#include <cstddef>\n";          // defines size_t
+    cout << "#include <common/Source.hpp>\n";  // defines common::Source
 
     int ns_cnt = 0;
-    int level = 0;
-    if(options["--namespace"] != "") {
-        std::stringstream namespaces(options["--namespace"]);
+    int level  = 0;
+    if (options["--namespace"] != "") {
+        stringstream namespaces(options["--namespace"]);
         string name;
         namespaces >> name;
         do {
@@ -144,29 +288,32 @@ int main(int argc, const char * const * const argv)
             cout << "namespace " << name << " { \n";
             ns_cnt++;
             namespaces >> name;
-        } while(!namespaces.fail());
+        } while (!namespaces.fail());
     }
 
-    if(options["--type"] == "") {
-        options["--type"]     = "char";
-    }
+    if (options["--type"] == "") { options["--type"] = "char"; }
     add_tabs(level);
-    cout << "static const " << options["--type"] << " " << options["--name"] << "[] = {\n";
 
+    // Always create unsigned char to avoid narrowing
+    cout << "static const "
+         << "unsigned char"
+         << " " << options["--name"] << "_uchar [] = {\n";
 
-    ifstream input(options["--file"]);
+    ifstream input(options["--file"],
+                   (binary ? std::ios::binary : std::ios::in));
     size_t char_cnt = 0;
+    stringstream ss = removeComments(input, options["--file"]);
     add_tabs(++level);
-    for(char i; input.get(i);) {
-        cout << "0x" << std::hex << static_cast<int>(i) << ",\t";
+    for (char i; ss.get(i);) {
+        cout << "0x" << std::hex << static_cast<int>(i & 0xff) << ",\t";
         char_cnt++;
-        if(!(char_cnt % 10)) {
+        if (!(char_cnt % 10)) {
             cout << endl;
             add_tabs(level);
         }
     }
 
-    if (options["--eof"].c_str()[0] == '1') {
+    if (nullterm) {
         // Add end of file character
         cout << "0x0";
         char_cnt++;
@@ -174,11 +321,34 @@ int main(int argc, const char * const * const argv)
 
     cout << "};\n";
     add_tabs(--level);
-    cout << "static const size_t " << options["--name"] << "_len" << " = " << std::dec << char_cnt << ";\n";
 
-    while(ns_cnt--) {
+    // Cast to proper output type
+    cout << "static const " << options["--type"] << " *" << options["--name"]
+         << " = (const " << options["--type"] << " *)" << options["--name"]
+         << "_uchar;\n";
+    add_tabs(level);
+    cout << "static const size_t " << options["--name"] << "_len"
+         << "  = " << std::dec << char_cnt << ";\n";
+    add_tabs(level);
+    cout << "static const size_t " << options["--name"] << "_hash"
+         << " = " << deterministicHash(ss.str()) << "ULL;\n";
+    add_tabs(level);
+    cout << "static const common::Source " << options["--name"] << "_src{\n";
+    add_tabs(++level);
+    cout << options["--name"] << ",\n";
+    add_tabs(level);
+    cout << options["--name"] << "_len,\n";
+    add_tabs(level);
+    cout << options["--name"] << "_hash\n";
+    add_tabs(--level);
+    cout << "};\n";
+
+    while (ns_cnt--) {
         add_tabs(--level);
         cout << "}\n";
     }
+
     cout.rdbuf(bak);
+
+    return 0;
 }
diff --git a/CMakeModules/boost_package.cmake b/CMakeModules/boost_package.cmake
new file mode 100644
index 0000000000..f6fa995c7f
--- /dev/null
+++ b/CMakeModules/boost_package.cmake
@@ -0,0 +1,67 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+set(Boost_MIN_VER 107000)
+set(Boost_MIN_VER_STR "1.70")
+
+if(NOT
+  ((Boost_VERSION VERSION_GREATER Boost_MIN_VER OR
+    Boost_VERSION VERSION_EQUAL Boost_MIN_VER) OR
+   (Boost_VERSION_STRING VERSION_GREATER Boost_MIN_VER_STR OR
+    Boost_VERSION_STRING VERSION_EQUAL Boost_MIN_VER_STR) OR
+   (Boost_VERSION_MACRO VERSION_GREATER Boost_MIN_VER OR
+    Boost_VERSION_MACRO VERSION_EQUAL Boost_MIN_VER))
+  AND NOT AF_WITH_EXTERNAL_PACKAGES_ONLY)
+  set(VER 1.70.0)
+  message(WARNING
+      "WARN: Found Boost v${Boost_MAJOR_VERSION}.${Boost_MINOR_VERSION}."
+      "Minimum required ${VER}. Build will download Boost Compute.")
+  af_dep_check_and_populate(${boost_prefix}
+    URL_AND_HASH
+    URI https://github.com/boostorg/compute/archive/boost-${VER}.tar.gz
+    REF MD5=e160ec0ff825fc2850ea4614323b1fb5
+  )
+  if(NOT TARGET Boost::boost)
+    add_library(Boost::boost IMPORTED INTERFACE GLOBAL)
+  endif()
+  set_target_properties(Boost::boost PROPERTIES
+    INTERFACE_INCLUDE_DIRECTORIES "${${boost_prefix}_SOURCE_DIR}/include;${Boost_INCLUDE_DIR}"
+    INTERFACE_SYSTEM_INCLUDE_DIRECTORIES "${${boost_prefix}_SOURCE_DIR}/include;${Boost_INCLUDE_DIR}"
+    )
+else()
+  if(NOT TARGET Boost::boost)
+    add_library(Boost::boost IMPORTED INTERFACE GLOBAL)
+    set_target_properties(Boost::boost PROPERTIES
+      INTERFACE_INCLUDE_DIRECTORIES "${Boost_INCLUDE_DIR}"
+      INTERFACE_SYSTEM_INCLUDE_DIRECTORIES "${Boost_INCLUDE_DIR}")
+  endif()
+endif()
+
+if(TARGET Boost::boost)
+  set(BOOST_DEFINITIONS "BOOST_CHRONO_HEADER_ONLY;BOOST_COMPUTE_THREAD_SAFE;BOOST_COMPUTE_HAVE_THREAD_LOCAL")
+
+  # NOTE: Basic and Windows options do not requre flags or libraries for
+  #       backtraces
+  if(AF_STACKTRACE_TYPE STREQUAL "libbacktrace")
+    list(APPEND BOOST_DEFINITIONS "BOOST_STACKTRACE_USE_BACKTRACE")
+    set_target_properties(Boost::boost PROPERTIES
+      INTERFACE_LINK_LIBRARIES ${Backtrace_LIBRARY})
+  elseif(AF_STACKTRACE_TYPE STREQUAL "addr2line")
+    list(APPEND BOOST_DEFINITIONS "BOOST_STACKTRACE_USE_ADDR2LINE")
+  elseif(AF_STACKTRACE_TYPE STREQUAL "None")
+      list(APPEND BOOST_DEFINITIONS "BOOST_STACKTRACE_USE_NOOP")
+  endif()
+
+  if(NOT AF_STACKTRACE_TYPE STREQUAL "None" AND APPLE)
+      list(APPEND BOOST_DEFINITIONS "BOOST_STACKTRACE_GNU_SOURCE_NOT_REQUIRED")
+  endif()
+
+  # NOTE: BOOST_CHRONO_HEADER_ONLY is required for Windows because otherwise it
+  # will try to link with libboost-chrono.
+  set_target_properties(Boost::boost PROPERTIES INTERFACE_COMPILE_DEFINITIONS
+      "${BOOST_DEFINITIONS}")
+endif()
diff --git a/CMakeModules/build_CLBlast.cmake b/CMakeModules/build_CLBlast.cmake
new file mode 100644
index 0000000000..7ea0b43256
--- /dev/null
+++ b/CMakeModules/build_CLBlast.cmake
@@ -0,0 +1,104 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+if(TARGET clblast OR AF_WITH_EXTERNAL_PACKAGES_ONLY)
+  if(TARGET clblast)
+    # CLBlast has a broken imported link interface where it lists
+    # the full path to the OpenCL library. OpenCL is imported by
+    # another package so we dont need this property to link against
+    # CLBlast.
+    set_target_properties(clblast PROPERTIES
+      IMPORTED_LINK_INTERFACE_LIBRARIES_RELEASE ""
+      IMPORTED_LINK_INTERFACE_LIBRARIES_DEBUG "")
+
+    if(WIN32 AND VCPKG_ROOT)
+      set_target_properties(clblast PROPERTIES
+        IMPORTED_LOCATION_RELEASE ""
+        IMPORTED_LOCATION_DEBUG "")
+    endif()
+  else()
+    message(ERROR "CLBlast now found")
+  endif()
+else()
+  # This specific reference passes tests
+  af_dep_check_and_populate(${clblast_prefix}
+    URI https://github.com/cnugteren/CLBlast.git
+    REF 4500a03440e2cc54998c0edab366babf5e504d67
+  )
+
+  include(ExternalProject)
+  find_program(GIT git)
+
+  set(prefix ${PROJECT_BINARY_DIR}/third_party/CLBlast)
+  set(CLBlast_libname ${CMAKE_STATIC_LIBRARY_PREFIX}clblast${CMAKE_STATIC_LIBRARY_SUFFIX})
+  set(CLBlast_location ${${clblast_prefix}_BINARY_DIR}/pkg/lib/${CLBlast_libname})
+
+  set(extproj_gen_opts "-G${CMAKE_GENERATOR}")
+  if(WIN32 AND CMAKE_GENERATOR_PLATFORM AND NOT CMAKE_GENERATOR MATCHES "Ninja")
+    list(APPEND extproj_gen_opts "-A${CMAKE_GENERATOR_PLATFORM}")
+    if(CMAKE_GENERATOR_TOOLSET)
+      list(APPEND extproj_gen_opts "-T${CMAKE_GENERATOR_TOOLSET}")
+    endif()
+  endif()
+  if(VCPKG_TARGET_TRIPLET)
+    list(APPEND extproj_gen_opts "-DOPENCL_ROOT:PATH=${_VCPKG_INSTALLED_DIR}/${VCPKG_TARGET_TRIPLET}")
+  endif()
+
+  set(extproj_build_type_option "")
+  if(NOT isMultiConfig)
+    if("${CMAKE_BUILD_TYPE}" MATCHES "Release|RelWithDebInfo")
+      set(extproj_build_type "Release")
+    else()
+      set(extproj_build_type ${CMAKE_BUILD_TYPE})
+    endif()
+    set(extproj_build_type_option "-DCMAKE_BUILD_TYPE:STRING=${extproj_build_type}")
+  endif()
+
+  ExternalProject_Add(
+      CLBlast-ext
+      DOWNLOAD_COMMAND ""
+      UPDATE_COMMAND ""
+      PATCH_COMMAND ""
+      SOURCE_DIR "${${clblast_prefix}_SOURCE_DIR}"
+      BINARY_DIR "${${clblast_prefix}_BINARY_DIR}"
+      PREFIX "${prefix}"
+      INSTALL_DIR "${${clblast_prefix}_BINARY_DIR}/pkg"
+      BUILD_BYPRODUCTS ${CLBlast_location}
+      CONFIGURE_COMMAND ${CMAKE_COMMAND} ${extproj_gen_opts}
+        -Wno-dev <SOURCE_DIR>
+        -DCMAKE_POLICY_VERSION_MINIMUM=3.5
+        -DCMAKE_CXX_COMPILER:FILEPATH=${CMAKE_CXX_COMPILER}
+        "-DCMAKE_CXX_FLAGS:STRING=${CMAKE_CXX_FLAGS}"
+        -DOVERRIDE_MSVC_FLAGS_TO_MT:BOOL=OFF
+        -DCMAKE_C_COMPILER:FILEPATH=${CMAKE_C_COMPILER}
+        "-DCMAKE_C_FLAGS:STRING=${CMAKE_C_FLAGS}"
+        -DCMAKE_POSITION_INDEPENDENT_CODE=ON
+        -DOPENCL_LIBRARIES="${OPENCL_LIBRARIES}"
+        ${extproj_build_type_option}
+        -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR>
+        -DCMAKE_INSTALL_LIBDIR:PATH=lib
+        -DBUILD_SHARED_LIBS:BOOL=OFF
+        -DSAMPLES:BOOL=OFF
+        -DTUNERS:BOOL=OFF
+        -DCLIENTS:BOOL=OFF
+        -DTESTS:BOOL=OFF
+        -DNETLIB:BOOL=OFF
+      )
+
+  set(CLBLAST_INCLUDE_DIRS "${${clblast_prefix}_BINARY_DIR}/pkg/include")
+  set(CLBLAST_LIBRARIES CLBlast)
+  set(CLBLAST_FOUND ON)
+
+  make_directory("${CLBLAST_INCLUDE_DIRS}")
+
+  add_library(clblast UNKNOWN IMPORTED)
+  set_target_properties(clblast PROPERTIES
+    IMPORTED_LOCATION "${CLBlast_location}"
+    INTERFACE_INCLUDE_DIRECTORIES "${CLBLAST_INCLUDE_DIRS}")
+
+  add_dependencies(clblast CLBlast-ext)
+endif()
diff --git a/CMakeModules/build_boost_compute.cmake b/CMakeModules/build_boost_compute.cmake
deleted file mode 100644
index c0de1cb291..0000000000
--- a/CMakeModules/build_boost_compute.cmake
+++ /dev/null
@@ -1,70 +0,0 @@
-SET(VER 79aa8f9086fdf6ef6db78e889de0273b0eb7bd19)
-SET(URL https://github.com/boostorg/compute/archive/${VER}.tar.gz)
-SET(MD5 dba3318cbdac912dddce71f2a38ffa43)
-
-SET(thirdPartyDir "${CMAKE_BINARY_DIR}/third_party")
-SET(srcDir "${thirdPartyDir}/compute-${VER}")
-SET(archive ${srcDir}.tar.gz)
-SET(inflated ${srcDir}-inflated)
-
-# the config to be used in the code
-SET(BoostCompute_INCLUDE_DIRS "${srcDir}/include")
-
-# do we have to do it again?
-SET(doExtraction ON)
-IF(EXISTS "${inflated}")
-    FILE(READ "${inflated}" extractedMD5)
-    IF("${extractedMD5}" STREQUAL "${MD5}")
-        # nope, everything looks fine
-        return()
-    ENDIF()
-ENDIF()
-
-# lets get and extract boost compute
-
-MESSAGE(STATUS "BoostCompute...")
-IF(EXISTS "${archive}")
-    FILE(MD5 "${archive}" md5)
-    IF(NOT "${md5}" STREQUAL "${MD5}")
-        MESSAGE("  wrong check sum ${md5}, redownloading")
-        FILE(REMOVE "${archive}")
-    ENDIF()
-ENDIF()
-
-IF(NOT EXISTS "${archive}")
-    MESSAGE(STATUS "  getting ${URL}")
-    FILE(DOWNLOAD "${URL}" ${archive}
-        STATUS rv
-        SHOW_PROGRESS)
-ENDIF()
-
-MESSAGE(STATUS "  validating ${archive}")
-FILE(MD5 "${archive}" md5)
-IF(NOT "${md5}" STREQUAL "${MD5}")
-    MESSAGE(WARNING "${archive}: Invalid check sum ${md5}. Expected was ${MD5}")
-    IF("${md5}" STREQUAL "d41d8cd98f00b204e9800998ecf8427e")
-        MESSAGE(STATUS "Trying wget ${URL}")
-        EXECUTE_PROCESS(COMMAND wget -O ${archive} ${URL})
-        FILE(MD5 "${archive}" md5_)
-        IF(NOT "${md5_}" STREQUAL "${MD5}")
-            MESSAGE(FATAL_ERROR "${archive}: Invalid check sum ${md5_}. Expected was ${MD5}")
-        ENDIF(NOT "${md5_}" STREQUAL "${MD5}")
-        MESSAGE(STATUS "wget successful")
-    ENDIF("${md5}" STREQUAL "d41d8cd98f00b204e9800998ecf8427e")
-ENDIF()
-
-IF(IS_DIRECTORY ${srcDir})
-    MESSAGE(STATUS "  cleaning ${cleaning}")
-    FILE(REMOVE_RECURSE ${srcDir})
-ENDIF()
-
-MESSAGE(STATUS "  extracting ${archive}")
-FILE(MAKE_DIRECTORY ${srcDir})
-EXECUTE_PROCESS(COMMAND ${CMAKE_COMMAND} -E tar xfz ${archive}
-    WORKING_DIRECTORY ${thirdPartyDir}
-    RESULT_VARIABLE rv)
-IF(NOT rv EQUAL 0)
-    MESSAGE(FATAL_ERROR "'${archive}' extraction failed")
-ENDIF()
-
-FILE(WRITE ${inflated} "${MD5}")
diff --git a/CMakeModules/build_cl2hpp.cmake b/CMakeModules/build_cl2hpp.cmake
new file mode 100644
index 0000000000..b38c4bc1d1
--- /dev/null
+++ b/CMakeModules/build_cl2hpp.cmake
@@ -0,0 +1,42 @@
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+# Check if cl2.hpp exsists and if not download it from khronos GitHub repo
+#
+# NOTE: This file does not use ExternalProject_Add because that command was
+#       was not able to download files that are not archives before CMake
+#       version 3.6
+
+find_package(OpenCL)
+
+if(NOT TARGET OpenCL::cl2hpp)
+  find_path(cl2hpp_header_file_path
+    NAMES CL/cl2.hpp
+    PATHS ${OpenCL_INCLUDE_PATHS})
+
+  if(cl2hpp_header_file_path)
+    add_library(cl2hpp IMPORTED INTERFACE GLOBAL)
+    add_library(OpenCL::cl2hpp IMPORTED INTERFACE GLOBAL)
+
+    set_target_properties(cl2hpp OpenCL::cl2hpp PROPERTIES
+      INTERFACE_INCLUDE_DIRECTORIES ${cl2hpp_header_file_path})
+  elseif (NOT TARGET OpenCL::cl2hpp OR NOT TARGET cl2hpp)
+    af_dep_check_and_populate(${cl2hpp_prefix}
+      URI https://github.com/KhronosGroup/OpenCL-CLHPP.git
+      REF v2024.10.24)
+
+    find_path(cl2hpp_var
+      NAMES CL/cl2.hpp
+      PATHS ${ArrayFire_BINARY_DIR}/extern/${cl2hpp_prefix}-src/include)
+
+    add_library(cl2hpp IMPORTED INTERFACE GLOBAL)
+    add_library(OpenCL::cl2hpp IMPORTED INTERFACE GLOBAL)
+
+    set_target_properties(cl2hpp OpenCL::cl2hpp PROPERTIES
+      INTERFACE_INCLUDE_DIRECTORIES ${cl2hpp_var})
+  endif()
+endif()
diff --git a/CMakeModules/build_clBLAS.cmake b/CMakeModules/build_clBLAS.cmake
deleted file mode 100644
index faa415185e..0000000000
--- a/CMakeModules/build_clBLAS.cmake
+++ /dev/null
@@ -1,42 +0,0 @@
-INCLUDE(ExternalProject)
-
-SET(prefix ${CMAKE_BINARY_DIR}/third_party/clBLAS)
-SET(clBLAS_location ${prefix}/lib/import/${CMAKE_STATIC_LIBRARY_PREFIX}clBLAS${CMAKE_STATIC_LIBRARY_SUFFIX})
-IF(CMAKE_VERSION VERSION_LESS 3.2)
-    IF(CMAKE_GENERATOR MATCHES "Ninja")
-        MESSAGE(WARNING "Building clBLAS with Ninja has known issues with CMake older than 3.2")
-    endif()
-    SET(byproducts)
-ELSE()
-    SET(byproducts BYPRODUCTS ${clBLAS_location})
-ENDIF()
-
-ExternalProject_Add(
-    clBLAS-external
-    GIT_REPOSITORY https://github.com/arrayfire/clBLAS.git
-    GIT_TAG 47662a6ac1186c756508109d7fef8827efab4504
-    PREFIX "${prefix}"
-    INSTALL_DIR "${prefix}"
-    UPDATE_COMMAND ""
-    CONFIGURE_COMMAND ${CMAKE_COMMAND} -Wno-dev "-G${CMAKE_GENERATOR}" <SOURCE_DIR>/src
-    -DCMAKE_CXX_COMPILER:FILEPATH=${CMAKE_CXX_COMPILER}
-    "-DCMAKE_CXX_FLAGS:STRING=${CMAKE_CXX_FLAGS} -w -fPIC"
-    -DCMAKE_C_COMPILER:FILEPATH=${CMAKE_C_COMPILER}
-    "-DCMAKE_C_FLAGS:STRING=${CMAKE_C_FLAGS} -w -fPIC"
-    -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-    -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR>
-    -DBUILD_SHARED_LIBS:BOOL=OFF
-    -DBUILD_CLIENT:BOOL=OFF
-    -DBUILD_TEST:BOOL=OFF
-    -DBUILD_KTEST:BOOL=OFF
-    -DSUFFIX_LIB:STRING=
-    ${byproducts}
-    )
-
-ExternalProject_Get_Property(clBLAS-external install_dir)
-ADD_LIBRARY(clBLAS IMPORTED STATIC)
-SET_TARGET_PROPERTIES(clBLAS PROPERTIES IMPORTED_LOCATION ${clBLAS_location})
-ADD_DEPENDENCIES(clBLAS clBLAS-external)
-SET(CLBLAS_INCLUDE_DIRS ${install_dir}/include)
-SET(CLBLAS_LIBRARIES clBLAS)
-SET(CLBLAS_FOUND ON)
diff --git a/CMakeModules/build_clFFT.cmake b/CMakeModules/build_clFFT.cmake
index cf135f4d53..b3e56137bf 100644
--- a/CMakeModules/build_clFFT.cmake
+++ b/CMakeModules/build_clFFT.cmake
@@ -1,42 +1,43 @@
-INCLUDE(ExternalProject)
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
 
-SET(prefix "${CMAKE_BINARY_DIR}/third_party/clFFT")
-SET(clFFT_location ${prefix}/lib/import/${CMAKE_STATIC_LIBRARY_PREFIX}clFFT${CMAKE_STATIC_LIBRARY_SUFFIX})
-IF(CMAKE_VERSION VERSION_LESS 3.2)
-    IF(CMAKE_GENERATOR MATCHES "Ninja")
-        MESSAGE(WARNING "Building clFFT with Ninja has known issues with CMake older than 3.2")
-    endif()
-    SET(byproducts)
-ELSE()
-    SET(byproducts BYPRODUCTS ${clFFT_location})
-ENDIF()
+af_dep_check_and_populate(${clfft_prefix}
+  URI https://github.com/arrayfire/clFFT.git
+  REF arrayfire-release
+)
 
-ExternalProject_Add(
-    clFFT-external
-    GIT_REPOSITORY https://github.com/arrayfire/clFFT.git
-    GIT_TAG 1597f0f35a644789c7ad77efe79014236cca2fab
-    PREFIX "${prefix}"
-    INSTALL_DIR "${prefix}"
-    UPDATE_COMMAND ""
-    CONFIGURE_COMMAND ${CMAKE_COMMAND} -Wno-dev "-G${CMAKE_GENERATOR}" <SOURCE_DIR>/src
-    -DCMAKE_CXX_COMPILER:FILEPATH=${CMAKE_CXX_COMPILER}
-    "-DCMAKE_CXX_FLAGS:STRING=${CMAKE_CXX_FLAGS} -w -fPIC"
-    -DCMAKE_C_COMPILER:FILEPATH=${CMAKE_C_COMPILER}
-    "-DCMAKE_C_FLAGS:STRING=${CMAKE_C_FLAGS} -w -fPIC"
-    -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-    -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR>
-    -DBUILD_SHARED_LIBS:BOOL=OFF
-    -DBUILD_CLIENT:BOOL=OFF
-    -DBUILD_TEST:BOOL=OFF
-    -DSUFFIX_LIB:STRING=
-    -DUSE_SYSTEM_GTEST:BOOL=ON
-    ${byproducts}
-    )
+set(current_build_type ${BUILD_SHARED_LIBS})
+set(BUILD_SHARED_LIBS OFF)
+add_subdirectory(${${clfft_prefix}_SOURCE_DIR}/src ${${clfft_prefix}_BINARY_DIR} EXCLUDE_FROM_ALL)
+get_property(clfft_include_dir
+  TARGET clFFT
+  PROPERTY INTERFACE_INCLUDE_DIRECTORIES)
+set_target_properties(clFFT
+  PROPERTIES INTERFACE_SYSTEM_INCLUDE_DIRECTORIES "${clfft_include_dir}")
 
-ExternalProject_Get_Property(clFFT-external install_dir)
-ADD_LIBRARY(clFFT IMPORTED STATIC)
-SET_TARGET_PROPERTIES(clFFT PROPERTIES IMPORTED_LOCATION ${clFFT_location})
-ADD_DEPENDENCIES(clFFT clFFT-external)
-SET(CLFFT_INCLUDE_DIRS ${install_dir}/include)
-SET(CLFFT_LIBRARIES clFFT)
-SET(CLFFT_FOUND ON)
+# OpenCL targets need this flag to avoid ignored attribute warnings in the
+# OpenCL headers
+check_cxx_compiler_flag(-Wno-ignored-attributes has_ignored_attributes_flag)
+if(has_ignored_attributes_flag)
+  target_compile_options(clFFT
+    PRIVATE -Wno-ignored-attributes)
+endif()
+set(BUILD_SHARED_LIBS ${current_build_type})
+
+mark_as_advanced(
+  Boost_PROGRAM_OPTIONS_LIBRARY_RELEASE
+  CLFFT_BUILD64
+  CLFFT_BUILD_CALLBACK_CLIENT
+  CLFFT_BUILD_CLIENT
+  CLFFT_BUILD_EXAMPLES
+  CLFFT_BUILD_LOADLIBRARIES
+  CLFFT_BUILD_RUNTIME
+  CLFFT_BUILD_TEST
+  CLFFT_CODE_COVERAGE
+  CLFFT_SUFFIX_BIN
+  CLFFT_SUFFIX_LIB
+)
diff --git a/CMakeModules/build_forge.cmake b/CMakeModules/build_forge.cmake
deleted file mode 100644
index 7041ed9df8..0000000000
--- a/CMakeModules/build_forge.cmake
+++ /dev/null
@@ -1,51 +0,0 @@
-INCLUDE(ExternalProject)
-
-SET(prefix ${CMAKE_BINARY_DIR}/third_party/forge)
-SET(forge_location "${prefix}/lib/${CMAKE_SHARED_LIBRARY_PREFIX}forge${CMAKE_SHARED_LIBRARY_SUFFIX}")
-
-IF(CMAKE_VERSION VERSION_LESS 3.2)
-    IF(CMAKE_GENERATOR MATCHES "Ninja")
-        MESSAGE(WARNING "Building forge with Ninja has known issues with CMake older than 3.2")
-    endif()
-    SET(byproducts)
-ELSE()
-    SET(byproducts BYPRODUCTS ${forge_location})
-ENDIF()
-
-INCLUDE("${CMAKE_MODULE_PATH}/FindGLEWmx.cmake")
-FIND_PACKAGE(GLFW)
-
-ExternalProject_Add(
-    forge-ext
-    GIT_REPOSITORY https://github.com/arrayfire/forge.git
-    GIT_TAG master
-    PREFIX "${prefix}"
-    INSTALL_DIR "${prefix}"
-    UPDATE_COMMAND ""
-    CONFIGURE_COMMAND ${CMAKE_COMMAND} -Wno-dev "-G${CMAKE_GENERATOR}" <SOURCE_DIR>
-    -DCMAKE_SOURCE_DIR:PATH=<SOURCE_DIR>
-    -DCMAKE_CXX_COMPILER:FILEPATH=${CMAKE_CXX_COMPILER}
-    -DCMAKE_C_COMPILER:FILEPATH=${CMAKE_C_COMPILER}
-    -DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-    -DCMAKE_INSTALL_PREFIX:PATH=<INSTALL_DIR>
-    -DBUILD_EXAMPLES:BOOL=OFF
-    -DUSE_GLEWmx_STATIC:BOOL=${USE_GLEWmx_STATIC}
-    -DGLEW_INCLUDE_DIR:PATH=${GLEW_INCLUDE_DIR}
-    -DGLEW_LIBRARY:FILEPATH=${GLEW_LIBRARY}
-    -DGLEWmxd_LIBRARY:FILEPATH=${GLEWmxd_LIBRARY}
-    -DGLEWmxs_LIBRARY:FILEPATH=${GLEWmxs_LIBRARY}
-    -DGLFW_INCLUDE_DIR:PATH=${GLFW_INCLUDE_DIR}
-    -DGLFW_LIBRARY:FILEPATH=${GLFW_LIBRARY}
-    ${byproducts}
-    )
-
-ExternalProject_Get_Property(forge-ext install_dir)
-ADD_LIBRARY(forge SHARED IMPORTED)
-SET_TARGET_PROPERTIES(forge PROPERTIES IMPORTED_LOCATION ${forge_location})
-IF(WIN32)
-    SET_TARGET_PROPERTIES(forge PROPERTIES IMPORTED_IMPLIB ${prefix}/lib/forge.lib)
-ENDIF(WIN32)
-ADD_DEPENDENCIES(forge forge-ext)
-SET(FORGE_INCLUDE_DIRECTORIES ${install_dir}/include)
-SET(FORGE_LIBRARIES forge)
-SET(FORGE_FOUND ON)
diff --git a/CMakeModules/build_gtest.cmake b/CMakeModules/build_gtest.cmake
deleted file mode 100644
index 4d7fbbe055..0000000000
--- a/CMakeModules/build_gtest.cmake
+++ /dev/null
@@ -1,99 +0,0 @@
-# Build the gtest libraries
-
-# Check if Google Test exists
-SET(GTEST_SOURCE_DIR "${CMAKE_SOURCE_DIR}/test/gtest")
-IF(NOT EXISTS "${GTEST_SOURCE_DIR}/README")
-    MESSAGE(WARNING "GTest Source is not available. Tests will not build.")
-    MESSAGE("Did you miss the --recursive option when cloning?")
-    MESSAGE("Run the following commands to correct this:")
-    MESSAGE("git submodule init")
-    MESSAGE("git submodule update")
-    MESSAGE("git submodule foreach git pull origin master")
-ENDIF()
-
-if(CMAKE_VERSION VERSION_LESS 3.2 AND CMAKE_GENERATOR MATCHES "Ninja")
-    message(WARNING "Building GTest with Ninja has known issues with CMake older than 3.2")
-endif()
-
-include(ExternalProject)
-
-# Set the build type if it isn't already
-if(NOT CMAKE_BUILD_TYPE)
-  set(CMAKE_BUILD_TYPE Release)
-endif()
-
-# Set default ExternalProject root directory
-set(prefix "${CMAKE_BINARY_DIR}/third_party/gtest")
-# the binary dir must be know before creating the external project in order
-# to pass the byproducts
-set(binary_dir "${prefix}/src/googletest-build")
-set(stdlib_binary_dir "${prefix}/src/googletest-build-stdlib")
-
-set(GTEST_LIBRARIES gtest gtest_main)
-set(GTEST_LIBRARIES_STDLIB gtest_stdlib gtest_main_stdlib)
-
-set(byproducts)
-set(byproducts_libstdcpp)
-foreach(lib ${GTEST_LIBRARIES})
-    set(${lib}_location
-        ${binary_dir}/${CMAKE_CFG_INTDIR}/${CMAKE_STATIC_LIBRARY_PREFIX}${lib}${CMAKE_STATIC_LIBRARY_SUFFIX})
-    set(${lib}_location_libstdcpp
-        ${stdlib_binary_dir}/${CMAKE_CFG_INTDIR}/${CMAKE_STATIC_LIBRARY_PREFIX}${lib}${CMAKE_STATIC_LIBRARY_SUFFIX})
-    list(APPEND byproducts ${${lib}_location})
-    list(APPEND byproducts_libstdcpp ${${lib}_location_libstdcpp})
-endforeach()
-SET(CMAKE_CXX_FLAGS_STD "${CMAKE_CXX_FLAGS} -stdlib=libstdc++")
-
-FUNCTION(GTEST_BUILD BUILD_NAME BUILD_TYPE BUILD_BINARY_DIR BUILD_BYPRODUCTS)
-# Add gtest
-ExternalProject_Add(
-    ${BUILD_NAME}
-    # URL http://googletest.googlecode.com/files/gtest-1.7.0.zip
-    # URL_MD5 2d6ec8ccdf5c46b05ba54a9fd1d130d7
-    SOURCE_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../test/gtest"
-    PREFIX ${prefix}
-    BINARY_DIR ${BUILD_BINARY_DIR}
-    TIMEOUT 10
-    CMAKE_ARGS -Dgtest_force_shared_crt=ON
-               -DCMAKE_CXX_COMPILER=${CMAKE_CXX_COMPILER}
-               -DCMAKE_BUILD_TYPE=${BUILD_TYPE}
-               -DCMAKE_CXX_FLAGS=${CMAKE_CXX_FLAGS}
-               -DCMAKE_CXX_FLAGS_LIBSTDCPP=${CMAKE_CXX_FLAGS_STD}
-               -DCMAKE_CXX_FLAGS_DEBUG=${CMAKE_CXX_FLAGS_DEBUG}
-               -DCMAKE_CXX_FLAGS_MINSIZEREL=${CMAKE_CXX_FLAGS_MINSIZEREL}
-               -DCMAKE_CXX_FLAGS_RELEASE=${CMAKE_CXX_FLAGS_RELEASE}
-               -DCMAKE_CXX_FLAGS_RELWITHDEBINFO=${CMAKE_CXX_FLAGS_RELWITHDEBINFO}
-    BUILD_BYPRODUCTS ${BUILD_BYPRODUCTS}
-    # Disable install step
-    INSTALL_COMMAND ""
-    # Wrap download, configure and build steps in a script to log output
-    LOG_DOWNLOAD 0
-    LOG_UPDATE 0
-    LOG_CONFIGURE 0
-    LOG_BUILD 0)
-ENDFUNCTION(GTEST_BUILD)
-
-GTEST_BUILD(googletest              ${CMAKE_BUILD_TYPE} ${binary_dir} "${byproducts}")
-
-# If we are on OSX and using the clang compiler go ahead and build
-# GTest using libstdc++ just in case we compile the CUDA backend
-IF("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
-    GTEST_BUILD(googletest_libstdcpp    LibStdCpp           ${stdlib_binary_dir} "${byproducts_libstdcpp}")
-ENDIF("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
-
-foreach(lib ${GTEST_LIBRARIES})
-    add_library(${lib} IMPORTED STATIC)
-    add_dependencies(${lib} googletest)
-    set_target_properties(${lib} PROPERTIES IMPORTED_LOCATION ${${lib}_location})
-
-    IF("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
-        add_library(${lib}_stdlib IMPORTED STATIC)
-        add_dependencies(${lib}_stdlib googletest_libstdcpp)
-        set_target_properties(${lib}_stdlib PROPERTIES IMPORTED_LOCATION ${${lib}_location_libstdcpp})
-    ENDIF("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
-endforeach()
-
-# Specify include dir
-ExternalProject_Get_Property(googletest source_dir)
-set(GTEST_INCLUDE_DIRS ${source_dir}/include)
-set(GTEST_FOUND ON)
diff --git a/CMakeModules/build_version.hpp.in b/CMakeModules/build_version.hpp.in
new file mode 100644
index 0000000000..d3b881f8d9
--- /dev/null
+++ b/CMakeModules/build_version.hpp.in
@@ -0,0 +1,13 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#define AF_REVISION "@GIT_COMMIT_HASH@"
+#define AF_COMPILER_STR "@AF_COMPILER_STRING@"
diff --git a/CMakeModules/compilers.h b/CMakeModules/compilers.h
new file mode 100644
index 0000000000..60480d86ee
--- /dev/null
+++ b/CMakeModules/compilers.h
@@ -0,0 +1,550 @@
+
+// This is a generated file. Do not edit!
+
+#ifndef AF_COMPILER_DETECTION_H
+#define AF_COMPILER_DETECTION_H
+
+#ifdef __cplusplus
+# define AF_COMPILER_IS_Comeau 0
+# define AF_COMPILER_IS_Intel 0
+# define AF_COMPILER_IS_PathScale 0
+# define AF_COMPILER_IS_Embarcadero 0
+# define AF_COMPILER_IS_Borland 0
+# define AF_COMPILER_IS_Watcom 0
+# define AF_COMPILER_IS_OpenWatcom 0
+# define AF_COMPILER_IS_SunPro 0
+# define AF_COMPILER_IS_HP 0
+# define AF_COMPILER_IS_Compaq 0
+# define AF_COMPILER_IS_zOS 0
+# define AF_COMPILER_IS_IBMClang 0
+# define AF_COMPILER_IS_XLClang 0
+# define AF_COMPILER_IS_XL 0
+# define AF_COMPILER_IS_VisualAge 0
+# define AF_COMPILER_IS_NVHPC 0
+# define AF_COMPILER_IS_PGI 0
+# define AF_COMPILER_IS_Cray 0
+# define AF_COMPILER_IS_TI 0
+# define AF_COMPILER_IS_FujitsuClang 0
+# define AF_COMPILER_IS_Fujitsu 0
+# define AF_COMPILER_IS_GHS 0
+# define AF_COMPILER_IS_Tasking 0
+# define AF_COMPILER_IS_SCO 0
+# define AF_COMPILER_IS_ARMCC 0
+# define AF_COMPILER_IS_AppleClang 0
+# define AF_COMPILER_IS_ARMClang 0
+# define AF_COMPILER_IS_Clang 0
+# define AF_COMPILER_IS_LCC 0
+# define AF_COMPILER_IS_GNU 0
+# define AF_COMPILER_IS_MSVC 0
+# define AF_COMPILER_IS_ADSP 0
+# define AF_COMPILER_IS_IAR 0
+# define AF_COMPILER_IS_MIPSpro 0
+
+#if defined(__COMO__)
+# undef AF_COMPILER_IS_Comeau
+# define AF_COMPILER_IS_Comeau 1
+
+#elif defined(__INTEL_COMPILER) || defined(__ICC)
+# undef AF_COMPILER_IS_Intel
+# define AF_COMPILER_IS_Intel 1
+
+#elif defined(__PATHCC__)
+# undef AF_COMPILER_IS_PathScale
+# define AF_COMPILER_IS_PathScale 1
+
+#elif defined(__BORLANDC__) && defined(__CODEGEARC_VERSION__)
+# undef AF_COMPILER_IS_Embarcadero
+# define AF_COMPILER_IS_Embarcadero 1
+
+#elif defined(__BORLANDC__)
+# undef AF_COMPILER_IS_Borland
+# define AF_COMPILER_IS_Borland 1
+
+#elif defined(__WATCOMC__) && __WATCOMC__ < 1200
+# undef AF_COMPILER_IS_Watcom
+# define AF_COMPILER_IS_Watcom 1
+
+#elif defined(__WATCOMC__)
+# undef AF_COMPILER_IS_OpenWatcom
+# define AF_COMPILER_IS_OpenWatcom 1
+
+#elif defined(__SUNPRO_CC)
+# undef AF_COMPILER_IS_SunPro
+# define AF_COMPILER_IS_SunPro 1
+
+#elif defined(__HP_aCC)
+# undef AF_COMPILER_IS_HP
+# define AF_COMPILER_IS_HP 1
+
+#elif defined(__DECCXX)
+# undef AF_COMPILER_IS_Compaq
+# define AF_COMPILER_IS_Compaq 1
+
+#elif defined(__IBMCPP__) && defined(__COMPILER_VER__)
+# undef AF_COMPILER_IS_zOS
+# define AF_COMPILER_IS_zOS 1
+
+#elif defined(__open_xl__) && defined(__clang__)
+# undef AF_COMPILER_IS_IBMClang
+# define AF_COMPILER_IS_IBMClang 1
+
+#elif defined(__ibmxl__) && defined(__clang__)
+# undef AF_COMPILER_IS_XLClang
+# define AF_COMPILER_IS_XLClang 1
+
+#elif defined(__IBMCPP__) && !defined(__COMPILER_VER__) && __IBMCPP__ >= 800
+# undef AF_COMPILER_IS_XL
+# define AF_COMPILER_IS_XL 1
+
+#elif defined(__IBMCPP__) && !defined(__COMPILER_VER__) && __IBMCPP__ < 800
+# undef AF_COMPILER_IS_VisualAge
+# define AF_COMPILER_IS_VisualAge 1
+
+#elif defined(__NVCOMPILER)
+# undef AF_COMPILER_IS_NVHPC
+# define AF_COMPILER_IS_NVHPC 1
+
+#elif defined(__PGI)
+# undef AF_COMPILER_IS_PGI
+# define AF_COMPILER_IS_PGI 1
+
+#elif defined(_CRAYC)
+# undef AF_COMPILER_IS_Cray
+# define AF_COMPILER_IS_Cray 1
+
+#elif defined(__TI_COMPILER_VERSION__)
+# undef AF_COMPILER_IS_TI
+# define AF_COMPILER_IS_TI 1
+
+#elif defined(__CLANG_FUJITSU)
+# undef AF_COMPILER_IS_FujitsuClang
+# define AF_COMPILER_IS_FujitsuClang 1
+
+#elif defined(__FUJITSU)
+# undef AF_COMPILER_IS_Fujitsu
+# define AF_COMPILER_IS_Fujitsu 1
+
+#elif defined(__ghs__)
+# undef AF_COMPILER_IS_GHS
+# define AF_COMPILER_IS_GHS 1
+
+#elif defined(__TASKING__)
+# undef AF_COMPILER_IS_Tasking
+# define AF_COMPILER_IS_Tasking 1
+
+#elif defined(__SCO_VERSION__)
+# undef AF_COMPILER_IS_SCO
+# define AF_COMPILER_IS_SCO 1
+
+#elif defined(__ARMCC_VERSION) && !defined(__clang__)
+# undef AF_COMPILER_IS_ARMCC
+# define AF_COMPILER_IS_ARMCC 1
+
+#elif defined(__clang__) && defined(__apple_build_version__)
+# undef AF_COMPILER_IS_AppleClang
+# define AF_COMPILER_IS_AppleClang 1
+
+#elif defined(__clang__) && defined(__ARMCOMPILER_VERSION)
+# undef AF_COMPILER_IS_ARMClang
+# define AF_COMPILER_IS_ARMClang 1
+
+#elif defined(__clang__)
+# undef AF_COMPILER_IS_Clang
+# define AF_COMPILER_IS_Clang 1
+
+#elif defined(__LCC__) && (defined(__GNUC__) || defined(__GNUG__) || defined(__MCST__))
+# undef AF_COMPILER_IS_LCC
+# define AF_COMPILER_IS_LCC 1
+
+#elif defined(__GNUC__) || defined(__GNUG__)
+# undef AF_COMPILER_IS_GNU
+# define AF_COMPILER_IS_GNU 1
+
+#elif defined(_MSC_VER)
+# undef AF_COMPILER_IS_MSVC
+# define AF_COMPILER_IS_MSVC 1
+
+#elif defined(_ADI_COMPILER)
+# undef AF_COMPILER_IS_ADSP
+# define AF_COMPILER_IS_ADSP 1
+
+#elif defined(__IAR_SYSTEMS_ICC__) || defined(__IAR_SYSTEMS_ICC)
+# undef AF_COMPILER_IS_IAR
+# define AF_COMPILER_IS_IAR 1
+
+
+#endif
+
+#  if AF_COMPILER_IS_AppleClang
+
+#    if !(((__clang_major__ * 100) + __clang_minor__) >= 400)
+#      error Unsupported compiler version
+#    endif
+
+# define AF_COMPILER_VERSION_MAJOR (__clang_major__)
+# define AF_COMPILER_VERSION_MINOR (__clang_minor__)
+# define AF_COMPILER_VERSION_PATCH (__clang_patchlevel__)
+# if defined(_MSC_VER)
+   /* _MSC_VER = VVRR */
+#  define AF_SIMULATE_VERSION_MAJOR (_MSC_VER / 100)
+#  define AF_SIMULATE_VERSION_MINOR (_MSC_VER % 100)
+# endif
+# define AF_COMPILER_VERSION_TWEAK (__apple_build_version__)
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_rvalue_references)
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 1
+#    else
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_noexcept)
+#      define AF_COMPILER_CXX_NOEXCEPT 1
+#    else
+#      define AF_COMPILER_CXX_NOEXCEPT 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_variadic_templates)
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 1
+#    else
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_alignas)
+#      define AF_COMPILER_CXX_ALIGNAS 1
+#    else
+#      define AF_COMPILER_CXX_ALIGNAS 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_static_assert)
+#      define AF_COMPILER_CXX_STATIC_ASSERT 1
+#    else
+#      define AF_COMPILER_CXX_STATIC_ASSERT 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_generalized_initializers)
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 1
+#    else
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 400 && __has_feature(cxx_relaxed_constexpr)
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 1
+#    else
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 0
+#    endif
+
+#  elif AF_COMPILER_IS_Clang
+
+#    if !(((__clang_major__ * 100) + __clang_minor__) >= 301)
+#      error Unsupported compiler version
+#    endif
+
+# define AF_COMPILER_VERSION_MAJOR (__clang_major__)
+# define AF_COMPILER_VERSION_MINOR (__clang_minor__)
+# define AF_COMPILER_VERSION_PATCH (__clang_patchlevel__)
+# if defined(_MSC_VER)
+   /* _MSC_VER = VVRR */
+#  define AF_SIMULATE_VERSION_MAJOR (_MSC_VER / 100)
+#  define AF_SIMULATE_VERSION_MINOR (_MSC_VER % 100)
+# endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_rvalue_references)
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 1
+#    else
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_noexcept)
+#      define AF_COMPILER_CXX_NOEXCEPT 1
+#    else
+#      define AF_COMPILER_CXX_NOEXCEPT 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_variadic_templates)
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 1
+#    else
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_alignas)
+#      define AF_COMPILER_CXX_ALIGNAS 1
+#    else
+#      define AF_COMPILER_CXX_ALIGNAS 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_static_assert)
+#      define AF_COMPILER_CXX_STATIC_ASSERT 1
+#    else
+#      define AF_COMPILER_CXX_STATIC_ASSERT 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_generalized_initializers)
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 1
+#    else
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 0
+#    endif
+
+#    if ((__clang_major__ * 100) + __clang_minor__) >= 301 && __has_feature(cxx_relaxed_constexpr)
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 1
+#    else
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 0
+#    endif
+
+#  elif AF_COMPILER_IS_GNU
+
+#    if !((__GNUC__ * 100 + __GNUC_MINOR__) >= 404)
+#      error Unsupported compiler version
+#    endif
+
+# if defined(__GNUC__)
+#  define AF_COMPILER_VERSION_MAJOR (__GNUC__)
+# else
+#  define AF_COMPILER_VERSION_MAJOR (__GNUG__)
+# endif
+# if defined(__GNUC_MINOR__)
+#  define AF_COMPILER_VERSION_MINOR (__GNUC_MINOR__)
+# endif
+# if defined(__GNUC_PATCHLEVEL__)
+#  define AF_COMPILER_VERSION_PATCH (__GNUC_PATCHLEVEL__)
+# endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 404 && (__cplusplus >= 201103L || (defined(__GXX_EXPERIMENTAL_CXX0X__) && __GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 1
+#    else
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 0
+#    endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 406 && (__cplusplus >= 201103L || (defined(__GXX_EXPERIMENTAL_CXX0X__) && __GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_NOEXCEPT 1
+#    else
+#      define AF_COMPILER_CXX_NOEXCEPT 0
+#    endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 404 && (__cplusplus >= 201103L || (defined(__GXX_EXPERIMENTAL_CXX0X__) && __GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 1
+#    else
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 0
+#    endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 408 && __cplusplus >= 201103L
+#      define AF_COMPILER_CXX_ALIGNAS 1
+#    else
+#      define AF_COMPILER_CXX_ALIGNAS 0
+#    endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 404 && (__cplusplus >= 201103L || (defined(__GXX_EXPERIMENTAL_CXX0X__) && __GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_STATIC_ASSERT 1
+#    else
+#      define AF_COMPILER_CXX_STATIC_ASSERT 0
+#    endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 404 && (__cplusplus >= 201103L || (defined(__GXX_EXPERIMENTAL_CXX0X__) && __GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 1
+#    else
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 0
+#    endif
+
+#    if (__GNUC__ * 100 + __GNUC_MINOR__) >= 500 && __cplusplus >= 201402L
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 1
+#    else
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 0
+#    endif
+
+#  elif AF_COMPILER_IS_Intel
+
+#    if !(__INTEL_COMPILER >= 1210)
+#      error Unsupported compiler version
+#    endif
+
+  /* __INTEL_COMPILER = VRP prior to 2021, and then VVVV for 2021 and later,
+     except that a few beta releases use the old format with V=2021.  */
+# if __INTEL_COMPILER < 2021 || __INTEL_COMPILER == 202110 || __INTEL_COMPILER == 202111
+#  define AF_COMPILER_VERSION_MAJOR (__INTEL_COMPILER/100)
+#  define AF_COMPILER_VERSION_MINOR (__INTEL_COMPILER/10 % 10)
+#  if defined(__INTEL_COMPILER_UPDATE)
+#   define AF_COMPILER_VERSION_PATCH (__INTEL_COMPILER_UPDATE)
+#  else
+#   define AF_COMPILER_VERSION_PATCH (__INTEL_COMPILER   % 10)
+#  endif
+# else
+#  define AF_COMPILER_VERSION_MAJOR (__INTEL_COMPILER)
+#  define AF_COMPILER_VERSION_MINOR (__INTEL_COMPILER_UPDATE)
+   /* The third version component from --version is an update index,
+      but no macro is provided for it.  */
+#  define AF_COMPILER_VERSION_PATCH (0)
+# endif
+# if defined(__INTEL_COMPILER_BUILD_DATE)
+   /* __INTEL_COMPILER_BUILD_DATE = YYYYMMDD */
+#  define AF_COMPILER_VERSION_TWEAK (__INTEL_COMPILER_BUILD_DATE)
+# endif
+# if defined(_MSC_VER)
+   /* _MSC_VER = VVRR */
+#  define AF_SIMULATE_VERSION_MAJOR (_MSC_VER / 100)
+#  define AF_SIMULATE_VERSION_MINOR (_MSC_VER % 100)
+# endif
+# if defined(__GNUC__)
+#  define AF_SIMULATE_VERSION_MAJOR (__GNUC__)
+# elif defined(__GNUG__)
+#  define AF_SIMULATE_VERSION_MAJOR (__GNUG__)
+# endif
+# if defined(__GNUC_MINOR__)
+#  define AF_SIMULATE_VERSION_MINOR (__GNUC_MINOR__)
+# endif
+# if defined(__GNUC_PATCHLEVEL__)
+#  define AF_SIMULATE_VERSION_PATCH (__GNUC_PATCHLEVEL__)
+# endif
+
+#    if (__cpp_rvalue_references >= 200610 || __INTEL_COMPILER >= 1210) && ((__cplusplus >= 201103L) || defined(__INTEL_CXX11_MODE__) || defined(__GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 1
+#    else
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 0
+#    endif
+
+#    if __INTEL_COMPILER >= 1400 && ((__cplusplus >= 201103L) || defined(__INTEL_CXX11_MODE__) || defined(__GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_NOEXCEPT 1
+#    else
+#      define AF_COMPILER_CXX_NOEXCEPT 0
+#    endif
+
+#    if (__cpp_variadic_templates >= 200704 || __INTEL_COMPILER >= 1210) && ((__cplusplus >= 201103L) || defined(__INTEL_CXX11_MODE__) || defined(__GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 1
+#    else
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 0
+#    endif
+
+#    if __INTEL_COMPILER >= 1500 && ((__cplusplus >= 201103L) || defined(__INTEL_CXX11_MODE__) || defined(__GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_ALIGNAS 1
+#    else
+#      define AF_COMPILER_CXX_ALIGNAS 0
+#    endif
+
+#    if (__cpp_static_assert >= 200410 || __INTEL_COMPILER >= 1210) && ((__cplusplus >= 201103L) || defined(__INTEL_CXX11_MODE__) || defined(__GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_STATIC_ASSERT 1
+#    else
+#      define AF_COMPILER_CXX_STATIC_ASSERT 0
+#    endif
+
+#    if __INTEL_COMPILER >= 1400 && ((__cplusplus >= 201103L) || defined(__INTEL_CXX11_MODE__) || defined(__GXX_EXPERIMENTAL_CXX0X__))
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 1
+#    else
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 0
+#    endif
+
+#    if __cpp_constexpr >= 201304 || (__INTEL_COMPILER >= 1700 && ((__cplusplus >= 201300L) || ((__cplusplus == 201103L) && !defined(__INTEL_CXX11_MODE__)) || ((((__INTEL_COMPILER == 1500) && (__INTEL_COMPILER_UPDATE == 1))) && defined(__GXX_EXPERIMENTAL_CXX0X__) && !defined(__INTEL_CXX11_MODE__) ) || (defined(__INTEL_CXX11_MODE__) && defined(__cpp_aggregate_nsdmi)) ) && !defined(_MSC_VER))
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 1
+#    else
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 0
+#    endif
+
+#  elif AF_COMPILER_IS_MSVC
+
+#    if !(_MSC_VER >= 1600)
+#      error Unsupported compiler version
+#    endif
+
+  /* _MSC_VER = VVRR */
+# define AF_COMPILER_VERSION_MAJOR (_MSC_VER / 100)
+# define AF_COMPILER_VERSION_MINOR (_MSC_VER % 100)
+# if defined(_MSC_FULL_VER)
+#  if _MSC_VER >= 1400
+    /* _MSC_FULL_VER = VVRRPPPPP */
+#   define AF_COMPILER_VERSION_PATCH (_MSC_FULL_VER % 100000)
+#  else
+    /* _MSC_FULL_VER = VVRRPPPP */
+#   define AF_COMPILER_VERSION_PATCH (_MSC_FULL_VER % 10000)
+#  endif
+# endif
+# if defined(_MSC_BUILD)
+#  define AF_COMPILER_VERSION_TWEAK (_MSC_BUILD)
+# endif
+
+#    if _MSC_VER >= 1600
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 1
+#    else
+#      define AF_COMPILER_CXX_RVALUE_REFERENCES 0
+#    endif
+
+#    if _MSC_VER >= 1900
+#      define AF_COMPILER_CXX_NOEXCEPT 1
+#    else
+#      define AF_COMPILER_CXX_NOEXCEPT 0
+#    endif
+
+#    if _MSC_VER >= 1800
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 1
+#    else
+#      define AF_COMPILER_CXX_VARIADIC_TEMPLATES 0
+#    endif
+
+#    if _MSC_VER >= 1900
+#      define AF_COMPILER_CXX_ALIGNAS 1
+#    else
+#      define AF_COMPILER_CXX_ALIGNAS 0
+#    endif
+
+#    if _MSC_VER >= 1600
+#      define AF_COMPILER_CXX_STATIC_ASSERT 1
+#    else
+#      define AF_COMPILER_CXX_STATIC_ASSERT 0
+#    endif
+
+#    if _MSC_FULL_VER >= 180030723
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 1
+#    else
+#      define AF_COMPILER_CXX_GENERALIZED_INITIALIZERS 0
+#    endif
+
+#    if _MSC_VER >= 1911
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 1
+#    else
+#      define AF_COMPILER_CXX_RELAXED_CONSTEXPR 0
+#    endif
+
+#  endif
+
+#  if defined(AF_COMPILER_CXX_NOEXCEPT) && AF_COMPILER_CXX_NOEXCEPT
+#    define AF_NOEXCEPT noexcept
+#    define AF_NOEXCEPT_EXPR(X) noexcept(X)
+#  else
+#    define AF_NOEXCEPT
+#    define AF_NOEXCEPT_EXPR(X)
+#  endif
+
+
+#  if defined(AF_COMPILER_CXX_ALIGNAS) && AF_COMPILER_CXX_ALIGNAS
+#    define AF_ALIGNAS(X) alignas(X)
+#  elif AF_COMPILER_IS_GNU || AF_COMPILER_IS_Clang || AF_COMPILER_IS_AppleClang
+#    define AF_ALIGNAS(X) __attribute__ ((__aligned__(X)))
+#  elif AF_COMPILER_IS_MSVC
+#    define AF_ALIGNAS(X) __declspec(align(X))
+#  else
+#    define AF_ALIGNAS(X)
+#  endif
+
+#  if defined(AF_COMPILER_CXX_STATIC_ASSERT) && AF_COMPILER_CXX_STATIC_ASSERT
+#    define AF_STATIC_ASSERT(X) static_assert(X, #X)
+#    define AF_STATIC_ASSERT_MSG(X, MSG) static_assert(X, MSG)
+#  else
+#    define AF_STATIC_ASSERT_JOIN(X, Y) AF_STATIC_ASSERT_JOIN_IMPL(X, Y)
+#    define AF_STATIC_ASSERT_JOIN_IMPL(X, Y) X##Y
+template<bool> struct AFStaticAssert;
+template<> struct AFStaticAssert<true>{};
+#    define AF_STATIC_ASSERT(X) enum { AF_STATIC_ASSERT_JOIN(AFStaticAssertEnum, __LINE__) = sizeof(AFStaticAssert<X>) }
+#    define AF_STATIC_ASSERT_MSG(X, MSG) enum { AF_STATIC_ASSERT_JOIN(AFStaticAssertEnum, __LINE__) = sizeof(AFStaticAssert<X>) }
+#  endif
+
+#endif
+
+  #if defined(AF_COMPILER_CXX_RELAXED_CONSTEXPR) && AF_COMPILER_CXX_RELAXED_CONSTEXPR
+  #define AF_CONSTEXPR constexpr
+  #else
+  #define AF_CONSTEXPR
+  #endif
+  #if defined(__cpp_if_constexpr) || __cplusplus >= 201606L
+  #define AF_IF_CONSTEXPR if constexpr
+  #else
+  #define AF_IF_CONSTEXPR if
+  #endif
+
+
+#endif
diff --git a/CMakeModules/config_ccache.cmake b/CMakeModules/config_ccache.cmake
new file mode 100644
index 0000000000..04b3a97901
--- /dev/null
+++ b/CMakeModules/config_ccache.cmake
@@ -0,0 +1,41 @@
+# picked up original content from https://crascit.com/2016/04/09/using-ccache-with-cmake/
+
+find_program(CCACHE_PROGRAM ccache)
+
+set(CCACHE_FOUND OFF)
+if(CCACHE_PROGRAM)
+  set(CCACHE_FOUND ON)
+endif()
+
+option(AF_USE_CCACHE "Use ccache when compiling" ${CCACHE_FOUND})
+
+if(${AF_USE_CCACHE})
+  message(STATUS "ccache FOUND: ${CCACHE_PROGRAM}")
+  # Set up wrapper scripts
+  set(C_LAUNCHER   "${CCACHE_PROGRAM}")
+  set(CXX_LAUNCHER "${CCACHE_PROGRAM}")
+  set(NVCC_LAUNCHER "${CCACHE_PROGRAM}")
+  configure_file(${ArrayFire_SOURCE_DIR}/CMakeModules/launch-c.in   launch-c)
+  configure_file(${ArrayFire_SOURCE_DIR}/CMakeModules/launch-cxx.in launch-cxx)
+  configure_file(${ArrayFire_SOURCE_DIR}/CMakeModules/launch-nvcc.in launch-nvcc)
+  execute_process(COMMAND chmod a+rx
+      "${ArrayFire_BINARY_DIR}/launch-c"
+      "${ArrayFire_BINARY_DIR}/launch-cxx"
+      "${ArrayFire_BINARY_DIR}/launch-nvcc"
+    )
+  if(CMAKE_GENERATOR STREQUAL "Xcode")
+    # Set Xcode project attributes to route compilation and linking
+    # through our scripts
+    set(CMAKE_XCODE_ATTRIBUTE_CC         "${ArrayFire_BINARY_DIR}/launch-c")
+    set(CMAKE_XCODE_ATTRIBUTE_CXX        "${ArrayFire_BINARY_DIR}/launch-cxx")
+    set(CMAKE_XCODE_ATTRIBUTE_LD         "${ArrayFire_BINARY_DIR}/launch-c")
+    set(CMAKE_XCODE_ATTRIBUTE_LDPLUSPLUS "${ArrayFire_BINARY_DIR}/launch-cxx")
+  else()
+    # Support Unix Makefiles and Ninja
+    set(CMAKE_C_COMPILER_LAUNCHER   "${CCACHE_PROGRAM}")
+    set(CMAKE_CXX_COMPILER_LAUNCHER "${CCACHE_PROGRAM}")
+    set(CMAKE_CUDA_COMPILER_LAUNCHER "${CCACHE_PROGRAM}")
+  endif()
+endif()
+mark_as_advanced(CCACHE_PROGRAM)
+mark_as_advanced(AF_USE_CCACHE)
diff --git a/CMakeModules/cuda_compute_capability.c b/CMakeModules/cuda_compute_capability.c
deleted file mode 100644
index bc17a40956..0000000000
--- a/CMakeModules/cuda_compute_capability.c
+++ /dev/null
@@ -1,52 +0,0 @@
-/*
-* Copyright (C) 2011 Florian Rathgeber, florian.rathgeber@gmail.com
-*
-* This code is licensed under the MIT License.  See the FindCUDA.cmake script
-* for the text of the license.
-*
-* Based on code by Christopher Bruns published on Stack Overflow (CC-BY):
-* http://stackoverflow.com/questions/2285185
-*/
-
-#include <stdio.h>
-#include <cuda_runtime.h>
-
-int main() {
-    int deviceCount, device, major = 9999, minor = 9999;
-    int gpuDeviceCount = 0;
-    struct cudaDeviceProp properties;
-
-    if (cudaGetDeviceCount(&deviceCount) != cudaSuccess)
-    {
-        printf("Couldn't get device count: %s\n", cudaGetErrorString(cudaGetLastError()));
-        return 1;
-    }
-    /* machines with no GPUs can still report one emulation device */
-    for (device = 0; device < deviceCount; ++device) {
-        cudaGetDeviceProperties(&properties, device);
-        if (properties.major != 9999) {/* 9999 means emulation only */
-            ++gpuDeviceCount;
-            /*  get minimum compute capability of all devices */
-            if (major > properties.major) {
-                major = properties.major;
-                minor = properties.minor;
-            } else if (minor > properties.minor) {
-                minor = properties.minor;
-            }
-        }
-    }
-
-    /* don't just return the number of gpus, because other runtime cuda
-    errors can also yield non-zero return values */
-    if (gpuDeviceCount > 0) {
-        if ((major == 2 && minor == 1))
-        {
-            // There is no --arch compute_21 flag for nvcc, so force minor to 0
-            minor = 0;
-        }
-        /* this output will be parsed by FindCUDA.cmake */
-        printf("%d%d", major, minor);
-        return 0; /* success */
-    }
-    return 1; /* failure */
-}
diff --git a/CMakeModules/debian/postinst b/CMakeModules/debian/postinst
new file mode 100644
index 0000000000..093371bd32
--- /dev/null
+++ b/CMakeModules/debian/postinst
@@ -0,0 +1,9 @@
+#!/bin/sh
+
+set -e
+
+if [ "$1" = "configure" ]; then
+    echo "/opt/intel/compilers_and_libraries/linux/mkl/lib/intel64_lin" >> /etc/ld.so.conf.d/99_arrayfire_${RC_COMPONENT}.conf
+    echo "/usr/local/cuda-${CPACK_CUDA_VERSION_MAJOR}.${CPACK_CUDA_VERSION_MINOR}/lib64" >> /etc/ld.so.conf.d/99_arrayfire_${RC_COMPONENT}.conf
+    ldconfig
+fi
diff --git a/CMakeModules/examples.dox.in b/CMakeModules/examples.dox.in
new file mode 100644
index 0000000000..dfad2dbb50
--- /dev/null
+++ b/CMakeModules/examples.dox.in
@@ -0,0 +1,3 @@
+/**
+@EXAMPLES_LIST@
+*/
diff --git a/CMakeModules/generate_product_version.cmake b/CMakeModules/generate_product_version.cmake
new file mode 100644
index 0000000000..6f4aae1da0
--- /dev/null
+++ b/CMakeModules/generate_product_version.cmake
@@ -0,0 +1,45 @@
+function(generate_product_version outfile)
+  set(options)
+  set(oneValueArgs
+    COMPANY_NAME
+    FILE_DESCRIPTION
+    FILE_NAME
+    ORIGINAL_FILE_NAME
+    COMPANY_COPYRIGHT
+  )
+  set(multiValueArgs)
+  cmake_parse_arguments(PRODUCT "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
+
+  if(NOT PRODUCT_COMPANY_NAME OR "${PRODUCT_COMPANY_NAME}" STREQUAL "")
+      set(PRODUCT_COMPANY_NAME "ArrayFire")
+  endif()
+  if(NOT PRODUCT_FILE_DESCRIPTION OR "${PRODUCT_FILE_DESCRIPTION}" STREQUAL "")
+    set(PRODUCT_FILE_DESCRIPTION "ArrayFire Library")
+  endif()
+  if(NOT PRODUCT_FILE_NAME OR "${PRODUCT_FILE_NAME}" STREQUAL "")
+    set(PRODUCT_FILE_NAME "${PROJECT_NAME}")
+  endif()
+  if(NOT PRODUCT_ORIGINAL_FILE_NAME OR "${PRODUCT_ORIGINAL_FILE_NAME}" STREQUAL "")
+    set(PRODUCT_ORIGINAL_FILE_NAME "${PRODUCT_FILE_NAME}")
+  endif()
+  if(NOT PRODUCT_FILE_DESCRIPTION OR "${PRODUCT_FILE_DESCRIPTION}" STREQUAL "")
+      set(PRODUCT_FILE_DESCRIPTION "${PRODUCT_FILE_NAME}")
+  endif()
+  if(NOT PRODUCT_COMPANY_COPYRIGHT OR "${PRODUCT_COMPANY_COPYRIGHT}" STREQUAL "")
+    string(TIMESTAMP PRODUCT_CURRENT_YEAR "%Y")
+    set(PRODUCT_COMPANY_COPYRIGHT "${PRODUCT_COMPANY_NAME} (C) Copyright ${PRODUCT_CURRENT_YEAR}")
+  endif()
+
+  set(PRODUCT_VERSION ${PROJECT_VERSION})
+  set(PRODUCT_VERSION_MAJOR ${PROJECT_VERSION_MAJOR})
+  set(PRODUCT_VERSION_MINOR ${PROJECT_VERSION_MINOR})
+  set(PRODUCT_VERSION_PATCH ${PROJECT_VERSION_PATCH})
+  set(PRODUCT_INTERNAL_FILE_NAME ${PRODUCT_ORIGINAL_FILE_NAME})
+
+  set(ver_res_file "${PROJECT_BINARY_DIR}/${PRODUCT_FILE_NAME}_version_info.rc")
+  configure_file(
+    ${PROJECT_SOURCE_DIR}/CMakeModules/version_info.rc.in
+    ${ver_res_file}
+  )
+  set(${outfile} ${ver_res_file} PARENT_SCOPE)
+endfunction()
diff --git a/CMakeModules/launch-c.in b/CMakeModules/launch-c.in
new file mode 100644
index 0000000000..6c6c9180bc
--- /dev/null
+++ b/CMakeModules/launch-c.in
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+# Xcode generator doesn't include the compiler as the
+# first argument, Ninja and Makefiles do. Handle both cases.
+if [ "$1" = "${CMAKE_C_COMPILER}" ] ; then
+    shift
+fi
+
+export CCACHE_CPP2=true
+exec "${C_LAUNCHER}" "${CMAKE_C_COMPILER}" "$@"
diff --git a/CMakeModules/launch-cxx.in b/CMakeModules/launch-cxx.in
new file mode 100644
index 0000000000..fa541fee0b
--- /dev/null
+++ b/CMakeModules/launch-cxx.in
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+# Xcode generator doesn't include the compiler as the
+# first argument, Ninja and Makefiles do. Handle both cases.
+if [ "$1" = "${CMAKE_CXX_COMPILER}" ] ; then
+    shift
+fi
+
+export CCACHE_CPP2=true
+exec "${CXX_LAUNCHER}" "${CMAKE_CXX_COMPILER}" "$@"
diff --git a/CMakeModules/launch-nvcc.in b/CMakeModules/launch-nvcc.in
new file mode 100644
index 0000000000..47a4591850
--- /dev/null
+++ b/CMakeModules/launch-nvcc.in
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+# Xcode generator doesn't include the compiler as the
+# first argument, Ninja and Makefiles do. Handle both cases.
+if [ "$1" = "${CUDA_NVCC_EXECUTABLE}" ] ; then
+    shift
+fi
+
+export CCACHE_CPP2=true
+exec "${NVCC_LAUNCHER}" "${CUDA_NVCC_EXECUTABLE}" "$@"
diff --git a/CMakeModules/nsis/NSIS.InstallOptions.ini.in b/CMakeModules/nsis/NSIS.InstallOptions.ini.in
new file mode 100644
index 0000000000..cc17d8268a
--- /dev/null
+++ b/CMakeModules/nsis/NSIS.InstallOptions.ini.in
@@ -0,0 +1,46 @@
+[Settings]
+NumFields=5
+
+[Field 1]
+Type=label
+Text=By default @CPACK_PACKAGE_INSTALL_DIRECTORY@ will add its directory to the system PATH. This will make the dynamic libraries available to all users and software on the system.
+Left=0
+Right=-1
+Top=0
+Bottom=20
+
+[Field 2]
+Type=radiobutton
+Text=Do not add @CPACK_PACKAGE_NAME@ to the system PATH
+Left=0
+Right=-1
+Top=30
+Bottom=40
+State=0
+
+[Field 3]
+Type=radiobutton
+Text=Add @CPACK_PACKAGE_NAME@ to the system PATH for all users
+Left=0
+Right=-1
+Top=40
+Bottom=50
+State=1
+
+[Field 4]
+Type=radiobutton
+Text=Add @CPACK_PACKAGE_NAME@ to the system PATH for current user
+Left=0
+Right=-1
+Top=50
+Bottom=60
+State=0
+
+[Field 5]
+Type=CheckBox
+Text=Create @CPACK_PACKAGE_NAME@ Desktop Icon
+Left=0
+Right=-1
+Top=80
+Bottom=90
+State=0
diff --git a/CMakeModules/nsis/NSIS.definitions.nsh.in b/CMakeModules/nsis/NSIS.definitions.nsh.in
new file mode 100644
index 0000000000..1062271940
--- /dev/null
+++ b/CMakeModules/nsis/NSIS.definitions.nsh.in
@@ -0,0 +1,35 @@
+!define MUI_WELCOMEPAGE_TITLE '${CPACK_PACKAGE_NAME} ${CPACK_PACKAGE_VERSION} Installer'
+!define MUI_WELCOMEPAGE_TITLE_3LINES
+!define MUI_WELCOMEPAGE_TEXT    \
+"ArrayFire is a high performance software library for parallel computing with an easy-to-use API.\r\n\r\n\
+Its array based function set makes parallel programming simple.\r\n\r\n\
+ArrayFire's multiple backends (CUDA, OneAPI, OpenCL, and native CPU) make it platform independent and highly portable.\r\n\r\n\
+A few lines of code in ArrayFire can replace dozens of lines of parallel compute code, \
+saving you valuable time and lowering development costs.\r\n\r\n\
+Follow these steps to install the ArrayFire libraries."
+
+!define MUI_ICON "@CPACK_AF_ASSETS_DIR@@APP_LOW_NAME@.ico"
+!define MUI_UNICON "@CPACK_AF_ASSETS_DIR@@APP_LOW_NAME@.ico"
+
+!define MUI_WELCOMEFINISHPAGE_BITMAP "@CPACK_AF_ASSETS_DIR@@APP_LOW_NAME@_sym.bmp"
+!define MUI_UNWELCOMEFINISHPAGE_BITMAP "@CPACK_AF_ASSETS_DIR@@APP_LOW_NAME@_sym.bmp"
+!define MUI_WELCOMEFINISHPAGE_UNBITMAP_NOSTRETCH
+!define MUI_UNWELCOMEFINISHPAGE_BITMAP_NOSTRETCH
+
+!define MUI_HEADERIMAGE
+!define MUI_HEADERIMAGE_RIGHT
+!define MUI_HEADERIMAGE_BITMAP "@CPACK_AF_ASSETS_DIR@@APP_LOW_NAME@_logo.bmp"
+!define MUI_HEADERIMAGE_UNBITMAP "@CPACK_AF_ASSETS_DIR@@APP_LOW_NAME@_logo.bmp"
+!define MUI_HEADERIMAGE_BITMAP_NOSTRETCH
+!define MUI_HEADERIMAGE_UNBITMAP_NOSTRETCH
+!define MUI_ABORTWARNING
+
+
+; Defines for Finish Page
+!define MUI_FINISHPAGE_RUN "explorer.exe"
+!define MUI_FINISHPAGE_RUN_PARAMETERS "$INSTDIR"
+!define MUI_FINISHPAGE_RUN_TEXT "Open ArrayFire install folder."
+!define MUI_FINISHPAGE_SHOWREADME "https://arrayfire.com/docs/using_on_windows.htm"
+!define MUI_FINISHPAGE_SHOWREADME_TEXT "Open ArrayFire documentation on the web."
+!define MUI_FINISHPAGE_LINK "ArrayFire Support and Services"
+!define MUI_FINISHPAGE_LINK_LOCATION "https://arrayfire.com/consulting/"
diff --git a/CMakeModules/nsis/NSIS.template.in b/CMakeModules/nsis/NSIS.template.in
new file mode 100644
index 0000000000..c46274518c
--- /dev/null
+++ b/CMakeModules/nsis/NSIS.template.in
@@ -0,0 +1,1009 @@
+; CPack install script designed for a nmake build
+
+;--------------------------------
+; You must define these values
+
+  !define VERSION "@CPACK_PACKAGE_VERSION@"
+  !define PATCH  "@CPACK_PACKAGE_VERSION_PATCH@"
+  !define INST_DIR "@CPACK_TEMPORARY_DIRECTORY@"
+
+;--------------------------------
+;Variables
+
+  Var MUI_TEMP
+  Var STARTMENU_FOLDER
+  Var SV_ALLUSERS
+  Var START_MENU
+  Var DO_NOT_ADD_TO_PATH
+  Var ADD_TO_PATH_ALL_USERS
+  Var ADD_TO_PATH_CURRENT_USER
+  Var INSTALL_DESKTOP
+  Var IS_DEFAULT_INSTALLDIR
+;--------------------------------
+;Include Modern UI
+
+  !include "..\..\..\NSIS.definitions.nsh"
+  !include "InstallOptions.nsh"
+  !include "MUI.nsh"
+
+  ;Default installation folder
+  InstallDir "@CPACK_NSIS_INSTALL_ROOT@\@CPACK_PACKAGE_INSTALL_DIRECTORY@\v@CPACK_PACKAGE_VERSION_MAJOR@"
+
+  !define env_af_hklm 'HKLM "SYSTEM\CurrentControlSet\Control\Session Manager\Environment"'
+
+  !define cmake_pkg_reg_key 'HKCU "Software\Kitware\CMake\Packages\ArrayFire"'
+
+;--------------------------------
+;General
+
+  ;Name and file
+  Name "@CPACK_NSIS_PACKAGE_NAME@"
+  OutFile "@CPACK_TOPLEVEL_DIRECTORY@/@CPACK_OUTPUT_FILE_NAME@"
+
+  ;Set compression
+  SetCompressor @CPACK_NSIS_COMPRESSOR@
+
+  ;Require administrator access
+  RequestExecutionLevel admin
+
+@CPACK_NSIS_DEFINES@
+
+  !include Sections.nsh
+
+;--- Component support macros: ---
+; The code for the add/remove functionality is from:
+;   http://nsis.sourceforge.net/Add/Remove_Functionality
+; It has been modified slightly and extended to provide
+; inter-component dependencies.
+Var AR_SecFlags
+Var AR_RegFlags
+@CPACK_NSIS_SECTION_SELECTED_VARS@
+
+; Loads the "selected" flag for the section named SecName into the
+; variable VarName.
+!macro LoadSectionSelectedIntoVar SecName VarName
+ SectionGetFlags ${${SecName}} $${VarName}
+ IntOp $${VarName} $${VarName} & ${SF_SELECTED}  ;Turn off all other bits
+!macroend
+
+; Loads the value of a variable... can we get around this?
+!macro LoadVar VarName
+  IntOp $R0 0 + $${VarName}
+!macroend
+
+; Sets the value of a variable
+!macro StoreVar VarName IntValue
+  IntOp $${VarName} 0 + ${IntValue}
+!macroend
+
+!macro InitSection SecName
+  ;  This macro reads component installed flag from the registry and
+  ;changes checked state of the section on the components page.
+  ;Input: section index constant name specified in Section command.
+
+  ClearErrors
+  ;Reading component status from registry
+  ReadRegDWORD $AR_RegFlags HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@\Components\${SecName}" "Installed"
+  IfErrors "default_${SecName}"
+    ;Status will stay default if registry value not found
+    ;(component was never installed)
+  IntOp $AR_RegFlags $AR_RegFlags & ${SF_SELECTED} ;Turn off all other bits
+  SectionGetFlags ${${SecName}} $AR_SecFlags  ;Reading default section flags
+  IntOp $AR_SecFlags $AR_SecFlags & 0xFFFE  ;Turn lowest (enabled) bit off
+  IntOp $AR_SecFlags $AR_RegFlags | $AR_SecFlags      ;Change lowest bit
+
+  ; Note whether this component was installed before
+  !insertmacro StoreVar ${SecName}_was_installed $AR_RegFlags
+  IntOp $R0 $AR_RegFlags & $AR_RegFlags
+
+  ;Writing modified flags
+  SectionSetFlags ${${SecName}} $AR_SecFlags
+
+ "default_${SecName}:"
+ !insertmacro LoadSectionSelectedIntoVar ${SecName} ${SecName}_selected
+!macroend
+
+!macro FinishSection SecName
+  ;  This macro reads section flag set by user and removes the section
+  ;if it is not selected.
+  ;Then it writes component installed flag to registry
+  ;Input: section index constant name specified in Section command.
+
+  SectionGetFlags ${${SecName}} $AR_SecFlags  ;Reading section flags
+  ;Checking lowest bit:
+  IntOp $AR_SecFlags $AR_SecFlags & ${SF_SELECTED}
+  IntCmp $AR_SecFlags 1 "leave_${SecName}"
+    ;Section is not selected:
+    ;Calling Section uninstall macro and writing zero installed flag
+    !insertmacro "Remove_${${SecName}}"
+    WriteRegDWORD HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@\Components\${SecName}" \
+  "Installed" 0
+    Goto "exit_${SecName}"
+
+ "leave_${SecName}:"
+    ;Section is selected:
+    WriteRegDWORD HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@\Components\${SecName}" \
+  "Installed" 1
+
+ "exit_${SecName}:"
+!macroend
+
+!macro RemoveSection_CPack SecName
+  ;  This macro is used to call section's Remove_... macro
+  ;from the uninstaller.
+  ;Input: section index constant name specified in Section command.
+
+  !insertmacro "Remove_${${SecName}}"
+!macroend
+
+; Determine whether the selection of SecName changed
+!macro MaybeSelectionChanged SecName
+  !insertmacro LoadVar ${SecName}_selected
+  SectionGetFlags ${${SecName}} $R1
+  IntOp $R1 $R1 & ${SF_SELECTED} ;Turn off all other bits
+
+  ; See if the status has changed:
+  IntCmp $R0 $R1 "${SecName}_unchanged"
+  !insertmacro LoadSectionSelectedIntoVar ${SecName} ${SecName}_selected
+
+  IntCmp $R1 ${SF_SELECTED} "${SecName}_was_selected"
+  !insertmacro "Deselect_required_by_${SecName}"
+  goto "${SecName}_unchanged"
+
+  "${SecName}_was_selected:"
+  !insertmacro "Select_${SecName}_depends"
+
+  "${SecName}_unchanged:"
+!macroend
+;--- End of Add/Remove macros ---
+
+;--------------------------------
+;Interface Settings
+
+  ;Below two are defined in custom nsh file
+  ;!define MUI_HEADERIMAGE
+  ;!define MUI_ABORTWARNING
+
+;----------------------------------------
+; based upon a script of "Written by KiCHiK 2003-01-18 05:57:02"
+;----------------------------------------
+!verbose 3
+!include "WinMessages.NSH"
+!verbose 4
+;====================================================
+; get_NT_environment
+;     Returns: the selected environment
+;     Output : head of the stack
+;====================================================
+!macro select_NT_profile UN
+Function ${UN}select_NT_profile
+   StrCmp $ADD_TO_PATH_ALL_USERS "1" 0 environment_single
+      DetailPrint "Selected environment for all users"
+      Push "all"
+      Return
+   environment_single:
+      DetailPrint "Selected environment for current user only."
+      Push "current"
+      Return
+FunctionEnd
+!macroend
+!insertmacro select_NT_profile ""
+!insertmacro select_NT_profile "un."
+;----------------------------------------------------
+!define NT_current_env 'HKCU "Environment"'
+!define NT_all_env     'HKLM "SYSTEM\CurrentControlSet\Control\Session Manager\Environment"'
+
+!ifndef WriteEnvStr_RegKey
+  !ifdef ALL_USERS
+    !define WriteEnvStr_RegKey \
+       'HKLM "SYSTEM\CurrentControlSet\Control\Session Manager\Environment"'
+  !else
+    !define WriteEnvStr_RegKey 'HKCU "Environment"'
+  !endif
+!endif
+
+; AddToPath - Adds the given dir to the search path.
+;        Input - head of the stack
+;        Note - Win9x systems requires reboot
+
+Function AddToPath
+  Exch $0
+  Push $1
+  Push $2
+  Push $3
+
+  # don't add if the path doesn't exist
+  IfFileExists "$0\*.*" "" AddToPath_done
+
+  ReadEnvStr $1 PATH
+  ; if the path is too long for a NSIS variable NSIS will return a 0
+  ; length string.  If we find that, then warn and skip any path
+  ; modification as it will trash the existing path.
+  StrLen $2 $1
+  IntCmp $2 0 CheckPathLength_ShowPathWarning CheckPathLength_Done CheckPathLength_Done
+    CheckPathLength_ShowPathWarning:
+    Messagebox MB_OK|MB_ICONEXCLAMATION "Warning! PATH too long installer unable to modify PATH!"
+    Goto AddToPath_done
+  CheckPathLength_Done:
+  Push "$1;"
+  Push "$0;"
+  Call StrStr
+  Pop $2
+  StrCmp $2 "" "" AddToPath_done
+  Push "$1;"
+  Push "$0\;"
+  Call StrStr
+  Pop $2
+  StrCmp $2 "" "" AddToPath_done
+  GetFullPathName /SHORT $3 $0
+  Push "$1;"
+  Push "$3;"
+  Call StrStr
+  Pop $2
+  StrCmp $2 "" "" AddToPath_done
+  Push "$1;"
+  Push "$3\;"
+  Call StrStr
+  Pop $2
+  StrCmp $2 "" "" AddToPath_done
+
+  Call IsNT
+  Pop $1
+  StrCmp $1 1 AddToPath_NT
+    ; Not on NT
+    StrCpy $1 $WINDIR 2
+    FileOpen $1 "$1\autoexec.bat" a
+    FileSeek $1 -1 END
+    FileReadByte $1 $2
+    IntCmp $2 26 0 +2 +2 # DOS EOF
+      FileSeek $1 -1 END # write over EOF
+    FileWrite $1 "$\r$\nSET PATH=%PATH%;$3$\r$\n"
+    FileClose $1
+    SetRebootFlag true
+    Goto AddToPath_done
+
+  AddToPath_NT:
+    StrCmp $ADD_TO_PATH_ALL_USERS "1" ReadAllKey
+      ReadRegStr $1 ${NT_current_env} "PATH"
+      Goto DoTrim
+    ReadAllKey:
+      ReadRegStr $1 ${NT_all_env} "PATH"
+    DoTrim:
+    StrCmp $1 "" AddToPath_NTdoIt
+      Push $1
+      Call Trim
+      Pop $1
+      StrCpy $0 "$1;$0"
+    AddToPath_NTdoIt:
+      StrCmp $ADD_TO_PATH_ALL_USERS "1" WriteAllKey
+        WriteRegExpandStr ${NT_current_env} "PATH" $0
+        Goto DoSend
+      WriteAllKey:
+        WriteRegExpandStr ${NT_all_env} "PATH" $0
+      DoSend:
+      SendMessage ${HWND_BROADCAST} ${WM_WININICHANGE} 0 "STR:Environment" /TIMEOUT=5000
+
+  AddToPath_done:
+    Pop $3
+    Pop $2
+    Pop $1
+    Pop $0
+FunctionEnd
+
+
+; RemoveFromPath - Remove a given dir from the path
+;     Input: head of the stack
+
+Function un.RemoveFromPath
+  Exch $0
+  Push $1
+  Push $2
+  Push $3
+  Push $4
+  Push $5
+  Push $6
+
+  IntFmt $6 "%c" 26 # DOS EOF
+
+  Call un.IsNT
+  Pop $1
+  StrCmp $1 1 unRemoveFromPath_NT
+    ; Not on NT
+    StrCpy $1 $WINDIR 2
+    FileOpen $1 "$1\autoexec.bat" r
+    GetTempFileName $4
+    FileOpen $2 $4 w
+    GetFullPathName /SHORT $0 $0
+    StrCpy $0 "SET PATH=%PATH%;$0"
+    Goto unRemoveFromPath_dosLoop
+
+    unRemoveFromPath_dosLoop:
+      FileRead $1 $3
+      StrCpy $5 $3 1 -1 # read last char
+      StrCmp $5 $6 0 +2 # if DOS EOF
+        StrCpy $3 $3 -1 # remove DOS EOF so we can compare
+      StrCmp $3 "$0$\r$\n" unRemoveFromPath_dosLoopRemoveLine
+      StrCmp $3 "$0$\n" unRemoveFromPath_dosLoopRemoveLine
+      StrCmp $3 "$0" unRemoveFromPath_dosLoopRemoveLine
+      StrCmp $3 "" unRemoveFromPath_dosLoopEnd
+      FileWrite $2 $3
+      Goto unRemoveFromPath_dosLoop
+      unRemoveFromPath_dosLoopRemoveLine:
+        SetRebootFlag true
+        Goto unRemoveFromPath_dosLoop
+
+    unRemoveFromPath_dosLoopEnd:
+      FileClose $2
+      FileClose $1
+      StrCpy $1 $WINDIR 2
+      Delete "$1\autoexec.bat"
+      CopyFiles /SILENT $4 "$1\autoexec.bat"
+      Delete $4
+      Goto unRemoveFromPath_done
+
+  unRemoveFromPath_NT:
+    StrCmp $ADD_TO_PATH_ALL_USERS "1" unReadAllKey
+      ReadRegStr $1 ${NT_current_env} "PATH"
+      Goto unDoTrim
+    unReadAllKey:
+      ReadRegStr $1 ${NT_all_env} "PATH"
+    unDoTrim:
+    StrCpy $5 $1 1 -1 # copy last char
+    StrCmp $5 ";" +2 # if last char != ;
+      StrCpy $1 "$1;" # append ;
+    Push $1
+    Push "$0;"
+    Call un.StrStr ; Find `$0;` in $1
+    Pop $2 ; pos of our dir
+    StrCmp $2 "" unRemoveFromPath_done
+      ; else, it is in path
+      # $0 - path to add
+      # $1 - path var
+      StrLen $3 "$0;"
+      StrLen $4 $2
+      StrCpy $5 $1 -$4 # $5 is now the part before the path to remove
+      StrCpy $6 $2 "" $3 # $6 is now the part after the path to remove
+      StrCpy $3 $5$6
+
+      StrCpy $5 $3 1 -1 # copy last char
+      StrCmp $5 ";" 0 +2 # if last char == ;
+        StrCpy $3 $3 -1 # remove last char
+
+      StrCmp $ADD_TO_PATH_ALL_USERS "1" unWriteAllKey
+        WriteRegExpandStr ${NT_current_env} "PATH" $3
+        Goto unDoSend
+      unWriteAllKey:
+        WriteRegExpandStr ${NT_all_env} "PATH" $3
+      unDoSend:
+      SendMessage ${HWND_BROADCAST} ${WM_WININICHANGE} 0 "STR:Environment" /TIMEOUT=5000
+
+  unRemoveFromPath_done:
+    Pop $6
+    Pop $5
+    Pop $4
+    Pop $3
+    Pop $2
+    Pop $1
+    Pop $0
+FunctionEnd
+
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+; Uninstall sutff
+;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+###########################################
+#            Utility Functions            #
+###########################################
+
+;====================================================
+; IsNT - Returns 1 if the current system is NT, 0
+;        otherwise.
+;     Output: head of the stack
+;====================================================
+; IsNT
+; no input
+; output, top of the stack = 1 if NT or 0 if not
+;
+; Usage:
+;   Call IsNT
+;   Pop $R0
+;  ($R0 at this point is 1 or 0)
+
+!macro IsNT un
+Function ${un}IsNT
+  Push $0
+  ReadRegStr $0 HKLM "SOFTWARE\Microsoft\Windows NT\CurrentVersion" CurrentVersion
+  StrCmp $0 "" 0 IsNT_yes
+  ; we are not NT.
+  Pop $0
+  Push 0
+  Return
+
+  IsNT_yes:
+    ; NT!!!
+    Pop $0
+    Push 1
+FunctionEnd
+!macroend
+!insertmacro IsNT ""
+!insertmacro IsNT "un."
+
+; StrStr
+; input, top of stack = string to search for
+;        top of stack-1 = string to search in
+; output, top of stack (replaces with the portion of the string remaining)
+; modifies no other variables.
+;
+; Usage:
+;   Push "this is a long ass string"
+;   Push "ass"
+;   Call StrStr
+;   Pop $R0
+;  ($R0 at this point is "ass string")
+
+!macro StrStr un
+Function ${un}StrStr
+Exch $R1 ; st=haystack,old$R1, $R1=needle
+  Exch    ; st=old$R1,haystack
+  Exch $R2 ; st=old$R1,old$R2, $R2=haystack
+  Push $R3
+  Push $R4
+  Push $R5
+  StrLen $R3 $R1
+  StrCpy $R4 0
+  ; $R1=needle
+  ; $R2=haystack
+  ; $R3=len(needle)
+  ; $R4=cnt
+  ; $R5=tmp
+  loop:
+    StrCpy $R5 $R2 $R3 $R4
+    StrCmp $R5 $R1 done
+    StrCmp $R5 "" done
+    IntOp $R4 $R4 + 1
+    Goto loop
+done:
+  StrCpy $R1 $R2 "" $R4
+  Pop $R5
+  Pop $R4
+  Pop $R3
+  Pop $R2
+  Exch $R1
+FunctionEnd
+!macroend
+!insertmacro StrStr ""
+!insertmacro StrStr "un."
+
+Function Trim ; Added by Pelaca
+	Exch $R1
+	Push $R2
+Loop:
+	StrCpy $R2 "$R1" 1 -1
+	StrCmp "$R2" " " RTrim
+	StrCmp "$R2" "$\n" RTrim
+	StrCmp "$R2" "$\r" RTrim
+	StrCmp "$R2" ";" RTrim
+	GoTo Done
+RTrim:
+	StrCpy $R1 "$R1" -1
+	Goto Loop
+Done:
+	Pop $R2
+	Exch $R1
+FunctionEnd
+
+Function ConditionalAddToRegisty
+  Pop $0
+  Pop $1
+  StrCmp "$0" "" ConditionalAddToRegisty_EmptyString
+    WriteRegStr SHCTX "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" \
+    "$1" "$0"
+    ;MessageBox MB_OK "Set Registry: '$1' to '$0'"
+    DetailPrint "Set install registry entry: '$1' to '$0'"
+  ConditionalAddToRegisty_EmptyString:
+FunctionEnd
+
+;--------------------------------
+
+!ifdef CPACK_USES_DOWNLOAD
+Function DownloadFile
+    IfFileExists $INSTDIR\* +2
+    CreateDirectory $INSTDIR
+    Pop $0
+
+    ; Skip if already downloaded
+    IfFileExists $INSTDIR\$0 0 +2
+    Return
+
+    StrCpy $1 "@CPACK_DOWNLOAD_SITE@"
+
+  try_again:
+    NSISdl::download "$1/$0" "$INSTDIR\$0"
+
+    Pop $1
+    StrCmp $1 "success" success
+    StrCmp $1 "Cancelled" cancel
+    MessageBox MB_OK "Download failed: $1"
+  cancel:
+    Return
+  success:
+FunctionEnd
+!endif
+
+;--------------------------------
+; Installation types
+@CPACK_NSIS_INSTALLATION_TYPES@
+
+;--------------------------------
+; Component sections
+@CPACK_NSIS_COMPONENT_SECTIONS@
+
+;--------------------------------
+; Define some macro setting for the gui
+@CPACK_NSIS_INSTALLER_MUI_ICON_CODE@
+@CPACK_NSIS_INSTALLER_ICON_CODE@
+@CPACK_NSIS_INSTALLER_MUI_WELCOMEFINISH_CODE@
+@CPACK_NSIS_INSTALLER_MUI_UNWELCOMEFINISH_CODE@
+@CPACK_NSIS_INSTALLER_MUI_COMPONENTS_DESC@
+@CPACK_NSIS_INSTALLER_MUI_FINISHPAGE_RUN_CODE@
+
+;--------------------------------
+;Pages
+  !insertmacro MUI_PAGE_WELCOME
+
+  !insertmacro MUI_PAGE_LICENSE "@CPACK_RESOURCE_FILE_LICENSE@"
+  Page custom InstallOptionsPage
+  !insertmacro MUI_PAGE_DIRECTORY
+
+  ;Start Menu Folder Page Configuration
+  !define MUI_STARTMENUPAGE_REGISTRY_ROOT "SHCTX"
+  !define MUI_STARTMENUPAGE_REGISTRY_KEY "Software\@CPACK_PACKAGE_VENDOR@\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@"
+  !define MUI_STARTMENUPAGE_REGISTRY_VALUENAME "Start Menu Folder"
+  !insertmacro MUI_PAGE_STARTMENU Application $STARTMENU_FOLDER
+
+  @CPACK_NSIS_PAGE_COMPONENTS@
+
+  !insertmacro MUI_PAGE_INSTFILES
+  !insertmacro MUI_PAGE_FINISH
+
+  !insertmacro MUI_UNPAGE_CONFIRM
+  !insertmacro MUI_UNPAGE_INSTFILES
+
+;--------------------------------
+;Languages
+
+  !insertmacro MUI_LANGUAGE "English" ;first language is the default language
+  !insertmacro MUI_LANGUAGE "Albanian"
+  !insertmacro MUI_LANGUAGE "Arabic"
+  !insertmacro MUI_LANGUAGE "Basque"
+  !insertmacro MUI_LANGUAGE "Belarusian"
+  !insertmacro MUI_LANGUAGE "Bosnian"
+  !insertmacro MUI_LANGUAGE "Breton"
+  !insertmacro MUI_LANGUAGE "Bulgarian"
+  !insertmacro MUI_LANGUAGE "Croatian"
+  !insertmacro MUI_LANGUAGE "Czech"
+  !insertmacro MUI_LANGUAGE "Danish"
+  !insertmacro MUI_LANGUAGE "Dutch"
+  !insertmacro MUI_LANGUAGE "Estonian"
+  !insertmacro MUI_LANGUAGE "Farsi"
+  !insertmacro MUI_LANGUAGE "Finnish"
+  !insertmacro MUI_LANGUAGE "French"
+  !insertmacro MUI_LANGUAGE "German"
+  !insertmacro MUI_LANGUAGE "Greek"
+  !insertmacro MUI_LANGUAGE "Hebrew"
+  !insertmacro MUI_LANGUAGE "Hungarian"
+  !insertmacro MUI_LANGUAGE "Icelandic"
+  !insertmacro MUI_LANGUAGE "Indonesian"
+  !insertmacro MUI_LANGUAGE "Irish"
+  !insertmacro MUI_LANGUAGE "Italian"
+  !insertmacro MUI_LANGUAGE "Japanese"
+  !insertmacro MUI_LANGUAGE "Korean"
+  !insertmacro MUI_LANGUAGE "Kurdish"
+  !insertmacro MUI_LANGUAGE "Latvian"
+  !insertmacro MUI_LANGUAGE "Lithuanian"
+  !insertmacro MUI_LANGUAGE "Luxembourgish"
+  !insertmacro MUI_LANGUAGE "Macedonian"
+  !insertmacro MUI_LANGUAGE "Malay"
+  !insertmacro MUI_LANGUAGE "Mongolian"
+  !insertmacro MUI_LANGUAGE "Norwegian"
+  !insertmacro MUI_LANGUAGE "Polish"
+  !insertmacro MUI_LANGUAGE "Portuguese"
+  !insertmacro MUI_LANGUAGE "PortugueseBR"
+  !insertmacro MUI_LANGUAGE "Romanian"
+  !insertmacro MUI_LANGUAGE "Russian"
+  !insertmacro MUI_LANGUAGE "Serbian"
+  !insertmacro MUI_LANGUAGE "SerbianLatin"
+  !insertmacro MUI_LANGUAGE "SimpChinese"
+  !insertmacro MUI_LANGUAGE "Slovak"
+  !insertmacro MUI_LANGUAGE "Slovenian"
+  !insertmacro MUI_LANGUAGE "Spanish"
+  !insertmacro MUI_LANGUAGE "Swedish"
+  !insertmacro MUI_LANGUAGE "Thai"
+  !insertmacro MUI_LANGUAGE "TradChinese"
+  !insertmacro MUI_LANGUAGE "Turkish"
+  !insertmacro MUI_LANGUAGE "Ukrainian"
+  !insertmacro MUI_LANGUAGE "Welsh"
+
+
+;--------------------------------
+;Reserve Files
+
+  ;These files should be inserted before other files in the data block
+  ;Keep these lines before any File command
+  ;Only for solid compression (by default, solid compression is enabled for BZIP2 and LZMA)
+
+  ReserveFile "NSIS.InstallOptions.ini"
+  !insertmacro MUI_RESERVEFILE_INSTALLOPTIONS
+
+;--------------------------------
+;Installer Sections
+
+Section "-Core installation"
+  ;Use the entire tree produced by the INSTALL target.  Keep the
+  ;list of directories here in sync with the RMDir commands below.
+  SetOutPath "$INSTDIR"
+  @CPACK_NSIS_EXTRA_PREINSTALL_COMMANDS@
+  @CPACK_NSIS_FULL_INSTALL@
+
+  ;Store installation folder
+  WriteRegStr SHCTX "Software\@CPACK_PACKAGE_VENDOR@\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "" $INSTDIR
+
+  ;Create uninstaller
+  WriteUninstaller "$INSTDIR\Uninstall.exe"
+  Push "DisplayName"
+  Push "@CPACK_NSIS_DISPLAY_NAME@"
+  Call ConditionalAddToRegisty
+  Push "DisplayVersion"
+  Push "@CPACK_PACKAGE_VERSION@"
+  Call ConditionalAddToRegisty
+  Push "Publisher"
+  Push "@CPACK_PACKAGE_VENDOR@"
+  Call ConditionalAddToRegisty
+  Push "DisplayIcon"
+  Push "$INSTDIR\Uninstall.exe"
+  Call ConditionalAddToRegisty
+  Push "UninstallString"
+  Push "$INSTDIR\Uninstall.exe"
+  Call ConditionalAddToRegisty
+  Push "NoRepair"
+  Push "1"
+  Call ConditionalAddToRegisty
+
+  !ifdef CPACK_NSIS_ADD_REMOVE
+  ;Create add/remove functionality
+  Push "ModifyPath"
+  Push "$INSTDIR\AddRemove.exe"
+  Call ConditionalAddToRegisty
+  !else
+  Push "NoModify"
+  Push "1"
+  Call ConditionalAddToRegisty
+  !endif
+
+  ; Optional registration
+  Push "HelpLink"
+  Push "@CPACK_NSIS_HELP_LINK@"
+  Call ConditionalAddToRegisty
+  Push "URLInfoAbout"
+  Push "@CPACK_NSIS_URL_INFO_ABOUT@"
+  Call ConditionalAddToRegisty
+  Push "Contact"
+  Push "@CPACK_NSIS_CONTACT@"
+  Call ConditionalAddToRegisty
+  !insertmacro MUI_INSTALLOPTIONS_READ $INSTALL_DESKTOP "NSIS.InstallOptions.ini" "Field 5" "State"
+  !insertmacro MUI_STARTMENU_WRITE_BEGIN Application
+
+  ;Create shortcuts
+  CreateDirectory "$SMPROGRAMS\$STARTMENU_FOLDER"
+@CPACK_NSIS_CREATE_ICONS@
+@CPACK_NSIS_CREATE_ICONS_EXTRA@
+  CreateShortCut "$SMPROGRAMS\$STARTMENU_FOLDER\Uninstall.lnk" "$INSTDIR\Uninstall.exe"
+
+  ;Read a value from an InstallOptions INI file
+  !insertmacro MUI_INSTALLOPTIONS_READ $DO_NOT_ADD_TO_PATH "NSIS.InstallOptions.ini" "Field 2" "State"
+  !insertmacro MUI_INSTALLOPTIONS_READ $ADD_TO_PATH_ALL_USERS "NSIS.InstallOptions.ini" "Field 3" "State"
+  !insertmacro MUI_INSTALLOPTIONS_READ $ADD_TO_PATH_CURRENT_USER "NSIS.InstallOptions.ini" "Field 4" "State"
+
+
+  ;Create AF_PATH variable
+  WriteRegExpandStr ${env_af_hklm} AF_PATH '$INSTDIR'
+  WriteRegExpandStr ${env_af_hklm} AF_PATH_v@CPACK_PACKAGE_VERSION_MAJOR@ '$INSTDIR'
+
+  ;Add key for CMake package
+  WriteRegStr ${cmake_pkg_reg_key} ArrayFire_CMake_DIR '$INSTDIR'
+
+  ; make sure windows knows about the change
+  SendMessage ${HWND_BROADCAST} ${WM_WININICHANGE} 0 "STR:Environment" /TIMEOUT=5000
+  MessageBox MB_OK "Added AF_PATH environment variable for all users.$\n$\nIf you chose not to modify PATH in the installer, please manually add $\"%AF_PATH%\lib$\" to the user or system PATH variable for running applications using ArrayFire." /SD IDOK
+
+
+  ; Write special uninstall registry entries
+  Push "StartMenu"
+  Push "$STARTMENU_FOLDER"
+  Call ConditionalAddToRegisty
+  Push "DoNotAddToPath"
+  Push "$DO_NOT_ADD_TO_PATH"
+  Call ConditionalAddToRegisty
+  Push "AddToPathAllUsers"
+  Push "$ADD_TO_PATH_ALL_USERS"
+  Call ConditionalAddToRegisty
+  Push "AddToPathCurrentUser"
+  Push "$ADD_TO_PATH_CURRENT_USER"
+  Call ConditionalAddToRegisty
+  Push "InstallToDesktop"
+  Push "$INSTALL_DESKTOP"
+  Call ConditionalAddToRegisty
+
+  !insertmacro MUI_STARTMENU_WRITE_END
+
+@CPACK_NSIS_EXTRA_INSTALL_COMMANDS@
+
+SectionEnd
+
+Section "-Visual C++ installation"
+  ExecWait "$INSTDIR\lib\vc_redist.x64.exe /install /passive /norestart"
+  Delete "$INSTDIR\lib\vc_redist.x64.exe"
+SectionEnd
+
+Section "-Add to path"
+  Push $INSTDIR\lib
+  StrCmp "@CPACK_NSIS_MODIFY_PATH@" "ON" 0 doNotAddToPath
+  StrCmp $DO_NOT_ADD_TO_PATH "1" doNotAddToPath 0
+    Call AddToPath
+  doNotAddToPath:
+SectionEnd
+
+;--------------------------------
+; Create custom pages
+Function InstallOptionsPage
+  !insertmacro MUI_HEADER_TEXT "Install Options" "Choose options for installing @CPACK_NSIS_PACKAGE_NAME@"
+  !insertmacro MUI_INSTALLOPTIONS_DISPLAY "NSIS.InstallOptions.ini"
+
+FunctionEnd
+
+;--------------------------------
+; determine admin versus local install
+Function un.onInit
+
+  ClearErrors
+  UserInfo::GetName
+  IfErrors noLM
+  Pop $0
+  UserInfo::GetAccountType
+  Pop $1
+  StrCmp $1 "Admin" 0 +3
+    SetShellVarContext all
+    ;MessageBox MB_OK 'User "$0" is in the Admin group'
+    Goto done
+  StrCmp $1 "Power" 0 +3
+    SetShellVarContext all
+    ;MessageBox MB_OK 'User "$0" is in the Power Users group'
+    Goto done
+
+  noLM:
+    ;Get installation folder from registry if available
+
+  done:
+
+FunctionEnd
+
+;--- Add/Remove callback functions: ---
+!macro SectionList MacroName
+  ;This macro used to perform operation on multiple sections.
+  ;List all of your components in following manner here.
+@CPACK_NSIS_COMPONENT_SECTION_LIST@
+!macroend
+
+Section -FinishComponents
+  ;Removes unselected components and writes component status to registry
+  !insertmacro SectionList "FinishSection"
+
+!ifdef CPACK_NSIS_ADD_REMOVE
+  ; Get the name of the installer executable
+  System::Call 'kernel32::GetModuleFileNameA(i 0, t .R0, i 1024) i r1'
+  StrCpy $R3 $R0
+
+  ; Strip off the last 13 characters, to see if we have AddRemove.exe
+  StrLen $R1 $R0
+  IntOp $R1 $R0 - 13
+  StrCpy $R2 $R0 13 $R1
+  StrCmp $R2 "AddRemove.exe" addremove_installed
+
+  ; We're not running AddRemove.exe, so install it
+  CopyFiles $R3 $INSTDIR\AddRemove.exe
+
+  addremove_installed:
+!endif
+SectionEnd
+;--- End of Add/Remove callback functions ---
+
+;--------------------------------
+; Component dependencies
+Function .onSelChange
+  !insertmacro SectionList "MaybeSelectionChanged"
+FunctionEnd
+
+;--------------------------------
+;Uninstaller Section
+
+Section "Uninstall"
+  ReadRegStr $START_MENU SHCTX \
+   "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "StartMenu"
+  ;MessageBox MB_OK "Start menu is in: $START_MENU"
+  ReadRegStr $DO_NOT_ADD_TO_PATH SHCTX \
+    "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "DoNotAddToPath"
+  ReadRegStr $ADD_TO_PATH_ALL_USERS SHCTX \
+    "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "AddToPathAllUsers"
+  ReadRegStr $ADD_TO_PATH_CURRENT_USER SHCTX \
+    "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "AddToPathCurrentUser"
+  ;MessageBox MB_OK "Add to path: $DO_NOT_ADD_TO_PATH all users: $ADD_TO_PATH_ALL_USERS"
+  ReadRegStr $INSTALL_DESKTOP SHCTX \
+    "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "InstallToDesktop"
+  ;MessageBox MB_OK "Install to desktop: $INSTALL_DESKTOP "
+
+@CPACK_NSIS_EXTRA_UNINSTALL_COMMANDS@
+
+  ;Remove files we installed.
+  ;Keep the list of directories here in sync with the File commands above.
+@CPACK_NSIS_DELETE_FILES@
+@CPACK_NSIS_DELETE_DIRECTORIES@
+
+!ifdef CPACK_NSIS_ADD_REMOVE
+  ;Remove the add/remove program
+  Delete "$INSTDIR\AddRemove.exe"
+!endif
+
+
+  ;Create AF_PATH variable
+  DeleteRegValue ${env_af_hklm} AF_PATH
+  DeleteRegValue ${env_af_hklm} AF_PATH_v@CPACK_PACKAGE_VERSION_MAJOR@
+
+  ;Delete cmake package key
+  DeleteRegValue ${cmake_pkg_reg_key} ArrayFire_CMake_DIR
+
+  ; make sure windows knows about the change
+  SendMessage ${HWND_BROADCAST} ${WM_WININICHANGE} 0 "STR:Environment" /TIMEOUT=5000
+
+
+  ;Remove the uninstaller itself.
+  Delete "$INSTDIR\Uninstall.exe"
+  DeleteRegKey SHCTX "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@"
+
+  ;Remove the installation directory if it is empty.
+  RMDir "$INSTDIR"
+
+  ; Remove the registry entries.
+  DeleteRegKey SHCTX "Software\@CPACK_PACKAGE_VENDOR@\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@"
+
+  ; Removes all optional components
+  !insertmacro SectionList "RemoveSection_CPack"
+
+  !insertmacro MUI_STARTMENU_GETFOLDER Application $MUI_TEMP
+
+  Delete "$SMPROGRAMS\$MUI_TEMP\Uninstall.lnk"
+@CPACK_NSIS_DELETE_ICONS@
+@CPACK_NSIS_DELETE_ICONS_EXTRA@
+
+  ;Delete empty start menu parent diretories
+  StrCpy $MUI_TEMP "$SMPROGRAMS\$MUI_TEMP"
+
+  startMenuDeleteLoop:
+    ClearErrors
+    RMDir $MUI_TEMP
+    GetFullPathName $MUI_TEMP "$MUI_TEMP\.."
+
+    IfErrors startMenuDeleteLoopDone
+
+    StrCmp "$MUI_TEMP" "$SMPROGRAMS" startMenuDeleteLoopDone startMenuDeleteLoop
+  startMenuDeleteLoopDone:
+
+  ; If the user changed the shortcut, then untinstall may not work. This should
+  ; try to fix it.
+  StrCpy $MUI_TEMP "$START_MENU"
+  Delete "$SMPROGRAMS\$MUI_TEMP\Uninstall.lnk"
+@CPACK_NSIS_DELETE_ICONS_EXTRA@
+
+  ;Delete empty start menu parent diretories
+  StrCpy $MUI_TEMP "$SMPROGRAMS\$MUI_TEMP"
+
+  secondStartMenuDeleteLoop:
+    ClearErrors
+    RMDir $MUI_TEMP
+    GetFullPathName $MUI_TEMP "$MUI_TEMP\.."
+
+    IfErrors secondStartMenuDeleteLoopDone
+
+    StrCmp "$MUI_TEMP" "$SMPROGRAMS" secondStartMenuDeleteLoopDone secondStartMenuDeleteLoop
+  secondStartMenuDeleteLoopDone:
+
+  DeleteRegKey /ifempty SHCTX "Software\@CPACK_PACKAGE_VENDOR@\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@"
+
+  Push $INSTDIR\lib
+  StrCmp $DO_NOT_ADD_TO_PATH_ "1" doNotRemoveFromPath 0
+    Call un.RemoveFromPath
+  doNotRemoveFromPath:
+SectionEnd
+
+;--------------------------------
+; determine admin versus local install
+; Is install for "AllUsers" or "JustMe"?
+; Default to "JustMe" - set to "AllUsers" if admin or on Win9x
+; This function is used for the very first "custom page" of the installer.
+; This custom page does not show up visibly, but it executes prior to the
+; first visible page and sets up $INSTDIR properly...
+; Choose different default installation folder based on SV_ALLUSERS...
+; "Program Files" for AllUsers, "My Documents" for JustMe...
+
+Function .onInit
+  StrCmp "@CPACK_NSIS_ENABLE_UNINSTALL_BEFORE_INSTALL@" "ON" 0 inst
+
+  ReadRegStr $0 HKLM "Software\Microsoft\Windows\CurrentVersion\Uninstall\@CPACK_PACKAGE_INSTALL_REGISTRY_KEY@" "UninstallString"
+  StrCmp $0 "" inst
+
+  MessageBox MB_YESNOCANCEL|MB_ICONEXCLAMATION \
+  "@CPACK_NSIS_PACKAGE_NAME@ is already installed. $\n$\nDo you want to uninstall the old version before installing the new one?" \
+  /SD IDYES IDYES uninst IDNO inst
+  Abort
+
+;Run the uninstaller
+uninst:
+  ClearErrors
+  StrLen $2 "\Uninstall.exe"
+  StrCpy $3 $0 -$2 # remove "\Uninstall.exe" from UninstallString to get path
+  ExecWait '"$0" /S _?=$3' ;Do not copy the uninstaller to a temp file
+
+  IfErrors uninst_failed inst
+uninst_failed:
+  MessageBox MB_OK|MB_ICONSTOP "Uninstall failed."
+  Abort
+
+
+inst:
+  ; Reads components status for registry
+  !insertmacro SectionList "InitSection"
+
+  ; check to see if /D has been used to change
+  ; the install directory by comparing it to the
+  ; install directory that is expected to be the
+  ; default
+  StrCpy $IS_DEFAULT_INSTALLDIR 0
+  StrCmp "$INSTDIR" "@CPACK_NSIS_INSTALL_ROOT@\@CPACK_PACKAGE_INSTALL_DIRECTORY@" 0 +2
+    StrCpy $IS_DEFAULT_INSTALLDIR 1
+
+  StrCpy $SV_ALLUSERS "JustMe"
+  ; if default install dir then change the default
+  ; if it is installed for JustMe
+  StrCmp "$IS_DEFAULT_INSTALLDIR" "1" 0 +2
+    StrCpy $INSTDIR "$DOCUMENTS\@CPACK_PACKAGE_INSTALL_DIRECTORY@"
+
+  ClearErrors
+  UserInfo::GetName
+  IfErrors noLM
+  Pop $0
+  UserInfo::GetAccountType
+  Pop $1
+  StrCmp $1 "Admin" 0 +4
+    SetShellVarContext all
+    ;MessageBox MB_OK 'User "$0" is in the Admin group'
+    StrCpy $SV_ALLUSERS "AllUsers"
+    Goto done
+  StrCmp $1 "Power" 0 +4
+    SetShellVarContext all
+    ;MessageBox MB_OK 'User "$0" is in the Power Users group'
+    StrCpy $SV_ALLUSERS "AllUsers"
+    Goto done
+
+  noLM:
+    StrCpy $SV_ALLUSERS "AllUsers"
+    ;Get installation folder from registry if available
+
+  done:
+  StrCmp $SV_ALLUSERS "AllUsers" 0 +3
+    StrCmp "$IS_DEFAULT_INSTALLDIR" "1" 0 +2
+      StrCpy $INSTDIR "@CPACK_NSIS_INSTALL_ROOT@\@CPACK_PACKAGE_INSTALL_DIRECTORY@"
+
+  StrCmp "@CPACK_NSIS_MODIFY_PATH@" "ON" 0 noOptionsPage
+    !insertmacro MUI_INSTALLOPTIONS_EXTRACT "NSIS.InstallOptions.ini"
+
+  noOptionsPage:
+FunctionEnd
diff --git a/CMakeModules/osx_install/readme.html.in b/CMakeModules/osx_install/readme.html.in
new file mode 100644
index 0000000000..82d42a79a8
--- /dev/null
+++ b/CMakeModules/osx_install/readme.html.in
@@ -0,0 +1,14 @@
+<html>
+<body>
+
+    <h2>Install Directories</h2>
+    <ul>
+        <li> Libraries will be installed in <code>/opt/arrayfire/lib</code> </li>
+        <li> Headers will be installed in <code>/opt/arrayfire/include</code> </li>
+        <li> Examples, documentation and CMake config files will be installed in <code>/opt/arrayfire/share</code> </li>
+    </ul>
+    <p> For complete list of updates, visit <a href="http://www.arrayfire.com/docs/releasenotes.htm">ArrayFire Release Notes</a></p>
+
+    <p> For questions about ArrayFire or this installer, visit <a href="https://groups.google.com/forum/#!forum/arrayfire-users"> ArrayFire User Forums</a> </p>
+</body>
+</html>
diff --git a/CMakeModules/osx_install/welcome.html.in b/CMakeModules/osx_install/welcome.html.in
new file mode 100644
index 0000000000..75cb2f40cf
--- /dev/null
+++ b/CMakeModules/osx_install/welcome.html.in
@@ -0,0 +1,27 @@
+<html>
+<body>
+    <h1>Welcome To ArrayFire ${AF_VERSION}</h1>
+
+    <p>
+    ArrayFire is a high performance software library for parallel computing
+    with an easy-to-use API. Its <em>array</em> based function set makes parallel
+    programming simple.
+    </p>
+
+    <p> 
+    ArrayFire's multiple backends (<strong>CUDA</strong>,
+    <strong>OpenCL</strong>and native <strong>CPU</strong>) make it platform
+    independent and highly portable.
+    </p>
+
+    <p>
+    A few lines of code in ArrayFire can replace dozens of lines of parallel
+    compute code, saving you valuable time and lowering development costs.
+    </p>
+
+    <p>
+    Follow these steps to install the ArrayFire libraries.
+    </p>
+</body>
+
+</html>
diff --git a/CMakeModules/platform.cmake b/CMakeModules/platform.cmake
new file mode 100644
index 0000000000..cf0f72f8ed
--- /dev/null
+++ b/CMakeModules/platform.cmake
@@ -0,0 +1,21 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+# Platform specific settings
+#
+# Add paths and flags specific platforms. This can inc
+
+if(APPLE)
+  # Default path for Intel MKL libraries
+  set(CMAKE_PREFIX_PATH "${CMAKE_PREFIX_PATH};/opt/intel/mkl/lib")
+endif()
+
+if(UNIX AND NOT APPLE)
+  # Default path for Intel MKL libraries
+  set(CMAKE_PREFIX_PATH "${CMAKE_PREFIX_PATH};/opt/intel/mkl/lib/intel64")
+endif()
+
diff --git a/CMakeModules/select_compute_arch.cmake b/CMakeModules/select_compute_arch.cmake
new file mode 100644
index 0000000000..e09490a7e5
--- /dev/null
+++ b/CMakeModules/select_compute_arch.cmake
@@ -0,0 +1,325 @@
+# Synopsis:
+#   CUDA_SELECT_NVCC_ARCH_FLAGS(out_variable [target_CUDA_architectures])
+#   -- Selects GPU arch flags for nvcc based on target_CUDA_architectures
+#      target_CUDA_architectures : Auto | Common | All | LIST(ARCH_AND_PTX ...)
+#       - "Auto" detects local machine GPU compute arch at runtime.
+#       - "Common" and "All" cover common and entire subsets of architectures
+#      ARCH_AND_PTX : NAME | NUM.NUM | NUM.NUM(NUM.NUM) | NUM.NUM+PTX
+#      NAME: Fermi Kepler Maxwell Kepler+Tegra Kepler+Tesla Maxwell+Tegra Pascal Volta Turing Ampere
+#      NUM: Any number. Only those pairs are currently accepted by NVCC though:
+#            2.0 2.1 3.0 3.2 3.5 3.7 5.0 5.2 5.3 6.0 6.2 7.0 7.2 7.5 8.0 8.6 9.0
+#      Returns LIST of flags to be added to CUDA_NVCC_FLAGS in ${out_variable}
+#      Additionally, sets ${out_variable}_readable to the resulting numeric list
+#      Example:
+#       CUDA_SELECT_NVCC_ARCH_FLAGS(ARCH_FLAGS 3.0 3.5+PTX 5.2(5.0) Maxwell)
+#        LIST(APPEND CUDA_NVCC_FLAGS ${ARCH_FLAGS})
+#
+#      More info on CUDA architectures: https://en.wikipedia.org/wiki/CUDA
+#
+
+if(CMAKE_CUDA_COMPILER_LOADED) # CUDA as a language
+  if(CMAKE_CUDA_COMPILER_ID STREQUAL "NVIDIA"
+      AND CMAKE_CUDA_COMPILER_VERSION MATCHES "^([0-9]+\\.[0-9]+)")
+    set(CUDA_VERSION "${CMAKE_MATCH_1}")
+  endif()
+endif()
+
+# See: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list
+# Additions, deprecations, and removals can be found in the release notes:
+# https://developer.nvidia.com/cuda-toolkit-archive
+
+# The initial status here is for CUDA 7.0
+set(CUDA_KNOWN_GPU_ARCHITECTURES  "Fermi" "Kepler" "Maxwell" "Kepler+Tegra" "Kepler+Tesla" "Maxwell+Tegra")
+set(CUDA_COMMON_GPU_ARCHITECTURES "2.0" "2.1" "3.0" "3.5" "5.0" "5.3")
+set(CUDA_LIMIT_GPU_ARCHITECTURE "6.0")
+set(CUDA_ALL_GPU_ARCHITECTURES "2.0" "2.1" "3.0" "3.2" "3.5" "3.7" "5.0" "5.2" "5.3")
+set(_CUDA_MAX_COMMON_ARCHITECTURE "5.2+PTX")
+
+
+if(CUDA_VERSION VERSION_GREATER_EQUAL "8.0")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Pascal")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "6.0" "6.1")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "6.0" "6.1" "6.2")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "6.2+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "7.0")
+
+  list(REMOVE_ITEM CUDA_COMMON_GPU_ARCHITECTURES "2.0" "2.1")
+endif ()
+
+if(CUDA_VERSION VERSION_GREATER_EQUAL "9.0")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Volta")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "7.0")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "7.0" "7.2")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "7.2+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "8.0")
+
+  list(REMOVE_ITEM CUDA_KNOWN_GPU_ARCHITECTURES "Fermi")
+  list(REMOVE_ITEM CUDA_ALL_GPU_ARCHITECTURES "2.0" "2.1")
+endif()
+
+if(CUDA_VERSION VERSION_GREATER_EQUAL "10.0")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Turing")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "7.5")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "7.5")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "7.5+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "8.0")
+
+  list(REMOVE_ITEM CUDA_COMMON_GPU_ARCHITECTURES "3.0")
+endif()
+
+# https://docs.nvidia.com/cuda/archive/11.0/cuda-toolkit-release-notes/index.html#cuda-general-new-features
+# https://docs.nvidia.com/cuda/archive/11.0/cuda-toolkit-release-notes/index.html#deprecated-features
+if(CUDA_VERSION VERSION_GREATER_EQUAL "11.0")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Ampere")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.0")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.0")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "8.0+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "8.6")
+
+  list(REMOVE_ITEM CUDA_COMMON_GPU_ARCHITECTURES "3.5" "5.0")
+  list(REMOVE_ITEM CUDA_ALL_GPU_ARCHITECTURES "3.0" "3.2")
+endif()
+
+if(CUDA_VERSION VERSION_GREATER_EQUAL "11.1")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.6")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.6")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "8.6+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "9.0")
+endif()
+
+if(CUDA_VERSION VERSION_GREATER_EQUAL "11.8")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "8.9")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "8.9")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "8.9+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "9.0")
+endif()
+
+if(CUDA_VERSION VERSION_GREATER_EQUAL "12.0")
+  list(APPEND CUDA_KNOWN_GPU_ARCHITECTURES "Hopper")
+  list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "9.0")
+  list(APPEND CUDA_ALL_GPU_ARCHITECTURES "9.0")
+
+  set(_CUDA_MAX_COMMON_ARCHITECTURE "9.0+PTX")
+  set(CUDA_LIMIT_GPU_ARCHITECTURE "9.0")
+
+  list(REMOVE_ITEM CUDA_ALL_GPU_ARCHITECTURES "3.5" "3.7")
+endif()
+
+list(APPEND CUDA_COMMON_GPU_ARCHITECTURES "${_CUDA_MAX_COMMON_ARCHITECTURE}")
+
+# Check with: cmake -DCUDA_VERSION=7.0 -P select_compute_arch.cmake
+if(DEFINED CMAKE_SCRIPT_MODE_FILE)
+  include(CMakePrintHelpers)
+  cmake_print_variables(CUDA_KNOWN_GPU_ARCHITECTURES)
+  cmake_print_variables(CUDA_COMMON_GPU_ARCHITECTURES)
+  cmake_print_variables(CUDA_LIMIT_GPU_ARCHITECTURE)
+  cmake_print_variables(CUDA_ALL_GPU_ARCHITECTURES)
+endif()
+
+
+################################################################################################
+# A function for automatic detection of GPUs installed  (if autodetection is enabled)
+# Usage:
+#   CUDA_DETECT_INSTALLED_GPUS(OUT_VARIABLE)
+#
+function(CUDA_DETECT_INSTALLED_GPUS OUT_VARIABLE)
+  if(NOT CUDA_GPU_DETECT_OUTPUT)
+    if(CMAKE_CUDA_COMPILER_LOADED) # CUDA as a language
+      set(file "${PROJECT_BINARY_DIR}/detect_cuda_compute_capabilities.cu")
+    else()
+      set(file "${PROJECT_BINARY_DIR}/detect_cuda_compute_capabilities.cpp")
+    endif()
+
+    file(WRITE ${file} ""
+      "#include <cuda_runtime.h>\n"
+      "#include <cstdio>\n"
+      "int main()\n"
+      "{\n"
+      "  int count = 0;\n"
+      "  if (cudaSuccess != cudaGetDeviceCount(&count)) return -1;\n"
+      "  if (count == 0) return -1;\n"
+      "  for (int device = 0; device < count; ++device)\n"
+      "  {\n"
+      "    cudaDeviceProp prop;\n"
+      "    if (cudaSuccess == cudaGetDeviceProperties(&prop, device))\n"
+      "      std::printf(\"%d.%d \", prop.major, prop.minor);\n"
+      "  }\n"
+      "  return 0;\n"
+      "}\n")
+
+    if(CMAKE_CUDA_COMPILER_LOADED) # CUDA as a language
+      try_run(run_result compile_result ${PROJECT_BINARY_DIR} ${file}
+              RUN_OUTPUT_VARIABLE compute_capabilities)
+    else()
+      try_run(run_result compile_result ${PROJECT_BINARY_DIR} ${file}
+              CMAKE_FLAGS "-DINCLUDE_DIRECTORIES=${CUDA_INCLUDE_DIRS}"
+              LINK_LIBRARIES ${CUDA_LIBRARIES}
+              RUN_OUTPUT_VARIABLE compute_capabilities)
+    endif()
+
+    # Filter unrelated content out of the output.
+    string(REGEX MATCHALL "[0-9]+\\.[0-9]+" compute_capabilities "${compute_capabilities}")
+
+    if(run_result EQUAL 0)
+      string(REPLACE "2.1" "2.1(2.0)" compute_capabilities "${compute_capabilities}")
+      set(CUDA_GPU_DETECT_OUTPUT ${compute_capabilities}
+        CACHE INTERNAL "Returned GPU architectures from detect_gpus tool" FORCE)
+    endif()
+  endif()
+
+  if(NOT CUDA_GPU_DETECT_OUTPUT)
+    message(STATUS "Automatic GPU detection failed. Building for common architectures.")
+    set(${OUT_VARIABLE} ${CUDA_COMMON_GPU_ARCHITECTURES} PARENT_SCOPE)
+  else()
+    # Filter based on CUDA version supported archs
+    set(CUDA_GPU_DETECT_OUTPUT_FILTERED "")
+    separate_arguments(CUDA_GPU_DETECT_OUTPUT)
+    foreach(ITEM IN ITEMS ${CUDA_GPU_DETECT_OUTPUT})
+        if(CUDA_LIMIT_GPU_ARCHITECTURE AND ITEM VERSION_GREATER_EQUAL CUDA_LIMIT_GPU_ARCHITECTURE)
+        list(GET CUDA_COMMON_GPU_ARCHITECTURES -1 NEWITEM)
+        string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${NEWITEM}")
+      else()
+        string(APPEND CUDA_GPU_DETECT_OUTPUT_FILTERED " ${ITEM}")
+      endif()
+    endforeach()
+
+    set(${OUT_VARIABLE} ${CUDA_GPU_DETECT_OUTPUT_FILTERED} PARENT_SCOPE)
+  endif()
+endfunction()
+
+
+################################################################################################
+# Function for selecting GPU arch flags for nvcc based on CUDA architectures from parameter list
+# Usage:
+#   SELECT_NVCC_ARCH_FLAGS(out_variable [list of CUDA compute archs])
+function(CUDA_SELECT_NVCC_ARCH_FLAGS out_variable)
+  set(CUDA_ARCH_LIST "${ARGN}")
+
+  if("X${CUDA_ARCH_LIST}" STREQUAL "X" )
+    set(CUDA_ARCH_LIST "Auto")
+  endif()
+
+  set(cuda_arch_bin)
+  set(cuda_arch_ptx)
+
+  if("${CUDA_ARCH_LIST}" STREQUAL "All")
+    set(CUDA_ARCH_LIST ${CUDA_KNOWN_GPU_ARCHITECTURES})
+  elseif("${CUDA_ARCH_LIST}" STREQUAL "Common")
+    set(CUDA_ARCH_LIST ${CUDA_COMMON_GPU_ARCHITECTURES})
+  elseif("${CUDA_ARCH_LIST}" STREQUAL "Auto")
+    CUDA_DETECT_INSTALLED_GPUS(CUDA_ARCH_LIST)
+    message(STATUS "Autodetected CUDA architecture(s): ${CUDA_ARCH_LIST}")
+  endif()
+
+  # Now process the list and look for names
+  string(REGEX REPLACE "[ \t]+" ";" CUDA_ARCH_LIST "${CUDA_ARCH_LIST}")
+  list(REMOVE_DUPLICATES CUDA_ARCH_LIST)
+  foreach(arch_name ${CUDA_ARCH_LIST})
+    set(arch_bin)
+    set(arch_ptx)
+    set(add_ptx FALSE)
+    # Check to see if we are compiling PTX
+    if(arch_name MATCHES "(.*)\\+PTX$")
+      set(add_ptx TRUE)
+      set(arch_name ${CMAKE_MATCH_1})
+    endif()
+    if(arch_name MATCHES "^([0-9]\\.[0-9](\\([0-9]\\.[0-9]\\))?)$")
+      set(arch_bin ${CMAKE_MATCH_1})
+      set(arch_ptx ${arch_bin})
+    else()
+      # Look for it in our list of known architectures
+      if(${arch_name} STREQUAL "Fermi")
+        set(arch_bin 2.0 "2.1(2.0)")
+      elseif(${arch_name} STREQUAL "Kepler+Tegra")
+        set(arch_bin 3.2)
+      elseif(${arch_name} STREQUAL "Kepler+Tesla")
+        set(arch_bin 3.7)
+      elseif(${arch_name} STREQUAL "Kepler")
+        set(arch_bin 3.0 3.5)
+        set(arch_ptx 3.5)
+      elseif(${arch_name} STREQUAL "Maxwell+Tegra")
+        set(arch_bin 5.3)
+      elseif(${arch_name} STREQUAL "Maxwell")
+        set(arch_bin 5.0 5.2)
+        set(arch_ptx 5.2)
+      elseif(${arch_name} STREQUAL "Pascal")
+        set(arch_bin 6.0 6.1)
+        set(arch_ptx 6.1)
+      elseif(${arch_name} STREQUAL "Pascal+Tegra")
+        set(arch_bin 6.2)
+        set(arch_ptx 6.2)
+      elseif(${arch_name} STREQUAL "Volta")
+        set(arch_bin 7.0 7.0)
+        set(arch_ptx 7.0)
+      elseif(${arch_name} STREQUAL "Volta+Tegra")
+        set(arch_bin 7.2)
+      elseif(${arch_name} STREQUAL "Turing")
+        set(arch_bin 7.5)
+        set(arch_ptx 7.5)
+      elseif(${arch_name} STREQUAL "Ampere")
+        set(arch_bin 8.0)
+        set(arch_ptx 8.0)
+      elseif(${arch_name} STREQUAL "Hopper")
+        set(arch_bin 9.0)
+        set(arch_ptx 9.0)
+      else()
+        message(SEND_ERROR "Unknown CUDA Architecture Name ${arch_name} in CUDA_SELECT_NVCC_ARCH_FLAGS")
+      endif()
+    endif()
+    if(NOT arch_bin)
+      message(SEND_ERROR "arch_bin wasn't set for some reason")
+    endif()
+    list(APPEND cuda_arch_bin ${arch_bin})
+    if(add_ptx)
+      if (NOT arch_ptx)
+        set(arch_ptx ${arch_bin})
+      endif()
+      list(APPEND cuda_arch_ptx ${arch_ptx})
+    endif()
+  endforeach()
+
+  # remove dots and convert to lists
+  string(REGEX REPLACE "\\." "" cuda_arch_bin "${cuda_arch_bin}")
+  string(REGEX REPLACE "\\." "" cuda_arch_ptx "${cuda_arch_ptx}")
+  string(REGEX MATCHALL "[0-9()]+" cuda_arch_bin "${cuda_arch_bin}")
+  string(REGEX MATCHALL "[0-9]+"   cuda_arch_ptx "${cuda_arch_ptx}")
+
+  if(cuda_arch_bin)
+    list(REMOVE_DUPLICATES cuda_arch_bin)
+  endif()
+  if(cuda_arch_ptx)
+    list(REMOVE_DUPLICATES cuda_arch_ptx)
+  endif()
+
+  set(nvcc_flags "")
+  set(nvcc_archs_readable "")
+
+  # Tell NVCC to add binaries for the specified GPUs
+  foreach(arch ${cuda_arch_bin})
+    if(arch MATCHES "([0-9]+)\\(([0-9]+)\\)")
+      # User explicitly specified ARCH for the concrete CODE
+      list(APPEND nvcc_flags -gencode arch=compute_${CMAKE_MATCH_2},code=sm_${CMAKE_MATCH_1})
+      list(APPEND nvcc_archs_readable sm_${CMAKE_MATCH_1})
+    else()
+      # User didn't explicitly specify ARCH for the concrete CODE, we assume ARCH=CODE
+      list(APPEND nvcc_flags -gencode arch=compute_${arch},code=sm_${arch})
+      list(APPEND nvcc_archs_readable sm_${arch})
+    endif()
+  endforeach()
+
+  # Tell NVCC to add PTX intermediate code for the specified architectures
+  foreach(arch ${cuda_arch_ptx})
+    list(APPEND nvcc_flags -gencode arch=compute_${arch},code=compute_${arch})
+    list(APPEND nvcc_archs_readable compute_${arch})
+  endforeach()
+
+  string(REPLACE ";" " " nvcc_archs_readable "${nvcc_archs_readable}")
+  set(${out_variable}          ${nvcc_flags}          PARENT_SCOPE)
+  set(${out_variable}_readable ${nvcc_archs_readable} PARENT_SCOPE)
+endfunction()
diff --git a/CMakeModules/vcpkg/ports/lapack-reference/FindLAPACK.cmake b/CMakeModules/vcpkg/ports/lapack-reference/FindLAPACK.cmake
new file mode 100644
index 0000000000..f4d25477d8
--- /dev/null
+++ b/CMakeModules/vcpkg/ports/lapack-reference/FindLAPACK.cmake
@@ -0,0 +1,559 @@
+# Distributed under the OSI-approved BSD 3-Clause License.  See accompanying
+# file Copyright.txt or https://cmake.org/licensing for details.
+
+#[=======================================================================[.rst:
+FindLAPACK
+----------
+
+Find Linear Algebra PACKage (LAPACK) library
+
+This module finds an installed Fortran library that implements the
+LAPACK linear-algebra interface (see http://www.netlib.org/lapack/).
+
+The approach follows that taken for the ``autoconf`` macro file,
+``acx_lapack.m4`` (distributed at
+http://ac-archive.sourceforge.net/ac-archive/acx_lapack.html).
+
+Input Variables
+^^^^^^^^^^^^^^^
+
+The following variables may be set to influence this module's behavior:
+
+``BLA_STATIC``
+  if ``ON`` use static linkage
+
+``BLA_VENDOR``
+  If set, checks only the specified vendor, if not set checks all the
+  possibilities.  List of vendors valid in this module:
+
+  * ``OpenBLAS``
+  * ``FLAME``
+  * ``Intel10_32`` (intel mkl v10 32 bit)
+  * ``Intel10_64lp`` (intel mkl v10+ 64 bit, threaded code, lp64 model)
+  * ``Intel10_64lp_seq`` (intel mkl v10+ 64 bit, sequential code, lp64 model)
+  * ``Intel10_64ilp`` (intel mkl v10+ 64 bit, threaded code, ilp64 model)
+  * ``Intel10_64ilp_seq`` (intel mkl v10+ 64 bit, sequential code, ilp64 model)
+  * ``Intel10_64_dyn`` (intel mkl v10+ 64 bit, single dynamic library)
+  * ``Intel`` (obsolete versions of mkl 32 and 64 bit)
+  * ``ACML``
+  * ``Apple``
+  * ``NAS``
+  * ``Arm``
+  * ``Arm_mp``
+  * ``Arm_ilp64``
+  * ``Arm_ilp64_mp``
+  * ``Generic``
+
+``BLA_F95``
+  if ``ON`` tries to find the BLAS95/LAPACK95 interfaces
+
+Imported targets
+^^^^^^^^^^^^^^^^
+
+This module defines the following :prop_tgt:`IMPORTED` target:
+
+``LAPACK::LAPACK``
+  The libraries to use for LAPACK, if found.
+
+Result Variables
+^^^^^^^^^^^^^^^^
+
+This module defines the following variables:
+
+``LAPACK_FOUND``
+  library implementing the LAPACK interface is found
+``LAPACK_LINKER_FLAGS``
+  uncached list of required linker flags (excluding ``-l`` and ``-L``).
+``LAPACK_LIBRARIES``
+  uncached list of libraries (using full path name) to link against
+  to use LAPACK
+``LAPACK95_LIBRARIES``
+  uncached list of libraries (using full path name) to link against
+  to use LAPACK95
+``LAPACK95_FOUND``
+  library implementing the LAPACK95 interface is found
+
+.. note::
+
+  C, CXX or Fortran must be enabled to detect a BLAS/LAPACK library.
+  C or CXX must be enabled to use Intel Math Kernel Library (MKL).
+
+  For example, to use Intel MKL libraries and/or Intel compiler:
+
+  .. code-block:: cmake
+
+    set(BLA_VENDOR Intel10_64lp)
+    find_package(LAPACK)
+#]=======================================================================]
+
+enable_language(C)
+# Check the language being used
+if(NOT (CMAKE_C_COMPILER_LOADED OR CMAKE_CXX_COMPILER_LOADED OR CMAKE_Fortran_COMPILER_LOADED))
+  if(LAPACK_FIND_REQUIRED)
+    message(FATAL_ERROR "FindLAPACK requires Fortran, C, or C++ to be enabled.")
+  else()
+    message(STATUS "Looking for LAPACK... - NOT found (Unsupported languages)")
+    return()
+  endif()
+endif()
+
+if(CMAKE_Fortran_COMPILER_LOADED)
+  include(${CMAKE_ROOT}/Modules/CheckFortranFunctionExists.cmake)
+else()
+  include(${CMAKE_ROOT}/Modules/CheckFunctionExists.cmake)
+endif()
+include(${CMAKE_ROOT}/Modules/CMakePushCheckState.cmake)
+
+cmake_push_check_state()
+set(CMAKE_REQUIRED_QUIET ${LAPACK_FIND_QUIETLY})
+
+set(LAPACK_FOUND FALSE)
+set(LAPACK95_FOUND FALSE)
+
+# store original values for CMAKE_FIND_LIBRARY_SUFFIXES
+set(_lapack_ORIG_CMAKE_FIND_LIBRARY_SUFFIXES ${CMAKE_FIND_LIBRARY_SUFFIXES})
+if (CMAKE_SYSTEM_NAME STREQUAL "Linux")
+    list(APPEND CMAKE_FIND_LIBRARY_SUFFIXES .so.3gfs .so.3 .so.4 .so.5)
+endif()
+
+# TODO: move this stuff to a separate module
+
+macro(CHECK_LAPACK_LIBRARIES LIBRARIES _prefix _name _flags _list _threadlibs _addlibdir _subdirs _blas)
+  # This macro checks for the existence of the combination of fortran libraries
+  # given by _list.  If the combination is found, this macro checks (using the
+  # Check_Fortran_Function_Exists macro) whether can link against that library
+  # combination using the name of a routine given by _name using the linker
+  # flags given by _flags.  If the combination of libraries is found and passes
+  # the link test, LIBRARIES is set to the list of complete library paths that
+  # have been found.  Otherwise, LIBRARIES is set to FALSE.
+
+  # N.B. _prefix is the prefix applied to the names of all cached variables that
+  # are generated internally and marked advanced by this macro.
+  # _addlibdir is a list of additional search paths. _subdirs is a list of path
+  # suffixes to be used by find_library().
+
+  set(_libraries_work TRUE)
+  set(${LIBRARIES})
+  set(_combined_name)
+
+  set(_extaddlibdir "${_addlibdir}")
+  if(WIN32)
+    list(APPEND _extaddlibdir ENV LIB)
+  elseif(APPLE)
+    list(APPEND _extaddlibdir ENV DYLD_LIBRARY_PATH)
+  else()
+    list(APPEND _extaddlibdir ENV LD_LIBRARY_PATH)
+  endif()
+  list(APPEND _extaddlibdir "${CMAKE_C_IMPLICIT_LINK_DIRECTORIES}")
+
+  foreach(_library ${_list})
+    if(_library MATCHES "^-Wl,--(start|end)-group$")
+      # Respect linker flags like --start/end-group (required by MKL)
+      set(${LIBRARIES} ${${LIBRARIES}} "${_library}")
+    else()
+      set(_combined_name ${_combined_name}_${_library})
+      if(_libraries_work)
+        find_library(${_prefix}_${_library}_LIBRARY
+          NAMES ${_library}
+          PATHS ${_extaddlibdir}
+          PATH_SUFFIXES ${_subdirs}
+        )
+        #message("DEBUG: find_library(${_library}) got ${${_prefix}_${_library}_LIBRARY}")
+        mark_as_advanced(${_prefix}_${_library}_LIBRARY)
+        set(${LIBRARIES} ${${LIBRARIES}} ${${_prefix}_${_library}_LIBRARY})
+        set(_libraries_work ${${_prefix}_${_library}_LIBRARY})
+      endif()
+    endif()
+  endforeach()
+
+  if(_libraries_work)
+    # Test this combination of libraries.
+    set(CMAKE_REQUIRED_LIBRARIES ${_flags} ${${LIBRARIES}} ${_blas} ${_threadlibs})
+    #message("DEBUG: CMAKE_REQUIRED_LIBRARIES = ${CMAKE_REQUIRED_LIBRARIES}")
+    if(CMAKE_Fortran_COMPILER_LOADED)
+      check_fortran_function_exists("${_name}" ${_prefix}${_combined_name}_WORKS)
+    else()
+      check_function_exists("${_name}_" ${_prefix}${_combined_name}_WORKS)
+    endif()
+    set(CMAKE_REQUIRED_LIBRARIES)
+    set(_libraries_work ${${_prefix}${_combined_name}_WORKS})
+  endif()
+
+  if(_libraries_work)
+    if("${_list}${_blas}" STREQUAL "")
+      set(${LIBRARIES} "${LIBRARIES}-PLACEHOLDER-FOR-EMPTY-LIBRARIES")
+    else()
+      set(${LIBRARIES} ${${LIBRARIES}} ${_blas} ${_threadlibs})
+    endif()
+  else()
+    set(${LIBRARIES} FALSE)
+  endif()
+  #message("DEBUG: ${LIBRARIES} = ${${LIBRARIES}}")
+endmacro()
+
+set(LAPACK_LINKER_FLAGS)
+set(LAPACK_LIBRARIES)
+set(LAPACK95_LIBRARIES)
+
+include(CMakeFindDependencyMacro)
+find_dependency(BLAS)
+
+if(BLAS_FOUND)
+  set(LAPACK_LINKER_FLAGS ${BLAS_LINKER_FLAGS})
+  if(NOT $ENV{BLA_VENDOR} STREQUAL "")
+    set(BLA_VENDOR $ENV{BLA_VENDOR})
+  else()
+    if(NOT BLA_VENDOR)
+      set(BLA_VENDOR "All")
+    endif()
+  endif()
+
+  # LAPACK in the Intel MKL 10+ library?
+  if(BLA_VENDOR MATCHES "Intel" OR BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      if(CMAKE_C_COMPILER_LOADED OR CMAKE_CXX_COMPILER_LOADED)
+        # System-specific settings
+        if(NOT WIN32)
+          set(LAPACK_mkl_LM "-lm")
+          set(LAPACK_mkl_LDL "-ldl")
+        endif()
+
+        if(LAPACK_FIND_QUIETLY OR NOT LAPACK_FIND_REQUIRED)
+          find_package(Threads)
+        else()
+          find_package(Threads REQUIRED)
+        endif()
+
+        if(BLA_VENDOR MATCHES "_64ilp")
+          set(LAPACK_mkl_ILP_MODE "ilp64")
+        else()
+          set(LAPACK_mkl_ILP_MODE "lp64")
+        endif()
+
+        set(LAPACK_SEARCH_LIBS "")
+
+        if(BLA_F95)
+          set(LAPACK_mkl_SEARCH_SYMBOL "cheev_f95")
+          set(_LIBRARIES LAPACK95_LIBRARIES)
+          set(_BLAS_LIBRARIES ${BLAS95_LIBRARIES})
+
+          # old
+          list(APPEND LAPACK_SEARCH_LIBS
+            "mkl_lapack95")
+          # new >= 10.3
+          list(APPEND LAPACK_SEARCH_LIBS
+            "mkl_intel_c")
+          list(APPEND LAPACK_SEARCH_LIBS
+            "mkl_lapack95_${LAPACK_mkl_ILP_MODE}")
+        else()
+          set(LAPACK_mkl_SEARCH_SYMBOL "cheev")
+          set(_LIBRARIES LAPACK_LIBRARIES)
+          set(_BLAS_LIBRARIES ${BLAS_LIBRARIES})
+
+          # old and new >= 10.3
+          list(APPEND LAPACK_SEARCH_LIBS
+            "mkl_lapack")
+        endif()
+
+        # MKL uses a multitude of partially platform-specific subdirectories:
+        if(BLA_VENDOR STREQUAL "Intel10_32")
+          set(LAPACK_mkl_ARCH_NAME "ia32")
+        else()
+          set(LAPACK_mkl_ARCH_NAME "intel64")
+        endif()
+        if(WIN32)
+          set(LAPACK_mkl_OS_NAME "win")
+        elseif(APPLE)
+          set(LAPACK_mkl_OS_NAME "mac")
+        else()
+          set(LAPACK_mkl_OS_NAME "lin")
+        endif()
+        if(DEFINED ENV{MKLROOT})
+          file(TO_CMAKE_PATH "$ENV{MKLROOT}" LAPACK_mkl_MKLROOT)
+          # If MKLROOT points to the subdirectory 'mkl', use the parent directory instead
+          # so we can better detect other relevant libraries in 'compiler' or 'tbb':
+          get_filename_component(LAPACK_mkl_MKLROOT_LAST_DIR "${LAPACK_mkl_MKLROOT}" NAME)
+          if(LAPACK_mkl_MKLROOT_LAST_DIR STREQUAL "mkl")
+              get_filename_component(LAPACK_mkl_MKLROOT "${LAPACK_mkl_MKLROOT}" DIRECTORY)
+          endif()
+        endif()
+        set(LAPACK_mkl_LIB_PATH_SUFFIXES
+            "compiler/lib" "compiler/lib/${LAPACK_mkl_ARCH_NAME}_${LAPACK_mkl_OS_NAME}"
+            "mkl/lib" "mkl/lib/${LAPACK_mkl_ARCH_NAME}_${LAPACK_mkl_OS_NAME}"
+            "lib/${LAPACK_mkl_ARCH_NAME}_${LAPACK_mkl_OS_NAME}")
+
+        # First try empty lapack libs
+        if(NOT ${_LIBRARIES})
+          check_lapack_libraries(
+            ${_LIBRARIES}
+            LAPACK
+            ${LAPACK_mkl_SEARCH_SYMBOL}
+            ""
+            ""
+            "${CMAKE_THREAD_LIBS_INIT};${LAPACK_mkl_LM};${LAPACK_mkl_LDL}"
+            "${LAPACK_mkl_MKLROOT}"
+            "${LAPACK_mkl_LIB_PATH_SUFFIXES}"
+            "${_BLAS_LIBRARIES}"
+          )
+        endif()
+
+        # Then try the search libs
+        foreach(IT ${LAPACK_SEARCH_LIBS})
+          string(REPLACE " " ";" SEARCH_LIBS ${IT})
+          if(NOT ${_LIBRARIES})
+            check_lapack_libraries(
+              ${_LIBRARIES}
+              LAPACK
+              ${LAPACK_mkl_SEARCH_SYMBOL}
+              ""
+              "${SEARCH_LIBS}"
+              "${CMAKE_THREAD_LIBS_INIT};${LAPACK_mkl_LM};${LAPACK_mkl_LDL}"
+              "${LAPACK_mkl_MKLROOT}"
+              "${LAPACK_mkl_LIB_PATH_SUFFIXES}"
+              "${_BLAS_LIBRARIES}"
+            )
+          endif()
+        endforeach()
+
+        unset(LAPACK_mkl_ILP_MODE)
+        unset(LAPACK_mkl_SEARCH_SYMBOL)
+        unset(LAPACK_mkl_LM)
+        unset(LAPACK_mkl_LDL)
+        unset(LAPACK_mkl_MKLROOT)
+        unset(LAPACK_mkl_ARCH_NAME)
+        unset(LAPACK_mkl_OS_NAME)
+        unset(LAPACK_mkl_LIB_PATH_SUFFIXES)
+      endif()
+    endif()
+  endif()
+
+  # gotoblas? (http://www.tacc.utexas.edu/tacc-projects/gotoblas2)
+  if(BLA_VENDOR STREQUAL "Goto" OR BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "goto2"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+
+  # OpenBLAS? (http://www.openblas.net)
+  if(BLA_VENDOR STREQUAL "OpenBLAS" OR BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "openblas"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+
+  # ArmPL? (https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-compiler-for-linux/arm-performance-libraries)
+  if(BLA_VENDOR MATCHES "Arm" OR BLA_VENDOR STREQUAL "All")
+
+    # Check for 64bit Integer support
+    if(BLA_VENDOR MATCHES "_ilp64")
+      set(LAPACK_armpl_LIB "armpl_ilp64")
+    else()
+      set(LAPACK_armpl_LIB "armpl_lp64")
+    endif()
+
+    # Check for OpenMP support, VIA BLA_VENDOR of Arm_mp or Arm_ipl64_mp
+    if(BLA_VENDOR MATCHES "_mp")
+     set(LAPACK_armpl_LIB "${LAPACK_armpl_LIB}_mp")
+    endif()
+
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "${LAPACK_armpl_LIB}"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+
+  # FLAME's blis library? (https://github.com/flame/blis)
+  if(BLA_VENDOR STREQUAL "FLAME" OR BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "flame"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+
+  # BLAS in acml library?
+  if(BLA_VENDOR MATCHES "ACML" OR BLA_VENDOR STREQUAL "All")
+    if(BLAS_LIBRARIES MATCHES ".+acml.+")
+      set(LAPACK_LIBRARIES ${BLAS_LIBRARIES})
+    endif()
+  endif()
+
+  # Apple LAPACK library?
+  if(BLA_VENDOR STREQUAL "Apple" OR BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "Accelerate"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+
+  # Apple NAS (vecLib) library?
+  if(BLA_VENDOR STREQUAL "NAS" OR BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "vecLib"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+
+  # Generic LAPACK library?
+  if(BLA_VENDOR STREQUAL "Generic" OR
+      BLA_VENDOR STREQUAL "ATLAS" OR
+      BLA_VENDOR STREQUAL "All")
+    if(NOT LAPACK_LIBRARIES)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "lapack"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+    if(NOT LAPACK_LIBRARIES AND NOT WIN32)
+      check_lapack_libraries(
+        LAPACK_LIBRARIES
+        LAPACK
+        cheev
+        ""
+        "lapack;m;gfortran"
+        ""
+        ""
+        ""
+        "${BLAS_LIBRARIES}"
+      )
+    endif()
+  endif()
+else()
+  message(STATUS "LAPACK requires BLAS")
+endif()
+
+if(BLA_F95)
+  if(LAPACK95_LIBRARIES)
+    set(LAPACK95_FOUND TRUE)
+  else()
+    set(LAPACK95_FOUND FALSE)
+  endif()
+  if(NOT LAPACK_FIND_QUIETLY)
+    if(LAPACK95_FOUND)
+      message(STATUS "A library with LAPACK95 API found.")
+    else()
+      if(LAPACK_FIND_REQUIRED)
+        message(FATAL_ERROR
+          "A required library with LAPACK95 API not found. Please specify library location."
+        )
+      else()
+        message(STATUS
+          "A library with LAPACK95 API not found. Please specify library location."
+        )
+      endif()
+    endif()
+  endif()
+  set(LAPACK_FOUND "${LAPACK95_FOUND}")
+  set(LAPACK_LIBRARIES "${LAPACK95_LIBRARIES}")
+else()
+  if(LAPACK_LIBRARIES)
+    set(LAPACK_FOUND TRUE)
+  else()
+    set(LAPACK_FOUND FALSE)
+  endif()
+
+  if(NOT LAPACK_FIND_QUIETLY)
+    if(LAPACK_FOUND)
+      message(STATUS "A library with LAPACK API found.")
+    else()
+      if(LAPACK_FIND_REQUIRED)
+        message(FATAL_ERROR
+          "A required library with LAPACK API not found. Please specify library location."
+        )
+      else()
+        message(STATUS
+          "A library with LAPACK API not found. Please specify library location."
+        )
+      endif()
+    endif()
+  endif()
+endif()
+
+# On compilers that implicitly link LAPACK (such as ftn, cc, and CC on Cray HPC machines)
+# we used a placeholder for empty LAPACK_LIBRARIES to get through our logic above.
+if(LAPACK_LIBRARIES STREQUAL "LAPACK_LIBRARIES-PLACEHOLDER-FOR-EMPTY-LIBRARIES")
+  set(LAPACK_LIBRARIES "")
+endif()
+
+if(NOT TARGET LAPACK::LAPACK)
+  add_library(LAPACK::LAPACK INTERFACE IMPORTED)
+  set(_lapack_libs "${LAPACK_LIBRARIES}")
+  if(_lapack_libs AND TARGET BLAS::BLAS)
+    # remove the ${BLAS_LIBRARIES} from the interface and replace it
+    # with the BLAS::BLAS target
+    list(REMOVE_ITEM _lapack_libs "${BLAS_LIBRARIES}")
+  endif()
+
+  if(_lapack_libs)
+    set_target_properties(LAPACK::LAPACK PROPERTIES
+      INTERFACE_LINK_LIBRARIES "${_lapack_libs}"
+    )
+  endif()
+  unset(_lapack_libs)
+endif()
+
+cmake_pop_check_state()
+# restore original values for CMAKE_FIND_LIBRARY_SUFFIXES
+set(CMAKE_FIND_LIBRARY_SUFFIXES ${_lapack_ORIG_CMAKE_FIND_LIBRARY_SUFFIXES})
diff --git a/CMakeModules/vcpkg/ports/lapack-reference/lapacke.patch b/CMakeModules/vcpkg/ports/lapack-reference/lapacke.patch
new file mode 100644
index 0000000000..964f0e3192
--- /dev/null
+++ b/CMakeModules/vcpkg/ports/lapack-reference/lapacke.patch
@@ -0,0 +1,16 @@
+diff --git a/CMakeLists.txt b/CMakeLists.txt
+index 1ee66f1..7cec7ca 100644
+--- a/CMakeLists.txt
++++ b/CMakeLists.txt
+@@ -392,8 +392,9 @@ endif()
+ set(LAPACK_INSTALL_EXPORT_NAME ${LAPACK_INSTALL_EXPORT_NAME_CACHE})
+ unset(LAPACK_INSTALL_EXPORT_NAME_CACHE)
+ 
+-add_subdirectory(LAPACKE)
+-
++if(LAPACKE)
++    add_subdirectory(LAPACKE)
++endif()
+ 
+ #-------------------------------------
+ # BLAS++ / LAPACK++
diff --git a/CMakeModules/vcpkg/ports/lapack-reference/portfile.cmake b/CMakeModules/vcpkg/ports/lapack-reference/portfile.cmake
new file mode 100644
index 0000000000..f1a180065a
--- /dev/null
+++ b/CMakeModules/vcpkg/ports/lapack-reference/portfile.cmake
@@ -0,0 +1,164 @@
+#TODO: Features to add:
+# USE_XBLAS??? extended precision blas. needs xblas
+# LAPACKE should be its own PORT
+# USE_OPTIMIZED_LAPACK (Probably not what we want. Does a find_package(LAPACK): probably for LAPACKE only builds _> own port?)
+# LAPACKE Builds LAPACKE
+# LAPACKE_WITH_TMG Build LAPACKE with tmglib routines
+if(EXISTS "${CURRENT_INSTALLED_DIR}/share/clapack/copyright")
+    message(FATAL_ERROR "Can't build ${PORT} if clapack is installed. Please remove clapack:${TARGET_TRIPLET}, and try to install ${PORT}:${TARGET_TRIPLET} again.")
+endif()
+
+include(vcpkg_find_fortran)
+SET(VCPKG_POLICY_EMPTY_INCLUDE_FOLDER enabled)
+
+set(lapack_ver 3.10.1)
+
+vcpkg_from_github(
+    OUT_SOURCE_PATH SOURCE_PATH
+    REPO  "Reference-LAPACK/lapack"
+    REF "v${lapack_ver}"
+    SHA512 0500bbbb48483208c0a35b74972ff0059c389da6032824a2079637266a99fa980882eedf7f1fc490219ee4ff27812ac8c6afe118e25f40a9c2387e7b997762fb
+    HEAD_REF master
+    PATCHES
+        lapacke.patch
+)
+
+if(NOT VCPKG_TARGET_IS_WINDOWS)
+    set(ENV{FFLAGS} "$ENV{FFLAGS} -fPIC")
+endif()
+
+set(CBLAS OFF)
+if("cblas" IN_LIST FEATURES)
+    set(CBLAS ON)
+    if("noblas" IN_LIST FEATURES)
+        message(FATAL_ERROR "Cannot built feature 'cblas' together with feature 'noblas'. cblas requires blas!")
+    endif()
+endif()
+
+set(USE_OPTIMIZED_BLAS OFF) 
+if("noblas" IN_LIST FEATURES)
+    set(USE_OPTIMIZED_BLAS ON)
+    set(pcfile "${CURRENT_INSTALLED_DIR}/lib/pkgconfig/openblas.pc")
+    if(EXISTS "${pcfile}")
+        file(CREATE_LINK "${pcfile}" "${CURRENT_PACKAGES_DIR}/lib/pkgconfig/blas.pc" COPY_ON_ERROR)
+    endif()
+    set(pcfile "${CURRENT_INSTALLED_DIR}/debug/lib/pkgconfig/openblas.pc")
+    if(EXISTS "${pcfile}")
+        file(CREATE_LINK "${pcfile}" "${CURRENT_PACKAGES_DIR}/debug/lib/pkgconfig/blas.pc" COPY_ON_ERROR)
+    endif()
+endif()
+
+set(VCPKG_CRT_LINKAGE_BACKUP ${VCPKG_CRT_LINKAGE})
+vcpkg_find_fortran(FORTRAN_CMAKE)
+if(VCPKG_USE_INTERNAL_Fortran)
+    if(VCPKG_CRT_LINKAGE_BACKUP STREQUAL static) 
+    # If openblas has been built with static crt linkage we cannot use it with gfortran!
+        set(USE_OPTIMIZED_BLAS OFF) 
+        #Cannot use openblas from vcpkg if we are building with gfortran here. 
+        if("noblas" IN_LIST FEATURES)
+            message(FATAL_ERROR "Feature 'noblas' cannot be used without supplying an external fortran compiler")
+        endif()
+    endif()
+else()
+    set(USE_OPTIMIZED_BLAS ON)
+endif()
+
+vcpkg_cmake_configure(
+    SOURCE_PATH "${SOURCE_PATH}"
+    OPTIONS
+        "-DUSE_OPTIMIZED_BLAS=${USE_OPTIMIZED_BLAS}"
+        "-DCBLAS=${CBLAS}"
+        "-DLAPACKE=ON"
+        ${FORTRAN_CMAKE}
+)
+
+vcpkg_cmake_install()
+
+vcpkg_cmake_config_fixup(PACKAGE_NAME lapack-${lapack_ver} CONFIG_PATH lib/cmake/lapack-${lapack_ver}) #Should the target path be lapack and not lapack-reference?
+
+message("CURRENT_PACKAGES_DIR: ${CURRENT_PACKAGES_DIR}")
+set(pcfile "${CURRENT_PACKAGES_DIR}/lib/pkgconfig/lapack.pc")
+if(EXISTS "${pcfile}")
+    file(READ "${pcfile}" _contents)
+    set(_contents "prefix=${CURRENT_INSTALLED_DIR}\n${_contents}")
+    file(WRITE "${pcfile}" "${_contents}")
+endif()
+set(pcfile "${CURRENT_PACKAGES_DIR}/debug/lib/pkgconfig/lapack.pc")
+if(EXISTS "${pcfile}")
+    file(READ "${pcfile}" _contents)
+    set(_contents "prefix=${CURRENT_INSTALLED_DIR}/debug\n${_contents}")
+    file(WRITE "${pcfile}" "${_contents}")
+endif()
+set(pcfile "${CURRENT_PACKAGES_DIR}/lib/pkgconfig/lapacke.pc")
+if(EXISTS "${pcfile}")
+    file(READ "${pcfile}" _contents)
+    set(_contents "prefix=${CURRENT_INSTALLED_DIR}\n${_contents}")
+    file(WRITE "${pcfile}" "${_contents}")
+endif()
+set(pcfile "${CURRENT_PACKAGES_DIR}/debug/lib/pkgconfig/lapacke.pc")
+if(EXISTS "${pcfile}")
+    file(READ "${pcfile}" _contents)
+    set(_contents "prefix=${CURRENT_INSTALLED_DIR}/debug\n${_contents}")
+    file(WRITE "${pcfile}" "${_contents}")
+endif()
+if(NOT USE_OPTIMIZED_BLAS AND NOT (VCPKG_TARGET_IS_WINDOWS AND VCPKG_LIBRARY_LINKAGE STREQUAL "static"))
+    set(pcfile "${CURRENT_PACKAGES_DIR}/lib/pkgconfig/blas.pc")
+    if(EXISTS "${pcfile}")
+        file(READ "${pcfile}" _contents)
+        set(_contents "prefix=${CURRENT_INSTALLED_DIR}\n${_contents}")
+        file(WRITE "${pcfile}" "${_contents}")
+    endif()
+    set(pcfile "${CURRENT_PACKAGES_DIR}/debug/lib/pkgconfig/blas.pc")
+    if(EXISTS "${pcfile}")
+        file(READ "${pcfile}" _contents)
+        set(_contents "prefix=${CURRENT_INSTALLED_DIR}/debug\n${_contents}")
+        file(WRITE "${pcfile}" "${_contents}")
+    endif()
+endif()
+if("cblas" IN_LIST FEATURES)
+    set(pcfile "${CURRENT_PACKAGES_DIR}/lib/pkgconfig/cblas.pc")
+    if(EXISTS "${pcfile}")
+        file(READ "${pcfile}" _contents)
+        set(_contents "prefix=${CURRENT_INSTALLED_DIR}\n${_contents}")
+        file(WRITE "${pcfile}" "${_contents}")
+    endif()
+    set(pcfile "${CURRENT_PACKAGES_DIR}/debug/lib/pkgconfig/cblas.pc")
+    if(EXISTS "${pcfile}")
+        file(READ "${pcfile}" _contents)
+        set(_contents "prefix=${CURRENT_INSTALLED_DIR}/debug\n${_contents}")
+        file(WRITE "${pcfile}" "${_contents}")
+    endif()
+endif()
+#vcpkg_fixup_pkgconfig()
+
+# Handle copyright
+file(INSTALL "${SOURCE_PATH}/LICENSE" DESTINATION "${CURRENT_PACKAGES_DIR}/share/${PORT}" RENAME copyright)
+
+# remove debug includes
+file(REMOVE_RECURSE ${CURRENT_PACKAGES_DIR}/debug/include)
+
+if(VCPKG_TARGET_IS_WINDOWS)
+    if(EXISTS "${CURRENT_PACKAGES_DIR}/lib/liblapack.lib")
+        file(RENAME "${CURRENT_PACKAGES_DIR}/lib/liblapack.lib" "${CURRENT_PACKAGES_DIR}/lib/lapack.lib")
+    endif()
+    if(EXISTS "${CURRENT_PACKAGES_DIR}/debug/lib/liblapack.lib")
+        file(RENAME "${CURRENT_PACKAGES_DIR}/debug/lib/liblapack.lib" "${CURRENT_PACKAGES_DIR}/debug/lib/lapack.lib")
+    endif()
+    if(EXISTS "${CURRENT_PACKAGES_DIR}/lib/liblapacke.lib")
+        file(RENAME "${CURRENT_PACKAGES_DIR}/lib/liblapacke.lib" "${CURRENT_PACKAGES_DIR}/lib/lapacke.lib")
+    endif()
+    if(EXISTS "${CURRENT_PACKAGES_DIR}/debug/lib/liblapacke.lib")
+        file(RENAME "${CURRENT_PACKAGES_DIR}/debug/lib/liblapacke.lib" "${CURRENT_PACKAGES_DIR}/debug/lib/lapacke.lib")
+    endif()
+    if(NOT USE_OPTIMIZED_BLAS)
+        if(EXISTS "${CURRENT_PACKAGES_DIR}/lib/libblas.lib")
+            file(RENAME "${CURRENT_PACKAGES_DIR}/lib/libblas.lib" "${CURRENT_PACKAGES_DIR}/lib/blas.lib")
+        endif()
+        if(EXISTS "${CURRENT_PACKAGES_DIR}/debug/lib/libblas.lib")
+            file(RENAME "${CURRENT_PACKAGES_DIR}/debug/lib/libblas.lib" "${CURRENT_PACKAGES_DIR}/debug/lib/blas.lib")
+        endif()
+    endif()
+endif()
+
+file(COPY ${CMAKE_CURRENT_LIST_DIR}/vcpkg-cmake-wrapper.cmake DESTINATION ${CURRENT_PACKAGES_DIR}/share/lapack)
+file(COPY ${CMAKE_CURRENT_LIST_DIR}/FindLAPACK.cmake DESTINATION ${CURRENT_PACKAGES_DIR}/share/lapack)
diff --git a/CMakeModules/vcpkg/ports/lapack-reference/vcpkg-cmake-wrapper.cmake b/CMakeModules/vcpkg/ports/lapack-reference/vcpkg-cmake-wrapper.cmake
new file mode 100644
index 0000000000..b3a7128fff
--- /dev/null
+++ b/CMakeModules/vcpkg/ports/lapack-reference/vcpkg-cmake-wrapper.cmake
@@ -0,0 +1,11 @@
+message(STATUS "Using VCPKG FindLAPACK from package 'lapack-reference'")
+set(LAPACK_PREV_MODULE_PATH ${CMAKE_MODULE_PATH})
+list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_LIST_DIR})
+
+list(REMOVE_ITEM ARGS "NO_MODULE")
+list(REMOVE_ITEM ARGS "CONFIG")
+list(REMOVE_ITEM ARGS "MODULE")
+
+_find_package(${ARGS})
+
+set(CMAKE_MODULE_PATH ${LAPACK_PREV_MODULE_PATH})
diff --git a/CMakeModules/vcpkg/ports/lapack-reference/vcpkg.json b/CMakeModules/vcpkg/ports/lapack-reference/vcpkg.json
new file mode 100644
index 0000000000..b2fe5d6998
--- /dev/null
+++ b/CMakeModules/vcpkg/ports/lapack-reference/vcpkg.json
@@ -0,0 +1,48 @@
+{
+  "name": "lapack-reference",
+  "version": "3.10.1",
+  "description": "LAPACK - Linear Algebra PACKage",
+  "homepage": "http://www.netlib.org/lapack/",
+  "license": "BSD-3-Clause-Open-MPI",
+  "dependencies": [
+    {
+      "name": "vcpkg-cmake",
+      "host": true
+    },
+    {
+      "name": "vcpkg-cmake-config",
+      "host": true
+    },
+    {
+      "name": "vcpkg-gfortran",
+      "platform": "windows"
+    }
+  ],
+  "default-features": [
+    "blas-select"
+  ],
+  "features": {
+    "blas-select": {
+      "description": "Use external optimized BLAS",
+      "dependencies": [
+        {
+          "name": "lapack-reference",
+          "default-features": false,
+          "features": [
+            "noblas"
+          ],
+          "platform": "!windows | !static"
+        }
+      ]
+    },
+    "cblas": {
+      "description": "Builds CBLAS"
+    },
+    "noblas": {
+      "description": "Use external optimized BLAS",
+      "dependencies": [
+        "blas"
+      ]
+    }
+  }
+}
diff --git a/CMakeModules/vcpkg/vcpkg-triplets/x64-windows.cmake b/CMakeModules/vcpkg/vcpkg-triplets/x64-windows.cmake
new file mode 100644
index 0000000000..67dfc468eb
--- /dev/null
+++ b/CMakeModules/vcpkg/vcpkg-triplets/x64-windows.cmake
@@ -0,0 +1,9 @@
+set(VCPKG_TARGET_ARCHITECTURE x64)
+
+if(PORT MATCHES "freetype")
+  set(VCPKG_CRT_LINKAGE static)
+  set(VCPKG_LIBRARY_LINKAGE static)
+else()
+  set(VCPKG_CRT_LINKAGE dynamic)
+  set(VCPKG_LIBRARY_LINKAGE dynamic)
+endif()
diff --git a/CMakeModules/version.h.in b/CMakeModules/version.h.in
index 1462394861..271fa54907 100644
--- a/CMakeModules/version.h.in
+++ b/CMakeModules/version.h.in
@@ -1,7 +1,16 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
 #pragma once
 
-#define AF_VERSION "@AF_VERSION@"
-#define AF_VERSION_MAJOR "@AF_VERSION_MAJOR@"
-#define AF_VERSION_MINOR "@AF_VERSION_MINOR@"
-#define AF_VERSION_PATCH "@AF_VERSION_PATCH@"
-#define AF_REVISION "@GIT_COMMIT_HASH@"
+#define AF_VERSION "@ArrayFire_VERSION@"
+#define AF_VERSION_MAJOR @ArrayFire_VERSION_MAJOR@
+#define AF_VERSION_MINOR @ArrayFire_VERSION_MINOR@
+#define AF_VERSION_PATCH @ArrayFire_VERSION_PATCH@
+#define AF_API_VERSION_CURRENT   @ArrayFire_API_VERSION_CURRENT@
diff --git a/CMakeModules/version_info.rc.in b/CMakeModules/version_info.rc.in
new file mode 100644
index 0000000000..d738ce20d0
--- /dev/null
+++ b/CMakeModules/version_info.rc.in
@@ -0,0 +1,50 @@
+#include <winresrc.h>
+
+#define VER_FILEVERSION             @PRODUCT_VERSION_MAJOR@,@PRODUCT_VERSION_MINOR@,@PRODUCT_VERSION_PATCH@
+#define VER_FILEVERSION_STR         "@PRODUCT_VERSION@\0"
+
+
+#define VER_PRODUCTVERSION          @PRODUCT_VERSION_MAJOR@,@PRODUCT_VERSION_MINOR@,@PRODUCT_VERSION_PATCH@
+#define VER_PRODUCTVERSION_STR      "@PRODUCT_VERSION@\0"
+
+#ifndef NDEBUG
+#define VER_DEBUG 0
+#else
+#define VER_DEBUG VS_FF_DEBUG
+#endif
+
+VS_VERSION_INFO VERSIONINFO
+FILEVERSION     VER_FILEVERSION
+PRODUCTVERSION  VER_PRODUCTVERSION
+FILEFLAGSMASK   VS_FFI_FILEFLAGSMASK
+FILEFLAGS       VER_DEBUG
+FILEOS          VOS__WINDOWS32
+FILETYPE        VFT_DLL
+FILESUBTYPE     VFT2_UNKNOWN
+BEGIN
+    BLOCK "StringFileInfo"
+    BEGIN
+        BLOCK "040904E4"
+        BEGIN
+            VALUE "CompanyName",      "@PRODUCT_COMPANY_NAME@\0"
+            VALUE "FileDescription",  "@PRODUCT_FILE_DESCRIPTION@\0"
+            VALUE "FileVersion",      "@PRODUCT_VERSION@\0"
+            VALUE "InternalName",     "@PRODUCT_INTERNAL_FILE_NAME@\0"
+            VALUE "LegalCopyright",   "@PRODUCT_COMPANY_COPYRIGHT@\0"
+            VALUE "OriginalFilename", "@PRODUCT_ORIGINAL_FILE_NAME@\0"
+            VALUE "ProductName",      "@PRODUCT_FILE_NAME@\0"
+            VALUE "ProductVersion",   "@PRODUCT_VERSION@\0"
+        END
+    END
+
+    BLOCK "VarFileInfo"
+    BEGIN
+        /* The following line should only be modified for localized versions.     */
+        /* It consists of any number of WORD,WORD pairs, with each pair           */
+        /* describing a language,codepage combination supported by the file.      */
+        /*                                                                        */
+        /* For example, a file might have values "0x409,1252" indicating that it  */
+        /* supports English language (0x409) in the Windows ANSI codepage (1252). */
+        VALUE "Translation", 0x409, 1252
+    END
+END
diff --git a/CMakePresets.json b/CMakePresets.json
new file mode 100644
index 0000000000..ba1520ddf5
--- /dev/null
+++ b/CMakePresets.json
@@ -0,0 +1,259 @@
+{
+  "version": 2,
+  "cmakeMinimumRequired": {
+    "major": 3,
+    "minor": 20,
+    "patch": 0
+  },
+  "configurePresets": [
+    {
+      "name": "ninja-all-off-debug",
+      "hidden": true,
+      "description": "Base preset with all backends off with Debug build configuration",
+      "binaryDir": "${sourceDir}/build/${presetName}",
+      "generator": "Ninja",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": {
+          "type": "String",
+          "value": "Debug"
+        },
+        "AF_COMPUTE_LIBRARY": {
+          "type": "String",
+          "value": "Intel-MKL"
+        },
+        "AF_BUILD_CPU": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "AF_BUILD_CUDA": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "AF_BUILD_OPENCL": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "AF_BUILD_UNIFIED": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "AF_BUILD_FORGE": {
+          "type": "BOOL",
+          "value": "ON"
+        },
+        "AF_BUILD_DOCS": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "AF_BUILD_EXAMPLES": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "AF_TEST_WITH_MTX_FILES": {
+          "type": "BOOL",
+          "value": "OFF"
+        },
+        "CMAKE_INSTALL_PREFIX": {
+          "type": "PATH",
+          "value": "${sourceDir}/build/${presetName}/pkg"
+        }
+      }
+    },
+    {
+      "name": "ninja-cpu-mkl-debug",
+      "description": "Build CPU Backend using Intel MKL in Debug Configuration with Ninja Generator",
+      "inherits": "ninja-all-off-debug",
+      "cacheVariables": {
+        "AF_BUILD_CPU": "ON"
+      }
+    },
+    {
+      "name": "ninja-cpu-mkl-relwithdebinfo",
+      "description": "Build CPU Backend using Intel MKL in RelWithDebInfo Configuration with Ninja Generator",
+      "inherits": "ninja-cpu-mkl-debug",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+      }
+    },
+    {
+      "name": "ninja-cpu-debug",
+      "description": "Build CPU Backend with FFTW and a BLAS library using Ninja Generator in Debug Configuration",
+      "inherits": "ninja-cpu-mkl-debug",
+      "cacheVariables": {
+        "AF_COMPUTE_LIBRARY": "FFTW/LAPCK/BLAS"
+      }
+    },
+    {
+      "name": "ninja-cpu-relwithdebinfo",
+      "description": "Build CPU Backend with FFTW and a BLAS library using Ninja Generator in RelWithDebInfo Configuration",
+      "inherits": "ninja-cpu-debug",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+      }
+    },
+    {
+      "name": "ninja-cuda-debug",
+      "description": "Build CUDA Backend in debug configuration using Ninja Generator",
+      "inherits": "ninja-all-off-debug",
+      "cacheVariables": {
+        "AF_BUILD_CUDA": "ON"
+      }
+    },
+    {
+      "name": "ninja-cuda-relwithdebinfo",
+      "description": "Build CUDA Backend in RelWithDebInfo configuration using Ninja Generator",
+      "inherits": "ninja-cuda-debug",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+      }
+    },
+    {
+      "name": "ninja-opencl-mkl-debug",
+      "description": "Build OpenCL Backend in debug configuration using Ninja Generator",
+      "inherits": "ninja-all-off-debug",
+      "cacheVariables": {
+        "AF_BUILD_OPENCL": "ON"
+      }
+    },
+    {
+      "name": "ninja-opencl-mkl-relwithdebinfo",
+      "description": "Build OpenCL Backend in RelWithDebInfo configuration using Ninja Generator. This preset uses Intel MKL for CPU fallback code.",
+      "inherits": "ninja-opencl-mkl-debug",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+      }
+    },
+    {
+      "name": "ninja-opencl-debug",
+      "description": "Build OpenCL Backend in debug configuration using Ninja Generator",
+      "inherits": "ninja-opencl-mkl-debug",
+      "cacheVariables": {
+        "AF_COMPUTE_LIBRARY": "FFTW/LAPCK/BLAS"
+      }
+    },
+    {
+      "name": "ninja-opencl-relwithdebinfo",
+      "description": "Build OpenCL Backend in RelWithDebInfo configuration using Ninja Generator",
+      "inherits": "ninja-opencl-debug",
+      "cacheVariables": {
+        "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+      }
+    },
+    {
+        "name": "ninja-all-mkl-debug",
+        "description": "Build all feasible backends using Ninja Generator in Debug Configuraiton",
+        "inherits": "ninja-all-off-debug",
+        "cacheVariables": {
+            "AF_BUILD_CPU": "ON",
+            "AF_BUILD_CUDA": "ON",
+            "AF_BUILD_OPENCL": "ON",
+            "AF_BUILD_UNIFIED": "ON"
+        }
+    },
+    {
+        "name": "ninja-all-mkl-relwithdebinfo",
+        "description": "Build all feasible backends using Ninja Generator in RelWithDebInfo Configuraiton",
+        "inherits": "ninja-all-mkl-debug",
+        "cacheVariables": {
+            "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+        }
+    },
+    {
+        "name": "ninja-all-debug",
+        "description": "Build all feasible backends using Ninja Generator in Debug Configuraiton",
+        "inherits": "ninja-all-mkl-debug",
+        "cacheVariables": {
+            "AF_COMPUTE_LIBRARY": "FFTW/LAPCK/BLAS"
+        }
+    },
+    {
+        "name": "ninja-all-relwithdebinfo",
+        "description": "Build all feasible backends using Ninja Generator in RelWithDebInfo Configuraiton",
+        "inherits": "ninja-all-debug",
+        "cacheVariables": {
+            "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+        }
+    },
+    {
+        "name": "ninja-all-mkl-local-install",
+        "description": "Build all feasible backends using Ninja Generator in RelWithDebInfo Configuraiton",
+        "inherits": "ninja-all-mkl-relwithdebinfo",
+        "cacheVariables": {
+            "BUILD_TESTING": "OFF"
+        }
+    },
+    {
+        "name": "ninja-all-mkl-standalone-install",
+        "description": "Build all feasible backends using Ninja Generator in RelWithDebInfo Configuraiton",
+        "inherits": "ninja-all-mkl-local-install",
+        "cacheVariables": {
+            "AF_INSTALL_STANDALONE": "ON"
+        }
+    },
+    {
+      "name": "ninja-docs",
+      "description": "Build ArrayFire Documentation, needs doxygen installed",
+      "inherits": "ninja-all-off-debug",
+      "cacheVariables": {
+          "BUILD_TESTING": "OFF",
+          "AF_BUILD_FORGE": "OFF",
+          "AF_BUILD_DOCS": "ON"
+      }
+    },
+    {
+        "name": "ninja-any-debug",
+        "description": "Build available backends in Debug configuration using Ninja Generator",
+        "binaryDir": "${sourceDir}/build/${presetName}",
+        "generator": "Ninja",
+        "cacheVariables": {
+            "CMAKE_BUILD_TYPE": "Debug",
+            "CMAKE_INSTALL_PREFIX": "${sourceDir}/build/${presetName}/pkg"
+        }
+    },
+    {
+        "name": "ninja-any-relwithdebinfo",
+        "description": "Build available backends in RelWithDebInfo configuration using Ninja Generator",
+        "inherits": "ninja-any-debug",
+        "cacheVariables": {
+            "CMAKE_BUILD_TYPE": "RelWithDebInfo"
+        }
+    },
+    {
+      "name": "msvc2019",
+      "hidden": true,
+      "description": "Base preset for Visual Studio 16 2019 generator.",
+      "generator": "Visual Studio 16 2019",
+      "architecture": "x64"
+    },
+    {
+      "name": "msvc2019-cpu-mkl",
+      "description": "Build CPU Backend using Intel MKL with MSVC 2019 Generator",
+      "inherits": [ "msvc2019", "ninja-cpu-mkl-debug" ]
+    },
+    {
+      "name": "msvc2019-cuda",
+      "description": "Build CUDA Backend with MSVC 2019 Generator",
+      "inherits": [ "msvc2019", "ninja-cuda-debug" ]
+    },
+    {
+      "name": "msvc2019-opencl-mkl",
+      "description": "Build OpenCL Backend with MSVC 2019 Generator. Uses MKL for CPU fallback.",
+      "inherits": [ "msvc2019", "ninja-opencl-mkl-debug" ]
+    },
+    {
+      "name": "msvc2019-all-mkl",
+      "description": "Build all feasible Backends with MSVC 2019 Generator. Uses MKL for CPU fallback.",
+      "inherits": [ "msvc2019", "ninja-all-mkl-debug" ]
+    },
+    {
+      "name": "msvc2019-all-mkl-local-install",
+      "description": "Build all feasible Backends with MSVC 2019 Generator. Installs to specified path prefix.",
+      "inherits": [ "msvc2019", "ninja-all-mkl-local-install" ]
+    },
+    {
+      "name": "msvc2019-all-mkl-standalone-install",
+      "description": "Build all feasible Backends with MSVC 2019 Generator. Also packages dependencies while installing to specified path prefix.",
+      "inherits": [ "msvc2019", "ninja-all-mkl-standalone-install" ]
+    }
+  ]
+}
diff --git a/COPYRIGHT.md b/COPYRIGHT.md
index 5e08985e2a..9948438d88 100644
--- a/COPYRIGHT.md
+++ b/COPYRIGHT.md
@@ -2,6 +2,7 @@ Copyrights
 ==========================================
 ##Index
 * [ArrayFire](#arrayfire)
+* [ArrayFire: SIFT](#arrayfire-sift)
 * [FreeImage](#freeimage)
 * [clBLAS](#clblas)
 * [clFFT](#clfft)
@@ -9,9 +10,10 @@ Copyrights
 * [Boost Compute](#boost-compute)
 * [Thrust](#thrust)
 * [Magma](#magma)
+* [glbinding](#glbinding)
 
 ### Introduction
-ArrayFire uses software written by the following parties. Each software is listed with it's copyright, license and home page.
+ArrayFire uses software written by the following parties. Each software is listed with its copyright, license and home page.
 
 All the licenses can be found in the LICENSES directory.
 
@@ -22,6 +24,15 @@ ArrayFire is distributed under the BSD 3-Clause License. A copy of this license
 
 See ArrayFire home page https://github.com/arrayfire/arrayfire for details and links to the source code.
 
+### ArrayFire: SIFT
+Copyright (C) 2014-2015, ArrayFire.
+
+Copyright (c) 2006-2012, Rob Hess <rob@iqengines.com>
+
+ArrayFire SIFT is based on the OpenSIFT project by Rob Hess, licensed and distributed under the BSD 3-Clause License. A full copy of the license is present in the LICENSES directory.
+
+SIFT is an algorithm patented and protected by US Law under the US Patent 6,711,293 (March 23, 2004) assigned to the University of British Columbia. before using this code or any binary forms generated from it, please verify that you have permission to do so.
+
 ### FreeImage
 Copyright (C) 2014-2015, FreeImage.
 
@@ -29,22 +40,28 @@ FreeImage is distributed under the FreeImage Public License (FIPL) version 1.0.
 
 See FreeImage home page http://freeimage.sourceforge.net/ for details and links to the source code.
 
+**How ArrayFire uses FreeImage:** The ArrayFire source code does not contain any source code from FreeImage. FreeImage can be optionally linked with or disabled during build time. The binary installers of ArrayFire may come packaged with FreeImage.
+
 ### clBLAS
 Copyright (C) 2013-2015 Advanced Micro Devices, Inc.
-This product includes software developed atAdvanced Micro Devices, Inc. (http://www.amd.com).
+This product includes software developed at Advanced Micro Devices, Inc. (http://www.amd.com).
 
 clBLAS is distributed under the Apache License Version 2.0 License. A copy of this license is present in the LICENSES directory.
 
 See clBLAS home page https://github.com/clMathLibraries/clBLAS for details and links to the source code.
 
+**How ArrayFire uses clBLAS:** The ArrayFire source code does not contain any source code from clBLAS. clBLAS is statically linked during build time when building OpenCL backend.
+
 ### clFFT
 Copyright (C) 2013-2015 Advanced Micro Devices, Inc.
-This product includes software developed atAdvanced Micro Devices, Inc. (http://www.amd.com).
+This product includes software developed at Advanced Micro Devices, Inc. (http://www.amd.com).
 
 clFFT is distributed under the Apache License Version 2.0 License. A copy of this license is present in the LICENSES directory.
 
 See clFFT home page https://github.com/clMathLibraries/clBLAS for details and links to the source code.
 
+**How ArrayFire uses clFFT:** The ArrayFire source code does not contain any source code from clFFT. clFFT is statically linked during build time when building OpenCL backend.
+
 ### Random123
 Copyright (C) 2010-2015, D. E. Shaw Research.
 
@@ -52,6 +69,8 @@ Random123 is distributed under the BSD 3-Clause License. A copy of this license
 
 See Random123 home page https://www.deshawresearch.com/resources_random123.html for details and links to the source code.
 
+**How ArrayFire uses Random123:** ArrayFire uses a modified and stripped down version of Random123 in all backends. Each of the source files using the modified version of Random123 contain the original copyright.
+
 ### Boost Compute
 Copyright (C) 2013-2015 Kyle Lutz
 
@@ -59,6 +78,8 @@ Boost Compute is distributed under the Boost Software License, Version 1.0 Licen
 
 See Boost Compute home page https://github.com/boostorg/compute for details and links to the source code.
 
+**How ArrayFire uses Boost Compute:** The ArrayFire source code does not contain any source code from Boost Compute. Boost Compute header files are optionally required to build the OpenCL backend.
+
 ### Thrust
 Copyright (C) 2011-2015 NVIDIA Corporation.
 
@@ -66,23 +87,16 @@ Thrust is distributed under the Apache License Version 2.0 License. A copy of th
 
 See Thrust home page https://github.com/thrust/thrust for details and links to the source code.
 
-### Magma
-Copyright (C) 2015 The University of Tennessee.
+**How ArrayFire uses Thrust:** The ArrayFire source code does not contain any source code from Thrust. Thrust header files are optionally required to build the CUDA backend.
 
-Magma is distributed under the BSD 3-Clause License. A copy of this license is present in the LICENSES directory.
-
-See Magma home page http://icl.cs.utk.edu/magma/index.html for details and links to the source code.
+### clMagma
+Copyright (C) 2015 The University of Tennessee.
 
-### GLEW
-The OpenGL Extension Wrangler Library
-Copyright (C) 2002-2007, Milan Ikits <milan ikits[]ieee org>
-Copyright (C) 2002-2007, Marcelo E. Magallon <mmagallo[]debian org>
-Copyright (C) 2002, Lev Povalahev
+clMagma is distributed under the BSD 3-Clause License. A copy of this license is present in the LICENSES directory.
 
-GLEW is distributed under the BSD 3-Clause License, Mesa 3D License (MIT) and the Khronos License (MIT).
-A copy of these licenses is present in the LICENSES directory.
+See clMagma home page http://icl.cs.utk.edu/magma/index.html for details and links to the source code.
 
-See GLEW home page http://glew.sourceforge.net for details and links to the source code.
+**How ArrayFire uses clMagma:** ArrayFire uses a modified and stripped down version of clMagma in the OpenCL backend. Each of the source files using the modified version of clMagma contain the original copyright.
 
 ### GLFW
 Copyright (C) 2002-2006 Marcus Geelnard
@@ -91,3 +105,17 @@ Copyright (C) 2006-2011 Camilla Berglund
 GLFW is distributed under the zlib/libpng License. A copy of this license is present in the LICENSES directory.
 
 See GLFW home page http://www.glfw.org for details and links to the source code.
+
+**How ArrayFire uses GLFW:** The ArrayFire source code does not contain any source code from GLFW. GLFW can be optionally linked with or disabled during build time. The binary installers of ArrayFire may come packaged with GLFW.
+
+### glbinding
+
+Copyright (c) 2014-2015 Computer Graphics Systems Group at the Hasso-Plattner-Institute and CG Internals GmbH, Germany.
+
+glbinding is distributed under the MIT License. A copy of this license is present in the LICENSES directory.
+
+See glbinding home page http://www.glbinding.org for details and links to the source code.
+
+**How ArrayFire uses glbinding:** The ArrayFire source code does not contain any source code from glbinding. glbinding is statically linked during build time.
+
+
diff --git a/CPack.txt b/CPack.txt
deleted file mode 100644
index ba3b09aae6..0000000000
--- a/CPack.txt
+++ /dev/null
@@ -1,93 +0,0 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-
-# CPack package generation
-SET(CPACK_GENERATOR "TGZ;STGZ")
-# Create the following installers are as follows:
-#  Windows: Use external packaging, do nothing here
-#  OSX: Deploy as TGZ and STGZ
-IF("${CMAKE_SYSTEM}" MATCHES "Linux")
-    #  Linux: TGZ, STGZ, DEB
-    SET(CPACK_GENERATOR "TGZ;STGZ;DEB;RPM")
-ENDIF()
-
-# Common settings to all packaging tools
-SET(CPACK_PREFIX_DIR ${CMAKE_INSTALL_PREFIX})
-SET(CPACK_PACKAGE_NAME "arrayfire")
-SET(CPACK_PACKAGE_VERSION 3.0.beta)
-SET(CPACK_PACKAGE_VERSION_MAJOR "3")
-SET(CPACK_PACKAGE_VERSION_MINOR "0")
-SET(CPACK_PACKAGE_VERSION_PATCH "beta")
-IF(BUILD_GRAPHICS)
-    SET(CPACK_PACKAGE_FILE_NAME
-    ${CPACK_PACKAGE_NAME}_${CPACK_PACKAGE_VERSION}_${CMAKE_SYSTEM_NAME}_${CMAKE_SYSTEM_PROCESSOR})
-ELSE()
-    SET(CPACK_PACKAGE_FILE_NAME
-        ${CPACK_PACKAGE_NAME}_no-gl_${CPACK_PACKAGE_VERSION}_${CMAKE_SYSTEM_NAME}_${CMAKE_SYSTEM_PROCESSOR})
-ENDIF()
-SET(CPACK_PACKAGE_VENDOR "ArrayFire")
-SET(CPACK_PACKAGE_CONTACT "ArrayFire Development Group <technical@arrayfire.com>")
-SET(CPACK_RESOURCE_FILE_LICENSE "${PROJECT_SOURCE_DIR}/LICENSE")
-SET(CPACK_RESOURCE_FILE_README "${PROJECT_SOURCE_DIR}/README.md")
-
-# Long description of the package
-SET(CPACK_PACKAGE_DESCRIPTION
-"ArrayFire is a high performance software library for parallel computing
-with an easy-to-use API. Its array based function set makes parallel
-programming simple.
-
-ArrayFire's multiple backends (CUDA, OpenCL and native CPU) make it
-platform independent and highly portable.
-
-A few lines of code in ArrayFire can replace dozens of lines of parallel
-computing code, saving you valuable time and lowering development costs.")
-
-# Short description of the package
-SET(CPACK_PACKAGE_DESCRIPTION_SUMMARY "A high performance library for parallel computing with an easy-to-use API.")
-
-# Useful descriptions for components
-SET(CPACK_COMPONENT_LIBRARIES_DISPLAY_NAME "ArrayFire libraries")
-SET(CPACK_COMPONENT_DOCUMENTATION_NAME "Doxygen documentation")
-SET(CPACK_COMPONENT_HEADERS_NAME "C/C++ headers")
-SET(CPACK_COMPONENT_CMAKE_NAME "CMake support")
-# Set the default components installed in the package
-SET(CPACK_COMPONENTS_ALL libraries headers documentation cmake)
-
-##
-# Debian package
-##
-SET(CPACK_DEBIAN_PACKAGE_ARCHITECTURE ${PROCESSOR_ARCHITECTURE})
-SET(CPACK_DEBIAN_PACKAGE_DEPENDS "libfreeimage-dev, libatlas3gf-base, libfftw3-dev, liblapacke-dev")
-SET(CPACK_DEBIAN_PACKAGE_SUGGESTS "ocl-icd-libopencl1 (>= 2.0), nvidia-cuda-dev (>= 6.0)")
-
-##
-# RPM package
-##
-SET(CPACK_RPM_PACKAGE_LICENSE "BSD")
-SET(CPACK_PACKAGE_GROUP "Development/Libraries")
-SET(CPACK_RPM_PACKAGE_REQUIRES "freeimage atlas fftw lapack")
-
-##
-# Source package
-##
-SET(CPACK_SOURCE_GENERATOR "TGZ")
-SET(CPACK_SOURCE_PACKAGE_FILE_NAME
-    ${CPACK_PACKAGE_NAME}_src_${CPACK_PACKAGE_VERSION}_${CMAKE_SYSTEM_NAME}_${CMAKE_SYSTEM_PROCESSOR})
-SET(CPACK_SOURCE_IGNORE_FILES
-    "/build"
-    "CMakeFiles"
-    "/\\\\.dir"
-    "/\\\\.git"
-    "/\\\\.gitignore$"
-    ".*~$"
-    "\\\\.bak$"
-    "\\\\.swp$"
-    "\\\\.orig$"
-    "/\\\\.DS_Store$"
-    "/Thumbs\\\\.db"
-    "/CMakeLists.txt.user$"
-    ${CPACK_SOURCE_IGNORE_FILES})
-# Ignore build directories that may be in the source tree
-FILE(GLOB_RECURSE CACHES "${CMAKE_SOURCE_DIR}/CMakeCache.txt")
-
-# Call to CPACK
-INCLUDE(CPack)
diff --git a/CTestConfig.cmake b/CTestConfig.cmake
new file mode 100644
index 0000000000..9bd3a5c9dc
--- /dev/null
+++ b/CTestConfig.cmake
@@ -0,0 +1,13 @@
+## This file should be placed in the root directory of your project.
+## Then modify the CMakeLists.txt file in the root directory of your
+## project to incorporate the testing dashboard.
+## # The following are required to uses Dart and the Cdash dashboard
+##   ENABLE_TESTING()
+##   INCLUDE(CTest)
+set(CTEST_PROJECT_NAME "ArrayFire")
+set(CTEST_NIGHTLY_START_TIME "01:00:00 UTC")
+
+set(CTEST_DROP_METHOD "https")
+set(CTEST_DROP_SITE "ci.arrayfire.org")
+set(CTEST_DROP_LOCATION "/submit.php?project=ArrayFire")
+set(CTEST_DROP_SITE_CDASH TRUE)
diff --git a/LICENSE b/LICENSE
index ec2f10c01f..d63051d62b 100644
--- a/LICENSE
+++ b/LICENSE
@@ -1,27 +1,12 @@
-Copyright (c) 2014-2015, ArrayFire
+Copyright (c) 2014-2025, ArrayFire
 All rights reserved.
 
-Redistribution and use in source and binary forms, with or without modification,
-are permitted provided that the following conditions are met:
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
 
-* Redistributions of source code must retain the above copyright notice, this
-  list of conditions and the following disclaimer.
+* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
 
-* Redistributions in binary form must reproduce the above copyright notice, this
-  list of conditions and the following disclaimer in the documentation and/or
-  other materials provided with the distribution.
+* Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
 
-* Neither the name of the ArrayFire nor the names of its
-  contributors may be used to endorse or promote products derived from
-  this software without specific prior written permission.
+* Neither the name ArrayFire nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
 
-THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
-ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
-WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
-DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR
-ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
-(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
-LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
-ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
-SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
diff --git a/LICENSES/BSD 3-Clause.txt b/LICENSES/BSD 3-Clause.txt
index e6690fd1f0..ffab4f203a 100644
--- a/LICENSES/BSD 3-Clause.txt	
+++ b/LICENSES/BSD 3-Clause.txt	
@@ -1,4 +1,4 @@
-Copyright (c) <year>, <copyright holder>
+Copyright (c) 2018, ArrayFire
 All rights reserved.
 
 Redistribution and use in source and binary forms, with or without
diff --git a/LICENSES/Half(MIT) License.txt b/LICENSES/Half(MIT) License.txt
new file mode 100644
index 0000000000..abee50b132
--- /dev/null
+++ b/LICENSES/Half(MIT) License.txt	
@@ -0,0 +1,21 @@
+The MIT License
+
+Copyright (c) 2012-2017 Christian Rau
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in
+all copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+THE SOFTWARE.
diff --git a/LICENSES/ISSL License.txt b/LICENSES/ISSL License.txt
new file mode 100644
index 0000000000..7ce92d1317
--- /dev/null
+++ b/LICENSES/ISSL License.txt	
@@ -0,0 +1,29 @@
+Copyright (c) 2018 Intel Corporation.
+
+Use and Redistribution.  You may use and redistribute the software (the “Software”), without modification, provided the following conditions are met:
+
+* Redistributions must reproduce the above copyright notice and the following terms of use in the Software and in the documentation and/or other materials provided with the distribution.
+
+* Neither the name of Intel nor the names of its suppliers may be used to endorse or promote products derived from this Software without specific prior written permission.
+
+* No reverse engineering, decompilation, or disassembly of this Software is permitted.
+
+Limited patent license.  Intel grants you a world-wide, royalty-free, non-exclusive license under patents it now or hereafter owns or controls to make, have made, use, import, offer to sell and sell (“Utilize”) this Software, but solely to the extent that any such patent is necessary to Utilize the Software alone. The patent license shall not apply to any combinations which include this software.  No hardware per se is licensed hereunder.
+
+Third party and other Intel programs.  “Third Party Programs” are the files listed in the “third-party-programs.txt” text file that is included with the Software and may include Intel programs under separate license terms. Third Party Programs, even if included with the distribution of the Materials, are governed by separate license terms and those license terms solely govern your use of those programs.
+
+DISCLAIMER. THIS SOFTWARE IS PROVIDED "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, AND NON-INFRINGEMENT ARE DISCLAIMED. THIS SOFTWARE IS NOT INTENDED FOR USE IN SYSTEMS OR APPLICATIONS WHERE FAILURE OF THE SOFTWARE MAY CAUSE PERSONAL INJURY OR DEATH AND YOU AGREE THAT YOU ARE FULLY RESPONSIBLE FOR ANY CLAIMS, COSTS, DAMAGES, EXPENSES, AND ATTORNEYS’ FEES ARISING OUT OF ANY SUCH USE, EVEN IF ANY CLAIM ALLEGES THAT INTEL WAS NEGLIGENT REGARDING THE DESIGN OR MANUFACTURE OF THE MATERIALS.
+
+LIMITATION OF LIABILITY. IN NO EVENT WILL INTEL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. YOU AGREE TO INDEMNIFY AND HOLD INTEL HARMLESS AGAINST ANY CLAIMS AND EXPENSES RESULTING FROM YOUR USE OR UNAUTHORIZED USE OF THE SOFTWARE.
+
+No support. Intel may make changes to the Software, at any time without notice, and is not obligated to support, update or provide training for the Software.
+
+Termination. Intel may terminate your right to use the Software in the event of your breach of this Agreement and you fail to cure the breach within a reasonable period of time.
+
+Feedback. Should you provide Intel with comments, modifications, corrections, enhancements or other input (“Feedback”) related to the Software Intel will be free to use, disclose, reproduce, license or otherwise distribute or exploit the Feedback in its sole discretion without any obligations or restrictions of any kind, including without limitation, intellectual property rights or licensing obligations.
+
+Compliance with laws. You agree to comply with all relevant laws and regulations governing your use, transfer, import or export (or prohibition thereof) of the Software.
+
+Governing law.  All disputes will be governed by the laws of the United States of America and the State of Delaware without reference to conflict of law principles and subject to the exclusive jurisdiction of the state or federal courts sitting in the State of Delaware, and each party agrees that it submits to the personal jurisdiction and venue of those courts and waives any objections. The United Nations Convention on Contracts for the International Sale of Goods (1980) is specifically excluded and will not apply to the Software.
+
+*Other names and brands may be claimed as the property of others.
\ No newline at end of file
diff --git a/LICENSES/MIT License.txt b/LICENSES/MIT License.txt
deleted file mode 100644
index 2bf24b9b9f..0000000000
--- a/LICENSES/MIT License.txt	
+++ /dev/null
@@ -1,21 +0,0 @@
-The MIT License (MIT)
-
-Copyright (c) <year> <copyright holders>
-
-Permission is hereby granted, free of charge, to any person obtaining a copy
-of this software and associated documentation files (the "Software"), to deal
-in the Software without restriction, including without limitation the rights
-to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
-copies of the Software, and to permit persons to whom the Software is
-furnished to do so, subject to the following conditions:
-
-The above copyright notice and this permission notice shall be included in
-all copies or substantial portions of the Software.
-
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
-THE SOFTWARE.
\ No newline at end of file
diff --git a/LICENSES/OpenSIFT License.txt b/LICENSES/OpenSIFT License.txt
new file mode 100644
index 0000000000..e6f59ba518
--- /dev/null
+++ b/LICENSES/OpenSIFT License.txt	
@@ -0,0 +1,57 @@
+Copyright (c) 2006-2012, Rob Hess <rob@iqengines.com>
+All rights reserved.
+
+The following patent has been issued for methods embodied in this 
+software: "Method and apparatus for identifying scale invariant features 
+in an image and use of same for locating an object in an image," David 
+G. Lowe, US Patent 6,711,293 (March 23, 2004). Provisional application 
+filed March 8, 1999. Asignee: The University of British Columbia. For 
+further details, contact David Lowe (lowe@cs.ubc.ca) or the 
+University-Industry Liaison Office of the University of British 
+Columbia.
+
+Note that restrictions imposed by this patent (and possibly others) 
+exist independently of and may be in conflict with the freedoms granted 
+in this license, which refers to copyright of the program, not patents 
+for any methods that it implements.  Both copyright and patent law must 
+be obeyed to legally use and redistribute this program and it is not the 
+purpose of this license to induce you to infringe any patents or other 
+property right claims or to contest validity of any such claims.  If you 
+redistribute or use the program, then this license merely protects you 
+from committing copyright infringement.  It does not protect you from 
+committing patent infringement.  So, before you do anything with this 
+program, make sure that you have permission to do so not merely in terms 
+of copyright, but also in terms of patent law.
+
+Please note that this license is not to be understood as a guarantee 
+either.  If you use the program according to this license, but in 
+conflict with patent law, it does not mean that the licensor will refund 
+you for any losses that you incur if you are sued for your patent 
+infringement.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions are 
+met:
+    * Redistributions of source code must retain the above copyright and 
+      patent notices, this list of conditions and the following 
+      disclaimer.
+    * Redistributions in binary form must reproduce the above copyright
+      notice, this list of conditions and the following disclaimer in 
+      the documentation and/or other materials provided with the 
+      distribution.
+    * Neither the name of Oregon State University nor the names of its 
+      contributors may be used to endorse or promote products derived 
+      from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS 
+IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED 
+TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A 
+PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT 
+HOLDER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, 
+EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, 
+PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR 
+PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF 
+LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING 
+NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS 
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
diff --git a/LICENSES/zlib-libpng License.txt b/LICENSES/zlib-libpng License.txt
index 28c994e86a..eec5469a5d 100644
--- a/LICENSES/zlib-libpng License.txt	
+++ b/LICENSES/zlib-libpng License.txt	
@@ -1,12 +1,21 @@
-The zlib/libpng License
-Copyright (c) <year> <copyright holders>
+Copyright (c) 2002-2006 Marcus Geelnard
+Copyright (c) 2006-2016 Camilla Berglund <elmindreda@glfw.org>
 
-This software is provided 'as-is', without any express or implied warranty. In no event will the authors be held liable for any damages arising from the use of this software.
+This software is provided 'as-is', without any express or implied
+warranty. In no event will the authors be held liable for any damages
+arising from the use of this software.
 
-Permission is granted to anyone to use this software for any purpose, including commercial applications, and to alter it and redistribute it freely, subject to the following restrictions:
+Permission is granted to anyone to use this software for any purpose,
+including commercial applications, and to alter it and redistribute it
+freely, subject to the following restrictions:
 
-1. The origin of this software must not be misrepresented; you must not claim that you wrote the original software. If you use this software in a product, an acknowledgment in the product documentation would be appreciated but is not required.
+1. The origin of this software must not be misrepresented; you must not
+   claim that you wrote the original software. If you use this software
+   in a product, an acknowledgment in the product documentation would
+   be appreciated but is not required.
 
-2. Altered source versions must be plainly marked as such, and must not be misrepresented as being the original software.
+2. Altered source versions must be plainly marked as such, and must not
+   be misrepresented as being the original software.
 
-3. This notice may not be removed or altered from any source distribution.
\ No newline at end of file
+3. This notice may not be removed or altered from any source
+   distribution.
\ No newline at end of file
diff --git a/README.md b/README.md
index 9e08d66d63..eb6dc6a5f6 100644
--- a/README.md
+++ b/README.md
@@ -1,108 +1,205 @@
-<a href="http://arrayfire.com/"><img src="http://arrayfire.com/logos/arrayfire_logo_whitebkgnd.png" width="300"></a>
-
-ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its **array** based function set makes parallel programming simple.
-
-ArrayFire's multiple backends (**CUDA**, **OpenCL** and native **CPU**) make it platform independent and highly portable.
-
-A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving you valuable time and lowering development costs.
-
-### Build ArrayFire from source
-To build ArrayFire from source, please follow the instructions on our [wiki](https://github.com/arrayfire/arrayfire/wiki).
-
-### Download ArrayFire Installers
-We currently have binary tar balls and installers available for the beta version of ArrayFire 3.0. These can be downloaded at the [ArrayFire Downloads](http://go.arrayfire.com/l/37882/2015-03-31/mmhqy) page.
-
-### Support and Contact Info
-
-* Google Groups: https://groups.google.com/forum/#!forum/arrayfire-users
-* ArrayFire Services:  [Consulting](http://arrayfire.com/consulting/)  |  [Support](http://arrayfire.com/support/)   |  [Training](http://arrayfire.com/training/)
-* ArrayFire Blogs: http://arrayfire.com/blog/
-* Email: <mailto:technical@arrayfire.com>
-
-
-### Build Status
-|                 | Build           | Tests           |
-|-----------------|-----------------|-----------------|
-| Linux x86       | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-linux/devel)](http://ci.arrayfire.org/job/arrayfire-linux/branch/devel/)      | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-linux-test/devel)](http://ci.arrayfire.org/job/arrayfire-linux-test/branch/devel/)              |
-| Linux Tegra     | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-tegra/devel)](http://ci.arrayfire.org/job/arrayfire-tegra/branch/devel/)      | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-tegra-test/devel)](http://ci.arrayfire.org/job/arrayfire-tegra-test/branch/devel/)              |
-| Windows         | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-windows/devel)](http://ci.arrayfire.org/job/arrayfire-windows/branch/devel/)  | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-windows-test/devel)](http://ci.arrayfire.org/job/arrayfire-windows-test/branch/devel/)          |
-| OSX             | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-osx/devel)](http://ci.arrayfire.org/job/arrayfire-osx/branch/devel/)          | [![Build Status](http://ci.arrayfire.org/buildStatus/icon?job=arrayfire-osx-test/devel)](http://ci.arrayfire.org/job/arrayfire-osx-test/branch/devel/)                  |
-
-Test coverage: [![Coverage Status](https://coveralls.io/repos/arrayfire/arrayfire/badge.svg?branch=HEAD)](https://coveralls.io/r/arrayfire/arrayfire?branch=HEAD)
-
-### Example
-
-``` C++
-
-#include <arrayfire.h>
-#include <cstdio>
-
-using namespace af;
-
-int main(int argc, char *argv[])
-{
-    try {
-
-        // Select a device and display arrayfire info
-        int device = argc > 1 ? atoi(argv[1]) : 0;
-        af::setDevice(device);
-        af::info();
-
-        printf("Create a 5-by-3 matrix of random floats on the GPU\n");
-        array A = randu(5,3, f32);
-        af_print(A);
-
-        printf("Element-wise arithmetic\n");
-        array B = sin(A) + 1.5;
-        af_print(B);
-
-        printf("Negate the first three elements of second column\n");
-        B(seq(0, 2), 1) = B(seq(0, 2), 1) * -1;
-        af_print(B);
-
-        printf("Fourier transform the result\n");
-        array C = fft(B);
-        af_print(C);
+<p align="center"><a href="http://arrayfire.com/"><img src="http://arrayfire.com/logos/arrayfire_logo_whitebkgnd.png" width="800"></a></p>
+
+ArrayFire is a general-purpose tensor library that simplifies the software
+development process for the parallel architectures found in CPUs, GPUs, and
+other hardware acceleration devices. The library serves users in every
+technical computing market.
+
+Several of ArrayFire's benefits include:
+
+* Hundreds of accelerated [tensor computing
+  functions](https://arrayfire.org/docs/group__arrayfire__func.htm), in the
+  following areas:
+    * Array handling
+    * Computer vision
+    * Image processing
+    * Linear algebra
+    * Machine learning
+    * Standard math
+    * Signal Processing
+    * Statistics
+    * Vector algorithms
+* [Easy to use](http://arrayfire.org/docs/gettingstarted.htm), stable,
+  [well-documented](http://arrayfire.org/docs) API
+* Rigorous benchmarks and tests ensuring top performance and numerical accuracy
+* Cross-platform compatibility with support for CUDA, oneAPI, OpenCL, and
+  native CPU on Windows, Mac, and Linux
+* Built-in visualization functions through
+  [Forge](https://github.com/arrayfire/forge)
+* Commercially friendly open-source licensing
+* Enterprise support from [ArrayFire](http://arrayfire.com)
+
+ArrayFire provides software developers with a high-level abstraction of data
+that resides on the accelerator, the `af::array` object. Developers write code
+that performs operations on ArrayFire arrays, which, in turn, are automatically
+translated into near-optimal kernels that execute on the computational device.
+
+ArrayFire runs on devices ranging from low-power mobile phones to high-power
+GPU-enabled supercomputers. ArrayFire runs on CPUs from all major vendors
+(Intel, AMD, ARM), GPUs from the prominent manufacturers (AMD, Intel, NVIDIA,
+and Qualcomm), as well as a variety of other accelerator devices on Windows,
+Mac, and Linux.
+
+# Getting ArrayFire
+
+Instructions to [install][32] or to build ArrayFire from source can be found on
+the [wiki][1].
+
+### Conway's Game of Life Using ArrayFire
+
+Visit the [Wikipedia page][2] for a description of Conway's Game of Life.
+
+<img align="left"
+src="https://github.com/arrayfire/assets/blob/master/gifs/conway.gif"
+alt="Conway's Game of Life" height="256" width="256">
+
+```cpp
+static const float h_kernel[] = { 1, 1, 1, 1, 0, 1, 1, 1, 1 };
+static const array kernel(3, 3, h_kernel, afHost);
+
+array state = (randu(128, 128, f32) > 0.5).as(f32); // Init state
+Window myWindow(256, 256);
+while(!myWindow.close()) {
+    array nHood = convolve(state, kernel); // Obtain neighbors
+    array C0 = (nHood == 2);  // Generate conditions for life
+    array C1 = (nHood == 3);
+    state = state * C0 + C1;  // Update state
+    myWindow.image(state);    // Display
+}
+```
+The complete source code can be found [here][3].
 
-        printf("Grab last row\n");
-        array c = C.row(end);
-        af_print(c);
+### Perceptron
 
-        printf("Create 2-by-3 matrix from host data\n");
-        float d[] = { 1, 2, 3, 4, 5, 6 };
-        array D(2, 3, d, af::afHost);
-        af_print(D);
+<img align="left"
+src="https://github.com/arrayfire/assets/blob/imgs_readme_improv/gifs/perceptron.gif"
+alt="Perceptron" height="400" width="300">
 
-        printf("Copy last column onto first\n");
-        D.col(0) = D.col(end);
-        af_print(D);
+```cpp
+array predict(const array &X, const array &W) {
+    return sigmoid(matmul(X, W));
+}
 
-        // Sort A
-        printf("Sort A and print sorted array and corresponding indices\n");
-        array vals, inds;
-        sort(vals, inds, A);
-        af_print(vals);
-        af_print(inds);
+array train(const array &X, const array &Y,
+        double alpha = 0.1, double maxerr = 0.05,
+        int maxiter = 1000, bool verbose = false) {
+    array Weights = constant(0, X.dims(1), Y.dims(1));
 
-    } catch (af::exception& e) {
-        fprintf(stderr, "%s\n", e.what());
-        throw;
+    for (int i = 0; i < maxiter; i++) {
+        array P   = predict(X, Weights);
+        array err = Y - P;
+        if (mean<float>(abs(err) < maxerr) break;
+        Weights += alpha * matmulTN(X, err);
     }
+    return Weights;
 }
+...
 
+array Weights = train(train_feats, train_targets);
+array test_outputs  = predict(test_feats, Weights);
+display_results<true>(test_images, test_outputs,
+                      test_targets, 20);
 ```
 
-### Documentation
+The complete source code can be found [here][31].
 
-You can find our complete documentation over [here](http://www.arrayfire.com/docs/index.htm).
+For more code examples, visit the [`examples/`][4] directory.
 
-Quick links:
+# Documentation
 
-- [Download Binaries](http://www.arrayfire.com/download/)
-- [List of functions](http://www.arrayfire.com/docs/group__arrayfire__func.htm)
-- [Tutorials](http://www.arrayfire.com/docs/gettingstarted.htm)
-- [Examples](http://www.arrayfire.com/docs/examples.htm)
+You can find the complete documentation [here](http://www.arrayfire.com/docs/index.htm).
 
-### Contribute
+Quick links:
 
-Contributions of any kind are welcome! Please refer to [this document](https://github.com/arrayfire/arrayfire/blob/master/CONTRIBUTING.md) to learn more about how you can get involved with ArrayFire.
+* [List of functions](http://www.arrayfire.org/docs/group__arrayfire__func.htm)
+* [Tutorials](http://arrayfire.org/docs/tutorials.htm)
+* [Examples](http://www.arrayfire.org/docs/examples.htm)
+* [Blog](http://arrayfire.com/blog/)
+
+# Language support
+
+ArrayFire has several official and community maintained language API's:
+
+[![C++][5]][6] [![Python][7]][8] [![Rust][9]][10] [![Julia][27]][28]<sub><span>&#8224;</span></sub>
+[![Nim][29]][30]<sub><span>&#8224;</span></sub>
+
+<sup><span>&#8224;</span></sup>&nbsp; Community maintained wrappers
+
+__In-Progress Wrappers__
+
+[![.NET][11]][12] [![Fortran][13]][14] [![Go][15]][16]
+[![Java][17]][18] [![Lua][19]][20] [![NodeJS][21]][22] [![R][23]][24] [![Ruby][25]][26]
+
+# Contributing
+
+The community of ArrayFire developers invites you to build with us if you are
+interested and able to write top-performing tensor functions. Together we can
+fulfill [The ArrayFire
+Mission](https://github.com/arrayfire/arrayfire/wiki/The-ArrayFire-Mission-Statement)
+for fast scientific computing for all.
+
+Contributions of any kind are welcome! Please refer to [the
+wiki](https://github.com/arrayfire/arrayfire/wiki) and our [Code of
+Conduct](33) to learn more about how you can get involved with the ArrayFire
+Community through
+[Sponsorship](https://github.com/arrayfire/arrayfire/wiki/Sponsorship),
+[Developer
+Commits](https://github.com/arrayfire/arrayfire/wiki/Contributing-Code-to-ArrayFire),
+or [Governance](https://github.com/arrayfire/arrayfire/wiki/Governance).
+
+# Citations and Acknowledgements
+
+If you redistribute ArrayFire, please follow the terms established in [the
+license](LICENSE). If you wish to cite ArrayFire in an academic publication,
+please use the following [citation document](.github/CITATION.md).
+
+ArrayFire development is funded by AccelerEyes LLC and several third parties,
+please see the list of [acknowledgements](ACKNOWLEDGEMENTS.md) for an
+expression of our gratitude.
+
+# Support and Contact Info
+
+* [Slack Chat](https://join.slack.com/t/arrayfire-org/shared_invite/MjI4MjIzMDMzMTczLTE1MDI5ODg4NzYtN2QwNGE3ODA5OQ)
+* [Google Groups](https://groups.google.com/forum/#!forum/arrayfire-users)
+* ArrayFire Services:  [Consulting](http://arrayfire.com/consulting)  |  [Support](http://arrayfire.com/download)   |  [Training](http://arrayfire.com/training)
+
+# Trademark Policy
+
+The literal mark "ArrayFire" and ArrayFire logos are trademarks of AccelerEyes
+LLC (dba ArrayFire). If you wish to use either of these marks in your own
+project, please consult [ArrayFire's Trademark
+Policy](http://arrayfire.com/trademark-policy/)
+
+[1]: https://github.com/arrayfire/arrayfire/wiki
+[2]: https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life
+[3]: https://github.com/arrayfire/arrayfire/blob/master/examples/graphics/conway_pretty.cpp
+[4]: https://github.com/arrayfire/arrayfire/blob/master/examples/
+[5]: https://img.shields.io/badge/c++-%2300599C.svg?style=for-the-badge&logo=c%2B%2B&logoColor=white
+[6]: http://arrayfire.org/docs/gettingstarted.htm#gettingstarted_api_usage
+[7]: https://img.shields.io/badge/python-%2314354C.svg?style=for-the-badge&logo=python&logoColor=white
+[8]: https://github.com/arrayfire/arrayfire-python
+[9]: https://img.shields.io/badge/rust-%23000000.svg?style=for-the-badge&logo=rust&logoColor=white
+[10]: https://github.com/arrayfire/arrayfire-rust
+[11]: https://img.shields.io/badge/.NET-5C2D91?style=for-the-badge&logo=.net&logoColor=white
+[12]: https://github.com/arrayfire/arrayfire-dotnet
+[13]: https://img.shields.io/badge/F-Fortran-734f96?style=for-the-badge
+[14]: https://github.com/arrayfire/arrayfire-fortran
+[15]: https://img.shields.io/badge/go-%2300ADD8.svg?style=for-the-badge&logo=go&logoColor=white
+[16]: https://github.com/arrayfire/arrayfire-go
+[17]: https://img.shields.io/badge/java-%23ED8B00.svg?style=for-the-badge&logo=java&logoColor=white
+[18]: https://github.com/arrayfire/arrayfire-java
+[19]: https://img.shields.io/badge/lua-%232C2D72.svg?style=for-the-badge&logo=lua&logoColor=white
+[20]: https://github.com/arrayfire/arrayfire-lua
+[21]: https://img.shields.io/badge/javascript-%23323330.svg?style=for-the-badge&logo=javascript&logoColor=%23F7DF1E
+[22]: https://github.com/arrayfire/arrayfire-js
+[23]: https://img.shields.io/badge/r-%23276DC3.svg?style=for-the-badge&logo=r&logoColor=white
+[24]: https://github.com/arrayfire/arrayfire-r
+[25]: https://img.shields.io/badge/ruby-%23CC342D.svg?style=for-the-badge&logo=ruby&logoColor=white
+[26]: https://github.com/arrayfire/arrayfire-rb
+[27]: https://img.shields.io/badge/j-Julia-cb3c33?style=for-the-badge&labelColor=4063d8
+[28]: https://github.com/JuliaComputing/ArrayFire.jl
+[29]: https://img.shields.io/badge/n-Nim-000000?style=for-the-badge&labelColor=efc743
+[30]: https://github.com/bitstormGER/ArrayFire-Nim
+[31]: https://github.com/arrayfire/arrayfire/blob/master/examples/machine_learning/perceptron.cpp
+[32]: https://github.com/arrayfire/arrayfire/wiki/Getting-ArrayFire
+[33]: https://github.com/arrayfire/arrayfire/wiki/Code-Of-Conduct
diff --git a/assets b/assets
deleted file mode 160000
index b71c025475..0000000000
--- a/assets
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit b71c02547544f0f6e1becb9544bbc2aed5dccf76
diff --git a/conanfile.py b/conanfile.py
new file mode 100644
index 0000000000..13169b943b
--- /dev/null
+++ b/conanfile.py
@@ -0,0 +1,126 @@
+from conans import ConanFile, CMake, tools
+import os
+
+
+ARRAYFIRE_VERSION = "3.7.1"
+BINARY_INSTALLER_NAME_SUFFIX = "-1"
+BINARY_INSTALLER_NAME = f"ArrayFire-v{ARRAYFIRE_VERSION}{BINARY_INSTALLER_NAME_SUFFIX}_Linux_x86_64.sh"
+CUDA_TOOLKIT_VERSION = "10.0"
+
+class ArrayFireConan(ConanFile):
+    name = "arrayfire"
+    version = ARRAYFIRE_VERSION
+    license = "BSD"
+    author = "jacobkahn jacobkahn1@gmail.com"
+    url = "https://github.com/arrayfire/arrayfire"
+    requires = []
+    description = "ArrayFire: a general purpose GPU library"
+    topics = ("arrayfire", "gpu", "cuda", "opencl", "gpgpu",
+              "hpc", "performance", "scientific-computing")
+    settings = "os", "compiler", "build_type", "arch"
+    options = {
+        "cpu_backend": [True, False],
+        "cuda_backend": [True, False],
+        "opencl_backend": [True, False],
+        "unified_backend": [True, False],
+        "graphics": [True, False],
+    }
+    generators = "cmake"  # unused
+
+    def configure(self):
+        if self.settings.os == "Windows":
+            raise ConanInvalidConfiguration(
+                "Linux binary installer not compaible with Windows.")
+
+    def requirements(self):
+        if self.options.graphics:
+            self.requires('glfw/3.3.2@bincrafters/stable')
+
+    def _download_arrayfire(self):
+        self.af_installer_local_path = BINARY_INSTALLER_NAME
+        if not os.path.exists(self.af_installer_local_path):
+            self.output.info(
+                f"Downloading the ArrayFire {ARRAYFIRE_VERSION} binary installer...")
+            tools.download(
+                f"https://arrayfire.s3.amazonaws.com/{ARRAYFIRE_VERSION}/{BINARY_INSTALLER_NAME}", self.af_installer_local_path)
+            self.output.success(
+                f"ArrayFire {ARRAYFIRE_VERSION} binary installer successfully downloaded to {self.af_installer_local_path}")
+        else:
+            self.output.info(
+                f"ArrayFire {ARRAYFIRE_VERSION} binary installer already exists - skipping download.")
+
+    def _unpack_arrayfire(self):
+        if not os.path.exists(self.af_unpack_path):
+            os.mkdir(self.af_unpack_path)
+        self.output.info(
+            f"Unpacking ArrayFire {ARRAYFIRE_VERSION} binary installer...")
+        cmd = f"bash {self.af_installer_local_path} --prefix={self.af_unpack_path} --skip-license"
+        self.run(cmd)
+        self.output.success(
+            f"ArrayFire {ARRAYFIRE_VERSION} successfully unpacked.")
+
+    def _process_arrayfire(self):
+        # Install ArrayFire to requisite path
+        self.af_unpack_path = os.path.join(self.source_folder, 'arrayfire')
+
+        # Only proceed if missing
+        if os.path.exists(os.path.join(self.af_unpack_path, 'include', 'arrayfire.h')):
+            self.output.info(
+                f"ArrayFire {ARRAYFIRE_VERSION} already unpacked - skipping.")
+        else:
+            self._download_arrayfire()
+            self._unpack_arrayfire()
+
+    def build(self):
+        self._process_arrayfire()
+
+    def package(self):
+        # libs
+        self.copy("*.so", dst="lib", keep_path=False, symlinks=True)
+        self.copy("*.so.*", dst="lib", keep_path=False, symlinks=True)
+
+        # headers
+        self.copy("*.h", dst="include", src="arrayfire/include")
+        self.copy("*.hpp", dst="include", src="arrayfire/include")
+
+    def package_info(self):
+        self.cpp_info.libs = []
+        if self.options.unified_backend:
+            self.cpp_info.libs.extend([
+                f"libaf.so.{ARRAYFIRE_VERSION}",
+            ])
+        if self.options.graphics:
+            self.cpp_info.libs.extend([
+                "libforge.so.1.0.5",
+            ])
+        if self.options.cuda_backend:
+            self.cpp_info.libs.extend([
+                f"libafcuda.so.{ARRAYFIRE_VERSION}",
+                "libnvrtc-builtins.so",
+                f"libcudnn.so.{CUDA_TOOLKIT_VERSION}",
+                f"libcusparse.so.{CUDA_TOOLKIT_VERSION}",
+                f"libcublas.so.{CUDA_TOOLKIT_VERSION}",
+                f"libcusolver.so.{CUDA_TOOLKIT_VERSION}",
+                f"libnvrtc.so.{CUDA_TOOLKIT_VERSION}",
+                f"libcufft.so.{CUDA_TOOLKIT_VERSION}",
+            ])
+        if self.options.cpu_backend:
+            self.cpp_info.libs.extend([
+                f"libafcpu.so.{ARRAYFIRE_VERSION}",
+                "libmkl_avx2.so",
+                "libmkl_mc.so",
+                "libmkl_intel_lp64.so",
+                "libmkl_core.so",
+                "libmkl_avx.so",
+                "libmkl_def.so",
+                "libiomp5.so",
+                "libmkl_avx512.so",
+                "libmkl_intel_thread.so",
+                "libmkl_mc3.so",
+
+            ])
+        if self.options.opencl_backend:
+            self.cpp_info.libs.extend([
+                f"libafopencl.so.{ARRAYFIRE_VERSION}",
+                "libOpenCL.so.1",
+            ])
diff --git a/docs/CMakeLists.txt b/docs/CMakeLists.txt
index a4058a8388..93ba6615e8 100644
--- a/docs/CMakeLists.txt
+++ b/docs/CMakeLists.txt
@@ -1,40 +1,59 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-PROJECT(ARRAYFIRE_DOCS)
 
-# Doxygen is required for the documentation to be built. Do not fail silently.
-FIND_PACKAGE(Doxygen REQUIRED)
+include(Version)
+set(AF_DOCS_CONFIG "${CMAKE_CURRENT_SOURCE_DIR}/doxygen.mk")
+set(AF_DOCS_CONFIG_OUT "${CMAKE_CURRENT_BINARY_DIR}/doxygen.mk.out")
 
-SET(AF_DOCS_CONFIG "${CMAKE_CURRENT_SOURCE_DIR}/doxygen.mk")
-SET(AF_DOCS_CONFIG_OUT "${CMAKE_CURRENT_BINARY_DIR}/doxygen.mk.out")
+set(AF_DOCS_LAYOUT "${CMAKE_CURRENT_SOURCE_DIR}/layout.xml")
+set(AF_DOCS_LAYOUT_OUT "${CMAKE_CURRENT_BINARY_DIR}/layout.xml.out")
 
-SET(AF_DOCS_LAYOUT "${CMAKE_CURRENT_SOURCE_DIR}/layout.xml")
-SET(AF_DOCS_LAYOUT_OUT "${CMAKE_CURRENT_BINARY_DIR}/layout.xml.out")
+set(DOCS_DIR ${CMAKE_CURRENT_SOURCE_DIR})
+set(INCLUDE_DIR "${PROJECT_SOURCE_DIR}/include")
+set(EXAMPLES_DIR "${PROJECT_SOURCE_DIR}/examples")
+set(SNIPPETS_DIR "${PROJECT_SOURCE_DIR}/test")
+configure_file(${AF_DOCS_CONFIG} ${AF_DOCS_CONFIG_OUT})
+configure_file(${AF_DOCS_LAYOUT} ${AF_DOCS_LAYOUT_OUT})
 
-SET(DOCS_DIR ${CMAKE_CURRENT_SOURCE_DIR})
-SET(ASSETS_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../assets")
-SET(INCLUDE_DIR "${CMAKE_SOURCE_DIR}/include")
-SET(EXAMPLES_DIR "${CMAKE_SOURCE_DIR}/examples")
-SET(SNIPPETS_DIR "${CMAKE_SOURCE_DIR}/test")
-CONFIGURE_FILE(${AF_DOCS_CONFIG} ${AF_DOCS_CONFIG_OUT})
-CONFIGURE_FILE(${AF_DOCS_LAYOUT} ${AF_DOCS_LAYOUT_OUT})
+###########################################################
+## This generates a list of the examples cpp files and
+## creates a dox file under docs/details/examples.dox
+## This is used to generate documentation for examples
+###########################################################
+file(GLOB EXAMPLES_CPP
+    "${EXAMPLES_DIR}/*/*.cpp")
 
-ADD_CUSTOM_TARGET(docs
+# Sort alphabetically
+# Note: example directories will be major sort order
+list(SORT EXAMPLES_CPP)
+
+# Get filenames and write to a string
+foreach(SRC ${EXAMPLES_CPP})
+    get_filename_component(DIR_PATH ${SRC} DIRECTORY)
+    get_filename_component(DIR_NAME ${DIR_PATH} NAME)
+    get_filename_component(SRC_NAME ${SRC} NAME)
+    set(EXAMPLES_LIST "${EXAMPLES_LIST}\\example ${DIR_NAME}/${SRC_NAME}\n")
+endforeach(SRC ${EXAMPLES_CPP})
+
+# Write string containing file names to examples.dox
+configure_file(
+    ${PROJECT_SOURCE_DIR}/CMakeModules/examples.dox.in
+    ${DOCS_DIR}/details/examples.dox
+)
+###########################################################
+add_custom_target(docs
     ALL
-    COMMAND ${DOXYGEN_EXECUTABLE} ${AF_DOCS_CONFIG_OUT}
-    COMMAND cmake -E copy_directory ${ASSETS_DIR} ${CMAKE_CURRENT_BINARY_DIR}
+    COMMAND Doxygen::doxygen ${AF_DOCS_CONFIG_OUT}
+    COMMAND cmake -E copy_directory ${ASSETS_DIR} ${CMAKE_CURRENT_BINARY_DIR}/html
     WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
     COMMENT "Generating Documentation"
     VERBATIM)
 
 # Install Doxygen documentation
-INSTALL(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/ DESTINATION ${AF_INSTALL_DOC_DIR}
+install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/html DESTINATION ${AF_INSTALL_DOC_DIR}
     COMPONENT documentation
-    PATTERN "*"
-    PATTERN "CMakeFiles" EXCLUDE
-    PATTERN "man" EXCLUDE
+    PATTERN ".git" EXCLUDE
 )
 
 # Install man pages
-#INSTALL(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/man DESTINATION ${AF_INSTALL_MAN_DIR}
+#install(DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/man DESTINATION ${AF_INSTALL_MAN_DIR}
 #    COMPONENT documentation
 #)
diff --git a/docs/arrayfire.css b/docs/arrayfire.css
index 44808a7538..c9a0417fb0 100644
--- a/docs/arrayfire.css
+++ b/docs/arrayfire.css
@@ -1,187 +1,22 @@
-/* The standard CSS for doxygen 1.8.5 */
-
-body, table, div, p, dl
-{
-    font            :   400 12px/22px Lucida Grande, Verdana, Geneva, Arial, sans-serif;
-}
-
-p
-{
-    padding-left    :   10px;
-}
-
-/* @group Heading Levels */
-/* Increase the size of the page title */
-.title
-{
-    font-size       :   250%;
-}
-
-/* Remove space above line items */
-ul
-{
-    margin-top      :   0em;
-}
-
-/* Slightly pad subsections */
-h2, h3, h4, h5
-{
-    padding-left    :   10px;
-    margin-bottom   :   0px;
-}
-
-/* Margins on the left of the code */
-div.line
-{
-    margin-left :   15px;
-}
-
-a.code, a.code:visited, a.line, a.line:visited
-{
-    color       :   #4665A2;
-}
-
-a.codeRef, a.codeRef:visited, a.lineRef, a.lineRef:visited
-{
-    color       :   #4665A2;
-}
-
-@font-face
-{
-    font-family :   prototype;
-    src         :   url('Prototype.ttf');
-}
-
-/*image and image groups*/
-div.image_group
-{
-    text-align  :   center;
-}
-
-div.image_group > div
-{
-    display     :   inline-block;
-}
-
-div.scaled > img
-{
-    max-width   :   250px;
-}
-
-div.scaled > img:hover
-{
-    -ms-transform       :   scale(2, 2);
-    -webkit-transform   :   scale(2, 2);
-    -moz-transform      :   scale(2, 2);
-    transform           :   scale(2, 2);
-}
-
-/*ArrayFire Feature Support Settings*/
-div.support
-{
-    text-align  :   right;
-}
-
-div.support *
-{
-    display     :   inline-block;
-    max-width   :   50px;
-}
-
-#under_logo
-{
-    font-family :   prototype;
-    font-size   :   2em;
-    max-width   :   25px;
-    color       :   #000000;
-}
-
-#projectbrief
-{
-    font-family :   prototype;
-    color       :   #555555
-}
-
-#projectlogo
-{
-    width       :   300px;
-    text-align  :   left;
-}
-
-#projectnumber
-{
-    max-width   :   25px;
-}
-
-#projectname
-{
-    font-family     :   prototype;
-    font-size       :   3em;
-    max-width       :   25px;
-    color           :   #555555
-}
-
-#gsearch
-{
-    width       :   150px;
-}
-
-.tablist span
-{
-    font-weight     :   normal;
-    font-family     :   "Raleway","Helvetica Neue",Helvetica,sans-serif;
-    color           :   #FFFFFF;
-    text-shadow     :   none;
-}
-
-#nav-tree
-{
-    background-color    : #F7F7F7;
-}
-
-div.toc
-{
-    background-color    : #F7F7F7;
-    border              : 1px solid #DFDFDF;
-}
-
-#nav-tree
-{
-    background-color    : #F7F7F7;
-}
-
-div.toc
-{
-    background-color    : #F7F7F7;
-    border              : 1px solid #DFDFDF;
-}
-
-.tablist a
-{
-    background-image:url('tab_b.png');
-}
-
-div.header
-{
-    background-image    :   none;
-    background-color    :   #F7F7F7;
-    border-bottom       :   1px solid #DFDFDF;
-}
-
-#nav-tree
-{
-    background-image    :   none;
-}
-
-.ui-resizable-e
-{
-    background  :   url("ftv2splitbar1.png") repeat scroll right center transparent;
-}
-
-div.fragment
-{
-    background-color    :   #F7F7F7;
-    border              :   1px solid #DFDFDF;
-}
-
-/* @end */
+/*
+Overwrite google search bar .css to better match doxygen-awesome dark theme
+*/
+.cse input.gsc-input,input.gsc-input,.gsc_input-box,.gsc-input-box-focus{
+	border-radius: 4px !important;
+	background-image:none !important;
+	color-scheme: light !important;
+	-webkit-box-sizing: border-box !important;
+	-moz-box-sizing: content-box !important;
+	box-sizing: content-box !important;
+	border: none !important;
+	outline: none !important;
+}
+.gsc-control-cse {
+	padding: 0px !important;
+	border: none !important;
+	outline: none !important;
+	background-color: transparent !important;
+}
+.gsc-clear-button {
+	display:none !important;
+}
\ No newline at end of file
diff --git a/docs/details/algorithm.dox b/docs/details/algorithm.dox
index e168e32da8..69633524e2 100644
--- a/docs/details/algorithm.dox
+++ b/docs/details/algorithm.dox
@@ -1,155 +1,492 @@
 /*!
 \page batch_detail_algo algorithm
-
-This function performs the operation across all batches present in the input simultaneously.
-
+This function runs across all batches in the input simultaneously.
 */
 
 
+
 /**
 \addtogroup arrayfire_func
 @{
-\defgroup reduce_func_sum sum
 
+
+
+\defgroup reduce_func_sum sum
 \ingroup reduce_mat
 
-Find the sum of values in the input
+Sum array elements over a given dimension.
+
+This table defines output types for corresponding input types:
+
+Input Type          | Output Type
+--------------------|---------------------
+f32, f64, c32, c64  | same as input
+s32, s64, u32, u64  | same as input
+s16, s8             | s32
+u16, u8, b8         | u32
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup reduce_func_product product
+\defgroup reduce_func_sum_by_key sumByKey
+\ingroup reduce_mat
+
+Sum array elements over a given dimension, according to an array of keys.
+
+The values corresponding to each group of consecutive equal keys will be summed
+together. Keys can repeat; however, only consecutive key values will be
+considered for each reduction. If a key value is repeated somewhere else in the
+keys array it will be considered the start of a new reduction. There are two
+outputs: the reduced set of consecutive keys and the corresponding final
+set of reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_sum_by_key
+
+The keys' input type must be integer (s32 or u32).
+
+This table defines output types for corresponding input types:
+
+Input Type          | Output Type
+--------------------|---------------------
+f32, f64, c32, c64  | same as input
+s32, s64, u32, u64  | same as input
+s16, s8             | s32
+u16, u8, b8         | u32
+f16                 | f32
+
+The keys array must be 1-dimensional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
+
+\snippet test/reduce.cpp ex_reduce_sum_by_key_dim
+
 
+
+\defgroup reduce_func_product product
 \ingroup reduce_mat
 
-Find the product of values in the input
+Multiply array elements over a given dimension.
+
+This table defines output types for corresponding input types:
+
+Input Type          | Output Type
+--------------------|---------------------
+f32, f64, c32, c64  | same as input
+s32, u32, s64, u64  | same as input
+s16, s8             | s32
+u16, u8, b8         | u32
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup reduce_func_min min
+\defgroup reduce_func_product_by_key productByKey
+\ingroup reduce_mat
+
+Multiply array elements over a given dimension, according to an array of keys.
+
+The values corresponding to each group of consecutive equal keys will be
+multiplied together. Keys can repeat; however, only consecutive key values will
+be considered for each reduction. If a key value is repeated somewhere else in
+the keys array it will be considered the start of a new reduction. There are
+two outputs: the reduced set of consecutive keys and the corresponding final
+set of reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_product_by_key
 
+The keys' input type must be integer (s32 or u32).
+
+This table defines output types for corresponding input types:
+
+Input Type          | Output Type
+--------------------|---------------------
+f32, f64, c32, c64  | same as input
+s32, u32, s64, u64  | same as input
+s16, s8             | s32
+u16, u8, b8         | u32
+f16                 | f32
+
+The keys array must be 1-dimenstional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
+
+\snippet test/reduce.cpp ex_reduce_product_by_key_dim
+
+
+
+\defgroup reduce_func_min min
 \ingroup reduce_mat
 
-Find the minimum values and their locations
+Return the minimum along a given dimension.
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup reduce_func_max max
+\defgroup reduce_func_min_by_key minByKey
+\ingroup reduce_mat
 
+Return the minimum along a given dimension, according to an array of keys.
+
+The minimum is returned from the values corresponding to each group of 
+consecutive equal keys. Keys can repeat; however, only consecutive key values
+will be considered for each reduction. If a key value is repeated somewhere
+else in the keys array it will be considered the start of a new reduction.
+There are two outputs: the reduced set of consecutive keys and the
+corresponding final set of reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_min_by_key
+
+The keys' input type must be integer (s32 or u32).
+
+The output type is the same as input type.
+
+The keys array must be 1-dimenstional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
+
+\snippet test/reduce.cpp ex_reduce_min_by_key_dim
+
+
+
+\defgroup reduce_func_max max
 \ingroup reduce_mat
 
-Find the maximum values and their locations
+Return the maximum along a given dimension.
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup reduce_func_all_true alltrue
+\defgroup reduce_func_max_by_key maxByKey
+\ingroup reduce_mat
+
+Return the maximum along a given dimension, according to an array of keys.
+
+The maximum is returned from the values corresponding to each group of 
+consecutive equal keys. Keys can repeat; however, only consecutive key values
+will be considered for each reduction. If a key value is repeated somewhere
+else in the keys array it will be considered the start of a new reduction.
+There are two outputs: the reduced set of consecutive keys and the
+corresponding final set of reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_max_by_key
+
+The keys' input type must be integer (s32 or u32).
+
+The output type is the same as input type.
+
+The keys array must be 1-dimenstional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
+
+\snippet test/reduce.cpp ex_reduce_max_by_key_dim
 
+
+
+\defgroup reduce_func_all_true allTrue
 \ingroup reduce_mat
 
-Find if of all of the values in input are true
+Check if all values along a given dimension are true.
+
+Return type is `b8` for all input types.
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup reduce_func_any_true anytrue
+\defgroup reduce_func_all_true_by_key allTrueByKey
+\ingroup reduce_mat
+
+Check if all values along a given dimension are true, according to an array of
+keys.
+
+All values corresponding to each group of consecutive equal keys will be tested
+to make sure all are true. Keys can repeat; however, only consecutive key
+values will be considered for each reduction. If a key value is repeated
+somewhere else in the keys array it will be considered the start of a new
+reduction. There are two outputs: the reduced set of consecutive keys and the
+corresponding final set of reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_alltrue_by_key
+
+The keys' input type must be integer (s32 or u32).
 
+The output type is `b8`.
+
+The keys array must be 1-dimenstional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
+
+\snippet test/reduce.cpp ex_reduce_alltrue_by_key_dim
+
+
+
+\defgroup reduce_func_any_true anytrue
 \ingroup reduce_mat
 
-Find if of any of the values in input are true
+Check if any values along a given dimension are true.
+
+The output type is `b8`.
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup reduce_func_count count
+\defgroup reduce_func_anytrue_by_key anyTrueByKey
+\ingroup reduce_mat
+
+Check if any values along a given dimension are true, according to an array of
+keys.
 
+Values corresponding to each group of consecutive equal keys will be tested to
+check if any are true. Keys can repeat; however, only consecutive key
+values will be considered for each reduction. If a key value is repeated
+somewhere else in the keys array it will be considered the start of a new
+reduction. There are two outputs: the reduced set of consecutive keys and the
+corresponding final set of reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_anytrue_by_key
+
+The keys' input type must be integer (s32 or u32).
+
+The output type is `b8`.
+
+The keys array must be 1-dimenstional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
+
+\snippet test/reduce.cpp ex_reduce_anytrue_by_key_dim
+
+
+
+\defgroup reduce_func_count count
 \ingroup reduce_mat
 
-Count the number of non-zero elements in the input
+Count non-zero values in an array along a given dimension.
+
+The output type is `u32`.
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup scan_func_accum accum
+\defgroup reduce_func_count_by_key countByKey
+\ingroup reduce_mat
+
+Count non-zero values in an array, according to an array of keys.
+
+All non-zero values corresponding to each group of consecutive equal keys will
+be counted. Keys can repeat; however, only consecutive key values will be
+considered for each reduction. If a key value is repeated somewhere else in the
+keys array it will be considered the start of a new reduction. There are two
+outputs: the reduced set of consecutive keys and the corresponding final set of
+reduced values.
+
+An example demonstrating the reduction behavior can be seen in the following
+snippet.
+
+\snippet test/reduce.cpp ex_reduce_count_by_key
+
+The keys' input type must be integer (s32 or u32).
+
+The output type is `u32`.
+
+The keys array must be 1-dimenstional matching the size of the reduced
+dimension. An example of multi-dimensional reduce-by-key can be seen below:
 
+\snippet test/reduce.cpp ex_reduce_count_by_key_dim
+
+
+
+\defgroup scan_func_accum accum
 \ingroup scan_mat
 
-Perform exclusive sum along specified dimension
+Evaluate the cumulative sum (inclusive) along a given dimension.
+
+For a 1D array \f$X\f$, the inclusive cumulative sum calculates \f$x_i =
+\sum_{p=0}^{i}x_p\f$ for every \f$x \in X\f$. Here is a simple example for the
+1D case:
+
+\snippet test/scan.cpp ex_accum_1D
+
+For 2D arrays and higher dimensions, you can specify the dimension along which
+the cumulative sum will be calculated. Thus, the formula above will be
+calculated for all array slices along the specified dimension (in the 2D case
+for example, this looks like \f$x_{i,j} = \sum_{p=0}^{j}x_{i,p}\f$ if the second
+dimension (dim1) was chosen). If no dimension is specified, then the first
+dimension (dim0) is used by default (only in the C++ API; the dimension is
+required to be specified in the C API):
+
+\snippet test/scan.cpp ex_accum_2D
+
+The output array type may be different from the input array type. The following
+table defines corresponding output types for each input type:
+
+Input Type          | Output Type
+--------------------|---------------------
+f32, f64, c32, c64  | same as input
+s32, s64, u32, u64  | same as input
+s16, s8             | s32
+u16, u8, b8         | u32
 
 \copydoc batch_detail_algo
 
 
 
-\defgroup scan_func_where where
+\defgroup scan_func_scan scan
+\ingroup scan_mat
+
+Scan an array (generalized) over a given dimension.
+
+Perform inclusive or exclusive scan using a given binary operation along a
+given dimension.
+
+Binary operations can be [add](\ref AF_BINARY_ADD), [mul](\ref AF_BINARY_MUL),
+[min](\ref AF_BINARY_MIN), [max](\ref AF_BINARY_MAX) as defined by \ref
+af_binary_op.
+
+
+
+\defgroup scan_func_scanbykey scanByKey
+\ingroup scan_mat
+
+Scan an array (generalized) over a given dimension, according to an array of
+keys.
 
+Perform inclusive or exclusive scan using a given binary operation along a
+given dimension using a key.
+
+Binary operations can be [add](\ref AF_BINARY_ADD), [mul](\ref AF_BINARY_MUL),
+[min](\ref AF_BINARY_MIN), [max](\ref AF_BINARY_MAX) as defined by \ref
+af_binary_op.
+
+
+
+\defgroup scan_func_where where
 \ingroup scan_mat
 
-Locate the indices of non-zero elements
+Locate the indices of the non-zero values in an array.
+
+Output type is `u32`.
 
 The locations are provided by flattening the input into a linear array.
 
 
 
 \defgroup calc_func_diff1 diff1
-
 \ingroup calc_mat
 
-First order numerical difference along specified dimension
+Calculate the first order difference in an array over a given dimension.
 
 \copydoc batch_detail_algo
 
 
 
 \defgroup calc_func_diff2 diff2
-
 \ingroup calc_mat
 
-Second order numerical difference along specified dimension
+Calculate the second order difference in an array over a given dimension.
 
 \copydoc batch_detail_algo
 
 
 
 \defgroup sort_func_sort sort
+\ingroup sort_mat
+
+Sort an array over a given dimension.
 
+
+
+\defgroup sort_func_sort_index sortIndex
 \ingroup sort_mat
 
-Sort input arrays
+Sort an array over a given dimension and return the original indices.
 
-Optionally return original indices and keys
+Output type is `u32`.
 
 
 
-\defgroup set_func_unique setunique
+\defgroup sort_func_sort_keys sortByKey
+\ingroup sort_mat
+
+Sort an array over a given dimension, according to an array of keys.
 
+
+
+\defgroup set_func_unique setunique
 \ingroup set_mat
 
-Find unique values from an input
+Return the unique values in an array.
 
+The input must be a one-dimensional array. Batching is not currently supported.
 
+An example, unsorted:
 
-\defgroup set_func_union setunion
+\snippet test/set.cpp ex_set_unique_simple
+
+The function can be sped up if it is known that the inputs are sorted.
 
+An example, sorted (ascending):
+
+\snippet test/set.cpp ex_set_unique_sorted
+
+The inputs can be sorted in ascending or descending order.
+
+An example, sorted (descending):
+
+\snippet test/set.cpp ex_set_unique_desc
+
+
+
+\defgroup set_func_union setunion
 \ingroup set_mat
 
-Find union of two inputs
+Evaluate the union of two arrays.
 
+The inputs must be one-dimensional arrays. Batching is not currently supported.
 
+An example:
 
-\defgroup set_func_intersect setintersect
+\snippet test/set.cpp ex_set_union_simple
+
+The function can be sped up if the input is sorted in increasing order and its
+values are unique.
 
+\snippet test/set.cpp ex_set_union
+
+
+
+\defgroup set_func_intersect setintersect
 \ingroup set_mat
 
-Find intersection of two inputs
+Evaluate the intersection of two arrays.
+
+The inputs must be one-dimensional arrays. Batching is not currently supported.
+
+An example:
+
+\snippet test/set.cpp ex_set_intersect_simple
+
+The function can be sped up if the input is sorted in increasing order and its
+values are unique.
+
+\snippet test/set.cpp ex_set_intersect
+
 
 
 @}
diff --git a/docs/details/arith.dox b/docs/details/arith.dox
index 1096252312..3a118bc890 100644
--- a/docs/details/arith.dox
+++ b/docs/details/arith.dox
@@ -1,526 +1,595 @@
+/*!
+\page arith_real_only arith_real
+\note This function only supports real inputs; complex inputs are not yet
+supported.
+*/
+
+/*!
+\page arith_int_only arith_int
+\note This function supports integer only.
+*/
+
+
+
 /**
 \addtogroup arrayfire_func
 @{
 
-\defgroup arith_func_add add
 
+
+\defgroup arith_func_add add
 \ingroup arith_mat
 
-Addition of two inputs.
+Elementwise addition.
 
 
 
 \defgroup arith_func_sub sub
-
 \ingroup arith_mat
 
-Subtract one input from another
+Elementwise subtraction.
 
 
 
 \defgroup arith_func_mul mul
-
 \ingroup arith_mat
 
-Multiply two inputs element wise
+Elementwise multiply.
 
 
 
 \defgroup arith_func_div div
-
 \ingroup arith_mat
 
-Divide one input by another
+Elementwise division.
 
 
 
-\defgroup arith_func_shiftl bitshiftl
-
-\ingroup arith_mat
+\defgroup arith_func_lt lt
+\ingroup logic_mat
 
-Left shift an input
+Less than, an elementwise comparison of two arrays.
 
+Check if the elements of one array are less than those of another array.
 
 
-\defgroup arith_func_shiftr bitshiftr
 
-\ingroup arith_mat
+\defgroup arith_func_gt gt
+\ingroup logic_mat
 
-Right shift an input
+Greater than comparison, an elementwise comparison of two arrays.
 
+Check if the elements of one array are greater than those of another array.
 
 
-\defgroup arith_func_lt lt
 
+\defgroup arith_func_le le
 \ingroup logic_mat
 
-Check if input is less than another
-
+Less than or equal to, an elementwise comparison of two arrays.
 
+Check if the elements of one array are less than or equal to those of another
+array.
 
-\defgroup arith_func_gt gt
 
+\defgroup arith_func_ge ge
 \ingroup logic_mat
 
-Check if input is greater than another
+Greater than or equal to, an elementwise comparison of two arrays.
 
+Check if the elements of one array are greater than or equal to those of
+another array.
 
 
-\defgroup arith_func_le le
 
+\defgroup arith_func_eq eq
 \ingroup logic_mat
 
-Check if input is less than or equal to another
+Equal to, an elementwise comparison of two arrays.
 
+Check if the elements of one array are equal to those of another array.
 
 
-\defgroup arith_func_ge ge
 
+\defgroup arith_func_neq neq
 \ingroup logic_mat
 
-Check if input is greater than or equal to another
+Not equal to, an elementwise comparison of two arrays.
 
+Check if the elements of one array are not equal to those of another array.
 
 
-\defgroup arith_func_eq eq
 
+\defgroup arith_func_and and
 \ingroup logic_mat
 
-Check if input two inputs are equal
+Evaluate the logical AND of two arrays.
 
 
 
-\defgroup arith_func_neq neq
+\defgroup arith_func_or or
+\ingroup logic_mat
+
+Evaluate the logical OR of two arrays.
+
+
 
+\defgroup arith_func_not not
 \ingroup logic_mat
 
-Check if input two inputs are not equal
+Evaluate the logical NOT of an array.
 
 
 
-\defgroup arith_func_and and
+\defgroup arith_func_neg neg
+\ingroup numeric_mat
+
+Negate an array.
 
+
+
+\defgroup arith_func_bitnot bitnot
 \ingroup logic_mat
 
-Logical and of two inputs
+Evaluate the bitwise NOT of an array.
 
+\copydoc arith_int_only
 
 
-\defgroup arith_func_or or
 
+\defgroup arith_func_bitand bitand
 \ingroup logic_mat
 
-Logical or of two inputs
+Evaluate the bitwise AND of two arrays.
 
+\copydoc arith_int_only
 
 
-\defgroup arith_func_not not
 
+\defgroup arith_func_bitor bitor
 \ingroup logic_mat
 
-Logical not of an input
+Evaluate the bitwise OR of two arrays.
 
+\copydoc arith_int_only
 
 
-\defgroup arith_func_neg neg
 
+\defgroup arith_func_bitxor bitxor
 \ingroup logic_mat
 
-Negative of an input
+Evaluate the bitwise XOR of two arrays.
 
+\copydoc arith_int_only
 
 
-\defgroup arith_func_bitand bitand
 
-\ingroup logic_mat
+\defgroup arith_func_shiftl bitshiftl
+\ingroup arith_mat
 
-Bitwise and operation of two inputs
+Shift the bits of integer arrays left.
 
+\copydoc arith_int_only
 
 
-\defgroup arith_func_bitor bitor
 
-\ingroup logic_mat
+\defgroup arith_func_shiftr bitshiftr
+\ingroup arith_mat
 
-Bitwise or operation of two inputs
+Shift the bits of integer arrays right.
 
+\copydoc arith_int_only
 
 
-\defgroup arith_func_bitxor bitxor
 
-\ingroup logic_mat
+\defgroup arith_func_cast cast
+\ingroup helper_mat
 
-Bitwise xor operation of two inputs
+Cast an array from one type to another.
 
 
 
 \defgroup arith_func_min min
-
 \ingroup numeric_mat
 
-Minimum of two inputs.
+Returns the elementwise minimum between two arrays.
 
 
 
 \defgroup arith_func_max max
+\ingroup numeric_mat
+
+Returns the elementwise maximum between two arrays.
+
 
+
+\defgroup arith_func_clamp clamp
 \ingroup numeric_mat
 
-Maximum of two inputs.
+Clamp an array between an upper and a lower limit.
 
 
 
 \defgroup arith_func_rem rem
-
 \ingroup numeric_mat
 
-Remainder operation
+Calculate the remainder of a division.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_mod mod
 
+\defgroup arith_func_mod mod
 \ingroup numeric_mat
 
-Compute \f$x - n * y\f$ where n is quotient of \f$x / y\f$
+Calculate the modulus.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_abs abs
 
+\defgroup arith_func_abs abs
 \ingroup numeric_mat
 
-Absolute value
-
+Calculate the absolute value.
 
 
 \defgroup arith_func_arg arg
-
 \ingroup numeric_mat
 
-Phase of a complex number
+Calculate the phase angle (in radians) of a complex array.
 
 
 
 \defgroup arith_func_sign sign
-
 \ingroup numeric_mat
 
-Check if input is negative
+Return the sign of elements in an array.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_round round
 
+\defgroup arith_func_round round
 \ingroup numeric_mat
 
-Round to nearest integer
+Round numbers to the nearest integer.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_trunc trunc
 
+\defgroup arith_func_trunc trunc
 \ingroup numeric_mat
 
-Truncate to nearest integer
+Truncate numbers to nearest integer.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_floor floor
 
+\defgroup arith_func_floor floor
 \ingroup numeric_mat
 
-Round to integer less than equal to current value
+Rounds down to the greatest integer less than or equal to x.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_ceil ceil
 
+\defgroup arith_func_ceil ceil
 \ingroup numeric_mat
 
-Round to integer greater than equal to current value
+Rounds up to the least integer greater than or equal to x.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_hypot hypot
 
+\defgroup arith_func_hypot hypot
 \ingroup numeric_mat
 
-Hypotenuse of the two inputs
+Evaluate the length of the hypotenuse of two inputs.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_sin sin
 
+\defgroup arith_func_sin sin
 \ingroup trig_mat
 
-sin of input
+Evaluate the sine function.
 
 
 
 \defgroup arith_func_cos cos
-
 \ingroup trig_mat
 
-cos of input
+Evaluate the cosine function.
 
 
 
-\defgroup arith_func_tan tan/tan2
-
+\defgroup arith_func_tan tan
 \ingroup trig_mat
 
-tan of input
+Evaluate the tangent function.
 
 
 
 \defgroup arith_func_asin asin
-
 \ingroup trig_mat
 
-arc sin of input
+Evaluate the inverse sine function (arc sine).
 
 
 
 \defgroup arith_func_acos acos
-
 \ingroup trig_mat
 
-arc cos of input
+Evaluate the inverse cosine function (arc cosine).
 
+The inverse of cosine so that, if `y = cos(x)`, then `x = arccos(y)`.
 
 
 \defgroup arith_func_atan atan/atan2
-
 \ingroup trig_mat
 
-arc tan of input
+Evaluate the inverse tangent function (arc tangent).
 
 
 
 \defgroup arith_func_sinh sinh
-
 \ingroup hyper_mat
 
-sinh of input
+Evaluate the hyperbolic sine function.
 
 
-\defgroup arith_func_cosh cosh
 
+\defgroup arith_func_cosh cosh
 \ingroup hyper_mat
 
-cosh of input
+Evaluate the hyperbolic cosine function.
 
 
 
 \defgroup arith_func_tanh tanh
-
 \ingroup hyper_mat
 
-tanh of input
+Evaluate the hyperbolic tangent function.
 
 
 
 \defgroup arith_func_asinh asinh
-
 \ingroup hyper_mat
 
-asinh of input
+Evaluate the inverse hyperbolic sine function (area hyperbolic sine).
 
 
 
 \defgroup arith_func_acosh acosh
-
 \ingroup hyper_mat
 
-acosh of input
+Evaluate the inverse hyperbolic cosine function (area hyperbolic cosine).
 
 
 
 \defgroup arith_func_atanh atanh
-
 \ingroup hyper_mat
 
-atanh of input
+Evaluate the inverse hyperbolic tangent function (area hyperbolic tangent).
 
 
 
 \defgroup arith_func_cplx complex
-
 \ingroup complex_mat
 
-create complex arrays
+Create complex arrays.
 
+Complex arrays are created from any of the following four inputs:
+
+1. a single real array, returning zeros for the imaginary component. See
+   `array b` in the example.
+2. two real arrays, one for the real component and one for the imaginary
+   component. See `array c` in the example.
+3. a single real array for the real component and a single scalar for each
+   imaginary component. See `array d` in the example.
+4. a single scalar for each real component and a single real array for the
+   imaginary component. See `array e` in the example.
+
+__Examples:__
+
+\snippet test/complex.cpp ex_arith_func_complex
 
 
-\defgroup arith_func_real real
 
+\defgroup arith_func_real real
 \ingroup complex_mat
 
-Get real part of complex arrays
+Returns the real part of a complex array.
 
 
 
 \defgroup arith_func_imag imag
-
 \ingroup complex_mat
 
-Get imaginary part of complex arrays
+Returns the imaginary part of a complex array.
 
 
 
 \defgroup arith_func_conjg conjg
-
 \ingroup complex_mat
 
-Get complex conjugate
-
+Evaluate the complex conjugate of an input array.
 
 
 
 \defgroup arith_func_root root
-
 \ingroup explog_mat
 
-Find root of an input
+Evaluate the nth root.
 
 
 
 \defgroup arith_func_pow pow
+\ingroup explog_mat
+
+Raise a base to a power (or exponent).
+
+
 
+\defgroup arith_func_pow2 pow2
 \ingroup explog_mat
 
-Raise an array to a power
+Raise 2 to a power (or exponent).
 
 
 
+\defgroup arith_func_sigmoid sigmoid
+Evaluate the logistical sigmoid function.
+
 
-\defgroup arith_func_exp exp
 
+\defgroup arith_func_exp exp
 \ingroup explog_mat
 
-Exponential of input
+Evaluate the exponential function.
 
 
 
 \defgroup arith_func_expm1 expm1
-
 \ingroup explog_mat
 
-Exponential of input - 1
+Evaluate the exponential function of an array minus 1, `exp(in) - 1`.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_erf erf
 
+\defgroup arith_func_erf erf
 \ingroup explog_mat
 
-Error function value
+Evaluate the error function.
 
+\copydoc arith_real_only
 
 
 
 \defgroup arith_func_erfc erfc
-
 \ingroup explog_mat
 
-Complementary Error function value
+Evaluate the complementary error function.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_log log
 
+\defgroup arith_func_log log
 \ingroup explog_mat
 
-Natural logarithm
+Evaluate the natural logarithm.
 
 
 
 \defgroup arith_func_log1p log1p
-
 \ingroup explog_mat
 
-Natural logarithm of (1 + in)
+Evaluate the natural logarithm of 1 + input, `ln(1+in)`.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_log10 log10
 
+\defgroup arith_func_log10 log10
 \ingroup explog_mat
 
-logarithm base 10
+Evaluate the base 10 logarithm.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_sqrt sqrt
 
+\defgroup arith_func_log2 log2
 \ingroup explog_mat
 
-Square root of input arrays
+Evaluate the base 2 logarithm.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_cbrt cbrt
 
+\defgroup arith_func_sqrt sqrt
 \ingroup explog_mat
 
-Cube root of input arrays
+Evaluate the square root.
 
 
 
-\defgroup arith_func_factorial factorial
-
+\defgroup arith_func_rsqrt rsqrt
 \ingroup explog_mat
 
-Factorial function
+Evaluate the reciprocal square root.
 
+\f[ \frac{1}{\sqrt{x}} \f]
+
+\copydoc arith_real_only
 
 
-\defgroup arith_func_tgamma tgamma
 
+\defgroup arith_func_cbrt cbrt
 \ingroup explog_mat
 
-Gamma function
+Evaluate the cube root.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_lgamma lgamma
 
+\defgroup arith_func_factorial factorial
 \ingroup explog_mat
 
-Logarithm of absolute values of Gamma function
+Evaluate the factorial.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_iszero iszero
 
-\ingroup helper_mat
+\defgroup arith_func_tgamma tgamma
+\ingroup explog_mat
 
-Check if values are zero
+Evaluate the gamma function.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_isinf isinf
 
-\ingroup helper_mat
+\defgroup arith_func_lgamma lgamma
+\ingroup explog_mat
 
-Check if values are infinite
+Evaluate the logarithm of the absolute value of the gamma function.
 
+\copydoc arith_real_only
 
 
-\defgroup arith_func_isnan isNan
 
+\defgroup arith_func_iszero iszero
 \ingroup helper_mat
 
-Check if values are Nan
+Check if values are zero.
 
 
 
-\defgroup arith_func_cast cast
+\defgroup arith_func_isinf isinf
+\ingroup helper_mat
 
+Check if values are infinite.
+
+
+
+\defgroup arith_func_isnan isnan
 \ingroup helper_mat
 
-Casting inputs from one type to another
+Check if values are NaN.
+
 
 
 @}
diff --git a/docs/details/array.dox b/docs/details/array.dox
index e8daf4b418..2f696a26a3 100644
--- a/docs/details/array.dox
+++ b/docs/details/array.dox
@@ -3,7 +3,7 @@
 \addtogroup arrayfire_func
 @{
 \defgroup array_mem_operator_paren operator()
-\ingroup index_mat 
+\ingroup index_mat
 
 \brief Gets a reference to a set of elements
 
@@ -16,27 +16,6 @@ to \ref af::array objects.
 
 ===============================================================================
 
-\defgroup array_mem_operator_paren_one operator()
-
-This operator returns a reference of the original array at a given coordinate.
-You can pass \ref af::seq, \ref af::array, or an int as it's parameters. These
-references can be used for assignment or returning references
-to \ref af::array objects.
-
-If the \ref af::array is a multi-dimensional array then this coordinate
-will treated as the data as a linear array.
-
-===============================================================================
-
-\defgroup array_mem_operator_paren_many operator()
-
-This operator returns a reference of the original array at a given coordinate.
-You can pass \ref af::seq, \ref af::array, or an int as it's parameters. These
-references can be used for assignment or returning references
-to \ref af::array objects.
-
-===============================================================================
-
 \defgroup array_mem_row row/rows
 \ingroup index_mat
 
@@ -92,7 +71,7 @@ Adds and assigns values
 \defgroup array_mem_operator_minus_eq operator-=
 \ingroup index_mat
 
-\brief Substracts and assigns the value(s) of val to the elements of the
+\brief Subtracts and assigns the value(s) of val to the elements of the
 af::array
 
 Substracts and assigns values
@@ -116,14 +95,5 @@ Multiplies and assigns values
 
 Divides and assigns values
 
-===============================================================================
-
-\defgroup array_mem_operator_plus operator+
-\ingroup arith_mat
-
-\brief Divides and assigns the value(s) of val to the elements of the af::array
-
-Divides and assigns values
-
 @}
 */
diff --git a/docs/details/backend.dox b/docs/details/backend.dox
new file mode 100644
index 0000000000..893567b696
--- /dev/null
+++ b/docs/details/backend.dox
@@ -0,0 +1,93 @@
+/**
+\addtogroup arrayfire_func
+@{
+
+\defgroup unified_func_setbackend setBackend
+
+\brief Set the current backend when using Unified backend
+
+This is a noop when using one of CPU, CUDA, or OpenCL backend.
+
+However, when using on of those 3 but trying to set it to a different backend
+will return in an exception.
+
+\ingroup unified_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup unified_func_getbackendcount getBackendCount
+
+\brief Get the number of backends whose libraries were successfully loaded.
+
+This will be between 0-3. 0 Being no backends were loaded and 3 being all
+backends loaded successfully.
+
+\ingroup unified_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup unified_func_getavailbackends getAvailableBackends
+
+\brief Returns an integer indicating the backends loaded successfully.
+
+The number returned denotes the backends available according to the table:
+
+Return Value | Backends Available
+-------------|-----------------------
+0            | None
+1            | CPU
+2            | CUDA
+3            | CPU and CUDA
+4            | OpenCL
+5            | CPU and OpenCL
+6            | CUDA and OpenCL
+7            | CPU, CUDA and OpenCL
+
+To convert the integer back into bools for each device, use the following code
+\code
+int backends = af::getAvailableBackends();
+
+bool cpu    = backends & AF_BACKEND_CPU;
+bool cuda   = backends & AF_BACKEND_CUDA;
+bool opencl = backends & AF_BACKEND_OPENCL;
+\endcode
+
+\ingroup unified_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup unified_func_getbackendid getBackendId
+
+\brief Get's the backend enum for an array
+
+This will return one of the values from the \ref af_backend enum.
+The return value specifies which backend the array was created on.
+
+\ingroup unified_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup unified_func_getactivebackend getActiveBackend
+
+\brief Get's the backend enum for the active backend
+
+\ingroup unified_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup unified_func_getdeviceid getDeviceId
+
+\brief Get's the id of the device an array was created on.
+
+\ingroup unified_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+@}
+*/
diff --git a/docs/details/blas.dox b/docs/details/blas.dox
index 2c63fa70c9..ac0aa99673 100644
--- a/docs/details/blas.dox
+++ b/docs/details/blas.dox
@@ -1,39 +1,80 @@
 /**
 \addtogroup arrayfire_func
 @{
-\defgroup blas_func_dot dot
 
-\ingroup blas_mat
+\defgroup blas_func_matmul matmul
 
-\brief Calculate the dot product of a vector
+Matrix multiplication.
 
-Scalar dot product between two vectors.  Also referred to as the inner
-product.
+Performs a matrix multiplication on the two input arrays after performing the
+operations specified in the options. The operations are done while reading the
+data from memory. This results in no additional memory being used for temporary
+buffers.
 
-This function returns the scalar product of two equal sized vectors or
-between a matrix and a vector. The second operand needs to be a vector
-in either case.
+Batched matrix multiplications are supported. The supported types of batch
+operations for any given set of two matrices A and B are given below,
 
-\image html matrix_vector_dot_product.png
+| Size of Input Matrix A     | Size of Input Matrix B     | Output Matrix Size          |
+|:--------------------------:|:--------------------------:|:---------------------------:|
+| \f$ \{ M, K,  1,  1 \} \f$ | \f$ \{ K, N,  1,  1 \} \f$ |  \f$ \{ M, N,  1,  1 \} \f$ |
+| \f$ \{ M, K, b2, b3 \} \f$ | \f$ \{ K, N, b2, b3 \} \f$ |  \f$ \{ M, N, b2, b3 \} \f$ |
+| \f$ \{ M, K,  1,  1 \} \f$ | \f$ \{ K, N, b2, b3 \} \f$ |  \f$ \{ M, N, b2, b3 \} \f$ |
+| \f$ \{ M, K, b2, b3 \} \f$ | \f$ \{ K, N,  1,  1 \} \f$ |  \f$ \{ M, N, b2, b3 \} \f$ |
 
-=======================================================================
+where `M`, `K`, `N` are dimensions of the matrix and `b2`, `b3` indicate batch
+size along the respective dimension.
+
+For the last two entries in the above table, the 2D matrix is broadcasted to
+match the dimensions of 3D/4D array. This broadcast doesn't involve any additional
+memory allocations either on host or device.
+
+\note Sparse support was added to ArrayFire in v3.4.0. This function can be used
+for Sparse-Dense matrix multiplication. See the notes of the function for usage
+and restrictions.
+
+\par
+\note Limited support for \ref s8 was added to the CUDA backend in ArrayFire
+v3.10.0. See \ref af_gemm "s8 Support" notes for details.
 
-\defgroup blas_func_matmul matmul
 \ingroup blas_mat
 
-\brief Matrix multiplication using array
+=======================================================================
+
+\defgroup blas_func_dot dot
 
-Performs a matrix multiplication on the two input arrays after performing the operations specified in the options. The operations are done while reading the data from memory. This results in no additional memory being used for temporary buffers.
+Compute the dot product.
+
+Scalar dot product between two vectors, also referred to as the inner
+product.
+
+\ingroup blas_mat
 
 =======================================================================
 
 \defgroup blas_func_transpose transpose
-\ingroup blas_mat
-\ingroup manip_mat
 
-\brief Matrix Transpose
+Transpose a matrix.
+
+Reverse or permute the dimensions of an array; returns the modified array.
+For an array a with two dimensions, `transpose(a)` gives the matrix transpose.
+For an array with more than two dimensions, the first two dimensions are
+transposed across higher dimensions.
 
-Transposes a matrix
+Set `conjugate=true` to perform the complex conjugate transpose of a matrix
+which interchanges the row and column index for each element, reflecting the
+elements across the main diagonal and negating the imaginary part of any
+complex numbers. For example, if `b = transpose(a, true)` and element
+`a(2, 1)` is `(1, 2)`, then element `b(1, 2)` is `(1, -2)`.
+
+In-place versions perform matrix transposition by reordering the input,
+reducing memory footprint.
+
+__Examples:__
+
+\snippet test/transpose.cpp ex_blas_func_transpose
+
+\ingroup blas_mat
+\ingroup manip_mat
 
 =======================================================================
 
diff --git a/docs/details/data.dox b/docs/details/data.dox
index d0e2eac5a8..bb96a4c61f 100644
--- a/docs/details/data.dox
+++ b/docs/details/data.dox
@@ -2,22 +2,11 @@
 \addtogroup arrayfire_func
 @{
 
-\defgroup data_func_randu randu
+\defgroup data_func_constant constant
 
-\brief Create a random array sampled from uniform distribution
+Create an array from a scalar input value.
 
-The data is uniformly distributed between [0, 1]
-
-\ingroup data_mat
-\ingroup arrayfire_func
-
-=======================================================================
-
-\defgroup data_func_randn randn
-
-\brief Create a random array sampled from a normal distribution
-
-The distribution is centered around 0
+Generate an array with elements set to a specified value.
 
 \ingroup data_mat
 \ingroup arrayfire_func
@@ -26,7 +15,7 @@ The distribution is centered around 0
 
 \defgroup data_func_identity identity
 
-\brief Create an identity array with diagonal values 1
+Generate an identity matrix.
 
 \code
 array a = identity(5, 3);
@@ -45,30 +34,12 @@ array a = identity(5, 3);
 
 \defgroup data_func_range range
 
-\brief Creates an array with [0, n] values along the seq_dim which is tiled across other dimensions
+Generate an array with `[0, n-1]` values along the a specified dimension and
+tiled across other dimensions.
 
-\code
-// Generates an array of [0, 4] along first dimension
-array a = range(dim4(5));        // a = [0,
-                                 //      1,
-                                 //      2,
-                                 //      3,
-                                 //      4]
-
-// Generates an array of [0, 4] along first dimension, tiled along second dimension
-array b = range(dim4(5, 2));     // a = [0, 0,
-                                 //      1, 1,
-                                 //      2, 2,
-                                 //      3, 3,
-                                 //      4, 4]
-
-// Generates an array of [0, 2] along second dimension, tiled along first dimension
-array c = range(dim4(5, 3), 1);  // c = [0, 1, 2,
-                                 //      0, 1, 2,
-                                 //      0, 1, 2,
-                                 //      0, 1, 2,
-                                 //      0, 1, 2]
-\endcode
+__Examples:__
+
+\snippet test/range.cpp ex_data_func_range
 
 \ingroup data_mat
 \ingroup arrayfire_func
@@ -77,7 +48,8 @@ array c = range(dim4(5, 3), 1);  // c = [0, 1, 2,
 
 \defgroup data_func_iota iota
 
-\brief Create an sequence [0, dims.elements() - 1] and modify to specified dimensions dims and then tile it according to tile_dims
+Generate an array with `[0, n-1]` values modified to specified dimensions and
+tiling.
 
 \code
 // Generate [0, 5x3 - 1] in dimensions 5, 3
@@ -106,7 +78,12 @@ array b = iota(dim4(5, 3), dim4(1, 2))
 =======================================================================
 
 \defgroup data_func_diag diag
-\brief Extract diagonal from a matrix when \p extract is set to true. Create a diagonal marix from input array when \p extract is set to false
+
+Extract the diagonal from an array.
+
+If `extract` is true, an array is extracted containing diagonal of the matrix,
+while a false condition returns a diagonal matrix.
+
 
 \code
 // Extraction
@@ -159,9 +136,10 @@ array b = diag(a, -1, false);
 
 \defgroup manip_func_join join
 
-\brief Join up to 4 arrays along specified dimension.
+Join up to 4 arrays along specified dimension.
 
-Requires that all dimensions except the join dimension must be the same for all arrays.
+Requires that all dimensions except the join dimension must be the same for all
+arrays.
 
 \ingroup manip_mat
 \ingroup arrayfire_func
@@ -170,9 +148,32 @@ Requires that all dimensions except the join dimension must be the same for all
 
 \defgroup manip_func_tile tile
 
-\brief Tile the input array along specified dimensions
+Generate a tiled array by repeating an array's contents along a specified
+dimension.
+
+Creates copies of the input array and concatenates them with each other, such
+that the output array will have as many copies of the input array as the user
+specifies along each dimension. In this sense, the output array is a set of
+"tiles" where each copy of the input array, including the original, is
+a "tile".
+
+Given below are some examples. The input array looks like this:
 
-Creates copys of the array a specified number of times within the output array
+\snippet test/tile.cpp ex_tile_input
+
+Here, the input array is tiled along the first dimenson, 2 times:
+
+\snippet test/tile.cpp ex_tile_0_2
+
+Here, the input is tiled along the second dimension, 3 times:
+
+\snippet test/tile.cpp ex_tile_1_3
+
+Lastly, one can also tile along multiple dimensions simultaneously. Here, the
+input is tiled 2 times in the first dimension and 3 times in the second
+dimension:
+
+\snippet test/tile.cpp ex_tile_0_2_and_1_3
 
 \ingroup manip_mat
 \ingroup arrayfire_func
@@ -181,49 +182,42 @@ Creates copys of the array a specified number of times within the output array
 
 \defgroup manip_func_reorder reorder
 
-\brief Reorder the input by in the specified order
+Reorder an array.
 
-Exchanges dimensions within an array. The order of the data along each
-dimension does not change.
+Exchanges data of an array such that the requested change in dimension
+is satisfied. The linear ordering of data within the array is preserved.
 
 \code
-array a = randu(5, 4, 3);
-// a [5 4 3 1]
-// 0.0000     0.2190     0.3835     0.5297
-// 0.1315     0.0470     0.5194     0.6711
-// 0.7556     0.6789     0.8310     0.0077
-// 0.4587     0.6793     0.0346     0.3834
-// 0.5328     0.9347     0.0535     0.0668
-
-// 0.4175     0.5269     0.9103     0.3282
-// 0.6868     0.0920     0.7622     0.6326
-// 0.5890     0.6539     0.2625     0.7564
-// 0.9304     0.4160     0.0475     0.9910
-// 0.8462     0.7012     0.7361     0.3653
-
-// 0.2470     0.0727     0.7665     0.1665
-// 0.9826     0.6316     0.4777     0.4865
-// 0.7227     0.8847     0.2378     0.8977
-// 0.7534     0.2727     0.2749     0.9092
-// 0.6515     0.4364     0.3593     0.0606
-
-array b = reorder(a, 2, 0, 1)
-// b [3 5 4 1]
-// 0.0000     0.1315     0.7556     0.4587     0.5328
-// 0.4175     0.6868     0.5890     0.9304     0.8462
-// 0.2470     0.9826     0.7227     0.7534     0.6515
-
-// 0.2190     0.0470     0.6789     0.6793     0.9347
-// 0.5269     0.0920     0.6539     0.4160     0.7012
-// 0.0727     0.6316     0.8847     0.2727     0.4364
-
-// 0.3835     0.5194     0.8310     0.0346     0.0535
-// 0.9103     0.7622     0.2625     0.0475     0.7361
-// 0.7665     0.4777     0.2378     0.2749     0.3593
-
-// 0.5297     0.6711     0.0077     0.3834     0.0668
-// 0.3282     0.6326     0.7564     0.9910     0.3653
-// 0.1665     0.4865     0.8977     0.9092     0.0606
+a [2 2 3 1]
+    1.0000     3.0000
+    2.0000     4.0000
+
+    1.0000     3.0000
+    2.0000     4.0000
+
+    1.0000     3.0000
+    2.0000     4.0000
+
+
+reorder(a, 1, 0, 2) [2 2 3 1]  // equivalent to a transpose
+    1.0000     2.0000
+    3.0000     4.0000
+
+    1.0000     2.0000
+    3.0000     4.0000
+
+    1.0000     2.0000
+    3.0000     4.0000
+
+
+reorder(a, 2, 0, 1) [3 2 2 1]
+    1.0000     2.0000
+    1.0000     2.0000
+    1.0000     2.0000
+
+    3.0000     4.0000
+    3.0000     4.0000
+    3.0000     4.0000
 \endcode
 
 \ingroup manip_mat
@@ -233,9 +227,9 @@ array b = reorder(a, 2, 0, 1)
 
 \defgroup manip_func_shift shift
 
-\brief Circular shift slong specified dimensions
+Shift an array.
 
-Shifts the values in a circular fashion along the specified dimesion.
+Circular shift array values along a specified dimesion.
 
 \ingroup manip_mat
 \ingroup arrayfire_func
@@ -244,9 +238,14 @@ Shifts the values in a circular fashion along the specified dimesion.
 
 \defgroup manip_func_moddims moddims
 
-\brief Modify the input dimensions without changing the data order
+Modify the dimensions of an array without changing the order of its elements.
+
+This function only modifies array metadata and requires no computation. It is a
+NOOP.
 
-Simply modifies the metadata. This is a noop.
+__Examples:__
+
+\snippet test/moddims.cpp ex_data_func_moddims
 
 \ingroup manip_mat
 \ingroup arrayfire_func
@@ -255,9 +254,9 @@ Simply modifies the metadata. This is a noop.
 
 \defgroup manip_func_flat flat
 
-\brief Flatten the input to a single dimension
+Flatten an array.
 
-Simply returns the array as a vector. This is a noop.
+Simply returns the array as a vector. This is a NOOP.
 
 \ingroup manip_mat
 \ingroup arrayfire_func
@@ -266,9 +265,9 @@ Simply returns the array as a vector. This is a noop.
 
 \defgroup manip_func_flip flip
 
-\brief Flip the input along sepcified dimension
+Flip the input along a specified dimension.
 
-Mirrors the array along the specified dimensions.
+Mirrors the array along the specified dimension.
 
 \ingroup manip_mat
 \ingroup arrayfire_func
@@ -277,7 +276,7 @@ Mirrors the array along the specified dimensions.
 
 \defgroup data_func_lower lower
 
-\brief Create a lower triangular marix from input array
+Return the lower triangular matrix from an input array.
 
 \ingroup data_mat
 \ingroup arrayfire_func
@@ -286,7 +285,65 @@ Mirrors the array along the specified dimensions.
 
 \defgroup data_func_upper upper
 
-\brief Create a upper triangular marix from input array
+Return the upper triangular matrix from an input array.
+
+\ingroup data_mat
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup data_func_select select
+
+Select elements based on a conditional array.
+
+Creates a new array that is composed of values either from array `a` or array
+`b`, based on a third conditional array. For all non-zero elements in the
+conditional array, the output array will contain values from `a`. Otherwise the
+output will contain values from `b`.
+
+\snippet test/select.cpp ex_data_select
+
+is equivalent to:
+
+\snippet test/select.cpp ex_data_select_c
+
+The conditional array must be a \ref b8 typed array.
+
+The select function can perform batched operations based on the size of each of
+the inputs. The following table describes the input and output sizes for
+supported batched configurations.
+
+| Output | Condition Array | Array A | Array B |
+|--------|-----------------|---------|---------|
+| (M, N) | (M, 1)          | (M, 1)  | (M, N)  |
+| (M, N) | (M, 1)          | (M, N)  | (M, 1)  |
+| (M, N) | (M, 1)          | (M, N)  | (M, N)  |
+| (M, N) | (M, N)          | (M, 1)  | (M, N)  |
+| (M, N) | (M, N)          | (M, 1)  | (M, N)  |
+
+\ingroup manip_mat
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup data_func_replace replace
+
+Replace elements of an array with elements of another array.
+
+Input values are retained when corresponding elements from the conditional
+array are true. Input values are replaced when corresponding elements from the
+conditional array are false.
+
+\ingroup manip_mat
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup data_func_pad pad
+
+Pad an array.
+
+Pad the input array using a constant or values from input along the border.
 
 \ingroup data_mat
 \ingroup arrayfire_func
@@ -295,4 +352,3 @@ Mirrors the array along the specified dimensions.
 
 @}
 */
-
diff --git a/docs/details/device.dox b/docs/details/device.dox
index 230199d583..1bc1bbdccc 100644
--- a/docs/details/device.dox
+++ b/docs/details/device.dox
@@ -2,6 +2,22 @@
 \addtogroup arrayfire_func
 @{
 
+\defgroup device_func_prop deviceInfo
+\ingroup device_mat
+
+\brief Gets the information about device and platform as strings
+
+\param d_name pointer to a user-allocated char array. Recommended minimum size is 64.
+The name of the device is stored in this array.
+\param d_platform pointer to a user-allocated char array. Recommended minimum size is 10.
+The platform information is stored in this array.
+\param d_toolkit pointer to a user-allocated char array. Recommended minimum size is 64.
+The toolkit information is stored in this array.
+\param d_compute pointer to a user-allocated char array. Recommended minimum size is 10.
+The compute version of the device is stored in this array.
+
+===============================================================================
+
 \defgroup device_func_count getDeviceCount
 \ingroup device_mat
 
@@ -30,6 +46,17 @@ floating point operations
 
 ===============================================================================
 
+\defgroup device_func_half isHalfAvailable
+\ingroup device_mat
+
+\brief Check if half(16-bit) precision floating point support is available for
+       specified device
+
+These functions check if a device has support to perform half precision
+floating point operations
+
+===============================================================================
+
 \defgroup device_func_set setDevice
 \ingroup device_mat
 
@@ -50,15 +77,51 @@ have finished.
 
 ===============================================================================
 
-\defgroup device_func_alloc alloc
+\defgroup device_func_alloc allocV2
 \ingroup device_mat
 
 \brief Allocate memory using the ArrayFire memory manager
 
 This function will allocate memory on the device and return a pointer
 to it. The memory is allocated using ArrayFire's memory manager which
-has some different characteristics to standard method of memory
-allocation
+will defer releasing memory to the driver and reuse the same memory
+for later operations.
+
+This function will return different objects based on the type used. The
+interface returns a void pointer that needs to be cast to the backend
+appropriate memory type.
+
+
+| function                     | CPU | CUDA | OpenCL      |
+|------------------------------|-----|------|-------------|
+| af_alloc_device_v2           | T*  | T*   | cl_mem      |
+| af::allocV2                  | T*  | T*   | cl_mem      |
+| af_alloc_device (deprecated) | T*  | T*   | cl::Buffer* |
+| af::alloc (deprecated)       | T*  | T*   | cl::Buffer* |
+
+CPU Backend
+-----------
+\snippet test/memory.cpp ex_alloc_v2_cpu
+
+CUDA Backend
+------------
+\snippet test/cuda.cu ex_alloc_v2_cuda
+
+OpenCL Backend
+--------------
+\snippet test/ocl_ext_context.cpp ex_alloc_v2_opencl
+
+===============================================================================
+
+\defgroup device_func_free freeV2
+\ingroup device_mat
+
+\brief Returns memory to ArrayFire's memory manager. The memory will
+       return to the memory pool.
+
+Releases control of the memory allocated by af::allocV2 functions to ArrayFire's
+memory manager. ArrayFire may reuse the memory for subsequent operations. This
+memory should not be used by the client after this point.
 
 ===============================================================================
 
@@ -73,12 +136,39 @@ a limited resource.
 
 ===============================================================================
 
-\defgroup device_func_free free
+\defgroup device_func_free_pinned freePinned
 \ingroup device_mat
 
-\brief Free device memory allocated by ArrayFire's memory manager
+\brief Free pinned memory allocated by ArrayFire's memory manager
+
+These calls free the pinned memory on host. These functions need to be called on
+pointers allocated using pinned function.
+
+===============================================================================
+
+\defgroup device_func_alloc_host allocHost
+\ingroup device_mat
+
+\brief Allocate memory on host
+
+This function is used for allocating regular memory on host. This is useful
+where the compiler version of ArrayFire library is different from the
+executable's compiler version.
+
+It does not use ArrayFire's memory manager.
+
+===============================================================================
+
+\defgroup device_func_free_host freeHost
+\ingroup device_mat
+
+\brief Free memory allocated on host internally by ArrayFire
+
+This function is used for freeing memory on host that was allocated within
+ArrayFire. This is useful where the compiler version of ArrayFire library is
+different from the executable's compiler version.
 
-These calls free the device or pinned memory. These functions need to be called
+It does not use ArrayFire's memory manager.
 
 ===============================================================================
 
diff --git a/docs/details/features.dox b/docs/details/features.dox
new file mode 100644
index 0000000000..6fe1386060
--- /dev/null
+++ b/docs/details/features.dox
@@ -0,0 +1,13 @@
+/**
+\addtogroup arrayfire_func
+@{
+
+\defgroup features_group_features features
+
+\brief Lookup values of an array based on sequences and/or arrays
+
+===============================================================================
+
+
+@}
+*/
\ No newline at end of file
diff --git a/docs/details/graphics.dox b/docs/details/graphics.dox
index 5fa083fc4c..f8fc325987 100644
--- a/docs/details/graphics.dox
+++ b/docs/details/graphics.dox
@@ -3,8 +3,6 @@
 \addtogroup  graphics_func
 @{
 
-A list of graphics functions to visualize data
-
 \defgroup gfx_func_window Window Functions
 @{
 
diff --git a/docs/details/image.dox b/docs/details/image.dox
index 12f38ce3d3..312b88c880 100644
--- a/docs/details/image.dox
+++ b/docs/details/image.dox
@@ -21,6 +21,12 @@ Grayscale is a single channel color space where pixel value ranges from 0 to 1.
 Zero represents black, one represent white and any value between zero & one is
 a gray value
 
+\page image_func_ycbcr YCbCr
+
+YCbCr is a family of color spaces used as a part of the color image pipeline in video
+and digital photography systems where Y is luma component and Cb & Cr are the blue-difference
+and red-difference chroma components.
+
 */
 //=================================================================================
 /**
@@ -28,7 +34,7 @@ a gray value
 \addtogroup  arrayfire_func
 @{
 
-\defgroup image_func_colorspace colorspace
+\defgroup image_func_colorspace colorSpace
 \ingroup colorconv_mat
 
 Colorspace conversion function
@@ -102,6 +108,81 @@ following formula
 
 =======================================================================
 
+\defgroup image_func_rgb2ycbcr rgb2ycbcr
+\ingroup colorconv_mat
+
+RGB to YCbCr colorspace converter
+
+\copydoc image_func_rgb
+\copydoc image_func_ycbcr
+
+Input array to this function should be of real data in the range \f$[0,1]\f$.
+
+The following equations are used to convert image from RGB color space to YCbCr color space.
+
+\f$ Y  = 16 + \displaystyle k_r*R + (1 - \displaystyle k_r- \displaystyle k_b)*G + \displaystyle k_b * B \f$
+
+\f$ Cb =  128 + \frac{\displaystyle 1}{\displaystyle 2} * \frac{\displaystyle B - Y\displaystyle
+}{\displaystyle 1 - \displaystyle k_b} \f$
+
+\f$ Cr =  128 + \frac{\displaystyle 1}{\displaystyle 2} * \frac{\displaystyle R - Y\displaystyle
+}{\displaystyle 1 - \displaystyle k_r} \f$
+
+Output image in YCbCr has following range for their respective channels.
+
+\f$Y -> [16, 219]\f$
+
+\f$Cb-> [16, 240]\f$
+
+\f$Cr-> [16, 240]\f$
+
+Based on the ITU-R BT.xyz[w] standard used, different values of \f$k_b\f$ and \f$k_r\f$ are used
+to do the color space conversion. You can change these values by passing the \ref af_ycc_std enum
+value.
+
+=======================================================================
+
+\defgroup image_func_ycbcr2rgb ycbcr2rgb
+\ingroup colorconv_mat
+
+YCbCr to RGB colorspace converter
+
+\copydoc image_func_ycbcr
+\copydoc image_func_rgb
+
+Input array to this function should be of real data with the following range in
+their respective channels.
+
+\f$Y -> [16, 219]\f$
+
+\f$Cb-> [16, 240]\f$
+
+\f$Cr-> [16, 240]\f$
+
+
+The following equations are used to convert image from RGB color space to YCbCr color space.
+
+\f$ R = \frac{\displaystyle Y - \displaystyle 16}{\displaystyle 219}
+       + \frac{\displaystyle C_r - \displaystyle 128}{\displaystyle 112} * (\displaystyle 1 - \displaystyle k_r) \f$
+
+\f$ G = \frac{\displaystyle Y - \displaystyle 16}{\displaystyle 219}
+       - \frac{\displaystyle C_r - \displaystyle 128}{\displaystyle 112} * (\displaystyle 1 -
+\displaystyle k_r) * \frac{\displaystyle k_r}{\displaystyle 1 - \displaystyle k_b - \displaystyle
+k_r} - \frac{\displaystyle C_b - \displaystyle 128}{\displaystyle 112} * (\displaystyle 1 -
+\displaystyle k_b) * \frac{\displaystyle k_b}{\displaystyle 1 - \displaystyle k_b - \displaystyle
+k_r}\f$
+
+\f$ B = \frac{\displaystyle Y - \displaystyle 16}{\displaystyle 219}
+       + \frac{\displaystyle C_b - \displaystyle 128}{\displaystyle 112} * (\displaystyle 1 - \displaystyle k_b) \f$
+
+Output image in RGB will have values in range \f$[0, 1]\f$.
+
+Based on the ITU-R BT.xyz[w] standard used, different values of \f$k_b\f$ and \f$k_r\f$ are used
+to do the color space conversion. You can change these values by passing the \ref af_ycc_std enum
+value.
+
+=======================================================================
+
 \defgroup image_func_histogram histogram
 \ingroup hist_mat
 
@@ -134,35 +215,6 @@ Data normalization via histogram equalization
 
 =======================================================================
 
-\defgroup cv_func_fast fast
-\ingroup featdetect_mat
-
-\brief FAST feature detector
-
-A circle of radius 3 pixels, translating into a total of 16 pixels, is checked
-for sequential segments of pixels much brighter or much darker than the central
-one. For a pixel p to be considered a feature, there must exist a sequential
-segment of arc_length pixels in the circle around it such that all are greather
-than (p + thr) or smaller than (p - thr). After all features in the image are
-detected, if nonmax is true, the non-maximal suppression is applied, checking
-all detected features and the features detected in its 8-neighborhood and
-discard it if its score is non maximal.
-
-=======================================================================
-
-\defgroup cv_func_orb orb
-\ingroup featdescriptor_mat
-
-\brief ORB Feature descriptor
-
-Extract ORB descriptors from FAST features that hold higher Harris responses.
-FAST does not compute orientation, thus, orientation of features is calculated
-using the intensity centroid. As FAST is also not multi-scale enabled, a
-multi-scale pyramid is calculated by downsampling the input image multiple
-times followed by FAST feature detection on each scale.
-
-=======================================================================
-
 \defgroup image_func_regions regions
 \ingroup connected_comps_mat
 
@@ -210,6 +262,46 @@ discussion on it can be found [here](http://en.wikipedia.org/wiki/Sobel_operator
 
 =======================================================================
 
+\defgroup image_func_anisotropic_diffusion anisotropicDiffusion
+\ingroup imageflt_mat
+
+\brief Anisotropic Smoothing Filter
+
+Anisotropic diffusion algorithm aims at removing noise in the images while preserving important
+features such as edges. The algorithm essentially creates a scale space representation of the
+original image, where image from previous step is used to create a new version of blurred image
+using the diffusion process. Standard isotropic diffusion methods such as gaussian blur, doesn't
+take into account the local content(smaller neighborhood of current processing pixel) while removing
+noise. Anisotropic diffusion uses the flux equations given below to achieve that. Flux equation is the
+formula used by the diffusion process to determine how much a pixel in neighborhood should contribute to
+the blurring operation being done at the current pixel at a given iteration.
+
+The flux function can be either exponential or quadratic.
+
+<table>
+<caption id="multi row">Available Flux Functions</caption>
+<tr>
+    <td> AF_FLUX_QUADRATIC </td>
+    <td>  \f$ \frac{1}{1 + (\frac{\| \nabla I\|}{K})^2} \f$  </td>
+</tr>
+<tr>
+    <td> AF_FLUX_EXPONENTIAL </td>
+    <td>  \f$ \exp{-(\frac{\| \nabla I\|}{K})^2} \f$  </td>
+</tr>
+</table>
+
+Please be cautious using the time step parameter to the function. Appropriate time steps for solving this type of p.d.e. depend on the dimensionality of the image and the order of the equation. Stable values for most 2D and 3D functions are 0.125 and 0.0625, respectively. The time step values are automatically constrained to the stable value.
+
+Another input parameter to be cautious about is the conductance parameter, lower values strongly preserve image features and vice-versa. For human vision, this value ranges from 0.5 to 2.0.
+
+#### Reference
+Pietro Perona and Jitendra Malik, `Scale-space and edge detection using anisotropic diffusion,` IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 12, pp. 629-639, 1990.
+
+#### Reference
+R. Whitaker and X. Xue. `Variable-Conductance, Level-Set Curvature for Image Denoising`, International Conference on Image Processing, 2001 pp. 142-145, Vol.3.
+
+=======================================================================
+
 \defgroup cv_func_match_template matchTemplate
 \ingroup match_mat
 
@@ -277,6 +369,9 @@ distance as well as the color distance.
 The bilateral filter requires the size of the filter (in pixels) and the upper
 bound on color values, N, where pixel values range from 0–N inclusively.
 
+The return type of the array is f64 for f64 input, f32 for all other input
+types.
+
 =======================================================================
 
 \defgroup image_func_erode erode
@@ -375,6 +470,37 @@ Save an array to disk as an image
 Supported formats include JPG, PNG, PPM and other formats supported by freeimage
 
 
+\defgroup imageio_func_available isImageIoAvailable
+\ingroup imageio_mat
+
+Returns true if ArrayFire was compiled with ImageIO (FreeImage) support
+
+
+\defgroup imagemem_func_load loadImageMem
+\ingroup imageio_mat
+
+Load an image from memory which is stored as a FreeImage stream (FIMEMORY).
+
+Supported formats include JPG, PNG, PPM and other formats supported by freeimage
+
+
+
+\defgroup imagemem_func_save saveImageMem
+\ingroup imageio_mat
+
+Save an array to memory as an image using FreeImage stream (FIMEMORY).
+
+Supported formats include JPG, PNG, PPM and other formats supported by freeimage
+
+
+\defgroup imagemem_func_delete deleteImageMem
+\ingroup imageio_mat
+
+Delete memory created by saveImageMem and af_save_image_memory function.
+This internally calls FreeImage_CloseMemory.
+
+Supported formats include JPG, PNG, PPM and other formats supported by freeimage
+
 
 \defgroup calc_func_grad grad
 \ingroup calc_mat
@@ -421,10 +547,12 @@ grad(dx, dy, in);
 
 Resize an input image
 
-Resizing an input image can be done using either \ref AF_INTERP_NEAREST or
-\ref AF_INTERP_BILINEAR interpolations. Nearest interpolation will pick the
-nearest value to the location, whereas bilinear interpolation will do a
-weighted interpolation for calculate the new size.
+Resizing an input image can be done using either \ref AF_INTERP_NEAREST,
+\ref AF_INTERP_BILINEAR or \ref AF_INTERP_LOWER, interpolations. Nearest
+interpolation will pick the nearest value to the location, bilinear
+interpolation will do a weighted interpolation for calculate the new size
+and lower interpolation is similar to the nearest, except it will use the
+floor function to get the lower neighbor.
 
 This function does not differentiate between images and data. As long as
 the array is defined and the output dimensions are not 0, it will resize any
@@ -472,27 +600,39 @@ af_print(resize(2, in, AF_INTERP_BILINEAR));
 \defgroup transform_func_rotate rotate
 \ingroup transform_mat
 
-Rotate an input image
-
-The angle theta is in radians.
-
-Rotating an input image can be done using either \ref AF_INTERP_NEAREST or
-\ref AF_INTERP_BILINEAR interpolations. Nearest interpolation will pick the
-nearest value to the location, whereas bilinear interpolation will do a
-weighted interpolation for calculate the new size.
-
-This function does not differentiate between images and data. As long as
-the array is defined, it will rotate any type or size of array.
-
-The crop option allows you to choose whether to resize the image.
-If crop is set to false, ie. the entire rotated image will be a part of the
-array and the new array size will be greater than or equal to the input array
-size.
-If crop is set to true, then the new array size is same as the input array
-size and the data that falls outside the boundaries of the array is discarded.
-
-Any location of the rotated array that does not map to a location of the input
-array is set to 0.
+\brief Rotate an input image or array
+
+The rotation is done counter-clockwise, with an angle \p theta (in radians),
+using a specified \p method of interpolation to determine the values of the
+output array. Six types of interpolation are currently supported:
+
+- \ref AF_INTERP_NEAREST - nearest value to the location
+- \ref AF_INTERP_BILINEAR - weighted interpolation
+- \ref AF_INTERP_BILINEAR_COSINE - bilinear interpolation with cosine smoothing
+- \ref AF_INTERP_BICUBIC - bicubic interpolation
+- \ref AF_INTERP_BICUBIC_SPLINE - bicubic interpolation with Catmull-Rom splines
+- \ref AF_INTERP_LOWER - floor indexed
+
+Since the output image still needs to be an upright box, \p crop determines how
+to bound the output image, given the now-rotated image. The figure below
+illustrates the effect of changing this parameter.
+
+\image html rotate_illus.png "Effect of \p crop parameter on the output"
+
+Here, the original image is represented by the innermost box with the solid
+black and dashed orange lines, and the (theoretical) rotated image is the box
+with the solid orange lines. If \p crop is true, then the output image's
+dimensions will stay the same as the original image's, but the rotated image's
+portions outside the dashed orange lines will be cropped, and the rest of the
+output image (the area between the solid black and solid orange lines) will be
+filled with zeros. However, if \p crop is false, then the output image's
+dimensions might get bigger (as shown in this illustration), as represented by
+the outermost box with dashed black lines. This change in dimensions is
+necessary to accommodate all of the rotated image's data. The remainder of the
+output image will be filled with zeros, as represented by the area between the
+solid orange lines and dashed black lines. Note that the new dimensions in
+general (beyond this illustration) will be greater than or equal the original
+image's dimensions when \p crop is false.
 
 
 \defgroup transform_func_translate translate
@@ -579,25 +719,379 @@ Skew is a special case of the \ref af::transform function.
 
 Transform an input image
 
-The transform function uses an affine transform matrix to tranform an input
+The transform function uses an affine or perspective transform matrix to tranform an input
 image into a new one.
 
-The transform matrix \p tf is a 3x2 matrix of type float. The matrix operation
-is applied to each location (x, y) that is then transformed to (x', y') of the
+If matrix \p tf is is a 3x2 matrix, an affine transformation will be performed. The matrix
+operation is applied to each location (x, y) that is then transformed to (x', y') of the
 new array. Hence the transformation is an element-wise operation.
 
-The operation is as below:
-tf = [r00 r10
-      r01 r11
+The operation is as below:\n
+tf = [r00 r10\n
+      r01 r11\n
       t0  t1]
 
-x' = x * r00 + y * r01 + t0;
+x' = x * r00 + y * r01 + t0;\n
 y' = x * r10 + y * r11 + t1;
 
-Interpolation types of \ref AF_INTERP_NEAREST and \ref AF_INTERP_BILINEAR are allowed.
+If matrix \p tf is is a 3x3 matrix, a perspective transformation will be performed.
+
+The operation is as below:\n
+tf = [r00 r10 r20\n
+      r01 r11 r21\n
+      t0  t1  t2]
+
+x' = (x * r00 + y * r01 + t0) / (x * r20 + y * r21 + t2);\n
+y' = (x * r10 + y * r11 + t1) / (x * r20 + y * r21 + t2);
+
+The transformation matrix \p tf should always be of type f32.
+
+Interpolation types of \ref AF_INTERP_NEAREST, \ref AF_INTERP_BILINEAR and
+AF_INTERP_LOWER are allowed.
 
 Affine transforms can be used for various purposes. \ref af::translate, \ref af::scale and \ref af::skew
 are specializations of the transform function.
 
+
+\defgroup transform_func_coordinates transformCoordinates
+\ingroup transform_mat
+
+Transform input coordinates
+
+The transform function uses a perspective transform matrix to transform input
+coordinates (given as two dimensions) into a coordinates matrix.
+
+The output is a 4x2 matrix, indicating the coordinates of the 4 bidimensional
+transformed points.
+
+=======================================================================
+
+\defgroup image_func_sat sat
+\ingroup imageflt_mat
+
+\brief Summed Area Tables
+
+Given an image \f$ I: (x,y) \mapsto i \f$ where i is pixel intensity at position \f$(x, y)\f$.
+
+\f$S(x, y) = i(x, y) + S(x-1, y) + S(x, y-1) - S(x-1, y-1)\f$
+
+The output array of this function will have \f$ S(x, y) \f$ values at their corresponding locations, \f$(x,y)\f$
+
+=======================================================================
+
+\defgroup image_func_unwrap unwrap
+\ingroup image_mod_mat
+
+\brief Rearrange windowed sections of an array into columns (or rows)
+
+The figure below illustrates how unwrap works. A moving window (marked by
+orange boxes in the figure) of size `wx` \f$\times \f$ `wy` captures sections of
+the input array, and flattens them into columns (or rows if `is_column` is false)
+of the output array (illustrated in the right image). It starts at the top-left
+section of the input array and moves in column-major order, each time moving in
+strides of `sx` units along the column and `sy` units along the row, whenever it
+exhausts a column (stride size illustrated as the white arrows in the left image,
+and window movement illustrated as the progression of the small yellow numbers on
+the corner of each window). When the remainder of the column or row is not big
+enough to accomodate the window, that remainder is skipped and the window moves
+on (in the figure, the last row is not captured in any of the windows).
+
+Optionally, one can specify that the input image's border be padded (with zeros,
+represented as the gray boxes in the figure) before the moving window starts
+capturing sections. The width of the padding is defined by `px` for the top and
+bottom and `py` for the left and right sides, with maximum values of `wx`-1 and
+`wy`-1, respectively. The moving window then captures sections as if the padding
+is part of the input image, and thus the padding also becomes part of the output
+array's columns (illustrated in the bottom of the right image).
+
+\image html unwrap_640.png "Unwrap on a 3x4 input array, using a 2x2 window, 2x2 stride, 1x1 padding"
+
+In the figure, the stride is set to be equally large as the window size (both
+2x2), and thus the sections that the window captures are distinct. However, when
+the stride is set to the minimum (1x1) and is smaller than the window size, the
+sections overlap (which in turn makes the output's columns overlap as well). The
+window then acts as a perfect "sliding window" in this case (see the first code
+example below). In general, there will be some overlap as long as the stride is
+smaller than the window size (though the overlap decreases as the stride
+approaches the window size), and when the stride is equal or greater than the
+window size, each section (and output column) will be distinct.
+
+For inputs that have more than two dimensions, the unwrap operation will be
+applied to each 2D slice of the input. This is especially useful for
+independently processing each channel of an image (or set of images) - each
+channel (along the third dimension) on the input corresponds to the same channel
+on the output, and each image (along the fourth dimension) on the input
+corresponds to the same image on the output.
+
+The size of the output is shown below. `nsections_dim0` and `nsections_dim1`
+denote how many windows can fit along the column and row, given the padded image
+size, window size, strides, and skips (if any):
+
+\code
+dim4(
+    wx * wy,                               // No. of rows (column height)
+    nsections_dim0 * nsections_dim1,       // No. of columns per channel
+    input.dims(2),                         // No. of channels
+    input.dims(3)                          // No. of images
+)
+\endcode
+
+Here are some code examples that demonstrate unwrap's usage:
+
+\snippet test/unwrap.cpp ex_unwrap
+
+One context where unwrap can be used is pre-processing an array or image for
+making window operations efficient (i.e. convolutions, computing the average
+pixel intensity around a point in an image, etc). Since each window capture is
+laid out as a column in an unwrapped array, vectorized operations can be executed
+efficiently on it (as opposed to strided access of each row in a window in the original
+array).
+
+Note that the actual implementation of unwrap may not match the way the operation
+is described above, but the effect should be the same.
+
+=======================================================================
+
+\defgroup image_func_wrap wrap
+\ingroup image_mod_mat
+
+Performs the opposite of \ref af::unwrap().
+
+More specifically, wrap takes each column (or row if `is_column` is false) of the
+\f$m \times n\f$ input array and reshapes them into `wx` \f$\times\f$ `wy`
+patches (where \f$m =\f$ `wx` \f$\times\f$ `wy`) of the `ox` \f$\times\f$ `oy`
+output array. Wrap is typically used on an array that has been previously
+unwrapped - for example, in the case of image processing, one can unwrap an
+image, process the unwrapped array, and then compose it back into an image using
+wrap. 
+
+The figure below illustrates how wrap works. The process can be visualized as a
+moving window (orange boxes in the figure) taking a column from the input
+(top-left), reshaping it into a patch (bottom-left), and then placing that patch
+on its corresponding position in the output array (right; numbers in yellow show
+correspondence). It starts placing a patch on the output's top-left corner, then
+moves `sx` units along the column, and `sy` units along the row whenever it
+exhausts a column. If padding exists in the input array (gray-filled boxes),
+which typically happens when padding was applied on the previous unwrap, then
+`px` and `py` must be specified in order for the padding to be removed on the
+output array (in the figure, the output array on the right will actually only
+contain the inner boxes, size `ox` \f$\times\f$ `oy`).
+
+\image html wrap_distinct.png "Wrap on a 4x6 input array, using a 2x2 window, 2x2 stride, 1x1 padding. The output array is 3x4"
+
+There are some things that must be considered when wrapping a previously
+unwrapped array. First, wrap must use the same parameters that unwrap used, and
+must use the original array's size (before unwrap) as `ox` and `oy`. This is
+necessary to correctly elicit wrap's behavior as the opposite of unwrap. Second,
+one must consider whether the previous unwrap used a distinct or sliding window
+configuration, since the element-wise mapping from the input array to the output
+depends on the configuration. If the distinct window configuration (the stride is
+at least as large as the window size) was used, then the mapping is
+straightforward - each column will map to a unique section in the output array,
+and therefore each element in the input will map to a unique position in the
+output (shown in the figure above). However, in the case of the sliding window
+configuration (the stride is smaller than the window size), some of the columns
+will map to overlapping sections in the output array, and so elements from
+multiple columns will map to the same position on the output array. Recomposing
+the array then requires some way to choose between competing elements to place in
+that position. To address this contention, wrap simply sums all of the competing
+elements and places the sum in that position. The figure below illustrates this
+behavior: the fourth element of the first column and the third element of the
+second column in the input array both map to the same position on the output
+array, and thus their sum is placed on that position (this happens on the second
+and third column of the input as well - they both map to the third element of the
+second column in the output). Given this behavior, it is up to the user to
+pre-process the input (unwrapped) array (or post-process the output (wrapped)
+array) in a way that somehow takes all of the competing elements into
+consideration.
+
+\image html wrap_sliding.png "Wrap on the same array as above, but with 1x1 stride (sliding window)"
+
+For inputs that have more than two dimensions, the wrap operation will be
+applied to each 2D slice of the input. This is especially useful for
+independently processing each channel of an image (or set of images) - each
+channel (along the third dimension) on the input corresponds to the same channel
+on the output, and each image (along the fourth dimension) on the input
+corresponds to the same image on the output.
+
+Here are some code examples that demonstrate wrap's usage. The first one shows
+wrapping a previously unwrapped array that used a 1x1 padding and a distinct
+window configuration. Notice how the arguments used in unwrap are the same as
+those used in wrap:
+
+\snippet test/wrap.cpp ex_wrap_1
+
+The next one shows what happens when both unwrap and wrap uses the sliding window
+configuration. Notice how the original array is not recovered through wrap;
+instead, overlapping elements are summed, just as described above:
+
+\snippet test/wrap.cpp ex_wrap_2
+
+Note that the actual implementation of unwrap may not match the way the operation
+is visualized above, but the effect should be the same.
+
+=======================================================================
+
+\defgroup image_func_moments moments
+\ingroup moments_mat
+
+The \ref af::moments() function allows for finding different
+properties of image regions. Currently, ArrayFire calculates all first order moments.
+The moments are defined within the \ref af_moment_type enum.
+
+As the enum details, each moment can be returned individually or all first-order
+moments can be calculated at once. This can be done as follows:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+af::array moments = af::moments(input_image, AF_MOMENT_FIRST_ORDER);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Here is an example of how the shorthand versions might be used to find the area(or gray level sum) and
+center of mass of an image:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+double m00, m01, m10;
+af::moments(&m00, input_image, AF_MOMENT_M00);
+af::moments(&m01, input_image, AF_MOMENT_M01);
+af::moments(&m10, input_image, AF_MOMENT_M10);
+
+double area = m00;
+double x_center = m10 / m00;
+double y_center = m01 / m00;
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+=======================================================================
+
+\defgroup image_func_canny canny
+\ingroup imageflt_mat
+
+\brief Canny Edge Detector
+
+The Canny edge detector is an edge detection operator that uses a multi-stage algorithm to detect a
+wide range of edges in images. A more in depth discussion on it can be found [here](https://en.wikipedia.org/wiki/Canny_edge_detector).
+
+=======================================================================
+
+\defgroup image_func_iterative_deconv iterativeDeconv
+\ingroup imageflt_mat
+
+\brief Iterative Deconvolution
+
+The following table shows the iteration update equations of the respective
+deconvolution algorithms.
+
+<table>
+<tr><th>Algorithm</th><th>Update Equation</th></tr>
+<tr>
+    <td>LandWeber</td>
+    <td>
+        \f$ \hat{I}_{n} = \hat{I}_{n-1} + \alpha * P^T \otimes (I - P \otimes \hat{I}_{n-1}) \f$
+    </td>
+</tr>
+<tr>
+  <td>Richardson-Lucy</td>
+  <td>
+    \f$ \hat{I}_{n} = \hat{I}_{n-1} . ( \frac{I}{\hat{I}_{n-1} \otimes P} \otimes P^T ) \f$
+  </td>
+</tr>
+</table>
+
+where
+    - \f$ I \f$ is the observed(input/blurred) image
+    - \f$ P \f$ is the point spread function
+    - \f$ P^T \f$ is the transpose of point spread function
+    - \f$ \hat{I}_{n} \f$ is the current iteration's updated image estimate
+    - \f$ \hat{I}_{n-1} \f$ is the previous iteration's image estimate
+    - \f$ \alpha \f$ is the relaxation factor
+    - \f$ \otimes \f$ indicates the convolution operator
+
+Iterative deconvolution function excepts \ref af::array of the following types only:
+    - \ref f32
+    - \ref s16
+    - \ref u16
+    - \ref s8
+    - \ref u8
+
+\note The type of output \ref af::array from deconvolution will be double if
+the input array type is double. For other types, output type will be float.
+Should the caller want to save the image to disk or require the values of output
+to be in a fixed range, that should be done by the caller explicitly.
+
+=======================================================================
+
+\defgroup image_func_inverse_deconv inverseDeconv
+\ingroup imageflt_mat
+
+\brief Inverse Deconvolution
+
+Inverse deconvolution is an linear algorithm i.e. they are non-iterative in
+nature and usually faster than iterative deconvolution algorithms.
+
+Depending on the values passed on to the enum \ref af_inverse_deconv_algo,
+different equations are used to compute the final result.
+
+#### Tikhonov's Deconvolution Method:
+
+The update equation for this algorithm is as follows:
+
+\f[
+\hat{I}_{\omega} = \frac{ I_{\omega} * P^{*}_{\omega} } { |P_{\omega}|^2 + \gamma }
+\f]
+
+where
+    - \f$ I_{\omega} \f$ is the observed(input/blurred) image in frequency domain
+    - \f$ P_{\omega} \f$ is the point spread function in frequency domain
+    - \f$ \gamma \f$ is a user defined regularization constant
+
+Inverse deconvolution function excepts \ref af::array of the following types only:
+    - \ref f32
+    - \ref s16
+    - \ref u16
+    - \ref s8
+    - \ref u8
+
+\note The type of output \ref af::array from deconvolution will be double
+if the input array type is double. Otherwise, it will be float in rest of
+the cases. Should the caller want to save the image to disk or require the
+values of output to be in a fixed range, that should be done by the caller
+explicitly.
+
+=======================================================================
+
+\defgroup image_func_confidence_cc confidenceCC
+\ingroup connected_comps_mat
+
+\brief Segment image based on similar pixel characteristics
+
+This filter is similar to \ref af::regions() (connected components) with additional
+criteria for segmentation. In \ref af::regions(), all connected (\ref af_connectivity)
+pixels connected are considered to be a single component. In this
+variation of connected components, pixels having similar pixel statistics
+of the neighborhoods around a given set of seed points are grouped together.
+
+The parameter \p radius determines the size of neighborhood around a seed point.
+
+Mean (\f$ \mu \f$) and Variance (\f$ \sigma^2 \f$) are the pixel statistics that
+are computed across all neighborhoods around the given set of seed points. The
+pixels which are connected to seed points and lie in the confidence interval
+ (\f$ [\mu - \alpha * \sigma, \mu + \alpha * \sigma] \f$ where \f$ \alpha \f$
+is the parameter \p multiplier) are grouped. \p multiplier can be used to
+control the width of the confidence interval.
+
+This filter follows an iterative approach for fine tuning the segmentation.
+An initial segmenetation followed by a finite number (\p iter) of segmentations
+are performed. The user provided parameter \p iter is only a request and the
+algorithm can prempt the execution if \f$ \sigma^2 \f$ approaches zero. The
+initial segmentation uses the mean and variance calculated from the neighborhoods
+of all the seed points. For subsequent segmentations, all pixels in the previous
+segmentation are used to re-calculate the mean and variance (as opposed to using
+the pixels in the neighborhood of the seed point).
+
+Given below is a sample output for segmenting three different regions of a
+donut using single seed.
+
+<img src="ccc_sample_output.png" align="center" alt="Confidence Connected Components Example"
+width="60%" height="60%"/>
+
+
+
 @}
 */
diff --git a/docs/details/index.dox b/docs/details/index.dox
new file mode 100644
index 0000000000..2e9d48eb0b
--- /dev/null
+++ b/docs/details/index.dox
@@ -0,0 +1,46 @@
+/**
+\addtogroup arrayfire_func
+@{
+
+\defgroup index_func_index index
+\ingroup index_mat
+
+\brief Lookup values of an array based on sequences and/or arrays
+
+===============================================================================
+
+\defgroup index_func_lookup lookup
+\ingroup index_mat
+
+\brief Lookup values of an array by indexing with another array.
+
+Will return an array with the values in the \p in array from the locations
+specified in the \p idx array. The resulting array contains values corresponding
+to each of the provided indices. Locations of the input data are assumed to be
+in the range [0, n). Indexing outside of this range will result in mirrored
+wrap-around behavior.
+
+A simple example of one-dimension indexing can be seen in the following example.
+
+\snippet test/index.cpp ex_index_lookup1d
+
+Index locations can also be out of bounds.
+
+\snippet test/index.cpp ex_index_lookup_oob
+
+The axis along which to query the indices can also be specified. The
+resulting array will be of the same size as the input, except for the queried
+dimension which will match the number of elements in the index array.
+
+\snippet test/index.cpp ex_index_lookup2d
+
+===============================================================================
+
+\defgroup index_func_assign assign
+\ingroup index_mat
+
+\brief Copy and write values in the locations specified by the sequences
+
+
+@}
+*/
diff --git a/docs/details/internal.dox b/docs/details/internal.dox
new file mode 100644
index 0000000000..5ac06422ca
--- /dev/null
+++ b/docs/details/internal.dox
@@ -0,0 +1,29 @@
+/**
+\addtogroup internal_func
+@{
+
+\defgroup internal_func_create createStridedArray
+
+Create an array with specified strides and offset.
+
+
+\defgroup internal_func_strides getStrides
+
+Get strides of underlying data.
+
+
+\defgroup internal_func_offset getOffset
+
+Get Offset of the underlying data.
+
+
+\defgroup internal_func_linear isLinear
+
+Check if all elements in array are contiguous.
+
+\defgroup internal_func_owner isOwner
+
+Check if underlying data is owned by the current array.
+
+@}
+*/
diff --git a/docs/details/lapack.dox b/docs/details/lapack.dox
index beb39ba555..995d47129b 100644
--- a/docs/details/lapack.dox
+++ b/docs/details/lapack.dox
@@ -1,25 +1,47 @@
 /**
 \addtogroup arrayfire_func
 @{
-\defgroup lapack_factor_func_lu lu
+
+\defgroup lapack_factor_func_svd svd
+
+Perform singular value decomposition.
+
+This function factorizes a matrix \f$A\f$ into two unitary matrices, \f$U\f$
+and \f$V^T\f$, and a diagonal matrix \f$S\f$, such that \f$A = USV^T\f$. If
+\f$A\f$ has \f$M\f$ rows and \f$N\f$ columns (\f$M \times N\f$), then \f$U\f$
+will be \f$M \times M\f$, \f$V\f$ will be \f$N \times N\f$, and \f$S\f$ will be
+\f$M \times N\f$. However, for \f$S\f$, this function only returns the non-zero
+diagonal elements as a sorted (in descending order) 1D array.
+
+To reconstruct the original matrix \f$A\f$ from the individual factors, the
+following code snippet can be used:
+
+\snippet test/svd_dense.cpp ex_svd_reg
+
+When memory is a concern, and \f$A\f$ is dispensable, \ref af::svdInPlace() can
+be used. However, this in-place version is currently limited to input arrays
+where \f$M \geq N\f$.
 
 \ingroup lapack_factor_mat
 
-\brief Perform LU decomposition
+===============================================================================
 
-This function decomposes input matrix **A** into a lower triangle **L**, an upper triangle **U** such that
+\defgroup lapack_factor_func_lu lu
 
-    \f$A = L * U\f$
+Perform LU decomposition.
 
-For stability, a permutation array **P** is also used to modify the formula in the following manner.
+This function decomposes input matrix \f$A\f$ into a lower triangle \f$L\f$, an
+upper triangle \f$U\f$ such that \f$A = L * U\f$.
 
-    \f$A(P, span) = L * U\f$
+For stability, a permutation array \f$P\f$ is also used to modify the formula
+in the following manner, \f$A(P, span) = L * U\f$.
 
-This operation can be performed in ArrayFire using the following code snippet.
+This operation can be performed in ArrayFire, using the following code snippet.
 
 \snippet test/lu_dense.cpp ex_lu_unpacked
 
-The permuted version of the original matrix can be reconstructed using the following snippet.
+The permuted version of the original matrix can be reconstructed, using the
+following snippet.
 
 \snippet test/lu_dense.cpp ex_lu_recon
 
@@ -57,91 +79,98 @@ a_perm [3 3 1 1]
     1.0000     4.0000     7.0000
 \endcode
 
-When memory is a concern, users can perform the LU decomposition in place as shown below.
+When memory is a concern, users can perform the LU decomposition in place as
+shown below.
 
 \snippet test/lu_dense.cpp ex_lu_packed
 
-The lower and upper triangle matrices can be obtained if necessary in the following manner.
+The lower and upper triangle matrices can be obtained if necessary in the
+following manner.
 
 \snippet test/lu_dense.cpp ex_lu_extract
 
-LU decompositions has many applications including <a href="http://en.wikipedia.org/wiki/LU_decomposition#Solving_linear_equations">solving a system of linear equations</a>. Check \ref af::solveLU fore more information.
-
-=======================================================================
-
-\defgroup lapack_factor_func_qr qr
+LU decompositions have many applications including
+<a href="http://en.wikipedia.org/wiki/LU_decomposition#Solving_linear_equations">
+solving a system of linear equations</a>. Check \ref af::solveLU for more
+information.
 
 \ingroup lapack_factor_mat
 
-\brief Perform QR decomposition
-
-This function decomposes input matrix **A** into an orthogonal matrix **Q** and an upper triangular matrix **R** such that
+===============================================================================
 
-     \f$A = Q * R\f$
+\defgroup lapack_factor_func_qr qr
 
-     \f$Q * Q^T = I\f$
+Perform QR decomposition.
 
-Where **I** is an identity matrix. The matrix **Q** is a square matrix of size **max(M, N)** where **M** and **N** are rows and columns of **A** respectively. The matrix **R** is the same size as **A*.
+This function decomposes input matrix \f$A\f$ into an orthogonal matrix \f$Q\f$
+and an upper triangular matrix \f$R\f$ such that, \f$A = Q * R\f$ and
+\f$Q * Q^T = I\f$, where \f$I\f$ is an identity matrix. The matrix \f$Q\f$ is a
+square matrix of size \f$max(M, N)\f$ where \f$M\f$ and \f$N\f$ are rows and
+columns of \f$A\f$ respectively. The matrix \f$R\f$ is the same size as
+\f$A\f$.
 
 This operation can be performed in ArrayFire using the following code snippet.
 
 \snippet test/qr_dense.cpp ex_qr_unpacked
 
-The additional parameter **Tau** can be used to speed up solving over and under determined system of equations.
+The additional parameter `tau` can be used to speed up solving over- and
+under-determined systems of equations.
 
 The original matrix can be reconstructed using the following code snippet.
 
 \snippet test/qr_dense.cpp ex_qr_recon
 
-When memory is a concern, users can perform QR decomposition in place as shown below.
+When memory is a concern, users can perform QR decomposition in place as shown
+below.
 
 \snippet test/qr_dense.cpp ex_qr_packed
 
-=======================================================================
-
-\defgroup lapack_factor_func_cholesky cholesky
-
 \ingroup lapack_factor_mat
 
-\brief Perform Cholesky decomposition
+===============================================================================
 
-This function decomposes a <a href="http://en.wikipedia.org/wiki/Positive-definite_matrix">positive definite</a> matrix **A** into two triangular matrices such that
+\defgroup lapack_factor_func_cholesky cholesky
 
-     \f$A = L * U\f$
+Perform Cholesky decomposition.
 
-     \f$L = U^T\f$
+This function decomposes a
+<a href="http://en.wikipedia.org/wiki/Positive-definite_matrix">positive
+definite</a> matrix \f$A\f$ into two triangular matrices such that,
+\f$A = L * U\f$ and \f$L = U^T\f$.
 
-Only one of **L** and **U** is stored to conserve space when solving linear equations.
+Only one of \f$L\f$ and \f$U\f$ is stored to conserve space when solving linear
+equations.
 
 This operation can be performed in ArrayFire using the following code snippet.
 
 \snippet test/cholesky_dense.cpp ex_chol_reg
 
-When memory is a concern, users can perform Cholesky decomposition in place as shown below.
+When memory is a concern, users can perform Cholesky decomposition in place as
+shown below.
 
 \snippet test/cholesky_dense.cpp ex_chol_inplace
 
-=======================================================================
-
-\defgroup lapack_solve_func_gen solve
+\ingroup lapack_factor_mat
 
-\ingroup lapack_solve_mat
+===============================================================================
 
-\brief Solve a system of equations
+\defgroup lapack_solve_func_gen solve
 
-This function takes a co-efficient matrix **A** and an output matrix **B**  as inputs to solve the following equation for **X**
+Solve a system of equations.
 
-     \f$A * X = B\f$
+This function takes a co-efficient matrix \f$A\f$ and an output matrix \f$B\f$
+as inputs to solve the following equation for \f$X\f$, \f$A * X = B\f$.
 
 This operation can be done in ArrayFire using the following code snippet.
 
-\snippet test/solve_dense.cpp ex_solve
+\snippet test/solve_common.hpp ex_solve
 
-The results can be verified by reconstructing the output matrix using \ref af::matmul in the following manner.
+The results can be verified by reconstructing the output matrix using \ref
+af::matmul in the following manner,
 
-\snippet test/solve_dense.cpp ex_solve_recon
+\snippet test/solve_common.hpp ex_solve_recon
 
-The sample output can be seen below
+The sample output can be seen below.
 
 \code
 A [3 3 1 1]
@@ -165,52 +194,57 @@ B1 [3 1 1 1]
    39.0000
 \endcode
 
-If the coefficient matrix is known to be a triangular matrix, \ref AF_MAT_LOWER or \ref AF_MAT_UPPER can be passed to make solve faster.
+If the coefficient matrix is known to be a triangular matrix, \ref AF_MAT_LOWER
+or \ref AF_MAT_UPPER can be passed to make solve faster.
 
-The sample code snippets for solving a lower triangular matrix can be seen below.
+The sample code snippets for solving a lower triangular matrix can be seen
+below.
 
-\snippet test/solve_dense.cpp ex_solve_lower
+\snippet test/solve_common.hpp ex_solve_lower
 
-Similarily, the code snippet for solving an upper triangular matrix can be seen below.
+Similarily, the code snippet for solving an upper triangular matrix can be seen
+below.
 
-\snippet test/solve_dense.cpp ex_solve_upper
+\snippet test/solve_common.hpp ex_solve_upper
 
 See also: \ref af::solveLU
 
-=======================================================================
-
-\defgroup lapack_solve_lu_func_gen solveLU
-
 \ingroup lapack_solve_mat
 
-\brief Solve a system of equations
+===============================================================================
+
+\defgroup lapack_solve_lu_func_gen solveLU
 
-This function takes a co-efficient matrix **A** and an output matrix **B**  as inputs to solve the following equation for **X**
+Solve a system of equations.
 
-     \f$A * X = B\f$
+This function takes a co-efficient matrix \f$A\f$ and an output matrix \f$B\f$
+as inputs to solve the following equation for \f$X\f$, \f$A * X = B\f$.
 
 This operation can be done in ArrayFire using the following code snippet.
 
-\snippet test/solve_dense.cpp ex_solve_lu
+\snippet test/solve_common.hpp ex_solve_lu
 
-This function along with \ref af::lu split up the task af::solve performs for square matrices.
+This function, along with \ref af::lu, split up the task af::solve performs for
+square matrices.
 
-\note This function is beneficial over \ref af::solve only in long running application where the coefficient matrix **A** stays the same, but the observed variables keep changing.
+This function is beneficial over \ref af::solve only in long running
+application where the coefficient matrix \f$A\f$ stays the same, but the
+observed variables keep changing.
 
+\ingroup lapack_solve_mat
 
-=======================================================================
+===============================================================================
 
 \defgroup lapack_ops_func_inv inverse
 
-\ingroup lapack_ops_mat
-
-\brief Invert a matrix
+Invert a matrix.
 
-This function inverts a square matrix **A**. The code snippet to demonstrate this can be seen below.
+This function inverts a square matrix \f$A\f$. The code snippet to demonstrate
+this can be seen below.
 
 \snippet test/inverse_dense.cpp ex_inverse
 
-The sample output can be seen below
+The sample output can be seen below.
 
 \code
 A [3 3 1 1]
@@ -230,38 +264,74 @@ I [3 3 1 1]
 
 \endcode
 
-==================================================================================
+\ingroup lapack_ops_mat
 
-\defgroup lapack_ops_func_rank rank
+===============================================================================
+
+\defgroup lapack_ops_func_pinv pinverse
+
+Pseudo-invert (Moore-Penrose) a matrix.
+
+This function calculates the Moore-Penrose pseudoinverse of a matrix \f$A\f$,
+using \ref af::svd at its core. If \f$A\f$ is of size \f$M \times N\f$, then
+its pseudoinverse \f$A^+\f$ will be of size \f$N \times M\f$.
+
+This calculation can be batched if the input array is three or four-dimensional
+\f$(M \times N \times P \times Q\f$, with \f$Q=1\f$ for only three dimensions
+\f$)\f$. Each \f$M \times N\f$ slice along the third dimension will have its
+own pseudoinverse, for a total of \f$P \times Q\f$ pseudoinverses in the output
+array \f$(N \times M \times P \times Q)\f$.
+
+Below is an example snippet of its usage. In this example, we have a matrix
+\f$A\f$ and compute its pseudoinverse \f$A^+\f$. This condition must hold:
+\f$AA^+A=A\f$, given that the two matrices are pseudoinverses of each other (in
+fact, this is one of the Moore-Penrose conditions):
+
+\snippet test/pinverse.cpp ex_pinverse
 
 \ingroup lapack_ops_mat
 
-\brief Find the rank of the input matrix.
+===============================================================================
 
-This function uses \ref af::qr to find the rank of the input matrix within the given tolerance.
+\defgroup lapack_ops_func_rank rank
 
-=====================================================================================
+Find the rank of a matrix.
 
-\defgroup lapack_ops_func_det det
+This function uses \ref af::qr to find the rank of the input matrix within the
+given tolerance.
 
 \ingroup lapack_ops_mat
 
-\brief Find the determinant of the input matrix.
+===============================================================================
+
+\defgroup lapack_ops_func_det det
+
+Find the determinant of a matrix.
 
+This function requires scratch space equal to the input array.
 
-\note This function requires scratch space equal to the input array
+\ingroup lapack_ops_mat
 
 ===============================================================================
 
 \defgroup lapack_ops_func_norm norm
 
+Find the norm of a matrix
+
+This function can return the norm using various metrics based on the `type`
+parameter.
+
+\ref AF_NORM_MATRIX_2 is currently not supported.
+
 \ingroup lapack_ops_mat
 
-\brief Find the norm of the input matrix
+===============================================================================
+
+\defgroup lapack_helper_func_available isLAPACKAvailable
 
-This function can return the norm using various metrics based on the type paramter.
+\brief Returns true if ArrayFire is compiled with LAPACK support
 
-\note \ref AF_NORM_MATRIX_2 is currently not supported.
+\ingroup lapack_helper
 
 ===============================================================================
 
diff --git a/docs/details/random.dox b/docs/details/random.dox
new file mode 100644
index 0000000000..d2400fcbbe
--- /dev/null
+++ b/docs/details/random.dox
@@ -0,0 +1,97 @@
+
+/**
+
+\defgroup random_mat Random Number Generation
+
+\brief Random Number Generation Functions
+
+Functions to generate and manage random numbers and random number engines.
+
+\ingroup data_mat
+
+===============================================================================
+
+\addtogroup arrayfire_func
+@{
+
+\defgroup random_func_random_engine randomEngine
+
+\brief Functions to create, modify, use, and destroy randomEngine objects.
+
+A \ref af::randomEngine object can be used to generate psuedo random numbers
+using various types of random number generation algorithms defined by \ref
+af::randomEngineType.
+
+\ingroup random_mat
+
+===============================================================================
+
+\defgroup random_func_randu randu
+
+\brief Create a random array sampled from uniform distribution.
+
+The type of engine used is defined by \ref af::randomEngine.
+
+The data is uniformly distributed between [0, 1].
+
+\ingroup random_mat
+
+===============================================================================
+
+\defgroup random_func_randn randn
+
+\brief Create a random array sampled from normal distribution.
+
+The type of engine used is defined by \ref af::randomEngine.
+
+The data is centered around 0.
+
+\ingroup random_mat
+
+===============================================================================
+
+\defgroup random_func_set_default_engine setDefaultRandomEngineType
+
+\brief Set the default random engine type.
+
+This random engine type is used when calling random number functions without
+an \ref af::randomEngine object as an argument.
+
+\ingroup random_mat
+
+===============================================================================
+
+\defgroup random_func_get_default_engine getDefaultRandomEngine
+
+\brief Returns the default random engine object.
+
+Returns the \ref af::randomEngine that is currently set as default.
+
+Note that there is no need to call \ref af_release_random_engine on the handle
+returned by \ref af_get_default_random_engine.
+
+\ingroup random_mat
+
+===============================================================================
+
+\defgroup random_func_set_seed setSeed
+
+\brief Set the seed for random number generation.
+
+Sets the seed for the current default random engine.
+
+\ingroup random_mat
+
+===============================================================================
+
+\defgroup random_func_get_seed getSeed
+
+\brief Returns the seed for random number generation.
+
+Returns the seed for the current default random engine.
+
+\ingroup random_mat
+
+
+@}
+*/
diff --git a/docs/details/signal.dox b/docs/details/signal.dox
index 77b1eb003e..e77da4f968 100644
--- a/docs/details/signal.dox
+++ b/docs/details/signal.dox
@@ -10,28 +10,18 @@ if a and b are the coefficients.
 Another way to think about it is that the filter kernel is centered on each pixel in a,
 and the output for that pixel or data point is the sum of the products.
 
-Depending on the dimensions of the input signal and the filter signal, any one of the following
+Depending on the size of the signal and the filter, any one of the following
 batch mode convolutions take place.
 
-- **One to One**   - Single filter applied to single input.
-- **One to Many**  - Many filters applied on same input
-- **Many to One**  - Single filter applied to a set of inputs.
-- **Many to Many** - A set of filters applied onto to a set of inputs in one-to-one correspondence.
-
-
-
-\page signal_func_conv2_batch_desc convolve2
-
-For example, if the signal is two dimensional with m & n as sizes along the 0th & 1st dimensions
-respectively, then the possible batch operations are as follows.
-
-| Input Signal Dimensions | Filter Signal Dimensions | Batch Mode | Explanation |
-|:-----------------------:|:------------------------:|:----------:|:------------|
-| [m n 1 1] | [m n 1 1] | One to One  | Output will be a single convolve array |
-| [m n 1 1] | [m n p 1] | One to Many | Output will be 3d array with 2nd dimension length as p - p filters applied to same input |
-| [m n p 1] | [m n 1 1] | Many to One | Output will be 3d array with 2nd dimension length as p - 1 filter applied to p inputs |
-| [m n p 1] | [m n p 1] | Many to Many| Output will be 3d array with 2nd dimension length as p - p filter applied to p inputs in one-to-one correspondence |
+- **No Batch**   - Single filter applied to single input.
+- **Filter is Batched**  - Many filters applied on same input
+- **Signal is Batched**  - Single filter applied to a set of inputs.
+- **Identical Batches** - A set of filters applied onto to a set of inputs in one-to-one correspondence.
+- **Non overlapping Batches** - All batched filters are applied to all batched signals. The batch
+  axis of Signal and Filter **should not** be the same.
 
+\note All non-overlapping(interleaved) convolutions default to frequency domain
+      \ref AF_CONV_FREQ irrespective of the provided convolution mode argument.
 
 
 \page signal_func_fft_desc fft
@@ -56,16 +46,78 @@ factor is calculated internally based on the input data provided.
 \addtogroup arrayfire_func
 @{
 
-\defgroup signal_func_convolve convolve
+\defgroup signal_func_convolve convolve (Non-separable)
 \ingroup convolve_mat
 
-\brief Convolution Integral for any dimensional data
+\brief Convolution Integral for any(one through three) dimensional data
 
 \copydoc signal_func_conv_desc
 
-\copydoc signal_func_conv2_batch_desc
+This version of convolution function delegates the call to respective
+1D, 2D or 3D convolution functions internally.
+
+Convolution dimensionality is \f$ \min (sd, fd) \f$ where sd & fd are dimensionality of
+signal and filter respectively.  This formulation only decides the dimensionality of convolution.
 
+Given below are some examples on how convolution dimensionality is computed.
 
+| Signal Size    | Filter Size    | Input Rank | Filter Rank | Convolve Dimensionality   |
+|:--------------:|:--------------:|:----------:|:-----------:|:-------------------------:|
+| \dims{m,n,1,1} | \dims{m,1,1,1} |     2      |      1      |   \f$ min(2, 1) => \f$ 1D |
+| \dims{m,1,1,1} | \dims{m,n,1,1} |     1      |      2      |   \f$ min(1, 2) => \f$ 1D |
+| \dims{m,n,1,1} | \dims{m,n,1,1} |     2      |      2      |   \f$ min(2, 2) => \f$ 2D |
+| \dims{m,n,1,1} | \dims{m,n,p,1} |     2      |      3      |   \f$ min(2, 3) => \f$ 2D |
+| \dims{m,n,1,p} | \dims{m,n,1,q} |     4      |      4      |   3D |
+| \dims{m,n,p,1} | \dims{m,n,q,1} |     3      |      3      |   \f$ min(3, 3) => \f$ 3D |
+
+\note In the cases similar to the fifth row of the above table,
+      signal and filter are of rank 4, the function delegates the
+      operation to three dimensional convolution \ref signal_func_convolve3
+
+If the operation you intend to perform doesn't align with what this
+function does, please check the rank specific convolve functions (hyperlinked below)
+documentation to find out more.
+
+- \ref signal_func_convolve1
+- \ref signal_func_convolve2
+- \ref signal_func_convolve3
+
+
+
+
+\defgroup signal_func_convolve_sep convolve (Separable)
+\ingroup convolve_mat
+
+\brief Separable Convolution
+
+Separable Convolution is faster equivalent of the canonical 2D convolution with
+an additional prerequisite that the filter/kernel can be decomposed into two
+separate spatial vectors. A classic example of such separable kernels
+is sobel operator. Given below is decomposition of vertical gradient of sobel operator.
+
+\f$
+\begin{bmatrix}
+-1 & 0 & +1 \\
+-2 & 0 & +2 \\
+-1 & 0 & +1 \\
+\end{bmatrix}
+\f$
+
+can be decomposed into two vectors shown below.
+
+\f$
+\begin{bmatrix}
+1 \\
+2 \\
+1 \\
+\end{bmatrix}
+\f$
+
+\f$
+\begin{bmatrix}
+-1 & 0 & +1 \\
+\end{bmatrix}
+\f$
 
 
 \defgroup signal_func_convolve1 convolve1
@@ -75,14 +127,24 @@ factor is calculated internally based on the input data provided.
 
 \copydoc signal_func_conv_desc
 
-For example, if the input size is m along 0th dimension, then the possible batch operations are as follows.
+For one dimensional signals (lets say m is size of 0th axis), below batch operations are possible.
+
+| Signal Size    | Filter Size    | Output Size    | Batch Mode               | Description                                                        |
+| :------------: | :------------: | :------------: | :----------------------: | :----------------------------------------------------------------- |
+| \dims{m,1,1,1} | \dims{m,1,1,1} | \dims{m,1,1,1} | No Batch                 | Output will be a single convolved array                            |
+| \dims{m,1,1,1} | \dims{m,n,1,1} | \dims{m,n,1,1} | Filter is Batched        | n filters applied to same input                                    |
+| \dims{m,n,1,1} | \dims{m,1,1,1} | \dims{m,n,1,1} | Signal is Batched        | 1 filter applied to n inputs                                       |
+| \dims{m,n,p,q} | \dims{m,n,p,q} | \dims{m,n,p,q} | Identical Batches        | n*p*q filters applied to n*p*q inputs in one-to-one correspondence |
+| \dims{m,n,1,1} | \dims{m,1,p,q} | \dims{m,n,p,q} | Non-overlapping batches  | p*q filters applied to n inputs to produce n x p x q results       |
 
-| Input Signal Dimensions | Filter Signal Dimensions | Batch Mode | Explanation |
-|:-----------------------:|:------------------------:|:----------:|:------------|
-| [m 1 1 1] | [m 1 1 1] | One to One  | Output will be a single convolve array |
-| [m 1 1 1] | [m n 1 1] | One to Many | Output will be 2d array with 1st dimension length as n - n filters applied to same input |
-| [m n 1 1] | [m 1 1 1] | Many to One | Output will be 2d array with 1st dimension length as n - 1 filter applied to n inputs |
-| [m n 1 1] | [m n 1 1] | Many to Many| Output will be 2d array with 1st dimension length as n - n filter applied to n inputs in one-to-one correspondence |
+There are various other permutations of signal and filter sizes that fall under
+the category of non-overlapping batch mode that are not listed in the above
+table. For any signal and filter size combination to fall under the
+non-overlapping batch mode, they should satisfy one of the following conditions.
+- Signal and filter size along a given batch axis (\f$ > 1 \f$) should be same.
+- Either signal size or filter size along a given batch axis (\f$ > 1 \f$) should be equal to one.
+
+\note For the above tabular illustrations, we assumed \ref af_conv_mode is \ref AF_CONV_DEFAULT.
 
 
 
@@ -93,55 +155,161 @@ For example, if the input size is m along 0th dimension, then the possible batch
 
 \copydoc signal_func_conv_desc
 
-\copydoc signal_func_conv2_batch_desc
+A detailed explanation of each batch mode for 2D convolutions is provided below.
+Given below are definitions of variables and constants that are used to
+facilitate easy illustration of the operations.
 
+- \f$[M\quad N]\f$, \f$[A\quad B]\f$ are signal, filter sizes along
+  \f$0^{th}\f$ & \f$1^{st}\f$ axes respectively.
+- \f$P\f$ and \f$Q\f$ are two constants, integers greater than one.
+- \f$ p \f$ is an integer variable with range \f$ \ 0 \leq p < P \f$.
+- \f$ q \f$ is an integer variable with range \f$ \ 0 \leq q < Q \f$.
+- O, S and F are notations for Output, Signal and Filter respectively.
 
+We have also used images to showcase some examples which follow the
+below notation.
 
-\defgroup signal_func_convolve3 convolve3
-\ingroup convolve_mat
+- Each blue line is a two dimensional matrix.
+- Each orange line indicates a full 2d convolution operation.
+- Suffix of each letter indicates indices along \f$ 3^{rd}\f$ and \f$ 4^{th}\f$
+  axes in the order of appearance from left to right in the suffix.
+- O, S and F are notations for Output, Signal and Filter respectively.
 
-\brief Convolution Integral for three dimensional data
+### No Batch
 
-\copydoc signal_func_conv_desc
+Given below is an example of no batch mode.
 
-For example, if the signal is three dimensional with m, n & p sizes along the 0th, 1st & 2nd dimensions
-respectively, then the possible batch operations are as follows.
+\image html "basic.png" "Single 2d convolution with 2d filter"
 
-| Input Signal Dimensions | Filter Signal Dimensions | Batch Mode | Explanation |
-|:-----------------------:|:------------------------:|:----------:|:------------|
-| [m n p 1] | [m n p 1] | One to One  | Output will be a single convolve array |
-| [m n p 1] | [m n p q] | One to Many | Output will be 4d array with 3rd dimension length as q - q filters applied to same input |
-| [m n p q] | [m n p 1] | Many to One | Output will be 4d array with 3rd dimension length as q - 1 filter applied to q inputs |
-| [m n p q] | [m n p q] | Many to Many| Output will be 4d array with 3rd dimension length as q - q filter applied to q inputs in one-to-one correspondence |
+For input size \dims{M,N,1,1} and filter size \dims{A,B,1,1}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
 
-===============================================================================
+\shape_eq{O,M,N,1,1} = \convolve_eq{\shape_t{S,M,N,1,1},\shape_t{F,A,B,1,1}}
 
-\defgroup signal_func_fft_convolve fftConvolve
-\ingroup convolve_mat
 
-\brief Convolution using Fast Fourier Transform
+### Batched Filter
 
-\copydoc signal_func_conv_desc
+Given below is an example of filter batch mode.
 
-===============================================================================
+\image html "filter.png" "Single signal convolved with many filters independently"
 
-\defgroup signal_func_fft_convolve2 fftConvolve2
-\ingroup convolve_mat
+For input size \dims{M,N,1,1} and filter size \dims{A,B,P,1}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
 
-\brief 2D Convolution using Fast Fourier Transform
+\shape_eq{O,M,N,P,1} = \set_eq{\convolve_t{\shape_t{S,M,N,1,1},\shape_t{f,A,B,p,1}}, \forall \shape_t{f,A,B,p,1} \in \shape_t{F,A,B,P,1}}
 
-\copydoc signal_func_conv_desc
 
-===============================================================================
+### Batched Signal
+
+Given below is an example of signal batch mode.
+
+\image html "signal.png" "Single filter convolved with many signals independently"
+
+For input size \dims{M,N,P,1} and filter size \dims{A,B,1,1}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
+
+\shape_eq{O,M,N,P,1} = \set_eq{\convolve_t{\shape_t{s,M,N,p,1},\shape_t{F,A,B,1,1}}, \forall \shape_t{s,M,N,p,1} \in \shape_t{S,M,N,P,1}}
+
+
+### Identical Batch Sizes
+
+Given below is an example of identical batch mode.
+
+\image html "identical.png" "Many signals convolved with many filters in one-on-one manner"
+
+For input size \dims{M,N,P,Q} and filter size \dims{A,B,P,Q}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
+
+\shape_eq{O,M,N,P,Q} = \set_eq{\convolve_t{\shape_t{s,M,N,p,q},\shape_t{f,A,B,p,q}}, \forall \shape_t{s,M,N,p,q} \in \shape_t{S,M,N,P,Q} \land \forall \shape_t{f,M,N,p,q} \in \shape_t{F,M,N,P,Q}}
+
+
+### Non-overlapping Batches
+
+Four different kinds of signal and filter size combinations are handled in this batch mode. Each one
+of them are explained in respective sections below.
+
+#### Combination 1
+
+For input size \dims{M,N,P,1} and filter size \dims{A,B,1,Q}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
+
+\shape_eq{O,M,N,P,Q} = \set_eq{\set_t{\convolve_t{\shape_t{s,M,N,p,1},\shape_t{f,A,B,1,q}}, \forall \shape_t{s,M,N,p,1} \in \shape_t{S,M,N,P,1}}, \forall \shape_t{f,A,B,1,q} \in \shape_t{F,A,B,1,Q}}
 
-\defgroup signal_func_fft_convolve3 fftConvolve3
+Given below is an example of this batch mode.
+
+\image html "non-overlapping_1.png"
+
+#### Combination 2
+
+For input size \dims{M,N,P,1} and filter size \dims{A,B,P,Q}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
+
+\shape_eq{O,M,N,P,Q} = \set_eq{\set_t{\convolve_t{\shape_t{s,M,N,p,1},\shape_t{f,A,B,p,q}}, \forall \shape_t{f,A,B,p,q} \in \shape_t{F,A,B,P,Q}}, \forall \shape_t{s,M,N,p,1} \in \shape_t{S,M,N,P,1}}
+
+Given below is an example of this batch mode.
+
+\image html "non-overlapping_2.png"
+
+#### Combination 3
+
+For input size \dims{M,N,1,P} and filter size \dims{A,B,Q,1}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
+
+\shape_eq{O,M,N,Q,P} = \set_eq{\set_t{\convolve_t{\shape_t{s,M,N,1,p},\shape_t{f,A,B,q,1}}, \forall \shape_t{s,M,N,1,p} \in \shape_t{S,M,N,1,P}}, \forall \shape_t{f,A,B,q,1} \in \shape_t{F,A,B,Q,1}}
+
+Given below is an example of this batch mode.
+
+\image html "non-overlapping_3.png"
+
+#### Combination 4
+
+For input size \dims{M,N,P,Q} and filter size \dims{A,B,P,1}, the following set-builder
+notation gives a formal definition of all convolutions performed in this mode.
+
+\shape_eq{O,M,N,P,Q} = \set_eq{\set_t{\convolve_t{\shape_t{s,M,N,p,q},\shape_t{f,A,B,p,1}}, \forall \shape_t{s,M,N,p,q} \in \shape_t{S,M,N,P,Q}}, \forall \shape_t{f,A,B,p,1} \in \shape_t{F,A,B,P,1}}
+
+Given below is an example of this batch mode.
+
+\image html "non-overlapping_4.png"
+
+
+The batching behavior of convolve2NN functions(\ref af_convolve2_nn() and
+\ref af::convolve2NN() ) is different from convolve2. The new functions can perform 2D
+convolution on 3D signals and filters in a way that is more aligned with
+convolutional neural networks.
+
+| Signal Size         | Filter Size         | Output Size         | Batch Mode     | Description                                         |
+| :-----------------: | :-----------------: | :-----------------: | :------------: | :-------------------------------------------------- |
+| \dims{M, N, 1, 1}   | \dims{M, N, 1, 1}   | \dims{M, N, 1, 1}   | No Batch       | Output will be a single convolved array             |
+| \dims{M, N, 1, 1}   | \dims{M, N, P, 1}   | \dims{M, N, P, 1}   | *Invalid*      | Size along second axis should be same               |
+| \dims{M, N, P, 1}   | \dims{M, N, 1, 1}   | \dims{M, N, P, 1}   | *Invalid*      | Size along second axis should be same               |
+| \dims{M, N, P, 1}   | \dims{M, N, P, 1}   | \dims{M, N, 1, 1}   | No Batch       | 3D Signal and 3D filter convoled to 2D result       |
+| \dims{M, N, P, Qs}  | \dims{M, N, P, Qf}  | \dims{M, N, Qf, Qs} | Batch Qs * Qf  | Qs signals and Qf filters to create Qs * Qf results |
+
+\note For the above tabular illustrations, we will assume \ref af_conv_mode is \ref AF_CONV_DEFAULT.
+
+
+
+\defgroup signal_func_convolve3 convolve3
 \ingroup convolve_mat
 
-\brief 3D Convolution using Fast Fourier Transform
+\brief Convolution Integral for three dimensional data
 
 \copydoc signal_func_conv_desc
 
-===============================================================================
+For three dimensional inputs with m, n & p sizes along the 0th, 1st & 2nd axes
+respectively, given below are the possible batch operations.
+
+| Signal Size        | Filter Size        | Output Size        | Batch Mode         | Description                                                |
+| :----------------: | :----------------: | :----------------: | :----------------: |:-----------------------------------------------------------|
+| \dims{m, n, p, 1}  | \dims{a, b, c, 1}  | \dims{m, n, p, 1}  | No Batch           | Output will be a single convolve array                     |
+| \dims{m, n, p, 1}  | \dims{a, b, c, d}  | \dims{m, n, p, d}  | Filter is Batched  | d filters applied to same input                            |
+| \dims{m, n, p, q}  | \dims{a, b, c, 1}  | \dims{m, n, p, q}  | Signal is Batched  | 1 filter applied to q inputs                               |
+| \dims{m, n, p, k}  | \dims{a, b, c, k}  | \dims{m, n, p, k}  | Identical Batches  | k filters applied to k inputs in one-to-one correspondence |
+
+\note For the above tabular illustrations, we assumed \ref af_conv_mode is \ref AF_CONV_DEFAULT.
+
+
 
 \defgroup signal_func_fft fft
 \ingroup fft_mat
@@ -191,25 +359,76 @@ respectively, then the possible batch operations are as follows.
 \copydoc signal_func_fft_desc
 
 
+\defgroup signal_func_fft_r2c fftR2C
+\ingroup fft_mat
+
+\brief Real to Complex Fast Fourier Transform
+
+
+\defgroup signal_func_fft_c2r fftC2R
+\ingroup fft_mat
+
+\brief Complex to Real Fast Fourier Transform
+
+
 \defgroup signal_func_approx1 approx1
 \ingroup approx_mat
+\brief Interpolation across a single dimension
+
+Performs interpolation on data along a single dimension.
+
+Interpolation is the process of computing for unknown values within a
+continuous range described by a discrete set of known values. These
+known values (`in`) correspond to a uniformly-spaced range of indices
+determined by start and step values, whose defaults are 0.0 and 1.0,
+respectively.
+
+The positions array (`pos`) contains the interpolating points (indices
+whose values we want to find) along a given dimension. Values of **known indices**
+will be looked up in the input array, while values of **unknown indices**
+will be found via interpolation. Indices outside of the index range
+are not extrapolated. Instead, those values are set `off_grid`, whose
+default value is 0.0.
 
-approx1 interpolates data along the first dimensions.
-It has three options for the type of interpolation to perform:
-- Nearest neighbor  - \ref AF_INTERP_NEAREST
-- Linear interpolation  - \ref AF_INTERP_LINEAR
-- Bilinear interpolation - \ref AF_INTERP_BILINEAR
-- Cubic interpolation - \ref AF_INTERP_CUBIC
+The following image illustrates a simple example (known values
+represented by blue dots, unknown values represented by red dots):
+
+\image html approx1_default_idx.png "approx1() using idx_start=0.0, idx_step=1.0"
+
+Several interpolation methods are supported by approx1:
+
+- Nearest neighbor interpolation - \ref AF_INTERP_NEAREST
+- Linear interpolation (default) - \ref AF_INTERP_LINEAR, \ref AF_INTERP_LINEAR_COSINE
+- Cubic interpolation - \ref AF_INTERP_CUBIC, \ref AF_INTERP_CUBIC_SPLINE
+- Lower interpolation - \ref AF_INTERP_LOWER
+
+Unless specified, linear interpolation is performed by default. Refer
+to \ref af_interp_type for more information about ArrayFire's
+interpolation types.
 
 \defgroup signal_func_approx2 approx2
 \ingroup approx_mat
-
-approx2 performs interpolation on data along the first and second dimensions.
-It has three options for the type of interpolation to perform:
-- Nearest neighbor  - \ref AF_INTERP_NEAREST
-- Linear interpolation  - \ref AF_INTERP_LINEAR
-- Bilinear interpolation - \ref AF_INTERP_BILINEAR
-- Cubic interpolation - \ref AF_INTERP_CUBIC
+\brief Interpolation along two dimensions
+
+Performs interpolation on data along two dimensions.
+
+Interpolation is the process of computing for unknown values within a
+continuous range described by a discrete set of known values. These
+known values correspond to a uniformly-spaced range of indices
+determined by start and step values, whose defaults are 0.0 and 1.0,
+respectively.
+
+The positions arrays (`pos0` and `pos1`) contain the interpolating
+points (indices whose values we want to find) along two given
+dimensions. Values of **known indices** will be looked up in the input
+array, while values of **unknown indices** will be found via
+interpolation. Indices outside of the index range are not
+extrapolated. Instead, those values are set to `off_grid`, whose
+default value is 0.0.
+
+All of the interpolation methods defined in \ref af_interp_type are
+supported by approx2. Unless specified, linear interpolation is
+performed by default.
 
 \defgroup signal_func_fir fir
 \ingroup sigfilt_mat
diff --git a/docs/details/sparse.dox b/docs/details/sparse.dox
new file mode 100644
index 0000000000..c01827cbbe
--- /dev/null
+++ b/docs/details/sparse.dox
@@ -0,0 +1,119 @@
+/**
+\addtogroup arrayfire_func
+@{
+
+\defgroup sparse_func_create sparse
+
+\brief Create a sparse array
+
+The sparse creation function has 3 different types of inputs it can accept.
+1. Independent \ref af::array for values, row indices and column indices.
+2. Independent host or device native arrays for values, row indices and column
+   indices.
+3. A dense \ref af::array.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_convert_to sparseConvertTo
+
+\brief Convert an existing sparse array into a different storage format
+
+Converting storage formats is allowed between \ref AF_STORAGE_CSR, \ref
+AF_STORAGE_COO and \ref AF_STORAGE_DENSE.
+
+When converting to \ref AF_STORAGE_DENSE, a dense array is returned.
+
+\note \ref AF_STORAGE_CSC is currently not supported.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_dense dense
+
+\brief Returns a dense array from a sparse input
+
+Converts the sparse matrix into a dense matrix and returns it
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_info sparseGetInfo
+
+\brief Returns reference to components of the input sparse array
+
+Returns reference to values, row indices, column indices and storage
+format of an input sparse array
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_values sparseGetValues
+
+\brief Returns reference to the values component of the sparse array
+
+Values is the \ref af::array containing the non-zero elements of the dense
+matrix.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_row_idx sparseGetRowIdx
+
+\brief Returns reference to the row indices component of the sparse array
+
+Row indices is the \ref af::array containing the row indices of the sparse
+array.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_col_idx sparseGetColdx
+
+\brief Returns reference to the column indices component of the sparse array
+
+Column indices is the \ref af::array containing the column indices of the sparse
+array.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_nnz sparseGetNNZ
+
+\brief Returns the number of non zero elements in the sparse array
+
+This is always equal to the size of the values array.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup sparse_func_storage sparseGetStorage
+
+\brief Returns the storage type of a sparse array
+
+The \ref af::storage type of the format of data storage in the sparse array.
+
+\ingroup sparse_func
+\ingroup arrayfire_func
+
+=======================================================================
+
+@}
+*/
+
diff --git a/docs/details/statistics.dox b/docs/details/statistics.dox
index c605e677ea..dab5bf25d4 100644
--- a/docs/details/statistics.dox
+++ b/docs/details/statistics.dox
@@ -1,7 +1,7 @@
 /*!
 \page batch_detail_stat statistics
 
-This function performs the operation across all batches present in the input simultaneously.
+This function performs the operation across all dimensions of the input array.
 
 */
 
@@ -31,7 +31,7 @@ Find the variance of values in the input
 
 \ingroup basicstats_mat
 
-Find the standar deviation of values in the input
+Find the standard deviation of values in the input
 
 \copydoc batch_detail_stat
 
@@ -62,6 +62,21 @@ Find the correlation coefficient of values in the input
 
 \copydoc batch_detail_stat
 
+========================================================
+\defgroup stat_func_topk topk
+
+\ingroup basicstats_mat
+
+This function returns the top k values along a given dimension of the input
+array. The indices along with their values are returned. If the input is a
+multi-dimensional array, the indices will be the index of the value in that
+dimension. Order of duplicate values are not preserved. This function is
+optimized for small values of k.
+
+\copydoc batch_detail_stat
+
+\note{Currently, topk elements can be found only along dimension 0.}
+
 ========================================================
 @}
 */
diff --git a/docs/details/util.dox b/docs/details/util.dox
new file mode 100644
index 0000000000..ec8a01ee6e
--- /dev/null
+++ b/docs/details/util.dox
@@ -0,0 +1,137 @@
+/**
+\addtogroup arrayfire_func
+@{
+\defgroup print_func_print print
+
+\brief Print the array to screen
+
+Print Array and dimensions to screen
+
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup print_func_tostring toString
+
+\brief Print the array to a string instead of the screen
+
+This function is similar to af::print, except that it prints to a string
+rather than to screen.
+
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup stream_func_read readArray
+
+\brief Load an array from a file
+
+The readArray function lets users read arrays saved in files.
+Arrays can either be read using the index in the file (0-indexed), or using
+the key that was used along with the Array.
+
+Note that if there are multiple arrays with the same key, only the first one
+will be read.
+
+The format of the file (version 1) is as follows:
+
+Header:
+Description | Data Type | Size (Bytes) | Detailed Desc
+------------|-----------|--------------|--------------
+Version     | Char      | 1            | ArrayFire File Format Version for future use. Currently set to 1
+Array Count | Int       | 4            | No. of Arrays stored in file
+
+
+Per Array:
+Description             | Data Type | Size (Bytes) | Detailed Desc
+------------------------|-----------|--------------|--------------
+Length of Key String    | Int       | 4            | No. of characters (excluding null ending) in the key string
+Key                     | Char []   | length       | Key of the Array. Used when reading from file
+Offset                  | Int64     | 8            | No of bytes between offset and start of next array
+Array Type              | Char      | 1            | Type corresponding to af_dtype enum
+Dims (4 values)         | Int64     | 4 * 8 = 32   | Dimensions of the Array
+Data                    | Type      | sizeof(Type) * dims.elements() | Actual data of the array
+
+The offset is equal to 1 byte (type) + 32 bytes (dims) + size of data.
+
+An file with 2 arrays would look like (representative)
+
+> 1\n
+> 2\n
+> Array 1 Key Length\n
+> Array 1 Key\n
+> Array 1 Offset\n
+> Array 1 Type\n
+> Array 1 Dims\n
+> Array 1 Data\n
+> Array 2 Key Length\n
+> Array 2 Key\n
+> Array 2 Offset\n
+> Array 2 Type\n
+> Array 2 Dims\n
+> Array 2 Data\n
+
+\ingroup dataio_mat
+\ingroup arrayfire_func
+
+=======================================================================
+
+\defgroup stream_func_save saveArray
+
+\brief Save an array to a binary file
+
+The saveArray and readArray functions are designed to provide store and
+read access to arrays using files written to disk.
+
+The format of the file (version 1) is as follows:
+
+Header:
+Description | Data Type | Size (Bytes) | Detailed Desc
+------------|-----------|--------------|--------------
+Version     | Char      | 1            | ArrayFire File Format Version for future use. Currently set to 1
+Array Count | Int       | 4            | No. of Arrays stored in file
+
+
+Per Array:
+Description             | Data Type | Size (Bytes) | Detailed Desc
+------------------------|-----------|--------------|--------------
+Length of Key String    | Int       | 4            | No. of characters (excluding null ending) in the key string
+Key                     | Char []   | length       | Key of the Array. Used when reading from file
+Offset                  | Int64     | 8            | No of bytes between offset and start of next array
+Array Type              | Char      | 1            | Type corresponding to af_dtype enum
+Dims (4 values)         | Int64     | 4 * 8 = 32   | Dimensions of the Array
+Data                    | Type      | sizeof(Type) * dims.elements() | Actual data of the array
+
+The offset is equal to 1 byte (type) + 32 bytes (dims) + size of data.
+
+An file with 2 arrays would look like (representative)
+
+> 1\n
+> 2\n
+> Array 1 Key Length\n
+> Array 1 Key\n
+> Array 1 Offset\n
+> Array 1 Type\n
+> Array 1 Dims\n
+> Array 1 Data\n
+> Array 2 Key Length\n
+> Array 2 Key\n
+> Array 2 Offset\n
+> Array 2 Type\n
+> Array 2 Dims\n
+> Array 2 Data\n
+
+Save array allows you to append any number of Arrays to the same file using
+the append argument. If the append argument is false, then the contents of the
+file are discarded and new array is written anew.
+
+On each append, the array counter in the header is incremented and the new
+array is written to the end of the file. This function does not check if the
+tag is unique or not.
+
+\ingroup dataio_mat
+\ingroup arrayfire_func
+
+@}
+*/
+
diff --git a/docs/details/vision.dox b/docs/details/vision.dox
new file mode 100644
index 0000000000..c870f18c07
--- /dev/null
+++ b/docs/details/vision.dox
@@ -0,0 +1,240 @@
+/**
+\addtogroup arrayfire_func
+@{
+
+\defgroup cv_func_fast fast
+\ingroup featdetect_mat
+
+\brief FAST feature detector
+
+A circle of radius 3 pixels, translating into a total of 16 pixels, is checked
+for sequential segments of pixels much brighter or much darker than the central
+one. For a pixel p to be considered a feature, there must exist a sequential
+segment of arc_length pixels in the circle around it such that all are greather
+than (p + thr) or smaller than (p - thr). After all features in the image are
+detected, if nonmax is true, the non-maximal suppression is applied, checking
+all detected features and the features detected in its 8-neighborhood and
+discard it if its score is non maximal.
+
+=======================================================================
+
+\defgroup cv_func_harris harris
+\ingroup featdetect_mat
+
+\brief Harris corner detector
+
+Compute corners using the Harris corner detector approach. For each pixel, a
+small window is used to calculate the determinant and trace of such a window,
+from which a response is calculated. Pixels are considered corners if they are
+local maximas and have a high positive response.
+
+=======================================================================
+
+\defgroup cv_func_susan susan
+\ingroup featdetect_mat
+
+\brief SUSAN corner detector
+
+SUSAN is an acronym standing for *Smallest Univalue Segment Assimilating Nucleus*. This method
+places a circular disc over the pixel to be tested (a.k.a nucleus) to compute the corner measure
+of that corresponding pixel. The region covered by the circular disc is **M**, and a pixel in this
+region is represented by \f$\vec{m} \in M\f$ where \f$\vec{m}_0\f$ is the nucleus. Every pixel in the region
+is compared to the nucleus using the following comparison function:
+
+\f$ c(\vec{m}) = e^{-{(({I(\vec{m}) - I(\vec{m}_0))} / t})^6}\f$
+
+where *t* is radius of the region, *I* is the brightness of the pixel.
+
+Response of SUSAN operator is given by the following equation:
+
+\f$ R(M) = \begin{cases} g - n(M) \quad \text{if } n(M) < g\\ 0 \quad \text{otherwise},\\ \end{cases}\f$
+
+where \f$ n(M) =  \sum\nolimits_{\vec{m} \in M} c(\vec{m})\f$, g is named the *geometric threshold* and n is the number
+of pixels in the mask which are within **t** of the nucleus.
+
+Importance of the parameters, **t** and **g** is explained below:
+
+- *t* determines how similar points have to be to the nucleusbefore they are considered to
+  be a part of the univalue segment
+- g determines the minimum size of the univalue segment. For a large enough *g*, SUSAN operator becomes
+  an edge dectector.
+
+=======================================================================
+
+\defgroup cv_func_orb orb
+\ingroup featdescriptor_mat
+
+\brief ORB Feature descriptor
+
+Extract ORB descriptors from FAST features that hold higher Harris responses.
+FAST does not compute orientation, thus, orientation of features is calculated
+using the intensity centroid. As FAST is also not multi-scale enabled, a
+multi-scale pyramid is calculated by downsampling the input image multiple
+times followed by FAST feature detection on each scale.
+
+=======================================================================
+
+\defgroup cv_func_sift sift
+\ingroup featdescriptor_mat
+
+\brief SIFT feature detector and descriptor extractor
+
+Detects features and extract descriptors using the Scale Invariant Feature
+Transform (SIFT), by David Lowe.
+
+Lowe, D. G., "Distinctive Image Features from Scale-Invariant Keypoints",
+International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
+
+=======================================================================
+
+\defgroup cv_func_gloh gloh
+\ingroup featdescriptor_mat
+
+\brief SIFT feature detector and GLOH descriptor extractor
+
+Detects features using the Scale Invariant Feature Transform (SIFT),
+by David Lowe. Descriptors are extracted using Gradient Location and
+Orientation Histogram (GLOH).
+
+Lowe, D. G., "Distinctive Image Features from Scale-Invariant Keypoints",
+International Journal of Computer Vision, 60, 2, pp. 91-110, 2004.
+
+Mikolajczyk, K., and Schmid, C., "A performance evaluation of local
+descriptors", IEEE Transactions on Pattern Analysis and Machine Intelligence,
+10, 27, pp. 1615-1630, 2005.
+
+=======================================================================
+
+\defgroup cv_func_hamming_matcher hammingMatcher
+\ingroup featmatcher_mat
+
+\brief Hamming Matcher
+
+Calculates Hamming distances between two 2-dimensional arrays containing
+features, one of the arrays containing the training data and the other the
+query data. One of the dimensions of the both arrays must be equal among them,
+identifying the length of each feature. The other dimension indicates the
+total number of features in each of the training and query arrays. Two
+1-dimensional arrays are created as results, one containg the smallest N
+distances of the query array and another containing the indices of these
+distances in the training array. The resulting 1-dimensional arrays have length
+equal to the number of features contained in the query array.
+
+=======================================================================
+
+\defgroup cv_func_nearest_neighbour nearestNeighbour
+\ingroup featmatcher_mat
+
+\brief Determine the nearest neighbouring points to a given set of points
+
+A "point" is simply a geometric point's coordinates in an n-dimensional space,
+which can be specified along the dimension specified by `dist_dim` (can be 0 or
+1). A list of such points can be enumerated along the dimension other than the
+one specified by `dist_dim` (excluding dim2 and dim3). By default, `dist_dim` is
+0, so a point's coordinates in this case must be specified along dim0, and the
+list of points must be enumerated along dim1. Consequently, if `dist_dim` is 1,
+then a point's coordinates must be specified along dim1, and the list must be
+enumerated along dim0.
+
+The arrays \p train and \p query are both a list of points, and one must have
+the same data layout as the other. This function calculates which points in the
+\p train are nearest to each point in \p query, based on the distance metric
+specified by \p dist_type: \ref AF_SAD (sum of absolute differences), \ref
+AF_SSD (sum of squared differences, the default option), or \ref AF_SHD (hamming
+distance). The resulting \p n_dist nearest neighboring points are described in
+two output arrays:
+- \p idx:  contains the index of each result that corresponds to the point in
+           \p train
+- \p dist: contains the distance from the query point to the result's
+           corresponding point in \p train
+
+In both the output arrays \p idx and \p dist, the nearest neighbor results for a
+single query are enumerated along dim0, in which the \f$ith\f$ result is the
+\f$ith\f$ nearest point to the query point. The result set for each query point
+is placed along dim1 (columns) of \p idx and \p dist, in the order that the
+queries appear in \p query. Therefore, the output arrays will have a shape of \p
+n_dist \f$\times\f$ the number of queries (regardless of the data layout of the
+input arrays, or the value of `dist_dim`).
+
+For illustration, a simple example is given below for 1 query in 1-dimensional
+space. There are 6 points in \p train, and 3 nearest neighbors are queried for
+(\p n_dist is 3), so there are 3 elements in the results for this single query,
+enumerated along dim0. The results \p idx and \p dist contain the 3 points
+closest to 1.25, ordered from nearest to farthest: point 0 in \p train (1.) with
+an SSD distance of 0.0625 from the query, point 1 (2.) with a distance of
+0.5625, and point 2 (3.) with a distance of 3.0625.
+
+\snippet test/nearest_neighbour.cpp ex_nearest_1
+
+A slightly more complicated example is given below. There are 2 \p query points
+and 6 \p train points, and they are in 3-dimensional space (each point's
+coordinates are specified along dim0, and the list of points is enumerated along
+dim1). Note that in the output arrays \p idx and \p dist, there are 2 sets of
+results now, one for each query. The result set located on the the first column
+of \p idx and \p dist correspond to the first query (the first column in \p
+query), and the result set on the second column of \p idx and \p dist correspond
+to the second query (second column in \p query). Thus, for example, the second
+query point is (7.5, 9., 1.), and the point closest to it in \p train is point
+3, which is (8., 9., 1.), which has a SSD distance of 0.25 from the query point.
+
+\snippet test/nearest_neighbour.cpp ex_nearest_2
+
+Note that it does not make sense for the \p train and \p query array shapes to
+have a third and fourth dimension, because a 2-dimensional array is sufficient
+to describe a list of points, no matter how long the list is or how many
+dimensions in space do the points span. Therefore, this function requires both
+input arrays to be at most 2-dimensional.
+
+=======================================================================
+
+\defgroup cv_func_dog dog
+\ingroup featdetect_mat
+
+\brief Difference of Gaussians
+
+Given an image, this function computes two different versions of smoothed
+input image using the difference smoothing parameters and subtracts one
+from the other and returns the result.
+
+=======================================================================
+
+\defgroup cv_func_match_template matchTemplate
+\ingroup match_mat
+
+\brief Template Matching
+
+Template matching is an image processing technique to find small patches of an image which match a given template image. Currently, this function doesn't support the following three metrics yet.
+- \ref AF_NCC
+- \ref AF_ZNCC
+- \ref AF_SHD
+
+A more in depth discussion about template matching can be found [here](http://en.wikipedia.org/wiki/Template_matching).
+
+=======================================================================
+
+\defgroup cv_func_homography homography
+\ingroup homography_mat
+
+\brief Homography Estimation
+
+Homography estimation find a perspective transform between two sets of 2D points.
+Currently, two methods are supported for the estimation, RANSAC (RANdom SAmple Consensus)
+and LMedS (Least Median of Squares). Both methods work by randomly selecting a subset
+of 4 points of the set of source points, computing the eigenvectors of that set and
+finding the perspective transform. The process is repeated several times, a maximum of
+times given by the value passed to the iterations arguments for RANSAC (for the CPU
+backend, usually less than that, depending on the quality of the dataset, but for CUDA
+and OpenCL backends the transformation will be computed exactly the amount of times
+passed via the iterations parameter), the returned value is the one that matches the
+best number of inliers, which are all of the points that fall within a maximum L2
+distance from the value passed to the inlier_thr argument. For the LMedS case, the
+number of iterations is currently hardcoded to meet the following equation:
+
+\f$ m = \frac{log(1 - P)}{log[1 - {(1 - \epsilon)}^{p}]}\f$,
+
+where \f$ P = 0.99\f$, \f$ \epsilon = 40\%\f$ and \f$ p = 4\f$.
+
+
+
+@}
+*/
diff --git a/docs/doxygen-awesome-darkmode-toggle.js b/docs/doxygen-awesome-darkmode-toggle.js
new file mode 100644
index 0000000000..2032f02c0b
--- /dev/null
+++ b/docs/doxygen-awesome-darkmode-toggle.js
@@ -0,0 +1,157 @@
+/**
+
+Doxygen Awesome
+https://github.com/jothepro/doxygen-awesome-css
+
+MIT License
+
+Copyright (c) 2021 - 2022 jothepro
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+*/
+
+class DoxygenAwesomeDarkModeToggle extends HTMLElement {
+    // SVG icons from https://fonts.google.com/icons
+    // Licensed under the Apache 2.0 license:
+    // https://www.apache.org/licenses/LICENSE-2.0.html
+    static lightModeIcon = `<svg xmlns="http://www.w3.org/2000/svg" enable-background="new 0 0 24 24" height="24px" viewBox="0 0 24 24" width="24px" fill="#FCBF00"><rect fill="none" height="24" width="24"/><circle cx="12" cy="12" opacity=".3" r="3"/><path d="M12,9c1.65,0,3,1.35,3,3s-1.35,3-3,3s-3-1.35-3-3S10.35,9,12,9 M12,7c-2.76,0-5,2.24-5,5s2.24,5,5,5s5-2.24,5-5 S14.76,7,12,7L12,7z M2,13l2,0c0.55,0,1-0.45,1-1s-0.45-1-1-1l-2,0c-0.55,0-1,0.45-1,1S1.45,13,2,13z M20,13l2,0c0.55,0,1-0.45,1-1 s-0.45-1-1-1l-2,0c-0.55,0-1,0.45-1,1S19.45,13,20,13z M11,2v2c0,0.55,0.45,1,1,1s1-0.45,1-1V2c0-0.55-0.45-1-1-1S11,1.45,11,2z M11,20v2c0,0.55,0.45,1,1,1s1-0.45,1-1v-2c0-0.55-0.45-1-1-1C11.45,19,11,19.45,11,20z M5.99,4.58c-0.39-0.39-1.03-0.39-1.41,0 c-0.39,0.39-0.39,1.03,0,1.41l1.06,1.06c0.39,0.39,1.03,0.39,1.41,0s0.39-1.03,0-1.41L5.99,4.58z M18.36,16.95 c-0.39-0.39-1.03-0.39-1.41,0c-0.39,0.39-0.39,1.03,0,1.41l1.06,1.06c0.39,0.39,1.03,0.39,1.41,0c0.39-0.39,0.39-1.03,0-1.41 L18.36,16.95z M19.42,5.99c0.39-0.39,0.39-1.03,0-1.41c-0.39-0.39-1.03-0.39-1.41,0l-1.06,1.06c-0.39,0.39-0.39,1.03,0,1.41 s1.03,0.39,1.41,0L19.42,5.99z M7.05,18.36c0.39-0.39,0.39-1.03,0-1.41c-0.39-0.39-1.03-0.39-1.41,0l-1.06,1.06 c-0.39,0.39-0.39,1.03,0,1.41s1.03,0.39,1.41,0L7.05,18.36z"/></svg>`
+    static darkModeIcon = `<svg xmlns="http://www.w3.org/2000/svg" enable-background="new 0 0 24 24" height="24px" viewBox="0 0 24 24" width="24px" fill="#FE9700"><rect fill="none" height="24" width="24"/><path d="M9.37,5.51C9.19,6.15,9.1,6.82,9.1,7.5c0,4.08,3.32,7.4,7.4,7.4c0.68,0,1.35-0.09,1.99-0.27 C17.45,17.19,14.93,19,12,19c-3.86,0-7-3.14-7-7C5,9.07,6.81,6.55,9.37,5.51z" opacity=".3"/><path d="M9.37,5.51C9.19,6.15,9.1,6.82,9.1,7.5c0,4.08,3.32,7.4,7.4,7.4c0.68,0,1.35-0.09,1.99-0.27C17.45,17.19,14.93,19,12,19 c-3.86,0-7-3.14-7-7C5,9.07,6.81,6.55,9.37,5.51z M12,3c-4.97,0-9,4.03-9,9s4.03,9,9,9s9-4.03,9-9c0-0.46-0.04-0.92-0.1-1.36 c-0.98,1.37-2.58,2.26-4.4,2.26c-2.98,0-5.4-2.42-5.4-5.4c0-1.81,0.89-3.42,2.26-4.4C12.92,3.04,12.46,3,12,3L12,3z"/></svg>`
+    static title = "Toggle Light/Dark Mode"
+
+    static prefersLightModeInDarkModeKey = "prefers-light-mode-in-dark-mode"
+    static prefersDarkModeInLightModeKey = "prefers-dark-mode-in-light-mode"
+
+    static _staticConstructor = function() {
+        DoxygenAwesomeDarkModeToggle.enableDarkMode(DoxygenAwesomeDarkModeToggle.userPreference)
+        // Update the color scheme when the browsers preference changes
+        // without user interaction on the website.
+        window.matchMedia('(prefers-color-scheme: dark)').addEventListener('change', event => {
+            DoxygenAwesomeDarkModeToggle.onSystemPreferenceChanged()
+        })
+        // Update the color scheme when the tab is made visible again.
+        // It is possible that the appearance was changed in another tab 
+        // while this tab was in the background.
+        document.addEventListener("visibilitychange", visibilityState => {
+            if (document.visibilityState === 'visible') {
+                DoxygenAwesomeDarkModeToggle.onSystemPreferenceChanged()
+            }
+        });
+    }()
+
+    static init() {
+        $(function() {
+            $(document).ready(function() {
+                const toggleButton = document.createElement('doxygen-awesome-dark-mode-toggle')
+                toggleButton.title = DoxygenAwesomeDarkModeToggle.title
+                toggleButton.updateIcon()
+
+                window.matchMedia('(prefers-color-scheme: dark)').addEventListener('change', event => {
+                    toggleButton.updateIcon()
+                })
+                document.addEventListener("visibilitychange", visibilityState => {
+                    if (document.visibilityState === 'visible') {
+                        toggleButton.updateIcon()
+                    }
+                });
+
+                $(document).ready(function(){
+                    document.getElementById("togglediv").parentNode.appendChild(toggleButton)
+                })
+                $(window).resize(function(){
+                    document.getElementById("togglediv").parentNode.appendChild(toggleButton)
+                })
+            })
+        })
+    }
+
+    constructor() {
+        super();
+        this.onclick=this.toggleDarkMode
+    }
+
+    /**
+     * @returns `true` for dark-mode, `false` for light-mode system preference
+     */
+    static get systemPreference() {
+        return window.matchMedia('(prefers-color-scheme: dark)').matches
+    }
+
+    /**
+     * @returns `true` for dark-mode, `false` for light-mode user preference
+     */
+    static get userPreference() {
+        return (!DoxygenAwesomeDarkModeToggle.systemPreference && localStorage.getItem(DoxygenAwesomeDarkModeToggle.prefersDarkModeInLightModeKey)) || 
+        (DoxygenAwesomeDarkModeToggle.systemPreference && !localStorage.getItem(DoxygenAwesomeDarkModeToggle.prefersLightModeInDarkModeKey))
+    }
+
+    static set userPreference(userPreference) {
+        DoxygenAwesomeDarkModeToggle.darkModeEnabled = userPreference
+        if(!userPreference) {
+            if(DoxygenAwesomeDarkModeToggle.systemPreference) {
+                localStorage.setItem(DoxygenAwesomeDarkModeToggle.prefersLightModeInDarkModeKey, true)
+            } else {
+                localStorage.removeItem(DoxygenAwesomeDarkModeToggle.prefersDarkModeInLightModeKey)
+            }
+        } else {
+            if(!DoxygenAwesomeDarkModeToggle.systemPreference) {
+                localStorage.setItem(DoxygenAwesomeDarkModeToggle.prefersDarkModeInLightModeKey, true)
+            } else {
+                localStorage.removeItem(DoxygenAwesomeDarkModeToggle.prefersLightModeInDarkModeKey)
+            }
+        }
+        DoxygenAwesomeDarkModeToggle.onUserPreferenceChanged()
+    }
+
+    static enableDarkMode(enable) {
+        if(enable) {
+            DoxygenAwesomeDarkModeToggle.darkModeEnabled = true
+            document.documentElement.classList.add("dark-mode")
+            document.documentElement.classList.remove("light-mode")
+        } else {
+            DoxygenAwesomeDarkModeToggle.darkModeEnabled = false
+            document.documentElement.classList.remove("dark-mode")
+            document.documentElement.classList.add("light-mode")
+        }
+    }
+
+    static onSystemPreferenceChanged() {
+        DoxygenAwesomeDarkModeToggle.darkModeEnabled = DoxygenAwesomeDarkModeToggle.userPreference
+        DoxygenAwesomeDarkModeToggle.enableDarkMode(DoxygenAwesomeDarkModeToggle.darkModeEnabled)
+    }
+
+    static onUserPreferenceChanged() {
+        DoxygenAwesomeDarkModeToggle.enableDarkMode(DoxygenAwesomeDarkModeToggle.darkModeEnabled)
+    }
+
+    toggleDarkMode() {
+        DoxygenAwesomeDarkModeToggle.userPreference = !DoxygenAwesomeDarkModeToggle.userPreference
+        this.updateIcon()
+    }
+
+    updateIcon() {
+        if(DoxygenAwesomeDarkModeToggle.darkModeEnabled) {
+            this.innerHTML = DoxygenAwesomeDarkModeToggle.darkModeIcon
+        } else {
+            this.innerHTML = DoxygenAwesomeDarkModeToggle.lightModeIcon
+        }
+    }
+}
+
+customElements.define("doxygen-awesome-dark-mode-toggle", DoxygenAwesomeDarkModeToggle);
diff --git a/docs/doxygen-awesome-fragment-copy-button.js b/docs/doxygen-awesome-fragment-copy-button.js
new file mode 100644
index 0000000000..7d06b348d6
--- /dev/null
+++ b/docs/doxygen-awesome-fragment-copy-button.js
@@ -0,0 +1,85 @@
+/**
+
+Doxygen Awesome
+https://github.com/jothepro/doxygen-awesome-css
+
+MIT License
+
+Copyright (c) 2022 jothepro
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+*/
+
+class DoxygenAwesomeFragmentCopyButton extends HTMLElement {
+    constructor() {
+        super();
+        this.onclick=this.copyContent
+    }
+    static title = "Copy to clipboard"
+    static copyIcon = `<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="24" height="24"><path d="M0 0h24v24H0V0z" fill="none"/><path d="M16 1H4c-1.1 0-2 .9-2 2v14h2V3h12V1zm3 4H8c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h11c1.1 0 2-.9 2-2V7c0-1.1-.9-2-2-2zm0 16H8V7h11v14z"/></svg>`
+    static successIcon = `<svg xmlns="http://www.w3.org/2000/svg" viewBox="0 0 24 24" width="24" height="24"><path d="M0 0h24v24H0V0z" fill="none"/><path d="M9 16.17L4.83 12l-1.42 1.41L9 19 21 7l-1.41-1.41L9 16.17z"/></svg>`
+    static successDuration = 980
+    static init() {
+        $(function() {
+            $(document).ready(function() {
+                if(navigator.clipboard) {
+                    const fragments = document.getElementsByClassName("fragment")
+                    for(const fragment of fragments) {
+                        const fragmentWrapper = document.createElement("div")
+                        fragmentWrapper.className = "doxygen-awesome-fragment-wrapper"
+                        const fragmentCopyButton = document.createElement("doxygen-awesome-fragment-copy-button")
+                        fragmentCopyButton.innerHTML = DoxygenAwesomeFragmentCopyButton.copyIcon
+                        fragmentCopyButton.title = DoxygenAwesomeFragmentCopyButton.title
+                
+                        fragment.parentNode.replaceChild(fragmentWrapper, fragment)
+                        fragmentWrapper.appendChild(fragment)
+                        fragmentWrapper.appendChild(fragmentCopyButton)
+            
+                    }
+                }
+            })
+        })
+    }
+
+
+    copyContent() {
+        const content = this.previousSibling.cloneNode(true)
+        // filter out line number from file listings
+        content.querySelectorAll(".lineno, .ttc").forEach((node) => {
+            node.remove()
+        })
+        let textContent = content.textContent
+        // remove trailing newlines that appear in file listings
+        let numberOfTrailingNewlines = 0
+        while(textContent.charAt(textContent.length - (numberOfTrailingNewlines + 1)) == '\n') {
+            numberOfTrailingNewlines++;
+        }
+        textContent = textContent.substring(0, textContent.length - numberOfTrailingNewlines)
+        navigator.clipboard.writeText(textContent);
+        this.classList.add("success")
+        this.innerHTML = DoxygenAwesomeFragmentCopyButton.successIcon
+        window.setTimeout(() => {
+            this.classList.remove("success")
+            this.innerHTML = DoxygenAwesomeFragmentCopyButton.copyIcon
+        }, DoxygenAwesomeFragmentCopyButton.successDuration);
+    }
+}
+
+customElements.define("doxygen-awesome-fragment-copy-button", DoxygenAwesomeFragmentCopyButton)
diff --git a/docs/doxygen-awesome-interactive-toc.js b/docs/doxygen-awesome-interactive-toc.js
new file mode 100644
index 0000000000..b049f57331
--- /dev/null
+++ b/docs/doxygen-awesome-interactive-toc.js
@@ -0,0 +1,81 @@
+/**
+
+Doxygen Awesome
+https://github.com/jothepro/doxygen-awesome-css
+
+MIT License
+
+Copyright (c) 2022 jothepro
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+*/
+
+class DoxygenAwesomeInteractiveToc {
+    static topOffset = 38
+    static hideMobileMenu = true
+    static headers = []
+
+    static init() {
+        window.addEventListener("load", () => {
+            let toc = document.querySelector(".contents > .toc")
+            if(toc) {
+                toc.classList.add("interactive")
+                if(!DoxygenAwesomeInteractiveToc.hideMobileMenu) {
+                    toc.classList.add("open")
+                }
+                document.querySelector(".contents > .toc > h3")?.addEventListener("click", () => {
+                    if(toc.classList.contains("open")) {
+                        toc.classList.remove("open")
+                    } else {
+                        toc.classList.add("open")
+                    }
+                })
+
+                document.querySelectorAll(".contents > .toc > ul a").forEach((node) => {
+                    let id = node.getAttribute("href").substring(1)
+                    DoxygenAwesomeInteractiveToc.headers.push({
+                        node: node,
+                        headerNode: document.getElementById(id)
+                    })
+
+                    document.getElementById("doc-content")?.addEventListener("scroll", () => {
+                        DoxygenAwesomeInteractiveToc.update()
+                    })
+                })
+                DoxygenAwesomeInteractiveToc.update()
+            }
+        })
+    }
+
+    static update() {
+        let active = DoxygenAwesomeInteractiveToc.headers[0]?.node
+        DoxygenAwesomeInteractiveToc.headers.forEach((header) => {
+            let position = header.headerNode.getBoundingClientRect().top
+            header.node.classList.remove("active")
+            header.node.classList.remove("aboveActive")
+            if(position < DoxygenAwesomeInteractiveToc.topOffset) {
+                active = header.node
+                active?.classList.add("aboveActive")
+            }
+        })
+        active?.classList.add("active")
+        active?.classList.remove("aboveActive")
+    }
+}
\ No newline at end of file
diff --git a/docs/doxygen-awesome-sidebar-only.css b/docs/doxygen-awesome-sidebar-only.css
new file mode 100644
index 0000000000..65e1a71fd2
--- /dev/null
+++ b/docs/doxygen-awesome-sidebar-only.css
@@ -0,0 +1,115 @@
+/**
+
+Doxygen Awesome
+https://github.com/jothepro/doxygen-awesome-css
+
+MIT License
+
+Copyright (c) 2021 jothepro
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+ */
+
+html {
+    /* side nav width. MUST be = `TREEVIEW_WIDTH`.
+     * Make sure it is wide enough to contain the page title (logo + title + version)
+     */
+    --side-nav-fixed-width: 335px;
+    --menu-display: none;
+
+    --top-height: 170px;
+    --toc-sticky-top: -25px;
+    --toc-max-height: calc(100vh - 2 * var(--spacing-medium) - 25px);
+}
+
+#projectname {
+    white-space: nowrap;
+}
+
+
+@media screen and (min-width: 768px) {
+    html {
+        --searchbar-background: var(--page-background-color);
+    }
+
+    #side-nav {
+        min-width: var(--side-nav-fixed-width);
+        max-width: var(--side-nav-fixed-width);
+        top: var(--top-height);
+        overflow: visible;
+    }
+
+    #nav-tree, #side-nav {
+        height: calc(100vh - var(--top-height)) !important;
+    }
+
+    #nav-tree {
+        padding: 0;
+    }
+
+    #top {
+        display: block;
+        border-bottom: none;
+        height: var(--top-height);
+        margin-bottom: calc(0px - var(--top-height));
+        max-width: var(--side-nav-fixed-width);
+        overflow: hidden;
+        background: var(--side-nav-background);
+    }
+    #main-nav {
+        float: left;
+        padding-right: 0;
+    }
+
+    .ui-resizable-handle {
+        cursor: default;
+        width: 1px !important;
+        box-shadow: 0 calc(-2 * var(--top-height)) 0 0 var(--separator-color);
+    }
+
+    #nav-path {
+        position: fixed;
+        right: 0;
+        left: var(--side-nav-fixed-width);
+        bottom: 0;
+        width: auto;
+    }
+
+    #doc-content {
+        height: calc(100vh - 31px) !important;
+        padding-bottom: calc(3 * var(--spacing-large));
+        padding-top: calc(var(--top-height) - 80px);
+        box-sizing: border-box;
+        margin-left: var(--side-nav-fixed-width) !important;
+    }
+
+    #MSearchBox {
+        width: calc(var(--side-nav-fixed-width) - calc(2 * var(--spacing-medium)));
+    }
+
+    #MSearchField {
+        width: calc(var(--side-nav-fixed-width) - calc(2 * var(--spacing-medium)) - 65px);
+    }
+
+    #MSearchResultsWindow {
+        left: var(--spacing-medium) !important;
+        right: auto;
+    }
+}
diff --git a/docs/doxygen-awesome.css b/docs/doxygen-awesome.css
new file mode 100644
index 0000000000..e9a1553123
--- /dev/null
+++ b/docs/doxygen-awesome.css
@@ -0,0 +1,2405 @@
+/**
+
+Doxygen Awesome
+https://github.com/jothepro/doxygen-awesome-css
+
+MIT License
+
+Copyright (c) 2021 - 2022 jothepro
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+*/
+
+html {
+    /* primary theme color. This will affect the entire websites color scheme: links, arrows, labels, ... */
+    --primary-color: #1779c4;
+    --primary-dark-color: #335c80;
+    --primary-light-color: #70b1e9;
+
+    /* page base colors */
+    --page-background-color: #ffffff;
+    --page-foreground-color: #2f4153;
+    --page-secondary-foreground-color: #6f7e8e;
+
+    /* color for all separators on the website: hr, borders, ... */
+    --separator-color: #dedede;
+
+    /* border radius for all rounded components. Will affect many components, like dropdowns, memitems, codeblocks, ... */
+    --border-radius-large: 6px;
+    --border-radius-small: 3px;
+    --border-radius-medium: 5px;
+
+    /* default spacings. Most components reference these values for spacing, to provide uniform spacing on the page. */
+    --spacing-small: 5px;
+    --spacing-medium: 8px;
+    --spacing-large: 10px;
+
+    /* default box shadow used for raising an element above the normal content. Used in dropdowns, search result, ... */
+    --box-shadow: 0 2px 8px 0 rgba(0,0,0,.075);
+
+    --odd-color: rgba(0,0,0,.028);
+
+    /* font-families. will affect all text on the website
+     * font-family: the normal font for text, headlines, menus
+     * font-family-monospace: used for preformatted text in memtitle, code, fragments
+     */
+    --font-family: -apple-system,BlinkMacSystemFont,Segoe UI,Roboto,Oxygen,Ubuntu,Cantarell,Fira Sans,Droid Sans,Helvetica Neue,sans-serif;
+    --font-family-monospace: ui-monospace,SFMono-Regular,SF Mono,Menlo,Consolas,Liberation Mono,monospace;
+
+    /* font sizes */
+    --page-font-size: 15.6px;
+    --navigation-font-size: 14.4px;
+    --toc-font-size: 13.4px;
+    --code-font-size: 14px; /* affects code, fragment */
+    --title-font-size: 22px;
+
+    /* content text properties. These only affect the page content, not the navigation or any other ui elements */
+    --content-line-height: 25px;
+    /* The content is centered and constraint in it's width. To make the content fill the whole page, set the variable to auto.*/
+    --content-maxwidth: 1050px;
+    --table-line-height: 24px;
+    --toc-sticky-top: var(--spacing-medium);
+    --toc-width: 200px;
+    --toc-max-height: calc(100vh - 2 * var(--spacing-medium) - 85px);
+
+    /* colors for various content boxes: @warning, @note, @deprecated @bug */
+    --warning-color: #f8d1cc;
+    --warning-color-dark: #b61825;
+    --warning-color-darker: #75070f;
+    --note-color: #faf3d8;
+    --note-color-dark: #f3a600;
+    --note-color-darker: #5f4204;
+    --todo-color: #e4f3ff;
+    --todo-color-dark: #1879C4;
+    --todo-color-darker: #274a5c;
+    --deprecated-color: #ecf0f3;
+    --deprecated-color-dark: #5b6269;
+    --deprecated-color-darker: #43454a;
+    --bug-color: #e4dafd;
+    --bug-color-dark: #5b2bdd;
+    --bug-color-darker: #2a0d72;
+    --invariant-color: #d8f1e3;
+    --invariant-color-dark: #44b86f;
+    --invariant-color-darker: #265532;
+
+    /* blockquote colors */
+    --blockquote-background: #f8f9fa;
+    --blockquote-foreground: #636568;
+
+    /* table colors */
+    --tablehead-background: #f1f1f1;
+    --tablehead-foreground: var(--page-foreground-color);
+
+    /* menu-display: block | none
+     * Visibility of the top navigation on screens >= 768px. On smaller screen the menu is always visible.
+     * `GENERATE_TREEVIEW` MUST be enabled!
+     */
+    --menu-display: block;
+
+    --menu-focus-foreground: var(--page-background-color);
+    --menu-focus-background: var(--primary-color);
+    --menu-selected-background: rgba(0,0,0,.05);
+
+
+    --header-background: var(--page-background-color);
+    --header-foreground: var(--page-foreground-color);
+
+    /* searchbar colors */
+    --searchbar-background: var(--side-nav-background);
+    --searchbar-foreground: var(--page-foreground-color);
+
+    /* searchbar size
+     * (`searchbar-width` is only applied on screens >= 768px.
+     * on smaller screens the searchbar will always fill the entire screen width) */
+    --searchbar-height: 33px;
+    --searchbar-width: 210px;
+    --searchbar-border-radius: var(--searchbar-height);
+
+    /* code block colors */
+    --code-background: #f5f5f5;
+    --code-foreground: var(--page-foreground-color);
+
+    /* fragment colors */
+    --fragment-background: #F8F9FA;
+    --fragment-foreground: #37474F;
+    --fragment-keyword: #bb6bb2;
+    --fragment-keywordtype: #8258b3;
+    --fragment-keywordflow: #d67c3b;
+    --fragment-token: #438a59;
+    --fragment-comment: #969696;
+    --fragment-link: #5383d6;
+    --fragment-preprocessor: #46aaa5;
+    --fragment-linenumber-color: #797979;
+    --fragment-linenumber-background: #f4f4f5;
+    --fragment-linenumber-border: #e3e5e7;
+    --fragment-lineheight: 19px;
+
+    /* sidebar navigation (treeview) colors */
+    --side-nav-background: #fbfbfb;
+    --side-nav-foreground: var(--page-foreground-color);
+    --side-nav-arrow-opacity: 0;
+    --side-nav-arrow-hover-opacity: 0.9;
+
+    --toc-background: var(--side-nav-background);
+    --toc-foreground: var(--side-nav-foreground);
+
+    /* height of an item in any tree / collapsable table */
+    --tree-item-height: 27px;
+
+    --memname-font-size: var(--code-font-size);
+    --memtitle-font-size: 18px;
+
+    --webkit-scrollbar-size: 7px;
+    --webkit-scrollbar-padding: 4px;
+    --webkit-scrollbar-color: var(--separator-color);
+}
+
+@media screen and (max-width: 767px) {
+    html {
+        --page-font-size: 16px;
+        --navigation-font-size: 16px;
+        --toc-font-size: 15px;
+        --code-font-size: 15px; /* affects code, fragment */
+        --title-font-size: 22px;
+    }
+}
+
+@media (prefers-color-scheme: dark) {
+    html:not(.light-mode) {
+        color-scheme: dark;
+
+        --primary-color: #1982d2;
+        --primary-dark-color: #86a9c4;
+        --primary-light-color: #4779ac;
+
+        --box-shadow: 0 2px 8px 0 rgba(0,0,0,.35);
+
+        --odd-color: rgba(100,100,100,.06);
+
+        --menu-selected-background: rgba(0,0,0,.4);
+
+        --page-background-color: #1C1D1F;
+        --page-foreground-color: #d2dbde;
+        --page-secondary-foreground-color: #859399;
+        --separator-color: #38393b;
+        --side-nav-background: #252628;
+
+        --code-background: #2a2c2f;
+
+        --tablehead-background: #2a2c2f;
+    
+        --blockquote-background: #222325;
+        --blockquote-foreground: #7e8c92;
+
+        --warning-color: #2e1917;
+        --warning-color-dark: #ad2617;
+        --warning-color-darker: #f5b1aa;
+        --note-color: #3b2e04;
+        --note-color-dark: #f1b602;
+        --note-color-darker: #ceb670;
+        --todo-color: #163750;
+        --todo-color-dark: #1982D2;
+        --todo-color-darker: #dcf0fa;
+        --deprecated-color: #2e323b;
+        --deprecated-color-dark: #738396;
+        --deprecated-color-darker: #abb0bd;
+        --bug-color: #2a2536;
+        --bug-color-dark: #7661b3;
+        --bug-color-darker: #ae9ed6;
+        --invariant-color: #303a35;
+        --invariant-color-dark: #76ce96;
+        --invariant-color-darker: #cceed5;
+
+        --fragment-background: #282c34;
+        --fragment-foreground: #dbe4eb;
+        --fragment-keyword: #cc99cd;
+        --fragment-keywordtype: #ab99cd;
+        --fragment-keywordflow: #e08000;
+        --fragment-token: #7ec699;
+        --fragment-comment: #999999;
+        --fragment-link: #98c0e3;
+        --fragment-preprocessor: #65cabe;
+        --fragment-linenumber-color: #cccccc;
+        --fragment-linenumber-background: #35393c;
+        --fragment-linenumber-border: #1f1f1f;
+    }
+}
+
+/* dark mode variables are defined twice, to support both the dark-mode without and with doxygen-awesome-darkmode-toggle.js */
+html.dark-mode {
+    color-scheme: dark;
+
+    --primary-color: #1982d2;
+    --primary-dark-color: #86a9c4;
+    --primary-light-color: #4779ac;
+
+    --box-shadow: 0 2px 8px 0 rgba(0,0,0,.30);
+
+    --odd-color: rgba(100,100,100,.06);
+
+    --menu-selected-background: rgba(0,0,0,.4);
+
+    --page-background-color: #1C1D1F;
+    --page-foreground-color: #d2dbde;
+    --page-secondary-foreground-color: #859399;
+    --separator-color: #38393b;
+    --side-nav-background: #252628;
+
+    --code-background: #2a2c2f;
+
+    --tablehead-background: #2a2c2f;
+
+    --blockquote-background: #222325;
+    --blockquote-foreground: #7e8c92;
+
+    --warning-color: #2e1917;
+    --warning-color-dark: #ad2617;
+    --warning-color-darker: #f5b1aa;
+    --note-color: #3b2e04;
+    --note-color-dark: #f1b602;
+    --note-color-darker: #ceb670;
+    --todo-color: #163750;
+    --todo-color-dark: #1982D2;
+    --todo-color-darker: #dcf0fa;
+    --deprecated-color: #2e323b;
+    --deprecated-color-dark: #738396;
+    --deprecated-color-darker: #abb0bd;
+    --bug-color: #2a2536;
+    --bug-color-dark: #7661b3;
+    --bug-color-darker: #ae9ed6;
+    --invariant-color: #303a35;
+    --invariant-color-dark: #76ce96;
+    --invariant-color-darker: #cceed5;
+
+    --fragment-background: #282c34;
+    --fragment-foreground: #dbe4eb;
+    --fragment-keyword: #cc99cd;
+    --fragment-keywordtype: #ab99cd;
+    --fragment-keywordflow: #e08000;
+    --fragment-token: #7ec699;
+    --fragment-comment: #999999;
+    --fragment-link: #98c0e3;
+    --fragment-preprocessor: #65cabe;
+    --fragment-linenumber-color: #cccccc;
+    --fragment-linenumber-background: #35393c;
+    --fragment-linenumber-border: #1f1f1f;
+}
+
+body {
+    color: var(--page-foreground-color);
+    background-color: var(--page-background-color);
+    font-size: var(--page-font-size);
+}
+
+body, table, div, p, dl, #nav-tree .label, .title,
+.sm-dox a, .sm-dox a:hover, .sm-dox a:focus, #projectname,
+.SelectItem, #MSearchField, .navpath li.navelem a,
+.navpath li.navelem a:hover, p.reference, p.definition {
+    font-family: var(--font-family);
+}
+
+h1, h2, h3, h4, h5 {
+    margin-top: .9em;
+    font-weight: 600;
+    line-height: initial;
+}
+
+p, div, table, dl, p.reference, p.definition {
+    font-size: var(--page-font-size);
+}
+
+p.reference, p.definition {
+    color: var(--page-secondary-foreground-color);
+}
+
+a:link, a:visited, a:hover, a:focus, a:active {
+    color: var(--primary-color) !important;
+    font-weight: 500;
+}
+
+a.anchor {
+    scroll-margin-top: var(--spacing-large);
+    display: block;
+}
+
+/*
+ Title and top navigation
+ */
+
+#top {
+    background: var(--header-background);
+    border-bottom: 1px solid var(--separator-color);
+}
+
+@media screen and (min-width: 768px) {
+    #top {
+        display: flex;
+        flex-wrap: wrap;
+        justify-content: space-between;
+        align-items: center;
+    }
+}
+
+#main-nav {
+    flex-grow: 5;
+    padding: var(--spacing-small) var(--spacing-medium);
+}
+
+#titlearea {
+    width: auto;
+    padding: var(--spacing-medium) var(--spacing-large);
+    background: none;
+    color: var(--header-foreground);
+    border-bottom: none;
+}
+
+@media screen and (max-width: 767px) {
+    #titlearea {
+        padding-bottom: var(--spacing-small);
+    }
+}
+
+#titlearea table tbody tr {
+    height: auto !important;
+}
+
+#projectname {
+    font-size: var(--title-font-size);
+    font-weight: 600;
+}
+
+#projectnumber {
+    font-family: inherit;
+    font-size: 60%;
+}
+
+#projectbrief {
+    font-family: inherit;
+    font-size: 80%;
+}
+
+#projectlogo {
+    vertical-align: middle;
+}
+
+#projectlogo img {
+    max-height: calc(var(--title-font-size) * 2);
+    margin-right: var(--spacing-small);
+}
+
+.sm-dox, .tabs, .tabs2, .tabs3 {
+    background: none;
+    padding: 0;
+}
+
+.tabs, .tabs2, .tabs3 {
+    border-bottom: 1px solid var(--separator-color);
+    margin-bottom: -1px;
+}
+
+.main-menu-btn-icon, .main-menu-btn-icon:before, .main-menu-btn-icon:after {
+    background: var(--page-secondary-foreground-color);
+}
+
+@media screen and (max-width: 767px) {
+    .sm-dox a span.sub-arrow {
+        background: var(--code-background);
+    }
+
+    #main-menu a.has-submenu span.sub-arrow {
+        color: var(--page-secondary-foreground-color);
+        border-radius: var(--border-radius-medium);
+    }
+
+    #main-menu a.has-submenu:hover span.sub-arrow {
+        color: var(--page-foreground-color);
+    }
+}
+
+@media screen and (min-width: 768px) {
+    .sm-dox li, .tablist li {
+        display: var(--menu-display);
+    }
+
+    .sm-dox a span.sub-arrow {
+        border-color: var(--header-foreground) transparent transparent transparent;
+    }
+
+    .sm-dox a:hover span.sub-arrow {
+        border-color: var(--menu-focus-foreground) transparent transparent transparent;
+    }
+
+    .sm-dox ul a span.sub-arrow {
+        border-color: transparent transparent transparent var(--page-foreground-color);
+    }
+
+    .sm-dox ul a:hover span.sub-arrow {
+        border-color: transparent transparent transparent var(--menu-focus-foreground);
+    }
+}
+
+.sm-dox ul {
+    background: var(--page-background-color);
+    box-shadow: var(--box-shadow);
+    border: 1px solid var(--separator-color);
+    border-radius: var(--border-radius-medium) !important;
+    padding: var(--spacing-small);
+    animation: ease-out 150ms slideInMenu;
+}
+
+@keyframes slideInMenu {
+    from {
+        opacity: 0;
+        transform: translate(0px, -2px);
+    }
+
+    to {
+        opacity: 1;
+        transform: translate(0px, 0px);
+    }
+}
+
+.sm-dox ul a {
+    color: var(--page-foreground-color) !important;
+    background: var(--page-background-color);
+    font-size: var(--navigation-font-size);
+}
+
+.sm-dox>li>ul:after {
+    border-bottom-color: var(--page-background-color) !important;
+}
+
+.sm-dox>li>ul:before {
+    border-bottom-color: var(--separator-color) !important;
+}
+
+.sm-dox ul a:hover, .sm-dox ul a:active, .sm-dox ul a:focus {
+    font-size: var(--navigation-font-size) !important;
+    color: var(--menu-focus-foreground) !important;
+    text-shadow: none;
+    background-color: var(--menu-focus-background);
+    border-radius: var(--border-radius-small) !important;
+}
+
+.sm-dox a, .sm-dox a:focus, .tablist li, .tablist li a, .tablist li.current a {
+    text-shadow: none;
+    background: transparent;
+    background-image: none !important;
+    color: var(--header-foreground) !important;
+    font-weight: normal;
+    font-size: var(--navigation-font-size);
+    border-radius: var(--border-radius-small) !important;
+}
+
+.sm-dox a:focus {
+    outline: auto;
+}
+
+.sm-dox a:hover, .sm-dox a:active, .tablist li a:hover {
+    text-shadow: none;
+    font-weight: normal;
+    background: var(--menu-focus-background);
+    color: var(--menu-focus-foreground) !important;
+    border-radius: var(--border-radius-small) !important;
+    font-size: var(--navigation-font-size);
+}
+
+.tablist li.current {
+    border-radius: var(--border-radius-small);
+    background: var(--menu-selected-background);
+}
+
+.tablist li {
+    margin: var(--spacing-small) 0 var(--spacing-small) var(--spacing-small);
+}
+
+.tablist a {
+    padding: 0 var(--spacing-large);
+}
+
+
+/*
+ Search box
+ */
+
+#MSearchBox {
+    height: var(--searchbar-height);
+    background: var(--searchbar-background);
+    border-radius: var(--searchbar-border-radius);
+    border: 1px solid var(--separator-color);
+    overflow: hidden;
+    width: var(--searchbar-width);
+    position: relative;
+    box-shadow: none;
+    display: block;
+    margin-top: 0;
+}
+
+/* until Doxygen 1.9.4 */
+.left img#MSearchSelect {
+    left: 0;
+    user-select: none;
+    padding-left: 8px;
+}
+
+/* Doxygen 1.9.5 */
+.left span#MSearchSelect {
+    left: 0;
+    user-select: none;
+    margin-left: 8px;
+    padding: 0;
+}
+
+.left #MSearchSelect[src$=".png"] {
+    padding-left: 0
+}
+
+.SelectionMark {
+    user-select: none;
+}
+
+.tabs .left #MSearchSelect {
+    padding-left: 0;
+}
+
+.tabs #MSearchBox {
+    position: absolute;
+    right: var(--spacing-medium);
+}
+
+@media screen and (max-width: 767px) {
+    .tabs #MSearchBox {
+        position: relative;
+        right: 0;
+        margin-left: var(--spacing-medium);
+        margin-top: 0;
+    }
+}
+
+#MSearchSelectWindow, #MSearchResultsWindow {
+    z-index: 9999;
+}
+
+#MSearchBox.MSearchBoxActive {
+    border-color: var(--primary-color);
+    box-shadow: inset 0 0 0 1px var(--primary-color);
+}
+
+#main-menu > li:last-child {
+    margin-right: 0;
+}
+
+@media screen and (max-width: 767px) {
+    #main-menu > li:last-child {
+        height: 50px;
+    }
+}
+
+#MSearchField {
+    font-size: var(--navigation-font-size);
+    height: calc(var(--searchbar-height) - 2px);
+    background: transparent;
+    width: calc(var(--searchbar-width) - 64px);
+}
+
+.MSearchBoxActive #MSearchField {
+    color: var(--searchbar-foreground);
+}
+
+#MSearchSelect {
+    top: calc(calc(var(--searchbar-height) / 2) - 11px);
+}
+
+#MSearchBox span.left, #MSearchBox span.right {
+    background: none;
+    background-image: none;
+}
+
+#MSearchBox span.right {
+    padding-top: calc(calc(var(--searchbar-height) / 2) - 12px);
+    position: absolute;
+    right: var(--spacing-small);
+}
+
+.tabs #MSearchBox span.right {
+    top: calc(calc(var(--searchbar-height) / 2) - 12px);
+}
+
+@keyframes slideInSearchResults {
+    from {
+        opacity: 0;
+        transform: translate(0, 15px);
+    }
+
+    to {
+        opacity: 1;
+        transform: translate(0, 20px);
+    }
+}
+
+#MSearchResultsWindow {
+    left: auto !important;
+    right: var(--spacing-medium);
+    border-radius: var(--border-radius-large);
+    border: 1px solid var(--separator-color);
+    transform: translate(0, 20px);
+    box-shadow: var(--box-shadow);
+    animation: ease-out 280ms slideInSearchResults;
+    background: var(--page-background-color);
+}
+
+iframe#MSearchResults {
+    margin: 4px;
+}
+
+iframe {
+    color-scheme: normal;
+}
+
+@media (prefers-color-scheme: dark) {
+    html:not(.light-mode) iframe#MSearchResults {
+        filter: invert() hue-rotate(180deg);
+    }
+}
+
+html.dark-mode iframe#MSearchResults {
+    filter: invert() hue-rotate(180deg);
+}
+
+#MSearchResults .SRPage {
+    background-color: transparent;
+}
+
+#MSearchResults .SRPage .SREntry {
+    font-size: 10pt;
+    padding: var(--spacing-small) var(--spacing-medium);
+}
+
+#MSearchSelectWindow {
+    border: 1px solid var(--separator-color);
+    border-radius: var(--border-radius-medium);
+    box-shadow: var(--box-shadow);
+    background: var(--page-background-color);
+    padding-top: var(--spacing-small);
+    padding-bottom: var(--spacing-small);
+}
+
+#MSearchSelectWindow a.SelectItem {
+    font-size: var(--navigation-font-size);
+    line-height: var(--content-line-height);
+    margin: 0 var(--spacing-small);
+    border-radius: var(--border-radius-small);
+    color: var(--page-foreground-color) !important;
+    font-weight: normal;
+}
+
+#MSearchSelectWindow a.SelectItem:hover {
+    background: var(--menu-focus-background);
+    color: var(--menu-focus-foreground) !important;
+}
+
+@media screen and (max-width: 767px) {
+    #MSearchBox {
+        margin-top: var(--spacing-medium);
+        margin-bottom: var(--spacing-medium);
+        width: calc(100vw - 30px);
+    }
+
+    #main-menu > li:last-child {
+        float: none !important;
+    }
+
+    #MSearchField {
+        width: calc(100vw - 110px);
+    }
+
+    @keyframes slideInSearchResultsMobile {
+        from {
+            opacity: 0;
+            transform: translate(0, 15px);
+        }
+
+        to {
+            opacity: 1;
+            transform: translate(0, 20px);
+        }
+    }
+
+    #MSearchResultsWindow {
+        left: var(--spacing-medium) !important;
+        right: var(--spacing-medium);
+        overflow: auto;
+        transform: translate(0, 20px);
+        animation: ease-out 280ms slideInSearchResultsMobile;
+        width: auto !important;
+    }
+
+    /*
+     * Overwrites for fixing the searchbox on mobile in doxygen 1.9.2
+     */
+    label.main-menu-btn ~ #searchBoxPos1 {
+        top: 3px !important;
+        right: 6px !important;
+        left: 45px;
+        display: flex;
+    }
+
+    label.main-menu-btn ~ #searchBoxPos1 > #MSearchBox {
+        margin-top: 0;
+        margin-bottom: 0;
+        flex-grow: 2;
+        float: left;
+    }
+}
+
+/*
+ Tree view
+ */
+
+#side-nav {
+    padding: 0 !important;
+    background: var(--side-nav-background);
+}
+
+@media screen and (max-width: 767px) {
+    #side-nav {
+        display: none;
+    }
+
+    #doc-content {
+        margin-left: 0 !important;
+    }
+}
+
+#nav-tree {
+    background: transparent;
+}
+
+#nav-tree .label {
+    font-size: var(--navigation-font-size);
+}
+
+#nav-tree .item {
+    height: var(--tree-item-height);
+    line-height: var(--tree-item-height);
+}
+
+#nav-sync {
+    bottom: 12px;
+    right: 12px;
+    top: auto !important;
+    user-select: none;
+}
+
+#nav-tree .selected {
+    text-shadow: none;
+    background-image: none;
+    background-color: transparent;
+    position: relative;
+}
+
+#nav-tree .selected::after {
+    content: "";
+    position: absolute;
+    top: 1px;
+    bottom: 1px;
+    left: 0;
+    width: 4px;
+    border-radius: 0 var(--border-radius-small) var(--border-radius-small) 0;
+    background: var(--primary-color);
+}
+
+
+#nav-tree a {
+    color: var(--side-nav-foreground) !important;
+    font-weight: normal;
+}
+
+#nav-tree a:focus {
+    outline-style: auto;
+}
+
+#nav-tree .arrow {
+    opacity: var(--side-nav-arrow-opacity);
+}
+
+.arrow {
+    color: inherit;
+    cursor: pointer;
+    font-size: 45%;
+    vertical-align: middle;
+    margin-right: 2px;
+    font-family: serif;
+    height: auto;
+    text-align: right;
+}
+
+#nav-tree div.item:hover .arrow, #nav-tree a:focus .arrow {
+    opacity: var(--side-nav-arrow-hover-opacity);
+}
+
+#nav-tree .selected a {
+    color: var(--primary-color) !important;
+    font-weight: bolder;
+    font-weight: 600;
+}
+
+.ui-resizable-e {
+    background: var(--separator-color);
+    width: 1px;
+}
+
+/*
+ Contents
+ */
+
+div.header {
+    border-bottom: 1px solid var(--separator-color);
+    background-color: var(--page-background-color);
+    background-image: none;
+}
+
+@media screen and (min-width: 1000px) {
+    #doc-content > div > div.contents,
+    .PageDoc > div.contents {
+        display: flex;
+        flex-direction: row-reverse;
+        flex-wrap: nowrap;
+        align-items: flex-start;
+    }
+    
+    div.contents .textblock {
+        min-width: 200px;
+        flex-grow: 1;
+    }
+}
+
+div.contents, div.header .title, div.header .summary {
+    max-width: var(--content-maxwidth);
+}
+
+div.contents, div.header .title  {
+    line-height: initial;
+    margin: calc(var(--spacing-medium) + .2em) auto var(--spacing-medium) auto;
+}
+
+div.header .summary {
+    margin: var(--spacing-medium) auto 0 auto;
+}
+
+div.headertitle {
+    padding: 0;
+}
+
+div.header .title {
+    font-weight: 600;
+    font-size: 225%;
+    padding: var(--spacing-medium) var(--spacing-large);
+    word-break: break-word;
+}
+
+div.header .summary {
+    width: auto;
+    display: block;
+    float: none;
+    padding: 0 var(--spacing-large);
+}
+
+td.memSeparator {
+    border-color: var(--separator-color);
+}
+
+span.mlabel {
+    background: var(--primary-color);
+    border: none;
+    padding: 4px 9px;
+    border-radius: 12px;
+    margin-right: var(--spacing-medium);
+}
+
+span.mlabel:last-of-type {
+    margin-right: 2px;
+}
+
+div.contents {
+    padding: 0 var(--spacing-large);
+}
+
+div.contents p, div.contents li {
+    line-height: var(--content-line-height);
+}
+
+div.contents div.dyncontent {
+    margin: var(--spacing-medium) 0;
+}
+
+@media (prefers-color-scheme: dark) {
+    html:not(.light-mode) div.contents div.dyncontent img,
+    html:not(.light-mode) div.contents center img,
+    html:not(.light-mode) div.contents > table img,
+    html:not(.light-mode) div.contents div.dyncontent iframe,
+    html:not(.light-mode) div.contents center iframe,
+    html:not(.light-mode) div.contents table iframe {
+        filter: hue-rotate(180deg) invert();
+    }
+}
+
+html.dark-mode div.contents div.dyncontent img,
+html.dark-mode div.contents center img,
+html.dark-mode div.contents > table img,
+html.dark-mode div.contents div.dyncontent iframe,
+html.dark-mode div.contents center iframe,
+html.dark-mode div.contents table iframe {
+    filter: hue-rotate(180deg) invert();
+}
+
+h2.groupheader {
+    border-bottom: 0px;
+    color: var(--page-foreground-color);
+    box-shadow: 
+        100px 0 var(--page-background-color), 
+        -100px 0 var(--page-background-color),
+        100px 0.75px var(--separator-color),
+        -100px 0.75px var(--separator-color),
+        500px 0 var(--page-background-color), 
+        -500px 0 var(--page-background-color),
+        500px 0.75px var(--separator-color),
+        -500px 0.75px var(--separator-color),
+        900px 0 var(--page-background-color), 
+        -900px 0 var(--page-background-color),
+        900px 0.75px var(--separator-color),
+        -900px 0.75px var(--separator-color),
+        1400px 0 var(--page-background-color),
+        -1400px 0 var(--page-background-color), 
+        1400px 0.75px var(--separator-color),
+        -1400px 0.75px var(--separator-color),
+        1900px 0 var(--page-background-color),
+        -1900px 0 var(--page-background-color),
+        1900px 0.75px var(--separator-color),
+        -1900px 0.75px var(--separator-color);
+}
+
+blockquote {
+    margin: 0 var(--spacing-medium) 0 var(--spacing-medium);
+    padding: var(--spacing-small) var(--spacing-large);
+    background: var(--blockquote-background);
+    color: var(--blockquote-foreground);
+    border-left: 0;
+    overflow: visible;
+    border-radius: var(--border-radius-medium);
+    overflow: visible;
+    position: relative;
+}
+
+blockquote::before, blockquote::after {
+    font-weight: bold;
+    font-family: serif;
+    font-size: 360%;
+    opacity: .15;
+    position: absolute;
+}
+
+blockquote::before {
+    content: "“";
+    left: -10px;
+    top: 4px;
+}
+
+blockquote::after {
+    content: "”";
+    right: -8px;
+    bottom: -25px;
+}
+
+blockquote p {
+    margin: var(--spacing-small) 0 var(--spacing-medium) 0;
+}
+.paramname {
+    font-weight: 600;
+    color: var(--primary-dark-color);
+}
+
+.paramname > code {
+    border: 0;
+}
+
+table.params .paramname {
+    font-weight: 600;
+    font-family: var(--font-family-monospace);
+    font-size: var(--code-font-size);
+    padding-right: var(--spacing-small);
+    line-height: var(--table-line-height);
+}
+
+h1.glow, h2.glow, h3.glow, h4.glow, h5.glow, h6.glow {
+    text-shadow: 0 0 15px var(--primary-light-color);
+}
+
+.alphachar a {
+    color: var(--page-foreground-color);
+}
+
+/*
+ Table of Contents
+ */
+
+div.contents .toc {
+    max-height: var(--toc-max-height);
+    min-width: var(--toc-width);
+    border: 0;
+    border-left: 1px solid var(--separator-color);
+    border-radius: 0;
+    background-color: transparent;
+    box-shadow: none;
+    position: sticky;
+    top: var(--toc-sticky-top);
+    padding: 0 var(--spacing-large);
+    margin: var(--spacing-small) 0 var(--spacing-large) var(--spacing-large);
+}
+
+div.toc h3 {
+    color: var(--toc-foreground);
+    font-size: var(--navigation-font-size);
+    margin: var(--spacing-large) 0 var(--spacing-medium) 0;
+}
+
+div.toc li {
+    padding: 0;
+    background: none;
+    line-height: var(--toc-font-size);
+    margin: var(--toc-font-size) 0 0 0;
+}
+
+div.toc li::before {
+    display: none;
+}
+
+div.toc ul {
+    margin-top: 0
+}
+
+div.toc li a {
+    font-size: var(--toc-font-size);
+    color: var(--page-foreground-color) !important;
+    text-decoration: none;
+}
+
+div.toc li a:hover, div.toc li a.active {
+    color: var(--primary-color) !important;
+}
+
+div.toc li a.aboveActive {
+    color: var(--page-secondary-foreground-color) !important;
+}
+
+
+@media screen and (max-width: 999px) {
+    div.contents .toc {
+        max-height: 45vh;
+        float: none;
+        width: auto;
+        margin: 0 0 var(--spacing-medium) 0;
+        position: relative;
+        top: 0;
+        position: relative;
+        border: 1px solid var(--separator-color);
+        border-radius: var(--border-radius-medium);
+        background-color: var(--toc-background);
+        box-shadow: var(--box-shadow);
+    }
+
+    div.contents .toc.interactive {
+        max-height: calc(var(--navigation-font-size) + 2 * var(--spacing-large));
+        overflow: hidden;
+    }
+
+    div.contents .toc > h3 {
+        -webkit-tap-highlight-color: transparent;
+        cursor: pointer;
+        position: sticky;
+        top: 0;
+        background-color: var(--toc-background);
+        margin: 0;
+        padding: var(--spacing-large) 0;
+        display: block;
+    }
+
+    div.contents .toc.interactive > h3::before {
+        content: "";
+        width: 0; 
+        height: 0; 
+        border-left: 4px solid transparent;
+        border-right: 4px solid transparent;
+        border-top: 5px solid var(--primary-color);
+        display: inline-block;
+        margin-right: var(--spacing-small);
+        margin-bottom: calc(var(--navigation-font-size) / 4);
+        transform: rotate(-90deg);
+        transition: transform 0.25s ease-out;
+    }
+
+    div.contents .toc.interactive.open > h3::before {
+        transform: rotate(0deg);
+    }
+
+    div.contents .toc.interactive.open {
+        max-height: 45vh;
+        overflow: auto;
+        transition: max-height 0.2s ease-in-out;
+    }
+
+    div.contents .toc a, div.contents .toc a.active {
+        color: var(--primary-color) !important;
+    }
+
+    div.contents .toc a:hover {
+        text-decoration: underline;
+    }
+}
+
+/*
+ Code & Fragments
+ */
+
+code, div.fragment, pre.fragment {
+    border-radius: var(--border-radius-small);
+    border: 1px solid var(--separator-color);
+    overflow: hidden;
+}
+
+code {
+    display: inline;
+    background: var(--code-background);
+    color: var(--code-foreground);
+    padding: 2px 6px;
+}
+
+div.fragment, pre.fragment {
+    margin: var(--spacing-medium) 0;
+    padding: calc(var(--spacing-large) - (var(--spacing-large) / 6)) var(--spacing-large);
+    background: var(--fragment-background);
+    color: var(--fragment-foreground);
+    overflow-x: auto;
+}
+
+@media screen and (max-width: 767px) {
+    div.fragment, pre.fragment {
+        border-top-right-radius: 0;
+        border-bottom-right-radius: 0;
+        border-right: 0;
+    }
+
+    .contents > div.fragment,
+    .textblock > div.fragment,
+    .textblock > pre.fragment,
+    .contents > .doxygen-awesome-fragment-wrapper > div.fragment,
+    .textblock > .doxygen-awesome-fragment-wrapper > div.fragment,
+    .textblock > .doxygen-awesome-fragment-wrapper > pre.fragment {
+        margin: var(--spacing-medium) calc(0px - var(--spacing-large));
+        border-radius: 0;
+        border-left: 0;
+    }
+
+    .textblock li > .fragment,
+    .textblock li > .doxygen-awesome-fragment-wrapper > .fragment {
+        margin: var(--spacing-medium) calc(0px - var(--spacing-large));
+    }
+
+    .memdoc li > .fragment,
+    .memdoc li > .doxygen-awesome-fragment-wrapper > .fragment {
+        margin: var(--spacing-medium) calc(0px - var(--spacing-medium));
+    }
+
+    .textblock ul, .memdoc ul {
+        overflow: initial;
+    }
+
+    .memdoc > div.fragment,
+    .memdoc > pre.fragment,
+    dl dd > div.fragment,
+    dl dd pre.fragment,
+    .memdoc > .doxygen-awesome-fragment-wrapper > div.fragment,
+    .memdoc > .doxygen-awesome-fragment-wrapper > pre.fragment,
+    dl dd > .doxygen-awesome-fragment-wrapper > div.fragment,
+    dl dd .doxygen-awesome-fragment-wrapper > pre.fragment {
+        margin: var(--spacing-medium) calc(0px - var(--spacing-medium));
+        border-radius: 0;
+        border-left: 0;
+    }
+}
+
+code, code a, pre.fragment, div.fragment, div.fragment .line, div.fragment span, div.fragment .line a, div.fragment .line span {
+    font-family: var(--font-family-monospace);
+    font-size: var(--code-font-size) !important;
+}
+
+div.line:after {
+    margin-right: var(--spacing-medium);
+}
+
+div.fragment .line, pre.fragment {
+    white-space: pre;
+    word-wrap: initial;
+    line-height: var(--fragment-lineheight);
+}
+
+div.fragment span.keyword {
+    color: var(--fragment-keyword);
+}
+
+div.fragment span.keywordtype {
+    color: var(--fragment-keywordtype);
+}
+
+div.fragment span.keywordflow {
+    color: var(--fragment-keywordflow);
+}
+
+div.fragment span.stringliteral {
+    color: var(--fragment-token)
+}
+
+div.fragment span.comment {
+    color: var(--fragment-comment);
+}
+
+div.fragment a.code {
+    color: var(--fragment-link) !important;
+}
+
+div.fragment span.preprocessor {
+    color: var(--fragment-preprocessor);
+}
+
+div.fragment span.lineno {
+    display: inline-block;
+    width: 27px;
+    border-right: none;
+    background: var(--fragment-linenumber-background);
+    color: var(--fragment-linenumber-color);
+}
+
+div.fragment span.lineno a {
+    background: none;
+    color: var(--fragment-link) !important;
+}
+
+div.fragment .line:first-child .lineno {
+    box-shadow: -999999px 0px 0 999999px var(--fragment-linenumber-background), -999998px 0px 0 999999px var(--fragment-linenumber-border);
+}
+
+div.line {
+    border-radius: var(--border-radius-small);
+}
+
+div.line.glow {
+    background-color: var(--primary-light-color);
+    box-shadow: none;
+}
+
+/*
+ dl warning, attention, note, deprecated, bug, ...
+ */
+
+dl.bug dt a, dl.deprecated dt a, dl.todo dt a {
+    font-weight: bold !important;
+}
+
+dl.warning, dl.attention, dl.note, dl.deprecated, dl.bug, dl.invariant, dl.pre, dl.post, dl.todo, dl.remark {
+    padding: var(--spacing-medium);
+    margin: var(--spacing-medium) 0;
+    color: var(--page-background-color);
+    overflow: hidden;
+    margin-left: 0;
+    border-radius: var(--border-radius-small);
+}
+
+dl.section dd {
+    margin-bottom: 2px;
+}
+
+dl.warning, dl.attention {
+    background: var(--warning-color);
+    border-left: 8px solid var(--warning-color-dark);
+    color: var(--warning-color-darker);
+}
+
+dl.warning dt, dl.attention dt {
+    color: var(--warning-color-dark);
+}
+
+dl.note, dl.remark {
+    background: var(--note-color);
+    border-left: 8px solid var(--note-color-dark);
+    color: var(--note-color-darker);
+}
+
+dl.note dt, dl.remark dt {
+    color: var(--note-color-dark);
+}
+
+dl.todo {
+    background: var(--todo-color);
+    border-left: 8px solid var(--todo-color-dark);
+    color: var(--todo-color-darker);
+}
+
+dl.todo dt {
+    color: var(--todo-color-dark);
+}
+
+dl.bug dt a {
+    color: var(--todo-color-dark) !important;
+}
+
+dl.bug {
+    background: var(--bug-color);
+    border-left: 8px solid var(--bug-color-dark);
+    color: var(--bug-color-darker);
+}
+
+dl.bug dt a {
+    color: var(--bug-color-dark) !important;
+}
+
+dl.deprecated {
+    background: var(--deprecated-color);
+    border-left: 8px solid var(--deprecated-color-dark);
+    color: var(--deprecated-color-darker);
+}
+
+dl.deprecated dt a {
+    color: var(--deprecated-color-dark) !important;
+}
+
+dl.section dd, dl.bug dd, dl.deprecated dd, dl.todo dd {
+    margin-inline-start: 0px;
+}
+
+dl.invariant, dl.pre, dl.post {
+    background: var(--invariant-color);
+    border-left: 8px solid var(--invariant-color-dark);
+    color: var(--invariant-color-darker);
+}
+
+dl.invariant dt, dl.pre dt, dl.post dt {
+    color: var(--invariant-color-dark);
+}
+
+/*
+ memitem
+ */
+
+div.memdoc, div.memproto, h2.memtitle {
+    box-shadow: none;
+    background-image: none;
+    border: none;
+}
+
+div.memdoc {
+    padding: 0 var(--spacing-medium);
+    background: var(--page-background-color);
+}
+
+h2.memtitle, div.memitem {
+    border: 1px solid var(--separator-color);
+    box-shadow: var(--box-shadow);
+}
+
+h2.memtitle {
+    box-shadow: 0px var(--spacing-medium) 0 -1px var(--fragment-background), var(--box-shadow);
+}
+
+div.memitem {
+    transition: none;
+}
+
+div.memproto, h2.memtitle {
+    background: var(--fragment-background);
+}
+
+h2.memtitle {
+    font-weight: 500;
+    font-size: var(--memtitle-font-size);
+    font-family: var(--font-family-monospace);
+    border-bottom: none;
+    border-top-left-radius: var(--border-radius-medium);
+    border-top-right-radius: var(--border-radius-medium);
+    word-break: break-all;
+    position: relative;
+}
+
+h2.memtitle:after {
+    content: "";
+    display: block;
+    background: var(--fragment-background);
+    height: var(--spacing-medium);
+    bottom: calc(0px - var(--spacing-medium));
+    left: 0;
+    right: -14px;
+    position: absolute;
+    border-top-right-radius: var(--border-radius-medium);
+}
+
+h2.memtitle > span.permalink {
+    font-size: inherit;
+}
+
+h2.memtitle > span.permalink > a {
+    text-decoration: none;
+    padding-left: 3px;
+    margin-right: -4px;
+    user-select: none;
+    display: inline-block;
+    margin-top: -6px;
+}
+
+h2.memtitle > span.permalink > a:hover {
+    color: var(--primary-dark-color) !important;
+}
+
+a:target + h2.memtitle, a:target + h2.memtitle + div.memitem {
+    border-color: var(--primary-light-color);
+}
+
+div.memitem {
+    border-top-right-radius: var(--border-radius-medium);
+    border-bottom-right-radius: var(--border-radius-medium);
+    border-bottom-left-radius: var(--border-radius-medium);
+    overflow: hidden;
+    display: block !important;
+}
+
+div.memdoc {
+    border-radius: 0;
+}
+
+div.memproto {
+    border-radius: 0 var(--border-radius-small) 0 0;
+    overflow: auto;
+    border-bottom: 1px solid var(--separator-color);
+    padding: var(--spacing-medium);
+    margin-bottom: -1px;
+}
+
+div.memtitle {
+    border-top-right-radius: var(--border-radius-medium);
+    border-top-left-radius: var(--border-radius-medium);
+}
+
+div.memproto table.memname {
+    font-family: var(--font-family-monospace);
+    color: var(--page-foreground-color);
+    font-size: var(--memname-font-size);
+    text-shadow: none;
+}
+
+div.memproto div.memtemplate {
+    font-family: var(--font-family-monospace);
+    color: var(--primary-dark-color);
+    font-size: var(--memname-font-size);
+    margin-left: 2px;
+    text-shadow: none;
+}
+
+table.mlabels, table.mlabels > tbody {
+    display: block;
+}
+
+td.mlabels-left {
+    width: auto;
+}
+
+td.mlabels-right {
+    margin-top: 3px;
+    position: sticky;
+    left: 0;
+}
+
+table.mlabels > tbody > tr:first-child {
+    display: flex;
+    justify-content: space-between;
+    flex-wrap: wrap;
+}
+
+.memname, .memitem span.mlabels {
+    margin: 0
+}
+
+/*
+ reflist
+ */
+
+dl.reflist {
+    box-shadow: var(--box-shadow);
+    border-radius: var(--border-radius-medium);
+    border: 1px solid var(--separator-color);
+    overflow: hidden;
+    padding: 0;
+}
+
+
+dl.reflist dt, dl.reflist dd {
+    box-shadow: none;
+    text-shadow: none;
+    background-image: none;
+    border: none;
+    padding: 12px;
+}
+
+
+dl.reflist dt {
+    font-weight: 500;
+    border-radius: 0;
+    background: var(--code-background);
+    border-bottom: 1px solid var(--separator-color);
+    color: var(--page-foreground-color)
+}
+
+
+dl.reflist dd {
+    background: none;
+}
+
+/*
+ Table
+ */
+
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname),
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody {
+    display: inline-block;
+    max-width: 100%;
+}
+
+.contents > table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname):not(.classindex) {
+    margin-left: calc(0px - var(--spacing-large));
+    margin-right: calc(0px - var(--spacing-large));
+    max-width: calc(100% + 2 * var(--spacing-large));
+}
+
+table.fieldtable,
+table.markdownTable tbody,
+table.doxtable tbody {
+    border: none;
+    margin: var(--spacing-medium) 0;
+    box-shadow: 0 0 0 1px var(--separator-color);
+    border-radius: var(--border-radius-small);
+}
+
+table.doxtable caption {
+    display: block;
+}
+
+table.fieldtable {
+    border-collapse: collapse;
+    width: 100%;
+}
+
+th.markdownTableHeadLeft,
+th.markdownTableHeadRight,
+th.markdownTableHeadCenter,
+th.markdownTableHeadNone,
+table.doxtable th {
+    background: var(--tablehead-background);
+    color: var(--tablehead-foreground);
+    font-weight: 600;
+    font-size: var(--page-font-size);
+}
+
+th.markdownTableHeadLeft:first-child,
+th.markdownTableHeadRight:first-child,
+th.markdownTableHeadCenter:first-child,
+th.markdownTableHeadNone:first-child,
+table.doxtable tr th:first-child {
+    border-top-left-radius: var(--border-radius-small);
+}
+
+th.markdownTableHeadLeft:last-child,
+th.markdownTableHeadRight:last-child,
+th.markdownTableHeadCenter:last-child,
+th.markdownTableHeadNone:last-child,
+table.doxtable tr th:last-child {
+    border-top-right-radius: var(--border-radius-small);
+}
+
+table.markdownTable td,
+table.markdownTable th,
+table.fieldtable td,
+table.fieldtable th,
+table.doxtable td,
+table.doxtable th {
+    border: 1px solid var(--separator-color);
+    padding: var(--spacing-small) var(--spacing-medium);
+}
+
+table.markdownTable td:last-child,
+table.markdownTable th:last-child,
+table.fieldtable td:last-child,
+table.fieldtable th:last-child,
+table.doxtable td:last-child,
+table.doxtable th:last-child {
+    border-right: none;
+}
+
+table.markdownTable td:first-child,
+table.markdownTable th:first-child,
+table.fieldtable td:first-child,
+table.fieldtable th:first-child,
+table.doxtable td:first-child,
+table.doxtable th:first-child {
+    border-left: none;
+}
+
+table.markdownTable tr:first-child td,
+table.markdownTable tr:first-child th,
+table.fieldtable tr:first-child td,
+table.fieldtable tr:first-child th,
+table.doxtable tr:first-child td,
+table.doxtable tr:first-child th {
+    border-top: none;
+}
+
+table.markdownTable tr:last-child td,
+table.markdownTable tr:last-child th,
+table.fieldtable tr:last-child td,
+table.fieldtable tr:last-child th,
+table.doxtable tr:last-child td,
+table.doxtable tr:last-child th {
+    border-bottom: none;
+}
+
+table.markdownTable tr, table.doxtable tr {
+    border-bottom: 1px solid var(--separator-color);
+}
+
+table.markdownTable tr:last-child, table.doxtable tr:last-child {
+    border-bottom: none;
+}
+
+table.fieldtable th {
+    font-size: var(--page-font-size);
+    font-weight: 600;
+    background-image: none;
+    background-color: var(--tablehead-background);
+    color: var(--tablehead-foreground);
+}
+
+table.fieldtable td.fieldtype, .fieldtable td.fieldname, .fieldtable td.fielddoc, .fieldtable th {
+    border-bottom: 1px solid var(--separator-color);
+    border-right: 1px solid var(--separator-color);
+}
+
+table.fieldtable tr:last-child td:first-child {
+    border-bottom-left-radius: var(--border-radius-small);
+}
+
+table.fieldtable tr:last-child td:last-child {
+    border-bottom-right-radius: var(--border-radius-small);
+}
+
+.memberdecls td.glow, .fieldtable tr.glow {
+    background-color: var(--primary-light-color);
+    box-shadow: none;
+}
+
+table.memberdecls {
+    display: block;
+    -webkit-tap-highlight-color: transparent;
+}
+
+table.memberdecls tr[class^='memitem'] {
+    font-family: var(--font-family-monospace);
+    font-size: var(--code-font-size);
+}
+
+table.memberdecls tr[class^='memitem'] .memTemplParams {
+    font-family: var(--font-family-monospace);
+    font-size: var(--code-font-size);
+    color: var(--primary-dark-color);
+    white-space: normal;
+}
+
+table.memberdecls .memItemLeft,
+table.memberdecls .memItemRight,
+table.memberdecls .memTemplItemLeft,
+table.memberdecls .memTemplItemRight,
+table.memberdecls .memTemplParams {
+    transition: none;
+    padding-top: var(--spacing-small);
+    padding-bottom: var(--spacing-small);
+    border-top: 1px solid var(--separator-color);
+    border-bottom: 1px solid var(--separator-color);
+    background-color: var(--fragment-background);
+}
+
+table.memberdecls .memTemplItemLeft,
+table.memberdecls .memTemplItemRight {
+    padding-top: 2px;
+}
+
+table.memberdecls .memTemplParams {
+    border-bottom: 0;
+    border-left: 1px solid var(--separator-color);
+    border-right: 1px solid var(--separator-color);
+    border-radius: var(--border-radius-small) var(--border-radius-small) 0 0;
+    padding-bottom: var(--spacing-small);
+}
+
+table.memberdecls .memTemplItemLeft {
+    border-radius: 0 0 0 var(--border-radius-small);
+    border-left: 1px solid var(--separator-color);
+    border-top: 0;
+}
+
+table.memberdecls .memTemplItemRight {
+    border-radius: 0 0 var(--border-radius-small) 0;
+    border-right: 1px solid var(--separator-color);
+    padding-left: 0;
+    border-top: 0;
+}
+
+table.memberdecls .memItemLeft {
+    border-radius: var(--border-radius-small) 0 0 var(--border-radius-small);
+    border-left: 1px solid var(--separator-color);
+    padding-left: var(--spacing-medium);
+    padding-right: 0;
+}
+
+table.memberdecls .memItemRight  {
+    border-radius: 0 var(--border-radius-small) var(--border-radius-small) 0;
+    border-right: 1px solid var(--separator-color);
+    padding-right: var(--spacing-medium);
+    padding-left: 0;
+
+}
+
+table.memberdecls .mdescLeft, table.memberdecls .mdescRight {
+    background: none;
+    color: var(--page-foreground-color);
+    padding: var(--spacing-small) 0;
+}
+
+table.memberdecls .memItemLeft,
+table.memberdecls .memTemplItemLeft {
+    padding-right: var(--spacing-medium);
+}
+
+table.memberdecls .memSeparator {
+    background: var(--page-background-color);
+    height: var(--spacing-large);
+    border: 0;
+    transition: none;
+}
+
+table.memberdecls .groupheader {
+    margin-bottom: var(--spacing-large);
+}
+
+table.memberdecls .inherit_header td {
+    padding: 0 0 var(--spacing-medium) 0;
+    text-indent: -12px;
+    color: var(--page-secondary-foreground-color);
+}
+
+table.memberdecls img[src="closed.png"],
+table.memberdecls img[src="open.png"],
+div.dynheader img[src="open.png"],
+div.dynheader img[src="closed.png"] {
+    width: 0; 
+    height: 0; 
+    border-left: 4px solid transparent;
+    border-right: 4px solid transparent;
+    border-top: 5px solid var(--primary-color);
+    margin-top: 8px;
+    display: block;
+    float: left;
+    margin-left: -10px;
+    transition: transform 0.25s ease-out;
+}
+
+table.memberdecls img {
+    margin-right: 10px;
+}
+
+table.memberdecls img[src="closed.png"],
+div.dynheader img[src="closed.png"] {
+    transform: rotate(-90deg);
+    
+}
+
+.compoundTemplParams {
+    font-family: var(--font-family-monospace);
+    color: var(--primary-dark-color);
+    font-size: var(--code-font-size);
+}
+
+@media screen and (max-width: 767px) {
+
+    table.memberdecls .memItemLeft,
+    table.memberdecls .memItemRight,
+    table.memberdecls .mdescLeft,
+    table.memberdecls .mdescRight,
+    table.memberdecls .memTemplItemLeft,
+    table.memberdecls .memTemplItemRight,
+    table.memberdecls .memTemplParams {
+        display: block;
+        text-align: left;
+        padding-left: var(--spacing-large);
+        margin: 0 calc(0px - var(--spacing-large)) 0 calc(0px - var(--spacing-large));
+        border-right: none;
+        border-left: none;
+        border-radius: 0;
+        white-space: normal;
+    }
+
+    table.memberdecls .memItemLeft,
+    table.memberdecls .mdescLeft,
+    table.memberdecls .memTemplItemLeft {
+        border-bottom: 0;
+        padding-bottom: 0;
+    }
+
+    table.memberdecls .memTemplItemLeft {
+        padding-top: 0;
+    }
+
+    table.memberdecls .mdescLeft {
+        margin-bottom: calc(0px - var(--page-font-size));
+    }
+
+    table.memberdecls .memItemRight, 
+    table.memberdecls .mdescRight,
+    table.memberdecls .memTemplItemRight {
+        border-top: 0;
+        padding-top: 0;
+        padding-right: var(--spacing-large);
+        overflow-x: auto;
+    }
+
+    table.memberdecls tr[class^='memitem']:not(.inherit) {
+        display: block;
+        width: calc(100vw - 2 * var(--spacing-large));
+    }
+
+    table.memberdecls .mdescRight {
+        color: var(--page-foreground-color);
+    }
+
+    table.memberdecls tr.inherit {
+        visibility: hidden;
+    }
+
+    table.memberdecls tr[style="display: table-row;"] {
+        display: block !important;
+        visibility: visible;
+        width: calc(100vw - 2 * var(--spacing-large));
+        animation: fade .5s;
+    }
+
+    @keyframes fade {
+        0% {
+            opacity: 0;
+            max-height: 0;
+        }
+
+        100% {
+            opacity: 1;
+            max-height: 200px;
+        }
+    }
+}
+
+
+/*
+ Horizontal Rule
+ */
+
+hr {
+    margin-top: var(--spacing-large);
+    margin-bottom: var(--spacing-large);
+    height: 1px;
+    background-color: var(--separator-color);
+    border: 0;
+}
+
+.contents hr {
+    box-shadow: 100px 0 0 var(--separator-color),
+                -100px 0 0 var(--separator-color),
+                500px 0 0 var(--separator-color),
+                -500px 0 0 var(--separator-color),
+                1500px 0 0 var(--separator-color),
+                -1500px 0 0 var(--separator-color),
+                2000px 0 0 var(--separator-color),
+                -2000px 0 0 var(--separator-color);
+}
+
+.contents img, .contents .center, .contents center, .contents div.image object {
+    max-width: 100%;
+    overflow: auto;
+}
+
+@media screen and (max-width: 767px) {
+    .contents .dyncontent > .center, .contents > center {
+        margin-left: calc(0px - var(--spacing-large));
+        margin-right: calc(0px - var(--spacing-large));
+        max-width: calc(100% + 2 * var(--spacing-large));
+    }
+}
+
+/*
+ Directories
+ */
+div.directory {
+    border-top: 1px solid var(--separator-color);
+    border-bottom: 1px solid var(--separator-color);
+    width: auto;
+}
+
+table.directory {
+    font-family: var(--font-family);
+    font-size: var(--page-font-size);
+    font-weight: normal;
+    width: 100%;
+}
+
+table.directory td.entry, table.directory td.desc {
+    padding: calc(var(--spacing-small) / 2) var(--spacing-small);
+    line-height: var(--table-line-height);
+}
+
+table.directory tr.even td:last-child {
+    border-radius: 0 var(--border-radius-small) var(--border-radius-small) 0;
+}
+
+table.directory tr.even td:first-child {
+    border-radius: var(--border-radius-small) 0 0 var(--border-radius-small);
+}
+
+table.directory tr.even:last-child td:last-child {
+    border-radius: 0 var(--border-radius-small) 0 0;
+}
+
+table.directory tr.even:last-child td:first-child {
+    border-radius: var(--border-radius-small) 0 0 0;
+}
+
+table.directory td.desc {
+    min-width: 250px;
+}
+
+table.directory tr.even {
+    background-color: var(--odd-color);
+}
+
+table.directory tr.odd {
+    background-color: transparent;
+}
+
+.icona {
+    width: auto;
+    height: auto;
+    margin: 0 var(--spacing-small);
+}
+
+.icon {
+    background: var(--primary-color);
+    border-radius: var(--border-radius-small);
+    font-size: var(--page-font-size);
+    padding: calc(var(--page-font-size) / 5);
+    line-height: var(--page-font-size);
+    transform: scale(0.8);
+    height: auto;
+    width: var(--page-font-size);
+    user-select: none;
+}
+
+.iconfopen, .icondoc, .iconfclosed {
+    background-position: center;
+    margin-bottom: 0;
+    height: var(--table-line-height);
+}
+
+.icondoc {
+    filter: saturate(0.2);
+}
+
+@media screen and (max-width: 767px) {
+    div.directory {
+        margin-left: calc(0px - var(--spacing-large));
+        margin-right: calc(0px - var(--spacing-large));
+    }
+}
+
+@media (prefers-color-scheme: dark) {
+    html:not(.light-mode) .iconfopen, html:not(.light-mode) .iconfclosed {
+        filter: hue-rotate(180deg) invert();
+    }
+}
+
+html.dark-mode .iconfopen, html.dark-mode .iconfclosed {
+    filter: hue-rotate(180deg) invert();
+}
+
+/*
+ Class list
+ */
+
+.classindex dl.odd {
+    background: var(--odd-color);
+    border-radius: var(--border-radius-small);
+}
+
+.classindex dl.even {
+    background-color: transparent;
+}
+
+/* 
+ Class Index Doxygen 1.8 
+*/
+
+table.classindex {
+    margin-left: 0;
+    margin-right: 0;
+    width: 100%;
+}
+
+table.classindex table div.ah {
+    background-image: none;
+    background-color: initial;
+    border-color: var(--separator-color);
+    color: var(--page-foreground-color);
+    box-shadow: var(--box-shadow);
+    border-radius: var(--border-radius-large);
+    padding: var(--spacing-small);
+}
+
+div.qindex {
+    background-color: var(--odd-color);
+    border-radius: var(--border-radius-small);
+    border: 1px solid var(--separator-color);
+    padding: var(--spacing-small) 0;
+}
+
+/*
+  Footer and nav-path
+ */
+
+#nav-path {
+    width: 100%;
+}
+
+#nav-path ul {
+    background-image: none;
+    background: var(--page-background-color);
+    border: none;
+    border-top: 1px solid var(--separator-color);
+    border-bottom: 1px solid var(--separator-color);
+    border-bottom: 0;
+    box-shadow: 0 0.75px 0 var(--separator-color);
+    font-size: var(--navigation-font-size);
+}
+
+img.footer {
+    width: 60px;
+}
+
+.navpath li.footer {
+    color: var(--page-secondary-foreground-color);
+}
+
+address.footer {
+    color: var(--page-secondary-foreground-color);
+    margin-bottom: var(--spacing-large);
+}
+
+#nav-path li.navelem {
+    background-image: none;
+    display: flex;
+    align-items: center;
+}
+
+.navpath li.navelem a {
+    text-shadow: none;
+    display: inline-block;
+    color: var(--primary-color) !important;
+}
+
+.navpath li.navelem b {
+    color: var(--primary-dark-color);
+    font-weight: 500;
+}
+
+li.navelem {
+    padding: 0;
+    margin-left: -8px;
+}
+
+li.navelem:first-child {
+    margin-left: var(--spacing-large);
+}
+
+li.navelem:first-child:before {
+    display: none;
+}
+
+#nav-path li.navelem:after {
+    content: '';
+    border: 5px solid var(--page-background-color);
+    border-bottom-color: transparent;
+    border-right-color: transparent;
+    border-top-color: transparent;
+    transform: translateY(-1px) scaleY(4.2);
+    z-index: 10;
+    margin-left: 6px;
+}
+
+#nav-path li.navelem:before {
+    content: '';
+    border: 5px solid var(--separator-color);
+    border-bottom-color: transparent;
+    border-right-color: transparent;
+    border-top-color: transparent;
+    transform: translateY(-1px) scaleY(3.2);
+    margin-right: var(--spacing-small);
+}
+
+.navpath li.navelem a:hover {
+    color: var(--primary-color);
+}
+
+/*
+ Scrollbars for Webkit
+*/
+
+#nav-tree::-webkit-scrollbar,
+div.fragment::-webkit-scrollbar,
+pre.fragment::-webkit-scrollbar,
+div.memproto::-webkit-scrollbar,
+.contents center::-webkit-scrollbar,
+.contents .center::-webkit-scrollbar,
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody::-webkit-scrollbar,
+div.contents .toc::-webkit-scrollbar {
+    background: transparent;
+    width: calc(var(--webkit-scrollbar-size) + var(--webkit-scrollbar-padding) + var(--webkit-scrollbar-padding));
+    height: calc(var(--webkit-scrollbar-size) + var(--webkit-scrollbar-padding) + var(--webkit-scrollbar-padding));
+}
+
+#nav-tree::-webkit-scrollbar-thumb,
+div.fragment::-webkit-scrollbar-thumb,
+pre.fragment::-webkit-scrollbar-thumb,
+div.memproto::-webkit-scrollbar-thumb,
+.contents center::-webkit-scrollbar-thumb,
+.contents .center::-webkit-scrollbar-thumb,
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody::-webkit-scrollbar-thumb,
+div.contents .toc::-webkit-scrollbar-thumb {
+    background-color: transparent;
+    border: var(--webkit-scrollbar-padding) solid transparent;
+    border-radius: calc(var(--webkit-scrollbar-padding) + var(--webkit-scrollbar-padding));
+    background-clip: padding-box;  
+}
+
+#nav-tree:hover::-webkit-scrollbar-thumb,
+div.fragment:hover::-webkit-scrollbar-thumb,
+pre.fragment:hover::-webkit-scrollbar-thumb,
+div.memproto:hover::-webkit-scrollbar-thumb,
+.contents center:hover::-webkit-scrollbar-thumb,
+.contents .center:hover::-webkit-scrollbar-thumb,
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody:hover::-webkit-scrollbar-thumb,
+div.contents .toc:hover::-webkit-scrollbar-thumb {
+    background-color: var(--webkit-scrollbar-color);
+}
+
+#nav-tree::-webkit-scrollbar-track,
+div.fragment::-webkit-scrollbar-track,
+pre.fragment::-webkit-scrollbar-track,
+div.memproto::-webkit-scrollbar-track,
+.contents center::-webkit-scrollbar-track,
+.contents .center::-webkit-scrollbar-track,
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody::-webkit-scrollbar-track,
+div.contents .toc::-webkit-scrollbar-track {
+    background: transparent;
+}
+
+#nav-tree::-webkit-scrollbar-corner {
+    background-color: var(--side-nav-background);
+}
+
+#nav-tree,
+div.fragment,
+pre.fragment,
+div.memproto,
+.contents center,
+.contents .center,
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody,
+div.contents .toc {
+    overflow-x: auto;
+    overflow-x: overlay;
+}
+
+#nav-tree {
+    overflow-x: auto;
+    overflow-y: auto;
+    overflow-y: overlay;
+}
+
+/*
+ Scrollbars for Firefox
+*/
+
+#nav-tree,
+div.fragment,
+pre.fragment,
+div.memproto,
+.contents center,
+.contents .center,
+.contents table:not(.memberdecls):not(.mlabels):not(.fieldtable):not(.memname) tbody,
+div.contents .toc {
+    scrollbar-width: thin;
+}
+
+/*
+  Optional Dark mode toggle button
+*/
+
+doxygen-awesome-dark-mode-toggle {
+    display: inline-block;
+    margin: 0 0 0 var(--spacing-small);
+    padding: 0;
+    width: var(--searchbar-height);
+    height: var(--searchbar-height);
+    background: none;
+    border: none;
+    border-radius: var(--searchbar-height);
+    vertical-align: middle;
+    text-align: center;
+    line-height: var(--searchbar-height);
+    font-size: 22px;
+    display: flex;
+    align-items: center;
+    justify-content: center;
+    user-select: none;
+    cursor: pointer;
+}
+
+doxygen-awesome-dark-mode-toggle > svg {
+    transition: transform .1s ease-in-out;
+}
+
+doxygen-awesome-dark-mode-toggle:active > svg {
+    transform: scale(.5);
+}
+
+doxygen-awesome-dark-mode-toggle:hover {
+    background-color: rgba(0,0,0,.03);
+}
+
+html.dark-mode doxygen-awesome-dark-mode-toggle:hover {
+    background-color: rgba(0,0,0,.18);
+}
+
+/*
+ Optional fragment copy button
+*/
+.doxygen-awesome-fragment-wrapper {
+    position: relative;
+}
+
+doxygen-awesome-fragment-copy-button {
+    opacity: 0;
+    background: var(--fragment-background);
+    width: 28px;
+    height: 28px;
+    position: absolute;
+    right: calc(var(--spacing-large) - (var(--spacing-large) / 2.5));
+    top: calc(var(--spacing-large) - (var(--spacing-large) / 2.5));
+    border: 1px solid var(--fragment-foreground);
+    cursor: pointer;
+    border-radius: var(--border-radius-small);
+    display: flex;
+    justify-content: center;
+    align-items: center;
+}
+
+.doxygen-awesome-fragment-wrapper:hover doxygen-awesome-fragment-copy-button, doxygen-awesome-fragment-copy-button.success {
+    opacity: .28;
+}
+
+doxygen-awesome-fragment-copy-button:hover, doxygen-awesome-fragment-copy-button.success {
+    opacity: 1 !important;
+}
+
+doxygen-awesome-fragment-copy-button:active:not([class~=success]) svg {
+    transform: scale(.91);
+}
+
+doxygen-awesome-fragment-copy-button svg {
+    fill: var(--fragment-foreground);
+    width: 18px;
+    height: 18px;
+}
+
+doxygen-awesome-fragment-copy-button.success svg {
+    fill: rgb(14, 168, 14);
+}
+
+doxygen-awesome-fragment-copy-button.success {
+    border-color: rgb(14, 168, 14);
+}
+
+@media screen and (max-width: 767px) {
+    .textblock > .doxygen-awesome-fragment-wrapper > doxygen-awesome-fragment-copy-button,
+    .textblock li > .doxygen-awesome-fragment-wrapper > doxygen-awesome-fragment-copy-button,
+    .memdoc li > .doxygen-awesome-fragment-wrapper > doxygen-awesome-fragment-copy-button,
+    .memdoc > .doxygen-awesome-fragment-wrapper > doxygen-awesome-fragment-copy-button,
+    dl dd > .doxygen-awesome-fragment-wrapper > doxygen-awesome-fragment-copy-button {
+        right: 0;
+    }
+}
+
+/*
+ Optional paragraph link button
+*/
+
+a.anchorlink {
+    font-size: 90%;
+    margin-left: var(--spacing-small);
+    color: var(--page-foreground-color) !important;
+    text-decoration: none;
+    opacity: .15;
+    display: none;
+    transition: opacity .1s ease-in-out, color .1s ease-in-out;
+}
+
+a.anchorlink svg {
+    fill: var(--page-foreground-color);
+}
+
+h3 a.anchorlink svg, h4 a.anchorlink svg {
+    margin-bottom: -3px;
+    margin-top: -4px;
+}
+
+a.anchorlink:hover {
+    opacity: .45;
+}
+
+h2:hover a.anchorlink, h1:hover a.anchorlink, h3:hover a.anchorlink, h4:hover a.anchorlink  {
+    display: inline-block;
+}
diff --git a/docs/doxygen.mk b/docs/doxygen.mk
index 5cadd9dbb2..9f46a1e37b 100644
--- a/docs/doxygen.mk
+++ b/docs/doxygen.mk
@@ -1,4 +1,4 @@
-# Doxyfile 1.8.8
+# Doxyfile 1.9.7
 
 # This file describes the settings to be used by the documentation system
 # doxygen (www.doxygen.org) for a project.
@@ -12,16 +12,26 @@
 # For lists, items can also be appended using:
 # TAG += value [value, ...]
 # Values that contain spaces should be placed between quotes (\" \").
+#
+# Note:
+#
+# Use doxygen to compare the used configuration file with the template
+# configuration file:
+# doxygen -x [configFile]
+# Use doxygen to compare the used configuration file with the template
+# configuration file without replacing the environment variables or CMake type
+# replacement variables:
+# doxygen -x_noenv [configFile]
 
 #---------------------------------------------------------------------------
 # Project related configuration options
 #---------------------------------------------------------------------------
 
-# This tag specifies the encoding used for all characters in the config file
-# that follow. The default is UTF-8 which is also the encoding used for all text
-# before the first occurrence of this tag. Doxygen uses libiconv (or the iconv
-# built into libc) for the transcoding. See http://www.gnu.org/software/libiconv
-# for the list of possible encodings.
+# This tag specifies the encoding used for all characters in the configuration
+# file that follow. The default is UTF-8 which is also the encoding used for all
+# text before the first occurrence of this tag. Doxygen uses libiconv (or the
+# iconv built into libc) for the transcoding. See
+# https://www.gnu.org/software/libiconv/ for the list of possible encodings.
 # The default value is: UTF-8.
 
 DOXYFILE_ENCODING      = UTF-8
@@ -32,24 +42,24 @@ DOXYFILE_ENCODING      = UTF-8
 # title of most generated pages and in a few other places.
 # The default value is: My Project.
 
-PROJECT_NAME           = ""
+PROJECT_NAME           = ${PROJECT_NAME}
 
 # The PROJECT_NUMBER tag can be used to enter a project or revision number. This
 # could be handy for archiving the generated documentation or if some version
 # control system is used.
 
-PROJECT_NUMBER         = ""
+PROJECT_NUMBER         = ${AF_VERSION}
 
 # Using the PROJECT_BRIEF tag one can provide an optional one line description
 # for a project that appears at the top of each page and should give viewer a
 # quick idea about the purpose of the project. Keep the description short.
 
-PROJECT_BRIEF          = ""
+PROJECT_BRIEF          = "A high-performance general-purpose compute library"
 
-# With the PROJECT_LOGO tag one can specify an logo or icon that is included in
-# the documentation. The maximum height of the logo should not exceed 55 pixels
-# and the maximum width should not exceed 200 pixels. Doxygen will copy the logo
-# to the output directory.
+# With the PROJECT_LOGO tag one can specify a logo or an icon that is included
+# in the documentation. The maximum height of the logo should not exceed 55
+# pixels and the maximum width should not exceed 200 pixels. Doxygen will copy
+# the logo to the output directory.
 
 PROJECT_LOGO           = ${ASSETS_DIR}/arrayfire_logo.png
 
@@ -58,18 +68,30 @@ PROJECT_LOGO           = ${ASSETS_DIR}/arrayfire_logo.png
 # entered, it will be relative to the location where doxygen was started. If
 # left blank the current directory will be used.
 
-OUTPUT_DIRECTORY       = .
+OUTPUT_DIRECTORY       = ${CMAKE_CURRENT_BINARY_DIR}
 
-# If the CREATE_SUBDIRS tag is set to YES, then doxygen will create 4096 sub-
-# directories (in 2 levels) under the output directory of each output format and
-# will distribute the generated files over these directories. Enabling this
+# If the CREATE_SUBDIRS tag is set to YES then doxygen will create up to 4096
+# sub-directories (in 2 levels) under the output directory of each output format
+# and will distribute the generated files over these directories. Enabling this
 # option can be useful when feeding doxygen a huge amount of source files, where
 # putting all generated files in the same directory would otherwise causes
-# performance problems for the file system.
+# performance problems for the file system. Adapt CREATE_SUBDIRS_LEVEL to
+# control the number of sub-directories.
 # The default value is: NO.
 
 CREATE_SUBDIRS         = NO
 
+# Controls the number of sub-directories that will be created when
+# CREATE_SUBDIRS tag is set to YES. Level 0 represents 16 directories, and every
+# level increment doubles the number of directories, resulting in 4096
+# directories at level 8 which is the default and also the maximum value. The
+# sub-directories are organized in 2 levels, the first level always has a fixed
+# number of 16 directories.
+# Minimum value: 0, maximum value: 8, default value: 8.
+# This tag requires that the tag CREATE_SUBDIRS is set to YES.
+
+CREATE_SUBDIRS_LEVEL   = 8
+
 # If the ALLOW_UNICODE_NAMES tag is set to YES, doxygen will allow non-ASCII
 # characters to appear in the names of generated files. If set to NO, non-ASCII
 # characters will be escaped, for example _xE3_x81_x84 will be used for Unicode
@@ -81,26 +103,26 @@ ALLOW_UNICODE_NAMES    = NO
 # The OUTPUT_LANGUAGE tag is used to specify the language in which all
 # documentation generated by doxygen is written. Doxygen will use this
 # information to generate all constant output in the proper language.
-# Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Catalan, Chinese,
-# Chinese-Traditional, Croatian, Czech, Danish, Dutch, English (United States),
-# Esperanto, Farsi (Persian), Finnish, French, German, Greek, Hungarian,
-# Indonesian, Italian, Japanese, Japanese-en (Japanese with English messages),
-# Korean, Korean-en (Korean with English messages), Latvian, Lithuanian,
-# Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese, Romanian, Russian,
-# Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish, Swedish, Turkish,
-# Ukrainian and Vietnamese.
+# Possible values are: Afrikaans, Arabic, Armenian, Brazilian, Bulgarian,
+# Catalan, Chinese, Chinese-Traditional, Croatian, Czech, Danish, Dutch, English
+# (United States), Esperanto, Farsi (Persian), Finnish, French, German, Greek,
+# Hindi, Hungarian, Indonesian, Italian, Japanese, Japanese-en (Japanese with
+# English messages), Korean, Korean-en (Korean with English messages), Latvian,
+# Lithuanian, Macedonian, Norwegian, Persian (Farsi), Polish, Portuguese,
+# Romanian, Russian, Serbian, Serbian-Cyrillic, Slovak, Slovene, Spanish,
+# Swedish, Turkish, Ukrainian and Vietnamese.
 # The default value is: English.
 
 OUTPUT_LANGUAGE        = English
 
-# If the BRIEF_MEMBER_DESC tag is set to YES doxygen will include brief member
+# If the BRIEF_MEMBER_DESC tag is set to YES, doxygen will include brief member
 # descriptions after the members that are listed in the file and class
 # documentation (similar to Javadoc). Set to NO to disable this.
 # The default value is: YES.
 
 BRIEF_MEMBER_DESC      = YES
 
-# If the REPEAT_BRIEF tag is set to YES doxygen will prepend the brief
+# If the REPEAT_BRIEF tag is set to YES, doxygen will prepend the brief
 # description of a member or function before the detailed description
 #
 # Note: If both HIDE_UNDOC_MEMBERS and BRIEF_MEMBER_DESC are set to NO, the
@@ -135,7 +157,7 @@ ALWAYS_DETAILED_SEC    = NO
 
 INLINE_INHERITED_MEMB  = NO
 
-# If the FULL_PATH_NAMES tag is set to YES doxygen will prepend the full path
+# If the FULL_PATH_NAMES tag is set to YES, doxygen will prepend the full path
 # before files name in the file list and in the header files. If set to NO the
 # shortest path that makes the file name unique will be used
 # The default value is: YES.
@@ -180,6 +202,16 @@ SHORT_NAMES            = NO
 
 JAVADOC_AUTOBRIEF      = YES
 
+# If the JAVADOC_BANNER tag is set to YES then doxygen will interpret a line
+# such as
+# /***************
+# as being the beginning of a Javadoc-style comment "banner". If set to NO, the
+# Javadoc-style will behave just like regular comments and it will not be
+# interpreted by doxygen.
+# The default value is: NO.
+
+JAVADOC_BANNER         = NO
+
 # If the QT_AUTOBRIEF tag is set to YES then doxygen will interpret the first
 # line (until the first dot) of a Qt-style comment as the brief description. If
 # set to NO, the Qt-style will behave just like regular Qt-style comments (thus
@@ -200,15 +232,23 @@ QT_AUTOBRIEF           = NO
 
 MULTILINE_CPP_IS_BRIEF = NO
 
+# By default Python docstrings are displayed as preformatted text and doxygen's
+# special commands cannot be used. By setting PYTHON_DOCSTRING to NO the
+# doxygen's special commands can be used and the contents of the docstring
+# documentation blocks is shown as doxygen documentation.
+# The default value is: YES.
+
+PYTHON_DOCSTRING       = YES
+
 # If the INHERIT_DOCS tag is set to YES then an undocumented member inherits the
 # documentation from any documented member that it re-implements.
 # The default value is: YES.
 
 INHERIT_DOCS           = YES
 
-# If the SEPARATE_MEMBER_PAGES tag is set to YES, then doxygen will produce a
-# new page for each member. If set to NO, the documentation of a member will be
-# part of the file/class/namespace that contains it.
+# If the SEPARATE_MEMBER_PAGES tag is set to YES then doxygen will produce a new
+# page for each member. If set to NO, the documentation of a member will be part
+# of the file/class/namespace that contains it.
 # The default value is: NO.
 
 SEPARATE_MEMBER_PAGES  = NO
@@ -217,17 +257,22 @@ SEPARATE_MEMBER_PAGES  = NO
 # uses this value to replace tabs by spaces in code fragments.
 # Minimum value: 1, maximum value: 16, default value: 4.
 
-TAB_SIZE               = 8
+TAB_SIZE               = 4
 
 # This tag can be used to specify a number of aliases that act as commands in
 # the documentation. An alias has the form:
 # name=value
 # For example adding
-# "sideeffect=@par Side Effects:\n"
+# "sideeffect=@par Side Effects:^^"
 # will allow you to put the command \sideeffect (or @sideeffect) in the
 # documentation, which will result in a user-defined paragraph with heading
-# "Side Effects:". You can put \n's in the value part of an alias to insert
-# newlines.
+# "Side Effects:". Note that you cannot put \n's in the value part of an alias
+# to insert newlines (in the resulting output). You can put ^^ in the value part
+# of an alias to insert a newline as if a physical newline was in the original
+# file. When you need a literal { or } or , in the value part of an alias you
+# have to escape them by means of a backslash (\), this can lead to conflicts
+# with the commands \{ and \} for these it is advised to use the version @{ and
+# @} or use a double escape (\\{ and \\})
 
 ALIASES                = "support{1}=<DIV class=\"support\">\1</DIV>" \
                          "opencl=<IMG src=\"OpenCL.png\" alt=\"OpenCL Support\" />" \
@@ -239,18 +284,20 @@ ALIASES                = "support{1}=<DIV class=\"support\">\1</DIV>" \
                          "jit=<IMG src=\"jit.png\" alt=\"JIT Support\" />" \
                          "democode{1}=\htmlonly \n <pre class=\"cpp afw\"> <code class=\"cpp\"> \1 </code> </pre> \n \endhtmlonly" \
                          "imagegroup{1}=<DIV CLASS=\"image_group\"> \1 </DIV>" \
-                         "smallimage{2}=\htmlonly <div class=\"scaled\"><img src=\"\1\" alt=\"\2\"><div class=\"caption\">\2</div></div> \endhtmlonly" \
+                         "smallimage{2}=\htmlonly <div class=\"scaled\"><img src=\"\1\" alt=\"\2\"></div> \endhtmlonly" \
                          "funcgroups{3}=\ingroup \3 \n @{ \n \defgroup \1 \2 \n @{ \n" \
                          "funcgroups{4}=\ingroup \3 \4 \n @{ \n \defgroup \1 \2 \n @{ \n" \
                          "funcgroups{5}=\ingroup \3 \4 \5 \n @{ \n \defgroup \1 \2 \n @{ \n" \
                          "funcgroups{6}=\ingroup \3 \4 \5 \6 \n @{ \n \defgroup \1 \2 \n @{ \n" \
-                         "endfuncgroups=@} \n @}"
-
-# This tag can be used to specify a number of word-keyword mappings (TCL only).
-# A mapping has the form "name=value". For example adding "class=itcl::class"
-# will allow you to use the command class in the itcl::class meaning.
-
-TCL_SUBST              =
+                         "endfuncgroups=@} \n @}" \
+                         "PR{1}=[[#\1](https://github.com/arrayfire/arrayfire/pull/\1)]" \
+                         "dims{4}=\f$ [\1 \ \2 \ \3 \ \4] \f$" \
+                         "shape_eq{5}=\f$ \underset{[\2 \ \3 \ \4 \ \5]}{\1} \f$" \
+                         "shape_t{5}=\underset{[\2 \ \3 \ \4 \ \5]}{\1}" \
+                         "convolve_eq{2}=\f$ \1 \ast \2 \f$" \
+                         "convolve_t{2}=\1 \ast \2" \
+                         "set_eq{2}=\f$ \left\\{ \1 \ \Bigg\vert \ \2 \right\\} \f$" \
+                         "set_t{2}=\left\\\{ \1 \ \Bigg\vert \ \2 \right\\\}"
 
 # Set the OPTIMIZE_OUTPUT_FOR_C tag to YES if your project consists of C sources
 # only. Doxygen will then generate output that is more tailored for C. For
@@ -280,28 +327,40 @@ OPTIMIZE_FOR_FORTRAN   = NO
 
 OPTIMIZE_OUTPUT_VHDL   = NO
 
+# Set the OPTIMIZE_OUTPUT_SLICE tag to YES if your project consists of Slice
+# sources only. Doxygen will then generate output that is more tailored for that
+# language. For instance, namespaces will be presented as modules, types will be
+# separated into more groups, etc.
+# The default value is: NO.
+
+OPTIMIZE_OUTPUT_SLICE  = NO
+
 # Doxygen selects the parser to use depending on the extension of the files it
 # parses. With this tag you can assign which parser to use for a given
 # extension. Doxygen has a built-in mapping, but you can override or extend it
 # using this tag. The format is ext=language, where ext is a file extension, and
-# language is one of the parsers supported by doxygen: IDL, Java, Javascript,
-# C#, C, C++, D, PHP, Objective-C, Python, Fortran (fixed format Fortran:
-# FortranFixed, free formatted Fortran: FortranFree, unknown formatted Fortran:
-# Fortran. In the later case the parser tries to guess whether the code is fixed
-# or free formatted code, this is the default for Fortran type files), VHDL. For
-# instance to make doxygen treat .inc files as Fortran files (default is PHP),
-# and .f files as C (default is Fortran), use: inc=Fortran f=C.
+# language is one of the parsers supported by doxygen: IDL, Java, JavaScript,
+# Csharp (C#), C, C++, Lex, D, PHP, md (Markdown), Objective-C, Python, Slice,
+# VHDL, Fortran (fixed format Fortran: FortranFixed, free formatted Fortran:
+# FortranFree, unknown formatted Fortran: Fortran. In the later case the parser
+# tries to guess whether the code is fixed or free formatted code, this is the
+# default for Fortran type files). For instance to make doxygen treat .inc files
+# as Fortran files (default is PHP), and .f files as C (default is Fortran),
+# use: inc=Fortran f=C.
 #
-# Note For files without extension you can use no_extension as a placeholder.
+# Note: For files without extension you can use no_extension as a placeholder.
 #
 # Note that for custom extensions you also need to set FILE_PATTERNS otherwise
-# the files are not read by doxygen.
+# the files are not read by doxygen. When specifying no_extension you should add
+# * to the FILE_PATTERNS.
+#
+# Note see also the list of default file extension mappings.
 
 EXTENSION_MAPPING      =
 
 # If the MARKDOWN_SUPPORT tag is enabled then doxygen pre-processes all comments
 # according to the Markdown format, which allows for more readable
-# documentation. See http://daringfireball.net/projects/markdown/ for details.
+# documentation. See https://daringfireball.net/projects/markdown/ for details.
 # The output of markdown processing is further processed by doxygen, so you can
 # mix doxygen, HTML, and XML commands with Markdown formatting. Disable only in
 # case of backward compatibilities issues.
@@ -309,10 +368,30 @@ EXTENSION_MAPPING      =
 
 MARKDOWN_SUPPORT       = YES
 
+# When the TOC_INCLUDE_HEADINGS tag is set to a non-zero value, all headings up
+# to that level are automatically included in the table of contents, even if
+# they do not have an id attribute.
+# Note: This feature currently applies only to Markdown headings.
+# Minimum value: 0, maximum value: 99, default value: 5.
+# This tag requires that the tag MARKDOWN_SUPPORT is set to YES.
+
+TOC_INCLUDE_HEADINGS   = 0
+
+# The MARKDOWN_ID_STYLE tag can be used to specify the algorithm used to
+# generate identifiers for the Markdown headings. Note: Every identifier is
+# unique.
+# Possible values are: DOXYGEN Use a fixed 'autotoc_md' string followed by a
+# sequence number starting at 0. and GITHUB Use the lower case version of title
+# with any whitespace replaced by '-' and punctations characters removed..
+# The default value is: DOXYGEN.
+# This tag requires that the tag MARKDOWN_SUPPORT is set to YES.
+
+MARKDOWN_ID_STYLE      = DOXYGEN
+
 # When enabled doxygen tries to link words that correspond to documented
 # classes, or namespaces to their corresponding documentation. Such a link can
-# be prevented in individual cases by by putting a % sign in front of the word
-# or globally by setting AUTOLINK_SUPPORT to NO.
+# be prevented in individual cases by putting a % sign in front of the word or
+# globally by setting AUTOLINK_SUPPORT to NO.
 # The default value is: YES.
 
 AUTOLINK_SUPPORT       = YES
@@ -334,7 +413,7 @@ BUILTIN_STL_SUPPORT    = NO
 CPP_CLI_SUPPORT        = NO
 
 # Set the SIP_SUPPORT tag to YES if your project consists of sip (see:
-# http://www.riverbankcomputing.co.uk/software/sip/intro) sources only. Doxygen
+# https://www.riverbankcomputing.com/software/sip/intro) sources only. Doxygen
 # will parse them like normal C++ but will assume all classes use public instead
 # of private inheritance when no explicit protection keyword is present.
 # The default value is: NO.
@@ -352,13 +431,20 @@ SIP_SUPPORT            = NO
 IDL_PROPERTY_SUPPORT   = YES
 
 # If member grouping is used in the documentation and the DISTRIBUTE_GROUP_DOC
-# tag is set to YES, then doxygen will reuse the documentation of the first
+# tag is set to YES then doxygen will reuse the documentation of the first
 # member in the group (if any) for the other members of the group. By default
 # all members of a group must be documented explicitly.
 # The default value is: NO.
 
 DISTRIBUTE_GROUP_DOC   = NO
 
+# If one adds a struct or class to a group and this option is enabled, then also
+# any nested class or struct is added to the same group. By default this option
+# is disabled and one has to add nested compounds explicitly via \ingroup.
+# The default value is: NO.
+
+GROUP_NESTED_COMPOUNDS = NO
+
 # Set the SUBGROUPING tag to YES to allow class member groups of the same type
 # (for instance a group of public functions) to be put as a subgroup of that
 # type (e.g. under the Public Functions section). Set it to NO to prevent
@@ -413,11 +499,32 @@ TYPEDEF_HIDES_STRUCT   = NO
 
 LOOKUP_CACHE_SIZE      = 0
 
+# The NUM_PROC_THREADS specifies the number of threads doxygen is allowed to use
+# during processing. When set to 0 doxygen will based this on the number of
+# cores available in the system. You can set it explicitly to a value larger
+# than 0 to get more control over the balance between CPU load and processing
+# speed. At this moment only the input processing can be done using multiple
+# threads. Since this is still an experimental feature the default is set to 1,
+# which effectively disables parallel processing. Please report any issues you
+# encounter. Generating dot graphs in parallel is controlled by the
+# DOT_NUM_THREADS setting.
+# Minimum value: 0, maximum value: 32, default value: 1.
+
+NUM_PROC_THREADS       = 0
+
+# If the TIMESTAMP tag is set different from NO then each generated page will
+# contain the date or date and time when the page was generated. Setting this to
+# NO can help when comparing the output of multiple runs.
+# Possible values are: YES, NO, DATETIME and DATE.
+# The default value is: NO.
+
+TIMESTAMP              = YES
+
 #---------------------------------------------------------------------------
 # Build related configuration options
 #---------------------------------------------------------------------------
 
-# If the EXTRACT_ALL tag is set to YES doxygen will assume all entities in
+# If the EXTRACT_ALL tag is set to YES, doxygen will assume all entities in
 # documentation are documented, even if no documentation was available. Private
 # class members and static file members will be hidden unless the
 # EXTRACT_PRIVATE respectively EXTRACT_STATIC tags are set to YES.
@@ -427,35 +534,41 @@ LOOKUP_CACHE_SIZE      = 0
 
 EXTRACT_ALL            = YES
 
-# If the EXTRACT_PRIVATE tag is set to YES all private members of a class will
+# If the EXTRACT_PRIVATE tag is set to YES, all private members of a class will
 # be included in the documentation.
 # The default value is: NO.
 
 EXTRACT_PRIVATE        = NO
 
-# If the EXTRACT_PACKAGE tag is set to YES all members with package or internal
+# If the EXTRACT_PRIV_VIRTUAL tag is set to YES, documented private virtual
+# methods of a class will be included in the documentation.
+# The default value is: NO.
+
+EXTRACT_PRIV_VIRTUAL   = NO
+
+# If the EXTRACT_PACKAGE tag is set to YES, all members with package or internal
 # scope will be included in the documentation.
 # The default value is: NO.
 
 EXTRACT_PACKAGE        = NO
 
-# If the EXTRACT_STATIC tag is set to YES all static members of a file will be
+# If the EXTRACT_STATIC tag is set to YES, all static members of a file will be
 # included in the documentation.
 # The default value is: NO.
 
 EXTRACT_STATIC         = YES
 
-# If the EXTRACT_LOCAL_CLASSES tag is set to YES classes (and structs) defined
-# locally in source files will be included in the documentation. If set to NO
+# If the EXTRACT_LOCAL_CLASSES tag is set to YES, classes (and structs) defined
+# locally in source files will be included in the documentation. If set to NO,
 # only classes defined in header files are included. Does not have any effect
 # for Java sources.
 # The default value is: YES.
 
 EXTRACT_LOCAL_CLASSES  = YES
 
-# This flag is only useful for Objective-C code. When set to YES local methods,
+# This flag is only useful for Objective-C code. If set to YES, local methods,
 # which are defined in the implementation section but not in the interface are
-# included in the documentation. If set to NO only methods in the interface are
+# included in the documentation. If set to NO, only methods in the interface are
 # included.
 # The default value is: NO.
 
@@ -470,6 +583,13 @@ EXTRACT_LOCAL_METHODS  = NO
 
 EXTRACT_ANON_NSPACES   = NO
 
+# If this flag is set to YES, the name of an unnamed parameter in a declaration
+# will be determined by the corresponding definition. By default unnamed
+# parameters remain unnamed in the output.
+# The default value is: YES.
+
+RESOLVE_UNNAMED_PARAMS = YES
+
 # If the HIDE_UNDOC_MEMBERS tag is set to YES, doxygen will hide all
 # undocumented members inside documented classes or files. If set to NO these
 # members will be included in the various overviews, but no documentation
@@ -480,21 +600,22 @@ HIDE_UNDOC_MEMBERS     = NO
 
 # If the HIDE_UNDOC_CLASSES tag is set to YES, doxygen will hide all
 # undocumented classes that are normally visible in the class hierarchy. If set
-# to NO these classes will be included in the various overviews. This option has
-# no effect if EXTRACT_ALL is enabled.
+# to NO, these classes will be included in the various overviews. This option
+# will also hide undocumented C++ concepts if enabled. This option has no effect
+# if EXTRACT_ALL is enabled.
 # The default value is: NO.
 
 HIDE_UNDOC_CLASSES     = NO
 
 # If the HIDE_FRIEND_COMPOUNDS tag is set to YES, doxygen will hide all friend
-# (class|struct|union) declarations. If set to NO these declarations will be
-# included in the documentation.
+# declarations. If set to NO, these declarations will be included in the
+# documentation.
 # The default value is: NO.
 
 HIDE_FRIEND_COMPOUNDS  = NO
 
 # If the HIDE_IN_BODY_DOCS tag is set to YES, doxygen will hide any
-# documentation blocks found inside the body of a function. If set to NO these
+# documentation blocks found inside the body of a function. If set to NO, these
 # blocks will be appended to the function's detailed documentation block.
 # The default value is: NO.
 
@@ -507,22 +628,43 @@ HIDE_IN_BODY_DOCS      = NO
 
 INTERNAL_DOCS          = NO
 
-# If the CASE_SENSE_NAMES tag is set to NO then doxygen will only generate file
-# names in lower-case letters. If set to YES upper-case letters are also
-# allowed. This is useful if you have classes or files whose names only differ
-# in case and if your file system supports case sensitive file names. Windows
-# and Mac users are advised to set this option to NO.
-# The default value is: system dependent.
+# With the correct setting of option CASE_SENSE_NAMES doxygen will better be
+# able to match the capabilities of the underlying filesystem. In case the
+# filesystem is case sensitive (i.e. it supports files in the same directory
+# whose names only differ in casing), the option must be set to YES to properly
+# deal with such files in case they appear in the input. For filesystems that
+# are not case sensitive the option should be set to NO to properly deal with
+# output files written for symbols that only differ in casing, such as for two
+# classes, one named CLASS and the other named Class, and to also support
+# references to files without having to specify the exact matching casing. On
+# Windows (including Cygwin) and MacOS, users should typically set this option
+# to NO, whereas on Linux or other Unix flavors it should typically be set to
+# YES.
+# Possible values are: SYSTEM, NO and YES.
+# The default value is: SYSTEM.
 
 CASE_SENSE_NAMES       = YES
 
 # If the HIDE_SCOPE_NAMES tag is set to NO then doxygen will show members with
-# their full class and namespace scopes in the documentation. If set to YES the
+# their full class and namespace scopes in the documentation. If set to YES, the
 # scope will be hidden.
 # The default value is: NO.
 
 HIDE_SCOPE_NAMES       = YES
 
+# If the HIDE_COMPOUND_REFERENCE tag is set to NO (default) then doxygen will
+# append additional text to a page's title, such as Class Reference. If set to
+# YES the compound reference will be hidden.
+# The default value is: NO.
+
+HIDE_COMPOUND_REFERENCE= NO
+
+# If the SHOW_HEADERFILE tag is set to YES then the documentation for a class
+# will show which file needs to be included to use the class.
+# The default value is: YES.
+
+SHOW_HEADERFILE        = YES
+
 # If the SHOW_INCLUDE_FILES tag is set to YES then doxygen will put a list of
 # the files that are included by a file in the documentation of that file.
 # The default value is: YES.
@@ -550,14 +692,14 @@ INLINE_INFO            = YES
 
 # If the SORT_MEMBER_DOCS tag is set to YES then doxygen will sort the
 # (detailed) documentation of file and class members alphabetically by member
-# name. If set to NO the members will appear in declaration order.
+# name. If set to NO, the members will appear in declaration order.
 # The default value is: YES.
 
 SORT_MEMBER_DOCS       = YES
 
 # If the SORT_BRIEF_DOCS tag is set to YES then doxygen will sort the brief
 # descriptions of file, namespace and class members alphabetically by member
-# name. If set to NO the members will appear in declaration order. Note that
+# name. If set to NO, the members will appear in declaration order. Note that
 # this will also influence the order of the classes in the class list.
 # The default value is: NO.
 
@@ -602,27 +744,25 @@ SORT_BY_SCOPE_NAME     = NO
 
 STRICT_PROTO_MATCHING  = NO
 
-# The GENERATE_TODOLIST tag can be used to enable ( YES) or disable ( NO) the
-# todo list. This list is created by putting \todo commands in the
-# documentation.
+# The GENERATE_TODOLIST tag can be used to enable (YES) or disable (NO) the todo
+# list. This list is created by putting \todo commands in the documentation.
 # The default value is: YES.
 
 GENERATE_TODOLIST      = NO
 
-# The GENERATE_TESTLIST tag can be used to enable ( YES) or disable ( NO) the
-# test list. This list is created by putting \test commands in the
-# documentation.
+# The GENERATE_TESTLIST tag can be used to enable (YES) or disable (NO) the test
+# list. This list is created by putting \test commands in the documentation.
 # The default value is: YES.
 
 GENERATE_TESTLIST      = NO
 
-# The GENERATE_BUGLIST tag can be used to enable ( YES) or disable ( NO) the bug
+# The GENERATE_BUGLIST tag can be used to enable (YES) or disable (NO) the bug
 # list. This list is created by putting \bug commands in the documentation.
 # The default value is: YES.
 
 GENERATE_BUGLIST       = NO
 
-# The GENERATE_DEPRECATEDLIST tag can be used to enable ( YES) or disable ( NO)
+# The GENERATE_DEPRECATEDLIST tag can be used to enable (YES) or disable (NO)
 # the deprecated list. This list is created by putting \deprecated commands in
 # the documentation.
 # The default value is: YES.
@@ -647,8 +787,8 @@ ENABLED_SECTIONS       =
 MAX_INITIALIZER_LINES  = 30
 
 # Set the SHOW_USED_FILES tag to NO to disable the list of files generated at
-# the bottom of the documentation of classes and structs. If set to YES the list
-# will mention the files that were used to generate the documentation.
+# the bottom of the documentation of classes and structs. If set to YES, the
+# list will mention the files that were used to generate the documentation.
 # The default value is: YES.
 
 SHOW_USED_FILES        = YES
@@ -682,7 +822,8 @@ FILE_VERSION_FILTER    = "/bin/sh -c 'git log --pretty=\"format:%ci, (build %h)\
 # output files in an output format independent way. To create the layout file
 # that represents doxygen's defaults, run doxygen with the -l option. You can
 # optionally specify a file name after the option, if omitted DoxygenLayout.xml
-# will be used as the name of the layout file.
+# will be used as the name of the layout file. See also section "Changing the
+# layout of pages" for information.
 #
 # Note that if you run doxygen from a directory containing a file called
 # DoxygenLayout.xml, doxygen will parse it automatically even if the LAYOUT_FILE
@@ -693,7 +834,7 @@ LAYOUT_FILE            = ${DOCS_DIR}/layout.xml
 # The CITE_BIB_FILES tag can be used to specify one or more bib files containing
 # the reference definitions. This must be a list of .bib files. The .bib
 # extension is automatically appended if omitted. This requires the bibtex tool
-# to be installed. See also http://en.wikipedia.org/wiki/BibTeX for more info.
+# to be installed. See also https://en.wikipedia.org/wiki/BibTeX for more info.
 # For LaTeX the style of the bibliography can be controlled using
 # LATEX_BIB_STYLE. To use this feature you need bibtex and perl available in the
 # search path. See also \cite for info how to create references.
@@ -712,7 +853,7 @@ CITE_BIB_FILES         =
 QUIET                  = YES
 
 # The WARNINGS tag can be used to turn on/off the warning messages that are
-# generated to standard error ( stderr) by doxygen. If WARNINGS is set to YES
+# generated to standard error (stderr) by doxygen. If WARNINGS is set to YES
 # this implies that the warnings are on.
 #
 # Tip: Turn warnings on while writing the documentation.
@@ -720,7 +861,7 @@ QUIET                  = YES
 
 WARNINGS               = YES
 
-# If the WARN_IF_UNDOCUMENTED tag is set to YES, then doxygen will generate
+# If the WARN_IF_UNDOCUMENTED tag is set to YES then doxygen will generate
 # warnings for undocumented members. If EXTRACT_ALL is set to YES then this flag
 # will automatically be disabled.
 # The default value is: YES.
@@ -728,20 +869,53 @@ WARNINGS               = YES
 WARN_IF_UNDOCUMENTED   = YES
 
 # If the WARN_IF_DOC_ERROR tag is set to YES, doxygen will generate warnings for
-# potential errors in the documentation, such as not documenting some parameters
-# in a documented function, or documenting parameters that don't exist or using
-# markup commands wrongly.
+# potential errors in the documentation, such as documenting some parameters in
+# a documented function twice, or documenting parameters that don't exist or
+# using markup commands wrongly.
 # The default value is: YES.
 
 WARN_IF_DOC_ERROR      = YES
 
+# If WARN_IF_INCOMPLETE_DOC is set to YES, doxygen will warn about incomplete
+# function parameter documentation. If set to NO, doxygen will accept that some
+# parameters have no documentation without warning.
+# The default value is: YES.
+
+WARN_IF_INCOMPLETE_DOC = YES
+
 # This WARN_NO_PARAMDOC option can be enabled to get warnings for functions that
 # are documented, but have no documentation for their parameters or return
-# value. If set to NO doxygen will only warn about wrong or incomplete parameter
-# documentation, but not about the absence of documentation.
+# value. If set to NO, doxygen will only warn about wrong parameter
+# documentation, but not about the absence of documentation. If EXTRACT_ALL is
+# set to YES then this flag will automatically be disabled. See also
+# WARN_IF_INCOMPLETE_DOC
+# The default value is: NO.
+
+WARN_NO_PARAMDOC       = YES
+
+# If WARN_IF_UNDOC_ENUM_VAL option is set to YES, doxygen will warn about
+# undocumented enumeration values. If set to NO, doxygen will accept
+# undocumented enumeration values. If EXTRACT_ALL is set to YES then this flag
+# will automatically be disabled.
 # The default value is: NO.
 
-WARN_NO_PARAMDOC       = NO
+WARN_IF_UNDOC_ENUM_VAL = NO
+
+# If the WARN_AS_ERROR tag is set to YES then doxygen will immediately stop when
+# a warning is encountered. If the WARN_AS_ERROR tag is set to FAIL_ON_WARNINGS
+# then doxygen will continue running as if WARN_AS_ERROR tag is set to NO, but
+# at the end of the doxygen process doxygen will return with a non-zero status.
+# If the WARN_AS_ERROR tag is set to FAIL_ON_WARNINGS_PRINT then doxygen behaves
+# like FAIL_ON_WARNINGS but in case no WARN_LOGFILE is defined doxygen will not
+# write the warning messages in between other messages but write them at the end
+# of a run, in case a WARN_LOGFILE is defined the warning messages will be
+# besides being in the defined file also be shown at the end of a run, unless
+# the WARN_LOGFILE is defined as - i.e. standard output (stdout) in that case
+# the behavior will remain as with the setting FAIL_ON_WARNINGS.
+# Possible values are: NO, YES, FAIL_ON_WARNINGS and FAIL_ON_WARNINGS_PRINT.
+# The default value is: NO.
+
+WARN_AS_ERROR          = NO
 
 # The WARN_FORMAT tag determines the format of the warning messages that doxygen
 # can produce. The string should contain the $file, $line, and $text tags, which
@@ -749,13 +923,27 @@ WARN_NO_PARAMDOC       = NO
 # and the warning text. Optionally the format may contain $version, which will
 # be replaced by the version of the file (if it could be obtained via
 # FILE_VERSION_FILTER)
+# See also: WARN_LINE_FORMAT
 # The default value is: $file:$line: $text.
 
 WARN_FORMAT            = "$file:$line: $text"
 
+# In the $text part of the WARN_FORMAT command it is possible that a reference
+# to a more specific place is given. To make it easier to jump to this place
+# (outside of doxygen) the user can define a custom "cut" / "paste" string.
+# Example:
+# WARN_LINE_FORMAT = "'vi $file +$line'"
+# See also: WARN_FORMAT
+# The default value is: at line $line of file $file.
+
+WARN_LINE_FORMAT       = "at line $line of file $file"
+
 # The WARN_LOGFILE tag can be used to specify a file to which warning and error
 # messages should be written. If left blank the output is written to standard
-# error (stderr).
+# error (stderr). In case the file specified cannot be opened for writing the
+# warning and error messages are written to standard error. When as file - is
+# specified the warning and error messages are written to standard output
+# (stdout).
 
 WARN_LOGFILE           =
 
@@ -766,31 +954,51 @@ WARN_LOGFILE           =
 # The INPUT tag is used to specify the files and/or directories that contain
 # documented source files. You may enter file names like myfile.cpp or
 # directories like /usr/src/myproject. Separate the files or directories with
-# spaces.
+# spaces. See also FILE_PATTERNS and EXTENSION_MAPPING
 # Note: If this tag is empty the current directory is searched.
 
 INPUT                  = ${DOCS_DIR}/pages \
-						 ${INCLUDE_DIR}/ \
+                         ${INCLUDE_DIR}/ \
                          ${INCLUDE_DIR}/af/ \
-						 ${DOCS_DIR}/details
+                         ${DOCS_DIR}/details
 
 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses
 # libiconv (or the iconv built into libc) for the transcoding. See the libiconv
-# documentation (see: http://www.gnu.org/software/libiconv) for the list of
-# possible encodings.
+# documentation (see:
+# https://www.gnu.org/software/libiconv/) for the list of possible encodings.
+# See also: INPUT_FILE_ENCODING
 # The default value is: UTF-8.
 
 INPUT_ENCODING         = UTF-8
 
+# This tag can be used to specify the character encoding of the source files
+# that doxygen parses The INPUT_FILE_ENCODING tag can be used to specify
+# character encoding on a per file pattern basis. Doxygen will compare the file
+# name with each pattern and apply the encoding instead of the default
+# INPUT_ENCODING) if there is a match. The character encodings are a list of the
+# form: pattern=encoding (like *.php=ISO-8859-1). See cfg_input_encoding
+# "INPUT_ENCODING" for further information on supported encodings.
+
+INPUT_FILE_ENCODING    =
+
 # If the value of the INPUT tag contains directories, you can use the
 # FILE_PATTERNS tag to specify one or more wildcard patterns (like *.cpp and
-# *.h) to filter out the source-files in the directories. If left blank the
-# following patterns are tested:*.c, *.cc, *.cxx, *.cpp, *.c++, *.java, *.ii,
-# *.ixx, *.ipp, *.i++, *.inl, *.idl, *.ddl, *.odl, *.h, *.hh, *.hxx, *.hpp,
-# *.h++, *.cs, *.d, *.php, *.php4, *.php5, *.phtml, *.inc, *.m, *.markdown,
-# *.md, *.mm, *.dox, *.py, *.f90, *.f, *.for, *.tcl, *.vhd, *.vhdl, *.ucf,
-# *.qsf, *.as and *.js.
+# *.h) to filter out the source-files in the directories.
+#
+# Note that for custom extensions or not directly supported extensions you also
+# need to set EXTENSION_MAPPING for the extension otherwise the files are not
+# read by doxygen.
+#
+# Note the list of default checked file patterns might differ from the list of
+# default file extension mappings.
+#
+# If left blank the following patterns are tested:*.c, *.cc, *.cxx, *.cpp,
+# *.c++, *.java, *.ii, *.ixx, *.ipp, *.i++, *.inl, *.idl, *.ddl, *.odl, *.h,
+# *.hh, *.hxx, *.hpp, *.h++, *.l, *.cs, *.d, *.php, *.php4, *.php5, *.phtml,
+# *.inc, *.m, *.markdown, *.md, *.mm, *.dox (to be provided as doxygen C
+# comment), *.py, *.pyw, *.f90, *.f95, *.f03, *.f08, *.f18, *.f, *.for, *.vhd,
+# *.vhdl, *.ucf, *.qsf and *.ice.
 
 FILE_PATTERNS          =
 
@@ -829,19 +1037,16 @@ EXCLUDE_PATTERNS       = *.cpp
 # (namespaces, classes, functions, etc.) that should be excluded from the
 # output. The symbol name can be a fully qualified name, a word, or if the
 # wildcard * is used, a substring. Examples: ANamespace, AClass,
-# AClass::ANamespace, ANamespace::*Test
-#
-# Note that the wildcards are matched against the file with absolute path, so to
-# exclude all test directories use the pattern */test/*
+# ANamespace::AClass, ANamespace::*Test
 
-EXCLUDE_SYMBOLS        = APPROX
+EXCLUDE_SYMBOLS        =
 
 # The EXAMPLE_PATH tag can be used to specify one or more files or directories
 # that contain example code fragments that are included (see the \include
 # command).
 
 EXAMPLE_PATH           = ${EXAMPLES_DIR}/ \
-			${SNIPPETS_DIR}
+                         ${SNIPPETS_DIR}
 
 # If the value of the EXAMPLE_PATH tag contains directories, you can use the
 # EXAMPLE_PATTERNS tag to specify one or more wildcard pattern (like *.cpp and
@@ -849,6 +1054,7 @@ EXAMPLE_PATH           = ${EXAMPLES_DIR}/ \
 # files are included.
 
 EXAMPLE_PATTERNS       = *.cpp \
+                         *.hpp \
                          *.cu
 
 # If the EXAMPLE_RECURSIVE tag is set to YES then subdirectories will be
@@ -862,7 +1068,8 @@ EXAMPLE_RECURSIVE      = YES
 # that contain images that are to be included in the documentation (see the
 # \image command).
 
-IMAGE_PATH             = ${ASSETS_DIR}
+IMAGE_PATH             = ${ASSETS_DIR} \
+                         ${ASSETS_DIR}/conv_docs_images
 
 # The INPUT_FILTER tag can be used to specify a program that doxygen should
 # invoke to filter for each input file. Doxygen will invoke the filter program
@@ -878,6 +1085,15 @@ IMAGE_PATH             = ${ASSETS_DIR}
 # Note that the filter must not add or remove lines; it is applied before the
 # code is scanned, but not when the output code is generated. If lines are added
 # or removed, the anchors will not be placed correctly.
+#
+# Note that doxygen will use the data processed and written to standard output
+# for further processing, therefore nothing else, like debug statements or used
+# commands (so in case of a Windows batch file always use @echo OFF), should be
+# written to standard output.
+#
+# Note that for custom extensions or not directly supported extensions you also
+# need to set EXTENSION_MAPPING for the extension otherwise the files are not
+# properly processed by doxygen.
 
 INPUT_FILTER           =
 
@@ -887,11 +1103,15 @@ INPUT_FILTER           =
 # (like *.cpp=my_cpp_filter). See INPUT_FILTER for further information on how
 # filters are used. If the FILTER_PATTERNS tag is empty or if none of the
 # patterns match the file name, INPUT_FILTER is applied.
+#
+# Note that for custom extensions or not directly supported extensions you also
+# need to set EXTENSION_MAPPING for the extension otherwise the files are not
+# properly processed by doxygen.
 
 FILTER_PATTERNS        =
 
 # If the FILTER_SOURCE_FILES tag is set to YES, the input filter (if set using
-# INPUT_FILTER ) will also be used to filter the input files that are used for
+# INPUT_FILTER) will also be used to filter the input files that are used for
 # producing the source files to browse (i.e. when SOURCE_BROWSER is set to YES).
 # The default value is: NO.
 
@@ -912,6 +1132,15 @@ FILTER_SOURCE_PATTERNS =
 
 USE_MDFILE_AS_MAINPAGE = ${DOCS_DIR}/pages/README.md
 
+# The Fortran standard specifies that for fixed formatted Fortran code all
+# characters from position 72 are to be considered as comment. A common
+# extension is to allow longer lines before the automatic comment starts. The
+# setting FORTRAN_COMMENT_AFTER will also make it possible that longer lines can
+# be processed before the automatic comment starts.
+# Minimum value: 7, maximum value: 10000, default value: 72.
+
+FORTRAN_COMMENT_AFTER  = 72
+
 #---------------------------------------------------------------------------
 # Configuration options related to source browsing
 #---------------------------------------------------------------------------
@@ -923,13 +1152,13 @@ USE_MDFILE_AS_MAINPAGE = ${DOCS_DIR}/pages/README.md
 # also VERBATIM_HEADERS is set to NO.
 # The default value is: NO.
 
-SOURCE_BROWSER         = NO
+SOURCE_BROWSER         = YES
 
 # Setting the INLINE_SOURCES tag to YES will include the body of functions,
 # classes and enums directly into the documentation.
 # The default value is: NO.
 
-INLINE_SOURCES         = NO
+INLINE_SOURCES         = YES
 
 # Setting the STRIP_CODE_COMMENTS tag to YES will instruct doxygen to hide any
 # special comment blocks from generated source code fragments. Normal C, C++ and
@@ -939,7 +1168,7 @@ INLINE_SOURCES         = NO
 STRIP_CODE_COMMENTS    = YES
 
 # If the REFERENCED_BY_RELATION tag is set to YES then for each documented
-# function all documented functions referencing it will be listed.
+# entity all documented functions referencing it will be listed.
 # The default value is: NO.
 
 REFERENCED_BY_RELATION = NO
@@ -951,7 +1180,7 @@ REFERENCED_BY_RELATION = NO
 REFERENCES_RELATION    = NO
 
 # If the REFERENCES_LINK_SOURCE tag is set to YES and SOURCE_BROWSER tag is set
-# to YES, then the hyperlinks from functions in REFERENCES_RELATION and
+# to YES then the hyperlinks from functions in REFERENCES_RELATION and
 # REFERENCED_BY_RELATION lists will link to the source code. Otherwise they will
 # link to the documentation.
 # The default value is: YES.
@@ -971,12 +1200,12 @@ SOURCE_TOOLTIPS        = YES
 # If the USE_HTAGS tag is set to YES then the references to source code will
 # point to the HTML generated by the htags(1) tool instead of doxygen built-in
 # source browser. The htags tool is part of GNU's global source tagging system
-# (see http://www.gnu.org/software/global/global.html). You will need version
+# (see https://www.gnu.org/software/global/global.html). You will need version
 # 4.8.6 or higher.
 #
 # To use it do the following:
 # - Install the latest version of global
-# - Enable SOURCE_BROWSER and USE_HTAGS in the config file
+# - Enable SOURCE_BROWSER and USE_HTAGS in the configuration file
 # - Make sure the INPUT points to the root of the source tree
 # - Run doxygen as normal
 #
@@ -998,6 +1227,46 @@ USE_HTAGS              = NO
 
 VERBATIM_HEADERS       = YES
 
+# If the CLANG_ASSISTED_PARSING tag is set to YES then doxygen will use the
+# clang parser (see:
+# http://clang.llvm.org/) for more accurate parsing at the cost of reduced
+# performance. This can be particularly helpful with template rich C++ code for
+# which doxygen's built-in parser lacks the necessary type information.
+# Note: The availability of this option depends on whether or not doxygen was
+# generated with the -Duse_libclang=ON option for CMake.
+# The default value is: NO.
+
+CLANG_ASSISTED_PARSING = NO
+
+# If the CLANG_ASSISTED_PARSING tag is set to YES and the CLANG_ADD_INC_PATHS
+# tag is set to YES then doxygen will add the directory of each input to the
+# include path.
+# The default value is: YES.
+# This tag requires that the tag CLANG_ASSISTED_PARSING is set to YES.
+
+CLANG_ADD_INC_PATHS    = YES
+
+# If clang assisted parsing is enabled you can provide the compiler with command
+# line options that you would normally use when invoking the compiler. Note that
+# the include paths will already be set by doxygen for the files and directories
+# specified with INPUT and INCLUDE_PATH.
+# This tag requires that the tag CLANG_ASSISTED_PARSING is set to YES.
+
+CLANG_OPTIONS          =
+
+# If clang assisted parsing is enabled you can provide the clang parser with the
+# path to the directory containing a file called compile_commands.json. This
+# file is the compilation database (see:
+# http://clang.llvm.org/docs/HowToSetupToolingForLLVM.html) containing the
+# options used when the source files were built. This is equivalent to
+# specifying the -p option to a clang tool, such as clang-check. These options
+# will then be passed to the parser. Any options specified with CLANG_OPTIONS
+# will be added as well.
+# Note: The availability of this option depends on whether or not doxygen was
+# generated with the -Duse_libclang=ON option for CMake.
+
+CLANG_DATABASE_PATH    =
+
 #---------------------------------------------------------------------------
 # Configuration options related to the alphabetical class index
 #---------------------------------------------------------------------------
@@ -1009,26 +1278,20 @@ VERBATIM_HEADERS       = YES
 
 ALPHABETICAL_INDEX     = YES
 
-# The COLS_IN_ALPHA_INDEX tag can be used to specify the number of columns in
-# which the alphabetical index list will be split.
-# Minimum value: 1, maximum value: 20, default value: 5.
-# This tag requires that the tag ALPHABETICAL_INDEX is set to YES.
-
-COLS_IN_ALPHA_INDEX    = 5
-
-# In case all classes in a project start with a common prefix, all classes will
-# be put under the same header in the alphabetical index. The IGNORE_PREFIX tag
-# can be used to specify a prefix (or a list of prefixes) that should be ignored
-# while generating the index headers.
+# The IGNORE_PREFIX tag can be used to specify a prefix (or a list of prefixes)
+# that should be ignored while generating the index headers. The IGNORE_PREFIX
+# tag works for classes, function and member names. The entity will be placed in
+# the alphabetical list under the first letter of the entity name that remains
+# after removing the prefix.
 # This tag requires that the tag ALPHABETICAL_INDEX is set to YES.
 
-IGNORE_PREFIX          =
+IGNORE_PREFIX          = af_
 
 #---------------------------------------------------------------------------
 # Configuration options related to the HTML output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_HTML tag is set to YES doxygen will generate HTML output
+# If the GENERATE_HTML tag is set to YES, doxygen will generate HTML output
 # The default value is: YES.
 
 GENERATE_HTML          = YES
@@ -1039,7 +1302,7 @@ GENERATE_HTML          = YES
 # The default directory is: html.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
-HTML_OUTPUT            = .
+HTML_OUTPUT            = html
 
 # The HTML_FILE_EXTENSION tag can be used to specify the file extension for each
 # generated HTML page (for example: .htm, .php, .asp).
@@ -1094,14 +1357,21 @@ HTML_STYLESHEET        =
 # cascading style sheets that are included after the standard style sheets
 # created by doxygen. Using this option one can overrule certain style aspects.
 # This is preferred over using HTML_STYLESHEET since it does not replace the
-# standard style sheet and is therefor more robust against future updates.
+# standard style sheet and is therefore more robust against future updates.
 # Doxygen will copy the style sheet files to the output directory.
-# Note: The order of the extra stylesheet files is of importance (e.g. the last
-# stylesheet in the list overrules the setting of the previous ones in the
-# list). For an example see the documentation.
+# Note: The order of the extra style sheet files is of importance (e.g. the last
+# style sheet in the list overrules the setting of the previous ones in the
+# list).
+# Note: Since the styling of scrollbars can currently not be overruled in
+# Webkit/Chromium, the styling will be left out of the default doxygen.css if
+# one or more extra stylesheets have been specified. So if scrollbar
+# customization is desired it has to be added explicitly. For an example see the
+# documentation.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
-HTML_EXTRA_STYLESHEET  = ${DOCS_DIR}/arrayfire.css
+HTML_EXTRA_STYLESHEET  = ${DOCS_DIR}/arrayfire.css \
+                         ${DOCS_DIR}/doxygen-awesome.css \
+                         ${DOCS_DIR}/doxygen-awesome-sidebar-only.css
 
 # The HTML_EXTRA_FILES tag can be used to specify one or more extra images or
 # other source files which should be copied to the HTML output directory. Note
@@ -1111,13 +1381,27 @@ HTML_EXTRA_STYLESHEET  = ${DOCS_DIR}/arrayfire.css
 # files will be copied as-is; there are no commands or markers available.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
-HTML_EXTRA_FILES       = ${DOCS_DIR}/highlight.pack.js \
-                         ${DOCS_DIR}/highlight_js_doxygen.css
+HTML_EXTRA_FILES       = ${DOCS_DIR}/doxygen-awesome-darkmode-toggle.js \
+                         ${DOCS_DIR}/doxygen-awesome-fragment-copy-button.js \
+                         ${DOCS_DIR}/doxygen-awesome-interactive-toc.js
+
+# The HTML_COLORSTYLE tag can be used to specify if the generated HTML output
+# should be rendered with a dark or light theme.
+# Possible values are: LIGHT always generate light mode output, DARK always
+# generate dark mode output, AUTO_LIGHT automatically set the mode according to
+# the user preference, use light mode if no preference is set (the default),
+# AUTO_DARK automatically set the mode according to the user preference, use
+# dark mode if no preference is set and TOGGLE allow to user to switch between
+# light and dark mode via a button.
+# The default value is: AUTO_LIGHT.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+HTML_COLORSTYLE        = LIGHT
 
 # The HTML_COLORSTYLE_HUE tag controls the color of the HTML output. Doxygen
-# will adjust the colors in the stylesheet and background images according to
-# this color. Hue is specified as an angle on a colorwheel, see
-# http://en.wikipedia.org/wiki/Hue for more information. For instance the value
+# will adjust the colors in the style sheet and background images according to
+# this color. Hue is specified as an angle on a color-wheel, see
+# https://en.wikipedia.org/wiki/Hue for more information. For instance the value
 # 0 represents red, 60 is yellow, 120 is green, 180 is cyan, 240 is blue, 300
 # purple, and 360 is red again.
 # Minimum value: 0, maximum value: 359, default value: 220.
@@ -1126,7 +1410,7 @@ HTML_EXTRA_FILES       = ${DOCS_DIR}/highlight.pack.js \
 HTML_COLORSTYLE_HUE    = 19
 
 # The HTML_COLORSTYLE_SAT tag controls the purity (or saturation) of the colors
-# in the HTML output. For a value of 0 the output will use grayscales only. A
+# in the HTML output. For a value of 0 the output will use gray-scales only. A
 # value of 255 will produce the most vivid colors.
 # Minimum value: 0, maximum value: 255, default value: 100.
 # This tag requires that the tag GENERATE_HTML is set to YES.
@@ -1144,13 +1428,16 @@ HTML_COLORSTYLE_SAT    = 219
 
 HTML_COLORSTYLE_GAMMA  = 70
 
-# If the HTML_TIMESTAMP tag is set to YES then the footer of each generated HTML
-# page will contain the date and time when the page was generated. Setting this
-# to NO can help when comparing the output of multiple runs.
+# If the HTML_DYNAMIC_MENUS tag is set to YES then the generated HTML
+# documentation will contain a main index with vertical navigation menus that
+# are dynamically created via JavaScript. If disabled, the navigation index will
+# consists of multiple levels of tabs that are statically embedded in every HTML
+# page. Disable this option to support browsers that do not have JavaScript,
+# like the Qt help browser.
 # The default value is: YES.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
-HTML_TIMESTAMP         = YES
+HTML_DYNAMIC_MENUS     = NO
 
 # If the HTML_DYNAMIC_SECTIONS tag is set to YES then the generated HTML
 # documentation will contain sections that can be hidden and shown after the
@@ -1158,7 +1445,7 @@ HTML_TIMESTAMP         = YES
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
-HTML_DYNAMIC_SECTIONS  = YES
+HTML_DYNAMIC_SECTIONS  = NO
 
 # With HTML_INDEX_NUM_ENTRIES one can control the preferred number of entries
 # shown in the various tree structured indices initially; the user can expand
@@ -1175,13 +1462,14 @@ HTML_INDEX_NUM_ENTRIES = 100
 
 # If the GENERATE_DOCSET tag is set to YES, additional index files will be
 # generated that can be used as input for Apple's Xcode 3 integrated development
-# environment (see: http://developer.apple.com/tools/xcode/), introduced with
-# OSX 10.5 (Leopard). To create a documentation set, doxygen will generate a
-# Makefile in the HTML output directory. Running make will produce the docset in
-# that directory and running make install will install the docset in
+# environment (see:
+# https://developer.apple.com/xcode/), introduced with OSX 10.5 (Leopard). To
+# create a documentation set, doxygen will generate a Makefile in the HTML
+# output directory. Running make will produce the docset in that directory and
+# running make install will install the docset in
 # ~/Library/Developer/Shared/Documentation/DocSets so that Xcode will find it at
-# startup. See http://developer.apple.com/tools/creatingdocsetswithdoxygen.html
-# for more information.
+# startup. See https://developer.apple.com/library/archive/featuredarticles/Doxy
+# genXcode/_index.html for more information.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
@@ -1195,6 +1483,13 @@ GENERATE_DOCSET        = NO
 
 DOCSET_FEEDNAME        = "Doxygen generated docs"
 
+# This tag determines the URL of the docset feed. A documentation feed provides
+# an umbrella under which multiple documentation sets from a single provider
+# (such as a company or product suite) can be grouped.
+# This tag requires that the tag GENERATE_DOCSET is set to YES.
+
+DOCSET_FEEDURL         =
+
 # This tag specifies a string that should uniquely identify the documentation
 # set bundle. This should be a reverse domain-name style string, e.g.
 # com.mycompany.MyDocSet. Doxygen will append .docset to the name.
@@ -1220,8 +1515,12 @@ DOCSET_PUBLISHER_NAME  = Publisher
 # If the GENERATE_HTMLHELP tag is set to YES then doxygen generates three
 # additional HTML index files: index.hhp, index.hhc, and index.hhk. The
 # index.hhp is a project file that can be read by Microsoft's HTML Help Workshop
-# (see: http://www.microsoft.com/en-us/download/details.aspx?id=21138) on
-# Windows.
+# on Windows. In the beginning of 2021 Microsoft took the original page, with
+# a.o. the download links, offline the HTML help workshop was already many years
+# in maintenance mode). You can download the HTML help workshop from the web
+# archives at Installation executable (see:
+# http://web.archive.org/web/20160201063255/http://download.microsoft.com/downlo
+# ad/0/A/9/0A939EF6-E31C-430F-A3DF-DFAE7960D564/htmlhelp.exe).
 #
 # The HTML Help Workshop contains a compiler that can convert all HTML output
 # generated by doxygen into a single compiled HTML file (.chm). Compiled HTML
@@ -1243,28 +1542,28 @@ GENERATE_HTMLHELP      = NO
 CHM_FILE               =
 
 # The HHC_LOCATION tag can be used to specify the location (absolute path
-# including file name) of the HTML help compiler ( hhc.exe). If non-empty
+# including file name) of the HTML help compiler (hhc.exe). If non-empty,
 # doxygen will try to run the HTML help compiler on the generated index.hhp.
 # The file has to be specified with full path.
 # This tag requires that the tag GENERATE_HTMLHELP is set to YES.
 
 HHC_LOCATION           =
 
-# The GENERATE_CHI flag controls if a separate .chi index file is generated (
-# YES) or that it should be included in the master .chm file ( NO).
+# The GENERATE_CHI flag controls if a separate .chi index file is generated
+# (YES) or that it should be included in the main .chm file (NO).
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTMLHELP is set to YES.
 
 GENERATE_CHI           = NO
 
-# The CHM_INDEX_ENCODING is used to encode HtmlHelp index ( hhk), content ( hhc)
+# The CHM_INDEX_ENCODING is used to encode HtmlHelp index (hhk), content (hhc)
 # and project file content.
 # This tag requires that the tag GENERATE_HTMLHELP is set to YES.
 
 CHM_INDEX_ENCODING     =
 
-# The BINARY_TOC flag controls whether a binary table of contents is generated (
-# YES) or a normal table of contents ( NO) in the .chm file. Furthermore it
+# The BINARY_TOC flag controls whether a binary table of contents is generated
+# (YES) or a normal table of contents (NO) in the .chm file. Furthermore it
 # enables the Previous and Next buttons.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTMLHELP is set to YES.
@@ -1278,6 +1577,16 @@ BINARY_TOC             = NO
 
 TOC_EXPAND             = NO
 
+# The SITEMAP_URL tag is used to specify the full URL of the place where the
+# generated documentation will be placed on the server by the user during the
+# deployment of the documentation. The generated sitemap is called sitemap.xml
+# and placed on the directory specified by HTML_OUTPUT. In case no SITEMAP_URL
+# is specified no sitemap is generated. For information about the sitemap
+# protocol see https://www.sitemaps.org
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+SITEMAP_URL            =
+
 # If the GENERATE_QHP tag is set to YES and both QHP_NAMESPACE and
 # QHP_VIRTUAL_FOLDER are set, an additional index file will be generated that
 # can be used as input for Qt's qhelpgenerator to generate a Qt Compressed Help
@@ -1296,7 +1605,8 @@ QCH_FILE               =
 
 # The QHP_NAMESPACE tag specifies the namespace to use when generating Qt Help
 # Project output. For more information please see Qt Help Project / Namespace
-# (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#namespace).
+# (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#namespace).
 # The default value is: org.doxygen.Project.
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
@@ -1304,8 +1614,8 @@ QHP_NAMESPACE          = org.doxygen.Project
 
 # The QHP_VIRTUAL_FOLDER tag specifies the namespace to use when generating Qt
 # Help Project output. For more information please see Qt Help Project / Virtual
-# Folders (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#virtual-
-# folders).
+# Folders (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#virtual-folders).
 # The default value is: doc.
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
@@ -1313,30 +1623,30 @@ QHP_VIRTUAL_FOLDER     = doc
 
 # If the QHP_CUST_FILTER_NAME tag is set, it specifies the name of a custom
 # filter to add. For more information please see Qt Help Project / Custom
-# Filters (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom-
-# filters).
+# Filters (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#custom-filters).
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHP_CUST_FILTER_NAME   =
 
 # The QHP_CUST_FILTER_ATTRS tag specifies the list of the attributes of the
 # custom filter to add. For more information please see Qt Help Project / Custom
-# Filters (see: http://qt-project.org/doc/qt-4.8/qthelpproject.html#custom-
-# filters).
+# Filters (see:
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#custom-filters).
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHP_CUST_FILTER_ATTRS  =
 
 # The QHP_SECT_FILTER_ATTRS tag specifies the list of the attributes this
 # project's filter section matches. Qt Help Project / Filter Attributes (see:
-# http://qt-project.org/doc/qt-4.8/qthelpproject.html#filter-attributes).
+# https://doc.qt.io/archives/qt-4.8/qthelpproject.html#filter-attributes).
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHP_SECT_FILTER_ATTRS  =
 
-# The QHG_LOCATION tag can be used to specify the location of Qt's
-# qhelpgenerator. If non-empty doxygen will try to run qhelpgenerator on the
-# generated .qhp file.
+# The QHG_LOCATION tag can be used to specify the location (absolute path
+# including file name) of Qt's qhelpgenerator. If non-empty doxygen will try to
+# run qhelpgenerator on the generated .qhp file.
 # This tag requires that the tag GENERATE_QHP is set to YES.
 
 QHG_LOCATION           =
@@ -1378,17 +1688,29 @@ DISABLE_INDEX          = NO
 # index structure (just like the one that is generated for HTML Help). For this
 # to work a browser that supports JavaScript, DHTML, CSS and frames is required
 # (i.e. any modern browser). Windows users are probably better off using the
-# HTML help feature. Via custom stylesheets (see HTML_EXTRA_STYLESHEET) one can
-# further fine-tune the look of the index. As an example, the default style
-# sheet generated by doxygen has an example that shows how to put an image at
-# the root of the tree instead of the PROJECT_NAME. Since the tree basically has
-# the same information as the tab index, you could consider setting
-# DISABLE_INDEX to YES when enabling this option.
+# HTML help feature. Via custom style sheets (see HTML_EXTRA_STYLESHEET) one can
+# further fine tune the look of the index (see "Fine-tuning the output"). As an
+# example, the default style sheet generated by doxygen has an example that
+# shows how to put an image at the root of the tree instead of the PROJECT_NAME.
+# Since the tree basically has the same information as the tab index, you could
+# consider setting DISABLE_INDEX to YES when enabling this option.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
 GENERATE_TREEVIEW      = YES
 
+# When both GENERATE_TREEVIEW and DISABLE_INDEX are set to YES, then the
+# FULL_SIDEBAR option determines if the side bar is limited to only the treeview
+# area (value NO) or if it should extend to the full height of the window (value
+# YES). Setting this to YES gives a layout similar to
+# https://docs.readthedocs.io with more room for contents, but less room for the
+# project logo, title, and description. If either GENERATE_TREEVIEW or
+# DISABLE_INDEX is set to NO, this option has no effect.
+# The default value is: NO.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+FULL_SIDEBAR           = NO
+
 # The ENUM_VALUES_PER_LINE tag can be used to set the number of enum values that
 # doxygen will group on one line in the generated HTML documentation.
 #
@@ -1404,15 +1726,33 @@ ENUM_VALUES_PER_LINE   = 4
 # Minimum value: 0, maximum value: 1500, default value: 250.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
-TREEVIEW_WIDTH         = 250
+TREEVIEW_WIDTH         = 335
 
-# When the EXT_LINKS_IN_WINDOW option is set to YES doxygen will open links to
+# If the EXT_LINKS_IN_WINDOW option is set to YES, doxygen will open links to
 # external symbols imported via tag files in a separate window.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_HTML is set to YES.
 
 EXT_LINKS_IN_WINDOW    = NO
 
+# If the OBFUSCATE_EMAILS tag is set to YES, doxygen will obfuscate email
+# addresses.
+# The default value is: YES.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+OBFUSCATE_EMAILS       = YES
+
+# If the HTML_FORMULA_FORMAT option is set to svg, doxygen will use the pdf2svg
+# tool (see https://github.com/dawbarton/pdf2svg) or inkscape (see
+# https://inkscape.org) to generate formulas as SVG images instead of PNGs for
+# the HTML output. These images will generally look nicer at scaled resolutions.
+# Possible values are: png (the default) and svg (looks nicer but requires the
+# pdf2svg or inkscape tool).
+# The default value is: png.
+# This tag requires that the tag GENERATE_HTML is set to YES.
+
+HTML_FORMULA_FORMAT    = png
+
 # Use this tag to change the font size of LaTeX formulas included as images in
 # the HTML documentation. When you change the font size after a successful
 # doxygen run you need to manually remove any form_*.png images from the HTML
@@ -1422,20 +1762,15 @@ EXT_LINKS_IN_WINDOW    = NO
 
 FORMULA_FONTSIZE       = 12
 
-# Use the FORMULA_TRANPARENT tag to determine whether or not the images
-# generated for formulas are transparent PNGs. Transparent PNGs are not
-# supported properly for IE 6.0, but are supported on all modern browsers.
-#
-# Note that when changing this option you need to delete any form_*.png files in
-# the HTML output directory before the changes have effect.
-# The default value is: YES.
-# This tag requires that the tag GENERATE_HTML is set to YES.
+# The FORMULA_MACROFILE can contain LaTeX \newcommand and \renewcommand commands
+# to create new LaTeX commands to be used in formulas as building blocks. See
+# the section "Including formulas" for details.
 
-FORMULA_TRANSPARENT    = YES
+FORMULA_MACROFILE      =
 
 # Enable the USE_MATHJAX option to render LaTeX formulas using MathJax (see
-# http://www.mathjax.org) which uses client side Javascript for the rendering
-# instead of using prerendered bitmaps. Use this if you do not have LaTeX
+# https://www.mathjax.org) which uses client side JavaScript for the rendering
+# instead of using pre-rendered bitmaps. Use this if you do not have LaTeX
 # installed or if you want to formulas look prettier in the HTML output. When
 # enabled you may also need to install MathJax separately and configure the path
 # to it using the MATHJAX_RELPATH option.
@@ -1444,11 +1779,29 @@ FORMULA_TRANSPARENT    = YES
 
 USE_MATHJAX            = YES
 
+# With MATHJAX_VERSION it is possible to specify the MathJax version to be used.
+# Note that the different versions of MathJax have different requirements with
+# regards to the different settings, so it is possible that also other MathJax
+# settings have to be changed when switching between the different MathJax
+# versions.
+# Possible values are: MathJax_2 and MathJax_3.
+# The default value is: MathJax_2.
+# This tag requires that the tag USE_MATHJAX is set to YES.
+
+MATHJAX_VERSION        = MathJax_2
+
 # When MathJax is enabled you can set the default output format to be used for
-# the MathJax output. See the MathJax site (see:
-# http://docs.mathjax.org/en/latest/output.html) for more details.
+# the MathJax output. For more details about the output format see MathJax
+# version 2 (see:
+# http://docs.mathjax.org/en/v2.7-latest/output.html) and MathJax version 3
+# (see:
+# http://docs.mathjax.org/en/latest/web/components/output.html).
 # Possible values are: HTML-CSS (which is slower, but has the best
-# compatibility), NativeMML (i.e. MathML) and SVG.
+# compatibility. This is the name for Mathjax version 2, for MathJax version 3
+# this will be translated into chtml), NativeMML (i.e. MathML. Only supported
+# for NathJax 2. For MathJax version 3 chtml will be used instead.), chtml (This
+# is the name for Mathjax version 3, for MathJax version 2 this will be
+# translated into HTML-CSS) and SVG.
 # The default value is: HTML-CSS.
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
@@ -1461,22 +1814,29 @@ MATHJAX_FORMAT         = HTML-CSS
 # MATHJAX_RELPATH should be ../mathjax. The default value points to the MathJax
 # Content Delivery Network so you can quickly see the result without installing
 # MathJax. However, it is strongly recommended to install a local copy of
-# MathJax from http://www.mathjax.org before deployment.
-# The default value is: http://cdn.mathjax.org/mathjax/latest.
+# MathJax from https://www.mathjax.org before deployment. The default value is:
+# - in case of MathJax version 2: https://cdn.jsdelivr.net/npm/mathjax@2
+# - in case of MathJax version 3: https://cdn.jsdelivr.net/npm/mathjax@3
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
-MATHJAX_RELPATH        = http://cdn.mathjax.org/mathjax/latest
+MATHJAX_RELPATH        = https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.1
 
 # The MATHJAX_EXTENSIONS tag can be used to specify one or more MathJax
 # extension names that should be enabled during MathJax rendering. For example
+# for MathJax version 2 (see
+# https://docs.mathjax.org/en/v2.7-latest/tex.html#tex-and-latex-extensions):
 # MATHJAX_EXTENSIONS = TeX/AMSmath TeX/AMSsymbols
+# For example for MathJax version 3 (see
+# http://docs.mathjax.org/en/latest/input/tex/extensions/index.html):
+# MATHJAX_EXTENSIONS = ams
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
 MATHJAX_EXTENSIONS     =
 
 # The MATHJAX_CODEFILE tag can be used to specify a file with javascript pieces
 # of code that will be used on startup of the MathJax code. See the MathJax site
-# (see: http://docs.mathjax.org/en/latest/output.html) for more details. For an
+# (see:
+# http://docs.mathjax.org/en/v2.7-latest/output.html) for more details. For an
 # example see the documentation.
 # This tag requires that the tag USE_MATHJAX is set to YES.
 
@@ -1504,7 +1864,7 @@ MATHJAX_CODEFILE       =
 SEARCHENGINE           = NO
 
 # When the SERVER_BASED_SEARCH tag is enabled the search engine will be
-# implemented using a web server instead of a web client using Javascript. There
+# implemented using a web server instead of a web client using JavaScript. There
 # are two flavors of web server based searching depending on the EXTERNAL_SEARCH
 # setting. When disabled, doxygen will generate a PHP script for searching and
 # an index file used by the script. When EXTERNAL_SEARCH is enabled the indexing
@@ -1521,9 +1881,10 @@ SERVER_BASED_SEARCH    = NO
 # external search engine pointed to by the SEARCHENGINE_URL option to obtain the
 # search results.
 #
-# Doxygen ships with an example indexer ( doxyindexer) and search engine
+# Doxygen ships with an example indexer (doxyindexer) and search engine
 # (doxysearch.cgi) which are based on the open source search engine library
-# Xapian (see: http://xapian.org/).
+# Xapian (see:
+# https://xapian.org/).
 #
 # See the section "External Indexing and Searching" for details.
 # The default value is: NO.
@@ -1534,10 +1895,11 @@ EXTERNAL_SEARCH        = NO
 # The SEARCHENGINE_URL should point to a search engine hosted by a web server
 # which will return the search results when EXTERNAL_SEARCH is enabled.
 #
-# Doxygen ships with an example indexer ( doxyindexer) and search engine
+# Doxygen ships with an example indexer (doxyindexer) and search engine
 # (doxysearch.cgi) which are based on the open source search engine library
-# Xapian (see: http://xapian.org/). See the section "External Indexing and
-# Searching" for details.
+# Xapian (see:
+# https://xapian.org/). See the section "External Indexing and Searching" for
+# details.
 # This tag requires that the tag SEARCHENGINE is set to YES.
 
 SEARCHENGINE_URL       =
@@ -1572,7 +1934,7 @@ EXTRA_SEARCH_MAPPINGS  =
 # Configuration options related to the LaTeX output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_LATEX tag is set to YES doxygen will generate LaTeX output.
+# If the GENERATE_LATEX tag is set to YES, doxygen will generate LaTeX output.
 # The default value is: YES.
 
 GENERATE_LATEX         = NO
@@ -1588,22 +1950,36 @@ LATEX_OUTPUT           = latex
 # The LATEX_CMD_NAME tag can be used to specify the LaTeX command name to be
 # invoked.
 #
-# Note that when enabling USE_PDFLATEX this option is only used for generating
-# bitmaps for formulas in the HTML output, but not in the Makefile that is
-# written to the output directory.
-# The default file is: latex.
+# Note that when not enabling USE_PDFLATEX the default is latex when enabling
+# USE_PDFLATEX the default is pdflatex and when in the later case latex is
+# chosen this is overwritten by pdflatex. For specific output languages the
+# default can have been set differently, this depends on the implementation of
+# the output language.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 LATEX_CMD_NAME         = latex
 
 # The MAKEINDEX_CMD_NAME tag can be used to specify the command name to generate
 # index for LaTeX.
+# Note: This tag is used in the Makefile / make.bat.
+# See also: LATEX_MAKEINDEX_CMD for the part in the generated output file
+# (.tex).
 # The default file is: makeindex.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 MAKEINDEX_CMD_NAME     = makeindex
 
-# If the COMPACT_LATEX tag is set to YES doxygen generates more compact LaTeX
+# The LATEX_MAKEINDEX_CMD tag can be used to specify the command name to
+# generate index for LaTeX. In case there is no backslash (\) as first character
+# it will be automatically added in the LaTeX code.
+# Note: This tag is used in the generated output file (.tex).
+# See also: MAKEINDEX_CMD_NAME for the part in the Makefile / make.bat.
+# The default value is: makeindex.
+# This tag requires that the tag GENERATE_LATEX is set to YES.
+
+LATEX_MAKEINDEX_CMD    = makeindex
+
+# If the COMPACT_LATEX tag is set to YES, doxygen generates more compact LaTeX
 # documents. This may be useful for small projects and may help to save some
 # trees in general.
 # The default value is: NO.
@@ -1621,41 +1997,57 @@ COMPACT_LATEX          = NO
 PAPER_TYPE             = a4
 
 # The EXTRA_PACKAGES tag can be used to specify one or more LaTeX package names
-# that should be included in the LaTeX output. To get the times font for
-# instance you can specify
-# EXTRA_PACKAGES=times
+# that should be included in the LaTeX output. The package can be specified just
+# by its name or with the correct syntax as to be used with the LaTeX
+# \usepackage command. To get the times font for instance you can specify :
+# EXTRA_PACKAGES=times or EXTRA_PACKAGES={times}
+# To use the option intlimits with the amsmath package you can specify:
+# EXTRA_PACKAGES=[intlimits]{amsmath}
 # If left blank no extra packages will be included.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 EXTRA_PACKAGES         =
 
-# The LATEX_HEADER tag can be used to specify a personal LaTeX header for the
-# generated LaTeX document. The header should contain everything until the first
-# chapter. If it is left blank doxygen will generate a standard header. See
-# section "Doxygen usage" for information on how to let doxygen write the
-# default header to a separate file.
+# The LATEX_HEADER tag can be used to specify a user-defined LaTeX header for
+# the generated LaTeX document. The header should contain everything until the
+# first chapter. If it is left blank doxygen will generate a standard header. It
+# is highly recommended to start with a default header using
+# doxygen -w latex new_header.tex new_footer.tex new_stylesheet.sty
+# and then modify the file new_header.tex. See also section "Doxygen usage" for
+# information on how to generate the default header that doxygen normally uses.
 #
-# Note: Only use a user-defined header if you know what you are doing! The
-# following commands have a special meaning inside the header: $title,
-# $datetime, $date, $doxygenversion, $projectname, $projectnumber,
-# $projectbrief, $projectlogo. Doxygen will replace $title with the empy string,
-# for the replacement values of the other commands the user is refered to
-# HTML_HEADER.
+# Note: Only use a user-defined header if you know what you are doing!
+# Note: The header is subject to change so you typically have to regenerate the
+# default header when upgrading to a newer version of doxygen. The following
+# commands have a special meaning inside the header (and footer): For a
+# description of the possible markers and block names see the documentation.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 LATEX_HEADER           =
 
-# The LATEX_FOOTER tag can be used to specify a personal LaTeX footer for the
-# generated LaTeX document. The footer should contain everything after the last
-# chapter. If it is left blank doxygen will generate a standard footer. See
+# The LATEX_FOOTER tag can be used to specify a user-defined LaTeX footer for
+# the generated LaTeX document. The footer should contain everything after the
+# last chapter. If it is left blank doxygen will generate a standard footer. See
 # LATEX_HEADER for more information on how to generate a default footer and what
-# special commands can be used inside the footer.
-#
-# Note: Only use a user-defined footer if you know what you are doing!
+# special commands can be used inside the footer. See also section "Doxygen
+# usage" for information on how to generate the default footer that doxygen
+# normally uses. Note: Only use a user-defined footer if you know what you are
+# doing!
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 LATEX_FOOTER           =
 
+# The LATEX_EXTRA_STYLESHEET tag can be used to specify additional user-defined
+# LaTeX style sheets that are included after the standard style sheets created
+# by doxygen. Using this option one can overrule certain style aspects. Doxygen
+# will copy the style sheet files to the output directory.
+# Note: The order of the extra style sheet files is of importance (e.g. the last
+# style sheet in the list overrules the setting of the previous ones in the
+# list).
+# This tag requires that the tag GENERATE_LATEX is set to YES.
+
+LATEX_EXTRA_STYLESHEET =
+
 # The LATEX_EXTRA_FILES tag can be used to specify one or more extra images or
 # other source files which should be copied to the LATEX_OUTPUT output
 # directory. Note that the files will be copied as-is; there are no commands or
@@ -1673,18 +2065,26 @@ LATEX_EXTRA_FILES      =
 
 PDF_HYPERLINKS         = YES
 
-# If the USE_PDFLATEX tag is set to YES, doxygen will use pdflatex to generate
-# the PDF file directly from the LaTeX files. Set this option to YES to get a
-# higher quality PDF documentation.
+# If the USE_PDFLATEX tag is set to YES, doxygen will use the engine as
+# specified with LATEX_CMD_NAME to generate the PDF file directly from the LaTeX
+# files. Set this option to YES, to get a higher quality PDF documentation.
+#
+# See also section LATEX_CMD_NAME for selecting the engine.
 # The default value is: YES.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 USE_PDFLATEX           = YES
 
-# If the LATEX_BATCHMODE tag is set to YES, doxygen will add the \batchmode
-# command to the generated LaTeX files. This will instruct LaTeX to keep running
-# if errors occur, instead of asking the user for help. This option is also used
-# when generating formulas in HTML.
+# The LATEX_BATCHMODE tag ignals the behavior of LaTeX in case of an error.
+# Possible values are: NO same as ERROR_STOP, YES same as BATCH, BATCH In batch
+# mode nothing is printed on the terminal, errors are scrolled as if <return> is
+# hit at every error; missing files that TeX tries to input or request from
+# keyboard input (\read on a not open input stream) cause the job to abort,
+# NON_STOP In nonstop mode the diagnostic message will appear on the terminal,
+# but there is no possibility of user interaction just like in batch mode,
+# SCROLL In scroll mode, TeX will stop only for missing files to input or if
+# keyboard input is necessary and ERROR_STOP In errorstop mode, TeX will stop at
+# each error, asking for user intervention.
 # The default value is: NO.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
@@ -1697,29 +2097,27 @@ LATEX_BATCHMODE        = NO
 
 LATEX_HIDE_INDICES     = NO
 
-# If the LATEX_SOURCE_CODE tag is set to YES then doxygen will include source
-# code with syntax highlighting in the LaTeX output.
-#
-# Note that which sources are shown also depends on other settings such as
-# SOURCE_BROWSER.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_LATEX is set to YES.
-
-LATEX_SOURCE_CODE      = NO
-
 # The LATEX_BIB_STYLE tag can be used to specify the style to use for the
 # bibliography, e.g. plainnat, or ieeetr. See
-# http://en.wikipedia.org/wiki/BibTeX and \cite for more info.
+# https://en.wikipedia.org/wiki/BibTeX and \cite for more info.
 # The default value is: plain.
 # This tag requires that the tag GENERATE_LATEX is set to YES.
 
 LATEX_BIB_STYLE        = plain
 
+# The LATEX_EMOJI_DIRECTORY tag is used to specify the (relative or absolute)
+# path from which the emoji images will be read. If a relative path is entered,
+# it will be relative to the LATEX_OUTPUT directory. If left blank the
+# LATEX_OUTPUT directory will be used.
+# This tag requires that the tag GENERATE_LATEX is set to YES.
+
+LATEX_EMOJI_DIRECTORY  =
+
 #---------------------------------------------------------------------------
 # Configuration options related to the RTF output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_RTF tag is set to YES doxygen will generate RTF output. The
+# If the GENERATE_RTF tag is set to YES, doxygen will generate RTF output. The
 # RTF output is optimized for Word 97 and may not look too pretty with other RTF
 # readers/editors.
 # The default value is: NO.
@@ -1734,7 +2132,7 @@ GENERATE_RTF           = NO
 
 RTF_OUTPUT             = rtf
 
-# If the COMPACT_RTF tag is set to YES doxygen generates more compact RTF
+# If the COMPACT_RTF tag is set to YES, doxygen generates more compact RTF
 # documents. This may be useful for small projects and may help to save some
 # trees in general.
 # The default value is: NO.
@@ -1754,9 +2152,9 @@ COMPACT_RTF            = NO
 
 RTF_HYPERLINKS         = NO
 
-# Load stylesheet definitions from file. Syntax is similar to doxygen's config
-# file, i.e. a series of assignments. You only have to provide replacements,
-# missing definitions are set to their default value.
+# Load stylesheet definitions from file. Syntax is similar to doxygen's
+# configuration file, i.e. a series of assignments. You only have to provide
+# replacements, missing definitions are set to their default value.
 #
 # See also section "Doxygen usage" for information on how to generate the
 # default style sheet that doxygen normally uses.
@@ -1765,8 +2163,8 @@ RTF_HYPERLINKS         = NO
 RTF_STYLESHEET_FILE    =
 
 # Set optional variables used in the generation of an RTF document. Syntax is
-# similar to doxygen's config file. A template extensions file can be generated
-# using doxygen -e rtf extensionFile.
+# similar to doxygen's configuration file. A template extensions file can be
+# generated using doxygen -e rtf extensionFile.
 # This tag requires that the tag GENERATE_RTF is set to YES.
 
 RTF_EXTENSIONS_FILE    =
@@ -1775,7 +2173,7 @@ RTF_EXTENSIONS_FILE    =
 # Configuration options related to the man page output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_MAN tag is set to YES doxygen will generate man pages for
+# If the GENERATE_MAN tag is set to YES, doxygen will generate man pages for
 # classes and files.
 # The default value is: NO.
 
@@ -1813,13 +2211,13 @@ MAN_SUBDIR             =
 # The default value is: NO.
 # This tag requires that the tag GENERATE_MAN is set to YES.
 
-MAN_LINKS              = NO
+MAN_LINKS              = YES
 
 #---------------------------------------------------------------------------
 # Configuration options related to the XML output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_XML tag is set to YES doxygen will generate an XML file that
+# If the GENERATE_XML tag is set to YES, doxygen will generate an XML file that
 # captures the structure of the code including all documentation.
 # The default value is: NO.
 
@@ -1833,7 +2231,7 @@ GENERATE_XML           = NO
 
 XML_OUTPUT             = xml
 
-# If the XML_PROGRAMLISTING tag is set to YES doxygen will dump the program
+# If the XML_PROGRAMLISTING tag is set to YES, doxygen will dump the program
 # listings (including syntax highlighting and cross-referencing information) to
 # the XML output. Note that enabling this will significantly increase the size
 # of the XML output.
@@ -1842,11 +2240,18 @@ XML_OUTPUT             = xml
 
 XML_PROGRAMLISTING     = YES
 
+# If the XML_NS_MEMB_FILE_SCOPE tag is set to YES, doxygen will include
+# namespace members in file scope as well, matching the HTML output.
+# The default value is: NO.
+# This tag requires that the tag GENERATE_XML is set to YES.
+
+XML_NS_MEMB_FILE_SCOPE = NO
+
 #---------------------------------------------------------------------------
 # Configuration options related to the DOCBOOK output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_DOCBOOK tag is set to YES doxygen will generate Docbook files
+# If the GENERATE_DOCBOOK tag is set to YES, doxygen will generate Docbook files
 # that can be used to generate PDF.
 # The default value is: NO.
 
@@ -1860,23 +2265,14 @@ GENERATE_DOCBOOK       = NO
 
 DOCBOOK_OUTPUT         = docbook
 
-# If the DOCBOOK_PROGRAMLISTING tag is set to YES doxygen will include the
-# program listings (including syntax highlighting and cross-referencing
-# information) to the DOCBOOK output. Note that enabling this will significantly
-# increase the size of the DOCBOOK output.
-# The default value is: NO.
-# This tag requires that the tag GENERATE_DOCBOOK is set to YES.
-
-DOCBOOK_PROGRAMLISTING = NO
-
 #---------------------------------------------------------------------------
 # Configuration options for the AutoGen Definitions output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_AUTOGEN_DEF tag is set to YES doxygen will generate an AutoGen
-# Definitions (see http://autogen.sf.net) file that captures the structure of
-# the code including all documentation. Note that this feature is still
-# experimental and incomplete at the moment.
+# If the GENERATE_AUTOGEN_DEF tag is set to YES, doxygen will generate an
+# AutoGen Definitions (see https://autogen.sourceforge.net/) file that captures
+# the structure of the code including all documentation. Note that this feature
+# is still experimental and incomplete at the moment.
 # The default value is: NO.
 
 GENERATE_AUTOGEN_DEF   = NO
@@ -1885,7 +2281,7 @@ GENERATE_AUTOGEN_DEF   = NO
 # Configuration options related to the Perl module output
 #---------------------------------------------------------------------------
 
-# If the GENERATE_PERLMOD tag is set to YES doxygen will generate a Perl module
+# If the GENERATE_PERLMOD tag is set to YES, doxygen will generate a Perl module
 # file that captures the structure of the code including all documentation.
 #
 # Note that this feature is still experimental and incomplete at the moment.
@@ -1893,7 +2289,7 @@ GENERATE_AUTOGEN_DEF   = NO
 
 GENERATE_PERLMOD       = NO
 
-# If the PERLMOD_LATEX tag is set to YES doxygen will generate the necessary
+# If the PERLMOD_LATEX tag is set to YES, doxygen will generate the necessary
 # Makefile rules, Perl scripts and LaTeX code to be able to generate PDF and DVI
 # output from the Perl module output.
 # The default value is: NO.
@@ -1901,9 +2297,9 @@ GENERATE_PERLMOD       = NO
 
 PERLMOD_LATEX          = NO
 
-# If the PERLMOD_PRETTY tag is set to YES the Perl module output will be nicely
+# If the PERLMOD_PRETTY tag is set to YES, the Perl module output will be nicely
 # formatted so it can be parsed by a human reader. This is useful if you want to
-# understand what is going on. On the other hand, if this tag is set to NO the
+# understand what is going on. On the other hand, if this tag is set to NO, the
 # size of the Perl module output will be much smaller and Perl will parse it
 # just the same.
 # The default value is: YES.
@@ -1923,14 +2319,14 @@ PERLMOD_MAKEVAR_PREFIX =
 # Configuration options related to the preprocessor
 #---------------------------------------------------------------------------
 
-# If the ENABLE_PREPROCESSING tag is set to YES doxygen will evaluate all
+# If the ENABLE_PREPROCESSING tag is set to YES, doxygen will evaluate all
 # C-preprocessor directives found in the sources and include files.
 # The default value is: YES.
 
 ENABLE_PREPROCESSING   = YES
 
-# If the MACRO_EXPANSION tag is set to YES doxygen will expand all macro names
-# in the source code. If set to NO only conditional compilation will be
+# If the MACRO_EXPANSION tag is set to YES, doxygen will expand all macro names
+# in the source code. If set to NO, only conditional compilation will be
 # performed. Macro expansion can be done in a controlled way by setting
 # EXPAND_ONLY_PREDEF to YES.
 # The default value is: NO.
@@ -1946,16 +2342,17 @@ MACRO_EXPANSION        = YES
 
 EXPAND_ONLY_PREDEF     = NO
 
-# If the SEARCH_INCLUDES tag is set to YES the includes files in the
+# If the SEARCH_INCLUDES tag is set to YES, the include files in the
 # INCLUDE_PATH will be searched if a #include is found.
 # The default value is: YES.
 # This tag requires that the tag ENABLE_PREPROCESSING is set to YES.
 
-SEARCH_INCLUDES        = YES
+SEARCH_INCLUDES        = NO
 
 # The INCLUDE_PATH tag can be used to specify one or more directories that
 # contain include files that are not input files but should be processed by the
-# preprocessor.
+# preprocessor. Note that the INCLUDE_PATH is not recursive, so the setting of
+# RECURSIVE has no effect here.
 # This tag requires that the tag SEARCH_INCLUDES is set to YES.
 
 INCLUDE_PATH           =
@@ -1978,8 +2375,9 @@ INCLUDE_FILE_PATTERNS  =
 
 PREDEFINED             = __declspec(x)= \
                          __attribute__(x)= \
-                         __cplusplus \
-                         AF_DOC
+                         __cplusplus=99999999999999 \
+                         AF_DOC \
+                         AF_API_VERSION=${ArrayFire_API_VERSION_CURRENT}
 
 # If the MACRO_EXPANSION and EXPAND_ONLY_PREDEF tags are set to YES then this
 # tag can be used to specify a list of macro names that should be expanded. The
@@ -2023,64 +2421,34 @@ TAGFILES               =
 # tag file that is based on the input files it reads. See section "Linking to
 # external documentation" for more information about the usage of tag files.
 
-GENERATE_TAGFILE       =
+GENERATE_TAGFILE       = doxtags.txt
 
-# If the ALLEXTERNALS tag is set to YES all external class will be listed in the
-# class index. If set to NO only the inherited external classes will be listed.
+# If the ALLEXTERNALS tag is set to YES, all external class will be listed in
+# the class index. If set to NO, only the inherited external classes will be
+# listed.
 # The default value is: NO.
 
 ALLEXTERNALS           = NO
 
-# If the EXTERNAL_GROUPS tag is set to YES all external groups will be listed in
-# the modules index. If set to NO, only the current project's groups will be
+# If the EXTERNAL_GROUPS tag is set to YES, all external groups will be listed
+# in the modules index. If set to NO, only the current project's groups will be
 # listed.
 # The default value is: YES.
 
 EXTERNAL_GROUPS        = YES
 
-# If the EXTERNAL_PAGES tag is set to YES all external pages will be listed in
+# If the EXTERNAL_PAGES tag is set to YES, all external pages will be listed in
 # the related pages index. If set to NO, only the current project's pages will
 # be listed.
 # The default value is: YES.
 
 EXTERNAL_PAGES         = YES
 
-# The PERL_PATH should be the absolute path and name of the perl script
-# interpreter (i.e. the result of 'which perl').
-# The default file (with absolute path) is: /usr/bin/perl.
-
-PERL_PATH              = /usr/bin/perl
-
 #---------------------------------------------------------------------------
-# Configuration options related to the dot tool
+# Configuration options related to diagram generator tools
 #---------------------------------------------------------------------------
 
-# If the CLASS_DIAGRAMS tag is set to YES doxygen will generate a class diagram
-# (in HTML and LaTeX) for classes with base or super classes. Setting the tag to
-# NO turns the diagrams off. Note that this option also works with HAVE_DOT
-# disabled, but it is recommended to install and use dot, since it yields more
-# powerful graphs.
-# The default value is: YES.
-
-CLASS_DIAGRAMS         = YES
-
-# You can define message sequence charts within doxygen comments using the \msc
-# command. Doxygen will then run the mscgen tool (see:
-# http://www.mcternan.me.uk/mscgen/)) to produce the chart and insert it in the
-# documentation. The MSCGEN_PATH tag allows you to specify the directory where
-# the mscgen tool resides. If left empty the tool is assumed to be found in the
-# default search path.
-
-MSCGEN_PATH            =
-
-# You can include diagrams made with dia in doxygen documentation. Doxygen will
-# then run dia to produce the diagram and insert it in the documentation. The
-# DIA_PATH tag allows you to specify the directory where the dia binary resides.
-# If left empty dia is assumed to be found in the default search path.
-
-DIA_PATH               =
-
-# If set to YES, the inheritance and collaboration graphs will hide inheritance
+# If set to YES the inheritance and collaboration graphs will hide inheritance
 # and usage relations if the target is undocumented or is not a class.
 # The default value is: YES.
 
@@ -2088,7 +2456,7 @@ HIDE_UNDOC_RELATIONS   = YES
 
 # If you set the HAVE_DOT tag to YES then doxygen will assume the dot tool is
 # available from the path. This tool is part of Graphviz (see:
-# http://www.graphviz.org/), a graph visualization toolkit from AT&T and Lucent
+# https://www.graphviz.org/), a graph visualization toolkit from AT&T and Lucent
 # Bell Labs. The other options in this section have no effect if this option is
 # set to NO
 # The default value is: NO.
@@ -2105,35 +2473,52 @@ HAVE_DOT               = NO
 
 DOT_NUM_THREADS        = 0
 
-# When you want a differently looking font in the dot files that doxygen
-# generates you can specify the font name using DOT_FONTNAME. You need to make
-# sure dot is able to find the font, which can be done by putting it in a
-# standard location or by setting the DOTFONTPATH environment variable or by
-# setting DOT_FONTPATH to the directory containing the font.
-# The default value is: Helvetica.
+# DOT_COMMON_ATTR is common attributes for nodes, edges and labels of
+# subgraphs. When you want a differently looking font in the dot files that
+# doxygen generates you can specify fontname, fontcolor and fontsize attributes.
+# For details please see <a href=https://graphviz.org/doc/info/attrs.html>Node,
+# Edge and Graph Attributes specification</a> You need to make sure dot is able
+# to find the font, which can be done by putting it in a standard location or by
+# setting the DOTFONTPATH environment variable or by setting DOT_FONTPATH to the
+# directory containing the font. Default graphviz fontsize is 14.
+# The default value is: fontname=Helvetica,fontsize=10.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
-DOT_FONTNAME           = Helvetica
+DOT_COMMON_ATTR        = "fontname=Helvetica,fontsize=10"
 
-# The DOT_FONTSIZE tag can be used to set the size (in points) of the font of
-# dot graphs.
-# Minimum value: 4, maximum value: 24, default value: 10.
+# DOT_EDGE_ATTR is concatenated with DOT_COMMON_ATTR. For elegant style you can
+# add 'arrowhead=open, arrowtail=open, arrowsize=0.5'. <a
+# href=https://graphviz.org/doc/info/arrows.html>Complete documentation about
+# arrows shapes.</a>
+# The default value is: labelfontname=Helvetica,labelfontsize=10.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
-DOT_FONTSIZE           = 10
+DOT_EDGE_ATTR          = "labelfontname=Helvetica,labelfontsize=10"
 
-# By default doxygen will tell dot to use the default font as specified with
-# DOT_FONTNAME. If you specify a different font using DOT_FONTNAME you can set
-# the path where dot can find it using this tag.
+# DOT_NODE_ATTR is concatenated with DOT_COMMON_ATTR. For view without boxes
+# around nodes set 'shape=plain' or 'shape=plaintext' <a
+# href=https://www.graphviz.org/doc/info/shapes.html>Shapes specification</a>
+# The default value is: shape=box,height=0.2,width=0.4.
+# This tag requires that the tag HAVE_DOT is set to YES.
+
+DOT_NODE_ATTR          = "shape=box,height=0.2,width=0.4"
+
+# You can set the path where dot can find font specified with fontname in
+# DOT_COMMON_ATTR and others dot attributes.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
 DOT_FONTPATH           =
 
-# If the CLASS_GRAPH tag is set to YES then doxygen will generate a graph for
-# each documented class showing the direct and indirect inheritance relations.
-# Setting this tag to YES will force the CLASS_DIAGRAMS tag to NO.
+# If the CLASS_GRAPH tag is set to YES or GRAPH or BUILTIN then doxygen will
+# generate a graph for each documented class showing the direct and indirect
+# inheritance relations. In case the CLASS_GRAPH tag is set to YES or GRAPH and
+# HAVE_DOT is enabled as well, then dot will be used to draw the graph. In case
+# the CLASS_GRAPH tag is set to YES and HAVE_DOT is disabled or if the
+# CLASS_GRAPH tag is set to BUILTIN, then the built-in generator will be used.
+# If the CLASS_GRAPH tag is set to TEXT the direct and indirect inheritance
+# relations will be shown as texts / links.
+# Possible values are: NO, YES, TEXT, GRAPH and BUILTIN.
 # The default value is: YES.
-# This tag requires that the tag HAVE_DOT is set to YES.
 
 CLASS_GRAPH            = YES
 
@@ -2147,13 +2532,14 @@ CLASS_GRAPH            = YES
 COLLABORATION_GRAPH    = YES
 
 # If the GROUP_GRAPHS tag is set to YES then doxygen will generate a graph for
-# groups, showing the direct groups dependencies.
+# groups, showing the direct groups dependencies. See also the chapter Grouping
+# in the manual.
 # The default value is: YES.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
 GROUP_GRAPHS           = YES
 
-# If the UML_LOOK tag is set to YES doxygen will generate inheritance and
+# If the UML_LOOK tag is set to YES, doxygen will generate inheritance and
 # collaboration diagrams in a style similar to the OMG's Unified Modeling
 # Language.
 # The default value is: NO.
@@ -2170,10 +2556,32 @@ UML_LOOK               = NO
 # but if the number exceeds 15, the total amount of fields shown is limited to
 # 10.
 # Minimum value: 0, maximum value: 100, default value: 10.
-# This tag requires that the tag HAVE_DOT is set to YES.
+# This tag requires that the tag UML_LOOK is set to YES.
 
 UML_LIMIT_NUM_FIELDS   = 10
 
+# If the DOT_UML_DETAILS tag is set to NO, doxygen will show attributes and
+# methods without types and arguments in the UML graphs. If the DOT_UML_DETAILS
+# tag is set to YES, doxygen will add type and arguments for attributes and
+# methods in the UML graphs. If the DOT_UML_DETAILS tag is set to NONE, doxygen
+# will not generate fields with class member information in the UML graphs. The
+# class diagrams will look similar to the default class diagrams but using UML
+# notation for the relationships.
+# Possible values are: NO, YES and NONE.
+# The default value is: NO.
+# This tag requires that the tag UML_LOOK is set to YES.
+
+DOT_UML_DETAILS        = NO
+
+# The DOT_WRAP_THRESHOLD tag can be used to set the maximum number of characters
+# to display on a single line. If the actual line length exceeds this threshold
+# significantly it will wrapped across multiple lines. Some heuristics are apply
+# to avoid ugly line breaks.
+# Minimum value: 0, maximum value: 1000, default value: 17.
+# This tag requires that the tag HAVE_DOT is set to YES.
+
+DOT_WRAP_THRESHOLD     = 17
+
 # If the TEMPLATE_RELATIONS tag is set to YES then the inheritance and
 # collaboration graphs will show the relations between templates and their
 # instances.
@@ -2205,7 +2613,8 @@ INCLUDED_BY_GRAPH      = YES
 #
 # Note that enabling this option will significantly increase the time of a run.
 # So in most cases it will be better to enable call graphs for selected
-# functions only using the \callgraph command.
+# functions only using the \callgraph command. Disabling a call graph can be
+# accomplished by means of the command \hidecallgraph.
 # The default value is: NO.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
@@ -2216,7 +2625,8 @@ CALL_GRAPH             = NO
 #
 # Note that enabling this option will significantly increase the time of a run.
 # So in most cases it will be better to enable caller graphs for selected
-# functions only using the \callergraph command.
+# functions only using the \callergraph command. Disabling a caller graph can be
+# accomplished by means of the command \hidecallergraph.
 # The default value is: NO.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
@@ -2238,12 +2648,23 @@ GRAPHICAL_HIERARCHY    = YES
 
 DIRECTORY_GRAPH        = YES
 
+# The DIR_GRAPH_MAX_DEPTH tag can be used to limit the maximum number of levels
+# of child directories generated in directory dependency graphs by dot.
+# Minimum value: 1, maximum value: 25, default value: 1.
+# This tag requires that the tag DIRECTORY_GRAPH is set to YES.
+
+DIR_GRAPH_MAX_DEPTH    = 1
+
 # The DOT_IMAGE_FORMAT tag can be used to set the image format of the images
-# generated by dot.
+# generated by dot. For an explanation of the image formats see the section
+# output formats in the documentation of the dot tool (Graphviz (see:
+# https://www.graphviz.org/)).
 # Note: If you choose svg you need to set HTML_FILE_EXTENSION to xhtml in order
 # to make the SVG files visible in IE 9+ (other browsers do not have this
 # requirement).
-# Possible values are: png, jpg, gif and svg.
+# Possible values are: png, jpg, gif, svg, png:gd, png:gd:gd, png:cairo,
+# png:cairo:gd, png:cairo:cairo, png:cairo:gdiplus, png:gdiplus and
+# png:gdiplus:gdiplus.
 # The default value is: png.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
@@ -2274,11 +2695,12 @@ DOT_PATH               =
 
 DOTFILE_DIRS           =
 
-# The MSCFILE_DIRS tag can be used to specify one or more directories that
-# contain msc files that are included in the documentation (see the \mscfile
-# command).
+# You can include diagrams made with dia in doxygen documentation. Doxygen will
+# then run dia to produce the diagram and insert it in the documentation. The
+# DIA_PATH tag allows you to specify the directory where the dia binary resides.
+# If left empty dia is assumed to be found in the default search path.
 
-MSCFILE_DIRS           =
+DIA_PATH               =
 
 # The DIAFILE_DIRS tag can be used to specify one or more directories that
 # contain dia files that are included in the documentation (see the \diafile
@@ -2287,14 +2709,23 @@ MSCFILE_DIRS           =
 DIAFILE_DIRS           =
 
 # When using plantuml, the PLANTUML_JAR_PATH tag should be used to specify the
-# path where java can find the plantuml.jar file. If left blank, it is assumed
-# PlantUML is not used or called during a preprocessing step. Doxygen will
-# generate a warning when it encounters a \startuml command in this case and
-# will not generate output for the diagram.
-# This tag requires that the tag HAVE_DOT is set to YES.
+# path where java can find the plantuml.jar file or to the filename of jar file
+# to be used. If left blank, it is assumed PlantUML is not used or called during
+# a preprocessing step. Doxygen will generate a warning when it encounters a
+# \startuml command in this case and will not generate output for the diagram.
 
 PLANTUML_JAR_PATH      =
 
+# When using plantuml, the PLANTUML_CFG_FILE tag can be used to specify a
+# configuration file for plantuml.
+
+PLANTUML_CFG_FILE      =
+
+# When using plantuml, the specified paths are searched for files specified by
+# the !include statement in a plantuml block.
+
+PLANTUML_INCLUDE_PATH  =
+
 # The DOT_GRAPH_MAX_NODES tag can be used to set the maximum number of nodes
 # that will be shown in the graph. If the number of nodes in a graph becomes
 # larger than this value, doxygen will truncate the graph, which is visualized
@@ -2319,19 +2750,7 @@ DOT_GRAPH_MAX_NODES    = 50
 
 MAX_DOT_GRAPH_DEPTH    = 0
 
-# Set the DOT_TRANSPARENT tag to YES to generate images with a transparent
-# background. This is disabled by default, because dot on Windows does not seem
-# to support this out of the box.
-#
-# Warning: Depending on the platform used, enabling this option may lead to
-# badly anti-aliased labels on the edges of a graph (i.e. they become hard to
-# read).
-# The default value is: NO.
-# This tag requires that the tag HAVE_DOT is set to YES.
-
-DOT_TRANSPARENT        = NO
-
-# Set the DOT_MULTI_TARGETS tag to YES allow dot to generate multiple output
+# Set the DOT_MULTI_TARGETS tag to YES to allow dot to generate multiple output
 # files in one run (i.e. multiple -o and -T options on the command line). This
 # makes dot run faster, but since only newer versions of dot (>1.8.10) support
 # this, this feature is disabled by default.
@@ -2343,14 +2762,34 @@ DOT_MULTI_TARGETS      = NO
 # If the GENERATE_LEGEND tag is set to YES doxygen will generate a legend page
 # explaining the meaning of the various boxes and arrows in the dot generated
 # graphs.
+# Note: This tag requires that UML_LOOK isn't set, i.e. the doxygen internal
+# graphical representation for inheritance and collaboration diagrams is used.
 # The default value is: YES.
 # This tag requires that the tag HAVE_DOT is set to YES.
 
 GENERATE_LEGEND        = YES
 
-# If the DOT_CLEANUP tag is set to YES doxygen will remove the intermediate dot
+# If the DOT_CLEANUP tag is set to YES, doxygen will remove the intermediate
 # files that are used to generate the various graphs.
+#
+# Note: This setting is not only used for dot files but also for msc temporary
+# files.
 # The default value is: YES.
-# This tag requires that the tag HAVE_DOT is set to YES.
 
 DOT_CLEANUP            = YES
+
+# You can define message sequence charts within doxygen comments using the \msc
+# command. If the MSCGEN_TOOL tag is left empty (the default), then doxygen will
+# use a built-in version of mscgen tool to produce the charts. Alternatively,
+# the MSCGEN_TOOL tag can also specify the name an external tool. For instance,
+# specifying prog as the value, doxygen will call the tool as prog -T
+# <outfile_format> -o <outputfile> <inputfile>. The external tool should support
+# output file formats "png", "eps", "svg", and "ismap".
+
+MSCGEN_TOOL            =
+
+# The MSCFILE_DIRS tag can be used to specify one or more directories that
+# contain msc files that are included in the documentation (see the \mscfile
+# command).
+
+MSCFILE_DIRS           =
diff --git a/docs/footer.htm b/docs/footer.htm
index 5a2af817bf..ca355c3af8 100644
--- a/docs/footer.htm
+++ b/docs/footer.htm
@@ -1,57 +1,17 @@
+<!-- HTML footer for doxygen 1.9.3-->
+<!-- start footer part -->
+<!--BEGIN GENERATE_TREEVIEW-->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+    <ul>
+        $navpath
+        <li class="footer">$generatedby <a href="https://www.doxygen.org/index.html"><img class="footer" src="$relpath^doxygen.svg" width="104" height="31" alt="doxygen"/></a> $doxygenversion </li>
+    </ul>
 </div>
-</div>
-</div>
-</div>
-</div>
-
-<!--Google Analytics-->
-<script type="text/javascript">
-  var _gaq = _gaq || [];
-  _gaq.push(['_setAccount', 'UA-5076919-1']);
-  _gaq.push(['_setDomainName', '.arrayfire.com']);
-  _gaq.push(['_trackPageview']);
-
-  (function() {
-    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
-    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
-    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
-  })();
-</script>
-
-<!--Spectate-->
-<script type="text/javascript">
-  sAId = "151";
-  sCId = "688";
-
-  (function() {
-    function async_load(){
-      var s = document.createElement('script'); s.type = 'text/javascript';
-      s.src = (('https:' == document.location.protocol) ? "https://ssl" : "http://cdn") + ".spectate.com/s.js";
-      var c = document.getElementsByTagName('script')[0]; c.parentNode.insertBefore(s, c);
-    }
-    if(window.attachEvent) { window.attachEvent('onload', async_load); }
-    else { window.addEventListener('load', async_load, false); }
-  })();
-</script>
-
-<!--Adroll-->
-<script type="text/javascript">
-adroll_adv_id = "ZRWI4W4RTRHENOWGXZY5JQ";
-adroll_pix_id = "QLXGBK3MSFB6LOL6PES2MT";
-(function () {
-var oldonload = window.onload;
-window.onload = function(){
-   __adroll_loaded=true;
-   var scr = document.createElement("script");
-   var host = (("https:" == document.location.protocol) ? "https://s.adroll.com" : "http://a.adroll.com");
-   scr.setAttribute('async', 'true');
-   scr.type = "text/javascript";
-   scr.src = host + "/j/roundtrip.js";
-   ((document.getElementsByTagName('head') || [null])[0] ||
-    document.getElementsByTagName('script')[0].parentNode).appendChild(scr);
-   if(oldonload){oldonload()}};
-}());
-</script>
-
+<!--END GENERATE_TREEVIEW-->
+<!--BEGIN !GENERATE_TREEVIEW-->
+<hr class="footer"/><address class="footer"><small>
+    $generatedby&#160;<a href="https://www.doxygen.org/index.html"><img class="footer" src="$relpath^doxygen.svg" width="104" height="31" alt="doxygen"/></a> $doxygenversion
+</small></address>
+<!--END !GENERATE_TREEVIEW-->
 </body>
 </html>
diff --git a/docs/header.htm b/docs/header.htm
index c591b2945f..9d7542fe1b 100644
--- a/docs/header.htm
+++ b/docs/header.htm
@@ -1,76 +1,98 @@
-<!-- HTML header for doxygen 1.8.5-->
-<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
-<html xmlns="http://www.w3.org/1999/xhtml">
+<!-- HTML header for doxygen 1.9.5-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "https://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
+<html xmlns="http://www.w3.org/1999/xhtml" lang="$langISO">
 <head>
+<!-- Global site tag (gtag.js) - Google Analytics -->
+<script async src="https://www.googletagmanager.com/gtag/js?id=UA-130950618-1"></script>
+<script>
+    window.dataLayer = window.dataLayer || [];
+    function gtag(){dataLayer.push(arguments);}
+    gtag('js', new Date());
+
+    gtag('config', 'UA-130950618-1');
+</script>
 <meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
-<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta http-equiv="X-UA-Compatible" content="IE=11"/>
 <meta name="generator" content="Doxygen $doxygenversion"/>
+<meta name="viewport" content="width=device-width, initial-scale=1"/>
 <!--BEGIN PROJECT_NAME--><title>$projectname: $title</title><!--END PROJECT_NAME-->
 <!--BEGIN !PROJECT_NAME--><title>$title</title><!--END !PROJECT_NAME-->
 <link href="$relpath^tabs.css" rel="stylesheet" type="text/css"/>
+<!--BEGIN DISABLE_INDEX-->
+  <!--BEGIN FULL_SIDEBAR-->
+<script type="text/javascript">var page_layout=1;</script>
+  <!--END FULL_SIDEBAR-->
+<!--END DISABLE_INDEX-->
 <script type="text/javascript" src="$relpath^jquery.js"></script>
 <script type="text/javascript" src="$relpath^dynsections.js"></script>
-<script type="text/javascript" src="afw.js"></script>
-<script src="highlight.pack.js"></script>
-<link rel="stylesheet" href="highlight_js_doxygen.css">
-<script>hljs.initHighlightingOnLoad();</script>
 $treeview
 $search
 $mathjax
+$darkmode
 <link href="$relpath^$stylesheet" rel="stylesheet" type="text/css" />
 $extrastylesheet
+<script type="text/javascript" src="$relpath^doxygen-awesome-darkmode-toggle.js"></script>
+<script type="text/javascript" src="$relpath^doxygen-awesome-fragment-copy-button.js"></script>
+<script type="text/javascript" src="$relpath^doxygen-awesome-interactive-toc.js"></script>
+<script type="text/javascript">
+    DoxygenAwesomeDarkModeToggle.init()
+    DoxygenAwesomeInteractiveToc.init()
+	DoxygenAwesomeFragmentCopyButton.init()
+</script>
 </head>
 <body>
+<!--BEGIN DISABLE_INDEX-->
+  <!--BEGIN FULL_SIDEBAR-->
+<div id="side-nav" class="ui-resizable side-nav-resizable"><!-- do not remove this div, it is closed by doxygen! -->
+  <!--END FULL_SIDEBAR-->
+<!--END DISABLE_INDEX-->
+
 <div id="top"><!-- do not remove this div, it is closed by doxygen! -->
 
 <!--BEGIN TITLEAREA-->
 <div id="titlearea">
-<table width="100%">
+<table cellspacing="2" cellpadding="2" width="100%">
  <tbody>
- <tr style="height: 56px;">
+  <tr id="projectrow">
   <!--BEGIN PROJECT_LOGO-->
-  <td id="projectlogo"><img alt="Logo" src="$relpath^$projectlogo"/>
-  </td>
+  <td id="projectlogo"><a  href="index.htm"><img alt="Logo" src="$relpath^$projectlogo"/></a></td>
   <!--END PROJECT_LOGO-->
-  <!--BEGIN PROJECT_NAME-->
-  <td style="padding-left: 0.5em;">
-   <div id="projectname">$projectname
-   <!--BEGIN PROJECT_NUMBER-->&#160;<span id="projectnumber">$projectnumber</span><!--END PROJECT_NUMBER-->
-   </div>
-   <!--BEGIN PROJECT_BRIEF--><div id="projectbrief">$projectbrief</div><!--END PROJECT_BRIEF-->
+  </tr>
+  <!--BEGIN PROJECT_BRIEF-->
+  <tr id="projectrow">
+  <td>
+  <div id="projectbrief">$projectbrief</div>
   </td>
-  <!--END PROJECT_NAME-->
-  <!--BEGIN !PROJECT_NAME-->
-   <!--BEGIN PROJECT_BRIEF-->
-    <td style="padding-left: 0.5em;">
-    <div id="projectbrief">$projectbrief</div>
-    </td>
-   <!--END PROJECT_BRIEF-->
+  </tr>
+  <!--END PROJECT_BRIEF-->
   <!--END !PROJECT_NAME-->
   <!--BEGIN DISABLE_INDEX-->
    <!--BEGIN SEARCHENGINE-->
-   <td>$searchbox</td>
+     <!--BEGIN !FULL_SIDEBAR-->
+  <tr>   
+     <!--END !FULL_SIDEBAR-->
    <!--END SEARCHENGINE-->
   <!--END DISABLE_INDEX-->
-	 <td id="gsearch">
-   <div><script>
-	    (function() {
-        var cx = '004356362924927882526:zup3ehe-7bs';
-        var gcse = document.createElement('script');
-        gcse.type = 'text/javascript';
-        gcse.async = true;
-        gcse.src = (document.location.protocol == 'https:' ? 'https:' : 'http:') +
-        '//www.google.com/cse/cse.js?cx=' + cx;
-	    var s = document.getElementsByTagName('script')[0];
-	    s.parentNode.insertBefore(gcse, s);
-	  })();
-  </script>
-  <gcse:search></gcse:search>
-</div>
-	 </td>
+  <div>
+    <td id="gsearch">
+        <script async src="https://cse.google.com/cse.js?cx=004356362924927882526:zup3ehe-7bs"></script>
+        <div class="gcse-search"></div>
+    </td>
+  </div>
+ </tr>
+ <tr>
+  <td>
+    <div id="togglediv"></div>
+  </td>
  </tr>
+  <!--BEGIN SEARCHENGINE-->
+  <!--BEGIN FULL_SIDEBAR-->
+   
+ 
+   <!--END FULL_SIDEBAR-->
+  <!--END SEARCHENGINE-->
  </tbody>
 </table>
 </div>
 <!--END TITLEAREA-->
-<!-- end header part -->
+<!-- end header part -->
\ No newline at end of file
diff --git a/docs/highlight.pack.js b/docs/highlight.pack.js
deleted file mode 100644
index c2a0b04252..0000000000
--- a/docs/highlight.pack.js
+++ /dev/null
@@ -1 +0,0 @@
-var hljs=new function(){function l(o){return o.replace(/&/gm,"&amp;").replace(/</gm,"&lt;").replace(/>/gm,"&gt;")}function b(p){for(var o=p.firstChild;o;o=o.nextSibling){if(o.nodeName.toUpperCase()=="CODE"){return o}if(!(o.nodeType==3&&o.nodeValue.match(/\s+/))){break}}}function h(p,o){return Array.prototype.map.call(p.childNodes,function(q){if(q.nodeType==3){return o?q.nodeValue.replace(/\n/g,""):q.nodeValue}if(q.nodeName.toUpperCase()=="BR"){return"\n"}return h(q,o)}).join("")}function a(q){var p=(q.className+" "+(q.parentNode?q.parentNode.className:"")).split(/\s+/);p=p.map(function(r){return r.replace(/^language-/,"")});for(var o=0;o<p.length;o++){if(e[p[o]]||p[o]=="no-highlight"){return p[o]}}}function c(q){var o=[];(function p(r,s){for(var t=r.firstChild;t;t=t.nextSibling){if(t.nodeType==3){s+=t.nodeValue.length}else{if(t.nodeName.toUpperCase()=="BR"){s+=1}else{if(t.nodeType==1){o.push({event:"start",offset:s,node:t});s=p(t,s);o.push({event:"stop",offset:s,node:t})}}}}return s})(q,0);return o}function j(p,r,v){var q=0;var y="";var s=[];function u(){if(!p.length||!r.length){return p.length?p:r}if(p[0].offset!=r[0].offset){return(p[0].offset<r[0].offset)?p:r}return r[0].event=="start"?p:r}function t(A){function z(B){return" "+B.nodeName+'="'+l(B.value)+'"'}y+="<"+A.nodeName.toLowerCase()+Array.prototype.map.call(A.attributes,z).join("")+">"}function x(z){y+="</"+z.nodeName.toLowerCase()+">"}function o(z){(z.event=="start"?t:x)(z.node)}while(p.length||r.length){var w=u();y+=l(v.substr(q,w[0].offset-q));q=w[0].offset;if(w==p){s.reverse().forEach(x);do{o(w.splice(0,1)[0]);w=u()}while(w==p&&w.length&&w[0].offset==q);s.reverse().forEach(t)}else{if(w[0].event=="start"){s.push(w[0].node)}else{s.pop()}o(w.splice(0,1)[0])}}return y+l(v.substr(q))}function f(r){function o(s){return(s&&s.source)||s}function p(t,s){return RegExp(o(t),"m"+(r.cI?"i":"")+(s?"g":""))}function q(z,x){if(z.compiled){return}z.compiled=true;var u=[];if(z.k){var s={};function A(B,t){if(r.cI){t=t.toLowerCase()}t.split(" ").forEach(function(C){var D=C.split("|");s[D[0]]=[B,D[1]?Number(D[1]):1];u.push(D[0])})}z.lR=p(z.l||"\\b"+hljs.IR+"\\b(?!\\.)",true);if(typeof z.k=="string"){A("keyword",z.k)}else{for(var y in z.k){if(!z.k.hasOwnProperty(y)){continue}A(y,z.k[y])}}z.k=s}if(x){if(z.bWK){z.b="\\b("+u.join("|")+")\\b(?!\\.)\\s*"}z.bR=p(z.b?z.b:"\\B|\\b");if(!z.e&&!z.eW){z.e="\\B|\\b"}if(z.e){z.eR=p(z.e)}z.tE=o(z.e)||"";if(z.eW&&x.tE){z.tE+=(z.e?"|":"")+x.tE}}if(z.i){z.iR=p(z.i)}if(z.r===undefined){z.r=1}if(!z.c){z.c=[]}for(var w=0;w<z.c.length;w++){if(z.c[w]=="self"){z.c[w]=z}q(z.c[w],z)}if(z.starts){q(z.starts,x)}var v=[];for(var w=0;w<z.c.length;w++){v.push(o(z.c[w].b))}if(z.tE){v.push(o(z.tE))}if(z.i){v.push(o(z.i))}z.t=v.length?p(v.join("|"),true):{exec:function(t){return null}}}q(r)}function d(E,G,C,M){function o(r,P){for(var O=0;O<P.c.length;O++){var N=P.c[O].bR.exec(r);if(N&&N.index==0){return P.c[O]}}}function s(N,r){if(N.e&&N.eR.test(r)){return N}if(N.eW){return s(N.parent,r)}}function t(r,N){return !C&&N.i&&N.iR.test(r)}function y(O,r){var N=H.cI?r[0].toLowerCase():r[0];return O.k.hasOwnProperty(N)&&O.k[N]}function I(){var N=l(w);if(!B.k){return N}var r="";var Q=0;B.lR.lastIndex=0;var O=B.lR.exec(N);while(O){r+=N.substr(Q,O.index-Q);var P=y(B,O);if(P){v+=P[1];r+='<span class="'+P[0]+'">'+O[0]+"</span>"}else{r+=O[0]}Q=B.lR.lastIndex;O=B.lR.exec(N)}return r+N.substr(Q)}function z(){if(B.sL&&!e[B.sL]){return l(w)}var N=B.subLanguageMode=="continuous"?B.top:undefined;var r=B.sL?d(B.sL,w,true,N):g(w);if(B.r>0){v+=r.keyword_count;A+=r.r}B.top=r.top;return'<span class="'+r.language+'">'+r.value+"</span>"}function L(){return B.sL!==undefined?z():I()}function K(O,r){var N=O.cN?'<span class="'+O.cN+'">':"";if(O.rB){x+=N;w=""}else{if(O.eB){x+=l(r)+N;w=""}else{x+=N;w=r}}B=Object.create(O,{parent:{value:B}})}function D(N,r){w+=N;if(r===undefined){x+=L();return 0}var P=o(r,B);if(P){x+=L();K(P,r);return P.rB?0:r.length}var Q=s(B,r);if(Q){var O=B;if(!(O.rE||O.eE)){w+=r}x+=L();do{if(B.cN){x+="</span>"}A+=B.r;B=B.parent}while(B!=Q.parent);if(O.eE){x+=l(r)}w="";if(Q.starts){K(Q.starts,"")}return O.rE?0:r.length}if(t(r,B)){throw new Error('Illegal lexem "'+r+'" for mode "'+(B.cN||"<unnamed>")+'"')}w+=r;return r.length||1}var H=e[E];if(!H){throw new Error('Unknown language: "'+E+'"')}f(H);var B=M||H;var x="";for(var F=B;F!=H;F=F.parent){if(F.cN){x='<span class="'+F.cN+'">'+x}}var w="";var A=0;var v=0;try{var u,q,p=0;while(true){B.t.lastIndex=p;u=B.t.exec(G);if(!u){break}q=D(G.substr(p,u.index-p),u[0]);p=u.index+q}D(G.substr(p));for(var F=B;F.parent;F=F.parent){if(F.cN){x+="</span>"}}return{r:A,keyword_count:v,value:x,language:E,top:B}}catch(J){if(J.message.indexOf("Illegal")!=-1){return{r:0,keyword_count:0,value:l(G)}}else{throw J}}}function g(s){var o={keyword_count:0,r:0,value:l(s)};var q=o;for(var p in e){if(!e.hasOwnProperty(p)){continue}var r=d(p,s,false);r.language=p;if(r.keyword_count+r.r>q.keyword_count+q.r){q=r}if(r.keyword_count+r.r>o.keyword_count+o.r){q=o;o=r}}if(q.language){o.second_best=q}return o}function i(q,p,o){if(p){q=q.replace(/^((<[^>]+>|\t)+)/gm,function(r,v,u,t){return v.replace(/\t/g,p)})}if(o){q=q.replace(/\n/g,"<br>")}return q}function m(r,u,p){var v=h(r,p);var t=a(r);if(t=="no-highlight"){return}var w=t?d(t,v,true):g(v);t=w.language;var o=c(r);if(o.length){var q=document.createElementNS("http://www.w3.org/1999/xhtml","pre");q.innerHTML=w.value;w.value=j(o,c(q),v)}w.value=i(w.value,u,p);var s=r.className;if(!s.match("(\\s|^)(language-)?"+t+"(\\s|$)")){s=s?(s+" "+t):t}r.innerHTML=w.value;r.className=s;r.result={language:t,kw:w.keyword_count,re:w.r};if(w.second_best){r.second_best={language:w.second_best.language,kw:w.second_best.keyword_count,re:w.second_best.r}}}function n(){if(n.called){return}n.called=true;Array.prototype.map.call(document.getElementsByTagNameNS("http://www.w3.org/1999/xhtml","pre"),b).filter(Boolean).forEach(function(o){m(o,hljs.tabReplace)})}function k(){window.addEventListener("DOMContentLoaded",n,false);window.addEventListener("load",n,false)}var e={};this.LANGUAGES=e;this.highlight=d;this.highlightAuto=g;this.fixMarkup=i;this.highlightBlock=m;this.initHighlighting=n;this.initHighlightingOnLoad=k;this.IR="[a-zA-Z][a-zA-Z0-9_]*";this.UIR="[a-zA-Z_][a-zA-Z0-9_]*";this.NR="\\b\\d+(\\.\\d+)?";this.CNR="(\\b0[xX][a-fA-F0-9]+|(\\b\\d+(\\.\\d*)?|\\.\\d+)([eE][-+]?\\d+)?)";this.BNR="\\b(0b[01]+)";this.RSR="!|!=|!==|%|%=|&|&&|&=|\\*|\\*=|\\+|\\+=|,|\\.|-|-=|/|/=|:|;|<<|<<=|<=|<|===|==|=|>>>=|>>=|>=|>>>|>>|>|\\?|\\[|\\{|\\(|\\^|\\^=|\\||\\|=|\\|\\||~";this.BE={b:"\\\\[\\s\\S]",r:0};this.ASM={cN:"string",b:"'",e:"'",i:"\\n",c:[this.BE],r:0};this.QSM={cN:"string",b:'"',e:'"',i:"\\n",c:[this.BE],r:0};this.CLCM={cN:"comment",b:"//",e:"$"};this.CBLCLM={cN:"comment",b:"/\\*",e:"\\*/"};this.HCM={cN:"comment",b:"#",e:"$"};this.NM={cN:"number",b:this.NR,r:0};this.CNM={cN:"number",b:this.CNR,r:0};this.BNM={cN:"number",b:this.BNR,r:0};this.REGEXP_MODE={cN:"regexp",b:/\//,e:/\/[gim]*/,i:/\n/,c:[this.BE,{b:/\[/,e:/\]/,r:0,c:[this.BE]}]};this.inherit=function(q,r){var o={};for(var p in q){o[p]=q[p]}if(r){for(var p in r){o[p]=r[p]}}return o}}();hljs.LANGUAGES.cpp=function(a){var b={keyword:"false int float while private char catch export virtual operator sizeof dynamic_cast|10 typedef const_cast|10 const struct for static_cast|10 union namespace unsigned long throw volatile static protected bool template mutable if public friend do return goto auto void enum else break new extern using true class asm case typeid short reinterpret_cast|10 default double register explicit signed typename try this switch continue wchar_t inline delete alignof char16_t char32_t constexpr decltype noexcept nullptr static_assert thread_local restrict _Bool complex",built_in:"std string cin cout cerr clog stringstream istringstream ostringstream auto_ptr deque list queue stack vector map set bitset multiset multimap unordered_set unordered_map unordered_multiset unordered_multimap array shared_ptr"};return{k:b,i:"</",c:[a.CLCM,a.CBLCLM,a.QSM,{cN:"string",b:"'\\\\?.",e:"'",i:"."},{cN:"number",b:"\\b(\\d+(\\.\\d*)?|\\.\\d+)(u|U|l|L|ul|UL|f|F)"},a.CNM,{cN:"preprocessor",b:"#",e:"$",c:[{b:"<",e:">",i:"\\n"},a.CLCM]},{cN:"stl_container",b:"\\b(deque|list|queue|stack|vector|map|set|bitset|multiset|multimap|unordered_map|unordered_set|unordered_multiset|unordered_multimap|array)\\s*<",e:">",k:b,r:10,c:["self"]}]}}(hljs);
\ No newline at end of file
diff --git a/docs/highlight_js_doxygen.css b/docs/highlight_js_doxygen.css
deleted file mode 100644
index 1cb145ea4a..0000000000
--- a/docs/highlight_js_doxygen.css
+++ /dev/null
@@ -1,93 +0,0 @@
-/*
-
-Doxygen emulation for highlight.js
-
-*/
-
-pre code {
-  display       :   block;
-  padding-left  :   15px;
-  background    :   #FEFCFB;
-  border        :   1px solid #F9CEBA
-}
-
-pre .comment,
-pre .template_comment,
-pre .diff .header,
-pre .doctype,
-pre .pi,
-pre .lisp .string,
-pre .javadoc {
-  color: #93a1a1;
-  font-style: italic;
-}
-
-pre .keyword,
-pre .winutils,
-pre .method,
-pre .addition,
-pre .css .tag,
-pre .request,
-pre .status,
-pre .nginx .title {
-  color: #859900;
-}
-
-pre .number,
-pre .command,
-pre .string,
-pre .tag .value,
-pre .rules .value,
-pre .phpdoc,
-pre .tex .formula,
-pre .regexp,
-pre .hexcolor {
-  color: #2aa198;
-}
-
-pre .title,
-pre .localvars,
-pre .chunk,
-pre .decorator,
-pre .identifier,
-pre .vhdl .literal,
-pre .id,
-pre .css .function {
-  color: #268bd2;
-}
-
-pre .attribute,
-pre .variable,
-pre .lisp .body,
-pre .smalltalk .number,
-pre .constant,
-pre .class .title,
-pre .parent,
-pre .haskell .type {
-  color: #b58900;
-}
-
-pre .preprocessor,
-pre .preprocessor .keyword,
-pre .pragma,
-pre .shebang,
-pre .symbol,
-pre .symbol .string,
-pre .diff .change,
-pre .special,
-pre .attr_selector,
-pre .important,
-pre .subst,
-pre .cdata,
-pre .clojure .title,
-pre .css .pseudo {
-  color: #cb4b16;
-}
-
-pre .deletion {
-  color: #dc322f;
-}
-
-pre .tex .formula {
-  background: #eee8d5;
-}
diff --git a/docs/layout.xml b/docs/layout.xml
index 6c8c9af358..1f0db6af21 100644
--- a/docs/layout.xml
+++ b/docs/layout.xml
@@ -2,17 +2,8 @@
   <!-- Navigation index tabs for HTML output -->
   <navindex>
     <tab type="mainpage" visible="yes" title="" />
-    <tab type="usergroup" visible="yes" title="Tutorials">
-      <tab type="user" url="\ref gettingstarted" visible="yes" title="Getting Started"/>
-      <tab type="user" url="\ref matrixmanipulation" visible="yes" title="Matrix Manipulation"/>
-      <tab type="user" url="\ref indexing" visible="yes" title="Indexing"/>
-      <tab type="user" url="\ref timing" visible="yes" title="Timing ArrayFire"/>
-      <tab type="user" url="\ref using_on_linux" visible="yes" title="Using on Linux"/>
-      <tab type="user" url="\ref using_on_windows" visible="yes" title="Using on Windows"/>
-      <tab type="user" url="\ref configuring_environment" visible="yes" title="Configuring ArrayFire Environment"/>
-      <tab type="usergroup" url="\ref page_gfor" visible="yes" title="GFOR Usage"/>
-    </tab>
-    <tab type="modules" visible="yes" title="Functions" intro="Documentation grouped according to category:"/>
+    <tab type="user" url="\ref tutorials" visible="yes" title="Tutorials"/>
+    <tab type="modules" visible="yes" title="Functions" intro="Documentation grouped by category:"/>
     <tab type="user" url="\ref releasenotes" visible="yes" title="Release Notes"/>
     <tab type="examples" visible="yes" title="" intro=""/>
   </navindex>
diff --git a/docs/pages/INSTALL.md b/docs/pages/INSTALL.md
deleted file mode 100644
index 3d9983aff9..0000000000
--- a/docs/pages/INSTALL.md
+++ /dev/null
@@ -1,137 +0,0 @@
-ArrayFire binary installation instructions {#installing}
-=====
-
-Installing ArrayFire couldn't be easier. We ship installers for Windows,
-OSX, and several variants of Linux. In general the installation procedure
-proceeds like this:
-
-1. [Download](http://arrayfire.com/download/) the ArrayFire installer for your
-   operating system
-2. Install prerequisites
-3. Install ArrayFire
-4. Test the installation
-5. [Where to go for help?](#GettingHelp)
-
-Below you will find instructions for
-
-* [Windows](#Windows)
-* Linux including
-    * [Debian (.deb) 8](#Debian)
-    * [Ubuntu (.deb) 14.10 and later](#Ubuntu)
-    * [Fedora (.rpm) 21](#Fedora)
-* [Mac OSX (.sh and brew)](#OSX)
-
-# <a name="Windows"></a> Windows
-
-Simply [download](http://arrayfire.com/download/) and run the installer.
-If you wish to use CUDA or OpenCL please ensure that you have also installed
-support for these technologies from your video card vendor's website.
-
-# Linux
-
-## <a name="Debian"></a> Debian 8
-
-First [download](http://arrayfire.com/download/) ArrayFire. Then, using the
-`gdebi` package manager, you can install ArrayFire and all dependencies as
-follows:
-
-    gdebi arrayfire*.deb
-
-If you prefer to use the `.sh` installer, it and all prerequisite packages
-may be installed as follows:
-
-    # Prerequisite packages:
-    apt-get install libfreeimage-dev libatlas3gf-base libfftw3-dev cmake
-
-    # Enable GPU support (OpenCL):
-    apt-get install ocl-icd-libopencl1
-
-    # Run Installer
-    ./arrayfire_3.0.0_Linux_x86_64.sh --exclude-subdir --prefix=/usr/local
-
-To enable CUDA support, edit `/etc/apt/sources.list` and append `non-free`
-to the line containing `deb http://.../debian jessie main`. Then, as root, run
-
-    apt-get update
-    apt-get install nvidia-cuda-dev
-
-## <a name="Fedora"></a> Fedora 21
-
-First [download](http://arrayfire.com/download/) ArrayFire. Then, using the
-`yum` package manager, you can install ArrayFire and all dependencies as
-follows:
-
-    yum --nogpgcheck localinstall arrayfire*.rpm
-
-Or with the self-extracting installer
-
-    # Install prerequiste packages
-    yum install freeimage atlas fftw cmake
-
-    # Run Installer
-    ./arrayfire_3.0.0_Linux_x86_64.sh --exclude-subdir --prefix=/usr/local
-
-## <a name="Ubuntu"></a> Ubuntu 14.10 and later
-
-First [download](http://arrayfire.com/download/) ArrayFire. Then, using the
-`gdebi` package manager, you can install ArrayFire and all dependencies as
-follows:
-
-    sudo apt-get install gdebi
-    gdebi arrayfire*.deb
-
-If you prefer to use the `.sh` installer, it and all prerequisite packages
-may be installed as follows:
-
-    # Prerequisite packages:
-    sudo apt-get install libfreeimage-dev libatlas3gf-base libfftw3-dev cmake
-
-    # Enable GPU support (OpenCL and/or CUDA):
-    sudo apt-get install ocl-icd-libopencl1
-    sudo apt-get install nvidia-cuda-dev
-
-    # Run Installer
-    sudo ./arrayfire_3.0.0_Linux_x86_64.sh --exclude-subdir --prefix=/usr/local
-
-# <a name="OSX"></a> Mac OSX
-
-## Self-extracting zip from ArrayFire website
-
-On OSX there are several dependencies that are not integrated into the
-operating system. It is easiest to install these using [Homebrew](http://brew.sh/),
-but you can also build them yourself if you prefer.
-
-First [download](http://arrayfire.com/download/) ArrayFire. You may install
-ArrayFire to `/usr/local` from XTerm using the following commands:
-
-    brew install boost fftw cmake freeimage
-
-    sudo ./arrayfire_3.0.0_Linux_x86_64.sh --exclude-subdir --prefix=/usr/local
-
-## Brew installation
-
-GitHub user [sutoiku](https://github.com/sutoiku) has been kind enough to
-write a brew installation script for ArrayFire. This installation method will
-download and compile ArrayFire and all prerequisites. Please remember to
-register on the ArrayFire website so we can keep you up to date about new
-versions of our software!
-
-    brew install arrayfire
-
-## Testing installation
-
-After ArrayFire is installed, you can build the example programs as follows:
-
-    cp -r /usr/local/share/doc/arrayfire/examples .
-    cd examples
-    mkdir build
-    cd build
-    cmake ..
-    make
-
-## <a name="GettingHelp"></a> Getting help
-
-* Google Groups: https://groups.google.com/forum/#!forum/arrayfire-users
-* ArrayFire Services:  [Consulting](http://arrayfire.com/consulting/)  |  [Support](http://arrayfire.com/support/)   |  [Training](http://arrayfire.com/training/)
-* ArrayFire Blogs: http://arrayfire.com/blog/
-* Email: <mailto:technical@arrayfire.com>
diff --git a/docs/pages/README.md b/docs/pages/README.md
index 6ec7d4274e..6ecb68ce4e 100644
--- a/docs/pages/README.md
+++ b/docs/pages/README.md
@@ -5,33 +5,35 @@ Overview {#mainpage}
 
 ## About ArrayFire
 
-ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array based function set makes parallel programming more accessible.
+ArrayFire is a high performance software library for parallel computing with
+an easy-to-use API. Its array based function set makes parallel programming
+more accessible.
 
 ## Installing ArrayFire
 
-You can install ArrayFire using either a binary installer for Windows, OSX,
-or Linux or download it from source:
+Install ArrayFire using either a binary installer for Windows, OSX, or Linux
+or download it from source:
 
 * [Binary installers for Windows, OSX, and Linux](\ref installing)
 * [Build from source](https://github.com/arrayfire/arrayfire)
 
 ## Easy to use
 
-The [array](\ref construct_mat) object is beautifully simple.
+The [array](\ref af::array) object is beautifully simple.
 
 Array-based notation effectively expresses computational algorithms in
-readable math-resembling notation. You _do not_ need expertise in
-parallel programming to use ArrayFire.
+readable math-resembling notation. Expertise in parallel programming _is not_
+required to use ArrayFire.
 
-A few lines of ArrayFire code
-accomplishes what can take 100s of complicated lines in CUDA or OpenCL
-kernels.
+A few lines of ArrayFire code accomplishes what can take 100s of complicated
+lines in CUDA, oneAPI, or OpenCL kernels.
 
 ## ArrayFire is extensive!
 
 #### Support for multiple domains
 
-ArrayFire contains [hundreds of functions](\ref arrayfire_func) across various domains including:
+ArrayFire contains [hundreds of functions](\ref arrayfire_func) across various
+domains including:
 - [Vector Algorithms](\ref vector_mat)
 - [Image Processing](\ref image_mat)
 - [Computer Vision](\ref cv_mat)
@@ -40,61 +42,65 @@ ArrayFire contains [hundreds of functions](\ref arrayfire_func) across various d
 - [Statistics](\ref stats_mat)
 - and more.
 
-Each function is hand-tuned by ArrayFire
-developers with all possible low-level optimizations.
+Each function is hand-tuned by ArrayFire developers with all possible
+low-level optimizations.
 
 #### Support for various data types and sizes
 
-ArrayFire operates on common [data shapes and sizes](\ref indexing),
-including vectors, matrices, volumes, and
+ArrayFire operates on common [data shapes and sizes](\ref indexing), including
+vectors, matrices, volumes, and
 
-It supports common [data types](\ref gettingstarted_datatypes),
-including single and double precision floating
-point values, complex numbers, booleans, and 32-bit signed and
-unsigned integers.
+It supports common [data types](\ref gettingstarted_datatypes), including
+single and double precision floating point values, complex numbers, booleans,
+and 8/16/32-bit signed and unsigned integers.
 
 #### Extending ArrayFire
 
-ArrayFire can be used as a stand-alone application or integrated with
-existing CUDA or OpenCL code. All ArrayFire `arrays` can be
-interchanged with other CUDA or OpenCL data structures.
+ArrayFire can be used as a stand-alone application or integrated with existing
+CUDA, oneAPI, or OpenCL code.
 
 ## Code once, run anywhere!
 
-With support for x86, ARM, CUDA, and OpenCL devices, ArrayFire supports for a comprehensive list of devices.
+With support for x86, ARM, CUDA, oneAPI, and OpenCL devices, ArrayFire
+supports for a comprehensive list of devices.
 
 Each ArrayFire installation comes with:
- - a CUDA version (named 'libafcuda') for [NVIDIA
- GPUs](https://developer.nvidia.com/cuda-gpus),
- - an OpenCL version (named 'libafopencl') for [OpenCL devices](http://www.khronos.org/conformance/adopters/conformant-products#opencl)
- - a CPU version (named 'libafcpu') to fall back to when CUDA or OpenCL devices are not available.
+- a CUDA backend (named 'libafcuda') for [NVIDIA
+  GPUs](https://developer.nvidia.com/cuda-gpus),
+- a oneAPI backend (named 'libafoneapi') for [oneAPI
+  devices](https://www.intel.com/content/www/us/en/developer/articles/system-requirements/intel-oneapi-base-toolkit-system-requirements.html),
+- an OpenCL backend (named 'libafopencl') for [OpenCL
+  devices](http://www.khronos.org/conformance/adopters/conformant-products#opencl),
+- a CPU backend (named 'libafcpu') to fall back to when CUDA, oneAPI, or
+  OpenCL devices are unavailable.
 
 ## ArrayFire is highly efficient
 
 #### Vectorized and Batched Operations
 
-ArrayFire supports batched operations on N-dimensional arrays.
-Batch operations in ArrayFire are run in parallel ensuring an optimal usage of your CUDA or OpenCL device.
+ArrayFire supports batched operations on N-dimensional arrays. Batch
+operations in ArrayFire are run in parallel ensuring an optimal usage of CUDA,
+oneAPI, or OpenCL devices.
 
-You can get the best performance out of ArrayFire using [vectorization techniques]().
+Best performance with ArrayFire is achieved using
+[vectorization techniques](\ref vectorization).
 
 ArrayFire can also execute loop iterations in parallel with
 [the gfor function](\ref gfor).
 
 #### Just in Time compilation
 
-ArrayFire performs run-time analysis of your code to increase
-arithmetic intensity and memory throughput, while avoiding unnecessary
-temporary allocations. It has an awesome internal JIT compiler to make
-optimizations for you.
+ArrayFire performs run-time analysis of code to increase arithmetic intensity
+and memory throughput, while avoiding unnecessary temporary allocations. It
+has an awesome internal JIT compiler to make important optimizations.
 
-Read more about how [ArrayFire JIT](http://arrayfire.com/performance-of-arrayfire-jit-code-generation/) can improve the performance in your application.
+Read more about how [ArrayFire JIT](\ref jit).  can improve the performance in
+your application.
 
 ## Simple Example
 
-Here's a live example to let you see ArrayFire code. You create [arrays](\ref
-construct_mat) which reside on CUDA or OpenCL devices. Then you can use
-[ArrayFire functions](modules.htm) on those [arrays](\ref construct_mat).
+Here is an example of ArrayFire code that performs a Monte Carlo estimation of
+PI.
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
 // sample 40 million points on the GPU
@@ -111,19 +117,51 @@ af_print(pi);
 
 #### Free Community Options
 
-* [ArrayFire mailing list](https://groups.google.com/forum/#!forum/arrayfire-users) (recommended)
+* [ArrayFire mailing
+  list](https://groups.google.com/forum/#!forum/arrayfire-users) (recommended)
 * [StackOverflow](http://stackoverflow.com/questions/tagged/arrayfire)
 
 #### Premium Support
 
-* Phone Support - available for purchase ([request a quote](mailto:sales@arrayfire.com))
+* Phone Support - available for purchase ([request a
+  quote](mailto:sales@arrayfire.com))
 
 #### Contact Us
 
-* If you need to contact us, visit our
-[contact us page](http://arrayfire.com/company/#contact).
+* If you need to contact us, visit our [contact us
+  page](http://arrayfire.com/company/#contact).
 
 #### Email
 
 * Engineering: technical@arrayfire.com
 * Sales: sales@arrayfire.com
+
+## Citations and Acknowledgements
+
+If you redistribute ArrayFire, please follow the terms established in <a
+href="https://github.com/arrayfire/arrayfire/blob/master/LICENSE">the
+license</a>. If you wish to cite ArrayFire in an academic publication, please
+use the following reference:
+
+Formatted:
+
+    Yalamanchili, P., Arshad, U., Mohammed, Z., Garigipati, P., Entschev, P.,
+    Kloppenborg, B., Malcolm, James and Melonakos, J. (2015).
+    ArrayFire - A high performance software library for parallel computing with an
+    easy-to-use API. Atlanta: AccelerEyes. Retrieved from https://github.com/arrayfire/arrayfire
+
+BibTeX:
+
+    @misc{Yalamanchili2015,
+    abstract = {ArrayFire is a high performance software library for parallel computing with an easy-to-use API. Its array based function set makes parallel programming simple. ArrayFire's multiple backends (CUDA, OpenCL and native CPU) make it platform independent and highly portable. A few lines of code in ArrayFire can replace dozens of lines of parallel computing code, saving you valuable time and lowering development costs.},
+    address = {Atlanta},
+    author = {Yalamanchili, Pavan and Arshad, Umar and Mohammed, Zakiuddin and Garigipati, Pradeep and Entschev, Peter and Kloppenborg, Brian and Malcolm, James and Melonakos, John},
+    publisher = {AccelerEyes},
+    title = {{ArrayFire - A high performance software library for parallel computing with an easy-to-use API}},
+    url = {https://github.com/arrayfire/arrayfire},
+    year = {2015}
+    }
+
+ArrayFire development is funded by AccelerEyes LLC (dba ArrayFire) and several
+third parties, please see the list of <a
+href="https://github.com/arrayfire/arrayfire/blob/master/ACKNOWLEDGEMENTS.md">acknowledgements</a>.
diff --git a/docs/pages/configuring_arrayfire_environment.md b/docs/pages/configuring_arrayfire_environment.md
index fbb421f17d..7b20be9b4a 100644
--- a/docs/pages/configuring_arrayfire_environment.md
+++ b/docs/pages/configuring_arrayfire_environment.md
@@ -18,6 +18,16 @@ This is the path with ArrayFire gets installed, ie. the includes and libs are
 present in this directory. You can use this variable to add include paths and
 libraries to your projects.
 
+AF_PRINT_ERRORS {#af_print_errors}
+-------------------------------------------------------------------------------
+
+When AF_PRINT_ERRORS is set to 1, the exceptions thrown are more verbose and
+detailed. This helps in locating the exact failure.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AF_PRINT_ERRORS=1 ./myprogram
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 AF_CUDA_DEFAULT_DEVICE {#af_cuda_default_device}
 -------------------------------------------------------------------------------
 
@@ -28,6 +38,16 @@ variable are the device identifiers shown when af::info is run.
 AF_CUDA_DEFAULT_DEVICE=1 ./myprogram_cuda
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
+AF_ONEAPI_DEFAULT_DEVICE {#af_oneapi_default_device}
+-------------------------------------------------------------------------------
+
+Use this variable to set the default oneAPI device. Valid values for this
+variable are the device identifiers shown when af::info is run.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AF_ONEAPI_DEFAULT_DEVICE=1 ./myprogram_oneapi
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
 Note: af::setDevice call in the source code will take precedence over this
 variable.
 
@@ -44,15 +64,211 @@ AF_OPENCL_DEFAULT_DEVICE=1 ./myprogram_opencl
 Note: af::setDevice call in the source code will take precedence over this
 variable.
 
+AF_OPENCL_DEFAULT_DEVICE_TYPE {#af_opencl_default_device_type}
+-------------------------------------------------------------------------------
+
+Use this variable to set the default OpenCL device type. Valid values for this
+variable are: CPU, GPU, ACC (Accelerators).
+
+When set, the first device of the specified type is chosen as default device.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AF_OPENCL_DEFAULT_DEVICE_TYPE=CPU ./myprogram_opencl
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Note: `AF_OPENCL_DEFAULT_DEVICE` and af::setDevice takes precedence over this variable.
+
+AF_OPENCL_DEVICE_TYPE {#af_opencl_device_type}
+-------------------------------------------------------------------------------
+
+Use this variable to only choose OpenCL devices of specified type. Valid values for this
+variable are:
+
+- ALL: All OpenCL devices. (Default behavior).
+- CPU: CPU devices only.
+- GPU: GPU devices only.
+- ACC: Accelerator devices only.
+
+When set, the remaining OpenCL device types are ignored by the OpenCL backend.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AF_OPENCL_DEVICE_TYPE=CPU ./myprogram_opencl
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+AF_OPENCL_CPU_OFFLOAD {#af_opencl_cpu_offload}
+-------------------------------------------------------------------------------
+
+When ArrayFire runs on devices with unified memory with the host (ie.
+`CL_DEVICE_HOST_UNIFIED_MENORY` is true for the device) then certain functions
+are offloaded to run on the CPU using mapped buffers.
+
+ArrayFire takes advantage of fast libraries such as MKL while spending no time
+copying memory from device to host. The device memory is mapped to a host
+pointer which can be used in the offloaded functions.
+
+This functionality can be disabled by using the environment variable
+`AF_OPENCL_CPU_OFFLOAD=0`.
+
+The default bevaior of this has changed in version 3.4.
+
+Prior to v3.4, CPU Offload functionality was used only when the user set
+`AF_OPENCL_CPU_OFFLOAD=1` and disabled otherwise.
+
+From v3.4 onwards, CPU Offload is enabled by default and is disabled only when
+`AF_OPENCL_CPU_OFFLOAD=0` is set.
+
+AF_OPENCL_SHOW_BUILD_INFO {#af_opencl_show_build_info}
+-------------------------------------------------------------------------------
+
+This variable is useful when debuggin OpenCL kernel compilation failures. When
+this variable is set to 1, and an error occurs during a OpenCL kernel
+compilation, then the log and kernel are printed to screen.
+
 AF_DISABLE_GRAPHICS {#af_disable_graphics}
 -------------------------------------------------------------------------------
 
-Setting this variable will disable window creation when graphics functions are
-being called. Simply setting this variable will disable functionality, any
-value will suffice. Disabling window creation will disable all other graphics
-calls at runtime as well.
+Setting this variable to 1 will disable window creation when graphics
+functions are being called. Disabling window creation will disable all other
+graphics calls at runtime as well.
 
 This is a useful enviornment variable when running code on servers and systems
 without displays. When graphics calls are run on such machines, they will
 print warning about window creation failing. To suppress those calls, set this
 variable.
+
+AF_SYNCHRONOUS_CALLS {#af_synchronous_calls}
+-------------------------------------------------------------------------------
+
+When this environment variable is set to 1, ArrayFire will execute all
+functions synchronously.
+
+AF_SHOW_LOAD_PATH {#af_show_load_path}
+-------------------------------------------------------------------------------
+
+When using the Unified backend, if this variable is set to 1, it will show the
+path where the ArrayFire backend libraries are loaded from.
+
+If the libraries are loaded from system paths, such as PATH or LD_LIBRARY_PATH
+etc, then it will print "system path". If the libraries are loaded from other
+paths, then those paths are shown in full.
+
+AF_MEM_DEBUG {#af_mem_debug}
+-------------------------------------------------------------------------------
+
+When AF_MEM_DEBUG is set to 1 (or anything not equal to 0), the caching
+mechanism in the memory manager is disabled. The device buffers are allocated
+using native functions as needed and freed when going out of scope.
+
+When the environment variable is not set, it is treated to be zero.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+AF_MEM_DEBUG=1 ./myprogram
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+AF_TRACE {#af_trace}
+-------------------------------------------------------------------------------
+
+If ArrayFire was built with logging support, this enviornment variable will
+enable tracing of various modules within ArrayFire. This is a comma separated
+list of modules to trace. If enabled, ArrayFire will print relevant information
+to stdout. Currently the following modules are supported:
+
+- all: All trace outputs
+- jit: Logs kernel fetch & respective compile options and any errors.
+- mem: Memory management allocation, free and garbage collection information
+- platform: Device management information
+- unified: Unified backend dynamic loading information
+
+Tracing displays the information that could be useful when debugging or
+optimizing your application. Here is how you would use this variable:
+
+    AF_TRACE=mem,unified ./myprogram
+
+This will print information about memory operations such as allocations,
+deallocations, and garbage collection.
+
+All trace statements printed to the console have a suffix with the following
+pattern.
+
+**[category][Seconds since Epoch][Thread Id][source file relative path] \<Message\>**
+
+AF_MAX_BUFFERS {#af_max_buffers}
+-------------------------------------------------------------------------
+
+When AF_MAX_BUFFERS is set, this environment variable specifies the maximum
+number of buffers allocated before garbage collection kicks in.
+
+Please note that the total number of buffers that can exist simultaneously can
+be higher than this number. This variable tells the garbage collector that it
+should free any available buffers immediately if the treshold is reached.
+
+When not set, the default value is 1000.
+
+AF_OPENCL_MAX_JIT_LEN {#af_opencl_max_jit_len}
+-------------------------------------------------------------------------------
+
+When set, this environment variable specifies the maximum height of the OpenCL
+JIT tree after which evaluation is forced.
+
+The default value, as of v3.4, is 50 on OSX, 100 everywhere else. This value was
+20 for older versions.
+
+AF_CUDA_MAX_JIT_LEN {#af_cuda_max_jit_len}
+-------------------------------------------------------------------------------
+
+When set, this environment variable specifies the maximum height of the CUDA JIT
+tree after which evaluation is forced.
+
+The default value, as of v3.4, 100. This value was 20 for older versions.
+
+AF_CPU_MAX_JIT_LEN {#af_cpu_max_jit_len}
+-------------------------------------------------------------------------------
+
+When set, this environment variable specifies the maximum length of the CPU JIT
+tree after which evaluation is forced.
+
+The default value, as of v3.4, 100. This value was 20 for older versions.
+
+AF_BUILD_LIB_CUSTOM_PATH {#af_build_lib_custom_path}
+-------------------------------------------------------------------------------
+
+When set, this environment variable specifies a custom path along which the
+symbol manager will search for dynamic (shared library) backends to load. This
+is useful for specialized build configurations that use the unified backend and
+build shared libraries separately.
+
+By default, no additional path will be searched for an empty value.
+
+
+AF_JIT_KERNEL_TRACE {#af_jit_kernel_trace}
+-------------------------------------------------------------------------------
+
+When set, this environment variable has to be set to one of the following
+three values:
+
+- stdout : generated kernels will be printed to standard output
+- stderr : generated kernels will be printed to standard error stream
+- absolute path to a folder on the disk where generated kernels will be stored
+
+CUDA backend kernels are stored in files with cu file extension.
+
+OpenCL backend kernels are stored in files with cl file extension.
+
+AF_JIT_KERNEL_CACHE_DIRECTORY {#af_jit_kernel_cache_directory}
+-------------------------------------------------------------------------------
+
+This variable sets the path to the ArrayFire cache on the filesystem. If set
+ArrayFire will write the kernels that are compiled at runtime to this directory.
+If the path is not writeable, the default path is used.
+
+This path is different from AF_JIT_KERNEL_TRACE which stores strings. These
+kernels will store binaries and the content will be dependent on the
+backend and platforms used.
+
+The default path is determined in the following order:
+  Unix:
+      1. $HOME/.arrayfire
+      2. /tmp/arrayfire
+  Windows:
+      1. ArrayFire application Temp folder(Usually
+          C:\\Users\\\<user_name\>\\AppData\\Local\\Temp\\ArrayFire)
diff --git a/docs/pages/debugging.md b/docs/pages/debugging.md
new file mode 100644
index 0000000000..6712900f74
--- /dev/null
+++ b/docs/pages/debugging.md
@@ -0,0 +1,29 @@
+Debugging ArrayFire Issues {#debugging}
+===============================================================================
+
+Using Environment Variables
+---------------------------
+
+ * [`AF_PRINT_ERRORS=1`](configuring_environment.htm#af_print_errors) : Makes exception's messages more helpful
+ * [`AF_TRACE=all`](configuring_environment.htm#af_trace): Print ArrayFire message stream to console
+ * [`AF_JIT_KERNEL_TRACE=stdout`](configuring_environment.htm#af_jit_kernel_trace): Writes out source code generated by ArrayFire's JIT to the specified target
+ * [`AF_OPENCL_SHOW_BUILD_INFO=1`](configuring_environment.htm#af_opencl_show_build_info): Print OpenCL kernel build log to console
+
+
+Tips in Language Bindings
+-------------------------
+
+### C++
+
+* `af_print_mem_info("message", -1);`: Print table of memory used by ArrayFire on the active GPU
+
+### Python
+
+* `arrayfire.device.print_mem_info("message")`: Print table of memory used by ArrayFire on the active GPU
+
+
+
+Further Reading
+---------------
+
+See the [ArrayFire README](https://github.com/arrayfire/arrayfire) for support information.
diff --git a/docs/pages/forge_visualization.md b/docs/pages/forge_visualization.md
new file mode 100644
index 0000000000..01cffa07eb
--- /dev/null
+++ b/docs/pages/forge_visualization.md
@@ -0,0 +1,165 @@
+Visualizing af::array with Forge {#forge_visualization}
+===================
+
+Arrayfire as a library aims to provide a robust and easy to use platform for
+high-performance, parallel and GPU computing.
+
+[TOC]
+
+The goal of [Forge](https://github.com/arrayfire/forge), an OpenGL visualization
+library, is to provide equally robust visualizations that are interoperable
+between Arrayfire data-structures and an OpenGL context.
+
+Arrayfire provides wrapper functions that are designed to be a simple interface
+to visualize af::arrays. These functions perform various interop tasks. One in
+particular is that instead of wasting time copying and reformatting data from
+the GPU to the host and back to the GPU, we can draw directly from GPU-data to
+GPU-framebuffers! This saves 2 memory copies.
+
+Visualizations can be manipulated with a mouse. The following actions are available:
+- zoom (Alt + Mouse Left Click, move up & down)
+- pan (Just left click and drag)
+- rotation (Mouse right click - track ball rotation).
+
+Let's see exactly what visuals we can illuminate with forge and how Arrayfire
+anneals the data between the two libraries.
+
+# Setup {#setup}
+Before we can call Forge functions, we need to set up the related "canvas" classes.
+Forge functions are tied to the af::Window class. First let's create a window:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+const static int width = 512, height = 512;
+af::Window window(width, height, "2D plot example title");
+
+do{
+
+//drawing functions here
+
+} while( !window.close() );
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+We also added a drawing loop, so now we can use Forge's drawing functions to 
+draw to the window.
+The drawing functions present in Forge are listed below.
+
+# Rendering Functions {#render_func}
+
+Documentation for rendering functions can be found [here](\ref gfx_func_draw).
+
+## Image {#image}
+The af::Window::image() function can be used to plot grayscale or color images.
+To plot a grayscale image a 2d array should be passed into the function.
+Let's see this on a static noise example:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array img = constant(0, width, height); //make a black image
+array random = randu(width, height);      //make random [0,1] distribution
+img(random > 0.5) = 1; //set all pixels where distribution > 0.5 to white
+
+window.image(img);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+<img src="gfx_docs_images/noise.png" alt="Forge image plot of noise" width="20%" />
+Tweaking the previous example by giving our image a depth of 3 for the RGB values
+allows us to generate colorful noise:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array img = 255 * randu(width, height, 3);      //make random [0, 255] distribution
+window.image( img.as(u8) );
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+<img src="gfx_docs_images/color_noise.png" alt="Forge image plot of color noise" width="20%" />
+Note that Forge automatically handles any af::array type passed from Arrayfire.
+In the first example we passed in an image of floats in the range [0, 1].
+In the last example we cast our array to an unsigned byte array with the range
+[0, 255]. The type-handling properties are consistent for all Forge drawing functions.
+
+## Plot {#plot}
+The af::Window::plot() function visualizes an array as a 2d-line plot. Let's see
+a simple example:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array X = seq(-af::Pi, af::Pi, 0.01);
+array Y = sin(X);
+window.plot(X, Y);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+<img src="gfx_docs_images/sin_plot.png" alt="Forge 2d line plot of sin() function" width="30%" />
+The plot function has the signature:
+
+> **void plot( const array &X, const array &Y, const char * const title = NULL );**
+
+Both the x and y coordinates of the points are required to plot. This allows for
+non-uniform, or parametric plots:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array t = seq(0, 100, 0.01);
+array X = sin(t) * (exp(cos(t)) - 2 * cos(4 * t) - pow(sin(t / 12), 5));
+array Y = cos(t) * (exp(cos(t)) - 2 * cos(4 * t) - pow(sin(t / 12), 5));
+window.plot(X, Y);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+<img src="gfx_docs_images/butterfly_plot.png" alt="Forge 2d line plot of butterfly function" width="30%" />
+
+## Plot3 {#plot3}
+The af::Window::plot3() function will plot a curve in 3d-space.
+Its signature is:
+> **void plot3 (const array &in, const char * title = NULL);**
+The input array expects xyz-triplets in sequential order. The points can be in a
+flattened one dimensional (*3n x 1*) array, or in one of the (*3 x n*), (*n x 3*) matrix forms.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array Z = seq(0.1f, 10.f, 0.01);
+array Y = sin(10 * Z) / Z;
+array X = cos(10 * Z) / Z;
+
+array Pts = join(1, X, Y, Z);
+//Pts can be passed in as a matrix in the from n x 3, 3 x n
+//or in the flattened xyz-triplet array with size 3n x 1
+window.plot3(Pts);
+//both of the following are equally valid
+//window.plot3(transpose(Pts));
+//window.plot3(flat(Pts));
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+<img src="gfx_docs_images/spiral_plot3.png" alt="Forge 3d line plot" width="40%" />
+
+## Histogram {#histogram}
+The af::Window::hist() function renders an input array as a histogram.
+In our example, the input array will be created with Arrayfire's histogram()
+function, which actually counts and bins each sample. The output from histogram()
+can directly be fed into the af::Window::hist() rendering function.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+const int BINS = 128; SAMPLES = 9162;
+array norm = randn(SAMPLES);
+array hist_arr = histogram(norm, BINS);
+
+win.hist(hist_arr, 0, BINS);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+In addition to the histogram array with the number of samples in each bin, the
+af::Window::hist() function takes two additional parameters -- the minimum and
+maximum values of all datapoints in the histogram array. This effectively sets
+the range of the binned data. The full signature of af::Window::hist() is:
+> **void hist(const array & X, const double minval, const double maxval, const char * const title = NULL);**
+<img src="gfx_docs_images/norm_histogram.png" alt="Forge 3d scatter plot" width="40%" />
+
+
+## Surface {#surface}
+The af::Window::surface() function will plot af::arrays as a 3d surface.
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array Z = randu(21, 21);
+window.surface(Z, "Random Surface");    //equal to next function call
+//window.surface( seq(-1, 1, 0.1), seq(-1, 1, 0.1), Z, "Random Surface");
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+<img src="gfx_docs_images/rand_surface.png" alt="Forge random surface plot" width="30%" />
+There are two overloads for the af::Window::surface() function:
+> **void surface (const array & S, const char * const title )**
+> // Accepts a 2d matrix with the z values of the surface
+
+> **void surface (const array &xVals, const array &yVals, const array &S, const char * const title)**
+> // accepts additional vectors that define the x,y coordinates for the surface points.
+
+The second overload has two options for the x, y coordinate vectors. Assuming a surface grid of size **m x n**:
+ 1. Short vectors defining the spacing along each axis. Vectors will have sizes **m x 1** and **n x 1**.
+ 2. Vectors containing the coordinates of each and every point.
+ Each of the vectors will have length **mn x 1**.
+ This can be used for completely non-uniform or parametric surfaces.
+
+# Conclusion {#conclusion}
+There is a fairly comprehensive collection of methods to visualize data in Arrayfire.
+Thanks to the high-performance gpu plotting library Forge, the provided Arrayfire
+functions not only make visualizations as simple as possible, but keep them as 
+robust as the rest of the Arrayfire library.
diff --git a/docs/pages/getting_started.md b/docs/pages/getting_started.md
index 451f994f60..2bd3b4d1f6 100644
--- a/docs/pages/getting_started.md
+++ b/docs/pages/getting_started.md
@@ -3,82 +3,140 @@ Getting Started {#gettingstarted}
 
 [TOC]
 
+# Introduction
+
+ArrayFire is a high performance software library for parallel computing with
+an easy-to-use API. ArrayFire abstracts away much of the details of
+programming parallel architectures by providing a high-level container object,
+the [array](\ref af::array), that represents data stored on a CPU, GPU, FPGA,
+or other type of accelerator. This abstraction permits developers to write
+massively parallel applications in a high-level language where they need
+not be concerned about low-level optimizations that are frequently required to
+achieve high throughput on most parallel architectures.
+
 # Supported data types {#gettingstarted_datatypes}
 
-There is one generic [array](\ref af::array) container object while the
-underlying data may be one of various [basic types](\ref af::af_dtype):
+ArrayFire provides one generic container object, the [array](\ref af::array)
+on which functions and mathematical operations are performed. The `array`
+can represent one of many different [basic data types](\ref af_dtype):
 
-* [b8](\ref b8) 8-bit boolean values (`bool`)
 * [f32](\ref f32) real single-precision (`float`)
 * [c32](\ref c32) complex single-precision (`cfloat`)
-* [s32](\ref s32) 32-bit signed integer (`int`)
-* [u32](\ref u32) 32-bit unsigned integer (`unsigned`)
 * [f64](\ref f64) real double-precision (`double`)
 * [c64](\ref c64) complex double-precision (`cdouble`)
+* [f16](\ref f16) real half-precision (`half_float::half`)
+* [b8](\ref b8) 8-bit boolean values (`bool`)
+* [s32](\ref s32) 32-bit signed integer (`int`)
+* [u32](\ref u32) 32-bit unsigned integer (`unsigned`)
+* [s8](\ref s8) 8-bit signed integer (`signed char`)
+* [u8](\ref u8) 8-bit unsigned integer (`unsigned char`)
 * [s64](\ref s64) 64-bit signed integer (`intl`)
 * [u64](\ref u64) 64-bit unsigned integer (`uintl`)
+* [s16](\ref s16) 16-bit signed integer (`short`)
+* [u16](\ref u16) 16-bit unsigned integer (`unsigned short`)
+
+Most of these data types are supported on all modern GPUs; however, some
+older devices may lack support for double precision arrays. In this case,
+a runtime error will be generated when the array is constructed.
 
-Older devices may not support double precision operations.
+If not specified otherwise, `array`s are created as single precision floating
+point numbers (`f32`).
 
-# Creating an populating an ArrayFire array {#getting_started_af_arrays}
+# Creating and populating an ArrayFire array {#getting_started_af_arrays}
 
-ArrayFire [array](\ref af::array)s always exist on the device. They
-may be populated with data using an ArrayFire function, or filled with data
-found on the host. For example:
+ArrayFire [array](\ref af::array)s represent memory stored on the device.
+As such, creation and population of an array will consume memory on the device
+which cannot freed until the `array` object goes out of scope. As device memory
+allocation can be expensive, ArrayFire also includes a memory manager which
+will re-use device memory whenever possible.
+
+Arrays can be created using one of the [array constructors](\ref af::array).
+Below we show how to create 1D, 2D, and 3D arrays with uninitialized values:
+
+\snippet test/getting_started.cpp ex_getting_started_constructors
+
+However, uninitialized memory is likely not useful in your application.
+ArrayFire provides several convenient functions for creating arrays that contain
+pre-populated values including constants, uniform random numbers, uniform
+normally distributed numbers, and the identity matrix:
 
 \snippet test/getting_started.cpp ex_getting_started_gen
 
 A complete list of ArrayFire functions that automatically generate data
 on the device may be found on the [functions to create arrays](\ref data_mat)
-page. The default data type for arrays is [f32](\ref f32) (a
+page. As stated above, the default data type for arrays is [f32](\ref f32) (a
 32-bit floating point number) unless specified otherwise.
 
-ArrayFire arrays may also be populated from data found on the host.
+ArrayFire `array`s may also be populated from data found on the host.
 For example:
 
 \snippet test/getting_started.cpp ex_getting_started_init
 
-ArrayFire also supports array initialization from a device pointer.
-For example ArrayFire can be populated directly by a call to `cudaMemcpy`
+ArrayFire also supports array initialization from memory already on the GPU.
+For example, with CUDA one can populate an `array` directly using a call
+to `cudaMemcpy`:
 
 \snippet test/getting_started.cpp ex_getting_started_dev_ptr
 
-# ArrayFire array contents, dimentions, and properties {#getting_started_array_properties}
+Similar functionality exists for OpenCL too. If you wish to intermingle
+ArrayFire with CUDA or OpenCL code, we suggest you consult the
+[CUDA interoperability](\ref interop_cuda) or
+[OpenCL interoperability](\ref interop_opencl) pages for detailed instructions.
+
+# ArrayFire array contents, dimensions, and properties {#getting_started_array_properties}
 
-The [af_print](\ref af::af_print) function can be used to print arrays that
-have already been generated or an expression involving arrays:
+ArrayFire provides several functions to determine various aspects of arrays.
+This includes functions to print the contents, query the dimensions, and
+determine various other aspects of arrays.
+
+The [af_print](\ref af_print) function can be used to print arrays that
+have already been generated or any expression involving arrays:
 
 \snippet test/getting_started.cpp ex_getting_started_print
 
-ArrayFire provides several convenient methods for accessing the dimensions.
-You may use either a [dim4](\ref af::dim4) object or access the dimensions
-directly using the [dims()](\ref af::array::dims) and
-[numdims()](\ref af::array::numdims) functions:
+The dimensions of an array may be determined using either a
+[dim4](\ref af::dim4) object or by accessing the dimensions directly using the
+[dims()](\ref af::array::dims) and [numdims()](\ref af::array::numdims)
+functions:
 
 \snippet test/getting_started.cpp ex_getting_started_dims
 
-Arrays also provide functions to determine their properties including:
+In addition to dimensions, arrays also carry several properties including
+methods to determine the underlying type and size (in bytes). You can even
+determine whether the array is empty, real/complex, a row/column, or a scalar
+or a vector:
 
 \snippet test/getting_started.cpp ex_getting_started_prop
 
+For further information on these capabilities, we suggest you consult the
+full documentation on the [array](\ref af::array).
+
 # Writing mathematical expressions in ArrayFire {#getting_started_writing_math}
 
-Most of ArrayFire's functions operate on an element-wise basis.
-This means that function like `c[i] = a[i] + b[i]` could simply be written
-as `c = a + b`.
-ArrayFire has an intelligent runtime JIT compliation engine which converts
-array expressions into the smallest number of OpenCL/CUDA kernels.
-This "kernel fusion" technology not only decreases the number of kernel calls,
-but, more importantly, avoids extraneous global memory operations.
+ArrayFire features an intelligent Just-In-Time (JIT) compilation engine that
+converts expressions using arrays into the smallest number of CUDA/OpenCL
+kernels. For most operations on arrays, ArrayFire functions like a vector library.
+That means that an element-wise operation, like `c[i] = a[i] + b[i]` in C,
+would be written more concisely without indexing, like `c = a + b`.
+When there are multiple expressions involving arrays, ArrayFire's JIT engine
+will merge them together. This "kernel fusion" technology not only decreases
+the number of kernel calls, but, more importantly, avoids extraneous global
+memory operations.
 Our JIT functionality extends across C/C++ function boundaries and only ends
 when a non-JIT function is encountered or a synchronization operation is
 explicitly called by the code.
 
-ArrayFire has [hundreds of functions](\ref arith_mat) for element-wise
-arithmetic. Here are a few examples:
+ArrayFire provides [hundreds of functions](\ref arith_mat) for element-wise
+operations. All of the standard operators (e.g. +,-,\*,/) are supported
+as are most transcendental functions (sin, cos, log, sqrt, etc.).
+Here are a few examples:
 
 \snippet test/getting_started.cpp ex_getting_started_arith
 
+To see the complete list of functions please consult the documentation on
+[mathematical](\ref mathfunc_mat), [linear algebra](\ref linalg_mat),
+[signal processing](\ref signal_mat), and [statistics](\ref stats_mat).
+
 # Mathematical constants {#getting_started_constants}
 
 ArrayFire contains several platform-independent constants, like
@@ -97,11 +155,11 @@ using the `af::` namespace.
 
 # Indexing {#getting_started_indexing}
 
-Like all functions in ArrayFire, indexing is also executed in parallel on
-the OpenCL/CUDA device.
-Because of this, indexing becomes part of a JIT operation and is accomplished
-using parentheses instead of square brackets (i.e. as `A(0)` instead of `A[0]`).
-To index `af::array`s you may use one or a combination of the following functions:
+Like all functions in ArrayFire, indexing is also executed in parallel on the
+OpenCL/CUDA devices. Because of this, indexing becomes part of a JIT operation
+and is accomplished using parentheses instead of square brackets (i.e. as `A(0)`
+instead of `A[0]`). To index `af::array`s you may use one or a combination of
+the following functions:
 
 * integer scalars
 * [seq()](\ref af::seq) representing a linear sequence
@@ -117,11 +175,13 @@ use these functions.
 # Getting access to ArrayFire array memory on the host and device {#getting_started_memory_access}
 
 Memory in `af::array`s may be accessed using the [host()](\ref af::array::host)
-and device()](\ref af::array::device) functions.
+and [device()](\ref af::array::device) functions.
 The `host` function *copies* the data from the device and makes it available
-in a C-style array on the host.
-The `device` function returns a pointer to device memory for interoperability
-with external CUDA/OpenCL kernels.
+in a C-style array on the host. As such, it is up to the developer to manage
+any memory returned by `host`.
+The `device` function returns a pointer/reference to device memory for
+interoperability with external CUDA/OpenCL kernels. As this memory belongs to
+ArrayFire, the programmer should not attempt to free/deallocate the pointer.
 For example, here is how we can interact with both OpenCL and CUDA:
 
 \snippet test/getting_started.cpp ex_getting_started_ptr
@@ -137,7 +197,7 @@ get it using the [scalar()](\ref af::array::scalar) function:
 
 # Bitwise operators {#getting_started_bitwise_operators}
 
-In addition to supporting standard mathematical functions, `af::array`s
+In addition to supporting standard mathematical functions, arrays
 that contain integer data types also support bitwise operators including
 and, or, and shift:
 
@@ -156,15 +216,16 @@ simply include the `arrayfire.h` header file and start coding!
     int main(void)
     {
         // generate random values
-        int n = 10000;
         af_array a;
-        af_randu(&a, n);
+        int n_dims = 1;
+        dim_t dims[] = {10000};
+        af_randu(&a, n_dims, dims, f32);
 
         // sum all the values
-        float result;
-        af_sum_all(&result, a, 0);
+        double result;
+        af_sum_all(&result, 0, a);
+        printf("sum: %g\n", result);
 
-        printf("sum: %g\n", sum);
         return 0;
     }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/pages/gfor.md b/docs/pages/gfor.md
index 28410a7f18..bbced5d14b 100644
--- a/docs/pages/gfor.md
+++ b/docs/pages/gfor.md
@@ -8,18 +8,17 @@ Run many independent loops simultaneously on the GPU or device.
 Introduction {#gfor_intro}
 ============
 
-The gfor-loop construct may be used to simultaneously launch all of
-the iterations of a for-loop on the GPU or device, as long as the
-iterations are independent. While the standard for-loop performs each
-iteration sequentially, ArrayFire's gfor-loop performs each iteration
-at the same time (in parallel). ArrayFire does this by tiling out the
-values of all loop iterations and then performing computation on those
-tiles in one pass.
-
-You can think of `gfor` as performing auto-vectorization of your
-code, e.g. you write a gfor-loop that increments every element of a
-vector but behind the scenes ArrayFire rewrites it to operate on
-the entire vector in parallel.
+The gfor-loop construct may be used to simultaneously launch all of the
+iterations of a for-loop on the GPU or device, as long as the iterations are
+independent. While the standard for-loop performs each iteration sequentially,
+ArrayFire's gfor-loop performs each iteration at the same time (in
+parallel). ArrayFire does this by tiling out the values of all loop iterations
+and then performing computation on those tiles in one pass.
+
+You can think of `gfor` as performing auto-vectorization of your code,
+e.g. you write a gfor-loop that increments every element of a vector but
+behind the scenes ArrayFire rewrites it to operate on the entire vector in
+parallel.
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
 for (int i = 0; i < n; ++i)
@@ -29,19 +28,19 @@ gfor (seq i, n)
    A(i) = A(i) + 1;
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Behind the scenes, ArrayFire rewrites your code into this
-equivalent and faster version:
+Behind the scenes, ArrayFire rewrites your code into this equivalent and
+faster version:
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
 A = A + 1;
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-It is best to vectorize computation as much as possible to avoid
-the overhead in both for-loops and gfor-loops.
+It is best to vectorize computation as much as possible to avoid the overhead
+in both for-loops and gfor-loops.
 
-To see another example, you could run an FFT on every 2D slice of a
-volume in a for-loop, or you could "vectorize" and simply do it all
-in one gfor-loop operation:
+To see another example, you could run an FFT on every 2D slice of a volume in
+a for-loop, or you could "vectorize" and simply do it all in one gfor-loop
+operation:
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
 for (int i = 0; i < N; ++i)
@@ -54,7 +53,7 @@ gfor (seq i, N)
 There are three formats for instantiating gfor-loops.
 -# gfor(var,n) Creates a sequence _{0, 1, ..., n-1}_
 -# gfor(var,first,last) Creates a sequence _{first, first+1, ..., last}_
--# gfor(var,first,incr,last) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_
+-# gfor(var,first,last,incr) Creates a sequence _{first, first+inc, first+2*inc, ..., last}_
 
 So all of the following represent the equivalent sequence: _0,1,2,3,4_
 
@@ -74,14 +73,6 @@ gfor (seq k, 0, n-1) {
 }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
-array A = constant(1,n,n,m);
-array B = constant(1,n,n);
-gfor (seq k, 0,m-1) {
-   A(span,span,k) = A(span,span,k) * B; // matrix-matrix multiply
-}
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
 array A = randu(n,m);
 array B = constant(0,n,m);
@@ -97,11 +88,11 @@ User Functions called within GFOR {#gfor_user_functions}
 ---------------------------------
 
 If you have defined a function that you want to call within a GFOR loop, then
-that function has to meet all the conditions described in this page in
-order to be able to work as expected.
+that function has to meet all the conditions described in this page in order
+to be able to work as expected.
 
-Consider the (trivial) example below. The function compute() has to satisfy all
-requirements for GFOR Usage, so you cannot use if-else conditions inside
+Consider the (trivial) example below. The function compute() has to satisfy
+all requirements for GFOR Usage, so you cannot use if-else conditions inside
 it.
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
@@ -122,30 +113,6 @@ gfor (seq ii, n)
   H(span,ii) = compute(A(span,ii), B(span,ii), ep);
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Multiplications {#gfor_mul}
----------------
-
-ArrayFire supports bulk multiplications of vector-vector, matrix-vector, and
-matrix-matrix types using GFOR. This is especially useful with many small
-matrices.
-
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
-array A = constant(1,n,n);
-array B = constant(1,n,1);
-array C = constant(0,n,m);
-gfor (seq k, n)
-  B(k) = A(k,span) * A(span,k); // vector-vector multiply
-
-A = constant(1,n,n,m);
-gfor (seq k, m)
-  C(span,k) = A(span,span,k) * B;  // matrix-vector multiply
-
-A = constant(1,n,n,m);
-B = constant(1,n,n);
-gfor (seq k, m)
-  A(span,span,k) = A(span,span,k) * B;  // matrix-matrix multiply
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
 The Iterator {#gfor_iterator}
 ------------
 
@@ -416,7 +383,8 @@ gfor (seq i, n) {
 }
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-The problem is that every GFOR tile has a different number of elements, something which GFOR cannot yet handle.
+The problem is that every GFOR tile has a different number of elements,
+something which GFOR cannot yet handle.
 
 Similar to the workaround for conditional statements, it might work to use
 masked arithmetic:
@@ -442,14 +410,13 @@ gfor (seq i, n) {
 Memory considerations {#gfor_memory}
 =====================
 
-Since each computation is done in parallel for all iterator values,
-you need to have enough card memory available to do all iterations
-simultaneously. If the problem exceeds memory, it will trigger "out of
-memory" errors.
+Since each computation is done in parallel for all iterator values, you need
+to have enough card memory available to do all iterations simultaneously. If
+the problem exceeds memory, it will trigger "out of memory" errors.
 
-You can work around the memory limitations of your GPU or device by
-breaking the GFOR loop up into segments; however, you might want to
-consider using a larger memory GPU or device.
+You can work around the memory limitations of your GPU or device by breaking
+the GFOR loop up into segments; however, you might want to consider using a
+larger memory GPU or device.
 
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
 // BEFORE
diff --git a/docs/pages/indexing.md b/docs/pages/indexing.md
index 6ca7cd4ab3..7ec779ff91 100644
--- a/docs/pages/indexing.md
+++ b/docs/pages/indexing.md
@@ -1,20 +1,334 @@
 Indexing {#indexing}
 ========
 
-There are several ways of referencing values.  ArrayFire uses
-parenthesis for subscripted referencing instead of the traditional
-square bracket notation.  Indexing is zero-based, i.e. the first
-element is at index zero (<tt>A(0)</tt>).  Indexing can be done
-with mixtures of:
-* integer scalars
-* [seq()](\ref af::seq) representing a linear sequence
-* [end](\ref af::end) representing the last element of a dimension
-* [span](\ref af::span) representing the entire dimension
+Indexing in ArrayFire is a powerful but easy to abuse feature of the af::array
+class. This feature allows you to reference or copy subsections of a larger array
+and perform operations on only a subset of elements.
+
+Indexing in ArrayFire can be performed using the parenthesis operator or one of
+the member functions of the af::array class. These functions allow you to
+reference one or a range of elements from the original array.
+
+Here we will demonstrate some of the ways you can use indexing in ArrayFire and
+discuss ways to minimize the memory and performance impact of these operations.
+
+Lets start by creating a new 4x4 matrix of floating point numbers:
+
+\snippet test/index.cpp index_tutorial_1
+
+ArrayFire is column-major so the resulting A array will look like this:
+
+\f[
+\begin{bmatrix}
+    0 & 4 & 8 & 12 \\
+    1 & 5 & 9 & 13 \\
+    2 & 6 & 10 & 14 \\
+    3 & 7 & 11 & 15
+\end{bmatrix}
+\f]
+
+This is a two dimensional array so we can access the first element of this
+matrix by passing `0,0` into the parenthesis operator of the af::array.
+
+\snippet test/index.cpp index_tutorial_first_element
+
+\f[ A(2, 3) = [ 14 ] \f]
+
+We can also access the array using linear indexing by passing in one value. Here
+we are accessing the fifth element of the array.
+
+\snippet test/index.cpp index_tutorial_fifth_element
+
+\f[ A(5) = [ 5 ] \f]
+
+\note Normally you want to avoid accessing individual elements of the array like this
+for performance reasons.
+
+Indexing with negative values will access from the end of the array. For example,
+the value negative one and negative two(-2) will return the last and second to
+last element of the array, respectively. ArrayFire provides the `end` alias for
+this which also allows you to index the last element of the array.
+
+\snippet test/index.cpp index_tutorial_negative_indexing
+
+## Indexing slices and subarrays
+
+You can access regions of the array via the af::seq and af::span objects. The
+span objects allows you to select the entire set of elements across a particular
+dimension/axis of an array. For example, we can select the third column of the
+array by passing span as the first argument and 2 as the second argument to the
+parenthesis operator.
+
+\snippet test/index.cpp index_tutorial_third_column
+
+\f[
+A(span, 2) =
+\begin{bmatrix}
+    8 \\
+    9 \\
+    10 \\
+    11
+\end{bmatrix}
+\f]
+
+You can read that as saying that you want all values across the first dimension,
+but only from index 2 of the second dimension.
+
+You can access the second row by passing (1, span) to the array
+
+\snippet test/index.cpp index_tutorial_second_row
+
+\f[ A(1, span) = [ 1, 5, 9, 13 ] \f]
+
+You can use the af::seq (short for sequence) object to define a range when
+indexing. For example, if you want to get the first two columns, you can access
+the array by passing af::span for the first argument and af::seq(2) as the
+second argument.
+
+\snippet test/index.cpp index_tutorial_first_two_columns
+
+\f[
+A(span, seq(2)) =
+\begin{bmatrix}
+     0 & 4 \\
+     1 & 5 \\
+     2 & 6 \\
+     3 & 7
+\end{bmatrix}
+\f]
+
+There are three constructors for af::seq.
+
+* af::seq(N): Defines a range between 0 and N-1
+* af::seq(begin, end) Defines a range between begin and end inclusive
+* af::seq(begin, end, step) defines a range between begin and end striding by step values
+
+The last constructor that can help create non-continuous ranges. For example,
+you can select the second and forth(last) rows by passing (seq(1, end, 2), span)
+to the indexing operator.
+
+\snippet test/index.cpp index_tutorial_second_and_fourth_rows
+
+\f[
+A(seq(1, end, 2), span) =
+\begin{bmatrix}
+     1 & 5 &  9 & 13 \\
+     3 & 7 & 11 & 15
+\end{bmatrix}
+\f]
+
+## Indexing using af::array
+
+You can also index using other af::array objects. ArrayFire performs a Cartesian
+product of the input arrays.
+
+\snippet test/index.cpp index_tutorial_array_indexing
+
+\f[
+A =
+\begin{bmatrix}
+    0 & 4 & 8 & 12 \\
+    1 & 5 & 9 & 13 \\
+    2 & 6 & 10 & 14 \\
+    3 & 7 & 11 & 15
+\end{bmatrix}
+\\
+A(
+\begin{bmatrix}
+2 \\ 1 \\ 3
+\end{bmatrix}
+,
+\begin{bmatrix}
+3 \\ 1 \\ 2
+\end{bmatrix}
+) =
+
+\begin{bmatrix}
+(2,3) & (2,1) & (2,2) \\
+(1,3) & (1,1) & (1,2) \\
+(3,3) & (3,1) & (3,2)
+\end{bmatrix}
+=
+\begin{bmatrix}
+14 & 6 & 10 \\
+13 & 5 &  9 \\
+15 & 7 & 11
+\end{bmatrix}
+\f]
+
+
+If you want to index an af::array using coordinate arrays, you can do that using the
+af::approx1 and af::approx2 functions.
+
+\snippet test/index.cpp index_tutorial_approx
+
+\f[
+approx2(A,
+\begin{bmatrix}
+2 \\ 1 \\ 3
+\end{bmatrix}
+,
+\begin{bmatrix}
+3 \\ 1 \\ 2
+\end{bmatrix}
+) =
+\begin{bmatrix}
+(2,3) \\
+(1,1) \\
+(3,2)
+\end{bmatrix}
+=
+\begin{bmatrix}
+14 \\
+ 5 \\
+11
+\end{bmatrix}
+\f]
+
+Boolean(b8) arrays can be used to index into another array. In this type of
+indexing the non-zero values will be selected by the boolean operation. If we
+want to select all values less than 5, we can pass a boolean expression into
+the parenthesis operator.
+
+\snippet test/index.cpp index_tutorial_boolean
+
+\f[
+out =
+\begin{bmatrix}
+0 \\ 1 \\ 2 \\ 3 \\ 4
+\end{bmatrix}
+\f]
+
+## References and copies
+
+All ArrayFire indexing functions return af::array(technically its an array_proxy
+class) objects. These objects may be new arrays or they may reference the
+original array depending on the type of indexing that was performed on them.
+
+- If an array was indexed using another af::array or it was indexed using the
+af::approx functions, then a new array is created. It does not reference the
+original data.
+- If an array was indexed using a scalar, af::seq or af::span, then
+the resulting array will reference the original data IF the first dimension is
+continuous. The following lines will not allocate additional memory.
+
+\note The new arrays wither references or newly allocated arrays, are
+independent of the original data. Meaning that any changes to the original array
+will not propagate to the references. Likewise, any changes to the reference
+arrays will not modify the original data.
+
+\snippet test/index.cpp index_tutorial_references
+
+The following code snippet shows some examples of indexing that will allocate
+new memory.
+
+\snippet test/index.cpp index_tutorial_copies
+
+Notice that even though the copy3 array is referencing continuous memory in the
+original array, a new array is created because we used an array to index into
+the af::array.
+
+## Assignment
+
+An assignment on an af::array will replace the array with the result of the
+expression on the right hand side of the equal(=) operator. This means that the
+type and shape of the result can be different from the array on the left had
+side of the equal operator. Assignments will not update the array that was
+previously referenced through an indexing operation. Here is an example:
+
+\snippet test/index.cpp index_tutorial_assignment
+
+The `ref` array is created by indexing into the data array. The initialized
+`ref` array points to the data array and does not allocate memory when it is
+created. After the matmul call, the `ref` array will not be pointing to the data
+array. The matmul call will not update the values of the data array.
+
+You can update the contents of an af::array by assigning with the operator
+parenthesis. For example, if you wanted to change the third column of the
+`A` array you can do that by assigning to `A(span, 2)`.
+
+\snippet test/index.cpp index_tutorial_assignment_third_column
+
+\f[
+ref =
+\begin{bmatrix}
+     8  \\
+     9  \\
+    10  \\
+    11
+\end{bmatrix}
+A =
+\begin{bmatrix}
+    0 & 4 & 3.14 & 12 \\
+    1 & 5 & 3.14 & 13 \\
+    2 & 6 & 3.14 & 14 \\
+    3 & 7 & 3.14 & 15
+\end{bmatrix}
+\f]
+
+This will update only the array being modified. If there are arrays that
+are referring to this array because of an indexing operation, those values
+will remain unchanged.
+
+Allocation will only be performed if there are other arrays referencing the data
+at the point of assignment. In the previous example, an allocation will be
+performed when assigning to the `A` array because the `ref` array is pointing
+to the original data. Here is another example demonstrating when an allocation
+will occur:
+
+\snippet test/index.cpp index_tutorial_assignment_alloc
+
+In this example, no allocation will take place because when the `ref` object
+is created, it is pointing to `A`'s data. Once it goes out of scope, no data
+points to `A`, therefore when the assignment takes place, the data is modified in
+place instead of being copied to a new address.
+
+You can also assign to arrays using another af::arrays as an indexing array.
+This works in a similar way to the other types of assignment but care must be
+taken to assure that the indexes are unique. Non-unique indexes will result in a
+race condition which will cause non-deterministic values.
+
+\snippet test/index.cpp index_tutorial_assignment_race_condition
+
+\f[
+idx =
+\begin{bmatrix}
+     4  \\
+     3  \\
+     4  \\
+     0
+\end{bmatrix}
+vals =
+\begin{bmatrix}
+     9  \\
+     8  \\
+     7  \\
+     6
+\end{bmatrix}
+\\
+A =
+\begin{bmatrix}
+    6 & 9\ or\ 7 &  8 & 12 \\
+    1 &   5    &  9 & 13 \\
+    2 &   6    & 10 & 14 \\
+    8 &   7    & 11 & 15
+\end{bmatrix}
+\f]
+
+## Member functions
+
+There are several member functions which allow you to index into an af::array. These
+functions have similar functionality but may be easier to parse for some.
+
 * [row(i)](\ref af::array::row) or [col(i)](\ref af::array::col) specifying a single row/column
 * [rows(first,last)](\ref af::array::rows) or [cols(first,last)](\ref af::array::cols)
- specifying a span of rows or columns
+  specifying multiple rows or columns
+* [slice(i)](\ref af::array::slice) or [slices(first, last)](\ref af::array::slices) to
+  select one or a range of slices
+
+# Additional examples
 
-See \ref indexing for the full listing.
+See \ref  index_mat for the full listing.
 
 \snippet test/index.cpp ex_indexing_first
 
diff --git a/docs/pages/install.md b/docs/pages/install.md
new file mode 100644
index 0000000000..01b268af34
--- /dev/null
+++ b/docs/pages/install.md
@@ -0,0 +1,127 @@
+# ArrayFire Installer {#installing}
+
+Installing ArrayFire couldn't be easier. Navigate to
+https://arrayfire.com/download and download the appropriate installer for the
+target architecture and operating system. Although ArrayFire can be [built
+from source](https://github.com/arrayfire/arrayfire), the installers
+conveniently package necessary dependencies.
+
+Install the latest device drivers before using ArrayFire. If you target the
+CPU using ArrayFire’s OpenCL backend, install the OpenCL runtime. Drivers and
+runtimes should be downloaded and installed from each device vendor's website.
+
+# Install Instructions {#InstallInstructions}
+
+* [Windows](#Windows)
+* [Linux](#Linux)
+* [macOS](#macOS)
+
+## Windows {#Windows}
+
+Once the ArrayFire has been downloaded, run the installer.
+
+The installer offers the option to automatically add ArrayFire to the path for
+all users. If the installer did not do this, simply append `%%AF_PATH%/lib` to
+the PATH variable so that the loader can find ArrayFire DLLs.
+
+For more information on using ArrayFire on Windows, visit the following
+[page](http://arrayfire.org/docs/using_on_windows.htm).
+
+## Linux {#Linux}
+
+There are two ways to install ArrayFire on Linux.
+1. Package Manager
+2. Using the ArrayFire Linux Installer
+
+As of today, approach (1) is only supported for Ubuntu 18.04 and 20.04. Please
+go through [the GitHub
+wiki[page](https://github.com/arrayfire/arrayfire/wiki/Install-ArrayFire-From-Linux-Package-Managers)
+for detailed instructions.
+
+For approach (2), once the ArrayFire installer is downloaded, execute the
+installer from the terminal as shown below. Set the `--prefix` argument to the
+target install directory; we recommend `/opt`.
+
+    ./ArrayFire_*_Linux_x86_64.sh --include-subdir --prefix=/opt
+
+Given sudo permissions, the ArrayFire libraries can be added to the path via
+`ldconfig` like so:
+
+    echo /opt/arrayfire/lib64 > /etc/ld.so.conf.d/arrayfire.conf
+    sudo ldconfig
+
+Otherwise, the `LD_LIBRARY_PATH` environment variable can be set so that the
+shared library loader can find the ArrayFire libraries.
+
+For more information on using ArrayFire on Linux, visit the following
+[page](http://arrayfire.org/docs/using_on_linux.htm).
+
+### Graphics support
+
+ArrayFire enables high-performance visualizations via the
+[Forge](https://github.com/arrayfire/forge) library. On Linux, there are a few
+dependencies to install to enable graphics support:
+
+* FreeImage
+* Fontconfig
+* GLU (OpenGL Utility Library)
+
+To install these dependencies on common Linux distributions:
+
+__Debian, Ubuntu (14.04 and above), and other Debian derivatives__
+
+    apt install build-essential libfreeimage3 libfontconfig1 libglu1-mesa
+
+__Fedora, Redhat, CentOS__
+
+    yum install freeimage fontconfig mesa-libGLU
+
+
+## macOS {#macOS}
+
+Once the ArrayFire installer has been downloaded, execute the installer by
+either double-clicking on the ArrayFire `pkg` file or running the following
+command:
+
+    sudo installer -pkg Arrayfire-*_OSX.pkg -target /
+
+For more information on using ArrayFire on macOS, visit the following
+[page](http://arrayfire.org/docs/using_on_osx.htm).
+
+## NVIDIA Tegra devices
+
+ArrayFire is capable of running TX2 devices.
+
+Before installing ArrayFire, make sure the latest version of JetPack (v2.3 and
+above) or L4T (v24.2 and above) is installed.
+
+### Tegra prerequisites
+
+The following dependencies are required for Tegra devices:
+
+    sudo apt install libopenblas-dev liblapacke-dev
+
+## Testing installation
+
+After ArrayFire is finished installing, we recommend building and running a
+few of the provided examples to verify things are working as expected.
+
+On Windows, open the CMakeLists.txt file from CMake-GUI. Once the project is
+configured and generated, build and run the examples from Visual Studio.
+
+On Linux, run the following commands:
+
+    cp -r /opt/arrayfire/share/ArrayFire/examples /tmp/examples
+    cd /tmp/examples
+    mkdir build
+    cd build
+    cmake ..
+    make
+    ./helloworld/helloworld_{cpu,cuda,oneapi,opencl}
+
+## <a name="GettingHelp"></a> Getting help
+
+* Google Groups: https://groups.google.com/forum/#!forum/arrayfire-users
+* ArrayFire Services:  [Consulting](https://arrayfire.com/consulting/)  |  [Training](https://arrayfire.com/training/)
+* ArrayFire Blogs: http://arrayfire.com/blog/
+* Email: <mailto:support@arrayfire.com>
diff --git a/docs/pages/interop_cuda.md b/docs/pages/interop_cuda.md
new file mode 100644
index 0000000000..2132dfcb2c
--- /dev/null
+++ b/docs/pages/interop_cuda.md
@@ -0,0 +1,237 @@
+Interoperability with CUDA {#interop_cuda}
+========
+
+Although ArrayFire is quite extensive, there remain many cases in which you
+may want to write custom kernels in CUDA or [OpenCL](\ref interop_opencl).
+For example, you may wish to add ArrayFire to an existing code base to increase
+your productivity, or you may need to supplement ArrayFire's functionality
+with your own custom implementation of specific algorithms.
+
+ArrayFire manages its own memory, runs within its own CUDA stream, and
+creates custom IDs for devices. As such, most of the interoperability functions
+focus on reducing potential synchronization conflicts between ArrayFire and CUDA.
+
+# Basics
+
+It is fairly straightforward to interface ArrayFire with your own custom CUDA
+code. ArrayFire provides several functions to ease this process including:
+
+| Function              | Purpose                                             |
+|-----------------------|-----------------------------------------------------|
+| af::array(...)        | Construct an ArrayFire Array from device memory     |
+| af::array.device()    | Obtain a pointer to the device memory (implies `lock()`) |
+| af::array.lock()      | Removes ArrayFire's control of a device memory pointer |
+| af::array.unlock()    | Restores ArrayFire's control over a device memory pointer |
+| af::getDevice()       | Gets the current ArrayFire device ID                |
+| af::setDevice()       | Switches ArrayFire to the specified device          |
+| afcu::getNativeId()   | Converts an ArrayFire device ID to a CUDA device ID |
+| afcu::setNativeId()   | Switches ArrayFire to the specified CUDA device ID  |
+| afcu::getStream()     | Get the current CUDA stream used by ArrayFire       |
+
+
+Below we provide two worked examples on how ArrayFire can be integrated
+into new and existing projects.
+
+# Adding custom CUDA kernels to an existing ArrayFire application
+
+By default, ArrayFire manages its own memory and operates in its own CUDA
+stream. Thus there is a slight amount of bookkeeping that needs to be done
+in order to integrate your custom CUDA kernel.
+
+If your kernels can share the ArrayFire CUDA stream, you should:
+
+1. Include the 'af/afcuda.h' header in your source code
+2. Use ArrayFire as normal
+3. Ensure any JIT kernels have executed using `af::eval()`
+4. Obtain device pointers from ArrayFire array objects using `array::device()`
+5. Determine ArrayFire's CUDA stream
+6. Set arguments and run your kernel in ArrayFire's stream
+7. Return control of af::array memory to ArrayFire
+8. Compile with `nvcc`, linking with the `afcuda` library.
+
+Notice that since ArrayFire and your kernels are sharing the same CUDA
+stream, there is no need to perform any synchronization operations as
+operations within a stream are executed in order.
+
+This process is best illustrated with a fully worked example:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// 1. Add includes
+#include <arrayfire.h>
+#include <af/cuda.h>
+
+int main() {
+
+    // 2. Use ArrayFire as normal
+    size_t num = 10;
+    af::array x = af::constant(0, num);
+
+    // ... many ArrayFire operaitons here
+
+    // 3. Ensure any JIT kernels have executed
+    x.eval();
+    af_print(x);
+
+    // Run a custom CUDA kernel in the ArrayFire CUDA stream
+
+    // 4. Obtain device pointers from ArrayFire array objects using
+    //    the array::device() function:
+    float *d_x = x.device<float>();
+
+    // 5. Determine ArrayFire's CUDA stream
+    int af_id = af::getDevice();
+    cudaStream_t af_cuda_stream = afcu::getStream(af_id);
+
+    // 6. Set arguments and run your kernel in ArrayFire's stream
+    //    Here launch with 1 block of 10 threads
+    increment<<<1, num, 0, af_cuda_stream>>>(d_x);
+
+    // 7. Return control of af::array memory to ArrayFire using
+    //    the array::unlock() function:
+    x.unlock();
+
+    // ... resume ArrayFire operations
+    af_print(x);
+
+    // Because the device pointer `d_x` was returned to ArrayFire's
+    // control by the unlock function, there is no need to free them using
+    // cudaFree()
+
+    return 0;
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If your kernels needs to operate in their own CUDA stream, the process is
+essentially identical, except you need to instruct ArrayFire to complete
+its computations using the af::sync() function prior to launching your
+own kernel and ensure your kernels are complete using `cudaDeviceSynchronize()`
+(or similar) commands prior to returning control of the memory to ArrayFire:
+
+1. Include the 'af/afcuda.h' header in your source code
+2. Use ArrayFire as normal
+3. Ensure any JIT kernels have executed using `af::eval()`
+4. Instruct ArrayFire to finish operations using af::sync()
+5. Obtain device pointers from ArrayFire array objects using
+6. Determine ArrayFire's CUDA stream
+7. Set arguments and run your kernel in your custom stream
+8. Ensure CUDA operations have finished using `cudaDeviceSyncronize()`
+   or similar commands.
+9. Return control of af::array memory to ArrayFire
+10. Compile with `nvcc`, linking with the `afcuda` library.
+
+# Adding ArrayFire to an existing CUDA application
+
+Adding ArrayFire to an existing CUDA application is slightly more involved
+and can be somewhat tricky due to several optimizations we implement. The
+most important are as follows:
+
+* ArrayFire assumes control of all memory provided to it.
+* ArrayFire does not (in general) support in-place memory transactions.
+
+We will discuss the implications of these items below. To add ArrayFire
+to existing code you need to:
+
+1. Include `arrayfire.h` and `af/cuda.h` in your source file
+2. Finish any pending CUDA operations
+   (e.g. use cudaDeviceSynchronize() or similar stream functions)
+3. Create ArrayFire arrays from existing CUDA pointers
+4. Perform operations on ArrayFire arrays
+5. Instruct ArrayFire to finish operations using af::eval() and af::sync()
+6. Obtain pointers to important memory
+7. Continue your CUDA application.
+8. Free non-managed memory
+9. Compile and link with the appropriate paths and the `-lafcuda` flags.
+
+To create the af::array objects, you should use one of the following
+constructors with `src=afDevice`:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// 1D - 3D af::array constructors
+af::array (dim_t dim0, const T *pointer, af::source src=afHost)
+af::array (dim_t dim0, dim_t dim1, const T *pointer, af::source src=afHost)
+af::array (dim_t dim0, dim_t dim1, dim_t dim2, const T *pointer, af::source src=afHost)
+af::array (dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3, const T *pointer, af::source src=afHost)
+
+// af::array constructor using a dim4 object
+af::array (const dim4 &dims, const T *pointer, af::source src=afHost)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+*NOTE*: With all of these constructors, ArrayFire's memory manager automatically
+assumes responsibility for any memory provided to it. Thus ArrayFire could free
+or reuse the memory at any later time. If this behavior is not desired, you
+may call `array::unlock()` and manage the memory yourself. However, if you do
+so, please be cautious not to free memory when ArrayFire might be using it!
+
+The seven steps above are best illustrated using a fully-worked example:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// 1. Add includes
+#include <arrayfire.h>
+#include <af/cuda.h>
+
+using namespace std;
+
+int main() {
+
+    // Create and populate CUDA memory objects
+    const int elements = 100;
+    size_t size = elements * sizeof(float);
+    float *cuda_A;
+    cudaMalloc((void**) &cuda_A, size);
+
+    // ... perform many CUDA operations here
+
+    // 2. Finish any pending CUDA operations
+    cudaDeviceSynchronize();
+
+    // 3. Create ArrayFire arrays from existing CUDA pointers.
+    //    Be sure to specify that the memory type is afDevice.
+    af::array d_A(elements, cuda_A, afDevice);
+
+    // NOTE: ArrayFire now manages cuda_A
+
+    // 4. Perform operations on the ArrayFire Arrays.
+    d_A = d_A * 2;
+
+    // NOTE: ArrayFire does not perform the above transaction using
+    // in-place memory, thus the pointers containing memory to d_A have
+    // likely changed.
+
+    // 5. Instruct ArrayFire to finish pending operations using eval and sync.
+    af::eval(d_A);
+    af::sync();
+
+    // 6. Get pointers to important memory objects.
+    //    Once device is called, ArrayFire will not manage the memory.
+    float * outputValue = d_A.device<float>();
+
+    // 7. continue CUDA application as normal
+
+    // 8. Free non-managed memory
+    //    We removed outputValue from ArrayFire's control, we need to free it
+    cudaFree(outputValue);
+
+    return 0;
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+# Using multiple devices
+
+If you are using multiple devices with ArrayFire and CUDA kernels, there is
+one "gotcha" of which you should be aware. ArrayFire implements its own internal
+order of compute devices, thus a CUDA device ID may not be the same as an
+ArrayFire device ID. Thus when switching between devices it is important
+that you use our interoperability functions to get/set the correct device
+IDs. Below is a quick listing of the various functions needed to switch
+between devices along with some disambiguation as to the device identifiers
+used with each function:
+
+| Function            | ID Type     | Purpose                                 |
+|---------------------|-------------|-----------------------------------------|
+| cudaGetDevice()     | CUDA        | Gets the current CUDA device ID         |
+| cudaSetDevice()     | CUDA        |Sets the current CUDA device             |
+| af::getDevice()     | AF          | Gets the current ArrayFire device ID    |
+| af::setDevice()     | AF          | Sets the current ArrayFire device       |
+| afcu::getNativeId() | AF -> CUDA  | Convert an ArrayFire device ID to a CUDA device ID |
+| afcu::setNativeId() | CUDA -> AF  |Set the current ArrayFire device from a CUDA ID |
+
diff --git a/docs/pages/interop_opencl.md b/docs/pages/interop_opencl.md
new file mode 100644
index 0000000000..6c1a7122c6
--- /dev/null
+++ b/docs/pages/interop_opencl.md
@@ -0,0 +1,138 @@
+Interoperability with OpenCL {#interop_opencl}
+========
+
+Although ArrayFire is quite extensive, there remain many cases in which you
+may want to write custom kernels in OpenCL or [CUDA](\ref interop_cuda).
+For example, you may wish to add ArrayFire to an existing code base to increase
+your productivity, or you may need to supplement ArrayFire's functionality
+with your own custom implementation of specific algorithms.
+
+ArrayFire manages its own context, queue, memory, and creates custom IDs
+for devices. As such, most of the interoperability functions focus on reducing
+potential synchronization conflicts between ArrayFire and OpenCL.
+
+# Basics
+
+It is fairly straightforward to interface ArrayFire with your own custom OpenCL
+code. ArrayFire provides several functions to ease this process including:
+
+| Function              | Purpose                                             |
+|-----------------------|-----------------------------------------------------|
+| af::array(...)        | Construct an ArrayFire array from cl_mem references or cl::Buffer objects |
+| af::array.device()    | Obtain a pointer to the cl_mem reference (implies `lock()`) |
+| af::array.lock()      | Removes ArrayFire's control of a cl_mem buffer            |
+| af::array.unlock()    | Restores ArrayFire's control over a cl_mem buffer         |
+| afcl::getPlatform()   | Get ArrayFire's current cl_platform                       |
+| af::getDevice()       | Get the current ArrayFire Device ID                       |
+| afcl::getDeviceId()   | Get ArrayFire's current cl_device_id                      |
+| af::setDevice()       | Set ArrayFire's device from an ArrayFire device ID        |
+| afcl::setDeviceId()   | Set ArrayFire's device from a cl_device_id                |
+| afcl::setDevice()     | Set ArrayFire's device from a cl_device_id and cl_context |
+| afcl::getContext()    | Get ArrayFire's current cl_context                        |
+| afcl::getQueue()      | Get ArrayFire's current cl_command_queue                  |
+| afcl::getDeviceType() | Get the current afcl_device_type                          |
+
+Additionally, the OpenCL backend permits the programmer to add and remove custom
+devices from the ArrayFire device manager. These permit you to attach ArrayFire
+directly to the OpenCL queue used by other portions of your application.
+
+| Function              | Purpose                                           |
+|-----------------------|---------------------------------------------------|
+| afcl::addDevice()     | Add a new device to ArrayFire's device manager    |
+| afcl::deleteDevice()  | Remove a device from ArrayFire's device manager   |
+
+Below we provide two worked examples on how ArrayFire can be integrated
+into new and existing projects.
+
+# Adding custom OpenCL kernels to an existing ArrayFire application
+
+By default, ArrayFire manages its own context, queue, memory, and creates custom
+IDs for devices. Thus there is some bookkeeping that needs to be done to
+integrate your custom OpenCL kernel.
+
+If your kernels can share operate in the same queue as ArrayFire, you should:
+
+1. Add an include for `af/opencl.h` to your project
+2. Obtain the OpenCL context, device, and queue used by ArrayFire
+3. Obtain cl_mem references to af::array objects
+4. Load, build, and use your kernels
+5. Return control of af::array memory to ArrayFire
+
+Note, ArrayFire uses an in-order queue, thus when ArrayFire and your kernels
+are operating in the same queue, there is no need to perform any
+synchronization operations.
+
+This process is best illustrated with a fully worked example:
+
+\snippet test/interop_opencl_custom_kernel_snippet.cpp interop_opencl_custom_kernel_snippet
+
+If your kernels needs to operate in their own OpenCL queue, the process is
+essentially identical, except you need to instruct ArrayFire to complete
+its computations using the af::sync() function prior to launching your
+own kernel and ensure your kernels are complete using `clFinish`
+(or similar) commands prior to returning control of the memory to ArrayFire:
+
+1. Add an include for `af/opencl.h` to your project
+2. Obtain the OpenCL context, device, and queue used by ArrayFire
+3. Obtain cl_mem references to af::array objects
+4. Instruct ArrayFire to finish operations using af::sync()
+5. Load, build, and use your kernels
+6. Instruct OpenCL to finish operations using clFinish() or similar commands.
+5. Return control of af::array memory to ArrayFire
+
+# Adding ArrayFire to an existing OpenCL application
+
+Adding ArrayFire to an existing OpenCL application is slightly more involved
+and can be somewhat tricky due to several optimizations we implement. The
+most important are as follows:
+
+* ArrayFire assumes control of all memory provided to it.
+* ArrayFire does not (in general) support in-place memory transactions.
+
+We will discuss the implications of these items below. To add ArrayFire
+to existing code you need to:
+
+1. Add includes
+2. Instruct OpenCL to complete its operations using clFinish (or similar)
+3. Instruct ArrayFire to use the user-created OpenCL Context
+4. Create ArrayFire arrays from OpenCL memory objects
+5. Perform ArrayFire operations on the Arrays
+6. Instruct ArrayFire to finish operations using af::sync()
+7. Obtain cl_mem references for important memory
+8. Continue your OpenCL application
+
+To create the af::array objects, you should use one of the following
+constructors:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// 1D - 3D af::array constructors
+static af::array    array (dim_t dim0, cl_mem buf, af::dtype type, bool retain=false)
+static af::array    array (dim_t dim0, dim_t dim1, cl_mem buf, af::dtype type, bool retain=false)
+static af::array    array (dim_t dim0, dim_t dim1, dim_t dim2, cl_mem buf, af::dtype type, bool retain=false)
+static af::array    array (dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3, cl_mem buf, af::dtype type, bool retain=false)
+
+// af::array constructor using a dim4 object
+static af::array    array (af::dim4 idims, cl_mem buf, af::dtype type, bool retain=false)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+*NOTE*: With all of these constructors, ArrayFire's memory manager automatically
+assumes responsibility for any memory provided to it. If you are creating
+an array from a `cl::Buffer`, you should specify `retain=true` to ensure your
+memory is not deallocated if your `cl::Buffer` were to go out of scope.
+We use this technique in the example below.
+If you do not wish for ArrayFire to manage your memory, you may call the
+`array::unlock()` function and manage the memory yourself; however, if you do
+so, please be cautious not to call `clReleaseMemObj` on a `cl_mem`  when
+ArrayFire might be using it!
+
+The eight steps above are best illustrated using a fully-worked example. Below we
+use the OpenCL C++ API and omit error checking to keep the code readable.
+
+\snippet test/interop_opencl_external_context_snippet.cpp interop_opencl_external_context_snippet
+
+# Using multiple devices
+
+If you are using ArrayFire and OpenCL with multiple devices be sure to use
+`afcl::addDevice` to add your custom context + device + queue to ArrayFire's
+device manager. This will let you switch ArrayFire devices using your current
+`cl_device_id` and `cl_context`.
diff --git a/docs/pages/jit.md b/docs/pages/jit.md
new file mode 100644
index 0000000000..8b5c783755
--- /dev/null
+++ b/docs/pages/jit.md
@@ -0,0 +1,102 @@
+ArrayFire JIT Code Generation {#jit}
+================
+
+The ArrayFire library offers JIT (Just In Time) compiling for elementwise
+arithmetic operations. This includes trigonometric functions, comparisons, and
+element-wise operations.
+
+At runtime, ArrayFire aggregates these function calls using an Abstract Syntax
+Tree (AST) data structure such that whenever a JIT-supported function is
+called, it is added into the AST for a given variable instance. The AST of the
+variable is computed if one of the following conditions is met:
+
+* an explication evaluation is required by the programmer using the
+  [eval](\ref af::eval) function, or
+* the variable is required to compute a different variable that is not
+  JIT-supported.
+
+When the above occurs, and the variable needs to be evaluated, the functions
+and variables in the AST data structure are used to create a single
+kernel. This is done by creating a customized kernel on-the-fly that is made
+up of all the functions in the AST. The customized function is then executed.
+
+This JIT compilation technique has multiple benefits:
+
+* A reduced number of kernel calls – a kernel call can be a significant
+  overhead for small data sets.
+* Better cache performance – there are many instances in which the memory
+  required by a single element in the array can be reused multiple times, or
+  the temporary value of a computation can be stored in the cache and reused
+  by future computations.
+* Temporary memory allocation and write-back can be reduced – when multiple
+  expressions are evaluated and stored into temporary arrays, these arrays
+  need to be allocated and the results written back to main memory.
+* Avoid computing elements that are not used – there are cases in which the
+  AST is created for a variable; however, the expression is not used later in
+  the computation. Thus, its evaluation can be avoided.
+* Better performance – all the above can help reduce the total execution time.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// As JIT is automatically enabled in ArrayFire, this version of the function
+// forces each expression to be evaluated. If the eval() function calls are
+// removed, then the execution of this code would be equivalent to the
+// following function.
+
+static double pi_no_jit(array x, array y, array temp, int samples) {
+        temp = x * x;
+        temp.eval();
+        temp += y * y;
+        temp.eval();
+        temp = sqrt(temp);
+        temp.eval();
+        temp = temp < 1;
+        temp.eval();
+        return 4.0 sum(temp)/samples;
+}
+
+static double pi_jit(array x, array y, array temp,int samples){
+        temp = sqrt(x*x + y*y) < 1;
+        temp.eval();
+        return 4.0 * sum(temp) / samples;
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The above code computes the value of π using a Monte-Carlo simulation where
+points are randomly generated within the unit square. Each point is tested to
+see if it is within the unit circle. The ratio of points within the circle and
+square approximate the value π. The accuracy of π improves as the number of
+samples is increased, which motivates using additional samples.
+
+There are two implementations above:
+1. an implementation that does not benefit from the JIT (pi\_no\_jit), and
+2. an implementation that takes advantage of the JIT feature (pi\_jit).
+
+Specifically, as JIT is an integral feature of the ArrayFire library, it
+cannot simply be turned on and off. The only way for a programmer to sidestep
+the JIT operations is to manually force the evaluation of expressions. This is
+done in the non-JIT-supported implementation.
+
+Timing these two implementations results in the following performance
+benchmark:
+
+<img src="jit_cuda1.webp" alt="Performance of JIT and Non-JIT implementations"
+width="100%" />
+
+
+The above figure depicts the execution time (abscissa) as a function of the
+number of samples (ordinate) for the two implementations discussed above.
+
+When the number of samples is small, the execution time of pi\_no\_jit is
+dominated by the launch of multiple kernels and the execution time pi\_jit is
+dominated by on-the-fly compilation of the JIT code required to launch a
+single kernel. Even with this JIT compilation time, pi\_jit outperforms
+pi_no_jit by 1.4-2.0X for smaller sample sizes.
+
+When the number of samples is large, both the kernel launch overhead and the
+JIT code creation are no longer the limiting factors – the kernel’s
+computational load dominates the execution time. Here, the pi\_jit outperforms
+pi\_no\_jit by 2.0-2.7X.
+
+The number of applications that benefit from the JIT code generation is
+significant. The actual performance benefits are also application-dependent.
+
diff --git a/docs/pages/matrix_manipulation.md b/docs/pages/matrix_manipulation.md
index 8fd7b35355..38e7219069 100644
--- a/docs/pages/matrix_manipulation.md
+++ b/docs/pages/matrix_manipulation.md
@@ -1,39 +1,404 @@
-Matrix Manipulation {#matrixmanipulation}
+Array and Matrix Manipulation {#matrixmanipulation}
 ===================
 
-Many different kinds of [matrix manipulation routines](\ref manip_mat) are available:
-* tile() to repeat a matrix along dimensions
-* join() to concatenate two matrices along a dimension
-* [array()](\ref af::array) to adjust the dimensions of an array
-* [transpose](\ref af::array::T) a matrix or vector
+ArrayFire provides several different methods for
+[manipulating arrays and matrices](\ref manip_mat). The functionality includes:
 
-tile() allows you to repeat a matrix along specified
-dimensions, effectively 'tiling' the matrix.  Please note that the
-dimensions passed in indicate the number of times to replicate the
-matrix in each dimension, not the final dimensions of the matrix.
+* moddims() - change the dimensions of an array without changing the data
+* array() - create a (shallow) copy of an array with different dimensions.
+* flat() - flatten an array to one dimension
+* flip() - flip an array along a dimension
+* join() - join up to 4 arrays
+* reorder() - changes the dimension order within the array
+* shift() - shifts data along a dimension
+* tile() - repeats an array along a dimension
+* transpose() - performs a matrix transpose
+* [T()](\ref af::array::T) - transpose a matrix or vector (shorthand notation)
+* [H()](\ref af::array::H) - Hermitian Transpose (conjugate-transpose) a matrix
 
-\snippet test/matrix_manipulation.cpp ex_matrix_manipulation_tile
+Below we provide several examples of these functions and their use.
 
-join() allows you to joining two matrices together.  Matrix
-dimensions must match along every dimension except the dimension
-of joining (dimensions are 0-indexed). For example, a 2x3 matrix
-can be joined with a 2x4 matrix along dimension 1, but not along
-dimension 0 since {3,4} don`t match up.
+## flat()
 
-\snippet test/matrix_manipulation.cpp ex_matrix_manipulation_join
+The __flat()__ function flattens an array to one dimension:
 
-Construct a regular mesh grid from vectors `x` and `y`. For example, a
-mesh grid of the vectors {1,2,3,4} and {5,6} would result in two matrices:
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [3 3 1 1]
+    1.0000     4.0000     7.0000
+    2.0000     5.0000     8.0000
+    3.0000     6.0000     9.0000
 
-\snippet test/matrix_manipulation.cpp ex_matrix_manipulation_mesh
+flat(a) [9 1 1 1]
+    1.0000
+    2.0000
+    3.0000
+    4.0000
+    5.0000
+    6.0000
+    7.0000
+    8.0000
+    9.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The flat function can be called from C and C++ as follows:
+
+> __af_err af_flat(af_array* out, const af_array in)__
+> --  C interface for flat() function
+
+> __array af::flat(const array& in)__
+> --  C++ interface for flat() function
+
+## flip()
+
+The __flip()__ function flips the contents of an array along a chosen dimension.
+In the example below, we show the 5x2 array flipped along the zeroth (i.e.
+within a column) and first (e.g. across rows) axes:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [5 2 1 1]
+    1.0000     6.0000
+    2.0000     7.0000
+    3.0000     8.0000
+    4.0000     9.0000
+    5.0000    10.0000
+
+flip(a, 0) [5 2 1 1]
+    5.0000    10.0000
+    4.0000     9.0000
+    3.0000     8.0000
+    2.0000     7.0000
+    1.0000     6.0000
+
+flip(a, 1) [5 2 1 1]
+    6.0000     1.0000
+    7.0000     2.0000
+    8.0000     3.0000
+    9.0000     4.0000
+   10.0000     5.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The flip function can be called from C and C++ as follows:
+
+> __af_err af_flip(af_array *out, const af_array in, const unsigned dim)__
+> --  C interface for flip()
+
+> __array af::flip(const array &in, const unsigned dim)__
+> --  C++ interface for flip()
+
+## join()
+
+The __join()__ function joins arrays along a specific dimension. The C++
+interface can join up to four arrays whereas the C interface supports up to 10
+arrays. Here is an example of how to use join an array to itself:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [5 1 1 1]
+    1.0000
+    2.0000
+    3.0000
+    4.0000
+    5.0000
+
+join(0, a, a) [10 1 1 1]
+    1.0000
+    2.0000
+    3.0000
+    4.0000
+    5.0000
+    1.0000
+    2.0000
+    3.0000
+    4.0000
+    5.0000
+
+join(1, a, a) [5 2 1 1]
+    1.0000     1.0000
+    2.0000     2.0000
+    3.0000     3.0000
+    4.0000     4.0000
+    5.0000     5.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The join function has several candidate functions in C:
+
+> __af_err af_join(af_array *out, const int dim, const af_array first, const af_array second)__
+> --  C interface function to join 2 arrays along a dimension
+
+> __af_err af_join_many(af_array *out, const int dim, const unsigned n_arrays, const af_array *inputs)__
+> --  C interface function to join up to 10 arrays along a dimension
+
+and in C++:
+
+> __array af::join(const int dim, const array &first, const array &second)__
+> --  Joins 2 arrays along a dimension
+
+> __array af::join(const int dim, const array &first, const array &second, const array &third)__
+> --  Joins 3 arrays along a dimension.
+
+> __array af::join(const int dim, const array &first, const array &second, const array &third, const array &fourth)__
+> --  Joins 4 arrays along a dimension
+
+
+## moddims()
+
+The __moddims()__ function changes the dimensions of an array without changing
+its data or order. Note that this function modifies only the _metadata_
+associated with the array. It does not modify the content of the array.
+Here is an example of moddims() converting an 8x1 array into a 2x4 and then
+back to a 8x1:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [8 1 1 1]
+    1.0000
+    2.0000
+    1.0000
+    2.0000
+    1.0000
+    2.0000
+    1.0000
+    2.0000
+
+af::dim4 new_dims(2, 4);
+moddims(a, new_dims) [2 4 1 1]
+    1.0000     1.0000     1.0000     1.0000
+    2.0000     2.0000     2.0000     2.0000
+
+moddims(a, a.elements(), 1, 1, 1) [8 1 1 1]
+    1.0000
+    2.0000
+    1.0000
+    2.0000
+    1.0000
+    2.0000
+    1.0000
+    2.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The moddims function has a single form in the C API:
+
+> __af_err af_moddims(af_array *out, const af_array in, const unsigned ndims, const dim_t *const dims)__
+> --  C interface to mod dimensions of an array
+
+And several overloaded candidates in the C++ API:
+
+> __array af::moddims(const array &in, const unsigned ndims, const dim_t *const dims)__
+> --  mods number of dimensions to match _ndims_ as specidied in the array _dims_
+
+> __array af::moddims(const array &in, const dim4 &dims)__
+> --  mods dimensions as specified by _dims_
+
+> __array af::moddims(const array &in, const dim_t d0, const dim_t d1=1, const dim_t d2=1, const dim_t d3=1)__
+> --  mods dimensions of an array
+
+## reorder()
+
+The __reorder()__ function modifies the order of data within an array by
+exchanging data according to the change in dimensionality. The linear ordering
+of data within the array is preserved.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [2 2 3 1]
+    1.0000     3.0000
+    2.0000     4.0000
+
+    1.0000     3.0000
+    2.0000     4.0000
+
+    1.0000     3.0000
+    2.0000     4.0000
+
+
+reorder(a, 1, 0, 2) [2 2 3 1]  //equivalent to a transpose
+    1.0000     2.0000
+    3.0000     4.0000
+
+    1.0000     2.0000
+    3.0000     4.0000
+
+    1.0000     2.0000
+    3.0000     4.0000
 
-[array()](\ref af::array) can be used to create a (shallow) copy of a matrix
-with different dimensions.  The number of elements must remain the same as
-the original array.
 
-\snippet test/matrix_manipulation.cpp ex_matrix_manipulation_moddims
+reorder(a, 2, 0, 1) [3 2 2 1]
+    1.0000     2.0000
+    1.0000     2.0000
+    1.0000     2.0000
 
-The [T()](\ref af::array::T) and [H()](\ref af::array::H) methods can be
-used to form the [matrix or vector transpose](\ref af::array::T) .
+    3.0000     4.0000
+    3.0000     4.0000
+    3.0000     4.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The reorder function has several candidates functions in the C/C++ APIs:
+
+> __af_err af_reorder(af_array *out, const af_array in, const unsigned x, const unsigned y, const unsigned z, const unsigned w)__
+> --  C interface for reordering function
+
+> __array af::reorder(const array &in, const unsigned x, const unsigned y=1, const unsigned z=2, const unsigned w=3)__
+> --  Reorders dimensions of an array
+
+## shift()
+
+The __shift()__ function shifts data in a circular buffer fashion along a
+chosen dimension. Consider the following example:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [3 5 1 1]
+    0.0000     0.0000     0.0000     0.0000     0.0000
+    3.0000     4.0000     5.0000     1.0000     2.0000
+    3.0000     4.0000     5.0000     1.0000     2.0000
+
+shift(a, 0, 2 ) [3 5 1 1]
+    0.0000     0.0000     0.0000     0.0000     0.0000
+    1.0000     2.0000     3.0000     4.0000     5.0000
+    1.0000     2.0000     3.0000     4.0000     5.0000
+
+shift(a, -1, 2 ) [3 5 1 1]
+    1.0000     2.0000     3.0000     4.0000     5.0000
+    1.0000     2.0000     3.0000     4.0000     5.0000
+    0.0000     0.0000     0.0000     0.0000     0.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The shift function can be called from C and C++ as follows:
+
+
+> __af_err af_shift(af_array *out, const af_array in, const int x, const int y, const int z, const int w)__
+> --  C interface for shifting an array
+
+> __array af::shift(const array &in, const int x, const int y=0, const int z=0, const int w=0)__
+> --  Shifts array along specified dimensions
+
+## tile()
+
+The __tile()__ function repeats an array along the specified dimension.
+For example below we show how to tile an array along the zeroth and first
+dimensions of an array:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [3 1 1 1]
+    1.0000
+    2.0000
+    3.0000
+
+// Repeat array a twice in the zeroth dimension
+tile(a, 2) [6 1 1 1]
+    1.0000
+    2.0000
+    3.0000
+    1.0000
+    2.0000
+    3.0000
+
+// Repeat array a twice along both the zeroth and first dimensions
+tile(a, 2, 2) [6 2 1 1]
+    1.0000     1.0000
+    2.0000     2.0000
+    3.0000     3.0000
+    1.0000     1.0000
+    2.0000     2.0000
+    3.0000     3.0000
+
+// Repeat array a twice along the first and three times along the second
+// dimension.
+af::dim4 tile_dims(1, 2, 3);
+tile(a, tile_dims) [3 2 3 1]
+    1.0000     1.0000
+    2.0000     2.0000
+    3.0000     3.0000
+
+    1.0000     1.0000
+    2.0000     2.0000
+    3.0000     3.0000
+
+    1.0000     1.0000
+    2.0000     2.0000
+    3.0000     3.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The C interface for tile is as follows:
+
+> __af_err af_tile(af_array *out, const af_array in, const unsigned x, const unsigned y, const unsigned z, const unsigned w)__
+> --  C interface for tiling an array
+
+The C++ interface has two overloads
+
+> __array af::tile(const array &in, const unsigned x, const unsigned y=1, const unsigned z=1, const unsigned w=1)__
+> --  Tiles array along specified dimensions
+
+> __array af::tile(const array &in, const dim4 &dims)__
+> --  Tile an array according to a dim4 object
+
+## transpose()
+
+The __transpose()__ function performs a standard matrix transpose. The input
+array must have the dimensions of a 2D-matrix.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+a [3 3 1 1]
+    1.0000     3.0000     3.0000
+    2.0000     1.0000     3.0000
+    2.0000     2.0000     1.0000
+
+transpose(a) [3 3 1 1]
+    1.0000     2.0000     2.0000
+    3.0000     1.0000     2.0000
+    3.0000     3.0000     1.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The C interfaces for transpose are as follows:
+
+> __af_err af_transpose(af_array *out, af_array in, const bool conjugate)__
+> --   C interface to transpose a matrix.
+
+> __af_err af_transpose_inplace(af_array in, const bool conjugate)__
+> --   C interface to transpose a matrix in-place.
+
+The C++ interface has two primary functions and two shorthand versions:
+
+> __array af::transpose(const array &in, const bool conjugate=false)__
+> --   Transposes a matrix.
+
+> __void af::transposeInPlace(array &in, const bool conjugate=false)__
+> --   Transposes a matrix in-place.
+
+> __array af::T()
+> --   Transpose a matrix
+
+> __array af::H()
+> --   Conjugate Transpose (Hermitian transpose) of a matrix
+
+Here is an example of how the shorthand versions might be used:
 
 \snippet test/matrix_manipulation.cpp ex_matrix_manipulation_transpose
+
+## array()
+
+[array()](\ref af::array) can be used to create a (shallow) copy of a matrix
+with different dimensions. The total number of elements must remain the same.
+This function is a wrapper over the moddims() function discussed earlier.
+
+# Combining re-ordering functions to enumerate grid coordinates
+
+By using a combination of the array restructuring functions, one can quickly code
+complex manipulation patterns with a few lines of code. For example, consider
+generating (*x,y*) coordinates for a grid where each axis goes from *1 to n*.
+Instead of using several loops to populate our arrays we can just use a small
+combination of the above functions.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+unsigned n=3;
+af::array xy = join(1,
+                tile(seq(1, n), n),
+                flat( transpose(tile(seq(1, n), 1, n)) )
+                   );
+xy [9 2 1 1]
+    1.0000     1.0000
+    2.0000     1.0000
+    3.0000     1.0000
+    1.0000     2.0000
+    2.0000     2.0000
+    3.0000     2.0000
+    1.0000     3.0000
+    2.0000     3.0000
+    3.0000     3.0000
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
diff --git a/docs/pages/release_notes.md b/docs/pages/release_notes.md
index a0045cf26c..525542246f 100644
--- a/docs/pages/release_notes.md
+++ b/docs/pages/release_notes.md
@@ -1,12 +1,1978 @@
 Release Notes {#releasenotes}
 ==============
 
+v3.10.0
+======
+
+## Improvements
+- Added signed int8 support \PR{3661} \PR{3508} \PR{3507} \PR{3503}
+- Increased support for half (fp16) \PR{3680} \PR{3258} \PR{3561} \PR{3627} \PR{3561} \PR{3627} \PR{3559}
+- Updated oneAPI to use Intel oneAPI (R) 2025.1 \PR{3643} \PR{3573}
+- Updated cl2hpp dependency \PR{3651} \pr{3562}
+- Add support for CUDA 12.3, 12.4, 12.5, 12.6, 12.8, and 12.9 \PR{3657} \PR{3645} \PR{3641} \PR{3636} \PR{3588} \PR{3552} \PR{3586} \PR{3541} 
+- Added minimum driver version check for CUDA GPUs \PR{3648}
+- Add more examples \PR{3530} \PR{3455} \PR{3375} \PR{3612} \PR{3584} \PR{3577}
+- Updated documentation \PR{3496} \PR{3613}
+- Improved performance of matrix multiplication of sparse matrices on the OpenCL backend \PR{3608}
+- Improved cmake configure \PR{3581} \PR{3569} \PR{3567} \PR{3564} \PR{3554}
+- Loosen indexing assertions for assignments \PR{3514}
+
+## Fixes
+- Fix jit tree when doing operations containing moddims and original array \PR{3671} 
+- Fix incorrect behavior of sub-arrays with multiple functions \PR{3679} \PR{3668} \PR{3666} \PR{3665} \PR{3664} \PR{3663} \PR{3658} \PR{3659} \PR{3650} \PR{3611} \PR{3633} \PR{3602}
+- Fix half precision operations in multiple backends \PR{3676} \PR{3662}
+- Fix for join not always respecting the order of parameters \PR{3667} \PR{3513}
+- Fix for cmake building as an external project (needed by arrayfire python wheels) \PR{3669}
+- Fix for cmake build in Windows (including with vcpkg) \PR{3655} \PR{3646} \PR{3644} \PR{3512} \PR{3626} \PR{3566} \PR{3557} \pr{3591} \PR{3592}
+- Fix race condition in OpenCL flood fill \PR{3535}
+- Fix indexing array using sequences `af_seq` that have non-unit steps \PR{3587}
+- Fix padding issue convolve2GradientNN \PR{3519}
+- Fix incorrect axis values for histogram \PR{3590}
+- Fix unified exceptions errors \PR{3617}
+- Fix OpenCL memory migration on devices with different contexts \PR{3510}
+- Fix conversion of COO Sparse to Dense matrix \PR{3589} \PR{3579}
+- Fix `AF_JIT_KERNEL_TRACE` on Windows \PR{3517}
+- Fix cmake build with CUDNN \PR{3521}
+- Fix cmake build with `AF_DISABLE_CPU_ASYNC` \PR{3551}
+
+
+## Contributions
+
+Special thanks to our contributors:
+[Willy Born](https://github.com/willyborn)
+[verstatx](https://github.com/verstatx)
+[Filip Matzner](https://github.com/FloopCZ)
+[Fraser Cormack](https://github.com/frasercrmck)
+[errata-c](https://github.com/errata-c)
+[Tyler Hilbert](https://github.com/Tyler-Hilbert)
+
+v3.9.0
+======
+
+## Improvements
+- Add oneAPI backend \PR{3296}
+- Add support to directly access arrays on other devices \PR{3447}
+- Add broadcast support \PR{2871}
+- Improve OpenCL CPU JIT performance \PR{3257} \PR{3392}
+- Optimize thread/block calculations of several kernels \PR{3144}
+- Add support for fast math compiliation when building ArrayFire \PR{3334}
+  \PR{3337}
+- Optimize performance of fftconvolve when using floats \PR{3338}
+- Add support for CUDA 12.1 and 12.2
+- Better handling of empty arrays \PR{3398}
+- Better handling of memory in linear algebra functions in OpenCL \PR{3423}
+- Better logging with JIT kernels \PR{3468}
+- Optimize memory manager/JIT interactions for small number of buffers
+  \PR{3468}
+- Documentation improvements \PR{3485}
+- Optimize reorder function \PR{3488}
+
+## Fixes
+- Improve Errors when creating OpenCL contexts from devices \PR{3257}
+- Improvements to vcpkg builds \PR{3376} \PR{3476}
+- Fix reduce by key when nan's are present \PR{3261}
+- Fix error in convolve where the ndims parameter was forced to be equal to 2
+  \PR{3277}
+- Make constructors that accept dim_t to be explicit to avoid invalid
+  conversions \PR{3259}
+- Fix error in randu when compiling against clang 14 \PR{3333}
+- Fix bug in OpenCL linear algebra functions  \PR{3398}
+- Fix bug with thread local variables when device was changed \PR{3420}
+  \PR{3421}
+- Fix bug in qr related to uninitialized memory \PR{3422}
+- Fix bug in shift where the array had an empty middle dimension \PR{3488}
+
+## Contributions
+
+Special thanks to our contributors:
+[Willy Born](https://github.com/willyborn)
+[Mike Mullen](https://github.com/mfzmullen)
+
+
+v3.8.3
+======
+
+## Improvements
+
+- Add support for CUDA 12 \PR{3352}
+- Modernize documentation style and content \PR{3351}
+- memcpy performance improvements \PR{3144}
+- JIT performance improvements \PR{3144}
+- join performance improvements \PR{3144}
+- Improve support for Intel and newer Clang compilers \PR{3334}
+- CCache support on Windows \PR{3257}
+
+## Fixes
+
+- Fix issue with some locales with OpenCL kernel generation \PR{3294}
+- Internal improvements
+- Fix leak in clfft on exit.
+- Fix some cases where ndims was incorrectly used ot calculate shape \PR{3277}
+- Fix issue when setDevice was not called in new threads \PR{3269}
+- Restrict initializer list to just fundamental types \PR{3264}
+
+## Contributions
+
+Special thanks to our contributors:
+[Carlo Cabrera](https://github.com/carlocab)
+[Guillaume Schmid](https://github.com/GuillaumeSchmid)
+[Willy Born](https://github.com/willyborn)
+[ktdq](https://github.com/ktdq)
+
+
+v3.8.2
+======
+
+## Improvements
+
+- Optimize JIT by removing some consecutive cast operations \PR{3031}
+- Add driver checks checks for CUDA 11.5 and 11.6 \PR{3203}
+- Improve the timing algorithm used for timeit \PR{3185}
+- Dynamically link against CUDA numeric libraries by default \PR{3205}
+- Add support for pruning CUDA binaries to reduce static binary sizes \PR{3234} \PR{3237}
+- Remove unused cuDNN libraries from installations \PR{3235}
+- Add support to staticly link NVRTC libraries after CUDA 11.5 \PR{3236}
+- Add support for compiling with ccache when building the CUDA backend \PR{3241}
+- Make cuSparse an optional runtime dependency \PR{3240}
+
+## Fixes
+
+- Fix issue with consecutive moddims operations in the CPU backend \PR{3232}
+- Better floating point comparisons for tests \PR{3212}
+- Fix several warnings and inconsistencies with doxygen and documentation \PR{3226}
+- Fix issue when passing empty arrays into join \PR{3211}
+- Fix default value for the `AF_COMPUTE_LIBRARY` when not set \PR{3228}
+- Fix missing symbol issue when MKL is staticly linked \PR{3244}
+- Remove linking of OpenCL's library to the unified backend \PR{3244}
+
+## Contributions
+
+Special thanks to our contributors:
+[Jacob Kahn](https://github.com/jacobkahn)
+[Willy Born](https://github.com/willyborn)
+
+
+v3.8.1
+======
+
+## Improvements
+
+- moddims now uses JIT approach for certain special cases - \PR{3177}
+- Embed Version Info in Windows DLLs - \PR{3025} 
+- OpenCL device max parameter is now queries from device properties - \PR{3032} 
+- JIT Performance Optimization: Unique funcName generation sped up - \PR{3040} 
+- Improved readability of log traces  - \PR{3050} 
+- Use short function name in non-debug build error messages - \PR{3060} 
+- SIFT/GLOH are now available as part of website binaries - \PR{3071} 
+- Short-circuit zero elements case in detail::copyArray backend function - \PR{3059} 
+- Speedup of kernel caching mechanism - \PR{3043} 
+- Add short-circuit check for empty Arrays in JIT evalNodes - \PR{3072} 
+- Performance optimization of indexing using dynamic thread block sizes - \PR{3111} 
+- ArrayFire starting with this release will use Intel MKL single dynamic library which resolves lot of linking issues unified library had when user applications used MKL themselves - \PR{3120} 
+- Add shortcut check for zero elements in af_write_array - \PR{3130} 
+- Speedup join by eliminating temp buffers for cascading joins - \PR{3145} 
+- Added batch support for solve - \PR{1705} 
+- Use pinned memory to copy device pointers in CUDA solve - \PR{1705} 
+- Added package manager instructions to docs - \PR{3076} 
+- CMake Build Improvements - \PR{3027} , \PR{3089} , \PR{3037} , \PR{3072} , \PR{3095} , \PR{3096} , \PR{3097} , \PR{3102} , \PR{3106} , \PR{3105} , \PR{3120} , \PR{3136} , \PR{3135} , \PR{3137} , \PR{3119} , \PR{3150} , \PR{3138} , \PR{3156} , \PR{3139} , \PR{1705} , \PR{3162} 
+- CPU backend improvements - \PR{3010} , \PR{3138} , \PR{3161} 
+- CUDA backend improvements - \PR{3066} , \PR{3091} , \PR{3093} , \PR{3125} , \PR{3143} , \PR{3161} 
+- OpenCL backend improvements - \PR{3091} , \PR{3068} , \PR{3127} , \PR{3010} , \PR{3039} , \PR{3138} , \PR{3161} 
+- General(including JIT) performance improvements across backends - \PR{3167} 
+- Testing improvements - \PR{3072} , \PR{3131} , \PR{3151} , \PR{3141} , \PR{3153} , \PR{3152} , \PR{3157} , \PR{1705} , \PR{3170} , \PR{3167} 
+- Update CLBlast to latest version - \PR{3135} , \PR{3179} 
+- Improved Otsu threshold computation helper in canny algorithm - \PR{3169} 
+- Modified default parameters for fftR2C and fftC2R C++ API from 0 to 1.0 - \PR{3178} 
+- Use appropriate MKL getrs_batch_strided API based on MKL Versions - \PR{3181} 
+
+## Fixes
+
+- Fixed a bug JIT kernel disk caching - \PR{3182} 
+- Fixed stream used by thrust(CUDA backend) functions - \PR{3029}  
+- Added workaround for new cuSparse API that was added by CUDA amid fix releases - \PR{3057} 
+- Fixed `const` array indexing inside `gfor` - \PR{3078} 
+- Handle zero elements in copyData to host - \PR{3059} 
+- Fixed double free regression in OpenCL backend - \PR{3091} 
+- Fixed an infinite recursion bug in NaryNode JIT Node - \PR{3072} 
+- Added missing input validation check in sparse-dense arithmetic operations - \PR{3129} 
+- Fixed bug in `getMappedPtr` in OpenCL due to invalid lambda capture - \PR{3163} 
+- Fixed bug in `getMappedPtr` on Arrays that are not ready - \PR{3163} 
+- Fixed edgeTraceKernel for CPU devices on OpenCL backend - \PR{3164} 
+- Fixed windows build issue(s) with VS2019 - \PR{3048}
+- API documentation fixes - \PR{3075} , \PR{3076} , \PR{3143} , \PR{3161} 
+- CMake Build Fixes - \PR{3088} 
+- Fixed the tutorial link in README - \PR{3033} 
+- Fixed function name typo in timing tutorial - \PR{3028} 
+- Fixed couple of bugs in CPU backend canny implementation - \PR{3169} 
+- Fixed reference count of array(s) used in JIT operations. It is related to arrayfire's internal memory book keeping. The behavior/accuracy of arrayfire code wasn't broken earlier. It corrected the reference count to be of optimal value in the said scenarios. This may potentially reduce memory usage in some narrow cases - \PR{3167} 
+- Added assert that checks if topk is called with a negative value for k - \PR{3176} 
+- Fixed an Issue where countByKey would give incorrect results for any n > 128 - \PR{3175} 
+
+## Contributions
+
+Special thanks to our contributors:
+[HO-COOH][https://github.com/HO-COOH]
+[Willy Born][https://github.com/willyborn]
+[Gilad Avidov][https://github.com/avidov]
+[Pavan Yalamanchili][https://github.com/pavanky]
+
+v3.8.0
+======
+
+Major Updates
+--------
+- Non-uniform(ragged) reductions \PR{2786}
+- Bit-wise not operator support for array and C API (af\_bitnot) \PR{2865}
+- Initialization list constructor for array class \PR{2829} \PR{2987}
+
+Improvements
+------------
+- New API for following statistics function: cov, var and stdev - \PR{2986}
+- allocV2 and freeV2 which return cl\_mem on OpenCL backend \PR{2911}
+- Move constructor and move assignment operator for Dim4 class \PR{2946}
+- Support for CUDA 11.1 and Compute 8.6 \PR{3023}
+- Fix af::feature copy constructor for multi-threaded sceanarios \PR{3022}
+
+v3.7.3
+======
+
+Improvements
+------------
+- Add f16 support for histogram - \PR{2984}
+- Update confidence connected components example with better illustration - \PR{2968}
+- Enable disk caching of OpenCL kernel binaries - \PR{2970}
+- Refactor extension of kernel binaries stored to disk `.bin` - \PR{2970}
+- Add minimum driver versions for CUDA toolkit 11 in internal map - \PR{2982}
+- Improve warnings messages from run-time kernel compilation functions - \PR{2996}
+
+Fixes
+-----
+- Fix bias factor of variance in var_all and cov functions - \PR{2986}
+- Fix a race condition in confidence connected components function for OpenCL backend - \PR{2969}
+- Safely ignore disk cache failures in CUDA backend for compiled kernel binaries - \PR{2970}
+- Fix randn by passing in correct values to Box-Muller - \PR{2980}
+- Fix rounding issues in Box-Muller function used for RNG - \PR{2980}
+- Fix problems in RNG for older compute architectures with fp16 - \PR{2980}  \PR{2996}
+- Fix performance regression of approx functions - \PR{2977}
+- Remove assert that check that signal/filter types have to be the same - \PR{2993}
+- Fix `checkAndSetDevMaxCompute` when the device cc is greater than max - \PR{2996}
+- Fix documentation errors and warnings - \PR{2973} , \PR{2987}
+- Add missing opencl-arrayfire interoperability functions in unified backend  - \PR{2981}
+
+Contributions
+-------------
+Special thanks to our contributors:
+[P. J. Reed](https://github.com/pjreed)
+
+v3.7.2
+======
+
+Improvements
+------------
+- Cache CUDA kernels to disk to improve load times(Thanks to \@cschreib-ibex) \PR{2848}
+- Staticly link against cuda libraries \PR{2785}
+- Make cuDNN an optional build dependency \PR{2836}
+- Improve support for different compilers and OS \PR{2876} \PR{2945} \PR{2925} \PR{2942} \PR{2943} \PR{2945} \PR{2958}
+- Improve performance of join and transpose on CPU \PR{2849}
+- Improve documentation \PR{2816} \PR{2821} \PR{2846} \PR{2918} \PR{2928} \PR{2947}
+- Reduce binary size using NVRTC and template reducing instantiations \PR{2849} \PR{2861} \PR{2890} \PR{2957}
+- reduceByKey performance improvements \PR{2851} \PR{2957}
+- Improve support for Intel OpenCL GPUs \PR{2855}
+- Allow staticly linking against MKL \PR{2877} (Sponsered by SDL)
+- Better support for older CUDA toolkits \PR{2923}
+- Add support for CUDA 11 \PR{2939}
+- Add support for ccache for faster builds \PR{2931}
+- Add support for the conan package manager on linux \PR{2875}
+- Propagate build errors up the stack in AFError exceptions \PR{2948} \PR{2957}
+- Improve runtime dependency library loading \PR{2954}
+- Improved cuDNN runtime checks and warnings \PR{2960}
+- Document af\_memory\_manager\_* native memory return values \PR{2911}
+
+Fixes
+-----
+- Bug crash when allocating large arrays \PR{2827}
+- Fix various compiler warnings \PR{2827} \PR{2849} \PR{2872} \PR{2876}
+- Fix minor leaks in OpenCL functions \PR{2913}
+- Various continuous integration related fixes \PR{2819}
+- Fix zero padding with convolv2NN \PR{2820}
+- Fix af_get_memory_pressure_threshold return value \PR{2831}
+- Increased the max filter length for morph
+- Handle empty array inputs for LU, QR, and Rank functions \PR{2838}
+- Fix FindMKL.cmake script for sequential threading library \PR{2840} \PR{2952}
+- Various internal refactoring \PR{2839} \PR{2861} \PR{2864} \PR{2873} \PR{2890} \PR{2891} \PR{2913} \PR{2959}
+- Fix OpenCL 2.0 builtin function name conflict \PR{2851}
+- Fix error caused when releasing memory with multiple devices \PR{2867}
+- Fix missing set stacktrace symbol from unified API \PR{2915}
+- Fix zero padding issue in convolve2NN \PR{2820}
+- Fixed bugs in ReduceByKey \PR{2957}
+
+Contributions
+-------------
+Special thanks to our contributors:
+[Corentin Schreiber](https://github.com/cschreib-ibex)
+[Jacob Kahn](https://github.com/jacobkahn)
+[Paul Jurczak](https://github.com/pauljurczak)
+[Christoph Junghans](https://github.com/junghans)
+
+v3.7.1
+======
+
+Improvements
+------------
+
+- Improve mtx download for test data \PR{2742}
+- Documentation improvements \PR{2754} \PR{2792} \PR{2797}
+- Remove verbose messages in older CMake versions \PR{2773}
+- Reduce binary size with the use of nvrtc  \PR{2790}
+- Use texture memory to load LUT in orb and fast \PR{2791}
+- Add missing print function for f16 \PR{2784}
+- Add checks for f16 support in the CUDA backend \PR{2784}
+- Create a thrust policy to intercept tmp buffer allocations \PR{2806}
+
+Fixes
+-----
+
+- Fix segfault on exit when ArrayFire is not initialized in the main thread
+- Fix support for CMake 3.5.1 \PR{2771} \PR{2772} \PR{2760}
+- Fix evalMultiple if the input array sizes aren't the same \PR{2766}
+- Fix error when AF_BACKEND_DEFAULT is passed directly to backend \PR{2769}
+- Workaround name collision with AMD OpenCL implementation \PR{2802}
+- Fix on-exit errors with the unified backend \PR{2769}
+- Fix check for f16 compatibility in OpenCL \PR{2773}
+- Fix matmul on Intel OpenCL when passing same array as input \PR{2774}
+- Fix CPU OpenCL blas batching \PR{2774}
+- Fix memory pressure in the default memory manager \PR{2801}
+
+Contributions
+-------------
+Special thanks to our contributors:
+[padentomasello](https://github.com/padentomasello)
+[glavaux2](https://github.com/glavaux2)
+
+v3.7.0
+======
+
+Major Updates
+-------------
+
+- Added the ability to customize the memory manager(Thanks jacobkahn and flashlight) \PR{2461}
+- Added 16-bit floating point support for several functions \PR{2413} \PR{2587} \PR{2585} \PR{2587} \PR{2583}
+- Added sumByKey, productByKey, minByKey, maxByKey, allTrueByKey, anyTrueByKey, countByKey \PR{2254}
+- Added confidence connected components \PR{2748}
+- Added neural network based convolution and gradient functions \PR{2359}
+- Added a padding function \PR{2682}
+- Added pinverse for pseudo inverse \PR{2279}
+- Added support for uniform ranges in approx1 and approx2 functions. \PR{2297}
+- Added support to write to preallocated arrays for some functions \PR{2599} \PR{2481} \PR{2328} \PR{2327}
+- Added meanvar function \PR{2258}
+- Add support for sparse-sparse arithmetic support
+- Added rsqrt function for reciprocal square root
+- Added a lower level af_gemm function for general matrix multiplication \PR{2481}
+- Added a function to set the cuBLAS math mode for the CUDA backend \PR{2584}
+- Separate debug symbols into separate files \PR{2535}
+- Print stacktraces on errors \PR{2632}
+- Support move constructor for af::array \PR{2595}
+- Expose events in the public API \PR{2461}
+- Add setAxesLabelFormat to format labels on graphs \PR{2495}
+
+Improvements
+------------
+
+- Better error messages for systems with driver or device incompatibilities \PR{2678} \PR{2448}
+- Optimized unified backend function calls
+- Optimized anisotropic smoothing \PR{2713}
+- Optimized canny filter for CUDA and OpenCL
+- Better MKL search script
+- Better logging of different submodules in ArrayFire \PR{2670} \PR{2669}
+- Improve documentation \PR{2665} \PR{2620} \PR{2615} \PR{2639} \PR{2628} \PR{2633} \PR{2622} \PR{2617} \PR{2558} \PR{2326} \PR{2515}
+- Optimized af::array assignment \PR{2575}
+- Update the k-means example to display the result \PR{2521}
+
+
+Fixes
+-----
+
+- Fix multi-config generators
+- Fix access errors in canny
+- Fix segfault in the unified backend if no backends are available
+- Fix access errors in scan-by-key
+- Fix sobel operator
+- Fix an issue with the random number generator and s16
+- Fix issue with boolean product reduction
+- Fix array_proxy move constructor
+- Fix convolve3 launch configuration
+- Fix an issue where the fft function modified the input array \PR{2520}
+
+Contributions
+-------------
+Special thanks to our contributors:
+[Jacob Khan](https://github.com/jacobkahn)
+[William Tambellini](https://github.com/WilliamTambellini)
+[Alexey Kuleshevich](https://github.com/lehins)
+[Richard Barnes](https://github.com/r-barnes)
+[Gaika](https://github.com/gaika)
+[ShalokShalom](https://github.com/ShalokShalom)
+
+
+v3.6.4
+======
+
+Bug Fixes
+---------
+- Address a JIT performance regression due to moving kernel arguments to shared memory \PR{2501}
+- Fix the default parameter for setAxisTitle \PR{2491}
+
+v3.6.3
+======
+
+Improvements
+------------
+- Graphics are now a runtime dependency instead of a link time dependency \PR{2365}
+- Reduce the CUDA backend binary size using runtime compilation of kernels \PR{2437}
+- Improved batched matrix multiplication on the CPU backend by using Intel MKL's
+  `cblas_Xgemm_batched`\PR{2206}
+- Print JIT kernels to disk or stream using the `AF_JIT_KERNEL_TRACE`
+  environment variable \PR{2404}
+- `void*` pointers are now allowed as arguments to `af::array::write()` \PR{2367}
+- Slightly improve the efficiency of JITed tile operations \PR{2472}
+- Make the random number generation on the CPU backend to be consistent with
+  CUDA and OpenCL \PR{2435}
+- Handled very large JIT tree generations \PR{2484} \PR{2487}
+
+Bug Fixes
+---------
+- Fixed `af::array::array_proxy` move assignment operator \PR{2479}
+- Fixed input array dimensions validation in svdInplace() \PR{2331}
+- Fixed the typedef declaration for window resource handle \PR{2357}.
+- Increase compatibility with GCC 8 \PR{2379}
+- Fixed `af::write` tests \PR{2380}
+- Fixed a bug in broadcast step of 1D exclusive scan \PR{2366}
+- Fixed OpenGL related build errors on OSX \PR{2382}
+- Fixed multiple array evaluation. Performance improvement. \PR{2384}
+- Fixed buffer overflow and expected output of kNN SSD small test \PR{2445}
+- Fixed MKL linking order to enable threaded BLAS \PR{2444}
+- Added validations for forge module plugin availability before calling
+  resource cleanup \PR{2443}
+- Improve compatibility on MSVC toolchain(_MSC_VER > 1914) with the CUDA
+  backend \PR{2443}
+- Fixed BLAS gemm func generators for newest MSVC 19 on VS 2017 \PR{2464}
+- Fix errors on exits when using the cuda backend with unified \PR{2470}
+
+Documentation
+-------------
+- Updated svdInplace() documentation following a bugfix \PR{2331}
+- Fixed a typo in matrix multiplication documentation \PR{2358}
+- Fixed a code snippet demostrating C-API use \PR{2406}
+- Updated hamming matcher implementation limitation \PR{2434}
+- Added illustration for the rotate function \PR{2453}
+
+Misc
+----
+- Use cudaMemcpyAsync instead of cudaMemcpy throughout the codebase \PR{2362}
+- Display a more informative error message if CUDA driver is incomptible
+  \PR{2421} \PR{2448}
+- Changed forge resource managemenet to use smart pointers \PR{2452}
+- Deprecated intl and uintl typedefs in API \PR{2360}
+- Enabled graphics by default for all builds starting with v3.6.3 \PR{2365}
+- Fixed several warnings \PR{2344} \PR{2356} \PR{2361}
+- Refactored initArray() calls to use createEmptyArray(). initArray() is for
+  internal use only by Array class. \PR{2361}
+- Refactored `void*` memory allocations to use unsigned char type \PR{2459}
+- Replaced deprecated MKL API with in-house implementations for sparse
+  to sparse/dense conversions \PR{2312}
+- Reorganized and fixed some internal backend API \PR{2356}
+- Updated compilation order of cuda files to speed up compile time \PR{2368}
+- Removed conditional graphics support builds after enabling runtime
+  loading of graphics dependencies \PR{2365}
+- Marked graphics dependencies as optional in CPack RPM config \PR{2365}
+- Refactored a sparse arithmetic backend API \PR{2379}
+- Fixed const correctness of `af_device_array` API \PR{2396}
+- Update Forge to v1.0.4 \PR{2466}
+- Manage Forge resources from the DeviceManager class \PR{2381}
+- Fixed non-mkl & non-batch blas upstream call arguments \PR{2401}
+- Link MKL with OpenMP instead of TBB by default
+- use clang-format to format source code
+
+Contributions
+-------------
+Special thanks to our contributors:
+[Alessandro Bessi](https://github.com/alessandrobessi)
+[zhihaoy](https://github.com/zhihaoy)
+[Jacob Khan](https://github.com/jacobkahn)
+[William Tambellini](https://github.com/WilliamTambellini)
+
+v3.6.2
+======
+
+Features
+--------
+- Added support for batching on the `cond` argument in select() \PR{2243}
+- Added support for broadcasting batched matmul() \PR{2315}
+- Added support for multiple nearest neighbors in nearestNeighbour() \PR{2280}
+- Added support for clamp-to-edge padding as an `af_border_type` option \PR{2333}
+
+Improvements
+------------
+- Improved performance of morphological operations \PR{2238}
+- Fixed linking errors when compiling without Freeimage/Graphics \PR{2248}
+- Improved the usage of ArrayFire as a CMake subproject \PR{2290}
+- Enabled configuration of custom library path for loading dynamic backend
+  libraries \PR{2302}
+
+Bug Fixes
+---------
+- Fixed LAPACK definitions and linking errors \PR{2239}
+- Fixed overflow in dim4::ndims() \PR{2289}
+- Fixed pow() precision for integral types \PR{2305}
+- Fixed issues with tile() with a large repeat dimension \PR{2307}
+- Fixed svd() sub-array output on OpenCL \PR{2279}
+- Fixed grid-based indexing calculation in histogram() \PR{2230}
+- Fixed bug in indexing when used after reorder \PR{2311}
+- Fixed errors when exiting on Windows when using
+  [CLBlast](https://github.com/CNugteren/CLBlast) \PR{2222}
+- Fixed fallthrough error in medfilt1 \PR{2349}
+
+Documentation
+-------------
+- Improved unwrap() documentation \PR{2301}
+- Improved wrap() documentation \PR{2320}
+- Improved accum() documentation \PR{2298}
+- Improved tile() documentation \PR{2293}
+- Clarified approx1() and approx2() indexing in documentation \PR{2287}
+- Updated examples of [select()](@ref data_func_select) in detailed documentation
+  \PR{2277}
+- Updated lookup() examples \PR{2288}
+- Updated set operations' documentation \PR{2299}
+
+Misc
+----
+- `af*` libraries and dependencies directory changed to `lib64` \PR{2186}
+- Added new arrayfire ASSERT utility functions \PR{2249} \PR{2256} \PR{2257} \PR{2263}
+- Improved error messages in JIT \PR{2309}
+
+Contributions
+-------------
+Special thanks to our contributors: [Jacob Kahn](https://github.com/jacobkahn),
+[Vardan Akopian](https://github.com/vakopian)
+
+v3.6.1
+======
+
+Improvements
+------------
+- FreeImage is now a run-time dependency [#2164]
+- Reduced binary size by setting the symbol visibility to hidden [#2168]
+- Add memory manager logging using the AF_TRACE=mem environment variable [#2169]
+- Improved CPU Anisotropic Diffusion performance [#2174]
+- Perform normalization after FFT for improved accuracy [#2185][#2192]
+- Updated CLBlast to v1.4.0 [#2178]
+- Added additional validation when using af::seq for indexing [#2153]
+- Perform checks for unsupported cards by the CUDA implementation [#2182]
+
+Bug Fixes
+---------
+- Fixed region when all pixels were the foreground or background [#2152]
+- Fixed several memory leaks [#2202][#2201][#2180][#2179][#2177][#2175]
+- Fixed bug in setDevice which didn't allow you to select the last device [#2189]
+- Fixed bug in min/max where the first element of the array was a NaN value [#2155]
+- Fixed window cell indexing for graphics [#2207]
+
+v3.6.0
+======
+
+The source code with submodules can be downloaded directly from the following link:
+http://arrayfire.com/arrayfire_source/arrayfire-full-3.6.0.tar.bz2
+
+Major Updates
+-------------
+
+- Added the `topk()` function
+  [Documentation](http://arrayfire.org/docs/group__stat__func__topk.htm).
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/2061)</sup>
+- Added batched matrix multiply support.
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1898)</sup>
+  <sup>[3](https://github.com/arrayfire/arrayfire/pull/2059)</sup>
+- Added anisotropic diffusion, `anisotropicDiffusion()`.
+  [Documentation](http://arrayfire.org/docs/group__image__func__anisotropic__diffusion.htm)
+  <sup>[4](https://github.com/arrayfire/arrayfire/pull/1850)</sup>.
+
+Features
+--------
+
+- Added support for batched matrix multiply.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1898)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/2059)</sup>
+- New anisotropic diffusion function, `anisotropicDiffusion()`.
+  [Documentation](http://arrayfire.org/docs/group__image__func__anisotropic__diffusion.htm)
+  <sup>[3](https://github.com/arrayfire/arrayfire/pull/1850)</sup>.
+- New `topk()` function, which returns the top k elements along a given
+  dimension of the input.
+  [Documentation](http://arrayfire.org/docs/group__stat__func__topk.htm).
+  <sup>[4](https://github.com/arrayfire/arrayfire/pull/2061)</sup>
+- New gradient diffusion
+  [example](https://github.com/arrayfire/arrayfire/blob/master/examples/image_processing/gradient_diffusion.cpp).
+
+Improvements
+------------
+
+- JITted `select()` and `shift()` functions for CUDA and OpenCL backends.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/2047)</sup>
+- Significant CMake improvements.
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1861)</sup>
+  <sup>[3](https://github.com/arrayfire/arrayfire/pull/2070)</sup>
+  <sup>[4](https://github.com/arrayfire/arrayfire/pull/2018)</sup>
+- Improved the quality of the random number generator, thanks to Ralf Stubner.
+  <sup>[5](https://github.com/arrayfire/arrayfire/pull/2122)</sup>
+- Modified `af_colormap` struct to match forge's definition.
+  <sup>[6](https://github.com/arrayfire/arrayfire/pull/2082)</sup>
+- Improved Black Scholes example.
+  <sup>[7](https://github.com/arrayfire/arrayfire/pull/2079)</sup>
+- Using CPack to generate installers.
+  <sup>[8](https://github.com/arrayfire/arrayfire/pull/1861)</sup>
+- Refactored
+  [black_scholes_options](https://github.com/arrayfire/arrayfire/blob/master/examples/financial/black_scholes_options.cpp)
+  example to use built-in `af::erfc` function for cumulative normal
+  distribution.<sup>[9](https://github.com/arrayfire/arrayfire/pull/2079)</sup>.
+- Reduced the scope of mutexes in memory manager
+  <sup>[10](https://github.com/arrayfire/arrayfire/pull/2125)</sup>
+- Official installers do not require the CUDA toolkit to be installed
+- Significant CMake improvements have been made. Using CPack to generate
+  installers. <sup>[11](https://github.com/arrayfire/arrayfire/pull/1861)</sup>
+  <sup>[12](https://github.com/arrayfire/arrayfire/pull/2070)</sup>
+  <sup>[13](https://github.com/arrayfire/arrayfire/pull/2018)</sup>
+- Corrected assert function calls in select() tests.
+  <sup>[14](https://github.com/arrayfire/arrayfire/pull/2058)</sup>
+
+Bug fixes
+-----------
+
+- Fixed `shfl_down()` warnings with CUDA 9.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/2040)</sup>
+- Disabled CUDA JIT debug flags on ARM
+  architecture.<sup>[2](https://github.com/arrayfire/arrayfire/pull/2037)</sup>
+- Fixed CLBLast install lib dir for linux platform where `lib` directory has
+  arch(64) suffix.<sup>[3](https://github.com/arrayfire/arrayfire/pull/2094)</sup>
+- Fixed assert condition in 3d morph opencl
+  kernel.<sup>[4](https://github.com/arrayfire/arrayfire/pull/2033)</sup>
+- Fix JIT errors with large non-linear
+  kernels<sup>[5](https://github.com/arrayfire/arrayfire/pull/2127)</sup>
+- Fix bug in CPU jit after moddims was called
+  <sup>[5](https://github.com/arrayfire/arrayfire/pull/2127)</sup>
+- Fixed deadlock caused by calls to from the worker thread
+  <sup>[6](https://github.com/arrayfire/arrayfire/pull/2124)</sup>
+
+Documentation
+-------------
+
+- Fixed variable name typo in `vectorization.md`.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/2032)</sup>
+- Fixed `AF_API_VERSION` value in Doxygen config file.
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/2053)</sup>
+
+Known issues
+------------
+
+- Several OpenCL tests failing on OSX:
+  - `canny_opencl, fft_opencl, gen_assign_opencl, homography_opencl,
+    reduce_opencl, scan_by_key_opencl, solve_dense_opencl,
+    sparse_arith_opencl, sparse_convert_opencl, where_opencl`
+
+Community contributions
+-----------------------
+
+Special thanks to our contributors:
+[Adrien F. Vincent](https://github.com/afvincent), [Cedric
+Nugteren](https://github.com/CNugteren),
+[Felix](https://github.com/fzimmermann89), [Filip
+Matzner](https://github.com/FloopCZ),
+[HoneyPatouceul](https://github.com/HoneyPatouceul), [Patrick
+Lavin](https://github.com/plavin), [Ralf Stubner](https://github.com/rstub),
+[William Tambellini](https://github.com/WilliamTambellini)
+
+
+v3.5.1
+======
+
+The source code with submodules can be downloaded directly from the following
+link: http://arrayfire.com/arrayfire_source/arrayfire-full-3.5.1.tar.bz2
+
+Installer CUDA Version: 8.0 (Required) Installer OpenCL Version: 1.2 (Minimum)
+
+Improvements
+------------
+- Relaxed `af::unwrap()` function's arguments.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1853)</sup>
+- Changed behavior of af::array::allocated() to specify memory allocated.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1877)</sup>
+- Removed restriction on the number of bins for `af::histogram()` on CUDA and
+  OpenCL kernels. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1895)</sup>
+
+
+Performance
+-----------
+
+- Improved JIT performance.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1864)</sup>
+- Improved CPU element-wise operation performance.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1890)</sup>
+- Improved regions performance using texture objects. <sup>
+  [1](https://github.com/arrayfire/arrayfire/pull/1903)</sup>
+
+
+Bug fixes
+---------
+- Fixed overflow issues in mean.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1849)</sup>
+- Fixed memory leak when chaining indexing operations.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1879)</sup>
+- Fixed bug in array assignment when using an empty array to index.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1897)</sup>
+- Fixed bug with `af::matmul()` which occured when its RHS argument was an
+  indexed vector.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1883)</sup>
+- Fixed bug deadlock bug when sparse array was used with a JIT Array.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1889)</sup>
+- Fixed pixel tests for FAST kernels.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1891)</sup>
+- Fixed `af::replace` so that it is now copy-on-write.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1892)</sup>
+- Fixed launch configuration issues in CUDA JIT.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1893)</sup>
+- Fixed segfaults and "Pure Virtual Call" error warnings when exiting on
+  Windows. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1899)
+  [2](https://github.com/arrayfire/arrayfire/pull/1924)</sup>
+- Workaround for `clEnqueueReadBuffer` bug on OSX.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1888)</sup>
+
+Build
+-----
+
+- Fixed issues when compiling with GCC 7.1.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1872)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1876)</sup>
+- Eliminated unnecessary Boost dependency from CPU and CUDA backends.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1857)</sup>
+
+Misc
+----
+
+- Updated support links to point to Slack instead of Gitter.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1905)</sup>
+
+
+
+v3.5.0
+==============
+
+Major Updates
+-------------
+
+* ArrayFire now supports threaded applications.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1706)</sup>
+* Added Canny edge detector.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1743)</sup>
+* Added Sparse-Dense arithmetic operations.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1696)</sup>
+
+Features
+--------
+
+* ArrayFire Threading
+  * \ref af::array can be read by multiple threads
+  * All ArrayFire functions can be executed concurrently by multiple threads
+  * Threads can operate on different devices to simplify Muli-device workloads
+* New Canny edge detector function, \ref af::canny().
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1743)</sup>
+  * Can automatically calculate high threshold with `AF_CANNY_THRESHOLD_AUTO_OTSU`
+  * Supports both L1 and L2 Norms to calculate gradients
+* New tuned OpenCL BLAS backend,
+  [CLBlast](https://github.com/arrayfire/arrayfire/pull/1727).
+
+Improvements
+------------
+
+* Converted CUDA JIT to use
+  [NVRTC](http://docs.nvidia.com/cuda/nvrtc/index.html) instead of
+  [NVVM](http://docs.nvidia.com/cuda/nvvm-ir-spec/index.html).
+* Performance improvements in \ref af::reorder().
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1766)</sup>
+* Performance improvements in \ref af::array::scalar<T>().
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1809)</sup>
+* Improved unified backend performance.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1770)</sup>
+* ArrayFire now depends on Forge
+  v1.0. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1800)</sup>
+* Can now specify the FFT plan cache size using the
+  \ref af::setFFTPlanCacheSize() function.
+* Get the number of physical bytes allocated by the memory manager
+  \ref af_get_allocated_bytes(). <sup>[1](https://github.com/arrayfire/arrayfire/pull/1630)</sup>
+* \ref af::dot() can now return a scalar value to the
+  host. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1628)</sup>
+
+Bug Fixes
+---------
+
+* Fixed improper release of default Mersenne random
+  engine. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1716)</sup>
+* Fixed \ref af::randu() and \ref af::randn() ranges for floating point
+  types. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1784)</sup>
+* Fixed assignment bug in CPU
+  backend. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1765)</sup>
+* Fixed complex (`c32`,`c64`) multiplication in OpenCL convolution
+  kernels. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1816)</sup>
+* Fixed inconsistent behavior with \ref af::replace() and \ref
+  af_replace_scalar(). <sup>[1](https://github.com/arrayfire/arrayfire/pull/1773)</sup>
+* Fixed memory leak in \ref
+  af_fir(). <sup>[1](https://github.com/arrayfire/arrayfire/pull/1765)</sup>
+* Fixed memory leaks in \ref af_cast for sparse arrays.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1826)</sup>
+* Fixing correctness of \ref af_pow for complex numbers by using Cartesian
+  form. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1765)</sup>
+* Corrected \ref af::select() with indexing in CUDA and OpenCL
+  backends. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1731)</sup>
+* Workaround for VS2015 compiler ternary
+  bug. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1771)</sup>
+* Fixed memory corruption in
+  `cuda::findPlan()`. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1793)</sup>
+* Argument checks in \ref af_create_sparse_array avoids inputs of type
+  int64. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1747)</sup>
+* Fixed issue with indexing an array with a step size != 1. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1846)</sup>
+
+Build fixes
+-----------
+
+* On OSX, utilize new GLFW package from the brew package
+  manager. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1720)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1775)</sup>
+* Fixed CUDA PTX names generated by CMake
+  v3.7. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1689)</sup>
+* Support `gcc` > 5.x for
+  CUDA. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1708)</sup>
+
+Examples
+--------
+
+* New genetic algorithm example.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1695)</sup>
+
+Documentation
+-------------
+
+* Updated `README.md` to improve readability and
+  formatting. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1726)</sup>
+* Updated `README.md` to mention Julia and Nim
+  wrappers. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1714)</sup>
+* Improved installation instructions -
+  `docs/pages/install.md`. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1740)</sup>
+
+Miscellaneous
+-------------
+
+* A few improvements for ROCm
+  support. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1710)</sup>
+* Removed CUDA 6.5 support.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1687)</sup>
+
+Known issues
+------------
+
+* Windows
+  * The Windows NVIDIA driver version `37x.xx` contains a bug which causes
+    `fftconvolve_opencl` to fail. Upgrade or downgrade to a different version of
+    the driver to avoid this failure.
+  * The following tests fail on Windows with NVIDIA hardware:
+    `threading_cuda`,`qr_dense_opencl`, `solve_dense_opencl`.
+* macOS
+  * The Accelerate framework, used by the CPU backend on macOS, leverages Intel
+    graphics cards (Iris) when there are no discrete GPUs available. This OpenCL
+    implementation is known to give incorrect results on the following tests:
+    `lu_dense_{cpu,opencl}`, `solve_dense_{cpu,opencl}`,
+    `inverse_dense_{cpu,opencl}`.
+  * Certain tests intermittently fail on macOS with NVIDIA GPUs apparently due
+    to inconsistent driver behavior: `fft_large_cuda` and `svd_dense_cuda`.
+  * The following tests are currently failing on macOS with AMD GPUs:
+    `cholesky_dense_opencl` and `scan_by_key_opencl`.
+
+
+v3.4.2
+==============
+
+Deprecation Announcement
+------------------------
+
+This release supports CUDA 6.5 and higher. The next ArrayFire relase will
+support CUDA 7.0 and higher, dropping support for CUDA 6.5. Reasons for no
+longer supporting CUDA 6.5 include:
+
+* CUDA 7.0 NVCC supports the C++11 standard (whereas CUDA 6.5 does not), which
+  is used by ArrayFire's CPU and OpenCL backends.
+* Very few ArrayFire users still use CUDA 6.5.
+
+As a result, the older Jetson TK1 / Tegra K1 will no longer be supported in
+the next ArrayFire release. The newer Jetson TX1 / Tegra X1 will continue to
+have full capability with ArrayFire.
+
+Docker
+------
+* [ArrayFire has been Dockerized](https://github.com/arrayfire/arrayfire-docker).
+
+Improvements
+------------
+* Implemented sparse storage format conversions between \ref AF_STORAGE_CSR
+  and \ref AF_STORAGE_COO.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1642)</sup>
+  * Directly convert between \ref AF_STORAGE_COO <--> \ref AF_STORAGE_CSR
+    using the af::sparseConvertTo() function.
+  * af::sparseConvertTo() now also supports converting to dense.
+* Added cast support for [sparse arrays](\ref sparse_func).
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1653)</sup>
+  * Casting only changes the values array and the type. The row and column
+    index arrays are not changed.
+* Reintroduced automated computation of chart axes limits for graphics functions.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1639)</sup>
+  * The axes limits will always be the minimum/maximum of the current and new
+    limit.
+  * The user can still set limits from API calls. If the user sets a limit
+    from the API call, then the automatic limit setting will be disabled.
+* Using `boost::scoped_array` instead of `boost::scoped_ptr` when managing
+  array resources.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1637)</sup>
+* Internal performance improvements to getInfo() by using `const` references
+  to avoid unnecessary copying of `ArrayInfo` objects.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1665)</sup>
+* Added support for scalar af::array inputs for af::convolve() and
+  [set functions](\ref set_mat).
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1660)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/issues/1675)</sup>
+  <sup>[3](https://github.com/arrayfire/arrayfire/pull/1668)</sup>
+* Performance fixes in af::fftConvolve() kernels.
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1679)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1680)</sup>
+
+Build
+-----
+* Support for Visual Studio 2015 compilation.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1632)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1640)</sup>
+* Fixed `FindCBLAS.cmake` when PkgConfig is used.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1657)</sup>
+
+Bug fixes
+---------
+* Fixes to JIT when tree is large.
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1646)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1638)</sup>
+* Fixed indexing bug when converting dense to sparse af::array as \ref
+  AF_STORAGE_COO.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1642)</sup>
+* Fixed af::bilateral() OpenCL kernel compilation on OS X.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1638)</sup>
+* Fixed memory leak in af::regions() (CPU) and af::rgb2ycbcr().
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1664)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/issues/1664)</sup>
+  <sup>[3](https://github.com/arrayfire/arrayfire/pull/1666)</sup>
+
+Installers
+----------
+* Major OS X installer fixes.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1629)</sup>
+  * Fixed installation scripts.
+  * Fixed installation symlinks for libraries.
+* Windows installer now ships with more pre-built examples.
+
+Examples
+--------
+* Added af::choleskyInPlace() calls to `cholesky.cpp` example.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1671)</sup>
+
+Documentation
+-------------
+* Added `u8` as supported data type in `getting_started.md`.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1661)</sup>
+* Fixed typos.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1652)</sup>
+
+CUDA 8 on OSX
+-------------
+* [CUDA 8.0.55](https://developer.nvidia.com/cuda-toolkit) supports Xcode 8.
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1664)</sup>
+
+Known Issues
+------------
+* Known failures with CUDA 6.5. These include all functions that use
+  sorting. As a result, sparse storage format conversion between \ref
+  AF_STORAGE_COO and \ref AF_STORAGE_CSR has been disabled for CUDA 6.5.
+
+v3.4.1
+==============
+
+Installers
+----------
+* Installers for Linux, OS X and Windows
+  * CUDA backend now uses [CUDA 8.0](https://developer.nvidia.com/cuda-toolkit).
+  * Uses [Intel MKL 2017](https://software.intel.com/en-us/intel-mkl).
+  * CUDA Compute 2.x (Fermi) is no longer compiled into the library.
+* Installer for OS X
+  * The libraries shipping in the OS X Installer are now compiled with Apple
+    Clang v7.3.1 (previously v6.1.0).
+  * The OS X version used is 10.11.6 (previously 10.10.5).
+* Installer for Jetson TX1 / Tegra X1
+  * Requires [JetPack for L4T 2.3](https://developer.nvidia.com/embedded/jetpack)
+    (containing Linux for Tegra r24.2 for TX1).
+  * CUDA backend now uses [CUDA 8.0](https://developer.nvidia.com/cuda-toolkit) 64-bit.
+  * Using CUDA's cusolver instead of CPU fallback.
+  * Uses OpenBLAS for CPU BLAS.
+  * All ArrayFire libraries are now 64-bit.
+
+Improvements
+------------
+* Add [sparse array](\ref sparse_func) support to \ref af::eval().
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1598)</sup>
+* Add OpenCL-CPU fallback support for sparse \ref af::matmul() when running on
+  a unified memory device. Uses MKL Sparse BLAS.
+* When using CUDA libdevice, pick the correct compute version based on device.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1612)</sup>
+* OpenCL FFT now also supports prime factors 7, 11 and 13.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1383)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1619)</sup>
+
+Bug Fixes
+---------
+* Allow CUDA libdevice to be detected from custom directory.
+* Fix `aarch64` detection on Jetson TX1 64-bit OS.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1593)</sup>
+* Add missing definition of `af_set_fft_plan_cache_size` in unified backend.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1591)</sup>
+* Fix intial values for \ref af::min() and \ref af::max() operations.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1594)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1595)</sup>
+* Fix distance calculation in \ref af::nearestNeighbour for CUDA and OpenCL backend.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1596)</sup>
+  <sup>[2](https://github.com/arrayfire/arrayfire/pull/1595)</sup>
+* Fix OpenCL bug where scalars where are passed incorrectly to compile options.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1595)</sup>
+* Fix bug in \ref af::Window::surface() with respect to dimensions and ranges.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1604)</sup>
+* Fix possible double free corruption in \ref af_assign_seq().
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1605)</sup>
+* Add missing eval for key in \ref af::scanByKey in CPU backend.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1605)</sup>
+* Fixed creation of sparse values array using \ref AF_STORAGE_COO.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1620)</sup>
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1621)</sup>
+
+Examples
+--------
+* Add a [Conjugate Gradient solver example](\ref benchmarks/cg.cpp)
+  to demonstrate sparse and dense matrix operations.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1599)</sup>
+
+CUDA Backend
+------------
+* When using [CUDA 8.0](https://developer.nvidia.com/cuda-toolkit),
+  compute 2.x are no longer in default compute list.
+  * This follows [CUDA 8.0](https://developer.nvidia.com/cuda-toolkit)
+    deprecating computes 2.x.
+  * Default computes for CUDA 8.0 will be 30, 50, 60.
+* When using CUDA pre-8.0, the default selection remains 20, 30, 50.
+* CUDA backend now uses `-arch=sm_30` for PTX compilation as default.
+  * Unless compute 2.0 is enabled.
+
+Known Issues
+------------
+* \ref af::lu() on CPU is known to give incorrect results when built run on
+  OS X 10.11 or 10.12 and compiled with Accelerate Framework.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1617)</sup>
+  * Since the OS X Installer libraries uses MKL rather than Accelerate
+    Framework, this issue does not affect those libraries.
+
+
+v3.4.0
+==============
+
+Major Updates
+-------------
+* [Sparse Matrix and BLAS](\ref sparse_func). <sup>[1](https://github.com/arrayfire/arrayfire/issues/821)
+  [2](https://github.com/arrayfire/arrayfire/pull/1319)</sup>
+* Faster JIT for CUDA and OpenCL. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1472)
+  [2](https://github.com/arrayfire/arrayfire/pull/1462)</sup>
+* Support for [random number generator engines](\ref af::randomEngine).
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/868)
+  [2](https://github.com/arrayfire/arrayfire/pull/1551)</sup>
+* Improvements to graphics. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1555)
+  [2](https://github.com/arrayfire/arrayfire/pull/1566)</sup>
+
+Features
+----------
+* **[Sparse Matrix and BLAS](\ref sparse_func)** <sup>[1](https://github.com/arrayfire/arrayfire/issues/821)
+[2](https://github.com/arrayfire/arrayfire/pull/1319)</sup>
+  * Support for [CSR](\ref AF_STORAGE_CSR) and [COO](\ref AF_STORAGE_COO)
+    [storage types](\ref af_storage).
+  * Sparse-Dense Matrix Multiplication and Matrix-Vector Multiplication as a
+    part of af::matmul() using \ref AF_STORAGE_CSR format for sparse.
+  * Conversion to and from [dense](\ref AF_STORAGE_DENSE) matrix to [CSR](\ref AF_STORAGE_CSR)
+    and [COO](\ref AF_STORAGE_COO) [storage types](\ref af_storage).
+* **Faster JIT** <sup>[1](https://github.com/arrayfire/arrayfire/issues/1472)
+  [2](https://github.com/arrayfire/arrayfire/pull/1462)</sup>
+  * Performance improvements for CUDA and OpenCL JIT functions.
+  * Support for evaluating multiple outputs in a single kernel. See af::array::eval() for more.
+* **[Random Number Generation](\ref af::randomEngine)**
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/868)
+  [2](https://github.com/arrayfire/arrayfire/pull/1551)</sup>
+  * af::randomEngine(): A random engine class to handle setting the [type](af_random_type) and seed
+    for random number generator engines.
+  * Supported engine types are (\ref af_random_engine_type):
+    * [Philox](http://www.thesalmons.org/john/random123/)
+    * [Threefry](http://www.thesalmons.org/john/random123/)
+    * [Mersenne Twister](http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/)
+* **Graphics** <sup>[1](https://github.com/arrayfire/arrayfire/pull/1555)
+  [2](https://github.com/arrayfire/arrayfire/pull/1566)</sup>
+  * Using [Forge v0.9.0](https://github.com/arrayfire/forge/releases/tag/v0.9.0)
+  * [Vector Field](\ref af::Window::vectorField) plotting functionality.
+    <sup>[1](https://github.com/arrayfire/arrayfire/pull/1566)</sup>
+  * Removed [GLEW](http://glew.sourceforge.net/) and replaced with [glbinding](https://github.com/cginternals/glbinding).
+    * Removed usage of GLEW after support for MX (multithreaded) was dropped in v2.0.
+      <sup>[1](https://github.com/arrayfire/arrayfire/issues/1540)</sup>
+  * Multiple overlays on the same window are now possible.
+    * Overlays support for same type of object (2D/3D)
+    * Supported by af::Window::plot, af::Window::hist, af::Window::surface,
+      af::Window::vectorField.
+  * New API to set axes limits for graphs.
+    * Draw calls do not automatically compute the limits. This is now under user control.
+    * af::Window::setAxesLimits can be used to set axes limits automatically or manually.
+    * af::Window::setAxesTitles can be used to set axes titles.
+  * New API for plot and scatter:
+    * af::Window::plot() and af::Window::scatter() now can handle 2D and 3D and determine appropriate order.
+    * af_draw_plot_nd()
+    * af_draw_plot_2d()
+    * af_draw_plot_3d()
+    * af_draw_scatter_nd()
+    * af_draw_scatter_2d()
+    * af_draw_scatter_3d()
+* **New [interpolation methods](\ref af_interp_type)**
+<sup>[1](https://github.com/arrayfire/arrayfire/issues/1562)</sup>
+  * Applies to
+    * \ref af::resize()
+    * \ref af::transform()
+    * \ref af::approx1()
+    * \ref af::approx2()
+* **Support for [complex mathematical functions](\ref mathfunc_mat)**
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1507)</sup>
+  * Add complex support for \ref trig_mat, \ref af::sqrt(), \ref af::log().
+* **af::medfilt1(): Median filter for 1-d signals** <sup>[1](https://github.com/arrayfire/arrayfire/pull/1479)</sup>
+* <b>Generalized scan functions: \ref scan_func_scan and \ref scan_func_scanbykey</b>
+  * Now supports inclusive or exclusive scans
+  * Supports binary operations defined by \ref af_binary_op.
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/388)</sup>
+* **[Image Moments](\ref moments_mat) functions**
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1453)</sup>
+* <b>Add af::getSizeOf() function for \ref af_dtype</b>
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1404)</sup>
+* <b>Explicitly extantiate \ref af::array::device() for `void *</b>
+  <sup>[1](https://github.com/arrayfire/arrayfire/issues/1503)</sup>
+
+Bug Fixes
+--------------
+* Fixes to edge-cases in \ref morph_mat. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1564)</sup>
+* Makes JIT tree size consistent between devices. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1457)</sup>
+* Delegate higher-dimension in \ref convolve_mat to correct dimensions. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1445)</sup>
+* Indexing fixes with C++11. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1426) [2](https://github.com/arrayfire/arrayfire/pull/1426)</sup>
+* Handle empty arrays as inputs in various functions. <sup>[1](https://github.com/arrayfire/arrayfire/issues/799)</sup>
+* Fix bug when single element input to af::median. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1423)</sup>
+* Fix bug in calculation of time from af::timeit(). <sup>[1](https://github.com/arrayfire/arrayfire/pull/1414)</sup>
+* Fix bug in floating point numbers in af::seq. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1404)</sup>
+* Fixes for OpenCL graphics interop on NVIDIA devices.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1408/commits/e1f16e6)</sup>
+* Fix bug when compiling large kernels for AMD devices.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1465)</sup>
+* Fix bug in af::bilateral when shared memory is over the limit.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1478)</sup>
+* Fix bug in kernel header compilation tool `bin2cpp`.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1544)</sup>
+* Fix inital values for \ref morph_mat functions.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1547)</sup>
+* Fix bugs in af::homography() CPU and OpenCL kernels.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1584)</sup>
+* Fix bug in CPU TNJ.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1587)</sup>
+
+
+Improvements
+------------
+* CUDA 8 and compute 6.x(Pascal) support, current installer ships with CUDA 7.5. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1432) [2](https://github.com/arrayfire/arrayfire/pull/1487) [3](https://github.com/arrayfire/arrayfire/pull/1539)</sup>
+* User controlled FFT plan caching. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1448)</sup>
+* CUDA performance improvements for \ref image_func_wrap, \ref image_func_unwrap and \ref approx_mat.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1411)</sup>
+* Fallback for CUDA-OpenGL interop when no devices does not support OpenGL.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1415)</sup>
+* Additional forms of batching with the \ref transform_func_transform functions.
+  [New behavior defined here](https://github.com/arrayfire/arrayfire/pull/1412).
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1412)</sup>
+* Update to OpenCL2 headers. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1344)</sup>
+* Support for integration with external OpenCL contexts. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1140)</sup>
+* Performance improvements to interal copy in CPU Backend.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1440)</sup>
+* Performance improvements to af::select and af::replace CUDA kernels.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1587)</sup>
+* Enable OpenCL-CPU offload by default for devices with Unified Host Memory.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1521)</sup>
+  * To disable, use the environment variable `AF_OPENCL_CPU_OFFLOAD=0`.
+
+Build
+------
+* Compilation speedups. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1526)</sup>
+* Build fixes with MKL. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1526)</sup>
+* Error message when CMake CUDA Compute Detection fails. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1535)</sup>
+* Several CMake build issues with Xcode generator fixed.
+  <sup>[1](https://github.com/arrayfire/arrayfire/pull/1493) [2](https://github.com/arrayfire/arrayfire/pull/1499)</sup>
+* Fix multiple OpenCL definitions at link time. <sup>[1](https://github.com/arrayfire/arrayfire/issues/1429)</sup>
+* Fix lapacke detection in CMake. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1423)</sup>
+* Update build tags of
+  * [clBLAS](https://github.com/clMathLibraries/clBLAS)
+  * [clFFT](https://github.com/clMathLibraries/clFFT)
+  * [Boost.Compute](https://github.com/boostorg/compute)
+  * [Forge](https://github.com/arrayfire/forge)
+  * [glbinding](https://github.com/cginternals/glbinding)
+* Fix builds with GCC 6.1.1 and GCC 5.3.0. <sup>[1](https://github.com/arrayfire/arrayfire/pull/1409)</sup>
+
+Installers
+----------
+* All installers now ship with ArrayFire libraries build with MKL 2016.
+* All installers now ship with Forge development files and examples included.
+* CUDA Compute 2.0 has been removed from the installers. Please contact us
+  directly if you have a special need.
+
+Examples
+-------------
+* Added [example simulating gravity](\ref graphics/field.cpp) for
+  demonstration of vector field.
+* Improvements to \ref financial/black_scholes_options.cpp example.
+* Improvements to \ref graphics/gravity_sim.cpp example.
+* Fix graphics examples to use af::Window::setAxesLimits and
+  af::Window::setAxesTitles functions.
+
+Documentation & Licensing
+-------------------------
+* [ArrayFire copyright and trademark policy](http://arrayfire.com/trademark-policy)
+* Fixed grammar in license.
+* Add license information for glbinding.
+* Remove license infomation for GLEW.
+* Random123 now applies to all backends.
+* Random number functions are now under \ref random_mat.
+
+Deprecations
+------------
+The following functions have been deprecated and may be modified or removed
+permanently from future versions of ArrayFire.
+* \ref af::Window::plot3(): Use \ref af::Window::plot instead.
+* \ref af_draw_plot(): Use \ref af_draw_plot_nd or \ref af_draw_plot_2d instead.
+* \ref af_draw_plot3(): Use \ref af_draw_plot_nd or \ref af_draw_plot_3d instead.
+* \ref af::Window::scatter3(): Use \ref af::Window::scatter instead.
+* \ref af_draw_scatter(): Use \ref af_draw_scatter_nd or \ref af_draw_scatter_2d instead.
+* \ref af_draw_scatter3(): Use \ref af_draw_scatter_nd or \ref af_draw_scatter_3d instead.
+
+Known Issues
+-------------
+Certain CUDA functions are known to be broken on Tegra K1. The following ArrayFire tests are currently failing:
+* assign_cuda
+* harris_cuda
+* homography_cuda
+* median_cuda
+* orb_cudasort_cuda
+* sort_by_key_cuda
+* sort_index_cuda
+
+
+v3.3.2
+==============
+
+Improvements
+------------
+* Family of [Sort](\ref sort_mat) functions now support
+  [higher order dimensions](https://github.com/arrayfire/arrayfire/pull/1373).
+* Improved performance of batched sort on dim 0 for all [Sort](\ref sort_mat) functions.
+* [Median](\ref stat_func_median) now also supports higher order dimensions.
+
+Bug Fixes
+--------------
+
+* Fixes to [error handling](https://github.com/arrayfire/arrayfire/issues/1352) in C++ API for binary functions.
+* Fixes to [external OpenCL context management](https://github.com/arrayfire/arrayfire/issues/1350).
+* Fixes to [JPEG_GREYSCALE](https://github.com/arrayfire/arrayfire/issues/1360) for FreeImage versions <= 3.154.
+* Fixed for [non-float inputs](https://github.com/arrayfire/arrayfire/issues/1386) to \ref af::rgb2gray().
+
+Build
+------
+* [Disable CPU Async](https://github.com/arrayfire/arrayfire/issues/1378) when building with GCC < 4.8.4.
+* Add option to [disable CPUID](https://github.com/arrayfire/arrayfire/issues/1369) from CMake.
+* More verbose message when [CUDA Compute Detection fails](https://github.com/arrayfire/arrayfire/issues/1362).
+* Print message to use [CUDA library stub](https://github.com/arrayfire/arrayfire/issues/1363)
+  from CUDA Toolkit if CUDA Library is not found from default paths.
+* [Build Fixes](https://github.com/arrayfire/arrayfire/pull/1385) on Windows.
+  * For compiling tests our of source.
+  * For compiling ArrayFire with static MKL.
+* [Exclude <sys/sysctl.h>](https://github.com/arrayfire/arrayfire/pull/1368) when building on GNU Hurd.
+* Add [manual CMake options](https://github.com/arrayfire/arrayfire/pull/1389) to build DEB and RPM packages.
+
+Documentation
+-------------
+* Fixed documentation for \ref af::replace().
+* Fixed images in [Using on OSX](\ref using_on_osx) page.
+
+Installer
+---------
+* Linux x64 installers will now be compiled with GCC 4.9.2.
+* OSX installer gives better error messages on brew failures and
+  now includes link to [Fixing OS X Installer Failures] (https://github.com/arrayfire/arrayfire/wiki/Fixing-Common-OS-X-Installer-Failures)
+  for brew installation failures.
+
+v3.3.1
+==============
+
+Bug Fixes
+--------------
+
+* Fixes to \ref af::array::device()
+    * CPU Backend: [evaluate arrays](https://github.com/arrayfire/arrayfire/issues/1316)
+      before returning pointer with asynchronous calls in CPU backend.
+    * OpenCL Backend: [fix segfaults](https://github.com/arrayfire/arrayfire/issues/1324)
+      when requested for device pointers on empty arrays.
+* Fixed \ref af::operator%() from using [rem to mod](https://github.com/arrayfire/arrayfire/issues/1318).
+* Fixed [array destruction](https://github.com/arrayfire/arrayfire/issues/1321)
+  when backends are switched in Unified API.
+* Fixed [indexing](https://github.com/arrayfire/arrayfire/issues/1331) after
+  \ref af::moddims() is called.
+* Fixes FFT calls for CUDA and OpenCL backends when used on
+  [multiple devices](https://github.com/arrayfire/arrayfire/issues/1332).
+* Fixed [unresolved external](https://github.com/arrayfire/arrayfire/commit/32965ef)
+  for some functions from \ref af::array::array_proxy class.
+
+Build
+------
+* CMake compiles files in alphabetical order.
+* CMake fixes for BLAS and LAPACK on some Linux distributions.
+
+Improvements
+------------
+* Fixed [OpenCL FFT performance](https://github.com/arrayfire/arrayfire/issues/1323) regression.
+* \ref af::array::device() on OpenCL backend [returns](https://github.com/arrayfire/arrayfire/issues/1311)
+  `cl_mem` instead of `(void*)cl::Buffer*`.
+* In Unified backend, [load versioned libraries](https://github.com/arrayfire/arrayfire/issues/1312)
+  at runtime.
+
+Documentation
+------
+* Reorganized, cleaner README file.
+* Replaced non-free lena image in assets with free-to-distribute lena image.
+
+v3.3.0
+==============
+
+Major Updates
+-------------
+
+* CPU backend supports aysnchronous execution.
+* Performance improvements to OpenCL BLAS and FFT functions.
+* Improved performance of memory manager.
+* Improvements to visualization functions.
+* Improved sorted order for OpenCL devices.
+* Integration with external OpenCL projects.
+
+Features
+----------
+
+* \ref af::getActiveBackend(): Returns the current backend being used.
+* [Scatter plot](https://github.com/arrayfire/arrayfire/pull/1116) added to graphics.
+* \ref af::transform() now supports perspective transformation matrices.
+* \ref af::infoString(): Returns `af::info()` as a string.
+* \ref af::printMemInfo(): Print a table showing information about buffer from the memory manager
+    * The \ref AF_MEM_INFO macro prints numbers and total sizes of all buffers (requires including af/macros.h)
+* \ref af::allocHost(): Allocates memory on host.
+* \ref af::freeHost(): Frees host side memory allocated by arrayfire.
+* OpenCL functions can now use CPU implementation.
+    * Currently limited to Unified Memory devices (CPU and On-board Graphics).
+    * Functions: af::matmul() and all [LAPACK](\ref linalg_mat) functions.
+    * Takes advantage of optimized libraries such as MKL without doing memory copies.
+    * Use the environment variable `AF_OPENCL_CPU_OFFLOAD=1` to take advantage of this feature.
+* Functions specific to OpenCL backend.
+    * \ref afcl::addDevice(): Adds an external device and context to ArrayFire's device manager.
+    * \ref afcl::deleteDevice(): Removes an external device and context from ArrayFire's device manager.
+    * \ref afcl::setDevice(): Sets an external device and context from ArrayFire's device manager.
+    * \ref afcl::getDeviceType(): Gets the device type of the current device.
+    * \ref afcl::getPlatform(): Gets the platform of the current device.
+* \ref af::createStridedArray() allows [array creation user-defined strides](https://github.com/arrayfire/arrayfire/issues/1177) and device pointer.
+* [Expose functions](https://github.com/arrayfire/arrayfire/issues/1131) that provide information
+  about memory layout of Arrays.
+    * \ref af::getStrides(): Gets the strides for each dimension of the array.
+    * \ref af::getOffset(): Gets the offsets for each dimension of the array.
+    * \ref af::getRawPtr(): Gets raw pointer to the location of the array on device.
+    * \ref af::isLinear(): Returns true if all elements in the array are contiguous.
+    * \ref af::isOwner(): Returns true if the array owns the raw pointer, false if it is a sub-array.
+    * \ref af::getStrides(): Gets the strides of the array.
+    * \ref af::getStrides(): Gets the strides of the array.
+* \ref af::getDeviceId(): Gets the device id on which the array resides.
+* \ref af::isImageIOAvailable(): Returns true if ArrayFire was compiled with Freeimage enabled
+* \ref af::isLAPACKAvailable(): Returns true if ArrayFire was compiled with LAPACK functions enabled
+
+Bug Fixes
+--------------
+
+* Fixed [errors when using 3D / 4D arrays](https://github.com/arrayfire/arrayfire/pull/1251) in select and replace
+* Fixed [JIT errors on AMD devices](https://github.com/arrayfire/arrayfire/pull/1238) for OpenCL backend.
+* Fixed [imageio bugs](https://github.com/arrayfire/arrayfire/pull/1229) for 16 bit images.
+* Fixed [bugs when loading and storing images](https://github.com/arrayfire/arrayfire/pull/1228) natively.
+* Fixed [bug in FFT for NVIDIA GPUs](https://github.com/arrayfire/arrayfire/issues/615) when using OpenCL backend.
+* Fixed [bug when using external context](https://github.com/arrayfire/arrayfire/pull/1241) with OpenCL backend.
+* Fixed [memory leak](https://github.com/arrayfire/arrayfire/issues/1269) in \ref af_median_all().
+* Fixed [memory leaks and performance](https://github.com/arrayfire/arrayfire/pull/1274) in graphics functions.
+* Fixed [bugs when indexing followed by moddims](https://github.com/arrayfire/arrayfire/issues/1275).
+* \ref af_get_revision() now returns actual commit rather than AF_REVISION.
+* Fixed [releasing arrays](https://github.com/arrayfire/arrayfire/issues/1282) when using different backends.
+* OS X OpenCL: [LAPACK functions](\ref linalg_mat) on CPU devices use OpenCL offload (previously threw errors).
+* [Add support for 32-bit integer image types](https://github.com/arrayfire/arrayfire/pull/1287) in Image IO.
+* Fixed [set operations for row vectors](https://github.com/arrayfire/arrayfire/issues/1300)
+* Fixed [bugs](https://github.com/arrayfire/arrayfire/issues/1243) in \ref af::meanShift() and af::orb().
+
+Improvements
+--------------
+
+* Optionally [offload BLAS and LAPACK](https://github.com/arrayfire/arrayfire/pull/1221) functions to CPU implementations to improve performance.
+* Performance improvements to the memory manager.
+* Error messages are now more detailed.
+* Improved sorted order for OpenCL devices.
+* JIT heuristics can now be tweaked using environment variables. See
+  [Environment Variables](\ref configuring_environment) tutorial.
+* Add `BUILD_<BACKEND>` [options to examples and tests](https://github.com/arrayfire/arrayfire/issues/1286)
+  to toggle backends when compiling independently.
+
+Examples
+----------
+
+* New visualization [example simulating gravity](\ref graphics/gravity_sim.cpp).
+
+Build
+----------
+
+* Support for Intel `icc` compiler
+* Support to compile with Intel MKL as a BLAS and LAPACK provider
+* Tests are now available for building as standalone (like examples)
+* Tests can now be built as a single file for each backend
+* Better handling of NONFREE build options
+* [Searching for GLEW in CMake default paths](https://github.com/arrayfire/arrayfire/pull/1292)
+* Fixes for compiling with MKL on OSX.
+
+Installers
+----------
+* Improvements to OSX Installer
+    * CMake config files are now installed with libraries
+    * Independent options for installing examples and documentation components
+
+Deprecations
+-----------
+
+* `af_lock_device_arr` is now deprecated to be removed in v4.0.0. Use \ref af_lock_array() instead.
+* `af_unlock_device_arr` is now deprecated to be removed in v4.0.0. use \ref af_unlock_array() instead.
+
+Documentation
+--------------
+
+* Fixes to documentation for \ref af::matchTemplate().
+* Improved documentation for deviceInfo.
+* Fixes to documentation for \ref af::exp().
+
+Known Issues
+------------
+
+* [Solve OpenCL fails on NVIDIA Maxwell devices](https://github.com/arrayfire/arrayfire/issues/1246)
+  for f32 and c32 when M > N and K % 4 is 1 or 2.
+
+
+v3.2.2
+==============
+
+Bug Fixes
+--------------
+
+* Fixed [memory leak](https://github.com/arrayfire/arrayfire/pull/1145) in
+  CUDA Random number generators
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1157) in
+  af::select() and af::replace() tests
+* Fixed [exception](https://github.com/arrayfire/arrayfire/issues/1164)
+  thrown when printing empty arrays with af::print()
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1170) in CPU
+  random number generation. Changed the generator to
+  [mt19937](http://en.cppreference.com/w/cpp/numeric/random)
+* Fixed exception handling (internal)
+    * [Exceptions](https://github.com/arrayfire/arrayfire/issues/1188)
+      now show function, short file name and line number
+    * Added [AF_RETURN_ERROR](https://github.com/arrayfire/arrayfire/issues/1186)
+      macro to handle returning errors.
+    * Removed THROW macro, and renamed AF_THROW_MSG to AF_THROW_ERR.
+* Fixed [bug](https://github.com/arrayfire/arrayfire/commit/9459c6)
+  in \ref af::identity() that may have affected CUDA Compute 5.2 cards
+
+
+Build
+------
+* Added a [MIN_BUILD_TIME](https://github.com/arrayfire/arrayfire/issues/1193)
+  option to build with minimum optimization compiler flags resulting in faster
+  compile times
+* Fixed [issue](https://github.com/arrayfire/arrayfire/issues/1143) in CBLAS
+  detection by CMake
+* Fixed tests failing for builds without optional components
+  [FreeImage](https://github.com/arrayfire/arrayfire/issues/1143) and
+  [LAPACK](https://github.com/arrayfire/arrayfire/issues/1167)
+* Added a [test](https://github.com/arrayfire/arrayfire/issues/1192)
+  for unified backend
+* Only [info and backend tests](https://github.com/arrayfire/arrayfire/issues/1192)
+  are now built for unified backend
+* [Sort tests](https://github.com/arrayfire/arrayfire/issues/1199)
+  execution alphabetically
+* Fixed compilation flags and errors in tests and examples
+* [Moved AF_REVISION and AF_COMPILER_STR](https://github.com/arrayfire/arrayfire/commit/2287c5)
+  into src/backend. This is because as revision is updated with every commit,
+  entire ArrayFire would have to be rebuilt in the old code.
+    * v3.3 will add a af_get_revision() function to get the revision string.
+* [Clean up examples](https://github.com/arrayfire/arrayfire/pull/1158)
+    * Remove getchar for Windows (this will be handled by the installer)
+    * Other miscellaneous code cleanup
+    * Fixed bug in [plot3.cpp](\ref graphics/plot3.cpp) example
+* [Rename](https://github.com/arrayfire/arrayfire/commit/35f0fc2) clBLAS/clFFT
+  external project suffix from external -> ext
+* [Add OpenBLAS](https://github.com/arrayfire/arrayfire/pull/1197) as a
+  lapack/lapacke alternative
+
+Improvements
+------------
+* Added \ref AF_MEM_INFO macro to print memory info from ArrayFire's memory
+  manager ([cross issue](https://github.com/arrayfire/arrayfire/issues/1172))
+* Added [additional paths](https://github.com/arrayfire/arrayfire/issues/1184)
+  for searching for `libaf*` for Unified backend on unix-style OS.
+    * Note: This still requires dependencies such as forge, CUDA, NVVM etc to be
+      in `LD_LIBRARY_PATH` as described in [Unified Backend](\ref unifiedbackend)
+* [Create streams](https://github.com/arrayfire/arrayfire/commit/ed0373f)
+  for devices only when required in CUDA Backend
+
+Documentation
+------
+* [Hide scrollbars](https://github.com/arrayfire/arrayfire/commit/9d218a5)
+  appearing for pre and code styles
+* Fix [documentation](https://github.com/arrayfire/arrayfire/commit/ac09f91) for af::replace
+* Add [code sample](https://github.com/arrayfire/arrayfire/commit/4e06483)
+  for converting the output of af::getAvailableBackends() into bools
+* Minor fixes in documentation
+
+v3.2.1
+==============
+
+Bug Fixes
+--------------
+
+* Fixed [bug](https://github.com/arrayfire/arrayfire/pull/1136) in homography()
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1135) in behavior
+  of af::array::device()
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1129) when
+  indexing with span along trailing dimension
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1127) when
+  indexing in [GFor](\ref gfor)
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1122) in CPU
+  information fetching
+* Fixed compilation [bug](https://github.com/arrayfire/arrayfire/issues/1117)
+  in unified backend caused by missing link library
+* Add [missing symbol](https://github.com/arrayfire/arrayfire/pull/1114) for
+  af_draw_surface()
+
+Build
+------
+* Tests can now be used as a [standalone project](https://github.com/arrayfire/arrayfire/pull/1120)
+    * Tests can now be built using pre-compiled libraries
+    * Similar to how the examples are built
+* The install target now installs the examples source irrespective of the
+  BUILD_EXAMPLES value
+    * Examples are not built if BUILD_EXAMPLES is off
+
+Documentation
+------
+* HTML documentation is now [built and installed](https://github.com/arrayfire/arrayfire/pull/1109)
+  in docs/html
+* Added documentation for \ref af::seq class
+* Updated [Matrix Manipulation](\ref matrixmanipulation) tutorial
+* Examples list is now generated by CMake
+    * <a href="examples.htm">Examples</a> are now listed as dir/example.cpp
+* Removed dummy groups used for indexing documentation (affcted doxygen < 1.8.9)
+
+v3.2.0
+=================
+
+Major Updates
+-------------
+
+* Added Unified backend
+    * Allows switching backends at runtime
+    * Read [Unified Backend](\ref unifiedbackend) for more.
+* Support for 16-bit integers (\ref s16 and \ref u16)
+    * All functions that support 32-bit interger types (\ref s32, \ref u32),
+      now also support 16-bit interger types
+
+Function Additions
+------------------
+* Unified Backend
+    * \ref af::setBackend() - Sets a backend as active
+    * \ref af::getBackendCount() - Gets the number of backends available for use
+    * \ref af::getAvailableBackends() - Returns information about available backends
+    * \ref af::getBackendId() - Gets the backend enum for an array
+
+* Vision
+    * \ref af::homography() - Homography estimation
+    * \ref af::gloh() - GLOH Descriptor for SIFT
+
+* Image Processing
+    * \ref af::loadImageNative() - Load an image as native data without modification
+    * \ref af::saveImageNative() - Save an image without modifying data or type
+
+* Graphics
+    * \ref af::Window::plot3() - 3-dimensional line plot
+    * \ref af::Window::surface() - 3-dimensional curve plot
+
+* Indexing
+    * \ref af_create_indexers()
+    * \ref af_set_array_indexer()
+    * \ref af_set_seq_indexer()
+    * \ref af_set_seq_param_indexer()
+    * \ref af_release_indexers()
+
+* CUDA Backend Specific
+    * \ref afcu::setNativeId() - Set the CUDA device with given native id as active
+        * ArrayFire uses a modified order for devices. The native id for a
+          device can be retreived using `nvidia-smi`
+
+* OpenCL Backend Specific
+    * \ref afcl::setDeviceId() - Set the OpenCL device using the `clDeviceId`
+
+Other Improvements
+------------------------
+* Added \ref c32 and \ref c64 support for \ref af::isNaN(), \ref af::isInf() and \ref af::iszero()
+* Added CPU information for `x86` and `x86_64` architectures in CPU backend's \ref af::info()
+* Batch support for \ref af::approx1() and \ref af::approx2()
+    * Now can be used with gfor as well
+* Added \ref s64 and \ref u64 support to:
+    * \ref af::sort() (along with sort index and sort by key)
+    * \ref af::setUnique(), \ref af::setUnion(), \ref af::setIntersect()
+    * \ref af::convolve() and \ref af::fftConvolve()
+    * \ref af::histogram() and \ref af::histEqual()
+    * \ref af::lookup()
+    * \ref af::mean()
+* Added \ref AF_MSG macro
+
+Build Improvements
+------------------
+* Submodules update is now automatically called if not cloned recursively
+* [Fixes for compilation](https://github.com/arrayfire/arrayfire/issues/766) on Visual Studio 2015
+* Option to use [fallback to CPU LAPACK](https://github.com/arrayfire/arrayfire/pull/1053)
+  for linear algebra functions in case of CUDA 6.5 or older versions.
+
+Bug Fixes
+--------------
+* Fixed [memory leak](https://github.com/arrayfire/arrayfire/pull/1096) in \ref af::susan()
+* Fixed [failing test](https://github.com/arrayfire/arrayfire/commit/144a2db)
+  in \ref af::lower() and \ref af::upper() for CUDA compute 53
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1092) in CUDA for indexing out of bounds
+* Fixed [dims check](https://github.com/arrayfire/arrayfire/commit/6975da8) in \ref af::iota()
+* Fixed [out-of-bounds access](https://github.com/arrayfire/arrayfire/commit/7fc3856) in \ref af::sift()
+* Fixed [memory allocation](https://github.com/arrayfire/arrayfire/commit/5e88e4a) in \ref af::fast() OpenCL
+* Fixed [memory leak](https://github.com/arrayfire/arrayfire/pull/994) in image I/O functions
+* \ref af::dog() now returns float-point type arrays
+
+Documentation Updates
+---------------------
+* Improved tutorials documentation
+    * More detailed Using on [Linux](\ref using_on_linux), [OSX](\ref using_on_osx),
+      [Windows](\ref using_on_windows) pages.
+* Added return type information for functions that return different type
+  arrays
+
+New Examples
+------------
+* Graphics
+    * [Plot3](\ref graphics/plot3.cpp)
+    * [Surface](\ref graphics/surface.cpp)
+* [Shallow Water Equation](\ref pde/swe.cpp)
+* [Basic](\ref unified/basic.cpp) as a Unified backend example
+
+Installers
+-----------
+* All installers now include the Unified backend and corresponding CMake files
+* Visual Studio projects include Unified in the Platform Configurations
+* Added installer for Jetson TX1
+* SIFT and GLOH do not ship with the installers as SIFT is protected by
+  patents that do not allow commercial distribution without licensing.
+
+v3.1.3
+==============
+
+Bug Fixes
+---------
+
+* Fixed [bugs](https://github.com/arrayfire/arrayfire/issues/1042) in various OpenCL kernels without offset additions
+* Remove ARCH_32 and ARCH_64 flags
+* Fix [missing symbols](https://github.com/arrayfire/arrayfire/issues/1040) when freeimage is not found
+* Use CUDA driver version for Windows
+* Improvements to SIFT
+* Fixed [memory leak](https://github.com/arrayfire/arrayfire/issues/1045) in median
+* Fixes for Windows compilation when not using MKL [#1047](https://github.com/arrayfire/arrayfire/issues/1047)
+* Fixed for building without LAPACK
+
+Other
+-------
+
+* Documentation: Fixed documentation for select and replace
+* Documentation: Fixed documentation for af_isnan
+
+v3.1.2
+==============
+
+Bug Fixes
+---------
+
+* Fixed [bug](https://github.com/arrayfire/arrayfire/commit/4698f12) in assign that was causing test to fail
+* Fixed bug in convolve. Frequency condition now depends on kernel size only
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1005) in indexed reductions for complex type in OpenCL backend
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1006) in kernel name generation in ireduce for OpenCL backend
+* Fixed non-linear to linear indices in ireduce
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1011) in reductions for small arrays
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/1010) in histogram for indexed arrays
+* Fixed [compiler error](https://github.com/arrayfire/arrayfire/issues/1015) CPUID for non-compliant devices
+* Fixed [failing tests](https://github.com/arrayfire/arrayfire/issues/1008) on i386 platforms
+* Add missing AFAPI
+
+Other
+-------
+
+* Documentation: Added missing examples and other corrections
+* Documentation: Fixed warnings in documentation building
+* Installers: Send error messages to log file in OSX Installer
+
+v3.1.1
+==============
+
+Installers
+-----------
+
+* CUDA backend now depends on CUDA 7.5 toolkit
+* OpenCL backend now require OpenCL 1.2 or greater
+
+Bug Fixes
+--------------
+
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/981) in reductions after indexing
+* Fixed [bug](https://github.com/arrayfire/arrayfire/issues/976) in indexing when using reverse indices
+
+Build
+------
+
+* `cmake` now includes `PKG_CONFIG` in the search path for CBLAS and LAPACKE libraries
+* [heston_model.cpp](\ref financial/heston_model.cpp) example now builds with the default ArrayFire cmake files after installation
+
+Other
+------
+
+* Fixed bug in [image_editing.cpp](\ref image_processing/image_editing.cpp)
+
+v3.1.0
+==============
+
+Function Additions
+------------------
+* Computer Vision Functions
+    * \ref af::nearestNeighbour() - Nearest Neighbour with SAD, SSD and SHD distances
+    * \ref af::harris() - Harris Corner Detector
+    * \ref af::susan() - Susan Corner Detector
+    * \ref af::sift() - Scale Invariant Feature Transform (SIFT)
+        * Method and apparatus for identifying scale invariant features"
+          "in an image and use of same for locating an object in an image,\" David"
+          "G. Lowe, US Patent 6,711,293 (March 23, 2004). Provisional application"
+          "filed March 8, 1999. Asignee: The University of British Columbia. For"
+          "further details, contact David Lowe (lowe@cs.ubc.ca) or the"
+          "University-Industry Liaison Office of the University of British"
+          "Columbia.")
+        * SIFT is available for compiling but does not ship with ArrayFire
+          hosted installers/pre-built libraries
+    * \ref af::dog() -  Difference of Gaussians
+
+* Image Processing Functions
+    * \ref ycbcr2rgb() and \ref rgb2ycbcr() - RGB <->YCbCr color space conversion
+    * \ref wrap() and \ref unwrap() Wrap and Unwrap
+    * \ref sat() - Summed Area Tables
+    * \ref loadImageMem() and \ref saveImageMem() - Load and Save images to/from memory
+        * \ref af_image_format - Added imageFormat (af_image_format) enum
+
+* Array & Data Handling
+    * \ref copy() - Copy
+    * array::lock() and array::unlock() - Lock and Unlock
+    * \ref select() and \ref replace() - Select and Replace
+    * Get array reference count (af_get_data_ref_count)
+
+* Signal Processing
+    * \ref fftInPlace() - 1D in place FFT
+    * \ref fft2InPlace() - 2D in place FFT
+    * \ref fft3InPlace() - 3D in place FFT
+    * \ref ifftInPlace() - 1D in place Inverse FFT
+    * \ref ifft2InPlace() - 2D in place Inverse FFT
+    * \ref ifft3InPlace() - 3D in place Inverse FFT
+    * \ref fftR2C() - Real to complex FFT
+    * \ref fftC2R() - Complex to Real FFT
+
+* Linear Algebra
+    * \ref svd() and \ref svdInPlace() - Singular Value Decomposition
+
+* Other operations
+    * \ref sigmoid() - Sigmoid
+    * Sum (with option to replace NaN values)
+    * Product (with option to replace NaN values)
+
+* Graphics
+    * Window::setSize() - Window resizing using Forge API
+
+* Utility
+    * Allow users to set print precision (print, af_print_array_gen)
+    * \ref saveArray() and \ref readArray() - Stream arrays to binary files
+    * \ref toString() - toString function returns the array and data as a string
+
+* CUDA specific functionality
+    * \ref getStream() - Returns default CUDA stream ArrayFire uses for the current device
+    * \ref getNativeId() - Returns native id of the CUDA device
+
+Improvements
+------------
+* dot
+    * Allow complex inputs with conjugate option
+* AF_INTERP_LOWER interpolation
+    * For resize, rotate and transform based functions
+* 64-bit integer support
+    * For reductions, random, iota, range, diff1, diff2, accum, join, shift
+      and tile
+* convolve
+    * Support for non-overlapping batched convolutions
+* Complex Arrays
+    * Fix binary ops on complex inputs of mixed types
+    * Complex type support for exp
+* tile
+    * Performance improvements by using JIT when possible.
+* Add AF_API_VERSION macro
+    * Allows disabling of API to maintain consistency with previous versions
+* Other Performance Improvements
+    * Use reference counting to reduce unnecessary copies
+* CPU Backend
+    * Device properties for CPU
+    * Improved performance when all buffers are indexed linearly
+* CUDA Backend
+    * Use streams in CUDA (no longer using default stream)
+    * Using async cudaMem ops
+    * Add 64-bit integer support for JIT functions
+    * Performance improvements for CUDA JIT for non-linear 3D and 4D arrays
+* OpenCL Backend
+    * Improve compilation times for OpenCL backend
+    * Performance improvements for non-linear JIT kernels on OpenCL
+    * Improved shared memory load/store in many OpenCL kernels (PR 933)
+    * Using cl.hpp v1.2.7
+
+Bug Fixes
+---------
+* Common
+    * Fix compatibility of c32/c64 arrays when operating with scalars
+    * Fix median for all values of an array
+    * Fix double free issue when indexing (30cbbc7)
+    * Fix [bug](https://github.com/arrayfire/arrayfire/issues/901) in rank
+    * Fix default values for scale throwing exception
+    * Fix conjg raising exception on real input
+    * Fix bug when using conjugate transpose for vector input
+    * Fix issue with const input for array_proxy::get()
+* CPU Backend
+    * Fix randn generating same sequence for multiple calls
+    * Fix setSeed for randu
+    * Fix casting to and from complex
+    * Check NULL values when allocating memory
+    * Fix [offset issue](https://github.com/arrayfire/arrayfire/issues/923) for CPU element-wise operations
+
+New Examples
+------------
+* Match Template
+* Susan
+* Heston Model (contributed by Michael Nowotny)
+
+Installer
+----------
+* Fixed bug in automatic detection of ArrayFire when using with CMake in Windows
+* The Linux libraries are now compiled with static version of FreeImage
+
+Known Issues
+------------
+* OpenBlas can cause issues with QR factorization in CPU backend
+* FreeImage older than 3.10 can cause issues with loadImageMem and
+  saveImageMem
+* OpenCL backend issues on OSX
+    * AMD GPUs not supported because of driver issues
+    * Intel CPUs not supported
+    * Linear algebra functions do not work on Intel GPUs.
+* Stability and correctness issues with open source OpenCL implementations such as Beignet, GalliumCompute.
+
+v3.0.2
+==============
+
+Bug Fixes
+--------------
+
+* Added missing symbols from the compatible API
+* Fixed a bug affecting corner rows and elements in \ref af::grad()
+* Fixed linear interpolation bugs affecting large images in the following:
+    - \ref af::approx1()
+    - \ref af::approx2()
+    - \ref af::resize()
+    - \ref af::rotate()
+    - \ref af::scale()
+    - \ref af::skew()
+    - \ref af::transform()
+
+Documentation
+-----------------
+
+* Added missing documentation for \ref af::constant()
+* Added missing documentation for `array::scalar()`
+* Added supported input types for functions in `arith.h`
+
+v3.0.1
+==============
+
+Bug Fixes
+--------------
+
+* Fixed header to work in Visual Studio 2015
+* Fixed a bug in batched mode for FFT based convolutions
+* Fixed graphics issues on OSX
+* Fixed various bugs in visualization functions
+
+Other improvements
+---------------
+
+* Improved fractal example
+* New OSX installer
+* Improved Windows installer
+  * Default install path has been changed
+* Fixed bug in machine learning examples
+
+<br>
+
+v3.0.0
+=================
+
 Major Updates
 -------------
 
 * ArrayFire is now open source
-* New backend: CPU fallback for systems without GPUs
-* A new and improved C api
+* Major changes to the visualization library
+* Introducing handle based C API
+* New backend: CPU fallback available for systems without GPUs
+* Dense linear algebra functions available for all backends
 * Support for 64 bit integers
 
 Function Additions
@@ -16,6 +1982,8 @@ Function Additions
     * iota()
 
 * Computer Vision Algorithms
+    * features()
+        * A data structure to hold features
     * fast()
         * FAST feature detector
     * orb()
@@ -24,21 +1992,59 @@ Function Additions
 * Image Processing
     * convolve1(), convolve2(), convolve3()
         * Specialized versions of convolve() to enable better batch support
+    * fftconvolve1(), fftconvolve2(), fftconvolve3()
+        * Convolutions in frequency domain to support larger kernel sizes
+    * dft(), idft()
+        * Unified functions for calling multi dimensional ffts.
     * matchTemplate()
         * Match a kernel in an image
     * sobel()
         * Get sobel gradients of an image
+    * rgb2hsv(), hsv2rgb(), rgb2gray(), gray2rgb()
+        * Explicit function calls to colorspace conversions
+    * erode3d(), dilate3d()
+        * Explicit erode and dilate calls for image morphing
 
-* Matrix Multiply
+* Linear Algebra
     * matmulNT(), matmulTN(), matmulTT()
         * Specialized versions of matmul() for transposed inputs
+    * luInPlace(), choleskyInPlace(), qrInPlace()
+        * In place factorizations to improve memory requirements
+    * solveLU()
+        * Specialized solve routines to improve performance
+    * OpenCL backend now Linear Algebra functions
 
 * Other functions
     * lookup() - lookup indices from a table
+    * batchFunc() - helper function to perform batch operations
 
-Deprecated Function APIs
+* Visualization functions
+    * Support for multiple windows
+    * window.hist()
+        * Visualize the output of the histogram
+
+* C API
+    * Removed old pointer based C API
+    * Introducing handle base C API
+    * Just In Time compilation available in C API
+    * C API has feature parity with C++ API
+    * bessel functions removed
+    * cross product functions removed
+    * Kronecker product functions removed
+
+Performance Improvements
 ------------------------
+* Improvements across the board for OpenCL backend
 
+API Changes
+---------------------
+* `print` is now af_print()
+* seq(): The step parameter is now the third input
+    * seq(start, step, end) changed to seq(start, end, step)
+* gfor(): The iterator now needs to be seq()
+
+Deprecated Function APIs
+------------------------
 Deprecated APIs are in af/compatible.h
 
 * devicecount() changed to getDeviceCount()
@@ -47,11 +2053,21 @@ Deprecated APIs are in af/compatible.h
 * loadimage() changed to loadImage()
 * saveimage() changed to saveImage()
 * gaussiankernel() changed to gaussianKernel()
+* alltrue() changed to allTrue()
+* anytrue() changed to anyTrue()
+* setunique() changed to setUnique()
+* setunion() changed to setUnion()
+* setintersect() changed to setIntersect()
+* histequal() changed to histEqual()
+* colorspace() changed to colorSpace()
+* filter() deprecated. Use convolve1() and convolve2()
+* mul() changed to product()
+* deviceprop() changed to deviceProp()
 
-API Changes
----------------------
-* `print` is now af_print()
-
-Performance Improvements
-------------------------
-* Improvements across the board for OpenCL backend
+Known Issues
+----------------------
+* OpenCL backend issues on OSX
+    * AMD GPUs not supported because of driver issues
+    * Intel CPUs not supported
+    * Linear algebra functions do not work on Intel GPUs.
+* Stability and correctness issues with open source OpenCL implementations such as Beignet, GalliumCompute.
diff --git a/docs/pages/timing.md b/docs/pages/timing.md
index 98042a79e8..8c43808a5c 100644
--- a/docs/pages/timing.md
+++ b/docs/pages/timing.md
@@ -1,64 +1,153 @@
-Timing Your Code {#timing}
+Timing ArrayFire Code {#timing}
 ================
 
-timer() : A platform-independent timer with microsecond accuracy:
-* [timer::start()](\ref af::timer::start) starts a timer
+In performance-sensitive applications, it is vital to profile and measure the
+execution time of operations. ArrayFire provides mechanisms to achieve this.
 
-* [timer::start()](\ref af::timer::stop) seconds since last \ref af::timer::start "start"
+ArrayFire employs an asynchronous evaluation model for all of its
+functions. This means that operations are queued to execute but do not
+necessarily complete prior to function return. Hence, directly measuring the
+time taken for an ArrayFire function could be misleading. To accurately
+measure time, one must ensure the operations are evaluated and synchronize the
+ArrayFire stream.
 
-* \ref af::timer::stop(af::timer start) "timer::start(timer start)" seconds since 'start'
+ArrayFire also employs a lazy evaluation model for its elementwise arithmetic
+operations. This means operations are not queued for execution until the
+result is needed by downstream operations blocking until the operations are
+complete.
 
-Example: single timer
+The following describes how to time ArrayFire code using the eval and sync
+functions along with the timer and timeit functions. A final note on kernel
+caching also provides helpful details about ArrayFire runtimes.
 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
-   // start timer
-   timer::start();
-   // run your code
-   printf("elapsed seconds: %g\n", timer::stop());
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+## Using ArrayFire eval and sync functions
 
-Example: multiple timers
+ArrayFire provides functions to force the evaluation of lazy functions and to
+block until all asynchoronous operations complete.
 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
-   // start timers
-   timer start1 = timer::start();
-   timer start2 = timer::start();
-   // run some code
-   printf("elapsed seconds: %g\n", timer::stop(start1));
-   // run more code
-   printf("elapsed seconds: %g\n", timer::stop(start2));
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+1. The [eval](\ref af::eval) function:
 
-Accurate and reliable measurement of performance involves several factors:
-* Executing enough iterations to achieve peak performance.
-* Executing enough repetitions to amortize any overhead from system timers.
+   Forces the evaluation of an ArrayFire array. It ensures the execution of
+   operations queued up for a specific array.
 
-To take care of much of this boilerplate, [timeit](\ref af::timeit) provides
-accurate and reliable estimates of both CPU or GPU code.
+   It is only required for timing purposes if elementwise arithmetic functions
+   are called on the array, since these are handled by the ArrayFire JIT.
 
-Here`s a stripped down example of
-[Monte-Carlo estimation of PI](\ref pi.cpp) making use
-of [timeit](\ref af::timeit).  Notice how it expects a `void` function pointer.
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+   af::array A = af::randu(1000, 1000);
+   af::array B = A + A;                 // Elementwise arithmetic operation.
+   B.eval();                            // Forces evaluation of B.
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
-#include <stdio.h>
-#include <arrayfire.h>
-using namespace af;
+   The function initializes the evaluation of the JIT-tree for that array and
+   may return prior to the completion of those operations. To ensure proper
+   timing, combine with a [sync](\ref af::sync) function.
 
-void pi_function() {
-  int n = 20e6; // 20 million random samples
-  array x = randu(n,f32), y = randu(n,f32);
-  // how many fell inside unit circle?
-  float pi = 4.0 * sum<float>(sqrt(x*x + y*y)) < 1) / n;
-}
+2. The [sync](\ref af::sync) function:
 
-int main() {
-  printf("pi_function took %g seconds\n", timeit(pi_function));
-  return 0;
-}
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+   Synchronizes the ArrayFire stream. It waits for all the previous operations
+   in the stream to finish. It is often used after [eval](\ref af::eval) to
+   ensure that operations have indeed been completed.
 
-This produces:
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+   af::sync();  // Waits for all previous operations to complete.
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-	pi_function took 0.007252 seconds
-	(test machine: Core i7 920 @ 2.67GHz with a Tesla C2070)
+## Using ArrayFire timer and timeit functions
+
+ArrayFire provides a simple timer functions that returns the current time in
+seconds.
+
+1. The [timer](\ref af::timer) function:
+
+   timer() : A platform-independent timer with microsecond accuracy:
+   * [timer::start()](\ref af::timer::start) starts a timer
+
+   * [timer::start()](\ref af::timer::stop) seconds since last \ref
+     af::timer::start "start"
+
+   * \ref af::timer::stop(af::timer start) "timer::stop(timer start)" seconds
+     since 'start'
+
+   Example: single timer
+
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+       // start timer
+       // - be sure to use the eval and sync functions so that previous code
+       //   does not get timed as part of the execution segment being measured
+       timer::start();
+       // run a code segment
+       // - be sure to use the eval and sync functions to ensure the code
+       //   segment operations have been completed
+       // stop timer
+       printf("elapsed seconds: %g\n", timer::stop());
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   Example: multiple timers
+
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+       // start timers
+       // - be sure to use the eval and sync functions so that previous code
+       //   does not get timed as part of the execution segment being measured
+       timer start1 = timer::start();
+       timer start2 = timer::start();
+       // run a code segment
+       // - be sure to use the eval and sync functions to ensure the code
+       //   segment operations have been completed
+       // stop timer1
+       printf("elapsed seconds: %g\n", timer::stop(start1));
+       // run another code segment
+       // - be sure to use the eval and sync functions to ensure the code
+       //   segment operations have been completed
+       // stop timer2
+       printf("elapsed seconds: %g\n", timer::stop(start2));
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   Accurate and reliable measurement of performance involves several factors:
+   * Executing enough iterations to achieve peak performance.
+   * Executing enough repetitions to amortize any overhead from system timers.
+
+2. The [timeit](\ref af::timeit) function:
+
+   To take care of much of this boilerplate, [timeit](\ref af::timeit) provides
+   accurate and reliable estimates of both CPU or GPU code.
+
+   Here is a stripped down example of [Monte-Carlo estimation of PI](\ref
+   benchmarks/pi.cpp) making use of [timeit](\ref af::timeit). Notice how it
+   expects a `void` function pointer.
+
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+   #include <stdio.h>
+   #include <arrayfire.h>
+   using namespace af;
+
+   void pi_function() {
+     int n = 20e6; // 20 million random samples
+     array x = randu(n, f32), y = randu(n, f32);
+     // how many fell inside unit circle?
+     float pi = 4.0 * sum<float>(sqrt(x*x + y*y)) < 1) / n;
+   }
+
+   int main() {
+     printf("pi_function took %g seconds\n", timeit(pi_function));
+     return 0;
+   }
+   ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+   This produces:
+
+       pi_function took 0.007252 seconds
+       (test machine: Core i7 920 @ 2.67GHz with a Tesla C2070)
+
+
+## A note on kernel caching
+
+The first run of ArrayFire code exercises any JIT compilation in the
+application, automatically saving a cache of the compilation to
+disk. Subsequent runs load the cache from disk, executing without
+compilation. Therefore, it is typically best to "warm up" the code with one
+run to initiate the application's kernel cache. Afterwards, subsequent runs do
+not include the compile time and are tend to be faster than the first run.
+
+Averaging the time taken is always the best approach and one reason why the
+[timeit](\ref af::timeit) function is helpful.
diff --git a/docs/pages/tutorials.md b/docs/pages/tutorials.md
new file mode 100644
index 0000000000..34b65be12c
--- /dev/null
+++ b/docs/pages/tutorials.md
@@ -0,0 +1,19 @@
+# Tutorials {#tutorials}
+
+* [Installation](\ref installing)
+* [Using on Linux](\ref using_on_linux)
+* [Using on Windows](\ref using_on_windows)
+* [Using on OSX](\ref using_on_osx)
+* [Getting Started](\ref gettingstarted)
+* [Introduction to Vectorization](\ref vectorization)
+* [Array and Matrix Manipulation](\ref matrixmanipulation)
+* [CUDA Interoperability](\ref interop_cuda)
+* [OpenCL Interoperability](\ref interop_opencl)
+* [Unified Backend](\ref unifiedbackend)
+* [Forge Visualization](\ref forge_visualization)
+* [Indexing](\ref indexing)
+* [Timing ArrayFire](\ref timing)
+* [Configuring ArrayFire Environment](\ref configuring_environment)
+* [Debugging ArrayFire Code](\ref debugging)
+* [ArrayFire JIT Code Generation](\ref jit)
+* [GFOR Usage](\ref page_gfor)
diff --git a/docs/pages/unified_backend.md b/docs/pages/unified_backend.md
new file mode 100644
index 0000000000..5a99bff8f4
--- /dev/null
+++ b/docs/pages/unified_backend.md
@@ -0,0 +1,212 @@
+Unified Backend {#unifiedbackend}
+==========
+
+[TOC]
+
+# Introduction
+
+The Unified backend was introduced in ArrayFire with version 3.2.
+While this is not an independent backend, it allows the user to switch between
+the different ArrayFire backends (CPU, CUDA, oneAPI and OpenCL) at runtime.
+
+# Compiling with Unified
+
+The steps to compile with the unified backend are the same as compiling with
+any of the other backends.
+The only change being that the executable needs to be linked with the __af__
+library (`libaf.so` (Linux), `libaf.dylib` (OSX), `af.lib` (Windows)).
+
+Check the Using with [Linux](\ref using_on_linux), [OSX](\ref using_on_osx),
+[Windows](\ref using_on_windows) for more details.
+
+To use with CMake, use the __ArrayFire_Unified_LIBRARIES__ variable.
+
+# Using the Unified Backend
+
+The Unified backend will try to dynamically load the backend libraries. The
+priority of backends is __CUDA -> oneAPI -> OpenCL -> CPU__
+
+The most important aspect to note here is that all the libraries the ArrayFire
+libs depend on need to be in the environment paths
+
+* `LD_LIBRARY_PATH` -> Linux, Unix, OSX
+* `DYLD_LIBRARY_PATH` -> OSX
+* `PATH` -> Windows
+
+If any of the libs are missing, then the library will fail to load and the
+backend will be marked as unavailable.
+
+Optionally, The ArrayFire libs may be present in `AF_PATH` or `AF_BUILD_PATH`
+environment variables if the path is not in the system paths. These are
+treated as fallback paths in case the files are not found in the system paths.
+However, all the other upstream libraries for ArrayFire libs must be present
+in the system path variables shown above.
+
+# Switching Backends
+
+The af_backend enum stores the possible backends.
+To select a backend, call the af::setBackend function as shown below.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.c}
+af::setBackend(AF_BACKEND_CUDA);    // Sets CUDA as current backend
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To get the count of the number of backends available (the number of `libaf*`
+backend libraries loaded successfully), call the af::getBackendCount function.
+
+# Example
+
+This example is shortened form of [basic.cpp](\ref unified/basic.cpp).
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.c}
+#include <arrayfire.h>
+
+void testBackend()
+{
+    af::info();
+    af_print(af::randu(5, 4));
+}
+
+int main()
+{
+    try {
+        printf("Trying CPU Backend\n");
+        af::setBackend(AF_BACKEND_CPU);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying CPU backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    try {
+        printf("Trying oneAPI Backend\n");
+        af::setBackend(AF_BACKEND_ONEAPI);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying oneAPI backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    try {
+        printf("Trying CUDA Backend\n");
+        af::setBackend(AF_BACKEND_CUDA);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying CUDA backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    try {
+        printf("Trying OpenCL Backend\n");
+        af::setBackend(AF_BACKEND_OPENCL);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying OpenCL backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    return 0;
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This output would be:
+
+    Trying CPU Backend
+    ArrayFire v3.9.0 (CPU, 64-bit Linux, build 23ee0650e)
+    [0] AMD: AMD Ryzen Threadripper PRO 3955WX 16-Cores     af::randu(5, 4)
+    [5 4 1 1]
+        0.6010     0.5497     0.1583     0.3636
+        0.0278     0.2864     0.3712     0.4165
+        0.9806     0.3410     0.3543     0.5814
+        0.2126     0.7509     0.6450     0.8962
+        0.0655     0.4105     0.9675     0.3712
+
+    Trying oneAPI Backend
+    ArrayFire v3.9.0 (oneAPI, 64-bit Linux, build 23ee0650e)
+    [0] Intel(R) OpenCL: AMD Ryzen Threadripper PRO 3955WX 16-Cores     , 128650 MB (fp64)
+    af::randu(5, 4)
+    [5 4 1 1]
+        0.6010     0.5497     0.1583     0.3636
+        0.0278     0.2864     0.3712     0.4165
+        0.9806     0.3410     0.3543     0.5814
+        0.2126     0.7509     0.6450     0.8962
+        0.0655     0.4105     0.9675     0.3712
+
+    Trying CUDA Backend
+    ArrayFire v3.9.0 (CUDA, 64-bit Linux, build 23ee0650e)
+    Platform: CUDA Runtime 12.2, Driver: 535.104.05
+    [0] NVIDIA RTX A5500, 22721 MB, CUDA Compute 8.6
+    -1- NVIDIA RTX A5500, 22719 MB, CUDA Compute 8.6
+    af::randu(5, 4)
+    [5 4 1 1]
+        0.6010     0.5497     0.1583     0.3636
+        0.0278     0.2864     0.3712     0.4165
+        0.9806     0.3410     0.3543     0.5814
+        0.2126     0.7509     0.6450     0.8962
+        0.0655     0.4105     0.9675     0.3712
+
+    Trying OpenCL Backend
+    ArrayFire v3.9.0 (OpenCL, 64-bit Linux, build 23ee0650e)
+    [0] NVIDIA: NVIDIA RTX A5500, 22720 MB
+    -1- NVIDIA: NVIDIA RTX A5500, 22718 MB
+    -2- Intel(R) FPGA Emulation Platform for OpenCL(TM): Intel(R) FPGA Emulation Device, 128650 MB
+    -3- INTEL: AMD Ryzen Threadripper PRO 3955WX 16-Cores     , 128650 MB
+    af::randu(5, 4)
+    [5 4 1 1]
+        0.6010     0.5497     0.1583     0.3636
+        0.0278     0.2864     0.3712     0.4165
+        0.9806     0.3410     0.3543     0.5814
+        0.2126     0.7509     0.6450     0.8962
+        0.0655     0.4105     0.9675     0.3712
+
+
+# Dos and Don'ts
+
+It is very easy to run into exceptions if you are not careful with the
+switching of backends.
+
+### Don't: Do not use arrays between different backends
+
+ArrayFire checks the input arrays to functions for mismatches with the active
+backend. If an array created on one backend, but used when another backend is
+set to active, an exception with code 503 (`AF_ERR_ARR_BKND_MISMATCH`) is
+thrown.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.c}
+#include <arrayfire.h>
+
+int main()
+{
+    try {
+        af::setBackend(AF_BACKEND_CUDA);
+        af::array A = af::randu(5, 5);
+
+        af::setBackend(AF_BACKEND_OPENCL);
+        af::array B = af::constant(10, 5, 5);
+        af::array C = af::matmul(A, B);     // This will throw an exception
+
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    return 0;
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+### Do: Use a naming scheme to track arrays and backends
+
+We recommend that you use a technique to track the arrays on the backends. One
+suggested technique would be to use a suffix of `_cpu`, `_cuda`, `_opencl`
+with the array names. So an array created on the CUDA backend would be named
+`myarray_cuda`.
+
+If you have not used the af::setBackend function anywhere in your code, then
+you do not have to worry about this as all the arrays will be created on the
+same default backend.
+
+### Don't: Do not use custom kernels (CUDA/OpenCL) with the Unified backend
+
+This is another area that is a no go when using the Unified backend. It not
+recommended that you use custom kernels with unified backend. This is mainly
+becuase the Unified backend is meant to be ultra portable and should use only
+ArrayFire and native CPU code.
diff --git a/docs/pages/using_on_linux.md b/docs/pages/using_on_linux.md
new file mode 100644
index 0000000000..91035426c5
--- /dev/null
+++ b/docs/pages/using_on_linux.md
@@ -0,0 +1,134 @@
+Using ArrayFire on Linux {#using_on_linux}
+=====
+
+Once you have [installed](\ref installing) ArrayFire on your system, the next
+thing to do is set up your build system. On Linux, you can create ArrayFire
+projects using almost any editor, compiler, or build system. The only
+requirements are that you include the ArrayFire header directories and link
+with the ArrayFire library you intend to use i.e. CUDA, OpenCL, oneAPI, CPU,
+or Unified backends.
+
+## The big picture  {#big-picture-linux}
+
+On Linux, we recommend installing ArrayFire to `/opt/arrayfire` directory. The
+installer will populate files in the following sub-directories:
+
+    include/arrayfire.h         - Primary ArrayFire include file
+    include/af/*.h              - Additional include files
+    lib/libaf*                  - CPU, CUDA, oneAPI, and OpenCL libraries (.a, .so)
+    lib/libforge*               - Visualization library
+    lib/libcu*                  - CUDA backend dependencies
+    lib/libOpenCL.so            - OpenCL ICD Loader library
+    share/ArrayFire/cmake/*     - CMake config (find) scripts
+    share/ArrayFire/examples/*  - All ArrayFire examples
+
+Because ArrayFire follows standard installation practices, you can use
+basically any build system to create and compile projects that use
+ArrayFire. Among the many possible build systems on Linux we suggest using
+ArrayFire with either CMake or Makefiles with CMake being our preferred build
+system.
+
+## Prerequisite software
+
+To build ArrayFire projects you will need a compiler
+
+#### Fedora, Centos and Redhat
+
+Install EPEL repo (not required for Fedora)
+
+```
+yum install epel-release
+yum update
+```
+
+Install build dependencies
+
+```
+yum install gcc gcc-c++ cmake3 make
+```
+
+#### Debian and its derivatives
+
+Install common dependencies
+
+```
+apt install build-essential cmake cmake-curses-gui
+```
+
+## CMake
+
+We recommend that the CMake build system be used to create ArrayFire projects.
+As [discussed above](#big-picture-linux), ArrayFire ships with a series of
+CMake scripts to make finding and using our library easy.
+
+First create a file called `CMakeLists.txt` in your project directory:
+
+    cd your-project-directory
+    touch CMakeLists.txt
+
+and populate it with the following code:
+
+    find_package(ArrayFire)
+    add_executable(<my_executable> [list your source files here])
+
+    # To use Unified backend, do the following.
+    # Unified backend lets you choose the backend at runtime
+    target_link_libraries(<my_executable> ArrayFire::af)
+
+where `my_executable` is the name of the executable you wish to create. See
+the [CMake documentation](https://cmake.org/documentation/) for more
+information on how to use CMake. To link with a specific backend directly,
+replace the `ArrayFire::af` with the following for their respective backends.
+
+* `ArrayFire::afcpu` for CPU backend.
+* `ArrayFire::afcuda` for CUDA backend.
+* `ArrayFire::afoneapi` for oneAPI backend.
+* `ArrayFire::afopencl` for OpenCL backend.
+
+Next we need to instruct CMake to create build instructions and then
+compile. We suggest using CMake's out-of-source build functionality to keep
+your build and source files cleanly separated. To do this open the CMake GUI.
+
+    cd your-project-directory
+    mkdir build
+    cd build
+    cmake ..
+    make
+
+*NOTE:* If you have installed ArrayFire to a non-standard location, CMake can
+still help you out. When you execute CMake specify the path to ArrayFire
+installation root as `ArrayFire_DIR` variable.
+
+For example, if ArrayFire were installed locally to `/home/user/ArrayFire`
+then you would modify the `cmake` command above to contain the following
+definition:
+
+    cmake -DArrayFire_DIR=/home/user/ArrayFire ..
+
+You can also specify this information in the `ccmake` command-line interface.
+
+## Makefiles
+
+Building ArrayFire projects with Makefiles is fairly similar to CMake except
+you must specify all paths and libraries manually.
+
+As with any `make` project, you need to specify the include path to the
+directory containing `arrayfire.h` file. This should be `-I
+/opt/arrayfire/include` if you followed our installation instructions.
+
+Similarly, you will need to specify the path to the ArrayFire library using
+the `-L` option (e.g. `-L/opt/arrayfire/lib`) followed by the specific
+ArrayFire library you wish to use using the `-l` option (for example
+`-lafcpu`, `-lafopencl`, `-lafoneapi`, `-lafcuda`, or `-laf` for the CPU,
+OpenCL, oneAPI, and CUDA, and unified backends, respectively.
+
+Here is a minimal example Makefile which uses ArrayFire's CPU backend:
+
+    LIBS=-lafcpu
+    LIB_PATHS=-L/opt/arrayfire/lib
+    INCLUDES=-I/opt/arrayfire/include
+    CC=g++ $(COMPILER_OPTIONS)
+    COMPILER_OPTIONS=-std=c++11 -g
+
+    all: main.cpp Makefile
+        $(CC) main.cpp -o test $(INCLUDES) $(LIBS) $(LIB_PATHS)
diff --git a/docs/pages/using_on_osx.md b/docs/pages/using_on_osx.md
new file mode 100644
index 0000000000..e851509c4b
--- /dev/null
+++ b/docs/pages/using_on_osx.md
@@ -0,0 +1,107 @@
+Using ArrayFire on OSX {#using_on_osx}
+======================================
+
+Once you have [installed](\ref installing) ArrayFire on your system, the next
+thing to do is set up your build system. On OSX, you may create ArrayFire
+project using almost any editor, compiler, or build system. The only requirement
+is that you can include the ArrayFire header directory, and link with the
+ArrayFire library you intend to use.
+
+## The big picture {#big-picture-osx}
+
+By default, the ArrayFire OSX installer will place several files in your
+computer's `/opt/arrayfire` directory. The installer will populate this
+directory with files in the following sub-directories:
+
+    include/arrayfire.h         - Primary ArrayFire include file
+    include/af/*.h              - Additional include files
+    lib/libaf*                  - CPU, CUDA, and OpenCL libraries (.a, .so)
+    lib/libforge*               - Visualization library
+    lib/libcu*                  - CUDA backend dependencies
+    share/ArrayFire/cmake/*     - CMake config scripts
+    share/ArrayFire/examples/*  - ArrayFire examples
+
+Because ArrayFire follows standard installation practices, you can use basically
+any build system to create and compile projects that use ArrayFire. Among the
+many possible build systems on Linux we suggest using ArrayFire with either
+CMake or Makefiles with CMake being our preferred build system.
+
+## Build Instructions:
+* [CMake](#CMake)
+* [Makefiles](#Makefiles)
+
+## CMake {#CMake}
+
+The CMake build system can be used to create ArrayFire projects. As [discussed
+above](#big-picture-osx), ArrayFire ships with a series of CMake scripts to make
+finding and using our library easy.
+
+First create a file called `CMakeLists.txt` in your project directory:
+
+    cd your-project-directory
+    touch CMakeLists.txt
+
+and populate it with the following code:
+
+    find_package(ArrayFire)
+    add_executable(<my_executable> [list your source files here])
+
+    # To use Unified backend, do the following.
+    # Unified backend lets you choose the backend at runtime
+    target_link_libraries(<my_executable> ArrayFire::af)
+
+where `my_executable` is the name of the executable you wish to create. See the
+[CMake documentation](https://cmake.org/documentation/) for more information on
+how to use CMake. To link with a specific backend directly, replace the
+`ArrayFire::af` with the following for their respective backends.
+
+* `ArrayFire::afcpu` for CPU backend.
+* `ArrayFire::afcuda` for CUDA backend.
+* `ArrayFire::afopencl` for OpenCL backend.
+
+Next we need to instruct CMake to create build instructions and then compile. We
+suggest using CMake's out-of-source build functionality to keep your build and
+source files cleanly separated. To do this open the CMake GUI.
+
+    cd your-project-directory
+    mkdir build
+    cd build
+    cmake ..
+    make
+
+*NOTE:* If you have installed ArrayFire to a non-standard location, CMake can
+still help you out. When you execute CMake specify the path to ArrayFire
+installation root as `ArrayFire_DIR` variable.
+
+For example, if ArrayFire were installed locally to `/home/user/ArrayFire` then
+you would modify the `cmake` command above to contain the following definition:
+
+    cmake -DArrayFire_DIR=/home/user/ArrayFire ..
+
+You can also specify this information in the `ccmake` command-line interface.
+
+## Makefiles {#Makefiles}
+
+Building ArrayFire projects with Makefiles is fairly similar to CMake except you
+must specify all paths and libraries manually.
+
+As with any make project, you need to specify the include path to the directory
+containing `arrayfire.h` file. This should be `-I /opt/arrayfire/include` if you
+followed our installation instructions.
+
+Similarly, you will need to specify the path to the ArrayFire library using the
+`-L` option (e.g. `-L/opt/arrayfire/lib`) followed by the specific ArrayFire
+library you wish to use using the `-l` option (for example `-lafcpu`,
+`-lafopencl`, `-lafcuda`, or `-laf` for the CPU, OpenCL, CUDA, and unified
+backends respectively.
+
+Here is a minimal example Makefile which uses ArrayFire's CPU backend:
+
+    LIBS=-lafcpu
+    LIB_PATHS=-L/opt/arrayfire/lib
+    INCLUDES=-I/opt/arrayfire/include
+    CC=g++ $(COMPILER_OPTIONS)
+    COMPILER_OPTIONS=-std=c++11 -g
+
+    all: main.cpp Makefile
+        $(CC) main.cpp -o test $(INCLUDES) $(LIBS) $(LIB_PATHS)
diff --git a/docs/pages/using_on_windows.md b/docs/pages/using_on_windows.md
new file mode 100644
index 0000000000..b9084723d1
--- /dev/null
+++ b/docs/pages/using_on_windows.md
@@ -0,0 +1,150 @@
+Using ArrayFire with Microsoft Windows and Visual Studio {#using_on_windows}
+============================================================================
+
+If you have not already done so, please make sure you have installed,
+configured, and tested ArrayFire following the [installation
+instructions](#installing).
+
+# The big picture {#big-picture-windows}
+
+The ArrayFire Windows installer creates the following:
+1. **AF_PATH** environment variable to point to the installation location. The
+   default install location is `C:\Program Files\ArrayFire\v3`
+2. **AF_PATH/include** : Header files for ArrayFire (include directory)
+3. **AF_PATH/lib** : All ArrayFire backend libraries, dlls, and dependency
+   dlls (library directory)
+4. **AF_PATH/examples** : Examples to get started
+5. **AF_PATH/cmake** : CMake config files
+6. **AF_PATH/uninstall.exe** : Uninstaller
+
+# Build and Run Helloworld {#section1}
+
+This can be done in two ways either by using CMake build tool or using Visual
+Studio directly.
+
+##  Using CMake {#section1part1}
+1. Download and install [CMake](https://cmake.org/download/), preferably the
+   latest version.
+2. Open CMake-GUI and set the field __Where is the source code__ to the root
+   directory of examples.
+3. Set the field __Where to build the binaries__ to
+   **path_to_examples_root_dir/build** and click the `Configure` button.
+4. CMake will prompt you to create the `build` directory if not already
+   present. Click "yes" to create the build directory.
+5. Before the configuration begins, CMake will show you a list (drop-down
+   menu) of available Visual Studio versions. Select one and check the radio
+   button that says **Use default native compilers** and click finish.
+6. CMake will show you errors in red text, if any, once configuration is
+   finished. Sometimes a second configuration is necessary.
+7. Click **Generate** button to generate the Visual Studio solution files for
+   the examples.
+8. Click **Open Project** button that is right next to **Generate** button to
+   open the solution file.
+9. You will see the examples segregated into four sets named after the compute
+   backends of ArrayFire: cpu, cuda, oneapi, & opencl, if you installed all
+   backends. Select the helloworld project from any of the installed backends,
+   mark it as startup project, and hit `F5`.
+10. Once the helloworld example builds, you will see a console window with the
+    output from helloworld program.
+
+## Using Visual Studio {#section1part2}
+
+1. Open Visual Studio and create an empty C++ project.
+2. Right-click the project and add an existing source file
+   `examples/helloworld/helloworld.cpp` to this project.
+3. Add `"$(AF_PATH)/include;"` to _Project Properties -> C/C++ -> General ->
+   Additional Include Directories_.
+4. Add `"$(AF_PATH)/lib;"` to _Project Properties -> Linker -> General ->
+   Additional Library Directories_.
+5. Add `afcpu.lib`, `afcuda.lib`, `afoneapi.lib`, or `afopencl.lib` to
+   _Project Properties -> Linker -> Input -> Additional Dependencies_. based
+   on your preferred backend.
+6. (Optional) You may choose to define `NOMINMAX`,
+   `AF_<CPU/CUDA/ONEAPI/OPENCL>`, or `AF_<DEBUG/RELEASE>` in your
+   projects. This can be added to _Project Properties -> C/C++ -> General ->
+   Preprocessor-> Preprocessory definitions_.
+7. Build and run the project. You will see a console window with the output
+   from helloworld program.
+
+# Using ArrayFire within Existing Visual Studio Projects {#section2}
+This is divided into three parts:
+* [Part A: Adding ArrayFire to an existing solution (Single Backend)](#section2partA)
+* [Part B: Adding ArrayFire CUDA to a new/existing CUDA project](#section2partB)
+* [Part C: Project with all ArrayFire backends](#section2partC)
+
+## Part A: Adding ArrayFire to an existing solution (Single Backend) {#section2partA}
+
+Note: If you plan on using Native CUDA code in the project, use the steps
+under [Part B](#section2partB).
+
+Adding a single backend to an existing project is quite simple:
+
+1. Add `"$(AF_PATH)/include;"` to _Project Properties -> C/C++ -> General ->
+   Additional Include Directories_.
+2. Add `"$(AF_PATH)/lib;"` to _Project Properties -> Linker -> General ->
+   Additional Library Directories_.
+3. Add `afcpu.lib`, `afcuda.lib`, `afopencl.lib`, or `af.lib` to _Project
+   Properties -> Linker -> Input -> Additional Dependencies_. based on your
+   preferred backend.
+
+## Part B: Adding ArrayFire CUDA to a new/existing CUDA project {#section2partB}
+
+Lastly, if your project contains custom CUDA code, the instructions are
+slightly different as it requires using a CUDA NVCC Project:
+
+1. Create a custom "CUDA NVCC project" in Visual Studio
+2. Add `"$(AF_PATH)/include;"` to _Project Properties -> CUDA C/C++ -> General
+   -> Additional Include Directories_.
+3. Add `"$(AF_PATH)/lib;"` to _Project Properties -> Linker -> General ->
+   Additional Library Directories_.
+4. Add `afcpu.lib`, `afcuda.lib`, `afopencl.lib`, or `af.lib` to _Project Properties ->
+   Linker -> Input -> Additional Dependencies_. based on your preferred backend.
+
+## Part C: Project with all ArrayFire backends {#section2partC}
+
+If you wish to create a project that allows you to use all the ArrayFire
+backends with ease, you should use `af.lib` in step 3 from [Part
+A](#section2partA).
+
+You can alternately download the template project from [ArrayFire Template
+Projects](https://github.com/arrayfire/arrayfire-project-templates)
+
+# Using ArrayFire with CMake
+
+ArrayFire ships with a series of CMake scripts to make finding and using the
+library easy.
+
+First, create a file called `CMakeLists.txt` in your project directory:
+
+    cd your-project-directory
+    touch CMakeLists.txt
+
+and populate it with the following code:
+
+    find_package(ArrayFire)
+    add_executable(<my_executable> [list your source files here])
+
+    # The Unified backend lets you choose the backend at runtime.
+    # To use the Unified backend, do the following:
+    target_link_libraries(<my_executable> ArrayFire::af)
+
+, where `<my_executable>` is the name of the executable to create. See the
+[CMake documentation](https://cmake.org/documentation/) for more information
+on how to use CMake. To link with a specific backend directly, replace the
+`ArrayFire::af` with the following for their respective backends.
+
+* `ArrayFire::afcpu` for CPU backend.
+* `ArrayFire::afcuda` for CUDA backend.
+* `ArrayFire::afoneapi` for oneAPI backend.
+* `ArrayFire::afopencl` for OpenCL backend.
+
+Next, instruct CMake to create build instructions and compile them. We suggest
+using CMake's out-of-source build functionality to keep your build and source
+files cleanly separated. To do this, open the CMake GUI.
+
+* Under "source directory", add the path to your project.
+* Under "build directory", add the path to your project and append /build.
+* Click "configure" and choose a 64-bit Visual Studio generator.
+* If the configuration was successful, click "generate". This will create a
+  my-project.sln file under build. Click `Open Project` in CMake-GUI to open
+  the solution and compile the ALL_BUILD project.
diff --git a/docs/pages/using_with_linux.md b/docs/pages/using_with_linux.md
deleted file mode 100644
index 62610bc009..0000000000
--- a/docs/pages/using_with_linux.md
+++ /dev/null
@@ -1,79 +0,0 @@
-Using ArrayFire on Linux {#using_on_linux}
-=====
-
-Among the many possible build systems on Linux we suggest using ArrayFire with
-either CMake or Makefiles with CMake being the preferred build system.
-
-## CMake
-
-This is the suggested method of using ArrayFire on Linux.
-ArrayFire ships with support for CMake by default, including a series of
-`Find` scripts installed  in the `/usr/local/share/ArrayFire` (or similar)
-directory.
-These scripts will automatically find the CUDA, OpenCL, and CPU versions
-of ArrayFire and automatically choose the most powerful installed backend
-(typically CUDA).
-
-To use ArrayFire, simply insert the `FIND_PACKAGE` command inside of your
-`CMakeLists.txt` file as follows:
-
-    FIND_PACKAGE(ArrayFire)
-    INCLUDE_DIRECTORIES(${ArrayFire_INCLUDE_DIRS})
-    ...
-
-    ADD_EXECUTABLE(some_executable ...)
-    TARGET_LINK_LIBRARIES(some_executable ${ArrayFire_LIBRARIES} )
-
-The find script will automatically define several variables including:
-
-    ArrayFire_INCLUDE_DIRS    - Location of ArrayFire's include directory.
-    ArrayFire_LIBRARIES       - Location of ArrayFire's libraries. This will default
-                                to a GPU backend if one
-    ArrayFire_FOUND           - True if ArrayFire has been located
-
-If you wish to use a specific backend, the find script also defines these variables:
-
-    ArrayFire_CPU_FOUND        - True of the ArrayFire CPU library has been found.
-    ArrayFire_CPU_LIBRARIES    - Location of ArrayFire's CPU library, if found
-    ArrayFire_CUDA_FOUND       - True of the ArrayFire CUDA library has been found.
-    ArrayFire_CUDA_LIBRARIES   - Location of ArrayFire's CUDA library, if found
-    ArrayFire_OpenCL_FOUND     - True of the ArrayFire OpenCL library has been found.
-    ArrayFire_OpenCL_LIBRARIES - Location of ArrayFire's OpenCL library, if found
-
-Therefore, if you wish to target a specific specific backend, switch
-`${ArrayFire_LIBRARIES}` to `${ArrayFire_CPU}` `${ArrayFire_OPENCL}` or
-`${ArrayFire_CUDA}` in the `TARGET_LINK_LIBRARIES` command above.
-
-Finally, if you have installed ArrayFire to a non-standard location, CMake can still help
-you out. When you execute CMake specify the path to the `ArrayFireConfig*` files that
-are found in the `share/ArrayFire` subdirectory of the installation folder.
-For example, if ArrayFire were installed locally to `/opt/ArrayFire` then you would
-modify the `cmake` command above to contain the following definition:
-
-```
-cmake -DArrayFire_DIR=/opt/ArrayFire/share/ArrayFire ...
-```
-
-## MakeFiles
-
-Using ArrayFire with Makefiles is almost as easy as CMake, but you will
-need to specify paths manually. In your makefile specify the include path to
-the directory containing `arrayfire.h`. Typically this will be `-I /usr/include`
-or `-I /usr/local/include` if you installed ArrayFire using our installation
-instructions.
-Then, in your linker line specify the path to ArrayFire using the `-L` option
-(typically `-L/usr/lib` or `-L/usr/local/lib` and the specific ArrayFire backend
-you wish to use with the `-l` option (i.e. `-lafcpu`, `-lafopencl` or `-lafcuda`
-for the CPU, OpenCL and CUDA backends repsectively).
-
-Here is a minimial example MakeFile which uses ArrayFire's CPU backend:
-
-    LIBS=-lafcpu
-    LIB_PATHS=/usr/lib
-    INCLUDES=-I/usr/include
-    CC=g++ $(COMPILER_OPTIONS)
-    COMPILER_OPTIONS=-std=c++11 -g
-
-    all: main.cpp Makefile
-        $(CC) main.cpp -o test $(INCLUDES) $(LIBS) $(LIB_PATHS)
-
diff --git a/docs/pages/using_with_ms_windows.md b/docs/pages/using_with_ms_windows.md
deleted file mode 100644
index 48d986585a..0000000000
--- a/docs/pages/using_with_ms_windows.md
+++ /dev/null
@@ -1,90 +0,0 @@
-Using ArrayFire with Microsoft Windows and Visual Studio {#using_on_windows}
-=====
-
-## Step 1: Adding ArrayFire to PATH for all users
-
-The ArrayFire installer for Windows creates a user `PATH` variable containing
-`%AF_PATH%/lib;`. This is required so that Windows knows where to find the
-ArrayFire DLLs. This variable fixes the DLL finding only for the user that
-installs ArrayFire.
-
-To allow DLL detection for all users, it needs to be added to the system
-`PATH` variable. For this, follow the steps:
-
-1. Open Advanced System Settings:
-  * Windows 8: Move the Mouse pointer to the bottom right corner of the screen,
-    Right click, choose System. Then click "Advanced System Settings"
-  * Windows 7: Open the Start Menu and Right Click on "Computer". Then choose
-    Properties and click "Advanced System Settings"
-
-2. In _Advanced System Settings_ window, click on _Advanced_ tab
-
-3. Click on _Environment Variables_, then under **System Variables**, find
-   `PATH`, and click on it.
-
-4. In edit mode, append `%AF_PATH%/lib;`. NOTE: Ensure that there is a semi-colon
-   separating `%AF_PATH%/lib;` from any existing content (e.g.
-   `EXITINGS_PATHS;%AF_PATH%/lib;`) otherwise other software may not function
-   correctly.
-
-## Step 2: Verify the path addition functions correctly
-
-1. Open Visual Studio 2013. Open the HelloWorld solution which is located at
-   `AF_PATH/examples/helloworld/helloworld.sln`.
-2. Build and run the `helloworld` example. Be sure to, select the
-   platform/configuration of your choice using the platform drop-down
-   (the options are CPU, CUDA, and OpenCL) and Solution Configuration drop down
-   (options of Release and Debug) menus.
-3. Run the `helloworld` example
-
-
-## Step 3: Creating your own Visual Studio Project
-
-### A new project from scratch
-
-If you are creating a new project which is intended to be platform-independent,
-the best option is to simply copy the existing `helloworld` solution files
-and modify them to suit your needs. This will retain all the platform based
-settings that have been configured in the examples.
-
-### Adding ArrayFire CPU/OpenCL to a new/existing project
-
-If you are adding ArrayFire to a new or existing project that will contain
-custom CPU or OpenCL kernels, you only need to make a few modifications to
-your project soultion:
-
-1. Open an existing project or create a new "Empty C/C++ project in Visual Studio"
-2. Add `$(AF_PATH)/include;` to
-   _Project Properties -> C/C++ -> General -> Additional Include Directories_
-3. Add `$(AF_PATH)/lib;` to
-  _Project Properties -> Linker -> General -> Additional Library Directories_
-4. Add `afcpu.lib` or `afcuda.lib` or `afopencl.lib` to
-  _Project Properties -> Linker -> Input -> Additional Dependencies_
-  based on your preferred backend.
-5. (Optional) You make choose to define `NOMINMAX`, `AF_<CPU/CUDA/OPENCL>`
-  and/or `AF_<DEBUG/RELEASE>` in your projects. This can be added to
-  _Project Properties -> C/C++ -> General -> Preprocessor-> Preprocessory definitions_.
-
-### Adding ArrayFire CUDA to a new/existing project
-
-Lastly, if your project contains custom CUDA code, the instructions are slightly
-different:
-
-1. Create a custom "CUDA NVCC project" in Visual Studio
-2. Follow steps 2-5 from the _Adding ArrayFire CPU/OpenCL to a new/existing project_
-   instructions above
-3. Add the following lines to the
-   _Project Properties -> Build Events -> Post Build Events_
-   dialog:
-
-     ```
-     echo copy "$(CUDA_PATH)\nvvm\bin\nvvm64*.dll" "$(OutDir)"
-     copy "$(CUDA_PATH)\nvvm\bin\nvvm64*.dll" "$(OutDir)"
-     ```
-
-4. Ensure that you use x64 based configurations.
-
-Please note that this method will not work with the ArrayFire examples as
-our implementations are built with the Visual Studio CL compiler rather than
-NVCC to ensure they are supported across various platforms.
-
diff --git a/docs/pages/vectorization.md b/docs/pages/vectorization.md
new file mode 100644
index 0000000000..339a1a51ec
--- /dev/null
+++ b/docs/pages/vectorization.md
@@ -0,0 +1,247 @@
+Introduction to Vectorization {#vectorization}
+===================
+
+Programmers and Data Scientists want to take advantage of fast and parallel
+computational devices. Writing vectorized code is necessary to get
+the best performance out of the current generation parallel hardware and
+scientific computing software. However, writing vectorized code may not be
+immediately intuitive. ArrayFire provides many ways to vectorize a given code
+segment. In this tutorial, we present several methods to vectorize code
+using ArrayFire and discuss the benefits and drawbacks associated with each method.
+
+# Generic/Default vectorization
+
+By its very nature, ArrayFire is a vectorized library. Most functions operate on
+arrays as a whole -- on all elements in parallel. Wherever possible, existing
+vectorized functions should be used opposed to manually indexing into arrays.
+For example consider the following code:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+af::array a = af::range(10); // [0,  9]
+for(int i = 0; i < a.dims(0); ++i)
+{
+    a(i) = a(i) + 1;         // [1, 10]
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Although completely valid, the code is very inefficient as it results in
+a kernel kernels that operate on one datum.
+Instead, the developer should have used ArrayFire's overload of the + operator:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+af::array a = af::range(10);  // [0,  9]
+a = a + 1;                    // [1, 10]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This code will result in a single kernel that operates on all 10 elements
+of `a` in  parallel.
+
+Most ArrayFire functions are vectorized. A small subset of these include:
+
+Operator Category                                           | Functions
+------------------------------------------------------------|--------------------------
+[Arithmetic operations](\ref arith_mat)                     | [+](\ref arith_func_add), [-](\ref arith_func_sub), [*](\ref arith_func_mul), [/](\ref arith_func_div), [%](\ref arith_func_mod), [>>](\ref arith_func_shiftr), [<<](\ref arith_func_shiftl)
+[Logical operations](\ref logic_mat)                        | [&&](\ref arith_func_and), \|\|[(or)](\ref arith_func_or), [<](\ref arith_func_lt), [>](\ref arith_func_gt), [==](\ref arith_func_eq), [!=](\ref arith_func_neq) etc.
+[Numeric functions](\ref numeric_mat)                       | abs(), floor(), round(), min(), max(), etc.
+[Complex operations](\ref complex_mat)                      | real(), imag(), conj(), etc.
+[Exponential and logarithmic functions](\ref explog_mat)    | exp(), log(), expm1(), log1p(), etc.
+[Trigonometric functions](\ref trig_mat)                    | sin(), cos(), tan(), etc.
+[Hyperbolic functions](\ref hyper_mat)                      | sinh(), cosh(), tanh(), etc.
+
+In addition to element-wise operations, many other functions are also
+vectorized in ArrayFire.
+
+Notice that even that perform some form of aggregation (e.g. `sum()` or `min()`),
+signal processing (like `convolve()`), and even image processing functions
+(i.e. `rotate()`) all support vectorization on different columns or images.
+For example, if we have `NUM` images of size `WIDTH` by `HEIGHT`, one could
+convolve each image in a vector fashion as follows:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+float g_coef[] = { 1, 2, 1,
+                   2, 4, 2,
+                   1, 2, 1 };
+
+af::array filter = 1.f/16 * af::array(3, 3, g_coef);
+
+af::array signal = randu(WIDTH, HEIGHT, NUM);
+af::array conv = convolve2(signal, filter);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Similarly, one can rotate 100 images by 45 degrees in a single call using
+code like the following:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// Construct an array of 100 WIDTH x HEIGHT images of random numbers
+af::array imgs = randu(WIDTH, HEIGHT, 100);
+// Rotate all of the images in a single command
+af::array rot_imgs = rotate(imgs, 45);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Although *most* functions in ArrayFire do support vectorization, some do not.
+Most notably, all linear algebra functions. Even though they are not vectorized
+linear algebra operations still execute in parallel on your hardware.
+
+Using the built in vectorized operations should be the first
+and preferred method of vectorizing any code written with ArrayFire.
+
+# GFOR: Parallel for-loops
+
+Another novel method of vectorization present in ArrayFire is the GFOR loop
+replacement construct. GFOR allows launching all iterations of a loop in parallel
+on the GPU or device, as long as the iterations are independent. While the
+standard for-loop performs each iteration sequentially, ArrayFire's gfor-loop
+performs each iteration at the same time (in parallel). ArrayFire does this by
+tiling out the values of all loop iterations and then performing computation on
+those tiles in one pass. You can think of gfor as performing auto-vectorization
+of your code, e.g. you write a gfor-loop that increments every element of a vector
+but behind the scenes ArrayFire rewrites it to operate on the entire vector in
+parallel.
+
+The original for-loop example at the beginning of this document could be
+rewritten using GFOR as follows:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+af::array a = af::range(10);
+gfor(seq i, n)
+    a(i) = a(i) + 1;
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In this case, each instance of the gfor loop is independent, thus ArrayFire
+will automatically tile out the `a` array in device memory and execute the
+increment kernels in parallel.
+
+To see another example, you could run an accum() on every slice of a matrix in a
+for-loop, or you could "vectorize" and simply do it all in one gfor-loop operation:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// runs each accum() in sequence
+for (int i = 0; i < N; ++i)
+   B(span,i) = accum(A(span,i));
+
+// runs N accums in parallel
+gfor (seq i, N)
+   B(span,i) = accum(A(span,i));
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+However, returning to our previous vectorization technique, accum() is already
+vectorized and the operation could be completely replaced with merely:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+    B = accum(A);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+It is best to vectorize computation as much as possible to avoid the overhead in
+both for-loops and gfor-loops. However, the gfor-loop construct is most effective
+in the narrow case of broadcast-style operations. Consider the case when we have
+a vector of constants that we wish to apply to a collection of variables, such as
+expressing the values of a linear combination for multiple vectors. The broadcast
+of one set of constants to many vectors works well with gfor-loops:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+const static int p=4, n=1000;
+af::array consts = af::randu(p);
+af::array var_terms = randn(p, n);
+
+gfor(seq i, n)
+    combination(span, i) = consts * var_terms(span, i);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Using GFOR requires following several rules and multiple guidelines for optimal
+performance. The details of this vectorization method can be found in the
+[GFOR documentation](\ref gfor).
+
+# Batching
+
+The batchFunc() function allows the broad application of existing ArrayFire
+functions to multiple sets of data. Effectively, batchFunc() allows ArrayFire
+functions to execute in "batch processing" mode. In this mode, functions will
+find a dimension which contains "batches" of data to be processed and will
+parallelize the procedure.
+
+Consider the following example. Here we create a filter which we would like
+to apply to each of the weight vectors. The naive solution would be using a
+for-loop as we have seen previously:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// Create the filter and the weight vectors
+af::array filter = randn(1, 5);
+af::array weights = randu(5, 5);
+
+// Apply the filter using a for-loop
+af::array filtered_weights = constant(0, 5, 5);
+for(int i=0; i<weights.dims(1); ++i){
+    filtered_weights.col(i) = filter * weights.col(i);
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+However, as we have discussed above, this solution will be very inefficient.
+One may be tempted to implement a vectorized solution as follows:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// Create the filter and the weight vectors
+af::array filter = randn(1, 5);
+af::array weights = randu(5, 5);
+
+af::array filtered_weights = filter * weights; // fails due to dimension mismatch
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+However, the dimensions of `filter` and `weights` do not match, thus ArrayFire
+will generate a runtime error.
+
+`batchfunc()` was created to solve this specific problem.
+The signature of the function is as follows:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+array batchFunc(const array &lhs, const array &rhs, batchFunc_t func);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+where `__batchFunc_t__` is a function pointer of the form:
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+typedef array (*batchFunc_t) (const array &lhs, const array &rhs);
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+So, to use batchFunc(), we need to provide the function we wish to apply as a
+batch operation. For illustration's sake, let's "implement" a multiplication
+function following the format.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+af::array my_mult (const af::array &lhs, const af::array &rhs){
+    return lhs * rhs;
+}
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Our final batch call is not much more difficult than the ideal
+syntax we imagined.
+
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~{.cpp}
+// Create the filter and the weight vectors
+af::array filter = randn(1, 5);
+af::array weights = randu(5, 5);
+
+// Apply the batch function
+af::array filtered_weights = batchFunc( filter, weights, my_mult );
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The batch function will work with many previously mentioned vectorized ArrayFire
+functions. It can even work with a combination of those functions if they are
+wrapped inside a helper function matching the `__batchFunc_t__` signature.
+One limitation of `batchfunc()` is that it cannot be used from within a
+`gfor()` loop at the present time.
+
+# Advanced Vectorization
+
+We have seen the different methods ArrayFire provides to vectorize our code. Tying
+them all together is a slightly more involved process that needs to consider data
+dimensionality and layout, memory usage, nesting order, etc. An excellent example
+and discussion of these factors can be found on our blog:
+
+http://arrayfire.com/how-to-write-vectorized-code/
+
+It's worth noting that the content discussed in the blog has since been transformed
+into a convenient af::nearestNeighbour() function. Before writing something from
+scratch, check that ArrayFire doesn't already have an implementation. The default
+vectorized nature of ArrayFire and an extensive collection of functions will
+speed things up in addition to replacing dozens of lines of code!
+
diff --git a/examples/.clang-format b/examples/.clang-format
new file mode 100644
index 0000000000..692cbc2f40
--- /dev/null
+++ b/examples/.clang-format
@@ -0,0 +1,144 @@
+---
+Language:        Cpp
+# BasedOnStyle:  Google
+AccessModifierOffset: -1
+AlignAfterOpenBracket: Align
+AlignConsecutiveAssignments: true
+AlignConsecutiveDeclarations: false
+AlignEscapedNewlines: Left
+AlignOperands:   true
+AlignTrailingComments: true
+AllowAllParametersOfDeclarationOnNextLine: true
+AllowShortBlocksOnASingleLine: true
+AllowShortCaseLabelsOnASingleLine: true
+AllowShortFunctionsOnASingleLine: All
+AllowShortIfStatementsOnASingleLine: true
+AllowShortLoopsOnASingleLine: true
+AlwaysBreakAfterReturnType: None
+AlwaysBreakBeforeMultilineStrings: true
+AlwaysBreakTemplateDeclarations: Yes
+BinPackArguments: true
+BinPackParameters: true
+BraceWrapping:   
+  AfterClass:      false
+  AfterControlStatement: false
+  AfterEnum:       false
+  AfterFunction:   false
+  AfterNamespace:  false
+  AfterObjCDeclaration: false
+  AfterStruct:     false
+  AfterUnion:      false
+  AfterExternBlock: false
+  BeforeCatch:     false
+  BeforeElse:      false
+  IndentBraces:    false
+  SplitEmptyFunction: false
+  SplitEmptyRecord: false
+  SplitEmptyNamespace: false
+BreakBeforeBinaryOperators: None
+BreakBeforeBraces: Custom
+BreakInheritanceList: BeforeComma
+BreakBeforeTernaryOperators: true
+BreakConstructorInitializers: BeforeComma
+BreakStringLiterals: true
+ColumnLimit:     80
+CommentPragmas:  '^ IWYU pragma:'
+CompactNamespaces: false
+ConstructorInitializerAllOnOneLineOrOnePerLine: true
+ConstructorInitializerIndentWidth: 4
+ContinuationIndentWidth: 4
+Cpp11BracedListStyle: true
+DerivePointerAlignment: true
+DisableFormat:   false
+ExperimentalAutoDetectBinPacking: false
+FixNamespaceComments: true
+ForEachMacros:
+  - foreach
+  - Q_FOREACH
+  - BOOST_FOREACH
+IncludeBlocks:   Preserve
+IncludeCategories: 
+  - Regex:           '^<af/.*\.h.*>'
+    Priority:        2
+  - Regex:           '^<.*\.h.*>'
+    Priority:        1
+  - Regex:           '^<.*'
+    Priority:        3
+  - Regex:           '.*'
+    Priority:        4
+IncludeIsMainRegex: '([-_](test|unittest))?$'
+IndentCaseLabels: true
+IndentPPDirectives: None
+IndentWidth:     4
+IndentWrappedFunctionNames: false
+JavaScriptQuotes: Leave
+JavaScriptWrapImports: true
+KeepEmptyLinesAtTheStartOfBlocks: false
+MacroBlockBegin: ''
+MacroBlockEnd:   ''
+MaxEmptyLinesToKeep: 1
+NamespaceIndentation: None
+ObjCBinPackProtocolList: Never
+ObjCBlockIndentWidth: 2
+ObjCSpaceAfterProperty: false
+ObjCSpaceBeforeProtocolList: true
+PenaltyBreakAssignment: 2
+PenaltyBreakBeforeFirstCallParameter: 1
+PenaltyBreakComment: 300
+PenaltyBreakFirstLessLess: 120
+PenaltyBreakString: 1000
+PenaltyBreakTemplateDeclaration: 10
+PenaltyExcessCharacter: 1000000
+PenaltyReturnTypeOnItsOwnLine: 200
+PointerAlignment: Right
+RawStringFormats: 
+  - Language:        Cpp
+    Delimiters:      
+      - cc
+      - CC
+      - cpp
+      - Cpp
+      - CPP
+      - 'c++'
+      - 'C++'
+      - R
+    CanonicalDelimiter: ''
+    BasedOnStyle:    google
+  - Language:        TextProto
+    Delimiters:      
+      - pb
+      - PB
+      - proto
+      - PROTO
+    EnclosingFunctions: 
+      - EqualsProto
+      - EquivToProto
+      - PARSE_PARTIAL_TEXT_PROTO
+      - PARSE_TEST_PROTO
+      - PARSE_TEXT_PROTO
+      - ParseTextOrDie
+      - ParseTextProtoOrDie
+    CanonicalDelimiter: ''
+    BasedOnStyle:    google
+ReflowComments:  true
+SortIncludes:    true
+SortUsingDeclarations: true
+SpaceAfterCStyleCast: false
+SpaceAfterTemplateKeyword: false
+SpaceBeforeAssignmentOperators: true
+SpaceBeforeCpp11BracedList: false
+SpaceBeforeCtorInitializerColon: true
+SpaceBeforeInheritanceColon: true
+SpaceBeforeParens: ControlStatements
+SpaceBeforeRangeBasedForLoopColon: true
+SpaceInEmptyParentheses: false
+SpacesBeforeTrailingComments: 2
+SpacesInAngles:  false
+SpacesInContainerLiterals: true
+SpacesInCStyleCastParentheses: false
+SpacesInParentheses: false
+SpacesInSquareBrackets: false
+Standard:        Cpp03
+TabWidth:        4
+UseTab:          Never
+
diff --git a/examples/CMakeLists.txt b/examples/CMakeLists.txt
index 9faf02545d..91280e485e 100644
--- a/examples/CMakeLists.txt
+++ b/examples/CMakeLists.txt
@@ -1,81 +1,41 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-PROJECT(arrayfire-examples)
-
-# Find CUDA and OpenCL
-SET(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules")
-FIND_PACKAGE(CUDA)
-FIND_PACKAGE(OpenCL)
-
-# If the examples are not being built at the same time as ArrayFire,
-# we need to first find the ArrayFire library
-if(TARGET afcpu OR TARGET afcuda OR TARGET afopencl)
-    SET(ArrayFire_CPU_FOUND False)
-    SET(ArrayFire_CUDA_FOUND False)
-    SET(ArrayFire_OpenCL_FOUND False)
-    SET(ASSETS_DIR "${CMAKE_CURRENT_SOURCE_DIR}/../assets")
-    IF(NOT EXISTS "${ASSETS_DIR}/LICENSE")
-        MESSAGE(WARNING "Arrayfire assets are not available. Assets will not be installed.")
-        MESSAGE("Did you miss the --recursive option when cloning?")
-        MESSAGE("Run the following commands to correct this:")
-        MESSAGE("git submodule init")
-        MESSAGE("git submodule update")
-        MESSAGE("git submodule foreach git pull origin master")
-    ENDIF()
-else()
-    FIND_PACKAGE(ArrayFire REQUIRED)
-    INCLUDE_DIRECTORIES(${ArrayFire_INCLUDE_DIRS})
-
-    SET(ASSETS_DIR "${CMAKE_CURRENT_SOURCE_DIR}/assets")
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+cmake_policy(VERSION 3.5)
+project(ArrayFire-Examples
+  VERSION 3.7.0
+  LANGUAGES CXX)
+
+set(CMAKE_CXX_STANDARD 14)
+if(NOT EXISTS "${ArrayFire_SOURCE_DIR}/CMakeLists.txt")
+  set(ASSETS_DIR "${CMAKE_CURRENT_SOURCE_DIR}/..")
 endif()
 
-# A macro to build an ArrayFire example
-# For most uses only FIND_PACKAGE(ArrayFire REQUIRED), ADD_EXECUTABLE(...)
-# and TARGET_LINK_LIBRARIES(... ${ARRAYFIRE_LIBRARIES}) are needed
-MACRO(BUILD_EXAMPLE EXAMPLE_NAME EXAMPLE_SOURCE BACKEND_NAME BACKEND_LIBRARIES)
-
-    ADD_EXECUTABLE(example_${EXAMPLE_NAME}_${BACKEND_NAME} ${EXAMPLE_SOURCE})
-    TARGET_LINK_LIBRARIES(example_${EXAMPLE_NAME}_${BACKEND_NAME}
-        ${BACKEND_LIBRARIES} )
-    SET_TARGET_PROPERTIES(example_${EXAMPLE_NAME}_${BACKEND_NAME}
-        PROPERTIES
-        OUTPUT_NAME ${EXAMPLE_NAME}_${BACKEND_NAME}
-        RUNTIME_OUTPUT_DIRECTORY ${DIR_NAME}
-        FOLDER "Examples/${BACKEND_NAME}")
-ENDMACRO()
-
-# Collect the source
-FILE(GLOB FILES "*/*.cpp")
-ADD_DEFINITIONS("-DASSETS_DIR=\"${ASSETS_DIR}\"")
+file(TO_NATIVE_PATH ${ASSETS_DIR} ASSETS_DIR)
 
-FOREACH(FILE ${FILES})
-    GET_FILENAME_COMPONENT(EXAMPLE ${FILE} NAME_WE)
-    GET_FILENAME_COMPONENT(FULL_DIR_NAME ${FILE} PATH)
-    GET_FILENAME_COMPONENT(DIR_NAME ${FULL_DIR_NAME} NAME)
-
-    # Next we build each example using every backend.
-    if(${ArrayFire_CPU_FOUND})  # variable defined by FIND(ArrayFire ...)
-        BUILD_EXAMPLE(${EXAMPLE} ${FILE} cpu ${ArrayFire_CPU_LIBRARIES})
-    elseif(TARGET afcpu)        # variable defined by the ArrayFire build tree
-        BUILD_EXAMPLE(${EXAMPLE} ${FILE} cpu afcpu)
-    endif()
-
-    if(${ArrayFire_CUDA_FOUND} AND ${CUDA_FOUND})  # variable defined by FIND(ArrayFire ...)
-        BUILD_EXAMPLE(${EXAMPLE} ${FILE} cuda ${ArrayFire_CUDA_LIBRARIES})
-    elseif(TARGET afcuda)        # variable defined by the ArrayFire build tree
-        BUILD_EXAMPLE(${EXAMPLE} ${FILE} cuda afcuda)
-    endif()
-
-    if(${ArrayFire_OpenCL_FOUND} AND ${OpenCL_FOUND})  # variable defined by FIND(ArrayFire ...)
-        BUILD_EXAMPLE(${EXAMPLE} ${FILE} opencl ${ArrayFire_OpenCL_LIBRARIES})
-    elseif(TARGET afopencl)        # variable defined by the ArrayFire build tree
-        BUILD_EXAMPLE(${EXAMPLE} ${FILE} opencl afopencl)
-    endif()
-ENDFOREACH()
-
-INSTALL(DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}"
-    DESTINATION "${AF_INSTALL_DOC_DIR}"
-    COMPONENT examples)
+if(WIN32)
+  string(REPLACE "\\" "\\\\" ASSETS_DIR  ${ASSETS_DIR})
+  # - WIN32_LEAN_AND_MEAN & VC_EXTRALEAN reduces the number of
+  #   windows headers being included.
+  # - NOMINMAX is required for ArrayFire code that uses
+  #   functions af::min & af::max. Having a namespace doesn't help also.
+  add_definitions(-DWIN32_LEAN_AND_MEAN -DVC_EXTRALEAN -DNOMINMAX)
+  unset(CMAKE_RUNTIME_OUTPUT_DIRECTORY)
+endif()
 
-INSTALL(DIRECTORY "${CMAKE_CURRENT_SOURCE_DIR}/../assets/examples"
-    DESTINATION "${AF_INSTALL_DOC_DIR}/examples/assets/"
-)
+add_subdirectory(benchmarks)
+add_subdirectory(computer_vision)
+add_subdirectory(financial)
+add_subdirectory(getting_started)
+add_subdirectory(graphics)
+add_subdirectory(helloworld)
+add_subdirectory(image_processing)
+add_subdirectory(lin_algebra)
+add_subdirectory(machine_learning)
+add_subdirectory(pde)
+add_subdirectory(unified)
diff --git a/examples/CMakeModules/FindOpenCL.cmake b/examples/CMakeModules/FindOpenCL.cmake
deleted file mode 100644
index 80fcf493bc..0000000000
--- a/examples/CMakeModules/FindOpenCL.cmake
+++ /dev/null
@@ -1,168 +0,0 @@
-#.rst:
-# FindOpenCL
-# ----------
-#
-# Try to find OpenCL
-#
-# Once done this will define::
-#
-#   OpenCL_FOUND          - True if OpenCL was found
-#   OpenCL_INCLUDE_DIRS   - include directories for OpenCL
-#   OpenCL_LIBRARIES      - link against this library to use OpenCL
-#   OpenCL_VERSION_STRING - Highest supported OpenCL version (eg. 1.2)
-#   OpenCL_VERSION_MAJOR  - The major version of the OpenCL implementation
-#   OpenCL_VERSION_MINOR  - The minor version of the OpenCL implementation
-#
-# The module will also define two cache variables::
-#
-#   OpenCL_INCLUDE_DIR    - the OpenCL include directory
-#   OpenCL_LIBRARY        - the path to the OpenCL library
-#
-
-#=============================================================================
-# From CMake 3.2
-# Copyright 2014 Matthaeus G. Chajdas
-#
-# Distributed under the OSI-approved BSD License (the "License");
-# see accompanying file Copyright.txt for details.
-#
-# This software is distributed WITHOUT ANY WARRANTY; without even the
-# implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-# See the License for more information.
-
-# CMake - Cross Platform Makefile Generator
-# Copyright 2000-2014 Kitware, Inc.
-# Copyright 2000-2011 Insight Software Consortium
-# All rights reserved.
-# 
-# Redistribution and use in source and binary forms, with or without
-# modification, are permitted provided that the following conditions
-# are met:
-# 
-# * Redistributions of source code must retain the above copyright
-# notice, this list of conditions and the following disclaimer.
-# 
-# * Redistributions in binary form must reproduce the above copyright
-# notice, this list of conditions and the following disclaimer in the
-# documentation and/or other materials provided with the distribution.
-# 
-# * Neither the names of Kitware, Inc., the Insight Software Consortium,
-# nor the names of their contributors may be used to endorse or promote
-# products derived from this software without specific prior written
-# permission.
-# 
-# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
-# "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
-# LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
-# A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
-# HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
-# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
-# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
-# DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
-# THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
-# (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
-# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-#=============================================================================
-
-function(_FIND_OPENCL_VERSION)
-  include(CheckSymbolExists)
-  include(CMakePushCheckState)
-  set(CMAKE_REQUIRED_QUIET ${OpenCL_FIND_QUIETLY})
-
-  CMAKE_PUSH_CHECK_STATE()
-  foreach(VERSION "2_0" "1_2" "1_1" "1_0")
-    set(CMAKE_REQUIRED_INCLUDES "${OpenCL_INCLUDE_DIR}")
-    if(APPLE)
-      CHECK_SYMBOL_EXISTS(
-        CL_VERSION_${VERSION}
-        "${OpenCL_INCLUDE_DIR}/OpenCL/cl.h"
-        OPENCL_VERSION_${VERSION})
-    else()
-      CHECK_SYMBOL_EXISTS(
-        CL_VERSION_${VERSION}
-        "${OpenCL_INCLUDE_DIR}/CL/cl.h"
-        OPENCL_VERSION_${VERSION})
-    endif()
-
-    if(OPENCL_VERSION_${VERSION})
-      string(REPLACE "_" "." VERSION "${VERSION}")
-      set(OpenCL_VERSION_STRING ${VERSION} PARENT_SCOPE)
-      string(REGEX MATCHALL "[0-9]+" version_components "${VERSION}")
-      list(GET version_components 0 major_version)
-      list(GET version_components 1 minor_version)
-      set(OpenCL_VERSION_MAJOR ${major_version} PARENT_SCOPE)
-      set(OpenCL_VERSION_MINOR ${minor_version} PARENT_SCOPE)
-      break()
-    endif()
-  endforeach()
-  CMAKE_POP_CHECK_STATE()
-endfunction()
-
-find_path(OpenCL_INCLUDE_DIR
-  NAMES
-    CL/cl.h OpenCL/cl.h
-  PATHS
-    ENV "PROGRAMFILES(X86)"
-    ENV NVSDKCOMPUTE_ROOT
-    ENV CUDA_PATH
-    ENV AMDAPPSDKROOT
-    ENV INTELOCLSDKROOT
-    ENV ATISTREAMSDKROOT
-  PATH_SUFFIXES
-    include
-    OpenCL/common/inc
-    "AMD APP/include")
-
-_FIND_OPENCL_VERSION()
-
-if(WIN32)
-  if(CMAKE_SIZEOF_VOID_P EQUAL 4)
-    find_library(OpenCL_LIBRARY
-      NAMES OpenCL
-      PATHS
-        ENV "PROGRAMFILES(X86)"
-        ENV CUDA_PATH
-        ENV NVSDKCOMPUTE_ROOT
-        ENV AMDAPPSDKROOT
-        ENV INTELOCLSDKROOT
-        ENV ATISTREAMSDKROOT
-      PATH_SUFFIXES
-        "AMD APP/lib/x86"
-        lib/x86
-        lib/Win32
-        OpenCL/common/lib/Win32)
-  elseif(CMAKE_SIZEOF_VOID_P EQUAL 8)
-    find_library(OpenCL_LIBRARY
-      NAMES OpenCL
-      PATHS
-        ENV "PROGRAMFILES(X86)"
-        ENV CUDA_PATH
-        ENV NVSDKCOMPUTE_ROOT
-        ENV AMDAPPSDKROOT
-        ENV INTELOCLSDKROOT
-        ENV ATISTREAMSDKROOT
-      PATH_SUFFIXES
-        "AMD APP/lib/x86_64"
-        lib/x86_64
-        lib/x64
-        OpenCL/common/lib/x64)
-  endif()
-else()
-  find_library(OpenCL_LIBRARY
-    NAMES OpenCL)
-endif()
-
-set(OpenCL_LIBRARIES ${OpenCL_LIBRARY})
-set(OpenCL_INCLUDE_DIRS ${OpenCL_INCLUDE_DIR})
-
-#include(${CMAKE_CURRENT_LIST_DIR}/FindPackageHandleStandardArgs.cmake)
-find_package_handle_standard_args(
-  OpenCL
-  FOUND_VAR OpenCL_FOUND
-  REQUIRED_VARS OpenCL_LIBRARY OpenCL_INCLUDE_DIR
-  VERSION_VAR OpenCL_VERSION_STRING)
-
-mark_as_advanced(
-  OpenCL_INCLUDE_DIR
-  OpenCL_LIBRARY)
-
diff --git a/examples/README.md b/examples/README.md
index d1020978e0..9dce225e1e 100644
--- a/examples/README.md
+++ b/examples/README.md
@@ -11,10 +11,10 @@ process; however, the compiled examples are not packaged in the ArrayFire
 installer. After compiling ArrayFire, the examples will be in subdirectories
 located in the `build/examples` directory.
 
-If you wish to disable example compilation, simply set the `BUILD_EXAMPLES`
+If you wish to disable example compilation, simply set the `AF_BUILD_EXAMPLES`
 variable to `OFF` in the CMake GUI or `ccmake` curses wrapper. If you are
 using the command-line version of `cmake`, simply specify 
-`-DBUILD_EXAMPLES=OFF` as an argument.
+`-DAF_BUILD_EXAMPLES=OFF` as an argument.
 
 ## Building examples as a stand-alone project
 
@@ -40,7 +40,7 @@ the directory which contains the `ArrayFireConfig.cmake` as an argument to the
 if you were to install ArrayFire to the `local` directory within your home
 folder, the invocation of `cmake` above would be replaced with the following:
 
-    cmake -DArrayFire_ROOT=~/local/share/ArrayFire/ ..
+    cmake -DArrayFire_DIR=$HOME/local/share/ArrayFire/cmake ..
     
 ### Support and Contact Info
 
diff --git a/examples/benchmarks/CMakeLists.txt b/examples/benchmarks/CMakeLists.txt
new file mode 100644
index 0000000000..4fd0853e58
--- /dev/null
+++ b/examples/benchmarks/CMakeLists.txt
@@ -0,0 +1,69 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Benchmarks
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+if(ArrayFire_CPU_FOUND)
+  add_executable(blas_cpu blas.cpp)
+  target_link_libraries(blas_cpu ArrayFire::afcpu)
+
+  add_executable(cg_cpu cg.cpp)
+  target_link_libraries(cg_cpu ArrayFire::afcpu)
+
+  add_executable(fft_cpu fft.cpp)
+  target_link_libraries(fft_cpu ArrayFire::afcpu)
+
+  add_executable(pi_cpu pi.cpp)
+  target_link_libraries(pi_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(blas_cuda blas.cpp)
+  target_link_libraries(blas_cuda ArrayFire::afcuda)
+
+  add_executable(cg_cuda cg.cpp)
+  target_link_libraries(cg_cuda ArrayFire::afcuda)
+
+  add_executable(fft_cuda fft.cpp)
+  target_link_libraries(fft_cuda ArrayFire::afcuda)
+
+  add_executable(pi_cuda pi.cpp)
+  target_link_libraries(pi_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(blas_opencl blas.cpp)
+  target_link_libraries(blas_opencl ArrayFire::afopencl)
+
+  add_executable(cg_opencl cg.cpp)
+  target_link_libraries(cg_opencl ArrayFire::afopencl)
+
+  add_executable(fft_opencl fft.cpp)
+  target_link_libraries(fft_opencl ArrayFire::afopencl)
+
+  add_executable(pi_opencl pi.cpp)
+  target_link_libraries(pi_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(blas_oneapi blas.cpp)
+  target_link_libraries(blas_oneapi ArrayFire::afoneapi)
+
+  add_executable(cg_oneapi cg.cpp)
+  target_link_libraries(cg_oneapi ArrayFire::afoneapi)
+
+  add_executable(fft_oneapi fft.cpp)
+  target_link_libraries(fft_oneapi ArrayFire::afoneapi)
+
+  add_executable(pi_oneapi pi.cpp)
+  target_link_libraries(pi_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/benchmarks/blas.cpp b/examples/benchmarks/blas.cpp
index 2a6c93d11e..ef0e2818cf 100644
--- a/examples/benchmarks/blas.cpp
+++ b/examples/benchmarks/blas.cpp
@@ -8,37 +8,41 @@
  ********************************************************/
 
 #include <arrayfire.h>
-#include <stdio.h>
 #include <math.h>
+#include <stdio.h>
 #include <cstdlib>
+#include <string>
 
 using namespace af;
 
 // create a small wrapper to benchmark
-static array A; // populated before each timing
-static void fn()
-{
+static array A;  // populated before each timing
+static void fn() {
     array B = matmul(A, A);  // matrix multiply
-    B.eval();                // ensure evaluated
 }
 
-int main(int argc, char ** argv)
-{
+int main(int argc, char** argv) {
     double peak = 0;
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         setDevice(device);
+
+        const std::string dtype(argc > 2 ? argv[2] : "f32");
+        const af_dtype dt = (dtype == "f16" ? f16 : f32);
+
+        if (dt == f16)
+            printf("Device %d isHalfAvailable ? %s\n", device,
+                   isHalfAvailable(device) ? "yes" : "no");
+
         info();
 
-        printf("Benchmark N-by-N matrix multiply\n");
+        printf("Benchmark N-by-N matrix multiply at %s \n", dtype.c_str());
         for (int n = 128; n <= 2048; n += 128) {
-
             printf("%4d x %4d: ", n, n);
-            A = constant(1,n,n);
-            double time = timeit(fn); // time in seconds
-            double gflops = 2.0 * powf(n,3) / (time * 1e9);
-            if (gflops > peak)
-                peak = gflops;
+            A             = constant(1, n, n, dt);
+            double time   = timeit(fn);  // time in seconds
+            double gflops = 2.0 * powf(n, 3) / (time * 1e9);
+            if (gflops > peak) peak = gflops;
 
             printf(" %4.0f Gflops\n", gflops);
             fflush(stdout);
@@ -48,15 +52,7 @@ int main(int argc, char ** argv)
         throw;
     }
 
-    if (argc == 2 && argv[1][0] == '-')
-        printf(" ### peak %g GFLOPS\n", peak);
+    printf(" ### peak %g GFLOPS\n", peak);
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/benchmarks/cg.cpp b/examples/benchmarks/cg.cpp
new file mode 100644
index 0000000000..cda79cec24
--- /dev/null
+++ b/examples/benchmarks/cg.cpp
@@ -0,0 +1,131 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <arrayfire.h>
+#include <iostream>
+
+using namespace af;
+
+static size_t dimension         = 4 * 1024;
+static const int maxIter        = 10;
+static const int sparsityFactor = 7;
+
+static array A;
+static array spA;  // Sparse A
+static array x0;
+static array b;
+
+void setupInputs() {
+    // Generate a random input: A
+    array T = randu(dimension, dimension, f32);
+    // Create 0s in input.
+    // Anything that is no divisible by sparsityFactor will become 0.
+    A = floor(T * 1000);
+    A = A * ((A % sparsityFactor) == 0) / 1000;
+    // Make it positive definite
+    A = transpose(A) + A + A.dims(0) * identity(A.dims(0), A.dims(0), f32);
+
+    // Make A sparse as spA
+    spA = sparse(A);
+
+    // Generate x0: Random guess
+    x0 = randu(A.dims(0), f32);
+
+    // Generate b
+    b = matmul(A, x0);
+
+    std::cout << "Sparsity of A = "
+              << 100.f * (float)sparseGetNNZ(spA) / (float)spA.elements() << "%"
+              << std::endl;
+    std::cout << "Memory Usage of A = " << A.bytes() / (1024.f * 1024.f)
+              << " MB" << std::endl;
+    std::cout << "Memory Usage of spA = "
+              << (sparseGetValues(spA).bytes() + sparseGetRowIdx(spA).bytes() +
+                  sparseGetColIdx(spA).bytes()) /
+                     (1024.f * 1024.f)
+              << " MB" << std::endl;
+}
+
+void sparseConjugateGradient(void) {
+    array x = constant(0, b.dims(), f32);
+    array r = b - matmul(spA, x);
+    array p = r;
+
+    for (int i = 0; i < maxIter; ++i) {
+        array Ap        = matmul(spA, p);
+        array alpha_num = dot(r, r);
+        array alpha_den = dot(p, Ap);
+        array alpha     = alpha_num / alpha_den;
+        r -= tile(alpha, Ap.dims()) * Ap;
+        x += tile(alpha, Ap.dims()) * p;
+        array beta_num = dot(r, r);
+        array beta     = beta_num / alpha_num;
+        p              = r + tile(beta, p.dims()) * p;
+    }
+}
+
+void denseConjugateGradient(void) {
+    array x = constant(0, b.dims(), f32);
+    array r = b - matmul(A, x);
+    array p = r;
+
+    for (int i = 0; i < maxIter; ++i) {
+        array Ap        = matmul(A, p);
+        array alpha_num = dot(r, r);
+        array alpha_den = dot(p, Ap);
+        array alpha     = alpha_num / alpha_den;
+        r -= tile(alpha, Ap.dims()) * Ap;
+        x += tile(alpha, Ap.dims()) * p;
+        array beta_num = dot(r, r);
+        array beta     = beta_num / alpha_num;
+        p              = r + tile(beta, p.dims()) * p;
+    }
+}
+
+void checkConjugateGradient(const af::array in) {
+    array x = constant(0, b.dims(), f32);
+    array r = b - matmul(in, x);
+    array p = r;
+
+    for (int i = 0; i < maxIter; ++i) {
+        array Ap        = matmul(in, p);
+        array alpha_num = dot(r, r);
+        array alpha_den = dot(p, Ap);
+        array alpha     = alpha_num / alpha_den;
+        r -= tile(alpha, Ap.dims()) * Ap;
+        x += tile(alpha, Ap.dims()) * p;
+        array beta_num = dot(r, r);
+        array beta     = beta_num / alpha_num;
+        p              = r + tile(beta, p.dims()) * p;
+    }
+    array res = x0 - x;
+
+    std::cout << "Final difference in solutions:\n";
+    af_print(dot(res, res));
+}
+
+int main(int, char **) {
+    af::info();
+    setupInputs();
+
+    std::cout << "Verifying Dense Conjugate Gradient:" << std::endl;
+    checkConjugateGradient(A);
+
+    std::cout << "Verifying Sparse Conjugate Gradient:" << std::endl;
+    checkConjugateGradient(spA);
+
+    af::sync();
+
+    std::cout << "Dense Conjugate Gradient Time: "
+              << timeit(denseConjugateGradient) * 1000 << "ms" << std::endl;
+
+    std::cout << "Sparse Conjugate Gradient Time: "
+              << timeit(sparseConjugateGradient) * 1000 << "ms" << std::endl;
+
+    return 0;
+}
diff --git a/examples/benchmarks/fft.cpp b/examples/benchmarks/fft.cpp
index 8063422372..b28873f16a 100644
--- a/examples/benchmarks/fft.cpp
+++ b/examples/benchmarks/fft.cpp
@@ -8,22 +8,20 @@
  ********************************************************/
 
 #include <arrayfire.h>
-#include <stdio.h>
 #include <math.h>
+#include <stdio.h>
 #include <cstdlib>
 
 using namespace af;
 
 // create a small wrapper to benchmark
-static array A; // populated before each timing
-static void fn()
-{
-    array B = fft2(A);  // matrix multiply
+static array A;  // populated before each timing
+static void fn() {
+    array B = fft2(A);  // 2d fft
     B.eval();           // ensure evaluated
 }
 
-int main(int argc, char ** argv)
-{
+int main(int argc, char** argv) {
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         setDevice(device);
@@ -34,23 +32,14 @@ int main(int argc, char ** argv)
             int N = (1 << M);
 
             printf("%4d x %4d: ", N, N);
-            A = randu(N,N);
-            double time = timeit(fn); // time in seconds
+            A             = randu(N, N);
+            double time   = timeit(fn);  // time in seconds
             double gflops = 10.0 * N * N * M / (time * 1e9);
 
             printf(" %4.0f Gflops\n", gflops);
             fflush(stdout);
         }
-    } catch (af::exception& e) {
-        fprintf(stderr, "%s\n", e.what());
-    }
-
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
+    } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); }
+
     return 0;
 }
diff --git a/examples/benchmarks/pi.cpp b/examples/benchmarks/pi.cpp
index 0a48f533c0..d4a550b78a 100644
--- a/examples/benchmarks/pi.cpp
+++ b/examples/benchmarks/pi.cpp
@@ -15,10 +15,10 @@
    - count what percent fell inside (top quarter) of unit circle
 */
 
-#include <stdio.h>
+#include <arrayfire.h>
 #include <math.h>
+#include <stdio.h>
 #include <cstdlib>
-#include <arrayfire.h>
 using namespace af;
 
 // generate millions of random samples
@@ -27,51 +27,39 @@ static int samples = 20e6;
 /* Self-contained code to run host and device estimates of PI.  Note that
    each is generating its own random values, so the estimates of PI
    will differ. */
-static double pi_device()
-{
-    array x = randu(samples,f32), y = randu(samples,f32);
-    return 4.0 * sum<float>(sqrt(x*x + y*y) < 1) / samples;
+static double pi_device() {
+    array x = randu(samples, f32), y = randu(samples, f32);
+    return 4.0 * sum<float>(sqrt(x * x + y * y) < 1) / samples;
 }
 
-static double pi_host()
-{
+static double pi_host() {
     int count = 0;
     for (int i = 0; i < samples; ++i) {
-        float x = float(rand()) / RAND_MAX;
-        float y = float(rand()) / RAND_MAX;
-        if (sqrt(x*x + y*y) < 1)
-            count++;
+        float x = float(rand()) / float(RAND_MAX);
+        float y = float(rand()) / float(RAND_MAX);
+        if (sqrt(x * x + y * y) < 1) count++;
     }
     return 4.0 * count / samples;
 }
 
-
-
 // void wrappers for timeit()
 static void device_wrapper() { pi_device(); }
 static void host_wrapper() { pi_host(); }
 
-
-int main(int argc, char ** argv)
-{
+int main(int argc, char** argv) {
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         setDevice(device);
         info();
 
-        printf("device:  %.5f seconds to estimate  pi = %.5f\n", timeit(device_wrapper), pi_device());
-        printf("  host:  %.5f seconds to estimate  pi = %.5f\n", timeit(host_wrapper), pi_host());
+        printf("device:  %.5f seconds to estimate  pi = %.5f\n",
+               timeit(device_wrapper), pi_device());
+        printf("  host:  %.5f seconds to estimate  pi = %.5f\n",
+               timeit(host_wrapper), pi_host());
     } catch (exception& e) {
         fprintf(stderr, "%s\n", e.what());
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/common/idxio.h b/examples/common/idxio.h
index 16b710a42b..cc80b6e125 100644
--- a/examples/common/idxio.h
+++ b/examples/common/idxio.h
@@ -9,21 +9,19 @@
 
 #pragma once
 
-#include <iostream>
+#include <arrayfire.h>
+#include <algorithm>
 #include <fstream>
+#include <iostream>
 #include <stdexcept>
 #include <vector>
-#include <algorithm>
-#include <arrayfire.h>
 
-union Data
-{
+union Data {
     unsigned dim;
     char bytes[4];
 };
 
-unsigned char reverse_char(unsigned char b)
-{
+unsigned char reverse_char(unsigned char b) {
     b = (b & 0xF0) >> 4 | (b & 0x0F) << 4;
     b = (b & 0xCC) >> 2 | (b & 0x33) << 2;
     b = (b & 0xAA) >> 1 | (b & 0x55) << 1;
@@ -31,8 +29,7 @@ unsigned char reverse_char(unsigned char b)
 }
 
 // http://stackoverflow.com/a/9144870/2192361
-unsigned reverse(unsigned x)
-{
+unsigned reverse(unsigned x) {
     x = ((x >> 1) & 0x55555555u) | ((x & 0x55555555u) << 1);
     x = ((x >> 2) & 0x33333333u) | ((x & 0x33333333u) << 2);
     x = ((x >> 4) & 0x0f0f0f0fu) | ((x & 0x0f0f0f0fu) << 4);
@@ -42,24 +39,22 @@ unsigned reverse(unsigned x)
 }
 
 template<class ty>
-void read_idx(std::vector<dim_t> &dims, std::vector<ty> &data, const char *name)
-{
+void read_idx(std::vector<dim_t> &dims, std::vector<ty> &data,
+              const char *name) {
     std::ifstream f(name, std::ios::in | std::ios::binary);
     if (!f.is_open()) throw std::runtime_error("Unable to open file");
 
     Data d;
     f.read(d.bytes, sizeof(d.bytes));
 
-    if (d.bytes[2] != 8) {
-        throw std::runtime_error("Unsupported data type");
-    }
+    if (d.bytes[2] != 8) { throw std::runtime_error("Unsupported data type"); }
 
-    unsigned numdims = d.bytes[3];
+    unsigned numdims  = d.bytes[3];
     unsigned elemsize = 1;
 
     // Read the dimensions
     size_t elem = 1;
-    dims = std::vector<dim_t>(numdims);
+    dims        = std::vector<dim_t>(numdims);
     for (unsigned i = 0; i < numdims; i++) {
         f.read(d.bytes, sizeof(d.bytes));
 
diff --git a/examples/common/progress.h b/examples/common/progress.h
index debb511e1a..90ccdd0abc 100644
--- a/examples/common/progress.h
+++ b/examples/common/progress.h
@@ -10,14 +10,13 @@
 #ifndef __PROGRESS_H
 #define __PROGRESS_H
 
-#include <cmath>
 #include <algorithm>
+#include <cmath>
 
-static bool progress(unsigned iter_curr, af::timer t, double time_total)
-{
+static bool progress(unsigned iter_curr, af::timer t, double time_total) {
     static unsigned iter_prev = 0;
-    static double time_prev = 0;
-    static double max_rate = 0;
+    static double time_prev   = 0;
+    static double max_rate    = 0;
 
     af::sync();
     double time_curr = af::timer::stop(t);
@@ -25,18 +24,17 @@ static bool progress(unsigned iter_curr, af::timer t, double time_total)
     if ((time_curr - time_prev) < 1) return true;
 
     double rate = (iter_curr - iter_prev) / (time_curr - time_prev);
-    printf("  iterations per second: %.0f   (progress %.0f%%)\n",
-            rate, 100.0f * time_curr / time_total);
+    printf("  iterations per second: %.0f   (progress %.0f%%)\n", rate,
+           100.0f * time_curr / time_total);
 
     max_rate = std::max(max_rate, rate);
 
     iter_prev = iter_curr;
     time_prev = time_curr;
 
-
     if (time_curr < time_total) return true;
 
-    printf(" ### vortex %f iterations per second (max)\n", max_rate);
+    printf(" ### %f iterations per second (max)\n", max_rate);
     return false;
 }
 
diff --git a/examples/computer_vision/CMakeLists.txt b/examples/computer_vision/CMakeLists.txt
new file mode 100644
index 0000000000..2683eb1931
--- /dev/null
+++ b/examples/computer_vision/CMakeLists.txt
@@ -0,0 +1,75 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Computer-Vision
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+add_definitions("-DASSETS_DIR=\"${ASSETS_DIR}\"")
+
+if (ArrayFire_CPU_FOUND)
+  # FAST examples
+  add_executable(fast_cpu fast.cpp)
+  target_link_libraries(fast_cpu ArrayFire::afcpu)
+
+  # Harris corner detector examples
+  add_executable(harris_cpu harris.cpp)
+  target_link_libraries(harris_cpu ArrayFire::afcpu)
+
+  # Template Matching examples
+  add_executable(matching_cpu matching.cpp)
+  target_link_libraries(matching_cpu ArrayFire::afcpu)
+
+  # Template Matching examples
+  add_executable(susan_cpu susan.cpp)
+  target_link_libraries(susan_cpu ArrayFire::afcpu)
+endif()
+
+if (ArrayFire_CUDA_FOUND)
+  add_executable(fast_cuda fast.cpp)
+  target_link_libraries(fast_cuda ArrayFire::afcuda)
+
+  add_executable(harris_cuda harris.cpp)
+  target_link_libraries(harris_cuda ArrayFire::afcuda)
+
+  add_executable(matching_cuda matching.cpp)
+  target_link_libraries(matching_cuda ArrayFire::afcuda)
+
+  add_executable(susan_cuda susan.cpp)
+  target_link_libraries(susan_cuda ArrayFire::afcuda)
+endif()
+
+if (ArrayFire_OpenCL_FOUND)
+  add_executable(fast_opencl fast.cpp)
+  target_link_libraries(fast_opencl ArrayFire::afopencl)
+
+  add_executable(harris_opencl harris.cpp)
+  target_link_libraries(harris_opencl ArrayFire::afopencl)
+
+  add_executable(matching_opencl matching.cpp)
+  target_link_libraries(matching_opencl ArrayFire::afopencl)
+
+  add_executable(susan_opencl susan.cpp)
+  target_link_libraries(susan_opencl ArrayFire::afopencl)
+endif()
+
+if (ArrayFire_oneAPI_FOUND)
+  add_executable(fast_oneapi fast.cpp)
+  target_link_libraries(fast_oneapi ArrayFire::afoneapi)
+
+  add_executable(harris_oneapi harris.cpp)
+  target_link_libraries(harris_oneapi ArrayFire::afoneapi)
+
+  add_executable(matching_oneapi matching.cpp)
+  target_link_libraries(matching_oneapi ArrayFire::afoneapi)
+
+  add_executable(susan_oneapi susan.cpp)
+  target_link_libraries(susan_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/computer_vision/fast.cpp b/examples/computer_vision/fast.cpp
index e010302b40..0dbc12b3b7 100644
--- a/examples/computer_vision/fast.cpp
+++ b/examples/computer_vision/fast.cpp
@@ -7,23 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <cstdio>
 #include <arrayfire.h>
+#include <cstdio>
 #include <cstdlib>
 
 using namespace af;
 
-static void fast_demo(bool console)
-{
+static void fast_demo(bool console) {
     // Load image
     array img_color;
     if (console)
         img_color = loadImage(ASSETS_DIR "/examples/images/square.png", true);
     else
-        img_color = loadImage(ASSETS_DIR "/examples/images/lena.ppm", true);
+        img_color = loadImage(ASSETS_DIR "/examples/images/man.jpg", true);
     // Convert the image from RGB to gray-scale
     array img = colorSpace(img_color, AF_GRAY, AF_RGB);
-    // For visualization in ArrayFire, color images must be in the [0.0f-1.0f] interval
+    // For visualization in ArrayFire, color images must be in the [0.0f-1.0f]
+    // interval
     img_color /= 255.f;
 
     features feat = fast(img, 20.0f, 9, true, 0.05);
@@ -34,46 +34,47 @@ static void fast_demo(bool console)
     // Draw draw_len x draw_len crosshairs where the corners are
     const int draw_len = 3;
     for (size_t f = 0; f < feat.getNumFeatures(); f++) {
-        int x = h_x[f];
-        int y = h_y[f];
-        img_color(y, seq(x-draw_len, x+draw_len), 0) = 0.f;
-        img_color(y, seq(x-draw_len, x+draw_len), 1) = 1.f;
-        img_color(y, seq(x-draw_len, x+draw_len), 2) = 0.f;
+        int x                                            = h_x[f];
+        int y                                            = h_y[f];
+        img_color(y, seq(x - draw_len, x + draw_len), 0) = 0.f;
+        img_color(y, seq(x - draw_len, x + draw_len), 1) = 1.f;
+        img_color(y, seq(x - draw_len, x + draw_len), 2) = 0.f;
 
-        // Draw vertical line of (draw_len * 2 + 1) pixels centered on  the corner
-        // Set only the first channel to 1 (green lines)
-        img_color(seq(y-draw_len, y+draw_len), x, 0) = 0.f;
-        img_color(seq(y-draw_len, y+draw_len), x, 1) = 1.f;
-        img_color(seq(y-draw_len, y+draw_len), x, 2) = 0.f;
+        // Draw vertical line of (draw_len * 2 + 1) pixels centered on  the
+        // corner Set only the first channel to 1 (green lines)
+        img_color(seq(y - draw_len, y + draw_len), x, 0) = 0.f;
+        img_color(seq(y - draw_len, y + draw_len), x, 1) = 1.f;
+        img_color(seq(y - draw_len, y + draw_len), x, 2) = 0.f;
     }
 
-    printf("Features found: %lu\n", feat.getNumFeatures());
+    freeHost(h_x);
+    freeHost(h_y);
+
+    printf("Features found: %zu\n", feat.getNumFeatures());
 
     if (!console) {
         af::Window wnd("FAST Feature Detector");
 
         // Previews color image with green crosshairs
-        while(!wnd.close())
-            wnd.image(img_color);
+        while (!wnd.close()) wnd.image(img_color);
     } else {
         af_print(feat.getX());
         af_print(feat.getY());
     }
 }
 
-int main(int argc, char** argv)
-{
-    int device = argc > 1 ? atoi(argv[1]) : 0;
+int main(int argc, char** argv) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
         af::setDevice(device);
         af::info();
-        std::cout << "** ArrayFire FAST Feature Detector Demo **" << std::endl << std::endl;
+        printf("** ArrayFire FAST Feature Detector Demo **\n\n");
         fast_demo(console);
 
     } catch (af::exception& ae) {
-        std::cerr << ae.what() << std::endl;
+        fprintf(stderr, "%s\n", ae.what());
         throw;
     }
 
diff --git a/examples/computer_vision/harris.cpp b/examples/computer_vision/harris.cpp
index 6821248012..d97a30d803 100644
--- a/examples/computer_vision/harris.cpp
+++ b/examples/computer_vision/harris.cpp
@@ -7,14 +7,13 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <cstdio>
 #include <arrayfire.h>
+#include <cstdio>
 #include <cstdlib>
 
 using namespace af;
 
-static void harris_demo(bool console)
-{
+static void harris_demo(bool console) {
     af::Window wnd("Harris Corner Detector");
 
     // Load image
@@ -22,10 +21,11 @@ static void harris_demo(bool console)
     if (console)
         img_color = loadImage(ASSETS_DIR "/examples/images/square.png", true);
     else
-        img_color = loadImage(ASSETS_DIR "/examples/images/lena.ppm", true);
+        img_color = loadImage(ASSETS_DIR "/examples/images/man.jpg", true);
     // Convert the image from RGB to gray-scale
     array img = colorSpace(img_color, AF_GRAY, AF_RGB);
-    // For visualization in ArrayFire, color images must be in the [0.0f-1.0f] interval
+    // For visualization in ArrayFire, color images must be in the [0.0f-1.0f]
+    // interval
     img_color /= 255.f;
 
     // Calculate image gradients
@@ -37,8 +37,8 @@ static void harris_demo(bool console)
     array ixy = ix * iy;
     array iyy = iy * iy;
 
-    // Compute a Gaussian kernel with standard deviation of 1.0 and length of 5 pixels
-    // These values can be changed to use a smaller or larger window
+    // Compute a Gaussian kernel with standard deviation of 1.0 and length of 5
+    // pixels These values can be changed to use a smaller or larger window
     array gauss_filt = gaussianKernel(5, 5, 1.0, 1.0);
 
     // Filter second-order derivatives with Gaussian kernel computed previously
@@ -55,13 +55,13 @@ static void harris_demo(bool console)
     array response = idet - 0.04f * (itr * itr);
 
     // Gets maximum response for each 3x3 neighborhood
-    //array max_resp = maxfilt(response, 3, 3);
-    array mask = constant(1,3,3);
+    // array max_resp = maxfilt(response, 3, 3);
+    array mask     = constant(1, 3, 3);
     array max_resp = dilate(response, mask);
 
     // Discard responses that are not greater than threshold
     array corners = response > 1e5f;
-    corners = corners * response;
+    corners       = corners * response;
 
     // Discard responses that are not equal to maximum neighborhood response,
     // scale them to original response value
@@ -78,29 +78,29 @@ static void harris_demo(bool console)
         for (int x = draw_len; x < img_color.dims(1) - draw_len; x++) {
             // Only draws crosshair if is a corner
             if (h_corners[x * corners.dims(0) + y] > 1e5f) {
-                // Draw horizontal line of (draw_len * 2 + 1) pixels centered on the corner
-                // Set only the first channel to 1 (green lines)
-                img_color(y, seq(x-draw_len, x+draw_len), 0) = 0.f;
-                img_color(y, seq(x-draw_len, x+draw_len), 1) = 1.f;
-                img_color(y, seq(x-draw_len, x+draw_len), 2) = 0.f;
-
-                // Draw vertical line of (draw_len * 2 + 1) pixels centered on  the corner
-                // Set only the first channel to 1 (green lines)
-                img_color(seq(y-draw_len, y+draw_len), x, 0) = 0.f;
-                img_color(seq(y-draw_len, y+draw_len), x, 1) = 1.f;
-                img_color(seq(y-draw_len, y+draw_len), x, 2) = 0.f;
+                // Draw horizontal line of (draw_len * 2 + 1) pixels centered on
+                // the corner Set only the first channel to 1 (green lines)
+                img_color(y, seq(x - draw_len, x + draw_len), 0) = 0.f;
+                img_color(y, seq(x - draw_len, x + draw_len), 1) = 1.f;
+                img_color(y, seq(x - draw_len, x + draw_len), 2) = 0.f;
+
+                // Draw vertical line of (draw_len * 2 + 1) pixels centered on
+                // the corner Set only the first channel to 1 (green lines)
+                img_color(seq(y - draw_len, y + draw_len), x, 0) = 0.f;
+                img_color(seq(y - draw_len, y + draw_len), x, 1) = 1.f;
+                img_color(seq(y - draw_len, y + draw_len), x, 2) = 0.f;
 
                 good_corners++;
             }
         }
     }
+    freeHost(h_corners);
 
     printf("Corners found: %u\n", good_corners);
 
     if (!console) {
         // Previews color image with green crosshairs
-        while(!wnd.close())
-            wnd.image(img_color);
+        while (!wnd.close()) wnd.image(img_color);
     } else {
         // Find corner indexes in the image as 1D indexes
         array idx = where(corners);
@@ -110,26 +110,25 @@ static void harris_demo(bool console)
         array corners_y = idx % corners.dims()[0];
 
         const int good_corners = corners_x.dims()[0];
-        std::cout << "Corners found: " << good_corners << std::endl << std::endl;
+        printf("Corners found: %d\n\n", good_corners);
 
         af_print(corners_x);
         af_print(corners_y);
     }
 }
 
-int main(int argc, char** argv)
-{
-    int device = argc > 1 ? atoi(argv[1]) : 0;
+int main(int argc, char** argv) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
         af::setDevice(device);
         af::info();
-        std::cout << "** ArrayFire Harris Corner Detector Demo **" << std::endl << std::endl;
+        printf("** ArrayFire Harris Corner Detector Demo **\n\n");
         harris_demo(console);
 
     } catch (af::exception& ae) {
-        std::cerr << ae.what() << std::endl;
+        fprintf(stderr, "%s\n", ae.what());
         throw;
     }
 
diff --git a/examples/computer_vision/matching.cpp b/examples/computer_vision/matching.cpp
new file mode 100644
index 0000000000..80cc9a80b5
--- /dev/null
+++ b/examples/computer_vision/matching.cpp
@@ -0,0 +1,128 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+
+using namespace af;
+
+array normalize(array a) {
+    float mx = af::max<float>(a);
+    float mn = af::min<float>(a);
+    return (a - mn) / (mx - mn);
+}
+
+void drawRectangle(array& out, unsigned x, unsigned y, unsigned dim0,
+                   unsigned dim1) {
+    printf("\nMatching patch origin = (%u, %u)\n\n", x, y);
+    seq col_span(x, x + dim0, 1);
+    seq row_span(y, y + dim1, 1);
+    // edge on left
+    out(col_span, y, 0) = 0.f;
+    out(col_span, y, 1) = 0.f;
+    out(col_span, y, 2) = 1.f;
+    // edge on right
+    out(col_span, y + dim1, 0) = 0.f;
+    out(col_span, y + dim1, 1) = 0.f;
+    out(col_span, y + dim1, 2) = 1.f;
+    // edge on top
+    out(x, row_span, 0) = 0.f;
+    out(x, row_span, 1) = 0.f;
+    out(x, row_span, 2) = 1.f;
+    // edge on bottom
+    out(x + dim0, row_span, 0) = 0.f;
+    out(x + dim0, row_span, 1) = 0.f;
+    out(x + dim0, row_span, 2) = 1.f;
+}
+
+static void templateMatchingDemo(bool console) {
+    // Load image
+    array img_color;
+    if (console)
+        img_color = loadImage(ASSETS_DIR "/examples/images/square.png", true);
+    else
+        img_color = loadImage(ASSETS_DIR "/examples/images/man.jpg", true);
+
+    // Convert the image from RGB to gray-scale
+    array img  = colorSpace(img_color, AF_GRAY, AF_RGB);
+    dim4 iDims = img.dims();
+    std::cout << "Input image dimensions: " << iDims << std::endl << std::endl;
+    // For visualization in ArrayFire, color images must be in the [0.0f-1.0f]
+    // interval
+
+    // extract a patch from input image
+    unsigned patch_size = 100;
+    array tmp_img =
+        img(seq(100, 100 + patch_size, 1.0), seq(100, 100 + patch_size, 1.0));
+    array result =
+        matchTemplate(img, tmp_img);  // Default disparity metric is
+                                      // Sum of Absolute differences (SAD)
+                                      // Currently supported metrics are
+                                      // AF_SAD, AF_ZSAD, AF_LSAD, AF_SSD,
+                                      // AF_ZSSD, ASF_LSSD
+    array disp_img = img / 255.0f;
+    array disp_tmp = tmp_img / 255.0f;
+    array disp_res = normalize(result);
+
+    unsigned minLoc;
+    float minVal;
+    min<float>(&minVal, &minLoc, disp_res);
+    std::cout << "Location(linear index) of minimum disparity value = "
+              << minLoc << std::endl;
+
+    if (!console) {
+        // Draw a rectangle on input image where the template matches
+        array marked_res = tile(disp_img, 1, 1, 3);
+        drawRectangle(marked_res, minLoc % iDims[0], minLoc / iDims[0],
+                      patch_size, patch_size);
+
+        std::cout << "Note: Based on the disparity metric option provided to "
+                     "matchTemplate function\n"
+                     "either minimum or maximum disparity location is the "
+                     "starting corner\n"
+                     "of our best matching patch to template image in the "
+                     "search image"
+                  << std::endl;
+
+        af::Window wnd("Template Matching Demo");
+
+        // Previews color image with green crosshairs
+        while (!wnd.close()) {
+            wnd.setColorMap(AF_COLORMAP_DEFAULT);
+            wnd.grid(2, 2);
+            wnd(0, 0).image(disp_img, "Search Image");
+            wnd(0, 1).image(disp_tmp, "Template Patch");
+            wnd(1, 0).image(marked_res, "Best Match");
+            wnd.setColorMap(AF_COLORMAP_HEAT);
+            wnd(1, 1).image(disp_res, "Disparity values");
+            wnd.show();
+        }
+    }
+}
+
+int main(int argc, char** argv) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
+    bool console = argc > 2 ? argv[2][0] == '-' : false;
+
+    try {
+        af::setDevice(device);
+        af::info();
+        std::cout << "** ArrayFire template matching Demo **" << std::endl
+                  << std::endl;
+        templateMatchingDemo(console);
+
+    } catch (af::exception& ae) {
+        std::cerr << ae.what() << std::endl;
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/computer_vision/susan.cpp b/examples/computer_vision/susan.cpp
new file mode 100644
index 0000000000..417213de7c
--- /dev/null
+++ b/examples/computer_vision/susan.cpp
@@ -0,0 +1,87 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+
+static void susan_demo(bool console) {
+    // Load image
+    array img_color;
+    if (console)
+        img_color = loadImage(ASSETS_DIR "/examples/images/square.png", true);
+    else
+        img_color = loadImage(ASSETS_DIR "/examples/images/man.jpg", true);
+    // Convert the image from RGB to gray-scale
+    array img = colorSpace(img_color, AF_GRAY, AF_RGB);
+    // For visualization in ArrayFire, color images must be in the [0.0f-1.0f]
+    // interval
+    img_color /= 255.f;
+
+    features feat = susan(img, 3, 32.0f, 10, 0.05f, 3);
+
+    if (!(feat.getNumFeatures() > 0)) {
+        printf("No features found, exiting\n");
+        return;
+    }
+
+    float* h_x = feat.getX().host<float>();
+    float* h_y = feat.getY().host<float>();
+
+    // Draw draw_len x draw_len crosshairs where the corners are
+    const int draw_len = 3;
+    for (size_t f = 0; f < feat.getNumFeatures(); f++) {
+        int x                                            = h_x[f];
+        int y                                            = h_y[f];
+        img_color(x, seq(y - draw_len, y + draw_len), 0) = 0.f;
+        img_color(x, seq(y - draw_len, y + draw_len), 1) = 1.f;
+        img_color(x, seq(y - draw_len, y + draw_len), 2) = 0.f;
+
+        // Draw vertical line of (draw_len * 2 + 1) pixels centered on  the
+        // corner Set only the first channel to 1 (green lines)
+        img_color(seq(x - draw_len, x + draw_len), y, 0) = 0.f;
+        img_color(seq(x - draw_len, x + draw_len), y, 1) = 1.f;
+        img_color(seq(x - draw_len, x + draw_len), y, 2) = 0.f;
+    }
+    freeHost(h_x);
+    freeHost(h_y);
+
+    printf("Features found: %zu\n", feat.getNumFeatures());
+
+    if (!console) {
+        af::Window wnd("FAST Feature Detector");
+
+        // Previews color image with green crosshairs
+        while (!wnd.close()) wnd.image(img_color);
+    } else {
+        af_print(feat.getX());
+        af_print(feat.getY());
+        af_print(feat.getScore());
+    }
+}
+
+int main(int argc, char** argv) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
+    bool console = argc > 2 ? argv[2][0] == '-' : false;
+
+    try {
+        af::setDevice(device);
+        af::info();
+        printf("** ArrayFire SUSAN Feature Detector Demo **\n\n");
+        susan_demo(console);
+
+    } catch (af::exception& ae) {
+        fprintf(stderr, "%s\n", ae.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/financial/CMakeLists.txt b/examples/financial/CMakeLists.txt
new file mode 100644
index 0000000000..f365f88b47
--- /dev/null
+++ b/examples/financial/CMakeLists.txt
@@ -0,0 +1,60 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Financial
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+if(ArrayFire_CPU_FOUND)
+  # Black-Scholes Options
+  add_executable(black_scholes_options_cpu black_scholes_options.cpp input.h)
+  target_link_libraries(black_scholes_options_cpu ArrayFire::afcpu)
+
+  # Heston Model
+  add_executable(heston_model_cpu heston_model.cpp)
+  target_link_libraries(heston_model_cpu ArrayFire::afcpu)
+
+  # Monte Carlo Options
+  add_executable(monte_carlo_options_cpu monte_carlo_options.cpp)
+  target_link_libraries(monte_carlo_options_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(black_scholes_options_cuda black_scholes_options.cpp input.h)
+  target_link_libraries(black_scholes_options_cuda ArrayFire::afcuda)
+
+  add_executable(heston_model_cuda heston_model.cpp)
+  target_link_libraries(heston_model_cuda ArrayFire::afcuda)
+
+  add_executable(monte_carlo_options_cuda monte_carlo_options.cpp)
+  target_link_libraries(monte_carlo_options_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(monte_carlo_options_opencl monte_carlo_options.cpp)
+  target_link_libraries(monte_carlo_options_opencl ArrayFire::afopencl)
+
+  add_executable(black_scholes_options_opencl black_scholes_options.cpp input.h)
+  target_link_libraries(black_scholes_options_opencl ArrayFire::afopencl)
+
+  add_executable(heston_model_opencl heston_model.cpp)
+  target_link_libraries(heston_model_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(monte_carlo_options_oneapi monte_carlo_options.cpp)
+  target_link_libraries(monte_carlo_options_oneapi ArrayFire::afoneapi)
+
+  add_executable(black_scholes_options_oneapi black_scholes_options.cpp input.h)
+  target_link_libraries(black_scholes_options_oneapi ArrayFire::afoneapi)
+
+  add_executable(heston_model_oneapi heston_model.cpp)
+  target_link_libraries(heston_model_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/financial/black_scholes_options.cpp b/examples/financial/black_scholes_options.cpp
index 03cceb885c..3bc1347d93 100644
--- a/examples/financial/black_scholes_options.cpp
+++ b/examples/financial/black_scholes_options.cpp
@@ -7,25 +7,25 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <iostream>
-#include <stdio.h>
-#include <math.h>
 #include <arrayfire.h>
+#include <math.h>
+#include <stdio.h>
 #include <cstdlib>
+#include <iostream>
 
 #include "input.h"
 using namespace af;
 
-static array cnd(const array& x)
-{
-    static float sqrt2 = sqrtf(2.0f);
-    array temp = (x > 0);
-    array y = temp * (0.5f + erf(x/sqrt2)/2) + (1-temp) * (0.5f - erf((-x)/sqrt2)/2);
-    return y;
+// Use the relationship between the cumulative normal distribution and the
+// (complementary) error function:
+// https://en.wikipedia.org/wiki/Error_function#Cumulative_distribution_function
+array cnd(array x) {
+    const float sqrt05 = sqrt(0.5f);
+    return 0.5f * erfc(-x * sqrt05);
 }
 
-static void black_scholes(array& C, array& P, const array& S, const array& X, const array& R, const array& V, const array& T)
-{
+static void black_scholes(array& C, array& P, const array& S, const array& X,
+                          const array& R, const array& V, const array& T) {
     // This function computes the call and put option prices based on
     // Black-Scholes Model
 
@@ -36,28 +36,27 @@ static void black_scholes(array& C, array& P, const array& S, const array& X, co
     // T = Time to maturity
 
     array d1 = log(S / X);
-    d1 = d1 + (R + (V*V)*0.5) * T;
-    d1 = d1 / (V*sqrt(T));
+    d1       = d1 + (R + (V * V) * 0.5) * T;
+    d1       = d1 / (V * sqrt(T));
 
-    array d2 = d1 - (V*sqrt(T));
+    array d2 = d1 - (V * sqrt(T));
 
     array cnd_d1 = cnd(d1);
     array cnd_d2 = cnd(d2);
 
-    C = S * cnd_d1  - (X * exp((-R)*T) * cnd_d2);
-    P = X * exp((-R)*T) * (1 - cnd_d2) - (S * (1 - cnd_d1));
+    C = S * cnd_d1 - (X * exp((-R) * T) * cnd_d2);
+    P = X * exp((-R) * T) * (1 - cnd_d2) - (S * (1 - cnd_d1));
 }
 
-int main(int argc, char **argv)
-{
-
+int main(int argc, char** argv) {
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         setDevice(device);
         info();
 
-        printf("** ArrayFire Black-Scholes Example **\n"
-               "**          by AccelerEyes         **\n\n");
+        printf(
+            "** ArrayFire Black-Scholes Example **\n"
+            "**          by AccelerEyes         **\n\n");
 
         array GC1(4000, 1, C1);
         array GC2(4000, 1, C2);
@@ -65,7 +64,6 @@ int main(int argc, char **argv)
         array GC4(4000, 1, C4);
         array GC5(4000, 1, C5);
 
-
         // Compile kernels
         // Create GPU copies of the data
         array Sg = GC1;
@@ -76,48 +74,40 @@ int main(int argc, char **argv)
         array Cg, Pg;
 
         // Warm up black scholes example
-        black_scholes(Cg, Pg, Sg,Xg,Rg,Vg,Tg);
+        black_scholes(Cg, Pg, Sg, Xg, Rg, Vg, Tg);
         eval(Cg, Pg);
         printf("Warming up done\n");
         af::sync();
 
-
-        int iter = 5;
+        int iter = 1000;
         for (int n = 50; n <= 500; n += 50) {
-
             // Create GPU copies of the data
             Sg = tile(GC1, n, 1);
             Xg = tile(GC2, n, 1);
             Rg = tile(GC3, n, 1);
             Vg = tile(GC4, n, 1);
             Tg = tile(GC5, n, 1);
+            af::eval(Sg, Xg, Rg, Vg, Tg);
 
             dim4 dims = Xg.dims();
-            printf("Input Data Size = %d x %d\n", (int)dims[0], (int)dims[1]);
-
             // Force compute on the GPU
             af::sync();
 
             timer::start();
             for (int i = 0; i < iter; i++) {
-                black_scholes(Cg, Pg, Sg,Xg,Rg,Vg,Tg);
-                eval(Cg,Pg);
+                black_scholes(Cg, Pg, Sg, Xg, Rg, Vg, Tg);
+                eval(Cg, Pg);
             }
             af::sync();
 
-            printf("Mean GPU Time = %0.6fms\n\n\n", 1000 * timer::stop()/iter);
+            double t = timer::stop() / iter;
+            printf("Input Data Size = %8d. Mean GPU Time: %0.6f ms\n",
+                   (int)dims[0], 1000 * t);
         }
-    } catch (af::exception& e){
+    } catch (af::exception& e) {
         fprintf(stderr, "%s\n", e.what());
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] =='-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/financial/heston_model.cpp b/examples/financial/heston_model.cpp
new file mode 100644
index 0000000000..79e7ff9dfe
--- /dev/null
+++ b/examples/financial/heston_model.cpp
@@ -0,0 +1,118 @@
+/**********************************************************************************************
+ * Copyright (c) 2015, Michael Nowotny
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ *modification, are permitted provided that the following conditions are met:
+ *
+ * 1. Redistributions of source code must retain the above copyright notice,
+ * this list of conditions and the following disclaimer.
+ *
+ * 2. Redistributions in binary form must reproduce the above copyright notice,
+ * this list of conditions and the following disclaimer in the documentation
+ *and/or other materials provided with the distribution.
+ *
+ * 3. Neither the name of the copyright holder nor the names of its contributors
+ *may be used to endorse or promote products derived from this software without
+ *specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ ***********************************************************************************************/
+
+#include <arrayfire.h>
+#include <stdio.h>
+#include <iostream>
+
+using namespace std;
+using namespace af;
+
+void simulateHestonModel(af::array &xres, af::array &vres, float T,
+                         unsigned int N, unsigned int R, float mu, float kappa,
+                         float vBar, float sigmaV, float rho, float x0,
+                         float v0) {
+    float deltaT = T / (float)(N - 1);
+
+    af::array x[] = {af::constant(x0, R), af::constant(0, R)};
+    af::array v[] = {af::constant(v0, R), af::constant(0, R)};
+
+    float sqrtDeltaT = sqrt(deltaT);
+
+    float sqrtOneMinusRhoSquare = sqrt(1 - rho * rho);
+
+    float mArray[] = {rho, sqrtOneMinusRhoSquare};
+    af::array m(2, 1, mArray);
+
+    unsigned int tPrevious = 0, tCurrent = 0;
+    af::array zeroConstant = constant(0, R);
+
+    for (unsigned int t = 1; t < N; t++) {
+        tPrevious = (t + 1) % 2;
+        tCurrent  = t % 2;
+
+        af::array dBt      = randn(R, 2) * sqrtDeltaT;
+        af::array sqrtVLag = af::sqrt(v[tPrevious]);
+
+        x[tCurrent] = x[tPrevious] + (mu - 0.5 * v[tPrevious]) * deltaT +
+                      (sqrtVLag * dBt(span, 0));
+        af::array vTmp = v[tPrevious] + kappa * (vBar - v[tPrevious]) * deltaT +
+                         sigmaV * (sqrtVLag * matmul(dBt, m));
+        v[tCurrent] = max(vTmp, zeroConstant);
+    }
+
+    xres = x[tCurrent];
+    vres = v[tCurrent];
+}
+
+int main() {
+    float T                  = 1;
+    unsigned int nT          = 10 * T;
+    unsigned int R_first_run = 1000;
+    unsigned int R           = 20000000;
+
+    float x0     = 0;              // initial log stock price
+    float v0     = pow(0.087, 2);  // initial volatility
+    float r      = log(1.0319);    // risk-free rate
+    float rho    = -0.82;  // instantaneous correlation between Brownian motions
+    float sigmaV = 0.14;   // variance of volatility
+    float kappa  = 3.46;   // mean reversion speed
+    float vBar   = 0.008;  // mean variance
+    float k      = log(0.95);  // strike price
+
+    // Price European call option
+    try {
+        af::array x;
+        af::array v;
+
+        // first run
+        simulateHestonModel(x, v, T, nT, R_first_run, r, kappa, vBar, sigmaV,
+                            rho, x0, v0);
+        af::sync();  // Ensure the first run is finished
+
+        timer::start();
+        simulateHestonModel(x, v, T, nT, R, r, kappa, vBar, sigmaV, rho, x0,
+                            v0);
+        af::sync();
+        cout << "Time in simulation: " << timer::stop() << endl;
+
+        af::array K            = exp(constant(k, x.dims()));
+        af::array zeroConstant = constant(0, x.dims());
+        af::array C_CPU =
+            exp(-r * T) * mean(af::max(af::exp(x) - K, zeroConstant));
+
+        af_print(C_CPU);
+        return 0;
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        return 1;
+    }
+}
diff --git a/examples/financial/input.h b/examples/financial/input.h
index 4b44e96f0c..220969ceee 100644
--- a/examples/financial/input.h
+++ b/examples/financial/input.h
@@ -7,2018 +7,3326 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+float C1[] = {5.000000f,   10.000000f,  100.000000f, 100.000000f, 60.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              5.000000f,   10.000000f,  100.000000f, 100.000000f, 60.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              5.000000f,   10.000000f,  100.000000f, 100.000000f, 60.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              5.000000f,   10.000000f,  100.000000f, 100.000000f, 60.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000};
 
-float C1[] = {
-    5.000000f, 10.000000f, 100.000000f, 100.000000f, 60.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    5.000000f, 10.000000f, 100.000000f, 100.000000f, 60.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    5.000000f, 10.000000f, 100.000000f, 100.000000f, 60.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    5.000000f, 10.000000f, 100.000000f, 100.000000f, 60.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000
-};
-
-float C2[] = {
-    5.000000f, 12.000000f, 100.000000f, 100.000000f, 65.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    5.000000f, 12.000000f, 100.000000f, 100.000000f, 65.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    5.000000f, 12.000000f, 100.000000f, 100.000000f, 65.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    5.000000f, 12.000000f, 100.000000f, 100.000000f, 65.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f,
-    90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f,
-    90.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f,
-    100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f,
-    41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 60.750000f, 60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    60.750000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 50.000000f, 50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 41.250000f,
-    41.250000f, 41.250000f, 41.250000f, 41.250000f, 41.250000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f, 50.000000f,
-    50.000000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f, 60.750000f,
-    110.000000f, 110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
-    110.000000f, 90.000000f, 90.000000f, 90.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000
-};
+float C2[] = {5.000000f,   12.000000f,  100.000000f, 100.000000f, 65.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              5.000000f,   12.000000f,  100.000000f, 100.000000f, 65.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              5.000000f,   12.000000f,  100.000000f, 100.000000f, 65.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              5.000000f,   12.000000f,  100.000000f, 100.000000f, 65.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 100.000000f, 100.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  90.000000f,
+              90.000000f,  90.000000f,  100.000000f, 100.000000f, 100.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 110.000000f, 110.000000f, 110.000000f, 41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              41.250000f,  41.250000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  50.000000f,  50.000000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  41.250000f,
+              41.250000f,  41.250000f,  41.250000f,  41.250000f,  41.250000f,
+              50.000000f,  50.000000f,  50.000000f,  50.000000f,  50.000000f,
+              50.000000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              60.750000f,  60.750000f,  60.750000f,  60.750000f,  60.750000f,
+              110.000000f, 110.000000f, 90.000000f,  90.000000f,  90.000000f,
+              100.000000f, 100.000000f, 100.000000f, 110.000000f, 110.000000f,
+              110.000000f, 90.000000f,  90.000000f,  90.000000f,  100.000000f,
+              100.000000f, 100.000000f, 110.000000f, 110.000000f, 110.000000};
 
 float C3[] = {
-    0.100000f, 0.100000f, 0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
-    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
-    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
-    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f
-};
+    0.100000f, 0.100000f, 0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.050000f, 0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.050000f,
+    0.050000f, 0.080000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.050000f,
+    0.050000f, 0.050000f, 0.050000f, 0.050000f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f, 0.072500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f, 0.082500f,
+    0.082500f, 0.082500f, 0.082500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.027500f,
+    0.027500f, 0.027500f, 0.027500f, 0.027500f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f};
 
 float C4[] = {
-    0.200000f, 0.200000f, 0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.200000f, 0.200000f, 0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.200000f, 0.200000f, 0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.200000f, 0.200000f, 0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
-    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
-    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
-    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f,
-    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
-    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f
-};
+    0.200000f, 0.200000f, 0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.200000f,
+    0.200000f, 0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.200000f, 0.200000f,
+    0.150000f, 0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.250000f, 0.350000f, 0.450000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.200000f, 0.200000f, 0.150000f,
+    0.150000f, 0.300000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f,
+    0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.250000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.500000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.450000f, 0.650000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.450000f,
+    0.650000f, 0.250000f, 0.350000f, 0.450000f, 0.250000f, 0.250000f, 0.500000f,
+    0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f, 0.500000f,
+    0.500000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f, 0.100000f,
+    0.100000f, 0.100000f, 0.100000f};
 
 float C5[] = {
-    0.500000f, 0.500000f, 1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.500000f, 0.500000f, 1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.500000f, 0.500000f, 1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.500000f, 0.500000f, 1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
-    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
-    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
-    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
-    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f,
-    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
-    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f
-};
+    0.500000f, 0.500000f, 1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.500000f,
+    0.500000f, 1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.500000f, 0.500000f,
+    1.000000f, 1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.250000f, 0.350000f, 0.400000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.500000f, 0.500000f, 1.000000f,
+    1.000000f, 0.250000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f,
+    0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f,
+    0.150000f, 0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f,
+    0.250000f, 0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f,
+    0.350000f, 0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f,
+    0.400000f, 0.750000f, 0.050000f, 0.150000f, 0.250000f, 0.350000f, 0.400000f,
+    0.750000f, 0.250000f, 0.350000f, 0.400000f, 0.500000f, 1.000000f, 0.100000f,
+    0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f,
+    1.000000f, 0.100000f, 0.500000f, 1.000000f, 0.100000f, 0.500000f, 1.000000f,
+    0.100000f, 0.500000f, 1.000000f};
diff --git a/examples/financial/monte_carlo_options.cpp b/examples/financial/monte_carlo_options.cpp
index 8f733ceb7e..321b3e966d 100644
--- a/examples/financial/monte_carlo_options.cpp
+++ b/examples/financial/monte_carlo_options.cpp
@@ -7,25 +7,32 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <iostream>
-#include <stdio.h>
-#include <math.h>
 #include <arrayfire.h>
+#include <math.h>
+#include <stdio.h>
 #include <af/util.h>
+#include <iostream>
 
 using namespace af;
-template<class ty> dtype get_dtype();
+template<class ty>
+dtype get_dtype();
 
-template<> dtype get_dtype<float>() { return f32; }
-template<> dtype get_dtype<double>() { return f64; }
+template<>
+dtype get_dtype<float>() {
+    return f32;
+}
+template<>
+dtype get_dtype<double>() {
+    return f64;
+}
 
 template<class ty, bool use_barrier>
-static ty monte_carlo_barrier(int N, ty K, ty t, ty vol, ty r, ty strike, int steps, ty B)
-{
-    dtype pres = get_dtype<ty>();
+static ty monte_carlo_barrier(int N, ty K, ty t, ty vol, ty r, ty strike,
+                              int steps, ty B) {
+    dtype pres   = get_dtype<ty>();
     array payoff = constant(0, N, 1, pres);
 
-    ty dt = t / (ty)(steps - 1);
+    ty dt   = t / (ty)(steps - 1);
     array s = constant(strike, N, 1, pres);
 
     array randmat = randn(N, steps - 1, pres);
@@ -33,52 +40,46 @@ static ty monte_carlo_barrier(int N, ty K, ty t, ty vol, ty r, ty strike, int st
 
     array S = product(join(1, s, randmat), 1);
 
-    if (use_barrier) {
-        S = S * allTrue(S < B, 1);
-    }
+    if (use_barrier) { S = S * allTrue(S < B, 1); }
 
     payoff = max(0.0, S - K);
-    ty P = mean<ty>(payoff) * exp(-r * t);
+    ty P   = mean<ty>(payoff) * exp(-r * t);
     return P;
 }
 
 template<class ty, bool use_barrier>
-double monte_carlo_bench(int N)
-{
-    int steps = 180;
+double monte_carlo_bench(int N) {
+    int steps      = 180;
     ty stock_price = 100.0;
-    ty maturity = 0.5;
-    ty volatility = .30;
-    ty rate = .01;
-    ty strike = 100;
-    ty barrier = 115.0;
+    ty maturity    = 0.5;
+    ty volatility  = .30;
+    ty rate        = .01;
+    ty strike      = 100;
+    ty barrier     = 115.0;
 
     timer::start();
     for (int i = 0; i < 10; i++) {
-        monte_carlo_barrier<ty, use_barrier>(N, stock_price, maturity, volatility,
-                                             rate, strike, steps, barrier);
+        monte_carlo_barrier<ty, use_barrier>(
+            N, stock_price, maturity, volatility, rate, strike, steps, barrier);
     }
     return timer::stop() / 10;
 }
 
-int main()
-{
+int main() {
     try {
-
         // Warm up and caching
         monte_carlo_bench<float, false>(1000);
         monte_carlo_bench<float, true>(1000);
 
         for (int n = 10000; n <= 100000; n += 10000) {
-            printf("Time for %7d paths - "
-                   "vanilla method: %4.3f ms,  "
-                   "barrier method: %4.3f ms\n", n,
-                   1000 * monte_carlo_bench<float, false>(n),
-                   1000 * monte_carlo_bench<float, true>(n));
+            printf(
+                "Time for %7d paths - "
+                "vanilla method: %4.3f ms,  "
+                "barrier method: %4.3f ms\n",
+                n, 1000 * monte_carlo_bench<float, false>(n),
+                1000 * monte_carlo_bench<float, true>(n));
         }
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
     return 0;
 }
diff --git a/examples/getting_started/CMakeLists.txt b/examples/getting_started/CMakeLists.txt
new file mode 100644
index 0000000000..a9d1ce4bcb
--- /dev/null
+++ b/examples/getting_started/CMakeLists.txt
@@ -0,0 +1,73 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Getting-Started
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+if(ArrayFire_CPU_FOUND)
+  # Convolve examples
+  add_executable(convolve_cpu convolve.cpp)
+  target_link_libraries(convolve_cpu ArrayFire::afcpu)
+
+  # Integer examples
+  add_executable(integer_cpu integer.cpp)
+  target_link_libraries(integer_cpu ArrayFire::afcpu)
+
+  # Rainfall examples
+  add_executable(rainfall_cpu rainfall.cpp)
+  target_link_libraries(rainfall_cpu ArrayFire::afcpu)
+
+  # Vectorization examples
+  add_executable(vectorize_cpu vectorize.cpp)
+  target_link_libraries(vectorize_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(convolve_cuda convolve.cpp)
+  target_link_libraries(convolve_cuda ArrayFire::afcuda)
+
+  add_executable(integer_cuda integer.cpp)
+  target_link_libraries(integer_cuda ArrayFire::afcuda)
+
+  add_executable(rainfall_cuda rainfall.cpp)
+  target_link_libraries(rainfall_cuda ArrayFire::afcuda)
+
+  add_executable(vectorize_cuda vectorize.cpp)
+  target_link_libraries(vectorize_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(convolve_opencl convolve.cpp)
+  target_link_libraries(convolve_opencl ArrayFire::afopencl)
+
+  add_executable(integer_opencl integer.cpp)
+  target_link_libraries(integer_opencl ArrayFire::afopencl)
+
+  add_executable(rainfall_opencl rainfall.cpp)
+  target_link_libraries(rainfall_opencl ArrayFire::afopencl)
+
+  add_executable(vectorize_opencl vectorize.cpp)
+  target_link_libraries(vectorize_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(convolve_oneapi convolve.cpp)
+  target_link_libraries(convolve_oneapi ArrayFire::afoneapi)
+
+  add_executable(integer_oneapi integer.cpp)
+  target_link_libraries(integer_oneapi ArrayFire::afoneapi)
+
+  add_executable(rainfall_oneapi rainfall.cpp)
+  target_link_libraries(rainfall_oneapi ArrayFire::afoneapi)
+
+  add_executable(vectorize_oneapi vectorize.cpp)
+  target_link_libraries(vectorize_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/getting_started/convolve.cpp b/examples/getting_started/convolve.cpp
index 50002f2100..7c2d0626ca 100644
--- a/examples/getting_started/convolve.cpp
+++ b/examples/getting_started/convolve.cpp
@@ -7,9 +7,9 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <arrayfire.h>
 #include <stdio.h>
 #include <cstdlib>
-#include <arrayfire.h>
 using namespace af;
 
 // use static variables at file scope so timeit() wrapper functions
@@ -19,48 +19,38 @@ using namespace af;
 static array img;
 
 // 5x5 derivative with separable kernels
-static float h_dx[] = {1.f / 12, -8.f / 12, 0, 8.f / 12, -1.f / 12}; // five point stencil
+static float h_dx[]     = {1.f / 12, -8.f / 12, 0, 8.f / 12,
+                           -1.f / 12};  // five point stencil
 static float h_spread[] = {1.f / 5, 1.f / 5, 1.f / 5, 1.f / 5, 1.f / 5};
-static array dx, spread, kernel; // device kernels
+static array dx, spread, kernel;  // device kernels
 
-static array full_out, dsep_out, hsep_out; // save output for value checks
+static array full_out, dsep_out, hsep_out;  // save output for value checks
 // wrapper functions for timeit() below
-static void full() { full_out = convolve2(img, kernel);}
+static void full() { full_out = convolve2(img, kernel); }
 static void dsep() { dsep_out = convolve(dx, spread, img); }
 
-static bool fail(array &left, array &right)
-{
+static bool fail(array &left, array &right) {
     return (max<float>(abs(left - right)) > 1e-6);
 }
 
-int main(int argc, char **argv)
-{
+int main(int argc, char **argv) {
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         af::setDevice(device);
         af::info();
 
         // setup image and device copies of kernels
-        img = randu(640, 480);
-        dx = array(5, 1, h_dx); // 5x1 kernel
-        spread = array(1, 5, h_spread); // 1x5 kernel
-        kernel = matmul(dx, spread); // 5x5 kernel
+        img    = randu(640, 480);
+        dx     = array(5, 1, h_dx);      // 5x1 kernel
+        spread = array(1, 5, h_spread);  // 1x5 kernel
+        kernel = matmul(dx, spread);     // 5x5 kernel
 
         printf("full 2D convolution:         %.5f seconds\n", timeit(full));
         printf("separable, device pointers:  %.5f seconds\n", timeit(dsep));
 
         // ensure values are all the same across versions
         if (fail(full_out, dsep_out)) { throw af::exception("full != dsep"); }
-    } catch (af::exception& e) {
-        fprintf(stderr, "%s\n", e.what());
-    }
+    } catch (af::exception &e) { fprintf(stderr, "%s\n", e.what()); }
 
-#ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-#endif
     return 0;
 }
diff --git a/examples/getting_started/integer.cpp b/examples/getting_started/integer.cpp
index 07d1466412..b508e4d711 100644
--- a/examples/getting_started/integer.cpp
+++ b/examples/getting_started/integer.cpp
@@ -7,26 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <af/util.h>
 #include <cstdlib>
 
 using namespace af;
 
-int main(int argc, char ** argv)
-{
+int main(int argc, char** argv) {
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         af::setDevice(device);
         af::info();
 
-        printf("\n=== ArrayFire signed(s32) / unsigned(u32) Integer Example ===\n");
+        printf(
+            "\n=== ArrayFire signed(s32) / unsigned(u32) Integer Example "
+            "===\n");
 
         int h_A[] = {1, 2, 4, -1, 2, 0, 4, 2, 3};
         int h_B[] = {2, 3, -5, 6, 0, 10, -12, 0, 1};
-        array A = array(3, 3, h_A);
-        array B = array(3, 3, h_B);
+        array A   = array(3, 3, h_A);
+        array B   = array(3, 3, h_B);
 
         printf("--\nSub-refencing and Sub-assignment\n");
         af_print(A);
@@ -36,7 +37,7 @@ int main(int argc, char ** argv)
         A(1) = 100;
         af_print(A);
         af_print(B);
-        A(1,span) = B(2,span);
+        A(1, span) = B(2, span);
         af_print(A);
 
         printf("--Bit-wise operations\n");
@@ -56,8 +57,8 @@ int main(int argc, char ** argv)
 
         printf("\n--Flip Vertically / Horizontally\n");
         af_print(A);
-        af_print(flip(A,0));
-        af_print(flip(A,1));
+        af_print(flip(A, 0));
+        af_print(flip(A, 1));
 
         printf("\n--Sum along columns\n");
         af_print(A);
@@ -88,12 +89,5 @@ int main(int argc, char ** argv)
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/getting_started/rainfall.cpp b/examples/getting_started/rainfall.cpp
index c334b272f7..04e39303f8 100644
--- a/examples/getting_started/rainfall.cpp
+++ b/examples/getting_started/rainfall.cpp
@@ -22,44 +22,41 @@
 //  "Rapid Problem Solving Using Thrust", Nathan Bell, NVIDIA
 
 #include <arrayfire.h>
-#include <af/util.h>
 #include <stdio.h>
+#include <af/util.h>
 #include <cstdlib>
 using namespace af;
 
-int main(int argc, char **argv)
-{
+int main(int argc, char **argv) {
     try {
         int device = argc > 1 ? atoi(argv[1]) : 0;
         af::setDevice(device);
         af::info();
 
         int days = 9, sites = 4;
-        int n = 10; // measurements
-        float day_[] =         {0, 0, 1, 2, 5, 5, 6, 6, 7, 8 }; // ascending
-        float site_[] =        {2, 3, 0, 1, 1, 2, 0, 1, 2, 1 };
-        float measurement_[] = {9, 5, 6, 3, 3, 8, 2, 6, 5, 10}; // inches
-        array day(n,day_);
-        array site(n,site_);
-        array measurement(n,measurement_);
+        int n                = 10;                              // measurements
+        float day_[]         = {0, 0, 1, 2, 5, 5, 6, 6, 7, 8};  // ascending
+        float site_[]        = {2, 3, 0, 1, 1, 2, 0, 1, 2, 1};
+        float measurement_[] = {9, 5, 6, 3, 3, 8, 2, 6, 5, 10};  // inches
+        array day(n, day_);
+        array site(n, site_);
+        array measurement(n, measurement_);
 
         array rainfall = constant(0, sites);
-        gfor (seq s, sites) {
-            rainfall(s) = sum(measurement * (site == s));
-        }
+        gfor(seq s, sites) { rainfall(s) = sum(measurement * (site == s)); }
 
         printf("total rainfall at each site:\n");
         af_print(rainfall);
 
-        array is_between = 1 <= day && day <= 5; // days 1 and 5
+        array is_between   = 1 <= day && day <= 5;  // days 1 and 5
         float rain_between = sum<float>(measurement * is_between);
         printf("rain between days: %g\n", rain_between);
 
-        printf("number of days with rain: %g\n", sum<float>(diff1(day) > 0) + 1);
+        printf("number of days with rain: %g\n",
+               sum<float>(diff1(day) > 0) + 1);
 
-        array per_day = constant(0, days);
-        gfor (seq d, days)
-            per_day(d) = sum(measurement * (day == d));
+        array per_day                = constant(0, days);
+        gfor(seq d, days) per_day(d) = sum(measurement * (day == d));
 
         printf("total rainfall each day:\n");
         af_print(per_day);
@@ -70,12 +67,5 @@ int main(int argc, char **argv)
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        std::cout << "hit [enter]...";
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/getting_started/vectorize.cpp b/examples/getting_started/vectorize.cpp
index 0d482234c7..1d3bb4faaf 100644
--- a/examples/getting_started/vectorize.cpp
+++ b/examples/getting_started/vectorize.cpp
@@ -7,24 +7,21 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <af/util.h>
 
 using namespace af;
 
 array A, B;
 
-static array dist_naive(array a, array b)
-{
+static array dist_naive(array a, array b) {
     array dist_mat = constant(0, a.dims(1), (int)b.dims(1));
 
     // Iterate through columns a
     for (int ii = 0; ii < (int)a.dims(1); ii++) {
-
         // Iterate through columns of b
         for (int jj = 0; jj < (int)b.dims(1); jj++) {
-
             // Get the sum of absolute differences
             for (int kk = 0; kk < (int)a.dims(0); kk++) {
                 dist_mat(ii, jj) += abs(a(kk, ii) - b(kk, jj));
@@ -35,8 +32,7 @@ static array dist_naive(array a, array b)
     return dist_mat;
 }
 
-static array dist_vec(array a, array b)
-{
+static array dist_vec(array a, array b) {
     array dist_mat = constant(0, (int)a.dims(1), (int)b.dims(1));
 
     // Iterate through columns a
@@ -55,12 +51,11 @@ static array dist_vec(array a, array b)
     return dist_mat;
 }
 
-static array dist_gfor1(array a, array b)
-{
+static array dist_gfor1(array a, array b) {
     array dist_mat = constant(0, (int)a.dims(1), (int)b.dims(1));
 
     // GFOR along columns of a
-    gfor (seq ii, (int)a.dims(1)) {
+    gfor(seq ii, (int)a.dims(1)) {
         array avec = a(span, ii);
 
         // Itere through columns of b
@@ -75,12 +70,11 @@ static array dist_gfor1(array a, array b)
     return dist_mat;
 }
 
-static array dist_gfor2(array a, array b)
-{
+static array dist_gfor2(array a, array b) {
     array dist_mat = constant(0, (int)a.dims(1), (int)b.dims(1));
 
     // GFOR along columns of b
-    gfor (seq jj, (int)b.dims(1)) {
+    gfor(seq jj, (int)b.dims(1)) {
         array bvec = b(span, jj);
 
         // Iterate through columns of A
@@ -95,8 +89,7 @@ static array dist_gfor2(array a, array b)
     return dist_mat;
 }
 
-static array dist_tile1(array a, array b)
-{
+static array dist_tile1(array a, array b) {
     // int feat_len = (int)a.dims(0); // Same as (int)b.dims(0);
     int alen = (int)a.dims(1);
     int blen = (int)b.dims(1);
@@ -105,7 +98,6 @@ static array dist_tile1(array a, array b)
 
     // Iterate through columns of b
     for (int jj = 0; jj < blen; jj++) {
-
         // Get the column vector of b
         // shape of bvec is (feat_len, 1)
         array bvec = b(span, jj);
@@ -125,11 +117,10 @@ static array dist_tile1(array a, array b)
     return dist_mat;
 }
 
-static array dist_tile2(array a, array b)
-{
+static array dist_tile2(array a, array b) {
     int feat_len = (int)a.dims(0);
-    int alen = (int)a.dims(1);
-    int blen = (int)b.dims(1);
+    int alen     = (int)a.dims(1);
+    int blen     = (int)b.dims(1);
 
     // Shape of a is (feat_len, alen, 1)
     array a_mod = a;
@@ -149,40 +140,20 @@ static array dist_tile2(array a, array b)
     return dist_mat;
 }
 
-static void bench_naive()
-{
-    dist_naive(A, B);
-}
+static void bench_naive() { dist_naive(A, B); }
 
-static void bench_vec()
-{
-    dist_vec(A, B);
-}
+static void bench_vec() { dist_vec(A, B); }
 
-static void bench_gfor1()
-{
-    dist_gfor1(A, B);
-}
+static void bench_gfor1() { dist_gfor1(A, B); }
 
-static void bench_gfor2()
-{
-    dist_gfor2(A, B);
-}
+static void bench_gfor2() { dist_gfor2(A, B); }
 
-static void bench_tile1()
-{
-    dist_tile1(A, B);
-}
+static void bench_tile1() { dist_tile1(A, B); }
 
-static void bench_tile2()
-{
-    dist_tile2(A, B);
-}
+static void bench_tile2() { dist_tile2(A, B); }
 
-int main(int argc, char **argv)
-{
+int main(int, char **) {
     try {
-
         af::info();
 
         // Do not increase the sizes
@@ -191,7 +162,7 @@ int main(int argc, char **argv)
         B = randu(3, 300);
 
         array d1 = dist_naive(A, B);
-        array d2 = dist_vec  (A, B);
+        array d2 = dist_vec(A, B);
         array d3 = dist_gfor1(A, B);
         array d4 = dist_gfor2(A, B);
         array d5 = dist_tile1(A, B);
@@ -206,22 +177,16 @@ int main(int argc, char **argv)
         printf("\n");
 
         printf("Time for dist_naive: %2.2fms\n", 1000 * timeit(bench_naive));
-        printf("Time for dist_vec  : %2.2fms\n", 1000 * timeit(bench_vec  ));
+        printf("Time for dist_vec  : %2.2fms\n", 1000 * timeit(bench_vec));
         printf("Time for dist_gfor1: %2.2fms\n", 1000 * timeit(bench_gfor1));
         printf("Time for dist_gfor2: %2.2fms\n", 1000 * timeit(bench_gfor2));
         printf("Time for dist_tile1: %2.2fms\n", 1000 * timeit(bench_tile1));
         printf("Time for dist_tile2: %2.2fms\n", 1000 * timeit(bench_tile2));
 
-    } catch(af::exception ex) {
+    } catch (const af::exception &ex) {
         fprintf(stderr, "%s\n", ex.what());
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
+    return 0;
 }
diff --git a/examples/graphics/CMakeLists.txt b/examples/graphics/CMakeLists.txt
new file mode 100644
index 0000000000..6140142343
--- /dev/null
+++ b/examples/graphics/CMakeLists.txt
@@ -0,0 +1,143 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Graphics
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+add_definitions("-DASSETS_DIR=\"${ASSETS_DIR}\"")
+
+if(ArrayFire_CPU_FOUND)
+  # Conway Game of Life
+  add_executable(conway_cpu conway.cpp)
+  target_link_libraries(conway_cpu ArrayFire::afcpu)
+
+  # Conway Game of Life with Color
+  add_executable(conway_pretty_cpu conway_pretty.cpp)
+  target_link_libraries(conway_pretty_cpu ArrayFire::afcpu)
+
+  # Vector fields example
+  add_executable(field_cpu field.cpp)
+  target_link_libraries(field_cpu ArrayFire::afcpu)
+
+  # Fractal example
+  add_executable(fractal_cpu fractal.cpp)
+  target_link_libraries(fractal_cpu ArrayFire::afcpu)
+
+  # Gravity Simulation example
+  add_executable(gravity_sim_cpu gravity_sim.cpp gravity_sim_init.h)
+  target_link_libraries(gravity_sim_cpu ArrayFire::afcpu)
+
+  # Histogram example
+  add_executable(histogram_cpu histogram.cpp)
+  target_compile_definitions(histogram_cpu PRIVATE "ASSETS_DIR=\"${ASSETS_DIR}\"")
+  target_link_libraries(histogram_cpu ArrayFire::afcpu)
+
+  # Plot 2D example
+  add_executable(plot2d_cpu plot2d.cpp)
+  target_link_libraries(plot2d_cpu ArrayFire::afcpu)
+
+  # Plot 3 example
+  add_executable(plot3_cpu plot3.cpp)
+  target_link_libraries(plot3_cpu ArrayFire::afcpu)
+
+  # Surface example
+  add_executable(surface_cpu surface.cpp)
+  target_link_libraries(surface_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(conway_cuda conway.cpp)
+  target_link_libraries(conway_cuda ArrayFire::afcuda)
+
+  add_executable(conway_pretty_cuda conway_pretty.cpp)
+  target_link_libraries(conway_pretty_cuda ArrayFire::afcuda)
+
+  add_executable(field_cuda field.cpp)
+  target_link_libraries(field_cuda ArrayFire::afcuda)
+
+  add_executable(fractal_cuda fractal.cpp)
+  target_link_libraries(fractal_cuda ArrayFire::afcuda)
+
+  add_executable(gravity_sim_cuda gravity_sim.cpp gravity_sim_init.h)
+  target_link_libraries(gravity_sim_cuda ArrayFire::afcuda)
+
+  add_executable(histogram_cuda histogram.cpp)
+  target_compile_definitions(histogram_cuda PRIVATE "ASSETS_DIR=\"${ASSETS_DIR}\"")
+  target_link_libraries(histogram_cuda ArrayFire::afcuda)
+
+  add_executable(plot2d_cuda plot2d.cpp)
+  target_link_libraries(plot2d_cuda ArrayFire::afcuda)
+  add_executable(plot3_cuda plot3.cpp)
+  target_link_libraries(plot3_cuda ArrayFire::afcuda)
+
+  add_executable(surface_cuda surface.cpp)
+  target_link_libraries(surface_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(conway_opencl conway.cpp)
+  target_link_libraries(conway_opencl ArrayFire::afopencl)
+
+  add_executable(conway_pretty_opencl conway_pretty.cpp)
+  target_link_libraries(conway_pretty_opencl ArrayFire::afopencl)
+
+  add_executable(field_opencl field.cpp)
+  target_link_libraries(field_opencl ArrayFire::afopencl)
+
+  add_executable(fractal_opencl fractal.cpp)
+  target_link_libraries(fractal_opencl ArrayFire::afopencl)
+
+  add_executable(gravity_sim_opencl gravity_sim.cpp gravity_sim_init.h)
+  target_link_libraries(gravity_sim_opencl ArrayFire::afopencl)
+
+  add_executable(histogram_opencl histogram.cpp)
+  target_compile_definitions(histogram_opencl PRIVATE "ASSETS_DIR=\"${ASSETS_DIR}\"")
+  target_link_libraries(histogram_opencl ArrayFire::afopencl)
+
+  add_executable(plot2d_opencl plot2d.cpp)
+  target_link_libraries(plot2d_opencl ArrayFire::afopencl)
+
+  add_executable(plot3_opencl plot3.cpp)
+  target_link_libraries(plot3_opencl ArrayFire::afopencl)
+
+  add_executable(surface_opencl surface.cpp)
+  target_link_libraries(surface_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(conway_oneapi conway.cpp)
+  target_link_libraries(conway_oneapi ArrayFire::afoneapi)
+
+  add_executable(conway_pretty_oneapi conway_pretty.cpp)
+  target_link_libraries(conway_pretty_oneapi ArrayFire::afoneapi)
+
+  add_executable(field_oneapi field.cpp)
+  target_link_libraries(field_oneapi ArrayFire::afoneapi)
+
+  add_executable(fractal_oneapi fractal.cpp)
+  target_link_libraries(fractal_oneapi ArrayFire::afoneapi)
+
+  add_executable(gravity_sim_oneapi gravity_sim.cpp gravity_sim_init.h)
+  target_link_libraries(gravity_sim_oneapi ArrayFire::afoneapi)
+
+  add_executable(histogram_oneapi histogram.cpp)
+  target_compile_definitions(histogram_oneapi PRIVATE "ASSETS_DIR=\"${ASSETS_DIR}\"")
+  target_link_libraries(histogram_oneapi ArrayFire::afoneapi)
+
+  add_executable(plot2d_oneapi plot2d.cpp)
+  target_link_libraries(plot2d_oneapi ArrayFire::afoneapi)
+
+  add_executable(plot3_oneapi plot3.cpp)
+  target_link_libraries(plot3_oneapi ArrayFire::afoneapi)
+
+  add_executable(surface_oneapi surface.cpp)
+  target_link_libraries(surface_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/graphics/conway.cpp b/examples/graphics/conway.cpp
index f481156efc..745187f75e 100644
--- a/examples/graphics/conway.cpp
+++ b/examples/graphics/conway.cpp
@@ -8,28 +8,39 @@
  ********************************************************/
 
 #include <arrayfire.h>
-#include <iostream>
 #include <cstdio>
+#include <iostream>
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int, char**) {
     try {
         static const float h_kernel[] = {1, 1, 1, 1, 0, 1, 1, 1, 1};
-        static const int reset = 500;
+        static const int reset        = 500;
         static const int game_w = 128, game_h = 128;
 
         af::info();
 
-        std::cout << "This example demonstrates the Conway's Game of Life using ArrayFire" << std::endl
-                  << "There are 4 simple rules of Conways's Game of Life" << std::endl
-                  << "1. Any live cell with fewer than two live neighbours dies, as if caused by under-population." << std::endl
-                  << "2. Any live cell with two or three live neighbours lives on to the next generation." << std::endl
-                  << "3. Any live cell with more than three live neighbours dies, as if by overcrowding." << std::endl
-                  << "4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction." << std::endl
-                  << "Each white block in the visualization represents 1 alive cell, black space represents dead cells" << std::endl
-                  ;
+        std::cout << "This example demonstrates the Conway's Game of Life "
+                     "using ArrayFire"
+                  << std::endl
+                  << "There are 4 simple rules of Conways's Game of Life"
+                  << std::endl
+                  << "1. Any live cell with fewer than two live neighbours "
+                     "dies, as if caused by under-population."
+                  << std::endl
+                  << "2. Any live cell with two or three live neighbours lives "
+                     "on to the next generation."
+                  << std::endl
+                  << "3. Any live cell with more than three live neighbours "
+                     "dies, as if by overcrowding."
+                  << std::endl
+                  << "4. Any dead cell with exactly three live neighbours "
+                     "becomes a live cell, as if by reproduction."
+                  << std::endl
+                  << "Each white block in the visualization represents 1 alive "
+                     "cell, black space represents dead cells"
+                  << std::endl;
 
         af::Window myWindow(512, 512, "Conway's Game of Life using ArrayFire");
 
@@ -40,13 +51,12 @@ int main(int argc, char *argv[])
         array state;
         state = (af::randu(game_h, game_w, f32) > 0.5).as(f32);
 
-        while(!myWindow.close()) {
-
+        while (!myWindow.close()) {
             myWindow.image(state);
             frame_count++;
 
             // Generate a random starting state
-            if(frame_count % reset == 0)
+            if (frame_count % reset == 0)
                 state = (af::randu(game_h, game_w, f32) > 0.5).as(f32);
 
             // Convolve gets neighbors
@@ -67,14 +77,5 @@ int main(int argc, char *argv[])
         fprintf(stderr, "%s\n", e.what());
         throw;
     }
-
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
-
diff --git a/examples/graphics/conway_pretty.cpp b/examples/graphics/conway_pretty.cpp
index 5b40990745..4d8de380bd 100644
--- a/examples/graphics/conway_pretty.cpp
+++ b/examples/graphics/conway_pretty.cpp
@@ -8,40 +8,64 @@
  ********************************************************/
 
 #include <arrayfire.h>
-#include <iostream>
 #include <cstdio>
+#include <iostream>
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int, char**) {
     try {
         static const float h_kernel[] = {1, 1, 1, 1, 0, 1, 1, 1, 1};
-        static const int reset = 500;
+        static const int reset        = 500;
         static const int game_w = 128, game_h = 128;
 
         af::info();
 
-        std::cout << "This example demonstrates the Conway's Game of Life using ArrayFire" << std::endl
-                  << "There are 4 simple rules of Conways's Game of Life" << std::endl
-                  << "1. Any live cell with fewer than two live neighbours dies, as if caused by under-population." << std::endl
-                  << "2. Any live cell with two or three live neighbours lives on to the next generation." << std::endl
-                  << "3. Any live cell with more than three live neighbours dies, as if by overcrowding." << std::endl
-                  << "4. Any dead cell with exactly three live neighbours becomes a live cell, as if by reproduction." << std::endl
-                  << "Each white block in the visualization represents 1 alive cell, black space represents dead cells" << std::endl
-                  ;
-        std::cout << "The conway_pretty example visualizes all the states in Conway" << std::endl
-                  << "Red   : Cells that have died due to under population"          << std::endl
-                  << "Yellow: Cells that continue to live from previous state"       << std::endl
-                  << "Green : Cells that are new as a result of reproduction"        << std::endl
-                  << "Blue  : Cells that have died due to over population"           << std::endl
-                  ;
-        std::cout << "This examples is throttled so as to be a better visualization" << std::endl;
-
-        af::Window simpleWindow(512, 512, "Conway's Game Of Life - Current State");
-        af::Window prettyWindow(512, 512, "Conway's Game Of Life - Visualizing States");
-        simpleWindow.setPos(25, 25);
-        prettyWindow.setPos(125, 15);
+        std::cout << "This example demonstrates the Conway's Game of Life "
+                     "using ArrayFire"
+                  << std::endl
+                  << "There are 4 simple rules of Conways's Game of Life"
+                  << std::endl
+                  << "1. Any live cell with fewer than two live neighbours "
+                     "dies, as if caused by under-population."
+                  << std::endl
+                  << "2. Any live cell with two or three live neighbours lives "
+                     "on to the next generation."
+                  << std::endl
+                  << "3. Any live cell with more than three live neighbours "
+                     "dies, as if by overcrowding."
+                  << std::endl
+                  << "4. Any dead cell with exactly three live neighbours "
+                     "becomes a live cell, as if by reproduction."
+                  << std::endl
+                  << "Each white block in the visualization represents 1 alive "
+                     "cell, black space represents dead cells"
+                  << std::endl
+                  << std::endl;
+
+        std::cout
+            << "The conway_pretty example visualizes all the states in Conway"
+            << std::endl
+            << "Red   : Cells that have died due to under population"
+            << std::endl
+            << "Yellow: Cells that continue to live from previous state"
+            << std::endl
+            << "Green : Cells that are new as a result of reproduction"
+            << std::endl
+            << "Blue  : Cells that have died due to over population"
+            << std::endl
+            << std::endl;
+
+        std::cout
+            << "This examples is throttled so as to be a better visualization"
+            << std::endl;
+
+        af::Window simpleWindow(512, 512,
+                                "Conway's Game Of Life - Current State");
+        af::Window prettyWindow(512, 512,
+                                "Conway's Game Of Life - Visualizing States");
+        simpleWindow.setPos(32, 32);
+        prettyWindow.setPos(512 + 32, 32);
 
         int frame_count = 0;
 
@@ -52,15 +76,15 @@ int main(int argc, char *argv[])
 
         array display = tile(state, 1, 1, 3, 1);
 
-        while(!simpleWindow.close() && !prettyWindow.close()) {
+        while (!simpleWindow.close() && !prettyWindow.close()) {
             af::timer delay = timer::start();
 
-            if(!simpleWindow.close())   simpleWindow.image(state);
-            if(!prettyWindow.close())   prettyWindow.image(display);
+            if (!simpleWindow.close()) simpleWindow.image(state);
+            if (!prettyWindow.close()) prettyWindow.image(display);
             frame_count++;
 
             // Generate a random starting state
-            if(frame_count % reset == 0)
+            if (frame_count % reset == 0)
                 state = (af::randu(game_h, game_w, f32) > 0.5).as(f32);
 
             // Convolve gets neighbors
@@ -74,31 +98,22 @@ int main(int argc, char *argv[])
             af::array C0 = (nHood == 2);
             af::array C1 = (nHood == 3);
 
-            array a0 = (state == 1) && (nHood < 2); // Die of under population
-            array a1 = (state != 0) && (C0 || C1);  // Continue to live
-            array a2 = (state == 0) && C1;          // Reproduction
-            array a3 = (state == 1) && (nHood > 3); // Over-population
+            array a0 = (state == 1) && (nHood < 2);  // Die of under population
+            array a1 = (state != 0) && (C0 || C1);   // Continue to live
+            array a2 = (state == 0) && C1;           // Reproduction
+            array a3 = (state == 1) && (nHood > 3);  // Over-population
 
             display = join(2, a0 + a1, a1 + a2, a3).as(f32);
 
             // Update state
             state = state * C0 + C1;
 
-            double fps = 5;
-            while(timer::stop(delay) < (1 / fps)) { }
+            double fps = 30;
+            while (timer::stop(delay) < (1 / fps)) {}
         }
     } catch (af::exception& e) {
         fprintf(stderr, "%s\n", e.what());
         throw;
     }
-
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
-
diff --git a/examples/graphics/field.cpp b/examples/graphics/field.cpp
new file mode 100644
index 0000000000..f493c7ecd6
--- /dev/null
+++ b/examples/graphics/field.cpp
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <math.h>
+#include <cstdio>
+
+using namespace af;
+
+const static float MINIMUM = -3.0f;
+const static float MAXIMUM = 3.0f;
+const static float STEP    = 0.18f;
+
+int main(int, char**) {
+    try {
+        af::info();
+        af::Window myWindow(1024, 1024, "2D Vector Field example: ArrayFire");
+
+        myWindow.grid(2, 2);
+
+        array dataRange = seq(MINIMUM, MAXIMUM, STEP);
+
+        array x = tile(dataRange, 1, dataRange.dims(0));
+        array y = tile(dataRange.T(), dataRange.dims(0), 1);
+        x.eval();
+        y.eval();
+
+        float scale = 2.0f;
+        do {
+            array points = join(1, flat(x), flat(y));
+
+            array saddle = join(1, flat(x), -1.0f * flat(y));
+
+            array bvals = sin(scale * (x * x + y * y));
+            array hbowl = join(1, constant(1., x.elements()), flat(bvals));
+            hbowl.eval();
+
+            // 2D points
+            myWindow(0, 0).vectorField(points, saddle, "Saddle point");
+            myWindow(0, 1).vectorField(
+                points, hbowl, "hilly bowl (in a loop with varying amplitude)");
+
+            // 2D coordinates
+            myWindow(1, 0).vectorField(2.0 * flat(x), flat(y), flat(x),
+                                       -flat(y), "Saddle point");
+            myWindow(1, 1).vectorField(
+                2.0 * flat(x), flat(y), constant(1., x.elements()), flat(bvals),
+                "hilly bowl (in a loop with varying amplitude)");
+
+            myWindow.show();
+
+            scale -= 0.0010f;
+            if (scale < -0.01f) { scale = 2.0f; }
+        } while (!myWindow.close());
+
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/graphics/fractal.cpp b/examples/graphics/fractal.cpp
index f3002e9c53..a86dd32805 100644
--- a/examples/graphics/fractal.cpp
+++ b/examples/graphics/fractal.cpp
@@ -7,85 +7,94 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
-#include <iostream>
 #include <arrayfire.h>
-#include <math.h>
+#include <stdio.h>
+#include <cmath>
 #include <cstdlib>
+#include <iostream>
 
-#define WIDTH 600 // Width of image
-#define HEIGHT 800 // Width of image
+#define WIDTH 400   // Width of image
+#define HEIGHT 400  // Width of image
 
 using namespace af;
+using std::abs;
 
-array complex_grid(int width, int height, float zoom, float center[2])
-{
-
+array complex_grid(int width, int height, float zoom, float center[2]) {
     // Generate sequences of length width, height
-    array x = (seq(double(height)) - double(height) / 2.0);
-    array y = (seq(double(width )) - double(width)  / 2.0);
-
-    // Tile the sequences to generate grid of image size
-    array X = tile(x.T(), y.elements(), 1) / zoom + center[0];
-    array Y = tile(y    , 1, x.elements()) / zoom + center[1];
+    array X =
+        (iota(dim4(1, height), dim4(width, 1)) - (float)height / 2.0) / zoom +
+        center[0];
+    array Y =
+        (iota(dim4(width, 1), dim4(1, height)) - (float)width / 2.0) / zoom +
+        center[1];
 
     // Return the locations as a complex grid
     return complex(X, Y);
 }
 
-array mandelbrot(const array &in, int iter, float maxval)
-{
-    array C = in;
-    array Z = C;
+array mandelbrot(const array &in, int iter, float maxval) {
+    array C   = in;
+    array Z   = C;
     array mag = constant(0, C.dims());
 
     for (int ii = 1; ii < iter; ii++) {
-
         // Do the calculation
         Z = Z * Z + C;
 
         // Get indices where abs(Z) crosses maxval
         array cond = (abs(Z) > maxval).as(f32);
-        mag = af::max(mag, cond * ii);
+        mag        = af::max(mag, cond * ii);
 
         // If abs(Z) cross maxval, turn off those locations
         C = C * (1 - cond);
         Z = Z * (1 - cond);
+
+        // Ensuring the JIT does not become too large
+        af::eval(C, Z);
+        mag.eval();
     }
 
     // Normalize
     return mag / maxval;
 }
 
-array normalize(array a)
-{
+array normalize(array a) {
     float mx = af::max<float>(a);
     float mn = af::min<float>(a);
-    return (a-mn)/(mx-mn);
+    return (a - mn) / (mx - mn);
 }
 
-int main(int argc, char **argv)
-{
-    int device = argc > 1 ? atoi(argv[1]) : 0;
-    int iter = argc > 2 ? atoi(argv[2]) : 100;
+int main(int argc, char **argv) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
+    int iter     = argc > 2 ? atoi(argv[2]) : 100;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     try {
         af::setDevice(device);
         info();
         printf("** ArrayFire Fractals Demo **\n");
-        af::Window wnd("Fractal Demo");
+        af::Window wnd(WIDTH, HEIGHT, "Fractal Demo");
         wnd.setColorMap(AF_COLORMAP_SPECTRUM);
 
-        float center[] = {-0.5, 0};
+        float center[] = {-0.75f, 0.1f};
         // Keep zomming out for each frame
-        for (int zoom = 1000; zoom > 100; zoom -= 1) {
+        for (int i = 10; i < 400; i++) {
+            int zoom = i * i;
+            if (!(i % 10)) {
+                printf("iteration: %d zoom: %d\n", i, zoom);
+                fflush(stdout);
+            }
+
             // Generate the grid at the current zoom factor
             array c = complex_grid(WIDTH, HEIGHT, zoom, center);
 
+            iter = sqrt(abs(2 * sqrt(abs(1 - sqrt(5 * zoom))))) * 100;
             // Generate the mandelbrot image
             array mag = mandelbrot(c, iter, 1000);
-            if(!console) {
-                wnd.image(normalize(mag));
+
+            if (!console) {
+                if (wnd.close()) break;
+                array mag_norm = normalize(mag);
+                wnd.image(mag_norm);
             }
         }
 
diff --git a/examples/graphics/gravity_sim.cpp b/examples/graphics/gravity_sim.cpp
new file mode 100644
index 0000000000..3de7424c1d
--- /dev/null
+++ b/examples/graphics/gravity_sim.cpp
@@ -0,0 +1,255 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cstdio>
+#include <iostream>
+#include <vector>
+#include "gravity_sim_init.h"
+
+using namespace af;
+using namespace std;
+
+static const bool is3D           = true;
+const static int total_particles = 4000;
+static const int reset           = 3000;
+static const float min_dist      = 3;
+static const int width = 768, height = 768, depth = 768;
+static const int gravity_constant = 20000;
+
+float mass_range = 0;
+float min_mass   = 0;
+
+void initial_conditions_rand(af::array &mass, vector<af::array> &pos,
+                             vector<af::array> &vels,
+                             vector<af::array> &forces) {
+    for (int i = 0; i < (int)pos.size(); ++i) {
+        pos[i]    = af::randn(total_particles) * width + width;
+        vels[i]   = 0 * af::randu(total_particles) - 0.5;
+        forces[i] = af::constant(0, total_particles);
+    }
+    mass = af::constant(gravity_constant, total_particles);
+}
+
+void initial_conditions_galaxy(af::array &mass, vector<af::array> &pos,
+                               vector<af::array> &vels,
+                               vector<af::array> &forces) {
+    af::array initial_cond_consts(af::dim4(7, total_particles), hbd);
+    initial_cond_consts = initial_cond_consts.T();
+
+    for (int i = 0; i < (int)pos.size(); ++i) {
+        pos[i]    = af::randn(total_particles) * width + width;
+        vels[i]   = 0 * (af::randu(total_particles) - 0.5);
+        forces[i] = af::constant(0, total_particles);
+    }
+
+    mass    = initial_cond_consts(span, 0);
+    pos[0]  = (initial_cond_consts(span, 1) / 32 + 0.6) * width;
+    pos[1]  = (initial_cond_consts(span, 2) / 32 + 0.3) * height;
+    pos[2]  = (initial_cond_consts(span, 3) / 32 + 0.5) * depth;
+    vels[0] = (initial_cond_consts(span, 4) / 32) * width;
+    vels[1] = (initial_cond_consts(span, 5) / 32) * height;
+    vels[2] = (initial_cond_consts(span, 6) / 32) * depth;
+
+    pos[0](seq(0, pos[0].dims(0) - 1, 2)) -= 0.4 * width;
+    pos[1](seq(0, pos[0].dims(0) - 1, 2)) += 0.4 * height;
+    vels[0](seq(0, pos[0].dims(0) - 1, 2)) += 4;
+
+    min_mass   = min<float>(mass);
+    mass_range = max<float>(mass) - min<float>(mass);
+}
+
+af::array ids_from_pos(vector<af::array> &pos) {
+    return (pos[0].as(u32) * height) + pos[1].as(u32);
+}
+
+af::array ids_from_3D(vector<af::array> &pos, float Rx, float Ry, float Rz) {
+    af::array x0 = (pos[0] - width / 2);
+    af::array y0 =
+        (pos[1] - height / 2) * cos(Rx) + (pos[2] - depth / 2) * sin(Rx);
+    af::array z0 =
+        (pos[2] - depth / 2) * cos(Rx) - (pos[2] - depth / 2) * sin(Rx);
+
+    af::array x1 = x0 * cos(Ry) - z0 * sin(Ry);
+    af::array y1 = y0;
+
+    af::array x2 = x1 * cos(Rz) + y1 * sin(Rz);
+    af::array y2 = y1 * cos(Rz) - x1 * sin(Rz);
+
+    x2 += width / 2;
+    y2 += height / 2;
+
+    return (x2.as(u32) * height) + y2.as(u32);
+}
+
+af::array ids_from_3D(vector<af::array> &pos, float Rx, float Ry, float Rz,
+                      af::array filter) {
+    af::array x0 = (pos[0](filter) - width / 2);
+    af::array y0 = (pos[1](filter) - height / 2) * cos(Rx) +
+                   (pos[2](filter) - depth / 2) * sin(Rx);
+    af::array z0 = (pos[2](filter) - depth / 2) * cos(Rx) -
+                   (pos[2](filter) - depth / 2) * sin(Rx);
+
+    af::array x1 = x0 * cos(Ry) - z0 * sin(Ry);
+    af::array y1 = y0;
+
+    af::array x2 = x1 * cos(Rz) + y1 * sin(Rz);
+    af::array y2 = y1 * cos(Rz) - x1 * sin(Rz);
+
+    x2 += width / 2;
+    y2 += height / 2;
+
+    return (x2.as(u32) * height) + y2.as(u32);
+}
+
+void simulate(af::array &mass, vector<af::array> &pos, vector<af::array> &vels,
+              vector<af::array> &forces, float dt) {
+    for (int i = 0; i < (int)pos.size(); ++i) {
+        pos[i] += vels[i] * dt;
+        pos[i].eval();
+    }
+
+    // calculate forces to each particle
+    vector<af::array> diff(pos.size());
+    af::array dist = af::constant(0, pos[0].dims(0), pos[0].dims(0));
+
+    for (int i = 0; i < (int)pos.size(); ++i) {
+        diff[i] = tile(pos[i], 1, pos[i].dims(0)) -
+                  transpose(tile(pos[i], 1, pos[i].dims(0)));
+        dist += (diff[i] * diff[i]);
+    }
+
+    dist = sqrt(dist);
+    dist = af::max(min_dist, dist);
+    dist *= dist * dist;
+
+    for (int i = 0; i < (int)pos.size(); ++i) {
+        // calculate force vectors
+        forces[i] = diff[i] / dist;
+        forces[i].eval();
+
+        // af::array idx = af::where(af::isNaN(forces[i]));
+        // if(idx.elements() > 0)
+        //    forces[i](idx) = 0;
+        // forces[i] = sum(forces[i]).T();
+        forces[i] = matmul(forces[i].T(), mass);
+
+        // update force scaled to time, magnitude constant
+        forces[i] *= (gravity_constant);
+        forces[i].eval();
+
+        // update velocities from forces
+        vels[i] += forces[i] * dt;
+        vels[i].eval();
+
+        // noise
+        // forces[i] += 0.1 * af::randn(forces[i].dims(0));
+
+        // dampening
+        // vels[i] *= 1 - (0.005*dt);
+    }
+}
+
+void collisions(vector<af::array> &pos, vector<af::array> &vels, bool is3D) {
+    // clamp particles inside screen border
+    af::array invalid_x = -2 * (pos[0] > width - 1 || pos[0] < 0) + 1;
+    af::array invalid_y = -2 * (pos[1] > height - 1 || pos[1] < 0) + 1;
+    // af::array invalid_x = (pos[0] < width-1 || pos[0] > 0);
+    // af::array invalid_y = (pos[1] < height-1 || pos[1] > 0);
+    vels[0] = invalid_x * vels[0];
+    vels[1] = invalid_y * vels[1];
+
+    af::array projected_px = min(width - 1, max(0, pos[0]));
+    af::array projected_py = min(height - 1, max(0, pos[1]));
+    pos[0]                 = projected_px;
+    pos[1]                 = projected_py;
+
+    if (is3D) {
+        af::array invalid_z    = -2 * (pos[2] > depth - 1 || pos[2] < 0) + 1;
+        vels[2]                = invalid_z * vels[2];
+        af::array projected_pz = min(depth - 1, max(0, pos[2]));
+        pos[2]                 = projected_pz;
+    }
+}
+
+int main(int, char **) {
+    try {
+        af::info();
+
+        af::Window myWindow(width, height,
+                            "Gravity Simulation using ArrayFire");
+        myWindow.setColorMap(AF_COLORMAP_HEAT);
+
+        int frame_count = 0;
+
+        // Initialize the kernel array just once
+        const af::array draw_kernel = gaussianKernel(7, 7);
+
+        const int dims = (is3D) ? 3 : 2;
+
+        vector<af::array> pos(dims);
+        vector<af::array> vels(dims);
+        vector<af::array> forces(dims);
+        af::array mass;
+
+        // Generate a random starting state
+        initial_conditions_galaxy(mass, pos, vels, forces);
+
+        af::array image = af::constant(0, width, height);
+        af::array ids(total_particles, u32);
+
+        af::timer timer = af::timer::start();
+        while (!myWindow.close()) {
+            float dt = af::timer::stop(timer);
+            timer    = af::timer::start();
+
+            af::array mid = mass(span) > (min_mass + mass_range / 3);
+            ids = (is3D) ? ids_from_3D(pos, 0, 0 + frame_count / 150.f, 0, mid)
+                         : ids_from_pos(pos);
+            // ids = (is3D)? ids_from_3D(pos, 0, 0, 0, mid) : ids_from_pos(pos);
+            // //uncomment for no 3d rotation
+            image(ids) += 4.f;
+
+            mid = mass(span) > (min_mass + 2 * mass_range / 3);
+            ids = (is3D) ? ids_from_3D(pos, 0, 0 + frame_count / 150.f, 0, mid)
+                         : ids_from_pos(pos);
+            // ids = (is3D)? ids_from_3D(pos, 0, 0, 0, mid) : ids_from_pos(pos);
+            // //uncomment for no 3d rotation
+            image(ids) += 4.f;
+
+            ids = (is3D) ? ids_from_3D(pos, 0, 0 + frame_count / 150.f, 0)
+                         : ids_from_pos(pos);
+            // ids = (is3D)? ids_from_3D(pos, 0, 0, 0) :  ids_from_pos(pos);
+            // //uncomment for no 3d rotation
+            image(ids) += 4.f;
+
+            image = convolve(image, draw_kernel);
+            myWindow.image(image);
+            image = af::constant(0, image.dims());
+
+            frame_count++;
+
+            // Generate a random starting state
+            if (frame_count % reset == 0) {
+                initial_conditions_galaxy(mass, pos, vels, forces);
+            }
+
+            // simulate
+            simulate(mass, pos, vels, forces, dt);
+
+            // check for collisions and adjust positions/velocities accordingly
+            collisions(pos, vels, is3D);
+        }
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/graphics/gravity_sim_init.h b/examples/graphics/gravity_sim_init.h
new file mode 100644
index 0000000000..9b1af92cfa
--- /dev/null
+++ b/examples/graphics/gravity_sim_init.h
@@ -0,0 +1,7005 @@
+const int HBD_NUM_ELEMENTS = 4000 * 7;
+// halo, bulge, and disk particles
+float hbd[] = {
+    4.9161855e-03f,  -1.5334119e+00f, -8.3381424e+00f, 4.4288845e+00f,
+    -2.3778248e-01f, 4.2592272e-02f,  -4.4895774e-01f, 4.9161855e-03f,
+    1.9886702e-02f,  6.0085773e+00f,  3.1188631e-01f,  8.1422836e-01f,
+    -1.4591325e-02f, 7.5382882e-01f,  4.9161855e-03f,  1.1676190e+00f,
+    -4.6193779e-01f, -5.0477743e-01f, -1.4803666e+00f, 5.6056118e-01f,
+    -2.9858449e-02f, 4.9161855e-03f,  -1.4250363e+00f, 1.0891747e+01f,
+    2.5225203e+00f,  -6.5798134e-02f, -3.5946497e-01f, 1.7471495e-01f,
+    4.9161855e-03f,  -3.7135857e-01f, 4.8796633e-01f,  -3.7898597e-01f,
+    8.5347527e-01f,  2.2493289e-01f,  -2.7678892e-01f, 4.9161855e-03f,
+    2.2072470e+00f,  -2.5046587e+00f, 2.6029270e+00f,  3.0826443e-01f,
+    5.8606583e-01f,  2.0105042e-01f,  4.9161855e-03f,  1.0779227e+00f,
+    -4.0834007e+00f, -3.3965745e+00f, -4.8430148e-01f, -7.1573091e-01f,
+    1.2384786e-01f,  4.9161855e-03f,  -3.8722844e+00f, -4.2357988e+00f,
+    -1.9723746e+00f, 3.5759529e-01f,  4.8990592e-01f,  -4.3040028e-01f,
+    4.9161855e-03f,  -1.3005282e-01f, -2.3483203e-01f, 1.3832784e-01f,
+    1.3746375e+00f,  -1.2947829e+00f, 6.1215276e-01f,  4.9161855e-03f,
+    3.6822948e-01f,  4.2760900e-01f,  1.1544695e+00f,  -2.3177411e-02f,
+    -6.9136995e-01f, -6.6200425e-03f, 4.9161855e-03f,  -1.2485707e+00f,
+    2.0474775e-01f,  -2.1652168e-01f, 2.7034196e-01f,  1.6398503e+00f,
+    -7.8224945e-01f, 4.9161855e-03f,  -3.3862705e+00f, 1.2049110e+00f,
+    1.0672448e+00f,  -1.6531572e-01f, -2.4370559e-01f, 8.7125647e-01f,
+    4.9161855e-03f,  3.4262960e+00f,  3.9102471e+00f,  6.6162848e-01f,
+    7.8005123e-01f,  -1.0415094e-01f, 5.0161743e-01f,  4.9161855e-03f,
+    1.5740298e-01f,  1.3008093e+00f,  7.8130345e+00f,  -1.6444305e-01f,
+    3.3037327e-03f,  1.9713788e-01f,  4.9161855e-03f,  5.6700945e-01f,
+    1.8889900e-01f,  2.7523971e+00f,  -3.4313673e-01f, -6.4287108e-01f,
+    -1.8927544e-01f, 4.9161855e-03f,  1.8354661e+00f,  1.3209668e+00f,
+    1.6966065e+00f,  5.3318393e-01f,  3.4129089e-01f,  -8.0587679e-01f,
+    4.9161855e-03f,  -7.8488460e+00f, 3.2376931e+00f,  2.6638079e+00f,
+    3.4405673e-01f,  -2.1986680e-01f, 1.6776933e-01f,  4.9161855e-03f,
+    3.2422847e-01f,  -1.2311785e+00f, 9.0597588e-01f,  3.6714745e-01f,
+    -1.3913552e-01f, 9.0002306e-02f,  4.9161855e-03f,  -1.9477528e-01f,
+    -2.3987198e+00f, -4.2354431e+00f, -2.1188869e-01f, -6.4195746e-01f,
+    1.5219630e-01f,  4.9161855e-03f,  3.2330542e+00f,  1.1787817e+00f,
+    -1.3654234e+00f, 1.9920348e-01f,  -1.0560199e+00f, -4.0022919e-01f,
+    4.9161855e-03f,  -2.2656450e+00f, 2.3343153e+00f,  3.0343585e+00f,
+    1.3909769e-01f,  -5.8018422e-01f, 7.7305830e-01f,  4.9161855e-03f,
+    1.0106117e+01f,  8.4062157e+00f,  -5.3659506e+00f, -3.3819172e-01f,
+    -5.7871189e-02f, -5.2655820e-02f, 4.9161855e-03f,  -8.4759682e-02f,
+    -2.4386784e-01f, 2.2389056e-01f,  -8.3496273e-01f, 1.1504352e+00f,
+    3.2196254e-03f,  4.9161855e-03f,  -4.8354459e+00f, -1.1709679e+01f,
+    -4.4684467e+00f, -3.7076837e-01f, 2.6136923e-01f,  -1.4268482e-01f,
+    4.9161855e-03f,  -1.3268198e+00f, -2.3238692e+00f, 6.7897618e-01f,
+    3.0518329e-01f,  6.8463421e-01f,  -7.1791840e-01f, 4.9161855e-03f,
+    -5.2054877e+00f, 2.0948052e+00f,  1.9656231e+00f,  7.4416548e-01f,
+    4.4825464e-01f,  -3.2727838e-01f, 4.9161855e-03f,  -8.2616639e-01f,
+    1.0700088e+00f,  3.5586545e+00f,  4.8024514e-01f,  1.1944018e-01f,
+    3.0837712e-01f,  4.9161855e-03f,  -2.9101398e+00f, -3.6366568e+00f,
+    8.7982547e-01f,  3.6643305e-01f,  -3.8197124e-01f, -1.1440479e-01f,
+    4.9161855e-03f,  3.5198438e-01f,  4.9096385e-01f,  -6.6494130e-02f,
+    -1.0383745e-01f, 3.9406076e-01f,  7.3723292e-01f,  4.9161855e-03f,
+    -6.9214082e+00f, -5.5405111e+00f, -2.3041859e+00f, 3.3985880e-01f,
+    1.0167535e-02f,  1.0593475e-01f,  4.9161855e-03f,  1.0908546e+00f,
+    -5.3155913e+00f, -4.5045247e+00f, 1.8077201e-01f,  -4.4904891e-01f,
+    4.7391072e-01f,  4.9161855e-03f,  -1.0766581e-01f, 6.7338924e+00f,
+    6.1174130e+00f,  -2.3362583e-01f, 7.6430768e-02f,  -2.4832390e-01f,
+    4.9161855e-03f,  -4.9775305e-01f, 1.6378751e+00f,  -2.6263945e+00f,
+    -3.0084690e-01f, -5.1551086e-01f, -6.6373748e-01f, 4.9161855e-03f,
+    -3.8946674e+00f, -1.4725525e+00f, 2.4148097e+00f,  -1.7075756e-01f,
+    5.3592271e-01f,  7.2393781e-01f,  4.9161855e-03f,  6.8583161e-02f,
+    -1.5991354e+00f, -3.0150402e-01f, 1.5219669e-01f,  -5.6440836e-01f,
+    1.5284424e+00f,  4.9161855e-03f,  -4.2822695e+00f, 4.0367408e+00f,
+    -2.2387395e+00f, 1.0239060e-01f,  3.2810995e-01f,  -1.4511149e-01f,
+    4.9161855e-03f,  5.3348875e-01f,  -3.6950427e-01f, 1.0364149e+00f,
+    7.8612208e-02f,  -2.7073494e-01f, 1.9663854e-01f,  4.9161855e-03f,
+    -3.3353384e+00f, 4.3220544e+00f,  -1.5343003e+00f, 6.7457032e-01f,
+    -1.8098858e-01f, 7.6241505e-01f,  4.9161855e-03f,  -8.8430309e+00f,
+    6.6101489e+00f,  2.2365890e+00f,  -2.9622875e-03f, -5.7892501e-01f,
+    2.3848678e-01f,  4.9161855e-03f,  -2.7121809e+00f, -3.7584829e+00f,
+    2.4702384e+00f,  3.9350358e-01f,  -6.7748266e-01f, -5.7142133e-01f,
+    4.9161855e-03f,  1.7517463e+00f,  -5.2237463e-01f, 1.2052536e+00f,
+    2.6133826e-01f,  -4.3084338e-01f, -2.8758329e-01f, 4.9161855e-03f,
+    -4.4221100e-01f, 2.4987850e-01f,  -9.0834004e-01f, -1.6435069e+00f,
+    -3.5537782e-01f, -5.6679737e-02f, 4.9161855e-03f,  9.5630264e+00f,
+    7.2472978e-01f,  -2.7188256e+00f, 4.1388586e-01f,  -2.7986884e-01f,
+    9.9171564e-02f,  4.9161855e-03f,  -2.5304942e+00f, -1.9891304e-01f,
+    -1.3565568e+00f, 1.6445565e-01f,  6.5720814e-01f,  8.8133616e-04f,
+    4.9161855e-03f,  -6.8739529e+00f, 6.0871582e+00f,  4.0246663e+00f,
+    -1.1313155e-01f, 2.6078510e-01f,  1.1052500e-02f,  4.9161855e-03f,
+    1.8411478e-01f,  6.3666153e-01f,  -1.7665352e+00f, 7.3893017e-01f,
+    8.2843482e-02f,  1.3584135e-01f,  4.9161855e-03f,  1.2281631e-01f,
+    -4.8358020e-01f, -4.2862403e-01f, -1.4062686e+00f, 2.6675841e-01f,
+    -5.2812093e-01f, 4.9161855e-03f,  -1.8010849e+00f, 2.5018549e+00f,
+    -1.1007906e+00f, -3.0198583e-01f, -2.5083411e-01f, -9.4572407e-01f,
+    4.9161855e-03f,  2.9228494e-02f,  2.8824418e+00f,  -7.7373713e-01f,
+    -8.9457905e-01f, -3.9830649e-01f, -8.2690775e-01f, 4.9161855e-03f,
+    -4.8449464e+00f, -3.5136631e+00f, 2.6319263e+00f,  2.3270021e-01f,
+    6.2155128e-01f,  -6.9675374e-01f, 4.9161855e-03f,  -2.4690704e-01f,
+    -3.6131024e+00f, 5.7440319e+00f,  -5.6087500e-01f, -2.9587632e-01f,
+    -7.5861102e-01f, 4.9161855e-03f,  5.2307582e+00f,  2.1941881e+00f,
+    -4.2112174e+00f, 2.3945954e-01f,  2.5676125e-01f,  3.2575151e-01f,
+    4.9161855e-03f,  4.8397323e-01f,  3.7831066e+00f,  4.4692445e+00f,
+    2.4802294e-02f,  6.5026706e-01f,  -1.1542060e-02f, 4.9161855e-03f,
+    7.9952207e+00f,  4.5379916e-01f,  1.4309001e-01f,  -2.2018740e-01f,
+    -2.1911193e-01f, -4.8267773e-01f, 4.9161855e-03f,  -2.0976503e+00f,
+    -2.4728169e-01f, 6.3614302e+00f,  -7.4839890e-02f, -4.1690156e-01f,
+    -1.7862423e-01f, 4.9161855e-03f,  3.4107253e-01f,  -1.2668414e+00f,
+    1.2606201e+00f,  3.6496368e-01f,  -3.5874972e-01f, -1.0340087e+00f,
+    4.9161855e-03f,  8.9313567e-01f,  3.6050075e-01f,  3.4469640e-01f,
+    -8.6372048e-01f, -6.3587260e-01f, 7.4591488e-01f,  4.9161855e-03f,
+    2.9728930e+00f,  -5.2957177e+00f, -7.3298526e+00f, -1.9522749e-01f,
+    -2.2528295e-01f, 1.9373624e-01f,  4.9161855e-03f,  -1.7334032e+00f,
+    1.9857804e+00f,  -4.9017177e+00f, -6.8124956e-01f, 8.3835334e-01f,
+    -7.8357399e-02f, 4.9161855e-03f,  2.0978465e+00f,  1.9166039e+00f,
+    1.0677823e+00f,  -2.6128739e-01f, -9.3216664e-01f, 8.0752736e-01f,
+    4.9161855e-03f,  -2.6831132e-01f, 1.6412498e-01f,  -5.8062166e-01f,
+    -3.9843372e-01f, 1.5403072e+00f,  -2.5054911e-01f, 4.9161855e-03f,
+    1.7003990e+00f,  3.3006930e+00f,  -1.7119979e+00f, -1.0552487e-01f,
+    -8.4340447e-01f, 9.8853576e-01f,  4.9161855e-03f,  -5.5339479e+00f,
+    4.8888919e-01f,  9.1028652e+00f,  4.6380356e-01f,  -4.4314775e-01f,
+    3.4938701e-03f,  4.9161855e-03f,  -3.9364102e+00f, -3.4606054e+00f,
+    2.2803564e+00f,  1.2712850e-01f,  -3.2586256e-01f, -6.5546811e-02f,
+    4.9161855e-03f,  -6.6842210e-01f, -8.6578093e-02f, -9.9518037e-01f,
+    3.0050567e-01f,  -1.3251954e+00f, -6.3900441e-01f, 4.9161855e-03f,
+    -1.7707565e+00f, -2.3981299e+00f, -2.8610508e+00f, 8.0815405e-02f,
+    2.6192275e-01f,  -4.4141706e-02f, 4.9161855e-03f,  5.2352209e+00f,
+    4.3753624e+00f,  5.2761130e+00f,  -3.6126247e-01f, -3.6049706e-01f,
+    -5.0132203e-01f, 4.9161855e-03f,  4.0741138e+00f,  -2.7320893e+00f,
+    -5.8015996e-01f, -3.3409804e-01f, -7.4342436e-01f, -8.1080115e-01f,
+    4.9161855e-03f,  1.0308882e+01f,  3.3621982e-01f,  -1.2449891e+01f,
+    -2.8561455e-01f, -1.0982110e-01f, -1.0319072e-02f, 4.9161855e-03f,
+    8.3470430e+00f,  -9.4488649e+00f, -6.6161261e+00f, -2.6525149e-01f,
+    5.0971325e-02f,  5.4980908e-02f,  4.9161855e-03f,  -4.8979187e-01f,
+    -2.1835434e+00f, 1.3237199e+00f,  -2.0376731e-01f, -4.8289922e-01f,
+    -1.9313942e-01f, 4.9161855e-03f,  3.8070815e+00f,  -4.1728072e+00f,
+    6.8302398e+00f,  2.1417937e-01f,  -5.6412149e-02f, 9.7045694e-03f,
+    4.9161855e-03f,  -1.7183731e+00f, 1.7611129e+00f,  5.8284336e-01f,
+    1.2992284e-01f,  -1.3527862e+00f, -4.3186599e-01f, 4.9161855e-03f,
+    -1.1291479e+01f, -3.0248559e+00f, -6.1554856e+00f, -6.8934292e-02f,
+    -3.0177805e-01f, -1.8667488e-01f, 4.9161855e-03f,  -2.3688557e+00f,
+    7.7071247e+00f,  -2.0670973e-01f, -2.1208389e-01f, 2.8578773e-01f,
+    2.0644853e-01f,  4.9161855e-03f,  8.2679868e-01f,  -2.1197610e+00f,
+    1.0767980e+00f,  2.4679126e-01f,  -4.0421063e-01f, -5.7845503e-01f,
+    4.9161855e-03f,  4.1475649e+00f,  -4.3077379e-01f, 5.4239964e+00f,
+    7.0667878e-02f,  4.9151066e-01f,  -5.2980289e-02f, 4.9161855e-03f,
+    -7.7668630e-02f, -4.1514721e+00f, -8.0719125e-01f, -4.2308268e-01f,
+    -5.9619360e-03f, -5.4758888e-01f, 4.9161855e-03f,  7.3864212e+00f,
+    -7.1388471e-01f, 4.2682199e+00f,  8.6512074e-02f,  -3.9517093e-01f,
+    3.4532326e-01f,  4.9161855e-03f,  3.1821191e+00f,  5.0156546e+00f,
+    -7.2775478e+00f, 3.8633448e-01f,  4.1517708e-01f,  -4.7167987e-01f,
+    4.9161855e-03f,  -5.5158086e+00f, -1.8736273e+00f, 1.2083918e+00f,
+    -5.2377588e-01f, -5.1698190e-01f, -1.7996560e-01f, 4.9161855e-03f,
+    -7.5245118e-01f, -5.0066152e+00f, -3.6176472e+00f, -1.4140940e-01f,
+    4.9951354e-01f,  -5.1893300e-01f, 4.9161855e-03f,  1.7928425e+00f,
+    2.7725005e+00f,  -2.2401933e-02f, -8.6086380e-01f, -3.3671090e-01f,
+    8.4016019e-01f,  4.9161855e-03f,  5.5359507e+00f,  -1.0514329e+01f,
+    3.6608188e+00f,  -1.5433036e-01f, -7.8473240e-03f, 2.5746456e-01f,
+    4.9161855e-03f,  1.8312926e+00f,  -6.6526437e-01f, -1.4381752e+00f,
+    -1.5768304e-01f, 4.5808712e-01f,  4.9162623e-01f,  4.9161855e-03f,
+    5.4815245e+00f,  -3.7619928e-01f, 3.7529993e-01f,  -3.4403029e-01f,
+    -1.9848712e-02f, 3.1211856e-01f,  4.9161855e-03f,  -2.8452486e-01f,
+    1.0852966e+00f,  -7.1417332e-01f, 8.5701519e-01f,  -1.9785182e-01f,
+    7.2242868e-01f,  4.9161855e-03f,  1.6400850e+00f,  6.0924044e+00f,
+    -6.7533379e+00f, -1.4117804e-01f, -2.7584502e-01f, 1.8720052e-01f,
+    4.9161855e-03f,  5.8992994e-01f,  -1.4057723e+00f, 1.7555045e+00f,
+    3.0828384e-01f,  -1.7618947e-01f, 5.7791591e-01f,  4.9161855e-03f,
+    3.2523406e+00f,  6.4261597e-01f,  -3.2577946e+00f, 4.3461993e-03f,
+    1.6368487e-01f,  -2.7604485e-01f, 4.9161855e-03f,  -4.4885483e+00f,
+    2.9889661e-01f,  7.7495706e-01f,  8.4083831e-01f,  -6.1657476e-01f,
+    -2.8107607e-01f, 4.9161855e-03f,  -8.8879662e+00f, 6.2833142e-01f,
+    -1.1011785e+01f, 4.1822538e-01f,  1.0211676e-01f,  -3.1296456e-01f,
+    4.9161855e-03f,  2.7859297e+00f,  -3.9616172e+00f, -9.8269482e+00f,
+    1.1758713e-01f,  -3.9799199e-01f, 3.1546867e-01f,  4.9161855e-03f,
+    4.7954245e+00f,  -3.0205333e-01f, 2.0376158e+00f,  -8.4786171e-01f,
+    3.1084442e-01f,  -2.9132118e-02f, 4.9161855e-03f,  -2.5424831e+00f,
+    -2.2019272e+00f, 1.2129050e+00f,  -7.6038790e-01f, 1.3783433e-01f,
+    -2.2782549e-02f, 4.9161855e-03f,  -1.7519760e+00f, 4.8521647e-01f,
+    6.5459456e+00f,  2.1810593e-01f,  -1.0864632e-01f, -2.8022933e-01f,
+    4.9161855e-03f,  1.1203793e+01f,  3.8465612e+00f,  -7.5724998e+00f,
+    -3.2845536e-01f, -5.3839471e-02f, -8.3486214e-02f, 4.9161855e-03f,
+    -3.2320779e-02f, -3.1065380e-02f, 6.4219080e-02f,  -2.2246722e-02f,
+    5.6946766e-01f,  1.1582422e-01f,  4.9161855e-03f,  -9.3361330e-01f,
+    4.6081281e+00f,  -3.0114322e+00f, -6.3036418e-01f, -1.4130452e-01f,
+    -7.0592797e-01f, 4.9161855e-03f,  6.5746963e-01f,  -2.6720290e+00f,
+    1.4632640e+00f,  -7.3338515e-01f, -9.7944528e-01f, 1.1936308e-01f,
+    4.9161855e-03f,  -1.2494113e+01f, -1.0112607e+00f, -6.1200657e+00f,
+    -4.6759155e-01f, -1.0928699e-01f, 1.0739395e-02f,  4.9161855e-03f,
+    1.4548665e+00f,  -1.5041708e+00f, 4.7451344e+00f,  5.3424448e-01f,
+    -2.7125362e-01f, 1.3840736e-01f,  4.9161855e-03f,  9.2012796e+00f,
+    -4.8018866e+00f, -6.6422758e+00f, -2.6537961e-01f, 2.8879899e-01f,
+    -2.9193002e-01f, 4.9161855e-03f,  -3.7384963e+00f, 2.0661526e+00f,
+    7.5109011e-01f,  -4.0893826e-01f, 2.1268708e-01f,  -3.2584268e-01f,
+    4.9161855e-03f,  1.2519404e+00f,  7.4001670e+00f,  -4.9840989e+00f,
+    -2.6203468e-01f, -2.9252869e-01f, -1.5676203e-01f, 4.9161855e-03f,
+    1.8744209e+00f,  -2.2234895e+00f, 8.1060524e+00f,  -1.5346730e-01f,
+    -6.9368631e-01f, 2.6046190e-01f,  4.9161855e-03f,  -1.4101373e+00f,
+    1.0645522e+00f,  -5.6520933e-01f, 1.4722762e-01f,  1.4932915e+00f,
+    -1.1569133e-01f, 4.9161855e-03f,  1.4165136e+00f,  3.5563886e+00f,
+    1.1791783e-01f,  -3.3764324e-01f, -7.5716054e-01f, 3.2871431e-01f,
+    4.9161855e-03f,  1.6921350e+00f,  4.4273725e+00f,  -4.7639960e-01f,
+    -5.4349893e-01f, 3.2590839e-01f,  -8.8562638e-01f, 4.9161855e-03f,
+    4.6483329e-01f,  -3.4445742e-01f, 3.6641576e+00f,  -8.6311603e-01f,
+    9.2173032e-03f,  -5.7865018e-01f, 4.9161855e-03f,  -1.0085900e+00f,
+    5.9951057e+00f,  3.0975575e+00f,  -4.4059810e-01f, 3.6342105e-01f,
+    5.4747361e-01f,  4.9161855e-03f,  7.5191727e+00f,  9.0358219e+00f,
+    8.2151717e-01f,  1.8641087e-01f,  4.7217867e-01f,  1.1944959e-01f,
+    4.9161855e-03f,  3.6888385e+00f,  -6.8363433e+00f, -4.2592320e+00f,
+    6.2831676e-01f,  3.1490234e-01f,  7.2379701e-02f,  4.9161855e-03f,
+    3.7106318e+00f,  4.4007950e+00f,  5.8240423e+00f,  7.2762161e-02f,
+    -2.0129098e-01f, -9.5572621e-03f, 4.9161855e-03f,  5.2575201e-02f,
+    -2.1707346e+00f, -3.3260161e-01f, -1.0624429e+00f, -3.8043940e-01f,
+    3.2408518e-01f,  4.9161855e-03f,  -6.7410097e+00f, 8.0306721e+00f,
+    -3.7412791e+00f, -4.4359837e-02f, -5.9044231e-02f, -2.7669320e-01f,
+    4.9161855e-03f,  1.1246946e+00f,  -4.5388550e-01f, -1.5147063e+00f,
+    4.0764180e-01f,  -8.7051743e-01f, -7.1820456e-01f, 4.9161855e-03f,
+    -5.3811870e+00f, -9.9082918e+00f, -4.0152779e-01f, 4.5821959e-01f,
+    -3.2393888e-01f, -1.6364813e-01f, 4.9161855e-03f,  1.3526427e+01f,
+    2.1158383e+00f,  -1.0211465e+01f, 2.2708364e-03f,  9.2716143e-02f,
+    2.6722401e-01f,  4.9161855e-03f,  -2.8869894e+00f, 2.4247556e+00f,
+    -9.4357147e+00f, -1.6119269e-01f, -1.7889833e-01f, -3.1364015e-01f,
+    4.9161855e-03f,  -5.8600578e+00f, 3.2861009e+00f,  3.5497742e+00f,
+    -2.2058662e-02f, -2.8658876e-01f, -6.7721397e-01f, 4.9161855e-03f,
+    -3.9212027e-01f, -3.8397207e+00f, 1.0866520e+00f,  -7.5877708e-01f,
+    4.9582422e-02f,  -4.6942544e-01f, 4.9161855e-03f,  -2.1149487e+00f,
+    -2.9379406e+00f, 3.7844057e+00f,  7.0750105e-01f,  -1.1503395e-01f,
+    1.6959289e-01f,  4.9161855e-03f,  3.8032734e+00f,  3.1186311e+00f,
+    3.3438654e+00f,  3.1028602e-01f,  3.7098780e-01f,  -2.0284407e-01f,
+    4.9161855e-03f,  8.1918567e-02f,  6.2097090e-01f,  4.3812424e-01f,
+    2.5215754e-01f,  3.8848091e-02f,  -8.5251456e-01f, 4.9161855e-03f,
+    4.3727204e-01f,  -4.0447369e+00f, -2.8818288e-01f, -2.0940250e-01f,
+    -8.1814951e-01f, -2.3166551e-01f, 4.9161855e-03f,  -4.9010497e-01f,
+    -1.5526206e+00f, -1.0393566e-02f, -1.1288775e+00f, 1.1438488e+00f,
+    -6.5885745e-02f, 4.9161855e-03f,  -2.1520743e+00f, 6.3760573e-01f,
+    -1.0841924e+00f, -1.2611383e-01f, -9.7003585e-01f, -8.2231325e-01f,
+    4.9161855e-03f,  -1.6600587e+00f, -1.9615304e-01f, 2.0637505e+00f,
+    3.1294438e-01f,  -5.0747823e-02f, 1.3301117e+00f,  4.9161855e-03f,
+    4.8307452e+00f,  2.8194723e-01f,  4.1964173e+00f,  -5.5529791e-01f,
+    3.5737309e-01f,  2.1602839e-01f,  4.9161855e-03f,  4.0863609e+00f,
+    -3.9082122e+00f, 6.0392475e+00f,  -5.8578849e-01f, 3.4978375e-01f,
+    3.4507743e-01f,  4.9161855e-03f,  4.6417685e+00f,  1.1660880e+01f,
+    2.5419605e+00f,  -4.1093502e-02f, -2.1781944e-01f, 2.3564143e-01f,
+    4.9161855e-03f,  5.1196570e+00f,  -4.5010920e+00f, -4.6046415e-01f,
+    -4.9308911e-01f, 2.0530705e-01f,  8.7350450e-02f,  4.9161855e-03f,
+    1.1313407e-01f,  4.8161488e+00f,  2.0587443e-01f,  -7.4091542e-01f,
+    7.4024308e-01f,  -5.1334614e-01f, 4.9161855e-03f,  2.7357507e+00f,
+    -1.9728105e+00f, 1.7016443e+00f,  -7.1896374e-01f, 8.3583705e-03f,
+    -1.8032035e-01f, 4.9161855e-03f,  8.5056558e-02f,  5.3287292e-01f,
+    9.1567415e-01f,  -1.1781330e+00f, 6.0054462e-02f,  6.6040766e-01f,
+    4.9161855e-03f,  -1.2452773e+00f, 3.6445162e+00f,  1.2409434e+00f,
+    3.2620323e-01f,  -1.9191052e-01f, -2.7282682e-01f, 4.9161855e-03f,
+    1.9056360e+00f,  3.5149584e+00f,  -1.0531671e+00f, -3.3422467e-01f,
+    -7.6369601e-01f, -5.0413966e-01f, 4.9161855e-03f,  1.3558551e+00f,
+    1.4875576e-01f,  6.9291228e-01f,  1.3113679e-01f,  -4.2128254e-02f,
+    -4.7609597e-01f, 4.9161855e-03f,  4.8151522e+00f,  1.9904665e+00f,
+    5.7363062e+00f,  9.1349882e-01f,  3.2824841e-01f,  8.0876220e-03f,
+    4.9161855e-03f,  6.5276303e+00f,  -2.5734696e+00f, -7.3017540e+00f,
+    1.6771398e-01f,  -1.6040705e-01f, 2.8028521e-01f,  4.9161855e-03f,
+    -4.9316432e-02f, 4.2286095e-01f,  -1.6050607e-01f, -1.6140953e-02f,
+    4.6242326e-01f,  1.5989579e+00f,  4.9161855e-03f,  -1.2718679e+01f,
+    -2.1632120e-02f, 2.7086315e+00f,  -4.4350330e-02f, 3.8374102e-01f,
+    3.5671154e-01f,  4.9161855e-03f,  1.4095187e+00f,  2.7944331e+00f,
+    -3.1381302e+00f, 6.6803381e-02f,  1.4252694e-01f,  -4.5197245e-01f,
+    4.9161855e-03f,  -4.3704524e+00f, 3.7166533e+00f,  -3.3841777e+00f,
+    1.6926841e-01f,  -2.2037603e-01f, -9.2970982e-02f, 4.9161855e-03f,
+    -3.4041522e+00f, 6.1920571e+00f,  6.1770749e+00f,  1.7624885e-01f,
+    2.3482014e-01f,  2.1265095e-02f,  4.9161855e-03f,  1.8683885e+00f,
+    2.9745255e+00f,  1.5871049e+00f,  9.7957826e-01f,  4.1725907e-01f,
+    2.7069089e-01f,  4.9161855e-03f,  3.2698989e+00f,  2.7192965e-01f,
+    -2.4263704e+00f, -6.2083137e-01f, -9.6088186e-02f, 3.1606305e-01f,
+    4.9161855e-03f,  2.9325829e+00f,  3.7225180e+00f,  1.5989654e+01f,
+    -5.9474718e-02f, -1.6357067e-01f, 2.4941908e-01f,  4.9161855e-03f,
+    -1.8487132e+00f, 1.7842275e-01f,  -2.6162112e+00f, 5.5724651e-01f,
+    1.6877288e-01f,  3.1606191e-01f,  4.9161855e-03f,  2.4827642e+00f,
+    1.3335655e+00f,  2.3972323e+00f,  -8.3342028e-01f, 4.9502304e-01f,
+    -1.8774435e-01f, 4.9161855e-03f,  -2.9442611e+00f, -1.5145620e+00f,
+    -1.0184349e+00f, 4.0914584e-02f,  6.1210513e-01f,  -8.8316077e-01f,
+    4.9161855e-03f,  4.1723294e+00f,  1.5920197e+00f,  1.0446097e+01f,
+    -3.4241676e-01f, -6.3489765e-02f, 1.3304074e-01f,  4.9161855e-03f,
+    1.5766021e+00f,  -7.6417365e+00f, 2.0848337e-01f,  -5.7905573e-01f,
+    4.0479490e-01f,  3.8954058e-01f,  4.9161855e-03f,  6.6417539e-01f,
+    6.1158419e-01f,  -5.0875813e-01f, -3.4595522e-01f, -7.4610633e-01f,
+    1.0812931e+00f,  4.9161855e-03f,  7.9958606e-01f,  3.8196829e-01f,
+    7.1277108e+00f,  -7.5384903e-01f, -1.0171402e-02f, 4.4570059e-01f,
+    4.9161855e-03f,  6.0540199e-02f,  -2.6677737e+00f, 1.8429880e-01f,
+    -8.5555512e-01f, 1.3299481e+00f,  -2.0235173e-01f, 4.9161855e-03f,
+    3.9919739e+00f,  -6.1402979e+00f, -2.2712085e+00f, 4.4366006e-02f,
+    -5.3994328e-01f, -5.2013063e-01f, 4.9161855e-03f,  1.2852119e+00f,
+    -5.1181007e-02f, 3.3027627e+00f,  -6.0097035e-03f, -6.6818082e-01f,
+    -1.0660943e+00f, 4.9161855e-03f,  3.1523392e+00f,  -9.0578318e-01f,
+    -1.6923687e+00f, -1.0864950e+00f, 3.1622055e-01f,  -7.6376736e-02f,
+    4.9161855e-03f,  7.4215269e-01f,  1.5873559e+00f,  -9.5407754e-01f,
+    7.5115144e-01f,  5.8517551e-01f,  1.8402222e-01f,  4.9161855e-03f,
+    1.3492858e+00f,  -6.8291659e+00f, -2.2102982e-01f, -7.7220458e-01f,
+    4.2033842e-01f,  -3.0141455e-01f, 4.9161855e-03f,  -4.3350059e-01f,
+    6.2212191e+00f,  -5.0225635e+00f, 3.7565130e-01f,  -3.3066887e-01f,
+    2.3742668e-01f,  4.9161855e-03f,  6.7826700e-01f,  1.8297392e+00f,
+    2.9780185e+00f,  -9.9050844e-01f, 1.5749370e-01f,  -4.7297102e-01f,
+    4.9161855e-03f,  2.7861264e-01f,  -6.3822955e-01f, -2.5232068e-01f,
+    1.0543227e-01f,  9.1327286e-01f,  1.7127641e-01f,  4.9161855e-03f,
+    -3.6165969e+00f, -4.4523582e+00f, -1.2699959e-01f, -2.9875079e-01f,
+    4.2230520e-01f,  1.6758612e-01f,  4.9161855e-03f,  -5.9345689e+00f,
+    -5.6375158e-01f, 2.8784866e+00f,  -1.1773017e-01f, -7.9442525e-01f,
+    -4.2923176e-01f, 4.9161855e-03f,  -4.5961580e+00f, 8.1358643e+00f,
+    1.3778535e+00f,  7.0015645e-01f,  -9.0196915e-03f, -2.8111514e-01f,
+    4.9161855e-03f,  1.3879143e+00f,  -7.0066613e-01f, -7.9476064e-01f,
+    -4.1934487e-01f, 9.3593562e-01f,  3.5931492e-01f,  4.9161855e-03f,
+    3.5791755e+00f,  8.4959614e-01f,  2.4947805e+00f,  3.3687270e-01f,
+    -2.1417584e-01f, 3.0292150e-01f,  4.9161855e-03f,  -3.7517645e+00f,
+    -2.6368710e-01f, -5.0094962e+00f, -1.8823624e-01f, 7.3051924e-01f,
+    2.1860786e-02f,  4.9161855e-03f,  -2.6936531e-01f, -2.0526983e-01f,
+    6.5954632e-01f,  7.6233715e-02f,  -1.2407604e+00f, -4.5338404e-01f,
+    4.9161855e-03f,  -4.1817716e-01f, 1.0786925e-01f,  3.2741669e-01f,
+    5.4251856e-01f,  1.3131720e+00f,  -3.1557430e-03f, 4.9161855e-03f,
+    2.9697366e+00f,  1.0332178e+00f,  -1.7329675e+00f, -1.0114059e+00f,
+    -4.8704460e-01f, -9.3279220e-02f, 4.9161855e-03f,  -6.6830988e+00f,
+    2.1857018e+00f,  -1.2270736e+00f, -3.7255654e-01f, -2.7769122e-02f,
+    3.4415185e-01f,  4.9161855e-03f,  1.0832707e+00f,  -2.4050269e+00f,
+    2.2816985e+00f,  7.7116030e-01f,  2.4420033e-01f,  -9.3734545e-01f,
+    4.9161855e-03f,  3.3026309e+00f,  1.7810617e-01f,  -2.1904149e+00f,
+    -6.9325995e-01f, 8.8455275e-02f,  3.2489097e-01f,  4.9161855e-03f,
+    2.3270497e+00f,  8.3747327e-01f,  3.5323045e-01f,  1.1793818e-01f,
+    5.4966879e-01f,  -8.1208754e-01f, 4.9161855e-03f,  1.5131900e+00f,
+    -1.5149459e-02f, -5.3584701e-01f, 1.4530161e-02f,  -2.9182155e-02f,
+    7.9910409e-01f,  4.9161855e-03f,  -2.3442965e+00f, -1.3287088e+00f,
+    4.3543211e-01f,  7.9374611e-01f,  -3.0103785e-01f, -9.5739615e-01f,
+    4.9161855e-03f,  -2.3381724e+00f, 8.0385667e-01f,  -8.2279320e+00f,
+    -5.3750402e-01f, 1.4501467e-01f,  1.2893280e-02f,  4.9161855e-03f,
+    4.1073112e+00f,  -3.4530356e+00f, 5.6881213e+00f,  4.1808629e-01f,
+    5.5509534e-02f,  -2.6360124e-01f, 4.9161855e-03f,  1.8762091e+00f,
+    -1.6527932e+00f, -9.3679339e-01f, 3.1534767e-01f,  -1.3423176e-01f,
+    -9.0115553e-01f, 4.9161855e-03f,  1.1706166e+00f,  8.0902272e-01f,
+    1.9191325e+00f,  6.1738718e-01f,  -7.8812784e-01f, -4.3176544e-01f,
+    4.9161855e-03f,  -6.9623942e+00f, 7.8894806e+00f,  2.0476704e+00f,
+    5.1036930e-01f,  4.7420147e-01f,  1.5404034e-01f,  4.9161855e-03f,
+    2.6558321e+00f,  3.9173145e+00f,  -4.8773055e+00f, 5.7064819e-01f,
+    -4.0699664e-01f, -4.5462996e-01f, 4.9161855e-03f,  -8.6401331e-01f,
+    1.3935235e-01f,  4.2587665e-01f,  -7.7478617e-02f, 1.6932582e+00f,
+    -1.2154281e+00f, 4.9161855e-03f,  -2.8499889e+00f, 8.6289811e-01f,
+    -2.2494588e+00f, 6.9739962e-01f,  5.3504556e-01f,  -2.9233766e-01f,
+    4.9161855e-03f,  8.7056971e-01f,  8.0734167e+00f,  -5.2569685e+00f,
+    -1.2045987e-01f, 5.9915550e-02f,  -2.5871423e-01f, 4.9161855e-03f,
+    -7.6902652e-01f, 4.9359465e+00f,  2.0405600e+00f,  6.6449463e-01f,
+    5.9997362e-01f,  -8.0591239e-02f, 4.9161855e-03f,  -6.1418343e-01f,
+    2.2238147e-01f,  1.9433361e+00f,  3.8223696e-01f,  1.6134988e-01f,
+    6.6222048e-01f,  4.9161855e-03f,  2.3634105e+00f,  -5.2483654e+00f,
+    -4.9841018e+00f, 2.2005677e-02f,  1.3641465e-01f,  7.6506054e-01f,
+    4.9161855e-03f,  6.8980312e-01f,  -3.7020442e+00f, 6.5552109e-01f,
+    -8.6253577e-01f, -2.1161395e-01f, -5.1099682e-01f, 4.9161855e-03f,
+    -9.0719271e-01f, 1.0400220e+00f,  -9.2072707e-01f, -2.6235368e-02f,
+    -1.5415086e+00f, -8.5675663e-01f, 4.9161855e-03f,  -2.0826190e+00f,
+    -1.0853169e+00f, 2.7213802e+00f,  -7.2631556e-01f, -2.2817095e-01f,
+    4.3584740e-01f,  4.9161855e-03f,  -1.6827782e+01f, -2.9605379e+00f,
+    -1.0047872e+01f, 2.6563797e-02f,  1.5370090e-01f,  -4.7696620e-02f,
+    4.9161855e-03f,  -9.2662311e-01f, -5.6182045e-01f, -1.2381338e-01f,
+    -7.7099133e-01f, -2.2433902e-01f, -2.7151868e-01f, 4.9161855e-03f,
+    3.8625498e+00f,  6.2779222e+00f,  1.7248056e+00f,  5.4683471e-01f,
+    3.1747159e-01f,  2.0465960e-01f,  4.9161855e-03f,  -5.2857494e-01f,
+    4.9168107e-01f,  7.0973392e+00f,  -2.2720265e-01f, -2.7799189e-01f,
+    -5.4959249e-01f, 4.9161855e-03f,  -8.8942690e+00f, 8.5861343e-01f,
+    1.7127624e+00f,  3.6901340e-02f,  1.2481604e-02f,  8.0296421e-01f,
+    4.9161855e-03f,  4.0336819e+00f,  5.8094540e+00f,  4.5305710e+00f,
+    2.8685197e-01f,  -5.8316555e-02f, -6.0864025e-01f, 4.9161855e-03f,
+    -2.4482727e+00f, -1.9019347e+00f, 1.7246116e+00f,  -7.1854728e-01f,
+    -1.1512666e+00f, -2.1945371e-01f, 4.9161855e-03f,  -9.9501288e-01f,
+    -4.2160991e-01f, -4.5714632e-01f, -7.1073520e-01f, 4.8275924e-01f,
+    -3.2529598e-01f, 4.9161855e-03f,  -1.5558394e+00f, 1.5529529e+00f,
+    2.2523422e+00f,  -8.4167308e-01f, -1.3368995e-01f, -1.6983755e-01f,
+    4.9161855e-03f,  5.5405390e-01f,  1.8711295e+00f,  -1.2510152e+00f,
+    -4.7915465e-01f, 1.0674027e+00f,  2.8612742e-01f,  4.9161855e-03f,
+    1.3904979e+00f,  1.1284027e+00f,  -1.6685362e+00f, 1.6082658e-01f,
+    -5.2100271e-01f, 5.1975566e-01f,  4.9161855e-03f,  2.6165011e+00f,
+    -5.0194263e-01f, 2.1846955e+00f,  -2.3559105e-01f, -2.3662653e-02f,
+    7.4845886e-01f,  4.9161855e-03f,  -5.4110746e+00f, -6.4436674e+00f,
+    1.4341636e+00f,  -5.0812584e-01f, 7.0323184e-02f,  3.9377066e-01f,
+    4.9161855e-03f,  -4.3721943e+00f, -4.8243036e+00f, -3.8223925e+00f,
+    7.9724538e-01f,  2.8923592e-01f,  -5.5999923e-02f, 4.9161855e-03f,
+    -1.7739439e+00f, -5.8599277e+00f, -5.6433570e-01f, -6.5808952e-01f,
+    2.0367002e-01f,  -7.9294957e-02f, 4.9161855e-03f,  -2.2564106e+00f,
+    2.0470109e+00f,  6.9972581e-01f,  6.6688859e-01f,  6.0902584e-01f,
+    6.3632256e-01f,  4.9161855e-03f,  3.6698052e-01f,  -4.3352251e+00f,
+    -5.9899611e+00f, 4.0369263e-01f,  2.6295286e-01f,  4.2630222e-01f,
+    4.9161855e-03f,  -1.4735569e+00f, 1.1467457e+00f,  -1.8791540e-01f,
+    6.3940281e-01f,  -5.8715850e-01f, 9.0234226e-01f,  4.9161855e-03f,
+    -1.5421475e+00f, 7.8114897e-01f,  4.8983026e-01f,  -4.7342235e-01f,
+    -2.4398072e-01f, 4.9046123e-01f,  4.9161855e-03f,  9.7783589e-01f,
+    -2.8461471e+00f, 3.5030347e-01f,  -4.4139645e-01f, 2.0448433e-01f,
+    1.0468356e-01f,  4.9161855e-03f,  -4.0129914e+00f, 1.9731904e+00f,
+    -1.6546636e+00f, 2.2512060e-02f,  1.4075196e-01f,  8.5166425e-01f,
+    4.9161855e-03f,  -1.7307792e+00f, -1.0478389e+00f, -8.8721651e-01f,
+    3.8117144e-02f,  -1.2626181e+00f, 7.4923879e-01f,  4.9161855e-03f,
+    -4.3903942e+00f, -9.8925960e-01f, 6.1441336e+00f,  -2.9261913e-02f,
+    -3.8877898e-01f, 6.0653800e-01f,  4.9161855e-03f,  1.9854151e+00f,
+    1.5335454e+00f,  -7.1224504e+00f, 1.2410113e-01f,  -6.4020097e-01f,
+    4.3765905e-01f,  4.9161855e-03f,  -2.3035769e-01f, 3.1040353e-01f,
+    -5.3409922e-01f, -1.1151735e+00f, -6.5187573e-01f, -1.4604175e+00f,
+    4.9161855e-03f,  6.6836309e-01f,  -1.1001868e+00f, -1.4494388e+00f,
+    -4.9145856e-01f, -9.9138743e-01f, -1.5402541e-02f, 4.9161855e-03f,
+    -3.6307559e+00f, 1.1479833e+00f,  8.0834293e+00f,  -5.0276536e-01f,
+    2.8816018e-01f,  -1.1084123e-01f, 4.9161855e-03f,  8.5108602e-01f,
+    3.4960878e-01f,  -3.7021643e-01f, 9.6607900e-01f,  7.5475499e-04f,
+    1.8197434e-02f,  4.9161855e-03f,  3.9257536e+00f,  1.0273324e+01f,
+    1.3603307e+00f,  -8.6920604e-02f, 2.4439566e-01f,  5.2786553e-01f,
+    4.9161855e-03f,  3.2979140e+00f,  -9.7059011e-01f, 3.9852014e+00f,
+    -3.6814031e-01f, -6.3033557e-01f, -3.0275184e-01f, 4.9161855e-03f,
+    -1.9637458e+00f, -3.7986367e+00f, 1.8776725e-01f,  -7.3836422e-01f,
+    -7.3102927e-01f, -3.2329816e-02f, 4.9161855e-03f,  1.1989680e-01f,
+    1.8742895e-01f,  -2.9862130e-01f, -6.9648969e-01f, -1.3914220e-01f,
+    8.6901551e-01f,  4.9161855e-03f,  4.4827180e+00f,  -6.3484206e+00f,
+    -1.0996312e+01f, 1.1085771e-01f,  2.8751048e-01f,  -3.1339028e-01f,
+    4.9161855e-03f,  -8.4107071e-02f, -1.2915938e+00f, -1.5298724e+00f,
+    1.7467059e-02f,  1.7537315e-01f,  -9.2487389e-01f, 4.9161855e-03f,
+    -1.7147981e+00f, 2.5744505e+00f,  9.4229102e-01f,  -2.0581135e-01f,
+    1.7269771e-01f,  -1.8089809e-02f, 4.9161855e-03f,  7.7855635e-01f,
+    3.9012763e-01f,  -2.2284987e+00f, -6.1369395e-01f, 2.1370943e-01f,
+    -1.0267475e+00f, 4.9161855e-03f,  8.9311361e+00f,  5.5741658e+00f,
+    7.3865414e+00f,  -1.1716497e-01f, -2.5958773e-01f, -1.6851740e-01f,
+    4.9161855e-03f,  5.5872452e-01f,  -5.5642301e-01f, -4.1004235e-01f,
+    -5.3327596e-01f, -3.3521464e-01f, 1.8098779e-01f,  4.9161855e-03f,
+    -5.7718742e-01f, 1.0537529e+01f,  -1.4418954e+00f, 1.3293984e-02f,
+    2.3253456e-01f,  -6.4981383e-01f, 4.9161855e-03f,  2.3259537e+00f,
+    -4.8474255e+00f, -3.8202603e+00f, 5.5202281e-01f,  6.6536266e-01f,
+    -2.7609745e-01f, 4.9161855e-03f,  -3.7997112e-02f, 1.9381075e+00f,
+    -2.5785954e+00f, 6.8127191e-01f,  -1.7897372e-01f, -8.1235218e-01f,
+    4.9161855e-03f,  -3.8103649e-01f, -6.5680504e-01f, 1.5427786e+00f,
+    -9.5525837e-01f, -3.1719565e-01f, 1.1927687e-01f,  4.9161855e-03f,
+    1.4715660e+00f,  -2.0378935e+00f, 1.1417512e+01f,  -1.9282946e-01f,
+    4.2619136e-01f,  -3.1886920e-01f, 4.9161855e-03f,  -1.2326461e+01f,
+    7.1164246e+00f,  -5.4399915e+00f, -1.6626815e-01f, 2.7605408e-01f,
+    -2.2947796e-01f, 4.9161855e-03f,  -1.5963143e+00f, 2.1413229e+00f,
+    -5.2012887e+00f, -9.3113273e-02f, -9.0160382e-01f, -3.2290292e-01f,
+    4.9161855e-03f,  -2.2547686e+00f, -2.1109045e+00f, 9.4487530e-01f,
+    1.2221540e+00f,  -5.8051199e-01f, 1.6429856e-01f,  4.9161855e-03f,
+    6.1478698e-01f,  -3.5675838e+00f, 2.6373148e+00f,  4.3251249e-01f,
+    -8.5788590e-01f, 5.7104155e-02f,  4.9161855e-03f,  -1.3495188e+00f,
+    8.3444464e-01f,  2.6639289e-01f,  5.3358626e-01f,  3.7881872e-01f,
+    9.0911025e-01f,  4.9161855e-03f,  2.5030458e+00f,  -5.6965089e-01f,
+    -2.3113575e+00f, 1.3439518e-01f,  -7.3302060e-01f, 7.5076187e-01f,
+    4.9161855e-03f,  -2.5559316e+00f, -8.9279480e+00f, -1.2572399e+00f,
+    -3.7291369e-01f, -4.4078836e-01f, -2.5859511e-01f, 4.9161855e-03f,
+    1.3601892e+00f,  2.5021265e+00f,  1.5640872e+00f,  -3.1240162e-02f,
+    9.6691996e-01f,  8.3088553e-01f,  4.9161855e-03f,  -2.5284555e+00f,
+    8.0730313e-01f,  -3.3774159e+00f, 6.7637634e-01f,  3.3326253e-01f,
+    -9.2735279e-01f, 4.9161855e-03f,  3.7032542e-01f,  -2.4868140e+00f,
+    -1.1112474e+00f, -9.5413953e-01f, -8.0205697e-01f, 6.7512685e-01f,
+    4.9161855e-03f,  -8.2023449e+00f, -3.6179368e+00f, -6.7208133e+00f,
+    4.1372880e-01f,  -5.2742619e-02f, 2.5393400e-01f,  4.9161855e-03f,
+    -6.7738466e+00f, 1.0515899e+01f,  4.2430286e+00f,  -1.1593546e-01f,
+    9.0816170e-02f,  4.7477886e-01f,  4.9161855e-03f,  3.9372973e+00f,
+    7.1310897e+00f,  -6.9858866e+00f, -3.6591515e-02f, -1.5123883e-01f,
+    3.6657345e-01f,  4.9161855e-03f,  1.0386430e+00f,  2.2649708e+00f,
+    9.1387175e-02f,  -2.3626551e-01f, -1.0093622e+00f, -3.8372061e-01f,
+    4.9161855e-03f,  9.5332122e-01f,  -2.3051651e+00f, 2.4670262e+00f,
+    -6.2529281e-02f, 8.3028495e-02f,  6.9906914e-01f,  4.9161855e-03f,
+    -1.3563960e+00f, 2.5031478e+00f,  -6.2883940e+00f, 1.7311640e-01f,
+    4.9507636e-01f,  2.9234192e-01f,  4.9161855e-03f,  -2.9803047e+00f,
+    1.2159318e+00f,  4.8416948e+00f,  2.8369582e-01f,  -5.6748096e-02f,
+    3.1981486e-01f,  4.9161855e-03f,  6.5630555e-01f,  2.2934692e+00f,
+    2.7370293e+00f,  -7.9501927e-01f, -6.8942112e-01f, -1.6282633e-01f,
+    4.9161855e-03f,  2.3649284e-01f,  4.4992870e-01f,  7.8668839e-01f,
+    -1.2076259e+00f, 4.7268322e-01f,  1.2055985e-01f,  4.9161855e-03f,
+    -3.9686160e+00f, -1.8684902e+00f, 4.2091322e+00f,  4.5759417e-03f,
+    -6.6025454e-01f, 3.0627838e-01f,  4.9161855e-03f,  4.6912169e+00f,
+    1.3108907e+00f,  1.6523095e+00f,  7.4617028e-02f,  -1.5275851e-01f,
+    -1.0304534e+00f, 4.9161855e-03f,  1.6227750e+00f,  -2.9257073e+00f,
+    -2.0109935e+00f, 5.6260967e-01f,  7.3484081e-01f,  -3.3534378e-01f,
+    4.9161855e-03f,  3.2824643e+00f,  1.7195469e+00f,  2.4556370e+00f,
+    -4.3755153e-01f, 3.8373569e-01f,  3.5499743e-01f,  4.9161855e-03f,
+    2.9962518e+00f,  2.1721799e+00f,  1.7336558e+00f,  3.1145018e-01f,
+    7.9644367e-02f,  -1.3956204e-01f, 4.9161855e-03f,  -2.9588618e+00f,
+    4.6151480e-01f,  -4.8934903e+00f, 8.6376870e-01f,  3.8755390e-01f,
+    5.4533780e-01f,  4.9161855e-03f,  8.0634928e-01f,  -4.7410351e-01f,
+    -2.8205675e-01f, 2.6197723e-01f,  1.1508983e+00f,  -5.8419865e-01f,
+    4.9161855e-03f,  1.3148562e+00f,  -2.1508453e+00f, 1.9594790e-01f,
+    5.1325864e-01f,  2.5508407e-01f,  8.2936794e-01f,  4.9161855e-03f,
+    -9.4635022e-01f, -1.5219972e+00f, 1.3732563e+00f,  1.8658447e-01f,
+    -5.0763839e-01f, 6.8416429e-01f,  4.9161855e-03f,  1.9665076e+00f,
+    -1.4183496e+00f, -9.9830639e-01f, 5.1939923e-01f,  5.7319009e-01f,
+    7.6324838e-01f,  4.9161855e-03f,  1.5808804e+00f,  -1.8976219e+00f,
+    8.7504091e+00f,  5.9602886e-01f,  7.5436220e-02f,  1.2904499e-01f,
+    4.9161855e-03f,  1.1003045e+00f,  1.5032083e+00f,  -1.4726260e-01f,
+    5.1224291e-01f,  -7.2072625e-01f, 1.2975526e-01f,  4.9161855e-03f,
+    5.2798715e+00f,  2.5695405e+00f,  3.1592795e-01f,  -7.5408041e-01f,
+    -7.4214637e-02f, -2.8957549e-01f, 4.9161855e-03f,  1.9984113e+00f,
+    1.7264737e-01f,  -1.2801701e+00f, 1.2017699e-01f,  1.2994696e-01f,
+    4.8225260e-01f,  4.9161855e-03f,  4.3436646e+00f,  2.5010517e+00f,
+    -5.0417509e+00f, -6.9469649e-01f, 9.0198889e-02f,  -1.6560705e-01f,
+    4.9161855e-03f,  3.1434805e+00f,  1.2980199e-01f,  1.6128474e+00f,
+    -5.6128830e-01f, -1.0250444e+00f, -3.8510275e-01f, 4.9161855e-03f,
+    2.8277862e-01f,  -2.8451059e+00f, 2.5292377e+00f,  7.6253235e-01f,
+    -1.7996164e-01f, 2.6946926e-01f,  4.9161855e-03f,  3.5885043e+00f,
+    4.0399914e+00f,  -1.3001188e+00f, 7.9189874e-03f,  7.6869708e-01f,
+    1.8452343e-01f,  4.9161855e-03f,  -3.6406140e+00f, -4.4173899e+00f,
+    2.3816900e+00f,  2.3459703e-01f,  -9.6344292e-01f, -1.5342139e-02f,
+    4.9161855e-03f,  5.3718510e+00f,  -1.7088416e+00f, -1.8807746e+00f,
+    -6.1651420e-02f, -6.9086784e-01f, 6.8573050e-02f,  4.9161855e-03f,
+    3.6558161e+00f,  -3.8063710e+00f, -3.0513796e-01f, -8.4415787e-01f,
+    3.4599161e-01f,  -5.5742852e-02f, 4.9161855e-03f,  5.9426804e+00f,
+    4.7330937e+00f,  7.3694414e-01f,  1.8919133e-01f,  4.8421431e-02f,
+    3.0752826e-01f,  4.9161855e-03f,  -1.1473065e-01f, 1.1929753e+00f,
+    -1.4199167e+00f, -7.4282992e-01f, -3.7387276e-01f, 4.0093365e-01f,
+    4.9161855e-03f,  1.8835774e-01f,  5.2445376e-01f,  -1.3755062e+00f,
+    -2.4628344e-01f, -6.3110536e-01f, 5.1000971e-01f,  4.9161855e-03f,
+    2.5405736e+00f,  -6.9903188e+00f, 9.3919051e-01f,  3.3130026e-01f,
+    1.8456288e-01f,  -8.3665240e-01f, 4.9161855e-03f,  5.6979461e+00f,
+    1.0634099e+00f,  5.0504303e+00f,  4.8742417e-01f,  -3.4125265e-01f,
+    -4.8883250e-01f, 4.9161855e-03f,  1.5545113e+00f,  3.1638365e+00f,
+    -1.4146330e+00f, 6.3059294e-01f,  2.2755766e-01f,  -8.6821437e-01f,
+    4.9161855e-03f,  9.4219780e-01f,  -3.0427148e+00f, 1.5069616e+01f,
+    -1.8126942e-01f, -2.8703877e-01f, -1.7763026e-01f, 4.9161855e-03f,
+    5.6406796e-01f,  9.8250061e-02f,  -1.6685426e+00f, -2.5693396e-01f,
+    -5.1183546e-01f, 1.1809591e+00f,  4.9161855e-03f,  4.1753957e-01f,
+    -7.4913788e-01f, -1.5843335e+00f, 1.1937810e+00f,  9.2524104e-03f,
+    5.0497741e-01f,  4.9161855e-03f,  1.4821501e+00f,  2.5209305e+00f,
+    -4.6038327e-01f, 7.6814204e-01f,  -7.3164687e-02f, 3.8332766e-01f,
+    4.9161855e-03f,  -5.6680064e+00f, -1.2447957e+01f, 3.7274573e+00f,
+    -1.2730822e-01f, -1.4861411e-01f, 3.6204612e-01f,  4.9161855e-03f,
+    -2.9226646e+00f, 3.2349854e+00f,  -7.5004943e-02f, 1.0707484e-01f,
+    1.2512811e-02f,  -1.0659227e+00f, 4.9161855e-03f,  -3.4468117e+00f,
+    -2.8624514e-01f, 8.8619429e-01f,  -1.7801450e-01f, -2.1748085e-02f,
+    4.1115180e-01f,  4.9161855e-03f,  1.6176590e+00f,  -2.1753321e+00f,
+    3.1298079e+00f,  7.2549015e-01f,  5.9325063e-01f,  1.4891429e-01f,
+    4.9161855e-03f,  -3.6799617e+00f, -3.9531178e+00f, -2.5695114e+00f,
+    -4.8447725e-01f, -3.9212063e-01f, 6.3521582e-01f,  4.9161855e-03f,
+    -2.8431458e+00f, 2.2023947e+00f,  7.7971797e+00f,  3.6939001e-01f,
+    -5.9056293e-02f, -2.8710604e-01f, 4.9161855e-03f,  -2.7290611e+00f,
+    -2.2683835e+00f, 1.3177802e+01f,  3.4860381e-01f,  1.9552551e-01f,
+    -3.8295232e-02f, 4.9161855e-03f,  -7.3016357e-01f, 2.6567767e+00f,
+    3.4571521e+00f,  -1.9641110e-01f, 7.5739235e-01f,  -6.1690923e-02f,
+    4.9161855e-03f,  4.2920651e+00f,  3.2999296e+00f,  -9.5379755e-02f,
+    -2.5943008e-01f, -8.7894499e-02f, 1.4806598e-01f,  4.9161855e-03f,
+    8.2875853e+00f,  -2.2597928e+00f, 7.8488052e-01f,  -1.0633945e-01f,
+    3.8035643e-01f,  4.2811239e-01f,  4.9161855e-03f,  9.6977365e-01f,
+    4.5958829e+00f,  -1.4316144e+00f, 9.3070194e-02f,  -3.4570369e-01f,
+    2.5216484e-01f,  4.9161855e-03f,  1.9271275e+00f,  -4.5494499e+00f,
+    -1.2852082e+00f, 4.4442824e-01f,  -5.3706849e-01f, 1.3541110e-01f,
+    4.9161855e-03f,  3.8576801e+00f,  -2.9864626e+00f, -7.5119339e-02f,
+    -7.1386874e-02f, 1.0027837e+00f,  4.9816358e-01f,  4.9161855e-03f,
+    -1.1524675e+00f, -6.4670318e-01f, 4.3123364e+00f,  -1.9000579e-01f,
+    8.5365757e-02f,  -1.9686638e-01f, 4.9161855e-03f,  1.8131450e+00f,
+    4.7976389e+00f,  1.5934553e+00f,  -6.6369760e-01f, -1.9696659e-01f,
+    -4.4029149e-01f, 4.9161855e-03f,  -6.6486311e+00f, 1.6121794e-01f,
+    2.6161983e+00f,  -2.6472679e-01f, 5.4675859e-01f,  -2.8940520e-01f,
+    4.9161855e-03f,  -2.9891250e+00f, -2.5974274e+00f, 8.3908844e-01f,
+    1.2454953e+00f,  7.0261940e-02f,  -2.2021371e-01f, 4.9161855e-03f,
+    -5.6700382e+00f, 1.6352696e+00f,  -3.4084382e+00f, 3.8202977e-01f,
+    1.3943486e-01f,  -6.0616112e-01f, 4.9161855e-03f,  -2.1950989e+00f,
+    -1.7341146e+00f, 1.7323859e+00f,  -1.1931682e+00f, 1.9817488e-01f,
+    -2.8878545e-02f, 4.9161855e-03f,  5.3196278e+00f,  3.5861525e-01f,
+    -1.5447701e+00f, -2.9301494e-01f, -3.2944006e-01f, 1.9657442e-01f,
+    4.9161855e-03f,  -5.4176431e+00f, -2.1789110e+00f, 7.9536524e+00f,
+    3.3994129e-01f,  -5.4087561e-02f, -8.6205676e-02f, 4.9161855e-03f,
+    4.2253766e+00f,  2.4311712e+00f,  -2.5541326e-01f, -4.5225611e-01f,
+    3.5217261e-01f,  -6.1695367e-01f, 4.9161855e-03f,  -3.4682634e+00f,
+    -4.7175350e+00f, 1.7459866e-01f,  -4.4882014e-01f, -6.4638937e-01f,
+    -3.0638602e-01f, 4.9161855e-03f,  2.7410993e-01f,  8.0045706e-01f,
+    2.4800158e-01f,  8.1277037e-01f,  -8.1796193e-01f, -7.3142517e-01f,
+    4.9161855e-03f,  -4.0135498e+00f, 6.9434705e+00f,  2.5408168e+00f,
+    -2.2635509e-01f, 4.9111062e-01f,  -5.2405067e-02f, 4.9161855e-03f,
+    6.1405811e+00f,  5.8829279e+00f,  4.2876434e+00f,  6.2422299e-01f,
+    1.2779064e-01f,  2.3671541e-01f,  4.9161855e-03f,  4.1401911e+00f,
+    -1.5639536e+00f, -3.7992470e+00f, -3.2793185e-01f, 1.1091782e-01f,
+    4.3175989e-01f,  4.9161855e-03f,  1.3912787e+00f,  -1.3100153e+00f,
+    -3.0417368e-01f, -1.1173264e+00f, 4.5876667e-01f,  1.7409755e-01f,
+    4.9161855e-03f,  1.7314148e+00f,  -2.9625313e+00f, -1.7712467e+00f,
+    1.2611393e-02f,  -5.9502721e-01f, -8.7409288e-01f, 4.9161855e-03f,
+    -3.3928535e+00f, -5.0355792e+00f, -6.3221753e-01f, -2.2786912e-01f,
+    3.6280593e-01f,  4.9860114e-01f,  4.9161855e-03f,  2.4627335e+00f,
+    7.4708309e+00f,  2.4828105e+00f,  -1.1931285e-01f, 3.8600791e-01f,
+    2.3935346e-01f,  4.9161855e-03f,  2.3079026e+00f,  4.0781622e+00f,
+    3.0667586e+00f,  -6.7254633e-02f, -4.7441235e-01f, 1.0479894e-01f,
+    4.9161855e-03f,  -2.3147500e+00f, 2.0114279e+00f,  2.4293604e+00f,
+    6.2526542e-01f,  -2.5844949e-01f, -6.8185478e-02f, 4.9161855e-03f,
+    1.6617872e+00f,  -4.1353674e+00f, -4.6586909e+00f, 6.1750430e-01f,
+    -2.6955858e-01f, -2.9278165e-01f, 4.9161855e-03f,  2.7149663e+00f,
+    3.6809824e+00f,  2.2618716e+00f,  -1.7421328e-01f, -3.5537606e-01f,
+    4.5174813e-01f,  4.9161855e-03f,  1.1291784e+00f,  -4.5050567e-01f,
+    -2.7562863e-01f, -3.1790689e-01f, 4.2996463e-01f,  6.6389285e-02f,
+    4.9161855e-03f,  -1.8577245e+00f, -3.6221521e+00f, -3.6851006e+00f,
+    8.9392263e-01f,  6.2321472e-01f,  3.2198742e-02f,  4.9161855e-03f,
+    -3.7487407e+00f, 2.8546640e-01f,  7.3861861e-01f,  3.0945167e-01f,
+    -6.9107234e-01f, -1.9396501e-02f, 4.9161855e-03f,  9.6022475e-01f,
+    -1.8548920e+00f, 1.4083722e+00f,  4.5544246e-01f,  8.1362873e-01f,
+    -5.0299495e-01f, 4.9161855e-03f,  1.8613169e+00f,  9.5430905e-01f,
+    -6.0006475e+00f, 6.4573717e-01f,  -4.5540605e-02f, 3.9353642e-01f,
+    4.9161855e-03f,  -5.7576466e-01f, -4.0702939e+00f, 1.4662871e-01f,
+    3.0704650e-01f,  -1.0507205e+00f, 1.9402106e-01f,  4.9161855e-03f,
+    -6.8696761e+00f, -2.3508449e-01f, 5.0098281e+00f,  1.1129197e-01f,
+    -2.0352839e-01f, 3.4785947e-01f,  4.9161855e-03f,  4.9972515e+00f,
+    -5.8319759e-01f, -7.7851087e-01f, -1.4849176e-01f, -9.4275653e-01f,
+    8.8817559e-02f,  4.9161855e-03f,  -8.6972165e-01f, 2.2390528e+00f,
+    -3.2159317e+00f, 6.5020138e-01f,  3.3443257e-01f,  7.1584368e-01f,
+    4.9161855e-03f,  -7.4197614e-01f, 2.3563713e-01f,  -4.4679699e+00f,
+    -6.5029413e-02f, -1.5337236e-02f, -1.4012328e-01f, 4.9161855e-03f,
+    -4.6647656e-01f, -7.8368151e-01f, -6.5655512e-01f, -1.5816532e+00f,
+    -4.6986195e-01f, 2.4150476e-01f,  4.9161855e-03f,  1.8196188e+00f,
+    -3.0113823e+00f, -2.8634396e+00f, 5.4593522e-02f,  -3.9083639e-01f,
+    -3.7897531e-02f, 4.9161855e-03f,  1.8511251e-02f,  -3.0789416e+00f,
+    -9.2857466e+00f, -5.8989190e-03f, 2.4363661e-01f,  -4.0882280e-01f,
+    4.9161855e-03f,  6.3670468e-01f,  -3.4076877e+00f, 2.0029318e+00f,
+    2.5282994e-01f,  6.2503815e-01f,  -1.9735672e-01f, 4.9161855e-03f,
+    7.2272696e+00f,  3.5271869e+00f,  -3.5384431e+00f, -6.4121693e-02f,
+    -3.5999200e-01f, 3.6083081e-01f,  4.9161855e-03f,  -2.0246913e+00f,
+    -6.5362781e-01f, 5.3856421e-01f,  6.6928858e-01f,  7.3955721e-01f,
+    -1.3549697e+00f, 4.9161855e-03f,  -9.5964992e-01f, 6.4670593e-02f,
+    -1.4811364e-01f, 1.6200148e+00f,  -4.5196310e-01f, 1.0413836e+00f,
+    4.9161855e-03f,  3.5101047e+00f,  -3.3526034e+00f, 1.0871273e+00f,
+    6.4286031e-03f,  -6.2434512e-01f, -1.8984480e-01f, 4.9161855e-03f,
+    4.1997194e-02f,  -1.6890702e+00f, 6.2843829e-01f,  -3.1199425e-01f,
+    1.0393422e-02f,  -2.6472378e-01f, 4.9161855e-03f,  -1.0753101e+00f,
+    -2.8216927e+00f, -1.0013848e+01f, -2.1837327e-01f, -2.8217086e-01f,
+    -2.3436151e-01f, 4.9161855e-03f,  2.7256424e+00f,  -2.1598244e-01f,
+    1.1041831e+00f,  -9.7582382e-01f, -6.4714873e-01f, 7.5260535e-02f,
+    4.9161855e-03f,  8.6457081e+00f,  -1.5165756e+00f, -2.0839074e+00f,
+    -4.0601650e-01f, -5.1888924e-02f, 4.3054423e-01f,  4.9161855e-03f,
+    2.1280665e+00f,  4.0284543e+00f,  -1.1783282e-01f, 2.6849008e-01f,
+    -2.0980414e-02f, -5.4006720e-01f, 4.9161855e-03f,  -9.1752825e+00f,
+    1.3060554e+00f,  2.0836954e+00f,  -4.5614180e-01f, 5.4078943e-01f,
+    -1.8295766e-01f, 4.9161855e-03f,  -2.2605104e+00f, -3.8497891e+00f,
+    1.0843127e+01f,  3.3604836e-01f,  -1.9332437e-01f, 2.5260451e-01f,
+    4.9161855e-03f,  4.7182384e+00f,  -2.8978045e+00f, -1.7428281e+00f,
+    1.3794658e-01f,  4.0305364e-01f,  6.6244882e-01f,  4.9161855e-03f,
+    -1.3224255e+00f, 5.2021098e-01f,  -3.3740718e+00f, 4.1427228e-01f,
+    1.0910715e+00f,  -6.5209341e-01f, 4.9161855e-03f,  -1.8185365e+00f,
+    2.5828514e-01f,  6.4289254e-01f,  1.2816476e+00f,  8.3038044e-01f,
+    1.4483032e-01f,  4.9161855e-03f,  3.9466562e+00f,  -1.1976725e+00f,
+    -9.5934469e-01f, -9.1652638e-01f, 2.7758551e-01f,  3.8030837e-02f,
+    4.9161855e-03f,  1.2100216e+00f,  8.4616941e-01f,  -1.4383118e-01f,
+    4.3242332e-01f,  -1.7141787e+00f, -1.6333774e-01f, 4.9161855e-03f,
+    -3.3315253e+00f, 8.9229387e-01f,  -8.6922163e-01f, -3.7541920e-01f,
+    3.6041844e-01f,  5.8519232e-01f,  4.9161855e-03f,  -1.8975563e+00f,
+    5.0625935e+00f,  -6.8447294e+00f, 2.1172547e-01f,  -2.1871617e-01f,
+    -2.3336901e-01f, 4.9161855e-03f,  -1.4570162e-01f, 4.5507040e+00f,
+    -7.0465422e-01f, -3.8589361e-01f, 1.9029337e-01f,  -3.5117975e-01f,
+    4.9161855e-03f,  -1.0140528e+01f, 6.1018895e-02f,  8.7904096e-01f,
+    4.5813575e-01f,  -1.4336927e-01f, -2.0259835e-01f, 4.9161855e-03f,
+    3.1312416e+00f,  2.2074494e+00f,  1.4556658e+00f,  8.4221363e-03f,
+    1.2502237e-01f,  1.3486885e-01f,  4.9161855e-03f,  6.2499490e+00f,
+    -8.0702143e+00f, -9.6102351e-01f, -1.5929534e-01f, 1.3664324e-02f,
+    5.6866592e-01f,  4.9161855e-03f,  4.9385223e+00f,  -6.5970898e+00f,
+    -6.1008911e+00f, -1.5166788e-01f, -1.4117464e-01f, -8.1479117e-02f,
+    4.9161855e-03f,  3.3048346e+00f,  2.3806884e+00f,  3.8274519e+00f,
+    6.1066008e-01f,  -3.2017228e-01f, -8.9838415e-02f, 4.9161855e-03f,
+    2.2271809e-01f,  -7.6123530e-01f, 2.6768461e-01f,  -1.0121994e+00f,
+    -1.3793845e-02f, -3.0452973e-01f, 4.9161855e-03f,  5.3817654e-01f,
+    -1.4470400e+00f, 5.3883266e+00f,  1.3771947e-01f,  3.3305600e-01f,
+    9.3459821e-01f,  4.9161855e-03f,  -3.7886247e-01f, 7.1961087e-01f,
+    3.8818314e+00f,  1.1518018e-01f,  -7.7900052e-01f, -2.4627395e-01f,
+    4.9161855e-03f,  -6.9175474e-02f, 3.0598080e+00f,  -6.8954463e+00f,
+    2.2322592e-01f,  7.9998024e-02f,  6.7966568e-01f,  4.9161855e-03f,
+    -6.0521278e+00f, 4.0208979e+00f,  3.6037574e+00f,  -9.0201005e-02f,
+    -4.9529395e-01f, -2.1849494e-01f, 4.9161855e-03f,  -4.2743959e+00f,
+    2.9045238e+00f,  6.2148004e+00f,  2.8813314e-01f,  6.3006467e-01f,
+    -1.5050417e-01f, 4.9161855e-03f,  4.4486532e-01f,  7.4547344e-01f,
+    9.4860238e-01f,  -9.3737505e-03f, -4.6862206e-01f, 6.7763716e-01f,
+    4.9161855e-03f,  4.5817189e+00f,  2.0669367e+00f,  4.9893899e+00f,
+    6.5484542e-01f,  -1.5561411e-01f, -3.5419935e-01f, 4.9161855e-03f,
+    -5.9296155e-01f, -9.4426107e-01f, 3.3796230e-01f,  -1.5486457e+00f,
+    -7.9331058e-01f, -5.0273466e-01f, 4.9161855e-03f,  4.1594043e+00f,
+    2.8537092e-01f,  -2.9473579e-01f, 1.7084515e-01f,  1.0823333e+00f,
+    4.2415988e-01f,  4.9161855e-03f,  5.3607149e+00f,  -5.6411510e+00f,
+    -1.3724309e-02f, -1.0412186e-03f, 5.3025208e-02f,  -2.1293500e-01f,
+    4.9161855e-03f,  -2.3203860e-01f, -5.6371040e+00f, -6.3359928e-01f,
+    -4.2490710e-02f, -7.5937819e-01f, -5.9297900e-03f, 4.9161855e-03f,
+    2.4609616e-01f,  -1.6647290e+00f, 1.0207754e+00f,  4.0807050e-01f,
+    -1.8156316e-02f, -3.4158570e-01f, 4.9161855e-03f,  7.6231754e-01f,
+    2.1758667e-01f,  -2.6425600e-01f, -4.2366499e-01f, -7.1745002e-01f,
+    -8.4950846e-01f, 4.9161855e-03f,  6.5433443e-01f,  2.3210588e+00f,
+    2.9462072e-01f,  -6.4530611e-01f, -1.4730625e-01f, -8.9621490e-01f,
+    4.9161855e-03f,  1.1421447e+00f,  3.2726744e-01f,  -4.9973121e+00f,
+    -3.0254982e-03f, -6.6178137e-01f, -4.4324645e-01f, 4.9161855e-03f,
+    -9.7846484e-01f, -4.1716191e-01f, -1.5661771e+00f, -7.5795805e-01f,
+    8.0893016e-01f,  -2.5552294e-01f, 4.9161855e-03f,  4.0538306e+00f,
+    1.0624267e+00f,  2.3265336e+00f,  7.2247207e-01f,  -1.0373462e-02f,
+    -1.4599025e-01f, 4.9161855e-03f,  7.6418567e-01f,  -1.6888050e+00f,
+    -1.0930395e+00f, -7.8154355e-02f, 2.6909021e-01f,  3.5038045e-01f,
+    4.9161855e-03f,  -4.8746696e+00f, 5.9930868e+00f,  -6.2591534e+00f,
+    -2.1022651e-01f, 3.3780858e-01f,  -2.2561373e-01f, 4.9161855e-03f,
+    1.0469738e+00f,  7.0248455e-01f,  -7.3410082e-01f, -3.8434425e-01f,
+    6.8571496e-01f,  -2.3600546e-01f, 4.9161855e-03f,  -1.4909858e+00f,
+    2.2121072e-03f,  4.8889652e-01f,  7.0869178e-02f,  1.9885659e-01f,
+    9.6898615e-01f,  4.9161855e-03f,  6.2116122e+00f,  -4.3895874e+00f,
+    -9.9557819e+00f, -2.0628119e-01f, 8.6890794e-03f,  3.4248311e-02f,
+    4.9161855e-03f,  -3.9620697e-01f, 2.1671128e+00f,  7.6029129e-02f,
+    1.2821326e-01f,  -1.7877888e-02f, -7.6138300e-01f, 4.9161855e-03f,
+    -7.7057395e+00f, 6.7583270e+00f,  4.1223164e+00f,  5.0063860e-01f,
+    -3.2260406e-01f, -2.6778015e-01f, 4.9161855e-03f,  2.7386568e+00f,
+    -2.3904824e+00f, -2.8976858e+00f, 8.0731452e-01f,  1.1586739e-01f,
+    4.5557588e-01f,  4.9161855e-03f,  -3.7126637e+00f, 1.2195703e+00f,
+    1.4704031e+00f,  1.4595404e-01f,  -1.2760527e+00f, 1.3700278e-01f,
+    4.9161855e-03f,  -9.1034138e-01f, 2.8166884e-01f,  9.1692306e-02f,
+    -1.2893773e+00f, -1.0068115e+00f, 7.2354060e-01f,  4.9161855e-03f,
+    -2.0368499e-01f, 1.1563526e-01f,  -2.2709820e+00f, 6.9055498e-01f,
+    -9.3631399e-01f, 7.8627145e-01f,  4.9161855e-03f,  -3.1859999e+00f,
+    -2.1765156e+00f, 3.7198505e-01f,  9.5657760e-01f,  7.4806470e-01f,
+    -2.6733288e-01f, 4.9161855e-03f,  -1.8653083e+00f, 1.6296799e+00f,
+    -1.1811743e+00f, 6.7173630e-02f,  9.3116254e-01f,  -8.9083868e-01f,
+    4.9161855e-03f,  -2.2038233e+00f, 9.2086273e-01f,  -5.4128571e+00f,
+    -5.6090122e-01f, 2.4447270e-01f,  1.2071518e-01f,  4.9161855e-03f,
+    -9.3272650e-01f, 8.6203270e+00f,  2.8476541e+00f,  -2.2184102e-01f,
+    4.6709016e-01f,  2.0684598e-01f,  4.9161855e-03f,  4.2462286e-01f,
+    2.6043649e+00f,  2.1567121e+00f,  4.0597555e-01f,  2.4635155e-01f,
+    5.4677874e-01f,  4.9161855e-03f,  -6.9791615e-01f, -7.2394654e-02f,
+    -7.9927075e-01f, -1.1686948e-01f, -4.4786358e-01f, -1.2310307e-01f,
+    4.9161855e-03f,  6.3908732e-01f,  1.5464031e+00f,  -7.2350521e+00f,
+    4.7771034e-01f,  -7.5061113e-02f, -6.0055035e-01f, 4.9161855e-03f,
+    5.4760659e-01f,  -4.0661488e+00f, 3.7574809e+00f,  -4.5561403e-01f,
+    2.0565687e-01f,  -3.3205089e-01f, 4.9161855e-03f,  1.1567845e+00f,
+    -2.1524792e+00f, -3.5894201e+00f, -5.3367224e-02f, 4.1133749e-01f,
+    -1.1288481e-02f, 4.9161855e-03f,  -4.0661426e+00f, 2.3462789e+00f,
+    -9.8737985e-01f, 5.2306634e-01f,  -2.5305262e-01f, -6.9745469e-01f,
+    4.9161855e-03f,  4.0782847e+00f,  -6.9291615e+00f, -1.6262084e+00f,
+    4.2396560e-01f,  -4.8761395e-01f, 2.1209660e-01f,  4.9161855e-03f,
+    -3.6398977e-02f, -8.5710377e-01f, -1.0456041e+00f, -4.2379850e-01f,
+    1.4236011e-01f,  -1.8565869e-01f, 4.9161855e-03f,  -1.0438566e+00f,
+    -1.0525371e+00f, 4.1417345e-01f,  3.3945918e-01f,  -9.1389066e-01f,
+    2.0205980e-02f,  4.9161855e-03f,  -9.3069160e-01f, -1.5719604e+00f,
+    -2.4732697e+00f, -1.5562963e-02f, 4.7170100e-01f,  -1.0558943e+00f,
+    4.9161855e-03f,  -2.6214740e-01f, -1.6777412e+00f, -1.6233773e+00f,
+    -1.8219057e-01f, -3.6187124e-01f, -5.5351281e-03f, 4.9161855e-03f,
+    -3.2747793e+00f, -4.5946374e+00f, -5.3931463e-01f, 7.5467026e-01f,
+    -3.6849698e-01f, 6.3520420e-01f,  4.9161855e-03f,  2.9533076e+00f,
+    -1.0749801e+00f, 7.1191603e-01f,  -3.5945854e-01f, 3.9648840e-01f,
+    -7.2392190e-01f, 4.9161855e-03f,  -1.0939742e+00f, -3.9905021e+00f,
+    -5.1769514e+00f, -1.9660223e-01f, -1.0596719e-02f, 4.3273312e-01f,
+    4.9161855e-03f,  -3.0557539e+00f, -6.6578549e-01f, 1.2200816e+00f,
+    2.2699955e-01f,  -4.1672829e-01f, -2.7230310e-01f, 4.9161855e-03f,
+    -3.1797330e+00f, -3.0303648e+00f, 5.5223483e-01f,  -1.5985982e-01f,
+    -6.3496631e-01f, 5.1583236e-01f,  4.9161855e-03f,  -8.1636095e-01f,
+    -6.1753297e-01f, -2.3677840e+00f, -1.0832779e+00f, -7.1589336e-02f,
+    4.3596086e-01f,  4.9161855e-03f,  -3.0114591e+00f, -3.0822971e-01f,
+    3.7344346e+00f,  3.4873700e-01f,  -2.0172851e-01f, -5.6026226e-01f,
+    4.9161855e-03f,  -1.2339014e+00f, -1.0268744e+00f, 2.3437053e-01f,
+    -8.8729274e-01f, 1.7357446e-01f,  -4.2521077e-01f, 4.9161855e-03f,
+    7.6893506e+00f,  5.8836145e+00f,  -2.0426424e+00f, 1.7266423e-02f,
+    1.1970200e-01f,  -1.4518172e-02f, 4.9161855e-03f,  -1.5856417e+00f,
+    2.5296898e+00f,  -1.6330155e+00f, -1.9896343e-01f, 6.2061214e-01f,
+    -7.6168430e-01f, 4.9161855e-03f,  -2.9207973e+00f, 1.0207623e+00f,
+    -2.1856134e+00f, 7.8229979e-02f,  1.5372838e-01f,  5.7523686e-01f,
+    4.9161855e-03f,  -7.2688259e-02f, 1.4009744e+00f,  8.5709387e-01f,
+    -3.2453546e-01f, 7.5210601e-02f,  5.8245473e-02f,  4.9161855e-03f,
+    1.2019936e+00f,  3.4423873e-01f,  -1.1004268e+00f, 1.4619813e+00f,
+    2.3473673e-01f,  -8.1246912e-01f, 4.9161855e-03f,  9.2013636e+00f,
+    1.5965141e+00f,  9.3494253e+00f,  4.1525030e-01f,  -3.0840111e-01f,
+    -7.5029820e-02f, 4.9161855e-03f,  -2.8596039e+00f, -3.1124935e-01f,
+    2.4989309e+00f,  -2.0422903e-01f, -2.7113402e-01f, -7.7276611e-01f,
+    4.9161855e-03f,  -2.5138488e+00f, 1.2386133e+01f,  3.0402360e+00f,
+    2.6705246e-02f,  -2.0976053e-01f, -9.6279144e-02f, 4.9161855e-03f,
+    -2.7852359e-01f, 3.4290299e-01f,  3.0158368e-01f,  -7.9115462e-01f,
+    4.4737333e-01f,  6.5243357e-01f,  4.9161855e-03f,  8.8802981e-01f,
+    3.3639688e+00f,  -3.2436025e+00f, -1.6130263e-01f, 4.3880481e-01f,
+    1.0564056e-01f,  4.9161855e-03f,  1.3081352e-01f,  -3.2971656e-01f,
+    9.2740881e-01f,  -2.3205736e-01f, 7.0441529e-02f,  -1.4793061e+00f,
+    4.9161855e-03f,  -6.9485197e+00f, -4.7469378e+00f, 7.2799211e+00f,
+    -1.4510322e-01f, 1.1659682e-01f,  -1.5350385e-01f, 4.9161855e-03f,
+    2.5247040e-01f,  -2.2481077e+00f, -5.5699044e-01f, -3.2005566e-01f,
+    -4.1440362e-01f, -8.3654840e-03f, 4.9161855e-03f,  2.1919296e+00f,
+    1.3954902e+00f,  -2.6824844e+00f, -9.2727757e-01f, 2.7820390e-01f,
+    2.0077060e-01f,  4.9161855e-03f,  -2.5565681e+00f, 8.9766016e+00f,
+    -2.0122559e+00f, 3.9176670e-01f,  -2.4847011e-01f, 1.1110017e-01f,
+    4.9161855e-03f,  6.0324121e-01f,  -8.9385861e-01f, -1.2336399e-01f,
+    8.6264330e-01f,  7.4958569e-01f,  8.2861269e-01f,  4.9161855e-03f,
+    -5.7891827e+00f, -2.1946945e+00f, -4.4824104e+00f, 2.5888926e-01f,
+    -3.5696858e-01f, -6.8930852e-01f, 4.9161855e-03f,  2.4704602e+00f,
+    9.4484291e+00f,  6.0409355e+00f,  5.3552705e-01f,  1.4301011e-01f,
+    2.1043065e-01f,  4.9161855e-03f,  6.2216535e+00f,  -1.3350110e-01f,
+    5.0205865e+00f,  -2.3507077e-01f, -6.0848188e-01f, 2.7384153e-01f,
+    4.9161855e-03f,  -1.1331167e+00f, -4.6681752e+00f, 4.7972460e+00f,
+    -2.5069791e-01f, 2.3398107e-01f,  4.1248101e-01f,  4.9161855e-03f,
+    5.2076955e+00f,  -8.2938963e-01f, 5.3475156e+00f,  -4.4323674e-01f,
+    -1.2149593e-01f, -3.4891346e-01f, 4.9161855e-03f,  1.1436806e+00f,
+    -3.8295863e+00f, -5.2244568e+00f, -3.5402426e-01f, -4.7722957e-01f,
+    2.8002101e-01f,  4.9161855e-03f,  -4.1085282e-01f, 7.1546543e-01f,
+    -1.1344000e-01f, -5.1656473e-01f, -1.9136779e-01f, -3.8638729e-01f,
+    4.9161855e-03f,  -1.5009623e+00f, 3.3477488e-01f,  4.1177177e-01f,
+    -7.7530108e-03f, -1.1455448e+00f, -5.5644792e-01f, 4.9161855e-03f,
+    -4.0001779e+00f, -1.5739800e+00f, -2.7977524e+00f, 9.1510427e-01f,
+    -6.9056615e-02f, -1.2942998e-01f, 4.9161855e-03f,  4.5878491e-01f,
+    -6.4639592e-01f, 5.5837858e-01f,  8.9323342e-01f,  5.5044502e-01f,
+    3.9806306e-01f,  4.9161855e-03f,  5.6660228e+00f,  3.7501116e+00f,
+    -4.2122407e+00f, -1.2555529e-01f, 4.6051678e-01f,  -5.2156222e-01f,
+    4.9161855e-03f,  -4.4734424e-01f, 1.3746558e+00f,  5.5306411e+00f,
+    1.1301793e-01f,  -6.5199757e-01f, -3.7271160e-01f, 4.9161855e-03f,
+    -2.7237234e+00f, -1.9530910e+00f, 9.5792544e-01f,  -2.1367524e-02f,
+    6.1001953e-02f,  5.8275521e-02f,  4.9161855e-03f,  -1.6100755e-01f,
+    3.7045591e+00f,  -2.5025744e+00f, 1.4095868e-01f,  5.4430299e-02f,
+    -1.2383699e-01f, 4.9161855e-03f,  -1.7754663e+00f, -1.6746805e+00f,
+    -2.3337072e-01f, -2.0568541e-01f, 2.3082292e-01f,  -1.0832767e+00f,
+    4.9161855e-03f,  3.7021962e-01f,  -7.7780523e+00f, 1.4875294e+00f,
+    1.2266554e-02f,  -7.1301538e-01f, -4.4682795e-01f, 4.9161855e-03f,
+    -2.4607019e+00f, 2.3491945e+00f,  -2.5397232e+00f, -6.2261623e-01f,
+    7.2446340e-01f,  -4.3639538e-01f, 4.9161855e-03f,  -5.6957707e+00f,
+    -2.9954064e+00f, -4.9214292e+00f, 5.7436901e-01f,  -4.0112248e-01f,
+    -1.2796953e-01f, 4.9161855e-03f,  7.6529913e+00f,  -5.7147236e+00f,
+    5.1646070e+00f,  -3.6653347e-02f, 1.9746809e-01f,  -1.6327949e-01f,
+    4.9161855e-03f,  2.5772855e-01f,  -4.6115333e-01f, 1.3816971e-01f,
+    1.8487598e+00f,  -3.3207378e-01f, 1.0512314e+00f,  4.9161855e-03f,
+    -5.2915611e+00f, 2.0870304e+00f,  2.6679549e-01f,  -2.9553398e-01f,
+    1.7010327e-01f,  6.1560780e-01f,  4.9161855e-03f,  3.7104313e+00f,
+    -8.5663140e-01f, 1.5043894e+00f,  -6.3773885e-02f, 6.6316694e-02f,
+    7.1101356e-01f,  4.9161855e-03f,  4.8451677e-01f,  1.8731930e+00f,
+    5.2332506e+00f,  -5.0878936e-01f, 3.0235314e-01f,  7.1813804e-01f,
+    4.9161855e-03f,  -4.1218561e-01f, 7.4095565e-01f,  -3.2884508e-01f,
+    -1.4225919e+00f, -7.9207763e-02f, -5.2490056e-01f, 4.9161855e-03f,
+    4.3497758e+00f,  -4.0700622e+00f, 2.6308778e-01f,  -6.2746292e-01f,
+    -7.3860154e-02f, 6.5638328e-01f,  4.9161855e-03f,  -2.1579653e-02f,
+    4.0641442e-01f,  5.4142561e+00f,  -3.9263438e-02f, 5.0368893e-01f,
+    -7.2989553e-01f, 4.9161855e-03f,  -1.7396202e+00f, -1.2370780e+00f,
+    -7.4541867e-01f, -9.9768794e-01f, -8.6462057e-01f, 8.0447471e-01f,
+    4.9161855e-03f,  2.5507419e+00f,  -2.5318336e+00f, 7.9411879e+00f,
+    -2.9810840e-01f, 5.5283558e-01f,  4.5358066e-02f,  4.9161855e-03f,
+    3.2466240e+00f,  -3.4043659e-02f, 7.7465367e-01f,  3.8771144e-01f,
+    1.6951884e-01f,  -8.2736440e-02f, 4.9161855e-03f,  3.1765196e+00f,
+    2.4791040e+00f,  7.8286749e-01f,  6.5482211e-01f,  4.2056656e-01f,
+    -6.0098726e-01f, 4.9161855e-03f,  5.1316774e-01f,  1.3855555e+00f,
+    1.8478738e+00f,  3.7954280e-01f,  -8.2836556e-01f, -1.2284636e-01f,
+    4.9161855e-03f,  1.2954119e+00f,  9.0436506e-01f,  3.3232520e+00f,
+    4.4694731e-01f,  3.4010820e-03f,  -1.4319934e-01f, 4.9161855e-03f,
+    1.2168367e-01f,  -6.4623189e+00f, 4.1875038e+00f,  3.4066197e-01f,
+    -1.3179915e-01f, 1.1279566e-01f,  4.9161855e-03f,  8.2923877e-01f,
+    3.3003147e+00f,  -1.1322347e-01f, 6.8241709e-01f,  3.9553082e-01f,
+    -6.2505466e-01f, 4.9161855e-03f,  -2.8459623e-02f, -8.9666122e-01f,
+    1.4573698e+00f,  9.5023394e-02f,  -7.6894805e-02f, -2.1677141e-01f,
+    4.9161855e-03f,  -9.6267796e-01f, 1.7573184e-01f,  2.5900939e-01f,
+    -2.6439837e-01f, 9.0278494e-01f,  8.8790357e-01f,  4.9161855e-03f,
+    2.4336672e+00f,  -7.1640553e+00f, 3.6254086e+00f,  6.4685160e-01f,
+    -3.2698211e-01f, 7.0840068e-02f,  4.9161855e-03f,  -5.9096532e+00f,
+    -1.9160348e+00f, 3.9193995e+00f,  -6.7071283e-01f, -1.9056444e-01f,
+    -4.5317072e-01f, 4.9161855e-03f,  -1.4707901e+00f, 1.1910865e-01f,
+    1.1022505e+00f,  2.6277620e-02f,  -3.8275990e-01f, 6.2770671e-01f,
+    4.9161855e-03f,  -7.3789585e-01f, -1.2953321e+00f, -5.2267389e+00f,
+    3.4158260e-02f,  1.5098372e-01f,  1.3004602e-01f,  4.9161855e-03f,
+    3.3035767e+00f,  4.6425954e-01f,  -8.1617832e-01f, 2.1944559e-01f,
+    3.3776700e-01f,  9.5569676e-01f,  4.9161855e-03f,  6.0753441e+00f,
+    -9.4240761e-01f, 4.0869508e+00f,  -7.9642147e-02f, 2.1676794e-02f,
+    3.5323358e-01f,  4.9161855e-03f,  -1.0766250e+01f, 9.0645037e+00f,
+    -4.8881302e+00f, -1.4934587e-01f, 2.2883666e-01f,  -1.6644326e-01f,
+    4.9161855e-03f,  -1.2535204e+00f, 8.5706103e-01f,  1.5652949e-01f,
+    1.1726750e+00f,  2.6057336e-01f,  4.0940413e-01f,  4.9161855e-03f,
+    -1.0702034e+01f, 1.2516937e+00f,  -1.3382761e+00f, -1.4350083e-01f,
+    2.5710282e-01f,  -1.4253895e-01f, 4.9161855e-03f,  6.2700930e+00f,
+    -1.5379217e+00f, -7.3641987e+00f, -3.9090697e-02f, -3.3347785e-01f,
+    3.5581671e-02f,  4.9161855e-03f,  2.9623554e+00f,  -8.8794357e-01f,
+    1.4922516e+00f,  9.2039919e-01f,  7.3257349e-03f,  -9.8296821e-02f,
+    4.9161855e-03f,  8.8694298e-01f,  6.9717664e-01f,  -4.4938159e+00f,
+    -6.6308784e-01f, -2.9959220e-02f, 5.9899336e-01f,  4.9161855e-03f,
+    2.7530522e+00f,  8.1737165e+00f,  -1.4010216e+00f, 1.1748995e-01f,
+    -1.3952407e-01f, 2.1300323e-01f,  4.9161855e-03f,  -8.3862219e+00f,
+    6.6970325e+00f,  8.5669098e+00f,  1.9593265e-02f,  -1.8054524e-01f,
+    8.2735501e-02f,  4.9161855e-03f,  -1.7339755e+00f, 1.7938353e+00f,
+    8.2033026e-01f,  -5.4445755e-01f, -6.2285561e-02f, 2.5855592e-01f,
+    4.9161855e-03f,  -5.2762489e+00f, -4.2943602e+00f, -4.0066252e+00f,
+    -4.3525260e-02f, -2.1258898e-02f, 4.7848368e-01f,  4.9161855e-03f,
+    7.6586235e-01f,  -2.4081889e-01f, -1.6427093e+00f, -2.0026308e-02f,
+    1.2395242e-01f,  6.1082700e-04f,  4.9161855e-03f,  3.3507187e+00f,
+    -1.0240507e+01f, -5.1297288e+00f, 4.3201432e-01f,  4.4983926e-01f,
+    -2.7774861e-01f, 4.9161855e-03f,  -2.8253822e+00f, -7.5929403e-01f,
+    -2.9382997e+00f, 4.7752061e-01f,  4.0330526e-01f,  3.0657032e-01f,
+    4.9161855e-03f,  2.0044863e-01f,  -2.9507504e+00f, -3.2443504e+00f,
+    2.5046369e-01f,  3.0626279e-01f,  -8.9583957e-01f, 4.9161855e-03f,
+    -2.0919750e+00f, 4.3667765e+00f,  -3.0602129e+00f, -3.8770989e-01f,
+    2.8424934e-01f,  -5.2657247e-01f, 4.9161855e-03f,  -3.3979905e+00f,
+    1.4949689e+00f,  -5.1806617e+00f, -1.5795708e-01f, -3.5939518e-02f,
+    5.1160586e-01f,  4.9161855e-03f,  -1.7886322e+00f, 8.9676952e-01f,
+    -8.6497908e+00f, 1.8233211e-01f,  -4.0997352e-02f, 6.4814395e-01f,
+    4.9161855e-03f,  -1.5730165e+00f, 1.7184561e+00f,  -5.0965128e+00f,
+    2.9170886e-01f,  -2.5669548e-01f, -1.8910386e-01f, 4.9161855e-03f,
+    9.1550064e+00f,  -5.8923647e-02f, 5.9311843e+00f,  -1.3799039e-01f,
+    5.6774336e-01f,  -7.2126962e-02f, 4.9161855e-03f,  3.4160118e+00f,
+    4.8486991e+00f,  -4.6832914e+00f, 6.8488821e-02f,  -3.0767199e-01f,
+    2.2700641e-01f,  4.9161855e-03f,  -1.5771277e+00f, 4.7655615e-01f,
+    1.7979294e+00f,  1.0064609e+00f,  -2.2796272e-01f, -8.4801579e-01f,
+    4.9161855e-03f,  5.3412542e+00f,  1.4290444e+00f,  -2.4337921e+00f,
+    1.8301491e-01f,  -7.2091872e-01f, 3.1204930e-01f,  4.9161855e-03f,
+    3.2980211e+00f,  7.2834247e-01f,  -5.7064676e-01f, -3.5967571e-01f,
+    -1.0186039e-01f, -8.8198590e-01f, 4.9161855e-03f,  -3.6528933e+00f,
+    -1.9906701e+00f, -1.5311290e+00f, -1.3554078e-01f, -7.3127121e-01f,
+    -3.3883739e-01f, 4.9161855e-03f,  5.6776178e-01f,  2.5676557e-01f,
+    -1.7308378e+00f, 4.5613620e-01f,  -3.0034539e-01f, -5.2824324e-01f,
+    4.9161855e-03f,  -1.2763550e+00f, 1.8992659e-01f,  1.3920313e+00f,
+    3.3915433e-01f,  -2.5801826e-01f, 3.7367827e-01f,  4.9161855e-03f,
+    2.9597163e+00f,  1.4648328e+00f,  6.6470485e+00f,  4.6583173e-01f,
+    2.9541162e-01f,  1.4314331e-01f,  4.9161855e-03f,  -1.2253593e-01f,
+    3.6476731e-01f,  -2.3429374e-01f, -8.5051000e-01f, -1.5754678e+00f,
+    -1.0546576e+00f, 4.9161855e-03f,  2.7294402e+00f,  3.8883293e+00f,
+    3.0172112e+00f,  4.1178986e-01f,  -7.2390623e-03f, 4.4097424e-01f,
+    4.9161855e-03f,  -4.3637651e-01f, -2.1402721e+00f, 2.6629260e+00f,
+    -8.0778193e-01f, 4.7216830e-01f,  -9.7485429e-01f, 4.9161855e-03f,
+    -3.9435267e+00f, -2.3975267e+00f, 1.4559281e+01f,  2.7717435e-01f,
+    9.1627508e-02f,  -1.8850714e-01f, 4.9161855e-03f,  5.9964097e-01f,
+    -7.2503984e-01f, -4.2790172e-01f, 1.5436234e+00f,  4.5493039e-01f,
+    5.8981228e-01f,  4.9161855e-03f,  -9.6339476e-01f, -8.9544678e-01f,
+    3.3564791e-01f,  -1.0856894e+00f, -7.9496235e-01f, 1.2212116e+00f,
+    4.9161855e-03f,  6.1837864e+00f,  -2.1298322e-01f, -4.8063025e+00f,
+    2.1292269e-01f,  1.1314870e-01f,  3.5606495e-01f,  4.9161855e-03f,
+    -4.7102060e+00f, -3.3512626e+00f, 7.8332210e+00f,  3.7699956e-01f,
+    3.9530000e-01f,  -2.6920196e-01f, 4.9161855e-03f,  -2.9211233e+00f,
+    -1.0305672e+00f, 2.4663877e+00f,  -1.7833069e-01f, 3.3804491e-01f,
+    7.5344557e-01f,  4.9161855e-03f,  6.8797150e+00f,  -6.6251493e+00f,
+    1.8645595e+00f,  -9.5544621e-02f, -4.5911532e-02f, -6.3025075e-01f,
+    4.9161855e-03f,  4.4177470e+00f,  6.7363849e+00f,  -1.1086810e+00f,
+    -9.4687149e-02f, -2.6860729e-01f, 7.5354621e-02f,  4.9161855e-03f,
+    6.6460018e+00f,  3.3235323e+00f,  4.0945444e+00f,  6.9182122e-01f,
+    3.5717290e-02f,  5.2928823e-01f,  4.9161855e-03f,  6.9093585e-01f,
+    5.3657085e-01f,  -2.7217064e+00f, 7.8025711e-01f,  1.0647196e+00f,
+    9.1549769e-02f,  4.9161855e-03f,  5.1078949e+00f,  -4.6708674e+00f,
+    -9.2208271e+00f, -1.5181795e-01f, -8.6041331e-02f, 1.2009077e-02f,
+    4.9161855e-03f,  -9.2331278e-01f, -1.5245067e+01f, -1.8430016e+00f,
+    1.6230610e-01f,  7.5651765e-02f,  -2.0839202e-01f, 4.9161855e-03f,
+    -2.4895720e+00f, -1.3060440e+00f, 8.2995977e+00f,  -3.9603344e-01f,
+    -1.4644308e-01f, -5.3232598e-01f, 4.9161855e-03f,  -5.0348949e-01f,
+    -9.4410628e-01f, 1.0830581e+00f,  -8.0133498e-01f, 8.0811757e-01f,
+    5.9235162e-01f,  4.9161855e-03f,  -3.3763075e+00f, 3.0640872e+00f,
+    4.0426502e+00f,  -5.3082889e-01f, 7.3710519e-01f,  -2.8753296e-01f,
+    4.9161855e-03f,  1.4202030e+00f,  -1.5501769e+00f, -1.2415150e+00f,
+    -6.6869056e-01f, 2.7094612e-01f,  -4.0606999e-01f, 4.9161855e-03f,
+    -7.7039480e-01f, -4.0073175e+00f, 3.0493884e+00f,  -2.6583874e-01f,
+    3.3602440e-01f,  -1.5869410e-01f, 4.9161855e-03f,  1.0002196e+00f,
+    -4.0281076e+00f, -4.3797832e+00f, -2.0664814e-01f, -5.3153837e-01f,
+    -1.8399048e-01f, 4.9161855e-03f,  2.6349607e-01f,  -7.4451178e-01f,
+    -6.0106546e-01f, -7.5970972e-01f, 2.8142974e-01f,  -1.3207905e+00f,
+    4.9161855e-03f,  3.8722780e+00f,  -4.5574789e+00f, 4.0573292e+00f,
+    -6.9357514e-02f, -1.6351803e-01f, -5.8050317e-01f, 4.9161855e-03f,
+    2.1514051e+00f,  -3.1127915e+00f, -2.7818331e-01f, -2.6966959e-01f,
+    -3.0738050e-01f, -2.6039067e-01f, 4.9161855e-03f,  3.1542454e+00f,
+    1.6528401e+00f,  1.5305791e+00f,  -1.1632952e-01f, 3.7422487e-01f,
+    2.7905959e-01f,  4.9161855e-03f,  -4.7130257e-01f, -1.8884267e+00f,
+    5.3116055e+00f,  -1.2791082e-01f, -3.0701835e-02f, 3.7195235e-01f,
+    4.9161855e-03f,  -2.3392570e+00f, 8.2322540e+00f,  8.3583860e+00f,
+    -4.4111077e-02f, 7.8319967e-02f,  -9.6207060e-02f, 4.9161855e-03f,
+    -2.1963356e+00f, -2.9490449e+00f, -5.8961862e-01f, -1.0104504e-01f,
+    9.4426346e-01f,  -5.8387357e-01f, 4.9161855e-03f,  -4.0715724e-01f,
+    -2.7898128e+00f, -4.7324011e-01f, 2.0851484e-01f,  3.9485529e-01f,
+    -3.8530013e-01f, 4.9161855e-03f,  -4.3974891e+00f, -8.4682912e-01f,
+    -3.2423160e+00f, -4.6953207e-01f, -2.3714904e-01f, -2.6994130e-02f,
+    4.9161855e-03f,  -1.0799764e+01f, 4.4622698e+00f,  6.1397690e-01f,
+    3.0125976e-03f,  1.8344313e-01f,  9.8420180e-02f,  4.9161855e-03f,
+    4.5963225e-01f,  5.7316095e-01f,  1.3716172e-01f,  -4.5887467e-01f,
+    -7.0215470e-01f, -8.5560244e-01f, 4.9161855e-03f,  -3.7018690e+00f,
+    4.5754645e-02f,  7.3413754e-01f,  2.8994748e-01f,  -1.2318026e+00f,
+    4.0843673e-02f,  4.9161855e-03f,  -3.8644615e-01f, 4.2327684e-01f,
+    -9.1640666e-02f, 4.8928967e-01f,  -1.3959870e+00f, 1.2630954e+00f,
+    4.9161855e-03f,  1.8139942e+00f,  3.8542380e+00f,  -6.5168285e+00f,
+    1.6067383e-01f,  -5.9492588e-01f, 5.3673685e-02f,  4.9161855e-03f,
+    1.3779532e+00f,  -1.1781169e+01f, 4.7154002e+00f,  1.5091422e-01f,
+    -8.9451134e-02f, 1.2947474e-01f,  4.9161855e-03f,  -1.3260136e+00f,
+    -7.6551027e+00f, -2.2713916e+00f, 4.8155704e-01f,  -3.0485472e-01f,
+    -1.0067774e-01f, 4.9161855e-03f,  -2.8808248e+00f, -1.0482716e+01f,
+    -4.4154463e+00f, 6.7491457e-02f,  -3.6273432e-01f, 2.0917881e-01f,
+    4.9161855e-03f,  6.3390737e+00f,  6.9130831e+00f,  -4.7350311e+00f,
+    8.7844469e-03f,  3.9109352e-01f,  3.5500124e-01f,  4.9161855e-03f,
+    -3.9952296e-01f, -1.1013354e-01f, -2.2021386e-01f, -5.4285401e-01f,
+    -2.3495735e-01f, 1.9557957e-01f,  4.9161855e-03f,  -4.3585640e-01f,
+    -3.7436824e+00f, 1.2239318e+00f,  4.1005331e-01f,  -9.1933674e-01f,
+    5.1098686e-01f,  4.9161855e-03f,  -1.6157585e+00f, -4.8224859e+00f,
+    -5.8910532e+00f, -4.5340981e-02f, -3.8654584e-01f, 1.2313969e-01f,
+    4.9161855e-03f,  1.4624373e+00f,  3.5870013e+00f,  -3.6420727e+00f,
+    1.1446878e-01f,  -1.5249999e-01f, -1.3377556e-01f, 4.9161855e-03f,
+    1.6492217e+00f,  -1.1625522e+00f, 6.4684806e+00f,  -5.5535161e-01f,
+    -6.1164206e-01f, 3.4487322e-01f,  4.9161855e-03f,  -4.1177252e-01f,
+    -1.3457669e-01f, 1.0822372e+00f,  6.0612595e-01f,  5.1498848e-01f,
+    -3.1651068e-01f, 4.9161855e-03f,  1.4677581e-01f,  -2.2483449e+00f,
+    8.4818816e-01f,  7.5509012e-02f,  3.9663109e-01f,  -6.3402826e-01f,
+    4.9161855e-03f,  6.1324382e+00f,  -2.0449994e+00f, 5.8202696e-01f,
+    6.1292440e-01f,  3.5556069e-01f,  2.2752848e-01f,  4.9161855e-03f,
+    -3.0714469e+00f, 1.0777712e+01f,  -1.1295730e+00f, -3.1449816e-01f,
+    3.5032073e-01f,  -3.0413285e-01f, 4.9161855e-03f,  5.2378380e-01f,
+    5.3693795e-01f,  7.1774465e-01f,  7.2248662e-01f,  3.4031644e-01f,
+    6.7593110e-01f,  4.9161855e-03f,  2.4295657e+00f,  -7.7421494e+00f,
+    -5.0242991e+00f, 3.2821459e-01f,  -1.2377231e-01f, 4.4129044e-02f,
+    4.9161855e-03f,  1.3932830e+01f,  -1.8785001e-01f, -2.5588515e+00f,
+    3.1930944e-01f,  -3.5054013e-01f, -4.5028195e-02f, 4.9161855e-03f,
+    -5.8196408e-01f, 6.6886023e-03f,  2.6216498e-01f,  6.4578718e-01f,
+    -5.2356768e-01f, 4.7566593e-01f,  4.9161855e-03f,  4.7260118e+00f,
+    1.2474382e+00f,  5.1553049e+00f,  1.5961643e-01f,  -3.1193703e-01f,
+    -2.3862544e-01f, 4.9161855e-03f,  3.4913974e+00f,  -1.6139863e+00f,
+    2.2464933e+00f,  -5.9063923e-01f, 4.8114887e-01f,  -3.3533069e-01f,
+    4.9161855e-03f,  8.9673018e-01f,  -1.4629961e+00f, -2.1733539e+00f,
+    6.3455045e-01f,  5.7413024e-01f,  5.9105396e-02f,  4.9161855e-03f,
+    3.3593988e+00f,  6.4571220e-01f,  -8.2219487e-01f, -2.8119728e-01f,
+    7.1795964e-01f,  -1.9348176e-01f, 4.9161855e-03f,  -1.6793771e+00f,
+    -9.3323147e-01f, -1.0284096e+00f, 1.7996219e-01f,  -5.4395292e-02f,
+    -5.3295928e-01f, 4.9161855e-03f,  3.6469729e+00f,  2.9210367e+00f,
+    3.3143349e+00f,  2.1656457e-01f,  5.0930542e-01f,  3.2544386e-01f,
+    4.9161855e-03f,  1.0256160e+01f,  5.1387095e+00f,  -2.3690042e-01f,
+    1.2514941e-01f,  4.5106778e-01f,  -4.2391279e-01f, 4.9161855e-03f,
+    2.2757618e+00f,  1.2305504e+00f,  3.8755146e-01f,  -2.1070603e-01f,
+    -7.8005248e-01f, -4.4709837e-01f, 4.9161855e-03f,  -5.1670942e+00f,
+    1.5598483e+00f,  -3.5291243e+00f, 1.6316184e-01f,  -2.0411415e-01f,
+    -5.9437793e-01f, 4.9161855e-03f,  -1.5594204e+01f, -3.7022252e+00f,
+    -3.7550454e+00f, 1.8492374e-01f,  -4.7934514e-02f, -7.7964649e-02f,
+    4.9161855e-03f,  3.1953554e+00f,  2.0546597e-01f,  -3.7095559e-01f,
+    1.9130148e-01f,  -7.1165860e-01f, -1.0573120e+00f, 4.9161855e-03f,
+    -2.7792058e+00f, 9.8535782e-01f,  2.5838134e-01f,  6.6172677e-01f,
+    8.8137114e-01f,  -1.0916281e-02f, 4.9161855e-03f,  -5.0778711e-01f,
+    -3.3756995e-01f, -8.2829469e-01f, -9.9659681e-01f, 1.0217003e+00f,
+    9.3604630e-01f,  4.9161855e-03f,  1.5158432e+00f,  -3.2348025e+00f,
+    1.4036649e+00f,  -1.9708058e-01f, -8.0950028e-01f, 2.9766664e-01f,
+    4.9161855e-03f,  9.8305964e-01f,  -3.4999862e-01f, -1.0570002e+00f,
+    -1.7369969e-01f, 6.2416160e-01f,  3.6124137e-01f,  4.9161855e-03f,
+    -3.3896977e-01f, -2.6897258e-01f, 4.5453751e-01f,  -3.4363815e-01f,
+    1.0429972e+00f,  -1.2775995e-01f, 4.9161855e-03f,  -1.0826423e+00f,
+    -3.3066554e+00f, 1.0597175e-01f,  -2.4241740e-01f, 9.1466504e-01f,
+    4.6157035e-01f,  4.9161855e-03f,  1.1641353e+00f,  -1.1828867e+00f,
+    8.3474927e-02f,  9.2612118e-02f,  -1.0640503e+00f, 6.1718243e-01f,
+    4.9161855e-03f,  -1.5752809e+00f, 3.1991715e+00f,  -9.9801407e+00f,
+    -3.5100287e-01f, -5.0016546e-01f, 1.6660391e-01f,  4.9161855e-03f,
+    -4.2045827e+00f, -3.2866499e+00f, -1.1206657e+00f, -4.5332417e-01f,
+    3.2170776e-01f,  1.7660064e-01f,  4.9161855e-03f,  -1.3083904e+00f,
+    -2.6270282e+00f, 1.9103733e+00f,  -3.7962582e-02f, 5.4677010e-01f,
+    -2.7110046e-01f, 4.9161855e-03f,  1.9824886e-01f,  3.3845697e-02f,
+    -1.3422199e-01f, -1.3416489e+00f, 1.3885272e+00f,  2.8959107e-01f,
+    4.9161855e-03f,  3.7783051e+00f,  -3.0795629e+00f, -5.9362769e-01f,
+    1.0876846e-01f,  4.5782991e-02f,  9.0166003e-01f,  4.9161855e-03f,
+    -3.3900323e+00f, -1.2412339e+00f, -4.0827131e-01f, 1.1136277e-01f,
+    -6.5951711e-01f, -7.5657803e-01f, 4.9161855e-03f,  -8.0518305e-02f,
+    3.6436194e-01f,  -2.6549952e+00f, -3.5231838e-01f, 1.0433834e+00f,
+    -3.7238491e-01f, 4.9161855e-03f,  3.3414989e+00f,  -2.7282398e+00f,
+    -1.0403559e+01f, -1.3802331e-02f, 4.6939823e-01f,  9.7290888e-02f,
+    4.9161855e-03f,  -7.1867938e+00f, 1.0925708e+00f,  8.2917814e+00f,
+    1.7192370e-01f,  4.5020524e-01f,  3.7679866e-01f,  4.9161855e-03f,
+    9.6701646e-01f,  -7.5983357e-01f, 1.1458014e+00f,  3.4344528e-02f,
+    5.6285536e-01f,  -6.2582952e-01f, 4.9161855e-03f,  -2.2120414e+00f,
+    -2.5760954e-02f, -5.7933021e-01f, 1.2068044e-01f,  -7.6880723e-01f,
+    5.1227695e-01f,  4.9161855e-03f,  3.2392139e+00f,  1.4307367e+00f,
+    9.5674601e+00f,  2.5352058e-01f,  -2.3321305e-01f, 1.2310863e-01f,
+    4.9161855e-03f,  -1.2752718e+00f, 4.5532646e+00f,  -1.2888458e+00f,
+    1.9152538e-01f,  -6.2447852e-01f, 1.2212185e-01f,  4.9161855e-03f,
+    -1.2589412e+00f, 5.5781960e-01f,  -6.3506114e-01f, 9.3907797e-01f,
+    1.9405334e-01f,  -3.4146562e-01f, 4.9161855e-03f,  1.9039134e+00f,
+    -6.8664914e-01f, 3.5822120e+00f,  -5.3415704e-01f, -2.7978751e-01f,
+    4.3960336e-01f,  4.9161855e-03f,  -6.4647198e+00f, -4.1601009e+00f,
+    3.7336736e+00f,  -6.3057430e-03f, -5.2555997e-02f, -5.6261116e-01f,
+    4.9161855e-03f,  4.3844986e+00f,  3.1030044e-01f,  -4.4900626e-01f,
+    -6.2084440e-02f, 1.1084561e-01f,  6.9612509e-01f,  4.9161855e-03f,
+    3.6297846e+00f,  7.4393764e+00f,  4.1029959e+00f,  8.4158558e-01f,
+    1.7579438e-01f,  1.7431067e-01f,  4.9161855e-03f,  1.5189036e+00f,
+    1.2657379e+00f,  -8.1859761e-01f, -3.1755473e-02f, -8.2581156e-01f,
+    -4.7878733e-01f, 4.9161855e-03f,  3.5807536e+00f,  2.8411615e+00f,
+    7.1922555e+00f,  2.9297936e-01f,  2.7300882e-01f,  -3.0718929e-01f,
+    4.9161855e-03f,  1.8796552e+00f,  4.8671743e-01f,  1.5402852e+00f,
+    -1.3353029e+00f, 2.7250770e-01f,  -2.5658351e-01f, 4.9161855e-03f,
+    1.1553524e+00f,  -2.7610519e+00f, -5.3075476e+00f, -5.2538043e-01f,
+    -2.1537741e-01f, 6.8323410e-01f,  4.9161855e-03f,  3.0374799e+00f,
+    1.7371255e+00f,  3.3680525e+00f,  3.2494023e-01f,  3.6663204e-01f,
+    -3.6701422e-02f, 4.9161855e-03f,  7.4782655e-02f,  9.2720592e-01f,
+    -4.8526448e-01f, 1.4851030e-02f,  3.2096094e-01f,  -5.2963793e-01f,
+    4.9161855e-03f,  -6.2992406e-01f, -3.6588037e-01f, 2.3253849e+00f,
+    -5.8190042e-01f, -4.1033864e-01f, 8.8333249e-01f,  4.9161855e-03f,
+    1.4884578e+00f,  -1.0439763e+00f, 5.9878411e+00f,  -3.7201801e-01f,
+    2.4588369e-03f,  4.5768097e-01f,  4.9161855e-03f,  3.1809483e+00f,
+    2.5962567e-01f,  -8.4237391e-01f, -1.3639174e-01f, -5.9878516e-01f,
+    -4.1162002e-01f, 4.9161855e-03f,  1.0680166e-01f,  1.0052605e+01f,
+    -6.3342768e-01f, 2.9385975e-01f,  8.4131043e-03f,  -1.8112695e-01f,
+    4.9161855e-03f,  -1.4464878e+00f, 2.6160688e+00f,  -2.5026495e+00f,
+    1.1747682e-01f,  1.0280722e+00f,  -4.8386863e-01f, 4.9161855e-03f,
+    9.4073653e-01f,  -1.4247403e+00f, -1.0551541e+00f, 1.2492497e-01f,
+    -7.0053712e-03f, 1.3082508e+00f,  4.9161855e-03f,  2.2290568e+00f,
+    -6.5506225e+00f, -2.4433014e+00f, 1.2130931e-01f,  -1.1610405e-01f,
+    -4.5584488e-01f, 4.9161855e-03f,  -1.9498895e+00f, 4.6767030e+00f,
+    -3.4168692e+00f, 1.1597754e-01f,  -8.7749928e-01f, -3.8664725e-01f,
+    4.9161855e-03f,  4.6785226e+00f,  2.6460407e+00f,  6.4718187e-01f,
+    -1.6712719e-01f, 5.7993102e-01f,  -4.9562579e-01f, 4.9161855e-03f,
+    2.1456182e+00f,  1.9635123e+00f,  -3.8655360e+00f, -2.7077436e-01f,
+    -1.8299668e-01f, -4.3573025e-01f, 4.9161855e-03f,  -1.9993131e+00f,
+    2.9507306e-01f,  -4.4145888e-01f, -1.6663829e+00f, 1.0946865e-01f,
+    3.7640512e-01f,  4.9161855e-03f,  1.4831481e+00f,  4.8473382e+00f,
+    2.7406850e+00f,  -5.7960081e-01f, 3.3503184e-01f,  4.2113072e-01f,
+    4.9161855e-03f,  1.1654446e+01f,  -3.2936807e+00f, 8.0157871e+00f,
+    -8.8741958e-02f, 1.3227934e-01f,  -2.1814951e-01f, 4.9161855e-03f,
+    -3.4944072e-01f, 7.0909047e-01f,  -1.2318096e+00f, 6.4097571e-01f,
+    -1.4119187e-01f, -7.6075204e-02f, 4.9161855e-03f,  -7.1035066e+00f,
+    1.9865555e+00f,  4.9796591e+00f,  1.8174887e-01f,  -3.2036242e-01f,
+    -7.0522577e-02f, 4.9161855e-03f,  8.1799567e-01f,  6.6474547e+00f,
+    -2.3917232e+00f, -3.0054757e-01f, -4.3092096e-01f, 7.3004472e-03f,
+    4.9161855e-03f,  -1.9377208e+00f, -2.6893675e+00f, 1.4853388e+00f,
+    -3.0860919e-01f, 3.1042361e-01f,  -3.0216944e-01f, 4.9161855e-03f,
+    4.0350935e-01f,  -1.2919564e+00f, -2.7707601e+00f, -1.4096673e-01f,
+    4.8063359e-01f,  1.2655888e-01f,  4.9161855e-03f,  -2.1167871e-01f,
+    1.0147147e+00f,  3.1870842e-01f,  -1.0515012e+00f, 7.5543255e-01f,
+    8.6726433e-01f,  4.9161855e-03f,  -4.6613235e+00f, -3.2844503e+00f,
+    1.5193036e+00f,  -7.0714578e-02f, 1.3104446e-01f,  3.8191986e-01f,
+    4.9161855e-03f,  5.7801533e-01f,  1.2869422e+01f,  -1.0647977e+01f,
+    3.0585650e-01f,  5.4061092e-02f,  -1.0565475e-01f, 4.9161855e-03f,
+    -3.5002222e+00f, -7.0146608e-01f, -6.2259334e-01f, 1.0736943e+00f,
+    -3.9632544e-01f, -2.6976940e-01f, 4.9161855e-03f,  -4.5761476e+00f,
+    4.6518782e-01f,  -8.3545198e+00f, 4.5499223e-01f,  -2.9078165e-01f,
+    4.0210626e-01f,  4.9161855e-03f,  -3.2152455e+00f, -4.4984317e+00f,
+    4.0649209e+00f,  1.3535073e-01f,  -4.9793366e-02f, 6.3251072e-01f,
+    4.9161855e-03f,  -2.2758319e+00f, 2.1843377e-01f,  1.8218734e+00f,
+    4.5802888e-01f,  4.3781579e-01f,  3.6604026e-01f,  4.9161855e-03f,
+    5.2763236e-01f,  -3.6522732e+00f, -4.1599369e+00f, -1.1727697e-01f,
+    -4.1723618e-01f, 5.8072770e-01f,  4.9161855e-03f,  8.4461415e-01f,
+    9.8445374e-01f,  3.5183206e+00f,  5.2661824e-01f,  3.9396206e-01f,
+    4.3828052e-01f,  4.9161855e-03f,  9.4771171e-01f,  -1.1062837e+01f,
+    1.8483003e+00f,  -3.5702106e-01f, 3.6815599e-01f,  -1.9429210e-01f,
+    4.9161855e-03f,  -5.0235379e-01f, -3.3477690e+00f, 1.8850605e+00f,
+    7.7522898e-01f,  8.8844210e-02f,  1.9595140e-01f,  4.9161855e-03f,
+    -9.4192564e-01f, 3.9732727e-01f,  5.7283994e-02f,  -1.3026857e+00f,
+    -6.6133314e-01f, 2.9416299e-01f,  4.9161855e-03f,  -5.0071373e+00f,
+    4.9481745e+00f,  -4.5885653e+00f, -7.2974527e-01f, -2.2810711e-01f,
+    -1.2024256e-01f, 4.9161855e-03f,  7.1727300e-01f,  3.8456815e-01f,
+    1.6282324e+00f,  -5.8138424e-01f, 4.9471337e-01f,  -3.9108536e-01f,
+    4.9161855e-03f,  8.2024693e-01f,  -6.8197541e+00f, -2.0822369e-01f,
+    -3.2457495e-01f, 9.2890322e-02f,  -3.1603387e-01f, 4.9161855e-03f,
+    2.6186655e+00f,  8.4280217e-01f,  1.4586608e+00f,  2.1663409e-01f,
+    1.3719971e-01f,  4.5461830e-01f,  4.9161855e-03f,  2.0187883e+00f,
+    -2.6526947e+00f, -7.1162456e-01f, 6.2822074e-02f,  7.1879733e-01f,
+    -4.9643615e-01f, 4.9161855e-03f,  6.7031212e+00f,  9.5287399e+00f,
+    5.1319051e+00f,  -4.5553867e-02f, 2.4826910e-01f,  -1.7123973e-01f,
+    4.9161855e-03f,  6.6973624e+00f,  -4.0875664e+00f, -3.0615408e+00f,
+    3.8208425e-01f,  -1.1532618e-01f, 2.9913893e-01f,  4.9161855e-03f,
+    2.0527894e+00f,  -8.4256897e+00f, 5.1228266e+00f,  -2.8846246e-01f,
+    -2.7936585e-03f, 4.5650041e-01f,  4.9161855e-03f,  -2.7092569e+00f,
+    -9.3979639e-01f, 3.3981374e-01f,  -1.4305636e-01f, 2.6583475e-01f,
+    1.2018280e-01f,  4.9161855e-03f,  -2.8628296e-01f, -4.5522223e+00f,
+    -1.8526778e+00f, 5.9731436e-01f,  3.5802311e-01f,  -2.2250395e-01f,
+    4.9161855e-03f,  -2.9563310e+00f, 5.0667650e-01f,  1.4143577e+00f,
+    6.1369061e-01f,  3.2685769e-01f,  -4.7347897e-01f, 4.9161855e-03f,
+    5.6968536e+00f,  -2.7288382e+00f, 2.8761234e+00f,  3.4138760e-01f,
+    1.4801402e-01f,  -2.8645852e-01f, 4.9161855e-03f,  -1.9916102e+00f,
+    5.4126325e+00f,  -4.8872595e+00f, 7.6246566e-01f,  2.3227106e-01f,
+    4.7669503e-01f,  4.9161855e-03f,  -2.1705077e+00f, 4.0323458e+00f,
+    4.9479923e+00f,  1.0430798e-01f,  2.3089279e-01f,  -5.2287728e-01f,
+    4.9161855e-03f,  -2.2662840e+00f, 8.9089022e+00f,  -7.7135497e-01f,
+    1.8162894e-01f,  4.0866244e-01f,  5.3680921e-01f,  4.9161855e-03f,
+    -1.0269644e+00f, -1.4122422e-01f, -1.9169942e-01f, -8.8593525e-01f,
+    1.6215587e+00f,  8.8405871e-01f,  4.9161855e-03f,  4.6594944e+00f,
+    -1.6808683e+00f, -6.3804030e+00f, 4.0089998e-01f,  3.2192758e-01f,
+    -6.9397962e-01f, 4.9161855e-03f,  4.1549420e+00f,  8.3110952e+00f,
+    5.8868928e+00f,  2.2127461e-01f,  -7.9492927e-02f, 3.2893412e-02f,
+    4.9161855e-03f,  1.4486778e+00f,  2.2841322e+00f,  -2.5452878e+00f,
+    7.0072806e-01f,  -1.4649132e-01f, 1.0610219e+00f,  4.9161855e-03f,
+    -2.7136266e-01f, 3.3732128e+00f,  -2.0099690e+00f, 3.3958232e-01f,
+    -4.6169385e-01f, -3.6463809e-01f, 4.9161855e-03f,  9.9050653e-01f,
+    1.2195800e+01f,  8.3389235e-01f,  1.0109326e-01f,  6.7902014e-02f,
+    3.6639729e-01f,  4.9161855e-03f,  2.1708052e+00f,  3.2507515e+00f,
+    -1.4772257e+00f, 1.7801300e-01f,  4.4694450e-01f,  3.6328074e-01f,
+    4.9161855e-03f,  -1.0298166e+00f, 3.7731926e+00f,  4.5335650e-01f,
+    1.8615964e-01f,  -1.3147214e-01f, -1.8023507e-01f, 4.9161855e-03f,
+    -6.8271005e-01f, 1.7772504e+00f,  4.4558904e-01f,  -2.9828987e-01f,
+    3.7757024e-01f,  1.2474483e+00f,  4.9161855e-03f,  2.2250241e-01f,
+    -1.6831324e-01f, -2.4957304e+00f, -2.1897994e-01f, -7.1676075e-01f,
+    -6.4455205e-01f, 4.9161855e-03f,  3.8112044e-01f,  -7.1052194e-02f,
+    -2.8060465e+00f, 4.4627541e-01f,  -1.5042870e-01f, -8.0832672e-01f,
+    4.9161855e-03f,  -1.0434804e+01f, -7.9979901e+00f, 5.2915440e+00f,
+    1.8933946e-01f,  -3.7415317e-01f, -3.9454479e-02f, 4.9161855e-03f,
+    -5.5525690e-01f, 2.9763732e+00f,  1.3161091e+00f,  -2.9539576e-01f,
+    1.2798968e-01f,  -1.0036783e+00f, 4.9161855e-03f,  -7.1574326e+00f,
+    6.7528421e-01f,  -6.8135509e+00f, -4.9650958e-01f, -2.6634148e-01f,
+    8.0632843e-02f,  4.9161855e-03f,  -1.9677415e-01f, -3.1772666e-02f,
+    -3.1380123e-01f, 5.2750385e-01f,  -1.2655318e-01f, -5.0206524e-01f,
+    4.9161855e-03f,  -3.7813017e+00f, 3.1822944e+00f,  3.9493024e+00f,
+    2.2256976e-01f,  3.6762279e-01f,  -1.4561446e-01f, 4.9161855e-03f,
+    -2.4210865e+00f, -1.5335252e+00f, 1.2370416e+00f,  4.4264695e-01f,
+    -5.3884721e-01f, 7.0146704e-01f,  4.9161855e-03f,  2.5519440e-01f,
+    -3.1845915e+00f, -1.6156477e+00f, -4.8931929e-01f, -5.0698853e-01f,
+    -2.0260869e-01f, 4.9161855e-03f,  7.2150087e-01f,  -1.6385086e+00f,
+    -3.1234305e+00f, 6.8608865e-02f,  -2.3429663e-01f, -7.6298904e-01f,
+    4.9161855e-03f,  -2.9550021e+00f, 7.5033283e-01f,  5.6401677e+00f,
+    6.5824181e-02f,  -3.4010240e-01f, 3.2443497e-01f,  4.9161855e-03f,
+    -1.5270572e+00f, -3.5373411e+00f, 1.5693500e+00f,  3.7276837e-01f,
+    2.1695007e-01f,  3.8393747e-02f,  4.9161855e-03f,  -5.1589422e+00f,
+    -6.3681526e+00f, 1.0760841e+00f,  -2.5135091e-01f, 3.0708104e-01f,
+    -4.9483731e-01f, 4.9161855e-03f,  1.8361908e+00f,  -4.4602613e+00f,
+    -3.4919205e-01f, -7.2775108e-01f, -2.0868689e-01f, -3.1512517e-01f,
+    4.9161855e-03f,  -3.8785400e+00f, -7.6205726e+00f, -7.8829169e+00f,
+    8.1175379e-04f,  1.0576858e-01f,  1.8129656e-01f,  4.9161855e-03f,
+    7.1177387e-01f,  8.1885141e-01f,  -1.7217830e+00f, -1.9208851e-01f,
+    -1.3030907e+00f, 4.7598522e-02f,  4.9161855e-03f,  -3.6250098e+00f,
+    2.8762753e+00f,  2.9860623e+00f,  2.3144880e-01f,  2.8537375e-01f,
+    -1.1493211e-01f, 4.9161855e-03f,  7.3697476e+00f,  -3.4015975e+00f,
+    -1.8899328e+00f, -1.5028998e-01f, 8.1884658e-01f,  2.3511624e-01f,
+    4.9161855e-03f,  1.2574476e+00f,  -5.2913986e-02f, -5.0422925e-01f,
+    -5.7174575e-01f, 3.9997689e-02f,  -1.3258116e-01f, 4.9161855e-03f,
+    -1.0631522e+01f, 3.2686024e+00f,  4.3932638e+00f,  9.8838761e-02f,
+    -3.1671458e-01f, -9.2160270e-02f, 4.9161855e-03f,  2.5545301e+00f,
+    3.9265974e+00f,  -3.6398952e+00f, 3.6835317e-02f,  -2.1515481e-01f,
+    -4.5866296e-02f, 4.9161855e-03f,  1.0905961e+00f,  3.8440325e+00f,
+    -3.7192562e-01f, 9.2682108e-02f,  -3.4356901e-01f, -5.2209865e-02f,
+    4.9161855e-03f,  8.8744926e-01f,  2.2146291e-01f,  4.7353499e-02f,
+    4.0027612e-01f,  2.1718575e-01f,  1.1241162e+00f,  4.9161855e-03f,
+    7.4782684e-02f,  -5.8573022e+00f, 9.4727010e-01f,  -7.7142745e-02f,
+    -3.9442587e-01f, 3.3397615e-01f,  4.9161855e-03f,  2.5723341e+00f,
+    -1.2086291e+00f, 2.1621540e-01f,  2.0654669e-01f,  8.0818397e-01f,
+    3.2965580e-01f,  4.9161855e-03f,  -9.7928196e-04f, 1.0167804e+00f,
+    1.2956423e+00f,  -1.5153140e-03f, -5.2789587e-01f, -1.6390795e-01f,
+    4.9161855e-03f,  1.2305754e-01f,  -6.3046426e-01f, 9.8316491e-01f,
+    -7.8406316e-01f, 8.6710081e-02f,  8.5524148e-01f,  4.9161855e-03f,
+    -9.9739094e+00f, 5.3992839e+00f,  -6.8508654e+00f, -3.8141125e-01f,
+    4.1228893e-01f,  1.7802539e-01f,  4.9161855e-03f,  -4.6988902e+00f,
+    1.0152538e+00f,  -2.2309287e-01f, 8.4234136e-01f,  -4.0990266e-01f,
+    -2.6733798e-01f, 4.9161855e-03f,  -5.5058222e+00f, 5.7907748e+00f,
+    -2.7843678e+00f, 2.1375868e-01f,  3.8807499e-01f,  -7.7388234e-02f,
+    4.9161855e-03f,  3.3045163e+00f,  -1.1770072e+00f, -1.5641589e-02f,
+    -5.1482927e-02f, -1.8373632e-01f, 4.0466342e-02f,  4.9161855e-03f,
+    1.7315409e+00f,  2.1844769e-01f,  1.4304966e-01f,  -1.0893430e+00f,
+    -2.0861734e-02f, -8.7531722e-01f, 4.9161855e-03f,  1.5424440e+00f,
+    -7.2086272e+00f, 9.1622877e+00f,  -3.6271956e-02f, -4.7172168e-01f,
+    -2.1003175e-01f, 4.9161855e-03f,  -2.7083893e+00f, 8.6804676e+00f,
+    -3.2331553e+00f, 2.6908439e-01f,  -3.4953970e-01f, -2.4492468e-01f,
+    4.9161855e-03f,  -5.1852617e+00f, 9.4568640e-01f,  -5.0578399e+00f,
+    -4.4451976e-01f, 3.1893823e-01f,  -7.9074281e-01f, 4.9161855e-03f,
+    1.1899835e+00f,  1.9693819e+00f,  -3.3153507e-01f, -3.4873661e-01f,
+    -2.0391415e-01f, -4.9932879e-01f, 4.9161855e-03f,  1.1360967e+01f,
+    -3.9719882e+00f, 3.7921674e+00f,  1.0489298e-01f,  -7.5027570e-02f,
+    -3.0018815e-01f, 4.9161855e-03f,  4.6038687e-02f,  -8.5388380e-01f,
+    -3.9826047e+00f, -7.2902948e-01f, 9.6215010e-01f,  3.9737353e-01f,
+    4.9161855e-03f,  -3.0697758e+00f, 3.4199128e+00f,  1.8134683e+00f,
+    3.3476505e-01f,  7.4594718e-01f,  1.2985985e-01f,  4.9161855e-03f,
+    8.6808662e+00f,  1.2434139e+00f,  5.8766375e+00f,  5.2469056e-03f,
+    2.1616346e-01f,  -1.5495627e-01f, 4.9161855e-03f,  -1.5893596e+00f,
+    -8.3871913e-01f, -3.5381632e+00f, -5.4525936e-01f, -3.4302887e-01f,
+    7.9525971e-01f,  4.9161855e-03f,  -3.4713862e+00f, 3.3892400e+00f,
+    -3.1186423e-01f, -8.2310215e-02f, 2.3830847e-01f,  -4.0828380e-01f,
+    4.9161855e-03f,  4.6376261e-01f,  -2.3504751e+00f, 8.7379980e+00f,
+    5.9576607e-01f,  4.3759072e-01f,  -2.9496548e-01f, 4.9161855e-03f,
+    7.3793805e-01f,  -3.1191103e+00f, 1.4759321e+00f,  -7.5425491e-02f,
+    -5.5234438e-01f, -5.0622556e-02f, 4.9161855e-03f,  2.1764961e-01f,
+    5.3867865e+00f,  -4.6210904e+00f, -7.5332618e-01f, 6.0661680e-01f,
+    -2.0945777e-01f, 4.9161855e-03f,  -4.8242340e+00f, 3.4368036e+00f,
+    1.7495153e+00f,  -2.2381353e-01f, 3.3742735e-01f,  -3.2996157e-01f,
+    4.9161855e-03f,  -7.6818025e-01f, 8.5186834e+00f,  -1.6621010e+00f,
+    -4.8525933e-02f, 5.1998466e-01f,  4.6652609e-01f,  4.9161855e-03f,
+    2.9274082e+00f,  1.3605498e+00f,  -1.3835232e+00f, -5.2345884e-01f,
+    -6.5272665e-01f, -8.2079905e-01f, 4.9161855e-03f,  2.4002981e-01f,
+    1.6116447e+00f,  5.7768559e-01f,  5.4355770e-01f,  -6.6993758e-02f,
+    8.4612656e-01f,  4.9161855e-03f,  3.7747231e+00f,  3.9674454e+00f,
+    -2.8348827e+00f, 1.7560831e-01f,  2.9448298e-01f,  1.5694165e-01f,
+    4.9161855e-03f,  -5.0004256e-01f, -6.5786219e+00f, 2.3221543e+00f,
+    1.6767733e-01f,  -4.3491575e-01f, -4.9816232e-02f, 4.9161855e-03f,
+    -1.4260645e-01f, -1.7102236e+00f, 1.1363747e+00f,  6.6301334e-01f,
+    -2.4057649e-01f, -5.2986807e-01f, 4.9161855e-03f,  -4.0897638e-01f,
+    1.3778459e+00f,  -3.2818675e+00f, 3.0937094e-02f,  6.3409823e-01f,
+    1.9686022e-01f,  4.9161855e-03f,  -3.7516546e+00f, 7.8061295e+00f,
+    -3.6109817e+00f, 3.9526541e-02f,  -2.5923508e-01f, 5.5310154e-01f,
+    4.9161855e-03f,  -2.1762199e+00f, 6.0308385e-01f,  -3.6948242e+00f,
+    1.5432464e-01f,  3.8322693e-01f,  3.5903120e-01f,  4.9161855e-03f,
+    9.3360925e-01f,  2.7155597e+00f,  -2.8619468e+00f, 4.4640329e-01f,
+    -9.5445514e-01f, 2.1085814e-01f,  4.9161855e-03f,  4.6537805e+00f,
+    3.6865804e-01f,  -6.2987547e+00f, 9.5986009e-02f,  -3.3649752e-01f,
+    1.7111708e-01f,  4.9161855e-03f,  -3.3964384e+00f, -4.1135290e-01f,
+    3.4448152e+00f,  -2.7269700e-01f, 3.3467367e-02f,  1.3824220e-01f,
+    4.9161855e-03f,  -2.8862083e+00f, 1.4199774e+00f,  1.1956720e+00f,
+    -2.1196423e-01f, 1.6710386e-01f,  -7.8150398e-01f, 4.9161855e-03f,
+    -9.9249439e+00f, -1.1378767e+00f, -5.6529598e+00f, -1.1644518e-01f,
+    -4.4520864e-01f, -3.7078220e-01f, 4.9161855e-03f,  -4.7503757e+00f,
+    -3.5715990e+00f, -6.9564614e+00f, -2.7867481e-01f, -7.9874322e-04f,
+    -1.8117830e-01f, 4.9161855e-03f,  2.7064116e+00f,  -2.6025534e+00f,
+    4.0725183e+00f,  -2.0042401e-02f, 2.1532330e-01f,  5.4155058e-01f,
+    4.9161855e-03f,  -2.3189397e-01f, 2.0117912e+00f,  9.4101083e-01f,
+    -3.6788115e-01f, 1.9799615e-01f,  -5.7828712e-01f, 4.9161855e-03f,
+    6.1443710e-01f,  1.0359978e+01f,  -6.5683085e-01f, -2.9390916e-01f,
+    -1.7937448e-02f, -4.1290057e-01f, 4.9161855e-03f,  -1.6002332e+00f,
+    3.1032276e-01f,  -1.9844985e+00f, -1.0407658e+00f, -1.2830317e-01f,
+    -5.4244572e-01f, 4.9161855e-03f,  -3.3518040e+00f, 4.3048638e-01f,
+    2.9040217e+00f,  -5.7252389e-01f, -3.7053362e-01f, -4.3022564e-01f,
+    4.9161855e-03f,  2.7084321e-01f,  1.3709670e+00f,  5.6227082e-01f,
+    2.4766102e-04f,  -6.2983495e-01f, -6.4000416e-01f, 4.9161855e-03f,
+    3.7130663e+00f,  -1.4099832e+00f, 2.2975676e+00f,  -5.7286900e-01f,
+    3.0302069e-01f,  -8.6501710e-02f, 4.9161855e-03f,  -1.5288106e+00f,
+    5.7587013e+00f,  -2.2268498e+00f, -5.1526409e-01f, 4.1919168e-02f,
+    6.0701624e-02f,  4.9161855e-03f,  -3.5371178e-01f, -1.0611730e+00f,
+    -2.4770358e+00f, -3.1260499e-01f, -1.8756437e-01f, 7.0527822e-01f,
+    4.9161855e-03f,  2.9468551e+00f,  -9.5992953e-01f, -1.6315839e+00f,
+    3.8581538e-01f,  6.2902999e-01f,  4.5568669e-01f,  4.9161855e-03f,
+    2.1884456e-02f,  -3.3141639e+00f, -2.3209243e+00f, 1.2527181e-01f,
+    7.3642576e-01f,  2.6096076e-01f,  4.9161855e-03f,  4.9121472e-01f,
+    -3.3519859e+00f, -2.0783453e+00f, 3.8152084e-01f,  2.9019746e-01f,
+    -1.5313545e-01f, 4.9161855e-03f,  -5.9925079e-01f, 2.3398435e-01f,
+    -5.2470636e-01f, -9.7035193e-01f, -1.3915922e-01f, -6.1820799e-01f,
+    4.9161855e-03f,  1.2211286e-02f,  -2.3050921e+00f, 2.5254521e+00f,
+    9.2945248e-01f,  2.9722992e-01f,  -7.8055942e-01f, 4.9161855e-03f,
+    -1.0353497e+00f, 7.0227325e-01f,  9.7704284e-02f,  1.9950202e-01f,
+    -1.2632115e+00f, -4.6897095e-01f, 4.9161855e-03f,  -1.4119594e+00f,
+    -1.7594622e-01f, -2.2044359e-01f, -1.0035964e+00f, 2.3804934e-01f,
+    -1.0056585e+00f, 4.9161855e-03f,  1.3683796e+00f,  1.2869899e+00f,
+    -3.4951594e-01f, 6.3419992e-01f,  1.8578966e-01f,  -1.1485415e-03f,
+    4.9161855e-03f,  -4.9956730e-01f, 5.8366477e-01f,  -2.4063723e+00f,
+    -1.3337563e+00f, 3.0105230e-01f,  4.9164304e-01f,  4.9161855e-03f,
+    -5.7258811e+00f, 3.1193795e+00f,  6.1532688e+00f,  -2.8648955e-01f,
+    3.7334338e-01f,  4.4397853e-02f,  4.9161855e-03f,  -3.1787193e+00f,
+    -6.1684477e-01f, 7.8470999e-01f,  -2.7169862e-01f, 6.2983268e-01f,
+    -4.0990084e-01f, 4.9161855e-03f,  -5.8536601e+00f, 3.1374009e+00f,
+    1.1196659e+01f,  3.6306509e-01f,  1.2497923e-01f,  -3.2900009e-01f,
+    4.9161855e-03f,  -1.4336401e+00f, 3.6423879e+00f,  2.9455814e-01f,
+    5.0265640e-02f,  1.3367407e-01f,  1.7864491e-01f,  4.9161855e-03f,
+    -6.7320728e-01f, -3.4796970e+00f, 3.0281281e+00f,  8.1557673e-01f,
+    2.8329834e-01f,  6.9728293e-02f,  4.9161855e-03f,  8.7235200e-01f,
+    -6.2127099e+00f, -6.7709522e+00f, -3.3463880e-01f, 2.5431144e-01f,
+    2.1056361e-01f,  4.9161855e-03f,  7.4262130e-01f,  2.8014413e-01f,
+    1.5717365e+00f,  5.2282453e-01f,  -1.4114179e-01f, -2.9954717e-01f,
+    4.9161855e-03f,  -2.8262016e-01f, -2.3039928e-01f, -1.7463644e-01f,
+    -1.2221454e+00f, -1.3235773e-01f, 1.2992574e+00f,  4.9161855e-03f,
+    9.7284031e-01f,  2.6330092e+00f,  -5.6705689e-01f, 4.5766715e-02f,
+    -7.9673088e-01f, 2.4375146e-02f,  4.9161855e-03f,  1.6221833e-01f,
+    1.1455119e+00f,  -7.3165691e-01f, -9.6261966e-01f, -6.7772681e-01f,
+    -5.0895005e-01f, 4.9161855e-03f,  -1.3145079e-01f, -9.8977530e-01f,
+    1.8190552e-01f,  -1.3086063e+00f, -4.5441660e-01f, -1.5140590e-01f,
+    4.9161855e-03f,  3.6631203e-01f,  -5.5953679e+00f, 1.8515537e+00f,
+    -1.1835757e-01f, 3.4308839e-01f,  -7.4142253e-01f, 4.9161855e-03f,
+    1.7894655e+00f,  3.2340016e+00f,  -1.9597653e+00f, 6.0638177e-01f,
+    2.4627247e-01f,  3.7773961e-01f,  4.9161855e-03f,  -2.3644276e+00f,
+    2.2999804e+00f,  3.0362730e+00f,  -1.7229168e-01f, 4.5280039e-01f,
+    2.7328429e-01f,  4.9161855e-03f,  -5.4846001e-01f, -5.3978336e-01f,
+    -1.8764967e-01f, 2.6570693e-01f,  5.1651460e-01f,  1.3129328e+00f,
+    4.9161855e-03f,  -2.0572522e+00f, 1.6284016e+00f,  -1.8220216e+00f,
+    9.3645245e-01f,  -3.2554824e-02f, -3.3085054e-01f, 4.9161855e-03f,
+    2.8688140e+00f,  1.0440081e+00f,  -2.6101885e+00f, 9.1692185e-01f,
+    5.9481817e-01f,  -2.7978235e-01f, 4.9161855e-03f,  -6.8651867e+00f,
+    -5.7501441e-01f, -4.7405205e+00f, -3.0854857e-01f, -3.5015658e-01f,
+    -1.4947073e-01f, 4.9161855e-03f,  -3.0446174e+00f, -1.3189298e+00f,
+    -4.4526964e-01f, -6.5238595e-01f, 2.5125405e-01f,  -5.7521623e-01f,
+    4.9161855e-03f,  1.5872617e+00f,  5.2730882e-01f,  4.1056418e-01f,
+    5.3521061e-01f,  -2.6350120e-01f, 4.5998412e-01f,  4.9161855e-03f,
+    6.9045973e-01f,  1.0874684e+01f,  3.8595419e+00f,  7.3225692e-02f,
+    1.6602789e-01f,  2.9183870e-02f,  4.9161855e-03f,  2.5059824e+00f,
+    3.0164742e-01f,  -2.6125145e+00f, -6.7855960e-01f, 1.4620833e-01f,
+    -4.8753867e-01f, 4.9161855e-03f,  -7.0119238e-01f, -4.6561737e+00f,
+    5.0049788e-01f,  6.3351721e-01f,  -1.2233253e-01f, -1.0171306e+00f,
+    4.9161855e-03f,  -1.4126154e+00f, 1.5292485e+00f,  1.1102905e+00f,
+    5.6266105e-01f,  2.2784410e-01f,  -3.4159967e-01f, 4.9161855e-03f,
+    4.3937855e+00f,  -9.0735254e+00f, 5.3568482e-02f,  -3.6723921e-01f,
+    2.5324371e-02f,  -3.5203284e-01f, 4.9161855e-03f,  1.0691199e+00f,
+    9.1392813e+00f,  -1.8874600e+00f, 4.1842386e-01f,  -3.3132017e-01f,
+    -2.8415892e-01f, 4.9161855e-03f,  6.3374710e-01f,  2.5551131e+00f,
+    -1.3376082e+00f, 8.8185698e-01f,  -3.1284800e-01f, -3.1974831e-01f,
+    4.9161855e-03f,  2.3240130e+00f,  -9.6958154e-01f, 2.2568219e+00f,
+    2.1874893e-01f,  5.4858702e-01f,  1.1796440e+00f,  4.9161855e-03f,
+    -6.4880705e-01f, -4.1643539e-01f, 2.4768062e-01f,  3.8609762e-02f,
+    3.3259016e-01f,  2.8074173e-02f,  4.9161855e-03f,  -3.7597117e+00f,
+    4.8846607e+00f,  -1.0938429e+00f, -6.6467881e-01f, -8.3340719e-02f,
+    4.8689563e-02f,  4.9161855e-03f,  -4.0047793e+00f, -1.4552666e+00f,
+    1.5778184e+00f,  2.4722622e-01f,  -7.8449148e-01f, -3.3435026e-01f,
+    4.9161855e-03f,  -1.8003519e+00f, -3.4933102e-01f, 7.5634164e-01f,
+    1.5913263e-01f,  9.7513661e-02f,  -1.4090157e-01f, 4.9161855e-03f,
+    1.3864951e+00f,  2.6985569e+00f,  2.3058993e-03f,  1.1075522e-01f,
+    -1.2919824e-01f, 1.1517610e-01f,  4.9161855e-03f,  -2.3922668e-01f,
+    2.2126920e+00f,  -2.4308768e-01f, 1.0138559e+00f,  -6.4216942e-01f,
+    9.2315382e-01f,  4.9161855e-03f,  2.8252475e-02f,  -6.9910206e-02f,
+    -8.6733297e-02f, 4.9744871e-01f,  6.7187613e-01f,  -8.3857214e-01f,
+    4.9161855e-03f,  -1.0352776e+00f, -6.1071119e+00f, -6.1352378e-01f,
+    6.1068472e-02f,  1.9980355e-01f,  5.0907719e-01f,  4.9161855e-03f,
+    -3.4014566e+00f, -5.2502894e+00f, -1.7027566e+00f, 7.6231271e-02f,
+    -7.3322898e-01f, 5.5840131e-02f,  4.9161855e-03f,  3.2973871e+00f,
+    9.1803055e+00f,  -2.7369773e+00f, -4.8800196e-02f, 9.0026900e-02f,
+    1.8236783e-01f,  4.9161855e-03f,  1.0630187e+00f,  1.4228784e+00f,
+    1.6523427e+00f,  -5.3679055e-01f, -9.3074685e-01f, 3.0011578e-02f,
+    4.9161855e-03f,  1.1572206e+00f,  -2.5543013e-01f, -2.1824286e+00f,
+    -1.2595724e-01f, -1.0616083e-02f, 2.3030983e-01f,  4.9161855e-03f,
+    2.5068386e+00f,  -1.1058602e+00f, -5.4497904e-01f, 7.7953972e-03f,
+    6.5180337e-01f,  1.0518056e+00f,  4.9161855e-03f,  -3.4099567e+00f,
+    -9.7085774e-01f, -3.2199454e-01f, -4.2888862e-01f, 1.2847167e+00f,
+    -1.9810332e-02f, 4.9161855e-03f,  -7.9507275e+00f, 2.7512937e+00f,
+    -1.2066312e+00f, -5.8048677e-02f, -1.9168517e-01f, 1.5841363e-01f,
+    4.9161855e-03f,  2.0070002e+00f,  8.0848372e-01f,  -5.8306575e-01f,
+    5.6489501e-02f,  1.0400468e+00f,  7.4592821e-02f,  4.9161855e-03f,
+    -3.3075492e+00f, 5.1723868e-03f,  1.2259688e+00f,  -3.7866405e-01f,
+    2.0897435e-01f,  -4.6969283e-01f, 4.9161855e-03f,  3.1639171e+00f,
+    7.9925642e+00f,  8.3530025e+00f,  3.0052868e-01f,  3.7759763e-01f,
+    -1.3571468e-01f, 4.9161855e-03f,  6.7606077e+00f,  -4.7717772e+00f,
+    1.6209762e+00f,  1.2496720e-01f,  6.0480130e-01f,  -1.4095207e-01f,
+    4.9161855e-03f,  -1.8988982e-02f, -8.6652441e+00f, 1.7404547e+00f,
+    -2.0668712e-02f, -3.1590638e-01f, -2.8762558e-01f, 4.9161855e-03f,
+    2.1608517e-01f,  -7.3183303e+00f, 8.7381115e+00f,  3.9131221e-01f,
+    4.4048199e-01f,  3.9590012e-02f,  4.9161855e-03f,  6.7038679e-01f,
+    1.0129324e+00f,  2.9565723e+00f,  4.7108623e-01f,  2.0279680e-01f,
+    2.1021616e-01f,  4.9161855e-03f,  -1.5016085e+00f, -3.0173790e-01f,
+    4.6930580e+00f,  -7.9204187e-02f, 6.1659485e-01f,  1.8992449e-01f,
+    4.9161855e-03f,  -1.0115957e+01f, 7.0272775e+00f,  7.1551585e+00f,
+    3.1140697e-01f,  2.4476580e-01f,  -1.1073206e-02f, 4.9161855e-03f,
+    7.0098214e+00f,  -7.0005975e+00f, 4.2892895e+00f,  -1.6605484e-01f,
+    4.0636766e-01f,  4.3826669e-02f,  4.9161855e-03f,  6.4929256e+00f,
+    2.4614367e+00f,  1.9342548e+00f,  4.6309695e-01f,  -4.0657017e-01f,
+    8.3738111e-02f,  4.9161855e-03f,  -6.8726311e+00f, 1.3984884e+00f,
+    -6.8842149e+00f, -1.8588004e-01f, 2.0669380e-01f,  -4.8805166e-02f,
+    4.9161855e-03f,  1.3889484e+00f,  2.2851789e+00f,  2.1564157e-01f,
+    -5.2115428e-01f, 1.0890797e+00f,  -9.1116257e-02f, 4.9161855e-03f,
+    5.0277815e+00f,  2.2623856e+00f,  -8.9327949e-01f, -5.3414333e-01f,
+    -6.9451642e-01f, -4.1549006e-01f, 4.9161855e-03f,  2.4073415e+00f,
+    -1.1421194e+00f, -2.8969624e+00f, 7.1487963e-01f,  -5.4590124e-01f,
+    7.3180008e-01f,  4.9161855e-03f,  -5.5531693e-01f, 2.2001345e+00f,
+    -2.0116048e+00f, 1.3093981e-01f,  2.5000465e-01f,  -2.1139747e-01f,
+    4.9161855e-03f,  4.2677286e-01f,  -6.0805666e-01f, -9.3171977e-02f,
+    -1.3855063e+00f, 1.1107761e+00f,  -7.2346574e-01f, 4.9161855e-03f,
+    2.4118025e+00f,  -1.0817316e-01f, -1.0635827e+00f, -2.6239228e-01f,
+    3.3911133e-01f,  2.7156833e-01f,  4.9161855e-03f,  -3.1179564e+00f,
+    -3.4902298e+00f, -2.9566779e+00f, 2.6767543e-01f,  -7.4764538e-01f,
+    -4.0841797e-01f, 4.9161855e-03f,  -3.8315830e+00f, -2.8693295e-01f,
+    1.2264606e+00f,  7.1764511e-01f,  2.8744808e-01f,  1.4351748e-01f,
+    4.9161855e-03f,  2.1988783e+00f,  2.5017753e+00f,  -1.5056832e+00f,
+    5.7636356e-01f,  2.7742168e-01f,  7.5629890e-01f,  4.9161855e-03f,
+    1.3267251e+00f,  -2.3888311e+00f, -3.0874431e+00f, -5.5534047e-01f,
+    4.3828189e-01f,  1.8654108e-02f,  4.9161855e-03f,  1.8535814e+00f,
+    6.2623990e-01f,  4.7347913e+00f,  1.2577538e-01f,  1.7349112e-01f,
+    6.9316727e-01f,  4.9161855e-03f,  -2.7529378e+00f, 8.0486965e+00f,
+    -3.1460145e+00f, -3.5349842e-02f, 6.2040991e-01f,  1.2270377e-01f,
+    4.9161855e-03f,  2.7085612e+00f,  -3.1664352e+00f, -6.6098504e+00f,
+    3.9036375e-02f,  2.1786502e-01f,  -2.0975997e-01f, 4.9161855e-03f,
+    -4.3633208e+00f, -3.1873746e+00f, 3.9879792e+00f,  6.1858986e-02f,
+    5.8643478e-01f,  -2.3943076e-02f, 4.9161855e-03f,  4.4895259e-01f,
+    -8.0033627e+00f, -4.2980051e+00f, -3.5628587e-01f, 4.5871198e-02f,
+    -5.0440890e-01f, 4.9161855e-03f,  -2.0766890e+00f, -3.5453114e-01f,
+    9.5316130e-01f,  1.0685886e+00f,  -6.1404473e-01f, 4.3412864e-01f,
+    4.9161855e-03f,  4.6599789e+00f,  7.6321137e-01f,  5.1791161e-01f,
+    7.9362035e-01f,  9.4472134e-01f,  2.7195081e-01f,  4.9161855e-03f,
+    1.4204055e+00f,  1.2976053e+00f,  3.4140759e+00f,  -2.7998051e-01f,
+    9.3910992e-02f,  -2.1845722e-01f, 4.9161855e-03f,  2.0027750e+00f,
+    -5.1036304e-01f, 1.0708960e+00f,  -6.8898842e-02f, -9.0199456e-02f,
+    -6.4016253e-01f, 4.9161855e-03f,  -7.8757644e-01f, -8.2123220e-01f,
+    4.7621093e+00f,  7.5402069e-01f,  8.1605291e-01f,  -4.4496268e-01f,
+    4.9161855e-03f,  3.9144907e+00f,  2.6032176e+00f,  -6.4981570e+00f,
+    6.2727785e-01f,  2.3621082e-01f,  4.1076604e-02f,  4.9161855e-03f,
+    4.6393976e-01f,  -7.0713186e+00f, -5.4097424e+00f, -2.4060065e-01f,
+    -3.0332360e-01f, -7.6152407e-02f, 4.9161855e-03f,  2.9016802e-01f,
+    4.3169793e-01f,  -4.4491177e+00f, -2.8857490e-01f, -1.1805181e-01f,
+    -3.1993431e-01f, 4.9161855e-03f,  2.2315259e+00f,  1.0688721e+01f,
+    -3.7511113e+00f, 6.4517701e-01f,  -1.2526173e-02f, 1.8122954e-02f,
+    4.9161855e-03f,  1.0970393e+00f,  -1.1538004e+00f, 1.4049878e+00f,
+    6.5186866e-02f,  -8.7630033e-02f, 4.5490557e-01f,  4.9161855e-03f,
+    1.1630872e+00f,  -3.3586752e+00f, -5.1886854e+00f, -3.2411623e-01f,
+    -5.9357971e-01f, -1.2593243e-01f, 4.9161855e-03f,  4.1530910e+00f,
+    -3.3933678e+00f, 2.7744570e-01f,  -1.1476377e-01f, 7.1353555e-01f,
+    -1.6184010e-01f, 4.9161855e-03f,  -4.8054910e-01f, 4.0832901e+00f,
+    -6.4635271e-01f, -2.7195120e-01f, -5.6111616e-01f, -5.6885738e-02f,
+    4.9161855e-03f,  -1.0014299e+00f, 8.5553300e-01f,  -1.0487682e+00f,
+    7.9116511e-01f,  -5.8663219e-01f, -8.2652688e-01f, 4.9161855e-03f,
+    -9.7151508e+00f, 2.3307506e-02f,  -6.8767400e+00f, -5.8681035e-01f,
+    -6.3017905e-03f, 1.4554894e-01f,  4.9161855e-03f,  -7.2011065e+00f,
+    3.2089129e-03f,  -2.1682229e+00f, 9.0917677e-01f,  2.4233872e-01f,
+    -2.4455663e-02f, 4.9161855e-03f,  2.7380750e-01f,  1.1398129e-01f,
+    -2.3251954e-01f, -6.2050128e-01f, -9.8904687e-01f, 6.1276555e-01f,
+    4.9161855e-03f,  7.5309634e-01f,  9.1240531e-01f,  -1.4304330e+00f,
+    -2.1415049e-01f, -2.5438640e-01f, 6.6564828e-01f,  4.9161855e-03f,
+    2.2702084e+00f,  -3.4885776e+00f, -1.9519736e+00f, 8.8171542e-01f,
+    6.7572936e-02f,  -2.9678118e-01f, 4.9161855e-03f,  9.8536015e-01f,
+    -3.4591892e-01f, -1.7775294e+00f, 3.6205220e-01f,  4.7126248e-01f,
+    -2.4621746e-01f, 4.9161855e-03f,  2.3693357e+00f,  -2.1991122e+00f,
+    2.3587375e+00f,  -3.0854723e-01f, -2.9487208e-01f, 5.7897805e-03f,
+    4.9161855e-03f,  -4.2711544e+00f, 4.5261446e-01f,  -3.1665640e+00f,
+    5.5260682e-01f,  -1.5946336e-01f, 4.9966860e-01f,  4.9161855e-03f,
+    2.4691024e-01f,  -6.0334170e-01f, 2.8205657e-01f,  9.6880984e-01f,
+    -4.1677353e-01f, -3.7562776e-01f, 4.9161855e-03f,  4.0299382e+00f,
+    -9.7706246e-01f, -3.1289804e+00f, -5.0271988e-01f, -9.5663056e-02f,
+    -5.5597544e-01f, 4.9161855e-03f,  -1.4471877e+00f, 3.3080500e-02f,
+    -6.4930863e+00f, 3.4223673e-01f,  -1.0339795e-01f, -7.8664470e-01f,
+    4.9161855e-03f,  2.8359787e+00f,  -1.1080276e+00f, 1.2509952e-02f,
+    9.0080702e-01f,  1.1740266e-01f,  5.4245752e-01f,  4.9161855e-03f,
+    -3.7335305e+00f, -2.1712480e+00f, -2.3682001e+00f, 4.0681985e-01f,
+    3.5981131e-01f,  -5.3326219e-01f, 4.9161855e-03f,  -4.8090410e+00f,
+    -1.9474498e+00f, 2.4090657e+00f,  8.7456591e-03f,  6.5673703e-01f,
+    -8.0464506e-01f, 4.9161855e-03f,  1.3003083e+00f,  -6.5911740e-01f,
+    -1.0162184e+00f, -5.0886953e-01f, 6.4523989e-01f,  7.5331908e-01f,
+    4.9161855e-03f,  -1.8457617e+00f, 1.8241471e+00f,  4.6184689e-01f,
+    -8.8451785e-01f, -4.9429384e-01f, 6.7950976e-01f,  4.9161855e-03f,
+    -3.0025485e+00f, -9.9487150e-01f, -2.7002697e+00f, 7.0347533e-02f,
+    2.9156083e-01f,  7.6180387e-01f,  4.9161855e-03f,  2.5102882e+00f,
+    2.7117646e+00f,  1.5375283e-01f,  4.7345707e-01f,  6.4748484e-01f,
+    1.9306719e-01f,  4.9161855e-03f,  1.0510226e+00f,  2.7516723e+00f,
+    8.3884163e+00f,  -5.9344631e-01f, -7.9659626e-02f, -5.8666283e-01f,
+    4.9161855e-03f,  -1.0505353e+00f, 3.3535776e+00f,  -6.1254048e+00f,
+    -1.4054072e-01f, -6.8188941e-01f, 1.2014035e-01f,  4.9161855e-03f,
+    -4.7317395e+00f, -1.5050373e+00f, -1.0340016e+00f, -5.4866910e-01f,
+    -6.9549009e-02f, -1.7546920e-02f, 4.9161855e-03f,  -6.3253093e-01f,
+    -2.2239773e+00f, -3.4673421e+00f, -3.8212058e-01f, -4.2768320e-01f,
+    -8.9828700e-01f, 4.9161855e-03f,  -9.1951513e+00f, -2.1846522e-01f,
+    2.2048602e+00f,  3.9210308e-01f,  1.1803684e-01f,  -3.3804283e-01f,
+    4.9161855e-03f,  5.6112452e+00f,  -1.1851096e+00f, -4.7329560e-01f,
+    -4.7372201e-01f, 1.2544686e-01f,  -7.2246857e-02f, 4.9161855e-03f,
+    -4.7142444e+00f, -5.9439855e+00f, 9.1472077e-01f,  -2.4894956e-02f,
+    1.5156128e-01f,  -6.4611149e-01f, 4.9161855e-03f,  -2.7767272e+00f,
+    1.6594193e+00f,  -3.3474880e-01f, -1.1401707e-01f, 2.1313189e-01f,
+    6.8303011e-02f,  4.9161855e-03f,  -5.6905332e+00f, -5.5028739e+00f,
+    -3.0428081e+00f, 1.6842730e-01f,  1.3743103e-01f,  7.1929646e-01f,
+    4.9161855e-03f,  -3.6480770e-01f, 2.5397754e+00f,  6.6113372e+00f,
+    2.6854122e-02f,  8.9688838e-02f,  2.4845721e-01f,  4.9161855e-03f,
+    1.1257753e-02f,  -3.5081968e+00f, -3.8531234e+00f, -8.3623715e-03f,
+    -2.7864194e-01f, 7.5133163e-01f,  4.9161855e-03f,  -2.1186159e+00f,
+    -1.4265026e-01f, -4.7930977e-01f, 7.5187445e-01f,  -3.0659360e-01f,
+    -5.6690919e-01f, 4.9161855e-03f,  -2.1828375e+00f, -1.3879466e+00f,
+    -7.6735836e-01f, -1.0389584e+00f, 4.1437101e-02f,  -1.0000792e+00f,
+    4.9161855e-03f,  6.2090626e+00f,  1.1736553e+00f,  -4.2526636e+00f,
+    1.2142450e-01f,  5.4318744e-01f,  2.0043340e-01f,  4.9161855e-03f,
+    -1.0836146e+00f, 8.9775902e-01f,  3.4197550e+00f,  -2.6557192e-01f,
+    9.2125458e-01f,  9.9024296e-02f,  4.9161855e-03f,  -1.2865182e+00f,
+    -2.3779576e+00f, 1.0267714e+00f,  7.8391838e-01f,  4.7870228e-01f,
+    4.4149358e-02f,  4.9161855e-03f,  -1.7352341e+00f, -1.3976511e+00f,
+    -4.7572774e-01f, 2.7982000e-02f,  7.4574035e-01f,  -2.7491179e-01f,
+    4.9161855e-03f,  5.0951724e+00f,  7.0423117e+00f,  2.5286412e+00f,
+    -2.6083142e-03f, 8.9322343e-02f,  3.2869387e-01f,  4.9161855e-03f,
+    -2.1303716e+00f, 6.0848312e+00f,  -8.3514148e-01f, -3.9567766e-01f,
+    -2.3403384e-01f, -2.9173279e-01f, 4.9161855e-03f,  -1.7515434e+00f,
+    9.4708413e-01f,  3.6215901e-02f,  4.5563179e-01f,  9.5048505e-01f,
+    2.9654810e-01f,  4.9161855e-03f,  1.1950095e+00f,  -1.1710796e+00f,
+    -1.3799815e+00f, 1.6984344e-01f,  7.1953338e-01f,  1.3579403e-01f,
+    4.9161855e-03f,  -4.8623890e-01f, 1.5280105e+00f,  -8.2775407e-02f,
+    -1.3304896e+00f, -3.4810343e-01f, -4.6076256e-01f, 4.9161855e-03f,
+    9.7547221e-01f,  4.9570251e+00f,  -5.1642299e+00f, 3.4099441e-02f,
+    -3.5293561e-01f, 1.0691833e-01f,  4.9161855e-03f,  -5.1215482e+00f,
+    7.6466513e+00f,  4.1682534e+00f,  4.4823301e-01f,  -5.8137152e-02f,
+    2.7662936e-01f,  4.9161855e-03f,  -2.4375920e+00f, -1.7836089e+00f,
+    -1.5079217e+00f, -6.0095286e-01f, -2.9551167e-02f, 2.1610253e-01f,
+    4.9161855e-03f,  7.4673204e+00f,  3.7838652e+00f,  -4.9228561e-01f,
+    6.0762912e-01f,  -2.4980460e-01f, -2.5321558e-01f, 4.9161855e-03f,
+    -4.0324645e+00f, -3.9843252e+00f, -4.5930037e+00f, 2.8964084e-01f,
+    -4.1202495e-01f, -8.5058615e-02f, 4.9161855e-03f,  -8.1824943e-02f,
+    -2.3486829e+00f, 1.0995286e+01f,  3.1956357e-01f,  1.6018158e-01f,
+    4.5054704e-01f,  4.9161855e-03f,  -1.6341938e+00f, 4.7861454e-01f,
+    1.0732051e+00f,  -3.0942813e-01f, 1.6263852e-01f,  -9.0218359e-01f,
+    4.9161855e-03f,  5.1130285e+00f,  1.0251660e+01f,  3.3382361e+00f,
+    -8.8138595e-02f, 4.4114050e-01f,  7.7584289e-02f,  4.9161855e-03f,
+    3.2567406e+00f,  1.3417608e+00f,  3.9642146e+00f,  8.8953912e-01f,
+    -6.5337247e-01f, -3.3107799e-01f, 4.9161855e-03f,  -1.0979061e+00f,
+    -1.8919065e+00f, -4.4125028e+00f, -5.5777244e-03f, -2.9929110e-01f,
+    -1.4782820e-02f, 4.9161855e-03f,  2.9368954e+00f,  1.2449178e+00f,
+    3.7712598e-01f,  -5.6694275e-01f, -1.8658595e-01f, 8.2939780e-01f,
+    4.9161855e-03f,  3.2968307e-01f,  -7.8758967e-01f, 5.5313916e+00f,
+    -2.3851317e-01f, -2.9061828e-02f, 5.1218897e-01f,  4.9161855e-03f,
+    1.6294027e+01f,  1.0013478e+00f,  -1.8814481e+00f, -4.5474652e-02f,
+    -2.5134942e-01f, 2.1463329e-01f,  4.9161855e-03f,  1.9027195e+00f,
+    -4.2396550e+00f, -3.8553664e-01f, 4.0708203e-02f,  4.2400825e-01f,
+    -2.6634154e-01f, 4.9161855e-03f,  5.3483829e+00f,  1.2148019e+00f,
+    1.6272407e+00f,  4.4261432e-01f,  2.3098828e-01f,  4.6488896e-01f,
+    4.9161855e-03f,  -1.0967269e+00f, -2.1727502e+00f, 3.5740285e+00f,
+    4.2795753e-01f,  -2.5582397e-01f, -8.5382843e-01f, 4.9161855e-03f,
+    -1.1308995e+00f, -3.2614260e+00f, 1.0248405e-01f,  4.3666521e-01f,
+    2.0534347e-01f,  1.8441883e-01f,  4.9161855e-03f,  -6.3069844e-01f,
+    -5.5859499e+00f, -2.9028583e+00f, 2.6716343e-01f,  8.6495563e-02f,
+    1.4163621e-01f,  4.9161855e-03f,  -1.0448105e+00f, -2.6915550e+00f,
+    4.3937242e-01f,  1.4905854e-01f,  1.4194788e-01f,  -5.5911583e-01f,
+    4.9161855e-03f,  -1.8201722e-01f, 2.0135620e+00f,  -1.2912718e+00f,
+    -7.3182094e-01f, 3.0119744e-01f,  1.3420664e+00f,  4.9161855e-03f,
+    4.3227882e+00f,  2.8700411e+00f,  3.4082010e+00f,  -2.0630202e-01f,
+    3.9230373e-02f,  -5.2473974e-01f, 4.9161855e-03f,  -2.1911819e+00f,
+    1.7594986e+00f,  4.3557429e-01f,  -4.1739848e-02f, -1.0808419e+00f,
+    4.9515194e-01f,  4.9161855e-03f,  -6.2963595e+00f, 5.6766582e-01f,
+    3.5349863e+00f,  9.1807526e-01f,  -2.1020424e-02f, 7.3577203e-02f,
+    4.9161855e-03f,  1.0022669e+00f,  1.1528041e+00f,  4.1921816e+00f,
+    1.0652335e+00f,  -3.8964850e-01f, -1.4009126e-01f, 4.9161855e-03f,
+    -4.2316961e+00f, 4.2751822e+00f,  -2.8457234e+00f, -4.5489040e-01f,
+    -9.8672390e-02f, -4.5683247e-01f, 4.9161855e-03f,  -5.5923849e-02f,
+    2.0179079e-01f,  -8.5677229e-02f, 1.4024553e+00f,  2.2731241e-02f,
+    1.1460901e+00f,  4.9161855e-03f,  -1.1000372e+00f, -3.4246635e+00f,
+    3.4057906e+00f,  1.4202693e-01f,  6.2597615e-01f,  -1.0738663e-01f,
+    4.9161855e-03f,  -4.4653705e-01f, 1.2775034e+00f,  2.2382529e+00f,
+    5.8476830e-01f,  -4.0535361e-01f, -4.0663313e-02f, 4.9161855e-03f,
+    -4.3897909e-01f, -1.3838578e+00f, 3.3987734e-01f,  1.5138667e-02f,
+    5.0450855e-01f,  5.4602545e-01f,  4.9161855e-03f,  1.8766081e+00f,
+    4.0743130e-01f,  4.3787842e+00f,  -5.4253125e-01f, 1.4950061e-01f,
+    5.9302235e-01f,  4.9161855e-03f,  6.4545207e+00f,  -1.0401627e+01f,
+    4.1183372e+00f,  -1.0839933e-01f, -1.3018763e-01f, 1.5540130e-01f,
+    4.9161855e-03f,  7.2673044e+00f,  -1.0516288e+01f, 2.7968097e+00f,
+    -1.0159393e-01f, 2.5331193e-01f,  1.4689362e-01f,  4.9161855e-03f,
+    6.1752546e-01f,  -6.6539848e-01f, 1.5790042e+00f,  4.6810243e-01f,
+    4.5815071e-01f,  2.2235610e-01f,  4.9161855e-03f,  -2.7761099e+00f,
+    -1.9110548e-01f, -5.2329435e+00f, -3.8739967e-01f, 4.2028257e-01f,
+    -3.2813045e-01f, 4.9161855e-03f,  -4.8406029e+00f, 3.8548832e+00f,
+    -1.8557613e+00f, 2.4498570e-01f,  6.4757206e-03f,  4.0098479e-01f,
+    4.9161855e-03f,  4.7958903e+00f,  8.2540913e+00f,  -4.5972724e+00f,
+    3.2517269e-01f,  -1.9743598e-01f, 3.9116934e-01f,  4.9161855e-03f,
+    -4.0123963e-01f, -6.8897343e-01f, 2.7810795e+00f,  8.6007661e-01f,
+    4.9481943e-01f,  6.3873953e-01f,  4.9161855e-03f,  -1.7793112e-02f,
+    2.3105267e-01f,  1.2126515e+00f,  8.3922762e-01f,  6.6346103e-01f,
+    -3.7485829e-01f, 4.9161855e-03f,  4.3382773e+00f,  1.5613933e+00f,
+    -3.6343262e+00f, 2.1901625e-01f,  -4.1477638e-01f, 2.9508388e-01f,
+    4.9161855e-03f,  -3.0846326e+00f, -2.9579741e-01f, -2.1933334e+00f,
+    -8.2738572e-01f, -3.8238015e-02f, 9.5646584e-01f,  4.9161855e-03f,
+    8.3155890e+00f,  -1.4635040e+00f, -2.0496392e+00f, 2.4219951e-01f,
+    -4.5884025e-01f, 7.0540287e-02f,  4.9161855e-03f,  5.6816280e-01f,
+    -6.2265098e-01f, 3.0707257e+00f,  -2.3038700e-01f, 3.9930439e-01f,
+    5.3365171e-01f,  4.9161855e-03f,  8.1566572e-01f,  -6.9638162e+00f,
+    -7.0388556e+00f, 3.5479505e-02f,  -2.4836056e-01f, -3.9540595e-01f,
+    4.9161855e-03f,  6.9852066e-01f,  1.1095667e+00f,  -9.0286893e-01f,
+    9.0236127e-01f,  -3.9585066e-01f, 1.5052068e-01f,  4.9161855e-03f,
+    1.3402741e+00f,  -1.1388254e+00f, 4.0604967e-01f,  1.7726400e-01f,
+    -6.0314578e-01f, -4.2617448e-02f, 4.9161855e-03f,  2.1614170e-01f,
+    -1.2087345e+00f, 1.2808864e-01f,  -8.6612529e-01f, -1.5024263e-01f,
+    -1.2756826e+00f, 4.9161855e-03f,  -1.7573875e+00f, -7.8019910e+00f,
+    -4.3610120e+00f, -5.0785565e-01f, -1.5262808e-01f, 3.3977672e-01f,
+    4.9161855e-03f,  -4.2444706e+00f, -3.3402276e+00f, 4.5897703e+00f,
+    4.4948584e-01f,  -4.2218447e-01f, -2.3225078e-01f, 4.9161855e-03f,
+    -1.5599895e+00f, 6.0431403e-01f,  -6.1214819e+00f, -3.7734157e-01f,
+    6.6961676e-01f,  -5.8923733e-01f, 4.9161855e-03f,  2.4274066e-03f,
+    2.0610650e-01f,  6.5060280e-02f,  -1.3872069e-01f, -1.5386139e-01f,
+    -1.4900351e-01f, 4.9161855e-03f,  5.8635516e+00f,  -1.5327750e+00f,
+    -9.4521803e-01f, 5.9160584e-01f,  -5.3233933e-01f, 6.1678046e-01f,
+    4.9161855e-03f,  1.2669034e+00f,  -7.7232546e-01f, 4.1323552e+00f,
+    1.9081751e-01f,  4.8949426e-01f,  -6.8394917e-01f, 4.9161855e-03f,
+    -4.4924707e+00f, 4.5738487e+00f,  3.5510623e-01f,  -3.5472098e-01f,
+    -7.2673786e-01f, -6.5104097e-02f, 4.9161855e-03f,  1.5104092e+00f,
+    -4.5632281e+00f, -3.5052586e+00f, 3.5283920e-01f,  -2.9118979e-01f,
+    8.2751143e-01f,  4.9161855e-03f,  4.2982454e+00f,  1.4069428e+00f,
+    -1.4013999e+00f, 6.8027061e-01f,  -6.5819138e-01f, 2.9329258e-01f,
+    4.9161855e-03f,  -4.5217700e+00f, 1.0523435e+00f,  -2.2821283e+00f,
+    8.4219709e-02f,  -2.7584890e-01f, 6.7295456e-01f,  4.9161855e-03f,
+    5.2264719e+00f,  -1.4307837e+00f, -3.2340927e+00f, -7.1228206e-02f,
+    -2.1093068e-01f, -8.1525087e-01f, 4.9161855e-03f,  2.2072789e-01f,
+    3.5226672e+00f,  5.3141117e-01f,  2.0788747e-01f,  -7.2764623e-01f,
+    -2.8564626e-01f, 4.9161855e-03f,  -3.1636074e-02f, 8.5646880e-01f,
+    -3.4173810e-01f, -3.7896153e-02f, -5.9833699e-01f, 1.4943473e+00f,
+    4.9161855e-03f,  -1.2744408e+01f, -6.4827204e+00f, -3.2037690e+00f,
+    1.4006729e-01f,  -1.5453620e-01f, -4.0955124e-03f, 4.9161855e-03f,
+    -1.0058378e+00f, -2.5833434e-01f, 1.4822595e-01f,  -1.1107229e+00f,
+    5.9726620e-01f,  2.0196709e-01f,  4.9161855e-03f,  4.2273268e-01f,
+    -2.8125572e+00f, 2.0296335e+00f,  1.0897195e-01f,  -1.6817221e-01f,
+    -2.0368332e-01f, 4.9161855e-03f,  1.9776979e-01f,  -1.0086494e+01f,
+    -4.6731253e+00f, -5.0744450e-01f, -2.3384772e-01f, -2.9397570e-02f,
+    4.9161855e-03f,  3.2259061e+00f,  3.2881415e+00f,  -7.4322491e+00f,
+    4.0874067e-01f,  8.5466772e-02f,  -6.5932405e-01f, 4.9161855e-03f,
+    -5.1663625e-01f, 1.1784043e+00f,  2.6455090e+00f,  2.0466088e-01f,
+    4.6737006e-01f,  4.2897043e-01f,  4.9161855e-03f,  1.4630719e+00f,
+    2.0680771e+00f,  3.3130009e+00f,  4.1502702e-01f,  -3.7550598e-01f,
+    -4.0496603e-01f, 4.9161855e-03f,  -1.3805447e+00f, 1.4294366e+00f,
+    -5.4358429e-01f, 4.3119603e-01f,  5.1777273e-01f,  -7.8216910e-01f,
+    4.9161855e-03f,  -8.0152440e-01f, 4.0992152e-02f,  3.5590905e-01f,
+    1.0957088e-01f,  -1.2443687e+00f, 1.5310404e-01f,  4.9161855e-03f,
+    -2.9923323e-01f, 9.8219496e-01f,  1.0595788e+00f,  -3.7417653e-01f,
+    -2.7768227e-01f, 4.7627777e-02f,  4.9161855e-03f,  -1.1485790e+00f,
+    1.4198235e+00f,  -1.0913734e+00f, -1.9027448e-01f, 8.7949914e-01f,
+    3.0509982e-01f,  4.9161855e-03f,  1.4250741e+00f,  4.0770733e-01f,
+    3.9183075e+00f,  -5.2151018e-01f, 3.1245175e-01f,  8.5960224e-02f,
+    4.9161855e-03f,  1.0649577e-01f,  2.2454384e-01f,  -1.8816823e-01f,
+    -1.1840330e+00f, 1.1719378e+00f,  -1.7471904e-01f, 4.9161855e-03f,
+    5.8095527e+00f,  4.5163748e-01f,  -1.3569316e+00f, -7.1711606e-01f,
+    4.6302426e-01f,  -1.2976727e-01f, 4.9161855e-03f,  1.2101072e+01f,
+    -3.3772957e+00f, -5.3192800e-01f, -4.1993264e-02f, -1.0637641e-01f,
+    -1.1508505e-01f, 4.9161855e-03f,  2.6165378e+00f,  1.8762544e+00f,
+    -6.6478405e+00f, 4.9833903e-01f,  5.6820488e-01f,  9.6074417e-03f,
+    4.9161855e-03f,  -2.7133231e+00f, -5.9103000e-01f, 4.9870867e-02f,
+    -2.2181080e-01f, -1.8415939e-02f, 5.7156056e-01f,  4.9161855e-03f,
+    1.0539672e+00f,  -7.1663280e+00f, 4.3730845e+00f,  -2.0142028e-01f,
+    4.7404751e-01f,  -2.7490994e-01f, 4.9161855e-03f,  -1.1627064e+01f,
+    -3.0775794e-01f, -5.9770060e+00f, -7.5886458e-02f, 4.0517724e-01f,
+    -1.3981339e-01f, 4.9161855e-03f,  1.0866967e+00f,  -7.9000783e-01f,
+    2.5184824e+00f,  1.1489426e-01f,  -5.5397308e-01f, -9.2689073e-01f,
+    4.9161855e-03f,  -1.8292384e-01f, 3.2646315e+00f,  -1.6746950e+00f,
+    5.0538975e-01f,  -8.1804043e-01f, 7.3222065e-01f,  4.9161855e-03f,
+    1.4929719e+00f,  9.4005907e-01f,  1.8587011e+00f,  4.4272500e-01f,
+    -5.7933551e-01f, 1.1078842e-02f,  4.9161855e-03f,  4.0897088e+00f,
+    -8.3170910e+00f, -7.7612681e+00f, -1.3118382e-01f, 2.2805281e-01f,
+    -5.7812393e-01f, 4.9161855e-03f,  8.6598027e-01f,  -1.0456352e+00f,
+    3.8437498e-01f,  1.6694506e+00f,  -6.2009120e-01f, 5.3192055e-01f,
+    4.9161855e-03f,  -4.8537847e-01f, 9.1856569e-01f,  -1.3051009e+00f,
+    6.5430939e-01f,  -5.9828395e-01f, 1.1575594e+00f,  4.9161855e-03f,
+    -4.2665830e+00f, -3.0704074e+00f, -1.0525151e+00f, -4.6153173e-01f,
+    3.5057652e-01f,  2.7432105e-01f,  4.9161855e-03f,  5.1324239e+00f,
+    -3.9258289e-01f, 2.4644251e+00f,  7.1393543e-01f,  5.6272078e-02f,
+    5.0331020e-01f,  4.9161855e-03f,  2.1729605e+00f,  -2.9398150e+00f,
+    3.8983128e+00f,  -5.7526851e-01f, -5.4395968e-01f, 2.6677924e-01f,
+    4.9161855e-03f,  -4.6834240e+00f, -7.1150680e+00f, 5.3980551e+00f,
+    2.3003122e-01f,  -9.5528945e-02f, 1.0089890e-01f,  4.9161855e-03f,
+    -6.5583615e+00f, 6.1323514e+00f,  3.4290126e-01f,  5.6338448e-02f,
+    -3.6545107e-01f, 6.3475060e-01f,  4.9161855e-03f,  -4.7143194e-01f,
+    -5.2725344e+00f, 1.0759580e+00f,  2.6186921e-02f,  2.0417234e-01f,
+    3.1454092e-01f,  4.9161855e-03f,  1.4883240e+00f,  -2.8093128e+00f,
+    3.0265145e+00f,  -4.0938655e-01f, -8.7190077e-02f, 3.6416546e-01f,
+    4.9161855e-03f,  2.1199739e+00f,  -5.4996886e+00f, 3.2656703e+00f,
+    -1.9891968e-01f, -1.9218311e-01f, 4.7576624e-01f,  4.9161855e-03f,
+    5.6682081e+00f,  9.3008503e-02f,  3.7969866e+00f,  -4.5014992e-01f,
+    -5.4205108e-01f, -1.7190477e-01f, 4.9161855e-03f,  2.9768403e+00f,
+    -4.0278282e+00f, 6.8811315e-01f,  -1.3242954e-01f, -2.6241624e-01f,
+    2.3300681e-01f,  4.9161855e-03f,  3.2816823e+00f,  -1.5965747e+00f,
+    -4.6481495e+00f, -7.3801905e-01f, 2.7248913e-01f,  -4.6172965e-02f,
+    4.9161855e-03f,  -1.2009241e+01f, -3.1461194e+00f, 6.5948210e+00f,
+    2.2816226e-02f,  1.7971846e-01f,  -7.1230225e-02f, 4.9161855e-03f,
+    1.0664890e+00f,  -4.2399839e-02f, -1.1740028e+00f, -2.5743067e-01f,
+    -1.9595818e-01f, -4.6895766e-01f, 4.9161855e-03f,  -4.4604793e-01f,
+    -4.1761667e-01f, -5.9358352e-01f, -1.4772195e-01f, 3.2849824e-01f,
+    9.1546112e-01f,  4.9161855e-03f,  -1.0685309e+00f, -8.3202881e-01f,
+    1.9027503e+00f,  3.7143436e-01f,  1.0500257e+00f,  7.3510087e-01f,
+    4.9161855e-03f,  2.6647577e-01f,  5.7187647e-01f,  -5.4631060e-01f,
+    -7.7697217e-01f, 5.5341065e-01f,  8.8884197e-02f,  4.9161855e-03f,
+    -2.4092264e+00f, -2.3437815e+00f, -5.6990242e+00f, 4.0246669e-02f,
+    -6.9021386e-01f, 4.8528168e-01f,  4.9161855e-03f,  -2.9229283e-01f,
+    2.7454209e+00f,  -1.2440990e+00f, 5.0732434e-01f,  1.6615523e-01f,
+    -5.7657963e-01f, 4.9161855e-03f,  -3.1489432e+00f, 1.2680652e+00f,
+    -5.7047668e+00f, -2.0682169e-01f, -5.2342772e-01f, 3.2621157e-01f,
+    4.9161855e-03f,  -4.2064637e-01f, 8.1609935e-01f,  6.2681526e-01f,
+    3.5374090e-01f,  6.2999052e-01f,  -5.8346725e-01f, 4.9161855e-03f,
+    7.1308404e-02f,  1.8311420e-01f,  4.0706435e-01f,  3.4199366e-01f,
+    9.3160830e-03f,  4.1215700e-01f,  4.9161855e-03f,  5.6278663e+00f,
+    3.3636853e-01f,  -6.4618564e-01f, 1.4624824e-01f,  2.6545855e-01f,
+    -2.6047999e-01f, 4.9161855e-03f,  2.1086318e+00f,  1.4405881e+00f,
+    1.9607490e+00f,  4.1016015e-01f,  -1.0820497e+00f, 5.2126324e-01f,
+    4.9161855e-03f,  2.2687659e+00f,  -3.8944154e+00f, -3.5740595e+00f,
+    5.5470216e-01f,  1.0869193e-01f,  1.2446215e-01f,  4.9161855e-03f,
+    -3.6911979e+00f, -1.6825495e-02f, 2.7175789e+00f,  3.3319286e-01f,
+    4.5574255e-02f,  -2.9945102e-01f, 4.9161855e-03f,  -9.1713123e+00f,
+    -1.1326112e+01f, 8.7793245e+00f,  3.2807869e-01f,  3.1993087e-02f,
+    6.5704375e-03f,  4.9161855e-03f,  -6.3241405e+00f, 4.5917640e+00f,
+    5.2446551e+00f,  8.6806208e-02f,  -1.1900769e-01f, 3.7303127e-02f,
+    4.9161855e-03f,  1.8690332e+00f,  5.1850295e-01f,  -4.2205045e-01f,
+    5.1754210e-02f,  1.0277729e+00f,  -9.3673009e-01f, 4.9161855e-03f,
+    1.1749099e+00f,  1.8220998e+00f,  3.7768686e+00f,  3.2626029e-02f,
+    1.9230081e-01f,  -6.1840069e-01f, 4.9161855e-03f,  -6.4281154e+00f,
+    -3.2852066e+00f, -3.6263623e+00f, 4.3581065e-02f,  -9.3072295e-02f,
+    2.2059004e-01f,  4.9161855e-03f,  -2.8914037e+00f, -8.9913285e-01f,
+    -6.0291066e+00f, -7.3334366e-02f, -1.7908965e-01f, 2.4383314e-01f,
+    4.9161855e-03f,  3.5674961e+00f,  -1.9904513e+00f, -2.8840287e+00f,
+    -2.1585038e-01f, 2.6890549e-01f,  5.7695067e-01f,  4.9161855e-03f,
+    -4.5172372e+00f, -1.2764982e+01f, -6.5555286e+00f, -8.7975547e-02f,
+    -2.8868642e-02f, -2.4445239e-01f, 4.9161855e-03f,  1.1917623e+00f,
+    2.7240102e+00f,  -5.6969924e+00f, 1.5443534e-01f,  8.0268896e-01f,
+    7.6069735e-02f,  4.9161855e-03f,  1.8703443e+00f,  -1.6433734e+00f,
+    -3.6527286e+00f, 9.3277645e-01f,  -2.1267043e-01f, 1.9547650e-01f,
+    4.9161855e-03f,  3.5234538e-01f,  -3.5503694e-01f, -3.5764150e-02f,
+    -2.7299783e-01f, 2.0867128e+00f,  -4.0437704e-01f, 4.9161855e-03f,
+    7.0537286e+00f,  4.2256870e+00f,  -2.3376143e+00f, 1.0489196e-01f,
+    -2.2336484e-01f, -2.2279005e-01f, 4.9161855e-03f,  1.2876858e+00f,
+    7.2569623e+00f,  -2.2856178e+00f, -3.6533204e-01f, -2.2654597e-01f,
+    -3.9202511e-01f, 4.9161855e-03f,  -2.9575005e+00f, 4.0046115e+00f,
+    1.9336003e+00f,  7.7007276e-01f,  1.8195377e-01f,  5.0428671e-01f,
+    4.9161855e-03f,  3.6017182e+00f,  9.1012402e+00f,  -6.7456603e+00f,
+    -1.3861659e-01f, -2.6884264e-01f, -3.9056700e-01f, 4.9161855e-03f,
+    -1.1627531e+00f, 1.7062700e+00f,  -7.1475458e-01f, -1.5973236e-02f,
+    -5.2192539e-01f, 9.2492419e-01f,  4.9161855e-03f,  7.0983272e+00f,
+    4.3586853e-01f,  -3.5620954e+00f, 3.9555708e-01f,  5.6896615e-01f,
+    -3.9723828e-01f, 4.9161855e-03f,  1.4865612e+00f,  -1.0475974e+00f,
+    -8.4833641e+00f, -3.7397227e-01f, 1.3291334e-01f,  3.3054215e-01f,
+    4.9161855e-03f,  3.3097060e+00f,  -4.0853152e+00f, 2.3023739e+00f,
+    -7.3129189e-01f, 4.1393802e-01f,  2.4469729e-01f,  4.9161855e-03f,
+    -6.4677873e+00f, -1.6074709e+00f, 2.2694349e+00f,  2.4836297e-01f,
+    -4.7907314e-01f, -1.2783307e-02f, 4.9161855e-03f,  7.6441946e+00f,
+    -6.5884595e+00f, 8.2836065e+00f,  -6.5808132e-02f, -1.2891619e-01f,
+    -1.0536889e-01f, 4.9161855e-03f,  -6.1940775e+00f, -7.0686564e+00f,
+    2.8182077e+00f,  4.6267312e-02f,  2.1834882e-01f,  -2.8412163e-01f,
+    4.9161855e-03f,  7.5322211e-01f,  4.4226575e-01f,  8.6104780e-01f,
+    -4.5959395e-01f, -1.2565438e+00f, 1.0619931e+00f,  4.9161855e-03f,
+    -3.1116338e+00f, 5.5792129e-01f,  5.3073101e+00f,  3.0462223e-01f,
+    7.5853378e-02f,  -1.9224058e-01f, 4.9161855e-03f,  2.2643218e+00f,
+    2.0357387e+00f,  4.4502897e+00f,  -2.8496760e-01f, 1.2047067e-01f,
+    6.4417034e-01f,  4.9161855e-03f,  -1.4413284e+00f, 3.5867362e+00f,
+    -2.4204571e+00f, 4.2380524e-01f,  -2.1113880e-01f, -1.7703670e-01f,
+    4.9161855e-03f,  -6.8668759e-01f, -9.5317203e-01f, 1.5330289e-01f,
+    5.7356155e-01f,  6.3638610e-01f,  7.7120703e-01f,  4.9161855e-03f,
+    -1.0682197e+00f, -6.9213104e+00f, -5.8608122e+00f, 1.0352087e-01f,
+    -3.3730379e-01f, 1.9342881e-01f,  4.9161855e-03f,  -2.4783916e+00f,
+    1.2663845e+00f,  1.5080407e+00f,  3.5923757e-03f,  5.0929576e-01f,
+    3.1987467e-01f,  4.9161855e-03f,  6.2106740e-01f,  -8.0850184e-01f,
+    6.0432136e-01f,  1.0544959e+00f,  3.5460990e-02f,  7.1798617e-01f,
+    4.9161855e-03f,  5.7629764e-01f,  -4.1872951e-01f, 2.6883879e-01f,
+    -5.7401496e-01f, -5.2689475e-01f, -2.9298371e-01f, 4.9161855e-03f,
+    -6.0079894e+00f, -3.0357261e+00f, 1.1362796e+00f,  1.8514165e-01f,
+    -1.0868914e-02f, -2.6686630e-01f, 4.9161855e-03f,  -6.4743943e+00f,
+    5.0929122e+00f,  4.5632439e+00f,  -8.3602853e-03f, 1.3735165e-01f,
+    -3.0539981e-01f, 4.9161855e-03f,  -1.1718397e+00f, -4.3745694e+00f,
+    4.1264515e+00f,  3.4016520e-01f,  -2.4106152e-01f, -6.2656836e-03f,
+    4.9161855e-03f,  4.5977187e+00f,  9.2932510e-01f,  1.8005730e+00f,
+    7.5450696e-02f,  2.5778416e-01f,  -1.0443735e-01f, 4.9161855e-03f,
+    -1.2225604e+00f, 3.8227065e+00f,  -4.0077796e+00f, 3.7918901e-01f,
+    -3.4038458e-02f, -2.2999659e-01f, 4.9161855e-03f,  -1.6463979e+00f,
+    3.3725232e-01f,  -2.3585579e+00f, -7.5838506e-02f, 7.1057733e-03f,
+    2.9407086e-02f,  4.9161855e-03f,  5.4664793e+00f,  -3.7369993e-01f,
+    1.8591646e+00f,  6.9752198e-01f,  5.2111161e-01f,  -5.1446843e-01f,
+    4.9161855e-03f,  -2.0373304e+00f, 2.6609144e+00f,  -1.8289629e+00f,
+    5.7756305e-01f,  -3.7016757e-03f, -1.2520009e-01f, 4.9161855e-03f,
+    -4.3900475e-01f, 1.6747446e+00f,  4.9002385e+00f,  2.5009772e-01f,
+    -1.8630438e-01f, 3.6023688e-01f,  4.9161855e-03f,  -6.4800224e+00f,
+    1.0171971e+00f,  2.6008205e+00f,  7.6939821e-02f,  3.9370355e-01f,
+    1.5263109e-02f,  4.9161855e-03f,  7.7535975e-01f,  -6.5957302e-01f,
+    -1.4328420e-01f, 1.3423905e-01f,  -1.1076678e+00f, 2.9757038e-01f,
+
+    4.3528955e-04f,  -1.0293683e+00f, -1.4860930e+00f, 1.5695719e-01f,
+    8.1952465e-01f,  -4.9572346e-01f, -5.7644486e-02f, 4.3528955e-04f,
+    -5.3100938e-01f, -5.8876202e-02f, 7.3920354e-02f,  3.6222014e-01f,
+    -8.7741643e-01f, -4.9836982e-02f, 4.3528955e-04f,  1.9436845e+00f,
+    5.1049846e-01f,  1.3180804e-01f,  -2.6122969e-01f, 9.9792713e-01f,
+    -1.1101015e-02f, 4.3528955e-04f,  -2.7033777e+00f, -1.8548988e+00f,
+    -3.8844220e-02f, 4.7028649e-01f,  -7.9503214e-01f, -2.7865918e-02f,
+    4.3528955e-04f,  4.1310158e-01f,  -3.4749858e+00f, 1.5252715e-01f,
+    9.1952014e-01f,  -2.8742326e-02f, -1.9396225e-02f, 4.3528955e-04f,
+    -3.1739223e+00f, -1.7183465e+00f, -1.7481904e-01f, 2.9902828e-01f,
+    -7.2434241e-01f, -2.6387524e-02f, 4.3528955e-04f,  -8.6253613e-01f,
+    -1.3973342e+00f, 1.1655489e-02f,  9.7994268e-01f,  -3.7582502e-01f,
+    2.1397233e-02f,  4.3528955e-04f,  -1.0050631e+00f, 2.2468293e+00f,
+    -1.4665943e-01f, -8.1148869e-01f, -3.0340642e-01f, 3.0684460e-02f,
+    4.3528955e-04f,  -1.4321089e+00f, -8.3064753e-01f, 5.7692427e-02f,
+    4.6401533e-01f,  -5.8835715e-01f, -2.3240988e-01f, 4.3528955e-04f,
+    -1.1840597e+00f, -4.7335869e-01f, -1.0066354e-01f, 3.2861975e-01f,
+    -8.1295985e-01f, 8.1459478e-02f,  4.3528955e-04f,  -5.7204002e-01f,
+    -6.0020667e-01f, -8.7873779e-02f, 8.9714015e-01f,  -6.7748755e-01f,
+    -1.9026755e-01f, 4.3528955e-04f,  -2.9476359e+00f, -1.7011030e+00f,
+    1.3818750e-01f,  6.1435014e-01f,  -7.3296779e-01f, 7.3396176e-02f,
+    4.3528955e-04f,  1.9609587e+00f,  -1.9409456e+00f, -7.0424877e-02f,
+    6.9078994e-01f,  6.1551386e-01f,  1.4795370e-01f,  4.3528955e-04f,
+    1.8401569e-01f,  -1.2294726e+00f, -6.5059900e-02f, 8.3214116e-01f,
+    -1.1039478e-01f, 1.0820668e-02f,  4.3528955e-04f,  -3.2635043e+00f,
+    1.5816216e+00f,  -1.4595885e-02f, -3.5887066e-01f, -8.6088765e-01f,
+    -2.9629178e-02f, 4.3528955e-04f,  -3.9439683e+00f, -2.3541796e+00f,
+    2.0591463e-01f,  3.8780153e-01f,  -8.0070376e-01f, -3.3018999e-02f,
+    4.3528955e-04f,  -2.2674167e+00f, 3.4032989e-01f,  2.8466174e-02f,
+    -2.9337224e-02f, -9.7169715e-01f, -3.5801485e-02f, 4.3528955e-04f,
+    1.8211118e+00f,  6.3323951e-01f,  8.0380157e-02f,  -7.6350129e-01f,
+    6.8511432e-01f,  2.6923558e-02f,  4.3528955e-04f,  1.0825631e-01f,
+    -2.3674943e-01f, -6.8531990e-02f, 7.1723968e-01f,  6.5778261e-01f,
+    -3.8818890e-01f, 4.3528955e-04f,  -1.2199759e+00f, 1.1100285e-02f,
+    3.4947380e-02f,  -4.4695923e-01f, -8.1581652e-01f, 5.8015283e-02f,
+    4.3528955e-04f,  -3.1495280e+00f, -2.4890139e+00f, 6.2988261e-03f,
+    6.1453247e-01f,  -6.6755074e-01f, -4.1738255e-03f, 4.3528955e-04f,
+    1.4966619e+00f,  -3.2968187e-01f, -5.0477613e-02f, 2.4966402e-01f,
+    1.0242459e+00f,  5.2230121e-03f,  4.3528955e-04f,  -8.4482647e-02f,
+    -7.1049720e-02f, -6.0130212e-02f, 9.4271088e-01f,  -2.0089492e-01f,
+    2.3388010e-01f,  4.3528955e-04f,  2.4736483e+00f,  -2.6515591e+00f,
+    9.1419272e-02f,  7.2109270e-01f,  5.8762175e-01f,  1.0272927e-02f,
+    4.3528955e-04f,  -1.7843741e-01f, -2.6111281e-01f, -2.5327990e-02f,
+    9.0371573e-01f,  -3.0383718e-01f, -2.1001785e-01f, 4.3528955e-04f,
+    -1.5343285e-01f, 2.0258040e+00f,  -7.3217832e-02f, -9.4239789e-01f,
+    1.9637553e-01f,  -5.4789580e-02f, 4.3528955e-04f,  3.6094151e+00f,
+    -1.3058611e+00f, 2.8641449e-02f,  4.2085060e-01f,  8.6798662e-01f,
+    5.5175863e-02f,  4.3528955e-04f,  -1.0593317e-01f, -9.4452149e-01f,
+    -1.7858937e-01f, 6.9635260e-01f,  -1.5049441e-01f, -1.3248153e-01f,
+    4.3528955e-04f,  3.7917423e-01f,  -8.9208072e-01f, 7.6984480e-02f,
+    1.0966808e+00f,  4.0643299e-01f,  -6.9561042e-02f, 4.3528955e-04f,
+    3.3198512e-01f,  -5.6812048e-01f, 1.9102082e-01f,  8.6836040e-01f,
+    -1.5086564e-01f, -1.7397478e-01f, 4.3528955e-04f,  -1.4775107e+00f,
+    2.2676902e+00f,  -2.6615953e-02f, -6.4627272e-01f, -7.3115832e-01f,
+    -3.6860257e-04f, 4.3528955e-04f,  -1.3652307e+00f, 1.4607301e+00f,
+    -7.0795878e-03f, -6.4263791e-01f, -8.5862374e-01f, -7.0166513e-02f,
+    4.3528955e-04f,  -2.4315050e-01f, 5.7259303e-01f,  -1.2909895e-01f,
+    -6.7960644e-01f, -3.8035557e-01f, 8.9591220e-02f,  4.3528955e-04f,
+    -8.9654458e-01f, -8.2225668e-01f, -1.5554781e-01f, 2.6332226e-01f,
+    -1.1026720e+00f, -1.4182439e-01f, 4.3528955e-04f,  1.0711229e+00f,
+    -7.8219914e-01f, 7.6412216e-02f,  5.8565933e-01f,  6.1893952e-01f,
+    -1.6858302e-01f, 4.3528955e-04f,  -7.9615515e-01f, 1.4364504e+00f,
+    9.2410203e-03f,  -6.5665913e-01f, -2.1941739e-01f, 1.0833266e-01f,
+    4.3528955e-04f,  -1.6137042e+00f, -2.0602920e+00f, -5.0673138e-02f,
+    7.6305509e-01f,  -5.9941691e-01f, -1.0346474e-01f, 4.3528955e-04f,
+    3.1642308e+00f,  3.1452847e+00f,  -5.0170259e-03f, -7.4229622e-01f,
+    6.7826283e-01f,  4.4823855e-02f,  4.3528955e-04f,  -3.0705388e+00f,
+    2.6966345e-01f,  -1.8887999e-02f, 3.6214914e-02f,  -7.5216961e-01f,
+    -1.0115588e-01f, 4.3528955e-04f,  1.4377837e+00f,  1.8380008e+00f,
+    1.0078024e-02f,  -9.4601542e-01f, 6.7934078e-01f,  -2.2415651e-02f,
+    4.3528955e-04f,  -3.0586500e+00f, -2.3072541e+00f, 8.6151786e-02f,
+    6.1782306e-01f,  -7.6497197e-01f, -2.1772760e-03f, 4.3528955e-04f,
+    -8.0013043e-01f, 1.2293025e+00f,  -5.2432049e-02f, -5.6075841e-01f,
+    -8.7740129e-01f, 6.5895572e-02f,  4.3528955e-04f,  -1.3656047e-01f,
+    1.4744946e+00f,  1.2479756e-01f,  -7.4122250e-01f, -3.8248911e-02f,
+    -2.2064438e-02f, 4.3528955e-04f,  1.0616552e+00f,  1.1348683e+00f,
+    -1.1367176e-01f, -4.8901221e-01f, 1.1293241e+00f,  9.0970963e-02f,
+    4.3528955e-04f,  2.6216686e+00f,  9.4791728e-01f,  4.0192474e-02f,
+    -2.2352676e-01f, 9.1756529e-01f,  -2.0654747e-02f, 4.3528955e-04f,
+    -1.0986848e+00f, -1.7928226e+00f, -8.0955531e-03f, 5.4425591e-01f,
+    -5.4146111e-01f, 5.6186426e-02f,  4.3528955e-04f,  -2.3845494e+00f,
+    6.4246732e-01f,  -2.1160398e-02f, -7.6780915e-02f, -9.5503724e-01f,
+    6.7784131e-02f,  4.3528955e-04f,  -1.9912511e+00f, 3.0141566e+00f,
+    8.3297707e-02f,  -8.3237952e-01f, -5.2035487e-01f, 5.1615741e-02f,
+    4.3528955e-04f,  -9.0560585e-01f, -3.7631898e+00f, 1.6689511e-01f,
+    9.0746129e-01f,  -1.9730194e-01f, -2.3535542e-02f, 4.3528955e-04f,
+    6.3766164e-01f,  -3.8548386e-01f, -3.1122489e-02f, 1.5888071e-01f,
+    4.4760171e-01f,  -4.5795736e-01f, 4.3528955e-04f,  1.5244511e+00f,
+    2.0055573e+00f,  -2.4869658e-02f, -8.0609977e-01f, 6.4100277e-01f,
+    3.8976461e-02f,  4.3528955e-04f,  6.9167578e-01f,  1.4518945e+00f,
+    3.1883813e-02f,  -8.5315329e-01f, 5.8884792e-02f,  -1.2494932e-01f,
+    4.3528955e-04f,  2.9661411e-01f,  1.3043760e+00f,  2.4526106e-02f,
+    -1.1065414e+00f, -1.1344036e-02f, 6.3221857e-02f,  4.3528955e-04f,
+    -8.4016162e-01f, 8.8171500e-01f,  -3.3638831e-02f, -8.7047851e-01f,
+    -7.4371785e-01f, -6.8592496e-02f, 4.3528955e-04f,  -1.0806392e+00f,
+    -8.1659573e-01f, 6.9328718e-02f,  7.9761153e-01f,  -2.6620972e-01f,
+    -4.9550496e-02f, 4.3528955e-04f,  4.6540970e-01f,  2.6671610e+00f,
+    -1.5481386e-01f, -1.0805309e+00f, 1.0314250e-01f,  3.1081898e-02f,
+    4.3528955e-04f,  -7.4959141e-01f, 1.2651914e+00f,  -5.3930525e-02f,
+    -7.1458316e-01f, -1.6966201e-01f, 1.2964334e-01f,  4.3528955e-04f,
+    1.3777412e-01f,  4.5225596e-01f,  7.9039142e-02f,  -8.1627947e-01f,
+    1.7738114e-01f,  -3.1320851e-02f, 4.3528955e-04f,  1.0212445e+00f,
+    -1.5533651e+00f, -8.3980761e-02f, 8.6295778e-01f,  3.0176216e-01f,
+    1.6473895e-01f,  4.3528955e-04f,  3.3092902e+00f,  -2.5739362e+00f,
+    1.7827101e-02f,  5.8178002e-01f,  7.2040093e-01f,  -7.1082853e-02f,
+    4.3528955e-04f,  1.3353622e+00f,  1.8426478e-01f,  -1.2336533e-01f,
+    -1.5237944e-01f, 8.7628794e-01f,  8.9047194e-02f,  4.3528955e-04f,
+    -2.1589763e+00f, -7.4480367e-01f, 1.0698751e-01f,  1.9649486e-01f,
+    -8.3016509e-01f, 2.9976953e-02f,  4.3528955e-04f,  -8.3592318e-02f,
+    1.6698179e+00f,  -5.6423243e-02f, -8.3871675e-01f, 2.1960415e-01f,
+    1.6031240e-01f,  4.3528955e-04f,  7.2103626e-01f,  -2.0886056e+00f,
+    -1.0135887e-02f, 8.1505424e-01f,  2.7959514e-01f,  9.6105590e-02f,
+    4.3528955e-04f,  -2.4309948e-02f, 1.2600120e+00f,  -5.3339738e-02f,
+    -6.1280799e-01f, -1.8306378e-01f, 1.7326172e-01f,  4.3528955e-04f,
+    4.8158026e-01f,  -6.6661340e-01f, 4.5266356e-02f,  9.4537783e-01f,
+    1.9018820e-01f,  2.9867753e-01f,  4.3528955e-04f,  6.9710463e-01f,
+    2.5529363e+00f,  -3.8498882e-02f, -7.2734129e-01f, 1.2338838e-01f,
+    8.0769040e-02f,  4.3528955e-04f,  9.5720708e-01f,  7.9277784e-01f,
+    -5.7742778e-02f, -6.7032278e-01f, 4.7057158e-01f,  1.7988858e-01f,
+    4.3528955e-04f,  -5.9059054e-01f, 1.4429114e+00f,  -2.1938417e-02f,
+    -5.8713347e-01f, -2.0255148e-01f, 1.9287418e-03f,  4.3528955e-04f,
+    -2.0606318e-01f, -6.1336350e-01f, 1.0962017e-01f,  5.3309757e-01f,
+    -2.4695891e-01f, 4.4428447e-01f,  4.3528955e-04f,  1.0315387e+00f,
+    5.0489306e-01f,  4.5739550e-02f,  -5.6967974e-01f, 9.4476599e-01f,
+    1.1259848e-01f,  4.3528955e-04f,  4.6653214e-01f,  -2.1413295e+00f,
+    -7.8291312e-02f, 9.3167323e-01f,  2.8987619e-01f,  6.2450152e-02f,
+    4.3528955e-04f,  -7.5579238e-01f, -1.4824712e+00f, 6.6262364e-02f,
+    8.3839804e-01f,  -1.0729449e-01f, -6.3796237e-02f, 4.3528955e-04f,
+    -2.3352005e+00f, 1.3538911e+00f,  -3.3673003e-02f, -4.4548821e-01f,
+    -8.1517369e-01f, -1.0029911e-01f, 4.3528955e-04f,  7.9074532e-01f,
+    -1.2019353e+00f, 3.2030545e-02f,  6.6592199e-01f,  6.0947978e-01f,
+    1.0519248e-01f,  4.3528955e-04f,  -2.3914580e+00f, -1.5300194e+00f,
+    -7.3386231e-03f, 5.2172303e-01f,  -5.3816289e-01f, 1.3147322e-02f,
+    4.3528955e-04f,  1.5584013e+00f,  1.2237773e+00f,  -2.2644576e-02f,
+    -4.8539612e-01f, 8.1405783e-01f,  2.2524531e-01f,  4.3528955e-04f,
+    2.7545780e-01f,  4.3402547e-01f,  -6.5069459e-02f, -9.3852228e-01f,
+    7.6457936e-01f,  2.9687262e-01f,  4.3528955e-04f,  -1.0373369e+00f,
+    -1.1858125e+00f, 7.9311356e-02f,  7.5912684e-01f,  -7.1744674e-01f,
+    -1.3299203e-03f, 4.3528955e-04f,  -3.6895132e-01f, -5.0010152e+00f,
+    6.5428980e-02f,  8.7311417e-01f,  -6.9538005e-02f, 1.0042680e-02f,
+    4.3528955e-04f,  3.6669555e-01f,  2.1180862e-01f,  9.9992063e-03f,
+    2.7217722e-01f,  1.2377149e+00f,  4.1405495e-02f,  4.3528955e-04f,
+    -9.2516810e-01f, 2.5122499e-01f,  9.0740845e-02f,  -3.1037506e-01f,
+    -5.3703344e-01f, -1.7266656e-01f, 4.3528955e-04f,  -1.3804758e+00f,
+    -1.3297899e+00f, -2.8708819e-01f, 6.7745668e-01f,  -7.3042059e-01f,
+    -5.8776453e-02f, 4.3528955e-04f,  -2.9314404e+00f, -3.2674408e-01f,
+    2.6022336e-03f,  1.1271559e-01f,  -9.9770236e-01f, -1.6199436e-02f,
+    4.3528955e-04f,  7.5596017e-01f,  6.4125985e-01f,  1.3342527e-01f,
+    -7.3403597e-01f, 7.2796106e-01f,  -1.9283566e-01f, 4.3528955e-04f,
+    2.4747379e+00f,  1.7827348e+00f,  -6.9021672e-02f, -5.9692907e-01f,
+    6.9948733e-01f,  -4.2432200e-02f, 4.3528955e-04f,  2.6764268e-01f,
+    -6.7757279e-01f, 5.7690304e-02f,  8.7350392e-01f,  -4.8027195e-02f,
+    -3.0863043e-02f, 4.3528955e-04f,  -2.6360197e+00f, 1.4940584e+00f,
+    2.8475098e-02f,  -4.3170014e-01f, -7.3762143e-01f, 2.6269550e-02f,
+    4.3528955e-04f,  -1.1015791e+00f, -3.0440766e-01f, 6.6284783e-02f,
+    2.0560089e-01f,  -8.5632157e-01f, -5.3701401e-02f, 4.3528955e-04f,
+    8.7469929e-01f,  -4.2660141e-01f, 8.8426486e-02f,  6.4585888e-01f,
+    9.5434201e-01f,  -1.1490559e-01f, 4.3528955e-04f,  -2.5340066e+00f,
+    -1.5883948e+00f, 2.7220825e-02f,  4.8709485e-01f,  -7.3602939e-01f,
+    -2.2645691e-02f, 4.3528955e-04f,  6.6391569e-01f,  5.2166218e-01f,
+    -2.8496210e-02f, -5.6626147e-01f, 6.4786118e-01f,  7.2635375e-02f,
+    4.3528955e-04f,  -2.1902223e+00f, 8.2347983e-01f,  -1.1497141e-01f,
+    -2.8690112e-01f, -4.1086102e-01f, -7.1620151e-02f, 4.3528955e-04f,
+    1.5770845e+00f,  9.1851938e-01f,  1.1258498e-01f,  -4.1776821e-01f,
+    8.8284534e-01f,  1.8577316e-01f,  4.3528955e-04f,  -1.2781682e+00f,
+    6.7074127e-02f,  -6.0735323e-02f, -5.4243341e-02f, -9.4303757e-01f,
+    -1.3638639e-02f, 4.3528955e-04f,  -5.3268588e-01f, 1.0086590e+00f,
+    -8.8331357e-02f, -6.6487861e-01f, -1.7597961e-01f, 1.0273039e-01f,
+    4.3528955e-04f,  -4.1415280e-01f, -3.3356786e+00f, 7.4211016e-02f,
+    9.8400438e-01f,  -1.1658446e-01f, -4.6829078e-03f, 4.3528955e-04f,
+    1.4253725e+00f,  1.9782156e-01f,  2.9133189e-01f,  -7.4195957e-01f,
+    5.5337536e-01f,  -1.6068888e-01f, 4.3528955e-04f,  -1.0491303e+00f,
+    -3.2139263e+00f, 1.1092858e-01f,  8.9176017e-01f,  -2.9428917e-01f,
+    -4.0598955e-02f, 4.3528955e-04f,  7.3543614e-01f,  -1.0327798e+00f,
+    4.2624928e-02f,  5.5009919e-01f,  7.5031644e-01f,  4.2304110e-02f,
+    4.3528955e-04f,  4.1882765e-01f,  5.2894473e-01f,  2.3122119e-02f,
+    -9.0452760e-01f, 7.6079768e-01f,  3.0251063e-02f,  4.3528955e-04f,
+    1.7290962e+00f,  -3.8216734e-01f, -2.3694385e-03f, 1.7573975e-01f,
+    5.5424958e-01f,  -1.0576776e-01f, 4.3528955e-04f,  -4.9047729e-01f,
+    1.8191563e+00f,  -4.9798083e-02f, -8.8397211e-01f, 1.1273885e-02f,
+    -1.0243861e-01f, 4.3528955e-04f,  -3.3216915e+00f, 2.6749082e+00f,
+    -3.5078647e-03f, -6.4118123e-01f, -6.9885534e-01f, 1.2539584e-02f,
+    4.3528955e-04f,  2.0661256e+00f,  -2.5834680e-01f, 3.6938366e-02f,
+    1.2303282e-01f,  1.0086769e+00f,  -3.6050532e-02f, 4.3528955e-04f,
+    -2.1940269e+00f, 1.0349510e+00f,  -7.0236035e-02f, -4.2349803e-01f,
+    -7.5247216e-01f, -3.2610431e-02f, 4.3528955e-04f,  -5.6429607e-01f,
+    1.7274550e-01f,  -1.2418390e-01f, 2.8083679e-01f,  -6.0797828e-01f,
+    1.6303551e-01f,  4.3528955e-04f,  -2.4041736e-01f, -5.2295232e-01f,
+    1.2220953e-01f,  6.5039289e-01f,  -5.4857534e-01f, -6.2998816e-02f,
+    4.3528955e-04f,  -5.5390012e-01f, -2.3208292e+00f, -1.2352142e-02f,
+    9.8400331e-01f,  -2.7417722e-01f, -7.8883640e-02f, 4.3528955e-04f,
+    2.1476331e+00f,  -6.8665481e-01f, -7.3507451e-03f, 3.0319877e-03f,
+    9.4414437e-01f,  2.1496855e-01f,  4.3528955e-04f,  -3.0688529e+00f,
+    1.1516720e+00f,  2.0417161e-01f,  -2.6995751e-01f, -8.8706827e-01f,
+    -5.3957894e-02f, 4.3528955e-04f,  5.7819611e-01f,  2.5423549e-02f,
+    -8.6092122e-02f, 1.1022063e-01f,  1.1623888e+00f,  1.6437319e-01f,
+    4.3528955e-04f,  1.9840709e+00f,  -4.7336960e-01f, -1.4526581e-02f,
+    1.3205178e-01f,  9.4507223e-01f,  1.9238252e-02f,  4.3528955e-04f,
+    -4.6718526e+00f, 9.5738612e-02f,  -1.9311178e-02f, -2.4011239e-02f,
+    -8.6004484e-01f, 1.2756791e-05f,  4.3528955e-04f,  -1.4253048e+00f,
+    3.3447695e-01f,  -1.4148505e-01f, 3.1641260e-01f,  -8.0988580e-01f,
+    -4.1063607e-02f, 4.3528955e-04f,  -4.3422803e-01f, 9.0025520e-01f,
+    5.2156147e-02f,  -5.7631129e-01f, -7.9319668e-01f, 1.4041223e-01f,
+    4.3528955e-04f,  1.2276639e+00f,  -4.6768516e-01f, -6.6567689e-02f,
+    6.2331867e-01f,  6.0804600e-01f,  -8.6065661e-03f, 4.3528955e-04f,
+    1.2209854e+00f,  2.0611868e+00f,  -2.2080135e-02f, -8.3303684e-01f,
+    5.8840591e-01f,  -9.2961803e-02f, 4.3528955e-04f,  2.7590897e+00f,
+    -2.4113996e+00f, 2.1922546e-02f,  6.4421254e-01f,  6.9499773e-01f,
+    3.1200372e-02f,  4.3528955e-04f,  1.7373955e-01f,  -6.9299430e-01f,
+    -8.2973309e-02f, 8.9439744e-01f,  1.4732683e-01f,  1.5092665e-01f,
+    4.3528955e-04f,  3.3027312e-01f,  8.6301500e-01f,  6.2476180e-04f,
+    -1.0291767e+00f, 6.4454619e-03f,  -2.1080287e-01f, 4.3528955e-04f,
+    2.4861829e+00f,  4.0451837e+00f,  8.0902949e-02f,  -7.9118973e-01f,
+    4.8616445e-01f,  7.0306743e-03f,  4.3528955e-04f,  1.4965006e+00f,
+    2.4475951e-01f,  1.0186931e-01f,  -3.4997222e-01f, 9.4842607e-01f,
+    -6.2949613e-02f, 4.3528955e-04f,  2.2916253e+00f,  -7.2003818e-01f,
+    1.3226300e-01f,  3.3129850e-01f,  9.8537338e-01f,  4.3681487e-02f,
+    4.3528955e-04f,  -9.5530534e-01f, 6.0735192e-02f,  6.8596378e-02f,
+    6.6042799e-01f,  -8.4032148e-01f, -2.6502052e-01f, 4.3528955e-04f,
+    6.6460031e-01f,  4.2885369e-01f,  1.3182928e-01f,  1.6623332e-01f,
+    7.6477611e-01f,  2.4471369e-01f,  4.3528955e-04f,  1.0474554e+00f,
+    -1.4935753e-01f, -5.9584882e-02f, -3.7499127e-01f, 9.0489215e-01f,
+    5.9376396e-02f,  4.3528955e-04f,  -2.2020214e+00f, 8.8971096e-01f,
+    5.2402527e-03f,  -2.5808704e-01f, -1.0479920e+00f, -6.4677130e-03f,
+    4.3528955e-04f,  7.3008411e-02f,  1.4000205e+00f,  -1.0999314e-02f,
+    -8.6268264e-01f, 3.8728300e-01f,  1.3624142e-01f,  4.3528955e-04f,
+    1.7595435e+00f,  -2.2820453e-01f, 1.9381622e-02f,  2.7175361e-01f,
+    8.3581573e-01f,  -1.6735129e-01f, 4.3528955e-04f,  6.8509853e-01f,
+    -1.0923694e+00f, -6.5119796e-02f, 8.5533810e-01f,  5.3909045e-01f,
+    -1.1210985e-01f, 4.3528955e-04f,  -4.9187341e-01f, 1.7474970e+00f,
+    7.5579710e-02f,  -6.7014492e-01f, -3.1476149e-01f, -4.2323388e-02f,
+    4.3528955e-04f,  1.1314451e+00f,  -4.0664530e+00f, -5.1949147e-02f,
+    7.2666746e-01f,  2.6192483e-01f,  -6.2984854e-02f, 4.3528955e-04f,
+    4.2365646e-01f,  1.4296100e-01f,  -6.1019380e-02f, 7.5781792e-02f,
+    1.4421431e+00f,  3.7766818e-02f,  4.3528955e-04f,  -5.1406527e-01f,
+    -2.6018875e+00f, 8.8697441e-02f,  8.8988566e-01f,  1.7456422e-02f,
+    4.0939976e-02f,  4.3528955e-04f,  -2.9294605e+00f, -5.4596150e-01f,
+    1.1871128e-01f,  3.6147022e-01f,  -8.9994967e-01f, 4.4900741e-02f,
+    4.3528955e-04f,  -1.9198341e+00f, 1.9872969e-01f,  6.7518577e-02f,
+    -2.9187760e-01f, -9.4867790e-01f, 5.5106424e-02f,  4.3528955e-04f,
+    -1.4682201e-01f, 6.2716529e-02f,  8.5705489e-02f,  -3.5292792e-01f,
+    -1.3333107e+00f, 1.5399890e-01f,  4.3528955e-04f,  5.6458944e-01f,
+    7.4650335e-01f,  2.0964811e-02f,  -7.7980030e-01f, 1.7844588e-01f,
+    -1.0286529e-01f, 4.3528955e-04f,  3.9443350e-01f,  5.5445343e-01f,
+    3.4685973e-02f,  -9.5826283e-02f, 7.2892958e-01f,  4.1770080e-01f,
+    4.3528955e-04f,  -9.6379435e-01f, 7.4746269e-01f,  -1.1238152e-01f,
+    -9.0431488e-01f, -7.1115744e-01f, 1.0492866e-01f,  4.3528955e-04f,
+    1.0993766e+00f,  1.7946624e+00f,  3.5881538e-02f,  -7.7185822e-01f,
+    5.8226192e-01f,  1.0660763e-01f,  4.3528955e-04f,  6.1402404e-01f,
+    3.3699328e-01f,  9.7646080e-03f,  -4.7469679e-01f, 7.4303389e-01f,
+    1.4536295e-02f,  4.3528955e-04f,  3.7222487e-01f,  1.0571420e+00f,
+    -5.5587426e-02f, -6.8102205e-01f, 5.1040512e-01f,  6.2596425e-02f,
+    4.3528955e-04f,  -5.4109651e-01f, -1.9028574e+00f, -1.0337635e-01f,
+    8.7597108e-01f,  -2.6894566e-01f, 1.3261346e-02f,  4.3528955e-04f,
+    2.9783866e+00f,  1.1318161e+00f,  1.1286816e-01f,  -3.7797740e-01f,
+    9.2105252e-01f,  -1.2561412e-02f, 4.3528955e-04f,  -2.4203587e+00f,
+    6.7099535e-01f,  1.6123953e-01f,  -1.9071741e-01f, -8.3741486e-01f,
+    2.2363402e-02f,  4.3528955e-04f,  -2.4060899e-01f, -1.6746978e+00f,
+    -6.3585855e-02f, 6.3713533e-01f,  -1.6243860e-01f, -1.0301367e-01f,
+    4.3528955e-04f,  -2.3374808e-01f, 1.5877067e+00f,  -6.3304029e-02f,
+    -6.8064660e-01f, -1.6111565e-01f, 1.8704011e-01f,  4.3528955e-04f,
+    -3.2001064e+00f, -3.5053986e-01f, -6.7523257e-03f, 2.2389330e-01f,
+    -9.9271786e-01f, 1.3841564e-02f,  4.3528955e-04f,  -9.5942175e-01f,
+    1.2818235e+00f,  3.4953414e-03f,  -5.7093233e-01f, -3.4419948e-01f,
+    -2.6134266e-02f, 4.3528955e-04f,  -1.4307834e-02f, -1.6978773e+00f,
+    5.7517976e-02f,  8.1520927e-01f,  9.1835745e-02f,  -7.7086739e-02f,
+    4.3528955e-04f,  1.6759750e-01f,  1.9545419e+00f,  1.2943475e-01f,
+    -9.2084253e-01f, 2.8578630e-01f,  6.6440463e-02f,  4.3528955e-04f,
+    3.9787703e+00f,  -5.7296115e-01f, 5.5781920e-02f,  1.1391202e-01f,
+    8.7464589e-01f,  4.2658065e-02f,  4.3528955e-04f,  -2.7484705e+00f,
+    9.4179943e-02f,  -2.1561574e-02f, 1.5151599e-01f,  -1.0331128e+00f,
+    -3.2135916e-03f, 4.3528955e-04f,  6.6138101e-01f,  -5.5236793e-01f,
+    5.2268133e-02f,  1.1983306e+00f,  3.1339714e-01f,  8.5346632e-02f,
+    4.3528955e-04f,  9.7141600e-01f,  8.7995207e-01f,  -2.1324303e-02f,
+    -5.2090597e-01f, 3.5178021e-01f,  9.9708922e-02f,  4.3528955e-04f,
+    -1.5719903e+00f, -7.1768105e-02f, -1.2551299e-01f, 1.4229689e-02f,
+    -8.3360845e-01f, 8.1439786e-02f,  4.3528955e-04f,  1.5227333e-01f,
+    5.9486467e-01f,  -1.1525757e-01f, -1.1770222e+00f, -1.1152212e-01f,
+    -1.8600106e-01f, 4.3528955e-04f,  5.4802305e-01f,  3.4771168e-01f,
+    4.9063850e-02f,  -5.0729358e-01f, 1.3604277e+00f,  -1.3778533e-01f,
+    4.3528955e-04f,  9.9639618e-01f,  -1.7845176e+00f, -1.8913926e-01f,
+    6.5115315e-01f,  3.5845143e-01f,  -1.1495365e-01f, 4.3528955e-04f,
+    5.0442761e-01f,  -1.6939765e+00f, 1.3444363e-01f,  7.9765767e-01f,
+    9.5896624e-02f,  2.3449574e-02f,  4.3528955e-04f,  9.1848820e-01f,
+    1.7947282e+00f,  2.3108328e-02f,  -8.1202078e-01f, 7.1194607e-01f,
+    -1.7643306e-01f, 4.3528955e-04f,  1.5751457e+00f,  7.4473113e-01f,
+    6.7701228e-02f,  -3.8270667e-01f, 9.6734154e-01f,  6.8683743e-02f,
+    4.3528955e-04f,  -1.1713362e-01f, -1.3700154e+00f, 3.4804426e-02f,
+    8.2037103e-01f,  7.3533528e-02f,  -1.9467700e-01f, 4.3528955e-04f,
+    5.5485153e-01f,  -1.9637446e+00f, 1.8337615e-01f,  5.1766717e-01f,
+    3.4823027e-01f,  -3.4191165e-02f, 4.3528955e-04f,  -3.2356417e+00f,
+    2.8865299e+00f,  1.3286486e-02f,  -5.5004179e-01f, -7.3694974e-01f,
+    -4.9680071e-03f, 4.3528955e-04f,  6.8383068e-01f,  -1.0171911e+00f,
+    7.6801121e-02f,  5.1768839e-01f,  8.8065892e-01f,  -3.5073467e-02f,
+    4.3528955e-04f,  -2.9700124e-01f, 2.8541234e-01f,  -4.8604775e-02f,
+    1.9351684e-01f,  -6.8938023e-01f, -2.0852907e-02f, 4.3528955e-04f,
+    -1.0927875e-01f, 4.5007253e-01f,  -3.6444936e-02f, -1.1870381e+00f,
+    -4.6954250e-01f, 3.3325869e-01f,  4.3528955e-04f,  1.5838519e-01f,
+    -9.5099694e-01f, 3.9163604e-03f,  8.3429587e-01f,  3.7280244e-01f,
+    1.5489189e-01f,  4.3528955e-04f,  -9.5958948e-01f, -4.0252578e-01f,
+    -1.5193108e-01f, 8.5437566e-01f,  -9.6645850e-01f, -4.2557649e-02f,
+    4.3528955e-04f,  -2.1925392e+00f, 6.1255288e-01f,  1.3726956e-01f,
+    1.0810964e-01f,  -4.7563764e-01f, 1.0408697e-02f,  4.3528955e-04f,
+    8.0056149e-01f,  6.3280797e-01f,  -1.8809592e-02f, -6.2868190e-01f,
+    9.4688636e-01f,  1.9725758e-01f,  4.3528955e-04f,  -2.8070614e+00f,
+    -1.2614650e+00f, -1.1386498e-01f, 4.2355239e-01f,  -8.4566140e-01f,
+    -7.9685450e-03f, 4.3528955e-04f,  4.1955745e-01f,  1.9868320e-01f,
+    -3.1617776e-02f, -5.2684080e-02f, 1.0835853e+00f,  8.0220193e-02f,
+    4.3528955e-04f,  -2.5174224e-01f, -4.4407541e-01f, -4.8306193e-02f,
+    1.2749988e+00f,  -6.6885084e-01f, -1.3335912e-01f, 4.3528955e-04f,
+    7.0725358e-01f,  1.7382908e+00f,  5.2570436e-02f,  -7.3960626e-01f,
+    3.9065564e-01f,  -1.5792915e-01f, 4.3528955e-04f,  7.1034974e-01f,
+    7.0316529e-01f,  1.4520990e-02f,  -3.7738079e-01f, 6.3790071e-01f,
+    -2.6745561e-01f, 4.3528955e-04f,  -1.4448143e+00f, -3.3479691e-01f,
+    -9.1712713e-02f, 3.7903488e-01f,  -1.1852527e+00f, -4.3817163e-02f,
+    4.3528955e-04f,  9.1948193e-01f,  3.3783108e-01f,  -1.7194884e-01f,
+    -3.7194601e-01f, 5.7952046e-01f,  -1.4570314e-01f, 4.3528955e-04f,
+    9.0682703e-01f,  1.1050630e-01f,  1.4422230e-01f,  -6.5633878e-02f,
+    1.0675951e+00f,  -5.5507615e-02f, 4.3528955e-04f,  -1.7482088e+00f,
+    2.0929351e+00f,  4.3209646e-02f,  -7.1878397e-01f, -5.8232319e-01f,
+    1.0525685e-01f,  4.3528955e-04f,  -8.5872394e-01f, -1.0510905e+00f,
+    4.4756822e-02f,  5.2299464e-01f,  -6.0057831e-01f, 1.4777406e-03f,
+    4.3528955e-04f,  1.8123600e+00f,  3.8618393e+00f,  -9.9931516e-02f,
+    -8.7890404e-01f, 4.4283646e-01f,  -1.2992264e-02f, 4.3528955e-04f,
+    -1.7530689e+00f, -2.0681916e-01f, 6.0035437e-02f,  2.8316894e-01f,
+    -9.0348077e-01f, 8.6966164e-02f,  4.3528955e-04f,  3.9494860e+00f,
+    -1.0678519e+00f, -5.0141223e-02f, 2.8560540e-01f,  9.5005929e-01f,
+    7.1510494e-02f,  4.3528955e-04f,  6.9034487e-02f,  3.5403073e-02f,
+    9.8647997e-02f,  9.1302776e-01f,  2.4737068e-01f,  -1.5760049e-01f,
+    4.3528955e-04f,  2.0547771e-01f,  -2.2991155e-01f, -1.1552069e-02f,
+    1.0102785e+00f,  6.6631353e-01f,  3.7846733e-02f,  4.3528955e-04f,
+    -2.4342282e+00f, -1.7840242e+00f, -2.5005478e-02f, 4.5579487e-01f,
+    -7.2240454e-01f, 1.4701856e-02f,  4.3528955e-04f,  1.7980205e+00f,
+    4.6459988e-02f,  -9.0972096e-02f, 7.1831360e-02f,  7.0716530e-01f,
+    -1.0303202e-01f, 4.3528955e-04f,  6.6836852e-01f,  -8.4279782e-01f,
+    9.9698991e-02f,  9.9217761e-01f,  5.7834560e-01f,  1.0746475e-02f,
+    4.3528955e-04f,  -1.9419354e-01f, 2.1292897e-01f,  2.9228097e-02f,
+    -8.8806790e-01f, -4.3216497e-01f, -5.1868367e-01f, 4.3528955e-04f,
+    3.4950113e+00f,  2.0882919e+00f,  -2.0109259e-03f, -5.4297996e-01f,
+    8.1844223e-01f,  2.0715050e-02f,  4.3528955e-04f,  3.9900154e-01f,
+    -7.2100657e-01f, 4.3235887e-02f,  1.0678504e+00f,  5.8101612e-01f,
+    2.1358739e-01f,  4.3528955e-04f,  1.6868560e-01f,  -2.7910845e+00f,
+    8.8336714e-02f,  7.2817665e-01f,  4.1302927e-02f,  -3.5887923e-02f,
+    4.3528955e-04f,  -3.2810414e-01f, 1.1153889e+00f,  -1.0935693e-01f,
+    -8.4676880e-01f, -4.0795302e-01f, 9.6220367e-02f,  4.3528955e-04f,
+    5.9330696e-01f,  -8.7856156e-01f, 4.0405612e-02f,  1.5590812e-01f,
+    1.0231596e+00f,  -3.2103498e-02f, 4.3528955e-04f,  2.2934699e+00f,
+    -1.3399214e+00f, 1.6193487e-01f,  4.5085764e-01f,  8.7768233e-01f,
+    9.4883651e-02f,  4.3528955e-04f,  4.2539656e-01f,  1.7120442e+00f,
+    2.3474370e-03f,  -1.0493259e+00f, -8.8822924e-02f, -3.2525703e-02f,
+    4.3528955e-04f,  9.5551372e-01f,  1.3588370e+00f,  -9.4798066e-02f,
+    -5.7994848e-01f, 6.9469571e-01f,  2.4920452e-02f,  4.3528955e-04f,
+    -5.3601122e-01f, -1.5160134e-01f, -1.7066029e-01f, -2.4359327e-02f,
+    -8.9285105e-01f, 3.2834098e-02f,  4.3528955e-04f,  1.7912328e+00f,
+    -4.4241762e+00f, -1.8812999e-02f, 8.2627416e-01f,  2.5185353e-01f,
+    -4.1162767e-02f, 4.3528955e-04f,  4.9252531e-01f,  1.2937322e+00f,
+    8.7287901e-03f,  -7.9359096e-01f, 4.9362287e-01f,  -1.3503897e-01f,
+    4.3528955e-04f,  3.6142251e-01f,  -5.6030905e-01f, 7.5339459e-02f,
+    6.4163691e-01f,  -1.5302195e-01f, -2.7688584e-01f, 4.3528955e-04f,
+    -1.2219087e+00f, -1.0727100e-01f, -4.5697547e-02f, -1.0294904e-01f,
+    -5.9727466e-01f, -5.4764196e-02f, 4.3528955e-04f,  5.6973231e-01f,
+    -1.7450819e+00f, -5.2026059e-02f, 1.0580206e+00f,  2.8782591e-01f,
+    -5.6884203e-02f, 4.3528955e-04f,  -1.2369975e-03f, -5.8013117e-01f,
+    -5.8974922e-03f, 7.4166512e-01f,  -1.0042721e+00f, 3.5535447e-02f,
+    4.3528955e-04f,  -5.9462953e-01f, 3.7291580e-01f,  8.7686956e-02f,
+    -3.0083433e-01f, -6.2008870e-01f, -9.5102675e-02f, 4.3528955e-04f,
+    -1.3492211e+00f, -3.8983810e+00f, 4.1564964e-02f,  8.8925868e-01f,
+    -2.9106182e-01f, 1.7333703e-02f,  4.3528955e-04f,  2.2741601e+00f,
+    -1.4002832e+00f, -6.0956709e-02f, 5.7429653e-01f,  7.3409754e-01f,
+    -1.0685916e-03f, 4.3528955e-04f,  8.7878656e-01f,  8.5581726e-01f,
+    1.6953863e-02f,  -7.3152947e-01f, 9.7729814e-01f,  -2.9440772e-02f,
+    4.3528955e-04f,  -2.1674078e+00f, 8.6668015e-01f,  6.6175461e-02f,
+    -3.6702636e-01f, -8.9041197e-01f, 6.5649763e-02f,  4.3528955e-04f,
+    -3.8680644e+00f, -1.5904489e+00f, 4.5447830e-02f,  2.5090364e-01f,
+    -8.2827896e-01f, 9.7553588e-02f,  4.3528955e-04f,  -9.0892303e-01f,
+    7.1150476e-01f,  -6.8186812e-02f, -1.4613225e-01f, -1.0603489e+00f,
+    3.1673759e-02f,  4.3528955e-04f,  9.4450384e-02f,  1.3218867e+00f,
+    -6.1349716e-02f, -1.1308742e+00f, -2.4090031e-01f, 2.1951146e-01f,
+    4.3528955e-04f,  -1.5746256e+00f, -1.0470667e+00f, -8.6010061e-04f,
+    5.7288134e-01f,  -7.3114324e-01f, 7.5074382e-02f,  4.3528955e-04f,
+    3.3483618e-01f,  -1.5210630e+00f, 2.2692809e-02f,  9.9551523e-01f,
+    -1.0912625e-01f, 8.1972875e-02f,  4.3528955e-04f,  2.4291334e+00f,
+    -3.4399405e-02f, 9.8094881e-02f,  4.1666031e-03f,  1.0377285e+00f,
+    -9.4893619e-02f, 4.3528955e-04f,  -2.6554995e+00f, -3.7823468e-03f,
+    1.1074498e-01f,  1.0974895e-02f,  -8.8933951e-01f, -5.1945969e-02f,
+    4.3528955e-04f,  6.1343318e-01f,  -5.8305007e-01f, -1.1999760e-01f,
+    -1.3594984e-01f, 1.0025090e+00f,  -3.6953089e-01f, 4.3528955e-04f,
+    -1.5069022e+00f, -4.2256989e+00f, 3.0603308e-02f,  7.7946877e-01f,
+    -1.9843438e-01f, -2.7253902e-02f, 4.3528955e-04f,  1.6633128e+00f,
+    -3.0724102e-01f, -1.0430512e-01f, 2.0687644e-01f,  7.8527009e-01f,
+    1.0578775e-01f,  4.3528955e-04f,  6.6953552e-01f,  -3.2005336e+00f,
+    -6.8019770e-02f, 9.4122666e-01f,  2.3615539e-01f,  9.5739000e-02f,
+    4.3528955e-04f,  2.0587425e+00f,  1.4421044e-01f,  -1.8236460e-01f,
+    -2.1935947e-01f, 9.5859706e-01f,  1.1302254e-02f,  4.3528955e-04f,
+    5.4458785e-01f,  2.4709666e-01f,  -6.6692062e-02f, -6.1524159e-01f,
+    4.7059724e-01f,  -2.2888286e-02f, 4.3528955e-04f,  7.2014111e-01f,
+    7.9029727e-01f,  -5.5218376e-02f, -1.0374172e+00f, 4.6188632e-01f,
+    -3.5084408e-02f, 4.3528955e-04f,  -2.7851671e-01f, 1.9118780e+00f,
+    -3.9301552e-02f, -4.8416391e-01f, -6.9028147e-02f, 1.7330231e-01f,
+    4.3528955e-04f,  -4.7618970e-03f, -1.3079121e+00f, 5.0670872e-03f,
+    7.0901120e-01f,  -3.7587307e-02f, 1.8654242e-01f,  4.3528955e-04f,
+    1.1705364e+00f,  3.2781522e+00f,  -1.2150936e-01f, -9.3055469e-01f,
+    2.4822456e-01f,  -9.2048571e-03f, 4.3528955e-04f,  -8.7524939e-01f,
+    5.6159610e-01f,  2.7534345e-01f,  -2.8852278e-01f, -4.9371830e-01f,
+    -1.8835297e-02f, 4.3528955e-04f,  2.7516374e-01f,  4.1634217e-03f,
+    5.2035462e-02f,  6.2060159e-01f,  8.4537053e-01f,  6.1152805e-02f,
+    4.3528955e-04f,  -4.6639569e-02f, 6.0319412e-01f,  1.6582395e-01f,
+    -1.1448529e+00f, -4.2412379e-01f, 1.9294204e-01f,  4.3528955e-04f,
+    -1.9107878e+00f, 5.4044783e-01f,  8.5509293e-02f,  -3.3519489e-01f,
+    -1.0005618e+00f, 4.8810579e-02f,  4.3528955e-04f,  1.1030688e+00f,
+    6.6738385e-01f,  -7.9510882e-03f, -4.9381998e-01f, 7.9014975e-01f,
+    1.1940150e-02f,  4.3528955e-04f,  1.8371016e+00f,  8.6669391e-01f,
+    7.5896859e-02f,  -5.0557137e-01f, 8.7190735e-01f,  -5.3131428e-02f,
+    4.3528955e-04f,  1.8313445e+00f,  -2.6782351e+00f, 4.7099039e-02f,
+    8.1865788e-01f,  6.2905490e-01f,  -2.0879131e-02f, 4.3528955e-04f,
+    -3.3697784e+00f, 1.3097280e+00f,  3.0998563e-02f,  -2.9466379e-01f,
+    -8.8796097e-01f, -6.9427766e-02f, 4.3528955e-04f,  1.4203578e-01f,
+    -6.6499758e-01f, 8.9194849e-03f,  8.9883035e-01f,  9.5924608e-02f,
+    4.9793622e-01f,  4.3528955e-04f,  3.0249829e+00f,  -2.1223748e+00f,
+    -7.0912436e-02f, 5.2555430e-01f,  8.4553987e-01f,  1.9501643e-02f,
+    4.3528955e-04f,  -1.4647747e+00f, -1.9972241e+00f, -3.1711858e-02f,
+    8.9056128e-01f,  -5.0825512e-01f, -1.3292629e-01f, 4.3528955e-04f,
+    -6.2173331e-01f, 5.5558360e-01f,  2.4999851e-02f,  1.0279559e-01f,
+    -9.7097284e-01f, 1.9347340e-01f,  4.3528955e-04f,  -3.2085264e+00f,
+    -2.0158483e-01f, 1.8398251e-01f,  1.7404564e-01f,  -8.4721696e-01f,
+    -7.3831029e-02f, 4.3528955e-04f,  -5.4112524e-01f, 7.1740001e-01f,
+    1.3377176e-01f,  -9.2220765e-01f, -1.1467383e-01f, 7.8370497e-02f,
+    4.3528955e-04f,  -9.6238494e-01f, 5.0185710e-01f,  -1.2713534e-01f,
+    -1.5316142e-01f, -7.7653420e-01f, -6.3943766e-02f, 4.3528955e-04f,
+    -2.9267105e-01f, -1.3744594e+00f, 2.8937540e-03f,  7.5700682e-01f,
+    -1.7309611e-01f, -6.6314831e-02f, 4.3528955e-04f,  -1.5776924e+00f,
+    -4.8578489e-01f, -4.8243001e-02f, 3.3610919e-01f,  -8.7581962e-01f,
+    -4.4119015e-02f, 4.3528955e-04f,  -3.0739406e-01f, 9.2640734e-01f,
+    -1.0629594e-02f, -7.3125219e-01f, -4.8829660e-01f, 2.7730295e-02f,
+    4.3528955e-04f,  9.0094936e-01f,  -5.1445609e-01f, 4.5214146e-02f,
+    2.4363704e-01f,  8.7138581e-01f,  5.1460029e-03f,  4.3528955e-04f,
+    1.8947197e+00f,  -4.5264080e-02f, -1.9929044e-02f, 9.9856898e-02f,
+    1.0626529e+00f,  1.2824624e-02f,  4.3528955e-04f,  3.7218094e-01f,
+    1.9603282e+00f,  -7.5409426e-03f, -7.6854545e-01f, 4.7003534e-01f,
+    -9.4227314e-02f, 4.3528955e-04f,  1.4814088e+00f,  -1.2769011e+00f,
+    1.4682226e-01f,  3.9976391e-01f,  9.7243237e-01f,  1.4586541e-01f,
+    4.3528955e-04f,  -4.3109617e+00f, -4.9896359e-01f, 3.3415098e-02f,
+    -5.6486018e-03f, -8.7749052e-01f, -1.3384028e-02f, 4.3528955e-04f,
+    -1.6760232e+00f, -2.3582497e+00f, 4.0734350e-03f,  6.0181093e-01f,
+    -4.2854720e-01f, -2.1288920e-02f, 4.3528955e-04f,  4.6388783e-02f,
+    -7.2831231e-01f, -7.8903306e-03f, 7.0105147e-01f,  -1.0184012e-02f,
+    7.8063674e-02f,  4.3528955e-04f,  1.3360603e-01f,  -7.1327165e-02f,
+    -8.0827422e-02f, 6.0449660e-01f,  -2.6237807e-01f, 4.7158456e-01f,
+    4.3528955e-04f,  1.0322180e+00f,  -8.8444710e-02f, -2.4497907e-03f,
+    3.9191729e-01f,  7.1182168e-01f,  1.9472133e-01f,  4.3528955e-04f,
+    -1.6787018e+00f, 1.3936006e-02f,  -2.0376258e-02f, 6.9622561e-02f,
+    -1.1742306e+00f, 2.4491500e-02f,  4.3528955e-04f,  -3.7257534e-01f,
+    -3.3005959e-01f, -3.7603412e-02f, 9.9694157e-01f,  -4.7953185e-03f,
+    -5.2515215e-01f, 4.3528955e-04f,  -2.2508092e+00f, 2.2966847e+00f,
+    -1.1166178e-01f, -8.0095035e-01f, -5.4450750e-01f, 5.4696579e-02f,
+    4.3528955e-04f,  1.5744833e+00f,  2.2859666e+00f,  1.0750927e-01f,
+    -7.5779963e-01f, 6.9149649e-01f,  4.5739256e-02f,  4.3528955e-04f,
+    5.6799734e-01f,  -1.9347568e+00f, -4.4610448e-02f, 8.2075489e-01f,
+    4.2844418e-01f,  5.5462327e-03f,  4.3528955e-04f,  -1.8346767e+00f,
+    -5.0701016e-01f, 4.6626353e-03f,  2.1580164e-01f,  -7.8223664e-01f,
+    1.2091298e-01f,  4.3528955e-04f,  9.2052954e-01f,  1.7963296e+00f,
+    -2.1172108e-01f, -7.0143813e-01f, 5.6263095e-01f,  -6.6501491e-02f,
+    4.3528955e-04f,  -7.3058164e-01f, -4.8458591e-02f, -6.3175932e-02f,
+    -2.8580406e-01f, -7.2346181e-01f, 1.4607534e-01f,  4.3528955e-04f,
+    -1.1606205e+00f, 5.5359739e-01f,  -7.8427941e-02f, -8.4612942e-01f,
+    -6.7815095e-01f, 7.2316304e-02f,  4.3528955e-04f,  3.5085919e+00f,
+    1.1668962e+00f,  -2.4600344e-02f, -9.1878489e-02f, 9.4168979e-01f,
+    -7.2389990e-02f, 4.3528955e-04f,  -1.3216339e-02f, 5.1988158e-02f,
+    1.2235074e-01f,  2.9628184e-01f,  5.5495657e-02f,  -5.9069729e-01f,
+    4.3528955e-04f,  -1.0901203e+00f, 6.0255116e-01f,  4.6301369e-02f,
+    -6.9798350e-01f, -1.2656675e-01f, 2.1526079e-01f,  4.3528955e-04f,
+    -1.0973371e+00f, 2.2718024e+00f,  2.0238444e-01f,  -8.6827409e-01f,
+    -5.5853146e-01f, 8.0269307e-02f,  4.3528955e-04f,  -1.9964811e-01f,
+    -4.1819191e-01f, 1.6384948e-02f,  1.0694578e+00f,  4.3344460e-02f,
+    2.9639563e-01f,  4.3528955e-04f,  -4.6055052e-01f, 8.0910414e-01f,
+    -4.9869474e-02f, -9.4967836e-01f, -5.1311731e-01f, -4.6472646e-02f,
+    4.3528955e-04f,  8.5823262e-01f,  -4.3352618e+00f, -7.6826841e-02f,
+    8.5697871e-01f,  2.2881442e-01f,  2.3213450e-02f,  4.3528955e-04f,
+    1.4068770e+00f,  -2.1306119e+00f, 7.8797340e-02f,  8.1366730e-01f,
+    1.3327995e-01f,  4.3479122e-02f,  4.3528955e-04f,  -3.9261168e-01f,
+    -1.6175076e-01f, -1.8034693e-02f, 5.4976559e-01f,  -9.3817276e-01f,
+    -1.2466094e-02f, 4.3528955e-04f,  -2.0928338e-01f, -2.4221926e+00f,
+    1.3948120e-01f,  8.8001233e-01f,  -4.5026046e-01f, -1.1691218e-02f,
+    4.3528955e-04f,  2.5392240e-01f,  2.5814664e+00f,  -5.6278333e-02f,
+    -9.3892109e-01f, 3.1367335e-03f,  -2.4127369e-01f, 4.3528955e-04f,
+    6.0388062e-02f,  -1.7275724e+00f, -1.1529418e-01f, 9.6161437e-01f,
+    1.4881924e-01f,  -5.9193913e-03f, 4.3528955e-04f,  2.2096753e-01f,
+    -1.9028102e-01f, -9.8590881e-02f, 1.2323563e+00f,  3.3178177e-01f,
+    -6.4575553e-02f, 4.3528955e-04f,  -3.7825681e-02f, -1.4006951e+00f,
+    -1.0015506e-03f, 8.4639901e-01f,  -9.6548952e-02f, 8.0236174e-02f,
+    4.3528955e-04f,  -3.7418777e-01f, 3.8658118e-01f,  -8.0474667e-02f,
+    -1.0075796e+00f, -2.5207719e-01f, 2.3718973e-01f,  4.3528955e-04f,
+    -4.0992048e-01f, -3.0901425e+00f, -7.6425873e-02f, 8.4618926e-01f,
+    -2.5141320e-01f, -7.6960456e-03f, 4.3528955e-04f,  -7.8333372e-01f,
+    -2.2068889e-01f, 1.0356124e-01f,  2.8885379e-01f,  -7.2961676e-01f,
+    6.3103060e-03f,  4.3528955e-04f,  -6.5211147e-01f, -8.1657305e-02f,
+    8.3370291e-02f,  2.0632194e-01f,  -6.1327732e-01f, -1.3197969e-01f,
+    4.3528955e-04f,  -5.3345978e-01f, 6.0345715e-01f,  9.1935411e-02f,
+    -6.1470973e-01f, -1.1198854e+00f, 8.1885017e-02f,  4.3528955e-04f,
+    -5.2436554e-01f, -7.1658295e-01f, 1.1636727e-02f,  7.6223838e-01f,
+    -4.8603621e-01f, 2.8814501e-01f,  4.3528955e-04f,  -2.0485020e+00f,
+    -6.4298987e-01f, 1.4666620e-01f,  2.7898651e-01f,  -9.9010277e-01f,
+    -7.9253661e-03f, 4.3528955e-04f,  -2.6378193e-01f, -8.3037257e-01f,
+    2.2775377e-03f,  1.0320436e+00f,  -5.9847558e-01f, 1.2161526e-01f,
+    4.3528955e-04f,  1.7431035e+00f,  -1.1224538e-01f, 1.2754733e-02f,
+    3.5519913e-01f,  8.9392328e-01f,  2.6083864e-02f,  4.3528955e-04f,
+    -1.9825019e+00f, 1.6631548e+00f,  -6.9976002e-02f, -6.6587645e-01f,
+    -7.8214914e-01f, -1.5668457e-03f, 4.3528955e-04f,  -2.5320234e+00f,
+    4.5381422e+00f,  1.3190304e-01f,  -8.0376834e-01f, -4.5212418e-01f,
+    2.2631714e-02f,  4.3528955e-04f,  -3.8837400e-01f, 4.2758799e-01f,
+    5.5168152e-02f,  -6.5929794e-01f, -6.4117724e-01f, -1.7238241e-01f,
+    4.3528955e-04f,  -6.8755001e-02f, 7.7668369e-01f,  -1.3726029e-01f,
+    -9.5277643e-01f, 9.6169300e-02f,  1.6556144e-01f,  4.3528955e-04f,
+    -4.6988037e-01f, -4.1539826e+00f, -1.8079028e-01f, 8.6600578e-01f,
+    -1.8249425e-01f, -6.0823705e-02f, 4.3528955e-04f,  -6.8252787e-02f,
+    -6.3952750e-01f, 1.2714736e-02f,  1.1548862e+00f,  1.3906900e-03f,
+    3.9105475e-02f,  4.3528955e-04f,  7.1639621e-01f,  -5.9285837e-01f,
+    6.5337978e-02f,  3.0108190e-01f,  1.1175181e+00f,  -4.4194516e-02f,
+    4.3528955e-04f,  1.6847095e-01f,  6.8630397e-01f,  -2.2217111e-01f,
+    -6.4777404e-01f, 1.0786993e-01f,  2.6769736e-01f,  4.3528955e-04f,
+    5.5452812e-01f,  4.4591151e-02f,  -2.6298653e-02f, -5.4346901e-01f,
+    8.6253178e-01f,  6.2286492e-02f,  4.3528955e-04f,  -1.9715778e+00f,
+    -2.8651762e+00f, -4.3898232e-02f, 6.9511735e-01f,  -6.5219259e-01f,
+    6.4324759e-02f,  4.3528955e-04f,  -5.2878326e-01f, 2.1198304e+00f,
+    -1.9936387e-01f, -3.0024999e-01f, -2.7701202e-01f, 2.1257617e-01f,
+    4.3528955e-04f,  -6.4378774e-01f, 7.1667415e-01f,  -1.2004392e-03f,
+    -1.4493372e-01f, -7.8214276e-01f, 4.1184720e-01f,  4.3528955e-04f,
+    2.8002597e-03f,  -1.5346475e+00f, 1.0069033e-01f,  8.1050605e-01f,
+    -5.9705414e-02f, 5.8796592e-03f,  4.3528955e-04f,  1.7117417e+00f,
+    -1.5196555e+00f, -5.8674067e-03f, 8.4071898e-01f,  3.8310093e-01f,
+    1.5986764e-01f,  4.3528955e-04f,  -1.6900882e+00f, 1.5632480e+00f,
+    1.3060671e-01f,  -7.5137240e-01f, -7.3127466e-01f, 4.3170583e-02f,
+    4.3528955e-04f,  -1.0563692e+00f, 1.7401083e-01f,  -1.5488608e-01f,
+    -2.6845968e-01f, -8.3062762e-01f, -1.0629267e-01f, 4.3528955e-04f,
+    1.8455126e+00f,  2.4793074e+00f,  -2.0304371e-02f, -7.9976463e-01f,
+    6.6082877e-01f,  3.2910839e-02f,  4.3528955e-04f,  2.3026595e+00f,
+    -1.5833452e+00f, 1.4882600e-01f,  5.2054495e-01f,  8.3873701e-01f,
+    -5.2865259e-02f, 4.3528955e-04f,  -4.4958181e+00f, -9.6401140e-02f,
+    -2.5703314e-01f, 2.1623902e-02f,  -8.7983537e-01f, 9.3407622e-03f,
+    4.3528955e-04f,  4.3300249e-02f,  -4.8771799e-02f, 2.1109173e-02f,
+    9.8582673e-01f,  1.7438723e-01f,  -2.3309004e-02f, 4.3528955e-04f,
+    2.8359148e-01f,  1.5564251e+00f,  -2.4148966e-01f, -4.3747026e-01f,
+    6.0119651e-02f,  -1.3416407e-01f, 4.3528955e-04f,  1.4433643e+00f,
+    -1.0424025e+00f, 7.6407731e-02f,  8.2782793e-01f,  6.1367387e-01f,
+    6.2737139e-03f,  4.3528955e-04f,  3.0582151e-01f,  2.7324748e-01f,
+    -2.4992649e-02f, -3.3384913e-01f, 1.2366687e+00f,  -3.4787363e-01f,
+    4.3528955e-04f,  8.9164823e-01f,  -1.1180420e+00f, 7.1293809e-03f,
+    7.8573531e-01f,  3.7941489e-01f,  -5.9574958e-02f, 4.3528955e-04f,
+    -8.0749339e-01f, 2.4347856e+00f,  1.8625913e-02f,  -9.1227871e-01f,
+    -3.9105028e-01f, 9.8748900e-02f,  4.3528955e-04f,  9.9036109e-01f,
+    1.5833213e+00f,  -7.2734550e-02f, -1.0118606e+00f, 6.3997787e-01f,
+    7.0183994e-03f,  4.3528955e-04f,  5.1899642e-01f,  -6.8044990e-02f,
+    -2.2436036e-02f, 1.8365455e-01f,  6.1489421e-01f,  -3.4521472e-01f,
+    4.3528955e-04f,  -1.2502953e-01f, 1.9603807e+00f,  7.7139951e-02f,
+    -9.4475204e-01f, 3.9464124e-02f,  -7.0530914e-02f, 4.3528955e-04f,
+    2.1809310e-01f,  -2.8192973e-01f, -8.8177517e-02f, 1.7420800e-01f,
+    3.4734306e-01f,  6.9848076e-02f,  4.3528955e-04f,  -1.7253790e+00f,
+    6.4833987e-01f,  -4.7017597e-02f, -1.5831332e-01f, -1.0773143e+00f,
+    -2.3099646e-02f, 4.3528955e-04f,  3.1200659e-01f,  2.6317425e+00f,
+    -7.5803841e-03f, -9.2410463e-01f, 2.7434048e-01f,  -5.8996426e-03f,
+    4.3528955e-04f,  6.7344916e-01f,  2.3812595e-01f,  -5.3347677e-02f,
+    2.9911479e-01f,  1.0487000e+00f,  -6.4047623e-01f, 4.3528955e-04f,
+    -1.4262769e+00f, -1.5840868e+00f, -1.4185352e-02f, 8.0626714e-01f,
+    -6.6788906e-01f, -1.2527342e-02f, 4.3528955e-04f,  -8.8243270e-01f,
+    -6.6544965e-02f, -4.5219529e-02f, -3.1836036e-01f, -1.0827892e+00f,
+    8.0954842e-02f,  4.3528955e-04f,  8.5320204e-01f,  -4.6619356e-01f,
+    1.8361269e-01f,  1.1744873e-01f,  1.1470025e+00f,  1.3099445e-01f,
+    4.3528955e-04f,  1.5893097e+00f,  3.3359849e-01f,  8.7728597e-02f,
+    -9.4074428e-02f, 8.5558063e-01f,  7.1599372e-02f,  4.3528955e-04f,
+    6.9802475e-01f,  7.0244670e-01f,  -1.2730344e-01f, -7.9351121e-01f,
+    8.6199772e-01f,  2.1429273e-01f,  4.3528955e-04f,  3.9801058e-01f,
+    -1.9619586e-01f, -2.8553704e-02f, 2.6608062e-01f,  9.0531552e-01f,
+    1.0160519e-01f,  4.3528955e-04f,  -2.6663713e+00f, 1.1437129e+00f,
+    -7.9127941e-03f, -2.1553291e-01f, -7.4337685e-01f, 6.1787229e-02f,
+    4.3528955e-04f,  8.2944798e-01f,  -3.9553720e-01f, -2.1320336e-01f,
+    7.3549861e-01f,  5.6847197e-01f,  1.2741445e-01f,  4.3528955e-04f,
+    2.0673868e-01f,  -4.7117770e-03f, -9.5025122e-02f, 1.1885463e-01f,
+    9.6139306e-01f,  7.3349577e-01f,  4.3528955e-04f,  -1.1751581e+00f,
+    -8.8963091e-01f, 5.6728594e-02f,  7.5733441e-01f,  -5.2992356e-01f,
+    -7.2754830e-02f, 4.3528955e-04f,  5.6664163e-01f,  -2.4083002e+00f,
+    -1.1575492e-02f, 9.9481761e-01f,  1.6690493e-01f,  8.4108859e-02f,
+    4.3528955e-04f,  -4.2071491e-01f, 4.0598914e-02f,  4.1631598e-02f,
+    -8.7216872e-01f, -9.8310983e-01f, 2.5905998e-02f,  4.3528955e-04f,
+    -3.1792514e+00f, -2.8342893e+00f, 2.6396619e-02f,  5.7536900e-01f,
+    -6.3687629e-01f, 3.7058637e-02f,  4.3528955e-04f,  -8.5528165e-01f,
+    5.3305882e-01f,  8.0884054e-02f,  -6.9774634e-01f, -8.6514282e-01f,
+    3.2690021e-01f,  4.3528955e-04f,  2.9192681e+00f,  3.2760453e-01f,
+    2.1944508e-02f,  -1.2450788e-02f, 9.8866934e-01f,  1.2543310e-01f,
+    4.3528955e-04f,  2.9221919e-01f,  3.9007831e-01f,  -9.7605832e-02f,
+    -6.3257658e-01f, 7.0576066e-01f,  2.3674605e-02f,  4.3528955e-04f,
+    1.1860079e+00f,  9.9021071e-01f,  -3.5594065e-02f, -7.6199496e-01f,
+    5.8004469e-01f,  -1.0932055e-01f, 4.3528955e-04f,  -1.2753685e+00f,
+    3.1014097e-01f,  1.2885163e-02f,  3.1609413e-01f,  -6.7016387e-01f,
+    5.7022344e-02f,  4.3528955e-04f,  1.2152785e+00f,  3.6533563e+00f,
+    -1.5357046e-01f, -8.2647967e-01f, 3.4494543e-01f,  3.7730463e-02f,
+    4.3528955e-04f,  -3.9361003e-01f, 1.5644358e+00f,  6.6312067e-02f,
+    -7.5193471e-01f, -6.3479301e-03f, 6.3314494e-03f,  4.3528955e-04f,
+    -2.7249730e-01f, -1.6673291e+00f, -1.6021354e-02f, 9.7879130e-01f,
+    -3.8477325e-01f, 1.5680734e-02f,  4.3528955e-04f,  -2.8903919e-01f,
+    -1.1029945e-01f, -1.6943873e-01f, 5.4717648e-01f,  -1.9069647e-02f,
+    -6.8054909e-01f, 4.3528955e-04f,  9.1222882e-02f,  7.1719539e-01f,
+    -2.9452544e-02f, -8.9402622e-01f, -1.0385520e-01f, 3.6462095e-01f,
+    4.3528955e-04f,  4.9034664e-01f,  2.5372047e+00f,  -1.5796764e-01f,
+    -7.8353208e-01f, 3.0035707e-01f,  1.4701201e-01f,  4.3528955e-04f,
+    -1.6712276e+00f, 9.2237347e-01f,  -1.5295211e-02f, -3.9726102e-01f,
+    -9.6922803e-01f, -9.6487127e-02f, 4.3528955e-04f,  -3.3061504e-01f,
+    -2.6439732e-01f, -4.9981024e-02f, 5.9281588e-01f,  -3.9533354e-02f,
+    -7.8602403e-01f, 4.3528955e-04f,  -2.6318662e+00f, -9.9999875e-02f,
+    -1.0537761e-01f, 2.3155998e-01f,  -8.9904398e-01f, -3.5334244e-02f,
+    4.3528955e-04f,  1.0736790e+00f,  -1.0056281e+00f, -3.9341662e-02f,
+    7.4204993e-01f,  7.9801148e-01f,  7.1365498e-02f,  4.3528955e-04f,
+    1.6290334e+00f,  5.3684253e-01f,  8.5536271e-02f,  -5.1997590e-01f,
+    7.1159887e-01f,  -1.3757463e-01f, 4.3528955e-04f,  1.5972921e-01f,
+    5.7883602e-01f,  -3.7885580e-02f, -6.4266074e-01f, 6.0969472e-01f,
+    1.6001739e-01f,  4.3528955e-04f,  -3.6997464e-01f, -9.0999687e-01f,
+    -1.3221473e-02f, 1.1066648e+00f,  -4.2467856e-01f, 1.3324721e-01f,
+    4.3528955e-04f,  -4.0859863e-01f, -5.5761755e-01f, -8.5263021e-02f,
+    8.1594694e-01f,  -4.2623565e-01f, 1.4657044e-01f,  4.3528955e-04f,
+    6.0318547e-01f,  1.6060371e+00f,  7.5351924e-02f,  -6.8833297e-01f,
+    6.2769395e-01f,  3.8721897e-02f,  4.3528955e-04f,  4.6848142e-01f,
+    5.9399033e-01f,  8.6065575e-02f,  -7.5879002e-01f, 5.1864004e-01f,
+    2.3022924e-01f,  4.3528955e-04f,  2.8059611e-01f,  3.5578692e-01f,
+    1.3760082e-01f,  -6.2750471e-01f, 4.9480835e-01f,  6.0928357e-01f,
+    4.3528955e-04f,  2.6870561e+00f,  -3.8201172e+00f, 1.6292152e-01f,
+    7.5746894e-01f,  5.5746984e-01f,  -3.7751743e-04f, 4.3528955e-04f,
+    -6.3296229e-01f, 1.8648008e-01f,  8.3398819e-02f,  -3.6834508e-01f,
+    -1.2584392e+00f, -2.6277814e-02f, 4.3528955e-04f,  -1.7026472e+00f,
+    2.7663729e+00f,  -1.2517599e-02f, -8.2644129e-01f, -5.3506184e-01f,
+    4.6790231e-02f,  4.3528955e-04f,  7.7757531e-01f,  -4.2396235e-01f,
+    4.9392417e-02f,  5.1513946e-01f,  8.3544070e-01f,  3.8013462e-02f,
+    4.3528955e-04f,  1.0379647e-01f,  1.3508245e+00f,  3.7603982e-02f,
+    -7.2131574e-01f, 2.5176909e-03f,  -1.3728854e-01f, 4.3528955e-04f,
+    2.2193615e+00f,  -6.2699205e-01f, -2.8053489e-02f, 1.3227111e-01f,
+    9.5042682e-01f,  -3.8334068e-02f, 4.3528955e-04f,  8.4366590e-01f,
+    7.7615720e-01f,  3.7194576e-02f,  -6.6990256e-01f, 9.9115783e-01f,
+    -1.8025069e-01f, 4.3528955e-04f,  2.6866668e-01f,  -3.6451846e-01f,
+    -5.3256247e-02f, 1.0354757e+00f,  8.0758768e-01f,  4.2162299e-01f,
+    4.3528955e-04f,  4.7384862e-02f,  1.6364790e+00f,  -3.5186723e-02f,
+    -1.0198511e+00f, 3.1282589e-02f,  1.5370726e-02f,  4.3528955e-04f,
+    4.7342142e-01f,  -4.4361076e+00f, -1.0876220e-01f, 8.9444709e-01f,
+    2.8634751e-02f,  -3.7090857e-02f, 4.3528955e-04f,  -1.7024572e+00f,
+    -5.2289593e-01f, 1.2880340e-02f,  -1.6245618e-01f, -5.1097965e-01f,
+    -6.8292372e-02f, 4.3528955e-04f,  4.1192296e-01f,  -2.2673421e-01f,
+    -4.4448368e-02f, 8.6228186e-01f,  8.5851663e-01f,  -3.5524856e-02f,
+    4.3528955e-04f,  -7.9530817e-01f, 4.9255311e-01f,  -3.0509783e-02f,
+    -2.1916683e-01f, -6.6272497e-01f, -6.3844785e-02f, 4.3528955e-04f,
+    -1.6070355e+00f, -3.1690111e+00f, 1.9160762e-03f,  7.9460520e-01f,
+    -3.3164346e-01f, 9.4414561e-04f,  4.3528955e-04f,  -8.9900386e-01f,
+    -1.4264215e+00f, -7.7908426e-03f, 7.6533854e-01f,  -5.6550097e-01f,
+    -5.3219646e-03f, 4.3528955e-04f,  -4.7582126e+00f, 5.1650208e-01f,
+    -3.3228938e-02f, -1.5894417e-02f, -8.4932667e-01f, 2.3929289e-02f,
+    4.3528955e-04f,  1.5043592e+00f,  -3.2150652e+00f, 8.8616714e-02f,
+    8.3122373e-01f,  3.5753649e-01f,  -1.7495936e-02f, 4.3528955e-04f,
+    4.6741363e-01f,  -4.5036831e+00f, 1.4526770e-01f,  8.9116263e-01f,
+    1.0267128e-01f,  -3.0252606e-02f, 4.3528955e-04f,  3.2530186e+00f,
+    -7.8395706e-01f, 7.1479063e-03f,  4.2124763e-01f,  8.3624017e-01f,
+    -6.9495225e-03f, 4.3528955e-04f,  9.4503242e-01f,  -1.1224557e+00f,
+    -9.4798438e-02f, 5.2605218e-01f,  6.8140876e-01f,  -4.9549006e-02f,
+    4.3528955e-04f,  -6.0506040e-01f, -6.1966851e-02f, -2.3466522e-01f,
+    -5.1676905e-01f, -6.8369699e-01f, -3.8264361e-01f, 4.3528955e-04f,
+    1.6045483e+00f,  -2.7520726e+00f, -8.3766520e-02f, 7.7127695e-01f,
+    5.1247066e-01f,  7.8615598e-02f,  4.3528955e-04f,  1.9128742e+00f,
+    2.3965627e-01f,  -9.5662493e-03f, -1.0804710e-01f, 1.2123753e+00f,
+    7.6982170e-02f,  4.3528955e-04f,  -2.1854777e+00f, 1.3149252e+00f,
+    1.7524103e-02f,  -5.5368072e-01f, -8.0884409e-01f, 2.8567716e-02f,
+    4.3528955e-04f,  9.9569321e-02f,  -1.0369093e+00f, 5.5877384e-02f,
+    9.4283545e-01f,  -1.1297291e-01f, 9.0435646e-02f,  4.3528955e-04f,
+    1.5350835e+00f,  1.0402894e+00f,  9.8020531e-02f,  -6.4686710e-01f,
+    6.4278400e-01f,  -2.5993254e-02f, 4.3528955e-04f,  3.8157380e-01f,
+    5.5609173e-01f,  -1.5312885e-01f, -6.0982031e-01f, 4.0178716e-01f,
+    -2.8640175e-02f, 4.3528955e-04f,  1.6251140e+00f,  8.8929707e-01f,
+    5.7938159e-02f,  -5.0785559e-01f, 7.2689855e-01f,  9.2441909e-02f,
+    4.3528955e-04f,  -1.6904168e+00f, -1.9677339e-01f, 1.5659848e-02f,
+    2.3618717e-01f,  -8.7785661e-01f, 2.2973628e-01f,  4.3528955e-04f,
+    2.0531859e+00f,  3.8820082e-01f,  -6.6097088e-02f, -2.2665374e-01f,
+    9.2306036e-01f,  -1.6773471e-01f, 4.3528955e-04f,  3.8406229e-01f,
+    -2.1593191e-01f, -2.3078699e-02f, 5.7673675e-01f,  9.5841962e-01f,
+    -8.7430067e-02f, 4.3528955e-04f,  -4.3663239e-01f, 2.0366621e+00f,
+    -2.1789217e-02f, -8.8247156e-01f, -1.1233694e-01f, -9.1616690e-02f,
+    4.3528955e-04f,  1.7748457e-01f,  -6.9158673e-01f, -8.7322064e-02f,
+    8.7343639e-01f,  1.0697287e-01f,  -1.5493947e-01f, 4.3528955e-04f,
+    1.2355442e+00f,  -3.1532996e+00f, 1.0174315e-01f,  8.0737686e-01f,
+    5.0984770e-01f,  -9.3526579e-03f, 4.3528955e-04f,  2.2214183e-01f,
+    1.1264226e+00f,  -2.9941211e-02f, -8.7924540e-01f, 3.1461455e-02f,
+    -5.4791212e-02f, 4.3528955e-04f,  -1.9551122e-01f, -2.4181418e-01f,
+    3.0132549e-02f,  5.4617471e-01f,  -6.2693703e-01f, 2.5780359e-04f,
+    4.3528955e-04f,  -2.1700785e+00f, 3.1984943e-01f,  -8.9460000e-02f,
+    -2.1540229e-01f, -9.5465070e-01f, 4.7669403e-02f,  4.3528955e-04f,
+    -5.3195304e-01f, -1.9684296e+00f, 3.9524268e-02f,  9.6801132e-01f,
+    -3.2285789e-01f, 1.1956638e-01f,  4.3528955e-04f,  -6.5615916e-01f,
+    1.1563283e+00f,  1.9247431e-01f,  -4.9143904e-01f, -4.4618788e-01f,
+    -2.1971650e-01f, 4.3528955e-04f,  6.1602265e-01f,  -9.9433988e-01f,
+    -4.1660544e-02f, 7.3804343e-01f,  7.8712177e-01f,  -1.2198638e-01f,
+    4.3528955e-04f,  -1.5933486e+00f, 1.4594842e+00f,  -4.7690030e-02f,
+    -4.4272724e-01f, -6.2345684e-01f, 8.3021455e-02f,  4.3528955e-04f,
+    9.9345642e-01f,  3.1415210e+00f,  3.4688767e-02f,  -8.4596556e-01f,
+    2.6290011e-01f,  4.9129397e-02f,  4.3528955e-04f,  -1.3648322e+00f,
+    1.9783546e+00f,  8.1545629e-02f,  -7.7211803e-01f, -6.0017622e-01f,
+    7.2351880e-02f,  4.3528955e-04f,  -1.1991616e+00f, -1.0602750e+00f,
+    2.7752738e-02f,  4.4146535e-01f,  -1.0024675e+00f, 2.4532437e-02f,
+    4.3528955e-04f,  -1.6312784e+00f, -2.6812965e-01f, -1.7275491e-01f,
+    1.4126079e-01f,  -7.8449047e-01f, 1.3337006e-01f,  4.3528955e-04f,
+    1.5738069e+00f,  -4.8046321e-01f, 6.9769025e-03f,  2.3619632e-01f,
+    9.9424917e-01f,  1.8036263e-01f,  4.3528955e-04f,  1.3630193e-01f,
+    -8.9625221e-01f, 1.2522443e-01f,  9.6579987e-01f,  5.1406944e-01f,
+    8.8187136e-02f,  4.3528955e-04f,  -1.9238100e+00f, -1.4972794e+00f,
+    6.1324183e-02f,  3.7533408e-01f,  -9.1988027e-01f, 4.6881530e-03f,
+    4.3528955e-04f,  3.8437709e-01f,  -2.3087962e-01f, -2.0568481e-02f,
+    9.8250937e-01f,  8.2068181e-01f,  -3.3938475e-02f, 4.3528955e-04f,
+    2.5155598e-01f,  3.0733153e-01f,  -7.6396666e-02f, -2.1564269e+00f,
+    1.3396159e-01f,  2.3616552e-01f,  4.3528955e-04f,  2.4270353e+00f,
+    2.0252407e+00f,  -1.2206118e-01f, -5.7060909e-01f, 7.1147025e-01f,
+    1.7456979e-02f,  4.3528955e-04f,  -3.1380148e+00f, -4.2048341e-01f,
+    2.2262061e-01f,  7.2394267e-02f,  -8.6464381e-01f, -4.2650081e-02f,
+    4.3528955e-04f,  5.0957441e-01f,  5.5095655e-01f,  4.3691047e-03f,
+    -1.0152292e+00f, 6.2029988e-01f,  -2.7066347e-01f, 4.3528955e-04f,
+    1.7715843e+00f,  -1.4322764e+00f, 6.8762094e-02f,  4.3271112e-01f,
+    4.1532812e-01f,  -4.3611161e-02f, 4.3528955e-04f,  1.2363526e+00f,
+    6.6573006e-01f,  -6.8292208e-02f, -4.9139750e-01f, 8.8040841e-01f,
+    -4.1231226e-02f, 4.3528955e-04f,  -1.9286144e-01f, -3.9467305e-01f,
+    -4.8507173e-02f, 1.0315835e+00f,  -8.3245188e-01f, -1.8581797e-01f,
+    4.3528955e-04f,  4.5066026e-01f,  -4.4092550e+00f, -3.3616550e-02f,
+    7.8327829e-01f,  5.4905731e-03f,  -1.9805601e-02f, 4.3528955e-04f,
+    2.6148161e-01f,  2.5449258e-01f,  -6.2907793e-02f, -1.2975985e+00f,
+    6.7672646e-01f,  -2.5414193e-01f, 4.3528955e-04f,  -6.6821188e-01f,
+    2.7189221e+00f,  -1.7011145e-01f, -5.9136927e-01f, -3.5449311e-01f,
+    2.1065997e-02f,  4.3528955e-04f,  1.0263144e+00f,  -3.4821565e+00f,
+    2.8970558e-02f,  8.4954894e-01f,  3.3141327e-01f,  -3.1337764e-02f,
+    4.3528955e-04f,  1.7917359e+00f,  1.0374277e+00f,  -4.7528129e-02f,
+    -5.5821693e-01f, 6.6934878e-01f,  -1.2269716e-01f, 4.3528955e-04f,
+    -3.2344837e+00f, 1.0969250e+00f,  -4.1219711e-02f, -2.1609430e-01f,
+    -9.0005237e-01f, 3.4145858e-02f,  4.3528955e-04f,  2.7132065e+00f,
+    1.7104101e+00f,  -1.1803426e-02f, -5.8316255e-01f, 8.0245358e-01f,
+    1.3250545e-02f,  4.3528955e-04f,  -8.6057556e-01f, 4.4934440e-01f,
+    7.8915253e-02f,  -2.6242447e-01f, -5.2418035e-01f, -1.5481699e-01f,
+    4.3528955e-04f,  -1.2536583e+00f, 3.4884179e-01f,  7.1365237e-02f,
+    -5.9308118e-01f, -6.6461545e-01f, -5.6163175e-03f, 4.3528955e-04f,
+    -3.7444763e-02f, 2.7449958e+00f,  -2.6783569e-02f, -7.5007623e-01f,
+    -2.4173772e-01f, -5.3153679e-02f, 4.3528955e-04f,  1.9221568e+00f,
+    1.0940913e+00f,  1.6590813e-03f,  -2.9678077e-01f, 9.5723051e-01f,
+    -4.2738985e-02f, 4.3528955e-04f,  -1.5062639e-01f, -2.4134733e-01f,
+    2.1370363e-01f,  6.9132853e-01f,  -7.5982928e-01f, -6.1713308e-01f,
+    4.3528955e-04f,  -7.4817955e-01f, 6.3022399e-01f,  2.2671606e-01f,
+    1.6890604e-02f,  -7.3694348e-01f, -1.3745776e-01f, 4.3528955e-04f,
+    1.5830293e-01f,  5.6820989e-01f,  -8.2535326e-02f, -1.0003529e+00f,
+    1.1112527e-01f,  1.7493713e-01f,  4.3528955e-04f,  -9.6784127e-01f,
+    -2.4335983e+00f, -4.1545067e-02f, 7.2238094e-01f,  -8.3412014e-02f,
+    3.5448592e-02f,  4.3528955e-04f,  -7.1091568e-01f, 1.6446002e-02f,
+    -4.2873971e-02f, 9.7573504e-02f,  -7.5165647e-01f, -3.5479236e-01f,
+    4.3528955e-04f,  2.9884844e+00f,  -1.1191673e+00f, -6.7899842e-04f,
+    4.2289948e-01f,  8.6072195e-01f,  -3.1748528e-03f, 4.3528955e-04f,
+    -1.3203474e+00f, -7.5833321e-01f, -7.3652901e-04f, 7.4542451e-01f,
+    -6.0491645e-01f, 1.6901693e-01f,  4.3528955e-04f,  2.1955743e-01f,
+    1.6311579e+00f,  1.1617735e-02f,  -9.5133579e-01f, 1.7925636e-01f,
+    6.2991023e-02f,  4.3528955e-04f,  1.6355280e-02f,  5.8594054e-01f,
+    -6.7490734e-02f, -1.3346469e+00f, -1.8123922e-01f, 8.9233108e-03f,
+    4.3528955e-04f,  1.3746215e+00f,  -5.6399333e-01f, -2.4105299e-02f,
+    2.3758389e-01f,  7.7998179e-01f,  -4.5221415e-04f, 4.3528955e-04f,
+    7.8744805e-01f,  -3.9314681e-01f, 8.1214057e-03f,  2.7876157e-02f,
+    9.4434404e-01f,  -1.0846276e-01f, 4.3528955e-04f,  1.4810952e+00f,
+    -2.1380272e+00f, -6.0650213e-03f, 8.4810764e-01f,  5.1461315e-01f,
+    6.1707355e-02f,  4.3528955e-04f,  -9.7949398e-01f, -1.6164738e+00f,
+    4.4522550e-02f,  6.3926369e-01f,  -3.1149176e-01f, 2.8921127e-02f,
+    4.3528955e-04f,  -1.1876075e+00f, -1.0845536e-01f, -1.9894073e-02f,
+    -6.5318549e-01f, -6.6628098e-01f, -1.9788034e-01f, 4.3528955e-04f,
+    -1.6122829e+00f, 3.8713796e+00f,  -1.5886787e-02f, -9.1771579e-01f,
+    -3.0566376e-01f, -8.6156670e-03f, 4.3528955e-04f,  -1.1716690e+00f,
+    5.9551567e-01f,  2.9208615e-02f,  -4.9536821e-01f, -1.1567805e+00f,
+    -2.8405653e-02f, 4.3528955e-04f,  3.8587689e-01f,  4.9823177e-01f,
+    1.2726180e-01f,  -6.9366837e-01f, 4.3446335e-01f,  -7.1376830e-02f,
+    4.3528955e-04f,  1.9513580e+00f,  8.9216268e-01f,  1.2301879e-01f,
+    -3.4953758e-01f, 9.3728948e-01f,  1.0216823e-01f,  4.3528955e-04f,
+    -1.4965385e-01f, 9.8844117e-01f,  4.9270604e-02f,  -7.3628932e-01f,
+    2.8803810e-01f,  1.5445946e-01f,  4.3528955e-04f,  -1.7823491e+00f,
+    -2.1477692e+00f, 5.4760799e-02f,  7.6727223e-01f,  -4.7197568e-01f,
+    4.9263872e-02f,  4.3528955e-04f,  1.0519831e+00f,  3.4746253e-01f,
+    -1.0014322e-01f, -5.7743337e-02f, 7.6023608e-01f,  1.7026998e-02f,
+    4.3528955e-04f,  7.2830725e-01f,  -8.2749277e-01f, -1.6265680e-01f,
+    8.5154420e-01f,  3.5448560e-01f,  7.4506886e-02f,  4.3528955e-04f,
+    -4.9358645e-01f, 9.5173813e-02f,  -1.8176930e-01f, -4.5200279e-01f,
+    -9.1117674e-01f, 2.9977345e-01f,  4.3528955e-04f,  -9.2516476e-01f,
+    2.0893261e+00f,  7.6011741e-03f,  -9.5545310e-01f, -5.6017917e-01f,
+    1.2310679e-02f,  4.3528955e-04f,  1.4659865e+00f,  -4.5523181e+00f,
+    5.0699856e-02f,  8.6746174e-01f,  1.9153556e-01f,  1.7843114e-02f,
+    4.3528955e-04f,  -3.7116027e+00f, -8.9467549e-01f, 2.4957094e-02f,
+    9.0376079e-02f,  -9.4548154e-01f, 1.1932597e-02f,  4.3528955e-04f,
+    -4.2240703e-01f, -4.1375618e+00f, -3.6905449e-02f, 8.7117583e-01f,
+    -1.7874116e-01f, 3.1819992e-02f,  4.3528955e-04f,  -1.2358875e-01f,
+    3.9882213e-01f,  -1.1369313e-01f, -7.8158736e-01f, -4.9872825e-01f,
+    3.8652241e-02f,  4.3528955e-04f,  -3.8232234e+00f, 1.5398806e+00f,
+    -1.1278409e-01f, -3.6745811e-01f, -8.2893586e-01f, 2.2155616e-02f,
+    4.3528955e-04f,  -2.8187122e+00f, 2.0826039e+00f,  1.1314002e-01f,
+    -5.9142959e-01f, -6.7290044e-01f, -1.7845951e-02f, 4.3528955e-04f,
+    6.0383421e-01f,  4.0162153e+00f,  -3.3075336e-02f, -1.0251707e+00f,
+    5.7326861e-02f,  4.2137936e-02f,  4.3528955e-04f,  8.3288366e-01f,
+    1.5265008e+00f,  6.4841017e-02f,  -8.0305076e-01f, 4.9918118e-01f,
+    1.4151365e-02f,  4.3528955e-04f,  -8.1151158e-01f, -1.2768396e+00f,
+    3.4681264e-02f,  1.2412475e-01f,  -5.2803195e-01f, -1.7577392e-01f,
+    4.3528955e-04f,  -1.8769079e+00f, 6.4006555e-01f,  7.4035167e-03f,
+    -7.2778028e-01f, -6.2969059e-01f, -1.2961457e-02f, 4.3528955e-04f,
+    -1.5696118e+00f, 4.0982550e-01f,  -8.4706321e-03f, 9.0089753e-02f,
+    -7.6241112e-01f, 6.6718131e-02f,  4.3528955e-04f,  7.4303883e-01f,
+    1.5716569e+00f,  -1.2976259e-01f, -6.5834260e-01f, 1.3369498e-01f,
+    -9.3228787e-02f, 4.3528955e-04f,  3.7110665e+00f,  -4.1251001e+00f,
+    -6.6280760e-02f, 6.6674542e-01f,  5.8004069e-01f,  -2.1870513e-02f,
+    4.3528955e-04f,  -3.7511417e-01f, 1.1831638e+00f,  -1.6432796e-01f,
+    -1.0193162e+00f, -4.8202363e-01f, -4.7622669e-02f, 4.3528955e-04f,
+    -1.9260553e+00f, -3.1453459e+00f, 8.8775687e-02f,  6.6888523e-01f,
+    -3.0807108e-01f, -4.5079403e-02f, 4.3528955e-04f,  5.4112285e-02f,
+    8.9693761e-01f,  1.3923745e-01f,  -9.7921741e-01f, 2.6900119e-01f,
+    1.0401227e-01f,  4.3528955e-04f,  -2.5086915e+00f, -3.2970846e+00f,
+    4.7606971e-02f,  7.2069007e-01f,  -5.4576069e-01f, -4.2606633e-02f,
+    4.3528955e-04f,  2.4980872e+00f,  1.8294894e+00f,  7.8685269e-02f,
+    -6.3266790e-01f, 7.9928625e-01f,  3.6757085e-02f,  4.3528955e-04f,
+    1.5711740e+00f,  -1.0344864e+00f, 4.5377612e-02f,  7.0911634e-01f,
+    1.6243491e-01f,  -2.9737610e-02f, 4.3528955e-04f,  -3.0429766e-02f,
+    8.0647898e-01f,  -1.2125886e-01f, -8.8272852e-01f, 7.6644921e-01f,
+    2.9131415e-01f,  4.3528955e-04f,  3.1328470e-01f,  6.1781591e-01f,
+    -9.6821584e-02f, -1.2710477e+00f, 4.8463207e-01f,  -2.6319336e-02f,
+    4.3528955e-04f,  5.1604873e-01f,  5.9988356e-01f,  -5.6589913e-02f,
+    -7.9377890e-01f, 5.1439172e-01f,  8.2556061e-02f,  4.3528955e-04f,
+    8.7698802e-02f,  -3.0462918e+00f, 5.4948162e-02f,  7.2130924e-01f,
+    -1.2553822e-01f, -9.5913671e-02f, 4.3528955e-04f,  5.0432914e-01f,
+    -7.4682698e-02f, -1.4939439e-01f, 3.6878958e-01f,  5.4592025e-01f,
+    5.4825163e-01f,  4.3528955e-04f,  -1.9534460e-01f, -2.9175371e-01f,
+    -4.6925806e-02f, 3.9450863e-01f,  -7.0590991e-01f, 3.1190920e-01f,
+    4.3528955e-04f,  -3.6384954e+00f, 1.9180716e+00f,  1.1991622e-01f,
+    -4.5264295e-01f, -6.6719252e-01f, -3.7860386e-02f, 4.3528955e-04f,
+    3.1155198e+00f,  -5.3450364e-01f, 3.1814430e-02f,  1.9506607e-02f,
+    9.5316929e-01f,  8.5243367e-02f,  4.3528955e-04f,  -9.9950671e-01f,
+    -2.2502939e-01f, -2.7965566e-02f, 5.4815624e-02f,  -9.3763602e-01f,
+    3.5604175e-02f,  4.3528955e-04f,  -5.0045854e-01f, -2.1551421e+00f,
+    4.5774583e-02f,  1.0089133e+00f,  -1.5166959e-01f, -4.2454366e-02f,
+    4.3528955e-04f,  1.3195388e+00f,  1.2066299e+00f,  1.3180681e-03f,
+    -5.2966392e-01f, 8.8652050e-01f,  -3.8287186e-03f, 4.3528955e-04f,
+    -2.3197868e+00f, 5.3813154e-01f,  -1.4323013e-01f, -2.0358893e-01f,
+    -7.0593286e-01f, -1.4612174e-03f, 4.3528955e-04f,  -3.8928065e-01f,
+    1.8135694e+00f,  -1.1539131e-01f, -1.0127989e+00f, -5.4707873e-01f,
+    -3.7782935e-03f, 4.3528955e-04f,  1.3128787e-01f,  3.1324604e-01f,
+    -1.1613828e-01f, -9.6565497e-01f, 4.8743463e-01f,  2.2296210e-01f,
+    4.3528955e-04f,  -2.8264084e-01f, -2.0482352e+00f, -1.5862308e-01f,
+    6.4887255e-01f,  -6.2488675e-02f, 5.2259326e-02f,  4.3528955e-04f,
+    -2.2146213e+00f, 8.2265848e-01f,  -4.3692356e-03f, -4.0457764e-01f,
+    -8.6833113e-01f, 1.4349361e-01f,  4.3528955e-04f,  2.8194075e+00f,
+    1.5431981e+00f,  4.6891749e-02f,  -5.2806181e-01f, 9.4605553e-01f,
+    -1.6644672e-02f, 4.3528955e-04f,  1.2291163e+00f,  -1.1094116e+00f,
+    -2.1125948e-02f, 9.1412115e-01f,  6.9120294e-01f,  -2.6790293e-02f,
+    4.3528955e-04f,  4.5774315e-02f,  -7.4914765e-01f, 2.1050863e-02f,
+    7.3184878e-01f,  1.2999527e-01f,  5.6078542e-02f,  4.3528955e-04f,
+    4.1572839e-01f,  2.0098236e+00f,  5.8760777e-02f,  -6.6086060e-01f,
+    2.5880659e-01f,  -9.6063815e-02f, 4.3528955e-04f,  -6.6123319e-01f,
+    -1.0189082e-01f, -3.4447988e-03f, -2.6373081e-03f, -7.7401018e-01f,
+    -1.4497456e-02f, 4.3528955e-04f,  -2.0477908e+00f, -5.8750266e-01f,
+    -1.9196099e-01f, 2.6583609e-01f,  -8.8344193e-01f, -7.0645444e-02f,
+    4.3528955e-04f,  -3.3041394e+00f, -2.2900808e+00f, 1.1528070e-01f,
+    4.5306441e-01f,  -7.3856491e-01f, -3.6893040e-02f, 4.3528955e-04f,
+    2.0154412e+00f,  4.8450238e-01f,  1.5543815e-02f,  -1.8620852e-01f,
+    1.0883974e+00f,  3.6225609e-02f,  4.3528955e-04f,  3.0872491e-01f,
+    4.0224606e-01f,  9.1166705e-02f,  -4.6638316e-01f, 7.7143443e-01f,
+    6.5925515e-01f,  4.3528955e-04f,  8.7760824e-01f,  2.7510577e-01f,
+    1.7797979e-02f,  -2.9797935e-01f, 9.7078758e-01f,  -8.9388855e-02f,
+    4.3528955e-04f,  7.1234787e-01f,  -2.3679936e+00f, 5.0869413e-02f,
+    9.0401238e-01f,  4.7823973e-02f,  -7.6790929e-02f, 4.3528955e-04f,
+    1.3949760e+00f,  2.3945431e-01f,  -3.8810603e-02f, 2.1147342e-01f,
+    7.0634449e-01f,  -1.8859072e-01f, 4.3528955e-04f,  -1.9009757e+00f,
+    -6.0301268e-01f, 4.8257317e-02f,  1.6760142e-01f,  -9.0536672e-01f,
+    -4.4823484e-03f, 4.3528955e-04f,  2.5235028e+00f,  -9.3666130e-01f,
+    7.5783066e-02f,  4.0648574e-01f,  8.8382584e-01f,  -1.0843456e-01f,
+    4.3528955e-04f,  -1.9267662e+00f, 2.5124550e+00f,  1.4117089e-01f,
+    -9.1824472e-01f, -6.4057815e-01f, 3.2649368e-02f,  4.3528955e-04f,
+    -2.9291880e-01f, 5.2158222e-02f,  3.2947254e-03f,  -1.7771052e-01f,
+    -1.0826948e+00f, -1.4147930e-01f, 4.3528955e-04f,  4.2295951e-01f,
+    2.1808259e+00f,  2.2489430e-02f,  -8.7703544e-01f, 6.6168390e-02f,
+    4.3013360e-02f,  4.3528955e-04f,  -1.8220338e+00f, 3.5323131e-01f,
+    -6.6785343e-02f, -3.9568189e-01f, -9.3803746e-01f, -7.6509170e-02f,
+    4.3528955e-04f,  7.8868383e-01f,  5.3664976e-01f,  1.0960373e-01f,
+    -2.7134785e-01f, 9.2691624e-01f,  3.0943942e-01f,  4.3528955e-04f,
+    -1.5222268e+00f, 5.5997258e-01f,  -1.7213039e-01f, -6.6770560e-01f,
+    -3.7135997e-01f, -5.3990912e-03f, 4.3528955e-04f,  4.3032837e+00f,
+    -2.4061038e-01f, 7.6745808e-02f,  6.0499843e-02f,  9.4411939e-01f,
+    -1.3739926e-02f, 4.3528955e-04f,  1.9143574e+00f,  8.8257438e-01f,
+    4.5209240e-02f,  -5.1431066e-01f, 8.4024924e-01f,  8.8160567e-02f,
+    4.3528955e-04f,  -3.9511117e-01f, -2.9672898e-02f, 1.2227301e-01f,
+    5.8551949e-01f,  -4.5785055e-01f, 6.4762509e-01f,  4.3528955e-04f,
+    -9.1726387e-01f, 1.4371368e+00f,  -1.1624065e-01f, -8.2254082e-01f,
+    -4.3494645e-01f, 1.3018741e-01f,  4.3528955e-04f,  1.8678042e-01f,
+    1.3186061e+00f,  1.3237837e-01f,  -6.8897098e-01f, -7.1039751e-02f,
+    7.7484585e-03f,  4.3528955e-04f,  1.0664595e+00f,  -1.2359957e+00f,
+    -3.3773951e-02f, 6.7676556e-01f,  7.1408629e-01f,  -7.7180266e-02f,
+    4.3528955e-04f,  1.0187730e+00f,  -2.8073221e-02f, 5.6223523e-02f,
+    2.6950917e-01f,  8.5886806e-01f,  3.5021219e-02f,  4.3528955e-04f,
+    -4.7467998e-01f, 4.6508598e-01f,  -4.6465926e-02f, -3.2858238e-01f,
+    -7.9678279e-01f, -3.2679009e-01f, 4.3528955e-04f,  -2.7080455e+00f,
+    3.6198139e+00f,  7.4134082e-02f,  -7.7647394e-01f, -5.3970301e-01f,
+    2.5387025e-02f,  4.3528955e-04f,  -6.5683538e-01f, -2.9654315e+00f,
+    1.9688174e-01f,  1.0140966e+00f,  -1.6312833e-01f, 3.7053581e-02f,
+    4.3528955e-04f,  -1.3083253e+00f, -1.1800464e+00f, 3.0229867e-02f,
+    6.9996423e-01f,  -5.9475672e-01f, 1.7552200e-01f,  4.3528955e-04f,
+    1.2114245e+00f,  2.6487134e-02f,  -1.8611832e-01f, -2.0188074e-01f,
+    1.0130707e+00f,  -7.3714547e-02f, 4.3528955e-04f,  2.3404248e+00f,
+    -7.2169399e-01f, -9.8881893e-02f, 1.2805714e-01f,  7.1080410e-01f,
+    -7.6863877e-02f, 4.3528955e-04f,  -1.7738123e+00f, -1.3076222e+00f,
+    1.1182407e-01f,  1.7176364e-01f,  -5.2570903e-01f, 1.1278353e-02f,
+    4.3528955e-04f,  4.3664700e-01f,  -8.3619022e-01f, 1.6352022e-02f,
+    1.1772091e+00f,  -7.8718938e-02f, -1.6953461e-01f, 4.3528955e-04f,
+    7.7987671e-01f,  -1.2544195e-01f, 4.1392475e-02f,  3.7989500e-01f,
+    7.2372407e-01f,  -1.5244494e-01f, 4.3528955e-04f,  -1.3894010e-01f,
+    5.6627977e-01f,  -4.8294205e-02f, -7.2790867e-01f, -5.7502633e-01f,
+    3.8728410e-01f,  4.3528955e-04f,  1.4263835e+00f,  -2.6080363e+00f,
+    -7.1940054e-03f, 8.8656622e-01f,  5.5094117e-01f,  1.6508987e-02f,
+    4.3528955e-04f,  1.0536736e+00f,  5.6991607e-01f,  -8.4239920e-04f,
+    -7.3434517e-02f, 1.0309550e+00f,  -4.5316808e-02f, 4.3528955e-04f,
+    6.7125511e-01f,  -2.2569125e+00f, 1.1688508e-01f,  9.9233747e-01f,
+    1.8324438e-01f,  1.2579346e-02f,  4.3528955e-04f,  -5.0757414e-01f,
+    -2.0540147e-01f, -7.8879267e-02f, -7.9941563e-03f, -7.0739174e-01f,
+    2.1243766e-01f,  4.3528955e-04f,  1.0619334e+00f,  1.1214033e+00f,
+    4.2785410e-02f,  -7.6342660e-01f, 8.0774105e-01f,  -6.1886806e-02f,
+    4.3528955e-04f,  3.4108374e+00f,  1.3031694e+00f,  1.1976974e-01f,
+    -1.6106504e-01f, 8.6888027e-01f,  4.0806949e-02f,  4.3528955e-04f,
+    -7.1255982e-01f, 3.9180893e-01f,  -2.4381752e-01f, -4.9217162e-01f,
+    -4.6334332e-01f, -7.0063815e-02f, 4.3528955e-04f,  1.2156445e-01f,
+    7.7780819e-01f,  6.8712935e-02f,  -1.0467523e+00f, -4.1648708e-02f,
+    7.0878178e-02f,  4.3528955e-04f,  6.4426392e-01f,  7.9680181e-01f,
+    6.4320907e-02f,  -7.3510611e-01f, 3.9533064e-01f,  -1.2439843e-01f,
+    4.3528955e-04f,  -1.1591996e+00f, -1.8134816e-01f, 7.1321055e-03f,
+    1.6338030e-01f,  -9.7992319e-01f, 2.3358957e-01f,  4.3528955e-04f,
+    5.8429587e-01f,  8.1245291e-01f,  -4.7306836e-02f, -7.7145267e-01f,
+    7.2311503e-01f,  -1.7128727e-01f, 4.3528955e-04f,  -1.8336542e+00f,
+    -1.0127969e+00f, 4.2186413e-02f,  1.1395214e-01f,  -8.5738230e-01f,
+    1.9758296e-01f,  4.3528955e-04f,  2.4219635e+00f,  8.4640390e-01f,
+    -7.2520666e-02f, -3.8880214e-01f, 9.6578538e-01f,  -7.3273167e-02f,
+    4.3528955e-04f,  7.1471298e-01f,  8.5783178e-01f,  4.6850712e-04f,
+    -6.9310719e-01f, 5.9186822e-01f,  7.5748019e-02f,  4.3528955e-04f,
+    -3.1481802e+00f, -2.5120802e+00f, -4.0321078e-02f, 6.6684407e-01f,
+    -6.4168000e-01f, -4.8431113e-02f, 4.3528955e-04f,  -9.8410368e-01f,
+    1.2322391e+00f,  4.0922489e-02f,  -2.6022952e-02f, -7.9952800e-01f,
+    -2.0420420e-01f, 4.3528955e-04f,  -3.4441069e-01f, 2.7368968e+00f,
+    -1.2412459e-01f, -9.9065799e-01f, -7.7947192e-02f, -2.2538021e-02f,
+    4.3528955e-04f,  -1.7631243e+00f, -1.2308637e+00f, -1.1188022e-01f,
+    5.8651203e-01f,  -6.7950016e-01f, -7.1616933e-02f, 4.3528955e-04f,
+    2.7291639e+00f,  6.1545968e-01f,  -4.3770082e-02f, -2.2944607e-01f,
+    9.2599034e-01f,  -5.7744779e-02f, 4.3528955e-04f,  9.8342830e-01f,
+    -4.0525049e-01f, -6.0760293e-02f, 3.3344209e-01f,  1.2308379e+00f,
+    1.2935786e-01f,  4.3528955e-04f,  2.8581601e-01f,  -1.4112517e-02f,
+    -1.7678876e-01f, -4.5460242e-01f, 1.5535580e+00f,  -3.6994606e-01f,
+    4.3528955e-04f,  8.6270911e-01f,  9.2712933e-01f,  -3.5473939e-02f,
+    -9.1946012e-01f, 1.0309505e+00f,  6.0221810e-02f,  4.3528955e-04f,
+    -8.9722854e-01f, 1.7029290e+00f,  4.5640755e-02f,  -8.0359757e-01f,
+    -1.8011774e-01f, 1.7072754e-01f,  4.3528955e-04f,  -1.4451771e+00f,
+    1.4134148e+00f,  8.2122207e-02f,  -8.2230687e-01f, -4.5283470e-01f,
+    -6.7036040e-02f, 4.3528955e-04f,  1.6632789e+00f,  -1.9932756e+00f,
+    5.5653471e-02f,  8.1583524e-01f,  5.0974780e-01f,  -4.6123166e-02f,
+    4.3528955e-04f,  -6.4132655e-01f, -2.9846947e+00f, 1.5824383e-02f,
+    7.9289520e-01f,  -1.2155361e-01f, -2.6429862e-02f, 4.3528955e-04f,
+    2.9498377e-01f,  2.1130908e-01f,  -2.3065518e-01f, -8.0761808e-01f,
+    9.1488993e-01f,  6.9834404e-02f,  4.3528955e-04f,  -4.8307291e-01f,
+    -1.3443463e+00f, 3.5763893e-02f,  5.0765014e-01f,  -3.9385077e-01f,
+    8.0975018e-02f,  4.3528955e-04f,  -2.0364411e-03f, 1.2312099e-01f,
+    -1.5632226e-01f, -4.9952552e-01f, -1.0198606e-01f, 8.2385254e-01f,
+    4.3528955e-04f,  -3.0537084e-02f, 4.1151061e+00f,  8.0756713e-03f,
+    -9.2269236e-01f, -9.5245484e-03f, 2.6914662e-02f,  4.3528955e-04f,
+    -3.9534619e-01f, -1.8035842e+00f, 2.7192649e-02f,  7.6255673e-01f,
+    -3.0257186e-01f, -2.0337830e-01f, 4.3528955e-04f,  -3.5672598e+00f,
+    -1.2730845e+00f, 2.4881868e-02f,  2.9876012e-01f,  -7.9164410e-01f,
+    -5.8735903e-02f, 4.3528955e-04f,  -7.5471944e-01f, -4.9377692e-01f,
+    -8.9411046e-03f, 4.0157977e-01f,  -7.4092835e-01f, 1.5000179e-01f,
+    4.3528955e-04f,  1.9819118e+00f,  -4.1295528e-01f, 1.9877127e-01f,
+    4.1145691e-01f,  5.2162260e-01f,  -1.0049545e-01f, 4.3528955e-04f,
+    -5.5425268e-01f, -6.6597354e-01f, 2.9064154e-02f,  6.2021571e-01f,
+    -2.1244894e-01f, -1.5186968e-01f, 4.3528955e-04f,  6.1718738e-01f,
+    4.8425522e+00f,  2.2114774e-02f,  -9.1469938e-01f, 6.4116456e-02f,
+    6.2777116e-03f,  4.3528955e-04f,  1.0847263e-01f,  -2.3458822e+00f,
+    3.7750790e-03f,  9.8158181e-01f,  -2.2117166e-01f, -1.6127359e-02f,
+    4.3528955e-04f,  -1.6747997e+00f, 3.9482909e-01f,  -4.2239107e-02f,
+    2.5999192e-02f,  -8.7887543e-01f, -8.4025450e-02f, 4.3528955e-04f,
+    -6.0559386e-01f, -4.7545546e-01f, 7.0755646e-02f,  6.7131019e-01f,
+    -1.1204072e+00f, 4.0183082e-02f,  4.3528955e-04f,  -1.9433140e+00f,
+    -1.0946375e+00f, 5.5746038e-02f,  2.5335291e-01f,  -9.1574770e-01f,
+    -7.6545686e-02f, 4.3528955e-04f,  2.2360495e-01f,  1.3575339e-01f,
+    -3.3127807e-02f, -3.9031914e-01f, 3.1273517e-01f,  -2.9962015e-01f,
+    4.3528955e-04f,  2.2018628e+00f,  -2.0298283e-01f, 2.3169792e-03f,
+    1.6526647e-01f,  9.5887303e-01f,  -5.3378310e-02f, 4.3528955e-04f,
+    4.6304870e+00f,  -1.2702584e+00f, 2.0059282e-01f,  1.8179649e-01f,
+    8.7383902e-01f,  3.8364134e-04f,  4.3528955e-04f,  -9.8315156e-01f,
+    3.5083795e-01f,  4.3822289e-02f,  -5.8358144e-02f, -8.7237656e-01f,
+    -1.9686761e-01f, 4.3528955e-04f,  1.1127846e-01f,  -4.8046410e-02f,
+    5.3116705e-02f,  1.3340555e+00f,  -1.8583155e-01f, 2.2168294e-01f,
+    4.3528955e-04f,  -6.6988774e-02f, 9.1640338e-02f,  1.5565564e-01f,
+    -1.0844786e-02f, -7.7646786e-01f, -1.7650257e-01f, 4.3528955e-04f,
+    -1.7960348e+00f, -4.9732488e-01f, -4.9041502e-02f, 2.7602810e-01f,
+    -6.8856353e-01f, -8.3671816e-02f, 4.3528955e-04f,  1.5708005e-01f,
+    -1.2277934e-01f, -1.4704129e-01f, 1.1980227e+00f,  6.2525511e-01f,
+    4.0112197e-01f,  4.3528955e-04f,  -9.1938920e-02f, 2.1437123e-02f,
+    6.9828652e-02f,  3.4388134e-01f,  -4.0673524e-01f, 2.8461090e-01f,
+    4.3528955e-04f,  3.0328202e+00f,  1.8111814e+00f,  -5.7537928e-02f,
+    -4.6367425e-01f, 6.8878222e-01f,  1.0565110e-01f,  4.3528955e-04f,
+    2.3395491e+00f,  -1.1238266e+00f, -3.5059210e-02f, 5.1803398e-01f,
+    7.2002441e-01f,  2.4124334e-02f,  4.3528955e-04f,  -3.6012745e-01f,
+    -3.8561423e+00f, 2.9720709e-02f,  7.6672399e-01f,  -1.7622126e-02f,
+    1.3955657e-03f,  4.3528955e-04f,  1.5704383e-01f,  -1.3065981e+00f,
+    1.2118255e-01f,  9.3142033e-01f,  1.8405320e-01f,  5.7355583e-02f,
+    4.3528955e-04f,  -1.1843678e+00f, 1.6676641e-01f,  -1.6413813e-02f,
+    -7.3328927e-02f, -6.1447078e-01f, 1.2300391e-01f,  4.3528955e-04f,
+    1.4284407e+00f,  -2.2257135e+00f, 1.0589403e-01f,  7.4413127e-01f,
+    6.9882792e-01f,  -7.7548631e-02f, 4.3528955e-04f,  1.6204368e+00f,
+    3.0677698e+00f,  -4.5549180e-02f, -8.5601294e-01f, 3.3688101e-01f,
+    -1.6458785e-02f, 4.3528955e-04f,  -4.7250447e-01f, 2.6688607e+00f,
+    1.1184974e-02f,  -8.5653257e-01f, -2.6655164e-01f, 1.8434405e-02f,
+    4.3528955e-04f,  -1.5411100e+00f, 1.6998276e+00f,  -2.4675524e-02f,
+    -5.5652368e-01f, -5.3410023e-01f, 4.8467688e-02f,  4.3528955e-04f,
+    8.6241633e-01f,  4.3443161e-01f,  -5.7756416e-02f, -5.5602342e-01f,
+    4.3863496e-01f,  -2.6363170e-01f, 4.3528955e-04f,  7.3259097e-01f,
+    2.5742469e+00f,  1.3466710e-01f,  -1.0232621e+00f, 3.0628243e-01f,
+    2.4503017e-02f,  4.3528955e-04f,  1.7625883e+00f,  6.7398411e-01f,
+    7.7921219e-02f,  -8.1789419e-02f, 6.6451126e-01f,  1.6876717e-01f,
+    4.3528955e-04f,  2.4401839e+00f,  -1.9271331e-01f, -4.6386715e-02f,
+    1.8522274e-02f,  8.5608590e-01f,  -2.2179447e-02f, 4.3528955e-04f,
+    2.2612375e-01f,  1.1743408e+00f,  6.8118960e-02f,  -1.2793194e+00f,
+    3.5598621e-01f,  6.6667676e-02f,  4.3528955e-04f,  -1.7811886e+00f,
+    -2.5047801e+00f, 6.0402744e-02f,  6.4845675e-01f,  -4.1981152e-01f,
+    3.3660401e-02f,  4.3528955e-04f,  -6.3104606e-01f, 2.3595910e+00f,
+    -6.3560316e-03f, -9.8349065e-01f, -3.0573681e-01f, -7.2268099e-02f,
+    4.3528955e-04f,  7.9656070e-01f,  -1.3980099e+00f, 5.7791550e-02f,
+    8.1901067e-01f,  1.8918321e-01f,  5.2549448e-02f,  4.3528955e-04f,
+    -1.8329369e+00f, 3.4441340e+00f,  -3.0997088e-02f, -9.0326005e-01f,
+    -4.1236532e-01f, 1.3757468e-02f,  4.3528955e-04f,  6.8333846e-01f,
+    -2.7107513e+00f, 1.3411222e-02f,  7.0861971e-01f,  2.8355035e-01f,
+    3.4299016e-02f,  4.3528955e-04f,  1.7861665e+00f,  -1.7971524e+00f,
+    -4.4569779e-02f, 7.1465141e-01f,  6.8738496e-01f,  7.1939677e-02f,
+    4.3528955e-04f,  -4.3149620e-02f, -2.4260783e+00f, 1.0428268e-01f,
+    9.6547621e-01f,  -9.2633329e-02f, 1.9962411e-02f,  4.3528955e-04f,
+    2.0154626e+00f,  -1.4770195e+00f, -6.7135006e-02f, 4.9757031e-01f,
+    8.0167031e-01f,  -3.4165192e-02f, 4.3528955e-04f,  -1.2665753e+00f,
+    -3.1609766e+00f, 6.2783211e-02f,  8.7136996e-01f,  -2.7853277e-01f,
+    2.7160807e-02f,  4.3528955e-04f,  -5.9744531e-01f, -1.3492881e+00f,
+    1.6264983e-02f,  8.4105080e-01f,  -6.3887024e-01f, -7.6508053e-02f,
+    4.3528955e-04f,  1.7431483e-01f,  -6.1369199e-01f, -1.9218560e-02f,
+    1.2443340e+00f,  2.2449757e-01f,  1.3597721e-01f,  4.3528955e-04f,
+    -2.4982634e+00f, 3.6249727e-01f,  7.8495942e-02f,  -2.5531936e-01f,
+    -9.1748792e-01f, -1.0637861e-01f, 4.3528955e-04f,  -1.0899761e+00f,
+    -2.3887362e+00f, 6.1714575e-03f,  9.2460322e-01f,  -5.8469015e-01f,
+    -1.1991275e-02f, 4.3528955e-04f,  1.9592813e-01f,  -2.8561431e-01f,
+    1.1642750e-02f,  1.3663009e+00f,  4.9269965e-01f,  -4.5824900e-02f,
+    4.3528955e-04f,  -1.1651812e+00f, 8.2145983e-01f,  1.0720280e-01f,
+    -8.0819333e-01f, -2.3103577e-01f, 2.8045535e-01f,  4.3528955e-04f,
+    6.7987078e-01f,  -8.3066583e-01f, 9.7249813e-02f,  6.2940931e-01f,
+    2.7587396e-01f,  1.5495064e-02f,  4.3528955e-04f,  1.1262791e+00f,
+    -1.8123887e+00f, 7.0646122e-02f,  8.3865178e-01f,  5.0337481e-01f,
+    -6.4746179e-02f, 4.3528955e-04f,  1.4193350e-01f,  1.5824263e+00f,
+    9.4382159e-02f,  -9.8917478e-01f, -4.0390171e-02f, 5.1472526e-02f,
+    4.3528955e-04f,  -1.4308505e-02f, -4.2588931e-01f, -1.1987735e-01f,
+    1.0691532e+00f,  -4.6046263e-01f, -1.2745146e-01f, 4.3528955e-04f,
+    1.6104525e+00f,  -1.4987866e+00f, 7.8105733e-02f,  8.0087638e-01f,
+    5.6428486e-01f,  1.9304684e-01f,  4.3528955e-04f,  1.4824510e-01f,
+    -9.8579094e-02f, 2.5478493e-02f,  1.2581154e+00f,  4.7554445e-01f,
+    4.8524100e-02f,  4.3528955e-04f,  -3.1068422e-02f, 1.4117844e+00f,
+    7.8013353e-02f,  -6.8690068e-01f, -1.0512276e-02f, 6.2779784e-02f,
+    4.3528955e-04f,  4.2159958e+00f,  1.0499845e-01f,  3.7787180e-02f,
+    1.0284677e-02f,  9.5449471e-01f,  8.7985629e-03f,  4.3528955e-04f,
+    4.3766895e-01f,  -1.4431179e-02f, -4.4127271e-02f, -1.0689002e-02f,
+    1.1839837e+00f,  7.8690276e-02f,  4.3528955e-04f,  -2.0288107e-01f,
+    -1.1865069e+00f, -1.0078384e-01f, 8.1464660e-01f,  1.5657799e-01f,
+    -1.9203810e-01f, 4.3528955e-04f,  -1.0264789e-01f, -5.6801152e-01f,
+    -1.3958214e-01f, 5.8939558e-01f,  -5.3152215e-01f, -3.9276145e-02f,
+    4.3528955e-04f,  1.5926468e+00f,  1.1786140e+00f,  -7.9796407e-03f,
+    -4.1204616e-01f, 8.5197341e-01f,  -8.4198266e-02f, 4.3528955e-04f,
+    1.3705515e+00f,  3.2410514e+00f,  1.0449603e-01f,  -8.3301961e-01f,
+    1.6753218e-01f,  6.2845275e-02f,  4.3528955e-04f,  1.4620272e+00f,
+    -3.6232734e+00f, 8.4449708e-02f,  8.6958987e-01f,  2.5236315e-01f,
+    -1.9011239e-02f, 4.3528955e-04f,  -7.4705929e-01f, -1.1651406e+00f,
+    -1.7225945e-01f, 4.3800959e-01f,  -8.6036104e-01f, -9.9520721e-03f,
+    4.3528955e-04f,  -7.8630024e-01f, 1.3028618e+00f,  1.3693019e-03f,
+    -6.4442724e-01f, -2.9915914e-01f, -2.3320701e-02f, 4.3528955e-04f,
+    -1.7143683e+00f, 2.1112833e+00f,  1.4181955e-01f,  -8.1498456e-01f,
+    -5.6963468e-01f, -1.0815447e-01f, 4.3528955e-04f,  -5.1881768e-02f,
+    -1.0247480e+00f, 9.4329268e-03f,  1.0063796e+00f,  2.2727183e-01f,
+    8.0825649e-02f,  4.3528955e-04f,  -2.0747060e-01f, -1.8810148e+00f,
+    4.2126242e-02f,  6.9233853e-01f,  2.3230591e-01f,  1.1505047e-01f,
+    4.3528955e-04f,  -3.1765503e-01f, -8.7143266e-01f, 6.1031505e-02f,
+    7.7775204e-01f,  -5.5683511e-01f, 1.7974336e-01f,  4.3528955e-04f,
+    -1.2806201e-01f, 7.1208030e-01f,  -9.3974601e-03f, -1.2262242e+00f,
+    -2.8500453e-01f, -1.7780138e-02f, 4.3528955e-04f,  9.3548036e-01f,
+    -1.0710551e+00f, 7.2923496e-02f,  5.4476082e-01f,  2.8654975e-01f,
+    -1.1280643e-01f, 4.3528955e-04f,  -2.6736741e+00f, 1.9258213e+00f,
+    -3.4942929e-02f, -6.0616034e-01f, -6.2834275e-01f, 2.9265374e-02f,
+    4.3528955e-04f,  1.2179046e-01f,  3.7532461e-01f,  -3.2129968e-03f,
+    -1.4078177e+00f, 6.4955163e-01f,  -1.6044824e-01f, 4.3528955e-04f,
+    -6.2316591e-01f, 6.6872501e-01f,  -1.0899656e-01f, -5.5763936e-01f,
+    -4.9174085e-01f, 7.9855770e-02f,  4.3528955e-04f,  -8.2433617e-01f,
+    2.0706795e-01f,  3.7638824e-02f,  -3.6388808e-01f, -8.5323268e-01f,
+    1.3365626e-02f,  4.3528955e-04f,  7.1452552e-01f,  2.0638871e+00f,
+    -1.4155641e-01f, -7.7500802e-01f, 4.7399595e-01f,  4.9572908e-03f,
+    4.3528955e-04f,  1.0178220e+00f,  -1.1636119e+00f, -1.0368702e-01f,
+    1.7123310e-01f,  7.6570213e-01f,  -5.1778797e-02f, 4.3528955e-04f,
+    1.6313007e+00f,  1.0574805e+00f,  -1.1272001e-01f, -4.4341496e-01f,
+    4.5351121e-01f,  -4.6958726e-02f, 4.3528955e-04f,  -2.2179785e-01f,
+    2.5529501e+00f,  4.4721544e-02f,  -1.0274668e+00f, -2.6848814e-02f,
+    -3.1693317e-02f, 4.3528955e-04f,  -2.6112552e+00f, -1.0356460e+00f,
+    -6.4313240e-02f, 3.7682864e-01f,  -6.1232924e-01f, 8.0180794e-02f,
+    4.3528955e-04f,  -8.3890185e-03f, 6.3304371e-01f,  1.4478542e-02f,
+    -1.3545437e+00f, -2.1648714e-01f, -4.3849859e-01f, 4.3528955e-04f,
+    1.2377798e-01f,  7.5291848e-01f,  -6.6793002e-02f, -1.0057472e+00f,
+    4.8518649e-01f,  1.1043333e-01f,  4.3528955e-04f,  -1.3890029e+00f,
+    5.2883124e-01f,  1.8484563e-01f,  -8.6176068e-02f, -7.8057182e-01f,
+    2.9687020e-01f,  4.3528955e-04f,  2.7035382e-01f,  1.6740604e-01f,
+    1.2926026e-01f,  -1.0372140e+00f, 2.0486128e-01f,  2.1212211e-01f,
+    4.3528955e-04f,  1.3022852e+00f,  -3.5823085e+00f, -3.7700269e-02f,
+    8.7681228e-01f,  2.4226135e-01f,  3.5013683e-02f,  4.3528955e-04f,
+    -1.5029714e-02f, 2.2435620e+00f,  -6.2895522e-02f, -1.1589462e+00f,
+    3.5775594e-02f,  -4.1528374e-02f, 4.3528955e-04f,  1.7240156e+00f,
+    -4.4220495e-01f, 1.6840763e-02f,  2.2854407e-01f,  1.0101982e+00f,
+    -6.7374431e-02f, 4.3528955e-04f,  1.1900745e-01f,  8.8163131e-01f,
+    2.6030915e-02f,  -8.9373130e-01f, 6.5033829e-01f,  -1.2208953e-02f,
+    4.3528955e-04f,  -7.1138692e-01f, 1.8521908e-01f,  1.4306283e-01f,
+    -4.1110639e-02f, -7.7178484e-01f, -1.4307649e-01f, 4.3528955e-04f,
+    3.4876852e+00f,  -1.1403059e+00f, -2.9803263e-03f, 2.6173684e-01f,
+    9.1170800e-01f,  -1.5012947e-02f, 4.3528955e-04f,  -1.2220994e+00f,
+    2.1699393e+00f,  -5.4717384e-02f, -8.0290663e-01f, -4.6052444e-01f,
+    1.2861992e-02f,  4.3528955e-04f,  2.3111260e+00f,  1.8687578e+00f,
+    -3.1444930e-02f, -5.6874424e-01f, 6.8459797e-01f,  -1.1363762e-02f,
+    4.3528955e-04f,  7.5213015e-01f,  2.4530648e-01f,  -2.4784634e-02f,
+    -1.0202463e+00f, 9.4235456e-01f,  4.1038880e-01f,  4.3528955e-04f,
+    2.6546800e-01f,  1.2686835e-01f,  3.0590214e-02f,  -6.6983774e-02f,
+    8.7312776e-01f,  3.9297056e-01f,  4.3528955e-04f,  -1.8194910e+00f,
+    1.6053598e+00f,  7.6371878e-02f,  -4.3147522e-01f, -7.0147145e-01f,
+    -1.2057581e-01f, 4.3528955e-04f,  -4.3470521e+00f, 1.5357250e+00f,
+    1.1521611e-02f,  -3.4190372e-01f, -8.5436046e-01f, 6.4401980e-03f,
+    4.3528955e-04f,  2.4718428e+00f,  7.4849766e-01f,  -1.2578441e-01f,
+    -3.0670792e-01f, 9.3496740e-01f,  -9.3041845e-02f, 4.3528955e-04f,
+    1.6245867e+00f,  9.0676534e-01f,  -2.6131051e-02f, -5.0981683e-01f,
+    8.8226199e-01f,  1.4706790e-02f,  4.3528955e-04f,  5.3629357e-02f,
+    -1.9460218e+00f, 1.8931456e-01f,  6.8697190e-01f,  9.0478152e-02f,
+    1.4611387e-01f,  4.3528955e-04f,  1.4326653e-01f,  2.0842566e+00f,
+    7.9307742e-03f,  -9.5330763e-01f, 1.6313007e-02f,  -8.7603740e-02f,
+    4.3528955e-04f,  -3.0684083e+00f, 2.8951976e+00f,  -2.0523956e-01f,
+    -6.8315005e-01f, -5.6792414e-01f, 1.3515852e-02f,  4.3528955e-04f,
+    3.7156016e-01f,  -8.8226348e-02f, -9.0709411e-02f, 7.6120734e-01f,
+    8.9114881e-01f,  4.2123947e-01f,  4.3528955e-04f,  -2.4878051e+00f,
+    -1.3428142e+00f, 1.3648568e-02f,  3.6928186e-01f,  -5.8802229e-01f,
+    -3.1415351e-02f, 4.3528955e-04f,  -8.0916685e-01f, -1.5335155e+00f,
+    -2.3956029e-02f, 8.1454718e-01f,  -5.9393686e-01f, 9.4823241e-02f,
+    4.3528955e-04f,  -3.4465652e+00f, 2.2864447e+00f,  -4.1884389e-02f,
+    -5.0968999e-01f, -8.2923305e-01f, 3.4688734e-03f,  4.3528955e-04f,
+    1.7302960e-01f,  3.8844979e-01f,  2.1224467e-01f,  -5.5934280e-01f,
+    8.2742929e-01f,  -1.5696114e-01f, 4.3528955e-04f,  8.5993123e-01f,
+    4.9684030e-01f,  2.0208281e-01f,  -5.3205526e-01f, 7.9040951e-01f,
+    -1.3906375e-01f, 4.3528955e-04f,  1.2053868e+00f,  1.9082505e+00f,
+    7.9863273e-02f,  -9.3174231e-01f, 4.4501936e-01f,  1.4488532e-02f,
+    4.3528955e-04f,  1.2332289e+00f,  6.6502213e-01f,  2.7194642e-02f,
+    -4.4422036e-01f, 9.9142724e-01f,  -1.3467143e-01f, 4.3528955e-04f,
+    -4.2188945e-01f, 1.1394335e+00f,  7.4561328e-02f,  -3.8032719e-01f,
+    -9.4379687e-01f, 1.5371908e-01f,  4.3528955e-04f,  6.8805552e-01f,
+    -5.0781482e-01f, 8.4537633e-02f,  9.8915055e-02f,  7.2064555e-01f,
+    9.8632440e-02f,  4.3528955e-04f,  -4.6452674e-01f, -6.8949109e-01f,
+    -4.9549226e-02f, 7.8829390e-01f,  -4.1630268e-01f, -4.6720903e-02f,
+    4.3528955e-04f,  9.4517291e-02f,  -1.9617591e+00f, 2.8329676e-01f,
+    8.8471633e-01f,  -3.3164871e-01f, -1.2087487e-01f, 4.3528955e-04f,
+    -1.8062207e+00f, -9.5620090e-01f, 9.5288701e-02f,  5.1075202e-01f,
+    -9.3048662e-01f, -3.0582197e-02f, 4.3528955e-04f,  6.5384638e-01f,
+    -1.5336242e+00f, 9.7270519e-02f,  9.4028151e-01f,  4.2703044e-01f,
+    -4.6439916e-02f, 4.3528955e-04f,  -1.2636801e+00f, -5.3587544e-01f,
+    5.2642107e-02f,  1.7468806e-01f,  -6.6755462e-01f, 1.2143110e-01f,
+    4.3528955e-04f,  8.3303422e-01f,  -8.0496150e-01f, 6.2062754e-03f,
+    7.6811618e-01f,  2.4650210e-01f,  8.4712692e-02f,  4.3528955e-04f,
+    -2.7329252e+00f, 5.7400674e-01f,  -1.3707304e-02f, -3.3052647e-01f,
+    -1.0063365e+00f, -7.6907508e-02f, 4.3528955e-04f,  4.0475959e-01f,
+    -7.3310995e-01f, 1.7290110e-02f,  9.0270841e-01f,  4.7236603e-01f,
+    1.9751348e-01f,  4.3528955e-04f,  8.9114082e-01f,  -3.9041886e+00f,
+    1.4314930e-01f,  8.6452746e-01f,  3.2133898e-01f,  2.3111271e-02f,
+    4.3528955e-04f,  -2.8497865e+00f, 8.7373668e-01f,  7.8135394e-02f,
+    -3.0310807e-01f, -7.8823161e-01f, -6.8280309e-02f, 4.3528955e-04f,
+    2.4931471e+00f,  -2.0805652e+00f, 2.9981118e-01f,  6.9217449e-01f,
+    5.8762097e-01f,  -1.0058647e-01f, 4.3528955e-04f,  3.4743707e+00f,
+    -3.6427355e+00f, 1.1139961e-01f,  6.7770588e-01f,  5.9131593e-01f,
+    -9.4667440e-03f, 4.3528955e-04f,  -2.5808959e+00f, -2.5319693e+00f,
+    6.1932772e-02f,  5.9394115e-01f,  -6.8024421e-01f, 3.7315756e-02f,
+    4.3528955e-04f,  5.7546878e-01f,  7.2117668e-01f,  -1.1854255e-01f,
+    -7.7911931e-01f, 1.7966381e-01f,  8.1078487e-04f,  4.3528955e-04f,
+    -1.9738939e-01f, 2.2021422e+00f,  1.2458548e-01f,  -1.0282260e+00f,
+    -5.5829272e-02f, -1.0241940e-01f, 4.3528955e-04f,  -1.9859957e+00f,
+    6.2058157e-01f,  -5.6927506e-02f, -2.4953787e-01f, -7.8160495e-01f,
+    1.2736998e-01f,  4.3528955e-04f,  2.1928351e+00f,  -2.8004615e+00f,
+    5.8770269e-02f,  7.4881363e-01f,  5.6378692e-01f,  5.0152007e-02f,
+    4.3528955e-04f,  -8.1494164e-01f, 1.7813724e+00f,  -5.2860077e-02f,
+    -7.5254411e-01f, -6.7736650e-01f, 8.0178536e-02f,  4.3528955e-04f,
+    2.1940415e+00f,  2.1297266e+00f,  -9.1236681e-03f, -6.7297322e-01f,
+    7.4085712e-01f,  -9.4919913e-02f, 4.3528955e-04f,  1.2528510e+00f,
+    -1.2292305e+00f, -2.2695884e-03f, 8.1167912e-01f,  6.2831384e-01f,
+    -2.5032112e-02f, 4.3528955e-04f,  2.5438616e+00f,  -4.0069551e+00f,
+    6.3803397e-02f,  7.2150367e-01f,  5.3041196e-01f,  -1.4289888e-04f,
+    4.3528955e-04f,  -8.0390710e-01f, -2.0937443e-02f, 4.4145592e-02f,
+    2.3317467e-01f,  -8.0284691e-01f, 6.4622425e-02f,  4.3528955e-04f,
+    1.9093925e-01f,  -1.2933433e+00f, 8.4598027e-02f,  7.7748722e-01f,
+    4.1109893e-01f,  1.2361845e-01f,  4.3528955e-04f,  1.1618797e+00f,
+    6.3664991e-01f,  -8.4324263e-02f, -5.0661612e-01f, 5.5152196e-01f,
+    1.2249570e-02f,  4.3528955e-04f,  1.1735058e+00f,  3.9594322e-01f,
+    -3.3891432e-02f, -3.7484404e-01f, 5.4143721e-01f,  -6.1145592e-03f,
+    4.3528955e-04f,  3.3215415e-01f,  6.3369465e-01f,  -3.8248058e-02f,
+    -7.7509481e-01f, 6.1869448e-01f,  9.3349330e-03f,  4.3528955e-04f,
+    -5.7882023e-01f, 3.5223794e-01f,  6.3020095e-02f,  -6.5205538e-01f,
+    -2.0266630e-01f, -2.1392727e-01f, 4.3528955e-04f,  8.8722742e-01f,
+    -2.9820807e-02f, -2.5318479e-02f, -4.1306210e-01f, 9.7813344e-01f,
+    -5.2406851e-02f, 4.3528955e-04f,  1.0608631e+00f,  -9.6749049e-01f,
+    -2.1546778e-01f, 5.4097843e-01f,  1.7916377e-01f,  -1.2016536e-01f,
+    4.3528955e-04f,  8.7103558e-01f,  -7.0414519e-01f, 1.3747574e-01f,
+    8.7251282e-01f,  1.9074968e-01f,  -9.7571231e-02f, 4.3528955e-04f,
+    -2.2098136e+00f, 3.1012225e+00f,  -2.7915960e-02f, -7.8782320e-01f,
+    -6.1888069e-01f, 1.6964864e-02f,  4.3528955e-04f,  -2.7419400e+00f,
+    9.5755702e-01f,  6.6877782e-02f,  -4.3573719e-01f, -8.3576477e-01f,
+    1.2340400e-02f,  4.3528955e-04f,  6.2363303e-01f,  -6.4761126e-01f,
+    1.2364513e-01f,  5.4543650e-01f,  4.2302847e-01f,  -1.7439902e-01f,
+    4.3528955e-04f,  -1.3079462e+00f, -6.7402446e-01f, -9.4164431e-02f,
+    2.1264133e-01f,  -8.5664880e-01f, 7.0875064e-02f,  4.3528955e-04f,
+    2.3271184e+00f,  1.0045061e+00f,  8.1497118e-02f,  -4.6193156e-01f,
+    7.7414334e-01f,  -1.0879388e-02f, 4.3528955e-04f,  4.7297290e-01f,
+    -1.2960273e+00f, -4.5066725e-02f, 8.6741769e-01f,  5.1616192e-01f,
+    9.1079697e-03f,  4.3528955e-04f,  -4.0886277e-01f, -1.2489190e+00f,
+    1.7869772e-01f,  1.0724745e+00f,  1.7147663e-01f,  -4.3249011e-02f,
+    4.3528955e-04f,  2.9625025e+00f,  8.9811623e-01f,  1.0366732e-01f,
+    -3.5994434e-01f, 9.9875784e-01f,  5.6906536e-02f,  4.3528955e-04f,
+    -1.4462894e+00f, -8.9719191e-02f, -3.7632052e-02f, 5.9485737e-02f,
+    -9.5634896e-01f, -1.3726316e-01f, 4.3528955e-04f,  1.6132880e+00f,
+    -1.8358498e+00f, 5.9327828e-03f,  5.3722197e-01f,  5.3395593e-01f,
+    -3.8351823e-02f, 4.3528955e-04f,  -1.8009328e+00f, -8.8788676e-01f,
+    7.9495125e-02f,  3.6993861e-01f,  -9.1977715e-01f, 1.4334529e-02f,
+    4.3528955e-04f,  1.3187234e+00f,  2.9230714e+00f,  -7.4055098e-02f,
+    -1.0020747e+00f, 2.4651599e-01f,  -7.0566339e-03f, 4.3528955e-04f,
+    1.0245814e+00f,  -1.2470711e+00f, 6.9593161e-02f,  6.4433324e-01f,
+    4.6833879e-01f,  -1.1757757e-02f, 4.3528955e-04f,  1.4476840e+00f,
+    3.6430258e-01f,  -1.4959517e-01f, -2.6726738e-01f, 8.9678597e-01f,
+    1.7887637e-01f,  4.3528955e-04f,  1.1991001e+00f,  -1.3357672e-01f,
+    9.2097923e-02f,  5.8223921e-01f,  8.9128441e-01f,  1.7508447e-01f,
+    4.3528955e-04f,  -2.5235280e-01f, 2.4037690e-01f,  1.9153684e-02f,
+    -4.5408651e-01f, -1.2068411e+00f, -3.9030842e-02f, 4.3528955e-04f,
+    2.4063656e-01f,  -1.6768345e-01f, -6.5320112e-02f, 5.3654033e-01f,
+    9.1626716e-01f,  2.2374574e-02f,  4.3528955e-04f,  1.7452581e+00f,
+    4.5152801e-01f,  -8.0500610e-02f, -3.0706576e-01f, 9.2148483e-01f,
+    4.1461132e-02f,  4.3528955e-04f,  5.2843964e-01f,  -3.4196645e-02f,
+    -1.0098846e-01f, 1.6464524e-01f,  8.1657040e-01f,  -2.3731372e-01f,
+    4.3528955e-04f,  -3.0751171e+00f, -2.0399392e-02f, -1.7712779e-02f,
+    -1.5751438e-01f, -1.0236182e+00f, 7.5312324e-02f,  4.3528955e-04f,
+    -9.9672365e-01f, -6.0573891e-02f, 2.0338792e-02f,  -4.9611442e-03f,
+    -1.2033057e+00f, 6.6216111e-02f,  4.3528955e-04f,  -8.3427864e-01f,
+    3.5306442e+00f,  1.0248182e-01f,  -8.9954227e-01f, -1.8098161e-01f,
+    2.6785709e-02f,  4.3528955e-04f,  -8.1620008e-01f, 1.1427180e+00f,
+    2.1249359e-02f,  -6.3314486e-01f, -7.5537074e-01f, 6.8656743e-02f,
+    4.3528955e-04f,  -7.2947735e-01f, -2.8773546e-01f, 1.4834255e-02f,
+    4.2110074e-02f,  -1.0107249e+00f, 1.0186988e-01f,  4.3528955e-04f,
+    1.9219340e+00f,  2.0344131e+00f,  1.0537723e-02f,  -8.8453054e-01f,
+    5.6961572e-01f,  1.1592037e-01f,  4.3528955e-04f,  3.9624229e-01f,
+    7.4893737e-01f,  2.5625819e-01f,  -7.8649825e-01f, -1.8142497e-02f,
+    2.7246875e-01f,  4.3528955e-04f,  -9.5972049e-01f, -3.9784238e+00f,
+    -1.2744001e-01f, 8.9626521e-01f,  -2.1719582e-01f, -5.3739928e-02f,
+    4.3528955e-04f,  -2.2209735e+00f, 4.0828973e-01f,  -1.4293413e-03f,
+    4.4912640e-02f,  -9.8741937e-01f, 6.4336501e-02f,  4.3528955e-04f,
+    -1.9072294e-01f, 6.9482073e-02f,  2.8179076e-02f,  -3.4388985e-02f,
+    -7.5702703e-01f, 6.0396558e-01f,  4.3528955e-04f,  -2.1347361e+00f,
+    2.6845937e+00f,  5.1935788e-02f,  -7.7243590e-01f, -6.0209292e-01f,
+    -2.4589475e-03f, 4.3528955e-04f,  3.7380633e-01f,  -1.8558566e-01f,
+    8.8370174e-02f,  2.7392811e-01f,  5.0073767e-01f,  3.8340512e-01f,
+    4.3528955e-04f,  -1.9972539e-01f, -9.9903268e-01f, -1.0925140e-01f,
+    9.1812170e-01f,  -2.0761842e-01f, 8.6280569e-02f,  4.3528955e-04f,
+    -2.4796362e+00f, -2.1080616e+00f, -8.8792235e-02f, 3.7085119e-01f,
+    -7.0346832e-01f, -3.6084629e-04f, 4.3528955e-04f,  -8.0955142e-01f,
+    9.0328604e-02f,  -1.1944088e-01f, 1.8240355e-01f,  -8.1641406e-01f,
+    3.7040301e-02f,  4.3528955e-04f,  1.1111076e+00f,  1.3079691e+00f,
+    1.3121401e-01f,  -7.9988277e-01f, 3.0277237e-01f,  6.3541859e-02f,
+    4.3528955e-04f,  -7.3996657e-01f, 9.9280134e-02f,  -1.0143487e-01f,
+    8.7252170e-02f,  -8.9303696e-01f, -1.0200218e-01f, 4.3528955e-04f,
+    8.6989218e-01f,  -1.2192975e+00f, -1.4109711e-01f, 7.5200081e-01f,
+    3.0269358e-01f,  -2.4913361e-03f, 4.3528955e-04f,  2.7364368e+00f,
+    4.4800675e-01f,  -1.9829268e-02f, -3.2318822e-01f, 9.5497954e-01f,
+    1.4149459e-01f,  4.3528955e-04f,  -1.1395575e+00f, -8.2150316e-01f,
+    -6.2357839e-02f, 7.4103838e-01f,  -8.3848941e-01f, -6.6276886e-02f,
+    4.3528955e-04f,  4.6565396e-01f,  -8.4651977e-01f, 8.1398241e-02f,
+    2.7354741e-01f,  6.8726301e-01f,  -3.0988744e-01f, 4.3528955e-04f,
+    1.0543463e+00f,  1.3841562e+00f,  -9.4186887e-04f, -1.4955588e-01f,
+    8.3551896e-01f,  -4.9011625e-02f, 4.3528955e-04f,  -1.5297432e+00f,
+    6.7655826e-01f,  -1.0511188e-02f, -2.7707219e-01f, -7.8688568e-01f,
+    3.5474356e-02f,  4.3528955e-04f,  -1.1569735e+00f, 1.5199314e+00f,
+    -6.2839692e-03f, -8.7391716e-01f, -6.2095112e-01f, -3.9445881e-02f,
+    4.3528955e-04f,  2.8896003e+00f,  -1.4017584e+00f, 5.9458449e-02f,
+    4.0057647e-01f,  7.7026284e-01f,  -7.0889086e-02f, 4.3528955e-04f,
+    -6.1653548e-01f, 7.4803042e-01f,  -6.6461116e-02f, -7.4472225e-01f,
+    -2.2674614e-01f, 7.5338110e-02f,  4.3528955e-04f,  2.2468379e+00f,
+    1.0900755e+00f,  1.5083292e-01f,  -2.8559774e-01f, 5.5818462e-01f,
+    1.8164465e-01f,  4.3528955e-04f,  -6.6869038e-01f, -5.5123109e-01f,
+    -5.2829117e-02f, 7.0601809e-01f,  -8.0849510e-01f, -2.8608093e-01f,
+    4.3528955e-04f,  -9.1728812e-01f, 1.5100837e-01f,  1.0717191e-02f,
+    -3.3205766e-02f, -9.0089554e-01f, 3.2620288e-03f,  4.3528955e-04f,
+    1.9833508e-01f,  -2.5416875e-01f, -1.1210950e-02f, 7.6340145e-01f,
+    7.6142931e-01f,  -1.2500016e-01f, 4.3528955e-04f,  -6.3136160e-02f,
+    -3.7955418e-02f, -5.0648652e-02f, 1.9443260e-01f,  -9.5924592e-01f,
+    -4.9567673e-01f, 4.3528955e-04f,  -3.3511939e+00f, 1.3763980e+00f,
+    -2.8175980e-01f, -3.3075571e-01f, -7.2215629e-01f, 5.5537324e-02f,
+    4.3528955e-04f,  -7.7278388e-01f, 1.2669877e+00f,  9.9741723e-03f,
+    -1.3017544e+00f, -2.3822296e-01f, 5.6377720e-02f,  4.3528955e-04f,
+    2.3066781e+00f,  1.7438185e+00f,  -3.7814431e-02f, -6.4040411e-01f,
+    7.4742746e-01f,  -1.1747459e-02f, 4.3528955e-04f,  -3.5414958e-01f,
+    6.7642355e-01f,  -1.1737331e-01f, -8.8944966e-01f, -5.5553746e-01f,
+    -6.6356003e-02f, 4.3528955e-04f,  1.9514939e-01f,  5.1513326e-01f,
+    9.0068586e-02f,  -8.9607567e-01f, 9.1939457e-02f,  5.4103935e-01f,
+    4.3528955e-04f,  1.0776924e+00f,  1.1247448e+00f,  1.3590787e-01f,
+    -2.8347340e-01f, 5.9835815e-01f,  -7.2089747e-02f, 4.3528955e-04f,
+    1.3179495e+00f,  1.7951225e+00f,  6.7255691e-02f,  -1.0099132e+00f,
+    5.5739868e-01f,  2.7127409e-02f,  4.3528955e-04f,  2.2312062e+00f,
+    -5.4299039e-01f, 1.4808068e-01f,  7.2737522e-03f,  8.6913300e-01f,
+    5.3679772e-02f,  4.3528955e-04f,  -5.3245026e-01f, 7.5906855e-01f,
+    1.0210465e-01f,  -7.6053566e-01f, -3.0423185e-01f, -9.1883808e-02f,
+    4.3528955e-04f,  -1.9151279e+00f, -1.2326658e+00f, -7.9156891e-02f,
+    4.4597378e-01f,  -7.3878336e-01f, -1.1682343e-01f, 4.3528955e-04f,
+    -4.6890297e+00f, -4.7881648e-02f, 2.5793966e-02f,  -5.7941843e-02f,
+    -8.1397521e-01f, 2.7331932e-02f,  4.3528955e-04f,  -1.1071205e+00f,
+    -3.9004030e+00f, 1.4632164e-02f,  8.2741660e-01f,  -3.3719224e-01f,
+    -8.4945597e-03f, 4.3528955e-04f,  2.8161068e+00f,  2.5371259e-01f,
+    -4.6132848e-02f, -2.4629307e-01f, 9.2917955e-01f,  8.1228957e-02f,
+    4.3528955e-04f,  -2.4190063e+00f, 2.8897872e+00f,  1.4370206e-01f,
+    -5.9525561e-01f, -7.0653802e-01f, 5.4432269e-02f,  4.3528955e-04f,
+    5.6029463e-01f,  2.0975065e+00f,  1.5240030e-02f,  -7.8760713e-01f,
+    1.3256210e-01f,  3.4910530e-02f,  4.3528955e-04f,  -4.3641537e-01f,
+    1.4373167e+00f,  3.3043109e-02f,  -7.9844785e-01f, -2.7614382e-01f,
+    -1.1996660e-01f, 4.3528955e-04f,  -1.4186677e+00f, -1.5117278e+00f,
+    -1.4024404e-01f, 9.2353231e-01f,  -6.2340803e-02f, -8.6422965e-02f,
+    4.3528955e-04f,  8.2067561e-01f,  -1.2150067e+00f, 2.9876277e-02f,
+    8.8452917e-01f,  2.9086155e-01f,  -3.6602367e-02f, 4.3528955e-04f,
+    1.9831281e+00f,  -2.7979410e+00f, -9.8200403e-02f, 8.5055041e-01f,
+    5.4897237e-01f,  -1.9718064e-02f, 4.3528955e-04f,  1.4403319e-01f,
+    1.1965969e+00f,  7.1624294e-02f,  -1.0304714e+00f, 2.8581807e-01f,
+    1.2608708e-01f,  4.3528955e-04f,  -2.1712091e+00f, 2.6044846e+00f,
+    1.5312089e-02f,  -7.2828621e-01f, -5.6067151e-01f, 1.5230587e-02f,
+    4.3528955e-04f,  6.5432943e-02f,  2.8781228e+00f,  5.7560153e-02f,
+    -1.0050591e+00f, -6.3458961e-03f, -3.2405092e-03f, 4.3528955e-04f,
+    -2.4840467e+00f, 1.6254947e-01f,  -2.2345879e-03f, -1.7022824e-01f,
+    -9.2277920e-01f, 1.3186707e-01f,  4.3528955e-04f,  -1.6140789e+00f,
+    -1.2576975e+00f, 3.0457728e-02f,  5.5549473e-01f,  -9.2969650e-01f,
+    -1.3156916e-02f, 4.3528955e-04f,  -1.6935363e+00f, -7.3487413e-01f,
+    -6.1505798e-02f, -9.6553460e-02f, -5.9113693e-01f, -1.2826630e-01f,
+    4.3528955e-04f,  -8.5449976e-01f, -3.0884948e+00f, -3.8969621e-02f,
+    7.3200876e-01f,  -2.9820076e-01f, 5.9529316e-02f,  4.3528955e-04f,
+    1.0351378e+00f,  3.8867459e+00f,  -1.5051538e-02f, -8.9223081e-01f,
+    3.0375513e-01f,  6.2733226e-02f,  4.3528955e-04f,  5.4747328e-02f,
+    6.0016888e-01f,  -1.0423271e-01f, -7.9658186e-01f, -3.8161021e-01f,
+    3.2643098e-01f,  4.3528955e-04f,  1.7992822e+00f,  2.1037467e+00f,
+    -7.0568539e-02f, -6.4013427e-01f, 7.2069573e-01f,  -2.8839797e-02f,
+    4.3528955e-04f,  8.6047316e-01f,  5.0609881e-01f,  -2.3999999e-01f,
+    -6.0632300e-01f, 3.9829370e-01f,  -1.9837283e-01f, 4.3528955e-04f,
+    1.5605989e+00f,  6.2248051e-01f,  -4.0083788e-02f, -5.2638328e-01f,
+    9.3150824e-01f,  -1.2981568e-01f, 4.3528955e-04f,  5.0136089e-01f,
+    1.7221067e+00f,  -4.2231359e-02f, -1.0298797e+00f, 4.7464579e-01f,
+    8.0042973e-02f,  4.3528955e-04f,  -1.1359335e+00f, -7.9333675e-01f,
+    7.6239504e-02f,  6.5233070e-01f,  -9.3884319e-01f, -4.3493770e-02f,
+    4.3528955e-04f,  1.2594597e+00f,  3.0324779e+00f,  -2.0490246e-02f,
+    -9.2858404e-01f, 4.3050870e-01f,  2.2876743e-02f,  4.3528955e-04f,
+    -4.0387809e-02f, -4.1635537e-01f, 7.7664368e-02f,  4.6129367e-01f,
+    -9.6416610e-01f, -3.5914072e-01f, 4.3528955e-04f,  -1.4465107e+00f,
+    8.9203715e-03f,  1.4070280e-01f,  -6.3813701e-02f, -6.6926038e-01f,
+    1.3467934e-02f,  4.3528955e-04f,  1.3855834e+00f,  7.7265239e-01f,
+    -6.8881005e-02f, -3.3959135e-01f, 7.6586396e-01f,  2.4312760e-01f,
+    4.3528955e-04f,  2.3765674e-01f,  -1.5268303e+00f, 3.0190405e-02f,
+    1.0335521e+00f,  2.3334214e-02f,  -7.7476814e-02f, 4.3528955e-04f,
+    2.8210237e+00f,  1.3233345e+00f,  1.6316225e-01f,  -4.2386949e-01f,
+    8.5659707e-01f,  -2.5423197e-02f, 4.3528955e-04f,  -3.4642501e+00f,
+    -7.4352539e-01f, -2.7707780e-02f, 2.3457249e-01f,  -8.6796266e-01f,
+    3.4045599e-02f,  4.3528955e-04f,  -1.3561223e+00f, -1.8002162e+00f,
+    3.1069191e-02f,  6.7489171e-01f,  -5.7943070e-01f, -9.5057584e-02f,
+    4.3528955e-04f,  1.9300683e+00f,  8.0599916e-01f,  -1.5229994e-01f,
+    -5.0685292e-01f, 7.6794749e-01f,  -9.1916397e-02f, 4.3528955e-04f,
+    -3.4507573e+00f, -2.5920522e+00f, -4.4888712e-02f, 5.2828062e-01f,
+    -6.9524604e-01f, 5.1775839e-02f,  4.3528955e-04f,  1.5003972e+00f,
+    -2.7979207e+00f, 8.9141622e-02f,  7.1114129e-01f,  4.8555550e-01f,
+    7.0350133e-02f,  4.3528955e-04f,  1.0986801e+00f,  1.1529102e+00f,
+    -4.2055294e-02f, -6.5066528e-01f, 7.0429492e-01f,  -8.7370969e-02f,
+    4.3528955e-04f,  1.3354640e+00f,  2.0270402e+00f,  6.8740755e-02f,
+    -7.7871448e-01f, 7.1772635e-01f,  3.6650557e-02f,  4.3528955e-04f,
+    -4.3775499e-01f, 2.7882445e-01f,  3.0524455e-02f,  -6.0615760e-01f,
+    -8.3507806e-01f, -2.9027894e-02f, 4.3528955e-04f,  4.3121532e-01f,
+    -1.4993954e-01f, -5.5632360e-02f, 2.0721985e-01f,  6.7359185e-01f,
+    2.1930890e-01f,  4.3528955e-04f,  1.4689544e-01f,  -1.9881763e+00f,
+    -7.6703101e-02f, 7.8135729e-01f,  6.7072563e-02f,  -3.9421905e-02f,
+    4.3528955e-04f,  -8.5320979e-01f, 7.2189003e-01f,  -1.5364744e-01f,
+    -4.7688644e-02f, -7.5285482e-01f, -2.9752398e-01f, 4.3528955e-04f,
+    1.9800025e-01f,  -5.8110315e-01f, -9.2541113e-02f, 1.0283029e+00f,
+    -2.0943272e-01f, -2.8842181e-01f, 4.3528955e-04f,  -2.4393229e+00f,
+    2.6583514e+00f,  4.8695404e-02f,  -7.5314486e-01f, -5.9586817e-01f,
+    1.0460446e-02f,  4.3528955e-04f,  -7.0178407e-01f, -9.4285482e-01f,
+    5.4829378e-02f,  1.0945523e+00f,  3.7516437e-02f,  1.6282859e-01f,
+    4.3528955e-04f,  -6.2866437e-01f, -1.8171599e+00f, 7.8861766e-02f,
+    9.0820384e-01f,  -3.2487518e-01f, -2.0910403e-02f, 4.3528955e-04f,
+    4.6129608e-01f,  1.6117942e-01f,  4.3949358e-02f,  -4.0699169e-04f,
+    1.3041219e+00f,  -2.3300363e-02f, 4.3528955e-04f,  1.7301964e+00f,
+    1.3876000e-01f,  -6.6845804e-02f, -1.4921412e-02f, 9.8644394e-01f,
+    2.4608020e-02f,  4.3528955e-04f,  -1.0126207e-01f, -2.0329518e+00f,
+    -8.8552862e-02f, 5.9389704e-01f,  1.1189844e-01f,  -2.0988469e-01f,
+    4.3528955e-04f,  8.8261557e-01f,  -8.9139241e-01f, 1.4932175e-01f,
+    4.0135559e-01f,  5.2043611e-01f,  3.0155739e-01f,  4.3528955e-04f,
+    1.2824923e+00f,  -3.4021163e+00f, -2.7656909e-03f, 9.4636476e-01f,
+    2.8362173e-01f,  -1.0006161e-02f, 4.3528955e-04f,  2.1780963e+00f,
+    4.6327376e+00f,  -7.1042039e-02f, -8.0766243e-01f, 3.8816705e-01f,
+    1.0733090e-02f,  4.3528955e-04f,  -3.7870679e+00f, 1.2518872e+00f,
+    8.5972399e-03f,  -2.3105516e-01f, -8.4759200e-01f, -3.7824262e-02f,
+    4.3528955e-04f,  1.0975684e-01f,  -1.3838869e+00f, -4.5297753e-02f,
+    9.8044658e-01f,  -1.4709541e-01f, 2.0121284e-02f,  4.3528955e-04f,
+    7.7339929e-01f,  1.3653439e+00f,  -2.0495221e-02f, -1.1255770e+00f,
+    2.8117427e-01f,  5.4144561e-02f,  4.3528955e-04f,  3.1258349e+00f,
+    3.8643211e-01f,  -4.6255188e-03f, -3.0162405e-02f, 9.8489749e-01f,
+    3.8890883e-02f,  4.3528955e-04f,  -1.6936293e-01f, 2.5974452e+00f,
+    -8.6488806e-02f, -1.0584354e+00f, -2.5025776e-01f, 1.4716987e-02f,
+    4.3528955e-04f,  -1.3399552e+00f, -1.9139563e+00f, 3.2249559e-02f,
+    6.1379176e-01f,  -7.4627435e-01f, 7.4899681e-03f,  4.3528955e-04f,
+    -2.1317811e+00f, 3.8002849e-01f,  -4.4216705e-04f, -9.8600686e-02f,
+    -9.4319785e-01f, 1.0316506e-01f,  4.3528955e-04f,  -1.3936301e+00f,
+    7.2360927e-01f,  7.2809696e-02f,  -2.1507695e-01f, -9.8306167e-01f,
+    1.5315999e-01f,  4.3528955e-04f,  -5.5729854e-01f, -1.1458862e-01f,
+    3.7456121e-02f,  -2.7633872e-02f, -7.6591325e-01f, -5.0509727e-01f,
+    4.3528955e-04f,  2.9816165e+00f,  -2.0278728e+00f, 1.3934152e-01f,
+    4.1347894e-01f,  8.0688226e-01f,  -3.0250959e-02f, 4.3528955e-04f,
+    3.5542517e+00f,  1.1715888e+00f,  1.1830042e-01f,  -3.0784884e-01f,
+    9.1164964e-01f,  -4.2073410e-03f, 4.3528955e-04f,  1.9176611e+00f,
+    -3.1886487e+00f, -8.6422734e-02f, 7.3918343e-01f,  3.3372632e-01f,
+    -8.4955148e-02f, 4.3528955e-04f,  -4.9872063e-02f, 8.8426632e-01f,
+    -6.3708678e-02f, -7.0026875e-01f, -1.3340619e-01f, 2.3681629e-01f,
+    4.3528955e-04f,  2.5763712e+00f,  2.9984944e+00f,  2.1613078e-02f,
+    -6.8912709e-01f, 6.2228382e-01f,  -2.6745193e-03f, 4.3528955e-04f,
+    -6.9699663e-01f, 1.0392898e+00f,  6.2197014e-03f,  -7.8517962e-01f,
+    -5.8713794e-01f, 1.2383224e-01f,  4.3528955e-04f,  -3.5416989e+00f,
+    2.5433132e-01f,  -1.2950949e-01f, -3.6350355e-02f, -9.1998512e-01f,
+    -3.6023913e-03f, 4.3528955e-04f,  4.2769015e-03f,  -1.5731010e-01f,
+    -1.3189128e-01f, 9.4763172e-01f,  -3.8673630e-01f, 2.2362442e-01f,
+    4.3528955e-04f,  2.1470485e-02f,  1.6566658e+00f,  5.5455338e-02f,
+    -4.6836373e-01f, 3.0020824e-01f,  3.1271869e-01f,  4.3528955e-04f,
+    -5.2836359e-01f, -1.2473102e-01f, 8.2957618e-02f,  1.0314199e-01f,
+    -8.6117131e-01f, -3.0286810e-01f, 4.3528955e-04f,  3.6164272e-01f,
+    -3.8524553e-02f, 8.7403774e-02f,  4.0763599e-01f,  7.7220082e-01f,
+    2.8372347e-01f,  4.3528955e-04f,  5.0415409e-01f,  1.4986265e+00f,
+    7.5677931e-02f,  -1.0256524e+00f, -1.6927800e-01f, -7.3035225e-02f,
+    4.3528955e-04f,  1.8275669e+00f,  1.3650849e+00f,  -2.8771091e-02f,
+    -5.1965785e-01f, 5.7174367e-01f,  -2.8468019e-03f, 4.3528955e-04f,
+    1.0512679e+00f,  -2.4691534e+00f, -5.7887468e-02f, 9.1211814e-01f,
+    4.1490227e-01f,  -1.3098322e-01f, 4.3528955e-04f,  -3.5785794e+00f,
+    -1.1905481e+00f, -1.1324088e-01f, 2.2581936e-01f,  -8.4135926e-01f,
+    -2.2623695e-03f, 4.3528955e-04f,  8.0188030e-01f,  6.7982012e-01f,
+    9.3623307e-03f,  -4.5117843e-01f, 5.5638522e-01f,  1.7788640e-01f,
+    4.3528955e-04f,  -1.3701813e+00f, -3.8071024e-01f, 9.3546204e-02f,
+    5.8212525e-01f,  -4.9734649e-01f, 9.9848203e-02f,  4.3528955e-04f,
+    -3.2725978e-01f, -4.0023935e-01f, 5.6639640e-03f,  9.1067171e-01f,
+    -4.7602186e-01f, 2.4467991e-01f,  4.3528955e-04f,  1.9343479e+00f,
+    3.0193636e+00f,  6.8569012e-02f,  -8.4729999e-01f, 5.6076455e-01f,
+    -5.1183745e-02f, 4.3528955e-04f,  -6.0957080e-01f, -3.0577326e+00f,
+    -5.1051108e-03f, 8.9770639e-01f,  -6.9119483e-02f, 1.2473267e-01f,
+    4.3528955e-04f,  -4.2946088e-01f, 1.6010027e+00f,  2.4316991e-02f,
+    -7.1165121e-01f, 5.4512881e-02f,  1.8752395e-01f,  4.3528955e-04f,
+    -9.8133349e-01f, 1.7977129e+00f,  -6.0283747e-02f, -7.2630054e-01f,
+    -5.0874031e-01f, 8.8421423e-03f,  4.3528955e-04f,  -1.7559731e-01f,
+    9.3687141e-01f,  -6.8809554e-02f, -8.8663399e-01f, -1.8405901e-01f,
+    2.7374444e-03f,  4.3528955e-04f,  -1.7930398e+00f, -1.1717603e+00f,
+    5.9395190e-02f,  3.9965212e-01f,  -7.3668516e-01f, 9.8224236e-03f,
+    4.3528955e-04f,  2.4054255e+00f,  2.0123062e+00f,  -6.3611940e-02f,
+    -5.8949912e-01f, 6.3997978e-01f,  8.5860461e-02f,  4.3528955e-04f,
+    -1.0959872e+00f, 4.3844223e-01f,  -1.4857452e-02f, 4.1316900e-02f,
+    -7.1704471e-01f, 2.8684292e-02f,  4.3528955e-04f,  -8.6543274e-01f,
+    -1.1746889e+00f, 2.5156501e-01f,  4.3933979e-01f,  -6.5431178e-01f,
+    -3.6804426e-02f, 4.3528955e-04f,  -8.8063931e-01f, 7.4011725e-01f,
+    1.1988863e-02f,  -7.3727340e-01f, -5.1459920e-01f, 1.1973896e-02f,
+    4.3528955e-04f,  4.5342889e-01f,  -1.4656247e+00f, -3.2751220e-03f,
+    6.5903592e-01f,  5.4813701e-01f,  4.8317891e-02f,  4.3528955e-04f,
+    -6.2215602e-01f, -2.4330001e+00f, -1.2228069e-01f, 1.0837550e+00f,
+    -2.3680070e-01f, 6.8860345e-02f,  4.3528955e-04f,  2.2561808e+00f,
+    1.9652840e+00f,  4.1036207e-02f,  -6.1725271e-01f, 7.1676087e-01f,
+    -1.0346054e-01f, 4.3528955e-04f,  2.3330596e-01f,  -6.9760281e-01f,
+    -1.4188291e-01f, 1.2005203e+00f,  7.4251510e-02f,  -4.5390140e-02f,
+    4.3528955e-04f,  -1.2217637e+00f, -7.8242928e-01f, -2.5508818e-03f,
+    7.5887680e-01f,  -5.4948437e-01f, -1.3689803e-01f, 4.3528955e-04f,
+    -1.0756361e+00f, 1.5005352e+00f,  3.0177031e-02f,  -7.8824949e-01f,
+    -7.3508334e-01f, -1.0868519e-01f, 4.3528955e-04f,  -4.5533744e-01f,
+    3.4445763e-01f,  -7.0692286e-02f, -9.4295084e-01f, -2.8744981e-01f,
+    4.4710916e-01f,  4.3528955e-04f,  -1.8019401e+00f, -3.6704779e-01f,
+    9.6709020e-02f,  9.5192313e-02f,  -9.1009527e-01f, 8.9203574e-02f,
+    4.3528955e-04f,  1.9221734e+00f,  -9.2941338e-01f, -4.0699216e-03f,
+    4.7749504e-01f,  8.0222940e-01f,  -3.4183737e-02f, 4.3528955e-04f,
+    -6.4527470e-01f, 3.3370101e-01f,  1.3079448e-01f,  -1.3034980e-01f,
+    -1.3292366e+00f, -1.1417542e-01f, 4.3528955e-04f,  -2.7598083e-01f,
+    -1.6207273e-01f, 2.9560899e-02f,  2.1475042e-01f,  -8.7075871e-01f,
+    4.1573080e-01f,  4.3528955e-04f,  7.1486199e-01f,  -9.9260467e-01f,
+    -2.1619191e-02f, 5.4572046e-01f,  2.1316585e-01f,  -3.5997236e-01f,
+    4.3528955e-04f,  9.3173265e-01f,  -1.2980844e-01f, -1.8667448e-01f,
+    6.9767401e-02f,  6.6200185e-01f,  1.3169025e-01f,  4.3528955e-04f,
+    1.5164829e+00f,  -1.0088232e+00f, 1.1634706e-01f,  5.1049697e-01f,
+    5.3080499e-01f,  1.1189683e-02f,  4.3528955e-04f,  -1.6087041e+00f,
+    1.0644196e+00f,  -5.9477530e-02f, -5.7600254e-01f, -8.6869079e-01f,
+    -6.3658133e-02f, 4.3528955e-04f,  3.4853853e-03f,  1.9572735e+00f,
+    -7.8547396e-02f, -8.7604821e-01f, 1.0742604e-01f,  3.7622731e-02f,
+    4.3528955e-04f,  5.8183050e-01f,  -1.7739646e-01f, 2.9870003e-01f,
+    5.5635202e-01f,  -2.0005694e-01f, -6.2055176e-01f, 4.3528955e-04f,
+    -2.2820008e+00f, -1.3945312e+00f, -7.7892742e-03f, 4.2868552e-01f,
+    -6.9301474e-01f, -9.7477928e-02f, 4.3528955e-04f,  -1.8641583e+00f,
+    2.7465053e-02f,  1.2192180e-01f,  3.0156896e-03f,  -6.8167579e-01f,
+    -8.0299556e-02f, 4.3528955e-04f,  -1.1981364e+00f, 7.0680112e-01f,
+    -3.3857473e-03f, -4.5225790e-01f, -7.0714951e-01f, -8.9042470e-02f,
+    4.3528955e-04f,  6.0733956e-01f,  1.0592633e+00f,  2.8518476e-03f,
+    -8.7947500e-01f, 9.1357589e-01f,  8.1421472e-03f,  4.3528955e-04f,
+    2.3284996e-01f,  -2.3463836e+00f, -1.1872729e-01f, 6.4454567e-01f,
+    1.0177531e-01f,  -5.5570129e-02f, 4.3528955e-04f,  1.0123148e+00f,
+    -4.3642199e-01f, 9.2424653e-02f,  2.7941990e-01f,  7.5670403e-01f,
+    1.8369447e-01f,  4.3528955e-04f,  -2.3166385e+00f, -2.2349715e+00f,
+    -5.8831323e-02f, 6.3332438e-01f,  -7.8983682e-01f, -1.6022406e-03f,
+    4.3528955e-04f,  1.3257864e+00f,  1.5173185e-01f,  -8.5078657e-02f,
+    5.5704767e-01f,  1.0449975e+00f,  -4.2890314e-02f, 4.3528955e-04f,
+    -4.6616891e-01f, 1.1827253e+00f,  6.8474352e-02f,  -9.8163366e-01f,
+    -4.1431677e-01f, -8.3290249e-02f, 4.3528955e-04f,  1.3888853e+00f,
+    -7.0945787e-01f, -2.6485198e-03f, 9.0755951e-01f,  5.8420587e-01f,
+    -6.9841221e-02f, 4.3528955e-04f,  4.0344670e-01f,  -1.9744726e-01f,
+    5.2640639e-02f,  8.9248818e-01f,  5.9592223e-01f,  -3.1512301e-02f,
+    4.3528955e-04f,  -9.3851052e-02f, 1.2325972e-01f,  1.1326956e-02f,
+    -4.1049104e-02f, -8.6170697e-01f, 4.9565232e-01f,  4.3528955e-04f,
+    -2.7608418e-01f, -9.1706961e-01f, -3.9283331e-02f, 6.6629159e-01f,
+    4.6900131e-02f,  -9.6876748e-02f, 4.3528955e-04f,  6.1510152e-01f,
+    -3.1084162e-01f, 3.3496581e-02f,  6.4234143e-01f,  7.0891094e-01f,
+    -1.5240727e-01f, 4.3528955e-04f,  -1.3467759e+00f, 6.5601468e-03f,
+    1.1923847e-01f,  2.4954344e-01f,  -8.0431491e-01f, 1.4003699e-01f,
+    4.3528955e-04f,  1.5015638e+00f,  4.2224205e-01f,  3.7855256e-02f,
+    -3.0567631e-01f, 6.5422416e-01f,  -5.9264053e-02f, 4.3528955e-04f,
+    2.1835573e+00f,  6.3033307e-01f,  -7.5978681e-02f, -1.6632210e-01f,
+    1.0998753e+00f,  -4.1510724e-02f, 4.3528955e-04f,  -2.0947654e+00f,
+    -2.1927676e+00f, 8.4981419e-02f,  6.3444036e-01f,  -5.8818138e-01f,
+    1.5387756e-02f,  4.3528955e-04f,  -1.6005783e+00f, -1.3310740e+00f,
+    6.0040783e-02f,  6.9319654e-01f,  -7.5023818e-01f, 1.6860314e-02f,
+    4.3528955e-04f,  -2.3510771e+00f, 4.9991045e+00f,  -4.8002247e-02f,
+    -7.7929640e-01f, -4.0648994e-01f, -8.1925886e-03f, 4.3528955e-04f,
+    4.9180302e-01f,  2.1565945e-01f,  -9.6070603e-02f, -2.4069451e-01f,
+    9.9891353e-01f,  4.3641704e-01f,  4.3528955e-04f,  -1.4258918e+00f,
+    -2.8863156e-01f, -4.3871175e-02f, 1.4689304e-03f,  -1.0336007e+00f,
+    3.4290813e-02f,  4.3528955e-04f,  -2.1505787e+00f, 1.5565648e+00f,
+    -8.8802092e-03f, -4.0514532e-01f, -8.5340643e-01f, 3.5363320e-02f,
+    4.3528955e-04f,  -7.7668816e-01f, -1.0159142e+00f, -1.0184953e-02f,
+    9.7047758e-01f,  -1.5017816e-01f, -4.9710974e-02f, 4.3528955e-04f,
+    2.4929187e+00f,  9.0935642e-01f,  6.0662776e-03f,  -2.6623783e-01f,
+    8.0046004e-01f,  5.1952224e-02f,  4.3528955e-04f,  1.3683498e-02f,
+    -1.3084476e-01f, -2.0548551e-01f, 1.0873919e+00f,  -1.5618834e-01f,
+    -3.1056911e-01f, 4.3528955e-04f,  5.6075990e-01f,  -1.4416924e+00f,
+    7.1186490e-02f,  9.1688663e-01f,  6.4281619e-01f,  -8.8124141e-02f,
+    4.3528955e-04f,  -3.0944389e-01f, -2.0978789e-01f, 8.5697934e-02f,
+    1.0239930e+00f,  -4.0066984e-01f, 4.0307227e-01f,  4.3528955e-04f,
+    -1.6003882e+00f, 2.3538635e+00f,  3.6375649e-02f,  -7.6307601e-01f,
+    -4.0220189e-01f, 3.0134235e-02f,  4.3528955e-04f,  1.0560352e+00f,
+    -2.2273662e+00f, 7.3063567e-02f,  7.2263932e-01f,  3.7847677e-01f,
+    4.6030346e-02f,  4.3528955e-04f,  -6.4598125e-01f, 8.1129140e-01f,
+    -5.6664143e-02f, -7.4648425e-02f, -7.8997791e-01f, 1.5829606e-01f,
+    4.3528955e-04f,  -2.4379516e+00f, 7.3035315e-02f,  -4.1270629e-04f,
+    6.4617097e-02f,  -8.2543749e-01f, -6.9390438e-02f, 4.3528955e-04f,
+    1.8554060e+00f,  2.2686234e+00f,  6.2723175e-02f,  -8.3886594e-01f,
+    5.4453933e-01f,  2.9522970e-02f,  4.3528955e-04f,  -2.1758134e+00f,
+    2.4692993e+00f,  4.1291825e-02f,  -7.5589931e-01f, -5.8207178e-01f,
+    2.1875396e-02f,  4.3528955e-04f,  -4.0102262e+00f, 2.1402586e+00f,
+    1.4411339e-01f,  -4.7340533e-01f, -7.5536495e-01f, 2.4990121e-02f,
+    4.3528955e-04f,  2.0854461e+00f,  1.0581270e+00f,  -9.4462991e-02f,
+    -4.7763690e-01f, 7.2808206e-01f,  -5.4269750e-02f, 4.3528955e-04f,
+    -3.4809309e-01f, 9.2944306e-01f,  -7.6522999e-02f, -7.1716177e-01f,
+    -1.5862770e-01f, -2.6683810e-01f, 4.3528955e-04f,  -2.2824350e-01f,
+    2.9110308e+00f,  2.2638135e-02f,  -9.0129310e-01f, -8.4137522e-02f,
+    -4.4785440e-02f, 4.3528955e-04f,  -1.6991079e-01f, -6.1489362e-01f,
+    -2.5371367e-02f, 1.0642589e+00f,  -6.7166185e-01f, -1.2231795e-01f,
+    4.3528955e-04f,  6.2697574e-02f,  -8.7367535e-01f, -1.4418544e-01f,
+    8.9939135e-01f,  3.0170986e-01f,  4.7817538e-03f,  4.3528955e-04f,
+    3.0297992e+00f,  2.0787981e+00f,  -7.3474944e-02f, -5.6852180e-01f,
+    8.1469548e-01f,  -3.8897924e-02f, 4.3528955e-04f,  -3.8067240e-01f,
+    -1.1524966e+00f, 3.8516581e-02f,  8.2935613e-01f,  2.4022901e-02f,
+    -1.3954166e-01f, 4.3528955e-04f,  1.1014551e+00f,  -2.5685072e-01f,
+    6.4635614e-04f,  9.9481255e-02f,  9.0067756e-01f,  -2.1589127e-01f,
+    4.3528955e-04f,  -5.7723336e-03f, -3.6178380e-01f, -8.6669117e-02f,
+    1.0192044e+00f,  4.5428507e-02f,  -6.4970207e-01f, 4.3528955e-04f,
+    -2.3682630e+00f, 3.0075445e+00f,  5.6730319e-02f,  -6.8723136e-01f,
+    -6.9053435e-01f, -1.8450310e-02f, 4.3528955e-04f,  1.0060428e+00f,
+    -1.2070980e+00f, 3.7082877e-02f,  1.0089158e+00f,  4.3128464e-01f,
+    1.2174068e-01f,  4.3528955e-04f,  -4.8601833e-01f, -1.4646028e-01f,
+    -1.1447769e-01f, -3.2519069e-02f, -6.5928167e-01f, -6.2041339e-02f,
+    4.3528955e-04f,  -7.9586762e-01f, -5.1124281e-01f, 7.2119661e-02f,
+    6.5245128e-01f,  -6.0699230e-01f, -3.6125593e-02f, 4.3528955e-04f,
+    7.6814789e-01f,  -1.0103707e+00f, -1.7016786e-03f, 7.0108259e-01f,
+    6.9612741e-01f,  -1.7634080e-01f, 4.3528955e-04f,  -1.3888013e-01f,
+    -1.0712302e+00f, 8.7932244e-02f,  5.9174263e-01f,  -1.7615789e-01f,
+    -1.1678394e-01f, 4.3528955e-04f,  3.6192957e-01f,  -1.1191550e+00f,
+    7.2612010e-02f,  9.2398232e-01f,  3.2302028e-01f,  5.5819996e-02f,
+    4.3528955e-04f,  2.0762613e-01f,  3.8743836e-01f,  -1.5759781e-02f,
+    -1.3446941e+00f, 9.9124205e-01f,  -3.9181828e-02f, 4.3528955e-04f,
+    -3.2997631e-02f, -9.1508240e-01f, -4.0426128e-02f, 1.2399937e+00f,
+    2.3933181e-01f,  5.7593007e-03f,  4.3528955e-04f,  -1.9456035e-01f,
+    -2.3826174e-01f, 8.0951400e-02f,  9.3956941e-01f,  -6.4900637e-01f,
+    1.0491522e-01f,  4.3528955e-04f,  -5.1994282e-01f, -5.5935693e-01f,
+    -1.4231588e-01f, 5.4354787e-01f,  -8.2436013e-01f, 4.0677872e-02f,
+    4.3528955e-04f,  -2.0209424e+00f, -1.5723596e+00f, -5.5655923e-02f,
+    5.6295890e-01f,  -6.0998255e-01f, 1.4997948e-02f,  4.3528955e-04f,
+    2.7614758e+00f,  6.0256422e-01f,  7.1232222e-02f,  -2.6086830e-03f,
+    9.8028719e-01f,  -1.1912977e-02f, 4.3528955e-04f,  -1.9922405e+00f,
+    4.7151500e-01f,  -1.7834723e-03f, -1.1477450e-01f, -7.7700359e-01f,
+    -2.7535448e-02f, 4.3528955e-04f,  3.7980145e-01f,  3.4257099e-03f,
+    1.1890216e-01f,  4.6193215e-01f,  1.1608402e+00f,  1.0467423e-01f,
+    4.3528955e-04f,  1.8358094e-01f,  -1.2552780e+00f, -3.7909370e-02f,
+    9.0157223e-01f,  3.6701509e-01f,  9.9518716e-02f,  4.3528955e-04f,
+    1.2123791e+00f,  -1.5972768e+00f, 1.2686159e-01f,  8.1489724e-01f,
+    5.5400294e-01f,  -8.5871525e-02f, 4.3528955e-04f,  -9.4329762e-01f,
+    5.6100458e-02f,  1.7532842e-02f,  -7.8835005e-01f, -7.2736347e-01f,
+    1.0471404e-02f,  4.3528955e-04f,  2.0937004e+00f,  6.3385844e-01f,
+    5.7293497e-02f,  -3.2964948e-01f, 9.0866017e-01f,  3.3154802e-03f,
+    4.3528955e-04f,  -7.0584334e-02f, -9.7772974e-01f, 1.6659202e-01f,
+    4.9047866e-01f,  -2.6394814e-01f, -1.8251322e-02f, 4.3528955e-04f,
+    -1.1481501e+00f, -5.2704561e-01f, -1.8715266e-02f, 5.3857684e-01f,
+    -5.5877143e-01f, -4.1718800e-03f, 4.3528955e-04f,  2.8464165e+00f,
+    4.4943213e-01f,  4.3992575e-02f,  -4.8634093e-02f, 1.0562508e+00f,
+    1.6032696e-02f,  4.3528955e-04f,  -1.0196202e+00f, -2.3240790e+00f,
+    -2.7570516e-02f, 5.7962632e-01f,  -3.4340993e-01f, -4.2130698e-02f,
+    4.3528955e-04f,  -2.8670207e-01f, -1.5506921e+00f, 1.9702598e-01f,
+    7.2750199e-01f,  2.8147116e-01f,  1.5790502e-02f,  4.3528955e-04f,
+    -1.8381362e+00f, -2.0094357e+00f, -3.1918582e-02f, 6.6335338e-01f,
+    -5.2372497e-01f, -1.3898736e-01f, 4.3528955e-04f,  -1.2609208e+00f,
+    2.8901553e+00f,  -3.6906675e-02f, -8.7866908e-01f, -3.5505357e-01f,
+    -4.4401392e-02f, 4.3528955e-04f,  -3.5843959e+00f, -2.1401691e+00f,
+    -1.0643330e-01f, 3.7463492e-01f,  -7.7903843e-01f, -2.0772289e-02f,
+    4.3528955e-04f,  -7.3718268e-01f, 2.3966916e+00f,  1.5484677e-01f,
+    -7.5375187e-01f, -5.2907461e-01f, -5.0237991e-02f, 4.3528955e-04f,
+    -6.3731682e-01f, 1.9150025e+00f,  5.4080207e-03f,  -1.0998387e+00f,
+    -1.8156113e-01f, 7.3647285e-03f,  4.3528955e-04f,  -2.4289921e-01f,
+    -7.4572784e-01f, 8.1248119e-02f,  9.2005670e-01f,  1.2741768e-01f,
+    -1.5394238e-01f, 4.3528955e-04f,  8.6489528e-01f,  9.7779983e-01f,
+    -1.5163459e-01f, -5.2225989e-01f, 5.3084785e-01f,  -2.1541419e-02f,
+    4.3528955e-04f,  7.5544429e-01f,  4.0809071e-01f,  -1.6853604e-01f,
+    -9.3467081e-01f, 5.3369951e-01f,  -2.7258320e-02f, 4.3528955e-04f,
+    -9.1180259e-01f, 3.6572223e+00f,  -1.4079297e-01f, -9.4609094e-01f,
+    -3.5335772e-02f, 7.8737838e-03f,  4.3528955e-04f,  1.5287068e+00f,
+    -7.2364837e-01f, -3.7078999e-02f, 5.7421780e-01f,  5.0547272e-01f,
+    8.3491690e-02f,  4.3528955e-04f,  4.4637341e+00f,  3.2211368e+00f,
+    -1.4458968e-01f, -5.4025429e-01f, 7.3564368e-01f,  -1.7339401e-02f,
+    4.3528955e-04f,  1.4302769e-01f,  1.4696223e+00f,  -9.2452578e-02f,
+    -3.6000121e-01f, 4.2636141e-01f,  -1.9545370e-01f, 4.3528955e-04f,
+    -1.9442877e-01f, -8.5649079e-01f, 7.9957530e-02f,  7.1255511e-01f,
+    -6.6840820e-02f, -2.2177167e-01f, 4.3528955e-04f,  -3.4624767e+00f,
+    -2.8475149e+00f, 5.3151054e-03f,  5.0592685e-01f,  -5.9230888e-01f,
+    3.3296701e-02f,  4.3528955e-04f,  -1.4694417e-01f, 7.9853117e-01f,
+    -1.3091272e-01f, -9.6863246e-01f, -5.1505375e-01f, -8.5718878e-02f,
+    4.3528955e-04f,  -2.6575654e+00f, -3.1684060e+00f, 1.0628834e-01f,
+    7.0591974e-01f,  -6.2780488e-01f, -3.2781709e-02f, 4.3528955e-04f,
+    1.5708895e+00f,  -4.2342246e-01f, 1.6597222e-01f,  4.0844396e-01f,
+    8.7643480e-01f,  9.2204601e-02f,  4.3528955e-04f,  -4.5800325e-01f,
+    1.8205228e-01f,  -1.3429826e-01f, 3.7224445e-02f,  -1.0611209e+00f,
+    2.5574582e-02f,  4.3528955e-04f,  -1.6134286e+00f, -1.7064326e+00f,
+    -8.3588079e-02f, 6.1157286e-01f,  -4.3371844e-01f, -1.0029837e-01f,
+    4.3528955e-04f,  -2.1027794e+00f, -5.1347286e-01f, 1.2565752e-02f,
+    -4.7717791e-02f, -8.2282400e-01f, 1.2548476e-02f,  4.3528955e-04f,
+    -1.8614851e+00f, -2.0677026e-01f, 7.9853842e-03f,  2.0795761e-01f,
+    -9.4659382e-01f, -3.9114386e-02f, 4.3528955e-04f,  5.1289411e+00f,
+    -1.3179317e+00f, 1.0919008e-01f,  1.9358820e-01f,  8.8127631e-01f,
+    -1.9898232e-02f, 4.3528955e-04f,  -1.2269670e+00f, 8.7995011e-01f,
+    2.6177542e-02f,  -3.7419376e-01f, -8.9926326e-01f, -6.7875780e-02f,
+    4.3528955e-04f,  -2.2015564e+00f, -2.1850240e+00f, -3.4390133e-02f,
+    5.6716156e-01f,  -6.4842093e-01f, -5.1432591e-02f, 4.3528955e-04f,
+    1.7781328e+00f,  5.5955946e-03f,  -6.9393143e-02f, -1.3635764e-01f,
+    9.9708903e-01f,  -7.3676907e-02f, 4.3528955e-04f,  1.2529815e+00f,
+    1.9671642e+00f,  -5.1458456e-02f, -8.5457945e-01f, 5.7445496e-01f,
+    5.8118518e-02f,  4.3528955e-04f,  -3.5883725e-02f, -4.4611484e-01f,
+    1.2419444e-01f,  7.5674605e-01f,  7.7487037e-02f,  -3.4017593e-01f,
+    4.3528955e-04f,  1.7376158e+00f,  -1.3196661e-01f, -6.4040616e-02f,
+    -1.9054647e-01f, 7.2107947e-01f,  -2.0503297e-02f, 4.3528955e-04f,
+    -1.4108166e+00f, -2.6815710e+00f, 1.7364021e-01f,  6.0414255e-01f,
+    -4.6622850e-02f, 6.1375309e-02f,  4.3528955e-04f,  1.2403609e+00f,
+    -1.1871028e+00f, -7.2622625e-04f, 4.8537186e-01f,  8.6502784e-01f,
+    -4.5529746e-02f, 4.3528955e-04f,  -1.0622272e+00f, 6.7466962e-01f,
+    -8.1324968e-03f, -5.4996812e-01f, -8.9663553e-01f, 1.3363400e-01f,
+    4.3528955e-04f,  6.3160449e-01f,  1.0832291e+00f,  -1.3951319e-01f,
+    -2.5244159e-01f, 2.9613563e-01f,  1.6045372e-01f,  4.3528955e-04f,
+    3.0216222e+00f,  1.3697159e+00f,  1.1086130e-01f,  -3.5881513e-01f,
+    9.1569012e-01f,  1.4387457e-02f,  4.3528955e-04f,  -2.0275074e-01f,
+    -1.1858085e+00f, -4.1962337e-02f, 9.4528812e-01f,  5.0686747e-01f,
+    -2.0301621e-04f, 4.3528955e-04f,  4.7311044e-01f,  5.4447269e-01f,
+    -1.2514491e-02f, -1.1029322e+00f, 9.5024250e-02f,  -1.4175789e-01f,
+    4.3528955e-04f,  -1.0189817e+00f, 3.6562440e+00f,  -6.8713859e-02f,
+    -9.5296353e-01f, -1.7406097e-01f, -3.1664057e-03f, 4.3528955e-04f,
+    5.6727463e-01f,  -3.8981760e-01f, 2.5054640e-03f,  1.0488477e+00f,
+    3.1072742e-01f,  -1.2332475e-01f, 4.3528955e-04f,  -1.3258146e+00f,
+    -1.9837744e+00f, 3.9975896e-02f,  9.0593606e-01f,  -5.3795701e-01f,
+    -1.0205296e-02f, 4.3528955e-04f,  7.1881181e-01f,  -2.1402523e-02f,
+    1.3678260e-02f,  2.7142560e-01f,  9.5376951e-01f,  -1.8041646e-02f,
+    4.3528955e-04f,  -1.9389488e+00f, -2.1415125e-01f, -1.0841317e-01f,
+    5.7342831e-02f,  -5.0847495e-01f, 1.3656878e-01f,  4.3528955e-04f,
+    -1.6326761e-01f, -5.1064745e-02f, 1.7848399e-02f,  2.8892335e-01f,
+    -7.9173779e-01f, -4.7302136e-01f, 4.3528955e-04f,  1.0485275e+00f,
+    3.5332769e-01f,  1.2982270e-03f,  -1.9968018e-01f, 6.8980163e-01f,
+    -7.6237783e-02f, 4.3528955e-04f,  -2.5742319e+00f, -2.9583421e+00f,
+    1.8703355e-01f,  6.2665957e-01f,  -4.8150995e-01f, 1.9563369e-02f,
+    4.3528955e-04f,  -1.1748800e+00f, -1.8395925e+00f, 1.7355075e-02f,
+    8.4393805e-01f,  -6.1777228e-01f, -1.0812550e-01f, 4.3528955e-04f,
+    -1.7046982e-01f, -3.3545059e-01f, -3.8340945e-02f, 8.2905853e-01f,
+    -8.6214101e-01f, -1.1035544e-01f, 4.3528955e-04f,  1.9859332e+00f,
+    -1.0748569e+00f, 1.7554332e-01f,  6.5117890e-01f,  4.4151530e-01f,
+    -5.7478976e-03f, 4.3528955e-04f,  -4.8137930e-01f, -1.0380815e+00f,
+    6.2740877e-02f,  9.5820153e-01f,  -3.2268471e-01f, -2.0330237e-02f,
+    4.3528955e-04f,  1.9993284e-01f,  4.7916993e-03f,  -1.1501078e-01f,
+    5.4132164e-01f,  1.0889151e+00f,  9.9186122e-02f,  4.3528955e-04f,
+    1.4918215e+00f,  -1.7517672e-01f, -4.2071585e-03f, 2.3835452e-01f,
+    1.0105820e+00f,  2.2959966e-02f,  4.3528955e-04f,  1.1000384e-01f,
+    -1.8607298e+00f, 8.6032413e-03f,  6.1837846e-01f,  1.8448141e-01f,
+    -1.2235850e-01f, 4.3528955e-04f,  7.4714965e-01f,  8.2311636e-01f,
+    8.6190209e-02f,  -8.1194460e-01f, 7.4272507e-01f,  1.2778525e-01f,
+    4.3528955e-04f,  -8.0694818e-01f, 6.5997887e-01f,  -1.2543000e-01f,
+    -2.2628681e-01f, -8.9708114e-01f, -1.7915092e-02f, 4.3528955e-04f,
+    -1.9006928e+00f, -1.1035321e+00f, 1.2985554e-01f,  5.1029456e-01f,
+    -6.5535706e-01f, 1.3560024e-01f,  4.3528955e-04f,  7.9528493e-01f,
+    2.0771511e-01f,  -7.9479553e-02f, -4.1508588e-01f, 8.0105984e-01f,
+    1.1802185e-01f,  4.3528955e-04f,  7.7923566e-01f,  -9.3095750e-01f,
+    4.4589967e-02f,  4.6303719e-01f,  9.5302033e-01f,  -2.9389910e-02f,
+    4.3528955e-04f,  -8.0144441e-01f, 9.4559604e-01f,  -7.2412767e-02f,
+    -7.1672493e-01f, -4.7348544e-01f, 1.2321755e-01f,  4.3528955e-04f,
+    5.3762770e-01f,  1.2744187e+00f,  -5.8605229e-03f, -1.2614549e+00f,
+    3.5339037e-01f,  -1.6787355e-01f, 4.3528955e-04f,  7.6284856e-01f,
+    -1.6233295e-01f, 6.1773930e-02f,  8.2883573e-01f,  8.7790263e-01f,
+    -8.1958450e-02f, 4.3528955e-04f,  -5.2454346e-01f, -6.1496943e-01f,
+    -1.9552670e-02f, 4.4897813e-01f,  -3.6256817e-01f, 1.2949856e-01f,
+    4.3528955e-04f,  -3.8461151e+00f, 1.2541501e-01f,  -8.0122240e-03f,
+    -8.9983657e-02f, -8.6990678e-01f, 6.9923857e-03f,  4.3528955e-04f,
+    -5.6383818e-01f, 8.6860374e-02f,  3.2924853e-02f,  4.7320196e-01f,
+    -7.6533908e-01f, 3.3768967e-01f,  4.3528955e-04f,  -5.7940447e-01f,
+    1.5289838e+00f,  -7.3831968e-02f, -1.1263613e+00f, -4.4460875e-01f,
+    5.1841764e-03f,  4.3528955e-04f,  -7.1055532e-01f, 5.5944264e-01f,
+    -4.5113482e-02f, -1.0527459e+00f, -3.3881494e-01f, -9.9038325e-02f,
+    4.3528955e-04f,  1.8563226e-01f,  1.7411098e-01f,  1.6449820e-01f,
+    -3.5436359e-01f, 6.8351567e-01f,  3.1219614e-01f,  4.3528955e-04f,
+    -1.0154796e+00f, -1.0835079e+00f, -7.3488481e-02f, 5.3158391e-02f,
+    -6.2301379e-01f, -2.7723985e-02f, 4.3528955e-04f,  -2.2134202e+00f,
+    7.3299915e-01f,  1.7523475e-01f,  6.0554836e-02f,  -9.4136065e-01f,
+    -1.0506817e-01f, 4.3528955e-04f,  4.6099508e-01f,  -9.2228657e-01f,
+    1.4527591e-02f,  7.0180815e-01f,  4.2765200e-01f,  -1.5324836e-02f,
+    4.3528955e-04f,  6.5343939e-03f,  1.1797009e+00f,  -5.8897626e-02f,
+    -9.5656049e-01f, -1.6282392e-01f, 1.7877306e-01f,  4.3528955e-04f,
+    1.1906117e+00f,  -3.7206614e-01f, 9.4158962e-02f,  1.3012047e-01f,
+    6.5927243e-01f,  5.0930791e-03f,  4.3528955e-04f,  -6.6487736e-01f,
+    -2.5282249e+00f, -1.9405337e-02f, 1.0161960e+00f,  -2.8220263e-01f,
+    2.2747150e-02f,  4.3528955e-04f,  -1.7089003e-01f, -8.6037171e-01f,
+    5.8650199e-02f,  1.1990469e+00f,  1.6698247e-01f,  -8.3592370e-02f,
+    4.3528955e-04f,  -2.6541048e-01f, 2.4239509e+00f,  4.8654035e-02f,
+    -1.0686468e+00f, -2.0613025e-01f, 1.4137380e-01f,  4.3528955e-04f,
+    1.8762881e-01f,  -1.6466684e+00f, -2.2188762e-02f, 1.0790110e+00f,
+    -5.6329168e-02f, 1.2611476e-01f,  4.3528955e-04f,  7.3261432e-02f,
+    1.4107574e+00f,  -1.1429172e-02f, -8.1988406e-01f, -1.5144719e-01f,
+    -1.3026617e-02f, 4.3528955e-04f,  3.1307274e-01f,  1.0335001e+00f,
+    9.8183732e-03f,  -6.7743176e-01f, -2.1390469e-01f, -1.8410927e-01f,
+    4.3528955e-04f,  5.4605675e-01f,  3.3160114e-01f,  7.4838951e-02f,
+    -2.4828947e-01f, 9.7398758e-01f,  -2.9874480e-01f, 4.3528955e-04f,
+    2.1224871e+00f,  1.5692554e+00f,  5.1408213e-02f,  -2.9297063e-01f,
+    8.1840754e-01f,  5.9465937e-02f,  4.3528955e-04f,  1.2108782e-01f,
+    -3.6355174e-01f, 2.4715219e-02f,  8.1516707e-01f,  -4.5604333e-01f,
+    -4.4499004e-01f, 4.3528955e-04f,  1.4930522e+00f,  3.7219711e-02f,
+    2.0906310e-01f,  -1.8597896e-01f, 4.4531906e-01f,  -3.4445338e-02f,
+    4.3528955e-04f,  4.8279342e-01f,  -6.4908266e-02f, -6.2609978e-02f,
+    -4.1552576e-01f, 1.3617489e+00f,  8.3189823e-02f,  4.3528955e-04f,
+    2.3535299e-01f,  -4.0749011e+00f, -6.5424107e-02f, 9.2983747e-01f,
+    1.4911497e-02f,  4.9508303e-02f,  4.3528955e-04f,  1.6287059e+00f,
+    3.9972339e-02f,  -1.4355247e-01f, -4.6433851e-01f, 8.4203392e-01f,
+    7.2183562e-03f,  4.3528955e-04f,  -2.6358588e+00f, -1.0662490e+00f,
+    -5.7905734e-02f, 3.0415908e-01f,  -8.5408950e-01f, 8.8994861e-02f,
+    4.3528955e-04f,  2.8376031e-01f,  -1.6345096e+00f, 4.8293866e-02f,
+    1.0505075e+00f,  -5.0440140e-02f, -7.7698499e-02f, 4.3528955e-04f,
+    -7.9914778e-03f, -1.9271202e+00f, 4.8289364e-03f,  1.0989825e+00f,
+    1.2260172e-01f,  -7.7416264e-02f, 4.3528955e-04f,  -2.3075923e-01f,
+    9.1273814e-01f,  -3.4187678e-01f, -5.9044671e-01f, -9.1118586e-01f,
+    6.1275695e-02f,  4.3528955e-04f,  1.4958969e+00f,  -3.1960080e+00f,
+    -4.8200447e-02f, 6.8350804e-01f,  4.4107708e-01f,  -3.0134398e-02f,
+    4.3528955e-04f,  2.1625829e+00f,  2.7377813e+00f,  -9.7442865e-02f,
+    -7.0911628e-01f, 5.2445948e-01f,  -4.3417690e-03f, 4.3528955e-04f,
+    9.6111894e-01f,  -5.1419926e-01f, -1.3526724e-01f, 7.4907434e-01f,
+    6.7704141e-01f,  -5.9062440e-02f, 4.3528955e-04f,  -1.6256415e+00f,
+    -1.5777866e+00f, -3.6580645e-02f, 7.1544939e-01f,  -5.5809951e-01f,
+    8.3573341e-02f,  4.3528955e-04f,  -1.6731998e+00f, -2.4314709e+00f,
+    3.3555571e-02f,  6.3186103e-01f,  -5.7202983e-01f, -6.7715906e-02f,
+    4.3528955e-04f,  1.0573283e+00f,  -1.0114421e+00f, -1.1656055e-02f,
+    7.8174746e-01f,  5.6242734e-01f,  -2.9390889e-01f, 4.3528955e-04f,
+    2.6305386e-01f,  -2.8429443e-01f, 8.7543577e-02f,  1.0864745e+00f,
+    3.8376942e-01f,  2.0973831e-01f,  4.3528955e-04f,  1.1670362e+00f,
+    -2.2380533e+00f, 9.9300154e-02f,  7.5512397e-01f,  5.6637782e-01f,
+    8.7429225e-02f,  4.3528955e-04f,  -1.6146168e-02f, 6.8004206e-02f,
+    7.6125632e-03f,  -1.0034001e-01f, -3.4705663e-01f, -6.7245531e-01f,
+    4.3528955e-04f,  2.7375526e+00f,  1.1401169e-02f,  1.1018647e-01f,
+    -8.4448820e-03f, 9.6227181e-01f,  1.1195991e-01f,  4.3528955e-04f,
+    1.8180557e+00f,  -1.4997587e+00f, -1.3250807e-01f, 1.4759028e-01f,
+    6.3660324e-01f,  7.9367891e-02f,  4.3528955e-04f,  8.3871174e-01f,
+    6.2382191e-01f,  1.1371982e-01f,  -2.7235886e-01f, 6.8314743e-01f,
+    3.3996525e-01f,  4.3528955e-04f,  9.4798401e-02f,  3.6791215e+00f,
+    1.7718750e-01f,  -9.8299026e-01f, 5.1193323e-02f,  -1.3795390e-02f,
+    4.3528955e-04f,  -9.9388814e-01f, -3.0705106e-01f, -4.2720366e-02f,
+    6.2940913e-01f,  -8.9266956e-01f, -6.9085239e-03f, 4.3528955e-04f,
+    1.6557571e-01f,  6.3235916e-02f,  1.0805068e-01f,  -8.3343908e-02f,
+    1.3096606e+00f,  1.0076551e-01f,  4.3528955e-04f,  3.9439764e+00f,
+    -9.6169835e-01f, 1.2606251e-01f,  1.8587218e-01f,  9.6314937e-01f,
+    9.4104260e-02f,  4.3528955e-04f,  -2.7005553e-01f, -7.3374242e-01f,
+    3.1435903e-02f,  3.6802042e-01f,  -1.0938375e+00f, -1.9657716e-01f,
+    4.3528955e-04f,  2.0184970e+00f,  1.4490035e-01f,  1.0753000e-02f,
+    -3.4436679e-01f, 1.0664097e+00f,  9.9087574e-02f,  4.3528955e-04f,
+    -5.2792066e-01f, 2.2600219e-01f,  -8.2622312e-02f, 6.8859786e-02f,
+    -9.4563073e-01f, 7.0459567e-02f,  4.3528955e-04f,  1.5100290e+00f,
+    -1.2275963e+00f, 1.0864139e-01f,  4.3059167e-01f,  8.6904675e-01f,
+    -3.3088846e-03f, 4.3528955e-04f,  1.0350852e+00f,  -6.0096484e-01f,
+    -7.7713229e-02f, 1.9289660e-01f,  4.0997708e-01f,  3.6208606e-01f,
+    4.3528955e-04f,  1.2842970e-01f,  -7.9557902e-01f, 1.7465273e-02f,
+    1.2862564e+00f,  6.1845370e-02f,  -7.6268420e-02f, 4.3528955e-04f,
+    -2.6823273e+00f, 2.9990748e-02f,  -5.9826102e-02f, -3.1797245e-02f,
+    -9.2061770e-01f, -1.1706609e-02f, 4.3528955e-04f,  -6.4967436e-01f,
+    -3.7262255e-01f, 9.2040181e-02f,  2.9023966e-01f,  -7.7643305e-01f,
+    3.7028827e-02f,  4.3528955e-04f,  -9.2506272e-01f, -3.0456748e+00f,
+    4.1766157e-03f,  9.0810478e-01f,  -2.1976584e-01f, 2.9321671e-02f,
+    4.3528955e-04f,  2.0766442e+00f,  -1.5329702e+00f, -1.9721813e-02f,
+    7.4043196e-01f,  5.8739161e-01f,  -4.8219319e-02f, 4.3528955e-04f,
+    -1.9482245e+00f, 1.6142071e+00f,  4.6485271e-02f,  -5.6103772e-01f,
+    -7.7759343e-01f, 1.0513947e-02f,  4.3528955e-04f,  2.7206964e+00f,
+    1.8737583e-01f,  1.2213083e-02f,  4.1202411e-02f,  6.6523236e-01f,
+    -6.1461490e-02f, 4.3528955e-04f,  -6.7600235e-02f, 4.3994719e-01f,
+    7.3636910e-03f,  -9.0833330e-01f, -6.2696552e-01f, 8.5546352e-02f,
+    4.3528955e-04f,  -4.4148512e-02f, -1.2488033e+00f, -1.3494247e-01f,
+    1.1119843e+00f,  3.4055412e-01f,  2.3770684e-02f,  4.3528955e-04f,
+    -3.0167198e-01f, 1.1546028e+00f,  -6.4071968e-02f, -9.3968511e-01f,
+    -2.5761208e-02f, 1.3900064e-01f,  4.3528955e-04f,  -9.0253097e-01f,
+    1.3158634e+00f,  -7.1968846e-02f, -1.0172766e+00f, -4.4377348e-01f,
+    4.4611204e-02f,  4.3528955e-04f,  2.0198661e-01f,  -1.6705064e+00f,
+    1.8185452e-01f,  8.9591777e-01f,  -2.1160556e-02f, 1.4230640e-01f,
+    4.3528955e-04f,  -2.9650918e-01f, -4.2986673e-01f, 1.3220521e-03f,
+    8.9759272e-01f,  -3.1360859e-01f, 1.6539155e-01f,  4.3528955e-04f,
+    3.3151308e-01f,  2.3956138e-01f,  5.3603165e-03f,  -3.1100404e-01f,
+    1.0404416e+00f,  -3.0668038e-01f, 4.3528955e-04f,  3.0479354e-01f,
+    -2.6506382e-01f, 1.2983680e-02f,  6.7710102e-01f,  6.3456041e-01f,
+    1.3437311e-02f,  4.3528955e-04f,  -6.7611599e-01f, 4.3690008e-01f,
+    -3.1045577e-01f, -3.7357938e-02f, -7.8385937e-01f, 1.0408919e-01f,
+    4.3528955e-04f,  -1.0499145e+00f, -1.5928968e+00f, -7.0203431e-02f,
+    6.3339651e-01f,  -2.8351557e-01f, -3.3504464e-02f, 4.3528955e-04f,
+    1.0707893e-01f,  -3.3282703e-01f, 1.7217811e-03f,  8.9257437e-01f,
+    1.2634313e-01f,  2.7407736e-01f,  4.3528955e-04f,  -4.7306743e-01f,
+    -3.6627409e+00f, 1.5279453e-01f,  9.3670958e-01f,  -1.8703133e-01f,
+    5.0045211e-02f,  4.3528955e-04f,  -1.4954550e+00f, -5.9864527e-01f,
+    -1.5149713e-02f, 2.6646069e-01f,  -4.8936108e-01f, -3.9969370e-02f,
+    4.3528955e-04f,  1.1929190e-01f,  4.4882655e-01f,  7.2918423e-02f,
+    -1.1234986e+00f, 7.9892772e-01f,  -1.3599160e-01f, 4.3528955e-04f,
+    4.9773327e-01f,  2.8081048e+00f,  -1.1645658e-01f, -1.0271441e+00f,
+    3.9698875e-01f,  -1.7881766e-02f, 4.3528955e-04f,  -2.9830910e-02f,
+    4.6643651e-01f,  1.9431780e-01f,  -9.3132663e-01f, -1.2520614e-01f,
+    -1.1692639e-01f, 4.3528955e-04f,  -1.4534796e+00f, -4.5605296e-01f,
+    -3.5628919e-02f, -1.2298536e-01f, -7.8542739e-01f, 5.8641203e-02f,
+    4.3528955e-04f,  -2.2793181e+00f, 2.7725875e+00f,  8.8588126e-02f,
+    -8.0416983e-01f, -5.8885109e-01f, 1.4368521e-02f,  4.3528955e-04f,
+    -4.6122566e-01f, -7.8167868e-01f, 9.8654822e-02f,  8.7647152e-01f,
+    -7.9687977e-01f, -2.4707097e-01f, 4.3528955e-04f,  2.0904486e+00f,
+    1.0376852e+00f,  7.0791371e-02f,  -5.3256816e-01f, 7.8894460e-01f,
+    -2.8891042e-02f, 4.3528955e-04f,  3.8026032e-01f,  -4.9832368e-01f,
+    1.8887039e-01f,  7.0771533e-01f,  5.1972377e-01f,  3.6633459e-01f,
+    4.3528955e-04f,  -3.5792905e-01f, -2.6193041e-01f, -7.1674432e-03f,
+    7.5479984e-01f,  -9.4663501e-01f, 4.0715303e-02f,  4.3528955e-04f,
+    -6.1932057e-03f, -1.3730650e+00f, -4.1603837e-02f, 6.8032396e-01f,
+    1.7864835e-02f,  -1.3640624e-02f, 4.3528955e-04f,  2.8921986e+00f,
+    2.3249514e+00f,  3.4847200e-02f,  -6.0075969e-01f, 7.6154184e-01f,
+    1.1830403e-02f,  4.3528955e-04f,  -2.1998569e-01f, -4.9023718e-01f,
+    4.2779185e-02f,  7.3325759e-01f,  -5.2059662e-01f, 3.2752699e-01f,
+    4.3528955e-04f,  -1.5461591e-01f, 1.8904281e-01f,  -6.3959934e-02f,
+    -6.2173307e-01f, -1.1407357e+00f, 6.1282977e-02f,  4.3528955e-04f,
+    -3.8895585e-02f, 1.7250928e-01f,  -1.6933821e-01f, -8.1387419e-01f,
+    -3.9619806e-01f, -3.0375746e-01f, 4.3528955e-04f,  -3.3404639e+00f,
+    1.3588730e+00f,  1.1133709e-01f,  -3.3143991e-01f, -7.0095521e-01f,
+    -1.4090304e-01f, 4.3528955e-04f,  -3.7851903e-01f, -3.0163314e+00f,
+    -1.4368688e-01f, 6.9236600e-01f,  7.0703499e-02f,  -2.8352518e-02f,
+    4.3528955e-04f,  6.1538601e-01f,  -1.3256779e+00f, -1.4643701e-02f,
+    9.5752370e-01f,  1.1659830e-01f,  1.7112301e-01f,  4.3528955e-04f,
+    3.2170019e-01f,  1.4347588e+00f,  2.5810661e-02f,  -6.0353881e-01f,
+    4.0167218e-01f,  -1.4890793e-01f, 4.3528955e-04f,  -5.8682722e-01f,
+    -8.7550503e-01f, 4.6326362e-02f,  4.5287761e-01f,  -5.6461084e-01f,
+    7.9910100e-02f,  4.3528955e-04f,  -1.8315905e+00f, -1.2754096e+00f,
+    9.8193102e-02f,  4.4478399e-01f,  -7.4075782e-01f, -1.8747212e-02f,
+    4.3528955e-04f,  1.0348213e+00f,  -1.0755039e+00f, -8.9135602e-02f,
+    5.3079355e-01f,  6.6031629e-01f,  5.8911089e-03f,  4.3528955e-04f,
+    -1.5423750e+00f, 7.3739409e-02f,  6.5554954e-02f,  1.8010707e-01f,
+    -8.6153692e-01f, 2.2073705e-01f,  4.3528955e-04f,  -6.8071413e-01f,
+    4.5609671e-01f,  -1.0735729e-01f, -7.8286487e-01f, -5.4729235e-01f,
+    -2.4990644e-01f, 4.3528955e-04f,  -2.7767408e-01f, -6.9126791e-01f,
+    1.9910909e-02f,  6.7783260e-01f,  -3.0832037e-01f, 5.9241347e-02f,
+    4.3528955e-04f,  -3.5970547e+00f, -2.5972850e+00f, 1.6296315e-01f,
+    5.1405609e-01f,  -7.1724749e-01f, -8.0069108e-03f, 4.3528955e-04f,
+    3.8337631e+00f,  -8.9045924e-01f, 2.3608359e-02f,  2.3156445e-01f,
+    9.3124580e-01f,  2.7664650e-02f,  4.3528955e-04f,  5.6023246e-01f,
+    5.1318008e-01f,  -1.1374960e-01f, -5.3413296e-01f, 6.3600975e-01f,
+    -7.5137310e-02f, 4.3528955e-04f,  -1.9966480e+00f, 1.8639064e+00f,
+    -9.2274494e-02f, -5.8248508e-01f, -4.2127529e-01f, 2.3446491e-03f,
+    4.3528955e-04f,  -3.8483953e-01f, -2.6815424e+00f, 1.6271441e-01f,
+    1.0225492e+00f,  -2.7065614e-01f, 7.0752278e-02f,  4.3528955e-04f,
+    -2.7943122e+00f, -9.2417616e-01f, 5.5039857e-02f,  1.8194324e-01f,
+    -9.3876076e-01f, -9.3954921e-02f, 4.3528955e-04f,  2.5156322e-01f,
+    6.7252028e-01f,  2.8501073e-02f,  -9.7412181e-01f, 8.2829905e-01f,
+    -7.2806947e-02f, 4.3528955e-04f,  -4.5402804e-01f, -5.6674677e-01f,
+    3.3780172e-02f,  9.7904491e-01f,  -3.0355367e-01f, -5.3886857e-02f,
+    4.3528955e-04f,  1.2318275e+00f,  1.2848774e+00f,  5.6275468e-02f,
+    -6.9665396e-01f, 8.1444532e-01f,  -1.9171304e-01f, 4.3528955e-04f,
+    2.9597955e+00f,  -2.2112701e+00f, 1.3052535e-01f,  5.6582713e-01f,
+    6.5637624e-01f,  -2.7025109e-02f, 4.3528955e-04f,  2.6054648e-01f,
+    -8.7282604e-01f, -1.8033467e-02f, 4.1854987e-01f,  2.1290404e-01f,
+    3.2835931e-02f,  4.3528955e-04f,  -3.5986719e+00f, -1.1810741e+00f,
+    9.5569789e-03f,  2.1664216e-01f,  -8.7209958e-01f, -9.7756861e-03f,
+    4.3528955e-04f,  2.1074045e+00f,  -1.1561445e+00f, 4.4246547e-02f,
+    3.7912285e-01f,  6.6237265e-01f,  1.0121474e-01f,  4.3528955e-04f,
+    -1.3832897e-01f, 8.4710020e-01f,  -6.9346197e-02f, -1.3777165e+00f,
+    1.5742433e-01f,  1.2203322e-01f,  4.3528955e-04f,  2.0753182e-02f,
+    3.9955264e-01f,  -2.7554768e-01f, -1.1058495e+00f, -1.5051392e-01f,
+    1.9915180e-01f,  4.3528955e-04f,  1.4598426e+00f,  -1.3529322e+00f,
+    3.7644319e-02f,  7.2704870e-01f,  5.9285808e-01f,  4.2472545e-02f,
+    4.3528955e-04f,  2.6423690e+00f,  1.4939207e+00f,  8.8385031e-02f,
+    -4.2193824e-01f, 9.3664753e-01f,  -1.1821534e-01f, 4.3528955e-04f,
+    2.5713961e+00f,  7.8146976e-01f,  -8.1882693e-02f, -2.6940665e-01f,
+    1.0678909e+00f,  -6.9690935e-02f, 4.3528955e-04f,  -1.1324745e-01f,
+    -2.5124974e+00f, -4.9715236e-02f, 9.2106593e-01f,  3.3960119e-02f,
+    -6.2996157e-02f, 4.3528955e-04f,  2.1336923e+00f,  -1.8130362e-02f,
+    -2.4351154e-02f, -1.6986061e-02f, 1.0555445e+00f,  -1.0552599e-01f,
+    4.3528955e-04f,  -7.2807205e-01f, -2.8566003e+00f, -4.9511544e-02f,
+    8.1608152e-01f,  -1.2436134e-01f, 1.3725357e-01f,  4.3528955e-04f,
+    -1.8783914e+00f, -2.1083527e+00f, -2.8764749e-02f, 7.3369449e-01f,
+    -6.0933912e-01f, -9.2682175e-02f, 4.3528955e-04f,  -2.7893338e+00f,
+    -1.7798558e+00f, -1.8015411e-04f, 6.0538352e-01f,  -7.3042506e-01f,
+    -9.3424451e-03f, 4.3528955e-04f,  2.9287165e-01f,  -1.5416672e+00f,
+    2.6843274e-02f,  5.9380108e-01f,  1.5043337e-03f,  -1.2819768e-01f,
+    4.3528955e-04f,  -2.2610130e+00f, 2.2696810e+00f,  6.3132428e-02f,
+    -6.6285449e-01f, -6.4354956e-01f, 5.8074877e-02f,  4.3528955e-04f,
+    7.8735745e-01f,  8.5398847e-01f,  -1.6297294e-02f, -8.5082054e-01f,
+    3.0274916e-01f,  1.1572878e-01f,  4.3528955e-04f,  -1.5628734e-01f,
+    -1.0101542e+00f, -8.2847036e-02f, 6.3570660e-01f,  1.7086607e-01f,
+    1.1028584e-01f,  4.3528955e-04f,  -5.2681404e-01f, 8.7790108e-01f,
+    8.2027487e-02f,  -9.7193962e-01f, -5.3704953e-01f, 2.7792022e-01f,
+    4.3528955e-04f,  1.9321035e+00f,  5.0077569e-01f,  -5.6551203e-02f,
+    -3.0770919e-01f, 9.6809697e-01f,  6.3143492e-02f,  4.3528955e-04f,
+    -1.5871102e+00f, -2.1219168e+00f, 4.1558765e-02f,  8.2326877e-01f,
+    -6.2389600e-01f, 5.9018593e-02f,  4.3528955e-04f,  -5.7469386e-01f,
+    -3.4515615e+00f, -1.4231116e-02f, 8.7869537e-01f,  -2.5454178e-01f,
+    -3.7191322e-03f, 4.3528955e-04f,  4.8901832e-01f,  2.2117412e+00f,
+    1.1363933e-01f,  -1.0149391e+00f, 1.7654455e-01f,  -1.1379423e-01f,
+    4.3528955e-04f,  -3.7083549e+00f, 1.3323400e+00f,  -7.8991532e-02f,
+    -2.9162118e-01f, -8.4995252e-01f, -6.2496278e-02f, 4.3528955e-04f,
+    3.8349299e+00f,  -2.7336266e+00f, 7.9552934e-02f,  5.4274660e-01f,
+    7.2438288e-01f,  1.8397825e-02f,  4.3528955e-04f,  -3.0832487e-01f,
+    6.0209662e-01f,  -4.8062760e-02f, -6.0332894e-01f, -4.5253173e-01f,
+    -3.3754000e-01f, 4.3528955e-04f,  3.6994793e+00f,  -1.8041264e+00f,
+    3.1641226e-02f,  5.8278185e-01f,  7.6064533e-01f,  1.0918153e-02f,
+    4.3528955e-04f,  6.4364201e-01f,  5.5878413e-01f,  -1.4481905e-01f,
+    -6.3611990e-01f, 2.0818824e-01f,  -2.1410342e-01f, 4.3528955e-04f,
+    1.1414441e-01f,  6.7824519e-01f,  4.2857490e-02f,  -9.6829146e-01f,
+    -7.9413235e-02f, -2.9731828e-01f, 4.3528955e-04f,  -2.0117333e+00f,
+    -1.0564096e+00f, 8.8811286e-02f,  5.5271786e-01f,  -6.8994069e-01f,
+    9.2843883e-02f,  4.3528955e-04f,  -9.9609113e-01f, -4.5489306e+00f,
+    1.3366992e-02f,  8.0767977e-01f,  -2.0808670e-01f, 6.1939154e-02f,
+    4.3528955e-04f,  1.9365237e+00f,  -6.7173406e-02f, 2.2906030e-02f,
+    -6.0663488e-02f, 1.0816253e+00f,  -7.5663649e-02f, 4.3528955e-04f,
+    2.4029985e-01f,  -9.8966271e-01f, 5.6717385e-02f,  9.9983931e-01f,
+    -1.3784690e-01f, 2.0507769e-01f,  4.3528955e-04f,  1.4357585e+00f,
+    7.9042166e-01f,  -1.6159797e-01f, -7.8169286e-01f, 5.9861195e-01f,
+    2.8152885e-02f,  4.3528955e-04f,  -6.1679220e-01f, -1.4942179e+00f,
+    -3.5028741e-02f, 1.0947024e+00f,  -5.0869727e-01f, 2.5930246e-02f,
+    4.3528955e-04f,  4.9062002e-01f,  -1.9358006e+00f, -1.8508570e-01f,
+    1.0616637e+00f,  5.3897917e-01f,  5.7820920e-02f,  4.3528955e-04f,
+    -4.0902686e+00f, 2.5500209e+00f,  5.0642667e-03f,  -5.0217628e-01f,
+    -6.9344664e-01f, 4.4363633e-02f,  4.3528955e-04f,  2.1371348e+00f,
+    -9.6668249e-01f, 2.2174895e-02f,  4.8959759e-01f,  7.5785708e-01f,
+    -1.1038192e-01f, 4.3528955e-04f,  7.2684348e-01f,  1.9258839e+00f,
+    -1.1434177e-02f, -9.4844007e-01f, 5.0505900e-01f,  5.9823863e-02f,
+    4.3528955e-04f,  2.8537784e+00f,  7.8416628e-01f,  2.3138697e-01f,
+    -2.5215584e-01f, 8.5236835e-01f,  4.2985030e-02f,  4.3528955e-04f,
+    -1.3713766e+00f, 1.0107807e+00f,  1.2526506e-01f,  -3.9959380e-01f,
+    -7.9186046e-01f, -7.1961898e-03f, 4.3528955e-04f,  -7.9162103e-01f,
+    -2.5221694e-01f, -1.9174539e-01f, -5.5946928e-02f, -6.9069123e-01f,
+    2.1735723e-01f,  4.3528955e-04f,  1.2948725e-01f,  2.7282624e+00f,
+    -1.7954864e-01f, -9.9496114e-01f, 2.6061144e-01f,  1.1808296e-01f,
+    4.3528955e-04f,  1.2148030e+00f,  -8.8033485e-01f, -6.6679493e-02f,
+    8.0099094e-01f,  5.2974063e-01f,  9.3057208e-02f,  4.3528955e-04f,
+    -3.4162641e-02f, 8.1898622e-02f,  2.6320390e-02f,  -2.2519495e-01f,
+    -2.7510282e-01f, -3.0823622e-02f, 4.3528955e-04f,  4.3423142e+00f,
+    -1.7333056e+00f, 1.0204320e-01f,  3.4049618e-01f,  8.1502122e-01f,
+    -9.3927560e-03f, 4.3528955e-04f,  1.6532332e+00f,  9.9396139e-02f,
+    2.8352195e-02f,  2.3957507e-01f,  7.7475399e-01f,  -8.9055233e-02f,
+    4.3528955e-04f,  -2.1650789e+00f, -2.9435515e+00f, -5.1053729e-02f,
+    7.3570138e-01f,  -5.3210324e-01f, 4.4819564e-02f,  4.3528955e-04f,
+    1.9316502e+00f,  -2.1113153e+00f, -1.1650901e-02f, 6.9894534e-01f,
+    6.4164501e-01f,  2.3008680e-02f,  4.3528955e-04f,  -1.2457354e+00f,
+    6.2464523e-01f,  3.4685433e-02f,  -4.7738412e-01f, -4.2005464e-01f,
+    -1.4766881e-01f, 4.3528955e-04f,  4.6656862e-02f,  5.1911861e-01f,
+    -4.5168288e-03f, -6.4022231e-01f, -5.4546297e-02f, -1.6100281e-01f,
+    4.3528955e-04f,  1.4976403e-01f,  -4.1653311e-01f, 6.4794824e-02f,
+    8.2851422e-01f,  4.6674559e-01f,  3.1138441e-02f,  4.3528955e-04f,
+    2.0364673e+00f,  -5.6869376e-01f, -1.1721701e-01f, 2.5139630e-01f,
+    6.3513911e-01f,  -6.9114387e-02f, 4.3528955e-04f,  5.6533396e-01f,
+    -2.9771359e+00f, 8.5961826e-02f,  8.8263297e-01f,  3.6188456e-01f,
+    -1.0716740e-01f, 4.3528955e-04f,  7.2091389e-01f,  5.2500606e-01f,
+    6.1953660e-02f,  -4.8243961e-01f, 6.9620436e-01f,  2.4841698e-01f,
+    4.3528955e-04f,  -8.9312828e-01f, 1.9610918e+00f,  2.0854339e-02f,
+    -8.8598889e-01f, -3.8192347e-01f, -1.2908104e-01f, 4.3528955e-04f,
+    2.7533177e-01f,  -6.6252732e-01f, -7.7119558e-03f, 6.2045109e-01f,
+    5.9049714e-01f,  4.4615041e-02f,  4.3528955e-04f,  9.9512279e-02f,
+    4.9117060e+00f,  -9.1942511e-02f, -8.9817631e-01f, 1.2457497e-01f,
+    -1.1684052e-02f, 4.3528955e-04f,  2.4695549e+00f,  8.4684980e-01f,
+    -1.4236942e-01f, -2.2739069e-01f, 8.4526575e-01f,  -6.2005814e-02f,
+    4.3528955e-04f,  5.8002388e-01f,  -5.0662756e-02f, -1.0917556e-01f,
+    -1.1214761e-01f, 1.2224433e+00f,  5.8882039e-02f,  4.3528955e-04f,
+    1.1481456e-01f,  -3.6071277e-01f, -3.4040589e-02f, 9.1737640e-01f,
+    4.7087023e-01f,  -2.6846689e-01f, 4.3528955e-04f,  -9.5788606e-02f,
+    6.1594993e-01f,  -7.4897461e-02f, -1.2510046e+00f, -7.0367806e-02f,
+    7.8754380e-02f,  4.3528955e-04f,  -2.3139198e+00f, 1.8622417e+00f,
+    2.5392897e-02f,  -7.2513646e-01f, -7.0665389e-01f, 2.7216619e-02f,
+    4.3528955e-04f,  -7.6869798e-01f, 2.6406727e+00f,  -4.3668617e-02f,
+    -8.0409122e-01f, -3.5779837e-01f, -9.0380087e-02f, 4.3528955e-04f,
+    2.9259999e+00f,  2.8035247e-01f,  -9.1116037e-03f, -1.5076195e-01f,
+    9.8557174e-01f,  -3.0311644e-02f, 4.3528955e-04f,  -7.0659488e-01f,
+    4.9059771e-02f,  2.1892056e-02f,  -2.2827113e-01f, -1.1742016e+00f,
+    1.0347778e-01f,  4.3528955e-04f,  -8.8512979e-02f, 1.7443842e+00f,
+    -2.0811846e-03f, -9.2541069e-01f, 1.1917360e-01f,  -4.8809119e-02f,
+    4.3528955e-04f,  -2.6482065e+00f, -8.4476119e-01f, -4.6996381e-02f,
+    3.5090873e-01f,  -8.6814374e-01f, 9.1328397e-02f,  4.3528955e-04f,
+    4.6940386e-01f,  -1.0593832e+00f, 1.5178430e-01f,  6.8659186e-01f,
+    -3.0276364e-02f, -4.6777604e-03f, 4.3528955e-04f,  1.5848714e+00f,
+    -1.4916527e-01f, -2.6565265e-02f, 1.3248552e-01f,  1.1715372e+00f,
+    -1.0514425e-01f, 4.3528955e-04f,  1.0449916e+00f,  -1.3765699e+00f,
+    3.6671285e-02f,  4.2873380e-01f,  7.0018327e-01f,  -1.5365869e-01f,
+    4.3528955e-04f,  3.5516554e-01f,  -2.3877062e-01f, 2.8328702e-02f,
+    8.7580144e-01f,  3.6978224e-01f,  -1.6347423e-01f, 4.3528955e-04f,
+    -5.1586218e-02f, -4.9940819e-01f, 2.3702430e-02f,  8.0487645e-01f,
+    -5.3927445e-01f, -4.1542139e-02f, 4.3528955e-04f,  -1.6342874e+00f,
+    8.0254287e-02f,  -1.3023959e-01f, -2.7415314e-01f, -8.1079578e-01f,
+    1.6113514e-01f,  4.3528955e-04f,  9.9607629e-01f,  1.6057771e-01f,
+    2.7852099e-02f,  -6.3055730e-01f, 7.5461149e-01f,  5.0627336e-02f,
+    4.3528955e-04f,  4.1896597e-01f,  -1.3559813e+00f, 7.6034740e-02f,
+    7.0934403e-01f,  3.7345123e-01f,  1.1380436e-01f,  4.3528955e-04f,
+    2.4989717e+00f,  4.7813785e-01f,  7.1747281e-02f,  -3.0444887e-01f,
+    8.4101593e-01f,  2.0305611e-02f,  4.3528955e-04f,  2.5578160e+00f,
+    -2.0705419e+00f, -1.5488301e-01f, 5.7151622e-01f,  7.3673505e-01f,
+    -2.3731153e-02f, 4.3528955e-04f,  -1.1450069e+00f, 3.6527624e+00f,
+    6.7007110e-02f,  -8.4978175e-01f, -3.0415943e-01f, 5.3995717e-02f,
+    4.3528955e-04f,  -5.4308951e-01f, 3.6215967e-01f,  1.0802917e-02f,
+    1.8584866e-02f,  -1.3201767e+00f, -2.9364263e-03f, 4.3528955e-04f,
+    -6.2927997e-01f, 1.1413135e-01f,  1.7718564e-01f,  3.2364946e-02f,
+    -5.8863801e-01f, 1.1266248e-01f,  4.3528955e-04f,  2.8551705e+00f,
+    2.0976958e+00f,  1.4925882e-01f,  -5.2651268e-01f, 7.5732607e-01f,
+    2.5851406e-02f,  4.3528955e-04f,  1.2036195e+00f,  2.8665383e+00f,
+    1.5537447e-01f,  -7.8631097e-01f, 2.4137463e-01f,  1.1834016e-01f,
+    4.3528955e-04f,  3.4964231e-01f,  3.0681980e+00f,  7.6762475e-02f,
+    -1.0214239e+00f, 1.5388754e-01f,  3.4457453e-02f,  4.3528955e-04f,
+    2.7903166e+00f,  -1.3887703e-02f, 1.0573205e-01f,  -1.3349533e-01f,
+    1.0134724e+00f,  -4.2535365e-02f, 4.3528955e-04f,  -2.8503016e-03f,
+    9.4427115e-01f,  1.8092738e-01f,  -8.0727476e-01f, -1.8088737e-01f,
+    1.0860105e-01f,  4.3528955e-04f,  1.3551986e+00f,  -1.3261968e+00f,
+    -2.7844800e-02f, 7.6242667e-01f,  8.9592588e-01f,  -1.5105624e-01f,
+    4.3528955e-04f,  2.1887197e+00f,  3.6513486e+00f,  1.7426091e-01f,
+    -7.8259623e-01f, 4.5992842e-01f,  4.2433566e-03f,  4.3528955e-04f,
+    -1.1633087e-01f, -2.5007532e+00f, 3.1969756e-02f,  1.0141793e+00f,
+    -1.3605224e-02f, 1.0070011e-01f,  4.3528955e-04f,  -1.1178275e+00f,
+    -1.9615002e+00f, 2.3799002e-02f,  8.4087062e-01f,  -3.0315670e-01f,
+    2.7463300e-02f,  4.3528955e-04f,  1.0193319e+00f,  -6.0979861e-01f,
+    -8.5366696e-02f, 3.8635477e-01f,  9.4630706e-01f,  9.2234582e-02f,
+    4.3528955e-04f,  6.1059576e-01f,  -1.0273169e+00f, 1.0398774e-01f,
+    4.9673298e-01f,  7.4835974e-01f,  5.2939426e-02f,  4.3528955e-04f,
+    -6.2917399e-01f, -5.3145862e-01f, 1.0937455e-01f,  3.1942454e-01f,
+    -8.1239611e-01f, -4.1080832e-02f, 4.3528955e-04f,  1.4435854e+00f,
+    -1.3752466e+00f, -3.5463274e-02f, 4.9324831e-01f,  7.7532083e-01f,
+    6.5710872e-02f,  4.3528955e-04f,  -1.5666409e+00f, 2.2342752e-01f,
+    -2.5046464e-02f, 1.3053726e-01f,  -3.8456565e-01f, -1.7621049e-01f,
+    4.3528955e-04f,  -1.4269531e+00f, -1.2496956e-01f, 1.2053710e-01f,
+    1.5873128e-01f,  -8.5627282e-01f, -1.6349185e-01f, 4.3528955e-04f,
+    1.6998104e+00f,  -3.5379630e-01f, -1.1419363e-02f, 4.3013114e-02f,
+    1.0524825e+00f,  -1.4391161e-02f, 4.3528955e-04f,  1.5938376e+00f,
+    7.7961379e-01f,  -3.9500888e-02f, -2.7346954e-01f, 8.2697076e-01f,
+    -1.3334219e-02f, 4.3528955e-04f,  3.3854014e-01f,  1.3544029e+00f,
+    -1.0902530e-01f, -7.3772508e-01f, 4.0016377e-01f,  1.8909087e-02f,
+    4.3528955e-04f,  -1.7641886e+00f, 6.9318902e-01f,  -3.3644080e-02f,
+    -3.3604053e-01f, -1.1467367e+00f, 5.0702966e-03f,  4.3528955e-04f,
+    -5.9459485e-02f, -2.7143254e+00f, -6.4295657e-02f, 9.9523795e-01f,
+    1.4044885e-01f,  -8.9944728e-02f, 4.3528955e-04f,  -1.3121885e-01f,
+    -6.8054110e-02f, -8.2871497e-02f, 5.4027569e-01f,  -4.8616377e-01f,
+    -4.8952267e-01f, 4.3528955e-04f,  -2.1056252e+00f, 3.6807826e+00f,
+    4.9550813e-02f,  -8.5520977e-01f, -4.6826419e-01f, -2.2465989e-02f,
+    4.3528955e-04f,  1.3879967e-01f,  -4.0380722e-01f, 4.3947432e-02f,
+    7.0244670e-01f,  4.3364462e-01f,  -3.9753953e-01f, 4.3528955e-04f,
+    9.4499546e-01f,  1.1988112e-01f,  -3.6229710e-03f, 2.1144216e-01f,
+    7.8064919e-01f,  1.5716030e-01f,  4.3528955e-04f,  -9.9016178e-01f,
+    1.2585963e+00f,  1.3307227e-01f,  -9.3445593e-01f, -2.9257739e-01f,
+    5.0386125e-03f,  4.3528955e-04f,  -2.8244774e+00f, 3.0761113e+00f,
+    -1.0555249e-01f, -7.1019751e-01f, -6.2095588e-01f, 2.8437562e-02f,
+    4.3528955e-04f,  -6.4424741e-01f, -8.1264913e-01f, 2.4255415e-02f,
+    6.4037544e-01f,  -4.1565210e-01f, 6.0177236e-03f,  4.3528955e-04f,
+    -1.0265695e-01f, -3.8579804e-01f, -4.1423313e-02f, 8.5103071e-01f,
+    -7.1083266e-01f, -1.4424540e-01f, 4.3528955e-04f,  4.3182299e-01f,
+    7.1545839e-02f,  2.3786619e-02f,  2.0408225e-01f,  1.2518615e+00f,
+    4.7981966e-02f,  4.3528955e-04f,  1.0000545e-01f,  2.3483059e-01f,
+    9.5230013e-02f,  -3.2118905e-01f, 1.6068284e-01f,  -1.1516461e+00f,
+    4.3528955e-04f,  1.7350295e-01f,  1.0323133e+00f,  -1.5317515e-02f,
+    -9.3399709e-01f, 2.7316827e-03f,  -1.2255983e-01f, 4.3528955e-04f,
+    -1.8259174e-01f, 1.6869284e-01f,  7.2316505e-02f,  1.4797674e-01f,
+    -7.4447143e-01f, -1.2733582e-01f, 4.3528955e-04f,  6.2912571e-01f,
+    -4.1652191e-01f, 1.3232289e-01f,  8.6860955e-01f,  2.9575959e-01f,
+    1.4060289e-01f,  4.3528955e-04f,  -1.2275702e+00f, 1.8783921e+00f,
+    1.8988673e-01f,  -7.1296537e-01f, -9.7856484e-02f, -3.6823254e-02f,
+    4.3528955e-04f,  3.5731812e+00f,  8.5277569e-01f,  1.7320411e-01f,
+    -2.6022583e-01f, 9.9511296e-01f,  1.7672656e-02f,  4.3528955e-04f,
+    -3.2547247e-01f, 1.0493282e+00f,  -4.6118867e-02f, -8.8639891e-01f,
+    -3.5033399e-01f, -2.7874088e-01f, 4.3528955e-04f,  -2.1683335e+00f,
+    2.8940396e+00f,  -3.0216346e-02f, -7.1029037e-01f, -4.7064987e-01f,
+    -1.6873490e-02f, 4.3528955e-04f,  -3.3068368e+00f, -3.1251514e-01f,
+    -4.1395524e-03f, 5.4402400e-02f,  -9.8918092e-01f, 1.8423792e-02f,
+    4.3528955e-04f,  -1.1528666e+00f, 4.5874470e-01f,  -3.7055109e-02f,
+    -4.4845080e-01f, -9.2169225e-01f, -8.6142374e-03f, 4.3528955e-04f,
+    -1.1858754e+00f, -1.2992933e+00f, -9.3087547e-02f, 7.4892771e-01f,
+    -3.4115070e-01f, -6.4444065e-02f, 4.3528955e-04f,  3.6193785e-01f,
+    8.3436614e-01f,  -1.4228393e-01f, -9.1417694e-01f, -1.0367716e-01f,
+    5.6777382e-01f,  4.3528955e-04f,  1.1210346e+00f,  1.5218471e+00f,
+    9.1662899e-02f,  -4.3306598e-01f, 5.4189026e-01f,  -7.3980235e-02f,
+    4.3528955e-04f,  -1.9737762e-01f, -2.8221097e+00f, -1.9571712e-02f,
+    8.8556200e-01f,  -6.7572035e-02f, -9.2143659e-03f, 4.3528955e-04f,
+    9.1818577e-01f,  -2.3148041e+00f, -7.9780087e-02f, 4.7388119e-01f,
+    5.4029591e-02f,  1.3003300e-01f,  4.3528955e-04f,  2.5585835e+00f,
+    1.1267759e+00f,  5.7470653e-02f,  -4.0843529e-01f, 7.3637956e-01f,
+    -2.4560466e-04f, 4.3528955e-04f,  -1.2836168e+00f, -7.4546921e-01f,
+    -5.0261978e-02f, 4.5069140e-01f,  -6.2581319e-01f, -1.5148738e-01f,
+    4.3528955e-04f,  1.2226480e-01f,  -1.5138268e+00f, 1.0142729e-01f,
+    6.1069036e-01f,  4.2878330e-01f,  1.5189332e-01f,  4.3528955e-04f,
+    -9.0388876e-01f, -1.2489145e-01f, -1.2365433e-01f, -1.3448201e-01f,
+    -5.9487671e-01f, -1.4365520e-01f, 4.3528955e-04f,  7.3593616e-01f,
+    2.0408962e+00f,  8.3824441e-02f,  -6.5857732e-01f, 1.5184176e-01f,
+    1.0317023e-01f,  4.3528955e-04f,  -1.7122892e+00f, 3.8581634e+00f,
+    -7.3656075e-02f, -8.9505386e-01f, -3.3179438e-01f, 3.7388578e-02f,
+    4.3528955e-04f,  -5.3468537e-01f, -4.7434717e-02f, 6.7179985e-02f,
+    8.6435848e-01f,  -6.7851961e-01f, 1.4579338e-01f,  4.3528955e-04f,
+    -2.4165223e+00f, 3.7271965e-01f,  -7.6431237e-02f, -2.2839461e-01f,
+    -9.8714507e-01f, 1.0885678e-01f,  4.3528955e-04f,  -4.7036663e-02f,
+    -1.0399392e-01f, -1.3034745e-01f, 7.2965717e-01f,  -4.8684612e-01f,
+    -7.4093901e-03f, 4.3528955e-04f,  7.4288279e-01f,  1.4353273e+00f,
+    -1.9567568e-02f, -9.8934579e-01f, 4.7643331e-01f,  1.1580731e-01f,
+    4.3528955e-04f,  2.0246121e-01f,  1.4431593e+00f,  1.6159782e-01f,
+    -8.1355417e-01f, -1.3663541e-01f, -3.2037806e-02f, 4.3528955e-04f,
+    1.6350821e+00f,  -1.7458792e+00f, 2.3793463e-02f,  5.7912129e-01f,
+    5.6457114e-01f,  1.7141799e-02f,  4.3528955e-04f,  -2.0551649e-01f,
+    -1.3543899e-01f, -4.1872516e-02f, 4.0893802e-01f,  -8.0225229e-01f,
+    -2.4241829e-01f, 4.3528955e-04f,  2.3305878e-01f,  2.5113597e+00f,
+    2.1840546e-01f,  -5.9460878e-01f, 3.5240728e-01f,  1.3851382e-01f,
+    4.3528955e-04f,  2.6124325e+00f,  -3.8102064e+00f, -4.3306615e-02f,
+    6.9091278e-01f,  4.8474282e-01f,  1.4768303e-02f,  4.3528955e-04f,
+    -2.4161020e-01f, 1.3587803e-01f,  -6.9224834e-02f, -3.9775196e-01f,
+    -6.3200921e-01f, -7.9936790e-01f, 4.3528955e-04f,  -1.3482593e+00f,
+    -2.5195771e-01f, -9.9038035e-03f, -3.3324938e-02f, -9.3111509e-01f,
+    7.4540854e-02f,  4.3528955e-04f,  -1.1981162e+00f, -8.8335890e-01f,
+    6.8965092e-02f,  2.8144574e-01f,  -5.8030558e-01f, -1.1548749e-01f,
+    4.3528955e-04f,  2.9708712e+00f,  -1.1089207e-01f, -3.4816068e-02f,
+    -1.5190066e-01f, 9.4288164e-01f,  6.0724258e-02f,  4.3528955e-04f,
+    3.1330743e-01f,  9.9292338e-01f,  -2.2172625e-01f, -8.7515223e-01f,
+    5.4050171e-01f,  1.3345526e-01f,  4.3528955e-04f,  1.0850617e+00f,
+    5.4578710e-01f,  -1.4380048e-01f, -6.2867448e-02f, 8.4845167e-01f,
+    4.6961077e-02f,  4.3528955e-04f,  -3.0208912e-01f, 1.8179843e-01f,
+    -8.6565815e-02f, 1.0579349e-01f,  -1.0855350e+00f, -2.1380183e-01f,
+    4.3528955e-04f,  3.3557911e+00f,  1.7753253e+00f,  2.1769961e-03f,
+    -4.3604359e-01f, 8.5013366e-01f,  3.3371430e-02f,  4.3528955e-04f,
+    -1.2968292e+00f, 2.7070138e+00f,  -7.1533243e-03f, -7.1641332e-01f,
+    -5.1094538e-01f, -1.1688570e-02f, 4.3528955e-04f,  -1.9913765e+00f,
+    -1.7756146e+00f, -4.3387286e-02f, 6.8172240e-01f,  -8.1636375e-01f,
+    2.8521253e-02f,  4.3528955e-04f,  2.7705827e+00f,  3.0667574e+00f,
+    4.2296227e-02f,  -5.9592640e-01f, 5.5296630e-01f,  -2.9462561e-02f,
+    4.3528955e-04f,  -8.3098304e-01f, 6.5962231e-01f,  2.6122395e-02f,
+    -3.5789123e-01f, -2.4934024e-01f, -6.8857037e-02f, 4.3528955e-04f,
+    2.1062651e+00f,  1.7009193e+00f,  4.6212338e-03f,  -5.6595540e-01f,
+    8.0170381e-01f,  -8.7768763e-02f, 4.3528955e-04f,  8.6214018e-01f,
+    -2.1982454e-01f, 5.5245426e-02f,  2.7128986e-01f,  1.0102823e+00f,
+    6.2986396e-02f,  4.3528955e-04f,  -2.3220477e+00f, -1.9201686e+00f,
+    -6.8302671e-03f, 6.5915823e-01f,  -5.2721488e-01f, 7.4514419e-02f,
+    4.3528955e-04f,  2.7097025e+00f,  1.2808559e+00f,  -3.5829075e-02f,
+    -2.8512707e-01f, 8.6724371e-01f,  -1.0604612e-01f, 4.3528955e-04f,
+    1.6352291e+00f,  -7.1214700e-01f, 1.2250543e-01f,  -8.0792114e-02f,
+    4.9566245e-01f,  3.5645124e-02f,  4.3528955e-04f,  -7.5146157e-01f,
+    1.5912848e+00f,  1.0614011e-01f,  -8.1132913e-01f, -4.4495651e-01f,
+    -1.8113302e-01f, 4.3528955e-04f,  1.4523309e+00f,  6.7063606e-01f,
+    -1.6688326e-01f, 1.6911168e-02f,  1.1126206e+00f,  -1.2194833e-01f,
+    4.3528955e-04f,  -8.4702277e-01f, 4.1258387e-02f,  2.3520105e-01f,
+    -3.8654116e-01f, -5.1819432e-01f, 7.8933001e-02f,  4.3528955e-04f,
+    -1.1487185e+00f, -9.9123007e-01f, -8.2986981e-02f, 2.7650914e-01f,
+    -5.3549790e-01f, 6.7036390e-02f,  4.3528955e-04f,  -1.2094220e-01f,
+    2.1623321e-02f,  7.2681710e-02f,  4.9753383e-01f,  -8.5398209e-01f,
+    -1.2832917e-01f, 4.3528955e-04f,  1.7979431e+00f,  -1.6102600e+00f,
+    3.2386094e-02f,  6.0534787e-01f,  7.4632061e-01f,  -8.5255355e-02f,
+    4.3528955e-04f,  -2.7590358e-01f, 1.4006134e+00f,  6.6706948e-02f,
+    -8.2671946e-01f, 1.4065933e-01f,  -3.2705441e-02f, 4.3528955e-04f,
+    1.0134294e+00f,  2.6530507e+00f,  -1.0000309e-01f, -8.9642572e-01f,
+    2.5590906e-01f,  -1.4502455e-01f, 4.3528955e-04f,  1.2263640e-01f,
+    -1.2401736e+00f, 4.4685442e-02f,  1.0572802e+00f,  9.7505040e-02f,
+    -1.1213637e-01f, 4.3528955e-04f,  -2.9113993e-01f, 2.4090378e+00f,
+    -5.9561726e-02f, -8.8974959e-01f, -1.9136673e-01f, 1.6485028e-02f,
+    4.3528955e-04f,  1.2612617e+00f,  -3.3669984e-01f, -4.0124498e-02f,
+    8.5429823e-01f,  7.3775476e-01f,  -1.6983813e-01f, 4.3528955e-04f,
+    5.8132738e-01f,  -6.1585069e-01f, -3.2657955e-02f, 7.6578617e-01f,
+    2.5307181e-01f,  2.4746701e-02f,  4.3528955e-04f,  -2.3786433e+00f,
+    4.7847595e+00f,  -6.9858521e-02f, -8.0182946e-01f, -3.5937512e-01f,
+    4.5570474e-02f,  4.3528955e-04f,  2.1276598e+00f,  -2.2034548e-02f,
+    -3.3164397e-02f, -8.3605975e-02f, 1.0985366e+00f,  5.3330835e-02f,
+    4.3528955e-04f,  -9.8296821e-01f, 9.2811710e-01f,  6.8162978e-02f,
+    -1.0059860e+00f, -1.5224475e-01f, -1.4412822e-01f, 4.3528955e-04f,
+    2.0265555e+00f,  -3.7009642e+00f, 4.2261393e-03f,  7.8852266e-01f,
+    4.2059430e-01f,  -2.6934424e-02f, 4.3528955e-04f,  1.0188012e-01f,
+    3.1628230e+00f,  -1.0311620e-02f, -9.7405827e-01f, -1.7689633e-01f,
+    -3.6586020e-02f, 4.3528955e-04f,  2.5105762e-01f,  -1.4537195e+00f,
+    -6.7538922e-03f, 6.4909959e-01f,  1.8300374e-01f,  1.5452889e-01f,
+    4.3528955e-04f,  -3.5887149e-01f, 1.0217121e+00f,  5.5621106e-02f,
+    -4.6745801e-01f, -3.5040429e-01f, 1.4017221e-01f,  4.3528955e-04f,
+    -3.6363474e-01f, -2.0791252e+00f, 9.9280544e-02f,  7.4064577e-01f,
+    2.4910280e-02f,  -1.3761082e-02f, 4.3528955e-04f,  2.5299704e+00f,
+    2.6565437e+00f,  -1.5974584e-01f, -7.8995067e-01f, 5.5792981e-01f,
+    1.6029423e-02f,  4.3528955e-04f,  8.5832125e-01f,  8.6110926e-01f,
+    1.5052030e-02f,  -1.0571755e-01f, 9.5851374e-01f,  -5.5006362e-02f,
+    4.3528955e-04f,  -3.6132884e-01f, -5.6717098e-01f, 1.2858142e-01f,
+    4.4388393e-01f,  -6.4576554e-01f, -7.0728026e-02f, 4.3528955e-04f,
+    -5.2491522e-01f, 1.4241612e+00f,  8.6118802e-02f,  -8.0211616e-01f,
+    -2.0621885e-01f, 4.6976794e-02f,  4.3528955e-04f,  7.4335837e-01f,
+    4.5022494e-01f,  2.1805096e-02f,  -2.8159657e-01f, 6.9618279e-01f,
+    1.1087923e-01f,  4.3528955e-04f,  2.4685440e+00f,  -1.7992185e+00f,
+    -2.4382826e-02f, 3.3877319e-01f,  7.1341413e-01f,  1.3980274e-01f,
+    4.3528955e-04f,  -5.6947696e-01f, -1.3093477e-01f, 3.4981940e-02f,
+    -3.9349020e-01f, -1.0065408e+00f, 1.3161841e-01f,  4.3528955e-04f,
+    3.0076389e+00f,  -3.0053742e+00f, -1.2630166e-01f, 5.9211147e-01f,
+    5.5681252e-01f,  5.0325658e-02f,  4.3528955e-04f,  2.4450483e+00f,
+    -8.3323008e-01f, -6.1835062e-02f, 3.9228153e-01f,  6.7553335e-01f,
+    4.6432964e-03f,  4.3528955e-04f,  -7.2692263e-01f, 3.2394440e+00f,
+    2.0450163e-01f,  -8.2043678e-01f, -3.3575037e-01f, 1.3271794e-01f,
+    4.3528955e-04f,  -4.7058865e-02f, 5.2744985e-01f,  3.0579763e-02f,
+    -1.3292233e+00f, 4.1714913e-01f,  2.4538927e-01f,  4.3528955e-04f,
+    -3.3970461e+00f, -2.2253754e+00f, -4.7939584e-02f, 4.3698314e-01f,
+    -7.8352094e-01f, 7.6068230e-02f,  4.3528955e-04f,  -4.0937471e-01f,
+    8.5695320e-01f,  -5.2578688e-02f, -1.0477607e+00f, -2.6653007e-01f,
+    1.5041941e-01f,  4.3528955e-04f,  4.2821819e-01f,  9.2341995e-01f,
+    -3.1434563e-01f, -2.8239945e-01f, 1.1230114e+00f,  1.4065085e-03f,
+    4.3528955e-04f,  -3.8736677e-01f, -2.9319978e-01f, -1.2894061e-01f,
+    1.1640970e+00f,  -5.0897682e-01f, -2.5595438e-03f, 4.3528955e-04f,
+    -1.8897545e+00f, -1.4387591e+00f, 1.6922385e-01f,  4.4390589e-01f,
+    -6.3282561e-01f, 1.7320186e-02f,  4.3528955e-04f,  -4.1135919e-01f,
+    -3.1203837e+00f, -9.8678328e-02f, 9.4173104e-01f,  -1.1044490e-01f,
+    -4.9056496e-02f, 4.3528955e-04f,  7.9128230e-01f,  3.0273194e+00f,
+    1.4116533e-02f,  -9.3604863e-01f, 2.5930220e-01f,  6.6329516e-02f,
+    4.3528955e-04f,  -8.1456822e-01f, -2.1186852e+00f, 2.3557574e-02f,
+    7.6779854e-01f,  -5.8944011e-01f, 3.7813656e-02f,  4.3528955e-04f,
+    -3.9661205e-01f, 1.2244097e+00f,  -6.1554950e-02f, -6.5904826e-01f,
+    -5.0002450e-01f, 2.0916667e-02f,  4.3528955e-04f,  1.1140013e+00f,
+    -5.7227570e-01f, -1.1597091e-02f, 7.5421071e-01f,  4.2004368e-01f,
+    -2.6281213e-03f, 4.3528955e-04f,  -1.6199192e+00f, -5.9800673e-01f,
+    -5.4581806e-02f, 4.4851816e-01f,  -9.0041524e-01f, 8.5989453e-02f,
+    4.3528955e-04f,  3.7264368e-01f,  6.6021419e-01f,  -6.7245439e-02f,
+    -1.1887774e+00f, -1.0028941e-01f, -3.6440849e-01f, 4.3528955e-04f,
+    5.6499505e-01f,  2.2261598e+00f,  1.1118982e-01f,  -6.5138388e-01f,
+    2.8424475e-01f,  -1.3678367e-01f, 4.3528955e-04f,  1.5373086e+00f,
+    -8.1240553e-01f, 9.2809029e-02f,  3.9106521e-01f,  8.1601411e-01f,
+    2.3013812e-01f,  4.3528955e-04f,  -4.9126324e-01f, -4.3590438e-01f,
+    1.1421021e-02f,  2.2640009e-01f,  -9.1928256e-01f, 2.0942467e-01f,
+    4.3528955e-04f,  -6.8653744e-01f, 2.2561247e+00f,  8.5459329e-02f,
+    -1.0358773e+00f, -2.9513091e-01f, 1.7248828e-02f,  4.3528955e-04f,
+    1.8069242e+00f,  -1.2037444e+00f, 4.5799825e-02f,  3.5944691e-01f,
+    9.1103619e-01f,  -7.9826497e-02f, 4.3528955e-04f,  2.0575259e+00f,
+    -3.1763389e+00f, -1.8279422e-02f, 7.8307521e-01f,  4.7109488e-01f,
+    -8.4028229e-02f, 4.3528955e-04f,  -8.7674581e-02f, -5.4540098e-02f,
+    1.5677622e-02f,  7.6661813e-01f,  3.3778343e-01f,  -4.3066570e-01f,
+    4.3528955e-04f,  9.5024467e-02f,  1.0252072e+00f,  2.1677898e-02f,
+    -7.9040045e-01f, -2.5232789e-01f, 4.1211635e-02f,  4.3528955e-04f,
+    5.4908508e-01f,  -1.3499315e+00f, -3.3463866e-02f, 8.7109840e-01f,
+    2.7386010e-01f,  5.1668398e-02f,  4.3528955e-04f,  1.5357281e+00f,
+    2.8483450e+00f,  -4.2783320e-02f, -9.3107170e-01f, 2.6026526e-01f,
+    5.4807654e-03f,  4.3528955e-04f,  1.9799074e+00f,  -8.8433012e-02f,
+    -1.4484942e-02f, -1.9528493e-01f, 7.2130388e-01f,  -2.0275770e-01f,
+    4.3528955e-04f,  -4.7000352e-01f, -1.2445089e+00f, 9.7627677e-03f,
+    6.3890266e-01f,  -2.7233315e-01f, 1.4536087e-01f,  4.3528955e-04f,
+    6.5441293e-01f,  -1.1488899e+00f, -4.8015434e-02f, 1.1887335e+00f,
+    2.7288523e-01f,  -1.9322780e-01f, 4.3528955e-04f,  1.2705033e+00f,
+    6.1883949e-02f,  2.1166829e-03f,  1.0357748e-01f,  8.9628267e-01f,
+    -1.2037895e-01f, 4.3528955e-04f,  -5.6938869e-01f, 6.6062771e-02f,
+    -1.8949907e-01f, -2.9908726e-01f, -7.2934484e-01f, 2.1711026e-01f,
+    4.3528955e-04f,  2.2395673e+00f,  -1.3461827e+00f, 1.9536251e-02f,
+    4.5044413e-01f,  5.6432700e-01f,  2.3857189e-02f,  4.3528955e-04f,
+    8.7322974e-01f,  1.5577562e+00f,  1.1960505e-01f,  -9.3819404e-01f,
+    4.6257854e-01f,  -1.4560352e-01f, 4.3528955e-04f,  9.0846598e-02f,
+    -5.4425433e-02f, -3.0641647e-02f, 4.8880920e-01f,  3.3609447e-01f,
+    -6.3160634e-01f, 4.3528955e-04f,  -2.3527200e+00f, -1.1870589e+00f,
+    1.0995490e-02f,  4.0187258e-01f,  -7.9024297e-01f, -5.7241295e-02f,
+    4.3528955e-04f,  2.4190569e+00f,  8.5987353e-01f,  1.9392224e-03f,
+    -6.4576805e-01f, 8.9911377e-01f,  -1.0872603e-02f, 4.3528955e-04f,
+    1.0541587e-01f,  5.4475451e-01f,  9.7522043e-02f,  -9.8095751e-01f,
+    9.9578626e-02f,  -3.8274810e-02f, 4.3528955e-04f,  -3.6179907e+00f,
+    -9.8762876e-01f, 6.7393772e-02f,  2.3076908e-01f,  -8.0047822e-01f,
+    -9.5403321e-02f, 4.3528955e-04f,  -5.7545960e-01f, -3.6404073e-01f,
+    -1.6558149e-01f, 7.6639628e-01f,  -2.5322661e-01f, -1.8760782e-01f,
+    4.3528955e-04f,  1.4494503e+00f,  1.3635819e-01f,  4.8340175e-02f,
+    -2.3426367e-02f, 8.0758417e-01f,  -2.9483119e-03f, 4.3528955e-04f,
+    1.0875323e+00f,  1.3451964e-01f,  -8.7131791e-02f, -2.1103024e-01f,
+    9.2205608e-01f,  2.8308816e-02f,  4.3528955e-04f,  -1.4242743e+00f,
+    2.7765086e+00f,  -1.2147181e-01f, -7.6130933e-01f, -2.9025900e-01f,
+    1.0861298e-01f,  4.3528955e-04f,  2.0784769e+00f,  -1.2349559e+00f,
+    1.0810343e-01f,  3.5329786e-01f,  4.6846032e-01f,  -1.6740002e-01f,
+    4.3528955e-04f,  1.4749795e-01f,  7.9844761e-01f,  -4.3843905e-03f,
+    -4.7300124e-01f, 8.7693036e-01f,  6.8800561e-02f,  4.3528955e-04f,
+    4.0119499e-01f,  -1.7291172e-01f, -1.2399731e-01f, 1.5388921e+00f,
+    7.7274776e-01f,  -2.3911048e-01f, 4.3528955e-04f,  7.3464863e-02f,
+    7.9866445e-01f,  6.2581743e-03f,  -8.5985190e-01f, 5.4649860e-01f,
+    -2.5982010e-01f, 4.3528955e-04f,  7.1442699e-01f,  -2.4070177e+00f,
+    8.9704074e-02f,  8.3865607e-01f,  2.1499628e-01f,  -1.5801724e-02f,
+    4.3528955e-04f,  8.3317614e-01f,  4.8940234e+00f,  -5.3537861e-02f,
+    -8.8109714e-01f, 2.1456513e-01f,  8.3016999e-02f,  4.3528955e-04f,
+    -1.7785053e+00f, 3.2734346e-01f,  6.1488722e-02f,  -7.6552361e-02f,
+    -9.5409876e-01f, 6.5554485e-02f,  4.3528955e-04f,  1.3497580e+00f,
+    -1.1932336e+00f, -3.3121523e-02f, 6.5040576e-01f,  8.5196728e-01f,
+    1.4664665e-01f,  4.3528955e-04f,  2.2499648e-01f,  -6.7828220e-01f,
+    -3.2244403e-02f, 1.2074751e+00f,  -3.3725122e-01f, -7.4476950e-02f,
+    4.3528955e-04f,  2.6168017e+00f,  -1.6076787e+00f, 1.9562436e-02f,
+    4.6444046e-01f,  8.2248992e-01f,  -4.8805386e-02f, 4.3528955e-04f,
+    -5.9902161e-01f, 2.4308178e+00f,  6.4808153e-02f,  -9.8294455e-01f,
+    -3.4821844e-01f, -1.7830840e-01f, 4.3528955e-04f,  1.1604474e+00f,
+    -1.6884667e+00f, 3.0157642e-02f,  8.8682789e-01f,  4.4615921e-01f,
+    3.4490395e-02f,  4.3528955e-04f,  -6.9408745e-01f, -5.1984382e-01f,
+    -7.2689377e-02f, 3.8508376e-01f,  -7.8935212e-01f, -1.7347808e-01f,
+    4.3528955e-04f,  -7.1409100e-01f, -1.4477054e+00f, 4.2847276e-02f,
+    8.6936325e-01f,  -5.7924348e-01f, 1.8125609e-01f,  4.3528955e-04f,
+    -4.6812585e-01f, 3.2654230e-02f,  -7.3437296e-02f, -7.3721573e-02f,
+    -9.5559794e-01f, 6.6486284e-02f,  4.3528955e-04f,  -1.1950930e+00f,
+    1.1448176e+00f,  4.5032661e-02f,  -5.8202130e-01f, -5.1685882e-01f,
+    -1.6979301e-01f, 4.3528955e-04f,  -3.5134771e-01f, 3.7821102e-01f,
+    4.0321019e-02f,  -4.7109327e-01f, -7.0669609e-01f, -2.8876856e-01f,
+    4.3528955e-04f,  -2.5681963e+00f, -1.6003565e+00f, -7.2119567e-03f,
+    5.2001029e-01f,  -7.5785911e-01f, -6.2797545e-03f, 4.3528955e-04f,
+    -8.8664222e-01f, -8.1197131e-01f, -5.3504933e-02f, 3.3268660e-01f,
+    -5.3778893e-01f, -7.9499856e-02f, 4.3528955e-04f,  -2.7094047e+00f,
+    2.9598814e-01f,  -7.1768537e-02f, -1.6321209e-01f, -1.1034260e+00f,
+    -3.7640940e-02f, 4.3528955e-04f,  -1.9633139e+00f, -1.6689534e+00f,
+    -3.2633558e-02f, 5.9074330e-01f,  -7.9040700e-01f, -2.1121839e-02f,
+    4.3528955e-04f,  -5.4326040e-01f, -1.9437907e+00f, 9.7472832e-02f,
+    8.7752557e-01f,  -4.8503622e-01f, 1.2190759e-01f,  4.3528955e-04f,
+    -3.4569380e+00f, -1.0447805e+00f, -9.9200681e-03f, 2.5297007e-01f,
+    -9.3736821e-01f, -4.2041242e-02f, 4.3528955e-04f,  -7.9708016e-01f,
+    -1.9970255e-01f, -4.3558534e-02f, 6.7883605e-01f,  -5.2064997e-01f,
+    -1.6564825e-01f, 4.3528955e-04f,  -2.9726634e+00f, -1.7741922e+00f,
+    -6.3677475e-02f, 4.7023273e-01f,  -7.7728236e-01f, -5.3127848e-02f,
+    4.3528955e-04f,  5.1731479e-01f,  -1.4780343e-01f, 1.2331359e-02f,
+    1.1335959e-01f,  9.6430969e-01f,  5.2361697e-01f,  4.3528955e-04f,
+    6.2453508e-01f,  9.0577215e-01f,  9.1513470e-03f,  -9.9412370e-01f,
+    2.6023936e-01f,  -9.7256288e-02f, 4.3528955e-04f,  -2.0287299e+00f,
+    -1.0946856e+00f, 1.1962408e-02f,  6.5835631e-01f,  -6.1281985e-01f,
+    1.2128092e-01f,  4.3528955e-04f,  2.6431584e-01f,  1.3354558e-01f,
+    9.8433338e-02f,  1.4912300e-01f,  1.1693451e+00f,  6.3731897e-01f,
+    4.3528955e-04f,  -1.7521005e+00f, -8.8002577e-02f, 1.5880217e-01f,
+    -3.3194533e-01f, -8.0388534e-01f, 2.0541638e-02f,  4.3528955e-04f,
+    -1.4229740e+00f, -2.1968081e+00f, 4.1129375e-03f,  7.6746833e-01f,
+    -5.2362108e-01f, -9.5837966e-02f, 4.3528955e-04f,  1.0743963e+00f,
+    4.6837765e-01f,  6.4699970e-02f,  -5.5894613e-01f, 9.0261793e-01f,
+    9.4317570e-02f,  4.3528955e-04f,  -8.5575664e-01f, -7.0606029e-01f,
+    8.9422494e-02f,  6.2036633e-01f,  -4.2148536e-01f, 1.8065149e-01f,
+    4.3528955e-04f,  2.3299632e+00f,  1.4127278e+00f,  6.6580819e-03f,
+    -5.3752929e-01f, 8.3643514e-01f,  -1.5355662e-01f, 4.3528955e-04f,
+    9.3130213e-01f,  2.8616208e-01f,  8.5462220e-02f,  -5.1858466e-02f,
+    1.0053108e+00f,  2.4221528e-01f,  4.3528955e-04f,  4.2765731e-01f,
+    9.0449750e-01f,  -1.6891049e-01f, -7.9796612e-01f, -3.1156367e-01f,
+    5.3547237e-02f,  4.3528955e-04f,  1.9845707e+00f,  3.4831560e+00f,
+    -4.7044829e-02f, -8.2068503e-01f, 4.0651965e-01f,  -1.3465271e-02f,
+    4.3528955e-04f,  -4.2305651e-01f, 6.0528225e-01f,  -2.3967813e-01f,
+    -3.0473635e-01f, -4.6031299e-01f, 3.9196101e-01f,  4.3528955e-04f,
+    8.5102820e-01f,  1.8474413e+00f,  -7.7416305e-04f, -7.4688625e-01f,
+    6.0994893e-01f,  3.1251919e-02f,  4.3528955e-04f,  5.4253709e-01f,
+    3.0557680e-01f,  -4.2302590e-02f, -6.0393506e-01f, 8.8126141e-01f,
+    -1.0627985e-01f, 4.3528955e-04f,  1.2939869e+00f,  -3.3022356e-01f,
+    -5.8827806e-02f, 6.7232513e-01f,  8.3248162e-01f,  -1.5342577e-01f,
+    4.3528955e-04f,  -2.4763982e+00f, -5.5538550e-02f, -2.7557008e-02f,
+    -6.7884222e-02f, -1.1428419e+00f, -4.6435285e-02f, 4.3528955e-04f,
+    -1.8661380e-01f, -2.0990010e-01f, -3.0606449e-01f, 7.7871537e-01f,
+    -4.4663510e-01f, 3.0201361e-01f,  4.3528955e-04f,  4.8322433e-01f,
+    -2.9237643e-02f, 5.7876904e-02f,  -3.8807693e-01f, 1.1019963e+00f,
+    -1.3166371e-01f, 4.3528955e-04f,  -8.4067845e-01f, 2.6345208e-01f,
+    -5.0317522e-02f, -4.0172011e-01f, -5.9563518e-01f, 8.2385927e-02f,
+    4.3528955e-04f,  2.3207787e-01f,  1.8103322e-01f,  -3.9755636e-01f,
+    9.7397976e-03f,  2.5413173e-01f,  -2.1863239e-01f, 4.3528955e-04f,
+    -6.5926468e-01f, -1.4410347e+00f, -7.4673556e-02f, 8.0999804e-01f,
+    -3.0382311e-02f, -2.3229431e-02f, 4.3528955e-04f,  -3.2831180e+00f,
+    -1.7271242e+00f, -4.1410003e-02f, 4.5661017e-01f,  -7.6089084e-01f,
+    7.8279510e-02f,  4.3528955e-04f,  1.6963539e+00f,  3.8021936e+00f,
+    -9.9510681e-03f, -8.1427753e-01f, 4.4077647e-01f,  1.5613039e-02f,
+    4.3528955e-04f,  1.3873883e-01f,  -1.8982550e+00f, 6.1575405e-02f,
+    4.5881829e-01f,  5.2736378e-01f,  1.3334970e-01f,  4.3528955e-04f,
+    8.6772814e-04f,  1.1601824e-01f,  -3.3122517e-02f, -5.6568939e-02f,
+    -1.5768901e-01f, -1.1994604e+00f, 4.3528955e-04f,  3.6489058e-01f,
+    2.2780013e+00f,  1.3434218e-01f,  -8.4435463e-01f, 3.9021924e-02f,
+    -1.3476358e-01f, 4.3528955e-04f,  4.3782651e-02f,  8.3711252e-02f,
+    -6.8130195e-02f, 2.5425407e-01f,  -8.3281243e-01f, -2.0019041e-01f,
+    4.3528955e-04f,  5.7107091e-01f,  1.5243270e+00f,  -1.3825943e-01f,
+    -5.2632976e-01f, -6.1366729e-02f, 5.5990737e-02f,  4.3528955e-04f,
+    3.3662832e-01f,  -6.8193883e-01f, 7.2840653e-02f,  1.0177697e+00f,
+    5.4933047e-01f,  6.9054075e-02f,  4.3528955e-04f,  -6.6073990e-01f,
+    -3.7196856e+00f, -5.0830446e-02f, 8.9156741e-01f,  -1.7090544e-01f,
+    -6.4102180e-02f, 4.3528955e-04f,  -5.0844455e-01f, -6.8513364e-01f,
+    -3.5965420e-02f, 5.9760863e-01f,  -4.7735396e-01f, -1.8299666e-01f,
+    4.3528955e-04f,  -6.8350154e-01f, 1.2145416e+00f,  1.6988605e-02f,
+    -9.6489954e-01f, -4.0220964e-01f, -5.7150863e-02f, 4.3528955e-04f,
+    2.6657023e-03f,  2.8361964e+00f,  1.3727842e-01f,  -9.2848885e-01f,
+    -2.3802651e-02f, -2.9893067e-02f, 4.3528955e-04f,  7.1484679e-01f,
+    -1.7558552e-02f, 6.5233268e-02f,  2.3428868e-01f,  1.2097244e+00f,
+    1.8551530e-01f,  4.3528955e-04f,  2.4974546e+00f,  -2.8424222e+00f,
+    -6.0842179e-02f, 7.2119719e-01f,  6.1807090e-01f,  4.4848886e-03f,
+    4.3528955e-04f,  -7.2637606e-01f, 2.0696627e-01f,  4.9142040e-02f,
+    -5.8697104e-01f, -1.1860815e+00f, -2.2350742e-02f, 4.3528955e-04f,
+    2.3579032e+00f,  -9.2522246e-01f, 4.0857952e-02f,  4.1979638e-01f,
+    1.0660518e+00f,  -6.8881184e-02f, 4.3528955e-04f,  5.6819302e-01f,
+    -6.5006769e-01f, -1.9551549e-02f, 6.0341620e-01f,  3.2316363e-01f,
+    -1.4131443e-01f, 4.3528955e-04f,  2.4865353e+00f,  1.8973608e+00f,
+    -1.7097190e-01f, -5.5020934e-01f, 5.8800060e-01f,  2.5497884e-02f,
+    4.3528955e-04f,  6.1875159e-01f,  -1.0255457e+00f, -1.9710729e-02f,
+    1.2166758e+00f,  -1.1979587e-01f, 1.1895105e-01f,  4.3528955e-04f,
+    1.8889960e+00f,  4.4113177e-01f,  3.5475913e-02f,  -1.4306320e-01f,
+    7.6067019e-01f,  -6.8022832e-02f, 4.3528955e-04f,  -1.0049478e+00f,
+    2.0558472e+00f,  -7.3774904e-02f, -7.4023187e-01f, -5.5185401e-01f,
+    3.7878823e-02f,  4.3528955e-04f,  5.7862115e-01f,  9.9097723e-01f,
+    1.6117774e-01f,  -7.5559306e-01f, 2.3866206e-01f,  -6.8879575e-02f,
+    4.3528955e-04f,  6.7603087e-01f,  1.2947229e+00f,  1.7446222e-02f,
+    -7.8521651e-01f, 2.9222745e-01f,  1.8735348e-01f,  4.3528955e-04f,
+    8.9647853e-01f,  -5.1956713e-01f, 2.4297573e-02f,  5.7326376e-01f,
+    5.8633041e-01f,  8.8684745e-02f,  4.3528955e-04f,  -2.6681957e+00f,
+    -3.6744459e+00f, -7.8220870e-03f, 7.3944151e-01f,  -5.1488256e-01f,
+    -1.4767495e-02f, 4.3528955e-04f,  -1.5683670e+00f, -3.2788195e-02f,
+    -7.6718442e-02f, 9.9740848e-02f,  -1.0113243e+00f, 3.3560790e-02f,
+    4.3528955e-04f,  1.5289804e+00f,  -1.9233367e+00f, -1.3894814e-01f,
+    6.0772854e-01f,  6.2203312e-01f,  9.6978344e-02f,  4.3528955e-04f,
+    2.4105768e+00f,  2.0855658e+00f,  5.3614336e-03f,  -6.1464190e-01f,
+    8.3017898e-01f,  -8.3853111e-02f, 4.3528955e-04f,  3.0580890e-01f,
+    -1.7872522e+00f, 5.1492233e-02f,  1.0887216e+00f,  3.4208119e-01f,
+    -3.9914541e-02f, 4.3528955e-04f,  8.2199591e-01f,  -8.4657177e-02f,
+    5.1774617e-02f,  4.9161799e-03f,  9.3774903e-01f,  1.5778178e-01f,
+    4.3528955e-04f,  3.4976749e+00f,  8.5384987e-02f,  1.0628924e-01f,
+    1.3552208e-01f,  9.4745260e-01f,  -1.7629931e-02f, 4.3528955e-04f,
+    -2.4719608e+00f, -1.2636092e+00f, -3.4360029e-02f, 3.0628666e-01f,
+    -7.9305702e-01f, 3.0154097e-03f,  4.3528955e-04f,  5.4926354e-02f,
+    5.2475423e-01f,  3.9143164e-02f,  -1.5864406e+00f, -1.5850060e-01f,
+    1.0531772e-01f,  4.3528955e-04f,  7.4198604e-01f,  9.2351431e-01f,
+    -3.7047196e-02f, -5.0775450e-01f, 4.2936420e-01f,  -1.1653668e-01f,
+    4.3528955e-04f,  1.1112170e+00f,  -2.7738097e+00f, -1.7497780e-02f,
+    5.5628884e-01f,  3.2689962e-01f,  -3.7064776e-04f, 4.3528955e-04f,
+    -1.0530510e+00f, -6.0071993e-01f, 1.2673734e-01f,  5.0024051e-02f,
+    -8.2949370e-01f, -2.9796121e-01f, 4.3528955e-04f,  -1.6241739e+00f,
+    1.3345010e+00f,  -1.1588360e-01f, -2.6951846e-01f, -8.2361335e-01f,
+    -5.0801218e-02f, 4.3528955e-04f,  -1.7419720e-01f, 5.2164137e-01f,
+    9.8528922e-02f,  -1.0291586e+00f, 3.3354655e-01f,  -1.5960336e-01f,
+    4.3528955e-04f,  -6.0565019e-01f, -5.5609035e-01f, 3.1082552e-02f,
+    7.5958008e-01f,  -1.9538224e-01f, -1.4633027e-01f, 4.3528955e-04f,
+    -4.9053571e-01f, 2.6430783e+00f,  -3.5154559e-02f, -8.0469090e-01f,
+    -9.4265632e-02f, -9.3485467e-02f, 4.3528955e-04f,  -7.0439494e-01f,
+    -2.0787339e+00f, -2.0756021e-01f, 8.3007181e-01f,  -1.6426764e-01f,
+    -7.2128408e-02f, 4.3528955e-04f,  -4.4035116e-01f, -3.3813620e-01f,
+    2.4307882e-02f,  9.1928631e-01f,  -6.0499167e-01f, 4.5926848e-01f,
+    4.3528955e-04f,  1.8527824e-01f,  3.8168532e-01f,  2.0983349e-01f,
+    -1.2506202e+00f, 2.3404452e-01f,  3.7371102e-01f,  4.3528955e-04f,
+    -1.2636013e+00f, -5.9784985e-01f, -4.7899146e-02f, 2.6908675e-01f,
+    -8.4778076e-01f, 2.2155586e-01f,  4.3528955e-04f,  7.3441261e-01f,
+    3.3533065e+00f,  2.3495506e-02f,  -9.7689992e-01f, 2.2297400e-01f,
+    5.0885610e-02f,  4.3528955e-04f,  -4.3284786e-01f, 1.5768865e+00f,
+    -1.3119726e-01f, -3.9913717e-01f, 6.4090211e-03f,  1.5286538e-01f,
+    4.3528955e-04f,  -1.6225419e+00f, 3.1184757e-01f,  -1.5585758e-01f,
+    -3.4648874e-01f, -8.7082028e-01f, -1.3506371e-01f, 4.3528955e-04f,
+    2.2161245e+00f,  4.6904075e-01f,  -5.6632236e-02f, -5.0753099e-01f,
+    9.4770229e-01f,  5.4372478e-02f,  4.3528955e-04f,  -2.5575384e-01f,
+    3.5101867e-01f,  4.0780365e-02f,  -8.7618387e-01f, -2.8381410e-01f,
+    7.8601778e-01f,  4.3528955e-04f,  -5.2588731e-01f, -4.5831239e-01f,
+    -4.0714860e-02f, 6.1667013e-01f,  -7.3502094e-01f, -1.4056404e-01f,
+    4.3528955e-04f,  1.8513770e+00f,  -7.0006624e-03f, -7.0344448e-02f,
+    4.5605299e-01f,  9.5424765e-01f,  -2.1301979e-02f, 4.3528955e-04f,
+    -1.6321905e+00f, 3.3895607e+00f,  5.7503361e-02f,  -8.6464560e-01f,
+    -3.8077244e-01f, -2.0179151e-02f, 4.3528955e-04f,  -1.0064033e+00f,
+    -2.5638180e+00f, 1.7124342e-02f,  8.9349258e-01f,  -5.7391059e-01f,
+    1.0868723e-02f,  4.3528955e-04f,  1.6346438e+00f,  8.3005965e-01f,
+    -3.2662919e-01f, -2.2681291e-01f, 2.7908221e-01f,  -5.9719056e-02f,
+    4.3528955e-04f,  2.2292199e+00f,  -1.1050543e+00f, 1.0730445e-02f,
+    2.6269138e-01f,  7.1185613e-01f,  -3.6181048e-02f, 4.3528955e-04f,
+    1.4036174e+00f,  1.1911034e-01f,  -7.1851350e-02f, 3.8490844e-01f,
+    7.7112746e-01f,  2.0386507e-01f,  4.3528955e-04f,  1.5732681e+00f,
+    1.9649107e+00f,  -5.1828143e-03f, -6.3068891e-01f, 7.0427275e-01f,
+    7.4060582e-02f,  4.3528955e-04f,  -9.4116902e-01f, 5.2349406e-01f,
+    4.6097331e-02f,  -3.3958930e-01f, -1.1173369e+00f, 5.0133470e-02f,
+    4.3528955e-04f,  3.6216076e-02f,  -6.6199940e-01f, 8.9318037e-02f,
+    6.6798460e-01f,  3.1147206e-01f,  2.9319344e-02f,  4.3528955e-04f,
+    -1.9645029e-01f, -1.0114925e-01f, 1.2631127e-01f,  2.5635052e-01f,
+    -1.0783873e+00f, 6.8749827e-01f,  4.3528955e-04f,  5.2444690e-01f,
+    2.3602283e+00f,  -8.3572835e-02f, -6.4519852e-01f, 8.0025628e-02f,
+    -1.3552377e-01f, 4.3528955e-04f,  -1.6568463e+00f, 4.4634086e-01f,
+    9.2762329e-02f,  -1.4402235e-01f, -8.4352988e-01f, -7.2363071e-02f,
+    4.3528955e-04f,  1.9485572e-01f,  -1.0336198e-01f, -5.1944387e-01f,
+    1.0494876e+00f,  3.9715716e-01f,  -2.1683177e-01f, 4.3528955e-04f,
+    -2.5671093e+00f, 1.0086215e+00f,  1.9796669e-02f,  -3.8691205e-01f,
+    -8.5182667e-01f, -5.2516472e-02f, 4.3528955e-04f,  -6.8475443e-01f,
+    8.0488014e-01f,  -5.3428616e-02f, -6.0934180e-01f, -5.5340040e-01f,
+    1.0262435e-01f,  4.3528955e-04f,  -2.7989755e+00f, 1.6411934e+00f,
+    1.1240622e-02f,  -3.2449642e-01f, -7.7580637e-01f, 7.4721649e-02f,
+    4.3528955e-04f,  -1.6455792e+00f, -3.8826019e-01f, 2.6373168e-02f,
+    3.1206760e-01f,  -8.5127658e-01f, 1.4375688e-01f,  4.3528955e-04f,
+    1.6801897e-01f,  1.2080152e-01f,  3.2445569e-02f,  -4.5004186e-01f,
+    5.0862789e-01f,  -3.7546745e-01f, 4.3528955e-04f,  -8.1845067e-02f,
+    6.6978371e-01f,  -2.6640799e-03f, -1.0906885e+00f, 2.3516981e-01f,
+    -1.9243948e-01f, 4.3528955e-04f,  -2.4199150e+00f, -2.4490683e+00f,
+    9.0220533e-02f,  7.2695744e-01f,  -4.6335566e-01f, 1.2076426e-02f,
+    4.3528955e-04f,  -1.6315820e+00f, 1.9164609e+00f,  9.1761731e-02f,
+    -7.0615059e-01f, -5.8519530e-01f, 1.7396139e-02f,  4.3528955e-04f,
+    1.7057887e+00f,  -4.1499596e+00f, -1.0884849e-01f, 8.3480477e-01f,
+    3.9828756e-01f,  1.9042855e-02f,  4.3528955e-04f,  -1.3012112e+00f,
+    1.5476942e-03f,  -6.9730930e-02f, 2.0261635e-01f,  -1.0344921e+00f,
+    -9.6373409e-02f, 4.3528955e-04f,  -3.4074442e+00f, 8.9113665e-01f,
+    8.4849717e-03f,  -1.7843123e-01f, -9.3914807e-01f, -1.5416148e-03f,
+    4.3528955e-04f,  3.1464972e+00f,  1.1707810e+00f,  -9.0123832e-02f,
+    -3.9649948e-01f, 8.9776999e-01f,  5.2308809e-02f,  4.3528955e-04f,
+    -2.0385325e+00f, -3.7286061e-01f, -6.4106174e-03f, 2.0919327e-02f,
+    -1.0702337e+00f, 4.5696404e-02f,  4.3528955e-04f,  8.0258048e-01f,
+    1.0938566e+00f,  -4.0008679e-02f, -1.0327832e+00f, 6.8696415e-01f,
+    -4.0962655e-02f, 4.3528955e-04f,  -1.8550175e+00f, -8.1463999e-01f,
+    -1.2179890e-01f, 4.6979740e-01f,  -8.0964887e-01f, 9.3179317e-03f,
+    4.3528955e-04f,  -1.0081606e+00f, 6.3990313e-01f,  -1.7731649e-01f,
+    -2.4444751e-01f, -6.5339428e-01f, -2.3890449e-01f, 4.3528955e-04f,
+    -5.8583635e-01f, -7.7241272e-01f, -8.5141376e-02f, 3.8316825e-01f,
+    -1.2590183e+00f, 1.3741040e-01f,  4.3528955e-04f,  3.6858296e-01f,
+    1.2729882e+00f,  -4.8333712e-02f, -1.0705950e+00f, 1.7838275e-01f,
+    -5.5438329e-02f, 4.3528955e-04f,  -9.3251050e-01f, -4.2383528e+00f,
+    -6.6728279e-02f, 9.3908644e-01f,  -1.1615617e-01f, -5.2799676e-02f,
+    4.3528955e-04f,  -8.6092806e-01f, -2.0961054e-01f, -2.3576934e-02f,
+    2.0899075e-01f,  -7.1604538e-01f, 6.4252585e-02f,  4.3528955e-04f,
+    8.9336425e-01f,  3.7537756e+00f,  -9.9117264e-02f, -8.9663672e-01f,
+    8.4996365e-02f,  9.4953980e-03f,  4.3528955e-04f,  5.1324695e-02f,
+    -2.3619716e-01f, 1.5474382e-01f,  1.0846313e+00f,  5.0602829e-01f,
+    2.6798308e-01f,  4.3528955e-04f,  1.3966159e+00f,  1.1771947e+00f,
+    -1.8398192e-02f, -7.1102077e-01f, 7.4281359e-01f,  1.0411168e-01f,
+    4.3528955e-04f,  -8.1604296e-01f, -2.5322747e-01f, 1.0084441e-01f,
+    2.2354032e-01f,  -9.0091413e-01f, 1.1915623e-01f,  4.3528955e-04f,
+    -1.1094052e+00f, -9.8612660e-01f, 3.8676581e-03f,  6.2351507e-01f,
+    -6.3881022e-01f, -5.3403387e-03f, 4.3528955e-04f,  -6.9642477e-03f,
+    5.8675390e-01f,  -9.8690011e-02f, -1.1098785e+00f, 4.5250601e-01f,
+    9.7602949e-02f,  4.3528955e-04f,  1.4921622e+00f,  9.9850911e-01f,
+    3.6655348e-02f,  -4.2746153e-01f, 9.3349844e-01f,  -1.5393926e-01f,
+    4.3528955e-04f,  -4.3362916e-02f, 1.9002694e-01f,  -2.4391308e-01f,
+    1.1959513e-01f,  -9.4393528e-01f, -3.5541323e-01f, 4.3528955e-04f,
+    -1.6305867e-01f, 2.7544081e+00f,  2.3556391e-02f,  -1.0627011e+00f,
+    8.3287004e-03f,  -1.6898345e-02f, 4.3528955e-04f,  -2.5126570e-01f,
+    -1.1028790e+00f, 1.2480201e-02f,  1.1590999e+00f,  -3.3019397e-01f,
+    -2.7436974e-02f, 4.3528955e-04f,  7.6877773e-01f,  2.1375852e+00f,
+    -5.3492442e-02f, -9.5682347e-01f, 2.5794798e-01f,  7.8800865e-02f,
+    4.3528955e-04f,  -2.1496334e+00f, -1.0704225e+00f, 1.1438736e-01f,
+    2.8073487e-01f,  -8.7501281e-01f, 1.8004082e-02f,  4.3528955e-04f,
+    1.1157215e-01f,  7.9269248e-01f,  3.7419826e-02f,  -6.3435560e-01f,
+    1.2309564e-01f,  5.2916104e-01f,  4.3528955e-04f,  1.6215664e-01f,
+    1.1370910e-01f,  6.4360604e-02f,  -6.2368357e-01f, 8.4098363e-01f,
+    -9.9017851e-02f, 4.3528955e-04f,  -6.8055756e-02f, 2.3591816e-01f,
+    -2.5371104e-02f, -1.3670915e+00f, -4.9924645e-01f, 1.5492143e-01f,
+    4.3528955e-04f,  -4.0576079e-01f, 5.6428093e-01f,  -1.9955214e-02f,
+    -9.1716069e-01f, -4.4390258e-01f, 1.5487632e-01f,  4.3528955e-04f,
+    4.3698698e-01f,  -1.0678458e+00f, 8.5466886e-03f,  6.9053429e-01f,
+    9.1374926e-02f,  -1.9639452e-01f, 4.3528955e-04f,  2.8086762e+00f,
+    2.5153184e-01f,  -4.0938362e-02f, -9.7816929e-02f, 8.8989162e-01f,
+    4.6607042e-03f,  4.3528955e-04f,  1.1914734e-01f,  4.0094848e+00f,
+    1.0656284e-02f,  -9.5877469e-01f, 9.0464726e-02f,  1.7575035e-02f,
+    4.3528955e-04f,  1.6897477e+00f,  7.1507531e-01f,  -5.9396248e-02f,
+    -6.7981321e-01f, 5.3341699e-01f,  8.1921957e-02f,  4.3528955e-04f,
+    -4.5945135e-01f, 1.8109561e+00f,  1.5357164e-01f,  -5.7724774e-01f,
+    -4.5341298e-01f, 1.0999590e-02f,  4.3528955e-04f,  -2.5735629e-01f,
+    -1.6450499e-01f, -3.3048809e-02f, 2.3319890e-01f,  -1.0194401e+00f,
+    1.4819548e-01f,  4.3528955e-04f,  -2.9380193e+00f, 2.9020257e+00f,
+    1.2768960e-01f,  -6.8581039e-01f, -6.0388863e-01f, 6.3929163e-02f,
+    4.3528955e-04f,  -3.3355658e+00f, 3.7097627e-01f,  -1.6426476e-02f,
+    -1.4267203e-01f, -9.3935430e-01f, 2.9711194e-02f,  4.3528955e-04f,
+    -2.2200632e-01f, 4.0952307e-01f,  -8.0037072e-02f, -9.8318177e-01f,
+    -6.0100824e-01f, 1.7267324e-01f,  4.3528955e-04f,  8.2259077e-01f,
+    8.7124079e-01f,  -8.3791822e-02f, -6.2109888e-01f, 7.6965737e-01f,
+    6.0943950e-02f,  4.3528955e-04f,  -2.2446665e-01f, 1.7140871e-01f,
+    7.8605991e-03f,  -8.9853778e-02f, -1.0530010e+00f, -8.7917328e-02f,
+    4.3528955e-04f,  1.2459519e+00f,  1.2814091e+00f,  3.8547529e-04f,
+    -6.3570970e-01f, 7.9840595e-01f,  1.0589287e-01f,  4.3528955e-04f,
+    2.8930590e-01f,  -3.8139060e+00f, -4.2835061e-02f, 9.4835585e-01f,
+    1.2672128e-02f,  1.8978270e-02f,  4.3528955e-04f,  1.8269278e+00f,
+    -2.1155013e-01f, 1.8428129e-01f,  -7.6016873e-02f, 8.4313256e-01f,
+    -1.2577550e-01f, 4.3528955e-04f,  -8.2367474e-01f, 1.3297483e+00f,
+    2.1322951e-01f,  -4.2771319e-01f, -3.7157148e-01f, 8.1101425e-02f,
+    4.3528955e-04f,  5.9127861e-01f,  1.7910275e-01f,  -1.6246950e-02f,
+    2.3466773e-01f,  7.3523319e-01f,  -2.9090303e-01f, 4.3528955e-04f,
+    -3.7655036e+00f, 3.5006323e+00f,  6.3238884e-03f,  -5.5551112e-01f,
+    -6.7227048e-01f, 7.6655988e-03f,  4.3528955e-04f,  5.9508973e-01f,
+    7.2618502e-01f,  -8.8602163e-02f, -4.5080820e-01f, 5.2040845e-01f,
+    6.7065634e-02f,  4.3528955e-04f,  3.2980368e-01f,  -1.7854273e+00f,
+    -2.1650448e-01f, 2.9855502e-01f,  -9.6578516e-02f, -9.8223321e-02f,
+    4.3528955e-04f,  -3.3137244e-01f, -6.8169302e-01f, -1.0712819e-01f,
+    7.6684791e-01f,  2.8122064e-01f,  -1.8704651e-01f, 4.3528955e-04f,
+    -1.7878211e+00f, -1.0538491e+00f, -1.5644399e-02f, 7.9419822e-01f,
+    -4.2358670e-01f, -9.8685756e-02f, 4.3528955e-04f,  -9.7568142e-01f,
+    7.7385145e-01f,  -2.1355547e-01f, -1.9552529e-01f, -7.6208937e-01f,
+    -1.4855327e-01f, 4.3528955e-04f,  -2.2184894e+00f, 1.0024046e+00f,
+    -1.9181224e-02f, -4.0252090e-01f, -8.0438477e-01f, -3.6284115e-02f,
+    4.3528955e-04f,  1.2718947e+00f,  -1.9417124e+00f, -3.3894055e-02f,
+    8.6667842e-01f,  5.7730848e-01f,  9.3426570e-02f,  4.3528955e-04f,
+    -5.6498152e-01f, 7.8492409e-01f,  2.6734818e-02f,  -5.5854064e-01f,
+    -8.0737895e-01f, 7.1064390e-02f,  4.3528955e-04f,  1.2081359e-01f,
+    -1.2480589e+00f, 1.1791831e-01f,  6.9548279e-01f,  3.3834264e-01f,
+    -9.5034026e-02f, 4.3528955e-04f,  2.9568866e-01f,  1.1014072e+00f,
+    6.8822131e-03f,  -9.4739729e-01f, 3.9713380e-01f,  -1.7567205e-01f,
+    4.3528955e-04f,  2.1950048e-01f,  -3.9876034e+00f, 7.0023626e-02f,
+    9.3209529e-01f,  8.2507066e-02f,  2.3696572e-02f,  4.3528955e-04f,
+    1.1599778e+00f,  9.0154648e-01f,  -6.8345033e-02f, -1.0062222e-01f,
+    8.6254150e-01f,  3.0084860e-02f,  4.3528955e-04f,  -5.7001747e-02f,
+    7.5215265e-02f,  1.3424559e-02f,  1.9119906e-01f,  -6.0607195e-01f,
+    6.7939466e-01f,  4.3528955e-04f,  -1.5581040e+00f, -2.8974302e-02f,
+    -7.9841040e-02f, -1.7738071e-01f, -1.0669515e+00f, -2.7056780e-01f,
+    4.3528955e-04f,  7.0702147e-01f,  -3.6933174e+00f, 1.9497527e-02f,
+    8.8557082e-01f,  2.1751013e-01f,  6.3531302e-02f,  4.3528955e-04f,
+    -1.6335356e-01f, -2.9317279e+00f, -1.6834711e-01f, 9.8811316e-01f,
+    -8.1094854e-02f, 3.3062451e-02f,  4.3528955e-04f,  9.0739131e-02f,
+    -5.1758832e-01f, 8.8841178e-02f,  7.2591561e-01f,  -1.0517586e-01f,
+    -8.2685344e-02f, 4.3528955e-04f,  -5.7260650e-01f, -9.0562886e-01f,
+    8.3358377e-02f,  5.5093777e-01f,  -4.1084892e-01f, -4.6392474e-02f,
+    4.3528955e-04f,  1.2737091e+00f,  2.7629447e-01f,  3.7284549e-02f,
+    6.8509805e-01f,  7.5068486e-01f,  -1.0516246e-01f, 4.3528955e-04f,
+    -2.4347022e+00f, -1.7949612e+00f, -1.8526115e-02f, 6.7247599e-01f,
+    -6.8816906e-01f, 1.7638974e-02f,  4.3528955e-04f,  -1.5200208e+00f,
+    1.5637147e+00f,  1.0973434e-01f,  -6.6884202e-01f, -7.7969164e-01f,
+    5.0851673e-02f,  4.3528955e-04f,  5.1161200e-01f,  3.8622718e-02f,
+    6.6024130e-03f,  -1.5395860e-01f, 9.1854596e-01f,  -2.5614029e-01f,
+    4.3528955e-04f,  -3.7677197e+00f, 8.4657282e-01f,  -1.5020480e-02f,
+    -2.0146538e-01f, -8.4772021e-01f, -2.3069715e-03f, 4.3528955e-04f,
+    5.9362096e-01f,  -1.5864100e+00f, -9.1443270e-02f, 7.6800126e-01f,
+    4.4464819e-02f,  1.1317293e-01f,  4.3528955e-04f,  7.3869061e-01f,
+    -6.2976104e-01f, 1.1063350e-02f,  1.1470231e+00f,  3.0875951e-01f,
+    9.1939501e-02f,  4.3528955e-04f,  1.6043411e+00f,  1.9707416e+00f,
+    -4.2025648e-02f, -7.6199579e-01f, 7.5675797e-01f,  5.0798316e-02f,
+    4.3528955e-04f,  -6.0735106e-01f, 1.6198444e-01f,  -7.4657939e-02f,
+    -9.7073400e-01f, -5.9605372e-01f, -3.0286152e-02f, 4.3528955e-04f,
+    -4.4805044e-01f, -3.6328363e-01f, 5.0451230e-02f,  6.9956982e-01f,
+    -4.7329658e-01f, -3.6083928e-01f, 4.3528955e-04f,  -5.5008179e-01f,
+    4.6926290e-01f,  -2.5039613e-02f, -5.0417352e-01f, -7.1628958e-01f,
+    -1.2449065e-01f, 4.3528955e-04f,  1.2112204e+00f,  2.5448508e+00f,
+    -4.8774365e-02f, -9.1844630e-01f, 4.0397832e-01f,  -4.4887317e-03f,
+    4.3528955e-04f,  -2.9167037e+00f, 2.0292599e+00f,  -1.0764054e-01f,
+    -4.6339211e-01f, -8.8704228e-01f, -1.2210441e-02f, 4.3528955e-04f,
+    -3.0024853e-01f, -2.6243842e+00f, -2.7856708e-02f, 9.1413563e-01f,
+    -2.5428391e-01f, 5.8676489e-02f,  4.3528955e-04f,  -6.9345802e-01f,
+    1.1563340e+00f,  -2.7709706e-02f, -5.8406997e-01f, -5.2306485e-01f,
+    1.0372675e-01f,  4.3528955e-04f,  -2.3971882e+00f, 2.0427179e+00f,
+    1.3696840e-01f,  -7.2759467e-01f, -6.1194903e-01f, -1.0065847e-02f,
+    4.3528955e-04f,  2.0362825e+00f,  7.3831427e-01f,  -4.4516232e-02f,
+    -1.6300862e-01f, 8.3612442e-01f,  -4.7003511e-02f, 4.3528955e-04f,
+    -2.5562041e+00f, 2.5596871e+00f,  -3.0471930e-01f, -6.2111938e-01f,
+    -6.7165303e-01f, 7.2957994e-03f,  4.3528955e-04f,  -8.6126786e-01f,
+    2.0725191e+00f,  4.4238310e-02f,  -7.3105526e-01f, -5.9656131e-01f,
+    -1.7619677e-02f, 4.3528955e-04f,  2.2616807e-01f,  1.5636193e+00f,
+    1.3607819e-01f,  -8.9862406e-01f, 9.4763957e-02f,  2.1043155e-02f,
+    4.3528955e-04f,  -1.2514881e+00f, 9.3834186e-01f,  2.3435390e-02f,
+    -4.8734823e-01f, -1.1040633e+00f, 2.3340965e-02f,  4.3528955e-04f,
+    5.1974452e-01f,  -1.7965607e-01f, -1.3495775e-01f, 9.1229510e-01f,
+    5.1830798e-01f,  -6.2726423e-02f, 4.3528955e-04f,  -1.0466781e+00f,
+    -3.1497540e+00f, 4.2369030e-03f,  8.3298695e-01f,  -2.3912063e-01f,
+    1.3725986e-01f,  4.3528955e-04f,  1.4996642e+00f,  -6.3317561e-01f,
+    -1.3875329e-01f, 6.5494668e-01f,  2.8372374e-01f,  -6.4453498e-02f,
+    4.3528955e-04f,  6.7979348e-01f,  -8.6266232e-01f, -1.8181077e-01f,
+    4.8073509e-01f,  4.2268249e-01f,  5.7765439e-02f,  4.3528955e-04f,
+    1.0127212e+00f,  2.8691180e+00f,  1.4520818e-01f,  -8.9089566e-01f,
+    3.3802062e-01f,  2.9917264e-02f,  4.3528955e-04f,  1.1285409e+00f,
+    -2.0512657e+00f, -7.2895803e-02f, 7.7414680e-01f,  5.8141363e-01f,
+    -3.2790303e-02f, 4.3528955e-04f,  -5.4898793e-01f, -1.0925920e+00f,
+    1.4790798e-02f,  5.8497632e-01f,  -4.9906954e-01f, -1.3408850e-01f,
+    4.3528955e-04f,  1.8547895e+00f,  7.5891048e-01f,  -1.1300622e-01f,
+    -1.9531547e-01f, 8.4286511e-01f,  -6.0534757e-02f, 4.3528955e-04f,
+    -1.5619370e-01f, 5.0376248e-01f,  -1.5048762e-01f, -5.9292632e-01f,
+    2.7502129e-02f,  4.5008907e-01f,  4.3528955e-04f,  -2.4245486e+00f,
+    3.0552418e+00f,  -9.0995952e-02f, -7.4486291e-01f, -5.9469736e-01f,
+    5.7195913e-02f,  4.3528955e-04f,  -2.1045104e-01f, 3.8308334e-02f,
+    -2.5949482e-02f, -4.5150450e-01f, -1.2878006e+00f, -1.8114355e-01f,
+    4.3528955e-04f,  -8.9615721e-01f, -7.9790503e-01f, -5.7245653e-02f,
+    2.7550218e-01f,  -7.7383637e-01f, -2.6006527e-02f, 4.3528955e-04f,
+    -1.2192070e+00f, 4.3795848e-01f,  8.8043459e-02f,  -3.9574137e-01f,
+    -7.3006749e-01f, -2.3289280e-01f, 4.3528955e-04f,  5.7600814e-01f,
+    5.7239056e-01f,  1.1158274e-02f,  -6.7376745e-01f, 8.0945325e-01f,
+    4.3004999e-01f,  4.3528955e-04f,  8.4171593e-01f,  4.5059452e+00f,
+    1.8946409e-02f,  -8.6993152e-01f, 1.0886719e-01f,  -2.6487883e-03f,
+    4.3528955e-04f,  -1.2104394e+00f, -1.0746313e+00f, 8.5864976e-02f,
+    3.8149878e-01f,  -7.9153347e-01f, -8.9847140e-02f, 4.3528955e-04f,
+    7.6207250e-01f,  -2.4612079e+00f, 5.5308964e-02f,  8.5729891e-01f,
+    3.5495734e-01f,  2.8557098e-02f,  4.3528955e-04f,  -1.2764996e+00f,
+    1.2638018e-01f,  4.7172405e-02f,  1.9839977e-01f,  -9.3802983e-01f,
+    1.2576167e-01f,  4.3528955e-04f,  -9.8363101e-01f, 3.3320966e+00f,
+    -9.0550825e-02f, -8.5163009e-01f, -2.5881630e-01f, 1.0692760e-01f,
+    4.3528955e-04f,  2.0959687e-01f,  5.4823637e-01f,  -8.5499078e-02f,
+    -1.1279593e+00f, 3.4983492e-01f,  -3.0262256e-01f, 4.3528955e-04f,
+    9.9516106e-01f,  1.9588314e+00f,  4.8181053e-02f,  -9.0679944e-01f,
+    4.2551869e-01f,  3.8964249e-02f,  4.3528955e-04f,  3.7819797e-01f,
+    -1.5989514e-01f, -5.9645571e-02f, 9.2092061e-01f,  5.2631885e-01f,
+    -2.0210028e-01f, 4.3528955e-04f,  2.5110004e+00f,  -4.1302282e-01f,
+    6.7394197e-02f,  3.9537970e-02f,  8.7502909e-01f,  6.5297350e-02f,
+    4.3528955e-04f,  1.5388039e+00f,  3.4164953e+00f,  9.3482010e-02f,
+    -7.8816193e-01f, 4.3080750e-01f,  5.0545413e-02f,  4.3528955e-04f,
+    3.7057083e+00f,  -1.0462193e-01f, -8.9247450e-02f, 3.0612472e-02f,
+    8.9961845e-01f,  -1.4465281e-02f, 4.3528955e-04f,  -1.0818894e+00f,
+    -1.1630299e+00f, 1.4436081e-01f,  8.1967473e-01f,  -1.9441366e-01f,
+    7.7438325e-02f,  4.3528955e-04f,  2.3743379e+00f,  -1.7002003e+00f,
+    -1.0236253e-01f, 5.5478513e-01f,  8.5615385e-01f,  -8.9464933e-02f,
+    4.3528955e-04f,  3.7671420e-01f,  9.0493518e-01f,  1.1918984e-01f,
+    -7.4727112e-01f, -2.6686406e-02f, -1.9342436e-01f, 4.3528955e-04f,
+    1.9037235e+00f,  1.3729904e+00f,  -4.6921659e-02f, -4.2820409e-01f,
+    8.9062947e-01f,  1.2489375e-01f,  4.3528955e-04f,  -1.3872921e-01f,
+    1.4897095e+00f,  9.2962429e-02f,  -8.0646181e-01f, 1.6383314e-01f,
+    8.0240101e-02f,  4.3528955e-04f,  1.3954884e+00f,  1.2202871e+00f,
+    -1.8442497e-02f, -7.6338565e-01f, 8.8603896e-01f,  -2.3846455e-02f,
+    4.3528955e-04f,  1.7231604e+00f,  -1.1676563e+00f, 4.1976538e-02f,
+    5.5980057e-01f,  8.3625561e-01f,  9.6121132e-03f,  4.3528955e-04f,
+    6.7529219e-01f,  2.5274205e+00f,  2.2876974e-02f,  -9.4442844e-01f,
+    3.1208906e-01f,  3.5907201e-02f,  4.3528955e-04f,  3.6658883e-01f,
+    1.6318053e+00f,  1.4524971e-01f,  -9.0861118e-01f, 7.3152386e-02f,
+    -1.5498987e-01f, 4.3528955e-04f,  -1.9651648e+00f, -1.0190165e+00f,
+    -1.8812520e-02f, 5.4479897e-01f,  -7.4715436e-01f, -6.8588316e-02f,
+    4.3528955e-04f,  6.9712752e-01f,  4.2073470e-01f,  -4.8981700e-02f,
+    -1.0108217e+00f, 4.0945417e-01f,  -8.6281255e-02f, 4.3528955e-04f,
+    -2.8558317e-01f, 1.5860125e-01f,  1.6407922e-02f,  1.9218779e-01f,
+    -8.0845189e-01f, 1.0272555e-01f,  4.3528955e-04f,  -2.6523151e+00f,
+    -6.0006446e-01f, 9.7568378e-02f,  2.8018847e-01f,  -9.3188751e-01f,
+    -3.6490981e-02f, 4.3528955e-04f,  1.0336689e+00f,  -5.6825382e-01f,
+    -1.2851429e-01f, 9.3970770e-01f,  7.4681407e-01f,  -1.5457554e-01f,
+    4.3528955e-04f,  1.3597071e+00f,  -1.4079829e+00f, -2.7288316e-02f,
+    6.6944152e-01f,  6.0485977e-01f,  -5.7927025e-03f, 4.3528955e-04f,
+    -5.8578831e-01f, -1.2727202e+00f, -2.5643412e-02f, 7.8866029e-01f,
+    -1.4117014e-01f, 2.3036511e-01f,  4.3528955e-04f,  -1.7312343e+00f,
+    3.3680038e+00f,  4.4771219e-03f,  -8.1990951e-01f, -4.2098597e-01f,
+    -8.5249305e-02f, 4.3528955e-04f,  -1.0405728e+00f, -8.5226637e-01f,
+    -1.0848474e-01f, 1.1366485e-01f,  -9.6413314e-01f, 1.9264795e-02f,
+    4.3528955e-04f,  -2.7307552e-01f, 4.7384363e-01f,  -2.1503374e-02f,
+    -9.7624016e-01f, -9.4466591e-01f, -1.6574259e-01f, 4.3528955e-04f,
+    1.1287458e+00f,  -7.4803412e-02f, -1.4842857e-02f, 3.8621345e-01f,
+    9.6026760e-01f,  -7.7019036e-03f, 4.3528955e-04f,  8.8729101e-01f,
+    3.8754907e+00f,  7.7574313e-02f,  -9.5098931e-01f, 1.9620788e-01f,
+    1.1897304e-02f,  4.3528955e-04f,  -1.5685564e+00f, 8.8353086e-01f,
+    9.8379202e-02f,  -2.0420526e-01f, -8.1917644e-01f, 2.3540005e-02f,
+    4.3528955e-04f,  -5.3475881e-01f, -9.8349386e-01f, 6.6125005e-02f,
+    5.2085739e-01f,  -5.8555913e-01f, -4.4677358e-02f, 4.3528955e-04f,
+    2.3079140e+00f,  -5.1909924e-01f, 1.1040982e-01f,  2.0891288e-01f,
+    9.1342264e-01f,  -4.9720295e-02f, 4.3528955e-04f,  -2.0523021e-01f,
+    -2.5413078e-01f, 1.6585601e-02f,  8.9484131e-01f,  -4.2910656e-01f,
+    1.3762525e-01f,  4.3528955e-04f,  2.7051359e-01f,  6.8913192e-02f,
+    3.6018617e-02f,  -1.2088288e-01f, 1.1989725e+00f,  1.2030299e-01f,
+    4.3528955e-04f,  -5.4640657e-01f, -1.6111522e+00f, 1.6444338e-02f,
+    7.4032789e-01f,  -6.1348403e-01f, 1.8584894e-02f,  4.3528955e-04f,
+    4.1983490e+00f,  -1.2601284e+00f, -3.5975501e-03f, 2.9173368e-01f,
+    9.4391131e-01f,  4.1886199e-02f,  4.3528955e-04f,  -3.9821665e+00f,
+    1.9979814e+00f,  -6.9255069e-02f, -4.1014221e-01f, -8.2415241e-01f,
+    -6.8018422e-02f, 4.3528955e-04f,  3.5476141e+00f,  -1.2111750e+00f,
+    -5.8824390e-02f, 3.0536789e-01f,  9.2630279e-01f,  -2.9742632e-03f,
+    4.3528955e-04f,  -1.1615095e+00f, -2.3852022e-01f, -2.8973524e-02f,
+    4.9668172e-01f,  -8.7224269e-01f, 7.1406364e-02f,  4.3528955e-04f,
+    1.5332398e-01f,  1.3596921e+00f,  1.3258819e-01f,  -1.0093648e+00f,
+    9.3414992e-02f,  -4.3266524e-02f, 4.3528955e-04f,  -1.3535298e+00f,
+    -7.0600986e-01f, -5.1231913e-02f, 2.8028187e-01f,  -9.0465486e-01f,
+    5.8381137e-02f,  4.3528955e-04f,  -4.9374047e-01f, -1.0416018e+00f,
+    -4.6476625e-02f, 7.6618212e-01f,  -5.5441868e-01f, 5.6809504e-02f,
+    4.3528955e-04f,  -4.7189376e-01f, 3.8589547e+00f,  1.2832280e-02f,
+    -9.3225902e-01f, -2.4875471e-01f, 2.0174583e-02f,  4.3528955e-04f,
+    5.5079544e-01f,  -1.8957899e+00f, -4.2841781e-02f, 7.2026002e-01f,
+    7.5219327e-01f,  6.9695532e-02f,  4.3528955e-04f,  -3.3094582e-01f,
+    1.2722793e-01f,  -6.6396751e-02f, -3.5630241e-01f, -8.7708467e-01f,
+    5.8051753e-01f,  4.3528955e-04f,  -1.0450090e+00f, -1.5599365e+00f,
+    2.3441900e-02f,  8.5639393e-01f,  -4.4026792e-01f, -5.1518515e-02f,
+    4.3528955e-04f,  -4.2583503e-02f, 1.9797888e-01f,  1.6281050e-02f,
+    -4.6430993e-01f, 9.3911640e-02f,  1.2131768e-01f,  4.3528955e-04f,
+    -7.2316462e-01f, -1.9096277e+00f, 1.1448264e-02f,  9.4615114e-01f,
+    -4.6997347e-01f, 6.1756140e-03f,  4.3528955e-04f,  1.2396161e-01f,
+    4.7320187e-01f,  -1.3348117e-01f, -8.8700473e-01f, 7.1571791e-01f,
+    -5.4665333e-01f, 4.3528955e-04f,  2.6467159e+00f,  2.8925023e+00f,
+    -2.5051776e-02f, -8.2216859e-01f, 5.7632196e-01f,  2.8916688e-03f,
+    4.3528955e-04f,  5.4453725e-01f,  3.1491206e+00f,  -3.5153538e-02f,
+    -9.8076981e-01f, 1.3098146e-01f,  6.2335346e-02f,  4.3528955e-04f,
+    -2.3856969e+00f, -2.6147289e+00f, 6.0943261e-02f,  6.9825500e-01f,
+    -6.5027004e-01f, 6.2381513e-02f,  4.3528955e-04f,  -1.6453477e+00f,
+    2.1736367e+00f,  9.1570474e-02f,  -8.2088917e-01f, -4.9630114e-01f,
+    -1.7054358e-01f, 4.3528955e-04f,  -2.9096308e-01f, 1.4960054e+00f,
+    4.4649333e-02f,  -9.4812638e-01f, -2.2034323e-02f, 3.0471999e-02f,
+    4.3528955e-04f,  2.5705126e-01f,  -1.7059978e+00f, -5.0124573e-03f,
+    1.0575900e+00f,  4.2924985e-02f,  -6.2346641e-02f, 4.3528955e-04f,
+    -3.2236746e-01f, 1.2268270e+00f,  1.0807484e-01f,  -1.2428317e+00f,
+    -1.2133651e-01f, 1.8217901e-03f,  4.3528955e-04f,  -7.5437051e-01f,
+    2.4948754e+00f,  -3.2978155e-02f, -6.6221327e-01f, -3.4020078e-01f,
+    4.7263868e-02f,  4.3528955e-04f,  9.1396177e-01f,  -2.3598522e-02f,
+    3.3893380e-02f,  4.9727133e-01f,  5.8316690e-01f,  -3.8547286e-01f,
+    4.3528955e-04f,  -4.5447782e-01f, 3.8704854e-01f,  1.5221456e-01f,
+    -7.3568207e-01f, -7.9415363e-01f, 9.0918615e-02f,  4.3528955e-04f,
+    -1.1942922e+00f, -3.7777569e+00f, 8.9142486e-02f,  8.2024539e-01f,
+    -2.5728244e-01f, -4.9606271e-02f, 4.3528955e-04f,  -1.8145802e+00f,
+    -2.1623027e+00f, -1.7036948e-01f, 6.5701401e-01f,  -7.4781722e-01f,
+    6.3691260e-03f,  4.3528955e-04f,  -1.3579884e+00f, -1.2774499e-01f,
+    1.6477738e-01f,  -1.8205714e-01f, -6.6548419e-01f, 1.4582828e-01f,
+    4.3528955e-04f,  7.6307982e-01f,  2.3985915e+00f,  -1.8217307e-01f,
+    -6.2741482e-01f, 5.9460855e-01f,  -3.7461333e-02f, 4.3528955e-04f,
+    2.7248065e+00f,  -9.7323701e-02f, 9.4873714e-04f,  -8.0090165e-03f,
+    1.0248001e+00f,  4.7593981e-02f,  4.3528955e-04f,  4.0494514e-01f,
+    -1.7076757e+00f, 6.0300831e-02f,  6.5458477e-01f,  -3.0174097e-02f,
+    3.0299872e-01f,  4.3528955e-04f,  5.5512011e-01f,  -1.5427257e+00f,
+    -1.3540138e-01f, 5.0493968e-01f,  -2.2801584e-02f, 4.1451145e-02f,
+    4.3528955e-04f,  -2.6594165e-01f, -2.2374497e-01f, -1.6572826e-02f,
+    6.9475102e-01f,  -6.3849425e-01f, 1.9156420e-01f,  4.3528955e-04f,
+    -1.9018272e-01f, 1.0402828e-01f,  1.0295907e-01f,  -5.2856040e-01f,
+    -1.3460129e+00f, -2.1459198e-02f, 4.3528955e-04f,  8.7110943e-01f,
+    2.6789827e+00f,  6.2334035e-02f,  -1.0540189e+00f, 3.6506024e-01f,
+    -7.0551559e-02f, 4.3528955e-04f,  -1.3534036e+00f, 9.8344284e-01f,
+    -9.5344849e-02f, -6.3147657e-03f, -6.6060781e-01f, -2.7683666e-02f,
+    4.3528955e-04f,  -1.9527997e+00f, -9.0062207e-01f, -1.1916086e-01f,
+    2.7223077e-01f,  -6.8923974e-01f, -1.0182928e-01f, 4.3528955e-04f,
+    1.3325390e+00f,  5.1013416e-01f,  -7.7212118e-02f, -5.1809126e-01f,
+    8.3726990e-01f,  -2.5215286e-01f, 4.3528955e-04f,  1.3690144e-03f,
+    2.3803756e-01f,  1.1822183e-01f,  -1.1467549e+00f, -2.9533285e-01f,
+    -9.4087422e-01f, 4.3528955e-04f,  5.0958484e-01f,  2.6217079e+00f,
+    -1.7888878e-01f, -9.5177180e-01f, 1.2383390e-01f,  -1.1383964e-01f,
+    4.3528955e-04f,  -2.0679591e+00f, 5.1125401e-01f,  4.7355525e-02f,
+    -1.8207365e-01f, -9.0480518e-01f, -7.7205896e-02f, 4.3528955e-04f,
+    2.5221562e-01f,  3.4834096e+00f,  -1.5396927e-02f, -9.3149149e-01f,
+    -7.8072228e-02f, 6.2066786e-02f,  4.3528955e-04f,  -1.0056190e+00f,
+    -3.0093341e+00f, 6.9895267e-02f,  8.6499333e-01f,  -3.6967728e-01f,
+    4.5798913e-02f,  4.3528955e-04f,  -6.6400284e-01f, 1.0649313e+00f,
+    -6.0387310e-02f, -8.7511110e-01f, -5.5720150e-01f, 1.9067825e-01f,
+    4.3528955e-04f,  -2.1069946e+00f, -8.6024761e-02f, -1.5838312e-03f,
+    3.1795013e-01f,  -9.9185598e-01f, -1.6532454e-03f, 4.3528955e-04f,
+    -1.1820407e+00f, 7.5370824e-01f,  -1.4696887e-01f, -1.1333437e-01f,
+    -8.2410812e-01f, 1.1523645e-01f,  4.3528955e-04f,  3.6485159e+00f,
+    4.6599621e-01f,  4.9893394e-02f,  -1.2093516e-01f, 9.6110195e-01f,
+    -6.0557786e-02f, 4.3528955e-04f,  2.9180310e+00f,  -5.9231848e-01f,
+    -1.7903703e-01f, 1.8331002e-01f,  9.1739738e-01f,  2.2560727e-02f,
+    4.3528955e-04f,  2.9935882e+00f,  -6.7790806e-02f, 6.5868042e-02f,
+    1.0487460e-01f,  1.0445405e+00f,  -6.4174188e-03f, 4.3528955e-04f,
+    -6.4532429e-01f, -6.8605250e-01f, -1.4488655e-01f, 1.1493319e-01f,
+    -5.4606605e-01f, -2.7601516e-01f, 4.3528955e-04f,  -2.0982425e+00f,
+    1.7860962e+00f,  -2.8782960e-02f, -7.9984480e-01f, -7.5186372e-01f,
+    2.0369323e-02f,  4.3528955e-04f,  -4.4549170e-01f, 1.6178877e+00f,
+    -3.8676765e-02f, -1.0438180e+00f, -2.7898571e-01f, 1.0418458e-02f,
+    4.3528955e-04f,  -1.7700337e+00f, -1.7657231e+00f, -7.2059020e-02f,
+    6.7140365e-01f,  -3.8700148e-01f, 1.3125168e-02f,  4.3528955e-04f,
+    -4.5103803e-01f, -2.0279837e+00f, 5.8646653e-02f,  5.7469481e-01f,
+    -6.4571321e-01f, -1.0075834e-02f, 4.3528955e-04f,  4.4553784e-01f,
+    2.4988653e-01f,  -7.2691694e-02f, -7.0793366e-01f, 1.2757463e+00f,
+    -4.7956280e-02f, 4.3528955e-04f,  1.6271150e-01f,  -3.6476851e-01f,
+    1.8391132e-03f,  8.3276445e-01f,  5.1784122e-01f,  2.1124071e-01f,
+    4.3528955e-04f,  -4.6798834e-01f, -7.5996757e-01f, -3.2432474e-02f,
+    7.8802240e-01f,  -5.9308678e-01f, -1.4162706e-01f, 4.3528955e-04f,
+    5.4028773e-01f,  5.3296846e-01f,  -8.3538912e-02f, -3.7790295e-01f,
+    7.3052102e-01f,  -9.4607435e-02f, 4.3528955e-04f,  -6.8664205e-01f,
+    1.7994770e+00f,  -6.0592983e-02f, -9.3366623e-01f, -4.1699055e-01f,
+    8.2532942e-02f,  4.3528955e-04f,  -2.7477753e+00f, -9.4542521e-01f,
+    1.3412552e-01f,  2.9221523e-01f,  -9.2532194e-01f, -6.8571437e-03f,
+    4.3528955e-04f,  3.9611607e+00f,  -1.6998433e+00f, -3.3285711e-02f,
+    3.6287051e-01f,  8.2579440e-01f,  1.1172022e-01f,  4.3528955e-04f,
+    -3.5593696e+00f, 5.2940363e-01f,  1.4374801e-03f,  -1.7416896e-01f,
+    -9.7423416e-01f, 4.8327565e-02f,  4.3528955e-04f,  -1.6343122e+00f,
+    -4.0770593e+00f, -9.7174659e-02f, 8.0503315e-01f,  -3.1813151e-01f,
+    2.9277258e-02f,  4.3528955e-04f,  1.2493931e-01f,  1.2530937e+00f,
+    1.2892409e-01f,  -5.7238287e-01f, 5.6570396e-02f,  1.6242205e-01f,
+    4.3528955e-04f,  1.3675431e+00f,  1.1522626e+00f,  4.5292370e-02f,
+    -4.9448878e-01f, 7.3247099e-01f,  5.7881400e-02f,  4.3528955e-04f,
+    -8.7553388e-01f, -9.9820405e-01f, -8.8758171e-02f, 4.5438942e-01f,
+    -5.0031185e-01f, 2.6445565e-01f,  4.3528955e-04f,  -1.3285303e-01f,
+    -1.4549898e+00f, -6.2589854e-02f, 8.9190900e-01f,  -8.4938258e-02f,
+    -7.6705620e-02f, 4.3528955e-04f,  3.8288185e-01f,  4.8173326e-01f,
+    -1.1687278e-01f, -6.8072104e-01f, 4.0710297e-01f,  -1.2324533e-02f,
+    4.3528955e-04f,  -3.8460371e-01f, 1.4502571e+00f,  -6.3802418e-04f,
+    -1.1821383e+00f, -4.7251841e-01f, -3.5038650e-02f, 4.3528955e-04f,
+    -8.0586421e-01f, -2.7991285e+00f, 1.1072625e-01f,  8.7624949e-01f,
+    -2.5870457e-01f, -1.1539051e-02f, 4.3528955e-04f,  -1.4186472e+00f,
+    -1.4843867e+00f, -1.0522312e-02f, 7.1792740e-01f,  -7.6803923e-01f,
+    9.3310356e-02f,  4.3528955e-04f,  1.6886408e+00f,  -1.7995821e-01f,
+    8.0749907e-02f,  -2.3811387e-01f, 8.3095574e-01f,  -6.1882090e-02f,
+    4.3528955e-04f,  2.0625069e+00f,  -1.0948033e+00f, -1.2192495e-02f,
+    3.1321755e-01f,  5.2816421e-01f,  -7.1500465e-02f, 4.3528955e-04f,
+    -6.1242390e-01f, -8.7926608e-01f, 1.2543145e-01f,  8.4517622e-01f,
+    -5.7011390e-01f, 2.1984421e-01f,  4.3528955e-04f,  -7.5987798e-01f,
+    1.3912635e+00f,  -2.0182172e-02f, -7.9840899e-01f, -7.7869654e-01f,
+    1.4088672e-02f,  4.3528955e-04f,  -3.9298868e-01f, -2.8862453e-01f,
+    -8.1597745e-02f, 5.2318060e-01f,  -1.1571109e+00f, -1.8697374e-01f,
+    4.3528955e-04f,  4.7451174e-01f,  -1.1179104e-02f, 3.7253283e-02f,
+    3.2569370e-01f,  1.2251990e+00f,  6.5762773e-02f,  4.3528955e-04f,
+    1.0792337e-02f,  7.8594178e-02f,  -2.6993725e-02f, -2.0019929e-01f,
+    -5.6868637e-01f, -1.9563165e-01f, 4.3528955e-04f,  -3.8857719e-01f,
+    1.9374442e+00f,  -1.8273048e-01f, -9.3475777e-01f, -4.6683502e-01f,
+    1.1114738e-01f,  4.3528955e-04f,  1.2963934e+00f,  -6.7159343e-01f,
+    -1.3374300e-01f, 5.0010496e-01f,  3.3541355e-01f,  -1.0686360e-01f,
+    4.3528955e-04f,  9.9916643e-01f,  -1.1889771e+00f, -1.0282318e-01f,
+    4.4557598e-01f,  5.5142176e-01f,  -8.8094465e-02f, 4.3528955e-04f,
+    -1.6356015e-01f, -8.0835998e-01f, 3.9010193e-02f,  6.2061238e-01f,
+    -4.8144999e-01f, -5.1244486e-02f, 4.3528955e-04f,  6.8447632e-01f,
+    9.2427576e-01f,  4.6838801e-02f,  -4.9955562e-01f, 7.2605830e-01f,
+    5.7618115e-02f,  4.3528955e-04f,  2.2405025e-01f,  -1.3472018e+00f,
+    1.5691324e-01f,  4.8615828e-01f,  2.5671595e-01f,  -1.4230360e-01f,
+    4.3528955e-04f,  1.3670226e+00f,  -4.3759456e+00f, -8.9703046e-02f,
+    7.7314514e-01f,  3.5450846e-01f,  -1.8391579e-02f, 4.3528955e-04f,
+    -1.2941103e+00f, 1.2218703e-01f,  3.2809410e-02f,  -2.0816748e-01f,
+    -6.7822468e-01f, -1.8481281e-01f, 4.3528955e-04f,  -2.4493298e-01f,
+    2.0341442e+00f,  6.3670613e-02f,  -7.4761653e-01f, 8.3838478e-02f,
+    4.1290127e-02f,  4.3528955e-04f,  -1.4132887e-01f, 1.3877538e+00f,
+    4.4341624e-02f,  -7.6937199e-01f, 1.0638619e-02f,  3.6105726e-02f,
+    4.3528955e-04f,  2.0952966e+00f,  -2.8692162e-01f, 1.1670630e-01f,
+    1.8731152e-01f,  1.0991420e+00f,  6.1124761e-02f,  4.3528955e-04f,
+    1.6503605e+00f,  5.4014015e-01f,  -8.2514189e-02f, -3.4011504e-01f,
+    9.5166874e-01f,  -5.5066114e-03f, 4.3528955e-04f,  -1.5648913e-01f,
+    -2.4208955e-01f, 2.2790931e-01f,  4.7919461e-01f,  -4.9989387e-01f,
+    7.7578805e-02f,  4.3528955e-04f,  3.8997129e-01f,  5.9603822e-01f,
+    1.6656693e-02f,  -1.0930487e+00f, 3.3865607e-01f,  -1.6377477e-01f,
+    4.3528955e-04f,  -2.2519155e+00f, 1.8109068e+00f,  6.0729474e-02f,
+    -5.8358651e-01f, -5.7778323e-01f, -3.0137261e-03f, 4.3528955e-04f,
+    1.5509482e-01f,  8.7820691e-01f,  2.5316522e-01f,  -7.1079797e-01f,
+    1.2084845e-01f,  2.2468922e-01f,  4.3528955e-04f,  -1.7193223e+00f,
+    9.3528844e-02f,  2.7771333e-01f,  -5.9042636e-02f, -9.4178385e-01f,
+    7.7764288e-02f,  4.3528955e-04f,  -3.4292325e-01f, -1.2804180e+00f,
+    4.5774568e-02f,  6.4114916e-01f,  -1.7751029e-02f, 2.0540750e-01f,
+    4.3528955e-04f,  -2.4732573e+00f, 4.2800623e-01f,  -2.2071728e-01f,
+    -2.7107227e-01f, -8.3930904e-01f, -2.2108711e-02f, 4.3528955e-04f,
+    -1.8878070e+00f, -1.5216388e+00f, 9.2556905e-03f,  5.5208969e-01f,
+    -8.1766576e-01f, 4.7230836e-02f,  4.3528955e-04f,  2.0385439e+00f,
+    1.0357767e+00f,  -1.1173534e-01f, -2.3991930e-01f, 1.0468161e+00f,
+    -4.9607392e-02f, 4.3528955e-04f,  -2.2448735e+00f, 1.4612150e+00f,
+    -4.5607056e-02f, -3.6662754e-01f, -6.6416806e-01f, -6.0418028e-02f,
+    4.3528955e-04f,  4.3112999e-01f,  -9.3915299e-02f, -3.4610718e-02f,
+    7.6084805e-01f,  5.8051246e-01f,  -1.2327053e-01f, 4.3528955e-04f,
+    -7.0689857e-02f, 1.3491998e+00f,  -1.3018163e-01f, -6.6273326e-01f,
+    -2.3712924e-02f, 2.4565625e-01f,  4.3528955e-04f,  1.9162495e+00f,
+    -8.7369758e-01f, 5.5904616e-02f,  1.9205941e-01f,  1.1560354e+00f,
+    6.7258276e-02f,  4.3528955e-04f,  2.9890555e-01f,  9.7531840e-02f,
+    -8.7200277e-02f, 3.2498977e-01f,  9.1155422e-01f,  5.6371200e-01f,
+    4.3528955e-04f,  -8.6528158e-01f, -6.9603741e-01f, -1.4524853e-01f,
+    8.6132050e-01f,  -2.7327960e-02f, -2.9232392e-01f, 4.3528955e-04f,
+    -5.6015968e-01f, -4.1615945e-01f, -6.9669168e-04f, -2.1004122e-02f,
+    -1.0432649e+00f, 9.1503166e-02f,  4.3528955e-04f,  1.0157115e+00f,
+    1.9242755e-01f,  -2.3935972e-02f, -6.2428232e-02f, 1.4072335e+00f,
+    -1.6973090e-01f, 4.3528955e-04f,  -6.0287219e-01f, -1.9685695e+00f,
+    2.4660975e-02f,  7.5017011e-01f,  -3.2379976e-01f, 1.7308933e-01f,
+    4.3528955e-04f,  -1.6159343e+00f, 1.7992778e+00f,  7.1512192e-02f,
+    -7.3574579e-01f, -5.3867769e-01f, -3.7051849e-02f, 4.3528955e-04f,
+    3.0524909e+00f,  -2.6691272e+00f, -3.6431113e-03f, 5.6007671e-01f,
+    7.8476959e-01f,  2.6392115e-02f,  4.3528955e-04f,  2.3750465e+00f,
+    -1.6454605e+00f, 2.0899134e-02f,  6.6186678e-01f,  7.6208746e-01f,
+    -6.6577658e-02f, 4.3528955e-04f,  -6.0734844e-01f, -5.1653833e+00f,
+    1.4422098e-02f,  8.5125679e-01f,  -1.2111279e-01f, -1.2907423e-02f,
+    4.3528955e-04f,  -4.1808081e+00f, 1.4798176e-01f,  -5.1333621e-02f,
+    1.9679084e-02f,  -9.4517273e-01f, -1.9125776e-02f, 4.3528955e-04f,
+    3.3448637e-01f,  3.0092809e-02f,  4.0015150e-02f,  2.4407066e-01f,
+    6.8381166e-01f,  -2.1186674e-01f, 4.3528955e-04f,  7.8013420e-01f,
+    8.2585865e-01f,  -2.2564691e-02f, -3.6610603e-01f, 9.7480893e-01f,
+    -2.9952146e-02f, 4.3528955e-04f,  -9.2882639e-01f, -3.1231135e-01f,
+    5.9644815e-02f,  4.6298921e-01f,  -7.5595623e-01f, -2.9574696e-02f,
+    4.3528955e-04f,  -1.0230860e+00f, -2.7598971e-01f, -6.9766805e-02f,
+    2.5314578e-01f,  -9.7938597e-01f, -3.7754945e-02f, 4.3528955e-04f,
+    -1.1349750e+00f, 1.4884578e+00f,  -1.3225291e-02f, -7.5129330e-01f,
+    -4.4310510e-01f, 1.0445925e-01f,  4.3528955e-04f,  -6.8604094e-01f,
+    1.4765683e-01f,  5.0536733e-02f,  -2.8366095e-01f, -9.6699065e-01f,
+    -1.7195180e-01f, 4.3528955e-04f,  1.4630882e+00f,  2.1969626e+00f,
+    -3.5170887e-02f, -5.3911299e-01f, 5.1588982e-01f,  6.7967400e-03f,
+    4.3528955e-04f,  -6.4872611e-01f, -5.6172144e-01f, -2.8991232e-02f,
+    1.0992563e+00f,  -6.7389756e-01f, 2.3791783e-01f,  4.3528955e-04f,
+    1.9306623e+00f,  7.2589642e-01f,  -4.2036962e-02f, -3.9409670e-01f,
+    9.9232477e-01f,  -7.0616663e-02f, 4.3528955e-04f,  3.5170476e+00f,
+    -1.9456553e+00f, 8.5132733e-02f,  4.5417547e-01f,  8.5303015e-01f,
+    3.0960012e-02f,  4.3528955e-04f,  -9.4035275e-02f, 5.3067827e-01f,
+    9.6327901e-02f,  -6.0828340e-01f, -6.7246795e-01f, 8.3590642e-02f,
+    4.3528955e-04f,  -1.6374981e+00f, -2.6582122e-01f, 5.3988576e-02f,
+    -1.9594476e-01f, -9.3965095e-01f, -3.9802559e-02f, 4.3528955e-04f,
+    2.2275476e+00f,  2.1025052e+00f,  -1.4453633e-01f, -8.2154346e-01f,
+    6.5899682e-01f,  -1.6214257e-02f, 4.3528955e-04f,  1.2220950e-01f,
+    -9.5152229e-02f, 1.3285591e-01f,  2.9470280e-01f,  4.3845960e-01f,
+    -5.4876179e-01f, 4.3528955e-04f,  6.6600613e-02f,  -2.4312320e+00f,
+    9.1123924e-02f,  7.0076609e-01f,  -2.1273872e-01f, 9.7542375e-02f,
+    4.3528955e-04f,  8.6681414e-01f,  1.0810934e+00f,  -1.8393439e-03f,
+    -7.4163288e-01f, 4.1683033e-01f,  7.8498840e-02f,  4.3528955e-04f,
+    -1.0561835e+00f, -4.4492245e-01f, 2.6711103e-01f,  2.8104088e-01f,
+    -7.7446014e-01f, -1.5831502e-01f, 4.3528955e-04f,  -7.8084111e-01f,
+    -9.3195683e-01f, 8.6887293e-03f,  1.0046687e+00f,  -4.8012564e-01f,
+    1.7115332e-02f,  4.3528955e-04f,  1.0442106e-01f,  9.3464601e-01f,
+    -1.3329314e-01f, -7.7637440e-01f, -9.6685424e-02f, -1.2922850e-01f,
+    4.3528955e-04f,  6.2351577e-02f,  5.8165771e-01f,  1.5642247e-01f,
+    -1.1904174e+00f, -1.7163813e-01f, 7.0839494e-02f,  4.3528955e-04f,
+    1.7299000e-02f,  2.8929749e-01f,  4.4131834e-02f,  -6.4061195e-01f,
+    -1.8535906e-01f, 3.9543688e-01f,  4.3528955e-04f,  -1.3890398e-01f,
+    1.9820398e+00f,  -4.1813083e-02f, -9.1835827e-01f, -3.9189634e-01f,
+    -6.2801339e-02f, 4.3528955e-04f,  -6.8080679e-02f, 3.0978892e+00f,
+    -5.8721703e-02f, -1.0253625e+00f, 1.3610230e-01f,  1.8367138e-02f,
+    4.3528955e-04f,  -9.0800756e-01f, -2.0518456e+00f, -2.2642942e-01f,
+    8.1299829e-01f,  -3.6434501e-01f, 5.6466818e-02f,  4.3528955e-04f,
+    -8.2330006e-01f, 4.3676692e-01f,  -8.8993654e-02f, -2.8599471e-01f,
+    -1.0141680e+00f, -2.1483710e-02f, 4.3528955e-04f,  -1.4321284e+00f,
+    2.0607890e-01f,  6.9554985e-02f,  2.9289412e-01f,  -4.8543891e-01f,
+    -1.2651734e-01f, 4.3528955e-04f,  -9.6482050e-01f, -2.1460772e+00f,
+    2.5596139e-03f,  9.2225760e-01f,  -4.2899844e-01f, 2.1118892e-02f,
+    4.3528955e-04f,  3.3674090e+00f,  4.0090528e+00f,  1.4332980e-01f,
+    -6.7465740e-01f, 6.0516548e-01f,  2.5385963e-02f,  4.3528955e-04f,
+    6.5007663e-01f,  2.0894101e+00f,  -1.4739278e-01f, -7.8564119e-01f,
+    5.9481180e-01f,  -1.0251867e-01f, 4.3528955e-04f,  -6.4447731e-01f,
+    7.7349758e-01f,  -2.8033048e-02f, -6.2545609e-01f, -6.0664898e-01f,
+    1.6450648e-01f,  4.3528955e-04f,  -3.2056984e-01f, -4.8122391e-02f,
+    8.8302776e-02f,  7.9358011e-02f,  -8.9642841e-01f, -9.2320271e-02f,
+    4.3528955e-04f,  3.1719546e+00f,  1.7128017e+00f,  -3.0302418e-02f,
+    -5.5962664e-01f, 6.2397093e-01f,  4.8231881e-02f,  4.3528955e-04f,
+    1.0599283e+00f,  -2.6612856e+00f, -4.6775889e-02f, 6.9994020e-01f,
+    4.3284380e-01f,  -9.3522474e-02f, 4.3528955e-04f,  -1.8474191e-02f,
+    8.0135071e-01f,  -5.9352741e-02f, -8.7077856e-01f, -5.7212907e-01f,
+    3.8131893e-01f,  4.3528955e-04f,  -1.0494272e+00f, -1.3914202e-01f,
+    2.1598944e-01f,  6.5014946e-01f,  -4.3245336e-01f, -1.4375189e-01f,
+    4.3528955e-04f,  5.4281282e-01f,  -1.3113482e-01f, 1.3185102e-01f,
+    2.1724258e-01f,  7.8620857e-01f,  4.7211680e-01f,  4.3528955e-04f,
+    7.5968391e-01f,  -1.7907287e-01f, 1.8164312e-02f,  1.3938058e-02f,
+    1.3369875e+00f,  2.8104940e-02f,  4.3528955e-04f,  5.2703846e-01f,
+    -3.5202062e-01f, -8.8826090e-02f, -9.8660484e-02f, 9.0747762e-01f,
+    2.2789402e-02f,  4.3528955e-04f,  -1.5599674e-01f, -1.4303715e+00f,
+    4.6144847e-02f,  9.5154881e-01f,  -1.2000827e-01f, -6.1274441e-03f,
+    4.3528955e-04f,  1.7105310e+00f,  6.4772415e-01f,  6.1802126e-02f,
+    -2.0703207e-01f, 9.2258567e-01f,  2.9194435e-02f,  4.3528955e-04f,
+    5.1064003e-01f,  1.6453859e-01f,  2.4838235e-02f,  -2.0034991e-01f,
+    1.4291912e+00f,  1.8037251e-01f,  4.3528955e-04f,  -9.6249200e-02f,
+    5.5289620e-01f,  2.3231117e-01f,  -5.6639469e-01f, -4.6671432e-01f,
+    1.7237876e-01f,  4.3528955e-04f,  3.0957062e+00f,  2.1662505e+00f,
+    -2.6947286e-02f, -5.5842191e-01f, 6.8165332e-01f,  -3.5938643e-02f,
+    4.3528955e-04f,  -4.3388373e-01f, -9.4529146e-01f, -1.3737644e-01f,
+    6.2122089e-01f,  -4.3809488e-01f, -1.1201017e-01f, 4.3528955e-04f,
+    1.8064566e+00f,  -9.4404835e-01f, -2.0395242e-02f, 4.6822482e-01f,
+    8.7938130e-01f,  2.2304822e-03f,  4.3528955e-04f,  7.1512711e-01f,
+    -1.8945515e+00f, -1.0164935e-02f, 8.6844039e-01f,  -2.4637526e-02f,
+    1.3754247e-01f,  4.3528955e-04f,  -5.9193283e-02f, 9.3404841e-01f,
+    4.0031165e-02f,  -9.2452937e-01f, -3.0482365e-02f, -3.4428015e-01f,
+    4.3528955e-04f,  -3.1682181e-01f, -4.4349790e-02f, 4.5898333e-02f,
+    -1.4738195e-01f, -1.2687914e+00f, -1.7005651e-01f, 4.3528955e-04f,
+    -6.0217631e-01f, 2.6832187e+00f,  -1.7019261e-01f, -9.0972215e-01f,
+    -5.1237017e-01f, -2.5846313e-03f, 4.3528955e-04f,  1.0459696e-01f,
+    4.0892011e-01f,  -5.0248113e-02f, -1.3328296e+00f, 6.1958063e-01f,
+    -2.3817251e-02f, 4.3528955e-04f,  3.4942657e-01f,  -5.3258038e-01f,
+    1.2674794e-01f,  1.6390590e-01f,  1.0199207e+00f,  -2.4471459e-01f,
+    4.3528955e-04f,  4.8576221e-01f,  -1.6881601e+00f, 3.7511133e-02f,
+    7.0576733e-01f,  1.7810932e-01f,  -7.2185293e-02f, 4.3528955e-04f,
+    -9.0147740e-01f, 1.6665719e+00f,  -1.5640621e-01f, -4.6505028e-01f,
+    -3.5920501e-01f, -1.2220404e-01f, 4.3528955e-04f,  1.7284967e+00f,
+    -4.8968053e-01f, -8.3691098e-02f, 2.6083806e-01f,  7.5472921e-01f,
+    -1.1336222e-01f, 4.3528955e-04f,  -2.6162329e+00f, 1.3804768e+00f,
+    -5.8043871e-02f, -3.6274192e-01f, -7.1767229e-01f, -1.3694651e-01f,
+    4.3528955e-04f,  -1.5626290e+00f, -2.9593856e+00f, 2.1055960e-03f,
+    7.8441155e-01f,  -3.7136063e-01f, 8.3678123e-03f,  4.3528955e-04f,
+    -2.0550177e+00f, 1.6195004e+00f,  8.8773422e-02f,  -7.9358667e-01f,
+    -7.8342104e-01f, 2.4659721e-02f,  4.3528955e-04f,  -3.4250553e+00f,
+    -7.7338284e-01f, 1.8137273e-01f,  2.9323843e-01f,  -8.5327971e-01f,
+    -1.2494276e-02f, 4.3528955e-04f,  -1.0928006e+00f, -9.8063856e-01f,
+    -3.5813272e-02f, 8.6911207e-01f,  -3.6709440e-01f, 1.0829409e-01f,
+    4.3528955e-04f,  -1.5037622e+00f, -2.6505890e+00f, -8.1888154e-02f,
+    7.1912748e-01f,  -3.3060527e-01f, 3.0391361e-03f,  4.3528955e-04f,
+    -1.8642495e+00f, -1.0241684e+00f, 2.2789132e-02f,  4.5018724e-01f,
+    -7.5242269e-01f, 1.0928122e-01f,  4.3528955e-04f,  1.5637577e-01f,
+    2.0454708e-01f,  -3.1532091e-03f, -9.2234260e-01f, 2.5889906e-01f,
+    1.1085278e+00f,  4.3528955e-04f,  -1.0646159e-01f, -2.3127935e+00f,
+    8.6346846e-03f,  6.7511958e-01f,  3.3803451e-01f,  3.2426551e-02f,
+    4.3528955e-04f,  3.8002166e-01f,  -4.9412841e-01f, -2.1785410e-02f,
+    7.1336085e-01f,  8.8995880e-01f,  -2.3885676e-01f, 4.3528955e-04f,
+    -2.5872514e-04f, 9.6659374e-01f,  1.0173360e-02f,  -9.8121423e-01f,
+    3.9377183e-01f,  2.4319079e-02f,  4.3528955e-04f,  1.1910295e+00f,
+    1.9076605e+00f,  -2.8408753e-02f, -8.9064270e-01f, 7.6573288e-01f,
+    3.8091257e-02f,  4.3528955e-04f,  5.0160426e-01f,  8.0534053e-01f,
+    4.0923987e-02f,  -5.7160139e-01f, 6.7943436e-01f,  9.8406978e-02f,
+    4.3528955e-04f,  -1.1994266e-01f, -1.1840980e+00f, -1.2843851e-02f,
+    8.7393749e-01f,  2.4980435e-02f,  1.3133699e-01f,  4.3528955e-04f,
+    -5.3161716e-01f, -1.7649425e+00f, 7.4960520e-03f,  9.1179603e-01f,
+    4.8043512e-02f,  -4.6563847e-03f, 4.3528955e-04f,  4.0527468e+00f,
+    -8.1622916e-01f, 7.5294048e-02f,  2.2883870e-01f,  8.8913989e-01f,
+    -1.8112550e-03f, 4.3528955e-04f,  5.1311258e-02f,  -6.5259296e-01f,
+    1.8828791e-02f,  8.7199658e-01f,  4.1920915e-01f,  1.4764397e-01f,
+    4.3528955e-04f,  1.1982348e+00f,  -1.0025470e+00f, 5.8512413e-03f,
+    6.5866423e-01f,  7.3078775e-01f,  -1.0948446e-01f, 4.3528955e-04f,
+    -5.7380664e-01f, 3.0134225e+00f,  3.4402102e-02f,  -9.1990477e-01f,
+    -2.8737250e-01f, 1.7441360e-02f,  4.3528955e-04f,  -3.5960561e-01f,
+    1.6457498e-01f,  6.0220505e-03f,  3.2237384e-01f,  -8.9993221e-01f,
+    1.6651231e-01f,  4.3528955e-04f,  -4.7114947e-01f, -3.1367221e+00f,
+    -1.7482856e-02f, 1.0110542e+00f,  -5.1265862e-03f, 7.3640600e-02f,
+    4.3528955e-04f,  2.9541917e+00f,  1.8186599e-01f,  8.9627750e-02f,
+    -1.1978638e-01f, 8.2598686e-01f,  5.2585863e-02f,  4.3528955e-04f,
+    3.1605814e+00f,  1.4804116e+00f,  -7.2326181e-03f, -3.5264218e-01f,
+    9.7272635e-01f,  1.5132143e-03f,  4.3528955e-04f,  2.1143963e+00f,
+    3.3559614e-01f,  1.1881064e-01f,  -8.0633223e-02f, 1.0973618e+00f,
+    -3.8899735e-03f, 4.3528955e-04f,  3.1001277e+00f,  2.8451636e+00f,
+    -2.9366398e-02f, -6.8751752e-01f, 6.5671217e-01f,  -2.5278979e-03f,
+    4.3528955e-04f,  -1.1604156e+00f, -5.4868358e-01f, -7.0652761e-02f,
+    2.4676095e-01f,  -9.4454223e-01f, -2.5924295e-02f, 4.3528955e-04f,
+    -7.4018097e-01f, -2.3911142e+00f, -2.5208769e-02f, 9.5126021e-01f,
+    -1.8476564e-01f, -5.3207301e-02f, 4.3528955e-04f,  1.8137285e-01f,
+    1.8002636e+00f,  -7.6774806e-02f, -8.1196320e-01f, -2.0312734e-01f,
+    -3.3981767e-02f, 4.3528955e-04f,  -8.8973665e-01f, 8.8048881e-01f,
+    -1.5304311e-01f, -4.6352151e-01f, -4.0352288e-01f, 1.3185799e-02f,
+    4.3528955e-04f,  6.2880623e-01f,  -2.3269174e+00f, 1.0132728e-01f,
+    7.5453192e-01f,  2.0464706e-01f,  -3.0325487e-02f, 4.3528955e-04f,
+    -1.6192812e+00f, 2.9005671e-01f,  8.6403497e-02f,  -4.2344549e-01f,
+    -9.2111617e-01f, -1.4405136e-02f, 4.3528955e-04f,  -2.0216768e+00f,
+    -1.7361889e+00f, 4.8458237e-02f,  5.6719553e-01f,  -5.3164411e-01f,
+    2.8369453e-02f,  4.3528955e-04f,  -1.7314348e-01f, 2.4393530e+00f,
+    1.9312203e-01f,  -9.4708359e-01f, -2.0663981e-01f, -3.0613426e-02f,
+    4.3528955e-04f,  -2.0798292e+00f, -2.1245657e-01f, -6.2375542e-02f,
+    1.4876083e-01f,  -8.6537892e-01f, -1.6776482e-02f, 4.3528955e-04f,
+    1.2424555e+00f,  -4.9340600e-01f, 3.8074714e-04f,  4.8663029e-01f,
+    1.1846467e+00f,  3.0666193e-02f,  4.3528955e-04f,  5.8551413e-01f,
+    -1.3404931e-01f, 2.9275170e-02f,  2.0949099e-02f,  6.5356815e-01f,
+    3.2296926e-01f,  4.3528955e-04f,  -2.2607148e-01f, 4.6342981e-01f,
+    1.9588798e-02f,  -6.2120587e-01f, -8.0679303e-01f, -5.5665299e-03f,
+    4.3528955e-04f,  4.8794228e-01f,  -1.5677538e+00f, 1.3222785e-01f,
+    9.8567438e-01f,  1.5833491e-01f,  1.1192162e-01f,  4.3528955e-04f,
+    -2.8819375e+00f, -4.3850827e-01f, -4.6859730e-02f, 3.4049299e-02f,
+    -9.0175933e-01f, -2.8249625e-02f, 4.3528955e-04f,  -3.3821573e+00f,
+    1.4153132e+00f,  4.7825798e-02f,  -4.5967886e-01f, -8.8771540e-01f,
+    -3.2246891e-02f, 4.3528955e-04f,  5.2379435e-01f,  2.1959323e-01f,
+    6.8631507e-02f,  3.5518754e-01f,  1.2534918e+00f,  -2.7986285e-01f,
+    4.3528955e-04f,  -7.5409085e-01f, -4.4856060e-01f, -1.1702770e-02f,
+    8.6026728e-02f,  -5.1055199e-01f, -1.1338430e-01f, 4.3528955e-04f,
+    -3.7166458e-01f, 4.2601299e+00f,  -2.6265597e-01f, -9.7686023e-01f,
+    -1.1489559e-01f, 2.7066329e-04f,  4.3528955e-04f,  -2.2153363e-01f,
+    2.6231911e+00f,  -9.5289782e-02f, -9.9855661e-01f, -1.3385244e-01f,
+    -3.1422805e-02f, 4.3528955e-04f,  7.8053570e-01f,  -9.8473448e-01f,
+    7.7782407e-02f,  8.9362705e-01f,  1.2495216e-01f,  1.4302009e-01f,
+    4.3528955e-04f,  -3.0539626e-01f, -3.3046138e+00f, -1.9005127e-02f,
+    8.7618279e-01f,  7.8633547e-02f,  9.7274203e-03f,  4.3528955e-04f,
+    -4.0694186e-01f, -1.6044971e+00f, 1.8410461e-01f,  6.1722302e-01f,
+    -9.0403587e-02f, -1.9891663e-02f, 4.3528955e-04f,  -1.0182806e+00f,
+    -3.1936564e+00f, -8.8086955e-02f, 8.2385814e-01f,  -3.8647696e-01f,
+    3.3644222e-02f,  4.3528955e-04f,  -2.4010088e+00f, -1.3584445e+00f,
+    -6.4757846e-02f, 3.5135934e-01f,  -7.4257511e-01f, 5.9980165e-02f,
+    4.3528955e-04f,  2.1665096e+00f,  6.8750298e-01f,  6.1138242e-02f,
+    -1.0285388e-01f, 1.0637898e+00f,  2.3372352e-02f,  4.3528955e-04f,
+    2.8401596e-02f,  -5.3743833e-01f, -4.9962223e-02f, 8.7825376e-01f,
+    -9.1578364e-01f, 1.7603993e-02f,  4.3528955e-04f,  -1.4481920e+00f,
+    -1.6172411e-01f, -5.8283173e-02f, -4.0988695e-02f, -8.6975026e-01f,
+    4.2644206e-02f,  4.3528955e-04f,  8.9154214e-01f,  -1.5530504e+00f,
+    6.9267112e-03f,  8.0952418e-01f,  6.0299855e-01f,  -2.9141452e-02f,
+    4.3528955e-04f,  4.4740546e-01f,  -8.5090563e-02f, 9.5522925e-03f,
+    6.8516874e-01f,  7.3528737e-01f,  6.2354665e-02f,  4.3528955e-04f,
+    3.8142238e+00f,  1.4170536e+00f,  7.6347967e-03f,  -3.3032110e-01f,
+    9.2062008e-01f,  8.4167987e-02f,  4.3528955e-04f,  4.3107897e-01f,
+    1.5380681e+00f,  8.9293651e-02f,  -1.0154482e+00f, -1.5598691e-01f,
+    7.4538076e-03f,  4.3528955e-04f,  9.0402043e-01f,  -2.9644141e+00f,
+    4.9292978e-02f,  8.8341254e-01f,  3.3673137e-01f,  3.4312230e-02f,
+    4.3528955e-04f,  1.2360678e+00f,  1.2461649e+00f,  1.2621503e-01f,
+    -7.5785065e-01f, 3.6909667e-01f,  1.0272077e-01f,  4.3528955e-04f,
+    -3.5386041e-02f, 8.3406943e-01f,  1.4718983e-02f,  -6.8749017e-01f,
+    -3.4632576e-01f, -8.5831143e-02f, 4.3528955e-04f,  -4.7062373e+00f,
+    -3.9321250e-01f, 1.3624497e-01f,  1.1087300e-01f,  -8.7108040e-01f,
+    -3.5730356e-03f, 4.3528955e-04f,  5.4503357e-01f,  8.0585349e-01f,
+    4.2364020e-03f,  -1.1494517e+00f, 5.0595313e-01f,  -1.0082168e-01f,
+    4.3528955e-04f,  -7.5158603e-02f, 9.5326018e-01f,  -8.8700153e-02f,
+    -1.0292276e+00f, -1.9819370e-01f, -1.8738037e-01f, 4.3528955e-04f,
+    5.4983836e-01f,  1.5210698e+00f,  4.3404628e-02f,  -1.2261977e+00f,
+    2.2023894e-01f,  7.5706698e-02f,  4.3528955e-04f,  -2.3999243e+00f,
+    2.1804373e+00f,  -1.0860875e-01f, -5.5760336e-01f, -7.1863830e-01f,
+    -2.3669039e-03f, 4.3528955e-04f,  3.1456679e-02f,  1.3726859e+00f,
+    3.7169342e-03f,  -9.5063037e-01f, 3.3770549e-01f,  -1.6761926e-01f,
+    4.3528955e-04f,  1.1985265e+00f,  7.4975020e-01f,  9.7618625e-03f,
+    -8.0065006e-01f, 6.5643001e-01f,  -1.2000196e-01f, 4.3528955e-04f,
+    -1.8628707e+00f, -2.1035333e-01f, 5.1831488e-02f,  3.6422512e-01f,
+    -9.8096609e-01f, -1.1301040e-01f, 4.3528955e-04f,  -1.8695948e-01f,
+    4.7098018e-02f,  -5.8505986e-02f, 6.7684507e-01f,  -9.7887170e-01f,
+    -7.1284488e-02f, 4.3528955e-04f,  1.2337499e+00f,  7.3599190e-01f,
+    -9.4945922e-02f, -6.0338819e-01f, 7.5461215e-01f,  -5.2646041e-02f,
+    4.3528955e-04f,  -8.0929905e-01f, -9.2185253e-01f, -1.0670380e-01f,
+    2.9095286e-01f,  -1.0370268e+00f, -1.4131424e-01f, 4.3528955e-04f,
+    -1.9641546e+00f, -3.7608240e+00f, 1.1018326e-01f,  8.2998341e-01f,
+    -4.3341470e-01f, 2.4326162e-02f,  4.3528955e-04f,  1.0984576e-01f,
+    5.6369001e-01f,  2.8241631e-02f,  -1.0328488e+00f, -4.1240555e-01f,
+    2.2188593e-01f,  4.3528955e-04f,  -6.0087287e-01f, -3.3414786e+00f,
+    2.1135636e-01f,  8.3026862e-01f,  -2.0112723e-01f, 1.8008851e-02f,
+    4.3528955e-04f,  1.4048605e+00f,  2.2681718e-01f,  8.5497804e-02f,
+    -5.9159223e-02f, 7.6656753e-01f,  -1.8471763e-01f, 4.3528955e-04f,
+    8.6701041e-01f,  -8.8834208e-01f, -5.4960161e-02f, 4.8620775e-01f,
+    5.5222017e-01f,  1.9075315e-02f,  4.3528955e-04f,  5.7406324e-01f,
+    1.0137316e+00f,  1.0804778e-01f,  -8.7813210e-01f, 1.8815668e-01f,
+    -8.7215542e-04f, 4.3528955e-04f,  2.0986035e+00f,  4.4738829e-02f,
+    1.8902699e-02f,  1.3665456e-01f,  1.0593314e+00f,  2.9838247e-02f,
+    4.3528955e-04f,  2.8635178e-02f,  1.6977284e+00f,  -7.5980671e-02f,
+    -7.4267983e-01f, 3.1753719e-02f,  4.9654372e-02f,  4.3528955e-04f,
+    4.4197792e-01f,  -8.8677621e-01f, 2.8880674e-01f,  5.5002004e-01f,
+    -2.3852623e-01f, -2.0448004e-01f, 4.3528955e-04f,  1.3324966e+00f,
+    6.2308347e-01f,  4.9173497e-02f,  -6.7105263e-01f, 8.5418338e-01f,
+    9.8057032e-02f,  4.3528955e-04f,  2.9794130e+00f,  -1.1382123e+00f,
+    3.6870189e-02f,  1.6805904e-01f,  8.0307668e-01f,  3.3715449e-02f,
+    4.3528955e-04f,  5.2165823e+00f,  7.9412901e-01f,  -2.6963159e-02f,
+    -1.2525870e-01f, 9.1279143e-01f,  2.7232314e-02f,  4.3528955e-04f,
+    1.5893443e+00f,  -3.1180762e-02f, 8.8540994e-02f,  1.2388450e-01f,
+    8.7858939e-01f,  3.2170609e-02f,  4.3528955e-04f,  -1.9729308e+00f,
+    -5.4301143e-01f, -1.0044137e-01f, 1.9859129e-01f,  -7.8461170e-01f,
+    1.3711540e-01f,  4.3528955e-04f,  -2.1488801e-02f, -8.9241862e-02f,
+    -9.0094492e-02f, -1.5251940e-01f, -7.8768557e-01f, -2.0239474e-01f,
+    4.3528955e-04f,  2.3853872e+00f,  5.8108550e-01f,  -1.6810659e-01f,
+    -5.9231204e-01f, 7.1739310e-01f,  -4.4527709e-02f, 4.3528955e-04f,
+    -8.4816611e-01f, -5.5872023e-01f, 6.2930591e-02f,  4.5399958e-01f,
+    -6.3848078e-01f, -1.3562729e-02f, 4.3528955e-04f,  2.4202998e+00f,
+    1.7121294e+00f,  5.1325999e-02f,  -5.5129248e-01f, 9.0952402e-01f,
+    -6.4055942e-02f, 4.3528955e-04f,  -4.4007868e-01f, 2.3427620e+00f,
+    7.4197814e-02f,  -6.3222665e-01f, -3.8390066e-03f, -1.2377399e-01f,
+    4.3528955e-04f,  -5.0934166e-01f, -1.3589574e+00f, 8.1578583e-02f,
+    5.5459166e-01f,  -6.8251216e-01f, 1.5072592e-01f,  4.3528955e-04f,
+    1.1867840e+00f,  6.2355483e-01f,  -1.4367016e-01f, -4.8990968e-01f,
+    8.7113827e-01f,  -3.3855990e-02f, 4.3528955e-04f,  -1.0341714e-01f,
+    2.1972027e+00f,  -8.5866004e-02f, -7.8301811e-01f, -5.2546956e-02f,
+    5.9950132e-02f,  4.3528955e-04f,  -6.8855725e-02f, -1.8209658e+00f,
+    9.4503239e-02f,  8.7841380e-01f,  1.6200399e-01f,  -9.4188489e-02f,
+    4.3528955e-04f,  -1.8718420e+00f, -2.5654843e+00f, -2.2279415e-02f,
+    7.0856446e-01f,  -6.5598333e-01f, 2.9622724e-02f,  4.3528955e-04f,
+    -9.0099084e-01f, -6.7630947e-01f, 1.2118616e-01f,  3.7618360e-01f,
+    -5.7120287e-01f, -1.7196420e-01f, 4.3528955e-04f,  -3.8416438e+00f,
+    -1.3796822e+00f, -1.9073356e-02f, 3.1241691e-01f,  -7.5429314e-01f,
+    4.6409406e-02f,  4.3528955e-04f,  2.8541243e-01f,  -3.6865935e+00f,
+    1.1118159e-01f,  8.0215394e-01f,  3.1592183e-02f,  5.6100197e-02f,
+    4.3528955e-04f,  3.3909471e+00f,  1.3730515e+00f,  -1.6735382e-02f,
+    -3.3026043e-01f, 8.8571084e-01f,  1.8637992e-02f,  4.3528955e-04f,
+    -1.0838163e+00f, 2.6683095e-01f,  -2.0475921e-01f, -1.7158101e-01f,
+    -6.5997642e-01f, -1.0635884e-02f, 4.3528955e-04f,  1.0041045e+00f,
+    1.2981331e-01f,  1.2747457e-02f,  -4.0641734e-01f, 8.1512636e-01f,
+    5.7096124e-02f,  4.3528955e-04f,  2.0038724e-01f,  -2.8984964e-01f,
+    -3.4706522e-02f, 1.1086525e+00f,  -1.2541127e-01f, 1.8057032e-01f,
+    4.3528955e-04f,  2.3104987e+00f,  -9.3613738e-01f, 6.3051313e-02f,
+    2.3807044e-01f,  9.8435211e-01f,  7.5864337e-02f,  4.3528955e-04f,
+    -2.0072730e+00f, 1.5337367e-01f,  7.6500647e-02f,  -1.3493069e-01f,
+    -1.0448799e+00f, -8.0492944e-02f, 4.3528955e-04f,  1.4438511e+00f,
+    4.9439639e-01f,  -8.5409455e-02f, -2.5178692e-01f, 7.3167127e-01f,
+    -1.4277172e-01f, 4.3528955e-04f,  -6.6208012e-02f, -1.6607817e-01f,
+    -3.3608258e-02f, 9.3574381e-01f,  -8.7886870e-01f, -4.5337468e-02f,
+    4.3528955e-04f,  5.8382565e-01f,  7.0541620e-01f,  4.5698363e-02f,
+    -1.0761838e+00f, 1.0414816e+00f,  8.1107780e-02f,  4.3528955e-04f,
+    4.9990299e-01f,  -1.6385348e-01f, -2.0624353e-02f, 1.1487038e-01f,
+    8.6193627e-01f,  -1.6885158e-01f, 4.3528955e-04f,  8.2547039e-01f,
+    -1.2059232e+00f, 5.1281963e-02f,  1.0258828e+00f,  2.2830784e-01f,
+    1.4370824e-01f,  4.3528955e-04f,  1.8418908e+00f,  9.5211905e-01f,
+    1.8969165e-02f,  -8.8576987e-02f, 4.8172790e-01f,  -1.4431679e-02f,
+    4.3528955e-04f,  -1.0114060e-01f, 1.6351238e-01f,  1.1543112e-01f,
+    -1.3514526e-01f, -1.0041178e+00f, 5.0662822e-01f,  4.3528955e-04f,
+    -4.2023335e+00f, 2.5431943e+00f,  -2.3773095e-02f, -4.5392498e-01f,
+    -7.6611948e-01f, 2.2688242e-02f,  4.3528955e-04f,  -8.1866479e-01f,
+    -6.0003787e-02f, -2.6448397e-06f, -4.3320069e-01f, -1.1364709e+00f,
+    2.0287114e-01f,  4.3528955e-04f,  2.2553949e+00f,  1.1285099e-01f,
+    -2.6196759e-02f, 3.8254209e-02f,  9.9790680e-01f,  4.6921276e-02f,
+    4.3528955e-04f,  2.5182300e+00f,  -8.7583530e-01f, 3.0350743e-02f,
+    2.1050508e-01f,  9.0025115e-01f,  -3.4214903e-02f, 4.3528955e-04f,
+    -1.3982513e+00f, 1.4634587e+00f,  1.0058690e-01f,  -5.5063361e-01f,
+    -8.0921721e-01f, 9.0333037e-03f,  4.3528955e-04f,  -1.0804394e+00f,
+    3.8848275e-01f,  6.0744066e-02f,  -1.3133051e-01f, -1.0311453e+00f,
+    3.1966725e-01f,  4.3528955e-04f,  -2.3210543e-01f, -1.4428994e-01f,
+    1.9665647e-01f,  5.8106953e-01f,  -4.1862264e-01f, -3.8007462e-01f,
+    4.3528955e-04f,  -2.3794636e-01f, 1.8890817e+00f,  -1.0230808e-01f,
+    -8.7130427e-01f, -4.1642734e-01f, 6.0796987e-02f,  4.3528955e-04f,
+    1.6616440e-01f,  8.0680639e-02f,  2.6312670e-02f,  -1.7039967e-01f,
+    9.4767940e-01f,  -4.9309337e-01f, 4.3528955e-04f,  -9.4497152e-02f,
+    6.2487996e-01f,  6.1155513e-02f,  -7.9731864e-01f, -4.8194578e-01f,
+    -6.5751120e-02f, 4.3528955e-04f,  5.9881383e-01f,  -1.0572406e+00f,
+    1.6778144e-01f,  4.4907954e-01f,  3.5768199e-01f,  -2.8938442e-01f,
+    4.3528955e-04f,  -2.1272349e+00f, -2.1148062e+00f, 1.9391527e-02f,
+    7.7905750e-01f,  -6.6755265e-01f, -2.2257227e-02f, 4.3528955e-04f,
+    2.6295462e+00f,  1.3879784e+00f,  1.1420004e-01f,  -4.4877172e-01f,
+    7.8877288e-01f,  -2.1199992e-02f, 4.3528955e-04f,  -2.0311728e+00f,
+    3.0221815e+00f,  6.8797758e-03f,  -7.2903228e-01f, -6.2226057e-01f,
+    -2.0611718e-02f, 4.3528955e-04f,  3.7315726e-01f,  1.9459890e+00f,
+    2.5346349e-03f,  -1.0972291e+00f, 2.3041408e-01f,  -5.9966482e-02f,
+    4.3528955e-04f,  6.2169200e-01f,  6.8652660e-01f,  -4.2650372e-02f,
+    -5.5223274e-01f, 7.3954892e-01f,  -1.9205309e-01f, 4.3528955e-04f,
+    6.6241843e-01f,  -4.5871633e-01f, 5.8407433e-02f,  2.0236804e-01f,
+    8.2332999e-01f,  2.9627156e-01f,  4.3528955e-04f,  2.1948621e-01f,
+    -2.8386688e-01f, 1.7493246e-01f,  8.2440829e-01f,  5.7249331e-01f,
+    -4.8702273e-01f, 4.3528955e-04f,  -1.4504439e+00f, 7.5814360e-01f,
+    -4.9124647e-02f, 2.9103994e-01f,  -8.9323312e-01f, 6.0043307e-03f,
+    4.3528955e-04f,  -1.0889474e+00f, -2.4433215e+00f, -6.4297408e-02f,
+    8.1158328e-01f,  -5.1451206e-01f, -2.0037789e-02f, 4.3528955e-04f,
+    7.2146070e-01f,  1.4136108e+00f,  -1.1201730e-02f, -7.5682038e-01f,
+    2.6541027e-01f,  -1.4377570e-01f, 4.3528955e-04f,  -2.5747868e-01f,
+    1.7068375e+00f,  -5.5693714e-03f, -5.2365309e-01f, -4.5422253e-01f,
+    9.8637320e-02f,  4.3528955e-04f,  4.4472823e-01f,  -8.8799697e-01f,
+    -3.5425290e-02f, 1.1954638e+00f,  -3.5426028e-02f, 5.7817161e-02f,
+    4.3528955e-04f,  1.3884593e-02f,  9.2989475e-01f,  1.1478577e-02f,
+    -7.5093061e-01f, 4.9144611e-02f,  9.6518300e-02f,  4.3528955e-04f,
+    3.0604446e+00f,  -1.1337315e+00f, -1.6526009e-01f, 2.1201716e-01f,
+    8.9217579e-01f,  -6.5360993e-02f, 4.3528955e-04f,  3.4266669e-01f,
+    -7.2600329e-01f, -2.5429339e-03f, 8.5793829e-01f,  5.4191905e-01f,
+    -2.0769665e-01f, 4.3528955e-04f,  -7.5925958e-01f, -2.4081950e-01f,
+    5.7799730e-02f,  1.5387757e-01f,  -7.6540476e-01f, -2.4511655e-01f,
+    4.3528955e-04f,  -1.0051786e+00f, -8.3961689e-01f, 2.8288592e-02f,
+    2.5145975e-01f,  -5.3426260e-01f, -7.9483189e-02f, 4.3528955e-04f,
+    1.7681268e-01f,  -4.0305942e-01f, 1.1047284e-01f,  9.6816206e-01f,
+    -9.0308256e-02f, 1.4949383e-01f,  4.3528955e-04f,  -1.0000279e+00f,
+    -4.1142410e-01f, -2.7344343e-01f, 6.5402395e-01f,  -4.5772868e-01f,
+    -4.0693965e-02f, 4.3528955e-04f,  1.8190960e+00f,  1.0242250e+00f,
+    -1.2690410e-01f, -4.6323961e-01f, 8.7463975e-01f,  1.8906144e-02f,
+    4.3528955e-04f,  -2.3929676e-01f, -9.1626137e-02f, 6.6445947e-02f,
+    1.0927068e+00f,  -9.2601752e-01f, -1.0192335e-01f, 4.3528955e-04f,
+    -3.3619612e-01f, -1.6351171e+00f, -1.0829730e-01f, 9.3116677e-01f,
+    -1.2086093e-01f, -4.5214906e-02f, 4.3528955e-04f,  1.0487654e+00f,
+    1.4507966e+00f,  -6.9856480e-02f, -7.8931224e-01f, 6.4676195e-01f,
+    -1.6027933e-02f, 4.3528955e-04f,  2.2815628e+00f,  5.8520377e-01f,
+    6.3243248e-02f,  -1.1186641e-01f, 9.8382092e-01f,  3.4892559e-02f,
+    4.3528955e-04f,  -3.7675142e-01f, -3.6345005e-01f, -5.2205354e-02f,
+    9.5492166e-01f,  -3.3363086e-01f, 1.0352491e-02f,  4.3528955e-04f,
+    -4.5937338e-01f, 4.3260610e-01f,  -6.0182167e-03f, -5.5746216e-01f,
+    -9.3278813e-01f, -1.0016717e-01f, 4.3528955e-04f,  -3.3373523e+00f,
+    3.0411497e-01f,  -3.2898132e-02f, -8.4115162e-02f, -9.9490058e-01f,
+    -3.2587412e-03f, 4.3528955e-04f,  -3.5499209e-01f, 1.2015631e+00f,
+    -5.5038612e-02f, -8.1605363e-01f, -4.0526313e-01f, 2.2949298e-01f,
+    4.3528955e-04f,  3.1604643e+00f,  -7.8258580e-01f, -9.9870756e-02f,
+    2.5978702e-01f,  8.1878477e-01f,  -1.7514464e-02f, 4.3528955e-04f,
+    6.7056261e-02f,  3.5691661e-01f,  -1.9738054e-02f, -6.9410777e-01f,
+    -1.9574766e-01f, 5.1850796e-01f,  4.3528955e-04f,  1.1690015e-01f,
+    1.5015254e+00f,  -1.6527115e-01f, -5.5864418e-01f, -3.8039735e-01f,
+    -2.1213351e-01f, 4.3528955e-04f,  -2.3876333e+00f, -1.6791182e+00f,
+    -5.8586076e-02f, 4.8861942e-01f,  -7.9862112e-01f, 8.7745395e-03f,
+    4.3528955e-04f,  5.4289335e-01f,  -8.9135349e-01f, 1.3314066e-02f,
+    4.4611534e-01f,  6.0574269e-01f,  -9.2228288e-03f, 4.3528955e-04f,
+    1.1757390e+00f,  -1.8771855e+00f, -3.0992141e-02f, 7.4466050e-01f,
+    4.0080741e-01f,  -3.4046450e-03f, 4.3528955e-04f,  3.5755274e+00f,
+    -6.3194543e-02f, 6.3506410e-02f,  -7.7472851e-02f, 9.3657905e-01f,
+    -1.6487084e-02f, 4.3528955e-04f,  2.0063922e+00f,  3.2654190e+00f,
+    -2.1489026e-01f, -8.4615904e-01f, 5.8452976e-01f,  -3.7852157e-02f,
+    4.3528955e-04f,  -2.2301111e+00f, -4.9555558e-01f, 1.4013952e-02f,
+    1.9073595e-01f,  -9.8883343e-01f, 2.6132664e-02f,  4.3528955e-04f,
+    -3.8411880e-01f, 1.6699871e+00f,  1.2264084e-02f,  -7.7501184e-01f,
+    -2.5391611e-01f, 7.7651799e-02f,  4.3528955e-04f,  9.5724076e-01f,
+    -8.4852898e-01f, 3.2571293e-02f,  5.2113032e-01f,  3.1918830e-01f,
+    1.3111247e-01f,  4.3528955e-04f,  -7.2317463e-01f, 5.8346587e-01f,
+    -8.4612876e-02f, -6.7789853e-01f, -1.0422281e+00f, -2.2353124e-02f,
+    4.3528955e-04f,  -1.1005304e+00f, -7.1903718e-01f, 2.9965490e-02f,
+    6.1634111e-01f,  -4.5465007e-01f, 7.8139126e-02f,  4.3528955e-04f,
+    -5.8435827e-01f, -2.2243567e-01f, 1.8944655e-02f,  3.6041191e-01f,
+    -3.4012070e-01f, -1.0267268e-01f, 4.3528955e-04f,  -1.5928942e+00f,
+    -2.6601809e-01f, -1.5099826e-01f, 1.6530070e-01f,  -8.8970184e-01f,
+    -6.5056160e-03f, 4.3528955e-04f,  -5.5076301e-02f, -1.8858309e-01f,
+    -5.1450022e-03f, 1.1228209e+00f,  2.9563385e-01f,  1.2502153e-01f,
+    4.3528955e-04f,  4.6305737e-01f,  -7.0927739e-01f, -1.9761238e-01f,
+    7.4018991e-01f,  -1.6856745e-01f, 8.9101888e-02f,  4.3528955e-04f,
+    3.5158052e+00f,  1.5233570e+00f,  -6.8500131e-02f, -2.8081557e-01f,
+    8.8278562e-01f,  1.8513286e-03f,  4.3528955e-04f,  -9.1508400e-01f,
+    -6.3259953e-01f, 3.8570073e-02f,  2.7261195e-01f,  -6.0721052e-01f,
+    -1.1852893e-01f, 4.3528955e-04f,  -1.0153127e+00f, 1.5829891e+00f,
+    -9.2706099e-02f, -5.9940714e-01f, -3.4442145e-01f, 9.2178218e-02f,
+    4.3528955e-04f,  -9.3551725e-01f, 9.5979649e-01f,  1.6506889e-01f,
+    -3.5330006e-01f, -7.9785210e-01f, -2.4093373e-02f, 4.3528955e-04f,
+    8.3512700e-01f,  -6.6445595e-01f, -7.3245666e-03f, 4.8541847e-01f,
+    9.8541915e-01f,  4.0799093e-02f,  4.3528955e-04f,  1.5766785e+00f,
+    3.5204580e+00f,  -5.0451625e-02f, -8.7230116e-01f, 4.1938159e-01f,
+    -8.1619648e-03f, 4.3528955e-04f,  -6.5286535e-01f, 2.0373333e+00f,
+    2.4839008e-02f,  -1.1652042e+00f, -3.3069769e-01f, -1.5820867e-01f,
+    4.3528955e-04f,  2.5837932e+00f,  1.0146980e+00f,  9.6991612e-04f,
+    -2.6156408e-01f, 8.5991192e-01f,  -1.0327504e-02f, 4.3528955e-04f,
+    -2.8940508e+00f, -2.4332553e-02f, -3.9269019e-02f, -8.2175329e-02f,
+    -8.5269511e-01f, -9.9542759e-02f, 4.3528955e-04f,  9.3731785e-01f,
+    -6.7471057e-01f, -1.1561787e-01f, 5.5656171e-01f,  3.6980581e-01f,
+    -8.1335299e-02f, 4.3528955e-04f,  2.2433418e-01f,  -1.9317548e+00f,
+    8.1712186e-02f,  9.7610009e-01f,  1.4621246e-01f,  6.8972103e-02f,
+    4.3528955e-04f,  9.6183723e-01f,  9.4192392e-01f,  1.7784914e-01f,
+    -9.9932361e-01f, 8.1023282e-01f,  -1.4741683e-01f, 4.3528955e-04f,
+    -2.4142542e+00f, -1.7644544e+00f, -4.0611704e-03f, 5.8124423e-01f,
+    -7.9773635e-01f, 9.1162033e-02f,  4.3528955e-04f,  2.5832012e-01f,
+    5.5883294e-01f,  -2.0291265e-02f, -1.0141363e+00f, 4.5042962e-01f,
+    9.2277065e-02f,  4.3528955e-04f,  -7.3965859e-01f, -1.0336103e+00f,
+    2.0964693e-02f,  2.4407096e-01f,  -7.6147139e-01f, -5.6517750e-02f,
+    4.3528955e-04f,  -1.2813196e-02f, 1.1440427e+00f,  -7.7077255e-02f,
+    -6.6795129e-01f, 4.8633784e-01f,  -2.4881299e-01f, 4.3528955e-04f,
+    2.5763817e+00f,  6.5523589e-01f,  -2.0384356e-02f, -4.7724381e-01f,
+    9.9749619e-01f,  -6.2102389e-02f, 4.3528955e-04f,  -2.4898973e-01f,
+    1.5939019e+00f,  -5.4233521e-02f, -9.9215376e-01f, -1.7488678e-01f,
+    -2.0961907e-02f, 4.3528955e-04f,  -1.8919522e+00f, -8.6752456e-01f,
+    6.9907911e-02f,  1.1650918e-01f,  -8.2493776e-01f, 1.5631513e-01f,
+    4.3528955e-04f,  1.4105057e+00f,  1.2156030e+00f,  1.0391846e-02f,
+    -7.8242904e-01f, 7.9300386e-01f,  -8.1698708e-02f, 4.3528955e-04f,
+    -9.6875899e-02f, 8.4136868e-01f,  1.5631573e-01f,  -6.9397932e-01f,
+    -4.2214730e-01f, -2.4216896e-01f, 4.3528955e-04f,  -1.4999424e+00f,
+    -9.7090620e-01f, 4.5710560e-02f,  -3.5041165e-02f, -8.9813638e-01f,
+    5.7672128e-02f,  4.3528955e-04f,  3.4523553e-01f,  -1.4340541e+00f,
+    5.6771271e-02f,  9.9525058e-01f,  4.6583526e-02f,  -1.9556314e-01f,
+    4.3528955e-04f,  1.1589792e+00f,  1.0217384e-01f,  -6.0573280e-02f,
+    4.6792346e-01f,  5.8281821e-01f,  -2.6106960e-01f, 4.3528955e-04f,
+    1.7685134e+00f,  7.5564779e-02f,  1.0923827e-01f,  -1.3139416e-01f,
+    9.6387523e-01f,  1.1992331e-01f,  4.3528955e-04f,  2.3585455e+00f,
+    -6.8175250e-01f, 6.3085712e-02f,  5.2321166e-01f,  9.5160639e-01f,
+    7.9756327e-02f,  4.3528955e-04f,  3.8741854e-01f,  -1.2380295e+00f,
+    -2.2081703e-01f, 4.8930815e-01f,  6.2844567e-02f,  6.0501765e-02f,
+    4.3528955e-04f,  -1.3577280e+00f, 9.0405315e-01f,  -8.2100511e-02f,
+    -4.9176940e-01f, -5.8622926e-01f, 2.1141709e-01f,  4.3528955e-04f,
+    2.1870217e+00f,  1.2079951e-01f,  3.1100186e-02f,  5.9182119e-02f,
+    6.8686843e-01f,  1.2959583e-01f,  4.3528955e-04f,  5.1665968e-01f,
+    3.3336937e-01f,  -1.1554714e-01f, -7.5879931e-01f, 2.5859886e-01f,
+    -1.1940341e-01f, 4.3528955e-04f,  -1.5278515e+00f, -3.1039636e+00f,
+    2.6547540e-02f,  7.0372438e-01f,  -4.6665913e-01f, -4.4643864e-02f,
+    4.3528955e-04f,  3.7159592e-02f,  -3.0733523e+00f, -5.2456588e-02f,
+    9.3483585e-01f,  8.5434876e-04f,  -1.3978018e-02f, 4.3528955e-04f,
+    -3.2946808e+00f, 2.3075864e+00f,  -6.9768272e-02f, -4.9566206e-01f,
+    -7.4619639e-01f, 1.3188319e-02f,  4.3528955e-04f,  4.9639660e-01f,
+    -3.9338440e-01f, -5.1259022e-02f, 7.5609314e-01f,  6.0839701e-01f,
+    2.0302209e-01f,  4.3528955e-04f,  -2.4058826e+00f, -3.2263417e+00f,
+    8.7073809e-03f,  7.2810167e-01f,  -5.0219864e-01f, 1.6857944e-02f,
+    4.3528955e-04f,  -9.6789634e-01f, 1.0031608e-01f,  1.0254135e-01f,
+    -5.5085337e-01f, -8.6377656e-01f, -3.4736189e-01f, 4.3528955e-04f,
+    1.7804682e-01f,  9.1845757e-01f,  -8.8900819e-02f, -8.1845421e-01f,
+    -2.7530786e-01f, -2.5303239e-01f, 4.3528955e-04f,  2.4283483e+00f,
+    1.0381964e+00f,  1.7149288e-02f,  -2.9458046e-01f, 7.7037472e-01f,
+    -5.7029113e-02f, 4.3528955e-04f,  -6.1018097e-01f, -6.9027001e-01f,
+    -1.3602732e-02f, 9.5917797e-01f,  -2.4647385e-01f, -1.0742184e-01f,
+    4.3528955e-04f,  -9.8558879e-01f, 1.4008402e+00f,  7.8846797e-02f,
+    -7.0550716e-01f, -6.2944043e-01f, -5.2106116e-02f, 4.3528955e-04f,
+    -4.3886936e-01f, -1.7004576e+00f, -5.0112486e-02f, 6.5699106e-01f,
+    -2.1699683e-01f, 4.9702950e-02f,  4.3528955e-04f,  2.7989200e-01f,
+    2.0351968e+00f,  -1.9291516e-02f, -9.4905597e-01f, 1.4831617e-01f,
+    1.5469903e-01f,  4.3528955e-04f,  -1.0940150e+00f, 1.2038294e+00f,
+    7.8553759e-02f,  -8.2914346e-01f, -4.5516059e-01f, -3.4970205e-02f,
+    4.3528955e-04f,  1.2369618e+00f,  -2.3469685e-01f, -4.6742926e-03f,
+    2.7868232e-01f,  9.8370445e-01f,  3.2809574e-02f,  4.3528955e-04f,
+    -1.1512040e+00f, 4.9605519e-01f,  5.4150194e-02f,  -1.4205958e-01f,
+    -7.9160959e-01f, -3.0626097e-01f, 4.3528955e-04f,  6.2758458e-01f,
+    -3.3829021e+00f, 1.6355248e-02f,  7.8983319e-01f,  1.1399511e-01f,
+    5.7745036e-02f,  4.3528955e-04f,  -6.6862237e-01f, -3.9799011e-01f,
+    4.7872785e-02f,  4.7939542e-01f,  -6.4601874e-01f, 1.6010832e-05f,
+    4.3528955e-04f,  2.3462856e-01f,  -1.2898934e+00f, 1.1523023e-02f,
+    9.5837194e-01f,  7.4089825e-02f,  9.0424165e-02f,  4.3528955e-04f,
+    1.1259102e+00f,  8.7618515e-02f,  -1.3456899e-01f, -2.9205632e-01f,
+    6.7723966e-01f,  -4.6079099e-02f, 4.3528955e-04f,  -8.7704882e-03f,
+    -1.1725254e+00f, -8.8250719e-02f, 4.4035894e-01f,  -1.6670430e-02f,
+    1.4089695e-01f,  4.3528955e-04f,  2.2584291e+00f,  1.4189466e+00f,
+    -1.8443355e-02f, -4.3839177e-01f, 8.6954474e-01f,  -4.5087278e-02f,
+    4.3528955e-04f,  -4.6254298e-01f, 4.8147935e-01f,  7.9244468e-03f,
+    -2.4719588e-01f, -9.0382683e-01f, 1.2646266e-04f,  4.3528955e-04f,
+    1.5133755e+00f,  -4.1474123e+00f, -1.4019597e-01f, 8.8256359e-01f,
+    3.0353436e-01f,  2.5529342e-02f,  4.3528955e-04f,  4.0004826e-01f,
+    -6.1617059e-01f, -1.1821052e-02f, 8.6504596e-01f,  4.9651924e-01f,
+    7.3513277e-02f,  4.3528955e-04f,  8.2862830e-01f,  2.3726277e+00f,
+    1.2705037e-01f,  -8.0391479e-01f, 3.8536501e-01f,  -1.0712823e-01f,
+    4.3528955e-04f,  2.5729899e+00f,  1.1411077e+00f,  -1.5030988e-02f,
+    -3.7253910e-01f, 7.6552385e-01f,  -4.9367297e-02f, 4.3528955e-04f,
+    8.8084817e-01f,  -1.3029621e+00f, 1.0845469e-01f,  5.8690238e-01f,
+    2.8065485e-01f,  3.5188537e-02f,  4.3528955e-04f,  -8.6291587e-01f,
+    -3.3691412e-01f, -9.3317881e-02f, 1.0001194e+00f,  -5.3239751e-01f,
+    -3.6933172e-02f, 4.3528955e-04f,  1.5546671e-01f,  9.7376794e-01f,
+    3.7359867e-02f,  -1.2189692e+00f, 1.0986128e-01f,  1.9549276e-04f,
+    4.3528955e-04f,  8.3077073e-01f,  -8.0026269e-01f, -1.5794440e-01f,
+    9.3238616e-01f,  4.0641621e-01f,  7.9029009e-02f,  4.3528955e-04f,
+    7.9840970e-01f,  -7.4233145e-01f, -4.8840925e-02f, 4.8868039e-01f,
+    6.7256373e-01f,  -1.3452559e-02f, 4.3528955e-04f,  -2.4638307e+00f,
+    -2.0854096e+00f, 3.3859923e-02f,  5.7639414e-01f,  -6.8748325e-01f,
+    3.9054889e-02f,  4.3528955e-04f,  -2.2930008e-01f, 2.8647637e-01f,
+    -1.6853252e-02f, -4.3840051e-01f, -1.3793395e+00f, 1.5072146e-01f,
+    4.3528955e-04f,  1.1410736e+00f,  7.8702398e-02f,  -3.3943098e-02f,
+    8.3931476e-02f,  8.1018960e-01f,  1.0001824e-01f,  4.3528955e-04f,
+    -4.4735882e-01f, 5.9994358e-01f,  6.2245611e-02f,  -7.1681690e-01f,
+    -3.9871550e-01f, -3.5942882e-02f, 4.3528955e-04f,  3.9692515e-01f,
+    -1.6514966e+00f, 1.6477087e-03f,  6.4856076e-01f,  -1.0229707e-01f,
+    -7.8090116e-02f, 4.3528955e-04f,  -2.0031521e-01f, 7.6972604e-01f,
+    7.1372345e-02f,  -8.2351524e-01f, -5.2152121e-01f, -3.4135514e-01f,
+    4.3528955e-04f,  -1.2074282e+00f, -1.4437757e-01f, -2.4055962e-02f,
+    5.2797568e-01f,  -7.7709115e-01f, 1.4448223e-01f,  4.3528955e-04f,
+    -6.2191188e-01f, -1.4273003e-01f, 1.0740837e-02f,  3.2151988e-01f,
+    -8.3749884e-01f, 1.6508783e-01f,  4.3528955e-04f,  -9.5489168e-01f,
+    -1.4336501e+00f, 8.4054336e-02f,  9.0721631e-01f,  -4.3047437e-01f,
+    -1.1153458e-02f, 4.3528955e-04f,  -3.4103441e+00f, 5.4458630e-01f,
+    -1.6016087e-03f, -2.2567050e-01f, -9.1743398e-01f, -1.1477491e-02f,
+    4.3528955e-04f,  1.4689618e+00f,  1.2086695e+00f,  -1.7923877e-01f,
+    -4.6484870e-01f, 5.5787706e-01f,  5.2227408e-02f,  4.3528955e-04f,
+    1.0726677e+00f,  1.2007883e+00f,  -7.8215607e-02f, -5.6627440e-01f,
+    7.7395010e-01f,  -9.1796324e-02f, 4.3528955e-04f,  2.6825041e-01f,
+    -6.8653381e-01f, -5.9507266e-02f, 9.6391803e-01f,  1.3338681e-01f,
+    8.0276683e-02f,  4.3528955e-04f,  2.8571851e+00f,  1.3082524e-01f,
+    -2.5722018e-01f, -1.3769688e-01f, 8.8655663e-01f,  -1.2759742e-02f,
+    4.3528955e-04f,  -1.9995936e+00f, 6.3053393e-01f,  1.3657334e-01f,
+    -3.1497157e-01f, -1.0123312e+00f, -1.4504001e-01f, 4.3528955e-04f,
+    -2.6333756e+00f, -1.1284588e-01f, 9.2306368e-02f,  -1.4584465e-01f,
+    -9.8003829e-01f, -8.1853099e-02f, 4.3528955e-04f,  -1.0313479e+00f,
+    -6.0844243e-01f, -5.8772981e-02f, 5.9872878e-01f,  -6.3945311e-01f,
+    2.7889737e-01f,  4.3528955e-04f,  -4.3594353e-03f, 7.7320230e-01f,
+    -3.1139882e-02f, -9.0527725e-01f, -2.0195818e-01f, 8.0879487e-02f,
+    4.3528955e-04f,  -2.1225788e-02f, 3.4976608e-01f,  3.0058688e-02f,
+    -1.6547097e+00f, 5.7853663e-01f,  -2.4616165e-01f, 4.3528955e-04f,
+    3.9255556e-01f,  3.2994020e-01f,  -8.2096547e-02f, -7.2169863e-03f,
+    5.0819004e-01f,  -6.0960871e-01f, 4.3528955e-04f,  -1.0141527e-01f,
+    9.8233062e-01f,  4.8593893e-03f,  -1.0525788e+00f, 4.0393576e-01f,
+    -8.3111404e-03f, 4.3528955e-04f,  -3.7638038e-01f, 1.2485307e+00f,
+    -4.6990685e-02f, -8.3900607e-01f, -3.7799808e-01f, -2.5249180e-01f,
+    4.3528955e-04f,  1.6465228e+00f,  -1.3082031e+00f, -3.0403731e-02f,
+    8.4443563e-01f,  6.6095126e-01f,  -2.3875806e-02f, 4.3528955e-04f,
+    -5.3227174e-01f, 7.4791506e-02f,  8.2121052e-02f,  -4.5901912e-01f,
+    -1.0037072e+00f, -2.0886606e-01f, 4.3528955e-04f,  -1.1895345e+00f,
+    2.7053397e+00f,  4.9947992e-02f,  -1.0490944e+00f, -2.5759271e-01f,
+    -9.9375071e-03f, 4.3528955e-04f,  -5.2512074e-01f, -1.1978335e+00f,
+    -3.5515487e-02f, 3.3485553e-01f,  -6.6308874e-01f, -1.8835375e-02f,
+    4.3528955e-04f,  -2.9846373e-01f, -3.7469918e-01f, -6.2433038e-02f,
+    2.0564352e-01f,  -3.1001776e-01f, -6.9941175e-01f, 4.3528955e-04f,
+    1.4412087e-01f,  3.9398068e-01f,  -4.3605398e-03f, -9.6136671e-01f,
+    3.4699216e-01f,  -3.3387709e-01f, 4.3528955e-04f,  9.0004724e-01f,
+    4.3466396e+00f,  -1.7010966e-02f, -9.0652692e-01f, 1.1844695e-01f,
+    -4.9140183e-03f, 4.3528955e-04f,  2.1525836e+00f,  -2.3640323e+00f,
+    9.3771614e-02f,  6.9751871e-01f,  4.8896772e-01f,  -3.3206567e-02f,
+    4.3528955e-04f,  -6.5681291e-01f, -1.1626377e+00f, 1.6823588e-02f,
+    6.1292183e-01f,  -4.9727377e-01f, -7.3625118e-02f, 4.3528955e-04f,
+    3.0889399e+00f,  -1.7847513e+00f, -1.8108279e-01f, 4.7052261e-01f,
+    7.3794258e-01f,  7.1605951e-02f,  4.3528955e-04f,  3.1459191e-01f,
+    9.8673105e-01f,  -1.9277580e-02f, -9.4081938e-01f, 2.2592145e-01f,
+    -1.2418746e-03f, 4.3528955e-04f,  -5.2789465e-02f, -3.2204080e-01f,
+    5.1925527e-03f,  9.0869290e-01f,  -6.4428222e-01f, -1.8813097e-01f,
+    4.3528955e-04f,  1.8455359e+00f,  6.9745862e-01f,  -1.2718292e-02f,
+    -4.1566870e-01f, 6.8618339e-01f,  -4.4232357e-02f, 4.3528955e-04f,
+    -4.9682930e-01f, 1.9522797e+00f,  2.8703390e-02f,  -4.4792947e-01f,
+    -2.2602636e-01f, 2.2362003e-02f,  4.3528955e-04f,  -3.4793615e+00f,
+    2.3711872e-01f,  -1.4545543e-01f, -8.3394885e-02f, -7.8745657e-01f,
+    -9.3304045e-02f, 4.3528955e-04f,  1.2784964e+00f,  -7.6302290e-01f,
+    7.2182991e-02f,  1.9082169e-01f,  8.5911638e-01f,  1.0819277e-01f,
+    4.3528955e-04f,  -5.5421162e-01f, 1.9772859e+00f,  8.0356188e-02f,
+    -9.6426272e-01f, 2.1338969e-01f,  4.3936344e-03f,  4.3528955e-04f,
+    5.6763339e-01f,  -7.8151935e-01f, -3.2130316e-01f, 6.4369994e-01f,
+    4.1616973e-01f,  -2.1497588e-01f, 4.3528955e-04f,  2.2931125e+00f,
+    -1.4712989e+00f, -8.0254532e-02f, 5.6852537e-01f,  7.7674639e-01f,
+    5.3321277e-03f,  4.3528955e-04f,  8.4126033e-03f,  -1.1700789e+00f,
+    -6.6257310e-03f, 9.8439240e-01f,  5.0111767e-03f,  2.5956127e-01f,
+    4.3528955e-04f,  4.0027924e+00f,  1.5303530e-01f,  2.6014443e-02f,
+    2.6190531e-02f,  9.3899882e-01f,  -2.6878801e-03f, 4.3528955e-04f,
+    -2.1070203e-01f, 2.0315614e-02f,  7.8653321e-02f,  -5.5834639e-01f,
+    -1.5306228e+00f, -1.9095647e-01f, 4.3528955e-04f,  1.2188442e-03f,
+    -5.8485001e-01f, -1.6234182e-01f, 1.0869372e+00f,  -4.2889737e-02f,
+    1.5446429e-01f,  4.3528955e-04f,  4.3049747e-01f,  -9.8857820e-02f,
+    -1.0185509e-01f, 5.4686821e-01f,  6.4180177e-01f,  2.5540575e-01f,
+
+    4.2524221e-04f,  -6.8952002e-02f, -3.7609130e-01f, 2.0454033e-01f,
+    4.6934392e-02f,  3.6518586e-01f,  -6.3908052e-01f, 4.2524221e-04f,
+    1.7167262e-03f,  2.7662572e-01f,  1.7233780e-02f,  1.1780310e-01f,
+    7.4727722e-02f,  -2.7824235e-01f, 4.2524221e-04f,  -6.4021356e-02f,
+    4.9878994e-01f,  1.1780857e-01f,  -7.2630882e-02f, -1.9749036e-01f,
+    4.1274959e-01f,  4.2524221e-04f,  -1.4642769e-01f, 7.2956882e-02f,
+    -2.1209341e-01f, -1.9561304e-01f, 4.3640116e-01f,  -1.4216131e-01f,
+    4.2524221e-04f,  4.4984859e-01f,  -2.0571905e-01f, 1.6579893e-01f,
+    2.3007728e-01f,  3.3259624e-01f,  -1.2255534e-01f, 4.2524221e-04f,
+    1.0123267e-01f,  -1.1069166e-01f, 1.2146676e-01f,  6.9276756e-01f,
+    1.5651067e-01f,  7.2201669e-02f,  4.2524221e-04f,  3.5509726e-01f,
+    -2.4750148e-01f, -7.0419729e-02f, -1.6315883e-01f, 2.7629051e-01f,
+    4.0912119e-01f,  4.2524221e-04f,  6.7211971e-02f,  3.6541705e-03f,
+    6.1872799e-02f,  -2.4400305e-02f, -2.8594831e-01f, 2.6267496e-01f,
+    4.2524221e-04f,  1.7564896e-02f,  2.2714512e-02f,  5.5567864e-02f,
+    1.6080794e-01f,  6.3173026e-01f,  -7.0765656e-01f, 4.2524221e-04f,
+    6.2095644e-03f,  1.6922535e-02f,  6.7964457e-02f,  -6.4950210e-01f,
+    1.1511780e-01f,  -2.3005176e-01f, 4.2524221e-04f,  8.1252515e-02f,
+    -2.4793835e-01f, 2.5017133e-02f,  1.0366057e-01f,  -1.0383766e+00f,
+    6.8862158e-01f,  4.2524221e-04f,  7.9731531e-03f,  6.2441554e-02f,
+    3.5850534e-01f,  -8.4335662e-02f, 2.3078813e-01f,  2.8442800e-01f,
+    4.2524221e-04f,  8.4318154e-02f,  6.3358635e-02f,  8.0232881e-02f,
+    7.4251097e-01f,  -5.9694689e-02f, -9.8565477e-01f, 4.2524221e-04f,
+    -3.5627842e-01f, 1.5056185e-01f,  1.2423660e-01f,  -3.0809689e-01f,
+    -5.7333690e-01f, 8.0326796e-02f,  4.2524221e-04f,  -8.0495151e-03f,
+    -1.0587189e-01f, -1.8965110e-01f, -8.8318896e-01f, 3.3843562e-01f,
+    2.1881117e-01f,  4.2524221e-04f,  1.4790270e-01f,  5.6889802e-02f,
+    -5.9076946e-02f, 1.6111375e-01f,  2.3636131e-01f,  -5.2197134e-01f,
+    4.2524221e-04f,  4.6059892e-01f,  3.8570845e-01f,  -2.4108456e-01f,
+    -5.6617850e-01f, 3.9318663e-01f,  2.6764247e-01f,  4.2524221e-04f,
+    2.6320845e-01f,  5.7858221e-02f,  -2.7922782e-01f, -5.6394571e-01f,
+    3.8956839e-01f,  1.2278712e-02f,  4.2524221e-04f,  -2.1918103e-01f,
+    -5.2948242e-01f, -2.0025180e-01f, -4.0323091e-01f, -5.6623662e-01f,
+    -1.9914013e-01f, 4.2524221e-04f,  -5.9552908e-02f, -1.0246649e-01f,
+    3.3934865e-02f,  1.0694876e+00f,  -2.3483194e-01f, 5.1456535e-01f,
+    4.2524221e-04f,  -3.0072188e-01f, -1.5119925e-01f, -9.4813794e-02f,
+    2.3947287e-01f,  -2.8111663e-02f, 4.7549266e-01f,  4.2524221e-04f,
+    -3.1408378e-01f, -2.4881051e-01f, -1.0178679e-01f, -3.5335216e-01f,
+    -3.3296376e-01f, 1.7537035e-01f,  4.2524221e-04f,  5.0441384e-02f,
+    -2.3857759e-01f, -2.0189323e-01f, 6.4591801e-01f,  7.4821287e-01f,
+    3.0161458e-01f,  4.2524221e-04f,  -2.1398225e-01f, 1.3716324e-01f,
+    2.6415381e-01f,  -1.0239993e-01f, 4.3141305e-02f,  3.9933646e-01f,
+    4.2524221e-04f,  -2.1833763e-02f, 7.7776663e-02f,  -1.1644596e-01f,
+    -1.3218959e-02f, -5.3083044e-01f, -2.2752643e-01f, 4.2524221e-04f,
+    5.9864126e-02f,  3.7901759e-02f,  2.4226917e-02f,  -1.1346813e-01f,
+    2.9795706e-01f,  2.2305934e-01f,  4.2524221e-04f,  -1.5093227e-01f,
+    1.9989584e-01f,  -6.6760153e-02f, -8.5909933e-01f, 1.0792204e+00f,
+    5.6337440e-01f,  4.2524221e-04f,  -1.2258115e-01f, -1.6773552e-01f,
+    1.1542997e-01f,  -2.4039291e-01f, -4.2407429e-01f, 9.4057155e-01f,
+    4.2524221e-04f,  -1.0204029e-01f, 4.7917057e-02f,  -1.3586305e-02f,
+    1.0611955e-02f,  -6.4236182e-01f, -4.9220425e-01f, 4.2524221e-04f,
+    -1.3242331e-01f, -1.5490770e-01f, -2.4436052e-01f, 7.8819454e-01f,
+    8.9990437e-01f,  -2.7850788e-02f, 4.2524221e-04f,  -1.1431516e-01f,
+    -5.7896734e-03f, -5.8673549e-02f, 4.0131390e-02f,  4.1823924e-02f,
+    3.5253352e-01f,  4.2524221e-04f,  1.3416216e-01f,  1.2450522e-01f,
+    -4.6916567e-02f, -1.1810165e-01f, 5.7470405e-01f,  4.6782512e-02f,
+    4.2524221e-04f,  9.1884322e-03f,  3.2225549e-02f,  -7.7325888e-02f,
+    -2.1032813e-01f, -4.8966500e-01f, 6.4191252e-01f,  4.2524221e-04f,
+    -2.1961327e-01f, -1.5659723e-01f, 1.2278610e-01f,  -7.4027401e-01f,
+    -6.3348526e-01f, -6.4378178e-01f, 4.2524221e-04f,  -8.8809431e-02f,
+    -1.0160245e-01f, -2.3898444e-01f, 1.1571468e-01f,  -1.5239573e-02f,
+    -7.1836734e-01f, 4.2524221e-04f,  -2.8333729e-02f, -1.2737048e-01f,
+    -1.8874502e-01f, 4.1093016e-01f,  -1.5388297e-01f, -9.9330693e-01f,
+    4.2524221e-04f,  1.3488932e-01f,  -2.8850915e-02f, -8.5983714e-03f,
+    -1.7177103e-01f, 2.4053304e-01f,  -6.3560623e-01f, 4.2524221e-04f,
+    -3.1490156e-01f, -9.9333093e-02f, 3.5978910e-01f,  6.6598135e-01f,
+    -3.3750072e-01f, -1.0837636e-01f, 4.2524221e-04f,  7.8173153e-02f,
+    1.5342808e-01f,  -7.4844666e-02f, 1.9755471e-01f,  7.4251711e-01f,
+    -1.9265547e-01f, 4.2524221e-04f,  5.4524943e-02f,  8.6015537e-02f,
+    7.9116998e-03f,  -3.3082482e-01f, 1.1510558e-01f,  -4.8080977e-02f,
+    4.2524221e-04f,  2.3899309e-01f,  2.0232114e-01f,  2.4308579e-01f,
+    -4.8312342e-01f, -7.6722562e-02f, -7.1023846e-01f, 4.2524221e-04f,
+    -1.1035525e-01f, 1.1003480e-01f,  7.8218743e-02f,  1.4598185e-01f,
+    2.8957045e-01f,  4.5391402e-01f,  4.2524221e-04f,  3.8056824e-01f,
+    -4.2662463e-01f, -2.9796240e-01f, -2.9642835e-01f, 2.7845275e-01f,
+    9.6103340e-02f,  4.2524221e-04f,  -2.1471562e-02f, -9.6082248e-02f,
+    6.3268065e-02f,  4.4057620e-01f,  -1.9100349e-01f, 4.3734275e-02f,
+    4.2524221e-04f,  1.6843402e-01f,  1.2867293e-02f,  -1.7205054e-01f,
+    -1.6690819e-01f, 4.0759605e-01f,  -1.2986995e-01f, 4.2524221e-04f,
+    1.0996082e-01f,  -6.6473335e-02f, 4.2397708e-01f,  -5.6338054e-01f,
+    4.0538439e-01f,  4.7354269e-01f,  4.2524221e-04f,  3.8981259e-01f,
+    -7.8386031e-02f, -1.2684372e-01f, 4.5999810e-01f,  1.4793024e-02f,
+    2.9288986e-01f,  4.2524221e-04f,  3.8427915e-02f,  -9.3180403e-02f,
+    5.2034128e-02f,  2.2621906e-01f,  2.4933131e-01f,  -2.6412728e-01f,
+    4.2524221e-04f,  1.7695948e-01f,  1.1208335e-01f,  9.4689289e-03f,
+    -4.7762734e-01f, 4.2272797e-01f,  -1.9553494e-01f, 4.2524221e-04f,
+    2.9530343e-01f,  5.4565635e-02f,  -9.3569167e-02f, -1.0310185e+00f,
+    -2.1791783e-01f, 1.1310533e-01f,  4.2524221e-04f,  3.6427479e-02f,
+    8.3433479e-02f,  -5.0965570e-02f, -7.0311046e-01f, -7.7300471e-01f,
+    7.8911895e-01f,  4.2524221e-04f,  -6.0537711e-02f, 2.0016704e-02f,
+    6.2623121e-02f,  -5.0709176e-01f, -6.9080782e-01f, -3.8370842e-01f,
+    4.2524221e-04f,  -2.4078569e-01f, -2.0172992e-01f, -1.7282113e-01f,
+    -1.9933814e-01f, -4.1384608e-01f, -4.2155632e-01f, 4.2524221e-04f,
+    1.7356554e-01f,  -8.2822353e-02f, 2.4565151e-01f,  2.4235701e-02f,
+    1.9959936e-01f,  -8.4004021e-01f, 4.2524221e-04f,  2.5406668e-01f,
+    -2.3104405e-02f, 8.9151785e-02f,  -1.5854710e-01f, 1.7603678e-01f,
+    4.9781209e-01f,  4.2524221e-04f,  -4.6918225e-02f, 3.1394951e-02f,
+    1.2196216e-01f,  5.3416461e-01f,  -7.8365993e-01f, 2.3617971e-01f,
+    4.2524221e-04f,  4.1943249e-01f,  -2.1520613e-01f, -2.9915211e-01f,
+    -4.2922956e-01f, 3.4326318e-01f,  -4.0416589e-01f, 4.2524221e-04f,
+    1.8558493e-02f,  2.3149431e-01f,  2.8412763e-02f,  -3.2613638e-01f,
+    -6.7272943e-01f, -2.7935442e-01f, 4.2524221e-04f,  6.7606665e-02f,
+    1.0590034e-01f,  -2.9134644e-02f, -2.8848764e-01f, 1.8802702e-01f,
+    -2.5352947e-02f, 4.2524221e-04f,  3.1923872e-01f,  2.0859796e-01f,
+    1.9689572e-01f,  -3.4045419e-01f, -1.1567620e-02f, -2.2331662e-01f,
+    4.2524221e-04f,  8.6090438e-02f,  -9.7899623e-02f, 3.7183642e-01f,
+    5.7801574e-01f,  -8.4642863e-01f, 3.7232456e-01f,  4.2524221e-04f,
+    -6.3343510e-02f, 5.1692825e-02f,  -2.2670483e-02f, 4.2227164e-01f,
+    -1.0418820e+00f, -4.3066531e-01f, 4.2524221e-04f,  7.7797174e-02f,
+    2.0468737e-01f,  -1.8630002e-02f, -2.6646578e-01f, 3.5000020e-01f,
+    1.7281543e-03f,  4.2524221e-04f,  1.6326034e-01f,  -7.6127653e-03f,
+    -1.9875813e-01f, 3.0400047e-01f,  -1.0095369e+00f, 3.0630016e-01f,
+    4.2524221e-04f,  -3.0587640e-01f, 3.6862275e-01f,  -1.6716866e-01f,
+    -1.5076877e-01f, 6.4900644e-02f,  -3.9979839e-01f, 4.2524221e-04f,
+    5.1980961e-02f,  -1.7389877e-02f, -6.5868706e-02f, 4.4816044e-01f,
+    -1.1290047e-01f, 1.0578583e-01f,  4.2524221e-04f,  -2.6579666e-01f,
+    1.5276420e-01f,  1.6454442e-01f,  -2.3063077e-01f, -1.1864688e-01f,
+    -2.7325454e-01f, 4.2524221e-04f,  2.3888920e-01f,  -1.0952530e-01f,
+    1.2845880e-02f,  6.3121682e-01f,  -1.2560226e-01f, -2.7487582e-01f,
+    4.2524221e-04f,  4.5389226e-03f,  3.1511687e-02f,  2.2977088e-02f,
+    4.9845091e-01f,  1.0308616e+00f,  6.6393840e-01f,  4.2524221e-04f,
+    -1.2475225e-01f, 1.9281661e-02f,  2.9971752e-01f,  3.3750951e-01f,
+    5.9152752e-01f,  -2.1105433e-02f, 4.2524221e-04f,  -2.1485806e-02f,
+    -6.7377828e-02f, 2.5713644e-03f,  4.6789891e-01f,  4.5696682e-01f,
+    -7.1609730e-01f, 4.2524221e-04f,  -1.0586022e-01f, 3.5893656e-02f,
+    2.2575684e-01f,  3.2815951e-01f,  1.2089105e+00f,  1.4042576e-01f,
+    4.2524221e-04f,  -1.2319917e-01f, -1.0005784e-02f, 1.5479188e-01f,
+    1.8208984e-01f,  1.2132756e+00f,  2.6527673e-01f,  4.2524221e-04f,
+    6.4620353e-02f,  1.7364240e-01f,  -1.4148856e-02f, 9.8386899e-02f,
+    -9.3257673e-02f, -4.5248473e-01f, 4.2524221e-04f,  2.1988168e-01f,
+    9.3818128e-02f,  2.6402268e-01f,  1.3119745e+00f,  8.3785437e-02f,
+    2.7858006e-02f,  4.2524221e-04f,  -1.4317329e-03f, 2.2498498e-02f,
+    -4.2581409e-03f, 7.6423578e-02f,  3.0879802e-01f,  -2.7642739e-01f,
+    4.2524221e-04f,  5.2082442e-02f,  -2.4966290e-02f, -3.3147499e-01f,
+    3.1459096e-01f,  -9.5654421e-02f, -4.9177298e-01f, 4.2524221e-04f,
+    2.1968150e-01f,  -3.1709429e-02f, -3.2633208e-02f, 6.6882968e-01f,
+    -8.7069683e-02f, -4.2155117e-01f, 4.2524221e-04f,  -1.5947688e-02f,
+    -6.6355400e-02f, -1.3427764e-01f, 8.1017509e-02f,  1.9732222e-02f,
+    9.7736377e-01f,  4.2524221e-04f,  3.3350714e-02f,  -2.5489935e-01f,
+    -4.5514282e-02f, 2.7353206e-01f,  9.3509305e-01f,  1.0290121e+00f,
+    4.2524221e-04f,  8.6571544e-02f,  -4.5660064e-02f, 5.3154297e-02f,
+    1.4696455e-01f,  -4.9930936e-01f, -5.4527204e-02f, 4.2524221e-04f,
+    -2.6918665e-01f, -2.2388337e-02f, 1.3400359e-01f,  -1.4872725e-01f,
+    4.6425454e-02f,  -8.6459154e-01f, 4.2524221e-04f,  -3.6714253e-01f,
+    4.7211602e-01f,  4.0126577e-02f,  -4.2214575e-01f, -3.5977527e-01f,
+    2.0702907e-01f,  4.2524221e-04f,  1.6364980e-01f,  4.1913200e-02f,
+    1.1654653e-01f,  3.3425164e-01f,  4.0906391e-01f,  4.2066461e-01f,
+    4.2524221e-04f,  -1.6987796e-01f, -8.7366281e-03f, -2.2486734e-01f,
+    -2.5333986e-02f, 1.3398515e-01f,  1.6617914e-01f,  4.2524221e-04f,
+    3.6583528e-02f,  -2.0342648e-01f, 2.4907716e-02f,  2.7443549e-01f,
+    -5.3054279e-01f, -2.1271352e-02f, 4.2524221e-04f,  -1.5638576e-01f,
+    -1.1497077e-01f, -2.6429644e-01f, 8.8159114e-02f,  -4.2751932e-01f,
+    4.1617098e-01f,  4.2524221e-04f,  -4.8269001e-01f, -2.9227877e-01f,
+    2.1283831e-03f,  -2.8166375e-01f, -8.0320311e-01f, -5.5873245e-02f,
+    4.2524221e-04f,  -3.0324167e-01f, 1.0270053e-01f,  -5.2782591e-02f,
+    2.4762978e-01f,  -5.2626616e-01f, 5.1518279e-01f,  4.2524221e-04f,
+    5.0096340e-02f,  -1.0615882e-01f, 1.0685217e-01f,  3.1090322e-01f,
+    5.4539001e-01f,  -7.7919763e-01f, 4.2524221e-04f,  6.8489499e-02f,
+    -8.5862644e-02f, 8.7295607e-02f,  1.1211764e+00f,  1.7104091e-01f,
+    -5.9566104e-01f, 4.2524221e-04f,  -3.1594849e-01f, 3.6219910e-01f,
+    9.6204855e-02f,  -3.6034283e-01f, -5.5798465e-01f, 3.6521727e-01f,
+    4.2524221e-04f,  8.9752123e-02f,  -3.7980074e-01f, 2.2659194e-01f,
+    2.5259364e-01f,  8.7990636e-01f,  -6.6328472e-01f, 4.2524221e-04f,
+    -1.2885086e-01f, 4.2518385e-02f,  -9.9296935e-02f, -2.9014772e-01f,
+    2.8919721e-01f,  7.2803092e-01f,  4.2524221e-04f,  1.0833747e-01f,
+    -2.3551908e-01f, -2.2371200e-01f, -6.8503207e-01f, 8.4255002e-02f,
+    -1.7699188e-01f, 4.2524221e-04f,  -4.5774442e-01f, -5.7774043e-01f,
+    -1.9628638e-01f, -1.6585727e-01f, -2.4805409e-01f, 3.2597375e-01f,
+    4.2524221e-04f,  9.4905041e-02f,  -1.2196866e-01f, -2.8854272e-01f,
+    1.2401120e-02f,  -5.5150861e-01f, -1.6573331e-01f, 4.2524221e-04f,
+    1.7654218e-01f,  2.8887981e-01f,  8.1515826e-02f,  -4.4433424e-01f,
+    -3.4858069e-01f, -7.5954390e-01f, 4.2524221e-04f,  2.0875847e-01f,
+    -3.4767810e-02f, -1.1624666e-01f, 5.1564693e-01f,  3.0314165e-01f,
+    8.9838400e-02f,  4.2524221e-04f,  -6.6830531e-02f, 6.5703589e-01f,
+    -1.4869122e-01f, -5.7415849e-01f, 1.4813814e-01f,  -8.1861876e-02f,
+    4.2524221e-04f,  -4.4457048e-02f, -1.5921470e-02f, -1.7754057e-02f,
+    -3.9143625e-01f, -6.3085490e-01f, -5.0749278e-01f, 4.2524221e-04f,
+    1.3718459e-01f,  1.7940737e-02f,  -2.0972039e-01f, -3.8703054e-01f,
+    3.6758363e-01f,  -4.0641344e-01f, 4.2524221e-04f,  -2.8808230e-01f,
+    -2.0762348e-01f, 1.0456783e-01f,  4.8344731e-01f,  -1.6193020e-01f,
+    2.6533803e-01f,  4.2524221e-04f,  -6.6829704e-02f, 6.8833500e-02f,
+    1.3597858e-02f,  3.2421193e-01f,  -5.3849036e-01f, 5.5469674e-01f,
+    4.2524221e-04f,  6.4109176e-02f,  1.7209695e-01f,  -1.2461232e-01f,
+    1.4659126e-02f,  5.3120416e-02f,  -7.5313765e-01f, 4.2524221e-04f,
+    1.8690982e-01f,  -8.1217997e-02f, -6.6295050e-02f, 3.9599022e-01f,
+    -1.9595018e-02f, 2.1561284e-01f,  4.2524221e-04f,  -1.6437256e-01f,
+    5.5488598e-02f,  3.7080717e-01f,  6.9631052e-01f,  -3.9775252e-01f,
+    -1.3562378e-01f, 4.2524221e-04f,  1.4495592e-01f,  3.1467380e-03f,
+    4.7463287e-02f,  -4.8221394e-01f, 3.0006620e-01f,  6.8734378e-01f,
+    4.2524221e-04f,  -2.4718483e-01f, 4.3802378e-01f,  -1.2592521e-01f,
+    -9.3917716e-01f, -3.4067336e-01f, -6.1952457e-02f, 4.2524221e-04f,
+    -3.0145645e-03f, -5.5502173e-02f, -6.6558704e-02f, 8.0767912e-01f,
+    -7.2791821e-01f, 3.4372488e-01f,  4.2524221e-04f,  1.0529807e-01f,
+    -2.1401968e-02f, 3.0527771e-01f,  -2.3833787e-01f, 4.1347948e-01f,
+    -1.7507052e-01f, 4.2524221e-04f,  -2.0485507e-01f, 1.6946118e-02f,
+    -1.1887775e-01f, -5.5250818e-01f, 8.3265829e-01f,  -1.0794708e+00f,
+    4.2524221e-04f,  -6.9180802e-02f, -1.3027902e-01f, -3.3495542e-02f,
+    -6.1051086e-02f, 4.4654012e-01f,  -9.2303656e-02f, 4.2524221e-04f,
+    6.2695004e-02f,  1.1709655e-01f,  7.4203797e-02f,  -2.8380197e-01f,
+    9.8839939e-01f,  4.0534791e-01f,  4.2524221e-04f,  -6.7415205e-03f,
+    -1.6664900e-01f, -6.5682314e-02f, 1.3035889e-02f,  4.5636165e-01f,
+    1.1176190e+00f,  4.2524221e-04f,  4.4184174e-02f,  -1.0161553e-01f,
+    1.1528383e-01f,  -1.0171146e-01f, -3.9852467e-01f, -1.7381568e-01f,
+    4.2524221e-04f,  -1.3380414e-01f, 2.4257090e-02f,  -2.1958955e-01f,
+    -3.3342477e-02f, -8.9707208e-01f, -4.0108163e-02f, 4.2524221e-04f,
+    1.6900148e-02f,  2.9698364e-02f,  7.4210748e-02f,  -9.5453638e-01f,
+    -6.0268533e-01f, -5.5909032e-01f, 4.2524221e-04f,  2.4844069e-02f,
+    1.1051752e-01f,  1.5278517e-01f,  1.8424262e-01f,  3.5749307e-01f,
+    1.0936087e-01f,  4.2524221e-04f,  -2.1159546e-03f, 9.1907848e-03f,
+    -2.7174723e-01f, -1.0244959e-01f, -3.3070275e-01f, 4.0042453e-02f,
+    4.2524221e-04f,  -4.2243101e-02f, -6.5984592e-02f, 6.5521769e-02f,
+    1.3259922e-01f,  9.9356227e-02f,  6.0295296e-01f,  4.2524221e-04f,
+    -3.7986684e-01f, -8.4376909e-02f, -4.6467561e-01f, -4.0422253e-02f,
+    3.8832929e-02f,  -1.3807257e-01f, 4.2524221e-04f,  -4.4804137e-02f,
+    1.9461249e-01f,  2.2816639e-01f,  9.9834325e-03f,  -8.2412779e-01f,
+    2.9902148e-01f,  4.2524221e-04f,  1.6407421e-01f,  1.8706313e-01f,
+    -5.6105852e-02f, -5.3491122e-01f, -3.3660775e-01f, 2.0109148e-01f,
+    4.2524221e-04f,  1.6713662e-01f,  -1.6991425e-01f, -1.0838299e-02f,
+    -3.7599638e-01f, 7.2962892e-01f,  3.9814565e-01f,  4.2524221e-04f,
+    -3.3015433e-01f, -1.8460733e-01f, -4.4423167e-02f, 1.0523954e-01f,
+    -5.9694952e-01f, -6.4566493e-02f, 4.2524221e-04f,  1.1639766e-01f,
+    -3.1477085e-01f, 4.5773551e-02f,  -8.9321405e-01f, 1.1365779e-01f,
+    -7.1910912e-01f, 4.2524221e-04f,  -1.0533749e-01f, -3.1784004e-01f,
+    -1.5684947e-01f, 3.9584538e-01f,  -2.2732932e-02f, -6.0109550e-01f,
+    4.2524221e-04f,  4.5312498e-02f,  -1.9773558e-02f, 3.4627101e-01f,
+    5.4061049e-01f,  2.3837478e-01f,  -9.5680386e-02f, 4.2524221e-04f,
+    1.9376430e-01f,  -3.5261887e-01f, -4.9361214e-02f, 4.4859773e-01f,
+    -1.3448930e-01f, -8.9390594e-01f, 4.2524221e-04f,  -3.8522416e-01f,
+    9.2452608e-02f,  -2.6977092e-01f, -7.6717246e-01f, -2.9236799e-01f,
+    8.6921006e-02f,  4.2524221e-04f,  -1.6161923e-01f, 4.8933748e-02f,
+    -7.2273888e-02f, 1.5900373e-02f,  -7.2096430e-02f, 2.5568214e-01f,
+    4.2524221e-04f,  7.4408822e-02f,  -9.5708661e-02f, 1.4543767e-01f,
+    4.2973867e-01f,  5.5417758e-01f,  -5.4315889e-01f, 4.2524221e-04f,
+    -1.2334914e-01f, -9.9942110e-02f, 6.0258025e-01f,  3.2969009e-02f,
+    -4.5631373e-01f, -3.1362407e-02f, 4.2524221e-04f,  -3.2407489e-02f,
+    1.2413250e-01f,  1.6033049e-01f,  -9.2026776e-01f, -4.0695891e-01f,
+    -6.5506846e-02f, 4.2524221e-04f,  1.9608337e-01f,  1.5339334e-01f,
+    -1.2951589e-03f, -4.1046813e-01f, 9.4732940e-02f,  2.2254905e-01f,
+    4.2524221e-04f,  3.7786314e-01f,  -9.9551268e-02f, 3.8753081e-02f,
+    2.7791873e-01f,  -5.2459854e-01f, 3.6625686e-01f,  4.2524221e-04f,
+    -2.6350039e-01f, 2.6152608e-01f,  -5.1885027e-01f, 3.9182296e-01f,
+    1.1261506e-01f,  4.1865278e-04f,  4.2524221e-04f,  -2.6930717e-01f,
+    8.7540634e-02f,  1.2011307e-01f,  -1.1454076e+00f, -2.5378546e-01f,
+    6.1277378e-01f,  4.2524221e-04f,  -5.1620595e-02f, -2.6162295e-02f,
+    1.9923788e-01f,  2.7361688e-01f,  6.8161465e-02f,  -2.4300206e-01f,
+    4.2524221e-04f,  8.3302639e-02f,  2.2153300e-01f,  7.5539924e-02f,
+    -6.4125758e-01f, -7.7184010e-01f, -5.9240508e-01f, 4.2524221e-04f,
+    -3.0167353e-01f, 1.0594812e-02f,  1.2207054e-01f,  4.2790112e-01f,
+    -7.3408598e-01f, -3.9747646e-01f, 4.2524221e-04f,  -1.3518098e-01f,
+    -1.1491226e-01f, 4.1219320e-02f,  6.6870731e-01f,  -5.6439346e-01f,
+    4.0781486e-01f,  4.2524221e-04f,  -2.2646338e-01f, -3.0869287e-01f,
+    1.9442609e-01f,  -8.5085193e-03f, -6.7781836e-01f, -1.4396685e-01f,
+    4.2524221e-04f,  2.3570412e-01f,  1.1237728e-01f,  4.0442336e-02f,
+    -3.9925253e-01f, -1.6827437e-01f, 2.5520343e-01f,  4.2524221e-04f,
+    1.9304930e-01f,  1.1386839e-01f,  -8.5760280e-03f, -6.7270681e-02f,
+    -1.5150026e+00f, 6.6858315e-01f,  4.2524221e-04f,  -3.5064521e-01f,
+    -3.4985831e-01f, -3.5266012e-02f, -4.9565598e-01f, 1.3284029e-01f,
+    6.4472258e-02f,  4.2524221e-04f,  6.4109452e-02f,  -5.6340277e-02f,
+    -1.0794429e-02f, 2.2326846e-01f,  6.3473828e-02f,  -5.3538460e-02f,
+    4.2524221e-04f,  -3.9694209e-02f, -1.2667970e-01f, 2.3774163e-01f,
+    -4.6629366e-01f, -8.2533091e-01f, 6.1826462e-01f,  4.2524221e-04f,
+    8.5494265e-02f,  4.6677209e-02f,  -2.6996067e-01f, 7.4071027e-02f,
+    -1.5797757e-01f, 8.9741655e-02f,  4.2524221e-04f,  1.4822495e-01f,
+    2.2652625e-01f,  -4.8856965e-01f, -4.7975492e-01f, 4.9277475e-01f,
+    1.3168377e-01f,  4.2524221e-04f,  2.2816645e-01f,  -2.3273047e-02f,
+    -3.2374825e-02f, 9.7304344e-01f,  1.0055114e+00f,  2.1530831e-01f,
+    4.2524221e-04f,  8.3597168e-02f,  -1.3374551e-01f, -1.2723055e-01f,
+    -4.4947600e-01f, -3.5162202e-01f, -3.4399763e-02f, 4.2524221e-04f,
+    1.6541488e-03f,  -1.3681918e-01f, -4.1941923e-01f, 2.8933066e-01f,
+    -1.1583021e-02f, -5.3825384e-01f, 4.2524221e-04f,  2.9779421e-02f,
+    -1.5177579e-01f, 9.4169438e-02f,  4.4210202e-01f,  7.0079613e-01f,
+    -2.4269655e-01f, 4.2524221e-04f,  3.2962313e-01f,  1.6373262e-01f,
+    -1.5794045e-01f, -3.6219120e-01f, -4.7019762e-01f, 5.4578936e-01f,
+    4.2524221e-04f,  2.5949749e-01f,  1.8039217e-02f,  -1.1556581e-01f,
+    1.2094127e-01f,  4.5777643e-01f,  4.9251959e-01f,  4.2524221e-04f,
+    -5.6016678e-04f, 2.2403972e-02f,  -1.2018181e-01f, -8.2266659e-01f,
+    5.3497875e-01f,  -5.6298089e-01f, 4.2524221e-04f,  1.2481754e-01f,
+    -6.5662614e-03f, 5.3280041e-02f,  1.0728637e-01f,  -3.6629236e-01f,
+    -7.7740186e-01f, 4.2524221e-04f,  -4.1662586e-01f, 6.2680237e-02f,
+    9.7843848e-02f,  9.7386146e-01f,  3.8152301e-01f,  -2.5823554e-01f,
+    4.2524221e-04f,  2.1547250e-01f,  -1.2857819e-01f, -7.6247320e-02f,
+    -5.1177174e-01f, 3.1464252e-01f,  -6.8949533e-01f, 4.2524221e-04f,
+    2.9243115e-01f,  1.8561119e-01f,  -1.4730722e-01f, 3.0295816e-01f,
+    -3.3570644e-01f, -6.4829089e-02f, 4.2524221e-04f,  -2.2853667e-01f,
+    -2.5666663e-03f, 3.2791372e-02f,  5.3857273e-01f,  2.5546068e-01f,
+    6.9839621e-01f,  4.2524221e-04f,  -8.5519083e-02f, 2.3358732e-01f,
+    -3.0836293e-01f, 4.0918893e-01f,  1.4886762e-01f,  -3.0877927e-01f,
+    4.2524221e-04f,  -5.8168643e-03f, 2.1029846e-01f,  -2.9014656e-02f,
+    -2.0898664e-01f, -5.5743361e-01f, -4.5692864e-01f, 4.2524221e-04f,
+    -3.2677907e-01f, -1.0963698e-01f, -3.0066803e-01f, -3.7513415e-03f,
+    -1.5595903e-01f, 3.7734365e-01f,  4.2524221e-04f,  -1.3074595e-01f,
+    5.1295745e-01f,  3.5618369e-02f,  -1.7757949e-01f, -2.7773422e-01f,
+    3.9297932e-01f,  4.2524221e-04f,  -4.6054059e-01f, 6.0361652e-03f,
+    4.3036997e-02f,  3.8986228e-02f,  -8.3808303e-02f, 1.3503957e-01f,
+    4.2524221e-04f,  6.3202726e-03f,  -6.9838986e-02f, 1.5222572e-01f,
+    7.8630304e-01f,  2.6035765e-01f,  1.9565882e-01f,  4.2524221e-04f,
+    2.2549452e-01f,  -2.9688054e-01f, -2.7452132e-01f, -3.4705338e-01f,
+    3.6365744e-02f,  -1.0018203e-01f, 4.2524221e-04f,  1.5116841e-01f,
+    1.1157162e-01f,  1.7717762e-01f,  9.5377460e-02f,  4.2657778e-01f,
+    7.9067266e-01f,  4.2524221e-04f,  1.1627000e-01f,  3.1979695e-01f,
+    -2.3524921e-02f, -1.9304131e-01f, -5.6617779e-01f, 4.6106350e-01f,
+    4.2524221e-04f,  1.4094487e-01f,  -1.9466771e-02f, -1.7018557e-01f,
+    -2.9211339e-01f, 3.1522620e-01f,  6.0243982e-01f,  4.2524221e-04f,
+    -3.0885851e-01f, 2.9579160e-01f,  1.9645715e-01f,  -7.4288589e-01f,
+    3.8729620e-01f,  -8.1753030e-02f, 4.2524221e-04f,  -4.9316991e-02f,
+    -6.7639120e-02f, 2.5503930e-02f,  1.2886477e-01f,  -4.2468214e-01f,
+    -4.2489755e-01f, 4.2524221e-04f,  1.0325251e-01f,  -1.2351098e-02f,
+    1.7995405e-01f,  -2.1645944e-01f, 1.1531074e-01f,  3.6774522e-01f,
+    4.2524221e-04f,  3.5494290e-02f,  1.3159359e-02f,  -8.9783361e-03f,
+    1.7681575e-01f,  5.7864314e-01f,  8.8688540e-01f,  4.2524221e-04f,
+    3.5579283e-02f,  -7.3573656e-02f, -4.6684593e-02f, 1.5158363e-01f,
+    2.5255179e-01f,  4.2681909e-01f,  4.2524221e-04f,  -4.1004341e-02f,
+    1.8314843e-01f,  -6.8004340e-02f, -6.4569753e-01f, -2.4601080e-01f,
+    -3.1736583e-01f, 4.2524221e-04f,  -3.5372970e-01f, -5.9734895e-03f,
+    -2.8878167e-01f, -3.8437065e-01f, 1.7586154e-01f,  4.8325151e-01f,
+    4.2524221e-04f,  2.8341490e-01f,  -1.9644819e-01f, -4.4990307e-01f,
+    -2.3372483e-01f, 1.8916056e-01f,  6.2253021e-02f,  4.2524221e-04f,
+    -7.9060040e-02f, 1.5312298e-01f,  -1.0657817e-01f, -6.4908840e-02f,
+    -1.1005557e-01f, -7.5388640e-01f, 4.2524221e-04f,  2.0811087e-01f,
+    -1.9149394e-01f, 6.8917416e-02f,  -6.9214320e-01f, 5.5273730e-01f,
+    -5.6367290e-01f, 4.2524221e-04f,  -1.6809903e-01f, 5.8745518e-02f,
+    6.9941558e-02f,  -6.0666478e-01f, -6.5189815e-01f, 9.6965067e-02f,
+    4.2524221e-04f,  2.8204435e-01f,  -2.8034040e-01f, -7.1355954e-02f,
+    5.7155037e-01f,  -4.7989607e-01f, -7.2021770e-01f, 4.2524221e-04f,
+    -9.9452965e-02f, 4.5155536e-02f,  -2.4321860e-01f, 5.0501686e-01f,
+    -6.7397219e-01f, 1.7940566e-01f,  4.2524221e-04f,  -4.1623276e-02f,
+    3.9544967e-01f,  1.3260084e-01f,  -7.2416043e-01f, 1.4999984e-01f,
+    3.2439882e-01f,  4.2524221e-04f,  2.0130565e-02f,  1.2174799e-01f,
+    1.0116580e-01f,  1.9213442e-02f,  4.4725251e-01f,  -9.9276684e-02f,
+    4.2524221e-04f,  -1.0185787e-02f, -1.1597388e-01f, -6.3543066e-02f,
+    7.0375061e-01f,  5.4625505e-01f,  1.1020880e-02f,  4.2524221e-04f,
+    -1.4459246e-01f, -4.2153552e-02f, 5.1556714e-03f,  -1.7952865e-01f,
+    -1.4147119e-01f, -1.2319133e-01f, 4.2524221e-04f,  3.1651965e-01f,
+    1.5370397e-01f,  -1.2385482e-01f, 2.6936245e-01f,  5.1711929e-01f,
+    6.8931890e-01f,  4.2524221e-04f,  -1.8418087e-01f, 1.1000612e-01f,
+    -4.1877508e-02f, 4.4682097e-01f,  -1.1498260e+00f, 4.1496921e-01f,
+    4.2524221e-04f,  -1.7385487e-02f, -1.2207379e-02f, -1.0904098e-01f,
+    6.5351778e-01f,  5.2470589e-01f,  -6.7526615e-01f, 4.2524221e-04f,
+    7.6974042e-02f,  -7.6170996e-02f, 4.1331150e-02f,  4.8798278e-01f,
+    -1.9912766e-01f, 8.6295828e-03f,  4.2524221e-04f,  -1.4817707e-01f,
+    -2.0577714e-01f, -2.1492377e-02f, 2.4804904e-01f,  -1.2062914e-01f,
+    1.0923308e+00f,  4.2524221e-04f,  2.2829910e-01f,  -8.7852478e-02f,
+    -2.1651746e-01f, -4.4923654e-01f, 2.0100503e-01f,  -6.6667879e-01f,
+    4.2524221e-04f,  -4.8959386e-02f, -1.7829145e-01f, -2.3248585e-01f,
+    3.1803364e-01f,  3.5625470e-01f,  -2.5345606e-01f, 4.2524221e-04f,
+    1.6019389e-01f,  -3.7726101e-02f, 2.0012274e-02f,  4.9065647e-01f,
+    -7.5336702e-02f, 4.2830771e-01f,  4.2524221e-04f,  9.2950560e-02f,
+    8.1110984e-02f,  -2.3080249e-01f, -4.1963845e-01f, 3.9410618e-01f,
+    2.6502368e-01f,  4.2524221e-04f,  -3.6329120e-02f, -2.4835167e-02f,
+    -1.0468025e-01f, 1.9597606e-01f,  7.7190138e-02f,  -1.2021227e-02f,
+    4.2524221e-04f,  -1.3207236e-01f, 4.9700566e-02f,  -9.6392229e-02f,
+    6.9591385e-01f,  -5.2213931e-01f, 6.6702977e-02f,  4.2524221e-04f,
+    -2.0891565e-01f, -1.0401086e-01f, -3.2914687e-02f, 2.0268060e-01f,
+    3.7300891e-01f,  -3.3493122e-01f, 4.2524221e-04f,  1.2298333e-02f,
+    -9.9019654e-02f, -2.2296559e-02f, 7.6882094e-01f,  4.8216751e-01f,
+    -5.0929153e-01f, 4.2524221e-04f,  5.1383042e-01f,  -3.6587961e-02f,
+    -7.9039536e-02f, -2.1929415e-02f, 4.9749163e-01f,  -7.5092280e-01f,
+    4.2524221e-04f,  6.7488663e-02f,  -1.5047796e-01f, -1.4453510e-02f,
+    9.8474354e-02f,  -1.2553598e-01f, 3.9576173e-01f,  4.2524221e-04f,
+    1.1320779e-01f,  4.3312490e-01f,  2.7788210e-01f,  3.5148668e-01f,
+    6.7258972e-01f,  3.2266015e-01f,  4.2524221e-04f,  2.8387174e-01f,
+    -2.8136987e-03f, 2.3146036e-01f,  7.0104808e-01f,  7.3719531e-01f,
+    6.8759960e-01f,  4.2524221e-04f,  5.7004183e-04f,  1.5941652e-02f,
+    1.1747324e-01f,  -7.6000273e-01f, -8.0573308e-01f, -3.8474363e-01f,
+    4.2524221e-04f,  1.3412678e-01f,  3.7177584e-01f,  -2.1013385e-01f,
+    2.6601321e-01f,  -2.0963144e-02f, -2.9721808e-01f, 4.2524221e-04f,
+    2.1684797e-02f,  -2.6148316e-02f, 2.8448166e-02f,  9.2044830e-02f,
+    4.1631389e-01f,  -3.9086950e-01f, 4.2524221e-04f,  1.7701186e-01f,
+    -1.3335569e-01f, -3.6527786e-02f, -1.4598356e-01f, -7.9653859e-02f,
+    -1.4612840e-01f, 4.2524221e-04f,  -7.9964489e-02f, -7.2931051e-02f,
+    -7.5731846e-03f, -5.6401604e-01f, 1.2140471e+00f,  2.5044760e-01f,
+    4.2524221e-04f,  5.0528418e-02f,  -1.8493372e-01f, -6.1973616e-02f,
+    1.0893459e+00f,  -7.3226017e-01f, -2.1861200e-01f, 4.2524221e-04f,
+    3.4899175e-01f,  -2.5673649e-01f, 2.3801270e-01f,  7.6705992e-02f,
+    2.3739794e-01f,  -2.2271127e-01f, 4.2524221e-04f,  -7.7574551e-02f,
+    -3.0072361e-01f, 8.9991860e-02f,  6.6169918e-01f,  7.5497506e-03f,
+    6.2827820e-01f,  4.2524221e-04f,  -4.1395541e-02f, -7.8363165e-02f,
+    -8.3268642e-02f, -3.6674482e-01f, 7.7186143e-01f,  -1.0884032e+00f,
+    4.2524221e-04f,  9.6079461e-02f,  1.9487463e-02f,  2.3446827e-01f,
+    -1.0828437e+00f, -1.0212445e-01f, 9.9640623e-02f,  4.2524221e-04f,
+    1.4852007e-01f,  1.7112080e-03f,  3.8287804e-02f,  4.6748403e-01f,
+    1.6748184e-01f,  -8.9558132e-02f, 4.2524221e-04f,  1.4533061e-01f,
+    1.1604913e-01f,  3.8661499e-02f,  4.3679410e-01f,  3.2537764e-01f,
+    -1.6830467e-01f, 4.2524221e-04f,  6.3480716e-03f,  -2.9074901e-01f,
+    1.9355851e-01f,  2.4606030e-01f,  -4.5717901e-01f, 1.7724554e-01f,
+    4.2524221e-04f,  3.8538933e-02f,  1.5341087e-01f,  -2.1069755e-03f,
+    -1.3919342e-01f, -7.7286698e-03f, -2.1324106e-01f, 4.2524221e-04f,
+    -1.9423309e-01f, -2.7765973e-02f, 7.2532348e-02f,  -9.3437082e-01f,
+    -8.2011551e-01f, -3.7270465e-01f, 4.2524221e-04f,  -3.7831109e-02f,
+    -1.2140978e-01f, 8.3114251e-02f,  5.6028736e-01f,  -6.1968172e-01f,
+    -1.3356548e-02f, 4.2524221e-04f,  -1.3984148e-01f, -1.1420244e-01f,
+    -9.0169579e-02f, 5.0556421e-01f,  3.6176574e-01f,  -2.8551257e-01f,
+    4.2524221e-04f,  5.1702183e-01f,  2.4532214e-01f,  -5.3291619e-02f,
+    5.1580917e-02f,  9.9806339e-02f,  1.5374357e-01f,  4.2524221e-04f,
+    4.1164238e-02f,  3.4978740e-02f,  -2.0140600e-01f, -1.0250385e-01f,
+    -1.9244492e-01f, 1.8400574e-01f,  4.2524221e-04f,  1.2606457e-01f,
+    3.7513068e-01f,  -6.0696520e-02f, 1.3621079e-02f,  -3.0291584e-01f,
+    3.3647969e-01f,  4.2524221e-04f,  -7.8076832e-02f, 8.4872216e-02f,
+    4.0365901e-02f,  3.7071791e-01f,  -5.9098870e-01f, 3.2774529e-01f,
+    4.2524221e-04f,  -2.3923574e-01f, -1.9211575e-01f, -1.7924082e-01f,
+    1.1655916e-01f,  -8.9026643e-03f, 7.0101243e-01f,  4.2524221e-04f,
+    2.3605846e-01f,  -1.0494024e-01f, -2.4913140e-02f, 1.1304358e-01f,
+    6.5852076e-01f,  5.3815949e-01f,  4.2524221e-04f,  1.5325595e-01f,
+    -4.6264112e-01f, -2.3033744e-01f, -3.9882928e-01f, 1.7055394e-01f,
+    2.3903577e-01f,  4.2524221e-04f,  9.9315541e-03f,  -1.3098700e-01f,
+    -1.4456044e-01f, 6.4630371e-01f,  7.7154741e-02f,  -3.8918430e-01f,
+    4.2524221e-04f,  -1.3281367e-02f, 1.8642080e-01f,  -6.7488782e-02f,
+    -5.8416975e-01f, 2.6503220e-01f,  6.2699541e-02f,  4.2524221e-04f,
+    1.5622652e-01f,  2.2385602e-01f,  -2.1002635e-01f, -1.0025834e+00f,
+    -1.3972777e-01f, -5.0823522e-01f, 4.2524221e-04f,  -5.7256967e-02f,
+    1.1900938e-02f,  6.6375956e-02f,  8.4001499e-01f,  3.4220794e-01f,
+    1.5207663e-01f,  4.2524221e-04f,  1.2499033e-01f,  1.8016313e-01f,
+    1.4031498e-01f,  2.2304562e-01f,  4.9709120e-01f,  -5.1419491e-01f,
+    4.2524221e-04f,  -2.4887011e-03f, 2.4914053e-01f,  6.9757082e-02f,
+    -3.2718769e-01f, 1.4410229e-01f,  6.2968469e-01f,  4.2524221e-04f,
+    -2.1348311e-01f, -1.4920866e-01f, 3.5942373e-01f,  -3.3802181e-01f,
+    -6.3084590e-01f, -3.5703820e-01f, 4.2524221e-04f,  -1.3208719e-01f,
+    -4.3626528e-02f, 1.1525477e-01f,  -8.9622033e-01f, -5.2570760e-01f,
+    7.1209446e-02f,  4.2524221e-04f,  2.0180137e-01f,  3.0973798e-01f,
+    -4.7396217e-02f, 8.0733806e-02f,  -4.7801504e-01f, 1.2905307e-01f,
+    4.2524221e-04f,  -3.9405990e-02f, -1.3421042e-01f, 2.1364555e-01f,
+    1.1934844e-01f,  4.1275540e-01f,  -7.2598690e-01f, 4.2524221e-04f,
+    3.0317783e-01f,  1.5446717e-01f,  1.8932924e-01f,  1.7827491e-01f,
+    -5.5765957e-01f, 8.5686105e-01f,  4.2524221e-04f,  9.7126581e-02f,
+    -3.2171151e-01f, 1.4782944e-01f,  1.8760729e-01f,  3.6745262e-01f,
+    -7.9939204e-01f, 4.2524221e-04f,  1.2204078e-01f,  1.7390806e-02f,
+    2.5008461e-02f,  7.7841687e-01f,  6.4786148e-01f,  -4.6705741e-01f,
+    4.2524221e-04f,  -4.2586967e-01f, -1.2234707e-01f, -1.7680998e-01f,
+    1.1388376e-01f,  2.5348544e-01f,  -4.4659165e-01f, 4.2524221e-04f,
+    5.0176810e-02f,  2.9768664e-01f,  -4.9092501e-02f, -3.5374787e-01f,
+    -1.0155331e+00f, -4.5657374e-02f, 4.2524221e-04f,  -5.8098711e-02f,
+    -7.4126154e-02f, 1.5455529e-01f,  -5.5758113e-01f, -5.7496008e-02f,
+    -3.1105158e-01f, 4.2524221e-04f,  1.5905772e-01f,  -5.2595858e-02f,
+    4.3390177e-02f,  -2.4082197e-01f, 1.0542246e-01f,  5.6913577e-02f,
+    4.2524221e-04f,  6.3337363e-02f,  -5.2784737e-02f, -7.1843952e-02f,
+    1.8084645e-01f,  5.8992529e-01f,  6.9003922e-01f,  4.2524221e-04f,
+    -1.1659018e-02f, -3.1661659e-02f, 2.1552466e-01f,  3.8084796e-01f,
+    -7.5515735e-01f, 1.0805442e-01f,  4.2524221e-04f,  -6.7320108e-02f,
+    4.2530239e-01f,  -8.3224047e-03f, 2.5150040e-01f,  3.4304920e-01f,
+    5.3361142e-01f,  4.2524221e-04f,  -1.3554615e-01f, -6.2619518e-03f,
+    -9.4313443e-02f, -7.6799446e-01f, -4.6307662e-01f, -1.0057564e+00f,
+    4.2524221e-04f,  3.8533989e-02f,  6.1796192e-02f,  8.6112045e-02f,
+    -4.8534065e-01f, 5.1081574e-01f,  -5.8071470e-01f, 4.2524221e-04f,
+    -1.5230169e-02f, -1.2033883e-01f, 7.3942550e-02f,  4.6739280e-01f,
+    8.4132425e-02f,  1.6251507e-01f,  4.2524221e-04f,  1.7331967e-02f,
+    -1.3612761e-01f, 1.5314302e-01f,  -1.4125380e-01f, -2.9499152e-01f,
+    -2.2088945e-01f, 4.2524221e-04f,  3.7615474e-02f,  -1.0014044e-01f,
+    2.0233028e-02f,  7.9775847e-02f,  6.8863159e-01f,  1.6004965e-02f,
+    4.2524221e-04f,  -9.6063040e-02f, 3.0204907e-01f,  -9.4360553e-02f,
+    -4.8655292e-01f, -6.1724377e-01f, -9.5279491e-01f, 4.2524221e-04f,
+    2.4641979e-02f,  2.7688531e-02f,  3.5698675e-02f,  7.2061479e-01f,
+    5.7431215e-01f,  -2.3499139e-01f, 4.2524221e-04f,  -2.3308350e-01f,
+    -1.5859704e-01f, 1.6264288e-01f,  -5.4998243e-01f, -8.7624407e-01f,
+    -2.4391791e-01f, 4.2524221e-04f,  2.0213775e-02f,  -8.3087897e-03f,
+    7.2641168e-03f,  -2.6261470e-01f, 8.9763856e-01f,  -2.9689264e-01f,
+    4.2524221e-04f,  -1.3720414e-01f, 3.9747078e-02f,  3.9863430e-02f,
+    -9.9515754e-01f, -4.1642633e-01f, -2.7768940e-01f, 4.2524221e-04f,
+    4.1457537e-01f,  -1.5103568e-01f, -4.7678750e-02f, 6.0775268e-01f,
+    6.3027298e-01f,  -8.2766257e-02f, 4.2524221e-04f,  -9.1587752e-02f,
+    2.0771132e-01f,  -1.1949047e-01f, -1.0162098e+00f, 6.4729214e-01f,
+    -2.8647608e-01f, 4.2524221e-04f,  6.9776617e-02f,  -1.4391021e-01f,
+    6.6905238e-02f,  4.4330075e-01f,  -5.4359299e-01f, 5.8366980e-02f,
+    4.2524221e-04f,  -2.1080155e-02f, 1.0876700e-01f,  -1.8273705e-01f,
+    -2.7334785e-01f, 1.2370202e-02f,  -5.0732791e-01f, 4.2524221e-04f,
+    2.9365107e-01f,  -3.7552178e-02f, 1.7366202e-01f,  3.7093323e-01f,
+    5.1931971e-01f,  2.2042035e-01f,  4.2524221e-04f,  -5.8714446e-02f,
+    -1.1625898e-01f, 8.9958400e-02f,  9.4603442e-02f,  -6.6513252e-01f,
+    -3.3096021e-01f, 4.2524221e-04f,  1.7270938e-01f,  -1.3684744e-01f,
+    -2.3963401e-02f, 5.1071239e-01f,  -5.2210022e-02f, 2.0341723e-01f,
+    4.2524221e-04f,  4.3902349e-02f,  5.8340929e-02f,  -1.8696614e-01f,
+    -3.8711539e-01f, 4.6378964e-01f,  -3.5242509e-02f, 4.2524221e-04f,
+    -2.2016709e-01f, -4.1709796e-02f, -1.2825581e-01f, 2.8010187e-01f,
+    8.4135972e-02f,  -3.2970226e-01f, 4.2524221e-04f,  4.4807252e-02f,
+    -3.1309262e-02f, 5.5173505e-02f,  3.5304120e-01f,  4.7825992e-01f,
+    -6.9327480e-01f, 4.2524221e-04f,  2.6006943e-01f,  3.9229229e-01f,
+    4.1401561e-02f,  2.5688058e-01f,  4.6096367e-01f,  -3.8301066e-02f,
+    4.2524221e-04f,  -5.7207685e-02f, 2.1041496e-01f,  -5.5592977e-02f,
+    7.3871851e-01f,  7.6392311e-01f,  5.5508763e-01f,  4.2524221e-04f,
+    2.0028868e-01f,  1.7377455e-02f,  -1.7383717e-02f, -1.0210022e-01f,
+    1.0636880e-01f,  9.4883746e-01f,  4.2524221e-04f,  -2.3191158e-01f,
+    1.7112093e-01f,  -5.7223786e-02f, 1.4026723e-02f,  -2.8560868e-01f,
+    -3.1835638e-02f, 4.2524221e-04f,  3.2962020e-02f,  7.8223407e-02f,
+    -1.3360938e-01f, -1.5919517e-01f, 3.3523160e-01f,  -8.9049095e-01f,
+    4.2524221e-04f,  6.5701969e-02f,  -2.1277949e-01f, 2.2916125e-01f,
+    3.0556580e-01f,  3.8131914e-01f,  -1.8459332e-01f, 4.2524221e-04f,
+    1.6372159e-01f,  1.3252127e-01f,  3.3026242e-01f,  6.6534467e-02f,
+    5.8466011e-01f,  -2.1187198e-01f, 4.2524221e-04f,  -2.0388210e-02f,
+    -2.6837876e-01f, -1.3936328e-02f, 5.5595392e-01f,  -1.9173568e-01f,
+    -3.1564653e-02f, 4.2524221e-04f,  4.2142672e-03f,  4.5444127e-02f,
+    -1.9033318e-02f, 2.6706985e-01f,  5.0933296e-03f,  -6.9982624e-01f,
+    4.2524221e-04f,  1.3599768e-01f,  -1.2645385e-01f, 5.4887198e-02f,
+    3.5913065e-02f,  -1.9649075e-01f, 3.3240259e-01f,  4.2524221e-04f,
+    1.4553209e-01f,  1.5071960e-02f,  -3.5280336e-02f, -1.2737115e-01f,
+    -8.2368088e-01f, -5.0747889e-01f, 4.2524221e-04f,  5.6710010e-03f,
+    4.6061239e-01f,  -2.5774138e-02f, 9.0305610e-03f,  -4.3211180e-01f,
+    -2.6158375e-01f, 4.2524221e-04f,  -6.4997308e-02f, 1.2228046e-01f,
+    -1.1081608e-01f, 2.5118258e-02f,  -5.0499208e-02f, 4.2089400e-01f,
+    4.2524221e-04f,  9.8428808e-02f,  9.2591822e-02f,  -1.7282183e-01f,
+    -4.8170805e-01f, -5.3339947e-02f, -5.6675595e-01f, 4.2524221e-04f,
+    -8.4237829e-02f, 1.4253823e-01f,  4.9275521e-02f,  -2.6992768e-01f,
+    -1.0569313e+00f, -9.4031647e-02f, 4.2524221e-04f,  -3.6385587e-01f,
+    1.5330490e-01f,  -4.9633920e-02f, 5.4262120e-01f,  3.7485160e-02f,
+    2.3123855e-03f,  4.2524221e-04f,  6.8289131e-02f,  2.2379410e-01f,
+    1.2773418e-01f,  -6.0800686e-02f, -1.1601755e-01f, 7.9482615e-02f,
+    4.2524221e-04f,  -3.2236850e-01f, 9.3640193e-02f,  2.2959833e-01f,
+    -5.3192180e-01f, -1.7132016e-01f, -8.4394589e-02f, 4.2524221e-04f,
+    3.8027413e-02f,  3.0569202e-01f,  -1.0576937e-01f, -4.3119910e-01f,
+    -3.3379223e-02f, 4.6473461e-01f,  4.2524221e-04f,  -8.8825256e-02f,
+    1.2526524e-01f,  -1.2704808e-01f, -1.5238588e-01f, 2.9670548e-02f,
+    2.7259463e-01f,  4.2524221e-04f,  2.0480262e-01f,  8.0929454e-03f,
+    -1.4154667e-02f, 2.3045730e-02f,  1.9490622e-01f,  5.9769058e-01f,
+    4.2524221e-04f,  -5.8878306e-02f, -1.4916752e-01f, -5.9504360e-02f,
+    -9.8221682e-02f, 5.7103390e-01f,  2.3102944e-01f,  4.2524221e-04f,
+    -1.7225789e-01f, 1.6756587e-01f,  -3.4342483e-01f, 4.1942871e-01f,
+    -2.2000684e-01f, 5.9689343e-01f,  4.2524221e-04f,  4.9882624e-01f,
+    -5.2865523e-01f, 4.1927774e-02f,  -2.8362114e-02f, 1.7950779e-01f,
+    -1.0107930e-01f, 4.2524221e-04f,  4.3928962e-02f,  -5.0005370e-01f,
+    8.7134331e-02f,  2.9411346e-01f,  -6.6736117e-03f, -1.4562376e-01f,
+    4.2524221e-04f,  -2.3325227e-01f, 1.7272754e-01f,  1.1977511e-01f,
+    -2.5740722e-01f, -4.2455325e-01f, -3.8168076e-01f, 4.2524221e-04f,
+    -1.7286746e-01f, 1.3987499e-01f,  5.1732048e-02f,  -3.8814163e-01f,
+    -5.4394585e-01f, -3.0911514e-01f, 4.2524221e-04f,  -7.4005872e-02f,
+    -2.0171419e-01f, 1.4349639e-02f,  1.0695112e+00f,  1.1055440e-01f,
+    4.7104073e-01f,  4.2524221e-04f,  -1.7483431e-01f, 1.8443911e-01f,
+    9.3163140e-02f,  -5.4278409e-01f, -4.9097329e-01f, -3.6492816e-01f,
+    4.2524221e-04f,  -1.0440959e-01f, 7.9506375e-02f,  1.6197237e-01f,
+    -4.9952024e-01f, -4.2269015e-01f, -1.9747719e-01f, 4.2524221e-04f,
+    -1.2244813e-01f, -3.9496835e-02f, 1.8504363e-02f,  2.7968970e-01f,
+    -2.1333002e-01f, 1.6160218e-01f,  4.2524221e-04f,  -1.2212741e-02f,
+    -2.0384742e-01f, -8.1245027e-02f, 6.5038508e-01f,  -5.9658372e-01f,
+    5.6763679e-01f,  4.2524221e-04f,  7.7157073e-02f,  3.8423132e-02f,
+    -7.9533443e-02f, 1.2899141e-01f,  2.2250174e-01f,  1.1144681e+00f,
+    4.2524221e-04f,  2.5630978e-01f,  -2.8503829e-01f, -7.5279221e-02f,
+    2.1920022e-01f,  -3.9966124e-01f, -3.6230826e-01f, 4.2524221e-04f,
+    -4.6040479e-02f, 1.7492487e-01f,  2.3670094e-02f,  1.5322700e-01f,
+    2.5319836e-01f,  -2.1926530e-01f, 4.2524221e-04f,  -2.6434872e-01f,
+    1.1163855e-01f,  1.1856534e-01f,  5.0888735e-01f,  1.0870682e+00f,
+    7.5545561e-01f,  4.2524221e-04f,  1.0934912e-02f,  -4.3975078e-03f,
+    -1.1050128e-01f, 5.7726038e-01f,  3.7376204e-01f,  -2.3798217e-01f,
+    4.2524221e-04f,  -1.0933757e-01f, -6.6509068e-02f, 5.9324563e-02f,
+    3.3751070e-01f,  1.9518003e-02f,  3.5434687e-01f,  4.2524221e-04f,
+    -5.0406039e-02f, 8.2527936e-02f,  5.8949720e-02f,  6.7421651e-01f,
+    7.2308058e-01f,  2.1764995e-01f,  4.2524221e-04f,  1.1794189e-01f,
+    -7.9106942e-02f, 7.3252164e-02f,  -1.7614780e-01f, 2.3364004e-01f,
+    -3.0955884e-01f, 4.2524221e-04f,  -3.8525936e-01f, 5.5291604e-02f,
+    3.0769013e-02f,  -2.8718120e-01f, -3.2775763e-01f, -6.8145633e-01f,
+    4.2524221e-04f,  -8.3880804e-02f, -7.4246824e-02f, -1.0636127e-01f,
+    2.2840117e-01f,  -3.4262979e-01f, -5.7159841e-02f, 4.2524221e-04f,
+    5.0429620e-02f,  1.7814779e-01f,  -1.3876863e-02f, -4.4347802e-01f,
+    2.2670373e-01f,  -5.2523874e-02f, 4.2524221e-04f,  8.4244743e-02f,
+    -1.2254165e-02f, 1.1833207e-01f,  4.9478766e-01f,  -5.9280358e-02f,
+    -6.6570687e-01f, 4.2524221e-04f,  4.2142691e-03f,  -2.6322320e-01f,
+    4.6141140e-02f,  -5.8571142e-01f, -1.9575717e-01f, 4.8644492e-01f,
+    4.2524221e-04f,  -8.6440565e-03f, -8.5276507e-02f, -1.0299275e-01f,
+    7.3558384e-01f,  1.9185032e-01f,  2.4474934e-03f,  4.2524221e-04f,
+    1.3430876e-01f,  7.4964397e-02f,  -4.4637624e-02f, 2.6200864e-01f,
+    -7.9147875e-01f, -1.3670044e-01f, 4.2524221e-04f,  1.5115394e-01f,
+    -5.0288949e-02f, 2.3326008e-03f,  4.5250246e-04f,  2.8048915e-01f,
+    6.7418523e-02f,  4.2524221e-04f,  7.9589985e-02f,  1.3198530e-02f,
+    9.5524024e-03f,  8.5114585e-03f,  4.9257568e-01f,  -2.1437393e-01f,
+    4.2524221e-04f,  8.8119820e-02f,  2.5465485e-01f,  2.9621312e-01f,
+    -6.9950558e-02f, 1.7136092e-01f,  1.5482426e-01f,  4.2524221e-04f,
+    3.9575586e-01f,  5.9830304e-02f,  2.7040720e-01f,  6.3961577e-01f,
+    -5.5998546e-01f, -5.2251714e-01f, 4.2524221e-04f,  2.1911263e-02f,
+    -1.0367694e-01f, 4.0058735e-01f,  -8.9272209e-02f, 9.4631839e-01f,
+    -3.8487363e-01f, 4.2524221e-04f,  3.4385122e-02f,  -1.3864669e-01f,
+    7.0193097e-02f,  4.5142362e-01f,  -2.2504972e-01f, -2.2282520e-01f,
+    4.2524221e-04f,  -2.2051957e-02f, 7.1768552e-02f,  3.2341501e-01f,
+    2.8539574e-01f,  1.4694886e-01f,  2.4218261e-01f,  4.2524221e-04f,
+    6.6477126e-03f,  -1.3585331e-01f, 1.6215855e-01f,  -9.2444402e-01f,
+    4.5748672e-01f,  -9.5693076e-01f, 4.2524221e-04f,  1.1732336e-02f,
+    7.6583289e-02f,  2.9326558e-02f,  -4.2848232e-01f, 8.9529181e-01f,
+    -5.0278997e-01f, 4.2524221e-04f,  -2.3169242e-01f, -7.7865161e-02f,
+    -6.8586029e-02f, 4.4346309e-01f,  4.3703821e-01f,  -1.3984813e-01f,
+    4.2524221e-04f,  2.1005182e-03f,  -1.0630068e-01f, -2.0478789e-03f,
+    4.2731187e-01f,  2.6764956e-01f,  6.9885917e-02f,  4.2524221e-04f,
+    4.3287359e-02f,  1.2680691e-01f,  -1.2716265e-01f, 1.4064538e+00f,
+    6.3669197e-02f,  2.9268086e-01f,  4.2524221e-04f,  2.1253993e-01f,
+    2.0032486e-02f,  -2.8352332e-01f, 6.1502069e-02f,  5.0910527e-01f,
+    2.5406623e-01f,  4.2524221e-04f,  -1.5371208e-01f, -1.5454817e-02f,
+    1.5976922e-01f,  3.8749605e-01f,  3.9152686e-02f,  2.0116392e-01f,
+    4.2524221e-04f,  -2.7467856e-01f, 2.0516390e-01f,  -8.8419601e-02f,
+    3.8022807e-01f,  1.8368958e-01f,  1.4313021e-01f,  4.2524221e-04f,
+    -1.9867215e-02f, 3.4233467e-03f,  2.6920827e-02f,  -4.9890375e-01f,
+    4.7998118e-01f,  -3.5384160e-01f, 4.2524221e-04f,  1.2394261e-01f,
+    -1.1514547e-01f, 1.8832713e-01f,  -1.4639932e-01f, 6.3231164e-01f,
+    -8.3366609e-01f, 4.2524221e-04f,  -7.1992099e-02f, 1.7378470e-02f,
+    -8.7242328e-02f, -3.2707125e-01f, -3.4206405e-01f, 1.1849549e-01f,
+    4.2524221e-04f,  1.3675264e-03f,  -1.0161220e-01f, 1.1794197e-01f,
+    -6.5400422e-01f, -1.9380212e-01f, 7.5254047e-01f,  4.2524221e-04f,
+    -1.1318323e-02f, -1.4939188e-02f, -4.1370645e-02f, -5.7902420e-01f,
+    -3.8736048e-01f, -6.4805365e-01f, 4.2524221e-04f,  2.2059079e-01f,
+    1.4307103e-01f,  5.2751834e-03f,  -7.1066815e-01f, -3.0571124e-01f,
+    -3.4100422e-01f, 4.2524221e-04f,  5.6093033e-02f,  1.6691233e-01f,
+    -7.0807494e-02f, 4.1625056e-01f,  -3.5175082e-01f, -2.9024789e-01f,
+    4.2524221e-04f,  -4.0760136e-01f, 1.6963206e-01f,  -1.2793277e-01f,
+    3.6916226e-01f,  -5.4585361e-01f, 4.1789886e-01f,  4.2524221e-04f,
+    2.8393698e-01f,  4.1604429e-02f,  -1.2255738e-01f, 4.1957131e-01f,
+    -6.0227048e-01f, -4.8008409e-01f, 4.2524221e-04f,  -5.1685097e-03f,
+    -4.1770671e-02f, 1.1320186e-02f,  6.9697315e-01f,  2.4219675e-01f,
+    4.5528144e-01f,  4.2524221e-04f,  -9.2784591e-02f, 7.7345654e-02f,
+    -7.9850294e-02f, 1.3106990e-01f,  -1.9888917e-01f, -6.0424030e-01f,
+    4.2524221e-04f,  -1.3671900e-01f, 5.6742132e-01f,  -1.8450902e-01f,
+    -1.5915504e-01f, -4.7375256e-01f, -1.3214935e-01f, 4.2524221e-04f,
+    -1.3770567e-01f, -5.6745846e-02f, -1.7213717e-02f, 8.8353807e-01f,
+    7.5317748e-02f,  -7.0693886e-01f, 4.2524221e-04f,  -1.8708508e-01f,
+    4.6241707e-03f,  1.7348535e-01f,  3.2163820e-01f,  8.2489528e-02f,
+    8.9861996e-02f,  4.2524221e-04f,  1.1482391e-01f,  1.6983777e-02f,
+    -1.1581448e-01f, -9.1527492e-01f, 2.3806203e-02f,  -6.1438274e-01f,
+    4.2524221e-04f,  -3.1089416e-02f, -2.0857678e-01f, 2.5814833e-02f,
+    2.1466513e-01f,  2.3788901e-01f,  -1.9398540e-02f, 4.2524221e-04f,
+    2.0071122e-01f,  -4.0954822e-01f, 5.4813763e-03f,  7.6764196e-01f,
+    -2.0557307e-01f, -1.5184893e-01f, 4.2524221e-04f,  -2.6855219e-02f,
+    5.3103637e-02f,  2.1054579e-01f,  -3.6030203e-01f, -5.0415200e-01f,
+    -1.0134627e+00f, 4.2524221e-04f,  -1.5320569e-01f, 2.1357769e-02f,
+    8.7219886e-02f,  -1.5428744e-01f, -2.0351259e-01f, 3.5907809e-02f,
+    4.2524221e-04f,  -1.8138912e-01f, -6.2948622e-02f, 7.4828513e-02f,
+    5.4962214e-02f,  -3.9846934e-02f, 6.8441704e-02f,  4.2524221e-04f,
+    -2.1332590e-02f, -8.0781348e-02f, 2.4442689e-02f,  1.7267960e-01f,
+    -3.7693899e-02f, -1.4580774e-01f, 4.2524221e-04f,  -2.7519673e-01f,
+    9.5269039e-02f,  -3.0745631e-02f, -9.9950932e-02f, -1.6695404e-01f,
+    1.3081552e-01f,  4.2524221e-04f,  1.5914220e-01f,  1.2361299e-01f,
+    1.3808930e-01f,  -3.7719634e-01f, 2.6418731e-01f,  -4.7624576e-01f,
+    4.2524221e-04f,  -4.6288930e-02f, -2.7458856e-01f, -2.4868591e-02f,
+    1.1211086e-01f,  -3.9368961e-04f, 6.0995859e-01f,  4.2524221e-04f,
+    -1.4516614e-01f, 9.5639445e-02f,  1.4521341e-02f,  -6.2749809e-01f,
+    -4.3474460e-01f, -6.3850440e-02f, 4.2524221e-04f,  1.2344169e-02f,
+    1.4936069e-01f,  7.7420339e-02f,  -5.5614072e-01f, 2.5198197e-01f,
+    1.2065966e-01f,  4.2524221e-04f,  1.7828740e-02f,  -5.0150797e-02f,
+    5.6068067e-02f,  -1.8056634e-01f, 5.0351298e-01f,  4.4432919e-02f,
+    4.2524221e-04f,  -1.4966798e-01f, 3.4953775e-03f,  5.8820792e-02f,
+    1.6740252e-01f,  -5.1562709e-01f, -1.2772369e-01f, 4.2524221e-04f,
+    1.8065150e-01f,  -2.2810679e-02f, 1.6292809e-01f,  -1.6482958e-01f,
+    1.0195982e+00f,  -2.3254627e-01f, 4.2524221e-04f,  -5.1958021e-05f,
+    -3.9097309e-01f, 8.2227796e-02f,  8.4267575e-01f,  5.7388678e-02f,
+    4.6285605e-01f,  4.2524221e-04f,  2.3226891e-02f,  -1.2692873e-01f,
+    -3.9916083e-01f, 3.1418437e-01f,  1.9673482e-01f,  1.7627418e-01f,
+    4.2524221e-04f,  -6.7505077e-02f, -1.0467784e-02f, 2.1655914e-01f,
+    -4.5411238e-01f, -4.9429080e-01f, -5.9390020e-01f, 4.2524221e-04f,
+    -3.1186458e-01f, 6.6885553e-02f,  -3.1015936e-01f, 2.3163263e-01f,
+    -3.1050909e-01f, -5.2182868e-02f, 4.2524221e-04f,  6.4003430e-02f,
+    1.0722633e-01f,  1.2855037e-02f,  6.4192277e-01f,  -1.1274775e-01f,
+    4.2818221e-01f,  4.2524221e-04f,  6.9713057e-04f,  -1.7024882e-01f,
+    1.1969007e-01f,  -4.8345292e-01f, 3.3571637e-01f,  2.2751006e-01f,
+    4.2524221e-04f,  2.5624090e-01f,  1.9991541e-01f,  2.7345872e-01f,
+    -8.3251333e-01f, -1.2804669e-01f, -2.8672218e-01f, 4.2524221e-04f,
+    1.8683919e-01f,  -3.6161101e-01f, 1.0703325e-02f,  3.3986914e-01f,
+    4.8497844e-02f,  2.3756032e-01f,  4.2524221e-04f,  -1.4104228e-01f,
+    -1.5553111e-01f, -1.3147251e-01f, 1.0852005e+00f,  -2.5680059e-01f,
+    2.5069383e-01f,  4.2524221e-04f,  -1.9770128e-01f, -1.4175245e-01f,
+    1.8448097e-01f,  -5.0913215e-01f, -5.9743571e-01f, -1.6894864e-02f,
+    4.2524221e-04f,  2.1237466e-02f,  -3.6086017e-01f, -1.9249740e-01f,
+    -5.9351578e-02f, 5.3578866e-01f,  -7.1674514e-01f, 4.2524221e-04f,
+    -3.3627223e-02f, -1.6906269e-01f, 2.2338827e-01f,  9.3727306e-02f,
+    9.1755494e-02f,  -5.7371092e-01f, 4.2524221e-04f,  4.7952205e-01f,
+    6.7791358e-02f,  -2.9310691e-01f, 4.1324478e-01f,  1.7141986e-01f,
+    2.4409248e-01f,  4.2524221e-04f,  1.7890526e-01f,  1.2169579e-01f,
+    -2.9259530e-01f, 5.4734105e-01f,  6.9304323e-01f,  7.3535725e-02f,
+    4.2524221e-04f,  2.1919321e-02f,  -3.1845599e-01f, -2.4307689e-01f,
+    4.4567209e-01f,  3.9958793e-01f,  -9.1936581e-02f, 4.2524221e-04f,
+    7.6360904e-02f,  -9.9568665e-02f, -3.6729082e-02f, 4.4655576e-01f,
+    -4.9103443e-02f, 5.6398445e-01f,  4.2524221e-04f,  -3.2680893e-01f,
+    3.4060474e-03f,  -9.5601030e-02f, 1.8501686e-01f,  -4.5118406e-01f,
+    -7.8546248e-02f, 4.2524221e-04f,  9.5919959e-02f,  1.7357532e-02f,
+    -6.2571138e-02f, 1.5893191e-01f,  -6.5006995e-01f, 2.5034849e-02f,
+    4.2524221e-04f,  -9.3976893e-02f, 7.4858761e-01f,  -2.6612282e-01f,
+    -2.1494505e-01f, -1.8607964e-01f, -1.1622455e-02f, 4.2524221e-04f,
+    -1.9914754e-01f, -1.4597380e-01f, -6.2302649e-02f, 1.1021204e-02f,
+    -6.7020303e-01f, -3.3657350e-02f, 4.2524221e-04f,  1.4431569e-01f,
+    2.4171654e-02f,  1.6881478e-01f,  -6.6591549e-01f, -3.4065247e-01f,
+    -7.5222605e-01f, 4.2524221e-04f,  1.4121325e-02f,  9.5259473e-02f,
+    -4.8137712e-01f, 6.9373988e-02f,  4.1705778e-01f,  -5.6761068e-01f,
+    4.2524221e-04f,  2.6314303e-01f,  5.4131560e-02f,  5.2006942e-01f,
+    -6.8592948e-01f, -1.8287517e-02f, 9.7879067e-02f,  4.2524221e-04f,
+    2.7169415e-01f,  -6.3688450e-02f, -2.1294890e-02f, -1.9359666e-01f,
+    1.0400132e+00f,  -1.9963259e-01f, 4.2524221e-04f,  -2.1797970e-01f,
+    -8.5340932e-02f, 1.1264686e-01f,  5.0285482e-01f,  -1.6192405e-01f,
+    3.8625699e-01f,  4.2524221e-04f,  -2.3507127e-01f, -1.2652132e-01f,
+    -2.2202699e-01f, 5.0801891e-01f,  1.9383451e-01f,  -6.6151083e-01f,
+    4.2524221e-04f,  -5.6993598e-03f, -5.0626114e-02f, -1.1308940e-01f,
+    1.0160903e+00f,  1.1862794e-01f,  2.7474642e-01f,  4.2524221e-04f,
+    4.8629191e-02f,  1.2844987e-01f,  3.8468280e-01f,  1.4983997e-01f,
+    -8.5667557e-01f, -1.8279985e-01f, 4.2524221e-04f,  -1.3248117e-01f,
+    -1.0631329e-01f, 7.5321319e-03f,  2.8159514e-01f,  -5.4962975e-01f,
+    -4.3660015e-01f, 4.2524221e-04f,  1.3241449e-03f,  -1.5634854e-01f,
+    -1.7225713e-01f, -4.2000353e-01f, 1.6989522e-02f,  1.0302254e+00f,
+    4.2524221e-04f,  6.0261134e-03f,  7.9409704e-03f,  9.1440484e-02f,
+    -3.0220580e-01f, -7.7151561e-01f, 4.2543150e-02f,  4.2524221e-04f,
+    2.0895573e-01f,  -2.1937467e-01f, -5.1814243e-02f, -3.0285525e-01f,
+    6.2322158e-01f,  -4.7911149e-01f, 4.2524221e-04f,  -9.8498203e-02f,
+    -5.9885830e-02f, -3.1867433e-02f, -1.2152094e+00f, 5.4904381e-03f,
+    -4.1258970e-01f, 4.2524221e-04f,  -4.8488066e-02f, 4.4104416e-02f,
+    1.5862907e-01f,  -4.4825897e-01f, 9.7611815e-02f,  -3.7502378e-01f,
+    4.2524221e-04f,  2.3262146e-01f,  3.2365641e-01f,  1.1808707e-01f,
+    -9.0573706e-02f, 1.5945364e-02f,  5.0722408e-01f,  4.2524221e-04f,
+    -1.1470696e-01f, 8.9340523e-02f,  -6.4827114e-02f, -2.9209036e-01f,
+    -3.6173090e-01f, -3.0526412e-01f, 4.2524221e-04f,  9.5129684e-02f,
+    -1.2038415e-01f, 2.4554672e-02f,  3.1021306e-01f,  -8.0452330e-02f,
+    -7.0555747e-01f, 4.2524221e-04f,  4.5191955e-02f,  2.2878443e-01f,
+    -2.3190710e-01f, 1.3439280e-01f,  9.4422090e-01f,  4.5181891e-01f,
+    4.2524221e-04f,  -1.1008850e-01f, -7.7886850e-02f, -6.5560035e-02f,
+    3.2681102e-01f,  -2.3604423e-01f, 1.2092002e-01f,  4.2524221e-04f,
+    -1.6582491e-01f, -6.4504117e-02f, 1.6040473e-01f,  -3.0520931e-01f,
+    -5.4780841e-01f, -6.8909246e-01f, 4.2524221e-04f,  1.4898033e-01f,
+    6.4304672e-02f,  1.8339977e-01f,  -3.9272609e-01f, 1.4390137e+00f,
+    -4.3225473e-01f, 4.2524221e-04f,  -4.9138270e-02f, -8.2813941e-02f,
+    -1.9770658e-01f, -1.0563649e-01f, -3.7128425e-01f, 7.4610549e-01f,
+    4.2524221e-04f,  -3.2529008e-01f, -4.6994045e-01f, -8.3219528e-02f,
+    2.3760368e-01f,  -9.3971521e-02f, 3.5663474e-01f,  4.2524221e-04f,
+    8.7377906e-02f,  -1.8962690e-01f, -1.4496110e-02f, 4.8985398e-01f,
+    1.9304378e-01f,  -3.4295464e-01f, 4.2524221e-04f,  2.4414150e-01f,
+    5.8528569e-02f,  7.7077024e-02f,  5.5549634e-01f,  1.9856468e-01f,
+    -8.5791957e-01f, 4.2524221e-04f,  -4.9084622e-02f, -9.5591195e-02f,
+    1.6564789e-01f,  2.9922199e-01f,  -9.8501690e-02f, -2.2108212e-01f,
+    4.2524221e-04f,  -5.0639343e-02f, -1.4512147e-01f, 7.7068340e-03f,
+    4.7224876e-02f,  -5.7675552e-01f, 2.4847232e-01f,  4.2524221e-04f,
+    -2.7882235e-02f, -2.5087783e-01f, -1.2902394e-01f, 4.2801958e-02f,
+    -3.6119899e-01f, 2.1516395e-01f,  4.2524221e-04f,  -4.6722639e-02f,
+    -1.1919469e-01f, 2.3033876e-02f,  1.0368994e-01f,  -3.9297837e-01f,
+    -9.0560585e-01f, 4.2524221e-04f,  -9.8877840e-02f, 8.3310038e-02f,
+    2.2861077e-02f,  -2.9519450e-02f, -4.3397459e-01f, 1.0293537e+00f,
+    4.2524221e-04f,  1.5239653e-01f,  2.5422654e-01f,  -1.7482758e-02f,
+    -4.2586017e-02f, 4.7841224e-01f,  -5.9156500e-02f, 4.2524221e-04f,
+    -4.7107911e-01f, -1.1996613e-01f, 6.2203579e-02f,  -9.6767664e-02f,
+    -4.0281779e-01f, 6.7321354e-01f,  4.2524221e-04f,  4.6411004e-02f,
+    5.5707924e-02f,  1.9377133e-01f,  4.0077385e-02f,  2.9719681e-01f,
+    -1.1192318e+00f, 4.2524221e-04f,  -1.9413696e-01f, -4.4348843e-02f,
+    1.0236490e-01f,  -8.2978594e-01f, -7.9887435e-02f, -1.3073830e-01f,
+    4.2524221e-04f,  5.4713640e-02f,  -2.9570219e-01f, 6.6040419e-02f,
+    5.4418570e-01f,  5.9043342e-01f,  -8.7340188e-01f, 4.2524221e-04f,
+    1.9088466e-02f,  1.7759448e-02f,  1.9595300e-01f,  -2.3816055e-01f,
+    -3.5885778e-01f, 5.0142020e-01f,  4.2524221e-04f,  3.5848218e-01f,
+    3.5156542e-01f,  8.8914238e-02f,  -8.4306836e-01f, -2.9635224e-01f,
+    5.0449312e-01f,  4.2524221e-04f,  -8.8375499e-03f, -2.6108938e-01f,
+    -4.8876982e-03f, -6.1897114e-02f, -4.1726297e-01f, -1.4984097e-01f,
+    4.2524221e-04f,  2.9446623e-01f,  -4.6997136e-01f, 1.9041170e-01f,
+    -3.1315902e-01f, 2.5396582e-02f,  2.5422072e-01f,  4.2524221e-04f,
+    3.3144456e-01f,  -4.7518802e-01f, 1.3028762e-01f,  9.1121584e-02f,
+    3.7702811e-01f,  2.4763432e-01f,  4.2524221e-04f,  2.8906846e-02f,
+    -2.7012853e-02f, 7.4882455e-02f,  -7.3651665e-01f, -1.3228054e-01f,
+    -2.5014046e-01f, 4.2524221e-04f,  -2.1941566e-01f, 1.7864147e-01f,
+    -8.1385314e-02f, -2.7048141e-01f, 1.6695546e-01f,  5.8578587e-01f,
+    4.2524221e-04f,  3.8897455e-02f,  -1.9677906e-01f, -1.6548048e-01f,
+    3.2346794e-01f,  5.9345144e-01f,  -1.3332494e-01f, 4.2524221e-04f,
+    -1.7442798e-02f, -2.8085416e-02f, 1.2957196e-01f,  -7.7560896e-01f,
+    -1.1487541e+00f, 6.1335992e-02f,  4.2524221e-04f,  -6.6024922e-02f,
+    1.1588415e-01f,  6.7844316e-02f,  -2.7552110e-01f, 6.2179494e-01f,
+    5.7581806e-01f,  4.2524221e-04f,  3.7913716e-01f,  -6.3323379e-02f,
+    -9.0205953e-02f, 2.0326111e-01f,  -7.8349888e-01f, 1.2221128e-01f,
+    4.2524221e-04f,  2.6661048e-02f,  -2.5068019e-02f, 1.4274968e-01f,
+    9.4247788e-02f,  1.4586176e-01f,  6.4317578e-01f,  4.2524221e-04f,
+    -3.0924156e-01f, -7.8534998e-02f, -6.9818869e-02f, 2.0920417e-01f,
+    -5.7607746e-01f, 1.1970257e+00f,  4.2524221e-04f,  -7.9141982e-02f,
+    -3.5169861e-01f, -1.9536397e-01f, 4.2081746e-01f,  -7.0208210e-01f,
+    5.1061481e-01f,  4.2524221e-04f,  -1.9229406e-01f, -1.4870661e-01f,
+    2.1185999e-01f,  8.3023351e-01f,  -2.7605864e-01f, -3.0809650e-01f,
+    4.2524221e-04f,  -2.1153130e-02f, -1.2270647e-01f, 2.7843162e-02f,
+    1.7671824e-01f,  -1.6691629e-04f, -9.6530452e-02f, 4.2524221e-04f,
+    2.6757956e-01f,  -6.6474929e-02f, -3.9959319e-02f, -4.0775532e-01f,
+    -5.6668681e-01f, -1.6157649e-01f, 4.2524221e-04f,  6.9529399e-02f,
+    -2.0434815e-01f, -1.5643069e-01f, 2.7118540e-01f,  -1.1553574e+00f,
+    3.7761849e-01f,  4.2524221e-04f,  -1.0081946e-01f, 1.1525136e-01f,
+    1.4974597e-01f,  -5.1787722e-01f, -2.0310085e-02f, 1.2351452e+00f,
+    4.2524221e-04f,  -5.7900643e-01f, -2.9167721e-01f, -1.4271416e-01f,
+    2.5774074e-01f,  -2.4057569e-01f, 1.1240454e-02f,  4.2524221e-04f,
+    2.0044571e-02f,  -1.2469979e-01f, 9.5384248e-02f,  2.7102938e-01f,
+    5.7413213e-02f,  -2.4517176e-01f, 4.2524221e-04f,  1.6620056e-01f,
+    4.7757544e-02f,  -2.0400334e-02f, 3.5164309e-01f,  -5.6205180e-02f,
+    1.3554877e-01f,  4.2524221e-04f,  3.1053850e-01f,  1.2239582e-01f,
+    1.1081365e-01f,  3.2454273e-01f,  -4.1576099e-01f, 4.3368453e-01f,
+    4.2524221e-04f,  -6.1997168e-02f, 6.8293571e-02f,  -2.1686632e-02f,
+    -1.1829304e+00f, -7.2746319e-01f, -6.3295043e-01f, 4.2524221e-04f,
+    -4.6507712e-02f, -1.8335190e-01f, 2.5036236e-02f,  5.9028554e-01f,
+    1.0557675e+00f,  -2.3586641e-01f, 4.2524221e-04f,  -1.9321825e-01f,
+    -3.3254452e-02f, 7.6559506e-02f,  6.4760417e-01f,  -2.4937464e-01f,
+    -1.9823854e-01f, 4.2524221e-04f,  9.6437842e-02f,  1.3186246e-01f,
+    9.5916361e-02f,  -3.5984623e-01f, -3.2689348e-01f, 5.9379440e-02f,
+    4.2524221e-04f,  7.6694958e-02f,  -1.3702771e-02f, -2.1995303e-01f,
+    8.1270732e-02f,  7.6408625e-01f,  2.0720795e-02f,  4.2524221e-04f,
+    2.6512283e-01f,  2.3807710e-02f,  -5.8690600e-02f, -5.9104975e-02f,
+    3.6571422e-01f,  -2.6530063e-01f, 4.2524221e-04f,  1.1985373e-01f,
+    8.8621952e-02f,  -2.9940531e-01f, -1.1448269e-01f, 1.1017141e-01f,
+    5.6789166e-01f,  4.2524221e-04f,  -1.2263313e-01f, -2.3629392e-02f,
+    5.3131497e-03f,  2.6857898e-01f,  1.1421818e-01f,  7.0165527e-01f,
+    4.2524221e-04f,  4.8763152e-02f,  -3.2277855e-01f, 2.0200168e-01f,
+    1.8440504e-01f,  -8.1272709e-01f, -2.7759212e-01f, 4.2524221e-04f,
+    9.3498468e-02f,  -4.1367030e-01f, 1.8555576e-01f,  2.9281719e-02f,
+    -5.5220705e-01f, 2.0397153e-02f,  4.2524221e-04f,  1.8687698e-01f,
+    -3.7513354e-01f, -3.5006168e-01f, -3.4435531e-01f, -7.3252641e-02f,
+    -7.9778379e-01f, 4.2524221e-04f,  4.0210519e-02f,  -4.4312064e-02f,
+    2.0531718e-02f,  6.8555629e-01f,  1.2600437e-01f,  5.8994955e-01f,
+    4.2524221e-04f,  9.7262099e-02f,  -2.4695326e-01f, 1.5161885e-01f,
+    6.3341367e-01f,  -7.2936422e-01f, 5.6940907e-01f,  4.2524221e-04f,
+    -3.4016535e-02f, -7.3744408e-03f, -1.1691462e-01f, 2.6614013e-01f,
+    -3.5331360e-01f, -8.8386804e-01f, 4.2524221e-04f,  1.3624603e-01f,
+    -1.7998964e-01f, 3.4350563e-02f,  1.9105835e-01f,  -4.1896972e-01f,
+    3.3572388e-01f,  4.2524221e-04f,  1.5011507e-01f,  -6.9377556e-02f,
+    -2.0842755e-01f, -1.0781676e+00f, -1.4453362e-01f, -4.6691768e-02f,
+    4.2524221e-04f,  -5.4555935e-01f, -1.3987549e-01f, 3.0308160e-01f,
+    -5.9472028e-02f, 1.9802932e-01f,  -8.6025819e-02f, 4.2524221e-04f,
+    4.9332839e-02f,  1.3310361e-03f,  -5.0368089e-02f, -3.0621833e-01f,
+    2.5460938e-01f,  -5.1256549e-01f, 4.2524221e-04f,  -4.7801822e-02f,
+    -3.4593850e-02f, 8.9611582e-02f,  1.8572922e-01f,  -6.0846277e-02f,
+    -1.8172133e-01f, 4.2524221e-04f,  -3.6373314e-01f, 6.6289470e-02f,
+    7.3245563e-02f,  8.9139789e-02f,  4.3985420e-01f,  -5.0775284e-01f,
+    4.2524221e-04f,  -1.4245206e-01f, 6.0951833e-02f,  -2.5649929e-01f,
+    2.8157827e-01f,  -3.2649705e-01f, -4.6543762e-01f, 4.2524221e-04f,
+    -2.4361274e-01f, -4.1191485e-02f, 2.5792071e-01f,  4.3440372e-01f,
+    -4.6756613e-01f, 1.6077581e-01f,  4.2524221e-04f,  3.3604893e-01f,
+    -1.3733134e-01f, 3.6824477e-01f,  9.4274664e-01f,  3.0627247e-02f,
+    2.0665247e-02f,  4.2524221e-04f,  -1.0862888e-01f, 1.7238052e-01f,
+    -8.3285324e-02f, -9.6792758e-01f, 1.4696856e-01f,  -9.0619934e-01f,
+    4.2524221e-04f,  5.4265555e-02f,  8.6158134e-02f,  1.7487629e-01f,
+    -4.4634727e-01f, -6.2019285e-02f, 3.9177588e-01f,  4.2524221e-04f,
+    -5.6538235e-02f, -5.9880339e-02f, 2.9278052e-01f,  1.1517015e+00f,
+    -1.4973013e-03f, -6.2995279e-01f, 4.2524221e-04f,  2.7599217e-02f,
+    -5.8020987e-02f, 4.7509563e-03f,  -2.3244345e-01f, 1.0103332e+00f,
+    4.6963906e-01f,  4.2524221e-04f,  9.3664825e-03f,  7.3502227e-03f,
+    4.6138402e-02f,  -1.3345490e-01f, 5.9955823e-01f,  -4.9404097e-01f,
+    4.2524221e-04f,  5.9396394e-02f,  3.3342212e-01f,  -1.0094202e-01f,
+    -4.7451437e-01f, 4.7322938e-01f,  -5.5454910e-01f, 4.2524221e-04f,
+    -2.7876474e-02f, 2.6822351e-02f,  1.8973917e-02f,  -1.6320571e-01f,
+    -1.8942030e-01f, -2.4480176e-01f, 4.2524221e-04f,  1.3889100e-01f,
+    -4.0123284e-02f, -1.0625365e-01f, 4.3459002e-02f,  7.0615810e-01f,
+    -5.2301788e-01f, 4.2524221e-04f,  1.5139003e-01f,  -1.8260507e-01f,
+    1.0779282e-01f,  -1.4358564e-01f, -2.6157531e-01f, 8.8461274e-01f,
+    4.2524221e-04f,  -2.8099319e-01f, -3.1833488e-01f, 1.3126114e-01f,
+    -2.3910215e-01f, 1.4543295e-01f,  -4.0892178e-01f, 4.2524221e-04f,
+    -1.4075463e-01f, 2.8643187e-02f,  2.4450511e-01f,  -3.6961821e-01f,
+    -1.4252850e-01f, -2.4521539e-01f, 4.2524221e-04f,  -7.4808247e-02f,
+    5.3461105e-01f,  -1.8508192e-02f, 8.0533735e-02f,  -6.9441730e-01f,
+    7.3116846e-02f,  4.2524221e-04f,  -1.6346678e-02f, 7.9455497e-03f,
+    -9.9148363e-02f, 3.1443191e-01f,  -5.4373699e-01f, 4.3133399e-01f,
+    4.2524221e-04f,  2.9067984e-02f,  -3.3523466e-02f, 3.0538375e-02f,
+    -1.1886040e+00f, 4.7290227e-01f,  -3.0723882e-01f, 4.2524221e-04f,
+    1.5234210e-01f,  1.9771519e-01f,  -2.4682826e-01f, -1.4036484e-01f,
+    -1.1035047e-01f, 8.4115155e-02f,  4.2524221e-04f,  -2.1906562e-01f,
+    -1.6002099e-01f, -9.2091426e-02f, 6.4754307e-01f,  -3.7645406e-01f,
+    1.2181389e-01f,  4.2524221e-04f,  -9.1878235e-02f, 1.2432076e-01f,
+    -8.0166101e-02f, 5.0367552e-01f,  -6.5015817e-01f, -8.8551737e-02f,
+    4.2524221e-04f,  3.6087655e-02f,  -2.6747819e-02f, -3.4746157e-03f,
+    9.9200827e-01f,  2.6657633e-02f,  -3.7900978e-01f, 4.2524221e-04f,
+    2.6048768e-02f,  2.3242475e-02f,  8.9528844e-02f,  -3.9793146e-01f,
+    7.2130662e-01f,  -1.0542603e+00f, 4.2524221e-04f,  -2.4949808e-02f,
+    -2.5223804e-01f, -3.0647239e-01f, 3.3407366e-01f,  -1.9705334e-01f,
+    2.5395662e-01f,  4.2524221e-04f,  -4.0463626e-02f, -1.9470181e-01f,
+    1.1714090e-01f,  2.1699083e-01f,  -4.6391746e-01f, 6.9011539e-01f,
+    4.2524221e-04f,  -3.6179063e-01f, 2.5796738e-01f,  -2.2714870e-01f,
+    6.8880364e-02f,  -5.1768059e-01f, 3.1510383e-01f,  4.2524221e-04f,
+    -1.2567266e-02f, -1.3621120e-01f, 1.8899418e-02f,  -2.5503978e-01f,
+    -4.4750300e-01f, -5.5090672e-01f, 4.2524221e-04f,  1.2223324e-01f,
+    1.6272777e-01f,  -7.7560306e-02f, -1.0317849e+00f, -2.8434926e-01f,
+    -3.4523854e-01f, 4.2524221e-04f,  -6.1004322e-02f, -5.9227122e-04f,
+    -2.1554500e-02f, 2.4792428e-01f,  9.2429572e-01f,  5.4870909e-01f,
+    4.2524221e-04f,  -1.9842461e-01f, -6.4582884e-02f, 1.3064224e-01f,
+    5.5808347e-01f,  -1.8904553e-01f, -6.2413597e-01f, 4.2524221e-04f,
+    2.1097521e-01f,  -9.7741969e-02f, -4.8862401e-01f, -1.5172134e-01f,
+    4.1083209e-03f,  -3.8696522e-01f, 4.2524221e-04f,  -4.1763911e-01f,
+    2.8503893e-02f,  2.3253348e-01f,  6.0633165e-01f,  -5.2774370e-01f,
+    -4.4324151e-01f, 4.2524221e-04f,  5.1180962e-02f,  -1.9705455e-01f,
+    -1.6887939e-01f, 1.5589913e-02f,  -2.5575042e-02f, -1.1669157e-01f,
+    4.2524221e-04f,  2.4728218e-01f,  -1.0551698e-01f, 7.4217469e-02f,
+    9.6258569e-01f,  -6.2713939e-01f, -1.8557775e-01f, 4.2524221e-04f,
+    2.1752425e-01f,  -4.7557138e-02f, 1.0900661e-01f,  1.3654574e-02f,
+    -3.1104892e-01f, -1.5954138e-01f, 4.2524221e-04f,  -8.5164877e-03f,
+    6.9203183e-02f,  -8.2244650e-02f, 8.6040825e-02f,  2.9945150e-01f,
+    7.0226085e-01f,  4.2524221e-04f,  3.1293556e-01f,  1.5429822e-02f,
+    -4.2168817e-01f, 1.1221366e-01f,  2.8672639e-01f,  -4.9470222e-01f,
+    4.2524221e-04f,  -1.7686468e-01f, -1.1348136e-01f, 1.0469711e-01f,
+    -7.0500970e-02f, -4.1212380e-01f, 1.9760063e-01f,  4.2524221e-04f,
+    8.3808228e-03f,  1.0910257e-02f,  -1.8213235e-02f, 4.4389714e-02f,
+    -7.7154768e-01f, -3.5982323e-01f, 4.2524221e-04f,  6.8500482e-02f,
+    -1.1419601e-01f, 1.4834467e-02f,  1.3472405e-01f,  1.4658807e-01f,
+    4.5247668e-01f,  4.2524221e-04f,  1.2863684e-04f,  4.7902670e-02f,
+    4.4644019e-03f,  6.1397803e-01f,  6.4297414e-01f,  -4.2464599e-01f,
+    4.2524221e-04f,  -1.4640845e-01f, 6.2301353e-02f,  1.7238835e-01f,
+    5.3890556e-01f,  2.9199031e-01f,  9.2200214e-01f,  4.2524221e-04f,
+    -2.3965839e-01f, 3.2009163e-01f,  -3.8611110e-02f, 8.6142951e-01f,
+    1.4380187e-01f,  -6.2833118e-01f, 4.2524221e-04f,  4.4654030e-01f,
+    1.0163968e-01f,  5.3189643e-02f,  -4.4938076e-01f, 5.7065886e-01f,
+    5.1487476e-01f,  4.2524221e-04f,  9.1271382e-03f,  5.7840168e-02f,
+    2.4090679e-01f,  -4.0559599e-01f, -7.3929489e-01f, -6.9430506e-01f,
+    4.2524221e-04f,  9.4600774e-02f,  5.1817168e-02f,  2.1506846e-01f,
+    -3.0376458e-01f, 1.1441462e-01f,  -6.2610811e-01f, 4.2524221e-04f,
+    -8.5917406e-02f, -9.6700184e-02f, 9.7186953e-02f,  7.2733891e-01f,
+    -1.0870229e+00f, -5.6539588e-02f, 4.2524221e-04f,  1.7685313e-02f,
+    -1.4662553e-03f, -1.7001009e-02f, -2.6348737e-01f, 9.5344022e-02f,
+    8.1280392e-01f,  4.2524221e-04f,  -1.7505834e-01f, -3.3343634e-01f,
+    -1.2530324e-01f, -2.8169325e-01f, 2.0131937e-01f,  -9.1824895e-01f,
+    4.2524221e-04f,  -1.4605665e-01f, -6.4788614e-03f, -6.0053490e-02f,
+    -7.8159940e-01f, -9.4004035e-02f, -1.6656834e-01f, 4.2524221e-04f,
+    -1.4236464e-01f, 9.5513508e-02f,  2.5040861e-02f,  3.2381487e-01f,
+    -4.1220659e-01f, 1.1228602e-01f,  4.2524221e-04f,  3.1168388e-02f,
+    3.5280091e-01f,  -1.4528583e-01f, -5.7546836e-01f, -3.9822334e-01f,
+    2.4046797e-01f,  4.2524221e-04f,  -1.2098387e-01f, 1.8265340e-01f,
+    -2.2984284e-01f, 1.3183025e-01f,  5.5871445e-01f,  -4.6467310e-01f,
+    4.2524221e-04f,  -4.2758569e-02f, 2.7958041e-01f,  1.3604170e-01f,
+    -4.2580155e-01f, 3.9972100e-01f,  4.8495343e-01f,  4.2524221e-04f,
+    1.0593699e-01f,  9.5284186e-02f,  4.9210130e-03f,  -4.8137295e-01f,
+    4.3073782e-01f,  4.2313659e-01f,  4.2524221e-04f,  3.4906089e-02f,
+    3.1306069e-02f,  -4.8974056e-02f, 1.9962604e-01f,  3.7843320e-01f,
+    2.6260796e-01f,  4.2524221e-04f,  -7.9922788e-02f, 1.5572652e-01f,
+    -4.2344011e-02f, -1.1441834e+00f, -1.2938149e-01f, 2.1325669e-01f,
+    4.2524221e-04f,  -1.9084260e-01f, 2.2564901e-01f,  -3.2097334e-01f,
+    1.6154413e-01f,  3.8027555e-01f,  3.4719923e-01f,  4.2524221e-04f,
+    -2.9850133e-02f, -3.8303677e-02f, 6.0475506e-02f,  6.9679272e-01f,
+    -5.5996644e-01f, -8.0641109e-01f, 4.2524221e-04f,  4.1167522e-03f,
+    2.6246420e-01f,  -1.5513101e-01f, -5.9974313e-01f, -4.0403536e-01f,
+    -1.7390466e-01f, 4.2524221e-04f,  -8.8623181e-02f, -2.1573004e-01f,
+    1.0872442e-01f,  -6.7163609e-02f, 7.3392200e-01f,  -6.1311746e-01f,
+    4.2524221e-04f,  3.4234326e-02f,  3.5096583e-01f,  -1.8464302e-01f,
+    -2.9789469e-01f, -2.9916745e-01f, -1.5300374e-01f, 4.2524221e-04f,
+    1.4820539e-02f,  2.8811511e-01f,  2.1999674e-01f,  -6.0168439e-01f,
+    2.1821584e-01f,  -9.0731859e-01f, 4.2524221e-04f,  1.3500918e-05f,
+    1.6290896e-02f,  -3.2978594e-01f, -2.6417324e-01f, -2.5580767e-01f,
+    -4.8237646e-01f, 4.2524221e-04f,  1.6280727e-01f,  -1.3910933e-02f,
+    9.0576991e-02f,  -3.5292417e-01f, 3.3175802e-01f,  2.6203001e-01f,
+    4.2524221e-04f,  3.6940601e-02f,  1.0942241e-01f,  -4.4244016e-04f,
+    -2.5942552e-01f, 5.0203174e-01f,  1.7998736e-02f,  4.2524221e-04f,
+    -7.2300643e-02f, -3.5532361e-01f, -1.1836357e-01f, 6.6084677e-01f,
+    1.0762968e-02f,  -3.3973151e-01f, 4.2524221e-04f,  -5.9891965e-02f,
+    -1.0563817e-01f, 3.3721972e-02f,  1.0326222e-01f,  3.2457301e-01f,
+    -5.3301256e-02f, 4.2524221e-04f,  -1.4665352e-01f, -9.1687031e-03f,
+    5.8719823e-03f,  -6.6473037e-01f, -2.8615147e-01f, -2.0601395e-01f,
+    4.2524221e-04f,  7.2293468e-02f,  2.6938063e-01f,  -5.6877002e-02f,
+    -2.3897879e-01f, -3.5202929e-01f, 5.5343825e-01f,  4.2524221e-04f,
+    1.9221555e-01f,  -2.1067508e-01f, 1.3436309e-01f,  -1.8503526e-01f,
+    1.8404932e-01f,  -5.8186956e-02f, 4.2524221e-04f,  1.3180923e-01f,
+    9.1396950e-02f,  -1.4538786e-01f, -3.3797005e-01f, 1.5660138e-01f,
+    5.4058945e-01f,  4.2524221e-04f,  -9.3225665e-02f, 1.4030679e-01f,
+    3.8216069e-01f,  -6.0168129e-01f, 6.8035245e-01f,  -3.1379357e-02f,
+    4.2524221e-04f,  1.5006550e-01f,  -2.5975293e-01f, 2.9107177e-01f,
+    2.6915145e-01f,  -3.5880175e-01f, 7.1583249e-02f,  4.2524221e-04f,
+    -9.4202636e-03f, -9.4279245e-02f, 4.4590913e-02f,  1.4364957e+00f,
+    -2.1902028e-01f, 9.6744083e-02f,  4.2524221e-04f,  3.0494422e-01f,
+    -2.5591444e-02f, 1.3159279e-02f,  1.2551376e-01f,  2.9426169e-01f,
+    8.9648157e-01f,  4.2524221e-04f,  8.9394294e-02f,  -8.8125467e-03f,
+    -7.3673509e-02f, 1.2743057e-01f,  5.1298594e-01f,  3.8048950e-01f,
+    4.2524221e-04f,  2.7601722e-01f,  3.1614223e-01f,  -8.8885389e-02f,
+    5.2427125e-01f,  3.5057170e-03f,  -3.2713708e-01f, 4.2524221e-04f,
+    -3.6194470e-02f, 1.5230738e-01f,  7.9578511e-02f,  -2.5105590e-01f,
+    1.4376603e-01f,  -8.4517467e-01f, 4.2524221e-04f,  -5.8516286e-02f,
+    -2.8070486e-01f, -1.1328175e-01f, -7.7989556e-02f, -8.5450399e-01f,
+    1.1351100e+00f,  4.2524221e-04f,  -2.9097018e-01f, 1.2985972e-01f,
+    -1.2366821e-02f, -8.3323711e-01f, 2.8012127e-01f,  1.6539182e-01f,
+    4.2524221e-04f,  3.0149514e-02f,  -2.8825521e-01f, 2.0892709e-01f,
+    1.7042273e-01f,  -2.1943188e-01f, 1.4729333e-01f,  4.2524221e-04f,
+    -3.8237656e-03f, -8.4436283e-02f, -6.5656848e-02f, 3.9715600e-01f,
+    -1.6315429e-01f, -2.1582417e-02f, 4.2524221e-04f,  -2.6904994e-01f,
+    -2.0234157e-01f, -2.4654223e-01f, -2.4513899e-01f, -3.8557103e-01f,
+    -4.3605319e-01f, 4.2524221e-04f,  6.1712354e-02f,  1.1876680e-01f,
+    4.5614880e-02f,  1.0898942e-01f,  3.4832779e-01f,  -1.1438330e-01f,
+    4.2524221e-04f,  2.9162480e-02f,  4.4080630e-01f,  -1.5951470e-01f,
+    -4.9014933e-02f, -9.3625681e-03f, 2.7527571e-01f,  4.2524221e-04f,
+    7.3062986e-02f,  -6.6397418e-03f, 1.7950128e-01f,  7.0830888e-01f,
+    1.2978782e-01f,  1.3472284e+00f,  4.2524221e-04f,  2.8972799e-01f,
+    5.6850761e-02f,  -5.7165205e-02f, -4.1536343e-01f, 6.4233094e-01f,
+    6.0319901e-01f,  4.2524221e-04f,  -3.0865413e-01f, 9.8037556e-02f,
+    3.5747847e-01f,  2.8535318e-01f,  -2.4099323e-01f, 5.6222606e-01f,
+    4.2524221e-04f,  2.3440693e-01f,  1.2845822e-01f,  8.4975455e-03f,
+    -4.5008373e-01f, 8.2154036e-01f,  2.8282517e-01f,  4.2524221e-04f,
+    -4.2209426e-01f, -2.8859657e-01f, -1.1607920e-02f, -4.4304460e-01f,
+    3.9312372e-01f,  1.9169927e-01f,  4.2524221e-04f,  1.2468050e-01f,
+    -5.2792262e-02f, 1.6926090e-01f,  -4.1853818e-01f, 9.2529470e-01f,
+    5.7520006e-02f,  4.2524221e-04f,  -4.0745918e-02f, -2.8348507e-02f,
+    7.5871006e-02f,  -1.5704729e-01f, 1.5866600e-02f,  -4.5703375e-01f,
+    4.2524221e-04f,  -7.0983037e-02f, -1.5641823e-01f, 1.5488678e-01f,
+    4.4416137e-02f,  -3.3845279e-01f, -4.2281461e-01f, 4.2524221e-04f,
+    -1.3118438e-01f, -5.2733809e-02f, 1.1520351e-01f,  -4.3224317e-01f,
+    -8.4300148e-01f, 6.3205147e-01f,  4.2524221e-04f,  7.8757547e-02f,
+    1.9275019e-01f,  1.9086936e-01f,  -2.5372884e-01f, -1.7555788e-01f,
+    -9.6621037e-01f, 4.2524221e-04f,  6.1421297e-02f,  8.8217385e-02f,
+    3.4060486e-02f,  -9.7399390e-01f, -4.3419144e-01f, 5.9618312e-01f,
+    4.2524221e-04f,  -1.2274663e-01f, 2.5060901e-01f,  -1.1468112e-02f,
+    -7.8941458e-01f, 2.7341384e-01f,  -6.1515898e-01f, 4.2524221e-04f,
+    1.6099273e-01f,  -1.2691557e-01f, -3.2513205e-02f, -1.4611143e-01f,
+    1.5527645e-01f,  -7.2558486e-01f, 4.2524221e-04f,  1.8519001e-01f,
+    2.0532405e-01f,  -1.6910744e-01f, -4.5328170e-01f, 5.8765030e-01f,
+    -1.4862502e-01f, 4.2524221e-04f,  -1.5140006e-01f, -8.6458258e-02f,
+    -1.6047309e-01f, -4.8886415e-02f, -1.0672981e+00f, 3.1179312e-01f,
+    4.2524221e-04f,  -8.3587386e-02f, -1.2287346e-02f, -8.7571703e-02f,
+    7.1086633e-01f,  -9.1293323e-01f, -3.1528232e-01f, 4.2524221e-04f,
+    -3.2128260e-01f, 8.4963381e-02f,  1.5987569e-01f,  1.0224266e-01f,
+    6.4008594e-01f,  2.9395220e-01f,  4.2524221e-04f,  1.5786476e-01f,
+    5.3590890e-03f,  -5.5616912e-02f, 5.0357819e-01f,  1.8937828e-01f,
+    -5.5346996e-02f, 4.2524221e-04f,  -1.4033395e-02f, 4.7902409e-02f,
+    1.6469944e-02f,  -7.3634845e-01f, -8.4391439e-01f, -5.7997006e-01f,
+    4.2524221e-04f,  4.6139669e-02f,  4.9407732e-01f,  8.4475011e-02f,
+    -8.7242141e-02f, -1.4178436e-01f, 3.1666979e-01f,  4.2524221e-04f,
+    -4.6616276e-03f, 1.0166116e-01f,  -1.5386216e-02f, -7.0224798e-01f,
+    -9.4707720e-02f, -6.7165381e-01f, 4.2524221e-04f,  -9.6739337e-02f,
+    -1.2548956e-01f, 7.3886842e-02f,  3.3122525e-01f,  -3.5799292e-01f,
+    -5.1508605e-01f, 4.2524221e-04f,  -1.3676272e-01f, 1.6589473e-01f,
+    -9.8882364e-03f, -1.7261167e-01f, 8.3302140e-02f,  9.0863913e-01f,
+    4.2524221e-04f,  1.8726122e-02f,  4.0612534e-02f,  -1.7925741e-01f,
+    2.8181347e-01f,  -3.4807554e-01f, 5.5549745e-02f,  4.2524221e-04f,
+    4.9839888e-02f,  7.4148856e-02f,  -1.8405744e-01f, 1.0743636e-01f,
+    6.7921108e-01f,  6.4675426e-01f,  4.2524221e-04f,  -3.0354818e-02f,
+    -1.3061531e-01f, -8.6205132e-02f, 1.8774085e-01f,  2.0533919e-01f,
+    -1.0565798e+00f, 4.2524221e-04f,  -9.4455130e-02f, 4.2605065e-02f,
+    -1.3030939e-01f, -7.8845370e-01f, -3.1062564e-01f, 4.7709572e-01f,
+    4.2524221e-04f,  3.1350471e-02f,  3.4500074e-02f,  7.0534945e-03f,
+    -6.9176936e-01f, 1.1310098e-01f,  -1.3413320e-01f, 4.2524221e-04f,
+    2.4395806e-01f,  7.5176328e-02f,  -3.3296991e-02f, 3.1648970e-01f,
+    5.6398427e-01f,  6.1850160e-01f,  4.2524221e-04f,  2.1897383e-02f,
+    2.8146941e-02f,  -6.2531494e-02f, -1.3465967e+00f, 3.7773412e-01f,
+    7.7484167e-01f,  4.2524221e-04f,  -2.6686126e-02f, 3.1228539e-01f,
+    -4.6987804e-03f, -1.3626312e-02f, -2.4467166e-01f, 7.5986612e-01f,
+    4.2524221e-04f,  1.5947264e-01f,  -8.0746040e-02f, -1.7094454e-01f,
+    -5.1279521e-01f, 1.6267106e-01f,  8.6997056e-01f,  4.2524221e-04f,
+    4.9272887e-02f,  1.4466125e-02f,  -7.4413516e-02f, 6.9271445e-01f,
+    4.4001666e-01f,  1.5345718e+00f,  4.2524221e-04f,  -9.1197841e-02f,
+    1.4876856e-01f,  5.7679560e-02f,  -2.4695964e-01f, 2.9359481e-01f,
+    -5.4799247e-01f, 4.2524221e-04f,  4.9863290e-02f,  -2.2775574e-01f,
+    2.3091725e-01f,  -4.0654394e-01f, -5.9075952e-01f, -4.0582088e-01f,
+    4.2524221e-04f,  -1.2353448e-01f, 2.5295690e-01f,  -1.6882554e-01f,
+    4.5849243e-01f,  -4.4755647e-01f, 7.6170802e-01f,  4.2524221e-04f,
+    3.4737591e-02f,  -5.2162796e-02f, -1.8833358e-02f, 3.8493788e-01f,
+    -4.4356552e-01f, -4.3135676e-01f, 4.2524221e-04f,  -1.0027516e-02f,
+    8.8445835e-02f,  -2.4178887e-02f, -2.6687092e-01f, 1.2641342e+00f,
+    3.9741747e-02f,  4.2524221e-04f,  1.3629331e-01f,  3.0274885e-02f,
+    -4.9603201e-02f, -2.0525749e-01f, 1.5462255e-01f,  -1.0581635e-02f,
+    4.2524221e-04f,  1.7440473e-01f,  1.7528504e-02f,  4.7165579e-01f,
+    1.2549154e-01f,  3.7338325e-01f,  1.5051016e-01f,  4.2524221e-04f,
+    7.0206814e-02f,  -9.5578976e-02f, -9.7290255e-02f, 1.0440143e+00f,
+    -1.7338488e-02f, 4.5162535e-01f,  4.2524221e-04f,  1.4842103e-01f,
+    -3.5338032e-01f, 7.4242488e-02f,  -7.7942592e-01f, -3.6993718e-01f,
+    -2.6660410e-01f, 4.2524221e-04f,  -2.0005354e-01f, -1.2306155e-01f,
+    1.8234999e-01f,  1.8517707e-02f,  -2.8440616e-01f, -4.6026167e-01f,
+    4.2524221e-04f,  -3.1091446e-01f, 4.1638911e-03f,  9.4440445e-02f,
+    -3.7516692e-01f, -6.2092733e-02f, -9.0215683e-02f, 4.2524221e-04f,
+    2.2883268e-01f,  1.8635769e-01f,  -1.2636398e-01f, -3.3906421e-01f,
+    4.5099068e-01f,  3.3371735e-01f,  4.2524221e-04f,  -9.3010657e-02f,
+    1.0265566e-02f,  -2.5101772e-01f, 4.2943428e-03f,  -1.6055083e-01f,
+    1.4742446e-01f,  4.2524221e-04f,  -8.4397286e-02f, 1.1820391e-01f,
+    5.0900407e-02f,  -1.6558273e-01f, 6.0947084e-01f,  -1.7589842e-01f,
+    4.2524221e-04f,  -8.5256398e-02f, 3.7663754e-02f,  1.1899337e-01f,
+    -4.3835071e-01f, 1.1705777e-01f,  7.3433155e-01f,  4.2524221e-04f,
+    2.2138724e-01f,  -1.9364721e-01f, 6.9743916e-02f,  9.8557949e-02f,
+    3.2159248e-03f,  -5.3981431e-02f, 4.2524221e-04f,  -2.5661740e-01f,
+    -1.1817967e-02f, 8.2025968e-02f,  2.4509899e-01f,  8.9409232e-01f,
+    2.4008162e-01f,  4.2524221e-04f,  -1.5285490e-01f, -4.4015872e-01f,
+    -6.8000995e-02f, -4.9648851e-01f, 3.9301586e-01f,  -1.1496496e-01f,
+    4.2524221e-04f,  -3.1353790e-02f, -1.3127027e-01f, 7.3963152e-03f,
+    -1.4538987e-02f, -2.6664889e-01f, -7.1776815e-02f, 4.2524221e-04f,
+    1.7971347e-01f,  8.9776315e-02f,  -6.6823706e-02f, 6.0679549e-01f,
+    -4.0313128e-01f, 1.7176071e-01f,  4.2524221e-04f,  -1.9183575e-01f,
+    9.9225312e-02f,  -7.4943341e-02f, -5.9748727e-01f, 3.6232822e-02f,
+    -7.1996677e-01f, 4.2524221e-04f,  4.4172558e-01f,  -4.0398613e-01f,
+    8.7670349e-02f,  5.4896683e-02f,  1.5191953e-02f,  2.2789274e-01f,
+    4.2524221e-04f,  2.2650942e-01f,  -1.7019360e-01f, -1.3765001e-01f,
+    -6.3071078e-01f, -2.0227708e-01f, -3.9755610e-01f, 4.2524221e-04f,
+    -6.0228016e-02f, -1.7750199e-01f, 5.6910969e-02f,  6.0434830e-03f,
+    -1.1737429e-01f, 4.2684477e-02f,  4.2524221e-04f,  -2.8057194e-01f,
+    2.5394902e-01f,  1.3704218e-01f,  -1.5781705e-01f, -2.5474310e-01f,
+    4.2928544e-01f,  4.2524221e-04f,  2.9724023e-01f,  2.6418313e-01f,
+    -1.8010649e-01f, -2.1657844e-01f, 4.7013920e-02f,  -4.7393724e-01f,
+    4.2524221e-04f,  2.7483977e-02f,  3.2736838e-02f,  2.4906708e-02f,
+    -3.0411181e-01f, 3.4564175e-05f,  -3.4402776e-01f, 4.2524221e-04f,
+    -1.9265959e-01f, -3.2971239e-01f, 2.6822144e-02f,  -6.5512590e-02f,
+    -7.4751413e-01f, 1.4770815e-01f,  4.2524221e-04f,  1.4458855e-02f,
+    -2.7778953e-01f, -5.1451754e-03f, 1.5581207e-01f,  1.6314049e-01f,
+    -4.2182133e-01f, 4.2524221e-04f,  7.0643820e-02f,  -1.1189459e-01f,
+    -5.6847006e-02f, 4.5946556e-01f,  -4.3224385e-01f, 5.1544166e-01f,
+    4.2524221e-04f,  -3.5764132e-02f, 2.1091269e-01f,  5.6935500e-02f,
+    -8.4074467e-02f, -1.4390823e-01f, -9.8180163e-01f, 4.2524221e-04f,
+    1.3896167e-01f,  1.9723510e-02f,  1.7714357e-01f,  -1.7278649e-01f,
+    -4.5862481e-01f, 3.7431630e-01f,  4.2524221e-04f,  -2.1221504e-02f,
+    -1.3576227e-04f, -2.9894554e-03f, -3.3511296e-01f, -2.8855109e-01f,
+    2.3762321e-01f,  4.2524221e-04f,  -2.2072981e-01f, -2.9615086e-01f,
+    -1.6249447e-01f, 1.9396010e-01f,  -2.3452900e-01f, -6.8934381e-01f,
+    4.2524221e-04f,  -2.4711587e-01f, 6.6215292e-02f,  2.9459327e-01f,
+    2.2967811e-01f,  -6.3108307e-01f, 6.5611404e-01f,  4.2524221e-04f,
+    -2.1285322e-02f, -1.2386114e-01f, 6.2201191e-02f,  5.3436661e-01f,
+    -4.0431392e-01f, -7.7562147e-01f, 4.2524221e-04f,  -8.6382926e-02f,
+    -3.3706561e-01f, 1.0842432e-01f,  5.1179561e-03f,  -4.7464913e-01f,
+    2.0684363e-02f,  4.2524221e-04f,  9.6528884e-03f,  4.3087178e-01f,
+    -1.1043572e-01f, -4.9431446e-01f, 1.8031393e-01f,  2.6970196e-01f,
+    4.2524221e-04f,  -2.6531018e-02f, -1.9610430e-01f, -1.6790607e-03f,
+    1.1281374e+00f,  1.5136592e-01f,  9.8486796e-02f,  4.2524221e-04f,
+    -1.8034083e-01f, -1.3662821e-01f, -1.3259698e-01f, -8.6151391e-02f,
+    -2.8930221e-02f, -1.9516864e-01f, 4.2524221e-04f,  -1.6123053e-01f,
+    5.1227976e-02f,  1.4094310e-01f,  7.2831273e-02f,  -6.0214359e-01f,
+    3.6388621e-01f,  4.2524221e-04f,  -2.4341675e-02f, -3.0543881e-02f,
+    6.9366746e-02f,  5.9653524e-02f,  -5.3063637e-01f, 1.7783808e-02f,
+    4.2524221e-04f,  1.3313243e-01f,  9.9556588e-02f,  7.0932761e-02f,
+    -7.2326390e-03f, 3.9656582e-01f,  1.8637327e-02f,  4.2524221e-04f,
+    -1.3823928e-01f, -3.5957817e-02f, 5.6716511e-03f,  8.5180300e-01f,
+    -3.3381844e-01f, -5.4434454e-01f, 4.2524221e-04f,  -3.7100065e-02f,
+    1.1523914e-02f,  2.5128178e-02f,  7.7173285e-02f,  4.3894690e-01f,
+    -4.3848313e-02f, 4.2524221e-04f,  -7.6498985e-03f, -1.1426557e-01f,
+    -1.8219030e-01f, -3.2270139e-01f, 1.9955225e-01f,  1.9636966e-01f,
+    4.2524221e-04f,  -3.2669120e-02f, -7.9211906e-02f, 7.4755155e-02f,
+    6.2405288e-01f,  -1.7592129e-01f, 8.4854907e-01f,  4.2524221e-04f,
+    -1.9327438e-01f, -1.0056755e-01f, 2.1392666e-02f,  -9.8348242e-01f,
+    5.6787902e-01f,  -5.0179607e-01f, 4.2524221e-04f,  3.9088953e-02f,
+    2.5658950e-01f,  1.9277962e-01f,  9.7212851e-02f,  -5.3468066e-01f,
+    1.2522656e-01f,  4.2524221e-04f,  1.1882245e-01f,  3.5993233e-01f,
+    -3.4517404e-01f, 1.1876222e-01f,  6.2315524e-01f,  -4.8743585e-01f,
+    4.2524221e-04f,  -4.0051651e-01f, -1.0897187e-01f, -7.4801184e-03f,
+    6.8073675e-02f,  4.1849717e-02f,  8.5073948e-01f,  4.2524221e-04f,
+    4.7407817e-02f,  -1.9368078e-01f, -1.7201653e-01f, -7.0505485e-02f,
+    3.6740083e-01f,  8.0027008e-01f,  4.2524221e-04f,  -1.3267617e-01f,
+    1.9472872e-01f,  -4.0064894e-02f, -1.0380410e-01f, 6.3962227e-01f,
+    2.3921097e-02f,  4.2524221e-04f,  2.7988908e-01f,  -6.2925845e-02f,
+    -1.7611413e-01f, -5.0337654e-01f, 2.7330443e-01f,  -5.0476772e-01f,
+    4.2524221e-04f,  3.4515928e-02f,  -9.3930382e-03f, -3.0169618e-01f,
+    -3.1043866e-01f, 3.9833727e-01f,  -6.8845254e-01f, 4.2524221e-04f,
+    -3.4974125e-01f, -7.9577379e-03f, -3.0059164e-02f, -7.0850009e-01f,
+    -2.4121274e-01f, -2.8753868e-01f, 4.2524221e-04f,  -7.7691572e-03f,
+    -2.0413874e-02f, -1.2392884e-01f, 3.0408052e-01f,  -6.8857402e-02f,
+    -3.5033783e-01f, 4.2524221e-04f,  -1.5277613e-02f, -1.7419693e-01f,
+    3.0105142e-04f,  5.7307982e-01f,  -2.8771883e-01f, -2.3910010e-01f,
+    4.2524221e-04f,  -4.0721068e-01f, -4.4756867e-03f, -7.0407726e-02f,
+    2.7276587e-01f,  -5.8952087e-01f, 6.2534916e-01f,  4.2524221e-04f,
+    -6.2416784e-02f, 2.4753070e-01f,  -3.9489728e-01f, -5.6489557e-01f,
+    -1.7005162e-01f, 3.2263398e-01f,  4.2524221e-04f,  3.4809310e-02f,
+    1.7183147e-01f,  1.1291619e-01f,  4.0835243e-02f,  8.4092546e-01f,
+    1.0386057e-01f,  4.2524221e-04f,  9.9502884e-02f,  -8.9014553e-02f,
+    1.4327242e-02f,  -1.3415192e-01f, 2.0539683e-01f,  5.1225615e-01f,
+    4.2524221e-04f,  -9.9338576e-02f, 7.7903412e-02f,  7.8683093e-02f,
+    -4.4619256e-01f, -3.8642880e-01f, -4.5288616e-01f, 4.2524221e-04f,
+    -6.6464217e-03f, 7.2777376e-02f,  -1.0936357e-01f, -5.5160701e-01f,
+    4.2614067e-01f,  -5.7428426e-01f, 4.2524221e-04f,  2.0513022e-01f,
+    2.3137546e-01f,  -1.1580054e-01f, -2.6082063e-01f, -2.2664042e-03f,
+    1.8098317e-01f,  4.2524221e-04f,  2.5404522e-01f,  1.9739975e-01f,
+    -1.3916019e-01f, -1.0633951e-01f, 4.8841217e-01f,  4.0106681e-01f,
+    4.2524221e-04f,  4.6066976e-01f,  4.3471590e-02f,  -2.2038933e-02f,
+    -2.6529682e-01f, 1.9761522e-01f,  -1.5468059e-01f, 4.2524221e-04f,
+    -1.0868851e-01f, 1.8440472e-01f,  -2.0887006e-02f, -2.9455331e-01f,
+    3.4735510e-01f,  3.9640254e-01f,  4.2524221e-04f,  6.4529307e-02f,
+    5.6022227e-02f,  -2.0796317e-01f, -9.1954306e-02f, 2.9907936e-01f,
+    1.0605063e-01f,  4.2524221e-04f,  -2.8637618e-01f, 3.6168817e-01f,
+    -1.7773281e-01f, -3.5550937e-01f, 5.5719107e-02f,  2.8447077e-01f,
+    4.2524221e-04f,  1.4367229e-01f,  3.6790896e-02f,  -8.9957513e-02f,
+    -3.4482917e-01f, 3.0745074e-01f,  -3.3021083e-01f, 4.2524221e-04f,
+    -3.7273146e-02f, 4.6586398e-02f,  -2.8032130e-01f, 5.1836554e-02f,
+    -5.1946968e-01f, -3.9904383e-03f, 4.2524221e-04f,  5.5017443e-03f,
+    1.4061913e-01f,  3.2810003e-01f,  -1.8671514e-02f, -1.3396165e-01f,
+    7.7566516e-01f,  4.2524221e-04f,  1.2836756e-01f,  3.2673013e-01f,
+    1.0522574e-01f,  -3.9210036e-01f, 1.9058160e-01f,  6.0012627e-01f,
+    4.2524221e-04f,  -2.8322670e-03f, 8.1709050e-02f,  1.5856279e-01f,
+    -2.0207804e-01f, -6.5358698e-01f, 3.0881688e-01f,  4.2524221e-04f,
+    -1.8327482e-01f, 1.7410596e-01f,  2.7175525e-01f,  -5.8174741e-01f,
+    5.7829767e-01f,  -3.0759615e-01f, 4.2524221e-04f,  1.8862121e-01f,
+    2.3421846e-02f,  -1.4547379e-01f, -1.0047355e+00f, -9.5609769e-02f,
+    -5.0194430e-01f, 4.2524221e-04f,  -2.5877842e-01f, 7.4365117e-02f,
+    5.3207774e-02f,  2.4205221e-01f,  -7.7687895e-01f, 6.5718162e-01f,
+    4.2524221e-04f,  8.3015468e-03f,  -1.3867578e-01f, 7.8228295e-02f,
+    8.8911873e-01f,  3.1582989e-02f,  -3.2893449e-01f, 4.2524221e-04f,
+    2.8517511e-01f,  2.2674799e-01f,  -5.3789582e-02f, 2.1177682e-01f,
+    6.9943660e-01f,  1.0750194e+00f,  4.2524221e-04f,  -8.4114768e-02f,
+    8.7255299e-02f,  -5.8825564e-01f, -1.6866541e-01f, -2.9444021e-01f,
+    4.5898318e-01f,  4.2524221e-04f,  1.8694002e-02f,  -9.8854899e-03f,
+    -4.0483117e-02f, 3.2066804e-01f,  4.1060719e-01f,  -4.5368248e-01f,
+    4.2524221e-04f,  2.5169483e-01f,  -4.2046070e-01f, 2.2424984e-01f,
+    1.8642014e-01f,  5.0467944e-01f,  4.7185245e-01f,  4.2524221e-04f,
+    1.9922593e-01f,  -1.3122274e-01f, 1.2862726e-01f,  -4.6471819e-01f,
+    4.1538861e-01f,  -1.5472211e-01f, 4.2524221e-04f,  -1.0976720e-01f,
+    -3.8183514e-02f, -2.9475859e-03f, -1.5112279e-01f, -3.9564857e-01f,
+    -4.2611513e-01f, 4.2524221e-04f,  5.5980727e-02f,  -3.3356067e-02f,
+    -1.2449604e-01f, 3.6787327e-02f,  -2.9011074e-01f, 6.8637788e-01f,
+    4.2524221e-04f,  8.7973373e-03f,  2.7395710e-02f,  -4.3055974e-02f,
+    2.7709210e-01f,  9.3438959e-01f,  2.6971966e-01f,  4.2524221e-04f,
+    3.3903524e-02f,  4.4548274e-03f,  -8.2844555e-02f, 8.1345606e-01f,
+    2.5008738e-02f,  1.2615150e-01f,  4.2524221e-04f,  5.4220194e-01f,
+    1.4434942e-02f,  4.7721926e-02f,  2.2486478e-01f,  4.9673972e-01f,
+    -1.7291072e-01f, 4.2524221e-04f,  -1.1954618e-01f, -3.9789897e-01f,
+    1.5299262e-01f,  -1.0768209e-02f, -2.4667594e-01f, -3.0026221e-01f,
+    4.2524221e-04f,  4.6828151e-02f,  -1.1296233e-01f, -2.8746171e-02f,
+    7.7913769e-02f,  6.7700285e-01f,  4.6074694e-01f,  4.2524221e-04f,
+    2.0316719e-01f,  1.8546565e-02f,  -1.8656729e-01f, 5.0312415e-02f,
+    -5.4829341e-01f, -2.4150999e-01f, 4.2524221e-04f,  7.5555742e-02f,
+    -2.8670877e-01f, 3.7772983e-01f,  -5.2546021e-03f, 7.6198977e-01f,
+    1.3225211e-01f,  4.2524221e-04f,  -3.5418484e-01f, 2.5971153e-01f,
+    -4.0895811e-01f, -4.2870775e-02f, -1.9482996e-01f, -4.0891513e-01f,
+    4.2524221e-04f,  1.9957203e-01f,  -1.2344085e-01f, 1.2681608e-01f,
+    3.6128989e-01f,  2.5084922e-01f,  -2.1348737e-01f, 4.2524221e-04f,
+    -8.4972858e-02f, -7.6948851e-02f, 1.4991978e-02f,  -2.2722845e-01f,
+    1.3533474e+00f,  -9.1036373e-01f, 4.2524221e-04f,  4.0499222e-02f,
+    1.5458107e-01f,  9.1433093e-02f,  -9.8637152e-01f, 6.8798542e-01f,
+    1.2652132e-01f,  4.2524221e-04f,  -1.3328849e-01f, 5.2899730e-01f,
+    2.5426340e-01f,  2.9279964e-02f,  6.7669886e-01f,  8.7504014e-02f,
+    4.2524221e-04f,  2.1768717e-02f,  -2.0213337e-01f, -6.5388098e-02f,
+    -2.9381168e-01f, -1.9073659e-01f, -5.1278132e-01f, 4.2524221e-04f,
+    1.3310824e-01f,  -2.7460909e-02f, -1.0676764e-01f, 1.2132843e+00f,
+    2.2298340e-01f,  8.2831341e-01f,  4.2524221e-04f,  2.3097621e-01f,
+    8.5518554e-02f,  -1.2092958e-01f, -3.5663152e-01f, 2.7573928e-01f,
+    -1.9825563e-01f, 4.2524221e-04f,  1.0934645e-01f,  -8.7501816e-02f,
+    -2.4669701e-01f, 7.6741141e-01f,  5.0448716e-01f,  -1.0834196e-01f,
+    4.2524221e-04f,  1.8530484e-01f,  3.4174684e-02f,  1.5646201e-01f,
+    9.4139254e-01f,  2.5214201e-01f,  -4.9693108e-01f, 4.2524221e-04f,
+    -1.2585643e-01f, -1.7891359e-01f, -1.3805175e-01f, -5.5314928e-01f,
+    5.7860100e-01f,  1.0814093e-02f,  4.2524221e-04f,  -8.7974980e-02f,
+    1.8139005e-01f,  1.9811335e-01f,  -8.6020619e-01f, 3.7998101e-01f,
+    -6.0617048e-01f, 4.2524221e-04f,  -2.1366538e-01f, -2.8991837e-02f,
+    1.6314709e-01f,  1.8656220e-01f,  4.5131448e-01f,  3.3050379e-01f,
+    4.2524221e-04f,  1.1256606e-01f,  -9.6497804e-02f, 7.0928104e-02f,
+    2.7094325e-01f,  -8.0149263e-01f, 1.2670897e-02f,  4.2524221e-04f,
+    2.4347697e-01f,  1.3383057e-02f,  -2.6464200e-01f, -1.7431870e-01f,
+    -3.7662300e-01f, 8.3716944e-02f,  4.2524221e-04f,  -3.1822246e-01f,
+    5.7659373e-02f,  -1.2617953e-01f, -3.1177822e-01f, -3.1086314e-01f,
+    -1.6085684e-01f, 4.2524221e-04f,  2.4692762e-01f,  -3.1178862e-01f,
+    1.9952995e-01f,  3.9238483e-01f,  -4.2550820e-01f, -5.5569744e-01f,
+    4.2524221e-04f,  1.5500219e-01f,  5.7150112e-03f,  -1.1340847e-02f,
+    1.4945309e-01f,  2.7379009e-01f,  2.0625734e-01f,  4.2524221e-04f,
+    1.6768256e-01f,  -4.7128350e-01f, 5.3742554e-02f,  8.4879495e-02f,
+    2.3286544e-01f,  7.4328578e-01f,  4.2524221e-04f,  2.4838540e-01f,
+    8.7162726e-02f,  6.2655974e-03f,  -1.6034657e-01f, -3.8968045e-01f,
+    4.9244452e-01f,  4.2524221e-04f,  -6.2987030e-02f, -1.3182718e-01f,
+    -1.6978437e-01f, 2.1902704e-01f,  -7.0577306e-01f, -3.3472535e-01f,
+    4.2524221e-04f,  -2.8039575e-01f, 4.7684874e-02f,  -1.7875251e-01f,
+    -1.2335522e+00f, -4.3686339e-01f, -4.3411765e-02f, 4.2524221e-04f,
+    -8.3724588e-02f, -7.2850031e-03f, 1.6124761e-01f,  -4.5697114e-01f,
+    4.9202301e-02f,  3.4172356e-01f,  4.2524221e-04f,  1.2950442e-02f,
+    -7.2970480e-02f, 8.7202005e-02f,  1.1089588e-01f,  1.4220235e-01f,
+    1.0735790e+00f,  4.2524221e-04f,  -2.3068037e-02f, -5.3824164e-02f,
+    -9.9369422e-02f, -1.3626503e+00f, 3.7142697e-01f,  3.2872483e-01f,
+    4.2524221e-04f,  -9.4487056e-02f, 2.0781608e-01f,  2.6805231e-01f,
+    8.2815714e-02f,  -6.4598866e-02f, -1.1031324e+00f, 4.2524221e-04f,
+    3.0240315e-01f,  -3.2626951e-01f, -2.0183936e-01f, -3.3096763e-01f,
+    4.7207242e-01f,  4.0066612e-01f,  4.2524221e-04f,  4.0568952e-02f,
+    -5.7891309e-03f, -2.1880756e-03f, 3.6196655e-01f,  6.7969316e-01f,
+    7.7404845e-01f,  4.2524221e-04f,  -1.2602168e-01f, -8.8083550e-02f,
+    -1.5483154e-01f, 1.1978400e+00f,  -3.9826334e-02f, -8.5664429e-02f,
+    4.2524221e-04f,  2.7540667e-02f,  3.8233176e-01f,  -3.1928834e-01f,
+    -4.9729136e-01f, 5.1598358e-01f,  2.1719547e-01f,  4.2524221e-04f,
+    4.9473715e-01f,  -1.5038919e-01f, 1.6167887e-01f,  1.0019143e-01f,
+    -6.4764369e-01f, 2.7181607e-01f,  4.2524221e-04f,  -4.5583122e-03f,
+    1.8841159e-02f,  9.0789218e-03f,  -3.4894064e-01f, 1.1940507e+00f,
+    -2.0905848e-01f, 4.2524221e-04f,  4.1136804e-01f,  4.5303986e-03f,
+    -5.2229241e-02f, -4.3855041e-01f, -5.6924307e-01f, 6.8723637e-01f,
+    4.2524221e-04f,  9.3354201e-03f,  1.1280259e-01f,  2.5641006e-01f,
+    3.5463244e-01f,  3.1278756e-01f,  1.8794464e-01f,  4.2524221e-04f,
+    -8.3529964e-02f, -1.5178075e-01f, 3.0708858e-01f,  4.2004418e-01f,
+    7.7655578e-01f,  -2.5741482e-01f, 4.2524221e-04f,  2.2518004e-01f,
+    -5.2192833e-02f, -2.1948409e-01f, -8.4531838e-01f, -3.9843234e-01f,
+    -1.9529273e-01f, 4.2524221e-04f,  9.4479308e-02f,  2.9467750e-01f,
+    8.9064136e-02f,  -4.2378661e-01f, -8.1728941e-01f, 2.1463831e-01f,
+    4.2524221e-04f,  2.6042691e-01f,  2.2843987e-01f,  4.1091021e-02f,
+    1.7020476e-01f,  3.3711955e-01f,  -6.9305815e-02f, 4.2524221e-04f,
+    -4.3036529e-01f, -3.0244246e-01f, -1.0803536e-01f, 5.7014644e-01f,
+    -6.7048460e-02f, 6.1771977e-01f,  4.2524221e-04f,  -4.8004159e-01f,
+    2.1672672e-01f,  -3.1727981e-02f, -2.6590165e-01f, -2.9074933e-02f,
+    -3.7910530e-01f, 4.2524221e-04f,  7.7203013e-02f,  2.3495296e-02f,
+    -2.1834677e-02f, 1.4777166e-01f,  -1.8331994e-01f, 3.8823250e-01f,
+    4.2524221e-04f,  8.0698798e-04f,  -2.0181616e-01f, -2.8987734e-02f,
+    6.3677335e-01f,  -7.3155540e-01f, -1.7035645e-01f, 4.2524221e-04f,
+    -6.4415105e-02f, -8.5588455e-02f, -1.2076505e-02f, 8.9396638e-01f,
+    -2.3984405e-01f, 5.3203154e-01f,  4.2524221e-04f,  1.5581731e-01f,
+    4.0706173e-01f,  -3.2788519e-02f, -3.8853493e-02f, -1.0616943e-01f,
+    1.5764322e-02f,  4.2524221e-04f,  -6.5745108e-02f, -1.8022074e-01f,
+    3.0143541e-01f,  5.2947521e-02f,  -3.3689898e-01f, 4.5815796e-02f,
+    4.2524221e-04f,  -1.1555911e-01f, -1.1878532e-01f, 1.7281310e-01f,
+    7.2894138e-01f,  3.3655125e-01f,  5.9280120e-02f,  4.2524221e-04f,
+    -2.8272390e-01f, 2.8440881e-01f,  2.6604033e-01f,  -3.4913486e-01f,
+    -1.9567727e-01f, 8.0797118e-01f,  4.2524221e-04f,  1.4249170e-01f,
+    -3.2275257e-01f, 3.3360582e-02f,  -8.3627719e-01f, 4.4384214e-01f,
+    -5.7542598e-01f, 4.2524221e-04f,  2.1481293e-01f,  2.6621398e-01f,
+    -1.2833585e-01f, 5.6968081e-01f,  3.1035224e-01f,  -4.5199507e-01f,
+    4.2524221e-04f,  -1.4219360e-01f, -4.3803088e-02f, -4.6387129e-02f,
+    8.5476321e-01f,  -2.3036179e-01f, -1.9935262e-01f, 4.2524221e-04f,
+    -1.2206751e-01f, -1.2761718e-01f, 2.3713002e-02f,  -1.1154665e-01f,
+    -3.4599584e-01f, -3.4939817e-01f, 4.2524221e-04f,  2.2550231e-02f,
+    -1.2879626e-01f, -1.4580293e-01f, 3.6900163e-02f,  -1.1923765e+00f,
+    -3.5290870e-01f, 4.2524221e-04f,  5.7361704e-01f,  1.0135137e-01f,
+    1.1580420e-01f,  8.2064427e-02f,  2.6263624e-01f,  2.9979834e-01f,
+    4.2524221e-04f,  6.9515154e-02f,  -2.4413483e-01f, -5.2721616e-02f,
+    -3.8506284e-01f, -6.4620906e-01f, -5.9624743e-01f, 4.2524221e-04f,
+    -6.1243935e-03f, 6.7365482e-02f,  -9.0251490e-02f, -3.6948121e-01f,
+    1.0993323e-01f,  -1.1918696e-01f, 4.2524221e-04f,  -5.9633836e-02f,
+    -4.3678004e-02f, 8.8739648e-02f,  -1.3570778e-01f, 8.3517295e-01f,
+    1.0714117e-01f,  4.2524221e-04f,  3.1671870e-01f,  -4.7124809e-01f,
+    1.3508266e-01f,  3.3855671e-01f,  4.7528154e-01f,  -5.8971047e-01f,
+    4.2524221e-04f,  -2.8101292e-01f, 3.2524601e-01f,  1.8996252e-01f,
+    3.4437977e-02f,  -8.9535552e-01f, -1.1821542e-01f, 4.2524221e-04f,
+    8.7360397e-02f,  -6.4803854e-02f, -3.5562407e-02f, -1.9053020e-01f,
+    -2.2582971e-01f, -6.2472306e-02f, 4.2524221e-04f,  -2.9329324e-01f,
+    -2.7417824e-01f, 1.1810481e-01f,  8.4965724e-01f,  -6.5472744e-02f,
+    1.5417866e-01f,  4.2524221e-04f,  4.8945490e-02f,  -9.2547052e-02f,
+    1.0741279e-02f,  6.8655288e-01f,  -1.1046035e+00f, 2.7061203e-01f,
+    4.2524221e-04f,  1.5586349e-01f,  -2.5229111e-01f, 2.3776799e-02f,
+    9.8775005e-01f,  -2.7451345e-01f, -2.0263436e-01f, 4.2524221e-04f,
+    1.8664643e-03f,  -8.8074543e-02f, 7.6768715e-03f,  3.8581857e-01f,
+    2.8611168e-01f,  -5.3370991e-03f, 4.2524221e-04f,  -1.7549123e-01f,
+    1.7310123e-01f,  2.2062732e-01f,  -2.0185371e-01f, -4.9658203e-01f,
+    -3.6814332e-01f, 4.2524221e-04f,  -3.4427583e-01f, -5.1099622e-01f,
+    7.0683092e-02f,  5.4417121e-01f,  -1.5044780e-01f, 2.4605605e-01f,
+    4.2524221e-04f,  9.5470153e-02f,  1.1968660e-01f,  -2.8386766e-01f,
+    3.6326036e-01f,  6.5153170e-01f,  7.5427431e-01f,  4.2524221e-04f,
+    -1.7596592e-01f, -3.6929369e-01f, 1.7650379e-01f,  1.8982802e-01f,
+    -3.3434723e-02f, -1.7100264e-01f, 4.2524221e-04f,  5.9746332e-02f,
+    -5.4291566e-03f, 2.7417295e-02f,  7.2204918e-01f,  -4.1095205e-02f,
+    1.3860859e-01f,  4.2524221e-04f,  -1.8077110e-01f, 1.5358247e-01f,
+    -2.4541134e-02f, -4.3253544e-01f, -3.4169495e-01f, -1.8532450e-01f,
+    4.2524221e-04f,  -1.5047994e-01f, -1.7405728e-01f, -1.0708266e-01f,
+    1.7643359e-01f,  -1.9239874e-01f, -9.0829039e-01f, 4.2524221e-04f,
+    -1.0832275e-01f, -2.7016816e-01f, -3.5729785e-02f, -3.0720302e-01f,
+    -5.2063406e-02f, -2.5750580e-01f, 4.2524221e-04f,  -4.6826981e-02f,
+    -4.8485696e-02f, -1.5099053e-01f, 3.5306349e-01f,  1.2127876e+00f,
+    -1.4873780e-02f, 4.2524221e-04f,  5.9326794e-03f,  4.7747534e-02f,
+    -8.0543414e-02f, 3.3139968e-01f,  2.4390240e-01f,  -2.3859148e-01f,
+    4.2524221e-04f,  -2.8181419e-01f, 3.9076668e-01f,  8.2394131e-02f,
+    -1.0311078e-01f, -1.5051240e-02f, -1.1317210e-02f, 4.2524221e-04f,
+    -3.9636351e-02f, 6.4322941e-02f,  2.2112089e-01f,  -9.2929608e-01f,
+    -4.4111279e-01f, -1.8459518e-01f, 4.2524221e-04f,  -8.0882527e-02f,
+    -5.3482848e-01f, -4.4907089e-02f, 5.7603568e-01f,  1.0898951e-01f,
+    -8.8375248e-02f, 4.2524221e-04f,  1.0426223e-01f,  -1.9884385e-01f,
+    -1.6454972e-01f, -7.7765323e-02f, 2.4396433e-01f,  4.1170165e-01f,
+    4.2524221e-04f,  6.7491367e-02f,  -2.2494389e-01f, 2.3740250e-01f,
+    -7.1736908e-01f, 6.8990833e-01f,  3.2261533e-01f,  4.2524221e-04f,
+    2.8791195e-02f,  7.8626890e-03f,  -1.0650118e-01f, 1.2547076e-01f,
+    -1.5376982e-01f, -3.9602396e-01f, 4.2524221e-04f,  -2.1179552e-01f,
+    -1.8070774e-01f, 8.1818618e-02f,  -2.1070567e-01f, 1.1403233e-01f,
+    9.0927385e-02f,  4.2524221e-04f,  -1.8575308e-03f, -6.1437313e-02f,
+    1.5328768e-02f,  -9.9276930e-01f, 4.4626612e-02f,  -1.6329136e-01f,
+    4.2524221e-04f,  3.5620552e-01f,  -7.5357705e-02f, -2.0542692e-02f,
+    3.6689162e-02f,  1.5991510e-01f,  4.8423269e-01f,  4.2524221e-04f,
+    -2.7537715e-01f, -8.8701747e-02f, -1.0147815e-01f, -1.0574761e-01f,
+    5.4233819e-01f,  1.9430749e-01f,  4.2524221e-04f,  -1.6808774e-02f,
+    -2.4182665e-01f, -5.2863855e-02f, 1.6076769e-01f,  3.1808126e-01f,
+    5.4979670e-01f,  4.2524221e-04f,  7.8577407e-02f,  4.0045127e-02f,
+    -1.4603028e-01f, 4.2129436e-01f,  6.0073954e-01f,  -6.6608900e-01f,
+    4.2524221e-04f,  9.5670983e-02f,  2.4700850e-01f,  4.5635734e-02f,
+    -4.7728243e-01f, 1.9680637e-01f,  -2.7621496e-01f, 4.2524221e-04f,
+    -2.6276016e-01f, -3.1463605e-01f, 4.6054568e-02f,  1.8232624e-01f,
+    5.4714763e-01f,  -3.2517221e-02f, 4.2524221e-04f,  1.5802158e-02f,
+    -2.0750746e-01f, -1.9261293e-02f, 4.4261548e-01f,  -7.9906650e-02f,
+    -3.7069431e-01f, 4.2524221e-04f,  -1.7820776e-01f, -2.0312509e-01f,
+    1.0928279e-02f,  7.7818090e-01f,  5.3738102e-02f,  6.1469358e-01f,
+    4.2524221e-04f,  -4.7285169e-02f, -8.1754826e-02f, 3.5087305e-01f,
+    -1.7471641e-01f, -3.7182125e-01f, -2.8422785e-01f, 4.2524221e-04f,
+    1.8552251e-01f,  -2.7961100e-02f, 1.0576315e-02f,  1.6873041e-01f,
+    1.2618817e-01f,  2.3374677e-02f,  4.2524221e-04f,  6.2451422e-02f,
+    2.1975082e-01f,  -8.0675185e-02f, -1.0115409e+00f, 3.5902664e-01f,
+    9.4094712e-01f,  4.2524221e-04f,  1.7549230e-01f,  3.0224830e-01f,
+    6.1378583e-02f,  -3.7785816e-01f, -3.1121659e-01f, -6.4453804e-01f,
+    4.2524221e-04f,  -1.1562916e-02f, -4.3279074e-02f, 2.1968156e-01f,
+    7.6314092e-01f,  2.7365914e-01f,  1.2414942e+00f,  4.2524221e-04f,
+    2.4942562e-02f,  -2.2669297e-01f, -4.2426489e-02f, -5.8109152e-01f,
+    -9.5140174e-02f, 1.8856217e-01f,  4.2524221e-04f,  2.3500895e-02f,
+    -2.6258335e-01f, 3.5159636e-02f,  -2.2540273e-01f, 1.3349633e-01f,
+    2.4041383e-01f,  4.2524221e-04f,  3.0685884e-01f,  -7.5942799e-02f,
+    -1.9636050e-01f, -4.3826777e-01f, 8.7217337e-01f,  -1.1831326e-01f,
+    4.2524221e-04f,  -5.4000854e-01f, -4.9547851e-02f, 9.5842272e-02f,
+    -3.0425093e-01f, 5.5910662e-02f,  3.9586414e-02f,  4.2524221e-04f,
+    -6.6837423e-02f, -2.7452702e-02f, 6.5130323e-02f,  5.6197387e-01f,
+    -9.0140574e-02f, 7.7510601e-01f,  4.2524221e-04f,  -1.2255727e-01f,
+    1.4311929e-01f,  4.0784118e-01f,  -2.0621242e-01f, -8.3209503e-01f,
+    -7.9739869e-02f, 4.2524221e-04f,  3.1605421e-03f,  6.5458536e-02f,
+    8.0096193e-02f,  2.8463723e-02f,  -7.3167956e-01f, 6.2876046e-01f,
+    4.2524221e-04f,  2.1385050e-01f,  -1.2446000e-01f, -7.7775151e-02f,
+    -3.6479920e-01f, 2.9188228e-01f,  4.9462464e-01f,  4.2524221e-04f,
+    9.7945176e-02f,  5.0228184e-01f,  1.2532781e-01f,  -1.6820884e-01f,
+    5.4619871e-02f,  -2.2341976e-01f, 4.2524221e-04f,  1.6906865e-01f,
+    2.3230301e-01f,  -7.9778165e-02f, -1.3981427e-01f, 2.0445855e-01f,
+    1.4598115e-01f,  4.2524221e-04f,  -2.3083951e-01f, -1.2815353e-01f,
+    -8.2986437e-02f, -3.8741472e-01f, -9.6694821e-01f, -2.0893198e-01f,
+    4.2524221e-04f,  -2.8678268e-01f, 3.3133966e-01f,  -3.8621360e-01f,
+    -3.1751993e-01f, 6.1450683e-02f,  1.2512209e-01f,  4.2524221e-04f,
+    2.3860487e-01f,  9.1560215e-02f,  3.4467034e-02f,  3.8503122e-03f,
+    -5.9466463e-01f, 1.4045978e+00f,  4.2524221e-04f,  2.2791898e-02f,
+    -2.4371918e-01f, -1.1899748e-01f, -3.3875480e-02f, 1.0718188e+00f,
+    -3.3057433e-01f, 4.2524221e-04f,  6.0494401e-02f,  -4.0027436e-02f,
+    4.6315026e-03f,  3.7647781e-01f,  -6.1523962e-01f, -4.4806430e-01f,
+    4.2524221e-04f,  -1.4398930e-02f, 8.8689297e-02f,  2.1196980e-02f,
+    -8.1722900e-02f, 4.7885597e-01f,  -2.8925687e-01f, 4.2524221e-04f,
+    -1.5524706e-01f, 1.4301302e-01f,  1.9916880e-01f,  -2.7829605e-01f,
+    -1.6239963e-01f, -5.1179785e-01f, 4.2524221e-04f,  1.7143184e-01f,
+    1.0019513e-01f,  1.5578574e-01f,  -1.9651586e-01f, 9.2729092e-02f,
+    -1.5538944e-02f, 4.2524221e-04f,  -4.7408080e-01f, 5.0612073e-02f,
+    -2.1197836e-01f, 9.1675021e-02f,  2.6731426e-01f,  4.9677739e-01f,
+    4.2524221e-04f,  1.2808032e-01f,  1.2442170e-01f,  -3.3044627e-01f,
+    1.9096320e-02f,  2.2950390e-01f,  1.8157041e-02f,  4.2524221e-04f,
+    6.6089116e-02f,  -2.6629618e-01f, 3.4804799e-02f,  3.3293316e-01f,
+    2.2796112e-01f,  -3.8085213e-01f, 4.2524221e-04f,  9.2263952e-02f,
+    -6.5684423e-04f, -4.9896240e-02f, 5.7995224e-01f,  3.9322713e-01f,
+    9.3843347e-01f,  4.2524221e-04f,  5.7055873e-01f,  -6.9591566e-03f,
+    -1.1013345e-01f, -8.4581479e-02f, 1.2417093e-01f,  6.0987943e-01f,
+    4.2524221e-04f,  8.6895220e-02f,  5.8952796e-01f,  1.0544782e-01f,
+    2.0634830e-01f,  -3.0626750e-01f, -4.4669414e-01f, 4.2524221e-04f,
+    7.7322349e-03f,  -2.0595033e-02f, 9.6146993e-02f,  5.2338964e-01f,
+    -3.3208278e-01f, -6.5161020e-01f, 4.2524221e-04f,  2.4041528e-01f,
+    1.2178984e-01f,  -1.4620358e-02f, 5.6683809e-02f,  -1.5925193e-01f,
+    1.1477942e-01f,  4.2524221e-04f,  2.6970300e-01f,  2.8292149e-01f,
+    -1.4419414e-01f, 3.0248770e-01f,  2.3761137e-01f,  7.9628110e-02f,
+    4.2524221e-04f,  -1.8196186e-03f, 1.0339138e-01f,  1.5589855e-02f,
+    -6.1143917e-01f, 5.8870763e-02f,  -5.5185825e-01f, 4.2524221e-04f,
+    -5.8955574e-01f, 5.0430399e-01f,  1.0446996e-01f,  3.3214679e-01f,
+    1.1066406e-01f,  2.1336867e-01f,  4.2524221e-04f,  3.6503878e-01f,
+    4.7822750e-01f,  2.1800978e-01f,  2.8266385e-01f,  -5.2650284e-02f,
+    -1.0749738e-01f, 4.2524221e-04f,  -2.5026042e-02f, -1.3568670e-01f,
+    8.8454850e-02f,  5.0228643e-01f,  7.2195143e-01f,  -3.6857009e-01f,
+    4.2524221e-04f,  3.3050784e-01f,  1.1087789e-03f,  7.7116556e-02f,
+    -1.3000013e-01f, 2.0656547e-01f,  -3.1055239e-01f, 4.2524221e-04f,
+    1.0038084e-01f,  2.9623389e-01f,  -2.8594765e-01f, -6.3773435e-01f,
+    -2.2472218e-01f, 2.7194136e-01f,  4.2524221e-04f,  -1.1816387e-01f,
+    -4.4781701e-03f, 2.2403985e-02f,  -2.9971334e-01f, -3.3830848e-02f,
+    7.4560910e-01f,  4.2524221e-04f,  -4.3074316e-03f, 2.2711021e-01f,
+    -5.6205500e-02f, -2.5100843e-03f, 3.0221465e-01f,  2.9007548e-02f,
+    4.2524221e-04f,  -2.3735079e-01f, 2.8882644e-01f,  7.3939011e-02f,
+    2.2294943e-01f,  -3.0588943e-01f, 3.1963449e-02f,  4.2524221e-04f,
+    -1.7048031e-01f, -1.3972566e-01f, 1.1619692e-01f,  6.2545680e-02f,
+    -1.4198409e-01f, 8.5753149e-01f,  4.2524221e-04f,  -1.6298614e-02f,
+    -8.2994640e-02f, 4.6882477e-02f,  2.9218301e-01f,  -1.0170504e-01f,
+    -4.2390954e-01f, 4.2524221e-04f,  -8.9525767e-03f, -2.5133255e-01f,
+    8.3229411e-03f,  1.4413431e-01f,  -4.7341764e-01f, 1.7939579e-01f,
+    4.2524221e-04f,  3.4318164e-02f,  3.6988214e-01f,  -4.0235329e-02f,
+    -3.3286434e-01f, 1.1149145e+00f,  3.0910656e-01f,  4.2524221e-04f,
+    -3.7121230e-01f, 3.1041780e-01f,  2.4160075e-01f,  -2.7346233e-02f,
+    -1.5404283e-01f, 5.0396878e-01f,  4.2524221e-04f,  -2.1208663e-02f,
+    1.5269564e-01f,  -6.8493679e-02f, 2.4583252e-02f,  -2.8066137e-01f,
+    4.7748199e-01f,  4.2524221e-04f,  -2.1734355e-01f, 2.5201303e-01f,
+    -3.2862380e-02f, 1.6177589e-02f,  -3.4582311e-01f, -1.2821641e+00f,
+    4.2524221e-04f,  4.4924536e-01f,  7.4113816e-02f,  -7.3689610e-02f,
+    1.7220579e-01f,  -6.3622075e-01f, -1.5600935e-01f, 4.2524221e-04f,
+    -2.4427678e-01f, -1.8103082e-01f, 8.4029436e-02f,  6.2840384e-01f,
+    -1.0204503e-01f, -1.2746918e+00f, 4.2524221e-04f,  -7.7623174e-02f,
+    -1.1538806e-01f, 1.0955370e-01f,  2.1155287e-01f,  -1.8333985e-02f,
+    -8.5965082e-02f, 4.2524221e-04f,  1.9285780e-01f,  5.4857415e-01f,
+    4.8495352e-02f,  -6.5345681e-01f, 6.8900383e-01f,  5.7032607e-02f,
+    4.2524221e-04f,  1.5831296e-01f,  2.8919354e-01f,  -7.7110849e-02f,
+    -4.8351768e-01f, -4.9834508e-02f, 3.6463663e-02f,  4.2524221e-04f,
+    6.4799570e-02f,  -3.2731708e-02f, -2.7273929e-02f, 8.1991071e-01f,
+    9.5503010e-02f,  2.9027075e-01f,  4.2524221e-04f,  -1.1201077e-02f,
+    5.4656636e-02f,  -1.4434703e-02f, -9.3639143e-02f, -1.8136314e-01f,
+    9.5906240e-01f,  4.2524221e-04f,  -3.9398316e-01f, -3.9860523e-01f,
+    2.1285461e-01f,  -6.9376923e-02f, 4.3563950e-01f,  1.4931425e-01f,
+    4.2524221e-04f,  -4.4031635e-02f, 6.0925055e-02f,  1.2944406e-02f,
+    1.4925966e-01f,  -2.0842522e-01f, 3.6399025e-01f,  4.2524221e-04f,
+    -7.4377365e-02f, -4.6327910e-01f, 1.3271235e-01f,  4.1344625e-01f,
+    -2.2608940e-01f, 4.4854322e-01f,  4.2524221e-04f,  -7.4429356e-02f,
+    9.7148471e-02f,  6.2793352e-02f,  1.5341394e-01f,  -8.4888637e-01f,
+    -3.6653098e-01f, 4.2524221e-04f,  2.2618461e-01f,  2.2315122e-02f,
+    -2.3498254e-01f, -6.1160840e-02f, 2.5365597e-01f,  5.4208982e-01f,
+    4.2524221e-04f,  -3.1962454e-01f, 3.9163461e-01f,  4.2871829e-02f,
+    6.0472304e-01f,  1.3251632e-02f,  5.9459621e-01f,  4.2524221e-04f,
+    5.1799797e-02f,  2.3819485e-01f,  9.1572301e-03f,  7.0380992e-03f,
+    8.0354142e-01f,  8.3409584e-01f,  4.2524221e-04f,  -1.5994681e-02f,
+    7.8938596e-02f,  6.6703215e-02f,  4.1910246e-02f,  2.8412926e-01f,
+    7.2893983e-01f,  4.2524221e-04f,  -2.1006101e-01f, 2.4578594e-01f,
+    4.8922536e-01f,  -1.0057293e-03f, -3.2497483e-01f, -2.5029007e-01f,
+    4.2524221e-04f,  -3.5587311e-01f, -3.5273769e-01f, 1.5821952e-01f,
+    2.9952317e-01f,  5.5395550e-01f,  -3.4648269e-02f, 4.2524221e-04f,
+    -1.6086802e-01f, -2.3201960e-01f, 5.4741569e-02f,  -3.2486397e-01f,
+    -5.3650331e-01f, 6.5752223e-02f,  4.2524221e-04f,  1.9204400e-01f,
+    1.2761375e-01f,  -3.9251870e-04f, -2.0936428e-01f, -5.3058326e-02f,
+    -3.0527651e-02f, 4.2524221e-04f,  -3.0021596e-01f, 1.5909308e-01f,
+    1.7731556e-01f,  4.2238137e-01f,  3.1060129e-01f,  5.7609707e-01f,
+    4.2524221e-04f,  -9.1755381e-03f, -4.5280188e-02f, 5.0950889e-03f,
+    -1.7395033e-01f, 3.4041181e-01f,  -6.2415045e-01f, 4.2524221e-04f,
+    1.0376621e-01f,  7.4777119e-02f,  -7.4621383e-03f, -8.7899685e-02f,
+    1.5269575e-01f,  2.4027891e-01f,  4.2524221e-04f,  -9.5581291e-03f,
+    -3.4383759e-02f, 5.3069271e-02f,  3.5880011e-01f,  -3.5557917e-01f,
+    2.0991372e-01f,  4.2524221e-04f,  3.6124307e-01f,  1.8159066e-01f,
+    -8.2019433e-02f, -3.2876030e-02f, 2.1423176e-01f,  -2.3691888e-01f,
+    4.2524221e-04f,  5.2591050e-01f,  1.4223778e-01f,  -2.3596896e-01f,
+    -2.4888556e-01f, 8.0744885e-02f,  -2.8598624e-01f, 4.2524221e-04f,
+    3.7822265e-02f,  -3.0359248e-02f, 1.2920305e-01f,  1.3964597e+00f,
+    -5.0595063e-01f, 3.7915143e-01f,  4.2524221e-04f,  -2.0440121e-01f,
+    -8.2971528e-02f, 2.4363218e-02f,  5.5374378e-01f,  -4.2351457e-01f,
+    2.6157996e-01f,  4.2524221e-04f,  -1.5342065e-02f, -1.1447024e-01f,
+    8.9309372e-02f,  -1.6897373e-01f, -3.8053963e-01f, -3.2147244e-01f,
+    4.2524221e-04f,  -4.7150299e-01f, 2.0515873e-01f,  -1.3660602e-01f,
+    -7.0529729e-01f, -3.4735793e-01f, 5.8833256e-02f,  4.2524221e-04f,
+    -1.2456580e-01f, 4.2049769e-02f,  2.8410503e-01f,  -4.3436193e-01f,
+    -8.4273821e-01f, -1.3157543e-02f, 4.2524221e-04f,  7.5538613e-02f,
+    3.9626577e-01f,  -1.5217549e-01f, -1.5618332e-01f, -3.3695772e-01f,
+    5.9022270e-02f,  4.2524221e-04f,  -1.5459322e-02f, 1.5710446e-01f,
+    -5.1338539e-02f, -5.5148184e-01f, -1.3073370e+00f, -4.2774591e-01f,
+    4.2524221e-04f,  1.0272874e-02f,  -2.7489871e-01f, 4.5325002e-03f,
+    4.8323011e-01f,  -4.8259729e-01f, -3.7467831e-01f, 4.2524221e-04f,
+    1.2912191e-01f,  1.2607241e-01f,  2.3619874e-01f,  -1.5429191e-01f,
+    -1.1406326e-02f, 7.4113697e-01f,  4.2524221e-04f,  -5.8898546e-02f,
+    1.0400093e-01f,  2.5439359e-02f,  -2.2700197e-01f, -6.9284344e-01f,
+    5.9191513e-01f,  4.2524221e-04f,  -1.3326290e-01f, 2.8317794e-01f,
+    -1.1651643e-01f, -2.0354472e-01f, 2.4168920e-02f,  -2.9111835e-01f,
+    4.2524221e-04f,  4.6675056e-01f,  1.8015167e-01f,  -2.7656639e-01f,
+    6.0998124e-01f,  1.1838278e-01f,  4.4735509e-01f,  4.2524221e-04f,
+    -7.8548267e-02f, 1.3879402e-01f,  2.9531106e-02f,  -3.2241312e-01f,
+    3.5146353e-01f,  -1.3042176e+00f, 4.2524221e-04f,  3.6139764e-02f,
+    1.2170444e-01f,  -2.3465194e-01f, -2.9680032e-01f, -6.8796831e-03f,
+    6.8688500e-01f,  4.2524221e-04f,  -1.4219068e-01f, 2.1623276e-02f,
+    1.5299717e-01f,  -7.4627483e-01f, -2.1742058e-01f, 3.2532772e-01f,
+    4.2524221e-04f,  -6.3564241e-02f, -2.9572992e-02f, -3.2649133e-02f,
+    5.9788638e-01f,  3.6870297e-02f,  -8.7102300e-01f, 4.2524221e-04f,
+    -2.0794891e-01f, 8.1371635e-02f,  3.3638042e-01f,  2.0494652e-01f,
+    -5.9626132e-01f, -1.5380038e-01f, 4.2524221e-04f,  -1.0159838e-01f,
+    -2.8721320e-02f, 2.7015638e-02f,  -2.7380022e-01f, -9.4103739e-02f,
+    -6.7215502e-02f, 4.2524221e-04f,  6.7924291e-02f,  9.6439593e-02f,
+    -1.2461703e-01f, 4.5358276e-01f,  -6.4580995e-01f, -2.7629402e-01f,
+    4.2524221e-04f,  1.1018521e-01f,  -2.0825058e-01f, -3.5493972e-03f,
+    3.0831328e-01f,  -2.9231513e-01f, 2.7853895e-02f,  4.2524221e-04f,
+    -4.6187687e-01f, 1.3196044e-02f,  -3.5266578e-01f, -7.5263560e-01f,
+    -1.1318106e-01f, 2.7656075e-01f,  4.2524221e-04f,  6.7048810e-02f,
+    -5.1194650e-01f, 1.1785375e-01f,  8.8861950e-02f,  -4.7610909e-01f,
+    -1.6243374e-01f, 4.2524221e-04f,  -6.6284803e-03f, -8.3670825e-02f,
+    -1.2508593e-01f, -3.8224804e-01f, -1.5937123e-02f, 1.0452353e+00f,
+    4.2524221e-04f,  -1.3160370e-01f, -9.5955923e-02f, -8.4739611e-02f,
+    1.9278596e-01f,  -1.1568629e-01f, 4.2249944e-02f,  4.2524221e-04f,
+    -2.1267873e-01f, 2.8323093e-01f,  -3.1590623e-01f, -4.9953362e-01f,
+    -6.5009966e-02f, 1.1061162e-02f,  4.2524221e-04f,  1.3268466e-01f,
+    -1.0461405e-02f, -8.3998583e-02f, -3.5246205e-01f, 2.2906788e-01f,
+    2.3335723e-02f,  4.2524221e-04f,  7.6434441e-02f,  -2.4937626e-02f,
+    -2.7596179e-02f, 7.4442047e-01f,  2.5470009e-01f,  -2.2758165e-01f,
+    4.2524221e-04f,  -7.3667087e-02f, -1.7799268e-02f, -5.9537459e-03f,
+    -5.1536787e-01f, -1.7191459e-01f, -5.3793174e-01f, 4.2524221e-04f,
+    3.2908652e-02f,  -6.8867397e-03f, 2.7038795e-01f,  4.1145402e-01f,
+    1.0897535e-01f,  3.5777646e-01f,  4.2524221e-04f,  1.7472942e-01f,
+    -4.1650254e-02f, -2.4139067e-02f, 5.2082646e-01f,  1.4688045e-01f,
+    2.5017604e-02f,  4.2524221e-04f,  3.8611683e-01f,  -2.1606129e-02f,
+    -4.6873342e-02f, -4.2890063e-01f, 5.4671443e-01f,  -4.8172039e-01f,
+    4.2524221e-04f,  2.4685478e-01f,  7.0533797e-02f,  4.4634484e-02f,
+    -9.0525120e-01f, -1.0043499e-01f, -7.0548397e-01f, 4.2524221e-04f,
+    9.6239939e-02f,  -2.2564979e-01f, 1.8903369e-01f,  5.6831491e-01f,
+    -2.5603232e-01f, 9.4581522e-02f,  4.2524221e-04f,  -3.2893878e-01f,
+    6.0157795e-03f,  -9.9098258e-02f, 2.5037730e-01f,  7.8038769e-03f,
+    2.9051918e-01f,  4.2524221e-04f,  -1.2168298e-02f, -4.0631089e-02f,
+    3.7083067e-02f,  -4.8783138e-01f, 3.5017189e-01f,  8.4070042e-02f,
+    4.2524221e-04f,  -4.2874196e-01f, 3.2063863e-01f,  -4.9277123e-02f,
+    -1.7415829e-01f, 1.0225703e-01f,  -7.5167364e-01f, 4.2524221e-04f,
+    3.2780454e-02f,  -7.5571574e-02f, 1.9622628e-02f,  8.4614986e-01f,
+    1.0693860e-01f,  -1.2419286e+00f, 4.2524221e-04f,  1.7366207e-01f,
+    3.9584300e-01f,  2.6937449e-01f,  -4.8690364e-01f, -4.9973553e-01f,
+    -3.2570970e-01f, 4.2524221e-04f,  1.9942973e-02f,  2.0214912e-01f,
+    4.2972099e-02f,  -8.2332152e-01f, -4.3931123e-02f, -6.0235494e-01f,
+    4.2524221e-04f,  2.0768560e-01f,  2.8317720e-02f,  4.1160220e-01f,
+    -1.0679507e-01f, 7.3761070e-01f,  -2.3942986e-01f, 4.2524221e-04f,
+    2.1720865e-01f,  -1.9589297e-01f, 2.1523495e-01f,  6.2263809e-02f,
+    1.8949240e-01f,  1.0847020e+00f,  4.2524221e-04f,  2.4538104e-01f,
+    -2.5909713e-01f, 2.0987009e-01f,  1.2600332e-01f,  1.5175544e-01f,
+    6.0273927e-01f,  4.2524221e-04f,  2.7597550e-02f,  -5.6118514e-02f,
+    -5.9334390e-02f, 4.0022990e-01f,  -6.6226465e-01f, -2.5346693e-01f,
+    4.2524221e-04f,  -2.8687498e-02f, -1.3005561e-01f, -1.6967385e-01f,
+    4.4480300e-01f,  -3.2221052e-01f, 9.4727051e-01f,  4.2524221e-04f,
+    -2.2392456e-01f, 9.9042743e-02f,  1.3410835e-01f,  2.6153162e-01f,
+    3.6460832e-01f,  5.3761798e-01f,  4.2524221e-04f,  -2.9815484e-02f,
+    -1.9565192e-01f, 1.5263952e-01f,  3.1450984e-01f,  -6.3300407e-01f,
+    -1.4046330e+00f, 4.2524221e-04f,  4.1146070e-01f,  -1.8429661e-01f,
+    7.8496866e-02f,  -5.7638370e-02f, 1.2995465e-01f,  -6.7994076e-01f,
+    4.2524221e-04f,  2.5325531e-01f,  3.7003466e-01f,  -1.3726011e-01f,
+    -4.5850614e-01f, -6.3685037e-02f, -1.7873959e-01f, 4.2524221e-04f,
+    -1.5031013e-01f, 1.5252687e-02f,  1.1144777e-01f,  -5.4487520e-01f,
+    -4.4944713e-01f, 3.7658595e-02f,  4.2524221e-04f,  -1.4412788e-01f,
+    -4.5210607e-02f, -1.8119146e-01f, -4.8468155e-01f, -2.1693365e-01f,
+    -2.6204476e-01f, 4.2524221e-04f,  9.3633771e-02f,  3.1804737e-02f,
+    -8.9491466e-03f, -5.5857754e-01f, 6.2144250e-01f,  4.5324361e-01f,
+    4.2524221e-04f,  -2.1607183e-01f, -3.5096270e-01f, 1.1616316e-01f,
+    3.1337175e-01f,  5.6796402e-01f,  -4.6863672e-01f, 4.2524221e-04f,
+    1.2146773e-01f,  -2.9970589e-01f, -9.3484394e-02f, -1.3636754e-01f,
+    1.8527946e-01f,  3.7086871e-01f,  4.2524221e-04f,  6.3321716e-04f,
+    1.9271399e-01f,  -1.3901092e-02f, -1.8197080e-01f, -3.2543473e-02f,
+    4.0833443e-01f,  4.2524221e-04f,  3.1323865e-01f,  -9.9166080e-02f,
+    1.6559476e-01f,  -1.1429023e-01f, 2.6936495e-01f,  -8.1836838e-01f,
+    4.2524221e-04f,  -3.2788602e-01f, 2.6309913e-01f,  -7.6578714e-02f,
+    1.7135184e-01f,  7.6391011e-01f,  -2.2268695e-01f, 4.2524221e-04f,
+    9.1498777e-02f,  -2.7498001e-02f, -2.3773773e-02f, -1.2034925e-01f,
+    -1.2773737e-01f, 6.2424815e-01f,  4.2524221e-04f,  1.5177734e-01f,
+    -3.5075852e-01f, -7.1983606e-02f, 2.8897448e-02f,  4.0577650e-01f,
+    2.2001588e-01f,  4.2524221e-04f,  -2.2474186e-01f, -1.5482238e-02f,
+    2.1841341e-01f,  -2.4401657e-02f, -1.5976839e-01f, 7.6759452e-01f,
+    4.2524221e-04f,  -1.9837938e-01f, -1.9819458e-01f, 1.0244832e-01f,
+    2.5585452e-01f,  -6.2405187e-01f, -1.2208650e-01f, 4.2524221e-04f,
+    1.0785859e-01f,  -4.7728598e-02f, -7.1606390e-02f, -3.0540991e-01f,
+    -1.3558470e-01f, -4.7501847e-02f, 4.2524221e-04f,  8.2393557e-02f,
+    -3.0366284e-01f, -2.4622783e-01f, 4.2844865e-01f,  5.1157504e-01f,
+    -1.3205969e-01f, 4.2524221e-04f,  -5.0696820e-02f, 2.0262659e-01f,
+    -1.7887448e-01f, -1.2609152e+00f, -3.5461038e-01f, -3.9882436e-01f,
+    4.2524221e-04f,  5.4839436e-02f,  -3.5092220e-02f, 1.1367126e-02f,
+    2.3117255e-01f,  3.8602617e-01f,  -7.5130589e-02f, 4.2524221e-04f,
+    -3.6607772e-02f, -1.0679845e-01f, -5.7734322e-02f, 1.2356401e-01f,
+    -4.4628922e-02f, 4.5649070e-01f,  4.2524221e-04f,  -1.9838469e-01f,
+    1.4024511e-01f,  1.2040158e-01f,  -1.9388847e-02f, 2.0905096e-02f,
+    1.0355227e-01f,  4.2524221e-04f,  2.3764308e-01f,  3.5117786e-02f,
+    -3.1436324e-02f, 8.5178584e-01f,  1.1339028e+00f,  1.1008400e-01f,
+    4.2524221e-04f,  -7.3822118e-02f, 6.9310486e-02f,  4.9703155e-02f,
+    -4.6891728e-01f, -4.8981270e-01f, 9.2132203e-02f,  4.2524221e-04f,
+    -2.4658789e-01f, -3.6811281e-02f, 5.3509071e-02f,  1.4401472e-01f,
+    -5.9464717e-01f, -4.7781080e-01f, 4.2524221e-04f,  -7.7872813e-02f,
+    -2.6063239e-02f, 2.0965867e-02f,  -3.8868725e-02f, -1.1606826e+00f,
+    6.7060548e-01f,  4.2524221e-04f,  -4.5830272e-02f, 1.1310847e-01f,
+    -8.1722803e-02f, -9.1091514e-02f, -3.6987996e-01f, -5.6169915e-01f,
+    4.2524221e-04f,  1.2683717e-02f,  -2.0634931e-02f, -8.5185498e-02f,
+    -4.8645809e-01f, -1.3408487e-01f, -2.7973619e-01f, 4.2524221e-04f,
+    1.0893838e-01f,  -2.1178136e-02f, -2.1285720e-03f, 1.5344471e-01f,
+    -3.4493029e-01f, -6.7877275e-01f, 4.2524221e-04f,  -3.2412663e-01f,
+    3.9371975e-02f,  -4.4002077e-01f, -5.3908128e-02f, 1.5829736e-01f,
+    2.6969984e-01f,  4.2524221e-04f,  2.2543361e-02f,  4.8779223e-02f,
+    4.3569636e-02f,  -3.4519175e-01f, 2.1664266e-01f,  9.3308222e-01f,
+    4.2524221e-04f,  -3.5433710e-01f, -2.9060904e-02f, 6.4444318e-02f,
+    -1.3577543e-01f, -1.4957221e-01f, -5.4734117e-01f, 4.2524221e-04f,
+    -2.2653489e-01f, 9.9744573e-02f,  -1.1482056e-01f, 3.1762671e-01f,
+    4.6666378e-01f,  1.9599502e-01f,  4.2524221e-04f,  4.3308473e-01f,
+    7.3437119e-01f,  -3.0044449e-02f, -8.3082899e-02f, -3.2125901e-02f,
+    -1.2847716e-02f, 4.2524221e-04f,  -1.8438119e-01f, -1.9283429e-01f,
+    3.5797872e-02f,  1.3573840e-01f,  -3.7481323e-02f, 1.1818637e+00f,
+    4.2524221e-04f,  1.0874497e-02f,  -6.1415236e-02f, 9.8641105e-02f,
+    1.1666699e-01f,  1.0087410e+00f,  -5.6476429e-02f, 4.2524221e-04f,
+    -3.7848192e-01f, -1.3981105e-01f, -5.3778347e-03f, 2.0008039e-01f,
+    -1.1830221e+00f, -3.6353923e-02f, 4.2524221e-04f,  8.3630599e-02f,
+    7.6356381e-02f,  -8.8009313e-02f, 2.8433867e-02f,  2.1191142e-02f,
+    6.8432979e-02f,  4.2524221e-04f,  5.2260540e-02f,  1.1663198e-01f,
+    1.0381171e-01f,  -5.1648277e-01f, 5.2234846e-01f,  -6.6856992e-01f,
+    4.2524221e-04f,  -2.2434518e-01f, 9.4649620e-02f,  -2.2770822e-01f,
+    1.1058451e-02f,  -5.2965415e-01f, -3.6854854e-01f, 4.2524221e-04f,
+    -1.8068549e-01f, -1.3638383e-01f, -2.5140682e-01f, -2.8262353e-01f,
+    -2.5481758e-01f, 6.2844765e-01f,  4.2524221e-04f,  1.0108690e-01f,
+    2.0101190e-01f,  1.3750127e-01f,  2.7563637e-01f,  -5.7106084e-01f,
+    -8.7128246e-01f, 4.2524221e-04f,  -1.0044957e-01f, -9.4999395e-02f,
+    -1.8605889e-01f, 1.8979494e-01f,  -8.5543871e-01f, 5.3148580e-01f,
+    4.2524221e-04f,  -2.4865381e-01f, 2.2518732e-01f,  -1.0148249e-01f,
+    -2.2050242e-01f, 5.3008753e-01f,  -3.9897123e-01f, 4.2524221e-04f,
+    7.3146023e-02f,  -1.3554707e-01f, -2.5761548e-01f, 3.1436664e-01f,
+    -8.2433552e-01f, 2.7389117e-02f,  4.2524221e-04f,  5.5880195e-01f,
+    -1.7010997e-01f, 3.7886339e-01f,  3.4537455e-01f,  1.6899250e-01f,
+    -4.0871644e-01f, 4.2524221e-04f,  3.3027393e-01f,  5.2694689e-02f,
+    -3.2332891e-01f, 2.3347795e-01f,  3.2150295e-01f,  2.1555850e-01f,
+    4.2524221e-04f,  1.4437835e-02f,  -1.4030455e-01f, -2.8837410e-01f,
+    3.0297443e-01f,  -5.1224962e-02f, -5.0067031e-01f, 4.2524221e-04f,
+    2.8251413e-01f,  2.2796902e-01f,  -3.2044646e-01f, -2.3228103e-01f,
+    -1.6037621e-01f, -2.6131482e-03f, 4.2524221e-04f,  5.2314814e-02f,
+    -2.0229014e-02f, -6.8570655e-03f, 2.0827544e-01f,  -2.2427905e-02f,
+    -3.7649903e-02f, 4.2524221e-04f,  -9.2880584e-02f, 9.8891854e-03f,
+    -3.9208323e-02f, -6.0296351e-01f, 6.1879003e-01f,  -3.7303507e-01f,
+    4.2524221e-04f,  -1.9322397e-01f, 2.0262747e-01f,  8.0153726e-02f,
+    -2.3856657e-02f, 4.0623334e-01f,  6.2071621e-01f,  4.2524221e-04f,
+    -4.4426578e-01f, 2.0553674e-01f,  -2.6441025e-02f, -1.6482647e-01f,
+    -8.7054305e-02f, -8.2128918e-01f, 4.2524221e-04f,  -2.8677690e-01f,
+    -1.0196485e-01f, 1.3304503e-01f,  -7.6817560e-01f, 1.9562703e-01f,
+    -4.6528971e-01f, 4.2524221e-04f,  -2.0077555e-01f, -1.5366915e-01f,
+    1.1841840e-01f,  -1.7148955e-01f, 9.5784628e-01f,  7.9418994e-02f,
+    4.2524221e-04f,  -1.2745425e-01f, 3.1222694e-02f,  -1.9043627e-01f,
+    4.9706772e-02f,  -1.8966989e-01f, -1.1206242e-01f, 4.2524221e-04f,
+    -7.4478179e-02f, 1.3656577e-02f,  -1.2854090e-01f, 3.0771527e-01f,
+    7.3823595e-01f,  6.9908720e-01f,  4.2524221e-04f,  -1.7966473e-01f,
+    -2.9162148e-01f, -2.1245839e-02f, -2.6599333e-01f, 1.9704431e-01f,
+    5.4458129e-01f,  4.2524221e-04f,  1.1969655e-01f,  -3.1876512e-02f,
+    1.9230773e-01f,  9.9345565e-01f,  -2.2614142e-01f, -7.7471659e-02f,
+    4.2524221e-04f,  7.2612032e-02f,  7.9093436e-03f,  9.1707774e-02f,
+    3.9948497e-02f,  -7.6741409e-01f, -2.7649629e-01f, 4.2524221e-04f,
+    -3.1801498e-01f, 9.1305524e-02f,  1.1569420e-01f,  -1.2343646e-01f,
+    6.5492535e-01f,  -1.5559088e-01f, 4.2524221e-04f,  8.8576578e-02f,
+    -1.1602592e-01f, 3.0858183e-02f,  4.6493343e-01f,  4.3753752e-01f,
+    1.5579678e-01f,  4.2524221e-04f,  -2.3568103e-01f, -3.1387237e-01f,
+    1.7740901e-01f,  -2.2428825e-01f, -7.9772305e-01f, 2.2299300e-01f,
+    4.2524221e-04f,  1.0266142e-01f,  -3.9200943e-02f, -1.6250725e-01f,
+    -2.1084811e-01f, 4.7313869e-01f,  7.5736183e-01f,  4.2524221e-04f,
+    -5.2503270e-01f, -2.5550249e-01f, 2.4210323e-01f,  4.2290211e-01f,
+    -1.1937749e-03f, -2.8803447e-01f, 4.2524221e-04f,  6.8656705e-02f,
+    2.3230983e-01f,  -1.0208790e-02f, -1.9244626e-01f, 8.1877112e-01f,
+    -2.5449389e-01f, 4.2524221e-04f,  -5.4129776e-02f, 2.9140076e-01f,
+    -4.6895444e-01f, -2.3883762e-02f, -1.9746602e-01f, -1.4508346e-02f,
+    4.2524221e-04f,  -3.0830520e-01f, -2.6217067e-01f, -2.6785174e-01f,
+    6.7281228e-01f,  3.7336886e-01f,  -1.4304060e-01f, 4.2524221e-04f,
+    1.5217099e-01f,  2.0078890e-01f,  7.7753231e-02f,  -3.3346283e-01f,
+    -1.2821050e-01f, -4.3130264e-01f, 4.2524221e-04f,  3.8476987e-04f,
+    -7.6562621e-02f, -4.8909627e-02f, -1.1036193e-01f, 2.4940021e-01f,
+    2.4720046e-01f,  4.2524221e-04f,  1.9815315e-01f,  1.9162391e-01f,
+    6.0125452e-02f,  -7.7126014e-01f, 4.2003978e-02f,  6.3951693e-02f,
+    4.2524221e-04f,  9.2402853e-02f,  -1.9484653e-01f, -1.4663309e-01f,
+    1.7251915e-01f,  -1.6592954e-01f, -3.1574631e-01f, 4.2524221e-04f,
+    1.4493692e-01f,  -3.1712703e-02f, -1.5764284e-01f, -1.6178896e-01f,
+    3.3917201e-01f,  -4.9173659e-01f, 4.2524221e-04f,  2.1914667e-01f,
+    -7.4241884e-02f, -9.9493600e-02f, -1.7168714e-01f, 1.7520438e-01f,
+    1.1748855e+00f,  4.2524221e-04f,  -1.6493322e-01f, 2.1094975e-01f,
+    2.6855225e-02f,  8.0839500e-02f,  6.4471591e-01f,  2.5444278e-01f,
+    4.2524221e-04f,  -1.0818439e-01f, 5.0222378e-02f,  1.0443858e-01f,
+    7.3543733e-01f,  -5.2923161e-01f, 2.3857592e-02f,  4.2524221e-04f,
+    -1.3066588e-01f, 3.3706114e-01f,  -6.5367684e-02f, -1.9584729e-01f,
+    -9.6636809e-02f, 5.7062846e-01f,  4.2524221e-04f,  8.9271449e-02f,
+    -1.5417366e-02f, -8.2307503e-02f, -5.0039625e-01f, 2.5350851e-01f,
+    -2.4847549e-01f, 4.2524221e-04f,  -2.8799692e-01f, -1.0268785e-01f,
+    -6.9768213e-02f, 1.9839688e-01f,  -9.6014850e-02f, 1.1959620e-02f,
+    4.2524221e-04f,  -7.6331727e-02f, 1.0289106e-01f,  2.5628258e-02f,
+    -9.5651820e-02f, -3.1599486e-01f, 3.4648609e-01f,  4.2524221e-04f,
+    -4.9910601e-02f, 8.5599929e-02f,  -3.1449606e-03f, -1.6781870e-01f,
+    1.0333546e+00f,  -6.6645592e-01f, 4.2524221e-04f,  8.2493991e-02f,
+    -9.5790043e-02f, 4.3036491e-02f,  1.8140252e-01f,  5.4385066e-01f,
+    3.2726720e-02f,  4.2524221e-04f,  2.2156011e-01f,  3.1133004e-02f,
+    -1.4379646e-01f, -5.9910184e-01f, 1.0038698e+00f,  -3.0557862e-01f,
+    4.2524221e-04f,  3.7525645e-01f,  7.0815518e-02f,  2.8620017e-01f,
+    6.9975668e-01f,  1.0616329e-01f,  1.8318458e-01f,  4.2524221e-04f,
+    9.5496923e-02f,  -3.8357295e-02f, 7.5472467e-02f,  1.4580189e-02f,
+    1.3419588e-01f,  -2.0312097e-02f, 4.2524221e-04f,  4.9029529e-02f,
+    1.7314212e-01f,  -4.9041037e-02f, -2.6927444e-01f, -2.4882385e-01f,
+    -2.5494534e-01f, 4.2524221e-04f,  -6.4100541e-02f, 2.6978979e-01f,
+    2.4858065e-02f,  -8.1361562e-01f, -3.7216064e-01f, 4.3392561e-02f,
+    4.2524221e-04f,  6.9799364e-02f,  -1.3860419e-01f, 1.0984455e-01f,
+    4.8301801e-01f,  5.5070144e-01f,  -3.3188796e-01f, 4.2524221e-04f,
+    -8.2801402e-02f, -6.8652697e-02f, -1.9647431e-02f, 1.8623030e-01f,
+    -1.3855183e-01f, 3.1506360e-01f,  4.2524221e-04f,  3.6300448e-01f,
+    -8.0298670e-02f, -3.1002939e-01f, -3.3787906e-01f, -3.0862695e-01f,
+    2.7613443e-01f,  4.2524221e-04f,  3.7739474e-01f,  1.1907437e-01f,
+    -3.9434172e-02f, 5.8045042e-01f,  4.5934165e-01f,  2.9962903e-01f,
+    4.2524221e-04f,  2.9385680e-02f,  1.1072745e-01f,  5.8579307e-02f,
+    -2.8264758e-01f, -1.0784884e-01f, 1.2321078e+00f,  4.2524221e-04f,
+    7.9958871e-02f,  1.2411897e-01f,  9.8061837e-02f,  3.3262360e-01f,
+    -8.3796644e-01f, 4.0548918e-01f,  4.2524221e-04f,  7.8290664e-02f,
+    4.5500584e-02f,  9.9731199e-02f,  -4.6239632e-01f, 3.0574635e-01f,
+    -4.3212789e-01f, 4.2524221e-04f,  3.6696273e-01f,  5.7200775e-03f,
+    5.3992327e-02f,  -1.6632666e-01f, -3.1065517e-03f, -1.1606836e-01f,
+    4.2524221e-04f,  2.3191632e-01f,  3.3108935e-01f,  2.0009531e-02f,
+    4.3141481e-01f,  7.1523404e-01f,  -4.0791895e-02f, 4.2524221e-04f,
+    -2.0644982e-01f, 3.2929885e-01f,  -2.1481182e-01f, 3.4483513e-01f,
+    8.7951744e-01f,  2.2883956e-01f,  4.2524221e-04f,  -2.4269024e-02f,
+    8.0496661e-02f,  -2.2875665e-02f, -4.7301382e-02f, -1.2039685e-01f,
+    -4.8519605e-01f, 4.2524221e-04f,  -3.5178763e-01f, -1.1468551e-01f,
+    -7.2022155e-02f, 7.1914357e-01f,  -1.8774068e-01f, 2.9152307e-01f,
+    4.2524221e-04f,  1.5231021e-01f,  2.1161540e-01f,  -1.1754553e-01f,
+    -7.1294534e-01f, -6.2154621e-01f, -1.9393834e-01f, 4.2524221e-04f,
+    -7.8070223e-02f, 1.7216440e-01f,  1.7939833e-01f,  4.8407644e-01f,
+    -1.7517121e-01f, 4.1451525e-02f,  4.2524221e-04f,  1.9436933e-02f,
+    4.3368284e-02f,  -3.5639319e-03f, 6.7544144e-01f,  5.4782498e-01f,
+    3.4879735e-01f,  4.2524221e-04f,  -1.3366042e-01f, -8.3979061e-03f,
+    -8.7891303e-02f, -9.8265654e-01f, -4.2677250e-02f, -1.1890029e-01f,
+    4.2524221e-04f,  1.2091810e-01f,  -1.8473221e-01f, 3.7591079e-01f,
+    1.7912203e-01f,  7.1378611e-03f,  5.6433028e-01f,  4.2524221e-04f,
+    -3.0588778e-02f, -8.0224700e-02f, 2.0911565e-01f,  1.7871276e-01f,
+    -4.5090526e-01f, 1.7313591e-01f,  4.2524221e-04f,  2.1592773e-01f,
+    -1.0682704e-01f, -1.4687291e-01f, -2.1309285e-01f, 3.2003528e-01f,
+    9.6824163e-01f,  4.2524221e-04f,  -7.1326107e-02f, -1.8375346e-01f,
+    1.6073698e-01f,  6.6706583e-02f,  -2.2058874e-01f, -1.6864805e-01f,
+    4.2524221e-04f,  -4.4198960e-02f, -1.1312663e-01f, 1.0822348e-01f,
+    1.3487945e-01f,  -7.0401341e-01f, -1.2007080e+00f, 4.2524221e-04f,
+    -2.9746767e-02f, -1.3425194e-01f, -2.5086749e-01f, -1.1511848e-01f,
+    -8.7276441e-01f, 1.6036594e-01f,  4.2524221e-04f,  1.7037044e-01f,
+    1.7299759e-01f,  4.6205060e-03f,  5.1056665e-01f,  1.0041865e+00f,
+    2.3419438e-01f,  4.2524221e-04f,  1.6252996e-01f,  1.1271755e-01f,
+    4.6216175e-02f,  5.6226152e-01f,  6.6637951e-01f,  5.3371119e-01f,
+    4.2524221e-04f,  -1.9546813e-01f, 1.3906172e-01f,  -5.5975009e-02f,
+    -1.0969467e-01f, -1.2633232e+00f, -4.3421894e-02f, 4.2524221e-04f,
+    -1.4044075e-01f, -2.6630515e-01f, 6.1962787e-02f,  4.6771467e-01f,
+    -6.9051319e-01f, 2.6465434e-01f,  4.2524221e-04f,  1.7195286e-01f,
+    -5.2851868e-01f, -1.6422449e-01f, 1.1703679e-01f,  7.2824037e-01f,
+    -3.6378372e-01f, 4.2524221e-04f,  1.0194746e-01f,  -9.7751893e-02f,
+    1.6529745e-01f,  2.4984296e-01f,  3.8181201e-02f,  2.7078211e-01f,
+    4.2524221e-04f,  2.0533490e-01f,  1.9480339e-01f,  -6.6993818e-02f,
+    3.9745870e-01f,  -7.9133675e-02f, -1.1942380e-01f, 4.2524221e-04f,
+    -3.9208923e-02f, 9.8150961e-02f,  1.0030308e-01f,  -5.7831265e-02f,
+    -6.4350224e-01f, 8.4775603e-01f,  4.2524221e-04f,  1.3816082e-01f,
+    -1.4092979e-02f, -1.0894109e-01f, 2.8519067e-01f,  5.8030725e-01f,
+    6.5652287e-01f,  4.2524221e-04f,  3.1362314e-02f,  -6.5740333e-03f,
+    6.7480214e-02f,  4.2265895e-01f,  -5.1995921e-01f, -2.8980300e-02f,
+    4.2524221e-04f,  -1.1953717e-01f, 1.5453845e-01f,  1.3720915e-01f,
+    -1.5399654e-01f, -1.2724885e-01f, 6.4902240e-01f,  4.2524221e-04f,
+    -2.4549389e-01f, -7.9987049e-02f, 8.9279823e-02f,  -9.2930816e-02f,
+    -6.1336237e-01f, 4.7973198e-01f,  4.2524221e-04f,  2.5360553e-02f,
+    -2.6513871e-02f, 5.4526389e-02f,  -9.8100655e-02f, 6.5327984e-01f,
+    -5.2721924e-01f, 4.2524221e-04f,  -1.0606319e-01f, -6.9447577e-02f,
+    4.3061398e-02f,  -1.0653659e+00f, 6.2340677e-01f,  4.6419606e-02f};
diff --git a/examples/graphics/histogram.cpp b/examples/graphics/histogram.cpp
index e91f6d51ea..7986e44dd5 100644
--- a/examples/graphics/histogram.cpp
+++ b/examples/graphics/histogram.cpp
@@ -8,25 +8,27 @@
  ********************************************************/
 
 #include <arrayfire.h>
-#include <cstdio>
 #include <math.h>
+#include <cstdio>
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int, char**) {
     try {
         // Initialize the kernel array just once
         af::info();
         af::Window myWindow(512, 512, "Histogram example using ArrayFire");
-        af::Window imgWnd("Input Image");
+        af::Window imgWnd(480, 640, "Input Image");
 
-        array img = loadImage(ASSETS_DIR"/examples/images/lena.ppm", false);
+        array img = loadImage(ASSETS_DIR "/examples/images/arrow.jpg", false);
         array hist_out = histogram(img, 256, 0, 255);
 
+        myWindow.setAxesTitles("Bins", "Frequency");
+        myWindow.setPos(480, 0);
+
         while (!myWindow.close() && !imgWnd.close()) {
             myWindow.hist(hist_out, 0, 255);
-            imgWnd.image(img/255);
+            imgWnd.image(img.as(u8));
         }
     }
 
@@ -34,14 +36,5 @@ int main(int argc, char *argv[])
         fprintf(stderr, "%s\n", e.what());
         throw;
     }
-
-#ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-#endif
     return 0;
 }
-
diff --git a/examples/graphics/plot2d.cpp b/examples/graphics/plot2d.cpp
index 7e28d34ebd..1bc72ac1e5 100644
--- a/examples/graphics/plot2d.cpp
+++ b/examples/graphics/plot2d.cpp
@@ -8,37 +8,41 @@
  ********************************************************/
 
 #include <arrayfire.h>
-#include <cstdio>
 #include <math.h>
+#include <cstdio>
 
 using namespace af;
 
-static const int ITERATIONS = 100;
-static const float PRECISION = 1.0f/ITERATIONS;
+static const int ITERATIONS  = 50;
+static const float PRECISION = 1.0f / ITERATIONS;
 
-int main(int argc, char *argv[])
-{
+int main(int, char**) {
     try {
         // Initialize the kernel array just once
         af::info();
-        af::Window myWindow(512, 512, "2D Plot example: ArrayFire");
+        af::Window myWindow(800, 800, "2D Plot example: ArrayFire");
 
         array Y;
-        int sign = 1;
-        array X = seq(-af::Pi, af::Pi, PRECISION);
+        int sign    = 1;
+        array X     = seq(-af::Pi, af::Pi, PRECISION);
+        array noise = randn(X.dims(0)) / 5.f;
 
-        for (double val=-af::Pi; !myWindow.close(); ) {
+        myWindow.grid(2, 1);
 
+        for (double val = 0; !myWindow.close();) {
             Y = sin(X);
 
-            myWindow.plot(X, Y);
+            myWindow(0, 0).plot(X, Y);
+            myWindow(1, 0).scatter(X, Y + noise, AF_MARKER_POINT);
+
+            myWindow.show();
 
             X = X + PRECISION * float(sign);
             val += PRECISION * float(sign);
 
-            if (val>af::Pi) {
+            if (val > af::Pi) {
                 sign = -1;
-            } else if (val<-af::Pi) {
+            } else if (val < -af::Pi) {
                 sign = 1;
             }
         }
@@ -47,14 +51,5 @@ int main(int argc, char *argv[])
         fprintf(stderr, "%s\n", e.what());
         throw;
     }
-
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
-
diff --git a/examples/graphics/plot3.cpp b/examples/graphics/plot3.cpp
new file mode 100644
index 0000000000..9be0d4f308
--- /dev/null
+++ b/examples/graphics/plot3.cpp
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <math.h>
+#include <cstdio>
+
+using namespace af;
+
+static const int ITERATIONS  = 200;
+static const float PRECISION = 1.0f / ITERATIONS;
+
+int main(int, char**) {
+    try {
+        // Initialize the kernel array just once
+        af::info();
+        af::Window myWindow(800, 800, "3D Line Plot example: ArrayFire");
+
+        static float t = 0.1;
+        array Z        = seq(0.1f, 10.f, PRECISION);
+
+        do {
+            array Y = sin((Z * t) + t) / Z;
+            array X = cos((Z * t) + t) / Z;
+            X       = max(min(X, 1.0), -1.0);
+            Y       = max(min(Y, 1.0), -1.0);
+
+            // Pts can be passed in as a matrix in the form n x 3, 3 x n
+            // or in the flattened xyz-triplet array with size 3n x 1
+            myWindow.plot(X, Y, Z);
+
+            t += 0.01;
+        } while (!myWindow.close());
+
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/graphics/surface.cpp b/examples/graphics/surface.cpp
new file mode 100644
index 0000000000..f9e89ed835
--- /dev/null
+++ b/examples/graphics/surface.cpp
@@ -0,0 +1,42 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <math.h>
+#include <cstdio>
+
+using namespace af;
+
+static const int M = 30;
+static const int N = 2 * M;
+
+int main(int, char**) {
+    try {
+        // Initialize the kernel array just once
+        af::info();
+        af::Window myWindow(800, 800, "3D Surface example: ArrayFire");
+
+        // Creates grid of between [-1 1] with precision of 1 / M
+        const array x = iota(dim4(N, 1), dim4(1, N)) / M - 1;
+        const array y = iota(dim4(1, N), dim4(N, 1)) / M - 1;
+
+        static float t = 0;
+        while (!myWindow.close()) {
+            t += 0.07;
+            array z = 10 * x * -abs(y) * cos(x * x * (y + t)) +
+                      sin(y * (x + t)) - 1.5;
+            myWindow.surface(x, y, z);
+        }
+
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/helloworld/CMakeLists.txt b/examples/helloworld/CMakeLists.txt
new file mode 100644
index 0000000000..b3a02e9fc6
--- /dev/null
+++ b/examples/helloworld/CMakeLists.txt
@@ -0,0 +1,34 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-HelloWorld
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+if(ArrayFire_CPU_FOUND)
+  # Hello World example
+  add_executable(helloworld_cpu helloworld.cpp)
+  target_link_libraries(helloworld_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(helloworld_cuda helloworld.cpp)
+  target_link_libraries(helloworld_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(helloworld_opencl helloworld.cpp)
+  target_link_libraries(helloworld_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(helloworld_oneapi helloworld.cpp)
+  target_link_libraries(helloworld_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/helloworld/helloworld.cpp b/examples/helloworld/helloworld.cpp
index 46a1fc44dd..d0b8ca20a5 100644
--- a/examples/helloworld/helloworld.cpp
+++ b/examples/helloworld/helloworld.cpp
@@ -13,18 +13,15 @@
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int argc, char* argv[]) {
     try {
-
-
         // Select a device and display arrayfire info
         int device = argc > 1 ? atoi(argv[1]) : 0;
         af::setDevice(device);
         af::info();
 
         printf("Create a 5-by-3 matrix of random floats on the GPU\n");
-        array A = randu(5,3, f32);
+        array A = randu(5, 3, f32);
         af_print(A);
 
         printf("Element-wise arithmetic\n");
@@ -43,8 +40,17 @@ int main(int argc, char *argv[])
         array c = C.row(end);
         af_print(c);
 
+        printf("Scan Test\n");
+        dim4 dims(16, 4, 1, 1);
+        array r = constant(2, dims);
+        af_print(r);
+
+        printf("Scan\n");
+        array S = af::scan(r, 0, AF_BINARY_MUL);
+        af_print(S);
+
         printf("Create 2-by-3 matrix from host data\n");
-        float d[] = { 1, 2, 3, 4, 5, 6 };
+        float d[] = {1, 2, 3, 4, 5, 6};
         array D(2, 3, d, afHost);
         af_print(D);
 
@@ -64,12 +70,5 @@ int main(int argc, char *argv[])
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/image_processing/CMakeLists.txt b/examples/image_processing/CMakeLists.txt
new file mode 100644
index 0000000000..e4ab1d3d8a
--- /dev/null
+++ b/examples/image_processing/CMakeLists.txt
@@ -0,0 +1,202 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Image-Processing
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+add_definitions("-DASSETS_DIR=\"${ASSETS_DIR}\"")
+
+if(ArrayFire_CPU_FOUND)
+  # Adaptive Thresholding example
+  add_executable(adaptive_thresholding_cpu adaptive_thresholding.cpp)
+  target_link_libraries(adaptive_thresholding_cpu ArrayFire::afcpu)
+
+  # Binary Thresholding example
+  add_executable(binary_thresholding_cpu binary_thresholding.cpp)
+  target_link_libraries(binary_thresholding_cpu ArrayFire::afcpu)
+
+  # Brain Segmentation example
+  add_executable(brain_segmentation_cpu brain_segmentation.cpp)
+  target_link_libraries(brain_segmentation_cpu ArrayFire::afcpu)
+
+  # Confidence Connected Components example
+  add_executable(confidence_connected_components_cpu
+      confidence_connected_components.cpp)
+  target_link_libraries(confidence_connected_components_cpu ArrayFire::afcpu)
+
+  # Edge detection example
+  add_executable(edge_cpu edge.cpp)
+  target_link_libraries(edge_cpu ArrayFire::afcpu)
+
+  # Filters example
+  add_executable(filters_cpu filters.cpp)
+  target_link_libraries(filters_cpu ArrayFire::afcpu)
+
+  # Image example
+  add_executable(image_demo_cpu image_demo.cpp)
+  target_link_libraries(image_demo_cpu ArrayFire::afcpu)
+
+  # Image Editing example
+  add_executable(image_editing_cpu image_editing.cpp)
+  target_link_libraries(image_editing_cpu ArrayFire::afcpu)
+
+  # Morph example
+  add_executable(morphing_cpu morphing.cpp)
+  target_link_libraries(morphing_cpu ArrayFire::afcpu)
+
+  # Optical Flow example
+  add_executable(optical_flow_cpu optical_flow.cpp)
+  target_link_libraries(optical_flow_cpu ArrayFire::afcpu)
+
+  # Pyramids example
+  add_executable(pyramids_cpu pyramids.cpp)
+  target_link_libraries(pyramids_cpu ArrayFire::afcpu)
+
+  # Gradient anisotropic diffusion example
+  add_executable(gradient_diffusion_cpu gradient_diffusion.cpp)
+  target_link_libraries(gradient_diffusion_cpu ArrayFire::afcpu)
+
+  #Image Deconvolution Example
+  add_executable(deconvolution_cpu deconvolution.cpp)
+  target_link_libraries(deconvolution_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(adaptive_thresholding_cuda adaptive_thresholding.cpp)
+  target_link_libraries(adaptive_thresholding_cuda ArrayFire::afcuda)
+
+  add_executable(binary_thresholding_cuda binary_thresholding.cpp)
+  target_link_libraries(binary_thresholding_cuda ArrayFire::afcuda)
+
+  add_executable(brain_segmentation_cuda brain_segmentation.cpp)
+  target_link_libraries(brain_segmentation_cuda ArrayFire::afcuda)
+
+  add_executable(confidence_connected_components_cuda
+      confidence_connected_components.cpp)
+  target_link_libraries(confidence_connected_components_cuda ArrayFire::afcuda)
+
+  add_executable(edge_cuda edge.cpp)
+  target_link_libraries(edge_cuda ArrayFire::afcuda)
+
+  add_executable(filters_cuda filters.cpp)
+  target_link_libraries(filters_cuda ArrayFire::afcuda)
+
+  add_executable(image_demo_cuda image_demo.cpp)
+  target_link_libraries(image_demo_cuda ArrayFire::afcuda)
+
+  add_executable(image_editing_cuda image_editing.cpp)
+  target_link_libraries(image_editing_cuda ArrayFire::afcuda)
+
+  add_executable(morphing_cuda morphing.cpp)
+  target_link_libraries(morphing_cuda ArrayFire::afcuda)
+
+  add_executable(optical_flow_cuda optical_flow.cpp)
+  target_link_libraries(optical_flow_cuda ArrayFire::afcuda)
+
+  add_executable(pyramids_cuda pyramids.cpp)
+  target_link_libraries(pyramids_cuda ArrayFire::afcuda)
+
+  # Gradient anisotropic diffusion example
+  add_executable(gradient_diffusion_cuda gradient_diffusion.cpp)
+  target_link_libraries(gradient_diffusion_cuda ArrayFire::afcuda)
+
+  #Image Deconvolution Example
+  add_executable(deconvolution_cuda deconvolution.cpp)
+  target_link_libraries(deconvolution_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(adaptive_thresholding_opencl adaptive_thresholding.cpp)
+  target_link_libraries(adaptive_thresholding_opencl ArrayFire::afopencl)
+
+  add_executable(binary_thresholding_opencl binary_thresholding.cpp)
+  target_link_libraries(binary_thresholding_opencl ArrayFire::afopencl)
+
+  add_executable(brain_segmentation_opencl brain_segmentation.cpp)
+  target_link_libraries(brain_segmentation_opencl ArrayFire::afopencl)
+
+  add_executable(confidence_connected_components_opencl
+      confidence_connected_components.cpp)
+  target_link_libraries(confidence_connected_components_opencl ArrayFire::afopencl)
+
+  add_executable(edge_opencl edge.cpp)
+  target_link_libraries(edge_opencl ArrayFire::afopencl)
+
+  add_executable(filters_opencl filters.cpp)
+  target_link_libraries(filters_opencl ArrayFire::afopencl)
+
+  add_executable(image_demo_opencl image_demo.cpp)
+  target_link_libraries(image_demo_opencl ArrayFire::afopencl)
+
+  add_executable(image_editing_opencl image_editing.cpp)
+  target_link_libraries(image_editing_opencl ArrayFire::afopencl)
+
+  add_executable(morphing_opencl morphing.cpp)
+  target_link_libraries(morphing_opencl ArrayFire::afopencl)
+
+  add_executable(optical_flow_opencl optical_flow.cpp)
+  target_link_libraries(optical_flow_opencl ArrayFire::afopencl)
+
+  add_executable(pyramids_opencl pyramids.cpp)
+  target_link_libraries(pyramids_opencl ArrayFire::afopencl)
+
+  # Gradient anisotropic diffusion example
+  add_executable(gradient_diffusion_opencl gradient_diffusion.cpp)
+  target_link_libraries(gradient_diffusion_opencl ArrayFire::afopencl)
+
+  #Image Deconvolution Example
+  add_executable(deconvolution_opencl deconvolution.cpp)
+  target_link_libraries(deconvolution_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(adaptive_thresholding_oneapi adaptive_thresholding.cpp)
+  target_link_libraries(adaptive_thresholding_oneapi ArrayFire::afoneapi)
+
+  add_executable(binary_thresholding_oneapi binary_thresholding.cpp)
+  target_link_libraries(binary_thresholding_oneapi ArrayFire::afoneapi)
+
+  add_executable(brain_segmentation_oneapi brain_segmentation.cpp)
+  target_link_libraries(brain_segmentation_oneapi ArrayFire::afoneapi)
+
+  add_executable(confidence_connected_components_oneapi
+      confidence_connected_components.cpp)
+  target_link_libraries(confidence_connected_components_oneapi ArrayFire::afoneapi)
+
+  add_executable(edge_oneapi edge.cpp)
+  target_link_libraries(edge_oneapi ArrayFire::afoneapi)
+
+  add_executable(filters_oneapi filters.cpp)
+  target_link_libraries(filters_oneapi ArrayFire::afoneapi)
+
+  add_executable(image_demo_oneapi image_demo.cpp)
+  target_link_libraries(image_demo_oneapi ArrayFire::afoneapi)
+
+  add_executable(image_editing_oneapi image_editing.cpp)
+  target_link_libraries(image_editing_oneapi ArrayFire::afoneapi)
+
+  add_executable(morphing_oneapi morphing.cpp)
+  target_link_libraries(morphing_oneapi ArrayFire::afoneapi)
+
+  add_executable(optical_flow_oneapi optical_flow.cpp)
+  target_link_libraries(optical_flow_oneapi ArrayFire::afoneapi)
+
+  add_executable(pyramids_oneapi pyramids.cpp)
+  target_link_libraries(pyramids_oneapi ArrayFire::afoneapi)
+
+  # Gradient anisotropic diffusion example
+  add_executable(gradient_diffusion_oneapi gradient_diffusion.cpp)
+  target_link_libraries(gradient_diffusion_oneapi ArrayFire::afoneapi)
+
+  #Image Deconvolution Example
+  add_executable(deconvolution_oneapi deconvolution.cpp)
+  target_link_libraries(deconvolution_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/image_processing/adaptive_thresholding.cpp b/examples/image_processing/adaptive_thresholding.cpp
new file mode 100644
index 0000000000..db2a1d1697
--- /dev/null
+++ b/examples/image_processing/adaptive_thresholding.cpp
@@ -0,0 +1,99 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+using std::abs;
+
+typedef enum { MEAN = 0, MEDIAN, MINMAX_AVG } LocalThresholdType;
+
+array threshold(const array &in, float thresholdValue) {
+    int channels  = in.dims(2);
+    array ret_val = in.copy();
+    if (channels > 1) ret_val = colorSpace(in, AF_GRAY, AF_RGB);
+    ret_val =
+        (ret_val < thresholdValue) * 0.0f + 255.0f * (ret_val > thresholdValue);
+    return ret_val;
+}
+
+array adaptiveThreshold(const array &in, LocalThresholdType kind,
+                        int window_size, int constnt) {
+    int wr        = window_size;
+    array ret_val = colorSpace(in, AF_GRAY, AF_RGB);
+    if (kind == MEAN) {
+        array wind = constant(1, wr, wr) / (wr * wr);
+        array mean = convolve(ret_val, wind);
+        array diff = mean - ret_val;
+        ret_val    = (diff < constnt) * 0.f + 255.f * (diff > constnt);
+    } else if (kind == MEDIAN) {
+        array medf = medfilt(ret_val, wr, wr);
+        array diff = medf - ret_val;
+        ret_val    = (diff < constnt) * 0.f + 255.f * (diff > constnt);
+    } else if (kind == MINMAX_AVG) {
+        array minf = minfilt(ret_val, wr, wr);
+        array maxf = maxfilt(ret_val, wr, wr);
+        array mean = (minf + maxf) / 2.0f;
+        array diff = mean - ret_val;
+        ret_val    = (diff < constnt) * 0.f + 255.f * (diff > constnt);
+    }
+    ret_val = 255.f - ret_val;
+    return ret_val;
+}
+
+array iterativeThreshold(const array &in) {
+    array ret_val   = colorSpace(in, AF_GRAY, AF_RGB);
+    float T         = mean<float>(ret_val);
+    bool isContinue = true;
+    while (isContinue) {
+        array region1 = (ret_val > T) * ret_val;
+        array region2 = (ret_val <= T) * ret_val;
+        float r1_avg  = mean<float>(region1);
+        float r2_avg  = mean<float>(region2);
+        float tempT   = (r1_avg + r2_avg) / 2.0f;
+        if (abs(tempT - T) < 0.01f) { break; }
+        T = tempT;
+    }
+    return threshold(ret_val, T);
+}
+
+int main(int argc, char **argv) {
+    try {
+        int device = argc > 1 ? atoi(argv[1]) : 0;
+        af::setDevice(device);
+        af::info();
+
+        array sudoku =
+            loadImage(ASSETS_DIR "/examples/images/sudoku.jpg", true);
+
+        array mnt = adaptiveThreshold(sudoku, MEAN, 37, 10);
+        array mdt = adaptiveThreshold(sudoku, MEDIAN, 7, 4);
+        array mmt = adaptiveThreshold(sudoku, MINMAX_AVG, 11, 4);
+        array itt = 255.0f - iterativeThreshold(sudoku);
+
+        af::Window wnd("Adaptive Thresholding Algorithms");
+        printf("Press ESC while the window is in focus to exit\n");
+        while (!wnd.close()) {
+            wnd.grid(2, 3);
+            wnd(0, 0).image(sudoku / 255, "Input");
+            wnd(1, 0).image(mnt, "Adap. Threshold(Mean)");
+            wnd(0, 1).image(mdt, "Adap. Threshold(Median)");
+            wnd(1, 1).image(mmt, "Adap. Threshold(Avg. Min,Max)");
+            wnd(0, 2).image(itt, "Iterative Threshold");
+            wnd.show();
+        }
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/image_processing/binary_thresholding.cpp b/examples/image_processing/binary_thresholding.cpp
new file mode 100644
index 0000000000..73a376b982
--- /dev/null
+++ b/examples/image_processing/binary_thresholding.cpp
@@ -0,0 +1,102 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+
+array threshold(const array& in, float thresholdValue) {
+    int channels  = in.dims(2);
+    array ret_val = in.copy();
+    if (channels > 1) ret_val = colorSpace(in, AF_GRAY, AF_RGB);
+    ret_val =
+        (ret_val < thresholdValue) * 0.0f + 255.0f * (ret_val > thresholdValue);
+    return ret_val;
+}
+
+/**
+ * Note:
+ * suffix B indicates subset of all graylevels before current gray level
+ * suffix F indicates subset of all graylevels after current gray level
+ */
+array otsu(const array& in) {
+    array gray;
+    int channels = in.dims(2);
+    if (channels > 1)
+        gray = colorSpace(in, AF_GRAY, AF_RGB);
+    else
+        gray = in;
+    unsigned total = gray.elements();
+    array hist     = histogram(gray, 256, 0.0f, 255.0f);
+    array wts      = range(256);
+
+    array wtB   = accum(hist);
+    array wtF   = total - wtB;
+    array sumB  = accum(wts * hist);
+    array meanB = sumB / wtB;
+    float lastElemInSumB;
+    sumB(seq(255, 255, 1)).host((void*)&lastElemInSumB);
+    array meanF = (lastElemInSumB - sumB) / wtF;
+    array mDiff = meanB - meanF;
+
+    array interClsVar = wtB * wtF * mDiff * mDiff;
+
+    float max        = af::max<float>(interClsVar);
+    float threshold2 = where(interClsVar == max).scalar<unsigned>();
+    array threshIdx  = where(interClsVar >= max);
+    float threshold1 =
+        threshIdx.elements() > 0 ? threshIdx.scalar<unsigned>() : 0.0f;
+
+    return threshold(gray, (threshold1 + threshold2) / 2.0f);
+}
+
+int main(int argc, char** argv) {
+    try {
+        int device = argc > 1 ? atoi(argv[1]) : 0;
+        af::setDevice(device);
+        af::info();
+
+        array bimodal =
+            loadImage(ASSETS_DIR "/examples/images/noisy_square.png", false);
+        bimodal = resize(0.75f, bimodal);
+
+        array bt         = threshold(bimodal, 180.0f);
+        array ot         = otsu(bimodal);
+        array bimodHist  = histogram(bimodal, 256, 0, 255);
+        array smooth     = convolve(bimodal, gaussianKernel(5, 5));
+        array smoothHist = histogram(smooth, 256, 0, 255);
+
+        af::Window wnd(1536, 1024, "Binary Thresholding Algorithms");
+        printf("Press ESC while the window is in focus to proceed to exit\n");
+
+        wnd.grid(3, 3);
+        wnd(0, 1).setAxesTitles("Bins", "Frequency");
+        wnd(1, 1).setAxesTitles("Bins", "Frequency");
+        wnd(2, 1).setAxesTitles("Bins", "Frequency");
+        while (!wnd.close()) {
+            wnd(0, 0).image(bimodal / 255, "Input Image");
+            wnd(1, 0).image(bimodal / 255, "Input Image");
+            wnd(2, 0).image(smooth / 255, "Input Smoothed by Gaussian Filter");
+            wnd(0, 1).hist(bimodHist, 0, 255, "Input Histogram");
+            wnd(1, 1).hist(bimodHist, 0, 255, "Input Histogram");
+            wnd(2, 1).hist(smoothHist, 0, 255, "Smoothed Input Histogram");
+            wnd(0, 2).image(bt, "Simple Binary threshold");
+            wnd(1, 2).image(ot, "Otsu's Threshold");
+            wnd(2, 2).image(otsu(smooth), "Otsu's Threshold on Smoothed Image");
+            wnd.show();
+        }
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/image_processing/brain_segmentation.cpp b/examples/image_processing/brain_segmentation.cpp
index 92c29bdd90..316d1508a2 100644
--- a/examples/image_processing/brain_segmentation.cpp
+++ b/examples/image_processing/brain_segmentation.cpp
@@ -7,45 +7,38 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <string.h>
-#include <stdio.h>
-#include <math.h>
 #include <arrayfire.h>
+#include <math.h>
+#include <stdio.h>
+#include <string.h>
 #include "../common/progress.h"
 
 using namespace af;
 
-const float h_sx_kernel[] = {  1,  2,  1,
-    0,  0,  0,
-    -1, -2, -1
-};
-const float h_sy_kernel[] = { -1, 0, 1,
-    -2, 0, 2,
-    -1, 0, 1
-};
-const float h_lp_kernel[] = { -0.5f, -1.0f, -0.5f,
-    -1.0f,  6.0f, -1.0f,
-    -0.5f, -1.0f, -0.5f
-};
-
-array edges_slice(array x)
-{
+const float h_sx_kernel[] = {1, 2, 1, 0, 0, 0, -1, -2, -1};
+const float h_sy_kernel[] = {-1, 0, 1, -2, 0, 2, -1, 0, 1};
+
+// Unused
+// const float h_lp_kernel[] = { -0.5f, -1.0f, -0.5f,
+//    -1.0f,  6.0f, -1.0f,
+//    -0.5f, -1.0f, -0.5f
+//};
+
+array edges_slice(array x) {
     array ret;
     static array kernelx = array(dim4(3, 3), h_sx_kernel);
     static array kernely = array(dim4(3, 3), h_sy_kernel);
-    ret = convolve(x, kernelx) + convolve(x, kernely);
+    ret                  = convolve(x, kernelx) + convolve(x, kernely);
     return abs(ret);
 }
 
-array gauss(array x, float u, float s)
-{
+array gauss(array x, float u, float s) {
     double f = 1 / sqrt(2 * af::Pi * s * s);
-    array e = exp(-pow((x - u), 2) / (2 * s * s));
+    array e  = exp(-pow((x - u), 2) / (2 * s * s));
     return f * e;
 }
 
-array segment_volume(array A, int k)
-{
+array segment_volume(array A, int k) {
     array I1 = A(span, span, k);
 
     float mx = max<float>(I1);
@@ -84,26 +77,25 @@ array segment_volume(array A, int k)
     L12_old = L12;
 
     array L1 = (L10 + L11 + L12) / 3;
-    array S = (L0 > L1);
+    array S  = (L0 > L1);
     return S.as(A.type());
 }
 
-void brain_seg(bool console)
-{
+void brain_seg(bool console) {
     af::Window wnd("Brain Segmentation Demo");
     wnd.setColorMap(AF_COLORMAP_HEAT);
 
-    double time_total = 30; // run for N seconds
+    double time_total = 30;  // run for N seconds
 
-    array B = loadImage(ASSETS_DIR "/examples/images/brain.png");
+    array B    = loadImage(ASSETS_DIR "/examples/images/brain.png");
     int slices = 256;
 
-    B = moddims(B, dim4(B.dims(0), B.dims(1)/slices, slices));
+    B = moddims(B, dim4(B.dims(0), B.dims(1) / slices, slices));
     af::sync();
 
     int N = 2 * slices - 1;
 
-    timer t = timer::start();
+    timer t  = timer::start();
     int iter = 0;
 
     /* loop forward and backward for 100 frames
@@ -113,8 +105,8 @@ void brain_seg(bool console)
     for (int i = 0; !wnd.close(); i++) {
         iter++;
 
-        int j = i % N;
-        int k = std::min(j, N - j);
+        int j    = i % N;
+        int k    = std::min(j, N - j);
         array Bi = B(span, span, k);
 
         /* process */
@@ -126,9 +118,9 @@ void brain_seg(bool console)
         if (!console) {
             wnd.grid(2, 2);
 
-            wnd(0, 0).image(Bi/255.f, "Input");
+            wnd(0, 0).image(Bi / 255.f, "Input");
             wnd(1, 0).image(Ei, "Edges");
-            wnd(0, 1).image(Mi/255.f, "Meanshift");
+            wnd(0, 1).image(Mi / 255.f, "Meanshift");
             wnd(1, 1).image(Si, "Segmented");
 
             wnd.show();
@@ -141,16 +133,13 @@ void brain_seg(bool console)
 
         /* we have had ran throuh simlation results
          * exit the rendering loop */
-        if (!progress(iter, t, time_total))
-            break;
-        if (!(i<100*N))
-            break;
+        if (!progress(iter, t, time_total)) break;
+        if (!(i < 100 * N)) break;
     }
 }
 
-int main(int argc, char* argv[])
-{
-    int device = argc > 1 ? atoi(argv[1]) : 0;
+int main(int argc, char* argv[]) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
@@ -160,8 +149,7 @@ int main(int argc, char* argv[])
         printf("Brain segmentation example\n");
         brain_seg(console);
 
-    } catch (af::exception& e) {
-        fprintf(stderr, "%s\n", e.what());
-    }
+    } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); }
+
     return 0;
 }
diff --git a/examples/image_processing/confidence_connected_components.cpp b/examples/image_processing/confidence_connected_components.cpp
new file mode 100644
index 0000000000..4671253bc1
--- /dev/null
+++ b/examples/image_processing/confidence_connected_components.cpp
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cassert>
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+
+using namespace af;
+
+array normalize01(const array& in) {
+    float min = af::min<float>(in);
+    float max = af::max<float>(in);
+    return (in - min) / (max - min);
+}
+
+void markCrossHair(array& in, const unsigned x, const unsigned y,
+                   const float val) {
+    const int draw_len = 5;
+    for (int i = -1; i < 2; i++) {
+        in(x + i, seq(y - draw_len, y + draw_len), 0) = val;
+        in(x + i, seq(y - draw_len, y + draw_len), 1) = 0.f;
+        in(x + i, seq(y - draw_len, y + draw_len), 2) = 0.f;
+
+        in(seq(x - draw_len, x + draw_len), y + i, 0) = val;
+        in(seq(x - draw_len, x + draw_len), y + i, 1) = 0.f;
+        in(seq(x - draw_len, x + draw_len), y + i, 2) = 0.f;
+    }
+}
+
+int main(int argc, char* argv[]) {
+    try {
+        unsigned radius     = 3;
+        unsigned multiplier = 2;
+        int iter            = 3;
+
+        array input =
+            loadImage(ASSETS_DIR "/examples/images/depression.jpg", false);
+        array normIn = normalize01(input);
+
+        unsigned seedx = 162;
+        unsigned seedy = 126;
+        array blob = confidenceCC(input, 1, &seedx, &seedy, radius, multiplier,
+                                  iter, 255);
+
+        array colorIn  = colorSpace(normIn, AF_RGB, AF_GRAY);
+        array colorOut = colorSpace(blob, AF_RGB, AF_GRAY);
+
+        markCrossHair(colorIn, seedx, seedy, 1);
+        markCrossHair(colorOut, seedx, seedy, 255);
+
+        af::Window wnd("Confidence Connected Components Demo");
+        while (!wnd.close()) {
+            wnd.grid(1, 2);
+            wnd(0, 0).image(colorIn, "Input Brain Scan");
+            wnd(0, 1).image(colorOut, "Region connected to Seed(162, 126)");
+            wnd.show();
+        }
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/image_processing/deconvolution.cpp b/examples/image_processing/deconvolution.cpp
new file mode 100644
index 0000000000..201d3a8a43
--- /dev/null
+++ b/examples/image_processing/deconvolution.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <stdio.h>
+#include <af/util.h>
+#include <cstdlib>
+
+using namespace af;
+
+const unsigned ITERATIONS     = 96;
+const float RELAXATION_FACTOR = 0.05f;
+
+array normalize(const array &in) {
+    float mx = max<float>(in.as(f32));
+    float mn = min<float>(in.as(f32));
+    return (in - mn) / (mx - mn);
+}
+
+int main(int argc, char *argv[]) {
+    int device = argc > 1 ? atoi(argv[1]) : 0;
+
+    try {
+        af::setDevice(device);
+        af::info();
+
+        printf("** ArrayFire Image Deconvolution Demo **\n");
+        af::Window myWindow("Image Deconvolution");
+
+        array in = loadImage(ASSETS_DIR "/examples/images/house.jpg", false);
+        array kernel  = gaussianKernel(13, 13, 2.25, 2.25);
+        array blurred = convolve(in, kernel);
+        array tikhonov =
+            inverseDeconv(blurred, kernel, 0.05, AF_INVERSE_DECONV_TIKHONOV);
+
+        array landweber =
+            iterativeDeconv(blurred, kernel, ITERATIONS, RELAXATION_FACTOR,
+                            AF_ITERATIVE_DECONV_LANDWEBER);
+
+        array richlucy =
+            iterativeDeconv(blurred, kernel, ITERATIONS, RELAXATION_FACTOR,
+                            AF_ITERATIVE_DECONV_RICHARDSONLUCY);
+
+        while (!myWindow.close()) {
+            myWindow.grid(2, 3);
+
+            myWindow(0, 0).image(normalize(in), "Input Image");
+            myWindow(1, 0).image(normalize(blurred), "Blurred Image");
+            myWindow(0, 1).image(normalize(tikhonov), "Tikhonov");
+            myWindow(1, 1).image(normalize(landweber), "Landweber");
+            myWindow(0, 2).image(normalize(richlucy), "Richardson-Lucy");
+
+            myWindow.show();
+        }
+
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/image_processing/edge.cpp b/examples/image_processing/edge.cpp
index 1c17320816..a8c29a26df 100644
--- a/examples/image_processing/edge.cpp
+++ b/examples/image_processing/edge.cpp
@@ -7,16 +7,15 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <af/util.h>
 #include <cstdlib>
 
 using namespace af;
 
-void prewitt(array &mag, array &dir, const array &in)
-{
-    static float h1[] = { 1, 1, 1};
+void prewitt(array &mag, array &dir, const array &in) {
+    static float h1[] = {1, 1, 1};
     static float h2[] = {-1, 0, 1};
     static array colf(3, 1, h1);
     static array rowf(3, 1, h2);
@@ -30,8 +29,7 @@ void prewitt(array &mag, array &dir, const array &in)
     dir = atan2(Gy, Gx);
 }
 
-void sobelFilter(array &mag, array &dir, const array &in)
-{
+void sobelFilter(array &mag, array &dir, const array &in) {
     array Gx, Gy;
     sobel(Gx, Gy, in, 3);
     // Find magnitude and direction
@@ -39,55 +37,58 @@ void sobelFilter(array &mag, array &dir, const array &in)
     dir = atan2(Gy, Gx);
 }
 
-array normalize(const array &in)
-{
+array normalize(const array &in) {
     float mx = max<float>(in);
     float mn = min<float>(in);
-    return (in-mn)/(mx-mn);
+    return (in - mn) / (mx - mn);
 }
 
-array edge(const array &in, int method = 0)
-{
+array edge(const array &in, int method = 0) {
     int w = 5;
-    if (in.dims(0) <  512) w = 3;
+    if (in.dims(0) < 512) w = 3;
     if (in.dims(0) > 2048) w = 7;
 
     int h = 5;
-    if (in.dims(0) <  512) h = 3;
+    if (in.dims(0) < 512) h = 3;
     if (in.dims(0) > 2048) h = 7;
 
-    array ker = gaussianKernel(w, h);
+    array ker    = gaussianKernel(w, h);
     array smooth = convolve(in, ker);
     array mag, dir;
 
-    switch(method) {
-        case  1: prewitt(mag, dir, smooth); break;
-        case  2: sobelFilter(mag, dir, smooth);   break;
+    switch (method) {
+        case 1: prewitt(mag, dir, smooth); break;
+        case 2: sobelFilter(mag, dir, smooth); break;
+        case 3:
+            mag = canny(in, AF_CANNY_THRESHOLD_AUTO_OTSU, 0.18, 0.54).as(f32);
+            break;
         default: throw af::exception("Unsupported type");
     }
 
     return normalize(mag);
 }
 
-void edge(bool console)
-{
+void edge() {
     af::Window myWindow("Edge Dectectors");
     af::Window myWindow2(512, 512, "Histogram");
 
-    array in = loadImage(ASSETS_DIR "/examples/images/lena.ppm", false);
+    array in = loadImage(ASSETS_DIR "/examples/images/trees_ctm.jpg", false);
 
-    array prewitt = edge(in, 1);
-    array sobelFilter   = edge(in, 2);
-    array hst = histogram(in, 256, 0, 255);
+    array prewitt     = edge(in, 1);
+    array sobelFilter = edge(in, 2);
+    array hst         = histogram(in, 256, 0, 255);
+    array cny         = edge(in, 3);
 
-    while(!myWindow.close() && !myWindow2.close()) {
+    myWindow2.setAxesTitles("Bins", "Frequency");
 
+    while (!myWindow.close() && !myWindow2.close()) {
         /* show input, prewitt and sobel edge detectors in a grid */
         myWindow.grid(2, 2);
 
-        myWindow(0,0).image(in/255     , "Input Image");
-        myWindow(0,1).image(prewitt    , "Prewitt"    );
-        myWindow(1,0).image(sobelFilter, "Sobel"      );
+        myWindow(0, 0).image(in / 255, "Input Image");
+        myWindow(0, 1).image(prewitt, "Prewitt");
+        myWindow(1, 0).image(sobelFilter, "Sobel");
+        myWindow(1, 1).image(cny, "Canny");
 
         myWindow.show();
 
@@ -96,17 +97,15 @@ void edge(bool console)
     }
 }
 
-int main(int argc, char* argv[])
-{
+int main(int argc, char *argv[]) {
     int device = argc > 1 ? atoi(argv[1]) : 0;
-    bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
         af::setDevice(device);
         af::info();
 
         printf("** ArrayFire Edge Detection Demo **\n");
-        edge(console);
+        edge();
 
     } catch (af::exception &e) {
         fprintf(stderr, "%s\n", e.what());
diff --git a/examples/image_processing/filters.cpp b/examples/image_processing/filters.cpp
new file mode 100644
index 0000000000..0c8ba11fd2
--- /dev/null
+++ b/examples/image_processing/filters.cpp
@@ -0,0 +1,245 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+
+/**
+ * randomization - controls % of total number of pixels in the image
+ * that will be effected by random noise
+ * repeat - # of times the process is carried out on the previous steps output
+ */
+array hurl(const array &in, int randomization, int repeat) {
+    int w         = in.dims(0);
+    int h         = in.dims(1);
+    float f       = randomization / 100.0f;
+    int dim       = (int)(f * w * h);
+    array ret_val = in.copy();
+    array temp    = moddims(ret_val, w * h, 3);
+    for (int i = 0; i < repeat; ++i) {
+        array idxs    = (w * h) * randu(dim);
+        array rndR    = 255.0f * randu(dim);
+        array rndG    = 255.0f * randu(dim);
+        array rndB    = 255.0f * randu(dim);
+        temp(idxs, 0) = rndR;
+        temp(idxs, 1) = rndG;
+        temp(idxs, 2) = rndB;
+    }
+    ret_val = moddims(temp, in.dims());
+    return ret_val;
+}
+
+/**
+ * Retrieve a new image of same dimensions of the original image
+ * where each original image's pixel is replaced by randomly picked
+ * neighbor in the provided local neighborhood window
+ */
+array getRandomNeighbor(const array &in, int windW, int windH) {
+    array rnd = 2.0f * randu(in.dims(0), in.dims(1)) - 1.0f;
+    array sx  = seq(in.dims(0));
+    array sy  = seq(in.dims(1));
+    array vx  = tile(sx, 1, in.dims(1)) + floor(rnd * windW);
+    array vy  = tile(sy.T(), in.dims(0), 1) + floor(rnd * windH);
+    array vxx = clamp(vx, 0, in.dims(0));
+    array vyy = clamp(vy, 0, in.dims(1));
+    array in2 = moddims(in, vx.elements(), 3);
+    return moddims(in2(vyy * in.dims(0) + vxx, span), in.dims());
+}
+
+/**
+ * randomly pick neighbor from given window size and replace the
+ * current pixel with the randomly chosen color.
+ * No new colors are introduced, unlike hurl.
+ */
+array spread(const array &in, int window_width, int window_height) {
+    return getRandomNeighbor(in, window_width, window_height);
+}
+
+/**
+ * randomization - controls % of total number of pixels in the image
+ * that will be effected by random noise
+ * repeat - # of times the process is carried out on the previous steps output
+ */
+array pick(const array &in, int randomization, int repeat) {
+    int w         = in.dims(0);
+    int h         = in.dims(1);
+    float f       = randomization / 100.0f;
+    int dim       = (int)(f * w * h);
+    array ret_val = in.copy();
+    for (int i = 0; i < repeat; ++i) {
+        array idxs           = (w * h) * randu(dim);
+        array rnd            = getRandomNeighbor(ret_val, 1, 1);
+        array temp_src       = moddims(rnd, w * h, 3);
+        array temp_dst       = moddims(ret_val, w * h, 3);
+        temp_dst(idxs, span) = temp_src(idxs, span);
+        ret_val              = moddims(temp_dst, in.dims());
+    }
+    return ret_val;
+}
+
+void prewitt(array &mag, array &dir, const array &in) {
+    static float h1[] = {1, 1, 1};
+    static float h2[] = {-1, 0, 1};
+    static array h1d(3, h1);
+    static array h2d(3, h2);
+
+    // Find the gradients
+    array Gy = af::convolve(h2d, h1d, in) / 6;
+    array Gx = af::convolve(h1d, h2d, in) / 6;
+
+    // Find magnitude and direction
+    mag = hypot(Gx, Gy);
+    dir = atan2(Gy, Gx);
+}
+
+void sobelFilter(array &mag, array &dir, const array &in) {
+    // Find the gradients
+    array Gy, Gx;
+    af::sobel(Gx, Gy, in);
+
+    // Find magnitude and direction
+    mag = hypot(Gx, Gy);
+    dir = atan2(Gy, Gx);
+}
+
+void normalizeImage(array &in) {
+    float min = af::min<float>(in);
+    float max = af::max<float>(in);
+    in        = 255.0f * ((in - min) / (max - min));
+}
+
+array DifferenceOfGaussian(const array &in, int window_radius1,
+                           int window_radius2) {
+    array ret_val;
+    int w1   = 2 * window_radius1 + 1;
+    int w2   = 2 * window_radius2 + 1;
+    array g1 = gaussianKernel(w1, w1);
+    array g2 = gaussianKernel(w2, w2);
+    ret_val  = (convolve(in, g1) - convolve(in, g2));
+    normalizeImage(ret_val);
+    return ret_val;
+}
+
+array medianfilter(const array &in, int window_width, int window_height) {
+    array ret_val(in.dims());
+    ret_val(span, span, 0) =
+        medfilt(in(span, span, 0), window_width, window_height);
+    ret_val(span, span, 1) =
+        medfilt(in(span, span, 1), window_width, window_height);
+    ret_val(span, span, 2) =
+        medfilt(in(span, span, 2), window_width, window_height);
+    return ret_val;
+}
+
+array gaussianblur(const array &in, int window_width, int window_height,
+                   double sigma) {
+    array g = gaussianKernel(window_width, window_height, sigma, sigma);
+    return convolve(in, g);
+}
+
+/**
+ * azimuth range is [0-360]
+ * elevation range is [0-180]
+ * depth range is [1-100]
+ * Note: this function has been tailored after
+ * the emboss implementation in GIMP editor
+ **/
+array emboss(const array &input, float azimuth, float elevation, float depth) {
+    if (depth < 1 || depth > 100) {
+        printf("Depth should be in the range of 1-100");
+        return input;
+    }
+    static float x[3] = {-1, 0, 1};
+    static array hg(3, x);
+    static array vg = hg.T();
+
+    array in = input;
+    if (in.dims(2) > 1)
+        in = colorSpace(input, AF_GRAY, AF_RGB);
+    else
+        in = input;
+
+    // convert angles to radians
+    float phi   = elevation * af::Pi / 180.0f;
+    float theta = azimuth * af::Pi / 180.0f;
+
+    // compute light pos in cartesian coordinates
+    // and scale with maximum intensity
+    // phi will effect the amount of we intend to put
+    // on a pixel
+    float pos[3];
+    pos[0] = 255.99f * cos(phi) * cos(theta);
+    pos[1] = 255.99f * cos(phi) * sin(theta);
+    pos[2] = 255.99f * sin(phi);
+
+    // compute gradient vector
+    array gx = convolve(in, vg);
+    array gy = convolve(in, hg);
+
+    float pxlz   = (6 * 255.0f) / depth;
+    array zdepth = constant(pxlz, gx.dims());
+    array vdot   = gx * pos[0] + gy * pos[1] + pxlz * pos[2];
+    array outwd  = vdot < 0.0f;
+    array norm   = vdot / sqrt(gx * gx + gy * gy + zdepth * zdepth);
+
+    array color = outwd * 0.0f + (1 - outwd) * norm;
+    return color;
+}
+
+int main(int argc, char **argv) {
+    try {
+        int device = argc > 1 ? atoi(argv[1]) : 0;
+        af::setDevice(device);
+        af::info();
+
+        array img =
+            loadImage(ASSETS_DIR "/examples/images/vegetable-woman.jpg", true);
+
+        array prew_mag, prew_dir;
+        array sob_mag, sob_dir;
+        array img1ch = colorSpace(img, AF_GRAY, AF_RGB);
+        prewitt(prew_mag, prew_dir, img1ch);
+        sobelFilter(sob_mag, sob_dir, img1ch);
+        array sprd  = spread(img, 3, 3);
+        array hrl   = hurl(img, 10, 1);
+        array pckng = pick(img, 40, 2);
+        array difog = DifferenceOfGaussian(img, 1, 2);
+        array bil   = bilateral(hrl, 3.0f, 40.0f);
+        array mf    = medianfilter(hrl, 5, 5);
+        array gb    = gaussianblur(hrl, 3, 3, 0.8);
+        array emb   = emboss(img, 45, 20, 10);
+
+        af::Window wnd("Image Filters Demo");
+        printf("Press ESC while the window is in focus to exit\n");
+        while (!wnd.close()) {
+            wnd.grid(2, 5);
+            wnd(0, 0).image(hrl / 255, "Hurl noise");
+            wnd(1, 0).image(gb / 255, "Gaussian blur");
+            wnd(0, 1).image(bil / 255, "Bilateral filter on hurl noise");
+            wnd(1, 1).image(mf / 255, "Median filter on hurl noise");
+            wnd(0, 2).image(prew_mag / 255, "Prewitt edge filter");
+            wnd(1, 2).image(sob_mag / 255, "Sobel edge filter");
+            wnd(0, 3).image(sprd / 255, "Spread filter");
+            wnd(1, 3).image(pckng / 255, "Pick filter");
+            wnd(0, 4).image(difog / 255,
+                            "Difference of gaussians(3x3 and 5x5)");
+            wnd(1, 4).image(emb / 255, "Emboss effect");
+            wnd.show();
+        }
+
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/image_processing/gradient_diffusion.cpp b/examples/image_processing/gradient_diffusion.cpp
new file mode 100644
index 0000000000..cf4cc402e0
--- /dev/null
+++ b/examples/image_processing/gradient_diffusion.cpp
@@ -0,0 +1,98 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <stdio.h>
+#include <af/util.h>
+#include <cstdlib>
+
+using namespace af;
+
+static const unsigned ITERS = 64;
+
+array normalize(const array &p_in) {
+    float mx = max<float>(p_in);
+    float mn = min<float>(p_in);
+    return (p_in - mn) / (mx - mn);
+}
+
+array sobelFilter(const array &p_in) {
+    int w = 5;
+    if (p_in.dims(0) < 512) w = 3;
+    if (p_in.dims(0) > 2048) w = 7;
+
+    int h = 5;
+    if (p_in.dims(0) < 512) h = 3;
+    if (p_in.dims(0) > 2048) h = 7;
+
+    array ker    = gaussianKernel(w, h);
+    array smooth = convolve(p_in, ker);
+
+    for (unsigned i = 1; i < ITERS; ++i) smooth = convolve(smooth, ker);
+
+    array Gx, Gy;
+    sobel(Gx, Gy, smooth, 3);
+
+    return normalize(hypot(Gx, Gy));
+}
+
+array in, edges, smoothed;
+
+void anisotropicSmoothing() {
+    smoothed = anisotropicDiffusion(in, 0.125, 0.35f, ITERS);
+}
+
+int main(int argc, char *argv[]) {
+    int device = argc > 1 ? atoi(argv[1]) : 0;
+
+    try {
+        setDevice(device);
+        info();
+
+        printf("** ArrayFire Gradient Anisotropic Smoothing Demo **\n");
+
+        Window myWindow("Gradient Anisotropic Smoothing");
+
+        in = loadImage(ASSETS_DIR "/examples/images/man.jpg", false);
+
+        array sEdges = sobelFilter(in);
+
+        anisotropicSmoothing();
+
+        array Gx, Gy;
+        sobel(Gx, Gy, smoothed, 3);
+
+        edges = normalize(hypot(Gx, Gy));
+
+        while (!myWindow.close()) {
+            myWindow.grid(2, 2);
+
+            myWindow(0, 0).image(in / 255.0f, "Input Image");
+            myWindow(0, 1).image(normalize(smoothed),
+                                 "Anisotropically smooted Input");
+            myWindow(1, 0).image(normalize(sEdges),
+                                 "Gradient Magnitude after gaussian blur t=64");
+            myWindow(1, 1).image(normalize(edges),
+                                 "Gradient Magnitude after diffusion t=64");
+
+            myWindow.show();
+        }
+
+        printf(
+            "\nAnisotropic Diffusion avg runtime for current image in Seconds: "
+            "%g\n",
+            timeit(anisotropicSmoothing));
+
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/image_processing/image_demo.cpp b/examples/image_processing/image_demo.cpp
index 01b13a91bf..6a73c2cb91 100644
--- a/examples/image_processing/image_demo.cpp
+++ b/examples/image_processing/image_demo.cpp
@@ -7,13 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <cstdlib>
 
 using namespace af;
 
-
 // Split a MxNx3 image into 3 separate channel matrices.
 static void channel_split(array& rgb, array& outr, array& outg, array& outb) {
     outr = rgb(span, span, 0);
@@ -23,23 +22,17 @@ static void channel_split(array& rgb, array& outr, array& outg, array& outb) {
 
 // 5x5 sigma-3 gaussian blur weights
 static const float h_gauss[] = {
-    0.0318,  0.0375,  0.0397,  0.0375,  0.0318,
-    0.0375,  0.0443,  0.0469,  0.0443,  0.0375,
-    0.0397,  0.0469,  0.0495,  0.0469,  0.0397,
-    0.0375,  0.0443,  0.0469,  0.0443,  0.0375,
-    0.0318,  0.0375,  0.0397,  0.0375,  0.0318,
+    0.0318f, 0.0375f, 0.0397f, 0.0375f, 0.0318f, 0.0375f, 0.0443f,
+    0.0469f, 0.0443f, 0.0375f, 0.0397f, 0.0469f, 0.0495f, 0.0469f,
+    0.0397f, 0.0375f, 0.0443f, 0.0469f, 0.0443f, 0.0375f, 0.0318f,
+    0.0375f, 0.0397f, 0.0375f, 0.0318f,
 };
 
 // 3x3 sobel weights
-static const float h_sobel[] = {
-    -2.0, -1.0,  0.0,
-    -1.0,  0.0,  1.0,
-    0.0,  1.0,  2.0
-};
+static const float h_sobel[] = {-2.0, -1.0, 0.0, -1.0, 0.0, 1.0, 0.0, 1.0, 2.0};
 
 // Demonstrates various image manipulations.
-static void img_test_demo(bool console)
-{
+static void img_test_demo() {
     af::Window wnd("Image Demo");
 
     // load convolution kernels
@@ -47,12 +40,15 @@ static void img_test_demo(bool console)
     array sobel_k = array(3, 3, h_sobel);
 
     // load images
-    array img_gray = loadImage(ASSETS_DIR "/examples/images/trees_ctm.jpg", false);         // 1 channel grayscale [0-255]
-    array img_rgb  = loadImage(ASSETS_DIR "/examples/images/sunset_emp.jpg", true) / 255.f; // 3 channel RGB       [0-1]
+    array img_gray = loadImage(ASSETS_DIR "/examples/images/trees_ctm.jpg",
+                               false);  // 1 channel grayscale [0-255]
+    array img_rgb =
+        loadImage(ASSETS_DIR "/examples/images/sunset_emp.jpg", true) /
+        255.f;  // 3 channel RGB       [0-1]
 
-    array rotatedImg = rotate(img_gray, Pi / 2, false)/255.f;
-    //array thrs_img = (img_gray < 130.f).as(s32);
-    array thrs_img = (img_gray<130.f).as(f32);
+    array rotatedImg = rotate(img_gray, Pi / 2, false) / 255.f;
+    // array thrs_img = (img_gray < 130.f).as(s32);
+    array thrs_img = (img_gray < 130.f).as(f32);
 
     // rgb channels
     array rr, gg, bb;
@@ -65,10 +61,10 @@ static void img_test_demo(bool console)
 
     // image histogram equalization
     array ihist = histogram(img_gray, 256, 0, 255);
-    array inorm = histEqual(img_gray, ihist)/255.f;
+    array inorm = histEqual(img_gray, ihist) / 255.f;
 
-    array edge_det = abs(convolve(img_gray, sobel_k))/255.f;
-    array smt = convolve(img_gray, gauss_k)/255.f;
+    array edge_det = abs(convolve(img_gray, sobel_k)) / 255.f;
+    array smt      = convolve(img_gray, gauss_k) / 255.f;
 
     while (!wnd.close()) {
         wnd.grid(2, 4);
@@ -90,18 +86,14 @@ static void img_test_demo(bool console)
     }
 }
 
-
-
-int main(int argc, char** argv)
-{
+int main(int argc, char** argv) {
     int device = argc > 1 ? atoi(argv[1]) : 0;
-    bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
         af::setDevice(device);
         af::info();
         printf("** ArrayFire Image Demo **\n\n");
-        img_test_demo(console);
+        img_test_demo();
 
     } catch (af::exception& e) {
         fprintf(stderr, "%s\n", e.what());
diff --git a/examples/image_processing/image_editing.cpp b/examples/image_processing/image_editing.cpp
new file mode 100644
index 0000000000..41c9390656
--- /dev/null
+++ b/examples/image_processing/image_editing.cpp
@@ -0,0 +1,132 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cmath>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+
+/**
+ * contrast value should be in the rnage [-1,1]
+ * */
+array changeContrast(const array &in, const float contrast) {
+    float scale = tan((contrast + 1) * Pi / 4);
+    return (((in / 255.0f - 0.5f) * scale + 0.5f) * 255.0f);
+}
+
+/**
+ * brightness value should be in the rnage [0,1]
+ * */
+array changeBrightness(const array &in, const float brightness,
+                       const float channelMax = 255.0f) {
+    float factor = brightness * channelMax;
+    return (in + factor);
+}
+
+array clamp(const array &in, float min = 0.0f, float max = 255.0f) {
+    return ((in < min) * 0.0f + (in > max) * 255.0f +
+            (in >= min && in <= max) * in);
+}
+
+/**
+ * radius effects the level of details that will effected during sharpening
+ * process amount value should be in the range [0,1] or [1,] note: value of 1.0
+ * for amount results unsharp masking values > 1.0 results in highboost filter
+ * effect
+ * */
+array usm(const array &in, float radius, float amount) {
+    int gKernelLen   = 2 * radius + 1;
+    array blurKernel = gaussianKernel(gKernelLen, gKernelLen);
+    array blur       = convolve(in, blurKernel);
+    return (in + amount * (in - blur));
+}
+
+/**
+ * x,y - starting position of zoom
+ * width, height - dimensions of the rectangle to where we have to zoom in
+ * */
+array digZoom(const array &in, int x, int y, int width, int height) {
+    array cropped = in(seq(x, width - 1), seq(y, height - 1), span);
+    return resize(cropped, (unsigned)in.dims(0), (unsigned)in.dims(1));
+}
+
+/**
+ * a - foregound image
+ * b - background image
+ * mask - mask map
+ * */
+array alphaBlend(const array &a, const array &b, const array &mask) {
+    array tiledMask;
+    if (mask.dims(2) != a.dims(2)) tiledMask = tile(mask, 1, 1, a.dims(2));
+    return a * tiledMask + (1.0f - tiledMask) * b;
+}
+
+void normalizeImage(array &in) {
+    float min = af::min<float>(in);
+    float max = af::max<float>(in);
+    in        = 255.0f * ((in - min) / (max - min));
+}
+
+/**
+ * dimensions of the mask control the thickness of the boundary that
+ * will be extracted by the following function
+ */
+array boundary(const array &in, const array &mask) {
+    array ret_val = in - erode(in, mask);
+    normalizeImage(ret_val);
+    return ret_val;
+}
+
+int main(int argc, char **argv) {
+    try {
+        int device = argc > 1 ? atoi(argv[1]) : 0;
+        af::setDevice(device);
+        af::info();
+
+        array man   = loadImage(ASSETS_DIR "/examples/images/man.jpg", true);
+        array fight = loadImage(ASSETS_DIR "/examples/images/fight.jpg", true);
+        array nature =
+            resize(loadImage(ASSETS_DIR "/examples/images/nature.jpg", true),
+                   fight.dims(0), fight.dims(1));
+
+        array intensity  = colorSpace(fight, AF_GRAY, AF_RGB);
+        array mask       = clamp(intensity, 10.0f, 255.0f) > 0.0f;
+        array blend      = alphaBlend(fight, nature, mask);
+        array highcon    = changeContrast(man, 0.3);
+        array highbright = changeBrightness(man, 0.2);
+        array translated = translate(man, 100, 100, 200, 126);
+        array sharp      = usm(man, 3, 1.2);
+        array zoom       = digZoom(man, 28, 10, 192, 192);
+        array morph_mask = constant(1, 3, 3);
+        array bdry       = boundary(man, morph_mask);
+
+        af::Window wnd("Image Editing Operations");
+        printf("Press ESC while the window is in focus to exit\n");
+        while (!wnd.close()) {
+            wnd.grid(2, 5);
+            wnd(0, 0).image(man / 255, "Input");
+            wnd(1, 0).image(highcon / 255, "High Contrast");
+            wnd(0, 1).image(highbright / 255, "High Brightness");
+            wnd(1, 1).image(translated / 255, "Translation");
+            wnd(0, 2).image(sharp / 255, "Unsharp Masking");
+            wnd(1, 2).image(zoom / 255, "Digital Zoom");
+            wnd(0, 3).image(nature / 255, "Background for blend");
+            wnd(1, 3).image(fight / 255, "Foreground for blend");
+            wnd(0, 4).image(blend / 255, "Alpha blend");
+            wnd(1, 4).image(bdry / 255, "Boundary extraction");
+            wnd.show();
+        }
+    } catch (af::exception &e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+    return 0;
+}
diff --git a/examples/image_processing/morphing.cpp b/examples/image_processing/morphing.cpp
index 8a4413f9e8..ad66b7ea2a 100644
--- a/examples/image_processing/morphing.cpp
+++ b/examples/image_processing/morphing.cpp
@@ -7,80 +7,71 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <af/util.h>
 #include <cstdlib>
 
 using namespace af;
 
-array morphopen(const array& img, const array& mask)
-{
+array morphopen(const array& img, const array& mask) {
     return dilate(erode(img, mask), mask);
 }
 
-array morphclose(const array& img, const array& mask)
-{
+array morphclose(const array& img, const array& mask) {
     return erode(dilate(img, mask), mask);
 }
 
-array morphgrad(const array& img, const array& mask)
-{
+array morphgrad(const array& img, const array& mask) {
     return (dilate(img, mask) - erode(img, mask));
 }
 
-array tophat(const array& img, const array& mask)
-{
+array tophat(const array& img, const array& mask) {
     return (img - morphopen(img, mask));
 }
 
-array bottomhat(const array& img, const array& mask)
-{
+array bottomhat(const array& img, const array& mask) {
     return (morphclose(img, mask) - img);
 }
 
-array border(const array& img, const int left, const int right,
-        const int top, const int bottom,
-        const float value = 0.0)
-{
-    if((int)img.dims(0) < (top + bottom))
-        std::cerr << "input does not have enough rows" << std::endl;
-    if((int)img.dims(1) < (left + right))
-        std::cerr << "input does not have enough columns" << std::endl;
+array border(const array& img, const int left, const int right, const int top,
+             const int bottom, const float value = 0.0) {
+    if ((int)img.dims(0) < (top + bottom))
+        printf("input does not have enough rows\n");
+    if ((int)img.dims(1) < (left + right))
+        fprintf(stderr, "input does not have enough columns\n");
 
     dim4 imgDims = img.dims();
-    array ret = constant(value, imgDims);
-    ret(seq(top, imgDims[0]-bottom), seq(left, imgDims[1]-right), span, span) =
-        img(seq(top, imgDims[0]-bottom), seq(left, imgDims[1]-right), span, span);
+    array ret    = constant(value, imgDims);
+    ret(seq(top, imgDims[0] - bottom), seq(left, imgDims[1] - right), span,
+        span)    = img(seq(top, imgDims[0] - bottom),
+                       seq(left, imgDims[1] - right), span, span);
 
     return ret;
 }
 
 array border(const array& img, const int w, const int h,
-        const float value = 0.0)
-{
+             const float value = 0.0) {
     return border(img, w, w, h, h, value);
 }
 
-array border(const array& img, const int size, const float value = 0.0)
-{
+array border(const array& img, const int size, const float value = 0.0) {
     return border(img, size, size, size, size, value);
 }
 
-array blur(const array& img, const array mask = gaussianKernel(3,3))
-{
+array blur(const array& img, const array mask = gaussianKernel(3, 3)) {
     array blurred = array(img.dims(), img.type());
-    for(int i = 0; i < (int)blurred.dims(2); i++)
+    for (int i = 0; i < (int)blurred.dims(2); i++)
         blurred(span, span, i) = convolve(img(span, span, i), mask);
     return blurred;
 }
 
 // Demonstrates various image morphing manipulations.
-static void morphing_demo(bool console)
-{
+static void morphing_demo() {
     af::Window wnd(1280, 720, "Morphological Operations");
     // load images
-    array img_rgb = loadImage(ASSETS_DIR "/examples/images/lena.ppm", true) / 255.f; // 3 channel RGB       [0-1]
+    array img_rgb = loadImage(ASSETS_DIR "/examples/images/man.jpg", true) /
+                    255.f;  // 3 channel RGB       [0-1]
 
     array mask = constant(1, 5, 5);
 
@@ -91,42 +82,40 @@ static void morphing_demo(bool console)
     array gr = morphgrad(img_rgb, mask);
     array th = tophat(img_rgb, mask);
     array bh = bottomhat(img_rgb, mask);
-    array bl = blur(img_rgb, gaussianKernel(5,5));
+    array bl = blur(img_rgb, gaussianKernel(5, 5));
     array bp = border(img_rgb, 20, 30, 40, 50, 0.5);
     array bo = border(img_rgb, 20);
 
     while (!wnd.close()) {
         wnd.grid(3, 4);
 
-        wnd(0, 0).image(img_rgb, "Input"          );
-        wnd(1, 0).image(er     , "Erosion"        );
-        wnd(2, 0).image(di     , "Dilation"       );
+        wnd(0, 0).image(img_rgb, "Input");
+        wnd(1, 0).image(er, "Erosion");
+        wnd(2, 0).image(di, "Dilation");
 
-        wnd(0, 1).image(op     , "Opening"        );
-        wnd(1, 1).image(cl     , "Closing"        );
-        wnd(2, 1).image(gr     , "Gradient"       );
+        wnd(0, 1).image(op, "Opening");
+        wnd(1, 1).image(cl, "Closing");
+        wnd(2, 1).image(gr, "Gradient");
 
-        wnd(0, 2).image(th     , "TopHat"         );
-        wnd(1, 2).image(bh     , "BottomHat"      );
-        wnd(2, 2).image(bl     , "Blur"           );
+        wnd(0, 2).image(th, "TopHat");
+        wnd(1, 2).image(bh, "BottomHat");
+        wnd(2, 2).image(bl, "Blur");
 
-        wnd(0, 3).image(bp     , "Border to Gray" );
-        wnd(1, 3).image(bo     , "Border to black");
+        wnd(0, 3).image(bp, "Border to Gray");
+        wnd(1, 3).image(bo, "Border to black");
 
         wnd.show();
     }
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char** argv) {
     int device = argc > 1 ? atoi(argv[1]) : 0;
-    bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
         af::info();
         af::setDevice(device);
         printf("** ArrayFire Image Morphing Demo **\n\n");
-        morphing_demo(console);
+        morphing_demo();
 
     } catch (af::exception& e) {
         fprintf(stderr, "%s\n", e.what());
diff --git a/examples/image_processing/optical_flow.cpp b/examples/image_processing/optical_flow.cpp
index 5c4b60b33e..6d859068a7 100644
--- a/examples/image_processing/optical_flow.cpp
+++ b/examples/image_processing/optical_flow.cpp
@@ -7,52 +7,49 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <string.h>
-#include <stdio.h>
+#include <arrayfire.h>
 #include <math.h>
+#include <stdio.h>
+#include <string.h>
 #include <algorithm>
-#include <arrayfire.h>
 
 using namespace af;
 
-static void diffs(array& Ix, array& Iy, array& It, array I1, array I2)
-{
-        //  3x3 derivative kernels
-    float dx_kernel[] = { -1.0f / 6.0f, -1.0f / 6.0f, -1.0f / 6.0f,
-        0.0f / 6.0f,  0.0f / 6.0f,  0.0f / 6.0f,
-        1.0f / 6.0f,  1.0f / 6.0f,  1.0f / 6.0f
-    };
-    float dy_kernel[] = { -1.0f / 6.0f,  0.0f / 6.0f,  1.0f / 6.0f,
-        -1.0f / 6.0f,  0.0f / 6.0f,  1.0f / 6.0f,
-        -1.0f / 6.0f,  0.0f / 6.0f,  1.0f / 6.0f
-    };
-    array dx = array(dim4(3, 3), dx_kernel);
-    array dy = array(dim4(3, 3), dy_kernel);
-    array dt = constant(1,1, 2) / 4.0;
+static void diffs(array& Ix, array& Iy, array& It, array I1, array I2) {
+    //  3x3 derivative kernels
+    float dx_kernel[] = {-1.0f / 6.0f, -1.0f / 6.0f, -1.0f / 6.0f,
+                         0.0f / 6.0f,  0.0f / 6.0f,  0.0f / 6.0f,
+                         1.0f / 6.0f,  1.0f / 6.0f,  1.0f / 6.0f};
+    float dy_kernel[] = {-1.0f / 6.0f, 0.0f / 6.0f, 1.0f / 6.0f,
+                         -1.0f / 6.0f, 0.0f / 6.0f, 1.0f / 6.0f,
+                         -1.0f / 6.0f, 0.0f / 6.0f, 1.0f / 6.0f};
+    array dx          = array(dim4(3, 3), dx_kernel);
+    array dy          = array(dim4(3, 3), dy_kernel);
+    array dt          = constant(1, 1, 2) / 4.0;
 
     Ix = convolve(I1, dx) + convolve(I2, dx);
     Iy = convolve(I1, dy) + convolve(I2, dy);
     It = convolve(I2, dt) - convolve(I1, dt);
 }
 
-static void optical_flow_demo(bool console)
-{
+static void optical_flow_demo(bool console) {
     af::Window wnd("Horn-Schunck Optical Flow Demo");
     wnd.setColorMap(AF_COLORMAP_COLORS);
 
-    double time_total = 10; // run for N seconds
+    double time_total = 10;  // run for N seconds
 
     const float h_mean_kernel[] = {1.0f / 12.0f, 2.0f / 12.0f, 1.0f / 12.0f,
-        2.0f / 12.0f,        0.0f,  2.0f / 12.0f,
-        1.0f / 12.0f, 2.0f / 12.0f, 1.1f / 12.0f
-    };
-    array mean_kernel = array(dim4(3, 3), h_mean_kernel, afHost);
+                                   2.0f / 12.0f, 0.0f,         2.0f / 12.0f,
+                                   1.0f / 12.0f, 2.0f / 12.0f, 1.1f / 12.0f};
+    array mean_kernel           = array(dim4(3, 3), h_mean_kernel, afHost);
 
-    array I1 = loadImage(ASSETS_DIR "/examples/images/circle_left.ppm"); // grayscale
+    array I1 =
+        loadImage(ASSETS_DIR "/examples/images/circle_left.ppm");  // grayscale
     array I2 = loadImage(ASSETS_DIR "/examples/images/circle_center.ppm");
 
-    array u = constant(0,I1.dims()), v = constant(0,I1.dims());
-    array Ix, Iy, It; diffs(Ix, Iy, It, I1, I2);
+    array u = constant(0, I1.dims()), v = constant(0, I1.dims());
+    array Ix, Iy, It;
+    diffs(Ix, Iy, It, I1, I2);
 
     timer time_start, time_last;
     time_start = time_last = timer::start();
@@ -65,15 +62,15 @@ static void optical_flow_demo(bool console)
         array v_ = convolve(v, mean_kernel);
 
         const float alphasq = 0.1f;
-        array num = Ix * u_ + Iy * v_ + It;
-        array den = alphasq + Ix * Ix + Iy * Iy;
+        array num           = Ix * u_ + Iy * v_ + It;
+        array den           = alphasq + Ix * Ix + Iy * Iy;
 
         array tmp = 0.01 * num;
-        u = u_ - (Ix * tmp) / den;
-        v = v_ - (Iy * tmp) / den;
+        u         = u_ - (Ix * tmp) / den;
+        v         = v_ - (Iy * tmp) / den;
 
         if (!console) {
-            wnd.grid(2,2);
+            wnd.grid(2, 2);
 
             wnd(0, 0).image(I1, "I1");
             wnd(1, 0).image(I2, "I2");
@@ -85,29 +82,25 @@ static void optical_flow_demo(bool console)
 
         double elapsed = timer::stop(time_last);
         if (elapsed > 1) {
-            double rate = (iter - iter_last) / elapsed;
+            double rate          = (iter - iter_last) / elapsed;
             double total_elapsed = timer::stop(time_start);
-            time_last = timer::start();
-            iter_last = iter;
-            max_rate = std::max(max_rate, rate);
-            if (total_elapsed >= time_total) {
-                break;
-            }
+            time_last            = timer::start();
+            iter_last            = iter;
+            max_rate             = std::max(max_rate, rate);
+            if (total_elapsed >= time_total) { break; }
             if (!console)
                 printf("  iterations per second: %.0f   (progress %.0f%%)\n",
-                        rate, 100.0f * total_elapsed / time_total);
+                       rate, 100.0f * total_elapsed / time_total);
         }
     }
 
     if (console) {
         printf(" ### optical_flow %f iterations per second (max)\n", max_rate);
     }
-
 }
 
-int main(int argc, char* argv[])
-{
-    int device = argc > 1 ? atoi(argv[1]) : 0;
+int main(int argc, char* argv[]) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
@@ -121,12 +114,5 @@ int main(int argc, char* argv[])
         throw;
     }
 
-#ifdef WIN32 // pause in Windows
-    if (!console) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-#endif
     return 0;
 }
diff --git a/examples/image_processing/pyramids.cpp b/examples/image_processing/pyramids.cpp
index 0f96732ada..b09e895d6a 100644
--- a/examples/image_processing/pyramids.cpp
+++ b/examples/image_processing/pyramids.cpp
@@ -7,36 +7,34 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <cstdlib>
 
 using namespace af;
 
-static const float pyramid_kernel[] = {
-    1,  4,  6,  4, 1,
-    4, 16, 24, 16, 4,
-    6, 24, 36, 24, 6,
-    4, 16, 24, 16, 4,
-    1,  4,  6,  4, 1
-};
+static const float pyramid_kernel[] = {1,  4, 6,  4,  1,  4, 16, 24, 16,
+                                       4,  6, 24, 36, 24, 6, 4,  16, 24,
+                                       16, 4, 1,  4,  6,  4, 1};
 
-array pyramid(const array& img, const int level, const bool sampling)
-{
+array pyramid(const array& img, const int level, const bool sampling) {
     array pyr = img.copy();
     array kernel(5, 5, pyramid_kernel);
     kernel = kernel / 256.f;
-    if(sampling) {                              //Downsample
-        for(int i = 0; i < level; i++) {
-            for(int j = 0; j < pyr.dims(2); j++)
+    if (sampling) {  // Downsample
+        for (int i = 0; i < level; i++) {
+            for (int j = 0; j < pyr.dims(2); j++)
                 pyr(span, span, j) = convolve(pyr(span, span, j), kernel);
-            pyr = pyr(seq(0, pyr.dims(0)-1, 2), seq(0, pyr.dims(1)-1, 2), span);
+            pyr = pyr(seq(0, pyr.dims(0) - 1, 2), seq(0, pyr.dims(1) - 1, 2),
+                      span);
         }
-    } else {                                    // Up sample
-        for(int i = 0; i < level; i++) {
-            array tmp = constant(0, pyr.dims(0) * 2, pyr.dims(1) * 2, pyr.dims(2));
-            tmp(seq(0, 2*pyr.dims(0)-1, 2), seq(0, 2*pyr.dims(1)-1, 2), span) = pyr;
-            for(int j = 0; j < pyr.dims(2); j++)
+    } else {  // Up sample
+        for (int i = 0; i < level; i++) {
+            array tmp =
+                constant(0, pyr.dims(0) * 2, pyr.dims(1) * 2, pyr.dims(2));
+            tmp(seq(0, 2 * pyr.dims(0) - 1, 2), seq(0, 2 * pyr.dims(1) - 1, 2),
+                span) = pyr;
+            for (int j = 0; j < pyr.dims(2); j++)
                 tmp(span, span, j) = convolve(tmp(span, span, j), kernel * 4.f);
             pyr = tmp;
         }
@@ -44,20 +42,21 @@ array pyramid(const array& img, const int level, const bool sampling)
     return pyr;
 }
 
-void pyramids_demo(bool console)
-{
+void pyramids_demo() {
     af::Window wnd_rgb("Image Pyramids - RGB Images");
     af::Window wnd_gray("Image Pyramids - Grayscale Images");
     wnd_rgb.setPos(25, 25);
     wnd_gray.setPos(150, 150);
 
-    array img_rgb = loadImage(ASSETS_DIR "/examples/images/atlantis.png", true) / 255.f; // 3 channel RGB       [0-1]
+    array img_rgb =
+        loadImage(ASSETS_DIR "/examples/images/atlantis.png", true) /
+        255.f;  // 3 channel RGB       [0-1]
     array img_gray = colorSpace(img_rgb, AF_GRAY, AF_RGB);
 
-    array downc1 = pyramid(img_rgb,  1, true);
-    array downc2 = pyramid(img_rgb,  2, true);
-    array upc1   = pyramid(img_rgb,  1, false);
-    array upc2   = pyramid(img_rgb,  2, false);
+    array downc1 = pyramid(img_rgb, 1, true);
+    array downc2 = pyramid(img_rgb, 2, true);
+    array upc1   = pyramid(img_rgb, 1, false);
+    array upc2   = pyramid(img_rgb, 2, false);
 
     array downg1 = pyramid(img_gray, 1, true);
     array downg2 = pyramid(img_gray, 2, true);
@@ -65,7 +64,6 @@ void pyramids_demo(bool console)
     array upg2   = pyramid(img_gray, 2, false);
 
     while (!wnd_rgb.close() && !wnd_gray.close()) {
-
         wnd_rgb.grid(2, 3);
         wnd_rgb(0, 0).image(img_rgb, "color image");
         wnd_rgb(1, 0).image(downc1, "downsample 1 level");
@@ -74,7 +72,6 @@ void pyramids_demo(bool console)
         wnd_rgb(0, 2).image(upc2, "upsample 2 level");
         wnd_rgb.show();
 
-
         wnd_gray.grid(2, 3);
         wnd_gray(0, 0).image(img_gray, "grayscale image");
         wnd_gray(1, 0).image(downg1, "downsample 1 level");
@@ -85,16 +82,14 @@ void pyramids_demo(bool console)
     }
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char** argv) {
     int device = argc > 1 ? atoi(argv[1]) : 0;
-    bool console = argc > 2 ? argv[2][0] == '-' : false;
 
     try {
         af::setDevice(device);
         af::info();
         printf("** ArrayFire Image Pyramids Demo **\n\n");
-        pyramids_demo(console);
+        pyramids_demo();
 
     } catch (af::exception& e) {
         fprintf(stderr, "%s\n", e.what());
diff --git a/examples/lin_algebra/CMakeLists.txt b/examples/lin_algebra/CMakeLists.txt
new file mode 100644
index 0000000000..89b9c89600
--- /dev/null
+++ b/examples/lin_algebra/CMakeLists.txt
@@ -0,0 +1,73 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Linear-Algebra
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+if(ArrayFire_CPU_FOUND)
+  # Cholesky example
+  add_executable(cholesky_cpu cholesky.cpp)
+  target_link_libraries(cholesky_cpu ArrayFire::afcpu)
+
+  # LU example
+  add_executable(lu_cpu lu.cpp)
+  target_link_libraries(lu_cpu ArrayFire::afcpu)
+
+  # QR example
+  add_executable(qr_cpu qr.cpp)
+  target_link_libraries(qr_cpu ArrayFire::afcpu)
+
+  # SVD example
+  add_executable(svd_cpu svd.cpp)
+  target_link_libraries(svd_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(cholesky_cuda cholesky.cpp)
+  target_link_libraries(cholesky_cuda ArrayFire::afcuda)
+
+  add_executable(lu_cuda lu.cpp)
+  target_link_libraries(lu_cuda ArrayFire::afcuda)
+
+  add_executable(qr_cuda qr.cpp)
+  target_link_libraries(qr_cuda ArrayFire::afcuda)
+
+  add_executable(svd_cuda svd.cpp)
+  target_link_libraries(svd_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(cholesky_opencl cholesky.cpp)
+  target_link_libraries(cholesky_opencl ArrayFire::afopencl)
+
+  add_executable(lu_opencl lu.cpp)
+  target_link_libraries(lu_opencl ArrayFire::afopencl)
+
+  add_executable(qr_opencl qr.cpp)
+  target_link_libraries(qr_opencl ArrayFire::afopencl)
+
+  add_executable(svd_opencl svd.cpp)
+  target_link_libraries(svd_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(cholesky_oneapi cholesky.cpp)
+  target_link_libraries(cholesky_oneapi ArrayFire::afoneapi)
+
+  add_executable(lu_oneapi lu.cpp)
+  target_link_libraries(lu_oneapi ArrayFire::afoneapi)
+
+  add_executable(qr_oneapi qr.cpp)
+  target_link_libraries(qr_oneapi ArrayFire::afoneapi)
+
+  add_executable(svd_oneapi svd.cpp)
+  target_link_libraries(svd_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/lin_algebra/cholesky.cpp b/examples/lin_algebra/cholesky.cpp
index 999597c913..56524314ce 100644
--- a/examples/lin_algebra/cholesky.cpp
+++ b/examples/lin_algebra/cholesky.cpp
@@ -13,23 +13,27 @@
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int argc, char* argv[]) {
     try {
-
         // Select a device and display arrayfire info
         int device = argc > 1 ? atoi(argv[1]) : 0;
         af::setDevice(device);
         af::info();
 
-        int n = 5;
-        array t = randu(n, n);
+        int n    = 5;
+        array t  = randu(n, n);
         array in = matmulNT(t, t) + identity(n, n) * n;
         af_print(in);
 
         printf("Running Cholesky InPlace\n");
-        array cin = in.copy();
-        af_print(cin);
+        array cin_upper = in.copy();
+        array cin_lower = in.copy();
+
+        choleskyInPlace(cin_upper, true);
+        choleskyInPlace(cin_lower, false);
+
+        af_print(cin_upper);
+        af_print(cin_lower);
 
         printf("Running Cholesky Out of place\n");
         array out_upper;
@@ -46,12 +50,5 @@ int main(int argc, char *argv[])
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/lin_algebra/lu.cpp b/examples/lin_algebra/lu.cpp
index 06cbeb1e9e..a162ce1962 100644
--- a/examples/lin_algebra/lu.cpp
+++ b/examples/lin_algebra/lu.cpp
@@ -13,8 +13,7 @@
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int argc, char* argv[]) {
     try {
         // Select a device and display arrayfire info
         int device = argc > 1 ? atoi(argv[1]) : 0;
@@ -44,12 +43,5 @@ int main(int argc, char *argv[])
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/lin_algebra/qr.cpp b/examples/lin_algebra/qr.cpp
index 6f0fe8eac3..334e53b876 100644
--- a/examples/lin_algebra/qr.cpp
+++ b/examples/lin_algebra/qr.cpp
@@ -13,10 +13,8 @@
 
 using namespace af;
 
-int main(int argc, char *argv[])
-{
+int main(int argc, char* argv[]) {
     try {
-
         // Select a device and display arrayfire info
         int device = argc > 1 ? atoi(argv[1]) : 0;
         af::setDevice(device);
@@ -47,12 +45,5 @@ int main(int argc, char *argv[])
         throw;
     }
 
-    #ifdef WIN32 // pause in Windows
-    if (!(argc == 2 && argv[1][0] == '-')) {
-        printf("hit [enter]...");
-        fflush(stdout);
-        getchar();
-    }
-    #endif
     return 0;
 }
diff --git a/examples/lin_algebra/svd.cpp b/examples/lin_algebra/svd.cpp
new file mode 100644
index 0000000000..e731532d1d
--- /dev/null
+++ b/examples/lin_algebra/svd.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+
+int main(int argc, char* argv[]) {
+    try {
+        // Select a device and display arrayfire info
+        int device = argc > 1 ? atoi(argv[1]) : 0;
+        af::setDevice(device);
+        af::info();
+
+        float h_buffer[] = {1, 4, 2, 5, 3, 6};  // host array
+        array in(2, 3, h_buffer);               // copy host data to device
+
+        array u;
+        array s_vec;
+        array vt;
+        svd(u, s_vec, vt, in);
+
+        array s_mat    = diag(s_vec, 0, false);
+        array in_recon = matmul(u, s_mat, vt(seq(2), span));
+
+        af_print(in);
+        af_print(s_vec);
+        af_print(u);
+        af_print(s_mat);
+        af_print(vt);
+        af_print(in_recon);
+
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/machine_learning/CMakeLists.txt b/examples/machine_learning/CMakeLists.txt
new file mode 100644
index 0000000000..480f3f7f12
--- /dev/null
+++ b/examples/machine_learning/CMakeLists.txt
@@ -0,0 +1,153 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Linear-Algebra
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+add_definitions("-DASSETS_DIR=\"${ASSETS_DIR}\"")
+
+if(ArrayFire_CPU_FOUND)
+  # Bagging example
+  add_executable(bagging_cpu bagging.cpp)
+  target_link_libraries(bagging_cpu ArrayFire::afcpu)
+
+  # Deep Belief Network example
+  add_executable(deep_belief_net_cpu deep_belief_net.cpp)
+  target_link_libraries(deep_belief_net_cpu ArrayFire::afcpu)
+
+  # Genetic Algorithm example
+  add_executable(geneticalgorithm_cpu geneticalgorithm.cpp)
+  target_link_libraries(geneticalgorithm_cpu ArrayFire::afcpu)
+
+  # k Means example
+  add_executable(kmeans_cpu kmeans.cpp)
+  target_link_libraries(kmeans_cpu ArrayFire::afcpu)
+
+  # Logistic Regression example
+  add_executable(logistic_regression_cpu logistic_regression.cpp)
+  target_link_libraries(logistic_regression_cpu ArrayFire::afcpu)
+
+  # Naive Bayes example
+  add_executable(naive_bayes_cpu naive_bayes.cpp)
+  target_link_libraries(naive_bayes_cpu ArrayFire::afcpu)
+
+  # Neural Network example
+  add_executable(neural_network_cpu neural_network.cpp)
+  target_link_libraries(neural_network_cpu ArrayFire::afcpu)
+
+  # Preceptron example
+  add_executable(perceptron_cpu perceptron.cpp)
+  target_link_libraries(perceptron_cpu ArrayFire::afcpu)
+
+  # Restricted Boltsmann Machine example
+  add_executable(rbm_cpu rbm.cpp)
+  target_link_libraries(rbm_cpu ArrayFire::afcpu)
+
+  # Softmax Regression example
+  add_executable(softmax_regression_cpu softmax_regression.cpp)
+  target_link_libraries(softmax_regression_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(bagging_cuda bagging.cpp)
+  target_link_libraries(bagging_cuda ArrayFire::afcuda)
+
+  add_executable(deep_belief_net_cuda deep_belief_net.cpp)
+  target_link_libraries(deep_belief_net_cuda ArrayFire::afcuda)
+
+  add_executable(geneticalgorithm_cuda geneticalgorithm.cpp)
+  target_link_libraries(geneticalgorithm_cuda ArrayFire::afcuda)
+
+  add_executable(kmeans_cuda kmeans.cpp)
+  target_link_libraries(kmeans_cuda ArrayFire::afcuda)
+
+  add_executable(logistic_regression_cuda logistic_regression.cpp)
+  target_link_libraries(logistic_regression_cuda ArrayFire::afcuda)
+
+  add_executable(naive_bayes_cuda naive_bayes.cpp)
+  target_link_libraries(naive_bayes_cuda ArrayFire::afcuda)
+
+  add_executable(neural_network_cuda neural_network.cpp)
+  target_link_libraries(neural_network_cuda ArrayFire::afcuda)
+
+  add_executable(perceptron_cuda perceptron.cpp)
+  target_link_libraries(perceptron_cuda ArrayFire::afcuda)
+
+  add_executable(rbm_cuda rbm.cpp)
+  target_link_libraries(rbm_cuda ArrayFire::afcuda)
+
+  add_executable(softmax_regression_cuda softmax_regression.cpp)
+  target_link_libraries(softmax_regression_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(bagging_opencl bagging.cpp)
+  target_link_libraries(bagging_opencl ArrayFire::afopencl)
+
+  add_executable(deep_belief_net_opencl deep_belief_net.cpp)
+  target_link_libraries(deep_belief_net_opencl ArrayFire::afopencl)
+
+  add_executable(geneticalgorithm_opencl geneticalgorithm.cpp)
+  target_link_libraries(geneticalgorithm_opencl ArrayFire::afopencl)
+
+  add_executable(kmeans_opencl kmeans.cpp)
+  target_link_libraries(kmeans_opencl ArrayFire::afopencl)
+
+  add_executable(logistic_regression_opencl logistic_regression.cpp)
+  target_link_libraries(logistic_regression_opencl ArrayFire::afopencl)
+
+  add_executable(naive_bayes_opencl naive_bayes.cpp)
+  target_link_libraries(naive_bayes_opencl ArrayFire::afopencl)
+
+  add_executable(neural_network_opencl neural_network.cpp)
+  target_link_libraries(neural_network_opencl ArrayFire::afopencl)
+
+  add_executable(perceptron_opencl perceptron.cpp)
+  target_link_libraries(perceptron_opencl ArrayFire::afopencl)
+
+  add_executable(rbm_opencl rbm.cpp)
+  target_link_libraries(rbm_opencl ArrayFire::afopencl)
+
+  add_executable(softmax_regression_opencl softmax_regression.cpp)
+  target_link_libraries(softmax_regression_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(bagging_oneapi bagging.cpp)
+  target_link_libraries(bagging_oneapi ArrayFire::afoneapi)
+
+  add_executable(deep_belief_net_oneapi deep_belief_net.cpp)
+  target_link_libraries(deep_belief_net_oneapi ArrayFire::afoneapi)
+
+  add_executable(geneticalgorithm_oneapi geneticalgorithm.cpp)
+  target_link_libraries(geneticalgorithm_oneapi ArrayFire::afoneapi)
+
+  add_executable(kmeans_oneapi kmeans.cpp)
+  target_link_libraries(kmeans_oneapi ArrayFire::afoneapi)
+
+  add_executable(logistic_regression_oneapi logistic_regression.cpp)
+  target_link_libraries(logistic_regression_oneapi ArrayFire::afoneapi)
+
+  add_executable(naive_bayes_oneapi naive_bayes.cpp)
+  target_link_libraries(naive_bayes_oneapi ArrayFire::afoneapi)
+
+  add_executable(neural_network_oneapi neural_network.cpp)
+  target_link_libraries(neural_network_oneapi ArrayFire::afoneapi)
+
+  add_executable(perceptron_oneapi perceptron.cpp)
+  target_link_libraries(perceptron_oneapi ArrayFire::afoneapi)
+
+  add_executable(rbm_oneapi rbm.cpp)
+  target_link_libraries(rbm_oneapi ArrayFire::afoneapi)
+
+  add_executable(softmax_regression_oneapi softmax_regression.cpp)
+  target_link_libraries(softmax_regression_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/machine_learning/bagging.cpp b/examples/machine_learning/bagging.cpp
index 862ffa25e3..2a52cc554e 100644
--- a/examples/machine_learning/bagging.cpp
+++ b/examples/machine_learning/bagging.cpp
@@ -8,50 +8,46 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 
 // Get accuracy of the predicted results
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     return 100 * count<float>(predicted == target) / target.elements();
 }
 
 // Calculate all the distances from testing set to training set
-array distance(array train, array test)
-{
-    const int feat_len = train.dims(1);
+array distance(array train, array test) {
+    const int feat_len  = train.dims(1);
     const int num_train = train.dims(0);
-    const int num_test  =  test.dims(0);
-    array dist = constant(0, num_train, num_test);
+    const int num_test  = test.dims(0);
+    array dist          = constant(0, num_train, num_test);
 
     // Iterate over each attribute
     for (int ii = 0; ii < feat_len; ii++) {
-
         // Get a attribute vectors
         array train_i = train(span, ii);
-        array test_i  = test (span, ii).T();
+        array test_i  = test(span, ii).T();
 
         // Tile the vectors to generate matrices
-        array train_tiled = tile(train_i, 1,   num_test);
-        array test_tiled  = tile( test_i, num_train, 1 );
+        array train_tiled = tile(train_i, 1, num_test);
+        array test_tiled  = tile(test_i, num_train, 1);
 
         // Add the distance for this attribute
         dist = dist + abs(train_tiled - test_tiled);
-        dist.eval(); // Necessary to free up train_i, test_i
+        dist.eval();  // Necessary to free up train_i, test_i
     }
 
     return dist;
 }
 
-array knn(array &train_feats, array &test_feats, array &train_labels)
-{
+array knn(array &train_feats, array &test_feats, array &train_labels) {
     // Find distances between training and testing sets
     array dist = distance(train_feats, test_feats);
 
@@ -63,16 +59,14 @@ array knn(array &train_feats, array &test_feats, array &train_labels)
     return train_labels(idx);
 }
 
-
 array bagging(array &train_feats, array &test_feats, array &train_labels,
-              int num_classes, int num_models, int sample_size)
-{
+              int num_classes, int num_models, int sample_size) {
     int num_train = train_feats.dims(0);
-    int num_test  =  test_feats.dims(0);
+    int num_test  = test_feats.dims(0);
 
-    array idx = floor(randu(sample_size, num_models) * num_train);
+    array idx        = floor(randu(sample_size, num_models) * num_train);
     array labels_all = constant(0, num_test, num_classes);
-    array off = seq(num_test);
+    array off        = seq(num_test);
 
     for (int i = 0; i < num_models; i++) {
         array ii = idx(span, i);
@@ -82,7 +76,7 @@ array bagging(array &train_feats, array &test_feats, array &train_labels,
 
         // Get the predicted results
         array labels_ii = knn(train_feats_ii, test_feats, train_labels_ii);
-        array lidx = labels_ii * num_test + off;
+        array lidx      = labels_ii * num_test + off;
 
         labels_all(lidx) = labels_all(lidx) + 1;
     }
@@ -93,24 +87,21 @@ array bagging(array &train_feats, array &test_feats, array &train_labels,
     return labels;
 }
 
-
-void bagging_demo(bool console, int perc)
-{
+void bagging_demo(bool console, int perc) {
     array train_images, train_labels;
     array test_images, test_labels;
     int num_train, num_test, num_classes;
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<false>(&num_classes, &num_train, &num_test,
-                       train_images, test_images,
-                       train_labels, test_labels, frac);
+    setup_mnist<false>(&num_classes, &num_train, &num_test, train_images,
+                       test_images, train_labels, test_labels, frac);
 
     int feature_length = train_images.elements() / num_train;
-    array train_feats = moddims(train_images, feature_length, num_train).T();
-    array test_feats  = moddims(test_images , feature_length, num_test ).T();
+    array train_feats  = moddims(train_images, feature_length, num_train).T();
+    array test_feats   = moddims(test_images, feature_length, num_test).T();
 
-    int num_models = 10;
+    int num_models  = 10;
     int sample_size = 1000;
 
     timer::start();
@@ -121,30 +112,26 @@ void bagging_demo(bool console, int perc)
 
     // Results
     printf("Accuracy on testing  data: %2.2f\n",
-           accuracy(res_labels , test_labels));
+           accuracy(res_labels, test_labels));
 
     printf("Prediction time: %4.4f\n", test_time);
 
     if (false && !console) {
         display_results<false>(test_images, res_labels, test_labels.T(), 20);
     }
-
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         setDevice(device);
         af::info();
         bagging_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/deep_belief_net.cpp b/examples/machine_learning/deep_belief_net.cpp
index 1f1706a57c..a9f3296c1c 100644
--- a/examples/machine_learning/deep_belief_net.cpp
+++ b/examples/machine_learning/deep_belief_net.cpp
@@ -8,83 +8,64 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 using std::vector;
 
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     array val, plabels, tlabels;
     max(val, tlabels, target, 1);
     max(val, plabels, predicted, 1);
     return 100 * count<float>(plabels == tlabels) / tlabels.elements();
 }
 
-// Activation function
-array sigmoid(const array &val)
-{
-    return 1 / (1 + exp(-val));
-}
-
 // Derivative of the activation function
-array deriv(const array &out)
-{
-    return out * (1 - out);
-}
+array deriv(const array &out) { return out * (1 - out); }
 
 // Cost function
-double error(const array &out,
-             const array &pred)
-{
+double error(const array &out, const array &pred) {
     array dif = (out - pred);
     return sqrt((double)(sum<float>(dif * dif)));
 }
 
-array sigmoid_binary(const array in)
-{
+array sigmoid_binary(const array in) {
     // Choosing "1" with probability sigmoid(in)
     return (sigmoid(in) > randu(in.dims())).as(f32);
 }
 
 class rbm {
-
-private:
+   private:
     array weights;
     array h_bias;
     array v_bias;
 
-public:
-    rbm(int v_size, int h_size) :
-        weights(randu(h_size, v_size)/100.f),
-        h_bias(constant(0, 1, h_size)),
-        v_bias(constant(0, 1, v_size))
-    {
-    }
+   public:
+    rbm(int v_size, int h_size)
+        : weights(randu(h_size, v_size) / 100.f)
+        , h_bias(constant(0, 1, h_size))
+        , v_bias(constant(0, 1, v_size)) {}
 
-    array get_weights()
-    {
+    array get_weights() {
         return transpose(join(1, weights, transpose(h_bias)));
     }
 
-    void train(const array &in, double lr, int num_epochs, int batch_size, bool verbose)
-    {
+    void train(const array &in, double lr, int num_epochs, int batch_size,
+               bool verbose) {
         const int num_samples = in.dims(0);
         const int num_batches = num_samples / batch_size;
 
-        for (int i = 0; i <  num_epochs; i++) {
-
+        for (int i = 0; i < num_epochs; i++) {
             double err = 0;
 
             for (int j = 0; j < num_batches - 1; j++) {
-
-                int st = j * batch_size;
-                int en = std::min(num_samples - 1, st + batch_size);
+                int st  = j * batch_size;
+                int en  = std::min(num_samples - 1, st + batch_size - 1);
                 int num = en - st + 1;
 
                 array v_pos = in(seq(st, en), span);
@@ -92,17 +73,16 @@ class rbm {
                 array h_pos = sigmoid_binary(tile(h_bias, num) +
                                              matmulNT(v_pos, weights));
 
-                array v_neg = sigmoid_binary(tile(v_bias, num) +
-                                             matmul(h_pos, weights));
+                array v_neg =
+                    sigmoid_binary(tile(v_bias, num) + matmul(h_pos, weights));
 
                 array h_neg = sigmoid_binary(tile(h_bias, num) +
                                              matmulNT(v_neg, weights));
 
-
                 array c_pos = matmulTN(h_pos, v_pos);
                 array c_neg = matmulTN(h_neg, v_neg);
 
-                array delta_w = lr * (c_pos - c_neg) / num;
+                array delta_w  = lr * (c_pos - c_neg) / num;
                 array delta_vb = lr * sum(v_pos - v_neg) / num;
                 array delta_hb = lr * sum(h_pos - h_neg) / num;
 
@@ -110,27 +90,23 @@ class rbm {
                 v_bias += delta_vb;
                 h_bias += delta_hb;
 
-                if (verbose) {
-                    err += error(v_pos, v_neg);
-                }
+                if (verbose) { err += error(v_pos, v_neg); }
             }
 
             if (verbose) {
-                printf("Epoch %d: Reconstruction error: %0.4f\n", i + 1, err / num_batches);
+                printf("Epoch %d: Reconstruction error: %0.4f\n", i + 1,
+                       err / num_batches);
             }
         }
     }
 
-    array prop_up(const array &in)
-    {
-        return sigmoid(tile(h_bias, in.dims(0)) +
-                       matmulNT(in, weights));
+    array prop_up(const array &in) {
+        return sigmoid(tile(h_bias, in.dims(0)) + matmulNT(in, weights));
     }
 };
 
 class dbn {
-
-private:
+   private:
     const int in_size;
     const int out_size;
     const int num_hidden;
@@ -138,39 +114,34 @@ class dbn {
     std::vector<array> weights;
     std::vector<int> hidden;
 
-    array add_bias(const array &in)
-    {
+    array add_bias(const array &in) {
         // Bias input is added on top of given input
         return join(1, constant(1, in.dims(0), 1), in);
     }
 
-    vector<array> forward_propagate(const array& input)
-    {
+    vector<array> forward_propagate(const array &input) {
         // Get activations at each layer
         vector<array> signal(num_total);
         signal[0] = input;
 
         for (int i = 0; i < num_total - 1; i++) {
-            array in = add_bias(signal[i]);
-            array out = matmul(in, weights[i]);
+            array in      = add_bias(signal[i]);
+            array out     = matmul(in, weights[i]);
             signal[i + 1] = sigmoid(out);
         }
 
         return signal;
     }
 
-    void back_propagate(const vector<array> signal,
-                        const array &target,
-                        const double &alpha)
-    {
-
+    void back_propagate(const vector<array> signal, const array &target,
+                        const double &alpha) {
         // Get error for output layer
-        array out = signal[num_total  - 1];
+        array out = signal[num_total - 1];
         array err = (out - target);
-        int m = target.dims(0);
+        int m     = target.dims(0);
 
         for (int i = num_total - 2; i >= 0; i--) {
-            array in = add_bias(signal[i]);
+            array in    = add_bias(signal[i]);
             array delta = (deriv(out) * err).T();
 
             // Adjust weights
@@ -186,78 +157,62 @@ class dbn {
         }
     }
 
-public:
-
-    dbn(const int in_sz, const int out_sz,
-        const std::vector<int> hidden_layers) :
-        in_size(in_sz),
-        out_size(out_sz),
-        num_hidden(hidden_layers.size()),
-        num_total(hidden_layers.size() + 2),
-        weights(hidden_layers.size() + 1),
-        hidden(hidden_layers)
-    {
-    }
-
-    void train(const array &input, const array &target,
-               double lr_rbm = 1.0,
-               double lr_nn  = 1.0,
-               const int epochs_rbm = 15,
-               const int epochs_nn = 300,
-               const int batch_size = 100,
-               double maxerr = 1.0, bool verbose=false)
-    {
-
+   public:
+    dbn(const int in_sz, const int out_sz, const std::vector<int> hidden_layers)
+        : in_size(in_sz)
+        , out_size(out_sz)
+        , num_hidden(hidden_layers.size())
+        , num_total(hidden_layers.size() + 2)
+        , weights(hidden_layers.size() + 1)
+        , hidden(hidden_layers) {}
+
+    void train(const array &input, const array &target, double lr_rbm = 1.0,
+               double lr_nn = 1.0, const int epochs_rbm = 15,
+               const int epochs_nn = 300, const int batch_size = 100,
+               double maxerr = 1.0, bool verbose = false) {
         // Pre-training hidden layers
         array X = input;
         for (int i = 0; i < num_hidden; i++) {
-
-            if (verbose) {
-                printf("Training Hidden Layer %d\n", i);
-            }
+            if (verbose) { printf("Training Hidden Layer %d\n", i); }
 
             int visible = (i == 0) ? in_size : hidden[i - 1];
 
             rbm r(visible, hidden[i]);
             r.train(X, lr_rbm, epochs_rbm, batch_size, verbose);
 
-            X = r.prop_up(X);
+            X          = r.prop_up(X);
             weights[i] = r.get_weights();
 
-            if (verbose) {
-                printf("\n");
-            }
+            if (verbose) { printf("\n"); }
         }
 
-        weights[num_hidden] = 0.05 * randu(hidden[num_hidden - 1] + 1, out_size) - 0.0025;
+        weights[num_hidden] =
+            0.05 * randu(hidden[num_hidden - 1] + 1, out_size) - 0.0025;
 
         const int num_samples = input.dims(0);
         const int num_batches = num_samples / batch_size;
 
         // Training the entire network
         for (int i = 0; i < epochs_nn; i++) {
-
             for (int j = 0; j < num_batches; j++) {
-
                 int st = j * batch_size;
-                int en = std::min(num_samples - 1, st + batch_size);
+                int en = std::min(num_samples - 1, st + batch_size - 1);
 
                 array x = input(seq(st, en), span);
                 array y = target(seq(st, en), span);
 
                 // Propagate the inputs forward
                 vector<array> signals = forward_propagate(x);
-                array out = signals[num_total - 1];
+                array out             = signals[num_total - 1];
 
                 // Propagate the error backward
                 back_propagate(signals, y, lr_nn);
             }
 
-
             // Validate with last batch
-            int st = (num_batches - 1) * batch_size;
-            int en = num_samples - 1;
-            array out = predict(input(seq(st, en), span));
+            int st     = (num_batches - 1) * batch_size;
+            int en     = num_samples - 1;
+            array out  = predict(input(seq(st, en), span));
             double err = error(out, target(seq(st, en), span));
 
             // Check if convergence criteria has been met
@@ -267,22 +222,20 @@ class dbn {
             }
 
             if (verbose) {
-                if ((i + 1) % 10 == 0) printf("Epoch: %4d, Error: %0.4f\n", i+1, err);
+                if ((i + 1) % 10 == 0)
+                    printf("Epoch: %4d, Error: %0.4f\n", i + 1, err);
             }
         }
     }
 
-    array predict(const array &input)
-    {
+    array predict(const array &input) {
         vector<array> signal = forward_propagate(input);
-        array out = signal[num_total - 1];
+        array out            = signal[num_total - 1];
         return out;
     }
-
 };
 
-int dbn_demo(bool console, int perc)
-{
+int dbn_demo(bool console, int perc) {
     printf("** ArrayFire DBN Demo **\n\n");
 
     array train_images, test_images;
@@ -291,14 +244,14 @@ int dbn_demo(bool console, int perc)
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<true>(&num_classes, &num_train, &num_test,
-                      train_images, test_images, train_target, test_target, frac);
+    setup_mnist<true>(&num_classes, &num_train, &num_test, train_images,
+                      test_images, train_target, test_target, frac);
 
     int feature_size = train_images.elements() / num_train;
 
     // Reshape images into feature vectors
     array train_feats = moddims(train_images, feature_size, num_train).T();
-    array test_feats  = moddims(test_images , feature_size, num_test ).T();
+    array test_feats  = moddims(test_images, feature_size, num_test).T();
 
     train_target = train_target.T();
     test_target  = test_target.T();
@@ -314,27 +267,24 @@ int dbn_demo(bool console, int perc)
     // Train network
     timer::start();
     network.train(train_feats, train_target,
-                  0.2,  // rbm learning rate
-                  4.0,  // nn learning rate
-                  15,   // rbm epochs
-                  250,  // nn epochs
-                  100,  // batch_size
-                  0.5,  // max error
-                  true);// verbose
+                  0.2,    // rbm learning rate
+                  4.0,    // nn learning rate
+                  15,     // rbm epochs
+                  250,    // nn epochs
+                  100,    // batch_size
+                  0.5,    // max error
+                  true);  // verbose
     af::sync();
     double train_time = timer::stop();
 
     // Run the trained network and test accuracy.
     array train_output = network.predict(train_feats);
-    array test_output  = network.predict(test_feats );
-
+    array test_output  = network.predict(test_feats);
 
     // Benchmark prediction
     af::sync();
     timer::start();
-    for (int i = 0; i < 100; i++) {
-        network.predict(test_feats);
-    }
+    for (int i = 0; i < 100; i++) { network.predict(test_feats); }
     af::sync();
     double test_time = timer::stop() / 100;
 
@@ -344,7 +294,7 @@ int dbn_demo(bool console, int perc)
 
     printf("\nTest set:\n");
     printf("Accuracy on testing  data: %2.2f\n",
-           accuracy(test_output , test_target ));
+           accuracy(test_output, test_target));
 
     printf("\nTraining time: %4.4lf s\n", train_time);
     printf("Prediction time: %4.4lf s\n\n", test_time);
@@ -358,20 +308,17 @@ int dbn_demo(bool console, int perc)
     return 0;
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         return dbn_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/geneticalgorithm.cpp b/examples/machine_learning/geneticalgorithm.cpp
new file mode 100644
index 0000000000..184bc9914e
--- /dev/null
+++ b/examples/machine_learning/geneticalgorithm.cpp
@@ -0,0 +1,190 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <climits>
+#include <cstdio>
+#include <cstring>
+#include <ctime>
+
+using namespace af;
+static const float DefaultTopFittest = 0.5;
+
+array update(const array& searchSpace, const array& sampleX,
+             const array& sampleY, const int n) {
+    return searchSpace(sampleY * n + sampleX);
+}
+
+array selectFittest(const array& sampleZ, const int nSamples,
+                    const float topFit = DefaultTopFittest) {
+    // pick top fittest
+    array indices, values;
+    sort(values, indices, sampleZ);
+    int topFitElem = topFit * nSamples;
+    int n          = indices.elements();
+    return (n > topFitElem) ? indices(seq(n - topFitElem, n - 1)) : indices;
+}
+
+void reproduce(array& searchSpace, array& sampleX, array& sampleY,
+               array& sampleZ, const int nSamples, const int n) {
+    // Get fittest parents
+    array selection = selectFittest(sampleZ, nSamples);
+    array parentsX  = sampleX(selection);
+    array parentsY  = sampleY(selection);
+    int bits        = (int)log2(n);
+
+    // Divide selection in two
+    array parentsX1 = parentsX.rows(0, parentsX.elements() / 2 - 1);
+    array parentsX2 =
+        parentsX.rows(parentsX.elements() / 2, parentsX.elements() - 1);
+    array parentsY1 = parentsY.rows(0, parentsY.elements() / 2 - 1);
+    array parentsY2 =
+        parentsY.rows(parentsY.elements() / 2, parentsY.elements() - 1);
+
+    // Get crossover points (at which bit to crossover) and construct bit masks
+    // from them
+    array crossover = randu(nSamples / 4, u32) % bits;
+    array lowermask = (1 << crossover) - 1;
+    array uppermask = INT_MAX - lowermask;
+
+    // Create children as the cross between two parents
+    array childrenX1 = (parentsX1 & uppermask) + (parentsX2 & lowermask);
+    array childrenY1 = (parentsY1 & uppermask) + (parentsY2 & lowermask);
+
+    array childrenX2 = (parentsX2 & uppermask) + (parentsX1 & lowermask);
+    array childrenY2 = (parentsY2 & uppermask) + (parentsY1 & lowermask);
+
+    // Join two new sets
+    sampleX = join(0, childrenX1, childrenX2);
+    sampleY = join(0, childrenY1, childrenY2);
+
+    // Create mutant children
+    array mutantX = sampleX;
+    array mutantY = sampleY;
+
+    // Flip a random bit to vary the gene pool
+    mutantX = mutantX ^ (1 << (randu(nSamples / 2, u32) % bits));
+    mutantY = mutantY ^ (1 << (randu(nSamples / 2, u32) % bits));
+
+    sampleX = join(0, sampleX, mutantX);
+    sampleY = join(0, sampleY, mutantY);
+
+    // Update the value of each sample with the new coordinates
+    sampleZ = update(searchSpace, sampleX, sampleY, n);
+}
+
+void initSamples(array& searchSpace, array& sampleX, array& sampleY,
+                 array& sampleZ, const int nSamples, const int n) {
+    setSeed(time(NULL));
+    sampleX = randu(nSamples, u32) % n;
+    sampleY = randu(nSamples, u32) % n;
+    sampleZ = update(searchSpace, sampleX, sampleY, n);
+}
+
+void init(array& searchSpace, array& searchSpaceXDisplay,
+          array& searchSpaceYDisplay, array& sampleX, array& sampleY,
+          array& sampleZ, const int nSamples, const int n) {
+    // initialize space
+    searchSpace = range(dim4(n / 2, n / 2), 0) + range(dim4(n / 2, n / 2), 1);
+    searchSpace = join(0, searchSpace, flip(searchSpace, 0));
+    searchSpace = join(1, searchSpace, flip(searchSpace, 1));
+
+    // initialize display data
+    searchSpaceXDisplay = iota(dim4(n, 1), dim4(1, n));
+    searchSpaceYDisplay = iota(dim4(1, n), dim4(n, 1));
+
+    // initalize searchers
+    initSamples(searchSpace, sampleX, sampleY, sampleZ, nSamples, n);
+}
+
+void reproducePrint(float& currentMax, array& searchSpace, array& sampleX,
+                    array& sampleY, array& sampleZ, const float trueMax,
+                    const int nSamples, const int n) {
+    if (currentMax < trueMax * 0.99) {
+        float maximum = max<float>(sampleZ);
+        array whereM  = where(sampleZ == maximum);
+        if (maximum < trueMax * 0.99) {
+            printf("Current max at ");
+        } else {
+            printf("\nMax found at ");
+        }
+        printf("(%d,%d): %f (trueMax %f)\n",
+               sampleX(whereM).scalar<unsigned int>(),
+               sampleY(whereM).scalar<unsigned int>(), maximum, trueMax);
+        currentMax = maximum;
+        reproduce(searchSpace, sampleX, sampleY, sampleZ, nSamples, n);
+    }
+}
+
+void geneticSearch(bool console, const int nSamples, const int n) {
+    array searchSpaceXDisplay;
+    array searchSpaceYDisplay;
+    array searchSpace;
+    array sampleX;
+    array sampleY;
+    array sampleZ;
+
+    init(searchSpace, searchSpaceXDisplay, searchSpaceYDisplay, sampleX,
+         sampleY, sampleZ, nSamples, n);
+    float trueMax = max<float>(searchSpace);
+    float maximum = -trueMax;
+
+    if (!console) {
+        af::Window win(1600, 800, "Arrayfire Genetic Algorithm Search Demo");
+        win.grid(1, 2);
+        do {
+            reproducePrint(maximum, searchSpace, sampleX, sampleY, sampleZ,
+                           trueMax, nSamples, n);
+            win(0, 0).setAxesTitles("IdX", "IdY", "Search Space");
+            win(0, 1).setAxesTitles("IdX", "IdY", "Search Space");
+            win(0, 0).surface(searchSpaceXDisplay, searchSpaceYDisplay,
+                              searchSpace);
+            win(0, 1).scatter(sampleX.as(f32), sampleY.as(f32), sampleZ.as(f32),
+                              AF_MARKER_CIRCLE);
+            win.show();
+        } while (!win.close());
+    } else {
+        do {
+            reproducePrint(maximum, searchSpace, sampleX, sampleY, sampleZ,
+                           trueMax, nSamples, n);
+        } while (maximum < trueMax * 0.99);
+    }
+}
+
+int main(int argc, char** argv) {
+    bool console       = false;
+    const int n        = 32;
+    const int nSamples = 16;
+    if (argc > 2 || (argc == 2 && strcmp(argv[1], "-"))) {
+        printf("usage: %s [-]\n", argv[0]);
+        return -1;
+    } else if (argc == 2 && argv[1][0] == '-') {
+        console = true;
+    }
+
+    try {
+        af::info();
+        printf(
+            "** ArrayFire Genetic Algorithm Search Demo **\n\n"
+            "Search for trueMax in a search space where the objective "
+            "function is defined as :\n\n"
+            "SS(x ,y) = min(x, n - (x + 1)) + min(y, n - (y + 1))\n\n"
+            "(x, y) belongs to RxR; R = [0, n); n = %d\n\n",
+            n);
+        if (!console) {
+            printf(
+                "The left figure shows the objective function.\n"
+                "The right figure shows current generation's "
+                "parameters and function values.\n\n");
+        }
+        geneticSearch(console, nSamples, n);
+    } catch (af::exception& e) { fprintf(stderr, "%s\n", e.what()); }
+
+    return 0;
+}
diff --git a/examples/machine_learning/kmeans.cpp b/examples/machine_learning/kmeans.cpp
index c13b5e63ea..963d6a609f 100644
--- a/examples/machine_learning/kmeans.cpp
+++ b/examples/machine_learning/kmeans.cpp
@@ -7,20 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <iostream>
-#include <stdio.h>
 #include <arrayfire.h>
+#include <stdio.h>
 #include <af/util.h>
 #include <cstdlib>
+#include <iostream>
+#include <string>
 
 using namespace af;
 
-array distance(array data, array means)
-{
-    int n = data.dims(0); // Number of features
-    int k = means.dims(1); // Number of means
+array distance(array data, array means) {
+    int n = data.dims(0);   // Number of data points
+    int k = means.dims(1);  // Number of means
 
-    array data2  = tile(data , 1, k, 1);
+    array data2  = tile(data, 1, k, 1);
     array means2 = tile(means, n, 1, 1);
 
     // Currently using manhattan distance
@@ -29,8 +29,7 @@ array distance(array data, array means)
 }
 
 // Get cluster id of each location in data
-array clusterize(const array data, const array means)
-{
+array clusterize(const array data, const array means) {
     // Get manhattan distance
     array dists = distance(data, means);
 
@@ -42,14 +41,14 @@ array clusterize(const array data, const array means)
     return idx;
 }
 
-array new_means(array data, array clusters, int k)
-{
-    int d = data.dims(2);
-    array means = constant(0, 1, k, d);
+array new_means(array data, array clusters, int k) {
+    int d           = data.dims(2);
+    array means     = constant(0, 1, k, d);
     array clustersd = tile(clusters, 1, 1, d);
 
-    gfor (seq ii, k) {
-        means(span, ii, span) = sum(data * (clustersd == ii)) / (sum(clusters == ii) + 1e-5);
+    gfor(seq ii, k) {
+        means(span, ii, span) =
+            sum(data * (clustersd == ii)) / (sum(clusters == ii) + 1e-5);
     }
 
     return means;
@@ -59,10 +58,10 @@ array new_means(array data, array clusters, int k)
 // data:  input,  1D or 2D (range > [0-1])
 // k:     input,  # desired means (k > 1)
 // means: output, vector of means
-void kmeans(array &means, array &clusters, const array in, int k, int iter=100)
-{
-    unsigned n = in.dims(0); // Num features
-    unsigned d = in.dims(2); // feature length
+void kmeans(array &means, array &clusters, const array in, int k,
+            int iter = 100) {
+    unsigned n = in.dims(0);  // Num of data points
+    unsigned d = in.dims(2);  // Num of features (will only be 1 in spider image example)
 
     // reshape input
     array data = in * 0;
@@ -72,11 +71,13 @@ void kmeans(array &means, array &clusters, const array in, int k, int iter=100)
     array maximum = max(in);
 
     gfor(seq ii, d) {
-        data(span, span, ii) = (in(span, span, ii) - minimum(ii)) / maximum(ii);
+        data(span, span, ii) =
+            (in(span, span, ii) - minimum(ii).scalar<float>()) /
+            maximum(ii).scalar<float>();
     }
 
     // Initial guess of means
-    means = randu(1, k, d);
+    means               = randu(1, k, d);
     array curr_clusters = constant(0, data.dims(0)) - 1;
     array prev_clusters;
 
@@ -91,7 +92,7 @@ void kmeans(array &means, array &clusters, const array in, int k, int iter=100)
         // Break early if clusters not changing
         unsigned num_changed = count<unsigned>(prev_clusters != curr_clusters);
 
-        if (num_changed < (n/1000) + 1) break;
+        if (num_changed < (n / 1000) + 1) break;
 
         // Update current means for new clusters
         means = new_means(data, curr_clusters, k);
@@ -99,20 +100,20 @@ void kmeans(array &means, array &clusters, const array in, int k, int iter=100)
 
     // Scale up means
     gfor(seq ii, d) {
-        means(span, span, ii) = maximum(ii) * means(span, span, ii) + minimum(ii);
+        means(span, span, ii) =
+            maximum(ii) * means(span, span, ii) + minimum(ii);
     }
 
     clusters = prev_clusters;
-
 }
 
 // K-Means image recoloring.
 // Shifts the hues of an image to the k mean hues.
-int kmeans_demo(int k, bool console)
-{
+int kmeans_demo(int k, bool console) {
     printf("** ArrayFire K-Means Demo (k = %d) **\n\n", k);
 
-    array img = loadImage(ASSETS_DIR"/examples/images/lena.ppm", true) / 255; // [0-255]
+    array img =
+        loadImage(ASSETS_DIR "/examples/images/spider.jpg") / 255;  // [0-255]
 
     int w = img.dims(0), h = img.dims(1), c = img.dims(2);
     array vec = moddims(img, w * h, 1, c);
@@ -127,52 +128,52 @@ int kmeans_demo(int k, bool console)
     kmeans(means_dbl, clusters_dbl, vec, k * 2);
 
     if (!console) {
-#if 0
-        array out_full = moddims(means_full(span, clusters_full, span), img.dims());
-        array out_half = moddims(means_half(span, clusters_half, span), img.dims());
-        array out_dbl  = moddims(means_dbl (span, clusters_dbl , span), img.dims());
-
-        char str_full[32], str_half[32], str_dbl[32];
-        sprintf(str_full, "%2d clusters", k);
-        sprintf(str_half, "%2d clusters", k/2);
-        sprintf(str_dbl , "%2d clusters", k*2);
-
-        fig("color","default");
-        fig("sub",2,2,1); image(img); fig("title","input");
-        fig("sub",2,2,2); image(out_full); fig("title", str_full);
-        fig("sub",2,2,3); image(out_half); fig("title", str_half);
-        fig("sub",2,2,4); image(out_dbl ); fig("title", str_dbl );
-        printf("Hit enter to finish\n");
-        getchar();
-#else
-        printf("Graphics not implemented yet\n");
-#endif
+        array out_full =
+            moddims(means_full(span, clusters_full, span), img.dims());
+        array out_half =
+            moddims(means_half(span, clusters_half, span), img.dims());
+        array out_dbl =
+            moddims(means_dbl(span, clusters_dbl, span), img.dims());
+
+        af::Window wnd(800, 800, "ArrayFire K-Means Demo");
+        wnd.grid(2, 2);
+        std::stringstream out_full_caption, out_half_caption, out_dbl_caption;
+        out_full_caption << "k = " << k;
+        out_half_caption << "k = " << k / 2;
+        out_dbl_caption << "k = " << k * 2;
+        while (!wnd.close()) {
+            wnd(0, 0).image(img, "Input Image");
+            wnd(0, 1).image(out_full, out_full_caption.str().c_str());
+            wnd(1, 0).image(out_half, out_half_caption.str().c_str());
+            wnd(1, 1).image(out_dbl, out_dbl_caption.str().c_str());
+            wnd.show();
+        }
     } else {
-        means_full = moddims(means_full, means_full.dims(1), means_full.dims(2));
-        means_half = moddims(means_half, means_half.dims(1), means_half.dims(2));
-        means_dbl  = moddims(means_dbl , means_dbl.dims(1) , means_dbl.dims(2) );
+        means_full =
+            moddims(means_full, means_full.dims(1), means_full.dims(2));
+        means_half =
+            moddims(means_half, means_half.dims(1), means_half.dims(2));
+        means_dbl = moddims(means_dbl, means_dbl.dims(1), means_dbl.dims(2));
 
         af_print(means_full);
         af_print(means_half);
-        af_print(means_dbl );
+        af_print(means_dbl);
     }
 
     return 0;
 }
 
-int main(int argc, char** argv)
-{
-    int device = argc > 1 ? atoi(argv[1]) : 0;
+int main(int argc, char **argv) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
-    int k = argc > 3 ? atoi(argv[3]) : 16;
+    int k        = argc > 3 ? atoi(argv[3]) : 8;
 
     try {
-
         af::setDevice(device);
         af::info();
         return kmeans_demo(k, console);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
+
+    return 0;
 }
diff --git a/examples/machine_learning/knn.cpp b/examples/machine_learning/knn.cpp
index bd86ae0e13..a23a62db53 100644
--- a/examples/machine_learning/knn.cpp
+++ b/examples/machine_learning/knn.cpp
@@ -8,52 +8,47 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 
 // Get accuracy of the predicted results
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     return 100 * count<float>(predicted == target) / target.elements();
 }
 
 // Calculate all the distances from testing set to training set
-array distance(array train, array test)
-{
-
-    const int feat_len = train.dims(1);
+array distance(array train, array test) {
+    const int feat_len  = train.dims(1);
     const int num_train = train.dims(0);
-    const int num_test  =  test.dims(0);
+    const int num_test  = test.dims(0);
 
     array dist = constant(0, num_train, num_test);
 
     // Iterate over each attribute
     for (int ii = 0; ii < feat_len; ii++) {
-
         // Get a attribute vectors
         array train_i = train(span, ii);
-        array test_i  = test (span, ii).T();
+        array test_i  = test(span, ii).T();
 
         // Tile the vectors to generate matrices
-        array train_tiled = tile(train_i, 1,   num_test);
-        array test_tiled  = tile( test_i, num_train, 1 );
+        array train_tiled = tile(train_i, 1, num_test);
+        array test_tiled  = tile(test_i, num_train, 1);
 
         // Add the distance for this attribute
         dist = dist + abs(train_tiled - test_tiled);
-        dist.eval(); // Necessary to free up train_i, test_i
+        dist.eval();  // Necessary to free up train_i, test_i
     }
 
     return dist;
 }
 
-array knn(array &train_feats, array &test_feats, array &train_labels)
-{
+array knn(array &train_feats, array &test_feats, array &train_labels) {
     // Find distances between training and testing sets
     array dist = distance(train_feats, test_feats);
 
@@ -65,21 +60,19 @@ array knn(array &train_feats, array &test_feats, array &train_labels)
     return train_labels(idx);
 }
 
-void knn_demo(bool console, int perc)
-{
+void knn_demo(bool console, int perc) {
     array train_images, train_labels;
     array test_images, test_labels;
     int num_train, num_test, num_classes;
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<false>(&num_classes, &num_train, &num_test,
-                       train_images, test_images,
-                       train_labels, test_labels, frac);
+    setup_mnist<false>(&num_classes, &num_train, &num_test, train_images,
+                       test_images, train_labels, test_labels, frac);
 
     int feature_length = train_images.elements() / num_train;
-    array train_feats = moddims(train_images, feature_length, num_train).T();
-    array test_feats  = moddims(test_images , feature_length, num_test ).T();
+    array train_feats  = moddims(train_images, feature_length, num_train).T();
+    array test_feats   = moddims(test_images, feature_length, num_test).T();
 
     timer::start();
     // Get the predicted results
@@ -88,7 +81,7 @@ void knn_demo(bool console, int perc)
 
     // Results
     printf("Accuracy on testing  data: %2.2f\n",
-           accuracy(res_labels , test_labels));
+           accuracy(res_labels, test_labels));
 
     printf("Prediction time: %4.4f\n", test_time);
 
@@ -97,20 +90,17 @@ void knn_demo(bool console, int perc)
     }
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         knn_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/logistic_regression.cpp b/examples/machine_learning/logistic_regression.cpp
index e7b4b2d5aa..c77f251474 100644
--- a/examples/machine_learning/logistic_regression.cpp
+++ b/examples/machine_learning/logistic_regression.cpp
@@ -8,17 +8,16 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     array val, plabels, tlabels;
     max(val, tlabels, target, 1);
     max(val, plabels, predicted, 1);
@@ -26,27 +25,18 @@ float accuracy(const array& predicted, const array& target)
     return 100 * count<float>(plabels == tlabels) / tlabels.elements();
 }
 
-float abserr(const array& predicted, const array& target)
-{
+float abserr(const array &predicted, const array &target) {
     return 100 * sum<float>(abs(predicted - target)) / predicted.elements();
 }
 
-// Activation function
-array sigmoid(const array &val)
-{
-    return 1 / (1 + exp(-val));
-}
-
 // Predict based on given parameters
-array predict(const array &X, const array &Weights)
-{
+array predict(const array &X, const array &Weights) {
     array Z = matmul(X, Weights);
     return sigmoid(Z);
 }
 
-void cost(array &J, array &dJ, const array &Weights,
-          const array &X, const array &Y, double lambda = 1.0)
-{
+void cost(array &J, array &dJ, const array &Weights, const array &X,
+          const array &Y, double lambda = 1.0) {
     // Number of samples
     int m = Y.dims(0);
 
@@ -60,7 +50,7 @@ void cost(array &J, array &dJ, const array &Weights,
     array H = predict(X, Weights);
 
     // Cost of misprediction
-    array Jerr =  -sum(Y * log(H) + (1 - Y) * log(1 - H));
+    array Jerr = -sum(Y * log(H) + (1 - Y) * log(1 - H));
 
     // Regularization cost
     array Jreg = 0.5 * sum(lambdat * Weights * Weights);
@@ -70,17 +60,12 @@ void cost(array &J, array &dJ, const array &Weights,
 
     // Find the gradient of cost
     array D = (H - Y);
-    dJ = (matmulTN(X, D) + lambdat * Weights) / m;
+    dJ      = (matmulTN(X, D) + lambdat * Weights) / m;
 }
 
-array train(const array &X, const array &Y,
-            double alpha = 0.1,
-            double lambda = 1.0,
-            double maxerr = 0.01,
-            int maxiter = 1000,
-            bool verbose = false)
-{
-
+array train(const array &X, const array &Y, double alpha = 0.1,
+            double lambda = 1.0, double maxerr = 0.01, int maxiter = 1000,
+            bool verbose = false) {
     // Initialize parameters to 0
     array Weights = constant(0, X.dims(1), Y.dims(1));
 
@@ -88,7 +73,6 @@ array train(const array &X, const array &Y,
     float err = 0;
 
     for (int i = 0; i < maxiter; i++) {
-
         // Get the cost and gradient
         cost(J, dJ, Weights, X, Y, lambda);
 
@@ -113,8 +97,7 @@ array train(const array &X, const array &Y,
 
 void benchmark_logistic_regression(const array &train_feats,
                                    const array &train_targets,
-                                   const array test_feats)
-{
+                                   const array test_feats) {
     timer::start();
     array Weights = train(train_feats, train_targets, 0.1, 1.0, 0.01, 1000);
     af::sync();
@@ -123,7 +106,7 @@ void benchmark_logistic_regression(const array &train_feats,
     timer::start();
     const int iter = 100;
     for (int i = 0; i < iter; i++) {
-        array test_outputs  = predict(test_feats , Weights);
+        array test_outputs = predict(test_feats, Weights);
         test_outputs.eval();
     }
     af::sync();
@@ -131,50 +114,49 @@ void benchmark_logistic_regression(const array &train_feats,
 }
 
 // Demo of one vs all logistic regression
-int logit_demo(bool console, int perc)
-{
+int logit_demo(bool console, int perc) {
     array train_images, train_targets;
     array test_images, test_targets;
     int num_train, num_test, num_classes;
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<true>(&num_classes, &num_train, &num_test,
-                      train_images, test_images,
-                      train_targets, test_targets, frac);
+    setup_mnist<true>(&num_classes, &num_train, &num_test, train_images,
+                      test_images, train_targets, test_targets, frac);
 
     // Reshape images into feature vectors
     int feature_length = train_images.elements() / num_train;
-    array train_feats = moddims(train_images, feature_length, num_train).T();
-    array test_feats  = moddims(test_images , feature_length, num_test ).T();
+    array train_feats  = moddims(train_images, feature_length, num_train).T();
+    array test_feats   = moddims(test_images, feature_length, num_test).T();
 
     train_targets = train_targets.T();
     test_targets  = test_targets.T();
 
     // Add a bias that is always 1
     train_feats = join(1, constant(1, num_train, 1), train_feats);
-    test_feats  = join(1, constant(1, num_test , 1), test_feats );
+    test_feats  = join(1, constant(1, num_test, 1), test_feats);
 
     // Train logistic regression parameters
-    array Weights = train(train_feats, train_targets,
-                          0.1,  // learning rate (aka alpha)
-                          1.0,  // regularization constant (aka weight decay, aka lamdba)
-                          0.01, // maximum error
-                          1000, // maximum iterations
-                          true);// verbose
+    array Weights =
+        train(train_feats, train_targets,
+              0.1,    // learning rate (aka alpha)
+              1.0,    // regularization constant (aka weight decay, aka lamdba)
+              0.01,   // maximum error
+              1000,   // maximum iterations
+              true);  // verbose
 
     // Predict the results
     array train_outputs = predict(train_feats, Weights);
-    array test_outputs  = predict(test_feats , Weights);
+    array test_outputs  = predict(test_feats, Weights);
 
     printf("Accuracy on training data: %2.2f\n",
-           accuracy(train_outputs, train_targets ));
+           accuracy(train_outputs, train_targets));
 
     printf("Accuracy on testing data: %2.2f\n",
-           accuracy(test_outputs , test_targets ));
+           accuracy(test_outputs, test_targets));
 
     printf("Maximum error on testing data: %2.2f\n",
-           abserr(test_outputs , test_targets ));
+           abserr(test_outputs, test_targets));
 
     benchmark_logistic_regression(train_feats, train_targets, test_feats);
 
@@ -187,20 +169,17 @@ int logit_demo(bool console, int perc)
     return 0;
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         return logit_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/mnist_common.h b/examples/machine_learning/mnist_common.h
index 0133f652a7..8d079df75a 100644
--- a/examples/machine_learning/mnist_common.h
+++ b/examples/machine_learning/mnist_common.h
@@ -7,26 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <sstream>
 #include <algorithm>
+#include <sstream>
 #include <utility>
 #include "../common/idxio.h"
 
-bool compare(const std::pair<float, int> l,
-             const std::pair<float, int> r)
-{
-    return l.first >= r.first;
+bool compare(const std::pair<float, int> l, const std::pair<float, int> r) {
+    return l.first > r.first;
 }
 
 typedef std::pair<float, int> sort_type;
 
 template<bool expand_labels>
-std::string classify(af::array arr, int k)
-{
+std::string classify(af::array arr, int k) {
     std::stringstream ss;
     if (expand_labels) {
         af::array vec = arr(af::span, k).as(f32);
-        float *h_vec = vec.host<float>();
+        float *h_vec  = vec.host<float>();
         std::vector<sort_type> data;
 
         for (int i = 0; i < (int)vec.elements(); i++)
@@ -34,6 +31,7 @@ std::string classify(af::array arr, int k)
 
         std::stable_sort(data.begin(), data.end(), compare);
 
+        af::freeHost(h_vec);
         ss << data[0].second;
     } else {
         ss << (int)(arr(k).as(f32).scalar<float>());
@@ -44,53 +42,53 @@ std::string classify(af::array arr, int k)
 template<bool expand_labels>
 static void setup_mnist(int *num_classes, int *num_train, int *num_test,
                         af::array &train_images, af::array &test_images,
-                        af::array &train_labels, af::array &test_labels, float frac)
-{
+                        af::array &train_labels, af::array &test_labels,
+                        float frac) {
     std::vector<dim_t> idims;
-    std::vector<float   > idata;
-    read_idx(idims, idata, ASSETS_DIR"/examples/data/mnist/images-subset");
+    std::vector<float> idata;
+    read_idx(idims, idata, ASSETS_DIR "/examples/data/mnist/images-subset");
 
     std::vector<dim_t> ldims;
     std::vector<unsigned> ldata;
-    read_idx(ldims, ldata, ASSETS_DIR"/examples/data/mnist/labels-subset");
+    read_idx(ldims, ldata, ASSETS_DIR "/examples/data/mnist/labels-subset");
 
     std::reverse(idims.begin(), idims.end());
     unsigned numdims = idims.size();
     af::array images = af::array(af::dim4(numdims, &idims[0]), &idata[0]);
 
-    af::array R = af::randu(10000, 1);
-    af::array cond = R < std::min(frac, 0.8f);
-    af::array train_indices = where( cond);
+    af::array R             = af::randu(10000, 1);
+    af::array cond          = R < std::min(frac, 0.8f);
+    af::array train_indices = where(cond);
     af::array test_indices  = where(!cond);
 
     train_images = lookup(images, train_indices, 2) / 255;
-    test_images  = lookup(images, test_indices , 2) / 255;
+    test_images  = lookup(images, test_indices, 2) / 255;
 
     *num_classes = 10;
-    *num_train = train_images.dims(2);
-    *num_test  = test_images.dims(2);
+    *num_train   = train_images.dims(2);
+    *num_test    = test_images.dims(2);
 
     if (expand_labels) {
         train_labels = af::constant(0, *num_classes, *num_train);
-        test_labels  = af::constant(0, *num_classes, *num_test );
+        test_labels  = af::constant(0, *num_classes, *num_test);
 
-        unsigned  *h_train_idx = train_indices.host<unsigned>();
-        unsigned  *h_test_idx  =  test_indices.host<unsigned>();
+        unsigned *h_train_idx = train_indices.host<unsigned>();
+        unsigned *h_test_idx  = test_indices.host<unsigned>();
 
         for (int ii = 0; ii < *num_train; ii++) {
             train_labels(ldata[h_train_idx[ii]], ii) = 1;
         }
 
         for (int ii = 0; ii < *num_test; ii++) {
-            test_labels(ldata[ h_test_idx[ii]], ii) = 1;
+            test_labels(ldata[h_test_idx[ii]], ii) = 1;
         }
 
-        delete[] h_train_idx;
-        delete[] h_test_idx;
+        af::freeHost(h_train_idx);
+        af::freeHost(h_test_idx);
     } else {
         af::array labels = af::array(ldims[0], &ldata[0]);
-        train_labels = labels(train_indices);
-        test_labels  = labels( test_indices);
+        train_labels     = labels(train_indices);
+        test_labels      = labels(test_indices);
     }
 
     return;
@@ -108,14 +106,10 @@ static af::array randidx(int num, int total)
 }
 #endif
 
-
-
 template<bool expand_labels>
 static void display_results(const af::array &test_images,
                             const af::array &test_output,
-                            const af::array &test_actual,
-                            int num_display)
-{
+                            const af::array &test_actual, int num_display) {
 #if 0
     af::array locs = randidx(num_display, test_images.dims(2));
 
@@ -141,18 +135,21 @@ static void display_results(const af::array &test_images,
     }
 #else
     using namespace af;
-    for(int i = 0; i < num_display; i++) {
-        std::cout << "Predicted: " << classify<expand_labels>(test_output, i) << std::endl;
-        std::cout << "Actual: " << classify<expand_labels>(test_actual, i) << std::endl;
-
-        unsigned char* img = (test_images(span, span, i) > 0.1f).as(u8).host<unsigned char>();
-        for(int j = 0; j < 28; j++) {
-            for(int k = 0; k < 28; k++) {
-                std::cout << (img[j*28+k] ? "\u2588" : " ") << " ";
+    for (int i = 0; i < num_display; i++) {
+        std::cout << "Predicted: " << classify<expand_labels>(test_output, i)
+                  << std::endl;
+        std::cout << "Actual: " << classify<expand_labels>(test_actual, i)
+                  << std::endl;
+
+        unsigned char *img =
+            (test_images(span, span, i) > 0.1f).as(u8).host<unsigned char>();
+        for (int j = 0; j < 28; j++) {
+            for (int k = 0; k < 28; k++) {
+                std::cout << (img[k * 28 + j] ? "\u2588" : " ") << " ";
             }
             std::cout << std::endl;
         }
-        delete[] img;
+        af::freeHost(img);
         getchar();
     }
 #endif
diff --git a/examples/machine_learning/naive_bayes.cpp b/examples/machine_learning/naive_bayes.cpp
index 1703205b90..aadca32bc0 100644
--- a/examples/machine_learning/naive_bayes.cpp
+++ b/examples/machine_learning/naive_bayes.cpp
@@ -8,42 +8,38 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 
 // Get accuracy of the predicted results
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     return 100 * count<float>(predicted == target) / target.elements();
 }
 
-void naive_bayes_train(float *priors,
-                       array &mu, array &sig2,
-                       const array &train_feats,
-                       const array &train_classes,
-                       int num_classes)
-{
-    const int feat_len = train_feats.dims(0);
+void naive_bayes_train(float *priors, array &mu, array &sig2,
+                       const array &train_feats, const array &train_classes,
+                       int num_classes) {
+    const int feat_len    = train_feats.dims(0);
     const int num_samples = train_classes.elements();
 
     // Get mean and variance from trianing data
-    mu  = constant(0, feat_len, num_classes);
+    mu   = constant(0, feat_len, num_classes);
     sig2 = constant(0, feat_len, num_classes);
 
     for (int ii = 0; ii < num_classes; ii++) {
-        array idx = where(train_classes == ii);
+        array idx            = where(train_classes == ii);
         array train_feats_ii = lookup(train_feats, idx, 1);
 
-        mu(span, ii)  = mean(train_feats_ii, 1);
+        mu(span, ii) = mean(train_feats_ii, 1);
 
         // Some pixels are always 0. Add a small variance.
-        sig2(span,ii) = var(train_feats_ii, 0, 1) + 0.01;
+        sig2(span, ii) = var(train_feats_ii, AF_VARIANCE_SAMPLE, 1) + 0.01;
 
         // Calculate priors
         priors[ii] = (float)idx.elements() / (float)num_samples;
@@ -53,10 +49,8 @@ void naive_bayes_train(float *priors,
     sig2.eval();
 }
 
-array naive_bayes_predict(float *priors,
-                          const array &mu, const array &sig2,
-                          const array &test_feats, int num_classes)
-{
+array naive_bayes_predict(float *priors, const array &mu, const array &sig2,
+                          const array &test_feats, int num_classes) {
     int num_test = test_feats.dims(1);
 
     // Predict the probabilities for testing data
@@ -64,16 +58,16 @@ array naive_bayes_predict(float *priors,
     array log_probs = constant(1, num_test, num_classes);
 
     for (int ii = 0; ii < num_classes; ii++) {
-
         // Tile the current mean and variance to the testing data size
-        array Mu  = tile(mu (span, ii), 1, num_test);
+        array Mu   = tile(mu(span, ii), 1, num_test);
         array Sig2 = tile(sig2(span, ii), 1, num_test);
 
         // This is the same as log of the CDF of the normal distribution
-        array Df = test_feats - Mu;
-        array log_P =  (-(Df * Df) / (2 * Sig2))  - log(sqrt(2 * af::Pi * Sig2));
+        array Df    = test_feats - Mu;
+        array log_P = (-(Df * Df) / (2 * Sig2)) - log(sqrt(2 * af::Pi * Sig2));
 
-        // Accumulate the probabilities, multiply with priors (add log of priors)
+        // Accumulate the probabilities, multiply with priors (add log of
+        // priors)
         log_probs(span, ii) = log(priors[ii]) + sum(log_P).T();
     }
 
@@ -84,15 +78,15 @@ array naive_bayes_predict(float *priors,
 }
 
 void benchmark_nb(const array &train_feats, const array test_feats,
-                  const array &train_labels, int num_classes)
-{
+                  const array &train_labels, int num_classes) {
     array mu, sig2;
-    int iter = 25;
+    int iter      = 25;
     float *priors = new float[num_classes];
 
     timer::start();
     for (int i = 0; i < iter; i++) {
-        naive_bayes_train(priors, mu, sig2, train_feats, train_labels, num_classes);
+        naive_bayes_train(priors, mu, sig2, train_feats, train_labels,
+                          num_classes);
     }
     af::sync();
     printf("Training time: %4.4lf s\n", timer::stop() / iter);
@@ -107,21 +101,19 @@ void benchmark_nb(const array &train_feats, const array test_feats,
     delete[] priors;
 }
 
-void naive_bayes_demo(bool console, int perc)
-{
+void naive_bayes_demo(bool console, int perc) {
     array train_images, train_labels;
     array test_images, test_labels;
     int num_train, num_test, num_classes;
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<false>(&num_classes, &num_train, &num_test,
-                       train_images, test_images,
-                       train_labels, test_labels, frac);
+    setup_mnist<false>(&num_classes, &num_train, &num_test, train_images,
+                       test_images, train_labels, test_labels, frac);
 
     int feature_length = train_images.elements() / num_train;
-    array train_feats = moddims(train_images, feature_length, num_train);
-    array test_feats  = moddims(test_images , feature_length, num_test );
+    array train_feats  = moddims(train_images, feature_length, num_train);
+    array test_feats   = moddims(test_images, feature_length, num_test);
 
     // Get training parameters
     array mu, sig2;
@@ -129,38 +121,36 @@ void naive_bayes_demo(bool console, int perc)
     naive_bayes_train(priors, mu, sig2, train_feats, train_labels, num_classes);
 
     // Predict the classes
-    array res_labels = naive_bayes_predict(priors, mu, sig2, test_feats, num_classes);
+    array res_labels =
+        naive_bayes_predict(priors, mu, sig2, test_feats, num_classes);
     delete[] priors;
 
     // Results
     printf("Trainng samples: %4d, Testing samples: %4d\n", num_train, num_test);
     printf("Accuracy on testing  data: %2.2f\n",
-           accuracy(res_labels , test_labels));
+           accuracy(res_labels, test_labels));
 
     benchmark_nb(train_feats, test_feats, train_labels, num_classes);
 
     if (!console) {
         test_images = test_images.T();
         test_labels = test_labels.T();
-        // FIXME: Crashing in mnist_common.h::classify
-        //display_results<false>(test_images, res_labels, test_labels , 20);
+
+        display_results<false>(test_images, res_labels, test_labels, 20);
     }
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         naive_bayes_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/neural_network.cpp b/examples/machine_learning/neural_network.cpp
index 496dae17e5..f480977706 100644
--- a/examples/machine_learning/neural_network.cpp
+++ b/examples/machine_learning/neural_network.cpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2019, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -8,116 +8,106 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 using std::vector;
 
-float accuracy(const array& predicted, const array& target)
-{
+std::string toStr(const dtype dt) {
+    switch (dt) {
+        case f32: return "f32";
+        case f16: return "f16";
+        default: return "N/A";
+    }
+}
+
+float accuracy(const array &predicted, const array &target) {
     array val, plabels, tlabels;
     max(val, tlabels, target, 1);
     max(val, plabels, predicted, 1);
     return 100 * count<float>(plabels == tlabels) / tlabels.elements();
 }
 
-// Activation function
-array sigmoid(const array &val)
-{
-    return 1 / (1 + exp(-val));
-}
-
 // Derivative of the activation function
-array deriv(const array &out)
-{
-    return out * (1 - out);
-}
+array deriv(const array &out) { return out * (1 - out); }
 
 // Cost function
-double error(const array &out,
-             const array &pred)
-{
+double error(const array &out, const array &pred) {
     array dif = (out - pred);
     return sqrt((double)(sum<float>(dif * dif)));
 }
 
 class ann {
-
-private:
+   private:
     int num_layers;
     vector<array> weights;
+    dtype datatype;
 
     // Add bias input to the output from previous layer
     array add_bias(const array &in);
 
-    vector<array> forward_propagate(const array& input);
+    vector<array> forward_propagate(const array &input);
 
-    void back_propagate(const vector<array> signal,
-                        const array &pred,
+    void back_propagate(const vector<array> signal, const array &pred,
                         const double &alpha);
-public:
 
+   public:
     // Create a network with given parameters
-    ann(vector<int> layers, double range=0.05);
+    ann(vector<int> layers, double range, dtype dt = f32);
 
     // Output after single pass of forward propagation
     array predict(const array &input);
 
-    // Method to trian the neural net
-    double train(const array &input, const array &target,
-                 double alpha = 1.0,
-                 int max_epochs = 300,
-                 int batch_size = 100,
-                 double maxerr = 1.0,
-                 bool verbose = false);
+    // Method to train the neural net
+    double train(const array &input, const array &target, double alpha = 1.0,
+                 int max_epochs = 300, int batch_size = 100,
+                 double maxerr = 1.0, bool verbose = false);
 };
 
-array ann::add_bias(const array &in)
-{
+array ann::add_bias(const array &in) {
     // Bias input is added on top of given input
-    return join(1, constant(1, in.dims(0), 1), in);
+    return join(1, constant(1, in.dims(0), 1, datatype), in);
 }
 
-vector<array> ann::forward_propagate(const array& input)
-{
+vector<array> ann::forward_propagate(const array &input) {
     // Get activations at each layer
     vector<array> signal(num_layers);
     signal[0] = input;
 
     for (int i = 0; i < num_layers - 1; i++) {
-        array in = add_bias(signal[i]);
-        array out = matmul(in, weights[i]);
+        array in      = add_bias(signal[i]);
+        array out     = matmul(in, weights[i]);
         signal[i + 1] = sigmoid(out);
     }
 
     return signal;
 }
 
-void ann::back_propagate(const vector<array> signal,
-                         const array &target,
-                         const double &alpha)
-{
-
+void ann::back_propagate(const vector<array> signal, const array &target,
+                         const double &alpha) {
     // Get error for output layer
-    array out = signal[num_layers  - 1];
+    array out = signal[num_layers - 1];
     array err = (out - target);
+
     int m = target.dims(0);
 
     for (int i = num_layers - 2; i >= 0; i--) {
-        array in = add_bias(signal[i]);
+        array in    = add_bias(signal[i]);
         array delta = (deriv(out) * err).T();
 
         // Adjust weights
-        array grad = -(alpha * matmul(delta, in)) / m;
+        array tg   = alpha * matmul(delta, in);
+        array grad = -(tg) / m;
         weights[i] += grad.T();
 
         // Input to current layer is output of previous
         out = signal[i];
+
         err = matmulTT(delta, weights[i]);
 
         // Remove the error of bias and propagate backward
@@ -125,28 +115,26 @@ void ann::back_propagate(const vector<array> signal,
     }
 }
 
-ann::ann(vector<int> layers, double range) :
-    num_layers(layers.size()),
-    weights(layers.size() - 1)
-{
-    // Generate uniformly distributed random numbers between [-range/2,range/2]
+ann::ann(vector<int> layers, double range, dtype dt)
+    : num_layers(layers.size()), weights(layers.size() - 1), datatype(dt) {
+    std::cout
+        << "Initializing weights using a random uniformly distribution between "
+        << -range / 2 << " and " << range / 2 << " at precision "
+        << toStr(datatype) << std::endl;
     for (int i = 0; i < num_layers - 1; i++) {
-        weights[i] = range * randu(layers[i] + 1, layers[i + 1]) - range/2;
+        weights[i] = range * randu(layers[i] + 1, layers[i + 1]) - range / 2;
+        if (datatype != f32) weights[i] = weights[i].as(datatype);
     }
 }
 
-array ann::predict(const array &input)
-{
+array ann::predict(const array &input) {
     vector<array> signal = forward_propagate(input);
-    array out = signal[num_layers - 1];
+    array out            = signal[num_layers - 1];
     return out;
 }
 
-double ann::train(const array &input, const array &target,
-                  double alpha, int max_epochs, int batch_size,
-                  double maxerr, bool verbose)
-{
-
+double ann::train(const array &input, const array &target, double alpha,
+                  int max_epochs, int batch_size, double maxerr, bool verbose) {
     const int num_samples = input.dims(0);
     const int num_batches = num_samples / batch_size;
 
@@ -154,29 +142,26 @@ double ann::train(const array &input, const array &target,
 
     // Training the entire network
     for (int i = 0; i < max_epochs; i++) {
-
         for (int j = 0; j < num_batches - 1; j++) {
-
             int st = j * batch_size;
-            int en = st + batch_size;
+            int en = st + batch_size - 1;
 
             array x = input(seq(st, en), span);
             array y = target(seq(st, en), span);
 
             // Propagate the inputs forward
             vector<array> signals = forward_propagate(x);
-            array out = signals[num_layers - 1];
-
+            array out             = signals[num_layers - 1];
 
             // Propagate the error backward
             back_propagate(signals, y, alpha);
         }
 
         // Validate with last batch
-        int st = (num_batches - 1) * batch_size;
-        int en = num_samples - 1;
+        int st    = (num_batches - 1) * batch_size;
+        int en    = num_samples - 1;
         array out = predict(input(seq(st, en), span));
-        err = error(out, target(seq(st, en), span));
+        err       = error(out, target(seq(st, en), span));
 
         // Check if convergence criteria has been met
         if (err < maxerr) {
@@ -185,14 +170,14 @@ double ann::train(const array &input, const array &target,
         }
 
         if (verbose) {
-            if ((i + 1) % 10 == 0) printf("Epoch: %4d, Error: %0.4f\n", i+1, err);
+            if ((i + 1) % 10 == 0)
+                printf("Epoch: %4d, Error: %0.4f\n", i + 1, err);
         }
     }
     return err;
 }
 
-int ann_demo(bool console, int perc)
-{
+int ann_demo(bool console, int perc, const dtype dt) {
     printf("** ArrayFire ANN Demo **\n\n");
 
     array train_images, test_images;
@@ -201,14 +186,19 @@ int ann_demo(bool console, int perc)
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<true>(&num_classes, &num_train, &num_test,
-                      train_images, test_images, train_target, test_target, frac);
+    setup_mnist<true>(&num_classes, &num_train, &num_test, train_images,
+                      test_images, train_target, test_target, frac);
+    if (dt != f32) {
+        train_images = train_images.as(dt);
+        test_images  = test_images.as(dt);
+        train_target = train_target.as(dt);
+    }
 
     int feature_size = train_images.elements() / num_train;
 
     // Reshape images into feature vectors
     array train_feats = moddims(train_images, feature_size, num_train).T();
-    array test_feats  = moddims(test_images , feature_size, num_test ).T();
+    array test_feats  = moddims(test_images, feature_size, num_test).T();
 
     train_target = train_target.T();
     test_target  = test_target.T();
@@ -220,31 +210,28 @@ int ann_demo(bool console, int perc)
     layers.push_back(50);
     layers.push_back(num_classes);
 
-    // Create network
-    ann network(layers);
+    // Create network: architecture, range, datatype
+    ann network(layers, 0.05, dt);
 
     // Train network
     timer::start();
     network.train(train_feats, train_target,
-                  2.0, // learning rate / alpha
-                  250, // max epochs
-                  100, // batch size
-                  0.5, // max error
-                  true); // verbose
+                  2.0,    // learning rate / alpha
+                  250,    // max epochs
+                  100,    // batch size
+                  0.5,    // max error
+                  true);  // verbose
     af::sync();
     double train_time = timer::stop();
 
     // Run the trained network and test accuracy.
     array train_output = network.predict(train_feats);
-    array test_output  = network.predict(test_feats );
-
+    array test_output  = network.predict(test_feats);
 
     // Benchmark prediction
     af::sync();
     timer::start();
-    for (int i = 0; i < 100; i++) {
-        network.predict(test_feats);
-    }
+    for (int i = 0; i < 100; i++) { network.predict(test_feats); }
     af::sync();
     double test_time = timer::stop() / 100;
 
@@ -254,7 +241,7 @@ int ann_demo(bool console, int perc)
 
     printf("\nTest set:\n");
     printf("Accuracy on testing  data: %2.2f\n",
-           accuracy(test_output , test_target ));
+           accuracy(test_output, test_target));
 
     printf("\nTraining time: %4.4lf s\n", train_time);
     printf("Prediction time: %4.4lf s\n\n", test_time);
@@ -268,20 +255,36 @@ int ann_demo(bool console, int perc)
     return 0;
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
+    // usage:  neural_network_xxx (device) (console on/off) (percentage
+    // training/test set) (f32|f16)
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
+    if (perc < 0 || perc > 100) {
+        std::cerr << "Bad perc arg: " << perc << std::endl;
+        return EXIT_FAILURE;
+    }
+    std::string dts = argc > 4 ? argv[4] : "f32";
+    dtype dt        = f32;
+    if (dts == "f16")
+        dt = f16;
+    else if (dts != "f32") {
+        std::cerr << "Unsupported datatype " << dts << ". Supported: f32 or f16"
+                  << std::endl;
+        return EXIT_FAILURE;
+    }
 
-    try {
+    if (dts == "f16" && !af::isHalfAvailable(device)) {
+        std::cerr << "Half not available for device " << device << std::endl;
+        return EXIT_FAILURE;
+    }
 
+    try {
         af::setDevice(device);
         af::info();
-        return ann_demo(console, perc);
-
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+        return ann_demo(console, perc, dt);
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/perceptron.cpp b/examples/machine_learning/perceptron.cpp
index 374edd0457..49845461eb 100644
--- a/examples/machine_learning/perceptron.cpp
+++ b/examples/machine_learning/perceptron.cpp
@@ -8,17 +8,16 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     array val, plabels, tlabels;
     max(val, tlabels, target, 1);
     max(val, plabels, predicted, 1);
@@ -26,33 +25,22 @@ float accuracy(const array& predicted, const array& target)
     return 100 * count<float>(plabels == tlabels) / tlabels.elements();
 }
 
-// Activation function
-array sigmoid(const array &val)
-{
-    return 1 / (1 + exp(-val));
-}
-
 // Predict based on given parameters
-array predict(const array &X, const array &Weights)
-{
+array predict(const array &X, const array &Weights) {
     return sigmoid(matmul(X, Weights));
 }
 
-array train(const array &X, const array &Y,
-            double alpha = 0.1,
-            double maxerr = 0.05,
-            int maxiter = 1000, bool verbose = false)
-{
-
+array train(const array &X, const array &Y, double alpha = 0.1,
+            double maxerr = 0.05, int maxiter = 1000, bool verbose = false) {
     // Initialize parameters to 0
     array Weights = constant(0, X.dims(1), Y.dims(1));
 
     for (int i = 0; i < maxiter; i++) {
-        array P = predict(X, Weights);
+        array P   = predict(X, Weights);
         array err = Y - P;
 
         float mean_abs_err = mean<float>(abs(err));
-        if (mean_abs_err  < maxerr) break;
+        if (mean_abs_err < maxerr) break;
 
         if (verbose && (i + 1) % 25 == 0) {
             printf("Iter: %d, Err: %.4f\n", i + 1, mean_abs_err);
@@ -64,10 +52,8 @@ array train(const array &X, const array &Y,
     return Weights;
 }
 
-void benchmark_perceptron(const array &train_feats,
-                          const array &train_targets,
-                          const array test_feats)
-{
+void benchmark_perceptron(const array &train_feats, const array &train_targets,
+                          const array test_feats) {
     timer::start();
     array Weights = train(train_feats, train_targets, 0.1, 0.01, 1000);
     af::sync();
@@ -76,7 +62,7 @@ void benchmark_perceptron(const array &train_feats,
     timer::start();
     const int iter = 100;
     for (int i = 0; i < iter; i++) {
-        array test_outputs  = predict(test_feats , Weights);
+        array test_outputs = predict(test_feats, Weights);
         test_outputs.eval();
     }
     af::sync();
@@ -84,42 +70,40 @@ void benchmark_perceptron(const array &train_feats,
 }
 
 // Demo of one vs all logistic regression
-int perceptron_demo(bool console, int perc)
-{
+int perceptron_demo(bool console, int perc) {
     array train_images, train_targets;
     array test_images, test_targets;
     int num_train, num_test, num_classes;
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<true>(&num_classes, &num_train, &num_test,
-                      train_images, test_images,
-                      train_targets, test_targets, frac);
+    setup_mnist<true>(&num_classes, &num_train, &num_test, train_images,
+                      test_images, train_targets, test_targets, frac);
 
     // Reshape images into feature vectors
     int feature_length = train_images.elements() / num_train;
-    array train_feats = moddims(train_images, feature_length, num_train).T();
-    array test_feats  = moddims(test_images , feature_length, num_test ).T();
+    array train_feats  = moddims(train_images, feature_length, num_train).T();
+    array test_feats   = moddims(test_images, feature_length, num_test).T();
 
     train_targets = train_targets.T();
     test_targets  = test_targets.T();
 
     // Add a bias that is always 1
     train_feats = join(1, constant(1, num_train, 1), train_feats);
-    test_feats  = join(1, constant(1, num_test , 1), test_feats );
+    test_feats  = join(1, constant(1, num_test, 1), test_feats);
 
     // Train logistic regression parameters
     array Weights = train(train_feats, train_targets, 0.1, 0.01, 1000, true);
 
     // Predict the results
     array train_outputs = predict(train_feats, Weights);
-    array test_outputs  = predict(test_feats , Weights);
+    array test_outputs  = predict(test_feats, Weights);
 
     printf("Accuracy on training data: %2.2f\n",
-           accuracy(train_outputs, train_targets ));
+           accuracy(train_outputs, train_targets));
 
     printf("Accuracy on testing data: %2.2f\n",
-           accuracy(test_outputs , test_targets ));
+           accuracy(test_outputs, test_targets));
 
     benchmark_perceptron(train_feats, train_targets, test_feats);
 
@@ -133,20 +117,17 @@ int perceptron_demo(bool console, int perc)
     return 0;
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         return perceptron_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/rbm.cpp b/examples/machine_learning/rbm.cpp
index caa583ce82..7da01cc7f3 100644
--- a/examples/machine_learning/rbm.cpp
+++ b/examples/machine_learning/rbm.cpp
@@ -8,93 +8,67 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 using std::vector;
 
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     array val, plabels, tlabels;
     max(val, tlabels, target, 1);
     max(val, plabels, predicted, 1);
     return 100 * count<float>(plabels == tlabels) / tlabels.elements();
 }
 
-// Activation function
-array sigmoid(const array &val)
-{
-    return 1 / (1 + exp(-val));
-}
-
 // Derivative of the activation function
-array deriv(const array &out)
-{
-    return out * (1 - out);
-}
+array deriv(const array &out) { return out * (1 - out); }
 
 // Cost function
-double error(const array &out,
-             const array &pred)
-{
+double error(const array &out, const array &pred) {
     array dif = (out - pred);
     return sqrt((double)(sum<float>(dif * dif)));
 }
 
-array binary(const array in)
-{
+array binary(const array in) {
     // Choosing "1" with probability sigmoid(in)
     return (in > randu(in.dims())).as(f32);
 }
 
 class rbm {
-
-private:
+   private:
     array weights;
     array h_bias;
     array v_bias;
 
     // Add bias input to the output from previous layer
-    array vtoh(const array &v)
-    {
-        return binary(prop_up(v));
-    }
-
-    array htov(const array &h)
-    {
-        return binary(prop_down(h));
-    }
+    array vtoh(const array &v) { return binary(prop_up(v)); }
 
-public:
+    array htov(const array &h) { return binary(prop_down(h)); }
 
+   public:
     rbm() {}
 
-    rbm(int v_size, int h_size) :
-        weights(randu(h_size, v_size)/100 - 0.05),
-        h_bias(constant(0, 1, h_size)),
-        v_bias(constant(0, 1, v_size))
-    {
-    }
+    rbm(int v_size, int h_size)
+        : weights(randu(h_size, v_size) / 100 - 0.05)
+        , h_bias(constant(0, 1, h_size))
+        , v_bias(constant(0, 1, v_size)) {}
 
-    array prop_up(const array &v)
-    {
+    array prop_up(const array &v) {
         array h_bias_tile = tile(h_bias, v.dims(0));
         return sigmoid(h_bias_tile + matmulNT(v, weights));
     }
 
-    array prop_down(const array &h)
-    {
+    array prop_down(const array &h) {
         array v_bias_tile = tile(v_bias, h.dims(0));
         return sigmoid(v_bias_tile + matmul(h, weights));
     }
 
-    void gibbs_vhv(array &vt, array &ht, const array &v, int k = 1)
-    {
+    void gibbs_vhv(array &vt, array &ht, const array &v, int k = 1) {
         vt = v;
         for (int i = 0; i < k; i++) {
             ht = vtoh(vt);
@@ -102,8 +76,7 @@ class rbm {
         }
     }
 
-    void gibbs_hvh(array &vt, array &ht, const array &h, int k = 1)
-    {
+    void gibbs_hvh(array &vt, array &ht, const array &h, int k = 1) {
         ht = h;
         for (int i = 0; i < k; i++) {
             vt = htov(ht);
@@ -111,23 +84,17 @@ class rbm {
         }
     }
 
-    void train(const array &in,
-               double lr = 0.1,
-               int num_epochs = 15,
-               int batch_size = 100,
-               int k = 1, bool verbose = false)
-    {
+    void train(const array &in, double lr = 0.1, int num_epochs = 15,
+               int batch_size = 100, int k = 1, bool verbose = false) {
         const int num_samples = in.dims(0);
         const int num_batches = num_samples / batch_size;
 
-        for (int i = 0; i <  num_epochs; i++) {
-
+        for (int i = 0; i < num_epochs; i++) {
             double err = 0;
 
             for (int j = 0; j < num_batches - 1; j++) {
-
-                int st = j * batch_size;
-                int en = std::min(num_samples - 1, st + batch_size);
+                int st  = j * batch_size;
+                int en  = std::min(num_samples - 1, st + batch_size - 1);
                 int num = en - st + 1;
 
                 array v_pos = in(seq(st, en), span);
@@ -142,7 +109,7 @@ class rbm {
                 array c_pos = matmulTN(h_pos, v_pos);
                 array c_neg = matmulTN(h_neg, v_neg);
 
-                array delta_w = lr * (c_pos - c_neg) / num;
+                array delta_w  = lr * (c_pos - c_neg) / num;
                 array delta_vb = lr * sum(v_pos - v_neg) / num;
                 array delta_hb = lr * sum(h_pos - h_neg) / num;
 
@@ -150,13 +117,12 @@ class rbm {
                 v_bias += delta_vb;
                 h_bias += delta_hb;
 
-                if (verbose) {
-                    err += error(v_pos, v_neg);
-                }
+                if (verbose) { err += error(v_pos, v_neg); }
             }
 
             if (verbose) {
-                printf("Epoch %d: Reconstruction error: %0.4f\n", i + 1, err / num_batches);
+                printf("Epoch %d: Reconstruction error: %0.4f\n", i + 1,
+                       err / num_batches);
             }
         }
 
@@ -164,8 +130,7 @@ class rbm {
     }
 };
 
-int rbm_demo(bool console, int perc)
-{
+int rbm_demo(bool /*console*/, int perc) {
     printf("** ArrayFire RBM Demo **\n\n");
 
     array train_images, test_images;
@@ -174,8 +139,8 @@ int rbm_demo(bool console, int perc)
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<true>(&num_classes, &num_train, &num_test,
-                      train_images, test_images, train_target, test_target, frac);
+    setup_mnist<true>(&num_classes, &num_train, &num_test, train_images,
+                      test_images, train_target, test_target, frac);
 
     dim4 dims = train_images.dims();
 
@@ -183,7 +148,7 @@ int rbm_demo(bool console, int perc)
 
     // Reshape images into feature vectors
     array train_feats = moddims(train_images, feature_size, num_train).T();
-    array test_feats  = moddims(test_images , feature_size, num_test ).T();
+    array test_feats  = moddims(test_images, feature_size, num_test).T();
 
     train_target = train_target.T();
     test_target  = test_target.T();
@@ -191,24 +156,23 @@ int rbm_demo(bool console, int perc)
     rbm network(train_feats.dims(1), 2000);
 
     network.train(train_feats,
-                  0.1, // learning rate
-                  15,  // num epochs
-                  100, // batch size
-                  1,   // k
+                  0.1,  // learning rate
+                  15,   // num epochs
+                  100,  // batch size
+                  1,    // k
                   true);
 
     // Test reconstructed images
     for (int ii = 0; ii < 5; ii++) {
-
         array in = test_feats(ii, span);
         array res, tmp;
 
         network.gibbs_vhv(res, tmp, in);
 
-        in  = moddims(in , dims[0], dims[1]);
+        in  = moddims(in, dims[0], dims[1]);
         res = moddims(res, dims[0], dims[1]);
 
-        in = round(in);
+        in  = round(in);
         res = round(res);
 
         printf("Reconstructed Error for image %2d: %.4f\n", ii,
@@ -218,20 +182,17 @@ int rbm_demo(bool console, int perc)
     return 0;
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         return rbm_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/machine_learning/softmax_regression.cpp b/examples/machine_learning/softmax_regression.cpp
index 45d253c8b9..452ab8950c 100644
--- a/examples/machine_learning/softmax_regression.cpp
+++ b/examples/machine_learning/softmax_regression.cpp
@@ -8,17 +8,16 @@
  ********************************************************/
 
 #include <arrayfire.h>
+#include <math.h>
 #include <stdio.h>
-#include <vector>
-#include <string>
 #include <af/util.h>
-#include <math.h>
+#include <string>
+#include <vector>
 #include "mnist_common.h"
 
 using namespace af;
 
-float accuracy(const array& predicted, const array& target)
-{
+float accuracy(const array &predicted, const array &target) {
     array val, plabels, tlabels;
     max(val, tlabels, target, 1);
     max(val, plabels, predicted, 1);
@@ -26,28 +25,22 @@ float accuracy(const array& predicted, const array& target)
     return 100 * count<float>(plabels == tlabels) / tlabels.elements();
 }
 
-float abserr(const array& predicted, const array& target)
-{
+float abserr(const array &predicted, const array &target) {
     return 100 * sum<float>(abs(predicted - target)) / predicted.elements();
 }
 
-array divide(const array &a, const array &b)
-{
-    return a / b;
-}
+array divide(const array &a, const array &b) { return a / b; }
 
 // Predict based on given parameters
-array predict(const array &X, const array &Weights)
-{
-    array Z = matmul(X, Weights);
-    array EZ = exp(Z);
+array predict(const array &X, const array &Weights) {
+    array Z   = matmul(X, Weights);
+    array EZ  = exp(Z);
     array nrm = sum(EZ, 1);
     return batchFunc(EZ, nrm, divide);
 }
 
-void cost(array &J, array &dJ, const array &Weights,
-          const array &X, const array &Y, double lambda = 1.0)
-{
+void cost(array &J, array &dJ, const array &Weights, const array &X,
+          const array &Y, double lambda = 1.0) {
     // Number of samples
     int m = Y.dims(0);
 
@@ -61,7 +54,7 @@ void cost(array &J, array &dJ, const array &Weights,
     array H = predict(X, Weights);
 
     // Cost of misprediction
-    array Jerr =  -sum(Y * log(H));
+    array Jerr = -sum(Y * log(H));
 
     // Regularization cost
     array Jreg = 0.5 * sum(lambdat * Weights * Weights);
@@ -71,17 +64,12 @@ void cost(array &J, array &dJ, const array &Weights,
 
     // Find the gradient of cost
     array D = (H - Y);
-    dJ = (matmulTN(X, D) + lambdat * Weights) / m;
+    dJ      = (matmulTN(X, D) + lambdat * Weights) / m;
 }
 
-array train(const array &X, const array &Y,
-            double alpha = 0.1,
-            double lambda = 1.0,
-            double maxerr = 0.01,
-            int maxiter = 1000,
-            bool verbose = false)
-{
-
+array train(const array &X, const array &Y, double alpha = 0.1,
+            double lambda = 1.0, double maxerr = 0.01, int maxiter = 1000,
+            bool verbose = false) {
     // Initialize parameters to 0
     array Weights = constant(0, X.dims(1), Y.dims(1));
 
@@ -89,7 +77,6 @@ array train(const array &X, const array &Y,
     float err = 0;
 
     for (int i = 0; i < maxiter; i++) {
-
         // Get the cost and gradient
         cost(J, dJ, Weights, X, Y, lambda);
 
@@ -114,8 +101,7 @@ array train(const array &X, const array &Y,
 
 void benchmark_softmax_regression(const array &train_feats,
                                   const array &train_targets,
-                                  const array test_feats)
-{
+                                  const array test_feats) {
     timer::start();
     array Weights = train(train_feats, train_targets, 0.1, 1.0, 0.01, 1000);
     af::sync();
@@ -124,7 +110,7 @@ void benchmark_softmax_regression(const array &train_feats,
     timer::start();
     const int iter = 100;
     for (int i = 0; i < iter; i++) {
-        array test_outputs  = predict(test_feats , Weights);
+        array test_outputs = predict(test_feats, Weights);
         test_outputs.eval();
     }
     af::sync();
@@ -132,50 +118,49 @@ void benchmark_softmax_regression(const array &train_feats,
 }
 
 // Demo of one vs all logistic regression
-int logit_demo(bool console, int perc)
-{
+int logit_demo(bool console, int perc) {
     array train_images, train_targets;
     array test_images, test_targets;
     int num_train, num_test, num_classes;
 
     // Load mnist data
     float frac = (float)(perc) / 100.0;
-    setup_mnist<true>(&num_classes, &num_train, &num_test,
-                      train_images, test_images,
-                      train_targets, test_targets, frac);
+    setup_mnist<true>(&num_classes, &num_train, &num_test, train_images,
+                      test_images, train_targets, test_targets, frac);
 
     // Reshape images into feature vectors
     int feature_length = train_images.elements() / num_train;
-    array train_feats = moddims(train_images, feature_length, num_train).T();
-    array test_feats  = moddims(test_images , feature_length, num_test ).T();
+    array train_feats  = moddims(train_images, feature_length, num_train).T();
+    array test_feats   = moddims(test_images, feature_length, num_test).T();
 
     train_targets = train_targets.T();
     test_targets  = test_targets.T();
 
     // Add a bias that is always 1
     train_feats = join(1, constant(1, num_train, 1), train_feats);
-    test_feats  = join(1, constant(1, num_test , 1), test_feats );
+    test_feats  = join(1, constant(1, num_test, 1), test_feats);
 
     // Train logistic regression parameters
-    array Weights = train(train_feats, train_targets,
-                          0.1,  // learning rate (aka alpha)
-                          1.0,  // regularization constant (aka weight decay, aka lamdba)
-                          0.01, // maximum error
-                          1000, // maximum iterations
-                          true);// verbose
+    array Weights =
+        train(train_feats, train_targets,
+              0.1,    // learning rate (aka alpha)
+              1.0,    // regularization constant (aka weight decay, aka lamdba)
+              0.01,   // maximum error
+              1000,   // maximum iterations
+              true);  // verbose
 
     // Predict the results
     array train_outputs = predict(train_feats, Weights);
-    array test_outputs  = predict(test_feats , Weights);
+    array test_outputs  = predict(test_feats, Weights);
 
     printf("Accuracy on training data: %2.2f\n",
-           accuracy(train_outputs, train_targets ));
+           accuracy(train_outputs, train_targets));
 
     printf("Accuracy on testing data: %2.2f\n",
-           accuracy(test_outputs , test_targets ));
+           accuracy(test_outputs, test_targets));
 
     printf("Maximum error on testing data: %2.2f\n",
-           abserr(test_outputs , test_targets ));
+           abserr(test_outputs, test_targets));
 
     benchmark_softmax_regression(train_feats, train_targets, test_feats);
 
@@ -188,20 +173,17 @@ int logit_demo(bool console, int perc)
     return 0;
 }
 
-int main(int argc, char** argv)
-{
+int main(int argc, char **argv) {
     int device   = argc > 1 ? atoi(argv[1]) : 0;
     bool console = argc > 2 ? argv[2][0] == '-' : false;
     int perc     = argc > 3 ? atoi(argv[3]) : 60;
 
     try {
-
         af::setDevice(device);
         af::info();
         return logit_demo(console, perc);
 
-    } catch (af::exception &ae) {
-        std::cerr << ae.what() << std::endl;
-    }
+    } catch (af::exception &ae) { std::cerr << ae.what() << std::endl; }
 
+    return 0;
 }
diff --git a/examples/pde/CMakeLists.txt b/examples/pde/CMakeLists.txt
new file mode 100644
index 0000000000..57f689a9e9
--- /dev/null
+++ b/examples/pde/CMakeLists.txt
@@ -0,0 +1,58 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-PDE
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+add_definitions("-DASSETS_DIR=\"${ASSETS_DIR}\"")
+
+if(ArrayFire_CPU_FOUND)
+  # Shallow Water simulation example
+  add_executable(swe_cpu swe.cpp)
+  target_link_libraries(swe_cpu ArrayFire::afcpu)
+
+  # Black Hole Raytracing example
+  add_executable(bhrt_cpu bhrt.cpp)
+  target_link_libraries(bhrt_cpu ArrayFire::afcpu)
+
+  add_executable(boltzmann_cfd_cpu boltzmann_cfd.cpp)
+  target_link_libraries(boltzmann_cfd_cpu ArrayFire::afcpu)
+endif()
+
+if(ArrayFire_CUDA_FOUND)
+  add_executable(swe_cuda swe.cpp)
+  target_link_libraries(swe_cuda ArrayFire::afcuda)
+
+  add_executable(bhrt_cuda bhrt.cpp)
+  target_link_libraries(bhrt_cuda ArrayFire::afcuda)
+
+  add_executable(boltzmann_cfd_cuda boltzmann_cfd.cpp)
+  target_link_libraries(boltzmann_cfd_cuda ArrayFire::afcuda)
+endif()
+
+if(ArrayFire_OpenCL_FOUND)
+  add_executable(swe_opencl swe.cpp)
+  target_link_libraries(swe_opencl ArrayFire::afopencl)
+
+  add_executable(bhrt_opencl bhrt.cpp)
+  target_link_libraries(bhrt_opencl ArrayFire::afopencl)
+
+  add_executable(boltzmann_cfd_opencl boltzmann_cfd.cpp)
+  target_link_libraries(boltzmann_cfd_opencl ArrayFire::afopencl)
+endif()
+
+if(ArrayFire_oneAPI_FOUND)
+  add_executable(swe_oneapi swe.cpp)
+  target_link_libraries(swe_oneapi ArrayFire::afoneapi)
+
+  add_executable(boltzmann_cfd_oneapi boltzmann_cfd.cpp)
+  target_link_libraries(boltzmann_cfd_oneapi ArrayFire::afoneapi)
+endif()
diff --git a/examples/pde/bhrt.cpp b/examples/pde/bhrt.cpp
new file mode 100644
index 0000000000..55e116a330
--- /dev/null
+++ b/examples/pde/bhrt.cpp
@@ -0,0 +1,1139 @@
+/*******************************************************
+ * Copyright (c) 2024, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*
+    This is a Black Hole Raytracer.
+    For this raytracer we are using backwards path tracing to compute the
+   resulting image The path of the rays shot from the camera are simulated step
+   by step from the null geodesics light follows in spacetime. The geodesics are
+   computed from the spacetime metric of the space. This project has three
+   metrics that can be used: Schwarzchild, Kerr, and Ellis.
+
+    For more information on the black hole raytracing, check out
+    Riazuelo, A. (2015). Seeing relativity -- I. Ray tracing in a Schwarzschild
+   metric to explore the maximal analytic extension of the metric and making a
+   proper rendering of the stars. ArXiv.
+   https://doi.org/10.1142/S0218271819500421
+
+    For more information on raytracing, check out
+    Raytracing in a Weekend Series, https://raytracing.github.io/
+
+    Image being used for the background is Westerlund 2 from
+    NASA, ESA, the Hubble Heritage Team (STScI/AURA), A. Nota (ESA/STScI), and
+   the Westerlund 2 Science Team See
+   http://www.spacetelescope.org/images/heic1509a/ for details.
+
+    The default scene is the rotating black hole using the Kerr metric set by
+   the global variable 'scene' The parameters of the blackholes/wormholes may be
+   changed at the top with the simulation constants The parameters of the image
+   may be changed in the 'raytracing' function.
+*/
+#include <arrayfire.h>
+
+#include <chrono>
+#include <iomanip>
+#include <iostream>
+#include <memory>
+#include <string>
+#include <vector>
+
+enum class Scene { ROTATE_BH, STATIC_BH, WORMHOLE };
+
+// Scene being computed
+static constexpr Scene scene = Scene::ROTATE_BH;
+
+// **** Simulation Constants ****
+static constexpr double M = 0.5;    // Black Hole Mass
+static constexpr double J = 0.249;  // Black Hole Rotation (J < M^2)
+static constexpr double b = 3.0;    // Wormhole drainhole parameter
+
+/**
+ * @brief Generates a string progress bar
+ *
+ * @param current current job
+ * @param total total number of jobs
+ * @param start_info progress bar prior info
+ */
+void status_bar(int64_t current, int64_t total, const std::string& start_info) {
+    auto precision         = std::cout.precision();
+    static auto prev_time  = std::chrono::high_resolution_clock::now();
+    static auto prev       = current - 1;
+    static auto prev2      = prev;
+    static auto prev2_time = prev_time;
+
+    auto curr_time = std::chrono::high_resolution_clock::now();
+
+    double percent  = 100.0 * (double)(current + 1) / (double)total;
+    std::string str = "[";
+    for (int i = 0; i < 50; ++i) {
+        if (percent >= i * 2)
+            str += "=";
+        else
+            str += " ";
+    }
+    str += "]";
+
+    auto time =
+        current != prev
+            ? (total - current) * (curr_time - prev_time) / (current - prev)
+            : (total - current) * (curr_time - prev2_time) / (current - prev2);
+
+    if (current != prev && prev != prev2) {
+        prev2      = prev;
+        prev2_time = prev_time;
+    }
+    prev      = current;
+    prev_time = curr_time;
+
+    if (current != total) {
+        using namespace std::chrono_literals;
+        std::cout << start_info << " " << std::fixed << std::setprecision(1)
+                  << percent << "%  " << str << " Time Remaining: ";
+        if (std::chrono::duration_cast<std::chrono::seconds>(time).count() >
+            300)
+            std::cout << std::chrono::duration_cast<std::chrono::minutes>(time)
+                             .count()
+                      << " min";
+        else
+            std::cout << std::chrono::duration_cast<std::chrono::seconds>(time)
+                             .count()
+                      << " s";
+
+        std::cout << std::string(5, ' ') << '\r';
+    } else
+        std::cout << "\rDone!" << std::string(120, ' ') << std::endl;
+
+    std::cout << std::setprecision(precision) << std::defaultfloat;
+}
+
+/**
+ * @brief Returns the euclidean dot product for two cartesian vectors with 3
+ * coords
+ *
+ * @param lhs
+ * @param rhs
+ * @return af::array
+ */
+af::array dot3(const af::array& lhs, const af::array& rhs) {
+    return af::sum(lhs * rhs, 0);
+}
+
+/**
+ * @brief Returns the euclidean norm for a cartesian vector with 3 coords
+ *
+ * @param vector
+ * @return af::array
+ */
+af::array norm3(const af::array& vector) {
+    return af::sqrt(dot3(vector, vector));
+}
+
+/**
+ * @brief Returns the normalized vector for a cartesian vector with 3 coords
+ *
+ * @param vector
+ * @return af::array
+ */
+af::array normalize3(const af::array& vector) { return vector / norm3(vector); }
+
+af::exception make_error(const char* string) {
+    std::cout << string << std::endl;
+    return af::exception(string);
+}
+
+/**
+ * @brief Transforms degrees to radians
+ *
+ * @param degrees
+ * @return double
+ */
+double radians(double degrees) { return degrees * af::Pi / 180.0; }
+
+/**
+ * @brief Computes the cross_product of two euclidean vectors
+ *
+ * @param lhs
+ * @param rhs
+ * @return af::array
+ */
+af::array cross_product(const af::array& lhs, const af::array& rhs) {
+    if (lhs.dims() != rhs.dims())
+        throw make_error("Arrays must have the same dimensions");
+    else if (lhs.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    return af::join(
+        0,
+        lhs(1, af::span, af::span) * rhs(2, af::span, af::span) -
+            lhs(2, af::span, af::span) * rhs(1, af::span, af::span),
+        lhs(2, af::span, af::span) * rhs(0, af::span, af::span) -
+            lhs(0, af::span, af::span) * rhs(2, af::span, af::span),
+        lhs(0, af::span, af::span) * rhs(1, af::span, af::span) -
+            lhs(1, af::span, af::span) * rhs(0, af::span, af::span));
+}
+
+/**
+ * @brief Transform the position vectors from cartesian to spherical coordinates
+ *
+ * @param pos
+ * @return af::array
+ */
+af::array cart_to_sph_position(const af::array& pos) {
+    if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array x = pos(0, af::span);
+    af::array y = pos(1, af::span);
+    af::array z = pos(2, af::span);
+
+    af::array r = af::sqrt(x * x + y * y + z * z);
+    af::array o = af::acos(z / r);
+    af::array p = af::atan2(y, x);
+
+    af::array transformed_pos = af::join(0, r, o, p);
+
+    return transformed_pos;
+}
+
+/**
+ * @brief Transform the velocity vectors from cartesian to spherical coordinates
+ *
+ * @param vel
+ * @param pos
+ * @return af::array
+ */
+af::array cart_to_sph_velocity(const af::array& vel, const af::array& pos) {
+    if (vel.dims() != pos.dims())
+        throw make_error("Arrays must have the same dimensions");
+    else if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array x = pos(0, af::span);
+    af::array y = pos(1, af::span);
+    af::array z = pos(2, af::span);
+
+    af::array r = af::sqrt(x * x + y * y + z * z);
+    af::array o = af::acos(z / r);
+    af::array p = af::atan2(y, x);
+
+    af::array ux = vel(0, af::span);
+    af::array uy = vel(1, af::span);
+    af::array uz = vel(2, af::span);
+
+    af::array ur = (ux * x + uy * y + uz * z) / r;
+    af::array up = (uy * af::cos(p) - ux * af::sin(p)) / (r * af::sin(o));
+    af::array uo =
+        (af::cos(o) * (ux * af::cos(p) + uy * af::sin(p)) - uz * af::sin(o)) /
+        r;
+    af::array transformed_vel = af::join(0, ur, uo, up);
+
+    return transformed_vel;
+}
+
+/**
+ * @brief Transform the velocity vectors from cartesian to spherical coordinates
+ *
+ * @param vel
+ * @param pos
+ * @return af::array
+ */
+af::array sph_to_cart_velocity(const af::array& vel, const af::array& pos) {
+    if (vel.dims() != pos.dims())
+        throw make_error("Arrays must have the same dimensions");
+    else if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array r = pos(0, af::span);
+    af::array o = pos(1, af::span);
+    af::array p = pos(2, af::span);
+
+    af::array ur = vel(0, af::span);
+    af::array uo = vel(1, af::span);
+    af::array up = vel(2, af::span);
+
+    af::array ux = (ur * af::sin(o) + uo * r * af::cos(o)) * af::cos(p) -
+                   up * r * af::sin(o) * af::sin(p);
+    af::array uy = (ur * af::sin(o) + uo * r * af::cos(o)) * af::sin(p) +
+                   up * r * af::sin(o) * af::cos(p);
+    af::array uz              = ur * af::cos(o) - uo * r * af::sin(o);
+    af::array transformed_vel = af::join(0, ux, uy, uz);
+
+    return transformed_vel;
+}
+
+/**
+ * @brief Transform the position vectors from cartesian to oblate coordinates
+ *
+ * @param vel
+ * @param pos
+ * @return af::array
+ */
+af::array cart_to_oblate_position(const af::array& pos) {
+    if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array x = pos(0, af::span);
+    af::array y = pos(1, af::span);
+    af::array z = pos(2, af::span);
+    auto a      = J / M;
+    auto diff   = x * x + y * y + z * z - a * a;
+
+    af::array r =
+        af::sqrt((diff + af::sqrt(diff * diff + z * z * a * a * 4.0)) / 2.0);
+    af::array o = af::acos(z / r);
+    af::array p = af::atan2(y, x);
+
+    af::array transformed_pos = af::join(0, r, o, p);
+
+    return transformed_pos;
+}
+
+/**
+ * @brief Transform the position vectors from oblate to cartesian coordinates
+ *
+ * @param vel
+ * @param pos
+ * @return af::array
+ */
+af::array oblate_to_cart_position(const af::array& pos) {
+    if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array r = pos(0, af::span);
+    af::array o = pos(1, af::span);
+    af::array p = pos(2, af::span);
+    auto a      = J / M;
+    auto R      = af::sqrt(r * r + a * a);
+
+    af::array x = R * af::sin(o) * af::cos(p);
+    af::array y = R * af::sin(o) * af::sin(p);
+    af::array z = r * af::cos(o);
+
+    af::array transformed_pos = af::join(0, x, y, z);
+
+    return transformed_pos;
+}
+
+/**
+ * @brief Transform the velocity vectors from oblate to cartesian coordinates
+ *
+ * @param vel
+ * @param pos
+ * @return af::array
+ */
+af::array oblate_to_cart_velocity(const af::array& vel, const af::array& pos) {
+    if (vel.dims() != pos.dims())
+        throw make_error("Arrays must have the same dimensions");
+    else if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array r = pos(0, af::span);
+    af::array o = pos(1, af::span);
+    af::array p = pos(2, af::span);
+
+    af::array ur = vel(0, af::span);
+    af::array uo = vel(1, af::span);
+    af::array up = vel(2, af::span);
+
+    double a     = J / M;
+    af::array ra = af::sqrt(r * r + a * a);
+
+    af::array ux =
+        (ur * r * af::sin(o) / ra + uo * ra * af::cos(o)) * af::cos(p) -
+        up * r * af::sin(o) * af::sin(p);
+    af::array uy =
+        (ur * r * af::sin(o) / ra + uo * ra * af::cos(o)) * af::sin(p) +
+        up * r * af::sin(o) * af::cos(p);
+    af::array uz              = ur * af::cos(o) - uo * r * af::sin(o);
+    af::array transformed_vel = af::join(0, ux, uy, uz);
+
+    return transformed_vel;
+}
+
+/**
+ * @brief Transform the velocity vectors from cartesian to oblate coordinates
+ *
+ * @param vel
+ * @param pos
+ * @return af::array
+ */
+af::array cart_to_oblate_velocity(const af::array& vel, const af::array& pos) {
+    if (vel.dims() != pos.dims())
+        throw make_error("Arrays must have the same dimensions");
+    else if (pos.dims()[0] != 3)
+        throw make_error("Arrays must have 3 principal coordintes");
+
+    af::array x = pos(0, af::span);
+    af::array y = pos(1, af::span);
+    af::array z = pos(2, af::span);
+
+    auto a    = J / M;
+    auto diff = x * x + y * y + z * z - a * a;
+
+    af::array r =
+        af::sqrt((diff + af::sqrt(diff * diff + z * z * a * a * 4.0)) / 2.0);
+    af::array o = af::acos(z / r);
+    af::array p = af::atan2(y, x);
+
+    af::array ux = vel(0, af::span);
+    af::array uy = vel(1, af::span);
+    af::array uz = vel(2, af::span);
+
+    af::array ra = r * r + a * a;
+    af::array ur = ((ux * x + uy * y) * r + uz * ra * z / r) /
+                   (r * r + af::pow(a * af::cos(o), 2.0));
+    af::array up = (uy * x - ux * y) / (x * x + y * y);
+    af::array uo = ((ux * x + uy * y) / af::tan(o) - uz * z * af::tan(o)) /
+                   (r * r + af::pow(a * af::cos(o), 2.0));
+    af::array transformed_vel = af::join(0, ur, uo, up);
+
+    return transformed_vel;
+}
+
+/**
+ * @brief Transform the position vectors from spherical to cartesian coordinates
+ *
+ * @param pos
+ * @return af::array
+ */
+af::array sph_to_cart_position(const af::array& pos) {
+    af::array r = pos(0, af::span);
+    af::array o = pos(1, af::span);
+    af::array p = pos(2, af::span);
+
+    af::array x = r * af::sin(o) * af::cos(p);
+    af::array y = r * af::sin(o) * af::sin(p);
+    af::array z = r * af::cos(o);
+
+    af::array transformed_pos = af::join(0, x, y, z);
+
+    return transformed_pos;
+}
+
+/**
+ * @brief Computes the inverse of a 4x4 matrix with the layout
+ *          [ a 0 0 b ]
+ *          [ 0 c 0 0 ]
+ *          [ 0 0 d 0 ]
+ *          [ b 0 0 e ]
+ *
+ * @param metric af::array with the shape af::dims4(4, 4, M, N)
+ *
+ * @return af::array with the shape af::dims4(4, 4, M, N)
+ */
+af::array inv_metric(const af::array& metric) {
+    af::array a = metric(0, 0, af::span);
+    af::array b = metric(3, 0, af::span);
+    af::array c = metric(1, 1, af::span);
+    af::array d = metric(2, 2, af::span);
+    af::array e = metric(3, 3, af::span);
+
+    af::array det = b * b - a * e;
+
+    auto res = af::constant(0, 4, 4, metric.dims()[2], metric.dims()[3], f64);
+
+    res(0, 0, af::span) = -e / det;
+    res(0, 3, af::span) = b / det;
+    res(3, 0, af::span) = b / det;
+    res(1, 1, af::span) = 1.0 / c;
+    res(2, 2, af::span) = 1.0 / d;
+    res(3, 3, af::span) = -a / det;
+
+    return res;
+}
+
+/**
+ * @brief Computes the 4x4 metric matrix for the given 4-vector positions
+ *
+ * @param pos af::dim4(4, N)
+ * @return af::array af::dim4(4, 4, 1, N)
+ */
+af::array metric4(const af::array& pos) {
+    if (pos.dims()[0] != 4)
+        throw make_error("Arrays must have 4 principal coordinates");
+
+    auto dims = pos.dims();
+
+    af::array t = af::moddims(pos(0, af::span), 1, 1, dims[1]);
+    af::array r = af::moddims(pos(1, af::span), 1, 1, dims[1]);
+    af::array o = af::moddims(pos(2, af::span), 1, 1, dims[1]);
+    af::array p = af::moddims(pos(3, af::span), 1, 1, dims[1]);
+
+    af::array gtt, gtr, gto, gtp, grt, grr, gro, grp, got, gor, goo, gop, gpt,
+        gpr, gpo, gpp;
+
+    switch (scene) {
+        // ******* Kerr Black Hole Metric *******
+        case Scene::ROTATE_BH: {
+            auto rs    = 2.0 * M;
+            auto a     = J / M;
+            auto delta = (r - rs) * r + a * a;
+            auto sigma = r * r + af::pow(a * af::cos(o), 2);
+
+            gtt = 1.0 - r * rs / sigma;
+            gtr = af::constant(0.0, 1, 1, dims[1], f64);
+            gto = af::constant(0.0, 1, 1, dims[1], f64);
+            gtp = rs * r * a * af::pow(af::sin(o), 2.0) / sigma;
+            grr = -sigma / delta;
+            gro = af::constant(0.0, 1, 1, dims[1], f64);
+            grp = af::constant(0.0, 1, 1, dims[1], f64);
+            goo = -sigma;
+            gop = af::constant(0.0, 1, 1, dims[1], f64);
+            gpp =
+                -(r * r + a * a + rs * r * af::pow(a * af::sin(o), 2) / sigma) *
+                af::pow(af::sin(o), 2);
+
+            break;
+        }
+
+        // ******* Schwarzchild Black Hole Metric *******
+        case Scene::STATIC_BH: {
+            gtt = 1.0 - 2.0 * M / r;
+            gtr = af::constant(0.0, 1, 1, dims[1], f64);
+            gto = af::constant(0.0, 1, 1, dims[1], f64);
+            gtp = af::constant(0.0, 1, 1, dims[1], f64);
+            grr = -1.0 / (1.0 - 2.0 * M / r);
+            gro = af::constant(0.0, 1, 1, dims[1], f64);
+            grp = af::constant(0.0, 1, 1, dims[1], f64);
+            goo = -r * r;
+            gop = af::constant(0.0, 1, 1, dims[1], f64);
+            gpp = -af::pow(r * af::sin(o), 2);
+
+            break;
+        }
+
+        // ******* Ellis Wormhole Metric *******
+        case Scene::WORMHOLE: {
+            gtt = af::constant(1.0, 1, 1, dims[1], f64);
+            gtr = af::constant(0.0, 1, 1, dims[1], f64);
+            gto = af::constant(0.0, 1, 1, dims[1], f64);
+            gtp = af::constant(0.0, 1, 1, dims[1], f64);
+            grr = -af::constant(1.0, 1, 1, dims[1], f64);
+            gro = af::constant(0.0, 1, 1, dims[1], f64);
+            grp = af::constant(0.0, 1, 1, dims[1], f64);
+            goo = -(r * r + b * b);
+            gop = af::constant(0.0, 1, 1, dims[1], f64);
+            gpp = -(r * r + b * b) * af::pow(af::sin(o), 2);
+
+            break;
+        }
+
+        default: throw;
+    }
+
+    auto res = af::join(
+        0, af::join(1, gtt, gtr, gto, gtp), af::join(1, gtr, grr, gro, grp),
+        af::join(1, gto, gro, goo, gop), af::join(1, gtp, grp, gop, gpp));
+
+    return res;
+}
+
+/**
+ * @brief Computes the dot product as defined by a metric between two 4-vector
+ * velocities
+ *
+ * @param pos
+ * @param lhs
+ * @param rhs
+ * @return af::array
+ */
+af::array dot_product(const af::array& pos, const af::array& lhs,
+                      const af::array& rhs) {
+    if (pos.dims() != lhs.dims())
+        throw make_error(
+            "Position and lhs velocity must have the same dimensions");
+    else if (lhs.dims() != rhs.dims())
+        throw make_error(
+            "Position and rhs velocity must have the same dimensions");
+    else if (rhs.dims()[0] != 4)
+        throw make_error("Arrays must have 4 principal coordinates");
+
+    return af::matmul(af::moddims(lhs, 1, 4, lhs.dims()[1]), metric4(pos),
+                      af::moddims(rhs, 4, 1, rhs.dims()[1]));
+}
+
+af::array norm4(const af::array& pos, const af::array& vel) {
+    return dot_product(pos, vel, vel);
+}
+
+af::array partials(const af::array& pos4, uint32_t index, double rel_diff,
+                   double abs_diff) {
+    double arr[4] = {0.0};
+    arr[index]    = 1.0;
+
+    auto pos_diff = pos4 * rel_diff + abs_diff;
+    auto h4       = pos_diff * af::array(af::dim4(4, 1), arr);
+    af::array h =
+        af::moddims(pos_diff(index, af::span), af::dim4(1, 1, pos4.dims()[1]));
+
+    return (-metric4(pos4 + h4 * 2.0) + metric4(pos4 + h4) * 8.0 -
+            metric4(pos4 - h4) * 8.0 + metric4(pos4 - h4 * 2.0)) /
+           (h * 12.0);
+}
+
+/**
+ * @brief Computes the geodesics from the established metric, 4-vector positions
+ * and velocities
+ *
+ * @param pos4
+ * @param vel4
+ * @return af::array
+ */
+af::array geodesics(const af::array& pos4, const af::array& vel4) {
+    auto N = vel4.dims()[1];
+
+    af::array uu = af::matmul(af::moddims(vel4, af::dim4(4, 1, N)),
+                              af::moddims(vel4, af::dim4(1, 4, N)));
+    uu           = af::moddims(uu, af::dim4(1, 4, 4, N));
+
+    af::array metric    = metric4(pos4);
+    af::array invmetric = af::moddims(inv_metric(metric), af::dim4(4, 4, 1, N));
+
+    // Compute the partials of the metric with respect to coordinates indices
+    af::array dt = af::constant(0, 4, 4, 1, N, f64);
+
+    auto dr     = partials(pos4, 1, 1e-6, 1e-12);
+    auto dtheta = partials(pos4, 2, 1e-6, 1e-12);
+    auto dphi   = partials(pos4, 3, 1e-6, 1e-12);
+
+    dr     = af::moddims(dr, af::dim4(4, 4, 1, N));
+    dtheta = af::moddims(dtheta, af::dim4(4, 4, 1, N));
+    dphi   = af::moddims(dphi, af::dim4(4, 4, 1, N));
+
+    // Compute the einsum for each of the christoffel terms
+    af::array partials = af::join(2, dt, dr, dtheta, dphi);
+    af::array p1       = af::matmul(invmetric, partials);
+    af::array p2       = af::reorder(p1, 0, 2, 1, 3);
+    af::array p3 = af::matmul(invmetric, af::reorder(partials, 2, 0, 1, 3));
+
+    auto christoffels = -0.5 * (p1 + p2 - p3);
+
+    // Use the geodesics equation to find the 4-vector acceleration
+    return af::moddims(af::sum(af::sum(christoffels * uu, 1), 2),
+                       af::dim4(4, N));
+}
+
+/**
+ * @brief Camera struct
+ *
+ * Contains all the data pertaining to the parameters for the image as seen from
+ * the camera
+ *
+ */
+struct Camera {
+    af::array position;
+    af::array lookat;
+    double fov;
+    double focal_length;
+    uint32_t width;
+    uint32_t height;
+
+    af::array direction;
+    af::array vertical;
+    af::array horizontal;
+    double aspect_ratio;
+
+    Camera(const af::array& position_, const af::array& lookat_, double fov_,
+           double focal_length_, uint32_t viewport_width_,
+           uint32_t viewport_height_)
+        : position(position_)
+        , lookat(lookat_)
+        , fov(fov_)
+        , focal_length(focal_length_)
+        , width(viewport_width_)
+        , height(viewport_height_) {
+        auto global_vertical = af::array(3, {0.0, 0.0, 1.0});
+
+        // Compute the camera three main axes
+        direction  = normalize3(lookat - position);
+        horizontal = normalize3(cross_product(direction, global_vertical));
+        vertical   = normalize3(cross_product(direction, horizontal));
+
+        aspect_ratio = (double)width / (double)height;
+    }
+
+    /**
+     * @brief Generates the initial rays 4-vector position and velocities
+     * (direction) for the simulation
+     *
+     * @return std::pair<af::array, af::array> (pos4, vel4)
+     */
+    std::pair<af::array, af::array> generate_viewport_4rays() {
+        auto& camera_direction  = direction;
+        auto& camera_horizontal = horizontal;
+        auto& camera_vertical   = vertical;
+        auto& camera_position   = position;
+        auto vfov               = fov;
+
+        double viewport_height = 2.0 * focal_length * std::tan(vfov / 2.0);
+        double viewport_width  = aspect_ratio * viewport_height;
+
+        // Create rays in equally spaced directions of the viewport
+        af::array viewport_rays = af::constant(0, 3, width, height, f64);
+        viewport_rays +=
+            (af::iota(af::dim4(1, width, 1), af::dim4(1, 1, height), f64) /
+                 (width - 1) -
+             0.5) *
+            viewport_width * camera_horizontal;
+        viewport_rays +=
+            (af::iota(af::dim4(1, 1, height), af::dim4(1, width, 1), f64) /
+                 (height - 1) -
+             0.5) *
+            viewport_height * camera_vertical;
+        viewport_rays += focal_length * camera_direction;
+        viewport_rays = af::moddims(af::reorder(viewport_rays, 1, 2, 0),
+                                    af::dim4(width * height, 3))
+                            .T();
+
+        // Compute the initial position from which the rays are launched
+        af::array viewport_position = viewport_rays + camera_position;
+        af::array viewport_sph_pos;
+        if (scene != Scene::ROTATE_BH)
+            viewport_sph_pos = cart_to_sph_position(viewport_position);
+        else
+            viewport_sph_pos = cart_to_oblate_position(viewport_position);
+
+        // Normalize the ray directions
+        viewport_rays = normalize3(viewport_rays);
+
+        // Generate the position 4-vector
+        af::array camera_sph_pos;
+        if (scene != Scene::ROTATE_BH)
+            camera_sph_pos = cart_to_sph_position(camera_position);
+        else
+            camera_sph_pos = cart_to_oblate_position(camera_position);
+
+        af::array camera_pos4 =
+            af::join(0, af::constant(0.0, 1, f64), camera_sph_pos);
+        double camera_velocity =
+            1.0 /
+            af::sqrt(norm4(camera_pos4, af::array(4, {1.0, 0.0, 0.0, 0.0})))
+                .scalar<double>();
+        af::array camera_vel4 = af::array(4, {camera_velocity, 0.0, 0.0, 0.0});
+
+        af::array viewport_rays_pos4 = af::join(
+            0, af::constant(0.0, 1, width * height, f64), viewport_sph_pos);
+
+        // Generate the velocity 4-vector by setting the camera to be stationary
+        // with respect to an observer at infinity
+        af::array vv;
+        if (scene != Scene::ROTATE_BH)
+            vv = cart_to_sph_velocity(viewport_rays, viewport_position);
+        else
+            vv = cart_to_oblate_velocity(viewport_rays, viewport_position);
+
+        af::array vvr = vv(0, af::span);
+        af::array vvo = vv(1, af::span);
+        af::array vvp = vv(2, af::span);
+        auto viewport_sph_rays4 =
+            af::join(0, af::constant(1, 1, width * height, f64), vvr, vvo, vvp);
+
+        af::array dot = af::moddims(
+            af::matmul(metric4(viewport_rays_pos4),
+                       af::moddims(viewport_sph_rays4 * viewport_sph_rays4,
+                                   af::dim4(4, 1, width * height))),
+            af::dim4(4, width * height));
+
+        // Normalize the 4-velocity vectors
+        af::array viewport_vel =
+            af::sqrt(-af::array(dot(0, af::span)) /
+                     (dot(1, af::span) + dot(2, af::span) + dot(3, af::span)));
+        af::array viewport_rays_vel4 =
+            af::join(0, af::constant(camera_velocity, 1, width * height, f64),
+                     vv * viewport_vel * camera_velocity);
+
+        return {viewport_rays_pos4, viewport_rays_vel4};
+    }
+};
+
+/**
+ * @brief Object struct
+ *
+ * Contains the methods for testing if a ray has collided with the object
+ *
+ */
+struct Object {
+    using HasHit = af::array;
+    using HitPos = af::array;
+
+    /**
+     * @brief Gets the color of the pixel that correspond to the ray that has
+     * intersected with the object
+     *
+     * @param ray_begin begining
+     * @param ray_end
+     * @return af::array
+     */
+    virtual af::array get_color(const af::array& ray_begin,
+                                const af::array& ray_end) const = 0;
+
+    /**
+     * @brief Returns a bool array if the rays have hit the object and the
+     * correspoding position where the ray has hit
+     *
+     * @param ray_begin
+     * @param ray_end
+     * @return std::pair<HasHit, HitPos>
+     */
+    virtual std::pair<HasHit, HitPos> intersect(
+        const af::array& ray_begin, const af::array& ray_end) const = 0;
+};
+
+struct AccretionDisk : public Object {
+    af::array disk_color;
+    af::array center;
+    af::array normal;
+    double inner_radius;
+    double outter_radius;
+
+    AccretionDisk(const af::array& center, const af::array& normal,
+                  double inner_radius, double outter_radius)
+        : disk_color(af::array(3, {209.f, 77.f, 0.f}))
+        , center(center)
+        , normal(normal)
+        , inner_radius(inner_radius)
+        , outter_radius(outter_radius) {
+        // disk_color = af::array(3, {254.f, 168.f, 29.f});
+    }
+
+    std::pair<HasHit, HitPos> intersect(
+        const af::array& ray_begin, const af::array& ray_end) const override {
+        uint32_t count = ray_begin.dims()[1];
+
+        // Compute intersection of ray with a plane
+        af::array has_hit = af::constant(0, count).as(b8);
+        af::array hit_pos = ray_end;
+        af::array a       = dot3(normal, center - ray_begin);
+        af::array b       = dot3(normal, ray_end - ray_begin);
+        af::array t       = af::select(b != 0.0, a / b, (double)0.0);
+
+        af::array plane_intersect = (ray_end - ray_begin) * t + ray_begin;
+        af::array dist            = norm3(plane_intersect - center);
+
+        t = af::abs(t);
+
+        // Determine if the intersection falls inside the disk radius and occurs
+        // with the current ray segment
+        has_hit = af::moddims((dist < outter_radius) && (t <= 1.0) &&
+                                  (t > 0.0) && (dist > inner_radius),
+                              af::dim4(count));
+        hit_pos = plane_intersect;
+
+        return {has_hit, hit_pos};
+    }
+
+    af::array get_color(const af::array& ray_begin,
+                        const af::array& ray_end) const override {
+        auto pair = intersect(ray_begin, ray_end);
+        af::array hit = pair.first;
+        af::array pos = pair.second;
+
+        auto val = 1.f - (norm3(pos - center).T() - inner_radius) /
+                             (outter_radius - inner_radius);
+
+        af::array color =
+            disk_color.T() * 1.5f * (val * val * (val * -2.f + 3.f)).as(f32);
+
+        return af::select(af::tile(hit, af::dim4(1, 3)), color, 0.f);
+    }
+};
+/**
+ * @brief Background struct
+ *
+ * Contains the methods for getting the color of background image
+ *
+ */
+struct Background {
+    af::array image;
+
+    Background(const af::array& image_) { image = image_; }
+
+    af::array get_color(const af::array& ray_dir) const {
+        auto spherical_dir = cart_to_sph_position(ray_dir);
+
+        auto img_height = image.dims()[0];
+        auto img_width  = image.dims()[1];
+        auto count      = ray_dir.dims()[1];
+
+        // Spherical mapping of the direction to a pixel of the image
+        af::array o = spherical_dir(1, af::span);
+        af::array p = spherical_dir(2, af::span);
+
+        auto x = (p / af::Pi + 1.0) * img_width / 2.0;
+        auto y = (o / af::Pi) * img_height;
+
+        // Interpolate the colors of the image from the calculated pixel
+        // positions
+        af::array colors = af::approx2(image, af::moddims(y.as(f32), count),
+                                       af::moddims(x.as(f32), count),
+                                       af::interpType::AF_INTERP_CUBIC_SPLINE);
+
+        // Zero out the color of any null rays
+        colors = af::moddims(colors, af::dim4(count, 3));
+        af::replace(colors, !af::isNaN(colors), 0.f);
+
+        return colors;
+    }
+};
+
+/**
+ * @brief Transform the array of pixels to the correct image format to display
+ *
+ * @param image
+ * @param width
+ * @param height
+ * @return af::array
+ */
+af::array rearrange_image(const af::array& image, uint32_t width,
+                          uint32_t height) {
+    return af::clamp(af::moddims(image, af::dim4(width, height, 3)).T(), 0.0,
+                     255.0)
+               .as(f32) /
+           255.f;
+}
+
+/**
+ * @brief Returns an rgb image containing the raytraced black hole from the
+ * camera rays, spacetime metric, objects living in the space, and background
+ *
+ * @param initial_pos initial position from where the rays are launched
+ * @param initial_vel initial velocities (directions) the rays have
+ * @param objects the objects the rays can collide with
+ * @param background the background of the scene
+ * @param time how long are the rays traced through space
+ * @param steps how many steps should be taken to trace the rays path
+ * @param width width of the image the camera produces
+ * @param height height of the image the camera produces
+ * @param checks the intervals between steps to check if the rays have collided
+ * with an object
+ * @return af::array
+ */
+af::array generate_image(const af::array& initial_pos,
+                         const af::array& initial_vel,
+                         const std::vector<std::unique_ptr<Object> >& objects,
+                         const Background& background, uint32_t width,
+                         uint32_t height, double time, double tol,
+                         uint32_t checks = 10) {
+    uint32_t lines = initial_pos.dims()[1];
+
+    auto def_step = 0.5 * pow(tol, 0.25);
+    auto dt       = af::constant(def_step, 1, lines, f64);
+    auto t        = af::constant(0.0, 1, lines, f64);
+    auto index    = af::iota(lines);
+    auto selected = t < time;
+
+    auto result = af::constant(0, lines, 3, f32);
+
+    auto pos = initial_pos;
+    auto vel = initial_vel;
+
+    af::Window window{(int)width, (int)height, "Black Hole Raytracing"};
+
+    af::array bg_col = af::constant(0.f, lines, 3);
+    af::array begin_pos, end_pos;
+    af::array bh_nohit;
+
+    if (scene != Scene::ROTATE_BH)
+        begin_pos = sph_to_cart_position(pos(af::seq(1, 3), af::span));
+    else
+        begin_pos = oblate_to_cart_position(pos(af::seq(1, 3), af::span));
+    end_pos = begin_pos;
+
+    int i = 0;
+
+    while (t.dims()[1] != 0 && af::anyTrue<bool>(t < time) &&
+           af::anyTrue<bool>(dt != 0.0)) {
+        // Displays the current progress and approximate time needed to finish
+        // it
+        status_bar((lines - t.dims()[1]) * time +
+                       af::sum<double>(af::clamp(t, 0.0, time)),
+                   time * lines, "Progress:");
+
+        // RK34 method for second order differential equation
+        auto dt2 = dt * dt;
+        auto k1  = geodesics(pos, vel);
+        auto k2  = geodesics(pos + vel * dt / 4.0 + k1 * dt2 / 32.0,
+                             vel + k1 * dt / 4.0);
+        auto k3  = geodesics(pos + vel * dt / 2.0 + (k1 + k2) * dt2 / 16.0,
+                             vel + k2 * dt / 2.0);
+        auto k4  = geodesics(pos + vel * dt + (k1 - k2 + k3 * 2.0) * dt2 / 4.0,
+                             vel + (k1 - k2 * 2.0 + 2.0 * k3) * dt);
+
+        auto diff4 = (k1 + k2 * 8.0 + k3 * 2.0 + k4) / 24.0;
+        auto diff3 = (k2 * 8.0 + k4) / 18.0;
+
+        auto err    = (af::max)(af::abs(diff4 - diff3), 0) * dt2;
+        auto maxerr = tol * (1.0 + (af::max)(af::abs(pos), 0));
+
+        auto rdt = af::constant(0, 1, dt.dims()[1], f64);
+        af::replace(rdt, err > maxerr, dt);
+
+        auto rdt2 = rdt * rdt;
+
+        pos += vel * rdt + (k1 + k2 * 8.0 + k3 * 2.0 + k4) * rdt2 / 24.0;
+        vel += (k1 + k3 * 4.0 + k4) * rdt / 6.0;
+        t += rdt;
+
+        auto q = af::clamp(0.8 * af::pow(maxerr / err, 0.25), 0.0, 5.0);
+
+        // Select the next time step
+        dt = af::select(q * dt < (time - t), q * dt, af::abs(time - t));
+
+        // Update image
+        if (i % checks == (checks - 1)) {
+            af::array ray_dir;
+            if (scene != Scene::ROTATE_BH) {
+                end_pos(af::span, index) =
+                    sph_to_cart_position(pos(af::seq(1, 3), af::span));
+                ray_dir = sph_to_cart_velocity(vel(af::seq(1, 3), af::span),
+                                               pos(af::seq(1, 3), af::span));
+            } else {
+                end_pos(af::span, index) =
+                    oblate_to_cart_position(pos(af::seq(1, 3), af::span));
+                ray_dir = oblate_to_cart_velocity(vel(af::seq(1, 3), af::span),
+                                                  pos(af::seq(1, 3), af::span));
+            }
+
+            af::array s_begin_pos = begin_pos(af::span, index);
+            af::array s_end_pos   = end_pos(af::span, index);
+
+            // Check if light ray intersect an object
+            for (const auto& obj : objects) {
+                result(index, af::span) +=
+                    obj->get_color(s_begin_pos, s_end_pos);
+            }
+
+            // Update background colors from rays
+            bg_col(index, af::span) = background.get_color(ray_dir);
+
+            // Display image
+            window.image(rearrange_image(result + bg_col, width, height));
+
+            begin_pos = end_pos;
+        }
+
+        // Stop rays entering the event horizon
+        switch (scene) {
+            case Scene::ROTATE_BH: {
+                auto a = J / M;
+                bh_nohit =
+                    (pos(1, af::span) > 1.01 * (M + std::sqrt(M * M - a * a)));
+                selected = bh_nohit && (t < time);
+
+                break;
+            }
+
+            case Scene::STATIC_BH: {
+                bh_nohit = pos(1, af::span) > 2.0 * M * 1.01;
+                selected = bh_nohit && (t < time);
+
+                break;
+            }
+
+            case Scene::WORMHOLE: {
+                selected = (t < time);
+            }
+            default: break;
+        }
+
+        // Remove finished rays from computation
+        if (af::sum<float>(selected.as(f32)) / (float)index.dims()[0] < 0.75) {
+            if (scene == Scene::STATIC_BH || scene == Scene::ROTATE_BH)
+                bg_col(af::array(index(!bh_nohit)), af::span) = 0.f;
+
+            index = index(selected);
+            pos   = pos(af::span, selected);
+            vel   = vel(af::span, selected);
+            dt    = dt(af::span, selected);
+            t     = t(af::span, selected);
+
+            // Free finished rays memory
+            af::deviceGC();
+        }
+
+        ++i;
+    }
+
+    result += bg_col;
+
+    return rearrange_image(result, width, height);
+}
+
+void raytracing(uint32_t width, uint32_t height) {
+    // Set the parameters of the raytraced image
+    double vfov         = radians(90.0);
+    double focal_length = 0.01;
+
+    // Set the parameters of the camera
+    af::array global_vertical            = af::array(3, {0.0, 0.0, 1.0});
+    af::array camera_position            = af::array(3, {-7.0, 6.0, 2.0});
+    af::array camera_lookat              = af::array(3, {0.0, 0.0, 0.0});
+    double accretion_inner_radius        = M * 3.0;
+    double accretion_outter_radius       = M * 8.0;
+    double simulation_tolerance          = 1e-6;
+    double max_simulation_time           = 12.;
+    uint32_t num_steps_per_collide_check = 1;
+
+    // Set the background of the scene
+    auto bg_image =
+        af::loadimage(ASSETS_DIR "/examples/images/westerlund.jpg", true);
+    auto background = Background(bg_image);
+
+    // Set the objects living in the scene
+    std::vector<std::unique_ptr<Object> > objects;
+    if (scene != Scene::WORMHOLE)
+        objects.push_back(std::make_unique<AccretionDisk>(
+            af::array(3, {0.0, 0.0, 0.0}), af::array(3, {0.0, 0.0, 1.0}),
+            accretion_inner_radius, accretion_outter_radius));
+
+    // Generate rays from the camera
+    auto camera = Camera(camera_position, camera_lookat, vfov, focal_length,
+                         width, height);
+    auto pair   = camera.generate_viewport_4rays();
+
+    auto ray4_pos = pair.first;
+    auto ray4_vel = pair.second;
+
+    auto begin = std::chrono::high_resolution_clock::now();
+    // Generate raytraced image
+    auto image = generate_image(
+        ray4_pos, ray4_vel, objects, background, width, height,
+        max_simulation_time, simulation_tolerance, num_steps_per_collide_check);
+
+    auto end = std::chrono::high_resolution_clock::now();
+
+    std::cout
+        << "\nSimulation took: "
+        << std::chrono::duration_cast<std::chrono::seconds>(end - begin).count()
+        << " s" << std::endl;
+
+    // Save image
+    af::saveImage("result.png", image);
+}
+
+int main(int argc, char** argv) {
+    int device = argc > 1 ? std::atoi(argv[1]) : 0;
+
+    int width  = argc > 2 ? std::atoi(argv[2]) : 200;
+    int height = argc > 3 ? std::atoi(argv[3]) : 200;
+
+    try {
+        af::setDevice(device);
+        af::info();
+
+        std::cout << "** ArrayFire Black Hole Raytracing Demo\n\n";
+
+        raytracing(width, height);
+    } catch (const af::exception& e) {
+        std::cerr << e.what() << std::endl;
+        return -1;
+    }
+
+    return 0;
+}
\ No newline at end of file
diff --git a/examples/pde/boltzmann_cfd.cpp b/examples/pde/boltzmann_cfd.cpp
new file mode 100644
index 0000000000..38882f3c5c
--- /dev/null
+++ b/examples/pde/boltzmann_cfd.cpp
@@ -0,0 +1,570 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*
+    This is a Computational Fluid Dynamics Simulation using the Lattice
+   Boltzmann Method For this simulation we are using D2N9 (2 dimensions, 9
+   neighbors) with bounce-back boundary conditions For more information on the
+   simulation equations, check out
+   https://en.wikipedia.org/wiki/Lattice_Boltzmann_methods#Mathematical_equations_for_simulations
+
+    The initial conditions of the fluid are obtained from three images that
+   specify their properties using the function read_initial_condition_arrays.
+   These images can be modified to simulate different cases
+*/
+
+#include <arrayfire.h>
+#include <chrono>
+#include <iostream>
+#include <thread>
+
+/*
+    Values of the D2N9 grid follow the following order structure:
+
+
+          -1      0       1
+      * ----------------------> x
+  -1   |   6      3       0
+       |
+   0   |   7      4       1
+       |
+   1   |   8      5       2
+       |
+       v
+       y
+
+    The (-1, 0, 1) refer to the x and y offsets with respect to a single cell
+    and the (0-8) refer to indices of each cell in the 3x3 grid
+
+    Eg. Element with index 4 is the center of the grid which has an x-offset =
+  ex_vals[4] = 0 and y-offset = ey_vals[4] = 0 with its quantities being
+  weighted with weight wt_vals[4] = 16/36
+*/
+
+static const float ex_vals[] = {1.0, 1.0, 1.0, 0.0, 0.0, 0.0, -1.0, -1.0, -1.0};
+
+static const float ey_vals[] = {1.0, 0.0, -1.0, 1.0, 0.0, -1.0, 1.0, 0.0, -1.0};
+
+static const float wt_vals[] = {1.0f / 36.0f, 4.0f / 36.0f,  1.0f / 36.0f,
+                                4.0f / 36.0f, 16.0f / 36.0f, 4.0f / 36.0f,
+                                1.0f / 36.0f, 4.0f / 36.0f,  1.0f / 36.0f};
+
+static const int opposite_indices[] = {8, 7, 6, 5, 4, 3, 2, 1, 0};
+
+struct Simulation {
+    // Fluid quantities
+    af::array ux;
+    af::array uy;
+    af::array rho;
+    af::array sigma;
+    af::array f;
+    af::array feq;
+
+    // Constant velocity boundary conditions positions
+    af::array set_boundaries;
+
+    // Simulation Parameters
+    size_t grid_width;
+    size_t grid_height;
+    float density;
+    float velocity;
+    float reynolds;
+
+    // Helper arrays stored for computation
+    af::array ex;
+    af::array ey;
+    af::array wt;
+
+    af::array ex_T;
+    af::array ey_T;
+    af::array wt_T;
+
+    af::array ex_;
+    af::array ey_;
+};
+
+/**
+ * @brief Create a simulation object containing all the initial parameters and
+ * condition of the simulation
+ *
+ * @details
+ * For the ux, uy, and boundary images, we use RGB values for to define the
+ * specific quantites for each grid cell/pixel
+ *
+ * /// R & B for ux & uy
+ *
+ * For ux and uy, Red means positive value while Blue means negative value. The
+ * speed value for both ux and uy is computed as $(R - B) * velocity / 255$.
+ *
+ * For example, for the same pixel in the two images if we had ux = RGB(255,0,0)
+ * and uy = RGB(0,0,255) means that cell's fluid has an x-velocity of +v and
+ * y-velocity of -v where v is the velocity quantity pass to this function.
+ *
+ * Note that having the same value in the R and B components will cancel each
+ * other out, i.e., have the fluid has 0 velocity in that direction similar to
+ * having it be 0.
+ *
+ * /// G for ux & uy
+ *
+ * The G component is reserved for an object or obstacle. Any non-zero value for
+ * the green component represents a hard boundary in the simulation
+ *
+ * /// RGB for boundary
+ *
+ * Any non-zero value for any of the components in the RGB value of the pixel
+ * means that the initial values passed for ux and uy will remain constant
+ * throught the simulation
+ *
+ */
+Simulation create_simulation(uint32_t grid_width, uint32_t grid_height,
+                             float density, float velocity, float reynolds,
+                             const char* ux_image_filename,
+                             const char* uy_image_filename,
+                             const char* boundaries_filename) {
+    Simulation sim;
+
+    sim.grid_width  = grid_width;
+    sim.grid_height = grid_height;
+    sim.velocity    = velocity;
+    sim.density     = density;
+    sim.reynolds    = reynolds;
+
+    try {
+        sim.ux = af::loadImage(ux_image_filename, true);
+    } catch (const af::exception& e) {
+        std::cerr << e.what() << std::endl;
+        sim.ux = af::constant(0, grid_width, grid_height, 3);
+    }
+
+    auto ux_dim = sim.ux.dims();
+    if (ux_dim[0] != grid_width || ux_dim[1] != grid_height) {
+        std::cerr
+            << "Fluid flow ux image has dimensions different to the simulation"
+            << std::endl;
+        throw std::runtime_error{
+            "Fluid flow ux image has dimensions different to the simulation"};
+    }
+
+    try {
+        sim.uy = af::loadImage(uy_image_filename, true);
+    } catch (const af::exception& e) {
+        std::cerr << e.what() << std::endl;
+        sim.uy = af::constant(0, grid_width, grid_height, 3);
+    }
+
+    auto uy_dim = sim.uy.dims();
+    if (uy_dim[0] != grid_width || uy_dim[1] != grid_height) {
+        std::cerr
+            << "Fluid flow uy image has dimensions different to the simulation"
+            << std::endl;
+        throw std::runtime_error{
+            "Fluid flow uy image has dimensions different to the simulation"};
+    }
+
+    try {
+        sim.set_boundaries = af::loadImage(boundaries_filename, false);
+    } catch (const af::exception& e) {
+        std::cerr << e.what() << std::endl;
+        sim.set_boundaries = af::constant(0, grid_width, grid_height);
+    }
+
+    auto b_dim = sim.set_boundaries.dims();
+    if (b_dim[0] != grid_width || b_dim[1] != grid_height) {
+        std::cerr
+            << "Fluid boundary image has dimensions different to the simulation"
+            << std::endl;
+        throw std::runtime_error{
+            "Fluid boundary image has dimensions different to the simulation"};
+    }
+
+    sim.ux = (sim.ux(af::span, af::span, 0).T() -
+              sim.ux(af::span, af::span, 2).T()) *
+             velocity / 255.f;
+    sim.uy = (sim.uy(af::span, af::span, 0).T() -
+              sim.uy(af::span, af::span, 2).T()) *
+             velocity / 255.f;
+    sim.set_boundaries = sim.set_boundaries.T() > 0;
+
+    return sim;
+}
+
+/**
+ * @brief Initializes internal values used for computation
+ *
+ */
+void initialize(Simulation& sim) {
+    auto& ux    = sim.ux;
+    auto& uy    = sim.uy;
+    auto& rho   = sim.rho;
+    auto& sigma = sim.sigma;
+    auto& f     = sim.f;
+    auto& feq   = sim.feq;
+
+    auto& ex   = sim.ex;
+    auto& ey   = sim.ey;
+    auto& wt   = sim.wt;
+    auto& ex_  = sim.ex_;
+    auto& ey_  = sim.ey_;
+    auto& ex_T = sim.ex_T;
+    auto& ey_T = sim.ey_T;
+    auto& wt_T = sim.wt_T;
+
+    auto density  = sim.density;
+    auto velocity = sim.velocity;
+    auto xcount   = sim.grid_width;
+    auto ycount   = sim.grid_height;
+
+    ex = af::array(1, 1, 9, ex_vals);
+    ey = af::array(1, 1, 9, ey_vals);
+    wt = af::array(1, 1, 9, wt_vals);
+
+    ex_T = af::array(1, 9, ex_vals);
+    ey_T = af::array(1, 9, ey_vals);
+    wt_T = af::moddims(wt, af::dim4(1, 9));
+
+    rho   = af::constant(density, xcount, ycount, f32);
+    sigma = af::constant(0, xcount, ycount, f32);
+
+    f = af::constant(0, xcount, ycount, 9, f32);
+
+    ex_ = af::tile(ex, xcount, ycount, 1);
+    ey_ = af::tile(ey, xcount, ycount, 1);
+
+    // Initialization of the distribution function
+    auto edotu = ex_ * ux + ey_ * uy;
+    auto udotu = ux * ux + uy * uy;
+
+    feq = rho * wt *
+          ((edotu * edotu * 4.5f) - (udotu * 1.5f) + (edotu * 3.0f) + 1.0f);
+    f = feq;
+}
+
+/**
+ * @brief Updates the particle distribution functions for the new simulation
+ * frame
+ *
+ */
+void collide_stream(Simulation& sim) {
+    auto& ux             = sim.ux;
+    auto& uy             = sim.uy;
+    auto& rho            = sim.rho;
+    auto& sigma          = sim.sigma;
+    auto& f              = sim.f;
+    auto& feq            = sim.feq;
+    auto& set_boundaries = sim.set_boundaries;
+
+    auto& ex   = sim.ex;
+    auto& ey   = sim.ey;
+    auto& wt   = sim.wt;
+    auto& ex_  = sim.ex_;
+    auto& ey_  = sim.ey_;
+    auto& ex_T = sim.ex_T;
+    auto& ey_T = sim.ey_T;
+    auto& wt_T = sim.wt_T;
+
+    auto density  = sim.density;
+    auto velocity = sim.velocity;
+    auto reynolds = sim.reynolds;
+    auto xcount   = sim.grid_width;
+    auto ycount   = sim.grid_height;
+
+    const float viscosity =
+        velocity * std::sqrt(static_cast<float>(xcount * ycount)) / reynolds;
+    const float tau  = 0.5f + 3.0f * viscosity;
+    const float csky = 0.16f;
+
+    auto edotu = ex_ * ux + ey_ * uy;
+    auto udotu = ux * ux + uy * uy;
+
+    // Compute the new distribution function
+    feq =
+        rho * wt * (edotu * edotu * 4.5f - udotu * 1.5f + edotu * 3.0f + 1.0f);
+
+    auto taut =
+        af::sqrt(sigma * (csky * csky * 18.0f * 0.25f) + (tau * tau * 0.25f)) -
+        (tau * 0.5f);
+
+    // Compute the shifted distribution functions
+    auto fplus = f - (f - feq) / (taut + tau);
+
+    // Compute new particle distribution according to the corresponding D2N9
+    // weights
+    for (int i = 0; i < 9; ++i) {
+        int xshift = static_cast<int>(ex_vals[i]);
+        int yshift = static_cast<int>(ey_vals[i]);
+
+        fplus(af::span, af::span, i) =
+            af::shift(fplus(af::span, af::span, i), xshift, yshift);
+    }
+
+    // Keep the boundary conditions at the borders the same
+    af::replace(fplus, af::tile(!set_boundaries, af::dim4(1, 1, 9)), f);
+
+    // Update the particle distribution
+    f = fplus;
+
+    // Computing u dot e at the each of the boundaries
+    af::array ux_top = ux.rows(0, 2);
+    ux_top =
+        af::moddims(af::tile(ux_top, af::dim4(1, 3)).T(), af::dim4(ycount, 9));
+    af::array ux_bot = ux.rows(xcount - 3, xcount - 1);
+    ux_bot =
+        af::moddims(af::tile(ux_bot, af::dim4(1, 3)).T(), af::dim4(ycount, 9));
+
+    af::array uy_top = uy.rows(0, 2);
+    uy_top =
+        af::moddims(af::tile(uy_top, af::dim4(1, 3)).T(), af::dim4(ycount, 9));
+    af::array uy_bot = uy.rows(xcount - 3, xcount - 1);
+    uy_bot =
+        af::moddims(af::tile(uy_bot, af::dim4(1, 3)).T(), af::dim4(ycount, 9));
+
+    auto ux_lft = af::tile(ux.cols(0, 2), af::dim4(1, 3));
+    auto uy_lft = af::tile(uy.cols(0, 2), af::dim4(1, 3));
+    auto ux_rht = af::tile(ux.cols(ycount - 3, ycount - 1), af::dim4(1, 3));
+    auto uy_rht = af::tile(uy.cols(ycount - 3, ycount - 1), af::dim4(1, 3));
+
+    auto ubdoute_top = ux_top * ex_T + uy_top * ey_T;
+    auto ubdoute_bot = ux_bot * ex_T + uy_bot * ey_T;
+    auto ubdoute_lft = ux_lft * ex_T + uy_lft * ey_T;
+    auto ubdoute_rht = ux_rht * ex_T + uy_rht * ey_T;
+
+    // Computing bounce-back boundary conditions
+    auto fnew_top = af::moddims(fplus.row(1), af::dim4(ycount, 9)) -
+                    6.0 * density * wt_T * ubdoute_top;
+    auto fnew_bot = af::moddims(fplus.row(xcount - 2), af::dim4(ycount, 9)) -
+                    6.0 * density * wt_T * ubdoute_bot;
+    auto fnew_lft = af::moddims(fplus.col(1), af::dim4(xcount, 9)) -
+                    6.0 * density * wt_T * ubdoute_lft;
+    auto fnew_rht = af::moddims(fplus.col(ycount - 2), af::dim4(xcount, 9)) -
+                    6.0 * density * wt_T * ubdoute_rht;
+
+    // Update the values near the boundaries with the correct bounce-back
+    // boundary
+    for (int i = 0; i < 9; ++i) {
+        int xshift = static_cast<int>(ex_vals[i]);
+        int yshift = static_cast<int>(ey_vals[i]);
+        if (xshift == 1)
+            f(1, af::span, opposite_indices[i]) = fnew_top(af::span, i);
+        if (xshift == -1)
+            f(xcount - 2, af::span, opposite_indices[i]) =
+                fnew_bot(af::span, i);
+        if (yshift == 1)
+            f(af::span, 1, opposite_indices[i]) = fnew_lft(af::span, i);
+        if (yshift == -1)
+            f(af::span, ycount - 2, opposite_indices[i]) =
+                fnew_rht(af::span, i);
+    }
+}
+
+/**
+ * @brief Updates the velocity field, density and strain at each point in the
+ * grid
+ *
+ */
+void update(Simulation& sim) {
+    auto& ux    = sim.ux;
+    auto& uy    = sim.uy;
+    auto& rho   = sim.rho;
+    auto& sigma = sim.sigma;
+    auto& f     = sim.f;
+    auto& feq   = sim.feq;
+    auto& ex    = sim.ex;
+    auto& ey    = sim.ey;
+
+    auto e_tile = af::join(3, af::constant(1, 1, 1, 9), ex, ey);
+    auto result = af::sum(f * e_tile, 2);
+
+    rho = result(af::span, af::span, af::span, 0);
+    result /= rho;
+    ux = result(af::span, af::span, af::span, 1);
+    uy = result(af::span, af::span, af::span, 2);
+
+    // Above code equivalent to
+    // rho = af::sum(f, 2);
+    // ux = af::sum(f * ex, 2) / rho;
+    // uy = af::sum(f * ey, 2) / rho;
+
+    auto product   = f - feq;
+    auto e_product = af::join(3, ex * ex, ex * ey * std::sqrt(2), ey * ey);
+
+    sigma = af::sqrt(af::sum(af::pow(af::sum(product * e_product, 2), 2), 3));
+
+    // Above code equivalent to
+
+    // auto xx = af::sum(product * ex * ex, 2);
+    // auto xy = af::sum(product * ex * ey, 2);
+    // auto yy = af::sum(product * ey * ey, 2);
+
+    // sigma = af::sqrt(xx * xx + xy * xy * 2 + yy * yy);
+}
+
+af::array generate_image(size_t width, size_t height, const Simulation& sim) {
+    const auto& ux         = sim.ux;
+    const auto& uy         = sim.uy;
+    const auto& boundaries = sim.set_boundaries;
+    auto velocity          = sim.velocity;
+
+    float image_scale =
+        static_cast<float>(width) / static_cast<float>(sim.grid_width - 1);
+
+    // Relative Flow speed at each cell
+    auto val = af::sqrt(ux * ux + uy * uy) / velocity;
+
+    af::replace(val, val != 0 || !boundaries, -1.0);
+
+    // Scaling and interpolating flow speed to the window size
+    if (width != sim.grid_width || height != sim.grid_height)
+        val =
+            af::approx2(val, af::iota(width, af::dim4(1, height)) / image_scale,
+                        af::iota(height, af::dim4(1, width)).T() / image_scale);
+
+    // Flip image
+    val = val.T();
+
+    auto image  = af::constant(0, height, width, 3);
+    auto image2 = image;
+
+    // Add custom coloring
+    image(af::span, af::span, 0) = val * 2;
+    image(af::span, af::span, 1) = val * 2;
+    image(af::span, af::span, 2) = 1.0 - val * 2;
+
+    image2(af::span, af::span, 0) = 1;
+    image2(af::span, af::span, 1) = -2 * val + 2;
+    image2(af::span, af::span, 2) = 0;
+
+    auto tile_val = af::tile(val, 1, 1, 3);
+    af::replace(image, tile_val < 0.5, image2);
+    af::replace(image, tile_val >= 0, 0.0);
+
+    return image;
+}
+
+void lattice_boltzmann_cfd_demo() {
+    // Define the lattice for the simulation
+    const size_t len         = 128;
+    const size_t grid_width  = len;
+    const size_t grid_height = len;
+
+    // Specify the image scaling displayed
+    float scale = 4.0f;
+
+    // Forge window initialization
+    int height = static_cast<int>(grid_width * scale);
+    int width  = static_cast<int>(grid_height * scale);
+    af::Window window(height, width, "Driven Cavity Flow");
+
+    int frame_count       = 0;
+    int max_frames        = 20000;
+    int simulation_frames = 100;
+    float total_time      = 0;
+    float total_time2     = 0;
+
+    // CFD fluid parameters
+    const float density  = 2.7f;
+    const float velocity = 0.35f;
+    const float reynolds = 1e5f;
+
+    const char* ux_image = ASSETS_DIR "/examples/images/default_ux.bmp";
+    const char* uy_image = ASSETS_DIR "/examples/images/default_uy.bmp";
+    const char* set_boundary_image =
+        ASSETS_DIR "/examples/images/default_boundary.bmp";
+
+    // Tesla Valve Fluid Simulation - entering from constricted side
+    {
+        //           ux_image = ASSETS_DIR "/examples/images/left_tesla_ux.bmp";
+        //           uy_image = ASSETS_DIR "/examples/images/left_tesla_uy.bmp";
+        // set_boundary_image = ASSETS_DIR
+        // "/examples/images/left_tesla_boundary.bmp";
+    }
+
+    // Tesla Valve Fluid Simulation - entering from transfer side
+    {
+        //           ux_image = ASSETS_DIR
+        //           "/examples/images/right_tesla_ux.bmp"; uy_image =
+        //           ASSETS_DIR "/examples/images/right_tesla_uy.bmp";
+        // set_boundary_image = ASSETS_DIR
+        // "/examples/images/right_tesla_boundary.bmp";
+    }
+
+    // Reads the initial values of fluid quantites and simulation parameters
+    Simulation sim =
+        create_simulation(grid_width, grid_height, density, velocity, reynolds,
+                          ux_image, uy_image, set_boundary_image);
+
+    // Initializes the simulation quantites
+    initialize(sim);
+
+    while (!window.close() && frame_count != max_frames) {
+        af::sync();
+        auto begin = std::chrono::high_resolution_clock::now();
+
+        // Computes the new particle distribution functions for the new
+        // simulation frame
+        collide_stream(sim);
+
+        // Updates the velocity, density, and stress fields
+        update(sim);
+
+        af::sync();
+        auto end = std::chrono::high_resolution_clock::now();
+
+        // Calculate computation time of 1 simulation frame
+        auto duration =
+            std::chrono::duration_cast<std::chrono::microseconds>(end - begin)
+                .count();
+
+        // Used for computing the distribution of frame computation time
+        total_time += duration;
+        total_time2 += duration * duration;
+
+        // Every number of `simulation_frames` display the last computed frame
+        // to the screen
+        if (frame_count % simulation_frames == 0) {
+            auto image = generate_image(width, height, sim);
+
+            // Display colored image
+            window.image(image);
+
+            float avg_time  = total_time / (float)simulation_frames;
+            float stdv_time = std::sqrt(total_time2 * simulation_frames -
+                                        total_time * total_time) /
+                              (float)simulation_frames;
+
+            std::cout << "Average Simulation Step Time: (" << avg_time
+                      << " +/- " << stdv_time
+                      << ") us; Total simulation time: " << total_time
+                      << " us; Simulation Frames: " << simulation_frames
+                      << std::endl;
+
+            total_time  = 0;
+            total_time2 = 0;
+        }
+
+        frame_count++;
+    }
+}
+
+int main(int argc, char** argv) {
+    int device = argc > 1 ? std::atoi(argv[1]) : 0;
+
+    try {
+        af::setDevice(device);
+        af::info();
+
+        std::cout << "** ArrayFire CFD Simulation Demo\n\n";
+
+        lattice_boltzmann_cfd_demo();
+    } catch (const af::exception& e) {
+        std::cerr << e.what() << std::endl;
+        return -1;
+    }
+
+    return 0;
+}
\ No newline at end of file
diff --git a/examples/pde/swe.cpp b/examples/pde/swe.cpp
new file mode 100644
index 0000000000..7e5a9af017
--- /dev/null
+++ b/examples/pde/swe.cpp
@@ -0,0 +1,118 @@
+#include <arrayfire.h>
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+using namespace af;
+
+Window* win;
+
+array normalize(array a, float max) {
+    float mx = max * 0.5;
+    float mn = -max * 0.5;
+    return (a - mn) / (mx - mn);
+}
+
+static void swe(bool console) {
+    // Grid length, number and spacing
+    const unsigned Lx = 1600, nx = Lx + 1;
+    const unsigned Ly = 1600, ny = Ly + 1;
+    const float dx = Lx / (nx - 1);
+    const float dy = Ly / (ny - 1);
+
+    array ZERO = constant(0, nx, ny);
+    array um = ZERO, vm = ZERO;
+    unsigned io = (unsigned)floor(Lx / 6.0f), jo = (unsigned)floor(Ly / 6.0f),
+             k = 15;
+    array x    = tile(range(nx), 1, ny);
+    array y    = tile(range(dim4(1, ny), 1), nx, 1);
+
+    // initial condition
+    array etam =
+        0.01f * exp((-((x - io) * (x - io) + (y - jo) * (y - jo))) / (k * k));
+    float m_eta = max<float>(etam);
+    array eta   = etam;
+    float dt    = 0.5;
+
+    // conv kernels
+    float h_diff_kernel[] = {9.81f * (dt / dx), 0, -9.81f * (dt / dx)};
+    float h_lap_kernel[]  = {0, 1, 0, 1, -4, 1, 0, 1, 0};
+
+    array h_diff_kernel_arr(3, h_diff_kernel);
+    array h_lap_kernel_arr(3, 3, h_lap_kernel);
+
+    if (!console) {
+        win = new Window(1536, 768, "Shallow Water Equations");
+        win->grid(2, 2);
+    }
+
+    unsigned iter            = 0;
+    unsigned random_interval = 30;
+
+    while (!win->close()) {
+        if (iter > 2000) {
+            // Initial condition
+            etam  = 0.01f * exp((-((x - io) * (x - io) + (y - jo) * (y - jo))) /
+                                (k * k));
+            m_eta = max<float>(etam);
+            eta   = etam;
+            iter  = 0;
+        }
+
+        // raindrops
+        if (iter % 100 == 0 || iter % 130 == 0 || iter % random_interval == 0) {
+            unsigned io     = (unsigned)floor(rand() % Lx),
+                     jo     = (unsigned)floor(rand() % Ly);
+            random_interval = rand() % 200 + 1;
+            eta += 0.01f * exp((-((x - io) * (x - io) + (y - jo) * (y - jo))) /
+                               (k * k));
+        }
+
+        // compute
+        array up   = um + convolve(eta, h_diff_kernel_arr);
+        array vp   = um + convolve(eta, h_diff_kernel_arr.T());
+        array e    = convolve(eta, h_lap_kernel_arr);
+        array etap = 2 * eta - etam + (2 * dt * dt) / (dx * dy) * e;
+
+        etam = eta;
+        eta  = etap;
+
+        m_eta = max<float>(etam);
+        if (!console) {
+            (*win)(0, 0).setColorMap(AF_COLORMAP_BLUE);
+            array hist_out = histogram(normalize(eta, m_eta), 15);
+
+            (*win)(0, 1).setAxesLimits(0, hist_out.elements(), 0,
+                                       max<float>(hist_out));
+
+            (*win)(0, 0).image(normalize(eta, m_eta));
+            (*win)(0, 1).hist(hist_out, 0, 1,
+                              "Normalized Pressure Distribution");
+            (*win)(1, 0).plot(range(up.dims(1)), vp.col(0),
+                              "Pressure at left boundary");
+            (*win)(1, 1).plot(
+                flat(eta.col(0)), flat(up.col(0)), flat(vp.col(0)),
+                "Gradients versus Magnitude at left boundary");  // viz
+            win->show();
+        } else
+            eval(eta, up, vp);
+        iter++;
+    }
+}
+
+int main(int argc, char* argv[]) {
+    int device   = argc > 1 ? atoi(argv[1]) : 0;
+    bool console = argc > 2 ? argv[2][0] == '-' : false;
+    try {
+        af::setDevice(device);
+        af::info();
+        printf("Simulation of shallow water equations\n");
+        swe(console);
+    } catch (af::exception& e) {
+        fprintf(stderr, "%s\n", e.what());
+        throw;
+    }
+
+    return 0;
+}
diff --git a/examples/unified/CMakeLists.txt b/examples/unified/CMakeLists.txt
new file mode 100644
index 0000000000..a399f58c00
--- /dev/null
+++ b/examples/unified/CMakeLists.txt
@@ -0,0 +1,19 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.5)
+project(ArrayFire-Example-Unified
+  VERSION 3.5.0
+  LANGUAGES CXX)
+
+find_package(ArrayFire REQUIRED)
+
+if(ArrayFire_Unified_FOUND)
+  # Simple unified backend example
+  add_executable(basic_unified basic.cpp)
+  target_link_libraries(basic_unified ArrayFire::af)
+endif()
diff --git a/examples/unified/basic.cpp b/examples/unified/basic.cpp
new file mode 100644
index 0000000000..89d4eed1a0
--- /dev/null
+++ b/examples/unified/basic.cpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <algorithm>
+#include <cstdio>
+#include <vector>
+
+using namespace af;
+
+std::vector<float> input(100);
+
+// Generate a random number between 0 and 1
+// return a uniform number in [0,1].
+double unifRand() { return rand() / double(RAND_MAX); }
+
+void testBackend() {
+    af::info();
+
+    af::dim4 dims(10, 10, 1, 1);
+
+    af::array A(dims, &input.front());
+    af_print(A);
+
+    af::array B = af::constant(0.5, dims, f32);
+    af_print(B);
+}
+
+int main(int, char**) {
+    std::generate(input.begin(), input.end(), unifRand);
+
+    try {
+        printf("Trying CPU Backend\n");
+        af::setBackend(AF_BACKEND_CPU);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying CPU backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    try {
+        printf("Trying CUDA Backend\n");
+        af::setBackend(AF_BACKEND_CUDA);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying CUDA backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    try {
+        printf("Trying OpenCL Backend\n");
+        af::setBackend(AF_BACKEND_OPENCL);
+        testBackend();
+    } catch (af::exception& e) {
+        printf("Caught exception when trying OpenCL backend\n");
+        fprintf(stderr, "%s\n", e.what());
+    }
+
+    return 0;
+}
diff --git a/extern/half/include/half.hpp b/extern/half/include/half.hpp
new file mode 100644
index 0000000000..e8dfc1995a
--- /dev/null
+++ b/extern/half/include/half.hpp
@@ -0,0 +1,3081 @@
+// half - IEEE 754-based half-precision floating point library.
+//
+// Copyright (c) 2012-2017 Christian Rau <rauy@users.sourceforge.net>
+//
+// Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation
+// files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy,
+// modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the
+// Software is furnished to do so, subject to the following conditions:
+//
+// The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
+//
+// THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE
+// WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
+// COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
+// ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
+
+// Version 1.12.0
+
+/// \file
+/// Main header file for half precision functionality.
+
+#ifndef HALF_HALF_HPP
+#define HALF_HALF_HPP
+
+/// Combined gcc version number.
+#define HALF_GNUC_VERSION (__GNUC__*100+__GNUC_MINOR__)
+
+//check C++11 language features
+#if defined(__clang__)										//clang
+	#if __has_feature(cxx_static_assert) && !defined(HALF_ENABLE_CPP11_STATIC_ASSERT)
+		#define HALF_ENABLE_CPP11_STATIC_ASSERT 1
+	#endif
+	#if __has_feature(cxx_constexpr) && !defined(HALF_ENABLE_CPP11_CONSTEXPR)
+		#define HALF_ENABLE_CPP11_CONSTEXPR 1
+	#endif
+	#if __has_feature(cxx_noexcept) && !defined(HALF_ENABLE_CPP11_NOEXCEPT)
+		#define HALF_ENABLE_CPP11_NOEXCEPT 1
+	#endif
+	#if __has_feature(cxx_user_literals) && !defined(HALF_ENABLE_CPP11_USER_LITERALS)
+		#define HALF_ENABLE_CPP11_USER_LITERALS 1
+	#endif
+	#if (defined(__GXX_EXPERIMENTAL_CXX0X__) || __cplusplus >= 201103L) && !defined(HALF_ENABLE_CPP11_LONG_LONG)
+		#define HALF_ENABLE_CPP11_LONG_LONG 1
+	#endif
+/*#elif defined(__INTEL_COMPILER)								//Intel C++
+	#if __INTEL_COMPILER >= 1100 && !defined(HALF_ENABLE_CPP11_STATIC_ASSERT)		????????
+		#define HALF_ENABLE_CPP11_STATIC_ASSERT 1
+	#endif
+	#if __INTEL_COMPILER >= 1300 && !defined(HALF_ENABLE_CPP11_CONSTEXPR)			????????
+		#define HALF_ENABLE_CPP11_CONSTEXPR 1
+	#endif
+	#if __INTEL_COMPILER >= 1300 && !defined(HALF_ENABLE_CPP11_NOEXCEPT)			????????
+		#define HALF_ENABLE_CPP11_NOEXCEPT 1
+	#endif
+	#if __INTEL_COMPILER >= 1100 && !defined(HALF_ENABLE_CPP11_LONG_LONG)			????????
+		#define HALF_ENABLE_CPP11_LONG_LONG 1
+	#endif*/
+#elif defined(__GNUC__)										//gcc
+	#if defined(__GXX_EXPERIMENTAL_CXX0X__) || __cplusplus >= 201103L
+		#if HALF_GNUC_VERSION >= 403 && !defined(HALF_ENABLE_CPP11_STATIC_ASSERT)
+			#define HALF_ENABLE_CPP11_STATIC_ASSERT 1
+		#endif
+		#if HALF_GNUC_VERSION >= 406 && !defined(HALF_ENABLE_CPP11_CONSTEXPR)
+			#define HALF_ENABLE_CPP11_CONSTEXPR 1
+		#endif
+		#if HALF_GNUC_VERSION >= 406 && !defined(HALF_ENABLE_CPP11_NOEXCEPT)
+			#define HALF_ENABLE_CPP11_NOEXCEPT 1
+		#endif
+		#if HALF_GNUC_VERSION >= 407 && !defined(HALF_ENABLE_CPP11_USER_LITERALS)
+			#define HALF_ENABLE_CPP11_USER_LITERALS 1
+		#endif
+		#if !defined(HALF_ENABLE_CPP11_LONG_LONG)
+			#define HALF_ENABLE_CPP11_LONG_LONG 1
+		#endif
+	#endif
+#elif defined(_MSC_VER)										//Visual C++
+	#if _MSC_VER >= 1900 && !defined(HALF_ENABLE_CPP11_CONSTEXPR)
+		#define HALF_ENABLE_CPP11_CONSTEXPR 1
+	#endif
+	#if _MSC_VER >= 1900 && !defined(HALF_ENABLE_CPP11_NOEXCEPT)
+		#define HALF_ENABLE_CPP11_NOEXCEPT 1
+	#endif
+	#if _MSC_VER >= 1900 && !defined(HALF_ENABLE_CPP11_USER_LITERALS)
+		#define HALF_ENABLE_CPP11_USER_LITERALS 1
+	#endif
+	#if _MSC_VER >= 1600 && !defined(HALF_ENABLE_CPP11_STATIC_ASSERT)
+		#define HALF_ENABLE_CPP11_STATIC_ASSERT 1
+	#endif
+	#if _MSC_VER >= 1310 && !defined(HALF_ENABLE_CPP11_LONG_LONG)
+		#define HALF_ENABLE_CPP11_LONG_LONG 1
+	#endif
+	#define HALF_POP_WARNINGS 1
+	#pragma warning(push)
+	#pragma warning(disable : 4099 4127 4146)	//struct vs class, constant in if, negative unsigned
+#endif
+
+//check C++11 library features
+#include <utility>
+#if defined(_LIBCPP_VERSION)								//libc++
+	#if defined(__GXX_EXPERIMENTAL_CXX0X__) || __cplusplus >= 201103
+		#ifndef HALF_ENABLE_CPP11_TYPE_TRAITS
+			#define HALF_ENABLE_CPP11_TYPE_TRAITS 1
+		#endif
+		#ifndef HALF_ENABLE_CPP11_CSTDINT
+			#define HALF_ENABLE_CPP11_CSTDINT 1
+		#endif
+		#ifndef HALF_ENABLE_CPP11_CMATH
+			#define HALF_ENABLE_CPP11_CMATH 1
+		#endif
+		#ifndef HALF_ENABLE_CPP11_HASH
+			#define HALF_ENABLE_CPP11_HASH 1
+		#endif
+	#endif
+#elif defined(__GLIBCXX__)									//libstdc++
+	#if defined(__GXX_EXPERIMENTAL_CXX0X__) || __cplusplus >= 201103
+		#ifdef __clang__
+			#if __GLIBCXX__ >= 20080606 && !defined(HALF_ENABLE_CPP11_TYPE_TRAITS)
+				#define HALF_ENABLE_CPP11_TYPE_TRAITS 1
+			#endif
+			#if __GLIBCXX__ >= 20080606 && !defined(HALF_ENABLE_CPP11_CSTDINT)
+				#define HALF_ENABLE_CPP11_CSTDINT 1
+			#endif
+			#if __GLIBCXX__ >= 20080606 && !defined(HALF_ENABLE_CPP11_CMATH)
+				#define HALF_ENABLE_CPP11_CMATH 1
+			#endif
+			#if __GLIBCXX__ >= 20080606 && !defined(HALF_ENABLE_CPP11_HASH)
+				#define HALF_ENABLE_CPP11_HASH 1
+			#endif
+		#else
+			#if HALF_GNUC_VERSION >= 403 && !defined(HALF_ENABLE_CPP11_CSTDINT)
+				#define HALF_ENABLE_CPP11_CSTDINT 1
+			#endif
+			#if HALF_GNUC_VERSION >= 403 && !defined(HALF_ENABLE_CPP11_CMATH)
+				#define HALF_ENABLE_CPP11_CMATH 1
+			#endif
+			#if HALF_GNUC_VERSION >= 403 && !defined(HALF_ENABLE_CPP11_HASH)
+				#define HALF_ENABLE_CPP11_HASH 1
+			#endif
+		#endif
+	#endif
+#elif defined(_CPPLIB_VER)									//Dinkumware/Visual C++
+	#if _CPPLIB_VER >= 520
+		#ifndef HALF_ENABLE_CPP11_TYPE_TRAITS
+			#define HALF_ENABLE_CPP11_TYPE_TRAITS 1
+		#endif
+		#ifndef HALF_ENABLE_CPP11_CSTDINT
+			#define HALF_ENABLE_CPP11_CSTDINT 1
+		#endif
+		#ifndef HALF_ENABLE_CPP11_HASH
+			#define HALF_ENABLE_CPP11_HASH 1
+		#endif
+	#endif
+	#if _CPPLIB_VER >= 610
+		#ifndef HALF_ENABLE_CPP11_CMATH
+			#define HALF_ENABLE_CPP11_CMATH 1
+		#endif
+	#endif
+#endif
+#undef HALF_GNUC_VERSION
+
+//support constexpr
+#if HALF_ENABLE_CPP11_CONSTEXPR
+	#define HALF_CONSTEXPR			constexpr
+	#define HALF_CONSTEXPR_CONST	constexpr
+#else
+	#define HALF_CONSTEXPR
+	#define HALF_CONSTEXPR_CONST	const
+#endif
+
+//support noexcept
+#if HALF_ENABLE_CPP11_NOEXCEPT
+	#define HALF_NOEXCEPT	noexcept
+	#define HALF_NOTHROW	noexcept
+#else
+	#define HALF_NOEXCEPT
+	#define HALF_NOTHROW	throw()
+#endif
+
+#include <algorithm>
+#include <iostream>
+#include <limits>
+#include <climits>
+#include <cmath>
+#include <cstring>
+#if HALF_ENABLE_CPP11_TYPE_TRAITS
+	#include <type_traits>
+#endif
+#if HALF_ENABLE_CPP11_CSTDINT
+	#include <cstdint>
+#endif
+#if HALF_ENABLE_CPP11_HASH
+	#include <functional>
+#endif
+
+
+/// Default rounding mode.
+/// This specifies the rounding mode used for all conversions between [half](\ref half_float::half)s and `float`s as well as
+/// for the half_cast() if not specifying a rounding mode explicitly. It can be redefined (before including half.hpp) to one
+/// of the standard rounding modes using their respective constants or the equivalent values of `std::float_round_style`:
+///
+/// `std::float_round_style`         | value | rounding
+/// ---------------------------------|-------|-------------------------
+/// `std::round_indeterminate`       | -1    | fastest (default)
+/// `std::round_toward_zero`         | 0     | toward zero
+/// `std::round_to_nearest`          | 1     | to nearest
+/// `std::round_toward_infinity`     | 2     | toward positive infinity
+/// `std::round_toward_neg_infinity` | 3     | toward negative infinity
+///
+/// By default this is set to `-1` (`std::round_indeterminate`), which uses truncation (round toward zero, but with overflows
+/// set to infinity) and is the fastest rounding mode possible. It can even be set to `std::numeric_limits<float>::round_style`
+/// to synchronize the rounding mode with that of the underlying single-precision implementation.
+#ifndef HALF_ROUND_STYLE
+	#define HALF_ROUND_STYLE	-1			// = std::round_indeterminate
+#endif
+
+/// Tie-breaking behaviour for round to nearest.
+/// This specifies if ties in round to nearest should be resolved by rounding to the nearest even value. By default this is
+/// defined to `0` resulting in the faster but slightly more biased behaviour of rounding away from zero in half-way cases (and
+/// thus equal to the round() function), but can be redefined to `1` (before including half.hpp) if more IEEE-conformant
+/// behaviour is needed.
+#ifndef HALF_ROUND_TIES_TO_EVEN
+	#define HALF_ROUND_TIES_TO_EVEN	0		// ties away from zero
+#endif
+
+/// Value signaling overflow.
+/// In correspondence with `HUGE_VAL[F|L]` from `<cmath>` this symbol expands to a positive value signaling the overflow of an
+/// operation, in particular it just evaluates to positive infinity.
+#define HUGE_VALH	std::numeric_limits<half_float::half>::infinity()
+
+/// Fast half-precision fma function.
+/// This symbol is only defined if the fma() function generally executes as fast as, or faster than, a separate
+/// half-precision multiplication followed by an addition. Due to the internal single-precision implementation of all
+/// arithmetic operations, this is in fact always the case.
+#define FP_FAST_FMAH	1
+
+#ifndef FP_ILOGB0
+	#define FP_ILOGB0		INT_MIN
+#endif
+#ifndef FP_ILOGBNAN
+	#define FP_ILOGBNAN		INT_MAX
+#endif
+#ifndef FP_SUBNORMAL
+	#define FP_SUBNORMAL	0
+#endif
+#ifndef FP_ZERO
+	#define FP_ZERO			1
+#endif
+#ifndef FP_NAN
+	#define FP_NAN			2
+#endif
+#ifndef FP_INFINITE
+	#define FP_INFINITE		3
+#endif
+#ifndef FP_NORMAL
+	#define FP_NORMAL		4
+#endif
+
+
+/// Main namespace for half precision functionality.
+/// This namespace contains all the functionality provided by the library.
+namespace half_float
+{
+	class half;
+
+#if HALF_ENABLE_CPP11_USER_LITERALS
+	/// Library-defined half-precision literals.
+	/// Import this namespace to enable half-precision floating point literals:
+	/// ~~~~{.cpp}
+	/// using namespace half_float::literal;
+	/// half_float::half = 4.2_h;
+	/// ~~~~
+	namespace literal
+	{
+		half operator""_h(long double);
+	}
+#endif
+
+	/// \internal
+	/// \brief Implementation details.
+	namespace detail
+	{
+	#if HALF_ENABLE_CPP11_TYPE_TRAITS
+		/// Conditional type.
+		template<bool B,typename T,typename F> struct conditional : std::conditional<B,T,F> {};
+
+		/// Helper for tag dispatching.
+		template<bool B> struct bool_type : std::integral_constant<bool,B> {};
+		using std::true_type;
+		using std::false_type;
+
+		/// Type traits for floating point types.
+		template<typename T> struct is_float : std::is_floating_point<T> {};
+	#else
+		/// Conditional type.
+		template<bool,typename T,typename> struct conditional { typedef T type; };
+		template<typename T,typename F> struct conditional<false,T,F> { typedef F type; };
+
+		/// Helper for tag dispatching.
+		template<bool> struct bool_type {};
+		typedef bool_type<true> true_type;
+		typedef bool_type<false> false_type;
+
+		/// Type traits for floating point types.
+		template<typename> struct is_float : false_type {};
+		template<typename T> struct is_float<const T> : is_float<T> {};
+		template<typename T> struct is_float<volatile T> : is_float<T> {};
+		template<typename T> struct is_float<const volatile T> : is_float<T> {};
+		template<> struct is_float<float> : true_type {};
+		template<> struct is_float<double> : true_type {};
+		template<> struct is_float<long double> : true_type {};
+	#endif
+
+		/// Type traits for floating point bits.
+		template<typename T> struct bits { typedef unsigned char type; };
+		template<typename T> struct bits<const T> : bits<T> {};
+		template<typename T> struct bits<volatile T> : bits<T> {};
+		template<typename T> struct bits<const volatile T> : bits<T> {};
+
+	#if HALF_ENABLE_CPP11_CSTDINT
+		/// Unsigned integer of (at least) 16 bits width.
+		typedef std::uint_least16_t uint16;
+
+		/// Unsigned integer of (at least) 32 bits width.
+		template<> struct bits<float> { typedef std::uint_least32_t type; };
+
+		/// Unsigned integer of (at least) 64 bits width.
+		template<> struct bits<double> { typedef std::uint_least64_t type; };
+	#else
+		/// Unsigned integer of (at least) 16 bits width.
+		typedef unsigned short uint16;
+
+		/// Unsigned integer of (at least) 32 bits width.
+		template<> struct bits<float> : conditional<std::numeric_limits<unsigned int>::digits>=32,unsigned int,unsigned long> {};
+
+		#if HALF_ENABLE_CPP11_LONG_LONG
+			/// Unsigned integer of (at least) 64 bits width.
+			template<> struct bits<double> : conditional<std::numeric_limits<unsigned long>::digits>=64,unsigned long,unsigned long long> {};
+		#else
+			/// Unsigned integer of (at least) 64 bits width.
+			template<> struct bits<double> { typedef unsigned long type; };
+		#endif
+	#endif
+
+		/// Tag type for binary construction.
+		struct binary_t {};
+
+		/// Tag for binary construction.
+		HALF_CONSTEXPR_CONST binary_t binary = binary_t();
+
+		/// Temporary half-precision expression.
+		/// This class represents a half-precision expression which just stores a single-precision value internally.
+		struct expr
+		{
+			/// Conversion constructor.
+			/// \param f single-precision value to convert
+			explicit HALF_CONSTEXPR expr(float f) HALF_NOEXCEPT : value_(f) {}
+
+			/// Conversion to single-precision.
+			/// \return single precision value representing expression value
+			HALF_CONSTEXPR operator float() const HALF_NOEXCEPT { return value_; }
+
+		private:
+			/// Internal expression value stored in single-precision.
+			float value_;
+		};
+
+		/// SFINAE helper for generic half-precision functions.
+		/// This class template has to be specialized for each valid combination of argument types to provide a corresponding
+		/// `type` member equivalent to \a T.
+		/// \tparam T type to return
+		template<typename T,typename,typename=void,typename=void> struct enable {};
+		template<typename T> struct enable<T,half,void,void> { typedef T type; };
+		template<typename T> struct enable<T,expr,void,void> { typedef T type; };
+		template<typename T> struct enable<T,half,half,void> { typedef T type; };
+		template<typename T> struct enable<T,half,expr,void> { typedef T type; };
+		template<typename T> struct enable<T,expr,half,void> { typedef T type; };
+		template<typename T> struct enable<T,expr,expr,void> { typedef T type; };
+		template<typename T> struct enable<T,half,half,half> { typedef T type; };
+		template<typename T> struct enable<T,half,half,expr> { typedef T type; };
+		template<typename T> struct enable<T,half,expr,half> { typedef T type; };
+		template<typename T> struct enable<T,half,expr,expr> { typedef T type; };
+		template<typename T> struct enable<T,expr,half,half> { typedef T type; };
+		template<typename T> struct enable<T,expr,half,expr> { typedef T type; };
+		template<typename T> struct enable<T,expr,expr,half> { typedef T type; };
+		template<typename T> struct enable<T,expr,expr,expr> { typedef T type; };
+
+		/// Return type for specialized generic 2-argument half-precision functions.
+		/// This class template has to be specialized for each valid combination of argument types to provide a corresponding
+		/// `type` member denoting the appropriate return type.
+		/// \tparam T first argument type
+		/// \tparam U first argument type
+		template<typename T,typename U> struct result : enable<expr,T,U> {};
+		template<> struct result<half,half> { typedef half type; };
+
+		/// \name Classification helpers
+		/// \{
+
+		/// Check for infinity.
+		/// \tparam T argument type (builtin floating point type)
+		/// \param arg value to query
+		/// \retval true if infinity
+		/// \retval false else
+		template<typename T> bool builtin_isinf(T arg)
+		{
+		#if HALF_ENABLE_CPP11_CMATH
+#ifdef __clang__
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wtautological-constant-compare"
+#endif
+			return std::isinf(arg);
+#ifdef __clang__
+#pragma GCC diagnostic pop
+#endif
+		#elif defined(_MSC_VER)
+			return !::_finite(static_cast<double>(arg)) && !::_isnan(static_cast<double>(arg));
+		#else
+			return arg == std::numeric_limits<T>::infinity() || arg == -std::numeric_limits<T>::infinity();
+		#endif
+		}
+
+		/// Check for NaN.
+		/// \tparam T argument type (builtin floating point type)
+		/// \param arg value to query
+		/// \retval true if not a number
+		/// \retval false else
+		template<typename T> bool builtin_isnan(T arg)
+		{
+		#if HALF_ENABLE_CPP11_CMATH
+#ifdef __clang__
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wtautological-constant-compare"
+#endif
+			return std::isnan(arg);
+#ifdef __clang__
+#pragma GCC diagnostic pop
+#endif
+		#elif defined(_MSC_VER)
+			return ::_isnan(static_cast<double>(arg)) != 0;
+		#else
+			return arg != arg;
+		#endif
+		}
+
+		/// Check sign.
+		/// \tparam T argument type (builtin floating point type)
+		/// \param arg value to query
+		/// \retval true if signbit set
+		/// \retval false else
+		template<typename T> bool builtin_signbit(T arg)
+		{
+		#if HALF_ENABLE_CPP11_CMATH
+			return std::signbit(arg);
+		#else
+			return arg < T() || (arg == T() && T(1)/arg < T());
+		#endif
+		}
+
+		/// \}
+		/// \name Conversion
+		/// \{
+
+		/// Convert IEEE single-precision to half-precision.
+		/// Credit for this goes to [Jeroen van der Zijp](ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf).
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \param value single-precision value
+		/// \return binary representation of half-precision value
+		template<std::float_round_style R> uint16 float2half_impl(float value, true_type)
+		{
+			typedef bits<float>::type uint32;
+			uint32 bits;// = *reinterpret_cast<uint32*>(&value);		//violating strict aliasing!
+			std::memcpy(&bits, &value, sizeof(float));
+/*			uint16 hbits = (bits>>16) & 0x8000;
+			bits &= 0x7FFFFFFF;
+			int exp = bits >> 23;
+			if(exp == 255)
+				return hbits | 0x7C00 | (0x3FF&-static_cast<unsigned>((bits&0x7FFFFF)!=0));
+			if(exp > 142)
+			{
+				if(R == std::round_toward_infinity)
+					return hbits | 0x7C00 - (hbits>>15);
+				if(R == std::round_toward_neg_infinity)
+					return hbits | 0x7BFF + (hbits>>15);
+				return hbits | 0x7BFF + (R!=std::round_toward_zero);
+			}
+			int g, s;
+			if(exp > 112)
+			{
+				g = (bits>>12) & 1;
+				s = (bits&0xFFF) != 0;
+				hbits |= ((exp-112)<<10) | ((bits>>13)&0x3FF);
+			}
+			else if(exp > 101)
+			{
+				int i = 125 - exp;
+				bits = (bits&0x7FFFFF) | 0x800000;
+				g = (bits>>i) & 1;
+				s = (bits&((1L<<i)-1)) != 0;
+				hbits |= bits >> (i+1);
+			}
+			else
+			{
+				g = 0;
+				s = bits != 0;
+			}
+			if(R == std::round_to_nearest)
+				#if HALF_ROUND_TIES_TO_EVEN
+					hbits += g & (s|hbits);
+				#else
+					hbits += g;
+				#endif
+			else if(R == std::round_toward_infinity)
+				hbits += ~(hbits>>15) & (s|g);
+			else if(R == std::round_toward_neg_infinity)
+				hbits += (hbits>>15) & (g|s);
+*/			static const uint16 base_table[512] = {
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+				0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0001, 0x0002, 0x0004, 0x0008, 0x0010, 0x0020, 0x0040, 0x0080, 0x0100,
+				0x0200, 0x0400, 0x0800, 0x0C00, 0x1000, 0x1400, 0x1800, 0x1C00, 0x2000, 0x2400, 0x2800, 0x2C00, 0x3000, 0x3400, 0x3800, 0x3C00,
+				0x4000, 0x4400, 0x4800, 0x4C00, 0x5000, 0x5400, 0x5800, 0x5C00, 0x6000, 0x6400, 0x6800, 0x6C00, 0x7000, 0x7400, 0x7800, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+				0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8001, 0x8002, 0x8004, 0x8008, 0x8010, 0x8020, 0x8040, 0x8080, 0x8100,
+				0x8200, 0x8400, 0x8800, 0x8C00, 0x9000, 0x9400, 0x9800, 0x9C00, 0xA000, 0xA400, 0xA800, 0xAC00, 0xB000, 0xB400, 0xB800, 0xBC00,
+				0xC000, 0xC400, 0xC800, 0xCC00, 0xD000, 0xD400, 0xD800, 0xDC00, 0xE000, 0xE400, 0xE800, 0xEC00, 0xF000, 0xF400, 0xF800, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+				0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00 };
+			static const unsigned char shift_table[512] = {
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
+				13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 13,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
+				13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+				24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 13 };
+			uint16 hbits = base_table[bits>>23] + static_cast<uint16>((bits&0x7FFFFF)>>shift_table[bits>>23]);
+			if(R == std::round_to_nearest)
+				hbits += (((bits&0x7FFFFF)>>(shift_table[bits>>23]-1))|(((bits>>23)&0xFF)==102)) & ((hbits&0x7C00)!=0x7C00)
+				#if HALF_ROUND_TIES_TO_EVEN
+					& (((((static_cast<uint32>(1)<<(shift_table[bits>>23]-1))-1)&bits)!=0)|hbits)
+				#endif
+				;
+			else if(R == std::round_toward_zero)
+				hbits -= ((hbits&0x7FFF)==0x7C00) & ~shift_table[bits>>23];
+			else if(R == std::round_toward_infinity)
+				hbits += ((((bits&0x7FFFFF&((static_cast<uint32>(1)<<(shift_table[bits>>23]))-1))!=0)|(((bits>>23)<=102)&
+					((bits>>23)!=0)))&(hbits<0x7C00)) - ((hbits==0xFC00)&((bits>>23)!=511));
+			else if(R == std::round_toward_neg_infinity)
+				hbits += ((((bits&0x7FFFFF&((static_cast<uint32>(1)<<(shift_table[bits>>23]))-1))!=0)|(((bits>>23)<=358)&
+					((bits>>23)!=256)))&(hbits<0xFC00)&(hbits>>15)) - ((hbits==0x7C00)&((bits>>23)!=255));
+			return hbits;
+		}
+
+		/// Convert IEEE double-precision to half-precision.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \param value double-precision value
+		/// \return binary representation of half-precision value
+		template<std::float_round_style R> uint16 float2half_impl(double value, true_type)
+		{
+			typedef bits<float>::type uint32;
+			typedef bits<double>::type uint64;
+			uint64 bits;// = *reinterpret_cast<uint64*>(&value);		//violating strict aliasing!
+			std::memcpy(&bits, &value, sizeof(double));
+			uint32 hi = bits >> 32, lo = bits & 0xFFFFFFFF;
+			uint16 hbits = (hi>>16) & 0x8000;
+			hi &= 0x7FFFFFFF;
+			int exp = hi >> 20;
+			if(exp == 2047)
+				return hbits | 0x7C00 | (0x3FF&-static_cast<unsigned>((bits&0xFFFFFFFFFFFFF)!=0));
+			if(exp > 1038)
+			{
+				if(R == std::round_toward_infinity)
+					return hbits | 0x7C00 - (hbits>>15);
+				if(R == std::round_toward_neg_infinity)
+					return hbits | 0x7BFF + (hbits>>15);
+				return hbits | 0x7BFF + (R!=std::round_toward_zero);
+			}
+			int g, s = lo != 0;
+			if(exp > 1008)
+			{
+				g = (hi>>9) & 1;
+				s |= (hi&0x1FF) != 0;
+				hbits |= ((exp-1008)<<10) | ((hi>>10)&0x3FF);
+			}
+			else if(exp > 997)
+			{
+				int i = 1018 - exp;
+				hi = (hi&0xFFFFF) | 0x100000;
+				g = (hi>>i) & 1;
+				s |= (hi&((1L<<i)-1)) != 0;
+				hbits |= hi >> (i+1);
+			}
+			else
+			{
+				g = 0;
+				s |= hi != 0;
+			}
+			if(R == std::round_to_nearest)
+				#if HALF_ROUND_TIES_TO_EVEN
+					hbits += g & (s|hbits);
+				#else
+					hbits += g;
+				#endif
+			else if(R == std::round_toward_infinity)
+				hbits += ~(hbits>>15) & (s|g);
+			else if(R == std::round_toward_neg_infinity)
+				hbits += (hbits>>15) & (g|s);
+			return hbits;
+		}
+
+		/// Convert non-IEEE floating point to half-precision.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam T source type (builtin floating point type)
+		/// \param value floating point value
+		/// \return binary representation of half-precision value
+		template<std::float_round_style R,typename T> uint16 float2half_impl(T value, ...)
+		{
+			uint16 hbits = static_cast<unsigned>(builtin_signbit(value)) << 15;
+			if(value == T())
+				return hbits;
+			if(builtin_isnan(value))
+				return hbits | 0x7FFF;
+			if(builtin_isinf(value))
+				return hbits | 0x7C00;
+			int exp;
+			std::frexp(value, &exp);
+			if(exp > 16)
+			{
+				if(R == std::round_toward_infinity)
+					return hbits | 0x7C00 - (hbits>>15);
+				else if(R == std::round_toward_neg_infinity)
+					return hbits | 0x7BFF + (hbits>>15);
+				return hbits | 0x7BFF + (R!=std::round_toward_zero);
+			}
+			if(exp < -13)
+				value = std::ldexp(value, 24);
+			else
+			{
+				value = std::ldexp(value, 11-exp);
+				hbits |= ((exp+13)<<10);
+			}
+			T ival, frac = std::modf(value, &ival);
+			hbits += static_cast<uint16>(std::abs(static_cast<int>(ival)));
+			if(R == std::round_to_nearest)
+			{
+				frac = std::abs(frac);
+				#if HALF_ROUND_TIES_TO_EVEN
+					hbits += (frac>T(0.5)) | ((frac==T(0.5))&hbits);
+				#else
+					hbits += frac >= T(0.5);
+				#endif
+			}
+			else if(R == std::round_toward_infinity)
+				hbits += frac > T();
+			else if(R == std::round_toward_neg_infinity)
+				hbits += frac < T();
+			return hbits;
+		}
+
+		/// Convert floating point to half-precision.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam T source type (builtin floating point type)
+		/// \param value floating point value
+		/// \return binary representation of half-precision value
+		template<std::float_round_style R,typename T> uint16 float2half(T value)
+		{
+			return float2half_impl<R>(value, bool_type<std::numeric_limits<T>::is_iec559&&sizeof(typename bits<T>::type)==sizeof(T)>());
+		}
+
+		/// Convert integer to half-precision floating point.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam S `true` if value negative, `false` else
+		/// \tparam T type to convert (builtin integer type)
+		/// \param value non-negative integral value
+		/// \return binary representation of half-precision value
+		template<std::float_round_style R,bool S,typename T> uint16 int2half_impl(T value)
+		{
+		#if HALF_ENABLE_CPP11_STATIC_ASSERT && HALF_ENABLE_CPP11_TYPE_TRAITS
+			static_assert(std::is_integral<T>::value, "int to half conversion only supports builtin integer types");
+		#endif
+			if(S)
+				value = -value;
+			uint16 bits = S << 15;
+			if(value > 0xFFFF)
+			{
+				if(R == std::round_toward_infinity)
+					bits |= 0x7C00 - S;
+				else if(R == std::round_toward_neg_infinity)
+					bits |= 0x7BFF + S;
+				else
+					bits |= 0x7BFF + (R!=std::round_toward_zero);
+			}
+			else if(value)
+			{
+				unsigned int m = value, exp = 24;
+				for(; m<0x400; m<<=1,--exp) ;
+				for(; m>0x7FF; m>>=1,++exp) ;
+				bits |= (exp<<10) + m;
+				if(exp > 24)
+				{
+					if(R == std::round_to_nearest)
+						bits += (value>>(exp-25)) & 1
+						#if HALF_ROUND_TIES_TO_EVEN
+							& (((((1<<(exp-25))-1)&value)!=0)|bits)
+						#endif
+						;
+					else if(R == std::round_toward_infinity)
+						bits += ((value&((1<<(exp-24))-1))!=0) & !S;
+					else if(R == std::round_toward_neg_infinity)
+						bits += ((value&((1<<(exp-24))-1))!=0) & S;
+				}
+			}
+			return bits;
+		}
+
+		/// Convert integer to half-precision floating point.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam T type to convert (builtin integer type)
+		/// \param value integral value
+		/// \return binary representation of half-precision value
+		template<std::float_round_style R,typename T> uint16 int2half(T value)
+		{
+			return (value<0) ? int2half_impl<R,true>(value) : int2half_impl<R,false>(value);
+		}
+
+		/// Convert half-precision to IEEE single-precision.
+		/// Credit for this goes to [Jeroen van der Zijp](ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf).
+		/// \param value binary representation of half-precision value
+		/// \return single-precision value
+		inline float half2float_impl(uint16 value, float, true_type)
+		{
+			typedef bits<float>::type uint32;
+/*			uint32 bits = static_cast<uint32>(value&0x8000) << 16;
+			int abs = value & 0x7FFF;
+			if(abs)
+			{
+				bits |= 0x38000000 << static_cast<unsigned>(abs>=0x7C00);
+				for(; abs<0x400; abs<<=1,bits-=0x800000) ;
+				bits += static_cast<uint32>(abs) << 13;
+			}
+*/			static const uint32 mantissa_table[2048] = {
+				0x00000000, 0x33800000, 0x34000000, 0x34400000, 0x34800000, 0x34A00000, 0x34C00000, 0x34E00000, 0x35000000, 0x35100000, 0x35200000, 0x35300000, 0x35400000, 0x35500000, 0x35600000, 0x35700000,
+				0x35800000, 0x35880000, 0x35900000, 0x35980000, 0x35A00000, 0x35A80000, 0x35B00000, 0x35B80000, 0x35C00000, 0x35C80000, 0x35D00000, 0x35D80000, 0x35E00000, 0x35E80000, 0x35F00000, 0x35F80000,
+				0x36000000, 0x36040000, 0x36080000, 0x360C0000, 0x36100000, 0x36140000, 0x36180000, 0x361C0000, 0x36200000, 0x36240000, 0x36280000, 0x362C0000, 0x36300000, 0x36340000, 0x36380000, 0x363C0000,
+				0x36400000, 0x36440000, 0x36480000, 0x364C0000, 0x36500000, 0x36540000, 0x36580000, 0x365C0000, 0x36600000, 0x36640000, 0x36680000, 0x366C0000, 0x36700000, 0x36740000, 0x36780000, 0x367C0000,
+				0x36800000, 0x36820000, 0x36840000, 0x36860000, 0x36880000, 0x368A0000, 0x368C0000, 0x368E0000, 0x36900000, 0x36920000, 0x36940000, 0x36960000, 0x36980000, 0x369A0000, 0x369C0000, 0x369E0000,
+				0x36A00000, 0x36A20000, 0x36A40000, 0x36A60000, 0x36A80000, 0x36AA0000, 0x36AC0000, 0x36AE0000, 0x36B00000, 0x36B20000, 0x36B40000, 0x36B60000, 0x36B80000, 0x36BA0000, 0x36BC0000, 0x36BE0000,
+				0x36C00000, 0x36C20000, 0x36C40000, 0x36C60000, 0x36C80000, 0x36CA0000, 0x36CC0000, 0x36CE0000, 0x36D00000, 0x36D20000, 0x36D40000, 0x36D60000, 0x36D80000, 0x36DA0000, 0x36DC0000, 0x36DE0000,
+				0x36E00000, 0x36E20000, 0x36E40000, 0x36E60000, 0x36E80000, 0x36EA0000, 0x36EC0000, 0x36EE0000, 0x36F00000, 0x36F20000, 0x36F40000, 0x36F60000, 0x36F80000, 0x36FA0000, 0x36FC0000, 0x36FE0000,
+				0x37000000, 0x37010000, 0x37020000, 0x37030000, 0x37040000, 0x37050000, 0x37060000, 0x37070000, 0x37080000, 0x37090000, 0x370A0000, 0x370B0000, 0x370C0000, 0x370D0000, 0x370E0000, 0x370F0000,
+				0x37100000, 0x37110000, 0x37120000, 0x37130000, 0x37140000, 0x37150000, 0x37160000, 0x37170000, 0x37180000, 0x37190000, 0x371A0000, 0x371B0000, 0x371C0000, 0x371D0000, 0x371E0000, 0x371F0000,
+				0x37200000, 0x37210000, 0x37220000, 0x37230000, 0x37240000, 0x37250000, 0x37260000, 0x37270000, 0x37280000, 0x37290000, 0x372A0000, 0x372B0000, 0x372C0000, 0x372D0000, 0x372E0000, 0x372F0000,
+				0x37300000, 0x37310000, 0x37320000, 0x37330000, 0x37340000, 0x37350000, 0x37360000, 0x37370000, 0x37380000, 0x37390000, 0x373A0000, 0x373B0000, 0x373C0000, 0x373D0000, 0x373E0000, 0x373F0000,
+				0x37400000, 0x37410000, 0x37420000, 0x37430000, 0x37440000, 0x37450000, 0x37460000, 0x37470000, 0x37480000, 0x37490000, 0x374A0000, 0x374B0000, 0x374C0000, 0x374D0000, 0x374E0000, 0x374F0000,
+				0x37500000, 0x37510000, 0x37520000, 0x37530000, 0x37540000, 0x37550000, 0x37560000, 0x37570000, 0x37580000, 0x37590000, 0x375A0000, 0x375B0000, 0x375C0000, 0x375D0000, 0x375E0000, 0x375F0000,
+				0x37600000, 0x37610000, 0x37620000, 0x37630000, 0x37640000, 0x37650000, 0x37660000, 0x37670000, 0x37680000, 0x37690000, 0x376A0000, 0x376B0000, 0x376C0000, 0x376D0000, 0x376E0000, 0x376F0000,
+				0x37700000, 0x37710000, 0x37720000, 0x37730000, 0x37740000, 0x37750000, 0x37760000, 0x37770000, 0x37780000, 0x37790000, 0x377A0000, 0x377B0000, 0x377C0000, 0x377D0000, 0x377E0000, 0x377F0000,
+				0x37800000, 0x37808000, 0x37810000, 0x37818000, 0x37820000, 0x37828000, 0x37830000, 0x37838000, 0x37840000, 0x37848000, 0x37850000, 0x37858000, 0x37860000, 0x37868000, 0x37870000, 0x37878000,
+				0x37880000, 0x37888000, 0x37890000, 0x37898000, 0x378A0000, 0x378A8000, 0x378B0000, 0x378B8000, 0x378C0000, 0x378C8000, 0x378D0000, 0x378D8000, 0x378E0000, 0x378E8000, 0x378F0000, 0x378F8000,
+				0x37900000, 0x37908000, 0x37910000, 0x37918000, 0x37920000, 0x37928000, 0x37930000, 0x37938000, 0x37940000, 0x37948000, 0x37950000, 0x37958000, 0x37960000, 0x37968000, 0x37970000, 0x37978000,
+				0x37980000, 0x37988000, 0x37990000, 0x37998000, 0x379A0000, 0x379A8000, 0x379B0000, 0x379B8000, 0x379C0000, 0x379C8000, 0x379D0000, 0x379D8000, 0x379E0000, 0x379E8000, 0x379F0000, 0x379F8000,
+				0x37A00000, 0x37A08000, 0x37A10000, 0x37A18000, 0x37A20000, 0x37A28000, 0x37A30000, 0x37A38000, 0x37A40000, 0x37A48000, 0x37A50000, 0x37A58000, 0x37A60000, 0x37A68000, 0x37A70000, 0x37A78000,
+				0x37A80000, 0x37A88000, 0x37A90000, 0x37A98000, 0x37AA0000, 0x37AA8000, 0x37AB0000, 0x37AB8000, 0x37AC0000, 0x37AC8000, 0x37AD0000, 0x37AD8000, 0x37AE0000, 0x37AE8000, 0x37AF0000, 0x37AF8000,
+				0x37B00000, 0x37B08000, 0x37B10000, 0x37B18000, 0x37B20000, 0x37B28000, 0x37B30000, 0x37B38000, 0x37B40000, 0x37B48000, 0x37B50000, 0x37B58000, 0x37B60000, 0x37B68000, 0x37B70000, 0x37B78000,
+				0x37B80000, 0x37B88000, 0x37B90000, 0x37B98000, 0x37BA0000, 0x37BA8000, 0x37BB0000, 0x37BB8000, 0x37BC0000, 0x37BC8000, 0x37BD0000, 0x37BD8000, 0x37BE0000, 0x37BE8000, 0x37BF0000, 0x37BF8000,
+				0x37C00000, 0x37C08000, 0x37C10000, 0x37C18000, 0x37C20000, 0x37C28000, 0x37C30000, 0x37C38000, 0x37C40000, 0x37C48000, 0x37C50000, 0x37C58000, 0x37C60000, 0x37C68000, 0x37C70000, 0x37C78000,
+				0x37C80000, 0x37C88000, 0x37C90000, 0x37C98000, 0x37CA0000, 0x37CA8000, 0x37CB0000, 0x37CB8000, 0x37CC0000, 0x37CC8000, 0x37CD0000, 0x37CD8000, 0x37CE0000, 0x37CE8000, 0x37CF0000, 0x37CF8000,
+				0x37D00000, 0x37D08000, 0x37D10000, 0x37D18000, 0x37D20000, 0x37D28000, 0x37D30000, 0x37D38000, 0x37D40000, 0x37D48000, 0x37D50000, 0x37D58000, 0x37D60000, 0x37D68000, 0x37D70000, 0x37D78000,
+				0x37D80000, 0x37D88000, 0x37D90000, 0x37D98000, 0x37DA0000, 0x37DA8000, 0x37DB0000, 0x37DB8000, 0x37DC0000, 0x37DC8000, 0x37DD0000, 0x37DD8000, 0x37DE0000, 0x37DE8000, 0x37DF0000, 0x37DF8000,
+				0x37E00000, 0x37E08000, 0x37E10000, 0x37E18000, 0x37E20000, 0x37E28000, 0x37E30000, 0x37E38000, 0x37E40000, 0x37E48000, 0x37E50000, 0x37E58000, 0x37E60000, 0x37E68000, 0x37E70000, 0x37E78000,
+				0x37E80000, 0x37E88000, 0x37E90000, 0x37E98000, 0x37EA0000, 0x37EA8000, 0x37EB0000, 0x37EB8000, 0x37EC0000, 0x37EC8000, 0x37ED0000, 0x37ED8000, 0x37EE0000, 0x37EE8000, 0x37EF0000, 0x37EF8000,
+				0x37F00000, 0x37F08000, 0x37F10000, 0x37F18000, 0x37F20000, 0x37F28000, 0x37F30000, 0x37F38000, 0x37F40000, 0x37F48000, 0x37F50000, 0x37F58000, 0x37F60000, 0x37F68000, 0x37F70000, 0x37F78000,
+				0x37F80000, 0x37F88000, 0x37F90000, 0x37F98000, 0x37FA0000, 0x37FA8000, 0x37FB0000, 0x37FB8000, 0x37FC0000, 0x37FC8000, 0x37FD0000, 0x37FD8000, 0x37FE0000, 0x37FE8000, 0x37FF0000, 0x37FF8000,
+				0x38000000, 0x38004000, 0x38008000, 0x3800C000, 0x38010000, 0x38014000, 0x38018000, 0x3801C000, 0x38020000, 0x38024000, 0x38028000, 0x3802C000, 0x38030000, 0x38034000, 0x38038000, 0x3803C000,
+				0x38040000, 0x38044000, 0x38048000, 0x3804C000, 0x38050000, 0x38054000, 0x38058000, 0x3805C000, 0x38060000, 0x38064000, 0x38068000, 0x3806C000, 0x38070000, 0x38074000, 0x38078000, 0x3807C000,
+				0x38080000, 0x38084000, 0x38088000, 0x3808C000, 0x38090000, 0x38094000, 0x38098000, 0x3809C000, 0x380A0000, 0x380A4000, 0x380A8000, 0x380AC000, 0x380B0000, 0x380B4000, 0x380B8000, 0x380BC000,
+				0x380C0000, 0x380C4000, 0x380C8000, 0x380CC000, 0x380D0000, 0x380D4000, 0x380D8000, 0x380DC000, 0x380E0000, 0x380E4000, 0x380E8000, 0x380EC000, 0x380F0000, 0x380F4000, 0x380F8000, 0x380FC000,
+				0x38100000, 0x38104000, 0x38108000, 0x3810C000, 0x38110000, 0x38114000, 0x38118000, 0x3811C000, 0x38120000, 0x38124000, 0x38128000, 0x3812C000, 0x38130000, 0x38134000, 0x38138000, 0x3813C000,
+				0x38140000, 0x38144000, 0x38148000, 0x3814C000, 0x38150000, 0x38154000, 0x38158000, 0x3815C000, 0x38160000, 0x38164000, 0x38168000, 0x3816C000, 0x38170000, 0x38174000, 0x38178000, 0x3817C000,
+				0x38180000, 0x38184000, 0x38188000, 0x3818C000, 0x38190000, 0x38194000, 0x38198000, 0x3819C000, 0x381A0000, 0x381A4000, 0x381A8000, 0x381AC000, 0x381B0000, 0x381B4000, 0x381B8000, 0x381BC000,
+				0x381C0000, 0x381C4000, 0x381C8000, 0x381CC000, 0x381D0000, 0x381D4000, 0x381D8000, 0x381DC000, 0x381E0000, 0x381E4000, 0x381E8000, 0x381EC000, 0x381F0000, 0x381F4000, 0x381F8000, 0x381FC000,
+				0x38200000, 0x38204000, 0x38208000, 0x3820C000, 0x38210000, 0x38214000, 0x38218000, 0x3821C000, 0x38220000, 0x38224000, 0x38228000, 0x3822C000, 0x38230000, 0x38234000, 0x38238000, 0x3823C000,
+				0x38240000, 0x38244000, 0x38248000, 0x3824C000, 0x38250000, 0x38254000, 0x38258000, 0x3825C000, 0x38260000, 0x38264000, 0x38268000, 0x3826C000, 0x38270000, 0x38274000, 0x38278000, 0x3827C000,
+				0x38280000, 0x38284000, 0x38288000, 0x3828C000, 0x38290000, 0x38294000, 0x38298000, 0x3829C000, 0x382A0000, 0x382A4000, 0x382A8000, 0x382AC000, 0x382B0000, 0x382B4000, 0x382B8000, 0x382BC000,
+				0x382C0000, 0x382C4000, 0x382C8000, 0x382CC000, 0x382D0000, 0x382D4000, 0x382D8000, 0x382DC000, 0x382E0000, 0x382E4000, 0x382E8000, 0x382EC000, 0x382F0000, 0x382F4000, 0x382F8000, 0x382FC000,
+				0x38300000, 0x38304000, 0x38308000, 0x3830C000, 0x38310000, 0x38314000, 0x38318000, 0x3831C000, 0x38320000, 0x38324000, 0x38328000, 0x3832C000, 0x38330000, 0x38334000, 0x38338000, 0x3833C000,
+				0x38340000, 0x38344000, 0x38348000, 0x3834C000, 0x38350000, 0x38354000, 0x38358000, 0x3835C000, 0x38360000, 0x38364000, 0x38368000, 0x3836C000, 0x38370000, 0x38374000, 0x38378000, 0x3837C000,
+				0x38380000, 0x38384000, 0x38388000, 0x3838C000, 0x38390000, 0x38394000, 0x38398000, 0x3839C000, 0x383A0000, 0x383A4000, 0x383A8000, 0x383AC000, 0x383B0000, 0x383B4000, 0x383B8000, 0x383BC000,
+				0x383C0000, 0x383C4000, 0x383C8000, 0x383CC000, 0x383D0000, 0x383D4000, 0x383D8000, 0x383DC000, 0x383E0000, 0x383E4000, 0x383E8000, 0x383EC000, 0x383F0000, 0x383F4000, 0x383F8000, 0x383FC000,
+				0x38400000, 0x38404000, 0x38408000, 0x3840C000, 0x38410000, 0x38414000, 0x38418000, 0x3841C000, 0x38420000, 0x38424000, 0x38428000, 0x3842C000, 0x38430000, 0x38434000, 0x38438000, 0x3843C000,
+				0x38440000, 0x38444000, 0x38448000, 0x3844C000, 0x38450000, 0x38454000, 0x38458000, 0x3845C000, 0x38460000, 0x38464000, 0x38468000, 0x3846C000, 0x38470000, 0x38474000, 0x38478000, 0x3847C000,
+				0x38480000, 0x38484000, 0x38488000, 0x3848C000, 0x38490000, 0x38494000, 0x38498000, 0x3849C000, 0x384A0000, 0x384A4000, 0x384A8000, 0x384AC000, 0x384B0000, 0x384B4000, 0x384B8000, 0x384BC000,
+				0x384C0000, 0x384C4000, 0x384C8000, 0x384CC000, 0x384D0000, 0x384D4000, 0x384D8000, 0x384DC000, 0x384E0000, 0x384E4000, 0x384E8000, 0x384EC000, 0x384F0000, 0x384F4000, 0x384F8000, 0x384FC000,
+				0x38500000, 0x38504000, 0x38508000, 0x3850C000, 0x38510000, 0x38514000, 0x38518000, 0x3851C000, 0x38520000, 0x38524000, 0x38528000, 0x3852C000, 0x38530000, 0x38534000, 0x38538000, 0x3853C000,
+				0x38540000, 0x38544000, 0x38548000, 0x3854C000, 0x38550000, 0x38554000, 0x38558000, 0x3855C000, 0x38560000, 0x38564000, 0x38568000, 0x3856C000, 0x38570000, 0x38574000, 0x38578000, 0x3857C000,
+				0x38580000, 0x38584000, 0x38588000, 0x3858C000, 0x38590000, 0x38594000, 0x38598000, 0x3859C000, 0x385A0000, 0x385A4000, 0x385A8000, 0x385AC000, 0x385B0000, 0x385B4000, 0x385B8000, 0x385BC000,
+				0x385C0000, 0x385C4000, 0x385C8000, 0x385CC000, 0x385D0000, 0x385D4000, 0x385D8000, 0x385DC000, 0x385E0000, 0x385E4000, 0x385E8000, 0x385EC000, 0x385F0000, 0x385F4000, 0x385F8000, 0x385FC000,
+				0x38600000, 0x38604000, 0x38608000, 0x3860C000, 0x38610000, 0x38614000, 0x38618000, 0x3861C000, 0x38620000, 0x38624000, 0x38628000, 0x3862C000, 0x38630000, 0x38634000, 0x38638000, 0x3863C000,
+				0x38640000, 0x38644000, 0x38648000, 0x3864C000, 0x38650000, 0x38654000, 0x38658000, 0x3865C000, 0x38660000, 0x38664000, 0x38668000, 0x3866C000, 0x38670000, 0x38674000, 0x38678000, 0x3867C000,
+				0x38680000, 0x38684000, 0x38688000, 0x3868C000, 0x38690000, 0x38694000, 0x38698000, 0x3869C000, 0x386A0000, 0x386A4000, 0x386A8000, 0x386AC000, 0x386B0000, 0x386B4000, 0x386B8000, 0x386BC000,
+				0x386C0000, 0x386C4000, 0x386C8000, 0x386CC000, 0x386D0000, 0x386D4000, 0x386D8000, 0x386DC000, 0x386E0000, 0x386E4000, 0x386E8000, 0x386EC000, 0x386F0000, 0x386F4000, 0x386F8000, 0x386FC000,
+				0x38700000, 0x38704000, 0x38708000, 0x3870C000, 0x38710000, 0x38714000, 0x38718000, 0x3871C000, 0x38720000, 0x38724000, 0x38728000, 0x3872C000, 0x38730000, 0x38734000, 0x38738000, 0x3873C000,
+				0x38740000, 0x38744000, 0x38748000, 0x3874C000, 0x38750000, 0x38754000, 0x38758000, 0x3875C000, 0x38760000, 0x38764000, 0x38768000, 0x3876C000, 0x38770000, 0x38774000, 0x38778000, 0x3877C000,
+				0x38780000, 0x38784000, 0x38788000, 0x3878C000, 0x38790000, 0x38794000, 0x38798000, 0x3879C000, 0x387A0000, 0x387A4000, 0x387A8000, 0x387AC000, 0x387B0000, 0x387B4000, 0x387B8000, 0x387BC000,
+				0x387C0000, 0x387C4000, 0x387C8000, 0x387CC000, 0x387D0000, 0x387D4000, 0x387D8000, 0x387DC000, 0x387E0000, 0x387E4000, 0x387E8000, 0x387EC000, 0x387F0000, 0x387F4000, 0x387F8000, 0x387FC000,
+				0x38000000, 0x38002000, 0x38004000, 0x38006000, 0x38008000, 0x3800A000, 0x3800C000, 0x3800E000, 0x38010000, 0x38012000, 0x38014000, 0x38016000, 0x38018000, 0x3801A000, 0x3801C000, 0x3801E000,
+				0x38020000, 0x38022000, 0x38024000, 0x38026000, 0x38028000, 0x3802A000, 0x3802C000, 0x3802E000, 0x38030000, 0x38032000, 0x38034000, 0x38036000, 0x38038000, 0x3803A000, 0x3803C000, 0x3803E000,
+				0x38040000, 0x38042000, 0x38044000, 0x38046000, 0x38048000, 0x3804A000, 0x3804C000, 0x3804E000, 0x38050000, 0x38052000, 0x38054000, 0x38056000, 0x38058000, 0x3805A000, 0x3805C000, 0x3805E000,
+				0x38060000, 0x38062000, 0x38064000, 0x38066000, 0x38068000, 0x3806A000, 0x3806C000, 0x3806E000, 0x38070000, 0x38072000, 0x38074000, 0x38076000, 0x38078000, 0x3807A000, 0x3807C000, 0x3807E000,
+				0x38080000, 0x38082000, 0x38084000, 0x38086000, 0x38088000, 0x3808A000, 0x3808C000, 0x3808E000, 0x38090000, 0x38092000, 0x38094000, 0x38096000, 0x38098000, 0x3809A000, 0x3809C000, 0x3809E000,
+				0x380A0000, 0x380A2000, 0x380A4000, 0x380A6000, 0x380A8000, 0x380AA000, 0x380AC000, 0x380AE000, 0x380B0000, 0x380B2000, 0x380B4000, 0x380B6000, 0x380B8000, 0x380BA000, 0x380BC000, 0x380BE000,
+				0x380C0000, 0x380C2000, 0x380C4000, 0x380C6000, 0x380C8000, 0x380CA000, 0x380CC000, 0x380CE000, 0x380D0000, 0x380D2000, 0x380D4000, 0x380D6000, 0x380D8000, 0x380DA000, 0x380DC000, 0x380DE000,
+				0x380E0000, 0x380E2000, 0x380E4000, 0x380E6000, 0x380E8000, 0x380EA000, 0x380EC000, 0x380EE000, 0x380F0000, 0x380F2000, 0x380F4000, 0x380F6000, 0x380F8000, 0x380FA000, 0x380FC000, 0x380FE000,
+				0x38100000, 0x38102000, 0x38104000, 0x38106000, 0x38108000, 0x3810A000, 0x3810C000, 0x3810E000, 0x38110000, 0x38112000, 0x38114000, 0x38116000, 0x38118000, 0x3811A000, 0x3811C000, 0x3811E000,
+				0x38120000, 0x38122000, 0x38124000, 0x38126000, 0x38128000, 0x3812A000, 0x3812C000, 0x3812E000, 0x38130000, 0x38132000, 0x38134000, 0x38136000, 0x38138000, 0x3813A000, 0x3813C000, 0x3813E000,
+				0x38140000, 0x38142000, 0x38144000, 0x38146000, 0x38148000, 0x3814A000, 0x3814C000, 0x3814E000, 0x38150000, 0x38152000, 0x38154000, 0x38156000, 0x38158000, 0x3815A000, 0x3815C000, 0x3815E000,
+				0x38160000, 0x38162000, 0x38164000, 0x38166000, 0x38168000, 0x3816A000, 0x3816C000, 0x3816E000, 0x38170000, 0x38172000, 0x38174000, 0x38176000, 0x38178000, 0x3817A000, 0x3817C000, 0x3817E000,
+				0x38180000, 0x38182000, 0x38184000, 0x38186000, 0x38188000, 0x3818A000, 0x3818C000, 0x3818E000, 0x38190000, 0x38192000, 0x38194000, 0x38196000, 0x38198000, 0x3819A000, 0x3819C000, 0x3819E000,
+				0x381A0000, 0x381A2000, 0x381A4000, 0x381A6000, 0x381A8000, 0x381AA000, 0x381AC000, 0x381AE000, 0x381B0000, 0x381B2000, 0x381B4000, 0x381B6000, 0x381B8000, 0x381BA000, 0x381BC000, 0x381BE000,
+				0x381C0000, 0x381C2000, 0x381C4000, 0x381C6000, 0x381C8000, 0x381CA000, 0x381CC000, 0x381CE000, 0x381D0000, 0x381D2000, 0x381D4000, 0x381D6000, 0x381D8000, 0x381DA000, 0x381DC000, 0x381DE000,
+				0x381E0000, 0x381E2000, 0x381E4000, 0x381E6000, 0x381E8000, 0x381EA000, 0x381EC000, 0x381EE000, 0x381F0000, 0x381F2000, 0x381F4000, 0x381F6000, 0x381F8000, 0x381FA000, 0x381FC000, 0x381FE000,
+				0x38200000, 0x38202000, 0x38204000, 0x38206000, 0x38208000, 0x3820A000, 0x3820C000, 0x3820E000, 0x38210000, 0x38212000, 0x38214000, 0x38216000, 0x38218000, 0x3821A000, 0x3821C000, 0x3821E000,
+				0x38220000, 0x38222000, 0x38224000, 0x38226000, 0x38228000, 0x3822A000, 0x3822C000, 0x3822E000, 0x38230000, 0x38232000, 0x38234000, 0x38236000, 0x38238000, 0x3823A000, 0x3823C000, 0x3823E000,
+				0x38240000, 0x38242000, 0x38244000, 0x38246000, 0x38248000, 0x3824A000, 0x3824C000, 0x3824E000, 0x38250000, 0x38252000, 0x38254000, 0x38256000, 0x38258000, 0x3825A000, 0x3825C000, 0x3825E000,
+				0x38260000, 0x38262000, 0x38264000, 0x38266000, 0x38268000, 0x3826A000, 0x3826C000, 0x3826E000, 0x38270000, 0x38272000, 0x38274000, 0x38276000, 0x38278000, 0x3827A000, 0x3827C000, 0x3827E000,
+				0x38280000, 0x38282000, 0x38284000, 0x38286000, 0x38288000, 0x3828A000, 0x3828C000, 0x3828E000, 0x38290000, 0x38292000, 0x38294000, 0x38296000, 0x38298000, 0x3829A000, 0x3829C000, 0x3829E000,
+				0x382A0000, 0x382A2000, 0x382A4000, 0x382A6000, 0x382A8000, 0x382AA000, 0x382AC000, 0x382AE000, 0x382B0000, 0x382B2000, 0x382B4000, 0x382B6000, 0x382B8000, 0x382BA000, 0x382BC000, 0x382BE000,
+				0x382C0000, 0x382C2000, 0x382C4000, 0x382C6000, 0x382C8000, 0x382CA000, 0x382CC000, 0x382CE000, 0x382D0000, 0x382D2000, 0x382D4000, 0x382D6000, 0x382D8000, 0x382DA000, 0x382DC000, 0x382DE000,
+				0x382E0000, 0x382E2000, 0x382E4000, 0x382E6000, 0x382E8000, 0x382EA000, 0x382EC000, 0x382EE000, 0x382F0000, 0x382F2000, 0x382F4000, 0x382F6000, 0x382F8000, 0x382FA000, 0x382FC000, 0x382FE000,
+				0x38300000, 0x38302000, 0x38304000, 0x38306000, 0x38308000, 0x3830A000, 0x3830C000, 0x3830E000, 0x38310000, 0x38312000, 0x38314000, 0x38316000, 0x38318000, 0x3831A000, 0x3831C000, 0x3831E000,
+				0x38320000, 0x38322000, 0x38324000, 0x38326000, 0x38328000, 0x3832A000, 0x3832C000, 0x3832E000, 0x38330000, 0x38332000, 0x38334000, 0x38336000, 0x38338000, 0x3833A000, 0x3833C000, 0x3833E000,
+				0x38340000, 0x38342000, 0x38344000, 0x38346000, 0x38348000, 0x3834A000, 0x3834C000, 0x3834E000, 0x38350000, 0x38352000, 0x38354000, 0x38356000, 0x38358000, 0x3835A000, 0x3835C000, 0x3835E000,
+				0x38360000, 0x38362000, 0x38364000, 0x38366000, 0x38368000, 0x3836A000, 0x3836C000, 0x3836E000, 0x38370000, 0x38372000, 0x38374000, 0x38376000, 0x38378000, 0x3837A000, 0x3837C000, 0x3837E000,
+				0x38380000, 0x38382000, 0x38384000, 0x38386000, 0x38388000, 0x3838A000, 0x3838C000, 0x3838E000, 0x38390000, 0x38392000, 0x38394000, 0x38396000, 0x38398000, 0x3839A000, 0x3839C000, 0x3839E000,
+				0x383A0000, 0x383A2000, 0x383A4000, 0x383A6000, 0x383A8000, 0x383AA000, 0x383AC000, 0x383AE000, 0x383B0000, 0x383B2000, 0x383B4000, 0x383B6000, 0x383B8000, 0x383BA000, 0x383BC000, 0x383BE000,
+				0x383C0000, 0x383C2000, 0x383C4000, 0x383C6000, 0x383C8000, 0x383CA000, 0x383CC000, 0x383CE000, 0x383D0000, 0x383D2000, 0x383D4000, 0x383D6000, 0x383D8000, 0x383DA000, 0x383DC000, 0x383DE000,
+				0x383E0000, 0x383E2000, 0x383E4000, 0x383E6000, 0x383E8000, 0x383EA000, 0x383EC000, 0x383EE000, 0x383F0000, 0x383F2000, 0x383F4000, 0x383F6000, 0x383F8000, 0x383FA000, 0x383FC000, 0x383FE000,
+				0x38400000, 0x38402000, 0x38404000, 0x38406000, 0x38408000, 0x3840A000, 0x3840C000, 0x3840E000, 0x38410000, 0x38412000, 0x38414000, 0x38416000, 0x38418000, 0x3841A000, 0x3841C000, 0x3841E000,
+				0x38420000, 0x38422000, 0x38424000, 0x38426000, 0x38428000, 0x3842A000, 0x3842C000, 0x3842E000, 0x38430000, 0x38432000, 0x38434000, 0x38436000, 0x38438000, 0x3843A000, 0x3843C000, 0x3843E000,
+				0x38440000, 0x38442000, 0x38444000, 0x38446000, 0x38448000, 0x3844A000, 0x3844C000, 0x3844E000, 0x38450000, 0x38452000, 0x38454000, 0x38456000, 0x38458000, 0x3845A000, 0x3845C000, 0x3845E000,
+				0x38460000, 0x38462000, 0x38464000, 0x38466000, 0x38468000, 0x3846A000, 0x3846C000, 0x3846E000, 0x38470000, 0x38472000, 0x38474000, 0x38476000, 0x38478000, 0x3847A000, 0x3847C000, 0x3847E000,
+				0x38480000, 0x38482000, 0x38484000, 0x38486000, 0x38488000, 0x3848A000, 0x3848C000, 0x3848E000, 0x38490000, 0x38492000, 0x38494000, 0x38496000, 0x38498000, 0x3849A000, 0x3849C000, 0x3849E000,
+				0x384A0000, 0x384A2000, 0x384A4000, 0x384A6000, 0x384A8000, 0x384AA000, 0x384AC000, 0x384AE000, 0x384B0000, 0x384B2000, 0x384B4000, 0x384B6000, 0x384B8000, 0x384BA000, 0x384BC000, 0x384BE000,
+				0x384C0000, 0x384C2000, 0x384C4000, 0x384C6000, 0x384C8000, 0x384CA000, 0x384CC000, 0x384CE000, 0x384D0000, 0x384D2000, 0x384D4000, 0x384D6000, 0x384D8000, 0x384DA000, 0x384DC000, 0x384DE000,
+				0x384E0000, 0x384E2000, 0x384E4000, 0x384E6000, 0x384E8000, 0x384EA000, 0x384EC000, 0x384EE000, 0x384F0000, 0x384F2000, 0x384F4000, 0x384F6000, 0x384F8000, 0x384FA000, 0x384FC000, 0x384FE000,
+				0x38500000, 0x38502000, 0x38504000, 0x38506000, 0x38508000, 0x3850A000, 0x3850C000, 0x3850E000, 0x38510000, 0x38512000, 0x38514000, 0x38516000, 0x38518000, 0x3851A000, 0x3851C000, 0x3851E000,
+				0x38520000, 0x38522000, 0x38524000, 0x38526000, 0x38528000, 0x3852A000, 0x3852C000, 0x3852E000, 0x38530000, 0x38532000, 0x38534000, 0x38536000, 0x38538000, 0x3853A000, 0x3853C000, 0x3853E000,
+				0x38540000, 0x38542000, 0x38544000, 0x38546000, 0x38548000, 0x3854A000, 0x3854C000, 0x3854E000, 0x38550000, 0x38552000, 0x38554000, 0x38556000, 0x38558000, 0x3855A000, 0x3855C000, 0x3855E000,
+				0x38560000, 0x38562000, 0x38564000, 0x38566000, 0x38568000, 0x3856A000, 0x3856C000, 0x3856E000, 0x38570000, 0x38572000, 0x38574000, 0x38576000, 0x38578000, 0x3857A000, 0x3857C000, 0x3857E000,
+				0x38580000, 0x38582000, 0x38584000, 0x38586000, 0x38588000, 0x3858A000, 0x3858C000, 0x3858E000, 0x38590000, 0x38592000, 0x38594000, 0x38596000, 0x38598000, 0x3859A000, 0x3859C000, 0x3859E000,
+				0x385A0000, 0x385A2000, 0x385A4000, 0x385A6000, 0x385A8000, 0x385AA000, 0x385AC000, 0x385AE000, 0x385B0000, 0x385B2000, 0x385B4000, 0x385B6000, 0x385B8000, 0x385BA000, 0x385BC000, 0x385BE000,
+				0x385C0000, 0x385C2000, 0x385C4000, 0x385C6000, 0x385C8000, 0x385CA000, 0x385CC000, 0x385CE000, 0x385D0000, 0x385D2000, 0x385D4000, 0x385D6000, 0x385D8000, 0x385DA000, 0x385DC000, 0x385DE000,
+				0x385E0000, 0x385E2000, 0x385E4000, 0x385E6000, 0x385E8000, 0x385EA000, 0x385EC000, 0x385EE000, 0x385F0000, 0x385F2000, 0x385F4000, 0x385F6000, 0x385F8000, 0x385FA000, 0x385FC000, 0x385FE000,
+				0x38600000, 0x38602000, 0x38604000, 0x38606000, 0x38608000, 0x3860A000, 0x3860C000, 0x3860E000, 0x38610000, 0x38612000, 0x38614000, 0x38616000, 0x38618000, 0x3861A000, 0x3861C000, 0x3861E000,
+				0x38620000, 0x38622000, 0x38624000, 0x38626000, 0x38628000, 0x3862A000, 0x3862C000, 0x3862E000, 0x38630000, 0x38632000, 0x38634000, 0x38636000, 0x38638000, 0x3863A000, 0x3863C000, 0x3863E000,
+				0x38640000, 0x38642000, 0x38644000, 0x38646000, 0x38648000, 0x3864A000, 0x3864C000, 0x3864E000, 0x38650000, 0x38652000, 0x38654000, 0x38656000, 0x38658000, 0x3865A000, 0x3865C000, 0x3865E000,
+				0x38660000, 0x38662000, 0x38664000, 0x38666000, 0x38668000, 0x3866A000, 0x3866C000, 0x3866E000, 0x38670000, 0x38672000, 0x38674000, 0x38676000, 0x38678000, 0x3867A000, 0x3867C000, 0x3867E000,
+				0x38680000, 0x38682000, 0x38684000, 0x38686000, 0x38688000, 0x3868A000, 0x3868C000, 0x3868E000, 0x38690000, 0x38692000, 0x38694000, 0x38696000, 0x38698000, 0x3869A000, 0x3869C000, 0x3869E000,
+				0x386A0000, 0x386A2000, 0x386A4000, 0x386A6000, 0x386A8000, 0x386AA000, 0x386AC000, 0x386AE000, 0x386B0000, 0x386B2000, 0x386B4000, 0x386B6000, 0x386B8000, 0x386BA000, 0x386BC000, 0x386BE000,
+				0x386C0000, 0x386C2000, 0x386C4000, 0x386C6000, 0x386C8000, 0x386CA000, 0x386CC000, 0x386CE000, 0x386D0000, 0x386D2000, 0x386D4000, 0x386D6000, 0x386D8000, 0x386DA000, 0x386DC000, 0x386DE000,
+				0x386E0000, 0x386E2000, 0x386E4000, 0x386E6000, 0x386E8000, 0x386EA000, 0x386EC000, 0x386EE000, 0x386F0000, 0x386F2000, 0x386F4000, 0x386F6000, 0x386F8000, 0x386FA000, 0x386FC000, 0x386FE000,
+				0x38700000, 0x38702000, 0x38704000, 0x38706000, 0x38708000, 0x3870A000, 0x3870C000, 0x3870E000, 0x38710000, 0x38712000, 0x38714000, 0x38716000, 0x38718000, 0x3871A000, 0x3871C000, 0x3871E000,
+				0x38720000, 0x38722000, 0x38724000, 0x38726000, 0x38728000, 0x3872A000, 0x3872C000, 0x3872E000, 0x38730000, 0x38732000, 0x38734000, 0x38736000, 0x38738000, 0x3873A000, 0x3873C000, 0x3873E000,
+				0x38740000, 0x38742000, 0x38744000, 0x38746000, 0x38748000, 0x3874A000, 0x3874C000, 0x3874E000, 0x38750000, 0x38752000, 0x38754000, 0x38756000, 0x38758000, 0x3875A000, 0x3875C000, 0x3875E000,
+				0x38760000, 0x38762000, 0x38764000, 0x38766000, 0x38768000, 0x3876A000, 0x3876C000, 0x3876E000, 0x38770000, 0x38772000, 0x38774000, 0x38776000, 0x38778000, 0x3877A000, 0x3877C000, 0x3877E000,
+				0x38780000, 0x38782000, 0x38784000, 0x38786000, 0x38788000, 0x3878A000, 0x3878C000, 0x3878E000, 0x38790000, 0x38792000, 0x38794000, 0x38796000, 0x38798000, 0x3879A000, 0x3879C000, 0x3879E000,
+				0x387A0000, 0x387A2000, 0x387A4000, 0x387A6000, 0x387A8000, 0x387AA000, 0x387AC000, 0x387AE000, 0x387B0000, 0x387B2000, 0x387B4000, 0x387B6000, 0x387B8000, 0x387BA000, 0x387BC000, 0x387BE000,
+				0x387C0000, 0x387C2000, 0x387C4000, 0x387C6000, 0x387C8000, 0x387CA000, 0x387CC000, 0x387CE000, 0x387D0000, 0x387D2000, 0x387D4000, 0x387D6000, 0x387D8000, 0x387DA000, 0x387DC000, 0x387DE000,
+				0x387E0000, 0x387E2000, 0x387E4000, 0x387E6000, 0x387E8000, 0x387EA000, 0x387EC000, 0x387EE000, 0x387F0000, 0x387F2000, 0x387F4000, 0x387F6000, 0x387F8000, 0x387FA000, 0x387FC000, 0x387FE000 };
+			static const uint32 exponent_table[64] = {
+				0x00000000, 0x00800000, 0x01000000, 0x01800000, 0x02000000, 0x02800000, 0x03000000, 0x03800000, 0x04000000, 0x04800000, 0x05000000, 0x05800000, 0x06000000, 0x06800000, 0x07000000, 0x07800000,
+				0x08000000, 0x08800000, 0x09000000, 0x09800000, 0x0A000000, 0x0A800000, 0x0B000000, 0x0B800000, 0x0C000000, 0x0C800000, 0x0D000000, 0x0D800000, 0x0E000000, 0x0E800000, 0x0F000000, 0x47800000,
+				0x80000000, 0x80800000, 0x81000000, 0x81800000, 0x82000000, 0x82800000, 0x83000000, 0x83800000, 0x84000000, 0x84800000, 0x85000000, 0x85800000, 0x86000000, 0x86800000, 0x87000000, 0x87800000,
+				0x88000000, 0x88800000, 0x89000000, 0x89800000, 0x8A000000, 0x8A800000, 0x8B000000, 0x8B800000, 0x8C000000, 0x8C800000, 0x8D000000, 0x8D800000, 0x8E000000, 0x8E800000, 0x8F000000, 0xC7800000 };
+			static const unsigned short offset_table[64] = {
+				   0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,
+				   0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024 };
+			uint32 bits = mantissa_table[offset_table[value>>10]+(value&0x3FF)] + exponent_table[value>>10];
+//			return *reinterpret_cast<float*>(&bits);			//violating strict aliasing!
+			float out;
+			std::memcpy(&out, &bits, sizeof(float));
+			return out;
+		}
+
+		/// Convert half-precision to IEEE double-precision.
+		/// \param value binary representation of half-precision value
+		/// \return double-precision value
+		inline double half2float_impl(uint16 value, double, true_type)
+		{
+			typedef bits<float>::type uint32;
+			typedef bits<double>::type uint64;
+			uint32 hi = static_cast<uint32>(value&0x8000) << 16;
+			int abs = value & 0x7FFF;
+			if(abs)
+			{
+				hi |= 0x3F000000 << static_cast<unsigned>(abs>=0x7C00);
+				for(; abs<0x400; abs<<=1,hi-=0x100000) ;
+				hi += static_cast<uint32>(abs) << 10;
+			}
+			uint64 bits = static_cast<uint64>(hi) << 32;
+//			return *reinterpret_cast<double*>(&bits);			//violating strict aliasing!
+			double out;
+			std::memcpy(&out, &bits, sizeof(double));
+			return out;
+		}
+
+		/// Convert half-precision to non-IEEE floating point.
+		/// \tparam T type to convert to (builtin integer type)
+		/// \param value binary representation of half-precision value
+		/// \return floating point value
+		template<typename T> T half2float_impl(uint16 value, T, ...)
+		{
+			T out;
+			int abs = value & 0x7FFF;
+			if(abs > 0x7C00)
+				out = std::numeric_limits<T>::has_quiet_NaN ? std::numeric_limits<T>::quiet_NaN() : T();
+			else if(abs == 0x7C00)
+				out = std::numeric_limits<T>::has_infinity ? std::numeric_limits<T>::infinity() : std::numeric_limits<T>::max();
+			else if(abs > 0x3FF)
+				out = std::ldexp(static_cast<T>((abs&0x3FF)|0x400), (abs>>10)-25);
+			else
+				out = std::ldexp(static_cast<T>(abs), -24);
+			return (value&0x8000) ? -out : out;
+		}
+
+		/// Convert half-precision to floating point.
+		/// \tparam T type to convert to (builtin integer type)
+		/// \param value binary representation of half-precision value
+		/// \return floating point value
+		template<typename T> T half2float(uint16 value)
+		{
+			return half2float_impl(value, T(), bool_type<std::numeric_limits<T>::is_iec559&&sizeof(typename bits<T>::type)==sizeof(T)>());
+		}
+
+		/// Convert half-precision floating point to integer.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam E `true` for round to even, `false` for round away from zero
+		/// \tparam T type to convert to (buitlin integer type with at least 16 bits precision, excluding any implicit sign bits)
+		/// \param value binary representation of half-precision value
+		/// \return integral value
+		template<std::float_round_style R,bool E,typename T> T half2int_impl(uint16 value)
+		{
+		#if HALF_ENABLE_CPP11_STATIC_ASSERT && HALF_ENABLE_CPP11_TYPE_TRAITS
+			static_assert(std::is_integral<T>::value, "half to int conversion only supports builtin integer types");
+		#endif
+			unsigned int e = value & 0x7FFF;
+			if(e >= 0x7C00)
+				return (value&0x8000) ? std::numeric_limits<T>::min() : std::numeric_limits<T>::max();
+			if(e < 0x3800)
+			{
+				if(R == std::round_toward_infinity)
+					return T(~(value>>15)&(e!=0));
+				else if(R == std::round_toward_neg_infinity)
+					return -T(value>0x8000);
+				return T();
+			}
+			unsigned int m = (value&0x3FF) | 0x400;
+			e >>= 10;
+			if(e < 25)
+			{
+				if(R == std::round_to_nearest)
+					m += (1<<(24-e)) - (~(m>>(25-e))&E);
+				else if(R == std::round_toward_infinity)
+					m += ((value>>15)-1) & ((1<<(25-e))-1U);
+				else if(R == std::round_toward_neg_infinity)
+					m += -(value>>15) & ((1<<(25-e))-1U);
+				m >>= 25 - e;
+			}
+			else
+				m <<= e - 25;
+			return (value&0x8000) ? -static_cast<T>(m) : static_cast<T>(m);
+		}
+
+		/// Convert half-precision floating point to integer.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam T type to convert to (buitlin integer type with at least 16 bits precision, excluding any implicit sign bits)
+		/// \param value binary representation of half-precision value
+		/// \return integral value
+		template<std::float_round_style R,typename T> T half2int(uint16 value) { return half2int_impl<R,HALF_ROUND_TIES_TO_EVEN,T>(value); }
+
+		/// Convert half-precision floating point to integer using round-to-nearest-away-from-zero.
+		/// \tparam T type to convert to (buitlin integer type with at least 16 bits precision, excluding any implicit sign bits)
+		/// \param value binary representation of half-precision value
+		/// \return integral value
+		template<typename T> T half2int_up(uint16 value) { return half2int_impl<std::round_to_nearest,0,T>(value); }
+
+		/// Round half-precision number to nearest integer value.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \tparam E `true` for round to even, `false` for round away from zero
+		/// \param value binary representation of half-precision value
+		/// \return half-precision bits for nearest integral value
+		template<std::float_round_style R,bool E> uint16 round_half_impl(uint16 value)
+		{
+			unsigned int e = value & 0x7FFF;
+			uint16 result = value;
+			if(e < 0x3C00)
+			{
+				result &= 0x8000;
+				if(R == std::round_to_nearest)
+					result |= 0x3C00U & -(e>=(0x3800+E));
+				else if(R == std::round_toward_infinity)
+					result |= 0x3C00U & -(~(value>>15)&(e!=0));
+				else if(R == std::round_toward_neg_infinity)
+					result |= 0x3C00U & -(value>0x8000);
+			}
+			else if(e < 0x6400)
+			{
+				e = 25 - (e>>10);
+				unsigned int mask = (1<<e) - 1;
+				if(R == std::round_to_nearest)
+					result += (1<<(e-1)) - (~(result>>e)&E);
+				else if(R == std::round_toward_infinity)
+					result += mask & ((value>>15)-1);
+				else if(R == std::round_toward_neg_infinity)
+					result += mask & -(value>>15);
+				result &= ~mask;
+			}
+			return result;
+		}
+
+		/// Round half-precision number to nearest integer value.
+		/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest rounding
+		/// \param value binary representation of half-precision value
+		/// \return half-precision bits for nearest integral value
+		template<std::float_round_style R> uint16 round_half(uint16 value) { return round_half_impl<R,HALF_ROUND_TIES_TO_EVEN>(value); }
+
+		/// Round half-precision number to nearest integer value using round-to-nearest-away-from-zero.
+		/// \param value binary representation of half-precision value
+		/// \return half-precision bits for nearest integral value
+		inline uint16 round_half_up(uint16 value) { return round_half_impl<std::round_to_nearest,0>(value); }
+		/// \}
+
+		struct functions;
+		template<typename> struct unary_specialized;
+		template<typename,typename> struct binary_specialized;
+		template<typename,typename,std::float_round_style> struct half_caster;
+	}
+
+	/// Half-precision floating point type.
+	/// This class implements an IEEE-conformant half-precision floating point type with the usual arithmetic operators and
+	/// conversions. It is implicitly convertible to single-precision floating point, which makes artihmetic expressions and
+	/// functions with mixed-type operands to be of the most precise operand type. Additionally all arithmetic operations
+	/// (and many mathematical functions) are carried out in single-precision internally. All conversions from single- to
+	/// half-precision are done using the library's default rounding mode, but temporary results inside chained arithmetic
+	/// expressions are kept in single-precision as long as possible (while of course still maintaining a strong half-precision type).
+	///
+	/// According to the C++98/03 definition, the half type is not a POD type. But according to C++11's less strict and
+	/// extended definitions it is both a standard layout type and a trivially copyable type (even if not a POD type), which
+	/// means it can be standard-conformantly copied using raw binary copies. But in this context some more words about the
+	/// actual size of the type. Although the half is representing an IEEE 16-bit type, it does not neccessarily have to be of
+	/// exactly 16-bits size. But on any reasonable implementation the actual binary representation of this type will most
+	/// probably not ivolve any additional "magic" or padding beyond the simple binary representation of the underlying 16-bit
+	/// IEEE number, even if not strictly guaranteed by the standard. But even then it only has an actual size of 16 bits if
+	/// your C++ implementation supports an unsigned integer type of exactly 16 bits width. But this should be the case on
+	/// nearly any reasonable platform.
+	///
+	/// So if your C++ implementation is not totally exotic or imposes special alignment requirements, it is a reasonable
+	/// assumption that the data of a half is just comprised of the 2 bytes of the underlying IEEE representation.
+	class half
+	{
+		friend struct detail::functions;
+		friend struct detail::unary_specialized<half>;
+		friend struct detail::binary_specialized<half,half>;
+		template<typename,typename,std::float_round_style> friend struct detail::half_caster;
+		friend class std::numeric_limits<half>;
+	#if HALF_ENABLE_CPP11_HASH
+		friend struct std::hash<half>;
+	#endif
+	#if HALF_ENABLE_CPP11_USER_LITERALS
+		friend half literal::operator""_h(long double);
+	#endif
+
+	public:
+		/// Default constructor.
+		/// This initializes the half to 0. Although this does not match the builtin types' default-initialization semantics
+		/// and may be less efficient than no initialization, it is needed to provide proper value-initialization semantics.
+		HALF_CONSTEXPR half() HALF_NOEXCEPT : data_() {}
+
+		/// Copy constructor.
+		/// \tparam T type of concrete half expression
+		/// \param rhs half expression to copy from
+		half(detail::expr rhs) : data_(detail::float2half<round_style>(static_cast<float>(rhs))) {}
+
+		/// Conversion constructor.
+		/// \param rhs float to convert
+		explicit half(float rhs) : data_(detail::float2half<round_style>(rhs)) {}
+	
+		/// Conversion to single-precision.
+		/// \return single precision value representing expression value
+		operator float() const { return detail::half2float<float>(data_); }
+
+		/// Assignment operator.
+		/// \tparam T type of concrete half expression
+		/// \param rhs half expression to copy from
+		/// \return reference to this half
+		half& operator=(detail::expr rhs) { return *this = static_cast<float>(rhs); }
+
+		/// Arithmetic assignment.
+		/// \tparam T type of concrete half expression
+		/// \param rhs half expression to add
+		/// \return reference to this half
+		template<typename T> typename detail::enable<half&,T>::type operator+=(T rhs) { return *this += static_cast<float>(rhs); }
+
+		/// Arithmetic assignment.
+		/// \tparam T type of concrete half expression
+		/// \param rhs half expression to subtract
+		/// \return reference to this half
+		template<typename T> typename detail::enable<half&,T>::type operator-=(T rhs) { return *this -= static_cast<float>(rhs); }
+
+		/// Arithmetic assignment.
+		/// \tparam T type of concrete half expression
+		/// \param rhs half expression to multiply with
+		/// \return reference to this half
+		template<typename T> typename detail::enable<half&,T>::type operator*=(T rhs) { return *this *= static_cast<float>(rhs); }
+
+		/// Arithmetic assignment.
+		/// \tparam T type of concrete half expression
+		/// \param rhs half expression to divide by
+		/// \return reference to this half
+		template<typename T> typename detail::enable<half&,T>::type operator/=(T rhs) { return *this /= static_cast<float>(rhs); }
+
+		/// Assignment operator.
+		/// \param rhs single-precision value to copy from
+		/// \return reference to this half
+		half& operator=(float rhs) { data_ = detail::float2half<round_style>(rhs); return *this; }
+
+		/// Arithmetic assignment.
+		/// \param rhs single-precision value to add
+		/// \return reference to this half
+		half& operator+=(float rhs) { data_ = detail::float2half<round_style>(detail::half2float<float>(data_)+rhs); return *this; }
+
+		/// Arithmetic assignment.
+		/// \param rhs single-precision value to subtract
+		/// \return reference to this half
+		half& operator-=(float rhs) { data_ = detail::float2half<round_style>(detail::half2float<float>(data_)-rhs); return *this; }
+
+		/// Arithmetic assignment.
+		/// \param rhs single-precision value to multiply with
+		/// \return reference to this half
+		half& operator*=(float rhs) { data_ = detail::float2half<round_style>(detail::half2float<float>(data_)*rhs); return *this; }
+
+		/// Arithmetic assignment.
+		/// \param rhs single-precision value to divide by
+		/// \return reference to this half
+		half& operator/=(float rhs) { data_ = detail::float2half<round_style>(detail::half2float<float>(data_)/rhs); return *this; }
+
+		/// Prefix increment.
+		/// \return incremented half value
+		half& operator++() { return *this += 1.0f; }
+
+		/// Prefix decrement.
+		/// \return decremented half value
+		half& operator--() { return *this -= 1.0f; }
+
+		/// Postfix increment.
+		/// \return non-incremented half value
+		half operator++(int) { half out(*this); ++*this; return out; }
+
+		/// Postfix decrement.
+		/// \return non-decremented half value
+		half operator--(int) { half out(*this); --*this; return out; }
+	
+	private:
+		/// Rounding mode to use
+		static const std::float_round_style round_style = (std::float_round_style)(HALF_ROUND_STYLE);
+
+		/// Constructor.
+		/// \param bits binary representation to set half to
+		HALF_CONSTEXPR half(detail::binary_t, detail::uint16 bits) HALF_NOEXCEPT : data_(bits) {}
+
+		/// Internal binary representation
+		detail::uint16 data_;
+	};
+
+#if HALF_ENABLE_CPP11_USER_LITERALS
+	namespace literal
+	{
+		/// Half literal.
+		/// While this returns an actual half-precision value, half literals can unfortunately not be constant expressions due
+		/// to rather involved conversions.
+		/// \param value literal value
+		/// \return half with given value (if representable)
+		inline half operator""_h(long double value) { return half(detail::binary, detail::float2half<half::round_style>(value)); }
+	}
+#endif
+
+	namespace detail
+	{
+		/// Wrapper implementing unspecialized half-precision functions.
+		struct functions
+		{
+			/// Addition implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Half-precision sum stored in single-precision
+			static expr plus(float x, float y) { return expr(x+y); }
+
+			/// Subtraction implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Half-precision difference stored in single-precision
+			static expr minus(float x, float y) { return expr(x-y); }
+
+			/// Multiplication implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Half-precision product stored in single-precision
+			static expr multiplies(float x, float y) { return expr(x*y); }
+
+			/// Division implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Half-precision quotient stored in single-precision
+			static expr divides(float x, float y) { return expr(x/y); }
+
+			/// Output implementation.
+			/// \param out stream to write to
+			/// \param arg value to write
+			/// \return reference to stream
+			template<typename charT,typename traits> static std::basic_ostream<charT,traits>& write(std::basic_ostream<charT,traits> &out, float arg) { return out << arg; }
+
+			/// Input implementation.
+			/// \param in stream to read from
+			/// \param arg half to read into
+			/// \return reference to stream
+			template<typename charT,typename traits> static std::basic_istream<charT,traits>& read(std::basic_istream<charT,traits> &in, half &arg)
+			{
+				float f;
+				if(in >> f)
+					arg = f;
+				return in;
+			}
+
+			/// Modulo implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Half-precision division remainder stored in single-precision
+			static expr fmod(float x, float y) { return expr(std::fmod(x, y)); }
+
+			/// Remainder implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Half-precision division remainder stored in single-precision
+			static expr remainder(float x, float y)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::remainder(x, y));
+			#else
+				if(builtin_isnan(x) || builtin_isnan(y))
+					return expr(std::numeric_limits<float>::quiet_NaN());
+				float ax = std::fabs(x), ay = std::fabs(y);
+				if(ax >= 65536.0f || ay < std::ldexp(1.0f, -24))
+					return expr(std::numeric_limits<float>::quiet_NaN());
+				if(ay >= 65536.0f)
+					return expr(x);
+				if(ax == ay)
+					return expr(builtin_signbit(x) ? -0.0f : 0.0f);
+				ax = std::fmod(ax, ay+ay);
+				float y2 = 0.5f * ay;
+				if(ax > y2)
+				{
+					ax -= ay;
+					if(ax >= y2)
+						ax -= ay;
+				}
+				return expr(builtin_signbit(x) ? -ax : ax);
+			#endif
+			}
+
+			/// Remainder implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \param quo address to store quotient bits at
+			/// \return Half-precision division remainder stored in single-precision
+			static expr remquo(float x, float y, int *quo)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::remquo(x, y, quo));
+			#else
+				if(builtin_isnan(x) || builtin_isnan(y))
+					return expr(std::numeric_limits<float>::quiet_NaN());
+				bool sign = builtin_signbit(x), qsign = static_cast<bool>(sign^builtin_signbit(y));
+				float ax = std::fabs(x), ay = std::fabs(y);
+				if(ax >= 65536.0f || ay < std::ldexp(1.0f, -24))
+					return expr(std::numeric_limits<float>::quiet_NaN());
+				if(ay >= 65536.0f)
+					return expr(x);
+				if(ax == ay)
+					return *quo = qsign ? -1 : 1, expr(sign ? -0.0f : 0.0f);
+				ax = std::fmod(ax, 8.0f*ay);
+				int cquo = 0;
+				if(ax >= 4.0f * ay)
+				{
+					ax -= 4.0f * ay;
+					cquo += 4;
+				}
+				if(ax >= 2.0f * ay)
+				{
+					ax -= 2.0f * ay;
+					cquo += 2;
+				}
+				float y2 = 0.5f * ay;
+				if(ax > y2)
+				{
+					ax -= ay;
+					++cquo;
+					if(ax >= y2)
+					{
+						ax -= ay;
+						++cquo;
+					}
+				}
+				return *quo = qsign ? -cquo : cquo, expr(sign ? -ax : ax);
+			#endif
+			}
+
+			/// Positive difference implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return Positive difference stored in single-precision
+			static expr fdim(float x, float y)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::fdim(x, y));
+			#else
+				return expr((x<=y) ? 0.0f : (x-y));
+			#endif
+			}
+
+			/// Fused multiply-add implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \param z third operand
+			/// \return \a x * \a y + \a z stored in single-precision
+			static expr fma(float x, float y, float z)
+			{
+			#if HALF_ENABLE_CPP11_CMATH && defined(FP_FAST_FMAF)
+				return expr(std::fma(x, y, z));
+			#else
+				return expr(x*y+z);
+			#endif
+			}
+
+			/// Get NaN.
+			/// \return Half-precision quiet NaN
+			static half nanh() { return half(binary, 0x7FFF); }
+
+			/// Exponential implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr exp(float arg) { return expr(std::exp(arg)); }
+
+			/// Exponential implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr expm1(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::expm1(arg));
+			#else
+				return expr(static_cast<float>(std::exp(static_cast<double>(arg))-1.0));
+			#endif
+			}
+
+			/// Binary exponential implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr exp2(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::exp2(arg));
+			#else
+				return expr(static_cast<float>(std::exp(arg*0.69314718055994530941723212145818)));
+			#endif
+			}
+
+			/// Logarithm implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr log(float arg) { return expr(std::log(arg)); }
+
+			/// Common logarithm implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr log10(float arg) { return expr(std::log10(arg)); }
+
+			/// Logarithm implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr log1p(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::log1p(arg));
+			#else
+				return expr(static_cast<float>(std::log(1.0+arg)));
+			#endif
+			}
+
+			/// Binary logarithm implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr log2(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::log2(arg));
+			#else
+				return expr(static_cast<float>(std::log(static_cast<double>(arg))*1.4426950408889634073599246810019));
+			#endif
+			}
+
+			/// Square root implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr sqrt(float arg) { return expr(std::sqrt(arg)); }
+
+			/// Cubic root implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr cbrt(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::cbrt(arg));
+			#else
+				if(builtin_isnan(arg) || builtin_isinf(arg))
+					return expr(arg);
+				return expr(builtin_signbit(arg) ? -static_cast<float>(std::pow(-static_cast<double>(arg), 1.0/3.0)) :
+					static_cast<float>(std::pow(static_cast<double>(arg), 1.0/3.0)));
+			#endif
+			}
+
+			/// Hypotenuse implementation.
+			/// \param x first argument
+			/// \param y second argument
+			/// \return function value stored in single-preicision
+			static expr hypot(float x, float y)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::hypot(x, y));
+			#else
+				return expr((builtin_isinf(x) || builtin_isinf(y)) ? std::numeric_limits<float>::infinity() :
+					static_cast<float>(std::sqrt(static_cast<double>(x)*x+static_cast<double>(y)*y)));
+			#endif
+			}
+
+			/// Power implementation.
+			/// \param base value to exponentiate
+			/// \param exp power to expontiate to
+			/// \return function value stored in single-preicision
+			static expr pow(float base, float exp) { return expr(std::pow(base, exp)); }
+
+			/// Sine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr sin(float arg) { return expr(std::sin(arg)); }
+
+			/// Cosine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr cos(float arg) { return expr(std::cos(arg)); }
+
+			/// Tan implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr tan(float arg) { return expr(std::tan(arg)); }
+
+			/// Arc sine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr asin(float arg) { return expr(std::asin(arg)); }
+
+			/// Arc cosine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr acos(float arg) { return expr(std::acos(arg)); }
+
+			/// Arc tangent implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr atan(float arg) { return expr(std::atan(arg)); }
+
+			/// Arc tangent implementation.
+			/// \param x first argument
+			/// \param y second argument
+			/// \return function value stored in single-preicision
+			static expr atan2(float x, float y) { return expr(std::atan2(x, y)); }
+
+			/// Hyperbolic sine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr sinh(float arg) { return expr(std::sinh(arg)); }
+
+			/// Hyperbolic cosine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr cosh(float arg) { return expr(std::cosh(arg)); }
+
+			/// Hyperbolic tangent implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr tanh(float arg) { return expr(std::tanh(arg)); }
+
+			/// Hyperbolic area sine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr asinh(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::asinh(arg));
+			#else
+				return expr((arg==-std::numeric_limits<float>::infinity()) ? arg : static_cast<float>(std::log(arg+std::sqrt(arg*arg+1.0))));
+			#endif
+			}
+
+			/// Hyperbolic area cosine implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr acosh(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::acosh(arg));
+			#else
+				return expr((arg<-1.0f) ? std::numeric_limits<float>::quiet_NaN() : static_cast<float>(std::log(arg+std::sqrt(arg*arg-1.0))));
+			#endif
+			}
+
+			/// Hyperbolic area tangent implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr atanh(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::atanh(arg));
+			#else
+				return expr(static_cast<float>(0.5*std::log((1.0+arg)/(1.0-arg))));
+			#endif
+			}
+
+			/// Error function implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr erf(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::erf(arg));
+			#else
+				return expr(static_cast<float>(erf(static_cast<double>(arg))));
+			#endif
+			}
+
+			/// Complementary implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr erfc(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::erfc(arg));
+			#else
+				return expr(static_cast<float>(1.0-erf(static_cast<double>(arg))));
+			#endif
+			}
+
+			/// Gamma logarithm implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr lgamma(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::lgamma(arg));
+			#else
+				if(builtin_isinf(arg))
+					return expr(std::numeric_limits<float>::infinity());
+				if(arg < 0.0f)
+				{
+					float i, f = std::modf(-arg, &i);
+					if(f == 0.0f)
+						return expr(std::numeric_limits<float>::infinity());
+					return expr(static_cast<float>(1.1447298858494001741434273513531-
+						std::log(std::abs(std::sin(3.1415926535897932384626433832795*f)))-lgamma(1.0-arg)));
+				}
+				return expr(static_cast<float>(lgamma(static_cast<double>(arg))));
+			#endif
+			}
+
+			/// Gamma implementation.
+			/// \param arg function argument
+			/// \return function value stored in single-preicision
+			static expr tgamma(float arg)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::tgamma(arg));
+			#else
+				if(arg == 0.0f)
+					return builtin_signbit(arg) ? expr(-std::numeric_limits<float>::infinity()) : expr(std::numeric_limits<float>::infinity());
+				if(arg < 0.0f)
+				{
+					float i, f = std::modf(-arg, &i);
+					if(f == 0.0f)
+						return expr(std::numeric_limits<float>::quiet_NaN());
+					double value = 3.1415926535897932384626433832795 / (std::sin(3.1415926535897932384626433832795*f)*std::exp(lgamma(1.0-arg)));
+					return expr(static_cast<float>((std::fmod(i, 2.0f)==0.0f) ? -value : value));
+				}
+				if(builtin_isinf(arg))
+					return expr(arg);
+				return expr(static_cast<float>(std::exp(lgamma(static_cast<double>(arg)))));
+			#endif
+			}
+
+			/// Floor implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static half floor(half arg) { return half(binary, round_half<std::round_toward_neg_infinity>(arg.data_)); }
+
+			/// Ceiling implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static half ceil(half arg) { return half(binary, round_half<std::round_toward_infinity>(arg.data_)); }
+
+			/// Truncation implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static half trunc(half arg) { return half(binary, round_half<std::round_toward_zero>(arg.data_)); }
+
+			/// Nearest integer implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static half round(half arg) { return half(binary, round_half_up(arg.data_)); }
+
+			/// Nearest integer implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static long lround(half arg) { return detail::half2int_up<long>(arg.data_); }
+
+			/// Nearest integer implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static half rint(half arg) { return half(binary, round_half<half::round_style>(arg.data_)); }
+
+			/// Nearest integer implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static long lrint(half arg) { return detail::half2int<half::round_style,long>(arg.data_); }
+
+		#if HALF_ENABLE_CPP11_LONG_LONG
+			/// Nearest integer implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static long long llround(half arg) { return detail::half2int_up<long long>(arg.data_); }
+
+			/// Nearest integer implementation.
+			/// \param arg value to round
+			/// \return rounded value
+			static long long llrint(half arg) { return detail::half2int<half::round_style,long long>(arg.data_); }
+		#endif
+
+			/// Decompression implementation.
+			/// \param arg number to decompress
+			/// \param exp address to store exponent at
+			/// \return normalized significant
+			static half frexp(half arg, int *exp)
+			{
+				int m = arg.data_ & 0x7FFF, e = -14;
+				if(m >= 0x7C00 || !m)
+					return *exp = 0, arg;
+				for(; m<0x400; m<<=1,--e) ;
+				return *exp = e+(m>>10), half(binary, (arg.data_&0x8000)|0x3800|(m&0x3FF));
+			}
+
+			/// Decompression implementation.
+			/// \param arg number to decompress
+			/// \param iptr address to store integer part at
+			/// \return fractional part
+			static half modf(half arg, half *iptr)
+			{
+				unsigned int e = arg.data_ & 0x7FFF;
+				if(e >= 0x6400)
+					return *iptr = arg, half(binary, arg.data_&(0x8000U|-(e>0x7C00)));
+				if(e < 0x3C00)
+					return iptr->data_ = arg.data_ & 0x8000, arg;
+				e >>= 10;
+				unsigned int mask = (1<<(25-e)) - 1, m = arg.data_ & mask;
+				iptr->data_ = arg.data_ & ~mask;
+				if(!m)
+					return half(binary, arg.data_&0x8000);
+				for(; m<0x400; m<<=1,--e) ;
+				return half(binary, static_cast<uint16>((arg.data_&0x8000)|(e<<10)|(m&0x3FF)));
+			}
+
+			/// Scaling implementation.
+			/// \param arg number to scale
+			/// \param exp power of two to scale by
+			/// \return scaled number
+			static half scalbln(half arg, long exp)
+			{
+				unsigned int m = arg.data_ & 0x7FFF;
+				if(m >= 0x7C00 || !m)
+					return arg;
+				for(; m<0x400; m<<=1,--exp) ;
+				exp += m >> 10;
+				uint16 value = arg.data_ & 0x8000;
+				if(exp > 30)
+				{
+					if(half::round_style == std::round_toward_zero)
+						value |= 0x7BFF;
+					else if(half::round_style == std::round_toward_infinity)
+						value |= 0x7C00 - (value>>15);
+					else if(half::round_style == std::round_toward_neg_infinity)
+						value |= 0x7BFF + (value>>15);
+					else
+						value |= 0x7C00;
+				}
+				else if(exp > 0)
+					value |= (exp<<10) | (m&0x3FF);
+				else if(exp > -11)
+				{
+					m = (m&0x3FF) | 0x400;
+					if(half::round_style == std::round_to_nearest)
+					{
+						m += 1 << -exp;
+					#if HALF_ROUND_TIES_TO_EVEN
+						m -= (m>>(1-exp)) & 1;
+					#endif
+					}
+					else if(half::round_style == std::round_toward_infinity)
+						m += ((value>>15)-1) & ((1<<(1-exp))-1U);
+					else if(half::round_style == std::round_toward_neg_infinity)
+						m += -(value>>15) & ((1<<(1-exp))-1U);
+					value |= m >> (1-exp);
+				}
+				else if(half::round_style == std::round_toward_infinity)
+					value -= (value>>15) - 1;
+				else if(half::round_style == std::round_toward_neg_infinity)
+					value += value >> 15;
+				return half(binary, value);
+			}
+
+			/// Exponent implementation.
+			/// \param arg number to query
+			/// \return floating point exponent
+			static int ilogb(half arg)
+			{
+				int abs = arg.data_ & 0x7FFF;
+				if(!abs)
+					return FP_ILOGB0;
+				if(abs < 0x7C00)
+				{
+					int exp = (abs>>10) - 15;
+					if(abs < 0x400)
+						for(; abs<0x200; abs<<=1,--exp) ;
+					return exp;
+				}
+				if(abs > 0x7C00)
+					return FP_ILOGBNAN;
+				return INT_MAX;
+			}
+
+			/// Exponent implementation.
+			/// \param arg number to query
+			/// \return floating point exponent
+			static half logb(half arg)
+			{
+				int abs = arg.data_ & 0x7FFF;
+				if(!abs)
+					return half(binary, 0xFC00);
+				if(abs < 0x7C00)
+				{
+					int exp = (abs>>10) - 15;
+					if(abs < 0x400)
+						for(; abs<0x200; abs<<=1,--exp) ;
+					uint16 bits = (exp<0) << 15;
+					if(exp)
+					{
+						unsigned int m = std::abs(exp) << 6, e = 18;
+						for(; m<0x400; m<<=1,--e) ;
+						bits |= (e<<10) + m;
+					}
+					return half(binary, bits);
+				}
+				if(abs > 0x7C00)
+					return arg;
+				return half(binary, 0x7C00);
+			}
+
+			/// Enumeration implementation.
+			/// \param from number to increase/decrease
+			/// \param to direction to enumerate into
+			/// \return next representable number
+			static half nextafter(half from, half to)
+			{
+				uint16 fabs = from.data_ & 0x7FFF, tabs = to.data_ & 0x7FFF;
+				if(fabs > 0x7C00)
+					return from;
+				if(tabs > 0x7C00 || from.data_ == to.data_ || !(fabs|tabs))
+					return to;
+				if(!fabs)
+					return half(binary, (to.data_&0x8000)+1);
+				bool lt = ((fabs==from.data_) ? static_cast<int>(fabs) : -static_cast<int>(fabs)) <
+					((tabs==to.data_) ? static_cast<int>(tabs) : -static_cast<int>(tabs));
+				return half(binary, from.data_+(((from.data_>>15)^static_cast<unsigned>(lt))<<1)-1);
+			}
+
+			/// Enumeration implementation.
+			/// \param from number to increase/decrease
+			/// \param to direction to enumerate into
+			/// \return next representable number
+			static half nexttoward(half from, long double to)
+			{
+				if(isnan(from))
+					return from;
+				long double lfrom = static_cast<long double>(from);
+				if(builtin_isnan(to) || lfrom == to)
+					return half(static_cast<float>(to));
+				if(!(from.data_&0x7FFF))
+					return half(binary, (static_cast<detail::uint16>(builtin_signbit(to))<<15)+1);
+				return half(binary, from.data_+(((from.data_>>15)^static_cast<unsigned>(lfrom<to))<<1)-1);
+			}
+
+			/// Sign implementation
+			/// \param x first operand
+			/// \param y second operand
+			/// \return composed value
+			static half copysign(half x, half y) { return half(binary, x.data_^((x.data_^y.data_)&0x8000)); }
+
+			/// Classification implementation.
+			/// \param arg value to classify
+			/// \retval true if infinite number
+			/// \retval false else
+			static int fpclassify(half arg)
+			{
+				unsigned int abs = arg.data_ & 0x7FFF;
+				return abs ? ((abs>0x3FF) ? ((abs>=0x7C00) ? ((abs>0x7C00) ? FP_NAN : FP_INFINITE) : FP_NORMAL) :FP_SUBNORMAL) : FP_ZERO;
+			}
+
+			/// Classification implementation.
+			/// \param arg value to classify
+			/// \retval true if finite number
+			/// \retval false else
+			static bool isfinite(half arg) { return (arg.data_&0x7C00) != 0x7C00; }
+
+			/// Classification implementation.
+			/// \param arg value to classify
+			/// \retval true if infinite number
+			/// \retval false else
+			static bool isinf(half arg) { return (arg.data_&0x7FFF) == 0x7C00; }
+
+			/// Classification implementation.
+			/// \param arg value to classify
+			/// \retval true if not a number
+			/// \retval false else
+			static bool isnan(half arg) { return (arg.data_&0x7FFF) > 0x7C00; }
+
+			/// Classification implementation.
+			/// \param arg value to classify
+			/// \retval true if normal number
+			/// \retval false else
+			static bool isnormal(half arg) { return ((arg.data_&0x7C00)!=0) & ((arg.data_&0x7C00)!=0x7C00); }
+
+			/// Sign bit implementation.
+			/// \param arg value to check
+			/// \retval true if signed
+			/// \retval false if unsigned
+			static bool signbit(half arg) { return (arg.data_&0x8000) != 0; }
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if operands equal
+			/// \retval false else
+			static bool isequal(half x, half y) { return (x.data_==y.data_ || !((x.data_|y.data_)&0x7FFF)) && !isnan(x); }
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if operands not equal
+			/// \retval false else
+			static bool isnotequal(half x, half y) { return (x.data_!=y.data_ && ((x.data_|y.data_)&0x7FFF)) || isnan(x); }
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if \a x > \a y
+			/// \retval false else
+			static bool isgreater(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				return xabs<=0x7C00 && yabs<=0x7C00 && (((xabs==x.data_) ? xabs : -xabs) > ((yabs==y.data_) ? yabs : -yabs));
+			}
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if \a x >= \a y
+			/// \retval false else
+			static bool isgreaterequal(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				return xabs<=0x7C00 && yabs<=0x7C00 && (((xabs==x.data_) ? xabs : -xabs) >= ((yabs==y.data_) ? yabs : -yabs));
+			}
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if \a x < \a y
+			/// \retval false else
+			static bool isless(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				return xabs<=0x7C00 && yabs<=0x7C00 && (((xabs==x.data_) ? xabs : -xabs) < ((yabs==y.data_) ? yabs : -yabs));
+			}
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if \a x <= \a y
+			/// \retval false else
+			static bool islessequal(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				return xabs<=0x7C00 && yabs<=0x7C00 && (((xabs==x.data_) ? xabs : -xabs) <= ((yabs==y.data_) ? yabs : -yabs));
+			}
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if either \a x > \a y nor \a x < \a y
+			/// \retval false else
+			static bool islessgreater(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				if(xabs > 0x7C00 || yabs > 0x7C00)
+					return false;
+				int a = (xabs==x.data_) ? xabs : -xabs, b = (yabs==y.data_) ? yabs : -yabs;
+				return a < b || a > b;
+			}
+
+			/// Comparison implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \retval true if operand unordered
+			/// \retval false else
+			static bool isunordered(half x, half y) { return isnan(x) || isnan(y); }
+
+		private:
+			static double erf(double arg)
+			{
+				if(builtin_isinf(arg))
+					return (arg<0.0) ? -1.0 : 1.0;
+				double x2 = arg * arg, ax2 = 0.147 * x2, value = std::sqrt(1.0-std::exp(-x2*(1.2732395447351626861510701069801+ax2)/(1.0+ax2)));
+				return builtin_signbit(arg) ? -value : value;
+			}
+
+			static double lgamma(double arg)
+			{
+				double v = 1.0;
+				for(; arg<8.0; ++arg) v *= arg;
+				double w = 1.0 / (arg*arg);
+				return (((((((-0.02955065359477124183006535947712*w+0.00641025641025641025641025641026)*w+
+					-0.00191752691752691752691752691753)*w+8.4175084175084175084175084175084e-4)*w+
+					-5.952380952380952380952380952381e-4)*w+7.9365079365079365079365079365079e-4)*w+
+					-0.00277777777777777777777777777778)*w+0.08333333333333333333333333333333)/arg +
+					0.91893853320467274178032973640562 - std::log(v) - arg + (arg-0.5) * std::log(arg);
+			}
+		};
+
+		/// Wrapper for unary half-precision functions needing specialization for individual argument types.
+		/// \tparam T argument type
+		template<typename T> struct unary_specialized
+		{
+			/// Negation implementation.
+			/// \param arg value to negate
+			/// \return negated value
+			static HALF_CONSTEXPR half negate(half arg) { return half(binary, arg.data_^0x8000); }
+
+			/// Absolute value implementation.
+			/// \param arg function argument
+			/// \return absolute value
+			static half fabs(half arg) { return half(binary, arg.data_&0x7FFF); }
+		};
+		template<> struct unary_specialized<expr>
+		{
+			static HALF_CONSTEXPR expr negate(float arg) { return expr(-arg); }
+			static expr fabs(float arg) { return expr(std::fabs(arg)); }
+		};
+
+		/// Wrapper for binary half-precision functions needing specialization for individual argument types.
+		/// \tparam T first argument type
+		/// \tparam U first argument type
+		template<typename T,typename U> struct binary_specialized
+		{
+			/// Minimum implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return minimum value
+			static expr fmin(float x, float y)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::fmin(x, y));
+			#else
+				if(builtin_isnan(x))
+					return expr(y);
+				if(builtin_isnan(y))
+					return expr(x);
+				return expr(std::min(x, y));
+			#endif
+			}
+
+			/// Maximum implementation.
+			/// \param x first operand
+			/// \param y second operand
+			/// \return maximum value
+			static expr fmax(float x, float y)
+			{
+			#if HALF_ENABLE_CPP11_CMATH
+				return expr(std::fmax(x, y));
+			#else
+				if(builtin_isnan(x))
+					return expr(y);
+				if(builtin_isnan(y))
+					return expr(x);
+				return expr(std::max(x, y));
+			#endif
+			}
+		};
+		template<> struct binary_specialized<half,half>
+		{
+			static half fmin(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				if(xabs > 0x7C00)
+					return y;
+				if(yabs > 0x7C00)
+					return x;
+				return (((xabs==x.data_) ? xabs : -xabs) > ((yabs==y.data_) ? yabs : -yabs)) ? y : x;
+			}
+			static half fmax(half x, half y)
+			{
+				int xabs = x.data_ & 0x7FFF, yabs = y.data_ & 0x7FFF;
+				if(xabs > 0x7C00)
+					return y;
+				if(yabs > 0x7C00)
+					return x;
+				return (((xabs==x.data_) ? xabs : -xabs) < ((yabs==y.data_) ? yabs : -yabs)) ? y : x;
+			}
+		};
+
+		/// Helper class for half casts.
+		/// This class template has to be specialized for all valid cast argument to define an appropriate static `cast` member
+		/// function and a corresponding `type` member denoting its return type.
+		/// \tparam T destination type
+		/// \tparam U source type
+		/// \tparam R rounding mode to use
+		template<typename T,typename U,std::float_round_style R=(std::float_round_style)(HALF_ROUND_STYLE)> struct half_caster {};
+		template<typename U,std::float_round_style R> struct half_caster<half,U,R>
+		{
+		#if HALF_ENABLE_CPP11_STATIC_ASSERT && HALF_ENABLE_CPP11_TYPE_TRAITS
+			static_assert(std::is_arithmetic<U>::value, "half_cast from non-arithmetic type unsupported");
+		#endif
+
+			static half cast(U arg) { return cast_impl(arg, is_float<U>()); };
+
+		private:
+			static half cast_impl(U arg, true_type) { return half(binary, float2half<R>(arg)); }
+			static half cast_impl(U arg, false_type) { return half(binary, int2half<R>(arg)); }
+		};
+		template<typename T,std::float_round_style R> struct half_caster<T,half,R>
+		{
+		#if HALF_ENABLE_CPP11_STATIC_ASSERT && HALF_ENABLE_CPP11_TYPE_TRAITS
+			static_assert(std::is_arithmetic<T>::value, "half_cast to non-arithmetic type unsupported");
+		#endif
+
+			static T cast(half arg) { return cast_impl(arg, is_float<T>()); }
+
+		private:
+			static T cast_impl(half arg, true_type) { return half2float<T>(arg.data_); }
+			static T cast_impl(half arg, false_type) { return half2int<R,T>(arg.data_); }
+		};
+		template<typename T,std::float_round_style R> struct half_caster<T,expr,R>
+		{
+		#if HALF_ENABLE_CPP11_STATIC_ASSERT && HALF_ENABLE_CPP11_TYPE_TRAITS
+			static_assert(std::is_arithmetic<T>::value, "half_cast to non-arithmetic type unsupported");
+		#endif
+
+			static T cast(expr arg) { return cast_impl(arg, is_float<T>()); }
+
+		private:
+			static T cast_impl(float arg, true_type) { return static_cast<T>(arg); }
+			static T cast_impl(half arg, false_type) { return half2int<R,T>(arg.data_); }
+		};
+		template<std::float_round_style R> struct half_caster<half,half,R>
+		{
+			static half cast(half arg) { return arg; }
+		};
+		template<std::float_round_style R> struct half_caster<half,expr,R> : half_caster<half,half,R> {};
+
+		/// \name Comparison operators
+		/// \{
+
+		/// Comparison for equality.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if operands equal
+		/// \retval false else
+		template<typename T,typename U> typename enable<bool,T,U>::type operator==(T x, U y) { return functions::isequal(x, y); }
+
+		/// Comparison for inequality.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if operands not equal
+		/// \retval false else
+		template<typename T,typename U> typename enable<bool,T,U>::type operator!=(T x, U y) { return functions::isnotequal(x, y); }
+
+		/// Comparison for less than.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x less than \a y
+		/// \retval false else
+		template<typename T,typename U> typename enable<bool,T,U>::type operator<(T x, U y) { return functions::isless(x, y); }
+
+		/// Comparison for greater than.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x greater than \a y
+		/// \retval false else
+		template<typename T,typename U> typename enable<bool,T,U>::type operator>(T x, U y) { return functions::isgreater(x, y); }
+
+		/// Comparison for less equal.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x less equal \a y
+		/// \retval false else
+		template<typename T,typename U> typename enable<bool,T,U>::type operator<=(T x, U y) { return functions::islessequal(x, y); }
+
+		/// Comparison for greater equal.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x greater equal \a y
+		/// \retval false else
+		template<typename T,typename U> typename enable<bool,T,U>::type operator>=(T x, U y) { return functions::isgreaterequal(x, y); }
+
+		/// \}
+		/// \name Arithmetic operators
+		/// \{
+
+		/// Add halfs.
+		/// \param x left operand
+		/// \param y right operand
+		/// \return sum of half expressions
+		template<typename T,typename U> typename enable<expr,T,U>::type operator+(T x, U y) { return functions::plus(x, y); }
+
+		/// Subtract halfs.
+		/// \param x left operand
+		/// \param y right operand
+		/// \return difference of half expressions
+		template<typename T,typename U> typename enable<expr,T,U>::type operator-(T x, U y) { return functions::minus(x, y); }
+
+		/// Multiply halfs.
+		/// \param x left operand
+		/// \param y right operand
+		/// \return product of half expressions
+		template<typename T,typename U> typename enable<expr,T,U>::type operator*(T x, U y) { return functions::multiplies(x, y); }
+
+		/// Divide halfs.
+		/// \param x left operand
+		/// \param y right operand
+		/// \return quotient of half expressions
+		template<typename T,typename U> typename enable<expr,T,U>::type operator/(T x, U y) { return functions::divides(x, y); }
+
+		/// Identity.
+		/// \param arg operand
+		/// \return uncahnged operand
+		template<typename T> HALF_CONSTEXPR typename enable<T,T>::type operator+(T arg) { return arg; }
+
+		/// Negation.
+		/// \param arg operand
+		/// \return negated operand
+		template<typename T> HALF_CONSTEXPR typename enable<T,T>::type operator-(T arg) { return unary_specialized<T>::negate(arg); }
+
+		/// \}
+		/// \name Input and output
+		/// \{
+
+		/// Output operator.
+		/// \param out output stream to write into
+		/// \param arg half expression to write
+		/// \return reference to output stream
+		template<typename T,typename charT,typename traits> typename enable<std::basic_ostream<charT,traits>&,T>::type
+			operator<<(std::basic_ostream<charT,traits> &out, T arg) { return functions::write(out, arg); }
+
+		/// Input operator.
+		/// \param in input stream to read from
+		/// \param arg half to read into
+		/// \return reference to input stream
+		template<typename charT,typename traits> std::basic_istream<charT,traits>&
+			operator>>(std::basic_istream<charT,traits> &in, half &arg) { return functions::read(in, arg); }
+
+		/// \}
+		/// \name Basic mathematical operations
+		/// \{
+
+		/// Absolute value.
+		/// \param arg operand
+		/// \return absolute value of \a arg
+//		template<typename T> typename enable<T,T>::type abs(T arg) { return unary_specialized<T>::fabs(arg); }
+		inline half abs(half arg) { return unary_specialized<half>::fabs(arg); }
+		inline expr abs(expr arg) { return unary_specialized<expr>::fabs(arg); }
+
+		/// Absolute value.
+		/// \param arg operand
+		/// \return absolute value of \a arg
+//		template<typename T> typename enable<T,T>::type fabs(T arg) { return unary_specialized<T>::fabs(arg); }
+		inline half fabs(half arg) { return unary_specialized<half>::fabs(arg); }
+		inline expr fabs(expr arg) { return unary_specialized<expr>::fabs(arg); }
+
+		/// Remainder of division.
+		/// \param x first operand
+		/// \param y second operand
+		/// \return remainder of floating point division.
+//		template<typename T,typename U> typename enable<expr,T,U>::type fmod(T x, U y) { return functions::fmod(x, y); }
+		inline expr fmod(half x, half y) { return functions::fmod(x, y); }
+		inline expr fmod(half x, expr y) { return functions::fmod(x, y); }
+		inline expr fmod(expr x, half y) { return functions::fmod(x, y); }
+		inline expr fmod(expr x, expr y) { return functions::fmod(x, y); }
+
+		/// Remainder of division.
+		/// \param x first operand
+		/// \param y second operand
+		/// \return remainder of floating point division.
+//		template<typename T,typename U> typename enable<expr,T,U>::type remainder(T x, U y) { return functions::remainder(x, y); }
+		inline expr remainder(half x, half y) { return functions::remainder(x, y); }
+		inline expr remainder(half x, expr y) { return functions::remainder(x, y); }
+		inline expr remainder(expr x, half y) { return functions::remainder(x, y); }
+		inline expr remainder(expr x, expr y) { return functions::remainder(x, y); }
+
+		/// Remainder of division.
+		/// \param x first operand
+		/// \param y second operand
+		/// \param quo address to store some bits of quotient at
+		/// \return remainder of floating point division.
+//		template<typename T,typename U> typename enable<expr,T,U>::type remquo(T x, U y, int *quo) { return functions::remquo(x, y, quo); }
+		inline expr remquo(half x, half y, int *quo) { return functions::remquo(x, y, quo); }
+		inline expr remquo(half x, expr y, int *quo) { return functions::remquo(x, y, quo); }
+		inline expr remquo(expr x, half y, int *quo) { return functions::remquo(x, y, quo); }
+		inline expr remquo(expr x, expr y, int *quo) { return functions::remquo(x, y, quo); }
+
+		/// Fused multiply add.
+		/// \param x first operand
+		/// \param y second operand
+		/// \param z third operand
+		/// \return ( \a x * \a y ) + \a z rounded as one operation.
+//		template<typename T,typename U,typename V> typename enable<expr,T,U,V>::type fma(T x, U y, V z) { return functions::fma(x, y, z); }
+		inline expr fma(half x, half y, half z) { return functions::fma(x, y, z); }
+		inline expr fma(half x, half y, expr z) { return functions::fma(x, y, z); }
+		inline expr fma(half x, expr y, half z) { return functions::fma(x, y, z); }
+		inline expr fma(half x, expr y, expr z) { return functions::fma(x, y, z); }
+		inline expr fma(expr x, half y, half z) { return functions::fma(x, y, z); }
+		inline expr fma(expr x, half y, expr z) { return functions::fma(x, y, z); }
+		inline expr fma(expr x, expr y, half z) { return functions::fma(x, y, z); }
+		inline expr fma(expr x, expr y, expr z) { return functions::fma(x, y, z); }
+
+		/// Maximum of half expressions.
+		/// \param x first operand
+		/// \param y second operand
+		/// \return maximum of operands
+//		template<typename T,typename U> typename result<T,U>::type fmax(T x, U y) { return binary_specialized<T,U>::fmax(x, y); }
+		inline half fmax(half x, half y) { return binary_specialized<half,half>::fmax(x, y); }
+		inline expr fmax(half x, expr y) { return binary_specialized<half,expr>::fmax(x, y); }
+		inline expr fmax(expr x, half y) { return binary_specialized<expr,half>::fmax(x, y); }
+		inline expr fmax(expr x, expr y) { return binary_specialized<expr,expr>::fmax(x, y); }
+
+		/// Minimum of half expressions.
+		/// \param x first operand
+		/// \param y second operand
+		/// \return minimum of operands
+//		template<typename T,typename U> typename result<T,U>::type fmin(T x, U y) { return binary_specialized<T,U>::fmin(x, y); }
+		inline half fmin(half x, half y) { return binary_specialized<half,half>::fmin(x, y); }
+		inline expr fmin(half x, expr y) { return binary_specialized<half,expr>::fmin(x, y); }
+		inline expr fmin(expr x, half y) { return binary_specialized<expr,half>::fmin(x, y); }
+		inline expr fmin(expr x, expr y) { return binary_specialized<expr,expr>::fmin(x, y); }
+
+		/// Positive difference.
+		/// \param x first operand
+		/// \param y second operand
+		/// \return \a x - \a y or 0 if difference negative
+//		template<typename T,typename U> typename enable<expr,T,U>::type fdim(T x, U y) { return functions::fdim(x, y); }
+		inline expr fdim(half x, half y) { return functions::fdim(x, y); }
+		inline expr fdim(half x, expr y) { return functions::fdim(x, y); }
+		inline expr fdim(expr x, half y) { return functions::fdim(x, y); }
+		inline expr fdim(expr x, expr y) { return functions::fdim(x, y); }
+
+		/// Get NaN value.
+		/// \return quiet NaN
+		inline half nanh(const char*) { return functions::nanh(); }
+
+		/// \}
+		/// \name Exponential functions
+		/// \{
+
+		/// Exponential function.
+		/// \param arg function argument
+		/// \return e raised to \a arg
+//		template<typename T> typename enable<expr,T>::type exp(T arg) { return functions::exp(arg); }
+		inline expr exp(half arg) { return functions::exp(arg); }
+		inline expr exp(expr arg) { return functions::exp(arg); }
+
+		/// Exponential minus one.
+		/// \param arg function argument
+		/// \return e raised to \a arg subtracted by 1
+//		template<typename T> typename enable<expr,T>::type expm1(T arg) { return functions::expm1(arg); }
+		inline expr expm1(half arg) { return functions::expm1(arg); }
+		inline expr expm1(expr arg) { return functions::expm1(arg); }
+
+		/// Binary exponential.
+		/// \param arg function argument
+		/// \return 2 raised to \a arg
+//		template<typename T> typename enable<expr,T>::type exp2(T arg) { return functions::exp2(arg); }
+		inline expr exp2(half arg) { return functions::exp2(arg); }
+		inline expr exp2(expr arg) { return functions::exp2(arg); }
+
+		/// Natural logorithm.
+		/// \param arg function argument
+		/// \return logarithm of \a arg to base e
+//		template<typename T> typename enable<expr,T>::type log(T arg) { return functions::log(arg); }
+		inline expr log(half arg) { return functions::log(arg); }
+		inline expr log(expr arg) { return functions::log(arg); }
+
+		/// Common logorithm.
+		/// \param arg function argument
+		/// \return logarithm of \a arg to base 10
+//		template<typename T> typename enable<expr,T>::type log10(T arg) { return functions::log10(arg); }
+		inline expr log10(half arg) { return functions::log10(arg); }
+		inline expr log10(expr arg) { return functions::log10(arg); }
+
+		/// Natural logorithm.
+		/// \param arg function argument
+		/// \return logarithm of \a arg plus 1 to base e
+//		template<typename T> typename enable<expr,T>::type log1p(T arg) { return functions::log1p(arg); }
+		inline expr log1p(half arg) { return functions::log1p(arg); }
+		inline expr log1p(expr arg) { return functions::log1p(arg); }
+
+		/// Binary logorithm.
+		/// \param arg function argument
+		/// \return logarithm of \a arg to base 2
+//		template<typename T> typename enable<expr,T>::type log2(T arg) { return functions::log2(arg); }
+		inline expr log2(half arg) { return functions::log2(arg); }
+		inline expr log2(expr arg) { return functions::log2(arg); }
+
+		/// \}
+		/// \name Power functions
+		/// \{
+
+		/// Square root.
+		/// \param arg function argument
+		/// \return square root of \a arg
+//		template<typename T> typename enable<expr,T>::type sqrt(T arg) { return functions::sqrt(arg); }
+		inline expr sqrt(half arg) { return functions::sqrt(arg); }
+		inline expr sqrt(expr arg) { return functions::sqrt(arg); }
+
+		/// Cubic root.
+		/// \param arg function argument
+		/// \return cubic root of \a arg
+//		template<typename T> typename enable<expr,T>::type cbrt(T arg) { return functions::cbrt(arg); }
+		inline expr cbrt(half arg) { return functions::cbrt(arg); }
+		inline expr cbrt(expr arg) { return functions::cbrt(arg); }
+
+		/// Hypotenuse function.
+		/// \param x first argument
+		/// \param y second argument
+		/// \return square root of sum of squares without internal over- or underflows
+//		template<typename T,typename U> typename enable<expr,T,U>::type hypot(T x, U y) { return functions::hypot(x, y); }
+		inline expr hypot(half x, half y) { return functions::hypot(x, y); }
+		inline expr hypot(half x, expr y) { return functions::hypot(x, y); }
+		inline expr hypot(expr x, half y) { return functions::hypot(x, y); }
+		inline expr hypot(expr x, expr y) { return functions::hypot(x, y); }
+
+		/// Power function.
+		/// \param base first argument
+		/// \param exp second argument
+		/// \return \a base raised to \a exp
+//		template<typename T,typename U> typename enable<expr,T,U>::type pow(T base, U exp) { return functions::pow(base, exp); }
+		inline expr pow(half base, half exp) { return functions::pow(base, exp); }
+		inline expr pow(half base, expr exp) { return functions::pow(base, exp); }
+		inline expr pow(expr base, half exp) { return functions::pow(base, exp); }
+		inline expr pow(expr base, expr exp) { return functions::pow(base, exp); }
+
+		/// \}
+		/// \name Trigonometric functions
+		/// \{
+
+		/// Sine function.
+		/// \param arg function argument
+		/// \return sine value of \a arg
+//		template<typename T> typename enable<expr,T>::type sin(T arg) { return functions::sin(arg); }
+		inline expr sin(half arg) { return functions::sin(arg); }
+		inline expr sin(expr arg) { return functions::sin(arg); }
+
+		/// Cosine function.
+		/// \param arg function argument
+		/// \return cosine value of \a arg
+//		template<typename T> typename enable<expr,T>::type cos(T arg) { return functions::cos(arg); }
+		inline expr cos(half arg) { return functions::cos(arg); }
+		inline expr cos(expr arg) { return functions::cos(arg); }
+
+		/// Tangent function.
+		/// \param arg function argument
+		/// \return tangent value of \a arg
+//		template<typename T> typename enable<expr,T>::type tan(T arg) { return functions::tan(arg); }
+		inline expr tan(half arg) { return functions::tan(arg); }
+		inline expr tan(expr arg) { return functions::tan(arg); }
+
+		/// Arc sine.
+		/// \param arg function argument
+		/// \return arc sine value of \a arg
+//		template<typename T> typename enable<expr,T>::type asin(T arg) { return functions::asin(arg); }
+		inline expr asin(half arg) { return functions::asin(arg); }
+		inline expr asin(expr arg) { return functions::asin(arg); }
+
+		/// Arc cosine function.
+		/// \param arg function argument
+		/// \return arc cosine value of \a arg
+//		template<typename T> typename enable<expr,T>::type acos(T arg) { return functions::acos(arg); }
+		inline expr acos(half arg) { return functions::acos(arg); }
+		inline expr acos(expr arg) { return functions::acos(arg); }
+
+		/// Arc tangent function.
+		/// \param arg function argument
+		/// \return arc tangent value of \a arg
+//		template<typename T> typename enable<expr,T>::type atan(T arg) { return functions::atan(arg); }
+		inline expr atan(half arg) { return functions::atan(arg); }
+		inline expr atan(expr arg) { return functions::atan(arg); }
+
+		/// Arc tangent function.
+		/// \param x first argument
+		/// \param y second argument
+		/// \return arc tangent value
+//		template<typename T,typename U> typename enable<expr,T,U>::type atan2(T x, U y) { return functions::atan2(x, y); }
+		inline expr atan2(half x, half y) { return functions::atan2(x, y); }
+		inline expr atan2(half x, expr y) { return functions::atan2(x, y); }
+		inline expr atan2(expr x, half y) { return functions::atan2(x, y); }
+		inline expr atan2(expr x, expr y) { return functions::atan2(x, y); }
+
+		/// \}
+		/// \name Hyperbolic functions
+		/// \{
+
+		/// Hyperbolic sine.
+		/// \param arg function argument
+		/// \return hyperbolic sine value of \a arg
+//		template<typename T> typename enable<expr,T>::type sinh(T arg) { return functions::sinh(arg); }
+		inline expr sinh(half arg) { return functions::sinh(arg); }
+		inline expr sinh(expr arg) { return functions::sinh(arg); }
+
+		/// Hyperbolic cosine.
+		/// \param arg function argument
+		/// \return hyperbolic cosine value of \a arg
+//		template<typename T> typename enable<expr,T>::type cosh(T arg) { return functions::cosh(arg); }
+		inline expr cosh(half arg) { return functions::cosh(arg); }
+		inline expr cosh(expr arg) { return functions::cosh(arg); }
+
+		/// Hyperbolic tangent.
+		/// \param arg function argument
+		/// \return hyperbolic tangent value of \a arg
+//		template<typename T> typename enable<expr,T>::type tanh(T arg) { return functions::tanh(arg); }
+		inline expr tanh(half arg) { return functions::tanh(arg); }
+		inline expr tanh(expr arg) { return functions::tanh(arg); }
+
+		/// Hyperbolic area sine.
+		/// \param arg function argument
+		/// \return area sine value of \a arg
+//		template<typename T> typename enable<expr,T>::type asinh(T arg) { return functions::asinh(arg); }
+		inline expr asinh(half arg) { return functions::asinh(arg); }
+		inline expr asinh(expr arg) { return functions::asinh(arg); }
+
+		/// Hyperbolic area cosine.
+		/// \param arg function argument
+		/// \return area cosine value of \a arg
+//		template<typename T> typename enable<expr,T>::type acosh(T arg) { return functions::acosh(arg); }
+		inline expr acosh(half arg) { return functions::acosh(arg); }
+		inline expr acosh(expr arg) { return functions::acosh(arg); }
+
+		/// Hyperbolic area tangent.
+		/// \param arg function argument
+		/// \return area tangent value of \a arg
+//		template<typename T> typename enable<expr,T>::type atanh(T arg) { return functions::atanh(arg); }
+		inline expr atanh(half arg) { return functions::atanh(arg); }
+		inline expr atanh(expr arg) { return functions::atanh(arg); }
+
+		/// \}
+		/// \name Error and gamma functions
+		/// \{
+
+		/// Error function.
+		/// \param arg function argument
+		/// \return error function value of \a arg
+//		template<typename T> typename enable<expr,T>::type erf(T arg) { return functions::erf(arg); }
+		inline expr erf(half arg) { return functions::erf(arg); }
+		inline expr erf(expr arg) { return functions::erf(arg); }
+
+		/// Complementary error function.
+		/// \param arg function argument
+		/// \return 1 minus error function value of \a arg
+//		template<typename T> typename enable<expr,T>::type erfc(T arg) { return functions::erfc(arg); }
+		inline expr erfc(half arg) { return functions::erfc(arg); }
+		inline expr erfc(expr arg) { return functions::erfc(arg); }
+
+		/// Natural logarithm of gamma function.
+		/// \param arg function argument
+		/// \return natural logarith of gamma function for \a arg
+//		template<typename T> typename enable<expr,T>::type lgamma(T arg) { return functions::lgamma(arg); }
+		inline expr lgamma(half arg) { return functions::lgamma(arg); }
+		inline expr lgamma(expr arg) { return functions::lgamma(arg); }
+
+		/// Gamma function.
+		/// \param arg function argument
+		/// \return gamma function value of \a arg
+//		template<typename T> typename enable<expr,T>::type tgamma(T arg) { return functions::tgamma(arg); }
+		inline expr tgamma(half arg) { return functions::tgamma(arg); }
+		inline expr tgamma(expr arg) { return functions::tgamma(arg); }
+
+		/// \}
+		/// \name Rounding
+		/// \{
+
+		/// Nearest integer not less than half value.
+		/// \param arg half to round
+		/// \return nearest integer not less than \a arg
+//		template<typename T> typename enable<half,T>::type ceil(T arg) { return functions::ceil(arg); }
+		inline half ceil(half arg) { return functions::ceil(arg); }
+		inline half ceil(expr arg) { return functions::ceil(arg); }
+
+		/// Nearest integer not greater than half value.
+		/// \param arg half to round
+		/// \return nearest integer not greater than \a arg
+//		template<typename T> typename enable<half,T>::type floor(T arg) { return functions::floor(arg); }
+		inline half floor(half arg) { return functions::floor(arg); }
+		inline half floor(expr arg) { return functions::floor(arg); }
+
+		/// Nearest integer not greater in magnitude than half value.
+		/// \param arg half to round
+		/// \return nearest integer not greater in magnitude than \a arg
+//		template<typename T> typename enable<half,T>::type trunc(T arg) { return functions::trunc(arg); }
+		inline half trunc(half arg) { return functions::trunc(arg); }
+		inline half trunc(expr arg) { return functions::trunc(arg); }
+
+		/// Nearest integer.
+		/// \param arg half to round
+		/// \return nearest integer, rounded away from zero in half-way cases
+//		template<typename T> typename enable<half,T>::type round(T arg) { return functions::round(arg); }
+		inline half round(half arg) { return functions::round(arg); }
+		inline half round(expr arg) { return functions::round(arg); }
+
+		/// Nearest integer.
+		/// \param arg half to round
+		/// \return nearest integer, rounded away from zero in half-way cases
+//		template<typename T> typename enable<long,T>::type lround(T arg) { return functions::lround(arg); }
+		inline long lround(half arg) { return functions::lround(arg); }
+		inline long lround(expr arg) { return functions::lround(arg); }
+
+		/// Nearest integer using half's internal rounding mode.
+		/// \param arg half expression to round
+		/// \return nearest integer using default rounding mode
+//		template<typename T> typename enable<half,T>::type nearbyint(T arg) { return functions::nearbyint(arg); }
+		inline half nearbyint(half arg) { return functions::rint(arg); }
+		inline half nearbyint(expr arg) { return functions::rint(arg); }
+
+		/// Nearest integer using half's internal rounding mode.
+		/// \param arg half expression to round
+		/// \return nearest integer using default rounding mode
+//		template<typename T> typename enable<half,T>::type rint(T arg) { return functions::rint(arg); }
+		inline half rint(half arg) { return functions::rint(arg); }
+		inline half rint(expr arg) { return functions::rint(arg); }
+
+		/// Nearest integer using half's internal rounding mode.
+		/// \param arg half expression to round
+		/// \return nearest integer using default rounding mode
+//		template<typename T> typename enable<long,T>::type lrint(T arg) { return functions::lrint(arg); }
+		inline long lrint(half arg) { return functions::lrint(arg); }
+		inline long lrint(expr arg) { return functions::lrint(arg); }
+	#if HALF_ENABLE_CPP11_LONG_LONG
+		/// Nearest integer.
+		/// \param arg half to round
+		/// \return nearest integer, rounded away from zero in half-way cases
+//		template<typename T> typename enable<long long,T>::type llround(T arg) { return functions::llround(arg); }
+		inline long long llround(half arg) { return functions::llround(arg); }
+		inline long long llround(expr arg) { return functions::llround(arg); }
+
+		/// Nearest integer using half's internal rounding mode.
+		/// \param arg half expression to round
+		/// \return nearest integer using default rounding mode
+//		template<typename T> typename enable<long long,T>::type llrint(T arg) { return functions::llrint(arg); }
+		inline long long llrint(half arg) { return functions::llrint(arg); }
+		inline long long llrint(expr arg) { return functions::llrint(arg); }
+	#endif
+
+		/// \}
+		/// \name Floating point manipulation
+		/// \{
+
+		/// Decompress floating point number.
+		/// \param arg number to decompress
+		/// \param exp address to store exponent at
+		/// \return significant in range [0.5, 1)
+//		template<typename T> typename enable<half,T>::type frexp(T arg, int *exp) { return functions::frexp(arg, exp); }
+		inline half frexp(half arg, int *exp) { return functions::frexp(arg, exp); }
+		inline half frexp(expr arg, int *exp) { return functions::frexp(arg, exp); }
+
+		/// Multiply by power of two.
+		/// \param arg number to modify
+		/// \param exp power of two to multiply with
+		/// \return \a arg multplied by 2 raised to \a exp
+//		template<typename T> typename enable<half,T>::type ldexp(T arg, int exp) { return functions::scalbln(arg, exp); }
+		inline half ldexp(half arg, int exp) { return functions::scalbln(arg, exp); }
+		inline half ldexp(expr arg, int exp) { return functions::scalbln(arg, exp); }
+
+		/// Extract integer and fractional parts.
+		/// \param arg number to decompress
+		/// \param iptr address to store integer part at
+		/// \return fractional part
+//		template<typename T> typename enable<half,T>::type modf(T arg, half *iptr) { return functions::modf(arg, iptr); }
+		inline half modf(half arg, half *iptr) { return functions::modf(arg, iptr); }
+		inline half modf(expr arg, half *iptr) { return functions::modf(arg, iptr); }
+
+		/// Multiply by power of two.
+		/// \param arg number to modify
+		/// \param exp power of two to multiply with
+		/// \return \a arg multplied by 2 raised to \a exp
+//		template<typename T> typename enable<half,T>::type scalbn(T arg, int exp) { return functions::scalbln(arg, exp); }
+		inline half scalbn(half arg, int exp) { return functions::scalbln(arg, exp); }
+		inline half scalbn(expr arg, int exp) { return functions::scalbln(arg, exp); }
+
+		/// Multiply by power of two.
+		/// \param arg number to modify
+		/// \param exp power of two to multiply with
+		/// \return \a arg multplied by 2 raised to \a exp	
+//		template<typename T> typename enable<half,T>::type scalbln(T arg, long exp) { return functions::scalbln(arg, exp); }
+		inline half scalbln(half arg, long exp) { return functions::scalbln(arg, exp); }
+		inline half scalbln(expr arg, long exp) { return functions::scalbln(arg, exp); }
+
+		/// Extract exponent.
+		/// \param arg number to query
+		/// \return floating point exponent
+		/// \retval FP_ILOGB0 for zero
+		/// \retval FP_ILOGBNAN for NaN
+		/// \retval MAX_INT for infinity
+//		template<typename T> typename enable<int,T>::type ilogb(T arg) { return functions::ilogb(arg); }
+		inline int ilogb(half arg) { return functions::ilogb(arg); }
+		inline int ilogb(expr arg) { return functions::ilogb(arg); }
+
+		/// Extract exponent.
+		/// \param arg number to query
+		/// \return floating point exponent
+//		template<typename T> typename enable<half,T>::type logb(T arg) { return functions::logb(arg); }
+		inline half logb(half arg) { return functions::logb(arg); }
+		inline half logb(expr arg) { return functions::logb(arg); }
+
+		/// Next representable value.
+		/// \param from value to compute next representable value for
+		/// \param to direction towards which to compute next value
+		/// \return next representable value after \a from in direction towards \a to
+//		template<typename T,typename U> typename enable<half,T,U>::type nextafter(T from, U to) { return functions::nextafter(from, to); }
+		inline half nextafter(half from, half to) { return functions::nextafter(from, to); }
+		inline half nextafter(half from, expr to) { return functions::nextafter(from, to); }
+		inline half nextafter(expr from, half to) { return functions::nextafter(from, to); }
+		inline half nextafter(expr from, expr to) { return functions::nextafter(from, to); }
+
+		/// Next representable value.
+		/// \param from value to compute next representable value for
+		/// \param to direction towards which to compute next value
+		/// \return next representable value after \a from in direction towards \a to
+//		template<typename T> typename enable<half,T>::type nexttoward(T from, long double to) { return functions::nexttoward(from, to); }
+		inline half nexttoward(half from, long double to) { return functions::nexttoward(from, to); }
+		inline half nexttoward(expr from, long double to) { return functions::nexttoward(from, to); }
+
+		/// Take sign.
+		/// \param x value to change sign for
+		/// \param y value to take sign from
+		/// \return value equal to \a x in magnitude and to \a y in sign
+//		template<typename T,typename U> typename enable<half,T,U>::type copysign(T x, U y) { return functions::copysign(x, y); }
+		inline half copysign(half x, half y) { return functions::copysign(x, y); }
+		inline half copysign(half x, expr y) { return functions::copysign(x, y); }
+		inline half copysign(expr x, half y) { return functions::copysign(x, y); }
+		inline half copysign(expr x, expr y) { return functions::copysign(x, y); }
+
+		/// \}
+		/// \name Floating point classification
+		/// \{
+
+
+		/// Classify floating point value.
+		/// \param arg number to classify
+		/// \retval FP_ZERO for positive and negative zero
+		/// \retval FP_SUBNORMAL for subnormal numbers
+		/// \retval FP_INFINITY for positive and negative infinity
+		/// \retval FP_NAN for NaNs
+		/// \retval FP_NORMAL for all other (normal) values
+//		template<typename T> typename enable<int,T>::type fpclassify(T arg) { return functions::fpclassify(arg); }
+		inline int fpclassify(half arg) { return functions::fpclassify(arg); }
+		inline int fpclassify(expr arg) { return functions::fpclassify(arg); }
+
+		/// Check if finite number.
+		/// \param arg number to check
+		/// \retval true if neither infinity nor NaN
+		/// \retval false else
+//		template<typename T> typename enable<bool,T>::type isfinite(T arg) { return functions::isfinite(arg); }
+		inline bool isfinite(half arg) { return functions::isfinite(arg); }
+		inline bool isfinite(expr arg) { return functions::isfinite(arg); }
+
+		/// Check for infinity.
+		/// \param arg number to check
+		/// \retval true for positive or negative infinity
+		/// \retval false else
+//		template<typename T> typename enable<bool,T>::type isinf(T arg) { return functions::isinf(arg); }
+		inline bool isinf(half arg) { return functions::isinf(arg); }
+		inline bool isinf(expr arg) { return functions::isinf(arg); }
+
+		/// Check for NaN.
+		/// \param arg number to check
+		/// \retval true for NaNs
+		/// \retval false else
+//		template<typename T> typename enable<bool,T>::type isnan(T arg) { return functions::isnan(arg); }
+		inline bool isnan(half arg) { return functions::isnan(arg); }
+		inline bool isnan(expr arg) { return functions::isnan(arg); }
+
+		/// Check if normal number.
+		/// \param arg number to check
+		/// \retval true if normal number
+		/// \retval false if either subnormal, zero, infinity or NaN
+//		template<typename T> typename enable<bool,T>::type isnormal(T arg) { return functions::isnormal(arg); }
+		inline bool isnormal(half arg) { return functions::isnormal(arg); }
+		inline bool isnormal(expr arg) { return functions::isnormal(arg); }
+
+		/// Check sign.
+		/// \param arg number to check
+		/// \retval true for negative number
+		/// \retval false for positive number
+//		template<typename T> typename enable<bool,T>::type signbit(T arg) { return functions::signbit(arg); }
+		inline bool signbit(half arg) { return functions::signbit(arg); }
+		inline bool signbit(expr arg) { return functions::signbit(arg); }
+
+		/// \}
+		/// \name Comparison
+		/// \{
+
+		/// Comparison for greater than.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x greater than \a y
+		/// \retval false else
+//		template<typename T,typename U> typename enable<bool,T,U>::type isgreater(T x, U y) { return functions::isgreater(x, y); }
+		inline bool isgreater(half x, half y) { return functions::isgreater(x, y); }
+		inline bool isgreater(half x, expr y) { return functions::isgreater(x, y); }
+		inline bool isgreater(expr x, half y) { return functions::isgreater(x, y); }
+		inline bool isgreater(expr x, expr y) { return functions::isgreater(x, y); }
+
+		/// Comparison for greater equal.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x greater equal \a y
+		/// \retval false else
+//		template<typename T,typename U> typename enable<bool,T,U>::type isgreaterequal(T x, U y) { return functions::isgreaterequal(x, y); }
+		inline bool isgreaterequal(half x, half y) { return functions::isgreaterequal(x, y); }
+		inline bool isgreaterequal(half x, expr y) { return functions::isgreaterequal(x, y); }
+		inline bool isgreaterequal(expr x, half y) { return functions::isgreaterequal(x, y); }
+		inline bool isgreaterequal(expr x, expr y) { return functions::isgreaterequal(x, y); }
+
+		/// Comparison for less than.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x less than \a y
+		/// \retval false else
+//		template<typename T,typename U> typename enable<bool,T,U>::type isless(T x, U y) { return functions::isless(x, y); }
+		inline bool isless(half x, half y) { return functions::isless(x, y); }
+		inline bool isless(half x, expr y) { return functions::isless(x, y); }
+		inline bool isless(expr x, half y) { return functions::isless(x, y); }
+		inline bool isless(expr x, expr y) { return functions::isless(x, y); }
+
+		/// Comparison for less equal.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if \a x less equal \a y
+		/// \retval false else
+//		template<typename T,typename U> typename enable<bool,T,U>::type islessequal(T x, U y) { return functions::islessequal(x, y); }
+		inline bool islessequal(half x, half y) { return functions::islessequal(x, y); }
+		inline bool islessequal(half x, expr y) { return functions::islessequal(x, y); }
+		inline bool islessequal(expr x, half y) { return functions::islessequal(x, y); }
+		inline bool islessequal(expr x, expr y) { return functions::islessequal(x, y); }
+
+		/// Comarison for less or greater.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if either less or greater
+		/// \retval false else
+//		template<typename T,typename U> typename enable<bool,T,U>::type islessgreater(T x, U y) { return functions::islessgreater(x, y); }
+		inline bool islessgreater(half x, half y) { return functions::islessgreater(x, y); }
+		inline bool islessgreater(half x, expr y) { return functions::islessgreater(x, y); }
+		inline bool islessgreater(expr x, half y) { return functions::islessgreater(x, y); }
+		inline bool islessgreater(expr x, expr y) { return functions::islessgreater(x, y); }
+
+		/// Check if unordered.
+		/// \param x first operand
+		/// \param y second operand
+		/// \retval true if unordered (one or two NaN operands)
+		/// \retval false else
+//		template<typename T,typename U> typename enable<bool,T,U>::type isunordered(T x, U y) { return functions::isunordered(x, y); }
+		inline bool isunordered(half x, half y) { return functions::isunordered(x, y); }
+		inline bool isunordered(half x, expr y) { return functions::isunordered(x, y); }
+		inline bool isunordered(expr x, half y) { return functions::isunordered(x, y); }
+		inline bool isunordered(expr x, expr y) { return functions::isunordered(x, y); }
+
+		/// \name Casting
+		/// \{
+
+		/// Cast to or from half-precision floating point number.
+		/// This casts between [half](\ref half_float::half) and any built-in arithmetic type. The values are converted
+		/// directly using the given rounding mode, without any roundtrip over `float` that a `static_cast` would otherwise do.
+		/// It uses the default rounding mode.
+		///
+		/// Using this cast with neither of the two types being a [half](\ref half_float::half) or with any of the two types
+		/// not being a built-in arithmetic type (apart from [half](\ref half_float::half), of course) results in a compiler
+		/// error and casting between [half](\ref half_float::half)s is just a no-op.
+		/// \tparam T destination type (half or built-in arithmetic type)
+		/// \tparam U source type (half or built-in arithmetic type)
+		/// \param arg value to cast
+		/// \return \a arg converted to destination type
+		template<typename T,typename U> T half_cast(U arg) { return half_caster<T,U>::cast(arg); }
+
+		/// Cast to or from half-precision floating point number.
+		/// This casts between [half](\ref half_float::half) and any built-in arithmetic type. The values are converted
+		/// directly using the given rounding mode, without any roundtrip over `float` that a `static_cast` would otherwise do.
+		///
+		/// Using this cast with neither of the two types being a [half](\ref half_float::half) or with any of the two types
+		/// not being a built-in arithmetic type (apart from [half](\ref half_float::half), of course) results in a compiler
+		/// error and casting between [half](\ref half_float::half)s is just a no-op.
+		/// \tparam T destination type (half or built-in arithmetic type)
+		/// \tparam R rounding mode to use.
+		/// \tparam U source type (half or built-in arithmetic type)
+		/// \param arg value to cast
+		/// \return \a arg converted to destination type
+		template<typename T,std::float_round_style R,typename U> T half_cast(U arg) { return half_caster<T,U,R>::cast(arg); }
+		/// \}
+	}
+
+	using detail::operator==;
+	using detail::operator!=;
+	using detail::operator<;
+	using detail::operator>;
+	using detail::operator<=;
+	using detail::operator>=;
+	using detail::operator+;
+	using detail::operator-;
+	using detail::operator*;
+	using detail::operator/;
+	using detail::operator<<;
+	using detail::operator>>;
+
+	using detail::abs;
+	using detail::fabs;
+	using detail::fmod;
+	using detail::remainder;
+	using detail::remquo;
+	using detail::fma;
+	using detail::fmax;
+	using detail::fmin;
+	using detail::fdim;
+	using detail::nanh;
+	using detail::exp;
+	using detail::expm1;
+	using detail::exp2;
+	using detail::log;
+	using detail::log10;
+	using detail::log1p;
+	using detail::log2;
+	using detail::sqrt;
+	using detail::cbrt;
+	using detail::hypot;
+	using detail::pow;
+	using detail::sin;
+	using detail::cos;
+	using detail::tan;
+	using detail::asin;
+	using detail::acos;
+	using detail::atan;
+	using detail::atan2;
+	using detail::sinh;
+	using detail::cosh;
+	using detail::tanh;
+	using detail::asinh;
+	using detail::acosh;
+	using detail::atanh;
+	using detail::erf;
+	using detail::erfc;
+	using detail::lgamma;
+	using detail::tgamma;
+	using detail::ceil;
+	using detail::floor;
+	using detail::trunc;
+	using detail::round;
+	using detail::lround;
+	using detail::nearbyint;
+	using detail::rint;
+	using detail::lrint;
+#if HALF_ENABLE_CPP11_LONG_LONG
+	using detail::llround;
+	using detail::llrint;
+#endif
+	using detail::frexp;
+	using detail::ldexp;
+	using detail::modf;
+	using detail::scalbn;
+	using detail::scalbln;
+	using detail::ilogb;
+	using detail::logb;
+	using detail::nextafter;
+	using detail::nexttoward;
+	using detail::copysign;
+	using detail::fpclassify;
+	using detail::isfinite;
+	using detail::isinf;
+	using detail::isnan;
+	using detail::isnormal;
+	using detail::signbit;
+	using detail::isgreater;
+	using detail::isgreaterequal;
+	using detail::isless;
+	using detail::islessequal;
+	using detail::islessgreater;
+	using detail::isunordered;
+
+	using detail::half_cast;
+}
+
+
+/// Extensions to the C++ standard library.
+namespace std
+{
+	/// Numeric limits for half-precision floats.
+	/// Because of the underlying single-precision implementation of many operations, it inherits some properties from
+	/// `std::numeric_limits<float>`.
+	template<> class numeric_limits<half_float::half> : public numeric_limits<float>
+	{
+	public:
+		/// Supports signed values.
+		static HALF_CONSTEXPR_CONST bool is_signed = true;
+
+		/// Is not exact.
+		static HALF_CONSTEXPR_CONST bool is_exact = false;
+
+		/// Doesn't provide modulo arithmetic.
+		static HALF_CONSTEXPR_CONST bool is_modulo = false;
+
+		/// IEEE conformant.
+		static HALF_CONSTEXPR_CONST bool is_iec559 = true;
+
+		/// Supports infinity.
+		static HALF_CONSTEXPR_CONST bool has_infinity = true;
+
+		/// Supports quiet NaNs.
+		static HALF_CONSTEXPR_CONST bool has_quiet_NaN = true;
+
+		/// Supports subnormal values.
+		static HALF_CONSTEXPR_CONST float_denorm_style has_denorm = denorm_present;
+
+		/// Rounding mode.
+		/// Due to the mix of internal single-precision computations (using the rounding mode of the underlying
+		/// single-precision implementation) with the rounding mode of the single-to-half conversions, the actual rounding
+		/// mode might be `std::round_indeterminate` if the default half-precision rounding mode doesn't match the
+		/// single-precision rounding mode.
+		static HALF_CONSTEXPR_CONST float_round_style round_style = (std::numeric_limits<float>::round_style==
+			half_float::half::round_style) ? half_float::half::round_style : round_indeterminate;
+
+		/// Significant digits.
+		static HALF_CONSTEXPR_CONST int digits = 11;
+
+		/// Significant decimal digits.
+		static HALF_CONSTEXPR_CONST int digits10 = 3;
+
+		/// Required decimal digits to represent all possible values.
+		static HALF_CONSTEXPR_CONST int max_digits10 = 5;
+
+		/// Number base.
+		static HALF_CONSTEXPR_CONST int radix = 2;
+
+		/// One more than smallest exponent.
+		static HALF_CONSTEXPR_CONST int min_exponent = -13;
+
+		/// Smallest normalized representable power of 10.
+		static HALF_CONSTEXPR_CONST int min_exponent10 = -4;
+
+		/// One more than largest exponent
+		static HALF_CONSTEXPR_CONST int max_exponent = 16;
+
+		/// Largest finitely representable power of 10.
+		static HALF_CONSTEXPR_CONST int max_exponent10 = 4;
+
+		/// Smallest positive normal value.
+		static HALF_CONSTEXPR half_float::half min() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x0400); }
+
+		/// Smallest finite value.
+		static HALF_CONSTEXPR half_float::half lowest() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0xFBFF); }
+
+		/// Largest finite value.
+		static HALF_CONSTEXPR half_float::half max() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x7BFF); }
+
+		/// Difference between one and next representable value.
+		static HALF_CONSTEXPR half_float::half epsilon() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x1400); }
+
+		/// Maximum rounding error.
+		static HALF_CONSTEXPR half_float::half round_error() HALF_NOTHROW
+			{ return half_float::half(half_float::detail::binary, (round_style==std::round_to_nearest) ? 0x3800 : 0x3C00); }
+
+		/// Positive infinity.
+		static HALF_CONSTEXPR half_float::half infinity() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x7C00); }
+
+		/// Quiet NaN.
+		static HALF_CONSTEXPR half_float::half quiet_NaN() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x7FFF); }
+
+		/// Signalling NaN.
+		static HALF_CONSTEXPR half_float::half signaling_NaN() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x7DFF); }
+
+		/// Smallest positive subnormal value.
+		static HALF_CONSTEXPR half_float::half denorm_min() HALF_NOTHROW { return half_float::half(half_float::detail::binary, 0x0001); }
+	};
+
+#if HALF_ENABLE_CPP11_HASH
+	/// Hash function for half-precision floats.
+	/// This is only defined if C++11 `std::hash` is supported and enabled.
+	template<> struct hash<half_float::half> //: unary_function<half_float::half,size_t>
+	{
+		/// Type of function argument.
+		typedef half_float::half argument_type;
+
+		/// Function return type.
+		typedef size_t result_type;
+
+		/// Compute hash function.
+		/// \param arg half to hash
+		/// \return hash value
+		result_type operator()(argument_type arg) const
+			{ return hash<half_float::detail::uint16>()(static_cast<unsigned>(arg.data_)&-(arg.data_!=0x8000)); }
+	};
+#endif
+}
+
+
+#undef HALF_CONSTEXPR
+#undef HALF_CONSTEXPR_CONST
+#undef HALF_NOEXCEPT
+#undef HALF_NOTHROW
+#ifdef HALF_POP_WARNINGS
+	#pragma warning(pop)
+	#undef HALF_POP_WARNINGS
+#endif
+
+#endif
diff --git a/include/.gitignore b/include/.gitignore
deleted file mode 100644
index e69de29bb2..0000000000
diff --git a/include/af/algorithm.h b/include/af/algorithm.h
index 65402aadf2..4949d0894d 100644
--- a/include/af/algorithm.h
+++ b/include/af/algorithm.h
@@ -16,339 +16,677 @@ namespace af
     class array;
 
     /**
-       C++ Interface for sum of elements in an array
+       C++ Interface to sum array elements over a given dimension.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the add operation occurs
-       \return    result of sum all values along dimension \p dim
+       \param[in] in  input array
+       \param[in] dim dimension along which the summation occurs, -1 denotes
+                      the first non-singleton dimension
+       \return        sum
 
        \ingroup reduce_func_sum
-
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
     */
     AFAPI array sum(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 31
     /**
-       C++ Interface for product of elements in an array
+       C++ Interface to sum array elements over a given dimension, replacing
+       any NaNs with a specified value.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the multiply operation occurs
-       \return    result of product all values along dimension \p dim
+       \param[in] in     input array
+       \param[in] dim    dimension along which the summation occurs
+       \param[in] nanval value that replaces NaNs
+       \return           sum
 
-       \ingroup reduce_func_product
+       \ingroup reduce_func_sum
+    */
+    AFAPI array sum(const array &in, const int dim, const double nanval);
+#endif
+
+#if AF_API_VERSION >= 37
+    /**
+       C++ Interface to sum array elements over a given dimension, according to
+       an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out sum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the summation occurs, -1
+                            denotes the first non-singleton dimension
+
+       \ingroup reduce_func_sum_by_key
+    */
+    AFAPI void sumByKey(array &keys_out, array &vals_out,
+                        const array &keys, const array &vals,
+                        const int dim = -1);
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+    /**
+       C++ Interface to sum array elements over a given dimension, replacing
+       any NaNs with a specified value, according to an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out sum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the summation occurs
+       \param[in]  nanval   value that replaces NaNs
+
+       \ingroup reduce_func_sum_by_key
+    */
+    AFAPI void sumByKey(array &keys_out, array &vals_out,
+                        const array &keys, const array &vals,
+                        const int dim, const double nanval);
+#endif
+
+    /**
+       C++ Interface to multiply array elements over a given dimension.
+
+       \param[in] in  input array
+       \param[in] dim dimension along which the product occurs, -1 denotes the
+                      first non-singleton dimension
+       \return        product
+
+       \ingroup reduce_func_product
     */
     AFAPI array product(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 31
     /**
-       C++ Interface for minimum values in an array
+       C++ Interface to multiply array elements over a given dimension,
+       replacing any NaNs with a specified value.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the minimum value needs to be extracted
-       \return    result of minimum all values along dimension \p dim
+       \param[in] in     input array
+       \param[in] dim    dimension along which the product occurs
+       \param[in] nanval value that replaces NaNs
+       \return           product
 
-       \ingroup reduce_func_min
+       \ingroup reduce_func_product
+    */
+    AFAPI array product(const array &in, const int dim, const double nanval);
+#endif
+
+#if AF_API_VERSION >= 37
+    /**
+       C++ Interface to multiply array elements over a given dimension,
+       according to an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out product
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the product occurs, -1
+                            denotes the first non-singleton dimension
+
+       \ingroup reduce_func_product_by_key
+    */
+    AFAPI void productByKey(array &keys_out, array &vals_out,
+                            const array &keys, const array &vals,
+                            const int dim = -1);
+
+    /**
+       C++ Interface to multiply array elements over a given dimension,
+       replacing any NaNs with a specified value, according to an array of
+       keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out product
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the product occurs
+       \param[in]  nanval   value that replaces NaNs
+
+       \ingroup reduce_func_product_by_key
+
+    */
+    AFAPI void productByKey(array &keys_out, array &vals_out,
+                            const array &keys, const array &vals,
+                            const int dim, const double nanval);
+#endif
+
+    /**
+       C++ Interface to return the minimum along a given dimension.
+
+       NaN values are ignored.
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \param[in] in  input array
+       \param[in] dim dimension along which the minimum is found, -1 denotes
+                      the first non-singleton dimension
+       \return        minimum
+
+       \ingroup reduce_func_min
     */
     AFAPI array min(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 37
     /**
-       C++ Interface for maximum values in an array
+       C++ Interface to return the minimum along a given dimension, according
+       to an array of keys.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the maximum value needs to be extracted
-       \return    result of maximum all values along dimension \p dim
+       NaN values are ignored.
 
-       \ingroup reduce_func_max
+       \param[out] keys_out reduced keys
+       \param[out] vals_out minimum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the minimum is found, -1
+                            denotes the first non-singleton dimension
+
+       \ingroup reduce_func_min_by_key
+    */
+    AFAPI void minByKey(array &keys_out, array &vals_out,
+                        const array &keys, const array &vals,
+                        const int dim = -1);
+#endif
+
+    /**
+       C++ Interface to return the maximum along a given dimension.
+
+       NaN values are ignored.
+
+       \param[in] in  input array
+       \param[in] dim dimension along which the maximum is found, -1 denotes
+                      the first non-singleton dimension
+       \return        maximum
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \ingroup reduce_func_max
     */
     AFAPI array max(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 37
     /**
-       C++ Interface for checking all true values in an array
+       C++ Interface to return the maximum along a given dimension, according
+       to an array of keys.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the values are checked to be all true
-       \return    result of checking if values along dimension \p dim are all true
+       NaN values are ignored.
 
-       \ingroup reduce_func_all_true
+       \param[out] keys_out reduced keys
+       \param[out] vals_out maximum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the maximum is found, -1
+                            denotes the first non-singleton dimension
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \ingroup reduce_func_max_by_key
+    */
+    AFAPI void maxByKey(array &keys_out, array &vals_out,
+                        const array &keys, const array &vals,
+                        const int dim = -1);
+#endif
+
+#if AF_API_VERSION >= 38
+    /**
+       C++ Interface to return the ragged maximum along a given dimension.
+
+       Input parameter `ragged_len` sets the number of elements to consider.
+
+       NaN values are ignored.
+
+       \param[out] val        ragged maximum
+       \param[out] idx        locations of the maximum ragged values
+       \param[in]  in         input array
+       \param[in]  ragged_len array containing the number of elements to use
+       \param[in]  dim        dimension along which the maximum is found
+
+       \ingroup reduce_func_max
+    */
+    AFAPI void max(array &val, array &idx, const array &in, const array &ragged_len, const int dim);
+#endif
+
+    /**
+       C++ Interface to check if all values along a given dimension are true.
+
+       NaN values are ignored.
+
+       \param[in] in  input array
+       \param[in] dim dimension along which the check occurs, -1 denotes the
+                      first non-singleton dimension
+       \return        array containing 1's if all true; 0's otherwise
+
+       \ingroup reduce_func_all_true
     */
     AFAPI array allTrue(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 37
     /**
-       C++ Interface for checking any true values in an array
+       C++ Interface to check if all values along a given dimension are true,
+       according to an array of keys.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the values are checked to be any true
-       \return    result of checking if values along dimension \p dim are any true
+       NaN values are ignored.
 
-       \ingroup reduce_func_any_true
+       \param[out] keys_out reduced keys
+       \param[out] vals_out array containing 1's if all true; 0's otherwise
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the check occurs
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \ingroup reduce_func_alltrue_by_key
+    */
+    AFAPI void allTrueByKey(array &keys_out, array &vals_out,
+                            const array &keys, const array &vals,
+                            const int dim = -1);
+#endif
+
+    /**
+       C++ Interface to check if any values along a given dimension are true.
+
+       NaN values are ignored.
+
+       \param[in] in  input array
+       \param[in] dim dimension along which the check occurs, -1 denotes the
+                      first non-singleton dimension
+       \return        array containing 1's if any true; 0's otherwise
+
+       \ingroup reduce_func_any_true
     */
     AFAPI array anyTrue(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 37
     /**
-       C++ Interface for counting non zero values in an array
+       C++ Interface to check if any values along a given dimension are true,
+       according to an array of keys.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the the number of non-zero values are counted
-       \return    the number of non-zero values along dimension \p dim
+       NaN values are ignored.
 
-       \ingroup reduce_func_count
+       \param[out] keys_out reduced keys
+       \param[out] vals_out array containing 1's if any true; 0's otherwise
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the check occurs
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \ingroup reduce_func_anytrue_by_key
+    */
+    AFAPI void anyTrueByKey(array &keys_out, array &vals_out,
+                            const array &keys, const array &vals,
+                            const int dim = -1);
+#endif
+
+    /**
+       C++ Interface to count non-zero values in an array along a given
+       dimension.
+
+       NaN values are treated as non-zero.
+
+       \param[in] in  input array
+       \param[in] dim dimension along which the count occurs, -1 denotes the
+                      first non-singleton dimension
+       \return        count
+
+       \ingroup reduce_func_count
     */
     AFAPI array count(const array &in, const int dim = -1);
 
+#if AF_API_VERSION >= 37
+    /**
+       C++ Interface to count non-zero values in an array, according to an
+       array of keys.
+
+       NaN values are treated as non-zero.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out count
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the count occurs, -1 denotes
+                            the first non-singleton dimension
+
+       \ingroup reduce_func_count_by_key
+    */
+    AFAPI void countByKey(array &keys_out, array &vals_out,
+                          const array &keys, const array &vals,
+                          const int dim = -1);
+#endif
+
     /**
-       C++ Interface for sum of all elements in an array
+       C++ Interface to sum array elements over all dimensions.
 
-       \param[in] in is the input array
-       \return    the sum of all values of \p in
+       Results in a single value as an output, which may be a single element
+       `af::array`.
+
+       \param[in] in  input array
+       \return        sum
 
        \ingroup reduce_func_sum
     */
     template<typename T> T sum(const array &in);
 
+#if AF_API_VERSION >= 31
     /**
-       C++ Interface for product of all elements in an array
+       C++ Interface to sum array elements over all dimensions, replacing any
+       NaNs with a specified value.
+
+       Results in a single value as an output, which may be a single element
+       `af::array`.
 
-       \param[in] in is the input array
-       \return    the product of all values of \p in
+       \param[in] in     input array
+       \param[in] nanval value that replaces NaNs
+       \return           sum
+
+       \ingroup reduce_func_sum
+    */
+    template<typename T> T sum(const array &in, double nanval);
+#endif
+
+    /**
+       C++ Interface to multiply array elements over the first non-singleton
+       dimension.
+
+       \param[in] in input array
+       \return       product
 
        \ingroup reduce_func_product
     */
     template<typename T> T product(const array &in);
 
+#if AF_API_VERSION >= 31
     /**
-       C++ Interface for getting minimum value of an array
+       C++ Interface to multiply array elements over the first non-singleton
+       dimension, replacing any NaNs with a specified value.
+
+       \param[in] in     input array
+       \param[in] nanval value that replaces NaNs
+       \return           product
 
-       \param[in] in is the input array
-       \return    the minimum of all values of \p in
+       \ingroup reduce_func_product
+    */
+    template<typename T> T product(const array &in, double nanval);
+#endif
+
+    /**
+       C++ Interface to return the minimum along the first non-singleton
+       dimension.
+
+       NaN values are ignored.
+
+       \param[in] in input array
+       \return       minimum
 
        \ingroup reduce_func_min
     */
     template<typename T> T min(const array &in);
 
     /**
-       C++ Interface for getting maximum value of an array
+       C++ Interface to return the maximum along the first non-singleton
+       dimension.
+
+       NaN values are ignored.
 
-       \param[in] in is the input array
-       \return    the maximum of all values of \p in
+       \param[in] in input array
+       \return       maximum
 
        \ingroup reduce_func_max
     */
     template<typename T> T max(const array &in);
 
     /**
-       C++ Interface for checking if all values in an array are true
+       C++ Interface to check if all values along the first non-singleton
+       dimension are true.
 
-       \param[in] in is the input array
-       \return    true if all values of \p in are true, false otherwise
+       NaN values are ignored.
+
+       \param[in] in input array
+       \return       array containing 1's if all true; 0's otherwise
 
        \ingroup reduce_func_all_true
     */
     template<typename T> T allTrue(const array &in);
 
     /**
-       C++ Interface for checking if any values in an array are true
+       C++ Interface to check if any values along the first non-singleton
+       dimension are true.
+
+       NaN values are ignored.
 
-       \param[in] in is the input array
-       \return    true if any values of \p in are true, false otherwise
+       \param[in] in input array
+       \return       array containing 1's if any true; 0's otherwise
 
        \ingroup reduce_func_any_true
     */
     template<typename T> T anyTrue(const array &in);
 
     /**
-       C++ Interface for counting total number of non zero values in an array
+       C++ Interface to count non-zero values along the first non-singleton
+       dimension.
 
-       \param[in] in is the input array
-       \return    the number of non-zero values in \p in
+       NaN values are treated as non-zero.
+
+       \param[in] in input array
+       \return       count
 
        \ingroup reduce_func_count
     */
     template<typename T> T count(const array &in);
 
     /**
-       C++ Interface for getting minimum values and their locations in an array
+       C++ Interface to return the minimum and its location along a given
+       dimension.
 
-       \param[out] val will contain the minimum values along dimension \p dim
-       \param[out] idx will contain the locations of minimum all values along dimension \p dim
-       \param[in]  in is the input array
-       \param[in]  dim The dimension along which the minimum value needs to be extracted
+       NaN values are ignored.
 
-       \ingroup reduce_func_min
+       \param[out] val minimum
+       \param[out] idx location
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the minimum is found, -1 denotes
+                       the first non-singleton dimension
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \ingroup reduce_func_min
     */
     AFAPI void min(array &val, array &idx, const array &in, const int dim = -1);
 
     /**
-       C++ Interface for getting maximum values and their locations in an array
+       C++ Interface to return the maximum and its location along a given
+       dimension.
 
-       \param[out] val will contain the maximum values along dimension \p dim
-       \param[out] idx will contain the locations of maximum all values along dimension \p dim
-       \param[in]  in is the input array
-       \param[in]  dim The dimension along which the maximum value needs to be extracted
+       NaN values are ignored.
 
-       \ingroup reduce_func_max
+       \param[out] val maximum
+       \param[out] idx location
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the maximum is found, -1 denotes
+                       the first non-singleton dimension
 
-       \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+       \ingroup reduce_func_max
     */
     AFAPI void max(array &val, array &idx, const array &in, const int dim = -1);
 
     /**
-       C++ Interface for getting minimum value and it's location from the entire array
+       C++ Interface to return the minimum and its location over all
+       dimensions.
+
+       NaN values are ignored.
+
+       Often used to return values directly to the host.
 
-       \param[out] val will contain the minimum values in the input
-       \param[out] idx will contain the locations of minimum all values in the input
-       \param[in]  in is the input array
+       \param[out] val minimum
+       \param[out] idx location
+       \param[in]  in  input array
 
        \ingroup reduce_func_min
     */
     template<typename T> void min(T *val, unsigned *idx, const array &in);
 
     /**
-       C++ Interface for getting maximum value and it's location from the entire array
+       C++ Interface to return the maximum and its location over all
+       dimensions.
+
+       NaN values are ignored.
+
+       Often used to return values directly to the host.
 
-       \param[out] val contains the maximum values in the input
-       \param[out] idx contains the locations of maximum all values in the input
-       \param[in]  in is the input array
+       \param[out] val maximum
+       \param[out] idx location
+       \param[in]  in  input array
 
        \ingroup reduce_func_max
     */
     template<typename T> void max(T *val, unsigned *idx, const array &in);
 
     /**
-       C++ Interface exclusive sum (cumulative sum) of an array
+       C++ Interface to evaluate the cumulative sum (inclusive) along a given
+       dimension.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which exclusive sum is performed
-       \return the output containing exclusive sums of the input
+       \param[in] in  input array
+       \param[in] dim dimension along which the sum is accumulated, 0 denotes
+                      the first non-singleton dimension
+       \return        cumulative sum
 
        \ingroup scan_func_accum
     */
     AFAPI array accum(const array &in, const int dim = 0);
 
+#if AF_API_VERSION >=34
+    /**
+       C++ Interface to scan an array (generalized) over a given dimension.
+
+       \param[in] in             input array
+       \param[in] dim            dimension along which the scan occurs, 0
+                                 denotes the first non-singleton dimension
+       \param[in] op             type of binary operation used
+       \param[in] inclusive_scan flag specifying whether the scan is inclusive
+       \return                   scan
+
+       \ingroup scan_func_scan
+    */
+    AFAPI array scan(const array &in, const int dim = 0,
+                     binaryOp op = AF_BINARY_ADD, bool inclusive_scan = true);
+
     /**
-       C++ Interface for finding the locations of non-zero values in an array
+       C++ Interface to scan an array (generalized) over a given dimension,
+       according to an array of keys.
+
+       \param[in] key            keys array
+       \param[in] in             input array
+       \param[in] dim            dimension along which the scan occurs, 0
+                                 denotes the first non-singleton dimension
+       \param[in] op             type of binary operation used
+       \param[in] inclusive_scan flag specifying whether the scan is inclusive
+       \return                   scan
+
+       \ingroup scan_func_scanbykey
+    */
+    AFAPI array scanByKey(const array &key, const array& in, const int dim = 0,
+                          binaryOp op = AF_BINARY_ADD, bool inclusive_scan = true);
+#endif
 
-       \param[in] in is the input array.
-       \return linear indices where \p in is non-zero
+    /**
+       C++ Interface to locate the indices of the non-zero values in an array.
+
+       \param[in] in input array
+       \return       linear indices where `in` is non-zero
 
        \ingroup scan_func_where
     */
     AFAPI array where(const array &in);
 
     /**
-       C++ Interface for calculating first order differences in an array
+       C++ Interface to calculate the first order difference in an array over a
+       given dimension.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \return output of first order numerical difference
+       \param[in] in  input array
+       \param[in] dim dimension along which the difference occurs, 0
+                      denotes the first non-singleton dimension
+       \return        first order numerical difference
 
        \ingroup calc_func_diff1
     */
     AFAPI array diff1(const array &in, const int dim = 0);
 
     /**
-       C++ Interface for calculating second order differences in an array
+       C++ Interface to calculate the second order difference in an array over
+       a given dimension.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \return output of second order numerical difference
+       \param[in] in  input array
+       \param[in] dim dimension along which the difference occurs, 0
+                      denotes the first non-singleton dimension
+       \return        second order numerical difference
 
        \ingroup calc_func_diff2
     */
     AFAPI array diff2(const array &in, const int dim = 0);
 
     /**
-       C++ Interface for sorting an array
+       C++ Interface to sort an array over a given dimension.
 
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
+       \param[in] in          input array
+       \param[in] dim         dimension along which the sort occurs, 0 denotes
+                              the first non-singleton dimension
        \param[in] isAscending specifies the sorting order
-       \return the sorted output
+       \return                sorted output
 
        \ingroup sort_func_sort
-
-       \note \p dim is currently restricted to 0.
     */
-    AFAPI array sort(const array &in, const unsigned dim = 0, const bool isAscending = true);
+    AFAPI array sort(const array &in, const unsigned dim = 0,
+                     const bool isAscending = true);
 
     /**
-       C++ Interface for sorting an array and getting original indices
+       C++ Interface to sort an array over a given dimension and to return the
+       original indices.
 
-       \param[out] out will contain the sorted output
-       \param[out] indices will contain the indices in the original input
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \param[in] isAscending specifies the sorting order
+       \param[out] out         sorted output
+       \param[out] indices     indices from the input
+       \param[in]  in          input array
+       \param[in]  dim         dimension along which the sort occurs, 0 denotes
+                               the first non-singleton dimension
+       \param[in]  isAscending specifies the sorting order
 
-       \ingroup sort_func_sort
-
-       \note \p dim is currently restricted to 0.
+       \ingroup sort_func_sort_index
     */
     AFAPI void  sort(array &out, array &indices, const array &in, const unsigned dim = 0,
                      const bool isAscending = true);
-    /**
-       C++ Interface for sorting an array based on keys
 
-       \param[out] out_keys will contain the keys based on sorted values
-       \param[out] out_values will contain the sorted values
-       \param[in] keys is the input array
-       \param[in] values The dimension along which numerical difference is performed
-       \param[in] dim The dimension along which numerical difference is performed
-       \param[in] isAscending specifies the sorting order
-
-       \ingroup sort_func_sort
-
-       \note \p dim is currently restricted to 0.
+    /**
+       C++ Interface to sort an array over a given dimension, according to an
+       array of keys.
+
+       \param[out] out_keys    sorted keys
+       \param[out] out_values  sorted output
+       \param[in]  keys        keys array
+       \param[in]  values      input array
+       \param[in]  dim         dimension along which the sort occurs, 0 denotes
+                               the first non-singleton dimension
+       \param[in]  isAscending specifies the sorting order
+
+       \ingroup sort_func_sort_keys
     */
-    AFAPI void  sort(array &out_keys, array &out_values, const array &keys, const array &values,
-                     const unsigned dim = 0, const bool isAscending = true);
+    AFAPI void  sort(array &out_keys, array &out_values, const array &keys,
+                     const array &values, const unsigned dim = 0,
+                     const bool isAscending = true);
 
     /**
-       C++ Interface for getting unique values
+       C++ Interface to return the unique values in an array.
 
-       \param[in] in is the input array
-       \param[in] is_sorted if true, skips the sorting steps internally
-       \return the unique values from \p in
+       \param[in] in        input array
+       \param[in] is_sorted if true, skip the sorting steps internally
+       \return              unique values
 
        \ingroup set_func_unique
     */
     AFAPI array setUnique(const array &in, const bool is_sorted=false);
 
     /**
-       C++ Interface for performing union of two arrays
+       C++ Interface to evaluate the union of two arrays.
 
-       \param[in] first is the first array
-       \param[in] second is the second array
-       \param[in] is_unique if true, skips calling unique internally
-       \return the union of \p first and \p second
+       \param[in] first     input array
+       \param[in] second    input array
+       \param[in] is_unique if true, skip calling setUnique internally
+       \return              union, values in increasing order
 
        \ingroup set_func_union
     */
-    AFAPI array setUnion(const array &first, const array &second, const bool is_unique=false);
+    AFAPI array setUnion(const array &first, const array &second,
+                         const bool is_unique=false);
 
     /**
-       C++ Interface for performing intersect of two arrays
+       C++ Interface to evaluate the intersection of two arrays.
 
-       \param[in] first is the first array
-       \param[in] second is the second array
-       \param[in] is_unique if true, skips calling unique internally
-       \return the intersection of \p first and \p second
+       \param[in] first     input array
+       \param[in] second    input array
+       \param[in] is_unique if true, skip calling setUnique internally
+       \return              intersection, values in increasing order
 
        \ingroup set_func_intersect
     */
-    AFAPI array setIntersect(const array &first, const array &second, const bool is_unique=false);
+    AFAPI array setIntersect(const array &first, const array &second,
+                             const bool is_unique=false);
 }
 #endif
 
@@ -357,377 +695,864 @@ extern "C" {
 #endif
 
     /**
-       C Interface for sum of elements in an array
+       C Interface to sum array elements over a given dimension.
 
-       \param[out] out will contain the sum of all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the add operation occurs
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out sum
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the summation occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_sum
     */
     AFAPI af_err af_sum(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 39
     /**
-       C Interface for product of elements in an array
+       C Interface to sum array elements over all dimensions.
+
+       Results in a single element `af::array`.
 
-       \param[out] out will contain the product of all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the multiply operation occurs
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out sum
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_sum
+    */
+    AFAPI af_err af_sum_all_array(af_array *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface to sum array elements over a given dimension, replacing any
+       NaNs with a specified value.
+
+       \param[out] out    sum
+       \param[in]  in     input array
+       \param[in]  dim    dimension along which the summation occurs
+       \param[in]  nanval value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_sum
+    */
+    AFAPI af_err af_sum_nan(af_array *out, const af_array in,
+                            const int dim, const double nanval);
+#endif
+
+#if AF_API_VERSION >= 39
+    /**
+       C Interface to sum array elements over all dimensions, replacing any
+       NaNs with a specified value.
+
+       Results in a single element `af::array`.
+
+       \param[out] out    sum
+       \param[in]  in     input array
+       \param[in]  nanval value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_sum
+    */
+    AFAPI af_err af_sum_nan_all_array(af_array *out, const af_array in, const double nanval);
+#endif
+
+#if AF_API_VERSION >= 37
+    /**
+       C Interface to sum array elements over a given dimension, according to
+       an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out sum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the summation occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_sum_by_key
+    */
+    AFAPI af_err af_sum_by_key(af_array *keys_out, af_array *vals_out,
+                               const af_array keys, const af_array vals, const int dim);
+
+    /**
+       C Interface to sum array elements over a given dimension, replacing any
+       NaNs with a specified value, according to an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out sum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the summation occurs
+       \param[in]  nanval   value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_sum_by_key
+    */
+    AFAPI af_err af_sum_by_key_nan(af_array *keys_out, af_array *vals_out,
+                                   const af_array keys, const af_array vals,
+                                   const int dim, const double nanval);
+#endif
+
+    /**
+       C Interface to multiply array elements over a given dimension.
+
+       \param[out] out product
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the product occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_product
     */
     AFAPI af_err af_product(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 39
+    /**
+       C Interface to multiply array elements over all dimensions.
+
+       Results in a single element `af::array`.
+
+       \param[out] out product
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_product
+    */
+    AFAPI af_err af_product_all_array(af_array *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface to multiply array elements over a given dimension, replacing
+       any NaNs with a specified value.
+
+       \param[out] out    product
+       \param[in]  in     input array
+       \param[in]  dim    dimension along with the product occurs
+       \param[in]  nanval value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_product
+    */
+    AFAPI af_err af_product_nan(af_array *out, const af_array in, const int dim, const double nanval);
+#endif
+
+#if AF_API_VERSION >= 39
+    /**
+       C Interface to multiply array elements over all dimensions, replacing
+       any NaNs with a specified value.
+
+       \param[out] out    product
+       \param[in]  in     input array
+       \param[in]  nanval value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_product
+    */
+    AFAPI af_err af_product_nan_all_array(af_array *out, const af_array in, const double nanval);
+#endif
+
+#if AF_API_VERSION >= 37
     /**
-       C Interface for minimum values in an array
+       C Interface to multiply array elements over a given dimension, according
+       to an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out product
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the product occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_product_by_key
+    */
+    AFAPI af_err af_product_by_key(af_array *keys_out, af_array *vals_out,
+                                   const af_array keys, const af_array vals, const int dim);
+
+    /**
+       C Interface to multiply array elements over a given dimension, replacing
+       any NaNs with a specified value, according to an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out product
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the product occurs
+       \param[in]  nanval   value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_product_by_key
+    */
+    AFAPI af_err af_product_by_key_nan(af_array *keys_out, af_array *vals_out,
+                                       const af_array keys, const af_array vals,
+                                       const int dim, const double nanval);
+#endif
 
-       \param[out] out will contain the minimum of all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the minimum value is extracted
-       \return \ref AF_SUCCESS if the execution completes properly
+    /**
+       C Interface to return the minimum along a given dimension.
+
+       \param[out] out minimum
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the minimum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_min
     */
     AFAPI af_err af_min(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 37
+    /**
+       C Interface to return the minimum along a given dimension, according to
+       an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out minimum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the minimum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_min_by_key
+    */
+    AFAPI af_err af_min_by_key(af_array *keys_out, af_array *vals_out,
+                               const af_array keys, const af_array vals,
+                               const int dim);
+#endif
+
     /**
-       C Interface for maximum values in an array
+       C Interface to return the maximum along a given dimension.
 
-       \param[out] out will contain the maximum of all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the maximum value is extracted
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out  maximum
+       \param[in]  in   input array
+       \param[in]  dim dimension along which the maximum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_max
     */
     AFAPI af_err af_max(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 37
+    /**
+       C Interface to return the maximum along a given dimension, according to
+       an array of keys.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out maximum
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the maximum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_max_by_key
+    */
+    AFAPI af_err af_max_by_key(af_array *keys_out, af_array *vals_out,
+                               const af_array keys, const af_array vals,
+                               const int dim);
+#endif
+
+#if AF_API_VERSION >= 38
     /**
-       C Interface for checking all true values in an array
+       C Interface to return the ragged maximum over a given dimension.
 
-       \param[out] out will contain the result of "and" operation all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the "and" operation occurs
-       \return \ref AF_SUCCESS if the execution completes properly
+       Input parameter `ragged_len` sets the number of elements to consider.
+
+       NaN values are ignored.
+
+       \param[out] val        ragged maximum
+       \param[out] idx        locations of the maximum ragged values
+       \param[in]  in         input array
+       \param[in]  ragged_len array containing the number of elements to use
+       \param[in]  dim        dimension along which the maximum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_max
+    */
+    AFAPI af_err af_max_ragged(af_array *val, af_array *idx, const af_array in, const af_array ragged_len, const int dim);
+#endif
+
+    /**
+       C Interface  to check if all values along a given dimension are true.
+
+       NaN values are ignored.
+
+       \param[out] out array containing 1's if all true; 0's otherwise
+       \param[in]  in  input array
+       \param[in]  dim dimention along which the check occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_all_true
     */
     AFAPI af_err af_all_true(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 37
     /**
-       C Interface for checking any true values in an array
+       C Interface to check if all values along a given dimension are true,
+       according to an array of keys.
 
-       \param[out] out will contain the result of "or" operation all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the "or" operation occurs
-       \return \ref AF_SUCCESS if the execution completes properly
+       NaN values are ignored.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out array containing 1's if all true; 0's otherwise
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the check occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_alltrue_by_key
+    */
+    AFAPI af_err af_all_true_by_key(af_array *keys_out, af_array *vals_out,
+                                    const af_array keys, const af_array vals,
+                                    const int dim);
+#endif
+
+    /**
+       C Interface to check if any values along a given dimension are true.
+
+       NaN values are ignored.
+
+       \param[out] out array containing 1's if any true; 0's otherwise
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the check occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_any_true
     */
     AFAPI af_err af_any_true(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 37
     /**
-       C Interface for counting non zero values in an array
+       C Interface to check if any values along a given dimension are true.
+
+       NaN values are ignored.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out array containing 1's if any true; 0's otherwise
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimensions along which the check occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \param[out] out will contain the number of non-zero values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the non-zero values are counted
-       \return \ref AF_SUCCESS if the execution completes properly
+       \ingroup reduce_func_anytrue_by_key
+    */
+    AFAPI af_err af_any_true_by_key(af_array *keys_out, af_array *vals_out,
+                                    const af_array keys, const af_array vals,
+                                    const int dim);
+#endif
+
+    /**
+       C Interface to count non-zero values in an array along a given
+       dimension.
+
+       NaN values are treated as non-zero.
+
+       \param[out] out count
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the count occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_count
     */
     AFAPI af_err af_count(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >= 37
+    /**
+       C Interface to count non-zero values in an array, according to an array
+       of keys.
+
+       NaN values are treated as non-zero.
+
+       \param[out] keys_out reduced keys
+       \param[out] vals_out count
+       \param[in]  keys     keys array
+       \param[in]  vals     input array
+       \param[in]  dim      dimension along which the count occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_count_by_key
+    */
+    AFAPI af_err af_count_by_key(af_array *keys_out, af_array *vals_out,
+                                 const af_array keys, const af_array vals,
+                                 const int dim);
+#endif
+
     /**
-       C Interface for sum of all elements in an array
+       C Interface to sum array elements over all dimensions.
 
-       \param[out] real will contain the real part of adding all elements in input \p in
-       \param[out] imag will contain the imaginary part of adding all elements in input \p in
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       If `in` is real, `imag` will be set to zeros.
 
-       \note \p imag is always set to 0 when \p in is real
+       \param[out] real sum of all real components
+       \param[out] imag sum of all imaginary components
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_sum
     */
     AFAPI af_err af_sum_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 31
+    /**
+       C Interface to sum array elements over all dimensions, replacing any
+       NaNs with a specified value.
+
+       If `in` is real, `imag` will be set to zeros.
+
+       \param[out] real   sum of all real components
+       \param[out] imag   sum of all imaginary components
+       \param[in]  in     input array
+       \param[in]  nanval value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_sum
+    */
+    AFAPI af_err af_sum_nan_all(double *real, double *imag,
+                                const af_array in, const double nanval);
+#endif
+
     /**
-       C Interface for product of all elements in an array
+       C Interface to multiply array elements over all dimensions.
 
-       \param[out] real will contain the real part of multiplying all elements in input \p in
-       \param[out] imag will contain the imaginary part of multiplying all elements in input \p in
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       If `in` is real, `imag` will be set to zeros.
 
-       \note \p imag is always set to 0 when \p in is real
+       \param[out] real product of all real components
+       \param[out] imag product of all imaginary components
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_product
     */
     AFAPI af_err af_product_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 31
     /**
-       C Interface for getting minimum value of an array
+       C Interface to multiply array elements over all dimensions, replacing
+       any NaNs with a specified value.
 
-       \param[out] real will contain the real part of minimum value of all elements in input \p in
-       \param[out] imag will contain the imaginary part of minimum value of all elements in input \p in
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       If `in` is real, `imag` will be set to zeros.
 
-       \note \p imag is always set to 0 when \p in is real.
+       \param[out] real   product of all real components
+       \param[out] imag   product of all imaginary components
+       \param[in]  in     input array
+       \param[in]  nanval value that replaces NaNs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_product
+    */
+    AFAPI af_err af_product_nan_all(double *real, double *imag,
+                                    const af_array in, const double nanval);
+#endif
+
+    /**
+       C Interface to return the minimum over all dimensions.
+
+       If `in` is real, `imag` will be set to zeros.
+
+       \param[out] real real component of the minimum
+       \param[out] imag imaginary component of the minimum
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_min
     */
     AFAPI af_err af_min_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 39
     /**
-       C Interface for getting maximum value of an array
+       C Interface to return the minimum over all dimensions.
 
-       \param[out] real will contain the real part of maximum value of all elements in input \p in
-       \param[out] imag will contain the imaginary part of maximum value of all elements in input \p in
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out minimum
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \note \p imag is always set to 0 when \p in is real.
+       \ingroup reduce_func_min
+    */
+    AFAPI af_err af_min_all_array(af_array *out, const af_array in);
+#endif
+
+    /**
+       C Interface to return the maximum over all dimensions.
+
+       If `in` is real, `imag` will be set to zeros.
+
+       \param[out] real real component of the maximum
+       \param[out] imag imaginary component of the maximum
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_max
     */
     AFAPI af_err af_max_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 39
     /**
-       C Interface for checking if all values in an array are true
+       C Interface to return the maximum over all dimensions.
 
-       \param[out] real is 1 if all values of input \p in are true. 0 otherwise.
-       \param[out] imag is always set to 0.
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out maximum
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \note \p imag is always set to 0.
+       \ingroup reduce_func_max
+    */
+    AFAPI af_err af_max_all_array(af_array *out, const af_array in);
+#endif
+
+    /**
+       C Interface to check if all values over all dimensions are true.
+ 
+       \param[out] real 1 if all true; 0 otherwise
+       \param[out] imag 0
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_all_true
     */
     AFAPI af_err af_all_true_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 39
     /**
-       C Interface for checking if any values in an array are true
+       C Interface to check if all values over all dimensions are true.
+ 
+       \param[out] out 1 if all true; 0 otherwise
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \param[out] real is 1 if any value of input \p in is true. 0 otherwise.
-       \param[out] imag is always set to 0.
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \ingroup reduce_func_all_true
+    */
+    AFAPI af_err af_all_true_all_array(af_array *out, const af_array in);
+#endif
+
+    /**
+       C Interface to check if any values over all dimensions are true.
 
-       \note \p imag is always set to 0.
+       \param[out] real 1 if any true; 0 otherwise
+       \param[out] imag 0
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_any_true
     */
     AFAPI af_err af_any_true_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 39
     /**
-       C Interface for counting total number of non zero values in an array
+       C Interface to check if any values over all dimensions are true.
+
+       \param[out] out 1 if any true; 0 otherwise
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_any_true
+    */
+    AFAPI af_err af_any_true_all_array(af_array *out, const af_array in);
+#endif
 
-       \param[out] real will contain the number of non-zero values in \p in.
-       \param[out] imag is always set to 0.
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+    /**
+       C Interface to count non-zero values over all dimensions.
 
-       \note \p imag is always set to 0.
+       \param[out] real count
+       \param[out] imag 0
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_count
     */
     AFAPI af_err af_count_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 39
     /**
-       C Interface for getting minimum values and their locations in an array
+       C Interface to count non-zero values over all dimensions.
 
-       \param[out] out will contain the minimum of all values in \p in along \p dim
-       \param[out] idx will contain the location of minimum of all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the minimum value is extracted
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out count
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup reduce_func_count
+    */
+    AFAPI af_err af_count_all_array(af_array *out, const af_array in);
+#endif
+
+    /**
+       C Interface to return the minimum and its location along a given
+       dimension.
+
+       \param[out] out minimum
+       \param[out] idx location
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the minimum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_min
     */
-    AFAPI af_err af_imin(af_array *out, af_array *idx, const af_array in, const int dim);
+    AFAPI af_err af_imin(af_array *out, af_array *idx, const af_array in,
+                         const int dim);
 
     /**
-       C Interface for getting maximum values and their locations in an array
+       C Interface to return the maximum and its location along a given
+       dimension.
 
-       \param[out] out will contain the maximum of all values in \p in along \p dim
-       \param[out] idx will contain the location of maximum of all values in \p in along \p dim
-       \param[in] in is the input array
-       \param[in] dim The dimension along which the maximum value is extracted
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out maximum
+       \param[out] idx location
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the maximum is found
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_max
     */
-    AFAPI af_err af_imax(af_array *out, af_array *idx, const af_array in, const int dim);
+    AFAPI af_err af_imax(af_array *out, af_array *idx, const af_array in,
+                         const int dim);
 
     /**
-       C Interface for getting minimum value and it's location from the entire array
+       C Interface to return the minimum and its location over all dimensions.
 
-       \param[out] real will contain the real part of minimum value of all elements in input \p in
-       \param[out] imag will contain the imaginary part of minimum value of all elements in input \p in
-       \param[out] idx will contain the location of minimum of all values in \p in
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       NaN values are ignored.
 
-       \note \p imag is always set to 0 when \p in is real.
+       \param[out] real real component of the minimum
+       \param[out] imag imaginary component of the minimum; 0 if `idx` is real
+       \param[out] idx  location
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_min
     */
-    AFAPI af_err af_imin_all(double *real, double *imag, unsigned *idx, const af_array in);
+    AFAPI af_err af_imin_all(double *real, double *imag, unsigned *idx,
+                             const af_array in);
 
     /**
-       C Interface for getting maximum value and it's location from the entire array
+       C Interface to return the maximum and its location over all dimensions.
 
-       \param[out] real will contain the real part of maximum value of all elements in input \p in
-       \param[out] imag will contain the imaginary part of maximum value of all elements in input \p in
-       \param[out] idx will contain the location of maximum of all values in \p in
-       \param[in] in is the input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       NaN values are ignored.
 
-       \note \p imag is always set to 0 when \p in is real.
+       \param[out] real real component of the maximum
+       \param[out] imag imaginary component of the maximum; 0 if `idx` is real
+       \param[out] idx  location
+       \param[in]  in   input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup reduce_func_max
     */
     AFAPI af_err af_imax_all(double *real, double *imag, unsigned *idx, const af_array in);
 
     /**
-       C Interface exclusive sum (cumulative sum) of an array
+       C Interface to evaluate the cumulative sum (inclusive) along a given
+       dimension.
 
-       \param[out] out will contain exclusive sums of the input
-       \param[in] in is the input array
-       \param[in] dim The dimension along which exclusive sum is performed
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out cumulative sum
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the sum is accumulated
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup scan_func_accum
     */
     AFAPI af_err af_accum(af_array *out, const af_array in, const int dim);
 
+#if AF_API_VERSION >=34
+    /**
+       C Interface to scan an array (generalized) over a given dimension.
+
+       \param[out] out            scan
+       \param[in]  in             input array
+       \param[in]  dim            dimension along which the scan occurs
+       \param[in]  op             type of binary operation used
+       \param[in]  inclusive_scan flag specifying whether the scan is inclusive
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup scan_func_scan
+    */
+    AFAPI af_err af_scan(af_array *out, const af_array in, const int dim,
+                         af_binary_op op, bool inclusive_scan);
+
+    /**
+       C Interface to scan an array (generalized) over a given dimension,
+       according to an array of keys.
+
+       \param[out] out            scan
+       \param[in]  key            keys array
+       \param[in]  in             input array
+       \param[in]  dim            dimension along which the scan occurs
+       \param[in]  op             type of binary operation used
+       \param[in]  inclusive_scan flag specifying whether the scan is inclusive
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup scan_func_scanbykey
+    */
+    AFAPI af_err af_scan_by_key(af_array *out, const af_array key,
+                                const af_array in, const int dim,
+                                af_binary_op op, bool inclusive_scan);
+
+#endif
+
     /**
-       C Interface for finding the locations of non-zero values in an array
+       C Interface to locate the indices of the non-zero values in an array.
 
-       \param[out] idx will contain indices where \p in is non-zero
-       \param[in] in is the input array.
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] idx linear indices where `in` is non-zero
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup scan_func_where
     */
     AFAPI af_err af_where(af_array *idx, const af_array in);
 
     /**
-       C Interface for calculating first order differences in an array
+       C Interface to calculate the first order difference in an array over a
+       given dimension.
 
-       \param[out] out will contain the first order numerical differences of \p in
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out first order numerical difference
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the difference occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup calc_func_diff1
     */
     AFAPI af_err af_diff1(af_array *out, const af_array in, const int dim);
 
     /**
-       C Interface for calculating second order differences in an array
+       C Interface to calculate the second order difference in an array over a
+       given dimension.
 
-       \param[out] out will contain the second order numerical differences of \p in
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out second order numerical difference
+       \param[in]  in  input array
+       \param[in]  dim dimension along which the difference occurs
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup calc_func_diff2
     */
     AFAPI af_err af_diff2(af_array *out, const af_array in, const int dim);
 
     /**
-       C Interface for sorting an array
+       C Interface to sort an array over a given dimension.
 
-       \param[out] out will contain the sorted output
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \param[in] isAscending specifies the sorting order
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out         sorted output
+       \param[in]  in          input array
+       \param[in]  dim         dimension along which the sort occurs
+       \param[in]  isAscending specifies the sorting order
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup sort_func_sort
-
-       \note \p dim is currently restricted to 0.
     */
-    AFAPI af_err af_sort(af_array *out, const af_array in, const unsigned dim, const bool isAscending);
+    AFAPI af_err af_sort(af_array *out, const af_array in, const unsigned dim,
+                         const bool isAscending);
 
     /**
-       C Interface for sorting an array and getting original indices
-
-       \param[out] out will contain the sorted output
-       \param[out] indices will contain the indices in the original input
-       \param[in] in is the input array
-       \param[in] dim The dimension along which numerical difference is performed
-       \param[in] isAscending specifies the sorting order
-       \return \ref AF_SUCCESS if the execution completes properly
-
-       \ingroup sort_func_sort
-
-       \note \p dim is currently restricted to 0.
+       C Interface to sort an array over a given dimension and to return the
+       original indices.
+
+       \param[out] out         sorted output
+       \param[out] indices     indices from the input
+       \param[in]  in          input array
+       \param[in]  dim         dimension along which the sort occurs
+       \param[in]  isAscending specifies the sorting order
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup sort_func_sort_index
     */
     AFAPI af_err af_sort_index(af_array *out, af_array *indices, const af_array in,
                                const unsigned dim, const bool isAscending);
     /**
-       C Interface for sorting an array based on keys
-
-       \param[out] out_keys will contain the keys based on sorted values
-       \param[out] out_values will contain the sorted values
-       \param[in] keys is the input array
-       \param[in] values The dimension along which numerical difference is performed
-       \param[in] dim The dimension along which numerical difference is performed
-       \param[in] isAscending specifies the sorting order
-       \return \ref AF_SUCCESS if the execution completes properly
-
-       \ingroup sort_func_sort
-
-       \note \p dim is currently restricted to 0.
+       C Interface to sort an array over a given dimension, according to an
+       array of keys.
+
+       \param[out] out_keys    sorted keys
+       \param[out] out_values  sorted output
+       \param[in]  keys        keys array
+       \param[in]  values      input array
+       \param[in]  dim         dimension along which the sort occurs
+       \param[in]  isAscending specifies the sorting order
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup sort_func_sort_keys
     */
     AFAPI af_err af_sort_by_key(af_array *out_keys, af_array *out_values,
                                 const af_array keys, const af_array values,
                                 const unsigned dim, const bool isAscending);
 
     /**
-       C Interface for getting unique values
+       C Interface to return the unique values in an array.
 
-       \param[out] out will contain the unique values from \p in
-       \param[in] in is the input array
-       \param[in] is_sorted if true, skips the sorting steps internally
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out       unique values
+       \param[in]  in        input array
+       \param[in]  is_sorted if true, skip the sorting steps internally
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup set_func_unique
     */
     AFAPI af_err af_set_unique(af_array *out, const af_array in, const bool is_sorted);
 
     /**
-       C Interface for performing union of two arrays
+       C Interface to evaluate the union of two arrays.
 
-       \param[out] out will contain the union of \p first and \p second
-       \param[in] first is the first array
-       \param[in] second is the second array
-       \param[in] is_unique if true, skips calling unique internally
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out       union, values in increasing order
+       \param[in]  first     input array
+       \param[in]  second    input array
+       \param[in]  is_unique if true, skip calling unique internally
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup set_func_union
     */
-    AFAPI af_err af_set_union(af_array *out, const af_array first, const af_array second, const bool is_unique);
+    AFAPI af_err af_set_union(af_array *out, const af_array first,
+                              const af_array second, const bool is_unique);
 
     /**
-       C Interface for performing intersect of two arrays
+       C Interface to evaluate the intersection of two arrays.
 
-       \param[out] out will contain the intersection of \p first and \p second
-       \param[in] first is the first array
-       \param[in] second is the second array
-       \param[in] is_unique if true, skips calling unique internally
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out       intersection, values in increasing order
+       \param[in]  first     input array
+       \param[in]  second    input array
+       \param[in]  is_unique if true, skip calling unique internally
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup set_func_intersect
     */
-    AFAPI af_err af_set_intersect(af_array *out, const af_array first, const af_array second, const bool is_unique);
+    AFAPI af_err af_set_intersect(af_array *out, const af_array first,
+                                  const af_array second, const bool is_unique);
 
 #ifdef __cplusplus
 }
diff --git a/include/af/arith.h b/include/af/arith.h
index 0b2daa349b..c75544a5ab 100644
--- a/include/af/arith.h
+++ b/include/af/arith.h
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2025, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -14,46 +14,104 @@ namespace af
 {
     class array;
 
-    /// \ingroup arith_func_min
-    /// @{
-    /// \brief C++ interface for min of two arrays
+    /// C++ Interface to find the elementwise minimum between two arrays.
     ///
-    /// \param[in] lhs first input
-    /// \param[in] rhs second input
-    /// \return minimum of \p lhs and \p rhs
+    /// \param[in] lhs input array
+    /// \param[in] rhs input array
+    /// \return        minimum
     ///
+    /// \ingroup arith_func_min
     AFAPI array min    (const array &lhs, const array &rhs);
 
-    /// \copydoc min(const array&, const array &)
+    /// C++ Interface to find the elementwise minimum between an array and a
+    /// scalar value.
+    ///
+    /// \param[in] lhs input array
+    /// \param[in] rhs scalar value
+    /// \return        minimum
+    ///
+    /// \ingroup arith_func_min
     AFAPI array min    (const array &lhs, const double rhs);
 
-    /// \copydoc min(const array&, const array &)
+    /// C++ Interface to find the elementwise minimum between an array and a
+    /// scalar value.
+    ///
+    /// \param[in] lhs scalar value
+    /// \param[in] rhs input array
+    /// \return        minimum
+    ///
+    /// \ingroup arith_func_min
     AFAPI array min    (const double lhs, const array &rhs);
-    /// @}
 
-    /// \ingroup arith_func_max
-    /// @{
-    /// C++ Interface for max of two arrays or an array and a scalar
+    /// C++ Interface to find the elementwise maximum between two arrays.
     ///
-    /// \param[in] lhs first input
-    /// \param[in] rhs second input
-    /// \return maximum of \p lhs and \p rhs
+    /// \param[in] lhs input array
+    /// \param[in] rhs input array
+    /// \return        maximum
+    ///
+    /// \ingroup arith_func_max
     AFAPI array max    (const array &lhs, const array &rhs);
 
-    /// \copydoc max(const array&, const array&)
+    /// C++ Interface to find the elementwise maximum between an array and a
+    /// scalar value.
+    ///
+    /// \param[in] lhs input array
+    /// \param[in] rhs scalar value
+    /// \return        maximum
+    ///
+    /// \ingroup arith_func_max
     AFAPI array max    (const array &lhs, const double rhs);
 
-    /// \copydoc max(const array&, const array&)
+    /// C++ Interface to find the elementwise maximum between an array and a
+    /// scalar value.
+    ///
+    /// \param[in] lhs input array
+    /// \param[in] rhs scalar value
+    /// \return        maximum
+    ///
+    /// \ingroup arith_func_max
     AFAPI array max    (const double lhs, const array &rhs);
+
+#if AF_API_VERSION >= 34
+    /// @{
+    /// C++ Interface to clamp an array between an upper and a lower limit.
+    ///
+    /// \param[in] in input array
+    /// \param[in] lo lower limit; can be an array or a scalar
+    /// \param[in] hi upper limit; can be an array or a scalar
+    /// \return       clamped array
+    ///
+    /// \ingroup arith_func_clamp
+    AFAPI array clamp(const array &in, const array &lo, const array &hi);
+#endif
+
+#if AF_API_VERSION >= 34
+    /// \copydoc clamp(const array&, const array&, const array&)
+    AFAPI array clamp(const array &in, const array &lo, const double hi);
+#endif
+
+#if AF_API_VERSION >= 34
+    /// \copydoc clamp(const array&, const array&, const array&)
+    AFAPI array clamp(const array &in, const double lo, const array &hi);
+#endif
+
+#if AF_API_VERSION >= 34
+    /// \copydoc clamp(const array&, const array&, const array&)
+    AFAPI array clamp(const array &in, const double lo, const double hi);
+#endif
     /// @}
 
-    /// \ingroup arith_func_rem
     /// @{
-    /// C++ Interface for remainder when array divides array
+    /// C++ Interface to calculate the remainder.
+    ///
+    /// For integers, it returns the same output as modulus (% operator)
+    /// For floating point numbers, it returns the same as std::remainder from <cmath>
+    /// 
+    /// \param[in] lhs numerator; can be an array or a scalar
+    /// \param[in] rhs denominator; can be an array or a scalar
+    /// \return        remainder
     ///
-    /// \param[in] lhs is numerator
-    /// \param[in] rhs is denominator
-    /// \return remainder when \p rhs divides \p lhs
+    /// \ingroup arith_func_rem
     AFAPI array rem    (const array &lhs, const array &rhs);
 
     /// \copydoc rem(const array&, const array&)
@@ -63,13 +121,17 @@ namespace af
     AFAPI array rem    (const double lhs, const array &rhs);
     /// @}
 
-    /// \ingroup arith_func_mod
     /// @{
-    /// C++ Interface for modulus when dividend and divisor are arrays
+    /// C++ Interface to calculate the modulus.
+    ///
+    /// For integers, it returns the same output as modulus (% operator)
+    /// For floating point numbers, it returns the same as std::fmod from <cmath>
     ///
-    /// \param[in] lhs is dividend
-    /// \param[in] rhs is divisor
-    /// \return \p lhs modulo \p rhs
+    /// \param[in] lhs dividend; can be an array or a scalar
+    /// \param[in] rhs divisor; can be an array or a scalar
+    /// \return        modulus
+    ///
+    /// \ingroup arith_func_mod
     AFAPI array mod    (const array &lhs, const array &rhs);
 
     /// \copydoc mod(const array&, const array&)
@@ -79,83 +141,73 @@ namespace af
     AFAPI array mod    (const double lhs, const array &rhs);
     /// @}
 
-    /// C++ Interface for absolute value
+    /// C++ Interface to calculate the absolute value.
     ///
-    /// \param[in] in is input array
-    /// \return absolute value of \p in
+    /// \param[in] in input array
+    /// \return       absolute value
     ///
     /// \ingroup arith_func_abs
     AFAPI array abs    (const array &in);
 
-    /**
-       C++ Interface for arg
-
-       \param[in] in is input array
-       \return phase of \p in
-
-       \ingroup arith_func_arg
-    */
+    /// C++ Interface to calculate the phase angle (in radians) of a complex
+    /// array.
+    ///
+    /// \param[in] in input array, typically complex
+    /// \return       phase angle (in radians)
+    ///
+    /// \ingroup arith_func_arg
     AFAPI array arg    (const array &in);
 
-    /**
-       C++ Interface for getting the sign of input
-
-       \param[in] in is input array
-       \return the sign of each element of input
-
-       \note output is 1 for negative numbers and 0 for positive numbers
-
-       \ingroup arith_func_sign
-    */
+    /// C++ Interface to return the sign of elements in an array.
+    ///
+    /// \param[in] in input array
+    /// \return       array containing 1's for negative values; 0's otherwise
+    ///
+    /// \ingroup arith_func_sign
     AFAPI array sign  (const array &in);
 
-    ///C++ Interface for rounding an array of numbers
+    /// C++ Interface to round numbers.
     ///
-    ///\param[in] in is input array
-    ///\return values rounded to nearest integer
+    /// \param[in] in input array
+    /// \return       nearest integer
     ///
-    ///\note The values are rounded to nearest integer
-    ///
-    ///\ingroup arith_func_round
+    /// \ingroup arith_func_round
     AFAPI array round  (const array &in);
 
-    /**
-       C++ Interface for truncating an array of numbers
-
-       \param[in] in is input array
-       \return values truncated to nearest integer not greater than input values
-
-       \ingroup arith_func_trunc
-    */
+    /// C++ Interface to truncate numbers.
+    ///
+    /// \param[in] in input array
+    /// \return       nearest integer not greater in magnitude than `in`
+    ///
+    /// \ingroup arith_func_trunc
     AFAPI array trunc  (const array &in);
 
-
-    /// C++ Interface for flooring an array of numbers
+    /// C++ Interface to floor numbers.
     ///
-    /// \param[in] in is input array
-    /// \return values rounded to nearest integer less than or equal to current value
+    /// \param[in] in input array
+    /// \return       nearest integer less than or equal to `in`
     ///
     /// \ingroup arith_func_floor
     AFAPI array floor  (const array &in);
 
-    /// C++ Interface for ceiling an array of numbers
+    /// C++ Interface to ceil numbers.
     ///
-    /// \param[in] in is input array
-    /// \return values rounded to nearest integer greater than or equal to current value
+    /// \param[in] in input array
+    /// \return       nearest integer greater than or equal to `in`
     ///
     /// \ingroup arith_func_ceil
     AFAPI array ceil   (const array &in);
 
     /// \ingroup arith_func_hypot
     /// @{
-    /// \brief C++ Interface for getting length of hypotenuse of two inputs
+    /// C++ Interface to calculate the length of the hypotenuse of two inputs.
     ///
-    /// Calculates the hypotenuse of two inputs. The inputs can be arrays or a
-    /// double.
+    /// Calculates the hypotenuse of two inputs. The inputs can be both arrays
+    /// or can be an array and a scalar.
     ///
-    /// \param[in] lhs is the length of first side
-    /// \param[in] rhs is the length of second side
-    /// \return the length of the hypotenuse
+    /// \param[in] lhs length of first side
+    /// \param[in] rhs length of second side
+    /// \return        length of the hypotenuse
     AFAPI array hypot  (const array &lhs, const array &rhs);
 
     /// \copydoc hypot(const array&, const array&)
@@ -165,61 +217,61 @@ namespace af
     AFAPI array hypot  (const double lhs, const array &rhs);
     /// @}
 
-    /// C++ Interface for sin
+    /// C++ Interface to evaluate the sine function.
     ///
-    /// \param[in] in is input array
-    /// \return sin of input
+    /// \param[in] in input array
+    /// \return       sine
     ///
     /// \ingroup arith_func_sin
     AFAPI array sin    (const array &in);
 
-    /// C++ Interface for cos
+    /// C++ Interface to evaluate the cosine function.
     ///
-    /// \param[in] in is input array
-    /// \return cos of input
+    /// \param[in] in input array
+    /// \return       cosine
     ///
     /// \ingroup arith_func_cos
     AFAPI array cos    (const array &in);
 
-    /// C++ Interface for tan
+    /// C++ Interface to evaluate the tangent function.
     ///
-    /// \param[in] in is input array
-    /// \return tan of input
+    /// \param[in] in input array
+    /// \return       tangent
     ///
     /// \ingroup arith_func_tan
     AFAPI array tan    (const array &in);
 
-    /// C++ Interface for arc sin (sin inverse)
+    /// C++ Interface to evaluate the inverse sine function.
     ///
-    /// \param[in] in is input array
-    /// \return arc sin of input
+    /// \param[in] in input array
+    /// \return       inverse sine
     ///
     /// \ingroup arith_func_asin
     AFAPI array asin   (const array &in);
 
-    /// C++ Interface for arc cos (cos inverse)
+    /// C++ Interface to evaluate the inverse cosine function.
     ///
-    /// \param[in] in is input array
-    /// \return arc cos of input
+    /// \param[in] in input array
+    /// \return       inverse cosine
     ///
     /// \ingroup arith_func_acos
     AFAPI array acos   (const array &in);
 
-    /// C++ Interface for arc tan (tan inverse)
+    /// C++ Interface to evaluate the inverse tangent function.
     ///
-    /// \param[in] in is input array
-    /// \return arc tan of input
+    /// \param[in] in input array
+    /// \return       inverse tangent
     ///
     /// \ingroup arith_func_atan
     AFAPI array atan   (const array &in);
 
     /// \ingroup arith_func_atan
     /// @{
-    /// C++ Interface for arc tan of two arrays
+    /// C++ Interface to evaluate the inverse tangent of two arrays.
     ///
     /// \param[in] lhs value of numerator
     /// \param[in] rhs value of denominator
-    /// \return arc tan of the inputs
+    /// \return        inverse tangent of the inputs
     AFAPI array atan2  (const array &lhs, const array &rhs);
 
     /// \copydoc atan2(const array&, const array&)
@@ -229,289 +281,322 @@ namespace af
     AFAPI array atan2  (const double lhs, const array &rhs);
     /// @}
 
-    /// \ingroup trig_func_cplx2
-    /// @{
-    /// C++ Interface for creating complex array from two inputs
-    ///
-    /// Creates a complex number from two sets of inputs. The left hand side is
-    /// the real part and the right hand side is the imaginary part. This
-    /// function accepts both \ref af::array and doubles as inputs. for both
-    /// inputs.
-    ///
-    /// \param[in] lhs is real value(s)
-    /// \param[in] rhs is imaginary value(s)
-    /// \return complex array from inputs
-    AFAPI array complex(const array &lhs, const array &rhs);
-
-    /// \copydoc complex(const array&, const array&)
-    AFAPI array complex(const array &lhs, const double rhs);
-
-    /// \copydoc complex(const array&, const array&)
-    AFAPI array complex(const double lhs, const array &rhs);
-
-    /// C++ Interface for creating complex array from real array
+    /// C++ Interface to evaluate the hyperbolic sine function.
     ///
-    /// \param[in] in is real array
-    /// \return complex array from \ref in
+    /// \param[in] in input array
+    /// \return       hyperbolic sine
     ///
-    /// \ingroup arith_func_cplx
-    AFAPI array complex(const array &in);
-    /// @}
+    /// \ingroup arith_func_sinh
+    AFAPI array sinh(const array& in);
 
-    /// C++ Interface for getting real part from complex array
+    /// C++ Interface to evaluate the hyperbolic cosine function.
     ///
-    /// \param[in] in is complex array
-    /// \return the real part of \p in
+    /// \param[in] in input array
+    /// \return       hyperbolic cosine
     ///
-    /// \ingroup arith_func_real
-    AFAPI array real   (const array &in);
+    /// \ingroup arith_func_cosh
+    AFAPI array cosh(const array& in);
 
-    /// C++ Interface for getting imaginary part from complex array
+    /// C++ Interface to evaluate the hyperbolic tangent function.
     ///
-    /// \param[in] in is complex array
-    /// \return the imaginary part of \p in
+    /// \param[in] in input array
+    /// \return       hyperbolic tangent
     ///
-    /// \ingroup arith_func_imag
-    AFAPI array imag   (const array &in);
+    /// \ingroup arith_func_tanh
+    AFAPI array tanh(const array& in);
 
-    /// C++ Interface for getting the complex conjugate of input array
+    /// C++ Interface to evaluate the inverse hyperbolic sine function.
     ///
-    /// \param[in] in is complex array
-    /// \return the complex conjugate of \p in
+    /// \param[in] in input array
+    /// \return       inverse hyperbolic sine
     ///
-    /// \ingroup arith_func_conjg
-    AFAPI array conjg  (const array &in);
+    /// \ingroup arith_func_asinh
+    AFAPI array asinh(const array& in);
 
-    /// C++ Interface for sinh
+    /// C++ Interface to evaluate the inverse hyperbolic cosine function.
     ///
-    /// \param[in] in is input array
-    /// \return sinh of input
+    /// \param[in] in input array
+    /// \return       inverse hyperbolic cosine
     ///
-    /// \ingroup arith_func_sinh
-    AFAPI array sinh    (const array &in);
+    /// \ingroup arith_func_acosh
+    AFAPI array acosh(const array& in);
 
-    /// C++ Interface for cosh
+    /// C++ Interface to evaluate the inverse hyperbolic tangent function.
     ///
-    /// \param[in] in is input array
-    /// \return cosh of input
+    /// \param[in] in input array
+    /// \return       inverse hyperbolic tangent
     ///
-    /// \ingroup arith_func_cosh
-    AFAPI array cosh    (const array &in);
+    /// \ingroup arith_func_atanh
+    AFAPI array atanh(const array& in);
 
-    /// C++ Interface for tanh
-    ///
-    /// \param[in] in is input array
-    /// \return tanh of input
-    ///
-    /// \ingroup arith_func_tanh
-    AFAPI array tanh    (const array &in);
+    /// \ingroup arith_func_cplx
+    /// @{
+    /// C++ Interface to create a complex array from a single real array.
+    ///
+    /// \param[in] in input array
+    /// \return       complex array
+    AFAPI array complex(const array& in);
+
+    /// C++ Interface to create a complex array from two real arrays.
+    ///
+    /// \param[in] real_ input array to be assigned as the real component of
+    ///                  the returned complex array
+    /// \param[in] imag_ input array to be assigned as the imaginary component
+    ///                  of the returned complex array
+    /// \return          complex array
+    AFAPI array complex(const array &real_, const array &imag_);
+
+    /// C++ Interface to create a complex array from a single real array for
+    /// the real component and a single scalar for each imaginary component.
+    ///
+    /// \param[in] real_ input array to be assigned as the real component of
+    ///                  the returned complex array
+    /// \param[in] imag_ single scalar to be assigned as the imaginary
+    ///                  component of each value of the returned complex array
+    /// \return          complex array
+    AFAPI array complex(const array &real_, const double imag_);
+
+    /// C++ Interface to create a complex array from a single scalar for each
+    /// real component and a single real array for the imaginary component.
+    ///
+    /// \param[in] real_ single scalar to be assigned as the real component of
+    ///                  each value of the returned complex array
+    /// \param[in] imag_ input array to be assigned as the imaginary component
+    ///                  of the returned complex array
+    /// \return          complex array
+    AFAPI array complex(const double real_, const array &imag_);
+    /// @}
 
-    /// C++ Interface for sinh inverse
+    /// C++ Interface to return the real part of a complex array.
     ///
-    /// \param[in] in is input array
-    /// \return sinh inverse of input
+    /// \param[in] in input complex array
+    /// \return       real part
     ///
-    /// \ingroup arith_func_asinh
-    AFAPI array asinh   (const array &in);
+    /// \ingroup arith_func_real
+    AFAPI array real   (const array &in);
 
-    /// C++ Interface for cosh inverse
+    /// C++ Interface to return the imaginary part of a complex array.
     ///
-    /// \param[in] in is input array
-    /// \return cosh inverse of input
+    /// \param[in] in input complex array
+    /// \return       imaginary part
     ///
-    /// \ingroup arith_func_acosh
-    AFAPI array acosh   (const array &in);
+    /// \ingroup arith_func_imag
+    AFAPI array imag   (const array &in);
 
-    /// C++ Interface for tanh inverse
+    /// C++ Interface to calculate the complex conjugate of an input array.
     ///
-    /// \param[in] in is input array
-    /// \return tanh inverse of input
+    /// \param[in] in input complex array
+    /// \return       complex conjugate
     ///
-    /// \ingroup arith_func_atanh
-    AFAPI array atanh   (const array &in);
+    /// \ingroup arith_func_conjg
+    AFAPI array conjg  (const array &in);
 
-    /// C++ Interface for nth root
+    /// C++ Interface to evaluate the nth root.
     ///
-    /// \param[in] lhs is nth root
-    /// \param[in] rhs is value
-    /// \return \p lhs th root of \p rhs
+    /// \param[in] nth_root nth root
+    /// \param[in] value    value
+    /// \return             `nth_root` th root of `value`
     ///
     /// \ingroup arith_func_root
-    AFAPI array root    (const array &lhs, const array &rhs);
+    AFAPI array root    (const array &nth_root, const array &value);
 
-    /// C++ Interface for nth root
+    /// C++ Interface to evaluate the nth root.
     ///
-    /// \param[in] lhs is nth root
-    /// \param[in] rhs is value
-    /// \return \p lhs th root of \p rhs
+    /// \param[in] nth_root nth root
+    /// \param[in] value    value
+    /// \return             `nth_root` th root of `value`
     ///
     /// \ingroup arith_func_root
-    AFAPI array root    (const array &lhs, const double rhs);
+    AFAPI array root    (const array &nth_root, const double value);
 
-    /// C++ Interface for nth root
+    /// C++ Interface to evaluate the nth root.
     ///
-    /// \param[in] lhs is nth root
-    /// \param[in] rhs is value
-    /// \return \p lhs th root of \p rhs
+    /// \param[in] nth_root nth root
+    /// \param[in] value    value
+    /// \return             `nth_root` th root of `value`
     ///
     /// \ingroup arith_func_root
-    AFAPI array root    (const double lhs, const array &rhs);
-
+    AFAPI array root    (const double nth_root, const array &value);
 
     /// \ingroup arith_func_pow
     /// @{
-    /// \brief C++ Interface for power
+    /// C++ Interface to raise a base to a power (or exponent).
     ///
-    /// Computes the value of \p lhs rased to the power of \p rhs. The inputs
-    /// can be an array or a double value.
+    /// Computes the value of `base` raised to the power of `exponent`. The
+    /// inputs can be two arrays or an array and a scalar.
     ///
-    /// \param[in] lhs is base
-    /// \param[in] rhs is exponent
-    /// \return \p lhs raised to power \p rhs
-    AFAPI array pow    (const array &lhs, const array &rhs);
+    /// \param[in] base     base
+    /// \param[in] exponent exponent
+    /// \return             `base` raised to the power of `exponent`
+    AFAPI array pow    (const array &base, const array &exponent);
 
     /// \copydoc pow(const array&, const array&)
-    AFAPI array pow    (const array &lhs, const double rhs);
+    AFAPI array pow    (const array &base, const double exponent);
 
     /// \copydoc pow(const array&, const array&)
-    AFAPI array pow    (const double lhs, const array &rhs);
+    AFAPI array pow    (const double base, const array &exponent);
 
-    /// C++ Interface for power of 2
-    ///
-    /// \param[in] in is exponent
-    /// \return 2 raised to power of \p in
+    /// C++ Interface to raise 2 to a power (or exponent).
     ///
+    /// \param[in] in power
+    /// \return       2 raised to the power
     AFAPI array pow2    (const array &in);
     /// @}
 
+#if AF_API_VERSION >= 31
+    /// C++ Interface to evaluate the logistical sigmoid function.
+    ///
+    /// Computes \f$\frac{1}{1+e^{-x}}\f$.
+    ///
+    /// \param[in] in input
+    /// \return       sigmoid
+    ///
+    /// \ingroup arith_func_sigmoid
+    AFAPI array sigmoid (const array &in);
+#endif
 
-    /// C++ Interface for exponential of an array
+    /// C++ Interface to evaluate the exponential.
     ///
-    /// \param[in] in is exponent
-    /// \return the exponential of \p in
+    /// \param[in] in exponent
+    /// \return       exponential
     ///
     /// \ingroup arith_func_exp
     AFAPI array exp    (const array &in);
 
-    /// C++ Interface for exponential of an array minus 1
+    /// C++ Interface to evaluate the exponential of an array minus 1,
+    /// `exp(in) - 1`.
+    ///
+    /// This function is useful when `in` is small.
     ///
-    /// \param[in] in is exponent
-    /// \return the exponential of \p in - 1
+    /// \param[in] in exponent
+    /// \return       exponential minus 1
     ///
-    /// \note This function is useful when \p in is small
     /// \ingroup arith_func_expm1
     AFAPI array expm1  (const array &in);
 
-    /// C++ Interface for error function value
+    /// C++ Interface to evaluate the error function.
     ///
-    /// \param[in] in is input
-    /// \return the error function value
+    /// \param[in] in input array
+    /// \return       error function
     ///
     /// \ingroup arith_func_erf
     AFAPI array erf    (const array &in);
 
-    /// C++ Interface for complementary error function value
+    /// C++ Interface to evaluate the complementary error function.
     ///
-    /// \param[in] in is input
-    /// \return the complementary error function value
+    /// \param[in] in input array
+    /// \return       complementary error function
     ///
     /// \ingroup arith_func_erfc
     AFAPI array erfc   (const array &in);
 
-    /// C++ Interface for natural logarithm
+    /// C++ Interface to evaluate the natural logarithm.
     ///
-    /// \param[in] in is input
-    /// \return the natural logarithm of input
+    /// \param[in] in input array
+    /// \return       natural logarithm
     ///
     /// \ingroup arith_func_log
     AFAPI array log    (const array &in);
 
-    /// C++ Interface for natural logarithm of 1 + input
+    /// C++ Interface to evaluate the natural logarithm of 1 + input,
+    /// `ln(1+in)`.
     ///
-    /// \param[in] in is input
-    /// \return the natural logarithm of (1 + input)
+    /// This function is useful when `in` is small.
+    ///
+    /// \param[in] in input
+    /// \return natural logarithm of `1 + input`
     ///
-    /// \note This function is useful when \p is small
     /// \ingroup arith_func_log1p
     AFAPI array log1p  (const array &in);
 
-    /// C++ Interface for logarithm base 10
+    /// C++ Interface to evaluate the base 10 logarithm.
     ///
-    /// \param[in] in is input
-    /// \return the logarithm of input in base 10
+    /// \param[in] in input
+    /// \return       base 10 logarithm
     ///
     /// \ingroup arith_func_log10
     AFAPI array log10  (const array &in);
 
-    /// C++ Interface for logarithm base 2
+    /// C++ Interface to evaluate the base 2 logarithm.
     ///
-    /// \param[in] in is input
-    /// \return the logarithm of input in base 2
+    /// \param[in] in input
+    /// \return       base 2 logarithm
     ///
     /// \ingroup explog_func_log2
     AFAPI array log2   (const array &in);
 
-    /// C++ Interface for square root of input
+    /// C++ Interface to evaluate the square root.
     ///
-    /// \param[in] in is input
-    /// \return the square root of input
+    /// \param[in] in input
+    /// \return       square root
     ///
     /// \ingroup arith_func_sqrt
     AFAPI array sqrt   (const array &in);
 
-    /// C++ Interface for cube root of input
+#if AF_API_VERSION >= 37
+    /// C++ Interface to evaluate the reciprocal square root.
+    ///
+    /// \param[in] in input
+    /// \return       reciprocal square root
+    ///
+    /// \ingroup arith_func_rsqrt
+    AFAPI array rsqrt   (const array &in);
+#endif
+
+    /// C++ Interface to evaluate the cube root.
     ///
-    /// \param[in] in is input
-    /// \return the cube root of input
+    /// \param[in] in input
+    /// \return       cube root
     ///
     /// \ingroup arith_func_cbrt
     AFAPI array cbrt   (const array &in);
 
+    /// C++ Interface to calculate the factorial.
     ///
-    /// C++ Interface for factorial of input
-    ///
-    /// \param[in] in is input
-    /// \return the factorial function of input
+    /// \param[in] in input
+    /// \return       factorial
     ///
     /// \ingroup arith_func_factorial
     AFAPI array factorial (const array &in);
 
-    /// C++ Interface for gamma function of input
+    /// C++ Interface to evaluate the gamma function.
     ///
-    /// \param[in] in is input
-    /// \return the gamma function of input
+    /// \param[in] in input
+    /// \return       gamma function
     ///
     /// \ingroup arith_func_tgamma
     AFAPI array tgamma (const array &in);
 
-    /// C++ Interface for logarithm of absolute value of gamma function of input
+    /// C++ Interface to evaluate the logarithm of the absolute value of the
+    /// gamma function.
     ///
-    /// \param[in] in is input
-    /// \return the logarithm of absolute value of gamma function of input
+    /// \param[in] in input
+    /// \return       logarithm of the absolute value of the gamma function
     ///
-    /// \ingroup arith_func_tgamma
+    /// \ingroup arith_func_lgamma
     AFAPI array lgamma (const array &in);
 
-    /// C++ Interface for checking values are zero
+    /// C++ Interface to check which values are zero.
     ///
-    /// \param[in] in is input
-    /// \return array containing 1's where input is 0, and 0 otherwise.
+    /// \param[in] in input
+    /// \return       array containing 1's where input is 0; 0's otherwise
     ///
     /// \ingroup arith_func_iszero
     AFAPI array iszero (const array &in);
 
-    /// C++ Interface for checking values are Infinities
+    /// C++ Interface to check if values are infinite.
     ///
-    /// \param[in] in is input
-    /// \return array containing 1's where input is Inf or -Inf, and 0 otherwise.
+    /// \param[in] in input
+    /// \return       array containing 1's where input is Inf or -Inf; 0's
+    ///               otherwise
     ///
     /// \ingroup arith_func_isinf
     AFAPI array isInf  (const array &in);
 
-    /// C++ Interface for checking values are NaNs
+    /// C++ Interface to check if values are NaN.
     ///
-    /// \param[in] in is input
-    /// \return array containing 1's where input is NaN, and 0 otherwise.
+    /// \param[in] in input
+    /// \return       array containing 1's where input is NaN; 0's otherwise
     ///
     /// \ingroup arith_func_isnan
     AFAPI array isNaN  (const array &in);
@@ -523,605 +608,748 @@ extern "C" {
 #endif
 
     /**
-       C Interface for adding arrays
+       C Interface to add two arrays.
 
-       \param[out] out will contain sum of \p lhs and \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   +
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_add
     */
     AFAPI af_err af_add   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for subtracting an array from another
+       C Interface to subtract one array from another array.
 
-       \param[out] out will contain result of \p lhs - \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   -
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_sub
     */
     AFAPI af_err af_sub   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for multiplying two arrays
+       C Interface to multiply two arrays.
 
-       \param[out] out will contain the product of \p lhs and  \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   *
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_mul
     */
     AFAPI af_err af_mul   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for dividing an array from another
+       C Interface to divide one array by another array.
 
-       \param[out] out will contain result of \p lhs / \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   \
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_div
     */
     AFAPI af_err af_div   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for checking if an array is less than another
+       C Interface to perform a less-than comparison between corresponding
+       elements of two arrays.
+
+       Output type is b8.
 
-       \param[out] out will contain result of \p lhs < \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   1's where `lhs < rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup logic_func_lt
     */
     AFAPI af_err af_lt    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for checking if an array is greater than another
+       C Interface to perform a greater-than comparison between corresponding
+       elements of two arrays.
+
+       Output type is b8.
 
-       \param[out] out will contain result of \p lhs > \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   1's where `lhs > rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_gt
     */
     AFAPI af_err af_gt    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for checking if an array is less or equal to than another
+       C Interface to perform a less-than-or-equal comparison between
+       corresponding elements of two arrays.
 
-       \param[out] out will contain result of \p lhs <= \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       Output type is b8.
+
+       \param[out] out   1's where `lhs <= rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_le
     */
     AFAPI af_err af_le    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for checking if an array is greater than or equal another
+       C Interface to perform a greater-than-or-equal comparison between
+       corresponding elements of two arrays.
+
+       Output type is b8.
 
-       \param[out] out will contain result of \p lhs >= \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   1's where `lhs >= rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_ge
     */
     AFAPI af_err af_ge    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for checking if an array is equal to another
+       C Interface to check if corresponding elements of two arrays are equal.
+
+       Output type is b8.
 
-       \param[out] out will contain result of \p lhs == \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   1's where `lhs == rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_eq
     */
     AFAPI af_err af_eq    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for checking if an array is not equal to another
+       C Interface to check if corresponding elements of two arrays are not
+       equal.
 
-       \param[out] out will contain result of \p lhs != \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       Output type is b8.
+
+       \param[out] out   1's where `lhs != rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_neq
     */
     AFAPI af_err af_neq   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for performing logical and on two arrays
+       C Interface to evaluate the logical AND of two arrays.
+
+       Output type is b8.
 
-       \param[out] out will contain result of \p lhs && \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   1's where `lhs && rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_and
     */
     AFAPI af_err af_and   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for performing logical or on two arrays
+       C Interface the evaluate the logical OR of two arrays.
+
+       Output type is b8.
 
-       \param[out] out will contain result of \p lhs || \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   1's where `lhs || rhs`, else 0's
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_or
     */
     AFAPI af_err af_or    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for performing logical not on input
+       C Interface to evaluate the logical NOT of an array.
 
-       \param[out] out will contain result of logical not of \p in
-       \param[in] in is the input
-       \return \ref AF_SUCCESS if the execution completes properly
+       Output type is b8.
+
+       \param[out] out !, logical NOT
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_not
     */
     AFAPI af_err af_not   (af_array *out, const af_array in);
 
+#if AF_API_VERSION >= 38
+    /**
+       C Interface to evaluate the bitwise NOT of an array.
+
+       \param[out] out ~, bitwise NOT
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup arith_func_bitnot
+    */
+    AFAPI af_err af_bitnot   (af_array *out, const af_array in);
+#endif
+
     /**
-       C Interface for performing bitwise and on two arrays
+       C Interface to evaluate the bitwise AND of two arrays.
 
-       \param[out] out will contain result of \p lhs & \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   &, bitwise AND
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_bitand
     */
     AFAPI af_err af_bitand   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for performing bitwise or on two arrays
+       C Interface to evaluate the bitwise OR of two arrays.
 
-       \param[out] out will contain result of \p lhs & \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   |, bitwise OR
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_bitor
     */
     AFAPI af_err af_bitor    (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for performing bitwise xor on two arrays
+       C Interface to evaluate the bitwise XOR of two arrays.
 
-       \param[out] out will contain result of \p lhs ^ \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   ^, bitwise XOR
+       \param[in]  lhs   first input
+       \param[in]  rhs   second input
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_bitxor
     */
     AFAPI af_err af_bitxor   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for left shfit on integer arrays
+       C Interface to shift the bits of integer arrays left.
 
-       \param[out] out will contain result of the left shift
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   left shift
+       \param[in]  lhs   values to shift
+       \param[in]  rhs   n bits to shift
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_shiftl
     */
     AFAPI af_err af_bitshiftl(af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for right shfit on integer arrays
+       C Interface to shift the bits of integer arrays right.
 
-       \param[out] out will contain result of the right shift
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   right shift
+       \param[in]  lhs   values to shift
+       \param[in]  rhs   n bits to shift
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_shiftr
     */
     AFAPI af_err af_bitshiftr(af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for casting an array from one type to another
-
-       \param[out] out will contain the values in the specified type
-       \param[in] in is the input
-       \param[in] type is the target data type \ref af_dtype
-       \return \ref AF_SUCCESS if the execution completes properly
+       C Interface to cast an array from one type to another.
+
+       This function casts an af_array object from one type to another. If the
+       type of the original array is the same as `type` then the same array is
+       returned.
+
+       Consecutive casting operations may be may be optimized out if the
+       original type of the af_array is the same as the final type. For example
+       if the original type is f64, which is cast to f32 and then back to
+       f64, then the cast to f32 is skipped and that operation will *NOT*
+       be performed by ArrayFire. The following table shows which casts will
+       be optimized out. outer -> inner -> outer
+
+       | inner-> | f32 | f64 | c32 | c64 | s32 | u32 | s8 | u8 | b8 | s64 | u64 | s16 | u16 | f16 |
+       |---------|-----|-----|-----|-----|-----|-----|----|----|----|-----|-----|-----|-----|-----|
+       | f32     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+       | f64     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+       | c32     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+       | c64     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+       | s32     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   |     |     | x   |
+       | u32     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   |     |     | x   |
+       | s8      | x   | x   | x   | x   | x   | x   | x  | x  | x  | x   | x   | x   | x   | x   |
+       | u8      | x   | x   | x   | x   | x   | x   | x  | x  | x  | x   | x   | x   | x   | x   |
+       | b8      | x   | x   | x   | x   | x   | x   | x  | x  | x  | x   | x   | x   | x   | x   |
+       | s64     | x   | x   | x   | x   |     |     |    |    |    | x   | x   |     |     | x   |
+       | u64     | x   | x   | x   | x   |     |     |    |    |    | x   | x   |     |     | x   |
+       | s16     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   | x   | x   | x   |
+       | u16     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   | x   | x   | x   |
+       | f16     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+
+       If you want to avoid this behavior use, af_eval after the first cast
+       operation. This will ensure that the cast operation is performed on the
+       af_array.
+
+       \param[out] out  values in the specified type
+       \param[in]  in   input
+       \param[in]  type target data type \ref af_dtype
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_cast
     */
     AFAPI af_err af_cast    (af_array *out, const af_array in, const af_dtype type);
 
     /**
-       C Interface for min of two arrays
+       C Interface to find the elementwise minimum between two arrays.
 
-       \param[out] out will contain minimum of \p lhs and \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   minimum
+       \param[in]  lhs   input array
+       \param[in]  rhs   input array
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_min
     */
     AFAPI af_err af_minof (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for max of two arrays
+       C Interface to find the elementwise minimum between an array and a
+       scalar value.
 
-       \param[out] out will contain maximum of \p lhs and \p rhs
-       \param[in] lhs first input
-       \param[in] rhs second input
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   maximum
+       \param[in]  lhs   input array
+       \param[in]  rhs   input array
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_max
     */
     AFAPI af_err af_maxof (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
+#if AF_API_VERSION >= 34
     /**
-       C Interface for remainder
+       C Interface to clamp an array between an upper and a lower limit.
 
-       \param[out] out will contain the remainder of \p lhs divided by \p rhs
-       \param[in] lhs is numerator
-       \param[in] rhs is denominator
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   clamped array
+       \param[in]  in    input array
+       \param[in]  lo    lower limit array
+       \param[in]  hi    upper limit array
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup arith_func_clamp
+    */
+    AFAPI af_err af_clamp(af_array *out, const af_array in,
+                          const af_array lo, const af_array hi, const bool batch);
+#endif
+
+    /**
+       C Interface to calculate the remainder.
+
+       For integers, it returns the same output as modulus (% operator)
+       For floating point numbers, it returns the same as `remainder` from <math.h>
+
+       \param[out] out   remainder
+       \param[in]  lhs   numerator
+       \param[in]  rhs   denominator
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_rem
     */
     AFAPI af_err af_rem   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for modulus
+       C Interface to calculate the modulus.
 
-       \param[out] out will contain the output of \p lhs modulo \p rhs
-       \param[in] lhs is dividend
-       \param[in] rhs is divisor
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       For integers, it returns the same output as modulus (% operator)
+       For floating point numbers, it returns the same as `fmod` from <math.h>
+
+       \param[out] out   modulus
+       \param[in]  lhs   dividend
+       \param[in]  rhs   divisor
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_mod
     */
     AFAPI af_err af_mod   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for absolute value
+       C Interface to calculate the absolute value.
 
-       \param[out] out will contain the absolute value of \p in
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out absolute value
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_abs
     */
     AFAPI af_err af_abs     (af_array *out, const af_array in);
 
     /**
-       C Interface for finding the phase
+       C Interface to calculate the phase angle (in radians) of a complex
+       array.
 
-       \param[out] out will the phase of \p in
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out phase angle (in radians)
+       \param[in]  in  input array, typically complex
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_arg
     */
     AFAPI af_err af_arg     (af_array *out, const af_array in);
 
     /**
-       C Interface for finding the sign of the input
-
-       \param[out] out will contain the sign of each element of the input arrays
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       C Interface to calculate the sign of elements in an array.
 
-       \note output is 1 for negative numbers and 0 for positive numbers
+       \param[out] out array containing 1's for negative values; 0's otherwise
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_round
+       \ingroup arith_func_sign
     */
     AFAPI af_err af_sign   (af_array *out, const af_array in);
 
     /**
-       C Interface for rounding an array of numbers
-
-       \param[out] out will contain values rounded to nearest integer
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       C Interface to round numbers.
 
-       \note The values are rounded to nearest integer
+       \param[out] out nearest integer
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_round
     */
     AFAPI af_err af_round   (af_array *out, const af_array in);
 
     /**
-       C Interface for truncing an array of numbers
+       C Interface to truncate numbers.
 
-       \param[out] out will contain values truncated to nearest integer not greater than input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out nearest integer not greater in magnitude than `in`
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_trunc
     */
     AFAPI af_err af_trunc   (af_array *out, const af_array in);
 
     /**
-       C Interface for flooring an array of numbers
+       C Interface to floor numbers.
 
-       \param[out] out will contain values rounded to nearest integer less than or equal to in
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out nearest integer less than or equal to `in`
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_floor
     */
     AFAPI af_err af_floor   (af_array *out, const af_array in);
 
     /**
-       C Interface for ceiling an array of numbers
+       C Interface to ceil numbers.
 
-       \param[out] out will contain values rounded to nearest integer greater than or equal to in
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out nearest integer greater than or equal to `in`
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_ceil
     */
     AFAPI af_err af_ceil    (af_array *out, const af_array in);
 
     /**
-       C Interface for getting length of hypotenuse of two arrays
+       C Interface to calculate the length of the hypotenuse of two inputs.
 
-       \param[out] out will contain the length of the hypotenuse
-       \param[in] lhs is the length of first side
-       \param[in] rhs is the length of second side
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   length of the hypotenuse
+       \param[in]  lhs   length of first side
+       \param[in]  rhs   length of second side
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_floor
     */
     AFAPI af_err af_hypot (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for sin
+       C Interface to evaluate the sine function.
 
-       \param[out] out will contain sin of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out sine
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_sin
     */
     AFAPI af_err af_sin     (af_array *out, const af_array in);
 
     /**
-       C Interface for cos
+       C Interface to evaluate the cosine function.
 
-       \param[out] out will contain cos of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out cosine
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_cos
     */
     AFAPI af_err af_cos     (af_array *out, const af_array in);
 
     /**
-       C Interface for tan
+       C Interface to evaluate the tangent function.
 
-       \param[out] out will contain tan of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out tangent
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_tan
     */
     AFAPI af_err af_tan     (af_array *out, const af_array in);
 
     /**
-       C Interface for arc sin
+       C Interface to evaluate the inverse sine function.
 
-       \param[out] out will contain arc sin of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out inverse sine
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_asin
     */
     AFAPI af_err af_asin    (af_array *out, const af_array in);
 
     /**
-       C Interface for arc cos
+       C Interface to evaluate the inverse cosine function.
 
-       \param[out] out will contain arc cos of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out inverse cos
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_acos
     */
     AFAPI af_err af_acos    (af_array *out, const af_array in);
 
     /**
-       C Interface for arc tan
+       C Interface to evaluate the inverse tangent function.
 
-       \param[out] out will contain arc tan of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out inverse tangent
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_atan
     */
     AFAPI af_err af_atan    (af_array *out, const af_array in);
 
     /**
-       C Interface for arc tan of two inputs
+       C Interface to evaluate the inverse tangent of two arrays.
 
-       \param[out] out will arc tan of the inputs
-       \param[in] lhs value of numerator
-       \param[in] rhs value of denominator
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   inverse tangent of two arrays
+       \param[in]  lhs   numerator
+       \param[in]  rhs   denominator
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_atan
     */
     AFAPI af_err af_atan2 (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for creating complex array from two input arrays
+       C Interface to evaluate the hyperbolic sine function.
 
-       \param[out] out will contain the complex array generated from inputs
-       \param[in] lhs is real array
-       \param[in] rhs is imaginary array
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out hyperbolic sine
+       \param[in]  in  input
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_cplx
+       \ingroup arith_func_sinh
     */
-    AFAPI af_err af_cplx2 (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
+    AFAPI af_err af_sinh    (af_array *out, const af_array in);
 
     /**
-       C Interface for creating complex array from real array
+       C Interface to evaluate the hyperbolic cosine function.
 
-       \param[out] out will contain complex array created from real input \p in
-       \param[in] in is real array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out hyperbolic cosine
+       \param[in]  in  input
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_cplx
+       \ingroup arith_func_cosh
     */
-    AFAPI af_err af_cplx    (af_array *out, const af_array in);
+    AFAPI af_err af_cosh    (af_array *out, const af_array in);
 
     /**
-       C Interface for getting real part from complex array
+       C Interface to evaluate the hyperbolic tangent function.
 
-       \param[out] out will contain the real part of \p in
-       \param[in] in is complex array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out hyperbolic tangent
+       \param[in]  in  input
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_real
+       \ingroup arith_func_tanh
     */
-    AFAPI af_err af_real    (af_array *out, const af_array in);
+    AFAPI af_err af_tanh    (af_array *out, const af_array in);
 
     /**
-       C Interface for getting imaginary part from complex array
+       C Interface to evaluate the inverse hyperbolic sine function.
 
-       \param[out] out will contain the imaginary part of \p in
-       \param[in] in is complex array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out inverse hyperbolic sine
+       \param[in]  in  input
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_imag
+       \ingroup arith_func_asinh
     */
-    AFAPI af_err af_imag    (af_array *out, const af_array in);
+    AFAPI af_err af_asinh   (af_array *out, const af_array in);
 
     /**
-       C Interface for getting the complex conjugate of input array
+       C Interface to evaluate the inverse hyperbolic cosine function.
 
-       \param[out] out will contain the complex conjugate of \p in
-       \param[in] in is complex array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out inverse hyperbolic cosine
+       \param[in]  in  input
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_conjg
+       \ingroup arith_func_acosh
     */
-    AFAPI af_err af_conjg   (af_array *out, const af_array in);
+    AFAPI af_err af_acosh   (af_array *out, const af_array in);
 
     /**
-       C Interface for sinh
+       C Interface to evaluate the inverse hyperbolic tangent function.
 
-       \param[out] out will contain sinh of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out inverse hyperbolic tangent
+       \param[in]  in  input
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_sinh
+       \ingroup arith_func_atanh
     */
-    AFAPI af_err af_sinh    (af_array *out, const af_array in);
+    AFAPI af_err af_atanh   (af_array *out, const af_array in);
 
     /**
-       C Interface for cosh
+       C Interface to create a complex array from a single real array.
 
-       \param[out] out will contain cosh of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out complex array
+       \param[in]  in  real array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_cosh
+       \ingroup arith_func_cplx
     */
-    AFAPI af_err af_cosh    (af_array *out, const af_array in);
+    AFAPI af_err af_cplx(af_array* out, const af_array in);
 
     /**
-       C Interface for tanh
+       C Interface to create a complex array from two real arrays.
 
-       \param[out] out will contain tanh of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   complex array
+       \param[in]  real  real array to be assigned as the real component of the
+                         returned complex array
+       \param[in]  imag  real array to be assigned as the imaginary component
+                         of the returned complex array
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_tanh
+       \ingroup arith_func_cplx
     */
-    AFAPI af_err af_tanh    (af_array *out, const af_array in);
+    AFAPI af_err af_cplx2(af_array* out, const af_array real, const af_array imag, const bool batch);
 
     /**
-       C Interface for asinh
+       C Interface to return the real part of a complex array.
 
-       \param[out] out will contain inverse sinh of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out real part
+       \param[in]  in  complex array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_asinh
+       \ingroup arith_func_real
     */
-    AFAPI af_err af_asinh   (af_array *out, const af_array in);
+    AFAPI af_err af_real(af_array* out, const af_array in);
 
     /**
-       C Interface for acosh
+       C Interface to return the imaginary part of a complex array.
 
-       \param[out] out will contain inverse cosh of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out imaginary part
+       \param[in]  in  complex array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_acosh
+       \ingroup arith_func_imag
     */
-    AFAPI af_err af_acosh   (af_array *out, const af_array in);
+    AFAPI af_err af_imag(af_array* out, const af_array in);
 
     /**
-       C Interface for atanh
+       C Interface to evaluate the complex conjugate of an input array.
 
-       \param[out] out will contain inverse tanh of input
-       \param[in] in is input array
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out complex conjugate
+       \param[in]  in  complex array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup arith_func_atanh
+       \ingroup arith_func_conjg
     */
-    AFAPI af_err af_atanh   (af_array *out, const af_array in);
+    AFAPI af_err af_conjg(af_array* out, const af_array in);
 
     /**
-       C Interface for root
+       C Interface to evaluate the nth root.
 
-       \param[out] out will contain \p lhs th root of \p rhs
-       \param[in] lhs is nth root
-       \param[in] rhs is value
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   `lhs` th root of `rhs`
+       \param[in]  lhs   nth root
+       \param[in]  rhs   value
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_root
     */
@@ -1129,204 +1357,255 @@ extern "C" {
 
 
     /**
-       C Interface for power
+       C Interface to raise a base to a power (or exponent).
 
-       \param[out] out will contain \p lhs raised to power \p rhs
-       \param[in] lhs is base
-       \param[in] rhs is exponent
-       \param[in] batch specifies if operations need to be performed in batch mode
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out   `lhs` raised to the power of `rhs`
+       \param[in]  lhs   base
+       \param[in]  rhs   exponent
+       \param[in]  batch batch mode
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_pow
     */
     AFAPI af_err af_pow   (af_array *out, const af_array lhs, const af_array rhs, const bool batch);
 
     /**
-       C Interface for power of two
+       C Interface to raise 2 to a power (or exponent).
 
-       \param[out] out will contain the values of 2 to the power \p in
-       \param[in] in is exponent
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out 2 raised to the power of `in`
+       \param[in]  in  exponent
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_pow2
     */
     AFAPI af_err af_pow2     (af_array *out, const af_array in);
 
+#if AF_API_VERSION >= 31
     /**
-       C Interface for exponential of an array
+       C Interface to evaluate the logistical sigmoid function.
+
+       Computes \f$\frac{1}{1+e^{-x}}\f$.
 
-       \param[out] out will contain the exponential of \p in
-       \param[in] in is exponent
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out output of the logistic sigmoid function
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup arith_func_sigmoid
+    */
+    AFAPI af_err af_sigmoid(af_array* out, const af_array in);
+#endif
+
+    /**
+       C Interface to evaluate the exponential.
+
+       \param[out] out e raised to the power of `in`
+       \param[in]  in  exponent
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_exp
     */
     AFAPI af_err af_exp     (af_array *out, const af_array in);
 
     /**
-       C Interface for exponential of an array
+       C Interface to evaluate the exponential of an array minus 1,
+       `exp(in) - 1`.
 
-       \param[out] out will contain the exponential of \p in - 1
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out exponential of `in - 1`
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_expm1
     */
     AFAPI af_err af_expm1   (af_array *out, const af_array in);
 
     /**
-       C Interface for error function value
+       C Interface to evaluate the error function.
 
-       \param[out] out will contain the error function value of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out error function value
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_erf
     */
     AFAPI af_err af_erf     (af_array *out, const af_array in);
 
     /**
-       C Interface for complementary error function value
+       C Interface to evaluate the complementary error function.
 
-       \param[out] out will contain the complementary error function value of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out complementary error function
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_erfc
     */
     AFAPI af_err af_erfc    (af_array *out, const af_array in);
 
     /**
-       C Interface for natural logarithm
+       C Interface to evaluate the natural logarithm.
 
-       \param[out] out will contain the natural logarithm of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out natural logarithm
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_log
     */
     AFAPI af_err af_log     (af_array *out, const af_array in);
 
     /**
-       C Interface for logarithm of (in + 1)
+       C Interface to evaluate the natural logarithm of 1 + input, `ln(1+in)`.
 
-       \param[out] out will contain the logarithm of of (in + 1)
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out logarithm of `in + 1`
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_log1p
     */
     AFAPI af_err af_log1p   (af_array *out, const af_array in);
 
     /**
-       C Interface for logarithm base 10
+       C Interface to evaluate the base 10 logarithm.
 
-       \param[out] out will contain the base 10 logarithm of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out base 10 logarithm
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_log10
     */
     AFAPI af_err af_log10   (af_array *out, const af_array in);
 
     /**
-       C Interface for logarithm base 2
+       C Interface to evaluate the base 2 logarithm.
 
-       \param[out] out will contain the base 2 logarithm of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out base 2 logarithm
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup explog_func_log2
     */
     AFAPI af_err af_log2   (af_array *out, const af_array in);
 
     /**
-       C Interface for square root
+       C Interface to evaluate the square root.
 
-       \param[out] out will contain the square root of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out square root
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_sqrt
     */
     AFAPI af_err af_sqrt    (af_array *out, const af_array in);
 
+#if AF_API_VERSION >= 37
     /**
-       C Interface for cube root
+      C Interface to evaluate the reciprocal square root.
 
-       \param[out] out will contain the cube root of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+      \param[out] out reciprocal square root
+      \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+      \ingroup arith_func_rsqrt
+    */
+    AFAPI af_err af_rsqrt    (af_array *out, const af_array in);
+#endif
+    /**
+       C Interface to evaluate the cube root.
+
+       \param[out] out cube root
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_cbrt
     */
     AFAPI af_err af_cbrt    (af_array *out, const af_array in);
 
     /**
-       C Interface for the factorial
+       C Interface to calculate the factorial.
 
-       \param[out] out will contain the result of factorial of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out factorial
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_factorial
     */
     AFAPI af_err af_factorial   (af_array *out, const af_array in);
 
     /**
-       C Interface for the gamma function
+       C Interface to evaluate the gamma function.
 
-       \param[out] out will contain the result of gamma function of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out gamma function
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_tgamma
     */
     AFAPI af_err af_tgamma   (af_array *out, const af_array in);
 
     /**
-       C Interface for the logarithm of absolute values of gamma function
+       C Interface to evaluate the logarithm of the absolute value of the
+       gamma function.
 
-       \param[out] out will contain the result of logarithm of absolute values of gamma function of \p in
-       \param[in] in is input
-       \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out logarithm of the absolute value of the gamma function
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup arith_func_lgamma
     */
     AFAPI af_err af_lgamma   (af_array *out, const af_array in);
 
     /**
-        C Interface for checking values are zero
+       C Interface to check if values are zero.
 
-        \param[out] out will contain 1's where input is 0, and 0 otherwise.
-        \param[in] in is input
-        \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out array containing 1's where input is 0; 0's otherwise
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup arith_func_iszero
+       \ingroup arith_func_iszero
     */
     AFAPI af_err af_iszero  (af_array *out, const af_array in);
 
     /**
-        C Interface for checking values are infinities
+       C Interface to check if values are infinite.
 
-        \param[out] out will contain 1's where input is Inf or -Inf, and 0 otherwise.
-        \param[in] in is input
-        \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out array containing 1's where input is Inf or -Inf; 0's
+                       otherwise
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup arith_func_isinf
+       \ingroup arith_func_isinf
     */
     AFAPI af_err af_isinf   (af_array *out, const af_array in);
 
     /**
-        C Interface for checking values are NaNs
+       C Interface to check if values are NaN.
 
-        \param[out] out will contain 1's where input is NaN, and 0 otherwise.
-        \param[in] in is input
-        \return \ref AF_SUCCESS if the execution completes properly
+       \param[out] out array containing 1's where input is NaN; 0's otherwise
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup arith_func_nan
+       \ingroup arith_func_isnan
     */
     AFAPI af_err af_isnan   (af_array *out, const af_array in);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/array.h b/include/af/array.h
index 1ae14a9b22..672c2716eb 100644
--- a/include/af/array.h
+++ b/include/af/array.h
@@ -8,14 +8,24 @@
  ********************************************************/
 
 #pragma once
+#include <af/compilers.h>
 #include <af/defines.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/exception.h>
+#include <af/index.h>
 #include <af/seq.h>
 #include <af/util.h>
-#include <af/index.h>
 
 #ifdef __cplusplus
 #include <af/traits.hpp>
-#include <vector>
+
+#if AF_API_VERSION >= 38
+#if AF_COMPILER_CXX_GENERALIZED_INITIALIZERS
+#include <initializer_list>
+#endif
+#endif
+
 namespace af
 {
 
@@ -23,7 +33,7 @@ namespace af
 
     ///
     /// \brief A multi dimensional data container
-    ///
+    /// \ingroup arrayfire_class
     class AFAPI array {
         af_array   arr;
 
@@ -32,13 +42,16 @@ namespace af
         ///
         /// \brief Updates the internal \ref af_array object
         ///
-        /// /note This function will reduce the reference of the previous
+        /// \note This function will reduce the reference of the previous
         ///       \ref af_array object
         ///
         void set(af_array tmp);
 
         ///
-        /// \brief Intermediate data class. Used for assignment and indexing
+        /// \brief Intermediate data class. Used for assignment and indexing.
+        ///
+        /// \note This class is for internal book keeping while indexing. This
+        ///       class is not intended for use in user code.
         ///
         class AFAPI array_proxy
         {
@@ -48,7 +61,7 @@ namespace af
         public:
             array_proxy(array& par, af_index_t *ssss, bool linear = false);
             array_proxy(const array_proxy &other);
-#if __cplusplus > 199711L
+#if AF_COMPILER_CXX_RVALUE_REFERENCES
             array_proxy(array_proxy &&other);
             array_proxy & operator=(array_proxy &&other);
 #endif
@@ -58,7 +71,7 @@ namespace af
             operator array() const;
             operator array();
 
-#define ASSIGN(OP)                                                  \
+#define ASSIGN_(OP)                                                 \
             array_proxy& operator OP(const array_proxy &a);         \
             array_proxy& operator OP(const array &a);               \
             array_proxy& operator OP(const double &a);              \
@@ -73,7 +86,27 @@ namespace af
             array_proxy& operator OP(const long  &a);               \
             array_proxy& operator OP(const unsigned long &a);       \
             array_proxy& operator OP(const long long  &a);          \
-            array_proxy& operator OP(const unsigned long long &a);  \
+            array_proxy& operator OP(const unsigned long long &a);
+
+#if AF_API_VERSION >= 32
+#define ASSIGN_32(OP)                                               \
+            array_proxy& operator OP(const short &a);               \
+            array_proxy& operator OP(const unsigned short &a);
+#else
+#define ASSIGN_32(OP)
+#endif
+
+#if AF_API_VERSION >= 310
+#define ASSIGN_310(OP)                                              \
+            array_proxy& operator OP(const signed char &a);
+#else
+#define ASSIGN_310(OP)
+#endif
+
+#define ASSIGN(OP)              \
+            ASSIGN_(OP)         \
+            ASSIGN_32(OP)       \
+            ASSIGN_310(OP)
 
             ASSIGN(=)
             ASSIGN(+=)
@@ -81,6 +114,9 @@ namespace af
             ASSIGN(*=)
             ASSIGN(/=)
 #undef ASSIGN
+#undef ASSIGN_
+#undef ASSIGN_32
+#undef ASSIGN_310
 
             // af::array member functions. same behavior as those below
             af_array get();
@@ -93,6 +129,7 @@ namespace af
             dim_t dims(unsigned dim) const;
             unsigned numdims() const;
             size_t bytes() const;
+            size_t allocated() const;
             array copy() const;
             bool isempty() const;
             bool isscalar() const;
@@ -103,10 +140,16 @@ namespace af
             inline bool isreal() const { return !iscomplex(); }
             bool isdouble() const;
             bool issingle() const;
+#if AF_API_VERSION >= 37
+            bool ishalf() const;
+#endif
             bool isrealfloating() const;
             bool isfloating() const;
             bool isinteger() const;
             bool isbool() const;
+#if AF_API_VERSION >= 34
+            bool issparse() const;
+#endif
             void eval() const;
             array as(dtype type) const;
             array T() const;
@@ -114,15 +157,34 @@ namespace af
             template<typename T> T scalar() const;
             template<typename T> T* device() const;
             void unlock() const;
+#if AF_API_VERSION >= 31
+            void lock() const;
+#endif
+
+#if AF_API_VERSION >= 34
+            bool isLocked() const;
+#endif
+
+                  array::array_proxy row(int index);
+            const array::array_proxy row(int index) const;
+
+                  array::array_proxy rows(int first, int last);
+            const array::array_proxy rows(int first, int last) const;
+
+                  array::array_proxy col(int index);
+            const array::array_proxy col(int index) const;
+                  array::array_proxy cols(int first, int last);
+            const array::array_proxy cols(int first, int last) const;
+
+                  array::array_proxy slice(int index);
+            const array::array_proxy slice(int index) const;
+
+                  array::array_proxy slices(int first, int last);
+            const array::array_proxy slices(int first, int last) const;
         };
 
-        //array(af_array in, const array *par, af_index_t seqs[4]);
         /**
-            \ingroup construct_mat
-            @{
-        */
-        /**
-            Create non-dimensioned array (no data, undefined size)
+            Create an uninitialized array (no data, undefined size)
 
             \code
             array A, B, C;   // creates three arrays called A, B and C
@@ -130,8 +192,36 @@ namespace af
         */
         array();
 
+#if AF_API_VERSION >= 37
+#if AF_COMPILER_CXX_RVALUE_REFERENCES
+        /**
+            Move constructor
+
+            Moves the \p other af::array into the current af::array. After this
+            operation, the \p other array will not be left uninitialized.
+
+            \param[in] other The array to be moved
+        */
+        array(array &&other) AF_NOEXCEPT;
+
+        /**
+            Move assignment operator
+
+            Moves the array into the current array. After this operation the
+            \p other array is left uninitialized. The previously referenced
+            af_array of the current object is released.
+
+            \param[in] other The array to be moved
+            \returns the reference to the current array
+        */
+        array &operator=(array &&other) AF_NOEXCEPT;
+#endif
+#endif
         /**
-            Creates an array from an \ref af_array handle
+            Creates an array from an \ref af_array handle. Does not increment
+            a reference counter: the array assumes ownership of the handle. To
+            share the array between multiple objects, use this in conjunction
+            with \ref af_retain_array.
             \param handle the af_array object.
          */
         explicit
@@ -166,6 +256,7 @@ namespace af
                        (default is f32)
 
         */
+        explicit
         array(dim_t dim0, dtype ty = f32);
 
         /**
@@ -191,6 +282,7 @@ namespace af
                        (default is f32)
 
         */
+        explicit
         array(dim_t dim0, dim_t dim1, dtype ty = f32);
 
         /**
@@ -217,6 +309,7 @@ namespace af
                        (default is f32)
 
         */
+        explicit
         array(dim_t dim0, dim_t dim1, dim_t dim2, dtype ty = f32);
 
         /**
@@ -244,6 +337,7 @@ namespace af
                        (default is f32)
 
         */
+        explicit
         array(dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3, dtype ty = f32);
 
         /**
@@ -277,15 +371,6 @@ namespace af
         /**
             Create a column vector on the device using a host/device pointer
 
-            This function can be used to transfer data from a host or device
-            pointer to an array object on the device with one column. The type
-            of the array is automatically matched to the type of the data.
-
-            Depending on the specified size of the column vector, the data will
-            be copied partially or completely. However, the user needs to be
-            careful to ensure that the array size is not larger than the number
-            of elements in the input buffer.
-
             \param[in] dim0     number of elements in the column vector
             \param[in] pointer  pointer (points to a buffer on the host/device)
             \param[in] src      source of the data (default is afHost, can also
@@ -297,14 +382,21 @@ namespace af
 
             array A(4, h_buffer);   // copy host data to device
                                     //
-                                    // A = 23
-                                    //   = 34
-                                    //   = 18
-                                    //   = 99
+                                    // A = [23]
+                                    //     [34]
+                                    //     [18]
+                                    //     [99]
 
             \endcode
+
+            \note If \p src is \ref afHost, the first \p dim0 elements are
+                  copied. If \p src is \ref afDevice, no copy is done; the
+                  array object wraps the device pointer AND takes ownership
+                  of the underlying memory.
+
         */
         template<typename T>
+        explicit
         array(dim_t dim0,
               const T *pointer, af::source src=afHost);
 
@@ -312,14 +404,6 @@ namespace af
         /**
             Create a 2D array on the device using a host/device pointer
 
-            This function copies data from the location specified by the
-            pointer to a 2D array on the device. The data is arranged in
-            "column-major" format (similar to that used by FORTRAN).
-
-            Note that this is an synchronous copy. The elements are not
-            actually filled until this array is evaluated or used in the
-            evaluation of some other expression that uses this array object.
-
             \param[in] dim0     number of rows
             \param[in] dim1     number of columns
             \param[in] pointer  pointer (points to a buffer on the host/device)
@@ -332,8 +416,15 @@ namespace af
             \endcode
 
             \image html 2dArray.png
+
+            \note If \p src is \ref afHost, the first \p dim0 * \p dim1 elements
+                  are copied. If \p src is \ref afDevice, no copy is done; the
+                  array object wraps the device pointer AND takes ownership of
+                  the underlying memory. The data is treated as column major
+                  format when performing linear algebra operations.
         */
         template<typename T>
+        explicit
         array(dim_t dim0, dim_t dim1,
               const T *pointer, af::source src=afHost);
 
@@ -341,10 +432,6 @@ namespace af
         /**
             Create a 3D array on the device using a host/device pointer
 
-            This function copies data from the location specified by the pointer
-            to a 3D array on the device. The data is arranged in "column-major"
-            format (similar to that used by FORTRAN).
-
             \param[in] dim0     first dimension
             \param[in] dim1     second dimension
             \param[in] dim2     third dimension
@@ -359,9 +446,17 @@ namespace af
             array A(3, 3, 2,  h_buffer);   // copy host data to 3D device array
             \endcode
 
+            \note If \p src is \ref afHost, the first \p dim0 * \p dim1 *
+                  \p dim2 elements are copied. If \p src is \ref afDevice, no
+                  copy is done; the array object just wraps the device pointer
+                  and does not take ownership of the underlying memory. The data
+                  is treated as column major format when performing linear
+                  algebra operations.
+
             \image html 3dArray.png
         */
         template<typename T>
+        explicit
         array(dim_t dim0, dim_t dim1, dim_t dim2,
               const T *pointer, af::source src=afHost);
 
@@ -369,10 +464,6 @@ namespace af
         /**
             Create a 4D array on the device using a host/device pointer
 
-            This function copies data from the location specified by the pointer
-            to a 4D array on the device. The data is arranged in "column-major"
-            format (similar to that used by FORTRAN).
-
             \param[in] dim0     first dimension
             \param[in] dim1     second dimension
             \param[in] dim2     third dimension
@@ -389,8 +480,17 @@ namespace af
 
             array A(2, 2, 2, 2, h_buffer);   // copy host data to 4D device array
             \endcode
+
+            \note If \p src is \ref afHost, the first \p dim0 * \p dim1 *
+                  \p dim2 * \p dim3 elements are copied. If \p src is
+                  \ref afDevice, no copy is done; the array object just wraps
+                  the device pointer and does not take ownership of the
+                  underlying memory. The data is treated as column major format
+                  when performing linear algebra operations.
+
         */
         template<typename T>
+        explicit
         array(dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3,
               const T *pointer, af::source src=afHost);
 
@@ -424,18 +524,64 @@ namespace af
                                              // Note the "column-major" ordering
                                              // used in ArrayFire
             \endcode
+
+            \note If \p src is \ref afHost, the first dims.elements() elements
+                  are copied. If \p src is \ref afDevice, no copy is done; the
+                  array object just wraps the device pointer and does not take
+                  ownership of the underlying memory. The data is treated as
+                  column major format when performing linear algebra operations.
+
         */
         template<typename T>
         explicit
         array(const dim4& dims,
               const T *pointer, af::source src=afHost);
 
+#if AF_API_VERSION >= 38
+#if AF_COMPILER_CXX_GENERALIZED_INITIALIZERS
+        /// \brief Initializer list constructor
+        template <typename T, typename = typename std::enable_if<
+                                  std::is_fundamental<T>::value, void>::type>
+        array(std::initializer_list<T> list)
+        : arr(nullptr) {
+          dim_t size = list.size();
+          if (af_err __aferr = af_create_array(&arr, list.begin(), 1, &size,
+                              static_cast<af_dtype>(af::dtype_traits<T>::af_type))) {
+            char *msg = NULL;
+            af_get_last_error(&msg, NULL);
+            af::exception ex(msg, __PRETTY_FUNCTION__, "include/af/array.h",
+                             __LINE__, __aferr);
+            af_free_host(msg);
+            throw std::move(ex);
+          }
+        }
+
+        /// \brief Initializer list constructor
+        template <typename T, typename = typename std::enable_if<
+                                  std::is_fundamental<T>::value, void>::type>
+        array(const af::dim4 &dims, std::initializer_list<T> list)
+            : arr(nullptr) {
+          const dim_t *size = dims.get();
+          if (af_err __aferr = af_create_array(
+              &arr, list.begin(), AF_MAX_DIMS, size,
+              static_cast<af_dtype>(af::dtype_traits<T>::af_type))) {
+            char *msg = NULL;
+            af_get_last_error(&msg, NULL);
+            af::exception ex(msg, __PRETTY_FUNCTION__, "include/af/array.h",
+                             __LINE__, __aferr);
+            af_free_host(msg);
+            throw std::move(ex);
+          }
+        }
+#endif
+#endif
+
         /**
            Adjust the dimensions of an N-D array (fast).
 
-           This operation simply rearranges the description of the array
-           on the host device. No memory transfers or transformations are
-           performed. The total number of elements must not change.
+           This operation simply rearranges the description of the array.
+           No memory transfers or transformations are  performed. The total
+           number of elements must not change.
 
            \code
            float f[] = {1,2,3,4};
@@ -455,16 +601,15 @@ namespace af
 
            \param[in] input
            \param[in] dims total number of elements must not change.
-           \return same underlying array data with different dimensions
         */
         array(const array& input, const dim4& dims);
 
         /**
            Adjust the dimensions of an N-D array (fast).
 
-           This operation simply rearranges the description of the array
-           on the host device. No memory transfers or transformations are
-           performed. The total number of elements must not change.
+           This operation simply rearranges the description of the array.
+           No memory transfers or transformations are  performed. The total
+           number of elements must not change.
 
            \code
 
@@ -488,21 +633,11 @@ namespace af
            \param[in] dim1 second dimension
            \param[in] dim2 third dimension
            \param[in] dim3 fourth dimension
-           \return same underlying array data with different dimensions
         */
         array(  const array& input,
                 const dim_t dim0, const dim_t dim1 = 1,
                 const dim_t dim2 = 1, const dim_t dim3 = 1);
 
-        /**
-            @}
-        */
-
-        /**
-           \ingroup method_mat
-           @{
-        */
-
         /**
            get the \ref af_array handle
         */
@@ -514,7 +649,7 @@ namespace af
         af_array get() const;
 
         /**
-           get the number of elements in array
+           Get the total number of elements across all dimensions of the array
         */
         dim_t elements() const;
 
@@ -529,7 +664,8 @@ namespace af
         void host(void *ptr) const;
 
         /**
-           Perform deep from host/device pointer to an existing array
+           Perform deep copy from host/device pointer to an existing array
+           \note Unlike all other assignment operations, this does NOT result in a copy on write.
         */
         template<typename T> void write(const T *ptr, const size_t bytes, af::source src = afHost);
 
@@ -558,6 +694,12 @@ namespace af
         */
         size_t bytes() const;
 
+        /**
+           Get the size of the array in memory. This will return the parent's
+           bytes() if the array is indexed.
+        */
+        size_t allocated() const;
+
         /**
            Perform deep copy of the array
         */
@@ -574,17 +716,20 @@ namespace af
         bool isscalar() const;
 
         /**
-           \brief Returns true if only one of the array dimensions has more than one elment
+           \brief Returns true if only one of the array dimensions has more
+                  than one element
         */
         bool isvector() const;
 
         /**
-           \brief Returns true if only the second dimension has more than one element
+           \brief Returns true if only the second dimension has more than one
+                  element
         */
         bool isrow() const;
 
         /**
-           \brief Returns true if only the first dimension has more than one element
+           \brief Returns true if only the first dimension has more than one
+                  element
         */
         bool iscolumn() const;
 
@@ -604,22 +749,31 @@ namespace af
         bool isdouble() const;
 
         /**
-           \brief Returns true if the array type is neither \ref f64 nor \ref c64
+           \brief Returns true if the array type is either \ref f32 nor \ref c32
         */
         bool issingle() const;
 
+#if AF_API_VERSION >= 37
         /**
-           \brief Returns true if the array type is \ref f32 or \ref f64
+           \brief Returns true if the array type is \ref f16
+        */
+        bool ishalf() const;
+#endif
+
+        /**
+           \brief Returns true if the array type is \ref f16 \ref f32 or \ref f64
         */
         bool isrealfloating() const;
 
         /**
-           \brief Returns true if the array type is \ref f32, \ref f64, \ref c32 or \ref c64
+           \brief Returns true if the array type is \ref f16 \ref f32, \ref f64,
+                  \ref c32 or \ref c64
         */
         bool isfloating() const;
 
         /**
-           \brief Returns true if the array type is \ref u8, \ref b8, \ref s32 \ref u32, \ref s64, \ref u64
+           \brief Returns true if the array type is \ref s8, \ref u8, \ref b8,
+                  \ref s32, \ref u32, \ref s64, \ref u64, \ref s16, \ref u16
         */
         bool isinteger() const;
 
@@ -628,78 +782,98 @@ namespace af
         */
         bool isbool() const;
 
+#if AF_API_VERSION >= 34
+        /**
+           \brief Returns true if the array is a sparse array
+        */
+        bool issparse() const;
+#endif
+
         /**
            \brief Evaluate any JIT expressions to generate data for the array
         */
         void eval() const;
 
         /**
-           @}
+           \brief Get the first element of the array as a scalar
+
+           \note The scalar function is recommended for use while debugging.
+                 Calling this method often will affect performance.
         */
         template<typename T> T scalar() const;
 
-
         /**
-           \defgroup device_func_device array::device<T>
+           \brief Get the device pointer from the array and lock the buffer in memory manager.
 
-           Get the device pointer from the array
-           @{
+           The device memory returned by this function is not freed until unlock() is called.
 
-           \ingroup arrayfire_func
-           \ingroup device_mat
+           /note When using the OpenCL backend and using the cl_mem template argument, the
+                 delete function should be called on the pointer returned by this function.
         */
         template<typename T> T* device() const;
+
+        // INDEXING
+        // Single arguments
+
         /**
-           @}
-        */
+            \brief This operator returns a reference of the original array at a given coordinate.
 
-        void unlock() const;
+            You can pass \ref af::seq, \ref af::array, or an int as its parameters.
+            These references can be used for assignment or returning references
+            to \ref af::array objects.
 
+            If the \ref af::array is a multi-dimensional array then this coordinate
+            will treated as the data as a linear array.
 
-        // INDEXING
-        // Single arguments
+            \param[in] s0   is sequence of linear indices
 
+            \returns A reference to the array at the given index
 
-        /// \ingroup array_mem_operator_paren
-        /// @{
-        ///
-        /// \brief Gets a reference to a set of linear elements
-        ///
-        /// \copydetails array_mem_operator_paren_one
-        ///
-        /// \param[in] s0   is sequence of linear indices
-        ///
-        /// \returns A reference to the array at the given index
-        ///
-              array::array_proxy operator()(const index &s0);
+            \ingroup array_mem_operator_paren
 
-        /// \copydoc operator()(const index &)
+        */
+        array::array_proxy operator()(const index &s0);
+
+        /**
+            \copydoc operator()(const index &)
+
+            \ingroup array_mem_operator_paren
+        */
         const array::array_proxy operator()(const index &s0) const;
 
 
-        ///
-        /// \brief Gets a reference to a sub array
-        ///
-        /// \copydetails array_mem_operator_paren_many
-        ///
-        /// \param[in] s0   is sequence of indices along the first dimension
-        /// \param[in] s1   is sequence of indices along the second dimension
-        /// \param[in] s2   is sequence of indices along the third dimension
-        /// \param[in] s3   is sequence of indices along the fourth dimension
-        ///
-        /// \returns A reference to the array at the given index
-        ///
-              array::array_proxy operator()(const index &s0,
-                                            const index &s1,
-                                            const index &s2 = span,
-                                            const index &s3 = span);
+        /**
+            \brief This operator returns a reference of the original array at a
+            given coordinate.
+
+            You can pass \ref af::seq, \ref af::array, or an int as it's parameters.
+            These references can be used for assignment or returning references
+            to \ref af::array objects.
+
+            \param[in] s0   is sequence of indices along the first dimension
+            \param[in] s1   is sequence of indices along the second dimension
+            \param[in] s2   is sequence of indices along the third dimension
+            \param[in] s3   is sequence of indices along the fourth dimension
+
+            \returns A reference to the array at the given index
 
-        /// \copydoc operator()(const index &, const index &, const index &, const index &)
+            \ingroup array_mem_operator_paren
+        */
+        array::array_proxy operator()(const index &s0,
+                                      const index &s1,
+                                      const index &s2 = span,
+                                      const index &s3 = span);
+
+        /**
+            \copydoc operator()(const index &, const index &, const index &, const index &)
+
+            \ingroup array_mem_operator_paren
+        */
         const array::array_proxy operator()(const index &s0,
                                             const index &s1,
                                             const index &s2 = span,
                                             const index &s3 = span) const;
-        /// @}
+
 
         /// \ingroup array_mem_row
         /// @{
@@ -781,35 +955,85 @@ namespace af
         const array::array_proxy slices(int first, int last) const; ///< \copydoc slices
         /// @}
 
-        /// \brief Converts the array into another type
+        /// \brief Casts the array into another data type
+        ///
+        /// \note Consecutive casting operations may be optimized out if
+        /// the original type of the af::array is the same as the final type.
+        /// For example if the original type is f64 which is then cast to f32
+        /// and then back to f64, then the cast to f32 will be skipped and that
+        /// operation will *NOT* be performed by ArrayFire. The following table
+        /// shows which casts will be optimized out. outer -> inner -> outer
+        /// | inner-> | f32 | f64 | c32 | c64 | s32 | u32 | s8 | u8 | b8 | s64 | u64 | s16 | u16 | f16 |
+        /// |---------|-----|-----|-----|-----|-----|-----|----|----|----|-----|-----|-----|-----|-----|
+        /// | f32     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+        /// | f64     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+        /// | c32     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+        /// | c64     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+        /// | s32     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   |     |     | x   |
+        /// | u32     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   |     |     | x   |
+        /// | s8      | x   | x   | x   | x   | x   | x   | x  | x  | x  | x   | x   | x   | x   | x   |
+        /// | u8      | x   | x   | x   | x   | x   | x   | x  | x  | x  | x   | x   | x   | x   | x   |
+        /// | b8      | x   | x   | x   | x   | x   | x   | x  | x  | x  | x   | x   | x   | x   | x   |
+        /// | s64     | x   | x   | x   | x   |     |     |    |    |    | x   | x   |     |     | x   |
+        /// | u64     | x   | x   | x   | x   |     |     |    |    |    | x   | x   |     |     | x   |
+        /// | s16     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   | x   | x   | x   |
+        /// | u16     | x   | x   | x   | x   | x   | x   |    |    |    | x   | x   | x   | x   | x   |
+        /// | f16     | x   | x   | x   | x   |     |     |    |    |    |     |     |     |     | x   |
+        /// If you want to avoid this behavior use af_eval after the first cast
+        /// operation. This will ensure that the cast operation is performed on
+        /// the af::array
         ///
-        ///  \param[in] type is the desired type(f32, s64, etc.)
+        /// \param[in] type is the desired type(f32, s64, etc.)
         /// \returns an array with the type specified by \p type
-        /// \ingroup method_mat
         const array as(dtype type) const;
 
 
         ~array();
 
-        // Transpose and Conjugate Tranpose
+        /// \brief Get the transposed the array
+        ///
+        /// \returns Transposed matrix
         array T() const;
+        /// \brief Get the conjugate-transpose of the current array
+        ///
+        /// \returns conjugate-transpose matrix
         array H() const;
 
-#define ASSIGN(OP)                                                                      \
-        array& OP(const array &val);                                                    \
-        array& OP(const double &val);              /**< \copydoc OP (const array &) */  \
-        array& OP(const cdouble &val);             /**< \copydoc OP (const array &) */  \
-        array& OP(const cfloat &val);              /**< \copydoc OP (const array &) */  \
-        array& OP(const float &val);               /**< \copydoc OP (const array &) */  \
-        array& OP(const int &val);                 /**< \copydoc OP (const array &) */  \
-        array& OP(const unsigned &val);            /**< \copydoc OP (const array &) */  \
-        array& OP(const bool &val);                /**< \copydoc OP (const array &) */  \
-        array& OP(const char &val);                /**< \copydoc OP (const array &) */  \
-        array& OP(const unsigned char &val);       /**< \copydoc OP (const array &) */  \
-        array& OP(const long  &val);               /**< \copydoc OP (const array &) */  \
-        array& OP(const unsigned long &val);       /**< \copydoc OP (const array &) */  \
-        array& OP(const long long  &val);          /**< \copydoc OP (const array &) */  \
-        array& OP(const unsigned long long &val);  /**< \copydoc OP (const array &) */  \
+#define ASSIGN_(OP2)                                                                      \
+        array& OP2(const array &val);                                                     \
+        array& OP2(const double &val);              /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const cdouble &val);             /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const cfloat &val);              /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const float &val);               /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const int &val);                 /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const unsigned &val);            /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const bool &val);                /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const char &val);                /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const unsigned char &val);       /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const long  &val);               /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const unsigned long &val);       /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const long long  &val);          /**< \copydoc OP2##(const array &) */ \
+        array& OP2(const unsigned long long &val);
+
+#if AF_API_VERSION >= 32
+#define ASSIGN_32(OP)                                                                    \
+        array& OP(const short  &val);               /**< \copydoc OP##(const array &) */ \
+        array& OP(const unsigned short &val);
+#else
+#define ASSIGN_32(OP)
+#endif
+
+#if AF_API_VERSION >= 310
+#define ASSIGN_310(OP)                                                                   \
+        array& OP(const signed char &val);          /**< \copydoc OP##(const array &) */
+#else
+#define ASSIGN_310(OP)
+#endif
+
+#define ASSIGN(OP)          \
+        ASSIGN_(OP)         \
+        ASSIGN_32(OP)       \
+        ASSIGN_310(OP)
 
         /// \ingroup array_mem_operator_eq
         /// @{
@@ -874,6 +1098,9 @@ namespace af
 
 
 #undef ASSIGN
+#undef ASSIGN_
+#undef ASSIGN_32
+#undef ASSIGN_310
 
         ///
         /// \brief Negates the values of the array
@@ -889,42 +1116,101 @@ namespace af
         /// \returns an \ref array with negated values
         array operator !() const;
 
+#if AF_API_VERSION >= 38
         ///
-        /// \brief Get the count of non zero elements in the array
+        /// \brief Performs a bitwise not operation on the values of the array
+        /// \ingroup arith_func_bitnot
+        ///
+        /// \returns an \ref array with inverted values
+        array operator ~() const;
+#endif
+
+        ///
+        /// \brief Get the count of non-zero elements in the array
         ///
         /// For dense matrix, this is the same as count<int>(arr);
         int nonzeros() const;
+
+
+        ///
+        /// \brief Locks the device buffer in the memory manager.
+        ///
+        /// This method can be called to take control of the device pointer from the memory manager.
+        /// While a buffer is locked, the memory manager doesn't free the memory until unlock() is invoked.
+        void lock() const;
+
+
+#if AF_API_VERSION >= 34
+        ///
+        /// \brief Query if the array has been locked by the user.
+        ///
+        /// An array can be locked by the user by calling `arry.lock` or `arr.device`
+        /// or `getRawPtr` function.
+        bool isLocked() const;
+#endif
+
+
+        ///
+        /// \brief Unlocks the device buffer in the memory manager.
+        ///
+        /// This method can be called after called after calling \ref array::lock()
+        /// Calling this method gives back the control of the device pointer to the memory manager.
+        void unlock() const;
     };
     // end of class array
 
-#define BIN_OP(OP)                                                                                                       \
+#define BIN_OP_(OP)                                                                                                      \
     AFAPI array OP (const array& lhs, const array& rhs);                                                                 \
-    AFAPI array OP (const bool& lhs, const array& rhs);                 /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const int& lhs, const array& rhs);                  /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const unsigned& lhs, const array& rhs);             /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const char& lhs, const array& rhs);                 /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const unsigned char& lhs, const array& rhs);        /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const long& lhs, const array& rhs);                 /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const unsigned long& lhs, const array& rhs);        /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const long long& lhs, const array& rhs);            /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const unsigned long long& lhs, const array& rhs);   /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const double& lhs, const array& rhs);               /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const float& lhs, const array& rhs);                /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const cfloat& lhs, const array& rhs);               /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const cdouble& lhs, const array& rhs);              /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const bool& rhs);                 /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const int& rhs);                  /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const unsigned& rhs);             /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const char& rhs);                 /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const unsigned char& rhs);        /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const long& rhs);                 /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const unsigned long& rhs);        /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const long long& rhs);            /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const unsigned long long& rhs);   /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const double& rhs);               /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const float& rhs);                /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const cfloat& rhs);               /**< \copydoc OP (const array&, const array&) */ \
-    AFAPI array OP (const array& lhs, const cdouble& rhs);              /**< \copydoc OP (const array&, const array&) */ \
+    AFAPI array OP (const bool& lhs, const array& rhs);               /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const int& lhs, const array& rhs);                /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const unsigned& lhs, const array& rhs);           /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const char& lhs, const array& rhs);               /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const unsigned char& lhs, const array& rhs);      /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const long& lhs, const array& rhs);               /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const unsigned long& lhs, const array& rhs);      /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const long long& lhs, const array& rhs);          /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const unsigned long long& lhs, const array& rhs); /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const double& lhs, const array& rhs);             /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const float& lhs, const array& rhs);              /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const cfloat& lhs, const array& rhs);             /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const cdouble& lhs, const array& rhs);            /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const bool& rhs);               /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const int& rhs);                /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const unsigned& rhs);           /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const char& rhs);               /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const unsigned char& rhs);      /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const long& rhs);               /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const unsigned long& rhs);      /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const long long& rhs);          /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const unsigned long long& rhs); /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const double& rhs);             /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const float& rhs);              /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const cfloat& rhs);             /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const cdouble& rhs);
+
+#if AF_API_VERSION >= 32
+#define BIN_OP_32(OP)                                                                                                    \
+        AFAPI array OP (const short& lhs, const array& rhs);           /**< \copydoc OP##(const array&, const array&) */ \
+        AFAPI array OP (const unsigned short& lhs, const array& rhs);  /**< \copydoc OP##(const array&, const array&) */ \
+        AFAPI array OP (const array& lhs, const short& rhs);           /**< \copydoc OP##(const array&, const array&) */ \
+        AFAPI array OP (const array& lhs, const unsigned short& rhs);
+
+#else
+#define BIN_OP_32(OP)
+#endif
+
+#if AF_API_VERSION >= 310
+#define BIN_OP_310(OP)                                                                                                  \
+    AFAPI array OP (const signed char& lhs, const array& rhs);        /**< \copydoc OP##(const array&, const array&) */ \
+    AFAPI array OP (const array& lhs, const signed char& rhs);        /**< \copydoc OP##(const array&, const array&) */
+#else
+#define BIN_OP_310(OP)
+#endif
+
+#define BIN_OP(OP)          \
+        BIN_OP_(OP)         \
+        BIN_OP_32(OP)       \
+        BIN_OP_310(OP)
 
     /// \ingroup arith_func_add
     /// @{
@@ -977,83 +1263,69 @@ namespace af
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns an array with the equality operation performed on each element
+    /// \returns an array of type b8 with the equality operation performed on each element
     BIN_OP(operator==)
     /// @}
 
     /// \ingroup arith_func_neq
     /// @{
-    /// \brief Performs an equality operation on two arrays or an array and a value.
+    /// \brief Performs an inequality operation on two arrays or an array and a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns    an array with the != operation performed on each element
+    /// \returns    an array of type b8 with the != operation performed on each element
     ///             of \p lhs and \p rhs
     BIN_OP(operator!=)
     /// @}
 
     /// \ingroup arith_func_lt
     /// @{
-    /// \brief Performs an < operation on two arrays or an array and a value.
+    /// \brief Performs a lower than operation on two arrays or an array and a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns    an array with the < operation performed on each element
+    /// \returns    an array of type b8 with the < operation performed on each element
     ///             of \p lhs and \p rhs
     BIN_OP(operator< )
     /// @}
 
     /// \ingroup arith_func_le
     /// @{
-    /// \brief Performs an <= operation on two arrays or an array and a value.
+    /// \brief Performs an lower or equal operation on two arrays or an array and a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns    an array with the <= operation performed on each element
-    ///             of \p lhs and \p rhs
+    /// \returns    an array of type b8 with the <= operation performed on each element
     BIN_OP(operator<=)
     /// @}
 
     /// \ingroup arith_func_gt
     /// @{
-    /// \brief Performs an > operation on two arrays or an array and a value.
+    /// \brief Performs an greater than operation on two arrays or an array and a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns    an array with the > operation performed on each element
+    /// \returns    an array of type b8 with the > operation performed on each element
     ///             of \p lhs and \p rhs
     BIN_OP(operator> )
     /// @}
 
     /// \ingroup arith_func_ge
     /// @{
-    /// \brief Performs an >= operation on two arrays or an array and a value.
+    /// \brief Performs an greater or equal operation on two arrays or an array and a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns    an array with the >= operation performed on each element
+    /// \returns    an array of type b8 with the >= operation performed on each element
     ///             of \p lhs and \p rhs
     BIN_OP(operator>=)
     /// @}
 
-    /// \ingroup arith_func_and
-    /// @{
-    /// \brief  Performs a logical AND operation on two arrays or an array and a
-    ///         value.
-    ///
-    /// \param[in] lhs the left hand side value of the operand
-    /// \param[in] rhs the right hand side value of the operand
-    ///
-    /// \returns    an array with a logical AND operation performed on each
-    ///             element of \p lhs and \p rhs
-    BIN_OP(operator&&)
-    /// @}
-
     /// \ingroup arith_func_or
     /// @{
     /// \brief  Performs an logical OR operation on two arrays or an array and a
@@ -1062,12 +1334,12 @@ namespace af
     /// \param[in] lhs the left hand side value of the operand
     /// \param[in] rhs the right hand side value of the operand
     ///
-    /// \returns    an array with a logical OR operation performed on each
+    /// \returns    an array of type b8 with a logical OR operation performed on each
     ///             element of \p lhs and \p rhs
     BIN_OP(operator||)
     /// @}
 
-    /// \ingroup numeric_func_rem
+    /// \ingroup arith_func_mod
     /// @{
     /// \brief Performs an modulo operation on two arrays or an array and a value.
     ///
@@ -1079,22 +1351,9 @@ namespace af
     BIN_OP(operator% )
     /// @}
 
-    /// \ingroup arith_func_bitand
-    /// @{
-    /// \brief  Performs an bitwise AND operation on two arrays or an array and
-    ///         a value.
-    ///
-    /// \param[in] lhs the left hand side value of the operand
-    /// \param[in] rhs the right hand side value of the operand
-    ///
-    /// \returns    an array with a bitwise AND operation performed on each
-    ///             element of \p lhs and \p rhs
-    BIN_OP(operator& )
-    /// @}
-
     /// \ingroup arith_func_bitor
     /// @{
-    /// \brief  Performs an bitwise AND operation on two arrays or an array and
+    /// \brief  Performs an bitwise OR operation on two arrays or an array and
     ///         a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
@@ -1107,7 +1366,7 @@ namespace af
 
     /// \ingroup arith_func_bitxor
     /// @{
-    /// \brief  Performs an bitwise AND operation on two arrays or an array and
+    /// \brief  Performs an bitwise XOR operation on two arrays or an array and
     ///         a value.
     ///
     /// \param[in] lhs the left hand side value of the operand
@@ -1145,18 +1404,227 @@ namespace af
     /// @}
 
 #undef BIN_OP
+#undef BIN_OP_
+#undef BIN_OP_32
+#undef BIN_OP_310
+
+    /// \ingroup arith_func_bitand
+    /// @{
+    /// \brief  Performs an bitwise AND operation on two arrays or an array and
+    ///         a value.
+    ///
+    /// \param[in] lhs the left hand side value of the operand
+    /// \param[in] rhs the right hand side value of the operand
+    ///
+    /// \returns    an array with a bitwise AND operation performed on each
+    ///             element of \p lhs and \p rhs
+    AFAPI array operator&(const array& lhs, const array& rhs);
+    AFAPI array operator&(const array& lhs, const bool& rhs);
+    AFAPI array operator&(const array& lhs, const cdouble& rhs);
+    AFAPI array operator&(const array& lhs, const cfloat& rhs);
+    AFAPI array operator&(const array& lhs, const char& rhs);
+    AFAPI array operator&(const array& lhs, const double& rhs);
+    AFAPI array operator&(const array& lhs, const float& rhs);
+    AFAPI array operator&(const array& lhs, const int& rhs);
+    AFAPI array operator&(const array& lhs, const long long& rhs);
+    AFAPI array operator&(const array& lhs, const long& rhs);
+    AFAPI array operator&(const array& lhs, const short& rhs);
+    AFAPI array operator&(const array& lhs, const signed char& rhs);
+    AFAPI array operator&(const array& lhs, const unsigned char& rhs);
+    AFAPI array operator&(const array& lhs, const unsigned long long& rhs);
+    AFAPI array operator&(const array& lhs, const unsigned long& rhs);
+    AFAPI array operator&(const array& lhs, const unsigned short& rhs);
+    AFAPI array operator&(const array& lhs, const unsigned& rhs);
+    AFAPI array operator&(const bool& lhs, const array& rhs);
+    AFAPI array operator&(const cdouble& lhs, const array& rhs);
+    AFAPI array operator&(const cfloat& lhs, const array& rhs);
+    AFAPI array operator&(const char& lhs, const array& rhs);
+    AFAPI array operator&(const double& lhs, const array& rhs);
+    AFAPI array operator&(const float& lhs, const array& rhs);
+    AFAPI array operator&(const int& lhs, const array& rhs);
+    AFAPI array operator&(const long long& lhs, const array& rhs);
+    AFAPI array operator&(const long& lhs, const array& rhs);
+    AFAPI array operator&(const short& lhs, const array& rhs);
+    AFAPI array operator&(const signed char& lhs, const array& rhs);
+    AFAPI array operator&(const unsigned char& lhs, const array& rhs);
+    AFAPI array operator&(const unsigned long long& lhs, const array& rhs);
+    AFAPI array operator&(const unsigned long& lhs, const array& rhs);
+    AFAPI array operator&(const unsigned short& lhs, const array& rhs);
+    AFAPI array operator&(const unsigned& lhs, const array& rhs);
+    /// @}
+
+    /// \ingroup arith_func_and
+    /// @{
+    /// \brief  Performs a logical AND operation on two arrays or an array and a
+    ///         value.
+    ///
+    /// \param[in] lhs the left hand side value of the operand
+    /// \param[in] rhs the right hand side value of the operand
+    ///
+    /// \returns    an array of type b8 with a logical AND operation performed on each
+    ///             element of \p lhs and \p rhs
+    AFAPI array operator&&(const array& lhs, const array& rhs);
+    AFAPI array operator&&(const array& lhs, const bool& rhs);
+    AFAPI array operator&&(const array& lhs, const cdouble& rhs);
+    AFAPI array operator&&(const array& lhs, const cfloat& rhs);
+    AFAPI array operator&&(const array& lhs, const char& rhs);
+    AFAPI array operator&&(const array& lhs, const double& rhs);
+    AFAPI array operator&&(const array& lhs, const float& rhs);
+    AFAPI array operator&&(const array& lhs, const int& rhs);
+    AFAPI array operator&&(const array& lhs, const long long& rhs);
+    AFAPI array operator&&(const array& lhs, const long& rhs);
+    AFAPI array operator&&(const array& lhs, const short& rhs);
+    AFAPI array operator&&(const array& lhs, const signed char& rhs);
+    AFAPI array operator&&(const array& lhs, const unsigned char& rhs);
+    AFAPI array operator&&(const array& lhs, const unsigned long long& rhs);
+    AFAPI array operator&&(const array& lhs, const unsigned long& rhs);
+    AFAPI array operator&&(const array& lhs, const unsigned short& rhs);
+    AFAPI array operator&&(const array& lhs, const unsigned& rhs);
+    AFAPI array operator&&(const bool& lhs, const array& rhs);
+    AFAPI array operator&&(const cdouble& lhs, const array& rhs);
+    AFAPI array operator&&(const cfloat& lhs, const array& rhs);
+    AFAPI array operator&&(const char& lhs, const array& rhs);
+    AFAPI array operator&&(const double& lhs, const array& rhs);
+    AFAPI array operator&&(const float& lhs, const array& rhs);
+    AFAPI array operator&&(const int& lhs, const array& rhs);
+    AFAPI array operator&&(const long long& lhs, const array& rhs);
+    AFAPI array operator&&(const long& lhs, const array& rhs);
+    AFAPI array operator&&(const short& lhs, const array& rhs);
+    AFAPI array operator&&(const signed char& lhs, const array& rhs);
+    AFAPI array operator&&(const unsigned char& lhs, const array& rhs);
+    AFAPI array operator&&(const unsigned long long& lhs, const array& rhs);
+    AFAPI array operator&&(const unsigned long& lhs, const array& rhs);
+    AFAPI array operator&&(const unsigned short& lhs, const array& rhs);
+    AFAPI array operator&&(const unsigned& lhs, const array& rhs);
+    /// @}
+
 
     /// Evaluate an expression (nonblocking).
     /**
-       \ingroup method_mat
+       \ingroup data_mat
        @{
     */
     inline array &eval(array &a) { a.eval(); return a; }
-    inline void eval(array &a, array &b) { eval(a); b.eval(); }
-    inline void eval(array &a, array &b, array &c) { eval(a, b); c.eval(); }
-    inline void eval(array &a, array &b, array &c, array &d) { eval(a, b, c); d.eval(); }
-    inline void eval(array &a, array &b, array &c, array &d, array &e) { eval(a, b, c, d); e.eval(); }
-    inline void eval(array &a, array &b, array &c, array &d, array &e, array &f) { eval(a, b, c, d, e); f.eval(); }
+
+#if AF_API_VERSION >= 34
+    ///
+    /// Evaluate multiple arrays simultaneously
+    ///
+    AFAPI void eval(int num, array **arrays);
+#endif
+
+    inline void eval(array &a, array &b)
+    {
+#if AF_API_VERSION >= 34
+        array *arrays[] = {&a, &b};
+        return eval(2, arrays);
+#else
+        eval(a); b.eval();
+#endif
+    }
+
+    inline void eval(array &a, array &b, array &c)
+    {
+#if AF_API_VERSION >= 34
+        array *arrays[] = {&a, &b, &c};
+        return eval(3, arrays);
+#else
+        eval(a, b); c.eval();
+#endif
+    }
+
+    inline void eval(array &a, array &b, array &c, array &d)
+    {
+#if AF_API_VERSION >= 34
+        array *arrays[] = {&a, &b, &c, &d};
+        return eval(4, arrays);
+#else
+        eval(a, b, c); d.eval();
+#endif
+
+    }
+
+    inline void eval(array &a, array &b, array &c, array &d, array &e)
+    {
+#if AF_API_VERSION >= 34
+        array *arrays[] = {&a, &b, &c, &d, &e};
+        return eval(5, arrays);
+#else
+        eval(a, b, c, d); e.eval();
+#endif
+    }
+
+    inline void eval(array &a, array &b, array &c, array &d, array &e, array &f)
+    {
+#if AF_API_VERSION >= 34
+        array *arrays[] = {&a, &b, &c, &d, &e, &f};
+        return eval(6, arrays);
+#else
+        eval(a, b, c, d, e); f.eval();
+#endif
+    }
+
+#if AF_API_VERSION >= 37
+
+    /// Evaluate an expression (nonblocking).
+    inline const array &eval(const array &a) { a.eval(); return a; }
+
+#if AF_COMPILER_CXX_VARIADIC_TEMPLATES
+    template <typename... ARRAYS>
+    inline void eval(ARRAYS... in) {
+        array *arrays[] = {const_cast<array *>(&in)...};
+        eval((int)sizeof...(in), arrays);
+    }
+
+#else
+
+    inline void eval(const array &a, const array &b)
+    {
+        const array *arrays[] = {&a, &b};
+        return eval(2, const_cast<array **>(arrays));
+    }
+
+    inline void eval(const array &a, const array &b, const array &c)
+    {
+        const array *arrays[] = {&a, &b, &c};
+        return eval(3, const_cast<array **>(arrays));
+    }
+
+    inline void eval(const array &a, const array &b, const array &c,
+                     const array &d)
+    {
+        const array *arrays[] = {&a, &b, &c, &d};
+        return eval(4, const_cast<array **>(arrays));
+    }
+
+    inline void eval(const array &a, const array &b, const array &c,
+                     const array &d, const array &e)
+    {
+        const array *arrays[] = {&a, &b, &c, &d, &e};
+        return eval(5, const_cast<array **>(arrays));
+    }
+
+    inline void eval(const array &a, const array &b, const array &c,
+                     const array &d, const array &e, const array &f)
+    {
+        const array *arrays[] = {&a, &b, &c, &d, &e, &f};
+        return eval(6, const_cast<array **>(arrays));
+    }
+#endif // AF_COMPILER_CXX_VARIADIC_TEMPLATES
+#endif
+
+#if AF_API_VERSION >= 34
+    ///
+    /// Turn the manual eval flag on or off
+    ///
+    AFAPI void setManualEvalFlag(bool flag);
+#endif
+
+#if AF_API_VERSION >= 34
+    /// Get the manual eval flag
+    AFAPI bool getManualEvalFlag();
+#endif
+
     /**
        @}
     */
@@ -1167,17 +1635,18 @@ namespace af
 #ifdef __cplusplus
 extern "C" {
 #endif
+
     /**
-       \ingroup construct_mat
+       \ingroup c_api_mat
        @{
     */
 
     /**
        Create an \ref af_array handle initialized with user defined data
 
-       This function will create an \ref af_array handle from the memory provided in \p data
+       This function will create an \ref af_array handle from the memory provided in \p data.
 
-       \param[out]  arr The pointer to the retured object.
+       \param[out]  arr The pointer to the returned object.
        \param[in]   data The data which will be loaded into the array
        \param[in]   ndims The number of dimensions read from the \p dims parameter
        \param[in]   dims A C pointer with \p ndims elements. Each value represents the size of that dimension
@@ -1190,6 +1659,9 @@ extern "C" {
     /**
        Create af_array handle
 
+       To release the memory allocated by this call you would have to
+       call \ref af_release_array once your use of this \ref af_array is complete.
+
        \param[out]  arr The pointer to the retured object.
        \param[in]   ndims The number of dimensions read from the \p dims parameter
        \param[in]   dims A C pointer with \p ndims elements. Each value represents the size of that dimension
@@ -1200,19 +1672,12 @@ extern "C" {
     AFAPI af_err af_create_handle(af_array *arr, const unsigned ndims, const dim_t * const dims, const af_dtype type);
 
     /**
-    @}
-    */
-
-    /**
-       \ingroup method_mat
-       @{
-
        Deep copy an array to another
     */
     AFAPI af_err af_copy_array(af_array *arr, const af_array in);
 
     /**
-       Copy data from an C pointer (host/device) to an existing array.
+       Copy data from a C pointer (host/device) to an existing array.
     */
     AFAPI af_err af_write_array(af_array arr, const void *data, const size_t bytes, af_source src);
 
@@ -1225,21 +1690,267 @@ extern "C" {
 
     /**
        \brief Reduce the reference count of the \ref af_array
+
+       \note Zero initialized af_arrays can be accepted after version 3.7
     */
     AFAPI af_err af_release_array(af_array arr);
 
     /**
-       Increments an \ref af_array reference count
+       Increments an \ref af_array reference count.
     */
     AFAPI af_err af_retain_array(af_array *out, const af_array in);
 
+#if AF_API_VERSION >= 31
+    /**
+       Get the reference count of \ref af_array
+    */
+    AFAPI af_err af_get_data_ref_count(int *use_count, const af_array in);
+#endif
+
     /**
        Evaluate any expressions in the Array
     */
     AFAPI af_err af_eval(af_array in);
 
+#if AF_API_VERSION >= 34
+    /**
+       Evaluate multiple arrays together
+    */
+    AFAPI af_err af_eval_multiple(const int num, af_array *arrays);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       Turn the manual eval flag on or off
+    */
+    AFAPI af_err af_set_manual_eval_flag(bool flag);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       Get the manual eval flag
+    */
+    AFAPI af_err af_get_manual_eval_flag(bool *flag);
+#endif
+
+    /**
+        \brief Get the total number of elements across all dimensions of the array
+
+        \param[out] elems is the output that contains number of elements of \p arr
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_get_elements(dim_t *elems, const af_array arr);
+
+    /**
+        \brief Gets the type of an array.
+
+        \param[out] type is the output that contains the type of \p arr
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_get_type(af_dtype *type, const af_array arr);
+
+    /**
+        \brief Gets the dimensions of an array.
+
+        \param[out] d0 is the output that contains the size of first dimension of \p arr
+        \param[out] d1 is the output that contains the size of second dimension of \p arr
+        \param[out] d2 is the output that contains the size of third dimension of \p arr
+        \param[out] d3 is the output that contains the size of fourth dimension of \p arr
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_get_dims(dim_t *d0, dim_t *d1, dim_t *d2, dim_t *d3,
+                             const af_array arr);
+
+    /**
+        \brief Gets the number of dimensions of an array.
+
+        \param[out] result is the output that contains the number of dims of \p arr
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_get_numdims(unsigned *result, const af_array arr);
+
+    /**
+        \brief Check if an array is empty.
+
+        \param[out] result is true if elements of arr is 0, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_empty        (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is scalar, ie. single element.
+
+        \param[out] result is true if elements of arr is 1, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_scalar       (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is row vector.
+
+        \param[out] result is true if arr has dims [1 x 1 1], false otherwise
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_row          (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is a column vector
+
+        \param[out] result is true if arr has dims [x 1 1 1], false otherwise
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_column       (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is a vector
+
+        A vector is any array that has exactly 1 dimension not equal to 1.
+
+        \param[out] result is true if arr is a vector, false otherwise
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_vector       (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is complex type
+
+        \param[out] result is true if arr is of type \ref c32 or \ref c64, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_complex      (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is real type
+
+        This is mutually exclusive to \ref af_is_complex
+
+        \param[out] result is true if arr is NOT \ref c32 or \ref c64, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_real         (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is double precision type
+
+        \param[out] result is true if arr is of type \ref f64 or \ref c64, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_double       (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is single precision type
+
+        \param[out] result is true if arr is of type \ref f32 or \ref c32, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_single       (bool *result, const af_array arr);
+
+#if AF_API_VERSION >= 37
+    /**
+        \brief Check if an array is 16 bit floating point type
+
+        \param[out] result is true if arr is of type \ref f16 otherwise false
+        \param[in] arr     is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_half(bool *result, const af_array arr);
+#endif
+
+    /**
+        \brief Check if an array is real floating point type
+
+        \param[out] result is true if arr is of type \ref f32 or \ref f64, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_realfloating (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is floating precision type
+
+        This is a combination of \ref af_is_realfloating and \ref af_is_complex
+
+        \param[out] result is true if arr is of type \ref f16 \ref f32, \ref
+                           f64, \ref c32 or \ref c64, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_floating     (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is integer type
+
+        \param[out] result is true if arr is of integer types, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_integer      (bool *result, const af_array arr);
+
+    /**
+        \brief Check if an array is bool type
+
+        \param[out] result is true if arr is of \ref b8 type, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_bool         (bool *result, const af_array arr);
+
+#if AF_API_VERSION >= 34
+    /**
+        \brief Check if an array is sparse
+
+        \param[out] result is true if arr is sparse, otherwise false
+        \param[in] arr is the input array
+
+        \returns error codes
+    */
+    AFAPI af_err af_is_sparse       (bool *result, const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 35
+    /**
+        \brief Get first element from an array
+
+        \param[out] output_value is the element requested
+        \param[in] arr is the input array
+        \return \ref AF_SUCCESS if the execution completes properly
+    */
+    AFAPI af_err af_get_scalar(void* output_value, const af_array arr);
+#endif
+
     /**
-      @}
+        @}
     */
 
 #ifdef __cplusplus
diff --git a/include/af/backend.h b/include/af/backend.h
new file mode 100644
index 0000000000..ddf38d17aa
--- /dev/null
+++ b/include/af/backend.h
@@ -0,0 +1,172 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/defines.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   \param[in] bknd takes one of the values of enum \ref af_backend
+   \returns \ref af_err error code
+
+   \ingroup unified_func_setbackend
+ */
+AFAPI af_err af_set_backend(const af_backend bknd);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   \param[out] num_backends Number of available backends
+   \returns \ref af_err error code
+
+   \ingroup unified_func_getbackendcount
+ */
+AFAPI af_err af_get_backend_count(unsigned* num_backends);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   Returns a flag of all available backends
+
+   \code{.cpp}
+   int backends = 0;
+   af_get_available_backends(&backends);
+
+   if(backends & AF_BACKEND_CUDA) {
+       // The CUDA backend is available
+   }
+   \endcode
+
+   \param[out] backends A flag of all available backends. Use the &(and)
+   operator to check if a particular backend is available
+
+   \returns \ref af_err error code
+
+   \ingroup unified_func_getavailbackends
+ */
+AFAPI af_err af_get_available_backends(int* backends);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   \param[out] backend takes one of the values of enum \ref af_backend
+   \param[in] in is the array who's backend is to be queried
+   \returns \ref af_err error code
+
+   \ingroup unified_func_getbackendid
+ */
+AFAPI af_err af_get_backend_id(af_backend *backend, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   \param[out] backend takes one of the values of enum \ref af_backend
+   from the backend that is currently set to active
+   \returns \ref af_err error code
+
+   \ingroup unified_func_getactivebackend
+ */
+AFAPI af_err af_get_active_backend(af_backend *backend);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   \param[out] device contains the device on which \p in was created.
+   \param[in] in is the array who's device is to be queried.
+   \returns \ref af_err error code
+
+   \ingroup unified_func_getdeviceid
+ */
+AFAPI af_err af_get_device_id(int *device, const af_array in);
+#endif
+
+
+#ifdef __cplusplus
+}
+#endif
+
+#ifdef __cplusplus
+namespace af
+{
+class array;
+
+#if AF_API_VERSION >= 32
+/**
+   \param[in] bknd takes one of the values of enum \ref af_backend
+
+   \ingroup unified_func_setbackend
+ */
+AFAPI void setBackend(const Backend bknd);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   \returns Number of available backends
+
+   \ingroup unified_func_getbackendcount
+ */
+AFAPI unsigned getBackendCount();
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   Returns a flag of all available backends
+
+   \code{.cpp}
+   int backends = getAvailableBackends();
+
+   if(backends & AF_BACKEND_CUDA) {
+   // The CUDA backend is available
+   }
+   \endcode
+
+   \returns A flag of available backends
+
+   \ingroup unified_func_getavailbackends
+ */
+AFAPI int getAvailableBackends();
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   \param[in] in is the array who's backend is to be queried
+   \returns \ref af_backend which is the backend on which the array is created
+
+   \ingroup unified_func_getbackendid
+ */
+AFAPI af::Backend getBackendId(const array &in);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   \returns \ref af_backend which is the backend is currently active
+
+   \ingroup unified_func_getctivebackend
+ */
+AFAPI af::Backend getActiveBackend();
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   \param[in] in is the array who's device is to be queried.
+   \returns The id of the device on which this array was created.
+
+   \note Device ID can be the same for arrays belonging to different backends.
+
+   \ingroup unified_func_getdeviceid
+ */
+AFAPI int getDeviceId(const array &in);
+#endif
+
+}
+#endif
diff --git a/include/af/blas.h b/include/af/blas.h
index e47f2e2413..05434ee861 100644
--- a/include/af/blas.h
+++ b/include/af/blas.h
@@ -1,4 +1,4 @@
-/*******************************************************
+/********************************************************
  * Copyright (c) 2014, ArrayFire
  * All rights reserved.
  *
@@ -7,15 +7,7 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-/** \file blas.h
- *
- * Contains BLAS related functions
- *
- * Contains functions for basic BLAS functionallity
- */
-
 #pragma once
-
 #include <af/defines.h>
 
 #ifdef __cplusplus
@@ -23,84 +15,95 @@ namespace af
 {
     class array;
     /**
-        \brief Matrix multiply on two arrays
+       C++ Interface to multiply two matrices.
+
+       \copydetails blas_func_matmul
 
-        \copydetails blas_func_matmul
+       `optLhs` and `optRhs` can only be one of \ref AF_MAT_NONE,
+       \ref AF_MAT_TRANS, \ref AF_MAT_CTRANS.
 
-        \param[in] lhs The array object on the left hand side
-        \param[in] rhs The array object on the right hand side
-        \param[in] optLhs Transpose operation before the function is performed
-        \param[in] optRhs Transpose operation before the function is performed
-        \return The result of the matrix multiplication of lhs, rhs
+       This function is not supported in GFOR.
 
-        \note optLhs and optRhs can only be one of \ref AF_MAT_NONE, \ref
-                AF_MAT_TRANS, \ref AF_MAT_CTRANS \note This function is not supported
-                in GFOR
+       \note <b>The following applies for Sparse-Dense matrix multiplication.</b>
+       \note This function can be used with one sparse input. The sparse input
+             must always be the \p lhs and the dense matrix must be \p rhs.
+       \note The sparse array can only be of \ref AF_STORAGE_CSR format.
+       \note The returned array is always dense.
+       \note \p optLhs an only be one of \ref AF_MAT_NONE, \ref AF_MAT_TRANS,
+             \ref AF_MAT_CTRANS.
+       \note \p optRhs can only be \ref AF_MAT_NONE.
 
-        \ingroup blas_func_matmul
+       \param[in] lhs    input array on the left-hand side
+       \param[in] rhs    input array on the right-hand side
+       \param[in] optLhs transpose the left-hand side prior to multiplication
+       \param[in] optRhs transpose the right-hand side prior to multiplication
+       \return    `lhs` * `rhs`
 
-     */
+       \ingroup blas_func_matmul
+    */
     AFAPI array matmul(const array &lhs, const array &rhs,
                        const matProp optLhs = AF_MAT_NONE,
                        const matProp optRhs = AF_MAT_NONE);
 
     /**
-       \brief Matrix multiply on two arrays
+       C++ Interface to multiply two matrices.
+       The second matrix will be transposed.
 
        \copydetails blas_func_matmul
 
-       \param[in] lhs The array object on the left hand side
-       \param[in] rhs The array object on the right hand side
-       \return The result of the matrix multiplication of \p lhs, transpose(\p rhs)
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[in] lhs input array on the left-hand side
+       \param[in] rhs input array on the right-hand side
+       \return    `lhs` * transpose(`rhs`)
 
        \ingroup blas_func_matmul
     */
     AFAPI array matmulNT(const array &lhs, const array &rhs);
 
     /**
-       \brief Matrix multiply on two arrays
+       C++ Interface to multiply two matrices.
+       The first matrix will be transposed.
 
        \copydetails blas_func_matmul
 
-       \param[in] lhs The array object on the left hand side
-       \param[in] rhs The array object on the right hand side
-       \return The result of the matrix multiplication of transpose(\p lhs), \p rhs
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[in] lhs input array on the left-hand side
+       \param[in] rhs input array on the right-hand side
+       \return    transpose(`lhs`) * `rhs`
 
        \ingroup blas_func_matmul
     */
     AFAPI array matmulTN(const array &lhs, const array &rhs);
 
     /**
-       \brief Matrix multiply on two arrays
+       C++ Interface to multiply two matrices.
+       Both matrices will be transposed.
 
        \copydetails blas_func_matmul
 
-       \param[in] lhs The array object on the left hand side
-       \param[in] rhs The array object on the right hand side
-       \return The result of the matrix multiplication of transpose(\p lhs), transpose(\p rhs)
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[in] lhs input array on the left-hand side
+       \param[in] rhs input array on the right-hand side
+       \return    transpose(`lhs`) * transpose(`rhs`)
 
        \ingroup blas_func_matmul
     */
     AFAPI array matmulTT(const array &lhs, const array &rhs);
 
     /**
-       \brief Chain 2 matrix multiplications
+       C++ Interface to chain multiply three matrices.
 
-       The matrix multiplications are done in a way to reduce temporary memory
+       The matrix multiplications are done in a way to reduce temporary memory.
+
+       This function is not supported in GFOR.
 
        \param[in] a The first array
        \param[in] b The second array
        \param[in] c The third array
-
-       \returns out = a x b x c
-
-       \note This function is not supported in GFOR
+       \return    a x b x c
 
        \ingroup blas_func_matmul
     */
@@ -108,67 +111,86 @@ namespace af
 
 
     /**
-       \brief Chain 3 matrix multiplications
+       C++ Interface to chain multiply three matrices.
 
-       The matrix multiplications are done in a way to reduce temporary memory
+       The matrix multiplications are done in a way to reduce temporary memory.
+
+       This function is not supported in GFOR.
 
        \param[in] a The first array
        \param[in] b The second array
        \param[in] c The third array
        \param[in] d The fourth array
-
-       \returns out = a x b x c x d
-
-       \note This function is not supported in GFOR
+       \returns   a x b x c x d
 
        \ingroup blas_func_matmul
     */
     AFAPI array matmul(const array &a, const array &b, const array &c, const array &d);
 
-
+#if AF_API_VERSION >= 35
     /**
-        \brief Dot Product
+        C++ Interface to compute the dot product.
 
-        Scalar dot product between two vectors.  Also referred to as the inner
+        Scalar dot product between two vectors, also referred to as the inner
         product.
 
-        \democode{
-        // compute scalar dot product
-        array x = randu(100), y = randu(100);
-        af_print(dot(x,y));
-        }
+        \code
+          // compute scalar dot product
+          array x = randu(100), y = randu(100);
+
+          af_print(dot(x, y));
+          // OR
+          printf("%f\n", dot<float>(x, y));
+        \endcode
+
+       Parameters `optLhs` and `optRhs` can only be one of \ref AF_MAT_NONE or
+       \ref AF_MAT_CONJ. The conjugate dot product can be computed by setting
+       `optLhs = AF_MAT_CONJ` and `optRhs = AF_MAT_NONE`.
+
+       This function is not supported in GFOR.
 
-        \note This function is not supported in GFOR
+        \tparam    T      type of the output
+        \param[in] lhs    input array on the left-hand side
+        \param[in] rhs    input array on the right-hand side
+        \param[in] optLhs `lhs` options, only \ref AF_MAT_NONE and \ref
+                          AF_MAT_CONJ are supported
+        \param[in] optRhs `rhs` options, only \ref AF_MAT_NONE and \ref
+                          AF_MAT_CONJ are supported
+        \return    dot product of `lhs` and `rhs`
 
         \ingroup blas_func_dot
     */
-    AFAPI array dot   (const array &lhs, const array &rhs,
-                       const matProp optLhs = AF_MAT_NONE,
-                       const matProp optRhs = AF_MAT_NONE);
+    template <typename T>
+    T dot(const array &lhs, const array &rhs,
+          const matProp optLhs = AF_MAT_NONE,
+          const matProp optRhs = AF_MAT_NONE);
+#endif
+
+    /// \ingroup blas_func_dot
+    AFAPI array dot(const array &lhs, const array &rhs,
+                    const matProp optLhs = AF_MAT_NONE,
+                    const matProp optRhs = AF_MAT_NONE);
 
     /**
-        \brief Transposes a matrix
+        C++ Interface to transpose a matrix.
 
-        \copydetails blas_func_transpose
+        \param[in] in        input array
+        \param[in] conjugate if true, conjugate transposition is performed
+        \return    transpose
 
-        \param[in] in Input Matrix
-        \param[in] conjugate If true a congugate transposition is performed
-        \return Transposed matrix
         \ingroup blas_func_transpose
     */
-    AFAPI array transpose(const array& in, const bool conjugate = false);
+    AFAPI array transpose(const array &in, const bool conjugate = false);
 
     /**
-        \brief Transposes a matrix
-
-        \copydetails blas_func_transpose
+        C++ Interface to transpose a matrix in-place.
 
-        \param[in,out] in is the matrix to be transposed in place
-        \param[in] conjugate If true a congugate transposition is performed
+        \param[in,out] in        input array to be transposed in-place
+        \param[in]     conjugate if true, conjugate transposition is performed
 
         \ingroup blas_func_transpose
     */
-    AFAPI void transposeInPlace(array& in, const bool conjugate = false);
+    AFAPI void transposeInPlace(array &in, const bool conjugate = false);
 }
 #endif
 
@@ -176,61 +198,176 @@ namespace af
 extern "C" {
 #endif
 
+#if AF_API_VERSION >= 37
     /**
-        \brief Matrix multiply on two \ref af_array
+        C Interface to multiply two matrices.
+
+        This provides an interface to the BLAS level 3 general matrix multiply
+        (GEMM) of two \ref af_array objects, which is generally defined as:
+
+        \f[
+        C = \alpha * opA(A)opB(B) + \beta * C
+        \f]
+
+        where \f$\alpha\f$ (\p alpha) and \f$\beta\f$ (\p beta) are both scalars;
+        \f$A\f$ and \f$B\f$ are the matrix multiply operands; and \f$opA\f$ and
+        \f$opB\f$ are noop (if \p AF_MAT_NONE) or transpose (if \p AF_MAT_TRANS)
+        operations on \f$A\f$ or \f$B\f$ before the actual GEMM operation. Batched
+        GEMM is supported if at least either \f$A\f$ or \f$B\f$ have more than
+        two dimensions (see \ref af::matmul for more details on broadcasting).
+        However, only one \p alpha and one \p beta can be used for all of the
+        batched matrix operands.
+
+        The \ref af_array that \p out points to can be used both as an input and
+        output. An allocation will be performed if you pass a null \ref af_array
+        handle (i.e. `af_array c = 0;`). If a valid \ref af_array is passed as
+        \f$C\f$, the operation will be performed on that \ref af_array itself. The C
+        \ref af_array must be the correct type and shape; otherwise, an error will
+        be thrown.
+
+        \note Passing an af_array that has not been initialized to the C array
+        is will cause undefined behavior.
+
+        This example demonstrates the usage of the af_gemm function on two
+        matrices. The \f$C\f$ \ref af_array handle is initialized to zero here,
+        so \ref af_gemm will perform an allocation.
+
+        \snippet test/blas.cpp ex_af_gemm_alloc
+
+        The following example shows how you can write to a previously allocated \ref
+        af_array using the \ref af_gemm call. Here we are going to use the \ref
+        af_array s from the previous example and index into the first slice. Only
+        the first slice of the original \f$C\f$ af_array will be modified by this
+        operation.
+
+        \snippet test/blas.cpp ex_af_gemm_overwrite
+
+        \note <b>s8 Support</b>
+        \note Starting with ArrayFire version v3.10.0, the CUDA backend supports
+        \p A, \p B input arrays of type \ref s8.
+        \note Scalars \p alpha, \p beta must be of type \ref f32.
+        \note Output array \p C will be of type \ref f32.
+        \note <br><b>Requires</b>
+        \note CUDA version >= 10 on devices with compute capability >= 5.0
+
+        \param[in,out] C     `A` * `B` = `C`
+        \param[in]     opA   operation to perform on A before the multiplication
+        \param[in]     opB   operation to perform on B before the multiplication
+        \param[in]     alpha alpha value; must be the same type as `A` and `B`
+        \param[in]     A     input array on the left-hand side
+        \param[in]     B     input array on the right-hand side
+        \param[in]     beta  beta value; must be the same type as `A` and `B`
+        \return        \ref AF_SUCCESS, if function returns successfully, else
+                       an \ref af_err code is given
 
-        \details Performs a matrix multiplication on two arrays (lhs, rhs).
+        \ingroup blas_func_matmul
+    */
+    AFAPI af_err af_gemm(af_array *C, const af_mat_prop opA, const af_mat_prop opB,
+                         const void *alpha, const af_array A, const af_array B,
+                         const void *beta);
+#endif
 
-        \param[out] out Pointer to the output \ref af_array
-        \param[in] lhs A 2D matrix \ref af_array object
-        \param[in] rhs A 2D matrix \ref af_array object
-        \param[in] optLhs Transpose operation before the function is performed
-        \param[in] optRhs Transpose operation before the function is performed
+    /**
+        C Interface to multiply two matrices.
+
+        Performs matrix multiplication on two arrays.
+
+        \note <b> The following applies for Sparse-Dense matrix multiplication.</b>
+        \note This function can be used with one sparse input. The sparse input
+              must always be the \p lhs and the dense matrix must be \p rhs.
+        \note The sparse array can only be of \ref AF_STORAGE_CSR format.
+        \note The returned array is always dense.
+        \note \p optLhs an only be one of \ref AF_MAT_NONE, \ref AF_MAT_TRANS,
+              \ref AF_MAT_CTRANS.
+        \note \p optRhs can only be \ref AF_MAT_NONE.
+
+        \param[out] out    `lhs` * `rhs` = `out`
+        \param[in]  lhs    input array on the left-hand side
+        \param[in]  rhs    input array on the right-hand side
+        \param[in]  optLhs transpose `lhs` before the function is performed
+        \param[in]  optRhs transpose `rhs` before the function is performed
+        \return     \ref AF_SUCCESS, if function returns successfully, else
+                    an \ref af_err code is given
 
-        \return AF_SUCCESS if the process is successful.
         \ingroup blas_func_matmul
      */
     AFAPI af_err af_matmul( af_array *out ,
                             const af_array lhs, const af_array rhs,
                             const af_mat_prop optLhs, const af_mat_prop optRhs);
 
-
     /**
-        Scalar dot product between two vectors.  Also referred to as the inner
+        C Interface to compute the dot product.
+
+        Scalar dot product between two vectors, also referred to as the inner
         product.
 
-        \democode{
-        // compute scalar dot product
-        array x = randu(100), y = randu(100);
-        print(dot<float>(x,y));
-        }
+        \code
+          // compute scalar dot product
+          array x = randu(100), y = randu(100);
+          print(dot<float>(x,y));
+        \endcode
+
+        \param[out] out    dot product of `lhs` and `rhs`
+        \param[in]  lhs    input array on the left-hand side
+        \param[in]  rhs    input array on the right-hand side
+        \param[in]  optLhs `lhs` options, only \ref AF_MAT_NONE and \ref
+                           AF_MAT_CONJ are supported
+        \param[in]  optRhs `rhs` options, only \ref AF_MAT_NONE and \ref
+                           AF_MAT_CONJ are supported
+        \return     \ref AF_SUCCESS, if function returns successfully, else
+                    an \ref af_err code is given
+
+        \ingroup blas_func_dot
+    */
+    AFAPI af_err af_dot(af_array *out,
+                        const af_array lhs, const af_array rhs,
+                        const af_mat_prop optLhs, const af_mat_prop optRhs);
+
+#if AF_API_VERSION >= 35
+    /**
+        C Interface to compute the dot product, scalar result returned on host.
+
+        Scalar dot product between two vectors. Also referred to as the inner
+        product. Returns the result as a host scalar.
+
+        \param[out] real   real component of the dot product
+        \param[out] imag   imaginary component of the dot product
+        \param[in]  lhs    input array on the left-hand side
+        \param[in]  rhs    input array on the right-hand side
+        \param[in]  optLhs `lhs` options, only \ref AF_MAT_NONE and \ref
+                           AF_MAT_CONJ are supported
+        \param[in]  optRhs `rhs` options, only \ref AF_MAT_NONE and \ref
+                           AF_MAT_CONJ are supported
+        \return     \ref AF_SUCCESS, if function returns successfully, else
+                    an \ref af_err code is given
+
         \ingroup blas_func_dot
     */
-    AFAPI af_err af_dot(    af_array *out,
+    AFAPI af_err af_dot_all(double *real, double *imag,
                             const af_array lhs, const af_array rhs,
                             const af_mat_prop optLhs, const af_mat_prop optRhs);
+#endif
 
     /**
-        \brief Transposes a matrix
+        C Interface to transpose a matrix.
 
-        This funciton will tranpose the matrix in.
+        \param[out] out       transpose
+        \param[in]  in        input array
+        \param[in]  conjugate if true, conjugate transposition is performed
+        \return     \ref AF_SUCCESS, if function returns successfully, else
+                    an \ref af_err code is given
 
-        \param[out] out The transposed matrix
-        \param[in] in Input matrix which will be transposed
-        \param[in] conjugate Perform a congugate transposition
-
-        \return AF_SUCCESS if the process is successful.
         \ingroup blas_func_transpose
     */
     AFAPI af_err af_transpose(af_array *out, af_array in, const bool conjugate);
 
     /**
-        \brief Transposes a matrix
-
-        \copydetails blas_func_transpose
+        C Interface to transpose a matrix in-place.
 
-        \param[in,out] in is the matrix to be transposed in place
-        \param[in] conjugate If true a congugate transposition is performed
+        \param[in,out] in        input array to be transposed in-place
+        \param[in]     conjugate if true, conjugate transposition is performed
+        \return        \ref AF_SUCCESS, if function returns successfully, else
+                       an \ref af_err code is given
 
         \ingroup blas_func_transpose
     */
diff --git a/include/af/compatible.h b/include/af/compatible.h
index b3432946c3..cecf3dc0e4 100644
--- a/include/af/compatible.h
+++ b/include/af/compatible.h
@@ -18,99 +18,132 @@ class array;
 /// \ingroup device_func_count
 /// \copydoc getDeviceCount()
 /// \deprecated Use getDeviceCount() instead
-DEPRECATED("Use getDeviceCount instead")
+AF_DEPRECATED("Use getDeviceCount instead")
 AFAPI int devicecount();
 
 /// \ingroup device_func_get
 /// \copydoc getDevice()
 /// \deprecated Use getDevice() instead
-DEPRECATED("Use getDevice instead")
+AF_DEPRECATED("Use getDevice instead")
 AFAPI int deviceget();
 
 /// \ingroup device_func_set
 /// \copydoc setDevice()
 /// \deprecated Use setDevice() instead
-DEPRECATED("Use setDevice instead")
+AF_DEPRECATED("Use setDevice instead")
 AFAPI void deviceset(const int device);
 
 /// \ingroup imageio_func_load
 /// \copydoc loadImage
 /// \deprecated Use \ref loadImage instead
-DEPRECATED("Use loadImage instead")
+AF_DEPRECATED("Use loadImage instead")
 AFAPI array loadimage(const char* filename, const bool is_color=false);
 
 /// \ingroup imageio_func_save
 /// \copydoc saveImage
 /// \deprecated Use \ref saveImage instead
-DEPRECATED("Use saveImage instead")
+AF_DEPRECATED("Use saveImage instead")
 AFAPI void saveimage(const char* filename, const array& in);
 
 /// \ingroup image_func_gauss
 /// \copydoc image_func_gauss
 /// \deprecated Use \ref gaussianKernel instead
-DEPRECATED("Use gaussianKernel instead")
+AF_DEPRECATED("Use gaussianKernel instead")
 AFAPI array gaussiankernel(const int rows, const int cols, const double sig_r = 0, const double sig_c = 0);
 
 /// \ingroup reduce_func_all_true
 /// \copydoc af::allTrue(const array&)
 /// \deprecated Use \ref af::allTrue(const array&) instead
 template<typename T>
-DEPRECATED("Use allTrue instead")
+AF_DEPRECATED("Use allTrue instead")
 T alltrue(const array &in);
 
 /// \ingroup reduce_func_any_true
 /// \copydoc af::allTrue(const array&)
 /// \deprecated Use \ref af::anyTrue(const array&) instead
 template<typename T>
-DEPRECATED("Use anyTrue instead")
+AF_DEPRECATED("Use anyTrue instead")
 T anytrue(const array &in);
 
 /// \ingroup reduce_func_all_true
 /// \copydoc allTrue
 /// \deprecated Use \ref af::allTrue instead
-DEPRECATED("Use allTrue instead")
+AF_DEPRECATED("Use allTrue instead")
 AFAPI array alltrue(const array &in, const int dim = -1);
 
 /// \ingroup reduce_func_any_true
 /// \copydoc anyTrue
 /// \deprecated Use \ref af::anyTrue instead
-DEPRECATED("Use anyTrue instead")
+AF_DEPRECATED("Use anyTrue instead")
 AFAPI array anytrue(const array &in, const int dim = -1);
 
 /// \ingroup set_func_unique
 /// \copydoc setUnique
 /// \deprecated Use \ref setUnique instead
-DEPRECATED("Use setUnique instead")
+AF_DEPRECATED("Use setUnique instead")
 AFAPI array setunique(const array &in, const bool is_sorted=false);
 
 /// \ingroup set_func_union
 /// \copydoc setUnion
 /// \deprecated Use \ref setUnion instead
-DEPRECATED("Use setUnion instead")
+AF_DEPRECATED("Use setUnion instead")
 AFAPI array setunion(const array &first, const array &second, const bool is_unique=false);
 
 /// \ingroup set_func_intersect
 /// \copydoc setIntersect
 /// \deprecated Use \ref setIntersect instead
-DEPRECATED("Use setIntersect instead")
+AF_DEPRECATED("Use setIntersect instead")
 AFAPI array setintersect(const array &first, const array &second, const bool is_unique=false);
 
 /// \ingroup image_func_histequal
 /// \copydoc histEqual
 /// \deprecated Use \ref histEqual instead
-DEPRECATED("Use histEqual instead")
+AF_DEPRECATED("Use histEqual instead")
 AFAPI array histequal(const array& in, const array& hist);
 
 /// \ingroup image_func_colorspace
 /// \copydoc colorSpace
 /// \deprecated Use \ref colorSpace instead
-DEPRECATED("Use colorSpace instead")
+AF_DEPRECATED("Use colorSpace instead")
 AFAPI array colorspace(const array& image, const CSpace to, const CSpace from);
 
+/// Image Filtering
+/// \code
+/// // filter (convolve) an image with a 3x3 sobel kernel
+/// const float h_kernel[] = { -2.0, -1.0,  0.0,
+///                            -1.0,  0.0,  1.0,
+///                             0.0,  1.0,  2.0 };
+/// array kernel = array(3,3,h_kernel);
+/// array img_out = filter(img_in, kernel);
+/// \endcode
+///
+/// \param[in] image
+/// \param[in] kernel coefficient matrix
+/// \returns filtered image (same size as input)
+///
+/// \note Filtering done using correlation. Array values outside bounds are assumed to have zero value (0).
+/// \ingroup image_func_filter
+/// \deprecated Use \ref af::convolve instead
+AF_DEPRECATED("Use af::convolve instead")
+AFAPI array filter(const array& image, const array& kernel);
+
+/// \ingroup reduce_func_product
+/// \copydoc product(const array&, const int);
+/// \deprecated Use \ref product instead
+AF_DEPRECATED("Use af::product instead")
+AFAPI array mul(const array& in, const int dim = -1);
+
+/// \ingroup reduce_func_product
+/// \copydoc product(const array&)
+/// \deprecated Use \ref product instead
+template<typename T>
+AF_DEPRECATED("Use af::product instead")
+T mul(const array& in);
+
 /// \ingroup device_func_prop
 /// \copydoc deviceInfo
 /// \deprecated Use \ref deviceInfo instead
-DEPRECATED("Use deviceInfo instead")
+AF_DEPRECATED("Use deviceInfo instead")
 AFAPI void deviceprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
 
 }
diff --git a/include/af/complex.h b/include/af/complex.h
new file mode 100644
index 0000000000..5f1baf2110
--- /dev/null
+++ b/include/af/complex.h
@@ -0,0 +1,123 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include "af/defines.h"
+
+
+#ifdef __cplusplus
+#include <ostream>
+#include <istream>
+
+namespace af{
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+typedef struct af_cfloat {
+    float real;
+    float imag;
+#ifdef __cplusplus
+    af_cfloat(const float _real = 0, const float _imag = 0) :real(_real), imag(_imag) {}
+#endif
+} af_cfloat;
+
+typedef struct af_cdouble {
+    double real;
+    double imag;
+#ifdef __cplusplus
+    af_cdouble(const double _real = 0, const double _imag = 0) :real(_real), imag(_imag) {}
+#endif
+} af_cdouble;
+#ifdef __cplusplus
+}
+#endif
+
+#ifdef __cplusplus
+typedef af::af_cfloat   cfloat;
+typedef af::af_cdouble  cdouble;
+
+AFAPI float real(af_cfloat val);
+AFAPI double real(af_cdouble val);
+
+AFAPI float imag(af_cfloat val);
+AFAPI double imag(af_cdouble val);
+
+// +,-,*,/ for (cfloat, cfloat) and (cdouble, cdouble)
+#define DEFINE_OP(OP)                                                               \
+    AFAPI af::cfloat  operator OP(const af::cfloat  &lhs, const af::cfloat  &rhs);  \
+    AFAPI af::cdouble operator OP(const af::cdouble &lhs, const af::cdouble &rhs);  \
+
+DEFINE_OP(+)
+DEFINE_OP(-)
+DEFINE_OP(*)
+DEFINE_OP(/)
+
+#undef DEFINE_OP
+
+// +,/ for (cfloat, double) and (cdouble, double)
+#define DEFINE_OP(OP)                                                               \
+    AFAPI af::cfloat  operator OP(const af::cfloat  &lhs, const     double  &rhs);  \
+    AFAPI af::cdouble operator OP(const af::cdouble &lhs, const     double  &rhs);  \
+
+DEFINE_OP(+)
+DEFINE_OP(/)
+
+#undef DEFINE_OP
+
+#if AF_API_VERSION >= 31
+// -,* for (cfloat, double) and (cdouble, double)
+#define DEFINE_OP(OP)                                                               \
+    AFAPI af::cfloat  operator OP(const af::cfloat  &lhs, const     double  &rhs);  \
+    AFAPI af::cdouble operator OP(const af::cdouble &lhs, const     double  &rhs);  \
+
+DEFINE_OP(-)
+DEFINE_OP(*)
+
+#undef DEFINE_OP
+#endif  // AF_API_VERSION
+
+#if AF_API_VERSION >= 31
+// +, -, *, / for (double, cfloat/cdouble) and (cfloat/cdouble, cdouble/cfloat)
+#define DEFINE_OP(OP)                                                               \
+    AFAPI af::cfloat  operator OP(const double      &rhs, const af::cfloat  &lhs);  \
+    AFAPI af::cdouble operator OP(const double      &rhs, const af::cdouble &lhs);  \
+    AFAPI af::cdouble operator OP(const af::cfloat  &lhs, const af::cdouble &rhs);  \
+    AFAPI af::cdouble operator OP(const af::cdouble &lhs, const af::cfloat  &rhs);  \
+
+DEFINE_OP(+)
+DEFINE_OP(-)
+DEFINE_OP(*)
+DEFINE_OP(/)
+
+#undef DEFINE_OP
+#endif  // AF_API_VERSION
+
+AFAPI bool operator==(const cfloat &lhs, const cfloat &rhs);
+AFAPI bool operator==(const cdouble &lhs, const cdouble &rhs);
+
+AFAPI bool operator!=(const cfloat &lhs, const cfloat &rhs);
+AFAPI bool operator!=(const cdouble &lhs, const cdouble &rhs);
+
+AFAPI std::istream& operator>> (std::istream &is, cfloat &in);
+AFAPI std::istream& operator>> (std::istream &is, cdouble &in);
+
+AFAPI std::ostream& operator<< (std::ostream &os, const cfloat &in);
+AFAPI std::ostream& operator<< (std::ostream &os, const cdouble &in);
+
+
+AFAPI float abs(const cfloat &val);
+AFAPI double abs(const cdouble &val);
+
+AFAPI cfloat conj(const cfloat &val);
+AFAPI cdouble conj(const cdouble &val);
+
+}
+#endif
diff --git a/include/af/constants.h b/include/af/constants.h
index 3cd60f6948..a57c924577 100644
--- a/include/af/constants.h
+++ b/include/af/constants.h
@@ -1,7 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 #include <af/defines.h>
+#ifdef __cplusplus
 namespace af
 {
     AFAPI extern const double NaN;
     AFAPI extern const double Inf;
     AFAPI extern const double Pi;
 }
+#endif
diff --git a/include/af/cuda.h b/include/af/cuda.h
new file mode 100644
index 0000000000..27908427e7
--- /dev/null
+++ b/include/af/cuda.h
@@ -0,0 +1,156 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/defines.h>
+#include <af/exception.h>
+
+/// This file contain functions that apply only to the CUDA backend. It will
+/// include cuda headers when it is built with NVCC. Otherwise the you can
+/// define the AF_DEFINE_CUDA_TYPES before including this file and it will
+/// define the cuda types used in this header.
+
+#ifdef __NVCC__
+#include <cuda.h>
+#include <cuda_runtime.h>
+#include <cublas_v2.h>
+#else
+#ifdef AF_DEFINE_CUDA_TYPES
+typedef struct CUstream_st *cudaStream_t;
+
+/*Enum for default math mode/tensor operation*/
+typedef enum {
+  CUBLAS_DEFAULT_MATH = 0,
+  CUBLAS_TENSOR_OP_MATH = 1
+} cublasMath_t;
+#endif
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+#if AF_API_VERSION >= 31
+/**
+   Get the stream for the CUDA device with \p id in ArrayFire context
+
+   \param[out] stream CUDA Stream of device with \p id in ArrayFire context
+   \param[in] id ArrayFire device id
+   \returns \ref af_err error code
+
+   \ingroup cuda_mat
+ */
+AFAPI af_err afcu_get_stream(cudaStream_t* stream, int id);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   Get the native device id of the CUDA device with \p id in ArrayFire context
+
+   \param[out] nativeid native device id of the CUDA device with \p id in ArrayFire context
+   \param[in] id ArrayFire device id
+   \returns \ref af_err error code
+
+   \ingroup cuda_mat
+ */
+AFAPI af_err afcu_get_native_id(int* nativeid, int id);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   Set the CUDA device with given native id as the active device for ArrayFire
+
+   \param[in] nativeid native device id of the CUDA device
+   \returns \ref af_err error code
+
+   \ingroup cuda_mat
+ */
+AFAPI af_err afcu_set_native_id(int nativeid);
+#endif
+
+#if AF_API_VERSION >= 37
+/**
+    Sets the cuBLAS math mode for the internal handle
+
+    See the cuBLAS documentation for additional details
+
+    \param[in] mode The cublasMath_t type to set
+    \returns \ref af_err error code
+
+    \ingroup cuda_mat
+*/
+AFAPI af_err afcu_cublasSetMathMode(cublasMath_t mode);
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#ifdef __cplusplus
+
+namespace afcu
+{
+
+#if AF_API_VERSION >= 31
+/**
+   Get the stream for the CUDA device with \p id in ArrayFire context
+
+   \param[in] id ArrayFire device id
+   \returns cuda stream used by CUDA device
+
+   \ingroup cuda_mat
+ */
+static inline cudaStream_t getStream(int id)
+{
+    cudaStream_t retVal;
+    af_err err = afcu_get_stream(&retVal, id);
+    if (err!=AF_SUCCESS)
+        throw af::exception("Failed to get CUDA stream from ArrayFire");
+    return retVal;
+}
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   Get the native device id of the CUDA device with \p id in ArrayFire context
+
+   \param[in] id ArrayFire device id
+   \returns cuda native id of device
+
+   \ingroup cuda_mat
+ */
+static inline int getNativeId(int id)
+{
+    int retVal;
+    af_err err = afcu_get_native_id(&retVal, id);
+    if (err!=AF_SUCCESS)
+        throw af::exception("Failed to get CUDA device native id from ArrayFire");
+    return retVal;
+}
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   Set the CUDA device with given native id as the active device for ArrayFire
+
+   \param[in] nativeId native device id of the CUDA device
+
+   \ingroup cuda_mat
+ */
+static inline void setNativeId(int nativeId)
+{
+    af_err err = afcu_set_native_id(nativeId);
+    if (err!=AF_SUCCESS)
+        throw af::exception("Failed to change active CUDA device to the device with given native id");
+}
+#endif
+
+}
+#endif
diff --git a/include/af/data.h b/include/af/data.h
index b2ab511da4..22e1874439 100644
--- a/include/af/data.h
+++ b/include/af/data.h
@@ -12,693 +12,995 @@
 
 #ifdef __cplusplus
 #include <af/dim4.hpp>
+#include <af/traits.hpp>
 namespace af
 {
     class array;
 
-    /**
-       \defgroup data_func_constant constant
-       Create constant array from the specified dimensions
-       @{
+    /// C++ Interface to generate an array with elements set to a specified
+    /// value.
+    ///
+    /// \param[in] val  constant value
+    /// \param[in] dims dimensions of the array to be generated
+    /// \param[in] ty   type
+    /// \return         constant array
+    ///
+    /// \ingroup data_func_constant
+    template<typename T>
+    array constant(T val, const dim4 &dims, const dtype ty=(af_dtype)dtype_traits<T>::ctype);
+
+    /// C++ Interface to generate a 1-D array with elements set to a specified
+    /// value.
+    ///
+    /// \param[in] val constant value
+    /// \param[in] d0  size of the first dimension
+    /// \param[in] ty  type
+    /// \return        constant 1-D array
+    ///
+    /// \ingroup data_func_constant
+    template<typename T>
+    array constant(T val, const dim_t d0, const af_dtype ty=(af_dtype)dtype_traits<T>::ctype);
+
+    /// C++ Interface to generate a 2-D array with elements set to a specified
+    /// value.
+    ///
+    /// \param[in] val constant value
+    /// \param[in] d0  size of the first dimension
+    /// \param[in] d1  size of the second dimension
+    /// \param[in] ty  type
+    /// \return        constant 2-D array
+    ///
+    /// \ingroup data_func_constant
+    template<typename T>
+    array constant(T val, const dim_t d0, const dim_t d1, const af_dtype ty=(af_dtype)dtype_traits<T>::ctype);
+
+    /// C++ Interface to generate a 3-D array with elements set to a specified
+    /// value.
+    ///
+    /// \param[in] val constant value
+    /// \param[in] d0  size of the first dimension
+    /// \param[in] d1  size of the second dimension
+    /// \param[in] d2  size of the third dimension
+    /// \param[in] ty  type
+    /// \return        constant 3-D array
+    ///
+    /// \ingroup data_func_constant
+    template<typename T>
+    array constant(T val, const dim_t d0, const dim_t d1, const dim_t d2, const af_dtype ty=(af_dtype)dtype_traits<T>::ctype);
+
+    /// C++ Interface to generate a 4-D array with elements set to a specified
+    /// value.
+    ///
+    /// \param[in] val constant value
+    /// \param[in] d0  size of the first dimension
+    /// \param[in] d1  size of the second dimension
+    /// \param[in] d2  size of the third dimension
+    /// \param[in] d3  size of the fourth dimension
+    /// \param[in] ty  type
+    /// \return        constant 4-D array
+    ///
+    /// \ingroup data_func_constant
+    template<typename T>
+    array constant(T val, const dim_t d0, const dim_t d1, const dim_t d2, const dim_t d3, const af_dtype ty=(af_dtype)dtype_traits<T>::ctype);
+
+    /// C++ Interface to generate an identity array.
+    ///
+    /// \param[in] dims size
+    /// \param[in] ty   type
+    /// \return         identity array
+    ///
+    /// \ingroup data_func_identity
+    AFAPI array identity(const dim4 &dims, const dtype ty=f32);
 
-       \ingroup arrayfire_func
-       \ingroup move_mat
-    */
-#define CONSTANT(TYPE, TY)                                              \
-    AFAPI array constant(TYPE val, const dim4 &dims, const dtype ty=TY); \
-    AFAPI array constant(TYPE val, const dim_t d0, const dtype ty=TY);  \
-    AFAPI array constant(TYPE val, const dim_t d0,                      \
-                         const dim_t d1, const dtype ty=TY);            \
-    AFAPI array constant(TYPE val, const dim_t d0,                      \
-                         const dim_t d1, const dim_t d2,                \
-                         const dtype ty=TY);                            \
-    AFAPI array constant(TYPE val, const dim_t d0,                      \
-                         const dim_t d1, const dim_t d2,                \
-                         const dim_t d3, const dtype ty=TY);            \
-
-    CONSTANT(double             , f32)
-    CONSTANT(float              , f32)
-    CONSTANT(int                , f32)
-    CONSTANT(unsigned           , f32)
-    CONSTANT(char               , f32)
-    CONSTANT(unsigned char      , f32)
-    CONSTANT(cfloat             , c32)
-    CONSTANT(cdouble            , c64)
-    CONSTANT(long               , s64)
-    CONSTANT(unsigned long      , u64)
-    CONSTANT(long long          , s64)
-    CONSTANT(unsigned long long , u64)
-    CONSTANT(bool               ,  b8)
-
-#undef CONSTANT
+    /// C++ Interface to generate a 1-D identity array.
+    ///
+    /// \param[in] d0 size of the first dimension
+    /// \param[in] ty type
+    /// \return       identity array
+    ///
+    /// \ingroup data_func_identity
+    AFAPI array identity(const dim_t d0, const dtype ty=f32);
 
-    /**
-       @}
-    */
+    /// C++ Interface to generate a 2-D identity array.
+    ///
+    /// \param[in] d0 size of the first dimension
+    /// \param[in] d1 size of the second dimension
+    /// \param[in] ty type
+    /// \return       identity array
+    ///
+    /// \ingroup data_func_identity
+    AFAPI array identity(const dim_t d0, const dim_t d1, const dtype ty=f32);
 
-    /**
-        \param[in] dims is the dimensions of the array to be generated
-        \param[in] ty is the type of the array
+    /// C++ Interface to generate a 3-D identity array.
+    ///
+    /// \param[in] d0 size of the first dimension
+    /// \param[in] d1 size of the second dimension
+    /// \param[in] d2 size of the third dimension
+    /// \param[in] ty type
+    /// \return       identity array
+    ///
+    /// \ingroup data_func_identity
+    AFAPI array identity(const dim_t d0, const dim_t d1,
+                         const dim_t d2, const dtype ty=f32);
 
-        \return array of size \p dims
+    /// C++ Interface to generate a 4-D identity array.
+    ///
+    /// \param[in] d0 size of the first dimension
+    /// \param[in] d1 size of the second dimension
+    /// \param[in] d2 size of the third dimension
+    /// \param[in] d3 size of the fourth dimension
+    /// \param[in] ty type
+    /// \return       identity array
+    ///
+    /// \ingroup data_func_identity
+    AFAPI array identity(const dim_t d0, const dim_t d1,
+                         const dim_t d2, const dim_t d3, const dtype ty=f32);
 
-        \ingroup data_func_randu
-    */
-    AFAPI array randu(const dim4 &dims, const dtype ty=f32);
+    /// C++ Interface to generate an array with `[0, n-1]` values along the
+    /// `seq_dim` dimension and tiled across other dimensions of shape `dim4`.
+    ///
+    /// \param[in] dims    size
+    /// \param[in] seq_dim dimesion along which the range is created
+    /// \param[in] ty      type
+    /// \return            range array
+    ///
+    /// \ingroup data_func_range
+    AFAPI array range(const dim4 &dims, const int seq_dim = -1, const dtype ty=f32);
 
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] ty is the type of the array
+    /// C++ Interface to generate an array with `[0, n-1]` values along the
+    /// `seq_dim` dimension and tiled across other dimensions described by
+    /// dimension parameters.
+    ///
+    /// \param[in] d0      size of the first dimension
+    /// \param[in] d1      size of the second dimension
+    /// \param[in] d2      size of the third dimension
+    /// \param[in] d3      size of the fourth dimension
+    /// \param[in] seq_dim dimesion along which the range is created
+    /// \param[in] ty      type
+    /// \return            range array
+    ///
+    /// \ingroup data_func_range
+    AFAPI array range(const dim_t d0, const dim_t d1 = 1, const dim_t d2 = 1,
+                      const dim_t d3 = 1, const int seq_dim = -1, const dtype ty=f32);
 
-        \return array of size \p d0
+    /// C++ Interface to generate an array with `[0, n-1]` values modified to
+    /// specified dimensions and tiling.
+    ///
+    /// \param[in] dims      size
+    /// \param[in] tile_dims number of tiled repetitions in each dimension
+    /// \param[in] ty        type
+    /// \return              iota array
+    ///
+    /// \ingroup data_func_iota
+    AFAPI array iota(const dim4 &dims, const dim4 &tile_dims = dim4(1), const dtype ty=f32);
 
-        \ingroup data_func_randu
-    */
-    AFAPI array randu(const dim_t d0, const dtype ty=f32);
+    /// C++ Interface to extract the diagonal from an array.
+    ///
+    /// \param[in] in      input array
+    /// \param[in] num     diagonal index
+    /// \param[in] extract if true, returns an array containing diagonal of the
+    ///                    matrix; if false, returns a diagonal matrix
+    /// \return            diagonal array (or matrix)
+    ///
+    /// \ingroup data_func_diag
+    AFAPI array diag(const array &in, const int num = 0, const bool extract = true);
 
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] d1 is the size of the second dimension
-        \param[in] ty is the type of the array
+    /// C++ Interface to join 2 arrays along a dimension.
+    ///
+    /// Empty arrays are ignored.
+    ///
+    /// \param[in] dim    dimension along which the join occurs
+    /// \param[in] first  input array
+    /// \param[in] second input array
+    /// \return           joined array
+    ///
+    /// \ingroup manip_func_join
+    AFAPI array join(const int dim, const array &first, const array &second);
 
-        \return array of size \p d0 x \p d1
+    /// C++ Interface to join 3 arrays along a dimension.
+    ///
+    /// Empty arrays are ignored.
+    ///
+    /// \param[in] dim    dimension along which the join occurs
+    /// \param[in] first  input array
+    /// \param[in] second input array
+    /// \param[in] third  input array
+    /// \return           joined array
+    ///
+    /// \ingroup manip_func_join
+    AFAPI array join(const int dim, const array &first, const array &second, const array &third);
 
-        \ingroup data_func_randu
-    */
-    AFAPI array randu(const dim_t d0,
-                      const dim_t d1, const dtype ty=f32);
+    /// C++ Interface to join 4 arrays along a dimension.
+    ///
+    /// Empty arrays are ignored.
+    ///
+    /// \param[in] dim    dimension along which the join occurs
+    /// \param[in] first  input array
+    /// \param[in] second input array
+    /// \param[in] third  input array
+    /// \param[in] fourth input array
+    /// \return           joined array
+    ///
+    /// \ingroup manip_func_join
+    AFAPI array join(const int dim, const array &first, const array &second,
+                     const array &third, const array &fourth);
 
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] d1 is the size of the second dimension
-        \param[in] d2 is the size of the third dimension
-        \param[in] ty is the type of the array
+    /// C++ Interface to generate a tiled array.
+    ///
+    /// Note, `x`, `y`, `z`, and `w` include the original in the count.
+    ///
+    /// \param[in] in input array
+    /// \param[in] x  number tiles along the first dimension
+    /// \param[in] y  number tiles along the second dimension
+    /// \param[in] z  number tiles along the third dimension
+    /// \param[in] w  number tiles along the fourth dimension
+    /// \return       tiled array
+    ///
+    /// \ingroup manip_func_tile
+    AFAPI array tile(const array &in, const unsigned x, const unsigned y=1,
+                     const unsigned z=1, const unsigned w=1);
 
-        \return array of size \p d0 x \p d1 x \p d2
+    /// C++ Interface to generate a tiled array.
+    ///
+    /// Each component of `dims` includes the original in the count. Thus, if
+    /// no duplicates are needed in a certain dimension, it is left as 1, the
+    /// default value for just one copy.
+    ///
+    /// \param[in] in   input array
+    /// \param[in] dims number of times `in` is copied along each dimension
+    /// \return         tiled array
+    ///
+    /// \ingroup manip_func_tile
+    AFAPI array tile(const array &in, const dim4 &dims);
 
-        \ingroup data_func_randu
-    */
-    AFAPI array randu(const dim_t d0,
-                      const dim_t d1, const dim_t d2, const dtype ty=f32);
+    /// C++ Interface to reorder an array. 
+    ///
+    /// \param[in] in input array
+    /// \param[in] x  specifies which dimension should be first
+    /// \param[in] y  specifies which dimension should be second
+    /// \param[in] z  specifies which dimension should be third
+    /// \param[in] w  specifies which dimension should be fourth
+    /// \return       reordered array
+    ///
+    /// \ingroup manip_func_reorder
+    AFAPI array reorder(const array& in, const unsigned x,
+                        const unsigned y=1, const unsigned z=2, const unsigned w=3);
 
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] d1 is the size of the second dimension
-        \param[in] d2 is the size of the third dimension
-        \param[in] d3 is the size of the fourth dimension
-        \param[in] ty is the type of the array
+    /// C++ Interface to shift an array.
+    ///
+    /// \param[in] in input array
+    /// \param[in] x  specifies the shift along the first dimension
+    /// \param[in] y  specifies the shift along the second dimension
+    /// \param[in] z  specifies the shift along the third dimension
+    /// \param[in] w  specifies the shift along the fourth dimension
+    /// \return       shifted array
+    ///
+    /// \ingroup manip_func_shift
+    AFAPI array shift(const array& in, const int x, const int y=0, const int z=0, const int w=0);
 
-        \return array of size \p d0 x \p d1 x \p d2 x \p d3
+    /// C++ Interface to modify the dimensions of an input array to a specified
+    /// shape.
+    ///
+    /// \param[in] in   input array
+    /// \param[in] dims new dimension sizes
+    /// \return         modded output
+    ///
+    /// \ingroup manip_func_moddims
+    AFAPI array moddims(const array& in, const dim4& dims);
 
-        \ingroup data_func_randu
-    */
-    AFAPI array randu(const dim_t d0,
-                      const dim_t d1, const dim_t d2,
-                      const dim_t d3, const dtype ty=f32);
+    /// C++ Interface to modify the dimensions of an input array to a specified
+    /// shape.
+    ///
+    /// \param[in] in input array
+    /// \param[in] d0 new size of the first dimension
+    /// \param[in] d1 new size of the second dimension (optional)
+    /// \param[in] d2 new size of the third dimension (optional)
+    /// \param[in] d3 new size of the fourth dimension (optional)
+    /// \return       modded output
+    ///
+    /// \ingroup manip_func_moddims
+    AFAPI array moddims(const array& in, const dim_t d0, const dim_t d1=1, const dim_t d2=1, const dim_t d3=1);
 
-    /**
-        \param[in] dims is the dimensions of the array to be generated
-        \param[in] ty is the type of the array
+    /// C++ Interface to modify the dimensions of an input array to a specified
+    /// shape.
+    ///
+    /// \param[in] in    input array
+    /// \param[in] ndims number of dimensions
+    /// \param[in] dims  new dimension sizes
+    /// \return          modded output
+    ///
+    /// \ingroup manip_func_moddims
+    AFAPI array moddims(const array& in, const unsigned ndims, const dim_t* const dims);
+
+    /// C++ Interface to flatten an array.
+    ///
+    /// \param[in] in input array
+    /// \return       flat array
+    ///
+    /// \ingroup manip_func_flat
+    AFAPI array flat(const array &in);
 
-        \return array of size \p dims
+    /// C++ Interface to flip an array.
+    ///
+    /// \param[in] in  input array
+    /// \param[in] dim dimension to flip
+    /// \return        flipped array
+    ///
+    /// \ingroup manip_func_flip
+    AFAPI array flip(const array &in, const unsigned dim);
 
-        \ingroup data_func_randn
-    */
-    AFAPI array randn(const dim4 &dims, const dtype ty=f32);
+    /// C++ Interface to return the lower triangle array.
+    ///
+    /// \param[in] in           input array
+    /// \param[in] is_unit_diag boolean specifying if diagonal elements are 1's
+    /// \return                 lower triangle array
+    ///
+    /// \ingroup data_func_lower
+    AFAPI array lower(const array &in, bool is_unit_diag=false);
 
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] ty is the type of the array
+    /// C++ Interface to return the upper triangle array.
+    ///
+    /// \param[in] in           input array
+    /// \param[in] is_unit_diag boolean specifying if diagonal elements are 1's
+    /// \return                 upper triangle matrix
+    ///
+    /// \ingroup data_func_upper
+    AFAPI array upper(const array &in, bool is_unit_diag=false);
 
-        \return array of size \p d0
+#if AF_API_VERSION >= 31
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select array element
+    /// \param[in] b    when false, select array element
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const array  &a, const array  &b);
+#endif
 
-        \ingroup data_func_randn
-    */
-    AFAPI array randn(const dim_t d0, const dtype ty=f32);
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] d1 is the size of the second dimension
-        \param[in] ty is the type of the array
+#if AF_API_VERSION >= 31
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select array element
+    /// \param[in] b    when false, select scalar value
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const array  &a, const double &b);
+#endif
 
-        \return array of size \p d0 x \p d1
+#if AF_API_VERSION >= 31
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select scalar value
+    /// \param[in] b    when false, select array element
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const double &a, const array  &b);
+#endif
 
-        \ingroup data_func_randn
-    */
-    AFAPI array randn(const dim_t d0,
-                      const dim_t d1, const dtype ty=f32);
-    /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] d1 is the size of the second dimension
-        \param[in] d2 is the size of the third dimension
-        \param[in] ty is the type of the array
+#if AF_API_VERSION >= 31
+    /// C++ Interface to replace elements of an array with elements of another
+    /// array.
+    ///
+    /// Elements of `a` are replaced with corresponding elements of `b` when
+    /// `cond` is false.
+    ///
+    /// \param[inout] a    input array
+    /// \param[in]    cond conditional array
+    /// \param[in]    b    replacement array
+    ///
+    /// \ingroup data_func_replace
+    AFAPI void replace(array &a, const array  &cond, const array  &b);
+#endif
 
-        \return array of size \p d0 x \p d1 x \p d2
+#if AF_API_VERSION >= 31
+    /// C++ Interface to replace elements of an array with a scalar value.
+    ///
+    /// Elements of `a` are replaced with a scalar value when `cond` is false.
+    ///
+    /// \param[inout] a    input array
+    /// \param[in]    cond conditional array
+    /// \param[in]    b    replacement scalar value
+    ///
+    /// \ingroup data_func_replace
+    AFAPI void replace(array &a, const array  &cond, const double &b);
+#endif
 
-        \ingroup data_func_randn
-    */
-    AFAPI array randn(const dim_t d0,
-                      const dim_t d1, const dim_t d2, const dtype ty=f32);
+#if AF_API_VERSION >= 37
+    /// C++ Interface to pad an array.
+    ///
+    /// \param[in] in           input array
+    /// \param[in] beginPadding number of elements to be padded at the start of
+    ///                         each dimension
+    /// \param[in] endPadding   number of elements to be padded at the end of
+    ///                         each dimension
+    /// \param[in] padFillType  values to fill into the padded region
+    /// \return                 padded array
+    ///
+    /// \ingroup data_func_pad
+    AFAPI array pad(const array &in, const dim4 &beginPadding,
+                    const dim4 &endPadding, const borderType padFillType);
+#endif
 
+#if AF_API_VERSION >= 39
+    /// C++ Interface to replace elements of an array with a scalar value.
+    ///
+    /// Elements of `a` are replaced with a scalar value when `cond` is false.
+    ///
+    /// \param[inout] a    input array
+    /// \param[in]    cond conditional array
+    /// \param[in]    b    replacement scalar value
+    ///
+    /// \ingroup data_func_replace
+    AFAPI void replace(array &a, const array &cond, const long long b);
+
+    /// C++ Interface to replace elements of an array with a scalar value.
+    ///
+    /// Elements of `a` are replaced with a scalar value when `cond` is false.
+    ///
+    /// \param[inout] a    input array
+    /// \param[in]    cond conditional array
+    /// \param[in]    b    replacement scalar value
+    ///
+    /// \ingroup data_func_replace
+    AFAPI void replace(array &a, const array &cond,
+                       const unsigned long long b);
+
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select array element
+    /// \param[in] b    when false, select scalar value
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const array &a, const long long b);
+
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select array element
+    /// \param[in] b    when false, select scalar value
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const array &a,
+                       const unsigned long long b);
+
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select scalar value
+    /// \param[in] b    when false, select array element
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const long long a, const array &b);
+
+    /// C++ Interface to select elements based on a conditional array.
+    ///
+    /// \param[in] cond conditional array
+    /// \param[in] a    when true, select scalar value
+    /// \param[in] b    when false, select array element
+    /// \return         `a` when `cond` is true, else `b`
+    ///
+    /// \ingroup data_func_select
+    AFAPI array select(const array &cond, const unsigned long long a,
+                       const array &b);
+#endif
+}
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
     /**
-        \param[in] d0 is the size of the first dimension
-        \param[in] d1 is the size of the second dimension
-        \param[in] d2 is the size of the third dimension
-        \param[in] d3 is the size of the fourth dimension
-        \param[in] ty is the type of the array
+       C Interface to generate an array with elements set to a specified value.
 
-        \return array of size \p d0 x \p d1 x \p d2 x \p d3
+       \param[out] arr   constant array
+       \param[in]  val   constant value
+       \param[in]  ndims size of the dimension array
+       \param[in]  dims  dimensions of the array to be generated
+       \param[in]  type  type
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_randn
+       \ingroup data_func_constant
     */
-    AFAPI array randn(const dim_t d0,
-                      const dim_t d1, const dim_t d2,
-                      const dim_t d3, const dtype ty=f32);
+    AFAPI af_err af_constant(af_array *arr, const double val, const unsigned ndims, const dim_t * const dims, const af_dtype type);
 
     /**
-        \defgroup data_func_setseed setSeed
-        Set the seed for the random number generator
-
+       C Interface to generate a complex array with elements set to a specified
+       value.
 
-        \param[in] seed is a 64 bit unsigned integer
+       \param[out] arr   constant complex array
+       \param[in]  real  real constant value
+       \param[in]  imag  imaginary constant value
+       \param[in]  ndims size of the dimension array
+       \param[in]  dims  dimensions of the array to be generated
+       \param[in]  type  type, \ref c32 or \ref c64
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_mat
-        \ingroup arrayfire_func
+       \ingroup data_func_constant
     */
-    AFAPI void setSeed(const uintl seed);
+    AFAPI af_err af_constant_complex(af_array *arr, const double real, const double imag,
+                                     const unsigned ndims, const dim_t * const dims, const af_dtype type);
 
     /**
-        \defgroup data_func_getseed getSeed
-        Get the seed for the random number generator
+       C Interface to generate an array with elements set to a specified value.
 
+       Output type is \ref s64.
 
-        \returns seed which is a 64 bit unsigned integer
+       \param[out] arr   constant array
+       \param[in]  val   constant value
+       \param[in]  ndims size of the dimension array
+       \param[in]  dims  dimensions of the array to be generated
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_mat
-        \ingroup arrayfire_func
+       \ingroup data_func_constant
     */
-    AFAPI uintl getSeed();
-
+    AFAPI af_err af_constant_long (af_array *arr, const long long val, const unsigned ndims, const dim_t * const dims);
 
     /**
-        \param[in] dims is dim4 for size of all dimensions
-        \param[in] ty is the type of array to generate
+       C Interface to generate an array with elements set to a specified value.
 
-        \returns an identity array of specified dimension and type
+       Output type is \ref u64.
 
-        \ingroup data_func_identity
-    */
-    AFAPI array identity(const dim4 &dims, const dtype ty=f32);
-
-    /**
-        \param[in] d0 is size of first dimension
-        \param[in] ty is the type of array to generate
+       \param[out] arr   constant array
+       \param[in]  val   constant value
+       \param[in]  ndims size of the dimension array
+       \param[in]  dims  dimensions of the array to be generated
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \returns an identity array of specified dimension and type
-
-        \ingroup data_func_identity
+       \ingroup data_func_constant
     */
-    AFAPI array identity(const dim_t d0, const dtype ty=f32);
-
-    /**
-        \param[in] d0 is size of first dimension
-        \param[in] d1 is size of second dimension
-        \param[in] ty is the type of array to generate
 
-        \returns an identity array of specified dimension and type
-
-        \ingroup data_func_identity
-    */
-    AFAPI array identity(const dim_t d0, const dim_t d1, const dtype ty=f32);
+    AFAPI af_err af_constant_ulong(af_array *arr, const unsigned long long val, const unsigned ndims, const dim_t * const dims);
 
     /**
-        \param[in] d0 is size of first dimension
-        \param[in] d1 is size of second dimension
-        \param[in] d2 is size of third dimension
-        \param[in] ty is the type of array to generate
+       C Interface to generate an identity array.
 
-        \returns an identity array of specified dimension and type
+       \param[out] out   identity array
+       \param[in]  ndims number of dimensions
+       \param[in]  dims  size
+       \param[in]  type  type
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_identity
+       \ingroup data_func_identity
     */
-    AFAPI array identity(const dim_t d0, const dim_t d1,
-                         const dim_t d2, const dtype ty=f32);
+    AFAPI af_err af_identity(af_array* out, const unsigned ndims, const dim_t* const dims, const af_dtype type);
 
     /**
-        \param[in] d0 is size of first dimension
-        \param[in] d1 is size of second dimension
-        \param[in] d2 is size of third dimension
-        \param[in] d3 is size of fourth dimension
-        \param[in] ty is the type of array to generate
+       C Interface to generate an array with `[0, n-1]` values along the
+       `seq_dim` dimension and tiled across other dimensions of shape `dim4`.
 
-        \returns an identity array of specified dimension and type
+       \param[out] out     range array
+       \param[in]  ndims   number of dimensions, specified by the size of `dims`
+       \param[in]  dims    size
+       \param[in]  seq_dim dimension along which the range is created
+       \param[in]  type    type
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_identity
+       \ingroup data_func_range
     */
-    AFAPI array identity(const dim_t d0, const dim_t d1,
-                         const dim_t d2, const dim_t d3, const dtype ty=f32);
+    AFAPI af_err af_range(af_array *out, const unsigned ndims, const dim_t * const dims,
+                          const int seq_dim, const af_dtype type);
 
     /**
-        \param[in] dims is dim4 for size of all dimensions
-        \param[in] seq_dim is dimesion along which [0, dim[seq_dim] - 1] is generated
-        \param[in] ty is the type of array to generate
+       C Interface to generate an array with `[0, n-1]` values modified to
+       specified dimensions and tiling.
 
-        \returns an array of integral range specified dimension and type
+       \param[out] out     iota array
+       \param[in]  ndims   number of dimensions
+       \param[in]  dims    size
+       \param[in]  t_ndims number of dimensions of tiled array
+       \param[in]  tdims   number of tiled repetitions in each dimension
+       \param[in]  type    type
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_range
+       \ingroup data_func_iota
     */
-    AFAPI array range(const dim4 &dims, const int seq_dim = -1, const dtype ty=f32);
+    AFAPI af_err af_iota(af_array *out, const unsigned ndims, const dim_t * const dims,
+                         const unsigned t_ndims, const dim_t * const tdims, const af_dtype type);
 
     /**
-        \param[in] d0 is size of first dimension
-        \param[in] d1 is size of second dimension
-        \param[in] d2 is size of third dimension
-        \param[in] d3 is size of fourth dimension
-        \param[in] seq_dim is dimesion along which [0, dim[seq_dim] - 1] is generated
-        \param[in] ty is the type of array to generate
+       C Interface to create a diagonal matrix from an extracted diagonal
+       array.
 
-        \returns an array of integral range specified dimension and type
+       See also, \ref af_diag_extract.
 
-        \ingroup data_func_range
-    */
-    AFAPI array range(const dim_t d0, const dim_t d1 = 1, const dim_t d2 = 1,
-                      const dim_t d3 = 1, const int seq_dim = -1, const dtype ty=f32);
+       \param[out] out diagonal matrix
+       \param[in]  in  diagonal array
+       \param[in]  num diagonal index
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-        \param[in] dims is dim4 for unit dimensions of the sequence to be generated
-        \param[in] tile_dims is dim4 for the number of repetitions of the unit dimensions
-        \param[in] ty is the type of array to generate
-
-        \returns an array of integral range specified dimension and type
-
-        \ingroup data_func_iota
+       \ingroup data_func_diag
     */
-    AFAPI array iota(const dim4 &dims, const dim4 &tile_dims = dim4(1), const dtype ty=f32);
+    AFAPI af_err af_diag_create(af_array *out, const af_array in, const int num);
 
     /**
-        \param[in] in is the input array
-        \param[in] num is the diagonal index
-        \param[in] extract when true returns an array containing diagonal of tha matrix
-        and when false returns a matrix with \p in as diagonal
+       C Interface to extract the diagonal from an array.
+
+       See also, \ref af_diag_create.
 
-        \returns an array with either the diagonal or the matrix based on \p extract
+       \param[out] out `num`-th diagonal array
+       \param[in]  in  input array
+       \param[in]  num diagonal index
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_diag
+       \ingroup data_func_diag
     */
-    AFAPI array diag(const array &in, const int num = 0, const bool extract = true);
+    AFAPI af_err af_diag_extract(af_array *out, const af_array in, const int num);
 
     /**
-        \brief Join 2 arrays along \p dim
+       C Interface to join 2 arrays along a dimension.
+
+       Empty arrays are ignored.
 
-        \param[in] dim is the dimension along which join occurs
-        \param[in] first is the first input array
-        \param[in] second is the second input array
-        \return the array that joins input arrays along the given dimension
+       \param[out] out    joined array
+       \param[in]  dim    dimension along which the join occurs
+       \param[in]  first  input array
+       \param[in]  second input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup manip_func_join
+       \ingroup manip_func_join
     */
-    AFAPI array join(const int dim, const array &first, const array &second);
+    AFAPI af_err af_join(af_array *out, const int dim, const af_array first, const af_array second);
 
     /**
-        \brief Join 3 arrays along \p dim
+       C Interface to join many arrays along a dimension.
+
+       Limited to 10 arrays. Empty arrays are ignored.
 
-        \param[in] dim is the dimension along which join occurs
-        \param[in] first is the first input array
-        \param[in] second is the second input array
-        \param[in] third is the third input array
-        \return the array that joins input arrays along the given dimension
+       \param[out] out      joined array
+       \param[in]  dim      dimension along which the join occurs
+       \param[in]  n_arrays number of arrays to join
+       \param[in]  inputs   array of af_arrays containing handles to the
+                             arrays to be joined
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup manip_func_join
+       \ingroup manip_func_join
     */
-    AFAPI array join(const int dim, const array &first, const array &second, const array &third);
+    AFAPI af_err af_join_many(af_array *out, const int dim, const unsigned n_arrays, const af_array *inputs);
 
     /**
-        \brief Join 4 arrays along \p dim
+       C Interface to generate a tiled array.
 
-        \param[in] dim is the dimension along which join occurs
-        \param[in] first is the first input array
-        \param[in] second is the second input array
-        \param[in] third is the third input array
-        \param[in] fourth is the fourth input array
-        \return the array that joins input arrays along the given dimension
+       Note, `x`, `y`, `z`, and `w` include the original in the count.
 
-        \ingroup manip_func_join
-    */
-    AFAPI array join(const int dim, const array &first, const array &second,
-                     const array &third, const array &fourth);
+       \param[out] out tiled array
+       \param[in]  in  input array
+       \param[in]  x   number of tiles along the first dimension
+       \param[in]  y   number of tiles along the second dimension
+       \param[in]  z   number of tiles along the third dimension
+       \param[in]  w   number of tiles along the fourth dimension
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-        \param[in] in is the input array
-        \param[in] x is the number of times \p in is tiled along first dimension
-        \param[in] y is the number of times \p in is tiled along second dimension
-        \param[in] z is the number of times \p in is tiled along third dimension
-        \param[in] w is the number of times \p in is tiled along fourth dimension
-        \return the tiled output
-
-        \ingroup manip_func_tile
+       \ingroup manip_func_tile
     */
-    AFAPI array tile(const array &in, const unsigned x, const unsigned y=1,
-                     const unsigned z=1, const unsigned w=1);
+    AFAPI af_err af_tile(af_array *out, const af_array in,
+                         const unsigned x, const unsigned y, const unsigned z, const unsigned w);
 
     /**
-        \param[in] in is the input array
-        \param[in] dims dim4 of tile dimensions
-        \return the tiled output
+       C Interface to reorder an array.
 
-        \ingroup manip_func_tile
-    */
-    AFAPI array tile(const array &in, const dim4 &dims);
+       \param[out] out reordered array
+       \param[in]  in  input array
+       \param[in]  x   specifies which dimension should be first
+       \param[in]  y   specifies which dimension should be second
+       \param[in]  z   specifies which dimension should be third
+       \param[in]  w   specifies which dimension should be fourth
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-        \param[in] in is the input array
-        \param[in] x specifies which dimension should be first
-        \param[in] y specifies which dimension should be second
-        \param[in] z specifies which dimension should be third
-        \param[in] w specifies which dimension should be fourth
-        \return the reordered output
-
-        \ingroup manip_func_reorder
+       \ingroup manip_func_reorder
     */
-    AFAPI array reorder(const array& in, const unsigned x,
-                        const unsigned y=1, const unsigned z=2, const unsigned w=3);
+    AFAPI af_err af_reorder(af_array *out, const af_array in,
+                            const unsigned x, const unsigned y, const unsigned z, const unsigned w);
 
     /**
-        \param[in] in is the input array
-        \param[in] x specifies the shift along first dimension
-        \param[in] y specifies the shift along second dimension
-        \param[in] z specifies the shift along third dimension
-        \param[in] w specifies the shift along fourth dimension
+       C Interface to shift an array.
 
-        \return the shifted output
+       \param[out] out shifted array
+       \param[in]  in  input array
+       \param[in]  x   specifies the shift along first dimension
+       \param[in]  y   specifies the shift along second dimension
+       \param[in]  z   specifies the shift along third dimension
+       \param[in]  w   specifies the shift along fourth dimension
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup manip_func_shift
+       \ingroup manip_func_shift
     */
-    AFAPI array shift(const array& in, const int x, const int y=0, const int z=0, const int w=0);
+    AFAPI af_err af_shift(af_array *out, const af_array in, const int x, const int y, const int z, const int w);
 
     /**
-        \param[in] in is the input array
-        \param[in] ndims is the number of dimensions
-        \param[in] dims is the array containing the new dimensions
-        \return the modded output
+       C Interface to modify the dimensions of an input array to a specified
+       shape.
 
-        \ingroup manip_func_moddims
-    */
-    AFAPI array moddims(const array& in, const unsigned ndims, const dim_t * const dims);
+       \param[out] out   modded output
+       \param[in]  in    input array
+       \param[in]  ndims number of dimensions
+       \param[in]  dims  new dimension sizes
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-        \param[in] in is the input array
-        \param[in] dims is the new dimensions
-        \return the modded output
-
-        \ingroup manip_func_moddims
+       \ingroup manip_func_moddims
     */
-    AFAPI array moddims(const array& in, const dim4& dims);
+    AFAPI af_err af_moddims(af_array *out, const af_array in, const unsigned ndims, const dim_t * const dims);
 
     /**
-        \param[in] in is the input array
-        \param[in] d0 specifies the new size of the first dimension
-        \param[in] d1 specifies the new size of the second dimension
-        \param[in] d2 specifies the new size of the third dimension
-        \param[in] d3 specifies the new size of the fourth dimension
-        \return the modded array
-
-        \ingroup manip_func_moddims
-    */
-    AFAPI array moddims(const array& in, const dim_t d0, const dim_t d1=1, const dim_t d2=1, const dim_t d3=1);
+       C Interface to flatten an array.
 
-    /**
-        \param[in] in is the input array
-        \return the flat array
+       \param[out] out flat array
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup manip_func_flat
+       \ingroup manip_func_flat
     */
-    AFAPI array flat(const array &in);
+    AFAPI af_err af_flat(af_array *out, const af_array in);
 
     /**
-        \param[in] in is the input array
-        \param[in] dim is the dimensions to flip the array
-        \return the flipped array
+       C Interface to flip an array.
 
-        \ingroup manip_func_flip
-    */
-    AFAPI array flip(const array &in, const unsigned dim);
-
-    /**
-        \param[in] in is the input matrix
-        \param[in] is_unit_diag is a boolean parameter specifying if the diagonal elements should be 1
-        \return the lower triangle array
+       \param[out] out flipped array
+       \param[in]  in  input array
+       \param[in]  dim dimension to flip
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_lower
+       \ingroup manip_func_flip
     */
-    AFAPI array lower(const array &in, bool is_unit_diag=false);
+    AFAPI af_err af_flip(af_array *out, const af_array in, const unsigned dim);
 
     /**
-        \param[in] in is the input matrix
-        \param[in] is_unit_diag is a boolean parameter specifying if the diagonal elements should be 1
-        \return the upper triangle matrix
+       C Interface to return the lower triangle array.
 
-        \ingroup data_func_upper
-    */
-    AFAPI array upper(const array &in, bool is_unit_diag=false);
+       \param[out] out          lower traingle array
+       \param[in]  in           input array
+       \param[in]  is_unit_diag boolean specifying if diagonal elements are 1's
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-      @}
+       \ingroup data_func_lower
     */
-}
-#endif
-
-#ifdef __cplusplus
-extern "C" {
-#endif
+    AFAPI af_err af_lower(af_array *out, const af_array in, bool is_unit_diag);
 
     /**
-       \defgroup data_func_constant constant
-       Create constant array from the specified dimensions
-       @{
+       C Interface to return the upper triangle array.
 
-       \ingroup arrayfire_func
-       \ingroup data_mat
-    */
-    AFAPI af_err af_constant(af_array *arr, const double val, const unsigned ndims, const dim_t * const dims, const af_dtype type);
-
-    AFAPI af_err af_constant_complex(af_array *arr, const double real, const double imag,
-                                     const unsigned ndims, const dim_t * const dims, const af_dtype type);
-
-    AFAPI af_err af_constant_long (af_array *arr, const  intl val, const unsigned ndims, const dim_t * const dims);
+       \param[out] out          upper triangle array
+       \param[in]  in           input array
+       \param[in]  is_unit_diag boolean specifying if diagonal elements are 1's
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    AFAPI af_err af_constant_ulong(af_array *arr, const uintl val, const unsigned ndims, const dim_t * const dims);
-    /**
-       @}
+       \ingroup data_func_upper
     */
+    AFAPI af_err af_upper(af_array *out, const af_array in, bool is_unit_diag);
 
+#if AF_API_VERSION >= 31
     /**
-        \param[out] out is the generated array
-        \param[in] ndims is size of dimension array \p dims
-        \param[in] dims is the array containing sizes of the dimension
-        \param[in] seq_dim is dimension along which [0, dim[seq_dim] - 1] is generated
-        \param[in] type is the type of array to generate
+       C Interface to select elements based on a conditional array.
 
-        \ingroup data_func_range
-    */
-    AFAPI af_err af_range(af_array *out, const unsigned ndims, const dim_t * const dims,
-                          const int seq_dim, const af_dtype type);
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select array element
+       \param[in]  b    when false, select array element
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-        \param[out] out is the generated array
-        \param[in] ndims is size of dimension array \p dims
-        \param[in] dims is the array containing sizes of the dimension
-        \param[in] t_ndims is size of tile array \p tdims
-        \param[in] tdims is array containing the number of repetitions of the unit dimensions
-        \param[in] type is the type of array to generate
-
-        \ingroup data_func_iota
+       \ingroup data_func_select
     */
-    AFAPI af_err af_iota(af_array *out, const unsigned ndims, const dim_t * const dims,
-                         const unsigned t_ndims, const dim_t * const tdims, const af_dtype type);
+    AFAPI af_err af_select(af_array *out, const af_array cond, const af_array a, const af_array b);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \param[out] out is the generated array
-        \param[in] ndims is size of dimension array \p dims
-        \param[in] dims is the array containing sizes of the dimension
-        \param[in] type is the type of array to generate
+       C Interface to select elements based on a conditional array.
+
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select array element
+       \param[in]  b    when false, select scalar value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \ingroup data_func_randu
+       \ingroup data_func_select
     */
-    AFAPI af_err af_randu(af_array *out, const unsigned ndims, const dim_t * const dims, const af_dtype type);
+    AFAPI af_err af_select_scalar_r(af_array *out, const af_array cond, const af_array a, const double b);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \param[out] out is the generated array
-        \param[in] ndims is size of dimension array \p dims
-        \param[in] dims is the array containing sizes of the dimension
-        \param[in] type is the type of array to generate
+       C Interface to select elements based on a conditional array.
 
-       \ingroup data_func_randn
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select scalar value
+       \param[in]  b    when false, select array element
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup data_func_select
     */
-    AFAPI af_err af_randn(af_array *out, const unsigned ndims, const dim_t * const dims, const af_dtype type);
+    AFAPI af_err af_select_scalar_l(af_array *out, const af_array cond, const double a, const af_array b);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \defgroup data_func_setseed setSeed
-        Set the seed for the random number generator
+       C Interface to replace elements of an array with elements of another
+       array.
 
+       Elements of `a` are replaced with corresponding elements of `b` when
+       `cond` is false.
 
-        \param[in] seed is a 64 bit unsigned integer
+       \param[inout]  a    input array
+       \param[in]     cond conditional array
+       \param[in]     b    replacement array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_mat
-        \ingroup arrayfire_func
+       \ingroup data_func_replace
     */
-    AFAPI af_err af_set_seed(const uintl seed);
+    AFAPI af_err af_replace(af_array a, const af_array cond, const af_array b);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \defgroup data_func_getseed getSeed
-        Get the seed for the random number generator
+       C Interface to replace elements of an array with a scalar value.
 
+       Elements of `a` are replaced with a scalar value when `cond` is false.
 
-        \param[out] seed which is a 64 bit unsigned integer
+       \param[inout] a    input array
+       \param[in]    cond conditional array
+       \param[in]    b    replacement scalar value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_mat
-        \ingroup arrayfire_func
+       \ingroup data_func_replace
     */
-    AFAPI af_err af_get_seed(uintl *seed);
+    AFAPI af_err af_replace_scalar(af_array a, const af_array cond, const double b);
+#endif
 
+#if AF_API_VERSION >= 37
+    /**
+       C Interface to pad an array.
+
+       \param[out] out           padded array
+       \param[in]  in            input array
+       \param[in]  begin_ndims   number of dimensions for start padding
+       \param[in]  begin_dims    number of elements to be padded at the start
+                                 of each dimension
+       \param[in]  end_ndims     number of dimensions for end padding
+       \param[in]  end_dims      number of elements to be padded at the end of
+                                 each dimension
+       \param[in]  pad_fill_type values to fill into the padded region
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup data_func_pad
+    */
+    AFAPI af_err af_pad(af_array *out, const af_array in,
+                        const unsigned begin_ndims,
+                        const dim_t *const begin_dims, const unsigned end_ndims,
+                        const dim_t *const end_dims,
+                        const af_border_type pad_fill_type);
+#endif
 
+#if AF_API_VERSION >= 39
     /**
-        \param[out] out is the generated array
-        \param[in] ndims is size of dimension array \p dims
-        \param[in] dims is the array containing sizes of the dimension
-        \param[in] type is the type of array to generate
+       C Interface to replace elements of an array with a scalar value.
 
-        \ingroup data_func_identity
-    */
-    AFAPI af_err af_identity(af_array *out, const unsigned ndims, const dim_t * const dims, const af_dtype type);
+       Elements of `a` are replaced with a scalar value when `cond` is false.
 
-    /**
-        \param[out] out is the array created from the input array \p in
-        \param[in] in is the input array which is the diagonal
-        \param[in] num is the diagonal index
+       \param[inout] a    input array
+       \param[in]    cond conditional array
+       \param[in]    b    replacement scalar value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_diag
+       \ingroup data_func_replace
     */
-    AFAPI af_err af_diag_create(af_array *out, const af_array in, const int num);
+    AFAPI af_err af_replace_scalar_long(af_array a, const af_array cond,
+                                        const long long b);
 
     /**
-        \param[out] out is the \p num -th diagonal of \p in
-        \param[in] in is the input matrix
-        \param[in] num is the diagonal index
+       C Interface to replace elements of an array with a scalar value.
 
-        \ingroup data_func_diag
-    */
-    AFAPI af_err af_diag_extract(af_array *out, const af_array in, const int num);
+       Elements of `a` are replaced with a scalar value when `cond` is false.
 
-    /**
-        \brief Join 2 arrays along \p dim
-
-        \param[out] out is the generated array
-        \param[in] dim is the dimension along which join occurs
-        \param[in] first is the first input array
-        \param[in] second is the second input array
+       \param[inout] a    input array
+       \param[in]    cond conditional array
+       \param[in]    b    replacement scalar value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup manip_func_join
+       \ingroup data_func_replace
     */
-    AFAPI af_err af_join(af_array *out, const int dim, const af_array first, const af_array second);
+    AFAPI af_err af_replace_scalar_ulong(af_array a, const af_array cond,
+                                         const unsigned long long b);
 
     /**
-        \brief Join many arrays along \p dim
+       C Interface to select elements based on a conditional array.
 
-        Current limit is set to 10 arrays.
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select array element
+       \param[in]  b    when false, select scalar value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \param[out] out is the generated array
-        \param[in] dim is the dimension along which join occurs
-        \param[in] n_arrays number of arrays to join
-        \param[in] inputs is an array of af_arrays containing handles to the arrays to be joined
-
-        \ingroup manip_func_join
+       \ingroup data_func_select
     */
-    AFAPI af_err af_join_many(af_array *out, const int dim, const unsigned n_arrays, const af_array *inputs);
+    AFAPI af_err af_select_scalar_r_long(af_array *out, const af_array cond,
+                                         const af_array a, const long long b);
 
     /**
-        \param[out] out is the generated array
-        \param[in] in is the input matrix
-        \param[in] x is the number of times \p in is tiled along first dimension
-        \param[in] y is the number of times \p in is tiled along second dimension
-        \param[in] z is the number of times \p in is tiled along third dimension
-        \param[in] w is the number of times \p in is tiled along fourth dimension
-
-        \ingroup manip_func_tile
-    */
-    AFAPI af_err af_tile(af_array *out, const af_array in,
-                         const unsigned x, const unsigned y, const unsigned z, const unsigned w);
+       C Interface to select elements based on a conditional array.
 
-    /**
-        \param[out] out is the reordered array
-        \param[in] in is the input matrix
-        \param[in] x specifies which dimension should be first
-        \param[in] y specifies which dimension should be second
-        \param[in] z specifies which dimension should be third
-        \param[in] w specifies which dimension should be fourth
-
-        \ingroup manip_func_reorder
-    */
-    AFAPI af_err af_reorder(af_array *out, const af_array in,
-                            const unsigned x, const unsigned y, const unsigned z, const unsigned w);
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select array element
+       \param[in]  b    when false, select scalar value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-    /**
-        \param[in] out is the shifted array
-        \param[in] in is the input array
-        \param[in] x specifies the shift along first dimension
-        \param[in] y specifies the shift along second dimension
-        \param[in] z specifies the shift along third dimension
-        \param[in] w specifies the shift along fourth dimension
-
-        \ingroup manip_func_shift
+       \ingroup data_func_select
     */
-    AFAPI af_err af_shift(af_array *out, const af_array in, const int x, const int y, const int z, const int w);
+    AFAPI af_err af_select_scalar_r_ulong(af_array *out, const af_array cond,
+                                          const af_array a,
+                                          const unsigned long long b);
 
     /**
-        \param[out] out is the modded array
-        \param[in] in is the input array
-        \param[in] ndims is the number of dimensions
-        \param[in] dims is the array containing the new dimensions
-
-        \ingroup manip_func_moddims
-    */
-    AFAPI af_err af_moddims(af_array *out, const af_array in, const unsigned ndims, const dim_t * const dims);
+       C Interface to select elements based on a conditional array.
 
-    /**
-        \param[out] out is the flat array
-        \param[in] in is the input array
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select scalar value
+       \param[in]  b    when false, select array element
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup manip_func_flat
+       \ingroup data_func_select
     */
-    AFAPI af_err af_flat(af_array *out, const af_array in);
+    AFAPI af_err af_select_scalar_l_long(af_array *out, const af_array cond,
+                                         const long long a, const af_array b);
 
     /**
-        \param[out] out is the flipped array
-        \param[in] in is the input array
-        \param[in] dim is the dimensions to flip the array
+       C Interface to select elements based on a conditional array.
 
-        \ingroup manip_func_flip
-    */
-    AFAPI af_err af_flip(af_array *out, const af_array in, const unsigned dim);
-
-    /**
-        \param[out] out is the lower traingle matrix
-        \param[in] in is the input matrix
-        \param[in] is_unit_diag is a boolean parameter specifying if the diagonal elements should be 1
+       \param[out] out  `a` when `cond` is true, else `b`
+       \param[in]  cond conditional array
+       \param[in]  a    when true, select scalar value
+       \param[in]  b    when false, select array element
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-        \ingroup data_func_lower
+       \ingroup data_func_select
     */
-    AFAPI af_err af_lower(af_array *out, const af_array in, bool is_unit_diag);
-
-    /**
-        \param[out] out is the upper triangle matrix
-        \param[in] in is the input matrix
-        \param[in] is_unit_diag is a boolean parameter specifying if the diagonal elements should be 1
+    AFAPI af_err af_select_scalar_l_ulong(af_array *out, const af_array cond,
+                                          const unsigned long long a,
+                                          const af_array b);
+#endif
 
-        \ingroup data_func_upper
-    */
-    AFAPI af_err af_upper(af_array *out, const af_array in, bool is_unit_diag);
-    /**
-      @}
-    */
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/defines.h b/include/af/defines.h
index 83f95c8170..42f71024fa 100644
--- a/include/af/defines.h
+++ b/include/af/defines.h
@@ -9,6 +9,10 @@
 
 #pragma once
 
+#ifndef __CUDACC_RTC__
+#include <af/compilers.h>
+#endif
+
 #if defined(_WIN32) || defined(_MSC_VER)
     // http://msdn.microsoft.com/en-us/library/b0084kay(v=VS.80).aspx
     // http://msdn.microsoft.com/en-us/library/3y1sfaz2%28v=VS.80%29.aspx
@@ -18,24 +22,27 @@
         #define AFAPI  __declspec(dllimport)
     #endif
 
-// bool
+    // bool
     #ifndef __cplusplus
         #define bool unsigned char
         #define false 0
         #define true  1
     #endif
     #define __PRETTY_FUNCTION__ __FUNCSIG__
-    #define snprintf sprintf_s
-    #define STATIC_ static
     #define SIZE_T_FRMT_SPECIFIER "%Iu"
-    #define DEPRECATED(msg) __declspec(deprecated( msg ))
+    #define AF_DEPRECATED(msg) __declspec(deprecated( msg ))
+    #if _MSC_VER >= 1800
+        #define AF_HAS_VARIADIC_TEMPLATES
+    #endif
 #else
     #define AFAPI   __attribute__((visibility("default")))
     #include <stdbool.h>
-    #define __PRETTY_FUNCTION__ __func__
-    #define STATIC_ inline
     #define SIZE_T_FRMT_SPECIFIER "%zu"
-    #define DEPRECATED(msg) __attribute__((deprecated( msg )))
+#if __GNUC__ >= 4 && __GNUC_MINOR > 4
+    #define AF_DEPRECATED(msg) __attribute__((deprecated( msg )))
+#else
+    #define AF_DEPRECATED(msg) __attribute__((deprecated))
+#endif
 #endif
 
 // Known 64-bit x86 and ARM architectures use long long
@@ -49,140 +56,185 @@
     typedef long long   dim_t;
 #endif
 
-#ifdef __cplusplus
-#include <complex>
-#include <cstddef>
-
-typedef std::complex<float> af_cfloat;
-typedef std::complex<double> af_cdouble;
-
-#else
 #include <stdlib.h>
 
-typedef struct {
-    float x;
-    float y;
-} af_cfloat;
-
-typedef struct {
-    double x;
-    double y;
-} af_cdouble;
-
+#ifndef AFDLL  // prevents the use of these types internally
+typedef AF_DEPRECATED("intl is deprecated. Use long long instead.") long long intl;
+typedef AF_DEPRECATED("uintl is deprecated. Use unsigned long long instead.") unsigned long long uintl;
 #endif
 
-typedef long long intl;
-typedef unsigned long long uintl;
+#include <af/version.h>
+#ifndef AF_API_VERSION
+#define AF_API_VERSION AF_API_VERSION_CURRENT
+#endif
 
 typedef enum {
     ///
     /// The function returned successfully
     ///
-    AF_SUCCESS            =   0,
+    AF_SUCCESS            =   0
 
     // 100-199 Errors in environment
 
     ///
     /// The system or device ran out of memory
     ///
-    AF_ERR_NO_MEM         = 101,
+    , AF_ERR_NO_MEM         = 101
 
     ///
     /// There was an error in the device driver
     ///
-    AF_ERR_DRIVER         = 102,
+    , AF_ERR_DRIVER         = 102
 
     ///
     /// There was an error with the runtime environment
     ///
-    AF_ERR_RUNTIME        = 103,
+    , AF_ERR_RUNTIME        = 103
 
     // 200-299 Errors in input parameters
 
     ///
     /// The input array is not a valid af_array object
     ///
-    AF_ERR_INVALID_ARRAY  = 201,
+    , AF_ERR_INVALID_ARRAY  = 201
 
     ///
     /// One of the function arguments is incorrect
     ///
-    AF_ERR_ARG            = 202,
+    , AF_ERR_ARG            = 202
 
     ///
     /// The size is incorrect
     ///
-    AF_ERR_SIZE           = 203,
+    , AF_ERR_SIZE           = 203
 
     ///
     /// The type is not suppported by this function
     ///
-    AF_ERR_TYPE           = 204,
+    , AF_ERR_TYPE           = 204
 
     ///
     /// The type of the input arrays are not compatible
     ///
-    AF_ERR_DIFF_TYPE      = 205,
+    , AF_ERR_DIFF_TYPE      = 205
 
     ///
     /// Function does not support GFOR / batch mode
     ///
-    AF_ERR_BATCH          = 207,
+    , AF_ERR_BATCH          = 207
 
 
+#if AF_API_VERSION >= 33
+    ///
+    /// Input does not belong to the current device.
+    ///
+    , AF_ERR_DEVICE         = 208
+#endif
+
     // 300-399 Errors for missing software features
 
     ///
     /// The option is not supported
     ///
-    AF_ERR_NOT_SUPPORTED  = 301,
+    , AF_ERR_NOT_SUPPORTED  = 301
 
     ///
     /// This build of ArrayFire does not support this feature
     ///
-    AF_ERR_NOT_CONFIGURED = 302,
+    , AF_ERR_NOT_CONFIGURED = 302
+
+#if AF_API_VERSION >= 32
+    ///
+    /// This build of ArrayFire is not compiled with "nonfree" algorithms
+    ///
+    , AF_ERR_NONFREE        = 303
+#endif
+
     // 400-499 Errors for missing hardware features
 
     ///
     /// This device does not support double
     ///
-    AF_ERR_NO_DBL         = 401,
+    , AF_ERR_NO_DBL         = 401
 
     ///
     /// This build of ArrayFire was not built with graphics or this device does
     /// not support graphics
     ///
-    AF_ERR_NO_GFX         = 402,
+    , AF_ERR_NO_GFX         = 402
+
+#if AF_API_VERSION >= 37
+    ///
+    /// This device does not support half
+    ///
+    , AF_ERR_NO_HALF        = 403
+#endif
+
+    // 500-599 Errors specific to heterogenous API
+
+#if AF_API_VERSION >= 32
+    ///
+    /// There was an error when loading the libraries
+    ///
+    , AF_ERR_LOAD_LIB       = 501
+#endif
+
+#if AF_API_VERSION >= 32
+    ///
+    /// There was an error when loading the symbols
+    ///
+    , AF_ERR_LOAD_SYM       = 502
+#endif
+
+#if AF_API_VERSION >= 32
+    ///
+    /// There was a mismatch between the input array and the active backend
+    ///
+    , AF_ERR_ARR_BKND_MISMATCH    = 503
+#endif
+
     // 900-999 Errors from upstream libraries and runtimes
 
     ///
     /// There was an internal error either in ArrayFire or in a project
     /// upstream
     ///
-    AF_ERR_INTERNAL       = 998,
+    , AF_ERR_INTERNAL       = 998
 
     ///
     /// Unknown Error
     ///
-    AF_ERR_UNKNOWN        = 999
+    , AF_ERR_UNKNOWN        = 999
 } af_err;
 
 typedef enum {
     f32,    ///< 32-bit floating point values
     c32,    ///< 32-bit complex floating point values
-    f64,    ///< 64-bit complex floating point values
+    f64,    ///< 64-bit floating point values
     c64,    ///< 64-bit complex floating point values
-    b8,     ///< 8-bit boolean values
+    b8 ,    ///< 8-bit boolean values
     s32,    ///< 32-bit signed integral values
     u32,    ///< 32-bit unsigned integral values
-    u8,     ///< 8-bit unsigned integral values
+    u8 ,    ///< 8-bit unsigned integral values
     s64,    ///< 64-bit signed integral values
-    u64     ///< 64-bit unsigned integral values
+    u64    ///< 64-bit unsigned integral values
+#if AF_API_VERSION >= 32
+    , s16    ///< 16-bit signed integral values
+#endif
+#if AF_API_VERSION >= 32
+    , u16    ///< 16-bit unsigned integral values
+#endif
+#if AF_API_VERSION >= 37
+    , f16    ///< 16-bit floating point value
+#endif
+#if AF_API_VERSION >= 310
+    , s8     ///< 8-bit signed integral values
+#endif
 } af_dtype;
 
 typedef enum {
     afDevice,   ///< Device pointer
-    afHost,     ///< Host pointer
+    afHost     ///< Host pointer
 } af_source;
 
 #define AF_MAX_DIMS 4
@@ -191,10 +243,27 @@ typedef enum {
 typedef void * af_array;
 
 typedef enum {
-    AF_INTERP_NEAREST,  ///< Nearest Interpolation
-    AF_INTERP_LINEAR,   ///< Linear Interpolation
-    AF_INTERP_BILINEAR, ///< Bilinear Interpolation
-    AF_INTERP_CUBIC     ///< Cubic Interpolation
+    AF_INTERP_NEAREST,         ///< Nearest Interpolation
+    AF_INTERP_LINEAR,          ///< Linear Interpolation
+    AF_INTERP_BILINEAR,        ///< Bilinear Interpolation
+    AF_INTERP_CUBIC,           ///< Cubic Interpolation
+    AF_INTERP_LOWER           ///< Floor Indexed
+#if AF_API_VERSION >= 34
+    , AF_INTERP_LINEAR_COSINE   ///< Linear Interpolation with cosine smoothing
+#endif
+#if AF_API_VERSION >= 34
+    , AF_INTERP_BILINEAR_COSINE ///< Bilinear Interpolation with cosine smoothing
+#endif
+#if AF_API_VERSION >= 34
+    , AF_INTERP_BICUBIC         ///< Bicubic Interpolation
+#endif
+#if AF_API_VERSION >= 34
+    , AF_INTERP_CUBIC_SPLINE    ///< Cubic Interpolation with Catmull-Rom splines
+#endif
+#if AF_API_VERSION >= 34
+    , AF_INTERP_BICUBIC_SPLINE  ///< Bicubic Interpolation with Catmull-Rom splines
+#endif
+
 } af_interp_type;
 
 typedef enum {
@@ -206,7 +275,17 @@ typedef enum {
     ///
     /// Out of bound values are symmetric over the edge
     ///
-    AF_PAD_SYM
+    AF_PAD_SYM,
+
+    ///
+    /// Out of bound values are clamped to the edge
+    ///
+    AF_PAD_CLAMP_TO_EDGE,
+
+    ///
+    /// Out of bound values are mapped to range of the dimension in cyclic fashion
+    ///
+    AF_PAD_PERIODIC
 } af_border_type;
 
 typedef enum {
@@ -231,13 +310,13 @@ typedef enum {
     ///
     /// Output of the convolution is signal_len + filter_len - 1
     ///
-    AF_CONV_EXPAND,
+    AF_CONV_EXPAND
 } af_conv_mode;
 
 typedef enum {
     AF_CONV_AUTO,    ///< ArrayFire automatically picks the right convolution algorithm
     AF_CONV_SPATIAL, ///< Perform convolution in spatial domain
-    AF_CONV_FREQ,    ///< Perform convolution in frequency domain
+    AF_CONV_FREQ     ///< Perform convolution in frequency domain
 } af_conv_domain;
 
 typedef enum {
@@ -252,16 +331,28 @@ typedef enum {
     AF_SHD        ///< Match based on Sum of Hamming Distances (SHD)
 } af_match_type;
 
+#if AF_API_VERSION >= 31
+typedef enum {
+    AF_YCC_601 = 601,  ///< ITU-R BT.601 (formerly CCIR 601) standard
+    AF_YCC_709 = 709,  ///< ITU-R BT.709 standard
+    AF_YCC_2020 = 2020 ///< ITU-R BT.2020 standard
+} af_ycc_std;
+#endif
+
 typedef enum {
     AF_GRAY = 0, ///< Grayscale
     AF_RGB,      ///< 3-channel RGB
     AF_HSV       ///< 3-channel HSV
+#if AF_API_VERSION >= 31
+    , AF_YCbCr     ///< 3-channel YCbCr
+#endif
 } af_cspace_t;
 
 typedef enum {
     AF_MAT_NONE       = 0,    ///< Default
     AF_MAT_TRANS      = 1,    ///< Data needs to be transposed
     AF_MAT_CTRANS     = 2,    ///< Data needs to be conjugate tansposed
+    AF_MAT_CONJ       = 4,    ///< Data needs to be conjugate
     AF_MAT_UPPER      = 32,   ///< Matrix is upper triangular
     AF_MAT_LOWER      = 64,   ///< Matrix is lower triangular
     AF_MAT_DIAG_UNIT  = 128,  ///< Matrix diagonal contains unitary values
@@ -282,18 +373,54 @@ typedef enum {
     AF_NORM_MATRIX_2,      ///< returns the max singular value). Currently NOT SUPPORTED
     AF_NORM_MATRIX_L_PQ,   ///< returns Lpq-norm
 
-    AF_NORM_EUCLID = AF_NORM_VECTOR_2, ///< The default. Same as AF_NORM_VECTOR_2
+    AF_NORM_EUCLID = AF_NORM_VECTOR_2 ///< The default. Same as AF_NORM_VECTOR_2
 } af_norm_type;
 
+#if AF_API_VERSION >= 31
 typedef enum {
-    AF_COLORMAP_DEFAULT = 0,    ///< Default grayscale map
-    AF_COLORMAP_SPECTRUM= 1,    ///< Spectrum map
-    AF_COLORMAP_COLORS  = 2,    ///< Colors
-    AF_COLORMAP_RED     = 3,    ///< Red hue map
-    AF_COLORMAP_MOOD    = 4,    ///< Mood map
-    AF_COLORMAP_HEAT    = 5,    ///< Heat map
-    AF_COLORMAP_BLUE    = 6     ///< Blue hue map
-} af_colormap;
+    AF_FIF_BMP          = 0,    ///< FreeImage Enum for Bitmap File
+    AF_FIF_ICO          = 1,    ///< FreeImage Enum for Windows Icon File
+    AF_FIF_JPEG         = 2,    ///< FreeImage Enum for JPEG File
+    AF_FIF_JNG          = 3,    ///< FreeImage Enum for JPEG Network Graphics File
+    AF_FIF_PNG          = 13,   ///< FreeImage Enum for Portable Network Graphics File
+    AF_FIF_PPM          = 14,   ///< FreeImage Enum for Portable Pixelmap (ASCII) File
+    AF_FIF_PPMRAW       = 15,   ///< FreeImage Enum for Portable Pixelmap (Binary) File
+    AF_FIF_TIFF         = 18,   ///< FreeImage Enum for Tagged Image File Format File
+    AF_FIF_PSD          = 20,   ///< FreeImage Enum for Adobe Photoshop File
+    AF_FIF_HDR          = 26,   ///< FreeImage Enum for High Dynamic Range File
+    AF_FIF_EXR          = 29,   ///< FreeImage Enum for ILM OpenEXR File
+    AF_FIF_JP2          = 31,   ///< FreeImage Enum for JPEG-2000 File
+    AF_FIF_RAW          = 34    ///< FreeImage Enum for RAW Camera Image File
+} af_image_format;
+#endif
+
+#if AF_API_VERSION >=34
+typedef enum {
+    AF_MOMENT_M00 = 1,
+    AF_MOMENT_M01 = 2,
+    AF_MOMENT_M10 = 4,
+    AF_MOMENT_M11 = 8,
+    AF_MOMENT_FIRST_ORDER = AF_MOMENT_M00 | AF_MOMENT_M01 | AF_MOMENT_M10 | AF_MOMENT_M11
+} af_moment_type;
+#endif
+
+#if AF_API_VERSION >= 32
+typedef enum {
+    AF_HOMOGRAPHY_RANSAC = 0,   ///< Computes homography using RANSAC
+    AF_HOMOGRAPHY_LMEDS  = 1    ///< Computes homography using Least Median of Squares
+} af_homography_type;
+#endif
+
+#if AF_API_VERSION >= 32
+// These enums should be 2^x
+typedef enum {
+    AF_BACKEND_DEFAULT = 0,  ///< Default backend order: OpenCL -> CUDA -> CPU
+    AF_BACKEND_CPU     = 1,  ///< CPU a.k.a sequential algorithms
+    AF_BACKEND_CUDA    = 2,  ///< CUDA Compute Backend
+    AF_BACKEND_OPENCL  = 4,  ///< OpenCL Compute Backend
+    AF_BACKEND_ONEAPI  = 8   ///< OneAPI Compute Backend
+} af_backend;
+#endif
 
 // Below enum is purely added for example purposes
 // it doesn't and shoudn't be used anywhere in the
@@ -302,11 +429,130 @@ typedef enum {
     AF_ID = 0
 } af_someenum_t;
 
+#if AF_API_VERSION >=34
+typedef enum {
+    AF_BINARY_ADD  = 0,
+    AF_BINARY_MUL  = 1,
+    AF_BINARY_MIN  = 2,
+    AF_BINARY_MAX  = 3
+} af_binary_op;
+#endif
+
+#if AF_API_VERSION >=34
+typedef enum {
+    AF_RANDOM_ENGINE_PHILOX_4X32_10     = 100,                                  //Philox variant with N = 4, W = 32 and Rounds = 10
+    AF_RANDOM_ENGINE_THREEFRY_2X32_16   = 200,                                  //Threefry variant with N = 2, W = 32 and Rounds = 16
+    AF_RANDOM_ENGINE_MERSENNE_GP11213   = 300,                                  //Mersenne variant with MEXP = 11213
+    AF_RANDOM_ENGINE_PHILOX             = AF_RANDOM_ENGINE_PHILOX_4X32_10,      //Resolves to Philox 4x32_10
+    AF_RANDOM_ENGINE_THREEFRY           = AF_RANDOM_ENGINE_THREEFRY_2X32_16,    //Resolves to Threefry 2X32_16
+    AF_RANDOM_ENGINE_MERSENNE           = AF_RANDOM_ENGINE_MERSENNE_GP11213,    //Resolves to Mersenne GP 11213
+    AF_RANDOM_ENGINE_DEFAULT            = AF_RANDOM_ENGINE_PHILOX               //Resolves to Philox
+} af_random_engine_type;
+#endif
+
+////////////////////////////////////////////////////////////////////////////////
+// FORGE / Graphics Related Enums
+// These enums have values corresponsding to Forge enums in forge defines.h
+////////////////////////////////////////////////////////////////////////////////
+typedef enum {
+    AF_COLORMAP_DEFAULT = 0,    ///< Default grayscale map
+    AF_COLORMAP_SPECTRUM= 1,    ///< Spectrum map (390nm-830nm, in sRGB colorspace)
+    AF_COLORMAP_COLORS  = 2,    ///< Colors, aka. Rainbow
+    AF_COLORMAP_RED     = 3,    ///< Red hue map
+    AF_COLORMAP_MOOD    = 4,    ///< Mood map
+    AF_COLORMAP_HEAT    = 5,    ///< Heat map
+    AF_COLORMAP_BLUE    = 6,    ///< Blue hue map
+    AF_COLORMAP_INFERNO = 7,    ///< Perceptually uniform shades of black-red-yellow
+    AF_COLORMAP_MAGMA   = 8,    ///< Perceptually uniform shades of black-red-white
+    AF_COLORMAP_PLASMA  = 9,    ///< Perceptually uniform shades of blue-red-yellow
+    AF_COLORMAP_VIRIDIS = 10    ///< Perceptually uniform shades of blue-green-yellow
+} af_colormap;
+
+#if AF_API_VERSION >= 32
+typedef enum {
+    AF_MARKER_NONE         = 0,
+    AF_MARKER_POINT        = 1,
+    AF_MARKER_CIRCLE       = 2,
+    AF_MARKER_SQUARE       = 3,
+    AF_MARKER_TRIANGLE     = 4,
+    AF_MARKER_CROSS        = 5,
+    AF_MARKER_PLUS         = 6,
+    AF_MARKER_STAR         = 7
+} af_marker_type;
+#endif
+////////////////////////////////////////////////////////////////////////////////
+
+#if AF_API_VERSION >= 35
+typedef enum {
+    AF_CANNY_THRESHOLD_MANUAL    = 0, ///< User has to define canny thresholds manually
+    AF_CANNY_THRESHOLD_AUTO_OTSU = 1  ///< Determine canny algorithm thresholds using Otsu algorithm
+} af_canny_threshold;
+#endif
+
+#if AF_API_VERSION >= 34
+typedef enum {
+    AF_STORAGE_DENSE     = 0,   ///< Storage type is dense
+    AF_STORAGE_CSR       = 1,   ///< Storage type is CSR
+    AF_STORAGE_CSC       = 2,   ///< Storage type is CSC
+    AF_STORAGE_COO       = 3    ///< Storage type is COO
+} af_storage;
+#endif
+
+#if AF_API_VERSION >= 36
+typedef enum {
+    AF_FLUX_QUADRATIC   = 1,    ///< Quadratic flux function
+    AF_FLUX_EXPONENTIAL = 2,    ///< Exponential flux function
+    AF_FLUX_DEFAULT     = 0     ///< Default flux function is exponential
+} af_flux_function;
+
+typedef enum {
+    AF_DIFFUSION_GRAD = 1,      ///< Gradient diffusion equation
+    AF_DIFFUSION_MCDE = 2,      ///< Modified curvature diffusion equation
+    AF_DIFFUSION_DEFAULT = 0    ///< Default option is same as AF_DIFFUSION_GRAD
+} af_diffusion_eq;
+
+typedef enum {
+    AF_TOPK_MIN         = 1,  ///< Top k min values
+    AF_TOPK_MAX         = 2,  ///< Top k max values
+    AF_TOPK_STABLE      = 4,  ///< Preserve order of indices for equal values
+    AF_TOPK_STABLE_MIN  = AF_TOPK_STABLE | AF_TOPK_MIN, ///< Top k min with stable indices
+    AF_TOPK_STABLE_MAX  = AF_TOPK_STABLE | AF_TOPK_MAX, ///< Top k max with stable indices
+    AF_TOPK_DEFAULT = 0   ///< Default option (max)
+} af_topk_function;
+#endif
+
+#if AF_API_VERSION >= 37
+typedef enum {
+    AF_VARIANCE_DEFAULT    = 0, ///< Default (Population) variance
+    AF_VARIANCE_SAMPLE     = 1, ///< Sample variance
+    AF_VARIANCE_POPULATION = 2  ///< Population variance
+} af_var_bias;
+
+typedef enum {
+    AF_ITERATIVE_DECONV_LANDWEBER       = 1,        ///< Landweber Deconvolution
+    AF_ITERATIVE_DECONV_RICHARDSONLUCY  = 2,        ///< Richardson-Lucy Deconvolution
+    AF_ITERATIVE_DECONV_DEFAULT         = 0         ///< Default is Landweber deconvolution
+} af_iterative_deconv_algo;
+
+typedef enum {
+    AF_INVERSE_DECONV_TIKHONOV       = 1,        ///< Tikhonov Inverse deconvolution
+    AF_INVERSE_DECONV_DEFAULT        = 0         ///< Default is Tikhonov deconvolution
+} af_inverse_deconv_algo;
+
+#endif
+
+#if AF_API_VERSION >= 37
+typedef enum {
+    AF_CONV_GRADIENT_DEFAULT = 0,
+    AF_CONV_GRADIENT_FILTER  = 1,
+    AF_CONV_GRADIENT_DATA    = 2,
+    AF_CONV_GRADIENT_BIAS    = 3
+} af_conv_gradient_type;
+#endif
+
 #ifdef __cplusplus
 namespace af
 {
-    typedef af_cfloat cfloat;
-    typedef af_cdouble  cdouble;
     typedef af_dtype dtype;
     typedef af_source source;
     typedef af_interp_type interpType;
@@ -321,6 +567,44 @@ namespace af
     typedef af_mat_prop matProp;
     typedef af_colormap ColorMap;
     typedef af_norm_type normType;
+#if AF_API_VERSION >= 31
+    typedef af_ycc_std YCCStd;
+#endif
+#if AF_API_VERSION >= 31
+    typedef af_image_format imageFormat;
+#endif
+#if AF_API_VERSION >= 32
+    typedef af_backend Backend;
+#endif
+#if AF_API_VERSION >= 32
+    typedef af_marker_type markerType;
+#endif
+#if AF_API_VERSION >= 34
+    typedef af_moment_type momentType;
+#endif
+#if AF_API_VERSION >= 34
+    typedef af_storage storage;
+#endif
+#if AF_API_VERSION >= 34
+    typedef af_binary_op binaryOp;
+#endif
+#if AF_API_VERSION >= 34
+    typedef af_random_engine_type randomEngineType;
+#endif
+#if AF_API_VERSION >= 35
+    typedef af_canny_threshold cannyThreshold;
+#endif
+#if AF_API_VERSION >= 36
+    typedef af_flux_function fluxFunction;
+    typedef af_diffusion_eq diffusionEq;
+    typedef af_topk_function topkFunction;
+#endif
+#if AF_API_VERSION >= 37
+    typedef af_var_bias varBias;
+    typedef af_iterative_deconv_algo iterativeDeconvAlgo;
+    typedef af_inverse_deconv_algo inverseDeconvAlgo;
+    typedef af_conv_gradient_type convGradientType;
+#endif
 }
 
 #endif
diff --git a/include/af/device.h b/include/af/device.h
index 1020ceb9c0..f081394d65 100644
--- a/include/af/device.h
+++ b/include/af/device.h
@@ -29,20 +29,33 @@ namespace af
     */
 
     /**
-       \defgroup device_func_prop deviceInfo
+       \defgroup device_func_info_string infoString
 
-       Get device information
+       Get af::info() as a string
 
        @{
 
+       \brief Returns the output of af::info() as a string
+
+       \param[in] verbose flag to return verbose info
+
+       \returns string containing output of af::info()
+
        \ingroup arrayfire_func
        \ingroup device_mat
     */
-    AFAPI void deviceInfo(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
+    AFAPI const char* infoString(const bool verbose = false);
     /**
        @}
     */
 
+    /**
+        \copydoc device_func_prop
+
+        \ingroup device_func_prop
+    */
+    AFAPI void deviceInfo(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
+
     /// \brief Gets the number of devices
     ///
     /// \copydoc device_func_count
@@ -62,10 +75,21 @@ namespace af
     ///
     /// \param[in] device the ID of the device to query
     ///
-    /// \returns true if the \p device supports double precision operations. false otherwise
+    /// \returns true if the \p device supports double precision operations.
+    ///          false otherwise
     /// \ingroup device_func_dbl
     AFAPI bool isDoubleAvailable(const int device);
 
+    /// \brief Queries the current device for half precision floating point
+    ///        support
+    ///
+    /// \param[in] device the ID of the device to query
+    ///
+    /// \returns true if the \p device supports half precision operations.
+    ///          false otherwise
+    /// \ingroup device_func_half
+    AFAPI bool isHalfAvailable(const int device);
+
     /// \brief Sets the current device
     ///
     /// \param[in] device The ID of the target device
@@ -82,28 +106,82 @@ namespace af
     /// @{
     /// \brief Allocates memory using ArrayFire's memory manager
     ///
-    /// \copydoc device_func_alloc
     /// \param[in] elements the number of elements to allocate
     /// \param[in] type is the type of the elements to allocate
-    /// \returns the pointer to the memory
+    /// \returns Pointer to the device memory on the current device. This is a
+    ///          CUDA device pointer for the CUDA backend. A cl::Buffer pointer
+    ///          from the cl2.hpp header on the OpenCL backend and a C pointer
+    ///          for the CPU backend
     ///
+    /// \note The device memory returned by this function is only freed if
+    ///       af::free() is called explicitly
+    /// \deprecated Use allocV2 instead. allocV2 accepts number of bytes
+    ///             instead of number of elements and returns a cl_mem object
+    ///             instead of the cl::Buffer object for the OpenCL backend.
+    ///             Otherwise the functionallity is identical to af::alloc.
+    AF_DEPRECATED("Use af::allocV2 instead")
     AFAPI void *alloc(const size_t elements, const dtype type);
 
+#if AF_API_VERSION >= 38
+    /// \brief Allocates memory using ArrayFire's memory manager
+    ///
+    /// \param[in] bytes the number of bytes to allocate
+    /// \returns Pointer to the device memory on the current device. This is a
+    ///          CUDA device pointer for the CUDA backend. A cl_mem pointer
+    ///          on the OpenCL backend and a C pointer for the CPU backend
+    ///
+    /// \note The device memory returned by this function is only freed if
+    ///       af::freeV2() is called explicitly
+    AFAPI void *allocV2(const size_t bytes);
+#endif
+
     /// \brief Allocates memory using ArrayFire's memory manager
     //
-    /// \copydoc device_func_alloc
     /// \param[in] elements the number of elements to allocate
-    /// \returns the pointer to the memory
+    /// \returns Pointer to the device memory on the current device. This is a
+    ///          CUDA device pointer for the CUDA backend. A cl::Buffer pointer
+    ///          from the cl2.hpp header on the OpenCL backend and a C pointer
+    ///          for the CPU backend
     ///
     /// \note the size of the memory allocated is the number of \p elements *
-    ///         sizeof(type)
-    template<typename T>
-    T* alloc(const size_t elements);
+    ///       sizeof(type)
+    /// \note The device memory returned by this function is only freed if
+    ///       af::free() is called explicitly
+    /// \deprecated Use allocV2 instead. allocV2 accepts number of bytes
+    ///             instead of number of elements and returns a cl_mem object
+    ///             instead of the cl::Buffer object for the OpenCL backend.
+    ///             Otherwise the functionallity is identical to af::alloc.
+    template <typename T>
+    AF_DEPRECATED("Use af::allocV2 instead")
+    T *alloc(const size_t elements);
     /// @}
 
+    /// \ingroup device_func_free
+    ///
+    /// \copydoc device_func_free
+    /// \param[in] ptr the memory allocated by the af::alloc function that
+    ///                will be freed
+    ///
+    /// \note This function will free a device pointer even if it has been
+    ///       previously locked.
+    /// \deprecated Use af::freeV2 instead. af_alloc_device_v2 returns a
+    ///             cl_mem object instead of the cl::Buffer object for the
+    ///             OpenCL backend. Otherwise the functionallity is identical
+    AF_DEPRECATED("Use af::freeV2 instead")
+    AFAPI void free(const void *ptr);
+
+#if AF_API_VERSION >= 38
+    /// \ingroup device_func_free
+    /// \copydoc device_func_free
+    /// \param[in] ptr The pointer returned by af::allocV2
+    ///
+    /// This function will free a device pointer even if it has been previously
+    /// locked.
+    AFAPI void freeV2(const void *ptr);
+#endif
+
     /// \ingroup device_func_pinned
     /// @{
-    ///
     /// \copydoc device_func_pinned
     ///
     /// \param[in] elements the number of elements to allocate
@@ -119,15 +197,51 @@ namespace af
     T* pinned(const size_t elements);
     /// @}
 
-    /// \ingroup device_func_free
-    /// @{
-    /// \copydoc device_func_free
+    /// \ingroup device_func_free_pinned
+    ///
+    /// \copydoc device_func_free_pinned
     /// \param[in] ptr the memory to free
-    AFAPI void free(const void *ptr);
-
-    /// \copydoc free()
     AFAPI void freePinned(const void *ptr);
-    ///@}
+
+#if AF_API_VERSION >= 33
+    /// \brief Allocate memory on host
+    ///
+    /// \copydoc device_func_alloc_host
+    ///
+    /// \param[in] elements the number of elements to allocate
+    /// \param[in] type is the type of the elements to allocate
+    /// \returns the pointer to the memory
+    ///
+    /// \ingroup device_func_alloc_host
+    AFAPI void *allocHost(const size_t elements, const dtype type);
+#endif
+
+#if AF_API_VERSION >= 33
+    /// \brief Allocate memory on host
+    ///
+    /// \copydoc device_func_alloc_host
+    ///
+    /// \param[in] elements the number of elements to allocate
+    /// \returns the pointer to the memory
+    ///
+    /// \note the size of the memory allocated is the number of \p elements *
+    ///         sizeof(type)
+    ///
+    /// \ingroup device_func_alloc_host
+    template<typename T>
+    AFAPI T* allocHost(const size_t elements);
+#endif
+
+#if AF_API_VERSION >= 33
+    /// \brief Free memory allocated internally by ArrayFire
+    //
+    /// \copydoc device_func_free_host
+    ///
+    /// \param[in] ptr the memory to free
+    ///
+    /// \ingroup device_func_free_host
+    AFAPI void freeHost(const void *ptr);
+#endif
 
     /// \ingroup device_func_mem
     /// @{
@@ -139,21 +253,39 @@ namespace af
     //                              manager
     /// \param[out] lock_bytes The number of bytes in use
     /// \param[out] lock_buffers The number of buffers in use
+    ///
+    /// \note This function performs a synchronization operation
     AFAPI void deviceMemInfo(size_t *alloc_bytes, size_t *alloc_buffers,
                              size_t *lock_bytes, size_t *lock_buffers);
 
+#if AF_API_VERSION >= 33
+    ///
+    /// Prints buffer details from the ArrayFire Device Manager
+    //
+    /// \param [in] msg A message to print before the table
+    /// \param [in] device_id print the memory info of the specified device.
+    ///  -1 signifies active device.
+    //
+    /// \ingroup device_func_mem
+    ///
+    /// \note This function performs a synchronization operation
+    AFAPI void printMemInfo(const char *msg = NULL, const int device_id = -1);
+#endif
+
     /// \brief Call the garbage collection function in the memory manager
     ///
     /// \ingroup device_func_mem
     AFAPI void deviceGC();
     /// @}
 
-    /// \brief Set the resolution of memory chunks
+    /// \brief Set the resolution of memory chunks. Works only with the default
+    /// memory manager - throws if a custom memory manager is set.
     ///
     /// \ingroup device_func_mem
     AFAPI void setMemStepSize(const size_t size);
 
-    /// \brief Get the resolution of memory chunks
+    /// \brief Get the resolution of memory chunks. Works only with the default
+    /// memory manager - throws if a custom memory manager is set.
     ///
     /// \ingroup device_func_mem
     AFAPI size_t getMemStepSize();
@@ -169,10 +301,25 @@ extern "C" {
     */
     AFAPI af_err af_info();
 
+    /**
+       \ingroup device_func_info
+    */
     AFAPI af_err af_init();
 
     /**
-       \ingroup device_func_info
+       \brief Gets the output of af_info() as a string
+
+       \param[out] str contains the string
+       \param[in] verbose flag to return verbose info
+
+       \ingroup device_func_info_string
+    */
+    AFAPI af_err af_info_string(char** str, const bool verbose);
+
+    /**
+        \copydoc device_func_prop
+
+        \ingroup device_func_prop
     */
     AFAPI af_err af_device_info(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
 
@@ -186,6 +333,11 @@ extern "C" {
     */
     AFAPI af_err af_get_dbl_support(bool* available, const int device);
 
+    /**
+       \ingroup device_func_half
+    */
+    AFAPI af_err af_get_half_support(bool *available, const int device);
+
     /**
        \ingroup device_func_set
     */
@@ -202,35 +354,103 @@ extern "C" {
     AFAPI af_err af_sync(const int device);
 
     /**
-       \ingroup device_func_device
+       \brief Allocates memory using ArrayFire's memory manager
+       \ingroup device_func_alloc
+
+       This device memory returned by this function can only be freed using
+       af_free_device
+
+       \param [out] ptr Pointer to the device memory on the current device. This
+                        is a CUDA device pointer for the CUDA backend. A
+                        cl::Buffer pointer on the OpenCL backend and a C pointer
+                        for the CPU backend
+       \param [in] bytes The number of bites to allocate on the device
+
+       \returns AF_SUCCESS if a pointer could be allocated. AF_ERR_NO_MEM if
+                there is no memory
+       \deprecated Use af_alloc_device_v2 instead. af_alloc_device_v2 returns a
+                   cl_mem object instead of the cl::Buffer object for the OpenCL
+                   backend. Otherwise the functionallity is identical
     */
-    AFAPI af_err af_get_device_ptr(void **ptr, const af_array arr);
+    AF_DEPRECATED("Use af_alloc_device_v2 instead")
+    AFAPI af_err af_alloc_device(void **ptr, const dim_t bytes);
 
     /**
-       \ingroup device_func_alloc
+       \brief Returns memory to ArrayFire's memory manager.
+
+       This function will free a device pointer even if it has been previously
+       locked.
+
+       \param[in] ptr The pointer allocated by af_alloc_device to be freed
+
+       \deprecated Use af_free_device_v2 instead. The new function handles the
+                   new behavior of the af_alloc_device_v2 function.
+       \ingroup device_func_free
     */
-    AFAPI af_err af_alloc_device(void **ptr, const dim_t bytes);
+    AF_DEPRECATED("Use af_free_device_v2 instead")
+    AFAPI af_err af_free_device(void *ptr);
 
+#if AF_API_VERSION >= 38
     /**
-       \ingroup device_func_pinned
+       \brief Allocates memory using ArrayFire's memory manager
+
+       This device memory returned by this function can only be freed using
+       af_free_device_v2.
+
+       \param [out] ptr Pointer to the device memory on the current device. This
+                        is a CUDA device pointer for the CUDA backend. A
+                        cl::Buffer pointer on the OpenCL backend and a C pointer
+                        for the CPU backend
+       \param [in] bytes The number of bites to allocate on the device
+
+       \returns AF_SUCCESS if a pointer could be allocated. AF_ERR_NO_MEM if
+                there is no memory
+       \ingroup device_func_alloc
     */
-    AFAPI af_err af_alloc_pinned(void **ptr, const dim_t bytes);
+    AFAPI af_err af_alloc_device_v2(void **ptr, const dim_t bytes);
 
     /**
+       \brief Returns memory to ArrayFire's memory manager.
+
+       This function will free a device pointer even if it has been previously
+       locked.
+
+       \param[in] ptr The pointer allocated by af_alloc_device_v2 to be freed
+       \note this function will not work for pointers allocated using the
+             af_alloc_device function for all backends
        \ingroup device_func_free
     */
-    AFAPI af_err af_free_device(void *ptr);
+    AFAPI af_err af_free_device_v2(void *ptr);
+#endif
+    /**
+       \ingroup device_func_pinned
+    */
+    AFAPI af_err af_alloc_pinned(void **ptr, const dim_t bytes);
 
     /**
        \ingroup device_func_free_pinned
     */
     AFAPI af_err af_free_pinned(void *ptr);
 
+#if AF_API_VERSION >= 33
+    /**
+       \ingroup device_func_alloc_host
+    */
+    AFAPI af_err af_alloc_host(void **ptr, const dim_t bytes);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \ingroup device_func_free_host
+    */
+    AFAPI af_err af_free_host(void *ptr);
+#endif
+
     /**
        Create array from device memory
-       \ingroup construct_mat
+       \ingroup c_api_mat
     */
-    AFAPI af_err af_device_array(af_array *arr, const void *data, const unsigned ndims, const dim_t * const dims, const af_dtype type);
+    AFAPI af_err af_device_array(af_array *arr, void *data, const unsigned ndims, const dim_t * const dims, const af_dtype type);
 
     /**
        Get memory information from the memory manager
@@ -239,6 +459,32 @@ extern "C" {
     AFAPI af_err af_device_mem_info(size_t *alloc_bytes, size_t *alloc_buffers,
                                     size_t *lock_bytes, size_t *lock_buffers);
 
+#if AF_API_VERSION >= 33
+    /**
+       Prints buffer details from the ArrayFire Device Manager.
+
+       The result is a table with several columns:
+
+        * POINTER:   The hex address of the array's device or pinned-memory
+                     pointer
+        * SIZE:      Human-readable size of the array
+        * AF LOCK:   Indicates whether ArrayFire is using this chunk of memory.
+                     If not, the chunk is ready for reuse.
+        * USER LOCK: If set, ArrayFire is prevented from freeing this memory.
+                     The chunk is not ready for re-use even if all ArrayFire's
+                     references to it go out of scope.
+
+       \param [in] msg A message to print before the table
+       \param [in] device_id print the memory info of the specified device.
+       -1 signifies active device.
+
+       \returns AF_SUCCESS if successful
+
+       \ingroup device_func_mem
+    */
+    AFAPI af_err af_print_mem_info(const char *msg, const int device_id);
+#endif
+
     /**
        Call the garbage collection routine
        \ingroup device_func_mem
@@ -246,17 +492,133 @@ extern "C" {
     AFAPI af_err af_device_gc();
 
     /**
-       Set the minimum memory chunk size
+       Set the minimum memory chunk size. Works only with the default
+       memory manager - returns an error if a custom memory manager is set.
+
        \ingroup device_func_mem
     */
     AFAPI af_err af_set_mem_step_size(const size_t step_bytes);
 
     /**
-       Get the minimum memory chunk size
+       Get the minimum memory chunk size. Works only with the default
+       memory manager - returns an error if a custom memory manager is set.
+
        \ingroup device_func_mem
     */
     AFAPI af_err af_get_mem_step_size(size_t *step_bytes);
 
+#if AF_API_VERSION >= 31
+    /**
+       Lock the device buffer in the memory manager.
+
+       Locked buffers are not freed by memory manager until \ref af_unlock_array is called.
+       \ingroup device_func_mem
+    */
+#if AF_API_VERSION >= 33
+    AF_DEPRECATED("Use af_lock_array instead")
+#endif
+    AFAPI af_err af_lock_device_ptr(const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       Unlock device buffer in the memory manager.
+
+       This function will give back the control over the device pointer to the memory manager.
+       \ingroup device_func_mem
+    */
+#if AF_API_VERSION >= 33
+    AF_DEPRECATED("Use af_unlock_array instead")
+#endif
+    AFAPI af_err af_unlock_device_ptr(const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       Lock the device buffer in the memory manager.
+
+       Locked buffers are not freed by memory manager until \ref af_unlock_array is called.
+       \ingroup device_func_mem
+    */
+    AFAPI af_err af_lock_array(const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       Unlock device buffer in the memory manager.
+
+       This function will give back the control over the device pointer to the memory manager.
+       \ingroup device_func_mem
+    */
+    AFAPI af_err af_unlock_array(const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       Query if the array has been locked by the user.
+
+       An array can be locked by the user by calling `af_lock_array`
+       or `af_get_device_ptr` or `af_get_raw_ptr` function.
+
+       \ingroup device_func_mem
+    */
+    AFAPI af_err af_is_locked_array(bool *res, const af_array arr);
+#endif
+
+    /**
+       Get the device pointer and lock the buffer in memory manager.
+
+       The device pointer \p ptr is notfreed by memory manager until \ref af_unlock_device_ptr is called.
+       \ingroup device_func_mem
+
+       \note For OpenCL backend *ptr should be cast to cl_mem.
+    */
+    AFAPI af_err af_get_device_ptr(void **ptr, const af_array arr);
+
+#if AF_API_VERSION >= 38
+    /**
+       Sets the path where the kernels generated at runtime will be cached
+
+       Sets the path where the kernels generated at runtime will be stored to
+       cache for later use. The files in this directory can be safely deleted.
+       The default location for these kernels is in $HOME/.arrayfire on Unix
+       systems and in the ArrayFire temp directory on Windows.
+
+       \param[in] path The location where the kernels will be stored
+       \param[in] override_env if true this path will take precedence over the
+                               AF_JIT_KERNEL_CACHE_DIRECTORY environment variable.
+                               If false, the environment variable takes precedence
+                               over this path.
+
+       \returns AF_SUCCESS if the variable is set. AF_ERR_ARG if path is NULL.
+       \ingroup device_func_mem
+    */
+    AFAPI af_err af_set_kernel_cache_directory(const char* path,
+                                               int override_env);
+
+    /**
+       Gets the path where the kernels generated at runtime will be cached
+
+       Gets the path where the kernels generated at runtime will be stored to
+       cache for later use. The files in this directory can be safely deleted.
+       The default location for these kernels is in $HOME/.arrayfire on Unix
+       systems and in the ArrayFire temp directory on Windows.
+
+       \param[out] length The length of the path array. If \p path is NULL, the
+                          length of the current path is assigned to this pointer
+       \param[out] path The path of the runtime generated kernel cache
+                         variable. If NULL, the current path length is assigned
+                         to \p length
+       \returns AF_SUCCESS if the variable is set.
+                AF_ERR_ARG if path and length are null at the same time.
+                AF_ERR_SIZE if \p length not sufficient enought to store the
+                            path
+       \ingroup device_func_mem
+    */
+    AFAPI af_err af_get_kernel_cache_directory(size_t *length, char *path);
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/dim4.hpp b/include/af/dim4.hpp
index 7511d8416c..db78e67228 100644
--- a/include/af/dim4.hpp
+++ b/include/af/dim4.hpp
@@ -14,49 +14,112 @@
 #include <ostream>
 #include <istream>
 #include <vector>
-#if __cplusplus > 199711L // Necessary for NVCC
-//#include <initializer_list>
-#endif
 #include <af/defines.h>
 #include <af/seq.h>
 
 
 namespace af
 {
+/// \brief Generic object that represents size and shape
+/// \ingroup arrayfire_class
 class AFAPI dim4
 {
-    public:
-    dim_t dims[4]; //FIXME: Make this C compatiable
-    dim4(); //deleted
 public:
-#if __cplusplus > 199711L
-    //dim4(std::initializer_list<dim_t> dim_vals);
-#endif
+    dim_t dims[4];
+    /// Default constructor. Creates an invalid dim4 object
+    dim4();
+
+    /// Creates an new dim4 given a set of dimension
     dim4(   dim_t first,
             dim_t second = 1,
             dim_t third = 1,
             dim_t fourth = 1);
+
+    /// Copy constructor
+    ///
+    /// \param[in] other The dim4 that will be copied
     dim4(const dim4& other);
-    dim4(const unsigned ndims, const dim_t * const dims);
+
+#if AF_API_VERSION >= 38
+#if AF_COMPILER_CXX_RVALUE_REFERENCES
+    /// Default move constructor
+    ///
+    /// \param[in] other The dim4 that will be moved
+    dim4(dim4 &&other) AF_NOEXCEPT = default;
+
+    /// Default move assignment operator
+    ///
+    /// \param[in] other The dim4 that will be moved
+    dim4 &operator=(dim4 other) AF_NOEXCEPT;
+#endif
+#endif
+
+    /// Constructs a dim4 object from a C array of dim_t objects
+    ///
+    /// Creates a new dim4 from a C array. If the C array is less than 4, all
+    /// values past \p ndims will be assigned the value 1.
+    ///
+    /// \param[in] ndims The number of elements in the C array. Must be less
+    ///                  than 4
+    /// \param[in] dims  The values to assign to each element of dim4
+    dim4(const unsigned ndims, const dim_t *const dims);
+
+    /// Returns the number of elements represented by this dim4
     dim_t elements();
+
+    /// Returns the number of elements represented by this dim4
     dim_t elements() const;
+
+    /// Returns the number of axis whose values are greater than one
     dim_t ndims();
+
+    /// Returns the number of axis whose values are greater than one
     dim_t ndims() const;
-    bool operator==(const dim4& other) const;
-    bool operator!=(const dim4& other) const;
-    dim4& operator*=(const dim4& other);
-    dim4& operator+=(const dim4& other);
-    dim4& operator-=(const dim4& other);
-    dim_t& operator[](const unsigned dim);
-    const dim_t& operator[](const unsigned dim) const;
-            dim_t* get()         { return dims; }
-    const   dim_t* get() const   { return dims; }
+
+    /// Returns true if the two dim4 represent the same shape
+    bool operator==(const dim4 &other) const;
+
+    /// Returns true if two dim4s store different values
+    bool operator!=(const dim4 &other) const;
+
+    /// Element-wise multiplication of the dim4 objects
+    dim4 &operator*=(const dim4 &other);
+
+    /// Element-wise addition of the dim4 objects
+    dim4 &operator+=(const dim4 &other);
+
+    /// Element-wise subtraction of the dim4 objects
+    dim4 &operator-=(const dim4 &other);
+
+    /// Returns the reference to the element at a give index. (Must be less than
+    /// 4)
+    dim_t &operator[](const unsigned dim);
+
+    /// Returns the reference to the element at a give index. (Must be less than
+    /// 4)
+    const dim_t &operator[](const unsigned dim) const;
+
+    /// Returns the underlying pointer to the dim4 object
+    dim_t *get() { return dims; }
+
+    /// Returns the underlying pointer to the dim4 object
+    const dim_t *get() const { return dims; }
 };
 
-dim4 operator+(const dim4& first, const dim4& second);
-dim4 operator-(const dim4& first, const dim4& second);
-dim4 operator*(const dim4& first, const dim4& second);
+/// Performs an element-wise addition of two dim4 objects
+AFAPI dim4 operator+(const dim4& first, const dim4& second);
+
+/// Performs an element-wise subtraction of two dim4 objects
+AFAPI dim4 operator-(const dim4& first, const dim4& second);
+
+/// Performs an element-wise multiplication of two dim4 objects
+AFAPI dim4 operator*(const dim4& first, const dim4& second);
 
+/// Prints the elements of the dim4 array separated by spaces
+///
+/// \param[inout] ostr An ostream object
+/// \param[in] dims The dim4 object to be printed
+/// \returns the reference to the \p ostr after the dim4 string as been streamed in
 static inline
 std::ostream&
 operator<<(std::ostream& ostr, const dim4& dims)
@@ -68,6 +131,11 @@ operator<<(std::ostream& ostr, const dim4& dims)
     return ostr;
 }
 
+/// Reads 4 dim_t values from an input stream and stores the results in a dim4
+///
+/// \param[inout] istr An istream object
+/// \param[in] dims The dim4 object that will store the values
+/// \return The \p istr object after 4 dim_t values have been read from the input
 static inline
 std::istream&
 operator>>(std::istream& istr, dim4& dims)
@@ -79,10 +147,13 @@ operator>>(std::istream& istr, dim4& dims)
     return istr;
 }
 
+/// Returns true if the af_seq object represents the entire range of an axis
 AFAPI bool isSpan(const af_seq &seq);
 
+/// Returns the number of elements that the af_seq object represents
 AFAPI size_t seqElements(const af_seq &seq);
 
+/// Returns the number of elements that will be represented by seq if applied on an array
 AFAPI dim_t calcDim(const af_seq &seq, const dim_t &parentDim);
 }
 
diff --git a/include/af/event.h b/include/af/event.h
new file mode 100644
index 0000000000..1a5b1718c0
--- /dev/null
+++ b/include/af/event.h
@@ -0,0 +1,130 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/defines.h>
+
+#if AF_API_VERSION >= 37
+
+/**
+    Handle to an event object
+
+    \ingroup event_api
+*/
+typedef void* af_event;
+
+#ifdef __cplusplus
+namespace af {
+
+/**
+    C++ RAII interface for manipulating events
+    \ingroup arrayfire_class
+    \ingroup event_api
+*/
+class AFAPI event {
+    af_event e_;
+
+   public:
+    /// Create a new event using the C af_event handle
+    event(af_event e);
+#if AF_COMPILER_CXX_RVALUE_REFERENCES
+    /// Move constructor
+    event(event&& other);
+
+    /// Move assignment operator
+    event& operator=(event&& other);
+#endif
+    /// Create a new event object
+    event();
+
+    /// event Destructor
+    ~event();
+
+    /// Return the underlying C af_event handle
+    af_event get() const;
+
+    /// \brief Adds the event on the default ArrayFire queue. Once this point
+    ///        on the program is executed, the event is considered complete.
+    void mark();
+
+    /// \brief Block the ArrayFire queue until this even has occurred
+    void enqueue();
+
+    /// \brief block the calling thread until this event has occurred
+    void block() const;
+
+   private:
+    event& operator=(const event& other);
+    event(const event& other);
+};
+
+}  // namespace af
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+   \brief Create a new \ref af_event handle
+
+   \param[in] eventHandle the input event handle
+
+   \ingroup event_api
+*/
+AFAPI af_err af_create_event(af_event* eventHandle);
+
+/**
+   \brief Release the \ref af_event handle
+
+   \param[in] eventHandle the input event handle
+
+   \ingroup event_api
+*/
+AFAPI af_err af_delete_event(af_event eventHandle);
+
+/**
+   marks the \ref af_event on the active computation stream. If the \ref
+   af_event is enqueued/waited on later, any operations that are currently
+   enqueued on the event stream will be completed before any events that are
+   enqueued after the call to enqueue
+
+   \param[in] eventHandle the input event handle
+
+   \ingroup event_api
+*/
+AFAPI af_err af_mark_event(const af_event eventHandle);
+
+/**
+   enqueues the \ref af_event and all enqueued events on the active stream.
+   All operations enqueued after a call to enqueue will not be executed
+   until operations on the stream when mark was called are complete
+
+   \param[in] eventHandle the input event handle
+
+   \ingroup event_api
+*/
+AFAPI af_err af_enqueue_wait_event(const af_event eventHandle);
+
+/**
+   blocks the calling thread on events until all events on the computation
+   stream before mark was called are complete
+
+   \param[in] eventHandle the input event handle
+
+   \ingroup event_api
+*/
+AFAPI af_err af_block_event(const af_event eventHandle);
+
+#ifdef __cplusplus
+}
+#endif  // __cplusplus
+
+#endif  // AF_API_VERSION >= 37
diff --git a/include/af/exception.h b/include/af/exception.h
index ee10c5db7b..aaa566e9bc 100644
--- a/include/af/exception.h
+++ b/include/af/exception.h
@@ -11,11 +11,13 @@
 
 #ifdef __cplusplus
 
-#include <iostream>
+#include <ostream>
 #include <af/defines.h>
 
 namespace af {
 
+/// An ArrayFire exception class
+/// \ingroup arrayfire_class
 class AFAPI exception : public std::exception
 {
 private:
@@ -24,16 +26,31 @@ class AFAPI exception : public std::exception
 public:
     af_err err() { return m_err; }
     exception();
+    /// Creates a new af::exception given a message. The error code is AF_ERR_UNKNOWN
     exception(const char *msg);
+
+    /// Creates a new exception with a formatted error message for a given file
+    /// and line number in the source code.
     exception(const char *file, unsigned line, af_err err);
+
+    /// Creates a new af::exception with a formatted error message for a given
+    /// an error code, file and line number in the source code.
     exception(const char *msg, const char *file, unsigned line, af_err err);
+#if AF_API_VERSION >= 33
+    /// Creates a new exception given a message, function name, file name, line number and
+    /// error code.
+    exception(const char *msg, const char *func, const char *file, unsigned line, af_err err);
+#endif
     virtual ~exception() throw() {}
+    /// Returns an error message for the exception in a string format
     virtual const char *what() const throw() { return m_msg; }
+
+    /// Writes the exception to a stream
     friend inline std::ostream& operator<<(std::ostream &s, const exception &e)
     { return s << e.what(); }
 };
 
-}
+}  // namespace af
 
 #endif
 
@@ -41,7 +58,15 @@ class AFAPI exception : public std::exception
 extern "C" {
 #endif
 
+/// Returns the last error message that occurred and its error message
+///
+/// \param[out] msg The message of the previous error
+/// \param[out] len The number of characters in the msg object
 AFAPI void af_get_last_error(char **msg, dim_t *len);
+
+/// Converts the af_err error code to its string representation
+///
+/// \param[in] err The ArrayFire error code
 AFAPI const char *af_err_to_string(const af_err err);
 
 #ifdef __cplusplus
diff --git a/include/af/features.h b/include/af/features.h
index 69ee72489d..0f3a146883 100644
--- a/include/af/features.h
+++ b/include/af/features.h
@@ -17,25 +17,61 @@ namespace af
 {
     class array;
 
+    /// Represents a feature returned by a feature detector
+    ///
+    /// \ingroup arrayfire_class
+    /// \ingroup features_group_features
     class AFAPI features {
     private:
         af_features feat;
 
     public:
+        /// Default constructor. Creates a features object with new features
         features();
+
+        /// Creates a features object with n features with undefined locations
         features(const size_t n);
+
+        /// Creates a features object from a C af_features object
         features(af_features f);
 
         ~features();
 
-        features& operator= (const features& f);
+        /// Copy assignment operator
+        features& operator= (const features& other);
+
+#if AF_API_VERSION >= 38
+        /// Copy constructor
+        features(const features &other);
+
+#if AF_COMPILER_CXX_RVALUE_REFERENCES
+        /// Move constructor
+        features(features &&other);
+
+        /// Move assignment operator
+        features &operator=(features &&other);
+#endif
+#endif
 
+        /// Returns  the number of features represented by this object
         size_t getNumFeatures() const;
+
+        /// Returns an af::array which represents the x locations of a feature
         array getX() const;
+
+        /// Returns an af::array which represents the y locations of a feature
         array getY() const;
+
+        /// Returns an array with the score of the features
         array getScore() const;
+
+        /// Returns an array with the orientations of the features
         array getOrientation() const;
+
+        /// Returns an array that represents the size of the features
         array getSize() const;
+
+        /// Returns the underlying C af_features object
         af_features get() const;
     };
 
@@ -46,23 +82,72 @@ namespace af
 extern "C" {
 #endif
 
+    /// Creates a new af_feature object with \p num features
+    ///
+    /// \param[out] feat The new feature that will be created
+    /// \param[in] num The number of features that will be in the new features
+    ///                object
+    /// \returns AF_SUCCESS if successful
+    /// \ingroup features_group_features
     AFAPI af_err af_create_features(af_features *feat, dim_t num);
 
+    /// Increases the reference count of the feature and all of its associated
+    /// arrays
+    ///
+    /// \param[out] out The reference to the incremented array
+    /// \param[in] feat The features object whose will be incremented
+    ///                 object
+    /// \returns AF_SUCCESS if successful
+    /// \ingroup features_group_features
     AFAPI af_err af_retain_features(af_features *out, const af_features feat);
 
+    /// Returns the number of features associated with this object
+    ///
+    /// \param[out] num The number of features in the object
+    /// \param[in] feat The feature whose count will be returned
+    /// \ingroup features_group_features
     AFAPI af_err af_get_features_num(dim_t *num, const af_features feat);
 
+    /// Returns the x positions of the features
+    ///
+    /// \param[out] out An array with all x positions of the features
+    /// \param[in] feat The features object
+    /// \ingroup features_group_features
     AFAPI af_err af_get_features_xpos(af_array *out, const af_features feat);
 
+    /// Returns the y positions of the features
+    ///
+    /// \param[out] out An array with all y positions of the features
+    /// \param[in] feat The features object
+    /// \ingroup features_group_features
     AFAPI af_err af_get_features_ypos(af_array *out, const af_features feat);
 
+    /// Returns the scores of the features
+    ///
+    /// \param[out] score An array with scores of the features
+    /// \param[in] feat The features object
+    /// \ingroup features_group_features
     AFAPI af_err af_get_features_score(af_array *score, const af_features feat);
 
+    /// Returns the orientations of the features
+    ///
+    /// \param[out] orientation An array with the orientations of the features
+    /// \param[in] feat The features object
+    /// \ingroup features_group_features
     AFAPI af_err af_get_features_orientation(af_array *orientation, const af_features feat);
 
+    /// Returns the size of the features
+    ///
+    /// \param[out] size An array with the sizes of the features
+    /// \param[in] feat The features object
+    /// \ingroup features_group_features
     AFAPI af_err af_get_features_size(af_array *size, const af_features feat);
 
-    // Destroy af_features
+    /// Reduces the reference count of each of the features
+    ///
+    /// \param[in] feat The features object whose reference count will be
+    ///                 reduced
+    /// \ingroup features_group_features
     AFAPI af_err af_release_features(af_features feat);
 
 #ifdef __cplusplus
diff --git a/include/af/graphics.h b/include/af/graphics.h
index 675082d654..d6ffa208fb 100644
--- a/include/af/graphics.h
+++ b/include/af/graphics.h
@@ -12,7 +12,7 @@
 #include <af/defines.h>
 #include <af/array.h>
 
-typedef unsigned long long af_window;
+typedef void* af_window;
 
 typedef struct {
     int row;
@@ -30,6 +30,8 @@ namespace af
 
    \brief Window object to render af::arrays
 
+   Windows are not CopyConstructible or CopyAssignable.
+
    \ingroup graphics_func
  */
 class AFAPI Window {
@@ -43,10 +45,15 @@ class AFAPI Window {
 
         void initWindow(const int width, const int height, const char* const title);
 
+        Window(const Window&);                 // Prevent copy-construction
+        Window& operator=(const Window&);      // Prevent assignment
+
     public:
         /**
            Creates a window object with default width
            and height with title set to "ArrayFire"
+
+           \ingroup gfx_func_window
          */
         Window();
 
@@ -55,6 +62,8 @@ class AFAPI Window {
            and height using the title provided by the user
 
            \param[in] title is the window title
+
+           \ingroup gfx_func_window
          */
         Window(const char* const title);
 
@@ -65,6 +74,8 @@ class AFAPI Window {
            \param[in] width is the window width
            \param[in] height is the window height
            \param[in] title is the window title with default value as "ArrayFire"
+
+           \ingroup gfx_func_window
          */
         Window(const int width, const int height, const char* const title="ArrayFire");
 
@@ -72,12 +83,18 @@ class AFAPI Window {
            Creates a window object with default width
            and height with title set to "ArrayFire"
 
-           \param[in] wnd is an \ref af_window handle which can be retrieved by
+           \param[in] window is an \ref af_window handle which can be retrieved
+                             by
            doing a get call on any \ref Window object
+
+           \ingroup gfx_func_window
          */
-        Window(const af_window wnd);
+        Window(const af_window window);
+
         /**
            Destroys the window handle
+
+           \ingroup gfx_func_window
          */
         ~Window();
 
@@ -85,6 +102,8 @@ class AFAPI Window {
 
         /**
            \return Returns the \ref af_window window handle.
+
+           \ingroup gfx_func_window
          */
         af_window get() const { return wnd; }
 
@@ -93,6 +112,8 @@ class AFAPI Window {
 
            \param[in] x is horizontal coordinate
            \param[in] y is vertical coordinate
+
+           \ingroup gfx_func_window
          */
         void setPos(const unsigned x, const unsigned y);
 
@@ -100,13 +121,29 @@ class AFAPI Window {
            Set the window title
 
            \param[in] title is the window title
+
+           \ingroup gfx_func_window
          */
         void setTitle(const char* const title);
 
+#if AF_API_VERSION >= 31
+        /**
+           Set the window size
+
+           \param[in]   w is target width of the window
+           \param[in]   h is target height of the window
+
+           \ingroup gfx_func_window
+         */
+        void setSize(const unsigned w, const unsigned h);
+#endif
+
         /**
            Set the colormap to be used for subsequent rendering calls
 
            \param[in] cmap should be one of the enum values from \ref ColorMap
+
+           \ingroup gfx_func_window
          */
         void setColorMap(const ColorMap cmap);
 
@@ -117,9 +154,57 @@ class AFAPI Window {
            \param[in] title parameter is used when this function is called in grid mode
 
            \note \p in should be 2d array or 3d array with 3 channels.
+
+           \ingroup gfx_func_draw
          */
         void image(const array& in, const char* title=NULL);
 
+#if AF_API_VERSION >= 32
+        /**
+           Renders the input array as an 3d line plot to the window
+
+           \param[in] in is an \ref array
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p in should be 1d array of size 3n or 2d array with (3 x n) or (n x 3) channels.
+
+           \ingroup gfx_func_draw
+         */
+        AF_DEPRECATED("Use plot instead")
+        void plot3(const array& in, const char* title=NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 2D or 3D plot to the window
+
+           \param[in] in is an \ref array with the data points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p in must be 2d and of the form [n, order], where order is either 2 or 3.
+                 If order is 2, then chart is 2D and if order is 3, then chart is 3D.
+
+           \ingroup gfx_func_draw
+         */
+        void plot(const array& in, const char* const title=NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 3D plot to the window
+
+           \param[in] X is an \ref array with the x-axis data points
+           \param[in] Y is an \ref array with the y-axis data points
+           \param[in] Z is an \ref array with the z-axis data points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p X, \p Y and \p Z should be vectors.
+
+           \ingroup gfx_func_draw
+         */
+        void plot(const array& X, const array& Y, const array& Z, const char* const title=NULL);
+#endif
+
         /**
            Renders the input arrays as a 2D plot to the window
 
@@ -128,52 +213,349 @@ class AFAPI Window {
            \param[in] title parameter is used when this function is called in grid mode
 
            \note \p X and \p Y should be vectors.
+
+           \ingroup gfx_func_draw
          */
         void plot(const array& X, const array& Y, const char* const title=NULL);
 
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 2D or 3D scatter-plot to the window
+
+           \param[in] in is an \ref array with the data points
+           \param[in] marker is an \ref markerType enum specifying which marker to use in the scatter plot
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p in must be 2d and of the form [n, order], where order is either 2 or 3.
+                 If order is 2, then chart is 2D and if order is 3, then chart is 3D.
+
+           \ingroup gfx_func_draw
+         */
+        void scatter(const array& in, const af::markerType marker = AF_MARKER_POINT,
+                     const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 3D scatter-plot to the window
+
+           \param[in] X is an \ref array with the x-axis data points
+           \param[in] Y is an \ref array with the y-axis data points
+           \param[in] Z is an \ref array with the z-axis data points
+           \param[in] marker is an \ref markerType enum specifying which marker to use in the scatter plot
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p X, \p Y and \p Z should be vectors.
+
+           \ingroup gfx_func_draw
+         */
+        void scatter(const array& X, const array& Y, const array& Z,
+                     const af::markerType marker = AF_MARKER_POINT, const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 33
+        /**
+           Renders the input arrays as a 2D scatter-plot to the window
+
+           \param[in] X is an \ref array with the x-axis data points
+           \param[in] Y is an \ref array with the y-axis data points
+           \param[in] marker is an \ref markerType enum specifying which marker to use in the scatter plot
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p X and \p Y should be vectors.
+
+           \ingroup gfx_func_draw
+         */
+        void scatter(const array& X, const array& Y,
+                     const af::markerType marker = AF_MARKER_POINT, const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 33
+        /**
+           Renders the input arrays as a 3D scatter-plot to the window
+
+           \param[in] P is an \ref af_array or matrix with the xyz-values of the points
+           \param[in] marker is an \ref markerType enum specifying which marker to use in the scatter plot
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \ingroup gfx_func_draw
+         */
+        AF_DEPRECATED("Use scatter instead")
+        void scatter3(const array& P, const af::markerType marker = AF_MARKER_POINT,
+                      const char* const title = NULL);
+#endif
+
         /**
            Renders the input array as a histogram to the window
 
            \param[in] X is the data frequency \ref array
            \param[in] minval is the value of the minimum data point of the array whose histogram(\p X) is going to be rendered.
            \param[in] maxval is the value of the maximum data point of the array whose histogram(\p X) is going to be rendered.
+           \param[in] title parameter is used when this function is called in grid mode
 
            \note \p X should be a vector.
+
+           \ingroup gfx_func_draw
          */
         void hist(const array& X, const double minval, const double maxval, const char* const title=NULL);
 
+#if AF_API_VERSION >= 32
+        /**
+           Renders the input arrays as a 3D surface plot to the window
+
+           \param[in] S is an \ref array with the z-axis data points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p S should be a 2D array
+
+           \ingroup gfx_func_draw
+         */
+        void surface(const array& S, const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 32
+        /**
+           Renders the input arrays as a 3D surface plot to the window
+
+           \param[in] xVals is an \ref array with the x-axis data points
+           \param[in] yVals is an \ref array with the y-axis data points
+           \param[in] S is an \ref array with the z-axis data points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p X and \p Y should be vectors or 2D arrays \p S should be s 2D array
+
+           \ingroup gfx_func_draw
+         */
+        void surface(const array& xVals, const array& yVals, const array& S, const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 2D or 3D vector field plot to the window
+
+           \param[in] points is an \ref array with the points
+           \param[in] directions is an \ref array with the directions at the points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note \p points and \p directions should have the same size and must
+           be 2D.
+           The number of rows (dim 0) determines are number of points and the
+           number columns determines the type of plot. If the number of columns
+           are 2, then the plot is 2D and if there are 3 columns, then the plot
+           is 3D.
+
+           \ingroup gfx_func_draw
+         */
+        void vectorField(const array& points, const array& directions, const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 3D vector field plot to the window
+
+           \param[in] xPoints is an \ref array with the x-coordinate points
+           \param[in] yPoints is an \ref array with the y-coordinate points
+           \param[in] zPoints is an \ref array with the z-coordinate points
+           \param[in] xDirs is an \ref array with the x-coordinate directions at the points
+           \param[in] yDirs is an \ref array with the y-coordinate directions at the points
+           \param[in] zDirs is an \ref array with the z-coordinate directions at the points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note All the array inputs must be vectors and must have the size sizes.
+
+           \ingroup gfx_func_draw
+         */
+        void vectorField(const array& xPoints, const array& yPoints, const array& zPoints,
+                         const array& xDirs  , const array& yDirs  , const array& zDirs  ,
+                         const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Renders the input arrays as a 2D vector field plot to the window
+
+           \param[in] xPoints is an \ref array with the x-coordinate points
+           \param[in] yPoints is an \ref array with the y-coordinate points
+           \param[in] xDirs is an \ref array with the x-coordinate directions at the points
+           \param[in] yDirs is an \ref array with the y-coordinate directions at the points
+           \param[in] title parameter is used when this function is called in grid mode
+
+           \note All the array inputs must be vectors and must have the size sizes.
+
+           \ingroup gfx_func_draw
+         */
+        void vectorField(const array& xPoints, const array& yPoints,
+                         const array& xDirs  , const array& yDirs  ,
+                         const char* const title = NULL);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Setup the axes limits for a 2D histogram/plot/vector field
+
+           This function computes the minimum and maximum for each dimension
+
+           \param[in] x the data to compute the limits for x-axis.
+           \param[in] y the data to compute the limits for y-axis.
+           \param[in] exact is for using the exact min/max values from \p x and \p y.
+                      If exact is false then the most significant digit is rounded up
+                      to next power of 2 and the magnitude remains the same.
+
+           \ingroup gfx_func_window
+        */
+        void setAxesLimits(const array &x, const array &y, const bool exact = false);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Setup the axes limits for a histogram/plot/surface/vector field
+
+           This function computes the minimum and maximum for each dimension
+
+           \param[in] x the data to compute the limits for x-axis.
+           \param[in] y the data to compute the limits for y-axis.
+           \param[in] z the data to compute the limits for z-axis.
+           \param[in] exact is for using the exact min/max values from \p x and \p y.
+                      If exact is false then the most significant digit is rounded up
+                      to next power of 2 and the magnitude remains the same.
+
+           \ingroup gfx_func_window
+        */
+        void setAxesLimits(const array &x, const array &y, const array &z,
+                           const bool exact = false);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Setup the axes limits for a histogram/plot/surface/vector field
+
+           This function sets the axes limits to the ones provided by the user.
+
+           \param[in] xmin is the minimum on x-axis
+           \param[in] xmax is the maximum on x-axis
+           \param[in] ymin is the minimum on y-axis
+           \param[in] ymax is the maximum on y-axis
+           \param[in] exact is for using the exact min/max values from \p x and \p y.
+                      If exact is false then the most significant digit is rounded up
+                      to next power of 2 and the magnitude remains the same.
+
+           \ingroup gfx_func_window
+        */
+        void setAxesLimits(const float xmin, const float xmax,
+                           const float ymin, const float ymax,
+                           const bool exact = false);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Setup the axes limits for a histogram/plot/surface/vector field
+
+           This function sets the axes limits to the ones provided by the user.
+
+           \param[in] xmin is the minimum on x-axis
+           \param[in] xmax is the maximum on x-axis
+           \param[in] ymin is the minimum on y-axis
+           \param[in] ymax is the maximum on y-axis
+           \param[in] zmin is the minimum on z-axis
+           \param[in] zmax is the maximum on z-axis
+           \param[in] exact is for using the exact min/max values from \p x, \p y and \p z.
+                      If exact is false then the most significant digit is rounded up
+                      to next power of 2 and the magnitude remains the same.
+
+           \ingroup gfx_func_window
+        */
+        void setAxesLimits(const float xmin, const float xmax,
+                           const float ymin, const float ymax,
+                           const float zmin, const float zmax,
+                           const bool exact = false);
+#endif
+
+#if AF_API_VERSION >= 34
+        /**
+           Setup the axes titles for a plot/surface/vector field
+
+           This function creates the axis titles for a chart.
+
+           \param[in] xtitle is the name of the x-axis
+           \param[in] ytitle is the name of the y-axis
+           \param[in] ztitle is the name of the z-axis
+
+           \ingroup gfx_func_window
+        */
+        void setAxesTitles(const char * const xtitle = "X-Axis",
+                           const char * const ytitle = "Y-Axis",
+                           const char * const ztitle = NULL);
+#endif
+
+#if AF_API_VERSION >= 37
+        /**
+           Setup the axes label formats for charts
+
+           \param[in] xformat is a printf-style format specifier for x-axis
+           \param[in] yformat is a printf-style format specifier for y-axis
+           \param[in] zformat is a printf-style format specifier for z-axis
+
+           \ingroup gfx_func_window
+        */
+        void setAxesLabelFormat(const char *const xformat = "4.1%f",
+                                const char *const yformat = "4.1%f",
+                                const char *const zformat = NULL);
+#endif
+
         /**
            Setup grid layout for multiview mode in a window
 
-           \param[in]   rows is number of rows you want to show in a window
-           \param[in]   cols is number of coloumns you want to show in a window
+           \param[in]   rows is number of rows you want to divide the display area
+           \param[in]   cols is number of coloumns you want to divide the display area
+
+           \ingroup gfx_func_window
         */
         void grid(const int rows, const int cols);
 
         /**
            This function swaps the background buffer to current view
            and polls for any key strokes while the window was in focus
+
+           \ingroup gfx_func_window
         */
         void show();
 
         /**
-           To check if window is marked for close. This usually
-            happens when user presses ESC key while the window is in focus.
+           Check if window is marked for close. This usually
+           happens when user presses ESC key while the window is in focus.
 
            \return     \ref AF_SUCCESS if window show is successful, otherwise an appropriate error code
            is returned.
+
+           \ingroup gfx_func_window
         */
         bool close();
 
+#if AF_API_VERSION >= 33
+        /**
+           Hide/Show the window
+
+           \param[in] isVisible indicates if the window is to be hidden or brought into focus
+
+           \ingroup gfx_func_window
+         */
+        void setVisibility(const bool isVisible);
+#endif
+
         /**
            This function is used to keep track of which cell in the grid mode is
            being currently rendered. When a user does Window(0,0), we internally
-           store the cell coordinates and return an reference to the very object that
+           store the cell coordinates and return a reference to the very object that
            called upon this function. This reference can be used later to issue
            draw calls using rendering functions.
 
-           \return returns a reference to the object pointed by this
+           \param[in] r is row identifier where current object has to be rendered
+           \param[in] c is column identifier where current object has to be rendered
+
+           \return a reference to the object pointed by this
            to enable cascading this call with rendering functions.
+
+           \ingroup gfx_window_func
          */
         inline Window& operator()(const int r, const int c) {
             _r = r; _c = c;
@@ -199,12 +581,12 @@ extern "C" {
    \return     \ref AF_SUCCESS if window creation is successful, otherwise an appropriate error code
    is returned.
 
-   \ingroup gfx_window_func
+   \ingroup gfx_func_window
 */
 AFAPI af_err af_create_window(af_window *out, const int width, const int height, const char* const title);
 
 /**
-   C Interface wrapper for setting the start position when window display
+   C Interface wrapper for setting the start position when window is displayed
 
    \param[in]   wind is the window handle
    \param[in]   x is horizontal start coordinate
@@ -230,6 +612,22 @@ AFAPI af_err af_set_position(const af_window wind, const unsigned x, const unsig
 */
 AFAPI af_err af_set_title(const af_window wind, const char* const title);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface wrapper for setting window position
+
+   \param[in]   wind is the window handle
+   \param[in]   w is target width of the window
+   \param[in]   h is target height of the window
+
+   \return     \ref AF_SUCCESS if set size for window is successful, otherwise an appropriate error code
+   is returned.
+
+   \ingroup gfx_func_window
+*/
+AFAPI af_err af_set_size(const af_window wind, const unsigned w, const unsigned h);
+#endif
+
 /**
    C Interface wrapper for drawing an array as an image
 
@@ -263,8 +661,204 @@ AFAPI af_err af_draw_image(const af_window wind, const af_array in, const af_cel
 
    \ingroup gfx_func_draw
 */
+AF_DEPRECATED("Use af_draw_plot_nd or af_draw_plot_2d instead")
 AFAPI af_err af_draw_plot(const af_window wind, const af_array X, const af_array Y, const af_cell* const props);
 
+#if AF_API_VERSION >= 32
+/**
+   C Interface wrapper for drawing an array as a plot
+
+   \param[in]   wind is the window handle
+   \param[in]   P is an \ref af_array or matrix with the xyz-values of the points
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p P should be a 3n x 1 vector or one of a 3xn or nx3 matrices.
+
+   \ingroup gfx_func_draw
+*/
+AF_DEPRECATED("Use af_draw_plot_nd or af_draw_plot_3d instead")
+AFAPI af_err af_draw_plot3(const af_window wind, const af_array P, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing an array as a 2D or 3D plot
+
+   \param[in]   wind is the window handle
+   \param[in]   P is an \ref af_array or matrix with the xyz-values of the points
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p in must be 2d and of the form [n, order], where order is either 2 or 3.
+         If order is 2, then chart is 2D and if order is 3, then chart is 3D.
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_plot_nd(const af_window wind, const af_array P, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing an array as a 2D plot
+
+   \param[in]   wind is the window handle
+   \param[in]   X is an \ref af_array with the x-axis data points
+   \param[in]   Y is an \ref af_array with the y-axis data points
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p X and \p Y should be vectors.
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_plot_2d(const af_window wind, const af_array X, const af_array Y,
+                             const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing an array as a 3D plot
+
+   \param[in]   wind is the window handle
+   \param[in]   X is an \ref af_array with the x-axis data points
+   \param[in]   Y is an \ref af_array with the y-axis data points
+   \param[in]   Z is an \ref af_array with the z-axis data points
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p X, \p Y and \p Z should be vectors.
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_plot_3d(const af_window wind,
+                             const af_array X, const af_array Y, const af_array Z,
+                             const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   C Interface wrapper for drawing an array as a plot
+
+   \param[in]   wind is the window handle
+   \param[in]   X is an \ref af_array with the x-axis data points
+   \param[in]   Y is an \ref af_array with the y-axis data points
+   \param[in]   marker is an \ref af_marker_type enum specifying which marker to use in the scatter plot
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p X and \p Y should be vectors.
+
+   \ingroup gfx_func_draw
+*/
+AF_DEPRECATED("Use af_draw_scatter_nd or af_draw_scatter_2d instead")
+AFAPI af_err af_draw_scatter(const af_window wind, const af_array X, const af_array Y,
+                             const af_marker_type marker, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   C Interface wrapper for drawing an array as a plot
+
+   \param[in]   wind is the window handle
+   \param[in]   P is an \ref af_array or matrix with the xyz-values of the points
+   \param[in]   marker is an \ref af_marker_type enum specifying which marker to use in the scatter plot
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \ingroup gfx_func_draw
+*/
+AF_DEPRECATED("Use af_draw_scatter_nd or af_draw_scatter_3d instead")
+AFAPI af_err af_draw_scatter3(const af_window wind, const af_array P,
+                              const af_marker_type marker, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing an array as a plot
+
+   \param[in]   wind is the window handle
+   \param[in]   P is an \ref af_array or matrix with the xyz-values of the points
+   \param[in]   marker is an \ref af_marker_type enum specifying which marker to use in the scatter plot
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p in must be 2d and of the form [n, order], where order is either 2 or 3.
+         If order is 2, then chart is 2D and if order is 3, then chart is 3D.
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_scatter_nd(const af_window wind, const af_array P,
+                                const af_marker_type marker, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing an array as a 2D plot
+
+   \param[in]   wind is the window handle
+   \param[in]   X is an \ref af_array with the x-axis data points
+   \param[in]   Y is an \ref af_array with the y-axis data points
+   \param[in]   marker is an \ref af_marker_type enum specifying which marker to use in the scatter plot
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p X and \p Y should be vectors.
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_scatter_2d(const af_window wind, const af_array X, const af_array Y,
+                                const af_marker_type marker, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing an array as a 3D plot
+
+   \param[in]   wind is the window handle
+   \param[in]   X is an \ref af_array with the x-axis data points
+   \param[in]   Y is an \ref af_array with the y-axis data points
+   \param[in]   Z is an \ref af_array with the z-axis data points
+   \param[in]   marker is an \ref af_marker_type enum specifying which marker to use in the scatter plot
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p X, \p Y and \p Z should be vectors.
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_scatter_3d(const af_window wind,
+                                const af_array X, const af_array Y, const af_array Z,
+                                const af_marker_type marker, const af_cell* const props);
+#endif
+
 /**
    C Interface wrapper for drawing an array as a histogram
 
@@ -284,6 +878,110 @@ AFAPI af_err af_draw_plot(const af_window wind, const af_array X, const af_array
 */
 AFAPI af_err af_draw_hist(const af_window wind, const af_array X, const double minval, const double maxval, const af_cell* const props);
 
+#if AF_API_VERSION >= 32
+/**
+   C Interface wrapper for drawing array's as a surface
+
+   \param[in]   wind is the window handle
+   \param[in]   xVals is an \ref af_array with the x-axis data points
+   \param[in]   yVals is an \ref af_array with the y-axis data points
+   \param[in]   S is an \ref af_array with the z-axis data points
+   \param[in]   props is structure \ref af_cell that has the properties that are used
+   for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p X and \p Y should be vectors. \p S should be a 2D array
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_surface(const af_window wind, const af_array xVals, const af_array yVals, const af_array S, const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing array's as a 2D or 3D vector field
+
+   \param[in]   wind is the window handle
+   \param[in]   points is an \ref af_array with the points
+   \param[in]   directions is an \ref af_array with the directions
+   \param[in]   props is structure \ref af_cell that has the properties that
+                are used for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note \p points and \p directions should have the same size and must
+   be 2D.
+   The number of rows (dim 0) determines are number of points and the
+   number columns determines the type of plot. If the number of columns
+   are 2, then the plot is 2D and if there are 3 columns, then the plot
+   is 3D.
+
+   \note all the \ref af_array inputs should be vectors and the same size
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_vector_field_nd(const af_window wind,
+                const af_array points, const af_array directions,
+                const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing array's as a 3D vector field
+
+   \param[in]   wind is the window handle
+   \param[in]   xPoints is an \ref af_array with the x-axis points
+   \param[in]   yPoints is an \ref af_array with the y-axis points
+   \param[in]   zPoints is an \ref af_array with the z-axis points
+   \param[in]   xDirs is an \ref af_array with the x-axis directions
+   \param[in]   yDirs is an \ref af_array with the y-axis directions
+   \param[in]   zDirs is an \ref af_array with the z-axis directions
+   \param[in]   props is structure \ref af_cell that has the properties that
+                are used for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note all the \ref af_array inputs should be vectors and the same size
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_vector_field_3d(
+                const af_window wind,
+                const af_array xPoints, const af_array yPoints, const af_array zPoints,
+                const af_array xDirs, const af_array yDirs, const af_array zDirs,
+                const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for drawing array's as a 2D vector field
+
+   \param[in]   wind is the window handle
+   \param[in]   xPoints is an \ref af_array with the x-axis points
+   \param[in]   yPoints is an \ref af_array with the y-axis points
+   \param[in]   xDirs is an \ref af_array with the x-axis directions
+   \param[in]   yDirs is an \ref af_array with the y-axis directions
+   \param[in]   props is structure \ref af_cell that has the properties that
+                are used for the current rendering.
+
+   \return     \ref AF_SUCCESS if rendering is successful, otherwise an appropriate error code
+   is returned.
+
+   \note all the \ref af_array inputs should be vectors and the same size
+
+   \ingroup gfx_func_draw
+*/
+AFAPI af_err af_draw_vector_field_2d(
+                const af_window wind,
+                const af_array xPoints, const af_array yPoints,
+                const af_array xDirs, const af_array yDirs,
+                const af_cell* const props);
+#endif
+
 /**
    C Interface wrapper for grid setup in a window
 
@@ -298,6 +996,144 @@ AFAPI af_err af_draw_hist(const af_window wind, const af_array X, const double m
 */
 AFAPI af_err af_grid(const af_window wind, const int rows, const int cols);
 
+#if AF_API_VERSION >= 34
+/**
+   C Interface for setting axes limits for a histogram/plot/surface/vector field
+
+   This function computes the minimum and maximum for each dimension
+
+   \param[in] wind is the window handle
+   \param[in] x the data to compute the limits for x-axis.
+   \param[in] y the data to compute the limits for y-axis.
+   \param[in] z the data to compute the limits for z-axis.
+   \param[in] exact is for using the exact min/max values from \p x, \p y and \p z.
+              If exact is false then the most significant digit is rounded up
+              to next power of 2 and the magnitude remains the same.
+   \param[in] props is structure \ref af_cell that has the properties that
+              are used for the current rendering.
+
+   \note Set \p to NULL if the chart is 2D.
+
+   \ingroup gfx_func_window
+*/
+AFAPI af_err af_set_axes_limits_compute(const af_window wind,
+                                        const af_array x, const af_array y, const af_array z,
+                                        const bool exact,
+                                        const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface for setting axes limits for a 2D histogram/plot/vector field
+
+   This function sets the axes limits to the ones provided by the user.
+
+   \param[in] wind is the window handle
+   \param[in] xmin is the minimum on x-axis
+   \param[in] xmax is the maximum on x-axis
+   \param[in] ymin is the minimum on y-axis
+   \param[in] ymax is the maximum on y-axis
+   \param[in] exact is for using the exact min/max values from \p x, and \p y.
+              If exact is false then the most significant digit is rounded up
+              to next power of 2 and the magnitude remains the same.
+   \param[in] props is structure \ref af_cell that has the properties that
+              are used for the current rendering.
+
+   \ingroup gfx_func_window
+*/
+AFAPI af_err af_set_axes_limits_2d(const af_window wind,
+                                   const float xmin, const float xmax,
+                                   const float ymin, const float ymax,
+                                   const bool exact,
+                                   const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface for setting axes limits for a 3D plot/surface/vector field
+
+   This function sets the axes limits to the ones provided by the user.
+
+   \param[in] wind is the window handle
+   \param[in] xmin is the minimum on x-axis
+   \param[in] xmax is the maximum on x-axis
+   \param[in] ymin is the minimum on y-axis
+   \param[in] ymax is the maximum on y-axis
+   \param[in] zmin is the minimum on z-axis
+   \param[in] zmax is the maximum on z-axis
+   \param[in] exact is for using the exact min/max values from \p x, \p y and \p z.
+              If exact is false then the most significant digit is rounded up
+              to next power of 2 and the magnitude remains the same.
+   \param[in] props is structure \ref af_cell that has the properties that
+              are used for the current rendering.
+
+   \ingroup gfx_func_window
+*/
+AFAPI af_err af_set_axes_limits_3d(const af_window wind,
+                                   const float xmin, const float xmax,
+                                   const float ymin, const float ymax,
+                                   const float zmin, const float zmax,
+                                   const bool exact,
+                                   const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface wrapper for setting axes titles for histogram/plot/surface/vector
+   field
+
+   Passing correct value to \p ztitle dictates the right behavior when it comes
+   to setting the axes titles appropriately.  If the user is targeting a two
+   dimensional chart on the window \p wind, then the user needs to pass NULL to
+   \p ztitle so that internal caching mechanism understands this window requires
+   a 2D chart. Any non NULL value passed to \p ztitle will result in ArrayFire
+   thinking the \p wind intends to use a 3D chart.
+
+   \param[in] wind is the window handle
+   \param[in] xtitle is the name of the x-axis
+   \param[in] ytitle is the name of the y-axis
+   \param[in] ztitle is the name of the z-axis
+   \param[in] props is structure \ref af_cell that has the properties that
+              are used for the current rendering.
+
+   \ingroup gfx_func_window
+*/
+AFAPI af_err af_set_axes_titles(const af_window wind,
+                                const char * const xtitle,
+                                const char * const ytitle,
+                                const char * const ztitle,
+                                const af_cell* const props);
+#endif
+
+#if AF_API_VERSION >= 37
+/**
+   C Interface wrapper for setting axes labels formats for charts
+
+   Axes labels use printf style format specifiers. Default specifier for the
+   data displayed as labels is `%4.1f`. This function lets the user change this
+   label formatting to whichever format that fits their data range and precision.
+
+   \param[in] wind is the window handle
+   \param[in] xformat is a printf-style format specifier for x-axis
+   \param[in] yformat is a printf-style format specifier for y-axis
+   \param[in] zformat is a printf-style format specifier for z-axis
+   \param[in] props is structure \ref af_cell that has the properties that
+              are used for the current rendering.
+
+   \note \p zformat can be NULL in which case ArrayFire understands that the
+   label formats are meant for a 2D chart corresponding to this \p wind
+   or a specific cell in multi-viewport mode (provided via \p props argument).
+   A non NULL value to \p zformat means the label formats belong to a 3D chart.
+
+   \ingroup gfx_func_window
+*/
+AFAPI af_err af_set_axes_label_format(const af_window wind,
+                                      const char *const xformat,
+                                      const char *const yformat,
+                                      const char *const zformat,
+                                      const af_cell *const props);
+#endif
+
 /**
    C Interface wrapper for showing a window
 
@@ -324,6 +1160,18 @@ AFAPI af_err af_show(const af_window wind);
 */
 AFAPI af_err af_is_window_closed(bool *out, const af_window wind);
 
+#if AF_API_VERSION >= 33
+/**
+   Hide/Show a window
+
+   \param[in] wind is the window whose visibility is to be changed
+   \param[in] is_visible indicates if the window is to be hidden or brought into focus
+
+   \ingroup gfx_func_window
+ */
+AFAPI af_err af_set_visibility(const af_window wind, const bool is_visible);
+#endif
+
 /**
    C Interface wrapper for destroying a window handle
 
diff --git a/include/af/half.h b/include/af/half.h
new file mode 100644
index 0000000000..961dd5f099
--- /dev/null
+++ b/include/af/half.h
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+typedef struct {
+    union {
+        unsigned short data_ : 16;
+        struct {
+            unsigned short fraction : 10;
+            unsigned short exponent : 5;
+            unsigned short sign : 1;
+        };
+    };
+} af_half;
+
+#ifdef __cplusplus
+namespace af {
+#endif
+typedef af_half half;
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/af/image.h b/include/af/image.h
index c145b5bd2a..b28d0b5395 100644
--- a/include/af/image.h
+++ b/include/af/image.h
@@ -8,6 +8,7 @@
  ********************************************************/
 
 #pragma once
+#include <af/defines.h>
 #include <af/features.h>
 
 #ifdef __cplusplus
@@ -47,6 +48,115 @@ AFAPI array loadImage(const char* filename, const bool is_color=false);
 */
 AFAPI void saveImage(const char* filename, const array& in);
 
+#if AF_API_VERSION >= 31
+/**
+    C++ Interface for loading an image from memory
+
+    \param[in] ptr is the location of the image data in memory. This is the pointer
+    created by saveImage.
+    \return image loaded as \ref af::array()
+
+    \note The pointer used is a void* cast of the FreeImage type FIMEMORY which is
+    created using the FreeImage_OpenMemory API. If the user is opening a FreeImage
+    stream external to ArrayFire, that pointer can be passed to this function as well.
+
+    \ingroup imagemem_func_load
+*/
+AFAPI array loadImageMem(const void *ptr);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+    C++ Interface for saving an image to memory
+
+    \param[in] in is the arrayfire array to be saved as an image
+    \param[in] format is the type of image to create in memory. The enum borrows from
+    the FREE_IMAGE_FORMAT enum of FreeImage. Other values not included in imageFormat
+    but included in FREE_IMAGE_FORMAT can also be passed to this function.
+
+    \return a void* pointer which is a type cast of the FreeImage type FIMEMORY* pointer.
+
+    \note Ensure that \ref deleteImageMem is called on this pointer. Otherwise there will
+    be memory leaks
+
+    \ingroup imagemem_func_save
+*/
+AFAPI void* saveImageMem(const array& in, const imageFormat format = AF_FIF_PNG);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+    C++ Interface for deleting memory created by \ref saveImageMem or
+    \ref af_save_image_memory
+
+    \param[in] ptr is the pointer to the FreeImage stream created by saveImageMem.
+
+    \ingroup imagemem_func_delete
+*/
+AFAPI void deleteImageMem(void *ptr);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+    C++ Interface for loading an image as its original type
+
+    This load image function allows you to load images as u8, u16 or f32
+    depending on the type of input image as shown by the table below.
+
+     Bits per Color (Gray/RGB/RGBA Bits Per Pixel) | Array Type  | Range
+    -----------------------------------------------|-------------|---------------
+      8 ( 8/24/32  BPP)                            | u8          | 0 - 255
+     16 (16/48/64  BPP)                            | u16         | 0 - 65535
+     32 (32/96/128 BPP)                            | f32         | 0 - 1
+
+    \param[in] filename is name of file to be loaded
+    \return image loaded as \ref af::array()
+
+    \ingroup imageio_func_load
+*/
+AFAPI array loadImageNative(const char* filename);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+    C++ Interface for saving an image without modifications
+
+    This function only accepts u8, u16, f32 arrays. These arrays are saved to
+    images without any modifications.
+
+    You must also note that note all image type support 16 or 32 bit images.
+
+    The best options for 16 bit images are PNG, PPM and TIFF.
+    The best option for 32 bit images is TIFF.
+    These allow lossless storage.
+
+    The images stored have the following properties:
+
+     Array Type  | Bits per Color (Gray/RGB/RGBA Bits Per Pixel) | Range
+    -------------|-----------------------------------------------|---------------
+     u8          |  8 ( 8/24/32  BPP)                            | 0 - 255
+     u16         | 16 (16/48/64  BPP)                            | 0 - 65535
+     f32         | 32 (32/96/128 BPP)                            | 0 - 1
+
+    \param[in] filename is name of file to be saved
+    \param[in] in is the array to be saved. Should be u8 for saving 8-bit image,
+    u16 for 16-bit image, and f32 for 32-bit image.
+
+    \ingroup imageio_func_save
+*/
+AFAPI void saveImageNative(const char* filename, const array& in);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+    Function to check if Image IO is available
+
+    \returns true if ArrayFire was commpiled with ImageIO support, false otherwise.
+    \ingroup imageio_func_available
+*/
+AFAPI bool isImageIOAvailable();
+#endif
+
 /**
     C++ Interface for resizing an image to specified dimensions
 
@@ -111,7 +221,22 @@ AFAPI array rotate(const array& in, const float theta, const bool crop=true, con
 
     \ingroup transform_func_transform
 */
-AFAPI array transform(const array& in, const array& transform, const dim_t odim0 = 0, const dim_t odim1 = 0, const interpType method=AF_INTERP_NEAREST, const bool inverse=true);
+AFAPI array transform(const array& in, const array& transform, const dim_t odim0 = 0, const dim_t odim1 = 0,
+                      const interpType method=AF_INTERP_NEAREST, const bool inverse=true);
+
+#if AF_API_VERSION >= 33
+/**
+    C++ Interface for transforming coordinates
+
+    \param[in] tf is transformation matrix
+    \param[in] d0 is the first input dimension
+    \param[in] d1 is the second input dimension
+    \return the transformed coordinates
+
+    \ingroup transform_func_coordinates
+*/
+AFAPI array transformCoordinates(const array& tf, const float d0, const float d1);
+#endif
 
 /**
     C++ Interface for translating an image
@@ -163,8 +288,8 @@ AFAPI array skew(const array& in, const float skew0, const float skew1, const di
     C++ Interface for bilateral filter
 
     \param[in]  in array is the input image
-    \param[in]  spatial_sigma is the spatial variance paramter that decides the filter window
-    \param[in]  chromatic_sigma is the chromatic variance paramter
+    \param[in]  spatial_sigma is the spatial variance parameter that decides the filter window
+    \param[in]  chromatic_sigma is the chromatic variance parameter
     \param[in]  is_color indicates if the input \p in is color image or grayscale
     \return     the processed image
 
@@ -181,7 +306,7 @@ AFAPI array bilateral(const array &in, const float spatial_sigma, const float ch
    \param[in]  nbins  Number of bins to populate between min and max
    \param[in]  minval minimum bin value (accumulates -inf to min)
    \param[in]  maxval minimum bin value (accumulates max to +inf)
-   \return     histogram array
+   \return     histogram array of type u32
 
    \ingroup image_func_histogram
  */
@@ -194,7 +319,7 @@ AFAPI array histogram(const array &in, const unsigned nbins, const double minval
 
    \param[in]  in is the input array
    \param[in]  nbins  Number of bins to populate between min and max
-   \return     histogram array
+   \return     histogram array of type u32
 
    \ingroup image_func_histogram
  */
@@ -204,8 +329,8 @@ AFAPI array histogram(const array &in, const unsigned nbins);
     C++ Interface for mean shift
 
     \param[in]  in array is the input image
-    \param[in]  spatial_sigma is the spatial variance paramter that decides the filter window
-    \param[in]  chromatic_sigma is the chromatic variance paramter
+    \param[in]  spatial_sigma is the spatial variance parameter that decides the filter window
+    \param[in]  chromatic_sigma is the chromatic variance parameter
     \param[in]  iter is the number of iterations filter operation is performed
     \param[in]  is_color indicates if the input \p in is color image or grayscale
     \return     the processed image
@@ -214,22 +339,6 @@ AFAPI array histogram(const array &in, const unsigned nbins);
 */
 AFAPI array meanShift(const array& in, const float spatial_sigma, const float chromatic_sigma, const unsigned iter, const bool is_color=false);
 
-/**
-    C++ Interface for median filter
-
-    \snippet test/medfilt.cpp ex_image_medfilt
-
-    \param[in]  in array is the input image
-    \param[in]  wind_length is the kernel height
-    \param[in]  wind_width is the kernel width
-    \param[in]  edge_pad value will decide what happens to border when running
-                filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
-    \return     the processed image
-
-    \ingroup image_func_medfilt
-*/
-AFAPI array medfilt(const array& in, const dim_t wind_length = 3, const dim_t wind_width = 3, const borderType edge_pad = AF_PAD_ZERO);
-
 /**
     C++ Interface for minimum filter
 
@@ -335,28 +444,11 @@ AFAPI array erode3(const array& in, const array& mask);
 */
 AFAPI array regions(const array& in, const af::connectivity connectivity=AF_CONNECTIVITY_4, const dtype type=f32);
 
-/**
-   C++ Interface for image template matching
-
-   \param[in]  searchImg is an array with image data
-   \param[in]  templateImg is the template we are looking for in the image
-   \param[in]  mType is metric that should be used to calculate the disparity
-               between window in the image and the template image. It can one of
-               the values defined by the enum \ref af_match_type
-   \return     array with dispartiy values for the window starting at
-               corresponding pixel position
-
-   \note If \p search_img is 3d array, a batch operation will be performed.
-
-   \ingroup cv_func_match_template
- */
-AFAPI array matchTemplate(const array &searchImg, const array &templateImg, const matchType mType=AF_SAD);
-
 /**
    C++ Interface for extracting sobel gradients
 
-   \param[out] dx is derivate along horizontal direction
-   \param[out] dy is derivate along vertical direction
+   \param[out] dx is derivative along horizontal direction
+   \param[out] dy is derivative along vertical direction
    \param[in]  img is an array with image data
    \param[in]  ker_size sobel kernel size or window size
 
@@ -416,7 +508,7 @@ AFAPI array gray2rgb(const array& in, const float rFactor=1.0, const float gFact
    \snippet test/histogram.cpp ex_image_histequal
 
    \param[in]  in is the input array, non-normalized input (!! assumes values [0-255] !!)
-   \param[in]  hist target histogram to approximate in output (based on # of bins)
+   \param[in]  hist target histogram to approximate in output (based on number of bins)
    \return     data with histogram approximately equal to histogram
 
    \note \p in must be two dimensional.
@@ -428,8 +520,8 @@ AFAPI array histEqual(const array& in, const array& hist);
 /**
    C++ Interface for generating gausian kernels
 
-   \param[in]  rows
-   \param[in]  cols
+   \param[in]  rows number of rows of the kernel
+   \param[in]  cols number of columns of the kernel
    \param[in]  sig_r (default 0) (calculated internally as 0.25 * rows + 0.75)
    \param[in]  sig_c (default 0) (calculated internally as 0.25 * cols + 0.75)
    \return     an array with values generated using gaussian function
@@ -478,6 +570,306 @@ AFAPI array rgb2hsv(const array& in);
  */
 AFAPI array colorSpace(const array& image, const CSpace to, const CSpace from);
 
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for rearranging windowed sections of an input into columns
+   (or rows)
+
+   \param[in]  in is the input array
+   \param[in]  wx is the window size along dimension 0
+   \param[in]  wy is the window size along dimension 1
+   \param[in]  sx is the stride along dimension 0
+   \param[in]  sy is the stride along dimension 1
+   \param[in]  px is the padding along dimension 0
+   \param[in]  py is the padding along dimension 1
+   \param[in]  is_column determines whether the section becomes a column (if
+               true) or a row (if false)
+   \returns    an array with the input's sections rearraged as columns (or rows)
+
+   \note \p in can hold multiple images for processing if it is three or
+         four-dimensional
+   \note \p wx and \p wy must be between [1, input.dims(0 (1)) + px (py)]
+   \note \p sx and \p sy must be greater than 1
+   \note \p px and \p py must be between [0, wx (wy) - 1]. Padding becomes part of
+         the input image prior to the windowing
+
+   \ingroup image_func_unwrap
+*/
+AFAPI array unwrap(const array& in, const dim_t wx, const dim_t wy,
+                   const dim_t sx, const dim_t sy, const dim_t px=0, const dim_t py=0,
+                   const bool is_column = true);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for performing the opposite of \ref unwrap
+
+   \param[in]  in is the input array
+   \param[in]  ox is the output's dimension 0 size
+   \param[in]  oy is the output's dimension 1 size
+   \param[in]  wx is the window size along dimension 0
+   \param[in]  wy is the window size along dimension 1
+   \param[in]  sx is the stride along dimension 0
+   \param[in]  sy is the stride along dimension 1
+   \param[in]  px is the padding along dimension 0
+   \param[in]  py is the padding along dimension 1
+   \param[in]  is_column determines whether an output patch is formed from a
+               column (if true) or a row (if false)
+   \returns    an array with the input's columns (or rows) reshaped as patches
+
+   \note Wrap is typically used to recompose an unwrapped image. If this is the
+         case, use the same parameters that were used in \ref unwrap(). Also
+         use the original image size (before unwrap) for \p ox and \p oy.
+   \note The window/patch size, \p wx \f$\times\f$ \p wy, must equal
+         `input.dims(0)` (or `input.dims(1)` if \p is_column is false).
+   \note \p sx and \p sy must be at least 1
+   \note \p px and \p py must be between [0, wx) and [0, wy), respectively
+   \note The number of patches, `input.dims(1)` (or `input.dims(0)` if
+         \p is_column is false), must equal \f$nx \times\ ny\f$, where
+         \f$\displaystyle nx = \frac{ox + 2px - wx}{sx} + 1\f$ and
+         \f$\displaystyle ny = \frac{oy + 2py - wy}{sy} + 1\f$
+   \note Batched wrap can be performed on multiple 2D slices at once if \p in
+         is three or four-dimensional
+
+   \ingroup image_func_wrap
+*/
+AFAPI array wrap(const array& in,
+                 const dim_t ox, const dim_t oy,
+                 const dim_t wx, const dim_t wy,
+                 const dim_t sx, const dim_t sy,
+                 const dim_t px = 0, const dim_t py = 0,
+                 const bool is_column = true);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface wrapper for summed area tables
+
+   \param[in]  in is the input array
+   \returns the summed area table of input image
+
+   \ingroup image_func_sat
+*/
+AFAPI array sat(const array& in);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for converting YCbCr to RGB
+
+   \param[in]  in is an array in the YCbCr colorspace
+   \param[in]  standard specifies the ITU-R BT "xyz" standard which determines the Kb, Kr values
+   used in colorspace conversion equation
+   \return     array in RGB colorspace
+
+   \note \p in must be three dimensional and values should lie in the range [0,1]
+
+   \ingroup image_func_ycbcr2rgb
+ */
+AFAPI array ycbcr2rgb(const array& in, const YCCStd standard=AF_YCC_601);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for converting RGB to YCbCr
+
+   \param[in]  in is an array in the RGB colorspace
+   \param[in]  standard specifies the ITU-R BT "xyz" standard which determines the Kb, Kr values
+   used in colorspace conversion equation
+   \return     array in YCbCr colorspace
+
+   \note \p in must be three dimensional and values should lie in the range [0,1]
+
+   \ingroup image_func_rgb2ycbcr
+ */
+AFAPI array rgb2ycbcr(const array& in, const YCCStd standard=AF_YCC_601);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C++ Interface for calculating an image moment
+
+   \param[out] out is a pointer to a pre-allocated array where the calculated moment(s) will be placed.
+   User is responsible for ensuring enough space to hold all requested moments
+   \param[in]  in is the input image
+   \param[in] moment is moment(s) to calculate
+
+   \ingroup image_func_moments
+ */
+AFAPI void moments(double* out, const array& in, const momentType moment=AF_MOMENT_FIRST_ORDER);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+   C++ Interface for calculating image moments
+
+   \param[in]  in contains the input image(s)
+   \param[in] moment is moment(s) to calculate
+   \return array containing the requested moment of each image
+
+   \ingroup image_func_moments
+ */
+AFAPI array moments(const array& in, const momentType moment=AF_MOMENT_FIRST_ORDER);
+#endif
+
+#if AF_API_VERSION >= 35
+/**
+   C++ Interface for canny edge detector
+
+   \param[in] in                  is the input image
+   \param[in] thresholdType       determines if user set high threshold is to be used or not. It
+                                  can take values defined by the enum \ref af_canny_threshold
+   \param[in] lowThresholdRatio   is the lower threshold % of maximum or auto-derived high threshold
+   \param[in] highThresholdRatio  is the higher threshold % of maximum value in gradient image used
+                                  in hysteresis procedure. This value is ignored if
+                                  \ref AF_CANNY_THRESHOLD_AUTO_OTSU is chosen as
+                                  \ref af_canny_threshold
+   \param[in] sobelWindow     is the window size of sobel kernel for computing gradient direction and
+                              magnitude
+   \param[in] isFast     indicates if L<SUB>1</SUB> norm(faster but less accurate) is used to compute
+                         image gradient magnitude instead of L<SUB>2</SUB> norm.
+   \return binary array containing edges
+
+   \ingroup image_func_canny
+*/
+AFAPI array canny(const array& in, const cannyThreshold thresholdType,
+                  const float lowThresholdRatio, const float highThresholdRatio,
+                  const unsigned sobelWindow = 3, const bool isFast = false);
+#endif
+
+#if AF_API_VERSION >= 36
+/**
+   C++ Interface for gradient anisotropic(non-linear diffusion) smoothing
+
+   \param[in] in is the input image, expects non-integral (float/double) typed af::array
+   \param[in] timestep is the time step used in solving the diffusion equation.
+   \param[in] conductance parameter controls the sensitivity of conductance in diffusion equation.
+   \param[in] iterations is the number of times the diffusion step is performed.
+   \param[in] fftype indicates whether quadratic or exponential flux function is used by algorithm.
+    \param[in] diffusionKind will let the user choose what kind of diffusion method to perform. It will take
+               any value of enum \ref diffusionEq
+   \return A filtered image that is of same size as the input.
+
+   \ingroup image_func_anisotropic_diffusion
+*/
+AFAPI array anisotropicDiffusion(const af::array& in, const float timestep,
+                                 const float conductance, const unsigned iterations,
+                                 const fluxFunction fftype=AF_FLUX_EXPONENTIAL,
+                                 const diffusionEq diffusionKind=AF_DIFFUSION_GRAD);
+#endif
+
+#if AF_API_VERSION >= 37
+/**
+  C++ Interface for Iterative deconvolution algorithm
+
+  \param[in] in is the blurred input image
+  \param[in] ker is the kernel(point spread function) known to have caused
+             the blur in the system
+  \param[in] iterations is the number of iterations the algorithm will run
+  \param[in] relaxFactor is the relaxation factor multiplied with distance
+             of estimate from observed image.
+  \param[in] algo takes value of type enum \ref af_iterative_deconv_algo
+             indicating the iterative deconvolution algorithm to be used
+  \return sharp image estimate generated from the blurred input
+
+  \note \p relax_factor argument is ignore when it
+  \ref AF_ITERATIVE_DECONV_RICHARDSONLUCY algorithm is used.
+
+  \ingroup image_func_iterative_deconv
+ */
+AFAPI array iterativeDeconv(const array& in, const array& ker,
+                            const unsigned iterations, const float relaxFactor,
+                            const iterativeDeconvAlgo algo);
+
+/**
+   C++ Interface for Tikhonov deconvolution algorithm
+
+   \param[in] in is the blurred input image
+   \param[in] psf is the kernel(point spread function) known to have caused
+              the blur in the system
+   \param[in] gamma is a user defined regularization constant
+   \param[in] algo takes different meaning depending on the algorithm chosen.
+              If \p algo is AF_INVERSE_DECONV_TIKHONOV, then \p gamma is
+              a user defined regularization constant.
+   \return sharp image estimate generated from the blurred input
+
+   \ingroup image_func_inverse_deconv
+ */
+AFAPI array inverseDeconv(const array& in, const array& psf,
+                          const float gamma, const inverseDeconvAlgo algo);
+
+/**
+   C++ Interface for confidence connected components
+
+   \param[in] in is the input image, expects non-integral (float/double)
+              typed af_array
+   \param[in] seeds is an af::array of x & y coordinates of the seed points
+              with coordinate values along columns of this af::array i.e. they
+              are not stored in interleaved fashion.
+   \param[in] radius is the neighborhood region to be considered around
+              each seed point
+   \param[in] multiplier controls the threshold range computed from
+              the mean and variance of seed point neighborhoods
+   \param[in] iter is number of iterations
+   \param[in] segmentedValue is the value to which output array valid
+              pixels are set to.
+   \return out is the output af_array having the connected components
+
+   \ingroup image_func_confidence_cc
+*/
+AFAPI array confidenceCC(const array &in, const array &seeds,
+                         const unsigned radius,
+                         const unsigned multiplier, const int iter,
+                         const double segmentedValue);
+
+/**
+   C++ Interface for confidence connected components
+
+   \param[in] in is the input image, expects non-integral (float/double)
+              typed af_array
+   \param[in] seedx is an af::array of x coordinates of the seed points
+   \param[in] seedy is an af::array of y coordinates of the seed points
+   \param[in] radius is the neighborhood region to be considered around
+              each seed point
+   \param[in] multiplier controls the threshold range computed from
+              the mean and variance of seed point neighborhoods
+   \param[in] iter is number of iterations
+   \param[in] segmentedValue is the value to which output array valid
+              pixels are set to.
+   \return out is the output af_array having the connected components
+
+   \ingroup image_func_confidence_cc
+*/
+AFAPI array confidenceCC(const array &in, const array &seedx,
+                         const array &seedy, const unsigned radius,
+                         const unsigned multiplier, const int iter,
+                         const double segmentedValue);
+
+/**
+   C++ Interface for confidence connected components
+
+   \param[in] in is the input image, expects non-integral (float/double)
+              typed af_array
+   \param[in] num_seeds is the total number of seeds
+   \param[in] seedx is an array of x coordinates of the seed points
+   \param[in] seedy is an array of y coordinates of the seed points
+   \param[in] radius is the neighborhood region to be considered around
+              each seed point
+   \param[in] multiplier controls the threshold range computed from
+              the mean and variance of seed point neighborhoods
+   \param[in] iter is number of iterations
+   \param[in] segmentedValue is the value to which output array valid
+              pixels are set to.
+   \return out is the output af_array having the connected components
+
+   \ingroup image_func_confidence_cc
+*/
+AFAPI array confidenceCC(const array &in, const size_t num_seeds,
+                         const unsigned *seedx, const unsigned *seedy,
+                         const unsigned radius, const unsigned multiplier,
+                         const int iter, const double segmentedValue);
+
+#endif
 }
 #endif
 
@@ -486,43 +878,154 @@ extern "C" {
 #endif
 
     /**
-       C Interface for calculating the gradients
+        C Interface for calculating the gradients
 
-       \param[out] dx the gradient along first dimension
-       \param[out] dy the gradient along second dimension
-       \param[in]  in is the input array
-       \return     \ref AF_SUCCESS if the color transformation is successful,
-       otherwise an appropriate error code is returned.
+        \param[out] dx the gradient along first dimension
+        \param[out] dy the gradient along second dimension
+        \param[in]  in is the input array
+        \return     \ref AF_SUCCESS if the color transformation is successful,
+        otherwise an appropriate error code is returned.
 
-       \ingroup calc_func_grad
+        \ingroup calc_func_grad
     */
     AFAPI af_err af_gradient(af_array *dx, af_array *dy, const af_array in);
 
     /**
-       C Interface for loading an image
+        C Interface for loading an image
 
-       \param[out] out will contain the image
-       \param[in] filename is name of file to be loaded
-       \param[in] isColor boolean denoting if the image should be loaded as 1 channel or 3 channel
-       \return     \ref AF_SUCCESS if the color transformation is successful,
-       otherwise an appropriate error code is returned.
+        \param[out] out will contain the image
+        \param[in] filename is name of file to be loaded
+        \param[in] isColor boolean denoting if the image should be loaded as 1 channel or 3 channel
+        \return     \ref AF_SUCCESS if the color transformation is successful,
+        otherwise an appropriate error code is returned.
 
-       \ingroup imageio_func_load
+        \ingroup imageio_func_load
     */
     AFAPI af_err af_load_image(af_array *out, const char* filename, const bool isColor);
 
-   /**
-      C Interface for saving an image
+    /**
+        C Interface for saving an image
 
-      \param[in] filename is name of file to be loaded
-      \param[in] in is the arrayfire array to be saved as an image
-      \return     \ref AF_SUCCESS if the color transformation is successful,
-      otherwise an appropriate error code is returned.
+        \param[in] filename is name of file to be loaded
+        \param[in] in is the arrayfire array to be saved as an image
+        \return     \ref AF_SUCCESS if the color transformation is successful,
+        otherwise an appropriate error code is returned.
 
-      \ingroup imageio_func_save
-   */
+        \ingroup imageio_func_save
+    */
     AFAPI af_err af_save_image(const char* filename, const af_array in);
 
+#if AF_API_VERSION >= 31
+    /**
+        C Interface for loading an image from memory
+
+        \param[out] out is an array that will contain the image
+        \param[in] ptr is the FIMEMORY pointer created by either saveImageMem function, the
+        af_save_image_memory function, or the FreeImage_OpenMemory API.
+        \return     \ref AF_SUCCESS if successful
+
+        \ingroup imagemem_func_load
+    */
+    AFAPI af_err af_load_image_memory(af_array *out, const void* ptr);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+        C Interface for saving an image to memory using FreeImage
+
+        \param[out] ptr is the FIMEMORY pointer created by FreeImage.
+        \param[in] in is the arrayfire array to be saved as an image
+        \param[in] format is the type of image to create in memory. The enum borrows from
+        the FREE_IMAGE_FORMAT enum of FreeImage. Other values not included in af_image_format
+        but included in FREE_IMAGE_FORMAT can also be passed to this function.
+        \return     \ref AF_SUCCESS if successful.
+
+        \ingroup imagemem_func_save
+    */
+    AFAPI af_err af_save_image_memory(void** ptr, const af_array in, const af_image_format format);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+        C Interface for deleting an image from memory
+
+        \param[in] ptr is the FIMEMORY pointer created by either saveImageMem function, the
+        af_save_image_memory function, or the FreeImage_OpenMemory API.
+        \return     \ref AF_SUCCESS if successful
+
+        \ingroup imagemem_func_delete
+    */
+    AFAPI af_err af_delete_image_memory(void* ptr);
+#endif
+
+#if AF_API_VERSION >= 32
+    /**
+        C Interface for loading an image as is original type
+
+        This load image function allows you to load images as u8, u16 or f32
+        depending on the type of input image as shown by the table below.
+
+         Bits per Color (Gray/RGB/RGBA Bits Per Pixel) | Array Type  | Range
+        -----------------------------------------------|-------------|---------------
+          8 ( 8/24/32  BPP)                            | u8          | 0 - 255
+         16 (16/48/64  BPP)                            | u16         | 0 - 65535
+         32 (32/96/128 BPP)                            | f32         | 0 - 1
+
+        \param[out] out contains them image
+        \param[in] filename is name of file to be loaded
+        \return     \ref AF_SUCCESS if successful
+
+        \ingroup imageio_func_load
+    */
+    AFAPI af_err af_load_image_native(af_array *out, const char* filename);
+#endif
+
+#if AF_API_VERSION >= 32
+    /**
+        C Interface for saving an image without modifications
+
+        This function only accepts u8, u16, f32 arrays. These arrays are saved to
+        images without any modifications.
+
+        You must also note that note all image type support 16 or 32 bit images.
+
+        The best options for 16 bit images are PNG, PPM and TIFF.
+        The best option for 32 bit images is TIFF.
+        These allow lossless storage.
+
+        The images stored have the following properties:
+
+         Array Type  | Bits per Color (Gray/RGB/RGBA Bits Per Pixel) | Range
+        -------------|-----------------------------------------------|---------------
+         u8          |  8 ( 8/24/32  BPP)                            | 0 - 255
+         u16         | 16 (16/48/64  BPP)                            | 0 - 65535
+         f32         | 32 (32/96/128 BPP)                            | 0 - 1
+
+        \param[in] filename is name of file to be saved
+        \param[in] in is the array to be saved. Should be u8 for saving 8-bit image,
+        u16 for 16-bit image, and f32 for 32-bit image.
+
+        \return     \ref AF_SUCCESS if successful
+
+        \ingroup imageio_func_save
+    */
+    AFAPI af_err af_save_image_native(const char* filename, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+        Function to check if Image IO is available
+
+        \param[out] out is true if ArrayFire was commpiled with ImageIO support,
+        false otherwise.
+
+        \return     \ref AF_SUCCESS if successful
+
+        \ingroup imageio_func_available
+    */
+    AFAPI af_err af_is_image_io_available(bool *out);
+#endif
+
     /**
        C Interface for resizing an image to specified dimensions
 
@@ -542,14 +1045,15 @@ extern "C" {
     /**
        C Interface for transforming an image
 
-       \param[out] out will contain the transformed image
-       \param[in] in is input image
-       \param[in] transform is transformation matrix
-       \param[in] odim0 is the first output dimension
-       \param[in] odim1 is the second output dimension
-       \param[in] method is the interpolation type (Nearest by default)
-       \param[in] inverse if true applies inverse transform, if false applies forward transoform
-       \return     \ref AF_SUCCESS if the color transformation is successful,
+       \param[out] out       will contain the transformed image
+       \param[in]  in        is input image
+       \param[in]  transform is transformation matrix
+       \param[in]  odim0     is the first output dimension
+       \param[in]  odim1     is the second output dimension
+       \param[in]  method    is the interpolation type (Nearest by default)
+       \param[in]  inverse   if true applies inverse transform, if false applies forward transoform
+
+       \return \ref AF_SUCCESS if the color transformation is successful,
        otherwise an appropriate error code is returned.
 
        \ingroup transform_func_transform
@@ -558,6 +1062,50 @@ extern "C" {
                               const dim_t odim0, const dim_t odim1,
                               const af_interp_type method, const bool inverse);
 
+#if AF_API_VERSION >= 37
+    /**
+       C Interface for the version of \ref af_transform that accepts a
+       preallocated output array
+
+       \param[out] out       will contain the transformed image
+       \param[in]  in        is input image
+       \param[in]  transform is transformation matrix
+       \param[in]  odim0     is the first output dimension
+       \param[in]  odim1     is the second output dimension
+       \param[in]  method    is the interpolation type (Nearest by default)
+       \param[in]  inverse   if true applies inverse transform, if false applies forward transoform
+
+       \return \ref AF_SUCCESS if the color transformation is successful,
+       otherwise an appropriate error code is returned.
+
+       \note \p out can either be a null or existing `af_array` object. If it is a
+             sub-array of an existing `af_array`, only the corresponding portion of
+             the `af_array` will be overwritten
+       \note Passing an `af_array` that has not been initialized to \p out will
+             cause undefined behavior.
+
+       \ingroup transform_func_transform
+    */
+    AFAPI af_err af_transform_v2(af_array *out, const af_array in, const af_array transform,
+                                 const dim_t odim0, const dim_t odim1,
+                                 const af_interp_type method, const bool inverse);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       C Interface for transforming an image
+       C++ Interface for transforming coordinates
+
+       \param[out] out the transformed coordinates
+       \param[in] tf is transformation matrix
+       \param[in] d0 is the first input dimension
+       \param[in] d1 is the second input dimension
+
+       \ingroup transform_func_coordinates
+    */
+    AFAPI af_err af_transform_coordinates(af_array *out, const af_array tf, const float d0, const float d1);
+#endif
+
     /**
        C Interface for rotating an image
 
@@ -630,7 +1178,7 @@ extern "C" {
     /**
        C Interface for histogram
 
-       \param[out] out is the histogram for input array in
+       \param[out] out (type u32) is the histogram for input array in
        \param[in]  in is the input array
        \param[in]  nbins  Number of bins to populate between min and max
        \param[in]  minval minimum bin value (accumulates -inf to min)
@@ -703,8 +1251,8 @@ extern "C" {
 
         \param[out] out array is the processed image
         \param[in]  in array is the input image
-        \param[in]  spatial_sigma is the spatial variance paramter that decides the filter window
-        \param[in]  chromatic_sigma is the chromatic variance paramter
+        \param[in]  spatial_sigma is the spatial variance parameter that decides the filter window
+        \param[in]  chromatic_sigma is the chromatic variance parameter
         \param[in]  isColor indicates if the input \p in is color image or grayscale
         \return     \ref AF_SUCCESS if the filter is applied successfully,
         otherwise an appropriate error code is returned.
@@ -718,8 +1266,8 @@ extern "C" {
 
         \param[out] out array is the processed image
         \param[in]  in array is the input image
-        \param[in]  spatial_sigma is the spatial variance paramter that decides the filter window
-        \param[in]  chromatic_sigma is the chromatic variance paramter
+        \param[in]  spatial_sigma is the spatial variance parameter that decides the filter window
+        \param[in]  chromatic_sigma is the chromatic variance parameter
         \param[in]  iter is the number of iterations filter operation is performed
         \param[in]  is_color indicates if the input \p in is color image or grayscale
         \return     \ref AF_SUCCESS if the filter is applied successfully,
@@ -729,22 +1277,6 @@ extern "C" {
     */
     AFAPI af_err af_mean_shift(af_array *out, const af_array in, const float spatial_sigma, const float chromatic_sigma, const unsigned iter, const bool is_color);
 
-    /**
-        C Interface for median filter
-
-        \param[out] out array is the processed image
-        \param[in]  in array is the input image
-        \param[in]  wind_length is the kernel height
-        \param[in]  wind_width is the kernel width
-        \param[in]  edge_pad value will decide what happens to border when running
-                    filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
-        \return     \ref AF_SUCCESS if the median filter is applied successfully,
-        otherwise an appropriate error code is returned.
-
-        \ingroup image_func_medfilt
-    */
-    AFAPI af_err af_medfilt(af_array *out, const af_array in, const dim_t wind_length, const dim_t wind_width, const af_border_type edge_pad);
-
     /**
         C Interface for minimum filter
 
@@ -791,33 +1323,14 @@ extern "C" {
     */
     AFAPI af_err af_regions(af_array *out, const af_array in, const af_connectivity connectivity, const af_dtype ty);
 
-    /**
-       C Interface for image template matching
-
-       \param[out] out will have dispartiy values for the window starting at
-                   corresponding pixel position
-       \param[in]  search_img is an array with image data
-       \param[in]  template_img is the template we are looking for in the image
-       \param[in]  m_type is metric that should be used to calculate the disparity
-                   between window in the image and the template image. It can one of
-                   the values defined by the enum \ref af_match_type
-       \return     \ref AF_SUCCESS if disparity metric is computed successfully,
-       otherwise an appropriate error code is returned.
-
-       \note If \p search_img is 3d array, a batch operation will be performed.
-
-       \ingroup cv_func_match_template
-    */
-    AFAPI af_err af_match_template(af_array *out, const af_array search_img, const af_array template_img, const af_match_type m_type);
-
     /**
        C Interface for getting sobel gradients
 
-       \param[out] dx is derivate along horizontal direction
-       \param[out] dy is derivate along vertical direction
+       \param[out] dx is derivative along horizontal direction
+       \param[out] dy is derivative along vertical direction
        \param[in]  img is an array with image data
        \param[in]  ker_size sobel kernel size or window size
-       \return     \ref AF_SUCCESS if sobel derivates are computed successfully,
+       \return     \ref AF_SUCCESS if sobel derivatives are computed successfully,
        otherwise an appropriate error code is returned.
 
        \note If \p img is 3d array, a batch operation will be performed.
@@ -865,7 +1378,7 @@ extern "C" {
 
        \param[out] out is an array with data that has histogram approximately equal to histogram
        \param[in]  in is the input array, non-normalized input (!! assumes values [0-255] !!)
-       \param[in]  hist target histogram to approximate in output (based on # of bins)
+       \param[in]  hist target histogram to approximate in output (based on number of bins)
        \return     \ref AF_SUCCESS if the color transformation is successful,
        otherwise an appropriate error code is returned.
 
@@ -879,8 +1392,8 @@ extern "C" {
        C Interface generating gaussian kernels
 
        \param[out] out is an array with values generated using gaussian function
-       \param[in]  rows
-       \param[in]  cols
+       \param[in]  rows number of rows of the gaussian kernel
+       \param[in]  cols number of columns of the gaussian kernel
        \param[in]  sigma_r (default 0) (calculated internally as 0.25 * rows + 0.75)
        \param[in]  sigma_c (default 0) (calculated internally as 0.25 * cols + 0.75)
        \return     \ref AF_SUCCESS if gaussian distribution values are generated successfully,
@@ -923,12 +1436,10 @@ extern "C" {
     /**
        C Interface wrapper for color space conversion
 
-       \param[out] out is an array in target color space \param[in]  image is
-       the input array
-
+       \param[out] out is an array in target color space
+       \param[in]  image is the input array
        \param[in]  to is the target array color space \param[in]
        from is the input array color space
-
        \return     \ref AF_SUCCESS if the color transformation is successful,
        otherwise an appropriate error code
        is returned.
@@ -941,6 +1452,344 @@ extern "C" {
     */
     AFAPI af_err af_color_space(af_array *out, const af_array image, const af_cspace_t to, const af_cspace_t from);
 
+#if AF_API_VERSION >= 31
+    /**
+       C Interface for rearranging windowed sections of an input into columns
+       (or rows)
+
+       \param[out] out is an array with the input's sections rearraged as columns
+                   (or rows)
+       \param[in]  in is the input array
+       \param[in]  wx is the window size along dimension 0
+       \param[in]  wy is the window size along dimension 1
+       \param[in]  sx is the stride along dimension 0
+       \param[in]  sy is the stride along dimension 1
+       \param[in]  px is the padding along dimension 0
+       \param[in]  py is the padding along dimension 1
+       \param[in]  is_column determines whether the section becomes a column (if
+                   true) or a row (if false)
+       \return     \ref AF_SUCCESS if unwrap is successful,
+                   otherwise an appropriate error code is returned.
+
+       \note \p in can hold multiple images for processing if it is three or
+             four-dimensional
+       \note \p wx and \p wy must be between [1, input.dims(0 (1)) + px (py)]
+       \note \p sx and \p sy must be greater than 1
+       \note \p px and \p py must be between [0, wx (wy) - 1]. Padding becomes
+             part of the input image prior to the windowing
+
+       \ingroup image_func_unwrap
+    */
+    AFAPI af_err af_unwrap(af_array *out, const af_array in, const dim_t wx, const dim_t wy,
+                           const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                           const bool is_column);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface for performing the opposite of \ref af::unwrap()
+
+       \param[out] out is an array with the input's columns (or rows) reshaped as
+                   patches
+       \param[in]  in is the input array
+       \param[in]  ox is the output's dimension 0 size
+       \param[in]  oy is the output's dimension 1 size
+       \param[in]  wx is the window size along dimension 0
+       \param[in]  wy is the window size along dimension 1
+       \param[in]  sx is the stride along dimension 0
+       \param[in]  sy is the stride along dimension 1
+       \param[in]  px is the padding along dimension 0
+       \param[in]  py is the padding along dimension 1
+       \param[in]  is_column determines whether an output patch is formed from a
+                   column (if true) or a row (if false)
+       \return     \ref AF_SUCCESS if the color transformation is successful,
+       otherwise an appropriate error code is returned.
+
+       \note Wrap is typically used to recompose an unwrapped image. If this is the
+             case, use the same parameters that were used in \ref af::unwrap(). Also
+             use the original image size (before unwrap) for \p ox and \p oy.
+       \note The window/patch size, \p wx \f$\times\f$ \p wy, must equal
+             `input.dims(0)` (or `input.dims(1)` if \p is_column is false).
+       \note \p sx and \p sy must be at least 1
+       \note \p px and \p py must be between [0, wx) and [0, wy), respectively
+       \note The number of patches, `input.dims(1)` (or `input.dims(0)` if
+             \p is_column is false), must equal \f$nx \times\ ny\f$, where
+             \f$\displaystyle nx = \frac{ox + 2px - wx}{sx} + 1\f$ and
+             \f$\displaystyle ny = \frac{oy + 2py - wy}{sy} + 1\f$
+       \note Batched wrap can be performed on multiple 2D slices at once if \p in
+             is three or four-dimensional
+
+       \ingroup image_func_wrap
+    */
+    AFAPI af_err af_wrap(af_array *out,
+                         const af_array in,
+                         const dim_t ox, const dim_t oy,
+                         const dim_t wx, const dim_t wy,
+                         const dim_t sx, const dim_t sy,
+                         const dim_t px, const dim_t py,
+                         const bool is_column);
+#endif
+
+#if AF_API_VERSION >= 37
+    /**
+       C Interface for the version of \ref af_wrap that accepts a
+       preallocated output array
+
+       \param[out] out is an array with the input's columns (or rows) reshaped as
+                   patches
+       \param[in]  in is the input array
+       \param[in]  ox is the output's dimension 0 size
+       \param[in]  oy is the output's dimension 1 size
+       \param[in]  wx is the window size along dimension 0
+       \param[in]  wy is the window size along dimension 1
+       \param[in]  sx is the stride along dimension 0
+       \param[in]  sy is the stride along dimension 1
+       \param[in]  px is the padding along dimension 0
+       \param[in]  py is the padding along dimension 1
+       \param[in]  is_column determines whether an output patch is formed from a
+                   column (if true) or a row (if false)
+       \return     \ref AF_SUCCESS if the color transformation is successful,
+       otherwise an appropriate error code is returned.
+
+       \note Wrap is typically used to recompose an unwrapped image. If this is the
+             case, use the same parameters that were used in \ref af::unwrap(). Also
+             use the original image size (before unwrap) for \p ox and \p oy.
+       \note The window/patch size, \p wx \f$\times\f$ \p wy, must equal
+             `input.dims(0)` (or `input.dims(1)` if \p is_column is false).
+       \note \p sx and \p sy must be at least 1
+       \note \p px and \p py must be between [0, wx) and [0, wy), respectively
+       \note The number of patches, `input.dims(1)` (or `input.dims(0)` if
+             \p is_column is false), must equal \f$nx \times\ ny\f$, where
+             \f$\displaystyle nx = \frac{ox + 2px - wx}{sx} + 1\f$ and
+             \f$\displaystyle ny = \frac{oy + 2py - wy}{sy} + 1\f$
+       \note Batched wrap can be performed on multiple 2D slices at once if \p in
+             is three or four-dimensional
+
+       \ingroup image_func_wrap
+    */
+    AFAPI af_err af_wrap_v2(af_array *out,
+                            const af_array in,
+                            const dim_t ox, const dim_t oy,
+                            const dim_t wx, const dim_t wy,
+                            const dim_t sx, const dim_t sy,
+                            const dim_t px, const dim_t py,
+                            const bool is_column);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface wrapper for summed area tables
+
+       \param[out] out is the summed area table on input image(s)
+       \param[in]  in is the input array
+       \return \ref AF_SUCCESS if the sat computation is successful,
+       otherwise an appropriate error code is returned.
+
+       \ingroup image_func_sat
+    */
+    AFAPI af_err af_sat(af_array *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface for converting YCbCr to RGB
+
+       \param[out] out is an array in the RGB color space
+       \param[in]  in is an array in the YCbCr color space
+       \param[in]  standard specifies the ITU-R BT "xyz" standard which determines the Kb, Kr values
+       used in colorspace conversion equation
+       \return     \ref AF_SUCCESS if the color transformation is successful,
+       otherwise an appropriate error code is returned.
+
+       \note \p in must be three dimensional and values should lie in the range [0,1]
+
+       \ingroup image_func_ycbcr2rgb
+    */
+    AFAPI af_err af_ycbcr2rgb(af_array* out, const af_array in, const af_ycc_std standard);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface for converting RGB to YCbCr
+
+       \param[out] out is an array in the YCbCr color space
+       \param[in]  in is an array in the RGB color space
+       \param[in]  standard specifies the ITU-R BT "xyz" standard which determines the Kb, Kr values
+       used in colorspace conversion equation
+       \return     \ref AF_SUCCESS if the color transformation is successful,
+       otherwise an appropriate error code is returned.
+
+       \note \p in must be three dimensional and values should lie in the range [0,1]
+
+       \ingroup image_func_rgb2ycbcr
+    */
+    AFAPI af_err af_rgb2ycbcr(af_array* out, const af_array in, const af_ycc_std standard);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface for finding image moments
+
+       \param[out] out is an array containing the calculated moments
+       \param[in]  in is an array of image(s)
+       \param[in] moment is moment(s) to calculate
+       \return     ref AF_SUCCESS if the moment calculation is successful,
+       otherwise an appropriate error code is returned.
+
+       \ingroup image_func_moments
+    */
+    AFAPI af_err af_moments(af_array *out, const af_array in, const af_moment_type moment);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface for calculating image moment(s) of a single image
+
+       \param[out] out is a pointer to a pre-allocated array where the calculated moment(s) will be placed.
+       User is responsible for ensuring enough space to hold all requested moments
+       \param[in] in is the input image
+       \param[in] moment is moment(s) to calculate
+       \return     ref AF_SUCCESS if the moment calculation is successful,
+       otherwise an appropriate error code is returned.
+
+       \ingroup image_func_moments
+    */
+    AFAPI af_err af_moments_all(double* out, const af_array in, const af_moment_type moment);
+#endif
+
+#if AF_API_VERSION >= 35
+    /**
+       C Interface for canny edge detector
+
+       \param[out] out is an binary array containing edges
+       \param[in] in is the input image
+       \param[in] threshold_type     determines if user set high threshold is to be used or not. It
+                                     can take values defined by the enum \ref af_canny_threshold
+       \param[in] low_threshold_ratio   is the lower threshold % of the maximum or auto-derived high
+                                        threshold
+       \param[in] high_threshold_ratio  is the higher threshold % of maximum value in gradient image
+                                        used in hysteresis procedure. This value is ignored if
+                                        \ref AF_CANNY_THRESHOLD_AUTO_OTSU is chosen as
+                                        \ref af_canny_threshold
+       \param[in] sobel_window      is the window size of sobel kernel for computing gradient direction
+                                    and magnitude
+       \param[in] is_fast indicates   if L<SUB>1</SUB> norm(faster but less accurate) is used to
+                                      compute image gradient magnitude instead of L<SUB>2</SUB> norm.
+       \return    \ref AF_SUCCESS if the moment calculation is successful,
+       otherwise an appropriate error code is returned.
+
+       \ingroup image_func_canny
+    */
+    AFAPI af_err af_canny(af_array* out, const af_array in,
+                          const af_canny_threshold threshold_type,
+                          const float low_threshold_ratio,
+                          const float high_threshold_ratio,
+                          const unsigned sobel_window, const bool is_fast);
+#endif
+
+#if AF_API_VERSION >= 36
+    /**
+       C Interface for anisotropic diffusion
+
+       It can do both gradient and curvature based anisotropic smoothing.
+
+       \param[out] out is an af_array containing anisotropically smoothed image pixel values
+       \param[in] in is the input image, expects non-integral (float/double) typed af_array
+       \param[in] timestep is the time step used in solving the diffusion equation.
+       \param[in] conductance parameter controls the sensitivity of conductance in diffusion equation.
+       \param[in] iterations is the number of times the diffusion step is performed.
+       \param[in] fftype indicates whether quadratic or exponential flux function is used by algorithm.
+       \param[in] diffusion_kind will let the user choose what kind of diffusion method to perform. It will take
+                  any value of enum \ref af_diffusion_eq
+       \return \ref AF_SUCCESS if the moment calculation is successful,
+       otherwise an appropriate error code is returned.
+
+       \ingroup image_func_anisotropic_diffusion
+    */
+    AFAPI af_err af_anisotropic_diffusion(af_array* out, const af_array in,
+                                          const float timestep,
+                                          const float conductance,
+                                          const unsigned iterations,
+                                          const af_flux_function fftype,
+                                          const af_diffusion_eq diffusion_kind);
+#endif
+
+#if AF_API_VERSION >= 37
+    /**
+       C Interface for Iterative deconvolution algorithm
+
+       \param[out] out is the sharp estimate generated from the blurred input
+       \param[in] in is the blurred input image
+       \param[in] ker is the kernel(point spread function) known to have caused
+                  the blur in the system
+       \param[in] iterations is the number of iterations the algorithm will run
+       \param[in] relax_factor is the relaxation factor multiplied with
+                  distance of estimate from observed image.
+       \param[in] algo takes value of type enum \ref af_iterative_deconv_algo
+                  indicating the iterative deconvolution algorithm to be used
+       \return \ref AF_SUCCESS if the deconvolution is successful,
+       otherwise an appropriate error code is returned.
+
+       \note \p relax_factor argument is ignore when it
+       \ref AF_ITERATIVE_DECONV_RICHARDSONLUCY algorithm is used.
+
+       \ingroup image_func_iterative_deconv
+     */
+    AFAPI af_err af_iterative_deconv(af_array* out,
+                                     const af_array in, const af_array ker,
+                                     const unsigned iterations,
+                                     const float relax_factor,
+                                     const af_iterative_deconv_algo algo);
+
+    /**
+       C Interface for Tikhonov deconvolution algorithm
+
+       \param[out] out is the sharp estimate generated from the blurred input
+       \param[in] in is the blurred input image
+       \param[in] psf is the kernel(point spread function) known to have caused
+                  the blur in the system
+       \param[in] gamma takes different meaning depending on the algorithm
+                  chosen. If \p algo is AF_INVERSE_DECONV_TIKHONOV, then
+                  \p gamma is a user defined regularization constant.
+       \param[in] algo takes value of type enum \ref af_inverse_deconv_algo
+                  indicating the inverse deconvolution algorithm to be used
+       \return \ref AF_SUCCESS if the deconvolution is successful,
+       otherwise an appropriate error code is returned.
+
+       \ingroup image_func_inverse_deconv
+     */
+    AFAPI af_err af_inverse_deconv(af_array* out, const af_array in,
+                                   const af_array psf, const float gamma,
+                                   const af_inverse_deconv_algo algo);
+
+    /**
+       C Interface for confidence connected components
+
+       \param[out] out is the output af_array having the connected components
+       \param[in] in is the input image, expects non-integral (float/double)
+                  typed af_array
+       \param[in] seedx is an af_array of x coordinates of the seed points
+       \param[in] seedy is an af_array of y coordinates of the seed points
+       \param[in] radius is the neighborhood region to be considered around
+                  each seed point
+       \param[in] multiplier controls the threshold range computed from
+                  the mean and variance of seed point neighborhoods
+       \param[in] iter is number of iterations
+       \param[in] segmented_value is the value to which output array valid
+                  pixels are set to.
+       \return \ref AF_SUCCESS if the execution is successful, otherwise an
+       appropriate error code is returned.
+
+       \ingroup image_func_confidence_cc
+    */
+    AFAPI af_err af_confidence_cc(af_array *out, const af_array in,
+                                  const af_array seedx, const af_array seedy,
+                                  const unsigned radius,
+                                  const unsigned multiplier, const int iter,
+                                  const double segmented_value);
+
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/index.h b/include/af/index.h
index be6a9d2821..8eaaeaa0a5 100644
--- a/include/af/index.h
+++ b/include/af/index.h
@@ -12,15 +12,14 @@
 #include <af/seq.h>
 
 ///
-/// \brief Struct used while indexing af_array
+/// \brief Struct used to index an af_array
 ///
-/// This struct represents objects which can be used to index into an afa_array
+/// This struct represents objects which can be used to index into an af_array
 /// Object. It contains a union object which can be an \ref af_seq or an
 /// \ref af_array. Indexing with an int can be represented using a \ref af_seq
 /// object with the same \ref af_seq::begin and \ref af_seq::end with an
 /// af_seq::step of 1
-///
-typedef struct af_index_t{
+typedef struct af_index_t {
     union {
         af_array arr;   ///< The af_array used for indexing
         af_seq   seq;   ///< The af_seq used for indexing
@@ -46,15 +45,16 @@ class seq;
 /// allows implicit type conversion from valid indexing types like int,
 /// \ref af::seq, \ref af_seq, and \ref af::array.
 ///
-/// \note This is a helper class and does not necessarly need to be created
+/// \note This is a helper class and does not necessarily need to be created
 /// explicitly. It is used in the operator() overloads to simplify the API.
 ///
+/// \ingroup arrayfire_class
 class AFAPI index {
 
     af_index_t impl;
     public:
     ///
-    /// \brief Default constructor. Equivilant to \ref af::span
+    /// \brief Default constructor. Equivalent to \ref af::span
     ///
     index();
     ~index();
@@ -71,7 +71,7 @@ class AFAPI index {
     index(const int idx);
 
     ///
-    /// \brief Implicit int converter
+    /// \brief Implicit seq converter
     ///
     /// Indexes the af::array using an \ref af::seq object
     ///
@@ -82,7 +82,7 @@ class AFAPI index {
     index(const af::seq& s0);
 
     ///
-    /// \brief Implicit int converter
+    /// \brief Implicit seq converter
     ///
     /// Indexes the af::array using an \ref af_seq object
     ///
@@ -103,6 +103,17 @@ class AFAPI index {
     ///
     index(const af::array& idx0);
 
+#if AF_API_VERSION >= 31
+    ///
+    /// \brief Copy constructor
+    ///
+    /// \param[in] idx0 is index to copy.
+    ///
+    /// \sa indexing
+    ///
+    index(const index& idx0);
+#endif
+
     ///
     /// \brief Returns true if the \ref af::index represents a af::span object
     ///
@@ -116,21 +127,66 @@ class AFAPI index {
     /// \returns the af_index_t represented by this object
     ///
     const af_index_t& get() const;
+
+#if AF_API_VERSION >= 31
+    ///
+    /// \brief Assigns idx0 to this index
+    ///
+    /// \param[in] idx0 is the index to be assigned to the /ref af::index
+    /// \returns the reference to this
+    ///
+    ///
+    index & operator=(const index& idx0);
+
+#if AF_COMPILER_CXX_RVALUE_REFERENCES
+    ///
+    /// \brief Move constructor
+    ///
+    /// \param[in] idx0 is index to copy.
+    ///
+    index(index &&idx0);
+    ///
+    /// \brief Move assignment operator
+    ///
+    /// \param[in] idx0 is the index to be assigned to the /ref af::index
+    /// \returns a reference to this
+    ///
+    index& operator=(index &&idx0);
+#endif
+#endif // AF_API_VERSION
 };
 
 ///
-/// Lookup the values of input array based on index
+/// Lookup the values of an input array by indexing with another array
 ///
-/// \param[in] in is input lookup array
-/// \param[in] idx is lookup indices
+/// \param[in] in is the input array that will be queried
+/// \param[in] idx are the lookup indices
 /// \param[in] dim specifies the dimension for indexing
-/// \returns an array containing values at locations specified by \p index
+/// \returns an array containing values of \p in at locations specified by \p index
 ///
-/// \ingroup index_func_index
+/// \ingroup index_func_lookup
 ///
-
 AFAPI array lookup(const array &in, const array &idx, const int dim = -1);
 
+#if AF_API_VERSION >= 31
+///
+/// Copy the values of an input array based on index
+///
+/// \param[out] dst The destination array
+/// \param[in] src The source array
+/// \param[in] idx0 The first index
+/// \param[in] idx1 The second index (defaults to \ref af::span)
+/// \param[in] idx2 The third index (defaults to \ref af::span)
+/// \param[in] idx3 The fourth index (defaults to \ref af::span)
+/// \ingroup index_func_index
+///
+AFAPI void copy(array &dst, const array &src,
+                const index &idx0,
+                const index &idx1 = span,
+                const index &idx2 = span,
+                const index &idx3 = span);
+#endif
+
 }
 #endif
 
@@ -141,31 +197,29 @@ extern "C" {
     ///
     /// Lookup the values of input array based on sequences
     ///
-    /// \param[out] out  will contain an array containing values at indexed by the
+    /// \param[out] out  output array containing values indexed by the
     ///                  sequences
     /// \param[in] in    is the input array
     /// \param[in] ndims is the number of sequences provided
     /// \param[in] index is an array of sequences
     ///
     /// \ingroup index_func_index
-
     AFAPI af_err af_index(  af_array *out,
                             const af_array in,
                             const unsigned ndims, const af_seq* const index);
 
 
     ///
-    /// Lookup the values of input array based on index
+    /// Lookup the values of an input array by indexing with another array
     ///
-    /// \param[out] out      will contain an array containing values at locations
-    ///                      specified by \p index
-    /// \param[in] in        is input lookup array
-    /// \param[in] indices   is lookup indices
+    /// \param[out] out      output array containing values of \p in at locations
+    ///                      specified by \p indices
+    /// \param[in] in        is the input array that will be queried
+    /// \param[in] indices   are the lookup indices
     /// \param[in] dim       specifies the dimension for indexing
     ///
-    /// \ingroup index_func_index
+    /// \ingroup index_func_lookup
     ///
-
     AFAPI af_err af_lookup( af_array *out,
                             const af_array in, const af_array indices,
                             const unsigned dim);
@@ -173,7 +227,7 @@ extern "C" {
     ///
     /// Copy and write values in the locations specified by the sequences
     ///
-    /// \param[out] out     will contain an array with values of \p rhs copied to
+    /// \param[out] out     output array with values of \p rhs copied to
     ///                     locations specified by \p index and values from
     ///                     \p lhs in all other locations.
     /// \param[in] lhs      is array whose values are used for indices NOT
@@ -185,7 +239,6 @@ extern "C" {
     ///
     /// \ingroup index_func_assign
     ///
-
     AFAPI af_err af_assign_seq( af_array *out,
                                 const af_array lhs,
                                 const unsigned ndims, const af_seq* const indices,
@@ -194,11 +247,11 @@ extern "C" {
     ///
     /// \brief Indexing an array using \ref af_seq, or \ref af_array
     ///
-    /// generalized indexing function that accepts either af_array or af_seq
+    /// Generalized indexing function that accepts either af_array or af_seq
     /// along a dimension to index the input array and create the corresponding
     /// output array
     ///
-    /// \param[out] out     will contain an array containing values at indexed by
+    /// \param[out] out     output array containing values at indexed by
     ///                     the sequences
     /// \param[in] in       is the input array
     /// \param[in] ndims    is the number of \ref af_index_t provided
@@ -214,23 +267,110 @@ extern "C" {
     /// \brief Assignment of an array using \ref af_seq, or \ref af_array
     ///
     /// Generalized assignment function that accepts either af_array or af_seq
-    /// along a dimension to assing elements form an input array to an output
+    /// along a dimension to assign elements form an input array to an output
     /// array
     ///
-    /// \param[out] out     will contain an array containing values at indexed by
+    /// \param[out] out     output array containing values at indexed by
     ///                     the sequences
     /// \param[in] lhs      is the input array
     /// \param[in] ndims    is the number of \ref af_index_t provided
-    /// \param[in] indices  is an af_array of \ref af_index_t objects
-    /// \param[in] rhs      is the array whos values will be assigned to \p lhs
+    /// \param[in] indices  is a C array of \ref af_index_t objects
+    /// \param[in] rhs      is the array whose values will be assigned to \p lhs
     ///
-    /// \ingroup index_func_index
+    /// \ingroup index_func_assign
     ///
     AFAPI af_err af_assign_gen( af_array *out,
                                 const af_array lhs,
                                 const dim_t ndims, const af_index_t* indices,
                                 const af_array rhs);
 
+#if AF_API_VERSION >= 32
+    ///
+    /// \brief Create an quadruple of af_index_t array
+    ///
+    /// \snippet test/index.cpp ex_index_util_0
+    ///
+    /// \param[out] indexers pointer to location where quadruple af_index_t array is created
+    /// \returns \ref af_err error code
+    ///
+    /// \ingroup index_func_index
+    ///
+    AFAPI af_err af_create_indexers(af_index_t** indexers);
+#endif
+
+#if AF_API_VERSION >= 32
+    ///
+    /// \brief set \p dim to given indexer af_array \p idx
+    ///
+    /// \snippet test/index.cpp ex_index_util_0
+    ///
+    /// \param[in] indexer pointer to location where quadruple af_index_t array was created
+    /// \param[in] idx is the af_array indexer for given dimension \p dim
+    /// \param[in] dim is the dimension to be indexed
+    /// \returns \ref af_err error code
+    ///
+    /// \ingroup index_func_index
+    ///
+    AFAPI af_err af_set_array_indexer(af_index_t* indexer, const af_array idx, const dim_t dim);
+#endif
+
+#if AF_API_VERSION >= 32
+    ///
+    /// \brief set \p dim to given indexer af_array \p idx
+    ///
+    /// This function is similar to \ref af_set_array_indexer in terms of functionality except
+    /// that this version accepts object of type \ref af_seq instead of \ref af_array.
+    ///
+    /// \snippet test/index.cpp ex_index_util_0
+    ///
+    /// \param[in] indexer pointer to location where quadruple af_index_t array was created
+    /// \param[in] idx is the af_seq indexer for given dimension \p dim
+    /// \param[in] dim is the dimension to be indexed
+    /// \param[in] is_batch indicates if the sequence based indexing is inside a batch operation
+    ///
+    /// \ingroup index_func_index
+    ///
+    AFAPI af_err af_set_seq_indexer(af_index_t* indexer, const af_seq* idx,
+                                  const dim_t dim, const bool is_batch);
+#endif
+
+#if AF_API_VERSION >= 32
+    ///
+    /// \brief set \p dim to given indexer af_array \p idx
+    ///
+    ///  This function is alternative to \ref af_set_seq_indexer where instead of passing
+    ///  in an already prepared \ref af_seq object, you pass the arguments necessary for
+    ///  creating an af_seq directly.
+    ///
+    /// \param[in] indexer pointer to location where quadruple af_index_t array was created
+    /// \param[in] begin is the beginning index of along dimension \p dim
+    /// \param[in] end is the beginning index of along dimension \p dim
+    /// \param[in] step size along dimension \p dim
+    /// \param[in] dim is the dimension to be indexed
+    /// \param[in] is_batch indicates if the sequence based indexing is inside a batch operation
+    /// \returns \ref af_err error code
+    ///
+    /// \ingroup index_func_index
+    ///
+    AFAPI af_err af_set_seq_param_indexer(af_index_t* indexer,
+                                        const double begin, const double end, const double step,
+                                        const dim_t dim, const bool is_batch);
+#endif
+
+#if AF_API_VERSION >= 32
+    ///
+    /// \brief Release's the memory resource used by the quadruple af_index_t array
+    ///
+    /// \snippet test/index.cpp ex_index_util_0
+    ///
+    /// \param[in] indexers is pointer to location where quadruple af_index_t array is created
+    //  \returns \ref af_err error code
+    ///
+    /// \ingroup index_func_index
+    ///
+    AFAPI af_err af_release_indexers(af_index_t* indexers);
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/internal.h b/include/af/internal.h
new file mode 100644
index 0000000000..c441919107
--- /dev/null
+++ b/include/af/internal.h
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+#ifdef __cplusplus
+namespace af
+{
+    class array;
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] data is the raw data pointer.
+       \param[in] offset specifies the number of elements to skip.
+       \param[in] dims specifies the dimensions for the region of interest.
+       \param[in] strides specifies the distance between each element of a given dimension.
+       \param[in] ty specifies the data type of \p data.
+       \param[in] location specifies if the data is on host or the device.
+
+       \note: If \p location is `afHost`, a memory copy is performed.
+
+       \returns an af::array() with specified offset, dimensions and strides.
+
+       \ingroup internal_func_create
+    */
+    AFAPI array createStridedArray(const void *data, const dim_t offset,
+                                   const dim4 dims, const dim4 strides,
+                                   const af::dtype ty,
+                                   const af::source location);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] in An multi dimensional array.
+       \returns af::dim4() containing distance between consecutive elements in each dimension.
+
+       \ingroup internal_func_strides
+    */
+    AFAPI dim4 getStrides(const array &in);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] in An multi dimensional array.
+       \returns offset from the starting location of data pointer specified in number of elements.
+
+       \ingroup internal_func_offset
+    */
+    AFAPI dim_t getOffset(const array &in);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] in An multi dimensional array.
+       \returns Returns the raw pointer location to the array.
+
+       \note This pointer may be shared with other arrays. Use this function with caution.
+
+       \ingroup internal_func_rawptr
+    */
+    AFAPI void *getRawPtr(const array &in);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] in An multi dimensional array.
+       \returns a boolean specifying if all elements in the array are contiguous.
+
+       \ingroup internal_func_linear
+    */
+    AFAPI bool isLinear(const array &in);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] in An multi dimensional array.
+       \returns a boolean specifying if the array owns the raw pointer. It is false if it is a sub array.
+
+       \ingroup internal_func_owner
+    */
+    AFAPI bool isOwner(const array &in);
+#endif
+}
+#endif
+
+#ifdef __cplusplus
+extern "C"
+{
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[out] arr an af_array with specified offset, dimensions and strides.
+       \param[in] data is the raw data pointer.
+       \param[in] offset specifies the number of elements to skip.
+       \param[in] ndims specifies the number of array dimensions.
+       \param[in] dims specifies the dimensions for the region of interest.
+       \param[in] strides specifies the distance between each element of a given dimension.
+       \param[in] ty specifies the data type of \p data.
+       \param[in] location specifies if the data is on host or the device.
+
+       \note If \p location is `afHost`, a memory copy is performed.
+
+       \ingroup internal_func_create
+    */
+    AFAPI af_err af_create_strided_array(af_array *arr,
+                                         const void *data,
+                                         const dim_t offset,
+                                         const unsigned ndims,
+                                         const dim_t *const dims,
+                                         const dim_t *const strides,
+                                         const af_dtype ty,
+                                         const af_source location);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] arr An multi dimensional array.
+       \param[out] s0 distance between each consecutive element along first  dimension.
+       \param[out] s1 distance between each consecutive element along second dimension.
+       \param[out] s2 distance between each consecutive element along third  dimension.
+       \param[out] s3 distance between each consecutive element along fourth dimension.
+
+       \ingroup internal_func_strides
+    */
+    AFAPI af_err af_get_strides(dim_t *s0, dim_t *s1, dim_t *s2, dim_t *s3, const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] arr An multi dimensional array.
+       \param[out] offset: Offset from the starting location of data pointer specified in number of elements. distance between each consecutive element along first  dimension.
+
+       \ingroup internal_func_offset
+    */
+    AFAPI af_err af_get_offset(dim_t *offset, const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] arr An multi dimensional array.
+       \param[out] ptr the raw pointer location to the array.
+
+       \note This pointer may be shared with other arrays. Use this function with caution.
+
+       \ingroup internal_func_rawptr
+    */
+    AFAPI af_err af_get_raw_ptr(void **ptr, const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] arr An multi dimensional array.
+       \param[out] result: a boolean specifying if all elements in the array are contiguous.
+
+       \ingroup internal_func_linear
+    */
+    AFAPI af_err af_is_linear(bool *result, const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 33
+    /**
+       \param[in] arr An multi dimensional array.
+       \param[out] result: a boolean specifying if the array owns the raw pointer. It is false if it is a sub array.
+
+       \ingroup internal_func_owner
+    */
+    AFAPI af_err af_is_owner(bool *result, const af_array arr);
+#endif
+
+#if AF_API_VERSION >= 35
+    /**
+       \param[out] bytes the size of the physical allocated bytes. This will return the size
+       of the parent/owner if the \p arr is an indexed array.
+       \param[in] arr the input array.
+
+       \ingroup internal_func_allocatedbytes
+    */
+    AFAPI af_err af_get_allocated_bytes(size_t *bytes, const af_array arr);
+#endif
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/af/lapack.h b/include/af/lapack.h
index ea696acd99..be30cd5900 100644
--- a/include/af/lapack.h
+++ b/include/af/lapack.h
@@ -14,202 +14,288 @@
 #ifdef __cplusplus
 namespace af
 {
+#if AF_API_VERSION >= 31
+    /**
+       C++ Interface to perform singular value decomposition.
+
+       \param[out] u  U
+       \param[out] s  diagonal values of sigma (singular values of the input 
+                      matrix)
+       \param[out] vt V^H
+       \param[in]  in input array
+
+       \ingroup lapack_factor_func_svd
+    */
+    AFAPI void svd(array &u, array &s, array &vt, const array &in);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C++ Interface to perform in-place singular value decomposition.
+
+       This function minimizes memory usage if `in` is dispensable. Input array
+       `in` is limited to arrays where `dim0` \f$\geq\f$ `dim1`.
+
+       \param[out]   u  U
+       \param[out]   s  diagonal values of sigma (singular values of the input
+                        matrix)
+       \param[out]   vt V^H
+       \param[inout] in input array; contains random data after the operation                       this operation
+
+       \ingroup lapack_factor_func_svd
+    */
+    AFAPI void svdInPlace(array &u, array &s, array &vt, array &in);
+#endif
 
     /**
-       C++ Interface for LU decomposition in packed format
+       C++ Interface to perform LU decomposition in packed format.
 
-       \param[out] out is the output array containing the packed LU decomposition
-       \param[out] pivot will contain the permutation indices to map the input to the decomposition
-       \param[in] in is the input matrix
-       \param[in] is_lapack_piv specifies if the pivot is returned in original LAPACK compliant format
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[out] out           packed LU decomposition
+       \param[out] pivot         permutation indices mapping the input to the
+                                 decomposition
+       \param[in]  in            input array
+       \param[in]  is_lapack_piv specifies if the pivot is returned in original
+                                 LAPACK compliant format
 
        \ingroup lapack_factor_func_lu
     */
     AFAPI void lu(array &out, array &pivot, const array &in, const bool is_lapack_piv=true);
 
     /**
-       C++ Interface for LU decomposition
+       C++ Interface to perform LU decomposition.
 
-       \param[out] lower will contain the lower triangular matrix of the LU decomposition
-       \param[out] upper will contain the upper triangular matrix of the LU decomposition
-       \param[out] pivot will contain the permutation indices to map the input to the decomposition
-       \param[in] in is the input matrix
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[out] lower lower triangular matrix of the LU decomposition
+       \param[out] upper upper triangular matrix of the LU decomposition
+       \param[out] pivot permutation indices mapping the input to the
+                         decomposition
+       \param[in]  in    input array
 
        \ingroup lapack_factor_func_lu
     */
     AFAPI void lu(array &lower, array &upper, array &pivot, const array &in);
 
     /**
-      C++ Interface for in place LU decomposition
+       C++ Interface to perform in-place LU decomposition.
 
-      \param[out] pivot will contain the permutation indices to map the input to the decomposition
-      \param[inout] in contains the input on entry, the packed LU decomposition on exit
-      \param[in] is_lapack_piv specifies if the pivot is returned in original LAPACK compliant format
+       This function is not supported in GFOR.
 
-      \note This function is not supported in GFOR
+       \param[out]   pivot         permutation indices mapping the input to the
+                                   decomposition
+       \param[inout] in            input array on entry; packed LU
+                                   decomposition on exit
+       \param[in]    is_lapack_piv specifies if the pivot is returned in
+                                   original LAPACK-compliant format
 
-      \ingroup lapack_factor_func_lu
+       \ingroup lapack_factor_func_lu
     */
     AFAPI void luInPlace(array &pivot, array &in, const bool is_lapack_piv=true);
 
     /**
-       C++ Interface for QR decomposition in packed format
+       C++ Interface to perform QR decomposition in packed format.
 
-       \param[out] out is the output array containing the packed QR decomposition
-       \param[out] tau will contain additional information needed for unpacking the data
-       \param[in] in is the input matrix
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[out] out packed QR decomposition
+       \param[out] tau additional information needed for unpacking the data
+       \param[in]  in  input array
 
        \ingroup lapack_factor_func_qr
     */
     AFAPI void qr(array &out, array &tau, const array &in);
 
     /**
-       C++ Interface for QR decomposition
+       C++ Interface to perform QR decomposition.
 
-       \param[out] q is the orthogonal matrix from QR decomposition
-       \param[out] r is the upper triangular matrix from QR decomposition
-       \param[out] tau will contain additional information needed for solving a least squares problem using \p q and \p r
-       \param[in] in is the input matrix
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[out] q   orthogonal matrix from QR decomposition
+       \param[out] r   upper triangular matrix from QR decomposition
+       \param[out] tau additional information needed for solving a
+                       least-squares problem using `q` and `r`
+       \param[in]  in  input array
 
        \ingroup lapack_factor_func_qr
     */
     AFAPI void qr(array &q, array &r, array &tau, const array &in);
 
     /**
-       C++ Interface for QR decomposition
+       C++ Interface to perform QR decomposition.
 
-       \param[out] tau will contain additional information needed for unpacking the data
-       \param[inout] in is the input matrix on entry. It contains packed QR decomposition on exit
+       This function is not supported in GFOR.
 
-       \note This function is not supported in GFOR
+       \param[out]   tau additional information needed for unpacking the data
+       \param[inout] in  input array on entry; packed QR decomposition on exit
 
        \ingroup lapack_factor_func_qr
     */
     AFAPI void qrInPlace(array &tau, array &in);
 
     /**
-       C++ Interface for cholesky decomposition
-
-       \param[out] out contains the triangular matrix. Multiply \p out with it conjugate transpose reproduces the input \p in.
-       \param[in] in is the input matrix
-       \param[in] is_upper a boolean determining if \p out is upper or lower triangular
-
-       \returns \p 0 if cholesky decomposition passes. If not returns the rank at which the decomposition failed.
-
-       \note The input matrix **has** to be a positive definite matrix. If it is not zero, the cholesky decomposition functions return a non zero output.
-       \note This function is not supported in GFOR
+       C++ Interface to perform Cholesky decomposition.
+
+       Multiplying `out` with its conjugate transpose reproduces the input
+       `in`.
+       
+       The input must be positive definite.
+       
+       This function is not supported in GFOR.
+
+       \param[out] out      triangular matrix; 
+       \param[in]  in       input matrix
+       \param[in]  is_upper boolean determining if `out` is upper or lower
+                            triangular
+       \returns    `0` if cholesky decomposition passes; if not, it returns the
+                   rank at which the decomposition fails
 
        \ingroup lapack_factor_func_cholesky
     */
     AFAPI int cholesky(array &out, const array &in, const bool is_upper = true);
 
     /**
-       C++ Interface for in place cholesky decomposition
+       C++ Interface to perform in-place Cholesky decomposition.
 
-       \param[inout] in is the input matrix on entry. It contains the triangular matrix on exit.
-       \param[in] is_upper a boolean determining if \p in is upper or lower triangular
+       The input must be positive definite.
 
-       \returns \p 0 if cholesky decomposition passes. If not returns the rank at which the decomposition failed.
+       This function is not supported in GFOR.
 
-       \note The input matrix **has** to be a positive definite matrix. If it is not zero, the cholesky decomposition functions return a non zero output.
-       \note This function is not supported in GFOR
+       \param[inout] in       input matrix on entry; triangular matrix on exit
+       \param[in]    is_upper boolean determining if `in` is upper or lower
+                              triangular
+       \returns      `0` if cholesky decomposition passes; if not, it returns
+                     the rank at which the decomposition fails
 
        \ingroup lapack_factor_func_cholesky
     */
     AFAPI int choleskyInPlace(array &in, const bool is_upper = true);
 
     /**
-       C++ Interface for solving a system of equations
+       C++ Interface to solve a system of equations.
+
+       The `options` parameter must be one of \ref AF_MAT_NONE,
+       \ref AF_MAT_LOWER or \ref AF_MAT_UPPER.
 
-       \param[in] a is the coefficient matrix
-       \param[in] b is the measured values
-       \param[in] options determining various properties of matrix \p a
-       \returns \p x, the matrix of unknown variables
+       This function is not supported in GFOR.
 
-       \note \p options needs to be one of \ref AF_MAT_NONE, \ref AF_MAT_LOWER or \ref AF_MAT_UPPER
-       \note This function is not supported in GFOR
+       \param[in] a       coefficient matrix
+       \param[in] b       measured values
+       \param[in] options determines various properties of matrix `a`
+       \returns   `x`, the matrix of unknown variables
 
        \ingroup lapack_solve_func_gen
     */
     AFAPI array solve(const array &a, const array &b, const matProp options = AF_MAT_NONE);
 
-
     /**
-       C++ Interface for solving a system of equations
+       C++ Interface to solve a system of equations.
 
-       \param[in] a is the output matrix from packed LU decomposition of the coefficient matrix
-       \param[in] piv is the pivot array from packed LU decomposition of the coefficient matrix
-       \param[in] b is the matrix of measured values
-       \param[in] options determining various properties of matrix \p a
-       \returns \p x, the matrix of unknown variables
+       The `options` parameter currently must be \ref AF_MAT_NONE.
 
-       \ingroup lapack_solve_lu_func_gen
+       This function is not supported in GFOR.
+
+       \param[in] a       packed LU decomposition of the coefficient matrix
+       \param[in] piv     pivot array from the packed LU decomposition of the
+                          coefficient matrix
+       \param[in] b       measured values
+       \param[in] options determines various properties of matrix `a`
+       \returns   `x`, the matrix of unknown variables
 
-       \note \p options currently needs to be \ref AF_MAT_NONE
-       \note This function is not supported in GFOR
+       \ingroup lapack_solve_lu_func_gen
     */
     AFAPI array solveLU(const array &a, const array &piv,
                         const array &b, const matProp options = AF_MAT_NONE);
 
     /**
-       C++ Interface for inverting a matrix
+       C++ Interface to invert a matrix.
+
+       The `options` parameter currently must be \ref AF_MAT_NONE.
 
-       \param[in] in is input matrix
-       \param[in] options determining various properties of matrix \p in
-       \returns \p x, the inverse of the input matrix
+       This function is not supported in GFOR.
 
-       \note \p options currently needs to be \ref AF_MAT_NONE
-       \note This function is not supported in GFOR
+       \param[in] in      input matrix
+       \param[in] options determines various properties of matrix `in`
+       \returns   inverse matrix
 
        \ingroup lapack_ops_func_inv
     */
     AFAPI array inverse(const array &in, const matProp options = AF_MAT_NONE);
 
+#if AF_API_VERSION >= 37
     /**
-       C++ Interface for finding the rank of a matrix
+       C++ Interface to pseudo-invert (Moore-Penrose) a matrix.
+
+       Currently uses the SVD-based approach.
+
+       Parameter `tol` is not the actual lower threshold, but it is passed in
+       as a parameter to the calculation of the actual threshold relative to
+       the shape and contents of `in`.
+       
+       This function is not supported in GFOR.
+
+       \param[in] in      input matrix
+       \param[in] tol     defines the lower threshold for singular values from
+                          SVD
+       \param[in] options must be AF_MAT_NONE (more options might be supported
+                          in the future)
+       \returns   pseudo-inverse matrix
 
-       \param[in] in is input matrix
-       \param[in] tol is the tolerance value
+       \ingroup lapack_ops_func_pinv
+    */
+    AFAPI array pinverse(const array &in, const double tol=1E-6,
+                         const matProp options = AF_MAT_NONE);
+#endif
 
-       \returns the rank of the matrix
+    /**
+       C++ Interface to find the rank of a matrix.
+
+       \param[in] in  input matrix
+       \param[in] tol tolerance value
+       \returns   rank
 
        \ingroup lapack_ops_func_rank
     */
     AFAPI unsigned rank(const array &in, const double tol=1E-5);
 
     /**
-       C++ Interface for finding the determinant of a matrix
-
-       \param[in] in is input matrix
+       C++ Interface to find the determinant of a matrix.
 
-       \returns the determinant of the matrix
+       \param[in] in input matrix
+       \returns   determinant
 
        \ingroup lapack_ops_func_det
     */
     template<typename T> T det(const array &in);
 
     /**
-       C++ Interface for norm of a matrix
+       C++ Interface to find the norm of a matrix.
 
-       \param[in] in is the input matrix
-       \param[in] type specifies the \ref af::normType. Default: \ref AF_NORM_VECTOR_1
-       \param[in] p specifies the value of P when \p type is one of \ref AF_NORM_VECTOR_P, AF_NORM_MATRIX_L_PQ is used. It is ignored for other values of \p type
-       \param[in] q specifies the value of Q when \p type is AF_NORM_MATRIX_L_PQ. This parameter is ignored if \p type is anything else
-
-       \returns the norm of \p inbased on \p type
+       \param[in] in   input matrix
+       \param[in] type \ref af::normType. Default: \ref AF_NORM_VECTOR_1
+       \param[in] p    value of P when `type` is \ref AF_NORM_VECTOR_P or
+                       \ref AF_NORM_MATRIX_L_PQ, else ignored
+       \param[in] q    value of Q when `type` is \ref AF_NORM_MATRIX_L_PQ, else
+                       ignored
+       \returns   norm
 
        \ingroup lapack_ops_func_norm
     */
     AFAPI double norm(const array &in, const normType type=AF_NORM_EUCLID,
                       const double p=1, const double q=1);
+
+#if AF_API_VERSION >= 33
+    /**
+       Returns true if ArrayFire is compiled with LAPACK support.
+
+       \returns true if LAPACK support is available; false otherwise
+
+       \ingroup lapack_helper_func_available
+    */
+    AFAPI bool isLAPACKAvailable();
+#endif
+
 }
 #endif
 
@@ -217,159 +303,284 @@ namespace af
 extern "C" {
 #endif
 
+#if AF_API_VERSION >= 31
     /**
-       C Interface for LU decomposition
+       C Interface to perform singular value decomposition.
 
-       \param[out] lower will contain the lower triangular matrix of the LU decomposition
-       \param[out] upper will contain the upper triangular matrix of the LU decomposition
-       \param[out] pivot will contain the permutation indices to map the input to the decomposition
-       \param[in] in is the input matrix
+       \param[out] u  U
+       \param[out] s  diagonal values of sigma (singular values of the input
+                      matrix)
+       \param[out] vt V^H
+       \param[in]  in input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup lapack_factor_func_svd
+    */
+    AFAPI af_err af_svd(af_array *u, af_array *s, af_array *vt, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface to perform in-place singular value decomposition.
+
+       This function minimizes memory usage if `in` is dispensable. Input array
+       `in` is limited to arrays where `dim0` \f$\geq\f$ `dim1`.
+
+       \param[out]   u  U
+       \param[out]   s  diagonal values of sigma (singular values of the input
+                        matrix)
+       \param[out]   vt V^H
+       \param[inout] in input array; contains random data after the operation                       this operation
+       \return       \ref AF_SUCCESS, if function returns successfully, else
+                     an \ref af_err code is given
+
+       \ingroup lapack_factor_func_svd
+    */
+    AFAPI af_err af_svd_inplace(af_array *u, af_array *s, af_array *vt, af_array in);
+#endif
+
+    /**
+       C Interface to perform LU decomposition.
+
+       \param[out] lower lower triangular matrix of the LU decomposition
+       \param[out] upper upper triangular matrix of the LU decomposition
+       \param[out] pivot permutation indices mapping the input to the
+                         decomposition
+       \param[in]  in    input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup lapack_factor_func_lu
     */
     AFAPI af_err af_lu(af_array *lower, af_array *upper, af_array *pivot, const af_array in);
 
     /**
-       C Interface for in place LU decomposition
+       C Interface to perform in-place LU decomposition.
+
+       This function is not supported in GFOR.
 
-       \param[out] pivot will contain the permutation indices to map the input to the decomposition
-       \param[inout] in contains the input on entry, the packed LU decomposition on exit
-       \param[in] is_lapack_piv specifies if the pivot is returned in original LAPACK compliant format
+       \param[out]   pivot         permutation indices mapping the input to the
+                                   decomposition
+       \param[inout] in            input array on entry; packed LU
+                                   decomposition on exit
+       \param[in]    is_lapack_piv specifies if the pivot is returned in
+                                   original LAPACK-compliant format
+       \return       \ref AF_SUCCESS, if function returns successfully, else
+                     an \ref af_err code is given
 
        \ingroup lapack_factor_func_lu
     */
     AFAPI af_err af_lu_inplace(af_array *pivot, af_array in, const bool is_lapack_piv);
 
     /**
-       C Interface for QR decomposition
+       C Interface to perform QR decomposition.
+
+       This function is not supported in GFOR.
 
-       \param[out] q is the orthogonal matrix from QR decomposition
-       \param[out] r is the upper triangular matrix from QR decomposition
-       \param[out] tau will contain additional information needed for solving a least squares problem using \p q and \p r
-       \param[in] in is the input matrix
+       \param[out] q   orthogonal matrix from QR decomposition
+       \param[out] r   upper triangular matrix from QR decomposition
+       \param[out] tau additional information needed for solving a
+                       least-squares problem using `q` and `r`
+       \param[in]  in  input array
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup lapack_factor_func_qr
     */
     AFAPI af_err af_qr(af_array *q, af_array *r, af_array *tau, const af_array in);
 
     /**
-       C Interface for QR decomposition
+       C Interface to perform QR decomposition.
 
-       \param[out] tau will contain additional information needed for unpacking the data
-       \param[inout] in is the input matrix on entry. It contains packed QR decomposition on exit
+       This function is not supported in GFOR.
+
+       \param[out]   tau additional information needed for unpacking the data
+       \param[inout] in  input array on entry; packed QR decomposition on exit
+       \return       \ref AF_SUCCESS, if function returns successfully, else
+                     an \ref af_err code is given
 
        \ingroup lapack_factor_func_qr
     */
     AFAPI af_err af_qr_inplace(af_array *tau, af_array in);
 
     /**
-       C++ Interface for cholesky decomposition
+       C Interface to perform Cholesky decomposition.
+
+       Multiplying `out` with its conjugate transpose reproduces the input
+       `in`.
 
-       \param[out] out contains the triangular matrix. Multiply \p out with it conjugate transpose reproduces the input \p in.
-       \param[out] info is \p 0 if cholesky decomposition passes. If not returns the rank at which the decomposition failed.
-       \param[in] in is the input matrix
-       \param[in] is_upper a boolean determining if \p out is upper or lower triangular
+       The input must be positive definite.
 
-       \note The input matrix **has** to be a positive definite matrix. If it is not zero, the cholesky decomposition functions return a non zero output.
+       \param[out] out      triangular matrix;
+       \param[out] info     `0` if cholesky decomposition passes; if not, it
+                            returns the rank at which the decomposition fails
+       \param[in]  in       input matrix
+       \param[in]  is_upper boolean determining if `out` is upper or lower
+                            triangular
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup lapack_factor_func_cholesky
     */
     AFAPI af_err af_cholesky(af_array *out, int *info, const af_array in, const bool is_upper);
 
     /**
-       C Interface for in place cholesky decomposition
+       C Interface to perform in-place Cholesky decomposition.
 
-       \param[out] info is \p 0 if cholesky decomposition passes. If not returns the rank at which the decomposition failed.
-       \param[inout] in is the input matrix on entry. It contains the triangular matrix on exit.
-       \param[in] is_upper a boolean determining if \p in is upper or lower triangular
+       The input must be positive definite.
 
-       \note The input matrix **has** to be a positive definite matrix. If it is not zero, the cholesky decomposition functions return a non zero output.
+       \param[out]   info     `0` if cholesky decomposition passes; if not, it
+                              returns the rank at which the decomposition fails
+       \param[inout] in       input matrix on entry; triangular matrix on exit
+       \param[in]    is_upper boolean determining if `in` is upper or lower
+                              triangular
+       \return       \ref AF_SUCCESS, if function returns successfully, else
+                     an \ref af_err code is given
 
        \ingroup lapack_factor_func_cholesky
     */
     AFAPI af_err af_cholesky_inplace(int *info, af_array in, const bool is_upper);
 
     /**
-       C Interface for solving a system of equations
+       C Interface to solve a system of equations.
 
-       \param[out] x is the matrix of unknown variables
-       \param[in] a is the coefficient matrix
-       \param[in] b is the measured values
-       \param[in] options determining various properties of matrix \p a
+       The `options` parameter must be one of \ref AF_MAT_NONE,
+       \ref AF_MAT_LOWER or \ref AF_MAT_UPPER.
 
-       \ingroup lapack_solve_func_gen
+       This function is not supported in GFOR.
+
+       \param[out] x       matrix of unknown variables
+       \param[in]  a       coefficient matrix
+       \param[in]  b       measured values
+       \param[in]  options determines various properties of matrix `a`
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \note \p options needs to be one of \ref AF_MAT_NONE, \ref AF_MAT_LOWER or \ref AF_MAT_UPPER
+       \ingroup lapack_solve_func_gen
     */
     AFAPI af_err af_solve(af_array *x, const af_array a, const af_array b,
                           const af_mat_prop options);
 
     /**
-       C Interface for solving a system of equations
+       C Interface to solve a system of equations.
 
-       \param[out] x will contain the matrix of unknown variables
-       \param[in] a is the output matrix from packed LU decomposition of the coefficient matrix
-       \param[in] piv is the pivot array from packed LU decomposition of the coefficient matrix
-       \param[in] b is the matrix of measured values
-       \param[in] options determining various properties of matrix \p a
+       The `options` parameter currently must be \ref AF_MAT_NONE.
 
-       \ingroup lapack_solve_lu_func_gen
+       \param[out] x       matrix of unknown variables
+       \param[in]  a       packed LU decomposition of the coefficient matrix
+       \param[in]  piv     pivot array from the packed LU decomposition of the
+                           coefficient matrix
+       \param[in]  b       measured values
+       \param[in]  options determines various properties of matrix `a`
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \note \p options currently needs to be \ref AF_MAT_NONE
-       \note This function is not supported in GFOR
+       \ingroup lapack_solve_lu_func_gen
     */
     AFAPI af_err af_solve_lu(af_array *x, const af_array a, const af_array piv,
                              const af_array b, const af_mat_prop options);
 
     /**
-       C Interface for inverting a matrix
+       C Interface to invert a matrix.
 
-       \param[out] out will contain the inverse of matrix \p in
-       \param[in] in is input matrix
-       \param[in] options determining various properties of matrix \p in
+       The `options` parameter currently must be \ref AF_MAT_NONE.
 
-       \ingroup lapack_ops_func_inv
+       \param[out] out     inverse matrix
+       \param[in]  in      input matrix
+       \param[in]  options determines various properties of matrix `in`
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
-       \note currently options needs to be \ref AF_MAT_NONE
+       \ingroup lapack_ops_func_inv
     */
     AFAPI af_err af_inverse(af_array *out, const af_array in, const af_mat_prop options);
 
+#if AF_API_VERSION >= 37
     /**
-       C Interface for finding the rank of a matrix
+       C Interface to pseudo-invert (Moore-Penrose) a matrix.
+
+       Currently uses the SVD-based approach.
+
+       Parameter `tol` is not the actual lower threshold, but it is passed in
+       as a parameter to the calculation of the actual threshold relative to
+       the shape and contents of `in`.
+
+       Suggested parameters for `tol`:  1e-6 for single precision and 1e-12 for
+       double precision.
 
-       \param[out] rank will contain the rank of \p in
-       \param[in] in is input matrix
-       \param[in] tol is the tolerance value
+       \param[out] out     pseudo-inverse matrix
+       \param[in]  in      input matrix
+       \param[in]  tol     defines the lower threshold for singular values from
+                           SVD
+       \param[in]  options must be AF_MAT_NONE (more options might be supported
+                           in the future)
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup lapack_ops_func_pinv
+    */
+    AFAPI af_err af_pinverse(af_array *out, const af_array in, const double tol,
+                             const af_mat_prop options);
+#endif
+
+    /**
+       C Interface to find the rank of a matrix.
+
+       \param[out] rank rank
+       \param[in]  in   input matrix
+       \param[in]  tol  tolerance value
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup lapack_ops_func_rank
     */
     AFAPI af_err af_rank(unsigned *rank, const af_array in, const double tol);
 
     /**
-       C Interface for finding the determinant of a matrix
+       C Interface to find the determinant of a matrix.
 
-       \param[out] det_real will contain the real part of the determinant of \p in
-       \param[out] det_imag will contain the imaginary part of the determinant of \p in
-       \param[in] in is input matrix
+       \param[out] det_real real part of the determinant
+       \param[out] det_imag imaginary part of the determinant
+       \param[in]  in       input matrix
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup lapack_ops_func_det
     */
     AFAPI af_err af_det(double *det_real, double *det_imag, const af_array in);
 
     /**
-       C Interface for norm of a matrix
-
-       \param[out] out will contain the norm of \p in
-       \param[in] in is the input matrix
-       \param[in] type specifies the \ref af::normType. Default: \ref AF_NORM_VECTOR_1
-       \param[in] p specifies the value of P when \p type is one of \ref AF_NORM_VECTOR_P,  AF_NORM_MATRIX_L_PQ is used. It is ignored for other values of \p type
-       \param[in] q specifies the value of Q when \p type is AF_NORM_MATRIX_L_PQ. This parameter is ignored if \p type is anything else
-
+       C Interface to find the norm of a matrix.
+
+       \param[out] out  norm
+       \param[in]  in   input matrix
+       \param[in]  type \ref af::normType. Default: \ref AF_NORM_VECTOR_1
+       \param[in]  p    value of P when `type` is \ref AF_NORM_VECTOR_P or
+                        \ref AF_NORM_MATRIX_L_PQ, else ignored
+       \param[in]  q    value of Q when `type` is \ref AF_NORM_MATRIX_L_PQ, else
+                        ignored
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
 
        \ingroup lapack_ops_func_norm
     */
     AFAPI af_err af_norm(double *out, const af_array in, const af_norm_type type, const double p, const double q);
 
+#if AF_API_VERSION >= 33
+    /**
+       Returns true if ArrayFire is compiled with LAPACK support.
+
+       \param[out] out true if LAPACK support is available; false otherwise
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given; does not depend on the value
+                   of `out`
+
+       \ingroup lapack_helper_func_available
+    */
+    AFAPI af_err af_is_lapack_available(bool *out);
+#endif
+
 
 #ifdef __cplusplus
 }
diff --git a/include/af/macros.h b/include/af/macros.h
new file mode 100644
index 0000000000..3c52321c5c
--- /dev/null
+++ b/include/af/macros.h
@@ -0,0 +1,84 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <stdio.h>
+
+///
+/// Print a line on screen using printf syntax.
+/// Usage: Uses same syntax and semantics as printf.
+/// Output: \<filename\>:\<line number\>: \<message\>
+///
+#ifndef AF_MSG
+#define AF_MSG(fmt,...) do {            \
+        printf("%s:%d: " fmt "\n",      \
+                 __FILE__, __LINE__, ##__VA_ARGS__);      \
+        } while (0);
+#endif
+
+/**
+ * AF_MEM_INFO macro can be used to print the current stats of ArrayFire's memory
+ * manager.
+ *
+ * AF_MEM_INFO print 4 values:
+ *
+ * ---------------------------------------------------
+ *  Name                    | Description
+ * -------------------------|-------------------------
+ *  Allocated Bytes         | Total number of bytes allocated by the memory manager
+ *  Allocated Buffers       | Total number of buffers allocated
+ *  Locked (In Use) Bytes   | Number of bytes that are in use by active arrays
+ *  Locked (In Use) Buffers | Number of buffers that are in use by active arrays
+ * ---------------------------------------------------
+ *
+ *  The `Allocated Bytes` is always a multiple of the memory step size. The
+ *  default step size is 1024 bytes. This means when a buffer is to be
+ *  allocated, the size is always rounded up to a multiple of the step size.
+ *  You can use af::getMemStepSize() to check the current step size and
+ *  af::setMemStepSize() to set a custom resolution size.
+ *
+ *  The `Allocated Buffers` is the number of buffers that use up the allocated
+ *  bytes. This includes buffers currently in scope, as well as buffers marked
+ *  as free, ie, from arrays gone out of scope. The free buffers are available
+ *  for use by new arrays that might be created.
+ *
+ *  The `Locked Bytes` is the number of bytes in use that cannot be
+ *  reallocated at the moment. The difference of Allocated Bytes and Locked
+ *  Bytes is the total bytes available for reallocation.
+ *
+ *  The `Locked Buffers` is the number of buffer in use that cannot be
+ *  reallocated at the moment. The difference of Allocated Buffers and Locked
+ *  Buffers is the number of buffers available for reallocation.
+ *
+ * The AF_MEM_INFO macro can accept a string an argument that is printed to screen
+ *
+ * \param[in] msg (Optional) A message that is printed to screen
+ *
+ * \code
+ *     AF_MEM_INFO("At start");
+ * \endcode
+ *
+ * Output:
+ *
+ *     AF Memory at /workspace/myfile.cpp:41: At Start
+ *     Allocated [ Bytes | Buffers ] = [ 4096 | 4 ]
+ *     In Use    [ Bytes | Buffers ] = [ 2048 | 2 ]
+ */
+#define AF_MEM_INFO(msg) do {                                                           \
+    size_t abytes = 0, abuffs = 0, lbytes = 0, lbuffs = 0;                              \
+    af_err err = af_device_mem_info(&abytes, &abuffs, &lbytes, &lbuffs);                \
+    if(err == AF_SUCCESS) {                                                             \
+        printf("AF Memory at %s:%d: " msg "\n", __FILE__, __LINE__);                    \
+        printf("Allocated [ Bytes | Buffers ] = [ %ld | %ld ]\n", abytes, abuffs);      \
+        printf("In Use    [ Bytes | Buffers ] = [ %ld | %ld ]\n", lbytes, lbuffs);      \
+    } else {                                                                            \
+        fprintf(stderr, "AF Memory at %s:%d: " msg "\nAF Error %d\n",                   \
+                __FILE__, __LINE__, err);                                               \
+    }                                                                                   \
+} while(0)
diff --git a/include/af/memory.h b/include/af/memory.h
new file mode 100644
index 0000000000..6c53837a6c
--- /dev/null
+++ b/include/af/memory.h
@@ -0,0 +1,599 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/defines.h>
+#include <af/event.h>
+
+#include <stddef.h>
+
+#if AF_API_VERSION >= 37
+
+/**
+   \ingroup memory_manager_api
+*/
+typedef void* af_memory_manager;
+
+#ifdef __cplusplus
+extern "C" {
+#endif  // __cplusplus
+
+/**
+   \brief Called after a memory manager is set and becomes active.
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_initialize_fn)(af_memory_manager handle);
+
+/**
+   \brief Called after a memory manager is unset and becomes unused
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_shutdown_fn)(af_memory_manager handle);
+
+/**
+   \brief Function pointer that will be called by ArrayFire to allocate memory.
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[out] ptr pointer to the allocated buffer
+   \param[in] user_lock a truthy value corresponding to whether or not the
+   memory should have a user lock associated with it
+   \param[in] ndims the number of dimensions associated with the allocated
+   memory. This value is currently always 1
+   \param[in,out] dims a \ref dim_t containing the dimensions of the allocation
+   by number of elements. After the function returns, the pointer contains the
+   shape of the allocated tensor
+   \param[in] element_size the number of bytes per element of allocated memory
+
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_alloc_fn)(af_memory_manager handle,
+                                             void** ptr,
+                                             /* bool */ int user_lock,
+                                             const unsigned ndims, dim_t* dims,
+                                             const unsigned element_size);
+
+/**
+   \brief Checks the amount of allocated memory for a pointer
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[out] size the size of the allocated memory for the pointer
+   \param[in] ptr the pointer to query
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_allocated_fn)(af_memory_manager handle,
+                                                 size_t* size, void* ptr);
+
+/**
+   \brief Unlocks memory from use
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[out] ptr the pointer to query
+   \param[in] user_unlock frees the memory from user lock
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_unlock_fn)(af_memory_manager handle,
+                                              void* ptr, /* bool */ int user_unlock);
+
+/**
+   \brief Called to signal the memory manager should free memory if possible
+
+   Called by some external functions that allocate their own memory if they
+   receive an out of memory in order to free up other memory on a device
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_signal_memory_cleanup_fn)(
+    af_memory_manager handle);
+
+/**
+   \brief Populates a character array with human readable information about the
+   current state of the memory manager.
+
+   Prints useful information about the memory manger and its state. No format is
+   enforced and can include any information that could be useful to the user.
+   This function is only called by \ref af_print_mem_info.
+
+   \param[in]  handle a pointer to the active \ref af_memory_manager handle
+   \param[out] buffer a buffer to which a message will be populated
+   \param[in]  id     the device id for which to print memory
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_print_info_fn)(af_memory_manager handle,
+                                                  char* buffer, int id);
+
+/**
+   \brief Called to lock a buffer as user-owned memory
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[in] ptr pointer to the buffer to lock
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_user_lock_fn)(af_memory_manager handle,
+                                                 void* ptr);
+
+/**
+   \brief Called to unlock a buffer from user-owned memory
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[in] ptr pointer to the buffer to unlock
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_user_unlock_fn)(af_memory_manager handle,
+                                                   void* ptr);
+
+/**
+   \brief Queries if a buffer is user locked
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[out] out a truthy value corresponding to if the buffer is user locked
+   \param[in] ptr pointer to the buffer to query
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_is_user_locked_fn)(af_memory_manager handle,
+                                                      int* out, void* ptr);
+
+/**
+   \brief Gets memory pressure for a memory manager
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[out] pressure the memory pressure value
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_get_memory_pressure_fn)(
+    af_memory_manager handle, float* pressure);
+
+/**
+   \brief Called to query if additions to the JIT tree would exert too much
+   memory pressure
+
+   The ArrayFire JIT compiler will call this function to determine if the number
+   of bytes referenced by the buffers in the JIT tree are causing too much
+   memory pressure on the system.
+
+   If the memory manager decides that the pressure is too great, the JIT tree
+   will be evaluated and this COULD result in some buffers being freed if they
+   are not referenced by other af_arrays. If the memory pressure is not too
+   great the JIT tree may not be evaluated and could continue to get bigger.
+
+   The default memory manager will trigger an evaluation if the buffers in the
+   JIT tree account for half of all buffers allocated.
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[out] out a truthy value if too much memory pressure is exerted
+   \param[in] size the total number of bytes allocated by all the buffer nodes
+   in the current JIT tree
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef af_err (*af_memory_manager_jit_tree_exceeds_memory_pressure_fn)(
+    af_memory_manager handle, int* out, size_t size);
+
+/**
+   \brief Adds a new device to the memory manager (OpenCL only)
+
+   \param[in] handle a pointer to the active \ref af_memory_manager handle
+   \param[in] id the id of the device to add
+   \returns AF_SUCCESS
+
+   \ingroup memory_manager_api
+*/
+typedef void (*af_memory_manager_add_memory_management_fn)(
+    af_memory_manager handle, int id);
+
+/**
+    \brief Removes a device from the memory manager (OpenCL only)
+
+    \param[in] handle a pointer to the active \ref af_memory_manager handle
+    \param[in] id the id of the device to remove
+    \returns AF_SUCCESS
+
+    \ingroup memory_manager_api
+*/
+typedef void (*af_memory_manager_remove_memory_management_fn)(
+    af_memory_manager handle, int id);
+
+/**
+   \brief Creates an \ref af_memory_manager handle
+
+   Creates a blank af_memory_manager with no attached function pointers.
+
+   \param[out] out \ref af_memory_manager
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_create_memory_manager(af_memory_manager* out);
+
+/**
+   \brief Destroys an \ref af_memory_manager handle.
+
+   Destroys a memory manager handle, does NOT call the
+   \ref af_memory_manager_shutdown_fn associated with the af_memory_manager.
+
+   \param[in] handle the \ref af_memory_manager handle to be destroyed
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_release_memory_manager(af_memory_manager handle);
+
+/**
+   \brief Sets an af_memory_manager to be the default memory manager for
+   non-pinned memory allocations in ArrayFire.
+
+   Registers the given memory manager as the AF memory manager non-pinned
+   memory allocations - does NOT shut down or release the existing memory
+   manager or free any associated memory.
+
+   \param[in] handle the \ref af_memory_manager handle to use
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_set_memory_manager(af_memory_manager handle);
+
+/**
+   \brief Sets an af_memory_manager to be the default memory manager for
+   pinned memory allocations in ArrayFire.
+
+   Registers the given memory manager as the AF memory manager for pinned
+   memory allocations - does NOT shut down or release the existing memory
+   manager or free any associated memory.
+
+   \param[in] handle the \ref af_memory_manager handle to use
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_set_memory_manager_pinned(af_memory_manager handle);
+
+/**
+   \brief Reset the memory manager being used in ArrayFire to the default
+   memory manager, shutting down the existing memory manager.
+
+   Calls the associated af_memory_manager_shutdown_fn on
+   the existing memory manager. If the default memory manager is set,
+   ALL associated memory will be freed on shutdown. Custom behavior that
+   does not free all memory can be defined for a custom memory manager
+   as per the specific implementation of its associated
+   af_memory_manager_shutdown_fn.
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_unset_memory_manager();
+
+/**
+   \brief Reset the pinned memory manager being used in ArrayFire to the
+   default memory manager, shutting down the existing pinned memory manager.
+
+   Calls the associated af_memory_manager_shutdown_fn on
+   the existing pinned memory manager. If the default memory manager is set,
+   ALL associated pinned memory will be freed on shutdown. Custom behavior that
+   does not free all pinned memory can be defined for a custom memory manager
+   as per the specific implementation of its associated
+   af_memory_manager_shutdown_fn.
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_unset_memory_manager_pinned();
+
+/**
+   \brief Gets the payload ptr from an \ref af_memory_manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[out] payload pointer to the payload pointer
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_get_payload(af_memory_manager handle,
+                                           void** payload);
+
+/**
+   \brief Sets the payload ptr from an \ref af_memory_manager
+
+   A payload can be any user defined memory associated with the memory manager
+   and can be used to track state of the memory manager. It is not used directly
+   by ArrayFire.
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[out] payload pointer to the payload pointer
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_payload(af_memory_manager handle,
+                                           void* payload);
+
+/**
+   \brief Sets an \ref af_memory_manager_initialize_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_initialize_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_initialize_fn(
+    af_memory_manager handle, af_memory_manager_initialize_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_shutdown_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_shutdown_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_shutdown_fn(
+    af_memory_manager handle, af_memory_manager_shutdown_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_alloc_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_alloc_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_alloc_fn(af_memory_manager handle,
+                                            af_memory_manager_alloc_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_allocated_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_allocated_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_allocated_fn(
+    af_memory_manager handle, af_memory_manager_allocated_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_unlock_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_unlock_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_unlock_fn(af_memory_manager handle,
+                                             af_memory_manager_unlock_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_signal_memory_cleanup_fn for a memory
+   manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_signal_memory_cleanup_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_signal_memory_cleanup_fn(
+    af_memory_manager handle, af_memory_manager_signal_memory_cleanup_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_print_info_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_print_info_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_print_info_fn(
+    af_memory_manager handle, af_memory_manager_print_info_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_user_lock_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_user_lock_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_user_lock_fn(
+    af_memory_manager handle, af_memory_manager_user_lock_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_user_unlock_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_user_unlock_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_user_unlock_fn(
+    af_memory_manager handle, af_memory_manager_user_unlock_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_is_user_locked_fn for a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_is_user_locked_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_is_user_locked_fn(
+    af_memory_manager handle, af_memory_manager_is_user_locked_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_get_memory_pressure_fn for a memory
+   manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_get_memory_pressure_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_get_memory_pressure_fn(
+    af_memory_manager handle, af_memory_manager_get_memory_pressure_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_jit_tree_exceeds_memory_pressure_fn for
+   a memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_jit_tree_exceeds_memory_pressure_fn
+   to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_jit_tree_exceeds_memory_pressure_fn(
+    af_memory_manager handle,
+    af_memory_manager_jit_tree_exceeds_memory_pressure_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_add_memory_management_fn for a memory
+   manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_add_memory_management_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_add_memory_management_fn(
+    af_memory_manager handle, af_memory_manager_add_memory_management_fn fn);
+
+/**
+   \brief Sets an \ref af_memory_manager_remove_memory_management_fn for a
+   memory manager
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[in] fn the \ref af_memory_manager_remove_memory_management_fn to set
+
+   \returns AF_SUCCESS
+   \ingroup memory_manager_utils
+*/
+AFAPI af_err af_memory_manager_set_remove_memory_management_fn(
+    af_memory_manager handle, af_memory_manager_remove_memory_management_fn fn);
+
+////////////////// Native memory interface functions
+
+/**
+   \brief Gets the id of the currently-active device
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[out] id the id of the active device
+
+   \returns AF_SUCCESS
+   \ingroup native_memory_interface
+*/
+AFAPI af_err af_memory_manager_get_active_device_id(af_memory_manager handle,
+                                                    int* id);
+
+/**
+   \brief Allocates memory with a native memory function for the active backend
+
+   \param[in] handle the \ref af_memory_manager handle
+   \param[out] ptr the pointer to the allocated buffer (for the CUDA and CPU
+   backends). For the OpenCL backend, this is a pointer to a cl_mem, which
+   can be cast accordingly
+   \param[in] size the size of the pointer allocation
+
+   \returns AF_SUCCESS
+   \ingroup native_memory_interface
+*/
+AFAPI af_err af_memory_manager_native_alloc(af_memory_manager handle,
+                                            void** ptr, size_t size);
+
+/**
+    \brief Frees a pointer with a native memory function for the active backend
+
+    \param[in] handle the \ref af_memory_manager handle
+    \param[in] ptr the pointer to free
+
+    \returns AF_SUCCESS
+    \ingroup native_memory_interface
+*/
+AFAPI af_err af_memory_manager_native_free(af_memory_manager handle, void* ptr);
+
+/** \brief Gets the maximum memory size for a managed device.
+
+  \param[in] handle the \ref af_memory_manager handle
+  \param[out] size the max memory size for the device
+  \param[in] id the device id
+
+  \returns AF_SUCCESS
+  \ingroup native_memory_interface */
+AFAPI af_err af_memory_manager_get_max_memory_size(af_memory_manager handle,
+                                                   size_t* size, int id);
+
+/**
+\brief Gets the memory pressure threshold for a memory manager.
+
+  \param[in] handle the \ref af_memory_manager handle
+  \param[out] value the memory pressure threshold
+
+  \returns AF_SUCCESS
+  \ingroup native_memory_interface
+*/
+AFAPI af_err af_memory_manager_get_memory_pressure_threshold(
+    af_memory_manager handle, float* value);
+
+/**
+    \brief Sets the memory pressure threshold for a memory manager.
+
+    The memory pressure threshold determines when the JIT tree evaluates based
+    on how much memory usage there is. If the value returned by \ref
+    af_memory_manager_get_memory_pressure_fn exceeds the memory pressure
+    threshold, the JIT will evaluate a subtree if generated kernels are valid.
+
+    \param[in] handle the \ref af_memory_manager handle
+    \param[in] value the new threshold value
+
+    \returns AF_SUCCESS
+    \ingroup native_memory_interface
+*/
+AFAPI af_err af_memory_manager_set_memory_pressure_threshold(
+    af_memory_manager handle, float value);
+
+#ifdef __cplusplus
+}
+#endif  // __cplusplus
+#endif  // AF_API_VERSION >= 37
diff --git a/include/af/ml.h b/include/af/ml.h
new file mode 100644
index 0000000000..33feff9112
--- /dev/null
+++ b/include/af/ml.h
@@ -0,0 +1,98 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/defines.h>
+
+#ifdef __cplusplus
+namespace af
+{
+class array;
+class dim4;
+
+#if AF_API_VERSION >= 37
+    /**
+        C++ interface for calculating backward pass gradient of 2D convolution
+        This function calculates the gradient with respect to the output
+        of the \ref convolve2NN function that uses the machine learning
+        formulation for the dimensions of the signals and filters
+
+        \param[in]  incoming_gradient gradients to be distributed in backwards pass
+        \param[in]  original_signal input signal to forward pass of convolution
+                    assumed structure of input is ( d0 x d1 x d2 x N )
+        \param[in]  original_filter input filter to forward pass of convolution
+                    assumed structure of input is ( d0 x d1 x d2 x N )
+        \param[in]  convolved_output output from forward pass of convolution
+        \param[in]  stride specifies strides along each dimension for original convolution
+        \param[in]  padding specifies padding width along each dimension for original convolution
+        \param[in]  dilation specifies filter dilation along each dimension for original convolution
+        \param[in]  grad_type specifies which gradient to return
+        \return     gradient wrt/grad_type
+
+        \note Make sure you pass in both dim0, and dim1 in your dim4 arguments. The third
+        and fourth dimensions are currently ignored.
+
+        \ingroup ml_convolution
+    */
+    AFAPI array convolve2GradientNN(const array& incoming_gradient,
+                                    const array& original_signal,
+                                    const array& original_filter,
+                                    const array& convolved_output,
+                                    const dim4 stride, const dim4 padding, const dim4 dilation,
+                                    convGradientType grad_type);
+
+#endif
+
+}
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#if AF_API_VERSION >= 37
+    /**
+        C interface for calculating backward pass gradient of 2D convolution
+        This function calculates the gradient with respect to the output
+        of the \ref af::convolve2NN() function that uses the machine learning
+        formulation for the dimensions of the signals and filters
+
+        \param[out] out gradient wrt/gradType
+        \param[in]  incoming_gradient gradients to be distributed in backwards pass
+        \param[in]  original_signal input signal to forward pass of convolution
+                    assumed structure of input is ( d0 x d1 x d2 x N )
+        \param[in]  original_filter input filter to forward pass of convolution
+                    assumed structure of input is ( d0 x d1 x d2 x N )
+        \param[in]  convolved_output output from forward pass of convolution
+        \param[in]  stride_dims specifies number of stride dimensions
+        \param[in]  strides array of stride values
+        \param[in]  padding_dims number of padding dimensions
+        \param[in]  paddings array of padding values
+        \param[in]  dilation_dims number of dilation dimensions
+        \param[in]  dilations array of dilation values
+        \param[in]  grad_type specifies which gradient to return
+        \return     \ref AF_SUCCESS if the execution completes properly
+
+        \ingroup ml_convolution
+    */
+    AFAPI af_err af_convolve2_gradient_nn(af_array *out,
+                                          const af_array incoming_gradient,
+                                          const af_array original_signal,
+                                          const af_array original_filter,
+                                          const af_array convolved_output,
+                                          const unsigned stride_dims,   const dim_t *strides,
+                                          const unsigned padding_dims,  const dim_t *paddings,
+                                          const unsigned dilation_dims, const dim_t *dilations,
+                                          af_conv_gradient_type grad_type);
+#endif
+
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/af/oneapi.h b/include/af/oneapi.h
new file mode 100644
index 0000000000..b6a3da15fa
--- /dev/null
+++ b/include/af/oneapi.h
@@ -0,0 +1,429 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/defines.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#if AF_API_VERSION >= 39
+typedef enum
+{
+    //TODO: update? are these relevant in sycl
+    AF_ONEAPI_PLATFORM_AMD     = 0,
+    AF_ONEAPI_PLATFORM_APPLE   = 1,
+    AF_ONEAPI_PLATFORM_INTEL   = 2,
+    AF_ONEAPI_PLATFORM_NVIDIA  = 3,
+    AF_ONEAPI_PLATFORM_BEIGNET = 4,
+    AF_ONEAPI_PLATFORM_POCL    = 5,
+    AF_ONEAPI_PLATFORM_UNKNOWN = -1
+} af_oneapi_platform;
+#endif
+
+#if 0
+/**
+    \ingroup opencl_mat
+    @{
+*/
+/**
+  Get a handle to ArrayFire's OpenCL context
+
+  \param[out] ctx the current context being used by ArrayFire
+  \param[in] retain if true calls clRetainContext prior to returning the context
+  \returns \ref af_err error code
+
+  \note Set \p retain to true if this value will be passed to a cl::Context constructor
+*/
+AFAPI af_err afcl_get_context(cl_context *ctx, const bool retain);
+
+/**
+  Get a handle to ArrayFire's OpenCL command queue
+
+  \param[out] queue the current command queue being used by ArrayFire
+  \param[in] retain if true calls clRetainCommandQueue prior to returning the context
+  \returns \ref af_err error code
+
+  \note Set \p retain to true if this value will be passed to a cl::CommandQueue constructor
+*/
+AFAPI af_err afcl_get_queue(cl_command_queue *queue, const bool retain);
+
+/**
+   Get the device ID for ArrayFire's current active device
+
+   \param[out] id the cl_device_id of the current device
+   \returns \ref af_err error code
+*/
+AFAPI af_err afcl_get_device_id(cl_device_id *id);
+
+#if AF_API_VERSION >= 39
+/**
+   Set ArrayFire's active device based on \p id of type cl_device_id
+
+   \param[in] id the cl_device_id of the device to be set as active device
+   \returns \ref af_err error code
+*/
+AFAPI af_err afcl_set_device_id(cl_device_id id);
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Push user provided device control constructs into the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to use an
+   user generated OpenCL context and related objects for ArrayFire operations.
+
+   \param[in] dev is the OpenCL device for which user provided context will be used by ArrayFire
+   \param[in] ctx is the user provided OpenCL cl_context to be used by ArrayFire
+   \param[in] que is the user provided OpenCL cl_command_queue to be used by ArrayFire. If this
+                  parameter is NULL, then we create a command queue for the user using the OpenCL
+                  context they provided us.
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+AFAPI af_err afcl_add_device_context(cl_device_id dev, cl_context ctx, cl_command_queue que);
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Set active device using cl_context and cl_device_id
+
+   \param[in] dev is the OpenCL device id that is to be set as Active device inside ArrayFire
+   \param[in] ctx is the OpenCL cl_context being used by ArrayFire
+*/
+AFAPI af_err afcl_set_device_context(cl_device_id dev, cl_context ctx);
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Remove the user provided device control constructs from the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to remove an already
+   pushed user generated OpenCL context and related objects.
+
+   \param[in] dev is the OpenCL device id that has to be popped
+   \param[in] ctx is the cl_context object to be removed from ArrayFire pool
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+AFAPI af_err afcl_delete_device_context(cl_device_id dev, cl_context ctx);
+#endif
+
+#if AF_API_VERSION >= 39
+ Ge
+  t the type of the current device
+*/
+AFAPI af_err afcl_get_device_type(afcl_device_type *res);
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Get the platform of the current device
+*/
+AFAPI af_err afcl_get_platform(afcl_platform *res);
+#endif
+
+/**
+  @}
+*/
+#endif //if 0 comment
+
+#ifdef __cplusplus
+}
+#endif
+
+#ifdef __cplusplus
+
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/exception.h>
+#include <af/device.h>
+#include <stdio.h>
+
+namespace afoneapi
+{
+
+#if 0
+ /**
+     \addtogroup opencl_mat
+     @{
+ */
+
+ /**
+ Get a handle to ArrayFire's OpenCL context
+
+ \param[in] retain if true calls clRetainContext prior to returning the context
+ \returns the current context being used by ArrayFire
+
+ \note Set \p retain to true if this value will be passed to a cl::Context constructor
+ */
+ static inline cl_context getContext(bool retain = false)
+ {
+     cl_context ctx;
+     af_err err = afcl_get_context(&ctx, retain);
+     if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL context from arrayfire");
+     return ctx;
+ }
+
+ /**
+ Get a handle to ArrayFire's OpenCL command queue
+
+ \param[in] retain if true calls clRetainCommandQueue prior to returning the context
+ \returns the current command queue being used by ArrayFire
+
+ \note Set \p retain to true if this value will be passed to a cl::CommandQueue constructor
+ */
+ static inline cl_command_queue getQueue(bool retain = false)
+ {
+     cl_command_queue queue;
+     af_err err = afcl_get_queue(&queue, retain);
+     if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL command queue from arrayfire");
+     return queue;
+ }
+
+ /**
+    Get the device ID for ArrayFire's current active device
+    \returns the cl_device_id of the current device
+ */
+ static inline cl_device_id getDeviceId()
+ {
+     cl_device_id id;
+     af_err err = afcl_get_device_id(&id);
+     if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL device ID");
+
+     return id;
+ }
+
+#if AF_API_VERSION >= 39
+ /**
+   Set ArrayFire's active device based on \p id of type cl_device_id
+
+   \param[in] id the cl_device_id of the device to be set as active device
+ */
+ static inline void setDeviceId(cl_device_id id)
+ {
+     af_err err = afcl_set_device_id(id);
+     if (err != AF_SUCCESS) throw af::exception("Failed to set OpenCL device as active device");
+ }
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Push user provided device control constructs into the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to use an
+   user generated OpenCL context and related objects for ArrayFire operations.
+
+   \param[in] dev is the OpenCL device for which user provided context will be used by ArrayFire
+   \param[in] ctx is the user provided OpenCL cl_context to be used by ArrayFire
+   \param[in] que is the user provided OpenCL cl_command_queue to be used by ArrayFire. If this
+                  parameter is NULL, then we create a command queue for the user using the OpenCL
+                  context they provided us.
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+static inline void addDevice(cl_device_id dev, cl_context ctx, cl_command_queue que)
+{
+    af_err err = afcl_add_device_context(dev, ctx, que);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to push user provided device/context to ArrayFire pool");
+}
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Set active device using cl_context and cl_device_id
+
+   \param[in] dev is the OpenCL device id that is to be set as Active device inside ArrayFire
+   \param[in] ctx is the OpenCL cl_context being used by ArrayFire
+*/
+static inline void setDevice(cl_device_id dev, cl_context ctx)
+{
+    af_err err = afcl_set_device_context(dev, ctx);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to set device based on cl_device_id & cl_context");
+}
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Remove the user provided device control constructs from the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to remove an already
+   pushed user generated OpenCL context and related objects.
+
+   \param[in] dev is the OpenCL device id that has to be popped
+   \param[in] ctx is the cl_context object to be removed from ArrayFire pool
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+static inline void deleteDevice(cl_device_id dev, cl_context ctx)
+{
+    af_err err = afcl_delete_device_context(dev, ctx);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to remove the requested device from ArrayFire device pool");
+}
+#endif
+
+
+#if AF_API_VERSION >= 39
+ typedef afcl_device_type deviceType;
+ typedef afcl_platform platform;
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Get the type of the current device
+*/
+static inline deviceType getDeviceType()
+{
+    afcl_device_type res = AFCL_DEVICE_TYPE_UNKNOWN;
+    af_err err = afcl_get_device_type(&res);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to get OpenCL device type");
+    return res;
+}
+#endif
+
+#if AF_API_VERSION >= 39
+/**
+   Get a vendor enumeration for the current platform
+*/
+static inline platform getPlatform()
+{
+    afcl_platform res = AFCL_PLATFORM_UNKNOWN;
+    af_err err = afcl_get_platform(&res);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to get OpenCL platform");
+    return res;
+}
+#endif
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] idims the dimensions of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(af::dim4 idims, cl_mem buf, af::dtype type, bool retain=false)
+ {
+     const unsigned ndims = (unsigned)idims.ndims();
+     const dim_t *dims = idims.get();
+
+     cl_context context;
+     cl_int clerr = clGetMemObjectInfo(buf, CL_MEM_CONTEXT, sizeof(cl_context), &context, NULL);
+     if (clerr != CL_SUCCESS) {
+         throw af::exception("Failed to get context from cl_mem object \"buf\" ");
+     }
+
+     if (context != getContext()) {
+         throw(af::exception("Context mismatch between input \"buf\" and arrayfire"));
+     }
+
+
+     if (retain) clerr = clRetainMemObject(buf);
+
+     af_array out;
+     af_err err = af_device_array(&out, buf, ndims, dims, type);
+
+     if (err != AF_SUCCESS || clerr != CL_SUCCESS) {
+         if (retain && clerr == CL_SUCCESS) clReleaseMemObject(buf);
+         throw af::exception("Failed to create device array");
+     }
+
+     return af::array(out);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0), buf, type, retain);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] dim1 the length of the second dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0, dim_t dim1,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0, dim1), buf, type, retain);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] dim1 the length of the second dimension of the buffer
+ \param[in] dim2 the length of the third dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0, dim_t dim1,
+                               dim_t dim2,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0, dim1, dim2), buf, type, retain);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] dim1 the length of the second dimension of the buffer
+ \param[in] dim2 the length of the third dimension of the buffer
+ \param[in] dim3 the length of the fourth dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0, dim_t dim1,
+                               dim_t dim2, dim_t dim3,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0, dim1, dim2, dim3), buf, type, retain);
+ }
+
+/**
+   @}
+*/
+#endif //#IF 0 tmp comment
+
+}
+
+
+#endif
diff --git a/include/af/opencl.h b/include/af/opencl.h
index 35d3b67e4f..d055804d6d 100644
--- a/include/af/opencl.h
+++ b/include/af/opencl.h
@@ -7,16 +7,149 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#pragma once
+#ifndef CL_TARGET_OPENCL_VERSION
+#define CL_TARGET_OPENCL_VERSION 120
+#endif
+#if defined(__APPLE__) || defined(__MACOSX)
+#include <OpenCL/cl.h>
+#else
+#include <CL/cl.h>
+#endif
+
+#include <af/defines.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-    AFAPI af_err afcl_get_context(cl_context *ctx, const bool retain);
+#if AF_API_VERSION >= 33
+typedef enum
+{
+    AFCL_DEVICE_TYPE_CPU     = CL_DEVICE_TYPE_CPU,
+    AFCL_DEVICE_TYPE_GPU     = CL_DEVICE_TYPE_GPU,
+    AFCL_DEVICE_TYPE_ACC     = CL_DEVICE_TYPE_ACCELERATOR,
+    AFCL_DEVICE_TYPE_UNKNOWN = -1
+} afcl_device_type;
+#endif
 
-    AFAPI af_err afcl_get_queue(cl_command_queue *queue, const bool retain);
+#if AF_API_VERSION >= 33
+typedef enum
+{
+    AFCL_PLATFORM_AMD     = 0,
+    AFCL_PLATFORM_APPLE   = 1,
+    AFCL_PLATFORM_INTEL   = 2,
+    AFCL_PLATFORM_NVIDIA  = 3,
+    AFCL_PLATFORM_BEIGNET = 4,
+    AFCL_PLATFORM_POCL    = 5,
+    AFCL_PLATFORM_UNKNOWN = -1
+} afcl_platform;
+#endif
+
+/**
+    \ingroup opencl_mat
+    @{
+*/
+/**
+  Get a handle to ArrayFire's OpenCL context
+
+  \param[out] ctx the current context being used by ArrayFire
+  \param[in] retain if true calls clRetainContext prior to returning the context
+  \returns \ref af_err error code
+
+  \note Set \p retain to true if this value will be passed to a cl::Context constructor
+*/
+AFAPI af_err afcl_get_context(cl_context *ctx, const bool retain);
+
+/**
+  Get a handle to ArrayFire's OpenCL command queue
 
-    AFAPI af_err afcl_get_device_id(cl_device_id *id);
+  \param[out] queue the current command queue being used by ArrayFire
+  \param[in] retain if true calls clRetainCommandQueue prior to returning the context
+  \returns \ref af_err error code
+
+  \note Set \p retain to true if this value will be passed to a cl::CommandQueue constructor
+*/
+AFAPI af_err afcl_get_queue(cl_command_queue *queue, const bool retain);
+
+/**
+   Get the device ID for ArrayFire's current active device
+
+   \param[out] id the cl_device_id of the current device
+   \returns \ref af_err error code
+*/
+AFAPI af_err afcl_get_device_id(cl_device_id *id);
+
+#if AF_API_VERSION >= 32
+/**
+   Set ArrayFire's active device based on \p id of type cl_device_id
+
+   \param[in] id the cl_device_id of the device to be set as active device
+   \returns \ref af_err error code
+*/
+AFAPI af_err afcl_set_device_id(cl_device_id id);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Push user provided device control constructs into the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to use an
+   user generated OpenCL context and related objects for ArrayFire operations.
+
+   \param[in] dev is the OpenCL device for which user provided context will be used by ArrayFire
+   \param[in] ctx is the user provided OpenCL cl_context to be used by ArrayFire
+   \param[in] que is the user provided OpenCL cl_command_queue to be used by ArrayFire. If this
+                  parameter is NULL, then we create a command queue for the user using the OpenCL
+                  context they provided us.
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+AFAPI af_err afcl_add_device_context(cl_device_id dev, cl_context ctx, cl_command_queue que);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Set active device using cl_context and cl_device_id
+
+   \param[in] dev is the OpenCL device id that is to be set as Active device inside ArrayFire
+   \param[in] ctx is the OpenCL cl_context being used by ArrayFire
+*/
+AFAPI af_err afcl_set_device_context(cl_device_id dev, cl_context ctx);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Remove the user provided device control constructs from the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to remove an already
+   pushed user generated OpenCL context and related objects.
+
+   \param[in] dev is the OpenCL device id that has to be popped
+   \param[in] ctx is the cl_context object to be removed from ArrayFire pool
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+AFAPI af_err afcl_delete_device_context(cl_device_id dev, cl_context ctx);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Get the type of the current device
+*/
+AFAPI af_err afcl_get_device_type(afcl_device_type *res);
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Get the platform of the current device
+*/
+AFAPI af_err afcl_get_platform(afcl_platform *res);
+#endif
+
+/**
+  @}
+*/
 
 #ifdef __cplusplus
 }
@@ -24,13 +157,6 @@ extern "C" {
 
 #ifdef __cplusplus
 
-#if defined(__APPLE__) || defined(__MACOSX)
-#include <OpenCL/cl.h>
-#else
-#include <CL/cl.h>
-#endif
-
-#include <af/defines.h>
 #include <af/array.h>
 #include <af/dim4.hpp>
 #include <af/exception.h>
@@ -39,176 +165,277 @@ extern "C" {
 
 namespace afcl
 {
-   /**
-
-    */
-    /**
-        \ingroup opencl_mat
-        @{
-    */
-    /**
-    Get a handle to ArrayFire's OpenCL context
-
-    \param[in] retain If true calls clRetainContext prior to returning the context.
-    \returns the current context being used by ArrayFire
-
-    \note Set \p retain to true if this value will be passed to a cl::Context constructor
-    */
-    static inline cl_context getContext(bool retain = false)
-    {
-        cl_context ctx;
-        af_err err = afcl_get_context(&ctx, retain);
-        if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL context from arrayfire");
-        return ctx;
-    }
-
-    /**
-    Get a handle to ArrayFire's OpenCL command queue
-
-    \param[in] retain If true calls clRetainCommandQueue prior to returning the context.
-    \returns the current command queue being used by ArrayFire
-
-    \note Set \p retain to true if this value will be passed to a cl::CommandQueue constructor
-    */
-    static inline cl_command_queue getQueue(bool retain = false)
-    {
-        cl_command_queue queue;
-        af_err err = afcl_get_queue(&queue, retain);
-        if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL command queue from arrayfire");
-    }
-
-    /**
-       Get the device ID for ArrayFire's current active device
-       \returns the cl_device_id of the current device
-    */
-    static inline cl_device_id getDeviceId()
-    {
-        cl_device_id id;
-        af_err err = afcl_get_device_id(&id);
-        if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL device ID");
-
-        return id;
-    }
-
-    /**
-    Create an af::array object from an OpenCL cl_mem buffer
-
-    \param[in] idims the dimensions of the buffer
-    \param[in] buf the OpenCL memory object.
-    \param[in] type the data type contained in the buffer
-    \param[in] retain If true, instructs ArrayFire to retain the memory object.
-    \returns an array object created from the OpenCL buffer
-
-    \note Set \p retain to true if the memory originates from a cl::Buffer object
-     */
-    static inline af::array array(af::dim4 idims, cl_mem buf, af::dtype type, bool retain=false)
-    {
-        const unsigned ndims = (unsigned)idims.ndims();
-        const dim_t *dims = idims.get();
-
-        cl_context context;
-        cl_int clerr = clGetMemObjectInfo(buf, CL_MEM_CONTEXT, sizeof(cl_context), &context, NULL);
-        if (clerr != CL_SUCCESS) {
-            throw af::exception("Failed to get context from cl_mem object \"buf\" ");
-        }
-
-        if (context != getContext()) {
-            throw(af::exception("Context mismatch between input \"buf\" and arrayfire"));
-        }
-
-
-        if (retain) clerr = clRetainMemObject(buf);
-
-        af_array out;
-        af_err err = af_device_array(&out, buf, ndims, dims, type);
-
-        if (err != AF_SUCCESS || clerr != CL_SUCCESS) {
-            if (retain && clerr == CL_SUCCESS) clReleaseMemObject(buf);
-            throw af::exception("Failed to create device array");
-        }
-
-        return af::array(out);
-    }
-
-    /**
-    Create an af::array object from an OpenCL cl_mem buffer
-
-    \param[in] dim0 the length of the first dimension of the buffer
-    \param[in] buf the OpenCL memory object.
-    \param[in] type the data type contained in the buffer
-    \param[in] retain If true, instructs ArrayFire to retain the memory object.
-    \returns an array object created from the OpenCL buffer
-
-    \note Set \p retain to true if the memory originates from a cl::Buffer object
-     */
-    static inline af::array array(dim_t dim0,
-                                  cl_mem buf, af::dtype type, bool retain=false)
-    {
-        return afcl::array(af::dim4(dim0), buf, type, retain);
-    }
-
-    /**
-    Create an af::array object from an OpenCL cl_mem buffer
-
-    \param[in] dim0 the length of the first dimension of the buffer
-    \param[in] dim1 the length of the second dimension of the buffer
-    \param[in] buf the OpenCL memory object.
-    \param[in] type the data type contained in the buffer
-    \param[in] retain If true, instructs ArrayFire to retain the memory object.
-    \returns an array object created from the OpenCL buffer
-
-    \note Set \p retain to true if the memory originates from a cl::Buffer object
-     */
-    static inline af::array array(dim_t dim0, dim_t dim1,
-                                  cl_mem buf, af::dtype type, bool retain=false)
-    {
-        return afcl::array(af::dim4(dim0, dim1), buf, type, retain);
-    }
-
-    /**
-    Create an af::array object from an OpenCL cl_mem buffer
-
-    \param[in] dim0 the length of the first dimension of the buffer
-    \param[in] dim1 the length of the second dimension of the buffer
-    \param[in] dim2 the length of the third dimension of the buffer
-    \param[in] buf the OpenCL memory object.
-    \param[in] type the data type contained in the buffer
-    \param[in] retain If true, instructs ArrayFire to retain the memory object.
-    \returns an array object created from the OpenCL buffer
-
-    \note Set \p retain to true if the memory originates from a cl::Buffer object
-     */
-    static inline af::array array(dim_t dim0, dim_t dim1,
-                                  dim_t dim2,
-                                  cl_mem buf, af::dtype type, bool retain=false)
-    {
-        return afcl::array(af::dim4(dim0, dim1, dim2), buf, type, retain);
-    }
-
-    /**
-    Create an af::array object from an OpenCL cl_mem buffer
-
-    \param[in] dim0 the length of the first dimension of the buffer
-    \param[in] dim1 the length of the second dimension of the buffer
-    \param[in] dim2 the length of the third dimension of the buffer
-    \param[in] dim3 the length of the fourth dimension of the buffer
-    \param[in] buf the OpenCL memory object.
-    \param[in] type the data type contained in the buffer
-    \param[in] retain If true, instructs ArrayFire to retain the memory object.
-    \returns an array object created from the OpenCL buffer
-
-    \note Set \p retain to true if the memory originates from a cl::Buffer object
-     */
-    static inline af::array array(dim_t dim0, dim_t dim1,
-                                  dim_t dim2, dim_t dim3,
-                                  cl_mem buf, af::dtype type, bool retain=false)
-    {
-        return afcl::array(af::dim4(dim0, dim1, dim2, dim3), buf, type, retain);
-    }
-
-    /**
-      @}
-    */
+
+ /**
+     \addtogroup opencl_mat
+     @{
+ */
+
+ /**
+ Get a handle to ArrayFire's OpenCL context
+
+ \param[in] retain if true calls clRetainContext prior to returning the context
+ \returns the current context being used by ArrayFire
+
+ \note Set \p retain to true if this value will be passed to a cl::Context constructor
+ */
+ static inline cl_context getContext(bool retain = false)
+ {
+     cl_context ctx;
+     af_err err = afcl_get_context(&ctx, retain);
+     if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL context from arrayfire");
+     return ctx;
+ }
+
+ /**
+ Get a handle to ArrayFire's OpenCL command queue
+
+ \param[in] retain if true calls clRetainCommandQueue prior to returning the context
+ \returns the current command queue being used by ArrayFire
+
+ \note Set \p retain to true if this value will be passed to a cl::CommandQueue constructor
+ */
+ static inline cl_command_queue getQueue(bool retain = false)
+ {
+     cl_command_queue queue;
+     af_err err = afcl_get_queue(&queue, retain);
+     if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL command queue from arrayfire");
+     return queue;
+ }
+
+ /**
+    Get the device ID for ArrayFire's current active device
+    \returns the cl_device_id of the current device
+ */
+ static inline cl_device_id getDeviceId()
+ {
+     cl_device_id id;
+     af_err err = afcl_get_device_id(&id);
+     if (err != AF_SUCCESS) throw af::exception("Failed to get OpenCL device ID");
+
+     return id;
+ }
+
+#if AF_API_VERSION >= 32
+ /**
+   Set ArrayFire's active device based on \p id of type cl_device_id
+
+   \param[in] id the cl_device_id of the device to be set as active device
+ */
+ static inline void setDeviceId(cl_device_id id)
+ {
+     af_err err = afcl_set_device_id(id);
+     if (err != AF_SUCCESS) throw af::exception("Failed to set OpenCL device as active device");
+ }
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Push user provided device control constructs into the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to use an
+   user generated OpenCL context and related objects for ArrayFire operations.
+
+   \param[in] dev is the OpenCL device for which user provided context will be used by ArrayFire
+   \param[in] ctx is the user provided OpenCL cl_context to be used by ArrayFire
+   \param[in] que is the user provided OpenCL cl_command_queue to be used by ArrayFire. If this
+                  parameter is NULL, then we create a command queue for the user using the OpenCL
+                  context they provided us.
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+static inline void addDevice(cl_device_id dev, cl_context ctx, cl_command_queue que)
+{
+    af_err err = afcl_add_device_context(dev, ctx, que);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to push user provided device/context to ArrayFire pool");
+}
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Set active device using cl_context and cl_device_id
+
+   \param[in] dev is the OpenCL device id that is to be set as Active device inside ArrayFire
+   \param[in] ctx is the OpenCL cl_context being used by ArrayFire
+*/
+static inline void setDevice(cl_device_id dev, cl_context ctx)
+{
+    af_err err = afcl_set_device_context(dev, ctx);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to set device based on cl_device_id & cl_context");
+}
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Remove the user provided device control constructs from the ArrayFire device manager pool
+
+   This function should be used only when the user would like ArrayFire to remove an already
+   pushed user generated OpenCL context and related objects.
+
+   \param[in] dev is the OpenCL device id that has to be popped
+   \param[in] ctx is the cl_context object to be removed from ArrayFire pool
+
+   \note ArrayFire does not take control of releasing the objects passed to it. The user needs to release them appropriately.
+*/
+static inline void deleteDevice(cl_device_id dev, cl_context ctx)
+{
+    af_err err = afcl_delete_device_context(dev, ctx);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to remove the requested device from ArrayFire device pool");
+}
+#endif
+
+
+#if AF_API_VERSION >= 33
+ typedef afcl_device_type deviceType;
+ typedef afcl_platform platform;
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Get the type of the current device
+*/
+static inline deviceType getDeviceType()
+{
+    afcl_device_type res = AFCL_DEVICE_TYPE_UNKNOWN;
+    af_err err = afcl_get_device_type(&res);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to get OpenCL device type");
+    return res;
+}
+#endif
+
+#if AF_API_VERSION >= 33
+/**
+   Get a vendor enumeration for the current platform
+*/
+static inline platform getPlatform()
+{
+    afcl_platform res = AFCL_PLATFORM_UNKNOWN;
+    af_err err = afcl_get_platform(&res);
+    if (err!=AF_SUCCESS) throw af::exception("Failed to get OpenCL platform");
+    return res;
 }
+#endif
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] idims the dimensions of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(af::dim4 idims, cl_mem buf, af::dtype type, bool retain=false)
+ {
+     const unsigned ndims = (unsigned)idims.ndims();
+     const dim_t *dims = idims.get();
+
+     cl_context context;
+     cl_int clerr = clGetMemObjectInfo(buf, CL_MEM_CONTEXT, sizeof(cl_context), &context, NULL);
+     if (clerr != CL_SUCCESS) {
+         throw af::exception("Failed to get context from cl_mem object \"buf\" ");
+     }
+
+     if (context != getContext()) {
+         throw(af::exception("Context mismatch between input \"buf\" and arrayfire"));
+     }
+
+
+     if (retain) clerr = clRetainMemObject(buf);
+
+     af_array out;
+     af_err err = af_device_array(&out, buf, ndims, dims, type);
+
+     if (err != AF_SUCCESS || clerr != CL_SUCCESS) {
+         if (retain && clerr == CL_SUCCESS) clReleaseMemObject(buf);
+         throw af::exception("Failed to create device array");
+     }
+
+     return af::array(out);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0), buf, type, retain);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] dim1 the length of the second dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0, dim_t dim1,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0, dim1), buf, type, retain);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] dim1 the length of the second dimension of the buffer
+ \param[in] dim2 the length of the third dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0, dim_t dim1,
+                               dim_t dim2,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0, dim1, dim2), buf, type, retain);
+ }
+
+ /**
+ Create an af::array object from an OpenCL cl_mem buffer
+
+ \param[in] dim0 the length of the first dimension of the buffer
+ \param[in] dim1 the length of the second dimension of the buffer
+ \param[in] dim2 the length of the third dimension of the buffer
+ \param[in] dim3 the length of the fourth dimension of the buffer
+ \param[in] buf the OpenCL memory object
+ \param[in] type the data type contained in the buffer
+ \param[in] retain if true, instructs ArrayFire to retain the memory object
+ \returns an array object created from the OpenCL buffer
+
+ \note Set \p retain to true if the memory originates from a cl::Buffer object
+  */
+ static inline af::array array(dim_t dim0, dim_t dim1,
+                               dim_t dim2, dim_t dim3,
+                               cl_mem buf, af::dtype type, bool retain=false)
+ {
+     return afcl::array(af::dim4(dim0, dim1, dim2, dim3), buf, type, retain);
+ }
+
+/**
+   @}
+*/
+}
+
 
 #endif
diff --git a/include/af/random.h b/include/af/random.h
new file mode 100644
index 0000000000..53939be226
--- /dev/null
+++ b/include/af/random.h
@@ -0,0 +1,551 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/defines.h>
+
+///
+/// \brief Handle for a random engine object.
+///
+/// This handle is used to reference the internal random engine object.
+///
+/// \ingroup random_mat
+typedef void * af_random_engine;
+
+#ifdef __cplusplus
+namespace af
+{
+    class array;
+    class dim4;
+#if AF_API_VERSION >= 34
+    /// C++ Interface - Random Number Generation Engine Class
+    ///
+    /// The \ref af::randomEngine class is used to set the type and seed of
+    /// random number generation engine based on \ref af::randomEngineType.
+    ///
+    /// \ingroup arrayfire_class
+    /// \ingroup random_mat
+    class AFAPI randomEngine {
+    private:
+      ///
+      /// \brief Handle to the interal random engine object
+      af_random_engine engine;
+
+    public:
+      /**
+          C++ Interface to create a \ref af::randomEngine object with a \ref
+          af::randomEngineType and a seed.
+
+          \code
+            // create a random engine of default type with seed = 1
+            randomEngine r(AF_RANDOM_ENGINE_DEFAULT, 1);
+          \endcode
+      */
+      explicit randomEngine(randomEngineType typeIn = AF_RANDOM_ENGINE_DEFAULT,
+                            unsigned long long seedIn = 0);
+
+      /**
+          C++ Interface copy constructor for a \ref af::randomEngine.
+
+          \param[in] other input random engine object
+      */
+      randomEngine(const randomEngine &other);
+
+      /**
+          C++ Interface to create a copy of the random engine object from a
+          \ref af_random_engine handle.
+
+          \param[in] engine The input random engine object
+      */
+      randomEngine(af_random_engine engine);
+
+      /**
+          C++ Interface destructor for a \ref af::randomEngine.
+      */
+      ~randomEngine();
+
+      /**
+          C++ Interface to assign the internal state of randome engine.
+
+          \param[in] other object to be assigned to the random engine
+
+          \return the reference to this
+      */
+      randomEngine &operator=(const randomEngine &other);
+
+      /**
+          C++ Interface to set the random type of the random engine.
+
+          \param[in] type type of the random number generator
+      */
+      void setType(const randomEngineType type);
+
+      /**
+          C++ Interface to get the random type of the random engine.
+
+          \return \ref af::randomEngineType associated with random engine
+      */
+      randomEngineType getType(void);
+
+      /**
+          C++ Interface to set the seed of the random engine.
+
+          \param[in] seed initializing seed of the random number generator
+      */
+      void setSeed(const unsigned long long seed);
+
+      /**
+          C++ Interface to return the seed of the random engine.
+
+          \return seed associated with random engine
+      */
+      unsigned long long getSeed(void) const;
+
+      /**
+          C++ Interface to return the af_random_engine handle of this object.
+
+          \return handle to the af_random_engine associated with this random
+                  engine
+      */
+      af_random_engine get(void) const;
+    };
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+        C++ Interface to create an array of random numbers uniformly
+        distributed.
+
+        \param[in] dims dimensions of the array to be generated
+        \param[in] ty   type of the array
+        \param[in] r    random engine object
+        \return    random number array of size `dims`
+
+        \ingroup random_func_randu
+    */
+    AFAPI array randu(const dim4 &dims, const dtype ty, randomEngine &r);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+        C++ Interface to create an array of random numbers normally
+        distributed.
+
+        \param[in] dims dimensions of the array to be generated
+        \param[in] ty   type of the array
+        \param[in] r    random engine object
+        \return    random number array of size `dims`
+
+        \ingroup random_func_randn
+    */
+    AFAPI array randn(const dim4 &dims, const dtype ty, randomEngine &r);
+#endif
+
+    /**
+        C++ Interface to create an array of random numbers uniformly
+        distributed.
+
+        \param[in] dims dimensions of the array to be generated
+        \param[in] ty   type of the array
+
+        \ingroup random_func_randu
+    */
+    AFAPI array randu(const dim4 &dims, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers uniformly
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0`
+
+        \ingroup random_func_randu
+    */
+    AFAPI array randu(const dim_t d0, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers uniformly
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] d1 size of the second dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0` x `d1`
+
+        \ingroup random_func_randu
+    */
+    AFAPI array randu(const dim_t d0,
+                      const dim_t d1, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers uniformly
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] d1 size of the second dimension
+        \param[in] d2 size of the third dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0` x `d1` x `d2`
+
+        \ingroup random_func_randu
+    */
+    AFAPI array randu(const dim_t d0,
+                      const dim_t d1, const dim_t d2, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers uniformly
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] d1 size of the second dimension
+        \param[in] d2 size of the third dimension
+        \param[in] d3 size of the fourth dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0` x `d1` x `d2` x `d3`
+
+        \ingroup random_func_randu
+    */
+    AFAPI array randu(const dim_t d0,
+                      const dim_t d1, const dim_t d2,
+                      const dim_t d3, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers normally
+        distributed.
+
+        \param[in] dims dimensions of the array to be generated
+        \param[in] ty   type of the array
+        \return    random number array of size `dims`
+
+        \ingroup random_func_randn
+    */
+    AFAPI array randn(const dim4 &dims, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers normally
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0`
+
+        \ingroup random_func_randn
+    */
+    AFAPI array randn(const dim_t d0, const dtype ty=f32);
+    /**
+        C++ Interface to create an array of random numbers normally
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] d1 size of the second dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0` x `d1`
+
+        \ingroup random_func_randn
+    */
+    AFAPI array randn(const dim_t d0,
+                      const dim_t d1, const dtype ty=f32);
+    /**
+        C++ Interface to create an array of random numbers normally
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] d1 size of the second dimension
+        \param[in] d2 size of the third dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0` x `d1` x `d2`
+
+        \ingroup random_func_randn
+    */
+    AFAPI array randn(const dim_t d0,
+                      const dim_t d1, const dim_t d2, const dtype ty=f32);
+
+    /**
+        C++ Interface to create an array of random numbers normally
+        distributed.
+
+        \param[in] d0 size of the first dimension
+        \param[in] d1 size of the second dimension
+        \param[in] d2 size of the third dimension
+        \param[in] d3 size of the fourth dimension
+        \param[in] ty type of the array
+        \return    random number array of size `d0` x `d1` x `d2` x `d3`
+
+        \ingroup random_func_randn
+    */
+    AFAPI array randn(const dim_t d0,
+                      const dim_t d1, const dim_t d2,
+                      const dim_t d3, const dtype ty=f32);
+
+#if AF_API_VERSION >= 34
+    /**
+        C++ Interface to set the default random engine type.
+
+        \param[in] rtype type of the random number generator
+
+        \ingroup random_func_set_default_engine
+    */
+    AFAPI void setDefaultRandomEngineType(randomEngineType rtype);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+        C++ Interface to get the default random engine type.
+
+        \return \ref af::randomEngine object for the default random engine
+
+        \ingroup random_func_get_default_engine
+    */
+    AFAPI randomEngine getDefaultRandomEngine(void);
+#endif
+
+    /**
+        C++ Interface to set the seed of the default random number generator.
+
+        \param[in] seed 64-bit unsigned integer
+
+        \ingroup random_func_set_seed
+    */
+    AFAPI void setSeed(const unsigned long long seed);
+
+    /**
+        C++ Interface to get the seed of the default random number generator.
+
+        \return seed 64-bit unsigned integer
+
+        \ingroup random_func_get_seed
+    */
+    AFAPI unsigned long long getSeed();
+
+}
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to create a random engine.
+
+       \param[out] engine pointer to the returned random engine object
+       \param[in]  rtype  type of the random number generator
+       \param[in]  seed   initializing seed of the random number generator
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_create_random_engine(af_random_engine *engine,
+                                         af_random_engine_type rtype,
+                                         unsigned long long seed);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to retain a random engine.
+
+       \param[out] out    pointer to the returned random engine object
+       \param[in]  engine random engine object
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_retain_random_engine(af_random_engine *out,
+                                         const af_random_engine engine);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to change random engine type.
+
+       \param[in]  engine random engine object
+       \param[in]  rtype  type of the random number generator
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_random_engine_set_type(af_random_engine *engine,
+                                           const af_random_engine_type rtype);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to get random engine type.
+
+       \param[out] rtype  type of the random number generator
+       \param[in]  engine random engine object
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_random_engine_get_type(af_random_engine_type *rtype,
+                                           const af_random_engine engine);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to create an array of uniform numbers using a random engine.
+
+       \param[out] out    pointer to the returned object
+       \param[in]  ndims  number of dimensions
+       \param[in]  dims   C pointer with `ndims` elements; each value
+                          represents the size of that dimension
+       \param[in]  type   type of the \ref af_array object
+       \param[in]  engine random engine object
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_randu
+    */
+    AFAPI af_err af_random_uniform(af_array *out, const unsigned ndims,
+                                   const dim_t * const dims, const af_dtype type,
+                                   af_random_engine engine);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to create an array of normal numbers using a random engine.
+
+       \param[out] out    pointer to the returned object
+       \param[in]  ndims  number of dimensions
+       \param[in]  dims   C pointer with `ndims` elements; each value
+                          represents the size of that dimension
+       \param[in]  type   type of the \ref af_array object
+       \param[in]  engine random engine object
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_randn
+    */
+    AFAPI af_err af_random_normal(af_array *out, const unsigned ndims,
+                                  const dim_t * const dims, const af_dtype type,
+                                  af_random_engine engine);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to set the seed of a random engine.
+
+       \param[out] engine pointer to the returned random engine object
+       \param[in]  seed   initializing seed of the random number generator
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_random_engine_set_seed(af_random_engine *engine,
+                                           const unsigned long long seed);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to get the default random engine.
+
+       \param[out] engine pointer to the returned default random engine object
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_get_default_engine
+    */
+    AFAPI af_err af_get_default_random_engine(af_random_engine *engine);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to set the type of the default random engine.
+
+       \param[in]  rtype type of the random number generator
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_set_default_engine
+    */
+    AFAPI af_err af_set_default_random_engine_type(const af_random_engine_type rtype);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to get the seed of a random engine.
+
+       \param[out] seed   pointer to the returned seed
+       \param[in]  engine random engine object
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_random_engine_get_seed(unsigned long long * const seed,
+                                           af_random_engine engine);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       C Interface to release a random engine.
+
+       \param[in] engine random engine object
+       \return    \ref AF_SUCCESS, if function returns successfully, else
+                  an \ref af_err code is given
+
+       \ingroup random_func_random_engine
+    */
+    AFAPI af_err af_release_random_engine(af_random_engine engine);
+#endif
+
+    /**
+       \param[out] out   generated array
+       \param[in]  ndims number of dimensions
+       \param[in]  dims  array containing sizes of the dimension
+       \param[in]  type  type of array to generate
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_randu
+    */
+    AFAPI af_err af_randu(af_array *out, const unsigned ndims,
+                          const dim_t * const dims, const af_dtype type);
+
+    /**
+       \param[out] out   generated array
+       \param[in]  ndims number of dimensions
+       \param[in]  dims  array containing sizes of the dimension
+       \param[in]  type  type of array to generate
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+       \ingroup random_func_randn
+    */
+    AFAPI af_err af_randn(af_array *out, const unsigned ndims,
+                          const dim_t * const dims, const af_dtype type);
+
+    /**
+       \param[in] seed a 64-bit unsigned integer
+       \return    \ref AF_SUCCESS, if function returns successfully, else
+                  an \ref af_err code is given
+
+        \ingroup random_func_set_seed
+    */
+    AFAPI af_err af_set_seed(const unsigned long long seed);
+
+    /**
+       \param[out] seed a 64-bit unsigned integer
+       \return     \ref AF_SUCCESS, if function returns successfully, else
+                   an \ref af_err code is given
+
+        \ingroup random_func_get_seed
+    */
+    AFAPI af_err af_get_seed(unsigned long long *seed);
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/af/seq.h b/include/af/seq.h
index 3776accda6..5a19921b1f 100644
--- a/include/af/seq.h
+++ b/include/af/seq.h
@@ -10,8 +10,21 @@
 #pragma once
 #include <af/defines.h>
 
+/**
+    \struct af_seq
+
+    \brief C-style struct to creating sequences for indexing
+
+    \ingroup index_mat
+*/
 typedef struct af_seq {
-    double begin, end;
+    /// Start position of the sequence
+    double begin;
+
+    /// End position of the sequence (inclusive)
+    double end;
+
+    /// Step size between sequence values
     double step;
 } af_seq;
 
@@ -22,31 +35,167 @@ namespace af
 {
 class array;
 
+/**
+    \class seq
+
+    \brief seq is used to create sequences for indexing af::array
+
+    \ingroup arrayfire_class
+*/
 class AFAPI seq
 {
 public:
+    ///
+    /// \brief Get the \ref af_seq C-style struct
+    ///
     af_seq s;
+
+    ///
+    /// \brief Get's the length of the sequence
+    ///
     size_t size;
+
+    ///
+    /// \brief Flag for gfor
+    ///
     bool m_gfor;
 
-    seq(double = 0);
+    /**
+        \brief Creates a sequence of size length as [0, 1, 2..., length - 1]
+
+        The sequence has begin as 0, end as length - 1 and step as 1.
+
+        \note When doing seq(-n), where n is > 0, then the sequence is generated as
+        0...-n but step remains +1. This is because when such a sequence is
+        used for indexing af::array, then -n represents n elements from the
+        end. That is, seq(-2) will imply indexing an array 0...dimSize - 2.
+
+        \code
+                            // [begin, end, step]
+        seq a(10);          // [0, 9, 1]    => 0, 1, 2....9
+        \endcode
+
+        \param[in] length is the size of the seq to be created.
+    */
+    seq(double length = 0);
+
+    /**
+        \brief Destructor
+    */
     ~seq();
 
-    // begin, end, step
+    /**
+        \brief Creates a sequence starting at begin,
+        ending at or before end (inclusive) with increments as step.
+
+        The sequence will be [begin, begin + step, begin + 2 * step...., begin + n * step]
+        where the begin + n * step <= end.
+
+        \code
+                            // [begin, end, step]
+        seq a(10, 20);      // [10, 20, 1]  => 10, 11, 12....20
+        seq b(10, 20, 2);   // [10, 20, 2]  => 10, 12, 14....20
+        seq c(-5, 5);       // [-5, 5, 1]   => -5, -4, -3....0, 1....5
+        seq d(-5, -15, -1); // [-5,-15, -1] => -5, -6, -7....-15
+        seq e(-15, -5, 1);  // [-15, -5, 1] => -15, -14, -13....-5
+        \endcode
+
+        \param[in] begin is the start of the sequence
+        \param[in] end is the maximum value a sequence can take (inclusive)
+        \param[in] step is the increment or decrement size (default is 1)
+    */
     seq(double begin, double end, double step = 1);
 
-    seq(seq afs, bool is_gfor);
+    /**
+        \brief Copy constructor
 
+        Creates a copy seq from another sequence.
+
+        \param[in] other seqence to be copies
+        \param[in] is_gfor is the gfor flag
+    */
+    seq(seq other, bool is_gfor);
+
+    /**
+        \brief Create a seq object from an \ref af_seq struct
+
+        \param[in] s_ is the \ref af_seq struct
+    */
     seq(const af_seq& s_);
 
+    /**
+        \brief Assignment operator to create a new sequence from an af_seq
+
+        This operator creates a new sequence using the begin, end and step
+        from the input sequence.
+
+        \param[in] s is the input sequence
+    */
     seq& operator=(const af_seq& s);
 
+    /**
+        \brief Negation operator creates a sequence with the signs negated
+
+        begin is changed to -begin
+        end is changed to -end
+        step is changed to -step
+
+        \code
+                        // [begin, end, step]
+        seq a(1, 10);   // [ 1, 10, 1] => 1, 2, 3....10
+        seq b = -a;     // [-1,-10,-1] => -1, -2, -3...-10
+        \endcode
+    */
     inline seq operator-()         { return seq(-s.begin, -s.end,  -s.step); }
 
+    /**
+        \brief Addition operator offsets the begin and end by x. There is no
+        change in step.
+
+        begin is changed to begin + x
+        end is changed to end + x
+
+        \code
+                            // [begin, end, step]
+        seq a(2, 20, 2);    // [2, 20, 2] => 2, 4, 6....20
+        seq b = a + 3;      // [5, 23, 2] => 5, 7, 9....23
+        \endcode
+    */
     inline seq operator+(double x) { return seq(s.begin + x, s.end + x, s.step); }
 
+    /**
+        \brief Subtraction operator offsets the begin and end by x. There is no
+        change in step.
+
+        begin is changed to begin - x
+        end is changed to end - x
+
+        \code
+                            // [begin, end, step]
+        seq a(10, 20, 2);   // [10, 20, 2] => 10, 12, 14....20
+        seq b(2, 10);       // [ 2, 10, 1] => 2, 3, 4....10
+        seq c = a - 3;      // [ 7, 17, 2] => 7, 9, 11....17
+        seq d = b - 3;      // [-1,  7, 2] => -1, 1, 3....7
+        \endcode
+    */
     inline seq operator-(double x) { return seq(s.begin - x, s.end - x, s.step); }
 
+    /**
+        \brief Multiplication operator spaces the sequence by a factor x.
+
+        begin is changed to begin * x
+        end is changed to end * x
+        step is changed to step * x
+
+        \code
+                            // [begin, end, step]
+        seq a(10, 20, 2);   // [10, 20, 2] => 10, 12, 14....20
+        seq b(-5, 5);       // [-5, 5, 1] => -5, -4, -3....0, 1....5
+        seq c = a * 3;      // [30, 60, 6] => 30, 36, 42....60
+        seq d = b * 3;      // [-15, 15, 3] => -15, -12, -9....0, 3....15
+        seq e = a * 0.5;    // [5, 10, 1] => 5, 6, 7....10
+        \endcode
+    */
     inline seq operator*(double x) { return seq(s.begin * x, s.end * x, s.step * x); }
 
     friend inline seq operator+(double x, seq y) { return  y + x; }
@@ -55,16 +204,43 @@ class AFAPI seq
 
     friend inline seq operator*(double x, seq y) { return  y * x; }
 
+    /**
+        \brief Implicit conversion operator from seq to af::array
+
+        Convertes a seq object into an af::array object. The contents of the
+        af:array will be the explicit values from the seq.
+
+        \note Do not use this to create arrays of sequences. Use \ref range.
+
+        \code
+                            // [begin, end, step]
+        seq s(10, 20, 2);   // [10, 20, 2] => 10, 12, 14....20
+        array arr = s;
+        af_print(arr);      // 10    12    14    16    18    20
+        \endcode
+    */
     operator array() const;
 
     private:
     void init(double begin, double end, double step);
 };
 
+/// A special value representing the last value of an axis
 extern AFAPI int end;
+
+/// A special value representing the entire axis of an af::array
 extern AFAPI seq span;
 
 }
 #endif
 
-extern "C" AFAPI af_seq af_make_seq(double begin, double end, double step);
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/// Create a new af_seq object.
+AFAPI af_seq af_make_seq(double begin, double end, double step);
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/af/signal.h b/include/af/signal.h
index 41810f9a82..f24e4df3df 100644
--- a/include/af/signal.h
+++ b/include/af/signal.h
@@ -18,42 +18,119 @@ class array;
 class dim4;
 
 /**
-   C++ Interface for data interpolation on one dimensional data
+   C++ Interface for data interpolation on one-dimensional signals.
 
-   \param[in]  in is the input array
-   \param[in]  pos array contains the interpolation locations
-   \param[in]  method is the interpolation type, it can take one of the values defined by the
-               enum \ref af_interp_type
-   \param[in]  offGrid is the value that will set in the output array when certain index is out of bounds
-   \return     the array with interpolated values
+   \param[in]  in is the multidimensional input array. Values assumed to lie uniformly spaced indices in the range of `[0, n)`, where `n` is the number of elements in the array.
+   \param[in]  pos positions of the interpolation points along the first dimension.
+   \param[in]  method is the interpolation method to be used. The following types (defined in enum \ref af_interp_type) are supported: nearest neighbor, linear, and cubic.
+   \param[in]  off_grid is the default value for any indices outside the valid range of indices.
+   \returns    the interpolated array.
+
+   The code sample below demonstrates approx1()'s usage:
+
+   \snippet test/approx1.cpp ex_signal_approx1
 
    \ingroup signal_func_approx1
  */
 AFAPI array approx1(const array &in, const array &pos,
-                    const interpType method = AF_INTERP_LINEAR, const float offGrid = 0.0f);
+                    const interpType method = AF_INTERP_LINEAR, const float off_grid = 0.0f);
 
 /**
-   C++ Interface for data interpolation on two dimensional data
+   C++ Interface for data interpolation on two-dimensional signals.
 
-   \param[in]  in is the input array
-   \param[in]  pos0 array contains the interpolation locations for 0th dimension
-   \param[in]  pos1 array contains the interpolation locations for 1st dimension
-   \param[in]  method is the interpolation type, it can take one of the values defined by the
-               enum \ref af_interp_type
-   \param[in]  offGrid is the value that will set in the output array when certain index is out of bounds
-   \return     the array with interpolated values
+   \param[in]  in is the multidimensional input array. Values assumed to lie uniformly spaced indices in the range of `[0, n)` along both interpolation dimensions. `n` is the number of elements in the array.
+   \param[in]  pos0 positions of the interpolation points along the first dimension.
+   \param[in]  pos1 positions of the interpolation points along the second dimension.
+   \param[in]  method is the interpolation method to be used. All interpolation types defined in \ref af_interp_type are supported.
+   \param[in]  off_grid is the default value for any indices outside the valid range of indices.
+   \returns    the interpolated array.
+
+   The code sample below demonstrates approx2()'s usage:
+
+   \snippet test/approx2.cpp ex_signal_approx2
 
    \ingroup signal_func_approx2
  */
 AFAPI array approx2(const array &in, const array &pos0, const array &pos1,
-                    const interpType method = AF_INTERP_LINEAR, const float offGrid = 0.0f);
+                    const interpType method = AF_INTERP_LINEAR, const float off_grid = 0.0f);
+
+
+#if AF_API_VERSION >= 37
+/**
+   C++ Interface for data interpolation on one-dimensional signals.
+
+   The following version of approx1() accepts the dimension to perform
+   the interpolation along the input. It also accepts start and step
+   values which define the uniform range of corresponding indices.
+
+   The following image illustrates what the range of indices
+   corresponding to the input values look like if `idx_start` and
+   `idx_step` are set to an arbitrary value of 10,
+
+   \image html approx1_arbitrary_idx.png "approx1() using idx_start=10.0, idx_step=10.0"
+
+   The blue dots represent indices whose values are known. The red dots
+   represent indices whose values are unknown.
+
+   \param[in]  in is the multidimensional input array. Values lie on uniformly spaced indices determined by `idx_start` and `idx_step`.
+   \param[in]  pos positions of the interpolation points along `interp_dim`.
+   \param[in]  interp_dim is the dimension to perform interpolation across.
+   \param[in]  idx_start is the first index value along `interp_dim`.
+   \param[in]  idx_step is the uniform spacing value between subsequent indices along `interp_dim`.
+   \param[in]  method is the interpolation method to be used. The following types (defined in enum \ref af_interp_type) are supported: nearest neighbor, linear, and cubic.
+   \param[in]  off_grid is the default value for any indices outside the valid range of indices.
+   \returns    the interpolated array.
+
+   The code sample below demonstrates usage:
+
+   \snippet test/approx1.cpp ex_signal_approx1_uniform
+
+   \ingroup signal_func_approx1
+ */
+AFAPI array approx1(const array &in,
+                    const array &pos, const int interp_dim,
+                    const double idx_start, const double idx_step,
+                    const interpType method = AF_INTERP_LINEAR, const float off_grid = 0.0f);
+
+/**
+   C++ Interface for data interpolation on two-dimensional signals.
+
+   The following version of the approx2() accepts the two dimensions
+   to perform the interpolation along the input. It also accepts start
+   and step values which define the uniform range of corresponding
+   indices.
+
+   \param[in]  in is the multidimensional input array.
+   \param[in]  pos0 positions of the interpolation points along `interp_dim0`.
+   \param[in]  interp_dim0 is the first dimension to perform interpolation across.
+   \param[in]  idx_start_dim0 is the first index value along `interp_dim0`.
+   \param[in]  idx_step_dim0 is the uniform spacing value between subsequent indices along `interp_dim0`.
+   \param[in]  pos1 positions of the interpolation points along `interp_dim1`.
+   \param[in]  interp_dim1 is the second dimension to perform interpolation across.
+   \param[in]  idx_start_dim1 is the first index value along `interp_dim1`.
+   \param[in]  idx_step_dim1 is the uniform spacing value between subsequent indices along `interp_dim1`.
+   \param[in]  method is the interpolation method to be used. All interpolation types defined in \ref af_interp_type are supported.
+   \param[in]  off_grid is the default value for any indices outside the valid range of indices.
+   \returns    the interpolated array.
+
+   The code sample below demonstrates usage:
+
+   \snippet test/approx2.cpp ex_signal_approx2_uniform
+
+   \ingroup signal_func_approx2
+ */
+AFAPI array approx2(const array &in,
+                    const array &pos0, const int interp_dim0, const double idx_start_dim0, const double idx_step_dim0,
+                    const array &pos1, const int interp_dim1, const double idx_start_dim1, const double idx_step_dim1,
+                    const interpType method = AF_INTERP_LINEAR, const float off_grid = 0.0f);
+#endif
 
 /**
-   C++ Interface for fast fourier transform on one dimensional data
+   C++ Interface for fast fourier transform on one dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data - used to either truncate or pad the input data
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_fft
@@ -61,12 +138,12 @@ AFAPI array approx2(const array &in, const array &pos0, const array &pos1,
 AFAPI array fftNorm(const array& in, const double norm_factor, const dim_t odim0=0);
 
 /**
-   C++ Interface for fast fourier transform on two dimensional data
+   C++ Interface for fast fourier transform on two dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_fft2
@@ -74,27 +151,69 @@ AFAPI array fftNorm(const array& in, const double norm_factor, const dim_t odim0
 AFAPI array fft2Norm(const array& in, const double norm_factor, const dim_t odim0=0, const dim_t odim1=0);
 
 /**
-   C++ Interface for fast fourier transform on three dimensional data
+   C++ Interface for fast fourier transform on three dimensional signals
 
-   \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
-   \param[in]  odim2 is the length of output data along 2nd dimension - used to either truncate/pad the input
+   \param[in]  in is the input array and the output of 1D fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  odim2 is the length of output signals along third dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_fft3
  */
 AFAPI array fft3Norm(const array& in, const double norm_factor, const dim_t odim0=0, const dim_t odim1=0, const dim_t odim2=0);
 
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for fast fourier transform on one dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 1D forward fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+
+   \note The input \p in must be complex
+
+   \ingroup signal_func_fft
+ */
+AFAPI void fftInPlace(array& in, const double norm_factor = 1.0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for fast fourier transform on two dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 2D forward fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+
+   \note The input \p in must be complex
+
+   \ingroup signal_func_fft2
+ */
+AFAPI void fft2InPlace(array& in, const double norm_factor = 1.0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for fast fourier transform on three dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 3D forward fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+
+   \note The input \p in must be complex
+
+   \ingroup signal_func_fft3
+ */
+AFAPI void fft3InPlace(array& in, const double norm_factor = 1.0);
+#endif
+
 /**
-   C++ Interface for fast fourier transform on one dimensional data
+   C++ Interface for fast fourier transform on one dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  odim0 is the length of output data - used to either truncate or pad the input data
+   \param[in]  odim0 is the length of output signals - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_fft
@@ -102,14 +221,14 @@ AFAPI array fft3Norm(const array& in, const double norm_factor, const dim_t odim
 AFAPI array fft(const array& in, const dim_t odim0=0);
 
 /**
-   C++ Interface for fast fourier transform on two dimensional data
+   C++ Interface for fast fourier transform on two dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_fft2
@@ -117,15 +236,15 @@ AFAPI array fft(const array& in, const dim_t odim0=0);
 AFAPI array fft2(const array& in, const dim_t odim0=0, const dim_t odim1=0);
 
 /**
-   C++ Interface for fast fourier transform on three dimensional data
+   C++ Interface for fast fourier transform on three dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
-   \param[in]  odim2 is the length of output data along 2nd dimension - used to either truncate/pad the input
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  odim2 is the length of output signals along third dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_fft3
@@ -133,11 +252,11 @@ AFAPI array fft2(const array& in, const dim_t odim0=0, const dim_t odim1=0);
 AFAPI array fft3(const array& in, const dim_t odim0=0, const dim_t odim1=0, const dim_t odim2=0);
 
 /**
-   C++ Interface for fast fourier transform on any(1d, 2d, 3d) dimensional data
+   C++ Interface for fast fourier transform on any(1d, 2d, 3d) dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input data
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_fft
@@ -145,13 +264,13 @@ AFAPI array fft3(const array& in, const dim_t odim0=0, const dim_t odim1=0, cons
 AFAPI array dft(const array& in, const double norm_factor, const dim4 outDims);
 
 /**
-   C++ Interface for fast fourier transform on any(1d, 2d, 3d) dimensional data
+   C++ Interface for fast fourier transform on any(1d, 2d, 3d) dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input data
+   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_fft
@@ -159,10 +278,10 @@ AFAPI array dft(const array& in, const double norm_factor, const dim4 outDims);
 AFAPI array dft(const array& in, const dim4 outDims);
 
 /**
-   C++ Interface for fast fourier transform on any(1d, 2d, 3d) dimensional data
+   C++ Interface for fast fourier transform on any(1d, 2d, 3d) dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
    \return     the transformed array
@@ -172,11 +291,11 @@ AFAPI array dft(const array& in, const dim4 outDims);
 AFAPI array dft(const array& in);
 
 /**
-   C++ Interface for inverse fast fourier transform on one dimensional data
+   C++ Interface for inverse fast fourier transform on one dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data - used to either truncate or pad the input data
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_ifft
@@ -184,12 +303,12 @@ AFAPI array dft(const array& in);
 AFAPI array ifftNorm(const array& in, const double norm_factor, const dim_t odim0=0);
 
 /**
-   C++ Interface for inverse fast fourier transform on two dimensional data
+   C++ Interface for inverse fast fourier transform on two dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_ifft2
@@ -197,27 +316,69 @@ AFAPI array ifftNorm(const array& in, const double norm_factor, const dim_t odim
 AFAPI array ifft2Norm(const array& in, const double norm_factor, const dim_t odim0=0, const dim_t odim1=0);
 
 /**
-   C++ Interface for inverse fast fourier transform on three dimensional data
+   C++ Interface for inverse fast fourier transform on three dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
-   \param[in]  odim2 is the length of output data along 2nd dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  odim2 is the length of output signals along third dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_ifft3
  */
 AFAPI array ifft3Norm(const array& in, const double norm_factor, const dim_t odim0=0, const dim_t odim1=0, const dim_t odim2=0);
 
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for fast fourier transform on one dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 1D inverse fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+
+   \note The input \p in must be complex
+
+   \ingroup signal_func_ifft
+ */
+AFAPI void ifftInPlace(array& in, const double norm_factor = 1.0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for fast fourier transform on two dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 2D inverse fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+
+   \note The input \p in must be complex
+
+   \ingroup signal_func_ifft2
+ */
+AFAPI void ifft2InPlace(array& in, const double norm_factor = 1.0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for fast fourier transform on three dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 3D inverse fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+
+   \note The input \p in must be complex
+
+   \ingroup signal_func_ifft3
+ */
+AFAPI void ifft3InPlace(array& in, const double norm_factor = 1.0);
+#endif
+
 /**
-   C++ Interface for inverse fast fourier transform on one dimensional data
+   C++ Interface for inverse fast fourier transform on one dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  odim0 is the length of output data - used to either truncate or pad the input data
+   \param[in]  odim0 is the length of output signals - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_ifft
@@ -225,14 +386,14 @@ AFAPI array ifft3Norm(const array& in, const double norm_factor, const dim_t odi
 AFAPI array ifft(const array& in, const dim_t odim0=0);
 
 /**
-   C++ Interface for inverse fast fourier transform on two dimensional data
+   C++ Interface for inverse fast fourier transform on two dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_ifft2
@@ -240,15 +401,15 @@ AFAPI array ifft(const array& in, const dim_t odim0=0);
 AFAPI array ifft2(const array& in, const dim_t odim0=0, const dim_t odim1=0);
 
 /**
-   C++ Interface for inverse fast fourier transform on three dimensional data
+   C++ Interface for inverse fast fourier transform on three dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
-   \param[in]  odim2 is the length of output data along 2nd dimension - used to either truncate/pad the input
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  odim2 is the length of output signals along third dimension - used to either truncate/pad the input
    \return     the transformed array
 
    \ingroup signal_func_ifft3
@@ -256,11 +417,11 @@ AFAPI array ifft2(const array& in, const dim_t odim0=0, const dim_t odim1=0);
 AFAPI array ifft3(const array& in, const dim_t odim0=0, const dim_t odim1=0, const dim_t odim2=0);
 
 /**
-   C++ Interface for inverse fast fourier transform on any(1d, 2d, 3d) dimensional data
+   C++ Interface for inverse fast fourier transform on any(1d, 2d, 3d) dimensional signals
 
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input data
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_fft
@@ -268,13 +429,13 @@ AFAPI array ifft3(const array& in, const dim_t odim0=0, const dim_t odim1=0, con
 AFAPI array idft(const array& in, const double norm_factor, const dim4 outDims);
 
 /**
-   C++ Interface for inverse fast fourier transform on any(1d, 2d, 3d) dimensional data
+   C++ Interface for inverse fast fourier transform on any(1d, 2d, 3d) dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
-   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input data
+   \param[in]  outDims is an object of \ref dim4 that has the output array dimensions - used to either truncate or pad the input signals
    \return     the transformed array
 
    \ingroup signal_func_fft
@@ -282,10 +443,10 @@ AFAPI array idft(const array& in, const double norm_factor, const dim4 outDims);
 AFAPI array idft(const array& in, const dim4 outDims);
 
 /**
-   C++ Interface for inverse fast fourier transform on any(1d, 2d, 3d) dimensional data
+   C++ Interface for inverse fast fourier transform on any(1d, 2d, 3d) dimensional signals
 
-   This version of fft function uses a default norm_factor paramter that is calculated internally
-   based on the input data.
+   This version of fft function uses a default norm_factor parameter that is calculated internally
+   based on the input signals.
 
    \param[in]  in is the input array
    \return     the transformed array
@@ -294,8 +455,62 @@ AFAPI array idft(const array& in, const dim4 outDims);
  */
 AFAPI array idft(const array& in);
 
+#if AF_API_VERSION >= 31
 /**
-   C++ Interface for convolution any(one through three) dimensional data
+   C++ Interface for real to complex fast fourier transform for one dimensional signals
+
+   \param[in]  in is a real array
+   \param[in]  dims is the requested padded dimensions before the transform is applied
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     a complex array containing the non redundant parts of \p in along the first dimension.
+
+   \note The first dimension of the output will be of size (dims[0] / 2) + 1. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_r2c
+*/
+template<int rank>
+array fftR2C(const array &in,
+             const dim4& dims,
+             const double norm_factor = 1.0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for real to complex fast fourier transform for one dimensional signals
+
+   \param[in]  in is a real array
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     a complex array containing the non redundant parts of \p in along the first dimension.
+
+   \note The first dimension of the output will be of size (in.dims(0) / 2) + 1. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_r2c
+*/
+template<int rank>
+array fftR2C(const array &in,
+             const double norm_factor = 1.0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for complex to real fast fourier transform
+
+   \param[in]  in is a complex array containing only the non redundant parts of the signals
+   \param[in]  is_odd is a flag signifying if the output should be even or odd size
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \tparam     rank signifies the dimensionality of the transform
+   \return     A real array of size [2 * idim0 - 2 + is_odd, idim1, idim2, idim3] where idim{0,1,2,3} signify input dimensions
+
+   \ingroup signal_func_fft_c2r
+*/
+
+template<int rank>
+array fftC2R(const array &in, bool is_odd = false,
+                 const double norm_factor = 1.0);
+#endif
+
+/**
+   C++ Interface for convolution any(one through three) dimensional signals
 
    Example for convolution on one dimensional signal in one to one batch mode
    \snippet test/convolve.cpp ex_image_convolve_1d
@@ -308,131 +523,161 @@ AFAPI array idft(const array& in);
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     the convolved array
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve
  */
 AFAPI array convolve(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT, const convDomain domain=AF_CONV_AUTO);
 
 /**
-   C++ Interface for separable convolution on two dimensional data
+   C++ Interface for separable convolution on two dimensional signals
 
    \snippet test/convolve.cpp ex_image_conv2_sep
 
    \param[in]  signal is the input signal
    \param[in]  col_filter is the signal that shall be along coloumns
    \param[in]  row_filter is the signal that shall be along rows
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     the convolved array
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \note Separable convolution only supports two(ONE-to-ONE and MANY-to-ONE) batch modes from the ones described in the detailed description section.
 
-   \ingroup signal_func_convolve
+   \ingroup signal_func_convolve_sep
  */
 AFAPI array convolve(const array& col_filter, const array& row_filter, const array& signal, const convMode mode=AF_CONV_DEFAULT);
 
 /**
-   C++ Interface for convolution on one dimensional data
+   C++ Interface for convolution on one dimensional signals
 
    \snippet test/convolve.cpp ex_image_convolve1
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     the convolved array
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve1
  */
 AFAPI array convolve1(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT, const convDomain domain=AF_CONV_AUTO);
 
 /**
-   C++ Interface for convolution on two dimensional data
+   C++ Interface for convolution on two dimensional signals
 
    \snippet test/convolve.cpp ex_image_convolve2
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     the convolved array
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve2
  */
 AFAPI array convolve2(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT, const convDomain domain=AF_CONV_AUTO);
 
 /**
-   C++ Interface for convolution on three dimensional data
+   C++ Interface for 2D convolution
+
+   This version of convolution is consistent with the machine learning
+   formulation that will spatially convolve a filter on 2-dimensions against a
+   signal. Multiple signals and filters can be batched against each other.
+   Furthermore, the signals and filters can be multi-dimensional however their
+   dimensions must match.
+
+   Example:
+   Signals with dimensions: d0 x d1 x d2 x Ns
+   Filters with dimensions: d0 x d1 x d2 x Nf
+
+   Resulting Convolution: d0 x d1 x Nf x Ns
+
+   \param[in]  signal   is the input signal
+   \param[in]  filter   is the filter that will be used for the convolution operation
+   \param[in]  stride   specifies the filter strides along each dimension
+   \param[in]  padding  specifies the padding along each dimension
+   \param[in]  dilation specifies the amount to dilate the filter before convolution
+   \return              the convolved array
+
+   \note Make sure you pass in both dim0, and dim1 in your dim4 arguments. The third
+   and fourth dimensions are currently ignored.
+
+   \ingroup signal_func_convolve2
+ */
+AFAPI array convolve2NN(const array& signal, const array& filter,
+                        const dim4 stride, const dim4 padding, const dim4 dilation);
+
+/**
+   C++ Interface for convolution on three dimensional signals
 
    \snippet test/convolve.cpp ex_image_convolve3
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     the convolved array
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve3
  */
 AFAPI array convolve3(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT, const convDomain domain=AF_CONV_AUTO);
 
 /**
-   C++ Interface for FFT-based convolution any(one through three) dimensional data
+   C++ Interface for FFT-based convolution any(one through three) dimensional signals
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     the convolved array
 
-   \ingroup signal_func_fft_convolve
+   \ingroup signal_func_convolve
  */
 AFAPI array fftConvolve(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT);
 
 /**
-   C++ Interface for convolution on one dimensional data
+   C++ Interface for convolution on 1D signals using FFT
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     the convolved array
 
-   \ingroup signal_func_fft_convolve1
+   \ingroup signal_func_convolve1
  */
 AFAPI array fftConvolve1(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT);
 
 /**
-   C++ Interface for convolution on two dimensional data
+   C++ Interface for convolution on 2D signals using FFT
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     the convolved array
 
-   \ingroup signal_func_fft_convolve2
+   \ingroup signal_func_convolve2
  */
 AFAPI array fftConvolve2(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT);
 
 /**
-   C++ Interface for convolution on three dimensional data
+   C++ Interface for convolution on 3D signals using FFT
 
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     the convolved array
 
-   \ingroup signal_func_fftconvolve3
+   \ingroup signal_func_convolve3
  */
 AFAPI array fftConvolve3(const array& signal, const array& filter, const convMode mode=AF_CONV_DEFAULT);
 
@@ -461,6 +706,69 @@ AFAPI array fir(const array &b, const array &x);
 */
 AFAPI array iir(const array &b, const array &a, const array &x);
 
+/**
+    C++ Interface for median filter
+
+    \snippet test/medfilt.cpp ex_image_medfilt
+
+    \param[in]  in array is the input image
+    \param[in]  wind_length is the kernel height
+    \param[in]  wind_width is the kernel width
+    \param[in]  edge_pad value will decide what happens to border when running
+                filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
+    \return     the processed image
+
+    \ingroup image_func_medfilt
+*/
+AFAPI array medfilt(const array& in, const dim_t wind_length = 3, const dim_t wind_width = 3, const borderType edge_pad = AF_PAD_ZERO);
+
+#if AF_API_VERSION >= 34
+/**
+    C++ Interface for median filter
+
+    \snippet test/medfilt.cpp ex_image_medfilt
+
+    \param[in]  in array is the input signal
+    \param[in]  wind_width is the kernel width
+    \param[in]  edge_pad value will decide what happens to border when running
+                filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
+    \return     the processed signal
+
+    \ingroup image_func_medfilt
+*/
+AFAPI array medfilt1(const array& in, const dim_t wind_width = 3, const borderType edge_pad = AF_PAD_ZERO);
+#endif
+
+#if AF_API_VERSION >= 34
+/**
+    C++ Interface for median filter
+
+    \snippet test/medfilt.cpp ex_image_medfilt
+
+    \param[in]  in array is the input image
+    \param[in]  wind_length is the kernel height
+    \param[in]  wind_width is the kernel width
+    \param[in]  edge_pad value will decide what happens to border when running
+                filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
+    \return     the processed image
+
+    \ingroup image_func_medfilt
+*/
+AFAPI array medfilt2(const array& in, const dim_t wind_length = 3, const dim_t wind_width = 3, const borderType edge_pad = AF_PAD_ZERO);
+#endif
+
+#if AF_API_VERSION >= 35
+/**
+   C++ Interface for setting plan cache size
+
+   This function doesn't do anything if called when CPU backend is active. The plans associated with
+   the most recently used array sizes are cached.
+
+   \param[in] cacheSize is the number of plans that shall be cached
+*/
+AFAPI void setFFTPlanCacheSize(size_t cacheSize);
+#endif
+
 }
 #endif
 
@@ -469,47 +777,311 @@ extern "C" {
 #endif
 
 /**
-   C Interface for data interpolation on one dimensional data
-
-   \param[out] out is the array with interpolated values
-   \param[in]  in is the input array
-   \param[in]  pos array contains the interpolation locations
-   \param[in]  method is the interpolation type, it can take one of the values defined by the
-               enum \ref af_interp_type
-   \param[in]  offGrid is the value that will set in the output array when certain index is out of bounds
-   \return     \ref AF_SUCCESS if the interpolation operation is successful,
-               otherwise an appropriate error code is returned.
+   C Interface for signals interpolation on one dimensional signals.
+
+   \param[out] out      is the interpolated array.
+   \param[in]  in       is the multidimensional input array. Values assumed
+                        to lie uniformly spaced indices in the range of
+                        `[0, n)`, where `n` is the number of elements in the
+                        array.
+   \param[in]  pos      positions of the interpolation points along the first
+                        dimension.
+   \param[in]  method   is the interpolation method to be used. The following
+                        types (defined in enum \ref af_interp_type)
+                        are supported: nearest neighbor, linear, and cubic.
+   \param[in]  off_grid is the default value for any indices outside the
+                           valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
 
    \ingroup signal_func_approx1
  */
 AFAPI af_err af_approx1(af_array *out, const af_array in, const af_array pos,
-                        const af_interp_type method, const float offGrid);
+                        const af_interp_type method, const float off_grid);
 
+#if AF_API_VERSION >= 37
 /**
-   C Interface for data interpolation on two dimensional data
+   C Interface for the version of \ref af_approx1 that accepts a preallocated
+   output array
+
+   \param[in,out] out      is the interpolated array (can be preallocated).
+   \param[in]     in       is the multidimensional input array. Values assumed
+                           to lie uniformly spaced indices in the range of
+                           `[0, n)`, where `n` is the number of elements in the
+                           array.
+   \param[in]     pos      positions of the interpolation points along the first
+                           dimension.
+   \param[in]     method   is the interpolation method to be used. The following
+                           types (defined in enum \ref af_interp_type)
+                           are supported: nearest neighbor, linear, and cubic.
+   \param[in]     off_grid is the default value for any indices outside the
+                           valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \note \p out can either be a null or existing `af_array` object. If it is a
+         sub-array of an existing `af_array`, only the corresponding portion of
+         the `af_array` will be overwritten
+   \note Passing an `af_array` that has not been initialized to \p out will
+         cause undefined behavior.
 
-   \param[out] out is the array with interpolated values
-   \param[in]  in is the input array
-   \param[in]  pos0 array contains the interpolation locations for 0th dimension
-   \param[in]  pos1 array contains the interpolation locations for 1st dimension
-   \param[in]  method is the interpolation type, it can take one of the values defined by the
-               enum \ref af_interp_type
-   \param[in]  offGrid is the value that will set in the output array when certain index is out of bounds
-   \return     \ref AF_SUCCESS if the interpolation operation is successful,
-               otherwise an appropriate error code is returned.
+   \ingroup signal_func_approx1
+ */
+AFAPI af_err af_approx1_v2(af_array *out, const af_array in, const af_array pos,
+                           const af_interp_type method, const float off_grid);
+#endif
+
+/**
+   C Interface for signals interpolation on two dimensional signals.
+
+   \param[out] out      the interpolated array.
+   \param[in]  in       is the multidimensional input array. Values assumed to
+                        lie uniformly spaced indices in the range of `[0, n)`
+                        along both interpolation dimensions. `n` is the number
+                        of elements in the array.
+   \param[in]  pos0     positions of the interpolation points along the first
+                        dimension.
+   \param[in]  pos1     positions of the interpolation points along the second
+                        dimension.
+   \param[in]  method   is the interpolation method to be used. All
+                        interpolation types defined in \ref af_interp_type are
+                        supported.
+   \param[in]  off_grid is the default value for any indices outside the valid
+                        range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \ingroup signal_func_approx2
+ */
+AFAPI af_err af_approx2(af_array *out, const af_array in,
+                        const af_array pos0, const af_array pos1,
+                        const af_interp_type method, const float off_grid);
+
+#if AF_API_VERSION >= 37
+/**
+   C Interface for the version of \ref af_approx2 that accepts a preallocated
+   output array
+
+   \param[in,out] out      the interpolated array (can be preallocated).
+   \param[in]     in       is the multidimensional input array. Values assumed
+                           to lie uniformly spaced indices in the range of
+                           `[0, n)` along both interpolation dimensions. `n` is
+                           the number of elements in the array.
+   \param[in]     pos0     positions of the interpolation points along the first
+                           dimension.
+   \param[in]     pos1     positions of the interpolation points along the
+                           second dimension.
+   \param[in]     method   is the interpolation method to be used. All
+                           interpolation types defined in \ref af_interp_type
+                           are supported.
+   \param[in]     off_grid is the default value for any indices outside the
+                           valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \note \p out can either be a null or existing `af_array` object. If it is a
+         sub-array of an existing `af_array`, only the corresponding portion of
+         the `af_array` will be overwritten
+   \note Passing an `af_array` to \p out that has not been initialized will
+         cause undefined behavior.
+
+   \ingroup signal_func_approx2
+ */
+AFAPI af_err af_approx2_v2(af_array *out, const af_array in,
+                           const af_array pos0, const af_array pos1,
+                           const af_interp_type method, const float off_grid);
+#endif
+
+
+#if AF_API_VERSION >= 37
+/**
+   C Interface for signals interpolation on one dimensional signals along
+   specified dimension.
+
+   af_approx1_uniform() accepts the dimension to perform the interpolation along
+   the input. It also accepts start and step values which define the uniform
+   range of corresponding indices.
+
+   The following image illustrates what the range of indices corresponding to
+   the input values look like if `idx_start` and `idx_step` are set to an
+   arbitrary value of 10,
+
+   \image html approx1_arbitrary_idx.png "approx1() using idx_start=10.0, idx_step=10.0"
+
+   The blue dots represent indices whose values are known. The red dots
+   represent indices whose values are unknown.
+
+   \param[out] out        the interpolated array.
+   \param[in]  in         is the multidimensional input array. Values lie on
+                          uniformly spaced indices determined by `idx_start`
+                          and `idx_step`.
+   \param[in]  pos        positions of the interpolation points along
+                          `interp_dim`.
+   \param[in]  interp_dim is the dimension to perform interpolation across.
+   \param[in]  idx_start  is the first index value along `interp_dim`.
+   \param[in]  idx_step   is the uniform spacing value between subsequent
+                          indices along `interp_dim`.
+   \param[in]  method     is the interpolation method to be used. The
+                          following types (defined in enum
+                          \ref af_interp_type) are supported: nearest
+                          neighbor, linear, and cubic.
+   \param[in]  off_grid   is the default value for any indices outside the
+                          valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \ingroup signal_func_approx1
+ */
+AFAPI af_err af_approx1_uniform(af_array *out, const af_array in,
+                                const af_array pos, const int interp_dim,
+                                const double idx_start, const double idx_step,
+                                const af_interp_type method,
+                                const float off_grid);
+
+/**
+   C Interface for the version of \ref af_approx1_uniform that accepts a
+   preallocated output array
+
+   \param[in,out] out        the interpolated array (can be preallocated).
+   \param[in]     in         is the multidimensional input array. Values lie on
+                             uniformly spaced indices determined by `idx_start`
+                             and `idx_step`.
+   \param[in]     pos        positions of the interpolation points along
+                             `interp_dim`.
+   \param[in]     interp_dim is the dimension to perform interpolation across.
+   \param[in]     idx_start  is the first index value along `interp_dim`.
+   \param[in]     idx_step   is the uniform spacing value between subsequent
+                             indices along `interp_dim`.
+   \param[in]     method     is the interpolation method to be used. The
+                             following types (defined in enum
+                             \ref af_interp_type) are supported: nearest
+                             neighbor, linear, and cubic.
+   \param[in]     off_grid   is the default value for any indices outside the
+                             valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \note \p out can either be a null or existing `af_array` object. If it is a
+         sub-array of an existing `af_array`, only the corresponding portion of
+         the `af_array` will be overwritten
+   \note Passing an `af_array` to \p out that has not been initialized will
+         cause undefined behavior.
+
+   \ingroup signal_func_approx1
+ */
+AFAPI af_err af_approx1_uniform_v2(af_array *out, const af_array in,
+                                   const af_array pos, const int interp_dim,
+                                   const double idx_start,
+                                   const double idx_step,
+                                   const af_interp_type method,
+                                   const float off_grid);
+
+/**
+   C Interface for signals interpolation on two dimensional signals along
+   specified dimensions.
+
+   af_approx2_uniform() accepts two dimensions to perform the interpolation
+   along the input. It also accepts start and step values which define the
+   uniform range of corresponding indices.
+
+   \param[out] out            the interpolated array.
+   \param[in]  in             is the multidimensional input array.
+   \param[in]  pos0           positions of the interpolation points along
+                              `interp_dim0`.
+   \param[in]  interp_dim0    is the first dimension to perform interpolation
+                              across.
+   \param[in]  idx_start_dim0 is the first index value along `interp_dim0`.
+   \param[in]  idx_step_dim0  is the uniform spacing value between subsequent
+                              indices along `interp_dim0`.
+   \param[in]  pos1           positions of the interpolation points along
+                              `interp_dim1`.
+   \param[in]  interp_dim1    is the second dimension to perform interpolation
+                              across.
+   \param[in]  idx_start_dim1 is the first index value along `interp_dim1`.
+   \param[in]  idx_step_dim1  is the uniform spacing value between subsequent
+                              indices along `interp_dim1`.
+   \param[in]  method         is the interpolation method to be used. All
+                              interpolation types defined in \ref af_interp_type
+                              are supported.
+   \param[in]  off_grid       is the default value for any indices outside the
+                              valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \ingroup signal_func_approx2
+ */
+AFAPI af_err af_approx2_uniform(af_array *out, const af_array in,
+                                const af_array pos0, const int interp_dim0,
+                                const double idx_start_dim0,
+                                const double idx_step_dim0,
+                                const af_array pos1, const int interp_dim1,
+                                const double idx_start_dim1,
+                                const double idx_step_dim1,
+                                const af_interp_type method,
+                                const float off_grid);
+
+/**
+   C Interface for the version of \ref af_approx2_uniform that accepts a
+   preallocated output array
+
+   \param[in,out] out            the interpolated array.
+   \param[in]     in             is the multidimensional input array.
+   \param[in]     pos0           positions of the interpolation points along
+                                 `interp_dim0`.
+   \param[in]     interp_dim0    is the first dimension to perform interpolation
+                                 across.
+   \param[in]     idx_start_dim0 is the first index value along `interp_dim0`.
+   \param[in]     idx_step_dim0  is the uniform spacing value between subsequent
+                                 indices along `interp_dim0`.
+   \param[in]     pos1           positions of the interpolation points along
+                                 `interp_dim1`.
+   \param[in]     interp_dim1    is the second dimension to perform
+                                 interpolation across.
+   \param[in]     idx_start_dim1 is the first index value along `interp_dim1`.
+   \param[in]     idx_step_dim1  is the uniform spacing value between subsequent
+                                 indices along `interp_dim1`.
+   \param[in]     method         is the interpolation method to be used. All
+                                 interpolation types defined in
+                                 \ref af_interp_type are supported.
+   \param[in]     off_grid       is the default value for any indices outside
+                                 the valid range of indices.
+
+   \return \ref AF_SUCCESS if the interpolation operation is successful,
+           otherwise an appropriate error code is returned.
+
+   \note \p out can either be a null or existing `af_array` object. If it is a
+         sub-array of an existing `af_array`, only the corresponding portion of
+         the `af_array` will be overwritten
+   \note Passing an `af_array` to \p out that has not been initialized will
+         cause undefined behavior.
 
    \ingroup signal_func_approx2
  */
-AFAPI af_err af_approx2(af_array *out, const af_array in, const af_array pos0, const af_array pos1,
-                        const af_interp_type method, const float offGrid);
+AFAPI af_err af_approx2_uniform_v2(af_array *out, const af_array in,
+                                   const af_array pos0, const int interp_dim0,
+                                   const double idx_start_dim0,
+                                   const double idx_step_dim0,
+                                   const af_array pos1, const int interp_dim1,
+                                   const double idx_start_dim1,
+                                   const double idx_step_dim1,
+                                   const af_interp_type method,
+                                   const float off_grid);
+#endif
 
 /**
-   C Interface for fast fourier transform on one dimensional data
+   C Interface for fast fourier transform on one dimensional signals
 
    \param[out] out is the transformed array
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data - used to either truncate or pad the input data
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals - used to either truncate or pad the input signals
    \return     \ref AF_SUCCESS if the fft transform is successful,
                otherwise an appropriate error code is returned.
 
@@ -517,14 +1089,30 @@ AFAPI af_err af_approx2(af_array *out, const af_array in, const af_array pos0, c
  */
 AFAPI af_err af_fft(af_array *out, const af_array in, const double norm_factor, const dim_t odim0);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface for fast fourier transform on one dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 1D forward fourier transform at exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The input \p in must be a complex array
+
+   \ingroup signal_func_fft
+*/
+AFAPI af_err af_fft_inplace(af_array in, const double norm_factor);
+#endif
+
 /**
-   C Interface for fast fourier transform on two dimensional data
+   C Interface for fast fourier transform on two dimensional signals
 
    \param[out] out is the transformed array
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
    \return     \ref AF_SUCCESS if the fft transform is successful,
                otherwise an appropriate error code is returned.
 
@@ -532,15 +1120,31 @@ AFAPI af_err af_fft(af_array *out, const af_array in, const double norm_factor,
  */
 AFAPI af_err af_fft2(af_array *out, const af_array in, const double norm_factor, const dim_t odim0, const dim_t odim1);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface for fast fourier transform on two dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 2D forward fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The input \p in must be a complex array
+
+   \ingroup signal_func_fft2
+ */
+AFAPI af_err af_fft2_inplace(af_array in, const double norm_factor);
+#endif
+
 /**
-   C Interface for fast fourier transform on three dimensional data
+   C Interface for fast fourier transform on three dimensional signals
 
    \param[out] out is the transformed array
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
-   \param[in]  odim2 is the length of output data along 2nd dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  odim2 is the length of output signals along third dimension - used to either truncate/pad the input
    \return     \ref AF_SUCCESS if the fft transform is successful,
                otherwise an appropriate error code is returned.
 
@@ -548,13 +1152,29 @@ AFAPI af_err af_fft2(af_array *out, const af_array in, const double norm_factor,
  */
 AFAPI af_err af_fft3(af_array *out, const af_array in, const double norm_factor, const dim_t odim0, const dim_t odim1, const dim_t odim2);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface for fast fourier transform on three dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 3D forward fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The input \p must be a complex array
+
+   \ingroup signal_func_fft3
+ */
+AFAPI af_err af_fft3_inplace(af_array in, const double norm_factor);
+#endif
+
 /**
-   C Interface for inverse fast fourier transform on one dimensional data
+   C Interface for inverse fast fourier transform on one dimensional signals
 
    \param[out] out is the transformed array
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data - used to either truncate or pad the input data
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals - used to either truncate or pad the input signals
    \return     \ref AF_SUCCESS if the fft transform is successful,
                otherwise an appropriate error code is returned.
 
@@ -562,14 +1182,30 @@ AFAPI af_err af_fft3(af_array *out, const af_array in, const double norm_factor,
  */
 AFAPI af_err af_ifft(af_array *out, const af_array in, const double norm_factor, const dim_t odim0);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface for fast fourier transform on one dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 1D inverse fourier transform at exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     \ref AF_SUCCESS if the ifft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The input \p in must be a complex array
+
+   \ingroup signal_func_ifft
+*/
+AFAPI af_err af_ifft_inplace(af_array in, const double norm_factor);
+#endif
+
 /**
-   C Interface for inverse fast fourier transform on two dimensional data
+   C Interface for inverse fast fourier transform on two dimensional signals
 
    \param[out] out is the transformed array
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
    \return     \ref AF_SUCCESS if the fft transform is successful,
                otherwise an appropriate error code is returned.
 
@@ -577,15 +1213,31 @@ AFAPI af_err af_ifft(af_array *out, const af_array in, const double norm_factor,
  */
 AFAPI af_err af_ifft2(af_array *out, const af_array in, const double norm_factor, const dim_t odim0, const dim_t odim1);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface for fast fourier transform on two dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 2D inverse fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     \ref AF_SUCCESS if the ifft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The input \p in must be a complex array
+
+   \ingroup signal_func_ifft2
+*/
+AFAPI af_err af_ifft2_inplace(af_array in, const double norm_factor);
+#endif
+
 /**
-   C Interface for inverse fast fourier transform on three dimensional data
+   C Interface for inverse fast fourier transform on three dimensional signals
 
    \param[out] out is the transformed array
    \param[in]  in is the input array
-   \param[in]  norm_factor is the normalization factor with which the input is scaled before the transformation is applied
-   \param[in]  odim0 is the length of output data along 0th dimension - used to either truncate/pad the input
-   \param[in]  odim1 is the length of output data along 1st dimension - used to either truncate/pad the input
-   \param[in]  odim2 is the length of output data along 2nd dimension - used to either truncate/pad the input
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  odim0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  odim1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  odim2 is the length of output signals along third dimension - used to either truncate/pad the input
    \return     \ref AF_SUCCESS if the fft transform is successful,
                otherwise an appropriate error code is returned.
 
@@ -593,119 +1245,282 @@ AFAPI af_err af_ifft2(af_array *out, const af_array in, const double norm_factor
  */
 AFAPI af_err af_ifft3(af_array *out, const af_array in, const double norm_factor, const dim_t odim0, const dim_t odim1, const dim_t odim2);
 
+#if AF_API_VERSION >= 31
+/**
+   C Interface for fast fourier transform on three dimensional signals
+
+   \param[inout]  in is the input array on entry and the output of 3D inverse fourier transform on exit
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \return     \ref AF_SUCCESS if the ifft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The input \p must be a complex array
+
+   \ingroup signal_func_ifft3
+*/
+AFAPI af_err af_ifft3_inplace(af_array in, const double norm_factor);
+#endif
+
+#if AF_API_VERSION >= 31
 /**
-   C Interface for convolution on one dimensional data
+   C Interface for real to complex fast fourier transform for one dimensional signals
+
+   \param[out] out is a complex array containing the non redundant parts of \p in.
+   \param[in]  in is a real array
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  pad0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The first dimension of the output will be of size (pad0 / 2) + 1. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_r2c
+*/
+AFAPI af_err af_fft_r2c (af_array *out, const af_array in, const double norm_factor, const dim_t pad0);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C Interface for real to complex fast fourier transform for two dimensional signals
+
+   \param[out] out is a complex array containing the non redundant parts of \p in.
+   \param[in]  in is a real array
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  pad0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  pad1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The first dimension of the output will be of size (pad0 / 2) + 1. The second dimension of the output will be pad1. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_r2c
+*/
+AFAPI af_err af_fft2_r2c(af_array *out, const af_array in, const double norm_factor, const dim_t pad0, const dim_t pad1);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C Interface for real to complex fast fourier transform for three dimensional signals
+
+   \param[out] out is a complex array containing the non redundant parts of \p in.
+   \param[in]  in is a real array
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  pad0 is the length of output signals along first dimension - used to either truncate/pad the input
+   \param[in]  pad1 is the length of output signals along second dimension - used to either truncate/pad the input
+   \param[in]  pad2 is the length of output signals along third dimension - used to either truncate/pad the input
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The first dimension of the output will be of size (pad0 / 2) + 1. The second dimension of the output will be pad1. The third dimension of the output will be pad 2.
+
+   \ingroup signal_func_fft_r2c
+*/
+AFAPI af_err af_fft3_r2c(af_array *out, const af_array in, const double norm_factor, const dim_t pad0, const dim_t pad1, const dim_t pad2);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C Interface for complex to real fast fourier transform for one dimensional signals
+
+   \param[out] out is a real array containing the output of the transform.
+   \param[in]  in is a complex array containing only the non redundant parts of the signals.
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  is_odd is a flag signifying if the output should be even or odd size
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The first dimension of the output will be 2 * dim0 - 1 if is_odd is true else 2 * dim0 - 2 where dim0 is the first dimension of the input. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_c2r
+*/
+
+AFAPI af_err af_fft_c2r (af_array *out, const af_array in, const double norm_factor, const bool is_odd);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C Interface for complex to real fast fourier transform for two dimensional signals
+
+   \param[out] out is a real array containing the output of the transform.
+   \param[in]  in is a complex array containing only the non redundant parts of the signals.
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  is_odd is a flag signifying if the output should be even or odd size
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The first dimension of the output will be 2 * dim0 - 1 if is_odd is true else 2 * dim0 - 2 where dim0 is the first dimension of the input. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_c2r
+*/
+AFAPI af_err af_fft2_c2r(af_array *out, const af_array in, const double norm_factor, const bool is_odd);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C Interface for complex to real fast fourier transform for three dimensional signals
+
+   \param[out] out is a real array containing the output of the transform.
+   \param[in]  in is a complex array containing only the non redundant parts of the signals.
+   \param[in]  norm_factor is the normalization factor with which the input is scaled after the transformation is applied
+   \param[in]  is_odd is a flag signifying if the output should be even or odd size
+   \return     \ref AF_SUCCESS if the fft transform is successful,
+               otherwise an appropriate error code is returned.
+
+   \note The first dimension of the output will be 2 * dim0 - 1 if is_odd is true else 2 * dim0 - 2 where dim0 is the first dimension of the input. The remaining dimensions are unchanged.
+
+   \ingroup signal_func_fft_c2r
+*/
+AFAPI af_err af_fft3_c2r(af_array *out, const af_array in, const double norm_factor, const bool is_odd);
+#endif
+
+/**
+   C Interface for convolution on one dimensional signals
 
    \param[out] out is convolved array
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve1
  */
 AFAPI af_err af_convolve1(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode, af_conv_domain domain);
 
 /**
-   C Interface for convolution on two dimensional data
+   C Interface for convolution on two dimensional signals
 
    \param[out] out is convolved array
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve2
  */
 AFAPI af_err af_convolve2(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode, af_conv_domain domain);
 
 /**
-   C Interface for convolution on three dimensional data
+   C Interface for 2D convolution
+
+   This version of convolution is consistent with the machine learning
+   formulation that will spatially convolve a filter on 2-dimensions against a
+   signal. Multiple signals and filters can be batched against each other.
+   Furthermore, the signals and filters can be multi-dimensional however their
+   dimensions must match.
+
+   Example:
+   Signals with dimensions: d0 x d1 x d2 x Ns
+   Filters with dimensions: d0 x d1 x d2 x Nf
+
+   Resulting Convolution: d0 x d1 x Nf x Ns
+
+   \param[out] out is convolved array
+   \param[in]  signal is the input signal
+   \param[in]  filter is the filter that will be used for the convolution operation
+   \param[in]  stride_dims specifies the number of stride dimension parameters
+   \param[in]  strides array of values specifying the amounts the filter strides along each dimension
+   \param[in]  padding_dims specifies the number of padding dimension parameters
+   \param[in]  paddings array of values specifying the amounts to pad along each dimension
+   \param[in]  dilation_dims specifies the number of dilation dimension parameters
+   \param[in]  dilations array of values specifying the amounts to dilate the filter
+               before convolving along each dimension
+   \return     \ref AF_SUCCESS if the convolution is successful,
+               otherwise an appropriate error code is returned.
+
+   \ingroup signal_func_convolve2
+ */
+AFAPI af_err af_convolve2_nn(af_array *out, const af_array signal, const af_array filter,
+                             const unsigned stride_dims,   const dim_t *strides,
+                             const unsigned padding_dims,  const dim_t *paddings,
+                             const unsigned dilation_dims, const dim_t *dilations);
+
+/**
+   C Interface for convolution on three dimensional signals
 
    \param[out] out is convolved array
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be flipped for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \param[in]  domain specifies if the convolution should be performed in frequency os spatial domain
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
-   \note The default paramter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
+   \note The default parameter of \p domain, \ref AF_CONV_AUTO, heuristically switches between frequency and spatial domain.
 
    \ingroup signal_func_convolve3
  */
 AFAPI af_err af_convolve3(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode, af_conv_domain domain);
 
 /**
-   C Interface for separable convolution on two dimensional data
+   C Interface for separable convolution on two dimensional signals
 
    \param[out] out is convolved array
    \param[in]  col_filter is filter that has to be applied along the coloumns
    \param[in]  row_filter is filter that has to be applied along the rows
    \param[in]  signal is the input array
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
    \note Separable convolution only supports two(ONE-to-ONE and MANY-to-ONE) batch modes from the ones described
          in the detailed description section.
 
-   \ingroup signal_func_convolve
+   \ingroup signal_func_convolve_sep
  */
 AFAPI af_err af_convolve2_sep(af_array *out, const af_array col_filter, const af_array row_filter, const af_array signal, const af_conv_mode mode);
 
 /**
-   C Interface for FFT-based convolution on one dimensional data
+   C Interface for convolution on 1D signals using FFT
 
    \param[out] out is convolved array
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
-   \ingroup signal_func_fft_convolve1
+   \ingroup signal_func_convolve1
  */
 AFAPI af_err af_fft_convolve1(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode);
 
 /**
-   C Interface for FFT-based convolution on two dimensional data
+   C Interface for convolution on 2D signals using FFT
 
    \param[out] out is convolved array
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
-   \ingroup signal_func_fft_convolve2
+   \ingroup signal_func_convolve2
  */
 AFAPI af_err af_fft_convolve2(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode);
 
 /**
-   C Interface for FFT-based convolution on three dimensional data
+   C Interface for convolution on 3D signals using FFT
 
    \param[out] out is convolved array
    \param[in]  signal is the input signal
    \param[in]  filter is the signal that shall be used for the convolution operation
-   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input).
+   \param[in]  mode indicates if the convolution should be expanded or not(where output size equals input)
    \return     \ref AF_SUCCESS if the convolution is successful,
                otherwise an appropriate error code is returned.
 
-   \ingroup signal_func_fft_convolve3
+   \ingroup signal_func_convolve3
  */
 AFAPI af_err af_fft_convolve3(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode);
 
 /**
-   C++ Interface for finite impulse response  filter
+   C Interface for finite impulse response  filter
 
    \param[out] y is the output signal from the filter
    \param[in] b is the array containing the coefficients of the filter
@@ -716,7 +1531,7 @@ AFAPI af_err af_fft_convolve3(af_array *out, const af_array signal, const af_arr
 AFAPI af_err af_fir(af_array *y, const af_array b, const af_array x);
 
 /**
-   C++ Interface for infinite impulse response filter
+   C Interface for infinite impulse response filter
 
    \param[out] y is the output signal from the filter
    \param[in] b is the array containing the feedforward coefficients
@@ -728,6 +1543,73 @@ AFAPI af_err af_fir(af_array *y, const af_array b, const af_array x);
    \ingroup signal_func_iir
 */
 AFAPI af_err af_iir(af_array *y, const af_array b, const af_array a, const af_array x);
+
+    /**
+        C Interface for median filter
+
+        \param[out] out array is the processed image
+        \param[in]  in array is the input image
+        \param[in]  wind_length is the kernel height
+        \param[in]  wind_width is the kernel width
+        \param[in]  edge_pad value will decide what happens to border when running
+                    filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
+        \return     \ref AF_SUCCESS if the median filter is applied successfully,
+        otherwise an appropriate error code is returned.
+
+        \ingroup image_func_medfilt
+    */
+    AFAPI af_err af_medfilt(af_array *out, const af_array in, const dim_t wind_length, const dim_t wind_width, const af_border_type edge_pad);
+
+#if AF_API_VERSION >= 34
+    /**
+        C Interface for 1D median filter
+
+        \param[out] out array is the processed signal
+        \param[in]  in array is the input signal
+        \param[in]  wind_width is the kernel width
+        \param[in]  edge_pad value will decide what happens to border when running
+                    filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
+        \return     \ref AF_SUCCESS if the median filter is applied successfully,
+        otherwise an appropriate error code is returned.
+
+        \ingroup image_func_medfilt
+    */
+    AFAPI af_err af_medfilt1(af_array *out, const af_array in, const dim_t wind_width, const af_border_type edge_pad);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+        C Interface for median filter
+
+        \param[out] out array is the processed image
+        \param[in]  in array is the input image
+        \param[in]  wind_length is the kernel height
+        \param[in]  wind_width is the kernel width
+        \param[in]  edge_pad value will decide what happens to border when running
+                    filter in their neighborhood. It takes one of the values [\ref AF_PAD_ZERO | \ref AF_PAD_SYM]
+        \return     \ref AF_SUCCESS if the median filter is applied successfully,
+        otherwise an appropriate error code is returned.
+
+        \ingroup image_func_medfilt
+    */
+    AFAPI af_err af_medfilt2(af_array *out, const af_array in, const dim_t wind_length, const dim_t wind_width, const af_border_type edge_pad);
+#endif
+
+
+#if AF_API_VERSION >= 34
+/**
+   C Interface for setting plan cache size
+
+   This function doesn't do anything if called when CPU backend is active. The plans associated with
+   the most recently used array sizes are cached.
+
+   \param[in] cache_size is the number of plans that shall be cached
+
+   \ingroup signal_func_fft
+*/
+AFAPI af_err af_set_fft_plan_cache_size(size_t cache_size);
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/sparse.h b/include/af/sparse.h
new file mode 100644
index 0000000000..1bda32d8fb
--- /dev/null
+++ b/include/af/sparse.h
@@ -0,0 +1,367 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/defines.h>
+
+#ifdef __cplusplus
+namespace af
+{
+    class array;
+
+#if AF_API_VERSION >= 34
+    /**
+       This function converts \ref af::array of values, row indices and column
+       indices into a sparse array.
+
+       \note This function only create references of these arrays into the
+             sparse data structure and does not do deep copies.
+
+       \param[in] nRows is the number of rows in the dense matrix
+       \param[in] nCols is the number of columns in the dense matrix
+       \param[in] values is the \ref af::array containing the non-zero elements
+                  of the matrix
+       \param[in] rowIdx is the row indices for the sparse array
+       \param[in] colIdx is the column indices for the sparse array
+       \param[in] stype is the storage format of the sparse array
+       \return \ref af::array for the sparse array
+
+       \snippet test/sparse.cpp ex_sparse_af_arrays
+
+       \ingroup sparse_func_create
+     */
+    AFAPI array sparse(const dim_t nRows, const dim_t nCols,
+                       const array values, const array rowIdx, const array colIdx,
+                       const af::storage stype = AF_STORAGE_CSR);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       This function converts host or device arrays of values, row indices and
+       column indices into a sparse array on the device.
+
+       \note The rules for deep copy/shallow copy/reference are the same as for
+             creating a regular \ref af::array.
+
+       \param[in] nRows is the number of rows in the dense matrix
+       \param[in] nCols is the number of columns in the dense matrix
+       \param[in] nNZ is the number of non zero elements in the dense matrix
+       \param[in] values is the host array containing the non-zero elements
+                  of the matrix
+       \param[in] rowIdx is the row indices for the sparse array
+       \param[in] colIdx is the column indices for the sparse array
+       \param[in] type is the data type for the matrix
+       \param[in] stype is the storage format of the sparse array
+       \param[in] src is \ref afHost if inputs are host arrays and \ref afDevice
+                  if the arrays are device arrays.
+       \return \ref af::array for the sparse array
+
+       \snippet test/sparse.cpp ex_sparse_host_arrays
+
+       \ingroup sparse_func_create
+     */
+    AFAPI array sparse(const dim_t nRows, const dim_t nCols, const dim_t nNZ,
+                       const void* const values,
+                       const int * const rowIdx, const int * const colIdx,
+                       const dtype type = f32, const af::storage stype = AF_STORAGE_CSR,
+                       const af::source src = afHost);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       This function converts a dense \ref af::array into a sparse array.
+
+       \param[in] dense is the source dense matrix
+       \param[in] stype is the storage format of the sparse array
+       \return \ref af::array for the sparse array with the given storage type
+
+       \snippet test/sparse.cpp ex_sparse_from_dense
+
+       \ingroup sparse_func_create
+     */
+    AFAPI array sparse(const array dense, const af::storage stype = AF_STORAGE_CSR);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] in is the source sparse matrix to be converted
+       \param[in] destStrorage is the storage format of the output sparse array
+       \return \ref af::array for the sparse array with the given storage type
+
+       \ingroup sparse_func_convert_to
+     */
+    AFAPI array sparseConvertTo(const array in, const af::storage destStrorage);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] sparse is the source sparse matrix
+       \return dense \ref af::array from sparse
+
+       \snippet test/sparse.cpp ex_dense_from_sparse
+
+       \ingroup sparse_func_dense
+     */
+    AFAPI array dense(const array sparse);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] values stores the non-zero elements component of the sparse array
+       \param[out] rowIdx stores the row indices component of the sparse array
+       \param[out] colIdx stores the column indices component of the sparse array
+       \param[out] stype stores the storage type of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \ingroup sparse_func_info
+     */
+    AFAPI void sparseGetInfo(array &values, array &rowIdx, array &colIdx, af::storage &stype,
+                             const array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] in is the input sparse matrix
+       \return \ref af::array for the non-zero elements component of the sparse array
+
+       \ingroup sparse_func_values
+     */
+    AFAPI array sparseGetValues(const array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] in is the input sparse matrix
+       \return \ref af::array for the row indices component of the sparse array
+
+       \ingroup sparse_func_row_idx
+     */
+    AFAPI array sparseGetRowIdx(const array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] in is the input sparse matrix
+       \return \ref af::array for the column indices component of the sparse array
+
+       \ingroup sparse_func_col_idx
+     */
+    AFAPI array sparseGetColIdx(const array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] in is the input sparse matrix
+       \return the number of non-zero elements of the sparse array
+
+       \ingroup sparse_func_nnz
+     */
+    AFAPI dim_t sparseGetNNZ(const array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[in] in is the input sparse matrix
+       \return \ref af::storage for the storage type of the sparse array
+
+       \ingroup sparse_func_storage
+     */
+    AFAPI af::storage sparseGetStorage(const array in);
+#endif
+}
+#endif
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       This function converts \ref af::array of values, row indices and column
+       indices into a sparse array.
+
+       \note This function only create references of these arrays into the
+             sparse data structure and does not do deep copies.
+
+       \param[out] out \ref af::array for the sparse array
+       \param[in] nRows is the number of rows in the dense matrix
+       \param[in] nCols is the number of columns in the dense matrix
+       \param[in] values is the \ref af_array containing the non-zero elements
+                  of the matrix
+       \param[in] rowIdx is the row indices for the sparse array
+       \param[in] colIdx is the column indices for the sparse array
+       \param[in] stype is the storage format of the sparse array
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_create
+     */
+    AFAPI af_err af_create_sparse_array(
+                 af_array *out,
+                 const dim_t nRows, const dim_t nCols,
+                 const af_array values, const af_array rowIdx, const af_array colIdx,
+                 const af_storage stype);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       This function converts host or device arrays of values, row indices and
+       column indices into a sparse array on the device.
+
+       \note The rules for deep copy/shallow copy/reference are the same as for
+             creating a regular \ref af::array.
+
+       \param[out] out \ref af::array for the sparse array
+       \param[in] nRows is the number of rows in the dense matrix
+       \param[in] nCols is the number of columns in the dense matrix
+       \param[in] nNZ is the number of non zero elements in the dense matrix
+       \param[in] values is the host array containing the non-zero elements
+                  of the matrix
+       \param[in] rowIdx is the row indices for the sparse array
+       \param[in] colIdx is the column indices for the sparse array
+       \param[in] type is the data type for the matrix
+       \param[in] stype is the storage format of the sparse array
+       \param[in] src is \ref afHost if inputs are host arrays and \ref afDevice
+                  if the arrays are device arrays.
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_create
+     */
+    AFAPI af_err af_create_sparse_array_from_ptr(
+                 af_array *out,
+                 const dim_t nRows, const dim_t nCols, const dim_t nNZ,
+                 const void * const values,
+                 const int * const rowIdx, const int * const colIdx,
+                 const af_dtype type, const af_storage stype,
+                 const af_source src);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       This function converts a dense \ref af_array into a sparse array.
+
+       \param[out] out \ref af_array for the sparse array with the given storage type
+       \param[in] dense is the source dense matrix
+       \param[in] stype is the storage format of the sparse array
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_create
+     */
+    AFAPI af_err af_create_sparse_array_from_dense(
+                 af_array *out, const af_array dense,
+                 const af_storage stype);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out \ref af_array for the sparse array with the given storage type
+       \param[in] in is the source sparse matrix to be converted
+       \param[in] destStorage is the storage format of the output sparse array
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_convert_to
+     */
+    AFAPI af_err af_sparse_convert_to(af_array *out, const af_array in,
+                                      const af_storage destStorage);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out dense \ref af_array from sparse
+       \param[in] sparse is the source sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_dense
+     */
+    AFAPI af_err af_sparse_to_dense(af_array *out, const af_array sparse);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] values stores the non-zero elements component of the sparse array
+       \param[out] rowIdx stores the row indices component of the sparse array
+       \param[out] colIdx stores the column indices component of the sparse array
+       \param[out] stype stores the storage type of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_info
+     */
+    AFAPI af_err af_sparse_get_info(af_array *values, af_array *rowIdx, af_array *colIdx, af_storage *stype,
+                                    const af_array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out \ref af_array for the non-zero elements component of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_values
+     */
+    AFAPI af_err af_sparse_get_values(af_array *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out \ref af_array for the row indices component of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_row_idx
+     */
+    AFAPI af_err af_sparse_get_row_idx(af_array *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out \ref af_array for the column indices component of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_col_idx
+     */
+    AFAPI af_err af_sparse_get_col_idx(af_array *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out the number of non-zero elements of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_nnz
+     */
+    AFAPI af_err af_sparse_get_nnz(dim_t *out, const af_array in);
+#endif
+
+#if AF_API_VERSION >= 34
+    /**
+       \param[out] out contains \ref af_storage for the storage type of the sparse array
+       \param[in] in is the input sparse matrix
+
+       \return \ref AF_SUCCESS if the execution completes properly
+
+       \ingroup sparse_func_storage
+     */
+    AFAPI af_err af_sparse_get_storage(af_storage *out, const af_array in);
+#endif
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/include/af/statistics.h b/include/af/statistics.h
index 5eaf56f1e5..86851a3a7b 100644
--- a/include/af/statistics.h
+++ b/include/af/statistics.h
@@ -19,12 +19,12 @@ class array;
    C++ Interface for mean
 
    \param[in] in is the input array
-   \param[in] dim The dimension along which the mean is extracted
+   \param[in] dim the dimension along which the mean is extracted
    \return    the mean of the input array along dimension \p dim
 
    \ingroup stat_func_mean
 
-   \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
 */
 AFAPI array mean(const array& in, const dim_t dim=-1);
 
@@ -33,12 +33,12 @@ AFAPI array mean(const array& in, const dim_t dim=-1);
 
    \param[in] in is the input array
    \param[in] weights is used to scale input \p in before getting mean
-   \param[in] dim The dimension along which the mean is extracted
+   \param[in] dim the dimension along which the mean is extracted
    \return    the mean of the weighted input array along dimension \p dim
 
    \ingroup stat_func_mean
 
-   \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
 */
 AFAPI array mean(const array& in, const array& weights, const dim_t dim=-1);
 
@@ -46,66 +46,143 @@ AFAPI array mean(const array& in, const array& weights, const dim_t dim=-1);
    C++ Interface for variance
 
    \param[in] in is the input array
-   \param[in] isbiased is boolean denoting Population variance (false) or Sample Variance (true)
-   \param[in] dim The dimension along which the variance is extracted
-   \return    the variance of the input array along dimension \p dim
+   \param[in] isbiased is boolean denoting Population variance (false) or Sample
+              Variance (true)
+   \param[in] dim the dimension along which the variance is extracted
+   \return the variance of the input array along dimension \p dim
 
    \ingroup stat_func_var
 
-   \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
+
+   \deprecated Use \ref af::var that takes \ref af_var_bias instead
 */
+AF_DEPRECATED("Use \ref af::var(const array&, const af_var_bias, const dim_t)")
 AFAPI array var(const array& in, const bool isbiased=false, const dim_t dim=-1);
 
+#if AF_API_VERSION >= 38
+/**
+   C++ Interface for variance
+
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes o
+              value of type \ref af_var_bias.
+   \param[in] dim the dimension along which the variance is extracted
+   \return the variance of the input array along dimension \p dim
+
+   \ingroup stat_func_var
+
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
+*/
+AFAPI array var(const array &in, const af_var_bias bias, const dim_t dim = -1);
+#endif
+
 /**
    C++ Interface for variance of weighted inputs
 
    \param[in] in is the input array
    \param[in] weights is used to scale input \p in before getting variance
-   \param[in] dim The dimension along which the variance is extracted
+   \param[in] dim the dimension along which the variance is extracted
    \return    the variance of the weighted input array along dimension \p dim
 
    \ingroup stat_func_var
 
-   \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
 */
 AFAPI array var(const array& in, const array &weights, const dim_t dim=-1);
 
+#if AF_API_VERSION >= 37
+/**
+   C++ Interface for mean and variance
+
+   \param[out] mean     The mean of the input array along \p dim dimension
+   \param[out] var      The variance of the input array along the \p dim dimension
+   \param[in]  in       The input array
+   \param[in]  weights  The weights to scale the input array before calculating
+                        the mean and varience. If empty, the input is not scaled
+   \param[in] bias      The type of bias used for variance calculation
+   \param[in] dim       The dimension along which the variance and mean are
+                        calculated. Default is -1 meaning the first non-zero dim
+  */
+AFAPI void meanvar(array& mean, array& var, const array& in, const array& weights,
+                   const af_var_bias bias = AF_VARIANCE_POPULATION, const dim_t dim=-1);
+#endif
+
 /**
    C++ Interface for standard deviation
 
    \param[in] in is the input array
-   \param[in] dim The dimension along which the standard deviation is extracted
+   \param[in] dim the dimension along which the standard deviation is extracted
    \return    the standard deviation of the input array along dimension \p dim
 
    \ingroup stat_func_stdev
 
-   \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
+
+   \deprecated Use \ref af::stdev that takes \ref af_var_bias instead
 */
+AF_DEPRECATED("Use af::stdev(const array&, const af_var_bias, const dim_t)")
 AFAPI array stdev(const array& in, const dim_t dim=-1);
 
+#if AF_API_VERSION >= 38
+/**
+   C++ Interface for standard deviation
+
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias.
+   \param[in] dim the dimension along which the standard deviation is extracted
+   \return    the standard deviation of the input array along dimension \p dim
+
+   \ingroup stat_func_stdev
+
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
+*/
+AFAPI array stdev(const array &in, const af_var_bias bias,
+                  const dim_t dim = -1);
+#endif
 
 /**
    C++ Interface for covariance
 
    \param[in] X is the first input array
    \param[in] Y is the second input array
-   \param[in] isbiased is boolean specifying if biased estiamte should be taken (default: false)
+   \param[in] isbiased is boolean specifying if biased estimate should be
+              taken (default: false)
    \return    the covariance of the input arrays
 
    \ingroup stat_func_cov
+
+   \deprecated Use af::cov(const array&, const array& const af_var_bias)
 */
+AF_DEPRECATED("Use af::cov(const af::array&, const array&, conv af_var_bias)")
 AFAPI array cov(const array& X, const array& Y, const bool isbiased=false);
 
+#if AF_API_VERSION >= 38
+/**
+   C++ Interface for covariance
+
+   \param[in] X is the first input array
+   \param[in] Y is the second input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias.
+   \return the covariance of the input arrays
+
+   \ingroup stat_func_cov
+*/
+AFAPI array cov(const array &X, const array &Y, const af_var_bias bias);
+#endif
+
 /**
    C++ Interface for median
 
    \param[in] in is the input array
-   \param[in] dim The dimension along which the median is extracted
+   \param[in] dim the dimension along which the median is extracted
    \return    the median of the input array along dimension \p dim
 
    \ingroup stat_func_median
 
-   \note \p dim is -1 by default. -1 denotes the first non-signleton dimension.
+   \note \p dim is -1 by default. -1 denotes the first non-singleton dimension.
 */
 AFAPI array median(const array& in, const dim_t dim=-1);
 
@@ -136,13 +213,31 @@ AFAPI T mean(const array& in, const array& weights);
    C++ Interface for variance of all elements
 
    \param[in] in is the input array
-   \param[in] isbiased is boolean denoting Population variance (false) or Sample Variance (true)
-   \return    variance of the entire input array
+   \param[in] isbiased is boolean denoting Population variance (false) or Sample
+              Variance (true)
+   \return variance of the entire input array
 
    \ingroup stat_func_var
+
+   \deprecated Use \ref af::var that takes \ref af_var_bias instead
 */
-template<typename T>
-AFAPI T var(const array& in, const bool isbiased=false);
+template <typename T>
+AF_DEPRECATED("Use af::var(const af::array&, const af_var_bias)")
+AFAPI T var(const array &in, const bool isbiased = false);
+
+#if AF_API_VERSION >= 38
+/**
+   C++ Interface for variance of all elements
+
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias.
+   \return variance of the \p in array
+
+   \ingroup stat_func_var
+*/
+template <typename T> AFAPI T var(const array &in, const af_var_bias bias);
+#endif
 
 /**
    C++ Interface for variance of all elements in weighted input
@@ -163,9 +258,26 @@ AFAPI T var(const array& in, const array& weights);
    \return    standard deviation of the entire input array
 
    \ingroup stat_func_stdev
+
+   \deprecated Use \ref af::stdev that takes \ref af_var_bias instead
 */
-template<typename T>
-AFAPI T stdev(const array& in);
+template <typename T>
+AF_DEPRECATED("Use af::stdev(const array&, const af_var_bias)")
+AFAPI T stdev(const array &in);
+
+#if AF_API_VERSION >= 38
+/**
+   C++ Interface for standard deviation of all elements
+
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias.
+   \return    standard deviation of the entire input array
+
+   \ingroup stat_func_stdev
+*/
+template <typename T> AFAPI T stdev(const array &in, const af_var_bias bias);
+#endif
 
 /**
    C++ Interface for median of all elements
@@ -192,6 +304,29 @@ AFAPI T median(const array& in);
 template<typename T>
 AFAPI T corrcoef(const array& X, const array& Y);
 
+#if AF_API_VERSION >= 36
+/**
+   C++ Interface for finding top k elements along a given dimension
+
+   \param[out] values  The values of the top k elements along the \p dim dimension
+   \param[out] indices The indices of the top k elements along the \p dim dimension
+   \param[in]  in      Input \ref af::array with at least \p k elements along
+                       \p dim
+   \param[in]  k       The number of elements to be retriefed along the \p dim dimension
+   \param[in]  dim     The dimension along which top k elements are extracted.
+                       (Must be 0)
+   \param[in]  order   If Descending the highest values are returned. Otherwise
+                       the lowest values are returned
+
+   \note{This function is optimized for small values of k.}
+   \note{The order of the returned keys may not be in the same order as the
+   appear in the input array, for a stable topk, set the AF_TOPK_STABLE flag
+   in the order param. These are equivalent to AF_TOPK_STABLE_MAX and AF_TOPK_STABLE_MIN}
+   \ingroup stat_func_topk
+*/
+AFAPI void topk(array &values, array &indices, const array& in, const int k,
+                const int dim = -1, const topkFunction order = AF_TOPK_MAX);
+#endif
 }
 #endif
 
@@ -204,8 +339,8 @@ extern "C" {
 
    \param[out] out will contain the mean of the input array along dimension \p dim
    \param[in] in is the input array
-   \param[in] dim The dimension along which the mean is extracte
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] dim the dimension along which the mean is extracted
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_mean
@@ -218,8 +353,8 @@ AFAPI af_err af_mean(af_array *out, const af_array in, const dim_t dim);
    \param[out] out will contain the mean of the input array along dimension \p dim
    \param[in] in is the input array
    \param[in] weights is used to scale input \p in before getting mean
-   \param[in] dim The dimension along which the mean is extracted
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] dim the dimension along which the mean is extracted
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_mean
@@ -232,23 +367,45 @@ AFAPI af_err af_mean_weighted(af_array *out, const af_array in, const af_array w
    \param[out] out will contain the variance of the input array along dimension \p dim
    \param[in] in is the input array
    \param[in] isbiased is boolean denoting Population variance (false) or Sample Variance (true)
-   \param[in] dim The dimension along which the variance is extracted
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] dim the dimension along which the variance is extracted
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_var
 
+   \deprecated Use \ref af_var_v2 instead
 */
+AF_DEPRECATED("Use af_var_v2")
 AFAPI af_err af_var(af_array *out, const af_array in, const bool isbiased, const dim_t dim);
 
+#if AF_API_VERSION >= 38
+/**
+   C Interface for variance
+
+   \param[out] out will contain the variance of the input array along dimension
+               \p dim
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias
+   \param[in] dim the dimension along which the variance is extracted
+   \return \ref AF_SUCCESS if the operation is successful, otherwise an
+           appropriate error code is returned.
+
+   \ingroup stat_func_var
+
+*/
+AFAPI af_err af_var_v2(af_array *out, const af_array in, const af_var_bias bias,
+                       const dim_t dim);
+#endif
+
 /**
    C Interface for variance of weighted input array
 
    \param[out] out will contain the variance of the input array along dimension \p dim
    \param[in] in is the input array
    \param[in] weights is used to scale input \p in before getting variance
-   \param[in] dim The dimension along which the variance is extracted
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] dim the dimension along which the variance is extracted
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_var
@@ -256,41 +413,101 @@ AFAPI af_err af_var(af_array *out, const af_array in, const bool isbiased, const
 */
 AFAPI af_err af_var_weighted(af_array *out, const af_array in, const af_array weights, const dim_t dim);
 
+#if AF_API_VERSION >= 37
+/**
+   C Interface for mean and variance
+
+   \param[out] mean     The mean of the input array along \p dim dimension
+   \param[out] var      The variance of the input array along the \p dim dimension
+   \param[in]  in       The input array
+   \param[in]  weights  The weights to scale the input array before calculating
+                        the mean and varience. If empty, the input is not scaled
+   \param[in]  bias     The type of bias used for variance calculation
+   \param[in]  dim      The dimension along which the variance and mean are
+                        calculated. Default is -1 meaning the first non-zero dim
+  */
+AFAPI af_err af_meanvar(af_array *mean, af_array *var, const af_array in,
+                        const af_array weights, const af_var_bias bias, const dim_t dim);
+#endif
+
 /**
    C Interface for standard deviation
 
    \param[out] out will contain the standard deviation of the input array along dimension \p dim
    \param[in] in is the input array
-   \param[in] dim The dimension along which the standard deviation is extracted
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] dim the dimension along which the standard deviation is extracted
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_stdev
 
+   \deprecated Use \ref af_stdev_v2 instead
 */
+AF_DEPRECATED("Use af_stdev_v2")
 AFAPI af_err af_stdev(af_array *out, const af_array in, const dim_t dim);
 
+#if AF_API_VERSION >= 38
+/**
+   C Interface for standard deviation
+
+   \param[out] out will contain the standard deviation of the input array along
+               dimension \p dim
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias
+   \param[in] dim the dimension along which the standard deviation is extracted
+   \return \ref AF_SUCCESS if the operation is successful, otherwise an
+           appropriate error code is returned.
+
+   \ingroup stat_func_stdev
+
+*/
+AFAPI af_err af_stdev_v2(af_array *out, const af_array in,
+                         const af_var_bias bias, const dim_t dim);
+#endif
+
 /**
    C Interface for covariance
 
    \param[out] out will the covariance of the input arrays
    \param[in] X is the first input array
    \param[in] Y is the second input array
-   \param[in] isbiased is boolean specifying if biased estiamte should be taken (default: false)
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] isbiased is boolean specifying if biased estimate should be taken (default: false)
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_cov
+
+   \deprecated Use \ref af_cov_v2 instead
 */
+AF_DEPRECATED("Use af_cov_v2")
 AFAPI af_err af_cov(af_array* out, const af_array X, const af_array Y, const bool isbiased);
 
+#if AF_API_VERSION >= 38
+/**
+   C Interface for covariance
+
+   \param[out] out will the covariance of the input arrays
+   \param[in] X is the first input array
+   \param[in] Y is the second input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias
+   \return \ref AF_SUCCESS if the operation is successful, otherwise an
+           appropriate error code is returned.
+
+   \ingroup stat_func_cov
+*/
+AFAPI af_err af_cov_v2(af_array *out, const af_array X, const af_array Y,
+                       const af_var_bias bias);
+#endif
+
 /**
    C Interface for median
 
    \param[out] out will contain the median of the input array along dimension \p dim
    \param[in] in is the input array
-   \param[in] dim The dimension along which the median is extracted
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \param[in] dim the dimension along which the median is extracted
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_median
@@ -303,7 +520,7 @@ AFAPI af_err af_median(af_array* out, const af_array in, const dim_t dim);
    \param[out] real will contain the real part of mean of the entire input array
    \param[out] imag will contain the imaginary part of mean of the entire input array
    \param[in] in is the input array
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_mean
@@ -317,7 +534,7 @@ AFAPI af_err af_mean_all(double *real, double *imag, const af_array in);
    \param[out] imag will contain the imaginary part of mean of the entire weighted input array
    \param[in] in is the input array
    \param[in] weights  is used to scale input \p in before getting mean
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_mean
@@ -332,13 +549,36 @@ AFAPI af_err af_mean_all_weighted(double *real, double *imag, const af_array in,
    \param[out] imagVal will contain the imaginary part of variance of the entire input array
    \param[in] in is the input array
    \param[in] isbiased is boolean denoting Population variance (false) or Sample Variance (true)
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_var
+
+   \deprecated Use \ref af_var_all_v2 instead
 */
+AF_DEPRECATED("Use af_var_all_v2")
 AFAPI af_err af_var_all(double *realVal, double *imagVal, const af_array in, const bool isbiased);
 
+#if AF_API_VERSION >= 38
+/**
+   C Interface for variance of all elements
+
+   \param[out] realVal will contain the real part of variance of the entire
+               input array
+   \param[out] imagVal will contain the imaginary part of variance
+               of the entire input array
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias
+   \return \ref AF_SUCCESS if the operation is successful, otherwise an
+           appropriate error code is returned.
+
+   \ingroup stat_func_var
+*/
+AFAPI af_err af_var_all_v2(double *realVal, double *imagVal, const af_array in,
+                           const af_var_bias bias);
+#endif
+
 /**
    C Interface for variance of all elements in weighted input
 
@@ -346,7 +586,7 @@ AFAPI af_err af_var_all(double *realVal, double *imagVal, const af_array in, con
    \param[out] imagVal will contain the imaginary part of variance of the entire weighted input array
    \param[in] in is the input array
    \param[in] weights  is used to scale input \p in before getting variance
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_var
@@ -359,20 +599,43 @@ AFAPI af_err af_var_all_weighted(double *realVal, double *imagVal, const af_arra
    \param[out] real will contain the real part of standard deviation of the entire input array
    \param[out] imag will contain the imaginary part of standard deviation of the entire input array
    \param[in] in is the input array
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_stdev
+
+   \deprecated Use \ref af_stdev_all_v2 instead
 */
+AF_DEPRECATED("Use af_stdev_all_v2")
 AFAPI af_err af_stdev_all(double *real, double *imag, const af_array in);
 
+#if AF_API_VERSION >= 38
+/**
+   C Interface for standard deviation of all elements
+
+   \param[out] real will contain the real part of standard deviation of the
+               entire input array
+   \param[out] imag will contain the imaginary part of standard deviation
+               of the entire input array
+   \param[in] in is the input array
+   \param[in] bias The type of bias used for variance calculation. Takes of
+              value of type \ref af_var_bias
+   \return     \ref AF_SUCCESS if the operation is successful,
+   otherwise an appropriate error code is returned.
+
+   \ingroup stat_func_stdev
+*/
+AFAPI af_err af_stdev_all_v2(double *real, double *imag, const af_array in,
+                             const af_var_bias bias);
+#endif
+
 /**
    C Interface for median
 
    \param[out] realVal will contain the real part of median of the entire input array
    \param[out] imagVal will contain the imaginary part of median of the entire input array
    \param[in] in is the input array
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \ingroup stat_func_median
@@ -386,16 +649,39 @@ AFAPI af_err af_median_all(double *realVal, double *imagVal, const af_array in);
    \param[out] imagVal will contain the imaginary part of correlation coefficient of the inputs
    \param[in] X is the first input array
    \param[in] Y is the second input array
-   \return     \ref AF_SUCCESS if the color transformation is successful,
+   \return     \ref AF_SUCCESS if the operation is successful,
    otherwise an appropriate error code is returned.
 
    \note There are many ways correlation coefficient is calculated. This algorithm returns Pearson product-moment correlation coefficient.
 
    \ingroup stat_func_corrcoef
 */
-
 AFAPI af_err af_corrcoef(double *realVal, double *imagVal, const af_array X, const af_array Y);
 
+#if AF_API_VERSION >= 36
+/**
+   C Interface for finding top k elements along a given dimension
+
+   \param[out] values  The values of the top k elements along the \p dim dimension
+   \param[out] indices The indices of the top k elements along the \p dim dimension
+   \param[in]  in      Input \ref af::array with at least \p k elements along
+                       \p dim
+   \param[in]  k       The number of elements to be retriefed along the \p dim dimension
+   \param[in]  dim     The dimension along which top k elements are extracted.
+                       (Must be 0)
+   \param[in]  order   If Descending the highest values are returned. Otherwise
+                       the lowest values are returned
+
+   \note{This function is optimized for small values of k.}
+   \note{The order of the returned keys may not be in the same order as the
+   appear in the input array, for a stable topk, set the AF_TOPK_STABLE flag
+   in the order param. These are equivalent to AF_TOPK_STABLE_MAX and AF_TOPK_STABLE_MIN}
+   \ingroup stat_func_topk
+*/
+AFAPI af_err af_topk(af_array *values, af_array *indices, const af_array in,
+                     const int k, const int dim, const af_topk_function order);
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/af/traits.hpp b/include/af/traits.hpp
index bb75b8c472..4216c3f046 100644
--- a/include/af/traits.hpp
+++ b/include/af/traits.hpp
@@ -13,6 +13,8 @@
 
 #include <complex>
 #include <af/defines.h>
+#include <af/complex.h>
+#include <af/half.h>
 
 namespace af {
 
@@ -20,74 +22,171 @@ template<typename T> struct dtype_traits;
 
 template<>
 struct dtype_traits<float> {
-    enum { af_type = f32 };
+    enum {
+        af_type = f32,
+        ctype = f32
+    };
     typedef float base_type;
     static const char* getName() { return "float"; }
 };
 
 template<>
 struct dtype_traits<af::cfloat> {
-    enum { af_type = c32 };
+    enum {
+        af_type = c32 ,
+        ctype = c32
+    };
+    typedef float base_type;
+    static const char* getName() { return "std::complex<float>"; }
+};
+
+template<>
+struct dtype_traits<std::complex<float> > {
+    enum {
+        af_type = c32 ,
+        ctype = c32
+    };
     typedef float base_type;
     static const char* getName() { return "std::complex<float>"; }
 };
 
 template<>
 struct dtype_traits<double> {
-    enum { af_type = f64 };
+    enum {
+        af_type = f64 ,
+        ctype = f32
+    };
     typedef double base_type;
     static const char* getName() { return "double"; }
 };
 
 template<>
 struct dtype_traits<af::cdouble> {
-    enum { af_type = c64 };
+    enum {
+        af_type = c64 ,
+        ctype = c64
+    };
+    typedef double base_type;
+    static const char* getName() { return "std::complex<double>"; }
+};
+
+template<>
+struct dtype_traits<std::complex<double> > {
+    enum {
+        af_type = c64 ,
+        ctype = c64
+    };
     typedef double base_type;
     static const char* getName() { return "std::complex<double>"; }
 };
 
 template<>
 struct dtype_traits<char> {
-    enum { af_type = b8 };
+    enum {
+        af_type = b8 ,
+        ctype = f32
+    };
     typedef char base_type;
     static const char* getName() { return "char"; }
 };
 
 template<>
 struct dtype_traits<int> {
-    enum { af_type = s32 };
+    enum {
+        af_type = s32 ,
+        ctype = f32
+    };
     typedef int base_type;
     static const char* getName() { return "int"; }
 };
 
 template<>
 struct dtype_traits<unsigned> {
-    enum { af_type = u32 };
+    enum {
+        af_type = u32 ,
+        ctype = f32
+    };
     typedef unsigned base_type;
     static const char* getName() { return "uint"; }
 };
 
 template<>
 struct dtype_traits<unsigned char> {
-    enum { af_type = u8 };
+    enum {
+        af_type = u8 ,
+        ctype = f32
+    };
     typedef unsigned char base_type;
     static const char* getName() { return "uchar"; }
 };
 
 template<>
 struct dtype_traits<long long> {
-    enum { af_type = s64 };
+    enum {
+        af_type = s64 ,
+        ctype = s64
+    };
     typedef long long base_type;
     static const char* getName() { return "long"; }
 };
 
 template<>
 struct dtype_traits<unsigned long long> {
-    enum { af_type = u64 };
+    enum {
+        af_type = u64 ,
+        ctype = u64
+    };
     typedef unsigned long long base_type;
     static const char* getName() { return "ulong"; }
 };
 
+#if AF_API_VERSION >= 32
+template<>
+struct dtype_traits<short> {
+    enum {
+        af_type = s16 ,
+        ctype = s16
+    };
+    typedef short base_type;
+    static const char* getName() { return "short"; }
+};
+#endif
+
+#if AF_API_VERSION >= 32
+template<>
+struct dtype_traits<unsigned short> {
+    enum {
+        af_type = u16 ,
+        ctype = u16
+    };
+    typedef unsigned short base_type;
+    static const char* getName() { return "ushort"; }
+};
+#endif
+
+#if AF_API_VERSION >= 37
+template<>
+struct dtype_traits<half> {
+    enum {
+        af_type = f16 ,
+        ctype = f16
+    };
+    typedef half base_type;
+    static const char* getName() { return "half"; }
+};
+#endif
+
+#if AF_API_VERSION >= 310
+template<>
+struct dtype_traits<signed char> {
+    enum {
+        af_type = s8 ,
+        ctype = f32
+    };
+    typedef signed char base_type;
+    static const char* getName() { return "schar"; }
+};
+#endif
 }
 
 #endif
diff --git a/include/af/util.h b/include/af/util.h
index 7c86c7ce4f..49a16b43ec 100644
--- a/include/af/util.h
+++ b/include/af/util.h
@@ -15,13 +15,6 @@ namespace af
 {
     class array;
 
-    /**
-        \defgroup print_func_print print
-
-        \brief Print the array to screen
-
-        \ingroup arrayfire_func
-    */
     /**
         \param[in] exp is an expression, generally the name of the array
         \param[in] arr is the input array
@@ -30,216 +23,269 @@ namespace af
     */
     AFAPI void print(const char *exp, const array &arr);
 
-    // Purpose of Addition: "How to add Function" documentation
-    AFAPI array exampleFunction(const array& in, const af_someenum_t param);
-}
-
-#define af_print(exp) af::print(#exp, exp);
-
-#endif //__cplusplus
-
-#ifdef __cplusplus
-extern "C" {
-#endif
-
-    /**
-        \ingroup method_mat
-        @{
-    */
+#if AF_API_VERSION >= 31
     /**
-        \brief Gets the number of elements in an array.
-
-        \param[out] elems is the output that contains number of elements of \p arr
+        \param[in] exp is an expression, generally the name of the array
         \param[in] arr is the input array
+        \param[in] precision is the precision length for display
 
-        \returns error codes
+        \ingroup print_func_print
     */
-    AFAPI af_err af_get_elements(dim_t *elems, const af_array arr);
+    AFAPI void print(const char *exp, const array &arr, const int precision);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Gets the type of an array.
+        \param[in] key is an expression used as tag/key for the array during \ref readArray
+        \param[in] arr is the array to be written
+        \param[in] filename is the path to the location on disk
+        \param[in] append is used to append to an existing file when true and create or
+        overwrite an existing file when false
 
-        \param[out] type is the output that contains the type of \p arr
-        \param[in] arr is the input array
+        \returns index of the saved array in the file
 
-        \returns error codes
+        \ingroup stream_func_save
     */
-    AFAPI af_err af_get_type(af_dtype *type, const af_array arr);
+    AFAPI int saveArray(const char *key, const array &arr, const char *filename, const bool append = false);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Gets the dimseions of an array.
+        \param[in] filename is the path to the location on disk
+        \param[in] index is the 0-based sequential location of the array to be read
 
-        \param[out] d0 is the output that contains the size of first dimension of \p arr
-        \param[out] d1 is the output that contains the size of first dimension of \p arr
-        \param[out] d2 is the output that contains the size of first dimension of \p arr
-        \param[out] d3 is the output that contains the size of first dimension of \p arr
-        \param[in] arr is the input array
+        \returns array read from the index location
 
-        \returns error codes
-    */
-    AFAPI af_err af_get_dims(dim_t *d0, dim_t *d1, dim_t *d2, dim_t *d3,
-                             const af_array arr);
-
-    /**
-        \brief Gets the number of dimensions of an array.
+        \note This function will throw an exception if the index is out of bounds
 
-        \param[out] result is the output that contains the number of dims of \p arr
-        \param[in] arr is the input array
-
-        \returns error codes
+        \ingroup stream_func_read
     */
-    AFAPI af_err af_get_numdims(unsigned *result, const af_array arr);
+    AFAPI array readArray(const char *filename, const unsigned index);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is empty.
+        \param[in] filename is the path to the location on disk
+        \param[in] key is the tag/name of the array to be read. The key needs to have an exact match.
 
-        \param[out] result is true if elements of arr is 0, otherwise false
-        \param[in] arr is the input array
+        \returns array read by key
 
-        \returns error codes
+        \note This function will throw an exception if the key is not found.
+
+        \ingroup stream_func_read
     */
-    AFAPI af_err af_is_empty        (bool *result, const af_array arr);
+    AFAPI array readArray(const char *filename, const char *key);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is scalar, ie. single element.
+        When reading by key, it may be a good idea to run this function first to check for the key
+        and then call the readArray using the index. This will avoid exceptions in case of key not found.
 
-        \param[out] result is true if elements of arr is 1, otherwise false
-        \param[in] arr is the input array
+        \param[in] filename is the path to the location on disk
+        \param[in] key is the tag/name of the array to be read. The key needs to have an exact match.
 
-        \returns error codes
+        \returns index of the array in the file if the key is found. -1 if key is not found.
+
+        \ingroup stream_func_read
     */
-    AFAPI af_err af_is_scalar       (bool *result, const af_array arr);
+    AFAPI int readArrayCheck(const char *filename, const char *key);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is row vector.
-
-        \param[out] result is true if arr has dims [1 x 1 1], false otherwise
+        \param[out] output is the pointer to the c-string that will hold the data. The memory for
+        output is allocated by the function. The user is responsible for deleting the memory using
+        af::freeHost() or af_free_host().
+        \param[in] exp is an expression, generally the name of the array
         \param[in] arr is the input array
+        \param[in] precision is the precision length for display
+        \param[in] transpose determines whether or not to transpose the array before storing it in
+        the string
 
-        \returns error codes
+        \ingroup print_func_tostring
     */
-    AFAPI af_err af_is_row          (bool *result, const af_array arr);
+    AFAPI void toString(char **output, const char *exp, const array &arr,
+                        const int precision = 4, const bool transpose = true);
+#endif
 
+#if AF_API_VERSION >= 33
     /**
-        \brief Check if an array is a column vector
-
-        \param[out] result is true if arr has dims [x 1 1 1], false otherwise
+        \param[in] exp is an expression, generally the name of the array
         \param[in] arr is the input array
+        \param[in] precision is the precision length for display
+        \param[in] transpose determines whether or not to transpose the array before storing it in
+        the string
 
-        \returns error codes
+        \return output is the pointer to the c-string that will hold the data. The memory for
+        output is allocated by the function. The user is responsible for deleting the memory using
+        af::freeHost() or af_free_host().
+
+        \ingroup print_func_tostring
     */
-    AFAPI af_err af_is_column       (bool *result, const af_array arr);
+    AFAPI const char* toString(const char *exp, const array &arr,
+                               const int precision = 4, const bool transpose = true);
+#endif
 
-    /**
-        \brief Check if an array is a vector
+    // Purpose of Addition: "How to add Function" documentation
+    AFAPI array exampleFunction(const array& in, const af_someenum_t param);
 
-        A vector is any array that has exactly 1 dimension not equal to 1.
+#if AF_API_VERSION >= 34
+    ///
+    /// Get the size of the type represented by an af_dtype enum
+    ///
+    AFAPI size_t getSizeOf(af::dtype type);
+#endif
+}
 
-        \param[out] result is true if arr is a vector, false otherwise
-        \param[in] arr is the input array
+#if AF_API_VERSION >= 31
 
-        \returns error codes
-    */
-    AFAPI af_err af_is_vector       (bool *result, const af_array arr);
+#define AF_PRINT1(exp)            af::print(#exp, exp);
+#define AF_PRINT2(exp, precision) af::print(#exp, exp, precision);
 
-    /**
-        \brief Check if an array is complex type
+#define GET_PRINT_MACRO(_1, _2, NAME, ...) NAME
 
-        \param[out] result is true if arr is of type \ref c32 or \ref c64, otherwise false
-        \param[in] arr is the input array
+#define af_print(...) GET_PRINT_MACRO(__VA_ARGS__, AF_PRINT2, AF_PRINT1)(__VA_ARGS__)
 
-        \returns error codes
-    */
-    AFAPI af_err af_is_complex      (bool *result, const af_array arr);
+#else // AF_API_VERSION
 
-    /**
-        \brief Check if an array is real type
+#define af_print(exp) af::print(#exp, exp);
 
-        This is mutually exclusive to \ref af_is_complex
+#endif // AF_API_VERSION
 
-        \param[out] result is true if arr is NOT of type \ref c32 or \ref c64, otherwise false
-        \param[in] arr is the input array
+#endif //__cplusplus
 
-        \returns error codes
-    */
-    AFAPI af_err af_is_real         (bool *result, const af_array arr);
+#ifdef __cplusplus
+extern "C" {
+#endif
 
     /**
-        \brief Check if an array is double precision type
-
-        \param[out] result is true if arr is of type \ref f64 or \ref c64, otherwise false
         \param[in] arr is the input array
 
         \returns error codes
+
+        \ingroup print_func_print
     */
-    AFAPI af_err af_is_double       (bool *result, const af_array arr);
+    AFAPI af_err af_print_array(af_array arr);
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is single precision type
-
-        \param[out] result is true if arr is of type \ref f32 or \ref c32, otherwise false
+        \param[in] exp is the expression or name of the array
         \param[in] arr is the input array
+        \param[in] precision precision for the display
 
         \returns error codes
+
+        \ingroup print_func_print
     */
-    AFAPI af_err af_is_single       (bool *result, const af_array arr);
+    AFAPI af_err af_print_array_gen(const char *exp, const af_array arr, const int precision);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is real floating point type
-
-        \param[out] result is true if arr is of type \ref f32 or \ref f64, otherwise false
-        \param[in] arr is the input array
-
-        \returns error codes
+        \param[out] index is the index location of the array in the file
+        \param[in] key is an expression used as tag/key for the array during \ref af::readArray()
+        \param[in] arr is the array to be written
+        \param[in] filename is the path to the location on disk
+        \param[in] append is used to append to an existing file when true and create or
+        overwrite an existing file when false
+
+        \ingroup stream_func_save
     */
-    AFAPI af_err af_is_realfloating (bool *result, const af_array arr);
+    AFAPI af_err af_save_array(int *index, const char* key, const af_array arr, const char *filename, const bool append);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is floating precision type
+        \param[out] out is the array read from index
+        \param[in] filename is the path to the location on disk
+        \param[in] index is the 0-based sequential location of the array to be read
 
-        This is a combination of \ref af_is_realfloating and \ref af_is_complex
+        \note This function will throw an exception if the key is not found.
 
-        \param[out] result is true if arr is of type \ref f32, \ref f64, \ref c32 or \ref c64, otherwise false
-        \param[in] arr is the input array
-
-        \returns error codes
+        \ingroup stream_func_read
     */
-    AFAPI af_err af_is_floating     (bool *result, const af_array arr);
+    AFAPI af_err af_read_array_index(af_array *out, const char *filename, const unsigned index);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is integer type
+        \param[out] out is the array read from key
+        \param[in] filename is the path to the location on disk
+        \param[in] key is the tag/name of the array to be read. The key needs to have an exact match.
 
-        \param[out] result is true if arr is of integer types, otherwise false
-        \param[in] arr is the input array
+        \note This function will throw an exception if the key is not found.
 
-        \returns error codes
+        \ingroup stream_func_read
     */
-    AFAPI af_err af_is_integer      (bool *result, const af_array arr);
+    AFAPI af_err af_read_array_key(af_array *out, const char *filename, const char* key);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
-        \brief Check if an array is bool type
+        When reading by key, it may be a good idea to run this function first to check for the key
+        and then call the readArray using the index. This will avoid exceptions in case of key not found.
 
-        \param[out] result is true if arr is of \ref b8 type, otherwise false
-        \param[in] arr is the input array
+        \param[out] index of the array in the file if the key is found. -1 if key is not found.
+        \param[in] filename is the path to the location on disk
+        \param[in] key is the tag/name of the array to be read. The key needs to have an exact match.
 
-        \returns error codes
-    */
-    AFAPI af_err af_is_bool         (bool *result, const af_array arr);
-    /**
-        @}
+        \ingroup stream_func_read
     */
+    AFAPI af_err af_read_array_key_check(int *index, const char *filename, const char* key);
+#endif
 
+#if AF_API_VERSION >= 31
     /**
+        \param[out] output is the pointer to the c-string that will hold the data. The memory for
+        output is allocated by the function. The user is responsible for deleting the memory.
+        \param[in] exp is an expression, generally the name of the array
         \param[in] arr is the input array
+        \param[in] precision is the precision length for display
+        \param[in] transpose determines whether or not to transpose the array before storing it in
+        the string
 
-        \returns error codes
-
-        \ingroup print_func_print
+        \ingroup print_func_tostring
     */
-    AFAPI af_err af_print_array(af_array arr);
+    AFAPI af_err af_array_to_string(char **output, const char *exp, const af_array arr,
+                                    const int precision, const bool transpose);
+#endif
 
     // Purpose of Addition: "How to add Function" documentation
-    AFAPI af_err af_example_function(af_array* out, const af_array in, const af_someenum_t param);
+    AFAPI af_err af_example_function(af_array* out,
+                                     const af_array in,
+                                     const af_someenum_t param);
+
+    ///
+    /// Get the version information of the library
+    ///
+    AFAPI af_err af_get_version(int *major, int *minor, int *patch);
+
+
+#if AF_API_VERSION >= 33
+    ///
+    /// Get the revision (commit) information of the library.
+    /// This returns a constant string from compile time and should not be
+    /// freed by the user.
+    ///
+    AFAPI const char *af_get_revision();
+#endif
+
+#if AF_API_VERSION >= 34
+    ///
+    /// Get the size of the type represented by an af_dtype enum
+    ///
+    AFAPI af_err af_get_size_of(size_t *size, af_dtype type);
+#endif
+
+#if AF_API_VERSION >= 37
+    /// Enable(default) or disable error messages that display the stacktrace.
+    ///
+    /// \param[in] is_enabled If zero stacktraces are not shown with the error
+    ///                       messages
+    /// \returns Always returns AF_SUCCESS
+    AFAPI af_err af_set_enable_stacktrace(int is_enabled);
+#endif
 
 #ifdef __cplusplus
 }
diff --git a/include/af/vision.h b/include/af/vision.h
index 125af67cec..5400112fe4 100644
--- a/include/af/vision.h
+++ b/include/af/vision.h
@@ -8,6 +8,7 @@
  ********************************************************/
 
 #pragma once
+#include <af/defines.h>
 #include <af/features.h>
 
 #ifdef __cplusplus
@@ -27,7 +28,7 @@ class array;
     \param[in] non_max performs non-maximal suppression if true
     \param[in] feature_ratio maximum ratio of features to detect, the maximum
                number of features is calculated by feature_ratio * in.elements().
-               The maximum number of features is not based on the score, instead
+               The maximum number of features is not based on the score, instead,
                features detected after the limit is reached are discarded
     \param[in] edge is the length of the edges in the image to be discarded
                by FAST (minimum is 3, as the radius of the circle)
@@ -38,7 +39,41 @@ class array;
 
     \ingroup cv_func_fast
  */
-AFAPI features fast(const array& in, const float thr=20.0f, const unsigned arc_length=9, const bool non_max=true, const float feature_ratio=0.05, const unsigned edge=3);
+AFAPI features fast(const array& in, const float thr=20.0f, const unsigned arc_length=9,
+                    const bool non_max=true, const float feature_ratio=0.05f,
+                    const unsigned edge=3);
+
+#if AF_API_VERSION >= 31
+/**
+    C++ Interface for Harris corner detector
+
+    \param[in] in array containing a grayscale image (color images are not
+               supported)
+    \param[in] max_corners maximum number of corners to keep, only retains
+               those with highest Harris responses
+    \param[in] min_response minimum response in order for a corner to be
+               retained, only used if max_corners = 0
+    \param[in] sigma the standard deviation of a circular window (its
+               dimensions will be calculated according to the standard
+               deviation), the covariation matrix will be calculated to a
+               circular neighborhood of this standard deviation (only used
+               when block_size == 0, must be >= 0.5f and <= 5.0f)
+    \param[in] block_size square window size, the covariation matrix will be
+               calculated to a square neighborhood of this size (must be
+               >= 3 and <= 31)
+    \param[in] k_thr Harris constant, usually set empirically to 0.04f (must
+               be >= 0.01f)
+    \return    features object containing arrays for x and y coordinates and
+               score (Harris response), while arrays orientation and size are
+               set to 0 and 1, respectively, because Harris does not compute
+               that information
+
+    \ingroup cv_func_harris
+ */
+AFAPI features harris(const array& in, const unsigned max_corners=500,
+                      const float min_response=1e5f, const float sigma=1.f,
+                      const unsigned block_size=0, const float k_thr=0.04f);
+#endif
 
 /**
     C++ Interface for ORB feature descriptor
@@ -51,19 +86,97 @@ AFAPI features fast(const array& in, const float thr=20.0f, const unsigned arc_l
                 supported)
     \param[in]  fast_thr FAST threshold for which a pixel of the circle around
                 the central pixel is considered to be brighter or darker
-    \param[in]  max_feat Maximum number of features to hold (will only keep the
+    \param[in]  max_feat maximum number of features to hold (will only keep the
                 max_feat features with higher Harris responses)
-    \param[in]  scl_fctr Factor to downsample the input image, meaning that
+    \param[in]  scl_fctr factor to downsample the input image, meaning that
                 each level will hold prior level dimensions divided by scl_fctr
-    \param[in]  levels Number of levels to be computed for the image pyramid
-    \param[in]  blur_img Blur image with a Gaussian filter with sigma=2 before
+    \param[in]  levels number of levels to be computed for the image pyramid
+    \param[in]  blur_img blur image with a Gaussian filter with sigma=2 before
                 computing descriptors to increase robustness against noise if
                 true
 
     \ingroup cv_func_orb
  */
-AFAPI void orb(features& feat, array& desc, const array& image, const float fast_thr=20.f, const unsigned max_feat=400, const float scl_fctr=1.5f, const unsigned levels=4, const bool blur_img=false);
+AFAPI void orb(features& feat, array& desc, const array& image,
+               const float fast_thr=20.f, const unsigned max_feat=400,
+               const float scl_fctr=1.5f, const unsigned levels=4,
+               const bool blur_img=false);
+
+#if AF_API_VERSION >= 31
+/**
+    C++ Interface for SIFT feature detector and descriptor
+
+    \param[out] feat features object composed of arrays for x and y
+                coordinates, score, orientation and size of selected features
+    \param[out] desc Nx128 array containing extracted descriptors, where N is the
+                number of features found by SIFT
+    \param[in]  in array containing a grayscale image (color images are not
+                supported)
+    \param[in]  n_layers number of layers per octave, the number of octaves is
+                computed automatically according to the input image dimensions,
+                the original SIFT paper suggests 3
+    \param[in]  contrast_thr threshold used to filter out features that have
+                low contrast, the original SIFT paper suggests 0.04
+    \param[in]  edge_thr threshold used to filter out features that are too
+                edge-like, the original SIFT paper suggests 10.0
+    \param[in]  init_sigma the sigma value used to filter the input image at
+                the first octave, the original SIFT paper suggests 1.6
+    \param[in]  double_input if true, the input image dimensions will be
+                doubled and the doubled image will be used for the first octave
+    \param[in]  intensity_scale the inverse of the difference between the minimum
+                and maximum grayscale intensity value, e.g.: if the ranges are
+                0-256, the proper intensity_scale value is 1/256, if the ranges
+                are 0-1, the proper intensity-scale value is 1/1
+    \param[in]  feature_ratio maximum ratio of features to detect, the maximum
+                number of features is calculated by feature_ratio * in.elements().
+                The maximum number of features is not based on the score, instead,
+                features detected after the limit is reached are discarded
+
+    \ingroup cv_func_sift
+ */
+AFAPI void sift(features& feat, array& desc, const array& in, const unsigned n_layers=3,
+                const float contrast_thr=0.04f, const float edge_thr=10.f,
+                const float init_sigma=1.6f, const bool double_input=true,
+                const float intensity_scale=0.00390625f, const float feature_ratio=0.05f);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+    C++ Interface for SIFT feature detector and GLOH descriptor
+
+    \param[out] feat features object composed of arrays for x and y
+                coordinates, score, orientation and size of selected features
+    \param[out] desc Nx272 array containing extracted GLOH descriptors, where N
+                is the number of features found by SIFT
+    \param[in]  in array containing a grayscale image (color images are not
+                supported)
+    \param[in]  n_layers number of layers per octave, the number of octaves is
+                computed automatically according to the input image dimensions,
+                the original SIFT paper suggests 3
+    \param[in]  contrast_thr threshold used to filter out features that have
+                low contrast, the original SIFT paper suggests 0.04
+    \param[in]  edge_thr threshold used to filter out features that are too
+                edge-like, the original SIFT paper suggests 10.0
+    \param[in]  init_sigma the sigma value used to filter the input image at
+                the first octave, the original SIFT paper suggests 1.6
+    \param[in]  double_input if true, the input image dimensions will be
+                doubled and the doubled image will be used for the first octave
+    \param[in]  intensity_scale the inverse of the difference between the minimum
+                and maximum grayscale intensity value, e.g.: if the ranges are
+                0-256, the proper intensity_scale value is 1/256, if the ranges
+                are 0-1, the proper intensity-scale value is 1/1
+    \param[in]  feature_ratio maximum ratio of features to detect, the maximum
+                number of features is calculated by feature_ratio * in.elements().
+                The maximum number of features is not based on the score, instead,
+                features detected after the limit is reached are discarded
 
+    \ingroup cv_func_sift
+ */
+AFAPI void gloh(features& feat, array& desc, const array& in, const unsigned n_layers=3,
+                const float contrast_thr=0.04f, const float edge_thr=10.f,
+                const float init_sigma=1.6f, const bool double_input=true,
+                const float intensity_scale=0.00390625f, const float feature_ratio=0.05f);
+#endif
 
 /**
    C++ Interface wrapper for Hamming matcher
@@ -78,11 +191,14 @@ AFAPI void orb(features& feat, array& desc, const array& image, const float fast
                the Hamming distance of the Jth smallest distance to the Ith query
                value in the train data array.
    \param[in]  query is the array containing the data to be queried
-   \param[in]  train is the array containing the data stored as training data
+   \param[in]  train is the array containing the data used as training data
    \param[in]  dist_dim indicates the dimension to analyze for distance (the dimension
                indicated here must be of equal length for both query and train arrays)
-   \param[in]  n_dist is the number of smallest distances to return (currently, only 1
-               is supported)
+   \param[in]  n_dist is the number of smallest distances to return (currently, only
+               values <= 256 are supported)
+
+   \note Note: This is a special case of the \ref nearestNeighbour function with AF_SHD
+    as dist_type
 
    \ingroup cv_func_hamming_matcher
  */
@@ -90,6 +206,137 @@ AFAPI void hammingMatcher(array& idx, array& dist,
                           const array& query, const array& train,
                           const dim_t dist_dim=0, const unsigned n_dist=1);
 
+#if AF_API_VERSION >= 31
+/**
+   C++ interface wrapper for determining the nearest neighbouring points to a
+   given set of points
+
+   \param[out] idx       is an array of \f$M \times N\f$ size, where \f$M\f$ is
+                         \p n_dist and \f$N\f$ is the number of queries. The
+                         value at position \f$i,j\f$ is the index of the point
+                         in \p train along dim1 (if \p dist_dim is 0) or along
+                         dim 0 (if \p dist_dim is 1), with the \f$ith\f$
+                         smallest distance to the \f$jth\f$ \p query point.
+   \param[out] dist      is an array of \f$M \times N\f$ size, where \f$M\f$ is
+                         \p n_dist and \f$N\f$ is the number of queries. The
+                         value at position \f$i,j\f$ is the distance from the
+                         \f$jth\f$ query point to the point in \p train referred
+                         to by \p idx(\f$i,j\f$). This distance is computed
+                         according to the \p dist_type chosen.
+   \param[in]  query     is the array containing the points to be queried. The
+                         points must be described along dim0 and listed along
+                         dim1 if \p dist_dim is 0, or vice versa if \p dist_dim
+                         is 1.
+   \param[in]  train     is the array containing the points used as training
+                         data. The points must be described along dim0 and
+                         listed along dim1 if \p dist_dim is 0, or vice versa if
+                         \p dist_dim is 1.
+   \param[in]  dist_dim  indicates the dimension that the distance computation
+                         will use to determine a point's coordinates. The \p
+                         train and \p query arrays must both use this dimension
+                         for describing a point's coordinates
+   \param[in]  n_dist    is the number of nearest neighbour points to return
+                         (currently only values <= 256 are supported)
+   \param[in]  dist_type is the distance computation type. Currently \ref
+                         AF_SAD (sum of absolute differences), \ref AF_SSD (sum
+                         of squared differences), and \ref AF_SHD (hamming
+                         distances) are supported.
+
+   \ingroup cv_func_nearest_neighbour
+ */
+AFAPI void nearestNeighbour(array& idx, array& dist,
+                            const array& query, const array& train,
+                            const dim_t dist_dim=0, const unsigned n_dist=1,
+                            const af_match_type dist_type = AF_SSD);
+#endif
+
+/**
+   C++ Interface for image template matching
+
+   \param[in]  searchImg is an array with image data
+   \param[in]  templateImg is the template we are looking for in the image
+   \param[in]  mType is metric that should be used to calculate the disparity
+               between window in the image and the template image. It can be one of
+               the values defined by the enum \ref af_match_type
+   \return     array with disparity values for the window starting at
+               corresponding pixel position
+
+   \note If \p search_img is 3d array, a batch operation will be performed.
+
+   \ingroup cv_func_match_template
+ */
+AFAPI array matchTemplate(const array &searchImg, const array &templateImg, const matchType mType=AF_SAD);
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface for SUSAN corner detector
+
+   \param[in]  in is input grayscale/intensity image
+   \param[in]  radius nucleus radius for each pixel neighborhood
+   \param[in]  diff_thr intensity difference threshold
+   \param[in]  geom_thr geometric threshold a.k.a **t** from equations in description
+   \param[in]  feature_ratio is maximum number of features that will be returned by the function
+   \param[in]  edge indicates how many pixels width area should be skipped for corner detection
+   \return If SUSAN corner detection is successfull returns an object of Features class, composed of arrays for x and y
+               coordinates, score, orientation and size of selected features, otherwise exception is thrown.
+
+   \note If \p in is a 3d array, a batch operation will be performed.
+
+   \ingroup cv_func_susan
+*/
+AFAPI features susan(const array& in,
+                     const unsigned radius=3,
+                     const float diff_thr=32.0f,
+                     const float geom_thr=10.0f,
+                     const float feature_ratio=0.05f,
+                     const unsigned edge=3);
+#endif
+
+#if AF_API_VERSION >= 31
+/**
+   C++ Interface wrapper for Difference of Gaussians
+
+   \param[in] in is input image
+   \param[in] radius1 is the radius of first gaussian kernel
+   \param[in] radius2 is the radius of second gaussian kernel
+   \return    Difference of smoothed inputs
+
+   \ingroup cv_func_dog
+ */
+AFAPI array dog(const array& in, const int radius1, const int radius2);
+#endif
+
+#if AF_API_VERSION >= 32
+/**
+   C++ Interface for Homography estimation
+
+   \param[out] H is a 3x3 array containing the estimated homography.
+   \param[out] inliers is the number of inliers that the homography was estimated to comprise,
+               in the case that htype is AF_HOMOGRAPHY_RANSAC, a higher inlier_thr value will increase the
+               estimated inliers. Note that if the number of inliers is too low, it is likely
+               that a bad homography will be returned.
+   \param[in]  x_src x coordinates of the source points.
+   \param[in]  y_src y coordinates of the source points.
+   \param[in]  x_dst x coordinates of the destination points.
+   \param[in]  y_dst y coordinates of the destination points.
+   \param[in]  htype can be AF_HOMOGRAPHY_RANSAC, for which a RANdom SAmple Consensus will be
+               used to evaluate the homography quality (e.g., number of inliers), or AF_HOMOGRAPHY_LMEDS,
+               which will use Least Median of Squares method to evaluate homography quality
+   \param[in]  inlier_thr if htype is AF_HOMOGRAPHY_RANSAC, this parameter will five the maximum L2-distance
+               for a point to be considered an inlier.
+   \param[in]  iterations maximum number of iterations when htype is AF_HOMOGRAPHY_RANSAC and backend is CPU,
+               if backend is CUDA or OpenCL, iterations is the total number of iterations, an
+               iteration is a selection of 4 random points for which the homography is estimated
+               and evaluated for number of inliers.
+   \param[in]  otype the array type for the homography output.
+
+   \ingroup cv_func_homography
+*/
+AFAPI void homography(array& H, int& inliers, const array& x_src, const array& y_src,
+                      const array& x_dst, const array& y_dst, const af_homography_type htype=AF_HOMOGRAPHY_RANSAC,
+                      const float inlier_thr=3.f, const unsigned iterations=1000, const dtype otype=f32);
+#endif
+
 }
 #endif
 
@@ -114,7 +361,7 @@ extern "C" {
         \param[in]  feature_ratio maximum ratio of features to detect, the
                     maximum number of features is calculated by
                     feature_ratio * in.elements(). The maximum number of
-                    features is not based on the score, instead features
+                    features is not based on the score, instead, features
                     detected after the limit is reached are discarded
         \param[in]  edge is the length of the edges in the image to be
                     discarded by FAST (minimum is 3, as the radius of the
@@ -122,7 +369,40 @@ extern "C" {
 
         \ingroup cv_func_fast
     */
-    AFAPI af_err af_fast(af_features *out, const af_array in, const float thr, const unsigned arc_length, const bool non_max, const float feature_ratio, const unsigned edge);
+    AFAPI af_err af_fast(af_features *out, const af_array in, const float thr, const unsigned arc_length,
+                         const bool non_max, const float feature_ratio, const unsigned edge);
+
+#if AF_API_VERSION >= 31
+    /**
+        C Interface for Harris corner detector
+
+        \param[out] out struct containing arrays for x and y
+                    coordinates and score (Harris response), while arrays
+                    orientation and size are set to 0 and 1, respectively,
+                    because Harris does not compute that information
+        \param[in]  in array containing a grayscale image (color images are not
+                    supported)
+        \param[in]  max_corners maximum number of corners to keep, only retains
+                    those with highest Harris responses
+        \param[in]  min_response minimum response in order for a corner to be
+                    retained, only used if max_corners = 0
+        \param[in]  sigma the standard deviation of a circular window (its
+                    dimensions will be calculated according to the standard
+                    deviation), the covariation matrix will be calculated to a
+                    circular neighborhood of this standard deviation (only used
+                    when block_size == 0, must be >= 0.5f and <= 5.0f)
+        \param[in]  block_size square window size, the covariation matrix will be
+                    calculated to a square neighborhood of this size (must be
+                    >= 3 and <= 31)
+        \param[in]  k_thr Harris constant, usually set empirically to 0.04f (must
+                    be >= 0.01f)
+
+        \ingroup cv_func_harris
+    */
+    AFAPI af_err af_harris(af_features *out, const af_array in, const unsigned max_corners,
+                           const float min_response, const float sigma,
+                           const unsigned block_size, const float k_thr);
+#endif
 
     /**
         C Interface for ORB feature descriptor
@@ -135,18 +415,96 @@ extern "C" {
                     supported)
         \param[in]  fast_thr FAST threshold for which a pixel of the circle around
                     the central pixel is considered to be brighter or darker
-        \param[in]  max_feat Maximum number of features to hold (will only keep the
+        \param[in]  max_feat maximum number of features to hold (will only keep the
                     max_feat features with higher Harris responses)
-        \param[in]  scl_fctr Factor to downsample the input image, meaning that
+        \param[in]  scl_fctr factor to downsample the input image, meaning that
                     each level will hold prior level dimensions divided by scl_fctr
-        \param[in]  levels Number of levels to be computed for the image pyramid
-        \param[in]  blur_img Blur image with a Gaussian filter with sigma=2 before
+        \param[in]  levels number of levels to be computed for the image pyramid
+        \param[in]  blur_img blur image with a Gaussian filter with sigma=2 before
                     computing descriptors to increase robustness against noise if
                     true
 
         \ingroup cv_func_orb
     */
-    AFAPI af_err af_orb(af_features *feat, af_array *desc, const af_array in, const float fast_thr, const unsigned max_feat, const float scl_fctr, const unsigned levels, const bool blur_img);
+    AFAPI af_err af_orb(af_features *feat, af_array *desc, const af_array in,
+                        const float fast_thr, const unsigned max_feat, const float scl_fctr,
+                        const unsigned levels, const bool blur_img);
+
+#if AF_API_VERSION >= 31
+    /**
+        C++ Interface for SIFT feature detector and descriptor
+
+        \param[out] feat af_features object composed of arrays for x and y
+                    coordinates, score, orientation and size of selected features
+        \param[out] desc Nx128 array containing extracted descriptors, where N is the
+                    number of features found by SIFT
+        \param[in]  in array containing a grayscale image (color images are not
+                    supported)
+        \param[in]  n_layers number of layers per octave, the number of octaves is
+                    computed automatically according to the input image dimensions,
+                    the original SIFT paper suggests 3
+        \param[in]  contrast_thr threshold used to filter out features that have
+                    low contrast, the original SIFT paper suggests 0.04
+        \param[in]  edge_thr threshold used to filter out features that are too
+                    edge-like, the original SIFT paper suggests 10.0
+        \param[in]  init_sigma the sigma value used to filter the input image at
+                    the first octave, the original SIFT paper suggests 1.6
+        \param[in]  double_input if true, the input image dimensions will be
+                    doubled and the doubled image will be used for the first octave
+        \param[in]  intensity_scale the inverse of the difference between the minimum
+                    and maximum grayscale intensity value, e.g.: if the ranges are
+                    0-256, the proper intensity_scale value is 1/256, if the ranges
+                    are 0-1, the proper intensity-scale value is 1/1
+        \param[in]  feature_ratio maximum ratio of features to detect, the maximum
+                    number of features is calculated by feature_ratio * in.elements().
+                    The maximum number of features is not based on the score, instead,
+                    features detected after the limit is reached are discarded
+
+        \ingroup cv_func_sift
+    */
+    AFAPI af_err af_sift(af_features *feat, af_array *desc, const af_array in,
+                         const unsigned n_layers, const float contrast_thr, const float edge_thr,
+                         const float init_sigma, const bool double_input,
+                         const float intensity_scale, const float feature_ratio);
+#endif
+
+#if AF_API_VERSION >= 32
+    /**
+        C++ Interface for SIFT feature detector and GLOH descriptor
+
+        \param[out] feat af_features object composed of arrays for x and y
+                    coordinates, score, orientation and size of selected features
+        \param[out] desc Nx272 array containing extracted GLOH descriptors, where N
+                    is the number of features found by SIFT
+        \param[in]  in array containing a grayscale image (color images are not
+                    supported)
+        \param[in]  n_layers number of layers per octave, the number of octaves is
+                    computed automatically according to the input image dimensions,
+                    the original SIFT paper suggests 3
+        \param[in]  contrast_thr threshold used to filter out features that have
+                    low contrast, the original SIFT paper suggests 0.04
+        \param[in]  edge_thr threshold used to filter out features that are too
+                    edge-like, the original SIFT paper suggests 10.0
+        \param[in]  init_sigma the sigma value used to filter the input image at
+                    the first octave, the original SIFT paper suggests 1.6
+        \param[in]  double_input if true, the input image dimensions will be
+                    doubled and the doubled image will be used for the first octave
+        \param[in]  intensity_scale the inverse of the difference between the minimum
+                    and maximum grayscale intensity value, e.g.: if the ranges are
+                    0-256, the proper intensity_scale value is 1/256, if the ranges
+                    are 0-1, the proper intensity-scale value is 1/1
+        \param[in]  feature_ratio maximum ratio of features to detect, the maximum
+                    number of features is calculated by feature_ratio * in.elements().
+                    The maximum number of features is not based on the score, instead,
+                    features detected after the limit is reached are discarded
+
+        \ingroup cv_func_sift
+    */
+    AFAPI af_err af_gloh(af_features *feat, af_array *desc, const af_array in,
+                         const unsigned n_layers, const float contrast_thr,
+                         const float edge_thr, const float init_sigma, const bool double_input,
+                         const float intensity_scale, const float feature_ratio);
+#endif
 
     /**
        C Interface wrapper for Hamming matcher
@@ -161,11 +519,11 @@ extern "C" {
                    the Hamming distance of the Jth smallest distance to the Ith query
                    value in the train data array.
        \param[in]  query is the array containing the data to be queried
-       \param[in]  train is the array containing the data stored as training data
+       \param[in]  train is the array containing the data used as training data
        \param[in]  dist_dim indicates the dimension to analyze for distance (the dimension
                    indicated here must be of equal length for both query and train arrays)
-       \param[in]  n_dist is the number of smallest distances to return (currently, only 1
-                   is supported)
+       \param[in]  n_dist is the number of smallest distances to return (currently, only
+                   values <= 256 are supported)
 
        \ingroup cv_func_hamming_matcher
     */
@@ -173,6 +531,144 @@ extern "C" {
                                     const af_array query, const af_array train,
                                     const dim_t dist_dim, const unsigned n_dist);
 
+#if AF_API_VERSION >= 31
+/**
+   C++ interface wrapper for determining the nearest neighbouring points to a
+   given set of points
+
+   \param[out] idx       is an array of \f$M \times N\f$ size, where \f$M\f$ is
+                         \p n_dist and \f$N\f$ is the number of queries. The
+                         value at position \f$i,j\f$ is the index of the point
+                         in \p train along dim1 (if \p dist_dim is 0) or along
+                         dim 0 (if \p dist_dim is 1), with the \f$ith\f$
+                         smallest distance to the \f$jth\f$ \p query point.
+   \param[out] dist      is an array of \f$M \times N\f$ size, where \f$M\f$ is
+                         \p n_dist and \f$N\f$ is the number of queries. The
+                         value at position \f$i,j\f$ is the distance from the
+                         \f$jth\f$ query point to the point in \p train referred
+                         to by \p idx(\f$i,j\f$). This distance is computed
+                         according to the \p dist_type chosen.
+   \param[in]  query     is the array containing the points to be queried. The
+                         points must be described along dim0 and listed along
+                         dim1 if \p dist_dim is 0, or vice versa if \p dist_dim
+                         is 1.
+   \param[in]  train     is the array containing the points used as training
+                         data. The points must be described along dim0 and
+                         listed along dim1 if \p dist_dim is 0, or vice versa if
+                         \p dist_dim is 1.
+   \param[in]  dist_dim  indicates the dimension that the distance computation
+                         will use to determine a point's coordinates. The \p
+                         train and \p query arrays must both use this dimension
+                         for describing a point's coordinates
+   \param[in]  n_dist    is the number of nearest neighbour points to return
+                         (currently only values <= 256 are supported)
+   \param[in]  dist_type is the distance computation type. Currently \ref
+                         AF_SAD (sum of absolute differences), \ref AF_SSD (sum
+                         of squared differences), and \ref AF_SHD (hamming
+                         distances) are supported.
+
+   \ingroup cv_func_nearest_neighbour
+ */
+AFAPI af_err af_nearest_neighbour(af_array* idx, af_array* dist,
+                                  const af_array query, const af_array train,
+                                  const dim_t dist_dim, const unsigned n_dist,
+                                  const af_match_type dist_type);
+#endif
+
+    /**
+       C Interface for image template matching
+
+       \param[out] out will have disparity values for the window starting at
+                   corresponding pixel position
+       \param[in]  search_img is an array with image data
+       \param[in]  template_img is the template we are looking for in the image
+       \param[in]  m_type is metric that should be used to calculate the disparity
+                   between window in the image and the template image. It can be one of
+                   the values defined by the enum \ref af_match_type
+       \return     \ref AF_SUCCESS if disparity metric is computed successfully,
+       otherwise an appropriate error code is returned.
+
+       \note If \p search_img is 3d array, a batch operation will be performed.
+
+       \ingroup cv_func_match_template
+    */
+    AFAPI af_err af_match_template(af_array *out, const af_array search_img,
+                                   const af_array template_img, const af_match_type m_type);
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface for SUSAN corner detector
+
+       \param[out] out is af_features struct composed of arrays for x and y
+                   coordinates, score, orientation and size of selected features
+       \param[in]  in is input grayscale/intensity image
+       \param[in]  radius nucleus radius for each pixel neighborhood
+       \param[in]  diff_thr intensity difference threshold a.k.a **t** from equations in description
+       \param[in]  geom_thr geometric threshold
+       \param[in]  feature_ratio is maximum number of features that will be returned by the function
+       \param[in]  edge indicates how many pixels width area should be skipped for corner detection
+       \return \ref AF_SUCCESS if SUSAN corner detection is successfull, otherwise an appropriate
+       error code is returned.
+
+       \note If \p in is a 3d array, a batch operation will be performed.
+
+       \ingroup cv_func_susan
+    */
+    AFAPI af_err af_susan(af_features* out, const af_array in, const unsigned radius,
+                          const float diff_thr, const float geom_thr,
+                          const float feature_ratio, const unsigned edge);
+#endif
+
+#if AF_API_VERSION >= 31
+    /**
+       C Interface wrapper for Difference of Gaussians
+
+       \param[out] out is difference of smoothed inputs
+       \param[in] in is input image
+       \param[in] radius1 is the radius of first gaussian kernel
+       \param[in] radius2 is the radius of second gaussian kernel
+       \return    \ref AF_SUCCESS if the computation is is successful,
+                  otherwise an appropriate error code is returned.
+
+       \ingroup cv_func_dog
+     */
+    AFAPI af_err af_dog(af_array *out, const af_array in, const int radius1, const int radius2);
+#endif
+
+#if AF_API_VERSION >= 32
+    /**
+       C Interface wrapper for Homography estimation
+
+       \param[out] H is a 3x3 array containing the estimated homography.
+       \param[out] inliers is the number of inliers that the homography was estimated to comprise,
+                   in the case that htype is AF_HOMOGRAPHY_RANSAC, a higher inlier_thr value will increase the
+                   estimated inliers. Note that if the number of inliers is too low, it is likely
+                   that a bad homography will be returned.
+       \param[in]  x_src x coordinates of the source points.
+       \param[in]  y_src y coordinates of the source points.
+       \param[in]  x_dst x coordinates of the destination points.
+       \param[in]  y_dst y coordinates of the destination points.
+       \param[in]  htype can be AF_HOMOGRAPHY_RANSAC, for which a RANdom SAmple Consensus will be
+                   used to evaluate the homography quality (e.g., number of inliers), or AF_HOMOGRAPHY_LMEDS,
+                   which will use Least Median of Squares method to evaluate homography quality.
+       \param[in]  inlier_thr if htype is AF_HOMOGRAPHY_RANSAC, this parameter will five the maximum L2-distance
+                   for a point to be considered an inlier.
+       \param[in]  iterations maximum number of iterations when htype is AF_HOMOGRAPHY_RANSAC and backend is CPU,
+                   if backend is CUDA or OpenCL, iterations is the total number of iterations, an
+                   iteration is a selection of 4 random points for which the homography is estimated
+                   and evaluated for number of inliers.
+       \param[in]  otype the array type for the homography output.
+       \return     \ref AF_SUCCESS if the computation is is successful,
+                   otherwise an appropriate error code is returned.
+
+       \ingroup cv_func_homography
+     */
+    AFAPI af_err af_homography(af_array *H, int *inliers, const af_array x_src, const af_array y_src,
+                               const af_array x_dst, const af_array y_dst,
+                               const af_homography_type htype, const float inlier_thr,
+                               const unsigned iterations, const af_dtype otype);
+#endif
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/include/arrayfire.h b/include/arrayfire.h
index 051d5ad656..4c9e50da47 100644
--- a/include/arrayfire.h
+++ b/include/arrayfire.h
@@ -11,7 +11,11 @@
 
 /**
 
-\defgroup arrayfire_func Complete List of ArrayFire Functions
+\defgroup arrayfire_func ArrayFire Functions
+@{
+@}
+
+\defgroup arrayfire_class ArrayFire Classes
 @{
 @}
 
@@ -27,19 +31,16 @@
 
       Array constructors, random number generation, transpose, indexing, etc.
 
-      @defgroup construct_mat Constructors of array class
-      Construct an array object
-
-      @defgroup method_mat Methods of array class
-      Get information about the array object
-
       @defgroup device_mat Managing devices in ArrayFire
       getting device pointer, allocating and freeing memory
 
       @defgroup data_mat Functions to create arrays.
       constant, random, range, etc.
 
-      @defgroup index_mat Indexing operation on arrays
+      @defgroup c_api_mat C API to manage arrays
+      Create, release, copy, fetch-properties of \ref af_array
+
+      @defgroup index_mat Assignment & Indexing operation on arrays
       Access sub regions of an array object
 
       @defgroup manip_mat Move and Reorder array content
@@ -97,10 +98,77 @@
       diff, gradient, etc.
    @}
 
+   @defgroup memory_manager Memory Management
+   @{
+      Interfaces for writing custom memory managers.
+
+      Create and set a custom memory manager by first defining the relevant
+      closures for each required function, for example:
+
+      \code{.cpp}
+          af_err my_initialize(af_memory_manager manager) {
+              void* myPayload = malloc(sizeof(MyPayload_t));
+              af_memory_manager_set_payload(manager, myPayload);
+              // ...
+          }
+
+          af_err my_allocated(af_memory_manager handle, size_t* size, void* ptr) {
+              void* myPayload;
+              af_memory_manager_get_payload(manager, &myPayload);
+              // ...
+          }
+      \endcode
+
+      Create an \ref af_memory_manager and attach relevant closures:
+
+      \code{.cpp}
+          af_memory_manager manager;
+          af_create_memory_manager(&manager);
+
+          af_memory_manager_set_initialize_fn(manager, my_initialize);
+          af_memory_manager_set_allocated_fn(manager, my_allocated);
+
+          // ...
+      \endcode
+
+      Set the memory manager to be active, which shuts down the existing memory
+      manager:
+
+      \code{.cpp}
+          af_set_memory_manager(manager);
+      \endcode
+
+      Unset to re-create and reset an instance of the default memory manager:
+
+      \code{.cpp}
+          af_unset_memory_manager();
+      \endcode
+
+      @defgroup native_memory_interface Native Memory Interface
+      \brief Native alloc, native free, get device id, etc.
+
+      @defgroup memory_manager_utils Memory Manager Utils
+      \brief Set and unset memory managers, set and get manager payloads,
+              function setters
+
+      @defgroup memory_manager_api Memory Manager API
+      \brief Functions for defining custom memory managers
+   @}
+
+   @defgroup event Events
+   @{
+
+      \brief Managing ArrayFire Events which allows manipulation of operations
+              on computation queues.
+
+      \defgroup event_api Event API
+      \brief af_create_event, af_mark_event, etc.
+   @}
+
    @defgroup linalg_mat Linear Algebra
    @{
 
-     Matrix multiply, solve, decompositions
+     Matrix multiply, solve, decompositions, sparse matrix
 
      @defgroup blas_mat BLAS operations
      Matrix multiply, dot product, etc.
@@ -113,6 +181,49 @@
 
      @defgroup lapack_ops_mat Matrix operations
      inverse, det, rank, norm etc.
+
+     @defgroup lapack_helper LAPACK Helper functions
+
+     @defgroup sparse_func Sparse functions
+        \brief Functions to create and handle sparse arrays and matrix operations
+
+        Sparse array in ArrayFire use the same \ref af::array (or \ref af_array)
+        handle as normal. Internally, this handle is used to maintain a structure
+        of the sparse array (components listed below).
+
+        Description     | Data Type
+        ----------------|-------------------
+        Values          | T (one of \ref f32, \ref f64, \ref c32, \ref c64)
+        Row Indices     | Int (\ref s32)
+        Column Indices  | Int (\ref s32)
+        Storage         | \ref af::storage
+
+        The value array contains the non-zero elements of the matrix. The
+        \ref af::dtype of the value array is the same as that of the matrix.
+        The size of this array is the same as the number of non-zero elements
+        of the matrix.
+
+        The row indices and column indices contain the indices based on
+        \ref af::storage type. These \ref af::array are always of type \ref s32.
+
+        The \ref af::storage is used to determin the type of storage to use.
+        Currently \ref AF_STORAGE_CSR and \ref AF_STORAGE_COO are available.
+
+        A sparse array can be identied using the \ref af::array::issparse()
+        function.  This function will return true for a sparse array and false
+        for a regular \ref af::array.
+
+        The valid operations on sparse arrays are \ref af::matmul (sparse-dense).
+        When calling matmul for sparse matrices, the sparse array is required to
+        be the left hand side matrix and can be used with transposing options.
+        The dense matrix on the right hand side cannot be used with any transpose
+        options.
+
+        Most functions cannot use sparse arrays and will throw an error with
+        \ref AF_ERR_ARG if a sparse array is given as input.
+
+        \note Sparse functionality support was added to ArrayFire in v3.4.0.
+
    @}
 
    @defgroup image_mat Image Processing
@@ -126,6 +237,9 @@
      @defgroup hist_mat Histograms
      Image and data histograms
 
+     @defgroup moments_mat Image moments
+     Centroids, areas, etc.
+
      @defgroup transform_mat Image transformations
      rotate, skew, etc.
 
@@ -138,6 +252,9 @@
      @defgroup connected_comps_mat Connected Components & Labeling
      regions
 
+     @defgroup image_mod_mat Wrapping and unwrapping image windows
+     wrap, unwrap, etc.
+
      @defgroup utility_mat Utility Functions
      loadImage, saveImage, gaussianKernel
    @}
@@ -153,6 +270,9 @@
      @defgroup featdescriptor_mat Feature descriptors
      ORB feature descriptor
 
+     @defgroup featmatcher_mat Feature matchers
+     Feature matchers
+
      @defgroup match_mat Template matching
    @}
 
@@ -194,10 +314,26 @@
      Reading and writing images
    @}
 
+   @defgroup unified_func Unified API Functions
+   @{
+
+     Functions to set current backend and utilities
+
+   @}
+
+   @defgroup internal_func Functions to work with internal array layout
+   @{
+
+     Functions to work with arrayfire's internal data structure.
+
+     Note: The behavior of these functions is not promised to be consistent across versions.
+
+   @}
+
    @defgroup external Interface Functions
    @{
 
-     CUDA/OpenCL specific functions
+     Backend specific functions
 
      @defgroup opencl_mat OpenCL specific functions
 
@@ -213,61 +349,56 @@
         upload data to `cl_mem` objects from separate threads, but the thread which
         instantiated ArrayFire must do the `cl_mem` to \ref af::array conversion.
 
+     @defgroup cuda_mat CUDA specific functions
+
+        \brief Accessing ArrayFire's stream, and native device id with other CUDA code.
+
+        If your software is using ArrayFire's CUDA backend, you can also write custom
+        kernels and do custom memory operations using native CUDA commands. The functions
+        contained in the \p afcu namespace provide methods to get the stream and native
+        device id that ArrayFire is using.
    @}
-@}
 
+   @defgroup ml Machine Learning
+   @{
+
+     Machine learning functions
+
+     @defgroup ml_convolution Convolutions
+     Forward and backward convolution passes
+   @}
+@}
 
-*/
 
-/**
-\example helloworld.cpp
-\example pi.cpp
-\example integer.cpp
-\example rainfall.cpp
-\example vectorize.cpp
-\example black_scholes_options.cpp
-\example monte_carlo_options.cpp
-\example harris.cpp
-\example kmeans.cpp
-\example knn.cpp
-\example bagging.cpp
-\example naive_bayes.cpp
-\example perceptron.cpp
-\example neural_network.cpp
-\example rbm.cpp
-\example deep_belief_net.cpp
-\example logistic_regression.cpp
-\example conway.cpp
-\example conway_pretty.cpp
-\example fractal.cpp
-\example histogram.cpp
-\example plot2d.cpp
-\example brain_segmentation.cpp
-\example image_demo.cpp
-\example morphing.cpp
-\example optical_flow.cpp
-\example pyramids.cpp
-\example edge.cpp
 */
 
 #include "af/compatible.h"
 #include "af/algorithm.h"
 #include "af/arith.h"
 #include "af/array.h"
+#include "af/backend.h"
 #include "af/blas.h"
+#include "af/complex.h"
 #include "af/constants.h"
 #include "af/data.h"
 #include "af/device.h"
+#include "af/event.h"
 #include "af/exception.h"
 #include "af/features.h"
 #include "af/gfor.h"
 #include "af/graphics.h"
+#include "af/half.h"
 #include "af/image.h"
 #include "af/index.h"
 #include "af/lapack.h"
+#include "af/memory.h"
+#include "af/ml.h"
+#include "af/random.h"
 #include "af/seq.h"
 #include "af/signal.h"
+#include "af/sparse.h"
 #include "af/statistics.h"
 #include "af/timing.h"
 #include "af/util.h"
+#include "af/version.h"
 #include "af/vision.h"
diff --git a/src/.clang-format b/src/.clang-format
new file mode 100644
index 0000000000..47afdf3208
--- /dev/null
+++ b/src/.clang-format
@@ -0,0 +1,144 @@
+---
+Language:        Cpp
+# BasedOnStyle:  Google
+AccessModifierOffset: -1
+AlignAfterOpenBracket: Align
+AlignConsecutiveAssignments: true
+AlignConsecutiveDeclarations: false
+AlignEscapedNewlines: Left
+AlignOperands:   true
+AlignTrailingComments: true
+AllowAllParametersOfDeclarationOnNextLine: true
+AllowShortBlocksOnASingleLine: true
+AllowShortCaseLabelsOnASingleLine: true
+AllowShortFunctionsOnASingleLine: All
+AllowShortIfStatementsOnASingleLine: true
+AllowShortLoopsOnASingleLine: true
+AlwaysBreakAfterReturnType: None
+AlwaysBreakBeforeMultilineStrings: true
+AlwaysBreakTemplateDeclarations: Yes
+BinPackArguments: true
+BinPackParameters: true
+BraceWrapping:   
+  AfterClass:      false
+  AfterControlStatement: false
+  AfterEnum:       false
+  AfterFunction:   false
+  AfterNamespace:  false
+  AfterObjCDeclaration: false
+  AfterStruct:     false
+  AfterUnion:      false
+  AfterExternBlock: false
+  BeforeCatch:     false
+  BeforeElse:      false
+  IndentBraces:    false
+  SplitEmptyFunction: false
+  SplitEmptyRecord: false
+  SplitEmptyNamespace: false
+BreakBeforeBinaryOperators: None
+BreakBeforeBraces: Custom
+BreakInheritanceList: BeforeComma
+BreakBeforeTernaryOperators: true
+BreakConstructorInitializers: BeforeComma
+BreakStringLiterals: true
+ColumnLimit:     80
+CommentPragmas:  '^ IWYU pragma:'
+CompactNamespaces: false
+ConstructorInitializerAllOnOneLineOrOnePerLine: true
+ConstructorInitializerIndentWidth: 4
+ContinuationIndentWidth: 4
+Cpp11BracedListStyle: true
+DerivePointerAlignment: true
+DisableFormat:   false
+ExperimentalAutoDetectBinPacking: false
+FixNamespaceComments: true
+ForEachMacros:
+  - foreach
+  - Q_FOREACH
+  - BOOST_FOREACH
+IncludeBlocks:   Preserve
+IncludeCategories: 
+  - Regex:           '^<af/.*\.h.*>'
+    Priority:        2
+  - Regex:           '^<.*\.h.*>'
+    Priority:        1
+  - Regex:           '^<.*'
+    Priority:        3
+  - Regex:           '.*'
+    Priority:        4
+IncludeIsMainRegex: '([-_](test|unittest))?$'
+IndentCaseLabels: true
+IndentPPDirectives: None
+IndentWidth:     4
+IndentWrappedFunctionNames: false
+JavaScriptQuotes: Leave
+JavaScriptWrapImports: true
+KeepEmptyLinesAtTheStartOfBlocks: false
+MacroBlockBegin: ''
+MacroBlockEnd:   ''
+MaxEmptyLinesToKeep: 1
+NamespaceIndentation: None
+ObjCBinPackProtocolList: Never
+ObjCBlockIndentWidth: 2
+ObjCSpaceAfterProperty: false
+ObjCSpaceBeforeProtocolList: true
+PenaltyBreakAssignment: 2
+PenaltyBreakBeforeFirstCallParameter: 1
+PenaltyBreakComment: 300
+PenaltyBreakFirstLessLess: 120
+PenaltyBreakString: 1000
+PenaltyBreakTemplateDeclaration: 10
+PenaltyExcessCharacter: 1000000
+PenaltyReturnTypeOnItsOwnLine: 200
+PointerAlignment: Right
+RawStringFormats: 
+  - Language:        Cpp
+    Delimiters:      
+      - cc
+      - CC
+      - cpp
+      - Cpp
+      - CPP
+      - 'c++'
+      - 'C++'
+      - R
+    CanonicalDelimiter: ''
+    BasedOnStyle:    google
+  - Language:        TextProto
+    Delimiters:      
+      - pb
+      - PB
+      - proto
+      - PROTO
+    EnclosingFunctions: 
+      - EqualsProto
+      - EquivToProto
+      - PARSE_PARTIAL_TEXT_PROTO
+      - PARSE_TEST_PROTO
+      - PARSE_TEXT_PROTO
+      - ParseTextOrDie
+      - ParseTextProtoOrDie
+    CanonicalDelimiter: ''
+    BasedOnStyle:    google
+ReflowComments:  true
+SortIncludes:    true
+SortUsingDeclarations: true
+SpaceAfterCStyleCast: false
+SpaceAfterTemplateKeyword: false
+SpaceBeforeAssignmentOperators: true
+SpaceBeforeCpp11BracedList: false
+SpaceBeforeCtorInitializerColon: true
+SpaceBeforeInheritanceColon: true
+SpaceBeforeParens: ControlStatements
+SpaceBeforeRangeBasedForLoopColon: true
+SpaceInEmptyParentheses: false
+SpacesBeforeTrailingComments: 2
+SpacesInAngles:  false
+SpacesInContainerLiterals: true
+SpacesInCStyleCastParentheses: false
+SpacesInParentheses: false
+SpacesInSquareBrackets: false
+Standard:        Cpp11
+TabWidth:        4
+UseTab:          Never
+
diff --git a/src/.clang-tidy b/src/.clang-tidy
new file mode 100644
index 0000000000..549c784606
--- /dev/null
+++ b/src/.clang-tidy
@@ -0,0 +1,391 @@
+---
+Checks:          'clang-diagnostic-*,clang-analyzer-*,*,-fuchsia-*,-cppcoreguidelines-*,-misc-misplaced-const,-hicpp-no-array-decay,-readability-implicit-bool-conversion,bugprone-*,performance-*,modernize-*,-llvm-header-guard,-hicpp-use-auto,-modernize-use-trailing-return-type,-hicpp-uppercase-literal-suffix,-hicpp-use-nullptr,-modernize-use-nullptr,-google-runtime-int,-llvm-include-order,-google-runtime-references,-readability-magic-numbers,-readability-isolate-declaration,-hicpp-vararg,-google-readability-todo,-bugprone-macro-parentheses,-misc-unused-using-decls,-readability-else-after-return,-hicpp-avoid-c-arrays,-modernize-avoid-c-arrays,-hicpp-braces-around-statements,-hicpp-noexcept-move,-llvmlibc-*,-altera-*,-hicpp-explicit-conversions'
+WarningsAsErrors: ''
+HeaderFilterRegex: ''
+AnalyzeTemporaryDtors: true
+FormatStyle:     file
+User:            arrayfire
+CheckOptions:
+  - key:             abseil-string-find-startswith.AbseilStringsMatchHeader
+    value:           'absl/strings/match.h'
+  - key:             abseil-string-find-startswith.IncludeStyle
+    value:           llvm
+  - key:             abseil-string-find-startswith.StringLikeClasses
+    value:           '::std::basic_string'
+  - key:             bugprone-argument-comment.CommentBoolLiterals
+    value:           '0'
+  - key:             bugprone-argument-comment.CommentCharacterLiterals
+    value:           '0'
+  - key:             bugprone-argument-comment.CommentFloatLiterals
+    value:           '0'
+  - key:             bugprone-argument-comment.CommentIntegerLiterals
+    value:           '0'
+  - key:             bugprone-argument-comment.CommentNullPtrs
+    value:           '0'
+  - key:             bugprone-argument-comment.CommentStringLiterals
+    value:           '0'
+  - key:             bugprone-argument-comment.CommentUserDefinedLiterals
+    value:           '0'
+  - key:             bugprone-argument-comment.StrictMode
+    value:           '0'
+  - key:             bugprone-assert-side-effect.AssertMacros
+    value:           assert
+  - key:             bugprone-assert-side-effect.CheckFunctionCalls
+    value:           '0'
+  - key:             bugprone-dangling-handle.HandleClasses
+    value:           'std::basic_string_view;std::experimental::basic_string_view'
+  - key:             bugprone-exception-escape.FunctionsThatShouldNotThrow
+    value:           ''
+  - key:             bugprone-exception-escape.IgnoredExceptions
+    value:           ''
+  - key:             bugprone-misplaced-widening-cast.CheckImplicitCasts
+    value:           '0'
+  - key:             bugprone-sizeof-expression.WarnOnSizeOfCompareToConstant
+    value:           '1'
+  - key:             bugprone-sizeof-expression.WarnOnSizeOfConstant
+    value:           '1'
+  - key:             bugprone-sizeof-expression.WarnOnSizeOfIntegerExpression
+    value:           '0'
+  - key:             bugprone-sizeof-expression.WarnOnSizeOfThis
+    value:           '1'
+  - key:             bugprone-string-constructor.LargeLengthThreshold
+    value:           '8388608'
+  - key:             bugprone-string-constructor.WarnOnLargeLength
+    value:           '1'
+  - key:             bugprone-suspicious-enum-usage.StrictMode
+    value:           '0'
+  - key:             bugprone-suspicious-missing-comma.MaxConcatenatedTokens
+    value:           '5'
+  - key:             bugprone-suspicious-missing-comma.RatioThreshold
+    value:           '0.200000'
+  - key:             bugprone-suspicious-missing-comma.SizeThreshold
+    value:           '5'
+  - key:             bugprone-suspicious-string-compare.StringCompareLikeFunctions
+    value:           ''
+  - key:             bugprone-suspicious-string-compare.WarnOnImplicitComparison
+    value:           '1'
+  - key:             bugprone-suspicious-string-compare.WarnOnLogicalNotComparison
+    value:           '0'
+  - key:             bugprone-too-small-loop-variable.MagnitudeBitsUpperLimit
+    value:           '16'
+  - key:             bugprone-unhandled-self-assignment.WarnOnlyIfThisHasSuspiciousField
+    value:           '1'
+  - key:             bugprone-unused-return-value.CheckedFunctions
+    value:           '::std::async;::std::launder;::std::remove;::std::remove_if;::std::unique;::std::unique_ptr::release;::std::basic_string::empty;::std::vector::empty'
+  - key:             cert-dcl16-c.IgnoreMacros
+    value:           '1'
+  - key:             cert-dcl16-c.NewSuffixes
+    value:           'L;LL;LU;LLU'
+  - key:             cert-dcl59-cpp.HeaderFileExtensions
+    value:           ',h,hh,hpp,hxx'
+  - key:             cert-err09-cpp.CheckThrowTemporaries
+    value:           '1'
+  - key:             cert-err61-cpp.CheckThrowTemporaries
+    value:           '1'
+  - key:             cert-msc32-c.DisallowedSeedTypes
+    value:           'time_t,std::time_t'
+  - key:             cert-msc51-cpp.DisallowedSeedTypes
+    value:           'time_t,std::time_t'
+  - key:             cert-oop11-cpp.IncludeStyle
+    value:           llvm
+  - key:             cert-oop54-cpp.WarnOnlyIfThisHasSuspiciousField
+    value:           '0'
+  - key:             cppcoreguidelines-avoid-magic-numbers.IgnoredFloatingPointValues
+    value:           '1.0;100.0;'
+  - key:             cppcoreguidelines-avoid-magic-numbers.IgnoredIntegerValues
+    value:           '1;2;3;4;'
+  - key:             cppcoreguidelines-explicit-virtual-functions.FinalSpelling
+    value:           final
+  - key:             cppcoreguidelines-explicit-virtual-functions.IgnoreDestructors
+    value:           '1'
+  - key:             cppcoreguidelines-explicit-virtual-functions.OverrideSpelling
+    value:           override
+  - key:             cppcoreguidelines-macro-usage.AllowedRegexp
+    value:           '^DEBUG_*'
+  - key:             cppcoreguidelines-macro-usage.CheckCapsOnly
+    value:           '0'
+  - key:             cppcoreguidelines-macro-usage.IgnoreCommandLineMacros
+    value:           '1'
+  - key:             cppcoreguidelines-no-malloc.Allocations
+    value:           '::malloc;::calloc'
+  - key:             cppcoreguidelines-no-malloc.Deallocations
+    value:           '::free'
+  - key:             cppcoreguidelines-no-malloc.Reallocations
+    value:           '::realloc'
+  - key:             cppcoreguidelines-non-private-member-variables-in-classes.IgnoreClassesWithAllMemberVariablesBeingPublic
+    value:           '1'
+  - key:             cppcoreguidelines-owning-memory.LegacyResourceConsumers
+    value:           '::free;::realloc;::freopen;::fclose'
+  - key:             cppcoreguidelines-owning-memory.LegacyResourceProducers
+    value:           '::malloc;::aligned_alloc;::realloc;::calloc;::fopen;::freopen;::tmpfile'
+  - key:             cppcoreguidelines-pro-bounds-constant-array-index.GslHeader
+    value:           ''
+  - key:             cppcoreguidelines-pro-bounds-constant-array-index.IncludeStyle
+    value:           '0'
+  - key:             cppcoreguidelines-pro-type-member-init.IgnoreArrays
+    value:           '0'
+  - key:             cppcoreguidelines-pro-type-member-init.UseAssignment
+    value:           '0'
+  - key:             cppcoreguidelines-special-member-functions.AllowMissingMoveFunctions
+    value:           '0'
+  - key:             cppcoreguidelines-special-member-functions.AllowSoleDefaultDtor
+    value:           '0'
+  - key:             fuchsia-header-anon-namespaces.HeaderFileExtensions
+    value:           ',h,hh,hpp,hxx'
+  - key:             fuchsia-restrict-system-includes.Includes
+    value:           '*'
+  - key:             google-build-namespaces.HeaderFileExtensions
+    value:           ',h,hh,hpp,hxx'
+  - key:             google-global-names-in-headers.HeaderFileExtensions
+    value:           ',h,hh,hpp,hxx'
+  - key:             google-readability-braces-around-statements.ShortStatementLines
+    value:           '1'
+  - key:             google-readability-function-size.BranchThreshold
+    value:           '4294967295'
+  - key:             google-readability-function-size.LineThreshold
+    value:           '4294967295'
+  - key:             google-readability-function-size.NestingThreshold
+    value:           '4294967295'
+  - key:             google-readability-function-size.ParameterThreshold
+    value:           '4294967295'
+  - key:             google-readability-function-size.StatementThreshold
+    value:           '800'
+  - key:             google-readability-function-size.VariableThreshold
+    value:           '4294967295'
+  - key:             google-readability-namespace-comments.ShortNamespaceLines
+    value:           '10'
+  - key:             google-readability-namespace-comments.SpacesBeforeComments
+    value:           '2'
+  - key:             google-runtime-int.SignedTypePrefix
+    value:           int
+  - key:             google-runtime-int.TypeSuffix
+    value:           ''
+  - key:             google-runtime-int.UnsignedTypePrefix
+    value:           uint
+  - key:             google-runtime-references.WhiteListTypes
+    value:           ''
+  - key:             hicpp-braces-around-statements.ShortStatementLines
+    value:           '0'
+  - key:             hicpp-function-size.BranchThreshold
+    value:           '4294967295'
+  - key:             hicpp-function-size.LineThreshold
+    value:           '4294967295'
+  - key:             hicpp-function-size.NestingThreshold
+    value:           '4294967295'
+  - key:             hicpp-function-size.ParameterThreshold
+    value:           '4294967295'
+  - key:             hicpp-function-size.StatementThreshold
+    value:           '800'
+  - key:             hicpp-function-size.VariableThreshold
+    value:           '4294967295'
+  - key:             hicpp-member-init.IgnoreArrays
+    value:           '0'
+  - key:             hicpp-member-init.UseAssignment
+    value:           '0'
+  - key:             hicpp-move-const-arg.CheckTriviallyCopyableMove
+    value:           '1'
+  - key:             hicpp-multiway-paths-covered.WarnOnMissingElse
+    value:           '0'
+  - key:             hicpp-named-parameter.IgnoreFailedSplit
+    value:           '0'
+  - key:             hicpp-no-malloc.Allocations
+    value:           '::malloc;::calloc'
+  - key:             hicpp-no-malloc.Deallocations
+    value:           '::free'
+  - key:             hicpp-no-malloc.Reallocations
+    value:           '::realloc'
+  - key:             hicpp-signed-bitwise.IgnorePositiveIntegerLiterals
+    value:           'true'
+  - key:             hicpp-special-member-functions.AllowMissingMoveFunctions
+    value:           '0'
+  - key:             hicpp-special-member-functions.AllowSoleDefaultDtor
+    value:           '0'
+  - key:             hicpp-uppercase-literal-suffix.IgnoreMacros
+    value:           '1'
+  - key:             hicpp-uppercase-literal-suffix.NewSuffixes
+    value:           ''
+  - key:             hicpp-use-auto.MinTypeNameLength
+    value:           '5'
+  - key:             hicpp-use-auto.RemoveStars
+    value:           '0'
+  - key:             hicpp-use-emplace.ContainersWithPushBack
+    value:           '::std::vector;::std::list;::std::deque'
+  - key:             hicpp-use-emplace.SmartPointers
+    value:           '::std::shared_ptr;::std::unique_ptr;::std::auto_ptr;::std::weak_ptr'
+  - key:             hicpp-use-emplace.TupleMakeFunctions
+    value:           '::std::make_pair;::std::make_tuple'
+  - key:             hicpp-use-emplace.TupleTypes
+    value:           '::std::pair;::std::tuple'
+  - key:             hicpp-use-equals-default.IgnoreMacros
+    value:           '1'
+  - key:             hicpp-use-equals-delete.IgnoreMacros
+    value:           '1'
+  - key:             hicpp-use-noexcept.ReplacementString
+    value:           ''
+  - key:             hicpp-use-noexcept.UseNoexceptFalse
+    value:           '1'
+  - key:             hicpp-use-nullptr.NullMacros
+    value:           ''
+  - key:             hicpp-use-override.FinalSpelling
+    value:           final
+  - key:             hicpp-use-override.IgnoreDestructors
+    value:           '0'
+  - key:             hicpp-use-override.OverrideSpelling
+    value:           override
+  - key:             llvm-namespace-comment.ShortNamespaceLines
+    value:           '1'
+  - key:             llvm-namespace-comment.SpacesBeforeComments
+    value:           '1'
+  - key:             misc-definitions-in-headers.HeaderFileExtensions
+    value:           ',h,hh,hpp,hxx'
+  - key:             misc-definitions-in-headers.UseHeaderFileExtension
+    value:           '1'
+  - key:             misc-throw-by-value-catch-by-reference.CheckThrowTemporaries
+    value:           '1'
+  - key:             misc-unused-parameters.StrictMode
+    value:           '0'
+  - key:             modernize-loop-convert.MaxCopySize
+    value:           '16'
+  - key:             modernize-loop-convert.MinConfidence
+    value:           reasonable
+  - key:             modernize-loop-convert.NamingStyle
+    value:           CamelCase
+  - key:             modernize-make-shared.IgnoreMacros
+    value:           '1'
+  - key:             modernize-make-shared.IncludeStyle
+    value:           '0'
+  - key:             modernize-make-shared.MakeSmartPtrFunction
+    value:           'std::make_shared'
+  - key:             modernize-make-shared.MakeSmartPtrFunctionHeader
+    value:           memory
+  - key:             modernize-make-unique.IgnoreMacros
+    value:           '1'
+  - key:             modernize-make-unique.IncludeStyle
+    value:           '0'
+  - key:             modernize-make-unique.MakeSmartPtrFunction
+    value:           'std::make_unique'
+  - key:             modernize-make-unique.MakeSmartPtrFunctionHeader
+    value:           memory
+  - key:             modernize-pass-by-value.IncludeStyle
+    value:           llvm
+  - key:             modernize-pass-by-value.ValuesOnly
+    value:           '0'
+  - key:             modernize-raw-string-literal.ReplaceShorterLiterals
+    value:           '0'
+  - key:             modernize-replace-auto-ptr.IncludeStyle
+    value:           llvm
+  - key:             modernize-replace-random-shuffle.IncludeStyle
+    value:           llvm
+  - key:             modernize-use-auto.MinTypeNameLength
+    value:           '5'
+  - key:             modernize-use-auto.RemoveStars
+    value:           '0'
+  - key:             modernize-use-default-member-init.IgnoreMacros
+    value:           '1'
+  - key:             modernize-use-default-member-init.UseAssignment
+    value:           '0'
+  - key:             modernize-use-emplace.ContainersWithPushBack
+    value:           '::std::vector;::std::list;::std::deque'
+  - key:             modernize-use-emplace.SmartPointers
+    value:           '::std::shared_ptr;::std::unique_ptr;::std::auto_ptr;::std::weak_ptr'
+  - key:             modernize-use-emplace.TupleMakeFunctions
+    value:           '::std::make_pair;::std::make_tuple'
+  - key:             modernize-use-emplace.TupleTypes
+    value:           '::std::pair;::std::tuple'
+  - key:             modernize-use-equals-default.IgnoreMacros
+    value:           '1'
+  - key:             modernize-use-equals-delete.IgnoreMacros
+    value:           '1'
+  - key:             modernize-use-nodiscard.ReplacementString
+    value:           '[[nodiscard]]'
+  - key:             modernize-use-noexcept.ReplacementString
+    value:           ''
+  - key:             modernize-use-noexcept.UseNoexceptFalse
+    value:           '1'
+  - key:             modernize-use-nullptr.NullMacros
+    value:           'NULL'
+  - key:             modernize-use-override.FinalSpelling
+    value:           final
+  - key:             modernize-use-override.IgnoreDestructors
+    value:           '0'
+  - key:             modernize-use-override.OverrideSpelling
+    value:           override
+  - key:             modernize-use-transparent-functors.SafeMode
+    value:           '0'
+  - key:             modernize-use-using.IgnoreMacros
+    value:           '1'
+  - key:             objc-forbidden-subclassing.ForbiddenSuperClassNames
+    value:           'ABNewPersonViewController;ABPeoplePickerNavigationController;ABPersonViewController;ABUnknownPersonViewController;NSHashTable;NSMapTable;NSPointerArray;NSPointerFunctions;NSTimer;UIActionSheet;UIAlertView;UIImagePickerController;UITextInputMode;UIWebView'
+  - key:             openmp-exception-escape.IgnoredExceptions
+    value:           ''
+  - key:             performance-faster-string-find.StringLikeClasses
+    value:           'std::basic_string'
+  - key:             performance-for-range-copy.AllowedTypes
+    value:           ''
+  - key:             performance-for-range-copy.WarnOnAllAutoCopies
+    value:           '0'
+  - key:             performance-inefficient-string-concatenation.StrictMode
+    value:           '0'
+  - key:             performance-inefficient-vector-operation.VectorLikeClasses
+    value:           '::std::vector'
+  - key:             performance-move-const-arg.CheckTriviallyCopyableMove
+    value:           '1'
+  - key:             performance-move-constructor-init.IncludeStyle
+    value:           llvm
+  - key:             performance-type-promotion-in-math-fn.IncludeStyle
+    value:           llvm
+  - key:             performance-unnecessary-copy-initialization.AllowedTypes
+    value:           'Array$;SparseArray*'
+  - key:             performance-unnecessary-value-param.AllowedTypes
+    value:           'CParam'
+  - key:             performance-unnecessary-value-param.IncludeStyle
+    value:           llvm
+  - key:             portability-simd-intrinsics.Std
+    value:           ''
+  - key:             portability-simd-intrinsics.Suggest
+    value:           '0'
+  - key:             readability-braces-around-statements.ShortStatementLines
+    value:           '0'
+  - key:             readability-function-size.BranchThreshold
+    value:           '4294967295'
+  - key:             readability-function-size.LineThreshold
+    value:           '4294967295'
+  - key:             readability-function-size.NestingThreshold
+    value:           '4294967295'
+  - key:             readability-function-size.ParameterThreshold
+    value:           '4294967295'
+  - key:             readability-function-size.StatementThreshold
+    value:           '800'
+  - key:             readability-function-size.VariableThreshold
+    value:           '4294967295'
+  - key:             readability-identifier-naming.IgnoreFailedSplit
+    value:           '0'
+  - key:             readability-implicit-bool-conversion.AllowIntegerConditions
+    value:           '0'
+  - key:             readability-implicit-bool-conversion.AllowPointerConditions
+    value:           '0'
+  - key:             readability-inconsistent-declaration-parameter-name.IgnoreMacros
+    value:           '1'
+  - key:             readability-inconsistent-declaration-parameter-name.Strict
+    value:           '0'
+  - key:             readability-magic-numbers.IgnoredFloatingPointValues
+    value:           '1.0;100.0;'
+  - key:             readability-magic-numbers.IgnoredIntegerValues
+    value:           '1;2;3;4;'
+  - key:             readability-redundant-smartptr-get.IgnoreMacros
+    value:           '1'
+  - key:             readability-simplify-boolean-expr.ChainedConditionalAssignment
+    value:           '0'
+  - key:             readability-simplify-boolean-expr.ChainedConditionalReturn
+    value:           '0'
+  - key:             readability-simplify-subscript-expr.Types
+    value:           '::std::basic_string;::std::basic_string_view;::std::vector;::std::array'
+  - key:             readability-static-accessed-through-instance.NameSpecifierNestingThreshold
+    value:           '3'
+  - key:             readability-uppercase-literal-suffix.IgnoreMacros
+    value:           '1'
+  - key:             readability-uppercase-literal-suffix.NewSuffixes
+    value:           'f,U,L,UL,LL,ULL'
+  - key:             zircon-temporary-objects.Names
+    value:           ''
+...
diff --git a/src/api/c/CMakeLists.txt b/src/api/c/CMakeLists.txt
new file mode 100644
index 0000000000..d374b9a669
--- /dev/null
+++ b/src/api/c/CMakeLists.txt
@@ -0,0 +1,207 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+add_library(c_api_interface INTERFACE)
+
+target_sources(c_api_interface
+  INTERFACE
+  ${ArrayFire_SOURCE_DIR}/include/arrayfire.h
+  ${ArrayFire_SOURCE_DIR}/include/af/algorithm.h
+  ${ArrayFire_SOURCE_DIR}/include/af/arith.h
+  ${ArrayFire_SOURCE_DIR}/include/af/array.h
+  ${ArrayFire_SOURCE_DIR}/include/af/backend.h
+  ${ArrayFire_SOURCE_DIR}/include/af/blas.h
+  ${ArrayFire_SOURCE_DIR}/include/af/compatible.h
+  ${ArrayFire_SOURCE_DIR}/include/af/complex.h
+  ${ArrayFire_SOURCE_DIR}/include/af/constants.h
+  ${ArrayFire_SOURCE_DIR}/include/af/cuda.h
+  ${ArrayFire_SOURCE_DIR}/include/af/data.h
+  ${ArrayFire_SOURCE_DIR}/include/af/defines.h
+  ${ArrayFire_SOURCE_DIR}/include/af/device.h
+  ${ArrayFire_SOURCE_DIR}/include/af/dim4.hpp
+  ${ArrayFire_SOURCE_DIR}/include/af/event.h
+  ${ArrayFire_SOURCE_DIR}/include/af/exception.h
+  ${ArrayFire_SOURCE_DIR}/include/af/features.h
+  ${ArrayFire_SOURCE_DIR}/include/af/gfor.h
+  ${ArrayFire_SOURCE_DIR}/include/af/graphics.h
+  ${ArrayFire_SOURCE_DIR}/include/af/image.h
+  ${ArrayFire_SOURCE_DIR}/include/af/index.h
+  ${ArrayFire_SOURCE_DIR}/include/af/internal.h
+  ${ArrayFire_SOURCE_DIR}/include/af/lapack.h
+  ${ArrayFire_SOURCE_DIR}/include/af/macros.h
+  ${ArrayFire_SOURCE_DIR}/include/af/ml.h
+  ${ArrayFire_SOURCE_DIR}/include/af/memory.h
+  ${ArrayFire_SOURCE_DIR}/include/af/opencl.h
+  ${ArrayFire_SOURCE_DIR}/include/af/random.h
+  ${ArrayFire_SOURCE_DIR}/include/af/seq.h
+  ${ArrayFire_SOURCE_DIR}/include/af/signal.h
+  ${ArrayFire_SOURCE_DIR}/include/af/sparse.h
+  ${ArrayFire_SOURCE_DIR}/include/af/statistics.h
+  ${ArrayFire_SOURCE_DIR}/include/af/timing.h
+  ${ArrayFire_SOURCE_DIR}/include/af/traits.hpp
+  ${ArrayFire_SOURCE_DIR}/include/af/util.h
+  ${ArrayFire_SOURCE_DIR}/include/af/vision.h
+  ${ArrayFire_BINARY_DIR}/include/af/version.h
+  )
+
+target_sources(c_api_interface
+  INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}/anisotropic_diffusion.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/approx.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/array.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/assign.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/bilateral.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/binary.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/blas.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/canny.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/cast.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/cholesky.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/clamp.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/colorspace.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/complex.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/confidence_connected.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/convolve.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/corrcoef.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/covariance.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/data.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/deconvolution.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/det.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/device.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/diff.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/dog.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/events.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/events.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/error.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/exampleFunction.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fast.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/features.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/features.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fft_common.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fftconvolve.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/filters.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/flip.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/gaussian_kernel.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/gradient.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/hamming.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/handle.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/handle.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/harris.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/hist.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/histeq.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/histogram.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/homography.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/hsv_rgb.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/iir.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/image.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/imageio.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/imageio2.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/implicit.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/implicit.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/imgproc_common.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/index.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/internal.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/inverse.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit_test_api.h
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit_test_api.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/join.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/lu.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/match_template.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/mean.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/meanshift.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/median.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/memory.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/memoryapi.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/moddims.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/moments.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/morph.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/nearest_neighbour.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/norm.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/optypes.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/orb.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/pinverse.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/plot.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/print.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/qr.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/random.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/rank.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/regions.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/reorder.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/replace.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/resize.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/rgb_gray.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/rotate.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sat.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/scan.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/select.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/set.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/shift.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sift.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sobel.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/solve.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sort.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sparse.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sparse_handle.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/stdev.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/stream.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/surface.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/susan.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/svd.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/tile.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/topk.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/transform.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/transform_coordinates.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/transpose.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/type_util.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/type_util.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/unary.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/unwrap.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/var.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/vector_field.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/version.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/where.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/window.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/wrap.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/ycbcr_rgb.cpp
+    )
+
+if(FreeImage_FOUND AND AF_WITH_IMAGEIO)
+  target_compile_definitions(c_api_interface INTERFACE WITH_FREEIMAGE)
+  if (AF_WITH_STATIC_FREEIMAGE)
+    target_compile_definitions(c_api_interface INTERFACE FREEIMAGE_STATIC)
+    target_link_libraries(c_api_interface INTERFACE FreeImage::FreeImage_STATIC)
+  else ()
+    target_include_directories(c_api_interface SYSTEM INTERFACE $<TARGET_PROPERTY:FreeImage::FreeImage,INTERFACE_INCLUDE_DIRECTORIES>)
+    if (WIN32 AND AF_INSTALL_STANDALONE)
+      install(FILES $<TARGET_FILE:FreeImage::FreeImage>
+        DESTINATION ${AF_INSTALL_BIN_DIR}
+        COMPONENT common_backend_dependencies)
+    endif ()
+  endif ()
+endif()
+
+if(BUILD_WITH_MKL)
+  # Create mkl thread layer compile option based on cmake cache variable
+  if(MKL_THREAD_LAYER STREQUAL "Sequential")
+    target_compile_definitions(c_api_interface INTERFACE AF_MKL_THREAD_LAYER=0)
+  elseif(MKL_THREAD_LAYER STREQUAL "GNU OpenMP")
+    target_compile_definitions(c_api_interface INTERFACE AF_MKL_THREAD_LAYER=1)
+  elseif(MKL_THREAD_LAYER STREQUAL "Intel OpenMP")
+    target_compile_definitions(c_api_interface INTERFACE AF_MKL_THREAD_LAYER=2)
+  else() #default Intel Thread Layer for ArrayFire
+    target_compile_definitions(c_api_interface INTERFACE AF_MKL_THREAD_LAYER=3)
+  endif()
+endif()
+
+target_include_directories(c_api_interface
+  INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}
+    ${CMAKE_SOURCE_DIR}/src/backend
+    ${CMAKE_SOURCE_DIR}/include
+    $<TARGET_PROPERTY:afcommon_interface,INTERFACE_INCLUDE_DIRECTORIES>
+    )
diff --git a/src/api/c/anisotropic_diffusion.cpp b/src/api/c/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..6268accb3b
--- /dev/null
+++ b/src/api/c/anisotropic_diffusion.cpp
@@ -0,0 +1,104 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <anisotropic_diffusion.hpp>
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <gradient.hpp>
+#include <handle.hpp>
+#include <reduce.hpp>
+
+#include <af/dim4.hpp>
+#include <af/image.h>
+
+#include <type_traits>
+
+using af::dim4;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::createEmptyArray;
+using detail::getScalar;
+using detail::gradient;
+using detail::reduce_all;
+
+template<typename T>
+af_array diffusion(const Array<float>& in, const float dt, const float K,
+                   const unsigned iterations, const af_flux_function fftype,
+                   const af::diffusionEq eq) {
+    auto out  = copyArray(in);
+    auto dims = out.dims();
+    auto g0   = createEmptyArray<float>(dims);
+    auto g1   = createEmptyArray<float>(dims);
+    float cnst =
+        -2.0f * K * K / dims.elements();  // NOLINT(readability-magic-numbers)
+
+    for (unsigned i = 0; i < iterations; ++i) {
+        gradient<float>(g0, g1, out);
+
+        auto g0Sqr = arithOp<float, af_mul_t>(g0, g0, dims);
+        auto g1Sqr = arithOp<float, af_mul_t>(g1, g1, dims);
+        auto sumd  = arithOp<float, af_add_t>(g0Sqr, g1Sqr, dims);
+        float avg =
+            getScalar<float>(reduce_all<af_add_t, float, float>(sumd, true, 0));
+
+        anisotropicDiffusion(out, dt, 1.0f / (cnst * avg), fftype, eq);
+    }
+
+    return getHandle(cast<T, float>(out));
+}
+
+af_err af_anisotropic_diffusion(af_array* out, const af_array in,
+                                const float dt, const float K,
+                                const unsigned iterations,
+                                const af_flux_function fftype,
+                                const af_diffusion_eq eq) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+
+        const af::dim4& inputDimensions = info.dims();
+        const af_dtype inputType        = info.getType();
+        const unsigned inputNumDims     = inputDimensions.ndims();
+
+        DIM_ASSERT(1, (inputNumDims >= 2));
+
+        ARG_ASSERT(3, (K > 0 || K < 0));
+        ARG_ASSERT(4, (iterations > 0));
+
+        const af_flux_function F =
+            (fftype == AF_FLUX_DEFAULT ? AF_FLUX_EXPONENTIAL : fftype);
+
+        auto input = castArray<float>(in);
+
+        af_array output = nullptr;
+        switch (inputType) {
+            case f64:
+                output = diffusion<double>(input, dt, K, iterations, F, eq);
+                break;
+            case f32:
+            case s32:
+            case u32:
+            case s16:
+            case u16:
+            case s8:
+            case u8:
+                output = diffusion<float>(input, dt, K, iterations, F, eq);
+                break;
+            default: TYPE_ERROR(1, inputType);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/approx.cpp b/src/api/c/approx.cpp
index 1bc7723fdf..5d5f6acb00 100644
--- a/src/api/c/approx.cpp
+++ b/src/api/c/approx.cpp
@@ -7,95 +7,288 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <approx.hpp>
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+
 #include <af/array.h>
-#include <af/signal.h>
 #include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-#include <ArrayInfo.hpp>
-#include <approx.hpp>
+#include <af/signal.h>
 
 using af::dim4;
-using namespace detail;
+using detail::approx1;
+using detail::approx2;
+using detail::cdouble;
+using detail::cfloat;
 
+namespace {
 template<typename Ty, typename Tp>
-static inline af_array approx1(const af_array in, const af_array pos,
-                               const af_interp_type method, const float offGrid)
-{
-    return getHandle(approx1<Ty>(getArray<Ty>(in), getArray<Tp>(pos), method, offGrid));
+inline void approx1(af_array *yo, const af_array yi, const af_array xo,
+                    const int xdim, const Tp &xi_beg, const Tp &xi_step,
+                    const af_interp_type method, const float offGrid) {
+    approx1<Ty>(getArray<Ty>(*yo), getArray<Ty>(yi), getArray<Tp>(xo), xdim,
+                xi_beg, xi_step, method, offGrid);
 }
+}  // namespace
 
 template<typename Ty, typename Tp>
-static inline af_array approx2(const af_array in, const af_array pos0, const af_array pos1,
-                               const af_interp_type method, const float offGrid)
-{
-    return getHandle(approx2<Ty>(getArray<Ty>(in), getArray<Tp>(pos0), getArray<Tp>(pos1),
-                                 method, offGrid));
+inline void approx2(af_array *zo, const af_array zi, const af_array xo,
+                    const int xdim, const Tp &xi_beg, const Tp &xi_step,
+                    const af_array yo, const int ydim, const Tp &yi_beg,
+                    const Tp &yi_step, const af_interp_type method,
+                    const float offGrid) {
+    approx2<Ty>(getArray<Ty>(*zo), getArray<Ty>(zi), getArray<Tp>(xo), xdim,
+                xi_beg, xi_step, getArray<Tp>(yo), ydim, yi_beg, yi_step,
+                method, offGrid);
 }
 
-af_err af_approx1(af_array *out, const af_array in, const af_array pos,
-                  const af_interp_type method, const float offGrid)
-{
-    try {
-        ArrayInfo i_info = getInfo(in);
-        ArrayInfo p_info = getInfo(pos);
-
-        af_dtype itype = i_info.getType();
-
-        ARG_ASSERT(1, i_info.isFloating());                       // Only floating and complex types
-        ARG_ASSERT(2, p_info.isRealFloating());                   // Only floating types
-        ARG_ASSERT(1, i_info.isSingle() == p_info.isSingle());    // Must have same precision
-        ARG_ASSERT(1, i_info.isDouble() == p_info.isDouble());    // Must have same precision
-        DIM_ASSERT(2, p_info.isColumn());                         // Only 1D input allowed
-        ARG_ASSERT(3, (method == AF_INTERP_LINEAR || method == AF_INTERP_NEAREST));
-
-        af_array output;
-
-        switch(itype) {
-            case f32: output = approx1<float  , float >(in, pos, method, offGrid);  break;
-            case f64: output = approx1<double , double>(in, pos, method, offGrid);  break;
-            case c32: output = approx1<cfloat , float >(in, pos, method, offGrid);  break;
-            case c64: output = approx1<cdouble, double>(in, pos, method, offGrid);  break;
-            default:  TYPE_ERROR(1, itype);
+void af_approx1_common(af_array *yo, const af_array yi, const af_array xo,
+                       const int xdim, const double xi_beg,
+                       const double xi_step, const af_interp_type method,
+                       const float offGrid, const bool allocate_yo) {
+    ARG_ASSERT(0, yo != 0);  // *yo (the af_array) can be null, but not yo
+    ARG_ASSERT(1, yi != 0);
+    ARG_ASSERT(2, xo != 0);
+
+    const ArrayInfo &yi_info = getInfo(yi);
+    const ArrayInfo &xo_info = getInfo(xo);
+
+    const dim4 &yi_dims = yi_info.dims();
+    const dim4 &xo_dims = xo_info.dims();
+    dim4 yo_dims        = yi_dims;
+    yo_dims[xdim]       = xo_dims[xdim];
+
+    ARG_ASSERT(1, yi_info.isFloating());      // Only floating and complex types
+    ARG_ASSERT(2, xo_info.isRealFloating());  // Only floating types
+    ARG_ASSERT(1, yi_info.isSingle() ==
+                      xo_info.isSingle());  // Must have same precision
+    ARG_ASSERT(1, yi_info.isDouble() ==
+                      xo_info.isDouble());  // Must have same precision
+    ARG_ASSERT(3, xdim >= 0 && xdim < 4);
+
+    // POS should either be (x, 1, 1, 1) or (1, yi_dims[1], yi_dims[2],
+    // yi_dims[3])
+    if (xo_dims[xdim] != xo_dims.elements()) {
+        for (int i = 0; i < 4; i++) {
+            if (xdim != i) { DIM_ASSERT(2, xo_dims[i] == yi_dims[i]); }
         }
-        std::swap(*out,output);
+    }
+
+    ARG_ASSERT(5, xi_step != 0);
+    ARG_ASSERT(
+        6, (method == AF_INTERP_CUBIC || method == AF_INTERP_CUBIC_SPLINE ||
+            method == AF_INTERP_LINEAR || method == AF_INTERP_LINEAR_COSINE ||
+            method == AF_INTERP_LOWER || method == AF_INTERP_NEAREST));
+
+    if (yi_dims.ndims() == 0 || xo_dims.ndims() == 0) {
+        af_create_handle(yo, 0, nullptr, yi_info.getType());
+        return;
+    }
+
+    if (allocate_yo) { *yo = createHandle(yo_dims, yi_info.getType()); }
+
+    DIM_ASSERT(0, getInfo(*yo).dims() == yo_dims);
+
+    switch (yi_info.getType()) {
+        case f32:
+            approx1<float, float>(yo, yi, xo, xdim, xi_beg, xi_step, method,
+                                  offGrid);
+            break;
+        case f64:
+            approx1<double, double>(yo, yi, xo, xdim, xi_beg, xi_step, method,
+                                    offGrid);
+            break;
+        case c32:
+            approx1<cfloat, float>(yo, yi, xo, xdim, xi_beg, xi_step, method,
+                                   offGrid);
+            break;
+        case c64:
+            approx1<cdouble, double>(yo, yi, xo, xdim, xi_beg, xi_step, method,
+                                     offGrid);
+            break;
+        default: TYPE_ERROR(1, yi_info.getType());
+    }
+}
+
+af_err af_approx1_uniform(af_array *yo, const af_array yi, const af_array xo,
+                          const int xdim, const double xi_beg,
+                          const double xi_step, const af_interp_type method,
+                          const float offGrid) {
+    try {
+        af_approx1_common(yo, yi, xo, xdim, xi_beg, xi_step, method, offGrid,
+                          true);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_approx2(af_array *out, const af_array in, const af_array pos0, const af_array pos1,
-                  const af_interp_type method, const float offGrid)
-{
+af_err af_approx1_uniform_v2(af_array *yo, const af_array yi, const af_array xo,
+                             const int xdim, const double xi_beg,
+                             const double xi_step, const af_interp_type method,
+                             const float offGrid) {
     try {
-        ArrayInfo i_info = getInfo(in);
-        ArrayInfo p_info = getInfo(pos0);
-        ArrayInfo q_info = getInfo(pos1);
-
-        af_dtype itype = i_info.getType();
-
-        ARG_ASSERT(1, i_info.isFloating());                       // Only floating and complex types
-        ARG_ASSERT(2, p_info.isRealFloating());                   // Only floating types
-        ARG_ASSERT(3, q_info.isRealFloating());                   // Only floating types
-        ARG_ASSERT(1, p_info.getType() == q_info.getType());      // Must have same type
-        ARG_ASSERT(1, i_info.isSingle() == p_info.isSingle());    // Must have same precision
-        ARG_ASSERT(1, i_info.isDouble() == p_info.isDouble());    // Must have same precision
-        DIM_ASSERT(2, p_info.dims() == q_info.dims());            // POS0 and POS1 must have same dims
-        DIM_ASSERT(2, p_info.ndims() < 3);// Allowing input batch but not positions. Output dims = (px, py, iz, iw)
-        ARG_ASSERT(3, (method == AF_INTERP_LINEAR || method == AF_INTERP_NEAREST));
-
-        af_array output;
-
-        switch(itype) {
-            case f32: output = approx2<float  , float >(in, pos0, pos1, method, offGrid);  break;
-            case f64: output = approx2<double , double>(in, pos0, pos1, method, offGrid);  break;
-            case c32: output = approx2<cfloat , float >(in, pos0, pos1, method, offGrid);  break;
-            case c64: output = approx2<cdouble, double>(in, pos0, pos1, method, offGrid);  break;
-            default:  TYPE_ERROR(1, itype);
+        ARG_ASSERT(0, yo != 0);  // need to dereference yo in next call
+        af_approx1_common(yo, yi, xo, xdim, xi_beg, xi_step, method, offGrid,
+                          *yo == 0);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_approx1(af_array *yo, const af_array yi, const af_array xo,
+                  const af_interp_type method, const float offGrid) {
+    try {
+        af_approx1_common(yo, yi, xo, 0, 0.0, 1.0, method, offGrid, true);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_approx1_v2(af_array *yo, const af_array yi, const af_array xo,
+                     const af_interp_type method, const float offGrid) {
+    try {
+        ARG_ASSERT(0, yo != 0);  // need to dereference yo in next call
+        af_approx1_common(yo, yi, xo, 0, 0.0, 1.0, method, offGrid, *yo == 0);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+void af_approx2_common(af_array *zo, const af_array zi, const af_array xo,
+                       const int xdim, const double xi_beg,
+                       const double xi_step, const af_array yo, const int ydim,
+                       const double yi_beg, const double yi_step,
+                       const af_interp_type method, const float offGrid,
+                       bool allocate_zo) {
+    ARG_ASSERT(0, zo != 0);  // *zo (the af_array) can be null, but not zo
+    ARG_ASSERT(1, zi != 0);
+    ARG_ASSERT(2, xo != 0);
+    ARG_ASSERT(6, yo != 0);
+
+    const ArrayInfo &zi_info = getInfo(zi);
+    const ArrayInfo &xo_info = getInfo(xo);
+    const ArrayInfo &yo_info = getInfo(yo);
+
+    dim4 zi_dims = zi_info.dims();
+    dim4 xo_dims = xo_info.dims();
+    dim4 yo_dims = yo_info.dims();
+
+    ARG_ASSERT(1, zi_info.isFloating());      // Only floating and complex types
+    ARG_ASSERT(2, xo_info.isRealFloating());  // Only floating types
+    ARG_ASSERT(4, yo_info.isRealFloating());  // Only floating types
+    ARG_ASSERT(2,
+               xo_info.getType() == yo_info.getType());  // Must have same type
+    ARG_ASSERT(1, zi_info.isSingle() ==
+                      xo_info.isSingle());  // Must have same precision
+    ARG_ASSERT(1, zi_info.isDouble() ==
+                      xo_info.isDouble());  // Must have same precision
+    DIM_ASSERT(2, xo_dims == yo_dims);      // POS0 and POS1 must have same dims
+
+    ARG_ASSERT(3, xdim >= 0 && xdim < 4);
+    ARG_ASSERT(5, ydim >= 0 && ydim < 4);
+    ARG_ASSERT(7, xi_step != 0);
+    ARG_ASSERT(9, yi_step != 0);
+
+    // POS should either be (x, y, 1, 1) or (x, y, zi_dims[2], zi_dims[3])
+    if (xo_dims[xdim] * xo_dims[ydim] != xo_dims.elements()) {
+        for (int i = 0; i < 4; i++) {
+            if (xdim != i && ydim != i) {
+                DIM_ASSERT(2, xo_dims[i] == zi_dims[i]);
+            }
         }
-        std::swap(*out,output);
+    }
+
+    if (zi_dims.ndims() == 0 || xo_dims.ndims() == 0 || yo_dims.ndims() == 0) {
+        af_create_handle(zo, 0, nullptr, zi_info.getType());
+        return;
+    }
+
+    dim4 zo_dims  = zi_info.dims();
+    zo_dims[xdim] = xo_info.dims()[xdim];
+    zo_dims[ydim] = xo_info.dims()[ydim];
+
+    if (allocate_zo) { *zo = createHandle(zo_dims, zi_info.getType()); }
+
+    DIM_ASSERT(0, getInfo(*zo).dims() == zo_dims);
+
+    switch (zi_info.getType()) {
+        case f32:
+            approx2<float, float>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                  yi_beg, yi_step, method, offGrid);
+            break;
+        case f64:
+            approx2<double, double>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, method, offGrid);
+            break;
+        case c32:
+            approx2<cfloat, float>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                   yi_beg, yi_step, method, offGrid);
+            break;
+        case c64:
+            approx2<cdouble, double>(zo, zi, xo, xdim, xi_beg, xi_step, yo,
+                                     ydim, yi_beg, yi_step, method, offGrid);
+            break;
+        default: TYPE_ERROR(1, zi_info.getType());
+    }
+}
+
+af_err af_approx2_uniform(af_array *zo, const af_array zi, const af_array xo,
+                          const int xdim, const double xi_beg,
+                          const double xi_step, const af_array yo,
+                          const int ydim, const double yi_beg,
+                          const double yi_step, const af_interp_type method,
+                          const float offGrid) {
+    try {
+        af_approx2_common(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim, yi_beg,
+                          yi_step, method, offGrid, true);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_approx2_uniform_v2(af_array *zo, const af_array zi, const af_array xo,
+                             const int xdim, const double xi_beg,
+                             const double xi_step, const af_array yo,
+                             const int ydim, const double yi_beg,
+                             const double yi_step, const af_interp_type method,
+                             const float offGrid) {
+    try {
+        ARG_ASSERT(0, zo != 0);  // need to dereference zo in next call
+        af_approx2_common(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim, yi_beg,
+                          yi_step, method, offGrid, *zo == 0);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_approx2(af_array *zo, const af_array zi, const af_array xo,
+                  const af_array yo, const af_interp_type method,
+                  const float offGrid) {
+    try {
+        af_approx2_common(zo, zi, xo, 0, 0.0, 1.0, yo, 1, 0.0, 1.0, method,
+                          offGrid, true);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_approx2_v2(af_array *zo, const af_array zi, const af_array xo,
+                     const af_array yo, const af_interp_type method,
+                     const float offGrid) {
+    try {
+        ARG_ASSERT(0, zo != 0);  // need to dereference zo in next call
+        af_approx2_common(zo, zi, xo, 0, 0.0, 1.0, yo, 1, 0.0, 1.0, method,
+                          offGrid, *zo == 0);
     }
     CATCHALL;
 
diff --git a/src/api/c/array.cpp b/src/api/c/array.cpp
new file mode 100644
index 0000000000..d164faabdb
--- /dev/null
+++ b/src/api/c/array.cpp
@@ -0,0 +1,483 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
+#include <platform.hpp>
+#include <sparse.hpp>
+#include <sparse_handle.hpp>
+#include <af/sparse.h>
+
+using af::dim4;
+using arrayfire::copyData;
+using arrayfire::copySparseArray;
+using arrayfire::getSparseArrayBase;
+using arrayfire::getUseCount;
+using arrayfire::releaseHandle;
+using arrayfire::releaseSparseHandle;
+using arrayfire::retainSparseHandle;
+using arrayfire::common::half;
+using arrayfire::common::SparseArrayBase;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createDeviceDataArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+af_err af_get_data_ptr(void *data, const af_array arr) {
+    try {
+        af_dtype type = getInfo(arr).getType();
+        // clang-format off
+        switch (type) {
+            case f32: copyData(static_cast<float*   >(data), arr); break;
+            case c32: copyData(static_cast<cfloat*  >(data), arr); break;
+            case f64: copyData(static_cast<double*  >(data), arr); break;
+            case c64: copyData(static_cast<cdouble* >(data), arr); break;
+            case b8:  copyData(static_cast<char*    >(data), arr); break;
+            case s32: copyData(static_cast<int*     >(data), arr); break;
+            case u32: copyData(static_cast<unsigned*>(data), arr); break;
+            case s8:  copyData(static_cast<schar*   >(data), arr); break;
+            case u8:  copyData(static_cast<uchar*   >(data), arr); break;
+            case s64: copyData(static_cast<intl*    >(data), arr); break;
+            case u64: copyData(static_cast<uintl*   >(data), arr); break;
+            case s16: copyData(static_cast<short*   >(data), arr); break;
+            case u16: copyData(static_cast<ushort*  >(data), arr); break;
+            case f16: copyData(static_cast<half*    >(data), arr); break;
+            default: TYPE_ERROR(1, type);
+        }
+        // clang-format on
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+// Strong Exception Guarantee
+af_err af_create_array(af_array *result, const void *const data,
+                       const unsigned ndims, const dim_t *const dims,
+                       const af_dtype type) {
+    try {
+        af_array out;
+        AF_CHECK(af_init());
+
+        dim4 d = verifyDims(ndims, dims);
+
+        switch (type) {
+            case f32:
+                out = createHandleFromData(d, static_cast<const float *>(data));
+                break;
+            case c32:
+                out =
+                    createHandleFromData(d, static_cast<const cfloat *>(data));
+                break;
+            case f64:
+                out =
+                    createHandleFromData(d, static_cast<const double *>(data));
+                break;
+            case c64:
+                out =
+                    createHandleFromData(d, static_cast<const cdouble *>(data));
+                break;
+            case b8:
+                out = createHandleFromData(d, static_cast<const char *>(data));
+                break;
+            case s32:
+                out = createHandleFromData(d, static_cast<const int *>(data));
+                break;
+            case u32:
+                out = createHandleFromData(d, static_cast<const uint *>(data));
+                break;
+            case s8:
+                out = createHandleFromData(d, static_cast<const schar *>(data));
+                break;
+            case u8:
+                out = createHandleFromData(d, static_cast<const uchar *>(data));
+                break;
+            case s64:
+                out = createHandleFromData(d, static_cast<const intl *>(data));
+                break;
+            case u64:
+                out = createHandleFromData(d, static_cast<const uintl *>(data));
+                break;
+            case s16:
+                out = createHandleFromData(d, static_cast<const short *>(data));
+                break;
+            case u16:
+                out =
+                    createHandleFromData(d, static_cast<const ushort *>(data));
+                break;
+            case f16:
+                out = createHandleFromData(d, static_cast<const half *>(data));
+                break;
+            default: TYPE_ERROR(4, type);
+        }
+        std::swap(*result, out);
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+// Strong Exception Guarantee
+af_err af_create_handle(af_array *result, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype type) {
+    try {
+        AF_CHECK(af_init());
+
+        if (ndims > 0) { ARG_ASSERT(2, ndims > 0 && dims != NULL); }
+
+        dim4 d(0);
+        for (unsigned i = 0; i < ndims; i++) { d[i] = dims[i]; }
+
+        af_array out = createHandle(d, type);
+        std::swap(*result, out);
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+// Strong Exception Guarantee
+af_err af_copy_array(af_array *out, const af_array in) {
+    try {
+        const ArrayInfo &info = getInfo(in, false);
+        const af_dtype type   = info.getType();
+
+        af_array res = 0;
+        if (info.isSparse()) {
+            const SparseArrayBase sbase = getSparseArrayBase(in);
+            if (info.ndims() == 0) {
+                return af_create_sparse_array_from_ptr(
+                    out, info.dims()[0], info.dims()[1], 0, nullptr, nullptr,
+                    nullptr, type, sbase.getStorage(), afDevice);
+            }
+            switch (type) {
+                case f32: res = copySparseArray<float>(in); break;
+                case f64: res = copySparseArray<double>(in); break;
+                case c32: res = copySparseArray<cfloat>(in); break;
+                case c64: res = copySparseArray<cdouble>(in); break;
+                default: TYPE_ERROR(0, type);
+            }
+
+        } else {
+            if (info.ndims() == 0) {
+                return af_create_handle(out, 0, nullptr, type);
+            }
+            switch (type) {
+                case f32: res = copyArray<float>(in); break;
+                case c32: res = copyArray<cfloat>(in); break;
+                case f64: res = copyArray<double>(in); break;
+                case c64: res = copyArray<cdouble>(in); break;
+                case b8: res = copyArray<char>(in); break;
+                case s32: res = copyArray<int>(in); break;
+                case u32: res = copyArray<uint>(in); break;
+                case s8: res = copyArray<schar>(in); break;
+                case u8: res = copyArray<uchar>(in); break;
+                case s64: res = copyArray<intl>(in); break;
+                case u64: res = copyArray<uintl>(in); break;
+                case s16: res = copyArray<short>(in); break;
+                case u16: res = copyArray<ushort>(in); break;
+                case f16: res = copyArray<half>(in); break;
+                default: TYPE_ERROR(1, type);
+            }
+        }
+        std::swap(*out, res);
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+// Strong Exception Guarantee
+af_err af_get_data_ref_count(int *use_count, const af_array in) {
+    try {
+        const ArrayInfo &info = getInfo(in, false);
+        const af_dtype type   = info.getType();
+
+        int res;
+        switch (type) {
+            case f32: res = getUseCount<float>(in); break;
+            case c32: res = getUseCount<cfloat>(in); break;
+            case f64: res = getUseCount<double>(in); break;
+            case c64: res = getUseCount<cdouble>(in); break;
+            case b8: res = getUseCount<char>(in); break;
+            case s32: res = getUseCount<int>(in); break;
+            case u32: res = getUseCount<uint>(in); break;
+            case s8: res = getUseCount<schar>(in); break;
+            case u8: res = getUseCount<uchar>(in); break;
+            case s64: res = getUseCount<intl>(in); break;
+            case u64: res = getUseCount<uintl>(in); break;
+            case s16: res = getUseCount<short>(in); break;
+            case u16: res = getUseCount<ushort>(in); break;
+            case f16: res = getUseCount<half>(in); break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*use_count, res);
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_release_array(af_array arr) {
+    try {
+        if (arr == 0) { return AF_SUCCESS; }
+        const ArrayInfo &info = getInfo(arr, false);
+        af_dtype type         = info.getType();
+
+        if (info.isSparse()) {
+            switch (type) {
+                case f32: releaseSparseHandle<float>(arr); break;
+                case f64: releaseSparseHandle<double>(arr); break;
+                case c32: releaseSparseHandle<cfloat>(arr); break;
+                case c64: releaseSparseHandle<cdouble>(arr); break;
+                default: TYPE_ERROR(0, type);
+            }
+        } else {
+            switch (type) {
+                case f32: releaseHandle<float>(arr); break;
+                case c32: releaseHandle<cfloat>(arr); break;
+                case f64: releaseHandle<double>(arr); break;
+                case c64: releaseHandle<cdouble>(arr); break;
+                case b8: releaseHandle<char>(arr); break;
+                case s32: releaseHandle<int>(arr); break;
+                case u32: releaseHandle<uint>(arr); break;
+                case s8: releaseHandle<schar>(arr); break;
+                case u8: releaseHandle<uchar>(arr); break;
+                case s64: releaseHandle<intl>(arr); break;
+                case u64: releaseHandle<uintl>(arr); break;
+                case s16: releaseHandle<short>(arr); break;
+                case u16: releaseHandle<ushort>(arr); break;
+                case f16: releaseHandle<half>(arr); break;
+                default: TYPE_ERROR(0, type);
+            }
+        }
+    }
+    CATCHALL
+
+    return AF_SUCCESS;
+}
+
+af_err af_retain_array(af_array *out, const af_array in) {
+    try {
+        *out = retain(in);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+template<typename T>
+void write_array(af_array arr, const T *const data, const size_t bytes,
+                 af_source src) {
+    if (src == afHost) {
+        writeHostDataArray(getArray<T>(arr), data, bytes);
+    } else {
+        writeDeviceDataArray(getArray<T>(arr), data, bytes);
+    }
+}
+
+af_err af_write_array(af_array arr, const void *data, const size_t bytes,
+                      af_source src) {
+    if (bytes == 0) { return AF_SUCCESS; }
+    try {
+        af_dtype type = getInfo(arr).getType();
+        ARG_ASSERT(1, (data != nullptr));
+        ARG_ASSERT(3, (src == afHost || src == afDevice));
+        // FIXME ArrayInfo class no bytes method, hence commented
+        // DIM_ASSERT(2, bytes <= getInfo(arr).bytes());
+
+        switch (type) {
+            case f32:
+                write_array(arr, static_cast<const float *>(data), bytes, src);
+                break;
+            case c32:
+                write_array(arr, static_cast<const cfloat *>(data), bytes, src);
+                break;
+            case f64:
+                write_array(arr, static_cast<const double *>(data), bytes, src);
+                break;
+            case c64:
+                write_array(arr, static_cast<const cdouble *>(data), bytes,
+                            src);
+                break;
+            case b8:
+                write_array(arr, static_cast<const char *>(data), bytes, src);
+                break;
+            case s32:
+                write_array(arr, static_cast<const int *>(data), bytes, src);
+                break;
+            case u32:
+                write_array(arr, static_cast<const uint *>(data), bytes, src);
+                break;
+            case s8:
+                write_array(arr, static_cast<const schar *>(data), bytes, src);
+                break;
+            case u8:
+                write_array(arr, static_cast<const uchar *>(data), bytes, src);
+                break;
+            case s64:
+                write_array(arr, static_cast<const intl *>(data), bytes, src);
+                break;
+            case u64:
+                write_array(arr, static_cast<const uintl *>(data), bytes, src);
+                break;
+            case s16:
+                write_array(arr, static_cast<const short *>(data), bytes, src);
+                break;
+            case u16:
+                write_array(arr, static_cast<const ushort *>(data), bytes, src);
+                break;
+            case f16:
+                write_array(arr, static_cast<const half *>(data), bytes, src);
+                break;
+            default: TYPE_ERROR(4, type);
+        }
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_get_elements(dim_t *elems, const af_array arr) {
+    try {
+        // Do not check for device mismatch
+        *elems = getInfo(arr, false).elements();
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_get_type(af_dtype *type, const af_array arr) {
+    try {
+        // Do not check for device mismatch
+        *type = getInfo(arr, false).getType();
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_get_dims(dim_t *d0, dim_t *d1, dim_t *d2, dim_t *d3,
+                   const af_array in) {
+    try {
+        // Do not check for device mismatch
+        const ArrayInfo &info = getInfo(in, false);
+        *d0                   = info.dims()[0];
+        *d1                   = info.dims()[1];
+        *d2                   = info.dims()[2];
+        *d3                   = info.dims()[3];
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_get_numdims(unsigned *nd, const af_array in) {
+    try {
+        // Do not check for device mismatch
+        const ArrayInfo &info = getInfo(in, false);
+        *nd                   = info.ndims();
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+#undef INSTANTIATE
+#define INSTANTIATE(fn1, fn2)                           \
+    af_err fn1(bool *result, const af_array in) {       \
+        try {                                           \
+            const ArrayInfo &info = getInfo(in, false); \
+            *result               = info.fn2();         \
+        }                                               \
+        CATCHALL                                        \
+        return AF_SUCCESS;                              \
+    }
+
+INSTANTIATE(af_is_empty, isEmpty)
+INSTANTIATE(af_is_scalar, isScalar)
+INSTANTIATE(af_is_row, isRow)
+INSTANTIATE(af_is_column, isColumn)
+INSTANTIATE(af_is_vector, isVector)
+INSTANTIATE(af_is_complex, isComplex)
+INSTANTIATE(af_is_real, isReal)
+INSTANTIATE(af_is_double, isDouble)
+INSTANTIATE(af_is_single, isSingle)
+INSTANTIATE(af_is_half, isHalf)
+INSTANTIATE(af_is_realfloating, isRealFloating)
+INSTANTIATE(af_is_floating, isFloating)
+INSTANTIATE(af_is_integer, isInteger)
+INSTANTIATE(af_is_bool, isBool)
+INSTANTIATE(af_is_sparse, isSparse)
+
+#undef INSTANTIATE
+
+template<typename T>
+inline void getScalar(T *out, const af_array &arr) {
+    out[0] = getScalar<T>(getArray<T>(arr));
+}
+
+af_err af_get_scalar(void *output_value, const af_array arr) {
+    try {
+        ARG_ASSERT(0, (output_value != NULL));
+
+        const ArrayInfo &info = getInfo(arr);
+        const af_dtype type   = info.getType();
+
+        switch (type) {
+            case f32:
+                getScalar<float>(reinterpret_cast<float *>(output_value), arr);
+                break;
+            case f64:
+                getScalar<double>(reinterpret_cast<double *>(output_value),
+                                  arr);
+                break;
+            case b8:
+                getScalar<char>(reinterpret_cast<char *>(output_value), arr);
+                break;
+            case s32:
+                getScalar<int>(reinterpret_cast<int *>(output_value), arr);
+                break;
+            case u32:
+                getScalar<uint>(reinterpret_cast<uint *>(output_value), arr);
+                break;
+            case s8:
+                getScalar<schar>(reinterpret_cast<schar *>(output_value), arr);
+                break;
+            case u8:
+                getScalar<uchar>(reinterpret_cast<uchar *>(output_value), arr);
+                break;
+            case s64:
+                getScalar<intl>(reinterpret_cast<intl *>(output_value), arr);
+                break;
+            case u64:
+                getScalar<uintl>(reinterpret_cast<uintl *>(output_value), arr);
+                break;
+            case s16:
+                getScalar<short>(reinterpret_cast<short *>(output_value), arr);
+                break;
+            case u16:
+                getScalar<ushort>(reinterpret_cast<ushort *>(output_value),
+                                  arr);
+                break;
+            case c32:
+                getScalar<cfloat>(reinterpret_cast<cfloat *>(output_value),
+                                  arr);
+                break;
+            case c64:
+                getScalar<cdouble>(reinterpret_cast<cdouble *>(output_value),
+                                   arr);
+                break;
+            case f16:
+                getScalar<half>(static_cast<half *>(output_value), arr);
+                break;
+            default: TYPE_ERROR(4, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/assign.cpp b/src/api/c/assign.cpp
index 386b18dd04..bdf505048d 100644
--- a/src/api/c/assign.cpp
+++ b/src/api/c/assign.cpp
@@ -6,175 +6,234 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <assign.hpp>
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/index.h>
-#include <af/data.h>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
 #include <Array.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/complex.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <common/moddims.hpp>
+#include <common/tile.hpp>
 #include <copy.hpp>
-#include <assign.hpp>
+#include <handle.hpp>
+#include <indexing_common.hpp>
 #include <math.hpp>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/index.h>
 
-using namespace detail;
-using std::vector;
+using std::signbit;
 using std::swap;
+using std::vector;
 
-// From src/api/c/moddims.cpp TODO: move to header?
-template<typename T>
-Array<T> modDims(const Array<T>& in, const af::dim4 &newDims);
+using af::dim4;
+using arrayfire::common::convert2Canonical;
+using arrayfire::common::createSpanIndex;
+using arrayfire::common::half;
+using arrayfire::common::if_complex;
+using arrayfire::common::if_real;
+using arrayfire::common::modDims;
+using arrayfire::common::tile;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createSubArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename Tout, typename Tin>
-static
-void assign(Array<Tout> &out, const unsigned &ndims, const af_seq *index, const Array<Tin> &in_)
-{
-    dim4 const outDs = out.dims();
-    dim4 const iDims = in_.dims();
+static void assign(Array<Tout>& out, const vector<af_seq> seqs,
+                   const Array<Tin>& in) {
+    size_t ndims      = seqs.size();
+    const dim4& outDs = out.dims();
+    const dim4& iDims = in.dims();
 
-    DIM_ASSERT(0, (outDs.ndims()>=iDims.ndims()));
-    DIM_ASSERT(0, (outDs.ndims()>=(dim_t)ndims));
+    if (iDims.elements() == 0) { return; }
 
-    evalArray(out);
+    out.eval();
 
-    vector<af_seq> index_(index, index+ndims);
+    dim4 oDims = toDims(seqs, outDs);
 
-    dim4 oDims = toDims(index_, outDs);
+    bool isVec = true;
+    for (int i = 0; isVec && i < static_cast<int>(oDims.ndims()) - 1; i++) {
+        isVec &= oDims[i] == 1;
+    }
 
-    bool is_vector = true;
-    for (int i = 0; is_vector && i < (int)oDims.ndims() - 1; i++) {
-        is_vector &= oDims[i] == 1;
+    isVec &= in.isVector() || in.isScalar();
+
+    for (auto i = static_cast<dim_t>(ndims); i < in.ndims(); i++) {
+        oDims[i] = 1;
     }
 
-    if (is_vector && in_.isVector()) {
-        if (oDims.elements() != (dim_t)in_.elements()) {
+    if (isVec) {
+        if (oDims.elements() != in.elements() && in.elements() != 1) {
             AF_ERROR("Size mismatch between input and output", AF_ERR_SIZE);
         }
 
-        // If both out and in are vectors of equal elements, reshape in to out dims
-        Array<Tin> in = modDims(in_, oDims);
-        Array<Tout> dst = createSubArray<Tout>(out, index_, false);
+        // If both out and in are vectors of equal elements,
+        // reshape in to out dims
+        Array<Tin> in_ = in.elements() == 1 ? arrayfire::common::tile(in, oDims)
+                                            : modDims(in, oDims);
+        auto dst       = createSubArray<Tout>(out, seqs, false);
 
-        copyArray<Tin , Tout>(dst, in);
+        copyArray<Tin, Tout>(dst, in_);
     } else {
-        for (int i = 0; i < 4; i++) {
+        for (int i = 0; i < AF_MAX_DIMS; i++) {
             if (oDims[i] != iDims[i]) {
                 AF_ERROR("Size mismatch between input and output", AF_ERR_SIZE);
             }
         }
-        Array<Tout> dst = createSubArray<Tout>(out, index_, false);
+        Array<Tout> dst = createSubArray<Tout>(out, seqs, false);
 
-        copyArray<Tin , Tout>(dst, in_);
+        copyArray<Tin, Tout>(dst, in);
     }
 }
 
 template<typename T>
-static
-void assign_helper(Array<T> &out, const unsigned &ndims, const af_seq *index, const af_array &in_)
-{
-    ArrayInfo iInfo = getInfo(in_);
-    af_dtype iType  = iInfo.getType();
-
-    if(out.getType() == c64 || out.getType() == c32)
-    {
-
-        switch(iType) {
-            case c64: assign<T, cdouble>(out, ndims, index, getArray<cdouble  >(in_));  break;
-            case c32: assign<T, cfloat >(out, ndims, index, getArray<cfloat   >(in_));  break;
-            default : TYPE_ERROR(1, iType); break;
-        }
+static if_complex<T> assign(Array<T>& out, const vector<af_seq> iv,
+                            const af_array& in) {
+    const ArrayInfo& iInfo = getInfo(in);
+    af_dtype iType         = iInfo.getType();
+    switch (iType) {
+        case c64: assign<T, cdouble>(out, iv, getArray<cdouble>(in)); break;
+        case c32: assign<T, cfloat>(out, iv, getArray<cfloat>(in)); break;
+        default: TYPE_ERROR(1, iType); break;
     }
-    else
-    {
-        switch(iType) {
-            case f64: assign<T, double >(out, ndims, index, getArray<double   >(in_));  break;
-            case f32: assign<T, float  >(out, ndims, index, getArray<float    >(in_));  break;
-            case s32: assign<T, int    >(out, ndims, index, getArray<int      >(in_));  break;
-            case u32: assign<T, uint   >(out, ndims, index, getArray<uint     >(in_));  break;
-            case s64: assign<T, intl   >(out, ndims, index, getArray<intl     >(in_));  break;
-            case u64: assign<T, uintl  >(out, ndims, index, getArray<uintl    >(in_));  break;
-            case u8 : assign<T, uchar  >(out, ndims, index, getArray<uchar    >(in_));  break;
-            case b8 : assign<T, char   >(out, ndims, index, getArray<char     >(in_));  break;
-            default : TYPE_ERROR(1, iType); break;
-        }
+}
+
+template<typename T>
+static if_real<T> assign(Array<T>& out, const vector<af_seq> iv,
+                         const af_array& in) {
+    const ArrayInfo& iInfo = getInfo(in);
+    af_dtype iType         = iInfo.getType();
+
+    switch (iType) {
+        case f64: assign<T, double>(out, iv, getArray<double>(in)); break;
+        case f32: assign<T, float>(out, iv, getArray<float>(in)); break;
+        case s32: assign<T, int>(out, iv, getArray<int>(in)); break;
+        case u32: assign<T, uint>(out, iv, getArray<uint>(in)); break;
+        case s64: assign<T, intl>(out, iv, getArray<intl>(in)); break;
+        case u64: assign<T, uintl>(out, iv, getArray<uintl>(in)); break;
+        case s16: assign<T, short>(out, iv, getArray<short>(in)); break;
+        case u16: assign<T, ushort>(out, iv, getArray<ushort>(in)); break;
+        case s8: assign<T, schar>(out, iv, getArray<schar>(in)); break;
+        case u8: assign<T, uchar>(out, iv, getArray<uchar>(in)); break;
+        case b8: assign<T, char>(out, iv, getArray<char>(in)); break;
+        case f16: assign<T, half>(out, iv, getArray<half>(in)); break;
+        default: TYPE_ERROR(1, iType); break;
     }
 }
 
-af_err af_assign_seq(af_array *out,
-                     const af_array lhs, const unsigned ndims,
-                     const af_seq *index, const af_array rhs)
-{
+af_err af_assign_seq(af_array* out, const af_array lhs, const unsigned ndims,
+                     const af_seq* index, const af_array rhs) {
     try {
-        ARG_ASSERT(0, (lhs!=0));
-        ARG_ASSERT(1, (ndims>0));
-        ARG_ASSERT(3, (rhs!=0));
-
-        for(dim_t i=0; i<(dim_t)ndims; ++i) {
-            ARG_ASSERT(2, (index[i].step>=0));
+        ARG_ASSERT(2, (ndims > 0 && ndims <= AF_MAX_DIMS));
+        ARG_ASSERT(1, (lhs != 0));
+        ARG_ASSERT(4, (rhs != 0));
+
+        const ArrayInfo& lInfo = getInfo(lhs);
+
+        if (ndims == 1 && ndims != lInfo.ndims()) {
+            af_array tmp_in;
+            af_array tmp_out;
+            AF_CHECK(af_flat(&tmp_in, lhs));
+            AF_CHECK(af_assign_seq(&tmp_out, tmp_in, ndims, index, rhs));
+            AF_CHECK(
+                af_moddims(out, tmp_out, lInfo.ndims(), lInfo.dims().get()));
+            AF_CHECK(af_release_array(tmp_in));
+            // This can run into a double free issue if tmp_in == tmp_out
+            // The condition ensures release only if both are different
+            // Issue found on Tegra X1
+            if (tmp_in != tmp_out) { AF_CHECK(af_release_array(tmp_out)); }
+            return AF_SUCCESS;
         }
 
-        af_array res;
-        if (*out != lhs) AF_CHECK(af_copy_array(&res, lhs));
-        else             res = retain(lhs);
+        af_array res = 0;
 
-        try {
+        if (*out != lhs) {
+            int count = 0;
+            AF_CHECK(af_get_data_ref_count(&count, lhs));
+            if (count > 1) {
+                AF_CHECK(af_copy_array(&res, lhs));
+            } else {
+                res = retain(lhs);
+            }
+        } else {
+            res = lhs;
+        }
 
+        try {
             if (lhs != rhs) {
-                ArrayInfo oInfo = getInfo(lhs);
-                af_dtype oType  = oInfo.getType();
-                switch(oType) {
-                case c64: assign_helper<cdouble>(getWritableArray<cdouble>(res), ndims, index, rhs);  break;
-                case c32: assign_helper<cfloat >(getWritableArray<cfloat >(res), ndims, index, rhs);  break;
-                case f64: assign_helper<double >(getWritableArray<double >(res), ndims, index, rhs);  break;
-                case f32: assign_helper<float  >(getWritableArray<float  >(res), ndims, index, rhs);  break;
-                case s32: assign_helper<int    >(getWritableArray<int    >(res), ndims, index, rhs);  break;
-                case u32: assign_helper<uint   >(getWritableArray<uint   >(res), ndims, index, rhs);  break;
-                case s64: assign_helper<intl   >(getWritableArray<intl   >(res), ndims, index, rhs);  break;
-                case u64: assign_helper<uintl  >(getWritableArray<uintl  >(res), ndims, index, rhs);  break;
-                case u8 : assign_helper<uchar  >(getWritableArray<uchar  >(res), ndims, index, rhs);  break;
-                case b8 : assign_helper<char   >(getWritableArray<char   >(res), ndims, index, rhs);  break;
-                default : TYPE_ERROR(1, oType); break;
+                const dim4& outDims = getInfo(res).dims();
+                const dim4& inDims  = getInfo(rhs).dims();
+
+                vector<af_seq> inSeqs(ndims, af_span);
+                for (unsigned i = 0; i < ndims; ++i) {
+                    inSeqs[i] = convert2Canonical(index[i], outDims[i]);
+                    ARG_ASSERT(3,
+                               (inSeqs[i].begin >= 0. || inSeqs[i].end >= 0.));
+                    if (signbit(inSeqs[i].step)) {
+                        ARG_ASSERT(3, inSeqs[i].begin >= inSeqs[i].end);
+                    } else {
+                        ARG_ASSERT(3, inSeqs[i].begin <= inSeqs[i].end);
+                    }
+                }
+                DIM_ASSERT(0, (outDims.ndims() >= inDims.ndims()));
+                DIM_ASSERT(0, (outDims.ndims() >= (dim_t)ndims));
+
+                const ArrayInfo& oInfo = getInfo(res);
+                af_dtype oType         = oInfo.getType();
+                switch (oType) {
+                    case c64:
+                        assign(getArray<cdouble>(res), inSeqs, rhs);
+                        break;
+                    case c32: assign(getArray<cfloat>(res), inSeqs, rhs); break;
+                    case f64: assign(getArray<double>(res), inSeqs, rhs); break;
+                    case f32: assign(getArray<float>(res), inSeqs, rhs); break;
+                    case s32: assign(getArray<int>(res), inSeqs, rhs); break;
+                    case u32: assign(getArray<uint>(res), inSeqs, rhs); break;
+                    case s64: assign(getArray<intl>(res), inSeqs, rhs); break;
+                    case u64: assign(getArray<uintl>(res), inSeqs, rhs); break;
+                    case s16: assign(getArray<short>(res), inSeqs, rhs); break;
+                    case u16: assign(getArray<ushort>(res), inSeqs, rhs); break;
+                    case s8: assign(getArray<schar>(res), inSeqs, rhs); break;
+                    case u8: assign(getArray<uchar>(res), inSeqs, rhs); break;
+                    case b8: assign(getArray<char>(res), inSeqs, rhs); break;
+                    case f16: assign(getArray<half>(res), inSeqs, rhs); break;
+                    default: TYPE_ERROR(1, oType); break;
                 }
             }
-        } catch(...) {
+        } catch (...) {
             af_release_array(res);
             throw;
         }
-        std::swap(*out, res);
+        swap(*out, res);
     }
     CATCHALL;
-
     return AF_SUCCESS;
 }
 
 template<typename T>
-static void genAssign(af_array& out, const af_index_t* indexs, const af_array& rhs)
-{
-    detail::assign<T>(getWritableArray<T>(out), indexs, getArray<T>(rhs));
+inline void genAssign(af_array& out, const af_index_t* indexs,
+                      const af_array& rhs) {
+    detail::assign<T>(getArray<T>(out), indexs, getArray<T>(rhs));
 }
 
-af_err af_assign_gen(af_array *out,
-                    const af_array lhs,
-                    const dim_t ndims, const af_index_t* indexs,
-                    const af_array rhs_)
-{
-    af_array output = 0;
-    af_array rhs = rhs_;
-    // spanner is sequence index used for indexing along the
-    // dimensions after ndims
-    af_index_t spanner;
-    spanner.idx.seq = af_span;
-    spanner.isSeq = true;
-
+af_err af_assign_gen(af_array* out, const af_array lhs, const dim_t ndims,
+                     const af_index_t* indexs, const af_array rhs_) {
     try {
-        ARG_ASSERT(2, (ndims>0));
-        ARG_ASSERT(3, (indexs!=NULL));
+        ARG_ASSERT(2, (ndims > 0 && ndims <= AF_MAX_DIMS));
+        ARG_ASSERT(3, (indexs != NULL));
 
         int track = 0;
-        vector<af_seq> seqs(4, af_span);
+        vector<af_seq> seqs(AF_MAX_DIMS, af_span);
         for (dim_t i = 0; i < ndims; i++) {
             if (indexs[i].isSeq) {
                 track++;
@@ -182,111 +241,166 @@ af_err af_assign_gen(af_array *out,
             }
         }
 
-        if (track==(int)ndims) {
+        af_array rhs = rhs_;
+        if (track == static_cast<int>(ndims)) {
             // all indexs are sequences, redirecting to af_assign
-            return af_assign_seq(out, lhs, ndims, &(seqs.front()), rhs);
+            return af_assign_seq(out, lhs, ndims, seqs.data(), rhs);
         }
 
-        ARG_ASSERT(1, (lhs!=0));
-        ARG_ASSERT(4, (rhs!=0));
+        ARG_ASSERT(1, (lhs != 0));
+        ARG_ASSERT(4, (rhs != 0));
+
+        const ArrayInfo& lInfo = getInfo(lhs);
+        const ArrayInfo& rInfo = getInfo(rhs);
+        const dim4& lhsDims    = lInfo.dims();
+        const dim4& rhsDims    = rInfo.dims();
+        af_dtype lhsType       = lInfo.getType();
+        af_dtype rhsType       = rInfo.getType();
 
-        if (*out != lhs) AF_CHECK(af_copy_array(&output, lhs));
-        else             output = lhs;
+        if (rhsDims.ndims() == 0) { return af_retain_array(out, lhs); }
 
-        ArrayInfo lInfo = getInfo(lhs);
-        ArrayInfo rInfo = getInfo(rhs);
-        dim4 lhsDims    = lInfo.dims();
-        dim4 rhsDims    = rInfo.dims();
-        af_dtype lhsType= lInfo.getType();
-        af_dtype rhsType= rInfo.getType();
+        if (lhsDims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, lhsType);
+        }
+
+        if (ndims == 1 && ndims != static_cast<dim_t>(lInfo.ndims())) {
+            af_array tmp_in  = 0;
+            af_array tmp_out = 0;
+            AF_CHECK(af_flat(&tmp_in, lhs));
+            AF_CHECK(af_assign_gen(&tmp_out, tmp_in, ndims, indexs, rhs_));
+            AF_CHECK(
+                af_moddims(out, tmp_out, lInfo.ndims(), lInfo.dims().get()));
+            AF_CHECK(af_release_array(tmp_in));
+            // This can run into a double free issue if tmp_in == tmp_out
+            // The condition ensures release only if both are different
+            // Issue found on Tegra X1
+            if (tmp_in != tmp_out) { AF_CHECK(af_release_array(tmp_out)); }
+            return AF_SUCCESS;
+        }
+
+        ARG_ASSERT(1, (lhsType == rhsType));
+        ARG_ASSERT(1, (lhsDims.ndims() >= rhsDims.ndims()));
 
-        ARG_ASSERT(1, (lhsType==rhsType));
-        ARG_ASSERT(3, (rhsDims.ndims()>0));
-        ARG_ASSERT(1, (lhsDims.ndims()>=rhsDims.ndims()));
-        ARG_ASSERT(2, (lhsDims.ndims()>=ndims));
+        af_array output = 0;
+        if (*out != lhs) {
+            int count = 0;
+            AF_CHECK(af_get_data_ref_count(&count, lhs));
+            if (count > 1) {
+                AF_CHECK(af_copy_array(&output, lhs));
+            } else {
+                output = retain(lhs);
+            }
+        } else {
+            output = lhs;
+        }
 
         dim4 oDims = toDims(seqs, lhsDims);
         // if af_array are indexs along any
         // particular dimension, set the length of
         // that dimension accordingly before any checks
-        for (dim_t i=0; i<ndims; i++) {
+        for (dim_t i = 0; i < ndims; i++) {
             if (!indexs[i].isSeq) {
                 oDims[i] = getInfo(indexs[i].idx.arr).elements();
             }
         }
 
-        bool is_vector = true;
-        for (uint i = 0; is_vector && i < oDims.ndims() - 1; i++) {
-            is_vector &= oDims[i] == 1;
+        for (dim_t i = ndims; i < static_cast<dim_t>(lInfo.ndims()); i++) {
+            oDims[i] = 1;
+        }
+
+        bool isVec = true;
+        for (int i = 0; isVec && i < oDims.ndims() - 1; i++) {
+            isVec &= oDims[i] == 1;
         }
 
-        //TODO: Move logic out of this
-        is_vector &= rInfo.isVector();
-        if (is_vector) {
-            if (oDims.elements() != (dim_t)rInfo.elements()) {
+        // TODO(umar): Move logic out of this
+        isVec &= rInfo.isVector() || rInfo.isScalar();
+        if (isVec) {
+            if (oDims.elements() != static_cast<dim_t>(rInfo.elements()) &&
+                rInfo.elements() != 1) {
                 AF_ERROR("Size mismatch between input and output", AF_ERR_SIZE);
             }
-
-            // If both out and rhs are vectors of equal elements, reshape rhs to out dims
-            AF_CHECK(af_moddims(&rhs, rhs_, oDims.ndims(), oDims.get()));
+            if (rInfo.elements() == 1) {
+                AF_CHECK(af_tile(&rhs, rhs_, oDims[0], oDims[1], oDims[2],
+                                 oDims[3]));
+            } else {
+                // If both out and rhs are vectors of equal
+                // elements, reshape rhs to out dims
+                AF_CHECK(af_moddims(&rhs, rhs_, oDims.ndims(), oDims.get()));
+            }
         } else {
-            for (int i = 0; i < 4; i++) {
+            for (int i = 0; i < AF_MAX_DIMS; i++) {
                 if (oDims[i] != rhsDims[i]) {
-                    AF_ERROR("Size mismatch between input and output", AF_ERR_SIZE);
+                    AF_ERROR("Size mismatch between input and output",
+                             AF_ERR_SIZE);
                 }
             }
         }
 
-        af_index_t idxrs[4];
-        // set all dimensions above ndims to spanner index
-        for (dim_t i=ndims; i<4; ++i) idxrs[i] = spanner;
-
-        for (dim_t i=0; i<ndims; ++i) {
-            if (!indexs[i].isSeq) {
-                // check if all af_arrays have atleast one value
-                // to enable indexing along that dimension
-                ArrayInfo idxInfo = getInfo(indexs[i].idx.arr);
-                af_dtype idxType  = idxInfo.getType();
-
-                ARG_ASSERT(3, (idxType!=c32));
-                ARG_ASSERT(3, (idxType!=c64));
-                ARG_ASSERT(3, (idxType!=b8 ));
-
-                idxrs[i].idx.arr = indexs[i].idx.arr;
-                idxrs[i].isSeq = indexs[i].isSeq;
+        std::array<af_index_t, AF_MAX_DIMS> idxrs{};
+        for (dim_t i = 0; i < AF_MAX_DIMS; ++i) {
+            if (i < ndims) {
+                bool isSeq = indexs[i].isSeq;
+                if (!isSeq) {
+                    // check if all af_arrays have atleast one value
+                    // to enable indexing along that dimension
+                    const ArrayInfo& idxInfo = getInfo(indexs[i].idx.arr);
+                    af_dtype idxType         = idxInfo.getType();
+
+                    ARG_ASSERT(3, (idxType != c32));
+                    ARG_ASSERT(3, (idxType != c64));
+                    ARG_ASSERT(3, (idxType != b8));
+
+                    idxrs[i] = {{indexs[i].idx.arr}, isSeq, indexs[i].isBatch};
+                } else {
+                    af_seq inSeq =
+                        convert2Canonical(indexs[i].idx.seq, lhsDims[i]);
+                    ARG_ASSERT(3, (inSeq.begin >= 0 || inSeq.end >= 0));
+                    if (signbit(inSeq.step)) {
+                        ARG_ASSERT(3, inSeq.begin >= inSeq.end);
+                    } else {
+                        ARG_ASSERT(3, inSeq.begin <= inSeq.end);
+                    }
+
+                    idxrs[i].idx.seq = inSeq;
+                    idxrs[i].isSeq   = isSeq;
+                    idxrs[i].isBatch = indexs[i].isBatch;
+                }
             } else {
-                // af_seq is being used for this dimension
-                // just copy the index to local variable
-                idxrs[i] = indexs[i];
+                // set all dimensions above ndims to spanner
+                idxrs[i] = createSpanIndex();
             }
         }
+        af_index_t* ptr = idxrs.data();
 
         try {
-            switch(rhsType) {
-                case c64: genAssign<cdouble>(output, idxrs, rhs); break;
-                case f64: genAssign<double >(output, idxrs, rhs); break;
-                case c32: genAssign<cfloat >(output, idxrs, rhs); break;
-                case f32: genAssign<float  >(output, idxrs, rhs); break;
-                case u64: genAssign<uintl  >(output, idxrs, rhs); break;
-                case u32: genAssign<uint   >(output, idxrs, rhs); break;
-                case s64: genAssign<intl   >(output, idxrs, rhs); break;
-                case s32: genAssign<int    >(output, idxrs, rhs); break;
-                case  u8: genAssign<uchar  >(output, idxrs, rhs); break;
-                case  b8: genAssign<char   >(output, idxrs, rhs); break;
+            switch (rhsType) {
+                case c64: genAssign<cdouble>(output, ptr, rhs); break;
+                case f64: genAssign<double>(output, ptr, rhs); break;
+                case c32: genAssign<cfloat>(output, ptr, rhs); break;
+                case f32: genAssign<float>(output, ptr, rhs); break;
+                case u64: genAssign<uintl>(output, ptr, rhs); break;
+                case u32: genAssign<uint>(output, ptr, rhs); break;
+                case s64: genAssign<intl>(output, ptr, rhs); break;
+                case s32: genAssign<int>(output, ptr, rhs); break;
+                case s16: genAssign<short>(output, ptr, rhs); break;
+                case u16: genAssign<ushort>(output, ptr, rhs); break;
+                case s8: genAssign<schar>(output, ptr, rhs); break;
+                case u8: genAssign<uchar>(output, ptr, rhs); break;
+                case b8: genAssign<char>(output, ptr, rhs); break;
+                case f16: genAssign<half>(output, ptr, rhs); break;
                 default: TYPE_ERROR(1, rhsType);
             }
-        } catch(...) {
+        } catch (...) {
             if (*out != lhs) {
                 AF_CHECK(af_release_array(output));
-                if (is_vector) { AF_CHECK(af_release_array(rhs)); }
+                if (isVec) { AF_CHECK(af_release_array(rhs)); }
             }
             throw;
         }
-        if (is_vector) { AF_CHECK(af_release_array(rhs)); }
+        if (isVec) { AF_CHECK(af_release_array(rhs)); }
+        swap(*out, output);
     }
     CATCHALL;
-
-    std::swap(*out, output);
-
     return AF_SUCCESS;
 }
diff --git a/src/api/c/bilateral.cpp b/src/api/c/bilateral.cpp
index c83c7ef8db..aeec279ea5 100644
--- a/src/api/c/bilateral.cpp
+++ b/src/api/c/bilateral.cpp
@@ -7,54 +7,59 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <handle.hpp>
 #include <backend.hpp>
 #include <bilateral.hpp>
-#include <err_common.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+
+#include <type_traits>
 
 using af::dim4;
-using namespace detail;
+using detail::bilateral;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+using std::conditional;
+using std::is_same;
 
-template<typename inType, typename outType, bool isColor>
-static inline af_array bilateral(const af_array &in, const float &sp_sig, const float &chr_sig)
-{
-    return getHandle(bilateral<inType, outType, isColor>(getArray<inType>(in), sp_sig, chr_sig));
+template<typename T>
+inline af_array bilateral(const af_array &in, const float &sp_sig,
+                          const float &chr_sig) {
+    using OutType =
+        typename conditional<is_same<T, double>::value, double, float>::type;
+    return getHandle(bilateral<T, OutType>(getArray<T>(in), sp_sig, chr_sig));
 }
 
-template<bool isColor>
-static af_err bilateral(af_array *out, const af_array &in, const float &s_sigma, const float &c_sigma)
-{
+af_err af_bilateral(af_array *out, const af_array in, const float ssigma,
+                    const float csigma, const bool iscolor) {
+    UNUSED(iscolor);
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type  = info.getType();
-        af::dim4 dims  = info.dims();
-
-        DIM_ASSERT(1, (dims.ndims()>=2));
-
-        af_array output;
-        switch(type) {
-            case f64: output = bilateral<double, double, isColor> (in, s_sigma, c_sigma); break;
-            case f32: output = bilateral<float ,  float, isColor> (in, s_sigma, c_sigma); break;
-            case b8 : output = bilateral<char  ,  float, isColor> (in, s_sigma, c_sigma); break;
-            case s32: output = bilateral<int   ,  float, isColor> (in, s_sigma, c_sigma); break;
-            case u32: output = bilateral<uint  ,  float, isColor> (in, s_sigma, c_sigma); break;
-            case u8 : output = bilateral<uchar ,  float, isColor> (in, s_sigma, c_sigma); break;
-            default : TYPE_ERROR(1, type);
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 dims         = info.dims();
+
+        DIM_ASSERT(1, (dims.ndims() >= 2));
+
+        af_array output = nullptr;
+        switch (type) {
+            case f64: output = bilateral<double>(in, ssigma, csigma); break;
+            case f32: output = bilateral<float>(in, ssigma, csigma); break;
+            case b8: output = bilateral<char>(in, ssigma, csigma); break;
+            case s32: output = bilateral<int>(in, ssigma, csigma); break;
+            case u32: output = bilateral<uint>(in, ssigma, csigma); break;
+            case s8: output = bilateral<schar>(in, ssigma, csigma); break;
+            case u8: output = bilateral<uchar>(in, ssigma, csigma); break;
+            case s16: output = bilateral<short>(in, ssigma, csigma); break;
+            case u16: output = bilateral<ushort>(in, ssigma, csigma); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
-
-af_err af_bilateral(af_array *out, const af_array in, const float spatial_sigma, const float chromatic_sigma, const bool isColor)
-{
-    if (isColor)
-        return bilateral<true>(out,in,spatial_sigma,chromatic_sigma);
-    else
-        return bilateral<false>(out,in,spatial_sigma,chromatic_sigma);
-}
diff --git a/src/api/c/binary.cpp b/src/api/c/binary.cpp
index 8a6ae465a6..eebe62bdbb 100644
--- a/src/api/c/binary.cpp
+++ b/src/api/c/binary.cpp
@@ -7,57 +7,207 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/defines.h>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/moddims.hpp>
+#include <common/tile.hpp>
+#include <handle.hpp>
+#include <implicit.hpp>
+#include <optypes.hpp>
+#include <sparse.hpp>
+#include <sparse_handle.hpp>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <ArrayInfo.hpp>
-#include <optypes.hpp>
-#include <implicit.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
+#include <af/defines.h>
 
 #include <arith.hpp>
 #include <logic.hpp>
+#include <sparse_arith.hpp>
+
+#include <common/half.hpp>
 
-using namespace detail;
 using af::dim4;
+using af::dtype;
+using arrayfire::castSparse;
+using arrayfire::getSparseArray;
+using arrayfire::getSparseArrayBase;
+using arrayfire::common::half;
+using arrayfire::common::modDims;
+using arrayfire::common::SparseArrayBase;
+using arrayfire::common::tile;
+using detail::arithOp;
+using detail::arithOpD;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T, af_op_t op>
 static inline af_array arithOp(const af_array lhs, const af_array rhs,
-                               const dim4 &odims)
-{
-    af_array res = getHandle(arithOp<T, op>(castArray<T>(lhs), castArray<T>(rhs), odims));
-    return res;
+                               const dim4 &odims) {
+    const ArrayInfo &linfo = getInfo(lhs);
+    const ArrayInfo &rinfo = getInfo(rhs);
+
+    dtype type = static_cast<af::dtype>(af::dtype_traits<T>::af_type);
+
+    const detail::Array<T> &l =
+        linfo.getType() == type ? getArray<T>(lhs) : castArray<T>(lhs);
+    const detail::Array<T> &r =
+        rinfo.getType() == type ? getArray<T>(rhs) : castArray<T>(rhs);
+
+    return getHandle(arithOp<T, op>(l, r, odims));
+}
+
+template<typename T, af_op_t op>
+static inline af_array arithOpBroadcast(const af_array lhs,
+                                        const af_array rhs) {
+    const ArrayInfo &linfo = getInfo(lhs);
+    const ArrayInfo &rinfo = getInfo(rhs);
+
+    dim4 odims(1), ltile(1), rtile(1);
+    dim4 lshape = linfo.dims();
+    dim4 rshape = rinfo.dims();
+
+    for (int d = 0; d < AF_MAX_DIMS; ++d) {
+        DIM_ASSERT(
+            1, ((lshape[d] == rshape[d]) || (lshape[d] == 1 && rshape[d] > 1) ||
+                (lshape[d] > 1 && rshape[d] == 1)));
+        odims[d] = std::max(lshape[d], rshape[d]);
+        if (lshape[d] == rshape[d]) {
+            ltile[d] = rtile[d] = 1;
+        } else if (lshape[d] == 1 && rshape[d] > 1) {
+            ltile[d] = odims[d];
+        } else if (lshape[d] > 1 && rshape[d] == 1) {
+            rtile[d] = odims[d];
+        }
+    }
+
+    Array<T> lhst =
+        arrayfire::common::tile<T>(modDims(getArray<T>(lhs), lshape), ltile);
+    Array<T> rhst =
+        arrayfire::common::tile<T>(modDims(getArray<T>(rhs), rshape), rtile);
+
+    return getHandle(arithOp<T, op>(lhst, rhst, odims));
+}
+
+template<typename T, af_op_t op>
+static inline af_array sparseArithOp(const af_array lhs, const af_array rhs) {
+    auto res = arithOp<T, op>(getSparseArray<T>(lhs), getSparseArray<T>(rhs));
+    return getHandle(res);
+}
+
+template<typename T, af_op_t op>
+static inline af_array arithSparseDenseOp(const af_array lhs,
+                                          const af_array rhs,
+                                          const bool reverse) {
+    if (op == af_add_t || op == af_sub_t) {
+        return getHandle(
+            arithOpD<T, op>(castSparse<T>(lhs), castArray<T>(rhs), reverse));
+    }
+    if (op == af_mul_t || op == af_div_t) {
+        return getHandle(
+            arithOp<T, op>(castSparse<T>(lhs), castArray<T>(rhs), reverse));
+    }
 }
 
 template<af_op_t op>
-static af_err af_arith(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+static af_err af_arith(af_array *out, const af_array lhs, const af_array rhs,
+                       const bool batchMode) {
     try {
-        const af_dtype otype = implicit(lhs, rhs);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
 
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        const af_dtype otype = implicit(linfo.getType(), rinfo.getType());
+        af_array res;
+
+        if (batchMode || linfo.dims() == rinfo.dims()) {
+            dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
+            if (odims.ndims() == 0) {
+                return af_create_handle(out, 0, nullptr, otype);
+            }
+
+            switch (otype) {
+                case f32: res = arithOp<float, op>(lhs, rhs, odims); break;
+                case f64: res = arithOp<double, op>(lhs, rhs, odims); break;
+                case c32: res = arithOp<cfloat, op>(lhs, rhs, odims); break;
+                case c64: res = arithOp<cdouble, op>(lhs, rhs, odims); break;
+                case s32: res = arithOp<int, op>(lhs, rhs, odims); break;
+                case u32: res = arithOp<uint, op>(lhs, rhs, odims); break;
+                case s8: res = arithOp<schar, op>(lhs, rhs, odims); break;
+                case u8: res = arithOp<uchar, op>(lhs, rhs, odims); break;
+                case b8: res = arithOp<char, op>(lhs, rhs, odims); break;
+                case s64: res = arithOp<intl, op>(lhs, rhs, odims); break;
+                case u64: res = arithOp<uintl, op>(lhs, rhs, odims); break;
+                case s16: res = arithOp<short, op>(lhs, rhs, odims); break;
+                case u16: res = arithOp<ushort, op>(lhs, rhs, odims); break;
+                case f16: res = arithOp<half, op>(lhs, rhs, odims); break;
+                default: TYPE_ERROR(0, otype);
+            }
+        } else {
+            if (linfo.ndims() == 0 && rinfo.ndims() == 0) {
+                return af_create_handle(out, 0, nullptr, otype);
+            }
+            switch (otype) {
+                case f32: res = arithOpBroadcast<float, op>(lhs, rhs); break;
+                case f64: res = arithOpBroadcast<double, op>(lhs, rhs); break;
+                case c32: res = arithOpBroadcast<cfloat, op>(lhs, rhs); break;
+                case c64: res = arithOpBroadcast<cdouble, op>(lhs, rhs); break;
+                case s32: res = arithOpBroadcast<int, op>(lhs, rhs); break;
+                case u32: res = arithOpBroadcast<uint, op>(lhs, rhs); break;
+                case s8: res = arithOpBroadcast<schar, op>(lhs, rhs); break;
+                case u8: res = arithOpBroadcast<uchar, op>(lhs, rhs); break;
+                case b8: res = arithOpBroadcast<char, op>(lhs, rhs); break;
+                case s64: res = arithOpBroadcast<intl, op>(lhs, rhs); break;
+                case u64: res = arithOpBroadcast<uintl, op>(lhs, rhs); break;
+                case s16: res = arithOpBroadcast<short, op>(lhs, rhs); break;
+                case u16: res = arithOpBroadcast<ushort, op>(lhs, rhs); break;
+                case f16: res = arithOpBroadcast<half, op>(lhs, rhs); break;
+                default: TYPE_ERROR(0, otype);
+            }
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+template<af_op_t op>
+static af_err af_arith_real(af_array *out, const af_array lhs,
+                            const af_array rhs, const bool batchMode) {
+    try {
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
 
         dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
+        const af_dtype otype = implicit(linfo.getType(), rinfo.getType());
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, otype);
+        }
 
         af_array res;
         switch (otype) {
-        case f32: res = arithOp<float  , op>(lhs, rhs, odims); break;
-        case f64: res = arithOp<double , op>(lhs, rhs, odims); break;
-        case c32: res = arithOp<cfloat , op>(lhs, rhs, odims); break;
-        case c64: res = arithOp<cdouble, op>(lhs, rhs, odims); break;
-        case s32: res = arithOp<int    , op>(lhs, rhs, odims); break;
-        case u32: res = arithOp<uint   , op>(lhs, rhs, odims); break;
-        case u8 : res = arithOp<uchar  , op>(lhs, rhs, odims); break;
-        case b8 : res = arithOp<char   , op>(lhs, rhs, odims); break;
-        case s64: res = arithOp<intl   , op>(lhs, rhs, odims); break;
-        case u64: res = arithOp<uintl  , op>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, otype);
+            case f32: res = arithOp<float, op>(lhs, rhs, odims); break;
+            case f64: res = arithOp<double, op>(lhs, rhs, odims); break;
+            case s32: res = arithOp<int, op>(lhs, rhs, odims); break;
+            case u32: res = arithOp<uint, op>(lhs, rhs, odims); break;
+            case s8: res = arithOp<schar, op>(lhs, rhs, odims); break;
+            case u8: res = arithOp<uchar, op>(lhs, rhs, odims); break;
+            case b8: res = arithOp<char, op>(lhs, rhs, odims); break;
+            case s64: res = arithOp<intl, op>(lhs, rhs, odims); break;
+            case u64: res = arithOp<uintl, op>(lhs, rhs, odims); break;
+            case s16: res = arithOp<short, op>(lhs, rhs, odims); break;
+            case u16: res = arithOp<ushort, op>(lhs, rhs, odims); break;
+            case f16: res = arithOp<half, op>(lhs, rhs, odims); break;
+            default: TYPE_ERROR(0, otype);
         }
-
         std::swap(*out, res);
     }
     CATCHALL;
@@ -65,27 +215,62 @@ static af_err af_arith(af_array *out, const af_array lhs, const af_array rhs, co
 }
 
 template<af_op_t op>
-static af_err af_arith_real(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+static af_err af_arith_sparse(af_array *out, const af_array lhs,
+                              const af_array rhs) {
     try {
-        const af_dtype otype = implicit(lhs, rhs);
+        const SparseArrayBase linfo = getSparseArrayBase(lhs);
+        const SparseArrayBase rinfo = getSparseArrayBase(rhs);
 
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        ARG_ASSERT(1, (linfo.getStorage() == rinfo.getStorage()));
+        ARG_ASSERT(1, (linfo.dims() == rinfo.dims()));
+        ARG_ASSERT(1, (linfo.getStorage() == AF_STORAGE_CSR));
 
-        dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
+        const af_dtype otype = implicit(linfo.getType(), rinfo.getType());
+        af_array res;
+        switch (otype) {
+            case f32: res = sparseArithOp<float, op>(lhs, rhs); break;
+            case f64: res = sparseArithOp<double, op>(lhs, rhs); break;
+            case c32: res = sparseArithOp<cfloat, op>(lhs, rhs); break;
+            case c64: res = sparseArithOp<cdouble, op>(lhs, rhs); break;
+            default: TYPE_ERROR(0, otype);
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
+template<af_op_t op>
+static af_err af_arith_sparse_dense(af_array *out, const af_array lhs,
+                                    const af_array rhs,
+                                    const bool reverse = false) {
+    try {
+        const SparseArrayBase linfo = getSparseArrayBase(lhs);
+        if (linfo.ndims() > 2) {
+            AF_ERROR(
+                "Sparse-Dense arithmetic operations cannot be used in batch "
+                "mode",
+                AF_ERR_BATCH);
+        }
+        const ArrayInfo &rinfo = getInfo(rhs);
+
+        const af_dtype otype = implicit(linfo.getType(), rinfo.getType());
         af_array res;
         switch (otype) {
-        case f32: res = arithOp<float  , op>(lhs, rhs, odims); break;
-        case f64: res = arithOp<double , op>(lhs, rhs, odims); break;
-        case s32: res = arithOp<int    , op>(lhs, rhs, odims); break;
-        case u32: res = arithOp<uint   , op>(lhs, rhs, odims); break;
-        case u8 : res = arithOp<uchar  , op>(lhs, rhs, odims); break;
-        case b8 : res = arithOp<char   , op>(lhs, rhs, odims); break;
-        case s64: res = arithOp<intl   , op>(lhs, rhs, odims); break;
-        case u64: res = arithOp<uintl  , op>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, otype);
+            case f32:
+                res = arithSparseDenseOp<float, op>(lhs, rhs, reverse);
+                break;
+            case f64:
+                res = arithSparseDenseOp<double, op>(lhs, rhs, reverse);
+                break;
+            case c32:
+                res = arithSparseDenseOp<cfloat, op>(lhs, rhs, reverse);
+                break;
+            case c64:
+                res = arithSparseDenseOp<cdouble, op>(lhs, rhs, reverse);
+                break;
+            default: TYPE_ERROR(0, otype);
         }
 
         std::swap(*out, res);
@@ -94,70 +279,188 @@ static af_err af_arith_real(af_array *out, const af_array lhs, const af_array rh
     return AF_SUCCESS;
 }
 
-af_err af_add(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
-    return af_arith<af_add_t>(out, lhs, rhs, batchMode);
+af_err af_add(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
+    try {
+        // Check if inputs are sparse
+        const ArrayInfo &linfo = getInfo(lhs, false);
+        const ArrayInfo &rinfo = getInfo(rhs, false);
+
+        if (linfo.isSparse() && rinfo.isSparse()) {
+            return af_arith_sparse<af_add_t>(out, lhs, rhs);
+        }
+        if (linfo.isSparse() && !rinfo.isSparse()) {
+            return af_arith_sparse_dense<af_add_t>(out, lhs, rhs);
+        }
+        if (!linfo.isSparse() && rinfo.isSparse()) {
+            // second operand(Array) of af_arith call should be dense
+            return af_arith_sparse_dense<af_add_t>(out, rhs, lhs, true);
+        }
+        return af_arith<af_add_t>(out, lhs, rhs, batchMode);
+    }
+    CATCHALL;
 }
 
-af_err af_mul(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
-    return af_arith<af_mul_t>(out, lhs, rhs, batchMode);
+af_err af_mul(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
+    try {
+        // Check if inputs are sparse
+        const ArrayInfo &linfo = getInfo(lhs, false);
+        const ArrayInfo &rinfo = getInfo(rhs, false);
+
+        if (linfo.isSparse() && rinfo.isSparse()) {
+            // return af_arith_sparse<af_mul_t>(out, lhs, rhs);
+            // MKL doesn't have mul or div support yet, hence
+            // this is commented out although alternative cpu code exists
+            return AF_ERR_NOT_SUPPORTED;
+        }
+        if (linfo.isSparse() && !rinfo.isSparse()) {
+            return af_arith_sparse_dense<af_mul_t>(out, lhs, rhs);
+        }
+        if (!linfo.isSparse() && rinfo.isSparse()) {
+            return af_arith_sparse_dense<af_mul_t>(
+                out, rhs, lhs,
+                true);  // dense should be rhs
+        }
+        return af_arith<af_mul_t>(out, lhs, rhs, batchMode);
+    }
+    CATCHALL;
 }
 
-af_err af_sub(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
-    return af_arith<af_sub_t>(out, lhs, rhs, batchMode);
+af_err af_sub(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
+    try {
+        // Check if inputs are sparse
+        const ArrayInfo &linfo = getInfo(lhs, false);
+        const ArrayInfo &rinfo = getInfo(rhs, false);
+
+        if (linfo.isSparse() && rinfo.isSparse()) {
+            return af_arith_sparse<af_sub_t>(out, lhs, rhs);
+        }
+        if (linfo.isSparse() && !rinfo.isSparse()) {
+            return af_arith_sparse_dense<af_sub_t>(out, lhs, rhs);
+        }
+        if (!linfo.isSparse() && rinfo.isSparse()) {
+            return af_arith_sparse_dense<af_sub_t>(
+                out, rhs, lhs,
+                true);  // dense should be rhs
+        }
+        return af_arith<af_sub_t>(out, lhs, rhs, batchMode);
+    }
+    CATCHALL;
 }
 
-af_err af_div(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
-    return af_arith<af_div_t>(out, lhs, rhs, batchMode);
+af_err af_div(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
+    try {
+        // Check if inputs are sparse
+        const ArrayInfo &linfo = getInfo(lhs, false);
+        const ArrayInfo &rinfo = getInfo(rhs, false);
+
+        if (linfo.isSparse() && rinfo.isSparse()) {
+            // return af_arith_sparse<af_div_t>(out, lhs, rhs);
+            // MKL doesn't have mul or div support yet, hence
+            // this is commented out although alternative cpu code exists
+            return AF_ERR_NOT_SUPPORTED;
+        }
+        if (linfo.isSparse() && !rinfo.isSparse()) {
+            return af_arith_sparse_dense<af_div_t>(out, lhs, rhs);
+        }
+        if (!linfo.isSparse() && rinfo.isSparse()) {
+            // Division by sparse is currently not allowed - for convinence of
+            // dealing with division by 0
+            // return af_arith_sparse_dense<af_div_t>(out, rhs, lhs, true); //
+            // dense should be rhs
+            return AF_ERR_NOT_SUPPORTED;
+        }
+        return af_arith<af_div_t>(out, lhs, rhs, batchMode);
+    }
+    CATCHALL;
 }
 
-af_err af_maxof(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_maxof(af_array *out, const af_array lhs, const af_array rhs,
+                const bool batchMode) {
     return af_arith<af_max_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_minof(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_minof(af_array *out, const af_array lhs, const af_array rhs,
+                const bool batchMode) {
     return af_arith<af_min_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_rem(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_rem(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
     return af_arith_real<af_rem_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_mod(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_mod(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
     return af_arith_real<af_mod_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_pow(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_pow(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
     try {
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
-        if (linfo.isComplex() || rinfo.isComplex()) {
-            AF_ERROR("Powers of Complex numbers not supported", AF_ERR_NOT_SUPPORTED);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
+        if (rinfo.isComplex()) {
+            af_array log_lhs, log_res;
+            af_array res;
+            AF_CHECK(af_log(&log_lhs, lhs));
+            AF_CHECK(af_mul(&log_res, log_lhs, rhs, batchMode));
+            AF_CHECK(af_exp(&res, log_res));
+            AF_CHECK(af_release_array(log_lhs));
+            AF_CHECK(af_release_array(log_res));
+            std::swap(*out, res);
+            return AF_SUCCESS;
         }
-    } CATCHALL;
+        if (linfo.isComplex()) {
+            af_array mag, angle;
+            af_array mag_res, angle_res;
+            af_array real_res, imag_res, cplx_res;
+            af_array res;
+            AF_CHECK(af_abs(&mag, lhs));
+            AF_CHECK(af_arg(&angle, lhs));
+            AF_CHECK(af_pow(&mag_res, mag, rhs, batchMode));
+            AF_CHECK(af_mul(&angle_res, angle, rhs, batchMode));
+            AF_CHECK(af_cos(&real_res, angle_res));
+            AF_CHECK(af_sin(&imag_res, angle_res));
+            AF_CHECK(af_cplx2(&cplx_res, real_res, imag_res, batchMode));
+            AF_CHECK(af_mul(&res, mag_res, cplx_res, batchMode));
+            AF_CHECK(af_release_array(mag));
+            AF_CHECK(af_release_array(angle));
+            AF_CHECK(af_release_array(mag_res));
+            AF_CHECK(af_release_array(angle_res));
+            AF_CHECK(af_release_array(real_res));
+            AF_CHECK(af_release_array(imag_res));
+            AF_CHECK(af_release_array(cplx_res));
+            std::swap(*out, res);
+            return AF_SUCCESS;
+        }
+    }
+    CATCHALL;
 
     return af_arith_real<af_pow_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_root(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_root(af_array *out, const af_array lhs, const af_array rhs,
+               const bool batchMode) {
     try {
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
         if (linfo.isComplex() || rinfo.isComplex()) {
-            AF_ERROR("Powers of Complex numbers not supported", AF_ERR_NOT_SUPPORTED);
+            af_array log_lhs, log_res;
+            af_array res;
+            AF_CHECK(af_log(&log_lhs, lhs));
+            AF_CHECK(af_div(&log_res, log_lhs, rhs, batchMode));
+            AF_CHECK(af_exp(&res, log_res));
+            std::swap(*out, res);
+            return AF_SUCCESS;
         }
 
         af_array one;
-        AF_CHECK(af_constant(&one, 1, linfo.ndims(), linfo.dims().get(), linfo.getType()));
+        AF_CHECK(af_constant(&one, 1, linfo.ndims(), linfo.dims().get(),
+                             linfo.getType()));
 
         af_array inv_lhs;
         AF_CHECK(af_div(&inv_lhs, one, lhs, batchMode));
@@ -166,33 +469,36 @@ af_err af_root(af_array *out, const af_array lhs, const af_array rhs, const bool
 
         AF_CHECK(af_release_array(one));
         AF_CHECK(af_release_array(inv_lhs));
-
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_atan2(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_atan2(af_array *out, const af_array lhs, const af_array rhs,
+                const bool batchMode) {
     try {
-
         const af_dtype type = implicit(lhs, rhs);
 
-        if (type != f32 && type != f64) {
+        if (type != f16 && type != f32 && type != f64) {
             AF_ERROR("Only floating point arrays are supported for atan2 ",
                      AF_ERR_NOT_SUPPORTED);
         }
 
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
 
         dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
 
         af_array res;
         switch (type) {
-        case f32: res = arithOp<float , af_atan2_t>(lhs, rhs, odims); break;
-        case f64: res = arithOp<double, af_atan2_t>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, type);
+            case f16: res = arithOp<half, af_atan2_t>(lhs, rhs, odims); break;
+            case f32: res = arithOp<float, af_atan2_t>(lhs, rhs, odims); break;
+            case f64: res = arithOp<double, af_atan2_t>(lhs, rhs, odims); break;
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -201,27 +507,31 @@ af_err af_atan2(af_array *out, const af_array lhs, const af_array rhs, const boo
     return AF_SUCCESS;
 }
 
-af_err af_hypot(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_hypot(af_array *out, const af_array lhs, const af_array rhs,
+                const bool batchMode) {
     try {
-
         const af_dtype type = implicit(lhs, rhs);
 
-        if (type != f32 && type != f64) {
+        if (type != f16 && type != f32 && type != f64) {
             AF_ERROR("Only floating point arrays are supported for hypot ",
                      AF_ERR_NOT_SUPPORTED);
         }
 
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
 
         dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
 
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
         af_array res;
         switch (type) {
-        case f32: res = arithOp<float , af_hypot_t>(lhs, rhs, odims); break;
-        case f64: res = arithOp<double, af_hypot_t>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, type);
+            case f16: res = arithOp<half, af_hypot_t>(lhs, rhs, odims); break;
+            case f32: res = arithOp<float, af_hypot_t>(lhs, rhs, odims); break;
+            case f64: res = arithOp<double, af_hypot_t>(lhs, rhs, odims); break;
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -231,36 +541,45 @@ af_err af_hypot(af_array *out, const af_array lhs, const af_array rhs, const boo
 }
 
 template<typename T, af_op_t op>
-static inline af_array logicOp(const af_array lhs, const af_array rhs, const dim4 &odims)
-{
-    af_array res = getHandle(logicOp<T, op>(castArray<T>(lhs), castArray<T>(rhs), odims));
+static inline af_array logicOp(const af_array lhs, const af_array rhs,
+                               const dim4 &odims) {
+    af_array res =
+        getHandle(logicOp<T, op>(castArray<T>(lhs), castArray<T>(rhs), odims));
     return res;
 }
 
 template<af_op_t op>
-static af_err af_logic(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+static af_err af_logic(af_array *out, const af_array lhs, const af_array rhs,
+                       const bool batchMode) {
     try {
         const af_dtype type = implicit(lhs, rhs);
 
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
 
         dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
 
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
         af_array res;
         switch (type) {
-        case f32: res = logicOp<float  , op>(lhs, rhs, odims); break;
-        case f64: res = logicOp<double , op>(lhs, rhs, odims); break;
-        case c32: res = logicOp<cfloat , op>(lhs, rhs, odims); break;
-        case c64: res = logicOp<cdouble, op>(lhs, rhs, odims); break;
-        case s32: res = logicOp<int    , op>(lhs, rhs, odims); break;
-        case u32: res = logicOp<uint   , op>(lhs, rhs, odims); break;
-        case u8 : res = logicOp<uchar  , op>(lhs, rhs, odims); break;
-        case b8 : res = logicOp<char   , op>(lhs, rhs, odims); break;
-        case s64: res = logicOp<intl   , op>(lhs, rhs, odims); break;
-        case u64: res = logicOp<uintl  , op>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, type);
+            case f32: res = logicOp<float, op>(lhs, rhs, odims); break;
+            case f64: res = logicOp<double, op>(lhs, rhs, odims); break;
+            case c32: res = logicOp<cfloat, op>(lhs, rhs, odims); break;
+            case c64: res = logicOp<cdouble, op>(lhs, rhs, odims); break;
+            case s32: res = logicOp<int, op>(lhs, rhs, odims); break;
+            case u32: res = logicOp<uint, op>(lhs, rhs, odims); break;
+            case s8: res = logicOp<schar, op>(lhs, rhs, odims); break;
+            case u8: res = logicOp<uchar, op>(lhs, rhs, odims); break;
+            case b8: res = logicOp<char, op>(lhs, rhs, odims); break;
+            case s64: res = logicOp<intl, op>(lhs, rhs, odims); break;
+            case u64: res = logicOp<uintl, op>(lhs, rhs, odims); break;
+            case s16: res = logicOp<short, op>(lhs, rhs, odims); break;
+            case u16: res = logicOp<ushort, op>(lhs, rhs, odims); break;
+            case f16: res = logicOp<half, op>(lhs, rhs, odims); break;
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -269,73 +588,81 @@ static af_err af_logic(af_array *out, const af_array lhs, const af_array rhs, co
     return AF_SUCCESS;
 }
 
-af_err af_eq(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_eq(af_array *out, const af_array lhs, const af_array rhs,
+             const bool batchMode) {
     return af_logic<af_eq_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_neq(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_neq(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
     return af_logic<af_neq_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_gt(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_gt(af_array *out, const af_array lhs, const af_array rhs,
+             const bool batchMode) {
     return af_logic<af_gt_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_ge(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_ge(af_array *out, const af_array lhs, const af_array rhs,
+             const bool batchMode) {
     return af_logic<af_ge_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_lt(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_lt(af_array *out, const af_array lhs, const af_array rhs,
+             const bool batchMode) {
     return af_logic<af_lt_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_le(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_le(af_array *out, const af_array lhs, const af_array rhs,
+             const bool batchMode) {
     return af_logic<af_le_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_and(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_and(af_array *out, const af_array lhs, const af_array rhs,
+              const bool batchMode) {
     return af_logic<af_and_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_or(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_or(af_array *out, const af_array lhs, const af_array rhs,
+             const bool batchMode) {
     return af_logic<af_or_t>(out, lhs, rhs, batchMode);
 }
 
 template<typename T, af_op_t op>
-static inline af_array bitOp(const af_array lhs, const af_array rhs, const dim4 &odims)
-{
-    af_array res = getHandle(bitOp<T, op>(castArray<T>(lhs), castArray<T>(rhs), odims));
+static inline af_array bitOp(const af_array lhs, const af_array rhs,
+                             const dim4 &odims) {
+    af_array res =
+        getHandle(bitOp<T, op>(castArray<T>(lhs), castArray<T>(rhs), odims));
     return res;
 }
 
 template<af_op_t op>
-static af_err af_bitwise(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+static af_err af_bitwise(af_array *out, const af_array lhs, const af_array rhs,
+                         const bool batchMode) {
     try {
         const af_dtype type = implicit(lhs, rhs);
 
-        ArrayInfo linfo = getInfo(lhs);
-        ArrayInfo rinfo = getInfo(rhs);
+        const ArrayInfo &linfo = getInfo(lhs);
+        const ArrayInfo &rinfo = getInfo(rhs);
 
         dim4 odims = getOutDims(linfo.dims(), rinfo.dims(), batchMode);
 
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
         af_array res;
         switch (type) {
-        case s32: res = bitOp<int    , op>(lhs, rhs, odims); break;
-        case u32: res = bitOp<uint   , op>(lhs, rhs, odims); break;
-        case u8 : res = bitOp<uchar  , op>(lhs, rhs, odims); break;
-        case b8 : res = bitOp<char   , op>(lhs, rhs, odims); break;
-        case s64: res = bitOp<intl   , op>(lhs, rhs, odims); break;
-        case u64: res = bitOp<uintl  , op>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, type);
+            case s32: res = bitOp<int, op>(lhs, rhs, odims); break;
+            case u32: res = bitOp<uint, op>(lhs, rhs, odims); break;
+            case s8: res = bitOp<schar, op>(lhs, rhs, odims); break;
+            case u8: res = bitOp<uchar, op>(lhs, rhs, odims); break;
+            case b8: res = bitOp<char, op>(lhs, rhs, odims); break;
+            case s64: res = bitOp<intl, op>(lhs, rhs, odims); break;
+            case u64: res = bitOp<uintl, op>(lhs, rhs, odims); break;
+            case s16: res = bitOp<short, op>(lhs, rhs, odims); break;
+            case u16: res = bitOp<ushort, op>(lhs, rhs, odims); break;
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -344,27 +671,27 @@ static af_err af_bitwise(af_array *out, const af_array lhs, const af_array rhs,
     return AF_SUCCESS;
 }
 
-af_err af_bitand(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_bitand(af_array *out, const af_array lhs, const af_array rhs,
+                 const bool batchMode) {
     return af_bitwise<af_bitand_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_bitor(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_bitor(af_array *out, const af_array lhs, const af_array rhs,
+                const bool batchMode) {
     return af_bitwise<af_bitor_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_bitxor(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_bitxor(af_array *out, const af_array lhs, const af_array rhs,
+                 const bool batchMode) {
     return af_bitwise<af_bitxor_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_bitshiftl(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_bitshiftl(af_array *out, const af_array lhs, const af_array rhs,
+                    const bool batchMode) {
     return af_bitwise<af_bitshiftl_t>(out, lhs, rhs, batchMode);
 }
 
-af_err af_bitshiftr(af_array *out, const af_array lhs, const af_array rhs, const bool batchMode)
-{
+af_err af_bitshiftr(af_array *out, const af_array lhs, const af_array rhs,
+                    const bool batchMode) {
     return af_bitwise<af_bitshiftr_t>(out, lhs, rhs, batchMode);
 }
diff --git a/src/api/c/blas.cpp b/src/api/c/blas.cpp
index 203c1bb264..f42bc7d57c 100644
--- a/src/api/c/blas.cpp
+++ b/src/api/c/blas.cpp
@@ -8,105 +8,334 @@
  ********************************************************/
 
 #include <af/blas.h>
+
+#include <Array.hpp>
+#include <backend.hpp>
 #include <blas.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
 #include <handle.hpp>
-#include <Array.hpp>
+#include <sparse_blas.hpp>
+#include <sparse_handle.hpp>
+
+#include <type_util.hpp>
 #include <af/array.h>
+#include <af/data.h>
 #include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::getSparseArray;
+using arrayfire::getSparseArrayBase;
+using arrayfire::common::half;
+using arrayfire::common::SparseArrayBase;
+using detail::cdouble;
+using detail::cfloat;
+using detail::gemm;
+using detail::matmul;
+using detail::schar;
 
+namespace {
 template<typename T>
-static inline af_array matmul(const af_array lhs, const af_array rhs,
-                    af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    return getHandle(detail::matmul<T>(getArray<T>(lhs), getArray<T>(rhs), optLhs, optRhs));
+static inline af_array sparseMatmul(const af_array lhs, const af_array rhs,
+                                    af_mat_prop optLhs, af_mat_prop optRhs) {
+    return getHandle(
+        matmul<T>(getSparseArray<T>(lhs), getArray<T>(rhs), optLhs, optRhs));
+}
+
+template<typename Ti, typename To = Ti>
+static inline void gemm(af_array *out, af_mat_prop optLhs, af_mat_prop optRhs,
+                        const To *alpha, const af_array lhs, const af_array rhs,
+                        const To *betas) {
+    gemm<Ti, To>(getArray<To>(*out), optLhs, optRhs, alpha, getArray<Ti>(lhs),
+                 getArray<Ti>(rhs), betas);
 }
 
 template<typename T>
 static inline af_array dot(const af_array lhs, const af_array rhs,
-                    af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    return getHandle(detail::dot<T>(getArray<T>(lhs), getArray<T>(rhs), optLhs, optRhs));
+                           af_mat_prop optLhs, af_mat_prop optRhs) {
+    return getHandle(
+        dot<T>(getArray<T>(lhs), getArray<T>(rhs), optLhs, optRhs));
 }
 
-af_err af_matmul(   af_array *out,
-                    const af_array lhs, const af_array rhs,
-                    const af_mat_prop optLhs, const af_mat_prop optRhs)
-{
-    using namespace detail;
+template<typename T>
+static inline T dotAll(af_array out) {
+    T res{};
+    AF_CHECK(af_eval(out));
+    AF_CHECK(af_get_data_ptr((void *)&res, out));
+    return res;
+}
+
+}  // namespace
+
+af_err af_sparse_matmul(af_array *out, const af_array lhs, const af_array rhs,
+                        const af_mat_prop optLhs, const af_mat_prop optRhs) {
+    try {
+        const SparseArrayBase lhsBase = getSparseArrayBase(lhs);
+        const ArrayInfo &rhsInfo      = getInfo(rhs);
+
+        ARG_ASSERT(2,
+                   lhsBase.isSparse() == true && rhsInfo.isSparse() == false);
+
+        af_dtype lhs_type = lhsBase.getType();
+        af_dtype rhs_type = rhsInfo.getType();
+
+        ARG_ASSERT(1, lhsBase.getStorage() == AF_STORAGE_CSR);
+
+        if (!(optLhs == AF_MAT_NONE || optLhs == AF_MAT_TRANS ||
+              optLhs == AF_MAT_CTRANS)) {  // Note the ! operator.
+            AF_ERROR(
+                "Using this property is not yet supported in sparse matmul",
+                AF_ERR_NOT_SUPPORTED);
+        }
+
+        // No transpose options for RHS
+        if (optRhs != AF_MAT_NONE) {
+            AF_ERROR("Using this property is not yet supported in matmul",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        if (rhsInfo.ndims() > 2) {
+            AF_ERROR("Sparse matmul can not be used in batch mode",
+                     AF_ERR_BATCH);
+        }
+
+        TYPE_ASSERT(lhs_type == rhs_type);
+
+        af::dim4 ldims = lhsBase.dims();
+        int lColDim    = (optLhs == AF_MAT_NONE) ? 1 : 0;
+        int rRowDim    = (optRhs == AF_MAT_NONE) ? 0 : 1;
+
+        DIM_ASSERT(1, ldims[lColDim] == rhsInfo.dims()[rRowDim]);
+
+        af_array output = 0;
+        switch (lhs_type) {
+            case f32:
+                output = sparseMatmul<float>(lhs, rhs, optLhs, optRhs);
+                break;
+            case c32:
+                output = sparseMatmul<cfloat>(lhs, rhs, optLhs, optRhs);
+                break;
+            case f64:
+                output = sparseMatmul<double>(lhs, rhs, optLhs, optRhs);
+                break;
+            case c64:
+                output = sparseMatmul<cdouble>(lhs, rhs, optLhs, optRhs);
+                break;
+            default: TYPE_ERROR(1, lhs_type);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
 
+af_err af_gemm(af_array *out, const af_mat_prop optLhs,
+               const af_mat_prop optRhs, const void *alpha, const af_array lhs,
+               const af_array rhs, const void *beta) {
     try {
-        ArrayInfo lhsInfo = getInfo(lhs);
-        ArrayInfo rhsInfo = getInfo(rhs);
+        const ArrayInfo &lhsInfo = getInfo(lhs, false);
+        const ArrayInfo &rhsInfo = getInfo(rhs, true);
 
         af_dtype lhs_type = lhsInfo.getType();
         af_dtype rhs_type = rhsInfo.getType();
 
-        if (!(optLhs == AF_MAT_NONE ||
-              optLhs == AF_MAT_TRANS ||
+        if (!(optLhs == AF_MAT_NONE || optLhs == AF_MAT_TRANS ||
               optLhs == AF_MAT_CTRANS)) {
-            AF_ERROR("Using this property is not yet supported in matmul", AF_ERR_NOT_SUPPORTED);
+            AF_ERROR("Using this property is not yet supported in matmul",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
-        if (!(optRhs == AF_MAT_NONE ||
-              optRhs == AF_MAT_TRANS ||
+        if (!(optRhs == AF_MAT_NONE || optRhs == AF_MAT_TRANS ||
               optRhs == AF_MAT_CTRANS)) {
-            AF_ERROR("Using this property is not yet supported in matmul", AF_ERR_NOT_SUPPORTED);
+            AF_ERROR("Using this property is not yet supported in matmul",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
+        af::dim4 lDims = lhsInfo.dims();
+        af::dim4 rDims = rhsInfo.dims();
 
-        if (lhsInfo.ndims() > 2 ||
-            rhsInfo.ndims() > 2) {
-            AF_ERROR("matmul can not be used in batch mode", AF_ERR_BATCH);
+        if (lDims.ndims() > 2 && rDims.ndims() > 2) {
+            DIM_ASSERT(3, lDims.ndims() == rDims.ndims());
+            if (lDims[2] != rDims[2] && lDims[2] != 1 && rDims[2] != 1) {
+                AF_ERROR("Batch size mismatch along dimension 2", AF_ERR_BATCH);
+            }
+            if (lDims[3] != rDims[3] && lDims[3] != 1 && rDims[3] != 1) {
+                AF_ERROR("Batch size mismatch along dimension 3", AF_ERR_BATCH);
+            }
         }
 
         TYPE_ASSERT(lhs_type == rhs_type);
-        af_array output = 0;
 
         int aColDim = (optLhs == AF_MAT_NONE) ? 1 : 0;
         int bRowDim = (optRhs == AF_MAT_NONE) ? 0 : 1;
 
         DIM_ASSERT(1, lhsInfo.dims()[aColDim] == rhsInfo.dims()[bRowDim]);
 
-        switch(lhs_type) {
-        case f32: output = matmul<float  >(lhs, rhs, optLhs, optRhs);   break;
-        case c32: output = matmul<cfloat >(lhs, rhs, optLhs, optRhs);   break;
-        case f64: output = matmul<double >(lhs, rhs, optLhs, optRhs);   break;
-        case c64: output = matmul<cdouble>(lhs, rhs, optLhs, optRhs);   break;
-        default:  TYPE_ERROR(1, lhs_type);
+        // Assume that *out is either initialized to null or an actual af_array
+        // Otherwise, this function has undefined behavior
+        af_array output = 0;
+        if (*out) {
+            output = *out;
+        } else {
+            af_dtype out_type = (lhs_type != s8) ? lhs_type : f32;
+
+            const int aRowDim    = (optLhs == AF_MAT_NONE) ? 0 : 1;
+            const int bColDim    = (optRhs == AF_MAT_NONE) ? 1 : 0;
+            const int M          = lDims[aRowDim];
+            const int N          = rDims[bColDim];
+            const dim_t d2       = std::max(lDims[2], rDims[2]);
+            const dim_t d3       = std::max(lDims[3], rDims[3]);
+            const af::dim4 oDims = af::dim4(M, N, d2, d3);
+            AF_CHECK(af_create_handle(&output, lhsInfo.ndims(), oDims.get(),
+                                      out_type));
         }
+
+        switch (lhs_type) {
+            case f32:
+                gemm<float>(&output, optLhs, optRhs,
+                            static_cast<const float *>(alpha), lhs, rhs,
+                            static_cast<const float *>(beta));
+                break;
+            case c32:
+                gemm<cfloat>(&output, optLhs, optRhs,
+                             static_cast<const cfloat *>(alpha), lhs, rhs,
+                             static_cast<const cfloat *>(beta));
+                break;
+            case f64:
+                gemm<double>(&output, optLhs, optRhs,
+                             static_cast<const double *>(alpha), lhs, rhs,
+                             static_cast<const double *>(beta));
+                break;
+            case c64:
+                gemm<cdouble>(&output, optLhs, optRhs,
+                              static_cast<const cdouble *>(alpha), lhs, rhs,
+                              static_cast<const cdouble *>(beta));
+                break;
+            case f16:
+                gemm<half>(&output, optLhs, optRhs,
+                           static_cast<const half *>(alpha), lhs, rhs,
+                           static_cast<const half *>(beta));
+                break;
+            case s8:
+                gemm<schar, float>(&output, optLhs, optRhs,
+                                   static_cast<const float *>(alpha), lhs, rhs,
+                                   static_cast<const float *>(beta));
+                break;
+            default: TYPE_ERROR(3, lhs_type);
+        }
+
         std::swap(*out, output);
     }
     CATCHALL
     return AF_SUCCESS;
 }
 
-af_err af_dot(      af_array *out,
-                    const af_array lhs, const af_array rhs,
-                    const af_mat_prop optLhs, const af_mat_prop optRhs)
-{
-    using namespace detail;
+af_err af_matmul(af_array *out, const af_array lhs, const af_array rhs,
+                 const af_mat_prop optLhs, const af_mat_prop optRhs) {
+    try {
+        const ArrayInfo &lhsInfo = getInfo(lhs, false);
+        const ArrayInfo &rhsInfo = getInfo(rhs, true);
+
+        if (lhsInfo.isSparse()) {
+            return af_sparse_matmul(out, lhs, rhs, optLhs, optRhs);
+        }
+
+        const int aRowDim = (optLhs == AF_MAT_NONE) ? 0 : 1;
+        const int bColDim = (optRhs == AF_MAT_NONE) ? 1 : 0;
+
+        const af::dim4 &lDims = lhsInfo.dims();
+        const af::dim4 &rDims = rhsInfo.dims();
+        const int M           = lDims[aRowDim];
+        const int N           = rDims[bColDim];
+
+        const dim_t d2       = std::max(lDims[2], rDims[2]);
+        const dim_t d3       = std::max(lDims[3], rDims[3]);
+        const af::dim4 oDims = af::dim4(M, N, d2, d3);
+
+        af_dtype lhs_type = lhsInfo.getType();
+
+        af_array gemm_out      = 0;
+        af_dtype gemm_out_type = (lhs_type != s8) ? lhs_type : f32;
+        AF_CHECK(af_create_handle(&gemm_out, oDims.ndims(), oDims.get(),
+                                  gemm_out_type));
+
+        switch (lhs_type) {
+            case f16: {
+                static const half alpha(1.0f);
+                static const half beta(0.0f);
+                AF_CHECK(af_gemm(&gemm_out, optLhs, optRhs, &alpha, lhs, rhs,
+                                 &beta));
+                break;
+            }
+            case f32: {
+                float alpha = 1.f;
+                float beta  = 0.f;
+                AF_CHECK(af_gemm(&gemm_out, optLhs, optRhs, &alpha, lhs, rhs,
+                                 &beta));
+                break;
+            }
+            case c32: {
+                cfloat alpha{1.f, 0.f};
+                cfloat beta{0.f, 0.f};
+
+                AF_CHECK(af_gemm(&gemm_out, optLhs, optRhs, &alpha, lhs, rhs,
+                                 &beta));
+                break;
+            }
+            case f64: {
+                double alpha = 1.0;
+                double beta  = 0.0;
+                AF_CHECK(af_gemm(&gemm_out, optLhs, optRhs, &alpha, lhs, rhs,
+                                 &beta));
+                break;
+            }
+            case c64: {
+                cdouble alpha{1.0, 0.0};
+                cdouble beta{0.0, 0.0};
+                AF_CHECK(af_gemm(&gemm_out, optLhs, optRhs, &alpha, lhs, rhs,
+                                 &beta));
+                break;
+            }
+            case s8: {
+                float alpha = 1.0;
+                float beta  = 0.0;
+                AF_CHECK(af_gemm(&gemm_out, optLhs, optRhs, &alpha, lhs, rhs,
+                                 &beta));
+                break;
+            }
+            default: TYPE_ERROR(1, lhs_type);
+        }
+
+        std::swap(*out, gemm_out);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
+af_err af_dot(af_array *out, const af_array lhs, const af_array rhs,
+              const af_mat_prop optLhs, const af_mat_prop optRhs) {
     try {
-        ArrayInfo lhsInfo = getInfo(lhs);
-        ArrayInfo rhsInfo = getInfo(rhs);
+        const ArrayInfo &lhsInfo = getInfo(lhs);
+        const ArrayInfo &rhsInfo = getInfo(rhs);
 
-        if (optLhs != AF_MAT_NONE) {
-            AF_ERROR("Using this property is not yet supported in dot", AF_ERR_NOT_SUPPORTED);
+        if (optLhs != AF_MAT_NONE && optLhs != AF_MAT_CONJ) {
+            AF_ERROR("Using this property is not yet supported in dot",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
-        if (optRhs != AF_MAT_NONE) {
-            AF_ERROR("Using this property is not yet supported in dot", AF_ERR_NOT_SUPPORTED);
+        if (optRhs != AF_MAT_NONE && optRhs != AF_MAT_CONJ) {
+            AF_ERROR("Using this property is not yet supported in dot",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
         DIM_ASSERT(1, lhsInfo.dims()[0] == rhsInfo.dims()[0]);
         af_dtype lhs_type = lhsInfo.getType();
         af_dtype rhs_type = rhsInfo.getType();
 
-        if (lhsInfo.ndims() > 2 ||
-            rhsInfo.ndims() > 2) {
+        if (lhsInfo.ndims() == 0) { return af_retain_array(out, lhs); }
+        if (lhsInfo.ndims() > 1 || rhsInfo.ndims() > 1) {
             AF_ERROR("dot can not be used in batch mode", AF_ERR_BATCH);
         }
 
@@ -114,15 +343,55 @@ af_err af_dot(      af_array *out,
 
         af_array output = 0;
 
-        switch(lhs_type) {
-        case f32: output = dot<float  >(lhs, rhs, optLhs, optRhs);   break;
-            //case c32: output = dot<cfloat >(lhs, rhs, optLhs, optRhs);   break;
-        case f64: output = dot<double >(lhs, rhs, optLhs, optRhs);   break;
-            //case c64: output = dot<cdouble>(lhs, rhs, optLhs, optRhs);   break;
-        default:  TYPE_ERROR(1, lhs_type);
+        switch (lhs_type) {
+            case f16: output = dot<half>(lhs, rhs, optLhs, optRhs); break;
+            case f32: output = dot<float>(lhs, rhs, optLhs, optRhs); break;
+            case c32: output = dot<cfloat>(lhs, rhs, optLhs, optRhs); break;
+            case f64: output = dot<double>(lhs, rhs, optLhs, optRhs); break;
+            case c64: output = dot<cdouble>(lhs, rhs, optLhs, optRhs); break;
+            default: TYPE_ERROR(1, lhs_type);
         }
         std::swap(*out, output);
     }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_dot_all(double *rval, double *ival, const af_array lhs,
+                  const af_array rhs, const af_mat_prop optLhs,
+                  const af_mat_prop optRhs) {
+    using namespace detail;  // NOLINT needed for imag and real functions
+                             // name resolution
+
+    try {
+        *rval = 0;
+        if (ival) { *ival = 0; }
+
+        af_array out = 0;
+        AF_CHECK(af_dot(&out, lhs, rhs, optLhs, optRhs));
+
+        const ArrayInfo &lhsInfo = getInfo(lhs);
+        af_dtype lhs_type        = lhsInfo.getType();
+
+        switch (lhs_type) {
+            case f16: *rval = static_cast<double>(dotAll<half>(out)); break;
+            case f32: *rval = dotAll<float>(out); break;
+            case f64: *rval = dotAll<double>(out); break;
+            case c32: {
+                cfloat temp = dotAll<cfloat>(out);
+                *rval       = real(temp);
+                if (ival) { *ival = imag(temp); }
+            } break;
+            case c64: {
+                cdouble temp = dotAll<cdouble>(out);
+                *rval        = real(temp);
+                if (ival) { *ival = imag(temp); }
+            } break;
+            default: TYPE_ERROR(1, lhs_type);
+        }
+
+        if (out != 0) { AF_CHECK(af_release_array(out)); }
+    }
     CATCHALL
-        return AF_SUCCESS;
+    return AF_SUCCESS;
 }
diff --git a/src/api/c/canny.cpp b/src/api/c/canny.cpp
new file mode 100644
index 0000000000..b68b8d4ed0
--- /dev/null
+++ b/src/api/c/canny.cpp
@@ -0,0 +1,285 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <canny.hpp>
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/tile.hpp>
+#include <complex.hpp>
+#include <convolve.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
+#include <histogram.hpp>
+#include <iota.hpp>
+#include <ireduce.hpp>
+#include <logic.hpp>
+#include <reduce.hpp>
+#include <scan.hpp>
+#include <sobel.hpp>
+#include <transpose.hpp>
+#include <unary.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+#include <af/seq.h>
+#include <utility>
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::cast;
+using arrayfire::common::tile;
+using detail::arithOp;
+using detail::Array;
+using detail::convolve2;
+using detail::createEmptyArray;
+using detail::createHostDataArray;
+using detail::createSubArray;
+using detail::createValueArray;
+using detail::getScalar;
+using detail::histogram;
+using detail::iota;
+using detail::ireduce;
+using detail::logicOp;
+using detail::reduce;
+using detail::reduce_all;
+using detail::scan;
+using detail::schar;
+using detail::sobelDerivatives;
+using detail::uchar;
+using detail::uint;
+using detail::unaryOp;
+using detail::ushort;
+using std::make_pair;
+using std::pair;
+using std::vector;
+
+namespace {
+Array<float> gradientMagnitude(const Array<float>& gx, const Array<float>& gy,
+                               const bool& isf) {
+    using detail::abs;
+    if (isf) {
+        Array<float> gx2 = abs<float, float>(gx);
+        Array<float> gy2 = abs<float, float>(gy);
+        return arithOp<float, af_add_t>(gx2, gy2, gx2.dims());
+    } else {
+        Array<float> gx2 = arithOp<float, af_mul_t>(gx, gx, gx.dims());
+        Array<float> gy2 = arithOp<float, af_mul_t>(gy, gy, gy.dims());
+        Array<float> sg  = arithOp<float, af_add_t>(gx2, gy2, gx2.dims());
+        return unaryOp<float, af_sqrt_t>(sg);
+    }
+}
+
+Array<float> otsuThreshold(const Array<float>& in, const unsigned NUM_BINS,
+                           const float maxVal) {
+    Array<uint> hist = histogram<float>(in, NUM_BINS, 0, maxVal, false);
+
+    const dim4& inDims = in.dims();
+    const dim4& hDims  = hist.dims();
+
+    const dim4 oDims(1, hDims[1], hDims[2], hDims[3]);
+    vector<af_seq> seqBegin(4, af_span);
+    vector<af_seq> seqRest(4, af_span);
+    vector<af_seq> sliceIndex(4, af_span);
+
+    seqBegin[0] = af_make_seq(0, static_cast<double>(hDims[0] - 1), 1);
+    seqRest[0]  = af_make_seq(0, static_cast<double>(hDims[0] - 1), 1);
+
+    Array<float> UnitP  = createValueArray<float>(oDims, 1.0f);
+    Array<float> histf  = cast<float, uint>(hist);
+    Array<float> totals = createValueArray<float>(hDims, inDims[0] * inDims[1]);
+    Array<float> weights =
+        iota<float>(dim4(NUM_BINS), oDims);  // a.k.a histogram shape
+
+    // pixel frequency probabilities
+    auto freqs        = arithOp<float, af_div_t>(histf, totals, hDims);
+    auto cumFreqs     = scan<af_add_t, float, float>(freqs, 0);
+    auto oneMCumFreqs = arithOp<float, af_sub_t>(UnitP, cumFreqs, hDims);
+    auto qLqH         = arithOp<float, af_mul_t>(cumFreqs, oneMCumFreqs, hDims);
+    auto product      = arithOp<float, af_mul_t>(weights, freqs, hDims);
+    auto cumProduct   = scan<af_add_t, float, float>(product, 0);
+    auto weightedSum  = reduce<af_add_t, float, float>(product, 0);
+
+    dim4 sigmaDims(NUM_BINS - 1, hDims[1], hDims[2], hDims[3]);
+    Array<float> sigmas = createEmptyArray<float>(sigmaDims);
+    for (unsigned b = 0; b < (NUM_BINS - 1); ++b) {
+        const dim4 fDims(b + 1, hDims[1], hDims[2], hDims[3]);
+        const dim4 eDims(NUM_BINS - 1 - b, hDims[1], hDims[2], hDims[3]);
+
+        sliceIndex[0]    = {double(b), double(b), 1};
+        seqBegin[0].end  = static_cast<double>(b);
+        seqRest[0].begin = static_cast<double>(b + 1);
+
+        auto qL    = createSubArray(cumFreqs, sliceIndex, false);
+        auto qH    = arithOp<float, af_sub_t>(UnitP, qL, oDims);
+        auto _muL  = createSubArray(cumProduct, sliceIndex, false);
+        auto _muH  = arithOp<float, af_sub_t>(weightedSum, _muL, oDims);
+        auto muL   = arithOp<float, af_div_t>(_muL, qL, oDims);
+        auto muH   = arithOp<float, af_div_t>(_muH, qH, oDims);
+        auto diff  = arithOp<float, af_sub_t>(muL, muH, oDims);
+        auto sqrd  = arithOp<float, af_mul_t>(diff, diff, oDims);
+        auto op2   = createSubArray(qLqH, sliceIndex, false);
+        auto sigma = arithOp<float, af_mul_t>(sqrd, op2, oDims);
+
+        auto binRes = createSubArray<float>(sigmas, sliceIndex, false);
+        copyArray(binRes, sigma);
+    }
+
+    Array<float> thresh = createEmptyArray<float>(oDims);
+    Array<uint> locs    = createEmptyArray<uint>(oDims);
+
+    ireduce<af_max_t, float>(thresh, locs, sigmas, 0);
+
+    return cast<float, uint>(
+        arrayfire::common::tile(locs, dim4(inDims[0], inDims[1])));
+}
+
+Array<float> normalize(const Array<float>& supEdges, const float minVal,
+                       const float maxVal) {
+    auto minArray = createValueArray<float>(supEdges.dims(), minVal);
+    auto diff  = arithOp<float, af_sub_t>(supEdges, minArray, supEdges.dims());
+    auto denom = createValueArray<float>(supEdges.dims(), (maxVal - minVal));
+    return arithOp<float, af_div_t>(diff, denom, supEdges.dims());
+}
+
+pair<Array<char>, Array<char>> computeCandidates(const Array<float>& supEdges,
+                                                 const float t1,
+                                                 const af_canny_threshold ct,
+                                                 const float t2) {
+    float maxVal =
+        getScalar<float>(reduce_all<af_max_t, float, float>(supEdges));
+    ;
+    auto NUM_BINS = static_cast<unsigned>(maxVal);
+
+    auto lowRatio = createValueArray<float>(supEdges.dims(), t1);
+
+    switch (ct) {  // NOLINT(hicpp-multiway-paths-covered)
+        case AF_CANNY_THRESHOLD_AUTO_OTSU: {
+            auto T2 = otsuThreshold(supEdges, NUM_BINS, maxVal);
+            auto T1 = arithOp<float, af_mul_t>(T2, lowRatio, T2.dims());
+            Array<char> weak1 =
+                logicOp<float, af_ge_t>(supEdges, T1, supEdges.dims());
+            Array<char> weak2 =
+                logicOp<float, af_lt_t>(supEdges, T2, supEdges.dims());
+            Array<char> weak =
+                logicOp<char, af_and_t>(weak1, weak2, weak1.dims());
+            Array<char> strong =
+                logicOp<float, af_ge_t>(supEdges, T2, supEdges.dims());
+            return make_pair(strong, weak);
+        };
+        default: {
+            float minVal =
+                getScalar<float>(reduce_all<af_min_t, float, float>(supEdges));
+            auto normG = normalize(supEdges, minVal, maxVal);
+            auto T2    = createValueArray<float>(supEdges.dims(), t2);
+            auto T1    = createValueArray<float>(supEdges.dims(), t1);
+            Array<char> weak1 =
+                logicOp<float, af_ge_t>(normG, T1, normG.dims());
+            Array<char> weak2 =
+                logicOp<float, af_lt_t>(normG, T2, normG.dims());
+            Array<char> weak =
+                logicOp<char, af_and_t>(weak1, weak2, weak1.dims());
+            Array<char> strong =
+                logicOp<float, af_ge_t>(normG, T2, normG.dims());
+            return std::make_pair(strong, weak);
+        };
+    }
+}
+
+template<typename T>
+af_array cannyHelper(const Array<T>& in, const float t1,
+                     const af_canny_threshold ct, const float t2,
+                     const unsigned sw, const bool isf) {
+    static const vector<float> v{-0.11021f, -0.23691f, -0.30576f, -0.23691f,
+                                 -0.11021f};
+    Array<float> cFilter = createHostDataArray<float>(dim4(5, 1), v.data());
+    Array<float> rFilter = createHostDataArray<float>(dim4(1, 5), v.data());
+
+    // Run separable convolution to smooth the input image
+    Array<float> smt =
+        convolve2<float, float>(cast<float, T>(in), cFilter, rFilter, false);
+
+    auto g          = sobelDerivatives<float, float>(smt, sw);
+    Array<float> gx = g.first;
+    Array<float> gy = g.second;
+
+    Array<float> gmag = gradientMagnitude(gx, gy, isf);
+
+    Array<float> supEdges = nonMaximumSuppression(gmag, gx, gy);
+
+    auto swpair = computeCandidates(supEdges, t1, ct, t2);
+
+    return getHandle(edgeTrackingByHysteresis(swpair.first, swpair.second));
+}
+
+}  // namespace
+
+af_err af_canny(af_array* out, const af_array in, const af_canny_threshold ct,
+                const float t1, const float t2, const unsigned sw,
+                const bool isf) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        af::dim4 dims         = info.dims();
+
+        DIM_ASSERT(2, (dims.ndims() >= 2));
+        // Input should be a minimum of 5x5 image
+        // since the gaussian filter used for smoothing
+        // the input is of 5x5 size. It's not mandatory but
+        // it is essentially of no use if image is less than 5x5
+        DIM_ASSERT(2, (dims[0] >= 5 && dims[1] >= 5));
+        ARG_ASSERT(5, (sw == 3));
+
+        af_array output;
+
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                output = cannyHelper<float>(getArray<float>(in), t1, ct, t2, sw,
+                                            isf);
+                break;
+            case f64:
+                output = cannyHelper<double>(getArray<double>(in), t1, ct, t2,
+                                             sw, isf);
+                break;
+            case s32:
+                output =
+                    cannyHelper<int>(getArray<int>(in), t1, ct, t2, sw, isf);
+                break;
+            case u32:
+                output =
+                    cannyHelper<uint>(getArray<uint>(in), t1, ct, t2, sw, isf);
+                break;
+            case s16:
+                output = cannyHelper<short>(getArray<short>(in), t1, ct, t2, sw,
+                                            isf);
+                break;
+            case u16:
+                output = cannyHelper<ushort>(getArray<ushort>(in), t1, ct, t2,
+                                             sw, isf);
+                break;
+            case s8:
+                output = cannyHelper<schar>(getArray<schar>(in), t1, ct, t2, sw,
+                                            isf);
+                break;
+            case u8:
+                output = cannyHelper<uchar>(getArray<uchar>(in), t1, ct, t2, sw,
+                                            isf);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        // output array is binary array
+        std::swap(output, *out);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/cast.cpp b/src/api/c/cast.cpp
index 379b2df91b..7b421d28bb 100644
--- a/src/api/c/cast.cpp
+++ b/src/api/c/cast.cpp
@@ -7,65 +7,88 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <optypes.hpp>
+#include <sparse.hpp>
+#include <sparse_handle.hpp>
+#include <af/arith.h>
 #include <af/array.h>
 #include <af/defines.h>
-#include <af/arith.h>
-#include <ArrayInfo.hpp>
-#include <optypes.hpp>
+#include <af/dim4.hpp>
 
-#include <cast.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-
-using namespace detail;
+using af::dim4;
+using arrayfire::castSparse;
+using arrayfire::getHandle;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
-static af_array cast(const af_array in, const af_dtype type)
-{
-    const ArrayInfo info = getInfo(in);
+static af_array cast(const af_array in, const af_dtype type) {
+    const ArrayInfo& info = getInfo(in, false);
 
-    if (info.getType() == type) {
-        return retain(in);
-    }
+    if (info.getType() == type) { return retain(in); }
 
-    switch (type) {
-    case f32: return getHandle(castArray<float   >(in));
-    case f64: return getHandle(castArray<double  >(in));
-    case c32: return getHandle(castArray<cfloat  >(in));
-    case c64: return getHandle(castArray<cdouble >(in));
-    case s32: return getHandle(castArray<int     >(in));
-    case u32: return getHandle(castArray<uint    >(in));
-    case u8 : return getHandle(castArray<uchar   >(in));
-    case b8 : return getHandle(castArray<char    >(in));
-    case s64: return getHandle(castArray<intl    >(in));
-    case u64: return getHandle(castArray<uintl   >(in));
-    default: TYPE_ERROR(2, type);
+    if (info.isSparse()) {
+        switch (type) {
+            case f32: return getHandle(castSparse<float>(in));
+            case f64: return getHandle(castSparse<double>(in));
+            case c32: return getHandle(castSparse<cfloat>(in));
+            case c64: return getHandle(castSparse<cdouble>(in));
+            default: TYPE_ERROR(2, type);
+        }
+    } else {
+        switch (type) {
+            case f32: return getHandle(castArray<float>(in));
+            case f64: return getHandle(castArray<double>(in));
+            case c32: return getHandle(castArray<cfloat>(in));
+            case c64: return getHandle(castArray<cdouble>(in));
+            case s32: return getHandle(castArray<int>(in));
+            case u32: return getHandle(castArray<uint>(in));
+            case s8: return getHandle(castArray<schar>(in));
+            case u8: return getHandle(castArray<uchar>(in));
+            case b8: return getHandle(castArray<char>(in));
+            case s64: return getHandle(castArray<intl>(in));
+            case u64: return getHandle(castArray<uintl>(in));
+            case s16: return getHandle(castArray<short>(in));
+            case u16: return getHandle(castArray<ushort>(in));
+            case f16: return getHandle(castArray<half>(in));
+            default: TYPE_ERROR(2, type);
+        }
     }
 }
 
-af_err af_cast(af_array *out, const af_array in, const af_dtype type)
-{
+af_err af_cast(af_array* out, const af_array in, const af_dtype type) {
     try {
-        af_array res = cast(in, type);
-        std::swap(*out, res);
-    }
-    CATCHALL;
+        const ArrayInfo& info = getInfo(in, false);
 
-    return AF_SUCCESS;
-}
-
-af_err af_cplx(af_array *out, const af_array in, const af_dtype type)
-{
-    try {
-        af_array res;
-        ArrayInfo in_info = getInfo(in);
+        af_dtype inType = info.getType();
+        if ((inType == c32 || inType == c64) &&
+            (type == f32 || type == f64 || type == f16)) {
+            AF_ERROR(
+                "Casting is not allowed from complex (c32/c64) to real "
+                "(f16/f32/f64) types.\n"
+                "Use abs, real, imag etc to convert complex to floating type.",
+                AF_ERR_TYPE);
+        }
 
-        if (in_info.isDouble()) {
-            res = cast(in, c64);
-        } else {
-            res = cast(in, c32);
+        dim4 idims = info.dims();
+        if (idims.elements() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
         }
 
+        af_array res = cast(in, type);
+
         std::swap(*out, res);
     }
     CATCHALL;
diff --git a/src/api/c/cholesky.cpp b/src/api/c/cholesky.cpp
index d738568e06..1a662c649f 100644
--- a/src/api/c/cholesky.cpp
+++ b/src/api/c/cholesky.cpp
@@ -7,34 +7,35 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <cholesky.hpp>
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <af/array.h>
-#include <af/lapack.h>
 #include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-#include <ArrayInfo.hpp>
-#include <cholesky.hpp>
+#include <af/lapack.h>
 
-using af::dim4;
-using namespace detail;
+using arrayfire::getArray;
+using detail::cdouble;
+using detail::cfloat;
 
 template<typename T>
-static inline af_array cholesky(int *info, const af_array in, const bool is_upper)
-{
+static inline af_array cholesky(int *info, const af_array in,
+                                const bool is_upper) {
     return getHandle(cholesky<T>(info, getArray<T>(in), is_upper));
 }
 
 template<typename T>
-static inline int cholesky_inplace(af_array in, const bool is_upper)
-{
-     return cholesky_inplace<T>(getWritableArray<T>(in), is_upper);
+static inline int cholesky_inplace(af_array in, const bool is_upper) {
+    return cholesky_inplace<T>(getArray<T>(in), is_upper);
 }
 
-af_err af_cholesky(af_array *out, int *info, const af_array in, const bool is_upper)
-{
+af_err af_cholesky(af_array *out, int *info, const af_array in,
+                   const bool is_upper) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo &i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("cholesky can not be used in batch mode", AF_ERR_BATCH);
@@ -42,16 +43,20 @@ af_err af_cholesky(af_array *out, int *info, const af_array in, const bool is_up
 
         af_dtype type = i_info.getType();
 
-        ARG_ASSERT(2, i_info.isFloating());                  // Only floating and complex types
-        DIM_ASSERT(1, i_info.dims()[0] == i_info.dims()[1]); // Only square matrices
+        if (i_info.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+        DIM_ASSERT(
+            1, i_info.dims()[0] == i_info.dims()[1]);  // Only square matrices
+        ARG_ASSERT(2, i_info.isFloating());  // Only floating and complex types
 
         af_array output;
-        switch(type) {
-            case f32: output = cholesky<float  >(info, in, is_upper);  break;
-            case f64: output = cholesky<double >(info, in, is_upper);  break;
-            case c32: output = cholesky<cfloat >(info, in, is_upper);  break;
-            case c64: output = cholesky<cdouble>(info, in, is_upper);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = cholesky<float>(info, in, is_upper); break;
+            case f64: output = cholesky<double>(info, in, is_upper); break;
+            case c32: output = cholesky<cfloat>(info, in, is_upper); break;
+            case c64: output = cholesky<cdouble>(info, in, is_upper); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
@@ -60,28 +65,28 @@ af_err af_cholesky(af_array *out, int *info, const af_array in, const bool is_up
     return AF_SUCCESS;
 }
 
-af_err af_cholesky_inplace(int *info, af_array in, const bool is_upper)
-{
+af_err af_cholesky_inplace(int *info, af_array in, const bool is_upper) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo &i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("cholesky can not be used in batch mode", AF_ERR_BATCH);
         }
 
         af_dtype type = i_info.getType();
-
-        ARG_ASSERT(1, i_info.isFloating()); // Only floating and complex types
-        DIM_ASSERT(1, i_info.dims()[0] == i_info.dims()[1]); // Only square matrices
+        if (i_info.ndims() == 0) { return AF_SUCCESS; }
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
+        DIM_ASSERT(
+            1, i_info.dims()[0] == i_info.dims()[1]);  // Only square matrices
 
         int out;
 
-        switch(type) {
-            case f32: out = cholesky_inplace<float  >(in, is_upper);  break;
-            case f64: out = cholesky_inplace<double >(in, is_upper);  break;
-            case c32: out = cholesky_inplace<cfloat >(in, is_upper);  break;
-            case c64: out = cholesky_inplace<cdouble>(in, is_upper);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: out = cholesky_inplace<float>(in, is_upper); break;
+            case f64: out = cholesky_inplace<double>(in, is_upper); break;
+            case c32: out = cholesky_inplace<cfloat>(in, is_upper); break;
+            case c64: out = cholesky_inplace<cdouble>(in, is_upper); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*info, out);
     }
diff --git a/src/api/c/clamp.cpp b/src/api/c/clamp.cpp
new file mode 100644
index 0000000000..8c31469e55
--- /dev/null
+++ b/src/api/c/clamp.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <implicit.hpp>
+#include <logic.hpp>
+#include <optypes.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/defines.h>
+
+using af::dim4;
+using arrayfire::common::half;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename T>
+static inline af_array clampOp(const af_array in, const af_array lo,
+                               const af_array hi, const dim4& odims) {
+    const Array<T> L = castArray<T>(lo);
+    const Array<T> H = castArray<T>(hi);
+    const Array<T> I = castArray<T>(in);
+    return getHandle(
+        arithOp<T, af_min_t>(arithOp<T, af_max_t>(I, L, odims), H, odims));
+}
+
+af_err af_clamp(af_array* out, const af_array in, const af_array lo,
+                const af_array hi, const bool batch) {
+    try {
+        const ArrayInfo& linfo = getInfo(lo);
+        const ArrayInfo& hinfo = getInfo(hi);
+        const ArrayInfo& iinfo = getInfo(in);
+
+        DIM_ASSERT(2, linfo.dims() == hinfo.dims());
+        TYPE_ASSERT(linfo.getType() == hinfo.getType());
+
+        dim4 odims           = getOutDims(iinfo.dims(), linfo.dims(), batch);
+        const af_dtype otype = implicit(iinfo.getType(), linfo.getType());
+
+        af_array res;
+        switch (otype) {
+            case f32: res = clampOp<float>(in, lo, hi, odims); break;
+            case f64: res = clampOp<double>(in, lo, hi, odims); break;
+            case c32: res = clampOp<cfloat>(in, lo, hi, odims); break;
+            case c64: res = clampOp<cdouble>(in, lo, hi, odims); break;
+            case s32: res = clampOp<int>(in, lo, hi, odims); break;
+            case u32: res = clampOp<uint>(in, lo, hi, odims); break;
+            case s8: res = clampOp<schar>(in, lo, hi, odims); break;
+            case u8: res = clampOp<uchar>(in, lo, hi, odims); break;
+            case b8: res = clampOp<char>(in, lo, hi, odims); break;
+            case s64: res = clampOp<intl>(in, lo, hi, odims); break;
+            case u64: res = clampOp<uintl>(in, lo, hi, odims); break;
+            case s16: res = clampOp<short>(in, lo, hi, odims); break;
+            case u16: res = clampOp<ushort>(in, lo, hi, odims); break;
+            case f16: res = clampOp<half>(in, lo, hi, odims); break;
+            default: TYPE_ERROR(0, otype);
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/colorspace.cpp b/src/api/c/colorspace.cpp
index e86b262d92..8fe078d6a5 100644
--- a/src/api/c/colorspace.cpp
+++ b/src/api/c/colorspace.cpp
@@ -7,25 +7,75 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/err_common.hpp>
+#include <af/array.h>
 #include <af/defines.h>
 #include <af/image.h>
-#include <err_common.hpp>
 
-af_err af_color_space(af_array *out, const af_array image, const af_cspace_t to, const af_cspace_t from)
-{
-    bool hsv2rgb  = (from==AF_HSV  && to==AF_RGB );
-    bool rgb2hsv  = (from==AF_RGB  && to==AF_HSV );
-    bool gray2rgb = (from==AF_GRAY && to==AF_RGB );
-    bool rgb2gray = (from==AF_RGB  && to==AF_GRAY);
+template<af_cspace_t FROM, af_cspace_t TO>
+void color_space(af_array *out, const af_array image) {
+    UNUSED(out);
+    UNUSED(image);
+    AF_ERROR(
+        "Color Space: Conversion from source type to output type not supported",
+        AF_ERR_NOT_SUPPORTED);
+}
+
+#define INSTANTIATE_CSPACE_DEFS1(F, T, FUNC)                       \
+    template<>                                                     \
+    void color_space<F, T>(af_array * out, const af_array image) { \
+        AF_CHECK(FUNC(out, image));                                \
+    }
+
+#define INSTANTIATE_CSPACE_DEFS2(F, T, FUNC, ...)                  \
+    template<>                                                     \
+    void color_space<F, T>(af_array * out, const af_array image) { \
+        AF_CHECK(FUNC(out, image, __VA_ARGS__));                   \
+    }
+
+INSTANTIATE_CSPACE_DEFS1(AF_HSV, AF_RGB, af_hsv2rgb);
+INSTANTIATE_CSPACE_DEFS1(AF_RGB, AF_HSV, af_rgb2hsv);
+
+INSTANTIATE_CSPACE_DEFS2(AF_RGB, AF_GRAY, af_rgb2gray, 0.2126f, 0.7152f,
+                         0.0722f);
+INSTANTIATE_CSPACE_DEFS2(AF_GRAY, AF_RGB, af_gray2rgb, 1.0f, 1.0f, 1.0f);
+INSTANTIATE_CSPACE_DEFS2(AF_YCbCr, AF_RGB, af_ycbcr2rgb, AF_YCC_601);
+INSTANTIATE_CSPACE_DEFS2(AF_RGB, AF_YCbCr, af_rgb2ycbcr, AF_YCC_601);
+
+template<af_cspace_t FROM>
+static void color_space(af_array *out, const af_array image,
+                        const af_cspace_t to) {
+    switch (to) {
+        case AF_GRAY: color_space<FROM, AF_GRAY>(out, image); break;
+        case AF_RGB: color_space<FROM, AF_RGB>(out, image); break;
+        case AF_HSV: color_space<FROM, AF_HSV>(out, image); break;
+        case AF_YCbCr: color_space<FROM, AF_YCbCr>(out, image); break;
+        default:
+            AF_ERROR("Incorrect enum value for output color type", AF_ERR_ARG);
+    }
+}
 
-    ARG_ASSERT(2, (hsv2rgb || rgb2hsv || gray2rgb || rgb2gray));
+af_err af_color_space(af_array *out, const af_array image, const af_cspace_t to,
+                      const af_cspace_t from) {
+    try {
+        if (from == to) { return af_retain_array(out, image); }
 
-    af_err result = AF_SUCCESS;
+        ARG_ASSERT(2, (to == AF_GRAY || to == AF_RGB || to == AF_HSV ||
+                       to == AF_YCbCr));
+        ARG_ASSERT(2, (from == AF_GRAY || from == AF_RGB || from == AF_HSV ||
+                       from == AF_YCbCr));
 
-    if (hsv2rgb)  result = af_hsv2rgb(out, image);
-    if (rgb2hsv)  result = af_rgb2hsv(out, image);
-    if (gray2rgb) result = af_gray2rgb(out, image, 1.0f, 1.0f, 1.0f);
-    if (rgb2gray) result = af_rgb2gray(out, image, 0.2126f, 0.7152f, 0.0722f);
+        switch (from) {
+            case AF_GRAY: color_space<AF_GRAY>(out, image, to); break;
+            case AF_RGB: color_space<AF_RGB>(out, image, to); break;
+            case AF_HSV: color_space<AF_HSV>(out, image, to); break;
+            case AF_YCbCr: color_space<AF_YCbCr>(out, image, to); break;
+            default:
+                AF_ERROR("Incorrect enum value for input color type",
+                         AF_ERR_ARG);
+        }
+    }
+    CATCHALL;
 
-    return result;
+    return AF_SUCCESS;
 }
diff --git a/src/api/c/complex.cpp b/src/api/c/complex.cpp
index 8bc7608b8e..afa24d8483 100644
--- a/src/api/c/complex.cpp
+++ b/src/api/c/complex.cpp
@@ -7,49 +7,57 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/defines.h>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <implicit.hpp>
+#include <optypes.hpp>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <ArrayInfo.hpp>
-#include <optypes.hpp>
-#include <implicit.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
+#include <af/defines.h>
 
 #include <complex.hpp>
 
-using namespace detail;
 using af::dim4;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::conj;
+using detail::imag;
+using detail::real;
 
 template<typename To, typename Ti>
 static inline af_array cplx(const af_array lhs, const af_array rhs,
-                            const dim4 &odims)
-{
-    af_array res = getHandle(cplx<To, Ti>(castArray<Ti>(lhs), castArray<Ti>(rhs), odims));
+                            const dim4 &odims) {
+    af_array res =
+        getHandle(cplx<To, Ti>(castArray<Ti>(lhs), castArray<Ti>(rhs), odims));
     return res;
 }
 
-af_err af_cplx2(af_array *out, const af_array lhs, const af_array rhs, bool batchMode)
-{
+af_err af_cplx2(af_array *out, const af_array lhs, const af_array rhs,
+                bool batchMode) {
     try {
-
         af_dtype type = implicit(lhs, rhs);
 
         if (type == c32 || type == c64) {
             AF_ERROR("Inputs to cplx2 can not be of complex type", AF_ERR_ARG);
         }
 
-        if (type != f64) type = f32;
-
-        dim4 odims = getOutDims(getInfo(lhs).dims(), getInfo(rhs).dims(), batchMode);
+        if (type != f64) { type = f32; }
+        dim4 odims =
+            getOutDims(getInfo(lhs).dims(), getInfo(rhs).dims(), batchMode);
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
 
         af_array res;
         switch (type) {
-        case f32: res = cplx<cfloat , float >(lhs, rhs, odims); break;
-        case f64: res = cplx<cdouble, double>(lhs, rhs, odims); break;
-        default: TYPE_ERROR(0, type);
+            case f32: res = cplx<cfloat, float>(lhs, rhs, odims); break;
+            case f64: res = cplx<cdouble, double>(lhs, rhs, odims); break;
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -58,30 +66,25 @@ af_err af_cplx2(af_array *out, const af_array lhs, const af_array rhs, bool batc
     return AF_SUCCESS;
 }
 
-af_err af_cplx(af_array *out, const af_array in)
-{
+af_err af_cplx(af_array *out, const af_array in) {
     try {
-
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
         if (type == c32 || type == c64) {
             AF_ERROR("Inputs to cplx2 can not be of complex type", AF_ERR_ARG);
         }
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
 
         af_array tmp;
-        AF_CHECK(af_constant(&tmp,
-                             0, info.ndims(),
-                             info.dims().get(),
-                             type));
+        AF_CHECK(af_constant(&tmp, 0, info.ndims(), info.dims().get(), type));
 
         af_array res;
         switch (type) {
+            case f32: res = cplx<cfloat, float>(in, tmp, info.dims()); break;
+            case f64: res = cplx<cdouble, double>(in, tmp, info.dims()); break;
 
-        case f32: res = cplx<cfloat , float >(in, tmp, info.dims()); break;
-        case f64: res = cplx<cdouble, double>(in, tmp, info.dims()); break;
-
-        default: TYPE_ERROR(0, type);
+            default: TYPE_ERROR(0, type);
         }
 
         AF_CHECK(af_release_array(tmp));
@@ -92,24 +95,24 @@ af_err af_cplx(af_array *out, const af_array in)
     return AF_SUCCESS;
 }
 
-af_err af_real(af_array *out, const af_array in)
-{
+af_err af_real(af_array *out, const af_array in) {
     try {
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-
-        if (type != c32 && type != c64) {
-            return af_retain_array(out, in);
-        }
+        if (type != c32 && type != c64) { return af_retain_array(out, in); }
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
 
         af_array res;
         switch (type) {
-
-        case c32: res = getHandle(real<float , cfloat >(getArray<cfloat >(in))); break;
-        case c64: res = getHandle(real<double, cdouble>(getArray<cdouble>(in))); break;
-
-        default: TYPE_ERROR(0, type);
+            case c32:
+                res = getHandle(real<float, cfloat>(getArray<cfloat>(in)));
+                break;
+            case c64:
+                res = getHandle(real<double, cdouble>(getArray<cdouble>(in)));
+                break;
+
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -118,24 +121,26 @@ af_err af_real(af_array *out, const af_array in)
     return AF_SUCCESS;
 }
 
-af_err af_imag(af_array *out, const af_array in)
-{
+af_err af_imag(af_array *out, const af_array in) {
     try {
-
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
         if (type != c32 && type != c64) {
             return af_constant(out, 0, info.ndims(), info.dims().get(), type);
         }
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
 
         af_array res;
         switch (type) {
-
-        case c32: res = getHandle(imag<float , cfloat >(getArray<cfloat >(in))); break;
-        case c64: res = getHandle(imag<double, cdouble>(getArray<cdouble>(in))); break;
-
-        default: TYPE_ERROR(0, type);
+            case c32:
+                res = getHandle(imag<float, cfloat>(getArray<cfloat>(in)));
+                break;
+            case c64:
+                res = getHandle(imag<double, cdouble>(getArray<cdouble>(in)));
+                break;
+
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -144,24 +149,24 @@ af_err af_imag(af_array *out, const af_array in)
     return AF_SUCCESS;
 }
 
-af_err af_conjg(af_array *out, const af_array in)
-{
+af_err af_conjg(af_array *out, const af_array in) {
     try {
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-
-        if (type != c32 && type != c64) {
-            AF_ERROR("Inputs to imag must be of complex type", AF_ERR_ARG);
-        }
+        if (type != c32 && type != c64) { return af_retain_array(out, in); }
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
 
         af_array res;
         switch (type) {
-
-        case c32: res = getHandle(conj<cfloat >(getArray<cfloat >(in))); break;
-        case c64: res = getHandle(conj<cdouble>(getArray<cdouble>(in))); break;
-
-        default: TYPE_ERROR(0, type);
+            case c32:
+                res = getHandle(conj<cfloat>(getArray<cfloat>(in)));
+                break;
+            case c64:
+                res = getHandle(conj<cdouble>(getArray<cdouble>(in)));
+                break;
+
+            default: TYPE_ERROR(0, type);
         }
 
         std::swap(*out, res);
@@ -170,24 +175,26 @@ af_err af_conjg(af_array *out, const af_array in)
     return AF_SUCCESS;
 }
 
-af_err af_abs(af_array *out, const af_array in)
-{
+af_err af_abs(af_array *out, const af_array in) {
     try {
-
-        ArrayInfo in_info = getInfo(in);
-        af_dtype in_type = in_info.getType();
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype in_type         = in_info.getType();
         af_array res;
 
         // Convert all inputs to floats / doubles
         af_dtype type = implicit(in_type, f32);
+        if (in_type == f16) { type = f16; }
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
         switch (type) {
-        case f32: res = getHandle(abs<float ,  float >(castArray<float  >(in))); break;
-        case f64: res = getHandle(abs<double,  double>(castArray<double >(in))); break;
-        case c32: res = getHandle(abs<float , cfloat >(castArray<cfloat >(in))); break;
-        case c64: res = getHandle(abs<double, cdouble>(castArray<cdouble>(in))); break;
-        default:
-            TYPE_ERROR(1, in_type); break;
+            // clang-format off
+            case f32: res = getHandle(detail::abs<float, float>(castArray<float>(in))); break;
+            case f64: res = getHandle(detail::abs<double, double>(castArray<double>(in))); break;
+            case c32: res = getHandle(detail::abs<float, cfloat>(castArray<cfloat>(in))); break;
+            case c64: res = getHandle(detail::abs<double, cdouble>(castArray<cdouble>(in))); break;
+            case f16: res = getHandle(detail::abs<half, half>(getArray<half>(in))); break;
+            // clang-format on
+            default: TYPE_ERROR(1, in_type); break;
         }
 
         std::swap(*out, res);
diff --git a/src/api/c/confidence_connected.cpp b/src/api/c/confidence_connected.cpp
new file mode 100644
index 0000000000..903c06f87b
--- /dev/null
+++ b/src/api/c/confidence_connected.cpp
@@ -0,0 +1,250 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/image.h>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <flood_fill.hpp>
+#include <handle.hpp>
+#include <imgproc_common.hpp>
+#include <index.hpp>
+#include <indexing_common.hpp>
+#include <reduce.hpp>
+
+#include <array>
+#include <cmath>
+#include <type_traits>
+
+using af::dim4;
+using arrayfire::common::cast;
+using arrayfire::common::convRange;
+using arrayfire::common::createSpanIndex;
+using arrayfire::common::integralImage;
+using detail::arithOp;
+using detail::Array;
+using detail::createValueArray;
+using detail::getScalar;
+using detail::reduce_all;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+using std::conditional;
+using std::is_same;
+using std::sqrt;
+using std::swap;
+
+/// Index corner points of given seed points
+template<typename T>
+Array<T> pointList(const Array<T>& in, const Array<uint>& x,
+                   const Array<uint>& y) {
+
+    // TODO: Temporary Fix, must fix handling subarrays upstream
+    // Array<T> has to be a basic array, to be accepted as af_index
+    Array<uint> x_ = (x.getOffset() == 0 && x.isLinear()) ? x : copyArray(x);
+    Array<uint> y_ = (y.getOffset() == 0 && y.isLinear()) ? y : copyArray(y);
+
+    af_array xcoords = getHandle<uint>(x_);
+    af_array ycoords = getHandle<uint>(y_);
+
+    std::array<af_index_t, AF_MAX_DIMS> idxrs = {{{{xcoords}, false, false},
+                                                  {{ycoords}, false, false},
+                                                  createSpanIndex(),
+                                                  createSpanIndex()}};
+
+    Array<T> retVal = detail::index(in, idxrs.data());
+
+    // detail::index fn keeps a reference to detail::Array
+    // created from the xcoords/ycoords passed via idxrs.
+    // Hence, it is safe to release xcoords, ycoords
+    releaseHandle<uint>(xcoords);
+    releaseHandle<uint>(ycoords);
+
+    return retVal;
+}
+
+/// Returns the sum of all values given the four corner points of the region of
+/// interest in the integral-image/summed-area-table of an input image.
+///
+///  +-------------------------------------+
+///  |           |                |        |
+///  |  A(_x, _y)|       B(_x, y_)|        |
+///  |-----------+----------------+        |
+///  |           |@@@@@@@@@@@@@@@@|        |
+///  |           |@@@@@@@@@@@@@@@@|        |
+///  |           |@@@@@@@@@@@@@@@@|        |
+///  |           |@@@@@@@@@@@@@@@@|        |
+///  |-----------+----------------+        |
+///  |  C(x_, _y)        D(x_, y_)         |
+///  |                                     |
+///  +-------------------------------------+
+template<typename T>
+Array<T> sum(const Array<T>& sat, const Array<uint>& _x, const Array<uint>& x_,
+             const Array<uint>& _y, const Array<uint>& y_) {
+    Array<T> A  = pointList(sat, _x, _y);
+    Array<T> B  = pointList(sat, _x, y_);
+    Array<T> C  = pointList(sat, x_, _y);
+    Array<T> D  = pointList(sat, x_, y_);
+    Array<T> DA = arithOp<T, af_add_t>(D, A, D.dims());
+    Array<T> BC = arithOp<T, af_add_t>(B, C, B.dims());
+    return arithOp<T, af_sub_t>(DA, BC, DA.dims());
+}
+
+template<typename T>
+af_array ccHelper(const Array<T>& img, const Array<uint>& seedx,
+                  const Array<uint>& seedy, const unsigned radius,
+                  const unsigned mult, const unsigned iterations,
+                  const double segmentedValue) {
+    using CT =
+        typename conditional<is_same<T, double>::value, double, float>::type;
+    constexpr CT epsilon = 1.0e-6;
+
+    auto calcVar = [](CT s2, CT s1, CT n) -> CT {
+        CT retVal = CT(0);
+        if (n > 1) { retVal = (s2 - (s1 * s1 / n)) / (n - CT(1)); }
+        return retVal;
+    };
+
+    const dim4& inDims       = img.dims();
+    const dim4& seedDims     = seedx.dims();
+    const size_t numSeeds    = seedx.elements();
+    const unsigned nhoodLen  = 2 * radius + 1;
+    const unsigned nhoodSize = nhoodLen * nhoodLen;
+
+    auto labelSegmented = [segmentedValue, inDims](const Array<CT>& segmented) {
+        Array<CT> newVals = createValueArray(inDims, CT(segmentedValue));
+        Array<CT> result  = arithOp<CT, af_mul_t>(newVals, segmented, inDims);
+        // cast final result to input type
+        return cast<T, CT>(result);
+    };
+
+    Array<uint> radiip = createValueArray<uint>(seedDims, radius + 1);
+    Array<uint> radii  = createValueArray<uint>(seedDims, radius);
+    Array<uint> _x     = arithOp<uint, af_sub_t>(seedx, radiip, seedDims);
+    Array<uint> x_     = arithOp<uint, af_add_t>(seedx, radii, seedDims);
+    Array<uint> _y     = arithOp<uint, af_sub_t>(seedy, radiip, seedDims);
+    Array<uint> y_     = arithOp<uint, af_add_t>(seedy, radii, seedDims);
+    Array<CT> in       = convRange<CT, T>(img, CT(1), CT(2));
+    Array<CT> in_2     = arithOp<CT, af_mul_t>(in, in, inDims);
+    Array<CT> I1       = integralImage<CT>(in);
+    Array<CT> I2       = integralImage<CT>(in_2);
+    Array<CT> S1       = sum(I1, _x, x_, _y, y_);
+    Array<CT> S2       = sum(I2, _x, x_, _y, y_);
+    CT totSum          = getScalar<CT>(reduce_all<af_add_t, CT, CT>(S1));
+    CT totSumSq        = getScalar<CT>(reduce_all<af_add_t, CT, CT>(S2));
+    CT totalNum        = numSeeds * nhoodSize;
+    CT s1mean          = totSum / totalNum;
+    CT s1var           = calcVar(totSumSq, totSum, totalNum);
+    CT s1stddev        = sqrt(s1var);
+    CT lower           = s1mean - mult * s1stddev;
+    CT upper           = s1mean + mult * s1stddev;
+
+    Array<CT> seedIntensities = pointList(in, seedx, seedy);
+    CT maxSeedIntensity =
+        getScalar<CT>(reduce_all<af_max_t, CT, CT>(seedIntensities));
+    CT minSeedIntensity =
+        getScalar<CT>(reduce_all<af_min_t, CT, CT>(seedIntensities));
+
+    if (lower > minSeedIntensity) { lower = minSeedIntensity; }
+    if (upper < maxSeedIntensity) { upper = maxSeedIntensity; }
+
+    Array<CT> segmented = floodFill(in, seedx, seedy, CT(1), lower, upper);
+
+    if (std::abs<CT>(s1var) < epsilon) {
+        // If variance is close to zero, stop after initial segmentation
+        return getHandle(labelSegmented(segmented));
+    }
+
+    bool continueLoop = true;
+    for (uint i = 0; (i < iterations) && continueLoop; ++i) {
+        // Segmented images are set with 1's and 0's thus essentially
+        // making them into mask arrays for each iteration's input image
+
+        uint sampleCount = getScalar<uint>(
+            reduce_all<af_notzero_t, CT, uint>(segmented, true));
+        if (sampleCount == 0) {
+            // If no valid pixels are found, skip iterations
+            break;
+        }
+        Array<CT> valids = arithOp<CT, af_mul_t>(segmented, in, inDims);
+        Array<CT> vsqrd  = arithOp<CT, af_mul_t>(valids, valids, inDims);
+
+        CT validsSum =
+            getScalar<CT>(reduce_all<af_add_t, CT, CT>(valids, true));
+        CT sumOfSqs = getScalar<CT>(reduce_all<af_add_t, CT, CT>(vsqrd, true));
+        CT validsMean = validsSum / sampleCount;
+        CT validsVar  = calcVar(sumOfSqs, validsSum, CT(sampleCount));
+        CT stddev     = sqrt(validsVar);
+        CT newLow     = validsMean - mult * stddev;
+        CT newHigh    = validsMean + mult * stddev;
+
+        if (newLow > minSeedIntensity) { newLow = minSeedIntensity; }
+        if (newHigh < maxSeedIntensity) { newHigh = maxSeedIntensity; }
+
+        if (std::abs<CT>(validsVar) < epsilon) {
+            // If variance is close to zero, discontinue iterating.
+            continueLoop = false;
+        }
+        segmented = floodFill(in, seedx, seedy, CT(1), newLow, newHigh);
+    }
+
+    return getHandle(labelSegmented(segmented));
+}
+
+af_err af_confidence_cc(af_array* out, const af_array in, const af_array seedx,
+                        const af_array seedy, const unsigned radius,
+                        const unsigned multiplier, const int iter,
+                        const double segmented_value) {
+    try {
+        const ArrayInfo& inInfo         = getInfo(in);
+        const ArrayInfo& seedxInfo      = getInfo(seedx);
+        const ArrayInfo& seedyInfo      = getInfo(seedy);
+        const af::dim4& inputDimensions = inInfo.dims();
+        const af::dtype inputArrayType  = inInfo.getType();
+
+        // TODO(pradeep) handle case where seeds are towards border
+        //              and indexing may result in throwing exception
+        // TODO(pradeep) add batch support later
+        ARG_ASSERT(
+            1, (inputDimensions.ndims() > 0 && inputDimensions.ndims() <= 2));
+
+        ARG_ASSERT(2, (seedxInfo.ndims() == 1));
+        ARG_ASSERT(3, (seedyInfo.ndims() == 1));
+        ARG_ASSERT(2, (seedxInfo.elements() == seedyInfo.elements()));
+
+        af_array output = 0;
+        switch (inputArrayType) {
+            case f32:
+                output = ccHelper(getArray<float>(in), getArray<uint>(seedx),
+                                  getArray<uint>(seedy), radius, multiplier,
+                                  iter, segmented_value);
+                break;
+            case u32:
+                output = ccHelper(getArray<uint>(in), getArray<uint>(seedx),
+                                  getArray<uint>(seedy), radius, multiplier,
+                                  iter, segmented_value);
+                break;
+            case u16:
+                output = ccHelper(getArray<ushort>(in), getArray<uint>(seedx),
+                                  getArray<uint>(seedy), radius, multiplier,
+                                  iter, segmented_value);
+                break;
+            case u8:
+                output = ccHelper(getArray<uchar>(in), getArray<uint>(seedx),
+                                  getArray<uint>(seedy), radius, multiplier,
+                                  iter, segmented_value);
+                break;
+            default: TYPE_ERROR(0, inputArrayType);
+        }
+        swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/convolve.cpp b/src/api/c/convolve.cpp
index 2ca7939786..8d37c5d285 100644
--- a/src/api/c/convolve.cpp
+++ b/src/api/c/convolve.cpp
@@ -6,213 +6,472 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/signal.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
 #include <convolve.hpp>
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <common/tile.hpp>
 #include <fftconvolve.hpp>
-#include <convolve_common.hpp>
+#include <handle.hpp>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/ml.h>
+#include <af/signal.h>
 
 #include <cstdio>
 
 using af::dim4;
-using namespace detail;
-
-template<typename T, typename accT, dim_t baseDim, bool expand>
-inline static af_array convolve(const af_array &s, const af_array &f, ConvolveBatchKind kind)
-{
-    return getHandle(convolve<T, accT, baseDim, expand>(getArray<T>(s), castArray<accT>(f), kind));
+using arrayfire::common::cast;
+using arrayfire::common::half;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::convolve;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename T, typename accT>
+inline af_array convolve(const af_array &s, const af_array &f,
+                         AF_BATCH_KIND kind, const int rank,
+                         const bool expand) {
+    return getHandle(convolve<T, accT>(getArray<T>(s), castArray<accT>(f), kind,
+                                       rank, expand));
 }
 
-template<typename T, typename accT, bool expand>
-inline static af_array convolve2(const af_array &s, const af_array &c_f, const af_array &r_f)
-{
-    return getHandle(convolve2<T, accT, expand>(getArray<T>(s),
-                                                castArray<accT>(c_f),
-                                                castArray<accT>(r_f)));
+template<typename T, typename accT>
+inline af_array convolve2(const af_array &s, const af_array &c_f,
+                          const af_array &r_f, const bool expand) {
+    const Array<accT> colFilter = castArray<accT>(c_f);
+    const Array<accT> rowFilter = castArray<accT>(r_f);
+    const Array<accT> signal    = castArray<accT>(s);
+
+    if (colFilter.isScalar() && rowFilter.isScalar()) {
+        Array<accT> colArray =
+            arrayfire::common::tile(colFilter, signal.dims());
+        Array<accT> rowArray =
+            arrayfire::common::tile(rowFilter, signal.dims());
+
+        Array<accT> filter =
+            arithOp<accT, af_mul_t>(colArray, rowArray, signal.dims());
+
+        return getHandle(cast<T, accT>(
+            arithOp<accT, af_mul_t>(signal, filter, signal.dims())));
+    }
+
+    ARG_ASSERT(2, colFilter.isVector());
+    ARG_ASSERT(3, rowFilter.isVector());
+
+    return getHandle(
+        convolve2<T, accT>(getArray<T>(s), colFilter, rowFilter, expand));
 }
 
-template<dim_t baseDim>
-ConvolveBatchKind identifyBatchKind(const dim4 &sDims, const dim4 &fDims)
-{
+AF_BATCH_KIND identifyBatchKind(const int rank, const dim4 &sDims,
+                                const dim4 &fDims) {
     dim_t sn = sDims.ndims();
     dim_t fn = fDims.ndims();
 
-    if (sn==baseDim && fn==baseDim)
-        return ONE2ONE;
-    else if (sn==baseDim && (fn>baseDim && fn<=4))
-        return ONE2MANY;
-    else if ((sn>baseDim && sn<=4) && fn==baseDim)
-        return MANY2ONE;
-    else if ((sn>baseDim && sn<=4) && (fn>baseDim && fn<=4)) {
+    if (sn == rank && fn == rank) { return AF_BATCH_NONE; }
+    if (sn == rank && (fn > rank && fn <= AF_MAX_DIMS)) { return AF_BATCH_RHS; }
+    if ((sn > rank && sn <= AF_MAX_DIMS) && fn == rank) { return AF_BATCH_LHS; }
+    if ((sn > rank && sn <= AF_MAX_DIMS) && (fn > rank && fn <= AF_MAX_DIMS)) {
         bool doesDimensionsMatch = true;
-        for (dim_t i=baseDim; i<4; i++) {
-            if (sDims[i]!=fDims[i]) {
-                doesDimensionsMatch = false;
-                break;
-            }
+        bool isInterleaved       = true;
+        for (dim_t i = rank; i < AF_MAX_DIMS; i++) {
+            doesDimensionsMatch &= (sDims[i] == fDims[i]);
+            isInterleaved &=
+                (sDims[i] == 1 || fDims[i] == 1 || sDims[i] == fDims[i]);
         }
-        return (doesDimensionsMatch ? MANY2MANY : CONVOLVE_UNSUPPORTED_BATCH_MODE);
+        if (doesDimensionsMatch) { return AF_BATCH_SAME; }
+        return (isInterleaved ? AF_BATCH_DIFF : AF_BATCH_UNSUPPORTED);
+    }
+    return AF_BATCH_UNSUPPORTED;
+}
+
+bool isFreqDomain(const int rank, const af_array &signal, const af_array filter,
+                  af_conv_domain domain) {
+    if (domain == AF_CONV_FREQ) { return true; }
+    if (domain != AF_CONV_AUTO) { return false; }
+
+    const ArrayInfo &sInfo = getInfo(signal);
+    const ArrayInfo &fInfo = getInfo(filter);
+
+    const dim4 &sdims = sInfo.dims();
+    dim4 fdims        = fInfo.dims();
+
+    if (identifyBatchKind(rank, sdims, fdims) == AF_BATCH_DIFF) { return true; }
+
+    int kbatch = 1;
+    for (int i = 3; i >= rank; i--) { kbatch *= fdims[i]; }
+
+    if (kbatch >= 10) { return true; }
+    if (rank == 1) {
+        if (fdims[0] > 128) { return true; }
+    }
+    if (rank == 2) {
+        // maximum supported size in 2D domain
+        if (fdims[0] > 17 || fdims[1] > 17) { return true; }
+
+        // Maximum supported non square size
+        if (fdims[0] != fdims[1] && fdims[0] > 5) { return true; }
     }
-    else
-        return CONVOLVE_UNSUPPORTED_BATCH_MODE;
+    if (rank == 3) {
+        if (fdims[0] > 5 || fdims[1] > 5 || fdims[2] > 5) { return true; }
+    }
+    return false;
 }
 
-template<dim_t baseDim, bool expand>
-af_err convolve(af_array *out, const af_array signal, const af_array filter)
-{
+af_err convolve(af_array *out, const af_array signal, const af_array filter,
+                const af_conv_mode mode, const int rank) {
     try {
-        ArrayInfo sInfo = getInfo(signal);
-        ArrayInfo fInfo = getInfo(filter);
+        const ArrayInfo &sInfo = getInfo(signal);
+        const ArrayInfo &fInfo = getInfo(filter);
 
-        af_dtype stype  = sInfo.getType();
+        af_dtype stype = sInfo.getType();
 
         dim4 sdims = sInfo.dims();
         dim4 fdims = fInfo.dims();
 
-        ConvolveBatchKind convBT = identifyBatchKind<baseDim>(sdims, fdims);
+        if (fdims.ndims() == 0 || sdims.ndims() == 0) {
+            return af_retain_array(out, signal);
+        }
+
+        AF_BATCH_KIND convBT = identifyBatchKind(rank, sdims, fdims);
 
-        ARG_ASSERT(1, (convBT != CONVOLVE_UNSUPPORTED_BATCH_MODE));
+        ARG_ASSERT(1,
+                   (convBT != AF_BATCH_UNSUPPORTED && convBT != AF_BATCH_DIFF));
+
+        const bool expand = mode == AF_CONV_EXPAND;
 
         af_array output;
-        switch(stype) {
-            case c32: output = convolve<cfloat ,  cfloat, baseDim, expand>(signal, filter, convBT); break;
-            case c64: output = convolve<cdouble, cdouble, baseDim, expand>(signal, filter, convBT); break;
-            case f32: output = convolve<float  ,   float, baseDim, expand>(signal, filter, convBT); break;
-            case f64: output = convolve<double ,  double, baseDim, expand>(signal, filter, convBT); break;
-            case u32: output = convolve<uint   ,   float, baseDim, expand>(signal, filter, convBT); break;
-            case s32: output = convolve<int    ,   float, baseDim, expand>(signal, filter, convBT); break;
-            case u8:  output = convolve<uchar  ,   float, baseDim, expand>(signal, filter, convBT); break;
-            case b8:  output = convolve<char   ,   float, baseDim, expand>(signal, filter, convBT); break;
+        switch (stype) {
+            case c32:
+                output = convolve<cfloat, cfloat>(signal, filter, convBT, rank,
+                                                  expand);
+                break;
+            case c64:
+                output = convolve<cdouble, cdouble>(signal, filter, convBT,
+                                                    rank, expand);
+                break;
+            case f32:
+                output = convolve<float, float>(signal, filter, convBT, rank,
+                                                expand);
+                break;
+            case f64:
+                output = convolve<double, double>(signal, filter, convBT, rank,
+                                                  expand);
+                break;
+            case u32:
+                output =
+                    convolve<uint, float>(signal, filter, convBT, rank, expand);
+                break;
+            case s32:
+                output =
+                    convolve<int, float>(signal, filter, convBT, rank, expand);
+                break;
+            case u16:
+                output = convolve<ushort, float>(signal, filter, convBT, rank,
+                                                 expand);
+                break;
+            case s16:
+                output = convolve<short, float>(signal, filter, convBT, rank,
+                                                expand);
+                break;
+            case u64:
+                output = convolve<uintl, float>(signal, filter, convBT, rank,
+                                                expand);
+                break;
+            case s64:
+                output =
+                    convolve<intl, float>(signal, filter, convBT, rank, expand);
+                break;
+            case u8:
+                output = convolve<uchar, float>(signal, filter, convBT, rank,
+                                                expand);
+                break;
+            case s8:
+                output = convolve<schar, float>(signal, filter, convBT, rank,
+                                                expand);
+                break;
+            case b8:
+                output =
+                    convolve<char, float>(signal, filter, convBT, rank, expand);
+                break;
             default: TYPE_ERROR(1, stype);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-template<bool expand>
-af_err convolve2_sep(af_array *out, af_array col_filter, af_array row_filter, const af_array signal)
-{
+af_err af_convolve1(af_array *out, const af_array signal, const af_array filter,
+                    const af_conv_mode mode, af_conv_domain domain) {
     try {
-        ArrayInfo sInfo = getInfo(signal);
-        ArrayInfo cfInfo= getInfo(col_filter);
-        ArrayInfo rfInfo= getInfo(row_filter);
+        if (isFreqDomain(1, signal, filter, domain)) {
+            return af_fft_convolve1(out, signal, filter, mode);
+        }
+        return convolve(out, signal, filter, mode, 1);
+    }
+    CATCHALL;
+}
 
-        af_dtype signalType  = sInfo.getType();
+af_err af_convolve2(af_array *out, const af_array signal, const af_array filter,
+                    const af_conv_mode mode, af_conv_domain domain) {
+    try {
+        if (getInfo(signal).dims().ndims() < 2 ||
+            getInfo(filter).dims().ndims() < 2) {
+            return af_convolve1(out, signal, filter, mode, domain);
+        }
+        if (isFreqDomain(2, signal, filter, domain)) {
+            return af_fft_convolve2(out, signal, filter, mode);
+        }
+        return convolve(out, signal, filter, mode, 2);
+    }
+    CATCHALL;
+}
 
-        dim4 signalDims = sInfo.dims();
+af_err af_convolve3(af_array *out, const af_array signal, const af_array filter,
+                    const af_conv_mode mode, af_conv_domain domain) {
+    try {
+        if (getInfo(signal).dims().ndims() < 3 ||
+            getInfo(filter).dims().ndims() < 3) {
+            return af_convolve2(out, signal, filter, mode, domain);
+        }
+        if (isFreqDomain(3, signal, filter, domain)) {
+            return af_fft_convolve3(out, signal, filter, mode);
+        }
+        return convolve(out, signal, filter, mode, 3);
+    }
+    CATCHALL;
+}
 
-        ARG_ASSERT(1, (signalDims.ndims()>=2));
-        ARG_ASSERT(2, cfInfo.isVector());
-        ARG_ASSERT(3, rfInfo.isVector());
+af_err af_convolve2_sep(af_array *out, const af_array col_filter,
+                        const af_array row_filter, const af_array signal,
+                        const af_conv_mode mode) {
+    try {
+        const ArrayInfo &sInfo = getInfo(signal);
 
-        af_array output;
-        switch(signalType) {
-            case c32: output = convolve2<cfloat ,  cfloat, expand>(signal, col_filter, row_filter); break;
-            case c64: output = convolve2<cdouble, cdouble, expand>(signal, col_filter, row_filter); break;
-            case f32: output = convolve2<float  ,   float, expand>(signal, col_filter, row_filter); break;
-            case f64: output = convolve2<double ,  double, expand>(signal, col_filter, row_filter); break;
-            case u32: output = convolve2<uint   ,   float, expand>(signal, col_filter, row_filter); break;
-            case s32: output = convolve2<int    ,   float, expand>(signal, col_filter, row_filter); break;
-            case u8:  output = convolve2<uchar  ,   float, expand>(signal, col_filter, row_filter); break;
-            case b8:  output = convolve2<char   ,   float, expand>(signal, col_filter, row_filter); break;
+        const dim4 &sdims = sInfo.dims();
+
+        const af_dtype signalType = sInfo.getType();
+
+        ARG_ASSERT(1, (sdims.ndims() >= 2));
+
+        af_array output = 0;
+
+        const bool expand = mode == AF_CONV_EXPAND;
+
+        switch (signalType) {
+            case c32:
+                output = convolve2<cfloat, cfloat>(signal, col_filter,
+                                                   row_filter, expand);
+                break;
+            case c64:
+                output = convolve2<cdouble, cdouble>(signal, col_filter,
+                                                     row_filter, expand);
+                break;
+            case f32:
+                output = convolve2<float, float>(signal, col_filter, row_filter,
+                                                 expand);
+                break;
+            case f64:
+                output = convolve2<double, double>(signal, col_filter,
+                                                   row_filter, expand);
+                break;
+            case u32:
+                output = convolve2<uint, float>(signal, col_filter, row_filter,
+                                                expand);
+                break;
+            case s32:
+                output = convolve2<int, float>(signal, col_filter, row_filter,
+                                               expand);
+                break;
+            case u16:
+                output = convolve2<ushort, float>(signal, col_filter,
+                                                  row_filter, expand);
+                break;
+            case s16:
+                output = convolve2<short, float>(signal, col_filter, row_filter,
+                                                 expand);
+                break;
+            case u64:
+                output = convolve2<uintl, float>(signal, col_filter, row_filter,
+                                                 expand);
+                break;
+            case s64:
+                output = convolve2<intl, float>(signal, col_filter, row_filter,
+                                                expand);
+                break;
+            case u8:
+                output = convolve2<uchar, float>(signal, col_filter, row_filter,
+                                                 expand);
+                break;
+            case s8:
+                output = convolve2<schar, float>(signal, col_filter, row_filter,
+                                                 expand);
+                break;
+            case b8:
+                output = convolve2<char, float>(signal, col_filter, row_filter,
+                                                expand);
+                break;
             default: TYPE_ERROR(1, signalType);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
+template<typename T>
+inline af_array convolve2Strided(const af_array &s, const af_array &f,
+                                 const dim4 stride, const dim4 padding,
+                                 const dim4 dilation) {
+    return getHandle(convolve2<T>(getArray<T>(s), getArray<T>(f), stride,
+                                  padding, dilation));
+}
 
-template<int baseDim>
-bool isFreqDomain(const af_array &signal, const af_array filter, af_conv_domain domain)
-{
-    if (domain == AF_CONV_FREQ) return true;
-    if (domain != AF_CONV_AUTO) return false;
+af_err af_convolve2_nn(af_array *out, const af_array signal,
+                       const af_array filter, const unsigned stride_dims,
+                       const dim_t *strides, const unsigned padding_dims,
+                       const dim_t *paddings, const unsigned dilation_dims,
+                       const dim_t *dilations) {
+    try {
+        const ArrayInfo &sInfo = getInfo(signal);
+        const ArrayInfo &fInfo = getInfo(filter);
 
-    ArrayInfo sInfo = getInfo(signal);
-    ArrayInfo fInfo = getInfo(filter);
+        af::dim4 sDims = sInfo.dims();
+        af::dim4 fDims = fInfo.dims();
 
-    dim4 sdims = sInfo.dims();
-    dim4 fdims = fInfo.dims();
+        const af_dtype signalType = sInfo.getType();
 
-    int batch = 1;
-    for(int i = 3; i >= baseDim; i--) {
-        batch *= std::max(fdims[i], sdims[i]);
-    }
+        dim4 stride(stride_dims, strides);
+        dim4 padding(padding_dims, paddings);
+        dim4 dilation(dilation_dims, dilations);
 
-    if (batch >= 10) return true;
+        size_t stride_ndims   = stride.ndims();
+        size_t padding_ndims  = padding.ndims();
+        size_t dilation_ndims = dilation.ndims();
+        ARG_ASSERT(3, stride_ndims > 0 && stride_ndims <= 2);
+        ARG_ASSERT(5, padding_ndims >= 0 && padding_ndims <= 2);
+        ARG_ASSERT(7, dilation_ndims > 0 && dilation_ndims <= 2);
 
-    if (baseDim == 1) {
-        if (fdims[0] > 128) return true;
-    }
+        // assert number of features matches between signal and filter
+        DIM_ASSERT(1, sDims[2] == fDims[2]);
 
-    if (baseDim == 2) {
-        // maximum supported size in 2D domain
-        if (fdims[0] > 17 || fdims[1] > 17) return true;
-
-        // Maximum supported non square size
-        if (fdims[0] != fdims[1] && fdims[0] > 5) return true;
+        af_array output;
+        switch (signalType) {
+            case f32:
+                output = convolve2Strided<float>(signal, filter, stride,
+                                                 padding, dilation);
+                break;
+            case f64:
+                output = convolve2Strided<double>(signal, filter, stride,
+                                                  padding, dilation);
+                break;
+            case f16:
+                output = convolve2Strided<half>(signal, filter, stride, padding,
+                                                dilation);
+                break;
+            default: TYPE_ERROR(1, signalType);
+        }
+        std::swap(*out, output);
     }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-    if (baseDim == 3) {
-        if (fdims[0] > 5 || fdims[1] > 5 || fdims[2] > 5) return true;
+template<typename T>
+af_array conv2GradCall(const af_array incoming_gradient,
+                       const af_array original_signal,
+                       const af_array original_filter,
+                       const af_array convolved_output, const dim4 &stride,
+                       const dim4 &padding, const dim4 &dilation,
+                       af_conv_gradient_type grad_type) {
+    if (grad_type == AF_CONV_GRADIENT_FILTER) {
+        return getHandle(detail::conv2FilterGradient<T>(
+            getArray<T>(incoming_gradient), getArray<T>(original_signal),
+            getArray<T>(original_filter), getArray<T>(convolved_output), stride,
+            padding, dilation));
+    } else {
+        return getHandle(detail::conv2DataGradient<T>(
+            getArray<T>(incoming_gradient), getArray<T>(original_signal),
+            getArray<T>(original_filter), getArray<T>(convolved_output), stride,
+            padding, dilation));
     }
-
-    return false;
 }
 
-af_err af_convolve1(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode, af_conv_domain domain)
-{
+af_err af_convolve2_gradient_nn(
+    af_array *out, const af_array incoming_gradient,
+    const af_array original_signal, const af_array original_filter,
+    const af_array convolved_output, const unsigned stride_dims,
+    const dim_t *strides, const unsigned padding_dims, const dim_t *paddings,
+    const unsigned dilation_dims, const dim_t *dilations,
+    af_conv_gradient_type grad_type) {
     try {
-        if (isFreqDomain<1>(signal, filter, domain))
-            return af_fft_convolve1(out, signal, filter, mode);
-    } CATCHALL;
+        const ArrayInfo &iinfo = getInfo(incoming_gradient);
+        const af::dim4 &iDims  = iinfo.dims();
 
-    if (mode == AF_CONV_EXPAND)
-            return convolve<1, true >(out, signal, filter);
-    else
-        return convolve<1, false>(out, signal, filter);
-}
+        const ArrayInfo &sinfo = getInfo(original_signal);
+        af::dim4 sDims         = sinfo.dims();
 
-af_err af_convolve2(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode, af_conv_domain domain)
-{
-    try {
-        if (isFreqDomain<2>(signal, filter, domain))
-            return af_fft_convolve2(out, signal, filter, mode);
-    } CATCHALL;
+        const ArrayInfo &finfo = getInfo(original_filter);
+        af::dim4 fDims         = finfo.dims();
 
-    if (mode == AF_CONV_EXPAND)
-        return convolve<2, true >(out, signal, filter);
-    else
-        return convolve<2, false>(out, signal, filter);
-}
+        const ArrayInfo &oinfo = getInfo(convolved_output);
+        af::dim4 oDims         = oinfo.dims();
 
-af_err af_convolve3(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode, af_conv_domain domain)
-{
-    try {
-        if (isFreqDomain<3>(signal, filter, domain))
-            return af_fft_convolve3(out, signal, filter, mode);
-    } CATCHALL;
+        DIM_ASSERT(1, iDims == oDims);
+        DIM_ASSERT(3, oDims[2] == fDims[3]);
+        DIM_ASSERT(3, oDims[3] == sDims[3]);
+        DIM_ASSERT(2, sDims[2] == fDims[2]);
 
-    if (mode == AF_CONV_EXPAND)
-        return convolve<3, true >(out, signal, filter);
-    else
-        return convolve<3, false>(out, signal, filter);
-}
+        af_array output;
 
-af_err af_convolve2_sep(af_array *out, const af_array signal, const af_array col_filter, const af_array row_filter, const af_conv_mode mode)
-{
-    if (mode == AF_CONV_EXPAND)
-        return convolve2_sep<true >(out, signal, col_filter, row_filter);
-    else
-        return convolve2_sep<false>(out, signal, col_filter, row_filter);
+        af::dim4 stride(stride_dims, strides);
+        af::dim4 padding(padding_dims, paddings);
+        af::dim4 dilation(dilation_dims, dilations);
+
+        size_t stride_ndims   = stride.ndims();
+        size_t padding_ndims  = padding.ndims();
+        size_t dilation_ndims = dilation.ndims();
+        ARG_ASSERT(3, stride_ndims > 0 && stride_ndims <= 2);
+        ARG_ASSERT(5, padding_ndims >= 0 && padding_ndims <= 2);
+        ARG_ASSERT(7, dilation_ndims > 0 && dilation_ndims <= 2);
+
+        af_dtype type = oinfo.getType();
+        switch (type) {
+            case f32:
+                output = conv2GradCall<float>(
+                    incoming_gradient, original_signal, original_filter,
+                    convolved_output, stride, padding, dilation, grad_type);
+                break;
+            case f64:
+                output = conv2GradCall<double>(
+                    incoming_gradient, original_signal, original_filter,
+                    convolved_output, stride, padding, dilation, grad_type);
+                break;
+            case f16:
+                output = conv2GradCall<half>(
+                    incoming_gradient, original_signal, original_filter,
+                    convolved_output, stride, padding, dilation, grad_type);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        // output array is pooled array
+        std::swap(output, *out);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
 }
diff --git a/src/api/c/convolve_common.hpp b/src/api/c/convolve_common.hpp
deleted file mode 100644
index fb4738cdab..0000000000
--- a/src/api/c/convolve_common.hpp
+++ /dev/null
@@ -1,18 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-
-typedef enum {
-    CONVOLVE_UNSUPPORTED_BATCH_MODE = -1, /* invalid inputs */
-    ONE2ONE,            /* one signal, one filter   */
-    MANY2ONE,           /* many signal, one filter  */
-    MANY2MANY,          /* many signal, many filter */
-    ONE2MANY            /* one signal, many filter  */
-} ConvolveBatchKind;
diff --git a/src/api/c/corrcoef.cpp b/src/api/c/corrcoef.cpp
index d6d98006a9..fde3788dac 100644
--- a/src/api/c/corrcoef.cpp
+++ b/src/api/c/corrcoef.cpp
@@ -7,73 +7,93 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/statistics.h>
-#include <af/defines.h>
-#include <err_common.hpp>
+#include <arith.hpp>
 #include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
 #include <handle.hpp>
-#include <reduce.hpp>
-#include <arith.hpp>
 #include <math.hpp>
-#include <cast.hpp>
-#include <cmath>
+#include <reduce.hpp>
+#include <stats.h>
+#include <types.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/statistics.h>
 
-#include "stats.h"
+#include <cmath>
 
-using namespace detail;
+using af::dim4;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::getScalar;
+using detail::intl;
+using detail::reduce_all;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename Ti, typename To>
-static To corrcoef(const af_array& X, const af_array& Y)
-{
+static To corrcoef(const af_array& X, const af_array& Y) {
     Array<To> xIn = cast<To>(getArray<Ti>(X));
     Array<To> yIn = cast<To>(getArray<Ti>(Y));
 
-    dim4 dims = xIn.dims();
-    dim_t n= xIn.elements();
+    const dim4& dims = xIn.dims();
+    dim_t n          = xIn.elements();
 
-    To xSum = detail::reduce_all<af_add_t, To, To>(xIn);
-    To ySum = detail::reduce_all<af_add_t, To, To>(yIn);
+    To xSum = getScalar<To>(reduce_all<af_add_t, To, To>(xIn));
+    To ySum = getScalar<To>(reduce_all<af_add_t, To, To>(yIn));
 
-    Array<To> xSq = detail::arithOp<To, af_mul_t>(xIn, xIn, dims);
-    Array<To> ySq = detail::arithOp<To, af_mul_t>(yIn, yIn, dims);
-    Array<To> xy  = detail::arithOp<To, af_mul_t>(xIn, yIn, dims);
+    Array<To> xSq = arithOp<To, af_mul_t>(xIn, xIn, dims);
+    Array<To> ySq = arithOp<To, af_mul_t>(yIn, yIn, dims);
+    Array<To> xy  = arithOp<To, af_mul_t>(xIn, yIn, dims);
 
-    To xSqSum = detail::reduce_all<af_add_t, To, To>(xSq);
-    To ySqSum = detail::reduce_all<af_add_t, To, To>(ySq);
-    To xySum  = detail::reduce_all<af_add_t, To, To>(xy);
+    To xSqSum = getScalar<To>(reduce_all<af_add_t, To, To>(xSq));
+    To ySqSum = getScalar<To>(reduce_all<af_add_t, To, To>(ySq));
+    To xySum  = getScalar<To>(reduce_all<af_add_t, To, To>(xy));
 
-    To result = (n*xySum - xSum*ySum)/(sqrt(n*xSqSum-xSum*xSum)*sqrt(n*ySqSum-ySum*ySum));
+    To result =
+        (n * xySum - xSum * ySum) / (std::sqrt(n * xSqSum - xSum * xSum) *
+                                     std::sqrt(n * ySqSum - ySum * ySum));
 
     return result;
 }
 
-af_err af_corrcoef(double *realVal, double *imagVal, const af_array X, const af_array Y)
-{
+// NOLINTNEXTLINE
+af_err af_corrcoef(double* realVal, double* imagVal, const af_array X,
+                   const af_array Y) {
+    UNUSED(imagVal);  // TODO(umar): implement for complex types
     try {
-        ArrayInfo xInfo = getInfo(X);
-        ArrayInfo yInfo = getInfo(Y);
-        dim4 xDims      = xInfo.dims();
-        dim4 yDims      = yInfo.dims();
-        af_dtype xType  = xInfo.getType();
-        af_dtype yType  = yInfo.getType();
+        const ArrayInfo& xInfo = getInfo(X);
+        const ArrayInfo& yInfo = getInfo(Y);
+        dim4 xDims             = xInfo.dims();
+        dim4 yDims             = yInfo.dims();
+        af_dtype xType         = xInfo.getType();
+        af_dtype yType         = yInfo.getType();
 
-        ARG_ASSERT(2, (xType==yType));
-        ARG_ASSERT(2, (xDims.ndims()==yDims.ndims()));
+        ARG_ASSERT(2, (xType == yType));
+        ARG_ASSERT(2, (xDims.ndims() == yDims.ndims()));
 
-        for (dim_t i=0; i<xDims.ndims(); ++i)
-            ARG_ASSERT(2, (xDims[i]==yDims[i]));
+        for (dim_t i = 0; i < xDims.ndims(); ++i) {
+            ARG_ASSERT(2, (xDims[i] == yDims[i]));
+        }
 
-        switch(xType) {
+        switch (xType) {
             case f64: *realVal = corrcoef<double, double>(X, Y); break;
-            case f32: *realVal = corrcoef<float , float >(X, Y); break;
-            case s32: *realVal = corrcoef<int   , float >(X, Y); break;
-            case u32: *realVal = corrcoef<uint  , float >(X, Y); break;
-            case s64: *realVal = corrcoef<intl  , double>(X, Y); break;
-            case u64: *realVal = corrcoef<uintl , double>(X, Y); break;
-            case  u8: *realVal = corrcoef<uchar , float >(X, Y); break;
-            case  b8: *realVal = corrcoef<char  , float >(X, Y); break;
-            default : TYPE_ERROR(1, xType);
+            case f32: *realVal = corrcoef<float, float>(X, Y); break;
+            case s32: *realVal = corrcoef<int, float>(X, Y); break;
+            case u32: *realVal = corrcoef<uint, float>(X, Y); break;
+            case s64: *realVal = corrcoef<intl, double>(X, Y); break;
+            case u64: *realVal = corrcoef<uintl, double>(X, Y); break;
+            case s16: *realVal = corrcoef<short, float>(X, Y); break;
+            case u16: *realVal = corrcoef<ushort, float>(X, Y); break;
+            case s8: *realVal = corrcoef<schar, float>(X, Y); break;
+            case u8: *realVal = corrcoef<uchar, float>(X, Y); break;
+            case b8: *realVal = corrcoef<char, float>(X, Y); break;
+            default: TYPE_ERROR(1, xType);
         }
     }
     CATCHALL;
diff --git a/src/api/c/covariance.cpp b/src/api/c/covariance.cpp
index 80b391d1a7..a4241a8f0a 100644
--- a/src/api/c/covariance.cpp
+++ b/src/api/c/covariance.cpp
@@ -7,72 +7,100 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/statistics.h>
-#include <af/defines.h>
+#include <arith.hpp>
 #include <backend.hpp>
-#include <reduce.hpp>
+#include <common/cast.hpp>
 #include <handle.hpp>
-#include <arith.hpp>
-#include <unary.hpp>
 #include <math.hpp>
-#include <cast.hpp>
+#include <mean.hpp>
+#include <reduce.hpp>
 #include <tile.hpp>
+#include <unary.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/statistics.h>
 
 #include "stats.h"
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::createValueArray;
+using detail::intl;
+using detail::mean;
+using detail::reduce;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T, typename cType>
-static af_array cov(const af_array& X, const af_array& Y, const bool isbiased)
-{
-    Array<cType> xArr = cast<cType>(getArray<T>(X));
-    Array<cType> yArr = cast<cType>(getArray<T>(Y));
+static af_array cov(const af_array& X, const af_array& Y,
+                    const af_var_bias bias) {
+    using weightType  = typename baseOutType<cType>::type;
+    const Array<T> _x = getArray<T>(X);
+    const Array<T> _y = getArray<T>(Y);
+    Array<cType> xArr = cast<cType>(_x);
+    Array<cType> yArr = cast<cType>(_y);
 
     dim4 xDims = xArr.dims();
-    dim_t N = isbiased ? xDims[0] : xDims[0]-1;
+    dim_t N    = (bias == AF_VARIANCE_SAMPLE ? xDims[0] - 1 : xDims[0]);
 
-    Array<cType> xmArr = createValueArray<cType>(xDims, mean<cType>(xArr));
-    Array<cType> ymArr = createValueArray<cType>(xDims, mean<cType>(yArr));
-    Array<cType> nArr  = createValueArray<cType>(xDims, scalar<cType>(N));
+    Array<cType> xmArr =
+        createValueArray<cType>(xDims, mean<T, weightType, cType>(_x));
+    Array<cType> ymArr =
+        createValueArray<cType>(xDims, mean<T, weightType, cType>(_y));
+    Array<cType> nArr = createValueArray<cType>(xDims, scalar<cType>(N));
 
-    Array<cType> diffX = detail::arithOp<cType, af_sub_t>(xArr, xmArr, xDims);
-    Array<cType> diffY = detail::arithOp<cType, af_sub_t>(yArr, ymArr, xDims);
-    Array<cType> mulXY = detail::arithOp<cType, af_mul_t>(diffX, diffY, xDims);
-    Array<cType> redArr= detail::reduce<af_add_t, cType, cType>(mulXY, 0);
-    xDims[0] = 1;
-    Array<cType> result= detail::arithOp<cType, af_div_t>(redArr, nArr, xDims);
+    Array<cType> diffX  = arithOp<cType, af_sub_t>(xArr, xmArr, xDims);
+    Array<cType> diffY  = arithOp<cType, af_sub_t>(yArr, ymArr, xDims);
+    Array<cType> mulXY  = arithOp<cType, af_mul_t>(diffX, diffY, xDims);
+    Array<cType> redArr = reduce<af_add_t, cType, cType>(mulXY, 0);
+    xDims[0]            = 1;
+    Array<cType> result = arithOp<cType, af_div_t>(redArr, nArr, xDims);
 
     return getHandle<cType>(result);
 }
 
-af_err af_cov(af_array* out, const af_array X, const af_array Y, const bool isbiased)
-{
+af_err af_cov(af_array* out, const af_array X, const af_array Y,
+              const bool isbiased) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return af_cov_v2(out, X, Y, bias);
+}
+
+af_err af_cov_v2(af_array* out, const af_array X, const af_array Y,
+                 const af_var_bias bias) {
     try {
-        ArrayInfo xInfo = getInfo(X);
-        ArrayInfo yInfo = getInfo(Y);
-        dim4 xDims      = xInfo.dims();
-        dim4 yDims      = yInfo.dims();
-        af_dtype xType  = xInfo.getType();
-        af_dtype yType  = yInfo.getType();
+        const ArrayInfo& xInfo = getInfo(X);
+        const ArrayInfo& yInfo = getInfo(Y);
+        dim4 xDims             = xInfo.dims();
+        dim4 yDims             = yInfo.dims();
+        af_dtype xType         = xInfo.getType();
+        af_dtype yType         = yInfo.getType();
 
-        ARG_ASSERT(1, (xDims.ndims()<=2));
-        ARG_ASSERT(2, (xDims.ndims()==yDims.ndims()));
-        ARG_ASSERT(2, (xDims[0]==yDims[0]));
-        ARG_ASSERT(2, (xDims[1]==yDims[1]));
-        ARG_ASSERT(2, (xType==yType));
+        ARG_ASSERT(1, (xDims.ndims() <= 2));
+        ARG_ASSERT(2, (xDims.ndims() == yDims.ndims()));
+        ARG_ASSERT(2, (xDims[0] == yDims[0]));
+        ARG_ASSERT(2, (xDims[1] == yDims[1]));
+        ARG_ASSERT(2, (xType == yType));
 
         af_array output = 0;
-        switch(xType) {
-            case f64: output = cov<double, double>(X, Y, isbiased); break;
-            case f32: output = cov<float , float >(X, Y, isbiased); break;
-            case s32: output = cov<int   , float >(X, Y, isbiased); break;
-            case u32: output = cov<uint  , float >(X, Y, isbiased); break;
-            case s64: output = cov<intl  , double>(X, Y, isbiased); break;
-            case u64: output = cov<uintl , double>(X, Y, isbiased); break;
-            case  u8: output = cov<uchar , float >(X, Y, isbiased); break;
-            default : TYPE_ERROR(1, xType);
+        switch (xType) {
+            case f64: output = cov<double, double>(X, Y, bias); break;
+            case f32: output = cov<float, float>(X, Y, bias); break;
+            case s32: output = cov<int, float>(X, Y, bias); break;
+            case u32: output = cov<uint, float>(X, Y, bias); break;
+            case s64: output = cov<intl, double>(X, Y, bias); break;
+            case u64: output = cov<uintl, double>(X, Y, bias); break;
+            case s16: output = cov<short, float>(X, Y, bias); break;
+            case u16: output = cov<ushort, float>(X, Y, bias); break;
+            case s8: output = cov<schar, float>(X, Y, bias); break;
+            case u8: output = cov<uchar, float>(X, Y, bias); break;
+            default: TYPE_ERROR(1, xType);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/data.cpp b/src/api/c/data.cpp
index 32d458375f..324936e76e 100644
--- a/src/api/c/data.cpp
+++ b/src/api/c/data.cpp
@@ -7,146 +7,94 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/array.h>
-#include <af/data.h>
-#include <af/device.h>
-#include <af/util.h>
-#include <copy.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <diagonal.hpp>
 #include <handle.hpp>
-#include <random.hpp>
+#include <identity.hpp>
+#include <iota.hpp>
 #include <math.hpp>
+#include <platform.hpp>
 #include <range.hpp>
-#include <iota.hpp>
-#include <identity.hpp>
-#include <diagonal.hpp>
 #include <triangle.hpp>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/util.h>
 
 using af::dim4;
-using namespace detail;
-using namespace std;
-
-static inline dim4 verifyDims(const unsigned ndims, const dim_t * const dims)
-{
-
-    DIM_ASSERT(1, ndims >= 1);
-
-    dim4 d(1, 1, 1, 1);
-
-    for(unsigned i = 0; i < ndims; i++) {
-        d[i] = dims[i];
-        DIM_ASSERT(2, dims[i] >= 1);
-    }
-
-    return d;
-}
-
-af_err af_get_data_ptr(void *data, const af_array arr)
-{
-    try {
-        af_dtype type = getInfo(arr).getType();
-        switch(type) {
-        case f32:   copyData(static_cast<float    *>(data), arr);  break;
-        case c32:   copyData(static_cast<cfloat   *>(data), arr);  break;
-        case f64:   copyData(static_cast<double   *>(data), arr);  break;
-        case c64:   copyData(static_cast<cdouble  *>(data), arr);  break;
-        case b8:    copyData(static_cast<char     *>(data), arr);  break;
-        case s32:   copyData(static_cast<int      *>(data), arr);  break;
-        case u32:   copyData(static_cast<unsigned *>(data), arr);  break;
-        case u8:    copyData(static_cast<uchar    *>(data), arr);  break;
-        case s64:   copyData(static_cast<intl     *>(data), arr);  break;
-        case u64:   copyData(static_cast<uintl    *>(data), arr);  break;
-        default:    TYPE_ERROR(1, type);
-        }
-    }
-    CATCHALL
-        return AF_SUCCESS;
-}
-
-//Strong Exception Guarantee
-af_err af_create_array(af_array *result, const void * const data,
-                       const unsigned ndims, const dim_t * const dims,
-                       const af_dtype type)
-{
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createValueArray;
+using detail::intl;
+using detail::iota;
+using detail::padArrayBorders;
+using detail::range;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+// Strong Exception Guarantee
+af_err af_constant(af_array *result, const double value, const unsigned ndims,
+                   const dim_t *const dims, const af_dtype type) {
     try {
         af_array out;
         AF_CHECK(af_init());
 
+        if (ndims <= 0) { return af_create_handle(result, 0, nullptr, type); }
         dim4 d = verifyDims(ndims, dims);
 
-        switch(type) {
-        case f32:   out = createHandleFromData(d, static_cast<const float   *>(data)); break;
-        case c32:   out = createHandleFromData(d, static_cast<const cfloat  *>(data)); break;
-        case f64:   out = createHandleFromData(d, static_cast<const double  *>(data)); break;
-        case c64:   out = createHandleFromData(d, static_cast<const cdouble *>(data)); break;
-        case b8:    out = createHandleFromData(d, static_cast<const char    *>(data)); break;
-        case s32:   out = createHandleFromData(d, static_cast<const int     *>(data)); break;
-        case u32:   out = createHandleFromData(d, static_cast<const uint    *>(data)); break;
-        case u8:    out = createHandleFromData(d, static_cast<const uchar   *>(data)); break;
-        case s64:   out = createHandleFromData(d, static_cast<const intl    *>(data)); break;
-        case u64:   out = createHandleFromData(d, static_cast<const uintl   *>(data)); break;
-        default:    TYPE_ERROR(4, type);
-        }
-        std::swap(*result, out);
-    }
-    CATCHALL
-        return AF_SUCCESS;
-}
-
-//Strong Exception Guarantee
-af_err af_constant(af_array *result, const double value,
-                   const unsigned ndims, const dim_t * const dims,
-                   const af_dtype type)
-{
-    try {
-        af_array out;
-        AF_CHECK(af_init());
-
-        dim4 d = verifyDims(ndims, dims);
-
-        switch(type) {
-        case f32:   out = createHandleFromValue<float  >(d, value); break;
-        case c32:   out = createHandleFromValue<cfloat >(d, value); break;
-        case f64:   out = createHandleFromValue<double >(d, value); break;
-        case c64:   out = createHandleFromValue<cdouble>(d, value); break;
-        case b8:    out = createHandleFromValue<char   >(d, value); break;
-        case s32:   out = createHandleFromValue<int    >(d, value); break;
-        case u32:   out = createHandleFromValue<uint   >(d, value); break;
-        case u8:    out = createHandleFromValue<uchar  >(d, value); break;
-        case s64:   out = createHandleFromValue<intl   >(d, value); break;
-        case u64:   out = createHandleFromValue<uintl  >(d, value); break;
-        default:    TYPE_ERROR(4, type);
+        switch (type) {
+            case f32: out = createHandleFromValue<float>(d, value); break;
+            case c32: out = createHandleFromValue<cfloat>(d, value); break;
+            case f64: out = createHandleFromValue<double>(d, value); break;
+            case c64: out = createHandleFromValue<cdouble>(d, value); break;
+            case b8: out = createHandleFromValue<char>(d, value); break;
+            case s32: out = createHandleFromValue<int>(d, value); break;
+            case u32: out = createHandleFromValue<uint>(d, value); break;
+            case s8: out = createHandleFromValue<schar>(d, value); break;
+            case u8: out = createHandleFromValue<uchar>(d, value); break;
+            case s64: out = createHandleFromValue<intl>(d, value); break;
+            case u64: out = createHandleFromValue<uintl>(d, value); break;
+            case s16: out = createHandleFromValue<short>(d, value); break;
+            case u16: out = createHandleFromValue<ushort>(d, value); break;
+            case f16: out = createHandleFromValue<half>(d, value); break;
+            default: TYPE_ERROR(4, type);
         }
         std::swap(*result, out);
     }
     CATCHALL
-        return AF_SUCCESS;
+    return AF_SUCCESS;
 }
 
 template<typename To, typename Ti>
-static inline af_array createCplx(dim4 dims, const Ti real, const Ti imag)
-{
-    To cval = scalar<To, Ti>(real, imag);
+static inline af_array createCplx(dim4 dims, const Ti real, const Ti imag) {
+    To cval      = scalar<To, Ti>(real, imag);
     af_array out = getHandle(createValueArray<To>(dims, cval));
     return out;
 }
 
-af_err af_constant_complex(af_array *result, const double real, const double imag,
-                           const unsigned ndims, const dim_t * const dims, af_dtype type)
-{
+af_err af_constant_complex(af_array *result, const double real,
+                           const double imag, const unsigned ndims,
+                           const dim_t *const dims, af_dtype type) {
     try {
         af_array out;
         AF_CHECK(af_init());
 
+        if (ndims <= 0) { return af_create_handle(result, 0, nullptr, type); }
         dim4 d = verifyDims(ndims, dims);
 
         switch (type) {
-        case c32: out = createCplx<cfloat , float >(d, real, imag); break;
-        case c64: out = createCplx<cdouble, double>(d, real, imag); break;
-        default:   TYPE_ERROR(5, type);
+            case c32: out = createCplx<cfloat, float>(d, real, imag); break;
+            case c64: out = createCplx<cdouble, double>(d, real, imag); break;
+            default: TYPE_ERROR(5, type);
         }
 
         std::swap(*result, out);
@@ -155,199 +103,77 @@ af_err af_constant_complex(af_array *result, const double real, const double ima
     return AF_SUCCESS;
 }
 
-af_err af_constant_long(af_array *result, const intl val,
-                        const unsigned ndims, const dim_t * const dims)
-{
+af_err af_constant_long(af_array *result, const intl val, const unsigned ndims,
+                        const dim_t *const dims) {
     try {
         af_array out;
         AF_CHECK(af_init());
 
+        if (ndims <= 0) { return af_create_handle(result, 0, nullptr, s64); }
         dim4 d = verifyDims(ndims, dims);
 
         out = getHandle(createValueArray<intl>(d, val));
 
         std::swap(*result, out);
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
 af_err af_constant_ulong(af_array *result, const uintl val,
-                         const unsigned ndims, const dim_t * const dims)
-{
+                         const unsigned ndims, const dim_t *const dims) {
     try {
         af_array out;
         AF_CHECK(af_init());
 
+        if (ndims <= 0) { return af_create_handle(result, 0, nullptr, u64); }
         dim4 d = verifyDims(ndims, dims);
-        out = getHandle(createValueArray<uintl>(d, val));
-
-        std::swap(*result, out);
-    } CATCHALL;
-
-    return AF_SUCCESS;
-}
 
-//Strong Exception Guarantee
-af_err af_create_handle(af_array *result, const unsigned ndims, const dim_t * const dims,
-                        const af_dtype type)
-{
-    try {
-        af_array out;
-        AF_CHECK(af_init());
-
-        dim4 d((size_t)dims[0]);
-        for(unsigned i = 1; i < ndims; i++) {
-            d[i] = dims[i];
-        }
+        out = getHandle(createValueArray<uintl>(d, val));
 
-        switch(type) {
-        case f32:   out = createHandle<float  >(d); break;
-        case c32:   out = createHandle<cfloat >(d); break;
-        case f64:   out = createHandle<double >(d); break;
-        case c64:   out = createHandle<cdouble>(d); break;
-        case b8:    out = createHandle<char   >(d); break;
-        case s32:   out = createHandle<int    >(d); break;
-        case u32:   out = createHandle<uint   >(d); break;
-        case u8:    out = createHandle<uchar  >(d); break;
-        case s64:   out = createHandle<intl   >(d); break;
-        case u64:   out = createHandle<uintl  >(d); break;
-        default:    TYPE_ERROR(3, type);
-        }
         std::swap(*result, out);
     }
-    CATCHALL
-    return AF_SUCCESS;
-}
+    CATCHALL;
 
-//Strong Exception Guarantee
-af_err af_copy_array(af_array *out, const af_array in)
-{
-    try {
-        ArrayInfo info = getInfo(in);
-        const af_dtype type = info.getType();
-
-        af_array res;
-        switch(type) {
-        case f32:   res = copyArray<float   >(in); break;
-        case c32:   res = copyArray<cfloat  >(in); break;
-        case f64:   res = copyArray<double  >(in); break;
-        case c64:   res = copyArray<cdouble >(in); break;
-        case b8:    res = copyArray<char    >(in); break;
-        case s32:   res = copyArray<int     >(in); break;
-        case u32:   res = copyArray<uint    >(in); break;
-        case u8:    res = copyArray<uchar   >(in); break;
-        case s64:   res = copyArray<intl    >(in); break;
-        case u64:   res = copyArray<uintl   >(in); break;
-        default:    TYPE_ERROR(1, type);
-        }
-        std::swap(*out, res);
-    }
-    CATCHALL
     return AF_SUCCESS;
 }
 
 template<typename T>
-static inline af_array randn_(const af::dim4 &dims)
-{
-    return getHandle(randn<T>(dims));
-}
-
-template<typename T>
-static inline af_array randu_(const af::dim4 &dims)
-{
-    return getHandle(randu<T>(dims));
-}
-
-template<typename T>
-static inline af_array identity_(const af::dim4 &dims)
-{
+static inline af_array identity_(const af::dim4 &dims) {
     return getHandle(detail::identity<T>(dims));
 }
 
-af_err af_randu(af_array *out, const unsigned ndims, const dim_t * const dims, const af_dtype type)
-{
+af_err af_identity(af_array *out, const unsigned ndims, const dim_t *const dims,
+                   const af_dtype type) {
     try {
         af_array result;
         AF_CHECK(af_init());
 
-        dim4 d = verifyDims(ndims, dims);
-
-        switch(type) {
-        case f32:   result = randu_<float  >(d);    break;
-        case c32:   result = randu_<cfloat >(d);    break;
-        case f64:   result = randu_<double >(d);    break;
-        case c64:   result = randu_<cdouble>(d);    break;
-        case s32:   result = randu_<int    >(d);    break;
-        case u32:   result = randu_<uint   >(d);    break;
-        case u8:    result = randu_<uchar  >(d);    break;
-        case b8:    result = randu_<char  >(d);    break;
-        default:    TYPE_ERROR(3, type);
-        }
-        std::swap(*out, result);
-    }
-    CATCHALL
-    return AF_SUCCESS;
-}
-
-af_err af_randn(af_array *out, const unsigned ndims, const dim_t * const dims, const af_dtype type)
-{
-    try {
-        af_array result;
-        AF_CHECK(af_init());
+        if (ndims == 0) { return af_create_handle(out, 0, nullptr, type); }
 
         dim4 d = verifyDims(ndims, dims);
 
-        switch(type) {
-        case f32:   result = randn_<float  >(d);    break;
-        case c32:   result = randn_<cfloat >(d);    break;
-        case f64:   result = randn_<double >(d);    break;
-        case c64:   result = randn_<cdouble>(d);    break;
-        default:    TYPE_ERROR(3, type);
-        }
-        std::swap(*out, result);
-    }
-    CATCHALL
-    return AF_SUCCESS;
-}
-
-af_err af_set_seed(const uintl seed)
-{
-    try {
-        setSeed(seed);
-    } CATCHALL;
-    return AF_SUCCESS;
-}
-
-af_err af_get_seed(uintl *seed)
-{
-    try {
-        *seed = getSeed();
-    } CATCHALL;
-    return AF_SUCCESS;
-}
-
-af_err af_identity(af_array *out, const unsigned ndims, const dim_t * const dims, const af_dtype type)
-{
-    try {
-        af_array result;
-        AF_CHECK(af_init());
-
-        dim4 d = verifyDims(ndims, dims);
-
-        switch(type) {
-        case f32:   result = identity_<float  >(d);    break;
-        case c32:   result = identity_<cfloat >(d);    break;
-        case f64:   result = identity_<double >(d);    break;
-        case c64:   result = identity_<cdouble>(d);    break;
-        case s32:   result = identity_<int    >(d);    break;
-        case u32:   result = identity_<uint   >(d);    break;
-        case u8:    result = identity_<uchar  >(d);    break;
-        case u64:   result = identity_<uintl  >(d);    break;
-        case s64:   result = identity_<intl   >(d);    break;
-            // Removed because of bool type. Functions implementations exist.
-        case b8:    result = identity_<char   >(d);    break;
-        default:    TYPE_ERROR(3, type);
+        switch (type) {
+            case f32: result = identity_<float>(d); break;
+            case c32: result = identity_<cfloat>(d); break;
+            case f64: result = identity_<double>(d); break;
+            case c64: result = identity_<cdouble>(d); break;
+            case s32: result = identity_<int>(d); break;
+            case u32: result = identity_<uint>(d); break;
+            case s8: result = identity_<schar>(d); break;
+            case u8: result = identity_<uchar>(d); break;
+            case u64: result = identity_<uintl>(d); break;
+            case s64: result = identity_<intl>(d); break;
+            case u16: result = identity_<ushort>(d); break;
+            case s16:
+                result = identity_<short>(d);
+                break;
+                // Removed because of bool type. Functions implementations
+                // exist.
+            case b8: result = identity_<char>(d); break;
+            case f16: result = identity_<half>(d); break;
+            default: TYPE_ERROR(3, type);
         }
         std::swap(*out, result);
     }
@@ -355,91 +181,34 @@ af_err af_identity(af_array *out, const unsigned ndims, const dim_t * const dims
     return AF_SUCCESS;
 }
 
-af_err af_release_array(af_array arr)
-{
-    try {
-        af_dtype type = getInfo(arr).getType();
-
-        switch(type) {
-        case f32:   releaseHandle<float   >(arr); break;
-        case c32:   releaseHandle<cfloat  >(arr); break;
-        case f64:   releaseHandle<double  >(arr); break;
-        case c64:   releaseHandle<cdouble >(arr); break;
-        case b8:    releaseHandle<char    >(arr); break;
-        case s32:   releaseHandle<int     >(arr); break;
-        case u32:   releaseHandle<uint    >(arr); break;
-        case u8:    releaseHandle<uchar   >(arr); break;
-        case s64:   releaseHandle<intl    >(arr); break;
-        case u64:   releaseHandle<uintl   >(arr); break;
-        default:    TYPE_ERROR(0, type);
-        }
-    }
-    CATCHALL
-
-    return AF_SUCCESS;
-}
-
-
-template<typename T>
-static af_array retainHandle(const af_array in)
-{
-    detail::Array<T> *A = reinterpret_cast<detail::Array<T> *>(in);
-    detail::Array<T> *out = detail::initArray<T>();
-    *out= *A;
-    return reinterpret_cast<af_array>(out);
-}
-
-af_array retain(const af_array in)
-{
-    af_dtype ty = getInfo(in).getType();
-    switch(ty) {
-    case f32: return retainHandle<float           >(in);
-    case f64: return retainHandle<double          >(in);
-    case s32: return retainHandle<int             >(in);
-    case u32: return retainHandle<uint            >(in);
-    case u8:  return retainHandle<uchar           >(in);
-    case c32: return retainHandle<detail::cfloat  >(in);
-    case c64: return retainHandle<detail::cdouble >(in);
-    case b8:  return retainHandle<char            >(in);
-    case s64: return retainHandle<intl            >(in);
-    case u64: return retainHandle<uintl           >(in);
-    default:
-        TYPE_ERROR(1, ty);
-    }
-}
-
-af_err af_retain_array(af_array *out, const af_array in)
-{
-    try {
-        *out = retain(in);
-    }
-    CATCHALL;
-    return AF_SUCCESS;
-}
-
 template<typename T>
-static inline af_array range_(const dim4& d, const int seq_dim)
-{
+static inline af_array range_(const dim4 &d, const int seq_dim) {
     return getHandle(range<T>(d, seq_dim));
 }
 
-//Strong Exception Guarantee
-af_err af_range(af_array *result, const unsigned ndims, const dim_t * const dims,
-               const int seq_dim, const af_dtype type)
-{
+// Strong Exception Guarantee
+af_err af_range(af_array *result, const unsigned ndims, const dim_t *const dims,
+                const int seq_dim, const af_dtype type) {
     try {
         af_array out;
         AF_CHECK(af_init());
 
+        if (ndims <= 0) { return af_create_handle(result, 0, nullptr, type); }
         dim4 d = verifyDims(ndims, dims);
 
-        switch(type) {
-        case f32:   out = range_<float  >(d, seq_dim); break;
-        case f64:   out = range_<double >(d, seq_dim); break;
-        case s32:   out = range_<int    >(d, seq_dim); break;
-        case u32:   out = range_<uint   >(d, seq_dim); break;
-        case u8:    out = range_<uchar  >(d, seq_dim); break;
-        default:    TYPE_ERROR(4, type);
+        switch (type) {
+            case f32: out = range_<float>(d, seq_dim); break;
+            case f64: out = range_<double>(d, seq_dim); break;
+            case s32: out = range_<int>(d, seq_dim); break;
+            case u32: out = range_<uint>(d, seq_dim); break;
+            case s64: out = range_<intl>(d, seq_dim); break;
+            case u64: out = range_<uintl>(d, seq_dim); break;
+            case s16: out = range_<short>(d, seq_dim); break;
+            case u16: out = range_<ushort>(d, seq_dim); break;
+            case s8: out = range_<schar>(d, seq_dim); break;
+            case u8: out = range_<uchar>(d, seq_dim); break;
+            case f16: out = range_<half>(d, seq_dim); break;
+            default: TYPE_ERROR(4, type);
         }
         std::swap(*result, out);
     }
@@ -448,39 +217,39 @@ af_err af_range(af_array *result, const unsigned ndims, const dim_t * const dims
 }
 
 template<typename T>
-static inline af_array iota_(const dim4 &dims, const dim4 &tile_dims)
-{
+static inline af_array iota_(const dim4 &dims, const dim4 &tile_dims) {
     return getHandle(iota<T>(dims, tile_dims));
 }
 
-//Strong Exception Guarantee
-af_err af_iota(af_array *result, const unsigned ndims, const dim_t * const dims,
-               const unsigned t_ndims, const dim_t * const tdims, const af_dtype type)
-{
+// Strong Exception Guarantee
+af_err af_iota(af_array *result, const unsigned ndims, const dim_t *const dims,
+               const unsigned t_ndims, const dim_t *const tdims,
+               const af_dtype type) {
     try {
         af_array out;
         AF_CHECK(af_init());
 
+        if (ndims == 0) { return af_create_handle(result, 0, nullptr, type); }
+
         DIM_ASSERT(1, ndims > 0 && ndims <= 4);
         DIM_ASSERT(3, t_ndims > 0 && t_ndims <= 4);
-        dim4 d;
-        dim4 t;
-        for(unsigned i = 0; i < 4; i++) {
-            d[i] = dims[i];
-            DIM_ASSERT(2, d[i] >= 1);
-        }
-        for(unsigned i = 0; i < 4; i++) {
-            t[i] = tdims[i];
-            DIM_ASSERT(4, t[i] >= 1);
-        }
 
-        switch(type) {
-        case f32:   out = iota_<float  >(d, t); break;
-        case f64:   out = iota_<double >(d, t); break;
-        case s32:   out = iota_<int    >(d, t); break;
-        case u32:   out = iota_<uint   >(d, t); break;
-        case u8:    out = iota_<uchar  >(d, t); break;
-        default:    TYPE_ERROR(4, type);
+        dim4 d = verifyDims(ndims, dims);
+        dim4 t = verifyDims(t_ndims, tdims);
+
+        switch (type) {
+            case f32: out = iota_<float>(d, t); break;
+            case f64: out = iota_<double>(d, t); break;
+            case s32: out = iota_<int>(d, t); break;
+            case u32: out = iota_<uint>(d, t); break;
+            case s64: out = iota_<intl>(d, t); break;
+            case u64: out = iota_<uintl>(d, t); break;
+            case s16: out = iota_<short>(d, t); break;
+            case u16: out = iota_<ushort>(d, t); break;
+            case s8: out = iota_<schar>(d, t); break;
+            case u8: out = iota_<uchar>(d, t); break;
+            case f16: out = iota_<half>(d, t); break;
+            default: TYPE_ERROR(4, type);
         }
         std::swap(*result, out);
     }
@@ -488,245 +257,214 @@ af_err af_iota(af_array *result, const unsigned ndims, const dim_t * const dims,
     return AF_SUCCESS;
 }
 
-#undef INSTANTIATE
-#define INSTANTIATE(fn1, fn2)                   \
-    af_err fn1(bool *result, const af_array in) \
-    {                                           \
-        try {                                   \
-            ArrayInfo info = getInfo(in);       \
-            *result = info.fn2();               \
-        }                                       \
-        CATCHALL                                \
-            return AF_SUCCESS;                  \
-    }
-
-INSTANTIATE(af_is_empty       , isEmpty       )
-INSTANTIATE(af_is_scalar      , isScalar      )
-INSTANTIATE(af_is_row         , isRow         )
-INSTANTIATE(af_is_column      , isColumn      )
-INSTANTIATE(af_is_vector      , isVector      )
-INSTANTIATE(af_is_complex     , isComplex     )
-INSTANTIATE(af_is_real        , isReal        )
-INSTANTIATE(af_is_double      , isDouble      )
-INSTANTIATE(af_is_single      , isSingle      )
-INSTANTIATE(af_is_realfloating, isRealFloating)
-INSTANTIATE(af_is_floating    , isFloating    )
-INSTANTIATE(af_is_integer     , isInteger     )
-INSTANTIATE(af_is_bool        , isBool        )
-
-#undef INSTANTIATE
-
-af_err af_get_dims(dim_t *d0, dim_t *d1, dim_t *d2, dim_t *d3,
-                   const af_array in)
-{
-    try {
-        ArrayInfo info = getInfo(in);
-        *d0 = info.dims()[0];
-        *d1 = info.dims()[1];
-        *d2 = info.dims()[2];
-        *d3 = info.dims()[3];
-    }
-    CATCHALL
-    return AF_SUCCESS;
-}
-
-af_err af_get_numdims(unsigned *nd, const af_array in)
-{
-    try {
-        ArrayInfo info = getInfo(in);
-        *nd = info.ndims();
-    }
-    CATCHALL
-    return AF_SUCCESS;
-}
-
 template<typename T>
-static inline void eval(af_array arr)
-{
-    evalArray(getArray<T>(arr));
-    return;
-}
-
-af_err af_eval(af_array arr)
-{
-    try {
-        af_dtype type = getInfo(arr).getType();
-        switch (type) {
-        case f32: eval<float  >(arr); break;
-        case f64: eval<double >(arr); break;
-        case c32: eval<cfloat >(arr); break;
-        case c64: eval<cdouble>(arr); break;
-        case s32: eval<int    >(arr); break;
-        case u32: eval<uint   >(arr); break;
-        case u8 : eval<uchar  >(arr); break;
-        case b8 : eval<char   >(arr); break;
-        case s64: eval<intl   >(arr); break;
-        case u64: eval<uintl  >(arr); break;
-        default:
-            TYPE_ERROR(0, type);
-        }
-    } CATCHALL;
-
-    return AF_SUCCESS;
-}
-
-template<typename T>
-static inline af_array diagCreate(const af_array in, const int num)
-{
+static inline af_array diagCreate(const af_array in, const int num) {
     return getHandle(diagCreate<T>(getArray<T>(in), num));
 }
 
 template<typename T>
-static inline af_array diagExtract(const af_array in, const int num)
-{
+static inline af_array diagExtract(const af_array in, const int num) {
     return getHandle(diagExtract<T>(getArray<T>(in), num));
 }
 
-af_err af_diag_create(af_array *out, const af_array in, const int num)
-{
+af_err af_diag_create(af_array *out, const af_array in, const int num) {
     try {
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
         DIM_ASSERT(1, in_info.ndims() <= 2);
         af_dtype type = in_info.getType();
 
         af_array result;
-        switch(type) {
-        case f32:   result = diagCreate<float  >(in, num);    break;
-        case c32:   result = diagCreate<cfloat >(in, num);    break;
-        case f64:   result = diagCreate<double >(in, num);    break;
-        case c64:   result = diagCreate<cdouble>(in, num);    break;
-        case s32:   result = diagCreate<int    >(in, num);    break;
-        case u32:   result = diagCreate<uint   >(in, num);    break;
-        case s64:   result = diagCreate<intl   >(in, num);    break;
-        case u64:   result = diagCreate<uintl  >(in, num);    break;
-        case u8:    result = diagCreate<uchar  >(in, num);    break;
-            // Removed because of bool type. Functions implementations exist.
-        case b8:    result = diagCreate<char   >(in, num);    break;
-        default:    TYPE_ERROR(1, type);
+
+        if (in_info.dims()[0] == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
+        switch (type) {
+            case f32: result = diagCreate<float>(in, num); break;
+            case c32: result = diagCreate<cfloat>(in, num); break;
+            case f64: result = diagCreate<double>(in, num); break;
+            case c64: result = diagCreate<cdouble>(in, num); break;
+            case s32: result = diagCreate<int>(in, num); break;
+            case u32: result = diagCreate<uint>(in, num); break;
+            case s64: result = diagCreate<intl>(in, num); break;
+            case u64: result = diagCreate<uintl>(in, num); break;
+            case s16: result = diagCreate<short>(in, num); break;
+            case u16: result = diagCreate<ushort>(in, num); break;
+            case s8: result = diagCreate<schar>(in, num); break;
+            case u8:
+                result = diagCreate<uchar>(in, num);
+                break;
+                // Removed because of bool type. Functions implementations
+                // exist.
+            case b8: result = diagCreate<char>(in, num); break;
+            case f16: result = diagCreate<half>(in, num); break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*out, result);
-    } CATCHALL;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_diag_extract(af_array *out, const af_array in, const int num)
-{
-
+af_err af_diag_extract(af_array *out, const af_array in, const int num) {
     try {
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
+
+        if (in_info.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
         DIM_ASSERT(1, in_info.ndims() >= 2);
-        af_dtype type = in_info.getType();
 
-        af_array result;
-        switch(type) {
-        case f32:   result = diagExtract<float  >(in, num);    break;
-        case c32:   result = diagExtract<cfloat >(in, num);    break;
-        case f64:   result = diagExtract<double >(in, num);    break;
-        case c64:   result = diagExtract<cdouble>(in, num);    break;
-        case s32:   result = diagExtract<int    >(in, num);    break;
-        case u32:   result = diagExtract<uint   >(in, num);    break;
-        case s64:   result = diagExtract<intl   >(in, num);    break;
-        case u64:   result = diagExtract<uintl  >(in, num);    break;
-        case u8:    result = diagExtract<uchar  >(in, num);    break;
-            // Removed because of bool type. Functions implementations exist.
-        case b8:    result = diagExtract<char   >(in, num);    break;
-        default:    TYPE_ERROR(1, type);
+        af_array result = nullptr;
+        switch (type) {
+            case f32: result = diagExtract<float>(in, num); break;
+            case c32: result = diagExtract<cfloat>(in, num); break;
+            case f64: result = diagExtract<double>(in, num); break;
+            case c64: result = diagExtract<cdouble>(in, num); break;
+            case s32: result = diagExtract<int>(in, num); break;
+            case u32: result = diagExtract<uint>(in, num); break;
+            case s64: result = diagExtract<intl>(in, num); break;
+            case u64: result = diagExtract<uintl>(in, num); break;
+            case s16: result = diagExtract<short>(in, num); break;
+            case u16: result = diagExtract<ushort>(in, num); break;
+            case s8: result = diagExtract<schar>(in, num); break;
+            case u8:
+                result = diagExtract<uchar>(in, num);
+                break;
+                // Removed because of bool type. Functions implementations
+                // exist.
+            case b8: result = diagExtract<char>(in, num); break;
+            case f16: result = diagExtract<half>(in, num); break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*out, result);
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
 template<typename T>
-void write_array(af_array arr, const T * const data, const size_t bytes, af_source src)
-{
-    if(src == afHost) {
-        writeHostDataArray(getWritableArray<T>(arr), data, bytes);
-    } else {
-        writeDeviceDataArray(getWritableArray<T>(arr), data, bytes);
-    }
-    return;
+inline af_array triangle(const af_array in, const bool is_upper,
+                         const bool is_unit_diag) {
+    return getHandle(triangle<T>(getArray<T>(in), is_upper, is_unit_diag));
 }
 
-af_err af_write_array(af_array arr, const void *data, const size_t bytes, af_source src)
-{
+af_err af_lower(af_array *out, const af_array in, bool is_unit_diag) {
     try {
-        af_dtype type = getInfo(arr).getType();
-        //DIM_ASSERT(2, bytes <= getInfo(arr).bytes());
-
-        switch(type) {
-        case f32:   write_array(arr, static_cast<const float   *>(data), bytes, src); break;
-        case c32:   write_array(arr, static_cast<const cfloat  *>(data), bytes, src); break;
-        case f64:   write_array(arr, static_cast<const double  *>(data), bytes, src); break;
-        case c64:   write_array(arr, static_cast<const cdouble *>(data), bytes, src); break;
-        case b8:    write_array(arr, static_cast<const char    *>(data), bytes, src); break;
-        case s32:   write_array(arr, static_cast<const int     *>(data), bytes, src); break;
-        case u32:   write_array(arr, static_cast<const uint    *>(data), bytes, src); break;
-        case u8:    write_array(arr, static_cast<const uchar   *>(data), bytes, src); break;
-        case s64:   write_array(arr, static_cast<const intl    *>(data), bytes, src); break;
-        case u64:   write_array(arr, static_cast<const uintl   *>(data), bytes, src); break;
-        default:    TYPE_ERROR(4, type);
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
+
+        af_array res = nullptr;
+        switch (type) {
+            case f32: res = triangle<float>(in, false, is_unit_diag); break;
+            case f64: res = triangle<double>(in, false, is_unit_diag); break;
+            case c32: res = triangle<cfloat>(in, false, is_unit_diag); break;
+            case c64: res = triangle<cdouble>(in, false, is_unit_diag); break;
+            case s32: res = triangle<int>(in, false, is_unit_diag); break;
+            case u32: res = triangle<uint>(in, false, is_unit_diag); break;
+            case s64: res = triangle<intl>(in, false, is_unit_diag); break;
+            case u64: res = triangle<uintl>(in, false, is_unit_diag); break;
+            case s16: res = triangle<short>(in, false, is_unit_diag); break;
+            case u16: res = triangle<ushort>(in, false, is_unit_diag); break;
+            case s8: res = triangle<schar>(in, false, is_unit_diag); break;
+            case u8: res = triangle<uchar>(in, false, is_unit_diag); break;
+            case b8: res = triangle<char>(in, false, is_unit_diag); break;
+            case f16: res = triangle<half>(in, false, is_unit_diag); break;
         }
+        std::swap(*out, res);
     }
     CATCHALL
-        return AF_SUCCESS;
-}
-
-template<typename T, bool is_upper>
-af_array triangle(const af_array in, bool is_unit_diag)
-{
-    if (is_unit_diag)
-        return getHandle(triangle<T, is_upper,  true>(getArray<T>(in)));
-    else
-        return getHandle(triangle<T, is_upper, false>(getArray<T>(in)));
+    return AF_SUCCESS;
 }
 
-af_err af_lower(af_array *out, const af_array in, bool is_unit_diag)
-{
+af_err af_upper(af_array *out, const af_array in, bool is_unit_diag) {
     try {
-        af_dtype type = getInfo(in).getType();
-        af_array res;
-        switch(type) {
-        case f32: res = triangle<float   , false>(in, is_unit_diag); break;
-        case f64: res = triangle<double  , false>(in, is_unit_diag); break;
-        case c32: res = triangle<cfloat  , false>(in, is_unit_diag); break;
-        case c64: res = triangle<cdouble , false>(in, is_unit_diag); break;
-        case s32: res = triangle<int     , false>(in, is_unit_diag); break;
-        case s64: res = triangle<intl    , false>(in, is_unit_diag); break;
-        case u32: res = triangle<uint    , false>(in, is_unit_diag); break;
-        case u64: res = triangle<uintl   , false>(in, is_unit_diag); break;
-        case u8 : res = triangle<uchar   , false>(in, is_unit_diag); break;
-        case b8 : res = triangle<char    , false>(in, is_unit_diag); break;
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
+
+        af_array res = nullptr;
+        switch (type) {
+            case f32: res = triangle<float>(in, true, is_unit_diag); break;
+            case f64: res = triangle<double>(in, true, is_unit_diag); break;
+            case c32: res = triangle<cfloat>(in, true, is_unit_diag); break;
+            case c64: res = triangle<cdouble>(in, true, is_unit_diag); break;
+            case s32: res = triangle<int>(in, true, is_unit_diag); break;
+            case u32: res = triangle<uint>(in, true, is_unit_diag); break;
+            case s64: res = triangle<intl>(in, true, is_unit_diag); break;
+            case u64: res = triangle<uintl>(in, true, is_unit_diag); break;
+            case s16: res = triangle<short>(in, true, is_unit_diag); break;
+            case u16: res = triangle<ushort>(in, true, is_unit_diag); break;
+            case s8: res = triangle<schar>(in, true, is_unit_diag); break;
+            case u8: res = triangle<uchar>(in, true, is_unit_diag); break;
+            case b8: res = triangle<char>(in, true, is_unit_diag); break;
+            case f16: res = triangle<half>(in, true, is_unit_diag); break;
         }
         std::swap(*out, res);
     }
     CATCHALL
-        return AF_SUCCESS;
+    return AF_SUCCESS;
 }
 
+template<typename T>
+inline af_array pad(const af_array in, const dim4 &lPad, const dim4 &uPad,
+                    const af::borderType ptype) {
+    return getHandle(padArrayBorders<T>(getArray<T>(in), lPad, uPad, ptype));
+}
 
-af_err af_upper(af_array *out, const af_array in, bool is_unit_diag)
-{
+af_err af_pad(af_array *out, const af_array in, const unsigned begin_ndims,
+              const dim_t *const begin_dims, const unsigned end_ndims,
+              const dim_t *const end_dims, const af_border_type pad_type) {
     try {
-        af_dtype type = getInfo(in).getType();
-        af_array res;
-        switch(type) {
-        case f32: res = triangle<float   , true>(in, is_unit_diag); break;
-        case f64: res = triangle<double  , true>(in, is_unit_diag); break;
-        case c32: res = triangle<cfloat  , true>(in, is_unit_diag); break;
-        case c64: res = triangle<cdouble , true>(in, is_unit_diag); break;
-        case s32: res = triangle<int     , true>(in, is_unit_diag); break;
-        case s64: res = triangle<intl    , true>(in, is_unit_diag); break;
-        case u32: res = triangle<uint    , true>(in, is_unit_diag); break;
-        case u64: res = triangle<uintl   , true>(in, is_unit_diag); break;
-        case u8 : res = triangle<uchar   , true>(in, is_unit_diag); break;
-        case b8 : res = triangle<char    , true>(in, is_unit_diag); break;
+        DIM_ASSERT(2, begin_ndims > 0 && begin_ndims <= 4);
+        DIM_ASSERT(4, end_ndims > 0 && end_ndims <= 4);
+        ARG_ASSERT(3, begin_dims != NULL);
+        ARG_ASSERT(5, end_dims != NULL);
+        ARG_ASSERT(6, (pad_type >= AF_PAD_ZERO && pad_type <= AF_PAD_PERIODIC));
+        for (unsigned i = 0; i < begin_ndims; i++) {
+            DIM_ASSERT(3, begin_dims[i] >= 0);
+        }
+        for (unsigned i = 0; i < end_ndims; i++) {
+            DIM_ASSERT(5, end_dims[i] >= 0);
+        }
+
+        dim4 lPad(begin_ndims, begin_dims);
+        dim4 uPad(end_ndims, end_dims);
+        for (unsigned i = begin_ndims; i < AF_MAX_DIMS; i++) { lPad[i] = 0; }
+        for (unsigned i = end_ndims; i < AF_MAX_DIMS; i++) { uPad[i] = 0; }
+
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
+
+        af_array res = 0;
+        switch (type) {
+            case f32: res = pad<float>(in, lPad, uPad, pad_type); break;
+            case f64: res = pad<double>(in, lPad, uPad, pad_type); break;
+            case c32: res = pad<cfloat>(in, lPad, uPad, pad_type); break;
+            case c64: res = pad<cdouble>(in, lPad, uPad, pad_type); break;
+            case s32: res = pad<int>(in, lPad, uPad, pad_type); break;
+            case u32: res = pad<uint>(in, lPad, uPad, pad_type); break;
+            case s64: res = pad<intl>(in, lPad, uPad, pad_type); break;
+            case u64: res = pad<uintl>(in, lPad, uPad, pad_type); break;
+            case s16: res = pad<short>(in, lPad, uPad, pad_type); break;
+            case u16: res = pad<ushort>(in, lPad, uPad, pad_type); break;
+            case s8: res = pad<schar>(in, lPad, uPad, pad_type); break;
+            case u8: res = pad<uchar>(in, lPad, uPad, pad_type); break;
+            case b8: res = pad<char>(in, lPad, uPad, pad_type); break;
+            case f16: res = pad<half>(in, lPad, uPad, pad_type); break;
         }
         std::swap(*out, res);
     }
     CATCHALL
-        return AF_SUCCESS;
+    return AF_SUCCESS;
 }
diff --git a/src/api/c/deconvolution.cpp b/src/api/c/deconvolution.cpp
new file mode 100644
index 0000000000..19ad89e5db
--- /dev/null
+++ b/src/api/c/deconvolution.cpp
@@ -0,0 +1,336 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/dispatch.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <fft_common.hpp>
+#include <fftconvolve.hpp>
+#include <handle.hpp>
+#include <logic.hpp>
+#include <math.hpp>
+#include <reduce.hpp>
+#include <select.hpp>
+#include <shift.hpp>
+#include <unary.hpp>
+#include <af/image.h>
+
+#include <algorithm>
+#include <array>
+#include <cmath>
+#include <type_traits>
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createSubArray;
+using detail::createValueArray;
+using detail::logicOp;
+using detail::padArrayBorders;
+using detail::scalar;
+using detail::schar;
+using detail::select_scalar;
+using detail::shift;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+using std::array;
+using std::vector;
+
+const int BASE_DIM = 2;
+
+#if defined(AF_CPU)
+// CPU backend uses FFTW or MKL
+// FFTW works with any data size, but is optimized for
+// size decomposition with prime factors up to
+// 13.
+const dim_t GREATEST_PRIME_FACTOR = 13;
+#else
+// cuFFT/clFFT works with any data size, but is optimized
+// for size decomposition with prime factors up to
+// 7.
+const dim_t GREATEST_PRIME_FACTOR = 7;
+#endif
+
+template<typename T, typename CT>
+Array<T> complexNorm(const Array<CT>& input) {
+    auto mag = detail::abs<T, CT>(input);
+    return arithOp<T, af_mul_t>(mag, mag, input.dims());
+}
+
+std::vector<af_seq> calcPadInfo(dim4& inLPad, dim4& psfLPad, dim4& inUPad,
+                                dim4& psfUPad, dim4& odims, dim_t nElems,
+                                const dim4& idims, const dim4& fdims) {
+    vector<af_seq> index(4);
+
+    for (int d = 0; d < 4; ++d) {
+        if (d < BASE_DIM) {
+            dim_t pad = idims[d] + fdims[d];
+
+            while (greatestPrimeFactor(pad) > GREATEST_PRIME_FACTOR) { pad++; }
+
+            dim_t diffLen  = pad - idims[d];
+            inLPad[d]      = diffLen / 2;
+            inUPad[d]      = diffLen / 2 + diffLen % 2;
+            psfLPad[d]     = 0;
+            psfUPad[d]     = pad - fdims[d];
+            odims[d]       = pad;
+            index[d].begin = inLPad[d];
+            index[d].end   = index[d].begin + idims[d] - 1;
+            index[d].step  = 1;
+
+            nElems *= odims[d];
+        } else {
+            inLPad[d]  = 0;
+            psfLPad[d] = 0;
+            inUPad[d]  = 0;
+            psfUPad[d] = 0;
+            odims[d]   = std::max(idims[d], fdims[d]);
+            index[d]   = af_span;
+        }
+    }
+    return index;
+}
+
+template<typename T, typename CT>
+void richardsonLucy(Array<T>& currentEstimate, const Array<T>& in,
+                    const Array<CT>& P, const Array<CT>& Pc,
+                    const unsigned iters, const float normFactor,
+                    const dim4 odims) {
+    for (unsigned i = 0; i < iters; ++i) {
+        auto fft1  = fft_r2c<CT, T>(currentEstimate, BASE_DIM);
+        auto cmul1 = arithOp<CT, af_mul_t>(fft1, P, P.dims());
+        auto ifft1 = fft_c2r<CT, T>(cmul1, normFactor, odims, BASE_DIM);
+        auto div1  = arithOp<T, af_div_t>(in, ifft1, in.dims());
+        auto fft2  = fft_r2c<CT, T>(div1, BASE_DIM);
+        auto cmul2 = arithOp<CT, af_mul_t>(fft2, Pc, Pc.dims());
+        auto ifft2 = fft_c2r<CT, T>(cmul2, normFactor, odims, BASE_DIM);
+
+        currentEstimate =
+            arithOp<T, af_mul_t>(currentEstimate, ifft2, ifft2.dims());
+    }
+}
+
+template<typename T, typename CT>
+void landweber(Array<T>& currentEstimate, const Array<T>& in,
+               const Array<CT>& P, const Array<CT>& Pc, const unsigned iters,
+               const float relaxFactor, const float normFactor,
+               const dim4 odims) {
+    const dim4& dims = P.dims();
+
+    auto I        = fft_r2c<CT, T>(in, BASE_DIM);
+    auto Pn       = complexNorm<T, CT>(P);
+    auto ONE      = createValueArray(dims, scalar<T>(1.0));
+    auto alpha    = createValueArray(dims, scalar<T>(relaxFactor));
+    auto alphaC   = cast<CT>(alpha);
+    auto prod     = arithOp<T, af_mul_t>(alpha, Pn, dims);
+    auto lhsFac   = arithOp<T, af_sub_t>(ONE, prod, dims);
+    auto lhs      = cast<CT>(lhsFac);
+    auto rhsFac   = arithOp<CT, af_mul_t>(Pc, I, dims);
+    auto rhs      = arithOp<CT, af_mul_t>(rhsFac, alphaC, dims);
+    auto iterTemp = I;
+
+    for (unsigned i = 0; i < iters; ++i) {
+        auto mul = arithOp<CT, af_mul_t>(iterTemp, lhs, dims);
+        iterTemp = arithOp<CT, af_add_t>(mul, rhs, dims);
+    }
+    currentEstimate = fft_c2r<CT, T>(iterTemp, normFactor, odims, BASE_DIM);
+}
+
+template<typename InputType, typename RealType = float>
+af_array iterDeconv(const af_array in, const af_array ker, const uint iters,
+                    const float rfactor, const af_iterative_deconv_algo algo) {
+    using T    = RealType;
+    using CT   = typename std::conditional<std::is_same<T, double>::value,
+                                         cdouble, cfloat>::type;
+    auto input = castArray<T>(in);
+    auto psf   = castArray<T>(ker);
+    const dim4& idims = input.dims();
+    const dim4& fdims = psf.dims();
+    dim_t nElems      = 1;
+
+    dim4 inUPad, psfUPad, inLPad, psfLPad, odims(1);
+
+    auto index = calcPadInfo(inLPad, psfLPad, inUPad, psfUPad, odims, nElems,
+                             idims, fdims);
+    auto paddedIn =
+        padArrayBorders<T>(input, inLPad, inUPad, AF_PAD_CLAMP_TO_EDGE);
+    auto paddedPsf = padArrayBorders<T>(psf, psfLPad, psfUPad, AF_PAD_ZERO);
+
+    const std::array<int, 4> shiftDims = {-int(fdims[0] / 2),
+                                          -int(fdims[1] / 2), 0, 0};
+    auto shiftedPsf                    = shift(paddedPsf, shiftDims.data());
+
+    auto P  = fft_r2c<CT, T>(shiftedPsf, BASE_DIM);
+    auto Pc = conj(P);
+
+    Array<T> currentEstimate = paddedIn;
+    const double normFactor  = 1 / static_cast<double>(nElems);
+
+    switch (algo) {
+        case AF_ITERATIVE_DECONV_RICHARDSONLUCY:
+            richardsonLucy(currentEstimate, paddedIn, P, Pc, iters, normFactor,
+                           odims);
+            break;
+        case AF_ITERATIVE_DECONV_LANDWEBER:
+        default:
+            landweber(currentEstimate, paddedIn, P, Pc, iters, rfactor,
+                      normFactor, odims);
+    }
+    return getHandle(createSubArray<T>(currentEstimate, index));
+}
+
+af_err af_iterative_deconv(af_array* out, const af_array in, const af_array ker,
+                           const unsigned iterations, const float relax_factor,
+                           const af_iterative_deconv_algo algo) {
+    try {
+        const ArrayInfo& inputInfo  = getInfo(in);
+        const dim4& inputDims       = inputInfo.dims();
+        const ArrayInfo& kernelInfo = getInfo(ker);
+        const dim4& kernelDims      = kernelInfo.dims();
+
+        DIM_ASSERT(2, (inputDims.ndims() == 2));
+        DIM_ASSERT(3, (kernelDims.ndims() == 2));
+        ARG_ASSERT(4, (iterations > 0));
+        ARG_ASSERT(5, std::isfinite(relax_factor));
+        ARG_ASSERT(5, (relax_factor > 0));
+        ARG_ASSERT(6, (algo == AF_ITERATIVE_DECONV_DEFAULT ||
+                       algo == AF_ITERATIVE_DECONV_LANDWEBER ||
+                       algo == AF_ITERATIVE_DECONV_RICHARDSONLUCY));
+        af_array res   = 0;
+        unsigned iters = iterations;
+        float rfac     = relax_factor;
+
+        af_dtype inputType = inputInfo.getType();
+        switch (inputType) {
+            case f32:
+                res = iterDeconv<float>(in, ker, iters, rfac, algo);
+                break;
+            case s16:
+                res = iterDeconv<short>(in, ker, iters, rfac, algo);
+                break;
+            case u16:
+                res = iterDeconv<ushort>(in, ker, iters, rfac, algo);
+                break;
+            case s8: res = iterDeconv<schar>(in, ker, iters, rfac, algo); break;
+            case u8: res = iterDeconv<uchar>(in, ker, iters, rfac, algo); break;
+            default: TYPE_ERROR(1, inputType);
+        }
+        std::swap(res, *out);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+template<typename CT>
+Array<CT> denominator(const Array<CT>& I, const Array<CT>& P, const float gamma,
+                      const af_inverse_deconv_algo algo) {
+    using T = typename af::dtype_traits<CT>::base_type;
+
+    auto RCNST = createValueArray(I.dims(), scalar<T>(gamma));
+
+    if (algo == AF_INVERSE_DECONV_TIKHONOV) {
+        auto normP = complexNorm<T, CT>(P);
+        auto denom = arithOp<T, af_add_t>(normP, RCNST, normP.dims());
+
+        return cast<CT, T>(denom);
+    } else {
+        // TODO(pradeep) Wiener Filter code path is disabled.
+        // This code path doesn't is not exposed using current API
+        auto normI = complexNorm<T, CT>(I);
+        auto sRes  = arithOp<T, af_sub_t>(normI, RCNST, normI.dims());
+        auto dRes  = arithOp<T, af_div_t>(RCNST, sRes, RCNST.dims());
+        auto normP = complexNorm<T, CT>(P);
+        auto denom = arithOp<T, af_add_t>(normP, dRes, normP.dims());
+
+        return cast<CT, T>(denom);
+    }
+}
+
+template<typename InputType, typename RealType = float>
+af_array invDeconv(const af_array in, const af_array ker, const float gamma,
+                   const af_inverse_deconv_algo algo) {
+    using T    = RealType;
+    using CT   = typename std::conditional<std::is_same<T, double>::value,
+                                         cdouble, cfloat>::type;
+    auto input = castArray<T>(in);
+    auto psf   = castArray<T>(ker);
+    const dim4& idims = input.dims();
+    const dim4& fdims = psf.dims();
+    dim_t nElems      = 1;
+
+    dim4 inUPad, psfUPad, inLPad, psfLPad, odims(1);
+
+    auto index = calcPadInfo(inLPad, psfLPad, inUPad, psfUPad, odims, nElems,
+                             idims, fdims);
+    auto paddedIn =
+        padArrayBorders<T>(input, inLPad, inUPad, AF_PAD_CLAMP_TO_EDGE);
+    auto paddedPsf = padArrayBorders<T>(psf, psfLPad, psfUPad, AF_PAD_ZERO);
+    const array<int, 4> shiftDims = {-int(fdims[0] / 2), -int(fdims[1] / 2), 0,
+                                     0};
+
+    auto shiftedPsf = shift(paddedPsf, shiftDims.data());
+
+    auto I      = fft_r2c<CT, T>(paddedIn, BASE_DIM);
+    auto P      = fft_r2c<CT, T>(shiftedPsf, BASE_DIM);
+    auto Pc     = conj(P);
+    auto numer  = arithOp<CT, af_mul_t>(I, Pc, I.dims());
+    auto denom  = denominator(I, P, gamma, algo);
+    auto absVal = detail::abs<T, CT>(denom);
+    auto THRESH = createValueArray(I.dims(), scalar<T>(gamma));
+    auto cond   = logicOp<T, af_ge_t>(absVal, THRESH, absVal.dims());
+    auto val    = arithOp<CT, af_div_t>(numer, denom, numer.dims());
+
+    select_scalar<CT, false>(val, cond, val, scalar<CT>(0.0));
+
+    auto ival =
+        fft_c2r<CT, T>(val, 1 / static_cast<double>(nElems), odims, BASE_DIM);
+
+    return getHandle(createSubArray<T>(ival, index));
+}
+
+af_err af_inverse_deconv(af_array* out, const af_array in, const af_array psf,
+                         const float gamma, const af_inverse_deconv_algo algo) {
+    try {
+        const ArrayInfo& inputInfo = getInfo(in);
+        const dim4& inputDims      = inputInfo.dims();
+        const ArrayInfo& psfInfo   = getInfo(psf);
+        const dim4& psfDims        = psfInfo.dims();
+
+        DIM_ASSERT(2, (inputDims.ndims() == 2));
+        DIM_ASSERT(3, (psfDims.ndims() == 2));
+        ARG_ASSERT(4, std::isfinite(gamma));
+        ARG_ASSERT(4, (gamma > 0));
+        ARG_ASSERT(5, (algo == AF_INVERSE_DECONV_DEFAULT ||
+                       algo == AF_INVERSE_DECONV_TIKHONOV));
+        af_array res = 0;
+
+        af_dtype inputType = inputInfo.getType();
+        switch (inputType) {
+            case f32: res = invDeconv<float>(in, psf, gamma, algo); break;
+            case s16: res = invDeconv<short>(in, psf, gamma, algo); break;
+            case u16: res = invDeconv<ushort>(in, psf, gamma, algo); break;
+            case s8: res = invDeconv<schar>(in, psf, gamma, algo); break;
+            case u8: res = invDeconv<uchar>(in, psf, gamma, algo); break;
+            default: TYPE_ERROR(1, inputType);
+        }
+        std::swap(res, *out);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/det.cpp b/src/api/c/det.cpp
index bdad4b665e..8507675b85 100644
--- a/src/api/c/det.cpp
+++ b/src/api/c/det.cpp
@@ -7,32 +7,43 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
-#include <math.hpp>
-#include <lu.hpp>
-#include <diagonal.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
 #include <copy.hpp>
+#include <diagonal.hpp>
+#include <handle.hpp>
+#include <lu.hpp>
+#include <math.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/lapack.h>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::imag;
+using detail::real;
+using detail::scalar;
 
 template<typename T>
-T det(const af_array a)
-{
+T det(const af_array a) {
+    using namespace detail;
     const Array<T> A = getArray<T>(a);
 
     const int num = A.dims()[0];
 
+    if (num == 0) {
+        T res = scalar<T>(1.0);
+        return res;
+    }
+
     std::vector<T> hD(num);
     std::vector<int> hP(num);
 
-    Array<T> D = createEmptyArray<T>(dim4());
+    Array<T> D       = createEmptyArray<T>(dim4());
     Array<int> pivot = createEmptyArray<int>(dim4());
 
     // Free memory as soon as possible
@@ -47,22 +58,20 @@ T det(const af_array a)
     }
 
     bool is_neg = false;
-    T res = scalar<T>(is_neg ? -1 : 1);
+    T res       = scalar<T>(is_neg ? -1 : 1);
     for (int i = 0; i < num; i++) {
         res = res * hD[i];
-        is_neg ^= (hP[i] != (i+1));
+        is_neg ^= (hP[i] != (i + 1));
     }
 
-    if (is_neg) res = res * scalar<T>(-1);
+    if (is_neg) { res = res * scalar<T>(-1); }
 
     return res;
 }
 
-af_err af_det(double *real_val, double *imag_val, const af_array in)
-{
-
+af_err af_det(double *real_val, double *imag_val, const af_array in) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo &i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("solve can not be used in batch mode", AF_ERR_BATCH);
@@ -70,8 +79,11 @@ af_err af_det(double *real_val, double *imag_val, const af_array in)
 
         af_dtype type = i_info.getType();
 
-        DIM_ASSERT(1, i_info.dims()[0] == i_info.dims()[1]);      // Only square matrices
-        ARG_ASSERT(1, i_info.isFloating());                       // Only floating and complex types
+        if (i_info.dims()[0]) {
+            DIM_ASSERT(1, i_info.dims()[0] ==
+                              i_info.dims()[1]);  // Only square matrices
+        }
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
 
         *real_val = 0;
         *imag_val = 0;
@@ -79,20 +91,20 @@ af_err af_det(double *real_val, double *imag_val, const af_array in)
         cfloat cfval;
         cdouble cdval;
 
-        switch(type) {
-        case f32: *real_val = det<float  >(in);  break;
-        case f64: *real_val = det<double >(in);  break;
-        case c32:
-            cfval = det<cfloat >(in);
-            *real_val = real(cfval);
-            *imag_val = imag(cfval);
-            break;
-        case c64:
-            cdval = det<cdouble>(in);
-            *real_val = real(cdval);
-            *imag_val = imag(cdval);
-            break;
-        default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: *real_val = det<float>(in); break;
+            case f64: *real_val = det<double>(in); break;
+            case c32:
+                cfval     = det<cfloat>(in);
+                *real_val = real(cfval);
+                *imag_val = imag(cfval);
+                break;
+            case c64:
+                cdval     = det<cdouble>(in);
+                *real_val = real(cdval);
+                *imag_val = imag(cdval);
+                break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
diff --git a/src/api/c/device.cpp b/src/api/c/device.cpp
index fca88f7a7e..7427a1a4e5 100644
--- a/src/api/c/device.cpp
+++ b/src/api/c/device.cpp
@@ -7,206 +7,411 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/device.h>
-#include <backend.hpp>
-#include <platform.hpp>
 #include <Array.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <common/util.hpp>
 #include <handle.hpp>
-#include <memory.hpp>
-#include "err_common.hpp"
-
-using namespace detail;
-
-af_err af_init()
-{
+#include <platform.hpp>
+#include <sparse_handle.hpp>
+#include <af/backend.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/version.h>
+
+#if defined(USE_MKL)
+#include <mkl_service.h>
+#endif
+
+#include <cstring>
+#include <string>
+
+using af::dim4;
+using arrayfire::getSparseArray;
+using arrayfire::common::getCacheDirectory;
+using arrayfire::common::getEnvVar;
+using arrayfire::common::half;
+using arrayfire::common::JIT_KERNEL_CACHE_DIRECTORY_ENV_NAME;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::devprop;
+using detail::evalFlag;
+using detail::getActiveDeviceId;
+using detail::getBackend;
+using detail::getDeviceCount;
+using detail::getDeviceInfo;
+using detail::init;
+using detail::intl;
+using detail::isDoubleSupported;
+using detail::isHalfSupported;
+using detail::schar;
+using detail::setDevice;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+af_err af_set_backend(const af_backend bknd) {
     try {
-        static bool first = true;
-        if(first) {
-            getInfo();
-            first = false;
+        if (bknd != getBackend() && bknd != AF_BACKEND_DEFAULT) {
+            return AF_ERR_ARG;
         }
-    } CATCHALL;
+    }
+    CATCHALL;
+
     return AF_SUCCESS;
 }
 
-af_err af_info()
-{
-    printf("%s", getInfo().c_str());
+af_err af_get_backend_count(unsigned* num_backends) {
+    *num_backends = 1;
     return AF_SUCCESS;
 }
 
-af_err af_device_info(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
-{
+af_err af_get_available_backends(int* result) {
     try {
-        devprop(d_name, d_platform, d_toolkit, d_compute);
-    } CATCHALL;
+        *result = getBackend();
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_dbl_support(bool* available, const int device)
-{
+af_err af_get_backend_id(af_backend* result, const af_array in) {
     try {
-        *available = isDoubleSupported(device);
-    } CATCHALL;
+        if (in) {
+            const ArrayInfo& info = getInfo(in, false);
+            *result               = info.getBackendId();
+        } else {
+            return AF_ERR_ARG;
+        }
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_device_count(int *nDevices)
-{
+af_err af_get_device_id(int* device, const af_array in) {
     try {
-        *nDevices = getDeviceCount();
-    } CATCHALL;
-
+        if (in) {
+            const ArrayInfo& info = getInfo(in, false);
+            *device               = static_cast<int>(info.getDevId());
+        } else {
+            return AF_ERR_ARG;
+        }
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_device(int *device)
-{
-    try {
-        *device = getActiveDeviceId();
-    } CATCHALL;
+af_err af_get_active_backend(af_backend* result) {
+    *result = static_cast<af_backend>(getBackend());
     return AF_SUCCESS;
 }
 
-af_err af_set_device(const int device)
-{
+af_err af_init() {
     try {
-        ARG_ASSERT(0, device >= 0);
-        ARG_ASSERT(0, setDevice(device) >= 0);
-    } CATCHALL;
-
+        thread_local std::once_flag flag;
+        std::call_once(flag, []() {
+            init();
+#if defined(USE_MKL) && !defined(USE_STATIC_MKL)
+            int errCode = -1;
+            // Have used the AF_MKL_INTERFACE_SIZE as regular if's so that
+            // we will know if these are not defined when using MKL when a
+            // compilation error is generated.
+            if (AF_MKL_INTERFACE_SIZE == 4) {
+                errCode = mkl_set_interface_layer(MKL_INTERFACE_LP64);
+            } else if (AF_MKL_INTERFACE_SIZE == 8) {
+                errCode = mkl_set_interface_layer(MKL_INTERFACE_ILP64);
+            }
+            if (errCode == -1) {
+                AF_ERROR(
+                    "Intel MKL Interface layer was not specified prior to the "
+                    "call and the input parameter is incorrect.",
+                    AF_ERR_RUNTIME);
+            }
+            switch (AF_MKL_THREAD_LAYER) {
+                case 0:
+                    errCode = mkl_set_threading_layer(MKL_THREADING_SEQUENTIAL);
+                    break;
+                case 1:
+                    errCode = mkl_set_threading_layer(MKL_THREADING_GNU);
+                    break;
+                case 2:
+                    errCode = mkl_set_threading_layer(MKL_THREADING_INTEL);
+                    break;
+                case 3:
+                    errCode = mkl_set_threading_layer(MKL_THREADING_TBB);
+                    break;
+            }
+            if (errCode == -1) {
+                AF_ERROR(
+                    "Intel MKL Thread layer was not specified prior to the "
+                    "call and the input parameter is incorrect.",
+                    AF_ERR_RUNTIME);
+            }
+#endif
+        });
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_sync(const int device)
-{
+af_err af_info() {
     try {
-        int dev = device == -1 ? getActiveDeviceId() : device;
-        detail::sync(dev);
-    } CATCHALL;
+        printf("%s", getDeviceInfo().c_str());  // NOLINT
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_device_array(af_array *arr, const void *data,
-                       const unsigned ndims,
-                       const dim_t * const dims,
-                       const af_dtype type)
-{
+af_err af_info_string(char** str, const bool verbose) {
+    UNUSED(verbose);  // TODO(umar): Add something useful
     try {
-        AF_CHECK(af_init());
+        std::string infoStr = getDeviceInfo();
+        void* halloc_ptr    = nullptr;
+        af_alloc_host(&halloc_ptr, sizeof(char) * (infoStr.size() + 1));
+        memcpy(str, &halloc_ptr, sizeof(void*));
+
+        // Need to do a deep copy
+        // str.c_str wont cut it
+        infoStr.copy(*str, infoStr.size());
+        (*str)[infoStr.size()] = '\0';
+    }
+    CATCHALL;
 
-        af_array res;
-        af::dim4 d((size_t)dims[0]);
-        for(unsigned i = 1; i < ndims; i++) {
-            d[i] = dims[i];
-        }
+    return AF_SUCCESS;
+}
 
-        switch (type) {
-        case f32: res = getHandle(createDeviceDataArray<float  >(d, data)); break;
-        case f64: res = getHandle(createDeviceDataArray<double >(d, data)); break;
-        case c32: res = getHandle(createDeviceDataArray<cfloat >(d, data)); break;
-        case c64: res = getHandle(createDeviceDataArray<cdouble>(d, data)); break;
-        case s32: res = getHandle(createDeviceDataArray<int    >(d, data)); break;
-        case u32: res = getHandle(createDeviceDataArray<uint   >(d, data)); break;
-        case u8 : res = getHandle(createDeviceDataArray<uchar  >(d, data)); break;
-        case b8 : res = getHandle(createDeviceDataArray<char   >(d, data)); break;
-        default: TYPE_ERROR(4, type);
-        }
+af_err af_device_info(char* d_name, char* d_platform, char* d_toolkit,
+                      char* d_compute) {
+    try {
+        devprop(d_name, d_platform, d_toolkit, d_compute);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-        std::swap(*arr, res);
-    } CATCHALL;
+af_err af_get_dbl_support(bool* available, const int device) {
+    try {
+        *available = isDoubleSupported(device);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
+af_err af_get_half_support(bool* available, const int device) {
+    try {
+        *available = isHalfSupported(device);
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_device_ptr(void **data, const af_array arr)
-{
+af_err af_get_device_count(int* nDevices) {
     try {
+        *nDevices = getDeviceCount();
+    }
+    CATCHALL;
 
-        // Make sure all kernels and memcopies are done before getting device pointer
-        detail::sync(getActiveDeviceId());
+    return AF_SUCCESS;
+}
 
-        af_dtype type = getInfo(arr).getType();
+af_err af_get_device(int* device) {
+    try {
+        *device = static_cast<int>(getActiveDeviceId());
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-        switch (type) {
-            //FIXME: Perform copy if memory not continuous
-        case f32: *data = getDevicePtr(getArray<float  >(arr)); break;
-        case f64: *data = getDevicePtr(getArray<double >(arr)); break;
-        case c32: *data = getDevicePtr(getArray<cfloat >(arr)); break;
-        case c64: *data = getDevicePtr(getArray<cdouble>(arr)); break;
-        case s32: *data = getDevicePtr(getArray<int    >(arr)); break;
-        case u32: *data = getDevicePtr(getArray<uint   >(arr)); break;
-        case u8 : *data = getDevicePtr(getArray<uchar  >(arr)); break;
-        case b8 : *data = getDevicePtr(getArray<char   >(arr)); break;
-
-        default: TYPE_ERROR(4, type);
+af_err af_set_device(const int device) {
+    try {
+        ARG_ASSERT(0, device >= 0);
+        if (setDevice(device) < 0) {
+            int ndevices = getDeviceCount();
+            if (ndevices == 0) {
+                AF_ERROR(
+                    "No devices were found on this system. Ensure "
+                    "you have installed the device driver as well as the "
+                    "necessary runtime libraries for your platform.",
+                    AF_ERR_RUNTIME);
+            } else {
+                char buf[512];
+                char err_msg[] =
+                    "The device index of %d is out of range. Use a value "
+                    "between 0 and %d.";
+                snprintf(buf, 512, err_msg, device, ndevices - 1);  // NOLINT
+                AF_ERROR(buf, AF_ERR_ARG);
+            }
         }
-
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_alloc_device(void **ptr, const dim_t bytes)
-{
+af_err af_sync(const int device) {
     try {
-        AF_CHECK(af_init());
-        *ptr = (void *)memAlloc<char>(bytes);
-    } CATCHALL;
+        int dev = device == -1 ? static_cast<int>(getActiveDeviceId()) : device;
+        detail::sync(dev);
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_alloc_pinned(void **ptr, const dim_t bytes)
-{
-    try {
-        AF_CHECK(af_init());
-        *ptr = (void *)pinnedAlloc<char>(bytes);
-    } CATCHALL;
-    return AF_SUCCESS;
+template<typename T>
+static inline void eval(af_array arr) {
+    getArray<T>(arr).eval();
+}
+
+template<typename T>
+static inline void sparseEval(af_array arr) {
+    getSparseArray<T>(arr).eval();
 }
 
-af_err af_free_device(void *ptr)
-{
+af_err af_eval(af_array arr) {
     try {
-        memFree<char>((char *)ptr);
-    } CATCHALL;
+        const ArrayInfo& info = getInfo(arr, false);
+        af_dtype type         = info.getType();
+
+        if (info.isSparse()) {
+            switch (type) {
+                case f32: sparseEval<float>(arr); break;
+                case f64: sparseEval<double>(arr); break;
+                case c32: sparseEval<cfloat>(arr); break;
+                case c64: sparseEval<cdouble>(arr); break;
+                default: TYPE_ERROR(0, type);
+            }
+        } else {
+            switch (type) {
+                case f32: eval<float>(arr); break;
+                case f64: eval<double>(arr); break;
+                case c32: eval<cfloat>(arr); break;
+                case c64: eval<cdouble>(arr); break;
+                case s32: eval<int>(arr); break;
+                case u32: eval<uint>(arr); break;
+                case s8: eval<schar>(arr); break;
+                case u8: eval<uchar>(arr); break;
+                case b8: eval<char>(arr); break;
+                case s64: eval<intl>(arr); break;
+                case u64: eval<uintl>(arr); break;
+                case s16: eval<short>(arr); break;
+                case u16: eval<ushort>(arr); break;
+                case f16: eval<half>(arr); break;
+                default: TYPE_ERROR(0, type);
+            }
+        }
+    }
+    CATCHALL;
+
     return AF_SUCCESS;
 }
 
-af_err af_free_pinned(void *ptr)
-{
+template<typename T>
+static inline void evalMultiple(int num, af_array* arrayPtrs) {
+    Array<T> empty = createEmptyArray<T>(dim4());
+    std::vector<Array<T>*> arrays(num, &empty);
+
+    for (int i = 0; i < num; i++) {
+        arrays[i] = reinterpret_cast<Array<T>*>(arrayPtrs[i]);
+    }
+
+    evalMultiple<T>(arrays);
+}
+
+af_err af_eval_multiple(int num, af_array* arrays) {
     try {
-        pinnedFree<char>((char *)ptr);
-    } CATCHALL;
+        const ArrayInfo& info = getInfo(arrays[0]);
+        af_dtype type         = info.getType();
+        const dim4& dims      = info.dims();
+
+        for (int i = 1; i < num; i++) {
+            const ArrayInfo& currInfo = getInfo(arrays[i]);
+
+            // FIXME: This needs to be removed when new functionality is added
+            if (type != currInfo.getType()) {
+                AF_ERROR("All arrays must be of same type", AF_ERR_TYPE);
+            }
+
+            if (dims != currInfo.dims()) {
+                AF_ERROR("All arrays must be of same size", AF_ERR_SIZE);
+            }
+        }
+
+        switch (type) {
+            case f32: evalMultiple<float>(num, arrays); break;
+            case f64: evalMultiple<double>(num, arrays); break;
+            case c32: evalMultiple<cfloat>(num, arrays); break;
+            case c64: evalMultiple<cdouble>(num, arrays); break;
+            case s32: evalMultiple<int>(num, arrays); break;
+            case u32: evalMultiple<uint>(num, arrays); break;
+            case s8: evalMultiple<schar>(num, arrays); break;
+            case u8: evalMultiple<uchar>(num, arrays); break;
+            case b8: evalMultiple<char>(num, arrays); break;
+            case s64: evalMultiple<intl>(num, arrays); break;
+            case u64: evalMultiple<uintl>(num, arrays); break;
+            case s16: evalMultiple<short>(num, arrays); break;
+            case u16: evalMultiple<ushort>(num, arrays); break;
+            case f16: evalMultiple<half>(num, arrays); break;
+            default: TYPE_ERROR(0, type);
+        }
+    }
+    CATCHALL;
+
     return AF_SUCCESS;
 }
 
-af_err af_device_gc()
-{
+af_err af_set_manual_eval_flag(bool flag) {
     try {
-        garbageCollect();
-    } CATCHALL;
+        bool& backendFlag = evalFlag();
+        backendFlag       = !flag;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_device_mem_info(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers)
-{
+af_err af_get_manual_eval_flag(bool* flag) {
     try {
-        deviceMemoryInfo(alloc_bytes, alloc_buffers, lock_bytes, lock_buffers);
-    } CATCHALL;
+        bool backendFlag = evalFlag();
+        *flag            = !backendFlag;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_set_mem_step_size(const size_t step_bytes)
-{
-    detail::setMemStepSize(step_bytes);
+af_err af_get_kernel_cache_directory(size_t* length, char* path) {
+    try {
+        std::string& cache_path = getCacheDirectory();
+        if (path == nullptr) {
+            ARG_ASSERT(length != nullptr, 1);
+            *length = cache_path.size();
+        } else {
+            size_t min_len = cache_path.size();
+            if (length) {
+                if (*length < cache_path.size()) {
+                    AF_ERROR("Length not sufficient to store the path",
+                             AF_ERR_SIZE);
+                }
+                min_len = std::min(*length, cache_path.size());
+            }
+            memcpy(path, cache_path.c_str(), min_len);
+        }
+    }
+    CATCHALL
     return AF_SUCCESS;
 }
 
-af_err af_get_mem_step_size(size_t *step_bytes)
-{
-    *step_bytes =  detail::getMemStepSize();
+af_err af_set_kernel_cache_directory(const char* path, int override_env) {
+    try {
+        ARG_ASSERT(path != nullptr, 1);
+        if (override_env) {
+            getCacheDirectory() = std::string(path);
+        } else {
+            auto env_path = getEnvVar(JIT_KERNEL_CACHE_DIRECTORY_ENV_NAME);
+            if (env_path.empty()) { getCacheDirectory() = std::string(path); }
+        }
+    }
+    CATCHALL
     return AF_SUCCESS;
 }
diff --git a/src/api/c/diff.cpp b/src/api/c/diff.cpp
index c3f8c5c4e9..f75d5c1ab1 100644
--- a/src/api/c/diff.cpp
+++ b/src/api/c/diff.cpp
@@ -7,88 +7,107 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/algorithm.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
 #include <diff.hpp>
+#include <handle.hpp>
+#include <af/algorithm.h>
+#include <af/defines.h>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::getArray;
+using arrayfire::getHandle;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array diff1(const af_array in, const int dim)
-{
+static inline af_array diff1(const af_array in, const int dim) {
     return getHandle(diff1<T>(getArray<T>(in), dim));
 }
 
 template<typename T>
-static inline af_array diff2(const af_array in, const int dim)
-{
+static inline af_array diff2(const af_array in, const int dim) {
     return getHandle(diff2<T>(getArray<T>(in), dim));
 }
 
-af_err af_diff1(af_array *out, const af_array in, const int dim)
-{
+af_err af_diff1(af_array* out, const af_array in, const int dim) {
     try {
-
         ARG_ASSERT(2, ((dim >= 0) && (dim < 4)));
 
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
 
         af::dim4 in_dims = info.dims();
+        if (in_dims[dim] < 2) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
         DIM_ASSERT(1, in_dims[dim] >= 2);
 
         af_array output;
 
-        switch(type) {
-            case f32: output = diff1<float  >(in,dim);  break;
-            case c32: output = diff1<cfloat >(in,dim);  break;
-            case f64: output = diff1<double >(in,dim);  break;
-            case c64: output = diff1<cdouble>(in,dim);  break;
-            case b8:  output = diff1<char   >(in,dim);  break;
-            case s32: output = diff1<int    >(in,dim);  break;
-            case u32: output = diff1<uint   >(in,dim);  break;
-            case u8:  output = diff1<uchar  >(in,dim);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = diff1<float>(in, dim); break;
+            case c32: output = diff1<cfloat>(in, dim); break;
+            case f64: output = diff1<double>(in, dim); break;
+            case c64: output = diff1<cdouble>(in, dim); break;
+            case b8: output = diff1<char>(in, dim); break;
+            case s32: output = diff1<int>(in, dim); break;
+            case u32: output = diff1<uint>(in, dim); break;
+            case s64: output = diff1<intl>(in, dim); break;
+            case u64: output = diff1<uintl>(in, dim); break;
+            case s16: output = diff1<short>(in, dim); break;
+            case u16: output = diff1<ushort>(in, dim); break;
+            case s8: output = diff1<schar>(in, dim); break;
+            case u8: output = diff1<uchar>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_diff2(af_array *out, const af_array in, const int dim)
-{
-
+af_err af_diff2(af_array* out, const af_array in, const int dim) {
     try {
-
         ARG_ASSERT(2, ((dim >= 0) && (dim < 4)));
 
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
 
         af::dim4 in_dims = info.dims();
+        if (in_dims[dim] < 3) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
         DIM_ASSERT(1, in_dims[dim] >= 3);
 
         af_array output;
 
-        switch(type) {
-            case f32: output = diff2<float  >(in,dim);  break;
-            case c32: output = diff2<cfloat >(in,dim);  break;
-            case f64: output = diff2<double >(in,dim);  break;
-            case c64: output = diff2<cdouble>(in,dim);  break;
-            case b8:  output = diff2<char   >(in,dim);  break;
-            case s32: output = diff2<int    >(in,dim);  break;
-            case u32: output = diff2<uint   >(in,dim);  break;
-            case u8:  output = diff2<uchar  >(in,dim);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = diff2<float>(in, dim); break;
+            case c32: output = diff2<cfloat>(in, dim); break;
+            case f64: output = diff2<double>(in, dim); break;
+            case c64: output = diff2<cdouble>(in, dim); break;
+            case b8: output = diff2<char>(in, dim); break;
+            case s32: output = diff2<int>(in, dim); break;
+            case u32: output = diff2<uint>(in, dim); break;
+            case s64: output = diff2<intl>(in, dim); break;
+            case u64: output = diff2<uintl>(in, dim); break;
+            case s16: output = diff2<short>(in, dim); break;
+            case u16: output = diff2<ushort>(in, dim); break;
+            case s8: output = diff2<schar>(in, dim); break;
+            case u8: output = diff2<uchar>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
diff --git a/src/api/c/dispatch.cpp b/src/api/c/dispatch.cpp
deleted file mode 100644
index c2d5d54053..0000000000
--- a/src/api/c/dispatch.cpp
+++ /dev/null
@@ -1,21 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "dispatch.hpp"
-
-unsigned nextpow2(unsigned x)
-{
-       x = x - 1;
-       x = x | (x >> 1);
-       x = x | (x >> 2);
-       x = x | (x >> 4);
-       x = x | (x >> 8);
-       x = x | (x >>16);
-       return x + 1;
-}
diff --git a/src/api/c/dispatch.hpp b/src/api/c/dispatch.hpp
deleted file mode 100644
index 19343280b8..0000000000
--- a/src/api/c/dispatch.hpp
+++ /dev/null
@@ -1,14 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-
-#define divup(a, b) (((a)+(b)-1)/(b))
-
-unsigned nextpow2(unsigned x);
diff --git a/src/api/c/dog.cpp b/src/api/c/dog.cpp
new file mode 100644
index 0000000000..848262daab
--- /dev/null
+++ b/src/api/c/dog.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <convolve.hpp>
+#include <handle.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+#include <af/vision.h>
+
+using af::dim4;
+using detail::arithOp;
+using detail::Array;
+using detail::convolve;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+template<typename T, typename accT>
+static af_array dog(const af_array& in, const int radius1, const int radius2) {
+    af_array g1, g2;
+    g1 = g2 = 0;
+    AF_CHECK(
+        af_gaussian_kernel(&g1, 2 * radius1 + 1, 2 * radius1 + 1, 0.0, 0.0));
+    AF_CHECK(
+        af_gaussian_kernel(&g2, 2 * radius2 + 1, 2 * radius2 + 1, 0.0, 0.0));
+
+    Array<accT> input = castArray<accT>(in);
+    dim4 iDims        = input.dims();
+
+    AF_BATCH_KIND bkind = iDims[2] > 1 ? AF_BATCH_LHS : AF_BATCH_NONE;
+
+    Array<accT> smth1 =
+        convolve<accT, accT>(input, castArray<accT>(g1), bkind, 2, false);
+    Array<accT> smth2 =
+        convolve<accT, accT>(input, castArray<accT>(g2), bkind, 2, false);
+    Array<accT> retVal = arithOp<accT, af_sub_t>(smth1, smth2, iDims);
+
+    AF_CHECK(af_release_array(g1));
+    AF_CHECK(af_release_array(g2));
+
+    return getHandle<accT>(retVal);
+}
+
+af_err af_dog(af_array* out, const af_array in, const int radius1,
+              const int radius2) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        dim4 inDims           = info.dims();
+        ARG_ASSERT(1, (inDims.ndims() >= 2));
+        ARG_ASSERT(1, (inDims.ndims() <= 3));
+
+        af_array output;
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32: output = dog<float, float>(in, radius1, radius2); break;
+            case f64: output = dog<double, double>(in, radius1, radius2); break;
+            case b8: output = dog<char, float>(in, radius1, radius2); break;
+            case s32: output = dog<int, float>(in, radius1, radius2); break;
+            case u32: output = dog<uint, float>(in, radius1, radius2); break;
+            case s16: output = dog<short, float>(in, radius1, radius2); break;
+            case u16: output = dog<ushort, float>(in, radius1, radius2); break;
+            case s8: output = dog<schar, float>(in, radius1, radius2); break;
+            case u8: output = dog<uchar, float>(in, radius1, radius2); break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/err_common.cpp b/src/api/c/err_common.cpp
deleted file mode 100644
index fdbe82f78d..0000000000
--- a/src/api/c/err_common.cpp
+++ /dev/null
@@ -1,244 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/exception.h>
-#include <err_common.hpp>
-#include <type_util.hpp>
-#include <string>
-#include <sstream>
-#include <cstring>
-#include <cstdio>
-#include <algorithm>
-
-#if defined(WITH_GRAPHICS)
-#include <graphics_common.hpp>
-#endif
-
-using std::string;
-using std::stringstream;
-
-AfError::AfError(const char * const funcName,
-                 const int line,
-                 const char * const message, af_err err)
-    : logic_error   (message),
-      functionName  (funcName),
-      lineNumber(line),
-      error(err)
-{}
-
-AfError::AfError(string funcName,
-                 const int line,
-                 string message, af_err err)
-    : logic_error   (message),
-      functionName  (funcName),
-      lineNumber(line),
-      error(err)
-{}
-
-const string&
-AfError::getFunctionName() const
-{
-    return functionName;
-}
-
-int
-AfError::getLine() const
-{
-    return lineNumber;
-}
-
-af_err
-AfError::getError() const
-{
-    return error;
-}
-
-AfError::~AfError() throw() {}
-
-TypeError::TypeError(const char * const  funcName,
-                     const int line,
-                     const int index, const af_dtype type)
-    : AfError (funcName, line, "Invalid data type", AF_ERR_TYPE),
-      argIndex(index),
-      errTypeName(getName(type))
-{}
-
-const string& TypeError::getTypeName() const
-{
-    return errTypeName;
-}
-
-int TypeError::getArgIndex() const
-{
-    return argIndex;
-}
-
-ArgumentError::ArgumentError(const char * const  funcName,
-                             const int line,
-                             const int index,
-                             const char * const  expectString)
-    : AfError(funcName, line, "Invalid argument", AF_ERR_ARG),
-      argIndex(index),
-      expected(expectString)
-{
-
-}
-
-const string& ArgumentError::getExpectedCondition() const
-{
-    return expected;
-}
-
-int ArgumentError::getArgIndex() const
-{
-    return argIndex;
-}
-
-
-SupportError::SupportError(const char * const funcName,
-                           const int line,
-                           const char * const back)
-    : AfError(funcName, line, "Unsupported Error", AF_ERR_NOT_SUPPORTED),
-      backend(back)
-{}
-
-const string& SupportError::getBackendName() const
-{
-    return backend;
-}
-
-DimensionError::DimensionError(const char * const  funcName,
-                             const int line,
-                             const int index,
-                             const char * const  expectString)
-    : AfError(funcName, line, "Invalid size", AF_ERR_SIZE),
-      argIndex(index),
-      expected(expectString)
-{
-
-}
-
-const string& DimensionError::getExpectedCondition() const
-{
-    return expected;
-}
-
-int DimensionError::getArgIndex() const
-{
-    return argIndex;
-}
-
-static const int MAX_ERR_SIZE = 1024;
-static std::string global_err_string;
-
-void
-print_error(const stringstream &msg)
-{
-    const char* perr = getenv("AF_PRINT_ERRORS");
-    if(perr != nullptr) {
-        if(std::strncmp(perr, "0", 1) != 0)
-            fprintf(stderr, "%s\n", msg.str().c_str());
-    }
-    global_err_string = msg.str();
-}
-
-void af_get_last_error(char **str, dim_t *len)
-{
-    *len = std::min(MAX_ERR_SIZE, (int)global_err_string.size());
-
-    if (*len == 0) {
-        *str = NULL;
-    }
-
-    *str = new char[*len + 1];
-    memcpy(*str, global_err_string.c_str(), *len * sizeof(char));
-
-    (*str)[*len] = '\0';
-    global_err_string = std::string("");
-}
-
-const char *af_err_to_string(const af_err err)
-{
-    switch (err) {
-    case AF_SUCCESS:            return "Success";
-    case AF_ERR_INTERNAL:       return "Internal error";
-    case AF_ERR_NO_MEM:         return "Device out of memory";
-    case AF_ERR_DRIVER:         return "Driver not available or incompatible";
-    case AF_ERR_RUNTIME:        return "Runtime error ";
-    case AF_ERR_INVALID_ARRAY:  return "Invalid array";
-    case AF_ERR_ARG:            return "Invalid input argument";
-    case AF_ERR_SIZE:           return "Invalid input size";
-    case AF_ERR_DIFF_TYPE:      return "Input types are not the same";
-    case AF_ERR_NOT_SUPPORTED:  return "Function not supported";
-    case AF_ERR_NOT_CONFIGURED: return "Function not configured to build";
-    case AF_ERR_TYPE:           return "Function does not support this data type";
-    case AF_ERR_NO_DBL:         return "Double precision not supported for this device";
-    case AF_ERR_UNKNOWN:
-    default:
-        return "Unknown error";
-    }
-}
-
-af_err processException()
-{
-    stringstream    ss;
-    af_err          err= AF_ERR_INTERNAL;
-
-    try {
-        throw;
-    } catch (const DimensionError &ex) {
-        ss << "In function " << ex.getFunctionName()
-           << "(" << ex.getLine() << "):\n"
-           << "Invalid dimension for argument " << ex.getArgIndex() << "\n"
-           << "Expected: " << ex.getExpectedCondition() << "\n";
-
-        print_error(ss);
-        err = AF_ERR_SIZE;
-    } catch (const ArgumentError &ex) {
-        ss << "In function " << ex.getFunctionName()
-           << "(" << ex.getLine() << "):\n"
-           << "Invalid argument at index " << ex.getArgIndex() << "\n"
-           << "Expected: " << ex.getExpectedCondition() << "\n";
-
-        print_error(ss);
-        err = AF_ERR_ARG;
-    } catch (const SupportError &ex) {
-        ss << ex.getFunctionName()
-           << " not supported for " << ex.getBackendName()
-           << " backend\n";
-
-        print_error(ss);
-        err = AF_ERR_NOT_SUPPORTED;
-    } catch (const TypeError &ex) {
-        ss << "In function " << ex.getFunctionName()
-           << "(" << ex.getLine() << "):\n"
-           << "Invalid type for argument " << ex.getArgIndex() << "\n";
-
-        print_error(ss);
-        err = AF_ERR_TYPE;
-    } catch (const AfError &ex) {
-        ss << "Error in " << ex.getFunctionName()
-           << "(" << ex.getLine() << "):\n"
-           << ex.what() << "\n";
-
-        print_error(ss);
-        err = ex.getError();
-#if defined(WITH_GRAPHICS)
-    } catch (const fg::Error &ex) {
-        ss << ex << "\n";
-        print_error(ss);
-        err = AF_ERR_INTERNAL;
-#endif
-    } catch (...) {
-        print_error(ss);
-        err = AF_ERR_UNKNOWN;
-    }
-
-    return err;
-}
diff --git a/src/api/c/err_common.hpp b/src/api/c/err_common.hpp
deleted file mode 100644
index 7c4a6f23cd..0000000000
--- a/src/api/c/err_common.hpp
+++ /dev/null
@@ -1,167 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-
-#include <stdexcept>
-#include <string>
-#include <cassert>
-#include <af/defines.h>
-#include <vector>
-
-class AfError   : public std::logic_error
-{
-    std::string functionName;
-    int lineNumber;
-    af_err error;
-    AfError();
-
-public:
-
-    AfError(const char * const funcName,
-            const int line,
-            const char * const message, af_err err);
-
-    AfError(std::string funcName,
-            const int line,
-            std::string message, af_err err);
-
-    const std::string&
-    getFunctionName() const;
-
-    int getLine() const;
-
-    af_err getError() const;
-
-    virtual ~AfError() throw();
-};
-
-// TODO: Perhaps add a way to return supported types
-class TypeError : public AfError
-{
-    int argIndex;
-    std::string errTypeName;
-    TypeError();
-
-public:
-
-    TypeError(const char * const  funcName,
-              const int line,
-              const int index,
-              const af_dtype type);
-
-    const std::string&
-    getTypeName() const;
-
-    int getArgIndex() const;
-
-    ~TypeError() throw() {}
-};
-
-class ArgumentError : public AfError
-{
-    int argIndex;
-    std::string    expected;
-    ArgumentError();
-
-public:
-    ArgumentError(const char * const funcName,
-                   const int line,
-                   const int index,
-                   const char * const expectString);
-
-    const std::string&
-    getExpectedCondition() const;
-
-    int getArgIndex() const;
-
-    ~ArgumentError() throw(){}
-};
-
-class SupportError  :   public AfError
-{
-    std::string backend;
-    SupportError();
-public:
-    SupportError(const char * const funcName,
-                 const int line,
-                 const char * const back);
-    ~SupportError()throw() {}
-    const std::string&
-    getBackendName() const;
-};
-
-class DimensionError : public AfError
-{
-    int argIndex;
-    std::string    expected;
-    DimensionError();
-
-public:
-    DimensionError(const char * const funcName,
-                   const int line,
-                   const int index,
-                   const char * const expectString);
-
-    const std::string&
-    getExpectedCondition() const;
-
-    int getArgIndex() const;
-
-    ~DimensionError() throw(){}
-};
-
-af_err processException();
-
-#define DIM_ASSERT(INDEX, COND) do {                    \
-        if((COND) == false) {                           \
-            throw DimensionError(__FILE__, __LINE__,    \
-                                 INDEX, #COND);         \
-        }                                               \
-    } while(0)
-
-#define ARG_ASSERT(INDEX, COND) do {                \
-        if((COND) == false) {                       \
-            throw ArgumentError(__FILE__, __LINE__, \
-                                INDEX, #COND);      \
-        }                                           \
-    } while(0)
-
-#define TYPE_ERROR(INDEX, type) do {            \
-        throw TypeError(__FILE__, __LINE__,     \
-                        INDEX, type);           \
-    } while(0)                                  \
-
-
-#define AF_ERROR(MSG, ERR_TYPE) do {            \
-        throw AfError(__FILE__, __LINE__,       \
-                      MSG, ERR_TYPE);           \
-    } while(0)
-
-#define TYPE_ASSERT(COND) do {                  \
-        if ((COND) == false) {                  \
-            AF_ERROR("Type mismatch inputs",    \
-                     AF_ERR_DIFF_TYPE);         \
-        }                                       \
-    } while(0)
-
-#define AF_ASSERT(COND, MESSAGE)                \
-    assert(MESSAGE && COND)
-
-#define CATCHALL                                \
-    catch(...) {                                \
-        return processException();              \
-    }
-
-#define AF_CHECK(fn) do {                       \
-        af_err __err = fn;                      \
-        if (__err == AF_SUCCESS) break;         \
-        throw AfError(__FILE__, __LINE__,       \
-                      "\n", __err);             \
-    } while(0)
diff --git a/src/api/c/error.cpp b/src/api/c/error.cpp
new file mode 100644
index 0000000000..91a84b3ff3
--- /dev/null
+++ b/src/api/c/error.cpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/err_common.hpp>
+#include <af/device.h>
+#include <af/exception.h>
+#include <af/util.h>
+
+#include <algorithm>
+#include <cstring>
+#include <string>
+
+void af_get_last_error(char **str, dim_t *len) {
+    std::string &global_error_string = get_global_error_string();
+    dim_t slen =
+        std::min(MAX_ERR_SIZE, static_cast<int>(global_error_string.size()));
+
+    if (len && slen == 0) {
+        *len = 0;
+        *str = NULL;
+        return;
+    }
+
+    void *halloc_ptr = nullptr;
+    af_alloc_host(&halloc_ptr, sizeof(char) * (slen + 1));
+    memcpy(str, &halloc_ptr, sizeof(void *));
+    global_error_string.copy(*str, slen);
+
+    (*str)[slen]        = '\0';
+    global_error_string = std::string("");
+
+    if (len) { *len = slen; }
+}
+
+af_err af_set_enable_stacktrace(int is_enabled) {
+    arrayfire::common::is_stacktrace_enabled() = is_enabled;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/events.cpp b/src/api/c/events.cpp
new file mode 100644
index 0000000000..112373672d
--- /dev/null
+++ b/src/api/c/events.cpp
@@ -0,0 +1,74 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <events.hpp>
+
+#include <Event.hpp>
+#include <common/err_common.hpp>
+#include <af/device.h>
+#include <af/event.h>
+
+using detail::block;
+using detail::createEvent;
+using detail::enqueueWaitOnActiveQueue;
+using detail::Event;
+using detail::markEventOnActiveQueue;
+
+Event &getEvent(af_event handle) {
+    Event &event = *static_cast<Event *>(handle);
+    return event;
+}
+
+af_event getHandle(Event &event) { return static_cast<af_event>(&event); }
+
+af_err af_create_event(af_event *handle) {
+    try {
+        AF_CHECK(af_init());
+        *handle = createEvent();
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_delete_event(af_event handle) {
+    try {
+        delete &getEvent(handle);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_mark_event(const af_event handle) {
+    try {
+        markEventOnActiveQueue(handle);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_enqueue_wait_event(const af_event handle) {
+    try {
+        enqueueWaitOnActiveQueue(handle);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_block_event(const af_event handle) {
+    try {
+        block(handle);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/events.hpp b/src/api/c/events.hpp
new file mode 100644
index 0000000000..488cb204e4
--- /dev/null
+++ b/src/api/c/events.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Event.hpp>
+#include <backend.hpp>
+#include <af/event.h>
+
+af_event getHandle(detail::Event& event);
+
+detail::Event& getEvent(af_event eventHandle);
diff --git a/src/api/c/exampleFunction.cpp b/src/api/c/exampleFunction.cpp
index f8fe157c04..a58336f90c 100644
--- a/src/api/c/exampleFunction.cpp
+++ b/src/api/c/exampleFunction.cpp
@@ -7,94 +7,96 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>          // Needed if you use dim4 class
+#include <af/dim4.hpp>  // Needed if you use dim4 class
 
-#include <af/util.h>            // Include header where function is delcared
+#include <af/util.h>  // Include header where function is delcared
 
-#include <af/defines.h>         // Include this header to access any enums,
-                                // #defines or constants declared
+#include <af/defines.h>  // Include this header to access any enums,
+                         // #defines or constants declared
 
-#include <err_common.hpp>       // Header with error checking functions & macros
+#include <common/err_common.hpp>  // Header with error checking functions & macros
 
-#include <backend.hpp>          // This header make sures appropriate backend
-                                // related namespace is being used
+#include <backend.hpp>  // This header make sures appropriate backend
+                        // related namespace is being used
 
-#include <Array.hpp>            // Header in which backend specific Array class
-                                // is defined
+#include <Array.hpp>  // Header in which backend specific Array class
+                      // is defined
 
-#include <handle.hpp>           // Header that helps you retrieve backend specific
-                                // Arrays based on the af_array
-                                // (typedef in defines.h) handle.
+#include <handle.hpp>  // Header that helps you retrieve backend specific
+                       // Arrays based on the af_array
+                       // (typedef in defines.h) handle.
 
 #include <exampleFunction.hpp>  // This is the backend specific header
                                 // where your new function declaration
                                 // is written
 
-using namespace detail;         // detail is an alias to appropriate backend
-                                // defined in backend.hpp. You don't need to
-                                // change this
+// NOLINTNEXTLINE(google-build-using-namespace)
+using namespace detail;  // detail is an alias to appropriate backend
+                         // defined in backend.hpp. You don't need to
+                         // change this
 
 template<typename T>
-af_array example(const af_array& in, const af_someenum_t& param)
-{
+af_array example(const af_array& a, const af_array& b,
+                 const af_someenum_t& param) {
     // getArray<T> function is defined in handle.hpp
     // and it returns backend specific Array, namely one of the following
     //      * cpu::Array<T>
-    //      * cuda::Array<T>
+    //      * arrayfire::cuda::Array<T>
     //      * opencl::Array<T>
     // getHandle<T> function is defined in handle.hpp takes one of the
     // above backend specific detail::Array<T> and returns the
     // universal array handle af_array
-    return getHandle<T>( exampleFunction(getArray<T>(in), param) );
+    return getHandle<T>(exampleFunction(getArray<T>(a), getArray<T>(b), param));
 }
 
-af_err af_example_function(af_array* out,
-                            const af_array in,
-                            const af_someenum_t param)
-{
+af_err af_example_function(af_array* out, const af_array a,
+                           const af_someenum_t param) {
     try {
         af_array output = 0;
-        ArrayInfo info = getInfo(in);       // ArrayInfo is the base class which
-                                            // each backend specific Array inherits
-                                            // This class stores the basic array meta-data
-                                            // such as type of data, dimensions,
-                                            // offsets and strides. This class is declared
-                                            // in src/backend/ArrayInfo.hpp
-
+        const ArrayInfo& info =
+            getInfo(a);  // ArrayInfo is the base class which
+                         // each backend specific Array inherits
+                         // This class stores the basic array meta-data
+                         // such as type of data, dimensions,
+                         // offsets and strides. This class is declared
+                         // in src/backend/common/ArrayInfo.hpp
         af::dim4 dims = info.dims();
 
-
-        ARG_ASSERT(2, (dims.ndims()>=0 && dims.ndims()<=3));
-                                            // defined in err_common.hpp
-                                            // there are other useful Macros
-                                            // for different purposes, feel free
-                                            // to look at the header
+        ARG_ASSERT(2, (dims.ndims() >= 0 && dims.ndims() <= 3));
+        // defined in err_common.hpp
+        // there are other useful Macros
+        // for different purposes, feel free
+        // to look at the header
 
         af_dtype type = info.getType();
 
-        switch(type) {                      // Based on the data type, call backend specific
-                                            // implementation
-            case f64: output = example<double >(in, param); break;
-            case f32: output = example<float  >(in, param); break;
-            case s32: output = example<int    >(in, param); break;
-            case u32: output = example<uint   >(in, param); break;
-            case  u8: output = example<uchar  >(in, param); break;
-            case  b8: output = example<char   >(in, param); break;
-            case c32: output = example<cfloat >(in, param); break;
-            case c64: output = example<cdouble>(in, param); break;
-            default : TYPE_ERROR(1, type);  // Another helpful macro from err_common.hpp
-                                            // that helps throw type based error messages
+        switch (type) {  // Based on the data type, call backend specific
+                         // implementation
+            case f64: output = example<double>(a, a, param); break;
+            case f32: output = example<float>(a, a, param); break;
+            case s32: output = example<int>(a, a, param); break;
+            case u32: output = example<uint>(a, a, param); break;
+            case s8: output = example<schar>(a, a, param); break;
+            case u8: output = example<uchar>(a, a, param); break;
+            case b8: output = example<char>(a, a, param); break;
+            case c32: output = example<cfloat>(a, a, param); break;
+            case c64: output = example<cdouble>(a, a, param); break;
+            default:
+                TYPE_ERROR(1,
+                           type);  // Another helpful macro from err_common.hpp
+                                   // that helps throw type based error messages
         }
 
-        std::swap(*out, output);            // if the function has returned successfully,
-                                            // swap the temporary 'output' variable with
-                                            // '*out'
+        std::swap(*out, output);  // if the function has returned successfully,
+                                  // swap the temporary 'output' variable with
+                                  // '*out'
     }
-    CATCHALL;                               // All throws/exceptions from any internal
-                                            // implementations are caught by this CATCHALL
-                                            // macro and handled appropriately.
-
-    return AF_SUCCESS;                      // In case of successfull completion, return AF_SUCCESS
-                                            // There are set of error codes defined in defines.h
-                                            // which you are used by CATCHALL to return approriate code
+    CATCHALL;  // All throws/exceptions from any internal
+               // implementations are caught by this CATCHALL
+               // macro and handled appropriately.
+
+    return AF_SUCCESS;  // In case of successfull completion, return AF_SUCCESS
+                        // There are set of error codes defined in defines.h
+                        // which you are used by CATCHALL to return approriate
+                        // code
 }
diff --git a/src/api/c/fast.cpp b/src/api/c/fast.cpp
index e28f590768..08834ce4f4 100644
--- a/src/api/c/fast.cpp
+++ b/src/api/c/fast.cpp
@@ -7,35 +7,40 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <fast.hpp>
+#include <features.hpp>
+#include <handle.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/features.h>
 #include <af/vision.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <features.hpp>
-#include <fast.hpp>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
 
 template<typename T>
 static af_features fast(af_array const &in, const float thr,
                         const unsigned arc_length, const bool non_max,
-                        const float feature_ratio, const unsigned edge)
-{
-    Array<float> x = createEmptyArray<float>(dim4());
-    Array<float> y = createEmptyArray<float>(dim4());
+                        const float feature_ratio, const unsigned edge) {
+    Array<float> x     = createEmptyArray<float>(dim4());
+    Array<float> y     = createEmptyArray<float>(dim4());
     Array<float> score = createEmptyArray<float>(dim4());
 
     af_features_t feat;
-    feat.n = fast<T>(x, y, score,
-                     getArray<T>(in), thr,
-                     arc_length, non_max, feature_ratio, edge);
+    feat.n = fast<T>(x, y, score, getArray<T>(in), thr, arc_length, non_max,
+                     feature_ratio, edge);
 
     Array<float> orientation = createValueArray<float>(feat.n, 0.0);
-    Array<float> size = createValueArray<float>(feat.n, 1.0);
+    Array<float> size        = createValueArray<float>(feat.n, 1.0);
 
     feat.x           = getHandle(x);
     feat.y           = getHandle(y);
@@ -46,32 +51,61 @@ static af_features fast(af_array const &in, const float thr,
     return getFeaturesHandle(feat);
 }
 
-
 af_err af_fast(af_features *out, const af_array in, const float thr,
                const unsigned arc_length, const bool non_max,
-               const float feature_ratio, const unsigned edge)
-{
+               const float feature_ratio, const unsigned edge) {
     try {
-        ArrayInfo info = getInfo(in);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 dims         = info.dims();
 
-        ARG_ASSERT(2, (dims[0] >= (dim_t)(2*edge+1) || dims[1] >= (dim_t)(2*edge+1)));
+        ARG_ASSERT(2, (dims[0] >= (dim_t)(2 * edge + 1) ||
+                       dims[1] >= (dim_t)(2 * edge + 1)));
         ARG_ASSERT(3, thr > 0.0f);
         ARG_ASSERT(4, (arc_length >= 9 && arc_length <= 16));
         ARG_ASSERT(6, (feature_ratio > 0.0f && feature_ratio <= 1.0f));
 
         dim_t in_ndims = dims.ndims();
-        DIM_ASSERT(1, (in_ndims <= 3 && in_ndims >= 2));
+        DIM_ASSERT(1, (in_ndims == 2));
 
-        af_dtype type  = info.getType();
-        switch(type) {
-            case f32: *out = fast<float >(in, thr, arc_length, non_max, feature_ratio, edge); break;
-            case f64: *out = fast<double>(in, thr, arc_length, non_max, feature_ratio, edge); break;
-            case b8 : *out = fast<char  >(in, thr, arc_length, non_max, feature_ratio, edge); break;
-            case s32: *out = fast<int   >(in, thr, arc_length, non_max, feature_ratio, edge); break;
-            case u32: *out = fast<uint  >(in, thr, arc_length, non_max, feature_ratio, edge); break;
-            case u8 : *out = fast<uchar >(in, thr, arc_length, non_max, feature_ratio, edge); break;
-            default : TYPE_ERROR(1, type);
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                *out = fast<float>(in, thr, arc_length, non_max, feature_ratio,
+                                   edge);
+                break;
+            case f64:
+                *out = fast<double>(in, thr, arc_length, non_max, feature_ratio,
+                                    edge);
+                break;
+            case b8:
+                *out = fast<char>(in, thr, arc_length, non_max, feature_ratio,
+                                  edge);
+                break;
+            case s32:
+                *out = fast<int>(in, thr, arc_length, non_max, feature_ratio,
+                                 edge);
+                break;
+            case u32:
+                *out = fast<uint>(in, thr, arc_length, non_max, feature_ratio,
+                                  edge);
+                break;
+            case s16:
+                *out = fast<short>(in, thr, arc_length, non_max, feature_ratio,
+                                   edge);
+                break;
+            case u16:
+                *out = fast<ushort>(in, thr, arc_length, non_max, feature_ratio,
+                                    edge);
+                break;
+            case s8:
+                *out = fast<schar>(in, thr, arc_length, non_max, feature_ratio,
+                                   edge);
+                break;
+            case u8:
+                *out = fast<uchar>(in, thr, arc_length, non_max, feature_ratio,
+                                   edge);
+                break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
diff --git a/src/api/c/features.cpp b/src/api/c/features.cpp
index ffa7fe0c75..06b048e830 100644
--- a/src/api/c/features.cpp
+++ b/src/api/c/features.cpp
@@ -7,39 +7,37 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
-#include <af/array.h>
 #include <features.hpp>
 #include <handle.hpp>
+#include <af/array.h>
+#include <af/features.h>
 
-af_err af_release_features(af_features featHandle)
-{
-
+af_err af_release_features(af_features featHandle) {
     try {
-        af_features_t feat = *(af_features_t *)featHandle;
+        af_features_t feat = *static_cast<af_features_t *>(featHandle);
         if (feat.n > 0) {
-            if (feat.x != 0)           AF_CHECK(af_release_array(feat.x));
-            if (feat.y != 0)           AF_CHECK(af_release_array(feat.y));
-            if (feat.score != 0)       AF_CHECK(af_release_array(feat.score));
-            if (feat.orientation != 0) AF_CHECK(af_release_array(feat.orientation));
-            if (feat.size != 0)        AF_CHECK(af_release_array(feat.size));
+            if (feat.x != 0) { AF_CHECK(af_release_array(feat.x)); }
+            if (feat.y != 0) { AF_CHECK(af_release_array(feat.y)); }
+            if (feat.score != 0) { AF_CHECK(af_release_array(feat.score)); }
+            if (feat.orientation != 0) {
+                AF_CHECK(af_release_array(feat.orientation));
+            }
+            if (feat.size != 0) { AF_CHECK(af_release_array(feat.size)); }
             feat.n = 0;
         }
-        delete (af_features_t *)featHandle;
+        delete static_cast<af_features_t *>(featHandle);
     }
     CATCHALL;
     return AF_SUCCESS;
 }
 
-af_features getFeaturesHandle(const af_features_t feat)
-{
-    af_features_t *featHandle = new af_features_t;
-    *featHandle = feat;
-    return (af_features)featHandle;
+af_features getFeaturesHandle(const af_features_t feat) {
+    auto *featHandle = new af_features_t;
+    *featHandle      = feat;
+    return static_cast<af_features>(featHandle);
 }
 
-af_err af_create_features(af_features *featHandle, dim_t num)
-{
+af_err af_create_features(af_features *featHandle, dim_t num) {
     try {
         af_features_t feat;
         feat.n = num;
@@ -54,20 +52,19 @@ af_err af_create_features(af_features *featHandle, dim_t num)
         }
 
         *featHandle = getFeaturesHandle(feat);
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_features_t getFeatures(const af_features featHandle)
-{
-    return *(af_features_t *)featHandle;
+af_features_t getFeatures(const af_features featHandle) {
+    return *static_cast<af_features_t *>(featHandle);
 }
 
-af_err af_retain_features(af_features *outHandle, const af_features featHandle)
-{
+af_err af_retain_features(af_features *outHandle,
+                          const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
         af_features_t out;
 
@@ -79,68 +76,62 @@ af_err af_retain_features(af_features *outHandle, const af_features featHandle)
         AF_CHECK(af_retain_array(&out.size, feat.size));
 
         *outHandle = getFeaturesHandle(out);
-
-    } CATCHALL;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_features_num(dim_t *num, const af_features featHandle)
-{
+af_err af_get_features_num(dim_t *num, const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
-        *num = feat.n;
-
-    } CATCHALL;
+        *num               = feat.n;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_features_xpos(af_array *out, const af_features featHandle)
-{
+af_err af_get_features_xpos(af_array *out, const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
-        *out = feat.x;
-    } CATCHALL;
+        *out               = feat.x;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_features_ypos(af_array *out, const af_features featHandle)
-{
+af_err af_get_features_ypos(af_array *out, const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
-        *out = feat.y;
-    } CATCHALL;
+        *out               = feat.y;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_features_score(af_array *out, const af_features featHandle)
-{
+af_err af_get_features_score(af_array *out, const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
-        *out = feat.score;
-    } CATCHALL;
+        *out               = feat.score;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_features_orientation(af_array *out, const af_features featHandle)
-{
+af_err af_get_features_orientation(af_array *out,
+                                   const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
-        *out = feat.orientation;
-    } CATCHALL;
+        *out               = feat.orientation;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_get_features_size(af_array *out, const af_features featHandle)
-{
+af_err af_get_features_size(af_array *out, const af_features featHandle) {
     try {
-
         af_features_t feat = getFeatures(featHandle);
-        *out = feat.size;
-    } CATCHALL;
+        *out               = feat.size;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
diff --git a/src/api/c/features.hpp b/src/api/c/features.hpp
index 5bb891540d..9cd977576a 100644
--- a/src/api/c/features.hpp
+++ b/src/api/c/features.hpp
@@ -1,4 +1,15 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 #pragma once
+#include <af/array.h>
+#include <af/features.h>
+#include <cstddef>
 
 typedef struct {
     size_t n;
diff --git a/src/api/c/fft.cpp b/src/api/c/fft.cpp
index e7f361f333..ec3586f839 100644
--- a/src/api/c/fft.cpp
+++ b/src/api/c/fft.cpp
@@ -7,109 +7,292 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <fft_common.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/signal.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <fft.hpp>
+
+#include <type_traits>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::multiply_inplace;
+using std::conditional;
+using std::is_same;
 
-template<typename inType, typename outType, int rank, bool isR2C>
-static af_array fft(const af_array in, const double norm_factor, const dim_t npad, const dim_t  * const pad)
-{
-    return getHandle(fft<inType, outType, rank, isR2C>(getArray<inType>(in), norm_factor, npad, pad));
+void computePaddedDims(dim4 &pdims, const dim4 &idims, const dim_t npad,
+                       dim_t const *const pad) {
+    for (int i = 0; i < 4; i++) {
+        pdims[i] = (i < static_cast<int>(npad)) ? pad[i] : idims[i];
+    }
 }
 
-template<typename T, int rank>
-static af_array ifft(const af_array in, const double norm_factor, const dim_t npad, const dim_t * const pad)
-{
-    return getHandle(ifft<T, rank>(getArray<T>(in), norm_factor, npad, pad));
+template<typename InType>
+af_array fft(const af_array in, const double norm_factor, const dim_t npad,
+             const dim_t *const pad, int rank, bool direction) {
+    using OutType = typename conditional<is_same<InType, double>::value ||
+                                             is_same<InType, cdouble>::value,
+                                         cdouble, cfloat>::type;
+    return getHandle(fft<InType, OutType>(getArray<InType>(in), norm_factor,
+                                          npad, pad, rank, direction));
 }
 
-template<int rank>
-static af_err fft(af_array *out, const af_array in, const double norm_factor, const dim_t npad, const dim_t * const pad)
-{
+af_err fft(af_array *out, const af_array in, const double norm_factor,
+           const dim_t npad, const dim_t *const pad, const int rank,
+           const bool direction) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type  = info.getType();
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        const dim4 &dims      = info.dims();
+
+        if (dims.ndims() == 0) { return af_retain_array(out, in); }
 
-        DIM_ASSERT(1, (dims.ndims()>=rank));
+        DIM_ASSERT(1, (dims.ndims() >= rank));
 
         af_array output;
-        switch(type) {
-            case c32: output = fft<cfloat , cfloat , rank, false>(in, norm_factor, npad, pad); break;
-            case c64: output = fft<cdouble, cdouble, rank, false>(in, norm_factor, npad, pad); break;
-            case f32: output = fft<float , cfloat  , rank, true >(in, norm_factor, npad, pad); break;
-            case f64: output = fft<double, cdouble , rank, true >(in, norm_factor, npad, pad); break;
+        switch (type) {
+            case c32:
+                output =
+                    fft<cfloat>(in, norm_factor, npad, pad, rank, direction);
+                break;
+            case c64:
+                output =
+                    fft<cdouble>(in, norm_factor, npad, pad, rank, direction);
+                break;
+            case f32:
+                output =
+                    fft<float>(in, norm_factor, npad, pad, rank, direction);
+                break;
+            case f64:
+                output =
+                    fft<double>(in, norm_factor, npad, pad, rank, direction);
+                break;
             default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-template<int rank>
-static af_err ifft(af_array *out, const af_array in, const double norm_factor, const dim_t npad, const dim_t * const pad)
-{
+af_err af_fft(af_array *out, const af_array in, const double norm_factor,
+              const dim_t pad0) {
+    const dim_t pad[1] = {pad0};
+    return fft(out, in, norm_factor, (pad0 > 0 ? 1 : 0), pad, 1, true);
+}
+
+af_err af_fft2(af_array *out, const af_array in, const double norm_factor,
+               const dim_t pad0, const dim_t pad1) {
+    const dim_t pad[2] = {pad0, pad1};
+    return fft(out, in, norm_factor, (pad0 > 0 && pad1 > 0 ? 2 : 0), pad, 2,
+               true);
+}
+
+af_err af_fft3(af_array *out, const af_array in, const double norm_factor,
+               const dim_t pad0, const dim_t pad1, const dim_t pad2) {
+    const dim_t pad[3] = {pad0, pad1, pad2};
+    return fft(out, in, norm_factor, (pad0 > 0 && pad1 > 0 && pad2 > 0 ? 3 : 0),
+               pad, 3, true);
+}
+
+af_err af_ifft(af_array *out, const af_array in, const double norm_factor,
+               const dim_t pad0) {
+    const dim_t pad[1] = {pad0};
+    return fft(out, in, norm_factor, (pad0 > 0 ? 1 : 0), pad, 1, false);
+}
+
+af_err af_ifft2(af_array *out, const af_array in, const double norm_factor,
+                const dim_t pad0, const dim_t pad1) {
+    const dim_t pad[2] = {pad0, pad1};
+    return fft(out, in, norm_factor, (pad0 > 0 && pad1 > 0 ? 2 : 0), pad, 2,
+               false);
+}
+
+af_err af_ifft3(af_array *out, const af_array in, const double norm_factor,
+                const dim_t pad0, const dim_t pad1, const dim_t pad2) {
+    const dim_t pad[3] = {pad0, pad1, pad2};
+    return fft(out, in, norm_factor, (pad0 > 0 && pad1 > 0 && pad2 > 0 ? 3 : 0),
+               pad, 3, false);
+}
+
+template<typename T>
+void fft_inplace(af_array in, const double norm_factor, int rank,
+                 bool direction) {
+    Array<T> &input = getArray<T>(in);
+    fft_inplace<T>(input, rank, direction);
+    if (norm_factor != 1) { multiply_inplace<T>(input, norm_factor); }
+}
+
+af_err fft_inplace(af_array in, const double norm_factor, int rank,
+                   bool direction) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type  = info.getType();
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 dims         = info.dims();
 
-        DIM_ASSERT(1, (dims.ndims()>=rank));
+        if (dims.ndims() == 0) { return AF_SUCCESS; }
+        DIM_ASSERT(1, (dims.ndims() >= rank));
 
-        af_array output;
-        switch(type) {
-            case c32: output = ifft<cfloat , rank>(in, norm_factor, npad, pad); break;
-            case c64: output = ifft<cdouble, rank>(in, norm_factor, npad, pad); break;
+        switch (type) {
+            case c32:
+                fft_inplace<cfloat>(in, norm_factor, rank, direction);
+                break;
+            case c64:
+                fft_inplace<cdouble>(in, norm_factor, rank, direction);
+                break;
             default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_fft(af_array *out, const af_array in, const double norm_factor, const dim_t pad0)
-{
-    const dim_t pad[1] = {pad0};
-    return fft<1>(out, in, norm_factor, (pad0>0?1:0), pad);
+af_err af_fft_inplace(af_array in, const double norm_factor) {
+    return fft_inplace(in, norm_factor, 1, true);
 }
 
-af_err af_fft2(af_array *out, const af_array in, const double norm_factor, const dim_t pad0, const dim_t pad1)
-{
-    const dim_t pad[2] = {pad0, pad1};
-    return fft<2>(out, in, norm_factor, (pad0>0&&pad1>0?2:0), pad);
+af_err af_fft2_inplace(af_array in, const double norm_factor) {
+    return fft_inplace(in, norm_factor, 2, true);
 }
 
-af_err af_fft3(af_array *out, const af_array in, const double norm_factor, const dim_t pad0, const dim_t pad1, const dim_t pad2)
-{
-    const dim_t pad[3] = {pad0, pad1, pad2};
-    return fft<3>(out, in, norm_factor, (pad0>0&&pad1>0&&pad2>0?3:0), pad);
+af_err af_fft3_inplace(af_array in, const double norm_factor) {
+    return fft_inplace(in, norm_factor, 3, true);
+}
+
+af_err af_ifft_inplace(af_array in, const double norm_factor) {
+    return fft_inplace(in, norm_factor, 1, false);
+}
+
+af_err af_ifft2_inplace(af_array in, const double norm_factor) {
+    return fft_inplace(in, norm_factor, 2, false);
+}
+
+af_err af_ifft3_inplace(af_array in, const double norm_factor) {
+    return fft_inplace(in, norm_factor, 3, false);
+}
+
+template<typename InType>
+af_array fft_r2c(const af_array in, const double norm_factor, const dim_t npad,
+                 const dim_t *const pad, const int rank) {
+    using OutType = typename conditional<is_same<InType, double>::value,
+                                         cdouble, cfloat>::type;
+    return getHandle(fft_r2c<InType, OutType>(getArray<InType>(in), norm_factor,
+                                              npad, pad, rank));
 }
 
-af_err af_ifft(af_array *out, const af_array in, const double norm_factor, const dim_t pad0)
-{
+af_err fft_r2c(af_array *out, const af_array in, const double norm_factor,
+               const dim_t npad, const dim_t *const pad, const int rank) {
+    try {
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 dims         = info.dims();
+
+        if (dims.ndims() == 0) { return af_retain_array(out, in); }
+        DIM_ASSERT(1, (dims.ndims() >= rank));
+
+        af_array output;
+        switch (type) {
+            case f32:
+                output = fft_r2c<float>(in, norm_factor, npad, pad, rank);
+                break;
+            case f64:
+                output = fft_r2c<double>(in, norm_factor, npad, pad, rank);
+                break;
+            default: {
+                TYPE_ERROR(1, type);
+            }
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_fft_r2c(af_array *out, const af_array in, const double norm_factor,
+                  const dim_t pad0) {
     const dim_t pad[1] = {pad0};
-    return ifft<1>(out, in, norm_factor, (pad0>0?1:0), pad);
+    return fft_r2c(out, in, norm_factor, (pad0 > 0 ? 1 : 0), pad, 1);
 }
 
-af_err af_ifft2(af_array *out, const af_array in, const double norm_factor, const dim_t pad0, const dim_t pad1)
-{
+af_err af_fft2_r2c(af_array *out, const af_array in, const double norm_factor,
+                   const dim_t pad0, const dim_t pad1) {
     const dim_t pad[2] = {pad0, pad1};
-    return ifft<2>(out, in, norm_factor, (pad0>0&&pad1>0?2:0), pad);
+    return fft_r2c(out, in, norm_factor, (pad0 > 0 && pad1 > 0 ? 2 : 0), pad,
+                   2);
 }
 
-af_err af_ifft3(af_array *out, const af_array in, const double norm_factor, const dim_t pad0, const dim_t pad1, const dim_t pad2)
-{
+af_err af_fft3_r2c(af_array *out, const af_array in, const double norm_factor,
+                   const dim_t pad0, const dim_t pad1, const dim_t pad2) {
     const dim_t pad[3] = {pad0, pad1, pad2};
-    return ifft<3>(out, in, norm_factor, (pad0>0&&pad1>0&&pad2>0?3:0), pad);
+    return fft_r2c(out, in, norm_factor,
+                   (pad0 > 0 && pad1 > 0 && pad2 > 0 ? 3 : 0), pad, 3);
+}
+
+template<typename InType>
+static af_array fft_c2r(const af_array in, const double norm_factor,
+                        const dim4 &odims, const int rank) {
+    using OutType = typename conditional<is_same<InType, cdouble>::value,
+                                         double, float>::type;
+    return getHandle(fft_c2r<InType, OutType>(getArray<InType>(in), norm_factor,
+                                              odims, rank));
+}
+
+af_err fft_c2r(af_array *out, const af_array in, const double norm_factor,
+               const bool is_odd, const int rank) {
+    try {
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 idims        = info.dims();
+
+        if (idims.ndims() == 0) { return af_retain_array(out, in); }
+        DIM_ASSERT(1, (idims.ndims() >= rank));
+
+        dim4 odims = idims;
+        odims[0]   = 2 * (odims[0] - 1) + (is_odd ? 1 : 0);
+
+        af_array output;
+        switch (type) {
+            case c32:
+                output = fft_c2r<cfloat>(in, norm_factor, odims, rank);
+                break;
+            case c64:
+                output = fft_c2r<cdouble>(in, norm_factor, odims, rank);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_fft_c2r(af_array *out, const af_array in, const double norm_factor,
+                  const bool is_odd) {
+    return fft_c2r(out, in, norm_factor, is_odd, 1);
+}
+
+af_err af_fft2_c2r(af_array *out, const af_array in, const double norm_factor,
+                   const bool is_odd) {
+    return fft_c2r(out, in, norm_factor, is_odd, 2);
+}
+
+af_err af_fft3_c2r(af_array *out, const af_array in, const double norm_factor,
+                   const bool is_odd) {
+    return fft_c2r(out, in, norm_factor, is_odd, 3);
+}
+
+af_err af_set_fft_plan_cache_size(size_t cache_size) {
+    try {
+        detail::setFFTPlanCacheSize(cache_size);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
 }
diff --git a/src/api/c/fft_common.hpp b/src/api/c/fft_common.hpp
new file mode 100644
index 0000000000..aacc637982
--- /dev/null
+++ b/src/api/c/fft_common.hpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+#include <fft.hpp>
+#include <handle.hpp>
+
+void computePaddedDims(af::dim4 &pdims, const af::dim4 &idims, const dim_t npad,
+                       dim_t const *const pad);
+
+template<typename inType, typename outType>
+detail::Array<outType> fft(const detail::Array<inType> input,
+                           const double norm_factor, const dim_t npad,
+                           const dim_t *const pad, const int rank,
+                           const bool direction) {
+    using af::dim4;
+    using detail::fft_inplace;
+    using detail::reshape;
+    using detail::scalar;
+
+    dim4 pdims(1);
+    computePaddedDims(pdims, input.dims(), npad, pad);
+    auto res = reshape(input, pdims, scalar<outType>(0));
+
+    fft_inplace<outType>(res, rank, direction);
+    if (norm_factor != 1.0) multiply_inplace(res, norm_factor);
+
+    return res;
+}
+
+template<typename inType, typename outType>
+detail::Array<outType> fft_r2c(const detail::Array<inType> input,
+                               const double norm_factor, const dim_t npad,
+                               const dim_t *const pad, const int rank) {
+    using af::dim4;
+    using detail::Array;
+    using detail::fft_r2c;
+    using detail::multiply_inplace;
+    using detail::reshape;
+    using detail::scalar;
+
+    const dim4 &idims = input.dims();
+
+    bool is_pad = false;
+    for (int i = 0; i < npad; i++) { is_pad |= (pad[i] != idims[i]); }
+
+    Array<inType> tmp = input;
+
+    if (is_pad) {
+        dim4 pdims(1);
+        computePaddedDims(pdims, input.dims(), npad, pad);
+        tmp = reshape(input, pdims, scalar<inType>(0));
+    }
+
+    auto res = fft_r2c<outType, inType>(tmp, rank);
+    if (norm_factor != 1.0) multiply_inplace(res, norm_factor);
+
+    return res;
+}
+
+template<typename inType, typename outType>
+detail::Array<outType> fft_c2r(const detail::Array<inType> input,
+                               const double norm_factor, const af::dim4 &odims,
+                               const int rank) {
+    using detail::Array;
+    using detail::fft_c2r;
+    using detail::multiply_inplace;
+
+    Array<outType> output = fft_c2r<outType, inType>(input, odims, rank);
+
+    if (norm_factor != 1) {
+        // Normalize input because tmp was not normalized
+        multiply_inplace(output, norm_factor);
+    }
+
+    return output;
+}
diff --git a/src/api/c/fftconvolve.cpp b/src/api/c/fftconvolve.cpp
index 86696bd081..ead2247c51 100644
--- a/src/api/c/fftconvolve.cpp
+++ b/src/api/c/fftconvolve.cpp
@@ -6,154 +6,262 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-#include <af/dim4.hpp>
+
+#include <fftconvolve.hpp>
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/dispatch.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <fft_common.hpp>
+#include <handle.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/signal.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <arith.hpp>
-#include <fft.hpp>
-#include <fftconvolve.hpp>
-#include <convolve_common.hpp>
-#include <dispatch.hpp>
+
+#include <algorithm>
+#include <type_traits>
+#include <vector>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createSubArray;
+using detail::fftconvolve;
+using detail::intl;
+using detail::real;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::conditional;
+using std::is_integral;
+using std::is_same;
+using std::max;
+using std::swap;
+using std::vector;
 
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
-inline static af_array fftconvolve(const af_array &s, const af_array &f, const bool expand, ConvolveBatchKind kind)
-{
-    return getHandle(fftconvolve<T, convT, cT, isDouble, roundOut, baseDim>(getArray<T>(s), castArray<T>(f), expand, kind));
-}
+template<typename T>
+af_array fftconvolve_fallback(const af_array signal, const af_array filter,
+                              const bool expand, const int baseDim) {
+    using convT = typename conditional<is_integral<T>::value ||
+                                           is_same<T, float>::value ||
+                                           is_same<T, cfloat>::value,
+                                       float, double>::type;
+    using cT    = typename conditional<is_same<convT, float>::value, cfloat,
+                                    cdouble>::type;
 
-template<dim_t baseDim>
-ConvolveBatchKind identifyBatchKind(const dim4 &sDims, const dim4 &fDims)
-{
-    dim_t sn = sDims.ndims();
-    dim_t fn = fDims.ndims();
-
-    if (sn==baseDim && fn==baseDim)
-        return ONE2ONE;
-    else if (sn==baseDim && (fn>baseDim && fn<=4))
-        return ONE2MANY;
-    else if ((sn>baseDim && sn<=4) && fn==baseDim)
-        return MANY2ONE;
-    else if ((sn>baseDim && sn<=4) && (fn>baseDim && fn<=4)) {
-        bool doesDimensionsMatch = true;
-        for (dim_t i=baseDim; i<4; i++) {
-            if (sDims[i]!=fDims[i]) {
-                doesDimensionsMatch = false;
-                break;
-            }
-        }
-        return (doesDimensionsMatch ? MANY2MANY : CONVOLVE_UNSUPPORTED_BATCH_MODE);
-    }
-    else
-        return CONVOLVE_UNSUPPORTED_BATCH_MODE;
-}
+    const Array<cT> S = castArray<cT>(signal);
+    const Array<cT> F = castArray<cT>(filter);
+    const dim4 &sdims = S.dims();
+    const dim4 &fdims = F.dims();
+    dim4 odims(1, 1, 1, 1);
+    dim4 psdims(1, 1, 1, 1);
+    dim4 pfdims(1, 1, 1, 1);
 
-template<typename T, int baseDim>
-static inline
-af_array fftconvcplx(const af_array signal, const af_array filter, bool expand,
-                     ConvolveBatchKind kind)
-{
-    const Array<T> S = getArray<T>(signal);
-    const Array<T> F = castArray<T>(filter);
-    const dim4 sdims = S.dims();
-    const dim4 fdims = F.dims();
-    dim4 tdims(1, 1, 1, 1);
-    std::vector<af_seq> index(4);
+    vector<af_seq> index(AF_MAX_DIMS);
 
     int count = 1;
     for (int i = 0; i < baseDim; i++) {
         dim_t tdim_i = sdims[i] + fdims[i] - 1;
 
         // Pad temporary buffers to power of 2 for performance
-        tdims[i] = nextpow2(tdim_i);
+        odims[i]  = nextpow2(tdim_i);
+        psdims[i] = nextpow2(tdim_i);
+        pfdims[i] = nextpow2(tdim_i);
 
         // The normalization factor
-        count *= tdims[i];
+        count *= odims[i];
 
         // Get the indexing params for output
         if (expand) {
-            index[i].begin = 0;
-            index[i].end = tdim_i - 1;
+            index[i].begin = 0.;
+            index[i].end   = static_cast<double>(tdim_i) - 1.;
         } else {
-            index[i].begin = fdims[i] / 2;
-            index[i].end = index[i].begin + sdims[i] - 1;
+            index[i].begin = static_cast<double>(fdims[i]) / 2.0;
+            index[i].end = static_cast<double>(index[i].begin + sdims[i]) - 1.;
         }
-        index[i].step = 1;
+        index[i].step = 1.;
     }
 
-    for (int i = baseDim; i < 4; i++) {
-        tdims[i] = std::max(sdims[i], fdims[i]);
-        index[i] = af_span;
+    for (int i = baseDim; i < AF_MAX_DIMS; i++) {
+        odims[i]  = max(sdims[i], fdims[i]);
+        psdims[i] = sdims[i];
+        pfdims[i] = fdims[i];
+        index[i]  = af_span;
     }
 
     // fft(signal)
-    Array<T> T1 = fft<T, T, baseDim, false>(S, 1.0, baseDim, tdims.get());
+    Array<cT> T1 = fft<cT, cT>(S, 1.0, baseDim, psdims.get(), baseDim, true);
 
     // fft(filter)
-    Array<T> T2 = fft<T, T, baseDim, false>(F, 1.0, baseDim, tdims.get());
+    Array<cT> T2 = fft<cT, cT>(F, 1.0, baseDim, pfdims.get(), baseDim, true);
 
     // fft(signal) * fft(filter)
-    T1 = arithOp<T, af_mul_t>(T1, T2, tdims);
+    T1 = arithOp<cT, af_mul_t>(T1, T2, odims);
 
     // ifft(ffit(signal) * fft(filter))
-    T1 = ifft<T, baseDim>(T1, 1.0/(double)count, baseDim, tdims.get());
+    T1 = fft<cT, cT>(T1, 1.0 / static_cast<double>(count), baseDim, odims.get(),
+                     baseDim, false);
 
     // Index to proper offsets
-    T1 = createSubArray<T>(T1, index);
-    return getHandle(T1);
+    T1 = createSubArray<cT>(T1, index);
+
+    if (getInfo(signal).isComplex() || getInfo(filter).isComplex()) {
+        return getHandle(cast<T>(T1));
+    } else {
+        return getHandle(cast<T>(real<convT>(T1)));
+    }
 }
 
-template<dim_t baseDim>
-af_err fft_convolve(af_array *out, const af_array signal, const af_array filter, const bool expand)
-{
+template<typename T>
+inline af_array fftconvolve(const af_array &s, const af_array &f,
+                            const bool expand, AF_BATCH_KIND kind,
+                            const int baseDim) {
+    if (kind == AF_BATCH_DIFF) {
+        return fftconvolve_fallback<T>(s, f, expand, baseDim);
+    } else {
+        return getHandle(fftconvolve<T>(getArray<T>(s), castArray<T>(f), expand,
+                                        kind, baseDim));
+    }
+}
+
+AF_BATCH_KIND identifyBatchKind(const dim4 &sDims, const dim4 &fDims,
+                                const int baseDim) {
+    dim_t sn = sDims.ndims();
+    dim_t fn = fDims.ndims();
+
+    if (sn == baseDim && fn == baseDim) { return AF_BATCH_NONE; }
+    if (sn == baseDim && (fn > baseDim && fn <= AF_MAX_DIMS)) {
+        return AF_BATCH_RHS;
+    }
+    if ((sn > baseDim && sn <= AF_MAX_DIMS) && fn == baseDim) {
+        return AF_BATCH_LHS;
+    } else if ((sn > baseDim && sn <= AF_MAX_DIMS) &&
+               (fn > baseDim && fn <= AF_MAX_DIMS)) {
+        bool doesDimensionsMatch = true;
+        bool isInterleaved       = true;
+        for (dim_t i = baseDim; i < AF_MAX_DIMS; i++) {
+            doesDimensionsMatch &= (sDims[i] == fDims[i]);
+            isInterleaved &=
+                (sDims[i] == 1 || fDims[i] == 1 || sDims[i] == fDims[i]);
+        }
+        if (doesDimensionsMatch) { return AF_BATCH_SAME; }
+        return (isInterleaved ? AF_BATCH_DIFF : AF_BATCH_UNSUPPORTED);
+    } else {
+        return AF_BATCH_UNSUPPORTED;
+    }
+}
+
+af_err fft_convolve(af_array *out, const af_array signal, const af_array filter,
+                    const bool expand, const int baseDim) {
     try {
-        ArrayInfo sInfo = getInfo(signal);
-        ArrayInfo fInfo = getInfo(filter);
+        const ArrayInfo &sInfo = getInfo(signal);
+        const ArrayInfo &fInfo = getInfo(filter);
 
-        af_dtype stype  = sInfo.getType();
+        af_dtype signalType = sInfo.getType();
 
-        dim4 sdims = sInfo.dims();
-        dim4 fdims = fInfo.dims();
+        const dim4 &sdims = sInfo.dims();
+        const dim4 &fdims = fInfo.dims();
 
-        ConvolveBatchKind convBT = identifyBatchKind<baseDim>(sdims, fdims);
+        AF_BATCH_KIND convBT = identifyBatchKind(sdims, fdims, baseDim);
 
-        ARG_ASSERT(1, (convBT != CONVOLVE_UNSUPPORTED_BATCH_MODE));
+        ARG_ASSERT(1, (convBT != AF_BATCH_UNSUPPORTED));
 
         af_array output;
-        switch(stype) {
-            case f64: output = fftconvolve<double, double, cdouble, true , false, baseDim>(signal, filter, expand, convBT); break;
-            case f32: output = fftconvolve<float , float,  cfloat,  false, false, baseDim>(signal, filter, expand, convBT); break;
-            case u32: output = fftconvolve<uint  , float,  cfloat,  false, true,  baseDim>(signal, filter, expand, convBT); break;
-            case s32: output = fftconvolve<int   , float,  cfloat,  false, true,  baseDim>(signal, filter, expand, convBT); break;
-            case u8:  output = fftconvolve<uchar , float,  cfloat,  false, true,  baseDim>(signal, filter, expand, convBT); break;
-            case b8:  output = fftconvolve<char  , float,  cfloat,  false, true,  baseDim>(signal, filter, expand, convBT); break;
-            case c32: output = fftconvcplx<cfloat , baseDim>(signal, filter, expand, convBT); break;
-            case c64: output = fftconvcplx<cdouble, baseDim>(signal, filter, expand, convBT); break;
-            default: TYPE_ERROR(1, stype);
+        switch (signalType) {
+            case f64:
+                output = fftconvolve<double>(signal, filter, expand, convBT,
+                                             baseDim);
+                break;
+            case f32:
+                output =
+                    fftconvolve<float>(signal, filter, expand, convBT, baseDim);
+                break;
+            case u32:
+                output =
+                    fftconvolve<uint>(signal, filter, expand, convBT, baseDim);
+                break;
+            case s32:
+                output =
+                    fftconvolve<int>(signal, filter, expand, convBT, baseDim);
+                break;
+            case u64:
+                output =
+                    fftconvolve<uintl>(signal, filter, expand, convBT, baseDim);
+                break;
+            case s64:
+                output =
+                    fftconvolve<intl>(signal, filter, expand, convBT, baseDim);
+                break;
+            case u16:
+                output = fftconvolve<ushort>(signal, filter, expand, convBT,
+                                             baseDim);
+                break;
+            case s16:
+                output =
+                    fftconvolve<short>(signal, filter, expand, convBT, baseDim);
+                break;
+            case u8:
+                output =
+                    fftconvolve<uchar>(signal, filter, expand, convBT, baseDim);
+                break;
+            case s8:
+                output =
+                    fftconvolve<schar>(signal, filter, expand, convBT, baseDim);
+                break;
+            case b8:
+                output =
+                    fftconvolve<char>(signal, filter, expand, convBT, baseDim);
+                break;
+            case c32:
+                output = fftconvolve_fallback<cfloat>(signal, filter, expand,
+                                                      baseDim);
+                break;
+            case c64:
+                output = fftconvolve_fallback<cdouble>(signal, filter, expand,
+                                                       baseDim);
+                break;
+            default: TYPE_ERROR(1, signalType);
         }
-        std::swap(*out,output);
+        swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_fft_convolve1(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode)
-{
-    return fft_convolve<1>(out, signal, filter, mode == AF_CONV_EXPAND);
+af_err af_fft_convolve1(af_array *out, const af_array signal,
+                        const af_array filter, const af_conv_mode mode) {
+    return fft_convolve(out, signal, filter, mode == AF_CONV_EXPAND, 1);
 }
 
-af_err af_fft_convolve2(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode)
-{
-    return fft_convolve<2>(out, signal, filter, mode == AF_CONV_EXPAND);
+af_err af_fft_convolve2(af_array *out, const af_array signal,
+                        const af_array filter, const af_conv_mode mode) {
+    try {
+        if (getInfo(signal).dims().ndims() < 2 &&
+            getInfo(filter).dims().ndims() < 2) {
+            return fft_convolve(out, signal, filter, mode == AF_CONV_EXPAND, 1);
+        }
+        return fft_convolve(out, signal, filter, mode == AF_CONV_EXPAND, 2);
+    }
+    CATCHALL;
 }
 
-af_err af_fft_convolve3(af_array *out, const af_array signal, const af_array filter, const af_conv_mode mode)
-{
-    return fft_convolve<3>(out, signal, filter, mode == AF_CONV_EXPAND);
+af_err af_fft_convolve3(af_array *out, const af_array signal,
+                        const af_array filter, const af_conv_mode mode) {
+    try {
+        if (getInfo(signal).dims().ndims() < 3 &&
+            getInfo(filter).dims().ndims() < 3) {
+            return fft_convolve(out, signal, filter, mode == AF_CONV_EXPAND, 2);
+        }
+        return fft_convolve(out, signal, filter, mode == AF_CONV_EXPAND, 3);
+    }
+    CATCHALL;
 }
diff --git a/src/api/c/filters.cpp b/src/api/c/filters.cpp
index 4658604937..4c154c16fb 100644
--- a/src/api/c/filters.cpp
+++ b/src/api/c/filters.cpp
@@ -7,58 +7,139 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <af/data.h>
-#include <handle.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <medfilt.hpp>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+#include <af/signal.h>
 
 using af::dim4;
-using namespace detail;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+af_err af_medfilt(af_array *out, const af_array in, const dim_t wind_length,
+                  const dim_t wind_width, const af_border_type edge_pad) {
+    return af_medfilt2(out, in, wind_length, wind_width, edge_pad);
+}
 
 template<typename T>
-static af_array medfilt(af_array const &in, dim_t w_len, dim_t w_wid, af_border_type edge_pad)
-{
-    switch(edge_pad) {
-        case AF_PAD_ZERO : return getHandle<T>(medfilt<T, AF_PAD_ZERO>(getArray<T>(in), w_len, w_wid)); break;
-        case AF_PAD_SYM  : return getHandle<T>(medfilt<T, AF_PAD_SYM >(getArray<T>(in), w_len, w_wid)); break;
-        default          : return getHandle<T>(medfilt<T, AF_PAD_ZERO>(getArray<T>(in), w_len, w_wid)); break;
+static af_array medfilt1(af_array const &in, dim_t w_wid,
+                         af_border_type edge_pad) {
+    return getHandle<T>(
+        medfilt1<T>(getArray<T>(in), static_cast<int>(w_wid), edge_pad));
+}
+
+af_err af_medfilt1(af_array *out, const af_array in, const dim_t wind_width,
+                   const af_border_type edge_pad) {
+    try {
+        ARG_ASSERT(2, (wind_width > 0));
+        ARG_ASSERT(4, (edge_pad >= AF_PAD_ZERO && edge_pad <= AF_PAD_SYM));
+
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 dims         = info.dims();
+
+        dim_t input_ndims = dims.ndims();
+        DIM_ASSERT(1, (input_ndims >= 1));
+
+        if (wind_width == 1) {
+            *out = retain(in);
+            return AF_SUCCESS;
+        }
+        af_array output = nullptr;
+        af_dtype type   = info.getType();
+        switch (type) {
+            case f32: output = medfilt1<float>(in, wind_width, edge_pad); break;
+            case f64:
+                output = medfilt1<double>(in, wind_width, edge_pad);
+                break;
+            case b8: output = medfilt1<char>(in, wind_width, edge_pad); break;
+            case s32: output = medfilt1<int>(in, wind_width, edge_pad); break;
+            case u32: output = medfilt1<uint>(in, wind_width, edge_pad); break;
+            case s16: output = medfilt1<short>(in, wind_width, edge_pad); break;
+            case u16:
+                output = medfilt1<ushort>(in, wind_width, edge_pad);
+                break;
+            case s8: output = medfilt1<schar>(in, wind_width, edge_pad); break;
+            case u8: output = medfilt1<uchar>(in, wind_width, edge_pad); break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, output);
     }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+template<typename T>
+inline af_array medfilt2(af_array const &in, dim_t w_len, dim_t w_wid,
+                         af_border_type edge_pad) {
+    return getHandle(medfilt2<T>(getArray<T>(in), static_cast<int>(w_len),
+                                 static_cast<int>(w_wid), edge_pad));
 }
 
-af_err af_medfilt(af_array *out, const af_array in, const dim_t wind_length, const dim_t wind_width, const af_border_type edge_pad)
-{
+af_err af_medfilt2(af_array *out, const af_array in, const dim_t wind_length,
+                   const dim_t wind_width, const af_border_type edge_pad) {
     try {
-        ARG_ASSERT(2, (wind_length==wind_width));
-        ARG_ASSERT(2, (wind_length>0));
-        ARG_ASSERT(3, (wind_width>0));
-        ARG_ASSERT(4, (edge_pad>=AF_PAD_ZERO && edge_pad<=AF_PAD_SYM));
+        ARG_ASSERT(2, (wind_length == wind_width));
+        ARG_ASSERT(2, (wind_length > 0));
+        ARG_ASSERT(3, (wind_width > 0));
+        ARG_ASSERT(4, (edge_pad >= AF_PAD_ZERO && edge_pad <= AF_PAD_SYM));
 
-        ArrayInfo info = getInfo(in);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 dims         = info.dims();
+
+        if (info.isColumn()) {
+            return af_medfilt1(out, in, wind_width, edge_pad);
+        }
 
         dim_t input_ndims = dims.ndims();
         DIM_ASSERT(1, (input_ndims >= 2));
 
-        if (wind_length==1) {
+        if (wind_length == 1) {
             *out = retain(in);
-        } else {
-            af_array output;
-            af_dtype type  = info.getType();
-            switch(type) {
-                case f32: output = medfilt<float >(in, wind_length, wind_width, edge_pad); break;
-                case f64: output = medfilt<double>(in, wind_length, wind_width, edge_pad); break;
-                case b8 : output = medfilt<char  >(in, wind_length, wind_width, edge_pad); break;
-                case s32: output = medfilt<int   >(in, wind_length, wind_width, edge_pad); break;
-                case u32: output = medfilt<uint  >(in, wind_length, wind_width, edge_pad); break;
-                case u8 : output = medfilt<uchar >(in, wind_length, wind_width, edge_pad); break;
-                default : TYPE_ERROR(1, type);
-            }
-            std::swap(*out, output);
+            return AF_SUCCESS;
+        }
+        af_array output = nullptr;
+        af_dtype type   = info.getType();
+        switch (type) {
+            case f32:
+                output = medfilt2<float>(in, wind_length, wind_width, edge_pad);
+                break;
+            case f64:
+                output =
+                    medfilt2<double>(in, wind_length, wind_width, edge_pad);
+                break;
+            case b8:
+                output = medfilt2<char>(in, wind_length, wind_width, edge_pad);
+                break;
+            case s32:
+                output = medfilt2<int>(in, wind_length, wind_width, edge_pad);
+                break;
+            case u32:
+                output = medfilt2<uint>(in, wind_length, wind_width, edge_pad);
+                break;
+            case s16:
+                output = medfilt2<short>(in, wind_length, wind_width, edge_pad);
+                break;
+            case u16:
+                output =
+                    medfilt2<ushort>(in, wind_length, wind_width, edge_pad);
+                break;
+            case s8:
+                output = medfilt2<schar>(in, wind_length, wind_width, edge_pad);
+                break;
+            case u8:
+                output = medfilt2<uchar>(in, wind_length, wind_width, edge_pad);
+                break;
+            default: TYPE_ERROR(1, type);
         }
+        std::swap(*out, output);
     }
     CATCHALL;
 
@@ -66,16 +147,15 @@ af_err af_medfilt(af_array *out, const af_array in, const dim_t wind_length, con
 }
 
 af_err af_minfilt(af_array *out, const af_array in, const dim_t wind_length,
-                  const dim_t wind_width, const af_border_type edge_pad)
-{
+                  const dim_t wind_width, const af_border_type edge_pad) {
     try {
-        ARG_ASSERT(2, (wind_length==wind_width));
-        ARG_ASSERT(2, (wind_length>0));
-        ARG_ASSERT(3, (wind_width>0));
-        ARG_ASSERT(4, (edge_pad==AF_PAD_ZERO));
+        ARG_ASSERT(2, (wind_length == wind_width));
+        ARG_ASSERT(2, (wind_length > 0));
+        ARG_ASSERT(3, (wind_width > 0));
+        ARG_ASSERT(4, (edge_pad == AF_PAD_ZERO));
 
-        ArrayInfo info = getInfo(in);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 dims         = info.dims();
 
         dim_t input_ndims = dims.ndims();
         DIM_ASSERT(1, (input_ndims >= 2));
@@ -94,16 +174,15 @@ af_err af_minfilt(af_array *out, const af_array in, const dim_t wind_length,
 }
 
 af_err af_maxfilt(af_array *out, const af_array in, const dim_t wind_length,
-                  const dim_t wind_width, const af_border_type edge_pad)
-{
+                  const dim_t wind_width, const af_border_type edge_pad) {
     try {
-        ARG_ASSERT(2, (wind_length==wind_width));
-        ARG_ASSERT(2, (wind_length>0));
-        ARG_ASSERT(3, (wind_width>0));
-        ARG_ASSERT(4, (edge_pad==AF_PAD_ZERO));
+        ARG_ASSERT(2, (wind_length == wind_width));
+        ARG_ASSERT(2, (wind_length > 0));
+        ARG_ASSERT(3, (wind_width > 0));
+        ARG_ASSERT(4, (edge_pad == AF_PAD_ZERO));
 
-        ArrayInfo info = getInfo(in);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 dims         = info.dims();
 
         dim_t input_ndims = dims.ndims();
         DIM_ASSERT(1, (input_ndims >= 2));
diff --git a/src/api/c/flip.cpp b/src/api/c/flip.cpp
index b7778e8fa7..4aea98ec73 100644
--- a/src/api/c/flip.cpp
+++ b/src/api/c/flip.cpp
@@ -7,50 +7,40 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <vector>
-#include <cassert>
-
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/half.hpp>
+#include <common/indexing_helpers.hpp>
+#include <handle.hpp>
 #include <af/array.h>
 #include <af/data.h>
-#include <af/index.h>
-#include <af/seq.h>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-#include <Array.hpp>
-#include <lookup.hpp>
 
-using namespace detail;
-using std::vector;
+#include <cassert>
+
+using af::dim4;
+using arrayfire::getArray;
+using arrayfire::common::flip;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uintl;
+using detail::ushort;
 using std::swap;
 
 template<typename T>
-static af_array flipArray(const af_array in, const unsigned dim)
-{
-    const Array<T> &input = getArray<T>(in);
-    vector<af_seq> index(4);
-
-    for (int i = 0; i < 4; i++) {
-        index[i] = af_span;
-    }
-
-    // Reverse "dim"
-    dim4 in_dims = input.dims();
-    af_seq s = {(double)(in_dims[dim] - 1), 0, -1};
-
-    index[dim] = s;
-
-    Array<T> dst =  createSubArray(input, index);
-
-    return getHandle(dst);
+static inline af_array flip(const af_array in, const unsigned dim) {
+    return getHandle(
+        flip(getArray<T>(in), {dim == 0, dim == 1, dim == 2, dim == 3}));
 }
 
-af_err af_flip(af_array *result, const af_array in, const unsigned dim)
-{
+af_err af_flip(af_array *result, const af_array in, const unsigned dim) {
     af_array out;
     try {
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
 
         if (in_info.ndims() <= dim) {
             *result = retain(in);
@@ -59,20 +49,26 @@ af_err af_flip(af_array *result, const af_array in, const unsigned dim)
 
         af_dtype in_type = in_info.getType();
 
-        switch(in_type) {
-        case f32:    out = flipArray<float>   (in, dim);  break;
-        case c32:    out = flipArray<cfloat>  (in, dim);  break;
-        case f64:    out = flipArray<double>  (in, dim);  break;
-        case c64:    out = flipArray<cdouble> (in, dim);  break;
-        case b8:     out = flipArray<char>    (in, dim);  break;
-        case s32:    out = flipArray<int>     (in, dim);  break;
-        case u32:    out = flipArray<unsigned>(in, dim);  break;
-        case u8:     out = flipArray<uchar>   (in, dim);  break;
-        default:    TYPE_ERROR(1, in_type);
+        switch (in_type) {
+            case f16: out = flip<half>(in, dim); break;
+            case f32: out = flip<float>(in, dim); break;
+            case c32: out = flip<cfloat>(in, dim); break;
+            case f64: out = flip<double>(in, dim); break;
+            case c64: out = flip<cdouble>(in, dim); break;
+            case b8: out = flip<char>(in, dim); break;
+            case s32: out = flip<int>(in, dim); break;
+            case u32: out = flip<unsigned>(in, dim); break;
+            case s64: out = flip<intl>(in, dim); break;
+            case u64: out = flip<uintl>(in, dim); break;
+            case s16: out = flip<short>(in, dim); break;
+            case u16: out = flip<ushort>(in, dim); break;
+            case s8: out = flip<schar>(in, dim); break;
+            case u8: out = flip<uchar>(in, dim); break;
+            default: TYPE_ERROR(1, in_type);
         }
+        swap(*result, out);
     }
     CATCHALL
 
-    swap(*result, out);
     return AF_SUCCESS;
 }
diff --git a/src/api/c/gaussian_kernel.cpp b/src/api/c/gaussian_kernel.cpp
index 981488cf8b..529aa378e9 100644
--- a/src/api/c/gaussian_kernel.cpp
+++ b/src/api/c/gaussian_kernel.cpp
@@ -7,27 +7,36 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <err_common.hpp>
+#include <arith.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
 #include <handle.hpp>
-#include <reduce.hpp>
-#include <arith.hpp>
 #include <math.hpp>
-#include <unary.hpp>
 #include <range.hpp>
 #include <reduce.hpp>
 #include <transpose.hpp>
+#include <unary.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
 
-using namespace detail;
+using af::dim4;
+using detail::arithOp;
+using detail::Array;
+using detail::createValueArray;
+using detail::getScalar;
+using detail::range;
+using detail::reduce_all;
+using detail::scalar;
+using detail::transpose;
+using detail::unaryOp;
 
 template<typename T>
-Array<T> gaussianKernel(const int rows, const int cols, const double sigma_r, const double sigma_c)
-{
+Array<T> gaussianKernel(const int rows, const int cols, const double sigma_r,
+                        const double sigma_c) {
     const dim4 odims = dim4(rows, cols);
-    double sigma = 0;
+    double sigma     = 0;
 
     Array<T> tmp  = createValueArray<T>(odims, scalar<T>(0));
     Array<T> half = createValueArray<T>(odims, 0.5);
@@ -37,28 +46,30 @@ Array<T> gaussianKernel(const int rows, const int cols, const double sigma_r, co
         Array<T> wt = range<T>(dim4(cols, rows), 0);
         Array<T> w  = transpose<T>(wt, false);
 
-        Array<T> c = createValueArray<T>(odims, scalar<T>((double)(cols - 1) / 2.0));
+        Array<T> c = createValueArray<T>(
+            odims, scalar<T>(static_cast<double>(cols - 1) / 2.0));
         w = arithOp<T, af_sub_t>(w, c, odims);
 
-        sigma = sigma_c > 0 ? sigma_c : 0.25 * cols;
+        sigma        = sigma_c > 0 ? sigma_c : 0.25 * cols;
         Array<T> sig = createValueArray<T>(odims, sigma);
-        w = arithOp<T, af_div_t>(w, sig, odims);
+        w            = arithOp<T, af_div_t>(w, sig, odims);
 
-        w = arithOp<T, af_mul_t>(w, w, odims);
+        w   = arithOp<T, af_mul_t>(w, w, odims);
         tmp = arithOp<T, af_add_t>(w, tmp, odims);
     }
 
     if (rows > 1) {
         Array<T> w = range<T>(dim4(rows, cols), 0);
 
-        Array<T> r = createValueArray<T>(odims, scalar<T>((double)(rows - 1) / 2.0));
+        Array<T> r = createValueArray<T>(
+            odims, scalar<T>(static_cast<double>(rows - 1) / 2.0));
         w = arithOp<T, af_sub_t>(w, r, odims);
 
-        sigma = sigma_r > 0 ? sigma_r : 0.25 * rows;
+        sigma        = sigma_r > 0 ? sigma_r : 0.25 * rows;
         Array<T> sig = createValueArray<T>(odims, sigma);
 
-        w = arithOp<T, af_div_t>(w, sig, odims);
-        w = arithOp<T, af_mul_t>(w, w, odims);
+        w   = arithOp<T, af_div_t>(w, sig, odims);
+        w   = arithOp<T, af_mul_t>(w, w, odims);
         tmp = arithOp<T, af_add_t>(w, tmp, odims);
     }
 
@@ -68,22 +79,22 @@ Array<T> gaussianKernel(const int rows, const int cols, const double sigma_r, co
 
     // Use this instead of (2 * pi * sig^2);
     // This ensures the window adds up to 1
-    T norm_factor = reduce_all<af_add_t, T, T>(tmp);
+    T norm_factor = getScalar<T>(reduce_all<af_add_t, T, T>(tmp));
 
     Array<T> norm = createValueArray(odims, norm_factor);
-    Array<T> res = arithOp<T, af_div_t>(tmp, norm, odims);
+    Array<T> res  = arithOp<T, af_div_t>(tmp, norm, odims);
 
     return res;
 }
 
-af_err af_gaussian_kernel(af_array *out,
-                          const int rows, const int cols,
-                          const double sigma_r, const double sigma_c)
-{
+af_err af_gaussian_kernel(af_array *out, const int rows, const int cols,
+                          const double sigma_r, const double sigma_c) {
     try {
         af_array res;
-        res = getHandle<float>(gaussianKernel<float>(rows, cols, sigma_r, sigma_c));
+        res = getHandle<float>(
+            gaussianKernel<float>(rows, cols, sigma_r, sigma_c));
         std::swap(*out, res);
-    }CATCHALL;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
diff --git a/src/api/c/gradient.cpp b/src/api/c/gradient.cpp
index d6adfdff13..e99f4e6e64 100644
--- a/src/api/c/gradient.cpp
+++ b/src/api/c/gradient.cpp
@@ -7,29 +7,30 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
 #include <gradient.hpp>
+#include <handle.hpp>
+#include <af/defines.h>
+#include <af/image.h>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::getArray;
+using detail::cdouble;
+using detail::cfloat;
 
 template<typename T>
-static inline void gradient(af_array *grad0, af_array *grad1, const af_array in)
-{
-    gradient<T>(getWritableArray<T>(*grad0), getWritableArray<T>(*grad1), getArray<T>(in));
+static inline void gradient(af_array *grad0, af_array *grad1,
+                            const af_array in) {
+    gradient<T>(getArray<T>(*grad0), getArray<T>(*grad1), getArray<T>(in));
 }
 
-af_err af_gradient(af_array *grows, af_array *gcols, const af_array in)
-{
+af_err af_gradient(af_array *grows, af_array *gcols, const af_array in) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        af::dim4 idims = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 idims        = info.dims();
 
         DIM_ASSERT(2, info.elements() > 0);
 
@@ -38,12 +39,12 @@ af_err af_gradient(af_array *grows, af_array *gcols, const af_array in)
         AF_CHECK(af_create_handle(&grad0, idims.ndims(), idims.get(), type));
         AF_CHECK(af_create_handle(&grad1, idims.ndims(), idims.get(), type));
 
-        switch(type) {
-            case f32: gradient<float  >(&grad0, &grad1, in);  break;
-            case c32: gradient<cfloat >(&grad0, &grad1, in);  break;
-            case f64: gradient<double >(&grad0, &grad1, in);  break;
-            case c64: gradient<cdouble>(&grad0, &grad1, in);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: gradient<float>(&grad0, &grad1, in); break;
+            case c32: gradient<cfloat>(&grad0, &grad1, in); break;
+            case f64: gradient<double>(&grad0, &grad1, in); break;
+            case c64: gradient<cdouble>(&grad0, &grad1, in); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*grows, grad0);
         std::swap(*gcols, grad1);
diff --git a/src/api/c/graphics_common.cpp b/src/api/c/graphics_common.cpp
deleted file mode 100644
index 2c861928f7..0000000000
--- a/src/api/c/graphics_common.cpp
+++ /dev/null
@@ -1,217 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined(WITH_GRAPHICS)
-
-#include <graphics_common.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <platform.hpp>
-
-using namespace std;
-
-template<typename T>
-GLenum getGLType() { return GL_FLOAT; }
-
-#define INSTANTIATE_GET_GL_TYPE(T, OpenGLEnum)\
-    template<> GLenum getGLType<T>() { return OpenGLEnum; }
-
-INSTANTIATE_GET_GL_TYPE(float, GL_FLOAT);
-INSTANTIATE_GET_GL_TYPE(int  , GL_INT);
-INSTANTIATE_GET_GL_TYPE(unsigned, GL_UNSIGNED_INT);
-INSTANTIATE_GET_GL_TYPE(char, GL_BYTE);
-INSTANTIATE_GET_GL_TYPE(unsigned char, GL_UNSIGNED_BYTE);
-
-GLenum glErrorSkip(const char *msg, const char* file, int line)
-{
-#ifndef NDEBUG
-    GLenum x = glGetError();
-    if (x != GL_NO_ERROR) {
-        char buf[1024];
-        sprintf(buf, "GL Error Skipped at: %s:%d Message: %s Error Code: %d \"%s\"\n", file, line, msg, x, gluErrorString(x));
-        AF_ERROR(buf, AF_ERR_INTERNAL);
-    }
-    return x;
-#else
-    return 0;
-#endif
-}
-
-GLenum glErrorCheck(const char *msg, const char* file, int line)
-{
-// Skipped in release mode
-#ifndef NDEBUG
-    GLenum x = glGetError();
-
-    if (x != GL_NO_ERROR) {
-        char buf[1024];
-        sprintf(buf, "GL Error at: %s:%d Message: %s Error Code: %d \"%s\"\n", file, line, msg, x, gluErrorString(x));
-        AF_ERROR(buf, AF_ERR_INTERNAL);
-    }
-    return x;
-#else
-    return 0;
-#endif
-}
-
-GLenum glForceErrorCheck(const char *msg, const char* file, int line)
-{
-    GLenum x = glGetError();
-
-    if (x != GL_NO_ERROR) {
-        char buf[1024];
-        sprintf(buf, "GL Error at: %s:%d Message: %s Error Code: %d \"%s\"\n", file, line, msg, x, gluErrorString(x));
-        AF_ERROR(buf, AF_ERR_INTERNAL);
-    }
-    return x;
-}
-
-size_t getTypeSize(GLenum type)
-{
-    switch(type) {
-        case GL_FLOAT:          return sizeof(float);
-        case GL_INT:            return sizeof(int  );
-        case GL_UNSIGNED_INT:   return sizeof(unsigned);
-        case GL_BYTE:           return sizeof(char );
-        case GL_UNSIGNED_BYTE:  return sizeof(unsigned char);
-        default: return sizeof(float);
-    }
-}
-
-namespace graphics
-{
-
-ForgeManager& ForgeManager::getInstance()
-{
-    static ForgeManager my_instance;
-    return my_instance;
-}
-
-ForgeManager::~ForgeManager()
-{
-    destroyResources();
-}
-
-fg::Font* ForgeManager::getFont(const bool dontCreate)
-{
-    static bool flag = true;
-    static fg::Font* fnt = NULL;
-
-    CheckGL("Begin ForgeManager::getFont");
-
-    if (flag && !dontCreate) {
-        fnt = new fg::Font();
-#if defined(_WIN32) || defined(_MSC_VER)
-        fnt->loadSystemFont("Arial", 32);
-#else
-        fnt->loadSystemFont("Vera", 32);
-#endif
-        CheckGL("End ForgeManager::getFont");
-        flag = false;
-    };
-
-    return fnt;
-}
-
-fg::Window* ForgeManager::getMainWindow(const bool dontCreate)
-{
-    static bool flag = true;
-    static fg::Window* wnd = NULL;
-
-    // Define AF_DISABLE_GRAPHICS with any value to disable initialization
-    const char* noGraphicsENV = getenv("AF_DISABLE_GRAPHICS");
-    if(!noGraphicsENV) { // If AF_DISABLE_GRAPHICS is not defined
-        if (flag && !dontCreate) {
-            wnd = new fg::Window(WIDTH, HEIGHT, "ArrayFire", NULL, true);
-            CheckGL("End ForgeManager::getMainWindow");
-            flag = false;
-        };
-    }
-    return wnd;
-}
-
-fg::Image* ForgeManager::getImage(int w, int h, fg::ColorMode mode, GLenum type)
-{
-    /* w, h needs to fall in the range of [0, 2^16]
-     * for the ForgeManager to correctly retrieve
-     * the necessary Forge Image object. So, this implementation
-     * is a limitation on how big of an image can be rendered
-     * using arrayfire graphics funtionality */
-    assert(w <= 2ll<<16);
-    assert(h <= 2ll<<16);
-    long long key = ((w & _16BIT) << 16) | (h & _16BIT);
-    key = (((key << 16) | mode) << 16) | type;
-
-    ImgMapIter iter = mImgMap.find(key);
-    if (iter==mImgMap.end()) {
-        fg::Image* temp = new fg::Image(w, h, mode, type);
-        mImgMap[key] = temp;
-    }
-
-    return mImgMap[key];
-}
-
-fg::Plot* ForgeManager::getPlot(int nPoints, GLenum type)
-{
-    /* nPoints needs to fall in the range of [0, 2^48]
-     * for the ForgeManager to correctly retrieve
-     * the necessary Forge Plot object. So, this implementation
-     * is a limitation on how big of an plot graph can be rendered
-     * using arrayfire graphics funtionality */
-    assert(nPoints <= 2ll<<48);
-    long long key = ((nPoints & _48BIT) << 48) | (type & _16BIT);
-
-    PltMapIter iter = mPltMap.find(key);
-    if (iter==mPltMap.end()) {
-        fg::Plot* temp = new fg::Plot(nPoints, type);
-        mPltMap[key] = temp;
-    }
-
-    return mPltMap[key];
-}
-
-fg::Histogram* ForgeManager::getHistogram(int nBins, GLenum type)
-{
-    /* nBins needs to fall in the range of [0, 2^48]
-     * for the ForgeManager to correctly retrieve
-     * the necessary Forge Histogram object. So, this implementation
-     * is a limitation on how big of an histogram data can be rendered
-     * using arrayfire graphics funtionality */
-    assert(nBins <= 2ll<<48);
-    long long key = ((nBins & _48BIT) << 48) | (type & _16BIT);
-
-    HstMapIter iter = mHstMap.find(key);
-    if (iter==mHstMap.end()) {
-        fg::Histogram* temp = new fg::Histogram(nBins, type);
-        mHstMap[key] = temp;
-    }
-
-    return mHstMap[key];
-}
-
-void ForgeManager::destroyResources()
-{
-    /* clear all OpenGL resource objects (images, plots, histograms etc) first
-     * and then delete the windows */
-    for(ImgMapIter iter = mImgMap.begin(); iter != mImgMap.end(); iter++)
-        delete (iter->second);
-
-    for(PltMapIter iter = mPltMap.begin(); iter != mPltMap.end(); iter++)
-        delete (iter->second);
-
-    for(HstMapIter iter = mHstMap.begin(); iter != mHstMap.end(); iter++)
-        delete (iter->second);
-
-    delete getFont(true);
-    delete getMainWindow(true);
-}
-
-}
-
-#endif
diff --git a/src/api/c/graphics_common.hpp b/src/api/c/graphics_common.hpp
deleted file mode 100644
index 54984cb328..0000000000
--- a/src/api/c/graphics_common.hpp
+++ /dev/null
@@ -1,91 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-
-#if defined(WITH_GRAPHICS)
-
-#include <af/graphics.h>
-#include <forge.h>
-
-#include <map>
-
-// default to f32(float) type
-template<typename T>
-GLenum getGLType();
-
-// Print for OpenGL errors
-// Returns 1 if an OpenGL error occurred, 0 otherwise.
-GLenum glErrorSkip(const char *msg, const char* file, int line);
-GLenum glErrorCheck(const char *msg, const char* file, int line);
-GLenum glForceErrorCheck(const char *msg, const char* file, int line);
-
-#define CheckGL(msg)      glErrorCheck     (msg, __FILE__, __LINE__)
-#define ForceCheckGL(msg) glForceErrorCheck(msg, __FILE__, __LINE__)
-#define CheckGLSkip(msg)  glErrorSkip      (msg, __FILE__, __LINE__)
-
-namespace graphics
-{
-
-enum Defaults {
-    WIDTH = 1280,
-    HEIGHT= 720
-};
-
-static const long long _16BIT = 0x000000000000FFFF;
-static const long long _32BIT = 0x00000000FFFFFFFF;
-static const long long _48BIT = 0x0000FFFFFFFFFFFF;
-
-typedef std::map<long long, fg::Image*> ImageMap_t;
-typedef std::map<long long, fg::Plot*> PlotMap_t;
-typedef std::map<long long, fg::Histogram*> HistogramMap_t;
-
-typedef ImageMap_t::iterator ImgMapIter;
-typedef PlotMap_t::iterator PltMapIter;
-typedef HistogramMap_t::iterator HstMapIter;
-
-/**
- * ForgeManager class follows a single pattern. Any user of this class, has
- * to call ForgeManager::getInstance inorder to use Forge resources for rendering.
- * It manages the windows, and other renderables (given below) that are drawed
- * onto chosen window.
- * Renderables:
- *             fg::Image
- *             fg::Plot
- *             fg::Histogram
- * */
-class ForgeManager
-{
-    private:
-        ImageMap_t      mImgMap;
-        PlotMap_t       mPltMap;
-        HistogramMap_t  mHstMap;
-
-    public:
-        static ForgeManager& getInstance();
-        ~ForgeManager();
-
-        fg::Font* getFont(const bool dontCreate=false);
-        fg::Window* getMainWindow(const bool dontCreate=false);
-        fg::Image* getImage(int w, int h, fg::ColorMode mode, GLenum type);
-        fg::Plot* getPlot(int nPoints, GLenum type);
-        fg::Histogram* getHistogram(int nBins, GLenum type);
-
-    protected:
-        ForgeManager() {}
-        ForgeManager(ForgeManager const&);
-        void operator=(ForgeManager const&);
-        void destroyResources();
-};
-
-}
-
-#define MAIN_WINDOW graphics::ForgeManager::getInstance().getMainWindow(true)
-
-#endif
diff --git a/src/api/c/hamming.cpp b/src/api/c/hamming.cpp
index 8a1dd910e9..d38a1f0bf1 100644
--- a/src/api/c/hamming.cpp
+++ b/src/api/c/hamming.cpp
@@ -7,59 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <af/defines.h>
 #include <af/vision.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <hamming.hpp>
 
-using af::dim4;
-using namespace detail;
-
-template<typename T>
-static void hamming_matcher(af_array* idx, af_array* dist, const af_array query, const af_array train, const dim_t dist_dim, const uint n_dist)
-{
-    Array<uint> oIdxArray = createEmptyArray<uint>(af::dim4());
-    Array<uint> oDistArray = createEmptyArray<uint>(af::dim4());
-
-    hamming_matcher<T>(oIdxArray, oDistArray, getArray<T>(query), getArray<T>(train), dist_dim, n_dist);
-
-    *idx  = getHandle<uint>(oIdxArray);
-    *dist = getHandle<uint>(oDistArray);
-}
-
-af_err af_hamming_matcher(af_array* idx, af_array* dist, const af_array query, const af_array train, const dim_t dist_dim, const uint n_dist)
-{
-    try {
-        ArrayInfo qInfo = getInfo(query);
-        ArrayInfo tInfo = getInfo(train);
-        af_dtype qType  = qInfo.getType();
-        af_dtype tType  = tInfo.getType();
-        af::dim4 qDims  = qInfo.dims();
-        af::dim4 tDims  = tInfo.dims();
-
-        uint train_samples = (dist_dim == 0) ? 1 : 0;
-
-        DIM_ASSERT(3, qDims[dist_dim] == tDims[dist_dim]);
-        DIM_ASSERT(3, qDims[2] == 1 && qDims[3] == 1);
-        DIM_ASSERT(3, qType == tType);
-        DIM_ASSERT(4, tDims[2] == 1 && tDims[3] == 1);
-        DIM_ASSERT(5, (dist_dim == 0 || dist_dim == 1));
-        DIM_ASSERT(6, n_dist > 0 && n_dist <= (uint)tDims[train_samples]);
-
-        af_array oIdx;
-        af_array oDist;
-        switch(qType) {
-            case u8:  hamming_matcher<uchar>(&oIdx, &oDist, query, train, dist_dim, n_dist); break;
-            case u32: hamming_matcher<uint >(&oIdx, &oDist, query, train, dist_dim, n_dist); break;
-            default : TYPE_ERROR(1, qType);
-        }
-        std::swap(*idx, oIdx);
-        std::swap(*dist, oDist);
-    }
-    CATCHALL;
-
-    return AF_SUCCESS;
+af_err af_hamming_matcher(af_array* idx, af_array* dist, const af_array query,
+                          const af_array train, const dim_t dist_dim,
+                          const unsigned n_dist) {
+    return af_nearest_neighbour(idx, dist, query, train, dist_dim, n_dist,
+                                AF_SHD);
 }
diff --git a/src/api/c/handle.cpp b/src/api/c/handle.cpp
new file mode 100644
index 0000000000..d67f4ae9a1
--- /dev/null
+++ b/src/api/c/handle.cpp
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <handle.hpp>
+
+#include <backend.hpp>
+#include <platform.hpp>
+#include <sparse_handle.hpp>
+
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createDeviceDataArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+namespace arrayfire {
+
+af_array retain(const af_array in) {
+    const ArrayInfo &info = getInfo(in, false);
+    af_dtype ty           = info.getType();
+
+    if (info.isSparse()) {
+        switch (ty) {
+            case f32: return retainSparseHandle<float>(in);
+            case f64: return retainSparseHandle<double>(in);
+            case c32: return retainSparseHandle<detail::cfloat>(in);
+            case c64: return retainSparseHandle<detail::cdouble>(in);
+            default: TYPE_ERROR(1, ty);
+        }
+    } else {
+        switch (ty) {
+            case f32: return retainHandle<float>(in);
+            case f64: return retainHandle<double>(in);
+            case s32: return retainHandle<int>(in);
+            case u32: return retainHandle<uint>(in);
+            case s8: return retainHandle<schar>(in);
+            case u8: return retainHandle<uchar>(in);
+            case c32: return retainHandle<detail::cfloat>(in);
+            case c64: return retainHandle<detail::cdouble>(in);
+            case b8: return retainHandle<char>(in);
+            case s64: return retainHandle<intl>(in);
+            case u64: return retainHandle<uintl>(in);
+            case s16: return retainHandle<short>(in);
+            case u16: return retainHandle<ushort>(in);
+            case f16: return retainHandle<half>(in);
+            default: TYPE_ERROR(1, ty);
+        }
+    }
+}
+
+af_array createHandle(const dim4 &d, af_dtype dtype) {
+    // clang-format off
+    switch (dtype) {
+        case f32: return createHandle<float  >(d);
+        case c32: return createHandle<cfloat >(d);
+        case f64: return createHandle<double >(d);
+        case c64: return createHandle<cdouble>(d);
+        case b8:  return createHandle<char   >(d);
+        case s32: return createHandle<int    >(d);
+        case u32: return createHandle<uint   >(d);
+        case s8:  return createHandle<schar  >(d);
+        case u8:  return createHandle<uchar  >(d);
+        case s64: return createHandle<intl   >(d);
+        case u64: return createHandle<uintl  >(d);
+        case s16: return createHandle<short  >(d);
+        case u16: return createHandle<ushort >(d);
+        case f16: return createHandle<half   >(d);
+        default: TYPE_ERROR(3, dtype);
+    }
+    // clang-format on
+}
+
+af_array createHandleFromValue(const dim4 &d, double val, af_dtype dtype) {
+    // clang-format off
+    switch (dtype) {
+        case f32: return createHandleFromValue<float  >(d, val);
+        case c32: return createHandleFromValue<cfloat >(d, val);
+        case f64: return createHandleFromValue<double >(d, val);
+        case c64: return createHandleFromValue<cdouble>(d, val);
+        case b8:  return createHandleFromValue<char   >(d, val);
+        case s32: return createHandleFromValue<int    >(d, val);
+        case u32: return createHandleFromValue<uint   >(d, val);
+        case s8:  return createHandleFromValue<schar  >(d, val);
+        case u8:  return createHandleFromValue<uchar  >(d, val);
+        case s64: return createHandleFromValue<intl   >(d, val);
+        case u64: return createHandleFromValue<uintl  >(d, val);
+        case s16: return createHandleFromValue<short  >(d, val);
+        case u16: return createHandleFromValue<ushort >(d, val);
+        case f16: return createHandleFromValue<half   >(d, val);
+        default: TYPE_ERROR(3, dtype);
+    }
+    // clang-format on
+}
+
+af_array createHandleFromDeviceData(const af::dim4 &d, af_dtype dtype,
+                                    void *data) {
+    // clang-format off
+    switch (dtype) {
+        case f32: return getHandle(createDeviceDataArray<float  >(d, data, false));
+        case c32: return getHandle(createDeviceDataArray<cfloat >(d, data, false));
+        case f64: return getHandle(createDeviceDataArray<double >(d, data, false));
+        case c64: return getHandle(createDeviceDataArray<cdouble>(d, data, false));
+        case b8:  return getHandle(createDeviceDataArray<char   >(d, data, false));
+        case s32: return getHandle(createDeviceDataArray<int    >(d, data, false));
+        case u32: return getHandle(createDeviceDataArray<uint   >(d, data, false));
+        case s8:  return getHandle(createDeviceDataArray<schar  >(d, data, false));
+        case u8:  return getHandle(createDeviceDataArray<uchar  >(d, data, false));
+        case s64: return getHandle(createDeviceDataArray<intl   >(d, data, false));
+        case u64: return getHandle(createDeviceDataArray<uintl  >(d, data, false));
+        case s16: return getHandle(createDeviceDataArray<short  >(d, data, false));
+        case u16: return getHandle(createDeviceDataArray<ushort >(d, data, false));
+        case f16: return getHandle(createDeviceDataArray<half   >(d, data, false));
+        default: TYPE_ERROR(2, dtype);
+    }
+    // clang-format on
+}
+
+dim4 verifyDims(const unsigned ndims, const dim_t *const dims) {
+    DIM_ASSERT(1, ndims >= 1);
+
+    dim4 d(1, 1, 1, 1);
+
+    for (unsigned i = 0; i < ndims; i++) {
+        d[i] = dims[i];
+        DIM_ASSERT(2, dims[i] >= 1);
+    }
+
+    return d;
+}
+
+template<typename T>
+void releaseHandle(const af_array arr) {
+    auto &info     = getInfo(arr);
+    int old_device = detail::getActiveDeviceId();
+    int array_id   = info.getDevId();
+    if (array_id != old_device) {
+        detail::setDevice(array_id);
+        detail::destroyArray(static_cast<detail::Array<T> *>(arr));
+        detail::setDevice(old_device);
+    } else {
+        detail::destroyArray(static_cast<detail::Array<T> *>(arr));
+    }
+}
+
+template<typename T>
+detail::Array<T> &getCopyOnWriteArray(const af_array &arr) {
+    detail::Array<T> *A = static_cast<detail::Array<T> *>(arr);
+
+    if ((af_dtype)af::dtype_traits<T>::af_type != A->getType())
+        AF_ERROR("Invalid type for input array.", AF_ERR_INTERNAL);
+
+    ARG_ASSERT(0, A->isSparse() == false);
+
+    if (A->useCount() > 1) { *A = copyArray(*A); }
+
+    return *A;
+}
+
+#define INSTANTIATE(TYPE)                                  \
+    template void releaseHandle<TYPE>(const af_array arr); \
+    template detail::Array<TYPE> &getCopyOnWriteArray<TYPE>(const af_array &arr)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(cfloat);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(uint);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(uchar);
+INSTANTIATE(char);
+INSTANTIATE(short);
+INSTANTIATE(ushort);
+INSTANTIATE(half);
+INSTANTIATE(schar);
+
+}  // namespace arrayfire
diff --git a/src/api/c/handle.hpp b/src/api/c/handle.hpp
index beb8393907..b2e3df97cc 100644
--- a/src/api/c/handle.hpp
+++ b/src/api/c/handle.hpp
@@ -8,99 +8,130 @@
  ********************************************************/
 
 #pragma once
-#include <af/array.h>
 #include <Array.hpp>
 #include <backend.hpp>
-#include <err_common.hpp>
-#include <math.hpp>
+#include <common/err_common.hpp>
+#include <common/traits.hpp>
 #include <copy.hpp>
-#include <cast.hpp>
+#include <math.hpp>
+#include <types.hpp>
+
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+
+af_array retain(const af_array in);
+
+af::dim4 verifyDims(const unsigned ndims, const dim_t *const dims);
+
+af_array createHandle(const af::dim4 &d, af_dtype dtype);
+
+af_array createHandleFromValue(const af::dim4 &d, double val, af_dtype dtype);
+
+/// This function creates an af_array handle from memory handle on the device.
+///
+/// \param[in] d The shape of the new af_array
+/// \param[in] dtype The type of the new af_array
+/// \param[in] data The handle to the device memory
+/// \returns a new af_array with a view to the \p data pointer
+af_array createHandleFromDeviceData(const af::dim4 &d, af_dtype dtype,
+                                    void *data);
+
+namespace common {
+const ArrayInfo &getInfo(const af_array arr, bool sparse_check = true);
+
+template<typename To>
+detail::Array<To> castArray(const af_array &in);
+
+}  // namespace common
 
 template<typename T>
-static const detail::Array<T> &
-getArray(const af_array &arr)
-{
-    detail::Array<T> *A = reinterpret_cast<detail::Array<T>*>(arr);
+const detail::Array<T> &getArray(const af_array &arr) {
+    const detail::Array<T> *A = static_cast<const detail::Array<T> *>(arr);
+    if ((af_dtype)af::dtype_traits<T>::af_type != A->getType())
+        AF_ERROR("Invalid type for input array.", AF_ERR_INTERNAL);
+    checkAndMigrate(*const_cast<detail::Array<T> *>(A));
     return *A;
 }
 
-template<typename To>
-detail::Array<To> castArray(const af_array &in)
-{
-    using detail::cfloat;
-    using detail::cdouble;
-    using detail::uint;
-    using detail::uchar;
-
-    const ArrayInfo info = getInfo(in);
-    switch (info.getType()) {
-    case f32: return detail::cast<To, float  >(getArray<float  >(in));
-    case f64: return detail::cast<To, double >(getArray<double >(in));
-    case c32: return detail::cast<To, cfloat >(getArray<cfloat >(in));
-    case c64: return detail::cast<To, cdouble>(getArray<cdouble>(in));
-    case s32: return detail::cast<To, int    >(getArray<int    >(in));
-    case u32: return detail::cast<To, uint   >(getArray<uint   >(in));
-    case u8 : return detail::cast<To, uchar  >(getArray<uchar  >(in));
-    case b8 : return detail::cast<To, char   >(getArray<char   >(in));
-    case s64: return detail::cast<To, intl   >(getArray<intl   >(in));
-    case u64: return detail::cast<To, uintl  >(getArray<uintl  >(in));
-    default: TYPE_ERROR(1, info.getType());
-    }
+template<typename T>
+detail::Array<T> &getArray(af_array &arr) {
+    detail::Array<T> *A = static_cast<detail::Array<T> *>(arr);
+    if ((af_dtype)af::dtype_traits<T>::af_type != A->getType())
+        AF_ERROR("Invalid type for input array.", AF_ERR_INTERNAL);
+    checkAndMigrate(*A);
+    return *A;
 }
 
+/// Returns the use count
+///
+/// \note This function is called separately because we cannot call getArray in
+/// case the data was built on a different context. so we are avoiding the check
+/// and migrate function
 template<typename T>
-static detail::Array<T> &
-getWritableArray(const af_array &arr)
-{
-    const detail::Array<T> &A = getArray<T>(arr);
-    return const_cast<detail::Array<T>&>(A);
+int getUseCount(const af_array &arr) {
+    detail::Array<T> *A = static_cast<detail::Array<T> *>(arr);
+    return A->useCount();
 }
 
 template<typename T>
-static af_array
-getHandle(const detail::Array<T> &A)
-{
-    detail::Array<T> *ret = detail::initArray<T>();
-    *ret = A;
-    af_array arr = reinterpret_cast<af_array>(ret);
-    return arr;
+af_array getHandle(const detail::Array<T> &A) {
+    detail::Array<T> *ret = new detail::Array<T>(A);
+    return static_cast<af_array>(ret);
 }
 
 template<typename T>
-static af_array createHandle(af::dim4 d)
-{
+af_array retainHandle(const af_array in) {
+    detail::Array<T> *A   = static_cast<detail::Array<T> *>(in);
+    detail::Array<T> *out = new detail::Array<T>(*A);
+    return static_cast<af_array>(out);
+}
+
+template<typename T>
+af_array createHandle(const af::dim4 &d) {
     return getHandle(detail::createEmptyArray<T>(d));
 }
 
 template<typename T>
-static af_array createHandleFromValue(af::dim4 d, double val)
-{
+af_array createHandleFromValue(const af::dim4 &d, double val) {
     return getHandle(detail::createValueArray<T>(d, detail::scalar<T>(val)));
 }
 
 template<typename T>
-static af_array createHandleFromData(af::dim4 d, const T * const data)
-{
+af_array createHandleFromData(const af::dim4 &d, const T *const data) {
     return getHandle(detail::createHostDataArray<T>(d, data));
 }
 
 template<typename T>
-static void copyData(T *data, const af_array &arr)
-{
+void copyData(T *data, const af_array &arr) {
     return detail::copyData(data, getArray<T>(arr));
 }
 
 template<typename T>
-static af_array copyArray(const af_array in)
-{
+af_array copyArray(const af_array in) {
     const detail::Array<T> &inArray = getArray<T>(in);
     return getHandle<T>(detail::copyArray<T>(inArray));
 }
 
 template<typename T>
-static void releaseHandle(const af_array arr)
-{
-    detail::destroyArray(reinterpret_cast<detail::Array<T>*>(arr));
-}
+void releaseHandle(const af_array arr);
 
-af_array retain(const af_array in);
+template<typename T>
+detail::Array<T> &getCopyOnWriteArray(const af_array &arr);
+
+}  // namespace arrayfire
+
+using arrayfire::copyArray;
+using arrayfire::copyData;
+using arrayfire::createHandle;
+using arrayfire::createHandleFromData;
+using arrayfire::createHandleFromValue;
+using arrayfire::getArray;
+using arrayfire::getHandle;
+using arrayfire::releaseHandle;
+using arrayfire::retain;
+using arrayfire::verifyDims;
+using arrayfire::common::castArray;
+using arrayfire::common::getInfo;
diff --git a/src/api/c/harris.cpp b/src/api/c/harris.cpp
new file mode 100644
index 0000000000..c55beb3fc5
--- /dev/null
+++ b/src/api/c/harris.cpp
@@ -0,0 +1,95 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <features.hpp>
+#include <handle.hpp>
+#include <harris.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/features.h>
+#include <af/vision.h>
+
+#include <cmath>
+
+using af::dim4;
+using detail::Array;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using std::floor;
+
+template<typename T, typename convAccT>
+static af_features harris(af_array const &in, const unsigned max_corners,
+                          const float min_response, const float sigma,
+                          const unsigned filter_len, const float k_thr) {
+    Array<float> x     = createEmptyArray<float>(dim4());
+    Array<float> y     = createEmptyArray<float>(dim4());
+    Array<float> score = createEmptyArray<float>(dim4());
+
+    af_features_t feat;
+    feat.n = harris<T, convAccT>(x, y, score, getArray<T>(in), max_corners,
+                                 min_response, sigma, filter_len, k_thr);
+
+    Array<float> orientation = createValueArray<float>(feat.n, 0.0);
+    Array<float> size        = createValueArray<float>(feat.n, 1.0);
+
+    feat.x           = getHandle(x);
+    feat.y           = getHandle(y);
+    feat.score       = getHandle(score);
+    feat.orientation = getHandle(orientation);
+    feat.size        = getHandle(size);
+
+    return getFeaturesHandle(feat);
+}
+
+af_err af_harris(af_features *out, const af_array in,
+                 const unsigned max_corners, const float min_response,
+                 const float sigma, const unsigned block_size,
+                 const float k_thr) {
+    try {
+        const ArrayInfo &info = getInfo(in);
+        dim4 dims             = info.dims();
+        dim_t in_ndims        = dims.ndims();
+
+        unsigned filter_len = (block_size == 0)
+                                  ? static_cast<unsigned>(floor(6.f * sigma))
+                                  : block_size;
+        if (block_size == 0 && filter_len % 2 == 0) { filter_len--; }
+
+        const unsigned edge =
+            (block_size > 0) ? block_size / 2 : filter_len / 2;
+
+        DIM_ASSERT(1, (in_ndims == 2));
+        ARG_ASSERT(1, (dims[0] >= (dim_t)(2 * edge + 1) ||
+                       dims[1] >= (dim_t)(2 * edge + 1)));
+        ARG_ASSERT(3, (max_corners > 0) || (min_response > 0.0f));
+        ARG_ASSERT(7, (k_thr >= 0.01f));
+        // Upper limits for sigma and block_size are due to convolve2 template
+        // at maximum length of 31 elements for the filter in OpenCL
+        ARG_ASSERT(4, (block_size > 2) || (sigma >= 0.5f && sigma <= 5.f));
+        ARG_ASSERT(5, (block_size <= 32));
+
+        af_dtype type = info.getType();
+        switch (type) {
+            case f64:
+                *out = harris<double, double>(in, max_corners, min_response,
+                                              sigma, filter_len, k_thr);
+                break;
+            case f32:
+                *out = harris<float, float>(in, max_corners, min_response,
+                                            sigma, filter_len, k_thr);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/hist.cpp b/src/api/c/hist.cpp
index 8cf1ac16a4..0d8f9bfe6b 100644
--- a/src/api/c/hist.cpp
+++ b/src/api/c/hist.cpp
@@ -7,82 +7,155 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/graphics.h>
-#include <graphics_common.hpp>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
-#include <reduce.hpp>
-#include <cast.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <copy.hpp>
 #include <handle.hpp>
 #include <hist_graphics.hpp>
+#include <platform.hpp>
+#include <reduce.hpp>
+#include <af/graphics.h>
 
-using af::dim4;
-using namespace detail;
-
-#if defined(WITH_GRAPHICS)
-using namespace graphics;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+using arrayfire::common::getGLType;
+using arrayfire::common::makeContextCurrent;
+using arrayfire::common::step_round;
+using detail::Array;
+using detail::copy_histogram;
+using detail::forgeManager;
+using detail::getScalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
 
 template<typename T>
-fg::Histogram* setup_histogram(const af_array in, const double minval, const double maxval)
-{
-    Array<T> histogramInput = getArray<T>(in);
-    dim_t nBins = histogramInput.elements();
-
-    T freqMax = detail::reduce_all<af_max_t, T, T>(histogramInput);
-
-    /* retrieve Forge Histogram with nBins and array type */
-    ForgeManager& fgMngr = ForgeManager::getInstance();
-    fg::Histogram* hist = fgMngr.getHistogram(nBins, getGLType<T>());
-    /* set histogram bar colors to orange */
-    hist->setBarColor(0.929f, 0.486f, 0.2745f);
-    /* set x axis limits to maximum and minimum values of data
-     * and y axis limits to range [0, nBins]*/
-    hist->setAxesLimits(maxval, minval, double(freqMax), 0.0f);
-    hist->setXAxisTitle("Bins");
-    hist->setYAxisTitle("Frequency");
+fg_chart setup_histogram(fg_window const window, const af_array in,
+                         const double minval, const double maxval,
+                         const af_cell* const props) {
+    ForgeModule& _ = forgePlugin();
+
+    const Array<T> histogramInput = getArray<T>(in);
+    dim_t nBins                   = histogramInput.elements();
+
+    // Retrieve Forge Histogram with nBins and array type
+    ForgeManager& fgMngr = forgeManager();
+
+    // Get the chart for the current grid position (if any)
+    fg_chart chart = NULL;
+    if (props->col > -1 && props->row > -1) {
+        chart = fgMngr.getChart(window, props->row, props->col, FG_CHART_2D);
+    } else {
+        chart = fgMngr.getChart(window, 0, 0, FG_CHART_2D);
+    }
+
+    // Create a histogram for the chart
+    fg_histogram hist = fgMngr.getHistogram(chart, nBins, getGLType<T>());
+
+    // Set histogram bar colors to ArrayFire's orange
+    FG_CHECK(_.fg_set_histogram_color(hist, 0.929f, 0.486f, 0.2745f, 1.0f));
+
+    // If chart axes limits do not have a manual override
+    // then compute and set axes limits
+    if (!fgMngr.getChartAxesOverride(chart)) {
+        float xMin, xMax, yMin, yMax, zMin, zMax;
+        FG_CHECK(_.fg_get_chart_axes_limits(&xMin, &xMax, &yMin, &yMax, &zMin,
+                                            &zMax, chart));
+        T freqMax =
+            getScalar<T>(detail::reduce_all<af_max_t, T, T>(histogramInput));
+
+	// For histogram, xMin and xMax should always be the first
+	// and last bin respectively and should not be rounded
+        if (xMin == 0 && xMax == 0 && yMin == 0 && yMax == 0) {
+            // No previous limits. Set without checking
+            xMin = static_cast<float>(minval);
+            xMax = static_cast<float>(maxval);
+            yMax = static_cast<float>(step_round(freqMax, true));
+            // For histogram, always set yMin to 0.
+            yMin = 0;
+        } else {
+            if (xMin > minval) {
+                xMin = static_cast<float>(minval);
+            }
+            if (xMax < maxval) {
+                xMax = static_cast<float>(maxval);
+            }
+            if (yMax < freqMax) {
+                yMax = static_cast<float>(step_round(freqMax, true));
+            }
+            // For histogram, always set yMin to 0.
+            yMin = 0;
+        }
+        FG_CHECK(_.fg_set_chart_axes_limits(chart, xMin, xMax, yMin, yMax, zMin,
+                                            zMax));
+    }
 
     copy_histogram<T>(histogramInput, hist);
 
-    return hist;
+    return chart;
 }
-#endif
-
-af_err af_draw_hist(const af_window wind, const af_array X, const double minval, const double maxval,
-                    const af_cell* const props)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
 
+af_err af_draw_hist(const af_window window, const af_array X,
+                    const double minval, const double maxval,
+                    const af_cell* const props) {
     try {
-        ArrayInfo Xinfo = getInfo(X);
-        af_dtype Xtype  = Xinfo.getType();
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& Xinfo = getInfo(X);
+        af_dtype Xtype         = Xinfo.getType();
 
         ARG_ASSERT(0, Xinfo.isVector());
 
-        fg::Window* window = reinterpret_cast<fg::Window*>(wind);
-        window->makeCurrent();
-        fg::Histogram* hist = NULL;
+        makeContextCurrent(window);
 
-        switch(Xtype) {
-            case f32: hist = setup_histogram<float  >(X, minval, maxval); break;
-            case s32: hist = setup_histogram<int    >(X, minval, maxval); break;
-            case u32: hist = setup_histogram<uint   >(X, minval, maxval); break;
-            case u8 : hist = setup_histogram<uchar  >(X, minval, maxval); break;
-            default:  TYPE_ERROR(1, Xtype);
+        fg_chart chart = NULL;
+
+        switch (Xtype) {
+            case f32:
+                chart =
+                    setup_histogram<float>(window, X, minval, maxval, props);
+                break;
+            case s32:
+                chart = setup_histogram<int>(window, X, minval, maxval, props);
+                break;
+            case u32:
+                chart = setup_histogram<uint>(window, X, minval, maxval, props);
+                break;
+            case s16:
+                chart =
+                    setup_histogram<short>(window, X, minval, maxval, props);
+                break;
+            case u16:
+                chart =
+                    setup_histogram<ushort>(window, X, minval, maxval, props);
+                break;
+            case s8:
+                chart =
+                    setup_histogram<schar>(window, X, minval, maxval, props);
+                break;
+            case u8:
+                chart =
+                    setup_histogram<uchar>(window, X, minval, maxval, props);
+                break;
+            default: TYPE_ERROR(1, Xtype);
         }
+        auto gridDims = forgeManager().getWindowGrid(window);
 
-        if (props->col>-1 && props->row>-1)
-            window->draw(props->col, props->row, *hist, props->title);
-        else
-            window->draw(*hist);
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
     }
     CATCHALL;
     return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
 }
diff --git a/src/api/c/histeq.cpp b/src/api/c/histeq.cpp
index 1b14ae54b2..faed6a238c 100644
--- a/src/api/c/histeq.cpp
+++ b/src/api/c/histeq.cpp
@@ -7,80 +7,100 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
-#include <af/index.h>
-#include <af/data.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-#include <cast.hpp>
-#include <scan.hpp>
 #include <arith.hpp>
-#include <reduce.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/moddims.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
 #include <lookup.hpp>
+#include <reduce.hpp>
+#include <scan.hpp>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/image.h>
+#include <af/index.h>
 
-using namespace detail;
+using af::dim4;
+using arrayfire::common::cast;
+using arrayfire::common::modDims;
+using detail::arithOp;
+using detail::Array;
+using detail::createValueArray;
+using detail::getScalar;
+using detail::intl;
+using detail::lookup;
+using detail::reduce_all;
+using detail::scan;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T, typename hType>
-static af_array hist_equal(const af_array& in, const af_array& hist)
-{
+static af_array hist_equal(const af_array& in, const af_array& hist) {
     const Array<T> input = getArray<T>(in);
 
     af_array vInput = 0;
     AF_CHECK(af_flat(&vInput, in));
 
-    Array<float> fHist  = cast<float>(getArray<hType>(hist));
+    Array<float> fHist = cast<float>(getArray<hType>(hist));
 
-    dim4 hDims = fHist.dims();
-    dim_t grayLevels = fHist.elements();
+    const dim4& hDims = fHist.dims();
+    dim_t grayLevels  = fHist.elements();
 
     Array<float> cdf = scan<af_add_t, float, float>(fHist, 0);
 
-    float minCdf = reduce_all<af_min_t, float, float>(cdf);
-    float maxCdf = reduce_all<af_max_t, float, float>(cdf);
-    float factor = (float)(grayLevels-1)/(maxCdf - minCdf);
+    float minCdf = getScalar<float>(reduce_all<af_min_t, float, float>(cdf));
+    float maxCdf = getScalar<float>(reduce_all<af_max_t, float, float>(cdf));
+    float factor = static_cast<float>(grayLevels - 1) / (maxCdf - minCdf);
 
     // constant array of min value from cdf
     Array<float> minCnst = createValueArray<float>(hDims, minCdf);
     // constant array of factor variable
     Array<float> facCnst = createValueArray<float>(hDims, factor);
     // cdf(i) - min for all elements
-    Array<float> diff    = arithOp<float, af_sub_t>(cdf, minCnst, hDims);
+    Array<float> diff = arithOp<float, af_sub_t>(cdf, minCnst, hDims);
     // multiply factor with difference
     Array<float> normCdf = arithOp<float, af_mul_t>(diff, facCnst, hDims);
     // index input array with normalized cdf array
-    Array<float> idxArr  = lookup<float, T>(normCdf, getArray<T>(vInput), 0);
+    Array<float> idxArr = lookup<float, T>(normCdf, getArray<T>(vInput), 0);
 
     Array<T> result = cast<T>(idxArr);
-    result.modDims(input.dims());
+    result          = modDims(result, input.dims());
 
     AF_CHECK(af_release_array(vInput));
 
     return getHandle<T>(result);
 }
 
-af_err af_hist_equal(af_array *out, const af_array in, const af_array hist)
-{
+af_err af_hist_equal(af_array* out, const af_array in, const af_array hist) {
     try {
-        ArrayInfo dataInfo = getInfo(in);
-        ArrayInfo histInfo = getInfo(hist);
+        const ArrayInfo& dataInfo = getInfo(in);
+        const ArrayInfo& histInfo = getInfo(hist);
 
-        af_dtype dataType  = dataInfo.getType();
-        af::dim4 histDims  = histInfo.dims();
+        af_dtype dataType = dataInfo.getType();
+        af::dim4 histDims = histInfo.dims();
 
-        ARG_ASSERT(2, (histDims.ndims()==1));
+        ARG_ASSERT(2, (histDims.ndims() == 1));
 
         af_array output = 0;
-        switch(dataType) {
+        switch (dataType) {
             case f64: output = hist_equal<double, uint>(in, hist); break;
-            case f32: output = hist_equal<float , uint>(in, hist); break;
-            case s32: output = hist_equal<int   , uint>(in, hist); break;
-            case u32: output = hist_equal<uint  , uint>(in, hist); break;
-            case u8 : output = hist_equal<uchar , uint>(in, hist); break;
-            default : TYPE_ERROR(1, dataType);
+            case f32: output = hist_equal<float, uint>(in, hist); break;
+            case s32: output = hist_equal<int, uint>(in, hist); break;
+            case u32: output = hist_equal<uint, uint>(in, hist); break;
+            case s16: output = hist_equal<short, uint>(in, hist); break;
+            case u16: output = hist_equal<ushort, uint>(in, hist); break;
+            case s64: output = hist_equal<intl, uint>(in, hist); break;
+            case u64: output = hist_equal<uintl, uint>(in, hist); break;
+            case s8: output = hist_equal<schar, uint>(in, hist); break;
+            case u8: output = hist_equal<uchar, uint>(in, hist); break;
+            default: TYPE_ERROR(1, dataType);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
diff --git a/src/api/c/histogram.cpp b/src/api/c/histogram.cpp
index 2d5477e753..69c6d71de5 100644
--- a/src/api/c/histogram.cpp
+++ b/src/api/c/histogram.cpp
@@ -7,41 +7,89 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/image.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <histogram.hpp>
+#include <af/dim4.hpp>
+#include <af/image.h>
 
-using af::dim4;
-using namespace detail;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
-template<typename inType,typename outType>
-static inline af_array histogram(const af_array in, const unsigned &nbins,
-                                 const double &minval, const double &maxval)
-{
-    return getHandle(histogram<inType,outType>(getArray<inType>(in),nbins,minval,maxval));
+template<typename T>
+inline af_array histogram(const af_array in, const unsigned &nbins,
+                          const double &minval, const double &maxval,
+                          const bool islinear) {
+    return getHandle(
+        histogram<T>(getArray<T>(in), nbins, minval, maxval, islinear));
 }
 
-af_err af_histogram(af_array *out, const af_array in,
-                    const unsigned nbins, const double minval, const double maxval)
-{
+af_err af_histogram(af_array *out, const af_array in, const unsigned nbins,
+                    const double minval, const double maxval) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type  = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
 
         af_array output;
-        switch(type) {
-            case f32: output = histogram<float , uint>(in, nbins, minval, maxval); break;
-            case f64: output = histogram<double, uint>(in, nbins, minval, maxval); break;
-            case b8 : output = histogram<char  , uint>(in, nbins, minval, maxval); break;
-            case s32: output = histogram<int   , uint>(in, nbins, minval, maxval); break;
-            case u32: output = histogram<uint  , uint>(in, nbins, minval, maxval); break;
-            case u8 : output = histogram<uchar , uint>(in, nbins, minval, maxval); break;
-            default : TYPE_ERROR(1, type);
+        switch (type) {
+            case f32:
+                output = histogram<float>(in, nbins, minval, maxval,
+                                          info.isLinear());
+                break;
+            case f64:
+                output = histogram<double>(in, nbins, minval, maxval,
+                                           info.isLinear());
+                break;
+            case b8:
+                output =
+                    histogram<char>(in, nbins, minval, maxval, info.isLinear());
+                break;
+            case s32:
+                output =
+                    histogram<int>(in, nbins, minval, maxval, info.isLinear());
+                break;
+            case u32:
+                output =
+                    histogram<uint>(in, nbins, minval, maxval, info.isLinear());
+                break;
+            case s16:
+                output = histogram<short>(in, nbins, minval, maxval,
+                                          info.isLinear());
+                break;
+            case u16:
+                output = histogram<ushort>(in, nbins, minval, maxval,
+                                           info.isLinear());
+                break;
+            case s64:
+                output =
+                    histogram<intl>(in, nbins, minval, maxval, info.isLinear());
+                break;
+            case u64:
+                output = histogram<uintl>(in, nbins, minval, maxval,
+                                          info.isLinear());
+                break;
+            case s8:
+                output = histogram<schar>(in, nbins, minval, maxval,
+                                          info.isLinear());
+                break;
+            case u8:
+                output = histogram<uchar>(in, nbins, minval, maxval,
+                                          info.isLinear());
+                break;
+            case f16:
+                output = histogram<arrayfire::common::half>(
+                    in, nbins, minval, maxval, info.isLinear());
+                break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
diff --git a/src/api/c/homography.cpp b/src/api/c/homography.cpp
new file mode 100644
index 0000000000..9d6f0f9a39
--- /dev/null
+++ b/src/api/c/homography.cpp
@@ -0,0 +1,104 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <homography.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/random.h>
+#include <af/vision.h>
+
+#include <utility>
+
+using af::dim4;
+using detail::Array;
+using detail::createEmptyArray;
+using std::swap;
+
+template<typename T>
+static inline void homography(af_array& H, int& inliers, const af_array x_src,
+                              const af_array y_src, const af_array x_dst,
+                              const af_array y_dst,
+                              const af_homography_type htype,
+                              const float inlier_thr,
+                              const unsigned iterations) {
+    Array<T> bestH = createEmptyArray<T>(af::dim4(3, 3));
+    af_array initial;
+    unsigned d    = (iterations + 256 - 1) / 256;
+    dim_t rdims[] = {4, d * 256};
+    AF_CHECK(af_randu(&initial, 2, rdims, f32));
+    inliers =
+        homography<T>(bestH, getArray<float>(x_src), getArray<float>(y_src),
+                      getArray<float>(x_dst), getArray<float>(y_dst),
+                      getArray<float>(initial), htype, inlier_thr, iterations);
+    AF_CHECK(af_release_array(initial));
+
+    H = getHandle<T>(bestH);
+}
+
+af_err af_homography(af_array* H, int* inliers, const af_array x_src,
+                     const af_array y_src, const af_array x_dst,
+                     const af_array y_dst, const af_homography_type htype,
+                     const float inlier_thr, const unsigned iterations,
+                     const af_dtype otype) {
+    try {
+        const ArrayInfo& xsinfo = getInfo(x_src);
+        const ArrayInfo& ysinfo = getInfo(y_src);
+        const ArrayInfo& xdinfo = getInfo(x_dst);
+        const ArrayInfo& ydinfo = getInfo(y_dst);
+
+        af::dim4 xsdims = xsinfo.dims();
+        af::dim4 ysdims = ysinfo.dims();
+        af::dim4 xddims = xdinfo.dims();
+        af::dim4 yddims = ydinfo.dims();
+
+        af_dtype xstype = xsinfo.getType();
+        af_dtype ystype = ysinfo.getType();
+        af_dtype xdtype = xdinfo.getType();
+        af_dtype ydtype = ydinfo.getType();
+
+        if (xstype != f32) { TYPE_ERROR(1, xstype); }
+        if (ystype != f32) { TYPE_ERROR(2, ystype); }
+        if (xdtype != f32) { TYPE_ERROR(3, xdtype); }
+        if (ydtype != f32) { TYPE_ERROR(4, ydtype); }
+
+        ARG_ASSERT(1, (xsdims[0] > 0));
+        ARG_ASSERT(2, (ysdims[0] == xsdims[0]));
+        ARG_ASSERT(3, (xddims[0] > 0));
+        ARG_ASSERT(4, (yddims[0] == xddims[0]));
+
+        ARG_ASSERT(5, (inlier_thr >= 0.1f));
+        ARG_ASSERT(6, (iterations > 0));
+        ARG_ASSERT(
+            7, (htype == AF_HOMOGRAPHY_RANSAC || htype == AF_HOMOGRAPHY_LMEDS));
+
+        af_array outH;
+        int outInl;
+
+        switch (otype) {
+            case f32:
+                homography<float>(outH, outInl, x_src, y_src, x_dst, y_dst,
+                                  htype, inlier_thr, iterations);
+                break;
+            case f64:
+                homography<double>(outH, outInl, x_src, y_src, x_dst, y_dst,
+                                   htype, inlier_thr, iterations);
+                break;
+            default: TYPE_ERROR(1, otype);
+        }
+        swap(*H, outH);
+        swap(*inliers, outInl);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/hsv_rgb.cpp b/src/api/c/hsv_rgb.cpp
index 9188e03e60..4661a255cc 100644
--- a/src/api/c/hsv_rgb.cpp
+++ b/src/api/c/hsv_rgb.cpp
@@ -1,49 +1,52 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <hsv_rgb.hpp>
 #include <af/defines.h>
 #include <af/dim4.hpp>
 #include <af/image.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <hsv_rgb.hpp>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::hsv2rgb;
+using detail::rgb2hsv;
 
 template<typename T, bool isHSV2RGB>
-static af_array convert(const af_array& in)
-{
+static af_array convert(const af_array& in) {
     const Array<T> input = getArray<T>(in);
     if (isHSV2RGB) {
         return getHandle<T>(hsv2rgb<T>(input));
-    }
-    else {
+    } else {
         return getHandle<T>(rgb2hsv<T>(input));
     }
 }
 
 template<bool isHSV2RGB>
-af_err convert(af_array* out, const af_array& in)
-{
+af_err convert(af_array* out, const af_array& in) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype iType = info.getType();
-        af::dim4 inputDims = info.dims();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype iType        = info.getType();
+        af::dim4 inputDims    = info.dims();
+
+        if (info.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, iType);
+        }
 
         ARG_ASSERT(1, (inputDims.ndims() >= 3));
 
         af_array output = 0;
         switch (iType) {
             case f64: output = convert<double, isHSV2RGB>(in); break;
-            case f32: output = convert<float , isHSV2RGB>(in); break;
+            case f32: output = convert<float, isHSV2RGB>(in); break;
             default: TYPE_ERROR(1, iType); break;
         }
         std::swap(*out, output);
@@ -52,12 +55,10 @@ af_err convert(af_array* out, const af_array& in)
     return AF_SUCCESS;
 }
 
-af_err af_hsv2rgb(af_array* out, const af_array in)
-{
+af_err af_hsv2rgb(af_array* out, const af_array in) {
     return convert<true>(out, in);
 }
 
-af_err af_rgb2hsv(af_array* out, const af_array in)
-{
+af_err af_rgb2hsv(af_array* out, const af_array in) {
     return convert<false>(out, in);
 }
diff --git a/src/api/c/iir.cpp b/src/api/c/iir.cpp
index 640d9a51c2..2c56011cc2 100644
--- a/src/api/c/iir.cpp
+++ b/src/api/c/iir.cpp
@@ -6,54 +6,53 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/signal.h>
-#include <af/arith.h>
-#include <handle.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
 #include <convolve.hpp>
+#include <handle.hpp>
 #include <iir.hpp>
+#include <af/arith.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/signal.h>
 
 #include <cstdio>
 
 using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
 
-af_err af_fir(af_array *y, const af_array b, const af_array x)
-{
+af_err af_fir(af_array* y, const af_array b, const af_array x) {
     try {
         af_array out;
         AF_CHECK(af_convolve1(&out, x, b, AF_CONV_EXPAND, AF_CONV_AUTO));
 
-        dim4 xdims = getInfo(x).dims();
+        dim4 xdims    = getInfo(x).dims();
         af_seq seqs[] = {af_span, af_span, af_span, af_span};
-        seqs[0].begin = 0;
-        seqs[0].end = xdims[0] - 1;
-        seqs[0].step = 1;
+        seqs[0].begin = 0.;
+        seqs[0].end   = static_cast<double>(xdims[0]) - 1.;
+        seqs[0].step  = 1.;
         af_array res;
         AF_CHECK(af_index(&res, out, 4, seqs));
+        AF_CHECK(af_release_array(out));
         std::swap(*y, res);
-
-    } CATCHALL;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
 template<typename T>
-inline static af_array iir(const af_array b, const af_array a, const af_array x)
-{
-    return getHandle(iir<T>(getArray<T>(b),
-                            getArray<T>(a),
-                            getArray<T>(x)));
+inline static af_array iir(const af_array b, const af_array a,
+                           const af_array x) {
+    return getHandle(iir<T>(getArray<T>(b), getArray<T>(a), getArray<T>(x)));
 }
 
-af_err af_iir(af_array *y, const af_array b, const af_array a, const af_array x)
-{
+af_err af_iir(af_array* y, const af_array b, const af_array a,
+              const af_array x) {
     try {
-        ArrayInfo ainfo = getInfo(a);
-        ArrayInfo binfo = getInfo(b);
-        ArrayInfo xinfo = getInfo(x);
+        const ArrayInfo& ainfo = getInfo(a);
+        const ArrayInfo& binfo = getInfo(b);
+        const ArrayInfo& xinfo = getInfo(x);
 
         af_dtype xtype = xinfo.getType();
 
@@ -65,6 +64,8 @@ af_err af_iir(af_array *y, const af_array b, const af_array a, const af_array x)
         dim4 bdims = binfo.dims();
         dim4 xdims = xinfo.dims();
 
+        if (xinfo.ndims() == 0) { return af_retain_array(y, x); }
+
         if (xinfo.ndims() > 1) {
             if (binfo.ndims() > 1) {
                 for (int i = 1; i < 3; i++) {
@@ -84,14 +85,15 @@ af_err af_iir(af_array *y, const af_array b, const af_array a, const af_array x)
 
         af_array res;
         switch (xtype) {
-        case f32: res = iir<float  >(b, a, x); break;
-        case f64: res = iir<double >(b, a, x); break;
-        case c32: res = iir<cfloat >(b, a, x); break;
-        case c64: res = iir<cdouble>(b, a, x); break;
-        default: TYPE_ERROR(1, xtype);
+            case f32: res = iir<float>(b, a, x); break;
+            case f64: res = iir<double>(b, a, x); break;
+            case c32: res = iir<cfloat>(b, a, x); break;
+            case c64: res = iir<cdouble>(b, a, x); break;
+            default: TYPE_ERROR(1, xtype);
         }
 
         std::swap(*y, res);
-    } CATCHALL;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
diff --git a/src/api/c/image.cpp b/src/api/c/image.cpp
index d51ac9c0cf..4650c0ec3d 100644
--- a/src/api/c/image.cpp
+++ b/src/api/c/image.cpp
@@ -7,234 +7,120 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-
+#include <af/data.h>
 #include <af/graphics.h>
 #include <af/image.h>
 #include <af/index.h>
-#include <af/data.h>
 
-#include <ArrayInfo.hpp>
-#include <graphics_common.hpp>
-#include <err_common.hpp>
+#include <arith.hpp>
 #include <backend.hpp>
-#include <image.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
 #include <handle.hpp>
+#include <image.hpp>
+#include <join.hpp>
+#include <platform.hpp>
 #include <reorder.hpp>
 #include <tile.hpp>
-#include <join.hpp>
 
-#include <iostream>
+#include <limits>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::cast;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+using arrayfire::common::getGLType;
+using arrayfire::common::makeContextCurrent;
+using detail::arithOp;
+using detail::Array;
+using detail::copy_image;
+using detail::createValueArray;
+using detail::forgeManager;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+template<typename T>
+Array<T> normalizePerType(const Array<T>& in) {
+    Array<float> inFloat = cast<float, T>(in);
+
+    Array<float> cnst = createValueArray<float>(in.dims(), 1.0 - 1.0e-6f);
 
-#if defined(WITH_GRAPHICS)
-using namespace graphics;
+    Array<float> scaled = arithOp<float, af_mul_t>(inFloat, cnst, in.dims());
+
+    return cast<T, float>(scaled);
+}
+
+template<>
+Array<float> normalizePerType<float>(const Array<float>& in) {
+    return in;
+}
 
 template<typename T>
-static fg::Image* convert_and_copy_image(const af_array in)
-{
-    ArrayInfo info      = getInfo(in);
-    const Array<T> _in  = getArray<T>(in);
-    dim4 inDims = _in.dims();
+static fg_image convert_and_copy_image(const af_array in) {
+    const Array<T> _in = getArray<T>(in);
+    dim4 inDims        = _in.dims();
 
-    dim4 rdims = (inDims[2]>1 ? dim4(2, 1, 0, 3) : dim4(1, 0, 2, 3));
+    dim4 rdims = (inDims[2] > 1 ? dim4(2, 1, 0, 3) : dim4(1, 0, 2, 3));
 
     Array<T> imgData = reorder(_in, rdims);
 
-    ForgeManager& fgMngr = ForgeManager::getInstance();
-
-    fg::Image* ret_val = fgMngr.getImage(inDims[1], inDims[0], (fg::ColorMode)inDims[2], getGLType<T>());
+    ForgeManager& fgMngr = forgeManager();
 
-    copy_image<T>(imgData, ret_val);
+    // The inDims[2] * 100 is a hack to convert to fg_channel_format
+    // TODO(pradeep): Write a proper conversion function
+    fg_image ret_val = fgMngr.getImage(
+        inDims[1], inDims[0], static_cast<fg_channel_format>(inDims[2] * 100),
+        getGLType<T>());
+    copy_image<T>(normalizePerType<T>(imgData), ret_val);
 
     return ret_val;
 }
-#endif
-
-af_err af_draw_image(const af_window wind, const af_array in, const af_cell* const props)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
 
+af_err af_draw_image(const af_window window, const af_array in,
+                     const af_cell* const props) {
     try {
-        ArrayInfo info = getInfo(in);
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& info = getInfo(in);
 
         af::dim4 in_dims = info.dims();
         af_dtype type    = info.getType();
         DIM_ASSERT(0, in_dims[2] == 1 || in_dims[2] == 3 || in_dims[2] == 4);
         DIM_ASSERT(0, in_dims[3] == 1);
 
-        fg::Window* window = reinterpret_cast<fg::Window*>(wind);
-        window->makeCurrent();
-        fg::Image* image = NULL;
+        makeContextCurrent(window);
+        fg_image image = NULL;
 
-        switch(type) {
+        switch (type) {
             case f32: image = convert_and_copy_image<float>(in); break;
-            case b8 : image = convert_and_copy_image<char >(in); break;
-            case s32: image = convert_and_copy_image<int  >(in); break;
-            case u32: image = convert_and_copy_image<uint >(in); break;
-            case u8 : image = convert_and_copy_image<uchar>(in); break;
-            default:  TYPE_ERROR(1, type);
-        }
-
-        window->setColorMap((fg::ColorMap)props->cmap);
-        if (props->col>-1 && props->row>-1)
-            window->draw(props->col, props->row, *image, props->title);
-        else
-            window->draw(*image);
-    }
-    CATCHALL;
-
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
-
-af_err af_create_window(af_window *out, const int width, const int height, const char* const title)
-{
-#if defined(WITH_GRAPHICS)
-    fg::Window* wnd;
-    try {
-        graphics::ForgeManager& fgMngr = graphics::ForgeManager::getInstance();
-        fg::Window* mainWnd = NULL;
-
-        try {
-            mainWnd = fgMngr.getMainWindow();
-        } catch(...) {
-            std::cerr<<"OpenGL context creation failed"<<std::endl;
+            case b8: image = convert_and_copy_image<char>(in); break;
+            case s32: image = convert_and_copy_image<int>(in); break;
+            case u32: image = convert_and_copy_image<uint>(in); break;
+            case s16: image = convert_and_copy_image<short>(in); break;
+            case u16: image = convert_and_copy_image<ushort>(in); break;
+            case s8: image = convert_and_copy_image<schar>(in); break;
+            case u8: image = convert_and_copy_image<uchar>(in); break;
+            default: TYPE_ERROR(1, type);
         }
 
-        if(mainWnd==0) {
-            std::cerr<<"Not a valid window"<<std::endl;
-            return AF_SUCCESS;
+        ForgeModule& _ = forgePlugin();
+        auto gridDims  = forgeManager().getWindowGrid(window);
+        FG_CHECK(_.fg_set_window_colormap(window, (fg_color_map)props->cmap));
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_image_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, image, props->title,
+                true));
+        } else {
+            FG_CHECK(_.fg_draw_image(window, image, true));
         }
-
-        wnd = new fg::Window(width, height, title, mainWnd);
-        wnd->setFont(fgMngr.getFont());
     }
     CATCHALL;
-    *out = reinterpret_cast<af_window>(wnd);
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
-
-af_err af_set_position(const af_window wind, const unsigned x, const unsigned y)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
-
-    try {
-        fg::Window* wnd = reinterpret_cast<fg::Window*>(wind);
-        wnd->setPos(x, y);
-    }
-    CATCHALL;
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
 
-af_err af_set_title(const af_window wind, const char* const title)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
-
-    try {
-        fg::Window* wnd = reinterpret_cast<fg::Window*>(wind);
-        wnd->setTitle(title);
-    }
-    CATCHALL;
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
-
-af_err af_grid(const af_window wind, const int rows, const int cols)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
-
-    try {
-        fg::Window* wnd = reinterpret_cast<fg::Window*>(wind);
-        wnd->grid(rows, cols);
-    }
-    CATCHALL;
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
-
-af_err af_show(const af_window wind)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
-
-    try {
-        fg::Window* wnd = reinterpret_cast<fg::Window*>(wind);
-        wnd->draw();
-    }
-    CATCHALL;
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
-
-af_err af_is_window_closed(bool *out, const af_window wind)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
-
-    try {
-        fg::Window* wnd = reinterpret_cast<fg::Window*>(wind);
-        *out = wnd->close();
-    }
-    CATCHALL;
-    return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
-}
-
-af_err af_destroy_window(const af_window wind)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
-    }
-
-    try {
-        fg::Window* wnd = reinterpret_cast<fg::Window*>(wind);
-        delete wnd;
-    }
-    CATCHALL;
     return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
 }
diff --git a/src/api/c/imageio.cpp b/src/api/c/imageio.cpp
index 2f670aa428..be5f528922 100644
--- a/src/api/c/imageio.cpp
+++ b/src/api/c/imageio.cpp
@@ -9,152 +9,209 @@
 
 #if defined(WITH_FREEIMAGE)
 
-#include <af/array.h>
-#include <af/arith.h>
+#include "imageio_helper.h"
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
 #include <af/algorithm.h>
+#include <af/arith.h>
+#include <af/array.h>
 #include <af/blas.h>
 #include <af/data.h>
+#include <af/dim4.hpp>
 #include <af/image.h>
 #include <af/index.h>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <ArrayInfo.hpp>
-#include <traits.hpp>
-#include <memory.hpp>
 
-#include <FreeImage.h>
-#include <string>
-#include <cstring>
+#include <common/DependencyModule.hpp>
+
 #include <cstdio>
 #include <cstdlib>
+#include <cstring>
+#include <memory>
+#include <string>
 
 using af::dim4;
-using namespace detail;
-
-class FI_Manager
-{
-    public:
-    static bool initialized;
-    FI_Manager()
-    {
-#ifdef FREEIMAGE_LIB
-        FreeImage_Initialise();
-#endif
-        initialized = true;
-    }
-
-    ~FI_Manager()
-    {
-#ifdef FREEIMAGE_LIB
-        FreeImage_DeInitialise();
-#endif
-    }
-};
-
-bool FI_Manager::initialized = false;
-
-static void FI_Init()
-{
-    static FI_Manager manager = FI_Manager();
-}
+using arrayfire::AFFI_GRAY;
+using arrayfire::AFFI_RGB;
+using arrayfire::AFFI_RGBA;
+using arrayfire::bitmap_ptr;
+using arrayfire::channel_split;
+using arrayfire::FI_CHANNELS;
+using arrayfire::FreeImage_Module;
+using arrayfire::FreeImageErrorHandler;
+using arrayfire::getFreeImagePlugin;
+using arrayfire::make_bitmap_ptr;
+using detail::pinnedAlloc;
+using detail::pinnedFree;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+using std::string;
+using std::swap;
 
-// Helpers
-void FreeImageErrorHandler(FREE_IMAGE_FORMAT oFif, const char* zMessage);
-
-typedef unsigned short ushort;
-
-// Error handler for FreeImage library.
-// In case this handler is invoked, it throws an af exception.
-void FreeImageErrorHandler(FREE_IMAGE_FORMAT oFif, const char* zMessage)
-{
-    printf("FreeImage Error Handler: %s\n", zMessage);
-}
-
-//  Split a MxNx3 image into 3 separate channel matrices.
-//  Produce 3 channels if needed
-static af_err channel_split(const af_array rgb, const af::dim4 &dims,
-                            af_array *outr, af_array *outg, af_array *outb, af_array *outa)
-{
-    try {
-        af_seq idx[4][3] = {{af_span, af_span, {0, 0, 1}},
-                            {af_span, af_span, {1, 1, 1}},
-                            {af_span, af_span, {2, 2, 1}},
-                            {af_span, af_span, {3, 3, 1}}
-                           };
-
-        if (dims[2] == 4) {
-            AF_CHECK(af_index(outr, rgb, dims.ndims(), idx[0]));
-            AF_CHECK(af_index(outg, rgb, dims.ndims(), idx[1]));
-            AF_CHECK(af_index(outb, rgb, dims.ndims(), idx[2]));
-            AF_CHECK(af_index(outa, rgb, dims.ndims(), idx[3]));
-        } else if (dims[2] == 3) {
-            AF_CHECK(af_index(outr, rgb, dims.ndims(), idx[0]));
-            AF_CHECK(af_index(outg, rgb, dims.ndims(), idx[1]));
-            AF_CHECK(af_index(outb, rgb, dims.ndims(), idx[2]));
-        } else {
-            AF_CHECK(af_index(outr, rgb, dims.ndims(), idx[0]));
-        }
-    } CATCHALL;
-    return AF_SUCCESS;
-}
+namespace arrayfire {
 
-template<typename T, int fi_color, int fo_color>
-static af_err readImage(af_array *rImage, const uchar* pSrcLine, const int nSrcPitch,
-                        const uint fi_w, const uint fi_h)
-{
+template<typename T, FI_CHANNELS fi_color, FI_CHANNELS fo_color>
+static af_err readImage(af_array* rImage, const uchar* pSrcLine,
+                        const int nSrcPitch, const uint fi_w, const uint fi_h) {
     // create an array to receive the loaded image data.
     AF_CHECK(af_init());
-    float *pDst = pinnedAlloc<float>(fi_w * fi_h * 4); // 4 channels is max
+    auto* pDst   = pinnedAlloc<float>(fi_w * fi_h * 4);  // 4 channels is max
     float* pDst0 = pDst;
     float* pDst1 = pDst + (fi_w * fi_h * 1);
     float* pDst2 = pDst + (fi_w * fi_h * 2);
     float* pDst3 = pDst + (fi_w * fi_h * 3);
 
-    int offR = 2; int offG = 1; int offB = 0; int offA = 3;
-    if (fo_color == 3 && fi_color == 1) {       //Convert gray to color
-        offG = 0; offR = 0;
-    }
     uint indx = 0;
     uint step = fi_color;
 
     for (uint x = 0; x < fi_w; ++x) {
         for (uint y = 0; y < fi_h; ++y) {
-            const T *src = (T*)(pSrcLine - y * nSrcPitch);
-                               pDst2[indx] = (float) *(src + (x * step + offB));
-            if (fo_color >= 3) pDst1[indx] = (float) *(src + (x * step + offG));
-            if (fo_color >= 3) pDst0[indx] = (float) *(src + (x * step + offR));
-            if (fo_color == 4) pDst3[indx] = (float) *(src + (x * step + offA));
+            const T* src = reinterpret_cast<const T*>(pSrcLine - y * nSrcPitch);
+            if (fo_color == 1) {
+                pDst0[indx] = static_cast<T>(*(src + (x * step)));
+            } else if (fo_color >= 3) {
+                if (static_cast<af_dtype>(af::dtype_traits<T>::af_type) == u8) {
+                    pDst0[indx] =
+                        static_cast<float>(*(src + (x * step + FI_RGBA_RED)));
+                    pDst1[indx] =
+                        static_cast<float>(*(src + (x * step + FI_RGBA_GREEN)));
+                    pDst2[indx] =
+                        static_cast<float>(*(src + (x * step + FI_RGBA_BLUE)));
+                    if (fo_color == 4) {
+                        pDst3[indx] = static_cast<float>(
+                            *(src + (x * step + FI_RGBA_ALPHA)));
+                    }
+                } else {
+                    // Non 8-bit types do not use ordering
+                    // See Pixel Access Functions Chapter in FreeImage Doc
+                    pDst0[indx] = static_cast<float>(*(src + (x * step + 0)));
+                    pDst1[indx] = static_cast<float>(*(src + (x * step + 1)));
+                    pDst2[indx] = static_cast<float>(*(src + (x * step + 2)));
+                    if (fo_color == 4) {
+                        pDst3[indx] =
+                            static_cast<float>(*(src + (x * step + 3)));
+                    }
+                }
+            }
             indx++;
         }
     }
 
-    // TODO
     af::dim4 dims(fi_h, fi_w, fo_color, 1);
-    af_err err = af_create_array(rImage, pDst, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<float>::af_type);
+    af_err err = af_create_array(rImage, pDst, dims.ndims(), dims.get(),
+                                 (af_dtype)af::dtype_traits<float>::af_type);
     pinnedFree(pDst);
     return err;
 }
 
-template<typename T, int fo_color>
-static af_err readImage(af_array *rImage, const uchar* pSrcLine, const int nSrcPitch,
-                        const uint fi_w, const uint fi_h)
-{
+#ifdef FREEIMAGE_STATIC
+// NOTE: Redefine the MODULE_FUNCTION_INIT macro to call the static functions
+// instead of dynamically loaded symbols in case we are building with a static
+// FreeImage library
+#undef MODULE_FUNCTION_INIT
+#define MODULE_FUNCTION_INIT(NAME) NAME = &::NAME
+
+FreeImage_Module::FreeImage_Module() : module(nullptr, nullptr) {
+    // We don't care if the module loaded if we are staticly linking against
+    // FreeImage
+    ::FreeImage_Initialise(false);
+#else
+FreeImage_Module::FreeImage_Module() : module("freeimage", nullptr) {
+    if (!module.isLoaded()) {
+        string error_message =
+            "Error loading FreeImage: " +
+            common::DependencyModule::getErrorMessage() +
+            "\nFreeImage or one of it's dependencies failed to "
+            "load. Try installing FreeImage or check if FreeImage is in the "
+            "search path.";
+        AF_ERROR(error_message.c_str(), AF_ERR_LOAD_LIB);
+    }
+#endif
+    MODULE_FUNCTION_INIT(FreeImage_Allocate);
+    MODULE_FUNCTION_INIT(FreeImage_AllocateT);
+    MODULE_FUNCTION_INIT(FreeImage_CloseMemory);
+    MODULE_FUNCTION_INIT(FreeImage_DeInitialise);
+    MODULE_FUNCTION_INIT(FreeImage_FIFSupportsReading);
+    MODULE_FUNCTION_INIT(FreeImage_GetBPP);
+    MODULE_FUNCTION_INIT(FreeImage_GetBits);
+    MODULE_FUNCTION_INIT(FreeImage_GetColorType);
+    MODULE_FUNCTION_INIT(FreeImage_GetFIFFromFilename);
+    MODULE_FUNCTION_INIT(FreeImage_GetFileType);
+    MODULE_FUNCTION_INIT(FreeImage_GetFileTypeFromMemory);
+    MODULE_FUNCTION_INIT(FreeImage_GetHeight);
+    MODULE_FUNCTION_INIT(FreeImage_GetImageType);
+    MODULE_FUNCTION_INIT(FreeImage_GetPitch);
+    MODULE_FUNCTION_INIT(FreeImage_GetWidth);
+    MODULE_FUNCTION_INIT(FreeImage_Initialise);
+    MODULE_FUNCTION_INIT(FreeImage_Load);
+    MODULE_FUNCTION_INIT(FreeImage_LoadFromMemory);
+    MODULE_FUNCTION_INIT(FreeImage_OpenMemory);
+    MODULE_FUNCTION_INIT(FreeImage_Save);
+    MODULE_FUNCTION_INIT(FreeImage_SaveToMemory);
+    MODULE_FUNCTION_INIT(FreeImage_SeekMemory);
+    MODULE_FUNCTION_INIT(FreeImage_SetOutputMessage);
+    MODULE_FUNCTION_INIT(FreeImage_Unload);
+
+#ifndef FREEIMAGE_STATIC
+    if (!module.symbolsLoaded()) {
+        string error_message =
+            "Error loading FreeImage: " +
+            common::DependencyModule::getErrorMessage() +
+            "\nThe installed version of FreeImage is not compatible with "
+            "ArrayFire. Please create an issue on which this error message";
+        AF_ERROR(error_message.c_str(), AF_ERR_LOAD_LIB);
+    }
+#endif
+}
+
+FreeImage_Module::~FreeImage_Module() {  // NOLINT(hicpp-use-equals-default,
+                                         // modernize-use-equals-default)
+#ifdef FREEIMAGE_STATIC
+    getFreeImagePlugin().FreeImage_DeInitialise();
+#endif
+}
+
+FreeImage_Module& getFreeImagePlugin() {
+    static auto* plugin = new FreeImage_Module();
+    return *plugin;
+}
+
+bitmap_ptr make_bitmap_ptr(FIBITMAP* ptr) {
+    return bitmap_ptr(ptr, getFreeImagePlugin().FreeImage_Unload);
+}
+
+template<typename T, FI_CHANNELS fo_color>
+static af_err readImage(af_array* rImage, const uchar* pSrcLine,
+                        const int nSrcPitch, const uint fi_w, const uint fi_h) {
     // create an array to receive the loaded image data.
     AF_CHECK(af_init());
-    float *pDst = pinnedAlloc<float>(fi_w * fi_h);
+    auto* pDst = pinnedAlloc<float>(fi_w * fi_h);
 
     uint indx = 0;
     uint step = nSrcPitch / (fi_w * sizeof(T));
     T r, g, b;
     for (uint x = 0; x < fi_w; ++x) {
         for (uint y = 0; y < fi_h; ++y) {
-            const T *src = (T*)(pSrcLine - y * nSrcPitch);
+            const T* src = reinterpret_cast<const T*>(pSrcLine - y * nSrcPitch);
             if (fo_color == 1) {
-                pDst[indx] = (float) *(src + (x * step));
-            } else if (fo_color >=3) {
-                r = (float) *(src + (x * step + 2));
-                g = (float) *(src + (x * step + 1));
-                b = (float) *(src + (x * step + 0));
+                pDst[indx] = static_cast<T>(*(src + (x * step)));
+            } else if (fo_color >= 3) {
+                if (static_cast<af_dtype>(af::dtype_traits<T>::af_type) == u8) {
+                    r = *(src + (x * step + FI_RGBA_RED));
+                    g = *(src + (x * step + FI_RGBA_GREEN));
+                    b = *(src + (x * step + FI_RGBA_BLUE));
+                } else {
+                    // Non 8-bit types do not use ordering
+                    // See Pixel Access Functions Chapter in FreeImage Doc
+                    r = *(src + (x * step + 0));
+                    g = *(src + (x * step + 1));
+                    b = *(src + (x * step + 2));
+                }
                 pDst[indx] = r * 0.2989f + g * 0.5870f + b * 0.1140f;
             }
             indx++;
@@ -162,153 +219,678 @@ static af_err readImage(af_array *rImage, const uchar* pSrcLine, const int nSrcP
     }
 
     af::dim4 dims(fi_h, fi_w, 1, 1);
-    af_err err = af_create_array(rImage, pDst, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<float>::af_type);
+    af_err err = af_create_array(rImage, pDst, dims.ndims(), dims.get(),
+                                 (af_dtype)af::dtype_traits<float>::af_type);
     pinnedFree(pDst);
     return err;
 }
 
-/// Load a gray-scale image from disk.
-AFAPI af_err af_load_image(af_array *out, const char* filename, const bool isColor)
-{
+}  // namespace arrayfire
+
+////////////////////////////////////////////////////////////////////////////////
+// File IO
+////////////////////////////////////////////////////////////////////////////////
+// Load image from disk.
+af_err af_load_image(af_array* out, const char* filename, const bool isColor) {
+    using arrayfire::readImage;
     try {
         ARG_ASSERT(1, filename != NULL);
 
-        // for statically linked FI
-        FI_Init();
+        FreeImage_Module& _ = getFreeImagePlugin();
 
         // set your own FreeImage error handler
-        FreeImage_SetOutputMessage(FreeImageErrorHandler);
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
 
         // try to guess the file format from the file extension
-        FREE_IMAGE_FORMAT fif = FreeImage_GetFileType(filename);
+        FREE_IMAGE_FORMAT fif = _.FreeImage_GetFileType(filename, 0);
         if (fif == FIF_UNKNOWN) {
-            fif = FreeImage_GetFIFFromFilename(filename);
+            fif = _.FreeImage_GetFIFFromFilename(filename);
         }
 
-        if(fif == FIF_UNKNOWN) {
-            AF_ERROR("FreeImage Error: Unknown File or Filetype", AF_ERR_NOT_SUPPORTED);
+        if (fif == FIF_UNKNOWN) {
+            AF_ERROR("FreeImage Error: Unknown File or Filetype",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
+        unsigned flags = 0;
+        if (fif == FIF_JPEG) {
+            flags = flags | static_cast<unsigned>(JPEG_ACCURATE);
+        }
+#ifdef JPEG_GREYSCALE
+        if (fif == FIF_JPEG && !isColor) {
+            flags = flags | static_cast<unsigned>(JPEG_GREYSCALE);
+        }
+#endif
+
         // check that the plugin has reading capabilities ...
-        FIBITMAP* pBitmap = NULL;
-        if (FreeImage_FIFSupportsReading(fif)) {
-            pBitmap = FreeImage_Load(fif, filename);
+        bitmap_ptr pBitmap = make_bitmap_ptr(NULL);
+        if (_.FreeImage_FIFSupportsReading(fif)) {
+            pBitmap.reset(
+                _.FreeImage_Load(fif, filename, static_cast<int>(flags)));
         }
 
-        if(pBitmap == NULL) {
-            AF_ERROR("FreeImage Error: Error reading image or file does not exist", AF_ERR_RUNTIME);
+        if (pBitmap == NULL) {
+            AF_ERROR(
+                "FreeImage Error: Error reading image or file does not exist",
+                AF_ERR_RUNTIME);
         }
 
         // check image color type
-        uint color_type = FreeImage_GetColorType(pBitmap);
-        const uint fi_bpp = FreeImage_GetBPP(pBitmap);
-        //int fi_color = (int)((fi_bpp / 8.0) + 0.5);        //ceil
-        int fi_color;
-        if      (color_type == 1) fi_color = 1;
-        else if (color_type == 2) fi_color = 3;
-        else if (color_type == 4) fi_color = 4;
-        else                      fi_color = 3;
-        const int fi_bpc = fi_bpp / fi_color;
-        if(fi_bpc != 8 && fi_bpc != 16 && fi_bpc != 32) {
-            AF_ERROR("FreeImage Error: Bits per channel not supported", AF_ERR_NOT_SUPPORTED);
+        uint color_type   = _.FreeImage_GetColorType(pBitmap.get());
+        const uint fi_bpp = _.FreeImage_GetBPP(pBitmap.get());
+        // int fi_color = (int)((fi_bpp / 8.0) + 0.5);        //ceil
+        uint fi_color;
+        switch (color_type) {
+            case 0:  // FIC_MINISBLACK
+            case 1:  // FIC_MINISWHITE
+                fi_color = 1;
+                break;
+            case 2:  // FIC_PALETTE
+            case 3:  // FIC_RGB
+                fi_color = 3;
+                break;
+            case 4:  // FIC_RGBALPHA
+            case 5:  // FIC_CMYK
+                fi_color = 4;
+                break;
+            default:  // Should not come here
+                fi_color = 3;
+                break;
+        }
+
+        const uint fi_bpc = fi_bpp / fi_color;
+        if (fi_bpc != 8 && fi_bpc != 16 && fi_bpc != 32) {
+            AF_ERROR("FreeImage Error: Bits per channel not supported",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
+        // data type
+        FREE_IMAGE_TYPE image_type = _.FreeImage_GetImageType(pBitmap.get());
+
         // sizes
-        uint fi_w = FreeImage_GetWidth(pBitmap);
-        uint fi_h = FreeImage_GetHeight(pBitmap);
+        uint fi_w = _.FreeImage_GetWidth(pBitmap.get());
+        uint fi_h = _.FreeImage_GetHeight(pBitmap.get());
 
         // FI = row major | AF = column major
-        uint nSrcPitch = FreeImage_GetPitch(pBitmap);
-        const uchar* pSrcLine = FreeImage_GetBits(pBitmap) + nSrcPitch * (fi_h - 1);
+        uint nSrcPitch = _.FreeImage_GetPitch(pBitmap.get());
+        const uchar* pSrcLine =
+            _.FreeImage_GetBits(pBitmap.get()) + nSrcPitch * (fi_h - 1);
 
         // result image
         af_array rImage;
         if (isColor) {
-            if(fi_color == 4) {     //4 channel image
-                if(fi_bpc == 8)
-                    AF_CHECK((readImage<uchar, 4, 4>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 16)
-                    AF_CHECK((readImage<ushort, 4, 4>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 32)
-                    AF_CHECK((readImage<float, 4, 4>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
+            if (fi_color == 4) {  // 4 channel image
+                if (fi_bpc == 8) {
+                    AF_CHECK((readImage<uchar, AFFI_RGBA, AFFI_RGBA>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                } else if (fi_bpc == 16) {
+                    AF_CHECK(
+                        (readImage<ushort, AFFI_RGBA, AFFI_RGBA>)(&rImage,
+                                                                  pSrcLine,
+                                                                  nSrcPitch,
+                                                                  fi_w, fi_h));
+                } else if (fi_bpc == 32) {
+                    switch (image_type) {
+                        case FIT_UINT32:
+                            AF_CHECK((readImage<uint, AFFI_RGBA,
+                                                AFFI_RGBA>)(&rImage, pSrcLine,
+                                                            nSrcPitch, fi_w,
+                                                            fi_h));
+                            break;
+                        case FIT_INT32:
+                            AF_CHECK((
+                                readImage<int, AFFI_RGBA, AFFI_RGBA>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                            break;
+                        case FIT_FLOAT:
+                            AF_CHECK((readImage<float, AFFI_RGBA,
+                                                AFFI_RGBA>)(&rImage, pSrcLine,
+                                                            nSrcPitch, fi_w,
+                                                            fi_h));
+                            break;
+                        default:
+                            AF_ERROR("FreeImage Error: Unknown image type",
+                                     AF_ERR_NOT_SUPPORTED);
+                            break;
+                    }
+                }
             } else if (fi_color == 1) {
-                if(fi_bpc == 8)
-                    AF_CHECK((readImage<uchar, 1, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 16)
-                    AF_CHECK((readImage<ushort, 1, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 32)
-                    AF_CHECK((readImage<float, 1, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-            } else {             //3 channel image
-                if(fi_bpc == 8)
-                    AF_CHECK((readImage<uchar, 3, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 16)
-                    AF_CHECK((readImage<ushort, 3, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 32)
-                    AF_CHECK((readImage<float, 3, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
+                if (fi_bpc == 8) {
+                    AF_CHECK((readImage<uchar, AFFI_GRAY, AFFI_RGB>)(&rImage,
+                                                                     pSrcLine,
+                                                                     nSrcPitch,
+                                                                     fi_w,
+                                                                     fi_h));
+                } else if (fi_bpc == 16) {
+                    AF_CHECK((readImage<ushort, AFFI_GRAY, AFFI_RGB>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                } else if (fi_bpc == 32) {
+                    switch (image_type) {
+                        case FIT_UINT32:
+                            AF_CHECK((
+                                readImage<uint, AFFI_GRAY, AFFI_RGB>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                            break;
+                        case FIT_INT32:
+                            AF_CHECK(
+                                (readImage<int, AFFI_GRAY, AFFI_RGB>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                            break;
+                        case FIT_FLOAT:
+                            AF_CHECK((readImage<float, AFFI_GRAY,
+                                                AFFI_RGB>)(&rImage, pSrcLine,
+                                                           nSrcPitch, fi_w,
+                                                           fi_h));
+                            break;
+                        default:
+                            AF_ERROR("FreeImage Error: Unknown image type",
+                                     AF_ERR_NOT_SUPPORTED);
+                            break;
+                    }
+                }
+            } else {  // 3 channel image
+                if (fi_bpc == 8) {
+                    AF_CHECK((
+                        readImage<uchar, AFFI_RGB, AFFI_RGB>)(&rImage, pSrcLine,
+                                                              nSrcPitch, fi_w,
+                                                              fi_h));
+                } else if (fi_bpc == 16) {
+                    AF_CHECK((readImage<ushort, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                     pSrcLine,
+                                                                     nSrcPitch,
+                                                                     fi_w,
+                                                                     fi_h));
+                } else if (fi_bpc == 32) {
+                    switch (image_type) {
+                        case FIT_UINT32:
+                            AF_CHECK(
+                                (readImage<uint, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                            break;
+                        case FIT_INT32:
+                            AF_CHECK(
+                                (readImage<int, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                     pSrcLine,
+                                                                     nSrcPitch,
+                                                                     fi_w,
+                                                                     fi_h));
+                            break;
+                        case FIT_FLOAT:
+                            AF_CHECK((
+                                readImage<float, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                      pSrcLine,
+                                                                      nSrcPitch,
+                                                                      fi_w,
+                                                                      fi_h));
+                            break;
+                        default:
+                            AF_ERROR("FreeImage Error: Unknown image type",
+                                     AF_ERR_NOT_SUPPORTED);
+                            break;
+                    }
+                }
             }
-        } else {                    //output gray irrespective
-            if(fi_color == 1) {     //4 channel image
-                if(fi_bpc == 8)
-                    AF_CHECK((readImage<uchar, 1>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 16)
-                    AF_CHECK((readImage<ushort, 1>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 32)
-                    AF_CHECK((readImage<float, 1>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
+        } else {                  // output gray irrespective
+            if (fi_color == 1) {  // 4 channel image
+                if (fi_bpc == 8) {
+                    AF_CHECK((readImage<uchar, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                           nSrcPitch, fi_w,
+                                                           fi_h));
+                } else if (fi_bpc == 16) {
+                    AF_CHECK((readImage<ushort, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                            nSrcPitch, fi_w,
+                                                            fi_h));
+                } else if (fi_bpc == 32) {
+                    switch (image_type) {
+                        case FIT_UINT32:
+                            AF_CHECK((readImage<uint, AFFI_GRAY>)(&rImage,
+                                                                  pSrcLine,
+                                                                  nSrcPitch,
+                                                                  fi_w, fi_h));
+                            break;
+                        case FIT_INT32:
+                            AF_CHECK((readImage<int, AFFI_GRAY>)(&rImage,
+                                                                 pSrcLine,
+                                                                 nSrcPitch,
+                                                                 fi_w, fi_h));
+                            break;
+                        case FIT_FLOAT:
+                            AF_CHECK((readImage<float, AFFI_GRAY>)(&rImage,
+                                                                   pSrcLine,
+                                                                   nSrcPitch,
+                                                                   fi_w, fi_h));
+                            break;
+                        default:
+                            AF_ERROR("FreeImage Error: Unknown image type",
+                                     AF_ERR_NOT_SUPPORTED);
+                            break;
+                    }
+                }
             } else if (fi_color == 3 || fi_color == 4) {
-                if(fi_bpc == 8)
-                    AF_CHECK((readImage<uchar, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 16)
-                    AF_CHECK((readImage<ushort, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
-                else if(fi_bpc == 32)
-                    AF_CHECK((readImage<float, 3>)(&rImage, pSrcLine, nSrcPitch, fi_w, fi_h));
+                if (fi_bpc == 8) {
+                    AF_CHECK((readImage<uchar, AFFI_RGB>)(&rImage, pSrcLine,
+                                                          nSrcPitch, fi_w,
+                                                          fi_h));
+                } else if (fi_bpc == 16) {
+                    AF_CHECK((readImage<ushort, AFFI_RGB>)(&rImage, pSrcLine,
+                                                           nSrcPitch, fi_w,
+                                                           fi_h));
+                } else if (fi_bpc == 32) {
+                    switch (image_type) {
+                        case FIT_UINT32:
+                            AF_CHECK((readImage<uint, AFFI_RGB>)(&rImage,
+                                                                 pSrcLine,
+                                                                 nSrcPitch,
+                                                                 fi_w, fi_h));
+                            break;
+                        case FIT_INT32:
+                            AF_CHECK((readImage<int, AFFI_RGB>)(&rImage,
+                                                                pSrcLine,
+                                                                nSrcPitch, fi_w,
+                                                                fi_h));
+                            break;
+                        case FIT_FLOAT:
+                            AF_CHECK((readImage<float, AFFI_RGB>)(&rImage,
+                                                                  pSrcLine,
+                                                                  nSrcPitch,
+                                                                  fi_w, fi_h));
+                            break;
+                        default:
+                            AF_ERROR("FreeImage Error: Unknown image type",
+                                     AF_ERR_NOT_SUPPORTED);
+                            break;
+                    }
+                }
             }
         }
 
-        FreeImage_Unload(pBitmap);
-        std::swap(*out,rImage);
-    } CATCHALL;
+        swap(*out, rImage);
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
 // Save an image to disk.
-af_err af_save_image(const char* filename, const af_array in_)
-{
+af_err af_save_image(const char* filename, const af_array in_) {
     try {
-
         ARG_ASSERT(0, filename != NULL);
 
-        FI_Init();
+        FreeImage_Module& _ = getFreeImagePlugin();
 
         // set your own FreeImage error handler
-        FreeImage_SetOutputMessage(FreeImageErrorHandler);
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
 
         // try to guess the file format from the file extension
-        FREE_IMAGE_FORMAT fif = FreeImage_GetFileType(filename);
+        FREE_IMAGE_FORMAT fif = _.FreeImage_GetFileType(filename, 0);
         if (fif == FIF_UNKNOWN) {
-            fif = FreeImage_GetFIFFromFilename(filename);
+            fif = _.FreeImage_GetFIFFromFilename(filename);
         }
 
-        if(fif == FIF_UNKNOWN) {
+        if (fif == FIF_UNKNOWN) {
             AF_ERROR("FreeImage Error: Unknown Filetype", AF_ERR_NOT_SUPPORTED);
         }
 
-        ArrayInfo info = getInfo(in_);
+        const ArrayInfo& info = getInfo(in_);
         // check image color type
         uint channels = info.dims()[2];
         DIM_ASSERT(1, channels <= 4);
         DIM_ASSERT(1, channels != 2);
 
-        int fi_bpp = channels * 8;
+        uint fi_bpp = channels * 8;
 
         // sizes
         uint fi_w = info.dims()[1];
         uint fi_h = info.dims()[0];
 
         // create the result image storage using FreeImage
-        FIBITMAP* pResultBitmap = FreeImage_Allocate(fi_w, fi_h, fi_bpp);
-        if(pResultBitmap == NULL) {
-            AF_ERROR("FreeImage Error: Error creating image or file", AF_ERR_RUNTIME);
+        bitmap_ptr pResultBitmap = make_bitmap_ptr(_.FreeImage_Allocate(
+            fi_w, fi_h, static_cast<int>(fi_bpp), 0, 0, 0));
+        if (pResultBitmap == NULL) {
+            AF_ERROR("FreeImage Error: Error creating image or file",
+                     AF_ERR_RUNTIME);
+        }
+
+        // FI assumes [0-255]
+        // If array is in 0-1 range, multiply by 255
+        af_array in;
+        double max_real, max_imag;
+        bool free_in = false;
+        AF_CHECK(af_max_all(&max_real, &max_imag, in_));
+        if (max_real <= 1) {
+            af_array c255 = 0;
+            AF_CHECK(af_constant(&c255, 255.0, info.ndims(), info.dims().get(),
+                                 f32));
+            AF_CHECK(af_mul(&in, in_, c255, false));
+            AF_CHECK(af_release_array(c255));
+            free_in = true;
+        } else if (max_real < 256) {  // NOLINT(bugprone-branch-clone)
+            in = in_;
+        } else if (max_real < 65536) {
+            af_array c255 = 0;
+            AF_CHECK(af_constant(&c255, 257.0, info.ndims(), info.dims().get(),
+                                 f32));
+            AF_CHECK(af_div(&in, in_, c255, false));
+            AF_CHECK(af_release_array(c255));
+            free_in = true;
+        } else {
+            in = (in_);
+        }
+
+        // FI = row major | AF = column major
+        uint nDstPitch = _.FreeImage_GetPitch(pResultBitmap.get());
+        uchar* pDstLine =
+            _.FreeImage_GetBits(pResultBitmap.get()) + nDstPitch * (fi_h - 1);
+        af_array rr = 0, gg = 0, bb = 0, aa = 0;
+        AF_CHECK(channel_split(in, info.dims(), &rr, &gg, &bb,
+                               &aa));  // convert array to 3 channels if needed
+
+        uint step = channels;  // force 3 channels saving
+        uint indx = 0;
+
+        af_array rrT = 0, ggT = 0, bbT = 0, aaT = 0;
+        if (channels == 4) {
+            AF_CHECK(af_transpose(&rrT, rr, false));
+            AF_CHECK(af_transpose(&ggT, gg, false));
+            AF_CHECK(af_transpose(&bbT, bb, false));
+            AF_CHECK(af_transpose(&aaT, aa, false));
+
+            const ArrayInfo& cinfo = getInfo(rrT);
+
+            auto* pSrc0 = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc1 = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc2 = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc3 = pinnedAlloc<float>(cinfo.elements());
+
+            AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
+            AF_CHECK(af_get_data_ptr((void*)pSrc1, ggT));
+            AF_CHECK(af_get_data_ptr((void*)pSrc2, bbT));
+            AF_CHECK(af_get_data_ptr((void*)pSrc3, aaT));
+
+            // Copy the array into FreeImage buffer
+            for (uint y = 0; y < fi_h; ++y) {
+                for (uint x = 0; x < fi_w; ++x) {
+                    *(pDstLine + x * step + FI_RGBA_RED) =
+                        static_cast<uchar>(pSrc0[indx]);  // r
+                    *(pDstLine + x * step + FI_RGBA_GREEN) =
+                        static_cast<uchar>(pSrc1[indx]);  // g
+                    *(pDstLine + x * step + FI_RGBA_BLUE) =
+                        static_cast<uchar>(pSrc2[indx]);  // b
+                    *(pDstLine + x * step + FI_RGBA_ALPHA) =
+                        static_cast<uchar>(pSrc3[indx]);  // a
+                    ++indx;
+                }
+                pDstLine -= nDstPitch;
+            }
+            pinnedFree(pSrc0);
+            pinnedFree(pSrc1);
+            pinnedFree(pSrc2);
+            pinnedFree(pSrc3);
+        } else if (channels == 3) {
+            AF_CHECK(af_transpose(&rrT, rr, false));
+            AF_CHECK(af_transpose(&ggT, gg, false));
+            AF_CHECK(af_transpose(&bbT, bb, false));
+
+            const ArrayInfo& cinfo = getInfo(rrT);
+
+            auto* pSrc0 = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc1 = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc2 = pinnedAlloc<float>(cinfo.elements());
+
+            AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
+            AF_CHECK(af_get_data_ptr((void*)pSrc1, ggT));
+            AF_CHECK(af_get_data_ptr((void*)pSrc2, bbT));
+
+            // Copy the array into FreeImage buffer
+            for (uint y = 0; y < fi_h; ++y) {
+                for (uint x = 0; x < fi_w; ++x) {
+                    *(pDstLine + x * step + FI_RGBA_RED) =
+                        static_cast<uchar>(pSrc0[indx]);  // r
+                    *(pDstLine + x * step + FI_RGBA_GREEN) =
+                        static_cast<uchar>(pSrc1[indx]);  // g
+                    *(pDstLine + x * step + FI_RGBA_BLUE) =
+                        static_cast<uchar>(pSrc2[indx]);  // b
+                    ++indx;
+                }
+                pDstLine -= nDstPitch;
+            }
+            pinnedFree(pSrc0);
+            pinnedFree(pSrc1);
+            pinnedFree(pSrc2);
+        } else {
+            AF_CHECK(af_transpose(&rrT, rr, false));
+            const ArrayInfo& cinfo = getInfo(rrT);
+            auto* pSrc0            = pinnedAlloc<float>(cinfo.elements());
+            AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
+
+            for (uint y = 0; y < fi_h; ++y) {
+                for (uint x = 0; x < fi_w; ++x) {
+                    *(pDstLine + x * step) = static_cast<uchar>(pSrc0[indx]);
+                    ++indx;
+                }
+                pDstLine -= nDstPitch;
+            }
+            pinnedFree(pSrc0);
+        }
+
+        unsigned flags = 0;
+        if (fif == FIF_JPEG) {
+            flags = flags | static_cast<unsigned>(JPEG_QUALITYSUPERB);
+        }
+
+        // now save the result image
+        if (_.FreeImage_Save(fif, pResultBitmap.get(), filename,
+                             static_cast<int>(flags)) == FALSE) {
+            AF_ERROR("FreeImage Error: Failed to save image", AF_ERR_RUNTIME);
+        }
+
+        if (free_in) { AF_CHECK(af_release_array(in)); }
+        if (rr != 0) { AF_CHECK(af_release_array(rr)); }
+        if (gg != 0) { AF_CHECK(af_release_array(gg)); }
+        if (bb != 0) { AF_CHECK(af_release_array(bb)); }
+        if (aa != 0) { AF_CHECK(af_release_array(aa)); }
+        if (rrT != 0) { AF_CHECK(af_release_array(rrT)); }
+        if (ggT != 0) { AF_CHECK(af_release_array(ggT)); }
+        if (bbT != 0) { AF_CHECK(af_release_array(bbT)); }
+        if (aaT != 0) { AF_CHECK(af_release_array(aaT)); }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Memory IO
+////////////////////////////////////////////////////////////////////////////////
+/// Load image from memory.
+af_err af_load_image_memory(af_array* out, const void* ptr) {
+    using arrayfire::readImage;
+    try {
+        ARG_ASSERT(1, ptr != NULL);
+
+        FreeImage_Module& _ = getFreeImagePlugin();
+
+        // set your own FreeImage error handler
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
+
+        auto* stream = static_cast<FIMEMORY*>(const_cast<void*>(ptr));
+        _.FreeImage_SeekMemory(stream, 0L, SEEK_SET);
+
+        // try to guess the file format from the file extension
+        FREE_IMAGE_FORMAT fif = _.FreeImage_GetFileTypeFromMemory(stream, 0);
+        // if (fif == FIF_UNKNOWN) {
+        //    fif = FreeImage_GetFIFFromFilenameFromMemory(filename);
+        //}
+
+        if (fif == FIF_UNKNOWN) {
+            AF_ERROR("FreeImage Error: Unknown File or Filetype",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        unsigned flags = 0;
+        if (fif == FIF_JPEG) {
+            flags = flags | static_cast<unsigned>(JPEG_ACCURATE);
+        }
+
+        // check that the plugin has reading capabilities ...
+        bitmap_ptr pBitmap = make_bitmap_ptr(NULL);
+        if (_.FreeImage_FIFSupportsReading(fif)) {
+            pBitmap.reset(_.FreeImage_LoadFromMemory(fif, stream,
+                                                     static_cast<int>(flags)));
+        }
+
+        if (pBitmap == NULL) {
+            AF_ERROR(
+                "FreeImage Error: Error reading image or file does not exist",
+                AF_ERR_RUNTIME);
+        }
+
+        // check image color type
+        uint color_type   = _.FreeImage_GetColorType(pBitmap.get());
+        const uint fi_bpp = _.FreeImage_GetBPP(pBitmap.get());
+        // int fi_color = (int)((fi_bpp / 8.0) + 0.5);        //ceil
+        int fi_color;
+        switch (color_type) {
+            case 0:  // FIC_MINISBLACK
+            case 1:  // FIC_MINISWHITE
+                fi_color = 1;
+                break;
+            case 2:  // FIC_PALETTE
+            case 3:  // FIC_RGB
+                fi_color = 3;
+                break;
+            case 4:  // FIC_RGBALPHA
+            case 5:  // FIC_CMYK
+                fi_color = 4;
+                break;
+            default:  // Should not come here
+                fi_color = 3;
+                break;
+        }
+        const uint fi_bpc = fi_bpp / fi_color;
+        if (fi_bpc != 8 && fi_bpc != 16 && fi_bpc != 32) {
+            AF_ERROR("FreeImage Error: Bits per channel not supported",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        // sizes
+        uint fi_w = _.FreeImage_GetWidth(pBitmap.get());
+        uint fi_h = _.FreeImage_GetHeight(pBitmap.get());
+
+        // FI = row major | AF = column major
+        uint nSrcPitch = _.FreeImage_GetPitch(pBitmap.get());
+        const uchar* pSrcLine =
+            _.FreeImage_GetBits(pBitmap.get()) + nSrcPitch * (fi_h - 1);
+
+        // result image
+        af_array rImage;
+        if (fi_color == 4) {  // 4 channel image
+            if (fi_bpc == 8) {
+                AF_CHECK((readImage<uchar, AFFI_RGBA, AFFI_RGBA>)(&rImage,
+                                                                  pSrcLine,
+                                                                  nSrcPitch,
+                                                                  fi_w, fi_h));
+            } else if (fi_bpc == 16) {
+                AF_CHECK((readImage<ushort, AFFI_RGBA, AFFI_RGBA>)(&rImage,
+                                                                   pSrcLine,
+                                                                   nSrcPitch,
+                                                                   fi_w, fi_h));
+            } else if (fi_bpc == 32) {
+                AF_CHECK((readImage<float, AFFI_RGBA, AFFI_RGBA>)(&rImage,
+                                                                  pSrcLine,
+                                                                  nSrcPitch,
+                                                                  fi_w, fi_h));
+            }
+        } else if (fi_color == 1) {  // 1 channel image
+            if (fi_bpc == 8) {
+                AF_CHECK((readImage<uchar, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                       nSrcPitch, fi_w, fi_h));
+            } else if (fi_bpc == 16) {
+                AF_CHECK((readImage<ushort, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                        nSrcPitch, fi_w, fi_h));
+            } else if (fi_bpc == 32) {
+                AF_CHECK((readImage<float, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                       nSrcPitch, fi_w, fi_h));
+            }
+        } else {  // 3 channel image
+            if (fi_bpc == 8) {
+                AF_CHECK((readImage<uchar, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                pSrcLine,
+                                                                nSrcPitch, fi_w,
+                                                                fi_h));
+            } else if (fi_bpc == 16) {
+                AF_CHECK((readImage<ushort, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                 pSrcLine,
+                                                                 nSrcPitch,
+                                                                 fi_w, fi_h));
+            } else if (fi_bpc == 32) {
+                AF_CHECK((readImage<float, AFFI_RGB, AFFI_RGB>)(&rImage,
+                                                                pSrcLine,
+                                                                nSrcPitch, fi_w,
+                                                                fi_h));
+            }
+        }
+
+        swap(*out, rImage);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+// Save an image to memory.
+af_err af_save_image_memory(void** ptr, const af_array in_,
+                            const af_image_format format) {
+    try {
+        FreeImage_Module& _ = getFreeImagePlugin();
+
+        // set our own FreeImage error handler
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
+
+        // try to guess the file format from the file extension
+        auto fif = static_cast<FREE_IMAGE_FORMAT>(format);
+
+        if (fif == FIF_UNKNOWN || fif > 34) {  // FreeImage FREE_IMAGE_FORMAT
+                                               // has upto 34 enums as of 3.17
+            AF_ERROR("FreeImage Error: Unknown Filetype", AF_ERR_NOT_SUPPORTED);
+        }
+
+        const ArrayInfo& info = getInfo(in_);
+        // check image color type
+        uint channels = info.dims()[2];
+        DIM_ASSERT(1, channels <= 4);
+        DIM_ASSERT(1, channels != 2);
+
+        uint fi_bpp = channels * 8;
+
+        // sizes
+        uint fi_w = info.dims()[1];
+        uint fi_h = info.dims()[0];
+
+        // create the result image storage using FreeImage
+        bitmap_ptr pResultBitmap = make_bitmap_ptr(_.FreeImage_Allocate(
+            fi_w, fi_h, static_cast<int>(fi_bpp), 0, 0, 0));
+        if (pResultBitmap == NULL) {
+            AF_ERROR("FreeImage Error: Error creating image or file",
+                     AF_ERR_RUNTIME);
         }
 
         // FI assumes [0-255]
@@ -319,7 +901,8 @@ af_err af_save_image(const char* filename, const af_array in_)
         AF_CHECK(af_max_all(&max_real, &max_imag, in_));
         if (max_real <= 1) {
             af_array c255;
-            AF_CHECK(af_constant(&c255, 255.0, info.ndims(), info.dims().get(), f32));
+            AF_CHECK(af_constant(&c255, 255.0, info.ndims(), info.dims().get(),
+                                 f32));
             AF_CHECK(af_mul(&in, in_, c255, false));
             AF_CHECK(af_release_array(c255));
             free_in = true;
@@ -328,27 +911,28 @@ af_err af_save_image(const char* filename, const af_array in_)
         }
 
         // FI = row major | AF = column major
-        uint nDstPitch = FreeImage_GetPitch(pResultBitmap);
-        uchar* pDstLine = FreeImage_GetBits(pResultBitmap) + nDstPitch * (fi_h - 1);
+        uint nDstPitch = _.FreeImage_GetPitch(pResultBitmap.get());
+        uchar* pDstLine =
+            _.FreeImage_GetBits(pResultBitmap.get()) + nDstPitch * (fi_h - 1);
         af_array rr = 0, gg = 0, bb = 0, aa = 0;
-        AF_CHECK(channel_split(in, info.dims(), &rr, &gg, &bb, &aa)); // convert array to 3 channels if needed
+        AF_CHECK(channel_split(in, info.dims(), &rr, &gg, &bb,
+                               &aa));  // convert array to 3 channels if needed
 
-        uint step = channels; // force 3 channels saving
+        uint step = channels;  // force 3 channels saving
         uint indx = 0;
 
         af_array rrT = 0, ggT = 0, bbT = 0, aaT = 0;
-        if(channels == 4) {
-
+        if (channels == 4) {
             AF_CHECK(af_transpose(&rrT, rr, false));
             AF_CHECK(af_transpose(&ggT, gg, false));
             AF_CHECK(af_transpose(&bbT, bb, false));
             AF_CHECK(af_transpose(&aaT, aa, false));
 
-            ArrayInfo cinfo = getInfo(rrT);
-            float* pSrc0 = pinnedAlloc<float>(cinfo.elements());
-            float* pSrc1 = pinnedAlloc<float>(cinfo.elements());
-            float* pSrc2 = pinnedAlloc<float>(cinfo.elements());
-            float* pSrc3 = pinnedAlloc<float>(cinfo.elements());
+            const ArrayInfo& cinfo = getInfo(rrT);
+            auto* pSrc0            = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc1            = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc2            = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc3            = pinnedAlloc<float>(cinfo.elements());
 
             AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
             AF_CHECK(af_get_data_ptr((void*)pSrc1, ggT));
@@ -358,10 +942,14 @@ af_err af_save_image(const char* filename, const af_array in_)
             // Copy the array into FreeImage buffer
             for (uint y = 0; y < fi_h; ++y) {
                 for (uint x = 0; x < fi_w; ++x) {
-                    *(pDstLine + x * step + 2) = (uchar) pSrc0[indx]; // b
-                    *(pDstLine + x * step + 1) = (uchar) pSrc1[indx]; // g
-                    *(pDstLine + x * step + 0) = (uchar) pSrc2[indx]; // r
-                    *(pDstLine + x * step + 3) = (uchar) pSrc3[indx]; // a
+                    *(pDstLine + x * step + FI_RGBA_RED) =
+                        static_cast<uchar>(pSrc0[indx]);  // r
+                    *(pDstLine + x * step + FI_RGBA_GREEN) =
+                        static_cast<uchar>(pSrc1[indx]);  // g
+                    *(pDstLine + x * step + FI_RGBA_BLUE) =
+                        static_cast<uchar>(pSrc2[indx]);  // b
+                    *(pDstLine + x * step + FI_RGBA_ALPHA) =
+                        static_cast<uchar>(pSrc3[indx]);  // a
                     ++indx;
                 }
                 pDstLine -= nDstPitch;
@@ -370,15 +958,15 @@ af_err af_save_image(const char* filename, const af_array in_)
             pinnedFree(pSrc1);
             pinnedFree(pSrc2);
             pinnedFree(pSrc3);
-        } else if(channels == 3) {
+        } else if (channels == 3) {
             AF_CHECK(af_transpose(&rrT, rr, false));
             AF_CHECK(af_transpose(&ggT, gg, false));
             AF_CHECK(af_transpose(&bbT, bb, false));
 
-            ArrayInfo cinfo = getInfo(rrT);
-            float* pSrc0 = pinnedAlloc<float>(cinfo.elements());
-            float* pSrc1 = pinnedAlloc<float>(cinfo.elements());
-            float* pSrc2 = pinnedAlloc<float>(cinfo.elements());
+            const ArrayInfo& cinfo = getInfo(rrT);
+            auto* pSrc0            = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc1            = pinnedAlloc<float>(cinfo.elements());
+            auto* pSrc2            = pinnedAlloc<float>(cinfo.elements());
 
             AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
             AF_CHECK(af_get_data_ptr((void*)pSrc1, ggT));
@@ -387,9 +975,12 @@ af_err af_save_image(const char* filename, const af_array in_)
             // Copy the array into FreeImage buffer
             for (uint y = 0; y < fi_h; ++y) {
                 for (uint x = 0; x < fi_w; ++x) {
-                    *(pDstLine + x * step + 2) = (uchar) pSrc0[indx]; // b
-                    *(pDstLine + x * step + 1) = (uchar) pSrc1[indx]; // g
-                    *(pDstLine + x * step + 0) = (uchar) pSrc2[indx]; // r
+                    *(pDstLine + x * step + FI_RGBA_RED) =
+                        static_cast<uchar>(pSrc0[indx]);  // r
+                    *(pDstLine + x * step + FI_RGBA_GREEN) =
+                        static_cast<uchar>(pSrc1[indx]);  // g
+                    *(pDstLine + x * step + FI_RGBA_BLUE) =
+                        static_cast<uchar>(pSrc2[indx]);  // b
                     ++indx;
                 }
                 pDstLine -= nDstPitch;
@@ -399,13 +990,13 @@ af_err af_save_image(const char* filename, const af_array in_)
             pinnedFree(pSrc2);
         } else {
             AF_CHECK(af_transpose(&rrT, rr, false));
-            ArrayInfo cinfo = getInfo(rrT);
-            float* pSrc0 = pinnedAlloc<float>(cinfo.elements());
+            const ArrayInfo& cinfo = getInfo(rrT);
+            auto* pSrc0            = pinnedAlloc<float>(cinfo.elements());
             AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
 
             for (uint y = 0; y < fi_h; ++y) {
                 for (uint x = 0; x < fi_w; ++x) {
-                    *(pDstLine + x * step) = (uchar) pSrc0[indx];
+                    *(pDstLine + x * step) = static_cast<uchar>(pSrc0[indx]);
                     ++indx;
                 }
                 pDstLine -= nDstPitch;
@@ -413,40 +1004,91 @@ af_err af_save_image(const char* filename, const af_array in_)
             pinnedFree(pSrc0);
         }
 
+        uint8_t* data          = nullptr;
+        uint32_t size_in_bytes = 0;
+        FIMEMORY* stream       = _.FreeImage_OpenMemory(data, size_in_bytes);
+
+        unsigned flags = 0;
+        if (fif == FIF_JPEG) {
+            flags = flags | static_cast<unsigned>(JPEG_QUALITYSUPERB);
+        }
+
         // now save the result image
-        if (!(FreeImage_Save(fif, pResultBitmap, filename, 0) == TRUE)) {
+        if (_.FreeImage_SaveToMemory(fif, pResultBitmap.get(), stream,
+                                     static_cast<int>(flags)) == FALSE) {
             AF_ERROR("FreeImage Error: Failed to save image", AF_ERR_RUNTIME);
         }
 
-        FreeImage_Unload(pResultBitmap);
+        *ptr = stream;
+
+        if (free_in) { AF_CHECK(af_release_array(in)); }
+        if (rr != 0) { AF_CHECK(af_release_array(rr)); }
+        if (gg != 0) { AF_CHECK(af_release_array(gg)); }
+        if (bb != 0) { AF_CHECK(af_release_array(bb)); }
+        if (aa != 0) { AF_CHECK(af_release_array(aa)); }
+        if (rrT != 0) { AF_CHECK(af_release_array(rrT)); }
+        if (ggT != 0) { AF_CHECK(af_release_array(ggT)); }
+        if (bbT != 0) { AF_CHECK(af_release_array(bbT)); }
+        if (aaT != 0) { AF_CHECK(af_release_array(aaT)); }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_delete_image_memory(void* ptr) {
+    try {
+        ARG_ASSERT(0, ptr != NULL);
+
+        FreeImage_Module& _ = getFreeImagePlugin();
 
-        if(free_in) AF_CHECK(af_release_array(in ));
-        if(rr != 0) AF_CHECK(af_release_array(rr ));
-        if(gg != 0) AF_CHECK(af_release_array(gg ));
-        if(bb != 0) AF_CHECK(af_release_array(bb ));
-        if(aa != 0) AF_CHECK(af_release_array(aa ));
-        if(rrT!= 0) AF_CHECK(af_release_array(rrT));
-        if(ggT!= 0) AF_CHECK(af_release_array(ggT));
-        if(bbT!= 0) AF_CHECK(af_release_array(bbT));
-        if(aaT!= 0) AF_CHECK(af_release_array(aaT));
+        // set your own FreeImage error handler
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
+
+        auto* stream = static_cast<FIMEMORY*>(ptr);
+        _.FreeImage_SeekMemory(stream, 0L, SEEK_SET);
+
+        // Ensure data is freeimage compatible
+        FREE_IMAGE_FORMAT fif =
+            _.FreeImage_GetFileTypeFromMemory(static_cast<FIMEMORY*>(ptr), 0);
+        if (fif == FIF_UNKNOWN) {
+            AF_ERROR("FreeImage Error: Unknown Filetype", AF_ERR_NOT_SUPPORTED);
+        }
 
-    } CATCHALL
+        _.FreeImage_CloseMemory(static_cast<FIMEMORY*>(ptr));
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-#else   // WITH_FREEIMAGE
-#include <af/image.h>
+#else  // WITH_FREEIMAGE
+#include <common/err_common.hpp>
 #include <stdio.h>
-AFAPI af_err af_load_image(af_array *out, const char* filename, const bool isColor)
-{
-    printf("Error: Image IO requires FreeImage. See https://github.com/arrayfire/arrayfire\n");
-    return AF_ERR_NOT_CONFIGURED;
+#include <af/image.h>
+af_err af_load_image(af_array *out, const char *filename, const bool isColor) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
+}
+
+af_err af_save_image(const char *filename, const af_array in_) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
+}
+
+af_err af_load_image_memory(af_array *out, const void *ptr) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
+}
+
+af_err af_save_image_memory(void **ptr, const af_array in_,
+                            const af_image_format format) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
 }
 
-af_err af_save_image(const char* filename, const af_array in_)
-{
-    printf("Error: Image IO requires FreeImage. See https://github.com/arrayfire/arrayfire\n");
-    return AF_ERR_NOT_CONFIGURED;
+af_err af_delete_image_memory(void *ptr) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
 }
 #endif  // WITH_FREEIMAGE
diff --git a/src/api/c/imageio2.cpp b/src/api/c/imageio2.cpp
new file mode 100644
index 0000000000..7130202397
--- /dev/null
+++ b/src/api/c/imageio2.cpp
@@ -0,0 +1,565 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_FREEIMAGE)
+
+#include "imageio_helper.h"
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+#include <af/algorithm.h>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/blas.h>
+#include <af/data.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+#include <af/index.h>
+
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <string>
+
+using af::dim4;
+using arrayfire::AFFI_GRAY;
+using arrayfire::AFFI_RGB;
+using arrayfire::AFFI_RGBA;
+using arrayfire::bitmap_ptr;
+using arrayfire::channel_split;
+using arrayfire::FI_CHANNELS;
+using arrayfire::FreeImage_Module;
+using arrayfire::FreeImageErrorHandler;
+using arrayfire::getFreeImagePlugin;
+using arrayfire::make_bitmap_ptr;
+using detail::pinnedAlloc;
+using detail::pinnedFree;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+namespace {
+template<typename T, FI_CHANNELS fi_color>
+static af_err readImage_t(af_array* rImage, const uchar* pSrcLine,
+                          const int nSrcPitch, const uint fi_w,
+                          const uint fi_h) {
+    // create an array to receive the loaded image data.
+    AF_CHECK(af_init());
+    T* pDst  = pinnedAlloc<T>(fi_w * fi_h * 4);  // 4 channels is max
+    T* pDst0 = pDst;
+    T* pDst1 = pDst + (fi_w * fi_h * 1);
+    T* pDst2 = pDst + (fi_w * fi_h * 2);
+    T* pDst3 = pDst + (fi_w * fi_h * 3);
+
+    uint indx = 0;
+    uint step = fi_color;
+
+    for (uint x = 0; x < fi_w; ++x) {
+        for (uint y = 0; y < fi_h; ++y) {
+            const T* src = reinterpret_cast<T*>(const_cast<uchar*>(pSrcLine) -
+                                                y * nSrcPitch);
+            if (fi_color == 1) {
+                pDst0[indx] = *(src + (x * step));
+            } else if (fi_color >= 3) {
+                if (static_cast<af_dtype>(af::dtype_traits<T>::af_type) == u8) {
+                    pDst0[indx] = *(src + (x * step + FI_RGBA_RED));
+                    pDst1[indx] = *(src + (x * step + FI_RGBA_GREEN));
+                    pDst2[indx] = *(src + (x * step + FI_RGBA_BLUE));
+                    if (fi_color == 4) {
+                        pDst3[indx] = *(src + (x * step + FI_RGBA_ALPHA));
+                    }
+                } else {
+                    // Non 8-bit types do not use ordering
+                    // See Pixel Access Functions Chapter in FreeImage Doc
+                    pDst0[indx] = *(src + (x * step + 0));
+                    pDst1[indx] = *(src + (x * step + 1));
+                    pDst2[indx] = *(src + (x * step + 2));
+                    if (fi_color == 4) {
+                        pDst3[indx] = *(src + (x * step + 3));
+                    }
+                }
+            }
+            indx++;
+        }
+    }
+
+    af::dim4 dims(fi_h, fi_w, fi_color, 1);
+    af_err err =
+        af_create_array(rImage, pDst, dims.ndims(), dims.get(),
+                        static_cast<af_dtype>(af::dtype_traits<T>::af_type));
+    pinnedFree(pDst);
+    return err;
+}
+
+FREE_IMAGE_TYPE getFIT(FI_CHANNELS channels, af_dtype type) {
+    if (channels == AFFI_GRAY) {
+        if (type == u8) { return FIT_BITMAP; }
+        if (type == u16) {
+            return FIT_UINT16;
+        } else if (type == f32) {
+            return FIT_FLOAT;
+        }
+    } else if (channels == AFFI_RGB) {
+        if (type == u8) { return FIT_BITMAP; }
+        if (type == u16) {
+            return FIT_RGB16;
+        } else if (type == f32) {
+            return FIT_RGBF;
+        }
+    } else if (channels == AFFI_RGBA) {
+        if (type == u8) { return FIT_BITMAP; }
+        if (type == u16) {
+            return FIT_RGBA16;
+        } else if (type == f32) {
+            return FIT_RGBAF;
+        }
+    }
+    return FIT_BITMAP;
+}
+
+}  // namespace
+
+////////////////////////////////////////////////////////////////////////////////
+// File IO
+////////////////////////////////////////////////////////////////////////////////
+// Load image from disk.
+af_err af_load_image_native(af_array* out, const char* filename) {
+    try {
+        ARG_ASSERT(1, filename != NULL);
+
+        FreeImage_Module& _ = getFreeImagePlugin();
+
+        // set your own FreeImage error handler
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
+
+        // try to guess the file format from the file extension
+        FREE_IMAGE_FORMAT fif = _.FreeImage_GetFileType(filename, 0);
+        if (fif == FIF_UNKNOWN) {
+            fif = _.FreeImage_GetFIFFromFilename(filename);
+        }
+
+        if (fif == FIF_UNKNOWN) {
+            AF_ERROR("FreeImage Error: Unknown File or Filetype",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        unsigned flags = 0;
+        if (fif == FIF_JPEG) {
+            flags = flags | static_cast<unsigned>(JPEG_ACCURATE);
+        }
+
+        // check that the plugin has reading capabilities ...
+        bitmap_ptr pBitmap = make_bitmap_ptr(nullptr);
+        if (_.FreeImage_FIFSupportsReading(fif)) {
+            pBitmap.reset(
+                _.FreeImage_Load(fif, filename, static_cast<int>(flags)));
+        }
+
+        if (pBitmap == NULL) {
+            AF_ERROR(
+                "FreeImage Error: Error reading image or file does not exist",
+                AF_ERR_RUNTIME);
+        }
+
+        // check image color type
+        uint color_type   = _.FreeImage_GetColorType(pBitmap.get());
+        const uint fi_bpp = _.FreeImage_GetBPP(pBitmap.get());
+        // int fi_color = (int)((fi_bpp / 8.0) + 0.5);        //ceil
+        uint fi_color;
+        switch (color_type) {
+            case 0:  // FIC_MINISBLACK
+            case 1:  // FIC_MINISWHITE
+                fi_color = 1;
+                break;
+            case 2:  // FIC_PALETTE
+            case 3:  // FIC_RGB
+                fi_color = 3;
+                break;
+            case 4:  // FIC_RGBALPHA
+            case 5:  // FIC_CMYK
+                fi_color = 4;
+                break;
+            default:  // Should not come here
+                fi_color = 3;
+                break;
+        }
+
+        const uint fi_bpc = fi_bpp / fi_color;
+        if (fi_bpc != 8 && fi_bpc != 16 && fi_bpc != 32) {
+            AF_ERROR("FreeImage Error: Bits per channel not supported",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        // data type
+        FREE_IMAGE_TYPE image_type = _.FreeImage_GetImageType(pBitmap.get());
+
+        // sizes
+        uint fi_w = _.FreeImage_GetWidth(pBitmap.get());
+        uint fi_h = _.FreeImage_GetHeight(pBitmap.get());
+
+        // FI = row major | AF = column major
+        uint nSrcPitch = _.FreeImage_GetPitch(pBitmap.get());
+        const uchar* pSrcLine =
+            _.FreeImage_GetBits(pBitmap.get()) + nSrcPitch * (fi_h - 1);
+
+        // result image
+        af_array rImage;
+        if (fi_color == 4) {  // 4 channel image
+            if (fi_bpc == 8) {
+                AF_CHECK((readImage_t<uchar, AFFI_RGBA>)(&rImage, pSrcLine,
+                                                         nSrcPitch, fi_w,
+                                                         fi_h));
+            } else if (fi_bpc == 16) {
+                AF_CHECK((readImage_t<ushort, AFFI_RGBA>)(&rImage, pSrcLine,
+                                                          nSrcPitch, fi_w,
+                                                          fi_h));
+            } else if (fi_bpc == 32) {
+                switch (image_type) {
+                    case FIT_UINT32:
+                        AF_CHECK((readImage_t<uint, AFFI_RGBA>)(&rImage,
+                                                                pSrcLine,
+                                                                nSrcPitch, fi_w,
+                                                                fi_h));
+                        break;
+                    case FIT_INT32:
+                        AF_CHECK((readImage_t<int, AFFI_RGBA>)(&rImage,
+                                                               pSrcLine,
+                                                               nSrcPitch, fi_w,
+                                                               fi_h));
+                        break;
+                    case FIT_FLOAT:
+                        AF_CHECK((readImage_t<float, AFFI_RGBA>)(&rImage,
+                                                                 pSrcLine,
+                                                                 nSrcPitch,
+                                                                 fi_w, fi_h));
+                        break;
+                    default:
+                        AF_ERROR("FreeImage Error: Unknown image type",
+                                 AF_ERR_NOT_SUPPORTED);
+                        break;
+                }
+            }
+        } else if (fi_color == 1) {
+            if (fi_bpc == 8) {
+                AF_CHECK((readImage_t<uchar, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                         nSrcPitch, fi_w,
+                                                         fi_h));
+            } else if (fi_bpc == 16) {
+                AF_CHECK((readImage_t<ushort, AFFI_GRAY>)(&rImage, pSrcLine,
+                                                          nSrcPitch, fi_w,
+                                                          fi_h));
+            } else if (fi_bpc == 32) {
+                switch (image_type) {
+                    case FIT_UINT32:
+                        AF_CHECK((readImage_t<uint, AFFI_GRAY>)(&rImage,
+                                                                pSrcLine,
+                                                                nSrcPitch, fi_w,
+                                                                fi_h));
+                        break;
+                    case FIT_INT32:
+                        AF_CHECK((readImage_t<int, AFFI_GRAY>)(&rImage,
+                                                               pSrcLine,
+                                                               nSrcPitch, fi_w,
+                                                               fi_h));
+                        break;
+                    case FIT_FLOAT:
+                        AF_CHECK((readImage_t<float, AFFI_GRAY>)(&rImage,
+                                                                 pSrcLine,
+                                                                 nSrcPitch,
+                                                                 fi_w, fi_h));
+                        break;
+                    default:
+                        AF_ERROR("FreeImage Error: Unknown image type",
+                                 AF_ERR_NOT_SUPPORTED);
+                        break;
+                }
+            }
+        } else {  // 3 channel imag
+            if (fi_bpc == 8) {
+                AF_CHECK((readImage_t<uchar, AFFI_RGB>)(&rImage, pSrcLine,
+                                                        nSrcPitch, fi_w, fi_h));
+            } else if (fi_bpc == 16) {
+                AF_CHECK((readImage_t<ushort, AFFI_RGB>)(&rImage, pSrcLine,
+                                                         nSrcPitch, fi_w,
+                                                         fi_h));
+            } else if (fi_bpc == 32) {
+                switch (image_type) {
+                    case FIT_UINT32:
+                        AF_CHECK((readImage_t<uint, AFFI_RGB>)(&rImage,
+                                                               pSrcLine,
+                                                               nSrcPitch, fi_w,
+                                                               fi_h));
+                        break;
+                    case FIT_INT32:
+                        AF_CHECK((readImage_t<int, AFFI_RGB>)(&rImage, pSrcLine,
+                                                              nSrcPitch, fi_w,
+                                                              fi_h));
+                        break;
+                    case FIT_FLOAT:
+                        AF_CHECK((readImage_t<float, AFFI_RGB>)(&rImage,
+                                                                pSrcLine,
+                                                                nSrcPitch, fi_w,
+                                                                fi_h));
+                        break;
+                    default:
+                        AF_ERROR("FreeImage Error: Unknown image type",
+                                 AF_ERR_NOT_SUPPORTED);
+                        break;
+                }
+            }
+        }
+
+        std::swap(*out, rImage);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+template<typename T, FI_CHANNELS channels>
+static void save_t(T* pDstLine, const af_array in, const dim4& dims,
+                   uint nDstPitch) {
+    af_array rr = 0, gg = 0, bb = 0, aa = 0;
+    AF_CHECK(channel_split(in, dims, &rr, &gg, &bb,
+                           &aa));  // convert array to 3 channels if needed
+
+    af_array rrT = 0, ggT = 0, bbT = 0, aaT = 0;
+    T *pSrc0 = 0, *pSrc1 = 0, *pSrc2 = 0, *pSrc3 = 0;
+
+    uint step = channels;  // force 3 channels saving
+    uint indx = 0;
+
+    AF_CHECK(af_transpose(&rrT, rr, false));
+    if (channels >= 3) { AF_CHECK(af_transpose(&ggT, gg, false)); }
+    if (channels >= 3) { AF_CHECK(af_transpose(&bbT, bb, false)); }
+    if (channels >= 4) { AF_CHECK(af_transpose(&aaT, aa, false)); }
+
+    const ArrayInfo& cinfo = getInfo(rrT);
+    pSrc0                  = pinnedAlloc<T>(cinfo.elements());
+    if (channels >= 3) { pSrc1 = pinnedAlloc<T>(cinfo.elements()); }
+    if (channels >= 3) { pSrc2 = pinnedAlloc<T>(cinfo.elements()); }
+    if (channels >= 4) { pSrc3 = pinnedAlloc<T>(cinfo.elements()); }
+
+    AF_CHECK(af_get_data_ptr((void*)pSrc0, rrT));
+    if (channels >= 3) { AF_CHECK(af_get_data_ptr((void*)pSrc1, ggT)); }
+    if (channels >= 3) { AF_CHECK(af_get_data_ptr((void*)pSrc2, bbT)); }
+    if (channels >= 4) { AF_CHECK(af_get_data_ptr((void*)pSrc3, aaT)); }
+
+    const uint fi_w = dims[1];
+    const uint fi_h = dims[0];
+
+    // Copy the array into FreeImage buffer
+    for (uint y = 0; y < fi_h; ++y) {
+        for (uint x = 0; x < fi_w; ++x) {
+            if (channels == 1) {
+                *(pDstLine + x * step) = pSrc0[indx];  // r -> 0
+            } else if (channels >= 3) {
+                if (static_cast<af_dtype>(af::dtype_traits<T>::af_type) == u8) {
+                    *(pDstLine + x * step + FI_RGBA_RED) =
+                        pSrc0[indx];  // r -> 0
+                    *(pDstLine + x * step + FI_RGBA_GREEN) =
+                        pSrc1[indx];  // g -> 1
+                    *(pDstLine + x * step + FI_RGBA_BLUE) =
+                        pSrc2[indx];  // b -> 2
+                    if (channels >= 4) {
+                        *(pDstLine + x * step + FI_RGBA_ALPHA) =
+                            pSrc3[indx];  // a
+                    }
+                } else {
+                    // Non 8-bit types do not use ordering
+                    // See Pixel Access Functions Chapter in FreeImage Doc
+                    *(pDstLine + x * step + 0) = pSrc0[indx];  // r -> 0
+                    *(pDstLine + x * step + 1) = pSrc1[indx];  // g -> 1
+                    *(pDstLine + x * step + 2) = pSrc2[indx];  // b -> 2
+                    if (channels >= 4) {
+                        *(pDstLine + x * step + 3) = pSrc3[indx];  // a
+                    }
+                }
+            }
+            ++indx;
+        }
+        pDstLine = reinterpret_cast<T*>(reinterpret_cast<uchar*>(pDstLine) -
+                                        nDstPitch);
+    }
+    pinnedFree(pSrc0);
+    if (channels >= 3) { pinnedFree(pSrc1); }
+    if (channels >= 3) { pinnedFree(pSrc2); }
+    if (channels >= 4) { pinnedFree(pSrc3); }
+
+    if (rr != 0) { AF_CHECK(af_release_array(rr)); }
+    if (gg != 0) { AF_CHECK(af_release_array(gg)); }
+    if (bb != 0) { AF_CHECK(af_release_array(bb)); }
+    if (aa != 0) { AF_CHECK(af_release_array(aa)); }
+    if (rrT != 0) { AF_CHECK(af_release_array(rrT)); }
+    if (ggT != 0) { AF_CHECK(af_release_array(ggT)); }
+    if (bbT != 0) { AF_CHECK(af_release_array(bbT)); }
+    if (aaT != 0) { AF_CHECK(af_release_array(aaT)); }
+}
+
+// Save an image to disk.
+af_err af_save_image_native(const char* filename, const af_array in) {
+    try {
+        ARG_ASSERT(0, filename != NULL);
+
+        FreeImage_Module& _ = getFreeImagePlugin();
+
+        // set your own FreeImage error handler
+        _.FreeImage_SetOutputMessage(FreeImageErrorHandler);
+
+        // try to guess the file format from the file extension
+        FREE_IMAGE_FORMAT fif = _.FreeImage_GetFileType(filename, 0);
+        if (fif == FIF_UNKNOWN) {
+            fif = _.FreeImage_GetFIFFromFilename(filename);
+        }
+
+        if (fif == FIF_UNKNOWN) {
+            AF_ERROR("FreeImage Error: Unknown Filetype", AF_ERR_NOT_SUPPORTED);
+        }
+
+        const ArrayInfo& info = getInfo(in);
+        // check image color type
+        auto channels = static_cast<FI_CHANNELS>(info.dims()[2]);
+        DIM_ASSERT(1, channels <= 4);
+        DIM_ASSERT(1, channels != 2);
+
+        // sizes
+        uint fi_w = info.dims()[1];
+        uint fi_h = info.dims()[0];
+
+        af_dtype type = info.getType();
+
+        // FI assumes [0-255] for u8
+        // FI assumes [0-65k] for u16
+        // FI assumes [0-1]   for f32
+        int fi_bpp = 0;
+        switch (type) {
+            case u8: fi_bpp = channels * 8; break;
+            case u16: fi_bpp = channels * 16; break;
+            case f32: fi_bpp = channels * 32; break;
+            default: TYPE_ERROR(1, type);
+        }
+
+        FREE_IMAGE_TYPE fit_type = getFIT(channels, type);
+
+        // create the result image storage using FreeImage
+        bitmap_ptr pResultBitmap = make_bitmap_ptr(nullptr);
+        switch (type) {
+            case u8:
+            case u16:
+            case f32:
+                pResultBitmap.reset(_.FreeImage_AllocateT(fit_type, fi_w, fi_h,
+                                                          fi_bpp, 0, 0, 0));
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+
+        if (pResultBitmap == NULL) {
+            AF_ERROR("FreeImage Error: Error creating image or file",
+                     AF_ERR_RUNTIME);
+        }
+
+        // FI = row major | AF = column major
+        uint nDstPitch = _.FreeImage_GetPitch(pResultBitmap.get());
+        void* pDstLine =
+            _.FreeImage_GetBits(pResultBitmap.get()) + nDstPitch * (fi_h - 1);
+
+        if (channels == AFFI_GRAY) {
+            switch (type) {
+                case u8:
+                    save_t<uchar, AFFI_GRAY>(static_cast<uchar*>(pDstLine), in,
+                                             info.dims(), nDstPitch);
+                    break;
+                case u16:
+                    save_t<ushort, AFFI_GRAY>(static_cast<ushort*>(pDstLine),
+                                              in, info.dims(), nDstPitch);
+                    break;
+                case f32:
+                    save_t<float, AFFI_GRAY>(static_cast<float*>(pDstLine), in,
+                                             info.dims(), nDstPitch);
+                    break;
+                default: TYPE_ERROR(1, type);
+            }
+        } else if (channels == AFFI_RGB) {
+            switch (type) {
+                case u8:
+                    save_t<uchar, AFFI_RGB>(static_cast<uchar*>(pDstLine), in,
+                                            info.dims(), nDstPitch);
+                    break;
+                case u16:
+                    save_t<ushort, AFFI_RGB>(static_cast<ushort*>(pDstLine), in,
+                                             info.dims(), nDstPitch);
+                    break;
+                case f32:
+                    save_t<float, AFFI_RGB>(static_cast<float*>(pDstLine), in,
+                                            info.dims(), nDstPitch);
+                    break;
+                default: TYPE_ERROR(1, type);
+            }
+        } else {
+            switch (type) {
+                case u8:
+                    save_t<uchar, AFFI_RGBA>(static_cast<uchar*>(pDstLine), in,
+                                             info.dims(), nDstPitch);
+                    break;
+                case u16:
+                    save_t<ushort, AFFI_RGBA>(static_cast<ushort*>(pDstLine),
+                                              in, info.dims(), nDstPitch);
+                    break;
+                case f32:
+                    save_t<float, AFFI_RGBA>(static_cast<float*>(pDstLine), in,
+                                             info.dims(), nDstPitch);
+                    break;
+                default: TYPE_ERROR(1, type);
+            }
+        }
+
+        unsigned flags = 0;
+        if (fif == FIF_JPEG) {
+            flags = flags | static_cast<unsigned>(JPEG_QUALITYSUPERB);
+        }
+
+        // now save the result image
+        if (!(_.FreeImage_Save(fif, pResultBitmap.get(), filename,
+                               static_cast<int>(flags)) == TRUE)) {
+            AF_ERROR("FreeImage Error: Failed to save image", AF_ERR_RUNTIME);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_is_image_io_available(bool* out) {
+    *out = true;
+    return AF_SUCCESS;
+}
+
+#else  // WITH_FREEIMAGE
+#include <common/err_common.hpp>
+#include <stdio.h>
+#include <af/image.h>
+af_err af_load_image_native(af_array* out, const char* filename) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
+}
+
+af_err af_save_image_native(const char* filename, const af_array in) {
+    AF_RETURN_ERROR("ArrayFire compiled without Image IO (FreeImage) support",
+                    AF_ERR_NOT_CONFIGURED);
+}
+
+af_err af_is_image_io_available(bool* out) {
+    *out = false;
+    return AF_SUCCESS;
+}
+#endif  // WITH_FREEIMAGE
diff --git a/src/api/c/imageio_helper.h b/src/api/c/imageio_helper.h
new file mode 100644
index 0000000000..e9ef818bf3
--- /dev/null
+++ b/src/api/c/imageio_helper.h
@@ -0,0 +1,107 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifndef IMAGEIO_HELPER_H
+#define IMAGEIO_HELPER_H
+
+#include <common/DependencyModule.hpp>
+#include <common/err_common.hpp>
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/index.h>
+
+#include <FreeImage.h>
+
+#include <functional>
+#include <memory>
+
+namespace arrayfire {
+
+class FreeImage_Module {
+    common::DependencyModule module;
+
+   public:
+    MODULE_MEMBER(FreeImage_Allocate);
+    MODULE_MEMBER(FreeImage_AllocateT);
+    MODULE_MEMBER(FreeImage_CloseMemory);
+    MODULE_MEMBER(FreeImage_DeInitialise);
+    MODULE_MEMBER(FreeImage_FIFSupportsReading);
+    MODULE_MEMBER(FreeImage_GetBPP);
+    MODULE_MEMBER(FreeImage_GetBits);
+    MODULE_MEMBER(FreeImage_GetColorType);
+    MODULE_MEMBER(FreeImage_GetFIFFromFilename);
+    MODULE_MEMBER(FreeImage_GetFileType);
+    MODULE_MEMBER(FreeImage_GetFileTypeFromMemory);
+    MODULE_MEMBER(FreeImage_GetHeight);
+    MODULE_MEMBER(FreeImage_GetImageType);
+    MODULE_MEMBER(FreeImage_GetPitch);
+    MODULE_MEMBER(FreeImage_GetWidth);
+    MODULE_MEMBER(FreeImage_Initialise);
+    MODULE_MEMBER(FreeImage_Load);
+    MODULE_MEMBER(FreeImage_LoadFromMemory);
+    MODULE_MEMBER(FreeImage_OpenMemory);
+    MODULE_MEMBER(FreeImage_Save);
+    MODULE_MEMBER(FreeImage_SaveToMemory);
+    MODULE_MEMBER(FreeImage_SeekMemory);
+    MODULE_MEMBER(FreeImage_SetOutputMessage);
+    MODULE_MEMBER(FreeImage_Unload);
+
+    FreeImage_Module();
+    ~FreeImage_Module();
+};
+
+FreeImage_Module &getFreeImagePlugin();
+
+using bitmap_ptr = std::unique_ptr<FIBITMAP, std::function<void(FIBITMAP *)>>;
+bitmap_ptr make_bitmap_ptr(FIBITMAP *);
+
+typedef enum {
+    AFFI_GRAY = 1,  //< gray
+    AFFI_RGB  = 3,  //< rgb
+    AFFI_RGBA = 4   //< rgba
+} FI_CHANNELS;
+
+// Error handler for FreeImage library.
+// In case this handler is invoked, it throws an af exception.
+static void FreeImageErrorHandler(FREE_IMAGE_FORMAT oFif,
+                                  const char *zMessage) {
+    UNUSED(oFif);
+    printf("FreeImage Error Handler: %s\n", zMessage);
+}
+
+//  Split a MxNx3 image into 3 separate channel matrices.
+//  Produce 3 channels if needed
+static af_err channel_split(const af_array rgb, const af::dim4 &dims,
+                            af_array *outr, af_array *outg, af_array *outb,
+                            af_array *outa) {
+    try {
+        af_seq idx[4][3] = {{af_span, af_span, {0, 0, 1}},
+                            {af_span, af_span, {1, 1, 1}},
+                            {af_span, af_span, {2, 2, 1}},
+                            {af_span, af_span, {3, 3, 1}}};
+
+        if (dims[2] == 4) {
+            AF_CHECK(af_index(outr, rgb, dims.ndims(), idx[0]));
+            AF_CHECK(af_index(outg, rgb, dims.ndims(), idx[1]));
+            AF_CHECK(af_index(outb, rgb, dims.ndims(), idx[2]));
+            AF_CHECK(af_index(outa, rgb, dims.ndims(), idx[3]));
+        } else if (dims[2] == 3) {
+            AF_CHECK(af_index(outr, rgb, dims.ndims(), idx[0]));
+            AF_CHECK(af_index(outg, rgb, dims.ndims(), idx[1]));
+            AF_CHECK(af_index(outb, rgb, dims.ndims(), idx[2]));
+        } else {
+            AF_CHECK(af_index(outr, rgb, dims.ndims(), idx[0]));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+#endif
+}
diff --git a/src/api/c/imgproc_common.hpp b/src/api/c/imgproc_common.hpp
new file mode 100644
index 0000000000..f4abcb0907
--- /dev/null
+++ b/src/api/c/imgproc_common.hpp
@@ -0,0 +1,82 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <copy.hpp>
+#include <logic.hpp>
+#include <reduce.hpp>
+#include <scan.hpp>
+
+#include <cmath>
+
+namespace arrayfire {
+namespace common {
+
+template<typename To, typename Ti = To>
+detail::Array<To> integralImage(const detail::Array<Ti>& in) {
+    auto input                       = common::cast<To, Ti>(in);
+    detail::Array<To> horizontalScan = detail::scan<af_add_t, To, To>(input, 0);
+    return detail::scan<af_add_t, To, To>(horizontalScan, 1);
+}
+
+template<typename T>
+detail::Array<T> threshold(const detail::Array<T>& in, T min, T max) {
+    const af::dim4 inDims = in.dims();
+
+    auto MN    = detail::createValueArray(inDims, min);
+    auto MX    = detail::createValueArray(inDims, max);
+    auto below = detail::logicOp<T, af_le_t>(in, MX, inDims);
+    auto above = detail::logicOp<T, af_ge_t>(in, MN, inDims);
+    auto valid = detail::logicOp<char, af_and_t>(below, above, inDims);
+
+    return detail::arithOp<T, af_mul_t>(in, common::cast<T, char>(valid),
+                                        inDims);
+}
+
+template<typename To, typename Ti>
+detail::Array<To> convRange(const detail::Array<Ti>& in,
+                            const To newLow = To(0), const To newHigh = To(1)) {
+    auto dims  = in.dims();
+    auto input = common::cast<To, Ti>(in);
+    To high =
+        detail::getScalar<To>(detail::reduce_all<af_max_t, To, To>(input));
+    To low = detail::getScalar<To>(detail::reduce_all<af_min_t, To, To>(input));
+    To range = high - low;
+
+    if (std::abs(range) < 1.0e-6) {
+        if (low == To(0) && newLow == To(0)) {
+            return input;
+        } else {
+            // Input is constant, use high as constant in converted range
+            return detail::createValueArray(dims, newHigh);
+        }
+    }
+
+    auto minArray = detail::createValueArray(dims, low);
+    auto invDen   = detail::createValueArray(dims, To(1.0 / range));
+    auto numer    = detail::arithOp<To, af_sub_t>(input, minArray, dims);
+    auto result   = detail::arithOp<To, af_mul_t>(numer, invDen, dims);
+
+    if (newLow != To(0) || newHigh != To(1)) {
+        To newRange    = newHigh - newLow;
+        auto newRngArr = detail::createValueArray(dims, newRange);
+        auto newMinArr = detail::createValueArray(dims, newLow);
+        auto scaledArr = detail::arithOp<To, af_mul_t>(result, newRngArr, dims);
+
+        result = detail::arithOp<To, af_add_t>(newMinArr, scaledArr, dims);
+    }
+    return result;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/api/c/implicit.cpp b/src/api/c/implicit.cpp
index b7a661d67c..d045769cbd 100644
--- a/src/api/c/implicit.cpp
+++ b/src/api/c/implicit.cpp
@@ -14,52 +14,39 @@ Implicit type mimics C/C++ behavior.
 
 Order of precedence:
 - complex > real
-- double > float > uintl > intl > uint > int > uchar > char
+- double > float > uintl > intl > uint > int > uchar > schar > char
 */
 
-af_dtype implicit(const af_dtype lty, const af_dtype rty)
-{
-    if (lty == rty) {
-        return lty;
-    }
+af_dtype implicit(const af_dtype lty, const af_dtype rty) {
+    if (lty == rty) { return lty; }
 
-    if (lty == c64 || rty == c64) {
-           return c64;
-    }
+    if (lty == c64 || rty == c64) { return c64; }
 
     if (lty == c32 || rty == c32) {
-        if (lty == f64 || rty == f64)  return c64;
+        if (lty == f64 || rty == f64) { return c64; }
         return c32;
     }
 
-    if (lty == f64 || rty == f64) return f64;
-    if (lty == f32 || rty == f32) return f32;
-
-    if ((lty == u64) ||
-        (rty == u64)) return u64;
-
-    if ((lty == s64) ||
-        (rty == s64)) return s64;
-
-    if ((lty == u32) ||
-        (rty == u32)) return u32;
-
-    if ((lty == s32) ||
-        (rty == s32)) return s32;
-
-    if ((lty == u8 ) ||
-        (rty == u8 )) return u8;
+    if (lty == f64 || rty == f64) { return f64; }
+    if (lty == f32 || rty == f32) { return f32; }
+    if ((lty == f16) || (rty == f16)) { return f16; }
 
-    if ((lty == b8 ) &&
-        (rty == b8 )) return b8;
+    if ((lty == u64) || (rty == u64)) { return u64; }
+    if ((lty == s64) || (rty == s64)) { return s64; }
+    if ((lty == u32) || (rty == u32)) { return u32; }
+    if ((lty == s32) || (rty == s32)) { return s32; }
+    if ((lty == u16) || (rty == u16)) { return u16; }
+    if ((lty == s16) || (rty == s16)) { return s16; }
+    if ((lty == u8) || (rty == u8)) { return u8; }
+    if ((lty == s8) || (rty == s8)) { return s8; }
+    if ((lty == b8) && (rty == b8)) { return b8; }
 
     return f32;
 }
 
-af_dtype implicit(const af_array lhs, const af_array rhs)
-{
-    ArrayInfo lInfo = getInfo(lhs);
-    ArrayInfo rInfo = getInfo(rhs);
+af_dtype implicit(const af_array lhs, const af_array rhs) {
+    const ArrayInfo& lInfo = getInfo(lhs);
+    const ArrayInfo& rInfo = getInfo(rhs);
 
     return implicit(lInfo.getType(), rInfo.getType());
 }
diff --git a/src/api/c/implicit.hpp b/src/api/c/implicit.hpp
index d3e455b645..d70240e33a 100644
--- a/src/api/c/implicit.hpp
+++ b/src/api/c/implicit.hpp
@@ -8,16 +8,14 @@
  ********************************************************/
 
 #pragma once
-#include <af/array.h>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <optypes.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <handle.hpp>
+#include <optypes.hpp>
 #include <types.hpp>
-#include <cast.hpp>
-
-using namespace detail;
+#include <af/array.h>
+#include <af/defines.h>
 
 af_dtype implicit(const af_array lhs, const af_array rhs);
 af_dtype implicit(const af_dtype lty, const af_dtype rty);
diff --git a/src/api/c/index.cpp b/src/api/c/index.cpp
index c8bc259aed..792a5a5af7 100644
--- a/src/api/c/index.cpp
+++ b/src/api/c/index.cpp
@@ -7,147 +7,244 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <vector>
-#include <cassert>
+#include <index.hpp>
+#include <indexing_common.hpp>
 
-#include <af/array.h>
-#include <af/index.h>
-#include <af/arith.h>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
 #include <Array.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/moddims.hpp>
+#include <handle.hpp>
 #include <lookup.hpp>
-#include <index.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/index.h>
 
-using namespace detail;
-using std::vector;
+#include <array>
+#include <cassert>
+#include <cmath>
+#include <vector>
+
+using std::signbit;
 using std::swap;
+using std::vector;
 
-template<typename T>
-static void indexArray(af_array &dest, const af_array &src, const unsigned ndims, const af_seq *index)
-{
-    const Array<T> &parent = getArray<T>(src);
-    vector<af_seq> index_(index, index+ndims);
-    Array<T> dst =  createSubArray(parent, index_);
+using af::dim4;
+using arrayfire::common::convert2Canonical;
+using arrayfire::common::createSpanIndex;
+using arrayfire::common::flat;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::index;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+namespace arrayfire {
+namespace common {
+af_index_t createSpanIndex() {
+    static af_index_t s = [] {
+        af_index_t s;
+        s.idx.seq = af_span;
+        s.isSeq   = true;
+        s.isBatch = false;
+        return s;
+    }();
+    return s;
+}
 
-    dest = getHandle(dst);
+af_seq convert2Canonical(const af_seq s, const dim_t len) {
+    double begin = signbit(s.begin) ? (len + s.begin) : s.begin;
+    double end   = signbit(s.end) ? (len + s.end) : s.end;
+
+    return af_seq{begin, end, s.step};
 }
+}  // namespace common
+}  // namespace arrayfire
 
-af_err af_index(af_array *result, const af_array in, const unsigned ndims, const af_seq* index)
-{
-    af_array out;
+template<typename T>
+static af_array indexBySeqs(const af_array& src,
+                            const vector<af_seq>& indicesV) {
+    auto ndims        = static_cast<dim_t>(indicesV.size());
+    const auto& input = getArray<T>(src);
+
+    if (ndims == 1U && ndims != input.ndims()) {
+        return getHandle(createSubArray(flat(input), indicesV));
+    } else {
+        return getHandle(createSubArray(input, indicesV));
+    }
+}
+
+af_err af_index(af_array* result, const af_array in, const unsigned ndims,
+                const af_seq* indices) {
     try {
-        af_dtype in_type = getInfo(in).getType();
-
-        switch(in_type) {
-        case f32:    indexArray<float>   (out, in, ndims, index);  break;
-        case c32:    indexArray<cfloat>  (out, in, ndims, index);  break;
-        case f64:    indexArray<double>  (out, in, ndims, index);  break;
-        case c64:    indexArray<cdouble> (out, in, ndims, index);  break;
-        case b8:     indexArray<char>    (out, in, ndims, index);  break;
-        case s32:    indexArray<int>     (out, in, ndims, index);  break;
-        case u32:    indexArray<unsigned>(out, in, ndims, index);  break;
-        case s64:    indexArray<intl>    (out, in, ndims, index);  break;
-        case u64:    indexArray<uintl>   (out, in, ndims, index);  break;
-        case u8:     indexArray<uchar>   (out, in, ndims, index);  break;
-        default:    TYPE_ERROR(1, in_type);
+        ARG_ASSERT(2, (ndims > 0 && ndims <= AF_MAX_DIMS));
+
+        const ArrayInfo& inInfo = getInfo(in);
+        af_dtype type           = inInfo.getType();
+        const dim4& iDims       = inInfo.dims();
+
+        vector<af_seq> indices_(ndims, af_span);
+        for (unsigned i = 0; i < ndims; ++i) {
+            indices_[i] = convert2Canonical(indices[i], iDims[i]);
+
+            ARG_ASSERT(3, (indices_[i].begin >= 0. && indices_[i].end >= 0.));
+            if (signbit(indices_[i].step)) {
+                ARG_ASSERT(3, indices_[i].begin >= indices_[i].end);
+            } else {
+                ARG_ASSERT(3, indices_[i].begin <= indices_[i].end);
+            }
         }
+
+        af_array out = 0;
+
+        switch (type) {
+            case f32: out = indexBySeqs<float>(in, indices_); break;
+            case c32: out = indexBySeqs<cfloat>(in, indices_); break;
+            case f64: out = indexBySeqs<double>(in, indices_); break;
+            case c64: out = indexBySeqs<cdouble>(in, indices_); break;
+            case b8: out = indexBySeqs<char>(in, indices_); break;
+            case s32: out = indexBySeqs<int>(in, indices_); break;
+            case u32: out = indexBySeqs<unsigned>(in, indices_); break;
+            case s16: out = indexBySeqs<short>(in, indices_); break;
+            case u16: out = indexBySeqs<ushort>(in, indices_); break;
+            case s64: out = indexBySeqs<intl>(in, indices_); break;
+            case u64: out = indexBySeqs<uintl>(in, indices_); break;
+            case s8: out = indexBySeqs<schar>(in, indices_); break;
+            case u8: out = indexBySeqs<uchar>(in, indices_); break;
+            case f16: out = indexBySeqs<half>(in, indices_); break;
+            default: TYPE_ERROR(1, type);
+        }
+        swap(*result, out);
     }
     CATCHALL
-
-    swap(*result, out);
     return AF_SUCCESS;
 }
 
+template<typename T, typename idx_t>
+inline af_array lookup(const af_array& in, const af_array& idx,
+                       const unsigned dim) {
+    return getHandle(lookup(getArray<T>(in), getArray<idx_t>(idx), dim));
+}
+
 template<typename idx_t>
-static af_array lookup(const af_array &in, const af_array &idx, const unsigned dim)
-{
-    ArrayInfo inInfo = getInfo(in);
-
-    af_dtype inType  = inInfo.getType();
-
-    switch(inType) {
-        case f32: return getHandle(lookup<float   , idx_t > (getArray<float   >(in), getArray<idx_t>(idx), dim));
-        case c32: return getHandle(lookup<cfloat  , idx_t > (getArray<cfloat  >(in), getArray<idx_t>(idx), dim));
-        case f64: return getHandle(lookup<double  , idx_t > (getArray<double  >(in), getArray<idx_t>(idx), dim));
-        case c64: return getHandle(lookup<cdouble , idx_t > (getArray<cdouble >(in), getArray<idx_t>(idx), dim));
-        case s32: return getHandle(lookup<int     , idx_t > (getArray<int     >(in), getArray<idx_t>(idx), dim));
-        case u32: return getHandle(lookup<unsigned, idx_t > (getArray<unsigned>(in), getArray<idx_t>(idx), dim));
-        case s64: return getHandle(lookup<intl    , idx_t > (getArray<intl    >(in), getArray<idx_t>(idx), dim));
-        case u64: return getHandle(lookup<uintl   , idx_t > (getArray<uintl   >(in), getArray<idx_t>(idx), dim));
-        case  u8: return getHandle(lookup<uchar   , idx_t > (getArray<uchar   >(in), getArray<idx_t>(idx), dim));
-        case  b8: return getHandle(lookup<char    , idx_t > (getArray<char    >(in), getArray<idx_t>(idx), dim));
-        default : TYPE_ERROR(1, inType);
+static af_array lookup(const af_array& in, const af_array& idx,
+                       const unsigned dim) {
+    const ArrayInfo& inInfo = getInfo(in);
+    af_dtype inType         = inInfo.getType();
+
+    switch (inType) {
+        case f32: return lookup<float, idx_t>(in, idx, dim);
+        case c32: return lookup<cfloat, idx_t>(in, idx, dim);
+        case f64: return lookup<double, idx_t>(in, idx, dim);
+        case c64: return lookup<cdouble, idx_t>(in, idx, dim);
+        case s32: return lookup<int, idx_t>(in, idx, dim);
+        case u32: return lookup<unsigned, idx_t>(in, idx, dim);
+        case s64: return lookup<intl, idx_t>(in, idx, dim);
+        case u64: return lookup<uintl, idx_t>(in, idx, dim);
+        case s16: return lookup<short, idx_t>(in, idx, dim);
+        case u16: return lookup<ushort, idx_t>(in, idx, dim);
+        case s8: return lookup<schar, idx_t>(in, idx, dim);
+        case u8: return lookup<uchar, idx_t>(in, idx, dim);
+        case b8: return lookup<char, idx_t>(in, idx, dim);
+        case f16: return lookup<half, idx_t>(in, idx, dim);
+        default: TYPE_ERROR(1, inType);
     }
 }
 
-af_err af_lookup(af_array *out, const af_array in, const af_array indices, const unsigned dim)
-{
-    af_array output = 0;
-
+af_err af_lookup(af_array* out, const af_array in, const af_array indices,
+                 const unsigned dim) {
     try {
-        ARG_ASSERT(3, (dim>=0 && dim<=3));
+        const ArrayInfo& idxInfo = getInfo(indices);
 
-        ArrayInfo inInfo = getInfo(in);
-        ArrayInfo idxInfo= getInfo(indices);
+        if (idxInfo.ndims() == 0) {
+            *out = retain(indices);
+            return AF_SUCCESS;
+        }
 
+        ARG_ASSERT(3, (dim <= 3));
         ARG_ASSERT(2, idxInfo.isVector() || idxInfo.isScalar());
 
         af_dtype idxType = idxInfo.getType();
 
-        ARG_ASSERT(2, (idxType!=c32));
-        ARG_ASSERT(2, (idxType!=c64));
-        ARG_ASSERT(2, (idxType!=b8));
-
-        switch(idxType) {
-            case f32: output = lookup<float   >(in, indices, dim); break;
-            case f64: output = lookup<double  >(in, indices, dim); break;
-            case s32: output = lookup<int     >(in, indices, dim); break;
-            case u32: output = lookup<unsigned>(in, indices, dim); break;
-            case  u8: output = lookup<uchar   >(in, indices, dim); break;
-            default : TYPE_ERROR(1, idxType);
+        ARG_ASSERT(2, (idxType != c32));
+        ARG_ASSERT(2, (idxType != c64));
+        ARG_ASSERT(2, (idxType != b8));
+
+        af_array output = 0;
+        af_array idx = 0;
+
+        if (!idxInfo.isColumn()) {
+            // Force a deep copy to flatten the array and handle subarrays of not column vector arrays correctly
+            AF_CHECK(af_copy_array(&idx, indices)); 
+        } else {
+            idx = indices;
         }
-    }
-    CATCHALL;
 
-    std::swap(*out, output);
+        switch (idxType) {
+            case f32: output = lookup<float>(in, idx, dim); break;
+            case f64: output = lookup<double>(in, idx, dim); break;
+            case s32: output = lookup<int>(in, indices, dim); break;
+            case u32: output = lookup<unsigned>(in, idx, dim); break;
+            case s16: output = lookup<short>(in, idx, dim); break;
+            case u16: output = lookup<ushort>(in, idx, dim); break;
+            case s64: output = lookup<intl>(in, idx, dim); break;
+            case u64: output = lookup<uintl>(in, idx, dim); break;
+            case s8: output = lookup<schar>(in, idx, dim); break;
+            case u8: output = lookup<uchar>(in, idx, dim); break;
+            case f16: output = lookup<half>(in, idx, dim); break;
+            default: TYPE_ERROR(1, idxType);
+        }
+        std::swap(*out, output);
 
+        if (idx != indices) {
+            AF_CHECK(af_release_array(idx)); // Release indices array if a copy has been made
+        }
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_seq
-af_make_seq(double begin, double end, double step) {
-    af_seq seq = {begin, end, step};
-    return seq;
-}
-
 // idxrs parameter to the below static function
 // expects 4 values which is handled appropriately
 // by the C-API af_index_gen
 template<typename T>
-static inline
-af_array genIndex(const af_array& in,  const af_index_t idxrs[])
-{
+static inline af_array genIndex(const af_array& in, const af_index_t idxrs[]) {
     return getHandle<T>(index<T>(getArray<T>(in), idxrs));
 }
 
-af_err af_index_gen(af_array *out, const af_array in, const dim_t ndims, const af_index_t* indexs)
-{
-    af_array output = 0;
-    // spanner is sequence index used for indexing along the
-    // dimensions after ndims
-    af_index_t spanner;
-    spanner.idx.seq = af_span;
-    spanner.isSeq = true;
-
+af_err af_index_gen(af_array* out, const af_array in, const dim_t ndims,
+                    const af_index_t* indexs) {
     try {
-        ARG_ASSERT(2, (ndims>0));
-        ARG_ASSERT(3, (indexs!=NULL));
+        ARG_ASSERT(2, (ndims > 0 && ndims <= AF_MAX_DIMS));
+        ARG_ASSERT(3, (indexs != NULL));
+
+        const ArrayInfo& iInfo = getInfo(in);
+        const dim4& iDims      = iInfo.dims();
+        af_dtype inType        = getInfo(in).getType();
+
+        if (iDims.ndims() <= 0) {
+            *out = createHandle(dim4(0), inType);
+            return AF_SUCCESS;
+        }
+
+        if (ndims == 1 && ndims != static_cast<dim_t>(iInfo.ndims())) {
+            af_array in_ = 0;
+            AF_CHECK(af_flat(&in_, in));
+            AF_CHECK(af_index_gen(out, in_, ndims, indexs));
+            AF_CHECK(af_release_array(in_));
+            return AF_SUCCESS;
+        }
 
         int track = 0;
-        af_seq seqs[] = {af_span, af_span, af_span, af_span};
+        std::array<af_seq, AF_MAX_DIMS> seqs{};
+        seqs.fill(af_span);
         for (dim_t i = 0; i < ndims; i++) {
             if (indexs[i].isSeq) {
                 track++;
@@ -155,58 +252,134 @@ af_err af_index_gen(af_array *out, const af_array in, const dim_t ndims, const a
             }
         }
 
-        if (track==(int)ndims) {
-            // all indexs are sequences, redirecting to af_index
-            return af_index(out, in, ndims, seqs);
+        if (track == static_cast<int>(ndims)) {
+            return af_index(out, in, ndims, seqs.data());
         }
 
-        af_index_t idxrs[4];
-        // set all dimensions above ndims to spanner index
-        for (dim_t i=ndims; i<4; ++i) idxrs[i] = spanner;
-
-        for (dim_t i=0; i<ndims; ++i) {
-            if (!indexs[i].isSeq) {
-                // check if all af_arrays have atleast one value
-                // to enable indexing along that dimension
-                ArrayInfo idxInfo = getInfo(indexs[i].idx.arr);
-                af_dtype idxType  = idxInfo.getType();
-
-                ARG_ASSERT(3, (idxType!=c32));
-                ARG_ASSERT(3, (idxType!=c64));
-                ARG_ASSERT(3, (idxType!=b8 ));
-
-                idxrs[i].idx.arr = indexs[i].idx.arr;
-                idxrs[i].isSeq = indexs[i].isSeq;
+        std::array<af_index_t, AF_MAX_DIMS> idxrs{};
+
+        for (dim_t i = 0; i < AF_MAX_DIMS; ++i) {
+            if (i < ndims) {
+                bool isSeq = indexs[i].isSeq;
+                if (!isSeq) {
+                    // check if all af_arrays have atleast one value
+                    // to enable indexing along that dimension
+                    const ArrayInfo& idxInfo = getInfo(indexs[i].idx.arr);
+                    af_dtype idxType         = idxInfo.getType();
+
+                    ARG_ASSERT(3, (idxType != c32));
+                    ARG_ASSERT(3, (idxType != c64));
+                    ARG_ASSERT(3, (idxType != b8));
+
+                    idxrs[i] = {{indexs[i].idx.arr}, isSeq, indexs[i].isBatch};
+                } else {
+                    // copy the af_seq to local variable
+                    af_seq inSeq =
+                        convert2Canonical(indexs[i].idx.seq, iDims[i]);
+                    ARG_ASSERT(3, (inSeq.begin >= 0. || inSeq.end >= 0.));
+                    if (signbit(inSeq.step)) {
+                        ARG_ASSERT(3, inSeq.begin >= inSeq.end);
+                    } else {
+                        ARG_ASSERT(3, inSeq.begin <= inSeq.end);
+                    }
+                    idxrs[i].idx.seq = inSeq;
+                    idxrs[i].isSeq   = isSeq;
+                    idxrs[i].isBatch = indexs[i].isBatch;
+                }
             } else {
-                // af_seq is being used for this dimension
-                // just copy the index to local variable
-                idxrs[i] = indexs[i];
+                // set all dimensions above ndims to spanner
+                idxrs[i] = createSpanIndex();
             }
         }
-
-        ArrayInfo iInfo = getInfo(in);
-        dim4 iDims = iInfo.dims();
-
-        ARG_ASSERT(1, (iDims.ndims()>0));
-
-        af_dtype inType = getInfo(in).getType();
-        switch(inType) {
-            case c64: output = genIndex<cdouble>(in, idxrs); break;
-            case f64: output = genIndex<double >(in, idxrs); break;
-            case c32: output = genIndex<cfloat >(in, idxrs); break;
-            case f32: output = genIndex<float  >(in, idxrs); break;
-            case u64: output = genIndex<uintl  >(in, idxrs); break;
-            case u32: output = genIndex<uint   >(in, idxrs); break;
-            case s64: output = genIndex<intl   >(in, idxrs); break;
-            case s32: output = genIndex<int    >(in, idxrs); break;
-            case  u8: output = genIndex<uchar  >(in, idxrs); break;
-            case  b8: output = genIndex<char   >(in, idxrs); break;
+        af_index_t* ptr = idxrs.data();
+
+        af_array output = 0;
+        switch (inType) {
+            case c64: output = genIndex<cdouble>(in, ptr); break;
+            case f64: output = genIndex<double>(in, ptr); break;
+            case c32: output = genIndex<cfloat>(in, ptr); break;
+            case f32: output = genIndex<float>(in, ptr); break;
+            case u64: output = genIndex<uintl>(in, ptr); break;
+            case s64: output = genIndex<intl>(in, ptr); break;
+            case u32: output = genIndex<uint>(in, ptr); break;
+            case s32: output = genIndex<int>(in, ptr); break;
+            case u16: output = genIndex<ushort>(in, ptr); break;
+            case s16: output = genIndex<short>(in, ptr); break;
+            case s8: output = genIndex<schar>(in, ptr); break;
+            case u8: output = genIndex<uchar>(in, ptr); break;
+            case b8: output = genIndex<char>(in, ptr); break;
+            case f16: output = genIndex<half>(in, ptr); break;
             default: TYPE_ERROR(1, inType);
         }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_seq af_make_seq(double begin, double end, double step) {
+    return af_seq{begin, end, step};
+}
+
+af_err af_create_indexers(af_index_t** indexers) {
+    try {
+        auto* out = new af_index_t[AF_MAX_DIMS];
+        for (int i = 0; i < AF_MAX_DIMS; ++i) {
+            out[i].idx.seq = af_span;
+            out[i].isSeq   = true;
+            out[i].isBatch = false;
+        }
+        std::swap(*indexers, out);
     }
     CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_array_indexer(af_index_t* indexer, const af_array idx,
+                            const dim_t dim) {
+    try {
+        ARG_ASSERT(0, (indexer != NULL));
+        ARG_ASSERT(1, (idx != NULL));
+        ARG_ASSERT(2, (dim >= 0 && dim <= 3));
+        indexer[dim] = af_index_t{{idx}, false, false};
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_seq_indexer(af_index_t* indexer, const af_seq* idx,
+                          const dim_t dim, const bool is_batch) {
+    try {
+        ARG_ASSERT(0, (indexer != NULL));
+        ARG_ASSERT(1, (idx != NULL));
+        ARG_ASSERT(2, (dim >= 0 && dim <= 3));
+        indexer[dim].idx.seq = *idx;
+        indexer[dim].isSeq   = true;
+        indexer[dim].isBatch = is_batch;
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-    std::swap(*out, output);
+af_err af_set_seq_param_indexer(af_index_t* indexer, const double begin,
+                                const double end, const double step,
+                                const dim_t dim, const bool is_batch) {
+    try {
+        ARG_ASSERT(0, (indexer != NULL));
+        ARG_ASSERT(4, (dim >= 0 && dim <= 3));
+        af_seq s             = af_make_seq(begin, end, step);
+        indexer[dim].idx.seq = s;
+        indexer[dim].isSeq   = true;
+        indexer[dim].isBatch = is_batch;
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
+af_err af_release_indexers(af_index_t* indexers) {
+    try {
+        delete[] indexers;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
diff --git a/src/api/c/indexing_common.hpp b/src/api/c/indexing_common.hpp
new file mode 100644
index 0000000000..85a5d9562a
--- /dev/null
+++ b/src/api/c/indexing_common.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/index.h>
+
+namespace arrayfire {
+namespace common {
+/// Creates a af_index_t object that represents a af_span value
+af_index_t createSpanIndex();
+
+/// Converts a af_seq to cononical form which is composed of positive values for
+/// begin and end. The step value is not modified.
+///
+/// af_seq objects represent a range of values. You can create an af_seq object
+/// with the af::end value which is represented as -1. For example you can have
+/// a sequence from 1 to end-5 which will be composed of all values in an array
+/// but the first and the last five values. This function converts that value to
+/// positive values taking into the account of the array size.
+///
+/// \param[in] s   is sequence that may have negative values
+/// \param[in] len is the length of a given array along a given dimension.
+///
+/// \returns Returns a sequence with begin and end values in the range [0,len).
+///          Step value is not modified.
+///
+/// \NOTE: No error checks are performed.
+///
+/// Sample outputs of convert2Canonical for given sequence s:
+/// // Assume the array's len is 10 along dimention 0
+/// s{1, end-2, 1}   will return a sequence af_seq(1, 7, 1)
+/// s{1, 2, 1};      will return the same sequence
+/// s{-1, 2, -1};    will return the sequence af_seq(9,2,-1)
+af_seq convert2Canonical(const af_seq s, const dim_t len);
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/api/c/internal.cpp b/src/api/c/internal.cpp
new file mode 100644
index 0000000000..c0314981cb
--- /dev/null
+++ b/src/api/c/internal.cpp
@@ -0,0 +1,261 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <platform.hpp>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/internal.h>
+#include <af/version.h>
+#include <cstring>
+
+using af::dim4;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createStridedArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+af_err af_create_strided_array(af_array *arr, const void *data,
+                               const dim_t offset, const unsigned ndims,
+                               const dim_t *const dims_,
+                               const dim_t *const strides_, const af_dtype ty,
+                               const af_source location) {
+    try {
+        ARG_ASSERT(2, offset >= 0);
+        ARG_ASSERT(3, ndims >= 1 && ndims <= 4);
+        ARG_ASSERT(4, dims_ != NULL);
+        ARG_ASSERT(5, strides_ != NULL);
+        ARG_ASSERT(5, strides_[0] == 1);
+
+        for (int i = 1; i < static_cast<int>(ndims); i++) {
+            ARG_ASSERT(5, strides_[i] > 0);
+        }
+
+        dim4 dims(ndims, dims_);
+        dim4 strides(ndims, strides_);
+
+        for (int i = static_cast<int>(ndims); i < 4; i++) {
+            strides[i] = strides[i - 1] * dims[i - 1];
+        }
+
+        bool isdev = location == afDevice;
+
+        af_array res;
+        AF_CHECK(af_init());
+
+        void *in_data = const_cast<void *>(
+            data);  // const cast because the api cannot change
+        switch (ty) {
+            case f32:
+                res = getHandle(createStridedArray<float>(
+                    dims, strides, offset, static_cast<float *>(in_data),
+                    isdev));
+                break;
+            case f64:
+                res = getHandle(createStridedArray<double>(
+                    dims, strides, offset, static_cast<double *>(in_data),
+                    isdev));
+                break;
+            case c32:
+                res = getHandle(createStridedArray<cfloat>(
+                    dims, strides, offset, static_cast<cfloat *>(in_data),
+                    isdev));
+                break;
+            case c64:
+                res = getHandle(createStridedArray<cdouble>(
+                    dims, strides, offset, static_cast<cdouble *>(in_data),
+                    isdev));
+                break;
+            case u32:
+                res = getHandle(createStridedArray<uint>(
+                    dims, strides, offset, static_cast<uint *>(in_data),
+                    isdev));
+                break;
+            case s32:
+                res = getHandle(createStridedArray<int>(
+                    dims, strides, offset, static_cast<int *>(in_data), isdev));
+                break;
+            case u64:
+                res = getHandle(createStridedArray<uintl>(
+                    dims, strides, offset, static_cast<uintl *>(in_data),
+                    isdev));
+                break;
+            case s64:
+                res = getHandle(createStridedArray<intl>(
+                    dims, strides, offset, static_cast<intl *>(in_data),
+                    isdev));
+                break;
+            case u16:
+                res = getHandle(createStridedArray<ushort>(
+                    dims, strides, offset, static_cast<ushort *>(in_data),
+                    isdev));
+                break;
+            case s16:
+                res = getHandle(createStridedArray<short>(
+                    dims, strides, offset, static_cast<short *>(in_data),
+                    isdev));
+                break;
+            case b8:
+                res = getHandle(createStridedArray<char>(
+                    dims, strides, offset, static_cast<char *>(in_data),
+                    isdev));
+                break;
+            case u8:
+                res = getHandle(createStridedArray<uchar>(
+                    dims, strides, offset, static_cast<uchar *>(in_data),
+                    isdev));
+                break;
+            case s8:
+                res = getHandle(createStridedArray<schar>(
+                    dims, strides, offset, static_cast<schar *>(in_data),
+                    isdev));
+                break;
+            case f16:
+                res = getHandle(createStridedArray<half>(
+                    dims, strides, offset, static_cast<half *>(in_data),
+                    isdev));
+                break;
+            default: TYPE_ERROR(6, ty);
+        }
+
+        std::swap(*arr, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_get_strides(dim_t *s0, dim_t *s1, dim_t *s2, dim_t *s3,
+                      const af_array in) {
+    try {
+        const ArrayInfo &info = getInfo(in);
+        *s0                   = info.strides()[0];
+        *s1                   = info.strides()[1];
+        *s2                   = info.strides()[2];
+        *s3                   = info.strides()[3];
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_get_offset(dim_t *offset, const af_array arr) {
+    try {
+        dim_t res = getInfo(arr).getOffset();
+        std::swap(*offset, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_get_raw_ptr(void **ptr, const af_array arr) {
+    try {
+        void *res = NULL;
+
+        af_dtype ty = getInfo(arr).getType();
+
+        switch (ty) {
+            case f32: res = getRawPtr(getArray<float>(arr)); break;
+            case f64: res = getRawPtr(getArray<double>(arr)); break;
+            case c32: res = getRawPtr(getArray<cfloat>(arr)); break;
+            case c64: res = getRawPtr(getArray<cdouble>(arr)); break;
+            case u32: res = getRawPtr(getArray<uint>(arr)); break;
+            case s32: res = getRawPtr(getArray<int>(arr)); break;
+            case u64: res = getRawPtr(getArray<uintl>(arr)); break;
+            case s64: res = getRawPtr(getArray<intl>(arr)); break;
+            case u16: res = getRawPtr(getArray<ushort>(arr)); break;
+            case s16: res = getRawPtr(getArray<short>(arr)); break;
+            case b8: res = getRawPtr(getArray<char>(arr)); break;
+            case u8: res = getRawPtr(getArray<uchar>(arr)); break;
+            case s8: res = getRawPtr(getArray<schar>(arr)); break;
+            case f16: res = getRawPtr(getArray<half>(arr)); break;
+            default: TYPE_ERROR(6, ty);
+        }
+
+        std::swap(*ptr, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_is_linear(bool *result, const af_array arr) {
+    try {
+        *result = getInfo(arr).isLinear();
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_is_owner(bool *result, const af_array arr) {
+    try {
+        bool res = false;
+
+        af_dtype ty = getInfo(arr).getType();
+
+        switch (ty) {
+            case f32: res = getArray<float>(arr).isOwner(); break;
+            case f64: res = getArray<double>(arr).isOwner(); break;
+            case c32: res = getArray<cfloat>(arr).isOwner(); break;
+            case c64: res = getArray<cdouble>(arr).isOwner(); break;
+            case u32: res = getArray<uint>(arr).isOwner(); break;
+            case s32: res = getArray<int>(arr).isOwner(); break;
+            case u64: res = getArray<uintl>(arr).isOwner(); break;
+            case s64: res = getArray<intl>(arr).isOwner(); break;
+            case u16: res = getArray<ushort>(arr).isOwner(); break;
+            case s16: res = getArray<short>(arr).isOwner(); break;
+            case b8: res = getArray<char>(arr).isOwner(); break;
+            case u8: res = getArray<uchar>(arr).isOwner(); break;
+            case s8: res = getArray<schar>(arr).isOwner(); break;
+            case f16: res = getArray<half>(arr).isOwner(); break;
+            default: TYPE_ERROR(6, ty);
+        }
+
+        std::swap(*result, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_get_allocated_bytes(size_t *bytes, const af_array arr) {
+    try {
+        af_dtype ty = getInfo(arr).getType();
+
+        size_t res = 0;
+
+        switch (ty) {
+            case f32: res = getArray<float>(arr).getAllocatedBytes(); break;
+            case f64: res = getArray<double>(arr).getAllocatedBytes(); break;
+            case c32: res = getArray<cfloat>(arr).getAllocatedBytes(); break;
+            case c64: res = getArray<cdouble>(arr).getAllocatedBytes(); break;
+            case u32: res = getArray<uint>(arr).getAllocatedBytes(); break;
+            case s32: res = getArray<int>(arr).getAllocatedBytes(); break;
+            case u64: res = getArray<uintl>(arr).getAllocatedBytes(); break;
+            case s64: res = getArray<intl>(arr).getAllocatedBytes(); break;
+            case u16: res = getArray<ushort>(arr).getAllocatedBytes(); break;
+            case s16: res = getArray<short>(arr).getAllocatedBytes(); break;
+            case b8: res = getArray<char>(arr).getAllocatedBytes(); break;
+            case u8: res = getArray<uchar>(arr).getAllocatedBytes(); break;
+            case s8: res = getArray<schar>(arr).getAllocatedBytes(); break;
+            case f16: res = getArray<half>(arr).getAllocatedBytes(); break;
+            default: TYPE_ERROR(6, ty);
+        }
+
+        std::swap(*bytes, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/inverse.cpp b/src/api/c/inverse.cpp
index 04603db26e..a2b9b5c90b 100644
--- a/src/api/c/inverse.cpp
+++ b/src/api/c/inverse.cpp
@@ -7,28 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <inverse.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/lapack.h>
 
-using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
 
 template<typename T>
-static inline af_array inverse(const af_array in)
-{
+static inline af_array inverse(const af_array in) {
     return getHandle(inverse<T>(getArray<T>(in)));
 }
 
-af_err af_inverse(af_array *out, const af_array in, const af_mat_prop options)
-{
+af_err af_inverse(af_array* out, const af_array in, const af_mat_prop options) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo& i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("solve can not be used in batch mode", AF_ERR_BATCH);
@@ -37,20 +35,24 @@ af_err af_inverse(af_array *out, const af_array in, const af_mat_prop options)
         af_dtype type = i_info.getType();
 
         if (options != AF_MAT_NONE) {
-            AF_ERROR("Using this property is not yet supported in inverse", AF_ERR_NOT_SUPPORTED);
+            AF_ERROR("Using this property is not yet supported in inverse",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
-        DIM_ASSERT(1, i_info.dims()[0] == i_info.dims()[1]);      // Only square matrices
-        ARG_ASSERT(1, i_info.isFloating());                       // Only floating and complex types
+        DIM_ASSERT(
+            1, i_info.dims()[0] == i_info.dims()[1]);  // Only square matrices
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
 
         af_array output;
 
-        switch(type) {
-            case f32: output = inverse<float  >(in);  break;
-            case f64: output = inverse<double >(in);  break;
-            case c32: output = inverse<cfloat >(in);  break;
-            case c64: output = inverse<cdouble>(in);  break;
-            default:  TYPE_ERROR(1, type);
+        if (i_info.ndims() == 0) { return af_retain_array(out, in); }
+
+        switch (type) {
+            case f32: output = inverse<float>(in); break;
+            case f64: output = inverse<double>(in); break;
+            case c32: output = inverse<cfloat>(in); break;
+            case c64: output = inverse<cdouble>(in); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/jit_test_api.cpp b/src/api/c/jit_test_api.cpp
new file mode 100644
index 0000000000..784994f267
--- /dev/null
+++ b/src/api/c/jit_test_api.cpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <jit_test_api.h>
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <platform.hpp>
+
+af_err af_get_max_jit_len(int *jitLen) {
+    *jitLen = detail::getMaxJitSize();
+    return AF_SUCCESS;
+}
+
+af_err af_set_max_jit_len(const int maxJitLen) {
+    try {
+        ARG_ASSERT(1, maxJitLen > 0);
+        detail::getMaxJitSize() = maxJitLen;
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/jit_test_api.h b/src/api/c/jit_test_api.h
new file mode 100644
index 0000000000..d99bc3b077
--- /dev/null
+++ b/src/api/c/jit_test_api.h
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/defines.h>
+
+#ifdef __cplusplus
+namespace af {
+/// Get the maximum jit tree length for active backend
+///
+/// \returns the maximum length of jit tree from root to any leaf
+AFAPI int getMaxJitLen(void);
+
+/// Set the maximum jit tree length for active backend
+///
+/// \param[in] jit_len is the maximum length of jit tree from root to any
+/// leaf
+AFAPI void setMaxJitLen(const int jitLen);
+}  // namespace af
+#endif  //__cplusplus
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/// Get the maximum jit tree length for active backend
+///
+/// \param[out] jit_len is the maximum length of jit tree from root to any
+/// leaf
+///
+/// \returns Always returns AF_SUCCESS
+AFAPI af_err af_get_max_jit_len(int *jit_len);
+
+/// Set the maximum jit tree length for active backend
+///
+/// \param[in] jit_len is the maximum length of jit tree from root to any
+/// leaf
+///
+/// \returns Always returns AF_SUCCESS
+AFAPI af_err af_set_max_jit_len(const int jit_len);
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/src/api/c/join.cpp b/src/api/c/join.cpp
index 423859c495..d3e9cda6b5 100644
--- a/src/api/c/join.cpp
+++ b/src/api/c/join.cpp
@@ -7,120 +7,176 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/data.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
 #include <join.hpp>
+#include <af/data.h>
+
+#include <algorithm>
+#include <climits>
 #include <vector>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::swap;
+using std::vector;
 
-template<typename Tx, typename Ty>
-static inline af_array join(const int dim, const af_array first, const af_array second)
-{
-    return getHandle(join<Tx, Ty>(dim, getArray<Tx>(first), getArray<Ty>(second)));
+template<typename T>
+static inline af_array join(const int dim, const af_array first,
+                            const af_array second) {
+    return getHandle(join<T>(dim, getArray<T>(first), getArray<T>(second)));
 }
 
 template<typename T>
-static inline af_array join_many(const int dim, const unsigned n_arrays, const af_array *inputs)
-{
-    std::vector<Array<T>> inputs_;
+static inline af_array join_many(const int dim, const unsigned n_arrays,
+                                 const af_array *inputs) {
+    vector<Array<T>> inputs_;
     inputs_.reserve(n_arrays);
 
-    for(int i = 0; i < (int)n_arrays; i++) {
-        inputs_.push_back(getArray<T>(inputs[i]));
+    dim_t dim_size{0};
+    for (unsigned i{0}; i < n_arrays; ++i) {
+        const Array<T> &iArray = getArray<T>(inputs[i]);
+        if (!iArray.isEmpty()) {
+            inputs_.push_back(iArray);
+            dim_size += iArray.dims().dims[dim];
+        }
     }
-    return getHandle(join<T>(dim, inputs_));
+
+    // All dimensions except join dimension must be equal
+    // calculate odims size
+    af::dim4 odims{inputs_[0].dims()};
+    odims.dims[dim] = dim_size;
+
+    Array<T> out{createEmptyArray<T>(odims)};
+    join<T>(out, dim, inputs_);
+    return getHandle(out);
 }
 
-af_err af_join(af_array *out, const int dim, const af_array first, const af_array second)
-{
+af_err af_join(af_array *out, const int dim, const af_array first,
+               const af_array second) {
     try {
-        ArrayInfo finfo = getInfo(first);
-        ArrayInfo sinfo = getInfo(second);
-        af::dim4  fdims = finfo.dims();
-        af::dim4  sdims = sinfo.dims();
+        const ArrayInfo &finfo{getInfo(first)};
+        const ArrayInfo &sinfo{getInfo(second)};
+        const dim4 &fdims{finfo.dims()};
+        const dim4 &sdims{sinfo.dims()};
 
         ARG_ASSERT(1, dim >= 0 && dim < 4);
         ARG_ASSERT(2, finfo.getType() == sinfo.getType());
-        DIM_ASSERT(2, sinfo.elements() > 0);
-        DIM_ASSERT(3, finfo.elements() > 0);
+        if (sinfo.elements() == 0) { return af_retain_array(out, first); }
+        if (finfo.elements() == 0) { return af_retain_array(out, second); }
+        DIM_ASSERT(2, finfo.elements() > 0);
+        DIM_ASSERT(3, sinfo.elements() > 0);
 
         // All dimensions except join dimension must be equal
-        // Compute output dims
-        for(int i = 0; i < 4; i++) {
-            if(i != dim) DIM_ASSERT(2, fdims[i] == sdims[i]);
+        for (int i{0}; i < AF_MAX_DIMS; i++) {
+            if (i != dim) { DIM_ASSERT(2, fdims.dims[i] == sdims.dims[i]); }
         }
 
         af_array output;
 
-        switch(finfo.getType()) {
-            case f32: output = join<float  , float  >(dim, first, second);  break;
-            case c32: output = join<cfloat , cfloat >(dim, first, second);  break;
-            case f64: output = join<double , double >(dim, first, second);  break;
-            case c64: output = join<cdouble, cdouble>(dim, first, second);  break;
-            case b8:  output = join<char   , char   >(dim, first, second);  break;
-            case s32: output = join<int    , int    >(dim, first, second);  break;
-            case u32: output = join<uint   , uint   >(dim, first, second);  break;
-            case s64: output = join<intl   , intl   >(dim, first, second);  break;
-            case u64: output = join<uintl  , uintl  >(dim, first, second);  break;
-            case u8:  output = join<uchar  , uchar  >(dim, first, second);  break;
-            default:  TYPE_ERROR(1, finfo.getType());
+        switch (finfo.getType()) {
+            case f32: output = join<float>(dim, first, second); break;
+            case c32: output = join<cfloat>(dim, first, second); break;
+            case f64: output = join<double>(dim, first, second); break;
+            case c64: output = join<cdouble>(dim, first, second); break;
+            case b8: output = join<char>(dim, first, second); break;
+            case s32: output = join<int>(dim, first, second); break;
+            case u32: output = join<uint>(dim, first, second); break;
+            case s64: output = join<intl>(dim, first, second); break;
+            case u64: output = join<uintl>(dim, first, second); break;
+            case s16: output = join<short>(dim, first, second); break;
+            case u16: output = join<ushort>(dim, first, second); break;
+            case s8: output = join<schar>(dim, first, second); break;
+            case u8: output = join<uchar>(dim, first, second); break;
+            case f16: output = join<half>(dim, first, second); break;
+            default: TYPE_ERROR(1, finfo.getType());
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_join_many(af_array *out, const int dim, const unsigned n_arrays, const af_array *inputs)
-{
+af_err af_join_many(af_array *out, const int dim, const unsigned n_arrays,
+                    const af_array *inputs) {
     try {
-        ARG_ASSERT(3, n_arrays > 1 && n_arrays <= 10);
-
-        std::vector<ArrayInfo> info;
-        info.reserve(n_arrays);
-        std::vector<af::dim4> dims(n_arrays);
-        for(int i = 0; i < (int)n_arrays; i++) {
-            info.push_back(getInfo(inputs[i]));
-            dims[i] = info[i].dims();
+        ARG_ASSERT(3, inputs != nullptr);
+
+        if (n_arrays == 1) {
+            af_array ret{nullptr};
+            AF_CHECK(af_retain_array(&ret, *inputs));
+            std::swap(*out, ret);
+            return AF_SUCCESS;
         }
 
-        ARG_ASSERT(1, dim >= 0 && dim < 4);
+        ARG_ASSERT(1, dim >= 0 && dim < AF_MAX_DIMS);
+        ARG_ASSERT(2, n_arrays > 0);
 
-        for(int i = 1; i < (int)n_arrays; i++) {
-            ARG_ASSERT(3, info[0].getType() == info[i].getType());
-            DIM_ASSERT(3, info[i].elements() > 0);
+        const af_array *inputIt{inputs};
+        const af_array *inputEnd{inputs + n_arrays};
+        while ((inputIt != inputEnd) && (getInfo(*inputIt).elements() == 0)) {
+            ++inputIt;
+        }
+        if (inputIt == inputEnd) {
+            // All arrays have 0 elements
+            af_array ret = nullptr;
+            AF_CHECK(af_retain_array(&ret, *inputs));
+            std::swap(*out, ret);
+            return AF_SUCCESS;
         }
 
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        for(int i = 0; i < 4; i++) {
-            if(i != dim) {
-                for(int j = 1; j < (int)n_arrays; j++) {
-                    DIM_ASSERT(3, dims[0][i] == dims[j][i]);
+        // inputIt points to first non empty array
+        const af_dtype assertType{getInfo(*inputIt).getType()};
+        const dim4 &assertDims{getInfo(*inputIt).dims()};
+
+        // Check all remaining arrays on assertType and assertDims
+        while (++inputIt != inputEnd) {
+            const ArrayInfo &info = getInfo(*inputIt);
+            if (info.elements() > 0) {
+                ARG_ASSERT(3, assertType == info.getType());
+                const dim4 &infoDims{getInfo(*inputIt).dims()};
+                // All dimensions except join dimension must be equal
+                for (int i{0}; i < AF_MAX_DIMS; i++) {
+                    if (i != dim) {
+                        DIM_ASSERT(3, assertDims.dims[i] == infoDims.dims[i]);
+                    }
                 }
             }
         }
-
         af_array output;
 
-        switch(info[0].getType()) {
-            case f32: output = join_many<float  >(dim, n_arrays, inputs);  break;
-            case c32: output = join_many<cfloat >(dim, n_arrays, inputs);  break;
-            case f64: output = join_many<double >(dim, n_arrays, inputs);  break;
-            case c64: output = join_many<cdouble>(dim, n_arrays, inputs);  break;
-            case b8:  output = join_many<char   >(dim, n_arrays, inputs);  break;
-            case s32: output = join_many<int    >(dim, n_arrays, inputs);  break;
-            case u32: output = join_many<uint   >(dim, n_arrays, inputs);  break;
-            case u8:  output = join_many<uchar  >(dim, n_arrays, inputs);  break;
-            default:  TYPE_ERROR(1, info[0].getType());
+        switch (assertType) {
+            case f32: output = join_many<float>(dim, n_arrays, inputs); break;
+            case c32: output = join_many<cfloat>(dim, n_arrays, inputs); break;
+            case f64: output = join_many<double>(dim, n_arrays, inputs); break;
+            case c64: output = join_many<cdouble>(dim, n_arrays, inputs); break;
+            case b8: output = join_many<char>(dim, n_arrays, inputs); break;
+            case s32: output = join_many<int>(dim, n_arrays, inputs); break;
+            case u32: output = join_many<uint>(dim, n_arrays, inputs); break;
+            case s64: output = join_many<intl>(dim, n_arrays, inputs); break;
+            case u64: output = join_many<uintl>(dim, n_arrays, inputs); break;
+            case s16: output = join_many<short>(dim, n_arrays, inputs); break;
+            case u16: output = join_many<ushort>(dim, n_arrays, inputs); break;
+            case s8: output = join_many<schar>(dim, n_arrays, inputs); break;
+            case u8: output = join_many<uchar>(dim, n_arrays, inputs); break;
+            case f16: output = join_many<half>(dim, n_arrays, inputs); break;
+            default: TYPE_ERROR(1, assertType);
         }
-        std::swap(*out,output);
+        swap(*out, output);
     }
     CATCHALL;
 
diff --git a/src/api/c/lu.cpp b/src/api/c/lu.cpp
index c6004bc6cf..761f7b3dcd 100644
--- a/src/api/c/lu.cpp
+++ b/src/api/c/lu.cpp
@@ -7,24 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <lu.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/lapack.h>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::isLAPACKAvailable;
 
 template<typename T>
 static inline void lu(af_array *lower, af_array *upper, af_array *pivot,
-                      const af_array in)
-{
-    Array<T> lowerArray = createEmptyArray<T>(af::dim4());
-    Array<T> upperArray = createEmptyArray<T>(af::dim4());
+                      const af_array in) {
+    Array<T> lowerArray   = createEmptyArray<T>(af::dim4());
+    Array<T> upperArray   = createEmptyArray<T>(af::dim4());
     Array<int> pivotArray = createEmptyArray<int>(af::dim4());
 
     lu<T>(lowerArray, upperArray, pivotArray, getArray<T>(in));
@@ -35,15 +38,14 @@ static inline void lu(af_array *lower, af_array *upper, af_array *pivot,
 }
 
 template<typename T>
-static inline af_array lu_inplace(af_array in, bool is_lapack_piv)
-{
-    return getHandle(lu_inplace<T>(getWritableArray<T>(in), !is_lapack_piv));
+static inline af_array lu_inplace(af_array in, bool is_lapack_piv) {
+    return getHandle(lu_inplace<T>(getArray<T>(in), !is_lapack_piv));
 }
 
-af_err af_lu(af_array *lower, af_array *upper, af_array *pivot, const af_array in)
-{
+af_err af_lu(af_array *lower, af_array *upper, af_array *pivot,
+             const af_array in) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo &i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("lu can not be used in batch mode", AF_ERR_BATCH);
@@ -51,14 +53,24 @@ af_err af_lu(af_array *lower, af_array *upper, af_array *pivot, const af_array i
 
         af_dtype type = i_info.getType();
 
-        ARG_ASSERT(3, i_info.isFloating());                       // Only floating and complex types
+        ARG_ASSERT(0, lower != nullptr);
+        ARG_ASSERT(1, upper != nullptr);
+        ARG_ASSERT(2, pivot != nullptr);
+        ARG_ASSERT(3, i_info.isFloating());  // Only floating and complex types
 
-        switch(type) {
-            case f32: lu<float  >(lower, upper, pivot, in);  break;
-            case f64: lu<double >(lower, upper, pivot, in);  break;
-            case c32: lu<cfloat >(lower, upper, pivot, in);  break;
-            case c64: lu<cdouble>(lower, upper, pivot, in);  break;
-            default:  TYPE_ERROR(1, type);
+        if (i_info.ndims() == 0) {
+            AF_CHECK(af_create_handle(lower, 0, nullptr, type));
+            AF_CHECK(af_create_handle(upper, 0, nullptr, type));
+            AF_CHECK(af_create_handle(pivot, 0, nullptr, type));
+            return AF_SUCCESS;
+        }
+
+        switch (type) {
+            case f32: lu<float>(lower, upper, pivot, in); break;
+            case f64: lu<double>(lower, upper, pivot, in); break;
+            case c32: lu<cfloat>(lower, upper, pivot, in); break;
+            case c64: lu<cdouble>(lower, upper, pivot, in); break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
@@ -66,30 +78,40 @@ af_err af_lu(af_array *lower, af_array *upper, af_array *pivot, const af_array i
     return AF_SUCCESS;
 }
 
-af_err af_lu_inplace(af_array *pivot, af_array in, const bool is_lapack_piv)
-{
+af_err af_lu_inplace(af_array *pivot, af_array in, const bool is_lapack_piv) {
     try {
-
-        ArrayInfo i_info = getInfo(in);
-        af_dtype type = i_info.getType();
+        const ArrayInfo &i_info = getInfo(in);
+        af_dtype type           = i_info.getType();
 
         if (i_info.ndims() > 2) {
             AF_ERROR("lu can not be used in batch mode", AF_ERR_BATCH);
         }
 
-        ARG_ASSERT(1, i_info.isFloating()); // Only floating and complex types
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
+        ARG_ASSERT(0, pivot != nullptr);
 
-        af_array out;
+        if (i_info.ndims() == 0) {
+            return af_create_handle(pivot, 0, nullptr, type);
+        }
 
-        switch(type) {
-            case f32: out = lu_inplace<float  >(in, is_lapack_piv);  break;
-            case f64: out = lu_inplace<double >(in, is_lapack_piv);  break;
-            case c32: out = lu_inplace<cfloat >(in, is_lapack_piv);  break;
-            case c64: out = lu_inplace<cdouble>(in, is_lapack_piv);  break;
-            default:  TYPE_ERROR(1, type);
+        af_array out;
+        switch (type) {
+            case f32: out = lu_inplace<float>(in, is_lapack_piv); break;
+            case f64: out = lu_inplace<double>(in, is_lapack_piv); break;
+            case c32: out = lu_inplace<cfloat>(in, is_lapack_piv); break;
+            case c64: out = lu_inplace<cdouble>(in, is_lapack_piv); break;
+            default: TYPE_ERROR(1, type);
         }
-        if(pivot != NULL)
-            std::swap(*pivot, out);
+        std::swap(*pivot, out);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_is_lapack_available(bool *out) {
+    try {
+        *out = isLAPACKAvailable();
     }
     CATCHALL;
 
diff --git a/src/api/c/match_template.cpp b/src/api/c/match_template.cpp
index 1532497e31..91d81c383c 100644
--- a/src/api/c/match_template.cpp
+++ b/src/api/c/match_template.cpp
@@ -7,62 +7,91 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/image.h>
-#include <handle.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <match_template.hpp>
+#include <types.hpp>
+#include <af/defines.h>
+#include <af/vision.h>
+
+#include <type_traits>
 
 using af::dim4;
-using namespace detail;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::conditional;
+using std::is_same;
 
-template<typename inType, typename outType>
-static
-af_array match_template(const af_array &sImg, const af_array tImg, af_match_type mType)
-{
-    switch(mType) {
-        case AF_SAD : return getHandle(match_template<inType, outType, AF_SAD >(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_ZSAD: return getHandle(match_template<inType, outType, AF_ZSAD>(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_LSAD: return getHandle(match_template<inType, outType, AF_LSAD>(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_SSD : return getHandle(match_template<inType, outType, AF_SSD >(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_ZSSD: return getHandle(match_template<inType, outType, AF_ZSSD>(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_LSSD: return getHandle(match_template<inType, outType, AF_LSSD>(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_NCC : return getHandle(match_template<inType, outType, AF_NCC >(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_ZNCC: return getHandle(match_template<inType, outType, AF_ZNCC>(getArray<inType>(sImg), getArray<inType>(tImg)));
-        case AF_SHD : return getHandle(match_template<inType, outType, AF_SHD >(getArray<inType>(sImg), getArray<inType>(tImg)));
-        default:      return getHandle(match_template<inType, outType, AF_SAD >(getArray<inType>(sImg), getArray<inType>(tImg)));
-    }
+template<typename InType>
+static af_array match_template(const af_array& sImg, const af_array tImg,
+                               af_match_type mType) {
+    using OutType = typename conditional<is_same<InType, double>::value, double,
+                                         float>::type;
+    return getHandle(match_template<InType, OutType>(
+        getArray<InType>(sImg), getArray<InType>(tImg), mType));
 }
 
-af_err af_match_template(af_array *out, const af_array search_img, const af_array template_img, const af_match_type m_type)
-{
+af_err af_match_template(af_array* out, const af_array search_img,
+                         const af_array template_img,
+                         const af_match_type m_type) {
     try {
-        ARG_ASSERT(3, (m_type>=AF_SAD && m_type<=AF_LSSD));
+        ARG_ASSERT(3, (m_type >= AF_SAD && m_type <= AF_LSSD));
 
-        ArrayInfo sInfo = getInfo(search_img);
-        ArrayInfo tInfo = getInfo(template_img);
+        const ArrayInfo& sInfo = getInfo(search_img);
+        const ArrayInfo& tInfo = getInfo(template_img);
 
-        dim4 const sDims = sInfo.dims();
-        dim4 const tDims = tInfo.dims();
+        dim4 const& sDims = sInfo.dims();
+        dim4 const& tDims = tInfo.dims();
 
-        dim_t sNumDims= sDims.ndims();
-        dim_t tNumDims= tDims.ndims();
-        ARG_ASSERT(1, (sNumDims>=2));
-        ARG_ASSERT(2, (tNumDims==2));
+        dim_t sNumDims = sDims.ndims();
+        dim_t tNumDims = tDims.ndims();
+        ARG_ASSERT(1, (sNumDims >= 2));
+        ARG_ASSERT(2, (tNumDims == 2));
 
         af_dtype sType = sInfo.getType();
-        ARG_ASSERT(1, (sType==tInfo.getType()));
+        ARG_ASSERT(1, (sType == tInfo.getType()));
 
         af_array output = 0;
-        switch(sType) {
-            case f64: output = match_template<double, double>(search_img, template_img, m_type); break;
-            case f32: output = match_template<float ,  float>(search_img, template_img, m_type); break;
-            case s32: output = match_template<int   ,  float>(search_img, template_img, m_type); break;
-            case u32: output = match_template<uint  ,  float>(search_img, template_img, m_type); break;
-            case  b8: output = match_template<char  ,  float>(search_img, template_img, m_type); break;
-            case  u8: output = match_template<uchar ,  float>(search_img, template_img, m_type); break;
-            default : TYPE_ERROR(1, sType);
+        switch (sType) {
+            case f64:
+                output =
+                    match_template<double>(search_img, template_img, m_type);
+                break;
+            case f32:
+                output =
+                    match_template<float>(search_img, template_img, m_type);
+                break;
+            case s32:
+                output = match_template<int>(search_img, template_img, m_type);
+                break;
+            case u32:
+                output = match_template<uint>(search_img, template_img, m_type);
+                break;
+            case s16:
+                output =
+                    match_template<short>(search_img, template_img, m_type);
+                break;
+            case u16:
+                output =
+                    match_template<ushort>(search_img, template_img, m_type);
+                break;
+            case b8:
+                output = match_template<char>(search_img, template_img, m_type);
+                break;
+            case s8:
+                output =
+                    match_template<schar>(search_img, template_img, m_type);
+                break;
+            case u8:
+                output =
+                    match_template<uchar>(search_img, template_img, m_type);
+                break;
+            default: TYPE_ERROR(1, sType);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/mean.cpp b/src/api/c/mean.cpp
index 1f71a85a41..65fe057155 100644
--- a/src/api/c/mean.cpp
+++ b/src/api/c/mean.cpp
@@ -7,83 +7,84 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/statistics.h>
-#include <af/defines.h>
-#include <err_common.hpp>
+#include <arith.hpp>
 #include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
 #include <handle.hpp>
-#include <reduce.hpp>
-#include <arith.hpp>
 #include <math.hpp>
-#include <cast.hpp>
+#include <mean.hpp>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/statistics.h>
 
 #include "stats.h"
 
-using namespace detail;
-
-template<typename inType, typename outType>
-static outType mean(const af_array &in)
-{
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    outType result = mean<outType>(input); /* defined in stats.h */
-    return result;
+using af::dim4;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::imag;
+using detail::intl;
+using detail::mean;
+using detail::real;
+using detail::schar;
+using detail::uchar;
+using detail::uintl;
+using detail::ushort;
+
+template<typename Ti, typename To>
+static To mean(const af_array &in) {
+    using Tw = typename baseOutType<To>::type;
+    return mean<Ti, Tw, To>(getArray<Ti>(in));
 }
 
-template<typename inType, typename outType>
-static outType mean(const af_array &in, const af_array &weights)
-{
-    typedef typename baseOutType<outType>::type bType;
-
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    Array<outType> wts   = cast<outType>(getArray<bType>(weights));
-
-    outType result = mean<outType, bType>(input, getArray<bType>(weights)); /* defined in stats.h */
-
-    return result;
+template<typename T>
+static T mean(const af_array &in, const af_array &weights) {
+    using Tw = typename baseOutType<T>::type;
+    return mean<T, Tw>(castArray<T>(in), castArray<Tw>(weights));
 }
 
-template<typename inType, typename outType>
-static af_array mean(const af_array &in, const dim_t dim)
-{
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    Array<outType>  result= mean<outType>(input, dim); /* defined in stats.h */
-
-    return getHandle<outType>(result);
+template<typename Ti, typename To>
+static af_array mean(const af_array &in, const dim_t dim) {
+    using Tw = typename baseOutType<To>::type;
+    return getHandle<To>(mean<Ti, Tw, To>(getArray<Ti>(in), dim));
 }
 
-template<typename inType, typename outType>
-static af_array mean(const af_array &in, const af_array &weights, const dim_t dim)
-{
-    typedef typename baseOutType<outType>::type bType;
-
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    Array<outType> wts   = cast<outType>(getArray<bType>(weights));
-    Array<outType> retVal= mean<outType>(input, wts, dim); /* defined in stats.h */
-
-    return getHandle<outType>(retVal);
+template<typename T>
+static af_array mean(const af_array &in, const af_array &weights,
+                     const dim_t dim) {
+    using Tw = typename baseOutType<T>::type;
+    return getHandle<T>(
+        mean<T, Tw>(castArray<T>(in), castArray<Tw>(weights), dim));
 }
 
-af_err af_mean(af_array *out, const af_array in, const dim_t dim)
-{
+af_err af_mean(af_array *out, const af_array in, const dim_t dim) {
     try {
-        ARG_ASSERT(2, (dim>=0 && dim<=3));
-
-        af_array output = 0;
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
-            case f64: output = mean<double,  double>(in, dim); break;
-            case f32: output = mean<float ,  float >(in, dim); break;
-            case s32: output = mean<int   ,  float >(in, dim); break;
-            case u32: output = mean<uint  ,  float >(in, dim); break;
-            case s64: output = mean<intl  ,  double>(in, dim); break;
-            case u64: output = mean<uintl ,  double>(in, dim); break;
-            case  u8: output = mean<uchar ,  float >(in, dim); break;
-            case  b8: output = mean<char  ,  float >(in, dim); break;
-            case c32: output = mean<cfloat,  cfloat>(in, dim); break;
-            case c64: output = mean<cdouble,cdouble>(in, dim); break;
-            default : TYPE_ERROR(1, type);
+        ARG_ASSERT(2, (dim >= 0 && dim <= 3));
+
+        af_array output       = 0;
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        switch (type) {
+            case f64: output = mean<double, double>(in, dim); break;
+            case f32: output = mean<float, float>(in, dim); break;
+            case s32: output = mean<int, float>(in, dim); break;
+            case u32: output = mean<unsigned, float>(in, dim); break;
+            case s64: output = mean<intl, double>(in, dim); break;
+            case u64: output = mean<uintl, double>(in, dim); break;
+            case s16: output = mean<short, float>(in, dim); break;
+            case u16: output = mean<ushort, float>(in, dim); break;
+            case s8: output = mean<schar, float>(in, dim); break;
+            case u8: output = mean<uchar, float>(in, dim); break;
+            case b8: output = mean<char, float>(in, dim); break;
+            case c32: output = mean<cfloat, cfloat>(in, dim); break;
+            case c64: output = mean<cdouble, cdouble>(in, dim); break;
+            case f16: output = mean<half, half>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
@@ -91,99 +92,136 @@ af_err af_mean(af_array *out, const af_array in, const dim_t dim)
     return AF_SUCCESS;
 }
 
-af_err af_mean_weighted(af_array *out, const af_array in, const af_array weights, const dim_t dim)
-{
+af_err af_mean_weighted(af_array *out, const af_array in,
+                        const af_array weights, const dim_t dim) {
     try {
-        ARG_ASSERT(2, (dim>=0 && dim<=3));
-
-        af_array output = 0;
-        ArrayInfo iInfo = getInfo(in);
-        ArrayInfo wInfo = getInfo(weights);
-        af_dtype iType  = iInfo.getType();
-        af_dtype wType  = wInfo.getType();
-
-        ARG_ASSERT(2, (wType==f32 || wType==f64)); /* verify that weights are non-complex real numbers */
-
-        switch(iType) {
-            case f64: output = mean<double,  double>(in, weights, dim); break;
-            case f32: output = mean<float ,  float >(in, weights, dim); break;
-            case s32: output = mean<int   ,  float >(in, weights, dim); break;
-            case u32: output = mean<uint  ,  float >(in, weights, dim); break;
-            case s64: output = mean<intl  ,  double>(in, weights, dim); break;
-            case u64: output = mean<uintl ,  double>(in, weights, dim); break;
-            case  u8: output = mean<uchar ,  float >(in, weights, dim); break;
-            case  b8: output = mean<char  ,  float >(in, weights, dim); break;
-            case c32: output = mean<cfloat,  cfloat>(in, weights, dim); break;
-            case c64: output = mean<cdouble,cdouble>(in, weights, dim); break;
-            default : TYPE_ERROR(1, iType);
+        ARG_ASSERT(3, (dim >= 0 && dim <= 3));
+
+        af_array output        = 0;
+        const ArrayInfo &iInfo = getInfo(in);
+        const ArrayInfo &wInfo = getInfo(weights);
+        af_dtype iType         = iInfo.getType();
+        af_dtype wType         = wInfo.getType();
+
+        ARG_ASSERT(
+            2,
+            (wType == f32 ||
+             wType ==
+                 f64)); /* verify that weights are non-complex real numbers */
+
+        // FIXME: We should avoid additional copies
+        af_array w = weights;
+        if (iInfo.dims() != wInfo.dims()) {
+            dim4 iDims = iInfo.dims();
+            dim4 wDims = wInfo.dims();
+            dim4 tDims(1, 1, 1, 1);
+            for (int i = 0; i < 4; i++) {
+                ARG_ASSERT(2, wDims[i] == 1 || wDims[i] == iDims[i]);
+                tDims[i] = iDims[i] / wDims[i];
+            }
+            AF_CHECK(
+                af_tile(&w, weights, tDims[0], tDims[1], tDims[2], tDims[3]));
+        }
+
+        switch (iType) {
+            case f32:
+            case s32:
+            case u32:
+            case s16:
+            case u16:
+            case s8:
+            case u8:
+            case b8: output = mean<float>(in, w, dim); break;
+            case f64:
+            case s64:
+            case u64: output = mean<double>(in, w, dim); break;
+            case c32: output = mean<cfloat>(in, w, dim); break;
+            case c64: output = mean<cdouble>(in, w, dim); break;
+            case f16: output = mean<half>(in, w, dim); break;
+            default: TYPE_ERROR(1, iType);
         }
+
+        if (w != weights) { AF_CHECK(af_release_array(w)); }
         std::swap(*out, output);
     }
     CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_mean_all(double *realVal, double *imagVal, const af_array in)
-{
+af_err af_mean_all(double *realVal, double *imagVal, const af_array in) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        switch (type) {
             case f64: *realVal = mean<double, double>(in); break;
-            case f32: *realVal = mean<float ,  float>(in); break;
-            case s32: *realVal = mean<int   ,  float>(in); break;
-            case u32: *realVal = mean<uint  ,  float>(in); break;
-            case s64: *realVal = mean<intl  , double>(in); break;
-            case u64: *realVal = mean<uintl , double>(in); break;
-            case  u8: *realVal = mean<uchar ,  float>(in); break;
-            case  b8: *realVal = mean<char  ,  float>(in); break;
+            case f32: *realVal = mean<float, float>(in); break;
+            case s32: *realVal = mean<int, float>(in); break;
+            case u32: *realVal = mean<unsigned, float>(in); break;
+            case s64: *realVal = mean<intl, double>(in); break;
+            case u64: *realVal = mean<uintl, double>(in); break;
+            case s16: *realVal = mean<short, float>(in); break;
+            case u16: *realVal = mean<ushort, float>(in); break;
+            case s8: *realVal = mean<schar, float>(in); break;
+            case u8: *realVal = mean<uchar, float>(in); break;
+            case b8: *realVal = mean<char, float>(in); break;
+            case f16:
+                *realVal = mean<arrayfire::common::half, float>(in);
+                break;
             case c32: {
-                cfloat tmp = mean<cfloat,cfloat>(in);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
+                cfloat tmp = mean<cfloat, cfloat>(in);
+                *realVal   = real(tmp);
+                *imagVal   = imag(tmp);
+            } break;
             case c64: {
-                cdouble tmp = mean<cdouble,cdouble>(in);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
-            default : TYPE_ERROR(1, type);
+                cdouble tmp = mean<cdouble, cdouble>(in);
+                *realVal    = real(tmp);
+                *imagVal    = imag(tmp);
+            } break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_mean_all_weighted(double *realVal, double *imagVal, const af_array in, const af_array weights)
-{
+af_err af_mean_all_weighted(double *realVal, double *imagVal, const af_array in,
+                            const af_array weights) {
     try {
-        ArrayInfo iInfo = getInfo(in);
-        ArrayInfo wInfo = getInfo(weights);
-        af_dtype iType  = iInfo.getType();
-        af_dtype wType  = wInfo.getType();
-
-        ARG_ASSERT(3, (wType==f32 || wType==f64)); /* verify that weights are non-complex real numbers */
-
-        switch(iType) {
-            case f64: *realVal = mean<double, double>(in, weights); break;
-            case f32: *realVal = mean<float ,  float>(in, weights); break;
-            case s32: *realVal = mean<int   ,  float>(in, weights); break;
-            case u32: *realVal = mean<uint  ,  float>(in, weights); break;
-            case s64: *realVal = mean<intl  , double>(in, weights); break;
-            case u64: *realVal = mean<uintl , double>(in, weights); break;
-            case  u8: *realVal = mean<uchar ,  float>(in, weights); break;
-            case  b8: *realVal = mean<char  ,  float>(in, weights); break;
+        const ArrayInfo &iInfo = getInfo(in);
+        const ArrayInfo &wInfo = getInfo(weights);
+        af_dtype iType         = iInfo.getType();
+        af_dtype wType         = wInfo.getType();
+
+        ARG_ASSERT(
+            3,
+            (wType == f32 ||
+             wType ==
+                 f64)); /* verify that weights are non-complex real numbers */
+
+        switch (iType) {
+            case f32:
+            case s32:
+            case u32:
+            case s16:
+            case u16:
+            case s8:
+            case u8:
+            case b8:
+            case f16: *realVal = mean<float>(in, weights); break;
+            case f64:
+            case s64:
+            case u64: *realVal = mean<double>(in, weights); break;
             case c32: {
-                cfloat tmp = mean<cfloat,cfloat>(in);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
+                cfloat tmp = mean<cfloat>(in, weights);
+                *realVal   = real(tmp);
+                *imagVal   = imag(tmp);
+            } break;
             case c64: {
-                cdouble tmp = mean<cdouble,cdouble>(in);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
-            default : TYPE_ERROR(1, iType);
+                cdouble tmp = mean<cdouble>(in, weights);
+                *realVal    = real(tmp);
+                *imagVal    = imag(tmp);
+            } break;
+            default: TYPE_ERROR(1, iType);
         }
     }
     CATCHALL;
diff --git a/src/api/c/meanshift.cpp b/src/api/c/meanshift.cpp
index 6c938548d4..bf09bc4d2a 100644
--- a/src/api/c/meanshift.cpp
+++ b/src/api/c/meanshift.cpp
@@ -7,59 +7,97 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <handle.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <meanshift.hpp>
-#include <err_common.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
 
 using af::dim4;
-using namespace detail;
+using detail::intl;
+using detail::meanshift;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
-template<typename T, bool is_color>
-static inline af_array mean_shift(const af_array &in, const float &s_sigma, const float &c_sigma, const unsigned iter)
-{
-    return getHandle(meanshift<T, is_color>(getArray<T>(in), s_sigma, c_sigma, iter));
+template<typename T>
+static inline af_array mean_shift(const af_array &in, const float &s_sigma,
+                                  const float &c_sigma, const unsigned niters,
+                                  const bool is_color) {
+    return getHandle(
+        meanshift<T>(getArray<T>(in), s_sigma, c_sigma, niters, is_color));
 }
 
-template<bool is_color>
-af_err mean_shift(af_array *out, const af_array in, const float s_sigma, const float c_sigma, const unsigned iter)
-{
+af_err af_mean_shift(af_array *out, const af_array in,
+                     const float spatial_sigma, const float chromatic_sigma,
+                     const unsigned num_iterations, const bool is_color) {
     try {
-        ARG_ASSERT(2, (s_sigma>=0));
-        ARG_ASSERT(3, (c_sigma>=0));
-        ARG_ASSERT(4, (iter>0));
+        ARG_ASSERT(2, (spatial_sigma >= 0));
+        ARG_ASSERT(3, (chromatic_sigma >= 0));
+        ARG_ASSERT(4, (num_iterations > 0));
 
-        ArrayInfo info = getInfo(in);
-        af_dtype type  = info.getType();
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 dims         = info.dims();
 
-        DIM_ASSERT(1, (dims.ndims()>=2));
-        if (is_color) DIM_ASSERT(1, (dims[2]==3));
+        DIM_ASSERT(1, (dims.ndims() >= 2));
+        if (is_color) { DIM_ASSERT(1, (dims[2] == 3)); }
 
         af_array output;
-        switch(type) {
-            case f32: output = mean_shift<float , is_color>(in, s_sigma, c_sigma, iter); break;
-            case f64: output = mean_shift<double, is_color>(in, s_sigma, c_sigma, iter); break;
-            case b8 : output = mean_shift<char  , is_color>(in, s_sigma, c_sigma, iter); break;
-            case s32: output = mean_shift<int   , is_color>(in, s_sigma, c_sigma, iter); break;
-            case u32: output = mean_shift<uint  , is_color>(in, s_sigma, c_sigma, iter); break;
-            case u8 : output = mean_shift<uchar , is_color>(in, s_sigma, c_sigma, iter); break;
-            default : TYPE_ERROR(1, type);
+        switch (type) {
+            case f32:
+                output = mean_shift<float>(in, spatial_sigma, chromatic_sigma,
+                                           num_iterations, is_color);
+                break;
+            case f64:
+                output = mean_shift<double>(in, spatial_sigma, chromatic_sigma,
+                                            num_iterations, is_color);
+                break;
+            case b8:
+                output = mean_shift<char>(in, spatial_sigma, chromatic_sigma,
+                                          num_iterations, is_color);
+                break;
+            case s32:
+                output = mean_shift<int>(in, spatial_sigma, chromatic_sigma,
+                                         num_iterations, is_color);
+                break;
+            case u32:
+                output = mean_shift<uint>(in, spatial_sigma, chromatic_sigma,
+                                          num_iterations, is_color);
+                break;
+            case s16:
+                output = mean_shift<short>(in, spatial_sigma, chromatic_sigma,
+                                           num_iterations, is_color);
+                break;
+            case u16:
+                output = mean_shift<ushort>(in, spatial_sigma, chromatic_sigma,
+                                            num_iterations, is_color);
+                break;
+            case s64:
+                output = mean_shift<intl>(in, spatial_sigma, chromatic_sigma,
+                                          num_iterations, is_color);
+                break;
+            case u64:
+                output = mean_shift<uintl>(in, spatial_sigma, chromatic_sigma,
+                                           num_iterations, is_color);
+                break;
+            case s8:
+                output = mean_shift<schar>(in, spatial_sigma, chromatic_sigma,
+                                           num_iterations, is_color);
+                break;
+            case u8:
+                output = mean_shift<uchar>(in, spatial_sigma, chromatic_sigma,
+                                           num_iterations, is_color);
+                break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
-
-af_err af_mean_shift(af_array *out, const af_array in, const float spatial_sigma, const float chromatic_sigma, const unsigned iter, const bool is_color)
-{
-    if (is_color)
-        return mean_shift<true >(out, in, spatial_sigma, chromatic_sigma, iter);
-    else
-        return mean_shift<false>(out, in, spatial_sigma, chromatic_sigma, iter);
-}
diff --git a/src/api/c/median.cpp b/src/api/c/median.cpp
index f53b180556..2fd0de18d8 100644
--- a/src/api/c/median.cpp
+++ b/src/api/c/median.cpp
@@ -7,95 +7,131 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/statistics.h>
-#include <af/index.h>
-#include <af/arith.h>
-#include <af/data.h>
-#include <handle.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
-#include <sort.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <math.hpp>
-#include <cast.hpp>
+#include <sort.hpp>
+#include <af/arith.h>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/index.h>
+#include <af/statistics.h>
 
-using namespace detail;
 using af::dim4;
+using detail::Array;
+using detail::division;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+using std::sort;
 
 template<typename T>
-static double median(const af_array& in)
-{
-    const Array<T> input  = getArray<T>(in);
-    dim_t nElems = input.elements();
-    double mid      = (nElems + 1) / 2;
-    af_seq mdSpan[1]= {af_make_seq(mid-1, mid, 1)};
+static double median(const af_array& in) {
+    dim_t nElems = getInfo(in).elements();
     dim4 dims(nElems, 1, 1, 1);
+    ARG_ASSERT(0, nElems > 0);
 
     af_array temp = 0;
     AF_CHECK(af_moddims(&temp, in, 1, dims.get()));
-    Array<T> sortedArr = sort<T, true>(input, 0);
+    const Array<T>& input = getArray<T>(temp);
+
+    // Shortcut cases for 1 or 2 elements
+    if (nElems == 1) {
+        T result;
+        AF_CHECK(af_get_data_ptr((void*)&result, in));
+        return result;
+    }
+    if (nElems == 2) {
+        T result[2];
+        AF_CHECK(af_get_data_ptr((void*)&result, in));
+        return division(
+            (static_cast<double>(result[0]) + static_cast<double>(result[1])),
+            2.0);
+    }
+
+    double mid       = static_cast<double>(nElems + 1) / 2.0;
+    af_seq mdSpan[1] = {af_make_seq(mid - 1, mid, 1)};
+
+    Array<T> sortedArr = sort<T>(input, 0, true);
+
+    af_array sarrHandle = getHandle<T>(sortedArr);
 
     double result;
     T resPtr[2];
     af_array res = 0;
-    AF_CHECK(af_index(&res, getHandle<T>(sortedArr), 1, mdSpan));
+    AF_CHECK(af_index(&res, sarrHandle, 1, mdSpan));
     AF_CHECK(af_get_data_ptr((void*)&resPtr, res));
 
+    AF_CHECK(af_release_array(res));
+    AF_CHECK(af_release_array(sarrHandle));
+    AF_CHECK(af_release_array(temp));
+
     if (nElems % 2 == 1) {
         result = resPtr[0];
     } else {
-        if (input.isFloating()) {
-            result = division(resPtr[0] + resPtr[1], 2);
-        } else {
-            result = division((float)resPtr[0] + (float)resPtr[1], 2);
-        }
+        result = division(
+            static_cast<double>(resPtr[0]) + static_cast<double>(resPtr[1]),
+            2.0);
     }
 
-    AF_CHECK(af_release_array(res));
-    AF_CHECK(af_release_array(temp));
-
     return result;
 }
 
 template<typename T>
-static af_array median(const af_array& in, const dim_t dim)
-{
+static af_array median(const af_array& in, const dim_t dim) {
     const Array<T> input = getArray<T>(in);
-    Array<T> sortedIn   = sort<T, true>(input, dim);
 
-    int nElems    = input.dims()[0];
-    double mid    = (nElems + 1) / 2;
-    af_array left = 0;
+    // Shortcut cases for 1 element along selected dimension
+    if (input.dims()[dim] == 1) {
+        Array<T> result = copyArray<T>(input);
+        return getHandle<T>(result);
+    }
+
+    Array<T> sortedIn = sort<T>(input, dim, true);
+
+    size_t dimLength = input.dims()[dim];
+    double mid       = static_cast<double>(dimLength + 1) / 2.0;
+    af_array left    = 0;
 
     af_seq slices[4] = {af_span, af_span, af_span, af_span};
-    slices[dim] = af_make_seq(mid-1.0, mid-1.0, 1.0);
+    slices[dim]      = af_make_seq(mid - 1.0, mid - 1.0, 1.0);
 
-    AF_CHECK(af_index(&left, getHandle<T>(sortedIn), input.ndims(), slices));
+    af_array sortedIn_handle = getHandle<T>(sortedIn);
+    AF_CHECK(af_index(&left, sortedIn_handle, input.ndims(), slices));
 
-    if (nElems % 2 == 1) {
+    af_array out = nullptr;
+    if (dimLength % 2 == 1) {
         // mid-1 is our guy
-        if (input.isFloating()) return left;
+        if (input.isFloating()) {
+            AF_CHECK(af_release_array(sortedIn_handle));
+            return left;
+        }
 
         // Return as floats for consistency
         af_array out;
         AF_CHECK(af_cast(&out, left, f32));
         AF_CHECK(af_release_array(left));
+        AF_CHECK(af_release_array(sortedIn_handle));
         return out;
     } else {
         // ((mid-1)+mid)/2 is our guy
-        dim4  dims = input.dims();
+        dim4 dims      = input.dims();
         af_array right = 0;
-        slices[dim] = af_make_seq(mid, mid, 1.0);
+        slices[dim]    = af_make_seq(mid, mid, 1.0);
 
-        AF_CHECK(af_index(&right, getHandle<T>(sortedIn), dims.ndims(), slices));
+        AF_CHECK(af_index(&right, sortedIn_handle, dims.ndims(), slices));
 
         af_array sumarr = 0;
         af_array carr   = 0;
-        af_array result = 0;
 
-        dim4 cdims = dim4(1, dims[1], dims[2], dims[3]);
-        AF_CHECK(af_constant(&carr, 0.5, cdims.ndims(), cdims.get(), input.isDouble() ? f64 : f32));
+        dim4 cdims = dims;
+        cdims[dim] = 1;
+        AF_CHECK(af_constant(&carr, 0.5, cdims.ndims(), cdims.get(),
+                             input.isDouble() ? f64 : f32));
 
         if (!input.isFloating()) {
             af_array lleft, rright;
@@ -103,54 +139,65 @@ static af_array median(const af_array& in, const dim_t dim)
             AF_CHECK(af_cast(&rright, right, f32));
             AF_CHECK(af_release_array(left));
             AF_CHECK(af_release_array(right));
-            left = lleft;
+            left  = lleft;
             right = rright;
         }
 
         AF_CHECK(af_add(&sumarr, left, right, false));
-        AF_CHECK(af_mul(&result, sumarr, carr, false));
+        AF_CHECK(af_mul(&out, sumarr, carr, false));
 
         AF_CHECK(af_release_array(left));
         AF_CHECK(af_release_array(right));
         AF_CHECK(af_release_array(sumarr));
         AF_CHECK(af_release_array(carr));
-        return result;
+        AF_CHECK(af_release_array(sortedIn_handle));
     }
+    return out;
 }
 
-af_err af_median_all(double *realVal, double *imagVal, const af_array in)
-{
+af_err af_median_all(double* realVal, double* imagVal,  // NOLINT
+                     const af_array in) {
+    UNUSED(imagVal);
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        ARG_ASSERT(2, info.ndims() > 0);
+        switch (type) {
             case f64: *realVal = median<double>(in); break;
-            case f32: *realVal = median<float >(in); break;
-            case s32: *realVal = median<int   >(in); break;
-            case u32: *realVal = median<uint  >(in); break;
-            case  u8: *realVal = median<uchar >(in); break;
-            default : TYPE_ERROR(1, type);
+            case f32: *realVal = median<float>(in); break;
+            case s32: *realVal = median<int>(in); break;
+            case u32: *realVal = median<uint>(in); break;
+            case s16: *realVal = median<short>(in); break;
+            case u16: *realVal = median<ushort>(in); break;
+            case s8: *realVal = median<schar>(in); break;
+            case u8: *realVal = median<uchar>(in); break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_median(af_array* out, const af_array in, const dim_t dim)
-{
+af_err af_median(af_array* out, const af_array in, const dim_t dim) {
     try {
-        ARG_ASSERT(2, (dim>=0 && dim<=0));
+        ARG_ASSERT(2, (dim >= 0 && dim <= 4));
+
+        af_array output       = 0;
+        const ArrayInfo& info = getInfo(in);
 
-        af_array output = 0;
-        ArrayInfo info = getInfo(in);
+        ARG_ASSERT(1, info.ndims() > 0);
         af_dtype type = info.getType();
-        switch(type) {
+        switch (type) {
             case f64: output = median<double>(in, dim); break;
-            case f32: output = median<float >(in, dim); break;
-            case s32: output = median<int   >(in, dim); break;
-            case u32: output = median<uint  >(in, dim); break;
-            case  u8: output = median<uchar >(in, dim); break;
-            default : TYPE_ERROR(1, type);
+            case f32: output = median<float>(in, dim); break;
+            case s32: output = median<int>(in, dim); break;
+            case u32: output = median<uint>(in, dim); break;
+            case s16: output = median<short>(in, dim); break;
+            case u16: output = median<ushort>(in, dim); break;
+            case s8: output = median<schar>(in, dim); break;
+            case u8: output = median<uchar>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/memory.cpp b/src/api/c/memory.cpp
new file mode 100644
index 0000000000..665a51ac9c
--- /dev/null
+++ b/src/api/c/memory.cpp
@@ -0,0 +1,832 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <memoryapi.hpp>
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <events.hpp>
+#include <handle.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <af/backend.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/memory.h>
+#include <af/version.h>
+
+#include <utility>
+
+using af::dim4;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createDeviceDataArray;
+using detail::deviceMemoryInfo;
+using detail::getActiveDeviceId;
+using detail::getDeviceCount;
+using detail::intl;
+using detail::isLocked;
+using detail::memAllocUser;
+using detail::memFreeUser;
+using detail::memLock;
+using detail::memUnlock;
+using detail::pinnedAlloc;
+using detail::pinnedFree;
+using detail::printMemInfo;
+using detail::schar;
+using detail::signalMemoryCleanup;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::move;
+using std::swap;
+
+af_err af_device_array(af_array *arr, void *data, const unsigned ndims,
+                       const dim_t *const dims, const af_dtype type) {
+    try {
+        AF_CHECK(af_init());
+
+        af_array res;
+
+        DIM_ASSERT(1, ndims >= 1);
+        dim4 d(1, 1, 1, 1);
+        for (unsigned i = 0; i < ndims; i++) {
+            d[i] = dims[i];
+            DIM_ASSERT(3, dims[i] >= 1);
+        }
+
+        switch (type) {
+            case f32:
+                res = getHandle(createDeviceDataArray<float>(d, data));
+                break;
+            case f64:
+                res = getHandle(createDeviceDataArray<double>(d, data));
+                break;
+            case c32:
+                res = getHandle(createDeviceDataArray<cfloat>(d, data));
+                break;
+            case c64:
+                res = getHandle(createDeviceDataArray<cdouble>(d, data));
+                break;
+            case s32:
+                res = getHandle(createDeviceDataArray<int>(d, data));
+                break;
+            case u32:
+                res = getHandle(createDeviceDataArray<uint>(d, data));
+                break;
+            case s64:
+                res = getHandle(createDeviceDataArray<intl>(d, data));
+                break;
+            case u64:
+                res = getHandle(createDeviceDataArray<uintl>(d, data));
+                break;
+            case s16:
+                res = getHandle(createDeviceDataArray<short>(d, data));
+                break;
+            case u16:
+                res = getHandle(createDeviceDataArray<ushort>(d, data));
+                break;
+            case s8:
+                res = getHandle(createDeviceDataArray<schar>(d, data));
+                break;
+            case u8:
+                res = getHandle(createDeviceDataArray<uchar>(d, data));
+                break;
+            case b8:
+                res = getHandle(createDeviceDataArray<char>(d, data));
+                break;
+            case f16:
+                res = getHandle(createDeviceDataArray<half>(d, data));
+                break;
+            default: TYPE_ERROR(4, type);
+        }
+
+        swap(*arr, res);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_get_device_ptr(void **data, const af_array arr) {
+    try {
+        af_dtype type = getInfo(arr).getType();
+
+        switch (type) {
+            // FIXME: Perform copy if memory not continuous
+            case f32: *data = getDevicePtr(getArray<float>(arr)); break;
+            case f64: *data = getDevicePtr(getArray<double>(arr)); break;
+            case c32: *data = getDevicePtr(getArray<cfloat>(arr)); break;
+            case c64: *data = getDevicePtr(getArray<cdouble>(arr)); break;
+            case s32: *data = getDevicePtr(getArray<int>(arr)); break;
+            case u32: *data = getDevicePtr(getArray<uint>(arr)); break;
+            case s64: *data = getDevicePtr(getArray<intl>(arr)); break;
+            case u64: *data = getDevicePtr(getArray<uintl>(arr)); break;
+            case s16: *data = getDevicePtr(getArray<short>(arr)); break;
+            case u16: *data = getDevicePtr(getArray<ushort>(arr)); break;
+            case s8: *data = getDevicePtr(getArray<schar>(arr)); break;
+            case u8: *data = getDevicePtr(getArray<uchar>(arr)); break;
+            case b8: *data = getDevicePtr(getArray<char>(arr)); break;
+            case f16: *data = getDevicePtr(getArray<half>(arr)); break;
+
+            default: TYPE_ERROR(4, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+template<typename T>
+inline void lockArray(const af_array arr) {
+    memLock(getArray<T>(arr).get());
+}
+
+af_err af_lock_device_ptr(const af_array arr) { return af_lock_array(arr); }
+
+af_err af_lock_array(const af_array arr) {
+    try {
+        af_dtype type = getInfo(arr).getType();
+
+        switch (type) {
+            case f32: lockArray<float>(arr); break;
+            case f64: lockArray<double>(arr); break;
+            case c32: lockArray<cfloat>(arr); break;
+            case c64: lockArray<cdouble>(arr); break;
+            case s32: lockArray<int>(arr); break;
+            case u32: lockArray<uint>(arr); break;
+            case s64: lockArray<intl>(arr); break;
+            case u64: lockArray<uintl>(arr); break;
+            case s16: lockArray<short>(arr); break;
+            case u16: lockArray<ushort>(arr); break;
+            case s8: lockArray<schar>(arr); break;
+            case u8: lockArray<uchar>(arr); break;
+            case b8: lockArray<char>(arr); break;
+            case f16: lockArray<half>(arr); break;
+            default: TYPE_ERROR(4, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+template<typename T>
+inline bool checkUserLock(const af_array arr) {
+    detail::Array<T> &out = const_cast<detail::Array<T> &>(getArray<T>(arr));
+    return isLocked(static_cast<void *>(out.get()));
+}
+
+af_err af_is_locked_array(bool *res, const af_array arr) {
+    try {
+        af_dtype type = getInfo(arr).getType();
+
+        switch (type) {
+            case f32: *res = checkUserLock<float>(arr); break;
+            case f64: *res = checkUserLock<double>(arr); break;
+            case c32: *res = checkUserLock<cfloat>(arr); break;
+            case c64: *res = checkUserLock<cdouble>(arr); break;
+            case s32: *res = checkUserLock<int>(arr); break;
+            case u32: *res = checkUserLock<uint>(arr); break;
+            case s64: *res = checkUserLock<intl>(arr); break;
+            case u64: *res = checkUserLock<uintl>(arr); break;
+            case s16: *res = checkUserLock<short>(arr); break;
+            case u16: *res = checkUserLock<ushort>(arr); break;
+            case s8: *res = checkUserLock<schar>(arr); break;
+            case u8: *res = checkUserLock<uchar>(arr); break;
+            case b8: *res = checkUserLock<char>(arr); break;
+            case f16: *res = checkUserLock<half>(arr); break;
+            default: TYPE_ERROR(4, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+template<typename T>
+inline void unlockArray(const af_array arr) {
+    memUnlock(getArray<T>(arr).get());
+}
+
+af_err af_unlock_device_ptr(const af_array arr) { return af_unlock_array(arr); }
+
+af_err af_unlock_array(const af_array arr) {
+    try {
+        af_dtype type = getInfo(arr).getType();
+
+        switch (type) {
+            case f32: unlockArray<float>(arr); break;
+            case f64: unlockArray<double>(arr); break;
+            case c32: unlockArray<cfloat>(arr); break;
+            case c64: unlockArray<cdouble>(arr); break;
+            case s32: unlockArray<int>(arr); break;
+            case u32: unlockArray<uint>(arr); break;
+            case s64: unlockArray<intl>(arr); break;
+            case u64: unlockArray<uintl>(arr); break;
+            case s16: unlockArray<short>(arr); break;
+            case u16: unlockArray<ushort>(arr); break;
+            case s8: unlockArray<schar>(arr); break;
+            case u8: unlockArray<uchar>(arr); break;
+            case b8: unlockArray<char>(arr); break;
+            case f16: unlockArray<half>(arr); break;
+            default: TYPE_ERROR(4, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_alloc_device(void **ptr, const dim_t bytes) {
+    try {
+        AF_CHECK(af_init());
+        *ptr = memAllocUser(bytes);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_alloc_device_v2(void **ptr, const dim_t bytes) {
+    try {
+        AF_CHECK(af_init());
+#ifdef AF_OPENCL
+        auto *buf = static_cast<cl::Buffer *>(memAllocUser(bytes));
+        *ptr      = buf->operator()();
+
+        // Calling retain to offset the decrement the reference count by the
+        // destructor of cl::Buffer
+        clRetainMemObject(cl_mem(*ptr));
+        delete buf;
+#else
+        *ptr = static_cast<void *>(memAllocUser(bytes));
+#endif
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_alloc_pinned(void **ptr, const dim_t bytes) {
+    try {
+        AF_CHECK(af_init());
+        *ptr = static_cast<void *>(pinnedAlloc<char>(bytes));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_free_device(void *ptr) {
+    try {
+        memFreeUser(ptr);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_free_device_v2(void *ptr) {
+    try {
+#ifdef AF_OPENCL
+        auto mem = static_cast<cl_mem>(ptr);
+        memFreeUser(new cl::Buffer(mem, false));
+#else
+        memFreeUser(ptr);
+#endif
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_free_pinned(void *ptr) {
+    try {
+        pinnedFree(ptr);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_alloc_host(void **ptr, const dim_t bytes) {
+    if ((*ptr = malloc(bytes))) {  // NOLINT(hicpp-no-malloc)
+        return AF_SUCCESS;
+    }
+    return AF_ERR_NO_MEM;
+}
+
+af_err af_free_host(void *ptr) {
+    free(ptr);  // NOLINT(hicpp-no-malloc)
+    return AF_SUCCESS;
+}
+
+af_err af_print_mem_info(const char *msg, const int device_id) {
+    try {
+        int device = device_id;
+        if (device == -1) { device = static_cast<int>(getActiveDeviceId()); }
+
+        if (msg != nullptr) {
+            ARG_ASSERT(0, strlen(msg) < 256);  // 256 character limit on msg
+        }
+        ARG_ASSERT(1, device >= 0 && device < getDeviceCount());
+
+        printMemInfo(msg ? msg : "", device);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_device_gc() {
+    try {
+        signalMemoryCleanup();
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_device_mem_info(size_t *alloc_bytes, size_t *alloc_buffers,
+                          size_t *lock_bytes, size_t *lock_buffers) {
+    try {
+        deviceMemoryInfo(alloc_bytes, alloc_buffers, lock_bytes, lock_buffers);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_mem_step_size(const size_t step_bytes) {
+    try {
+        detail::setMemStepSize(step_bytes);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_get_mem_step_size(size_t *step_bytes) {
+    try {
+        *step_bytes = detail::getMemStepSize();
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Memory Manager API
+////////////////////////////////////////////////////////////////////////////////
+
+MemoryManager &getMemoryManager(const af_memory_manager handle) {
+    return *static_cast<MemoryManager *>(handle);
+}
+
+af_memory_manager getHandle(MemoryManager &manager) {
+    MemoryManager *handle;
+    handle = &manager;
+    return static_cast<af_memory_manager>(handle);
+}
+
+af_err af_create_memory_manager(af_memory_manager *manager) {
+    try {
+        AF_CHECK(af_init());
+        std::unique_ptr<MemoryManager> m(new MemoryManager());
+        *manager = getHandle(*m.release());
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_release_memory_manager(af_memory_manager handle) {
+    try {
+        // NB: does NOT reset the internal memory manager to be the default:
+        // af_unset_memory_manager_pinned must be used to fully-reset with a new
+        // AF default memory manager
+        delete static_cast<MemoryManager *>(handle);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_set_memory_manager(af_memory_manager mgr) {
+    try {
+        std::unique_ptr<MemoryManagerFunctionWrapper> newManager(
+            new MemoryManagerFunctionWrapper(mgr));
+        // Calls shutdown() on the existing memory manager, but does not free
+        // the associated handle, if there is one
+        detail::setMemoryManager(std::move(newManager));
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_unset_memory_manager() {
+    try {
+        detail::resetMemoryManager();
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_set_memory_manager_pinned(af_memory_manager mgr) {
+    try {
+        // NB: does NOT free if a non-default implementation is set as the
+        // current memory manager - the user is responsible for freeing any
+        // controlled memory
+        std::unique_ptr<MemoryManagerFunctionWrapper> newManager(
+            new MemoryManagerFunctionWrapper(mgr));
+
+        // Calls shutdown() on the existing memory manager
+        detail::setMemoryManagerPinned(std::move(newManager));
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_unset_memory_manager_pinned() {
+    try {
+        detail::resetMemoryManagerPinned();
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_get_payload(af_memory_manager handle, void **payload) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        *payload               = manager.payload;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_payload(af_memory_manager handle, void *payload) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.payload        = payload;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Native memory interface wrapper implementations
+
+af_err af_memory_manager_get_active_device_id(af_memory_manager handle,
+                                              int *id) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        *id                    = manager.wrapper->getActiveDeviceId();
+    }
+
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_native_alloc(af_memory_manager handle, void **ptr,
+                                      size_t size) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        *ptr                   = manager.wrapper->nativeAlloc(size);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_native_free(af_memory_manager handle, void *ptr) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.wrapper->nativeFree(ptr);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_get_max_memory_size(af_memory_manager handle,
+                                             size_t *size, int id) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        *size                  = manager.wrapper->getMaxMemorySize(id);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_get_memory_pressure_threshold(af_memory_manager handle,
+                                                       float *value) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        *value                 = manager.wrapper->getMemoryPressureThreshold();
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_memory_pressure_threshold(af_memory_manager handle,
+                                                       float value) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.wrapper->setMemoryPressureThreshold(value);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Function setters
+
+af_err af_memory_manager_set_initialize_fn(af_memory_manager handle,
+                                           af_memory_manager_initialize_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.initialize_fn  = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_shutdown_fn(af_memory_manager handle,
+                                         af_memory_manager_shutdown_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.shutdown_fn    = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_alloc_fn(af_memory_manager handle,
+                                      af_memory_manager_alloc_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.alloc_fn       = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_allocated_fn(af_memory_manager handle,
+                                          af_memory_manager_allocated_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.allocated_fn   = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_unlock_fn(af_memory_manager handle,
+                                       af_memory_manager_unlock_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.unlock_fn      = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_signal_memory_cleanup_fn(
+    af_memory_manager handle, af_memory_manager_signal_memory_cleanup_fn fn) {
+    try {
+        MemoryManager &manager           = getMemoryManager(handle);
+        manager.signal_memory_cleanup_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_print_info_fn(af_memory_manager handle,
+                                           af_memory_manager_print_info_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.print_info_fn  = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_user_lock_fn(af_memory_manager handle,
+                                          af_memory_manager_user_lock_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.user_lock_fn   = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_user_unlock_fn(
+    af_memory_manager handle, af_memory_manager_user_unlock_fn fn) {
+    try {
+        MemoryManager &manager = getMemoryManager(handle);
+        manager.user_unlock_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_is_user_locked_fn(
+    af_memory_manager handle, af_memory_manager_is_user_locked_fn fn) {
+    try {
+        MemoryManager &manager    = getMemoryManager(handle);
+        manager.is_user_locked_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_get_memory_pressure_fn(
+    af_memory_manager handle, af_memory_manager_get_memory_pressure_fn fn) {
+    try {
+        MemoryManager &manager         = getMemoryManager(handle);
+        manager.get_memory_pressure_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_jit_tree_exceeds_memory_pressure_fn(
+    af_memory_manager handle,
+    af_memory_manager_jit_tree_exceeds_memory_pressure_fn fn) {
+    try {
+        MemoryManager &manager                      = getMemoryManager(handle);
+        manager.jit_tree_exceeds_memory_pressure_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_add_memory_management_fn(
+    af_memory_manager handle, af_memory_manager_add_memory_management_fn fn) {
+    try {
+        MemoryManager &manager           = getMemoryManager(handle);
+        manager.add_memory_management_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_memory_manager_set_remove_memory_management_fn(
+    af_memory_manager handle,
+    af_memory_manager_remove_memory_management_fn fn) {
+    try {
+        MemoryManager &manager              = getMemoryManager(handle);
+        manager.remove_memory_management_fn = fn;
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Memory Manager wrapper implementations
+
+MemoryManagerFunctionWrapper::MemoryManagerFunctionWrapper(
+    af_memory_manager handle)
+    : handle_(handle) {
+    MemoryManager &manager = getMemoryManager(handle_);
+    manager.wrapper        = this;
+}
+
+MemoryManagerFunctionWrapper::~MemoryManagerFunctionWrapper() {
+    MemoryManager &manager = getMemoryManager(handle_);
+    manager.wrapper        = 0;
+}
+
+void MemoryManagerFunctionWrapper::initialize() {
+    AF_CHECK(getMemoryManager(handle_).initialize_fn(handle_));
+}
+
+void MemoryManagerFunctionWrapper::shutdown() {
+    AF_CHECK(getMemoryManager(handle_).shutdown_fn(handle_));
+}
+
+void *MemoryManagerFunctionWrapper::alloc(bool user_lock, const unsigned ndims,
+                                          dim_t *dims,
+                                          const unsigned element_size) {
+    void *ptr;
+    AF_CHECK(getMemoryManager(handle_).alloc_fn(handle_, &ptr, (int)user_lock,
+                                                ndims, dims, element_size));
+
+    return ptr;
+}
+
+size_t MemoryManagerFunctionWrapper::allocated(void *ptr) {
+    size_t out;
+    AF_CHECK(getMemoryManager(handle_).allocated_fn(handle_, &out, ptr));
+    return out;
+}
+
+void MemoryManagerFunctionWrapper::unlock(void *ptr, bool user_unlock) {
+    AF_CHECK(
+        getMemoryManager(handle_).unlock_fn(handle_, ptr, (int)user_unlock));
+}
+
+void MemoryManagerFunctionWrapper::signalMemoryCleanup() {
+    AF_CHECK(getMemoryManager(handle_).signal_memory_cleanup_fn(handle_));
+}
+
+void MemoryManagerFunctionWrapper::printInfo(const char *msg,
+                                             const int device) {
+    AF_CHECK(getMemoryManager(handle_).print_info_fn(
+        handle_, const_cast<char *>(msg), device));
+}
+
+void MemoryManagerFunctionWrapper::userLock(const void *ptr) {
+    AF_CHECK(getMemoryManager(handle_).user_lock_fn(handle_,
+                                                    const_cast<void *>(ptr)));
+}
+
+void MemoryManagerFunctionWrapper::userUnlock(const void *ptr) {
+    AF_CHECK(getMemoryManager(handle_).user_unlock_fn(handle_,
+                                                      const_cast<void *>(ptr)));
+}
+
+bool MemoryManagerFunctionWrapper::isUserLocked(const void *ptr) {
+    int out;
+    AF_CHECK(getMemoryManager(handle_).is_user_locked_fn(
+        handle_, &out, const_cast<void *>(ptr)));
+    return static_cast<bool>(out);
+}
+
+void MemoryManagerFunctionWrapper::usageInfo(size_t * /*alloc_bytes*/,
+                                             size_t * /*alloc_buffers*/,
+                                             size_t * /*lock_bytes*/,
+                                             size_t * /*lock_buffers*/) {
+    // Not implemented in the public memory manager API, but for backward
+    // compatibility reasons, needs to be in the common memory manager interface
+    // so that it can be used with the default memory manager. Called from
+    // deviceMemoryInfo from a backend - throws so as to properly propagate
+    AF_ERROR(
+        "Device memory info/usage info not supported "
+        "for custom memory manager",
+        AF_ERR_NOT_SUPPORTED);
+}
+
+float MemoryManagerFunctionWrapper::getMemoryPressure() {
+    float out;
+    AF_CHECK(getMemoryManager(handle_).get_memory_pressure_fn(handle_, &out));
+    return out;
+}
+
+bool MemoryManagerFunctionWrapper::jitTreeExceedsMemoryPressure(size_t bytes) {
+    int out;
+    AF_CHECK(getMemoryManager(handle_).jit_tree_exceeds_memory_pressure_fn(
+        handle_, &out, bytes));
+    return static_cast<bool>(out);
+}
+
+size_t MemoryManagerFunctionWrapper::getMemStepSize() {
+    // Not implemented in the public memory manager API, but for backward
+    // compatibility reasons, needs to be in the common memory manager interface
+    // so that it can be used with the default memory manager. Call into the
+    // backend implementation so the exception can be properly propagated
+    AF_ERROR("Memory step size API not implemented for custom memory manager",
+             AF_ERR_NOT_SUPPORTED);
+}
+
+void MemoryManagerFunctionWrapper::setMemStepSize(size_t new_step_size) {
+    // Not implemented in the public memory manager API, but for backward
+    // compatibility reasons, needs to be in the common memory manager interface
+    // so that it can be used with the default memory manager.
+    UNUSED(new_step_size);
+    AF_ERROR("Memory step size API not implemented for custom memory manager ",
+             AF_ERR_NOT_SUPPORTED);
+}
+
+void MemoryManagerFunctionWrapper::addMemoryManagement(int device) {
+    getMemoryManager(handle_).add_memory_management_fn(handle_, device);
+}
+
+void MemoryManagerFunctionWrapper::removeMemoryManagement(int device) {
+    getMemoryManager(handle_).remove_memory_management_fn(handle_, device);
+}
diff --git a/src/api/c/memoryapi.hpp b/src/api/c/memoryapi.hpp
new file mode 100644
index 0000000000..a52947dce0
--- /dev/null
+++ b/src/api/c/memoryapi.hpp
@@ -0,0 +1,81 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/MemoryManagerBase.hpp>
+
+#include <af/memory.h>
+
+////////////////////////////////////////////////////////////////////////////////
+// Memory Manager API
+////////////////////////////////////////////////////////////////////////////////
+
+/**
+ * An internal wrapper around an af_memory_manager which calls function pointers
+ * on a af_memory_manager via calls to a MemoryManagerBase
+ */
+class MemoryManagerFunctionWrapper final
+    : public arrayfire::common::MemoryManagerBase {
+    af_memory_manager handle_;
+
+   public:
+    MemoryManagerFunctionWrapper(af_memory_manager handle);
+    ~MemoryManagerFunctionWrapper();
+    void initialize() override;
+    void shutdown() override;
+    void *alloc(bool user_lock, const unsigned ndims, dim_t *dims,
+                const unsigned element_size) override;
+    size_t allocated(void *ptr) override;
+    void unlock(void *ptr, bool user_unlock) override;
+    void signalMemoryCleanup() override;
+    void printInfo(const char *msg, const int device) override;
+    void usageInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                   size_t *lock_bytes, size_t *lock_buffers) override;
+    void userLock(const void *ptr) override;
+    void userUnlock(const void *ptr) override;
+    bool isUserLocked(const void *ptr) override;
+    size_t getMemStepSize() override;
+    void setMemStepSize(size_t new_step_size) override;
+    float getMemoryPressure() override;
+    bool jitTreeExceedsMemoryPressure(size_t bytes) override;
+
+    void addMemoryManagement(int device) override;
+    void removeMemoryManagement(int device) override;
+};
+
+struct MemoryManager {
+    // Callbacks from public API
+    af_memory_manager_initialize_fn initialize_fn;
+    af_memory_manager_shutdown_fn shutdown_fn;
+    af_memory_manager_alloc_fn alloc_fn;
+    af_memory_manager_allocated_fn allocated_fn;
+    af_memory_manager_unlock_fn unlock_fn;
+    af_memory_manager_print_info_fn print_info_fn;
+    af_memory_manager_user_lock_fn user_lock_fn;
+    af_memory_manager_user_unlock_fn user_unlock_fn;
+    af_memory_manager_is_user_locked_fn is_user_locked_fn;
+    af_memory_manager_get_memory_pressure_fn get_memory_pressure_fn;
+    af_memory_manager_signal_memory_cleanup_fn signal_memory_cleanup_fn;
+    af_memory_manager_add_memory_management_fn add_memory_management_fn;
+    af_memory_manager_remove_memory_management_fn remove_memory_management_fn;
+    af_memory_manager_jit_tree_exceeds_memory_pressure_fn
+        jit_tree_exceeds_memory_pressure_fn;
+    // A generic payload on which data can be stored on the af_memory_manager
+    // and is accessible from the handle
+    void *payload;
+    // A pointer to the MemoryManagerFunctionWrapper wrapping this struct that
+    // facilitates calling native memory functions directly from the handle. The
+    // lifetime of the wrapper is controlled by the relevant device manager
+    MemoryManagerFunctionWrapper *wrapper;
+};
+
+MemoryManager &getMemoryManager(const af_memory_manager handle);
+
+af_memory_manager getHandle(MemoryManager &manager);
diff --git a/src/api/c/moddims.cpp b/src/api/c/moddims.cpp
index e43efa067c..f419a2fb04 100644
--- a/src/api/c/moddims.cpp
+++ b/src/api/c/moddims.cpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2018, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -7,86 +7,112 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/data.h>
-#include <backend.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <common/moddims.hpp>
 #include <copy.hpp>
+#include <handle.hpp>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
 using af::dim4;
-using namespace detail;
-
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+namespace {
 template<typename T>
-Array<T> modDims(const Array<T>& in, const af::dim4 &newDims)
-{
-    //FIXME: Figure out a better way
-    evalArray<T>(in);
-
-    Array<T> Out = in;
-
-    if (!in.isOwner()) {
-        Out = copyArray<T>(in);
-    }
-    Out.modDims(newDims);
-
-    return Out;
+af_array modDims(const af_array in, const dim4& newDims) {
+    return getHandle(arrayfire::common::modDims(getArray<T>(in), newDims));
+}
+template<typename T>
+af_array flat(const af_array in) {
+    return getHandle(arrayfire::common::flat(getArray<T>(in)));
 }
+}  // namespace
 
-af_err af_moddims(af_array *out, const af_array in,
-                  const unsigned ndims, const dim_t * const dims)
-{
+af_err af_moddims(af_array* out, const af_array in, const unsigned ndims,
+                  const dim_t* const dims) {
     try {
+        if (ndims == 0) {
+            *out = retain(in);
+            return AF_SUCCESS;
+        }
         ARG_ASSERT(2, ndims >= 1);
         ARG_ASSERT(3, dims != NULL);
 
         af_array output = 0;
         dim4 newDims(ndims, dims);
-        ArrayInfo info = getInfo(in);
-        dim_t in_elements = info.elements();
-        dim_t new_elements = newDims.elements();
+        const ArrayInfo& info = getInfo(in);
+        dim_t in_elements     = info.elements();
+        dim_t new_elements    = newDims.elements();
 
         DIM_ASSERT(1, in_elements == new_elements);
 
         af_dtype type = info.getType();
 
-        switch(type) {
-        case f32: output = getHandle(modDims<float  >(getArray<float  >(in), newDims)); break;
-        case c32: output = getHandle(modDims<cfloat >(getArray<cfloat >(in), newDims)); break;
-        case f64: output = getHandle(modDims<double >(getArray<double >(in), newDims)); break;
-        case c64: output = getHandle(modDims<cdouble>(getArray<cdouble>(in), newDims)); break;
-        case b8:  output = getHandle(modDims<char   >(getArray<char   >(in), newDims)); break;
-        case s32: output = getHandle(modDims<int    >(getArray<int    >(in), newDims)); break;
-        case u32: output = getHandle(modDims<uint   >(getArray<uint   >(in), newDims)); break;
-        case u8:  output = getHandle(modDims<uchar  >(getArray<uchar  >(in), newDims)); break;
-        case s64: output = getHandle(modDims<intl   >(getArray<intl   >(in), newDims)); break;
-        case u64: output = getHandle(modDims<uintl  >(getArray<uintl  >(in), newDims)); break;
-        default: TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = modDims<float>(in, newDims); break;
+            case c32: output = modDims<cfloat>(in, newDims); break;
+            case f64: output = modDims<double>(in, newDims); break;
+            case c64: output = modDims<cdouble>(in, newDims); break;
+            case b8: output = modDims<char>(in, newDims); break;
+            case s32: output = modDims<int>(in, newDims); break;
+            case u32: output = modDims<uint>(in, newDims); break;
+            case s8: output = modDims<schar>(in, newDims); break;
+            case u8: output = modDims<uchar>(in, newDims); break;
+            case s64: output = modDims<intl>(in, newDims); break;
+            case u64: output = modDims<uintl>(in, newDims); break;
+            case s16: output = modDims<short>(in, newDims); break;
+            case u16: output = modDims<ushort>(in, newDims); break;
+            case f16: output = modDims<half>(in, newDims); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL
 
     return AF_SUCCESS;
 }
 
-af_err af_flat(af_array *out, const af_array in)
-{
-    af_array res;
+af_err af_flat(af_array* out, const af_array in) {
     try {
+        const ArrayInfo& info = getInfo(in);
 
-        ArrayInfo in_info = getInfo(in);
-
-        if (in_info.ndims() == 1) {
-            AF_CHECK(af_retain_array(&res, in));
+        if (info.ndims() == 1) {
+            *out = retain(in);
         } else {
-            const dim_t num = (dim_t)(in_info.elements());
-            AF_CHECK(af_moddims(&res, in, 1, &num));
+            af_array output = 0;
+            af_dtype type   = info.getType();
+
+            switch (type) {
+                case f32: output = flat<float>(in); break;
+                case c32: output = flat<cfloat>(in); break;
+                case f64: output = flat<double>(in); break;
+                case c64: output = flat<cdouble>(in); break;
+                case b8: output = flat<char>(in); break;
+                case s32: output = flat<int>(in); break;
+                case u32: output = flat<uint>(in); break;
+                case s8: output = flat<schar>(in); break;
+                case u8: output = flat<uchar>(in); break;
+                case s64: output = flat<intl>(in); break;
+                case u64: output = flat<uintl>(in); break;
+                case s16: output = flat<short>(in); break;
+                case u16: output = flat<ushort>(in); break;
+                case f16: output = flat<half>(in); break;
+                default: TYPE_ERROR(1, type);
+            }
+            std::swap(*out, output);
         }
-
-        std::swap(*out, res);
-    } CATCHALL;
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
diff --git a/src/api/c/moments.cpp b/src/api/c/moments.cpp
new file mode 100644
index 0000000000..ecef793a50
--- /dev/null
+++ b/src/api/c/moments.cpp
@@ -0,0 +1,88 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/data.h>
+#include <af/image.h>
+#include <af/index.h>
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <handle.hpp>
+#include <join.hpp>
+#include <moments.hpp>
+#include <reorder.hpp>
+#include <tile.hpp>
+
+#include <limits>
+#include <vector>
+
+using af::dim4;
+
+using detail::Array;
+using std::vector;
+
+template<typename T>
+static inline void moments(af_array* out, const af_array in,
+                           af_moment_type moment) {
+    Array<float> temp = moments<T>(getArray<T>(in), moment);
+    *out              = getHandle<float>(temp);
+}
+
+af_err af_moments(af_array* out, const af_array in,
+                  const af_moment_type moment) {
+    try {
+        const ArrayInfo& in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
+
+        switch (type) {
+            case f32: moments<float>(out, in, moment); break;
+            case f64: moments<double>(out, in, moment); break;
+            case u32: moments<unsigned>(out, in, moment); break;
+            case s32: moments<int>(out, in, moment); break;
+            case u16: moments<unsigned short>(out, in, moment); break;
+            case s16: moments<short>(out, in, moment); break;
+            case b8: moments<char>(out, in, moment); break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+template<typename T>
+static inline void moment_copy(double* out, const af_array moments) {
+    const auto& info = getInfo(moments);
+    vector<T> h_moments(info.elements());
+    copyData(h_moments.data(), moments);
+
+    // convert to double
+    copy(begin(h_moments), end(h_moments), out);
+}
+
+af_err af_moments_all(double* out, const af_array in,
+                      const af_moment_type moment) {
+    try {
+        const ArrayInfo& in_info = getInfo(in);
+        dim4 idims               = in_info.dims();
+        DIM_ASSERT(1, idims[2] == 1 && idims[3] == 1);
+
+        af_array moments_arr;
+        AF_CHECK(af_moments(&moments_arr, in, moment));
+        moment_copy<float>(out, moments_arr);
+        AF_CHECK(af_release_array(moments_arr));
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/morph.cpp b/src/api/c/morph.cpp
index 980097c9f4..418b84e8a9 100644
--- a/src/api/c/morph.cpp
+++ b/src/api/c/morph.cpp
@@ -7,59 +7,140 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <handle.hpp>
-#include <err_common.hpp>
+#include <arith.hpp>
 #include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/indexing_helpers.hpp>
+#include <copy.hpp>
+#include <fftconvolve.hpp>
+#include <handle.hpp>
+#include <logic.hpp>
+#include <math.hpp>
 #include <morph.hpp>
+#include <unary.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::cast;
+using arrayfire::common::flip;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::logicOp;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::unaryOp;
+using detail::ushort;
 
-template<typename T, bool isDilation>
-static inline af_array morph(const af_array &in, const af_array &mask)
-{
-    const Array<T> &input = getArray<T>(in);
+template<typename T>
+af_array morph(const af_array &in, const af_array &mask, bool isDilation) {
+    const Array<T> &input  = getArray<T>(in);
     const Array<T> &filter = castArray<T>(mask);
-    Array<T> out = morph<T, isDilation>(input, filter);
+    Array<T> out           = morph<T>(input, filter, isDilation);
     return getHandle(out);
 }
 
-template<typename T, bool isDilation>
-static inline af_array morph3d(const af_array &in, const af_array &mask)
-{
-    const Array<T> &input = getArray<T>(in);
+template<>
+af_array morph<char>(const af_array &input, const af_array &mask,
+                     const bool isDilation) {
+    using detail::fftconvolve;
+
+#if defined(AF_CPU)
+#if defined(USE_MKL)
+    constexpr unsigned fftMethodThreshold = 11;
+#else
+    constexpr unsigned fftMethodThreshold = 27;
+#endif  // defined(USE_MKL)
+#elif defined(AF_CUDA)
+    constexpr unsigned fftMethodThreshold = 17;
+#elif defined(AF_OPENCL)
+    constexpr unsigned fftMethodThreshold = 19;
+#elif defined(AF_ONEAPI)
+    constexpr unsigned fftMethodThreshold = 19;
+#endif  // defined(AF_CPU)
+
+    const Array<float> se = castArray<float>(mask);
+    const dim4 &seDims    = se.dims();
+
+    if (seDims[0] <= fftMethodThreshold) {
+        auto out =
+            morph(getArray<char>(input), castArray<char>(mask), isDilation);
+        return getHandle(out);
+    }
+
+    DIM_ASSERT(2, (seDims[0] == seDims[1]));
+
+    const Array<char> in = getArray<char>(input);
+    const dim4 &inDims   = in.dims();
+    const auto paddedSe =
+        padArrayBorders(se,
+                        {static_cast<dim_t>(seDims[0] % 2 == 0),
+                         static_cast<dim_t>(seDims[1] % 2 == 0), 0, 0},
+                        {0, 0, 0, 0}, AF_PAD_ZERO);
+    if (isDilation) {
+        Array<float> dft =
+            fftconvolve(cast<float>(in), paddedSe, false, AF_BATCH_LHS, 2);
+
+        return getHandle(cast<char>(unaryOp<float, af_round_t>(dft)));
+    } else {
+        const Array<char> ONES   = createValueArray(inDims, scalar<char>(1));
+        const Array<float> ZEROS = createValueArray(inDims, scalar<float>(0));
+        const Array<char> inv    = arithOp<char, af_sub_t>(ONES, in, inDims);
+
+        Array<float> dft =
+            fftconvolve(cast<float>(inv), paddedSe, false, AF_BATCH_LHS, 2);
+
+        Array<float> rounded = unaryOp<float, af_round_t>(dft);
+        Array<char> thrshd   = logicOp<float, af_gt_t>(rounded, ZEROS, inDims);
+        Array<char> inverted = arithOp<char, af_sub_t>(ONES, thrshd, inDims);
+
+        return getHandle(inverted);
+    }
+}
+
+template<typename T>
+static inline af_array morph3d(const af_array &in, const af_array &mask,
+                               bool isDilation) {
+    const Array<T> &input  = getArray<T>(in);
     const Array<T> &filter = castArray<T>(mask);
-    Array<T> out = morph3d<T, isDilation>(input, filter);
+    Array<T> out           = morph3d<T>(input, filter, isDilation);
     return getHandle(out);
 }
 
-template<bool isDilation>
-static af_err morph(af_array *out, const af_array &in, const af_array &mask)
-{
+af_err morph(af_array *out, const af_array &in, const af_array &mask,
+             bool isDilation) {
     try {
-        ArrayInfo info = getInfo(in);
-        ArrayInfo mInfo= getInfo(mask);
-        af::dim4 dims  = info.dims();
-        af::dim4 mdims = mInfo.dims();
-        dim_t in_ndims = dims.ndims();
-        dim_t mask_ndims = mdims.ndims();
+        const ArrayInfo &info  = getInfo(in);
+        const ArrayInfo &mInfo = getInfo(mask);
+        af::dim4 dims          = info.dims();
+        af::dim4 mdims         = mInfo.dims();
+        dim_t in_ndims         = dims.ndims();
+        dim_t mask_ndims       = mdims.ndims();
 
         DIM_ASSERT(1, (in_ndims >= 2));
         DIM_ASSERT(2, (mask_ndims == 2));
 
         af_array output;
-        af_dtype type  = info.getType();
-        switch(type) {
-            case f32: output = morph<float , isDilation>(in, mask);      break;
-            case f64: output = morph<double, isDilation>(in, mask);      break;
-            case b8 : output = morph<char  , isDilation>(in, mask);      break;
-            case s32: output = morph<int   , isDilation>(in, mask);      break;
-            case u32: output = morph<uint  , isDilation>(in, mask);      break;
-            case u8 : output = morph<uchar , isDilation>(in, mask);      break;
-            default : TYPE_ERROR(1, type);
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32: output = morph<float>(in, mask, isDilation); break;
+            case f64: output = morph<double>(in, mask, isDilation); break;
+            case b8: output = morph<char>(in, mask, isDilation); break;
+            case s32: output = morph<int>(in, mask, isDilation); break;
+            case u32: output = morph<uint>(in, mask, isDilation); break;
+            case s16: output = morph<short>(in, mask, isDilation); break;
+            case u16: output = morph<ushort>(in, mask, isDilation); break;
+            case s8: output = morph<schar>(in, mask, isDilation); break;
+            case u8: output = morph<uchar>(in, mask, isDilation); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
@@ -68,30 +149,32 @@ static af_err morph(af_array *out, const af_array &in, const af_array &mask)
     return AF_SUCCESS;
 }
 
-template<bool isDilation>
-static af_err morph3d(af_array *out, const af_array &in, const af_array &mask)
-{
+af_err morph3d(af_array *out, const af_array &in, const af_array &mask,
+               bool isDilation) {
     try {
-        ArrayInfo info = getInfo(in);
-        ArrayInfo mInfo= getInfo(mask);
-        af::dim4 dims  = info.dims();
-        af::dim4 mdims = mInfo.dims();
-        dim_t in_ndims = dims.ndims();
-        dim_t mask_ndims = mdims.ndims();
+        const ArrayInfo &info  = getInfo(in);
+        const ArrayInfo &mInfo = getInfo(mask);
+        af::dim4 dims          = info.dims();
+        af::dim4 mdims         = mInfo.dims();
+        dim_t in_ndims         = dims.ndims();
+        dim_t mask_ndims       = mdims.ndims();
 
         DIM_ASSERT(1, (in_ndims >= 3));
         DIM_ASSERT(2, (mask_ndims == 3));
 
         af_array output;
-        af_dtype type  = info.getType();
-        switch(type) {
-            case f32: output = morph3d<float , isDilation>(in, mask);       break;
-            case f64: output = morph3d<double, isDilation>(in, mask);       break;
-            case b8 : output = morph3d<char  , isDilation>(in, mask);       break;
-            case s32: output = morph3d<int   , isDilation>(in, mask);       break;
-            case u32: output = morph3d<uint  , isDilation>(in, mask);       break;
-            case u8 : output = morph3d<uchar , isDilation>(in, mask);       break;
-            default : TYPE_ERROR(1, type);
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32: output = morph3d<float>(in, mask, isDilation); break;
+            case f64: output = morph3d<double>(in, mask, isDilation); break;
+            case b8: output = morph3d<char>(in, mask, isDilation); break;
+            case s32: output = morph3d<int>(in, mask, isDilation); break;
+            case u32: output = morph3d<uint>(in, mask, isDilation); break;
+            case s16: output = morph3d<short>(in, mask, isDilation); break;
+            case u16: output = morph3d<ushort>(in, mask, isDilation); break;
+            case s8: output = morph3d<schar>(in, mask, isDilation); break;
+            case u8: output = morph3d<uchar>(in, mask, isDilation); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
@@ -99,22 +182,19 @@ static af_err morph3d(af_array *out, const af_array &in, const af_array &mask)
 
     return AF_SUCCESS;
 }
-af_err af_dilate(af_array *out, const af_array in, const af_array mask)
-{
-    return morph<true>(out,in,mask);
+
+af_err af_dilate(af_array *out, const af_array in, const af_array mask) {
+    return morph(out, in, mask, true);
 }
 
-af_err af_erode(af_array *out, const af_array in, const af_array mask)
-{
-    return morph<false>(out,in,mask);
+af_err af_erode(af_array *out, const af_array in, const af_array mask) {
+    return morph(out, in, mask, false);
 }
 
-af_err af_dilate3(af_array *out, const af_array in, const af_array mask)
-{
-    return morph3d<true>(out,in,mask);
+af_err af_dilate3(af_array *out, const af_array in, const af_array mask) {
+    return morph3d(out, in, mask, true);
 }
 
-af_err af_erode3(af_array *out, const af_array in, const af_array mask)
-{
-    return morph3d<false>(out,in,mask);
+af_err af_erode3(af_array *out, const af_array in, const af_array mask) {
+    return morph3d(out, in, mask, false);
 }
diff --git a/src/api/c/nearest_neighbour.cpp b/src/api/c/nearest_neighbour.cpp
new file mode 100644
index 0000000000..10543649d9
--- /dev/null
+++ b/src/api/c/nearest_neighbour.cpp
@@ -0,0 +1,149 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <nearest_neighbour.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/vision.h>
+
+using af::dim4;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename Ti, typename To>
+static void nearest_neighbour(af_array* idx, af_array* dist,
+                              const af_array query, const af_array train,
+                              const dim_t dist_dim, const uint n_dist,
+                              const af_match_type dist_type) {
+    Array<uint> oIdxArray = createEmptyArray<uint>(af::dim4());
+    Array<To> oDistArray  = createEmptyArray<To>(af::dim4());
+
+    nearest_neighbour<Ti, To>(oIdxArray, oDistArray, getArray<Ti>(query),
+                              getArray<Ti>(train), dist_dim, n_dist, dist_type);
+
+    *idx  = getHandle<uint>(oIdxArray);
+    *dist = getHandle<To>(oDistArray);
+}
+
+af_err af_nearest_neighbour(af_array* idx, af_array* dist, const af_array query,
+                            const af_array train, const dim_t dist_dim,
+                            const uint n_dist, const af_match_type dist_type) {
+    try {
+        const ArrayInfo& qInfo = getInfo(query);
+        const ArrayInfo& tInfo = getInfo(train);
+        af_dtype qType         = qInfo.getType();
+        af_dtype tType         = tInfo.getType();
+        af::dim4 qDims         = qInfo.dims();
+        af::dim4 tDims         = tInfo.dims();
+
+        uint train_samples = (dist_dim == 0) ? 1 : 0;
+
+        DIM_ASSERT(2, qDims[dist_dim] == tDims[dist_dim]);
+        DIM_ASSERT(2, qDims[2] == 1 && qDims[3] == 1);
+        DIM_ASSERT(3, tDims[2] == 1 && tDims[3] == 1);
+        DIM_ASSERT(4, (dist_dim == 0 || dist_dim == 1));
+        DIM_ASSERT(5, n_dist > 0 && n_dist <= (uint)tDims[train_samples]);
+        ARG_ASSERT(5, n_dist > 0 && n_dist <= 256);
+        ARG_ASSERT(6, dist_type == AF_SAD || dist_type == AF_SSD ||
+                          dist_type == AF_SHD);
+        TYPE_ASSERT(qType == tType);
+
+        // For Hamming, only u8, u16, u32 and u64 allowed.
+        af_array oIdx;
+        af_array oDist;
+
+        if (dist_type == AF_SHD) {
+            TYPE_ASSERT(qType == u8 || qType == u16 || qType == u32 ||
+                        qType == u64);
+            switch (qType) {
+                case u8:
+                    nearest_neighbour<uchar, uint>(&oIdx, &oDist, query, train,
+                                                   dist_dim, n_dist, AF_SHD);
+                    break;
+                case u16:
+                    nearest_neighbour<ushort, uint>(&oIdx, &oDist, query, train,
+                                                    dist_dim, n_dist, AF_SHD);
+                    break;
+                case u32:
+                    nearest_neighbour<uint, uint>(&oIdx, &oDist, query, train,
+                                                  dist_dim, n_dist, AF_SHD);
+                    break;
+                case u64:
+                    nearest_neighbour<uintl, uint>(&oIdx, &oDist, query, train,
+                                                   dist_dim, n_dist, AF_SHD);
+                    break;
+                default: TYPE_ERROR(1, qType);
+            }
+        } else {
+            switch (qType) {
+                case f32:
+                    nearest_neighbour<float, float>(&oIdx, &oDist, query, train,
+                                                    dist_dim, n_dist,
+                                                    dist_type);
+                    break;
+                case f64:
+                    nearest_neighbour<double, double>(&oIdx, &oDist, query,
+                                                      train, dist_dim, n_dist,
+                                                      dist_type);
+                    break;
+                case s32:
+                    nearest_neighbour<int, int>(&oIdx, &oDist, query, train,
+                                                dist_dim, n_dist, dist_type);
+                    break;
+                case u32:
+                    nearest_neighbour<uint, uint>(&oIdx, &oDist, query, train,
+                                                  dist_dim, n_dist, dist_type);
+                    break;
+                case s64:
+                    nearest_neighbour<intl, intl>(&oIdx, &oDist, query, train,
+                                                  dist_dim, n_dist, dist_type);
+                    break;
+                case u64:
+                    nearest_neighbour<uintl, uintl>(&oIdx, &oDist, query, train,
+                                                    dist_dim, n_dist,
+                                                    dist_type);
+                    break;
+                case s16:
+                    nearest_neighbour<short, int>(&oIdx, &oDist, query, train,
+                                                  dist_dim, n_dist, dist_type);
+                    break;
+                case u16:
+                    nearest_neighbour<ushort, uint>(&oIdx, &oDist, query, train,
+                                                    dist_dim, n_dist,
+                                                    dist_type);
+                    break;
+                case s8:
+                    nearest_neighbour<schar, int>(&oIdx, &oDist, query, train,
+                                                  dist_dim, n_dist, dist_type);
+                    break;
+                case u8:
+                    nearest_neighbour<uchar, uint>(&oIdx, &oDist, query, train,
+                                                   dist_dim, n_dist, dist_type);
+                    break;
+                default: TYPE_ERROR(1, qType);
+            }
+        }
+        std::swap(*idx, oIdx);
+        std::swap(*dist, oDist);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/norm.cpp b/src/api/c/norm.cpp
index 17012ea7e6..7eef41afcc 100644
--- a/src/api/c/norm.cpp
+++ b/src/api/c/norm.cpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2025, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -7,143 +7,142 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <af/traits.hpp>
-#include <af/constants.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-#include <ArrayInfo.hpp>
-#include <math.hpp>
 #include <arith.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
 #include <lu.hpp>
+#include <math.hpp>
 #include <reduce.hpp>
-#include <complex.hpp>
+#include <af/array.h>
+#include <af/constants.h>
+#include <af/defines.h>
+#include <af/lapack.h>
+#include <af/traits.hpp>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::getScalar;
+using detail::reduce;
+using detail::reduce_all;
+using detail::scalar;
 
 template<typename T>
-double matrixNorm(const Array<T> &A, double p)
-{
+using normReductionResult =
+    typename std::conditional<std::is_same<T, arrayfire::common::half>::value, float,
+                              T>::type;
+
+template<typename T>
+double matrixNorm(const Array<T> &A, double p) {
+    using RT = normReductionResult<T>;
     if (p == 1) {
-        Array<T> colSum = reduce<af_add_t, T, T>(A, 0);
-        return reduce_all<af_max_t, T, T>(colSum);
-    } else if (p == af::Inf) {
-        Array<T> rowSum = reduce<af_add_t, T, T>(A, 1);
-        return reduce_all<af_max_t, T, T>(rowSum);
+        Array<RT> colSum = reduce<af_add_t, T, normReductionResult<T>>(A, 0);
+        return getScalar<RT>(reduce_all<af_max_t, RT, RT>(colSum));
+    }
+    if (p == af::Inf) {
+        Array<RT> rowSum = reduce<af_add_t, T, RT>(A, 1);
+        return getScalar<RT>(reduce_all<af_max_t, RT, RT>(rowSum));
     }
 
-    AF_ERROR("This type of norm is not supported in ArrayFire\n", AF_ERR_NOT_SUPPORTED);
+    AF_ERROR("This type of norm is not supported in ArrayFire\n",
+             AF_ERR_NOT_SUPPORTED);
 }
 
 template<typename T>
-double vectorNorm(const Array<T> &A, double p)
-{
-    if (p == 1) {
-        return reduce_all<af_add_t, T, T>(A);
-    } else if (p == af::Inf) {
-        return reduce_all<af_max_t, T, T>(A);
+double vectorNorm(const Array<T> &A, double p) {
+    using RT = normReductionResult<T>;
+    if (p == 1) { return getScalar<RT>(reduce_all<af_add_t, T, RT>(A)); }
+    if (p == af::Inf) {
+        return getScalar<RT>(reduce_all<af_max_t, RT, RT>(cast<RT>(A)));
     } else if (p == 2) {
         Array<T> A_sq = arithOp<T, af_mul_t>(A, A, A.dims());
-        return std::sqrt(reduce_all<af_add_t, T, T>(A_sq));
+        return std::sqrt(getScalar<RT>(reduce_all<af_add_t, T, RT>(A_sq)));
     }
 
-    Array<T> P = createValueArray<T>(A.dims(), scalar<T>(p));
+    Array<T> P   = createValueArray<T>(A.dims(), scalar<T>(p));
     Array<T> A_p = arithOp<T, af_pow_t>(A, P, A.dims());
-    return std::pow(reduce_all<af_add_t, T, T>(A_p), 1.0/p);
+    return std::pow(getScalar<RT>(reduce_all<af_add_t, T, RT>(A_p)), (1.0 / p));
 }
 
 template<typename T>
-double LPQNorm(const Array<T> &A, double p, double q)
-{
-    Array<T> A_p_norm = createEmptyArray<T>(dim4());
+double LPQNorm(const Array<T> &A, double p, double q) {
+    using RT           = normReductionResult<T>;
+    Array<RT> A_p_norm = createEmptyArray<RT>(dim4());
 
     if (p == 1) {
-        A_p_norm = reduce<af_add_t, T, T>(A, 0);
+        A_p_norm = reduce<af_add_t, T, RT>(A, 0);
     } else {
-        Array<T> P = createValueArray<T>(A.dims(), scalar<T>(p));
-        Array<T> invP = createValueArray<T>(A.dims(), scalar<T>(1.0/p));
+        Array<T> P     = createValueArray<T>(A.dims(), scalar<T>(p));
+        Array<RT> invP = createValueArray<RT>(A.dims(), scalar<RT>(1.0 / p));
 
-        Array<T> A_p = arithOp<T, af_pow_t>(A, P, A.dims());
-        Array<T> A_p_sum = reduce<af_add_t, T, T>(A_p, 0);
-        A_p_norm = arithOp<T, af_pow_t>(A_p_sum, invP, invP.dims());
+        Array<T> A_p      = arithOp<T, af_pow_t>(A, P, A.dims());
+        Array<RT> A_p_sum = reduce<af_add_t, T, RT>(A_p, 0);
+        A_p_norm          = arithOp<RT, af_pow_t>(A_p_sum, invP, invP.dims());
     }
 
     if (q == 1) {
-        return reduce_all<af_add_t, T, T>(A_p_norm);
+        return getScalar<RT>(reduce_all<af_add_t, RT, RT>(A_p_norm));
     }
 
-    Array<T> Q = createValueArray<T>(A_p_norm.dims(), scalar<T>(q));
-    Array<T> A_p_norm_q = arithOp<T, af_pow_t>(A_p_norm, Q, Q.dims());
+    Array<RT> Q          = createValueArray<RT>(A_p_norm.dims(), scalar<RT>(q));
+    Array<RT> A_p_norm_q = arithOp<RT, af_pow_t>(A_p_norm, Q, Q.dims());
 
-    return std::pow(reduce_all<af_add_t, T, T>(A_p_norm_q), 1.0/q);
+    return std::pow(getScalar<RT>(reduce_all<af_add_t, RT, RT>(A_p_norm_q)),
+                    (1.0 / q));
 }
 
 template<typename T>
-double norm(const af_array a, const af_norm_type type, const double p, const double q)
-{
+double norm(const af_array a, const af_norm_type type, const double p,
+            const double q) {
+    using BT = typename af::dtype_traits<T>::base_type;
 
-    typedef typename af::dtype_traits<T>::base_type BT;
-
-    const Array<BT> A = abs<BT, T>(getArray<T>(a));
+    const Array<BT> A = detail::abs<BT, T>(getArray<T>(a));
 
     switch (type) {
-
-    case AF_NORM_EUCLID:
-        return vectorNorm(A, 2);
-
-    case AF_NORM_VECTOR_1:
-        return vectorNorm(A, 1);
-
-    case AF_NORM_VECTOR_INF:
-        return vectorNorm(A, af::Inf);
-
-    case AF_NORM_VECTOR_P:
-        return vectorNorm(A, p);
-
-    case AF_NORM_MATRIX_1:
-        return matrixNorm(A, 1);
-
-    case AF_NORM_MATRIX_INF:
-        return matrixNorm(A, af::Inf);
-
-    case AF_NORM_MATRIX_2:
-        return matrixNorm(A, 2);
-
-    case AF_NORM_MATRIX_L_PQ:
-        return LPQNorm(A, p, q);
-
-    default:
-        AF_ERROR("This type of norm is not supported in ArrayFire\n", AF_ERR_NOT_SUPPORTED);
+        case AF_NORM_EUCLID: return vectorNorm(A, 2);
+        case AF_NORM_VECTOR_1: return vectorNorm(A, 1);
+        case AF_NORM_VECTOR_INF: return vectorNorm(A, af::Inf);
+        case AF_NORM_VECTOR_P: return vectorNorm(A, p);
+        case AF_NORM_MATRIX_1: return matrixNorm(A, 1);
+        case AF_NORM_MATRIX_INF: return matrixNorm(A, af::Inf);
+        case AF_NORM_MATRIX_2: return matrixNorm(A, 2);
+        case AF_NORM_MATRIX_L_PQ: return LPQNorm(A, p, q);
+        default:
+            AF_ERROR("This type of norm is not supported in ArrayFire\n",
+                     AF_ERR_NOT_SUPPORTED);
     }
 }
 
-af_err af_norm(double *out, const af_array in,
-               const af_norm_type type, const double p, const double q)
-{
-
+af_err af_norm(double *out, const af_array in, const af_norm_type type,
+               const double p, const double q) {
     try {
-        ArrayInfo i_info = getInfo(in);
-
+        const ArrayInfo &i_info = getInfo(in);
         if (i_info.ndims() > 2) {
             AF_ERROR("solve can not be used in batch mode", AF_ERR_BATCH);
         }
 
         af_dtype i_type = i_info.getType();
-
-        ARG_ASSERT(1, i_info.isFloating());                       // Only floating and complex types
-
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
         *out = 0;
-
-        switch(i_type) {
-        case f32: *out = norm<float  >(in, type, p, q);  break;
-        case f64: *out = norm<double >(in, type, p, q);  break;
-        case c32: *out = norm<cfloat >(in, type, p, q);  break;
-        case c64: *out = norm<cdouble>(in, type, p, q);  break;
-        default:  TYPE_ERROR(1, i_type);
+        if (i_info.ndims() == 0) { return AF_SUCCESS; }
+
+        switch (i_type) {
+            case f32: *out = norm<float>(in, type, p, q); break;
+            case f64: *out = norm<double>(in, type, p, q); break;
+            case c32: *out = norm<cfloat>(in, type, p, q); break;
+            case c64: *out = norm<cdouble>(in, type, p, q); break;
+            case f16: *out = norm<arrayfire::common::half>(in, type, p, q); break;
+            default: TYPE_ERROR(1, i_type);
         }
     }
     CATCHALL;
diff --git a/src/api/c/ops.hpp b/src/api/c/ops.hpp
deleted file mode 100644
index a94b984852..0000000000
--- a/src/api/c/ops.hpp
+++ /dev/null
@@ -1,260 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <backend.hpp>
-#include <math.hpp>
-
-#ifndef __DH__
-#define __DH__
-#endif
-
-#include "optypes.hpp"
-
-using namespace detail;
-
-template<typename T, af_op_t op>
-struct Binary
-{
-    __DH__ T init()
-    {
-        return detail::scalar<T>(0);
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return lhs + rhs;
-    }
-};
-
-template<typename T>
-struct Binary<T, af_add_t>
-{
-    __DH__ T init()
-    {
-        return detail::scalar<T>(0);
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return lhs + rhs;
-    }
-};
-
-template<typename T>
-struct Binary<T, af_mul_t>
-{
-    __DH__ T init()
-    {
-        return detail::scalar<T>(1);
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return lhs * rhs;
-    }
-};
-
-template<typename T>
-struct Binary<T, af_or_t>
-{
-    __DH__ T init()
-    {
-        return detail::scalar<T>(0);
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return lhs || rhs;
-    }
-};
-
-template<typename T>
-struct Binary<T, af_and_t>
-{
-    __DH__ T init()
-    {
-        return detail::scalar<T>(1);
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return lhs && rhs;
-    }
-};
-
-template<typename T>
-struct Binary<T, af_notzero_t>
-{
-    __DH__ T init()
-    {
-        return detail::scalar<T>(0);
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return lhs + rhs;
-    }
-};
-
-template<typename T>
-struct Binary<T, af_min_t>
-{
-    __DH__ T init()
-    {
-        return detail::limit_max<T>();
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return detail::min(lhs, rhs);
-    }
-};
-
-#define SPECIALIZE_COMPLEX_MIN(T, Tr)           \
-    template<>                                  \
-    struct Binary<T, af_min_t>                  \
-    {                                           \
-        __DH__ T init()                         \
-        {                                       \
-            return detail::scalar<T>(           \
-                detail::limit_max<Tr>()         \
-                );                              \
-        }                                       \
-                                                \
-        __DH__ T operator() (T lhs, T rhs)      \
-        {                                       \
-            return detail::min(lhs, rhs);       \
-        }                                       \
-    };                                          \
-
-SPECIALIZE_COMPLEX_MIN(cfloat, float)
-SPECIALIZE_COMPLEX_MIN(cdouble, double)
-
-#undef SPECIALIZE_COMPLEX_MIN
-
-template<typename T>
-struct Binary<T, af_max_t>
-{
-    __DH__ T init()
-    {
-        return detail::limit_min<T>();
-    }
-
-    __DH__ T operator() (T lhs, T rhs)
-    {
-        return detail::max(lhs, rhs);
-    }
-};
-
-template<>
-struct Binary<char, af_max_t>
-{
-    __DH__ char init()
-    {
-        return 0;
-    }
-
-    __DH__ char operator() (char lhs, char rhs)
-    {
-        return detail::max(lhs > 0, rhs > 0);
-    }
-};
-
-template<>
-struct Binary<char, af_min_t>
-{
-    __DH__ char init()
-    {
-        return 1;
-    }
-
-    __DH__ char operator() (char lhs, char rhs)
-    {
-        return detail::min(lhs > 0, rhs > 0);
-    }
-};
-
-#define SPECIALIZE_FLOATING_MAX(T, Tr)          \
-    template<>                                  \
-    struct Binary<T, af_max_t>                  \
-    {                                           \
-        __DH__ T init()                         \
-        {                                       \
-            return detail::scalar<T>(           \
-                -detail::limit_max<Tr>()        \
-                );                              \
-        }                                       \
-                                                \
-        __DH__ T operator() (T lhs, T rhs)      \
-        {                                       \
-            return detail::max(lhs, rhs);       \
-        }                                       \
-    };                                          \
-
-SPECIALIZE_FLOATING_MAX(float, float)
-SPECIALIZE_FLOATING_MAX(double, double)
-
-#define SPECIALIZE_COMPLEX_MAX(T, Tr)           \
-    template<>                                  \
-    struct Binary<T, af_max_t>                  \
-    {                                           \
-        __DH__ T init()                         \
-        {                                       \
-            return detail::scalar<T>(           \
-                detail::scalar<Tr>(0)           \
-                );                              \
-        }                                       \
-                                                \
-        __DH__ T operator() (T lhs, T rhs)      \
-        {                                       \
-            return detail::max(lhs, rhs);       \
-        }                                       \
-    };                                          \
-
-SPECIALIZE_COMPLEX_MAX(cfloat, float)
-SPECIALIZE_COMPLEX_MAX(cdouble, double)
-
-#undef SPECIALIZE_FLOATING_MAX
-
-template<typename Ti, typename To, af_op_t op>
-struct Transform
-{
-    __DH__ To operator ()(Ti in)
-    {
-        return (To)(in);
-    }
-};
-
-template<typename Ti, typename To>
-struct Transform<Ti, To, af_or_t>
-{
-    __DH__ To operator ()(Ti in)
-    {
-        return (in != detail::scalar<Ti>(0));
-    }
-};
-
-template<typename Ti, typename To>
-struct Transform<Ti, To, af_and_t>
-{
-    __DH__ To operator ()(Ti in)
-    {
-        return (in != detail::scalar<Ti>(0));
-    }
-};
-
-template<typename Ti, typename To>
-struct Transform<Ti, To, af_notzero_t>
-{
-    __DH__ To operator ()(Ti in)
-    {
-        return (in != detail::scalar<Ti>(0));
-    }
-};
diff --git a/src/api/c/optypes.hpp b/src/api/c/optypes.hpp
index b9514daea5..44f1fd68d6 100644
--- a/src/api/c/optypes.hpp
+++ b/src/api/c/optypes.hpp
@@ -9,8 +9,9 @@
 
 #pragma once
 
-typedef enum {
-    af_add_t = 0,
+enum af_op_t : int {
+    af_none_t = -1,
+    af_add_t  = 0,
     af_sub_t,
     af_mul_t,
     af_div_t,
@@ -29,6 +30,7 @@ typedef enum {
     af_bitxor_t,
     af_bitshiftl_t,
     af_bitshiftr_t,
+    af_bitnot_t,
 
     af_min_t,
     af_max_t,
@@ -75,7 +77,7 @@ typedef enum {
     af_ceil_t,
     af_round_t,
     af_trunc_t,
-    af_sign_t,
+    af_signbit_t,
 
     af_rem_t,
     af_mod_t,
@@ -89,5 +91,13 @@ typedef enum {
     af_isinf_t,
     af_isnan_t,
 
-    af_noop_t
-} af_op_t;
+    af_sigmoid_t,
+
+    af_noop_t,
+
+    af_select_t,
+    af_not_select_t,
+    af_rsqrt_t,
+
+    af_moddims_t
+};
diff --git a/src/api/c/orb.cpp b/src/api/c/orb.cpp
index 98f5170027..7608553170 100644
--- a/src/api/c/orb.cpp
+++ b/src/api/c/orb.cpp
@@ -7,37 +7,38 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <features.hpp>
+#include <handle.hpp>
+#include <orb.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/features.h>
 #include <af/vision.h>
-#include <handle.hpp>
-#include <err_common.hpp>
-#include <backend.hpp>
-#include <orb.hpp>
-#include <features.hpp>
 
 using af::dim4;
-using namespace detail;
+
+using detail::Array;
+using detail::createEmptyArray;
+using detail::uint;
 
 template<typename T, typename convAccT>
-static void orb(af_features& feat_, af_array& descriptor,
-                const af_array& in, const float fast_thr,
-                const unsigned max_feat, const float scl_fctr,
-                const unsigned levels, const bool blur_img)
-{
+static void orb(af_features& feat_, af_array& descriptor, const af_array& in,
+                const float fast_thr, const unsigned max_feat,
+                const float scl_fctr, const unsigned levels,
+                const bool blur_img) {
     Array<float> x     = createEmptyArray<float>(dim4());
     Array<float> y     = createEmptyArray<float>(dim4());
     Array<float> score = createEmptyArray<float>(dim4());
     Array<float> ori   = createEmptyArray<float>(dim4());
     Array<float> size  = createEmptyArray<float>(dim4());
-    Array<uint > desc  = createEmptyArray<uint >(dim4());
+    Array<uint> desc   = createEmptyArray<uint>(dim4());
 
     af_features_t feat;
 
-    feat.n = orb<T, convAccT>(x, y, score, ori, size, desc,
-                              getArray<T>(in), fast_thr, max_feat,
-                              scl_fctr, levels, blur_img);
+    feat.n = orb<T, convAccT>(x, y, score, ori, size, desc, getArray<T>(in),
+                              fast_thr, max_feat, scl_fctr, levels, blur_img);
 
     feat.x           = getHandle(x);
     feat.y           = getHandle(y);
@@ -45,36 +46,40 @@ static void orb(af_features& feat_, af_array& descriptor,
     feat.orientation = getHandle(ori);
     feat.size        = getHandle(size);
 
-    feat_ = getFeaturesHandle(feat);
+    feat_      = getFeaturesHandle(feat);
     descriptor = getHandle<unsigned>(desc);
 }
 
-af_err af_orb(af_features* feat, af_array* desc,
-              const af_array in, const float fast_thr,
-              const unsigned max_feat, const float scl_fctr,
-              const unsigned levels, const bool blur_img)
-{
+af_err af_orb(af_features* feat, af_array* desc, const af_array in,
+              const float fast_thr, const unsigned max_feat,
+              const float scl_fctr, const unsigned levels,
+              const bool blur_img) {
     try {
-        ArrayInfo info = getInfo(in);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo& info = getInfo(in);
+        af::dim4 dims         = info.dims();
 
-        ARG_ASSERT(2, (dims[0] >= 7 && dims[1] >= 7 && dims[2] == 1 && dims[3] == 1));
+        ARG_ASSERT(
+            2, (dims[0] >= 7 && dims[1] >= 7 && dims[2] == 1 && dims[3] == 1));
         ARG_ASSERT(3, fast_thr > 0.0f);
         ARG_ASSERT(4, max_feat > 0);
         ARG_ASSERT(5, scl_fctr > 1.0f);
         ARG_ASSERT(6, levels > 0);
 
         dim_t in_ndims = dims.ndims();
-        DIM_ASSERT(1, (in_ndims <= 3 && in_ndims >= 2));
+        DIM_ASSERT(1, (in_ndims == 2));
 
         af_array tmp_desc;
-        af_dtype type  = info.getType();
-        switch(type) {
-            case f32: orb<float , float >(*feat, tmp_desc, in, fast_thr, max_feat,
-                                          scl_fctr, levels, blur_img); break;
-            case f64: orb<double, double>(*feat, tmp_desc, in, fast_thr, max_feat,
-                                          scl_fctr, levels, blur_img); break;
-            default : TYPE_ERROR(1, type);
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                orb<float, float>(*feat, tmp_desc, in, fast_thr, max_feat,
+                                  scl_fctr, levels, blur_img);
+                break;
+            case f64:
+                orb<double, double>(*feat, tmp_desc, in, fast_thr, max_feat,
+                                    scl_fctr, levels, blur_img);
+                break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*desc, tmp_desc);
     }
diff --git a/src/api/c/pinverse.cpp b/src/api/c/pinverse.cpp
new file mode 100644
index 0000000000..55c5cf8d7d
--- /dev/null
+++ b/src/api/c/pinverse.cpp
@@ -0,0 +1,201 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <backend.hpp>
+
+#include <arith.hpp>
+#include <blas.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/moddims.hpp>
+#include <diagonal.hpp>
+#include <handle.hpp>
+#include <logic.hpp>
+#include <math.hpp>
+#include <reduce.hpp>
+#include <select.hpp>
+#include <svd.hpp>
+#include <tile.hpp>
+#include <transpose.hpp>
+#include <af/array.h>
+#include <af/complex.h>
+#include <af/defines.h>
+#include <af/lapack.h>
+
+using af::dim4;
+using af::dtype_traits;
+using arrayfire::common::cast;
+using arrayfire::common::modDims;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createSelectNode;
+using detail::createSubArray;
+using detail::createValueArray;
+using detail::diagCreate;
+using detail::gemm;
+using detail::logicOp;
+using detail::max;
+using detail::min;
+using detail::reduce;
+using detail::scalar;
+using detail::svd;
+using detail::tile;
+using detail::uint;
+using std::swap;
+using std::vector;
+
+template<typename T>
+Array<T> getSubArray(const Array<T> &in, const bool copy, uint dim0begin = 0,
+                     uint dim0end = 0, uint dim1begin = 0, uint dim1end = 0,
+                     uint dim2begin = 0, uint dim2end = 0, uint dim3begin = 0,
+                     uint dim3end = 0) {
+    vector<af_seq> seqs = {
+        {static_cast<double>(dim0begin), static_cast<double>(dim0end), 1.},
+        {static_cast<double>(dim1begin), static_cast<double>(dim1end), 1.},
+        {static_cast<double>(dim2begin), static_cast<double>(dim2end), 1.},
+        {static_cast<double>(dim3begin), static_cast<double>(dim3end), 1.}};
+    return createSubArray<T>(in, seqs, copy);
+}
+
+// Moore-Penrose Pseudoinverse
+template<typename T>
+Array<T> pinverseSvd(const Array<T> &in, const double tol) {
+    in.eval();
+    dim_t M = in.dims()[0];
+    dim_t N = in.dims()[1];
+    dim_t P = in.dims()[2];
+    dim_t Q = in.dims()[3];
+
+    // Compute SVD
+    using Tr = typename dtype_traits<T>::base_type;
+    // Ideally, these initializations should use createEmptyArray(), but for
+    // some reason, linux-opencl-k80 will produce wrong results for large arrays
+    Array<T> u  = createValueArray<T>(dim4(M, M, P, Q), scalar<T>(0));
+    Array<T> vT = createValueArray<T>(dim4(N, N, P, Q), scalar<T>(0));
+    Array<Tr> sVec =
+        createValueArray<Tr>(dim4(min(M, N), 1, P, Q), scalar<Tr>(0));
+    for (dim_t j = 0; j < Q; ++j) {
+        for (dim_t i = 0; i < P; ++i) {
+            Array<T> inSlice =
+                getSubArray(in, false, 0, M - 1, 0, N - 1, i, i, j, j);
+            Array<Tr> sVecSlice = getSubArray(
+                sVec, false, 0, sVec.dims()[0] - 1, 0, 0, i, i, j, j);
+            Array<T> uSlice  = getSubArray(u, false, 0, u.dims()[0] - 1, 0,
+                                           u.dims()[1] - 1, i, i, j, j);
+            Array<T> vTSlice = getSubArray(vT, false, 0, vT.dims()[0] - 1, 0,
+                                           vT.dims()[1] - 1, i, i, j, j);
+            svd<T, Tr>(sVecSlice, uSlice, vTSlice, inSlice);
+        }
+    }
+
+    // Cast s back to original data type for matmul later
+    // (since svd() makes s' type the base type of T)
+    Array<T> sVecCast = cast<T, Tr>(sVec);
+
+    Array<T> v = transpose(vT, true);
+
+    // Build relative tolerance array
+    Array<Tr> sVecMax    = reduce<af_max_t, Tr, Tr>(sVec, 0);
+    Array<T> sVecMaxCast = cast<T, Tr>(sVecMax);
+    double tolMulShape   = tol * static_cast<double>(max(M, N));
+    Array<T> tolMulShapeArr =
+        createValueArray<T>(sVecMaxCast.dims(), scalar<T>(tolMulShape));
+    Array<T> relTol =
+        arithOp<T, af_mul_t>(tolMulShapeArr, sVecMaxCast, sVecMaxCast.dims());
+    Array<T> relTolArr = tile<T>(relTol, dim4(sVecCast.dims()[0]));
+
+    // Get reciprocal of sVec's non-zero values for s pinverse, except for
+    // very small non-zero values though (< relTol), in order to avoid very
+    // large reciprocals
+    Array<T> ones      = createValueArray<T>(sVecCast.dims(), scalar<T>(1.));
+    Array<T> sVecRecip = arithOp<T, af_div_t>(ones, sVecCast, sVecCast.dims());
+    Array<char> cond =
+        logicOp<T, af_ge_t>(sVecCast, relTolArr, sVecCast.dims());
+    Array<T> zeros = createValueArray<T>(sVecCast.dims(), scalar<T>(0.));
+    sVecRecip = createSelectNode<T>(cond, sVecRecip, zeros, sVecRecip.dims());
+
+    // Make s vector into s pinverse array
+    Array<T> sVecRecipMod = modDims<T>(
+        sVecRecip,
+        dim4(sVecRecip.dims()[0], (sVecRecip.dims()[2] * sVecRecip.dims()[3])));
+    Array<T> sPinv = diagCreate<T>(sVecRecipMod, 0);
+    sPinv          = modDims<T>(sPinv, dim4(sPinv.dims()[0], sPinv.dims()[1],
+                                            sVecRecip.dims()[2], sVecRecip.dims()[3]));
+
+    Array<T> uT = transpose(u, true);
+
+    // Crop v and u* for final matmul later based on s+'s size, because
+    // sVec produced by svd() has minimal dim length (no extra zeroes).
+    // Thus s+ produced by diagCreate() will have minimal dims as well,
+    // and v could have an extra dim0 or u* could have an extra dim1
+    if (v.dims()[1] > sPinv.dims()[0]) {
+        v = getSubArray(v, false, 0, v.dims()[0] - 1, 0, sPinv.dims()[0] - 1, 0,
+                        v.dims()[2] - 1, 0, v.dims()[3] - 1);
+    }
+    if (uT.dims()[0] > sPinv.dims()[1]) {
+        uT = getSubArray(uT, false, 0, sPinv.dims()[1] - 1, 0, uT.dims()[1] - 1,
+                         0, uT.dims()[2] - 1, 0, uT.dims()[3] - 1);
+    }
+
+    Array<T> vsPinv =
+        createEmptyArray<T>(dim4(v.dims()[0], sPinv.dims()[1], P, Q));
+    Array<T> out =
+        createEmptyArray<T>(dim4(vsPinv.dims()[0], uT.dims()[1], P, Q));
+
+    T alpha = scalar<T>(1.0);
+    T beta  = scalar<T>(0.0);
+
+    gemm<T>(vsPinv, AF_MAT_NONE, AF_MAT_NONE, &alpha, v, sPinv, &beta);
+    gemm<T>(out, AF_MAT_NONE, AF_MAT_NONE, &alpha, vsPinv, uT, &beta);
+
+    return out;
+}
+
+template<typename T>
+static inline af_array pinverse(const af_array in, const double tol) {
+    return getHandle(pinverseSvd<T>(getArray<T>(in), tol));
+}
+
+af_err af_pinverse(af_array *out, const af_array in, const double tol,
+                   const af_mat_prop options) {
+    try {
+        const ArrayInfo &i_info = getInfo(in);
+
+        af_dtype type = i_info.getType();
+
+        if (options != AF_MAT_NONE) {
+            AF_ERROR("Using this property is not yet supported in inverse",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
+        ARG_ASSERT(2, tol >= 0.);            // Ensure tolerance is not negative
+
+        af_array output;
+
+        if (i_info.ndims() == 0) { return af_retain_array(out, in); }
+
+        switch (type) {
+            case f32: output = pinverse<float>(in, tol); break;
+            case f64: output = pinverse<double>(in, tol); break;
+            case c32: output = pinverse<cfloat>(in, tol); break;
+            case c64: output = pinverse<cdouble>(in, tol); break;
+            default: TYPE_ERROR(1, type);
+        }
+        swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/plot.cpp b/src/api/c/plot.cpp
index 0723a30eff..be5aab06b1 100644
--- a/src/api/c/plot.cpp
+++ b/src/api/c/plot.cpp
@@ -7,100 +7,462 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <af/data.h>
 #include <af/graphics.h>
 #include <af/image.h>
 
-#include <ArrayInfo.hpp>
-#include <graphics_common.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <handle.hpp>
+#include <join.hpp>
+#include <platform.hpp>
 #include <plot.hpp>
 #include <reduce.hpp>
-#include <join.hpp>
 #include <reorder.hpp>
-#include <handle.hpp>
+#include <transpose.hpp>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+using arrayfire::common::getFGMarker;
+using arrayfire::common::getGLType;
+using arrayfire::common::makeContextCurrent;
+using arrayfire::common::step_round;
+using detail::Array;
+using detail::copy_plot;
+using detail::forgeManager;
+using detail::reduce;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+// Requires in_ to be in either [order, n] or [n, order] format
+template<typename T, int order>
+fg_chart setup_plot(fg_window window, const af_array in_,
+                    const af_cell* const props, fg_plot_type ptype,
+                    fg_marker_type mtype) {
+    ForgeModule& _ = forgePlugin();
+
+    Array<T> in = getArray<T>(in_);
+
+    af::dim4 dims = in.dims();
+
+    DIM_ASSERT(1, dims.ndims() == 2);
+    DIM_ASSERT(1, (dims[0] == order || dims[1] == order));
+
+    // The data expected by backend is 2D [order, n]
+    if (dims[1] == order) { in = transpose(in, false); }
+
+    af::dim4 tdims = in.dims();  // transposed dimensions
+
+    ForgeManager& fgMngr = forgeManager();
+
+    // Get the chart for the current grid position (if any)
+    fg_chart chart      = NULL;
+    fg_chart_type ctype = order == 2 ? FG_CHART_2D : FG_CHART_3D;
+
+    if (props->col > -1 && props->row > -1) {
+        chart = fgMngr.getChart(window, props->row, props->col, ctype);
+    } else {
+        chart = fgMngr.getChart(window, 0, 0, ctype);
+    }
 
-#if defined(WITH_GRAPHICS)
-using namespace graphics;
+    fg_plot plot =
+        fgMngr.getPlot(chart, tdims[1], getGLType<T>(), ptype, mtype);
+
+    // ArrayFire LOGO Orange shade
+    FG_CHECK(_.fg_set_plot_color(plot, 0.929f, 0.529f, 0.212f, 1.0));
+
+    // If chart axes limits do not have a manual override
+    // then compute and set axes limits
+    if (!fgMngr.getChartAxesOverride(chart)) {
+        float cmin[3], cmax[3];
+        T dmin[3], dmax[3];
+        FG_CHECK(_.fg_get_chart_axes_limits(
+            &cmin[0], &cmax[0], &cmin[1], &cmax[1], &cmin[2], &cmax[2], chart));
+        copyData(dmin, reduce<af_min_t, T, T>(in, 1));
+        copyData(dmax, reduce<af_max_t, T, T>(in, 1));
+
+        if (cmin[0] == 0 && cmax[0] == 0 && cmin[1] == 0 && cmax[1] == 0 &&
+            cmin[2] == 0 && cmax[2] == 0) {
+            // No previous limits. Set without checking
+            cmin[0] = step_round(dmin[0], false);
+            cmax[0] = step_round(dmax[0], true);
+            cmin[1] = step_round(dmin[1], false);
+            cmax[1] = step_round(dmax[1], true);
+            if (order == 3) { cmin[2] = step_round(dmin[2], false); }
+            if (order == 3) { cmax[2] = step_round(dmax[2], true); }
+        } else {
+            if (cmin[0] > dmin[0]) { cmin[0] = step_round(dmin[0], false); }
+            if (cmax[0] < dmax[0]) { cmax[0] = step_round(dmax[0], true); }
+            if (cmin[1] > dmin[1]) { cmin[1] = step_round(dmin[1], false); }
+            if (cmax[1] < dmax[1]) { cmax[1] = step_round(dmax[1], true); }
+            if (order == 3) {
+                if (cmin[2] > dmin[2]) { cmin[2] = step_round(dmin[2], false); }
+                if (cmax[2] < dmax[2]) { cmax[2] = step_round(dmax[2], true); }
+            }
+        }
+        FG_CHECK(_.fg_set_chart_axes_limits(chart, cmin[0], cmax[0], cmin[1],
+                                            cmax[1], cmin[2], cmax[2]));
+    }
+    copy_plot<T>(in, plot);
+
+    return chart;
+}
 
 template<typename T>
-fg::Plot* setup_plot(const af_array X, const af_array Y)
-{
-    Array<T> xIn = getArray<T>(X);
-    Array<T> yIn = getArray<T>(Y);
+fg_chart setup_plot(fg_window window, const af_array in_, const int order,
+                    const af_cell* const props, fg_plot_type ptype,
+                    fg_marker_type mtype) {
+    if (order == 2) {
+        return setup_plot<T, 2>(window, in_, props, ptype, mtype);
+    }
+    if (order == 3) {
+        return setup_plot<T, 3>(window, in_, props, ptype, mtype);
+    }
+    // Dummy to avoid warnings
+    return NULL;
+}
 
-    T xmax = reduce_all<af_max_t, T, T>(xIn);
-    T xmin = reduce_all<af_min_t, T, T>(xIn);
-    T ymax = reduce_all<af_max_t, T, T>(yIn);
-    T ymin = reduce_all<af_min_t, T, T>(yIn);
+af_err plotWrapper(const af_window window, const af_array in,
+                   const int order_dim, const af_cell* const props,
+                   fg_plot_type ptype    = FG_PLOT_LINE,
+                   fg_marker_type marker = FG_MARKER_NONE) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& info = getInfo(in);
+        af::dim4 dims         = info.dims();
+        af_dtype type         = info.getType();
 
-    dim4 rdims(1, 0, 2, 3);
+        DIM_ASSERT(0, dims.ndims() == 2);
+        DIM_ASSERT(0, dims[order_dim] == 2 || dims[order_dim] == 3);
 
-    Array<T> Z = join(1, xIn, yIn);
-    Array<T> P = reorder(Z, rdims);
+        makeContextCurrent(window);
 
-    ArrayInfo Xinfo = getInfo(X);
-    af::dim4 X_dims = Xinfo.dims();
+        fg_chart chart = NULL;
 
-    ForgeManager& fgMngr = ForgeManager::getInstance();
-    fg::Plot* plot = fgMngr.getPlot(X_dims.elements(), getGLType<T>());
-    plot->setColor(1.0, 0.0, 0.0);
-    plot->setAxesLimits(xmax, xmin, ymax, ymin);
-    plot->setXAxisTitle("X Axis");
-    plot->setYAxisTitle("Y Axis");
+        switch (type) {
+            case f32:
+                chart = setup_plot<float>(window, in, dims[order_dim], props,
+                                          ptype, marker);
+                break;
+            case s32:
+                chart = setup_plot<int>(window, in, dims[order_dim], props,
+                                        ptype, marker);
+                break;
+            case u32:
+                chart = setup_plot<uint>(window, in, dims[order_dim], props,
+                                         ptype, marker);
+                break;
+            case s16:
+                chart = setup_plot<short>(window, in, dims[order_dim], props,
+                                          ptype, marker);
+                break;
+            case u16:
+                chart = setup_plot<ushort>(window, in, dims[order_dim], props,
+                                           ptype, marker);
+                break;
+            case s8:
+                chart = setup_plot<schar>(window, in, dims[order_dim], props,
+                                          ptype, marker);
+                break;
+            case u8:
+                chart = setup_plot<uchar>(window, in, dims[order_dim], props,
+                                          ptype, marker);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
 
-    copy_plot<T>(P, plot);
+        auto gridDims = forgeManager().getWindowGrid(window);
 
-    return plot;
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
 }
-#endif
-
-af_err af_draw_plot(const af_window wind, const af_array X, const af_array Y, const af_cell* const props)
-{
-#if defined(WITH_GRAPHICS)
-    if(wind==0) {
-        std::cerr<<"Not a valid window"<<std::endl;
-        return AF_SUCCESS;
+
+af_err plotWrapper(const af_window window, const af_array X, const af_array Y,
+                   const af_array Z, const af_cell* const props,
+                   fg_plot_type ptype    = FG_PLOT_LINE,
+                   fg_marker_type marker = FG_MARKER_NONE) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& xInfo = getInfo(X);
+        const af::dim4& xDims  = xInfo.dims();
+        af_dtype xType         = xInfo.getType();
+
+        const ArrayInfo& yInfo = getInfo(Y);
+        const af::dim4& yDims  = yInfo.dims();
+        af_dtype yType         = yInfo.getType();
+
+        const ArrayInfo& zInfo = getInfo(Z);
+        const af::dim4& zDims  = zInfo.dims();
+        af_dtype zType         = zInfo.getType();
+
+        DIM_ASSERT(0, xDims == yDims);
+        DIM_ASSERT(0, xDims == zDims);
+        DIM_ASSERT(0, xInfo.isVector());
+
+        TYPE_ASSERT(xType == yType);
+        TYPE_ASSERT(xType == zType);
+
+        // Join for set up vector
+        af_array in    = 0;
+        af_array pIn[] = {X, Y, Z};
+        AF_CHECK(af_join_many(&in, 1, 3, pIn));
+
+        makeContextCurrent(window);
+
+        fg_chart chart = NULL;
+
+        switch (xType) {
+            case f32:
+                chart = setup_plot<float>(window, in, 3, props, ptype, marker);
+                break;
+            case s32:
+                chart = setup_plot<int>(window, in, 3, props, ptype, marker);
+                break;
+            case u32:
+                chart = setup_plot<uint>(window, in, 3, props, ptype, marker);
+                break;
+            case s16:
+                chart = setup_plot<short>(window, in, 3, props, ptype, marker);
+                break;
+            case u16:
+                chart = setup_plot<ushort>(window, in, 3, props, ptype, marker);
+                break;
+            case s8:
+                chart = setup_plot<schar>(window, in, 3, props, ptype, marker);
+                break;
+            case u8:
+                chart = setup_plot<uchar>(window, in, 3, props, ptype, marker);
+                break;
+            default: TYPE_ERROR(1, xType);
+        }
+        auto gridDims = forgeManager().getWindowGrid(window);
+
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+
+        AF_CHECK(af_release_array(in));
     }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
+af_err plotWrapper(const af_window window, const af_array X, const af_array Y,
+                   const af_cell* const props,
+                   fg_plot_type ptype    = FG_PLOT_LINE,
+                   fg_marker_type marker = FG_MARKER_NONE) {
     try {
-        ArrayInfo Xinfo = getInfo(X);
-        af::dim4 X_dims = Xinfo.dims();
-        af_dtype Xtype  = Xinfo.getType();
-
-        ArrayInfo Yinfo = getInfo(Y);
-        af::dim4 Y_dims = Yinfo.dims();
-        af_dtype Ytype  = Yinfo.getType();
-
-        DIM_ASSERT(0, X_dims == Y_dims);
-        DIM_ASSERT(0, X_dims == Y_dims);
-        DIM_ASSERT(0, Xinfo.isVector());
-
-        TYPE_ASSERT(Xtype == Ytype);
-
-        fg::Window* window = reinterpret_cast<fg::Window*>(wind);
-        window->makeCurrent();
-        fg::Plot* plot = NULL;
-
-        switch(Xtype) {
-            case f32: plot = setup_plot<float  >(X, Y); break;
-            case s32: plot = setup_plot<int    >(X, Y); break;
-            case u32: plot = setup_plot<uint   >(X, Y); break;
-            case u8 : plot = setup_plot<uchar  >(X, Y); break;
-            default:  TYPE_ERROR(1, Xtype);
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& xInfo = getInfo(X);
+        const af::dim4& xDims  = xInfo.dims();
+        af_dtype xType         = xInfo.getType();
+
+        const ArrayInfo& yInfo = getInfo(Y);
+        const af::dim4& yDims  = yInfo.dims();
+        af_dtype yType         = yInfo.getType();
+
+        DIM_ASSERT(0, xDims == yDims);
+        DIM_ASSERT(0, xInfo.isVector());
+
+        TYPE_ASSERT(xType == yType);
+
+        // Join for set up vector
+        af_array in = 0;
+        AF_CHECK(af_join(&in, 1, X, Y));
+
+        makeContextCurrent(window);
+
+        fg_chart chart = NULL;
+
+        switch (xType) {
+            case f32:
+                chart = setup_plot<float>(window, in, 2, props, ptype, marker);
+                break;
+            case s32:
+                chart = setup_plot<int>(window, in, 2, props, ptype, marker);
+                break;
+            case u32:
+                chart = setup_plot<uint>(window, in, 2, props, ptype, marker);
+                break;
+            case s16:
+                chart = setup_plot<short>(window, in, 2, props, ptype, marker);
+                break;
+            case u16:
+                chart = setup_plot<ushort>(window, in, 2, props, ptype, marker);
+                break;
+            case s8:
+                chart = setup_plot<schar>(window, in, 2, props, ptype, marker);
+                break;
+            case u8:
+                chart = setup_plot<uchar>(window, in, 2, props, ptype, marker);
+                break;
+            default: TYPE_ERROR(1, xType);
         }
+        auto gridDims = forgeManager().getWindowGrid(window);
 
-        if (props->col>-1 && props->row>-1)
-            window->draw(props->col, props->row, *plot, props->title);
-        else
-            window->draw(*plot);
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+
+        AF_CHECK(af_release_array(in));
     }
     CATCHALL;
     return AF_SUCCESS;
-#else
-    return AF_ERR_NO_GFX;
-#endif
+}
+
+// Plot API
+af_err af_draw_plot_nd(const af_window wind, const af_array in,
+                       const af_cell* const props) {
+    return plotWrapper(wind, in, 1, props);
+}
+
+af_err af_draw_plot_2d(const af_window wind, const af_array X, const af_array Y,
+                       const af_cell* const props) {
+    return plotWrapper(wind, X, Y, props);
+}
+
+af_err af_draw_plot_3d(const af_window wind, const af_array X, const af_array Y,
+                       const af_array Z, const af_cell* const props) {
+    return plotWrapper(wind, X, Y, Z, props);
+}
+
+// Deprecated Plot API
+af_err af_draw_plot(const af_window wind, const af_array X, const af_array Y,
+                    const af_cell* const props) {
+    return plotWrapper(wind, X, Y, props);
+}
+
+af_err af_draw_plot3(const af_window wind, const af_array P,
+                     const af_cell* const props) {
+    try {
+        const ArrayInfo& info = getInfo(P);
+        af::dim4 dims         = info.dims();
+
+        if (dims.ndims() == 2 && dims[1] == 3) {
+            return plotWrapper(wind, P, 1, props);
+        }
+        if (dims.ndims() == 2 && dims[0] == 3) {
+            return plotWrapper(wind, P, 0, props);
+        } else if (dims.ndims() == 1 && dims[0] % 3 == 0) {
+            dim4 rdims(dims.elements() / 3, 3, 1, 1);
+            af_array in = 0;
+            AF_CHECK(af_moddims(&in, P, rdims.ndims(), rdims.get()));
+            af_err err = plotWrapper(wind, in, 1, props);
+            AF_CHECK(af_release_array(in));
+            return err;
+        } else {
+            AF_RETURN_ERROR(
+                "Input needs to be either [n, 3] or [3, n] or [3n, 1]",
+                AF_ERR_SIZE);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+// Scatter API
+af_err af_draw_scatter_nd(const af_window wind, const af_array in,
+                          const af_marker_type af_marker,
+                          const af_cell* const props) {
+    try {
+        fg_marker_type fg_marker = getFGMarker(af_marker);
+        return plotWrapper(wind, in, 1, props, FG_PLOT_SCATTER, fg_marker);
+    }
+    CATCHALL;
+}
+
+af_err af_draw_scatter_2d(const af_window wind, const af_array X,
+                          const af_array Y, const af_marker_type af_marker,
+                          const af_cell* const props) {
+    try {
+        fg_marker_type fg_marker = getFGMarker(af_marker);
+        return plotWrapper(wind, X, Y, props, FG_PLOT_SCATTER, fg_marker);
+    }
+    CATCHALL;
+}
+
+af_err af_draw_scatter_3d(const af_window wind, const af_array X,
+                          const af_array Y, const af_array Z,
+                          const af_marker_type af_marker,
+                          const af_cell* const props) {
+    try {
+        fg_marker_type fg_marker = getFGMarker(af_marker);
+        return plotWrapper(wind, X, Y, Z, props, FG_PLOT_SCATTER, fg_marker);
+    }
+    CATCHALL;
+}
+
+// Deprecated Scatter API
+af_err af_draw_scatter(const af_window wind, const af_array X, const af_array Y,
+                       const af_marker_type af_marker,
+                       const af_cell* const props) {
+    try {
+        fg_marker_type fg_marker = getFGMarker(af_marker);
+        return plotWrapper(wind, X, Y, props, FG_PLOT_SCATTER, fg_marker);
+    }
+    CATCHALL;
+}
+
+af_err af_draw_scatter3(const af_window wind, const af_array P,
+                        const af_marker_type af_marker,
+                        const af_cell* const props) {
+    try {
+        fg_marker_type fg_marker = getFGMarker(af_marker);
+        const ArrayInfo& info    = getInfo(P);
+        af::dim4 dims            = info.dims();
+
+        if (dims.ndims() == 2 && dims[1] == 3) {
+            return plotWrapper(wind, P, 1, props, FG_PLOT_SCATTER, fg_marker);
+        }
+        if (dims.ndims() == 2 && dims[0] == 3) {
+            return plotWrapper(wind, P, 0, props, FG_PLOT_SCATTER, fg_marker);
+        } else if (dims.ndims() == 1 && dims[0] % 3 == 0) {
+            dim4 rdims(dims.elements() / 3, 3, 1, 1);
+            af_array in = 0;
+            AF_CHECK(af_moddims(&in, P, rdims.ndims(), rdims.get()));
+            af_err err =
+                plotWrapper(wind, in, 1, props, FG_PLOT_SCATTER, fg_marker);
+            AF_CHECK(af_release_array(in));
+            return err;
+        } else {
+            AF_RETURN_ERROR(
+                "Input needs to be either [n, 3] or [3, n] or [3n, 1]",
+                AF_ERR_SIZE);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
 }
diff --git a/src/api/c/print.cpp b/src/api/c/print.cpp
index a85505df6c..2f1ae15c8d 100644
--- a/src/api/c/print.cpp
+++ b/src/api/c/print.cpp
@@ -7,45 +7,62 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <iostream>
-#include <iomanip>
-#include <vector>
-#include <af/array.h>
-#include <af/data.h>
-#include <copy.hpp>
 #include <print.hpp>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
+
 #include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
+#include <sparse_handle.hpp>
 #include <type_util.hpp>
 
+#include <af/array.h>
+#include <af/data.h>
+#include <af/internal.h>
+
+#include <iomanip>
+#include <iostream>
+#include <sstream>
+#include <string>
+#include <vector>
+
 #include <af/index.h>
 
-using namespace detail;
-using std::ostream;
+using arrayfire::getSparseArray;
+using arrayfire::common::half;
+using arrayfire::common::SparseArray;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 using std::cout;
 using std::endl;
+using std::ostream;
 using std::vector;
 
 template<typename T>
-static void printer(ostream &out, const T* ptr, const ArrayInfo &info, unsigned dim)
-{
-
-    dim_t stride =   info.strides()[dim];
-    dim_t d      =   info.dims()[dim];
+static void printer(ostream &out, const T *ptr, const ArrayInfo &info,
+                    unsigned dim, const int precision) {
+    dim_t stride = info.strides()[dim];
+    dim_t d      = info.dims()[dim];
     ToNum<T> toNum;
+    using namespace detail;  // NOLINT
 
-    if(dim == 0) {
-        for(dim_t i = 0, j = 0; i < d; i++, j+=stride) {
-            out<<   std::fixed <<
-                    std::setw(10) <<
-                    std::setprecision(4) << toNum(ptr[j]) << " ";
+    if (dim == 0) {
+        for (dim_t i = 0, j = 0; i < d; i++, j += stride) {
+            out << std::fixed << std::setw(precision + 6)
+                << std::setprecision(precision) << toNum(ptr[j]) << " ";
         }
         out << endl;
-    }
-    else {
-        for(dim_t i = 0; i < d; i++) {
-            printer(out, ptr, info, dim - 1);
+    } else {
+        for (dim_t i = 0; i < d; i++) {
+            printer(out, ptr, info, dim - 1, precision);
             ptr += stride;
         }
         out << endl;
@@ -53,51 +70,228 @@ static void printer(ostream &out, const T* ptr, const ArrayInfo &info, unsigned
 }
 
 template<typename T>
-static void print(af_array arr)
-{
-    const ArrayInfo info = getInfo(arr);
+static void print(const char *exp, af_array arr, const int precision,
+                  std::ostream &os = std::cout, bool transpose = true) {
+    if (exp == NULL) {
+        os << "No Name Array" << std::endl;
+    } else {
+        os << exp << std::endl;
+    }
+
+    const ArrayInfo &info = getInfo(arr);
+
+    std::ios_base::fmtflags backup = os.flags();
+
+    os << "[" << info.dims() << "]\n";
+#ifndef NDEBUG
+    os << "   Offset: " << info.getOffset() << std::endl;
+    os << "   Strides: [" << info.strides() << "]" << std::endl;
+#endif
+
+    // Handle empty array
+    if (info.elements() == 0) {
+        os << "<empty>" << std::endl;
+        os.flags(backup);
+        return;
+    }
+
     vector<T> data(info.elements());
 
     af_array arrT;
-    AF_CHECK(af_reorder(&arrT, arr, 1, 0, 2, 3));
+    if (transpose) {
+        AF_CHECK(af_reorder(&arrT, arr, 1, 0, 2, 3));
+    } else {
+        arrT = arr;
+    }
 
-    //FIXME: Use alternative function to avoid copies if possible
+    // FIXME: Use alternative function to avoid copies if possible
     AF_CHECK(af_get_data_ptr(&data.front(), arrT));
-    const ArrayInfo infoT = getInfo(arrT);
-    AF_CHECK(af_release_array(arrT));
+    const ArrayInfo &infoT = getInfo(arrT);
 
-    std::ios_base::fmtflags backup = std::cout.flags();
+    printer(os, &data.front(), infoT, infoT.ndims() - 1, precision);
 
-    std::cout << "[" << info.dims() << "]\n";
-#ifndef NDEBUG
-    std::cout <<"   Offsets: ["<<info.offsets()<<"]"<<std::endl;
-    std::cout <<"   Strides: ["<<info.strides()<<"]"<<std::endl;
-#endif
+    if (transpose) { AF_CHECK(af_release_array(arrT)); }
+
+    os.flags(backup);
+}
+
+template<typename T>
+static void printSparse(const char *exp, af_array arr, const int precision,
+                        std::ostream &os = std::cout, bool transpose = true) {
+    SparseArray<T> sparse = getSparseArray<T>(arr);
+    std::string name("No Name Sparse Array");
 
-    printer(std::cout, &data.front(), infoT, infoT.ndims() - 1);
+    if (exp != NULL) { name = std::string(exp); }
+    os << name << std::endl;
+    os << "Storage Format : ";
+    switch (sparse.getStorage()) {
+        case AF_STORAGE_DENSE: os << "AF_STORAGE_DENSE\n"; break;
+        case AF_STORAGE_CSR: os << "AF_STORAGE_CSR\n"; break;
+        case AF_STORAGE_CSC: os << "AF_STORAGE_CSC\n"; break;
+        case AF_STORAGE_COO: os << "AF_STORAGE_COO\n"; break;
+    }
+    os << "[" << sparse.dims() << "]\n";
 
-    std::cout.flags(backup);
+    print<T>(std::string(name + ": Values").c_str(),
+             getHandle(sparse.getValues()), precision, os, transpose);
+    print<int>(std::string(name + ": RowIdx").c_str(),
+               getHandle(sparse.getRowIdx()), precision, os, transpose);
+    print<int>(std::string(name + ": ColIdx").c_str(),
+               getHandle(sparse.getColIdx()), precision, os, transpose);
 }
 
-af_err af_print_array(af_array arr)
-{
+af_err af_print_array(af_array arr) {
     try {
-        ArrayInfo info = getInfo(arr);
+        const ArrayInfo &info =
+            getInfo(arr, false);  // Don't assert sparse/dense
         af_dtype type = info.getType();
-        switch(type)
-        {
-        case f32:   print<float>(arr);    break;
-        case c32:   print<cfloat>(arr);   break;
-        case f64:   print<double>(arr);   break;
-        case c64:   print<cdouble>(arr);  break;
-        case b8:    print<char>(arr);     break;
-        case s32:   print<int>(arr);      break;
-        case u32:   print<unsigned>(arr); break;
-        case u8:    print<uchar>(arr);    break;
-        case s64:   print<intl>(arr);     break;
-        case u64:   print<uintl>(arr);    break;
-        default:    TYPE_ERROR(1, type);
+
+        if (info.isSparse()) {
+            switch (type) {
+                case f32: printSparse<float>(NULL, arr, 4); break;
+                case f64: printSparse<double>(NULL, arr, 4); break;
+                case c32: printSparse<cfloat>(NULL, arr, 4); break;
+                case c64: printSparse<cdouble>(NULL, arr, 4); break;
+                default: TYPE_ERROR(0, type);
+            }
+        } else {
+            switch (type) {
+                case f32: print<float>(NULL, arr, 4); break;
+                case c32: print<cfloat>(NULL, arr, 4); break;
+                case f64: print<double>(NULL, arr, 4); break;
+                case c64: print<cdouble>(NULL, arr, 4); break;
+                case b8: print<char>(NULL, arr, 4); break;
+                case s32: print<int>(NULL, arr, 4); break;
+                case u32: print<unsigned>(NULL, arr, 4); break;
+                case s8: print<schar>(NULL, arr, 4); break;
+                case u8: print<uchar>(NULL, arr, 4); break;
+                case s64: print<intl>(NULL, arr, 4); break;
+                case u64: print<uintl>(NULL, arr, 4); break;
+                case s16: print<short>(NULL, arr, 4); break;
+                case u16: print<ushort>(NULL, arr, 4); break;
+                case f16: print<half>(NULL, arr, 4); break;
+                default: TYPE_ERROR(1, type);
+            }
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_print_array_gen(const char *exp, const af_array arr,
+                          const int precision) {
+    try {
+        ARG_ASSERT(0, exp != NULL);
+        const ArrayInfo &info =
+            getInfo(arr, false);  // Don't assert sparse/dense
+        af_dtype type = info.getType();
+
+        if (info.isSparse()) {
+            switch (type) {
+                case f32: printSparse<float>(exp, arr, precision); break;
+                case f64: printSparse<double>(exp, arr, precision); break;
+                case c32: printSparse<cfloat>(exp, arr, precision); break;
+                case c64: printSparse<cdouble>(exp, arr, precision); break;
+                default: TYPE_ERROR(0, type);
+            }
+        } else {
+            switch (type) {
+                case f32: print<float>(exp, arr, precision); break;
+                case c32: print<cfloat>(exp, arr, precision); break;
+                case f64: print<double>(exp, arr, precision); break;
+                case c64: print<cdouble>(exp, arr, precision); break;
+                case b8: print<char>(exp, arr, precision); break;
+                case s32: print<int>(exp, arr, precision); break;
+                case u32: print<unsigned>(exp, arr, precision); break;
+                case s8: print<schar>(exp, arr, precision); break;
+                case u8: print<uchar>(exp, arr, precision); break;
+                case s64: print<intl>(exp, arr, precision); break;
+                case u64: print<uintl>(exp, arr, precision); break;
+                case s16: print<short>(exp, arr, precision); break;
+                case u16: print<ushort>(exp, arr, precision); break;
+                case f16: print<half>(exp, arr, precision); break;
+                default: TYPE_ERROR(1, type);
+            }
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_array_to_string(char **output, const char *exp, const af_array arr,
+                          const int precision, bool transpose) {
+    try {
+        ARG_ASSERT(0, exp != NULL);
+        const ArrayInfo &info =
+            getInfo(arr, false);  // Don't assert sparse/dense
+        af_dtype type = info.getType();
+        std::stringstream ss;
+
+        if (info.isSparse()) {
+            switch (type) {
+                case f32:
+                    printSparse<float>(exp, arr, precision, ss, transpose);
+                    break;
+                case f64:
+                    printSparse<double>(exp, arr, precision, ss, transpose);
+                    break;
+                case c32:
+                    printSparse<cfloat>(exp, arr, precision, ss, transpose);
+                    break;
+                case c64:
+                    printSparse<cdouble>(exp, arr, precision, ss, transpose);
+                    break;
+                default: TYPE_ERROR(0, type);
+            }
+        } else {
+            switch (type) {
+                case f32:
+                    print<float>(exp, arr, precision, ss, transpose);
+                    break;
+                case c32:
+                    print<cfloat>(exp, arr, precision, ss, transpose);
+                    break;
+                case f64:
+                    print<double>(exp, arr, precision, ss, transpose);
+                    break;
+                case c64:
+                    print<cdouble>(exp, arr, precision, ss, transpose);
+                    break;
+                case b8: print<char>(exp, arr, precision, ss, transpose); break;
+                case s32: print<int>(exp, arr, precision, ss, transpose); break;
+                case u32:
+                    print<unsigned>(exp, arr, precision, ss, transpose);
+                    break;
+                case s8:
+                    print<schar>(exp, arr, precision, ss, transpose);
+                    break;
+                case u8:
+                    print<uchar>(exp, arr, precision, ss, transpose);
+                    break;
+                case s64:
+                    print<intl>(exp, arr, precision, ss, transpose);
+                    break;
+                case u64:
+                    print<uintl>(exp, arr, precision, ss, transpose);
+                    break;
+                case s16:
+                    print<short>(exp, arr, precision, ss, transpose);
+                    break;
+                case u16:
+                    print<ushort>(exp, arr, precision, ss, transpose);
+                    break;
+                case f16:
+                    print<half>(exp, arr, precision, ss, transpose);
+                    break;
+                default: TYPE_ERROR(1, type);
+            }
         }
+        std::string str  = ss.str();
+        void *halloc_ptr = nullptr;
+        af_alloc_host(&halloc_ptr, sizeof(char) * (str.size() + 1));
+        memcpy(output, &halloc_ptr, sizeof(void *));
+        str.copy(*output, str.size());
+        (*output)[str.size()] = '\0';  // don't forget the terminating 0
     }
     CATCHALL;
     return AF_SUCCESS;
diff --git a/src/api/c/qr.cpp b/src/api/c/qr.cpp
index 95eb53adab..8d74a0d3f9 100644
--- a/src/api/c/qr.cpp
+++ b/src/api/c/qr.cpp
@@ -7,24 +7,28 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <qr.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/lapack.h>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using std::swap;
 
 template<typename T>
-static inline void qr(af_array *q, af_array *r, af_array *tau, const af_array in)
-{
-    Array<T> qArray = createEmptyArray<T>(af::dim4());
-    Array<T> rArray = createEmptyArray<T>(af::dim4());
-    Array<T> tArray = createEmptyArray<T>(af::dim4());
+static inline void qr(af_array *q, af_array *r, af_array *tau,
+                      const af_array in) {
+    Array<T> qArray = createEmptyArray<T>(dim4());
+    Array<T> rArray = createEmptyArray<T>(dim4());
+    Array<T> tArray = createEmptyArray<T>(dim4());
 
     qr<T>(qArray, rArray, tArray, getArray<T>(in));
 
@@ -34,15 +38,13 @@ static inline void qr(af_array *q, af_array *r, af_array *tau, const af_array in
 }
 
 template<typename T>
-static inline af_array qr_inplace(af_array in)
-{
-    return getHandle(qr_inplace<T>(getWritableArray<T>(in)));
+static inline af_array qr_inplace(af_array in) {
+    return getHandle(qr_inplace<T>(getArray<T>(in)));
 }
 
-af_err af_qr(af_array *q, af_array *r, af_array *tau, const af_array in)
-{
+af_err af_qr(af_array *q, af_array *r, af_array *tau, const af_array in) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo &i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("qr can not be used in batch mode", AF_ERR_BATCH);
@@ -50,14 +52,24 @@ af_err af_qr(af_array *q, af_array *r, af_array *tau, const af_array in)
 
         af_dtype type = i_info.getType();
 
-        ARG_ASSERT(3, i_info.isFloating());                       // Only floating and complex types
+        if (i_info.ndims() == 0) {
+            AF_CHECK(af_create_handle(q, 0, nullptr, type));
+            AF_CHECK(af_create_handle(r, 0, nullptr, type));
+            AF_CHECK(af_create_handle(tau, 0, nullptr, type));
+            return AF_SUCCESS;
+        }
 
-        switch(type) {
-            case f32: qr<float  >(q, r, tau, in);  break;
-            case f64: qr<double >(q, r, tau, in);  break;
-            case c32: qr<cfloat >(q, r, tau, in);  break;
-            case c64: qr<cdouble>(q, r, tau, in);  break;
-            default:  TYPE_ERROR(1, type);
+        ARG_ASSERT(0, q != nullptr);
+        ARG_ASSERT(1, r != nullptr);
+        ARG_ASSERT(2, tau != nullptr);
+        ARG_ASSERT(3, i_info.isFloating());  // Only floating and complex types
+
+        switch (type) {
+            case f32: qr<float>(q, r, tau, in); break;
+            case f64: qr<double>(q, r, tau, in); break;
+            case c32: qr<cfloat>(q, r, tau, in); break;
+            case c64: qr<cdouble>(q, r, tau, in); break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
@@ -65,10 +77,9 @@ af_err af_qr(af_array *q, af_array *r, af_array *tau, const af_array in)
     return AF_SUCCESS;
 }
 
-af_err af_qr_inplace(af_array *tau, af_array in)
-{
+af_err af_qr_inplace(af_array *tau, af_array in) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo &i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("qr can not be used in batch mode", AF_ERR_BATCH);
@@ -76,19 +87,22 @@ af_err af_qr_inplace(af_array *tau, af_array in)
 
         af_dtype type = i_info.getType();
 
-        ARG_ASSERT(1, i_info.isFloating()); // Only floating and complex types
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
+        ARG_ASSERT(0, tau != nullptr);
 
-        af_array out;
+        if (i_info.ndims() == 0) {
+            return af_create_handle(tau, 0, nullptr, type);
+        }
 
-        switch(type) {
-            case f32: out = qr_inplace<float  >(in);  break;
-            case f64: out = qr_inplace<double >(in);  break;
-            case c32: out = qr_inplace<cfloat >(in);  break;
-            case c64: out = qr_inplace<cdouble>(in);  break;
-            default:  TYPE_ERROR(1, type);
+        af_array out;
+        switch (type) {
+            case f32: out = qr_inplace<float>(in); break;
+            case f64: out = qr_inplace<double>(in); break;
+            case c32: out = qr_inplace<cfloat>(in); break;
+            case c64: out = qr_inplace<cdouble>(in); break;
+            default: TYPE_ERROR(1, type);
         }
-        if(tau != NULL)
-            std::swap(*tau, out);
+        swap(*tau, out);
     }
     CATCHALL;
 
diff --git a/src/api/c/random.cpp b/src/api/c/random.cpp
new file mode 100644
index 0000000000..6508786f53
--- /dev/null
+++ b/src/api/c/random.cpp
@@ -0,0 +1,424 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/random.h>
+
+#include <backend.hpp>
+#include <common/MersenneTwister.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <random_engine.hpp>
+#include <types.hpp>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <map>
+#include <memory>
+
+using af::dim4;
+using arrayfire::common::half;
+using arrayfire::common::mask;
+using arrayfire::common::MaxBlocks;
+using arrayfire::common::MtStateLength;
+using arrayfire::common::pos;
+using arrayfire::common::recursion_tbl;
+using arrayfire::common::sh1;
+using arrayfire::common::sh2;
+using arrayfire::common::TableLength;
+using arrayfire::common::temper_tbl;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createHostDataArray;
+using detail::intl;
+using detail::normalDistribution;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::uniformDistribution;
+using detail::ushort;
+
+Array<uint> emptyArray() { return createEmptyArray<uint>(dim4(0)); }
+
+struct RandomEngine {
+    // clang-format off
+    af_random_engine_type type{AF_RANDOM_ENGINE_DEFAULT}; // NOLINT(misc-non-private-member-variables-in-classes)
+    std::shared_ptr<uintl> seed;                          // NOLINT(misc-non-private-member-variables-in-classes)
+    std::shared_ptr<uintl> counter;                       // NOLINT(misc-non-private-member-variables-in-classes)
+    Array<uint> pos;                                      // NOLINT(misc-non-private-member-variables-in-classes)
+    Array<uint> sh1;                                      // NOLINT(misc-non-private-member-variables-in-classes)
+    Array<uint> sh2;                                      // NOLINT(misc-non-private-member-variables-in-classes)
+    uint mask{0};                                         // NOLINT(misc-non-private-member-variables-in-classes)
+    Array<uint> recursion_table;                          // NOLINT(misc-non-private-member-variables-in-classes)
+    Array<uint> temper_table;                             // NOLINT(misc-non-private-member-variables-in-classes)
+    Array<uint> state;                                    // NOLINT(misc-non-private-member-variables-in-classes)
+    // clang-format on
+
+    RandomEngine()
+        : seed(new uintl())
+        , counter(new uintl())
+        , pos(emptyArray())
+        , sh1(emptyArray())
+        , sh2(emptyArray())
+        , recursion_table(emptyArray())
+        , temper_table(emptyArray())
+        , state(emptyArray()) {}
+};
+
+af_random_engine getRandomEngineHandle(const RandomEngine &engine) {
+    auto *engineHandle = new RandomEngine;
+    *engineHandle      = engine;
+    return static_cast<af_random_engine>(engineHandle);
+}
+
+RandomEngine *getRandomEngine(const af_random_engine engineHandle) {
+    if (engineHandle == 0) {
+        AF_ERROR("Uninitialized random engine", AF_ERR_ARG);
+    }
+    return static_cast<RandomEngine *>(engineHandle);
+}
+
+namespace {
+template<typename T>
+inline af_array uniformDistribution_(const dim4 &dims, RandomEngine *e) {
+    if (e->type == AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+        return getHandle(uniformDistribution<T>(dims, e->pos, e->sh1, e->sh2,
+                                                e->mask, e->recursion_table,
+                                                e->temper_table, e->state));
+    } else {
+        return getHandle(
+            uniformDistribution<T>(dims, e->type, *(e->seed), *(e->counter)));
+    }
+}
+
+template<typename T>
+inline af_array normalDistribution_(const dim4 &dims, RandomEngine *e) {
+    if (e->type == AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+        return getHandle(normalDistribution<T>(dims, e->pos, e->sh1, e->sh2,
+                                               e->mask, e->recursion_table,
+                                               e->temper_table, e->state));
+    } else {
+        return getHandle(
+            normalDistribution<T>(dims, e->type, *(e->seed), *(e->counter)));
+    }
+}
+
+void validateRandomType(const af_random_engine_type type) {
+    if ((type != AF_RANDOM_ENGINE_PHILOX_4X32_10) &&
+        (type != AF_RANDOM_ENGINE_THREEFRY_2X32_16) &&
+        (type != AF_RANDOM_ENGINE_MERSENNE_GP11213) &&
+        (type != AF_RANDOM_ENGINE_PHILOX) &&
+        (type != AF_RANDOM_ENGINE_THREEFRY) &&
+        (type != AF_RANDOM_ENGINE_MERSENNE) &&
+        (type != AF_RANDOM_ENGINE_DEFAULT)) {
+        AF_ERROR("Invalid random type", AF_ERR_ARG);
+    }
+}
+}  // namespace
+
+af_err af_get_default_random_engine(af_random_engine *r) {
+    try {
+        AF_CHECK(af_init());
+
+        // RandomEngine contains device buffers which are dependent on
+        // context|stream/device. Since nor context or stream are available at
+        // this level, we will only use the deviceId.
+        thread_local std::map<int /*deviceId*/, RandomEngine *>
+            cachedDefaultRandomEngines;
+        const int dependent = af::getDevice();
+        auto it             = cachedDefaultRandomEngines.find(dependent);
+        if (it == cachedDefaultRandomEngines.end()) {
+            RandomEngine *defaultRandomEngine     = new RandomEngine;
+            cachedDefaultRandomEngines[dependent] = defaultRandomEngine;
+            *r = static_cast<af_random_engine>(defaultRandomEngine);
+        } else {
+            *r = static_cast<af_random_engine>(it->second);
+        }
+        return AF_SUCCESS;
+    }
+    CATCHALL;
+}
+
+af_err af_create_random_engine(af_random_engine *engineHandle,
+                               af_random_engine_type rtype, uintl seed) {
+    try {
+        AF_CHECK(af_init());
+        validateRandomType(rtype);
+
+        RandomEngine e;
+        e.type     = rtype;
+        *e.seed    = seed;
+        *e.counter = 0;
+
+        if (rtype == AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+            e.pos  = createHostDataArray<uint>(dim4(MaxBlocks), pos);
+            e.sh1  = createHostDataArray<uint>(dim4(MaxBlocks), sh1);
+            e.sh2  = createHostDataArray<uint>(dim4(MaxBlocks), sh2);
+            e.mask = mask;
+
+            e.recursion_table =
+                createHostDataArray<uint>(dim4(TableLength), recursion_tbl);
+            e.temper_table =
+                createHostDataArray<uint>(dim4(TableLength), temper_tbl);
+            e.state = createEmptyArray<uint>(dim4(MtStateLength));
+
+            initMersenneState(e.state, seed, e.recursion_table);
+        }
+
+        *engineHandle = getRandomEngineHandle(e);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_retain_random_engine(af_random_engine *outHandle,
+                               const af_random_engine engineHandle) {
+    try {
+        AF_CHECK(af_init());
+        *outHandle = getRandomEngineHandle(*(getRandomEngine(engineHandle)));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_random_engine_set_type(af_random_engine *engine,
+                                 const af_random_engine_type rtype) {
+    try {
+        AF_CHECK(af_init());
+        validateRandomType(rtype);
+        RandomEngine *e = getRandomEngine(*engine);
+        if (rtype != e->type) {
+            if (rtype == AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+                e->pos  = createHostDataArray<uint>(dim4(MaxBlocks), pos);
+                e->sh1  = createHostDataArray<uint>(dim4(MaxBlocks), sh1);
+                e->sh2  = createHostDataArray<uint>(dim4(MaxBlocks), sh2);
+                e->mask = mask;
+
+                e->recursion_table =
+                    createHostDataArray<uint>(dim4(TableLength), recursion_tbl);
+                e->temper_table =
+                    createHostDataArray<uint>(dim4(TableLength), temper_tbl);
+                e->state = createEmptyArray<uint>(dim4(MtStateLength));
+
+                initMersenneState(e->state, *(e->seed), e->recursion_table);
+            } else if (e->type == AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+                e->pos             = emptyArray();
+                e->sh1             = emptyArray();
+                e->sh2             = emptyArray();
+                e->mask            = 0;
+                e->recursion_table = emptyArray();
+                e->temper_table    = emptyArray();
+                e->state           = emptyArray();
+            }
+            e->type = rtype;
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_random_engine_get_type(af_random_engine_type *rtype,
+                                 const af_random_engine engine) {
+    try {
+        AF_CHECK(af_init());
+        RandomEngine *e = getRandomEngine(engine);
+        *rtype          = e->type;
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_default_random_engine_type(const af_random_engine_type rtype) {
+    try {
+        AF_CHECK(af_init());
+        af_random_engine e;
+        AF_CHECK(af_get_default_random_engine(&e));
+        AF_CHECK(af_random_engine_set_type(&e, rtype));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_random_engine_set_seed(af_random_engine *engine, const uintl seed) {
+    try {
+        AF_CHECK(af_init());
+        RandomEngine *e = getRandomEngine(*engine);
+        *(e->seed)      = seed;
+        if (e->type == AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+            initMersenneState(e->state, seed, e->recursion_table);
+        } else {
+            *(e->counter) = 0;
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_random_engine_get_seed(uintl *const seed, af_random_engine engine) {
+    try {
+        AF_CHECK(af_init());
+        RandomEngine *e = getRandomEngine(engine);
+        *seed           = *(e->seed);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_random_uniform(af_array *out, const unsigned ndims,
+                         const dim_t *const dims, const af_dtype type,
+                         af_random_engine engine) {
+    try {
+        AF_CHECK(af_init());
+        af_array result;
+
+        dim4 d          = verifyDims(ndims, dims);
+        RandomEngine *e = getRandomEngine(engine);
+
+        switch (type) {
+            case f32: result = uniformDistribution_<float>(d, e); break;
+            case c32: result = uniformDistribution_<cfloat>(d, e); break;
+            case f64: result = uniformDistribution_<double>(d, e); break;
+            case c64: result = uniformDistribution_<cdouble>(d, e); break;
+            case s32: result = uniformDistribution_<int>(d, e); break;
+            case u32: result = uniformDistribution_<uint>(d, e); break;
+            case s64: result = uniformDistribution_<intl>(d, e); break;
+            case u64: result = uniformDistribution_<uintl>(d, e); break;
+            case s16: result = uniformDistribution_<short>(d, e); break;
+            case u16: result = uniformDistribution_<ushort>(d, e); break;
+            case s8: result = uniformDistribution_<schar>(d, e); break;
+            case u8: result = uniformDistribution_<uchar>(d, e); break;
+            case b8: result = uniformDistribution_<char>(d, e); break;
+            case f16: result = uniformDistribution_<half>(d, e); break;
+            default: TYPE_ERROR(4, type);
+        }
+        std::swap(*out, result);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_random_normal(af_array *out, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype type,
+                        af_random_engine engine) {
+    try {
+        AF_CHECK(af_init());
+        af_array result;
+
+        dim4 d          = verifyDims(ndims, dims);
+        RandomEngine *e = getRandomEngine(engine);
+
+        switch (type) {
+            case f32: result = normalDistribution_<float>(d, e); break;
+            case c32: result = normalDistribution_<cfloat>(d, e); break;
+            case f64: result = normalDistribution_<double>(d, e); break;
+            case c64: result = normalDistribution_<cdouble>(d, e); break;
+            case f16: result = normalDistribution_<half>(d, e); break;
+            default: TYPE_ERROR(4, type);
+        }
+        std::swap(*out, result);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_release_random_engine(af_random_engine engineHandle) {
+    try {
+        AF_CHECK(af_init());
+        delete getRandomEngine(engineHandle);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_randu(af_array *out, const unsigned ndims, const dim_t *const dims,
+                const af_dtype type) {
+    try {
+        AF_CHECK(af_init());
+        af_array result;
+
+        af_random_engine engine;
+        AF_CHECK(af_get_default_random_engine(&engine));
+        RandomEngine *e = getRandomEngine(engine);
+        dim4 d          = verifyDims(ndims, dims);
+
+        switch (type) {
+            case f32: result = uniformDistribution_<float>(d, e); break;
+            case c32: result = uniformDistribution_<cfloat>(d, e); break;
+            case f64: result = uniformDistribution_<double>(d, e); break;
+            case c64: result = uniformDistribution_<cdouble>(d, e); break;
+            case s32: result = uniformDistribution_<int>(d, e); break;
+            case u32: result = uniformDistribution_<uint>(d, e); break;
+            case s64: result = uniformDistribution_<intl>(d, e); break;
+            case u64: result = uniformDistribution_<uintl>(d, e); break;
+            case s16: result = uniformDistribution_<short>(d, e); break;
+            case u16: result = uniformDistribution_<ushort>(d, e); break;
+            case s8: result = uniformDistribution_<schar>(d, e); break;
+            case u8: result = uniformDistribution_<uchar>(d, e); break;
+            case b8: result = uniformDistribution_<char>(d, e); break;
+            case f16: result = uniformDistribution_<half>(d, e); break;
+            default: TYPE_ERROR(3, type);
+        }
+        std::swap(*out, result);
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_randn(af_array *out, const unsigned ndims, const dim_t *const dims,
+                const af_dtype type) {
+    try {
+        AF_CHECK(af_init());
+        af_array result;
+
+        af_random_engine engine;
+        AF_CHECK(af_get_default_random_engine(&engine));
+        RandomEngine *e = getRandomEngine(engine);
+        dim4 d          = verifyDims(ndims, dims);
+
+        switch (type) {
+            case f32: result = normalDistribution_<float>(d, e); break;
+            case c32: result = normalDistribution_<cfloat>(d, e); break;
+            case f64: result = normalDistribution_<double>(d, e); break;
+            case c64: result = normalDistribution_<cdouble>(d, e); break;
+            case f16: result = normalDistribution_<half>(d, e); break;
+            default: TYPE_ERROR(3, type);
+        }
+        std::swap(*out, result);
+    }
+    CATCHALL
+    return AF_SUCCESS;
+}
+
+af_err af_set_seed(const uintl seed) {
+    try {
+        AF_CHECK(af_init());
+        af_random_engine engine;
+        AF_CHECK(af_get_default_random_engine(&engine));
+        AF_CHECK(af_random_engine_set_seed(&engine, seed));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_get_seed(uintl *seed) {
+    try {
+        AF_CHECK(af_init());
+        af_random_engine e;
+        AF_CHECK(af_get_default_random_engine(&e));
+        AF_CHECK(af_random_engine_get_seed(seed, e));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/rank.cpp b/src/api/c/rank.cpp
index c1eeb737d4..770c331a7a 100644
--- a/src/api/c/rank.cpp
+++ b/src/api/c/rank.cpp
@@ -7,45 +7,59 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
+#include <logic.hpp>
 #include <qr.hpp>
 #include <reduce.hpp>
-#include <logic.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/lapack.h>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::getScalar;
+using detail::logicOp;
+using detail::reduce;
+using detail::reduce_all;
+using detail::scalar;
+using detail::uint;
 
 template<typename T>
-static inline uint rank(const af_array in, double tol)
-{
-    Array<T> In = getArray<T>(in);
+static inline uint rank(const af_array in, double tol) {
+    using BT          = typename af::dtype_traits<T>::base_type;
+    const Array<T> In = getArray<T>(in);
 
-    Array<T> r = createEmptyArray<T>(dim4());
+    Array<BT> R = createEmptyArray<BT>(dim4());
 
-    // Scoping to get rid of q and t as they are not necessary
+    // Scoping to get rid of q, r and t as they are not necessary
     {
         Array<T> q = createEmptyArray<T>(dim4());
+        Array<T> r = createEmptyArray<T>(dim4());
         Array<T> t = createEmptyArray<T>(dim4());
         qr(q, r, t, In);
+        using detail::abs;
+
+        R = abs<BT, T>(r);
     }
 
-    Array<T> val = createValueArray<T>(r.dims(), scalar<T>(tol));
-    Array<char> gt = logicOp<T, af_gt_t>(r, val, val.dims());
+    Array<BT> val  = createValueArray<BT>(R.dims(), scalar<BT>(tol));
+    Array<char> gt = logicOp<BT, af_gt_t>(R, val, val.dims());
     Array<char> at = reduce<af_or_t, char, char>(gt, 1);
-
-    return reduce_all<af_notzero_t, char, uint>(at);
+    return getScalar<uint>(reduce_all<af_notzero_t, char, uint>(at));
 }
 
-af_err af_rank(uint *out, const af_array in, const double tol)
-{
+af_err af_rank(uint* out, const af_array in, const double tol) {
     try {
-        ArrayInfo i_info = getInfo(in);
+        const ArrayInfo& i_info = getInfo(in);
 
         if (i_info.ndims() > 2) {
             AF_ERROR("solve can not be used in batch mode", AF_ERR_BATCH);
@@ -53,16 +67,18 @@ af_err af_rank(uint *out, const af_array in, const double tol)
 
         af_dtype type = i_info.getType();
 
-        ARG_ASSERT(1, i_info.isFloating());                       // Only floating and complex types
-
-        uint output;
+        ARG_ASSERT(1, i_info.isFloating());  // Only floating and complex types
+        ARG_ASSERT(0, out != nullptr);
 
-        switch(type) {
-            case f32: output = rank<float  >(in, tol);  break;
-            case f64: output = rank<double >(in, tol);  break;
-            case c32: output = rank<cfloat >(in, tol);  break;
-            case c64: output = rank<cdouble>(in, tol);  break;
-            default:  TYPE_ERROR(1, type);
+        uint output = 0;
+        if (i_info.ndims() != 0) {
+            switch (type) {
+                case f32: output = rank<float>(in, tol); break;
+                case f64: output = rank<double>(in, tol); break;
+                case c32: output = rank<cfloat>(in, tol); break;
+                case c64: output = rank<cdouble>(in, tol); break;
+                default: TYPE_ERROR(1, type);
+            }
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/reduce.cpp b/src/api/c/reduce.cpp
index 082c36bb86..65d3f85209 100644
--- a/src/api/c/reduce.cpp
+++ b/src/api/c/reduce.cpp
@@ -7,38 +7,87 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/defines.h>
-#include <af/dim4.hpp>
-#include <af/algorithm.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <ops.hpp>
 #include <backend.hpp>
-#include <reduce.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
 #include <ireduce.hpp>
 #include <math.hpp>
+#include <optypes.hpp>
+#include <reduce.hpp>
+#include <af/algorithm.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::getScalar;
+using detail::imag;
+using detail::intl;
+using detail::real;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<af_op_t op, typename Ti, typename To>
-static inline af_array reduce(const af_array in, const int dim)
-{
-    return getHandle(reduce<op,Ti,To>(getArray<Ti>(in), dim));
+static inline af_array reduce(const af_array in, const int dim,
+                              bool change_nan = false, double nanval = 0) {
+    return getHandle(
+        reduce<op, Ti, To>(getArray<Ti>(in), dim, change_nan, nanval));
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+static inline void reduce_by_key(af_array *keys_out, af_array *vals_out,
+                                 const af_array keys, const af_array vals,
+                                 const int dim, bool change_nan,
+                                 double nanval) {
+    Array<Tk> oKeyArray = createEmptyArray<Tk>(dim4());
+    Array<To> oValArray = createEmptyArray<To>(dim4());
+
+    reduce_by_key<op, Ti, Tk, To>(oKeyArray, oValArray, getArray<Tk>(keys),
+                                  getArray<Ti>(vals), dim, change_nan, nanval);
+
+    *keys_out = getHandle(oKeyArray);
+    *vals_out = getHandle(oValArray);
+}
+
+template<af_op_t op, typename Ti, typename To>
+static inline void reduce_key(af_array *keys_out, af_array *vals_out,
+                              const af_array keys, const af_array vals,
+                              const int dim, bool change_nan = false,
+                              double nanval = 0.0) {
+    const ArrayInfo &key_info = getInfo(keys);
+    af_dtype type             = key_info.getType();
+
+    switch (type) {
+        case s32:
+            reduce_by_key<op, Ti, int, To>(keys_out, vals_out, keys, vals, dim,
+                                           change_nan, nanval);
+            break;
+        case u32:
+            reduce_by_key<op, Ti, uint, To>(keys_out, vals_out, keys, vals, dim,
+                                            change_nan, nanval);
+            break;
+        default: TYPE_ERROR(2, type);
+    }
 }
 
 template<af_op_t op, typename To>
-static af_err reduce_type(af_array *out, const af_array in, const int dim)
-{
+static af_err reduce_type(af_array *out, const af_array in, const int dim) {
     try {
-
         ARG_ASSERT(2, dim >= 0);
-        ARG_ASSERT(2, dim <  4);
+        ARG_ASSERT(2, dim < 4);
 
-        const ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
 
-        if (dim >= (int)in_info.ndims()) {
+        if (dim >= static_cast<int>(in_info.ndims())) {
             *out = retain(in);
             return AF_SUCCESS;
         }
@@ -46,16 +95,22 @@ static af_err reduce_type(af_array *out, const af_array in, const int dim)
         af_dtype type = in_info.getType();
         af_array res;
 
-        switch(type) {
-        case f32:  res = reduce<op, float  , To>(in, dim); break;
-        case f64:  res = reduce<op, double , To>(in, dim); break;
-        case c32:  res = reduce<op, cfloat , To>(in, dim); break;
-        case c64:  res = reduce<op, cdouble, To>(in, dim); break;
-        case u32:  res = reduce<op, uint   , To>(in, dim); break;
-        case s32:  res = reduce<op, int    , To>(in, dim); break;
-        case b8:   res = reduce<op, char   , To>(in, dim); break;
-        case u8:   res = reduce<op, uchar  , To>(in, dim); break;
-        default:   TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: res = reduce<op, float, To>(in, dim); break;
+            case f64: res = reduce<op, double, To>(in, dim); break;
+            case c32: res = reduce<op, cfloat, To>(in, dim); break;
+            case c64: res = reduce<op, cdouble, To>(in, dim); break;
+            case u32: res = reduce<op, uint, To>(in, dim); break;
+            case s32: res = reduce<op, int, To>(in, dim); break;
+            case u64: res = reduce<op, uintl, To>(in, dim); break;
+            case s64: res = reduce<op, intl, To>(in, dim); break;
+            case u16: res = reduce<op, ushort, To>(in, dim); break;
+            case s16: res = reduce<op, short, To>(in, dim); break;
+            case b8: res = reduce<op, char, To>(in, dim); break;
+            case u8: res = reduce<op, uchar, To>(in, dim); break;
+            case s8: res = reduce<op, schar, To>(in, dim); break;
+            case f16: res = reduce<op, half, To>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*out, res);
@@ -65,34 +120,104 @@ static af_err reduce_type(af_array *out, const af_array in, const int dim)
     return AF_SUCCESS;
 }
 
-template<af_op_t op>
-static af_err reduce_common(af_array *out, const af_array in, const int dim)
-{
+template<af_op_t op, typename To>
+static af_err reduce_by_key_type(af_array *keys_out, af_array *vals_out,
+                                 const af_array keys, const af_array vals,
+                                 const int dim) {
     try {
+        ARG_ASSERT(4, dim >= 0);
+        ARG_ASSERT(4, dim < 4);
+
+        const ArrayInfo &kinfo   = getInfo(keys);
+        const ArrayInfo &in_info = getInfo(vals);
+        af_dtype type            = in_info.getType();
+
+        ARG_ASSERT(2, kinfo.isVector());
+        ARG_ASSERT(2, in_info.dims()[dim] == kinfo.elements());
+
+        switch (type) {
+            case f32:
+                reduce_key<op, float, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case f64:
+                reduce_key<op, double, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case c32:
+                reduce_key<op, cfloat, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case c64:
+                reduce_key<op, cdouble, To>(keys_out, vals_out, keys, vals,
+                                            dim);
+                break;
+            case u32:
+                reduce_key<op, uint, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case s32:
+                reduce_key<op, int, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case u64:
+                reduce_key<op, uintl, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case s64:
+                reduce_key<op, intl, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case u16:
+                reduce_key<op, ushort, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case s16:
+                reduce_key<op, short, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case b8:
+                reduce_key<op, char, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case u8:
+                reduce_key<op, uchar, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case s8:
+                reduce_key<op, schar, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case f16:
+                reduce_key<op, half, To>(keys_out, vals_out, keys, vals, dim);
+                break;
+            default: TYPE_ERROR(3, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
 
+template<af_op_t op>
+static af_err reduce_common(af_array *out, const af_array in, const int dim) {
+    try {
         ARG_ASSERT(2, dim >= 0);
-        ARG_ASSERT(2, dim <  4);
+        ARG_ASSERT(2, dim < 4);
 
-        const ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
 
-        if (dim >= (int)in_info.ndims()) {
-            *out = retain(in);
-            return AF_SUCCESS;
+        if (dim >= static_cast<int>(in_info.ndims())) {
+            return af_retain_array(out, in);
         }
 
         af_dtype type = in_info.getType();
         af_array res;
 
-        switch(type) {
-        case f32:  res = reduce<op, float  , float  >(in, dim); break;
-        case f64:  res = reduce<op, double , double >(in, dim); break;
-        case c32:  res = reduce<op, cfloat , cfloat >(in, dim); break;
-        case c64:  res = reduce<op, cdouble, cdouble>(in, dim); break;
-        case u32:  res = reduce<op, uint   , uint   >(in, dim); break;
-        case s32:  res = reduce<op, int    , int    >(in, dim); break;
-        case b8:   res = reduce<op, char   , char   >(in, dim); break;
-        case u8:   res = reduce<op, uchar  , uchar  >(in, dim); break;
-        default:   TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: res = reduce<op, float, float>(in, dim); break;
+            case f64: res = reduce<op, double, double>(in, dim); break;
+            case c32: res = reduce<op, cfloat, cfloat>(in, dim); break;
+            case c64: res = reduce<op, cdouble, cdouble>(in, dim); break;
+            case u32: res = reduce<op, uint, uint>(in, dim); break;
+            case s32: res = reduce<op, int, int>(in, dim); break;
+            case u64: res = reduce<op, uintl, uintl>(in, dim); break;
+            case s64: res = reduce<op, intl, intl>(in, dim); break;
+            case u16: res = reduce<op, ushort, ushort>(in, dim); break;
+            case s16: res = reduce<op, short, short>(in, dim); break;
+            case b8: res = reduce<op, char, char>(in, dim); break;
+            case u8: res = reduce<op, uchar, uchar>(in, dim); break;
+            case s8: res = reduce<op, schar, schar>(in, dim); break;
+            case f16: res = reduce<op, half, half>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*out, res);
@@ -103,16 +228,90 @@ static af_err reduce_common(af_array *out, const af_array in, const int dim)
 }
 
 template<af_op_t op>
-static af_err reduce_promote(af_array *out, const af_array in, const int dim)
-{
+static af_err reduce_by_key_common(af_array *keys_out, af_array *vals_out,
+                                   const af_array keys, const af_array vals,
+                                   const int dim) {
     try {
+        ARG_ASSERT(4, dim >= 0);
+        ARG_ASSERT(4, dim < 4);
+
+        const ArrayInfo &kinfo   = getInfo(keys);
+        const ArrayInfo &in_info = getInfo(vals);
+        af_dtype type            = in_info.getType();
+
+        ARG_ASSERT(2, kinfo.isVector());
+        ARG_ASSERT(2, in_info.dims()[dim] == kinfo.dims()[0]);
+
+        switch (type) {
+            case f32:
+                reduce_key<op, float, float>(keys_out, vals_out, keys, vals,
+                                             dim);
+                break;
+            case f64:
+                reduce_key<op, double, double>(keys_out, vals_out, keys, vals,
+                                               dim);
+                break;
+            case c32:
+                reduce_key<op, cfloat, cfloat>(keys_out, vals_out, keys, vals,
+                                               dim);
+                break;
+            case c64:
+                reduce_key<op, cdouble, cdouble>(keys_out, vals_out, keys, vals,
+                                                 dim);
+                break;
+            case u32:
+                reduce_key<op, uint, uint>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case s32:
+                reduce_key<op, int, int>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case u64:
+                reduce_key<op, uintl, uintl>(keys_out, vals_out, keys, vals,
+                                             dim);
+                break;
+            case s64:
+                reduce_key<op, intl, intl>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case u16:
+                reduce_key<op, ushort, ushort>(keys_out, vals_out, keys, vals,
+                                               dim);
+                break;
+            case s16:
+                reduce_key<op, short, short>(keys_out, vals_out, keys, vals,
+                                             dim);
+                break;
+            case b8:
+                reduce_key<op, char, char>(keys_out, vals_out, keys, vals, dim);
+                break;
+            case u8:
+                reduce_key<op, uchar, uchar>(keys_out, vals_out, keys, vals,
+                                             dim);
+                break;
+            case s8:
+                reduce_key<op, schar, schar>(keys_out, vals_out, keys, vals,
+                                             dim);
+                break;
+            case f16:
+                reduce_key<op, half, half>(keys_out, vals_out, keys, vals, dim);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
 
+template<af_op_t op>
+static af_err reduce_promote(af_array *out, const af_array in, const int dim,
+                             bool change_nan = false, double nanval = 0.0) {
+    try {
         ARG_ASSERT(2, dim >= 0);
-        ARG_ASSERT(2, dim <  4);
+        ARG_ASSERT(2, dim < 4);
 
-        const ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
 
-        if (dim >= (int)in_info.ndims()) {
+        if (dim >= static_cast<int>(in_info.ndims())) {
             *out = retain(in);
             return AF_SUCCESS;
         }
@@ -120,17 +319,56 @@ static af_err reduce_promote(af_array *out, const af_array in, const int dim)
         af_dtype type = in_info.getType();
         af_array res;
 
-        switch(type) {
-        case f32:  res = reduce<op, float  , float  >(in, dim); break;
-        case f64:  res = reduce<op, double , double >(in, dim); break;
-        case c32:  res = reduce<op, cfloat , cfloat >(in, dim); break;
-        case c64:  res = reduce<op, cdouble, cdouble>(in, dim); break;
-        case u32:  res = reduce<op, uint   , uint   >(in, dim); break;
-        case s32:  res = reduce<op, int    , int    >(in, dim); break;
-        case u8:   res = reduce<op, uchar  , uint   >(in, dim); break;
-            // Make sure you are adding only "1" for every non zero value, even if op == af_add_t
-        case b8:   res = reduce<af_notzero_t, char  , uint   >(in, dim); break;
-        default:   TYPE_ERROR(1, type);
+        switch (type) {
+            case f32:
+                res = reduce<op, float, float>(in, dim, change_nan, nanval);
+                break;
+            case f64:
+                res = reduce<op, double, double>(in, dim, change_nan, nanval);
+                break;
+            case c32:
+                res = reduce<op, cfloat, cfloat>(in, dim, change_nan, nanval);
+                break;
+            case c64:
+                res = reduce<op, cdouble, cdouble>(in, dim, change_nan, nanval);
+                break;
+            case u32:
+                res = reduce<op, uint, uint>(in, dim, change_nan, nanval);
+                break;
+            case s32:
+                res = reduce<op, int, int>(in, dim, change_nan, nanval);
+                break;
+            case u64:
+                res = reduce<op, uintl, uintl>(in, dim, change_nan, nanval);
+                break;
+            case s64:
+                res = reduce<op, intl, intl>(in, dim, change_nan, nanval);
+                break;
+            case u16:
+                res = reduce<op, ushort, uint>(in, dim, change_nan, nanval);
+                break;
+            case s16:
+                res = reduce<op, short, int>(in, dim, change_nan, nanval);
+                break;
+            case u8:
+                res = reduce<op, uchar, uint>(in, dim, change_nan, nanval);
+                break;
+            case s8:
+                res = reduce<op, schar, int>(in, dim, change_nan, nanval);
+                break;
+            case b8: {
+                if (op == af_mul_t) {
+                    res = reduce<af_and_t, char, char>(in, dim, change_nan,
+                                                       nanval);
+                } else {
+                    res = reduce<af_notzero_t, char, uint>(in, dim, change_nan,
+                                                           nanval);
+                }
+            } break;
+            case f16:
+                res = reduce<op, half, float>(in, dim, change_nan, nanval);
+                break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, res);
     }
@@ -139,71 +377,259 @@ static af_err reduce_promote(af_array *out, const af_array in, const int dim)
     return AF_SUCCESS;
 }
 
-af_err af_min(af_array *out, const af_array in, const int dim)
-{
+template<af_op_t op>
+static af_err reduce_promote_by_key(af_array *keys_out, af_array *vals_out,
+                                    const af_array keys, const af_array vals,
+                                    const int dim, bool change_nan = false,
+                                    double nanval = 0.0) {
+    try {
+        ARG_ASSERT(4, dim >= 0);
+        ARG_ASSERT(4, dim < 4);
+
+        const ArrayInfo &kinfo   = getInfo(keys);
+        const ArrayInfo &in_info = getInfo(vals);
+        af_dtype type            = in_info.getType();
+
+        ARG_ASSERT(2, kinfo.isVector());
+        ARG_ASSERT(2, in_info.dims()[dim] == kinfo.dims()[0]);
+
+        switch (type) {
+            case f32:
+                reduce_key<op, float, float>(keys_out, vals_out, keys, vals,
+                                             dim, change_nan, nanval);
+                break;
+            case f64:
+                reduce_key<op, double, double>(keys_out, vals_out, keys, vals,
+                                               dim, change_nan, nanval);
+                break;
+            case c32:
+                reduce_key<op, cfloat, cfloat>(keys_out, vals_out, keys, vals,
+                                               dim, change_nan, nanval);
+                break;
+            case c64:
+                reduce_key<op, cdouble, cdouble>(keys_out, vals_out, keys, vals,
+                                                 dim, change_nan, nanval);
+                break;
+            case u32:
+                reduce_key<op, uint, uint>(keys_out, vals_out, keys, vals, dim,
+                                           change_nan, nanval);
+                break;
+            case s32:
+                reduce_key<op, int, int>(keys_out, vals_out, keys, vals, dim,
+                                         change_nan, nanval);
+                break;
+            case u64:
+                reduce_key<op, uintl, uintl>(keys_out, vals_out, keys, vals,
+                                             dim, change_nan, nanval);
+                break;
+            case s64:
+                reduce_key<op, intl, intl>(keys_out, vals_out, keys, vals, dim,
+                                           change_nan, nanval);
+                break;
+            case u16:
+                reduce_key<op, ushort, uint>(keys_out, vals_out, keys, vals,
+                                             dim, change_nan, nanval);
+                break;
+            case s16:
+                reduce_key<op, short, int>(keys_out, vals_out, keys, vals, dim,
+                                           change_nan, nanval);
+                break;
+            case u8:
+                reduce_key<op, uchar, uint>(keys_out, vals_out, keys, vals, dim,
+                                            change_nan, nanval);
+                break;
+            case s8:
+                reduce_key<op, schar, int>(keys_out, vals_out, keys, vals, dim,
+                                           change_nan, nanval);
+                break;
+            case b8:
+                reduce_key<af_notzero_t, char, uint>(
+                    keys_out, vals_out, keys, vals, dim, change_nan, nanval);
+                break;
+            case f16:
+                reduce_key<op, half, float>(keys_out, vals_out, keys, vals, dim,
+                                            change_nan, nanval);
+                break;
+            default: TYPE_ERROR(3, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_min(af_array *out, const af_array in, const int dim) {
     return reduce_common<af_min_t>(out, in, dim);
 }
 
-af_err af_max(af_array *out, const af_array in, const int dim)
-{
+af_err af_max(af_array *out, const af_array in, const int dim) {
     return reduce_common<af_max_t>(out, in, dim);
 }
 
-af_err af_sum(af_array *out, const af_array in, const int dim)
-{
+af_err af_sum(af_array *out, const af_array in, const int dim) {
     return reduce_promote<af_add_t>(out, in, dim);
 }
 
-af_err af_product(af_array *out, const af_array in, const int dim)
-{
+af_err af_product(af_array *out, const af_array in, const int dim) {
     return reduce_promote<af_mul_t>(out, in, dim);
 }
 
-af_err af_count(af_array *out, const af_array in, const int dim)
-{
+af_err af_sum_nan(af_array *out, const af_array in, const int dim,
+                  const double nanval) {
+    return reduce_promote<af_add_t>(out, in, dim, true, nanval);
+}
+
+af_err af_product_nan(af_array *out, const af_array in, const int dim,
+                      const double nanval) {
+    return reduce_promote<af_mul_t>(out, in, dim, true, nanval);
+}
+
+af_err af_count(af_array *out, const af_array in, const int dim) {
     return reduce_type<af_notzero_t, uint>(out, in, dim);
 }
 
-af_err af_all_true(af_array *out, const af_array in, const int dim)
-{
+af_err af_all_true(af_array *out, const af_array in, const int dim) {
     return reduce_type<af_and_t, char>(out, in, dim);
 }
 
-af_err af_any_true(af_array *out, const af_array in, const int dim)
-{
+af_err af_any_true(af_array *out, const af_array in, const int dim) {
     return reduce_type<af_or_t, char>(out, in, dim);
 }
 
+// by key versions
+af_err af_min_by_key(af_array *keys_out, af_array *vals_out,
+                     const af_array keys, const af_array vals, const int dim) {
+    return reduce_by_key_common<af_min_t>(keys_out, vals_out, keys, vals, dim);
+}
+
+af_err af_max_by_key(af_array *keys_out, af_array *vals_out,
+                     const af_array keys, const af_array vals, const int dim) {
+    return reduce_by_key_common<af_max_t>(keys_out, vals_out, keys, vals, dim);
+}
+
+af_err af_sum_by_key(af_array *keys_out, af_array *vals_out,
+                     const af_array keys, const af_array vals, const int dim) {
+    return reduce_promote_by_key<af_add_t>(keys_out, vals_out, keys, vals, dim);
+}
+
+af_err af_product_by_key(af_array *keys_out, af_array *vals_out,
+                         const af_array keys, const af_array vals,
+                         const int dim) {
+    return reduce_promote_by_key<af_mul_t>(keys_out, vals_out, keys, vals, dim);
+}
+
+af_err af_sum_by_key_nan(af_array *keys_out, af_array *vals_out,
+                         const af_array keys, const af_array vals,
+                         const int dim, const double nanval) {
+    return reduce_promote_by_key<af_add_t>(keys_out, vals_out, keys, vals, dim,
+                                           true, nanval);
+}
+
+af_err af_product_by_key_nan(af_array *keys_out, af_array *vals_out,
+                             const af_array keys, const af_array vals,
+                             const int dim, const double nanval) {
+    return reduce_promote_by_key<af_mul_t>(keys_out, vals_out, keys, vals, dim,
+                                           true, nanval);
+}
+
+af_err af_count_by_key(af_array *keys_out, af_array *vals_out,
+                       const af_array keys, const af_array vals,
+                       const int dim) {
+    return reduce_by_key_type<af_notzero_t, uint>(keys_out, vals_out, keys,
+                                                  vals, dim);
+}
+
+af_err af_all_true_by_key(af_array *keys_out, af_array *vals_out,
+                          const af_array keys, const af_array vals,
+                          const int dim) {
+    return reduce_by_key_type<af_and_t, char>(keys_out, vals_out, keys, vals,
+                                              dim);
+}
+
+af_err af_any_true_by_key(af_array *keys_out, af_array *vals_out,
+                          const af_array keys, const af_array vals,
+                          const int dim) {
+    return reduce_by_key_type<af_or_t, char>(keys_out, vals_out, keys, vals,
+                                             dim);
+}
+
 template<af_op_t op, typename Ti, typename To>
-static inline To reduce_all(const af_array in)
-{
-    return reduce_all<op,Ti,To>(getArray<Ti>(in));
+static inline af_array reduce_all_array(const af_array in,
+                                        bool change_nan = false,
+                                        double nanval   = 0) {
+    return getHandle(
+        detail::reduce_all<op, Ti, To>(getArray<Ti>(in), change_nan, nanval));
+}
+
+template<af_op_t op, typename Ti, typename Tacc, typename Tret = double>
+static inline Tret reduce_all(const af_array in, bool change_nan = false,
+                              double nanval = 0) {
+    return static_cast<Tret>(getScalar<Tacc>(
+        reduce_all<op, Ti, Tacc>(getArray<Ti>(in), change_nan, nanval)));
 }
 
 template<af_op_t op, typename To>
-static af_err reduce_all_type(double *real, double *imag, const af_array in)
-{
+static af_err reduce_all_type(double *real, double *imag, const af_array in) {
     try {
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
 
-        const ArrayInfo in_info = getInfo(in);
-        af_dtype type = in_info.getType();
-
-        ARG_ASSERT(0, real != NULL);
+        ARG_ASSERT(0, real != nullptr);
         *real = 0;
-        if (!imag) *imag = 0;
-
-        switch(type) {
-        case f32:  *real = (double)reduce_all<op, float  , To>(in); break;
-        case f64:  *real = (double)reduce_all<op, double , To>(in); break;
-        case c32:  *real = (double)reduce_all<op, cfloat , To>(in); break;
-        case c64:  *real = (double)reduce_all<op, cdouble, To>(in); break;
-        case u32:  *real = (double)reduce_all<op, uint   , To>(in); break;
-        case s32:  *real = (double)reduce_all<op, int    , To>(in); break;
-        case b8:   *real = (double)reduce_all<op, char   , To>(in); break;
-        case u8:   *real = (double)reduce_all<op, uchar  , To>(in); break;
-        default:   TYPE_ERROR(1, type);
+        if (imag) { *imag = 0; }
+
+        switch (type) {
+            // clang-format off
+            case f32: *real = reduce_all<op, float,   To>(in); break;
+            case f64: *real = reduce_all<op, double,  To>(in); break;
+            case c32: *real = reduce_all<op, cfloat,  To>(in); break;
+            case c64: *real = reduce_all<op, cdouble, To>(in); break;
+            case u32: *real = reduce_all<op, uint,    To>(in); break;
+            case s32: *real = reduce_all<op, int,     To>(in); break;
+            case u64: *real = reduce_all<op, uintl,   To>(in); break;
+            case s64: *real = reduce_all<op, intl,    To>(in); break;
+            case u16: *real = reduce_all<op, ushort,  To>(in); break;
+            case s16: *real = reduce_all<op, short,   To>(in); break;
+            case b8:  *real = reduce_all<op, char,    To>(in); break;
+            case u8:  *real = reduce_all<op, uchar,   To>(in); break;
+            case s8:  *real = reduce_all<op, schar,   To>(in); break;
+            case f16: *real = reduce_all<op, half,    To>(in); break;
+            // clang-format on
+            default: TYPE_ERROR(1, type);
         }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
 
+template<af_op_t op, typename To>
+static af_err reduce_all_type_array(af_array *out, const af_array in) {
+    try {
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
+
+        af_array res;
+        switch (type) {
+            // clang-format off
+            case f32: res = reduce_all_array<op, float,   To>(in); break;
+            case f64: res = reduce_all_array<op, double,  To>(in); break;
+            case c32: res = reduce_all_array<op, cfloat,  To>(in); break;
+            case c64: res = reduce_all_array<op, cdouble, To>(in); break;
+            case u32: res = reduce_all_array<op, uint,    To>(in); break;
+            case s32: res = reduce_all_array<op, int,     To>(in); break;
+            case u64: res = reduce_all_array<op, uintl,   To>(in); break;
+            case s64: res = reduce_all_array<op, intl,    To>(in); break;
+            case u16: res = reduce_all_array<op, ushort,  To>(in); break;
+            case s16: res = reduce_all_array<op, short,   To>(in); break;
+            case b8:  res = reduce_all_array<op, char,    To>(in); break;
+            case u8:  res = reduce_all_array<op, uchar,   To>(in); break;
+            case s8:  res = reduce_all_array<op, schar,   To>(in); break;
+            case f16: res = reduce_all_array<op, half,    To>(in); break;
+            // clang-format on
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, res);
     }
     CATCHALL;
 
@@ -211,45 +637,86 @@ static af_err reduce_all_type(double *real, double *imag, const af_array in)
 }
 
 template<af_op_t op>
-static af_err reduce_all_common(double *real_val, double *imag_val, const af_array in)
-{
+static af_err reduce_all_common(double *real_val, double *imag_val,
+                                const af_array in) {
     try {
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
 
-        const ArrayInfo in_info = getInfo(in);
-        af_dtype type = in_info.getType();
-
-        ARG_ASSERT(0, real_val != NULL);
+        ARG_ASSERT(2, in_info.ndims() > 0);
+        ARG_ASSERT(0, real_val != nullptr);
         *real_val = 0;
-        if (!imag_val) *imag_val = 0;
+        if (imag_val != nullptr) { *imag_val = 0; }
 
-        cfloat  cfval;
+        cfloat cfval;
         cdouble cdval;
 
-        switch(type) {
-        case f32:  *real_val = (double)reduce_all<op, float  , float  >(in); break;
-        case f64:  *real_val = (double)reduce_all<op, double , double >(in); break;
-        case u32:  *real_val = (double)reduce_all<op, uint   , uint   >(in); break;
-        case s32:  *real_val = (double)reduce_all<op, int    , int    >(in); break;
-        case b8:   *real_val = (double)reduce_all<op, char   , char   >(in); break;
-        case u8:   *real_val = (double)reduce_all<op, uchar  , uchar  >(in); break;
-
-        case c32:
-            cfval = reduce_all<op, cfloat, cfloat>(in);
-            ARG_ASSERT(1, imag_val != NULL);
-            *real_val = real(cfval);
-            *imag_val = imag(cfval);
-            break;
+        switch (type) {
+            // clang-format off
+            case f32: *real_val = reduce_all<op, float,  float>(in); break;
+            case f64: *real_val = reduce_all<op, double, double>(in); break;
+            case u32: *real_val = reduce_all<op, uint,   uint>(in); break;
+            case s32: *real_val = reduce_all<op, int,    int>(in); break;
+            case u64: *real_val = reduce_all<op, uintl,  uintl>(in); break;
+            case s64: *real_val = reduce_all<op, intl,   intl>(in); break;
+            case u16: *real_val = reduce_all<op, ushort, ushort>(in); break;
+            case s16: *real_val = reduce_all<op, short,  short>(in); break;
+            case b8:  *real_val = reduce_all<op, char,   char>(in); break;
+            case u8:  *real_val = reduce_all<op, uchar,  uchar>(in); break;
+            case s8:  *real_val = reduce_all<op, schar,  schar>(in); break;
+            case f16: *real_val = reduce_all<op, half,   half>(in); break;
+            // clang-format on
+            case c32:
+                cfval = reduce_all<op, cfloat, cfloat, cfloat>(in);
+                ARG_ASSERT(1, imag_val != nullptr);
+                *real_val = real(cfval);
+                *imag_val = imag(cfval);
+                break;
+
+            case c64:
+                cdval = reduce_all<op, cdouble, cdouble, cdouble>(in);
+                ARG_ASSERT(1, imag_val != nullptr);
+                *real_val = real(cdval);
+                *imag_val = imag(cdval);
+                break;
+
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
 
-        case c64:
-            cdval = reduce_all<op, cdouble, cdouble>(in);
-            ARG_ASSERT(1, imag_val != NULL);
-            *real_val = real(cdval);
-            *imag_val = imag(cdval);
-            break;
+    return AF_SUCCESS;
+}
 
-        default:   TYPE_ERROR(1, type);
-        }
+template<af_op_t op>
+static af_err reduce_all_common_array(af_array *out, const af_array in) {
+    try {
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
+
+        ARG_ASSERT(2, in_info.ndims() > 0);
+        af_array res;
 
+        switch (type) {
+            // clang-format off
+            case f32: res = reduce_all_array<op, float,  float>(in); break;
+            case f64: res = reduce_all_array<op, double, double>(in); break;
+            case u32: res = reduce_all_array<op, uint,   uint>(in); break;
+            case s32: res = reduce_all_array<op, int,    int>(in); break;
+            case u64: res = reduce_all_array<op, uintl,  uintl>(in); break;
+            case s64: res = reduce_all_array<op, intl,   intl>(in); break;
+            case u16: res = reduce_all_array<op, ushort, ushort>(in); break;
+            case s16: res = reduce_all_array<op, short,  short>(in); break;
+            case b8:  res = reduce_all_array<op, char,   char>(in); break;
+            case u8:  res = reduce_all_array<op, uchar,  uchar>(in); break;
+            case s8:  res = reduce_all_array<op, schar,  schar>(in); break;
+            case f16: res = reduce_all_array<op, half,   half>(in); break;
+            // clang-format on
+            case c32: res = reduce_all_array<op, cfloat, cfloat>(in); break;
+            case c64: res = reduce_all_array<op, cdouble, cdouble>(in); break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, res);
     }
     CATCHALL;
 
@@ -257,95 +724,205 @@ static af_err reduce_all_common(double *real_val, double *imag_val, const af_arr
 }
 
 template<af_op_t op>
-static af_err reduce_all_promote(double *real_val, double *imag_val, const af_array in)
-{
+static af_err reduce_all_promote(double *real_val, double *imag_val,
+                                 const af_array in, bool change_nan = false,
+                                 double nanval = 0) {
     try {
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
 
-        const ArrayInfo in_info = getInfo(in);
-        af_dtype type = in_info.getType();
-
-        ARG_ASSERT(0, real_val != NULL);
+        ARG_ASSERT(0, real_val != nullptr);
         *real_val = 0;
-        if (!imag_val) *imag_val = 0;
+        if (imag_val) { *imag_val = 0; }
 
-        cfloat  cfval;
+        cfloat cfval;
         cdouble cdval;
 
-        switch(type) {
-        case f32: *real_val = (double)reduce_all<op, float  , float  >(in); break;
-        case f64: *real_val = (double)reduce_all<op, double , double >(in); break;
-        case u32: *real_val = (double)reduce_all<op, uint   , uint   >(in); break;
-        case s32: *real_val = (double)reduce_all<op, int    , int    >(in); break;
-        case u8:  *real_val = (double)reduce_all<op, uchar  , uint   >(in); break;
-            // Make sure you are adding only "1" for every non zero value, even if op == af_add_t
-        case b8:  *real_val = (double)reduce_all<af_notzero_t, char  , uint   >(in); break;
-
-        case c32:
-            cfval = reduce_all<op, cfloat, cfloat>(in);
-            ARG_ASSERT(1, imag_val != NULL);
-            *real_val = real(cfval);
-            *imag_val = imag(cfval);
-            break;
+        switch (type) {
+            // clang-format off
+            case f32: *real_val = reduce_all<op, float,   float>(in, change_nan, nanval); break;
+            case f64: *real_val = reduce_all<op, double, double>(in, change_nan, nanval); break;
+            case u32: *real_val = reduce_all<op, uint,     uint>(in, change_nan, nanval); break;
+            case s32: *real_val = reduce_all<op, int,       int>(in, change_nan, nanval); break;
+            case u64: *real_val = reduce_all<op, uintl,   uintl>(in, change_nan, nanval); break;
+            case s64: *real_val = reduce_all<op, intl,     intl>(in, change_nan, nanval); break;
+            case u16: *real_val = reduce_all<op, ushort,   uint>(in, change_nan, nanval); break;
+            case s16: *real_val = reduce_all<op, short,     int>(in, change_nan, nanval); break;
+            case u8:  *real_val = reduce_all<op, uchar,    uint>(in, change_nan, nanval); break;
+            case s8:  *real_val = reduce_all<op, schar,     int>(in, change_nan, nanval); break;
+            // clang-format on
+            case b8: {
+                if (op == af_mul_t) {
+                    *real_val = reduce_all<af_and_t, char, char>(in, change_nan,
+                                                                 nanval);
+                } else {
+                    *real_val = reduce_all<af_notzero_t, char, uint>(
+                        in, change_nan, nanval);
+                }
+            } break;
+            case c32:
+                cfval = reduce_all<op, cfloat, cfloat, cfloat>(in);
+                ARG_ASSERT(1, imag_val != nullptr);
+                *real_val = real(cfval);
+                *imag_val = imag(cfval);
+                break;
+
+            case c64:
+                cdval = reduce_all<op, cdouble, cdouble, cdouble>(in);
+                ARG_ASSERT(1, imag_val != nullptr);
+                *real_val = real(cdval);
+                *imag_val = imag(cdval);
+                break;
+            case f16:
+                *real_val = reduce_all<op, half, float>(in, change_nan, nanval);
+                break;
+
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
 
-        case c64:
-            cdval = reduce_all<op, cdouble, cdouble>(in);
-            ARG_ASSERT(1, imag_val != NULL);
-            *real_val = real(cdval);
-            *imag_val = imag(cdval);
-            break;
+    return AF_SUCCESS;
+}
+
+template<af_op_t op>
+static af_err reduce_all_promote_array(af_array *out, const af_array in,
+                                       bool change_nan = false,
+                                       double nanval   = 0.0) {
+    try {
+        const ArrayInfo &in_info = getInfo(in);
 
-        default:   TYPE_ERROR(1, type);
+        af_dtype type = in_info.getType();
+        af_array res;
+
+        switch (type) {
+            case f32:
+                res =
+                    reduce_all_array<op, float, float>(in, change_nan, nanval);
+                break;
+            case f64:
+                res = reduce_all_array<op, double, double>(in, change_nan,
+                                                           nanval);
+                break;
+            case c32:
+                res = reduce_all_array<op, cfloat, cfloat>(in, change_nan,
+                                                           nanval);
+                break;
+            case c64:
+                res = reduce_all_array<op, cdouble, cdouble>(in, change_nan,
+                                                             nanval);
+                break;
+            case u32:
+                res = reduce_all_array<op, uint, uint>(in, change_nan, nanval);
+                break;
+            case s32:
+                res = reduce_all_array<op, int, int>(in, change_nan, nanval);
+                break;
+            case u64:
+                res =
+                    reduce_all_array<op, uintl, uintl>(in, change_nan, nanval);
+                break;
+            case s64:
+                res = reduce_all_array<op, intl, intl>(in, change_nan, nanval);
+                break;
+            case u16:
+                res =
+                    reduce_all_array<op, ushort, uint>(in, change_nan, nanval);
+                break;
+            case s16:
+                res = reduce_all_array<op, short, int>(in, change_nan, nanval);
+                break;
+            case u8:
+                res = reduce_all_array<op, uchar, uint>(in, change_nan, nanval);
+                break;
+            case s8:
+                res = reduce_all_array<op, schar, int>(in, change_nan, nanval);
+                break;
+            case b8: {
+                if (op == af_mul_t) {
+                    res = reduce_all_array<af_and_t, char, char>(in, change_nan,
+                                                                 nanval);
+                } else {
+                    res = reduce_all_array<af_notzero_t, char, uint>(
+                        in, change_nan, nanval);
+                }
+            } break;
+            case f16:
+                res = reduce_all_array<op, half, float>(in, change_nan, nanval);
+                break;
+            default: TYPE_ERROR(1, type);
         }
+        std::swap(*out, res);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_min_all(double *real, double *imag, const af_array in)
-{
+af_err af_min_all(double *real, double *imag, const af_array in) {
     return reduce_all_common<af_min_t>(real, imag, in);
 }
 
-af_err af_max_all(double *real, double *imag, const af_array in)
-{
+af_err af_min_all_array(af_array *out, const af_array in) {
+    return reduce_all_common_array<af_min_t>(out, in);
+}
+
+af_err af_max_all(double *real, double *imag, const af_array in) {
     return reduce_all_common<af_max_t>(real, imag, in);
 }
 
-af_err af_sum_all(double *real, double *imag, const af_array in)
-{
+af_err af_max_all_array(af_array *out, const af_array in) {
+    return reduce_all_common_array<af_max_t>(out, in);
+}
+
+af_err af_sum_all(double *real, double *imag, const af_array in) {
     return reduce_all_promote<af_add_t>(real, imag, in);
 }
 
-af_err af_product_all(double *real, double *imag, const af_array in)
-{
+af_err af_sum_all_array(af_array *out, const af_array in) {
+    return reduce_all_promote_array<af_add_t>(out, in);
+}
+
+af_err af_product_all(double *real, double *imag, const af_array in) {
     return reduce_all_promote<af_mul_t>(real, imag, in);
 }
 
-af_err af_count_all(double *real, double *imag, const af_array in)
-{
+af_err af_product_all_array(af_array *out, const af_array in) {
+    return reduce_all_promote_array<af_mul_t>(out, in);
+}
+
+af_err af_count_all(double *real, double *imag, const af_array in) {
     return reduce_all_type<af_notzero_t, uint>(real, imag, in);
 }
 
-af_err af_all_true_all(double *real, double *imag, const af_array in)
-{
+af_err af_count_all_array(af_array *out, const af_array in) {
+    return reduce_all_type_array<af_notzero_t, uint>(out, in);
+}
+
+af_err af_all_true_all(double *real, double *imag, const af_array in) {
     return reduce_all_type<af_and_t, char>(real, imag, in);
 }
 
-af_err af_any_true_all(double *real, double *imag, const af_array in)
-{
-    return reduce_all_type<af_or_t , char>(real, imag, in);
+af_err af_all_true_all_array(af_array *out, const af_array in) {
+    return reduce_all_type_array<af_and_t, char>(out, in);
+}
+
+af_err af_any_true_all(double *real, double *imag, const af_array in) {
+    return reduce_all_type<af_or_t, char>(real, imag, in);
+}
+
+af_err af_any_true_all_array(af_array *out, const af_array in) {
+    return reduce_all_type_array<af_or_t, char>(out, in);
 }
 
 template<af_op_t op, typename T>
-static inline void ireduce(af_array *res, af_array *loc,
-                           const af_array in, const int dim)
-{
+static inline void ireduce(af_array *res, af_array *loc, const af_array in,
+                           const int dim) {
     const Array<T> In = getArray<T>(in);
-    dim4 odims = In.dims();
-    odims[dim] = 1;
+    dim4 odims        = In.dims();
+    odims[dim]        = 1;
 
-    Array<T> Res = createEmptyArray<T>(odims);
+    Array<T> Res    = createEmptyArray<T>(odims);
     Array<uint> Loc = createEmptyArray<uint>(odims);
     ireduce<op, T>(Res, Loc, In, dim);
 
@@ -353,34 +930,57 @@ static inline void ireduce(af_array *res, af_array *loc,
     *loc = getHandle(Loc);
 }
 
+template<af_op_t op, typename T>
+static inline void rreduce(af_array *res, af_array *loc, const af_array in,
+                           const int dim, const af_array ragged_len) {
+    const Array<T> In     = getArray<T>(in);
+    const Array<uint> Len = getArray<uint>(ragged_len);
+    dim4 odims            = In.dims();
+    odims[dim]            = 1;
+
+    Array<T> Res    = createEmptyArray<T>(odims);
+    Array<uint> Loc = createEmptyArray<uint>(odims);
+    rreduce<op, T>(Res, Loc, In, dim, Len);
+
+    *res = getHandle(Res);
+    *loc = getHandle(Loc);
+}
+
 template<af_op_t op>
-static af_err ireduce_common(af_array *val, af_array *idx, const af_array in, const int dim)
-{
+static af_err ireduce_common(af_array *val, af_array *idx, const af_array in,
+                             const int dim) {
     try {
+        ARG_ASSERT(3, dim >= 0);
+        ARG_ASSERT(3, dim < 4);
 
-        ARG_ASSERT(2, dim >= 0);
-        ARG_ASSERT(2, dim <  4);
-
-        const ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
+        ARG_ASSERT(2, in_info.ndims() > 0);
 
-        if (dim >= (int)in_info.ndims()) {
+        if (dim >= static_cast<int>(in_info.ndims())) {
             *val = retain(in);
+            *idx = createHandleFromValue<uint>(in_info.dims(), 0);
             return AF_SUCCESS;
         }
 
         af_dtype type = in_info.getType();
         af_array res, loc;
 
-        switch(type) {
-        case f32:  ireduce<op, float  >(&res, &loc, in, dim); break;
-        case f64:  ireduce<op, double >(&res, &loc, in, dim); break;
-        case c32:  ireduce<op, cfloat >(&res, &loc, in, dim); break;
-        case c64:  ireduce<op, cdouble>(&res, &loc, in, dim); break;
-        case u32:  ireduce<op, uint   >(&res, &loc, in, dim); break;
-        case s32:  ireduce<op, int    >(&res, &loc, in, dim); break;
-        case b8:   ireduce<op, char   >(&res, &loc, in, dim); break;
-        case u8:   ireduce<op, uchar  >(&res, &loc, in, dim); break;
-        default:   TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: ireduce<op, float>(&res, &loc, in, dim); break;
+            case f64: ireduce<op, double>(&res, &loc, in, dim); break;
+            case c32: ireduce<op, cfloat>(&res, &loc, in, dim); break;
+            case c64: ireduce<op, cdouble>(&res, &loc, in, dim); break;
+            case u32: ireduce<op, uint>(&res, &loc, in, dim); break;
+            case s32: ireduce<op, int>(&res, &loc, in, dim); break;
+            case u64: ireduce<op, uintl>(&res, &loc, in, dim); break;
+            case s64: ireduce<op, intl>(&res, &loc, in, dim); break;
+            case u16: ireduce<op, ushort>(&res, &loc, in, dim); break;
+            case s16: ireduce<op, short>(&res, &loc, in, dim); break;
+            case b8: ireduce<op, char>(&res, &loc, in, dim); break;
+            case u8: ireduce<op, uchar>(&res, &loc, in, dim); break;
+            case s8: ireduce<op, schar>(&res, &loc, in, dim); break;
+            case f16: ireduce<op, half>(&res, &loc, in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*val, res);
@@ -391,75 +991,178 @@ static af_err ireduce_common(af_array *val, af_array *idx, const af_array in, co
     return AF_SUCCESS;
 }
 
-af_err af_imin(af_array *val, af_array *idx, const af_array in, const int dim)
-{
+af_err af_imin(af_array *val, af_array *idx, const af_array in, const int dim) {
     return ireduce_common<af_min_t>(val, idx, in, dim);
 }
 
-af_err af_imax(af_array *val, af_array *idx, const af_array in, const int dim)
-{
+af_err af_imax(af_array *val, af_array *idx, const af_array in, const int dim) {
     return ireduce_common<af_max_t>(val, idx, in, dim);
 }
 
-template<af_op_t op, typename T>
-static inline T ireduce_all(unsigned *loc, const af_array in)
-{
-    return ireduce_all<op, T>(loc, getArray<T>(in));
+template<af_op_t op>
+static af_err rreduce_common(af_array *val, af_array *idx, const af_array in,
+                             const af_array ragged_len, const int dim) {
+    try {
+        ARG_ASSERT(3, dim >= 0);
+        ARG_ASSERT(3, dim < 4);
+
+        const ArrayInfo &in_info = getInfo(in);
+        ARG_ASSERT(2, in_info.ndims() > 0);
+
+        if (dim >= static_cast<int>(in_info.ndims())) {
+            *val = retain(in);
+            *idx = createHandleFromValue<uint>(in_info.dims(), 0);
+            return AF_SUCCESS;
+        }
+
+        // Make sure ragged_len.dims == in.dims(), except on reduced dim
+        const ArrayInfo &ragged_info = getInfo(ragged_len);
+        dim4 test_dim                = in_info.dims();
+        test_dim[dim]                = 1;
+        ARG_ASSERT(4, test_dim == ragged_info.dims());
+
+        af_dtype keytype = ragged_info.getType();
+        if (keytype != u32) { TYPE_ERROR(4, keytype); }
+
+        af_dtype type = in_info.getType();
+        af_array res, loc;
+
+        switch (type) {
+            case f32:
+                rreduce<op, float>(&res, &loc, in, dim, ragged_len);
+                break;
+            case f64:
+                rreduce<op, double>(&res, &loc, in, dim, ragged_len);
+                break;
+            case c32:
+                rreduce<op, cfloat>(&res, &loc, in, dim, ragged_len);
+                break;
+            case c64:
+                rreduce<op, cdouble>(&res, &loc, in, dim, ragged_len);
+                break;
+            case u32: rreduce<op, uint>(&res, &loc, in, dim, ragged_len); break;
+            case s32: rreduce<op, int>(&res, &loc, in, dim, ragged_len); break;
+            case u64:
+                rreduce<op, uintl>(&res, &loc, in, dim, ragged_len);
+                break;
+            case s64: rreduce<op, intl>(&res, &loc, in, dim, ragged_len); break;
+            case u16:
+                rreduce<op, ushort>(&res, &loc, in, dim, ragged_len);
+                break;
+            case s16:
+                rreduce<op, short>(&res, &loc, in, dim, ragged_len);
+                break;
+            case b8: rreduce<op, char>(&res, &loc, in, dim, ragged_len); break;
+            case u8: rreduce<op, uchar>(&res, &loc, in, dim, ragged_len); break;
+            case s8: rreduce<op, schar>(&res, &loc, in, dim, ragged_len); break;
+            case f16: rreduce<op, half>(&res, &loc, in, dim, ragged_len); break;
+            default: TYPE_ERROR(2, type);
+        }
+
+        std::swap(*val, res);
+        std::swap(*idx, loc);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_max_ragged(af_array *val, af_array *idx, const af_array in,
+                     const af_array ragged_len, const int dim) {
+    return rreduce_common<af_max_t>(val, idx, in, ragged_len, dim);
+}
+
+template<af_op_t op, typename T, typename Tret = T>
+static inline Tret ireduce_all(unsigned *loc, const af_array in) {
+    return static_cast<Tret>(ireduce_all<op, T>(loc, getArray<T>(in)));
 }
 
 template<af_op_t op>
 static af_err ireduce_all_common(double *real_val, double *imag_val,
-                                 unsigned *loc, const af_array in)
-{
+                                 unsigned *loc, const af_array in) {
     try {
+        const ArrayInfo &in_info = getInfo(in);
+        af_dtype type            = in_info.getType();
 
-        const ArrayInfo in_info = getInfo(in);
-        af_dtype type = in_info.getType();
-
-        ARG_ASSERT(0, real_val != NULL);
+        ARG_ASSERT(3, in_info.ndims() > 0);
+        ARG_ASSERT(0, real_val != nullptr);
         *real_val = 0;
-        if (!imag_val) *imag_val = 0;
+        if (imag_val) { *imag_val = 0; }
 
-        cfloat  cfval;
+        cfloat cfval;
         cdouble cdval;
 
-        switch(type) {
-        case f32:  *real_val = (double)ireduce_all<op, float >(loc, in); break;
-        case f64:  *real_val = (double)ireduce_all<op, double>(loc, in); break;
-        case u32:  *real_val = (double)ireduce_all<op, uint  >(loc, in); break;
-        case s32:  *real_val = (double)ireduce_all<op, int   >(loc, in); break;
-        case b8:   *real_val = (double)ireduce_all<op, char  >(loc, in); break;
-        case u8:   *real_val = (double)ireduce_all<op, uchar >(loc, in); break;
-
-        case c32:
-            cfval = ireduce_all<op, cfloat>(loc, in);
-            ARG_ASSERT(1, imag_val != NULL);
-            *real_val = real(cfval);
-            *imag_val = imag(cfval);
-            break;
-
-        case c64:
-            cdval = ireduce_all<op, cdouble>(loc, in);
-            ARG_ASSERT(1, imag_val != NULL);
-            *real_val = real(cdval);
-            *imag_val = imag(cdval);
-            break;
-
-        default:   TYPE_ERROR(1, type);
+        switch (type) {
+            case f32:
+                *real_val = ireduce_all<op, float, double>(loc, in);
+                break;
+            case f64:
+                *real_val = ireduce_all<op, double, double>(loc, in);
+                break;
+            case u32: *real_val = ireduce_all<op, uint, double>(loc, in); break;
+            case s32: *real_val = ireduce_all<op, int, double>(loc, in); break;
+            case u64:
+                *real_val = ireduce_all<op, uintl, double>(loc, in);
+                break;
+            case s64: *real_val = ireduce_all<op, intl, double>(loc, in); break;
+            case u16:
+                *real_val = ireduce_all<op, ushort, double>(loc, in);
+                break;
+            case s16:
+                *real_val = ireduce_all<op, short, double>(loc, in);
+                break;
+            case b8: *real_val = ireduce_all<op, char, double>(loc, in); break;
+            case u8: *real_val = ireduce_all<op, uchar, double>(loc, in); break;
+            case s8: *real_val = ireduce_all<op, schar, double>(loc, in); break;
+
+            case c32:
+                cfval = ireduce_all<op, cfloat>(loc, in);
+                ARG_ASSERT(1, imag_val != nullptr);
+                *real_val = real(cfval);
+                *imag_val = imag(cfval);
+                break;
+
+            case c64:
+                cdval = ireduce_all<op, cdouble>(loc, in);
+                ARG_ASSERT(1, imag_val != nullptr);
+                *real_val = real(cdval);
+                *imag_val = imag(cdval);
+                break;
+
+            default: TYPE_ERROR(1, type);
         }
-
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_imin_all(double *real, double *imag, unsigned *idx, const af_array in)
-{
+af_err af_imin_all(double *real, double *imag, unsigned *idx,
+                   const af_array in) {
     return ireduce_all_common<af_min_t>(real, imag, idx, in);
 }
 
-af_err af_imax_all(double *real, double *imag, unsigned *idx, const af_array in)
-{
+af_err af_imax_all(double *real, double *imag, unsigned *idx,
+                   const af_array in) {
     return ireduce_all_common<af_max_t>(real, imag, idx, in);
 }
+
+af_err af_sum_nan_all(double *real, double *imag, const af_array in,
+                      const double nanval) {
+    return reduce_all_promote<af_add_t>(real, imag, in, true, nanval);
+}
+
+af_err af_sum_nan_all_array(af_array *out, const af_array in,
+                            const double nanval) {
+    return reduce_all_promote_array<af_add_t>(out, in, true, nanval);
+}
+
+af_err af_product_nan_all(double *real, double *imag, const af_array in,
+                          const double nanval) {
+    return reduce_all_promote<af_mul_t>(real, imag, in, true, nanval);
+}
+
+af_err af_product_nan_all_array(af_array *out, const af_array in,
+                                const double nanval) {
+    return reduce_all_promote_array<af_mul_t>(out, in, true, nanval);
+}
diff --git a/src/api/c/regions.cpp b/src/api/c/regions.cpp
index 4245eac8d5..a76391de5a 100644
--- a/src/api/c/regions.cpp
+++ b/src/api/c/regions.cpp
@@ -7,46 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <handle.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <regions.hpp>
+#include <types.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
 
 using af::dim4;
-using namespace detail;
+using detail::uint;
+using detail::ushort;
 
 template<typename T>
-static af_array regions(af_array const &in, af_connectivity connectivity)
-{
+static af_array regions(af_array const &in, af_connectivity connectivity) {
     return getHandle<T>(regions<T>(getArray<char>(in), connectivity));
 }
 
-af_err af_regions(af_array *out, const af_array in, const af_connectivity connectivity, const af_dtype type)
-{
+af_err af_regions(af_array *out, const af_array in,
+                  const af_connectivity connectivity, const af_dtype type) {
     try {
-        ARG_ASSERT(2, (connectivity==AF_CONNECTIVITY_4 || connectivity==AF_CONNECTIVITY_8));
+        ARG_ASSERT(2, (connectivity == AF_CONNECTIVITY_4 ||
+                       connectivity == AF_CONNECTIVITY_8));
 
-        ArrayInfo info = getInfo(in);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 dims         = info.dims();
 
         dim_t in_ndims = dims.ndims();
-        DIM_ASSERT(1, (in_ndims <= 3 && in_ndims >= 2));
+        DIM_ASSERT(1, (in_ndims == 2));
 
         af_dtype in_type = info.getType();
-        if (in_type != b8) {
-            TYPE_ERROR(1, in_type);
-        }
+        if (in_type != b8) { TYPE_ERROR(1, in_type); }
 
         af_array output;
-        switch(type) {
-            case f32: output = regions<float >(in, connectivity); break;
+        switch (type) {
+            case f32: output = regions<float>(in, connectivity); break;
             case f64: output = regions<double>(in, connectivity); break;
-            case s32: output = regions<int   >(in, connectivity); break;
-            case u32: output = regions<uint  >(in, connectivity); break;
-            default : TYPE_ERROR(0, type);
+            case s32: output = regions<int>(in, connectivity); break;
+            case u32: output = regions<uint>(in, connectivity); break;
+            case s16: output = regions<short>(in, connectivity); break;
+            case u16: output = regions<ushort>(in, connectivity); break;
+            default: TYPE_ERROR(0, type);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/reorder.cpp b/src/api/c/reorder.cpp
index 733981cad8..e29fb621c0 100644
--- a/src/api/c/reorder.cpp
+++ b/src/api/c/reorder.cpp
@@ -7,28 +7,76 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/data.h>
-#include <af/blas.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
-#include <ArrayInfo.hpp>
 #include <reorder.hpp>
 
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <transpose.hpp>
+
+#include <af/blas.h>
+#include <af/data.h>
+
 using af::dim4;
-using namespace detail;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::swap;
 
 template<typename T>
-static inline af_array reorder(const af_array in, const af::dim4 &rdims)
-{
-    return getHandle(reorder<T>(getArray<T>(in), rdims));
+static inline af_array reorder(const af_array in, const af::dim4 &rdims0) {
+    Array<T> In = detail::createEmptyArray<T>(af::dim4(0));
+    dim4 rdims  = rdims0;
+
+    if (rdims[0] == 1 && rdims[1] == 0) {
+        In = transpose(getArray<T>(in), false);
+        std::swap(rdims[0], rdims[1]);
+    } else {
+        In = getArray<T>(in);
+    }
+    const dim4 idims    = In.dims();
+    const dim4 istrides = In.strides();
+
+    // Ensure all JIT nodes are evaled
+    In.eval();
+
+    af_array out;
+    if (rdims[0] == 0 && rdims[1] == 1 && rdims[2] == 2 && rdims[3] == 3) {
+        out = getHandle(In);
+    } else if (rdims[0] == 0) {
+        dim4 odims    = dim4(1, 1, 1, 1);
+        dim4 ostrides = dim4(1, 1, 1, 1);
+        for (int i = 0; i < 4; i++) {
+            odims[i]    = idims[rdims[i]];
+            ostrides[i] = istrides[rdims[i]];
+        }
+        Array<T> Out = In;
+        // Use modDims instead of setDataDims to only modify the ArrayInfo
+        Out.modDims(odims);
+        Out.modStrides(ostrides);
+        out = getHandle(Out);
+    } else {
+        Array<T> Out = reorder<T>(In, rdims);
+        out          = getHandle(Out);
+    }
+    return out;
 }
 
-af_err af_reorder(af_array *out, const af_array in, const af::dim4 &rdims)
-{
+af_err af_reorder(af_array *out, const af_array in, const af::dim4 &rdims) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        if (info.elements() == 0) { return af_retain_array(out, in); }
 
         DIM_ASSERT(1, info.elements() > 0);
 
@@ -43,47 +91,42 @@ af_err af_reorder(af_array *out, const af_array in, const af::dim4 &rdims)
         // i = 2 => 3 found and cond is true so alldims[3] = -1
         // i = 3 => 1 found and cond is true so alldims[1] = -1
         // rdims = {2, 0, 3, 2} // Failure case
-        // i = 3 => 2 found so cond is false (since alldims[2] = -1 when i = 0) so failed.
+        // i = 3 => 2 found so cond is false (since alldims[2] = -1 when i = 0)
+        // so failed.
         dim_t allDims[] = {0, 1, 2, 3};
-        for(int i = 0; i < 4; i++) {
+        for (int i = 0; i < 4; i++) {
             DIM_ASSERT(i + 2, rdims[i] == allDims[rdims[i]]);
             allDims[rdims[i]] = -1;
         }
 
-        // If reorder is a (batched) transpose, then call transpose
-        if(info.dims()[3] == 1) {
-            if(rdims[0] == 1 && rdims[1] == 0 &&
-               rdims[2] == 2 && rdims[3] == 3) {
-                return af_transpose(out, in, false);
-            }
-        }
-
         af_array output;
 
-        switch(type) {
-            case f32: output = reorder<float  >(in, rdims);  break;
-            case c32: output = reorder<cfloat >(in, rdims);  break;
-            case f64: output = reorder<double >(in, rdims);  break;
-            case c64: output = reorder<cdouble>(in, rdims);  break;
-            case b8:  output = reorder<char   >(in, rdims);  break;
-            case s32: output = reorder<int    >(in, rdims);  break;
-            case u32: output = reorder<uint   >(in, rdims);  break;
-            case u8:  output = reorder<uchar  >(in, rdims);  break;
-            case s64: output = reorder<intl   >(in, rdims);  break;
-            case u64: output = reorder<uintl  >(in, rdims);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = reorder<float>(in, rdims); break;
+            case c32: output = reorder<cfloat>(in, rdims); break;
+            case f64: output = reorder<double>(in, rdims); break;
+            case c64: output = reorder<cdouble>(in, rdims); break;
+            case b8: output = reorder<char>(in, rdims); break;
+            case s32: output = reorder<int>(in, rdims); break;
+            case u32: output = reorder<uint>(in, rdims); break;
+            case s8: output = reorder<schar>(in, rdims); break;
+            case u8: output = reorder<uchar>(in, rdims); break;
+            case s64: output = reorder<intl>(in, rdims); break;
+            case u64: output = reorder<uintl>(in, rdims); break;
+            case s16: output = reorder<short>(in, rdims); break;
+            case u16: output = reorder<ushort>(in, rdims); break;
+            case f16: output = reorder<half>(in, rdims); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_reorder(af_array *out, const af_array in,
-               const unsigned x, const unsigned y,
-               const unsigned z, const unsigned w)
-{
+af_err af_reorder(af_array *out, const af_array in, const unsigned x,
+                  const unsigned y, const unsigned z, const unsigned w) {
     af::dim4 rdims(x, y, z, w);
     return af_reorder(out, in, rdims);
 }
diff --git a/src/api/c/replace.cpp b/src/api/c/replace.cpp
new file mode 100644
index 0000000000..7bf66cc439
--- /dev/null
+++ b/src/api/c/replace.cpp
@@ -0,0 +1,143 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <common/traits.hpp>
+#include <handle.hpp>
+#include <implicit.hpp>
+#include <optypes.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/defines.h>
+
+#include <select.hpp>
+
+using af::dim4;
+using arrayfire::getCopyOnWriteArray;
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::select_scalar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename T>
+void replace(af_array a, const af_array cond, const af_array b) {
+    select(getCopyOnWriteArray<T>(a), getArray<char>(cond), getArray<T>(a),
+           getArray<T>(b));
+}
+
+af_err af_replace(af_array a, const af_array cond, const af_array b) {
+    try {
+        const ArrayInfo& ainfo = getInfo(a);
+        const ArrayInfo& binfo = getInfo(b);
+        const ArrayInfo& cinfo = getInfo(cond);
+
+        if (cinfo.ndims() == 0) { return AF_SUCCESS; }
+
+        ARG_ASSERT(2, ainfo.getType() == binfo.getType());
+        ARG_ASSERT(1, cinfo.getType() == b8);
+
+        DIM_ASSERT(1, ainfo.ndims() >= binfo.ndims());
+        DIM_ASSERT(1, cinfo.ndims() == std::min(ainfo.ndims(), binfo.ndims()));
+
+        dim4 adims = ainfo.dims();
+        dim4 bdims = binfo.dims();
+        dim4 cdims = cinfo.dims();
+
+        for (int i = 0; i < 4; i++) {
+            DIM_ASSERT(1, cdims[i] == std::min(adims[i], bdims[i]));
+            DIM_ASSERT(2, adims[i] == bdims[i] || bdims[i] == 1);
+        }
+
+        switch (ainfo.getType()) {
+            case f16: replace<half>(a, cond, b); break;
+            case f32: replace<float>(a, cond, b); break;
+            case f64: replace<double>(a, cond, b); break;
+            case c32: replace<cfloat>(a, cond, b); break;
+            case c64: replace<cdouble>(a, cond, b); break;
+            case s32: replace<int>(a, cond, b); break;
+            case u32: replace<uint>(a, cond, b); break;
+            case s64: replace<intl>(a, cond, b); break;
+            case u64: replace<uintl>(a, cond, b); break;
+            case s16: replace<short>(a, cond, b); break;
+            case u16: replace<ushort>(a, cond, b); break;
+            case s8: replace<schar>(a, cond, b); break;
+            case u8: replace<uchar>(a, cond, b); break;
+            case b8: replace<char>(a, cond, b); break;
+            default: TYPE_ERROR(2, ainfo.getType());
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+template<typename ArrayType, typename ScalarType>
+void replace_scalar(af_array a, const af_array cond, const ScalarType& b) {
+    select_scalar<ArrayType, false>(
+        getCopyOnWriteArray<ArrayType>(a), getArray<char>(cond),
+        getArray<ArrayType>(a), detail::scalar<ArrayType>(b));
+}
+
+template<typename ScalarType>
+af_err replaceScalar(af_array a, const af_array cond, const ScalarType b) {
+    try {
+        const ArrayInfo& ainfo = getInfo(a);
+        const ArrayInfo& cinfo = getInfo(cond);
+
+        ARG_ASSERT(1, cinfo.getType() == b8);
+        DIM_ASSERT(1, cinfo.ndims() == ainfo.ndims());
+
+        dim4 adims = ainfo.dims();
+        dim4 cdims = cinfo.dims();
+
+        for (int i = 0; i < 4; i++) { DIM_ASSERT(1, cdims[i] == adims[i]); }
+
+        switch (ainfo.getType()) {
+            case f16: replace_scalar<half>(a, cond, b); break;
+            case f32: replace_scalar<float>(a, cond, b); break;
+            case f64: replace_scalar<double>(a, cond, b); break;
+            case c32: replace_scalar<cfloat>(a, cond, b); break;
+            case c64: replace_scalar<cdouble>(a, cond, b); break;
+            case s32: replace_scalar<int>(a, cond, b); break;
+            case u32: replace_scalar<uint>(a, cond, b); break;
+            case s64: replace_scalar<intl>(a, cond, b); break;
+            case u64: replace_scalar<uintl>(a, cond, b); break;
+            case s16: replace_scalar<short>(a, cond, b); break;
+            case u16: replace_scalar<ushort>(a, cond, b); break;
+            case s8: replace_scalar<schar>(a, cond, b); break;
+            case u8: replace_scalar<uchar>(a, cond, b); break;
+            case b8: replace_scalar<char>(a, cond, b); break;
+            default: TYPE_ERROR(2, ainfo.getType());
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_replace_scalar(af_array a, const af_array cond, const double b) {
+    return replaceScalar(a, cond, b);
+}
+
+af_err af_replace_scalar_long(af_array a, const af_array cond,
+                              const long long b) {
+    return replaceScalar(a, cond, b);
+}
+
+af_err af_replace_scalar_ulong(af_array a, const af_array cond,
+                               const unsigned long long b) {
+    return replaceScalar(a, cond, b);
+}
diff --git a/src/api/c/resize.cpp b/src/api/c/resize.cpp
index 4036f9c224..814d4df0c8 100644
--- a/src/api/c/resize.cpp
+++ b/src/api/c/resize.cpp
@@ -7,51 +7,74 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <resize.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/image.h>
 
-using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array resize(const af_array in, const dim_t odim0, const dim_t odim1,
-                              const af_interp_type method)
-{
+static inline af_array resize(const af_array in, const dim_t odim0,
+                              const dim_t odim1, const af_interp_type method) {
     return getHandle(resize<T>(getArray<T>(in), odim0, odim1, method));
 }
 
-af_err af_resize(af_array *out, const af_array in, const dim_t odim0, const dim_t odim1,
-                 const af_interp_type method)
-{
+af_err af_resize(af_array* out, const af_array in, const dim_t odim0,
+                 const dim_t odim1, const af_interp_type method) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        ARG_ASSERT(4, method == AF_INTERP_NEAREST ||
+                          method == AF_INTERP_BILINEAR ||
+                          method == AF_INTERP_BILINEAR_COSINE ||
+                          method == AF_INTERP_BICUBIC ||
+                          method == AF_INTERP_BICUBIC_SPLINE ||
+                          method == AF_INTERP_LOWER);
 
-        ARG_ASSERT(4, (method == AF_INTERP_BILINEAR || method == AF_INTERP_NEAREST));
         DIM_ASSERT(2, odim0 > 0);
         DIM_ASSERT(3, odim1 > 0);
 
+        bool is_resize_supported =
+            (method == AF_INTERP_LOWER || method == AF_INTERP_NEAREST ||
+             method == AF_INTERP_BILINEAR);
+
+        if (!is_resize_supported) {
+            // Fall back to scale for additional methods
+            return af_scale(out, in, 0, 0, odim0, odim1, method);
+        }
+
         af_array output;
 
-        switch(type) {
-            case f32: output = resize<float  >(in, odim0, odim1, method);  break;
-            case f64: output = resize<double >(in, odim0, odim1, method);  break;
-            case c32: output = resize<cfloat >(in, odim0, odim1, method);  break;
-            case c64: output = resize<cdouble>(in, odim0, odim1, method);  break;
-            case s32: output = resize<int    >(in, odim0, odim1, method);  break;
-            case u32: output = resize<uint   >(in, odim0, odim1, method);  break;
-            case s64: output = resize<intl   >(in, odim0, odim1, method);  break;
-            case u64: output = resize<uintl  >(in, odim0, odim1, method);  break;
-            case u8:  output = resize<uchar  >(in, odim0, odim1, method);  break;
-            case b8:  output = resize<char   >(in, odim0, odim1, method);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = resize<float>(in, odim0, odim1, method); break;
+            case f64: output = resize<double>(in, odim0, odim1, method); break;
+            case c32: output = resize<cfloat>(in, odim0, odim1, method); break;
+            case c64: output = resize<cdouble>(in, odim0, odim1, method); break;
+            case s32: output = resize<int>(in, odim0, odim1, method); break;
+            case u32: output = resize<uint>(in, odim0, odim1, method); break;
+            case s64: output = resize<intl>(in, odim0, odim1, method); break;
+            case u64: output = resize<uintl>(in, odim0, odim1, method); break;
+            case s16: output = resize<short>(in, odim0, odim1, method); break;
+            case u16: output = resize<ushort>(in, odim0, odim1, method); break;
+            case s8: output = resize<schar>(in, odim0, odim1, method); break;
+            case u8: output = resize<uchar>(in, odim0, odim1, method); break;
+            case b8: output = resize<char>(in, odim0, odim1, method); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
diff --git a/src/api/c/rgb_gray.cpp b/src/api/c/rgb_gray.cpp
index 0ed5eb9583..c7abe042bc 100644
--- a/src/api/c/rgb_gray.cpp
+++ b/src/api/c/rgb_gray.cpp
@@ -7,75 +7,82 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <af/data.h>
 #include <af/defines.h>
 #include <af/dim4.hpp>
 #include <af/image.h>
 #include <af/index.h>
-#include <af/data.h>
 
-#include <ArrayInfo.hpp>
-#include <handle.hpp>
-#include <backend.hpp>
 #include <arith.hpp>
-#include <math.hpp>
-#include <cast.hpp>
-#include <tile.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/tile.hpp>
+#include <handle.hpp>
 #include <join.hpp>
+#include <math.hpp>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::cast;
+using detail::arithOp;
+using detail::Array;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::join;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
 
 template<typename T, typename cType>
-static af_array rgb2gray(const af_array& in, const float r, const float g, const float b)
-{
+static af_array rgb2gray(const af_array& in, const float r, const float g,
+                         const float b) {
     Array<cType> input = cast<cType>(getArray<T>(in));
-    dim4 inputDims = input.dims();
+    dim4 inputDims     = input.dims();
     dim4 matDims(inputDims[0], inputDims[1], 1, inputDims[3]);
 
     Array<cType> rCnst = createValueArray<cType>(matDims, scalar<cType>(r));
     Array<cType> gCnst = createValueArray<cType>(matDims, scalar<cType>(g));
     Array<cType> bCnst = createValueArray<cType>(matDims, scalar<cType>(b));
 
+    std::vector<af_seq> slice1(4, af_span), slice2(4, af_span),
+        slice3(4, af_span);
     // extract three channels as three slices
-    af_seq slice1[4] = { af_span, af_span, {0, 0, 1}, af_span };
-    af_seq slice2[4] = { af_span, af_span, {1, 1, 1}, af_span };
-    af_seq slice3[4] = { af_span, af_span, {2, 2, 1}, af_span };
+    slice1[2] = {0, 0, 1};
+    slice2[2] = {1, 1, 1};
+    slice3[2] = {2, 2, 1};
 
-    af_array ch1Temp=0, ch2Temp=0, ch3Temp=0;
-    AF_CHECK(af_index(&ch1Temp, in, 4, slice1));
-    AF_CHECK(af_index(&ch2Temp, in, 4, slice2));
-    AF_CHECK(af_index(&ch3Temp, in, 4, slice3));
+    Array<cType> ch1Temp = createSubArray(input, slice1);
+    Array<cType> ch2Temp = createSubArray(input, slice2);
+    Array<cType> ch3Temp = createSubArray(input, slice3);
 
     // r*Slice0
-    Array<cType> expr1 = arithOp<cType, af_mul_t>(getArray<cType>(ch1Temp), rCnst, matDims);
-    //g*Slice1
-    Array<cType> expr2 = arithOp<cType, af_mul_t>(getArray<cType>(ch2Temp), gCnst, matDims);
-    //b*Slice2
-    Array<cType> expr3 = arithOp<cType, af_mul_t>(getArray<cType>(ch3Temp), bCnst, matDims);
-    //r*Slice0 + g*Slice1
+    Array<cType> expr1 = arithOp<cType, af_mul_t>(ch1Temp, rCnst, matDims);
+    // g*Slice1
+    Array<cType> expr2 = arithOp<cType, af_mul_t>(ch2Temp, gCnst, matDims);
+    // b*Slice2
+    Array<cType> expr3 = arithOp<cType, af_mul_t>(ch3Temp, bCnst, matDims);
+    // r*Slice0 + g*Slice1
     Array<cType> expr4 = arithOp<cType, af_add_t>(expr1, expr2, matDims);
-    //r*Slice0 + g*Slice1 + b*Slice2
-    Array<cType> result= arithOp<cType, af_add_t>(expr3, expr4, matDims);
-
-    AF_CHECK(af_release_array(ch1Temp));
-    AF_CHECK(af_release_array(ch2Temp));
-    AF_CHECK(af_release_array(ch3Temp));
+    // r*Slice0 + g*Slice1 + b*Slice2
+    Array<cType> result = arithOp<cType, af_add_t>(expr3, expr4, matDims);
 
     return getHandle<cType>(result);
 }
 
 template<typename T, typename cType>
-static af_array gray2rgb(const af_array& in, const float r, const float g, const float b)
-{
-    if (r==1.0 && g==1.0 && b==1.0) {
+static af_array gray2rgb(const af_array& in, const float r, const float g,
+                         const float b) {
+    if (r == 1.0 && g == 1.0 && b == 1.0) {
         dim4 tileDims(1, 1, 3, 1);
-        return getHandle(tile(getArray<T>(in), tileDims));
+        return getHandle(arrayfire::common::tile(getArray<T>(in), tileDims));
     }
 
     af_array mod_input = 0;
-    dim4 inputDims = getInfo(in).dims();
+    dim4 inputDims     = getInfo(in).dims();
 
-    dim4 matDims(inputDims[0], inputDims[1], 1, inputDims[2]*inputDims[3]);
+    dim4 matDims(inputDims[0], inputDims[1], 1, inputDims[2] * inputDims[3]);
 
     AF_CHECK(af_moddims(&mod_input, in, matDims.ndims(), matDims.get()));
     Array<cType> mod_in = cast<cType>(getArray<cType>(mod_input));
@@ -91,13 +98,15 @@ static af_array gray2rgb(const af_array& in, const float r, const float g, const
     AF_CHECK(af_release_array(mod_input));
 
     // join channels
-    Array<cType> expr4 = join<cType, cType>(2, expr1, expr2);
-    return getHandle(join<cType, cType>(2, expr3, expr4));
+    dim4 odims(expr1.dims()[0], expr1.dims()[1], 3);
+    Array<cType> out = createEmptyArray<cType>(odims);
+    join<cType>(out, 2, {expr3, expr1, expr2});
+    return getHandle(out);
 }
 
 template<typename T, typename cType, bool isRGB2GRAY>
-static af_array convert(const af_array& in, const float r, const float g, const float b)
-{
+static af_array convert(const af_array& in, const float r, const float g,
+                        const float b) {
     if (isRGB2GRAY) {
         return rgb2gray<T, cType>(in, r, g, b);
     } else {
@@ -106,23 +115,52 @@ static af_array convert(const af_array& in, const float r, const float g, const
 }
 
 template<bool isRGB2GRAY>
-af_err convert(af_array* out, const af_array in, const float r, const float g, const float b)
-{
+af_err convert(af_array* out, const af_array in, const float r, const float g,
+               const float b) {
     try {
-        ArrayInfo info     = getInfo(in);
-        af_dtype iType     = info.getType();
-        af::dim4 inputDims = info.dims();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype iType        = info.getType();
+        af::dim4 inputDims    = info.dims();
 
-        ARG_ASSERT(1, (inputDims.ndims()>=2));
-        if (isRGB2GRAY) ARG_ASSERT(1, (inputDims[2]==3));
+        // 2D is not required.
+        if (info.elements() == 0) {
+            return af_create_handle(out, 0, nullptr, iType);
+        }
+
+        // If RGB is input, then assert 3 channels
+        // else 1 channel
+        if (isRGB2GRAY) {
+            ARG_ASSERT(1, (inputDims[2] == 3));
+        } else {
+            ARG_ASSERT(1, (inputDims[2] == 1));
+        }
 
         af_array output = 0;
-        switch(iType) {
-            case f64: output = convert<double, double, isRGB2GRAY>(in, r, g, b); break;
-            case f32: output = convert<float , float , isRGB2GRAY>(in, r, g, b); break;
-            case u32: output = convert<uint  , float , isRGB2GRAY>(in, r, g, b); break;
-            case s32: output = convert<int   , float , isRGB2GRAY>(in, r, g, b); break;
-            case u8:  output = convert<uchar , float , isRGB2GRAY>(in, r, g, b); break;
+        switch (iType) {
+            case f64:
+                output = convert<double, double, isRGB2GRAY>(in, r, g, b);
+                break;
+            case f32:
+                output = convert<float, float, isRGB2GRAY>(in, r, g, b);
+                break;
+            case u32:
+                output = convert<uint, float, isRGB2GRAY>(in, r, g, b);
+                break;
+            case s32:
+                output = convert<int, float, isRGB2GRAY>(in, r, g, b);
+                break;
+            case u16:
+                output = convert<ushort, float, isRGB2GRAY>(in, r, g, b);
+                break;
+            case s16:
+                output = convert<short, float, isRGB2GRAY>(in, r, g, b);
+                break;
+            case u8:
+                output = convert<uchar, float, isRGB2GRAY>(in, r, g, b);
+                break;
+            case s8:
+                output = convert<schar, float, isRGB2GRAY>(in, r, g, b);
+                break;
             default: TYPE_ERROR(1, iType); break;
         }
         std::swap(*out, output);
@@ -131,12 +169,12 @@ af_err convert(af_array* out, const af_array in, const float r, const float g, c
     return AF_SUCCESS;
 }
 
-af_err af_rgb2gray(af_array* out, const af_array in, const float rPercent, const float gPercent, const float bPercent)
-{
+af_err af_rgb2gray(af_array* out, const af_array in, const float rPercent,
+                   const float gPercent, const float bPercent) {
     return convert<true>(out, in, rPercent, gPercent, bPercent);
 }
 
-af_err af_gray2rgb(af_array* out, const af_array in, const float rFactor, const float gFactor, const float bFactor)
-{
+af_err af_gray2rgb(af_array* out, const af_array in, const float rFactor,
+                   const float gFactor, const float bFactor) {
     return convert<false>(out, in, rFactor, gFactor, bFactor);
 }
diff --git a/src/api/c/rotate.cpp b/src/api/c/rotate.cpp
index 13db85306a..50397a310a 100644
--- a/src/api/c/rotate.cpp
+++ b/src/api/c/rotate.cpp
@@ -7,37 +7,45 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <ArrayInfo.hpp>
 #include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <rotate.hpp>
+#include <af/image.h>
+#include <cmath>
 
 using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::cos;
+using std::fabs;
+using std::sin;
 
 template<typename T>
-static inline af_array rotate(const af_array in, const float theta, const af::dim4 &odims,
-                              const af_interp_type method)
-{
-    return getHandle(rotate<T>(getArray<T>(in), theta, odims, method));
+static inline af_array rotate(const af_array in, const float theta,
+                              const af::dim4 &odims,
+                              const af_interp_type method) {
+    return getHandle(rotate<T>(castArray<T>(in), theta, odims, method));
 }
 
-
 af_err af_rotate(af_array *out, const af_array in, const float theta,
-                 const bool crop,
-                 const af_interp_type method)
-{
+                 const bool crop, const af_interp_type method) {
     try {
-        unsigned odims0 = 0, odims1 = 0;
+        dim_t odims0 = 0, odims1 = 0;
 
-        ArrayInfo info = getInfo(in);
-        af::dim4 idims = info.dims();
+        const ArrayInfo &info = getInfo(in);
+        af::dim4 idims        = info.dims();
 
-        if(!crop) {
-            odims0 = idims[0] * fabs(std::cos(theta)) + idims[1] * fabs(std::sin(theta));
-            odims1 = idims[1] * fabs(std::cos(theta)) + idims[0] * fabs(std::sin(theta));
+        if (!crop) {
+            odims0 = idims[0] * fabs(cos(theta)) + idims[1] * fabs(sin(theta));
+            odims1 = idims[1] * fabs(cos(theta)) + idims[0] * fabs(sin(theta));
         } else {
             odims0 = idims[0];
             odims1 = idims[1];
@@ -45,27 +53,38 @@ af_err af_rotate(af_array *out, const af_array in, const float theta,
 
         af_dtype itype = info.getType();
 
-        ARG_ASSERT(3, method == AF_INTERP_NEAREST || method == AF_INTERP_BILINEAR);
+        ARG_ASSERT(4, method == AF_INTERP_NEAREST ||
+                          method == AF_INTERP_BILINEAR ||
+                          method == AF_INTERP_BILINEAR_COSINE ||
+                          method == AF_INTERP_BICUBIC ||
+                          method == AF_INTERP_BICUBIC_SPLINE ||
+                          method == AF_INTERP_LOWER);
+
+        if (idims.elements() == 0) { return af_retain_array(out, in); }
         DIM_ASSERT(1, idims.elements() > 0);
 
         af::dim4 odims(odims0, odims1, idims[2], idims[3]);
 
         af_array output = 0;
-        switch(itype) {
-            case f32: output = rotate<float  >(in, theta, odims, method);  break;
-            case f64: output = rotate<double >(in, theta, odims, method);  break;
-            case c32: output = rotate<cfloat >(in, theta, odims, method);  break;
-            case c64: output = rotate<cdouble>(in, theta, odims, method);  break;
-            case s32: output = rotate<int    >(in, theta, odims, method);  break;
-            case u32: output = rotate<uint   >(in, theta, odims, method);  break;
-            case s64: output = rotate<intl   >(in, theta, odims, method);  break;
-            case u64: output = rotate<uintl  >(in, theta, odims, method);  break;
-            case u8:  output = rotate<uchar  >(in, theta, odims, method);  break;
-            case b8:  output = rotate<uchar  >(in, theta, odims, method);  break;
-            default:  TYPE_ERROR(1, itype);
+        switch (itype) {
+            case f32: output = rotate<float>(in, theta, odims, method); break;
+            case f64: output = rotate<double>(in, theta, odims, method); break;
+            case c32: output = rotate<cfloat>(in, theta, odims, method); break;
+            case c64: output = rotate<cdouble>(in, theta, odims, method); break;
+            case s32: output = rotate<int>(in, theta, odims, method); break;
+            case u32: output = rotate<uint>(in, theta, odims, method); break;
+            case s64: output = rotate<intl>(in, theta, odims, method); break;
+            case u64: output = rotate<uintl>(in, theta, odims, method); break;
+            case s16: output = rotate<short>(in, theta, odims, method); break;
+            case u16: output = rotate<ushort>(in, theta, odims, method); break;
+            case s8: output = rotate<schar>(in, theta, odims, method); break;
+            case u8:
+            case b8: output = rotate<uchar>(in, theta, odims, method); break;
+            default: TYPE_ERROR(1, itype);
         }
-        std::swap(*out,output);
-    } CATCHALL
+        std::swap(*out, output);
+    }
+    CATCHALL
 
     return AF_SUCCESS;
 }
diff --git a/src/api/c/sat.cpp b/src/api/c/sat.cpp
new file mode 100644
index 0000000000..8715f4865c
--- /dev/null
+++ b/src/api/c/sat.cpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <imgproc_common.hpp>
+#include <af/defines.h>
+#include <af/image.h>
+
+using af::dim4;
+using arrayfire::common::integralImage;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename To, typename Ti>
+inline af_array sat(const af_array& in) {
+    return getHandle<To>(integralImage<To, Ti>(getArray<Ti>(in)));
+}
+
+af_err af_sat(af_array* out, const af_array in) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        const dim4& dims      = info.dims();
+
+        ARG_ASSERT(1, (dims.ndims() >= 2));
+
+        af_dtype inputType = info.getType();
+
+        af_array output = 0;
+        switch (inputType) {
+            case f64: output = sat<double, double>(in); break;
+            case f32: output = sat<float, float>(in); break;
+            case s32: output = sat<int, int>(in); break;
+            case u32: output = sat<uint, uint>(in); break;
+            case b8: output = sat<int, char>(in); break;
+            case s8: output = sat<int, schar>(in); break;
+            case u8: output = sat<uint, uchar>(in); break;
+            case s64: output = sat<intl, intl>(in); break;
+            case u64: output = sat<uintl, uintl>(in); break;
+            case s16: output = sat<int, short>(in); break;
+            case u16: output = sat<uint, ushort>(in); break;
+            default: TYPE_ERROR(1, inputType);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/scan.cpp b/src/api/c/scan.cpp
index 34f4c897b9..cac89d6c01 100644
--- a/src/api/c/scan.cpp
+++ b/src/api/c/scan.cpp
@@ -7,36 +7,122 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/defines.h>
-#include <af/dim4.hpp>
-#include <af/algorithm.h>
-#include <err_common.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
 #include <handle.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 #include <scan.hpp>
-#include <backend.hpp>
+#include <scan_by_key.hpp>
+#include <af/algorithm.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <complex>
 
-using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<af_op_t op, typename Ti, typename To>
+static inline af_array scan(const af_array in, const int dim,
+                            bool inclusive_scan = true) {
+    return getHandle(scan<op, Ti, To>(getArray<Ti>(in), dim, inclusive_scan));
+}
 
 template<af_op_t op, typename Ti, typename To>
-static inline af_array scan(const af_array in, const int dim)
-{
-    return getHandle(scan<op,Ti,To>(getArray<Ti>(in), dim));
+static inline af_array scan_key(const af_array key, const af_array in,
+                                const int dim, bool inclusive_scan = true) {
+    const ArrayInfo& key_info = getInfo(key);
+    af_dtype type             = key_info.getType();
+    af_array out;
+
+    switch (type) {
+        case s32:
+            out = getHandle(scan<op, Ti, int, To>(
+                getArray<int>(key), castArray<Ti>(in), dim, inclusive_scan));
+            break;
+        case u32:
+            out = getHandle(scan<op, Ti, uint, To>(
+                getArray<uint>(key), castArray<Ti>(in), dim, inclusive_scan));
+            break;
+        case s64:
+            out = getHandle(scan<op, Ti, intl, To>(
+                getArray<intl>(key), castArray<Ti>(in), dim, inclusive_scan));
+            break;
+        case u64:
+            out = getHandle(scan<op, Ti, uintl, To>(
+                getArray<uintl>(key), castArray<Ti>(in), dim, inclusive_scan));
+            break;
+        default: TYPE_ERROR(1, type);
+    }
+    return out;
+}
+
+template<typename Ti, typename To>
+static inline af_array scan_op(const af_array key, const af_array in,
+                               const int dim, af_binary_op op,
+                               bool inclusive_scan = true) {
+    af_array out;
+
+    switch (op) {
+        case AF_BINARY_ADD:
+            out = scan_key<af_add_t, Ti, To>(key, in, dim, inclusive_scan);
+            break;
+        case AF_BINARY_MUL:
+            out = scan_key<af_mul_t, Ti, To>(key, in, dim, inclusive_scan);
+            break;
+        case AF_BINARY_MIN:
+            out = scan_key<af_min_t, Ti, To>(key, in, dim, inclusive_scan);
+            break;
+        case AF_BINARY_MAX:
+            out = scan_key<af_max_t, Ti, To>(key, in, dim, inclusive_scan);
+            break;
+        default:
+            AF_ERROR("Incorrect binary operation enum for argument number 3",
+                     AF_ERR_ARG);
+            break;
+    }
+    return out;
 }
 
+template<typename Ti, typename To>
+static inline af_array scan_op(const af_array in, const int dim,
+                               af_binary_op op, bool inclusive_scan) {
+    af_array out;
 
-af_err af_accum(af_array *out, const af_array in, const int dim)
-{
-    ARG_ASSERT(2, dim >= 0);
-    ARG_ASSERT(2, dim <  4);
+    switch (op) {
+        case AF_BINARY_ADD:
+            out = scan<af_add_t, Ti, To>(in, dim, inclusive_scan);
+            break;
+        case AF_BINARY_MUL:
+            out = scan<af_mul_t, Ti, To>(in, dim, inclusive_scan);
+            break;
+        case AF_BINARY_MIN:
+            out = scan<af_min_t, Ti, To>(in, dim, inclusive_scan);
+            break;
+        case AF_BINARY_MAX:
+            out = scan<af_max_t, Ti, To>(in, dim, inclusive_scan);
+            break;
+        default:
+            AF_ERROR("Incorrect binary operation enum for argument number 2",
+                     AF_ERR_ARG);
+            break;
+    }
+    return out;
+}
 
+af_err af_accum(af_array* out, const af_array in, const int dim) {
     try {
+        ARG_ASSERT(2, dim >= 0);
+        ARG_ASSERT(2, dim < 4);
 
         const ArrayInfo& in_info = getInfo(in);
 
-        if (dim >= (int)in_info.ndims()) {
+        if (dim >= static_cast<int>(in_info.ndims())) {
             *out = retain(in);
             return AF_SUCCESS;
         }
@@ -44,18 +130,149 @@ af_err af_accum(af_array *out, const af_array in, const int dim)
         af_dtype type = in_info.getType();
         af_array res;
 
-        switch(type) {
-        case f32:  res = scan<af_add_t, float  , float  >(in, dim); break;
-        case f64:  res = scan<af_add_t, double , double >(in, dim); break;
-        case c32:  res = scan<af_add_t, cfloat , cfloat >(in, dim); break;
-        case c64:  res = scan<af_add_t, cdouble, cdouble>(in, dim); break;
-        case u32:  res = scan<af_add_t, uint   , uint   >(in, dim); break;
-        case s32:  res = scan<af_add_t, int    , int    >(in, dim); break;
-        case u8:   res = scan<af_add_t, uchar  , uint   >(in, dim); break;
-        // Make sure you are adding only "1" for every non zero value, even if op == af_add_t
-        case b8:   res = scan<af_notzero_t, char  , uint   >(in, dim); break;
-        default:
-            TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: res = scan<af_add_t, float, float>(in, dim); break;
+            case f64: res = scan<af_add_t, double, double>(in, dim); break;
+            case c32: res = scan<af_add_t, cfloat, cfloat>(in, dim); break;
+            case c64: res = scan<af_add_t, cdouble, cdouble>(in, dim); break;
+            case u32: res = scan<af_add_t, uint, uint>(in, dim); break;
+            case s32: res = scan<af_add_t, int, int>(in, dim); break;
+            case u64: res = scan<af_add_t, uintl, uintl>(in, dim); break;
+            case s64: res = scan<af_add_t, intl, intl>(in, dim); break;
+            case u16: res = scan<af_add_t, ushort, uint>(in, dim); break;
+            case s16: res = scan<af_add_t, short, int>(in, dim); break;
+            case u8: res = scan<af_add_t, uchar, uint>(in, dim); break;
+            case s8: res = scan<af_add_t, schar, int>(in, dim); break;
+            // Make sure you are adding only "1" for every non zero value, even
+            // if op == af_add_t
+            case b8: res = scan<af_notzero_t, char, uint>(in, dim); break;
+            default: TYPE_ERROR(1, type);
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_scan(af_array* out, const af_array in, const int dim, af_binary_op op,
+               bool inclusive_scan) {
+    try {
+        ARG_ASSERT(2, dim >= 0);
+        ARG_ASSERT(2, dim < 4);
+
+        const ArrayInfo& in_info = getInfo(in);
+
+        if (dim >= static_cast<int>(in_info.ndims())) {
+            *out = retain(in);
+            return AF_SUCCESS;
+        }
+
+        af_dtype type = in_info.getType();
+        af_array res;
+
+        switch (type) {
+            case f32:
+                res = scan_op<float, float>(in, dim, op, inclusive_scan);
+                break;
+            case f64:
+                res = scan_op<double, double>(in, dim, op, inclusive_scan);
+                break;
+            case c32:
+                res = scan_op<cfloat, cfloat>(in, dim, op, inclusive_scan);
+                break;
+            case c64:
+                res = scan_op<cdouble, cdouble>(in, dim, op, inclusive_scan);
+                break;
+            case u32:
+                res = scan_op<uint, uint>(in, dim, op, inclusive_scan);
+                break;
+            case s32:
+                res = scan_op<int, int>(in, dim, op, inclusive_scan);
+                break;
+            case u64:
+                res = scan_op<uintl, uintl>(in, dim, op, inclusive_scan);
+                break;
+            case s64:
+                res = scan_op<intl, intl>(in, dim, op, inclusive_scan);
+                break;
+            case u16:
+                res = scan_op<ushort, uint>(in, dim, op, inclusive_scan);
+                break;
+            case s16:
+                res = scan_op<short, int>(in, dim, op, inclusive_scan);
+                break;
+            case u8:
+                res = scan_op<uchar, uint>(in, dim, op, inclusive_scan);
+                break;
+            case s8:
+                res = scan_op<schar, int>(in, dim, op, inclusive_scan);
+                break;
+            case b8:
+                res = scan_op<char, uint>(in, dim, op, inclusive_scan);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_scan_by_key(af_array* out, const af_array key, const af_array in,
+                      const int dim, af_binary_op op, bool inclusive_scan) {
+    try {
+        ARG_ASSERT(2, dim >= 0);
+        ARG_ASSERT(2, dim < 4);
+
+        const ArrayInfo& in_info  = getInfo(in);
+        const ArrayInfo& key_info = getInfo(key);
+
+        if (dim >= static_cast<int>(in_info.ndims())) {
+            *out = retain(in);
+            return AF_SUCCESS;
+        }
+
+        ARG_ASSERT(2, in_info.dims() == key_info.dims());
+
+        af_dtype type = in_info.getType();
+        af_array res;
+
+        switch (type) {
+            case f32:
+                res = scan_op<float, float>(key, in, dim, op, inclusive_scan);
+                break;
+            case f64:
+                res = scan_op<double, double>(key, in, dim, op, inclusive_scan);
+                break;
+            case c32:
+                res = scan_op<cfloat, cfloat>(key, in, dim, op, inclusive_scan);
+                break;
+            case c64:
+                res =
+                    scan_op<cdouble, cdouble>(key, in, dim, op, inclusive_scan);
+                break;
+            case s16:
+            case s32:
+            case s8:
+                res = scan_op<int, int>(key, in, dim, op, inclusive_scan);
+                break;
+            case u64:
+                res = scan_op<uintl, uintl>(key, in, dim, op, inclusive_scan);
+                break;
+            case s64:
+                res = scan_op<intl, intl>(key, in, dim, op, inclusive_scan);
+                break;
+            case u16:
+            case u32:
+            case u8:
+            case b8:
+                res = scan_op<uint, uint>(key, in, dim, op, inclusive_scan);
+                break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*out, res);
diff --git a/src/api/c/select.cpp b/src/api/c/select.cpp
new file mode 100644
index 0000000000..c161aa5e9b
--- /dev/null
+++ b/src/api/c/select.cpp
@@ -0,0 +1,217 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <implicit.hpp>
+#include <optypes.hpp>
+#include <select.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/defines.h>
+
+using af::dim4;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createSelectNode;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename T>
+af_array select(const af_array cond, const af_array a, const af_array b,
+                const dim4& odims) {
+    Array<T> out = createSelectNode(getArray<char>(cond), getArray<T>(a),
+                                    getArray<T>(b), odims);
+    return getHandle<T>(out);
+}
+
+af_err af_select(af_array* out, const af_array cond, const af_array a,
+                 const af_array b) {
+    try {
+        const ArrayInfo& ainfo     = getInfo(a);
+        const ArrayInfo& binfo     = getInfo(b);
+        const ArrayInfo& cond_info = getInfo(cond);
+
+        if (cond_info.ndims() == 0) { return af_retain_array(out, cond); }
+
+        ARG_ASSERT(2, ainfo.getType() == binfo.getType());
+        ARG_ASSERT(1, cond_info.getType() == b8);
+
+        dim4 adims     = ainfo.dims();
+        dim4 bdims     = binfo.dims();
+        dim4 cond_dims = cond_info.dims();
+        dim4 odims(1, 1, 1, 1);
+
+        for (int i = 0; i < 4; i++) {
+            DIM_ASSERT(2, (adims[i] == bdims[i] && adims[i] == cond_dims[i]) ||
+                              adims[i] == 1 || bdims[i] == 1 ||
+                              cond_dims[i] == 1);
+            odims[i] = std::max(std::max(adims[i], bdims[i]), cond_dims[i]);
+        }
+
+        af_array res;
+
+        switch (ainfo.getType()) {
+            case f32: res = select<float>(cond, a, b, odims); break;
+            case f64: res = select<double>(cond, a, b, odims); break;
+            case c32: res = select<cfloat>(cond, a, b, odims); break;
+            case c64: res = select<cdouble>(cond, a, b, odims); break;
+            case s32: res = select<int>(cond, a, b, odims); break;
+            case u32: res = select<uint>(cond, a, b, odims); break;
+            case s64: res = select<intl>(cond, a, b, odims); break;
+            case u64: res = select<uintl>(cond, a, b, odims); break;
+            case s16: res = select<short>(cond, a, b, odims); break;
+            case u16: res = select<ushort>(cond, a, b, odims); break;
+            case s8: res = select<schar>(cond, a, b, odims); break;
+            case u8: res = select<uchar>(cond, a, b, odims); break;
+            case b8: res = select<char>(cond, a, b, odims); break;
+            case f16: res = select<half>(cond, a, b, odims); break;
+            default: TYPE_ERROR(2, ainfo.getType());
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+template<typename ArrayType, typename ScalarType, bool flip>
+af_array select_scalar(const af_array cond, const af_array a,
+                       const ScalarType b, const dim4& odims) {
+    auto scalar = detail::scalar<ArrayType>(b);
+    auto out    = createSelectNode<ArrayType, flip>(
+        getArray<char>(cond), getArray<ArrayType>(a), scalar, odims);
+    return getHandle(out);
+}
+
+template<typename ScalarType, bool IsScalarTrueOutput>
+af_err selectScalar(af_array* out, const af_array cond, const af_array e,
+                    const ScalarType c) {
+    try {
+        const ArrayInfo& einfo = getInfo(e);
+        const ArrayInfo& cinfo = getInfo(cond);
+
+        ARG_ASSERT(1, cinfo.getType() == b8);
+
+        dim4 edims     = einfo.dims();
+        dim4 cond_dims = cinfo.dims();
+        dim4 odims(1);
+
+        for (int i = 0; i < 4; i++) {
+            DIM_ASSERT(1, cond_dims[i] == edims[i] || cond_dims[i] == 1 ||
+                              edims[i] == 1);
+            odims[i] = std::max(cond_dims[i], edims[i]);
+        }
+
+        af_array res;
+
+        switch (einfo.getType()) {
+            case f16:
+                res = select_scalar<half, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case f32:
+                res = select_scalar<float, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case f64:
+                res = select_scalar<double, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case c32:
+                res = select_scalar<cfloat, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case c64:
+                res = select_scalar<cdouble, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case s32:
+                res = select_scalar<int, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case u32:
+                res = select_scalar<uint, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case s16:
+                res = select_scalar<short, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case u16:
+                res = select_scalar<ushort, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case s64:
+                res = select_scalar<intl, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case u64:
+                res = select_scalar<uintl, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case s8:
+                res = select_scalar<schar, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case u8:
+                res = select_scalar<uchar, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            case b8:
+                res = select_scalar<char, ScalarType, IsScalarTrueOutput>(
+                    cond, e, c, odims);
+                break;
+            default: TYPE_ERROR((IsScalarTrueOutput ? 3 : 2), einfo.getType());
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_select_scalar_r(af_array* out, const af_array cond, const af_array a,
+                          const double b) {
+    return selectScalar<double, false>(out, cond, a, b);
+}
+
+af_err af_select_scalar_r_long(af_array* out, const af_array cond,
+                               const af_array a, const long long b) {
+    return selectScalar<long long, false>(out, cond, a, b);
+}
+
+af_err af_select_scalar_r_ulong(af_array* out, const af_array cond,
+                                const af_array a, const unsigned long long b) {
+    return selectScalar<unsigned long long, false>(out, cond, a, b);
+}
+
+af_err af_select_scalar_l(af_array* out, const af_array cond, const double a,
+                          const af_array b) {
+    return selectScalar<double, true>(out, cond, b, a);
+}
+
+af_err af_select_scalar_l_long(af_array* out, const af_array cond,
+                               const long long a, const af_array b) {
+    return selectScalar<long long, true>(out, cond, b, a);
+}
+
+af_err af_select_scalar_l_ulong(af_array* out, const af_array cond,
+                                const unsigned long long a, const af_array b) {
+    return selectScalar<unsigned long long, true>(out, cond, b, a);
+}
diff --git a/src/api/c/set.cpp b/src/api/c/set.cpp
index 1200eaef32..3353d7c5ee 100644
--- a/src/api/c/set.cpp
+++ b/src/api/c/set.cpp
@@ -7,107 +7,166 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/defines.h>
-#include <af/algorithm.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <set.hpp>
+#include <af/algorithm.h>
+#include <af/defines.h>
+#include <complex>
 
-using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array setUnique(const af_array in, const bool is_sorted)
-{
+static inline af_array setUnique(const af_array in, const bool is_sorted) {
     return getHandle(setUnique(getArray<T>(in), is_sorted));
 }
 
-af_err af_set_unique(af_array *out, const af_array in, const bool is_sorted)
-{
+af_err af_set_unique(af_array* out, const af_array in, const bool is_sorted) {
     try {
+        const ArrayInfo& in_info = getInfo(in);
+
+        if (in_info.isEmpty() || in_info.isScalar()) {
+            return af_retain_array(out, in);
+        }
 
-        af_dtype type = getInfo(in).getType();
+        ARG_ASSERT(1, in_info.isVector());
+
+        af_dtype type = in_info.getType();
 
         af_array res;
-        switch(type) {
-        case f32: res = setUnique<float  >(in, is_sorted); break;
-        case f64: res = setUnique<double >(in, is_sorted); break;
-        case s32: res = setUnique<int    >(in, is_sorted); break;
-        case u32: res = setUnique<uint   >(in, is_sorted); break;
-        case b8:  res = setUnique<char   >(in, is_sorted); break;
-        case u8:  res = setUnique<uchar  >(in, is_sorted); break;
-        default: TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: res = setUnique<float>(in, is_sorted); break;
+            case f64: res = setUnique<double>(in, is_sorted); break;
+            case s32: res = setUnique<int>(in, is_sorted); break;
+            case u32: res = setUnique<uint>(in, is_sorted); break;
+            case s16: res = setUnique<short>(in, is_sorted); break;
+            case u16: res = setUnique<ushort>(in, is_sorted); break;
+            case s64: res = setUnique<intl>(in, is_sorted); break;
+            case u64: res = setUnique<uintl>(in, is_sorted); break;
+            case b8: res = setUnique<char>(in, is_sorted); break;
+            case s8: res = setUnique<schar>(in, is_sorted); break;
+            case u8: res = setUnique<uchar>(in, is_sorted); break;
+            default: TYPE_ERROR(1, type);
         }
 
         std::swap(*out, res);
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-
 template<typename T>
-static inline af_array setUnion(const af_array first, const af_array second, const bool is_unique)
-{
-    return getHandle(setUnion(getArray<T>(first), getArray<T>(second), is_unique));
+static inline af_array setUnion(const af_array first, const af_array second,
+                                const bool is_unique) {
+    return getHandle(
+        setUnion(getArray<T>(first), getArray<T>(second), is_unique));
 }
 
-af_err af_set_union(af_array *out, const af_array first, const af_array second, const bool is_unique)
-{
+af_err af_set_union(af_array* out, const af_array first, const af_array second,
+                    const bool is_unique) {
     try {
+        const ArrayInfo& first_info  = getInfo(first);
+        const ArrayInfo& second_info = getInfo(second);
+
+        af_array res;
+        if (first_info.isEmpty()) { return af_retain_array(out, second); }
+
+        if (second_info.isEmpty()) { return af_retain_array(out, first); }
 
-        af_dtype first_type = getInfo(first).getType();
-        af_dtype second_type = getInfo(second).getType();
+        ARG_ASSERT(1, (first_info.isVector() || first_info.isScalar()));
+        ARG_ASSERT(1, (second_info.isVector() || second_info.isScalar()));
+
+        af_dtype first_type  = first_info.getType();
+        af_dtype second_type = second_info.getType();
 
         ARG_ASSERT(1, first_type == second_type);
 
-        af_array res;
-        switch(first_type) {
-        case f32: res = setUnion<float  >(first, second, is_unique); break;
-        case f64: res = setUnion<double >(first, second, is_unique); break;
-        case s32: res = setUnion<int    >(first, second, is_unique); break;
-        case u32: res = setUnion<uint   >(first, second, is_unique); break;
-        case b8:  res = setUnion<char   >(first, second, is_unique); break;
-        case u8:  res = setUnion<uchar  >(first, second, is_unique); break;
-        default: TYPE_ERROR(1, first_type);
+        switch (first_type) {
+            case f32: res = setUnion<float>(first, second, is_unique); break;
+            case f64: res = setUnion<double>(first, second, is_unique); break;
+            case s32: res = setUnion<int>(first, second, is_unique); break;
+            case u32: res = setUnion<uint>(first, second, is_unique); break;
+            case s16: res = setUnion<short>(first, second, is_unique); break;
+            case u16: res = setUnion<ushort>(first, second, is_unique); break;
+            case s64: res = setUnion<intl>(first, second, is_unique); break;
+            case u64: res = setUnion<uintl>(first, second, is_unique); break;
+            case b8: res = setUnion<char>(first, second, is_unique); break;
+            case s8: res = setUnion<schar>(first, second, is_unique); break;
+            case u8: res = setUnion<uchar>(first, second, is_unique); break;
+            default: TYPE_ERROR(1, first_type);
         }
 
         std::swap(*out, res);
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
 template<typename T>
-static inline af_array setIntersect(const af_array first, const af_array second, const bool is_unique)
-{
-    return getHandle(setIntersect(getArray<T>(first), getArray<T>(second), is_unique));
+static inline af_array setIntersect(const af_array first, const af_array second,
+                                    const bool is_unique) {
+    return getHandle(
+        setIntersect(getArray<T>(first), getArray<T>(second), is_unique));
 }
 
-af_err af_set_intersect(af_array *out, const af_array first, const af_array second, const bool is_unique)
-{
+af_err af_set_intersect(af_array* out, const af_array first,
+                        const af_array second, const bool is_unique) {
     try {
+        const ArrayInfo& first_info  = getInfo(first);
+        const ArrayInfo& second_info = getInfo(second);
+
+        // TODO(umar): fix for set intersect from union
+        if (first_info.isEmpty()) { return af_retain_array(out, first); }
+
+        if (second_info.isEmpty()) { return af_retain_array(out, second); }
+
+        ARG_ASSERT(1, (first_info.isVector() || first_info.isScalar()));
+        ARG_ASSERT(1, (second_info.isVector() || second_info.isScalar()));
 
-        af_dtype first_type = getInfo(first).getType();
-        af_dtype second_type = getInfo(second).getType();
+        af_dtype first_type  = first_info.getType();
+        af_dtype second_type = second_info.getType();
 
         ARG_ASSERT(1, first_type == second_type);
 
         af_array res;
-        switch(first_type) {
-        case f32: res = setIntersect<float  >(first, second, is_unique); break;
-        case f64: res = setIntersect<double >(first, second, is_unique); break;
-        case s32: res = setIntersect<int    >(first, second, is_unique); break;
-        case u32: res = setIntersect<uint   >(first, second, is_unique); break;
-        case b8:  res = setIntersect<char   >(first, second, is_unique); break;
-        case u8:  res = setIntersect<uchar  >(first, second, is_unique); break;
-        default: TYPE_ERROR(1, first_type);
+        switch (first_type) {
+            case f32:
+                res = setIntersect<float>(first, second, is_unique);
+                break;
+            case f64:
+                res = setIntersect<double>(first, second, is_unique);
+                break;
+            case s32: res = setIntersect<int>(first, second, is_unique); break;
+            case u32: res = setIntersect<uint>(first, second, is_unique); break;
+            case s16:
+                res = setIntersect<short>(first, second, is_unique);
+                break;
+            case u16:
+                res = setIntersect<ushort>(first, second, is_unique);
+                break;
+            case s64: res = setIntersect<intl>(first, second, is_unique); break;
+            case u64:
+                res = setIntersect<uintl>(first, second, is_unique);
+                break;
+            case b8: res = setIntersect<char>(first, second, is_unique); break;
+            case s8: res = setIntersect<schar>(first, second, is_unique); break;
+            case u8: res = setIntersect<uchar>(first, second, is_unique); break;
+            default: TYPE_ERROR(1, first_type);
         }
 
         std::swap(*out, res);
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
diff --git a/src/api/c/shift.cpp b/src/api/c/shift.cpp
index e5c2b836bb..cf195d2026 100644
--- a/src/api/c/shift.cpp
+++ b/src/api/c/shift.cpp
@@ -7,53 +7,62 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/data.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <shift.hpp>
+#include <af/data.h>
 
-using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array shift(const af_array in, const int sdims[4])
-{
+static inline af_array shift(const af_array in, const int sdims[4]) {
     return getHandle(shift<T>(getArray<T>(in), sdims));
 }
 
-af_err af_shift(af_array *out, const af_array in, const int sdims[4])
-{
+af_err af_shift(af_array *out, const af_array in, const int sdims[4]) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
         DIM_ASSERT(1, info.elements() > 0);
 
         af_array output;
 
-        switch(type) {
-            case f32: output = shift<float  >(in, sdims);  break;
-            case c32: output = shift<cfloat >(in, sdims);  break;
-            case f64: output = shift<double >(in, sdims);  break;
-            case c64: output = shift<cdouble>(in, sdims);  break;
-            case b8:  output = shift<char   >(in, sdims);  break;
-            case s32: output = shift<int    >(in, sdims);  break;
-            case u32: output = shift<uint   >(in, sdims);  break;
-            case u8:  output = shift<uchar  >(in, sdims);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = shift<float>(in, sdims); break;
+            case c32: output = shift<cfloat>(in, sdims); break;
+            case f64: output = shift<double>(in, sdims); break;
+            case c64: output = shift<cdouble>(in, sdims); break;
+            case b8: output = shift<char>(in, sdims); break;
+            case s32: output = shift<int>(in, sdims); break;
+            case u32: output = shift<uint>(in, sdims); break;
+            case s64: output = shift<intl>(in, sdims); break;
+            case u64: output = shift<uintl>(in, sdims); break;
+            case s16: output = shift<short>(in, sdims); break;
+            case u16: output = shift<ushort>(in, sdims); break;
+            case s8: output = shift<schar>(in, sdims); break;
+            case u8: output = shift<uchar>(in, sdims); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_shift(af_array *out, const af_array in,
-                const int x, const int y, const int z, const int w)
-{
+af_err af_shift(af_array *out, const af_array in, const int x, const int y,
+                const int z, const int w) {
     const int sdims[] = {x, y, z, w};
     return af_shift(out, in, sdims);
 }
diff --git a/src/api/c/sift.cpp b/src/api/c/sift.cpp
new file mode 100644
index 0000000000..b615025f80
--- /dev/null
+++ b/src/api/c/sift.cpp
@@ -0,0 +1,138 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <features.hpp>
+#include <handle.hpp>
+#include <sift.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/features.h>
+#include <af/vision.h>
+
+using af::dim4;
+using detail::Array;
+using detail::createEmptyArray;
+
+template<typename T, typename convAccT>
+static void sift(af_features& feat_, af_array& descriptors, const af_array& in,
+                 const unsigned n_layers, const float contrast_thr,
+                 const float edge_thr, const float init_sigma,
+                 const bool double_input, const float img_scale,
+                 const float feature_ratio, const bool compute_GLOH) {
+    Array<float> x     = createEmptyArray<float>(dim4());
+    Array<float> y     = createEmptyArray<float>(dim4());
+    Array<float> score = createEmptyArray<float>(dim4());
+    Array<float> ori   = createEmptyArray<float>(dim4());
+    Array<float> size  = createEmptyArray<float>(dim4());
+    Array<float> desc  = createEmptyArray<float>(dim4());
+
+    af_features_t feat;
+
+    feat.n =
+        sift<T, convAccT>(x, y, score, ori, size, desc, getArray<T>(in),
+                          n_layers, contrast_thr, edge_thr, init_sigma,
+                          double_input, img_scale, feature_ratio, compute_GLOH);
+
+    feat.x           = getHandle(x);
+    feat.y           = getHandle(y);
+    feat.score       = getHandle(score);
+    feat.orientation = getHandle(ori);
+    feat.size        = getHandle(size);
+
+    feat_       = getFeaturesHandle(feat);
+    descriptors = getHandle<float>(desc);
+}
+
+af_err af_sift(af_features* feat, af_array* desc, const af_array in,
+               const unsigned n_layers, const float contrast_thr,
+               const float edge_thr, const float init_sigma,
+               const bool double_input, const float img_scale,
+               const float feature_ratio) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        af::dim4 dims         = info.dims();
+
+        ARG_ASSERT(2, (dims[0] >= 15 && dims[1] >= 15 && dims[2] == 1 &&
+                       dims[3] == 1));
+        ARG_ASSERT(3, n_layers > 0);
+        ARG_ASSERT(4, contrast_thr > 0.0f);
+        ARG_ASSERT(5, edge_thr >= 1.0f);
+        ARG_ASSERT(6, init_sigma > 0.5f);
+        ARG_ASSERT(8, img_scale > 0.0f);
+        ARG_ASSERT(9, feature_ratio > 0.0f);
+
+        dim_t in_ndims = dims.ndims();
+        DIM_ASSERT(1, (in_ndims <= 3 && in_ndims >= 2));
+
+        af_array tmp_desc;
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                sift<float, float>(*feat, tmp_desc, in, n_layers, contrast_thr,
+                                   edge_thr, init_sigma, double_input,
+                                   img_scale, feature_ratio, false);
+                break;
+            case f64:
+                sift<double, double>(
+                    *feat, tmp_desc, in, n_layers, contrast_thr, edge_thr,
+                    init_sigma, double_input, img_scale, feature_ratio, false);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*desc, tmp_desc);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_gloh(af_features* feat, af_array* desc, const af_array in,
+               const unsigned n_layers, const float contrast_thr,
+               const float edge_thr, const float init_sigma,
+               const bool double_input, const float img_scale,
+               const float feature_ratio) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        af::dim4 dims         = info.dims();
+
+        ARG_ASSERT(2, (dims[0] >= 15 && dims[1] >= 15 && dims[2] == 1 &&
+                       dims[3] == 1));
+        ARG_ASSERT(3, n_layers > 0);
+        ARG_ASSERT(4, contrast_thr > 0.0f);
+        ARG_ASSERT(5, edge_thr >= 1.0f);
+        ARG_ASSERT(6, init_sigma > 0.5f);
+        ARG_ASSERT(8, img_scale > 0.0f);
+        ARG_ASSERT(9, feature_ratio > 0.0f);
+
+        dim_t in_ndims = dims.ndims();
+        DIM_ASSERT(1, (in_ndims == 2));
+
+        af_array tmp_desc;
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                sift<float, float>(*feat, tmp_desc, in, n_layers, contrast_thr,
+                                   edge_thr, init_sigma, double_input,
+                                   img_scale, feature_ratio, true);
+                break;
+            case f64:
+                sift<double, double>(
+                    *feat, tmp_desc, in, n_layers, contrast_thr, edge_thr,
+                    init_sigma, double_input, img_scale, feature_ratio, true);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*desc, tmp_desc);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/sobel.cpp b/src/api/c/sobel.cpp
index 594bf65a14..d466db1617 100644
--- a/src/api/c/sobel.cpp
+++ b/src/api/c/sobel.cpp
@@ -7,50 +7,73 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/image.h>
-#include <handle.hpp>
-#include <err_common.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <sobel.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
 #include <utility>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
-typedef std::pair<af_array, af_array> ArrayPair;
+using ArrayPair = std::pair<af_array, af_array>;
 template<typename Ti, typename To>
-ArrayPair sobelDerivatives(const af_array &in, const unsigned &ker_size)
-{
-    typedef std::pair< Array<To>, Array<To> > BAPair;
-    BAPair  out = sobelDerivatives<Ti, To>(getArray<Ti>(in), ker_size);
-    return std::make_pair(getHandle<To>(out.first),
-                          getHandle<To>(out.second));
+ArrayPair sobelDerivatives(const af_array &in, const unsigned &ker_size) {
+    using BAPair = std::pair<Array<To>, Array<To>>;
+    BAPair out   = sobelDerivatives<Ti, To>(getArray<Ti>(in), ker_size);
+    return std::make_pair(getHandle<To>(out.first), getHandle<To>(out.second));
 }
 
-af_err af_sobel_operator(af_array *dx, af_array *dy, const af_array img, const unsigned ker_size)
-{
+af_err af_sobel_operator(af_array *dx, af_array *dy, const af_array img,
+                         const unsigned ker_size) {
     try {
-        //FIXME: ADD SUPPORT FOR OTHER KERNEL SIZES
-        //ARG_ASSERT(4, (ker_size==3 || ker_size==5 || ker_size==7));
-        ARG_ASSERT(4, (ker_size==3));
+        // FIXME: ADD SUPPORT FOR OTHER KERNEL SIZES
+        // ARG_ASSERT(4, (ker_size==3 || ker_size==5 || ker_size==7));
+        ARG_ASSERT(4, (ker_size == 3));
 
-        ArrayInfo info = getInfo(img);
-        af::dim4 dims  = info.dims();
+        const ArrayInfo &info = getInfo(img);
+        af::dim4 dims         = info.dims();
 
         DIM_ASSERT(3, (dims.ndims() >= 2));
 
         ArrayPair output;
-        af_dtype type  = info.getType();
-        switch(type) {
-            case f32: output = sobelDerivatives<float , float> (img, ker_size); break;
-            case f64: output = sobelDerivatives<double, double>(img, ker_size); break;
-            case s32: output = sobelDerivatives<int   , int>   (img, ker_size); break;
-            case u32: output = sobelDerivatives<uint  , int>   (img, ker_size); break;
-            case b8 : output = sobelDerivatives<char  , int>   (img, ker_size); break;
-            case u8:  output = sobelDerivatives<uchar , int>   (img, ker_size); break;
-            default : TYPE_ERROR(1, type);
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                output = sobelDerivatives<float, float>(img, ker_size);
+                break;
+            case f64:
+                output = sobelDerivatives<double, double>(img, ker_size);
+                break;
+            case s32: output = sobelDerivatives<int, int>(img, ker_size); break;
+            case u32:
+                output = sobelDerivatives<uint, int>(img, ker_size);
+                break;
+            case s16:
+                output = sobelDerivatives<short, int>(img, ker_size);
+                break;
+            case u16:
+                output = sobelDerivatives<ushort, int>(img, ker_size);
+                break;
+            case b8: output = sobelDerivatives<char, int>(img, ker_size); break;
+            case s8:
+                output = sobelDerivatives<schar, int>(img, ker_size);
+                break;
+            case u8:
+                output = sobelDerivatives<uchar, int>(img, ker_size);
+                break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*dx, output.first);
         std::swap(*dy, output.second);
diff --git a/src/api/c/solve.cpp b/src/api/c/solve.cpp
index 857ee18688..31c1489484 100644
--- a/src/api/c/solve.cpp
+++ b/src/api/c/solve.cpp
@@ -7,43 +7,41 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/lapack.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <solve.hpp>
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/lapack.h>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::solveLU;
 
 template<typename T>
-static inline af_array solve(const af_array a, const af_array b, const af_mat_prop options)
-{
+static inline af_array solve(const af_array a, const af_array b,
+                             const af_mat_prop options) {
     return getHandle(solve<T>(getArray<T>(a), getArray<T>(b), options));
 }
 
-af_err af_solve(af_array *out, const af_array a, const af_array b, const af_mat_prop options)
-{
+af_err af_solve(af_array* out, const af_array a, const af_array b,
+                const af_mat_prop options) {
     try {
-        ArrayInfo a_info = getInfo(a);
-        ArrayInfo b_info = getInfo(b);
-
-        if (a_info.ndims() > 2 ||
-            b_info.ndims() > 2) {
-            AF_ERROR("solve can not be used in batch mode", AF_ERR_BATCH);
-        }
+        const ArrayInfo& a_info = getInfo(a);
+        const ArrayInfo& b_info = getInfo(b);
 
         af_dtype a_type = a_info.getType();
         af_dtype b_type = b_info.getType();
 
         dim4 adims = a_info.dims();
-        dim4 bdims = a_info.dims();
+        dim4 bdims = b_info.dims();
 
-        ARG_ASSERT(1, a_info.isFloating());                       // Only floating and complex types
-        ARG_ASSERT(2, b_info.isFloating());                       // Only floating and complex types
+        ARG_ASSERT(1, a_info.isFloating());  // Only floating and complex types
+        ARG_ASSERT(2, b_info.isFloating());  // Only floating and complex types
 
         TYPE_ASSERT(a_type == b_type);
 
@@ -51,27 +49,34 @@ af_err af_solve(af_array *out, const af_array a, const af_array b, const af_mat_
         DIM_ASSERT(1, bdims[2] == adims[2]);
         DIM_ASSERT(1, bdims[3] == adims[3]);
 
-        bool is_triangle_solve = (options & AF_MAT_LOWER) || (options & AF_MAT_UPPER);
+        if (a_info.ndims() == 0 || b_info.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, a_type);
+        }
+
+        bool is_triangle_solve =
+            (options & AF_MAT_LOWER) || (options & AF_MAT_UPPER);
 
         if (options != AF_MAT_NONE && !is_triangle_solve) {
-            AF_ERROR("Using this property is not yet supported in solve", AF_ERR_NOT_SUPPORTED);
+            AF_ERROR("Using this property is not yet supported in solve",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
         if (is_triangle_solve) {
             DIM_ASSERT(1, adims[0] == adims[1]);
             if ((options & AF_MAT_TRANS || options & AF_MAT_CTRANS)) {
-                AF_ERROR("Using AF_MAT_TRANS is not yet supported in solve", AF_ERR_NOT_SUPPORTED);
+                AF_ERROR("Using AF_MAT_TRANS is not yet supported in solve",
+                         AF_ERR_NOT_SUPPORTED);
             }
         }
 
         af_array output;
 
-        switch(a_type) {
-            case f32: output = solve<float  >(a, b, options);  break;
-            case f64: output = solve<double >(a, b, options);  break;
-            case c32: output = solve<cfloat >(a, b, options);  break;
-            case c64: output = solve<cdouble>(a, b, options);  break;
-            default:  TYPE_ERROR(1, a_type);
+        switch (a_type) {
+            case f32: output = solve<float>(a, b, options); break;
+            case f64: output = solve<double>(a, b, options); break;
+            case c32: output = solve<cfloat>(a, b, options); break;
+            case c64: output = solve<cdouble>(a, b, options); break;
+            default: TYPE_ERROR(1, a_type);
         }
         std::swap(*out, output);
     }
@@ -82,22 +87,19 @@ af_err af_solve(af_array *out, const af_array a, const af_array b, const af_mat_
 
 template<typename T>
 static inline af_array solve_lu(const af_array a, const af_array pivot,
-                                const af_array b, const af_mat_prop options)
-{
+                                const af_array b, const af_mat_prop options) {
     return getHandle(solveLU<T>(getArray<T>(a), getArray<int>(pivot),
                                 getArray<T>(b), options));
 }
 
-af_err af_solve_lu(af_array *out, const af_array a,
-                   const af_array piv, const af_array b,
-                   const af_mat_prop options)
-{
+af_err af_solve_lu(af_array* out, const af_array a, const af_array piv,
+                   const af_array b, const af_mat_prop options) {
     try {
-        ArrayInfo a_info = getInfo(a);
-        ArrayInfo b_info = getInfo(b);
+        const ArrayInfo& a_info   = getInfo(a);
+        const ArrayInfo& b_info   = getInfo(b);
+        const ArrayInfo& piv_info = getInfo(piv);
 
-        if (a_info.ndims() > 2 ||
-            b_info.ndims() > 2) {
+        if (a_info.ndims() > 2 || b_info.ndims() > 2) {
             AF_ERROR("solveLU can not be used in batch mode", AF_ERR_BATCH);
         }
 
@@ -105,30 +107,37 @@ af_err af_solve_lu(af_array *out, const af_array a,
         af_dtype b_type = b_info.getType();
 
         dim4 adims = a_info.dims();
-        dim4 bdims = a_info.dims();
+        dim4 bdims = b_info.dims();
+        if (a_info.ndims() == 0 || b_info.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, a_type);
+        }
 
-        ARG_ASSERT(1, a_info.isFloating());                       // Only floating and complex types
-        ARG_ASSERT(2, b_info.isFloating());                       // Only floating and complex types
+        ARG_ASSERT(1, a_info.isFloating());  // Only floating and complex types
+        ARG_ASSERT(2, b_info.isFloating());  // Only floating and complex types
 
         TYPE_ASSERT(a_type == b_type);
 
+        af_dtype piv_type = piv_info.getType();
+        TYPE_ASSERT(piv_type == s32);  // TODO: add support for 64 bit types
+
         DIM_ASSERT(1, adims[0] == adims[1]);
         DIM_ASSERT(1, bdims[0] == adims[0]);
         DIM_ASSERT(1, bdims[2] == adims[2]);
         DIM_ASSERT(1, bdims[3] == adims[3]);
 
         if (options != AF_MAT_NONE) {
-            AF_ERROR("Using this property is not yet supported in solveLU", AF_ERR_NOT_SUPPORTED);
+            AF_ERROR("Using this property is not yet supported in solveLU",
+                     AF_ERR_NOT_SUPPORTED);
         }
 
         af_array output;
 
-        switch(a_type) {
-        case f32: output = solve_lu<float  >(a, piv, b, options);  break;
-        case f64: output = solve_lu<double >(a, piv, b, options);  break;
-        case c32: output = solve_lu<cfloat >(a, piv, b, options);  break;
-        case c64: output = solve_lu<cdouble>(a, piv, b, options);  break;
-        default:  TYPE_ERROR(1, a_type);
+        switch (a_type) {
+            case f32: output = solve_lu<float>(a, piv, b, options); break;
+            case f64: output = solve_lu<double>(a, piv, b, options); break;
+            case c32: output = solve_lu<cfloat>(a, piv, b, options); break;
+            case c64: output = solve_lu<cdouble>(a, piv, b, options); break;
+            default: TYPE_ERROR(1, a_type);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/sort.cpp b/src/api/c/sort.cpp
index 39a7f227b3..b917b8b3c5 100644
--- a/src/api/c/sort.cpp
+++ b/src/api/c/sort.cpp
@@ -7,54 +7,63 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/algorithm.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
 #include <sort.hpp>
-#include <sort_index.hpp>
 #include <sort_by_key.hpp>
-#include <copy.hpp>
+#include <sort_index.hpp>
+#include <af/algorithm.h>
+#include <af/array.h>
+#include <af/defines.h>
 
-#include<cstdio>
+#include <cstdio>
 
 using af::dim4;
-using namespace detail;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array sort(const af_array in, const unsigned dim, const bool isAscending)
-{
+static inline af_array sort(const af_array in, const unsigned dim,
+                            const bool isAscending) {
     const Array<T> &inArray = getArray<T>(in);
-    if(isAscending) {
-        return getHandle(sort<T, 1>(inArray, dim));
-    } else {
-        return getHandle(sort<T, 0>(inArray, dim));
-    }
+    return getHandle(sort<T>(inArray, dim, isAscending));
 }
 
-af_err af_sort(af_array *out, const af_array in, const unsigned dim, const bool isAscending)
-{
+af_err af_sort(af_array *out, const af_array in, const unsigned dim,
+               const bool isAscending) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
+        if (info.elements() == 0) { return af_retain_array(out, in); }
         DIM_ASSERT(1, info.elements() > 0);
-        // Only Dim 0 supported
-        ARG_ASSERT(2, dim == 0);
 
         af_array val;
 
-        switch(type) {
-            case f32: val = sort<float  >(in, dim, isAscending);  break;
-            case f64: val = sort<double >(in, dim, isAscending);  break;
-            case s32: val = sort<int    >(in, dim, isAscending);  break;
-            case u32: val = sort<uint   >(in, dim, isAscending);  break;
-            case u8:  val = sort<uchar  >(in, dim, isAscending);  break;
-            case b8:  val = sort<char   >(in, dim, isAscending);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: val = sort<float>(in, dim, isAscending); break;
+            case f64: val = sort<double>(in, dim, isAscending); break;
+            case s32: val = sort<int>(in, dim, isAscending); break;
+            case u32: val = sort<uint>(in, dim, isAscending); break;
+            case s16: val = sort<short>(in, dim, isAscending); break;
+            case u16: val = sort<ushort>(in, dim, isAscending); break;
+            case s64: val = sort<intl>(in, dim, isAscending); break;
+            case u64: val = sort<uintl>(in, dim, isAscending); break;
+            case s8: val = sort<schar>(in, dim, isAscending); break;
+            case u8: val = sort<uchar>(in, dim, isAscending); break;
+            case b8: val = sort<char>(in, dim, isAscending); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, val);
     }
@@ -65,46 +74,58 @@ af_err af_sort(af_array *out, const af_array in, const unsigned dim, const bool
 
 template<typename T>
 static inline void sort_index(af_array *val, af_array *idx, const af_array in,
-                              const unsigned dim, const bool isAscending)
-{
+                              const unsigned dim, const bool isAscending) {
     const Array<T> &inArray = getArray<T>(in);
 
     // Initialize Dummy Arrays
-    Array<T> valArray = createEmptyArray<T>(af::dim4());
+    Array<T> valArray    = createEmptyArray<T>(af::dim4());
     Array<uint> idxArray = createEmptyArray<uint>(af::dim4());
 
-    if(isAscending) {
-        sort_index<T, 1>(valArray, idxArray, inArray, dim);
-    } else {
-        sort_index<T, 0>(valArray, idxArray, inArray, dim);
-    }
+    sort_index<T>(valArray, idxArray, inArray, dim, isAscending);
     *val = getHandle(valArray);
     *idx = getHandle(idxArray);
 }
 
-af_err af_sort_index(af_array *out, af_array *indices, const af_array in, const unsigned dim, const bool isAscending)
-{
+af_err af_sort_index(af_array *out, af_array *indices, const af_array in,
+                     const unsigned dim, const bool isAscending) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
-        DIM_ASSERT(2, info.elements() > 0);
-        // Only Dim 0 supported
-        ARG_ASSERT(3, dim == 0);
+        if (info.elements() <= 0) {
+            AF_CHECK(af_create_handle(out, 0, nullptr, type));
+            AF_CHECK(af_create_handle(indices, 0, nullptr, type));
+            return AF_SUCCESS;
+        }
 
         af_array val;
         af_array idx;
 
-        switch(type) {
-            case f32: sort_index<float  >(&val, &idx, in, dim, isAscending);  break;
-            case f64: sort_index<double >(&val, &idx, in, dim, isAscending);  break;
-            case s32: sort_index<int    >(&val, &idx, in, dim, isAscending);  break;
-            case u32: sort_index<uint   >(&val, &idx, in, dim, isAscending);  break;
-            case u8:  sort_index<uchar  >(&val, &idx, in, dim, isAscending);  break;
-            case b8:  sort_index<char   >(&val, &idx, in, dim, isAscending);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32:
+                sort_index<float>(&val, &idx, in, dim, isAscending);
+                break;
+            case f64:
+                sort_index<double>(&val, &idx, in, dim, isAscending);
+                break;
+            case s32: sort_index<int>(&val, &idx, in, dim, isAscending); break;
+            case u32: sort_index<uint>(&val, &idx, in, dim, isAscending); break;
+            case s16:
+                sort_index<short>(&val, &idx, in, dim, isAscending);
+                break;
+            case u16:
+                sort_index<ushort>(&val, &idx, in, dim, isAscending);
+                break;
+            case s64: sort_index<intl>(&val, &idx, in, dim, isAscending); break;
+            case u64:
+                sort_index<uintl>(&val, &idx, in, dim, isAscending);
+                break;
+            case s8: sort_index<schar>(&val, &idx, in, dim, isAscending); break;
+            case u8: sort_index<uchar>(&val, &idx, in, dim, isAscending); break;
+            case b8: sort_index<char>(&val, &idx, in, dim, isAscending); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out , val);
+        std::swap(*out, val);
         std::swap(*indices, idx);
     }
     CATCHALL;
@@ -113,9 +134,9 @@ af_err af_sort_index(af_array *out, af_array *indices, const af_array in, const
 }
 
 template<typename Tk, typename Tv>
-static inline void sort_by_key(af_array *okey, af_array *oval, const af_array ikey, const af_array ival,
-                               const unsigned dim, const bool isAscending)
-{
+static inline void sort_by_key(af_array *okey, af_array *oval,
+                               const af_array ikey, const af_array ival,
+                               const unsigned dim, const bool isAscending) {
     const Array<Tk> &ikeyArray = getArray<Tk>(ikey);
     const Array<Tv> &ivalArray = getArray<Tv>(ival);
 
@@ -123,64 +144,133 @@ static inline void sort_by_key(af_array *okey, af_array *oval, const af_array ik
     Array<Tk> okeyArray = createEmptyArray<Tk>(af::dim4());
     Array<Tv> ovalArray = createEmptyArray<Tv>(af::dim4());
 
-    if(isAscending) {
-        sort_by_key<Tk, Tv, 1>(okeyArray, ovalArray, ikeyArray, ivalArray, dim);
-    } else {
-        sort_by_key<Tk, Tv, 0>(okeyArray, ovalArray, ikeyArray, ivalArray, dim);
-    }
+    sort_by_key<Tk, Tv>(okeyArray, ovalArray, ikeyArray, ivalArray, dim,
+                        isAscending);
     *okey = getHandle(okeyArray);
     *oval = getHandle(ovalArray);
 }
 
 template<typename Tk>
-void sort_by_key_tmplt(af_array *okey, af_array *oval, const af_array ikey, const af_array ival,
-                       const unsigned dim, const bool isAscending)
-{
-    ArrayInfo info = getInfo(ival);
-    af_dtype vtype = info.getType();
-
-    switch(vtype) {
-    case f32: sort_by_key<Tk, float  >(okey, oval, ikey, ival, dim, isAscending);  break;
-    case f64: sort_by_key<Tk, double >(okey, oval, ikey, ival, dim, isAscending);  break;
-    case s32: sort_by_key<Tk, int    >(okey, oval, ikey, ival, dim, isAscending);  break;
-    case u32: sort_by_key<Tk, uint   >(okey, oval, ikey, ival, dim, isAscending);  break;
-    case u8:  sort_by_key<Tk, uchar  >(okey, oval, ikey, ival, dim, isAscending);  break;
-    case b8:  sort_by_key<Tk, char   >(okey, oval, ikey, ival, dim, isAscending);  break;
-    default:  TYPE_ERROR(1, vtype);
-    }
+void sort_by_key_tmplt(af_array *okey, af_array *oval, const af_array ikey,
+                       const af_array ival, const unsigned dim,
+                       const bool isAscending) {
+    const ArrayInfo &info = getInfo(ival);
+    af_dtype vtype        = info.getType();
 
-    return;
+    switch (vtype) {
+        case f32:
+            sort_by_key<Tk, float>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case f64:
+            sort_by_key<Tk, double>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case c32:
+            sort_by_key<Tk, cfloat>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case c64:
+            sort_by_key<Tk, cdouble>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case s32:
+            sort_by_key<Tk, int>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case u32:
+            sort_by_key<Tk, uint>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case s16:
+            sort_by_key<Tk, short>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case u16:
+            sort_by_key<Tk, ushort>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case s64:
+            sort_by_key<Tk, intl>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case u64:
+            sort_by_key<Tk, uintl>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case s8:
+            sort_by_key<Tk, schar>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case u8:
+            sort_by_key<Tk, uchar>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        case b8:
+            sort_by_key<Tk, char>(okey, oval, ikey, ival, dim, isAscending);
+            break;
+        default: TYPE_ERROR(1, vtype);
+    }
 }
 
 af_err af_sort_by_key(af_array *out_keys, af_array *out_values,
                       const af_array keys, const af_array values,
-                      const unsigned dim, const bool isAscending)
-{
+                      const unsigned dim, const bool isAscending) {
     try {
-        ArrayInfo info = getInfo(keys);
-        af_dtype type = info.getType();
+        const ArrayInfo &kinfo = getInfo(keys);
+        af_dtype ktype         = kinfo.getType();
 
-        ArrayInfo vinfo = getInfo(values);
+        const ArrayInfo &vinfo = getInfo(values);
+
+        DIM_ASSERT(4, kinfo.dims() == vinfo.dims());
+        if (kinfo.elements() == 0) {
+            AF_CHECK(af_create_handle(out_keys, 0, nullptr, ktype));
+            AF_CHECK(af_create_handle(out_values, 0, nullptr, ktype));
+            return AF_SUCCESS;
+        }
 
-        DIM_ASSERT(3, info.elements() > 0);
-        DIM_ASSERT(4, info.dims() == vinfo.dims());
-        // Only Dim 0 supported
-        ARG_ASSERT(5, dim == 0);
+        TYPE_ASSERT(kinfo.isReal());
 
         af_array oKey;
         af_array oVal;
 
-        switch(type) {
-            case f32: sort_by_key_tmplt<float  >(&oKey, &oVal, keys, values, dim, isAscending);  break;
-            case f64: sort_by_key_tmplt<double >(&oKey, &oVal, keys, values, dim, isAscending);  break;
-            case s32: sort_by_key_tmplt<int    >(&oKey, &oVal, keys, values, dim, isAscending);  break;
-            case u32: sort_by_key_tmplt<uint   >(&oKey, &oVal, keys, values, dim, isAscending);  break;
-            case u8:  sort_by_key_tmplt<uchar  >(&oKey, &oVal, keys, values, dim, isAscending);  break;
-            case b8:  sort_by_key_tmplt<char   >(&oKey, &oVal, keys, values, dim, isAscending);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (ktype) {
+            case f32:
+                sort_by_key_tmplt<float>(&oKey, &oVal, keys, values, dim,
+                                         isAscending);
+                break;
+            case f64:
+                sort_by_key_tmplt<double>(&oKey, &oVal, keys, values, dim,
+                                          isAscending);
+                break;
+            case s32:
+                sort_by_key_tmplt<int>(&oKey, &oVal, keys, values, dim,
+                                       isAscending);
+                break;
+            case u32:
+                sort_by_key_tmplt<uint>(&oKey, &oVal, keys, values, dim,
+                                        isAscending);
+                break;
+            case s16:
+                sort_by_key_tmplt<short>(&oKey, &oVal, keys, values, dim,
+                                         isAscending);
+                break;
+            case u16:
+                sort_by_key_tmplt<ushort>(&oKey, &oVal, keys, values, dim,
+                                          isAscending);
+                break;
+            case s64:
+                sort_by_key_tmplt<intl>(&oKey, &oVal, keys, values, dim,
+                                        isAscending);
+                break;
+            case u64:
+                sort_by_key_tmplt<uintl>(&oKey, &oVal, keys, values, dim,
+                                         isAscending);
+                break;
+            case s8:
+                sort_by_key_tmplt<schar>(&oKey, &oVal, keys, values, dim,
+                                         isAscending);
+                break;
+            case u8:
+                sort_by_key_tmplt<uchar>(&oKey, &oVal, keys, values, dim,
+                                         isAscending);
+                break;
+            case b8:
+                sort_by_key_tmplt<char>(&oKey, &oVal, keys, values, dim,
+                                        isAscending);
+                break;
+            default: TYPE_ERROR(1, ktype);
         }
-        std::swap(*out_keys , oKey);
-        std::swap(*out_values , oVal);
+        std::swap(*out_keys, oKey);
+        std::swap(*out_values, oVal);
     }
     CATCHALL;
 
diff --git a/src/api/c/sparse.cpp b/src/api/c/sparse.cpp
new file mode 100644
index 0000000000..db57b0077b
--- /dev/null
+++ b/src/api/c/sparse.cpp
@@ -0,0 +1,491 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <lookup.hpp>
+#include <platform.hpp>
+#include <sparse.hpp>
+#include <sparse_handle.hpp>
+#include <af/algorithm.h>
+#include <af/array.h>
+#include <af/sparse.h>
+
+using af::dim4;
+using arrayfire::getSparseArray;
+using arrayfire::retainSparseHandle;
+using arrayfire::common::createArrayDataSparseArray;
+using arrayfire::common::createDeviceDataSparseArray;
+using arrayfire::common::createEmptySparseArray;
+using arrayfire::common::createHostDataSparseArray;
+using arrayfire::common::SparseArray;
+using arrayfire::common::SparseArrayBase;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::sparseConvertDenseToStorage;
+
+namespace arrayfire {
+
+const SparseArrayBase &getSparseArrayBase(const af_array in,
+                                          bool device_check) {
+    const SparseArrayBase *base =
+        static_cast<SparseArrayBase *>(static_cast<void *>(in));
+
+    if (!base->isSparse()) {
+        AF_ERROR(
+            "Input is not a SparseArray and cannot be used in Sparse functions",
+            AF_ERR_ARG);
+    }
+
+    if (device_check &&
+        base->getDevId() != static_cast<int>(detail::getActiveDeviceId())) {
+        AF_ERROR("Input Array not created on current device", AF_ERR_DEVICE);
+    }
+
+    return *base;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Sparse Creation
+////////////////////////////////////////////////////////////////////////////////
+template<typename T>
+af_array createSparseArrayFromData(const dim4 &dims, const af_array values,
+                                   const af_array rowIdx, const af_array colIdx,
+                                   const af::storage stype) {
+    SparseArray<T> sparse = createArrayDataSparseArray(
+        dims, getArray<T>(values), getArray<int>(rowIdx), getArray<int>(colIdx),
+        stype);
+    return getHandle(sparse);
+}
+
+template<typename T>
+af_array createSparseArrayFromPtr(const af::dim4 &dims, const dim_t nNZ,
+                                  const T *const values,
+                                  const int *const rowIdx,
+                                  const int *const colIdx,
+                                  const af::storage stype,
+                                  const af::source source) {
+    if (nNZ) {
+        switch (source) {
+            case afHost:
+                return getHandle(createHostDataSparseArray(
+                    dims, nNZ, values, rowIdx, colIdx, stype));
+                break;
+            case afDevice:
+                return getHandle(createDeviceDataSparseArray(
+                    dims, nNZ, const_cast<T *>(values),
+                    const_cast<int *>(rowIdx), const_cast<int *>(colIdx),
+                    stype));
+                break;
+        }
+    }
+
+    return getHandle(createEmptySparseArray<T>(dims, nNZ, stype));
+}
+
+template<typename T>
+af_array createSparseArrayFromDense(const af_array _in,
+                                    const af_storage stype) {
+    const Array<T> in = getArray<T>(_in);
+
+    switch (stype) {
+        case AF_STORAGE_CSR:
+            return getHandle(
+                sparseConvertDenseToStorage<T, AF_STORAGE_CSR>(in));
+        case AF_STORAGE_COO:
+            return getHandle(
+                sparseConvertDenseToStorage<T, AF_STORAGE_COO>(in));
+        case AF_STORAGE_CSC:
+            // return getHandle(sparseConvertDenseToStorage<T,
+            // AF_STORAGE_CSC>(in));
+        default:
+            AF_ERROR("Storage type is out of range/unsupported", AF_ERR_ARG);
+    }
+}
+
+template<typename T>
+af_array sparseConvertStorage(const af_array in_,
+                              const af_storage destStorage) {
+    const SparseArray<T> in = getSparseArray<T>(in_);
+
+    if (destStorage == AF_STORAGE_DENSE) {
+        // Returns a regular af_array, not sparse
+        switch (in.getStorage()) {
+            case AF_STORAGE_CSR:
+                return getHandle(
+                    detail::sparseConvertStorageToDense<T, AF_STORAGE_CSR>(in));
+            case AF_STORAGE_COO:
+                return getHandle(
+                    detail::sparseConvertStorageToDense<T, AF_STORAGE_COO>(in));
+            default:
+                AF_ERROR("Invalid storage type of input array", AF_ERR_ARG);
+        }
+    } else if (destStorage == AF_STORAGE_CSR) {
+        // Returns a sparse af_array
+        switch (in.getStorage()) {
+            case AF_STORAGE_CSR: return retainSparseHandle<T>(in_);
+            case AF_STORAGE_COO:
+                return getHandle(
+                    detail::sparseConvertStorageToStorage<T, AF_STORAGE_CSR,
+                                                          AF_STORAGE_COO>(in));
+            default:
+                AF_ERROR("Invalid storage type of input array", AF_ERR_ARG);
+        }
+    } else if (destStorage == AF_STORAGE_COO) {
+        // Returns a sparse af_array
+        switch (in.getStorage()) {
+            case AF_STORAGE_CSR:
+                return getHandle(
+                    detail::sparseConvertStorageToStorage<T, AF_STORAGE_COO,
+                                                          AF_STORAGE_CSR>(in));
+            case AF_STORAGE_COO: return retainSparseHandle<T>(in_);
+            default:
+                AF_ERROR("Invalid storage type of input array", AF_ERR_ARG);
+        }
+    }
+
+    // Shoud never come here
+    return NULL;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Get Functions
+////////////////////////////////////////////////////////////////////////////////
+template<typename T>
+af_array getSparseValues(const af_array in) {
+    return getHandle(getSparseArray<T>(in).getValues());
+}
+
+}  // namespace arrayfire
+
+using arrayfire::createSparseArrayFromData;
+using arrayfire::createSparseArrayFromDense;
+using arrayfire::createSparseArrayFromPtr;
+using arrayfire::getSparseArrayBase;
+using arrayfire::getSparseValues;
+using arrayfire::sparseConvertStorage;
+
+af_err af_create_sparse_array(af_array *out, const dim_t nRows,
+                              const dim_t nCols, const af_array values,
+                              const af_array rowIdx, const af_array colIdx,
+                              const af_storage stype) {
+    try {
+        // Checks:
+        // rowIdx and colIdx arrays are of s32 type
+        // values is of floating point type
+        // if COO, rowIdx, colIdx and values should have same dims
+        // if CRS, colIdx and values should have same dims, rowIdx.dims = nRows
+        // if CRC, rowIdx and values should have same dims, colIdx.dims = nCols
+        // stype is within acceptable range
+        // type is floating type
+
+        if (!(stype == AF_STORAGE_CSR || stype == AF_STORAGE_CSC ||
+              stype == AF_STORAGE_COO)) {
+            AF_ERROR("Storage type is out of range/unsupported", AF_ERR_ARG);
+        }
+
+        const ArrayInfo &vInfo = getInfo(values);
+        const ArrayInfo &rInfo = getInfo(rowIdx);
+        const ArrayInfo &cInfo = getInfo(colIdx);
+
+        TYPE_ASSERT(vInfo.isFloating());
+        DIM_ASSERT(3, vInfo.isLinear());
+        ARG_ASSERT(4, rInfo.getType() == s32);
+        DIM_ASSERT(4, rInfo.isLinear());
+        ARG_ASSERT(5, cInfo.getType() == s32);
+        DIM_ASSERT(5, cInfo.isLinear());
+
+        const dim_t nNZ = vInfo.elements();
+        if (stype == AF_STORAGE_COO) {
+            DIM_ASSERT(4, rInfo.elements() == nNZ);
+            DIM_ASSERT(5, cInfo.elements() == nNZ);
+        } else if (stype == AF_STORAGE_CSR) {
+            DIM_ASSERT(4, (dim_t)rInfo.elements() == nRows + 1);
+            DIM_ASSERT(5, cInfo.elements() == nNZ);
+        } else if (stype == AF_STORAGE_CSC) {
+            DIM_ASSERT(4, rInfo.elements() == nNZ);
+            DIM_ASSERT(5, (dim_t)cInfo.elements() == nCols + 1);
+        }
+
+        af_array output = nullptr;
+
+        dim4 dims(nRows, nCols);
+
+        switch (vInfo.getType()) {
+            case f32:
+                output = createSparseArrayFromData<float>(dims, values, rowIdx,
+                                                          colIdx, stype);
+                break;
+            case f64:
+                output = createSparseArrayFromData<double>(dims, values, rowIdx,
+                                                           colIdx, stype);
+                break;
+            case c32:
+                output = createSparseArrayFromData<cfloat>(dims, values, rowIdx,
+                                                           colIdx, stype);
+                break;
+            case c64:
+                output = createSparseArrayFromData<cdouble>(
+                    dims, values, rowIdx, colIdx, stype);
+                break;
+            default: TYPE_ERROR(1, vInfo.getType());
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_create_sparse_array_from_ptr(
+    af_array *out, const dim_t nRows, const dim_t nCols, const dim_t nNZ,
+    const void *const values, const int *const rowIdx, const int *const colIdx,
+    const af_dtype type, const af_storage stype, const af_source source) {
+    try {
+        // Checks:
+        // rowIdx and colIdx arrays are of s32 type
+        // values is of floating point type
+        // if COO, rowIdx, colIdx and values should have same dims
+        // if CRS, colIdx and values should have same dims, rowIdx.dims = nRows
+        // if CRC, rowIdx and values should have same dims, colIdx.dims = nCols
+        // stype is within acceptable range
+        // type is floating type
+        if (!(stype == AF_STORAGE_CSR || stype == AF_STORAGE_CSC ||
+              stype == AF_STORAGE_COO)) {
+            AF_ERROR("Storage type is out of range/unsupported", AF_ERR_ARG);
+        }
+
+        TYPE_ASSERT(type == f32 || type == f64 || type == c32 || type == c64);
+
+        af_array output = nullptr;
+
+        dim4 dims(nRows, nCols);
+
+        switch (type) {
+            case f32:
+                output = createSparseArrayFromPtr<float>(
+                    dims, nNZ, static_cast<const float *>(values), rowIdx,
+                    colIdx, stype, source);
+                break;
+            case f64:
+                output = createSparseArrayFromPtr<double>(
+                    dims, nNZ, static_cast<const double *>(values), rowIdx,
+                    colIdx, stype, source);
+                break;
+            case c32:
+                output = createSparseArrayFromPtr<cfloat>(
+                    dims, nNZ, static_cast<const cfloat *>(values), rowIdx,
+                    colIdx, stype, source);
+                break;
+            case c64:
+                output = createSparseArrayFromPtr<cdouble>(
+                    dims, nNZ, static_cast<const cdouble *>(values), rowIdx,
+                    colIdx, stype, source);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_create_sparse_array_from_dense(af_array *out, const af_array in,
+                                         const af_storage stype) {
+    try {
+        // Checks:
+        // stype is within acceptable range
+        // values is of floating point type
+
+        const ArrayInfo &info = getInfo(in);
+
+        if (!(stype == AF_STORAGE_CSR || stype == AF_STORAGE_CSC ||
+              stype == AF_STORAGE_COO)) {
+            AF_ERROR("Storage type is out of range/unsupported", AF_ERR_ARG);
+        }
+
+        // Only matrices allowed
+        DIM_ASSERT(1, info.ndims() == 2);
+
+        TYPE_ASSERT(info.isFloating());
+
+        af_array output = 0;
+
+        switch (info.getType()) {
+            case f32:
+                output = createSparseArrayFromDense<float>(in, stype);
+                break;
+            case f64:
+                output = createSparseArrayFromDense<double>(in, stype);
+                break;
+            case c32:
+                output = createSparseArrayFromDense<cfloat>(in, stype);
+                break;
+            case c64:
+                output = createSparseArrayFromDense<cdouble>(in, stype);
+                break;
+            default: TYPE_ERROR(1, info.getType());
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_convert_to(af_array *out, const af_array in,
+                            const af_storage destStorage) {
+    try {
+        // Handle dense case
+        const ArrayInfo &info = getInfo(in, false);
+        if (!info.isSparse()) {  // If input is dense
+            return af_create_sparse_array_from_dense(out, in, destStorage);
+        }
+
+        af_array output = nullptr;
+
+        const SparseArrayBase &base = getSparseArrayBase(in);
+
+        // Dense not allowed as input -> Should never happen with
+        // SparseArrayBase CSC is currently not supported
+        ARG_ASSERT(1, base.getStorage() != AF_STORAGE_DENSE &&
+                          base.getStorage() != AF_STORAGE_CSC);
+
+        // Conversion to and from CSC is not supported
+        ARG_ASSERT(2, destStorage != AF_STORAGE_CSC);
+
+        if (base.getStorage() == destStorage) {
+            // Return a reference
+            AF_CHECK(af_retain_array(out, in));
+            return AF_SUCCESS;
+        }
+
+        switch (base.getType()) {
+            case f32:
+                output = sparseConvertStorage<float>(in, destStorage);
+                break;
+            case f64:
+                output = sparseConvertStorage<double>(in, destStorage);
+                break;
+            case c32:
+                output = sparseConvertStorage<cfloat>(in, destStorage);
+                break;
+            case c64:
+                output = sparseConvertStorage<cdouble>(in, destStorage);
+                break;
+            default: AF_ERROR("Output storage type is not valid", AF_ERR_ARG);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_to_dense(af_array *out, const af_array in) {
+    try {
+        af_array output = nullptr;
+
+        const SparseArrayBase &base = getSparseArrayBase(in);
+
+        // Dense not allowed as input -> Should never happen
+        // To convert from dense to type, use the create* functions
+        ARG_ASSERT(1, base.getStorage() != AF_STORAGE_DENSE);
+
+        switch (base.getType()) {
+            case f32:
+                output = sparseConvertStorage<float>(in, AF_STORAGE_DENSE);
+                break;
+            case f64:
+                output = sparseConvertStorage<double>(in, AF_STORAGE_DENSE);
+                break;
+            case c32:
+                output = sparseConvertStorage<cfloat>(in, AF_STORAGE_DENSE);
+                break;
+            case c64:
+                output = sparseConvertStorage<cdouble>(in, AF_STORAGE_DENSE);
+                break;
+            default: AF_ERROR("Output storage type is not valid", AF_ERR_ARG);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_get_info(af_array *values, af_array *rows, af_array *cols,
+                          af_storage *stype, const af_array in) {
+    try {
+        if (values != NULL) { AF_CHECK(af_sparse_get_values(values, in)); }
+        if (rows != NULL) { AF_CHECK(af_sparse_get_row_idx(rows, in)); }
+        if (cols != NULL) { AF_CHECK(af_sparse_get_col_idx(cols, in)); }
+        if (stype != NULL) { AF_CHECK(af_sparse_get_storage(stype, in)); }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_get_values(af_array *out, const af_array in) {
+    try {
+        const SparseArrayBase base = getSparseArrayBase(in);
+
+        af_array output = nullptr;
+
+        switch (base.getType()) {
+            case f32: output = getSparseValues<float>(in); break;
+            case f64: output = getSparseValues<double>(in); break;
+            case c32: output = getSparseValues<cfloat>(in); break;
+            case c64: output = getSparseValues<cdouble>(in); break;
+            default: TYPE_ERROR(1, base.getType());
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_get_row_idx(af_array *out, const af_array in) {
+    try {
+        const SparseArrayBase base = getSparseArrayBase(in);
+        *out                       = getHandle(base.getRowIdx());
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_get_col_idx(af_array *out, const af_array in) {
+    try {
+        const SparseArrayBase base = getSparseArrayBase(in);
+        *out                       = getHandle(base.getColIdx());
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_get_nnz(dim_t *out, const af_array in) {
+    try {
+        const SparseArrayBase base = getSparseArrayBase(in);
+        *out                       = base.getNNZ();
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_sparse_get_storage(af_storage *out, const af_array in) {
+    try {
+        const SparseArrayBase base = getSparseArrayBase(in);
+        *out                       = base.getStorage();
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/sparse_handle.hpp b/src/api/c/sparse_handle.hpp
new file mode 100644
index 0000000000..62c5289ebc
--- /dev/null
+++ b/src/api/c/sparse_handle.hpp
@@ -0,0 +1,96 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
+#include <math.hpp>
+#include <af/array.h>
+#include <af/dim4.hpp>
+
+#include <common/SparseArray.hpp>
+
+namespace arrayfire {
+
+const common::SparseArrayBase &getSparseArrayBase(const af_array in,
+                                                  bool device_check = true);
+
+template<typename T>
+const common::SparseArray<T> &getSparseArray(const af_array &arr) {
+    const common::SparseArray<T> *A =
+        static_cast<const common::SparseArray<T> *>(arr);
+    ARG_ASSERT(0, A->isSparse() == true);
+    checkAndMigrate(*A);
+    return *A;
+}
+
+template<typename T>
+common::SparseArray<T> &getSparseArray(af_array &arr) {
+    common::SparseArray<T> *A = static_cast<common::SparseArray<T> *>(arr);
+    ARG_ASSERT(0, A->isSparse() == true);
+    checkAndMigrate(*A);
+    return *A;
+}
+
+template<typename T>
+static af_array getHandle(const common::SparseArray<T> &A) {
+    common::SparseArray<T> *ret = new common::SparseArray<T>(A);
+    return static_cast<af_array>(ret);
+}
+
+template<typename T>
+static void releaseSparseHandle(const af_array arr) {
+    common::destroySparseArray(static_cast<common::SparseArray<T> *>(arr));
+}
+
+template<typename T>
+af_array retainSparseHandle(const af_array in) {
+    const common::SparseArray<T> *sparse =
+        static_cast<const common::SparseArray<T> *>(in);
+    common::SparseArray<T> *out = new common::SparseArray<T>(*sparse);
+    return static_cast<af_array>(out);
+}
+
+// based on castArray in handle.hpp
+template<typename To>
+common::SparseArray<To> castSparse(const af_array &in) {
+    const ArrayInfo &info = getInfo(in, false);
+    using namespace common;
+
+#define CAST_SPARSE(Ti)                                                          \
+    do {                                                                         \
+        const SparseArray<Ti> sparse = getSparseArray<Ti>(in);                   \
+        detail::Array<To> values     = common::cast<To, Ti>(sparse.getValues()); \
+        return createArrayDataSparseArray(                                       \
+            sparse.dims(), values, sparse.getRowIdx(), sparse.getColIdx(),       \
+            sparse.getStorage());                                                \
+    } while (0)
+
+    switch (info.getType()) {
+        case f32: CAST_SPARSE(float);
+        case f64: CAST_SPARSE(double);
+        case c32: CAST_SPARSE(detail::cfloat);
+        case c64: CAST_SPARSE(detail::cdouble);
+        default: TYPE_ERROR(1, info.getType());
+    }
+}
+
+template<typename T>
+static af_array copySparseArray(const af_array in) {
+    const common::SparseArray<T> &inArray = getSparseArray<T>(in);
+    return getHandle<T>(common::copySparseArray<T>(inArray));
+}
+
+}  // namespace arrayfire
+
+using arrayfire::getHandle;
diff --git a/src/api/c/stats.h b/src/api/c/stats.h
index 0e74942880..cde5b1621b 100644
--- a/src/api/c/stats.h
+++ b/src/api/c/stats.h
@@ -9,86 +9,12 @@
 
 #pragma once
 
-template<typename T, typename Other>
-struct is_same{
-    static const bool value = false;
-};
-
-template<typename T>
-struct is_same<T, T> {
-    static const bool value = true;
-};
-
-template<bool, typename T, typename O>
-struct cond_type;
-
-template<typename T, typename Other>
-struct cond_type<true, T, Other> {
-    typedef T type;
-};
-
-template<typename T, typename Other>
-struct cond_type<false, T, Other> {
-    typedef Other type;
-};
+#include <backend.hpp>
+#include <type_traits>
 
 template<typename T>
 struct baseOutType {
-    typedef typename cond_type< is_same<T, cdouble>::value ||
-                                is_same<T, double>::value,
-                                double,
-                                float>::type type;
+    typedef typename std::conditional<std::is_same<T, detail::cdouble>::value ||
+                                          std::is_same<T, double>::value,
+                                      double, float>::type type;
 };
-
-template<typename T>
-inline T mean(const Array<T>& in)
-{
-    T out = reduce_all<af_add_t, T, T>(in);
-    T result = division(out, in.elements());
-    return result;
-}
-
-template<typename T, typename wType>
-inline T mean(const Array<T>& in, const Array<wType>& weights)
-{
-    Array<T> wts   = cast<T>(weights);
-
-    dim4 iDims = in.dims();
-
-    Array<T> wtdInput = arithOp<T, af_mul_t>(in, wts, iDims);
-
-    T wtdSum = reduce_all<af_add_t, T, T>(wtdInput);
-    wType wtsSum = reduce_all<af_add_t, wType, wType>(weights);
-
-    return division(wtdSum, wtsSum);
-}
-
-template<typename T>
-inline Array<T> mean(const Array<T>& in, dim_t dim)
-{
-    Array<T> redArr = reduce<af_add_t, T, T>(in, dim);
-
-    dim4 iDims = in.dims();
-    dim4 oDims = redArr.dims();
-
-    Array<T> cnstArr = createValueArray<T>(oDims, scalar<T>(iDims[dim]));
-    Array<T> result  = arithOp<T, af_div_t>(redArr, cnstArr, oDims);
-
-    return result;
-}
-
-template<typename T>
-inline Array<T> mean(const Array<T>& in, const Array<T>& wts, dim_t dim)
-{
-    dim4 iDims = in.dims();
-
-    Array<T> wtdInput = arithOp<T, af_mul_t>(in, wts, iDims);
-    Array<T> redArr   = reduce<af_add_t, T, T>(wtdInput, dim);
-    Array<T> wtsSum   = reduce<af_add_t, T, T>(wts, dim);
-
-    dim4 oDims = redArr.dims();
-
-    Array<T> result = arithOp<T, af_div_t>(redArr, wtsSum, oDims);
-
-    return result;
-}
diff --git a/src/api/c/stdev.cpp b/src/api/c/stdev.cpp
index b2f307b628..d5589f4d39 100644
--- a/src/api/c/stdev.cpp
+++ b/src/api/c/stdev.cpp
@@ -7,119 +7,160 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/statistics.h>
-#include <af/defines.h>
+#include <arith.hpp>
 #include <backend.hpp>
-#include <reduce.hpp>
+#include <common/cast.hpp>
+#include <copy.hpp>
 #include <handle.hpp>
-#include <arith.hpp>
-#include <unary.hpp>
 #include <math.hpp>
-#include <cast.hpp>
+#include <mean.hpp>
+#include <reduce.hpp>
 #include <tile.hpp>
+#include <unary.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/statistics.h>
 #include <cmath>
 #include <complex>
 
 #include "stats.h"
 
-using namespace detail;
+using af::dim4;
+using arrayfire::common::cast;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createValueArray;
+using detail::division;
+using detail::getScalar;
+using detail::intl;
+using detail::mean;
+using detail::reduce;
+using detail::reduce_all;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename inType, typename outType>
-static outType stdev(const af_array& in)
-{
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-
-    Array<outType> meanCnst= createValueArray<outType>(input.dims(), mean<outType>(input));
-
-    Array<outType> diff    = detail::arithOp<outType, af_sub_t>(input, meanCnst, input.dims());
-
-    Array<outType> diffSq  = detail::arithOp<outType, af_mul_t>(diff, diff, diff.dims());
-
-    outType result = division(reduce_all<af_add_t, outType, outType>(diffSq), input.elements());
-
+static outType stdev(const af_array& in, const af_var_bias bias) {
+    using weightType        = typename baseOutType<outType>::type;
+    const Array<inType> _in = getArray<inType>(in);
+    Array<outType> input    = cast<outType>(_in);
+    Array<outType> meanCnst = createValueArray<outType>(
+        input.dims(), mean<inType, weightType, outType>(_in));
+    Array<outType> diff =
+        detail::arithOp<outType, af_sub_t>(input, meanCnst, input.dims());
+    Array<outType> diffSq =
+        detail::arithOp<outType, af_mul_t>(diff, diff, diff.dims());
+    outType result = division(
+        getScalar<outType>(reduce_all<af_add_t, outType, outType>(diffSq)),
+        (input.elements() - (bias == AF_VARIANCE_SAMPLE)));
     return sqrt(result);
 }
 
 template<typename inType, typename outType>
-static af_array stdev(const af_array& in, int dim)
-{
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    dim4 iDims = input.dims();
+static af_array stdev(const af_array& in, int dim, const af_var_bias bias) {
+    using weightType        = typename baseOutType<outType>::type;
+    const Array<inType> _in = getArray<inType>(in);
+    Array<outType> input    = cast<outType>(_in);
+    dim4 iDims              = input.dims();
 
-    Array<outType> meanArr = mean<outType>(input, dim);
+    Array<outType> meanArr = mean<inType, weightType, outType>(_in, dim);
 
     /* now tile meanArr along dim and use it for variance computation */
     dim4 tileDims(1);
-    tileDims[dim] = iDims[dim];
+    tileDims[dim]           = iDims[dim];
     Array<outType> tMeanArr = detail::tile<outType>(meanArr, tileDims);
     /* now mean array is ready */
 
-    Array<outType> diff    = detail::arithOp<outType, af_sub_t>(input, tMeanArr, tMeanArr.dims());
-    Array<outType> diffSq  = detail::arithOp<outType, af_mul_t>(diff, diff, diff.dims());
+    Array<outType> diff =
+        detail::arithOp<outType, af_sub_t>(input, tMeanArr, tMeanArr.dims());
+    Array<outType> diffSq =
+        detail::arithOp<outType, af_mul_t>(diff, diff, diff.dims());
     Array<outType> redDiff = reduce<af_add_t, outType, outType>(diffSq, dim);
-    dim4 oDims = redDiff.dims();
+    const dim4& oDims      = redDiff.dims();
 
-    Array<outType> divArr = createValueArray<outType>(oDims, scalar<outType>(iDims[dim]));
-    Array<outType> varArr = detail::arithOp<outType, af_div_t>(redDiff, divArr, redDiff.dims());
+    Array<outType> divArr = createValueArray<outType>(
+        oDims, scalar<outType>((iDims[dim] - (bias == AF_VARIANCE_SAMPLE))));
+    Array<outType> varArr =
+        detail::arithOp<outType, af_div_t>(redDiff, divArr, redDiff.dims());
     Array<outType> result = detail::unaryOp<outType, af_sqrt_t>(varArr);
 
     return getHandle<outType>(result);
 }
 
-af_err af_stdev_all(double *realVal, double *imagVal, const af_array in)
-{
+// NOLINTNEXTLINE(readability-non-const-parameter)
+af_err af_stdev_all(double* realVal, double* imagVal, const af_array in) {
+    return af_stdev_all_v2(realVal, imagVal, in, AF_VARIANCE_POPULATION);
+}
+
+af_err af_stdev_all_v2(double* realVal, double* imagVal, const af_array in,
+                       const af_var_bias bias) {
+    UNUSED(imagVal);  // TODO implement for complex values
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
-            case f64: *realVal = stdev<double, double>(in); break;
-            case f32: *realVal = stdev<float , float >(in); break;
-            case s32: *realVal = stdev<int   , float >(in); break;
-            case u32: *realVal = stdev<uint  , float >(in); break;
-            case s64: *realVal = stdev<intl  , double>(in); break;
-            case u64: *realVal = stdev<uintl , double>(in); break;
-            case  u8: *realVal = stdev<uchar , float >(in); break;
-            case  b8: *realVal = stdev<char  , float >(in); break;
-            // TODO: FIXME: sqrt(complex) is not present in cuda/opencl backend
-            //case c32: {
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+        switch (type) {
+            case f64: *realVal = stdev<double, double>(in, bias); break;
+            case f32: *realVal = stdev<float, float>(in, bias); break;
+            case s32: *realVal = stdev<int, float>(in, bias); break;
+            case u32: *realVal = stdev<uint, float>(in, bias); break;
+            case s16: *realVal = stdev<short, float>(in, bias); break;
+            case u16: *realVal = stdev<ushort, float>(in, bias); break;
+            case s64: *realVal = stdev<intl, double>(in, bias); break;
+            case u64: *realVal = stdev<uintl, double>(in, bias); break;
+            case s8: *realVal = stdev<schar, float>(in, bias); break;
+            case u8: *realVal = stdev<uchar, float>(in, bias); break;
+            case b8: *realVal = stdev<char, float>(in, bias); break;
+            // TODO(umar): FIXME: sqrt(complex) is not present in cuda/opencl
+            // backend case c32: {
             //    cfloat tmp = stdev<cfloat,cfloat>(in);
             //    *realVal = real(tmp);
             //    *imagVal = imag(tmp);
             //    } break;
-            //case c64: {
+            // case c64: {
             //    cdouble tmp = stdev<cdouble,cdouble>(in);
             //    *realVal = real(tmp);
             //    *imagVal = imag(tmp);
             //    } break;
-            default : TYPE_ERROR(1, type);
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_stdev(af_array *out, const af_array in, const dim_t dim)
-{
-    try {
-        ARG_ASSERT(2, (dim>=0 && dim<=3));
+af_err af_stdev(af_array* out, const af_array in, const dim_t dim) {
+    return af_stdev_v2(out, in, AF_VARIANCE_POPULATION, dim);
+}
 
-        af_array output = 0;
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
-            case f64: output = stdev<double,  double>(in, dim); break;
-            case f32: output = stdev<float ,  float >(in, dim); break;
-            case s32: output = stdev<int   ,  float >(in, dim); break;
-            case u32: output = stdev<uint  ,  float >(in, dim); break;
-            case s64: output = stdev<intl  ,  double>(in, dim); break;
-            case u64: output = stdev<uintl ,  double>(in, dim); break;
-            case  u8: output = stdev<uchar ,  float >(in, dim); break;
-            case  b8: output = stdev<char  ,  float >(in, dim); break;
-            // TODO: FIXME: sqrt(complex) is not present in cuda/opencl backend
-            //case c32: output = stdev<cfloat,  cfloat>(in, dim); break;
-            //case c64: output = stdev<cdouble,cdouble>(in, dim); break;
-            default : TYPE_ERROR(1, type);
+af_err af_stdev_v2(af_array* out, const af_array in, const af_var_bias bias,
+                   const dim_t dim) {
+    try {
+        ARG_ASSERT(2, (dim >= 0 && dim <= 3));
+
+        af_array output       = 0;
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+        switch (type) {
+            case f64: output = stdev<double, double>(in, dim, bias); break;
+            case f32: output = stdev<float, float>(in, dim, bias); break;
+            case s32: output = stdev<int, float>(in, dim, bias); break;
+            case u32: output = stdev<uint, float>(in, dim, bias); break;
+            case s16: output = stdev<short, float>(in, dim, bias); break;
+            case u16: output = stdev<ushort, float>(in, dim, bias); break;
+            case s64: output = stdev<intl, double>(in, dim, bias); break;
+            case u64: output = stdev<uintl, double>(in, dim, bias); break;
+            case s8: output = stdev<schar, float>(in, dim, bias); break;
+            case u8: output = stdev<uchar, float>(in, dim, bias); break;
+            case b8: output = stdev<char, float>(in, dim, bias); break;
+            // TODO(umar): FIXME: sqrt(complex) is not present in cuda/opencl
+            // backend case c32: output = stdev<cfloat,  cfloat>(in, dim);
+            // break; case c64: output = stdev<cdouble,cdouble>(in, dim); break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
diff --git a/src/api/c/stream.cpp b/src/api/c/stream.cpp
new file mode 100644
index 0000000000..45265e69b5
--- /dev/null
+++ b/src/api/c/stream.cpp
@@ -0,0 +1,378 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <type_util.hpp>
+
+#include <af/array.h>
+#include <af/index.h>
+
+#include <fstream>
+#include <iomanip>
+#include <vector>
+
+using std::string;
+using std::vector;
+
+using af::dim4;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createHostDataArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+#define STREAM_FORMAT_VERSION 0x1
+static const char sfv_char = STREAM_FORMAT_VERSION;
+
+template<typename T>
+static int save(const char *key, const af_array arr, const char *filename,
+                const bool append = false) {
+    // (char     )   Version (Once)
+    // (int      )   No. of Arrays (Once)
+    // (int    )   Length of the key
+    // (cstring)   Key
+    // (intl   )   Offset bytes to next array (type + dims + data)
+    // (char   )   Type
+    // (intl   )   dim4 (x 4)
+    // (T      )   data (x elements)
+    // Setup all the data structures that need to be written to file
+    ///////////////////////////////////////////////////////////////////////////
+    std::string k(key);
+    int klen = k.size();
+
+    const ArrayInfo &info = getInfo(arr);
+    std::vector<T> data(info.elements());
+
+    AF_CHECK(af_get_data_ptr(&data.front(), arr));
+
+    char type = info.getType();
+
+    intl odims[4];
+    for (int i = 0; i < 4; i++) { odims[i] = info.dims()[i]; }
+
+    intl offset = sizeof(char) + 4 * sizeof(intl) + info.elements() * sizeof(T);
+    ///////////////////////////////////////////////////////////////////////////
+
+    std::fstream fs;
+    int n_arrays = 0;
+
+    if (append) {
+        std::ifstream checkIfExists(filename);
+        bool exists = checkIfExists.good();
+        checkIfExists.close();
+        if (exists) {
+            fs.open(filename, std::fstream::in | std::fstream::out |
+                                  std::fstream::binary);
+        } else {
+            fs.open(filename, std::fstream::out | std::fstream::binary);
+        }
+
+        // Throw exception if file is not open
+        if (!fs.is_open()) { AF_ERROR("File failed to open", AF_ERR_ARG); }
+
+        // Assert Version
+        if (fs.peek() == std::fstream::traits_type::eof()) {
+            // File is empty
+            fs.clear();
+        } else {
+            char prev_version = 0;
+            fs.read(&prev_version, sizeof(char));
+
+            AF_ASSERT(
+                prev_version == sfv_char,
+                "ArrayFire data format has changed. Can't append to file");
+
+            fs.read(reinterpret_cast<char *>(&n_arrays), sizeof(int));
+        }
+    } else {
+        fs.open(filename,
+                std::fstream::out | std::fstream::binary | std::fstream::trunc);
+
+        // Throw exception if file is not open
+        if (!fs.is_open()) { AF_ERROR("File failed to open", AF_ERR_ARG); }
+    }
+
+    n_arrays++;
+
+    // Write version and n_arrays to top of file
+    fs.seekp(0);
+    fs.write(&sfv_char, 1);
+    fs.write(reinterpret_cast<char *>(&n_arrays), sizeof(int));
+
+    // Write array to end of file. Irrespective of new or append
+    fs.seekp(0, std::ios_base::end);
+    fs.write(reinterpret_cast<char *>(&klen), sizeof(int));
+    fs.write(k.c_str(), klen);
+    fs.write(reinterpret_cast<char *>(&offset), sizeof(intl));
+    fs.write(&type, sizeof(char));
+    fs.write(reinterpret_cast<char *>(&odims), sizeof(intl) * 4);
+    fs.write(reinterpret_cast<char *>(&data.front()), sizeof(T) * data.size());
+    fs.close();
+
+    return n_arrays - 1;
+}
+
+af_err af_save_array(int *index, const char *key, const af_array arr,
+                     const char *filename, const bool append) {
+    try {
+        ARG_ASSERT(0, key != NULL);
+        ARG_ASSERT(2, filename != NULL);
+
+        const ArrayInfo &info = getInfo(arr);
+        af_dtype type         = info.getType();
+        int id                = -1;
+        switch (type) {
+            case f32: id = save<float>(key, arr, filename, append); break;
+            case c32: id = save<cfloat>(key, arr, filename, append); break;
+            case f64: id = save<double>(key, arr, filename, append); break;
+            case c64: id = save<cdouble>(key, arr, filename, append); break;
+            case b8: id = save<char>(key, arr, filename, append); break;
+            case s32: id = save<int>(key, arr, filename, append); break;
+            case u32: id = save<unsigned>(key, arr, filename, append); break;
+            case s8: id = save<schar>(key, arr, filename, append); break;
+            case u8: id = save<uchar>(key, arr, filename, append); break;
+            case s64: id = save<intl>(key, arr, filename, append); break;
+            case u64: id = save<uintl>(key, arr, filename, append); break;
+            case s16: id = save<short>(key, arr, filename, append); break;
+            case u16: id = save<ushort>(key, arr, filename, append); break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*index, id);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+template<typename T>
+static af_array readDataToArray(std::fstream &fs) {
+    intl dims[4];
+    fs.read(reinterpret_cast<char *>(&dims), 4 * sizeof(intl));
+
+    dim4 d;
+    for (int i = 0; i < 4; i++) { d[i] = dims[i]; }
+
+    intl size = d.elements();
+
+    std::vector<T> data(size);
+    fs.read(reinterpret_cast<char *>(&data.front()), size * sizeof(T));
+
+    return getHandle(createHostDataArray<T>(d, &data.front()));
+}
+
+static af_array readArrayV1(const char *filename, const unsigned index) {
+    char version = 0;
+    int n_arrays = 0;
+
+    std::fstream fs(filename, std::fstream::in | std::fstream::binary);
+
+    // Throw exception if file is not open
+    if (!fs.is_open()) { AF_ERROR("File failed to open", AF_ERR_ARG); }
+
+    if (fs.peek() == std::fstream::traits_type::eof()) {
+        AF_ERROR("File is empty", AF_ERR_ARG);
+    }
+
+    fs.read(&version, sizeof(char));
+    fs.read(reinterpret_cast<char *>(&n_arrays), sizeof(int));
+
+    AF_ASSERT((int)index < n_arrays, "Index out of bounds");
+
+    for (unsigned i = 0; i < index; i++) {
+        // (int    )   Length of the key
+        // (cstring)   Key
+        // (intl   )   Offset bytes to next array (type + dims + data)
+        // (char   )   Type
+        // (intl   )   dim4 (x 4)
+        // (T      )   data (x elements)
+        int klen = -1;
+        fs.read(reinterpret_cast<char *>(&klen), sizeof(int));
+
+        // char* key = new char[klen];
+        // fs.read((char*)&key, klen * sizeof(char));
+
+        // Skip the array name tag
+        fs.seekg(klen, std::ios_base::cur);
+
+        // Read data offset
+        intl offset = -1;
+        fs.read(reinterpret_cast<char *>(&offset), sizeof(intl));
+
+        // Skip data
+        fs.seekg(offset, std::ios_base::cur);
+    }
+
+    int klen = -1;
+    fs.read(reinterpret_cast<char *>(&klen), sizeof(int));
+
+    // char* key = new char[klen];
+    // fs.read((char*)&key, klen * sizeof(char));
+
+    // Skip the array name tag
+    fs.seekg(klen, std::ios_base::cur);
+
+    // Read data offset
+    intl offset = -1;
+    fs.read(reinterpret_cast<char *>(&offset), sizeof(intl));
+
+    // Read type and dims
+    char type_ = -1;
+    fs.read(&type_, sizeof(char));
+
+    auto type = static_cast<af_dtype>(type_);
+
+    af_array out;
+    switch (type) {
+        case f32: out = readDataToArray<float>(fs); break;
+        case c32: out = readDataToArray<cfloat>(fs); break;
+        case f64: out = readDataToArray<double>(fs); break;
+        case c64: out = readDataToArray<cdouble>(fs); break;
+        case b8: out = readDataToArray<char>(fs); break;
+        case s32: out = readDataToArray<int>(fs); break;
+        case u32: out = readDataToArray<uint>(fs); break;
+        case s8: out = readDataToArray<schar>(fs); break;
+        case u8: out = readDataToArray<uchar>(fs); break;
+        case s64: out = readDataToArray<intl>(fs); break;
+        case u64: out = readDataToArray<uintl>(fs); break;
+        case s16: out = readDataToArray<short>(fs); break;
+        case u16: out = readDataToArray<ushort>(fs); break;
+        default: TYPE_ERROR(1, type);
+    }
+    fs.close();
+
+    return out;
+}
+
+static af_array checkVersionAndRead(const char *filename,
+                                    const unsigned index) {
+    char version = 0;
+
+    std::string filenameStr = std::string(filename);
+    std::fstream fs(filenameStr, std::fstream::in | std::fstream::binary);
+    // Throw exception if file is not open
+    if (!fs.is_open()) {
+        std::string errStr = "Failed to open: " + filenameStr;
+        AF_ERROR(errStr.c_str(), AF_ERR_ARG);
+    }
+
+    if (fs.peek() == std::fstream::traits_type::eof()) {
+        std::string errStr = filenameStr + " is empty";
+        AF_ERROR(errStr.c_str(), AF_ERR_ARG);
+    } else {
+        fs.read(&version, sizeof(char));
+    }
+    fs.close();
+
+    switch (version) {  // NOLINT(hicpp-multiway-paths-covered)
+        case 1: return readArrayV1(filename, index);
+        default: AF_ERROR("Invalid version", AF_ERR_ARG);
+    }
+}
+
+int checkVersionAndFindIndex(const char *filename, const char *k) {
+    char version = 0;
+    std::string key(k);
+    std::string filenameStr(filename);
+    std::ifstream fs(filenameStr, std::ifstream::in | std::ifstream::binary);
+
+    // Throw exception if file is not open
+    if (!fs.is_open()) {
+        std::string errStr = "Failed to open: " + filenameStr;
+        AF_ERROR(errStr.c_str(), AF_ERR_ARG);
+    }
+
+    if (fs.peek() == std::ifstream::traits_type::eof()) {
+        std::string errStr = filenameStr + " is empty";
+        AF_ERROR(errStr.c_str(), AF_ERR_ARG);
+    } else {
+        fs.read(&version, sizeof(char));
+    }
+
+    int index = -1;
+    if (version == 1) {
+        int n_arrays = -1;
+        fs.read(reinterpret_cast<char *>(&n_arrays), sizeof(int));
+        for (int i = 0; i < n_arrays; i++) {
+            int klen = -1;
+            fs.read(reinterpret_cast<char *>(&klen), sizeof(int));
+            string readKey;
+            readKey.resize(klen);
+            fs.read(&readKey.front(), klen);
+
+            if (key == readKey) {
+                // Ket matches, break
+                index = i;
+                break;
+            }
+            // Key doesn't match. Skip the data
+            intl offset = -1;
+            fs.read(reinterpret_cast<char *>(&offset), sizeof(intl));
+            fs.seekg(offset, std::ios_base::cur);
+        }
+    } else {
+        AF_ERROR("Invalid version", AF_ERR_ARG);
+    }
+    fs.close();
+
+    return index;
+}
+
+af_err af_read_array_index(af_array *out, const char *filename,
+                           const unsigned index) {
+    try {
+        AF_CHECK(af_init());
+
+        ARG_ASSERT(1, filename != NULL);
+
+        af_array output = checkVersionAndRead(filename, index);
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_read_array_key(af_array *out, const char *filename, const char *key) {
+    try {
+        AF_CHECK(af_init());
+        ARG_ASSERT(1, filename != NULL);
+        ARG_ASSERT(2, key != NULL);
+
+        // Find index of key. Then call read by index
+        int index = checkVersionAndFindIndex(filename, key);
+
+        if (index == -1) { AF_ERROR("Key not found", AF_ERR_INVALID_ARRAY); }
+
+        af_array output = checkVersionAndRead(filename, index);
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_read_array_key_check(int *index, const char *filename,
+                               const char *key) {
+    try {
+        ARG_ASSERT(1, filename != NULL);
+        ARG_ASSERT(2, key != NULL);
+
+        AF_CHECK(af_init());
+
+        // Find index of key. Then call read by index
+        int id = checkVersionAndFindIndex(filename, key);
+        std::swap(*index, id);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/surface.cpp b/src/api/c/surface.cpp
new file mode 100644
index 0000000000..d748677269
--- /dev/null
+++ b/src/api/c/surface.cpp
@@ -0,0 +1,216 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/graphics.h>
+#include <af/image.h>
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <common/moddims.hpp>
+#include <common/tile.hpp>
+#include <copy.hpp>
+#include <handle.hpp>
+#include <join.hpp>
+#include <platform.hpp>
+#include <reduce.hpp>
+#include <reorder.hpp>
+#include <surface.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+using arrayfire::common::getGLType;
+using arrayfire::common::makeContextCurrent;
+using arrayfire::common::modDims;
+using arrayfire::common::step_round;
+using detail::Array;
+using detail::copy_surface;
+using detail::createEmptyArray;
+using detail::forgeManager;
+using detail::getScalar;
+using detail::reduce_all;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+template<typename T>
+fg_chart setup_surface(fg_window window, const af_array xVals,
+                       const af_array yVals, const af_array zVals,
+                       const af_cell* const props) {
+    ForgeModule& _ = forgePlugin();
+    Array<T> xIn   = getArray<T>(xVals);
+    Array<T> yIn   = getArray<T>(yVals);
+    Array<T> zIn   = getArray<T>(zVals);
+
+    const ArrayInfo& Xinfo = getInfo(xVals);
+    const ArrayInfo& Yinfo = getInfo(yVals);
+    const ArrayInfo& Zinfo = getInfo(zVals);
+
+    dim4 X_dims = Xinfo.dims();
+    dim4 Y_dims = Yinfo.dims();
+    dim4 Z_dims = Zinfo.dims();
+
+    if (Xinfo.isVector()) {
+        // Convert xIn is a column vector
+        xIn = modDims(xIn, xIn.elements());
+        // Now tile along second dimension
+        dim4 x_tdims(1, Y_dims[0], 1, 1);
+        xIn = arrayfire::common::tile(xIn, x_tdims);
+
+        // Convert yIn to a row vector
+        yIn = modDims(yIn, dim4(1, yIn.elements()));
+        // Now tile along first dimension
+        dim4 y_tdims(X_dims[0], 1, 1, 1);
+        yIn = arrayfire::common::tile(yIn, y_tdims);
+    }
+
+    // Flatten xIn, yIn and zIn into row vectors
+    dim4 rowDims = dim4(1, zIn.elements());
+    xIn          = modDims(xIn, rowDims);
+    yIn          = modDims(yIn, rowDims);
+    zIn          = modDims(zIn, rowDims);
+
+    // Now join along first dimension, skip reorder
+    std::vector<Array<T>> inputs{xIn, yIn, zIn};
+
+    dim4 odims(3, rowDims[1]);
+    Array<T> out = createEmptyArray<T>(odims);
+    join(out, 0, inputs);
+    Array<T> Z = out;
+
+    ForgeManager& fgMngr = forgeManager();
+
+    // Get the chart for the current grid position (if any)
+    fg_chart chart = NULL;
+    if (props->col > -1 && props->row > -1) {
+        chart = fgMngr.getChart(window, props->row, props->col, FG_CHART_3D);
+    } else {
+        chart = fgMngr.getChart(window, 0, 0, FG_CHART_3D);
+    }
+
+    fg_surface surface =
+        fgMngr.getSurface(chart, Z_dims[0], Z_dims[1], getGLType<T>());
+
+    FG_CHECK(_.fg_set_surface_color(surface, 0.0, 1.0, 0.0, 1.0));
+
+    // If chart axes limits do not have a manual override
+    // then compute and set axes limits
+    if (!fgMngr.getChartAxesOverride(chart)) {
+        float cmin[3], cmax[3];
+        T dmin[3], dmax[3];
+        FG_CHECK(_.fg_get_chart_axes_limits(
+            &cmin[0], &cmax[0], &cmin[1], &cmax[1], &cmin[2], &cmax[2], chart));
+        dmin[0] = getScalar<T>(reduce_all<af_min_t, T, T>(xIn));
+        dmax[0] = getScalar<T>(reduce_all<af_max_t, T, T>(xIn));
+        dmin[1] = getScalar<T>(reduce_all<af_min_t, T, T>(yIn));
+        dmax[1] = getScalar<T>(reduce_all<af_max_t, T, T>(yIn));
+        dmin[2] = getScalar<T>(reduce_all<af_min_t, T, T>(zIn));
+        dmax[2] = getScalar<T>(reduce_all<af_max_t, T, T>(zIn));
+
+        if (cmin[0] == 0 && cmax[0] == 0 && cmin[1] == 0 && cmax[1] == 0 &&
+            cmin[2] == 0 && cmax[2] == 0) {
+            // No previous limits. Set without checking
+            cmin[0] = step_round(dmin[0], false);
+            cmax[0] = step_round(dmax[0], true);
+            cmin[1] = step_round(dmin[1], false);
+            cmax[1] = step_round(dmax[1], true);
+            cmin[2] = step_round(dmin[2], false);
+            cmax[2] = step_round(dmax[2], true);
+        } else {
+            if (cmin[0] > dmin[0]) { cmin[0] = step_round(dmin[0], false); }
+            if (cmax[0] < dmax[0]) { cmax[0] = step_round(dmax[0], true); }
+            if (cmin[1] > dmin[1]) { cmin[1] = step_round(dmin[1], false); }
+            if (cmax[1] < dmax[1]) { cmax[1] = step_round(dmax[1], true); }
+            if (cmin[2] > dmin[2]) { cmin[2] = step_round(dmin[2], false); }
+            if (cmax[2] < dmax[2]) { cmax[2] = step_round(dmax[2], true); }
+        }
+
+        FG_CHECK(_.fg_set_chart_axes_limits(chart, cmin[0], cmax[0], cmin[1],
+                                            cmax[1], cmin[2], cmax[2]));
+    }
+    copy_surface<T>(Z, surface);
+
+    return chart;
+}
+
+af_err af_draw_surface(const af_window window, const af_array xVals,
+                       const af_array yVals, const af_array S,
+                       const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& Xinfo = getInfo(xVals);
+        dim4 X_dims            = Xinfo.dims();
+        af_dtype Xtype         = Xinfo.getType();
+
+        const ArrayInfo& Yinfo = getInfo(yVals);
+        dim4 Y_dims            = Yinfo.dims();
+        af_dtype Ytype         = Yinfo.getType();
+
+        const ArrayInfo& Sinfo = getInfo(S);
+        const dim4& S_dims     = Sinfo.dims();
+        af_dtype Stype         = Sinfo.getType();
+
+        TYPE_ASSERT(Xtype == Ytype);
+        TYPE_ASSERT(Ytype == Stype);
+
+        if (!Yinfo.isVector()) {
+            DIM_ASSERT(1, X_dims == Y_dims);
+            DIM_ASSERT(3, Y_dims == S_dims);
+        } else {
+            DIM_ASSERT(3, (X_dims[0] * Y_dims[0] == (dim_t)Sinfo.elements()));
+        }
+
+        makeContextCurrent(window);
+
+        fg_chart chart = NULL;
+
+        switch (Xtype) {
+            case f32:
+                chart = setup_surface<float>(window, xVals, yVals, S, props);
+                break;
+            case s32:
+                chart = setup_surface<int>(window, xVals, yVals, S, props);
+                break;
+            case u32:
+                chart = setup_surface<uint>(window, xVals, yVals, S, props);
+                break;
+            case s16:
+                chart = setup_surface<short>(window, xVals, yVals, S, props);
+                break;
+            case u16:
+                chart = setup_surface<ushort>(window, xVals, yVals, S, props);
+                break;
+            case s8:
+                chart = setup_surface<schar>(window, xVals, yVals, S, props);
+                break;
+            case u8:
+                chart = setup_surface<uchar>(window, xVals, yVals, S, props);
+                break;
+            default: TYPE_ERROR(1, Xtype);
+        }
+        auto gridDims = forgeManager().getWindowGrid(window);
+
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/susan.cpp b/src/api/c/susan.cpp
new file mode 100644
index 0000000000..8ea7dc8945
--- /dev/null
+++ b/src/api/c/susan.cpp
@@ -0,0 +1,116 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <features.hpp>
+#include <handle.hpp>
+#include <susan.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/features.h>
+#include <af/vision.h>
+
+using af::dim4;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+
+template<typename T>
+static af_features susan(af_array const& in, const unsigned radius,
+                         const float diff_thr, const float geom_thr,
+                         const float feature_ratio, const unsigned edge) {
+    Array<float> x     = createEmptyArray<float>(dim4());
+    Array<float> y     = createEmptyArray<float>(dim4());
+    Array<float> score = createEmptyArray<float>(dim4());
+
+    af_features_t feat;
+    feat.n = susan<T>(x, y, score, getArray<T>(in), radius, diff_thr, geom_thr,
+                      feature_ratio, edge);
+
+    feat.x     = getHandle(x);
+    feat.y     = getHandle(y);
+    feat.score = getHandle(score);
+    feat.orientation =
+        getHandle(feat.n > 0 ? createValueArray<float>(feat.n, 0.0)
+                             : createEmptyArray<float>(dim4()));
+    feat.size = getHandle(feat.n > 0 ? createValueArray<float>(feat.n, 1.0)
+                                     : createEmptyArray<float>(dim4()));
+
+    return getFeaturesHandle(feat);
+}
+
+af_err af_susan(af_features* out, const af_array in, const unsigned radius,
+                const float diff_thr, const float geom_thr,
+                const float feature_ratio, const unsigned edge) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        af::dim4 dims         = info.dims();
+
+        ARG_ASSERT(1, dims.ndims() == 2);
+        ARG_ASSERT(2, radius < 10);
+        ARG_ASSERT(2, radius <= edge);
+        ARG_ASSERT(3, diff_thr > 0.0f);
+        ARG_ASSERT(4, geom_thr > 0.0f);
+        ARG_ASSERT(5, (feature_ratio > 0.0f && feature_ratio <= 1.0f));
+        ARG_ASSERT(6, (dims[0] >= (dim_t)(2 * edge + 1) ||
+                       dims[1] >= (dim_t)(2 * edge + 1)));
+
+        af_dtype type = info.getType();
+        switch (type) {
+            case f32:
+                *out = susan<float>(in, radius, diff_thr, geom_thr,
+                                    feature_ratio, edge);
+                break;
+            case f64:
+                *out = susan<double>(in, radius, diff_thr, geom_thr,
+                                     feature_ratio, edge);
+                break;
+            case b8:
+                *out = susan<char>(in, radius, diff_thr, geom_thr,
+                                   feature_ratio, edge);
+                break;
+            case s32:
+                *out = susan<int>(in, radius, diff_thr, geom_thr, feature_ratio,
+                                  edge);
+                break;
+            case u32:
+                *out = susan<uint>(in, radius, diff_thr, geom_thr,
+                                   feature_ratio, edge);
+                break;
+            case s16:
+                *out = susan<short>(in, radius, diff_thr, geom_thr,
+                                    feature_ratio, edge);
+                break;
+            case u16:
+                *out = susan<ushort>(in, radius, diff_thr, geom_thr,
+                                     feature_ratio, edge);
+                break;
+            case s8:
+                *out = susan<schar>(in, radius, diff_thr, geom_thr,
+                                    feature_ratio, edge);
+                break;
+            case u8:
+                *out = susan<uchar>(in, radius, diff_thr, geom_thr,
+                                    feature_ratio, edge);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/svd.cpp b/src/api/c/svd.cpp
new file mode 100644
index 0000000000..661831ffc8
--- /dev/null
+++ b/src/api/c/svd.cpp
@@ -0,0 +1,127 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/lapack.h>
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <svd.hpp>
+#include <af/defines.h>
+
+using af::dim4;
+using af::dtype_traits;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using std::min;
+
+template<typename T>
+static inline void svd(af_array *s, af_array *u, af_array *vt,
+                       const af_array in) {
+    const ArrayInfo &info = getInfo(in);  // ArrayInfo is the base class which
+    dim4 dims             = info.dims();
+    int M                 = dims[0];
+    int N                 = dims[1];
+
+    using Tr = typename dtype_traits<T>::base_type;
+
+    // Allocate output arrays
+    Array<Tr> sA = createEmptyArray<Tr>(dim4(min(M, N)));
+    Array<T> uA  = createEmptyArray<T>(dim4(M, M));
+    Array<T> vtA = createEmptyArray<T>(dim4(N, N));
+
+    svd<T, Tr>(sA, uA, vtA, getArray<T>(in));
+
+    *s  = getHandle(sA);
+    *u  = getHandle(uA);
+    *vt = getHandle(vtA);
+}
+
+template<typename T>
+static inline void svdInPlace(af_array *s, af_array *u, af_array *vt,
+                              af_array in) {
+    const ArrayInfo &info = getInfo(in);  // ArrayInfo is the base class which
+    dim4 dims             = info.dims();
+    int M                 = dims[0];
+    int N                 = dims[1];
+
+    using Tr = typename dtype_traits<T>::base_type;
+
+    // Allocate output arrays
+    Array<Tr> sA = createEmptyArray<Tr>(dim4(min(M, N)));
+    Array<T> uA  = createEmptyArray<T>(dim4(M, M));
+    Array<T> vtA = createEmptyArray<T>(dim4(N, N));
+
+    svdInPlace<T, Tr>(sA, uA, vtA, getArray<T>(in));
+
+    *s  = getHandle(sA);
+    *u  = getHandle(uA);
+    *vt = getHandle(vtA);
+}
+
+af_err af_svd(af_array *u, af_array *s, af_array *vt, const af_array in) {
+    try {
+        const ArrayInfo &info = getInfo(in);
+        dim4 dims             = info.dims();
+
+        ARG_ASSERT(3, (dims.ndims() >= 0 && dims.ndims() <= 2));
+        af_dtype type = info.getType();
+
+        if (dims.ndims() == 0) {
+            AF_CHECK(af_create_handle(u, 0, nullptr, type));
+            AF_CHECK(af_create_handle(s, 0, nullptr, type));
+            AF_CHECK(af_create_handle(vt, 0, nullptr, type));
+            return AF_SUCCESS;
+        }
+
+        switch (type) {
+            case f64: svd<double>(s, u, vt, in); break;
+            case f32: svd<float>(s, u, vt, in); break;
+            case c64: svd<cdouble>(s, u, vt, in); break;
+            case c32: svd<cfloat>(s, u, vt, in); break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_svd_inplace(af_array *u, af_array *s, af_array *vt, af_array in) {
+    try {
+        const ArrayInfo &info = getInfo(in);
+        dim4 dims             = info.dims();
+
+        ARG_ASSERT(3, (dims.ndims() >= 0 && dims.ndims() <= 2));
+        af_dtype type = info.getType();
+
+        if (dims.ndims() == 0) {
+            AF_CHECK(af_create_handle(u, 0, nullptr, type));
+            AF_CHECK(af_create_handle(s, 0, nullptr, type));
+            AF_CHECK(af_create_handle(vt, 0, nullptr, type));
+            return AF_SUCCESS;
+        }
+
+        DIM_ASSERT(3, dims[0] >= dims[1]);
+
+        switch (type) {
+            case f64: svdInPlace<double>(s, u, vt, in); break;
+            case f32: svdInPlace<float>(s, u, vt, in); break;
+            case c64: svdInPlace<cdouble>(s, u, vt, in); break;
+            case c32: svdInPlace<cfloat>(s, u, vt, in); break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/tile.cpp b/src/api/c/tile.cpp
index 463dd36700..2a50f12c43 100644
--- a/src/api/c/tile.cpp
+++ b/src/api/c/tile.cpp
@@ -7,55 +7,74 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/data.h>
-#include <err_common.hpp>
-#include <handle.hpp>
+#include <common/tile.hpp>
+
+#include <arith.hpp>
 #include <backend.hpp>
-#include <ArrayInfo.hpp>
-#include <tile.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <unary.hpp>
+#include <af/arith.h>
+#include <af/data.h>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::half;
+using arrayfire::common::tile;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::unaryOp;
+using detail::ushort;
 
 template<typename T>
-static inline af_array tile(const af_array in, const af::dim4 &tileDims)
-{
-    return getHandle(tile<T>(getArray<T>(in), tileDims));
+static inline af_array tile(const af_array in, const af::dim4 &tileDims) {
+    return getHandle(arrayfire::common::tile<T>(getArray<T>(in), tileDims));
 }
 
-af_err af_tile(af_array *out, const af_array in, const af::dim4 &tileDims)
-{
+af_err af_tile(af_array *out, const af_array in, const af::dim4 &tileDims) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
+        const ArrayInfo &info = getInfo(in);
+        af_dtype type         = info.getType();
 
+        if (info.ndims() == 0) { return af_retain_array(out, in); }
         DIM_ASSERT(1, info.dims().elements() > 0);
         DIM_ASSERT(2, tileDims.elements() > 0);
 
         af_array output;
 
-        switch(type) {
-            case f32: output = tile<float  >(in, tileDims);  break;
-            case c32: output = tile<cfloat >(in, tileDims);  break;
-            case f64: output = tile<double >(in, tileDims);  break;
-            case c64: output = tile<cdouble>(in, tileDims);  break;
-            case b8:  output = tile<char   >(in, tileDims);  break;
-            case s32: output = tile<int    >(in, tileDims);  break;
-            case u32: output = tile<uint   >(in, tileDims);  break;
-            case u8:  output = tile<uchar  >(in, tileDims);  break;
-            default:  TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = tile<float>(in, tileDims); break;
+            case c32: output = tile<cfloat>(in, tileDims); break;
+            case f64: output = tile<double>(in, tileDims); break;
+            case c64: output = tile<cdouble>(in, tileDims); break;
+            case b8: output = tile<char>(in, tileDims); break;
+            case s32: output = tile<int>(in, tileDims); break;
+            case u32: output = tile<uint>(in, tileDims); break;
+            case s64: output = tile<intl>(in, tileDims); break;
+            case u64: output = tile<uintl>(in, tileDims); break;
+            case s16: output = tile<short>(in, tileDims); break;
+            case u16: output = tile<ushort>(in, tileDims); break;
+            case s8: output = tile<schar>(in, tileDims); break;
+            case u8: output = tile<uchar>(in, tileDims); break;
+            case f16: output = tile<half>(in, tileDims); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_tile(af_array *out, const af_array in,
-               const unsigned x, const unsigned y,
-               const unsigned z, const unsigned w)
-{
+af_err af_tile(af_array *out, const af_array in, const unsigned x,
+               const unsigned y, const unsigned z, const unsigned w) {
     af::dim4 tileDims(x, y, z, w);
     return af_tile(out, in, tileDims);
 }
diff --git a/src/api/c/topk.cpp b/src/api/c/topk.cpp
new file mode 100644
index 0000000000..c8a303afea
--- /dev/null
+++ b/src/api/c/topk.cpp
@@ -0,0 +1,92 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/data.h>
+#include <af/statistics.h>
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
+#include <topk.hpp>
+
+using arrayfire::common::half;
+using detail::createEmptyArray;
+using detail::uint;
+
+namespace {
+
+template<typename T>
+af_err topk(af_array *v, af_array *i, const af_array in, const int k,
+            const int dim, const af_topk_function order) {
+    auto vals = createEmptyArray<T>(af::dim4());
+    auto idxs = createEmptyArray<unsigned>(af::dim4());
+
+    topk(vals, idxs, getArray<T>(in), k, dim, order);
+
+    *v = getHandle<T>(vals);
+    *i = getHandle<unsigned>(idxs);
+    return AF_SUCCESS;
+}
+}  //  namespace
+
+af_err af_topk(af_array *values, af_array *indices, const af_array in,
+               const int k, const int dim, const af_topk_function order) {
+    try {
+        af::topkFunction ord = (order == AF_TOPK_DEFAULT ? AF_TOPK_MAX : order);
+
+        const ArrayInfo &inInfo = getInfo(in);
+
+        ARG_ASSERT(2, (inInfo.ndims() > 0));
+
+        if (inInfo.elements() == 1) {
+            dim_t dims[1]   = {1};
+            af_err errValue = af_constant(indices, 0, 1, dims, u32);
+            return errValue == AF_SUCCESS ? af_retain_array(values, in)
+                                          : errValue;
+        }
+
+        int rdim           = dim;
+        const auto &inDims = inInfo.dims();
+
+        if (rdim == -1) {
+            for (dim_t d = 0; d < 4; d++) {
+                if (inDims[d] > 1) {
+                    rdim = d;
+                    break;
+                }
+            }
+        }
+
+        ARG_ASSERT(2, (inInfo.dims()[rdim] >= k));
+        ARG_ASSERT(
+            4, (k > 0) && (k <= 256));  // TODO(umar): Remove this limitation
+
+        if (rdim != 0) {
+            AF_ERROR("topk is supported along dimenion 0 only.",
+                     AF_ERR_NOT_SUPPORTED);
+        }
+
+        af_dtype type = inInfo.getType();
+
+        switch (type) {
+            // TODO(umar): FIX RETURN VALUES HERE
+            case f32: topk<float>(values, indices, in, k, rdim, ord); break;
+            case f64: topk<double>(values, indices, in, k, rdim, ord); break;
+            case u32: topk<uint>(values, indices, in, k, rdim, ord); break;
+            case s32: topk<int>(values, indices, in, k, rdim, ord); break;
+            case f16: topk<half>(values, indices, in, k, rdim, ord); break;
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/transform.cpp b/src/api/c/transform.cpp
index 41ea0ace12..259d13840e 100644
--- a/src/api/c/transform.cpp
+++ b/src/api/c/transform.cpp
@@ -7,85 +7,203 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
-#include <af/defines.h>
-#include <err_common.hpp>
-#include <handle.hpp>
-#include <ArrayInfo.hpp>
 #include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
 #include <transform.hpp>
+#include <af/defines.h>
+#include <af/image.h>
 
 using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array transform(const af_array in, const af_array tf, const af::dim4 &odims,
-                                 const af_interp_type method, const bool inverse)
-{
-    return getHandle(transform<T>(getArray<T>(in), getArray<float>(tf), odims, method, inverse));
+static inline void transform(af_array *out, const af_array in,
+                             const af_array tf, const af_interp_type method,
+                             const bool inverse, const bool perspective) {
+    transform<T>(getArray<T>(*out), getArray<T>(in), getArray<float>(tf),
+                 method, inverse, perspective);
+}
+
+AF_BATCH_KIND getTransformBatchKind(const dim4 &iDims, const dim4 &tDims) {
+    static const int baseDim = 2;
+
+    dim_t iNd = iDims.ndims();
+    dim_t tNd = tDims.ndims();
+
+    if (iNd == baseDim && tNd == baseDim) { return AF_BATCH_NONE; }
+    if (iNd == baseDim && tNd <= 4) {
+        return AF_BATCH_RHS;
+    } else if (iNd <= 4 && tNd == baseDim) {
+        return AF_BATCH_LHS;
+    } else if (iNd <= 4 && tNd <= 4) {
+        bool dimsMatch     = true;
+        bool isInterleaved = true;
+        for (dim_t i = baseDim; i < 4; i++) {
+            dimsMatch &= (iDims[i] == tDims[i]);
+            isInterleaved &=
+                (iDims[i] == 1 || tDims[i] == 1 || iDims[i] == tDims[i]);
+        }
+        if (dimsMatch) { return AF_BATCH_SAME; }
+        return (isInterleaved ? AF_BATCH_DIFF : AF_BATCH_UNSUPPORTED);
+    } else {
+        return AF_BATCH_UNSUPPORTED;
+    }
+}
+
+void af_transform_common(af_array *out, const af_array in, const af_array tf,
+                         const dim_t odim0, const dim_t odim1,
+                         const af_interp_type method, const bool inverse,
+                         bool allocate_out) {
+    ARG_ASSERT(0, out != 0);  // *out (the af_array) can be null, but not out
+    ARG_ASSERT(1, in != 0);
+    ARG_ASSERT(2, tf != 0);
+
+    const ArrayInfo &t_info = getInfo(tf);
+    const ArrayInfo &i_info = getInfo(in);
+
+    const dim4 &idims    = i_info.dims();
+    const dim4 &tdims    = t_info.dims();
+    const af_dtype itype = i_info.getType();
+
+    // Assert type and interpolation
+    ARG_ASSERT(2, t_info.getType() == f32);
+    ARG_ASSERT(5, method == AF_INTERP_NEAREST || method == AF_INTERP_BILINEAR ||
+                      method == AF_INTERP_BILINEAR_COSINE ||
+                      method == AF_INTERP_BICUBIC ||
+                      method == AF_INTERP_BICUBIC_SPLINE ||
+                      method == AF_INTERP_LOWER);
+
+    // Assert dimesions
+    // Image can be 2D or higher
+    DIM_ASSERT(1, idims.elements() > 0);
+    DIM_ASSERT(1, idims.ndims() >= 2);
+
+    // Transform can be 3x2 for affine transform or 3x3 for perspective
+    // transform
+    DIM_ASSERT(2, (tdims[0] == 3 && (tdims[1] == 2 || tdims[1] == 3)));
+
+    // If transform is batched, the output dimensions must be specified
+    if (tdims[2] * tdims[3] > 1) {
+        ARG_ASSERT(3, odim0 > 0);
+        ARG_ASSERT(4, odim1 > 0);
+    }
+
+    // If idims[2] > 1 and tdims[2] > 1, then both must be equal
+    // else at least one of them must be 1
+    if (tdims[2] != 1 && idims[2] != 1) {
+        DIM_ASSERT(2, idims[2] == tdims[2]);
+    } else {
+        DIM_ASSERT(2, idims[2] == 1 || tdims[2] == 1);
+    }
+
+    // If idims[3] > 1 and tdims[3] > 1, then both must be equal
+    // else at least one of them must be 1
+    if (tdims[3] != 1 && idims[3] != 1) {
+        DIM_ASSERT(2, idims[3] == tdims[3]);
+    } else {
+        DIM_ASSERT(2, idims[3] == 1 || tdims[3] == 1);
+    }
+
+    const bool perspective = (tdims[1] == 3);
+    dim_t o0 = odim0, o1 = odim1, o2 = 0, o3 = 0;
+    if (odim0 * odim1 == 0) {
+        o0 = idims[0];
+        o1 = idims[1];
+    }
+
+    switch (getTransformBatchKind(idims, tdims)) {
+        case AF_BATCH_NONE:  // Both are exactly 2D
+        case AF_BATCH_LHS:   // Image is 3/4D, transform is 2D
+        case AF_BATCH_SAME:  // Both are 3/4D and have the same dims
+            o2 = idims[2];
+            o3 = idims[3];
+            break;
+        case AF_BATCH_RHS:  // Image is 2D, transform is 3/4D
+            o2 = tdims[2];
+            o3 = tdims[3];
+            break;
+        case AF_BATCH_DIFF:  // Both are 3/4D, but have different dims
+            o2 = idims[2] == 1 ? tdims[2] : idims[2];
+            o3 = idims[3] == 1 ? tdims[3] : idims[3];
+            break;
+        case AF_BATCH_UNSUPPORTED:
+        default:
+            AF_ERROR(
+                "Unsupported combination of batching parameters in "
+                "transform",
+                AF_ERR_NOT_SUPPORTED);
+            break;
+    }
+
+    const dim4 odims(o0, o1, o2, o3);
+    if (allocate_out) { *out = createHandle(odims, itype); }
+
+    // clang-format off
+    switch(itype) {
+    case f32: transform<float  >(out, in, tf, method, inverse, perspective);  break;
+    case f64: transform<double >(out, in, tf, method, inverse, perspective);  break;
+    case c32: transform<cfloat >(out, in, tf, method, inverse, perspective);  break;
+    case c64: transform<cdouble>(out, in, tf, method, inverse, perspective);  break;
+    case s32: transform<int    >(out, in, tf, method, inverse, perspective);  break;
+    case u32: transform<uint   >(out, in, tf, method, inverse, perspective);  break;
+    case s64: transform<intl   >(out, in, tf, method, inverse, perspective);  break;
+    case u64: transform<uintl  >(out, in, tf, method, inverse, perspective);  break;
+    case s16: transform<short  >(out, in, tf, method, inverse, perspective);  break;
+    case u16: transform<ushort >(out, in, tf, method, inverse, perspective);  break;
+    case s8:  transform<schar  >(out, in, tf, method, inverse, perspective);  break;
+    case u8:  transform<uchar  >(out, in, tf, method, inverse, perspective);  break;
+    case b8:  transform<char   >(out, in, tf, method, inverse, perspective);  break;
+    default:  TYPE_ERROR(1, itype);
+    }
+    // clang-format on
 }
 
 af_err af_transform(af_array *out, const af_array in, const af_array tf,
                     const dim_t odim0, const dim_t odim1,
-                    const af_interp_type method, const bool inverse)
-{
+                    const af_interp_type method, const bool inverse) {
     try {
-        ArrayInfo t_info = getInfo(tf);
-        ArrayInfo i_info = getInfo(in);
-
-        af::dim4 idims = i_info.dims();
-        af::dim4 tdims = t_info.dims();
-        af_dtype itype = i_info.getType();
-
-        ARG_ASSERT(2, t_info.getType() == f32);
-        ARG_ASSERT(5, method == AF_INTERP_NEAREST || method == AF_INTERP_BILINEAR);
-        DIM_ASSERT(2, (tdims[0] == 3 && tdims[1] == 2));
-        DIM_ASSERT(1, idims.elements() > 0);
-        DIM_ASSERT(1, (idims.ndims() == 2 || idims.ndims() == 3));
-
-        dim_t o0 = odim0, o1 = odim1;
-        dim_t o2 = idims[2] * tdims[2];
-        if (odim0 * odim1 == 0) {
-            o0 = idims[0];
-            o1 = idims[1];
-        }
-        af::dim4 odims(o0, o1, o2, 1);
-
-        af_array output = 0;
-        switch(itype) {
-            case f32: output = transform<float  >(in, tf, odims, method, inverse);  break;
-            case f64: output = transform<double >(in, tf, odims, method, inverse);  break;
-            case c32: output = transform<cfloat >(in, tf, odims, method, inverse);  break;
-            case c64: output = transform<cdouble>(in, tf, odims, method, inverse);  break;
-            case s32: output = transform<int    >(in, tf, odims, method, inverse);  break;
-            case u32: output = transform<uint   >(in, tf, odims, method, inverse);  break;
-            case s64: output = transform<intl   >(in, tf, odims, method, inverse);  break;
-            case u64: output = transform<uintl  >(in, tf, odims, method, inverse);  break;
-            case u8:  output = transform<uchar  >(in, tf, odims, method, inverse);  break;
-            case b8:  output = transform<char   >(in, tf, odims, method, inverse);  break;
-            default:  TYPE_ERROR(1, itype);
-        }
-        std::swap(*out,output);
+        af_transform_common(out, in, tf, odim0, odim1, method, inverse, true);
     }
     CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_translate(af_array *out, const af_array in, const float trans0, const float trans1,
-                    const dim_t odim0, const dim_t odim1, const af_interp_type method)
-{
+af_err af_transform_v2(af_array *out, const af_array in, const af_array tf,
+                       const dim_t odim0, const dim_t odim1,
+                       const af_interp_type method, const bool inverse) {
+    try {
+        ARG_ASSERT(0, out != 0);  // need to dereference out in next call
+        af_transform_common(out, in, tf, odim0, odim1, method, inverse,
+                            *out == 0);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
 
+af_err af_translate(af_array *out, const af_array in, const float trans0,
+                    const float trans1, const dim_t odim0, const dim_t odim1,
+                    const af_interp_type method) {
     try {
-        static float trans_mat[6] = {1, 0, 0,
-                                     0, 1, 0};
-        trans_mat[2] = trans0;
-        trans_mat[5] = trans1;
+        float trans_mat[6] = {1, 0, 0, 0, 1, 0};
+        trans_mat[2]       = trans0;
+        trans_mat[5]       = trans1;
 
-        static af::dim4 tdims(3, 2, 1, 1);
+        const dim4 tdims(3, 2, 1, 1);
         af_array t = 0;
 
-        AF_CHECK(af_create_array(&t, trans_mat, tdims.ndims(), tdims.get(), f32));
+        AF_CHECK(
+            af_create_array(&t, trans_mat, tdims.ndims(), tdims.get(), f32));
         AF_CHECK(af_transform(out, in, t, odim0, odim1, method, true));
         AF_CHECK(af_release_array(t));
     }
@@ -94,38 +212,43 @@ af_err af_translate(af_array *out, const af_array in, const float trans0, const
     return AF_SUCCESS;
 }
 
-af_err af_scale(af_array *out, const af_array in, const float scale0, const float scale1,
-                const dim_t odim0, const dim_t odim1, const af_interp_type method)
-{
+af_err af_scale(af_array *out, const af_array in, const float scale0,
+                const float scale1, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method) {
     try {
-        ArrayInfo i_info = getInfo(in);
-        af::dim4 idims = i_info.dims();
+        const ArrayInfo &i_info = getInfo(in);
+        dim4 idims              = i_info.dims();
 
         dim_t _odim0 = odim0, _odim1 = odim1;
         float sx, sy;
 
-        DIM_ASSERT(4, odim0 != 0);
-        DIM_ASSERT(5, odim1 != 0);
+        if (_odim0 == 0 || _odim1 == 0) {
+            DIM_ASSERT(2, scale0 != 0);
+            DIM_ASSERT(3, scale1 != 0);
 
-        if(_odim0 == 0 && _odim1 == 0) {
             sx = 1.f / scale0, sy = 1.f / scale1;
             _odim0 = idims[0] / sx;
             _odim1 = idims[1] / sy;
-        } else if (scale0 == 0 && scale1 == 0) {
-            sx = idims[0] / (float)_odim0;
-            sy = idims[1] / (float)_odim1;
+
+        } else if (scale0 == 0 || scale1 == 0) {
+            DIM_ASSERT(4, odim0 != 0);
+            DIM_ASSERT(5, odim1 != 0);
+
+            sx = idims[0] / static_cast<float>(_odim0);
+            sy = idims[1] / static_cast<float>(_odim1);
+
         } else {
             sx = 1.f / scale0, sy = 1.f / scale1;
         }
 
-        static float trans_mat[6] = {1, 0, 0,
-                                     0, 1, 0};
-        trans_mat[0] = sx;
-        trans_mat[4] = sy;
+        float trans_mat[6] = {1, 0, 0, 0, 1, 0};
+        trans_mat[0]       = sx;
+        trans_mat[4]       = sy;
 
-        static af::dim4 tdims(3, 2, 1, 1);
+        const dim4 tdims(3, 2, 1, 1);
         af_array t = 0;
-        AF_CHECK(af_create_array(&t, trans_mat, tdims.ndims(), tdims.get(), f32));
+        AF_CHECK(
+            af_create_array(&t, trans_mat, tdims.ndims(), tdims.get(), f32));
         AF_CHECK(af_transform(out, in, t, _odim0, _odim1, method, true));
         AF_CHECK(af_release_array(t));
     }
@@ -133,36 +256,35 @@ af_err af_scale(af_array *out, const af_array in, const float scale0, const floa
     return AF_SUCCESS;
 }
 
-af_err af_skew(af_array *out, const af_array in, const float skew0, const float skew1,
-               const dim_t odim0, const dim_t odim1, const af_interp_type method,
-               const bool inverse)
-{
+af_err af_skew(af_array *out, const af_array in, const float skew0,
+               const float skew1, const dim_t odim0, const dim_t odim1,
+               const af_interp_type method, const bool inverse) {
     try {
         float tx = std::tan(skew0);
         float ty = std::tan(skew1);
 
-        static float trans_mat[6] = {1, 0, 0,
-                                     0, 1, 0};
-        trans_mat[1] = ty;
-        trans_mat[3] = tx;
+        float trans_mat[6] = {1, 0, 0, 0, 1, 0};
+        trans_mat[1]       = ty;
+        trans_mat[3]       = tx;
 
-        if(inverse) {
-            if(tx == 0 || ty == 0) {
+        if (inverse) {
+            if (tx == 0 || ty == 0) {
                 trans_mat[1] = tx;
                 trans_mat[3] = ty;
             } else {
-                //calc_tranform_inverse(trans_mat);
-                //short cut of calc_transform_inverse
-                float d = 1.0f / (1.0f - tx * ty);
+                // calc_tranform_inverse(trans_mat);
+                // short cut of calc_transform_inverse
+                float d      = 1.0f / (1.0f - tx * ty);
                 trans_mat[0] = d;
                 trans_mat[1] = ty * d;
                 trans_mat[3] = tx * d;
                 trans_mat[4] = d;
             }
         }
-        static af::dim4 tdims(3, 2, 1, 1);
+        const dim4 tdims(3, 2, 1, 1);
         af_array t = 0;
-        AF_CHECK(af_create_array(&t, trans_mat, tdims.ndims(), tdims.get(), f32));
+        AF_CHECK(
+            af_create_array(&t, trans_mat, tdims.ndims(), tdims.get(), f32));
         AF_CHECK(af_transform(out, in, t, odim0, odim1, method, true));
         AF_CHECK(af_release_array(t));
     }
diff --git a/src/api/c/transform_coordinates.cpp b/src/api/c/transform_coordinates.cpp
new file mode 100644
index 0000000000..8bec381b6c
--- /dev/null
+++ b/src/api/c/transform_coordinates.cpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <blas.hpp>
+#include <common/err_common.hpp>
+#include <convolve.hpp>
+#include <handle.hpp>
+#include <join.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+#include <vector>
+
+using af::dim4;
+using detail::arithOp;
+using detail::Array;
+using detail::createEmptyArray;
+using detail::createHostDataArray;
+using detail::createSubArray;
+using detail::scalar;
+
+template<typename T>
+Array<T> multiplyIndexed(const Array<T> &lhs, const Array<T> &rhs,
+                         std::vector<af_seq> idx) {
+    Array<T> rhs_sub = createSubArray(rhs, idx);
+    Array<T> out     = createEmptyArray<T>(
+        dim4(lhs.dims()[0], rhs_sub.dims()[1], lhs.dims()[2], lhs.dims()[3]));
+    T alpha = scalar<T>(1.0);
+    T beta  = scalar<T>(0.0);
+    gemm(out, AF_MAT_NONE, AF_MAT_NONE, &alpha, lhs, rhs_sub, &beta);
+    return out;
+}
+
+template<typename T>
+static af_array transform_coordinates(const af_array &tf_, const float d0_,
+                                      const float d1_) {
+    af::dim4 h_dims(4, 3);
+    T zero = 0;
+    T one  = 1;
+    T d0   = static_cast<T>(d0_);
+    T d1   = static_cast<T>(d1_);
+    // clang-format off
+    T h_in[4 * 3] = {zero, zero,  d1,   d1,
+                     zero,   d0,  d0, zero,
+                      one,  one, one,  one};
+    // clang-format on
+
+    const Array<T> tf = getArray<T>(tf_);
+    Array<T> in       = createHostDataArray<T>(h_dims, h_in);
+
+    std::vector<af_seq> idx(2);
+    idx[0] = af_make_seq(0, 2, 1);
+
+    // w = 1.0 / matmul(tf, in(span, 2));
+    // iw = matmul(tf, in(span, 2));
+    idx[1]      = af_make_seq(2, 2, 1);
+    Array<T> iw = multiplyIndexed(in, tf, idx);
+
+    // xt = w * matmul(tf, in(span, 0));
+    // xt = matmul(tf, in(span, 0)) / iw;
+    idx[1] = af_make_seq(0, 0, 1);
+    Array<T> xt =
+        arithOp<T, af_div_t>(multiplyIndexed(in, tf, idx), iw, iw.dims());
+
+    // yt = w * matmul(tf, in(span, 1));
+    // yt = matmul(tf, in(span, 1)) / iw;
+    idx[1] = af_make_seq(1, 1, 1);
+    Array<T> yw =
+        arithOp<T, af_div_t>(multiplyIndexed(in, tf, idx), iw, iw.dims());
+
+    // return join(1, xt, yt)
+    Array<T> r = join(1, xt, yw);
+    return getHandle(r);
+}
+
+af_err af_transform_coordinates(af_array *out, const af_array tf,
+                                const float d0_, const float d1_) {
+    try {
+        const ArrayInfo &tfInfo = getInfo(tf);
+        dim4 tfDims             = tfInfo.dims();
+        ARG_ASSERT(1,
+                   (tfDims[0] == 3 && tfDims[1] == 3 && tfDims.ndims() == 2));
+
+        af_array output;
+        af_dtype type = tfInfo.getType();
+        switch (type) {
+            case f32:
+                output = transform_coordinates<float>(tf, d0_, d1_);
+                break;
+            case f64:
+                output = transform_coordinates<double>(tf, d0_, d1_);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/transpose.cpp b/src/api/c/transpose.cpp
index c63ab62c7e..9d2fd48cbd 100644
--- a/src/api/c/transpose.cpp
+++ b/src/api/c/transpose.cpp
@@ -7,53 +7,77 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/blas.h>
-#include <af/data.h>
-#include <err_common.hpp>
-#include <handle.hpp>
 #include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <handle.hpp>
 #include <transpose.hpp>
+#include <af/arith.h>
+#include <af/blas.h>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
 using af::dim4;
-using namespace detail;
+using arrayfire::common::half;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T>
-static inline af_array trs(const af_array in, const bool conjugate)
-{
+static inline af_array trs(const af_array in, const bool conjugate) {
     return getHandle<T>(detail::transpose<T>(getArray<T>(in), conjugate));
 }
 
-af_err af_transpose(af_array *out, af_array in, const bool conjugate)
-{
+af_err af_transpose(af_array* out, af_array in, const bool conjugate) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        af::dim4 dims = info.dims();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 dims         = info.dims();
+
+        if (dims.elements() == 0) { return af_retain_array(out, in); }
 
-        if (dims[0]==1 || dims[1]==1) {
-            // for a vector OR a batch of vectors
-            // we can use modDims to transpose
-            af::dim4 outDims(dims[1],dims[0],dims[2],dims[3]);
-            return af_moddims(out, in, outDims.ndims(), outDims.get());
+        if (dims[0] == 1 || dims[1] == 1) {
+            af::dim4 outDims(dims[1], dims[0], dims[2], dims[3]);
+            if (conjugate) {
+                af_array temp = 0;
+                AF_CHECK(af_conjg(&temp, in));
+                AF_CHECK(af_moddims(out, temp, outDims.ndims(), outDims.get()));
+                AF_CHECK(af_release_array(temp));
+                return AF_SUCCESS;
+            } else {
+                // for a vector OR a batch of vectors
+                // we can use modDims to transpose
+                AF_CHECK(af_moddims(out, in, outDims.ndims(), outDims.get()));
+                return AF_SUCCESS;
+            }
         }
 
         af_array output;
-        switch(type) {
-            case f32: output = trs<float>  (in, conjugate);    break;
-            case c32: output = trs<cfloat> (in, conjugate);    break;
-            case f64: output = trs<double> (in, conjugate);    break;
-            case c64: output = trs<cdouble>(in, conjugate);    break;
-            case b8 : output = trs<char>   (in, conjugate);    break;
-            case s32: output = trs<int>    (in, conjugate);    break;
-            case u32: output = trs<uint>   (in, conjugate);    break;
-            case u8 : output = trs<uchar>  (in, conjugate);    break;
-            case s64: output = trs<intl>   (in, conjugate);    break;
-            case u64: output = trs<uintl>  (in, conjugate);    break;
-            default : TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: output = trs<float>(in, conjugate); break;
+            case c32: output = trs<cfloat>(in, conjugate); break;
+            case f64: output = trs<double>(in, conjugate); break;
+            case c64: output = trs<cdouble>(in, conjugate); break;
+            case b8: output = trs<char>(in, conjugate); break;
+            case s32: output = trs<int>(in, conjugate); break;
+            case u32: output = trs<uint>(in, conjugate); break;
+            case s8: output = trs<schar>(in, conjugate); break;
+            case u8: output = trs<uchar>(in, conjugate); break;
+            case s64: output = trs<intl>(in, conjugate); break;
+            case u64: output = trs<uintl>(in, conjugate); break;
+            case s16: output = trs<short>(in, conjugate); break;
+            case u16: output = trs<ushort>(in, conjugate); break;
+            case f16: output = trs<half>(in, conjugate); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*out,output);
+        std::swap(*out, output);
     }
     CATCHALL;
 
@@ -61,37 +85,38 @@ af_err af_transpose(af_array *out, af_array in, const bool conjugate)
 }
 
 template<typename T>
-static inline void transpose_inplace(af_array in, const bool conjugate)
-{
-    return detail::transpose_inplace<T>(getWritableArray<T>(in), conjugate);
+static inline void transpose_inplace(af_array in, const bool conjugate) {
+    return detail::transpose_inplace<T>(getArray<T>(in), conjugate);
 }
 
-af_err af_transpose_inplace(af_array in, const bool conjugate)
-{
+af_err af_transpose_inplace(af_array in, const bool conjugate) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        af::dim4 dims = info.dims();
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 dims         = info.dims();
 
         // InPlace only works on square matrices
         DIM_ASSERT(0, dims[0] == dims[1]);
 
         // If singleton element
-        if(dims[0] == 1)
-            return AF_SUCCESS;
+        if (dims[0] == 1) { return AF_SUCCESS; }
 
-        switch(type) {
-            case f32: transpose_inplace<float>  (in, conjugate);    break;
-            case c32: transpose_inplace<cfloat> (in, conjugate);    break;
-            case f64: transpose_inplace<double> (in, conjugate);    break;
-            case c64: transpose_inplace<cdouble>(in, conjugate);    break;
-            case b8 : transpose_inplace<char>   (in, conjugate);    break;
-            case s32: transpose_inplace<int>    (in, conjugate);    break;
-            case u32: transpose_inplace<uint>   (in, conjugate);    break;
-            case u8 : transpose_inplace<uchar>  (in, conjugate);    break;
-            case s64: transpose_inplace<intl>   (in, conjugate);    break;
-            case u64: transpose_inplace<uintl>  (in, conjugate);    break;
-            default : TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: transpose_inplace<float>(in, conjugate); break;
+            case c32: transpose_inplace<cfloat>(in, conjugate); break;
+            case f64: transpose_inplace<double>(in, conjugate); break;
+            case c64: transpose_inplace<cdouble>(in, conjugate); break;
+            case b8: transpose_inplace<char>(in, conjugate); break;
+            case s32: transpose_inplace<int>(in, conjugate); break;
+            case u32: transpose_inplace<uint>(in, conjugate); break;
+            case s8: transpose_inplace<schar>(in, conjugate); break;
+            case u8: transpose_inplace<uchar>(in, conjugate); break;
+            case s64: transpose_inplace<intl>(in, conjugate); break;
+            case u64: transpose_inplace<uintl>(in, conjugate); break;
+            case s16: transpose_inplace<short>(in, conjugate); break;
+            case u16: transpose_inplace<ushort>(in, conjugate); break;
+            case f16: transpose_inplace<half>(in, conjugate); break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
diff --git a/src/api/c/type_util.cpp b/src/api/c/type_util.cpp
index 750932c9cd..d409c0d868 100644
--- a/src/api/c/type_util.cpp
+++ b/src/api/c/type_util.cpp
@@ -9,17 +9,39 @@
 
 #include <type_util.hpp>
 
-const char *getName(af_dtype type)
-{
-    switch(type) {
-    case f32: return "float";
-    case f64: return "double";
-    case c32: return "complex float";
-    case c64: return "complex double";
-    case u32: return "unsigned int";
-    case s32: return "int";
-    case u8: return "unsigned char";
-    case b8: return "bool";
-    default: return "unknown type";
+#include <common/err_common.hpp>
+#include <af/half.h>
+#include <af/util.h>
+
+size_t size_of(af_dtype type) {
+    try {
+        switch (type) {
+            case f32: return sizeof(float);
+            case f64: return sizeof(double);
+            case s32: return sizeof(int);
+            case u32: return sizeof(unsigned);
+            case s8: return sizeof(signed char);
+            case u8: return sizeof(unsigned char);
+            case b8: return sizeof(unsigned char);
+            case c32: return sizeof(float) * 2;
+            case c64: return sizeof(double) * 2;
+            case s16: return sizeof(short);
+            case u16: return sizeof(unsigned short);
+            case s64: return sizeof(long long);
+            case u64: return sizeof(unsigned long long);
+            case f16: return sizeof(af_half);
+            default: TYPE_ERROR(1, type);
+        }
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_get_size_of(size_t *size, af_dtype type) {
+    try {
+        *size = size_of(type);
+        return AF_SUCCESS;
     }
+    CATCHALL;
 }
diff --git a/src/api/c/type_util.hpp b/src/api/c/type_util.hpp
index e4ecf8aebe..8e6a7ff9cf 100644
--- a/src/api/c/type_util.hpp
+++ b/src/api/c/type_util.hpp
@@ -8,26 +8,27 @@
  ********************************************************/
 
 #pragma once
-#include <string>
 #include <af/defines.h>
 
-const char *getName(af_dtype type);
-
-//uchar to number converters
+// uchar to number converters
 template<typename T>
-struct ToNum
-{
+struct ToNum {
     inline T operator()(T val) { return val; }
 };
 
 template<>
-struct ToNum<unsigned char>
-{
+struct ToNum<signed char> {
+    inline int operator()(signed char val) { return static_cast<int>(val); }
+};
+
+template<>
+struct ToNum<unsigned char> {
     inline int operator()(unsigned char val) { return static_cast<int>(val); }
 };
 
 template<>
-struct ToNum<char>
-{
+struct ToNum<char> {
     inline int operator()(char val) { return static_cast<int>(val); }
 };
+
+size_t size_of(af_dtype type);
diff --git a/src/api/c/unary.cpp b/src/api/c/unary.cpp
index f1e86ffaba..505c831e74 100644
--- a/src/api/c/unary.cpp
+++ b/src/api/c/unary.cpp
@@ -7,34 +7,71 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/defines.h>
-#include <af/arith.h>
-#include <af/data.h>
-#include <ArrayInfo.hpp>
-#include <optypes.hpp>
-#include <implicit.hpp>
-#include <err_common.hpp>
-#include <handle.hpp>
+// This needs to be the first thing in the file
+#if defined(_WIN32) || defined(_MSC_VER)
+#define _USE_MATH_DEFINES
+#endif
+#include <cmath>
+
+#include <arith.hpp>
 #include <backend.hpp>
-#include <unary.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <complex.hpp>
+#include <handle.hpp>
 #include <implicit.hpp>
+#include <logic.hpp>
+#include <optypes.hpp>
+#include <unary.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/defines.h>
 
-using namespace detail;
+using af::dim4;
+using arrayfire::common::half;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::cplx;
+using detail::createValueArray;
+using detail::imag;
+using detail::intl;
+using detail::logicOp;
+using detail::real;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
 
 template<typename T, af_op_t op>
-static inline af_array unaryOp(const af_array in)
-{
+static inline af_array unaryOp(const af_array in) {
     af_array res = getHandle(unaryOp<T, op>(castArray<T>(in)));
     return res;
 }
 
+template<typename Tc, typename Tr, af_op_t op>
+struct unaryOpCplxFun;
+
+template<typename Tc, typename Tr, af_op_t op>
+static inline Array<Tc> unaryOpCplx(const Array<Tc> &in) {
+    return unaryOpCplxFun<Tc, Tr, op>()(in);
+}
+
+template<typename Tc, typename Tr, af_op_t op>
+static inline af_array unaryOpCplx(const af_array in) {
+    return getHandle(unaryOpCplx<Tc, Tr, op>(castArray<Tc>(in)));
+}
+
 template<af_op_t op>
-static af_err af_unary(af_array *out, const af_array in)
-{
+static af_err af_unary(af_array *out, const af_array in) {
     try {
-
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
         ARG_ASSERT(1, in_info.isReal());
 
         af_dtype in_type = in_info.getType();
@@ -42,12 +79,14 @@ static af_err af_unary(af_array *out, const af_array in)
 
         // Convert all inputs to floats / doubles
         af_dtype type = implicit(in_type, f32);
+        if (in_type == f16) { type = f16; }
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
         switch (type) {
-        case f32 : res = unaryOp<float  , op>(in); break;
-        case f64 : res = unaryOp<double , op>(in); break;
-        default:
-            TYPE_ERROR(1, in_type); break;
+            case f16: res = unaryOp<half, op>(in); break;
+            case f32: res = unaryOp<float, op>(in); break;
+            case f64: res = unaryOp<double, op>(in); break;
+            default: TYPE_ERROR(1, in_type); break;
         }
 
         std::swap(*out, res);
@@ -56,81 +95,534 @@ static af_err af_unary(af_array *out, const af_array in)
     return AF_SUCCESS;
 }
 
-#define UNARY(fn)                                       \
-    af_err af_##fn(af_array *out, const af_array in)    \
-    {                                                   \
-        return af_unary<af_##fn##_t>(out, in);          \
-    }
+template<af_op_t op>
+static af_err af_unary_complex(af_array *out, const af_array in) {
+    try {
+        const ArrayInfo &in_info = getInfo(in);
 
+        af_dtype in_type = in_info.getType();
+        af_array res;
+
+        // Convert all inputs to floats / doubles
+        af_dtype type = implicit(in_type, f32);
+        if (in_type == f16) { type = f16; }
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
+
+        switch (type) {
+            case f32: res = unaryOp<float, op>(in); break;
+            case f64: res = unaryOp<double, op>(in); break;
+            case c32: res = unaryOpCplx<cfloat, float, op>(in); break;
+            case c64: res = unaryOpCplx<cdouble, double, op>(in); break;
+            case f16: res = unaryOp<half, op>(in); break;
+            default: TYPE_ERROR(1, in_type); break;
+        }
 
-UNARY(sin)
-UNARY(cos)
-UNARY(tan)
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-UNARY(asin)
-UNARY(acos)
-UNARY(atan)
+#define UNARY_FN(name, opcode)                           \
+    af_err af_##name(af_array *out, const af_array in) { \
+        return af_unary<af_##opcode##_t>(out, in);       \
+    }
 
-UNARY(sinh)
-UNARY(cosh)
-UNARY(tanh)
+#define UNARY(fn) UNARY_FN(fn, fn)
 
-UNARY(asinh)
-UNARY(acosh)
-UNARY(atanh)
+#define UNARY_COMPLEX(fn)                              \
+    af_err af_##fn(af_array *out, const af_array in) { \
+        return af_unary_complex<af_##fn##_t>(out, in); \
+    }
 
 UNARY(trunc)
-UNARY(sign)
+UNARY_FN(sign, signbit)
 UNARY(round)
 UNARY(floor)
 UNARY(ceil)
 
-UNARY(exp)
+UNARY(sigmoid)
 UNARY(expm1)
 UNARY(erf)
 UNARY(erfc)
 
-UNARY(log)
 UNARY(log10)
 UNARY(log1p)
 UNARY(log2)
 
-UNARY(sqrt)
 UNARY(cbrt)
+UNARY(rsqrt)
 
 UNARY(tgamma)
 UNARY(lgamma)
 
+UNARY_COMPLEX(acosh)
+UNARY_COMPLEX(acos)
+UNARY_COMPLEX(asin)
+UNARY_COMPLEX(asinh)
+UNARY_COMPLEX(atan)
+UNARY_COMPLEX(atanh)
+UNARY_COMPLEX(cos)
+UNARY_COMPLEX(cosh)
+UNARY_COMPLEX(exp)
+UNARY_COMPLEX(log)
+UNARY_COMPLEX(sin)
+UNARY_COMPLEX(sinh)
+UNARY_COMPLEX(sqrt)
+UNARY_COMPLEX(tan)
+UNARY_COMPLEX(tanh)
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_exp_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // exp(a + ib)
+        // --> exp(a) * exp(ib)
+        // --> exp(a) * (cos(a) + i * sin(b))
+        // --> exp(a) * cos(a) + i * exp(a) * sin(b)
+
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        Array<Tr> exp_a = unaryOp<Tr, af_exp_t>(a);
+        Array<Tr> cos_b = unaryOp<Tr, af_cos_t>(b);
+        Array<Tr> sin_b = unaryOp<Tr, af_sin_t>(b);
+
+        // exp(a) * cos(b)
+        Array<Tr> a_out = arithOp<Tr, af_mul_t>(exp_a, cos_b, exp_a.dims());
+        // exp(a) * sin(b)
+        Array<Tr> b_out = arithOp<Tr, af_mul_t>(exp_a, sin_b, exp_a.dims());
+
+        // exp(a) * cos(b) + i * exp(a) * sin(b)
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_log_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // log(a + ib)
+        // using r = abs(a + ib), phi == arg(a + ib)
+        // --> log(r * exp(i * phi))
+        // --> log(r) + i * phi
+
+        // convert cartesian to polar
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        // phi = arg(a + ib)
+        // --> phi = atan2(b, a)
+        Array<Tr> phi = arithOp<Tr, af_atan2_t>(b, a, b.dims());
+
+        Array<Tr> r = detail::abs<Tr>(z);
+
+        // compute log
+        // log(r)
+        Array<Tr> a_out = unaryOp<Tr, af_log_t>(r);
+        // phi
+        const Array<Tr> &b_out = phi;
+
+        // log(r) + i * phi
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_sin_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // sin(a + ib)
+        // --> sin(a) * cos(ib) + cos(a) * sin(ib)
+        // --> sin(a) * cosh(b) + i * cos(a) * sinh(b)
+
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        // compute sin
+        Array<Tr> sin_a  = unaryOp<Tr, af_sin_t>(a);
+        Array<Tr> cos_a  = unaryOp<Tr, af_cos_t>(a);
+        Array<Tr> sinh_b = unaryOp<Tr, af_sinh_t>(b);
+        Array<Tr> cosh_b = unaryOp<Tr, af_cosh_t>(b);
+
+        // sin(a) * cosh(b)
+        Array<Tr> a_out = arithOp<Tr, af_mul_t>(sin_a, cosh_b, sin_a.dims());
+        // cos(a) * sinh(b)
+        Array<Tr> b_out = arithOp<Tr, af_mul_t>(cos_a, sinh_b, cos_a.dims());
+
+        // sin(a) * cosh(b) + i * cos(a) * sinh(b)
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_cos_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // cos(a + ib)
+        // --> cos(a) * cos(ib) - sin(a) * sin(ib)
+        // --> cos(a) * cosh(b) - i * sin(a) * sinh(b)
+
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        // compute cos
+        Array<Tr> sin_a  = unaryOp<Tr, af_sin_t>(a);
+        Array<Tr> cos_a  = unaryOp<Tr, af_cos_t>(a);
+        Array<Tr> sinh_b = unaryOp<Tr, af_sinh_t>(b);
+        Array<Tr> cosh_b = unaryOp<Tr, af_cosh_t>(b);
+
+        // cos(a) * cosh(b)
+        Array<Tr> a_out = arithOp<Tr, af_mul_t>(cos_a, cosh_b, sin_a.dims());
+        // -1
+        Array<Tr> neg_one = createValueArray<Tr>(a_out.dims(), -1);
+        // sin(a) * sinh(b)
+        Array<Tr> b_out_neg =
+            arithOp<Tr, af_mul_t>(sin_a, sinh_b, cos_a.dims());
+        // -1 * sin(a) * sinh(b)
+        Array<Tr> b_out =
+            arithOp<Tr, af_mul_t>(neg_one, b_out_neg, b_out_neg.dims());
+        // cos(a) * cosh(b) - i * sin(a) * sinh(b)
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_tan_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // tan(a + ib) = sin(a + ib) / cos(a + ib)
+        Array<Tc> sin_z = unaryOpCplx<Tc, Tr, af_sin_t>(z);
+        Array<Tc> cos_z = unaryOpCplx<Tc, Tr, af_cos_t>(z);
+        return arithOp<Tc, af_div_t>(sin_z, cos_z, sin_z.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_sinh_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // sinh(a + ib)
+        // --> sinh(a) * cosh(ib) + cosh(a) * sinh(ib)
+        // --> sinh(a) * cos(b) + i * cosh(a) * sin(b)
+
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        // compute sinh
+        Array<Tr> sinh_a = unaryOp<Tr, af_sinh_t>(a);
+        Array<Tr> cosh_a = unaryOp<Tr, af_cosh_t>(a);
+        Array<Tr> sin_b  = unaryOp<Tr, af_sin_t>(b);
+        Array<Tr> cos_b  = unaryOp<Tr, af_cos_t>(b);
+
+        // sinh(a) * cos(b)
+        Array<Tr> a_out = arithOp<Tr, af_mul_t>(sinh_a, cos_b, sinh_a.dims());
+        // cosh(a) * sin(b)
+        Array<Tr> b_out = arithOp<Tr, af_mul_t>(cosh_a, sin_b, cosh_a.dims());
+
+        // sinh(a) * cos(b) + i * cosh(a) * sin(b)
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_cosh_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // cosh(a + ib)
+        // --> cosh(a) * cosh(ib) + sinh(a) * sinh(ib)
+        // --> cosh(a) * cos(b) + i * sinh(a) * sin(b)
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        // compute cosh
+        Array<Tr> sinh_a = unaryOp<Tr, af_sinh_t>(a);
+        Array<Tr> cosh_a = unaryOp<Tr, af_cosh_t>(a);
+        Array<Tr> sin_b  = unaryOp<Tr, af_sin_t>(b);
+        Array<Tr> cos_b  = unaryOp<Tr, af_cos_t>(b);
+
+        // cosh(a) * cos(b)
+        Array<Tr> a_out = arithOp<Tr, af_mul_t>(cosh_a, cos_b, cosh_a.dims());
+        // sinh(a) * sin(b)
+        Array<Tr> b_out = arithOp<Tr, af_mul_t>(sinh_a, sin_b, sinh_a.dims());
+
+        // cosh(a) * cos(b) + i * sinh(a) * sin(b)
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_tanh_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // tanh(a + ib) = sinh(a + ib) / cosh(a + ib)
+        Array<Tc> sinh_z = unaryOpCplx<Tc, Tr, af_sinh_t>(z);
+        Array<Tc> cosh_z = unaryOpCplx<Tc, Tr, af_cosh_t>(z);
+        return arithOp<Tc, af_div_t>(sinh_z, cosh_z, sinh_z.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_acosh_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // dont simplify this expression, as it might lead to branch cuts
+        // acosh(z) = log(z+sqrt(z+1)*sqrt(z-1))
+
+        Array<Tc> one = createValueArray<Tc>(z.dims(), scalar<Tc>(1.0));
+
+        // (z + 1)
+        Array<Tc> z_plus_one = arithOp<Tc, af_add_t>(z, one, z.dims());
+        // (z - 1)
+        Array<Tc> z_minus_one = arithOp<Tc, af_sub_t>(z, one, z.dims());
+        // sqrt(z + 1)
+        Array<Tc> sqrt_z_plus_one = unaryOpCplx<Tc, Tr, af_sqrt_t>(z_plus_one);
+        // sqrt(z - 1)
+        Array<Tc> sqrt_z_minus_one =
+            unaryOpCplx<Tc, Tr, af_sqrt_t>(z_minus_one);
+        // sqrt(z + 1) * sqrt(z - 1)
+        Array<Tc> sqrt_prod = arithOp<Tc, af_mul_t>(
+            sqrt_z_plus_one, sqrt_z_minus_one, sqrt_z_plus_one.dims());
+        // z + sqrt(z + 1) * sqrt(z - 1)
+        Array<Tc> w = arithOp<Tc, af_add_t>(z, sqrt_prod, z.dims());
+        // log(z + sqrt(z + 1) * sqrt(z - 1))
+        return unaryOpCplx<Tc, Tr, af_log_t>(w);
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_asinh_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // asinh(z) = log(z+sqrt(z^2+1))
+        Array<Tc> one = createValueArray<Tc>(z.dims(), scalar<Tc>(1.0));
+
+        // z^2
+        Array<Tc> z2 = arithOp<Tc, af_mul_t>(z, z, z.dims());
+        // ((a + 1) + i * b) --> z^2 + 1
+        Array<Tc> z2_plus_one = arithOp<Tc, af_add_t>(z2, one, z.dims());
+        // sqrt(z^2 + 1)
+        Array<Tc> sqrt_z2_plus_one =
+            unaryOpCplx<Tc, Tr, af_sqrt_t>(z2_plus_one);
+        // z + sqrt(z^2 + 1)
+        Array<Tc> w = arithOp<Tc, af_add_t>(z, sqrt_z2_plus_one, z.dims());
+        // log(z + sqrt(z^2 + 1))
+        return unaryOpCplx<Tc, Tr, af_log_t>(w);
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_atanh_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // atanh(z) = 0.5*(log(1+z)-log(1-z))
+        Array<Tc> one =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(1.0, 0.0));
+        Array<Tc> half =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(0.5, 0.0));
+
+        // (1 + z)
+        Array<Tc> one_plus_z = arithOp<Tc, af_add_t>(one, z, one.dims());
+        // (1 - z)
+        Array<Tc> one_minus_z = arithOp<Tc, af_sub_t>(one, z, one.dims());
+        // log(1 + z)
+        Array<Tc> log_one_plus_z = unaryOpCplx<Tc, Tr, af_log_t>(one_plus_z);
+        // log(1 - z)
+        Array<Tc> log_one_minus_z = unaryOpCplx<Tc, Tr, af_log_t>(one_minus_z);
+        // (log(1 + z) - log(1 - z))
+        Array<Tc> w = arithOp<Tc, af_sub_t>(log_one_plus_z, log_one_minus_z,
+                                            log_one_plus_z.dims());
+        // 0.5 * (log(1 + z) - log(1 - z))
+        return arithOp<Tc, af_mul_t>(w, half, w.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_acos_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // acos(z) = pi/2 + i*log(i*z+sqrt(1-z.^2))
+        // --> pi/2 - asinz(z)
+
+        Array<Tc> one =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(1.0, 0.0));
+
+        Array<Tc> i = createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(0.0, 1.0));
+        Array<Tc> pi_half =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(M_PI_2, 0.0));
+
+        // z^2
+        Array<Tc> z2 = arithOp<Tc, af_mul_t>(z, z, z.dims());
+        // 1 - z^2
+        Array<Tc> one_minus_z2 = arithOp<Tc, af_sub_t>(one, z2, one.dims());
+        // sqrt(1 - z^2)
+        Array<Tc> sqrt_one_minus_z2 =
+            unaryOpCplx<Tc, Tr, af_sqrt_t>(one_minus_z2);
+        // i*z
+        Array<Tc> iz = arithOp<Tc, af_mul_t>(i, z, z.dims());
+        // (i*z - sqrt(1 - z^2))
+        Array<Tc> w = arithOp<Tc, af_add_t>(iz, sqrt_one_minus_z2, iz.dims());
+        // log(i*z - sqrt(1 - z^2))
+        Array<Tc> log_w = unaryOpCplx<Tc, Tr, af_log_t>(w);
+        // i*log(i*z - sqrt(1 - z^2))
+        Array<Tc> i_log_w = arithOp<Tc, af_mul_t>(i, log_w, i.dims());
+        // pi/2 + i*log(i*z - sqrt(1 - z^2))
+        return arithOp<Tc, af_add_t>(pi_half, i_log_w, pi_half.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_asin_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // asin(z) = -i*log(i*z+sqrt(1-z^2))
+
+        Array<Tc> one =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(1.0, 0.0));
+        Array<Tc> i = createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(0.0, 1.0));
+        Array<Tc> minus_i =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(0.0, -1.0));
+
+        // z^2
+        Array<Tc> z2 = arithOp<Tc, af_mul_t>(z, z, z.dims());
+        // 1 - z^2
+        Array<Tc> one_minus_z2 = arithOp<Tc, af_sub_t>(one, z2, one.dims());
+        // sqrt(1 - z^2)
+        Array<Tc> sqrt_one_minus_z2 =
+            unaryOpCplx<Tc, Tr, af_sqrt_t>(one_minus_z2);
+        // i*z
+        Array<Tc> iz = arithOp<Tc, af_mul_t>(i, z, z.dims());
+        // (i*z + sqrt(1 - z^2))
+        Array<Tc> w = arithOp<Tc, af_add_t>(iz, sqrt_one_minus_z2, iz.dims());
+        // log(i*z + sqrt(1 - z^2))
+        Array<Tc> log_w = unaryOpCplx<Tc, Tr, af_log_t>(w);
+        // i*log(i*z + sqrt(1 - z^2))
+        return arithOp<Tc, af_mul_t>(minus_i, log_w, minus_i.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_atan_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // atan(z) = 0.5 * i * (log(1-i*z)-log(1+i*z))
+        Array<Tc> one =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(1.0, 0.0));
+        Array<Tc> i = createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(0.0, 1.0));
+
+        // 0.5 * i
+        Array<Tc> i_half =
+            createValueArray<Tc>(z.dims(), scalar<Tc, Tr>(0.0, 0.5));
+        // i*z
+        Array<Tc> iz = arithOp<Tc, af_mul_t>(i, z, z.dims());
+        // 1 - i*z
+        Array<Tc> one_minus_iz = arithOp<Tc, af_sub_t>(one, iz, z.dims());
+        // 1 + i*z
+        Array<Tc> one_plus_iz = arithOp<Tc, af_add_t>(one, iz, z.dims());
+        // log(1 - i*z)
+        Array<Tc> log_minus = unaryOpCplx<Tc, Tr, af_log_t>(one_minus_iz);
+        // log(1 + i*z)
+        Array<Tc> log_plus = unaryOpCplx<Tc, Tr, af_log_t>(one_plus_iz);
+        // log(1 - i*z) - log(1 + i*z)
+        Array<Tc> log_diff =
+            arithOp<Tc, af_sub_t>(log_minus, log_plus, z.dims());
+        // 0.5 * i * (log(1 - i*z) - log(1 + i*z))
+        return arithOp<Tc, af_mul_t>(i_half, log_diff, z.dims());
+    }
+};
+
+template<typename Tc, typename Tr>
+struct unaryOpCplxFun<Tc, Tr, af_sqrt_t> {
+    Array<Tc> operator()(const Array<Tc> &z) {
+        // sqrt(a + ib)
+        // using r = abs(a + ib), phi == arg(a + ib)
+        // --> sqrt(r * exp(i * phi))
+        // --> sqrt(r) * exp(i * phi / 2)
+        // --> sqrt(r) * cos(phi/2) + i * sqrt(r) * sin(phi/2)
+
+        // convert cartesian to polar
+        Array<Tr> a = real<Tr, Tc>(z);
+        Array<Tr> b = imag<Tr, Tc>(z);
+
+        // phi = arg(a + ib)
+        // --> phi = atan2(b, a)
+        Array<Tr> phi = arithOp<Tr, af_atan2_t>(b, a, b.dims());
+        Array<Tr> r   = detail::abs<Tr>(z);
+
+        // compute sqrt
+        Array<Tr> two = createValueArray<Tr>(phi.dims(), 2.0);
+
+        // sqrt(r)
+        Array<Tr> r_out = unaryOp<Tr, af_sqrt_t>(r);
+
+        // phi/2
+        Array<Tr> phi_out = arithOp<Tr, af_div_t>(phi, two, phi.dims());
+
+        // convert polar to cartesian
+        // cos(phi/2)
+        Array<Tr> a_out_unit = unaryOp<Tr, af_cos_t>(phi_out);
+        // sin(phi/2)
+        Array<Tr> b_out_unit = unaryOp<Tr, af_sin_t>(phi_out);
+        // sqrt(r) * cos(phi/2)
+        Array<Tr> a_out =
+            arithOp<Tr, af_mul_t>(r_out, a_out_unit, r_out.dims());
+        // sqrt(r) * sin(phi/2)
+        Array<Tr> b_out =
+            arithOp<Tr, af_mul_t>(r_out, b_out_unit, r_out.dims());
+
+        // sqrt(r) * cos(phi/2) + i * sqrt(r) * sin(phi/2)
+        return cplx<Tc, Tr>(a_out, b_out, a_out.dims());
+    }
+};
 
-af_err af_not(af_array *out, const af_array in)
-{
+af_err af_not(af_array *out, const af_array in) {
     try {
-
         af_array tmp;
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
-        AF_CHECK(af_constant(&tmp, 0,
-                             in_info.ndims(),
-                             in_info.dims().get(), in_info.getType()));
+        AF_CHECK(af_constant(&tmp, 0, in_info.ndims(), in_info.dims().get(),
+                             in_info.getType()));
 
-        AF_CHECK(af_neq(out, in, tmp, false));
+        AF_CHECK(af_eq(out, in, tmp, false));
 
         AF_CHECK(af_release_array(tmp));
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_arg(af_array *out, const af_array in)
-{
+template<typename T>
+static inline af_array bitOpNot(const af_array in) {
+    return unaryOp<T, af_bitnot_t>(in);
+}
+
+af_err af_bitnot(af_array *out, const af_array in) {
     try {
+        const ArrayInfo &iinfo = getInfo(in);
+        const af_dtype type    = iinfo.getType();
+
+        dim4 odims = iinfo.dims();
+
+        if (odims.ndims() == 0) {
+            return af_create_handle(out, 0, nullptr, type);
+        }
+
+        af_array res;
+        switch (type) {
+            case s32: res = bitOpNot<int>(in); break;
+            case u32: res = bitOpNot<uint>(in); break;
+            case s8: res = bitOpNot<schar>(in); break;
+            case u8: res = bitOpNot<uchar>(in); break;
+            case b8: res = bitOpNot<char>(in); break;
+            case s64: res = bitOpNot<intl>(in); break;
+            case u64: res = bitOpNot<uintl>(in); break;
+            case s16: res = bitOpNot<short>(in); break;
+            case u16: res = bitOpNot<ushort>(in); break;
+            default: TYPE_ERROR(0, type);
+        }
+
+        std::swap(*out, res);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-        ArrayInfo in_info = getInfo(in);
+af_err af_arg(af_array *out, const af_array in) {
+    try {
+        const ArrayInfo &in_info = getInfo(in);
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
         if (!in_info.isComplex()) {
-            return af_constant(out, 0,
-                               in_info.ndims(),
-                               in_info.dims().get(), in_info.getType());
+            return af_constant(out, 0, in_info.ndims(), in_info.dims().get(),
+                               in_info.getType());
         }
 
         af_array real;
@@ -143,40 +635,38 @@ af_err af_arg(af_array *out, const af_array in)
 
         AF_CHECK(af_release_array(real));
         AF_CHECK(af_release_array(imag));
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_pow2(af_array *out, const af_array in)
-{
+af_err af_pow2(af_array *out, const af_array in) {
     try {
-
         af_array two;
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
-        AF_CHECK(af_constant(&two, 2,
-                             in_info.ndims(),
-                             in_info.dims().get(), in_info.getType()));
+        AF_CHECK(af_constant(&two, 2, in_info.ndims(), in_info.dims().get(),
+                             in_info.getType()));
 
         AF_CHECK(af_pow(out, two, in, false));
 
         AF_CHECK(af_release_array(two));
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
-af_err af_factorial(af_array *out, const af_array in)
-{
+af_err af_factorial(af_array *out, const af_array in) {
     try {
-
         af_array one;
-        ArrayInfo in_info = getInfo(in);
+        const ArrayInfo &in_info = getInfo(in);
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
-        AF_CHECK(af_constant(&one, 1,
-                             in_info.ndims(),
-                             in_info.dims().get(), in_info.getType()));
+        AF_CHECK(af_constant(&one, 1, in_info.ndims(), in_info.dims().get(),
+                             in_info.getType()));
 
         af_array inp1;
         AF_CHECK(af_add(&inp1, one, in, false));
@@ -185,37 +675,70 @@ af_err af_factorial(af_array *out, const af_array in)
 
         AF_CHECK(af_release_array(one));
         AF_CHECK(af_release_array(inp1));
-    } CATCHALL;
+    }
+    CATCHALL;
 
     return AF_SUCCESS;
 }
 
 template<typename T, af_op_t op>
-static inline af_array checkOp(const af_array in)
-{
+static inline af_array checkOp(const af_array in) {
     af_array res = getHandle(checkOp<T, op>(castArray<T>(in)));
     return res;
 }
 
 template<af_op_t op>
-static af_err af_check(af_array *out, const af_array in)
-{
-    try {
+struct cplxLogicOp {
+    af_array operator()(const Array<char> &resR, const Array<char> &resI,
+                        const dim4 &dims) {
+        return getHandle(logicOp<char, af_or_t>(resR, resI, dims));
+    }
+};
 
-        ArrayInfo in_info = getInfo(in);
-        ARG_ASSERT(1, in_info.isReal());
+template<>
+struct cplxLogicOp<af_iszero_t> {
+    af_array operator()(const Array<char> &resR, const Array<char> &resI,
+                        const dim4 &dims) {
+        return getHandle(logicOp<char, af_and_t>(resR, resI, dims));
+    }
+};
+
+template<typename T, typename BT, af_op_t op>
+static inline af_array checkOpCplx(const af_array in) {
+    Array<BT> R = real<BT, T>(getArray<T>(in));
+    Array<BT> I = imag<BT, T>(getArray<T>(in));
+
+    Array<char> resR = checkOp<BT, op>(R);
+    Array<char> resI = checkOp<BT, op>(I);
+
+    const ArrayInfo &in_info = getInfo(in);
+    const dim4 &dims         = in_info.dims();
+    cplxLogicOp<op> cplxLogic;
+    af_array res = cplxLogic(resR, resI, dims);
+
+    return res;
+}
+
+template<af_op_t op>
+static af_err af_check(af_array *out, const af_array in) {
+    try {
+        const ArrayInfo &in_info = getInfo(in);
 
         af_dtype in_type = in_info.getType();
         af_array res;
 
-        // Convert all inputs to floats / doubles
+        // Convert all inputs to floats / doubles / complex
         af_dtype type = implicit(in_type, f32);
+        if (in_type == f16) { type = f16; }
+        if (in_info.ndims() == 0) { return af_retain_array(out, in); }
 
         switch (type) {
-        case f32 : res = checkOp<float  , op>(in); break;
-        case f64 : res = checkOp<double , op>(in); break;
-        default:
-            TYPE_ERROR(1, in_type); break;
+            case f32: res = checkOp<float, op>(in); break;
+            case f64: res = checkOp<double, op>(in); break;
+            case f16: res = checkOp<half, op>(in); break;
+            case c32: res = checkOpCplx<cfloat, float, op>(in); break;
+            case c64: res = checkOpCplx<cdouble, double, op>(in); break;
+            default: TYPE_ERROR(1, in_type); break;
         }
 
         std::swap(*out, res);
@@ -224,13 +747,11 @@ static af_err af_check(af_array *out, const af_array in)
     return AF_SUCCESS;
 }
 
-#define CHECK(fn)                                       \
-    af_err af_##fn(af_array *out, const af_array in)    \
-    {                                                   \
-        return af_check<af_##fn##_t>(out, in);          \
+#define CHECK(fn)                                      \
+    af_err af_##fn(af_array *out, const af_array in) { \
+        return af_check<af_##fn##_t>(out, in);         \
     }
 
-
 CHECK(isinf)
 CHECK(isnan)
 CHECK(iszero)
diff --git a/src/api/c/unwrap.cpp b/src/api/c/unwrap.cpp
new file mode 100644
index 0000000000..6f09a6b7eb
--- /dev/null
+++ b/src/api/c/unwrap.cpp
@@ -0,0 +1,101 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <unwrap.hpp>
+#include <af/defines.h>
+#include <af/image.h>
+
+using af::dim4;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename T>
+static inline af_array unwrap(const af_array in, const dim_t wx, const dim_t wy,
+                              const dim_t sx, const dim_t sy, const dim_t px,
+                              const dim_t py, const bool is_column) {
+    return getHandle(
+        unwrap<T>(getArray<T>(in), wx, wy, sx, sy, px, py, 1, 1, is_column));
+}
+
+af_err af_unwrap(af_array* out, const af_array in, const dim_t wx,
+                 const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,
+                 const dim_t py, const bool is_column) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+        af::dim4 idims        = info.dims();
+
+        ARG_ASSERT(2, wx > 0 && wx <= idims[0] + 2 * px);
+        ARG_ASSERT(3, wy > 0 && wy <= idims[1] + 2 * py);
+        ARG_ASSERT(4, sx > 0);
+        ARG_ASSERT(5, sy > 0);
+        ARG_ASSERT(6, px >= 0 && px < wx);
+        ARG_ASSERT(7, py >= 0 && py < wy);
+
+        af_array output;
+
+        switch (type) {
+            case f32:
+                output = unwrap<float>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case f64:
+                output = unwrap<double>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case c32:
+                output = unwrap<cfloat>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case c64:
+                output = unwrap<cdouble>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case s32:
+                output = unwrap<int>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case u32:
+                output = unwrap<uint>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case s64:
+                output = unwrap<intl>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case u64:
+                output = unwrap<uintl>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case s16:
+                output = unwrap<short>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case u16:
+                output = unwrap<ushort>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case s8:
+                output = unwrap<schar>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case u8:
+                output = unwrap<uchar>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            case b8:
+                output = unwrap<char>(in, wx, wy, sx, sy, px, py, is_column);
+                break;
+            default: TYPE_ERROR(1, type);
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/var.cpp b/src/api/c/var.cpp
index 7feb1c4692..64a5d8f693 100644
--- a/src/api/c/var.cpp
+++ b/src/api/c/var.cpp
@@ -7,133 +7,244 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/statistics.h>
-#include <af/defines.h>
-#include <err_common.hpp>
+#include <arith.hpp>
 #include <backend.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
 #include <handle.hpp>
-#include <reduce.hpp>
-#include <arith.hpp>
 #include <math.hpp>
-#include <cast.hpp>
-#include <tile.hpp>
+#include <mean.hpp>
+#include <reduce.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/statistics.h>
 
 #include "stats.h"
 
-using namespace detail;
+#include <tuple>
+
+using af::dim4;
+using arrayfire::common::cast;
+using arrayfire::common::half;
+using detail::arithOp;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::division;
+using detail::getScalar;
+using detail::imag;
+using detail::intl;
+using detail::mean;
+using detail::real;
+using detail::reduce;
+using detail::reduce_all;
+using detail::scalar;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::ignore;
+using std::make_tuple;
+using std::tie;
+using std::tuple;
 
 template<typename inType, typename outType>
-static outType varAll(const af_array& in, const bool isbiased)
-{
-    Array<outType> input = cast<outType>(getArray<inType>(in));
+static outType varAll(const af_array& in, const af_var_bias bias) {
+    using weightType          = typename baseOutType<outType>::type;
+    const Array<inType> inArr = getArray<inType>(in);
+    Array<outType> input      = cast<outType>(inArr);
 
-    Array<outType> meanCnst= createValueArray<outType>(input.dims(), mean<outType>(input));
+    Array<outType> meanCnst = createValueArray<outType>(
+        input.dims(), mean<inType, weightType, outType>(inArr));
 
-    Array<outType> diff    = arithOp<outType, af_sub_t>(input, meanCnst, input.dims());
+    Array<outType> diff =
+        arithOp<outType, af_sub_t>(input, meanCnst, input.dims());
 
-    Array<outType> diffSq  = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
+    Array<outType> diffSq = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
 
-    outType result = division(reduce_all<af_add_t, outType, outType>(diffSq),
-        isbiased ? input.elements() : input.elements() - 1);
+    outType result = division(
+        getScalar<outType>(reduce_all<af_add_t, outType, outType>(diffSq)),
+        (input.elements() - (bias == AF_VARIANCE_SAMPLE)));
 
     return result;
 }
 
 template<typename inType, typename outType>
-static outType varAll(const af_array& in, const af_array weights)
-{
-    typedef typename baseOutType<outType>::type bType;
+static outType varAll(const af_array& in, const af_array weights) {
+    using bType = typename baseOutType<outType>::type;
 
     Array<outType> input = cast<outType>(getArray<inType>(in));
     Array<outType> wts   = cast<outType>(getArray<bType>(weights));
 
-    bType wtsSum    = reduce_all<af_add_t, bType, bType>(getArray<bType>(weights));
-    outType wtdMean = mean<outType, bType>(input, getArray<bType>(weights));
+    bType wtsSum = getScalar<bType>(
+        reduce_all<af_add_t, bType, bType>(getArray<bType>(weights)));
+    auto wtdMean = mean<outType, bType>(input, getArray<bType>(weights));
 
     Array<outType> meanArr = createValueArray<outType>(input.dims(), wtdMean);
-    Array<outType> diff    = arithOp<outType, af_sub_t>(input, meanArr, input.dims());
-    Array<outType> diffSq  = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
+    Array<outType> diff =
+        arithOp<outType, af_sub_t>(input, meanArr, input.dims());
+    Array<outType> diffSq = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
 
-    Array<outType> accDiffSq = arithOp<outType, af_mul_t>(diffSq, wts, diffSq.dims());
+    Array<outType> accDiffSq =
+        arithOp<outType, af_mul_t>(diffSq, wts, diffSq.dims());
 
-    outType result = division(reduce_all<af_add_t, outType, outType>(accDiffSq), wtsSum);
+    outType result = division(
+        getScalar<outType>(reduce_all<af_add_t, outType, outType>(accDiffSq)),
+        wtsSum);
 
     return result;
 }
 
 template<typename inType, typename outType>
-static af_array var(const af_array& in, const bool isbiased, int dim)
-{
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    dim4 iDims = input.dims();
-
-    Array<outType> meanArr = mean<outType>(input, dim);
-
-    /* now tile meanArr along dim and use it for variance computation */
-    dim4 tileDims(1);
-    tileDims[dim] = iDims[dim];
-    Array<outType> tMeanArr = tile<outType>(meanArr, tileDims);
-    /* now mean array is ready */
+static tuple<Array<outType>, Array<outType>> meanvar(
+    const Array<inType>& in,
+    const Array<typename baseOutType<outType>::type>& weights,
+    const af_var_bias bias, const dim_t dim) {
+    using weightType     = typename baseOutType<outType>::type;
+    Array<outType> input = cast<outType>(in);
+    dim4 iDims           = input.dims();
+
+    Array<outType> meanArr = createEmptyArray<outType>({0});
+    Array<outType> normArr = createEmptyArray<outType>({0});
+    if (weights.isEmpty()) {
+        meanArr  = mean<outType, weightType, outType>(input, dim);
+        auto val = 1.0 / static_cast<double>(bias == AF_VARIANCE_POPULATION
+                                                 ? iDims[dim]
+                                                 : iDims[dim] - 1);
+        normArr =
+            createValueArray<outType>(meanArr.dims(), scalar<outType>(val));
+    } else {
+        meanArr               = mean<outType, weightType>(input, weights, dim);
+        Array<outType> wtsSum = cast<outType>(
+            reduce<af_add_t, weightType, weightType>(weights, dim));
+        Array<outType> ones =
+            createValueArray<outType>(wtsSum.dims(), scalar<outType>(1));
+        if (bias == AF_VARIANCE_SAMPLE) {
+            wtsSum = arithOp<outType, af_sub_t>(wtsSum, ones, ones.dims());
+        }
+        normArr = arithOp<outType, af_div_t>(ones, wtsSum, meanArr.dims());
+    }
 
-    Array<outType> diff    = arithOp<outType, af_sub_t>(input, tMeanArr, tMeanArr.dims());
-    Array<outType> diffSq  = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
+    Array<outType> diff =
+        arithOp<outType, af_sub_t>(input, meanArr, input.dims());
+    Array<outType> diffSq = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
     Array<outType> redDiff = reduce<af_add_t, outType, outType>(diffSq, dim);
-    dim4 oDims = redDiff.dims();
 
-    Array<outType> divArr = createValueArray<outType>(oDims, scalar<outType>(isbiased ? iDims[dim] : iDims[dim]-1));
-    Array<outType> result = arithOp<outType, af_div_t>(redDiff, divArr, redDiff.dims());
+    Array<outType> variance =
+        arithOp<outType, af_mul_t>(normArr, redDiff, redDiff.dims());
 
-    return getHandle<outType>(result);
+    return make_tuple(meanArr, variance);
 }
 
 template<typename inType, typename outType>
-static af_array var(const af_array& in, const af_array& weights, int dim)
-{
-    typedef typename baseOutType<outType>::type bType;
-
-    Array<outType> input = cast<outType>(getArray<inType>(in));
-    Array<outType> wts   = cast<outType>(getArray<bType>(weights));
-    dim4 iDims = input.dims();
-
-    Array<outType> meanArr = mean<outType>(input, wts, dim);
+static tuple<af_array, af_array> meanvar(const af_array& in,
+                                         const af_array& weights,
+                                         const af_var_bias bias,
+                                         const dim_t dim) {
+    using weightType    = typename baseOutType<outType>::type;
+    Array<outType> mean = createEmptyArray<outType>({0}),
+                   var  = createEmptyArray<outType>({0});
+
+    Array<weightType> w = createEmptyArray<weightType>({0});
+    if (weights != 0) { w = getArray<weightType>(weights); }
+    tie(mean, var) =
+        meanvar<inType, outType>(getArray<inType>(in), w, bias, dim);
+    return make_tuple(getHandle(mean), getHandle(var));
+}
 
-    /* now tile meanArr along dim and use it for variance computation */
-    dim4 tileDims(1);
-    tileDims[dim] = iDims[dim];
-    Array<outType> tMeanArr = tile<outType>(meanArr, tileDims);
-    /* now mean array is ready */
+/// Calculates the variance
+///
+/// \note Only calculates the weighted variance if the weights array is
+/// non-empty
+template<typename inType, typename outType>
+static Array<outType> var(
+    const Array<inType>& in,
+    const Array<typename baseOutType<outType>::type>& weights,
+    const af_var_bias bias, int dim) {
+    Array<outType> variance = createEmptyArray<outType>({0});
+    tie(ignore, variance)   = meanvar<inType, outType>(in, weights, bias, dim);
+    return variance;
+}
 
-    Array<outType> diff    = arithOp<outType, af_sub_t>(input, tMeanArr, tMeanArr.dims());
-    Array<outType> diffSq  = arithOp<outType, af_mul_t>(diff, diff, diff.dims());
-    Array<outType> wDiffSq = arithOp<outType, af_mul_t>(diffSq, wts, diffSq.dims());
-    Array<outType> accWDS  = reduce<af_add_t, outType, outType>(wDiffSq, dim);
-    Array<outType> divArr  = reduce<af_add_t, outType, outType>(wts, dim);
-    Array<outType> result  = arithOp<outType, af_div_t>(accWDS, divArr, accWDS.dims());
+template<typename inType, typename outType>
+static af_array var_(const af_array& in, const af_array& weights,
+                     const af_var_bias bias, int dim) {
+    using bType = typename baseOutType<outType>::type;
+    if (weights == 0) {
+        Array<bType> empty = createEmptyArray<bType>({0});
+        return getHandle(
+            var<inType, outType>(getArray<inType>(in), empty, bias, dim));
+    }
+    return getHandle(var<inType, outType>(getArray<inType>(in),
+                                          getArray<bType>(weights), bias, dim));
+}
 
-    return getHandle<outType>(result);
+af_err af_var(af_array* out, const af_array in, const bool isbiased,
+              const dim_t dim) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return af_var_v2(out, in, bias, dim);
 }
 
-af_err af_var(af_array *out, const af_array in, const bool isbiased, const dim_t dim)
-{
+af_err af_var_v2(af_array* out, const af_array in, const af_var_bias bias,
+                 const dim_t dim) {
     try {
-        ARG_ASSERT(2, (dim>=0 && dim<=3));
-
-        af_array output = 0;
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
-            case f64: output = var<double,  double>(in, isbiased, dim); break;
-            case f32: output = var<float ,  float >(in, isbiased, dim); break;
-            case s32: output = var<int   ,  float >(in, isbiased, dim); break;
-            case u32: output = var<uint  ,  float >(in, isbiased, dim); break;
-            case s64: output = var<intl  ,  double>(in, isbiased, dim); break;
-            case u64: output = var<uintl ,  double>(in, isbiased, dim); break;
-            case  u8: output = var<uchar ,  float >(in, isbiased, dim); break;
-            case  b8: output = var<char  ,  float >(in, isbiased, dim); break;
-            case c32: output = var<cfloat,  cfloat>(in, isbiased, dim); break;
-            case c64: output = var<cdouble,cdouble>(in, isbiased, dim); break;
-            default : TYPE_ERROR(1, type);
+        ARG_ASSERT(3, (dim >= 0 && dim <= 3));
+
+        af_array output       = 0;
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+
+        af_array no_weights = 0;
+        switch (type) {
+            case f32:
+                output = var_<float, float>(in, no_weights, bias, dim);
+                break;
+            case f64:
+                output = var_<double, double>(in, no_weights, bias, dim);
+                break;
+            case s32:
+                output = var_<int, float>(in, no_weights, bias, dim);
+                break;
+            case u32:
+                output = var_<uint, float>(in, no_weights, bias, dim);
+                break;
+            case s16:
+                output = var_<short, float>(in, no_weights, bias, dim);
+                break;
+            case u16:
+                output = var_<ushort, float>(in, no_weights, bias, dim);
+                break;
+            case s64:
+                output = var_<intl, double>(in, no_weights, bias, dim);
+                break;
+            case u64:
+                output = var_<uintl, double>(in, no_weights, bias, dim);
+                break;
+            case s8:
+                output = var_<schar, float>(in, no_weights, bias, dim);
+                break;
+            case u8:
+                output = var_<uchar, float>(in, no_weights, bias, dim);
+                break;
+            case b8:
+                output = var_<char, float>(in, no_weights, bias, dim);
+                break;
+            case c32:
+                output = var_<cfloat, cfloat>(in, no_weights, bias, dim);
+                break;
+            case c64:
+                output = var_<cdouble, cdouble>(in, no_weights, bias, dim);
+                break;
+            case f16:
+                output = var_<half, half>(in, no_weights, bias, dim);
+                break;
+            default: TYPE_ERROR(1, type);
         }
         std::swap(*out, output);
     }
@@ -141,31 +252,81 @@ af_err af_var(af_array *out, const af_array in, const bool isbiased, const dim_t
     return AF_SUCCESS;
 }
 
-af_err af_var_weighted(af_array *out, const af_array in, const af_array weights, const dim_t dim)
-{
+af_err af_var_weighted(af_array* out, const af_array in, const af_array weights,
+                       const dim_t dim) {
     try {
-        ARG_ASSERT(2, (dim>=0 && dim<=3));
-
-        af_array output = 0;
-        ArrayInfo iInfo = getInfo(in);
-        ArrayInfo wInfo = getInfo(weights);
-        af_dtype iType  = iInfo.getType();
-        af_dtype wType  = wInfo.getType();
-
-        ARG_ASSERT(3, (wType==f32 || wType==f64)); /* verify that weights are non-complex real numbers */
-
-        switch(iType) {
-            case f64: output = var<double,  double>(in, weights, dim); break;
-            case f32: output = var<float ,  float >(in, weights, dim); break;
-            case s32: output = var<int   ,  float >(in, weights, dim); break;
-            case u32: output = var<uint  ,  float >(in, weights, dim); break;
-            case s64: output = var<intl  ,  double>(in, weights, dim); break;
-            case u64: output = var<uintl ,  double>(in, weights, dim); break;
-            case  u8: output = var<uchar ,  float >(in, weights, dim); break;
-            case  b8: output = var<char  ,  float >(in, weights, dim); break;
-            case c32: output = var<cfloat,  cfloat>(in, weights, dim); break;
-            case c64: output = var<cdouble,cdouble>(in, weights, dim); break;
-            default : TYPE_ERROR(1, iType);
+        ARG_ASSERT(3, (dim >= 0 && dim <= 3));
+
+        af_array output        = 0;
+        const ArrayInfo& iInfo = getInfo(in);
+        const ArrayInfo& wInfo = getInfo(weights);
+        af_dtype iType         = iInfo.getType();
+        af_dtype wType         = wInfo.getType();
+
+        ARG_ASSERT(
+            2,
+            (wType == f32 ||
+             wType ==
+                 f64)); /* verify that weights are non-complex real numbers */
+
+        switch (iType) {
+            case f64:
+                output = var_<double, double>(in, weights,
+                                              AF_VARIANCE_POPULATION, dim);
+                break;
+            case f32:
+                output = var_<float, float>(in, weights, AF_VARIANCE_POPULATION,
+                                            dim);
+                break;
+            case s32:
+                output =
+                    var_<int, float>(in, weights, AF_VARIANCE_POPULATION, dim);
+                break;
+            case u32:
+                output =
+                    var_<uint, float>(in, weights, AF_VARIANCE_POPULATION, dim);
+                break;
+            case s16:
+                output = var_<short, float>(in, weights, AF_VARIANCE_POPULATION,
+                                            dim);
+                break;
+            case u16:
+                output = var_<ushort, float>(in, weights,
+                                             AF_VARIANCE_POPULATION, dim);
+                break;
+            case s64:
+                output = var_<intl, double>(in, weights, AF_VARIANCE_POPULATION,
+                                            dim);
+                break;
+            case u64:
+                output = var_<uintl, double>(in, weights,
+                                             AF_VARIANCE_POPULATION, dim);
+                break;
+            case s8:
+                output = var_<schar, float>(in, weights, AF_VARIANCE_POPULATION,
+                                            dim);
+                break;
+            case u8:
+                output = var_<uchar, float>(in, weights, AF_VARIANCE_POPULATION,
+                                            dim);
+                break;
+            case b8:
+                output =
+                    var_<char, float>(in, weights, AF_VARIANCE_POPULATION, dim);
+                break;
+            case f16:
+                output =
+                    var_<half, float>(in, weights, AF_VARIANCE_POPULATION, dim);
+                break;
+            case c32:
+                output = var_<cfloat, cfloat>(in, weights,
+                                              AF_VARIANCE_POPULATION, dim);
+                break;
+            case c64:
+                output = var_<cdouble, cdouble>(in, weights,
+                                                AF_VARIANCE_POPULATION, dim);
+                break;
+            default: TYPE_ERROR(1, iType);
         }
         std::swap(*out, output);
     }
@@ -173,67 +334,158 @@ af_err af_var_weighted(af_array *out, const af_array in, const af_array weights,
     return AF_SUCCESS;
 }
 
-af_err af_var_all(double *realVal, double *imagVal, const af_array in, const bool isbiased)
-{
+af_err af_var_all(double* realVal, double* imagVal, const af_array in,
+                  const bool isbiased) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return af_var_all_v2(realVal, imagVal, in, bias);
+}
+
+af_err af_var_all_v2(double* realVal, double* imagVal, const af_array in,
+                     const af_var_bias bias) {
     try {
-        ArrayInfo info = getInfo(in);
-        af_dtype type = info.getType();
-        switch(type) {
-            case f64: *realVal = varAll<double, double>(in, isbiased); break;
-            case f32: *realVal = varAll<float , float >(in, isbiased); break;
-            case s32: *realVal = varAll<int   , float >(in, isbiased); break;
-            case u32: *realVal = varAll<uint  , float >(in, isbiased); break;
-            case s64: *realVal = varAll<intl  , double>(in, isbiased); break;
-            case u64: *realVal = varAll<uintl , double>(in, isbiased); break;
-            case  u8: *realVal = varAll<uchar , float >(in, isbiased); break;
-            case  b8: *realVal = varAll<char  , float >(in, isbiased); break;
+        const ArrayInfo& info = getInfo(in);
+        af_dtype type         = info.getType();
+        switch (type) {
+            case f64: *realVal = varAll<double, double>(in, bias); break;
+            case f32: *realVal = varAll<float, float>(in, bias); break;
+            case s32: *realVal = varAll<int, float>(in, bias); break;
+            case u32: *realVal = varAll<uint, float>(in, bias); break;
+            case s16: *realVal = varAll<short, float>(in, bias); break;
+            case u16: *realVal = varAll<ushort, float>(in, bias); break;
+            case s64: *realVal = varAll<intl, double>(in, bias); break;
+            case u64: *realVal = varAll<uintl, double>(in, bias); break;
+            case s8: *realVal = varAll<schar, float>(in, bias); break;
+            case u8: *realVal = varAll<uchar, float>(in, bias); break;
+            case b8: *realVal = varAll<char, float>(in, bias); break;
+            case f16: *realVal = varAll<half, float>(in, bias); break;
             case c32: {
-                cfloat tmp = varAll<cfloat,cfloat>(in, isbiased);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
+                cfloat tmp = varAll<cfloat, cfloat>(in, bias);
+                *realVal   = real(tmp);
+                *imagVal   = imag(tmp);
+            } break;
             case c64: {
-                cdouble tmp = varAll<cdouble,cdouble>(in, isbiased);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
-            default : TYPE_ERROR(1, type);
+                cdouble tmp = varAll<cdouble, cdouble>(in, bias);
+                *realVal    = real(tmp);
+                *imagVal    = imag(tmp);
+            } break;
+            default: TYPE_ERROR(1, type);
         }
     }
     CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err af_var_all_weighted(double *realVal, double *imagVal, const af_array in, const af_array weights)
-{
+af_err af_var_all_weighted(double* realVal, double* imagVal, const af_array in,
+                           const af_array weights) {
     try {
-        ArrayInfo iInfo = getInfo(in);
-        ArrayInfo wInfo = getInfo(weights);
-        af_dtype iType  = iInfo.getType();
-        af_dtype wType  = wInfo.getType();
-
-        ARG_ASSERT(3, (wType==f32 || wType==f64)); /* verify that weights are non-complex real numbers */
-
-        switch(iType) {
+        const ArrayInfo& iInfo = getInfo(in);
+        const ArrayInfo& wInfo = getInfo(weights);
+        af_dtype iType         = iInfo.getType();
+        af_dtype wType         = wInfo.getType();
+
+        ARG_ASSERT(
+            3,
+            (wType == f32 ||
+             wType ==
+                 f64)); /* verify that weights are non-complex real numbers */
+
+        switch (iType) {
             case f64: *realVal = varAll<double, double>(in, weights); break;
-            case f32: *realVal = varAll<float , float >(in, weights); break;
-            case s32: *realVal = varAll<int   , float >(in, weights); break;
-            case u32: *realVal = varAll<uint  , float >(in, weights); break;
-            case s64: *realVal = varAll<intl  , double >(in, weights); break;
-            case u64: *realVal = varAll<uintl , double >(in, weights); break;
-            case  u8: *realVal = varAll<uchar , float >(in, weights); break;
-            case  b8: *realVal = varAll<char  , float >(in, weights); break;
+            case f32: *realVal = varAll<float, float>(in, weights); break;
+            case s32: *realVal = varAll<int, float>(in, weights); break;
+            case u32: *realVal = varAll<uint, float>(in, weights); break;
+            case s16: *realVal = varAll<short, float>(in, weights); break;
+            case u16: *realVal = varAll<ushort, float>(in, weights); break;
+            case s64: *realVal = varAll<intl, double>(in, weights); break;
+            case u64: *realVal = varAll<uintl, double>(in, weights); break;
+            case s8: *realVal = varAll<schar, float>(in, weights); break;
+            case u8: *realVal = varAll<uchar, float>(in, weights); break;
+            case b8: *realVal = varAll<char, float>(in, weights); break;
+            case f16: *realVal = varAll<half, float>(in, weights); break;
             case c32: {
-                cfloat tmp = varAll<cfloat,cfloat>(in, weights);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
+                cfloat tmp = varAll<cfloat, cfloat>(in, weights);
+                *realVal   = real(tmp);
+                *imagVal   = imag(tmp);
+            } break;
             case c64: {
-                cdouble tmp = varAll<cdouble,cdouble>(in, weights);
-                *realVal = real(tmp);
-                *imagVal = imag(tmp);
-                } break;
-            default : TYPE_ERROR(1, iType);
+                cdouble tmp = varAll<cdouble, cdouble>(in, weights);
+                *realVal    = real(tmp);
+                *imagVal    = imag(tmp);
+            } break;
+            default: TYPE_ERROR(1, iType);
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_meanvar(af_array* mean, af_array* var, const af_array in,
+                  const af_array weights, const af_var_bias bias,
+                  const dim_t dim) {
+    try {
+        const ArrayInfo& iInfo = getInfo(in);
+        if (weights != 0) {
+            const ArrayInfo& wInfo = getInfo(weights);
+            af_dtype wType         = wInfo.getType();
+            ARG_ASSERT(3, (wType == f32 || wType == f64));
+        }
+        af_dtype iType = iInfo.getType();
+
+        switch (iType) {
+            case f32:
+                tie(*mean, *var) =
+                    meanvar<float, float>(in, weights, bias, dim);
+                break;
+            case f64:
+                tie(*mean, *var) =
+                    meanvar<double, double>(in, weights, bias, dim);
+                break;
+            case s32:
+                tie(*mean, *var) = meanvar<int, float>(in, weights, bias, dim);
+                break;
+            case u32:
+                tie(*mean, *var) = meanvar<uint, float>(in, weights, bias, dim);
+                break;
+            case s16:
+                tie(*mean, *var) =
+                    meanvar<short, float>(in, weights, bias, dim);
+                break;
+            case u16:
+                tie(*mean, *var) =
+                    meanvar<ushort, float>(in, weights, bias, dim);
+                break;
+            case s64:
+                tie(*mean, *var) =
+                    meanvar<intl, double>(in, weights, bias, dim);
+                break;
+            case u64:
+                tie(*mean, *var) =
+                    meanvar<uintl, double>(in, weights, bias, dim);
+                break;
+            case s8:
+                tie(*mean, *var) =
+                    meanvar<schar, float>(in, weights, bias, dim);
+                break;
+            case u8:
+                tie(*mean, *var) =
+                    meanvar<uchar, float>(in, weights, bias, dim);
+                break;
+            case b8:
+                tie(*mean, *var) = meanvar<char, float>(in, weights, bias, dim);
+                break;
+            case c32:
+                tie(*mean, *var) =
+                    meanvar<cfloat, cfloat>(in, weights, bias, dim);
+                break;
+            case c64:
+                tie(*mean, *var) =
+                    meanvar<cdouble, cdouble>(in, weights, bias, dim);
+                break;
+            case f16:
+                tie(*mean, *var) = meanvar<half, half>(in, weights, bias, dim);
+                break;
+            default: TYPE_ERROR(1, iType);
         }
     }
     CATCHALL;
diff --git a/src/api/c/vector_field.cpp b/src/api/c/vector_field.cpp
new file mode 100644
index 0000000000..9eba21811c
--- /dev/null
+++ b/src/api/c/vector_field.cpp
@@ -0,0 +1,441 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/data.h>
+#include <af/graphics.h>
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <handle.hpp>
+#include <join.hpp>
+#include <platform.hpp>
+#include <reduce.hpp>
+#include <transpose.hpp>
+#include <vector_field.hpp>
+
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+using arrayfire::common::getGLType;
+using arrayfire::common::makeContextCurrent;
+using arrayfire::common::step_round;
+using detail::Array;
+using detail::copy_vector_field;
+using detail::createEmptyArray;
+using detail::forgeManager;
+using detail::reduce;
+using detail::schar;
+using detail::transpose;
+using detail::uchar;
+using detail::uint;
+using detail::ushort;
+using std::vector;
+
+template<typename T>
+fg_chart setup_vector_field(fg_window window, const vector<af_array>& points,
+                            const vector<af_array>& directions,
+                            const af_cell* const props,
+                            const bool transpose_ = true) {
+    ForgeModule& _ = forgePlugin();
+    vector<Array<T>> pnts;
+    vector<Array<T>> dirs;
+
+    Array<T> pIn = getArray<T>(points[0]);
+    Array<T> dIn = getArray<T>(directions[0]);
+    if (points.size() > 1) {
+        for (unsigned i = 0; i < points.size(); ++i) {
+            pnts.push_back(getArray<T>(points[i]));
+            dirs.push_back(getArray<T>(directions[i]));
+        }
+
+        // Join for set up vector
+        const dim4 odims(pIn.dims()[0], points.size());
+        pIn = createEmptyArray<T>(odims);
+        dIn = createEmptyArray<T>(odims);
+        detail::join<T>(pIn, 1, pnts);
+        detail::join<T>(dIn, 1, dirs);
+    }
+    // do transpose if required
+    if (transpose_) {
+        pIn = transpose<T>(pIn, false);
+        dIn = transpose<T>(dIn, false);
+    }
+
+    ForgeManager& fgMngr = forgeManager();
+
+    // Get the chart for the current grid position (if any)
+    fg_chart chart = NULL;
+
+    if (pIn.dims()[0] == 2) {
+        if (props->col > -1 && props->row > -1) {
+            chart =
+                fgMngr.getChart(window, props->row, props->col, FG_CHART_2D);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, FG_CHART_2D);
+        }
+    } else {
+        if (props->col > -1 && props->row > -1) {
+            chart =
+                fgMngr.getChart(window, props->row, props->col, FG_CHART_3D);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, FG_CHART_3D);
+        }
+    }
+
+    fg_vector_field vfield =
+        fgMngr.getVectorField(chart, pIn.dims()[1], getGLType<T>());
+
+    // ArrayFire LOGO dark blue shade
+    FG_CHECK(_.fg_set_vector_field_color(vfield, 0.130f, 0.173f, 0.263f, 1.0));
+
+    // If chart axes limits do not have a manual override
+    // then compute and set axes limits
+    if (!fgMngr.getChartAxesOverride(chart)) {
+        float cmin[3], cmax[3];
+        T dmin[3], dmax[3];
+        FG_CHECK(_.fg_get_chart_axes_limits(
+            &cmin[0], &cmax[0], &cmin[1], &cmax[1], &cmin[2], &cmax[2], chart));
+        copyData(dmin, reduce<af_min_t, T, T>(pIn, 1));
+        copyData(dmax, reduce<af_max_t, T, T>(pIn, 1));
+
+        if (cmin[0] == 0 && cmax[0] == 0 && cmin[1] == 0 && cmax[1] == 0 &&
+            cmin[2] == 0 && cmax[2] == 0) {
+            // No previous limits. Set without checking
+            cmin[0] = step_round(dmin[0], false);
+            cmax[0] = step_round(dmax[0], true);
+            cmin[1] = step_round(dmin[1], false);
+            cmax[1] = step_round(dmax[1], true);
+            if (pIn.dims()[0] == 3) { cmin[2] = step_round(dmin[2], false); }
+            if (pIn.dims()[0] == 3) { cmax[2] = step_round(dmax[2], true); }
+        } else {
+            if (cmin[0] > dmin[0]) { cmin[0] = step_round(dmin[0], false); }
+            if (cmax[0] < dmax[0]) { cmax[0] = step_round(dmax[0], true); }
+            if (cmin[1] > dmin[1]) { cmin[1] = step_round(dmin[1], false); }
+            if (cmax[1] < dmax[1]) { cmax[1] = step_round(dmax[1], true); }
+            if (pIn.dims()[0] == 3) {
+                if (cmin[2] > dmin[2]) { cmin[2] = step_round(dmin[2], false); }
+                if (cmax[2] < dmax[2]) { cmax[2] = step_round(dmax[2], true); }
+            }
+        }
+        FG_CHECK(_.fg_set_chart_axes_limits(chart, cmin[0], cmax[0], cmin[1],
+                                            cmax[1], cmin[2], cmax[2]));
+    }
+    copy_vector_field<T>(pIn, dIn, vfield);
+
+    return chart;
+}
+
+af_err vectorFieldWrapper(const af_window window, const af_array points,
+                          const af_array directions,
+                          const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        const ArrayInfo& pInfo = getInfo(points);
+        af::dim4 pDims         = pInfo.dims();
+        af_dtype pType         = pInfo.getType();
+
+        const ArrayInfo& dInfo = getInfo(directions);
+        const af::dim4& dDims  = dInfo.dims();
+        af_dtype dType         = dInfo.getType();
+
+        DIM_ASSERT(0, pDims == dDims);
+        DIM_ASSERT(0, pDims.ndims() == 2);
+        DIM_ASSERT(0,
+                   pDims[1] == 2 ||
+                       pDims[1] == 3);  // Columns:P 2 means 2D and 3 means 3D
+
+        TYPE_ASSERT(pType == dType);
+
+        makeContextCurrent(window);
+
+        fg_chart chart = NULL;
+
+        vector<af_array> pnts;
+        pnts.push_back(points);
+
+        vector<af_array> dirs;
+        dirs.push_back(directions);
+
+        switch (pType) {
+            case f32:
+                chart = setup_vector_field<float>(window, pnts, dirs, props);
+                break;
+            case s32:
+                chart = setup_vector_field<int>(window, pnts, dirs, props);
+                break;
+            case u32:
+                chart = setup_vector_field<uint>(window, pnts, dirs, props);
+                break;
+            case s16:
+                chart = setup_vector_field<short>(window, pnts, dirs, props);
+                break;
+            case u16:
+                chart = setup_vector_field<ushort>(window, pnts, dirs, props);
+                break;
+            case s8:
+                chart = setup_vector_field<schar>(window, pnts, dirs, props);
+                break;
+            case u8:
+                chart = setup_vector_field<uchar>(window, pnts, dirs, props);
+                break;
+            default: TYPE_ERROR(1, pType);
+        }
+        auto gridDims = forgeManager().getWindowGrid(window);
+
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err vectorFieldWrapper(const af_window window, const af_array xPoints,
+                          const af_array yPoints, const af_array zPoints,
+                          const af_array xDirs, const af_array yDirs,
+                          const af_array zDirs, const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_SUCCESS); }
+
+        const ArrayInfo& xpInfo = getInfo(xPoints);
+        const ArrayInfo& ypInfo = getInfo(yPoints);
+        const ArrayInfo& zpInfo = getInfo(zPoints);
+
+        af::dim4 xpDims        = xpInfo.dims();
+        const af::dim4& ypDims = ypInfo.dims();
+        const af::dim4& zpDims = zpInfo.dims();
+
+        af_dtype xpType = xpInfo.getType();
+        af_dtype ypType = ypInfo.getType();
+        af_dtype zpType = zpInfo.getType();
+
+        const ArrayInfo& xdInfo = getInfo(xDirs);
+        const ArrayInfo& ydInfo = getInfo(yDirs);
+        const ArrayInfo& zdInfo = getInfo(zDirs);
+
+        const af::dim4& xdDims = xdInfo.dims();
+        const af::dim4& ydDims = ydInfo.dims();
+        const af::dim4& zdDims = zdInfo.dims();
+
+        af_dtype xdType = xdInfo.getType();
+        af_dtype ydType = ydInfo.getType();
+        af_dtype zdType = zdInfo.getType();
+
+        // Assert all arrays are equal dimensions
+        DIM_ASSERT(1, xpDims == xdDims);
+        DIM_ASSERT(2, ypDims == ydDims);
+        DIM_ASSERT(3, zpDims == zdDims);
+
+        DIM_ASSERT(1, xpDims == ypDims);
+        DIM_ASSERT(1, xpDims == zpDims);
+
+        // Verify vector
+        DIM_ASSERT(1, xpDims.ndims() == 1);
+
+        // Assert all arrays are equal types
+        DIM_ASSERT(1, xpType == xdType);
+        DIM_ASSERT(2, ypType == ydType);
+        DIM_ASSERT(3, zpType == zdType);
+
+        DIM_ASSERT(1, xpType == ypType);
+        DIM_ASSERT(1, xpType == zpType);
+
+        makeContextCurrent(window);
+
+        fg_chart chart = NULL;
+
+        vector<af_array> points;
+        points.push_back(xPoints);
+        points.push_back(yPoints);
+        points.push_back(zPoints);
+
+        vector<af_array> directions;
+        directions.push_back(xDirs);
+        directions.push_back(yDirs);
+        directions.push_back(zDirs);
+
+        switch (xpType) {
+            case f32:
+                chart = setup_vector_field<float>(window, points, directions,
+                                                  props);
+                break;
+            case s32:
+                chart =
+                    setup_vector_field<int>(window, points, directions, props);
+                break;
+            case u32:
+                chart =
+                    setup_vector_field<uint>(window, points, directions, props);
+                break;
+            case s16:
+                chart = setup_vector_field<short>(window, points, directions,
+                                                  props);
+                break;
+            case u16:
+                chart = setup_vector_field<ushort>(window, points, directions,
+                                                   props);
+                break;
+            case s8:
+                chart = setup_vector_field<schar>(window, points, directions,
+                                                  props);
+                break;
+            case u8:
+                chart = setup_vector_field<uchar>(window, points, directions,
+                                                  props);
+                break;
+            default: TYPE_ERROR(1, xpType);
+        }
+        auto gridDims = forgeManager().getWindowGrid(window);
+
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err vectorFieldWrapper(const af_window window, const af_array xPoints,
+                          const af_array yPoints, const af_array xDirs,
+                          const af_array yDirs, const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_SUCCESS); }
+
+        const ArrayInfo& xpInfo = getInfo(xPoints);
+        const ArrayInfo& ypInfo = getInfo(yPoints);
+
+        af::dim4 xpDims        = xpInfo.dims();
+        const af::dim4& ypDims = ypInfo.dims();
+
+        af_dtype xpType = xpInfo.getType();
+        af_dtype ypType = ypInfo.getType();
+
+        const ArrayInfo& xdInfo = getInfo(xDirs);
+        const ArrayInfo& ydInfo = getInfo(yDirs);
+
+        const af::dim4& xdDims = xdInfo.dims();
+        const af::dim4& ydDims = ydInfo.dims();
+
+        af_dtype xdType = xdInfo.getType();
+        af_dtype ydType = ydInfo.getType();
+
+        // Assert all arrays are equal dimensions
+        DIM_ASSERT(1, xpDims == xdDims);
+        DIM_ASSERT(2, ypDims == ydDims);
+
+        DIM_ASSERT(1, xpDims == ypDims);
+
+        // Verify vector
+        DIM_ASSERT(1, xpDims.ndims() == 1);
+
+        // Assert all arrays are equal types
+        DIM_ASSERT(1, xpType == xdType);
+        DIM_ASSERT(2, ypType == ydType);
+
+        DIM_ASSERT(1, xpType == ypType);
+
+        makeContextCurrent(window);
+
+        fg_chart chart = NULL;
+
+        vector<af_array> points;
+        points.push_back(xPoints);
+        points.push_back(yPoints);
+
+        vector<af_array> directions;
+        directions.push_back(xDirs);
+        directions.push_back(yDirs);
+
+        switch (xpType) {
+            case f32:
+                chart = setup_vector_field<float>(window, points, directions,
+                                                  props);
+                break;
+            case s32:
+                chart =
+                    setup_vector_field<int>(window, points, directions, props);
+                break;
+            case u32:
+                chart =
+                    setup_vector_field<uint>(window, points, directions, props);
+                break;
+            case s16:
+                chart = setup_vector_field<short>(window, points, directions,
+                                                  props);
+                break;
+            case u16:
+                chart = setup_vector_field<ushort>(window, points, directions,
+                                                   props);
+                break;
+            case s8:
+                chart = setup_vector_field<schar>(window, points, directions,
+                                                  props);
+                break;
+            case u8:
+                chart = setup_vector_field<uchar>(window, points, directions,
+                                                  props);
+                break;
+            default: TYPE_ERROR(1, xpType);
+        }
+
+        auto gridDims = forgeManager().getWindowGrid(window);
+
+        ForgeModule& _ = forgePlugin();
+        if (props->col > -1 && props->row > -1) {
+            FG_CHECK(_.fg_draw_chart_to_cell(
+                window, gridDims.first, gridDims.second,
+                props->row * gridDims.second + props->col, chart,
+                props->title));
+        } else {
+            FG_CHECK(_.fg_draw_chart(window, chart));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_draw_vector_field_nd(const af_window wind, const af_array points,
+                               const af_array directions,
+                               const af_cell* const props) {
+    return vectorFieldWrapper(wind, points, directions, props);
+}
+
+af_err af_draw_vector_field_3d(const af_window wind, const af_array xPoints,
+                               const af_array yPoints, const af_array zPoints,
+                               const af_array xDirs, const af_array yDirs,
+                               const af_array zDirs,
+                               const af_cell* const props) {
+    return vectorFieldWrapper(wind, xPoints, yPoints, zPoints, xDirs, yDirs,
+                              zDirs, props);
+}
+
+af_err af_draw_vector_field_2d(const af_window wind, const af_array xPoints,
+                               const af_array yPoints, const af_array xDirs,
+                               const af_array yDirs,
+                               const af_cell* const props) {
+    return vectorFieldWrapper(wind, xPoints, yPoints, xDirs, yDirs, props);
+}
diff --git a/src/api/c/version.cpp b/src/api/c/version.cpp
new file mode 100644
index 0000000000..47b6952427
--- /dev/null
+++ b/src/api/c/version.cpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <build_version.hpp>
+#include <af/util.h>
+
+af_err af_get_version(int *major, int *minor, int *patch) {
+    *major = AF_VERSION_MAJOR;
+    *minor = AF_VERSION_MINOR;
+    *patch = AF_VERSION_PATCH;
+
+    return AF_SUCCESS;
+}
+
+const char *af_get_revision() { return AF_REVISION; }
diff --git a/src/api/c/where.cpp b/src/api/c/where.cpp
index 0853e6df46..6f83aed17d 100644
--- a/src/api/c/where.cpp
+++ b/src/api/c/where.cpp
@@ -7,45 +7,57 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/algorithm.h>
-#include <err_common.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
 #include <handle.hpp>
-#include <ops.hpp>
 #include <where.hpp>
-#include <backend.hpp>
+#include <af/algorithm.h>
+#include <af/dim4.hpp>
+#include <complex>
 
-using af::dim4;
-using namespace detail;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+using std::swap;
 
 template<typename T>
-static inline af_array where(const af_array in)
-{
+static inline af_array where(const af_array in) {
     // Making it more explicit that the output is uint
     return getHandle<uint>(where<T>(getArray<T>(in)));
 }
 
-af_err af_where(af_array *idx, const af_array in)
-{
+af_err af_where(af_array* idx, const af_array in) {
     try {
-        af_dtype type = getInfo(in).getType();
+        const ArrayInfo& i_info = getInfo(in);
+        af_dtype type           = i_info.getType();
+
+        if (i_info.ndims() == 0) {
+            return af_create_handle(idx, 0, nullptr, u32);
+        }
+
         af_array res;
-        switch(type) {
-        case f32: res = where<float  >(in); break;
-        case f64: res = where<double >(in); break;
-        case c32: res = where<cfloat >(in); break;
-        case c64: res = where<cdouble>(in); break;
-        case s32: res = where<int    >(in); break;
-        case u32: res = where<uint   >(in); break;
-        case s64: res = where<intl   >(in); break;
-        case u64: res = where<uintl  >(in); break;
-        case u8 : res = where<uchar  >(in); break;
-        case b8 : res = where<char   >(in); break;
-        default:
-            TYPE_ERROR(1, type);
+        switch (type) {
+            case f32: res = where<float>(in); break;
+            case f64: res = where<double>(in); break;
+            case c32: res = where<cfloat>(in); break;
+            case c64: res = where<cdouble>(in); break;
+            case s32: res = where<int>(in); break;
+            case u32: res = where<uint>(in); break;
+            case s64: res = where<intl>(in); break;
+            case u64: res = where<uintl>(in); break;
+            case s16: res = where<short>(in); break;
+            case u16: res = where<ushort>(in); break;
+            case s8: res = where<schar>(in); break;
+            case u8: res = where<uchar>(in); break;
+            case b8: res = where<char>(in); break;
+            default: TYPE_ERROR(1, type);
         }
-        std::swap(*idx, res);
+        swap(*idx, res);
     }
     CATCHALL
 
diff --git a/src/api/c/window.cpp b/src/api/c/window.cpp
new file mode 100644
index 0000000000..fe9fea5ba0
--- /dev/null
+++ b/src/api/c/window.cpp
@@ -0,0 +1,304 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/algorithm.h>
+#include <af/graphics.h>
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <platform.hpp>
+
+using arrayfire::common::ForgeManager;
+using arrayfire::common::forgePlugin;
+using arrayfire::common::step_round;
+using detail::forgeManager;
+
+af_err af_create_window(af_window* out, const int width, const int height,
+                        const char* const title) {
+    try {
+        fg_window temp = forgeManager().getWindow(width, height, title, false);
+        std::swap(*out, temp);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_position(const af_window wind, const unsigned x,
+                       const unsigned y) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        FG_CHECK(forgePlugin().fg_set_window_position(wind, x, y));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_title(const af_window wind, const char* const title) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        FG_CHECK(forgePlugin().fg_set_window_title(wind, title));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_size(const af_window wind, const unsigned w, const unsigned h) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        FG_CHECK(forgePlugin().fg_set_window_size(wind, w, h));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_grid(const af_window wind, const int rows, const int cols) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        forgeManager().setWindowChartGrid(wind, rows, cols);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_axes_limits_compute(const af_window window, const af_array x,
+                                  const af_array y, const af_array z,
+                                  const bool exact,
+                                  const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        ForgeManager& fgMngr = forgeManager();
+
+        fg_chart chart = nullptr;
+
+        fg_chart_type ctype = (z ? FG_CHART_3D : FG_CHART_2D);
+
+        if (props->col > -1 && props->row > -1) {
+            chart = fgMngr.getChart(window, props->row, props->col, ctype);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, ctype);
+        }
+
+        double xmin = -1., xmax = 1.;
+        double ymin = -1., ymax = 1.;
+        double zmin = -1., zmax = 1.;
+        AF_CHECK(af_min_all(&xmin, nullptr, x));
+        AF_CHECK(af_max_all(&xmax, nullptr, x));
+        AF_CHECK(af_min_all(&ymin, nullptr, y));
+        AF_CHECK(af_max_all(&ymax, nullptr, y));
+
+        if (ctype == FG_CHART_3D) {
+            AF_CHECK(af_min_all(&zmin, nullptr, z));
+            AF_CHECK(af_max_all(&zmax, nullptr, z));
+        }
+
+        if (!exact) {
+            xmin = step_round(xmin, false);
+            xmax = step_round(xmax, true);
+            ymin = step_round(ymin, false);
+            ymax = step_round(ymax, true);
+            zmin = step_round(zmin, false);
+            zmax = step_round(zmax, true);
+        }
+
+        fgMngr.setChartAxesOverride(chart);
+        FG_CHECK(forgePlugin().fg_set_chart_axes_limits(chart, xmin, xmax, ymin,
+                                                        ymax, zmin, zmax));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_axes_limits_2d(const af_window window, const float xmin,
+                             const float xmax, const float ymin,
+                             const float ymax, const bool exact,
+                             const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        ForgeManager& fgMngr = forgeManager();
+
+        fg_chart chart = nullptr;
+        // The ctype here below doesn't really matter as it is only fetching
+        // the chart. It will not set it.
+        // If this is actually being done, then it is extremely bad.
+        fg_chart_type ctype = FG_CHART_2D;
+
+        if (props->col > -1 && props->row > -1) {
+            chart = fgMngr.getChart(window, props->row, props->col, ctype);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, ctype);
+        }
+
+        double _xmin = xmin;
+        double _xmax = xmax;
+        double _ymin = ymin;
+        double _ymax = ymax;
+        if (!exact) {
+            _xmin = step_round(_xmin, false);
+            _xmax = step_round(_xmax, true);
+            _ymin = step_round(_ymin, false);
+            _ymax = step_round(_ymax, true);
+        }
+
+        fgMngr.setChartAxesOverride(chart);
+        FG_CHECK(forgePlugin().fg_set_chart_axes_limits(
+            chart, _xmin, _xmax, _ymin, _ymax, 0.0f, 0.0f));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_axes_limits_3d(const af_window window, const float xmin,
+                             const float xmax, const float ymin,
+                             const float ymax, const float zmin,
+                             const float zmax, const bool exact,
+                             const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        ForgeManager& fgMngr = forgeManager();
+
+        fg_chart chart = nullptr;
+        // The ctype here below doesn't really matter as it is only fetching
+        // the chart. It will not set it.
+        // If this is actually being done, then it is extremely bad.
+        fg_chart_type ctype = FG_CHART_3D;
+
+        if (props->col > -1 && props->row > -1) {
+            chart = fgMngr.getChart(window, props->row, props->col, ctype);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, ctype);
+        }
+
+        double _xmin = xmin;
+        double _xmax = xmax;
+        double _ymin = ymin;
+        double _ymax = ymax;
+        double _zmin = zmin;
+        double _zmax = zmax;
+        if (!exact) {
+            _xmin = step_round(_xmin, false);
+            _xmax = step_round(_xmax, true);
+            _ymin = step_round(_ymin, false);
+            _ymax = step_round(_ymax, true);
+            _zmin = step_round(_zmin, false);
+            _zmax = step_round(_zmax, true);
+        }
+
+        fgMngr.setChartAxesOverride(chart);
+        FG_CHECK(forgePlugin().fg_set_chart_axes_limits(
+            chart, _xmin, _xmax, _ymin, _ymax, _zmin, _zmax));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_axes_titles(const af_window window, const char* const xtitle,
+                          const char* const ytitle, const char* const ztitle,
+                          const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        ForgeManager& fgMngr = forgeManager();
+
+        fg_chart chart = nullptr;
+
+        fg_chart_type ctype = (ztitle ? FG_CHART_3D : FG_CHART_2D);
+
+        if (props->col > -1 && props->row > -1) {
+            chart = fgMngr.getChart(window, props->row, props->col, ctype);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, ctype);
+        }
+
+        FG_CHECK(forgePlugin().fg_set_chart_axes_titles(chart, xtitle, ytitle,
+                                                        ztitle));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_axes_label_format(const af_window window,
+                                const char* const xformat,
+                                const char* const yformat,
+                                const char* const zformat,
+                                const af_cell* const props) {
+    try {
+        if (window == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+
+        ARG_ASSERT(2, xformat != nullptr);
+        ARG_ASSERT(3, yformat != nullptr);
+
+        ForgeManager& fgMngr = forgeManager();
+
+        fg_chart chart = nullptr;
+
+        fg_chart_type ctype = (zformat ? FG_CHART_3D : FG_CHART_2D);
+
+        if (props->col > -1 && props->row > -1) {
+            chart = fgMngr.getChart(window, props->row, props->col, ctype);
+        } else {
+            chart = fgMngr.getChart(window, 0, 0, ctype);
+        }
+
+        if (ctype == FG_CHART_2D) {
+            FG_CHECK(forgePlugin().fg_set_chart_label_format(chart, xformat,
+                                                             yformat, "3.2%f"));
+        } else {
+            ARG_ASSERT(4, zformat != nullptr);
+            FG_CHECK(forgePlugin().fg_set_chart_label_format(chart, xformat,
+                                                             yformat, zformat));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_show(const af_window wind) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        FG_CHECK(forgePlugin().fg_swap_window_buffers(wind));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_is_window_closed(bool* out, const af_window wind) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        FG_CHECK(forgePlugin().fg_close_window(out, wind));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_set_visibility(const af_window wind, const bool is_visible) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        if (is_visible) {
+            FG_CHECK(forgePlugin().fg_show_window(wind));
+        } else {
+            FG_CHECK(forgePlugin().fg_hide_window(wind));
+        }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_destroy_window(const af_window wind) {
+    try {
+        if (wind == 0) { AF_ERROR("Not a valid window", AF_ERR_INTERNAL); }
+        forgeManager().setWindowChartGrid(wind, 0, 0);
+        FG_CHECK(forgePlugin().fg_release_window(wind));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/wrap.cpp b/src/api/c/wrap.cpp
new file mode 100644
index 0000000000..e3c06a4642
--- /dev/null
+++ b/src/api/c/wrap.cpp
@@ -0,0 +1,111 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <wrap.hpp>
+#include <af/defines.h>
+#include <af/image.h>
+
+using af::dim4;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+template<typename T>
+static inline void wrap(af_array* out, const af_array in, const dim_t wx,
+                        const dim_t wy, const dim_t sx, const dim_t sy,
+                        const dim_t px, const dim_t py, const bool is_column) {
+    wrap<T>(getArray<T>(*out), getArray<T>(in), wx, wy, sx, sy, px, py,
+            is_column);
+}
+
+void af_wrap_common(af_array* out, const af_array in, const dim_t ox,
+                    const dim_t oy, const dim_t wx, const dim_t wy,
+                    const dim_t sx, const dim_t sy, const dim_t px,
+                    const dim_t py, const bool is_column, bool allocate_out) {
+    ARG_ASSERT(0, out != 0);  // *out (the af_array) can be null, but not out
+    ARG_ASSERT(1, in != 0);
+
+    const ArrayInfo& info  = getInfo(in);
+    const af_dtype in_type = info.getType();
+    const dim4& in_dims    = info.dims();
+    const dim4 out_dims(ox, oy, in_dims[2], in_dims[3]);
+
+    ARG_ASSERT(4, wx > 0);
+    ARG_ASSERT(5, wy > 0);
+    ARG_ASSERT(6, sx > 0);
+    ARG_ASSERT(7, sy > 0);
+
+    const dim_t nx = (ox + 2 * px - wx) / sx + 1;
+    const dim_t ny = (oy + 2 * py - wy) / sy + 1;
+
+    const dim_t patch_size  = is_column ? in_dims[0] : in_dims[1];
+    const dim_t num_patches = is_column ? in_dims[1] : in_dims[0];
+
+    DIM_ASSERT(1, patch_size == wx * wy);
+    DIM_ASSERT(1, num_patches == nx * ny);
+
+    if (allocate_out) { *out = createHandleFromValue(out_dims, 0.0, in_type); }
+
+    // The out pointer can be passed in to the function by the user
+    DIM_ASSERT(0, getInfo(*out).dims() == out_dims);
+
+    // clang-format off
+    switch(in_type) {
+        case f32: wrap<float  >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case f64: wrap<double >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case c32: wrap<cfloat >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case c64: wrap<cdouble>(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case s32: wrap<int    >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case u32: wrap<uint   >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case s64: wrap<intl   >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case u64: wrap<uintl  >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case s16: wrap<short  >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case u16: wrap<ushort >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case s8:  wrap<schar  >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case u8:  wrap<uchar  >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        case b8:  wrap<char   >(out, in, wx, wy, sx, sy, px, py, is_column);  break;
+        default:  TYPE_ERROR(1, in_type);
+    }
+    // clang-format on
+}
+
+af_err af_wrap(af_array* out, const af_array in, const dim_t ox, const dim_t oy,
+               const dim_t wx, const dim_t wy, const dim_t sx, const dim_t sy,
+               const dim_t px, const dim_t py, const bool is_column) {
+    try {
+        af_wrap_common(out, in, ox, oy, wx, wy, sx, sy, px, py, is_column,
+                       true);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
+
+af_err af_wrap_v2(af_array* out, const af_array in, const dim_t ox,
+                  const dim_t oy, const dim_t wx, const dim_t wy,
+                  const dim_t sx, const dim_t sy, const dim_t px,
+                  const dim_t py, const bool is_column) {
+    try {
+        ARG_ASSERT(0, out != 0);  // need to dereference out in next call
+        af_wrap_common(out, in, ox, oy, wx, wy, sx, sy, px, py, is_column,
+                       *out == 0);
+    }
+    CATCHALL;
+
+    return AF_SUCCESS;
+}
diff --git a/src/api/c/ycbcr_rgb.cpp b/src/api/c/ycbcr_rgb.cpp
new file mode 100644
index 0000000000..a871618d28
--- /dev/null
+++ b/src/api/c/ycbcr_rgb.cpp
@@ -0,0 +1,163 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arith.hpp>
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <handle.hpp>
+#include <join.hpp>
+#include <math.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+
+using af::dim4;
+using detail::arithOp;
+using detail::Array;
+using detail::createEmptyArray;
+using detail::createValueArray;
+using detail::join;
+using detail::scalar;
+
+template<typename T>
+static Array<T> mix(const Array<T>& X, const Array<T>& Y, double xf,
+                    double yf) {
+    const dim4& dims = X.dims();
+    Array<T> xf_cnst = createValueArray<T>(dims, xf);
+    Array<T> yf_cnst = createValueArray<T>(dims, yf);
+
+    Array<T> fX = arithOp<T, af_mul_t>(xf_cnst, X, dims);
+    Array<T> fY = arithOp<T, af_mul_t>(yf_cnst, Y, dims);
+
+    return arithOp<T, af_add_t>(fX, fY, dims);
+}
+
+template<typename T>
+static Array<T> mix(const Array<T>& X, const Array<T>& Y, const Array<T>& Z,
+                    double xf, double yf, double zf) {
+    const dim4& dims = X.dims();
+    Array<T> xf_cnst = createValueArray<T>(dims, xf);
+    Array<T> yf_cnst = createValueArray<T>(dims, yf);
+    Array<T> zf_cnst = createValueArray<T>(dims, zf);
+
+    Array<T> fX = arithOp<T, af_mul_t>(xf_cnst, X, dims);
+    Array<T> fY = arithOp<T, af_mul_t>(yf_cnst, Y, dims);
+    Array<T> fZ = arithOp<T, af_mul_t>(zf_cnst, Z, dims);
+
+    Array<T> fx_fy = arithOp<T, af_add_t>(fX, fY, dims);
+    return arithOp<T, af_add_t>(fx_fy, fZ, dims);
+}
+
+template<typename T>
+static Array<T> digitize(const Array<T> ch, const double scale,
+                         const double offset) {
+    const dim4& dims = ch.dims();
+    Array<T> base    = createValueArray<T>(dims, scalar<T>(offset));
+    Array<T> cnst    = createValueArray<T>(dims, scalar<T>(scale));
+    Array<T> scl     = arithOp<T, af_mul_t>(ch, cnst, dims);
+    return arithOp<T, af_add_t>(scl, base, dims);
+}
+
+template<typename T, bool isYCbCr2RGB>
+static af_array convert(const af_array& in, const af_ycc_std standard) {
+    static const float INV_219 = 0.004566210;
+    static const float INV_112 = 0.008928571;
+    const static float k[6]    = {0.1140f, 0.2990f, 0.0722f,
+                                  0.2126f, 0.0593f, 0.2627f};
+    unsigned stdIdx            = 0;  // Default standard is AF_YCC_601
+    switch (standard) {
+        case AF_YCC_709: stdIdx = 2; break;
+        case AF_YCC_2020: stdIdx = 4; break;
+        default: stdIdx = 0; break;
+    }
+    float kb    = k[stdIdx];
+    float kr    = k[stdIdx + 1];
+    float kl    = 1.0f - kb - kr;
+    float invKl = 1 / kl;
+
+    // extract three channels as three slices
+    // prepare sequence objects
+    // get Array objects for corresponding channel views
+    const Array<T> input = getArray<T>(in);
+    std::vector<af_seq> indices(4, af_span);
+
+    indices[2] = {0, 0, 1};
+    Array<T> X = createSubArray(input, indices, false);
+
+    indices[2] = {1, 1, 1};
+    Array<T> Y = createSubArray(input, indices, false);
+
+    indices[2] = {2, 2, 1};
+    Array<T> Z = createSubArray(input, indices, false);
+
+    if (isYCbCr2RGB) {
+        const dim4& dims = X.dims();
+        Array<T> yc      = createValueArray<T>(dims, 16);
+        Array<T> cc      = createValueArray<T>(dims, 128);
+        Array<T> Y_      = arithOp<T, af_sub_t>(X, yc, dims);
+        Array<T> Cb_     = arithOp<T, af_sub_t>(Y, cc, dims);
+        Array<T> Cr_     = arithOp<T, af_sub_t>(Z, cc, dims);
+        Array<T> R       = mix<T>(Y_, Cr_, INV_219, INV_112 * (1 - kr));
+        Array<T> G =
+            mix<T>(Y_, Cr_, Cb_, INV_219, INV_112 * (kr - 1) * kr * invKl,
+                   INV_112 * (kb - 1) * kb * invKl);
+        Array<T> B = mix<T>(Y_, Cb_, INV_219, INV_112 * (1 - kb));
+        // join channels
+        dim4 odims(R.dims()[0], R.dims()[1], 3);
+        Array<T> rgbout = createEmptyArray<T>(odims);
+        join<T>(rgbout, 2, {R, G, B});
+        return getHandle(rgbout);
+    }
+    Array<T> Ey = mix<T>(X, Y, Z, kr, kl, kb);
+    Array<T> Ecr =
+        mix<T>(X, Y, Z, 0.5, 0.5 * kl / (kr - 1), 0.5 * kb / (kr - 1));
+    Array<T> Ecb =
+        mix<T>(X, Y, Z, 0.5 * kr / (kb - 1), 0.5 * kl / (kb - 1), 0.5);
+    Array<T> Y_ = digitize<T>(Ey, 219.0, 16.0);
+    Array<T> Cr = digitize<T>(Ecr, 224.0, 128.0);
+    Array<T> Cb = digitize<T>(Ecb, 224.0, 128.0);
+    // join channels
+    dim4 odims(Y_.dims()[0], Y_.dims()[1], 3);
+    Array<T> ycbcrout = createEmptyArray<T>(odims);
+    join<T>(ycbcrout, 2, {Y_, Cb, Cr});
+    return getHandle(ycbcrout);
+}
+
+template<bool isYCbCr2RGB>
+af_err convert(af_array* out, const af_array& in, const af_ycc_std standard) {
+    try {
+        const ArrayInfo& info = getInfo(in);
+        af_dtype iType        = info.getType();
+        af::dim4 inputDims    = info.dims();
+
+        ARG_ASSERT(1, (inputDims.ndims() >= 3));
+
+        af_array output = 0;
+        switch (iType) {
+            case f64:
+                output = convert<double, isYCbCr2RGB>(in, standard);
+                break;
+            case f32: output = convert<float, isYCbCr2RGB>(in, standard); break;
+            default: TYPE_ERROR(1, iType); break;
+        }
+        std::swap(*out, output);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err af_ycbcr2rgb(af_array* out, const af_array in,
+                    const af_ycc_std standard) {
+    return convert<true>(out, in, standard);
+}
+
+af_err af_rgb2ycbcr(af_array* out, const af_array in,
+                    const af_ycc_std standard) {
+    return convert<false>(out, in, standard);
+}
diff --git a/src/api/cpp/CMakeLists.txt b/src/api/cpp/CMakeLists.txt
new file mode 100644
index 0000000000..e33a8b320d
--- /dev/null
+++ b/src/api/cpp/CMakeLists.txt
@@ -0,0 +1,98 @@
+
+add_library(cpp_api_interface INTERFACE)
+
+target_sources(cpp_api_interface
+  INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}/common.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/error.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/anisotropic_diffusion.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/approx.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/array.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/bilateral.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/binary.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/blas.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/canny.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/clamp.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/colorspace.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/complex.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/confidence_connected.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/convolve.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/corrcoef.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/covariance.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/data.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/deconvolution.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/device.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/diff.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/dog.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/event.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/exampleFunction.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/exception.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fast.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/features.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fft.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/fftconvolve.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/filters.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/gaussian_kernel.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/gfor.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/gradient.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/graphics.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/hamming.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/harris.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/histogram.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/homography.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/hsv_rgb.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/iir.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/imageio.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/index.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/internal.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit_test_api.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/lapack.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/matchTemplate.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/mean.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/meanvar.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/meanshift.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/median.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/moments.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/morph.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/nearest_neighbour.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/orb.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/random.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/reduce.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/regions.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/resize.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/rgb_gray.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/rotate.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sat.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/scale.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/scan.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/seq.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/set.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sift.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/skew.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sobel.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sort.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sparse.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/stdev.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/susan.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/timing.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/topk.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/transform.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/transform_coordinates.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/translate.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/transpose.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/unary.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/unwrap.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/util.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/var.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/where.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/wrap.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/ycbcr_rgb.cpp
+)
+
+target_include_directories(cpp_api_interface
+  SYSTEM INTERFACE
+    ${ArrayFire_SOURCE_DIR}/extern/half/include)
+
+target_include_directories(cpp_api_interface
+  INTERFACE
+    ${CMAKE_SOURCE_DIR}/src/api/c)
diff --git a/src/api/cpp/anisotropic_diffusion.cpp b/src/api/cpp/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..cd0800c1fa
--- /dev/null
+++ b/src/api/cpp/anisotropic_diffusion.cpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+array anisotropicDiffusion(const array& in, const float timestep,
+                           const float conductance, const unsigned iterations,
+                           const fluxFunction fftype, const diffusionEq eq) {
+    af_array out = 0;
+    AF_THROW(af_anisotropic_diffusion(&out, in.get(), timestep, conductance,
+                                      iterations, fftype, eq));
+    return array(out);
+}
+}  // namespace af
diff --git a/src/api/cpp/approx.cpp b/src/api/cpp/approx.cpp
index 3d02a84b05..4e560572b4 100644
--- a/src/api/cpp/approx.cpp
+++ b/src/api/cpp/approx.cpp
@@ -11,20 +11,38 @@
 #include <af/signal.h>
 #include "error.hpp"
 
-namespace af
-{
-    array approx1(const array& in, const array &pos, const interpType method, const float offGrid)
-    {
-        af_array out = 0;
-        AF_THROW(af_approx1(&out, in.get(), pos.get(), method, offGrid));
-        return array(out);
-    }
+namespace af {
+array approx1(const array &yi, const array &xo, const interpType method,
+              const float offGrid) {
+    af_array yo = 0;
+    AF_THROW(af_approx1(&yo, yi.get(), xo.get(), method, offGrid));
+    return array(yo);
+}
+
+array approx2(const array &zi, const array &xo, const array &yo,
+              const interpType method, const float offGrid) {
+    af_array zo = 0;
+    AF_THROW(af_approx2(&zo, zi.get(), xo.get(), yo.get(), method, offGrid));
+    return array(zo);
+}
+
+array approx1(const array &yi, const array &xo, const int xdim,
+              const double xi_beg, const double xi_step,
+              const interpType method, const float offGrid) {
+    af_array yo = 0;
+    AF_THROW(af_approx1_uniform(&yo, yi.get(), xo.get(), xdim, xi_beg, xi_step,
+                                method, offGrid));
+    return array(yo);
+}
 
-    array approx2(const array& in, const array &pos0, const array &pos1,
-                  const interpType method, const float offGrid)
-    {
-        af_array out = 0;
-        AF_THROW(af_approx2(&out, in.get(), pos0.get(), pos1.get(), method, offGrid));
-        return array(out);
-    }
+array approx2(const array &zi, const array &xo, const int xdim,
+              const double xi_beg, const double xi_step, const array &yo,
+              const int ydim, const double yi_beg, const double yi_step,
+              const interpType method, const float offGrid) {
+    af_array zo = 0;
+    AF_THROW(af_approx2_uniform(&zo, zi.get(), xo.get(), xdim, xi_beg, xi_step,
+                                yo.get(), ydim, yi_beg, yi_step, method,
+                                offGrid));
+    return array(zo);
 }
+}  // namespace af
diff --git a/src/api/cpp/array.cpp b/src/api/cpp/array.cpp
index 46cc31bbac..418d94c52b 100644
--- a/src/api/cpp/array.cpp
+++ b/src/api/cpp/array.cpp
@@ -6,1003 +6,1129 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-#include <stdexcept>
+
 #include <af/array.h>
+
+#include <af/algorithm.h>
 #include <af/arith.h>
 #include <af/blas.h>
 #include <af/data.h>
-#include <af/traits.hpp>
-#include <af/util.h>
-#include <af/index.h>
 #include <af/device.h>
 #include <af/gfor.h>
-#include <af/algorithm.h>
+#include <af/half.h>
+#include <af/index.h>
+#include <af/internal.h>
+#include <af/traits.hpp>
+#include <af/util.h>
 #include "error.hpp"
 
-namespace af
-{
-    static int gforDim(af_index_t *indices)
-    {
-        for (int i = 0; i < AF_MAX_DIMS; i++) {
-            if (indices[i].isBatch) return i;
-        }
-        return -1;
-    }
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wparentheses"
+#include "half.hpp"  //note: NOT common. From extern/half/include/half.hpp
+#pragma GCC diagnostic pop
 
-    static af_array gforReorder(const af_array in, unsigned dim)
-    {
-        // This is here to stop gcc from complaining
-        if (dim > 3) THROW(AF_ERR_SIZE);
-        unsigned order[AF_MAX_DIMS] = {0, 1, 2, dim};
-        order[dim] = 3;
-        af_array out;
-        AF_THROW(af_reorder(&out, in, order[0], order[1], order[2], order[3]));
-        return out;
+#ifdef AF_CUDA
+// NOTE: Adding ifdef here to avoid copying code constructor in the cuda backend
+#include <cuda_fp16.h>
+#include <traits.hpp>
+#endif
+
+#ifdef AF_UNIFIED
+#include <symbol_manager.hpp>
+#include <af/backend.h>
+using arrayfire::common::getFunctionPointer;
+#endif
+
+#include <memory>
+#include <stdexcept>
+#include <vector>
+
+using af::calcDim;
+using af::dim4;
+using std::copy;
+using std::logic_error;
+using std::vector;
+
+namespace {
+int gforDim(af_index_t *indices) {
+    for (int i = 0; i < AF_MAX_DIMS; i++) {
+        if (indices[i].isBatch) { return i; }
     }
+    return -1;
+}
 
-    static af::dim4 seqToDims(af_index_t *indices, af::dim4 parentDims, bool reorder = true)
-    {
-        try {
-            af::dim4 odims(1);
-            for (int i = 0; i < AF_MAX_DIMS; i++) {
-                if (indices[i].isSeq) {
-                    odims[i] = calcDim(indices[i].idx.seq, parentDims[i]);
-                } else {
-                    dim_t elems = 0;
-                    AF_THROW(af_get_elements(&elems, indices[i].idx.arr));
-                    odims[i] = elems;
-                }
+af_array gforReorder(const af_array in, unsigned dim) {
+    // This is here to stop gcc from complaining
+    if (dim > 3) { AF_THROW_ERR("GFor: Dimension is invalid", AF_ERR_SIZE); }
+    unsigned order[AF_MAX_DIMS] = {0, 1, 2, dim};
+
+    order[dim] = 3;
+    af_array out;
+    AF_THROW(af_reorder(&out, in, order[0], order[1], order[2], order[3]));
+    return out;
+}
+
+af::dim4 seqToDims(af_index_t *indices, af::dim4 parentDims,
+                   bool reorder = true) {
+    try {
+        af::dim4 odims(1);
+        for (int i = 0; i < AF_MAX_DIMS; i++) {
+            if (indices[i].isSeq) {
+                odims[i] = calcDim(indices[i].idx.seq, parentDims[i]);
+            } else {
+                dim_t elems = 0;
+                AF_THROW(af_get_elements(&elems, indices[i].idx.arr));
+                odims[i] = elems;
             }
+        }
 
-            // Change the dimensions if inside GFOR
-            if (reorder) {
-                for (int i = 0; i < AF_MAX_DIMS; i++) {
-                    if (indices[i].isBatch) {
-                        int tmp = odims[i];
-                        odims[i] = odims[3];
-                        odims[3] = tmp;
-                        break;
-                    }
+        // Change the dimensions if inside GFOR
+        if (reorder) {
+            for (int i = 0; i < AF_MAX_DIMS; i++) {
+                if (indices[i].isBatch) {
+                    int tmp  = odims[i];
+                    odims[i] = odims[3];
+                    odims[3] = tmp;
+                    break;
                 }
             }
-            return odims;
-        } catch(std::logic_error &err) {
-            AF_THROW_MSG(err.what(), AF_ERR_SIZE);
         }
-    }
+        return odims;
+    } catch (const logic_error &err) { AF_THROW_ERR(err.what(), AF_ERR_SIZE); }
+}
 
-    static unsigned size_of(af::dtype type)
-    {
-        switch(type) {
-        case f32: return sizeof(float);
-        case f64: return sizeof(double);
-        case s32: return sizeof(int);
-        case u32: return sizeof(unsigned);
-        case s64: return sizeof(intl);
-        case u64: return sizeof(uintl);
-        case u8 : return sizeof(unsigned char);
-        case b8 : return sizeof(unsigned char);
-        case c32: return sizeof(float) * 2;
-        case c64: return sizeof(double) * 2;
-        default: return sizeof(float);
-        }
-    }
+unsigned numDims(const af_array arr) {
+    unsigned nd;
+    AF_THROW(af_get_numdims(&nd, arr));
+    return nd;
+}
 
-    static unsigned numDims(const af_array arr)
-    {
-        unsigned nd;
-        AF_THROW(af_get_numdims(&nd, arr));
-        return nd;
-    }
+dim4 getDims(const af_array arr) {
+    dim_t d0, d1, d2, d3;
+    AF_THROW(af_get_dims(&d0, &d1, &d2, &d3, arr));
+    return dim4(d0, d1, d2, d3);
+}
 
-    static dim4 getDims(const af_array arr)
-    {
-        dim_t d0, d1, d2, d3;
-        AF_THROW(af_get_dims(&d0, &d1, &d2, &d3, arr));
-        return dim4(d0, d1, d2, d3);
-    }
+af_array initEmptyArray(af::dtype ty, dim_t d0, dim_t d1 = 1, dim_t d2 = 1,
+                        dim_t d3 = 1) {
+    af_array arr;
+    dim_t my_dims[] = {d0, d1, d2, d3};
+    AF_THROW(af_create_handle(&arr, AF_MAX_DIMS, my_dims, ty));
+    return arr;
+}
 
-    struct array::array_proxy::array_proxy_impl
-    {
-        array       *parent;        // The original array
-        af_index_t  indices[4];     // Indexing array or seq objects
-        bool        lin;
-        array_proxy_impl(array &parent, af_index_t *idx, bool linear)
-            : parent(&parent)
-            , indices()
-            , lin(linear)
-        {
-            std::copy(idx, idx + AF_MAX_DIMS, indices);
-        }
-    };
+af_array initDataArray(const void *ptr, int ty, af::source src, dim_t d0,
+                       dim_t d1 = 1, dim_t d2 = 1, dim_t d3 = 1) {
+    dim_t my_dims[] = {d0, d1, d2, d3};
+    af_array arr;
+    switch (src) {
+        case afHost:
+            AF_THROW(af_create_array(&arr, ptr, AF_MAX_DIMS, my_dims,
+                                     static_cast<af_dtype>(ty)));
+            break;
+        case afDevice:
+            AF_THROW(af_device_array(&arr, const_cast<void *>(ptr), AF_MAX_DIMS,
+                                     my_dims, static_cast<af_dtype>(ty)));
+            break;
+        default:
+            AF_THROW_ERR(
+                "Can not create array from the requested source pointer",
+                AF_ERR_ARG);
+    }
+    return arr;
+}
+}  // namespace
 
-    array::array(const af_array handle): arr(handle)
-    {
-    }
+namespace af {
 
-    static void initEmptyArray(af_array *arr, af::dtype ty,
-                               dim_t d0, dim_t d1=1, dim_t d2=1, dim_t d3=1)
-    {
-        dim_t my_dims[] = {d0, d1, d2, d3};
-        AF_THROW(af_create_handle(arr, AF_MAX_DIMS, my_dims, ty));
-    }
+struct array::array_proxy::array_proxy_impl {
+    // NOLINTNEXTLINE(misc-non-private-member-variables-in-classes)
+    array *parent_;  //< The original array
+    // NOLINTNEXTLINE(misc-non-private-member-variables-in-classes)
+    af_index_t indices_[4];  //< Indexing array or seq objects
+    // NOLINTNEXTLINE(misc-non-private-member-variables-in-classes)
+    bool is_linear_;
 
-    template<typename T>
-    static void initDataArray(af_array *arr, const T *ptr, af::source src,
-                              dim_t d0, dim_t d1=1, dim_t d2=1, dim_t d3=1)
-    {
-        af::dtype ty = (af::dtype)dtype_traits<T>::af_type;
-        dim_t my_dims[] = {d0, d1, d2, d3};
-        switch (src) {
-        case afHost:   AF_THROW(af_create_array(arr, (const void * const)ptr, AF_MAX_DIMS, my_dims, ty)); break;
-        case afDevice: AF_THROW(af_device_array(arr, (const void *      )ptr, AF_MAX_DIMS, my_dims, ty)); break;
-        default: AF_THROW_MSG("Can not create array from the requested source pointer",
-                              AF_ERR_ARG);
-        }
+    // if true the parent_ object will be deleted on distruction. This is
+    // necessary only when calling indexing functions in array_proxy objects.
+    // NOLINTNEXTLINE(misc-non-private-member-variables-in-classes)
+    bool delete_on_destruction_;
+    array_proxy_impl(array &parent, af_index_t *idx, bool linear)
+        : parent_(&parent)
+        , indices_()
+        , is_linear_(linear)
+        , delete_on_destruction_(false) {
+        std::copy(idx, idx + AF_MAX_DIMS, indices_);
     }
 
-    array::array() : arr(0)
-    {
-        initEmptyArray(&arr, f32, 0, 0, 0, 0);
-    }
+    void delete_on_destruction(bool val) { delete_on_destruction_ = val; }
 
-    array::array(const dim4 &dims, af::dtype ty) : arr(0)
-    {
-        initEmptyArray(&arr, ty, dims[0], dims[1], dims[2], dims[3]);
+    ~array_proxy_impl() {
+        if (delete_on_destruction_) { delete parent_; }
     }
 
-    array::array(dim_t d0, af::dtype ty) : arr(0)
-    {
-        initEmptyArray(&arr, ty, d0);
-    }
+    array_proxy_impl(const array_proxy_impl &)            = delete;
+    array_proxy_impl(const array_proxy_impl &&)           = delete;
+    array_proxy_impl operator=(const array_proxy_impl &)  = delete;
+    array_proxy_impl operator=(const array_proxy_impl &&) = delete;
+};
 
-    array::array(dim_t d0, dim_t d1, af::dtype ty) : arr(0)
-    {
-        initEmptyArray(&arr, ty, d0, d1);
-    }
+array::array(const af_array handle) : arr(handle) {}
 
-    array::array(dim_t d0, dim_t d1, dim_t d2, af::dtype ty) :
-        arr(0)
-    {
-        initEmptyArray(&arr, ty, d0, d1, d2);
-    }
+array::array() : arr(initEmptyArray(f32, 0, 1, 1, 1)) {}
 
-    array::array(dim_t d0, dim_t d1, dim_t d2, dim_t d3, af::dtype ty) :
-        arr(0)
-    {
-        initEmptyArray(&arr, ty, d0, d1, d2, d3);
-    }
+array::array(array &&other) noexcept : arr(other.arr) { other.arr = 0; }
 
-#define INSTANTIATE(T)                                                  \
-    template<> AFAPI                                                    \
-    array::array(const dim4 &dims, const T *ptr, af::source src)       \
-        : arr(0)                                                        \
-    {                                                                   \
-        initDataArray<T>(&arr, ptr, src, dims[0], dims[1], dims[2],     \
-                dims[3]);                                               \
-    }                                                                   \
-    template<> AFAPI                                                    \
-    array::array(dim_t d0, const T *ptr, af::source src) \
-        : arr(0)                                                        \
-    {                                                                   \
-        initDataArray<T>(&arr, ptr, src, d0);                           \
-    }                                                                   \
-    template<> AFAPI                                                    \
-    array::array(dim_t d0, dim_t d1, const T *ptr, af::source src)     \
-        : arr(0)                                                        \
-    {                                                                   \
-        initDataArray<T>(&arr, ptr, src, d0, d1);                       \
-    }                                                                   \
-    template<> AFAPI                                                    \
-    array::array(dim_t d0, dim_t d1, dim_t d2, const T *ptr,            \
-                 af::source src)                                       \
-        : arr(0)                                                        \
-    {                                                                   \
-        initDataArray<T>(&arr, ptr, src, d0, d1, d2);                   \
-    }                                                                   \
-    template<> AFAPI                                                    \
-    array::array(dim_t d0, dim_t d1, dim_t d2, dim_t d3, const T *ptr,  \
-                 af::source src) :                                     \
-        arr(0)                                                          \
-                                                                        \
-    {                                                                   \
-        initDataArray<T>(&arr, ptr, src, d0, d1, d2, d3);               \
-    }                                                                   \
+array &array::operator=(array &&other) noexcept {
+    af_release_array(arr);
+    arr       = other.arr;
+    other.arr = 0;
+    return *this;
+}
 
-    INSTANTIATE(cdouble)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(double)
-    INSTANTIATE(float)
-    INSTANTIATE(unsigned)
-    INSTANTIATE(int)
-    INSTANTIATE(unsigned char)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+array::array(const dim4 &dims, af::dtype ty)
+    : arr(initEmptyArray(ty, dims[0], dims[1], dims[2], dims[3])) {}
+
+array::array(dim_t dim0, af::dtype ty) : arr(initEmptyArray(ty, dim0)) {}
+
+array::array(dim_t dim0, dim_t dim1, af::dtype ty)
+    : arr(initEmptyArray(ty, dim0, dim1)) {}
+
+array::array(dim_t dim0, dim_t dim1, dim_t dim2, af::dtype ty)
+    : arr(initEmptyArray(ty, dim0, dim1, dim2)) {}
+
+array::array(dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3, af::dtype ty)
+    : arr(initEmptyArray(ty, dim0, dim1, dim2, dim3)) {}
+
+template<>
+struct dtype_traits<half_float::half> {
+    enum { af_type = f16, ctype = f16 };
+    using base_type = half;
+    static const char *getName() { return "half"; }
+};
+
+#define INSTANTIATE(T)                                                         \
+    template<>                                                                 \
+    AFAPI array::array(const dim4 &dims, const T *ptr, af::source src)         \
+        : arr(initDataArray(ptr, dtype_traits<T>::af_type, src, dims[0],       \
+                            dims[1], dims[2], dims[3])) {}                     \
+    template<>                                                                 \
+    AFAPI array::array(dim_t dim0, const T *ptr, af::source src)               \
+        : arr(initDataArray(ptr, dtype_traits<T>::af_type, src, dim0)) {}      \
+    template<>                                                                 \
+    AFAPI array::array(dim_t dim0, dim_t dim1, const T *ptr, af::source src)   \
+        : arr(initDataArray(ptr, dtype_traits<T>::af_type, src, dim0, dim1)) { \
+    }                                                                          \
+    template<>                                                                 \
+    AFAPI array::array(dim_t dim0, dim_t dim1, dim_t dim2, const T *ptr,       \
+                       af::source src)                                         \
+        : arr(initDataArray(ptr, dtype_traits<T>::af_type, src, dim0, dim1,    \
+                            dim2)) {}                                          \
+    template<>                                                                 \
+    AFAPI array::array(dim_t dim0, dim_t dim1, dim_t dim2, dim_t dim3,         \
+                       const T *ptr, af::source src)                           \
+        : arr(initDataArray(ptr, dtype_traits<T>::af_type, src, dim0, dim1,    \
+                            dim2, dim3)) {}
+
+INSTANTIATE(cdouble)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(unsigned)
+INSTANTIATE(int)
+INSTANTIATE(signed char)
+INSTANTIATE(unsigned char)
+INSTANTIATE(char)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(short)
+INSTANTIATE(unsigned short)
+INSTANTIATE(af_half)
+INSTANTIATE(half_float::half)
+#ifdef AF_CUDA
+INSTANTIATE(__half);
+#endif
 
 #undef INSTANTIATE
 
-    array::~array()
-    {
-        af_array tmp = get();
-        // THOU SHALL NOT THROW IN DESTRUCTORS
-        af_release_array(tmp);
-    }
-
-    af::dtype array::type() const
-    {
-        af::dtype my_type;
-        AF_THROW(af_get_type(&my_type, arr));
-        return my_type;
+array::~array() {
+#ifdef AF_UNIFIED
+    using af_release_array_ptr =
+        std::add_pointer<decltype(af_release_array)>::type;
+
+    if (get()) {
+        af_backend backend = arrayfire::unified::getActiveBackend();
+        af_err err         = af_get_backend_id(&backend, get());
+        if (!err) {
+            switch (backend) {
+                case AF_BACKEND_CPU: {
+                    static auto *cpu_handle =
+                        arrayfire::unified::getActiveHandle();
+                    static auto release_func =
+                        reinterpret_cast<af_release_array_ptr>(
+                            getFunctionPointer(cpu_handle, "af_release_array"));
+                    release_func(get());
+                    break;
+                }
+                case AF_BACKEND_OPENCL: {
+                    static auto *opencl_handle =
+                        arrayfire::unified::getActiveHandle();
+                    static auto release_func =
+                        reinterpret_cast<af_release_array_ptr>(
+                            getFunctionPointer(opencl_handle,
+                                               "af_release_array"));
+                    release_func(get());
+                    break;
+                }
+                case AF_BACKEND_CUDA: {
+                    static auto *cuda_handle =
+                        arrayfire::unified::getActiveHandle();
+                    static auto release_func =
+                        reinterpret_cast<af_release_array_ptr>(
+                            getFunctionPointer(cuda_handle,
+                                               "af_release_array"));
+                    release_func(get());
+                    break;
+                }
+                case AF_BACKEND_ONEAPI: {
+                    static auto *oneapi_handle =
+                        arrayfire::unified::getActiveHandle();
+                    static auto release_func =
+                        reinterpret_cast<af_release_array_ptr>(
+                            getFunctionPointer(oneapi_handle,
+                                               "af_release_array"));
+                    release_func(get());
+                    break;
+                }
+                case AF_BACKEND_DEFAULT:
+                    assert(1 != 1 &&
+                           "AF_BACKEND_DEFAULT cannot be set as a backend for "
+                           "an array");
+            }
+        }
     }
+#else
+    // THOU SHALL NOT THROW IN DESTRUCTORS
+    if (af_array arr = get()) { af_release_array(arr); }
+#endif
+}
 
-    dim_t array::elements() const
-    {
-        dim_t elems;
-        AF_THROW(af_get_elements(&elems, get()));
-        return elems;
-    }
+af::dtype array::type() const {
+    af::dtype my_type;
+    AF_THROW(af_get_type(&my_type, arr));
+    return my_type;
+}
 
-    void array::host(void *data) const
-    {
-        AF_THROW(af_get_data_ptr(data, get()));
-    }
+dim_t array::elements() const {
+    dim_t elems;
+    AF_THROW(af_get_elements(&elems, get()));
+    return elems;
+}
 
-    af_array array::get()
-    {
-        return arr;
-    }
+void array::host(void *ptr) const { AF_THROW(af_get_data_ptr(ptr, get())); }
 
-    af_array array::get() const
-    {
-        return ((array *)(this))->get();
-    }
+af_array array::get() { return arr; }
 
-    // Helper functions
-    dim4 array::dims() const
-    {
-        return getDims(get());
-    }
+af_array array::get() const { return const_cast<array *>(this)->get(); }
 
-    dim_t array::dims(unsigned dim) const
-    {
-        return dims()[dim];
-    }
+// Helper functions
+dim4 array::dims() const { return getDims(get()); }
 
-    unsigned array::numdims() const
-    {
-        return numDims(get());
-    }
+dim_t array::dims(unsigned dim) const { return dims()[dim]; }
 
-    size_t array::bytes() const
-    {
-        dim_t nElements;
-        AF_THROW(af_get_elements(&nElements, get()));
-        return nElements * size_of(type());
-    }
+unsigned array::numdims() const { return numDims(get()); }
 
-    array array::copy() const
-    {
-        af_array other = 0;
-        AF_THROW(af_copy_array(&other, get()));
-        return array(other);
-    }
+size_t array::bytes() const {
+    dim_t nElements;
+    AF_THROW(af_get_elements(&nElements, get()));
+    return nElements * getSizeOf(type());
+}
 
-#undef INSTANTIATE
-#define INSTANTIATE(fn)                         \
-    bool array::is##fn() const                  \
-    {                                           \
-        bool ret = false;                       \
-        AF_THROW(af_is_##fn(&ret, get()));      \
-        return ret;                             \
-    }
+size_t array::allocated() const {
+    size_t result = 0;
+    AF_THROW(af_get_allocated_bytes(&result, get()));
+    return result;
+}
 
-    INSTANTIATE(empty)
-    INSTANTIATE(scalar)
-    INSTANTIATE(vector)
-    INSTANTIATE(row)
-    INSTANTIATE(column)
-    INSTANTIATE(complex)
-    INSTANTIATE(double)
-    INSTANTIATE(single)
-    INSTANTIATE(realfloating)
-    INSTANTIATE(floating)
-    INSTANTIATE(integer)
-    INSTANTIATE(bool)
+array array::copy() const {
+    af_array other = nullptr;
+    AF_THROW(af_copy_array(&other, get()));
+    return array(other);
+}
 
 #undef INSTANTIATE
+#define INSTANTIATE(fn)                    \
+    bool array::is##fn() const {           \
+        bool ret = false;                  \
+        AF_THROW(af_is_##fn(&ret, get())); \
+        return ret;                        \
+    }
+
+INSTANTIATE(empty)
+INSTANTIATE(scalar)
+INSTANTIATE(vector)
+INSTANTIATE(row)
+INSTANTIATE(column)
+INSTANTIATE(complex)
+INSTANTIATE(double)
+INSTANTIATE(single)
+INSTANTIATE(half)
+INSTANTIATE(realfloating)
+INSTANTIATE(floating)
+INSTANTIATE(integer)
+INSTANTIATE(bool)
+INSTANTIATE(sparse)
 
-    static array::array_proxy gen_indexing(const array &ref, const index &s0, const index &s1, const index &s2, const index &s3, bool linear = false)
-    {
-        ref.eval();
-        af_index_t inds[AF_MAX_DIMS];
-        inds[0] = s0.get();
-        inds[1] = s1.get();
-        inds[2] = s2.get();
-        inds[3] = s3.get();
+#undef INSTANTIATE
 
-        return array::array_proxy(const_cast<array&>(ref), inds, linear);
-    }
+static array::array_proxy gen_indexing(const array &ref, const index &s0,
+                                       const index &s1, const index &s2,
+                                       const index &s3, bool linear = false) {
+    ref.eval();
+    af_index_t inds[AF_MAX_DIMS];
+    inds[0] = s0.get();
+    inds[1] = s1.get();
+    inds[2] = s2.get();
+    inds[3] = s3.get();
+
+    return array::array_proxy(const_cast<array &>(ref), inds, linear);
+}
 
-    array::array_proxy array::operator()(const index &s0)
-    {
-        return const_cast<const array*>(this)->operator()(s0);
-    }
+array::array_proxy array::operator()(const index &s0) {
+    return const_cast<const array *>(this)->operator()(s0);
+}
 
-    array::array_proxy array::operator()(const index &s0, const index &s1, const index &s2, const index &s3)
-    {
-        return const_cast<const array*>(this)->operator()(s0, s1, s2, s3);
-    }
+array::array_proxy array::operator()(const index &s0, const index &s1,
+                                     const index &s2, const index &s3) {
+    return const_cast<const array *>(this)->operator()(s0, s1, s2, s3);
+}
 
-    const array::array_proxy array::operator()(const index &s0) const
-    {
-        index z = index(0);
-        if(isvector()){
-            switch(numDims(this->arr)) {
-                case 1: return gen_indexing(*this, s0, z, z, z);
-                case 2: return gen_indexing(*this, z, s0, z, z);
-                case 3: return gen_indexing(*this, z, z, s0, z);
-                case 4: return gen_indexing(*this, z, z, z, s0);
-                default: THROW(AF_ERR_SIZE);
-            }
-        }
-        else {
-            return gen_indexing(*this, s0, z, z, z, true);
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::operator()(const index &s0) const {
+    index z = index(0);
+    if (isvector()) {
+        switch (numDims(this->arr)) {
+            case 1: return gen_indexing(*this, s0, z, z, z);
+            case 2: return gen_indexing(*this, z, s0, z, z);
+            case 3: return gen_indexing(*this, z, z, s0, z);
+            case 4: return gen_indexing(*this, z, z, z, s0);
+            default: AF_THROW_ERR("ndims for Array is invalid", AF_ERR_SIZE);
         }
+    } else {
+        return gen_indexing(*this, s0, z, z, z, true);
     }
+}
 
-    const array::array_proxy array::operator()(const index &s0, const index &s1, const index &s2, const index &s3) const
-    {
-        if(isvector()   && s1.isspan()
-                        && s2.isspan()
-                        && s3.isspan()) {
-            int num_dims = numDims(this->arr);
-
-            switch(num_dims) {
-            case 1: return gen_indexing(*this, s0, s1, s2, s3);
-            case 2: return gen_indexing(*this, s1, s0, s2, s3);
-            case 3: return gen_indexing(*this, s1, s2, s0, s3);
-            case 4: return gen_indexing(*this, s1, s2, s3, s0);
-            default: THROW(AF_ERR_SIZE);
-            }
-        }
-        else {
-            return gen_indexing(*this, s0, s1, s2, s3);
-        }
-
-    }
-
-    const array::array_proxy array::row(int index) const
-    {
-        seq idx(index, index, 1);
-        return this->operator()(idx, span, span, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::operator()(const index &s0, const index &s1,
+                                           const index &s2,
+                                           const index &s3) const {
+    return gen_indexing(*this, s0, s1, s2, s3);
+}
 
-    array::array_proxy array::row(int index)
-    {
-        seq idx(index, index, 1);
-        return this->operator()(idx, span, span, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::row(int index) const {
+    return this->operator()(index, span, span, span);
+}
 
-    const array::array_proxy array::col(int index) const
-    {
-        seq idx(index, index, 1);
-        return this->operator()(span, idx, span, span);
-    }
+array::array_proxy array::row(int index) {
+    return const_cast<const array *>(this)->row(index);
+}
 
-    array::array_proxy array::col(int index)
-    {
-        seq idx(index, index, 1);
-        return this->operator()(span, idx, span, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::col(int index) const {
+    return this->operator()(span, index, span, span);
+}
 
-    const array::array_proxy array::slice(int index) const
-    {
-        seq idx(index, index, 1);
-        return this->operator()(span, span, idx, span);
-    }
+array::array_proxy array::col(int index) {
+    return const_cast<const array *>(this)->col(index);
+}
 
-    array::array_proxy array::slice(int index)
-    {
-        seq idx(index, index, 1);
-        return this->operator()(span, span, idx, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::slice(int index) const {
+    return this->operator()(span, span, index, span);
+}
 
-    const array::array_proxy array::rows(int first, int last) const
-    {
-        seq idx(first, last, 1);
-        return this->operator()(idx, span, span, span);
-    }
+array::array_proxy array::slice(int index) {
+    return const_cast<const array *>(this)->slice(index);
+}
 
-    array::array_proxy array::rows(int first, int last)
-    {
-        seq idx(first, last, 1);
-        return this->operator()(idx, span, span, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::rows(int first, int last) const {
+    seq idx(first, last, 1);
+    return this->operator()(idx, span, span, span);
+}
 
-    const array::array_proxy array::cols(int first, int last) const
-    {
-        seq idx(first, last, 1);
-        return this->operator()(span, idx, span, span);
-    }
+array::array_proxy array::rows(int first, int last) {
+    return const_cast<const array *>(this)->rows(first, last);
+}
 
-    array::array_proxy array::cols(int first, int last)
-    {
-        seq idx(first, last, 1);
-        return this->operator()(span, idx, span, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::cols(int first, int last) const {
+    seq idx(first, last, 1);
+    return this->operator()(span, idx, span, span);
+}
 
-    const array::array_proxy array::slices(int first, int last) const
-    {
-        seq idx(first, last, 1);
-        return this->operator()(span, span, idx, span);
-    }
+array::array_proxy array::cols(int first, int last) {
+    return const_cast<const array *>(this)->cols(first, last);
+}
 
-    array::array_proxy array::slices(int first, int last)
-    {
-        seq idx(first, last, 1);
-        return this->operator()(span, span, idx, span);
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array::array_proxy array::slices(int first, int last) const {
+    seq idx(first, last, 1);
+    return this->operator()(span, span, idx, span);
+}
 
-    const array array::as(af::dtype type) const
-    {
-        af_array out;
-        AF_THROW(af_cast(&out, this->get(), type));
-        return array(out);
-    }
+array::array_proxy array::slices(int first, int last) {
+    return const_cast<const array *>(this)->slices(first, last);
+}
 
-    array::array(const array& in) : arr(0)
-    {
-        AF_THROW(af_retain_array(&arr, in.get()));
-    }
+// NOLINTNEXTLINE(readability-const-return-type)
+const array array::as(af::dtype type) const {
+    af_array out;
+    AF_THROW(af_cast(&out, this->get(), type));
+    return array(out);
+}
 
-    array::array(const array& input, const dim4& dims) : arr(0)
-    {
-        AF_THROW(af_moddims(&arr, input.get(), AF_MAX_DIMS, dims.get()));
-    }
+array::array(const array &in) : arr(nullptr) {
+    AF_THROW(af_retain_array(&arr, in.get()));
+}
 
-    array::array(const array& input, const dim_t dim0, const dim_t dim1, const dim_t dim2, const dim_t dim3)
-        : arr(0)
-    {
-        dim_t dims[] = {dim0, dim1, dim2, dim3};
-        AF_THROW(af_moddims(&arr, input.get(), AF_MAX_DIMS, dims));
-    }
+array::array(const array &input, const dim4 &dims) : arr(nullptr) {
+    AF_THROW(af_moddims(&arr, input.get(), AF_MAX_DIMS, dims.get()));
+}
 
-    // Transpose and Conjugate Transpose
-    array array::T() const
-    {
-        return transpose(*this);
-    }
+array::array(const array &input, const dim_t dim0, const dim_t dim1,
+             const dim_t dim2, const dim_t dim3)
+    : arr(nullptr) {
+    dim_t dims[] = {dim0, dim1, dim2, dim3};
+    AF_THROW(af_moddims(&arr, input.get(), AF_MAX_DIMS, dims));
+}
 
-    array array::H() const
-    {
-        return transpose(*this, true);
-    }
+// Transpose and Conjugate Transpose
+array array::T() const { return transpose(*this); }
 
-    void array::set(af_array tmp)
-    {
-        if (arr) AF_THROW(af_release_array(arr));
-        arr = tmp;
-    }
+array array::H() const { return transpose(*this, true); }
 
-    // Assign values to an array
-    array::array_proxy&
-    af::array::array_proxy::operator=(const array &other)
-    {
-        unsigned nd = numDims(impl->parent->get());
-        const dim4 this_dims = getDims(impl->parent->get());
-        const dim4 other_dims = other.dims();
-        int dim = gforDim(impl->indices);
-        af_array other_arr = other.get();
-
-        bool batch_assign = false;
-        bool is_reordered = false;
-        if (dim >= 0) {
-            batch_assign = true;
-            for (int i = 0; i < AF_MAX_DIMS; i++) {
-                if (this->impl->indices[i].isBatch) batch_assign &= (other_dims[i] == 1);
-                else                          batch_assign &= (other_dims[i] == this_dims[i]);
-            }
+void array::set(af_array tmp) {
+    if (arr) { AF_THROW(af_release_array(arr)); }
+    arr = tmp;
+}
 
-            if (batch_assign) {
-                //FIXME: Figure out a faster, cleaner way to do this
-                dim4 out_dims = seqToDims(impl->indices, this_dims, false);
-                af_array out;
-                AF_THROW(af_tile(&out, other_arr,
-                                 out_dims[0] / other_dims[0],
-                                 out_dims[1] / other_dims[1],
-                                 out_dims[2] / other_dims[2],
-                                 out_dims[3] / other_dims[3]));
-                other_arr = out;
-
-            } else if (this_dims != other_dims) {
-                // HACK: This is a quick check to see if other has been reordered inside gfor
-                // TODO: Figure out if this breaks and implement a cleaner method
-                other_arr = gforReorder(other_arr, dim);
-                is_reordered = true;
+// Assign values to an array
+array::array_proxy &af::array::array_proxy::operator=(const array &other) {
+    unsigned nd           = numDims(impl->parent_->get());
+    const dim4 this_dims  = getDims(impl->parent_->get());
+    const dim4 other_dims = other.dims();
+    int dim               = gforDim(impl->indices_);
+    af_array other_arr    = other.get();
+
+    bool batch_assign = false;
+    bool is_reordered = false;
+    if (dim >= 0) {
+        // FIXME: Figure out a faster, cleaner way to do this
+        dim4 out_dims = seqToDims(impl->indices_, this_dims, false);
+
+        batch_assign = true;
+        for (int i = 0; i < AF_MAX_DIMS; i++) {
+            if (this->impl->indices_[i].isBatch) {
+                batch_assign &= (other_dims[i] == 1);
+            } else {
+                batch_assign &= (other_dims[i] == out_dims[i]);
             }
         }
 
-        af_array par_arr = 0;
-
-        if (impl->lin) {
-            AF_THROW(af_flat(&par_arr, impl->parent->get()));
-            nd = 1;
-        } else {
-            par_arr = impl->parent->get();
-        }
-
-        af_array tmp = 0;
-        AF_THROW(af_assign_gen(&tmp, par_arr, nd, impl->indices, other_arr));
-
-        af_array res = 0;
-        if (impl->lin) {
-            AF_THROW(af_moddims(&res, tmp, this_dims.ndims(), this_dims.get()));
-            AF_THROW(af_release_array(par_arr));
-            AF_THROW(af_release_array(tmp));
-        } else {
-            res = tmp;
-        }
-
-        impl->parent->set(res);
-
-        if (dim >= 0 && (is_reordered || batch_assign)) {
-            if (other_arr) AF_THROW(af_release_array(other_arr));
+        if (batch_assign) {
+            af_array out;
+            AF_THROW(af_tile(&out, other_arr, out_dims[0] / other_dims[0],
+                             out_dims[1] / other_dims[1],
+                             out_dims[2] / other_dims[2],
+                             out_dims[3] / other_dims[3]));
+            other_arr = out;
+
+        } else if (out_dims != other_dims) {
+            // HACK: This is a quick check to see if other has been reordered
+            // inside gfor
+            // TODO(umar): Figure out if this breaks and implement a cleaner
+            // method
+            other_arr    = gforReorder(other_arr, dim);
+            is_reordered = true;
         }
-        return *this;
-    }
-
-    array::array_proxy&
-    af::array::array_proxy::operator=(const array::array_proxy &other)
-    {
-        array out = other;
-        return *this = out;
     }
 
-    af::array::array_proxy::array_proxy(array& par, af_index_t *ssss, bool linear)
-        : impl(new array_proxy_impl(par, ssss, linear))
-    {
-    }
+    af_array par_arr = 0;
+
+    dim4 parent_dims = impl->parent_->dims();
+    if (impl->is_linear_) {
+        AF_THROW(af_flat(&par_arr, impl->parent_->get()));
+        // The set call will dereference the impl->parent_ array. We are doing
+        // this because the af_flat call above increases the reference count of
+        // the parent array which triggers a copy operation. This triggers a
+        // copy operation inside the af_assign_gen function below. The parent
+        // array will be reverted to the original array and shape later in the
+        // code.
+        af_array empty = 0;
+        impl->parent_->set(empty);
+        nd = 1;
+    } else {
+        par_arr = impl->parent_->get();
+    }
+
+    af_array flat_res = 0;
+    AF_THROW(af_assign_gen(&flat_res, par_arr, nd, impl->indices_, other_arr));
+
+    af_array res         = 0;
+    af_array unflattened = 0;
+    if (impl->is_linear_) {
+        AF_THROW(
+            af_moddims(&res, flat_res, this_dims.ndims(), this_dims.get()));
+        // Unflatten the af_array and reset the original reference
+        AF_THROW(af_moddims(&unflattened, par_arr, parent_dims.ndims(),
+                            parent_dims.get()));
+        impl->parent_->set(unflattened);
+        AF_THROW(af_release_array(par_arr));
+        AF_THROW(af_release_array(flat_res));
+    } else {
+        res = flat_res;
+    }
+
+    impl->parent_->set(res);
+
+    if (dim >= 0 && (is_reordered || batch_assign)) {
+        if (other_arr) { AF_THROW(af_release_array(other_arr)); }
+    }
+    return *this;
+}
 
-    af::array::array_proxy::array_proxy(const array_proxy &other) {
-        *impl = *(other.impl);
-    }
+array::array_proxy &af::array::array_proxy::operator=(
+    const array::array_proxy &other) {
+    if (this == &other) { return *this; }
+    array out = other;
+    *this     = out;
+    return *this;
+}
 
-#if __cplusplus > 199711L
-    af::array::array_proxy::array_proxy(array_proxy &&other) {
-        impl = other.impl;
-        other.impl = nullptr;
-    }
+af::array::array_proxy::array_proxy(array &par, af_index_t *ssss, bool linear)
+    : impl(new array_proxy_impl(par, ssss, linear)) {}
 
-    array::array_proxy&
-    af::array::array_proxy::operator=(array_proxy &&other) {
-        array out = other;
-        return *this = out;
-    }
-#endif
+af::array::array_proxy::array_proxy(const array_proxy &other)
+    : impl(new array_proxy_impl(*other.impl->parent_, other.impl->indices_,
+                                other.impl->is_linear_)) {}
 
-    af::array::array_proxy::~array_proxy() {
-        if(impl) delete impl;
-    }
+// NOLINTNEXTLINE(performance-noexcept-move-constructor,hicpp-noexcept-move)
+af::array::array_proxy::array_proxy(array_proxy &&other) {
+    impl       = other.impl;
+    other.impl = nullptr;
+}
 
-    array array::array_proxy::as(dtype type) const
-    {
-        array out = *this;
-        return out.as(type);
-    }
+// NOLINTNEXTLINE(performance-noexcept-move-constructor,hicpp-noexcept-move)
+array::array_proxy &af::array::array_proxy::operator=(array_proxy &&other) {
+    array out = other;
+    *this     = out;
+    return *this;
+}
 
-    dim_t array::array_proxy::dims(unsigned dim) const
-    {
-        array out = *this;
-        return out.dims(dim);
-    }
+af::array::array_proxy::~array_proxy() { delete impl; }
 
-    void array::array_proxy::host(void *ptr) const
-    {
-        array out = *this;
-        return out.host(ptr);
-    }
+array array::array_proxy::as(dtype type) const {
+    array out = *this;
+    return out.as(type);
+}
 
-    af_array array::array_proxy::get()
-    {
-        array tmp = *this;
-        af_array out = 0;
-        AF_THROW(af_retain_array(&out, tmp.get()));
-        return out;
-    }
+dim_t array::array_proxy::dims(unsigned dim) const {
+    array out = *this;
+    return out.dims(dim);
+}
 
-#define MEM_FUNC(PREFIX, FUNC)                  \
-    PREFIX array::array_proxy::FUNC() const     \
-    {                                           \
-        array out = *this;                      \
-        return out.FUNC();                      \
-    }
+void array::array_proxy::host(void *ptr) const {
+    array out = *this;
+    return out.host(ptr);
+}
 
-    MEM_FUNC(af_array               , get)
-    MEM_FUNC(dim_t                  , elements)
-    MEM_FUNC(array                  , T)
-    MEM_FUNC(array                  , H)
-    MEM_FUNC(dtype                  , type)
-    MEM_FUNC(dim4                   , dims)
-    MEM_FUNC(unsigned               , numdims)
-    MEM_FUNC(size_t                 , bytes)
-    MEM_FUNC(array                  , copy)
-    MEM_FUNC(bool                   , isempty)
-    MEM_FUNC(bool                   , isscalar)
-    MEM_FUNC(bool                   , isvector)
-    MEM_FUNC(bool                   , isrow)
-    MEM_FUNC(bool                   , iscolumn)
-    MEM_FUNC(bool                   , iscomplex)
-    MEM_FUNC(bool                   , isdouble)
-    MEM_FUNC(bool                   , issingle)
-    MEM_FUNC(bool                   , isrealfloating)
-    MEM_FUNC(bool                   , isfloating)
-    MEM_FUNC(bool                   , isinteger)
-    MEM_FUNC(bool                   , isbool)
-    MEM_FUNC(void                   , eval)
-    //MEM_FUNC(void                   , unlock)
+#define MEM_FUNC(PREFIX, FUNC)                \
+    PREFIX array::array_proxy::FUNC() const { \
+        array out = *this;                    \
+        return out.FUNC();                    \
+    }
+
+MEM_FUNC(dim_t, elements)
+MEM_FUNC(array, T)
+MEM_FUNC(array, H)
+MEM_FUNC(dtype, type)
+MEM_FUNC(dim4, dims)
+MEM_FUNC(unsigned, numdims)
+MEM_FUNC(size_t, bytes)
+MEM_FUNC(size_t, allocated)
+MEM_FUNC(array, copy)
+MEM_FUNC(bool, isempty)
+MEM_FUNC(bool, isscalar)
+MEM_FUNC(bool, isvector)
+MEM_FUNC(bool, isrow)
+MEM_FUNC(bool, iscolumn)
+MEM_FUNC(bool, iscomplex)
+MEM_FUNC(bool, isdouble)
+MEM_FUNC(bool, issingle)
+MEM_FUNC(bool, ishalf)
+MEM_FUNC(bool, isrealfloating)
+MEM_FUNC(bool, isfloating)
+MEM_FUNC(bool, isinteger)
+MEM_FUNC(bool, isbool)
+MEM_FUNC(bool, issparse)
+MEM_FUNC(void, eval)
+MEM_FUNC(af_array, get)
+// MEM_FUNC(void                   , unlock)
 #undef MEM_FUNC
 
+#define ASSIGN_TYPE(TY, OP)                                                \
+    array::array_proxy &array::array_proxy::operator OP(const TY &value) { \
+        dim4 pdims = getDims(impl->parent_->get());                        \
+        if (impl->is_linear_) pdims = dim4(pdims.elements());              \
+        dim4 dims    = seqToDims(impl->indices_, pdims);                   \
+        af::dtype ty = impl->parent_->type();                              \
+        array cst    = constant(value, dims, ty);                          \
+        this->operator OP(cst);                                            \
+        return *this;                                                      \
+    }
+
+#define ASSIGN_OP(OP, op1)              \
+    ASSIGN_TYPE(double, OP)             \
+    ASSIGN_TYPE(float, OP)              \
+    ASSIGN_TYPE(cdouble, OP)            \
+    ASSIGN_TYPE(cfloat, OP)             \
+    ASSIGN_TYPE(int, OP)                \
+    ASSIGN_TYPE(unsigned, OP)           \
+    ASSIGN_TYPE(long, OP)               \
+    ASSIGN_TYPE(unsigned long, OP)      \
+    ASSIGN_TYPE(long long, OP)          \
+    ASSIGN_TYPE(unsigned long long, OP) \
+    ASSIGN_TYPE(char, OP)               \
+    ASSIGN_TYPE(signed char, OP)        \
+    ASSIGN_TYPE(unsigned char, OP)      \
+    ASSIGN_TYPE(bool, OP)               \
+    ASSIGN_TYPE(short, OP)              \
+    ASSIGN_TYPE(unsigned short, OP)
+
+ASSIGN_OP(=, =)
+ASSIGN_OP(+=, +)
+ASSIGN_OP(-=, -)
+ASSIGN_OP(*=, *)
+ASSIGN_OP(/=, /)
+#undef ASSIGN_OP
 
-#define ASSIGN_TYPE(TY, OP)                             \
-    array::array_proxy&                                 \
-    array::array_proxy::operator OP(const TY &value)    \
-    {                                                   \
-        dim4 pdims = getDims(impl->parent->get());      \
-        if (impl->lin) pdims = dim4(pdims.elements());  \
-        dim4 dims = seqToDims(impl->indices, pdims );   \
-        af::dtype ty = impl->parent->type();            \
-        array cst = constant(value, dims, ty);          \
-        return this->operator OP(cst);                  \
-    }                                                   \
-
-#define ASSIGN_OP(OP, op1)                      \
-    ASSIGN_TYPE(double             , OP)        \
-    ASSIGN_TYPE(float              , OP)        \
-    ASSIGN_TYPE(cdouble            , OP)        \
-    ASSIGN_TYPE(cfloat             , OP)        \
-    ASSIGN_TYPE(int                , OP)        \
-    ASSIGN_TYPE(unsigned           , OP)        \
-    ASSIGN_TYPE(long               , OP)        \
-    ASSIGN_TYPE(unsigned long      , OP)        \
-    ASSIGN_TYPE(long long          , OP)        \
-    ASSIGN_TYPE(unsigned long long , OP)        \
-    ASSIGN_TYPE(char               , OP)        \
-    ASSIGN_TYPE(unsigned char      , OP)        \
-    ASSIGN_TYPE(bool               , OP)        \
-
-    ASSIGN_OP(= , =)
-    ASSIGN_OP(+=, +)
-    ASSIGN_OP(-=, -)
-    ASSIGN_OP(*=, *)
-    ASSIGN_OP(/=, /)
 #undef ASSIGN_TYPE
-#undef ASSIGN_OP
 
-#define SELF_OP(OP, op1)                                                          \
-    array::array_proxy& array::array_proxy::operator OP(const array_proxy &other) \
-    {                                                                             \
-        *this = *this op1 other;                                                  \
-        return *this;                                                             \
-    }                                                                             \
-    array::array_proxy& array::array_proxy::operator OP(const array &other)       \
-    {                                                                             \
-        *this = *this op1 other;                                                  \
-        return *this;                                                             \
-    }                                                                             \
-
-    SELF_OP(+=, +)
-    SELF_OP(-=, -)
-    SELF_OP(*=, *)
-    SELF_OP(/=, /)
+#define SELF_OP(OP, op1)                                                      \
+    array::array_proxy &array::array_proxy::operator OP(                      \
+        const array_proxy &other) {                                           \
+        *this = *this op1 other;                                              \
+        return *this;                                                         \
+    }                                                                         \
+    array::array_proxy &array::array_proxy::operator OP(const array &other) { \
+        *this = *this op1 other;                                              \
+        return *this;                                                         \
+    }
+
+SELF_OP(+=, +)
+SELF_OP(-=, -)
+SELF_OP(*=, *)
+SELF_OP(/=, /)
 #undef SELF_OP
 
-    array::array_proxy::operator array() const
-    {
-        af_array tmp = 0;
-        af_array arr = 0;
-
-        if(impl->lin)  {
-            AF_THROW(af_flat(&arr, impl->parent->get()));
-        } else {
-            arr = impl->parent->get();
-        }
-
-        AF_THROW(af_index_gen(&tmp, arr, AF_MAX_DIMS, impl->indices));
-        if(impl->lin)  {
-            AF_THROW(af_release_array(arr));
-        }
+array::array_proxy::operator array() const {
+    af_array tmp = nullptr;
+    af_array arr = nullptr;
 
-        return array(tmp);
+    if (impl->is_linear_) {
+        AF_THROW(af_flat(&arr, impl->parent_->get()));
+    } else {
+        arr = impl->parent_->get();
     }
 
-    array::array_proxy::operator array()
-    {
-        af_array tmp = 0;
-        af_array arr = 0;
-
-        if(impl->lin)  {
-            AF_THROW(af_flat(&arr, impl->parent->get()));
-        } else {
-            arr = impl->parent->get();
-        }
-
-        AF_THROW(af_index_gen(&tmp, arr, AF_MAX_DIMS, impl->indices));
-        if(impl->lin)  {
-            AF_THROW(af_release_array(arr));
-        }
-
-        int dim = gforDim(impl->indices);
-        if (tmp && dim >= 0) {
-            arr = gforReorder(tmp, dim);
-            if (tmp) AF_THROW(af_release_array(tmp));
-        } else {
-            arr = tmp;
-        }
+    AF_THROW(af_index_gen(&tmp, arr, AF_MAX_DIMS, impl->indices_));
+    if (impl->is_linear_) { AF_THROW(af_release_array(arr)); }
 
-        return array(arr);
+    int dim = gforDim(impl->indices_);
+    if (tmp && dim >= 0) {
+        arr = gforReorder(tmp, dim);
+        if (tmp) { AF_THROW(af_release_array(tmp)); }
+    } else {
+        arr = tmp;
     }
 
-    ///////////////////////////////////////////////////////////////////////////
-    // Operator =
-    ///////////////////////////////////////////////////////////////////////////
-    array& array::operator=(const array &other)
-    {
-        if (this->get() == other.get()) {
-            return *this;
-        }
-        //TODO: Unsafe. loses data if af_weak_copy fails
-        if(this->arr != 0) {
-            AF_THROW(af_release_array(this->arr));
-        }
+    return array(arr);
+}
 
-        af_array temp = 0;
-        AF_THROW(af_retain_array(&temp, other.get()));
-        this->arr = temp;
-        return *this;
-    }
-#define ASSIGN_TYPE(TY, OP)                                         \
-    array& array::operator OP(const TY &value)                      \
-    {                                                               \
-        af::dim4 dims = this->dims();                               \
-        af::dtype ty = this->type();                                \
-        array cst = constant(value, dims, ty);                      \
-        return operator OP(cst);                                    \
-    }                                                               \
-
-#define ASSIGN_OP(OP, op1)                                          \
-    array& array::operator OP(const array &other)                   \
-    {                                                               \
-        af_array out = 0;                                           \
-        AF_THROW(op1(&out, this->get(), other.get(), gforGet()));   \
-        this->set(out);                                             \
-        return *this;                                               \
-    }                                                               \
-    ASSIGN_TYPE(double             , OP)                            \
-    ASSIGN_TYPE(float              , OP)                            \
-    ASSIGN_TYPE(cdouble            , OP)                            \
-    ASSIGN_TYPE(cfloat             , OP)                            \
-    ASSIGN_TYPE(int                , OP)                            \
-    ASSIGN_TYPE(unsigned           , OP)                            \
-    ASSIGN_TYPE(long               , OP)                            \
-    ASSIGN_TYPE(unsigned long      , OP)                            \
-    ASSIGN_TYPE(long long          , OP)                            \
-    ASSIGN_TYPE(unsigned long long , OP)                            \
-    ASSIGN_TYPE(char               , OP)                            \
-    ASSIGN_TYPE(unsigned char      , OP)                            \
-    ASSIGN_TYPE(bool               , OP)                            \
-
-    ASSIGN_OP(+=, af_add)
-    ASSIGN_OP(-=, af_sub)
-    ASSIGN_OP(*=, af_mul)
-    ASSIGN_OP(/=, af_div)
+array::array_proxy::operator array() {
+    return const_cast<const array::array_proxy *>(this)->operator array();
+}
+
+#define MEM_INDEX(FUNC_SIG, USAGE)                                \
+    array::array_proxy array::array_proxy::FUNC_SIG {             \
+        array *out               = new array(*this);              \
+        array::array_proxy proxy = out->USAGE;                    \
+        proxy.impl->delete_on_destruction(true);                  \
+        return proxy;                                             \
+    }                                                             \
+                                                                  \
+    const array::array_proxy array::array_proxy::FUNC_SIG const { \
+        const array *out         = new array(*this);              \
+        array::array_proxy proxy = out->USAGE;                    \
+        proxy.impl->delete_on_destruction(true);                  \
+        return proxy;                                             \
+    }
+// NOLINTNEXTLINE(readability-const-return-type)
+MEM_INDEX(row(int index), row(index));
+// NOLINTNEXTLINE(readability-const-return-type)
+MEM_INDEX(rows(int first, int last), rows(first, last));
+// NOLINTNEXTLINE(readability-const-return-type)
+MEM_INDEX(col(int index), col(index));
+// NOLINTNEXTLINE(readability-const-return-type)
+MEM_INDEX(cols(int first, int last), cols(first, last));
+// NOLINTNEXTLINE(readability-const-return-type)
+MEM_INDEX(slice(int index), slice(index));
+// NOLINTNEXTLINE(readability-const-return-type)
+MEM_INDEX(slices(int first, int last), slices(first, last));
+
+#undef MEM_INDEX
+
+///////////////////////////////////////////////////////////////////////////
+// Operator =
+///////////////////////////////////////////////////////////////////////////
+array &array::operator=(const array &other) {
+    if (this == &other || this->get() == other.get()) { return *this; }
+    // TODO(umar): Unsafe. loses data if af_weak_copy fails
+    if (this->arr != nullptr) { AF_THROW(af_release_array(this->arr)); }
+
+    af_array temp = nullptr;
+    AF_THROW(af_retain_array(&temp, other.get()));
+    this->arr = temp;
+    return *this;
+}
+#define ASSIGN_TYPE(TY, OP)                        \
+    array &array::operator OP(const TY &value) {   \
+        af::dim4 dims = this->dims();              \
+        af::dtype ty  = this->type();              \
+        array cst     = constant(value, dims, ty); \
+        return operator OP(cst);                   \
+    }
+
+#define ASSIGN_OP(OP, op1)                                        \
+    array &array::operator OP(const array &other) {               \
+        af_array out = 0;                                         \
+        AF_THROW(op1(&out, this->get(), other.get(), gforGet())); \
+        this->set(out);                                           \
+        return *this;                                             \
+    }                                                             \
+    ASSIGN_TYPE(double, OP)                                       \
+    ASSIGN_TYPE(float, OP)                                        \
+    ASSIGN_TYPE(cdouble, OP)                                      \
+    ASSIGN_TYPE(cfloat, OP)                                       \
+    ASSIGN_TYPE(int, OP)                                          \
+    ASSIGN_TYPE(unsigned, OP)                                     \
+    ASSIGN_TYPE(long, OP)                                         \
+    ASSIGN_TYPE(unsigned long, OP)                                \
+    ASSIGN_TYPE(long long, OP)                                    \
+    ASSIGN_TYPE(unsigned long long, OP)                           \
+    ASSIGN_TYPE(char, OP)                                         \
+    ASSIGN_TYPE(signed char, OP)                                  \
+    ASSIGN_TYPE(unsigned char, OP)                                \
+    ASSIGN_TYPE(bool, OP)                                         \
+    ASSIGN_TYPE(short, OP)                                        \
+    ASSIGN_TYPE(unsigned short, OP)
+
+ASSIGN_OP(+=, af_add)
+ASSIGN_OP(-=, af_sub)
+ASSIGN_OP(*=, af_mul)
+ASSIGN_OP(/=, af_div)
 
 #undef ASSIGN_OP
+
 #undef ASSIGN_TYPE
 
-#define ASSIGN_TYPE(TY, OP)                                     \
-    array& array::operator OP(const TY &value)                  \
-    {                                                           \
-        af::dim4 dims = this->dims();                           \
-        af::dtype ty = this->type();                            \
-        array cst = constant(value, dims, ty);                  \
-        return operator OP(cst);                                \
-    }                                                           \
-
-#define ASSIGN_OP(OP)                           \
-    ASSIGN_TYPE(double             , OP)        \
-    ASSIGN_TYPE(float              , OP)        \
-    ASSIGN_TYPE(cdouble            , OP)        \
-    ASSIGN_TYPE(cfloat             , OP)        \
-    ASSIGN_TYPE(int                , OP)        \
-    ASSIGN_TYPE(unsigned           , OP)        \
-    ASSIGN_TYPE(long               , OP)        \
-    ASSIGN_TYPE(unsigned long      , OP)        \
-    ASSIGN_TYPE(long long          , OP)        \
-    ASSIGN_TYPE(unsigned long long , OP)        \
-    ASSIGN_TYPE(char               , OP)        \
-    ASSIGN_TYPE(unsigned char      , OP)        \
-    ASSIGN_TYPE(bool               , OP)        \
-
-    ASSIGN_OP(= )
+#define ASSIGN_TYPE(TY, OP)                        \
+    array &array::operator OP(const TY &value) {   \
+        af::dim4 dims = this->dims();              \
+        af::dtype ty  = this->type();              \
+        array cst     = constant(value, dims, ty); \
+        operator OP(cst);                          \
+        return *this;                              \
+    }
+
+#define ASSIGN_OP(OP)                   \
+    ASSIGN_TYPE(double, OP)             \
+    ASSIGN_TYPE(float, OP)              \
+    ASSIGN_TYPE(cdouble, OP)            \
+    ASSIGN_TYPE(cfloat, OP)             \
+    ASSIGN_TYPE(int, OP)                \
+    ASSIGN_TYPE(unsigned, OP)           \
+    ASSIGN_TYPE(long, OP)               \
+    ASSIGN_TYPE(unsigned long, OP)      \
+    ASSIGN_TYPE(long long, OP)          \
+    ASSIGN_TYPE(unsigned long long, OP) \
+    ASSIGN_TYPE(char, OP)               \
+    ASSIGN_TYPE(signed char, OP)        \
+    ASSIGN_TYPE(unsigned char, OP)      \
+    ASSIGN_TYPE(bool, OP)               \
+    ASSIGN_TYPE(short, OP)              \
+    ASSIGN_TYPE(unsigned short, OP)
+
+ASSIGN_OP(=)
 
 #undef ASSIGN_OP
+
 #undef ASSIGN_TYPE
 
-#define BINARY_TYPE(TY, OP, func, dty)                          \
-    array operator OP(const array& plhs, const TY &value)       \
-    {                                                           \
-        af_array out;                                           \
-        af::dtype ty = plhs.type();                             \
-        af::dtype cty = plhs.isrealfloating() ? ty : dty;       \
-        array cst = constant(value, plhs.dims(), cty);          \
-        AF_THROW(func(&out, plhs.get(), cst.get(), gforGet())); \
-        return array(out);                                      \
-    }                                                           \
-    array operator OP(const TY &value, const array &other)      \
-    {                                                           \
-        const af_array rhs = other.get();                         \
-        af_array out;                                           \
-        af::dtype ty = other.type();                            \
-        af::dtype cty = other.isrealfloating() ? ty : dty;      \
-        array cst = constant(value, other.dims(), cty);         \
-        AF_THROW(func(&out, cst.get(), rhs, gforGet()));        \
-        return array(out);                                      \
-    }                                                           \
-
-#define BINARY_OP(OP, func)                                     \
-    array operator OP(const array &lhs, const array &rhs)       \
-    {                                                           \
-        af_array out;                                           \
-        AF_THROW(func(&out, lhs.get(), rhs.get(), gforGet()));  \
-        return array(out);                                      \
-    }                                                           \
-    BINARY_TYPE(double             , OP, func, f64)             \
-    BINARY_TYPE(float              , OP, func, f32)             \
-    BINARY_TYPE(cdouble            , OP, func, c64)             \
-    BINARY_TYPE(cfloat             , OP, func, c32)             \
-    BINARY_TYPE(int                , OP, func, s32)             \
-    BINARY_TYPE(unsigned           , OP, func, u32)             \
-    BINARY_TYPE(long               , OP, func, s64)             \
-    BINARY_TYPE(unsigned long      , OP, func, u64)             \
-    BINARY_TYPE(long long          , OP, func, s64)             \
-    BINARY_TYPE(unsigned long long , OP, func, u64)             \
-    BINARY_TYPE(char               , OP, func, b8)              \
-    BINARY_TYPE(unsigned char      , OP, func, u8)              \
-    BINARY_TYPE(bool               , OP, func, b8)              \
-
-    BINARY_OP(+, af_add)
-    BINARY_OP(-, af_sub)
-    BINARY_OP(*, af_mul)
-    BINARY_OP(/, af_div)
-    BINARY_OP(==, af_eq)
-    BINARY_OP(!=, af_neq)
-    BINARY_OP(< , af_lt)
-    BINARY_OP(<=, af_le)
-    BINARY_OP(> , af_gt)
-    BINARY_OP(>=, af_ge)
-    BINARY_OP(&&, af_and)
-    BINARY_OP(||, af_or)
-    BINARY_OP(%, af_rem)
-    BINARY_OP(&, af_bitand)
-    BINARY_OP(|, af_bitor)
-    BINARY_OP(^, af_bitxor)
-    BINARY_OP(<<, af_bitshiftl)
-    BINARY_OP(>>, af_bitshiftr)
+af::dtype implicit_dtype(af::dtype scalar_type, af::dtype array_type) {
+    // If same, do not do anything
+    if (scalar_type == array_type) { return scalar_type; }
 
-#undef BINARY_TYPE
-#undef BINARY_OP
+    // If complex, return appropriate complex type
+    if (scalar_type == c32 || scalar_type == c64) {
+        if (array_type == f64 || array_type == c64) { return c64; }
+        return c32;
+    }
 
-    array array::operator-() const
-    {
-        af_array lhs = this->get();
-        af_array out;
-        array cst = constant(0, this->dims(), this->type());
-        AF_THROW(af_sub(&out, cst.get(), lhs, gforGet()));
-        return array(out);
+    // If 64 bit precision, do not lose precision
+    if (array_type == f64 || array_type == c64 || array_type == f32 ||
+        array_type == c32) {
+        return array_type;
     }
 
-    array array::operator!() const
-    {
-        af_array lhs = this->get();
-        af_array out;
-        array cst = constant(0, this->dims(), this->type());
-        AF_THROW(af_eq(&out, cst.get(), lhs, gforGet()));
-        return array(out);
+    // If the array is f16 then avoid upcasting to float or double
+    if ((scalar_type == f64 || scalar_type == f32) && (array_type == f16)) {
+        return f16;
     }
 
-    void array::eval() const
-    {
-        AF_THROW(af_eval(get()));
+    // Default to single precision by default when multiplying with scalar
+    if ((scalar_type == f64 || scalar_type == c64) &&
+        (array_type != f64 && array_type != c64)) {
+        return f32;
     }
 
+    // Punt to C api for everything else
+    return scalar_type;
+}
+
+#define BINARY_TYPE(TY, OP, release_func, dty)                          \
+    array operator OP(const array &plhs, const TY &value) {             \
+        af_array out;                                                   \
+        af::dtype cty = implicit_dtype(dty, plhs.type());               \
+        array cst     = constant(value, plhs.dims(), cty);              \
+        AF_THROW(release_func(&out, plhs.get(), cst.get(), gforGet())); \
+        return array(out);                                              \
+    }                                                                   \
+    array operator OP(const TY &value, const array &other) {            \
+        const af_array rhs = other.get();                               \
+        af_array out;                                                   \
+        af::dtype cty = implicit_dtype(dty, other.type());              \
+        array cst     = constant(value, other.dims(), cty);             \
+        AF_THROW(release_func(&out, cst.get(), rhs, gforGet()));        \
+        return array(out);                                              \
+    }
+
+#define BINARY_OP(OP, release_func)                                    \
+    array operator OP(const array &lhs, const array &rhs) {            \
+        af_array out;                                                  \
+        AF_THROW(release_func(&out, lhs.get(), rhs.get(), gforGet())); \
+        return array(out);                                             \
+    }                                                                  \
+    BINARY_TYPE(double, OP, release_func, f64)                         \
+    BINARY_TYPE(float, OP, release_func, f32)                          \
+    BINARY_TYPE(cdouble, OP, release_func, c64)                        \
+    BINARY_TYPE(cfloat, OP, release_func, c32)                         \
+    BINARY_TYPE(int, OP, release_func, s32)                            \
+    BINARY_TYPE(unsigned, OP, release_func, u32)                       \
+    BINARY_TYPE(long, OP, release_func, s64)                           \
+    BINARY_TYPE(unsigned long, OP, release_func, u64)                  \
+    BINARY_TYPE(long long, OP, release_func, s64)                      \
+    BINARY_TYPE(unsigned long long, OP, release_func, u64)             \
+    BINARY_TYPE(char, OP, release_func, b8)                            \
+    BINARY_TYPE(signed char, OP, release_func, s8)                     \
+    BINARY_TYPE(unsigned char, OP, release_func, u8)                   \
+    BINARY_TYPE(bool, OP, release_func, b8)                            \
+    BINARY_TYPE(short, OP, release_func, s16)                          \
+    BINARY_TYPE(unsigned short, OP, release_func, u16)
+
+BINARY_OP(+, af_add)
+BINARY_OP(-, af_sub)
+BINARY_OP(*, af_mul)
+BINARY_OP(/, af_div)
+BINARY_OP(==, af_eq)
+BINARY_OP(!=, af_neq)
+BINARY_OP(<, af_lt)
+BINARY_OP(<=, af_le)
+BINARY_OP(>, af_gt)
+BINARY_OP(>=, af_ge)
+BINARY_OP(&&, af_and)
+BINARY_OP(||, af_or)
+BINARY_OP(%, af_mod)
+BINARY_OP(&, af_bitand)
+BINARY_OP(|, af_bitor)
+BINARY_OP(^, af_bitxor)
+BINARY_OP(<<, af_bitshiftl)
+BINARY_OP(>>, af_bitshiftr)
+
+#undef BINARY_OP
+
+#undef BINARY_TYPE
+
+array array::operator-() const {
+    af_array lhs = this->get();
+    af_array out;
+    array cst = constant(0, this->dims(), this->type());
+    AF_THROW(af_sub(&out, cst.get(), lhs, gforGet()));
+    return array(out);
+}
+
+array array::operator!() const {
+    af_array lhs = this->get();
+    af_array out;
+    AF_THROW(af_not(&out, lhs));
+    return array(out);
+}
+
+array array::operator~() const {
+    af_array lhs = this->get();
+    af_array out = nullptr;
+    AF_THROW(af_bitnot(&out, lhs));
+    return array(out);
+}
+
+void array::eval() const { AF_THROW(af_eval(get())); }
+
 // array instanciations
-#define INSTANTIATE(T)                                              \
-    template<> AFAPI T *array::host() const                         \
-    {                                                               \
-        if (type() != (af::dtype)dtype_traits<T>::af_type) {        \
-            AF_THROW_MSG("Requested type doesn't match with array", \
-                         AF_ERR_TYPE);                              \
-        }                                                           \
-                                                                    \
-        T *res = new T[elements()];                                 \
-        AF_THROW(af_get_data_ptr((void *)res, get()));              \
-                                                                    \
-        return res;                                                 \
-    }                                                               \
-    template<> AFAPI T array::scalar() const                        \
-    {                                                               \
-        T *h_ptr = host<T>();                                       \
-        T scalar = h_ptr[0];                                        \
-        delete[] h_ptr;                                             \
-        return scalar;                                              \
-    }                                                               \
-    template<> AFAPI T* array::device() const                       \
-    {                                                               \
-        void *ptr = NULL;                                           \
-        AF_THROW(af_get_device_ptr(&ptr, get()));                   \
-        return (T *)ptr;                                            \
-    }                                                               \
-    template<> AFAPI void array::write(const T *ptr,                \
-                                       const size_t bytes,          \
-                                       af::source src)              \
-    {                                                               \
-        if(src == afHost)   {                                       \
-            AF_THROW(af_write_array(get(), ptr, bytes,              \
-                                    (af::source)afHost));           \
-        }                                                           \
-        if(src == afDevice) {                                       \
-            AF_THROW(af_write_array(get(), ptr, bytes,              \
-                                    (af::source)afDevice));         \
-        }                                                           \
-    }                                                               \
-
-    INSTANTIATE(cdouble)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(double)
-    INSTANTIATE(float)
-    INSTANTIATE(unsigned)
-    INSTANTIATE(int)
-    INSTANTIATE(unsigned char)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+#define INSTANTIATE(T)                                                         \
+    template<>                                                                 \
+    AFAPI T *array::host() const {                                             \
+        if (type() != (af::dtype)dtype_traits<T>::af_type) {                   \
+            AF_THROW_ERR("Requested type doesn't match with array",            \
+                         AF_ERR_TYPE);                                         \
+        }                                                                      \
+        void *res;                                                             \
+        AF_THROW(af_alloc_host(&res, bytes()));                                \
+        AF_THROW(af_get_data_ptr(res, get()));                                 \
+                                                                               \
+        return (T *)res;                                                       \
+    }                                                                          \
+    template<>                                                                 \
+    AFAPI T array::scalar() const {                                            \
+        af_dtype type = (af_dtype)af::dtype_traits<T>::af_type;                \
+        if (type != this->type())                                              \
+            AF_THROW_ERR("Requested type doesn't match array type",            \
+                         AF_ERR_TYPE);                                         \
+        T val;                                                                 \
+        AF_THROW(af_get_scalar(&val, get()));                                  \
+        return val;                                                            \
+    }                                                                          \
+    template<>                                                                 \
+    AFAPI T *array::device() const {                                           \
+        void *ptr = NULL;                                                      \
+        AF_THROW(af_get_device_ptr(&ptr, get()));                              \
+        return (T *)ptr;                                                       \
+    }                                                                          \
+    template<>                                                                 \
+    AFAPI void array::write(const T *ptr, const size_t bytes,                  \
+                            af::source src) {                                  \
+        if (src == afHost) {                                                   \
+            AF_THROW(af_write_array(get(), ptr, bytes, (af::source)afHost));   \
+        }                                                                      \
+        if (src == afDevice) {                                                 \
+            AF_THROW(af_write_array(get(), ptr, bytes, (af::source)afDevice)); \
+        }                                                                      \
+    }
+
+INSTANTIATE(cdouble)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(unsigned)
+INSTANTIATE(int)
+INSTANTIATE(signed char)
+INSTANTIATE(unsigned char)
+INSTANTIATE(char)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(short)
+INSTANTIATE(unsigned short)
+INSTANTIATE(af_half)
+INSTANTIATE(half_float::half)
+
+template<>
+AFAPI void array::write(const void *ptr, const size_t bytes, af::source src) {
+    AF_THROW(af_write_array(get(), ptr, bytes, src));
+}
 
 #undef INSTANTIATE
 
+template<>
+AFAPI void *array::device() const {
+    void *ptr = nullptr;
+    AF_THROW(af_get_device_ptr(&ptr, get()));
+    return ptr;
+}
 
 // array_proxy instanciations
-#define TEMPLATE_MEM_FUNC(TYPE, RETURN_TYPE, FUNC)      \
-    template <>                                         \
-    RETURN_TYPE array::array_proxy::FUNC() const        \
-    {                                                   \
-        array out = *this;                              \
-        return out.FUNC<TYPE>();                        \
-    }
-
-#define INSTANTIATE(T)                                  \
-    TEMPLATE_MEM_FUNC(T, T*, host)                      \
-    TEMPLATE_MEM_FUNC(T, T , scalar)                    \
-    TEMPLATE_MEM_FUNC(T, T*, device)                    \
-
-    INSTANTIATE(cdouble)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(double)
-    INSTANTIATE(float)
-    INSTANTIATE(unsigned)
-    INSTANTIATE(int)
-    INSTANTIATE(unsigned char)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+#define TEMPLATE_MEM_FUNC(TYPE, RETURN_TYPE, FUNC)       \
+    template<>                                           \
+    AFAPI RETURN_TYPE array::array_proxy::FUNC() const { \
+        array out = *this;                               \
+        return out.FUNC<TYPE>();                         \
+    }
+
+#define INSTANTIATE(T)              \
+    TEMPLATE_MEM_FUNC(T, T *, host) \
+    TEMPLATE_MEM_FUNC(T, T, scalar) \
+    TEMPLATE_MEM_FUNC(T, T *, device)
+
+INSTANTIATE(cdouble)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(unsigned)
+INSTANTIATE(int)
+INSTANTIATE(signed char)
+INSTANTIATE(unsigned char)
+INSTANTIATE(char)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(short)
+INSTANTIATE(unsigned short)
+INSTANTIATE(af_half)
+INSTANTIATE(half_float::half)
 
 #undef INSTANTIATE
 #undef TEMPLATE_MEM_FUNC
 
-    //FIXME: This needs to be implemented at a later point
-    void array::unlock() const {}
-    void array::array_proxy::unlock() const {}
+// FIXME: These functions need to be implemented properly at a later point
+void array::array_proxy::unlock() const {}
+void array::array_proxy::lock() const {}
+
+// NOLINTNEXTLINE(readability-convert-member-functions-to-static)
+bool array::array_proxy::isLocked() const { return false; }
+
+int array::nonzeros() const { return count<int>(*this); }
+
+void array::lock() const { AF_THROW(af_lock_array(get())); }
+
+bool array::isLocked() const {
+    bool res;
+    AF_THROW(af_is_locked_array(&res, get()));
+    return res;
+}
+
+void array::unlock() const { AF_THROW(af_unlock_array(get())); }
+
+void eval(int num, array **arrays) {
+    vector<af_array> outputs(num);
+    for (int i = 0; i < num; i++) { outputs[i] = arrays[i]->get(); }
+    AF_THROW(af_eval_multiple(num, &outputs[0]));
+}
 
-    int array::nonzeros() const { return count<int>(*this); }
+void setManualEvalFlag(bool flag) { AF_THROW(af_set_manual_eval_flag(flag)); }
 
+bool getManualEvalFlag() {
+    bool flag;
+    AF_THROW(af_get_manual_eval_flag(&flag));
+    return flag;
 }
+}  // namespace af
diff --git a/src/api/cpp/bilateral.cpp b/src/api/cpp/bilateral.cpp
index 047830c22b..d0626a9299 100644
--- a/src/api/cpp/bilateral.cpp
+++ b/src/api/cpp/bilateral.cpp
@@ -7,18 +7,18 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array bilateral(const array &in, const float spatial_sigma, const float chromatic_sigma, const bool is_color)
-{
+array bilateral(const array &in, const float spatial_sigma,
+                const float chromatic_sigma, const bool is_color) {
     af_array out = 0;
-    AF_THROW(af_bilateral(&out, in.get(), spatial_sigma, chromatic_sigma, is_color));
+    AF_THROW(
+        af_bilateral(&out, in.get(), spatial_sigma, chromatic_sigma, is_color));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/binary.cpp b/src/api/cpp/binary.cpp
index 2a9161615b..96a04165d7 100644
--- a/src/api/cpp/binary.cpp
+++ b/src/api/cpp/binary.cpp
@@ -7,51 +7,51 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
 #include <af/gfor.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-#define INSTANTIATE(cppfunc, cfunc)                     \
-    array cppfunc(const array &lhs, const array &rhs)   \
-    {                                                   \
-        af_array out = 0;                               \
-        cfunc(&out, lhs.get(), rhs.get(), gforGet());   \
-        return array(out);                              \
+#define INSTANTIATE(cppfunc, cfunc)                             \
+    array cppfunc(const array &lhs, const array &rhs) {         \
+        af_array out = 0;                                       \
+        AF_THROW(cfunc(&out, lhs.get(), rhs.get(), gforGet())); \
+        return array(out);                                      \
     }
 
-    INSTANTIATE(min , af_minof)
-    INSTANTIATE(max , af_maxof)
-    INSTANTIATE(pow , af_pow  )
-    INSTANTIATE(root, af_root )
-    INSTANTIATE(rem , af_rem  )
-    INSTANTIATE(mod , af_mod  )
+INSTANTIATE(min, af_minof)
+INSTANTIATE(max, af_maxof)
+INSTANTIATE(pow, af_pow)
+INSTANTIATE(root, af_root)
+INSTANTIATE(rem, af_rem)
+INSTANTIATE(mod, af_mod)
 
-    INSTANTIATE(complex, af_cplx2)
-    INSTANTIATE(atan2, af_atan2)
-    INSTANTIATE(hypot, af_hypot)
+INSTANTIATE(complex, af_cplx2)
+INSTANTIATE(atan2, af_atan2)
+INSTANTIATE(hypot, af_hypot)
 
-#define WRAPPER(func)                                                   \
-    array func(const array &lhs, const double rhs)                      \
-    {                                                                   \
-        return func(lhs, constant(rhs, lhs.dims(), lhs.type()));        \
-    }                                                                   \
-    array func(const double lhs, const array &rhs)                      \
-    {                                                                   \
-        return func(constant(lhs, rhs.dims(), rhs.type()), rhs);        \
+#define WRAPPER(func)                                             \
+    array func(const array &lhs, const double rhs) {              \
+        af::dtype ty = lhs.type();                                \
+        if (lhs.iscomplex()) { ty = lhs.issingle() ? f32 : f64; } \
+        return func(lhs, constant(rhs, lhs.dims(), ty));          \
+    }                                                             \
+    array func(const double lhs, const array &rhs) {              \
+        af::dtype ty = rhs.type();                                \
+        if (rhs.iscomplex()) { ty = rhs.issingle() ? f32 : f64; } \
+        return func(constant(lhs, rhs.dims(), ty), rhs);          \
     }
 
-    WRAPPER(min)
-    WRAPPER(max)
-    WRAPPER(pow)
-    WRAPPER(root)
-    WRAPPER(rem)
-    WRAPPER(mod)
-    WRAPPER(complex)
-    WRAPPER(atan2)
-    WRAPPER(hypot)
-}
+WRAPPER(min)
+WRAPPER(max)
+WRAPPER(pow)
+WRAPPER(root)
+WRAPPER(rem)
+WRAPPER(mod)
+WRAPPER(complex)
+WRAPPER(atan2)
+WRAPPER(hypot)
+}  // namespace af
diff --git a/src/api/cpp/blas.cpp b/src/api/cpp/blas.cpp
index aac0cbabd7..fbff177818 100644
--- a/src/api/cpp/blas.cpp
+++ b/src/api/cpp/blas.cpp
@@ -11,69 +11,85 @@
 #include <af/blas.h>
 #include "error.hpp"
 
-namespace af
-{
-    array matmul(const array &lhs, const array &rhs,
-                 const matProp optLhs, const matProp optRhs)
-    {
-        af_array out = 0;
-        AF_THROW(af_matmul(&out, lhs.get(), rhs.get(), optLhs, optRhs));
-        return array(out);
-    }
+namespace af {
+array matmul(const array &lhs, const array &rhs, const matProp optLhs,
+             const matProp optRhs) {
+    af_array out = 0;
+    AF_THROW(af_matmul(&out, lhs.get(), rhs.get(), optLhs, optRhs));
+    return array(out);
+}
 
-    array matmulNT(const array &lhs, const array &rhs)
-    {
-        af_array out = 0;
-        AF_THROW(af_matmul(&out, lhs.get(), rhs.get(),
-                           AF_MAT_NONE, AF_MAT_TRANS));
-        return array(out);
-    }
+array matmulNT(const array &lhs, const array &rhs) {
+    af_array out = 0;
+    AF_THROW(af_matmul(&out, lhs.get(), rhs.get(), AF_MAT_NONE, AF_MAT_TRANS));
+    return array(out);
+}
 
-    array matmulTN(const array &lhs, const array &rhs)
-    {
-        af_array out = 0;
-        AF_THROW(af_matmul(&out, lhs.get(), rhs.get(),
-                           AF_MAT_TRANS, AF_MAT_NONE));
-        return array(out);
-    }
+array matmulTN(const array &lhs, const array &rhs) {
+    af_array out = 0;
+    AF_THROW(af_matmul(&out, lhs.get(), rhs.get(), AF_MAT_TRANS, AF_MAT_NONE));
+    return array(out);
+}
+
+array matmulTT(const array &lhs, const array &rhs) {
+    af_array out = 0;
+    AF_THROW(af_matmul(&out, lhs.get(), rhs.get(), AF_MAT_TRANS, AF_MAT_TRANS));
+    return array(out);
+}
 
-    array matmulTT(const array &lhs, const array &rhs)
-    {
-        af_array out = 0;
-        AF_THROW(af_matmul(&out, lhs.get(), rhs.get(),
-                           AF_MAT_TRANS, AF_MAT_TRANS));
-        return array(out);
+array matmul(const array &a, const array &b, const array &c) {
+    dim_t tmp1 = a.dims(0) * b.dims(1);
+    dim_t tmp2 = b.dims(0) * c.dims(1);
+
+    if (tmp1 < tmp2) {
+        return matmul(matmul(a, b), c);
+    } else {
+        return matmul(a, matmul(b, c));
     }
+}
 
-    array matmul(const array &a, const array &b, const array &c)
-    {
-        int tmp1 = a.dims(0) * b.dims(1);
-        int tmp2 = b.dims(0) * c.dims(1);
+array matmul(const array &a, const array &b, const array &c, const array &d) {
+    dim_t tmp1 = a.dims(0) * c.dims(1);
+    dim_t tmp2 = b.dims(0) * d.dims(1);
 
-        if (tmp1 < tmp2) {
-            return matmul(matmul(a, b), c);
-        } else {
-            return matmul(a, matmul(b, c));
-        }
+    if (tmp1 < tmp2) {
+        return matmul(matmul(a, b, c), d);
+    } else {
+        return matmul(a, matmul(b, c, d));
     }
+}
 
-    array matmul(const array &a, const array &b, const array &c, const array &d)
-    {
-        int tmp1 = a.dims(0) * c.dims(1);
-        int tmp2 = b.dims(0) * d.dims(1);
+array dot(const array &lhs, const array &rhs, const matProp optLhs,
+          const matProp optRhs) {
+    af_array out = 0;
+    AF_THROW(af_dot(&out, lhs.get(), rhs.get(), optLhs, optRhs));
+    return array(out);
+}
 
-        if (tmp1 < tmp2) {
-            return matmul(matmul(a, b, c), d);
-        } else {
-            return matmul(a, matmul(b, c, d));
-        }
+#define INSTANTIATE_REAL(TYPE)                                               \
+    template<>                                                               \
+    AFAPI TYPE dot(const array &lhs, const array &rhs, const matProp optLhs, \
+                   const matProp optRhs) {                                   \
+        double rval = 0, ival = 0;                                           \
+        AF_THROW(                                                            \
+            af_dot_all(&rval, &ival, lhs.get(), rhs.get(), optLhs, optRhs)); \
+        return (TYPE)(rval);                                                 \
     }
 
-    array dot   (const array &lhs, const array &rhs,
-                 const matProp optLhs, const matProp optRhs)
-    {
-        af_array out = 0;
-        AF_THROW(af_dot(&out, lhs.get(), rhs.get(), optLhs, optRhs));
-        return array(out);
+#define INSTANTIATE_CPLX(TYPE, REAL)                                         \
+    template<>                                                               \
+    AFAPI TYPE dot(const array &lhs, const array &rhs, const matProp optLhs, \
+                   const matProp optRhs) {                                   \
+        double rval = 0, ival = 0;                                           \
+        AF_THROW(                                                            \
+            af_dot_all(&rval, &ival, lhs.get(), rhs.get(), optLhs, optRhs)); \
+        TYPE out((REAL)rval, (REAL)ival);                                    \
+        return out;                                                          \
     }
-}
+
+INSTANTIATE_REAL(float)
+INSTANTIATE_REAL(double)
+INSTANTIATE_CPLX(cfloat, float)
+INSTANTIATE_CPLX(cdouble, double)
+
+}  // namespace af
diff --git a/src/api/cpp/canny.cpp b/src/api/cpp/canny.cpp
new file mode 100644
index 0000000000..bdf7a382cc
--- /dev/null
+++ b/src/api/cpp/canny.cpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+array canny(const array& in, const cannyThreshold ctType, const float ltr,
+            const float htr, const unsigned sW, const bool isFast) {
+    af_array temp = 0;
+    AF_THROW(af_canny(&temp, in.get(), ctType, ltr, htr, sW, isFast));
+    return array(temp);
+}
+}  // namespace af
diff --git a/src/api/cpp/clamp.cpp b/src/api/cpp/clamp.cpp
new file mode 100644
index 0000000000..cb3616d764
--- /dev/null
+++ b/src/api/cpp/clamp.cpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/gfor.h>
+#include "error.hpp"
+
+namespace af {
+array clamp(const array &in, const array &lo, const array &hi) {
+    af_array out;
+    AF_THROW(af_clamp(&out, in.get(), lo.get(), hi.get(), gforGet()));
+    return array(out);
+}
+
+array clamp(const array &in, const array &lo, const double hi) {
+    return clamp(in, lo, constant(hi, lo.dims(), lo.type()));
+}
+
+array clamp(const array &in, const double lo, const array &hi) {
+    return clamp(in, constant(lo, hi.dims(), hi.type()), hi);
+}
+
+array clamp(const array &in, const double lo, const double hi) {
+    return clamp(in, constant(lo, in.dims(), in.type()),
+                 constant(hi, in.dims(), in.type()));
+}
+}  // namespace af
diff --git a/src/api/cpp/colorspace.cpp b/src/api/cpp/colorspace.cpp
index 6241e54ddb..eda57cccb7 100644
--- a/src/api/cpp/colorspace.cpp
+++ b/src/api/cpp/colorspace.cpp
@@ -7,25 +7,22 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <af/array.h>
+#include <af/compatible.h>
 #include <af/defines.h>
 #include <af/image.h>
-#include <af/compatible.h>
-#include <af/array.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array colorspace(const array& image, const CSpace to, const CSpace from)
-{
+array colorspace(const array& image, const CSpace to, const CSpace from) {
     return colorSpace(image, to, from);
 }
 
-array colorSpace(const array& image, const CSpace to, const CSpace from)
-{
+array colorSpace(const array& image, const CSpace to, const CSpace from) {
     af_array temp = 0;
-    AF_THROW(af_color_space(&temp, image.get(), to ,from));
+    AF_THROW(af_color_space(&temp, image.get(), to, from));
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/common.hpp b/src/api/cpp/common.hpp
index ed191670ec..e1f161bdde 100644
--- a/src/api/cpp/common.hpp
+++ b/src/api/cpp/common.hpp
@@ -8,19 +8,28 @@
  ********************************************************/
 
 #include <af/dim4.hpp>
+#include <af/half.h>
 
-namespace af
-{
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wparentheses"
+#include "half.hpp"
+#pragma GCC diagnostic pop
+
+#ifdef AF_CUDA
+#include <cuda_fp16.h>
+#endif
+
+#include <cstring>
+
+namespace af {
 
 /// Get the first non-zero dimension
-static inline dim_t getFNSD(const int dim, af::dim4 dims)
-{
-    if(dim >= 0)
-        return dim;
+static inline dim_t getFNSD(const int dim, af::dim4 dims) {
+    if (dim >= 0) return dim;
 
     dim_t fNSD = 0;
-    for (dim_t i=0; i<4; ++i) {
-        if (dims[i]>1) {
+    for (dim_t i = 0; i < 4; ++i) {
+        if (dims[i] > 1) {
             fNSD = i;
             break;
         }
@@ -28,4 +37,30 @@ static inline dim_t getFNSD(const int dim, af::dim4 dims)
     return fNSD;
 }
 
+namespace {
+// casts from one type to another. Needed for af_half conversions specialization
+template<typename To, typename T>
+inline To cast(T in) {
+    return static_cast<To>(in);
+}
+
+#if defined(AF_CUDA) && CUDA_VERSION < 10000
+template<>
+inline __half cast<__half, double>(double in) {
+    __half_raw out;
+    half_float::half h(in);
+    memcpy(&out, &h, sizeof(__half_raw));
+    return out;
+}
+#endif
+
+template<>
+[[gnu::unused]] af_half cast<af_half, double>(double in) {
+    half_float::half tmp = static_cast<half_float::half>(in);
+    af_half out;
+    memcpy(&out, &tmp, sizeof(af_half));
+    return out;
 }
+
+}  // namespace
+}  // namespace af
diff --git a/src/api/cpp/complex.cpp b/src/api/cpp/complex.cpp
new file mode 100644
index 0000000000..e058536b36
--- /dev/null
+++ b/src/api/cpp/complex.cpp
@@ -0,0 +1,157 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/complex.h>
+#include <cmath>
+#include <complex>
+#include <istream>
+
+namespace af {
+using std::complex;
+
+float real(af_cfloat val) { return val.real; }
+double real(af_cdouble val) { return val.real; }
+
+float imag(af_cfloat val) { return val.imag; }
+double imag(af_cdouble val) { return val.imag; }
+
+cfloat operator+(const cfloat &lhs, const cfloat &rhs) {
+    cfloat out(lhs.real + rhs.real, lhs.imag + rhs.imag);
+    return out;
+}
+
+cdouble operator+(const cdouble &lhs, const cdouble &rhs) {
+    cdouble out(lhs.real + rhs.real, lhs.imag + rhs.imag);
+    return out;
+}
+
+cfloat operator*(const cfloat &lhs, const cfloat &rhs) {
+    complex<float> clhs(lhs.real, lhs.imag);
+    complex<float> crhs(rhs.real, rhs.imag);
+    complex<float> out = clhs * crhs;
+    return {out.real(), out.imag()};
+}
+
+cdouble operator*(const cdouble &lhs, const cdouble &rhs) {
+    complex<double> clhs(lhs.real, lhs.imag);
+    complex<double> crhs(rhs.real, rhs.imag);
+    complex<double> out = clhs * crhs;
+    return {out.real(), out.imag()};
+}
+
+cfloat operator-(const cfloat &lhs, const cfloat &rhs) {
+    cfloat out(lhs.real - rhs.real, lhs.imag - rhs.imag);
+    return out;
+}
+
+cdouble operator-(const cdouble &lhs, const cdouble &rhs) {
+    cdouble out(lhs.real - rhs.real, lhs.imag - rhs.imag);
+    return out;
+}
+
+cfloat operator/(const cfloat &lhs, const cfloat &rhs) {
+    complex<float> clhs(lhs.real, lhs.imag);
+    complex<float> crhs(rhs.real, rhs.imag);
+    complex<float> out = clhs / crhs;
+    return {out.real(), out.imag()};
+}
+
+cdouble operator/(const cdouble &lhs, const cdouble &rhs) {
+    complex<double> clhs(lhs.real, lhs.imag);
+    complex<double> crhs(rhs.real, rhs.imag);
+    complex<double> out = clhs / crhs;
+    return {out.real(), out.imag()};
+}
+
+#define IMPL_OP(OP)                                              \
+    cfloat operator OP(const cfloat &lhs, const double &rhs) {   \
+        return lhs OP cfloat(rhs);                               \
+    }                                                            \
+    cdouble operator OP(const cdouble &lhs, const double &rhs) { \
+        return lhs OP cdouble(rhs);                              \
+    }                                                            \
+    cfloat operator OP(const double &lhs, const cfloat &rhs) {   \
+        return cfloat(lhs) OP rhs;                               \
+    }                                                            \
+    cdouble operator OP(const double &lhs, const cdouble &rhs) { \
+        return cdouble(lhs) OP rhs;                              \
+    }                                                            \
+    cdouble operator OP(const cfloat &lhs, const cdouble &rhs) { \
+        return cdouble(real(lhs), imag(lhs)) OP rhs;             \
+    }                                                            \
+    cdouble operator OP(const cdouble &lhs, const cfloat &rhs) { \
+        return lhs OP cdouble(real(rhs), imag(rhs));             \
+    }
+
+IMPL_OP(+)
+IMPL_OP(-)
+IMPL_OP(*)
+IMPL_OP(/)
+
+#undef IMPL_OP
+
+bool operator!=(const cfloat &lhs, const cfloat &rhs) { return !(lhs == rhs); }
+
+bool operator!=(const cdouble &lhs, const cdouble &rhs) {
+    return !(lhs == rhs);
+}
+
+bool operator==(const cfloat &lhs, const cfloat &rhs) {
+    return lhs.real == rhs.real && lhs.imag == rhs.imag;
+}
+
+bool operator==(const cdouble &lhs, const cdouble &rhs) {
+    return lhs.real == rhs.real && lhs.imag == rhs.imag;
+}
+
+float abs(const cfloat &val) {
+    std::complex<float> out(val.real, val.imag);
+    return abs(out);
+}
+
+double abs(const cdouble &val) {
+    std::complex<double> out(val.real, val.imag);
+    return abs(out);
+}
+
+cfloat conj(const cfloat &val) { return {val.real, -val.imag}; }
+
+cdouble conj(const cdouble &val) { return {val.real, -val.imag}; }
+
+std::ostream &operator<<(std::ostream &os, const cfloat &in) {
+    os << "(" << in.real << ", " << in.imag << ")";
+    return os;
+}
+
+std::ostream &operator<<(std::ostream &os, const cdouble &in) {
+    os << "(" << in.real << " " << in.imag << ")";
+    return os;
+}
+
+std::istream &operator>>(std::istream &is, cfloat &in) {
+    char trash;
+    is >> trash;
+    is >> in.real;
+    is >> trash;
+    is >> in.imag;
+    is >> trash;
+    return is;
+}
+
+std::istream &operator>>(std::istream &is, cdouble &in) {
+    char trash;
+    is >> trash;
+    is >> in.real;
+    is >> trash;
+    is >> in.imag;
+    is >> trash;
+    return is;
+}
+
+}  // namespace af
diff --git a/src/api/cpp/confidence_connected.cpp b/src/api/cpp/confidence_connected.cpp
new file mode 100644
index 0000000000..97e5209f8c
--- /dev/null
+++ b/src/api/cpp/confidence_connected.cpp
@@ -0,0 +1,49 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+
+array confidenceCC(const array &in, const size_t num_seeds,
+                   const unsigned *seedx, const unsigned *seedy,
+                   const unsigned radius, const unsigned multiplier,
+                   const int iter, const double segmentedValue) {
+    af::array xs(dim4(num_seeds), seedx);
+    af::array ys(dim4(num_seeds), seedy);
+    af_array temp = 0;
+    AF_THROW(af_confidence_cc(&temp, in.get(), xs.get(), ys.get(), radius,
+                              multiplier, iter, segmentedValue));
+    return array(temp);
+}
+
+array confidenceCC(const array &in, const array &seeds, const unsigned radius,
+                   const unsigned multiplier, const int iter,
+                   const double segmentedValue) {
+    af::array xcoords = seeds.col(0);
+    af::array ycoords = seeds.col(1);
+    af_array temp     = 0;
+    AF_THROW(af_confidence_cc(&temp, in.get(), xcoords.get(), ycoords.get(),
+                              radius, multiplier, iter, segmentedValue));
+    return array(temp);
+}
+
+array confidenceCC(const array &in, const array &seedx, const array &seedy,
+                   const unsigned radius, const unsigned multiplier,
+                   const int iter, const double segmentedValue) {
+    af_array temp = 0;
+    AF_THROW(af_confidence_cc(&temp, in.get(), seedx.get(), seedy.get(), radius,
+                              multiplier, iter, segmentedValue));
+    return array(temp);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/constants.cpp b/src/api/cpp/constants.cpp
deleted file mode 100644
index 62b7967beb..0000000000
--- a/src/api/cpp/constants.cpp
+++ /dev/null
@@ -1,9 +0,0 @@
-#include <limits>
-#include <af/constants.h>
-
-namespace af
-{
-    const double NaN = std::numeric_limits<double>::quiet_NaN();
-    const double Inf = std::numeric_limits<double>::infinity();
-    const double Pi  = 3.1415926535897932384626433832795028841971693993751;
-}
diff --git a/src/api/cpp/convolve.cpp b/src/api/cpp/convolve.cpp
index f1eccdd7dc..a69d26b9b4 100644
--- a/src/api/cpp/convolve.cpp
+++ b/src/api/cpp/convolve.cpp
@@ -7,53 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/signal.h>
 #include <af/array.h>
-#include "error.hpp"
+#include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/ml.h>
+#include <af/signal.h>
 #include <algorithm>
+#include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array convolve(const array& signal, const array& filter, const convMode mode, convDomain domain)
-{
+array convolve(const array &signal, const array &filter, const convMode mode,
+               convDomain domain) {
     unsigned sN = signal.numdims();
     unsigned fN = filter.numdims();
 
-    switch(std::min(sN,fN)) {
+    switch (std::min(sN, fN)) {
         case 1: return convolve1(signal, filter, mode, domain);
         case 2: return convolve2(signal, filter, mode, domain);
+        default:
         case 3: return convolve3(signal, filter, mode, domain);
-        default: return convolve3(signal, filter, mode, domain);
     }
 }
 
-array convolve(const array& col_filter, const array& row_filter, const array& signal, const convMode mode)
-{
+array convolve(const array &col_filter, const array &row_filter,
+               const array &signal, const convMode mode) {
     af_array out = 0;
-    AF_THROW(af_convolve2_sep(&out, col_filter.get(), row_filter.get(), signal.get(), mode));
+    AF_THROW(af_convolve2_sep(&out, col_filter.get(), row_filter.get(),
+                              signal.get(), mode));
     return array(out);
 }
 
-array convolve1(const array& signal, const array& filter, const convMode mode, convDomain domain)
-{
+array convolve1(const array &signal, const array &filter, const convMode mode,
+                convDomain domain) {
     af_array out = 0;
     AF_THROW(af_convolve1(&out, signal.get(), filter.get(), mode, domain));
     return array(out);
 }
 
-array convolve2(const array& signal, const array& filter, const convMode mode, convDomain domain)
-{
+array convolve2(const array &signal, const array &filter, const convMode mode,
+                convDomain domain) {
     af_array out = 0;
     AF_THROW(af_convolve2(&out, signal.get(), filter.get(), mode, domain));
     return array(out);
 }
 
-array convolve3(const array& signal, const array& filter, const convMode mode, convDomain domain)
-{
+array convolve2NN(
+    const array &signal, const array &filter,
+    const dim4 stride,      // NOLINT(performance-unnecessary-value-param)
+    const dim4 padding,     // NOLINT(performance-unnecessary-value-param)
+    const dim4 dilation) {  // NOLINT(performance-unnecessary-value-param)
+    af_array out = 0;
+    AF_THROW(af_convolve2_nn(&out, signal.get(), filter.get(), 2, stride.get(),
+                             2, padding.get(), 2, dilation.get()));
+    return array(out);
+}
+
+array convolve2GradientNN(
+    const array &incoming_gradient, const array &original_signal,
+    const array &original_filter, const array &convolved_output,
+    const dim4 stride,    // NOLINT(performance-unnecessary-value-param)
+    const dim4 padding,   // NOLINT(performance-unnecessary-value-param)
+    const dim4 dilation,  // NOLINT(performance-unnecessary-value-param)
+    af_conv_gradient_type gradType) {
+    af_array out = 0;
+    AF_THROW(af_convolve2_gradient_nn(
+        &out, incoming_gradient.get(), original_signal.get(),
+        original_filter.get(), convolved_output.get(), 2, stride.get(), 2,
+        padding.get(), 2, dilation.get(), gradType));
+    return array(out);
+}
+
+array convolve3(const array &signal, const array &filter, const convMode mode,
+                convDomain domain) {
     af_array out = 0;
     AF_THROW(af_convolve3(&out, signal.get(), filter.get(), mode, domain));
     return array(out);
 }
 
+array filter(const array &image, const array &kernel) {
+    return convolve(image, kernel, AF_CONV_DEFAULT, AF_CONV_AUTO);
 }
+
+}  // namespace af
diff --git a/src/api/cpp/corrcoef.cpp b/src/api/cpp/corrcoef.cpp
index 3b8f5cfcdb..dbedad5aee 100644
--- a/src/api/cpp/corrcoef.cpp
+++ b/src/api/cpp/corrcoef.cpp
@@ -7,28 +7,32 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/statistics.h>
 #include <af/array.h>
+#include <af/statistics.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-#define INSTANTIATE_CORRCOEF(T)                                    \
-    template<> AFAPI T corrcoef(const array& X, const array& Y)    \
-    {                                                              \
-        double real;                                               \
-        AF_THROW(af_corrcoef(&real, NULL, X.get(), Y.get()));      \
-        return (T)real;                                            \
-    }                                                              \
+#define INSTANTIATE_CORRCOEF(T)                               \
+    template<>                                                \
+    AFAPI T corrcoef(const array& X, const array& Y) {        \
+        double real;                                          \
+        AF_THROW(af_corrcoef(&real, NULL, X.get(), Y.get())); \
+        return (T)real;                                       \
+    }
 
 INSTANTIATE_CORRCOEF(float);
 INSTANTIATE_CORRCOEF(double);
 INSTANTIATE_CORRCOEF(int);
 INSTANTIATE_CORRCOEF(unsigned int);
 INSTANTIATE_CORRCOEF(char);
+INSTANTIATE_CORRCOEF(signed char);
 INSTANTIATE_CORRCOEF(unsigned char);
+INSTANTIATE_CORRCOEF(long long);
+INSTANTIATE_CORRCOEF(unsigned long long);
+INSTANTIATE_CORRCOEF(short);
+INSTANTIATE_CORRCOEF(unsigned short);
 
 #undef INSTANTIATE_CORRCOEF
 
-}
+}  // namespace af
diff --git a/src/api/cpp/covariance.cpp b/src/api/cpp/covariance.cpp
index a38fe0410f..8261ea0cd7 100644
--- a/src/api/cpp/covariance.cpp
+++ b/src/api/cpp/covariance.cpp
@@ -7,18 +7,22 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/statistics.h>
 #include <af/array.h>
+#include <af/statistics.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array cov(const array& X, const array& Y, const bool isbiased)
-{
+array cov(const array& X, const array& Y, const bool isbiased) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return cov(X, Y, bias);
+}
+
+array cov(const array& X, const array& Y, const af_var_bias bias) {
     af_array temp = 0;
-    AF_THROW(af_cov(&temp, X.get(), Y.get(), isbiased));
+    AF_THROW(af_cov_v2(&temp, X.get(), Y.get(), bias));
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/data.cpp b/src/api/cpp/data.cpp
index 9663d96251..f5eb8c2544 100644
--- a/src/api/cpp/data.cpp
+++ b/src/api/cpp/data.cpp
@@ -6,353 +6,355 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <af/data.h>
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wparentheses"
+#include <half.hpp>
+#pragma GCC diagnostic pop
 
-#include <af/array.h>
 #include <af/arith.h>
-#include <af/data.h>
+#include <af/array.h>
+#include <af/complex.h>
+#include <af/defines.h>
+#include <af/gfor.h>
+#include <af/half.h>
 #include <af/traits.hpp>
 #include "error.hpp"
 
-namespace af
-{
-
-#define CONSTANT(TYPE)                                                  \
-    array constant(TYPE val, const dim_t d0, const af::dtype ty)        \
-    {                                                                   \
-        return constant(val, dim4(d0), ty);                             \
-    }                                                                   \
-                                                                        \
-    array constant(TYPE val, const dim_t d0,                            \
-                   const dim_t d1, const af::dtype ty)                  \
-    {                                                                   \
-        return constant(val, dim4(d0, d1), ty);                         \
-    }                                                                   \
-                                                                        \
-    array constant(TYPE val, const dim_t d0,                            \
-                   const dim_t d1, const dim_t d2, const af::dtype ty)  \
-    {                                                                   \
-        return constant(val, dim4(d0, d1, d2), ty);                     \
-    }                                                                   \
-                                                                        \
-    array constant(TYPE val, const dim_t d0,                            \
-                   const dim_t d1, const dim_t d2,                      \
-                   const dim_t d3, const af::dtype ty)                  \
-    {                                                                   \
-        return constant(val, dim4(d0, d1, d2, d3), ty);                 \
-    }                                                                   \
-
-    CONSTANT(double);
-    CONSTANT(float);
-    CONSTANT(int);
-    CONSTANT(unsigned);
-    CONSTANT(char);
-    CONSTANT(unsigned char);
-    CONSTANT(cfloat);
-    CONSTANT(cdouble);
-    CONSTANT(long);
-    CONSTANT(unsigned long);
-    CONSTANT(long long);
-    CONSTANT(unsigned long long);
-    CONSTANT(bool);
+#include <type_traits>
+
+using af::array;
+using af::dim4;
+using af::dtype;
+using std::enable_if;
+
+namespace {
+// NOTE: we are repeating this here so that we don't need to access the
+// is_complex types in backend/common. This is done to isolate the C++ API from
+// the internal API
+template<typename T>
+struct is_complex {
+    static const bool value = false;
+};
+template<>
+struct is_complex<af::cfloat> {
+    static const bool value = true;
+};
+template<>
+struct is_complex<af::cdouble> {
+    static const bool value = true;
+};
+
+array constant(af_half val, const dim4 &dims, const dtype type) {
+    af_array res;
+    UNUSED(val);
+    AF_THROW(af_constant(&res, 0,  //(double)val,
+                         dims.ndims(), dims.get(), type));
+    return array(res);
+}
 
-#undef CONSTANT
+template<typename T, typename = typename enable_if<
+                         !static_cast<bool>(is_complex<T>::value), T>::type>
+array constant(T val, const dim4 &dims, dtype type) {
+    af_array res;
+    if (type != s64 && type != u64) {
+        AF_THROW(
+            af_constant(&res, (double)val, dims.ndims(), dims.get(), type));
+    } else if (type == s64) {
+        AF_THROW(
+            af_constant_long(&res, (long long)val, dims.ndims(), dims.get()));
+    } else {
+        AF_THROW(af_constant_ulong(&res, (unsigned long long)val, dims.ndims(),
+                                   dims.get()));
+    }
+    return array(res);
+}
 
-#define CONSTANT_DOUBLE(TYPE)                                   \
-    array constant(TYPE val, const dim4 &dims, const af::dtype type)  \
-    {                                                           \
-        af_array res;                                           \
-        AF_THROW(af_constant(&res, val,                         \
-                             dims.ndims(), dims.get(), type));  \
-        return array(res);                                      \
-    }                                                           \
-
-    CONSTANT_DOUBLE(double)
-    CONSTANT_DOUBLE(float)
-    CONSTANT_DOUBLE(bool)
-    CONSTANT_DOUBLE(int)
-    CONSTANT_DOUBLE(unsigned)
-    CONSTANT_DOUBLE(char)
-    CONSTANT_DOUBLE(unsigned char)
-
-#undef CONSTANT_DOUBLE
-
-#define CONSTANT_LONG(TYPE, DTYPE)                              \
-    array constant(TYPE val, const dim4 &dims, const af::dtype type)  \
-    {                                                           \
-        if (type != s64 && type != u64) {                       \
-            return constant((double)val, dims, type);           \
-        }                                                       \
-        af_array res;                                           \
-        if (DTYPE == s64) {                                     \
-            AF_THROW(af_constant_long (&res, ( intl)val,        \
-                                       dims.ndims(),            \
-                                       dims.get()));            \
-        } else {                                                \
-            AF_THROW(af_constant_ulong(&res, (uintl)val,        \
-                                       dims.ndims(),            \
-                                       dims.get()));            \
-        }                                                       \
-        return array(res);                                      \
-    }                                                           \
-
-    CONSTANT_LONG(long, s64)
-    CONSTANT_LONG(long long, s64)
-    CONSTANT_LONG(unsigned long, u64)
-    CONSTANT_LONG(unsigned long long, u64)
-
-#undef CONSTANT_LONG
-
-#define CONSTANT_COMPLEX(TYPE)                                  \
-    array constant(TYPE val, const dim4 &dims, const af::dtype type)  \
-    {                                                           \
-        if (type != c32 && type != c64) {                       \
-            return constant(real(val), dims, type);             \
-        }                                                       \
-        af_array res;                                           \
-        AF_THROW(af_constant_complex(&res,                      \
-                                     real(val),                 \
-                                     imag(val),                 \
-                                     dims.ndims(),              \
-                                     dims.get(), type));        \
-        return array(res);                                      \
-    }                                                           \
-
-    CONSTANT_COMPLEX(cdouble)
-    CONSTANT_COMPLEX(cfloat)
-
-#undef CONSTANT_COMPLEX
-
-    array randu(const dim4 &dims, const af::dtype type)
-    {
-        af_array res;
-        AF_THROW(af_randu(&res, dims.ndims(), dims.get(), type));
-        return array(res);
+template<typename T>
+typename enable_if<static_cast<bool>(is_complex<T>::value), array>::type
+constant(T val, const dim4 &dims, const dtype type) {
+    if (type != c32 && type != c64) {
+        return ::constant(real(val), dims, type);
     }
+    af_array res;
+    AF_THROW(af_constant_complex(&res, real(val), imag(val), dims.ndims(),
+                                 dims.get(), type));
+    return array(res);
+}
+}  // namespace
 
-    array randu(const dim_t d0, const af::dtype ty)
-    {
-        return randu(dim4(d0), ty);
-    }
+namespace af {
+template<typename T>
+array constant(T val, const dim4 &dims, const af::dtype type) {
+    return ::constant(val, dims, type);
+}
 
-    array randu(const dim_t d0,
-                const dim_t d1, const af::dtype ty)
-    {
-        return randu(dim4(d0, d1), ty);
-    }
+template<typename T>
+array constant(T val, const dim_t d0, const af::dtype ty) {
+    return ::constant(val, dim4(d0), ty);
+}
 
-    array randu(const dim_t d0,
-                const dim_t d1, const dim_t d2, const af::dtype ty)
-    {
-        return randu(dim4(d0, d1, d2), ty);
-    }
+template<typename T>
+array constant(T val, const dim_t d0, const dim_t d1, const af::dtype ty) {
+    return ::constant(val, dim4(d0, d1), ty);
+}
 
-    array randu(const dim_t d0,
-                const dim_t d1, const dim_t d2,
-                const dim_t d3, const af::dtype ty)
-    {
-        return randu(dim4(d0, d1, d2, d3), ty);
-    }
+template<typename T>
+array constant(T val, const dim_t d0, const dim_t d1, const dim_t d2,
+               const af::dtype ty) {
+    return ::constant(val, dim4(d0, d1, d2), ty);
+}
 
-    array randn(const dim4 &dims, const af::dtype type)
-    {
-        af_array res;
-        AF_THROW(af_randn(&res, dims.ndims(), dims.get(), type));
-        return array(res);
-    }
+template<typename T>
+array constant(T val, const dim_t d0, const dim_t d1, const dim_t d2,
+               const dim_t d3, const af::dtype ty) {
+    return ::constant(val, dim4(d0, d1, d2, d3), ty);
+}
 
-    array randn(const dim_t d0, const af::dtype ty)
-    {
-        return randn(dim4(d0), ty);
-    }
+#define CONSTANT(TYPE)                                                       \
+    template AFAPI array constant<TYPE>(TYPE val, const dim4 &dims,          \
+                                        const af::dtype ty);                 \
+    template AFAPI array constant<TYPE>(TYPE val, const dim_t d0,            \
+                                        const af::dtype ty);                 \
+    template AFAPI array constant<TYPE>(TYPE val, const dim_t d0,            \
+                                        const dim_t d1, const af::dtype ty); \
+    template AFAPI array constant<TYPE>(TYPE val, const dim_t d0,            \
+                                        const dim_t d1, const dim_t d2,      \
+                                        const af::dtype ty);                 \
+    template AFAPI array constant<TYPE>(TYPE val, const dim_t d0,            \
+                                        const dim_t d1, const dim_t d2,      \
+                                        const dim_t d3, const af::dtype ty);
+CONSTANT(double);
+CONSTANT(float);
+CONSTANT(int);
+CONSTANT(unsigned);
+CONSTANT(char);
+CONSTANT(signed char);
+CONSTANT(unsigned char);
+CONSTANT(cfloat);
+CONSTANT(cdouble);
+CONSTANT(long);
+CONSTANT(unsigned long);
+CONSTANT(long long);
+CONSTANT(unsigned long long);
+CONSTANT(bool);
+CONSTANT(short);
+CONSTANT(unsigned short);
+CONSTANT(half);
+CONSTANT(half_float::half);
 
-    array randn(const dim_t d0,
-                const dim_t d1, const af::dtype ty)
-    {
-        return randn(dim4(d0, d1), ty);
-    }
+#undef CONSTANT
 
-    array randn(const dim_t d0,
-                const dim_t d1, const dim_t d2, const af::dtype ty)
-    {
-        return randn(dim4(d0, d1, d2), ty);
-    }
+array range(const dim4 &dims, const int seq_dim, const af::dtype ty) {
+    af_array out;
+    AF_THROW(af_range(&out, dims.ndims(), dims.get(), seq_dim, ty));
+    return array(out);
+}
 
-    array randn(const dim_t d0,
-                const dim_t d1, const dim_t d2,
-                const dim_t d3, const af::dtype ty)
-    {
-        return randn(dim4(d0, d1, d2, d3), ty);
-    }
+array range(const dim_t d0, const dim_t d1, const dim_t d2, const dim_t d3,
+            const int seq_dim, const af::dtype ty) {
+    return range(dim4(d0, d1, d2, d3), seq_dim, ty);
+}
 
-    void setSeed(const uintl seed)
-    {
-        AF_THROW(af_set_seed(seed));
-    }
+array iota(const dim4 &dims, const dim4 &tile_dims, const af::dtype ty) {
+    af_array out;
+    AF_THROW(af_iota(&out, dims.ndims(), dims.get(), tile_dims.ndims(),
+                     tile_dims.get(), ty));
+    return array(out);
+}
 
-    uintl getSeed()
-    {
-        uintl seed = 0;
-        AF_THROW(af_get_seed(&seed));
-        return seed;
-    }
+array identity(const dim4 &dims, const af::dtype type) {
+    af_array res;
+    AF_THROW(af_identity(&res, dims.ndims(), dims.get(), type));
+    return array(res);
+}
 
-    array range(const dim4 &dims, const int seq_dim, const af::dtype ty)
-    {
-        af_array out;
-        AF_THROW(af_range(&out, dims.ndims(), dims.get(), seq_dim, ty));
-        return array(out);
-    }
+array identity(const dim_t d0, const af::dtype ty) {
+    return identity(dim4(d0), ty);
+}
 
-    array range(const dim_t d0, const dim_t d1, const dim_t d2,
-               const dim_t d3, const int seq_dim, const af::dtype ty)
-    {
-        return range(dim4(d0, d1, d2, d3), seq_dim, ty);
-    }
+array identity(const dim_t d0, const dim_t d1, const af::dtype ty) {
+    return identity(dim4(d0, d1), ty);
+}
 
-    array iota(const dim4 &dims, const dim4 &tile_dims, const af::dtype ty)
-    {
-        af_array out;
-        AF_THROW(af_iota(&out, dims.ndims(), dims.get(), tile_dims.ndims(), tile_dims.get(), ty));
-        return array(out);
-    }
+array identity(const dim_t d0, const dim_t d1, const dim_t d2,
+               const af::dtype ty) {
+    return identity(dim4(d0, d1, d2), ty);
+}
 
-    array identity(const dim4 &dims, const af::dtype type)
-    {
-        af_array res;
-        AF_THROW(af_identity(&res, dims.ndims(), dims.get(), type));
-        return array(res);
-    }
+array identity(const dim_t d0, const dim_t d1, const dim_t d2, const dim_t d3,
+               const af::dtype ty) {
+    return identity(dim4(d0, d1, d2, d3), ty);
+}
 
-    array identity(const dim_t d0, const af::dtype ty)
-    {
-        return identity(dim4(d0), ty);
+array diag(const array &in, const int num, const bool extract) {
+    af_array res;
+    if (extract) {
+        AF_THROW(af_diag_extract(&res, in.get(), num));
+    } else {
+        AF_THROW(af_diag_create(&res, in.get(), num));
     }
 
-    array identity(const dim_t d0,
-                const dim_t d1, const af::dtype ty)
-    {
-        return identity(dim4(d0, d1), ty);
-    }
+    return array(res);
+}
 
-    array identity(const dim_t d0,
-                const dim_t d1, const dim_t d2, const af::dtype ty)
-    {
-        return identity(dim4(d0, d1, d2), ty);
-    }
+array moddims(const array &in, const unsigned ndims, const dim_t *const dims) {
+    af_array out = 0;
+    AF_THROW(af_moddims(&out, in.get(), ndims, dims));
+    return array(out);
+}
 
-    array identity(const dim_t d0,
-                const dim_t d1, const dim_t d2,
-                const dim_t d3, const af::dtype ty)
-    {
-        return identity(dim4(d0, d1, d2, d3), ty);
-    }
+array moddims(const array &in, const dim4 &dims) {
+    return af::moddims(in, dims.ndims(), dims.get());
+}
 
-    array diag(const array &in, const int num, const bool extract)
-    {
-        af_array res;
-        if (extract) {
-            AF_THROW(af_diag_extract(&res, in.get(), num));
-        } else {
-            AF_THROW(af_diag_create(&res, in.get(), num));
-        }
+array moddims(const array &in, const dim_t d0, const dim_t d1, const dim_t d2,
+              const dim_t d3) {
+    dim_t dims[4] = {d0, d1, d2, d3};
+    return af::moddims(in, 4, dims);
+}
 
-        return array(res);
-    }
+array flat(const array &in) {
+    af_array out = 0;
+    AF_THROW(af_flat(&out, in.get()));
+    return array(out);
+}
 
-    array moddims(const array& in, const unsigned ndims, const dim_t * const dims)
-    {
-        af_array out = 0;
-        AF_THROW(af_moddims(&out, in.get(), ndims, dims));
-        return array(out);
-    }
+array join(const int dim, const array &first, const array &second) {
+    af_array out = 0;
+    AF_THROW(af_join(&out, dim, first.get(), second.get()));
+    return array(out);
+}
 
-    array moddims(const array& in, const dim4& dims)
-    {
-        return af::moddims(in, dims.ndims(), dims.get());
-    }
+array join(const int dim, const array &first, const array &second,
+           const array &third) {
+    af_array out       = 0;
+    af_array inputs[3] = {first.get(), second.get(), third.get()};
+    AF_THROW(af_join_many(&out, dim, 3, inputs));
+    return array(out);
+}
 
-    array moddims(const array& in, const dim_t d0, const dim_t d1, const dim_t d2, const dim_t d3)
-    {
-        dim_t dims[4] = {d0, d1, d2, d3};
-        return af::moddims(in, 4, dims);
-    }
+array join(const int dim, const array &first, const array &second,
+           const array &third, const array &fourth) {
+    af_array out       = 0;
+    af_array inputs[4] = {first.get(), second.get(), third.get(), fourth.get()};
+    AF_THROW(af_join_many(&out, dim, 4, inputs));
+    return array(out);
+}
 
-    array flat(const array& in)
-    {
-        af_array out = 0;
-        AF_THROW(af_flat(&out, in.get()));
-        return array(out);
-    }
+array tile(const array &in, const unsigned x, const unsigned y,
+           const unsigned z, const unsigned w) {
+    af_array out = 0;
+    AF_THROW(af_tile(&out, in.get(), x, y, z, w));
+    return array(out);
+}
 
-    array join(const int dim, const array& first, const array& second)
-    {
-        af_array out = 0;
-        AF_THROW(af_join(&out, dim, first.get(), second.get()));
-        return array(out);
-    }
+array tile(const array &in, const af::dim4 &dims) {
+    af_array out = 0;
+    AF_THROW(af_tile(&out, in.get(), dims[0], dims[1], dims[2], dims[3]));
+    return array(out);
+}
 
-    array join(const int dim, const array& first, const array& second, const array &third)
-    {
-        af_array out = 0;
-        af_array inputs[3] = {first.get(), second.get(), third.get()};
-        AF_THROW(af_join_many(&out, dim, 3, inputs));
-        return array(out);
-    }
+array reorder(const array &in, const unsigned x, const unsigned y,
+              const unsigned z, const unsigned w) {
+    af_array out = 0;
+    AF_THROW(af_reorder(&out, in.get(), x, y, z, w));
+    return array(out);
+}
 
-    array join(const int dim, const array& first, const array& second, const array &third, const array &fourth)
-    {
-        af_array out = 0;
-        af_array inputs[4] = {first.get(), second.get(), third.get(), fourth.get()};
-        AF_THROW(af_join_many(&out, dim, 4, inputs));
-        return array(out);
-    }
+array shift(const array &in, const int x, const int y, const int z,
+            const int w) {
+    af_array out = 0;
+    AF_THROW(af_shift(&out, in.get(), x, y, z, w));
+    return array(out);
+}
 
-    array tile(const array& in, const unsigned x, const unsigned y, const unsigned z, const unsigned w)
-    {
-        af_array out = 0;
-        AF_THROW(af_tile(&out, in.get(), x, y, z, w));
-        return array(out);
-    }
+array flip(const array &in, const unsigned dim) {
+    af_array out = 0;
+    AF_THROW(af_flip(&out, in.get(), dim));
+    return array(out);
+}
 
-    array tile(const array& in, const af::dim4 &dims)
-    {
-        af_array out = 0;
-        AF_THROW(af_tile(&out, in.get(), dims[0], dims[1], dims[2], dims[3]));
-        return array(out);
-    }
+array lower(const array &in, bool is_unit_diag) {
+    af_array res;
+    AF_THROW(af_lower(&res, in.get(), is_unit_diag));
+    return array(res);
+}
 
-    array reorder(const array& in, const unsigned x, const unsigned y, const unsigned z, const unsigned w)
-    {
-        af_array out = 0;
-        AF_THROW(af_reorder(&out, in.get(), x, y, z, w));
-        return array(out);
-    }
+array upper(const array &in, bool is_unit_diag) {
+    af_array res;
+    AF_THROW(af_upper(&res, in.get(), is_unit_diag));
+    return array(res);
+}
 
-    array shift(const array& in, const int x, const int y, const int z, const int w)
-    {
-        af_array out = 0;
-        AF_THROW(af_shift(&out, in.get(), x, y, z, w));
-        return array(out);
-    }
+array select(const array &cond, const array &a, const array &b) {
+    af_array res;
+    AF_THROW(af_select(&res, cond.get(), a.get(), b.get()));
+    return array(res);
+}
 
-    array flip(const array &in, const unsigned dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_flip(&out, in.get(), dim));
-        return array(out);
-    }
+array select(const array &cond, const array &a, const double &b) {
+    af_array res;
+    AF_THROW(af_select_scalar_r(&res, cond.get(), a.get(), b));
+    return array(res);
+}
 
-    array lower(const array &in, bool is_unit_diag)
-    {
-        af_array res;
-        AF_THROW(af_lower(&res, in.get(), is_unit_diag));
-        return array(res);
-    }
+array select(const array &cond, const double &a, const array &b) {
+    af_array res;
+    AF_THROW(af_select_scalar_l(&res, cond.get(), a, b.get()));
+    return array(res);
+}
 
-    array upper(const array &in, bool is_unit_diag)
-    {
-        af_array res;
-        AF_THROW(af_upper(&res, in.get(), is_unit_diag));
-        return array(res);
-    }
+void replace(array &a, const array &cond, const array &b) {
+    AF_THROW(af_replace(a.get(), cond.get(), b.get()));
+}
+
+void replace(array &a, const array &cond, const double &b) {
+    AF_THROW(af_replace_scalar(a.get(), cond.get(), b));
+}
+
+void replace(array &a, const array &cond, const long long b) {
+    AF_THROW(af_replace_scalar_long(a.get(), cond.get(), b));
+}
+
+void replace(array &a, const array &cond, const unsigned long long b) {
+    AF_THROW(af_replace_scalar_ulong(a.get(), cond.get(), b));
 }
+
+array select(const array &cond, const array &a, const long long b) {
+    af_array res;
+    AF_THROW(af_select_scalar_r_long(&res, cond.get(), a.get(), b));
+    return array(res);
+}
+
+array select(const array &cond, const array &a, const unsigned long long b) {
+    af_array res;
+    AF_THROW(af_select_scalar_r_ulong(&res, cond.get(), a.get(), b));
+    return array(res);
+}
+
+array select(const array &cond, const long long a, const array &b) {
+    af_array res;
+    AF_THROW(af_select_scalar_l_long(&res, cond.get(), a, b.get()));
+    return array(res);
+}
+
+array select(const array &cond, const unsigned long long a, const array &b) {
+    af_array res;
+    AF_THROW(af_select_scalar_l_ulong(&res, cond.get(), a, b.get()));
+    return array(res);
+}
+
+array pad(const array &in, const dim4 &beginPadding, const dim4 &endPadding,
+          const borderType padFillType) {
+    af_array out = 0;
+    // FIXME(pradeep) Cannot use dim4::ndims() since that will
+    //               always return 0 if any one of dimensions
+    //               has no padding completely
+    AF_THROW(af_pad(&out, in.get(), 4, beginPadding.get(), 4, endPadding.get(),
+                    padFillType));
+    return array(out);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/deconvolution.cpp b/src/api/cpp/deconvolution.cpp
new file mode 100644
index 0000000000..4b466a0b7e
--- /dev/null
+++ b/src/api/cpp/deconvolution.cpp
@@ -0,0 +1,30 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+array iterativeDeconv(const array& in, const array& ker,
+                      const unsigned iterations, const float relaxFactor,
+                      const iterativeDeconvAlgo algo) {
+    af_array temp = 0;
+    AF_THROW(af_iterative_deconv(&temp, in.get(), ker.get(), iterations,
+                                 relaxFactor, algo));
+    return array(temp);
+}
+
+array inverseDeconv(const array& in, const array& psf, const float gamma,
+                    const inverseDeconvAlgo algo) {
+    af_array temp = 0;
+    AF_THROW(af_inverse_deconv(&temp, in.get(), psf.get(), gamma, algo));
+    return array(temp);
+}
+}  // namespace af
diff --git a/src/api/cpp/device.cpp b/src/api/cpp/device.cpp
index 0a39ed2bae..b62589097e 100644
--- a/src/api/cpp/device.cpp
+++ b/src/api/cpp/device.cpp
@@ -7,152 +7,198 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/device.h>
+#include <common/deprecated.hpp>
+#include <af/array.h>
+#include <af/backend.h>
 #include <af/compatible.h>
+#include <af/device.h>
 #include <af/traits.hpp>
 #include "error.hpp"
+#include "type_util.hpp"
 
-namespace af
-{
-    void info()
-    {
-        AF_THROW(af_info());
-    }
+namespace af {
+void setBackend(const Backend bknd) { AF_THROW(af_set_backend(bknd)); }
 
-    void deviceprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
-    {
-        deviceInfo(d_name, d_platform, d_toolkit, d_compute);
-    }
-    void deviceInfo(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
-    {
-        AF_THROW(af_device_info(d_name, d_platform, d_toolkit, d_compute));
-    }
+unsigned getBackendCount() {
+    unsigned temp = 1;
+    AF_THROW(af_get_backend_count(&temp));
+    return temp;
+}
 
-    int getDeviceCount()
-    {
-        int devices = -1;
-        AF_THROW(af_get_device_count(&devices));
-        return devices;
-    }
+int getAvailableBackends() {
+    int result = 0;
+    AF_THROW(af_get_available_backends(&result));
+    return result;
+}
 
-    int devicecount() { return getDeviceCount(); }
+af::Backend getBackendId(const array &in) {
+    auto result = static_cast<af::Backend>(0);
+    AF_THROW(af_get_backend_id(&result, in.get()));
+    return result;
+}
 
-    void setDevice(const int device)
-    {
-        AF_THROW(af_set_device(device));
-    }
+int getDeviceId(const array &in) {
+    int device = getDevice();
+    AF_THROW(af_get_device_id(&device, in.get()));
+    return device;
+}
 
-    void deviceset(const int device) { setDevice(device); }
+af::Backend getActiveBackend() {
+    auto result = static_cast<af::Backend>(0);
+    AF_THROW(af_get_active_backend(&result));
+    return result;
+}
 
-    int getDevice()
-    {
-        int device = 0;
-        AF_THROW(af_get_device(&device));
-        return device;
-    }
+void info() { AF_THROW(af_info()); }
 
-    bool isDoubleAvailable(const int device)
-    {
-        bool temp;
-        AF_THROW(af_get_dbl_support(&temp, device));
-        return temp;
-    }
+const char *infoString(const bool verbose) {
+    char *str = NULL;
+    AF_THROW(af_info_string(&str, verbose));
+    return str;
+}
 
-    int deviceget() { return getDevice(); }
+void deviceprop(char *d_name, char *d_platform, char *d_toolkit,
+                char *d_compute) {
+    deviceInfo(d_name, d_platform, d_toolkit, d_compute);
+}
+void deviceInfo(char *d_name, char *d_platform, char *d_toolkit,
+                char *d_compute) {
+    AF_THROW(af_device_info(d_name, d_platform, d_toolkit, d_compute));
+}
 
-    void sync(int device)
-    {
-        AF_THROW(af_sync(device));
-    }
+int getDeviceCount() {
+    int devices = -1;
+    AF_THROW(af_get_device_count(&devices));
+    return devices;
+}
 
-    ///////////////////////////////////////////////////////////////////////////
-    // Alloc and free host, pinned, zero copy
-    static unsigned size_of(af::dtype type)
-    {
-        switch(type) {
-        case f32: return sizeof(float);
-        case f64: return sizeof(double);
-        case s32: return sizeof(int);
-        case u32: return sizeof(unsigned);
-        case u8 : return sizeof(unsigned char);
-        case b8 : return sizeof(unsigned char);
-        case c32: return sizeof(float) * 2;
-        case c64: return sizeof(double) * 2;
-        default: return sizeof(float);
-        }
-    }
+int devicecount() { return getDeviceCount(); }
 
-    void *alloc(const size_t elements, const af::dtype type)
-    {
-        void *ptr;
-        AF_THROW(af_alloc_device(&ptr, elements * size_of(type)));
-        // FIXME: Add to map
-        return ptr;
-    }
+void setDevice(const int device) { AF_THROW(af_set_device(device)); }
 
-    void *pinned(const size_t elements, const af::dtype type)
-    {
-        void *ptr;
-        AF_THROW(af_alloc_pinned(&ptr, elements * size_of(type)));
-        // FIXME: Add to map
-        return ptr;
-    }
+void deviceset(const int device) { setDevice(device); }
 
-    void free(const void *ptr)
-    {
-        //FIXME: look up map and call the right free
-        AF_THROW(af_free_device((void *)ptr));
-    }
+int getDevice() {
+    int device = 0;
+    AF_THROW(af_get_device(&device));
+    return device;
+}
 
-    void freePinned(const void *ptr)
-    {
-        //FIXME: look up map and call the right free
-        AF_THROW(af_free_pinned((void *)ptr));
-    }
+bool isDoubleAvailable(const int device) {
+    bool temp;
+    AF_THROW(af_get_dbl_support(&temp, device));
+    return temp;
+}
 
-    void deviceGC()
-    {
-        AF_THROW(af_device_gc());
-    }
+bool isHalfAvailable(const int device) {
+    bool temp;
+    AF_THROW(af_get_half_support(&temp, device));
+    return temp;
+}
 
-    void deviceMemInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                       size_t *lock_bytes,  size_t *lock_buffers)
-    {
-        AF_THROW(af_device_mem_info(alloc_bytes, alloc_buffers,
-                                    lock_bytes,  lock_buffers));
-    }
+int deviceget() { return getDevice(); }
 
-    void setMemStepSize(const size_t step_bytes)
-    {
-        AF_THROW(af_set_mem_step_size(step_bytes));
-    }
+void sync(int device) { AF_THROW(af_sync(device)); }
 
-    size_t getMemStepSize()
-    {
-        size_t size_bytes = 0;
-        AF_THROW(af_get_mem_step_size(&size_bytes));
-        return size_bytes;
-    }
+// Alloc device memory
+void *alloc(const size_t elements, const af::dtype type) {
+    void *ptr;
+    AF_DEPRECATED_WARNINGS_OFF
+    AF_THROW(af_alloc_device(&ptr, elements * size_of(type)));
+    AF_DEPRECATED_WARNINGS_ON
+    // FIXME: Add to map
+    return ptr;
+}
 
-#define INSTANTIATE(T)                                                  \
-    template<> AFAPI                                                    \
-    T* alloc(const size_t elements)                                     \
-    {                                                                   \
-        return (T*)alloc(elements, (af::dtype)dtype_traits<T>::af_type); \
-    }                                                                   \
-    template<> AFAPI                                                    \
-    T* pinned(const size_t elements)                                    \
-    {                                                                   \
-        return (T*)pinned(elements, (af::dtype)dtype_traits<T>::af_type); \
-    }
+// Alloc device memory
+void *allocV2(const size_t bytes) {
+    void *ptr;
+    AF_THROW(af_alloc_device_v2(&ptr, bytes));
+    return ptr;
+}
+
+// Alloc pinned memory
+void *pinned(const size_t elements, const af::dtype type) {
+    void *ptr;
+    AF_THROW(af_alloc_pinned(&ptr, elements * size_of(type)));
+    // FIXME: Add to map
+    return ptr;
+}
+
+void free(const void *ptr) {
+    // FIXME: look up map and call the right free
+    AF_DEPRECATED_WARNINGS_OFF
+    AF_THROW(af_free_device(const_cast<void *>(ptr)));
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+void freeV2(const void *ptr) {
+    AF_THROW(af_free_device_v2(const_cast<void *>(ptr)));
+}
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(unsigned)
-    INSTANTIATE(unsigned char)
-    INSTANTIATE(char)
+void freePinned(const void *ptr) {
+    // FIXME: look up map and call the right free
+    AF_THROW(af_free_pinned((void *)ptr));
+}
 
+void *allocHost(const size_t elements, const af::dtype type) {
+    void *ptr;
+    AF_THROW(af_alloc_host(&ptr, elements * size_of(type)));
+    return ptr;
 }
+
+void freeHost(const void *ptr) { AF_THROW(af_free_host((void *)ptr)); }
+
+void printMemInfo(const char *msg, const int device_id) {
+    AF_THROW(af_print_mem_info(msg, device_id));
+}
+
+void deviceGC() { AF_THROW(af_device_gc()); }
+
+void deviceMemInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                   size_t *lock_bytes, size_t *lock_buffers) {
+    AF_THROW(af_device_mem_info(alloc_bytes, alloc_buffers, lock_bytes,
+                                lock_buffers));
+}
+
+void setMemStepSize(const size_t step_bytes) {
+    AF_THROW(af_set_mem_step_size(step_bytes));
+}
+
+size_t getMemStepSize() {
+    size_t size_bytes = 0;
+    AF_THROW(af_get_mem_step_size(&size_bytes));
+    return size_bytes;
+}
+
+AF_DEPRECATED_WARNINGS_OFF
+#define INSTANTIATE(T)                                                        \
+    template<>                                                                \
+    AFAPI T *alloc(const size_t elements) {                                   \
+        return (T *)alloc(elements, (af::dtype)dtype_traits<T>::af_type);     \
+    }                                                                         \
+    template<>                                                                \
+    AFAPI T *pinned(const size_t elements) {                                  \
+        return (T *)pinned(elements, (af::dtype)dtype_traits<T>::af_type);    \
+    }                                                                         \
+    template<>                                                                \
+    AFAPI T *allocHost(const size_t elements) {                               \
+        return (T *)allocHost(elements, (af::dtype)dtype_traits<T>::af_type); \
+    }
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(unsigned)
+INSTANTIATE(signed char)
+INSTANTIATE(unsigned char)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(unsigned short)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+AF_DEPRECATED_WARNINGS_ON
+
+}  // namespace af
diff --git a/src/api/cpp/diff.cpp b/src/api/cpp/diff.cpp
index 4e59f7692f..aca205bd9f 100644
--- a/src/api/cpp/diff.cpp
+++ b/src/api/cpp/diff.cpp
@@ -7,23 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/algorithm.h>
+#include <af/array.h>
 #include "error.hpp"
 
-namespace af
-{
-    array diff1(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_diff1(&out, in.get(), dim));
-        return array(out);
-    }
+namespace af {
+array diff1(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_diff1(&out, in.get(), dim));
+    return array(out);
+}
 
-    array diff2(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_diff2(&out, in.get(), dim));
-        return array(out);
-    }
+array diff2(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_diff2(&out, in.get(), dim));
+    return array(out);
 }
+}  // namespace af
diff --git a/src/api/cpp/dog.cpp b/src/api/cpp/dog.cpp
new file mode 100644
index 0000000000..67e2d50e4f
--- /dev/null
+++ b/src/api/cpp/dog.cpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "error.hpp"
+
+namespace af {
+
+array dog(const array& in, const int radius1, const int radius2) {
+    af_array temp = 0;
+    AF_THROW(af_dog(&temp, in.get(), radius1, radius2));
+    return array(temp);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/error.hpp b/src/api/cpp/error.hpp
index 7e4854cc0a..188f25b40b 100644
--- a/src/api/cpp/error.hpp
+++ b/src/api/cpp/error.hpp
@@ -7,17 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/defines.hpp>
+#include <af/device.h>
 #include <af/exception.h>
 
-#define AF_THROW(fn) do {                               \
-        af_err __err = fn;                              \
-        if (__err == AF_SUCCESS) break;                 \
-        throw af::exception(__FILE__, __LINE__, __err); \
-    } while(0)
+#define AF_THROW(fn)                                                          \
+    do {                                                                      \
+        af_err __err = fn;                                                    \
+        if (__err == AF_SUCCESS) break;                                       \
+        char *msg = NULL;                                                     \
+        af_get_last_error(&msg, NULL);                                        \
+        af::exception ex(msg, __AF_FUNC__, __AF_FILENAME__, __LINE__, __err); \
+        af_free_host(msg);                                                    \
+        throw std::move(ex);                                                  \
+    } while (0)
 
-#define AF_THROW_MSG(__msg, __err) do {                         \
-        if (__err == AF_SUCCESS) break;                         \
-        throw af::exception(__msg, __FILE__, __LINE__, __err);  \
-    } while(0);
-
-#define THROW(__err) throw af::exception(__FILE__, __LINE__, __err)
+#define AF_THROW_ERR(__msg, __err)                                         \
+    do {                                                                   \
+        throw af::exception(__msg, __AF_FUNC__, __AF_FILENAME__, __LINE__, \
+                            __err);                                        \
+    } while (0)
diff --git a/src/api/cpp/event.cpp b/src/api/cpp/event.cpp
new file mode 100644
index 0000000000..02d1e8fd73
--- /dev/null
+++ b/src/api/cpp/event.cpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/event.h>
+#include "error.hpp"
+
+namespace af {
+
+event::event() : e_{} { AF_THROW(af_create_event(&e_)); }
+
+event::event(af_event e) : e_(e) {}
+
+event::~event() {
+    // No dtor throw
+    if (e_) { af_delete_event(e_); }
+}
+
+// NOLINTNEXTLINE(performance-noexcept-move-constructor) we can't change the API
+event::event(event&& other) : e_(other.e_) { other.e_ = 0; }
+
+// NOLINTNEXTLINE(performance-noexcept-move-constructor) we can't change the API
+event& event::operator=(event&& other) {
+    af_delete_event(this->e_);
+    this->e_ = other.e_;
+    other.e_ = 0;
+    return *this;
+}
+
+af_event event::get() const { return e_; }
+
+void event::mark() { AF_THROW(af_mark_event(e_)); }
+
+void event::enqueue() { AF_THROW(af_enqueue_wait_event(e_)); }
+
+void event::block() const { AF_THROW(af_block_event(e_)); }
+
+}  // namespace af
diff --git a/src/api/cpp/exampleFunction.cpp b/src/api/cpp/exampleFunction.cpp
index 5236e60fee..017950d0a8 100644
--- a/src/api/cpp/exampleFunction.cpp
+++ b/src/api/cpp/exampleFunction.cpp
@@ -7,27 +7,25 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>       // af::array class is declared here
+#include <af/array.h>  // af::array class is declared here
 
-#include <af/util.h>        // Include the header related to the function
+#include <af/util.h>  // Include the header related to the function
 
-#include "error.hpp"        // AF_THROW macro to use error code C-API
-                            // is going to return and throw corresponding
-                            // exceptions if call isn't a success
+#include "error.hpp"  // AF_THROW macro to use error code C-API
+                      // is going to return and throw corresponding
+                      // exceptions if call isn't a success
 
-namespace af
-{
+namespace af {
 
-array exampleFunction(const array& in, const af_someenum_t p)
-{
+array exampleFunction(const array& a, const af_someenum_t p) {
     // create a temporary af_array handle
     af_array temp = 0;
 
     // call C-API function
-    AF_THROW( af_example_function(&temp, in.get(), p) );
+    AF_THROW(af_example_function(&temp, a.get(), p));
 
     // array::get() returns af_array handle for the corresponding cpp af::array
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/exception.cpp b/src/api/cpp/exception.cpp
index 373ae29c55..45efcf6b6a 100644
--- a/src/api/cpp/exception.cpp
+++ b/src/api/cpp/exception.cpp
@@ -7,10 +7,10 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <string.h> // strncpy
-#include <stdio.h>
 #include <af/exception.h>
 #include <algorithm>
+#include <cstdio>
+#include <cstring>  // strncpy
 
 #ifdef OS_WIN
 #define snprintf _snprintf
@@ -18,34 +18,42 @@
 
 namespace af {
 
-exception::exception(): m_err(AF_ERR_UNKNOWN)
-{
+exception::exception() : m_msg{}, m_err(AF_ERR_UNKNOWN) {
     strncpy(m_msg, "unknown exception", sizeof(m_msg));
 }
 
-exception::exception(const char *msg): m_err(AF_ERR_UNKNOWN)
-{
-    strncpy(m_msg, msg, sizeof(m_msg));
-    m_msg[sizeof(m_msg)-1] = '\0';
+exception::exception(const char *msg) : m_msg{}, m_err(AF_ERR_UNKNOWN) {
+    strncpy(m_msg, msg, sizeof(m_msg) - 1);
+    m_msg[sizeof(m_msg) - 1] = '\0';
 }
 
-exception::exception(const char *file, unsigned line, af_err err): m_err(err)
-{
-    snprintf(m_msg, sizeof(m_msg) - 1,
-             "ArrayFire Exception(%d): %s\nIn %s:%u",
-             (int)err, af_err_to_string(err), file, line);
+exception::exception(const char *file, unsigned line, af_err err)
+    : m_msg{}, m_err(err) {
+    snprintf(m_msg, sizeof(m_msg) - 1, "ArrayFire Exception (%s:%d):\nIn %s:%u",
+             af_err_to_string(err), static_cast<int>(err), file, line);
 
-    m_msg[sizeof(m_msg)-1] = '\0';
+    m_msg[sizeof(m_msg) - 1] = '\0';
 }
 
-exception::exception(const char *msg, const char *file, unsigned line, af_err err): m_err(err)
-{
+exception::exception(const char *msg, const char *file, unsigned line,
+                     af_err err)
+    : m_msg{}, m_err(err) {
     snprintf(m_msg, sizeof(m_msg) - 1,
-             "ArrayFire Exception(%d): %s\nIn %s:%u",
-             (int)(err), msg, file, line);
+             "ArrayFire Exception (%s:%d):\n%s\nIn %s:%u",
+             af_err_to_string(err), static_cast<int>(err), msg, file, line);
 
-    m_msg[sizeof(m_msg)-1] = '\0';
+    m_msg[sizeof(m_msg) - 1] = '\0';
 }
 
+exception::exception(const char *msg, const char *func, const char *file,
+                     unsigned line, af_err err)
+    : m_msg{}, m_err(err) {
+    snprintf(m_msg, sizeof(m_msg) - 1,
+             "ArrayFire Exception (%s:%d):\n%s\nIn function %s\nIn file %s:%u",
+             af_err_to_string(err), static_cast<int>(err), msg, func, file,
+             line);
 
+    m_msg[sizeof(m_msg) - 1] = '\0';
 }
+
+}  // namespace af
diff --git a/src/api/cpp/fast.cpp b/src/api/cpp/fast.cpp
index 3ba553bc62..308635b4f4 100644
--- a/src/api/cpp/fast.cpp
+++ b/src/api/cpp/fast.cpp
@@ -7,21 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/vision.h>
 #include <af/array.h>
+#include <af/vision.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
 features fast(const array& in, const float thr, const unsigned arc_length,
-                const bool non_max, const float feature_ratio,
-                const unsigned edge)
-{
+              const bool non_max, const float feature_ratio,
+              const unsigned edge) {
     af_features temp;
-    AF_THROW(af_fast(&temp, in.get(), thr, arc_length,
-                     non_max, feature_ratio, edge));
+    AF_THROW(af_fast(&temp, in.get(), thr, arc_length, non_max, feature_ratio,
+                     edge));
     return features(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/features.cpp b/src/api/cpp/features.cpp
index be9e160028..9422c487e4 100644
--- a/src/api/cpp/features.cpp
+++ b/src/api/cpp/features.cpp
@@ -7,99 +7,93 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
 #include <af/array.h>
-#include <handle.hpp>
+#include <af/features.h>
 #include "error.hpp"
 
-namespace af
-{
+#include <utility>
 
-    features::features()
-    {
-        AF_THROW(af_create_features(&feat, 0));
-    }
+namespace af {
 
-    features::features(const size_t n)
-    {
-        AF_THROW(af_create_features(&feat, (int)n));
-    }
+features::features() : feat{} { AF_THROW(af_create_features(&feat, 0)); }
 
-    features::features(af_features f) : feat(f)
-    {
-    }
+features::features(const size_t n) : feat{} {
+    AF_THROW(af_create_features(&feat, (int)n));
+}
 
-    features& features::operator= (const features& other)
-    {
-        if (this != &other) {
-            AF_THROW(af_release_features(feat));
-            AF_THROW(af_retain_features(&feat, other.get()));
-        }
-        return *this;
-    }
+features::features(af_features f) : feat(f) {}
 
-    features::~features()
-    {
-        if(AF_SUCCESS != af_release_features(feat)) {
-            fprintf(stderr, "Error: Couldn't release af::features: %p\n", this);
-        }
-    }
+features::features(const features& other) {
+    if (this != &other) { AF_THROW(af_retain_features(&feat, other.get())); }
+}
 
-    size_t features::getNumFeatures() const
-    {
-        dim_t n = 0;
-        AF_THROW(af_get_features_num(&n, feat));
-        return n;
+features& features::operator=(const features& other) {
+    if (this != &other) {
+        AF_THROW(af_release_features(feat));
+        AF_THROW(af_retain_features(&feat, other.get()));
     }
+    return *this;
+}
 
-    array features::getX() const
-    {
-        af_array x = 0;
-        AF_THROW(af_get_features_xpos(&x, feat));
-        af_array tmp = 0;
-        AF_THROW(af_retain_array(&tmp, x));
-        return array(tmp);
-    }
+features::features(features&& other)
+    : feat(std::exchange(other.feat, nullptr)) {}
 
-    array features::getY() const
-    {
-        af_array y = 0;
-        AF_THROW(af_get_features_ypos(&y, feat));
-        af_array tmp = 0;
-        AF_THROW(af_retain_array(&tmp, y));
-        return array(tmp);
-    }
+features& features::operator=(features&& other) {
+    std::swap(feat, other.feat);
+    return *this;
+}
 
-    array features::getScore() const
-    {
-        af_array s = 0;
-        AF_THROW(af_get_features_score(&s, feat));
-        af_array tmp = 0;
-        AF_THROW(af_retain_array(&tmp, s));
-        return array(tmp);
-    }
+features::~features() {
+    // THOU SHALL NOT THROW IN DESTRUCTORS
+    if (feat) { af_release_features(feat); }
+}
 
-    array features::getOrientation() const
-    {
-        af_array ori = 0;
-        AF_THROW(af_get_features_orientation(&ori, feat));
-        af_array tmp = 0;
-        AF_THROW(af_retain_array(&tmp, ori));
-        return array(tmp);
-    }
+size_t features::getNumFeatures() const {
+    dim_t n = 0;
+    AF_THROW(af_get_features_num(&n, feat));
+    return n;
+}
 
-    array features::getSize() const
-    {
-        af_array s = 0;
-        AF_THROW(af_get_features_size(&s, feat));
-        af_array tmp = 0;
-        AF_THROW(af_retain_array(&tmp, s));
-        return array(tmp);
-    }
+array features::getX() const {
+    af_array x = 0;
+    AF_THROW(af_get_features_xpos(&x, feat));
+    af_array tmp = 0;
+    AF_THROW(af_retain_array(&tmp, x));
+    return array(tmp);
+}
 
-    af_features features::get() const
-    {
-        return feat;
-    }
+array features::getY() const {
+    af_array y = 0;
+    AF_THROW(af_get_features_ypos(&y, feat));
+    af_array tmp = 0;
+    AF_THROW(af_retain_array(&tmp, y));
+    return array(tmp);
+}
+
+array features::getScore() const {
+    af_array s = 0;
+    AF_THROW(af_get_features_score(&s, feat));
+    af_array tmp = 0;
+    AF_THROW(af_retain_array(&tmp, s));
+    return array(tmp);
+}
+
+array features::getOrientation() const {
+    af_array ori = 0;
+    AF_THROW(af_get_features_orientation(&ori, feat));
+    af_array tmp = 0;
+    AF_THROW(af_retain_array(&tmp, ori));
+    return array(tmp);
+}
+
+array features::getSize() const {
+    af_array s = 0;
+    AF_THROW(af_get_features_size(&s, feat));
+    af_array tmp = 0;
+    AF_THROW(af_retain_array(&tmp, s));
+    return array(tmp);
+}
+
+af_features features::get() const { return feat; }
 
-};
+};  // namespace af
diff --git a/src/api/cpp/fft.cpp b/src/api/cpp/fft.cpp
index dbc44063e9..dbce09f488 100644
--- a/src/api/cpp/fft.cpp
+++ b/src/api/cpp/fft.cpp
@@ -7,141 +7,261 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/signal.h>
 #include <af/array.h>
 #include <af/dim4.hpp>
+#include <af/signal.h>
 #include "error.hpp"
 
-namespace af
-{
-
+using af::array;
+using af::dim4;
 
-array fftNorm(const array& in, const double norm_factor, const dim_t odim0)
-{
+namespace af {
+array fftNorm(const array& in, const double norm_factor, const dim_t odim0) {
     af_array out = 0;
     AF_THROW(af_fft(&out, in.get(), norm_factor, odim0));
     return array(out);
 }
 
-array fft2Norm(const array& in, const double norm_factor, const dim_t odim0, const dim_t odim1)
-{
+array fft2Norm(const array& in, const double norm_factor, const dim_t odim0,
+               const dim_t odim1) {
     af_array out = 0;
     AF_THROW(af_fft2(&out, in.get(), norm_factor, odim0, odim1));
     return array(out);
 }
 
-array fft3Norm(const array& in, const double norm_factor, const dim_t odim0, const dim_t odim1, const dim_t odim2)
-{
+array fft3Norm(const array& in, const double norm_factor, const dim_t odim0,
+               const dim_t odim1, const dim_t odim2) {
     af_array out = 0;
     AF_THROW(af_fft3(&out, in.get(), norm_factor, odim0, odim1, odim2));
     return array(out);
 }
 
-array fft(const array& in, const dim_t odim0)
-{
+array fft(const array& in, const dim_t odim0) {
     return fftNorm(in, 1.0, odim0);
 }
 
-array fft2(const array& in, const dim_t odim0, const dim_t odim1)
-{
+array fft2(const array& in, const dim_t odim0, const dim_t odim1) {
     return fft2Norm(in, 1.0, odim0, odim1);
 }
 
-array fft3(const array& in, const dim_t odim0, const dim_t odim1, const dim_t odim2)
-{
+array fft3(const array& in, const dim_t odim0, const dim_t odim1,
+           const dim_t odim2) {
     return fft3Norm(in, 1.0, odim0, odim1, odim2);
 }
 
-array dft(const array& in, const double norm_factor, const dim4 outDims)
-{
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array dft(const array& in, const double norm_factor, const dim4 outDims) {
     array temp;
-    switch(in.dims().ndims()) {
+    switch (in.dims().ndims()) {
         case 1: temp = fftNorm(in, norm_factor, outDims[0]); break;
         case 2: temp = fft2Norm(in, norm_factor, outDims[0], outDims[1]); break;
-        case 3: temp = fft3Norm(in, norm_factor, outDims[0], outDims[1], outDims[2]); break;
+        case 3:
+            temp =
+                fft3Norm(in, norm_factor, outDims[0], outDims[1], outDims[2]);
+            break;
         default: AF_THROW(AF_ERR_NOT_SUPPORTED);
     }
     return temp;
 }
 
-array dft(const array& in, const dim4 outDims)
-{
-    return dft(in, 1.0, outDims);
-}
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array dft(const array& in, const dim4 outDims) { return dft(in, 1.0, outDims); }
 
-array dft(const array& in)
-{
-    return dft(in, 1.0, dim4(0,0,0,0));
-}
+array dft(const array& in) { return dft(in, 1.0, dim4(0, 0, 0, 0)); }
 
-array ifftNorm(const array& in, const double norm_factor, const dim_t odim0)
-{
+array ifftNorm(const array& in, const double norm_factor, const dim_t odim0) {
     af_array out = 0;
     AF_THROW(af_ifft(&out, in.get(), norm_factor, odim0));
     return array(out);
 }
 
-array ifft2Norm(const array& in, const double norm_factor, const dim_t odim0, const dim_t odim1)
-{
+array ifft2Norm(const array& in, const double norm_factor, const dim_t odim0,
+                const dim_t odim1) {
     af_array out = 0;
     AF_THROW(af_ifft2(&out, in.get(), norm_factor, odim0, odim1));
     return array(out);
 }
 
-array ifft3Norm(const array& in, const double norm_factor, const dim_t odim0, const dim_t odim1, const dim_t odim2)
-{
+array ifft3Norm(const array& in, const double norm_factor, const dim_t odim0,
+                const dim_t odim1, const dim_t odim2) {
     af_array out = 0;
     AF_THROW(af_ifft3(&out, in.get(), norm_factor, odim0, odim1, odim2));
     return array(out);
 }
 
-array ifft(const array& in, const dim_t odim0)
-{
-    const dim4 dims = in.dims();
-    dim_t dim0 = odim0==0 ? dims[0] : odim0;
-    double norm_factor = 1.0/dim0;
+array ifft(const array& in, const dim_t odim0) {
+    const dim4 dims    = in.dims();
+    dim_t dim0         = odim0 == 0 ? dims[0] : odim0;
+    double norm_factor = 1.0 / static_cast<double>(dim0);
     return ifftNorm(in, norm_factor, odim0);
 }
 
-array ifft2(const array& in, const dim_t odim0, const dim_t odim1)
-{
-    const dim4 dims = in.dims();
-    dim_t dim0 = odim0==0 ? dims[0] : odim0;
-    dim_t dim1 = odim1==0 ? dims[1] : odim1;
-    double norm_factor = 1.0/(dim0*dim1);
+array ifft2(const array& in, const dim_t odim0, const dim_t odim1) {
+    const dim4 dims    = in.dims();
+    dim_t dim0         = odim0 == 0 ? dims[0] : odim0;
+    dim_t dim1         = odim1 == 0 ? dims[1] : odim1;
+    double norm_factor = 1.0 / static_cast<double>(dim0 * dim1);
     return ifft2Norm(in, norm_factor, odim0, odim1);
 }
 
-array ifft3(const array& in, const dim_t odim0, const dim_t odim1, const dim_t odim2)
-{
-    const dim4 dims = in.dims();
-    dim_t dim0 = odim0==0 ? dims[0] : odim0;
-    dim_t dim1 = odim1==0 ? dims[1] : odim1;
-    dim_t dim2 = odim2==0 ? dims[2] : odim2;
-    double norm_factor = 1.0/(dim0*dim1*dim2);
+array ifft3(const array& in, const dim_t odim0, const dim_t odim1,
+            const dim_t odim2) {
+    const dim4 dims    = in.dims();
+    dim_t dim0         = odim0 == 0 ? dims[0] : odim0;
+    dim_t dim1         = odim1 == 0 ? dims[1] : odim1;
+    dim_t dim2         = odim2 == 0 ? dims[2] : odim2;
+    double norm_factor = 1.0 / static_cast<double>(dim0 * dim1 * dim2);
     return ifft3Norm(in, norm_factor, odim0, odim1, odim2);
 }
 
-array idft(const array& in, const double norm_factor, const dim4 outDims)
-{
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array idft(const array& in, const double norm_factor, const dim4 outDims) {
     array temp;
-    switch(in.dims().ndims()) {
-        case 1: temp =  ifftNorm(in, norm_factor, outDims[0]); break;
-        case 2: temp = ifft2Norm(in, norm_factor, outDims[0], outDims[1]); break;
-        case 3: temp = ifft3Norm(in, norm_factor, outDims[0], outDims[1], outDims[2]); break;
+    switch (in.dims().ndims()) {
+        case 1: temp = ifftNorm(in, norm_factor, outDims[0]); break;
+        case 2:
+            temp = ifft2Norm(in, norm_factor, outDims[0], outDims[1]);
+            break;
+        case 3:
+            temp =
+                ifft3Norm(in, norm_factor, outDims[0], outDims[1], outDims[2]);
+            break;
         default: AF_THROW(AF_ERR_NOT_SUPPORTED);
     }
     return temp;
 }
 
-array idft(const array& in, const dim4 outDims)
-{
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array idft(const array& in, const dim4 outDims) {
     return idft(in, 1.0, outDims);
 }
 
-array idft(const array& in)
-{
-    return idft(in, 1.0, dim4(0,0,0,0));
+array idft(const array& in) { return idft(in, 1.0, dim4(0, 0, 0, 0)); }
+
+void fftInPlace(array& in, const double norm_factor) {
+    AF_THROW(af_fft_inplace(in.get(), norm_factor));
+}
+
+void fft2InPlace(array& in, const double norm_factor) {
+    AF_THROW(af_fft2_inplace(in.get(), norm_factor));
+}
+
+void fft3InPlace(array& in, const double norm_factor) {
+    AF_THROW(af_fft3_inplace(in.get(), norm_factor));
+}
+
+void ifftInPlace(array& in, const double norm_factor) {
+    const dim4 dims = in.dims();
+    double norm     = norm_factor * (1.0 / static_cast<double>(dims[0]));
+    AF_THROW(af_ifft_inplace(in.get(), norm));
+}
+
+void ifft2InPlace(array& in, const double norm_factor) {
+    const dim4 dims = in.dims();
+    double norm = norm_factor * (1.0 / static_cast<double>(dims[0] * dims[1]));
+    AF_THROW(af_ifft2_inplace(in.get(), norm));
+}
+
+void ifft3InPlace(array& in, const double norm_factor) {
+    const dim4 dims = in.dims();
+    double norm =
+        norm_factor * (1.0 / static_cast<double>(dims[0] * dims[1] * dims[2]));
+    AF_THROW(af_ifft3_inplace(in.get(), norm));
+}
+
+template<>
+AFAPI array fftR2C<1>(const array& in, const dim4& dims,
+                      const double norm_factor) {
+    af_array res;
+    AF_THROW(af_fft_r2c(&res, in.get(), norm_factor == 0 ? 1.0 : norm_factor,
+                        dims[0]));
+    return array(res);
 }
 
+template<>
+AFAPI array fftR2C<2>(const array& in, const dim4& dims,
+                      const double norm_factor) {
+    af_array res;
+    AF_THROW(af_fft2_r2c(&res, in.get(), norm_factor == 0 ? 1.0 : norm_factor,
+                         dims[0], dims[1]));
+    return array(res);
+}
+
+template<>
+AFAPI array fftR2C<3>(const array& in, const dim4& dims,
+                      const double norm_factor) {
+    af_array res;
+    AF_THROW(af_fft3_r2c(&res, in.get(), norm_factor == 0 ? 1.0 : norm_factor,
+                         dims[0], dims[1], dims[2]));
+    return array(res);
+}
+
+inline dim_t getOrigDim(dim_t d, bool is_odd) {
+    return 2 * (d - 1) + (is_odd ? 1 : 0);
+}
+
+template<>
+AFAPI array fftC2R<1>(const array& in, const bool is_odd,
+                      const double norm_factor) {
+    double norm = norm_factor;
+
+    if (norm == 0) {
+        dim4 idims = in.dims();
+        dim_t dim0 = getOrigDim(idims[0], is_odd);
+        norm       = 1.0 / static_cast<double>(dim0);
+    }
+
+    af_array res;
+    AF_THROW(af_fft_c2r(&res, in.get(), norm, is_odd));
+    return array(res);
+}
+
+template<>
+AFAPI array fftC2R<2>(const array& in, const bool is_odd,
+                      const double norm_factor) {
+    double norm = norm_factor;
+
+    if (norm == 0) {
+        dim4 idims = in.dims();
+        dim_t dim0 = getOrigDim(idims[0], is_odd);
+        dim_t dim1 = idims[1];
+        norm       = 1.0 / static_cast<double>(dim0 * dim1);
+    }
+
+    af_array res;
+    AF_THROW(af_fft2_c2r(&res, in.get(), norm, is_odd));
+    return array(res);
+}
+
+template<>
+AFAPI array fftC2R<3>(const array& in, const bool is_odd,
+                      const double norm_factor) {
+    double norm = norm_factor;
+
+    if (norm == 0) {
+        dim4 idims = in.dims();
+        dim_t dim0 = getOrigDim(idims[0], is_odd);
+        dim_t dim1 = idims[1];
+        dim_t dim2 = idims[2];
+        norm       = 1.0 / static_cast<double>(dim0 * dim1 * dim2);
+    }
+
+    af_array res;
+    AF_THROW(af_fft3_c2r(&res, in.get(), norm, is_odd));
+    return array(res);
+}
+
+#define FFT_REAL(rank)                                                    \
+    template<>                                                            \
+    AFAPI array fftR2C<rank>(const array& in, const double norm_factor) { \
+        return fftR2C<rank>(in, in.dims(), norm_factor);                  \
+    }
+
+FFT_REAL(1)
+FFT_REAL(2)
+FFT_REAL(3)
+
+void setFFTPlanCacheSize(size_t cacheSize) {
+    AF_THROW(af_set_fft_plan_cache_size(cacheSize));
 }
+}  // namespace af
diff --git a/src/api/cpp/fftconvolve.cpp b/src/api/cpp/fftconvolve.cpp
index 3156922108..24f68b103b 100644
--- a/src/api/cpp/fftconvolve.cpp
+++ b/src/api/cpp/fftconvolve.cpp
@@ -7,46 +7,45 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/signal.h>
 #include <af/array.h>
-#include "error.hpp"
+#include <af/signal.h>
 #include <algorithm>
+#include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array fftConvolve(const array& signal, const array& filter, const convMode mode)
-{
+array fftConvolve(const array& signal, const array& filter,
+                  const convMode mode) {
     unsigned sN = signal.numdims();
     unsigned fN = filter.numdims();
 
-    switch(std::min(sN,fN)) {
-        case 1:  return fftConvolve1(signal, filter, mode);
-        case 2:  return fftConvolve2(signal, filter, mode);
-        case 3:  return fftConvolve3(signal, filter, mode);
-        default: return fftConvolve3(signal, filter, mode);
+    switch (std::min(sN, fN)) {
+        case 1: return fftConvolve1(signal, filter, mode);
+        case 2: return fftConvolve2(signal, filter, mode);
+        default:
+        case 3: return fftConvolve3(signal, filter, mode);
     }
 }
 
-array fftConvolve1(const array& signal, const array& filter, const convMode mode)
-{
+array fftConvolve1(const array& signal, const array& filter,
+                   const convMode mode) {
     af_array out = 0;
     AF_THROW(af_fft_convolve1(&out, signal.get(), filter.get(), mode));
     return array(out);
 }
 
-array fftConvolve2(const array& signal, const array& filter, const convMode mode)
-{
+array fftConvolve2(const array& signal, const array& filter,
+                   const convMode mode) {
     af_array out = 0;
     AF_THROW(af_fft_convolve2(&out, signal.get(), filter.get(), mode));
     return array(out);
 }
 
-array fftConvolve3(const array& signal, const array& filter, const convMode mode)
-{
+array fftConvolve3(const array& signal, const array& filter,
+                   const convMode mode) {
     af_array out = 0;
     AF_THROW(af_fft_convolve3(&out, signal.get(), filter.get(), mode));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/filters.cpp b/src/api/cpp/filters.cpp
index 900bd4e19d..85246cad38 100644
--- a/src/api/cpp/filters.cpp
+++ b/src/api/cpp/filters.cpp
@@ -7,32 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
+#include <af/signal.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array medfilt(const array& in, const dim_t wind_length, const dim_t wind_width, const borderType edge_pad)
-{
+array medfilt(const array& in, const dim_t wind_length, const dim_t wind_width,
+              const borderType edge_pad) {
     af_array out = 0;
     AF_THROW(af_medfilt(&out, in.get(), wind_length, wind_width, edge_pad));
     return array(out);
 }
 
-array minfilt(const array& in, const dim_t wind_length, const dim_t wind_width, const borderType edge_pad)
-{
+array medfilt1(const array& in, const dim_t wind_width,
+               const borderType edge_pad) {
+    af_array out = 0;
+    AF_THROW(af_medfilt1(&out, in.get(), wind_width, edge_pad));
+    return array(out);
+}
+
+array medfilt2(const array& in, const dim_t wind_length, const dim_t wind_width,
+               const borderType edge_pad) {
+    af_array out = 0;
+    AF_THROW(af_medfilt2(&out, in.get(), wind_length, wind_width, edge_pad));
+    return array(out);
+}
+
+array minfilt(const array& in, const dim_t wind_length, const dim_t wind_width,
+              const borderType edge_pad) {
     af_array out = 0;
     AF_THROW(af_minfilt(&out, in.get(), wind_length, wind_width, edge_pad));
     return array(out);
 }
 
-array maxfilt(const array& in, const dim_t wind_length, const dim_t wind_width, const borderType edge_pad)
-{
+array maxfilt(const array& in, const dim_t wind_length, const dim_t wind_width,
+              const borderType edge_pad) {
     af_array out = 0;
     AF_THROW(af_maxfilt(&out, in.get(), wind_length, wind_width, edge_pad));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/gaussian_kernel.cpp b/src/api/cpp/gaussian_kernel.cpp
index 68fd96819e..1b7e837192 100644
--- a/src/api/cpp/gaussian_kernel.cpp
+++ b/src/api/cpp/gaussian_kernel.cpp
@@ -6,27 +6,25 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-#include <af/dim4.hpp>
-#include <af/image.h>
-#include <af/array.h>
 #include <af/algorithm.h>
+#include <af/array.h>
 #include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af{
-    array gaussianKernel(const int rows, const int cols,
-                         const double sig_r, const double sig_c)
-    {
-        af_array res;
-        AF_THROW(af_gaussian_kernel(&res, rows, cols, sig_r, sig_c));
-        return array(res);
-    }
-
-    // Compatible function
-    array gaussiankernel(const int rows, const int cols,
-                         const double sig_r, const double sig_c)
-    {
-        return gaussianKernel(rows, cols, sig_r, sig_c);
-    }
+namespace af {
+array gaussianKernel(const int rows, const int cols, const double sig_r,
+                     const double sig_c) {
+    af_array res;
+    AF_THROW(af_gaussian_kernel(&res, rows, cols, sig_r, sig_c));
+    return array(res);
+}
 
+// Compatible function
+array gaussiankernel(const int rows, const int cols, const double sig_r,
+                     const double sig_c) {
+    return gaussianKernel(rows, cols, sig_r, sig_c);
 }
+
+}  // namespace af
diff --git a/src/api/cpp/gfor.cpp b/src/api/cpp/gfor.cpp
index 3918b39b33..51d36b3e12 100644
--- a/src/api/cpp/gfor.cpp
+++ b/src/api/cpp/gfor.cpp
@@ -7,37 +7,35 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <af/array.h>
 #include <af/defines.h>
 #include <af/dim4.hpp>
-#include <af/seq.h>
-#include <af/array.h>
 #include <af/gfor.h>
+#include <af/seq.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-    static bool gforStatus;
+thread_local bool gforStatus;
 
-    bool gforGet() { return gforStatus; }
-    void gforSet(bool val) { gforStatus = val; }
+bool gforGet() { return gforStatus; }
+void gforSet(bool val) { gforStatus = val; }
 
-    bool gforToggle()
-    {
-        bool status = gforGet();
-        status ^= 1;
-        gforSet(status);
-        return status;
-    }
+bool gforToggle() {
+    bool status = gforGet();
+    status ^= 1U;
+    gforSet(status);
+    return status;
+}
 
-    array batchFunc(const array &lhs, const array &rhs, batchFunc_t func)
-    {
-        if (gforGet()) AF_THROW_MSG("batchFunc can not be used inside GFOR",
-                                    AF_ERR_ARG);
-        gforSet(true);
-        array res = func(lhs, rhs);
-        gforSet(false);
-        return res;
+array batchFunc(const array &lhs, const array &rhs, batchFunc_t func) {
+    if (gforGet()) {
+        AF_THROW_ERR("batchFunc can not be used inside GFOR", AF_ERR_ARG);
     }
-
+    gforSet(true);
+    array res = func(lhs, rhs);
+    gforSet(false);
+    return res;
 }
+
+}  // namespace af
diff --git a/src/api/cpp/gradient.cpp b/src/api/cpp/gradient.cpp
index d4b21ec445..f38906dc8e 100644
--- a/src/api/cpp/gradient.cpp
+++ b/src/api/cpp/gradient.cpp
@@ -7,16 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include <utility>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-void grad(array &rows, array &cols, const array& in)
-{
+void grad(array &rows, array &cols, const array &in) {
     af_array rows_handle = 0;
     af_array cols_handle = 0;
     AF_THROW(af_gradient(&rows_handle, &cols_handle, in.get()));
@@ -24,4 +22,4 @@ void grad(array &rows, array &cols, const array& in)
     cols = array(cols_handle);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/graphics.cpp b/src/api/cpp/graphics.cpp
index 6e4f3aa559..c5f0ae2e20 100644
--- a/src/api/cpp/graphics.cpp
+++ b/src/api/cpp/graphics.cpp
@@ -8,95 +8,214 @@
  ********************************************************/
 
 #include <af/array.h>
+#include <af/data.h>
 #include <af/graphics.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-void Window::initWindow(const int width, const int height, const char* const title)
-{
+void Window::initWindow(const int width, const int height,
+                        const char* const title) {
     AF_THROW(af_create_window(&wnd, width, height, title));
 }
 
-Window::Window()
-    : wnd(0), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT)
-{
+Window::Window() : wnd(0), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT) {
     initWindow(1280, 720, "ArrayFire");
 }
 
 Window::Window(const char* const title)
-    : wnd(0), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT)
-{
+    : wnd(0), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT) {
     initWindow(1280, 720, title);
 }
 
 Window::Window(const int width, const int height, const char* const title)
-    : wnd(0), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT)
-{
+    : wnd(0), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT) {
     initWindow(width, height, title);
 }
 
 Window::Window(const af_window window)
-    : wnd(window), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT)
-{
-}
+    : wnd(window), _r(-1), _c(-1), _cmap(AF_COLORMAP_DEFAULT) {}
 
-Window::~Window()
-{
-    AF_THROW(af_destroy_window(wnd));
+Window::~Window() {
+    // THOU SHALL NOT THROW IN DESTRUCTORS
+    if (wnd) { af_destroy_window(wnd); }
 }
 
-void Window::setPos(const unsigned x, const unsigned y)
-{
+// NOLINTNEXTLINE(readability-make-member-function-const)
+void Window::setPos(const unsigned x, const unsigned y) {
     AF_THROW(af_set_position(get(), x, y));
 }
 
-void Window::setTitle(const char* const title)
-{
+// NOLINTNEXTLINE(readability-make-member-function-const)
+void Window::setTitle(const char* const title) {
     AF_THROW(af_set_title(get(), title));
 }
 
-void Window::setColorMap(const ColorMap cmap)
-{
-    _cmap = cmap;
+// NOLINTNEXTLINE(readability-make-member-function-const)
+void Window::setSize(const unsigned w, const unsigned h) {
+    AF_THROW(af_set_size(get(), w, h));
 }
 
-void Window::image(const array& in, const char* const title)
-{
+void Window::setColorMap(const ColorMap cmap) { _cmap = cmap; }
+
+void Window::image(const array& in, const char* const title) {
     af_cell temp{_r, _c, title, _cmap};
     AF_THROW(af_draw_image(get(), in.get(), &temp));
 }
 
-void Window::plot(const array& X, const array& Y, const char* const title)
-{
+void Window::plot(const array& in, const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_plot_nd(get(), in.get(), &temp));
+}
+
+void Window::plot(const array& X, const array& Y, const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_plot_2d(get(), X.get(), Y.get(), &temp));
+}
+
+void Window::plot(const array& X, const array& Y, const array& Z,
+                  const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_plot_3d(get(), X.get(), Y.get(), Z.get(), &temp));
+}
+
+void Window::plot3(const array& P, const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    P.eval();
+    AF_THROW(af_draw_plot_nd(get(), P.get(), &temp));
+}
+
+void Window::scatter(const array& in, af::markerType marker,
+                     const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_scatter_nd(get(), in.get(), marker, &temp));
+}
+
+void Window::scatter(const array& X, const array& Y, af::markerType marker,
+                     const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_scatter_2d(get(), X.get(), Y.get(), marker, &temp));
+}
+
+void Window::scatter(const array& X, const array& Y, const array& Z,
+                     af::markerType marker, const char* const title) {
     af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
-    AF_THROW(af_draw_plot(get(), X.get(), Y.get(), &temp));
+    AF_THROW(
+        af_draw_scatter_3d(get(), X.get(), Y.get(), Z.get(), marker, &temp));
 }
 
-void Window::hist(const array& X, const double minval, const double maxval, const char* const title)
-{
+void Window::scatter3(const array& P, af::markerType marker,
+                      const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_scatter_nd(get(), P.get(), marker, &temp));
+}
+
+void Window::hist(const array& X, const double minval, const double maxval,
+                  const char* const title) {
     af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
     AF_THROW(af_draw_hist(get(), X.get(), minval, maxval, &temp));
 }
 
-void Window::grid(const int rows, const int cols)
-{
+void Window::surface(const array& S, const char* const title) {
+    af::array xVals = range(S.dims(0));
+    af::array yVals = range(S.dims(1));
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_surface(get(), xVals.get(), yVals.get(), S.get(), &temp));
+}
+
+void Window::surface(const array& xVals, const array& yVals, const array& S,
+                     const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_surface(get(), xVals.get(), yVals.get(), S.get(), &temp));
+}
+
+void Window::vectorField(const array& points, const array& directions,
+                         const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(
+        af_draw_vector_field_nd(get(), points.get(), directions.get(), &temp));
+}
+
+void Window::vectorField(const array& xPoints, const array& yPoints,
+                         const array& zPoints, const array& xDirs,
+                         const array& yDirs, const array& zDirs,
+                         const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_vector_field_3d(get(), xPoints.get(), yPoints.get(),
+                                     zPoints.get(), xDirs.get(), yDirs.get(),
+                                     zDirs.get(), &temp));
+}
+
+void Window::vectorField(const array& xPoints, const array& yPoints,
+                         const array& xDirs, const array& yDirs,
+                         const char* const title) {
+    af_cell temp{_r, _c, title, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_draw_vector_field_2d(get(), xPoints.get(), yPoints.get(),
+                                     xDirs.get(), yDirs.get(), &temp));
+}
+
+// NOLINTNEXTLINE(readability-make-member-function-const)
+void Window::grid(const int rows, const int cols) {
     AF_THROW(af_grid(get(), rows, cols));
 }
 
-void Window::show()
-{
+void Window::setAxesLimits(const array& x, const array& y, const bool exact) {
+    af_cell temp{_r, _c, NULL, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_set_axes_limits_compute(get(), x.get(), y.get(), NULL, exact,
+                                        &temp));
+}
+
+void Window::setAxesLimits(const array& x, const array& y, const array& z,
+                           const bool exact) {
+    af_cell temp{_r, _c, NULL, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_set_axes_limits_compute(get(), x.get(), y.get(), z.get(), exact,
+                                        &temp));
+}
+
+void Window::setAxesLimits(const float xmin, const float xmax, const float ymin,
+                           const float ymax, const bool exact) {
+    af_cell temp{_r, _c, NULL, AF_COLORMAP_DEFAULT};
+    AF_THROW(
+        af_set_axes_limits_2d(get(), xmin, xmax, ymin, ymax, exact, &temp));
+}
+
+void Window::setAxesLimits(const float xmin, const float xmax, const float ymin,
+                           const float ymax, const float zmin, const float zmax,
+                           const bool exact) {
+    af_cell temp{_r, _c, NULL, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_set_axes_limits_3d(get(), xmin, xmax, ymin, ymax, zmin, zmax,
+                                   exact, &temp));
+}
+
+void Window::setAxesTitles(const char* const xtitle, const char* const ytitle,
+                           const char* const ztitle) {
+    af_cell temp{_r, _c, NULL, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_set_axes_titles(get(), xtitle, ytitle, ztitle, &temp));
+}
+
+void Window::setAxesLabelFormat(const char* const xformat,
+                                const char* const yformat,
+                                const char* const zformat) {
+    af_cell temp{_r, _c, NULL, AF_COLORMAP_DEFAULT};
+    AF_THROW(af_set_axes_label_format(get(), xformat, yformat, zformat, &temp));
+}
+
+void Window::show() {
     AF_THROW(af_show(get()));
     _r = -1;
     _c = -1;
 }
 
-bool Window::close()
-{
+// NOLINTNEXTLINE(readability-make-member-function-const)
+bool Window::close() {
     bool temp = true;
     AF_THROW(af_is_window_closed(&temp, get()));
     return temp;
 }
 
+// NOLINTNEXTLINE(readability-make-member-function-const)
+void Window::setVisibility(const bool isVisible) {
+    AF_THROW(af_set_visibility(get(), isVisible));
 }
+
+}  // namespace af
diff --git a/src/api/cpp/hamming.cpp b/src/api/cpp/hamming.cpp
index f97c34c265..3f1e84dcea 100644
--- a/src/api/cpp/hamming.cpp
+++ b/src/api/cpp/hamming.cpp
@@ -7,22 +7,21 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/vision.h>
 #include <af/array.h>
+#include <af/vision.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-void hammingMatcher(array& idx, array& dist,
-                     const array& query, const array& train,
-                     const dim_t dist_dim, const unsigned n_dist)
-{
+void hammingMatcher(array& idx, array& dist, const array& query,
+                    const array& train, const dim_t dist_dim,
+                    const unsigned n_dist) {
     af_array temp_idx  = 0;
     af_array temp_dist = 0;
-    AF_THROW(af_hamming_matcher(&temp_idx, &temp_dist, query.get(), train.get(), dist_dim, n_dist));
+    AF_THROW(af_nearest_neighbour(&temp_idx, &temp_dist, query.get(),
+                                  train.get(), dist_dim, n_dist, AF_SHD));
     idx  = array(temp_idx);
     dist = array(temp_dist);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/harris.cpp b/src/api/cpp/harris.cpp
new file mode 100644
index 0000000000..a84cedfa44
--- /dev/null
+++ b/src/api/cpp/harris.cpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "error.hpp"
+
+namespace af {
+
+features harris(const array& in, const unsigned max_corners,
+                const float min_response, const float sigma,
+                const unsigned block_size, const float k_thr) {
+    af_features temp;
+    AF_THROW(af_harris(&temp, in.get(), max_corners, min_response, sigma,
+                       block_size, k_thr));
+    return features(temp);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/histogram.cpp b/src/api/cpp/histogram.cpp
index 0bd7d90bf7..6f7db329bc 100644
--- a/src/api/cpp/histogram.cpp
+++ b/src/api/cpp/histogram.cpp
@@ -7,35 +7,36 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/algorithm.h>
-#include <af/compatible.h>
 #include <af/array.h>
+#include <af/compatible.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array histogram(const array &in, const unsigned nbins, const double minval, const double maxval)
-{
+array histogram(const array& in, const unsigned nbins, const double minval,
+                const double maxval) {
     af_array out = 0;
     AF_THROW(af_histogram(&out, in.get(), nbins, minval, maxval));
     return array(out);
 }
 
-array histogram(const array &in, const unsigned nbins)
-{
+array histogram(const array& in, const unsigned nbins) {
     af_array out = 0;
-    AF_THROW(af_histogram(&out, in.get(), nbins, min<double>(in), max<double>(in)));
+    if (in.numdims() == 0) { return in; }
+    AF_THROW(
+        af_histogram(&out, in.get(), nbins, min<double>(in), max<double>(in)));
     return array(out);
 }
 
-array histequal(const array& in, const array& hist) { return histEqual(in, hist); }
-array histEqual(const array& in, const array& hist)
-{
+array histequal(const array& in, const array& hist) {
+    return histEqual(in, hist);
+}
+array histEqual(const array& in, const array& hist) {
     af_array temp = 0;
     AF_THROW(af_hist_equal(&temp, in.get(), hist.get()));
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/homography.cpp b/src/api/cpp/homography.cpp
new file mode 100644
index 0000000000..4df2f8da87
--- /dev/null
+++ b/src/api/cpp/homography.cpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "error.hpp"
+
+namespace af {
+
+void homography(array &H, int &inliers, const array &x_src, const array &y_src,
+                const array &x_dst, const array &y_dst,
+                const af_homography_type htype, const float inlier_thr,
+                const unsigned iterations, const af::dtype otype) {
+    af_array outH;
+    AF_THROW(af_homography(&outH, &inliers, x_src.get(), y_src.get(),
+                           x_dst.get(), y_dst.get(), htype, inlier_thr,
+                           iterations, otype));
+
+    H = array(outH);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/hsv_rgb.cpp b/src/api/cpp/hsv_rgb.cpp
index e55442c832..b023e12013 100644
--- a/src/api/cpp/hsv_rgb.cpp
+++ b/src/api/cpp/hsv_rgb.cpp
@@ -1,31 +1,28 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
-
-    array hsv2rgb(const array& in)
-    {
-        af_array temp = 0;
-        AF_THROW(af_hsv2rgb(&temp, in.get()));
-        return array(temp);
-    }
+namespace af {
 
-    array rgb2hsv(const array& in)
-    {
-        af_array temp = 0;
-        AF_THROW(af_rgb2hsv(&temp, in.get()));
-        return array(temp);
-    }
+array hsv2rgb(const array& in) {
+    af_array temp = 0;
+    AF_THROW(af_hsv2rgb(&temp, in.get()));
+    return array(temp);
+}
 
+array rgb2hsv(const array& in) {
+    af_array temp = 0;
+    AF_THROW(af_rgb2hsv(&temp, in.get()));
+    return array(temp);
 }
+
+}  // namespace af
diff --git a/src/api/cpp/iir.cpp b/src/api/cpp/iir.cpp
index 8ef1ab7c70..7071978c3b 100644
--- a/src/api/cpp/iir.cpp
+++ b/src/api/cpp/iir.cpp
@@ -7,26 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/signal.h>
 #include <af/array.h>
-#include "error.hpp"
+#include <af/signal.h>
 #include <algorithm>
+#include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array fir(const array& b, const array& x)
-{
+array fir(const array& b, const array& x) {
     af_array out = 0;
     AF_THROW(af_fir(&out, b.get(), x.get()));
     return array(out);
 }
 
-array iir(const array &b, const array& a, const array& x)
-{
+array iir(const array& b, const array& a, const array& x) {
     af_array out = 0;
     AF_THROW(af_iir(&out, b.get(), a.get(), x.get()));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/imageio.cpp b/src/api/cpp/imageio.cpp
index a0647a00a5..21035ab3e6 100644
--- a/src/api/cpp/imageio.cpp
+++ b/src/api/cpp/imageio.cpp
@@ -7,34 +7,59 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
-#include <af/compatible.h>
 #include <af/array.h>
+#include <af/compatible.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array loadImage(const char* filename, const bool is_color)
-{
+array loadImage(const char* filename, const bool is_color) {
     af_array out = 0;
     AF_THROW(af_load_image(&out, filename, is_color));
     return array(out);
 }
 
-array loadimage(const char* filename, const bool is_color)
-{
+array loadImageMem(const void* ptr) {
+    af_array out = 0;
+    AF_THROW(af_load_image_memory(&out, ptr));
+    return array(out);
+}
+
+array loadimage(const char* filename, const bool is_color) {
     return loadImage(filename, is_color);
 }
 
-void saveImage(const char* filename, const array& in)
-{
+void saveImage(const char* filename, const array& in) {
     AF_THROW(af_save_image(filename, in.get()));
 }
 
-void saveimage(const char* filename, const array& in)
-{
+void* saveImageMem(const array& in, const imageFormat format) {
+    void* ptr = NULL;
+    AF_THROW(af_save_image_memory(&ptr, in.get(), format));
+    return ptr;
+}
+
+void saveimage(const char* filename, const array& in) {
     return saveImage(filename, in);
 }
 
+void deleteImageMem(void* ptr) { AF_THROW(af_delete_image_memory(ptr)); }
+
+array loadImageNative(const char* filename) {
+    af_array out = 0;
+    AF_THROW(af_load_image_native(&out, filename));
+    return array(out);
 }
+
+void saveImageNative(const char* filename, const array& in) {
+    AF_THROW(af_save_image_native(filename, in.get()));
+}
+
+bool isImageIOAvailable() {
+    bool out = false;
+    AF_THROW(af_is_image_io_available(&out));
+    return out;
+}
+
+}  // namespace af
diff --git a/src/api/cpp/index.cpp b/src/api/cpp/index.cpp
index 83ac93df1c..c2664432ef 100644
--- a/src/api/cpp/index.cpp
+++ b/src/api/cpp/index.cpp
@@ -7,74 +7,105 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/index.h>
-#include <af/array.h>
 #include <af/algorithm.h>
-#include "error.hpp"
+#include <af/array.h>
+#include <af/index.h>
 #include "common.hpp"
+#include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array lookup(const array &in, const array &idx, const int dim)
-{
+array lookup(const array &in, const array &idx, const int dim) {
     af_array out = 0;
     AF_THROW(af_lookup(&out, in.get(), idx.get(), getFNSD(dim, in.dims())));
     return array(out);
 }
 
-index::index() {
+void copy(array &dst, const array &src, const index &idx0, const index &idx1,
+          const index &idx2, const index &idx3) {
+    unsigned nd = dst.numdims();
+
+    af_index_t indices[] = {idx0.get(), idx1.get(), idx2.get(), idx3.get()};
+
+    af_array lhs       = dst.get();
+    const af_array rhs = src.get();
+    AF_THROW(af_assign_gen(&lhs, lhs, nd, indices, rhs));
+}
+
+index::index() : impl{} {
     impl.idx.seq = af_span;
-    impl.isSeq = true;
+    impl.isSeq   = true;
     impl.isBatch = false;
 }
 
-index::index(const int idx) {
+index::index(const int idx) : impl{} {
     impl.idx.seq = af_make_seq(idx, idx, 1);
-    impl.isSeq = true;
+    impl.isSeq   = true;
     impl.isBatch = false;
 }
 
-index::index(const af::seq& s0) {
+index::index(const af::seq &s0) : impl{} {
     impl.idx.seq = s0.s;
-    impl.isSeq = true;
+    impl.isSeq   = true;
     impl.isBatch = s0.m_gfor;
 }
 
-index::index(const af_seq& s0) {
+index::index(const af_seq &s0) : impl{} {
     impl.idx.seq = s0;
-    impl.isSeq = true;
+    impl.isSeq   = true;
     impl.isBatch = false;
 }
 
-index::index(const af::array& idx0) {
-    array idx = idx0.isbool() ? where(idx0) : idx0;
+index::index(const af::array &idx0) : impl{} {
+    array idx    = idx0.isbool() ? where(idx0) : idx0;
     af_array arr = 0;
     AF_THROW(af_retain_array(&arr, idx.get()));
     impl.idx.arr = arr;
 
-    impl.isSeq = false;
+    impl.isSeq   = false;
     impl.isBatch = false;
 }
 
-index::~index() {
-    if (!impl.isSeq)
-        af_release_array(impl.idx.arr);
+index::index(const af::index &idx0) : impl{idx0.impl} {
+    if (!impl.isSeq && impl.idx.arr) {
+        // increment reference count to avoid double free
+        // when/if idx0 is destroyed
+        AF_THROW(af_retain_array(&impl.idx.arr, impl.idx.arr));
+    }
 }
 
+// NOLINTNEXTLINE(hicpp-noexcept-move, performance-noexcept-move-constructor)
+index::index(index &&idx0) : impl{idx0.impl} { idx0.impl.idx.arr = nullptr; }
 
-static bool operator==(const af_seq& lhs, const af_seq& rhs) {
-    return lhs.begin == rhs.begin && lhs.end == rhs.end && lhs.step == rhs.step;
+index::~index() {
+    if (!impl.isSeq && impl.idx.arr) { af_release_array(impl.idx.arr); }
 }
 
-bool index::isspan() const
-{
-    return impl.isSeq == true && impl.idx.seq == af_span;
+index &index::operator=(const index &idx0) {
+    if (this == &idx0) { return *this; }
+
+    impl = idx0.get();
+    if (!impl.isSeq && impl.idx.arr) {
+        // increment reference count to avoid double free
+        // when/if idx0 is destroyed
+        AF_THROW(af_retain_array(&impl.idx.arr, impl.idx.arr));
+    }
+    return *this;
 }
 
-const af_index_t& index::get() const
-{
-    return impl;
+// NOLINTNEXTLINE(hicpp-noexcept-move, performance-noexcept-move-constructor)
+index &index::operator=(index &&idx0) {
+    impl              = idx0.impl;
+    idx0.impl.idx.arr = nullptr;
+    return *this;
 }
 
+static bool operator==(const af_seq &lhs, const af_seq &rhs) {
+    return lhs.begin == rhs.begin && lhs.end == rhs.end && lhs.step == rhs.step;
 }
+
+bool index::isspan() const { return impl.isSeq && impl.idx.seq == af_span; }
+
+const af_index_t &index::get() const { return impl; }
+
+}  // namespace af
diff --git a/src/api/cpp/internal.cpp b/src/api/cpp/internal.cpp
new file mode 100644
index 0000000000..e6760b7fe7
--- /dev/null
+++ b/src/api/cpp/internal.cpp
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/internal.h>
+#include "error.hpp"
+
+namespace af {
+array createStridedArray(
+    const void *data, const dim_t offset,
+    const dim4 dims,     // NOLINT(performance-unnecessary-value-param)
+    const dim4 strides,  // NOLINT(performance-unnecessary-value-param)
+    const af::dtype ty, const af::source location) {
+    af_array res;
+    AF_THROW(af_create_strided_array(&res, data, offset, dims.ndims(),
+                                     dims.get(), strides.get(), ty, location));
+    return array(res);
+}
+
+dim4 getStrides(const array &in) {
+    dim_t s0, s1, s2, s3;
+    AF_THROW(af_get_strides(&s0, &s1, &s2, &s3, in.get()));
+    return dim4(s0, s1, s2, s3);
+}
+
+dim_t getOffset(const array &in) {
+    dim_t offset;
+    AF_THROW(af_get_offset(&offset, in.get()));
+    return offset;
+}
+
+void *getRawPtr(const array &in) {
+    void *ptr = NULL;
+    AF_THROW(af_get_raw_ptr(&ptr, in.get()));
+    return ptr;
+}
+
+bool isLinear(const array &in) {
+    bool is_linear = false;
+    AF_THROW(af_is_linear(&is_linear, in.get()));
+    return is_linear;
+}
+
+bool isOwner(const array &in) {
+    bool is_owner = false;
+    AF_THROW(af_is_owner(&is_owner, in.get()));
+    return is_owner;
+}
+
+}  // namespace af
diff --git a/src/api/cpp/jit_test_api.cpp b/src/api/cpp/jit_test_api.cpp
new file mode 100644
index 0000000000..bc6930dc04
--- /dev/null
+++ b/src/api/cpp/jit_test_api.cpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <jit_test_api.h>
+#include "error.hpp"
+
+namespace af {
+int getMaxJitLen(void) {
+    int retVal = 0;
+    AF_THROW(af_get_max_jit_len(&retVal));
+    return retVal;
+}
+
+void setMaxJitLen(const int jitLen) { AF_THROW(af_set_max_jit_len(jitLen)); }
+}  // namespace af
diff --git a/src/api/cpp/lapack.cpp b/src/api/cpp/lapack.cpp
index 139ae0bef3..3556eaf46f 100644
--- a/src/api/cpp/lapack.cpp
+++ b/src/api/cpp/lapack.cpp
@@ -11,128 +11,140 @@
 #include <af/lapack.h>
 #include "error.hpp"
 
-namespace af
-{
-    void lu(array &out, array &pivot, const array &in, const bool is_lapack_piv)
-    {
-        out = in.copy();
-        af_array p = 0;
-        AF_THROW(af_lu_inplace(&p, out.get(), is_lapack_piv));
-        pivot = array(p);
-    }
+namespace af {
+void svd(array &u, array &s, array &vt, const array &in) {
+    af_array sl = 0, ul = 0, vtl = 0;
+    AF_THROW(af_svd(&ul, &sl, &vtl, in.get()));
+    s  = array(sl);
+    u  = array(ul);
+    vt = array(vtl);
+}
 
-    void lu(array &lower, array &upper, array &pivot, const array &in)
-    {
-        af_array l = 0, u = 0, p = 0;
-        AF_THROW(af_lu(&l, &u, &p, in.get()));
-        lower = array(l);
-        upper = array(u);
-        pivot = array(p);
-    }
+void svdInPlace(array &u, array &s, array &vt, array &in) {
+    af_array sl = 0, ul = 0, vtl = 0;
+    AF_THROW(af_svd_inplace(&ul, &sl, &vtl, in.get()));
+    s  = array(sl);
+    u  = array(ul);
+    vt = array(vtl);
+}
 
-    void luInPlace(array &pivot, array &in, const bool is_lapack_piv)
-    {
-        af_array p = 0;
-        AF_THROW(af_lu_inplace(&p, in.get(), is_lapack_piv));
-        pivot = array(p);
-    }
+void lu(array &out, array &pivot, const array &in, const bool is_lapack_piv) {
+    out        = in.copy();
+    af_array p = 0;
+    AF_THROW(af_lu_inplace(&p, out.get(), is_lapack_piv));
+    pivot = array(p);
+}
 
-    void qr(array &out, array &tau, const array &in)
-    {
-        out = in.copy();
-        af_array t = 0;
-        AF_THROW(af_qr_inplace(&t, out.get()));
-        tau = array(t);
-    }
+void lu(array &lower, array &upper, array &pivot, const array &in) {
+    af_array l = 0, u = 0, p = 0;
+    AF_THROW(af_lu(&l, &u, &p, in.get()));
+    lower = array(l);
+    upper = array(u);
+    pivot = array(p);
+}
 
-    void qr(array &q, array &r, array &tau, const array &in)
-    {
-        af_array q_ = 0, r_ = 0, t_ = 0;
-        AF_THROW(af_qr(&q_, &r_, &t_, in.get()));
-        q = array(q_);
-        r = array(r_);
-        tau = array(t_);
-    }
+void luInPlace(array &pivot, array &in, const bool is_lapack_piv) {
+    af_array p = 0;
+    AF_THROW(af_lu_inplace(&p, in.get(), is_lapack_piv));
+    pivot = array(p);
+}
 
-    void qrInPlace(array &tau, array &in)
-    {
-        af_array t = 0;
-        AF_THROW(af_qr_inplace(&t, in.get()));
-        tau = array(t);
-    }
+void qr(array &out, array &tau, const array &in) {
+    out        = in.copy();
+    af_array t = 0;
+    AF_THROW(af_qr_inplace(&t, out.get()));
+    tau = array(t);
+}
 
-    int cholesky(array &out, const array &in, const bool is_upper)
-    {
-        int info = 0;
-        af_array res;
-        AF_THROW(af_cholesky(&res, &info, in.get(), is_upper));
-        out = array(res);
-        return info;
-    }
+void qr(array &q, array &r, array &tau, const array &in) {
+    af_array q_ = 0, r_ = 0, t_ = 0;
+    AF_THROW(af_qr(&q_, &r_, &t_, in.get()));
+    q   = array(q_);
+    r   = array(r_);
+    tau = array(t_);
+}
 
-    int choleskyInPlace(array &in, const bool is_upper)
-    {
-        int info = 0;
-        AF_THROW(af_cholesky_inplace(&info, in.get(), is_upper));
-        return info;
-    }
+void qrInPlace(array &tau, array &in) {
+    af_array t = 0;
+    AF_THROW(af_qr_inplace(&t, in.get()));
+    tau = array(t);
+}
 
-    array solve(const array &a, const array &b, const matProp options)
-    {
-        af_array out;
-        AF_THROW(af_solve(&out, a.get(), b.get(), options));
-        return array(out);
-    }
+int cholesky(array &out, const array &in, const bool is_upper) {
+    int info = 0;
+    af_array res;
+    AF_THROW(af_cholesky(&res, &info, in.get(), is_upper));
+    out = array(res);
+    return info;
+}
 
-    array solveLU(const array &a, const array &piv,
-                  const array &b, const matProp options)
-    {
-        af_array out;
-        AF_THROW(af_solve_lu(&out, a.get(), piv.get(), b.get(), options));
-        return array(out);
-    }
+int choleskyInPlace(array &in, const bool is_upper) {
+    int info = 0;
+    AF_THROW(af_cholesky_inplace(&info, in.get(), is_upper));
+    return info;
+}
 
-    array inverse(const array &in, const matProp options)
-    {
-        af_array out;
-        AF_THROW(af_inverse(&out, in.get(), options));
-        return array(out);
-    }
+array solve(const array &a, const array &b, const matProp options) {
+    af_array out;
+    AF_THROW(af_solve(&out, a.get(), b.get(), options));
+    return array(out);
+}
 
-    unsigned rank(const array &in, const double tol)
-    {
-        unsigned r = 0;
-        AF_THROW(af_rank(&r, in.get(), tol));
-        return r;
-    }
+array solveLU(const array &a, const array &piv, const array &b,
+              const matProp options) {
+    af_array out;
+    AF_THROW(af_solve_lu(&out, a.get(), piv.get(), b.get(), options));
+    return array(out);
+}
+
+array inverse(const array &in, const matProp options) {
+    af_array out;
+    AF_THROW(af_inverse(&out, in.get(), options));
+    return array(out);
+}
 
-#define INSTANTIATE_DET(TR, TC)                     \
-    template<> AFAPI                                \
-    TR det(const array &in)                         \
-    {                                               \
-        double real;                                \
-        double imag;                                \
-        AF_THROW(af_det(&real, &imag, in.get()));   \
-        return real;                                \
-    }                                               \
-    template<> AFAPI                                \
-    TC det(const array &in)                         \
-    {                                               \
-        double real;                                \
-        double imag;                                \
-        AF_THROW(af_det(&real, &imag, in.get()));   \
-        TC out((TR)real, (TR)imag);                 \
-        return out;                                 \
-    }                                               \
-
-    INSTANTIATE_DET(float, af_cfloat)
-    INSTANTIATE_DET(double, af_cdouble)
-
-    double norm(const array &in, const normType type,
-                const double p, const double q)
-    {
-        double out;
-        AF_THROW(af_norm(&out, in.get(), type, p, q));
-        return out;
+array pinverse(const array &in, const double tol, const matProp options) {
+    af_array out;
+    AF_THROW(af_pinverse(&out, in.get(), tol, options));
+    return array(out);
+}
+
+unsigned rank(const array &in, const double tol) {
+    unsigned r = 0;
+    AF_THROW(af_rank(&r, in.get(), tol));
+    return r;
+}
+
+#define INSTANTIATE_DET(TR, TC)                   \
+    template<>                                    \
+    AFAPI TR det(const array &in) {               \
+        double real;                              \
+        double imag;                              \
+        AF_THROW(af_det(&real, &imag, in.get())); \
+        return real;                              \
+    }                                             \
+    template<>                                    \
+    AFAPI TC det(const array &in) {               \
+        double real;                              \
+        double imag;                              \
+        AF_THROW(af_det(&real, &imag, in.get())); \
+        TC out((TR)real, (TR)imag);               \
+        return out;                               \
     }
+
+INSTANTIATE_DET(float, af_cfloat)
+INSTANTIATE_DET(double, af_cdouble)
+
+double norm(const array &in, const normType type, const double p,
+            const double q) {
+    double out;
+    AF_THROW(af_norm(&out, in.get(), type, p, q));
+    return out;
+}
+
+bool isLAPACKAvailable() {
+    bool out = false;
+    AF_THROW(af_is_lapack_available(&out));
+    return out;
 }
+}  // namespace af
diff --git a/src/api/cpp/matchTemplate.cpp b/src/api/cpp/matchTemplate.cpp
index c16068736a..e53cd89113 100644
--- a/src/api/cpp/matchTemplate.cpp
+++ b/src/api/cpp/matchTemplate.cpp
@@ -7,18 +7,18 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/vision.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array matchTemplate(const array &searchImg, const array &templateImg, const matchType mType)
-{
+array matchTemplate(const array &searchImg, const array &templateImg,
+                    const matchType mType) {
     af_array out = 0;
-    AF_THROW(af_match_template(&out, searchImg.get(), templateImg.get(), mType));
+    AF_THROW(
+        af_match_template(&out, searchImg.get(), templateImg.get(), mType));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/mean.cpp b/src/api/cpp/mean.cpp
index e50198ca83..61693ca40d 100644
--- a/src/api/cpp/mean.cpp
+++ b/src/api/cpp/mean.cpp
@@ -7,71 +7,73 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/statistics.h>
 #include <af/algorithm.h>
 #include <af/array.h>
-#include "error.hpp"
+#include <af/dim4.hpp>
+#include <af/statistics.h>
 #include "common.hpp"
+#include "error.hpp"
+#include "half.hpp"
+#ifdef AF_CUDA
+#include <cuda_fp16.h>
+#include <traits.hpp>
+#endif
 
-namespace af
-{
+namespace af {
 
-array mean(const array &in, const dim_t dim)
-{
+array mean(const array& in, const dim_t dim) {
     af_array temp = 0;
     AF_THROW(af_mean(&temp, in.get(), getFNSD(dim, in.dims())));
     return array(temp);
 }
 
-array mean(const array &in, const array &weights, const dim_t dim)
-{
+array mean(const array& in, const array& weights, const dim_t dim) {
     af_array temp = 0;
-    AF_THROW(af_mean_weighted(&temp, in.get(), weights.get(), getFNSD(dim, in.dims())));
+    AF_THROW(af_mean_weighted(&temp, in.get(), weights.get(),
+                              getFNSD(dim, in.dims())));
     return array(temp);
 }
 
-#define INSTANTIATE_MEAN(T)                                     \
-    template<> AFAPI T mean(const array& in)                    \
-    {                                                           \
-        double ret_val;                                         \
-        AF_THROW(af_mean_all(&ret_val, NULL, in.get()));        \
-        return (T)ret_val;                                      \
-    }                                                           \
-    template<> AFAPI T mean(const array& in, const array& wts)  \
-    {                                                           \
-        double ret_val;                                         \
-        AF_THROW(af_mean_all_weighted(&ret_val, NULL,           \
-                    in.get(), wts.get()));                      \
-        return (T)ret_val;                                      \
-    }                                                           \
+#define INSTANTIATE_MEAN(T)                                                  \
+    template<>                                                               \
+    AFAPI T mean(const array& in) {                                          \
+        double ret_val;                                                      \
+        AF_THROW(af_mean_all(&ret_val, NULL, in.get()));                     \
+        return cast<T>(ret_val);                                             \
+    }                                                                        \
+    template<>                                                               \
+    AFAPI T mean(const array& in, const array& wts) {                        \
+        double ret_val;                                                      \
+        AF_THROW(af_mean_all_weighted(&ret_val, NULL, in.get(), wts.get())); \
+        return cast<T>(ret_val);                                             \
+    }
 
-template<> AFAPI af_cfloat mean(const array& in)
-{
+template<>
+AFAPI af_cfloat mean(const array& in) {
     double real, imag;
     AF_THROW(af_mean_all(&real, &imag, in.get()));
-    return std::complex<float>((float)real, (float)imag);
+    return {static_cast<float>(real), static_cast<float>(imag)};
 }
 
-template<> AFAPI af_cdouble mean(const array& in)
-{
+template<>
+AFAPI af_cdouble mean(const array& in) {
     double real, imag;
     AF_THROW(af_mean_all(&real, &imag, in.get()));
-    return std::complex<double>(real, imag);
+    return {real, imag};
 }
 
-template<> AFAPI af_cfloat mean(const array& in, const array& weights)
-{
+template<>
+AFAPI af_cfloat mean(const array& in, const array& weights) {
     double real, imag;
     AF_THROW(af_mean_all_weighted(&real, &imag, in.get(), weights.get()));
-    return std::complex<float>((float)real, (float)imag);
+    return {static_cast<float>(real), static_cast<float>(imag)};
 }
 
-template<> AFAPI af_cdouble mean(const array& in, const array& weights)
-{
+template<>
+AFAPI af_cdouble mean(const array& in, const array& weights) {
     double real, imag;
     AF_THROW(af_mean_all_weighted(&real, &imag, in.get(), weights.get()));
-    return std::complex<double>(real, imag);
+    return {real, imag};
 }
 
 INSTANTIATE_MEAN(float);
@@ -79,8 +81,18 @@ INSTANTIATE_MEAN(double);
 INSTANTIATE_MEAN(int);
 INSTANTIATE_MEAN(unsigned int);
 INSTANTIATE_MEAN(char);
+INSTANTIATE_MEAN(signed char);
 INSTANTIATE_MEAN(unsigned char);
+INSTANTIATE_MEAN(long long);
+INSTANTIATE_MEAN(unsigned long long);
+INSTANTIATE_MEAN(short);
+INSTANTIATE_MEAN(unsigned short);
+INSTANTIATE_MEAN(af_half);
+INSTANTIATE_MEAN(half_float::half);  // Add support for public API
+#ifdef AF_CUDA
+INSTANTIATE_MEAN(__half);
+#endif
 
 #undef INSTANTIATE_MEAN
 
-}
+}  // namespace af
diff --git a/src/api/cpp/meanshift.cpp b/src/api/cpp/meanshift.cpp
index c83e011958..03a786bbcf 100644
--- a/src/api/cpp/meanshift.cpp
+++ b/src/api/cpp/meanshift.cpp
@@ -7,18 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array meanShift(const array& in, const float spatial_sigma, const float chromatic_sigma, const unsigned iter, const bool is_color)
-{
+array meanShift(const array& in, const float spatial_sigma,
+                const float chromatic_sigma, const unsigned iter,
+                const bool is_color) {
     af_array out = 0;
-    AF_THROW(af_mean_shift(&out, in.get(), spatial_sigma, chromatic_sigma, iter, is_color));
+    AF_THROW(af_mean_shift(&out, in.get(), spatial_sigma, chromatic_sigma, iter,
+                           is_color));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/meanvar.cpp b/src/api/cpp/meanvar.cpp
new file mode 100644
index 0000000000..d62499bd32
--- /dev/null
+++ b/src/api/cpp/meanvar.cpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/statistics.h>
+#include "error.hpp"
+
+using af::array;
+
+namespace af {
+void meanvar(array& mean, array& var, const array& in, const array& weights,
+             const af_var_bias bias, const dim_t dim) {
+    af_array mean_ = mean.get();
+    af_array var_  = var.get();
+    AF_THROW(af_meanvar(&mean_, &var_, in.get(), weights.get(), bias, dim));
+    mean.set(mean_);
+    var.set(var_);
+}
+}  // namespace af
diff --git a/src/api/cpp/median.cpp b/src/api/cpp/median.cpp
index 0528b5ba6d..b288df74a9 100644
--- a/src/api/cpp/median.cpp
+++ b/src/api/cpp/median.cpp
@@ -7,36 +7,39 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/statistics.h>
 #include <af/array.h>
-#include "error.hpp"
+#include <af/statistics.h>
 #include "common.hpp"
+#include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-#define INSTANTIATE_MEDIAN(T)                               \
-    template<> AFAPI T median(const array& in)              \
-    {                                                       \
-        double ret_val;                                     \
-        AF_THROW(af_median_all(&ret_val, NULL, in.get()));  \
-        return (T)ret_val;                                  \
-    }                                                       \
+#define INSTANTIATE_MEDIAN(T)                              \
+    template<>                                             \
+    AFAPI T median(const array& in) {                      \
+        double ret_val;                                    \
+        AF_THROW(af_median_all(&ret_val, NULL, in.get())); \
+        return (T)ret_val;                                 \
+    }
 
 INSTANTIATE_MEDIAN(float);
 INSTANTIATE_MEDIAN(double);
 INSTANTIATE_MEDIAN(int);
 INSTANTIATE_MEDIAN(unsigned int);
 INSTANTIATE_MEDIAN(char);
+INSTANTIATE_MEDIAN(signed char);
 INSTANTIATE_MEDIAN(unsigned char);
+INSTANTIATE_MEDIAN(long long);
+INSTANTIATE_MEDIAN(unsigned long long);
+INSTANTIATE_MEDIAN(short);
+INSTANTIATE_MEDIAN(unsigned short);
 
 #undef INSTANTIATE_MEDIAN
 
-AFAPI array median(const array& in, const dim_t dim)
-{
+array median(const array& in, const dim_t dim) {
     af_array temp = 0;
     AF_THROW(af_median(&temp, in.get(), getFNSD(dim, in.dims())));
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/moments.cpp b/src/api/cpp/moments.cpp
new file mode 100644
index 0000000000..3e28baf964
--- /dev/null
+++ b/src/api/cpp/moments.cpp
@@ -0,0 +1,27 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+
+array moments(const array& in, const af_moment_type moment) {
+    af_array out = 0;
+    AF_THROW(af_moments(&out, in.get(), moment));
+    return array(out);
+}
+
+void moments(double* out, const array& in, const af_moment_type moment) {
+    AF_THROW(af_moments_all(out, in.get(), moment));
+}
+
+}  // namespace af
diff --git a/src/api/cpp/morph.cpp b/src/api/cpp/morph.cpp
index cda589e47d..8b6033ac80 100644
--- a/src/api/cpp/morph.cpp
+++ b/src/api/cpp/morph.cpp
@@ -7,39 +7,34 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array dilate(const array& in, const array& mask)
-{
+array dilate(const array& in, const array& mask) {
     af_array out = 0;
     AF_THROW(af_dilate(&out, in.get(), mask.get()));
     return array(out);
 }
 
-array dilate3(const array& in, const array& mask)
-{
+array dilate3(const array& in, const array& mask) {
     af_array out = 0;
     AF_THROW(af_dilate3(&out, in.get(), mask.get()));
     return array(out);
 }
 
-array erode(const array& in, const array& mask)
-{
+array erode(const array& in, const array& mask) {
     af_array out = 0;
     AF_THROW(af_erode(&out, in.get(), mask.get()));
     return array(out);
 }
 
-array erode3(const array& in, const array& mask)
-{
+array erode3(const array& in, const array& mask) {
     af_array out = 0;
     AF_THROW(af_erode3(&out, in.get(), mask.get()));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/nearest_neighbour.cpp b/src/api/cpp/nearest_neighbour.cpp
new file mode 100644
index 0000000000..2c17f2c62c
--- /dev/null
+++ b/src/api/cpp/nearest_neighbour.cpp
@@ -0,0 +1,27 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "error.hpp"
+
+namespace af {
+
+void nearestNeighbour(array& idx, array& dist, const array& query,
+                      const array& train, const dim_t dist_dim,
+                      const unsigned n_dist, const af_match_type dist_type) {
+    af_array temp_idx  = 0;
+    af_array temp_dist = 0;
+    AF_THROW(af_nearest_neighbour(&temp_idx, &temp_dist, query.get(),
+                                  train.get(), dist_dim, n_dist, dist_type));
+    idx  = array(temp_idx);
+    dist = array(temp_dist);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/orb.cpp b/src/api/cpp/orb.cpp
index 8a8e230ba4..3c9447d9b2 100644
--- a/src/api/cpp/orb.cpp
+++ b/src/api/cpp/orb.cpp
@@ -7,27 +7,24 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/vision.h>
 #include <af/array.h>
+#include <af/vision.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-void orb(features& feat, array& desc, const array& in,
-         const float fast_thr, const unsigned max_feat,
-         const float scl_fctr, const unsigned levels,
-         const bool blur_img)
-{
+void orb(features& feat, array& desc, const array& in, const float fast_thr,
+         const unsigned max_feat, const float scl_fctr, const unsigned levels,
+         const bool blur_img) {
     af_features temp_feat;
     af_array temp_desc = 0;
-    AF_THROW(af_orb(&temp_feat, &temp_desc, in.get(), fast_thr,
-                    max_feat, scl_fctr, levels, blur_img));
+    AF_THROW(af_orb(&temp_feat, &temp_desc, in.get(), fast_thr, max_feat,
+                    scl_fctr, levels, blur_img));
 
     dim_t num = 0;
-    AF_THROW(af_get_features_num(&num,  temp_feat));
+    AF_THROW(af_get_features_num(&num, temp_feat));
     feat = features(temp_feat);
     desc = array(temp_desc);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/random.cpp b/src/api/cpp/random.cpp
new file mode 100644
index 0000000000..821f5c70fe
--- /dev/null
+++ b/src/api/cpp/random.cpp
@@ -0,0 +1,140 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/data.h>
+#include <af/dim4.hpp>
+#include <af/random.h>
+#include "error.hpp"
+
+namespace af {
+randomEngine::randomEngine(randomEngineType type, unsigned long long seed)
+    : engine(0) {
+    AF_THROW(af_create_random_engine(&engine, type, seed));
+}
+
+randomEngine::randomEngine(const randomEngine &other) : engine(0) {
+    if (this != &other) {
+        AF_THROW(af_retain_random_engine(&engine, other.get()));
+    }
+}
+
+randomEngine::randomEngine(af_random_engine engine) : engine(engine) {}
+
+randomEngine::~randomEngine() {
+    if (engine) { af_release_random_engine(engine); }
+}
+
+randomEngine &randomEngine::operator=(const randomEngine &other) {
+    if (this != &other) {
+        AF_THROW(af_release_random_engine(engine));
+        AF_THROW(af_retain_random_engine(&engine, other.get()));
+    }
+    return *this;
+}
+
+randomEngineType randomEngine::getType() {
+    af_random_engine_type type;
+    AF_THROW(af_random_engine_get_type(&type, engine));
+    return type;
+}
+
+void randomEngine::setType(const randomEngineType type) {
+    AF_THROW(af_random_engine_set_type(&engine, type));
+}
+
+void randomEngine::setSeed(const unsigned long long seed) {
+    AF_THROW(af_random_engine_set_seed(&engine, seed));
+}
+
+unsigned long long randomEngine::getSeed() const {
+    unsigned long long seed;
+    AF_THROW(af_random_engine_get_seed(&seed, engine));
+    return seed;
+}
+
+af_random_engine randomEngine::get() const { return engine; }
+
+array randu(const dim4 &dims, const dtype ty, randomEngine &r) {
+    af_array out;
+    AF_THROW(af_random_uniform(&out, dims.ndims(), dims.get(), ty, r.get()));
+    return array(out);
+}
+
+array randn(const dim4 &dims, const dtype ty, randomEngine &r) {
+    af_array out;
+    AF_THROW(af_random_normal(&out, dims.ndims(), dims.get(), ty, r.get()));
+    return array(out);
+}
+
+array randu(const dim4 &dims, const af::dtype type) {
+    af_array res;
+    AF_THROW(af_randu(&res, dims.ndims(), dims.get(), type));
+    return array(res);
+}
+
+array randu(const dim_t d0, const af::dtype ty) { return randu(dim4(d0), ty); }
+
+array randu(const dim_t d0, const dim_t d1, const af::dtype ty) {
+    return randu(dim4(d0, d1), ty);
+}
+
+array randu(const dim_t d0, const dim_t d1, const dim_t d2,
+            const af::dtype ty) {
+    return randu(dim4(d0, d1, d2), ty);
+}
+
+array randu(const dim_t d0, const dim_t d1, const dim_t d2, const dim_t d3,
+            const af::dtype ty) {
+    return randu(dim4(d0, d1, d2, d3), ty);
+}
+
+array randn(const dim4 &dims, const af::dtype type) {
+    af_array res;
+    AF_THROW(af_randn(&res, dims.ndims(), dims.get(), type));
+    return array(res);
+}
+
+array randn(const dim_t d0, const af::dtype ty) { return randn(dim4(d0), ty); }
+
+array randn(const dim_t d0, const dim_t d1, const af::dtype ty) {
+    return randn(dim4(d0, d1), ty);
+}
+
+array randn(const dim_t d0, const dim_t d1, const dim_t d2,
+            const af::dtype ty) {
+    return randn(dim4(d0, d1, d2), ty);
+}
+
+array randn(const dim_t d0, const dim_t d1, const dim_t d2, const dim_t d3,
+            const af::dtype ty) {
+    return randn(dim4(d0, d1, d2, d3), ty);
+}
+
+void setDefaultRandomEngineType(randomEngineType rtype) {
+    AF_THROW(af_set_default_random_engine_type(rtype));
+}
+
+randomEngine getDefaultRandomEngine() {
+    af_random_engine internal_handle = 0;
+    af_random_engine handle          = 0;
+    AF_THROW(af_get_default_random_engine(&internal_handle));
+    AF_THROW(af_retain_random_engine(&handle, internal_handle));
+    return randomEngine(handle);
+}
+
+void setSeed(const unsigned long long seed) { AF_THROW(af_set_seed(seed)); }
+
+unsigned long long getSeed() {
+    unsigned long long seed = 0;
+    AF_THROW(af_get_seed(&seed));
+    return seed;
+}
+
+}  // namespace af
diff --git a/src/api/cpp/reduce.cpp b/src/api/cpp/reduce.cpp
index e8f870e97b..8dc47fcab9 100644
--- a/src/api/cpp/reduce.cpp
+++ b/src/api/cpp/reduce.cpp
@@ -7,190 +7,340 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/algorithm.h>
+#include <af/array.h>
 #include <af/compatible.h>
-#include "error.hpp"
 #include "common.hpp"
+#include "error.hpp"
 
-namespace af
-{
-    array sum(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_sum(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
-    }
+namespace af {
+array sum(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_sum(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
 
-    array product(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_product(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
-    }
+array sum(const array &in, const int dim, const double nanval) {
+    af_array out = 0;
+    AF_THROW(af_sum_nan(&out, in.get(), dim, nanval));
+    return array(out);
+}
 
-    array min(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_min(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
-    }
+void sumByKey(array &keys_out, array &vals_out, const array &keys,
+              const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_sum_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                           getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+void sumByKey(array &keys_out, array &vals_out, const array &keys,
+              const array &vals, const int dim, const double nanval) {
+    af_array okeys, ovals;
+    AF_THROW(
+        af_sum_by_key_nan(&okeys, &ovals, keys.get(), vals.get(), dim, nanval));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+array product(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_product(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
+
+array product(const array &in, const int dim, const double nanval) {
+    af_array out = 0;
+    AF_THROW(af_product_nan(&out, in.get(), dim, nanval));
+    return array(out);
+}
+
+void productByKey(array &keys_out, array &vals_out, const array &keys,
+                  const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_product_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                               getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+void productByKey(array &keys_out, array &vals_out, const array &keys,
+                  const array &vals, const int dim, const double nanval) {
+    af_array okeys, ovals;
+    AF_THROW(af_product_by_key_nan(&okeys, &ovals, keys.get(), vals.get(), dim,
+                                   nanval));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+array mul(const array &in, const int dim) { return product(in, dim); }
+
+array min(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_min(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
+
+void minByKey(array &keys_out, array &vals_out, const array &keys,
+              const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_min_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                           getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+array max(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_max(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
+
+void maxByKey(array &keys_out, array &vals_out, const array &keys,
+              const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_max_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                           getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+void max(array &val, array &idx, const array &in, const array &ragged_len,
+         const int dim) {
+    af_array oval, oidx;
+    AF_THROW(af_max_ragged(&oval, &oidx, in.get(), ragged_len.get(), dim));
+    val = array(oval);
+    idx = array(oidx);
+}
+
+// 2.1 compatibility
+array alltrue(const array &in, const int dim) { return allTrue(in, dim); }
+array allTrue(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_all_true(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
+
+void allTrueByKey(array &keys_out, array &vals_out, const array &keys,
+                  const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_all_true_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                                getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+// 2.1 compatibility
+array anytrue(const array &in, const int dim) { return anyTrue(in, dim); }
+array anyTrue(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_any_true(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
+
+void anyTrueByKey(array &keys_out, array &vals_out, const array &keys,
+                  const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_any_true_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                                getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+array count(const array &in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_count(&out, in.get(), getFNSD(dim, in.dims())));
+    return array(out);
+}
+
+void countByKey(array &keys_out, array &vals_out, const array &keys,
+                const array &vals, const int dim) {
+    af_array okeys, ovals;
+    AF_THROW(af_count_by_key(&okeys, &ovals, keys.get(), vals.get(),
+                             getFNSD(dim, vals.dims())));
+    keys_out = array(okeys);
+    vals_out = array(ovals);
+}
+
+void min(array &val, array &idx, const array &in, const int dim) {
+    af_array out = 0;
+    af_array loc = 0;
+    AF_THROW(af_imin(&out, &loc, in.get(), getFNSD(dim, in.dims())));
+    val = array(out);
+    idx = array(loc);
+}
+
+void max(array &val, array &idx, const array &in, const int dim) {
+    af_array out = 0;
+    af_array loc = 0;
+    AF_THROW(af_imax(&out, &loc, in.get(), getFNSD(dim, in.dims())));
+    val = array(out);
+    idx = array(loc);
+}
+
+#define INSTANTIATE(fnC, fnCPP)                      \
+    INSTANTIATE_REAL(fnC, fnCPP, float)              \
+    INSTANTIATE_REAL(fnC, fnCPP, double)             \
+    INSTANTIATE_REAL(fnC, fnCPP, int)                \
+    INSTANTIATE_REAL(fnC, fnCPP, unsigned)           \
+    INSTANTIATE_REAL(fnC, fnCPP, long)               \
+    INSTANTIATE_REAL(fnC, fnCPP, unsigned long)      \
+    INSTANTIATE_REAL(fnC, fnCPP, long long)          \
+    INSTANTIATE_REAL(fnC, fnCPP, unsigned long long) \
+    INSTANTIATE_REAL(fnC, fnCPP, short)              \
+    INSTANTIATE_REAL(fnC, fnCPP, unsigned short)     \
+    INSTANTIATE_REAL(fnC, fnCPP, char)               \
+    INSTANTIATE_REAL(fnC, fnCPP, signed char)        \
+    INSTANTIATE_REAL(fnC, fnCPP, unsigned char)      \
+    INSTANTIATE_CPLX(fnC, fnCPP, af_cfloat, float)   \
+    INSTANTIATE_CPLX(fnC, fnCPP, af_cdouble, double)
 
-    array max(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_max(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
+#define INSTANTIATE_REAL(fnC, fnCPP, T)                   \
+    template<>                                            \
+    AFAPI T fnCPP(const array &in) {                      \
+        double rval, ival;                                \
+        AF_THROW(af_##fnC##_all(&rval, &ival, in.get())); \
+        return (T)(rval);                                 \
     }
 
-    // 2.1 compatibility
-    array alltrue(const array &in, const int dim) { return allTrue(in, dim); }
-    array allTrue(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_all_true(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
+#define INSTANTIATE_CPLX(fnC, fnCPP, T, Tr)               \
+    template<>                                            \
+    AFAPI T fnCPP(const array &in) {                      \
+        double rval, ival;                                \
+        AF_THROW(af_##fnC##_all(&rval, &ival, in.get())); \
+        T out((Tr)rval, (Tr)ival);                        \
+        return out;                                       \
     }
 
-    // 2.1 compatibility
-    array anytrue(const array &in, const int dim) { return anyTrue(in, dim); }
-    array anyTrue(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_any_true(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
+#define INSTANTIATE_ARRAY(fnC, fnCPP)                   \
+    template<>                                          \
+    AFAPI af::array fnCPP(const array &in) {            \
+        af_array out = 0;                               \
+        AF_THROW(af_##fnC##_all_array(&out, in.get())); \
+        return array(out);                              \
     }
 
-    array count(const array &in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_count(&out, in.get(), getFNSD(dim, in.dims())));
-        return array(out);
+INSTANTIATE(sum, sum)
+INSTANTIATE(product, product)
+INSTANTIATE(min, min)
+INSTANTIATE(max, max)
+INSTANTIATE(all_true, allTrue)
+INSTANTIATE(any_true, anyTrue)
+INSTANTIATE(count, count)
+
+INSTANTIATE_REAL(all_true, allTrue, bool);
+INSTANTIATE_REAL(any_true, anyTrue, bool);
+
+INSTANTIATE_ARRAY(sum, sum)
+INSTANTIATE_ARRAY(product, product)
+INSTANTIATE_ARRAY(min, min)
+INSTANTIATE_ARRAY(max, max)
+INSTANTIATE_ARRAY(all_true, allTrue)
+INSTANTIATE_ARRAY(any_true, anyTrue)
+INSTANTIATE_ARRAY(count, count)
+
+#undef INSTANTIATE_REAL
+#undef INSTANTIATE_CPLX
+#undef INSTANTIATE_ARRAY
+
+#define INSTANTIATE_REAL(fnC, fnCPP, T)                           \
+    template<>                                                    \
+    AFAPI T fnCPP(const array &in, const double nanval) {         \
+        double rval, ival;                                        \
+        AF_THROW(af_##fnC##_all(&rval, &ival, in.get(), nanval)); \
+        return (T)(rval);                                         \
     }
 
-    void min(array &val, array &idx, const array &in, const int dim)
-    {
-        af_array out = 0;
-        af_array loc = 0;
-        AF_THROW(af_imin(&out, &loc, in.get(), getFNSD(dim, in.dims())));
-        val = array(out);
-        idx = array(loc);
+#define INSTANTIATE_CPLX(fnC, fnCPP, T, Tr)                       \
+    template<>                                                    \
+    AFAPI T fnCPP(const array &in, const double nanval) {         \
+        double rval, ival;                                        \
+        AF_THROW(af_##fnC##_all(&rval, &ival, in.get(), nanval)); \
+        T out((Tr)rval, (Tr)ival);                                \
+        return out;                                               \
     }
 
-    void max(array &val, array &idx, const array &in, const int dim)
-    {
-        af_array out = 0;
-        af_array loc = 0;
-        AF_THROW(af_imax(&out, &loc, in.get(), getFNSD(dim, in.dims())));
-        val = array(out);
-        idx = array(loc);
+#define INSTANTIATE_ARRAY(fnC, fnCPP)                             \
+    template<>                                                    \
+    AFAPI af::array fnCPP(const array &in, const double nanval) { \
+        af_array out = 0;                                         \
+        AF_THROW(af_##fnC##_all_array(&out, in.get(), nanval));   \
+        return array(out);                                        \
     }
+INSTANTIATE_ARRAY(sum_nan, sum)
+INSTANTIATE_ARRAY(product_nan, product)
 
-#define INSTANTIATE_REAL(fnC, fnCPP, T)                     \
-    template<> AFAPI                                        \
-    T fnCPP(const array &in)                                \
-    {                                                       \
-        double rval, ival;                                  \
-        AF_THROW(af_##fnC##_all(&rval, &ival, in.get()));   \
-        return (T)(rval);                                   \
-    }                                                       \
-
-
-#define INSTANTIATE_CPLX(fnC, fnCPP, T, Tr)                 \
-    template<> AFAPI                                        \
-    T fnCPP(const array &in)                                \
-    {                                                       \
-        double rval, ival;                                  \
-        AF_THROW(af_##fnC##_all(&rval, &ival, in.get()));   \
-        T out((Tr)rval, (Tr)ival);                          \
-        return out;                                         \
-    }                                                       \
-
-#define INSTANTIATE(fnC, fnCPP)                         \
-    INSTANTIATE_REAL(fnC, fnCPP, float)                 \
-    INSTANTIATE_REAL(fnC, fnCPP, double)                \
-    INSTANTIATE_REAL(fnC, fnCPP, int)                   \
-    INSTANTIATE_REAL(fnC, fnCPP, unsigned)              \
-    INSTANTIATE_REAL(fnC, fnCPP, long)                  \
-    INSTANTIATE_REAL(fnC, fnCPP, unsigned long)         \
-    INSTANTIATE_REAL(fnC, fnCPP, long long)             \
-    INSTANTIATE_REAL(fnC, fnCPP, unsigned long long)    \
-    INSTANTIATE_REAL(fnC, fnCPP, char)                  \
-    INSTANTIATE_REAL(fnC, fnCPP, unsigned char)         \
-    INSTANTIATE_CPLX(fnC, fnCPP, af_cfloat, float)      \
-    INSTANTIATE_CPLX(fnC, fnCPP, af_cdouble, double)    \
-
-    INSTANTIATE(sum, sum)
-    INSTANTIATE(product, product)
-    INSTANTIATE(min, min)
-    INSTANTIATE(max, max)
-    INSTANTIATE(all_true, allTrue)
-    INSTANTIATE(any_true, anyTrue)
-    INSTANTIATE(count, count)
-
-    INSTANTIATE_REAL(all_true, allTrue, bool);
-    INSTANTIATE_REAL(any_true, anyTrue, bool);
+INSTANTIATE(sum_nan, sum)
+INSTANTIATE(product_nan, product)
 
-#undef INSTANTIATE
 #undef INSTANTIATE_REAL
 #undef INSTANTIATE_CPLX
+#undef INSTANTIATE
+#undef INSTANTIATE_ARRAY
 
-#define INSTANTIATE_COMPAT(fnCPP, fnCompat, T)              \
-    template<> AFAPI                                        \
-    T fnCompat(const array &in)                             \
-    {                                                       \
-        return fnCPP<T>(in);                                      \
+#define INSTANTIATE_COMPAT(fnCPP, fnCompat, T) \
+    template<>                                 \
+    AFAPI T fnCompat(const array &in) {        \
+        return fnCPP<T>(in);                   \
     }
 
-#define INSTANTIATE(fnCPP, fnCompat)                            \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, float)                  \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, double)                 \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, int)                    \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned)               \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, long)                   \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned long)          \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, long long)              \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned long long)     \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, char)                   \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned char)          \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, af_cfloat)              \
-    INSTANTIATE_COMPAT(fnCPP, fnCompat, af_cdouble)             \
-
-    INSTANTIATE(allTrue, alltrue)
-    INSTANTIATE(anyTrue, anytrue)
+#define INSTANTIATE(fnCPP, fnCompat)                        \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, float)              \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, double)             \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, int)                \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned)           \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, long)               \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned long)      \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, long long)          \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned long long) \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, char)               \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, signed char)        \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned char)      \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, af_cfloat)          \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, af_cdouble)         \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, short)              \
+    INSTANTIATE_COMPAT(fnCPP, fnCompat, unsigned short)
+
+INSTANTIATE(product, mul)
+INSTANTIATE(allTrue, alltrue)
+INSTANTIATE(anyTrue, anytrue)
+
+INSTANTIATE_COMPAT(allTrue, alltrue, bool)
+INSTANTIATE_COMPAT(anyTrue, anytrue, bool)
 
 #undef INSTANTIATE
 #undef INSTANTIATE_COMPAT
 
-#define INSTANTIATE_REAL(fn, T)                                 \
-    template<> AFAPI                                            \
-    void fn(T *val, unsigned *idx, const array &in)             \
-    {                                                           \
-        double rval, ival;                                      \
-        AF_THROW(af_i##fn##_all(&rval, &ival, idx, in.get()));  \
-        *val = (T)(rval);                                       \
-    }                                                           \
-
-
-#define INSTANTIATE_CPLX(fn, T, Tr)                             \
-    template<> AFAPI                                            \
-    void fn(T *val, unsigned *idx, const array &in)             \
-    {                                                           \
-        double rval, ival;                                      \
-        AF_THROW(af_i##fn##_all(&rval, &ival, idx, in.get()));  \
-        *val = T((Tr)rval, (Tr)ival);                           \
-    }                                                           \
-
-#define INSTANTIATE(fn)                         \
-    INSTANTIATE_REAL(fn, float)                 \
-    INSTANTIATE_REAL(fn, double)                \
-    INSTANTIATE_REAL(fn, int)                   \
-    INSTANTIATE_REAL(fn, unsigned)              \
-    INSTANTIATE_REAL(fn, char)                  \
-    INSTANTIATE_REAL(fn, unsigned char)         \
-    INSTANTIATE_CPLX(fn, af_cfloat, float)      \
-    INSTANTIATE_CPLX(fn, af_cdouble, double)    \
-
-    INSTANTIATE(min)
-    INSTANTIATE(max)
-}
+#define INSTANTIATE_REAL(fn, T)                                \
+    template<>                                                 \
+    AFAPI void fn(T *val, unsigned *idx, const array &in) {    \
+        double rval, ival;                                     \
+        AF_THROW(af_i##fn##_all(&rval, &ival, idx, in.get())); \
+        *val = (T)(rval);                                      \
+    }
+
+#define INSTANTIATE_CPLX(fn, T, Tr)                            \
+    template<>                                                 \
+    AFAPI void fn(T *val, unsigned *idx, const array &in) {    \
+        double rval, ival;                                     \
+        AF_THROW(af_i##fn##_all(&rval, &ival, idx, in.get())); \
+        *val = T((Tr)rval, (Tr)ival);                          \
+    }
+
+#define INSTANTIATE(fn)                    \
+    INSTANTIATE_REAL(fn, float)            \
+    INSTANTIATE_REAL(fn, double)           \
+    INSTANTIATE_REAL(fn, int)              \
+    INSTANTIATE_REAL(fn, unsigned)         \
+    INSTANTIATE_REAL(fn, char)             \
+    INSTANTIATE_REAL(fn, signed char)      \
+    INSTANTIATE_REAL(fn, unsigned char)    \
+    INSTANTIATE_REAL(fn, short)            \
+    INSTANTIATE_REAL(fn, unsigned short)   \
+    INSTANTIATE_CPLX(fn, af_cfloat, float) \
+    INSTANTIATE_CPLX(fn, af_cdouble, double)
+
+INSTANTIATE(min)
+INSTANTIATE(max)
+}  // namespace af
diff --git a/src/api/cpp/regions.cpp b/src/api/cpp/regions.cpp
index 73d40557e7..dce5319297 100644
--- a/src/api/cpp/regions.cpp
+++ b/src/api/cpp/regions.cpp
@@ -7,18 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array regions(const array& in, const af::connectivity connectivity, const af::dtype type)
-{
+array regions(const array& in, const af::connectivity connectivity,
+              const af::dtype type) {
     af_array temp = 0;
     AF_THROW(af_regions(&temp, in.get(), connectivity, type));
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/resize.cpp b/src/api/cpp/resize.cpp
index e0e4f680e6..4bb1723437 100644
--- a/src/api/cpp/resize.cpp
+++ b/src/api/cpp/resize.cpp
@@ -7,32 +7,32 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array resize(const array &in, const dim_t odim0, const dim_t odim1, const interpType method)
-{
+array resize(const array &in, const dim_t odim0, const dim_t odim1,
+             const interpType method) {
     af_array out = 0;
     AF_THROW(af_resize(&out, in.get(), odim0, odim1, method));
     return array(out);
 }
 
-array resize(const float scale0, const float scale1, const array &in, const interpType method)
-{
+array resize(const float scale0, const float scale1, const array &in,
+             const interpType method) {
     af_array out = 0;
-    AF_THROW(af_resize(&out, in.get(), in.dims(0) * scale0, in.dims(1) * scale1, method));
+    AF_THROW(af_resize(&out, in.get(), in.dims(0) * scale0, in.dims(1) * scale1,
+                       method));
     return array(out);
 }
 
-array resize(const float scale, const array &in, const interpType method)
-{
+array resize(const float scale, const array &in, const interpType method) {
     af_array out = 0;
-    AF_THROW(af_resize(&out, in.get(), in.dims(0) * scale, in.dims(1) * scale, method));
+    AF_THROW(af_resize(&out, in.get(), in.dims(0) * scale, in.dims(1) * scale,
+                       method));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/rgb_gray.cpp b/src/api/cpp/rgb_gray.cpp
index 0395228d20..995db9b225 100644
--- a/src/api/cpp/rgb_gray.cpp
+++ b/src/api/cpp/rgb_gray.cpp
@@ -7,25 +7,24 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array rgb2gray(const array& in, const float rPercent, const float gPercent, const float bPercent)
-{
+array rgb2gray(const array& in, const float rPercent, const float gPercent,
+               const float bPercent) {
     af_array temp = 0;
     AF_THROW(af_rgb2gray(&temp, in.get(), rPercent, gPercent, bPercent));
     return array(temp);
 }
 
-array gray2rgb(const array& in, const float rFactor, const float gFactor, const float bFactor)
-{
+array gray2rgb(const array& in, const float rFactor, const float gFactor,
+               const float bFactor) {
     af_array temp = 0;
     AF_THROW(af_gray2rgb(&temp, in.get(), rFactor, gFactor, bFactor));
     return array(temp);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/rotate.cpp b/src/api/cpp/rotate.cpp
index ed42f69a2a..fb4b96f4b8 100644
--- a/src/api/cpp/rotate.cpp
+++ b/src/api/cpp/rotate.cpp
@@ -7,18 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array rotate(const array& in, const float theta, const bool crop, const interpType method)
-{
+array rotate(const array& in, const float theta, const bool crop,
+             const interpType method) {
     af_array out = 0;
     AF_THROW(af_rotate(&out, in.get(), theta, crop, method));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/sat.cpp b/src/api/cpp/sat.cpp
new file mode 100644
index 0000000000..f0dfe641f7
--- /dev/null
+++ b/src/api/cpp/sat.cpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+
+array sat(const array& in) {
+    af_array out = 0;
+    AF_THROW(af_sat(&out, in.get()));
+    return array(out);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/scale.cpp b/src/api/cpp/scale.cpp
index a41aa0509f..fb59e1b95b 100644
--- a/src/api/cpp/scale.cpp
+++ b/src/api/cpp/scale.cpp
@@ -7,18 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array scale(const array& in, const float scale0, const float scale1, const dim_t odim0, const dim_t odim1, const interpType method)
-{
+array scale(const array& in, const float scale0, const float scale1,
+            const dim_t odim0, const dim_t odim1, const interpType method) {
     af_array out = 0;
     AF_THROW(af_scale(&out, in.get(), scale0, scale1, odim0, odim1, method));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/scan.cpp b/src/api/cpp/scan.cpp
index 9cc313971b..840f23942b 100644
--- a/src/api/cpp/scan.cpp
+++ b/src/api/cpp/scan.cpp
@@ -7,16 +7,28 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/algorithm.h>
+#include <af/array.h>
 #include "error.hpp"
 
-namespace af
-{
-    array accum(const array& in, const int dim)
-    {
-        af_array out = 0;
-        AF_THROW(af_accum(&out, in.get(), dim));
-        return array(out);
-    }
+namespace af {
+array accum(const array& in, const int dim) {
+    af_array out = 0;
+    AF_THROW(af_accum(&out, in.get(), dim));
+    return array(out);
+}
+
+array scan(const array& in, const int dim, binaryOp op, bool inclusive_scan) {
+    af_array out = 0;
+    AF_THROW(af_scan(&out, in.get(), dim, op, inclusive_scan));
+    return array(out);
+}
+
+array scanByKey(const array& key, const array& in, const int dim, binaryOp op,
+                bool inclusive_scan) {
+    af_array out = 0;
+    AF_THROW(
+        af_scan_by_key(&out, key.get(), in.get(), dim, op, inclusive_scan));
+    return array(out);
 }
+}  // namespace af
diff --git a/src/api/cpp/seq.cpp b/src/api/cpp/seq.cpp
index d1433563ba..5d56a70f95 100644
--- a/src/api/cpp/seq.cpp
+++ b/src/api/cpp/seq.cpp
@@ -7,24 +7,24 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/seq.h>
 #include <af/array.h>
 #include <af/data.h>
+#include <af/seq.h>
+
 #include "error.hpp"
 
-namespace af
-{
+#include <cmath>
 
-AFAPI int end = -1;
-AFAPI seq span(af_span);
+namespace af {
+int end = -1;
+seq span(af_span);
 
-void seq::init(double begin, double end, double step)
-{
+void seq::init(double begin, double end, double step) {
     this->s.begin = begin;
     this->s.end   = end;
     this->s.step  = step;
-    if(step != 0) {       // Not Span
-        size = fabs((end - begin) / step) + 1;
+    if (step != 0) {  // Not Span
+        size = std::fabs((end - begin) / step) + 1;
     } else {
         size = 0;
     }
@@ -32,68 +32,56 @@ void seq::init(double begin, double end, double step)
 
 #ifndef signbit
 // wtf windows?!
-inline int signbit(double x)
-{
-    if (x < 0) return -1;
-    return  0;
+inline int signbit(double x) {
+    if (x < 0) { return -1; }
+    return 0;
 }
 #endif
 
-seq::~seq()
-{
-}
+seq::~seq() = default;
 
-seq::seq(double n): m_gfor(false)
-{
-    if (n < 0) {
-        init(n + 1, 0, 1);  // seq(-4) = -3, -2, -1, 0
+seq::seq(double length) : s{}, size{}, m_gfor(false) {
+    if (length < 0) {
+        init(0, length, 1);
     } else {
-        init(0, n - 1, 1);
+        init(0, length - 1, 1);
     }
 }
 
-seq::seq(const af_seq& s_): m_gfor(false)
-{
+seq::seq(const af_seq& s_) : s{}, size{}, m_gfor(false) {
     init(s_.begin, s_.end, s_.step);
 }
 
-seq& seq::operator=(const af_seq& s_)
-{
+seq& seq::operator=(const af_seq& s_) {
     init(s_.begin, s_.end, s_.step);
     return *this;
 }
 
-seq::seq(double begin, double end, double step): m_gfor(false)
-{
-    if(begin == -1 && end <= -1) {
-        step = 0;           // end
-    }
-
+seq::seq(double begin, double end, double step) : s{}, size{}, m_gfor(false) {
     if (step == 0) {
-        if (begin != end)   // Span
-            AF_THROW_MSG("Invalid step size", AF_ERR_ARG);
+        if (begin != end) {  // Span
+            AF_THROW_ERR("Invalid step size", AF_ERR_ARG);
+        }
+    }
+    if ((signbit(end) == signbit(begin)) &&
+        (signbit(end - begin) != signbit(step))) {
+        AF_THROW_ERR("Sequence is invalid", AF_ERR_ARG);
     }
-    if (end >= 0 && begin >= 0 && signbit(end-begin) != signbit(step))
-        AF_THROW_MSG("Sequence is invalid", AF_ERR_ARG);
-        //AF_THROW("step must match direction of sequence");
     init(begin, end, step);
 }
 
-seq::seq(seq other, bool is_gfor)
-    : s(other.s),
-      size(other.size),
-      m_gfor(is_gfor)
-{ }
+seq::seq(seq other,  // NOLINT(performance-unnecessary-value-param)
+         bool is_gfor)
+    : s(other.s), size(other.size), m_gfor(is_gfor) {}
 
-seq::operator array() const
-{
-    dim_t diff = s.end - s.begin;
-    dim_t len = (int)((diff + fabs(s.step) * (signbit(diff) == 0 ? 1 : -1)) / s.step);
+seq::operator array() const {
+    double diff = s.end - s.begin;
+    dim_t len   = static_cast<int>(
+        (diff + std::fabs(s.step) * (signbit(diff) == 0 ? 1 : -1)) / s.step);
 
     array tmp = (m_gfor) ? range(1, 1, 1, len, 3) : range(len);
 
     array res = s.begin + s.step * tmp;
     return res;
 }
-
-}
+}  // namespace af
diff --git a/src/api/cpp/set.cpp b/src/api/cpp/set.cpp
index c5be4d61fb..b1a23fa1e4 100644
--- a/src/api/cpp/set.cpp
+++ b/src/api/cpp/set.cpp
@@ -9,45 +9,41 @@
 
 #include <af/algorithm.h>
 #include <af/array.h>
+#include <af/compatible.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array setunique(const array &in, const bool is_sorted)
-{
+array setunique(const array &in, const bool is_sorted) {
     return setUnique(in, is_sorted);
 }
 
-array setUnique(const array &in, const bool is_sorted)
-{
+array setUnique(const array &in, const bool is_sorted) {
     af_array out = 0;
     AF_THROW(af_set_unique(&out, in.get(), is_sorted));
     return array(out);
 }
 
-array setunion(const array &first, const array &second, const bool is_unique)
-{
+array setunion(const array &first, const array &second, const bool is_unique) {
     return setUnion(first, second, is_unique);
 }
 
-array setUnion(const array &first, const array &second, const bool is_unique)
-{
+array setUnion(const array &first, const array &second, const bool is_unique) {
     af_array out = 0;
     AF_THROW(af_set_union(&out, first.get(), second.get(), is_unique));
     return array(out);
 }
 
-array setintersect(const array &first, const array &second, const bool is_unique)
-{
+array setintersect(const array &first, const array &second,
+                   const bool is_unique) {
     return setIntersect(first, second, is_unique);
 }
 
-array setIntersect(const array &first, const array &second, const bool is_unique)
-{
+array setIntersect(const array &first, const array &second,
+                   const bool is_unique) {
     af_array out = 0;
     AF_THROW(af_set_intersect(&out, first.get(), second.get(), is_unique));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/sift.cpp b/src/api/cpp/sift.cpp
new file mode 100644
index 0000000000..decc6851ff
--- /dev/null
+++ b/src/api/cpp/sift.cpp
@@ -0,0 +1,48 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "error.hpp"
+
+namespace af {
+
+void sift(features& feat, array& desc, const array& in, const unsigned n_layers,
+          const float contrast_thr, const float edge_thr,
+          const float init_sigma, const bool double_input,
+          const float img_scale, const float feature_ratio) {
+    af_features temp_feat;
+    af_array temp_desc = 0;
+    AF_THROW(af_sift(&temp_feat, &temp_desc, in.get(), n_layers, contrast_thr,
+                     edge_thr, init_sigma, double_input, img_scale,
+                     feature_ratio));
+
+    dim_t num = 0;
+    AF_THROW(af_get_features_num(&num, temp_feat));
+    feat = features(temp_feat);
+    desc = array(temp_desc);
+}
+
+void gloh(features& feat, array& desc, const array& in, const unsigned n_layers,
+          const float contrast_thr, const float edge_thr,
+          const float init_sigma, const bool double_input,
+          const float img_scale, const float feature_ratio) {
+    af_features temp_feat;
+    af_array temp_desc = 0;
+    AF_THROW(af_gloh(&temp_feat, &temp_desc, in.get(), n_layers, contrast_thr,
+                     edge_thr, init_sigma, double_input, img_scale,
+                     feature_ratio));
+
+    dim_t num = 0;
+    AF_THROW(af_get_features_num(&num, temp_feat));
+    feat = features(temp_feat);
+    desc = array(temp_desc);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/skew.cpp b/src/api/cpp/skew.cpp
index 3913fbd6da..dd2e67edcc 100644
--- a/src/api/cpp/skew.cpp
+++ b/src/api/cpp/skew.cpp
@@ -7,18 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array skew(const array& in, const float skew0, const float skew1, const dim_t odim0, const dim_t odim1, const bool inverse, const interpType method)
-{
+array skew(const array& in, const float skew0, const float skew1,
+           const dim_t odim0, const dim_t odim1, const bool inverse,
+           const interpType method) {
     af_array out = 0;
-    AF_THROW(af_skew(&out, in.get(), skew0, skew1, odim0, odim1, method, inverse));
+    AF_THROW(
+        af_skew(&out, in.get(), skew0, skew1, odim0, odim1, method, inverse));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/sobel.cpp b/src/api/cpp/sobel.cpp
index ddeee47518..5ec491af7d 100644
--- a/src/api/cpp/sobel.cpp
+++ b/src/api/cpp/sobel.cpp
@@ -7,16 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/arith.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-void sobel(array &dx, array &dy, const array &img, const unsigned ker_size)
-{
+void sobel(array &dx, array &dy, const array &img, const unsigned ker_size) {
     af_array af_dx = 0;
     af_array af_dy = 0;
     AF_THROW(af_sobel_operator(&af_dx, &af_dy, img.get(), ker_size));
@@ -24,16 +22,15 @@ void sobel(array &dx, array &dy, const array &img, const unsigned ker_size)
     dy = array(af_dy);
 }
 
-array sobel(const array &img, const unsigned ker_size, const bool isFast)
-{
+array sobel(const array &img, const unsigned ker_size, const bool isFast) {
     array dx;
     array dy;
     sobel(dx, dy, img, ker_size);
     if (isFast) {
-        return abs(dx)+abs(dy);
+        return abs(dx) + abs(dy);
     } else {
-        return sqrt(dx*dx+dy*dy);
+        return sqrt(dx * dx + dy * dy);
     }
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/sort.cpp b/src/api/cpp/sort.cpp
index fcd4cfb612..64a851fe5c 100644
--- a/src/api/cpp/sort.cpp
+++ b/src/api/cpp/sort.cpp
@@ -7,33 +7,31 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/algorithm.h>
+#include <af/array.h>
 #include "error.hpp"
 
-namespace af
-{
-    array sort(const array& in, const unsigned dim, const bool isAscending)
-    {
-        af_array out = 0;
-        AF_THROW(af_sort(&out, in.get(), dim, isAscending));
-        return array(out);
-    }
+namespace af {
+array sort(const array &in, const unsigned dim, const bool isAscending) {
+    af_array out = 0;
+    AF_THROW(af_sort(&out, in.get(), dim, isAscending));
+    return array(out);
+}
 
-    void sort(array &out, array &indices, const array& in, const unsigned dim, const bool isAscending)
-    {
-        af_array out_, indices_;
-        AF_THROW(af_sort_index(&out_, &indices_, in.get(), dim, isAscending));
-        out = array(out_);
-        indices = array(indices_);
-    }
+void sort(array &out, array &indices, const array &in, const unsigned dim,
+          const bool isAscending) {
+    af_array out_, indices_;
+    AF_THROW(af_sort_index(&out_, &indices_, in.get(), dim, isAscending));
+    out     = array(out_);
+    indices = array(indices_);
+}
 
-    void sort(array &out_keys, array &out_values, const array &keys, const array &values,
-              const unsigned dim, const bool isAscending)
-    {
-        af_array okeys, ovalues;
-        AF_THROW(af_sort_by_key(&okeys, &ovalues, keys.get(), values.get(), dim, isAscending));
-        out_keys = array(okeys);
-        out_values = array(ovalues);
-    }
+void sort(array &out_keys, array &out_values, const array &keys,
+          const array &values, const unsigned dim, const bool isAscending) {
+    af_array okeys, ovalues;
+    AF_THROW(af_sort_by_key(&okeys, &ovalues, keys.get(), values.get(), dim,
+                            isAscending));
+    out_keys   = array(okeys);
+    out_values = array(ovalues);
 }
+}  // namespace af
diff --git a/src/api/cpp/sparse.cpp b/src/api/cpp/sparse.cpp
new file mode 100644
index 0000000000..92486f873a
--- /dev/null
+++ b/src/api/cpp/sparse.cpp
@@ -0,0 +1,104 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/sparse.h>
+#include "error.hpp"
+
+namespace af {
+array sparse(const dim_t nRows, const dim_t nCols,
+             const array values,  // NOLINT(performance-unnecessary-value-param)
+             const array rowIdx,  // NOLINT(performance-unnecessary-value-param)
+             const array colIdx,  // NOLINT(performance-unnecessary-value-param)
+             const af::storage stype) {
+    af_array out = 0;
+    AF_THROW(af_create_sparse_array(&out, nRows, nCols, values.get(),
+                                    rowIdx.get(), colIdx.get(), stype));
+    return array(out);
+}
+
+array sparse(const dim_t nRows, const dim_t nCols, const dim_t nNZ,
+             const void* const values, const int* const rowIdx,
+             const int* const colIdx, const dtype type, const af::storage stype,
+             const af::source src) {
+    af_array out = 0;
+    AF_THROW(af_create_sparse_array_from_ptr(&out, nRows, nCols, nNZ, values,
+                                             rowIdx, colIdx, type, stype, src));
+    return array(out);
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array sparse(const array dense, const af::storage stype) {
+    af_array out = 0;
+    AF_THROW(af_create_sparse_array_from_dense(&out, dense.get(), stype));
+    return array(out);
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array sparseConvertTo(const array in, const af::storage stype) {
+    af_array out = 0;
+    AF_THROW(af_sparse_convert_to(&out, in.get(), stype));
+    return array(out);
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array dense(const array sparse) {
+    af_array out = 0;
+    AF_THROW(af_sparse_to_dense(&out, sparse.get()));
+    return array(out);
+}
+
+void sparseGetInfo(
+    array& values, array& rowIdx, array& colIdx, storage& stype,
+    const array in) {  // NOLINT(performance-unnecessary-value-param)
+    af_array values_ = 0, rowIdx_ = 0, colIdx_ = 0;
+    af_storage stype_ = AF_STORAGE_DENSE;
+    AF_THROW(
+        af_sparse_get_info(&values_, &rowIdx_, &colIdx_, &stype_, in.get()));
+    values = array(values_);
+    rowIdx = array(rowIdx_);
+    colIdx = array(colIdx_);
+    stype  = stype_;
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array sparseGetValues(const array in) {
+    af_array out = 0;
+    AF_THROW(af_sparse_get_values(&out, in.get()));
+    return array(out);
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array sparseGetRowIdx(const array in) {
+    af_array out = 0;
+    AF_THROW(af_sparse_get_row_idx(&out, in.get()));
+    return array(out);
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+array sparseGetColIdx(const array in) {
+    af_array out = 0;
+    AF_THROW(af_sparse_get_col_idx(&out, in.get()));
+    return array(out);
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+dim_t sparseGetNNZ(const array in) {
+    dim_t out = 0;
+    AF_THROW(af_sparse_get_nnz(&out, in.get()));
+    return out;
+}
+
+// NOLINTNEXTLINE(performance-unnecessary-value-param)
+af::storage sparseGetStorage(const array in) {
+    af::storage out;
+    AF_THROW(af_sparse_get_storage(&out, in.get()));
+    return out;
+}
+}  // namespace af
diff --git a/src/api/cpp/stdev.cpp b/src/api/cpp/stdev.cpp
index 36430b824f..66edaf816a 100644
--- a/src/api/cpp/stdev.cpp
+++ b/src/api/cpp/stdev.cpp
@@ -7,51 +7,72 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <af/array.h>
 #include <af/dim4.hpp>
 #include <af/statistics.h>
-#include <af/array.h>
-#include "error.hpp"
 #include "common.hpp"
+#include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-#define INSTANTIATE_STDEV(T)                              \
-    template<> AFAPI T stdev(const array& in)             \
-    {                                                     \
-        double ret_val;                                   \
-        AF_THROW(af_stdev_all(&ret_val, NULL, in.get())); \
-        return (T) ret_val;                               \
-    }                                                     \
+#define INSTANTIATE_STDEV(T)                                       \
+    template<>                                                     \
+    AFAPI T stdev(const array& in, const af_var_bias bias) {       \
+        double ret_val;                                            \
+        AF_THROW(af_stdev_all_v2(&ret_val, NULL, in.get(), bias)); \
+        return (T)ret_val;                                         \
+    }                                                              \
+    template<>                                                     \
+    AFAPI T stdev(const array& in) {                               \
+        return stdev<T>(in, AF_VARIANCE_POPULATION);               \
+    }
 
-template<> AFAPI af_cfloat stdev(const array& in)
-{
+template<>
+AFAPI af_cfloat stdev(const array& in, const af_var_bias bias) {
     double real, imag;
-    AF_THROW(af_stdev_all(&real, &imag, in.get()));
-    return std::complex<float>((float)real, (float)imag);
+    AF_THROW(af_stdev_all_v2(&real, &imag, in.get(), bias));
+    return {static_cast<float>(real), static_cast<float>(imag)};
 }
 
-template<> AFAPI af_cdouble stdev(const array& in)
-{
+template<>
+AFAPI af_cdouble stdev(const array& in, const af_var_bias bias) {
     double real, imag;
-    AF_THROW(af_stdev_all(&real, &imag, in.get()));
-    return std::complex<double>(real, imag);
+    AF_THROW(af_stdev_all_v2(&real, &imag, in.get(), bias));
+    return {real, imag};
+}
+
+template<>
+AFAPI af_cfloat stdev(const array& in) {
+    return stdev<af_cfloat>(in, AF_VARIANCE_POPULATION);
+}
+
+template<>
+AFAPI af_cdouble stdev(const array& in) {
+    return stdev<af_cdouble>(in, AF_VARIANCE_POPULATION);
 }
 
 INSTANTIATE_STDEV(float);
 INSTANTIATE_STDEV(double);
 INSTANTIATE_STDEV(int);
 INSTANTIATE_STDEV(unsigned int);
+INSTANTIATE_STDEV(long long);
+INSTANTIATE_STDEV(unsigned long long);
+INSTANTIATE_STDEV(short);
+INSTANTIATE_STDEV(unsigned short);
 INSTANTIATE_STDEV(char);
+INSTANTIATE_STDEV(signed char);
 INSTANTIATE_STDEV(unsigned char);
 
 #undef INSTANTIATE_STDEV
 
-array stdev(const array& in, const dim_t dim)
-{
+array stdev(const array& in, const af_var_bias bias, const dim_t dim) {
     af_array temp = 0;
-    AF_THROW(af_stdev(&temp, in.get(), getFNSD(dim, in.dims())));
+    AF_THROW(af_stdev_v2(&temp, in.get(), bias, getFNSD(dim, in.dims())));
     return array(temp);
 }
 
+array stdev(const array& in, const dim_t dim) {
+    return stdev(in, AF_VARIANCE_POPULATION, dim);
 }
+
+}  // namespace af
diff --git a/src/api/cpp/susan.cpp b/src/api/cpp/susan.cpp
new file mode 100644
index 0000000000..2d2df19884
--- /dev/null
+++ b/src/api/cpp/susan.cpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "error.hpp"
+
+namespace af {
+
+features susan(const array& in, const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge) {
+    af_features temp;
+    AF_THROW(af_susan(&temp, in.get(), radius, diff_thr, geom_thr,
+                      feature_ratio, edge));
+    return features(temp);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/timing.cpp b/src/api/cpp/timing.cpp
index f530ba7ef3..285cb0cdb9 100644
--- a/src/api/cpp/timing.cpp
+++ b/src/api/cpp/timing.cpp
@@ -7,16 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/timing.h>
 #include <af/device.h>
-#include <math.h>
+#include <af/timing.h>
 #include <algorithm>
+#include <array>
+#include <cmath>
+#include <vector>
 
 using namespace af;
 
 // get current time
-static inline timer time_now(void)
-{
+static inline timer time_now() {
 #if defined(OS_WIN)
     timer time;
     QueryPerformanceCounter(&time.val);
@@ -30,85 +31,79 @@ static inline timer time_now(void)
 }
 
 // absolute difference between two times (in seconds)
-static inline double time_seconds(timer start, timer end)
-{
+static inline double time_seconds(timer start, timer end) {
 #if defined(OS_WIN)
     if (start.val.QuadPart > end.val.QuadPart) {
         timer temp = end;
-        end = start;
-        start = temp;
+        end        = start;
+        start      = temp;
     }
     timer system_freq;
     QueryPerformanceFrequency(&system_freq.val);
-    return (double)(end.val.QuadPart - start.val.QuadPart) / system_freq.val.QuadPart;
+    return (double)(end.val.QuadPart - start.val.QuadPart) /
+           system_freq.val.QuadPart;
 #elif defined(OS_MAC)
     if (start.val > end.val) {
         timer temp = start;
-        start = end;
-        end   = temp;
+        start      = end;
+        end        = temp;
     }
     // calculate platform timing epoch
-    static mach_timebase_info_data_t info;
+    thread_local mach_timebase_info_data_t info;
     mach_timebase_info(&info);
     double nano = (double)info.numer / (double)info.denom;
     return (end.val - start.val) * nano * 1e-9;
 #elif defined(OS_LNX)
-    struct timeval elapsed;
+    struct timeval elapsed {};
     timersub(&start.val, &end.val, &elapsed);
-    long sec = elapsed.tv_sec;
+    long sec  = elapsed.tv_sec;
     long usec = elapsed.tv_usec;
-    double t = sec + usec * 1e-6;
+    double t  = sec + usec * 1e-6;
     return t >= 0 ? t : -t;
 #endif
 }
 
-
 namespace af {
 
-static timer _timer_;
-
-AFAPI timer timer::start()
-{
-    return _timer_ = time_now();
-}
-AFAPI double timer::stop(timer start)
-{
-    return time_seconds(start, time_now());
-}
-AFAPI double timer::stop()
-{
-    return time_seconds(_timer_, time_now());
-}
+thread_local timer _timer_;
 
-AFAPI double timeit(void(*fn)())
-{
-    // parameters
-    int sample_trials = 3;
-    double min_time = 1;
+timer timer::start() { return _timer_ = time_now(); }
+double timer::stop(timer start) { return time_seconds(start, time_now()); }
+double timer::stop() { return time_seconds(_timer_, time_now()); }
 
-    // estimate time for a few samples
-    double sample_time = 1e99; // INF
-    for (int i = 0; i < sample_trials; ++i) {
-        sync();
-        timer start = timer::start();
-        fn();
-        sync();
-        sample_time = std::min(sample_time, timer::stop(start));
-    }
+double timeit(void (*fn)()) {
+    // Minimum target duration to limit impact of clock precision
+    constexpr double targetDurationPerTest = 0.050;
+    // samples during which the nr of cycles are determined to obtain target
+    // duration
+    constexpr int testSamples = 2;
+    // cycles needed to include CPU-GPU overlapping (if present)
+    constexpr int minCycles = 3;
+    // initial cycles used for the test samples
+    int cycles = minCycles;
+    // total number of real samples taken, of which the median is returned
+    constexpr int nrSamples = 10;
 
-    double seconds = std::max(sample_time, min_time); // at least minimum time
-    double elapsed = 0;
-    while (elapsed + sample_time < seconds) {
-        int r = ceilf((seconds - elapsed) / sample_time);
-        timer start = timer::start();
-        for (int i = 0; i < r; ++i)
-            fn();
-        sync();
-        double t = timer::stop(start);
-        elapsed += t;
-        sample_time = std::min(sample_time, t / r);
+    std::array<double, nrSamples> X;
+    for (int s = -testSamples; s < nrSamples; ++s) {
+        af::sync();
+        af::timer start = af::timer::start();
+        for (int i = cycles; i > 0; --i) { fn(); }
+        af::sync();
+        const double time = af::timer::stop(start);
+        if (s >= 0) {
+            // real sample, so store it for later processing
+            X[s] = time;
+        } else {
+            // test sample, so improve nr cycles
+            cycles = std::max(
+                minCycles,
+                static_cast<int>(trunc(targetDurationPerTest / time * cycles)));
+        };
     }
-    return sample_time;
+    std::sort(X.begin(), X.end());
+    // returns the median (iso of mean), to limit impact of outliers
+    return X[nrSamples / 2] / cycles;
 }
 
-} // namespace af
+}  // namespace af
diff --git a/src/api/cpp/topk.cpp b/src/api/cpp/topk.cpp
new file mode 100644
index 0000000000..55ebbe42cd
--- /dev/null
+++ b/src/api/cpp/topk.cpp
@@ -0,0 +1,27 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/statistics.h>
+#include "common.hpp"
+#include "error.hpp"
+
+namespace af {
+void topk(array &values, array &indices, const array &in, const int k,
+          const int dim, const topkFunction order) {
+    af_array af_vals = 0;
+    af_array af_idxs = 0;
+
+    AF_THROW(af_topk(&af_vals, &af_idxs, in.get(), k, dim, order));
+
+    values  = array(af_vals);
+    indices = array(af_idxs);
+}
+}  // namespace af
diff --git a/src/api/cpp/transform.cpp b/src/api/cpp/transform.cpp
index d2f369fffa..550841fa52 100644
--- a/src/api/cpp/transform.cpp
+++ b/src/api/cpp/transform.cpp
@@ -7,18 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array transform(const array& in, const array& transform, const dim_t odim0, const dim_t odim1, const interpType method, const bool inverse)
-{
+array transform(const array& in, const array& transform, const dim_t odim0,
+                const dim_t odim1, const interpType method,
+                const bool inverse) {
     af_array out = 0;
-    AF_THROW(af_transform(&out, in.get(), transform.get(), odim0, odim1, method, inverse));
+    AF_THROW(af_transform(&out, in.get(), transform.get(), odim0, odim1, method,
+                          inverse));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/transform_coordinates.cpp b/src/api/cpp/transform_coordinates.cpp
new file mode 100644
index 0000000000..3e4bb5500c
--- /dev/null
+++ b/src/api/cpp/transform_coordinates.cpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+
+array transformCoordinates(const array& tf, const float d0, const float d1) {
+    af_array out = 0;
+    AF_THROW(af_transform_coordinates(&out, tf.get(), d0, d1));
+    return array(out);
+}
+
+}  // namespace af
diff --git a/src/api/cpp/translate.cpp b/src/api/cpp/translate.cpp
index cfb79ef43b..de6908b735 100644
--- a/src/api/cpp/translate.cpp
+++ b/src/api/cpp/translate.cpp
@@ -7,18 +7,18 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <af/array.h>
+#include <af/image.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array translate(const array& in, const float trans0, const float trans1, const dim_t odim0, const dim_t odim1, const interpType method)
-{
+array translate(const array& in, const float trans0, const float trans1,
+                const dim_t odim0, const dim_t odim1, const interpType method) {
     af_array out = 0;
-    AF_THROW(af_translate(&out, in.get(), trans0, trans1, odim0, odim1, method));
+    AF_THROW(
+        af_translate(&out, in.get(), trans0, trans1, odim0, odim1, method));
     return array(out);
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/transpose.cpp b/src/api/cpp/transpose.cpp
index 3c4c87af1d..dc5905fa75 100644
--- a/src/api/cpp/transpose.cpp
+++ b/src/api/cpp/transpose.cpp
@@ -7,23 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/blas.h>
 #include <af/array.h>
+#include <af/blas.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
-array transpose(const array& in, const bool conjugate)
-{
+array transpose(const array& in, const bool conjugate) {
     af_array out = 0;
     AF_THROW(af_transpose(&out, in.get(), conjugate));
     return array(out);
 }
 
-void transposeInPlace(array& in, const bool conjugate)
-{
+void transposeInPlace(array& in, const bool conjugate) {
     AF_THROW(af_transpose_inplace(in.get(), conjugate));
 }
 
-}
+}  // namespace af
diff --git a/src/api/cpp/unary.cpp b/src/api/cpp/unary.cpp
index 5b3a2367aa..65e3cfd4c3 100644
--- a/src/api/cpp/unary.cpp
+++ b/src/api/cpp/unary.cpp
@@ -7,87 +7,85 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
 #include "error.hpp"
 
-namespace af
-{
+namespace af {
 
 #define af_complex(...) af_cplx(__VA_ARGS__)
 
-#define INSTANTIATE(func)                                           \
-    array func(const array &in)                                     \
-    {                                                               \
-        af_array out = 0;                                           \
-        AF_THROW(af_##func(&out, in.get()));                        \
-        return array(out);                                          \
-    }
-
-    INSTANTIATE(complex)
-    INSTANTIATE(real  )
-    INSTANTIATE(imag  )
-    INSTANTIATE(arg   )
-    INSTANTIATE(abs   )
-    INSTANTIATE(conjg )
-
-    INSTANTIATE(sign  )
-    INSTANTIATE(round )
-    INSTANTIATE(trunc )
-    INSTANTIATE(floor )
-    INSTANTIATE(ceil  )
-
-    INSTANTIATE(sin   )
-    INSTANTIATE(cos   )
-    INSTANTIATE(tan   )
-
-    INSTANTIATE(asin  )
-    INSTANTIATE(acos  )
-    INSTANTIATE(atan  )
-
-    INSTANTIATE(sinh  )
-    INSTANTIATE(cosh  )
-    INSTANTIATE(tanh  )
-
-    INSTANTIATE(asinh )
-    INSTANTIATE(acosh )
-    INSTANTIATE(atanh )
-
-    INSTANTIATE(pow2  )
-    INSTANTIATE(exp   )
-    INSTANTIATE(expm1 )
-    INSTANTIATE(erf   )
-    INSTANTIATE(erfc  )
-
-    INSTANTIATE(log   )
-    INSTANTIATE(log1p )
-    INSTANTIATE(log10 )
-    INSTANTIATE(log2  )
-
-    INSTANTIATE(sqrt  )
-    INSTANTIATE(cbrt  )
-
-    INSTANTIATE(iszero)
-
-    INSTANTIATE(factorial)
-    INSTANTIATE(tgamma)
-    INSTANTIATE(lgamma)
-
-    // isinf and isnan are defined by C++.
-    // Thus we need a difference nomenclature.
-    array isInf(const array &in)
-    {
-        af_array out = 0;
-        AF_THROW(af_isinf(&out, in.get()));
-        return array(out);
+#define INSTANTIATE(func)                    \
+    array func(const array &in) {            \
+        af_array out = 0;                    \
+        AF_THROW(af_##func(&out, in.get())); \
+        return array(out);                   \
     }
 
-    array isNaN(const array &in)
-    {
-        af_array out = 0;
-        AF_THROW(af_isnan(&out, in.get()));
-        return array(out);
-    }
+INSTANTIATE(complex)
+INSTANTIATE(real)
+INSTANTIATE(imag)
+INSTANTIATE(arg)
+INSTANTIATE(abs)
+INSTANTIATE(conjg)
+
+INSTANTIATE(sign)
+INSTANTIATE(round)
+INSTANTIATE(trunc)
+INSTANTIATE(floor)
+INSTANTIATE(ceil)
+
+INSTANTIATE(sin)
+INSTANTIATE(cos)
+INSTANTIATE(tan)
+
+INSTANTIATE(asin)
+INSTANTIATE(acos)
+INSTANTIATE(atan)
+
+INSTANTIATE(sinh)
+INSTANTIATE(cosh)
+INSTANTIATE(tanh)
+
+INSTANTIATE(asinh)
+INSTANTIATE(acosh)
+INSTANTIATE(atanh)
+
+INSTANTIATE(pow2)
+INSTANTIATE(exp)
+INSTANTIATE(expm1)
+INSTANTIATE(erf)
+INSTANTIATE(erfc)
+INSTANTIATE(sigmoid)
+
+INSTANTIATE(log)
+INSTANTIATE(log1p)
+INSTANTIATE(log10)
+INSTANTIATE(log2)
+
+INSTANTIATE(sqrt)
+INSTANTIATE(rsqrt)
+INSTANTIATE(cbrt)
+
+INSTANTIATE(iszero)
+
+INSTANTIATE(factorial)
+INSTANTIATE(tgamma)
+INSTANTIATE(lgamma)
+
+// isinf and isnan are defined by C++.
+// Thus we need a difference nomenclature.
+array isInf(const array &in) {
+    af_array out = 0;
+    AF_THROW(af_isinf(&out, in.get()));
+    return array(out);
+}
 
+array isNaN(const array &in) {
+    af_array out = 0;
+    AF_THROW(af_isnan(&out, in.get()));
+    return array(out);
 }
+
+}  // namespace af
diff --git a/src/api/cpp/unwrap.cpp b/src/api/cpp/unwrap.cpp
new file mode 100644
index 0000000000..f0dc7d9803
--- /dev/null
+++ b/src/api/cpp/unwrap.cpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+array unwrap(const array& in, const dim_t wx, const dim_t wy, const dim_t sx,
+             const dim_t sy, const dim_t px, const dim_t py,
+             const bool is_column) {
+    af_array out = 0;
+    AF_THROW(af_unwrap(&out, in.get(), wx, wy, sx, sy, px, py, is_column));
+    return array(out);
+}
+}  // namespace af
diff --git a/src/api/cpp/util.cpp b/src/api/cpp/util.cpp
index 3705f50760..c2bf0c05bf 100644
--- a/src/api/cpp/util.cpp
+++ b/src/api/cpp/util.cpp
@@ -9,17 +9,60 @@
 
 #include <af/array.h>
 #include <af/util.h>
-#include "error.hpp"
 #include <cstdio>
+#include "error.hpp"
 
 using namespace std;
 
-namespace af
-{
-    void print(const char *exp, const array &arr)
-    {
-        printf("%s ", exp);
-        AF_THROW(af_print_array(arr.get()));
-        return;
-    }
+namespace af {
+void print(const char *exp, const array &arr) {
+    AF_THROW(af_print_array_gen(exp, arr.get(), 4));
+}
+
+void print(const char *exp, const array &arr, const int precision) {
+    AF_THROW(af_print_array_gen(exp, arr.get(), precision));
+}
+
+int saveArray(const char *key, const array &arr, const char *filename,
+              const bool append) {
+    int index = -1;
+    AF_THROW(af_save_array(&index, key, arr.get(), filename, append));
+    return index;
+}
+
+array readArray(const char *filename, const unsigned index) {
+    af_array out = 0;
+    AF_THROW(af_read_array_index(&out, filename, index));
+    return array(out);
+}
+
+array readArray(const char *filename, const char *key) {
+    af_array out = 0;
+    AF_THROW(af_read_array_key(&out, filename, key));
+    return array(out);
+}
+
+int readArrayCheck(const char *filename, const char *key) {
+    int out = -1;
+    AF_THROW(af_read_array_key_check(&out, filename, key));
+    return out;
+}
+
+void toString(char **output, const char *exp, const array &arr,
+              const int precision, const bool transpose) {
+    AF_THROW(af_array_to_string(output, exp, arr.get(), precision, transpose));
+}
+
+const char *toString(const char *exp, const array &arr, const int precision,
+                     const bool transpose) {
+    char *output = NULL;
+    AF_THROW(af_array_to_string(&output, exp, arr.get(), precision, transpose));
+    return output;
+}
+
+size_t getSizeOf(af::dtype type) {
+    size_t size = 0;
+    AF_THROW(af_get_size_of(&size, type));
+    return size;
 }
+}  // namespace af
diff --git a/src/api/cpp/var.cpp b/src/api/cpp/var.cpp
index 91c332d6a6..66f2d76252 100644
--- a/src/api/cpp/var.cpp
+++ b/src/api/cpp/var.cpp
@@ -7,82 +7,119 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <af/array.h>
 #include <af/dim4.hpp>
 #include <af/statistics.h>
-#include <af/array.h>
-#include "error.hpp"
 #include "common.hpp"
+#include "error.hpp"
+#include "half.hpp"
+#ifdef AF_CUDA
+#include <cuda_fp16.h>
+#include <traits.hpp>
+#endif
+
+namespace af {
 
-namespace af
-{
+array var(const array& in, const bool isbiased, const dim_t dim) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return var(in, bias, dim);
+}
 
-array var(const array& in, const bool isbiased, const dim_t dim)
-{
+array var(const array& in, const af_var_bias bias, const dim_t dim) {
     af_array temp = 0;
-    AF_THROW(af_var(&temp, in.get(), isbiased, getFNSD(dim, in.dims())));
+    AF_THROW(af_var_v2(&temp, in.get(), bias, getFNSD(dim, in.dims())));
     return array(temp);
 }
 
-array var(const array& in, const array &weights, const dim_t dim)
-{
+array var(const array& in, const array& weights, const dim_t dim) {
     af_array temp = 0;
-    AF_THROW(af_var_weighted(&temp, in.get(), weights.get(), getFNSD(dim, in.dims())));
+    AF_THROW(af_var_weighted(&temp, in.get(), weights.get(),
+                             getFNSD(dim, in.dims())));
     return array(temp);
 }
 
-#define INSTANTIATE_VAR(T)                                          \
-    template<> AFAPI T var(const array& in, const bool isbiased)    \
-    {                                                               \
-        double ret_val;                                             \
-        AF_THROW(af_var_all(&ret_val, NULL, in.get(), isbiased));   \
-        return (T) ret_val;                                         \
-    }                                                               \
-                                                                    \
-    template<> AFAPI T var(const array& in, const array &weights)   \
-    {                                                               \
-        double ret_val;                                             \
-        AF_THROW(af_var_all_weighted(&ret_val, NULL,                \
-                                     in.get(), weights.get()));     \
-        return (T) ret_val;                                         \
-    }                                                               \
+#define INSTANTIATE_VAR(T)                                                 \
+    template<>                                                             \
+    AFAPI T var(const array& in, const af_var_bias bias) {                 \
+        double ret_val;                                                    \
+        AF_THROW(af_var_all_v2(&ret_val, NULL, in.get(), bias));           \
+        return cast<T>(ret_val);                                           \
+    }                                                                      \
+    template<>                                                             \
+    AFAPI T var(const array& in, const bool isbiased) {                    \
+        const af_var_bias bias =                                           \
+            (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);      \
+        return var<T>(in, bias);                                           \
+    }                                                                      \
+                                                                           \
+    template<>                                                             \
+    AFAPI T var(const array& in, const array& weights) {                   \
+        double ret_val;                                                    \
+        AF_THROW(                                                          \
+            af_var_all_weighted(&ret_val, NULL, in.get(), weights.get())); \
+        return cast<T>(ret_val);                                           \
+    }
 
-template<> AFAPI af_cfloat var(const array& in, const bool isbiased)
-{
+template<>
+AFAPI af_cfloat var(const array& in, const af_var_bias bias) {
     double real, imag;
-    AF_THROW(af_var_all(&real, &imag, in.get(), isbiased));
-    return std::complex<float>((float)real, (float)imag);
+    AF_THROW(af_var_all_v2(&real, &imag, in.get(), bias));
+    return {static_cast<float>(real), static_cast<float>(imag)};
 }
 
-template<> AFAPI af_cdouble var(const array& in, const bool isbiased)
-{
+template<>
+AFAPI af_cdouble var(const array& in, const af_var_bias bias) {
     double real, imag;
-    AF_THROW(af_var_all(&real, &imag, in.get(), isbiased));
-    return std::complex<double>(real, imag);
+    AF_THROW(af_var_all_v2(&real, &imag, in.get(), bias));
+    return {real, imag};
 }
 
-template<> AFAPI af_cfloat var(const array& in, const array &weights)
-{
+template<>
+AFAPI af_cfloat var(const array& in, const bool isbiased) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return var<af_cfloat>(in, bias);
+}
+
+template<>
+AFAPI af_cdouble var(const array& in, const bool isbiased) {
+    const af_var_bias bias =
+        (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION);
+    return var<af_cdouble>(in, bias);
+}
+
+template<>
+AFAPI af_cfloat var(const array& in, const array& weights) {
     double real, imag;
     AF_THROW(af_var_all_weighted(&real, &imag, in.get(), weights.get()));
-    return std::complex<float>((float)real, (float)imag);
+    return {static_cast<float>(real), static_cast<float>(imag)};
 }
 
-template<> AFAPI af_cdouble var(const array& in, const array &weights)
-{
+template<>
+AFAPI af_cdouble var(const array& in, const array& weights) {
     double real, imag;
     AF_THROW(af_var_all_weighted(&real, &imag, in.get(), weights.get()));
-    return std::complex<double>(real, imag);
+    return {real, imag};
 }
 
 INSTANTIATE_VAR(float);
 INSTANTIATE_VAR(double);
 INSTANTIATE_VAR(int);
 INSTANTIATE_VAR(unsigned int);
-INSTANTIATE_VAR(intl);
-INSTANTIATE_VAR(uintl);
+INSTANTIATE_VAR(long long);
+INSTANTIATE_VAR(unsigned long long);
+INSTANTIATE_VAR(short);
+INSTANTIATE_VAR(unsigned short);
 INSTANTIATE_VAR(char);
+INSTANTIATE_VAR(signed char);
 INSTANTIATE_VAR(unsigned char);
+INSTANTIATE_VAR(af_half);
+INSTANTIATE_VAR(half_float::half);
+#ifdef AF_CUDA
+INSTANTIATE_VAR(__half);
+#endif
 
 #undef INSTANTIATE_VAR
 
-}
+}  // namespace af
diff --git a/src/api/cpp/where.cpp b/src/api/cpp/where.cpp
index aa1681842b..b68f0616c4 100644
--- a/src/api/cpp/where.cpp
+++ b/src/api/cpp/where.cpp
@@ -7,21 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <af/algorithm.h>
+#include <af/array.h>
 #include <af/gfor.h>
 #include "error.hpp"
 
-namespace af
-{
-    array where(const array& in)
-    {
-        if (gforGet()) {
-            AF_THROW_MSG("WHERE can not be used inside GFOR", AF_ERR_RUNTIME);
-        }
-
-        af_array out = 0;
-        AF_THROW(af_where(&out, in.get()));
-        return array(out);
+namespace af {
+array where(const array& in) {
+    if (gforGet()) {
+        AF_THROW_ERR("WHERE can not be used inside GFOR", AF_ERR_RUNTIME);
     }
+
+    af_array out = 0;
+    AF_THROW(af_where(&out, in.get()));
+    return array(out);
 }
+}  // namespace af
diff --git a/src/api/cpp/wrap.cpp b/src/api/cpp/wrap.cpp
new file mode 100644
index 0000000000..62194ff186
--- /dev/null
+++ b/src/api/cpp/wrap.cpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+array wrap(const array& in, const dim_t ox, const dim_t oy, const dim_t wx,
+           const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,
+           const dim_t py, const bool is_column) {
+    af_array out = 0;
+    AF_THROW(
+        af_wrap(&out, in.get(), ox, oy, wx, wy, sx, sy, px, py, is_column));
+    return array(out);
+}
+}  // namespace af
diff --git a/src/api/cpp/ycbcr_rgb.cpp b/src/api/cpp/ycbcr_rgb.cpp
new file mode 100644
index 0000000000..59c6ff7879
--- /dev/null
+++ b/src/api/cpp/ycbcr_rgb.cpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "error.hpp"
+
+namespace af {
+
+array ycbcr2rgb(const array& in, const YCCStd standard) {
+    af_array temp = 0;
+    AF_THROW(af_ycbcr2rgb(&temp, in.get(), standard));
+    return array(temp);
+}
+
+array rgb2ycbcr(const array& in, const YCCStd standard) {
+    af_array temp = 0;
+    AF_THROW(af_rgb2ycbcr(&temp, in.get(), standard));
+    return array(temp);
+}
+
+}  // namespace af
diff --git a/src/api/unified/CMakeLists.txt b/src/api/unified/CMakeLists.txt
new file mode 100644
index 0000000000..bd373acab8
--- /dev/null
+++ b/src/api/unified/CMakeLists.txt
@@ -0,0 +1,134 @@
+# Copyright (c) 2022, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+generate_product_version(af_unified_ver_res_file
+  FILE_NAME "af"
+  FILE_DESCRIPTION "Unified Backend Dynamic-link library"
+)
+
+add_library(af "")
+add_library(ArrayFire::af ALIAS af)
+
+target_sources(af
+  PRIVATE
+    ${af_unified_ver_res_file}
+    algorithm.cpp
+    arith.cpp
+    array.cpp
+    blas.cpp
+    data.cpp
+    device.cpp
+    error.cpp
+    event.cpp
+    features.cpp
+    graphics.cpp
+    image.cpp
+    index.cpp
+    internal.cpp
+    jit_test_api.cpp
+    lapack.cpp
+    memory.cpp
+    ml.cpp
+    moments.cpp
+    random.cpp
+    signal.cpp
+    sparse.cpp
+    statistics.cpp
+    symbol_manager.cpp
+    symbol_manager.hpp
+    util.cpp
+    vision.cpp
+
+    $<$<BOOL:${OpenCL_FOUND}>: ${CMAKE_CURRENT_SOURCE_DIR}/opencl.cpp>
+    $<$<BOOL:${CUDA_FOUND}>: ${CMAKE_CURRENT_SOURCE_DIR}/cuda.cpp>
+
+    ${ArrayFire_SOURCE_DIR}/src/api/c/type_util.cpp
+    ${ArrayFire_SOURCE_DIR}/src/api/c/version.cpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/Logger.cpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/Logger.hpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/constants.cpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/dim4.cpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/err_common.cpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/util.cpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/util.hpp
+    ${ArrayFire_SOURCE_DIR}/src/backend/common/deprecated.hpp
+  )
+
+if(WIN32)
+  target_sources(af
+    PRIVATE
+      ${ArrayFire_SOURCE_DIR}/src/backend/common/module_loading_windows.cpp)
+else()
+  target_sources(af
+    PRIVATE
+      ${ArrayFire_SOURCE_DIR}/src/backend/common/module_loading_unix.cpp)
+endif()
+
+target_compile_definitions(af PRIVATE AF_UNIFIED)
+
+target_include_directories(af
+  PUBLIC
+    $<BUILD_INTERFACE:${ArrayFire_SOURCE_DIR}/include>
+    $<BUILD_INTERFACE:${ArrayFire_BINARY_DIR}/include>
+    $<INSTALL_INTERFACE:${AF_INSTALL_INC_DIR}>
+  PRIVATE
+    ${ArrayFire_SOURCE_DIR}/src/api/c
+    ${ArrayFire_SOURCE_DIR}/src/api/unified)
+
+target_include_directories(af
+  SYSTEM PRIVATE
+    $<TARGET_PROPERTY:afcommon_interface,INTERFACE_INCLUDE_DIRECTORIES>
+    $<$<BOOL:${OpenCL_FOUND}>:$<TARGET_PROPERTY:OpenCL::OpenCL,INTERFACE_INCLUDE_DIRECTORIES>>
+    $<$<BOOL:${CUDA_FOUND}>:${CUDA_INCLUDE_DIRS}>
+  )
+
+target_link_libraries(af
+  PRIVATE
+    af_spdlog
+    cpp_api_interface
+    Threads::Threads
+    Boost::boost
+    ${CMAKE_DL_LIBS}
+  )
+
+if(TARGET fmt::fmt)
+  target_link_libraries(af
+    PRIVATE
+      fmt::fmt
+  )
+endif()
+
+install(TARGETS af
+  EXPORT ArrayFireUnifiedTargets
+  COMPONENT unified
+  PUBLIC_HEADER DESTINATION af
+  RUNTIME DESTINATION ${AF_INSTALL_BIN_DIR}
+  LIBRARY DESTINATION ${AF_INSTALL_LIB_DIR}
+  ARCHIVE DESTINATION ${AF_INSTALL_LIB_DIR}
+  FRAMEWORK DESTINATION framework
+  INCLUDES DESTINATION ${AF_INSTALL_INC_DIR}
+  )
+
+af_split_debug_info(af ${AF_INSTALL_LIB_DIR})
+
+# install(TARGETS af EXPORT AF DESTINATION "${AF_INSTALL_LIB_DIR}"
+#   COMPONENT libraries)
+#
+# if(APPLE)
+#   INSTALL(SCRIPT "${PROJECT_SOURCE_DIR}/CMakeModules/osx_install/InstallTool.cmake")
+# endif(APPLE)
+#
+# export(TARGETS af FILE ArrayFireUnified.cmake)
+# install(EXPORT AF DESTINATION "${AF_INSTALL_CMAKE_DIR}"
+#   COMPONENT cmake
+#   FILE ArrayFireUnified.cmake)
+
+source_group(include REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/include/*)
+source_group(source REGULAR_EXPRESSION ${CMAKE_CURRENT_SOURCE_DIR}/*|${ArrayFire_SOURCE_DIR}/src/backend/common/*)
+source_group(api\\cpp REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/cpp/*)
+source_group(api\\c REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/c/*)
+source_group("" FILES CMakeLists.txt)
diff --git a/src/api/unified/algorithm.cpp b/src/api/unified/algorithm.cpp
new file mode 100644
index 0000000000..8f990fb535
--- /dev/null
+++ b/src/api/unified/algorithm.cpp
@@ -0,0 +1,211 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/algorithm.h>
+#include <af/array.h>
+#include "symbol_manager.hpp"
+
+#define ALGO_HAPI_DEF(af_func)                                        \
+    af_err af_func(af_array *out, const af_array in, const int dim) { \
+        CHECK_ARRAYS(in);                                             \
+        CALL(af_func, out, in, dim);                                  \
+    }
+
+ALGO_HAPI_DEF(af_sum)
+ALGO_HAPI_DEF(af_product)
+ALGO_HAPI_DEF(af_min)
+ALGO_HAPI_DEF(af_max)
+ALGO_HAPI_DEF(af_all_true)
+ALGO_HAPI_DEF(af_any_true)
+ALGO_HAPI_DEF(af_count)
+ALGO_HAPI_DEF(af_accum)
+ALGO_HAPI_DEF(af_diff1)
+ALGO_HAPI_DEF(af_diff2)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF_BYKEY(af_func)                                          \
+    af_err af_func(af_array *keys_out, af_array *vals_out,                    \
+                   const af_array keys, const af_array vals, const int dim) { \
+        CHECK_ARRAYS(keys, vals);                                             \
+        CALL(af_func, keys_out, vals_out, keys, vals, dim);                   \
+    }
+
+ALGO_HAPI_DEF_BYKEY(af_sum_by_key)
+ALGO_HAPI_DEF_BYKEY(af_product_by_key)
+ALGO_HAPI_DEF_BYKEY(af_min_by_key)
+ALGO_HAPI_DEF_BYKEY(af_max_by_key)
+ALGO_HAPI_DEF_BYKEY(af_all_true_by_key)
+ALGO_HAPI_DEF_BYKEY(af_any_true_by_key)
+ALGO_HAPI_DEF_BYKEY(af_count_by_key)
+
+#undef ALGO_HAPI_DEF_BYKEY
+
+#define ALGO_HAPI_DEF(af_func_nan)                                      \
+    af_err af_func_nan(af_array *out, const af_array in, const int dim, \
+                       const double nanval) {                           \
+        CHECK_ARRAYS(in);                                               \
+        CALL(af_func_nan, out, in, dim, nanval);                        \
+    }
+
+ALGO_HAPI_DEF(af_sum_nan)
+ALGO_HAPI_DEF(af_product_nan)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF_BYKEY(af_func_nan)                                \
+    af_err af_func_nan(af_array *keys_out, af_array *vals_out,          \
+                       const af_array keys, const af_array vals,        \
+                       const int dim, const double nanval) {            \
+        CHECK_ARRAYS(keys, vals);                                       \
+        CALL(af_func_nan, keys_out, vals_out, keys, vals, dim, nanval); \
+    }
+
+ALGO_HAPI_DEF_BYKEY(af_sum_by_key_nan)
+ALGO_HAPI_DEF_BYKEY(af_product_by_key_nan)
+
+#undef ALGO_HAPI_DEF_BYKEY
+
+#define ALGO_HAPI_DEF(af_func_all)                                      \
+    af_err af_func_all(double *real, double *imag, const af_array in) { \
+        CHECK_ARRAYS(in);                                               \
+        CALL(af_func_all, real, imag, in);                              \
+    }
+
+ALGO_HAPI_DEF(af_sum_all)
+ALGO_HAPI_DEF(af_product_all)
+ALGO_HAPI_DEF(af_min_all)
+ALGO_HAPI_DEF(af_max_all)
+ALGO_HAPI_DEF(af_all_true_all)
+ALGO_HAPI_DEF(af_any_true_all)
+ALGO_HAPI_DEF(af_count_all)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF(af_func_nan_all)                                    \
+    af_err af_func_nan_all(double *real, double *imag, const af_array in, \
+                           const double nanval) {                         \
+        CHECK_ARRAYS(in);                                                 \
+        CALL(af_func_nan_all, real, imag, in, nanval);                    \
+    }
+
+ALGO_HAPI_DEF(af_sum_nan_all)
+ALGO_HAPI_DEF(af_product_nan_all)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF(af_ifunc)                                      \
+    af_err af_ifunc(af_array *out, af_array *idx, const af_array in, \
+                    const int dim) {                                 \
+        CHECK_ARRAYS(in);                                            \
+        CALL(af_ifunc, out, idx, in, dim);                           \
+    }
+
+ALGO_HAPI_DEF(af_imin)
+ALGO_HAPI_DEF(af_imax)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF(af_ifunc_all)                                \
+    af_err af_ifunc_all(double *real, double *imag, unsigned *idx, \
+                        const af_array in) {                       \
+        CHECK_ARRAYS(in);                                          \
+        CALL(af_ifunc_all, real, imag, idx, in);                   \
+    }
+
+ALGO_HAPI_DEF(af_imin_all)
+ALGO_HAPI_DEF(af_imax_all)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF(af_func)                         \
+    af_err af_func(af_array *out, const af_array in) { \
+        CHECK_ARRAYS(in);                              \
+        CALL(af_func, out, in);                        \
+    }
+
+ALGO_HAPI_DEF(af_sum_all_array)
+ALGO_HAPI_DEF(af_product_all_array)
+ALGO_HAPI_DEF(af_min_all_array)
+ALGO_HAPI_DEF(af_max_all_array)
+ALGO_HAPI_DEF(af_count_all_array)
+ALGO_HAPI_DEF(af_any_true_all_array)
+ALGO_HAPI_DEF(af_all_true_all_array)
+
+#undef ALGO_HAPI_DEF
+
+#define ALGO_HAPI_DEF(af_func)                                              \
+    af_err af_func(af_array *out, const af_array in, const double nanval) { \
+        CHECK_ARRAYS(in);                                                   \
+        CALL(af_func, out, in, nanval);                                     \
+    }
+
+ALGO_HAPI_DEF(af_sum_nan_all_array)
+ALGO_HAPI_DEF(af_product_nan_all_array)
+
+#undef ALGO_HAPI_DEF
+
+af_err af_where(af_array *idx, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_where, idx, in);
+}
+
+af_err af_scan(af_array *out, const af_array in, const int dim, af_binary_op op,
+               bool inclusive_scan) {
+    CHECK_ARRAYS(in);
+    CALL(af_scan, out, in, dim, op, inclusive_scan);
+}
+
+af_err af_scan_by_key(af_array *out, const af_array key, const af_array in,
+                      const int dim, af_binary_op op, bool inclusive_scan) {
+    CHECK_ARRAYS(in, key);
+    CALL(af_scan_by_key, out, key, in, dim, op, inclusive_scan);
+}
+
+af_err af_sort(af_array *out, const af_array in, const unsigned dim,
+               const bool isAscending) {
+    CHECK_ARRAYS(in);
+    CALL(af_sort, out, in, dim, isAscending);
+}
+
+af_err af_sort_index(af_array *out, af_array *indices, const af_array in,
+                     const unsigned dim, const bool isAscending) {
+    CHECK_ARRAYS(in);
+    CALL(af_sort_index, out, indices, in, dim, isAscending);
+}
+
+af_err af_sort_by_key(af_array *out_keys, af_array *out_values,
+                      const af_array keys, const af_array values,
+                      const unsigned dim, const bool isAscending) {
+    CHECK_ARRAYS(keys, values);
+    CALL(af_sort_by_key, out_keys, out_values, keys, values, dim, isAscending);
+}
+
+af_err af_set_unique(af_array *out, const af_array in, const bool is_sorted) {
+    CHECK_ARRAYS(in);
+    CALL(af_set_unique, out, in, is_sorted);
+}
+
+af_err af_set_union(af_array *out, const af_array first, const af_array second,
+                    const bool is_unique) {
+    CHECK_ARRAYS(first, second);
+    CALL(af_set_union, out, first, second, is_unique);
+}
+
+af_err af_set_intersect(af_array *out, const af_array first,
+                        const af_array second, const bool is_unique) {
+    CHECK_ARRAYS(first, second);
+    CALL(af_set_intersect, out, first, second, is_unique);
+}
+
+af_err af_max_ragged(af_array *vals, af_array *idx, const af_array in,
+                     const af_array ragged_len, const int dim) {
+    CHECK_ARRAYS(in, ragged_len);
+    CALL(af_max_ragged, vals, idx, in, ragged_len, dim);
+}
diff --git a/src/api/unified/arith.cpp b/src/api/unified/arith.cpp
new file mode 100644
index 0000000000..03638fdde3
--- /dev/null
+++ b/src/api/unified/arith.cpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/arith.h>
+#include <af/array.h>
+#include "symbol_manager.hpp"
+
+#define BINARY_HAPI_DEF(af_func)                                          \
+    af_err af_func(af_array* out, const af_array lhs, const af_array rhs, \
+                   const bool batchMode) {                                \
+        CHECK_ARRAYS(lhs, rhs);                                           \
+        CALL(af_func, out, lhs, rhs, batchMode);                          \
+    }
+
+BINARY_HAPI_DEF(af_add)
+BINARY_HAPI_DEF(af_mul)
+BINARY_HAPI_DEF(af_sub)
+BINARY_HAPI_DEF(af_div)
+BINARY_HAPI_DEF(af_maxof)
+BINARY_HAPI_DEF(af_minof)
+BINARY_HAPI_DEF(af_rem)
+BINARY_HAPI_DEF(af_mod)
+BINARY_HAPI_DEF(af_pow)
+BINARY_HAPI_DEF(af_root)
+BINARY_HAPI_DEF(af_atan2)
+BINARY_HAPI_DEF(af_cplx2)
+BINARY_HAPI_DEF(af_eq)
+BINARY_HAPI_DEF(af_neq)
+BINARY_HAPI_DEF(af_gt)
+BINARY_HAPI_DEF(af_ge)
+BINARY_HAPI_DEF(af_lt)
+BINARY_HAPI_DEF(af_le)
+BINARY_HAPI_DEF(af_and)
+BINARY_HAPI_DEF(af_or)
+BINARY_HAPI_DEF(af_bitand)
+BINARY_HAPI_DEF(af_bitor)
+BINARY_HAPI_DEF(af_bitxor)
+BINARY_HAPI_DEF(af_bitshiftl)
+BINARY_HAPI_DEF(af_bitshiftr)
+BINARY_HAPI_DEF(af_hypot)
+
+af_err af_cast(af_array* out, const af_array in, const af_dtype type) {
+    CHECK_ARRAYS(in);
+    CALL(af_cast, out, in, type);
+}
+
+#define UNARY_HAPI_DEF(af_func)                        \
+    af_err af_func(af_array* out, const af_array in) { \
+        CHECK_ARRAYS(in);                              \
+        CALL(af_func, out, in);                        \
+    }
+
+UNARY_HAPI_DEF(af_abs)
+UNARY_HAPI_DEF(af_arg)
+UNARY_HAPI_DEF(af_sign)
+UNARY_HAPI_DEF(af_round)
+UNARY_HAPI_DEF(af_trunc)
+UNARY_HAPI_DEF(af_floor)
+UNARY_HAPI_DEF(af_ceil)
+UNARY_HAPI_DEF(af_sin)
+UNARY_HAPI_DEF(af_cos)
+UNARY_HAPI_DEF(af_tan)
+UNARY_HAPI_DEF(af_asin)
+UNARY_HAPI_DEF(af_acos)
+UNARY_HAPI_DEF(af_atan)
+UNARY_HAPI_DEF(af_cplx)
+UNARY_HAPI_DEF(af_real)
+UNARY_HAPI_DEF(af_imag)
+UNARY_HAPI_DEF(af_conjg)
+UNARY_HAPI_DEF(af_sinh)
+UNARY_HAPI_DEF(af_cosh)
+UNARY_HAPI_DEF(af_tanh)
+UNARY_HAPI_DEF(af_asinh)
+UNARY_HAPI_DEF(af_acosh)
+UNARY_HAPI_DEF(af_atanh)
+UNARY_HAPI_DEF(af_pow2)
+UNARY_HAPI_DEF(af_exp)
+UNARY_HAPI_DEF(af_sigmoid)
+UNARY_HAPI_DEF(af_expm1)
+UNARY_HAPI_DEF(af_erf)
+UNARY_HAPI_DEF(af_erfc)
+UNARY_HAPI_DEF(af_log)
+UNARY_HAPI_DEF(af_log1p)
+UNARY_HAPI_DEF(af_log10)
+UNARY_HAPI_DEF(af_log2)
+UNARY_HAPI_DEF(af_sqrt)
+UNARY_HAPI_DEF(af_rsqrt)
+UNARY_HAPI_DEF(af_cbrt)
+UNARY_HAPI_DEF(af_factorial)
+UNARY_HAPI_DEF(af_tgamma)
+UNARY_HAPI_DEF(af_lgamma)
+UNARY_HAPI_DEF(af_iszero)
+UNARY_HAPI_DEF(af_isinf)
+UNARY_HAPI_DEF(af_isnan)
+UNARY_HAPI_DEF(af_not)
+UNARY_HAPI_DEF(af_bitnot)
+
+af_err af_clamp(af_array* out, const af_array in, const af_array lo,
+                const af_array hi, const bool batch) {
+    CHECK_ARRAYS(in, lo, hi);
+    CALL(af_clamp, out, in, lo, hi, batch);
+}
diff --git a/src/api/unified/array.cpp b/src/api/unified/array.cpp
new file mode 100644
index 0000000000..d68f54f84c
--- /dev/null
+++ b/src/api/unified/array.cpp
@@ -0,0 +1,110 @@
+/*******************************************************
+ * Copyright(c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/backend.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_array(af_array *arr, const void *const data,
+                       const unsigned ndims, const dim_t *const dims,
+                       const af_dtype type) {
+    CALL(af_create_array, arr, data, ndims, dims, type);
+}
+
+af_err af_create_handle(af_array *arr, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype type) {
+    CALL(af_create_handle, arr, ndims, dims, type);
+}
+
+af_err af_copy_array(af_array *arr, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_copy_array, arr, in);
+}
+
+af_err af_write_array(af_array arr, const void *data, const size_t bytes,
+                      af_source src) {
+    CHECK_ARRAYS(arr);
+    CALL(af_write_array, arr, data, bytes, src);
+}
+
+af_err af_get_data_ptr(void *data, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_data_ptr, data, arr);
+}
+
+af_err af_release_array(af_array arr) {
+    if (arr) {
+        CALL(af_release_array, arr);
+    } else {
+        return AF_SUCCESS;
+    }
+}
+
+af_err af_retain_array(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_retain_array, out, in);
+}
+
+af_err af_get_data_ref_count(int *use_count, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_get_data_ref_count, use_count, in);
+}
+
+af_err af_eval(af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_eval, in);
+}
+
+af_err af_get_elements(dim_t *elems, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_elements, elems, arr);
+}
+
+af_err af_get_type(af_dtype *type, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_type, type, arr);
+}
+
+af_err af_get_dims(dim_t *d0, dim_t *d1, dim_t *d2, dim_t *d3,
+                   const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_dims, d0, d1, d2, d3, arr);
+}
+
+af_err af_get_numdims(unsigned *result, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_numdims, result, arr);
+}
+
+#define ARRAY_HAPI_DEF(af_func)                        \
+    af_err af_func(bool *result, const af_array arr) { \
+        CHECK_ARRAYS(arr);                             \
+        CALL(af_func, result, arr);                    \
+    }
+
+ARRAY_HAPI_DEF(af_is_empty)
+ARRAY_HAPI_DEF(af_is_scalar)
+ARRAY_HAPI_DEF(af_is_row)
+ARRAY_HAPI_DEF(af_is_column)
+ARRAY_HAPI_DEF(af_is_vector)
+ARRAY_HAPI_DEF(af_is_complex)
+ARRAY_HAPI_DEF(af_is_real)
+ARRAY_HAPI_DEF(af_is_double)
+ARRAY_HAPI_DEF(af_is_single)
+ARRAY_HAPI_DEF(af_is_half)
+ARRAY_HAPI_DEF(af_is_realfloating)
+ARRAY_HAPI_DEF(af_is_floating)
+ARRAY_HAPI_DEF(af_is_integer)
+ARRAY_HAPI_DEF(af_is_bool)
+ARRAY_HAPI_DEF(af_is_sparse)
+
+af_err af_get_scalar(void *output_value, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_scalar, output_value, arr);
+}
diff --git a/src/api/unified/blas.cpp b/src/api/unified/blas.cpp
new file mode 100644
index 0000000000..843fa8da35
--- /dev/null
+++ b/src/api/unified/blas.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/blas.h>
+#include "symbol_manager.hpp"
+
+AFAPI af_err af_gemm(af_array *out, const af_mat_prop optLhs,
+                     const af_mat_prop optRhs, const void *alpha,
+                     const af_array lhs, const af_array rhs, const void *beta) {
+    CHECK_ARRAYS(out, lhs, rhs);
+    CALL(af_gemm, out, optLhs, optRhs, alpha, lhs, rhs, beta);
+}
+
+af_err af_matmul(af_array *out, const af_array lhs, const af_array rhs,
+                 const af_mat_prop optLhs, const af_mat_prop optRhs) {
+    CHECK_ARRAYS(lhs, rhs);
+    CALL(af_matmul, out, lhs, rhs, optLhs, optRhs);
+}
+
+af_err af_dot(af_array *out, const af_array lhs, const af_array rhs,
+              const af_mat_prop optLhs, const af_mat_prop optRhs) {
+    CHECK_ARRAYS(lhs, rhs);
+    CALL(af_dot, out, lhs, rhs, optLhs, optRhs);
+}
+
+af_err af_dot_all(double *rval, double *ival, const af_array lhs,
+                  const af_array rhs, const af_mat_prop optLhs,
+                  const af_mat_prop optRhs) {
+    CHECK_ARRAYS(lhs, rhs);
+    CALL(af_dot_all, rval, ival, lhs, rhs, optLhs, optRhs);
+}
+
+af_err af_transpose(af_array *out, af_array in, const bool conjugate) {
+    CHECK_ARRAYS(in);
+    CALL(af_transpose, out, in, conjugate);
+}
+
+af_err af_transpose_inplace(af_array in, const bool conjugate) {
+    CHECK_ARRAYS(in);
+    CALL(af_transpose_inplace, in, conjugate);
+}
diff --git a/src/api/unified/cuda.cpp b/src/api/unified/cuda.cpp
new file mode 100644
index 0000000000..47d087e301
--- /dev/null
+++ b/src/api/unified/cuda.cpp
@@ -0,0 +1,42 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/backend.h>
+#include "symbol_manager.hpp"
+
+#define AF_DEFINE_CUDA_TYPES
+#include <af/cuda.h>
+
+af_err afcu_get_stream(cudaStream_t* stream, int id) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_CUDA) { CALL(afcu_get_stream, stream, id); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcu_get_native_id(int* nativeid, int id) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_CUDA) { CALL(afcu_get_native_id, nativeid, id); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcu_set_native_id(int nativeid) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_CUDA) { CALL(afcu_set_native_id, nativeid); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcu_cublasSetMathMode(cublasMath_t mode) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_CUDA) { CALL(afcu_cublasSetMathMode, mode); }
+    return AF_ERR_NOT_SUPPORTED;
+}
diff --git a/src/api/unified/data.cpp b/src/api/unified/data.cpp
new file mode 100644
index 0000000000..3fb7312fdd
--- /dev/null
+++ b/src/api/unified/data.cpp
@@ -0,0 +1,186 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/data.h>
+#include "symbol_manager.hpp"
+
+af_err af_constant(af_array *result, const double value, const unsigned ndims,
+                   const dim_t *const dims, const af_dtype type) {
+    CALL(af_constant, result, value, ndims, dims, type);
+}
+
+af_err af_constant_complex(af_array *arr, const double real, const double imag,
+                           const unsigned ndims, const dim_t *const dims,
+                           const af_dtype type) {
+    CALL(af_constant_complex, arr, real, imag, ndims, dims, type);
+}
+
+af_err af_constant_long(af_array *arr, const long long val,
+                        const unsigned ndims, const dim_t *const dims) {
+    CALL(af_constant_long, arr, val, ndims, dims);
+}
+
+af_err af_constant_ulong(af_array *arr, const unsigned long long val,
+                         const unsigned ndims, const dim_t *const dims) {
+    CALL(af_constant_ulong, arr, val, ndims, dims);
+}
+
+af_err af_range(af_array *out, const unsigned ndims, const dim_t *const dims,
+                const int seq_dim, const af_dtype type) {
+    CALL(af_range, out, ndims, dims, seq_dim, type);
+}
+
+af_err af_iota(af_array *out, const unsigned ndims, const dim_t *const dims,
+               const unsigned t_ndims, const dim_t *const tdims,
+               const af_dtype type) {
+    CALL(af_iota, out, ndims, dims, t_ndims, tdims, type);
+}
+
+af_err af_identity(af_array *out, const unsigned ndims, const dim_t *const dims,
+                   const af_dtype type) {
+    CALL(af_identity, out, ndims, dims, type);
+}
+
+af_err af_diag_create(af_array *out, const af_array in, const int num) {
+    CHECK_ARRAYS(in);
+    CALL(af_diag_create, out, in, num);
+}
+
+af_err af_diag_extract(af_array *out, const af_array in, const int num) {
+    CHECK_ARRAYS(in);
+    CALL(af_diag_extract, out, in, num);
+}
+
+af_err af_join(af_array *out, const int dim, const af_array first,
+               const af_array second) {
+    CHECK_ARRAYS(first, second);
+    CALL(af_join, out, dim, first, second);
+}
+
+af_err af_join_many(af_array *out, const int dim, const unsigned n_arrays,
+                    const af_array *inputs) {
+    for (unsigned i = 0; i < n_arrays; i++) { CHECK_ARRAYS(inputs[i]); }
+    CALL(af_join_many, out, dim, n_arrays, inputs);
+}
+
+af_err af_tile(af_array *out, const af_array in, const unsigned x,
+               const unsigned y, const unsigned z, const unsigned w) {
+    CHECK_ARRAYS(in);
+    CALL(af_tile, out, in, x, y, z, w);
+}
+
+af_err af_reorder(af_array *out, const af_array in, const unsigned x,
+                  const unsigned y, const unsigned z, const unsigned w) {
+    CHECK_ARRAYS(in);
+    CALL(af_reorder, out, in, x, y, z, w);
+}
+
+af_err af_shift(af_array *out, const af_array in, const int x, const int y,
+                const int z, const int w) {
+    CHECK_ARRAYS(in);
+    CALL(af_shift, out, in, x, y, z, w);
+}
+
+af_err af_moddims(af_array *out, const af_array in, const unsigned ndims,
+                  const dim_t *const dims) {
+    CHECK_ARRAYS(in);
+    CALL(af_moddims, out, in, ndims, dims);
+}
+
+af_err af_flat(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_flat, out, in);
+}
+
+af_err af_flip(af_array *out, const af_array in, const unsigned dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_flip, out, in, dim);
+}
+
+af_err af_lower(af_array *out, const af_array in, bool is_unit_diag) {
+    CHECK_ARRAYS(in);
+    CALL(af_lower, out, in, is_unit_diag);
+}
+
+af_err af_upper(af_array *out, const af_array in, bool is_unit_diag) {
+    CHECK_ARRAYS(in);
+    CALL(af_upper, out, in, is_unit_diag);
+}
+
+af_err af_select(af_array *out, const af_array cond, const af_array a,
+                 const af_array b) {
+    CHECK_ARRAYS(cond, a, b);
+    CALL(af_select, out, cond, a, b);
+}
+
+af_err af_select_scalar_r(af_array *out, const af_array cond, const af_array a,
+                          const double b) {
+    CHECK_ARRAYS(cond, a);
+    CALL(af_select_scalar_r, out, cond, a, b);
+}
+
+af_err af_select_scalar_l(af_array *out, const af_array cond, const double a,
+                          const af_array b) {
+    CHECK_ARRAYS(cond, b);
+    CALL(af_select_scalar_l, out, cond, a, b);
+}
+
+af_err af_replace(af_array a, const af_array cond, const af_array b) {
+    CHECK_ARRAYS(a, cond, b);
+    CALL(af_replace, a, cond, b);
+}
+
+af_err af_replace_scalar(af_array a, const af_array cond, const double b) {
+    CHECK_ARRAYS(a, cond);
+    CALL(af_replace_scalar, a, cond, b);
+}
+
+af_err af_pad(af_array *out, const af_array in, const unsigned b_ndims,
+              const dim_t *const b_dims, const unsigned e_ndims,
+              const dim_t *const e_dims, const af_border_type ptype) {
+    CHECK_ARRAYS(in);
+    CALL(af_pad, out, in, b_ndims, b_dims, e_ndims, e_dims, ptype);
+}
+
+af_err af_replace_scalar_long(af_array a, const af_array cond,
+                              const long long b) {
+    CHECK_ARRAYS(a, cond);
+    CALL(af_replace_scalar_long, a, cond, b);
+}
+
+af_err af_replace_scalar_ulong(af_array a, const af_array cond,
+                               const unsigned long long b) {
+    CHECK_ARRAYS(a, cond);
+    CALL(af_replace_scalar_ulong, a, cond, b);
+}
+
+af_err af_select_scalar_r_long(af_array *out, const af_array cond,
+                               const af_array a, const long long b) {
+    CHECK_ARRAYS(cond, a);
+    CALL(af_select_scalar_r_long, out, cond, a, b);
+}
+
+af_err af_select_scalar_r_ulong(af_array *out, const af_array cond,
+                                const af_array a, const unsigned long long b) {
+    CHECK_ARRAYS(cond, a);
+    CALL(af_select_scalar_r_ulong, out, cond, a, b);
+}
+
+af_err af_select_scalar_l_long(af_array *out, const af_array cond,
+                               const long long a, const af_array b) {
+    CHECK_ARRAYS(cond, b);
+    CALL(af_select_scalar_l_long, out, cond, a, b);
+}
+
+af_err af_select_scalar_l_ulong(af_array *out, const af_array cond,
+                                const unsigned long long a, const af_array b) {
+    CHECK_ARRAYS(cond, b);
+    CALL(af_select_scalar_l_ulong, out, cond, a, b);
+}
diff --git a/src/api/unified/device.cpp b/src/api/unified/device.cpp
new file mode 100644
index 0000000000..96b14d621e
--- /dev/null
+++ b/src/api/unified/device.cpp
@@ -0,0 +1,191 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/deprecated.hpp>
+#include <af/array.h>
+#include <af/backend.h>
+#include <af/device.h>
+#include "symbol_manager.hpp"
+
+af_err af_set_backend(const af_backend bknd) {
+    return arrayfire::unified::setBackend(bknd);
+}
+
+af_err af_get_backend_count(unsigned *num_backends) {
+    *num_backends =
+        arrayfire::unified::AFSymbolManager::getInstance().getBackendCount();
+    return AF_SUCCESS;
+}
+
+af_err af_get_available_backends(int *result) {
+    *result = arrayfire::unified::AFSymbolManager::getInstance()
+                  .getAvailableBackends();
+    return AF_SUCCESS;
+}
+
+af_err af_get_backend_id(af_backend *result, const af_array in) {
+    // DO NOT CALL CHECK_ARRAYS HERE.
+    // IT WILL RESULT IN AN INFINITE RECURSION
+    CALL(af_get_backend_id, result, in);
+}
+
+af_err af_get_device_id(int *device, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_get_device_id, device, in);
+}
+
+af_err af_get_active_backend(af_backend *result) {
+    *result = arrayfire::unified::getActiveBackend();
+    return AF_SUCCESS;
+}
+
+af_err af_info() { CALL_NO_PARAMS(af_info); }
+
+af_err af_init() { CALL_NO_PARAMS(af_init); }
+
+af_err af_info_string(char **str, const bool verbose) {
+    CALL(af_info_string, str, verbose);
+}
+
+af_err af_device_info(char *d_name, char *d_platform, char *d_toolkit,
+                      char *d_compute) {
+    CALL(af_device_info, d_name, d_platform, d_toolkit, d_compute);
+}
+
+af_err af_get_device_count(int *num_of_devices) {
+    CALL(af_get_device_count, num_of_devices);
+}
+
+af_err af_get_dbl_support(bool *available, const int device) {
+    CALL(af_get_dbl_support, available, device);
+}
+
+af_err af_get_half_support(bool *available, const int device) {
+    CALL(af_get_half_support, available, device);
+}
+
+af_err af_set_device(const int device) { CALL(af_set_device, device); }
+
+af_err af_get_device(int *device) { CALL(af_get_device, device); }
+
+af_err af_sync(const int device) { CALL(af_sync, device); }
+
+af_err af_alloc_device(void **ptr, const dim_t bytes) {
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_alloc_device, ptr, bytes);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_alloc_device_v2(void **ptr, const dim_t bytes) {
+    CALL(af_alloc_device_v2, ptr, bytes);
+}
+
+af_err af_alloc_pinned(void **ptr, const dim_t bytes) {
+    CALL(af_alloc_pinned, ptr, bytes);
+}
+
+af_err af_free_device(void *ptr) {
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_free_device, ptr);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_free_device_v2(void *ptr) { CALL(af_free_device_v2, ptr); }
+
+af_err af_free_pinned(void *ptr) { CALL(af_free_pinned, ptr); }
+
+af_err af_alloc_host(void **ptr, const dim_t bytes) {
+    *ptr = malloc(bytes);  // NOLINT(hicpp-no-malloc)
+    return (*ptr == NULL) ? AF_ERR_NO_MEM : AF_SUCCESS;
+}
+
+af_err af_free_host(void *ptr) {
+    free(ptr);  // NOLINT(hicpp-no-malloc)
+    return AF_SUCCESS;
+}
+
+af_err af_device_array(af_array *arr, void *data, const unsigned ndims,
+                       const dim_t *const dims, const af_dtype type) {
+    CALL(af_device_array, arr, data, ndims, dims, type);
+}
+
+af_err af_device_mem_info(size_t *alloc_bytes, size_t *alloc_buffers,
+                          size_t *lock_bytes, size_t *lock_buffers) {
+    CALL(af_device_mem_info, alloc_bytes, alloc_buffers, lock_bytes,
+         lock_buffers);
+}
+
+af_err af_print_mem_info(const char *msg, const int device_id) {
+    CALL(af_print_mem_info, msg, device_id);
+}
+
+af_err af_device_gc() { CALL_NO_PARAMS(af_device_gc); }
+
+af_err af_set_mem_step_size(const size_t step_bytes) {
+    CALL(af_set_mem_step_size, step_bytes);
+}
+
+af_err af_get_mem_step_size(size_t *step_bytes) {
+    CALL(af_get_mem_step_size, step_bytes);
+}
+
+af_err af_lock_device_ptr(const af_array arr) {
+    CHECK_ARRAYS(arr);
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_lock_device_ptr, arr);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_unlock_device_ptr(const af_array arr) {
+    CHECK_ARRAYS(arr);
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_unlock_device_ptr, arr);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_lock_array(const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_lock_array, arr);
+}
+
+af_err af_unlock_array(const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_unlock_array, arr);
+}
+
+af_err af_is_locked_array(bool *res, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_is_locked_array, res, arr);
+}
+
+af_err af_get_device_ptr(void **ptr, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_device_ptr, ptr, arr);
+}
+
+af_err af_eval_multiple(const int num, af_array *arrays) {
+    for (int i = 0; i < num; i++) { CHECK_ARRAYS(arrays[i]); }
+    CALL(af_eval_multiple, num, arrays);
+}
+
+af_err af_set_manual_eval_flag(bool flag) {
+    CALL(af_set_manual_eval_flag, flag);
+}
+
+af_err af_get_manual_eval_flag(bool *flag) {
+    CALL(af_get_manual_eval_flag, flag);
+}
+
+af_err af_set_kernel_cache_directory(const char *path, int override_eval) {
+    CALL(af_set_kernel_cache_directory, path, override_eval);
+}
+
+af_err af_get_kernel_cache_directory(size_t *length, char *path) {
+    CALL(af_get_kernel_cache_directory, length, path);
+}
diff --git a/src/api/unified/error.cpp b/src/api/unified/error.cpp
new file mode 100644
index 0000000000..24a2dbfac9
--- /dev/null
+++ b/src/api/unified/error.cpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/device.h>
+#include <af/exception.h>
+#include <af/util.h>
+#include <algorithm>
+#include "symbol_manager.hpp"
+
+void af_get_last_error(char **str, dim_t *len) {
+    // Set error message from unified backend
+    std::string &global_error_string = get_global_error_string();
+    dim_t slen =
+        std::min(MAX_ERR_SIZE, static_cast<int>(global_error_string.size()));
+
+    // If this is true, the error is coming from the unified backend.
+    if (slen != 0) {
+        if (len && slen == 0) {
+            *len = 0;
+            *str = NULL;
+            return;
+        }
+
+        void *in = nullptr;
+        af_alloc_host(&in, sizeof(char) * (slen + 1));
+        memcpy(str, &in, sizeof(void *));
+        global_error_string.copy(*str, slen);
+
+        (*str)[slen]        = '\0';
+        global_error_string = std::string("");
+
+        if (len) { *len = slen; }
+    } else {
+        // If false, the error is coming from active backend.
+        typedef void (*af_func)(char **, dim_t *);
+        void *vfn    = LOAD_SYMBOL();
+        af_func func = nullptr;
+        memcpy(&func, &vfn, sizeof(void *));
+        func(str, len);
+    }
+}
+
+af_err af_set_enable_stacktrace(int is_enabled) {
+    CALL(af_set_enable_stacktrace, is_enabled);
+}
diff --git a/src/api/unified/event.cpp b/src/api/unified/event.cpp
new file mode 100644
index 0000000000..8e3c45f6c0
--- /dev/null
+++ b/src/api/unified/event.cpp
@@ -0,0 +1,31 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/event.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_event(af_event* eventHandle) {
+    CALL(af_create_event, eventHandle);
+}
+
+af_err af_delete_event(af_event eventHandle) {
+    CALL(af_delete_event, eventHandle);
+}
+
+af_err af_mark_event(const af_event eventHandle) {
+    CALL(af_mark_event, eventHandle);
+}
+
+af_err af_enqueue_wait_event(const af_event eventHandle) {
+    CALL(af_enqueue_wait_event, eventHandle);
+}
+
+af_err af_block_event(const af_event eventHandle) {
+    CALL(af_block_event, eventHandle);
+}
diff --git a/src/api/unified/features.cpp b/src/api/unified/features.cpp
new file mode 100644
index 0000000000..57c8d01982
--- /dev/null
+++ b/src/api/unified/features.cpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/features.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_features(af_features *feat, dim_t num) {
+    CALL(af_create_features, feat, num);
+}
+
+af_err af_retain_features(af_features *out, const af_features feat) {
+    CALL(af_retain_features, out, feat);
+}
+
+af_err af_get_features_num(dim_t *num, const af_features feat) {
+    CALL(af_get_features_num, num, feat);
+}
+
+#define FEAT_HAPI_DEF(af_func)                              \
+    af_err af_func(af_array *out, const af_features feat) { \
+        CALL(af_func, out, feat);                           \
+    }
+
+FEAT_HAPI_DEF(af_get_features_xpos)
+FEAT_HAPI_DEF(af_get_features_ypos)
+FEAT_HAPI_DEF(af_get_features_score)
+FEAT_HAPI_DEF(af_get_features_orientation)
+FEAT_HAPI_DEF(af_get_features_size)
+
+af_err af_release_features(af_features feat) {
+    CALL(af_release_features, feat);
+}
diff --git a/src/api/unified/graphics.cpp b/src/api/unified/graphics.cpp
new file mode 100644
index 0000000000..49fb036457
--- /dev/null
+++ b/src/api/unified/graphics.cpp
@@ -0,0 +1,205 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/deprecated.hpp>
+#include <af/array.h>
+#include <af/graphics.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_window(af_window* out, const int width, const int height,
+                        const char* const title) {
+    CALL(af_create_window, out, width, height, title);
+}
+
+af_err af_set_position(const af_window wind, const unsigned x,
+                       const unsigned y) {
+    CALL(af_set_position, wind, x, y);
+}
+
+af_err af_set_title(const af_window wind, const char* const title) {
+    CALL(af_set_title, wind, title);
+}
+
+af_err af_set_size(const af_window wind, const unsigned w, const unsigned h) {
+    CALL(af_set_size, wind, w, h);
+}
+
+af_err af_draw_image(const af_window wind, const af_array in,
+                     const af_cell* const props) {
+    CHECK_ARRAYS(in);
+    CALL(af_draw_image, wind, in, props);
+}
+
+af_err af_draw_plot(const af_window wind, const af_array X, const af_array Y,
+                    const af_cell* const props) {
+    CHECK_ARRAYS(X, Y);
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_draw_plot, wind, X, Y, props);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_draw_plot3(const af_window wind, const af_array P,
+                     const af_cell* const props) {
+    CHECK_ARRAYS(P);
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_draw_plot3, wind, P, props);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_draw_plot_nd(const af_window wind, const af_array in,
+                       const af_cell* const props) {
+    CHECK_ARRAYS(in);
+    CALL(af_draw_plot_nd, wind, in, props);
+}
+
+af_err af_draw_plot_2d(const af_window wind, const af_array X, const af_array Y,
+                       const af_cell* const props) {
+    CHECK_ARRAYS(X, Y);
+    CALL(af_draw_plot_2d, wind, X, Y, props);
+}
+
+af_err af_draw_plot_3d(const af_window wind, const af_array X, const af_array Y,
+                       const af_array Z, const af_cell* const props) {
+    CHECK_ARRAYS(X, Y, Z);
+    CALL(af_draw_plot_3d, wind, X, Y, Z, props);
+}
+
+af_err af_draw_scatter(const af_window wind, const af_array X, const af_array Y,
+                       const af_marker_type marker,
+                       const af_cell* const props) {
+    CHECK_ARRAYS(X, Y);
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_draw_scatter, wind, X, Y, marker, props);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_draw_scatter3(const af_window wind, const af_array P,
+                        const af_marker_type marker,
+                        const af_cell* const props) {
+    CHECK_ARRAYS(P);
+    AF_DEPRECATED_WARNINGS_OFF
+    CALL(af_draw_scatter3, wind, P, marker, props);
+    AF_DEPRECATED_WARNINGS_ON
+}
+
+af_err af_draw_scatter_nd(const af_window wind, const af_array in,
+                          const af_marker_type marker,
+                          const af_cell* const props) {
+    CHECK_ARRAYS(in);
+    CALL(af_draw_scatter_nd, wind, in, marker, props);
+}
+
+af_err af_draw_scatter_2d(const af_window wind, const af_array X,
+                          const af_array Y, const af_marker_type marker,
+                          const af_cell* const props) {
+    CHECK_ARRAYS(X, Y);
+    CALL(af_draw_scatter_2d, wind, X, Y, marker, props);
+}
+
+af_err af_draw_scatter_3d(const af_window wind, const af_array X,
+                          const af_array Y, const af_array Z,
+                          const af_marker_type marker,
+                          const af_cell* const props) {
+    CHECK_ARRAYS(X, Y, Z);
+    CALL(af_draw_scatter_3d, wind, X, Y, Z, marker, props);
+}
+
+af_err af_draw_hist(const af_window wind, const af_array X, const double minval,
+                    const double maxval, const af_cell* const props) {
+    CHECK_ARRAYS(X);
+    CALL(af_draw_hist, wind, X, minval, maxval, props);
+}
+
+af_err af_draw_surface(const af_window wind, const af_array xVals,
+                       const af_array yVals, const af_array S,
+                       const af_cell* const props) {
+    CHECK_ARRAYS(xVals, yVals, S);
+    CALL(af_draw_surface, wind, xVals, yVals, S, props);
+}
+
+af_err af_draw_vector_field_nd(const af_window wind, const af_array points,
+                               const af_array directions,
+                               const af_cell* const props) {
+    CHECK_ARRAYS(points, directions);
+    CALL(af_draw_vector_field_nd, wind, points, directions, props);
+}
+
+af_err af_draw_vector_field_3d(const af_window wind, const af_array xPoints,
+                               const af_array yPoints, const af_array zPoints,
+                               const af_array xDirs, const af_array yDirs,
+                               const af_array zDirs,
+                               const af_cell* const props) {
+    CHECK_ARRAYS(xPoints, yPoints, zPoints, xDirs, yDirs, zDirs);
+    CALL(af_draw_vector_field_3d, wind, xPoints, yPoints, zPoints, xDirs, yDirs,
+         zDirs, props);
+}
+
+af_err af_draw_vector_field_2d(const af_window wind, const af_array xPoints,
+                               const af_array yPoints, const af_array xDirs,
+                               const af_array yDirs,
+                               const af_cell* const props) {
+    CHECK_ARRAYS(xPoints, yPoints, xDirs, yDirs);
+    CALL(af_draw_vector_field_2d, wind, xPoints, yPoints, xDirs, yDirs, props);
+}
+
+af_err af_grid(const af_window wind, const int rows, const int cols) {
+    CALL(af_grid, wind, rows, cols);
+}
+
+af_err af_set_axes_limits_compute(const af_window wind, const af_array x,
+                                  const af_array y, const af_array z,
+                                  const bool exact,
+                                  const af_cell* const props) {
+    CHECK_ARRAYS(x, y);
+    if (z) { CHECK_ARRAYS(z); }
+    CALL(af_set_axes_limits_compute, wind, x, y, z, exact, props);
+}
+
+af_err af_set_axes_limits_2d(const af_window wind, const float xmin,
+                             const float xmax, const float ymin,
+                             const float ymax, const bool exact,
+                             const af_cell* const props) {
+    CALL(af_set_axes_limits_2d, wind, xmin, xmax, ymin, ymax, exact, props);
+}
+
+af_err af_set_axes_limits_3d(const af_window wind, const float xmin,
+                             const float xmax, const float ymin,
+                             const float ymax, const float zmin,
+                             const float zmax, const bool exact,
+                             const af_cell* const props) {
+    CALL(af_set_axes_limits_3d, wind, xmin, xmax, ymin, ymax, zmin, zmax, exact,
+         props);
+}
+
+af_err af_set_axes_titles(const af_window wind, const char* const xtitle,
+                          const char* const ytitle, const char* const ztitle,
+                          const af_cell* const props) {
+    CALL(af_set_axes_titles, wind, xtitle, ytitle, ztitle, props);
+}
+
+af_err af_set_axes_label_format(const af_window wind, const char* const xformat,
+                                const char* const yformat,
+                                const char* const zformat,
+                                const af_cell* const props) {
+    CALL(af_set_axes_label_format, wind, xformat, yformat, zformat, props);
+}
+
+af_err af_show(const af_window wind) { CALL(af_show, wind); }
+
+af_err af_is_window_closed(bool* out, const af_window wind) {
+    CALL(af_is_window_closed, out, wind);
+}
+
+af_err af_set_visibility(const af_window wind, const bool is_visible) {
+    CALL(af_set_visibility, wind, is_visible);
+}
+
+af_err af_destroy_window(const af_window wind) {
+    CALL(af_destroy_window, wind);
+}
diff --git a/src/api/unified/image.cpp b/src/api/unified/image.cpp
new file mode 100644
index 0000000000..0459301f1a
--- /dev/null
+++ b/src/api/unified/image.cpp
@@ -0,0 +1,285 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/image.h>
+#include "symbol_manager.hpp"
+
+af_err af_gradient(af_array *dx, af_array *dy, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_gradient, dx, dy, in);
+}
+
+af_err af_load_image(af_array *out, const char *filename, const bool isColor) {
+    CALL(af_load_image, out, filename, isColor);
+}
+
+af_err af_save_image(const char *filename, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_save_image, filename, in);
+}
+
+af_err af_load_image_memory(af_array *out, const void *ptr) {
+    CALL(af_load_image_memory, out, ptr);
+}
+
+af_err af_save_image_memory(void **ptr, const af_array in,
+                            const af_image_format format) {
+    CHECK_ARRAYS(in);
+    CALL(af_save_image_memory, ptr, in, format);
+}
+
+af_err af_delete_image_memory(void *ptr) { CALL(af_delete_image_memory, ptr); }
+
+af_err af_load_image_native(af_array *out, const char *filename) {
+    CALL(af_load_image_native, out, filename);
+}
+
+af_err af_save_image_native(const char *filename, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_save_image_native, filename, in);
+}
+
+af_err af_is_image_io_available(bool *out) {
+    CALL(af_is_image_io_available, out);
+}
+
+af_err af_resize(af_array *out, const af_array in, const dim_t odim0,
+                 const dim_t odim1, const af_interp_type method) {
+    CHECK_ARRAYS(in);
+    CALL(af_resize, out, in, odim0, odim1, method);
+}
+
+af_err af_transform(af_array *out, const af_array in, const af_array transform,
+                    const dim_t odim0, const dim_t odim1,
+                    const af_interp_type method, const bool inverse) {
+    CHECK_ARRAYS(in, transform);
+    CALL(af_transform, out, in, transform, odim0, odim1, method, inverse);
+}
+
+af_err af_transform_v2(af_array *out, const af_array in,
+                       const af_array transform, const dim_t odim0,
+                       const dim_t odim1, const af_interp_type method,
+                       const bool inverse) {
+    CHECK_ARRAYS(out, in, transform);
+    CALL(af_transform_v2, out, in, transform, odim0, odim1, method, inverse);
+}
+
+af_err af_transform_coordinates(af_array *out, const af_array tf,
+                                const float d0, const float d1) {
+    CHECK_ARRAYS(tf);
+    CALL(af_transform_coordinates, out, tf, d0, d1);
+}
+
+af_err af_rotate(af_array *out, const af_array in, const float theta,
+                 const bool crop, const af_interp_type method) {
+    CHECK_ARRAYS(in);
+    CALL(af_rotate, out, in, theta, crop, method);
+}
+
+af_err af_translate(af_array *out, const af_array in, const float trans0,
+                    const float trans1, const dim_t odim0, const dim_t odim1,
+                    const af_interp_type method) {
+    CHECK_ARRAYS(in);
+    CALL(af_translate, out, in, trans0, trans1, odim0, odim1, method);
+}
+
+af_err af_scale(af_array *out, const af_array in, const float scale0,
+                const float scale1, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method) {
+    CHECK_ARRAYS(in);
+    CALL(af_scale, out, in, scale0, scale1, odim0, odim1, method);
+}
+
+af_err af_skew(af_array *out, const af_array in, const float skew0,
+               const float skew1, const dim_t odim0, const dim_t odim1,
+               const af_interp_type method, const bool inverse) {
+    CHECK_ARRAYS(in);
+    CALL(af_skew, out, in, skew0, skew1, odim0, odim1, method, inverse);
+}
+
+af_err af_histogram(af_array *out, const af_array in, const unsigned nbins,
+                    const double minval, const double maxval) {
+    CHECK_ARRAYS(in);
+    CALL(af_histogram, out, in, nbins, minval, maxval);
+}
+
+af_err af_dilate(af_array *out, const af_array in, const af_array mask) {
+    CHECK_ARRAYS(in, mask);
+    CALL(af_dilate, out, in, mask);
+}
+
+af_err af_dilate3(af_array *out, const af_array in, const af_array mask) {
+    CHECK_ARRAYS(in, mask);
+    CALL(af_dilate3, out, in, mask);
+}
+
+af_err af_erode(af_array *out, const af_array in, const af_array mask) {
+    CHECK_ARRAYS(in, mask);
+    CALL(af_erode, out, in, mask);
+}
+
+af_err af_erode3(af_array *out, const af_array in, const af_array mask) {
+    CHECK_ARRAYS(in, mask);
+    CALL(af_erode3, out, in, mask);
+}
+
+af_err af_bilateral(af_array *out, const af_array in, const float spatial_sigma,
+                    const float chromatic_sigma, const bool isColor) {
+    CHECK_ARRAYS(in);
+    CALL(af_bilateral, out, in, spatial_sigma, chromatic_sigma, isColor);
+}
+
+af_err af_mean_shift(af_array *out, const af_array in,
+                     const float spatial_sigma, const float chromatic_sigma,
+                     const unsigned iter, const bool is_color) {
+    CHECK_ARRAYS(in);
+    CALL(af_mean_shift, out, in, spatial_sigma, chromatic_sigma, iter,
+         is_color);
+}
+
+af_err af_minfilt(af_array *out, const af_array in, const dim_t wind_length,
+                  const dim_t wind_width, const af_border_type edge_pad) {
+    CHECK_ARRAYS(in);
+    CALL(af_minfilt, out, in, wind_length, wind_width, edge_pad);
+}
+
+af_err af_maxfilt(af_array *out, const af_array in, const dim_t wind_length,
+                  const dim_t wind_width, const af_border_type edge_pad) {
+    CHECK_ARRAYS(in);
+    CALL(af_maxfilt, out, in, wind_length, wind_width, edge_pad);
+}
+
+af_err af_regions(af_array *out, const af_array in,
+                  const af_connectivity connectivity, const af_dtype ty) {
+    CHECK_ARRAYS(in);
+    CALL(af_regions, out, in, connectivity, ty);
+}
+
+af_err af_sobel_operator(af_array *dx, af_array *dy, const af_array img,
+                         const unsigned ker_size) {
+    CHECK_ARRAYS(img);
+    CALL(af_sobel_operator, dx, dy, img, ker_size);
+}
+
+af_err af_rgb2gray(af_array *out, const af_array in, const float rPercent,
+                   const float gPercent, const float bPercent) {
+    CHECK_ARRAYS(in);
+    CALL(af_rgb2gray, out, in, rPercent, gPercent, bPercent);
+}
+
+af_err af_gray2rgb(af_array *out, const af_array in, const float rFactor,
+                   const float gFactor, const float bFactor) {
+    CHECK_ARRAYS(in);
+    CALL(af_gray2rgb, out, in, rFactor, gFactor, bFactor);
+}
+
+af_err af_hist_equal(af_array *out, const af_array in, const af_array hist) {
+    CHECK_ARRAYS(in, hist);
+    CALL(af_hist_equal, out, in, hist);
+}
+
+af_err af_gaussian_kernel(af_array *out, const int rows, const int cols,
+                          const double sigma_r, const double sigma_c) {
+    CALL(af_gaussian_kernel, out, rows, cols, sigma_r, sigma_c);
+}
+
+af_err af_hsv2rgb(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_hsv2rgb, out, in);
+}
+
+af_err af_rgb2hsv(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_rgb2hsv, out, in);
+}
+
+af_err af_color_space(af_array *out, const af_array image, const af_cspace_t to,
+                      const af_cspace_t from) {
+    CHECK_ARRAYS(image);
+    CALL(af_color_space, out, image, to, from);
+}
+
+af_err af_unwrap(af_array *out, const af_array in, const dim_t wx,
+                 const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,
+                 const dim_t py, const bool is_column) {
+    CHECK_ARRAYS(in);
+    CALL(af_unwrap, out, in, wx, wy, sx, sy, px, py, is_column);
+}
+
+af_err af_wrap(af_array *out, const af_array in, const dim_t ox, const dim_t oy,
+               const dim_t wx, const dim_t wy, const dim_t sx, const dim_t sy,
+               const dim_t px, const dim_t py, const bool is_column) {
+    CHECK_ARRAYS(in);
+    CALL(af_wrap, out, in, ox, oy, wx, wy, sx, sy, px, py, is_column);
+}
+
+af_err af_wrap_v2(af_array *out, const af_array in, const dim_t ox,
+                  const dim_t oy, const dim_t wx, const dim_t wy,
+                  const dim_t sx, const dim_t sy, const dim_t px,
+                  const dim_t py, const bool is_column) {
+    CHECK_ARRAYS(out, in);
+    CALL(af_wrap_v2, out, in, ox, oy, wx, wy, sx, sy, px, py, is_column);
+}
+
+af_err af_sat(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sat, out, in);
+}
+
+af_err af_ycbcr2rgb(af_array *out, const af_array in,
+                    const af_ycc_std standard) {
+    CHECK_ARRAYS(in);
+    CALL(af_ycbcr2rgb, out, in, standard);
+}
+
+af_err af_rgb2ycbcr(af_array *out, const af_array in,
+                    const af_ycc_std standard) {
+    CHECK_ARRAYS(in);
+    CALL(af_rgb2ycbcr, out, in, standard);
+}
+
+af_err af_canny(af_array *out, const af_array in, const af_canny_threshold ct,
+                const float t1, const float t2, const unsigned sw,
+                const bool isf) {
+    CHECK_ARRAYS(in);
+    CALL(af_canny, out, in, ct, t1, t2, sw, isf);
+}
+
+af_err af_anisotropic_diffusion(af_array *out, const af_array in,
+                                const float dt, const float K,
+                                const unsigned iterations,
+                                const af_flux_function fftype,
+                                const af_diffusion_eq eq) {
+    CHECK_ARRAYS(in);
+    CALL(af_anisotropic_diffusion, out, in, dt, K, iterations, fftype, eq);
+}
+
+af_err af_iterative_deconv(af_array *out, const af_array in, const af_array ker,
+                           const unsigned iterations, const float relax_factor,
+                           const af_iterative_deconv_algo algo) {
+    CHECK_ARRAYS(in, ker);
+    CALL(af_iterative_deconv, out, in, ker, iterations, relax_factor, algo);
+}
+
+af_err af_inverse_deconv(af_array *out, const af_array in, const af_array psf,
+                         const float gamma, const af_inverse_deconv_algo algo) {
+    CHECK_ARRAYS(in, psf);
+    CALL(af_inverse_deconv, out, in, psf, gamma, algo);
+}
+
+af_err af_confidence_cc(af_array *out, const af_array in, const af_array seedx,
+                        const af_array seedy, const unsigned radius,
+                        const unsigned multiplier, const int iter,
+                        const double segmented_value) {
+    CHECK_ARRAYS(in, seedx, seedy);
+    CALL(af_confidence_cc, out, in, seedx, seedy, radius, multiplier, iter,
+         segmented_value);
+}
diff --git a/src/api/unified/index.cpp b/src/api/unified/index.cpp
new file mode 100644
index 0000000000..90ea9d4694
--- /dev/null
+++ b/src/api/unified/index.cpp
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/index.h>
+#include "symbol_manager.hpp"
+
+af_err af_index(af_array* out, const af_array in, const unsigned ndims,
+                const af_seq* const index) {
+    CHECK_ARRAYS(in);
+    CALL(af_index, out, in, ndims, index);
+}
+
+af_err af_lookup(af_array* out, const af_array in, const af_array indices,
+                 const unsigned dim) {
+    CHECK_ARRAYS(in, indices);
+    CALL(af_lookup, out, in, indices, dim);
+}
+
+af_err af_assign_seq(af_array* out, const af_array lhs, const unsigned ndims,
+                     const af_seq* const indices, const af_array rhs) {
+    CHECK_ARRAYS(lhs, rhs);
+    CALL(af_assign_seq, out, lhs, ndims, indices, rhs);
+}
+
+af_err af_index_gen(af_array* out, const af_array in, const dim_t ndims,
+                    const af_index_t* indices) {
+    CHECK_ARRAYS(in);
+    CALL(af_index_gen, out, in, ndims, indices);
+}
+
+af_err af_assign_gen(af_array* out, const af_array lhs, const dim_t ndims,
+                     const af_index_t* indices, const af_array rhs) {
+    CHECK_ARRAYS(lhs, rhs);
+    CALL(af_assign_gen, out, lhs, ndims, indices, rhs);
+}
+
+af_seq af_make_seq(double begin, double end, double step) {
+    af_seq seq = {begin, end, step};
+    return seq;
+}
+
+af_err af_create_indexers(af_index_t** indexers) {
+    CALL(af_create_indexers, indexers);
+}
+
+af_err af_set_array_indexer(af_index_t* indexer, const af_array idx,
+                            const dim_t dim) {
+    CHECK_ARRAYS(idx);
+    CALL(af_set_array_indexer, indexer, idx, dim);
+}
+
+af_err af_set_seq_indexer(af_index_t* indexer, const af_seq* idx,
+                          const dim_t dim, const bool is_batch) {
+    CALL(af_set_seq_indexer, indexer, idx, dim, is_batch);
+}
+
+af_err af_set_seq_param_indexer(af_index_t* indexer, const double begin,
+                                const double end, const double step,
+                                const dim_t dim, const bool is_batch) {
+    CALL(af_set_seq_param_indexer, indexer, begin, end, step, dim, is_batch);
+}
+
+af_err af_release_indexers(af_index_t* indexers) {
+    CALL(af_release_indexers, indexers);
+}
diff --git a/src/api/unified/internal.cpp b/src/api/unified/internal.cpp
new file mode 100644
index 0000000000..ab1d3be7ca
--- /dev/null
+++ b/src/api/unified/internal.cpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/internal.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_strided_array(af_array *arr, const void *data,
+                               const dim_t offset, const unsigned ndims,
+                               const dim_t *const dims_,
+                               const dim_t *const strides_, const af_dtype ty,
+                               const af_source location) {
+    CALL(af_create_strided_array, arr, data, offset, ndims, dims_, strides_, ty,
+         location);
+}
+
+af_err af_get_strides(dim_t *s0, dim_t *s1, dim_t *s2, dim_t *s3,
+                      const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_get_strides, s0, s1, s2, s3, in);
+}
+
+af_err af_get_offset(dim_t *offset, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_offset, offset, arr);
+}
+
+af_err af_get_raw_ptr(void **ptr, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_raw_ptr, ptr, arr);
+}
+
+af_err af_is_linear(bool *result, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_is_linear, result, arr);
+}
+
+af_err af_is_owner(bool *result, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_is_owner, result, arr);
+}
+
+af_err af_get_allocated_bytes(size_t *bytes, const af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_get_allocated_bytes, bytes, arr);
+}
diff --git a/src/api/unified/jit_test_api.cpp b/src/api/unified/jit_test_api.cpp
new file mode 100644
index 0000000000..de60ac1eb1
--- /dev/null
+++ b/src/api/unified/jit_test_api.cpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <jit_test_api.h>
+
+#include "symbol_manager.hpp"
+
+af_err af_get_max_jit_len(int *jitLen) { CALL(af_get_max_jit_len, jitLen); }
+
+af_err af_set_max_jit_len(const int jitLen) {
+    CALL(af_set_max_jit_len, jitLen);
+}
diff --git a/src/api/unified/lapack.cpp b/src/api/unified/lapack.cpp
new file mode 100644
index 0000000000..491e4e2763
--- /dev/null
+++ b/src/api/unified/lapack.cpp
@@ -0,0 +1,95 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/lapack.h>
+#include "symbol_manager.hpp"
+
+af_err af_svd(af_array *u, af_array *s, af_array *vt, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_svd, u, s, vt, in);
+}
+
+af_err af_svd_inplace(af_array *u, af_array *s, af_array *vt, af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_svd_inplace, u, s, vt, in);
+}
+
+af_err af_lu(af_array *lower, af_array *upper, af_array *pivot,
+             const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_lu, lower, upper, pivot, in);
+}
+
+af_err af_lu_inplace(af_array *pivot, af_array in, const bool is_lapack_piv) {
+    CHECK_ARRAYS(in);
+    CALL(af_lu_inplace, pivot, in, is_lapack_piv);
+}
+
+af_err af_qr(af_array *q, af_array *r, af_array *tau, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_qr, q, r, tau, in);
+}
+
+af_err af_qr_inplace(af_array *tau, af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_qr_inplace, tau, in);
+}
+
+af_err af_cholesky(af_array *out, int *info, const af_array in,
+                   const bool is_upper) {
+    CHECK_ARRAYS(in);
+    CALL(af_cholesky, out, info, in, is_upper);
+}
+
+af_err af_cholesky_inplace(int *info, af_array in, const bool is_upper) {
+    CHECK_ARRAYS(in);
+    CALL(af_cholesky_inplace, info, in, is_upper);
+}
+
+af_err af_solve(af_array *x, const af_array a, const af_array b,
+                const af_mat_prop options) {
+    CHECK_ARRAYS(a, b);
+    CALL(af_solve, x, a, b, options);
+}
+
+af_err af_solve_lu(af_array *x, const af_array a, const af_array piv,
+                   const af_array b, const af_mat_prop options) {
+    CHECK_ARRAYS(a, piv, b);
+    CALL(af_solve_lu, x, a, piv, b, options);
+}
+
+af_err af_inverse(af_array *out, const af_array in, const af_mat_prop options) {
+    CHECK_ARRAYS(in);
+    CALL(af_inverse, out, in, options);
+}
+
+af_err af_pinverse(af_array *out, const af_array in, const double tol,
+                   const af_mat_prop options) {
+    CHECK_ARRAYS(in);
+    CALL(af_pinverse, out, in, tol, options);
+}
+
+af_err af_rank(unsigned *rank, const af_array in, const double tol) {
+    CHECK_ARRAYS(in);
+    CALL(af_rank, rank, in, tol);
+}
+
+af_err af_det(double *det_real, double *det_imag, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_det, det_real, det_imag, in);
+}
+
+af_err af_norm(double *out, const af_array in, const af_norm_type type,
+               const double p, const double q) {
+    CHECK_ARRAYS(in);
+    CALL(af_norm, out, in, type, p, q);
+}
+
+af_err af_is_lapack_available(bool *out) { CALL(af_is_lapack_available, out); }
diff --git a/src/api/unified/memory.cpp b/src/api/unified/memory.cpp
new file mode 100644
index 0000000000..45ab9bc623
--- /dev/null
+++ b/src/api/unified/memory.cpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/memory.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_memory_manager(af_memory_manager* out) {
+    CALL(af_create_memory_manager, out);
+}
+
+af_err af_release_memory_manager(af_memory_manager handle) {
+    CALL(af_release_memory_manager, handle);
+}
+
+af_err af_set_memory_manager(af_memory_manager handle) {
+    CALL(af_set_memory_manager, handle);
+}
+
+af_err af_set_memory_manager_pinned(af_memory_manager handle) {
+    CALL(af_set_memory_manager_pinned, handle);
+}
+
+af_err af_unset_memory_manager() { CALL_NO_PARAMS(af_unset_memory_manager); }
+
+af_err af_unset_memory_manager_pinned() {
+    CALL_NO_PARAMS(af_unset_memory_manager_pinned);
+}
+
+af_err af_memory_manager_get_payload(af_memory_manager handle, void** payload) {
+    CALL(af_memory_manager_get_payload, handle, payload);
+}
+
+af_err af_memory_manager_set_payload(af_memory_manager handle, void* payload) {
+    CALL(af_memory_manager_set_payload, handle, payload);
+}
+
+af_err af_memory_manager_set_initialize_fn(af_memory_manager handle,
+                                           af_memory_manager_initialize_fn fn) {
+    CALL(af_memory_manager_set_initialize_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_shutdown_fn(af_memory_manager handle,
+                                         af_memory_manager_shutdown_fn fn) {
+    CALL(af_memory_manager_set_shutdown_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_alloc_fn(af_memory_manager handle,
+                                      af_memory_manager_alloc_fn fn) {
+    CALL(af_memory_manager_set_alloc_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_allocated_fn(af_memory_manager handle,
+                                          af_memory_manager_allocated_fn fn) {
+    CALL(af_memory_manager_set_allocated_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_unlock_fn(af_memory_manager handle,
+                                       af_memory_manager_unlock_fn fn) {
+    CALL(af_memory_manager_set_unlock_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_signal_memory_cleanup_fn(
+    af_memory_manager handle, af_memory_manager_signal_memory_cleanup_fn fn) {
+    CALL(af_memory_manager_set_signal_memory_cleanup_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_print_info_fn(af_memory_manager handle,
+                                           af_memory_manager_print_info_fn fn) {
+    CALL(af_memory_manager_set_print_info_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_user_lock_fn(af_memory_manager handle,
+                                          af_memory_manager_user_lock_fn fn) {
+    CALL(af_memory_manager_set_user_lock_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_user_unlock_fn(
+    af_memory_manager handle, af_memory_manager_user_unlock_fn fn) {
+    CALL(af_memory_manager_set_user_unlock_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_is_user_locked_fn(
+    af_memory_manager handle, af_memory_manager_is_user_locked_fn fn) {
+    CALL(af_memory_manager_set_is_user_locked_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_get_memory_pressure_fn(
+    af_memory_manager handle, af_memory_manager_get_memory_pressure_fn fn) {
+    CALL(af_memory_manager_set_get_memory_pressure_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_jit_tree_exceeds_memory_pressure_fn(
+    af_memory_manager handle,
+    af_memory_manager_jit_tree_exceeds_memory_pressure_fn fn) {
+    CALL(af_memory_manager_set_jit_tree_exceeds_memory_pressure_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_add_memory_management_fn(
+    af_memory_manager handle, af_memory_manager_add_memory_management_fn fn) {
+    CALL(af_memory_manager_set_add_memory_management_fn, handle, fn);
+}
+
+af_err af_memory_manager_set_remove_memory_management_fn(
+    af_memory_manager handle,
+    af_memory_manager_remove_memory_management_fn fn) {
+    CALL(af_memory_manager_set_remove_memory_management_fn, handle, fn);
+}
+
+af_err af_memory_manager_get_active_device_id(af_memory_manager handle,
+                                              int* id) {
+    CALL(af_memory_manager_get_active_device_id, handle, id);
+}
+
+af_err af_memory_manager_native_alloc(af_memory_manager handle, void** ptr,
+                                      size_t size) {
+    CALL(af_memory_manager_native_alloc, handle, ptr, size);
+}
+
+af_err af_memory_manager_native_free(af_memory_manager handle, void* ptr) {
+    CALL(af_memory_manager_native_free, handle, ptr);
+}
+
+af_err af_memory_manager_get_max_memory_size(af_memory_manager handle,
+                                             size_t* size, int id) {
+    CALL(af_memory_manager_get_max_memory_size, handle, size, id);
+}
+
+af_err af_memory_manager_get_memory_pressure_threshold(af_memory_manager handle,
+                                                       float* value) {
+    CALL(af_memory_manager_get_memory_pressure_threshold, handle, value);
+}
+
+af_err af_memory_manager_set_memory_pressure_threshold(af_memory_manager handle,
+                                                       float value) {
+    CALL(af_memory_manager_set_memory_pressure_threshold, handle, value);
+}
diff --git a/src/api/unified/ml.cpp b/src/api/unified/ml.cpp
new file mode 100644
index 0000000000..b91cc7a49d
--- /dev/null
+++ b/src/api/unified/ml.cpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <af/array.h>
+#include <af/ml.h>
+#include "symbol_manager.hpp"
+
+af_err af_convolve2_gradient_nn(
+    af_array *out, const af_array incoming_gradient,
+    const af_array original_signal, const af_array original_filter,
+    const af_array convolved_output, const unsigned stride_dims,
+    const dim_t *strides, const unsigned padding_dims, const dim_t *paddings,
+    const unsigned dilation_dims, const dim_t *dilations,
+    af_conv_gradient_type gradType) {
+    CHECK_ARRAYS(incoming_gradient, original_signal, original_filter,
+                 convolved_output);
+    CALL(af_convolve2_gradient_nn, out, incoming_gradient, original_signal,
+         original_filter, convolved_output, stride_dims, strides, padding_dims,
+         paddings, dilation_dims, dilations, gradType);
+}
diff --git a/src/api/unified/moments.cpp b/src/api/unified/moments.cpp
new file mode 100644
index 0000000000..5d709160e7
--- /dev/null
+++ b/src/api/unified/moments.cpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/image.h>
+#include "symbol_manager.hpp"
+
+af_err af_moments(af_array* out, const af_array in,
+                  const af_moment_type moment) {
+    CHECK_ARRAYS(in);
+    CALL(af_moments, out, in, moment);
+}
+
+af_err af_moments_all(double* out, const af_array in,
+                      const af_moment_type moment) {
+    CHECK_ARRAYS(in);
+    CALL(af_moments_all, out, in, moment);
+}
diff --git a/src/api/unified/opencl.cpp b/src/api/unified/opencl.cpp
new file mode 100644
index 0000000000..6ad93ae9ce
--- /dev/null
+++ b/src/api/unified/opencl.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/backend.h>
+#include "symbol_manager.hpp"
+
+#include <af/opencl.h>
+
+af_err afcl_get_device_type(afcl_device_type* res) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) { CALL(afcl_get_device_type, res); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_get_platform(afcl_platform* res) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) { CALL(afcl_get_platform, res); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_get_context(cl_context* ctx, const bool retain) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) { CALL(afcl_get_context, ctx, retain); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_get_queue(cl_command_queue* queue, const bool retain) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) { CALL(afcl_get_queue, queue, retain); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_get_device_id(cl_device_id* id) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) { CALL(afcl_get_device_id, id); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_set_device_id(cl_device_id id) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) { CALL(afcl_set_device_id, id); }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_add_device_context(cl_device_id dev, cl_context ctx,
+                               cl_command_queue que) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) {
+        CALL(afcl_add_device_context, dev, ctx, que);
+    }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_set_device_context(cl_device_id dev, cl_context ctx) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) {
+        CALL(afcl_set_device_context, dev, ctx);
+    }
+    return AF_ERR_NOT_SUPPORTED;
+}
+
+af_err afcl_delete_device_context(cl_device_id dev, cl_context ctx) {
+    af_backend backend;
+    af_get_active_backend(&backend);
+    if (backend == AF_BACKEND_OPENCL) {
+        CALL(afcl_delete_device_context, dev, ctx);
+    }
+    return AF_ERR_NOT_SUPPORTED;
+}
diff --git a/src/api/unified/random.cpp b/src/api/unified/random.cpp
new file mode 100644
index 0000000000..771839e9fc
--- /dev/null
+++ b/src/api/unified/random.cpp
@@ -0,0 +1,81 @@
+/*******************************************************
+ * Copyright(c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/random.h>
+#include "symbol_manager.hpp"
+
+af_err af_get_default_random_engine(af_random_engine *r) {
+    CALL(af_get_default_random_engine, r);
+}
+
+af_err af_create_random_engine(af_random_engine *engineHandle,
+                               af_random_engine_type rtype,
+                               unsigned long long seed) {
+    CALL(af_create_random_engine, engineHandle, rtype, seed);
+}
+
+af_err af_retain_random_engine(af_random_engine *outHandle,
+                               const af_random_engine engineHandle) {
+    CALL(af_retain_random_engine, outHandle, engineHandle);
+}
+
+af_err af_random_engine_get_type(af_random_engine_type *rtype,
+                                 const af_random_engine engine) {
+    CALL(af_random_engine_get_type, rtype, engine);
+}
+
+af_err af_random_engine_set_type(af_random_engine *engine,
+                                 const af_random_engine_type rtype) {
+    CALL(af_random_engine_set_type, engine, rtype);
+}
+
+af_err af_set_default_random_engine_type(const af_random_engine_type rtype) {
+    CALL(af_set_default_random_engine_type, rtype);
+}
+
+af_err af_random_uniform(af_array *arr, const unsigned ndims,
+                         const dim_t *const dims, const af_dtype type,
+                         af_random_engine engine) {
+    CALL(af_random_uniform, arr, ndims, dims, type, engine);
+}
+
+af_err af_random_normal(af_array *arr, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype type,
+                        af_random_engine engine) {
+    CALL(af_random_normal, arr, ndims, dims, type, engine);
+}
+
+af_err af_release_random_engine(af_random_engine engineHandle) {
+    CALL(af_release_random_engine, engineHandle);
+}
+
+af_err af_random_engine_set_seed(af_random_engine *engine,
+                                 const unsigned long long seed) {
+    CALL(af_random_engine_set_seed, engine, seed);
+}
+
+af_err af_random_engine_get_seed(unsigned long long *const seed,
+                                 af_random_engine engine) {
+    CALL(af_random_engine_get_seed, seed, engine);
+}
+
+af_err af_randu(af_array *out, const unsigned ndims, const dim_t *const dims,
+                const af_dtype type) {
+    CALL(af_randu, out, ndims, dims, type);
+}
+
+af_err af_randn(af_array *out, const unsigned ndims, const dim_t *const dims,
+                const af_dtype type) {
+    CALL(af_randn, out, ndims, dims, type);
+}
+
+af_err af_set_seed(const unsigned long long seed) { CALL(af_set_seed, seed); }
+
+af_err af_get_seed(unsigned long long *seed) { CALL(af_get_seed, seed); }
diff --git a/src/api/unified/signal.cpp b/src/api/unified/signal.cpp
new file mode 100644
index 0000000000..e491965acd
--- /dev/null
+++ b/src/api/unified/signal.cpp
@@ -0,0 +1,244 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/signal.h>
+#include "symbol_manager.hpp"
+
+af_err af_approx1(af_array *yo, const af_array yi, const af_array xo,
+                  const af_interp_type method, const float offGrid) {
+    CHECK_ARRAYS(yo, yi, xo);
+    CALL(af_approx1, yo, yi, xo, method, offGrid);
+}
+
+af_err af_approx1_v2(af_array *yo, const af_array yi, const af_array xo,
+                     const af_interp_type method, const float offGrid) {
+    CHECK_ARRAYS(yo, yi, xo);
+    CALL(af_approx1_v2, yo, yi, xo, method, offGrid);
+}
+
+af_err af_approx2(af_array *zo, const af_array zi, const af_array xo,
+                  const af_array yo, const af_interp_type method,
+                  const float offGrid) {
+    CHECK_ARRAYS(zo, zi, xo, yo);
+    CALL(af_approx2, zo, zi, xo, yo, method, offGrid);
+}
+
+af_err af_approx2_v2(af_array *zo, const af_array zi, const af_array xo,
+                     const af_array yo, const af_interp_type method,
+                     const float offGrid) {
+    CHECK_ARRAYS(zo, zi, xo, yo);
+    CALL(af_approx2_v2, zo, zi, xo, yo, method, offGrid);
+}
+
+af_err af_approx1_uniform(af_array *yo, const af_array yi, const af_array xo,
+                          const int xdim, const double xi_beg,
+                          const double xi_step, const af_interp_type method,
+                          const float offGrid) {
+    CHECK_ARRAYS(yo, yi, xo);
+    CALL(af_approx1_uniform, yo, yi, xo, xdim, xi_beg, xi_step, method,
+         offGrid);
+}
+
+af_err af_approx1_uniform_v2(af_array *yo, const af_array yi, const af_array xo,
+                             const int xdim, const double xi_beg,
+                             const double xi_step, const af_interp_type method,
+                             const float offGrid) {
+    CHECK_ARRAYS(yo, yi, xo);
+    CALL(af_approx1_uniform_v2, yo, yi, xo, xdim, xi_beg, xi_step, method,
+         offGrid);
+}
+
+af_err af_approx2_uniform(af_array *zo, const af_array zi, const af_array xo,
+                          const int xdim, const double xi_beg,
+                          const double xi_step, const af_array yo,
+                          const int ydim, const double yi_beg,
+                          const double yi_step, const af_interp_type method,
+                          const float offGrid) {
+    CHECK_ARRAYS(zo, zi, xo, yo);
+    CALL(af_approx2_uniform, zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+         yi_beg, yi_step, method, offGrid);
+}
+
+af_err af_approx2_uniform_v2(af_array *zo, const af_array zi, const af_array xo,
+                             const int xdim, const double xi_beg,
+                             const double xi_step, const af_array yo,
+                             const int ydim, const double yi_beg,
+                             const double yi_step, const af_interp_type method,
+                             const float offGrid) {
+    CHECK_ARRAYS(zo, zi, xo, yo);
+    CALL(af_approx2_uniform_v2, zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+         yi_beg, yi_step, method, offGrid);
+}
+
+af_err af_set_fft_plan_cache_size(size_t cache_size) {
+    CALL(af_set_fft_plan_cache_size, cache_size);
+}
+
+#define FFT_HAPI_DEF(af_func)                               \
+    af_err af_func(af_array in, const double norm_factor) { \
+        CHECK_ARRAYS(in);                                   \
+        CALL(af_func, in, norm_factor);                     \
+    }
+
+FFT_HAPI_DEF(af_fft_inplace)
+FFT_HAPI_DEF(af_fft2_inplace)
+FFT_HAPI_DEF(af_fft3_inplace)
+FFT_HAPI_DEF(af_ifft_inplace)
+FFT_HAPI_DEF(af_ifft2_inplace)
+FFT_HAPI_DEF(af_ifft3_inplace)
+
+af_err af_fft(af_array *out, const af_array in, const double norm_factor,
+              const dim_t odim0) {
+    CHECK_ARRAYS(in);
+    CALL(af_fft, out, in, norm_factor, odim0);
+}
+
+af_err af_fft2(af_array *out, const af_array in, const double norm_factor,
+               const dim_t odim0, const dim_t odim1) {
+    CHECK_ARRAYS(in);
+    CALL(af_fft2, out, in, norm_factor, odim0, odim1);
+}
+
+af_err af_fft3(af_array *out, const af_array in, const double norm_factor,
+               const dim_t odim0, const dim_t odim1, const dim_t odim2) {
+    CHECK_ARRAYS(in);
+    CALL(af_fft3, out, in, norm_factor, odim0, odim1, odim2);
+}
+
+af_err af_ifft(af_array *out, const af_array in, const double norm_factor,
+               const dim_t odim0) {
+    CHECK_ARRAYS(in);
+    CALL(af_ifft, out, in, norm_factor, odim0);
+}
+
+af_err af_ifft2(af_array *out, const af_array in, const double norm_factor,
+                const dim_t odim0, const dim_t odim1) {
+    CHECK_ARRAYS(in);
+    CALL(af_ifft2, out, in, norm_factor, odim0, odim1);
+}
+
+af_err af_ifft3(af_array *out, const af_array in, const double norm_factor,
+                const dim_t odim0, const dim_t odim1, const dim_t odim2) {
+    CHECK_ARRAYS(in);
+    CALL(af_ifft3, out, in, norm_factor, odim0, odim1, odim2);
+}
+
+af_err af_fft_r2c(af_array *out, const af_array in, const double norm_factor,
+                  const dim_t pad0) {
+    CHECK_ARRAYS(in);
+    CALL(af_fft_r2c, out, in, norm_factor, pad0);
+}
+
+af_err af_fft2_r2c(af_array *out, const af_array in, const double norm_factor,
+                   const dim_t pad0, const dim_t pad1) {
+    CHECK_ARRAYS(in);
+    CALL(af_fft2_r2c, out, in, norm_factor, pad0, pad1);
+}
+
+af_err af_fft3_r2c(af_array *out, const af_array in, const double norm_factor,
+                   const dim_t pad0, const dim_t pad1, const dim_t pad2) {
+    CHECK_ARRAYS(in);
+    CALL(af_fft3_r2c, out, in, norm_factor, pad0, pad1, pad2);
+}
+
+#define FFTC2R_HAPI_DEF(af_func)                                               \
+    af_err af_func(af_array *out, const af_array in, const double norm_factor, \
+                   const bool is_odd) {                                        \
+        CHECK_ARRAYS(in);                                                      \
+        CALL(af_func, out, in, norm_factor, is_odd);                           \
+    }
+
+FFTC2R_HAPI_DEF(af_fft_c2r)
+FFTC2R_HAPI_DEF(af_fft2_c2r)
+FFTC2R_HAPI_DEF(af_fft3_c2r)
+
+#define CONV_HAPI_DEF(af_func)                                     \
+    af_err af_func(af_array *out, const af_array signal,           \
+                   const af_array filter, const af_conv_mode mode, \
+                   af_conv_domain domain) {                        \
+        CHECK_ARRAYS(signal, filter);                              \
+        CALL(af_func, out, signal, filter, mode, domain);          \
+    }
+
+CONV_HAPI_DEF(af_convolve1)
+CONV_HAPI_DEF(af_convolve2)
+CONV_HAPI_DEF(af_convolve3)
+
+af_err af_convolve2_nn(af_array *out, const af_array signal,
+                       const af_array filter, const unsigned stride_dims,
+                       const dim_t *strides, const unsigned padding_dims,
+                       const dim_t *paddings, const unsigned dilation_dims,
+                       const dim_t *dilations) {
+    CHECK_ARRAYS(signal, filter);
+    CALL(af_convolve2_nn, out, signal, filter, stride_dims, strides,
+         padding_dims, paddings, dilation_dims, dilations);
+}
+
+af_err af_convolve2_gradient_nn(
+    af_array *out, const af_array incoming_gradient,
+    const af_array original_signal, const af_array original_filter,
+    const af_array convolved_output, const unsigned stride_dims,
+    const dim_t *strides, const unsigned padding_dims, const dim_t *paddings,
+    const unsigned dilation_dims, const dim_t *dilations,
+    af_conv_gradient_type grad_type) {
+    CHECK_ARRAYS(incoming_gradient, original_signal, original_filter,
+                 convolved_output);
+    CALL(af_convolve2_gradient_nn, out, incoming_gradient, original_signal,
+         original_filter, convolved_output, stride_dims, strides, padding_dims,
+         paddings, dilation_dims, dilations, grad_type);
+}
+
+#define FFT_CONV_HAPI_DEF(af_func)                                   \
+    af_err af_func(af_array *out, const af_array signal,             \
+                   const af_array filter, const af_conv_mode mode) { \
+        CHECK_ARRAYS(signal, filter);                                \
+        CALL(af_func, out, signal, filter, mode);                    \
+    }
+
+FFT_CONV_HAPI_DEF(af_fft_convolve1)
+FFT_CONV_HAPI_DEF(af_fft_convolve2)
+FFT_CONV_HAPI_DEF(af_fft_convolve3)
+
+af_err af_convolve2_sep(af_array *out, const af_array col_filter,
+                        const af_array row_filter, const af_array signal,
+                        const af_conv_mode mode) {
+    CHECK_ARRAYS(col_filter, row_filter, signal);
+    CALL(af_convolve2_sep, out, col_filter, row_filter, signal, mode);
+}
+
+af_err af_fir(af_array *y, const af_array b, const af_array x) {
+    CHECK_ARRAYS(b, x);
+    CALL(af_fir, y, b, x);
+}
+
+af_err af_iir(af_array *y, const af_array b, const af_array a,
+              const af_array x) {
+    CHECK_ARRAYS(b, a, x);
+    CALL(af_iir, y, b, a, x);
+}
+
+af_err af_medfilt(af_array *out, const af_array in, const dim_t wind_length,
+                  const dim_t wind_width, const af_border_type edge_pad) {
+    CHECK_ARRAYS(in);
+    CALL(af_medfilt, out, in, wind_length, wind_width, edge_pad);
+}
+
+af_err af_medfilt1(af_array *out, const af_array in, const dim_t wind_width,
+                   const af_border_type edge_pad) {
+    CHECK_ARRAYS(in);
+    CALL(af_medfilt1, out, in, wind_width, edge_pad);
+}
+
+af_err af_medfilt2(af_array *out, const af_array in, const dim_t wind_length,
+                   const dim_t wind_width, const af_border_type edge_pad) {
+    CHECK_ARRAYS(in);
+    CALL(af_medfilt2, out, in, wind_length, wind_width, edge_pad);
+}
diff --git a/src/api/unified/sparse.cpp b/src/api/unified/sparse.cpp
new file mode 100644
index 0000000000..56ec71858a
--- /dev/null
+++ b/src/api/unified/sparse.cpp
@@ -0,0 +1,76 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/sparse.h>
+#include "symbol_manager.hpp"
+
+af_err af_create_sparse_array(af_array *out, const dim_t nRows,
+                              const dim_t nCols, const af_array values,
+                              const af_array rowIdx, const af_array colIdx,
+                              const af_storage stype) {
+    CHECK_ARRAYS(values, rowIdx, colIdx);
+    CALL(af_create_sparse_array, out, nRows, nCols, values, rowIdx, colIdx,
+         stype);
+}
+
+af_err af_create_sparse_array_from_ptr(
+    af_array *out, const dim_t nRows, const dim_t nCols, const dim_t nNZ,
+    const void *const values, const int *const rowIdx, const int *const colIdx,
+    const af_dtype type, const af_storage stype, const af_source source) {
+    CALL(af_create_sparse_array_from_ptr, out, nRows, nCols, nNZ, values,
+         rowIdx, colIdx, type, stype, source);
+}
+
+af_err af_create_sparse_array_from_dense(af_array *out, const af_array in,
+                                         const af_storage stype) {
+    CHECK_ARRAYS(in);
+    CALL(af_create_sparse_array_from_dense, out, in, stype);
+}
+
+af_err af_sparse_convert_to(af_array *out, const af_array in,
+                            const af_storage destStorage) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_convert_to, out, in, destStorage);
+}
+
+af_err af_sparse_to_dense(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_to_dense, out, in);
+}
+
+af_err af_sparse_get_info(af_array *values, af_array *rowIdx, af_array *colIdx,
+                          af_storage *stype, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_get_info, values, rowIdx, colIdx, stype, in);
+}
+
+af_err af_sparse_get_values(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_get_values, out, in);
+}
+
+af_err af_sparse_get_row_idx(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_get_row_idx, out, in);
+}
+
+af_err af_sparse_get_col_idx(af_array *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_get_col_idx, out, in);
+}
+
+af_err af_sparse_get_nnz(dim_t *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_get_nnz, out, in);
+}
+
+af_err af_sparse_get_storage(af_storage *out, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_sparse_get_storage, out, in);
+}
diff --git a/src/api/unified/statistics.cpp b/src/api/unified/statistics.cpp
new file mode 100644
index 0000000000..d97bd33237
--- /dev/null
+++ b/src/api/unified/statistics.cpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/deprecated.hpp>
+#include <af/array.h>
+#include <af/statistics.h>
+#include "symbol_manager.hpp"
+
+af_err af_mean(af_array *out, const af_array in, const dim_t dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_mean, out, in, dim);
+}
+
+af_err af_mean_weighted(af_array *out, const af_array in,
+                        const af_array weights, const dim_t dim) {
+    CHECK_ARRAYS(in, weights);
+    CALL(af_mean_weighted, out, in, weights, dim);
+}
+
+AF_DEPRECATED_WARNINGS_OFF
+af_err af_var(af_array *out, const af_array in, const bool isbiased,
+              const dim_t dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_var, out, in, isbiased, dim);
+}
+AF_DEPRECATED_WARNINGS_ON
+
+af_err af_var_weighted(af_array *out, const af_array in, const af_array weights,
+                       const dim_t dim) {
+    CHECK_ARRAYS(in, weights);
+    CALL(af_var_weighted, out, in, weights, dim);
+}
+
+af_err af_meanvar(af_array *mean, af_array *var, const af_array in,
+                  const af_array weights, const af_var_bias bias,
+                  const dim_t dim) {
+    CHECK_ARRAYS(in, weights);
+    CALL(af_meanvar, mean, var, in, weights, bias, dim);
+}
+
+AF_DEPRECATED_WARNINGS_OFF
+af_err af_stdev(af_array *out, const af_array in, const dim_t dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_stdev, out, in, dim);
+}
+
+af_err af_cov(af_array *out, const af_array X, const af_array Y,
+              const bool isbiased) {
+    CHECK_ARRAYS(X, Y);
+    CALL(af_cov, out, X, Y, isbiased);
+}
+AF_DEPRECATED_WARNINGS_ON
+
+af_err af_median(af_array *out, const af_array in, const dim_t dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_median, out, in, dim);
+}
+
+af_err af_mean_all(double *real, double *imag, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_mean_all, real, imag, in);
+}
+
+af_err af_mean_all_weighted(double *real, double *imag, const af_array in,
+                            const af_array weights) {
+    CHECK_ARRAYS(in, weights);
+    CALL(af_mean_all_weighted, real, imag, in, weights);
+}
+
+AF_DEPRECATED_WARNINGS_OFF
+af_err af_var_all(double *realVal, double *imagVal, const af_array in,
+                  const bool isbiased) {
+    CHECK_ARRAYS(in);
+    CALL(af_var_all, realVal, imagVal, in, isbiased);
+}
+AF_DEPRECATED_WARNINGS_ON
+
+af_err af_var_all_weighted(double *realVal, double *imagVal, const af_array in,
+                           const af_array weights) {
+    CHECK_ARRAYS(in, weights);
+    CALL(af_var_all_weighted, realVal, imagVal, in, weights);
+}
+
+AF_DEPRECATED_WARNINGS_OFF
+af_err af_stdev_all(double *real, double *imag, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_stdev_all, real, imag, in);
+}
+AF_DEPRECATED_WARNINGS_ON
+
+af_err af_median_all(double *realVal, double *imagVal, const af_array in) {
+    CHECK_ARRAYS(in);
+    CALL(af_median_all, realVal, imagVal, in);
+}
+
+af_err af_corrcoef(double *realVal, double *imagVal, const af_array X,
+                   const af_array Y) {
+    CHECK_ARRAYS(X, Y);
+    CALL(af_corrcoef, realVal, imagVal, X, Y);
+}
+
+af_err af_topk(af_array *values, af_array *indices, const af_array in,
+               const int k, const int dim, const af_topk_function order) {
+    CHECK_ARRAYS(in);
+    CALL(af_topk, values, indices, in, k, dim, order);
+}
+
+af_err af_var_v2(af_array *out, const af_array in, const af_var_bias bias,
+                 const dim_t dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_var_v2, out, in, bias, dim);
+}
+
+af_err af_var_all_v2(double *realVal, double *imagVal, const af_array in,
+                     const af_var_bias bias) {
+    CHECK_ARRAYS(in);
+    CALL(af_var_all_v2, realVal, imagVal, in, bias);
+}
+
+af_err af_cov_v2(af_array *out, const af_array X, const af_array Y,
+                 const af_var_bias bias) {
+    CHECK_ARRAYS(X, Y);
+    CALL(af_cov_v2, out, X, Y, bias);
+}
+
+af_err af_stdev_v2(af_array *out, const af_array in, const af_var_bias bias,
+                   const dim_t dim) {
+    CHECK_ARRAYS(in);
+    CALL(af_stdev_v2, out, in, bias, dim);
+}
+
+af_err af_stdev_all_v2(double *real, double *imag, const af_array in,
+                       const af_var_bias bias) {
+    CHECK_ARRAYS(in);
+    CALL(af_stdev_all_v2, real, imag, in, bias);
+}
diff --git a/src/api/unified/symbol_manager.cpp b/src/api/unified/symbol_manager.cpp
new file mode 100644
index 0000000000..93ca06938f
--- /dev/null
+++ b/src/api/unified/symbol_manager.cpp
@@ -0,0 +1,255 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include "symbol_manager.hpp"
+
+#include <af/version.h>
+
+#include <common/Logger.hpp>
+#include <common/module_loading.hpp>
+#include <spdlog/spdlog.h>
+
+#include <cmath>
+#include <functional>
+#include <string>
+#include <type_traits>
+
+#ifdef OS_WIN
+#include <Windows.h>
+#else
+#include <dlfcn.h>
+#endif
+
+using arrayfire::common::getEnvVar;
+using arrayfire::common::getErrorMessage;
+using arrayfire::common::getFunctionPointer;
+using arrayfire::common::loadLibrary;
+using arrayfire::common::loggerFactory;
+using arrayfire::common::unloadLibrary;
+using std::extent;
+using std::function;
+using std::string;
+
+namespace arrayfire {
+namespace unified {
+
+#if defined(OS_WIN)
+static const char* LIB_AF_BKND_PREFIX = "";
+static const char* LIB_AF_BKND_SUFFIX = ".dll";
+#define PATH_SEPARATOR "\\"
+#define RTLD_LAZY 0
+#else
+
+#if defined(__APPLE__)
+#define SO_SUFFIX_HELPER(VER) "." #VER ".dylib"
+#else
+#define SO_SUFFIX_HELPER(VER) ".so." #VER
+#endif
+static const char* LIB_AF_BKND_PREFIX = "lib";
+#define PATH_SEPARATOR "/"
+
+#define GET_SO_SUFFIX(VER) SO_SUFFIX_HELPER(VER)
+static const char* LIB_AF_BKND_SUFFIX = GET_SO_SUFFIX(AF_VERSION_MAJOR);
+#endif
+
+string getBkndLibName(const af_backend backend) {
+    string ret;
+    switch (backend) {
+        case AF_BACKEND_CUDA:
+            ret = string(LIB_AF_BKND_PREFIX) + "afcuda" + LIB_AF_BKND_SUFFIX;
+            break;
+        case AF_BACKEND_OPENCL:
+            ret = string(LIB_AF_BKND_PREFIX) + "afopencl" + LIB_AF_BKND_SUFFIX;
+            break;
+        case AF_BACKEND_CPU:
+            ret = string(LIB_AF_BKND_PREFIX) + "afcpu" + LIB_AF_BKND_SUFFIX;
+            break;
+        case AF_BACKEND_ONEAPI:
+            ret = string(LIB_AF_BKND_PREFIX) + "afoneapi" + LIB_AF_BKND_SUFFIX;
+            break;
+        default: assert(1 != 1 && "Invalid backend");
+    }
+    return ret;
+}
+string getBackendDirectoryName(const af_backend backend) {
+    string ret;
+    switch (backend) {
+        case AF_BACKEND_CUDA: ret = "cuda"; break;
+        case AF_BACKEND_OPENCL: ret = "opencl"; break;
+        case AF_BACKEND_CPU: ret = "cpu"; break;
+        case AF_BACKEND_ONEAPI: ret = "oneapi"; break;
+        default: assert(1 != 1 && "Invalid backend");
+    }
+    return ret;
+}
+
+string join_path(string first) { return first; }
+
+template<typename... ARGS>
+string join_path(const string& first, ARGS... args) {
+    if (first.empty()) {
+        return join_path(args...);
+    } else {
+        return first + PATH_SEPARATOR + join_path(args...);
+    }
+}
+
+/*flag parameter is not used on windows platform */
+LibHandle openDynLibrary(const af_backend bknd_idx) {
+    // The default search path is the colon separated list of paths stored in
+    // the environment variables:
+    string bkndLibName  = getBkndLibName(bknd_idx);
+    string show_flag    = getEnvVar("AF_SHOW_LOAD_PATH");
+    bool show_load_path = show_flag == "1";
+
+    // FIXME(umar): avoid this if at all possible
+    auto getLogger = [&] { return spdlog::get("unified"); };
+
+    string pathPrefixes[] = {
+        "",   // empty prefix i.e. just the library name will enable search in
+              // system default paths such as LD_LIBRARY_PATH, Program
+              // Files(Windows) etc.
+        ".",  // Shared libraries in current directory
+        // Running from the CMake Build directory
+        join_path(".", "src", "backend", getBackendDirectoryName(bknd_idx)),
+        // Running from the test directory
+        join_path("..", "src", "backend", getBackendDirectoryName(bknd_idx)),
+        // Environment variable PATHS
+        join_path(getEnvVar("AF_BUILD_PATH"), "src", "backend",
+                  getBackendDirectoryName(bknd_idx)),
+        join_path(getEnvVar("AF_PATH"), "lib"),
+        join_path(getEnvVar("AF_PATH"), "lib64"),
+        getEnvVar("AF_BUILD_LIB_CUSTOM_PATH"),
+
+    // Common install paths
+#if !defined(OS_WIN)
+        "/opt/arrayfire-3/lib/",
+        "/opt/arrayfire/lib/",
+        "/usr/local/lib/",
+        "/usr/local/arrayfire/lib/"
+#else
+        join_path(getEnvVar("ProgramFiles"), "ArrayFire", "lib"),
+        join_path(getEnvVar("ProgramFiles"), "ArrayFire", "v3", "lib")
+#endif
+    };
+    typedef af_err (*func)(int*);
+
+    LibHandle retVal = nullptr;
+
+    for (auto& pathPrefixe : pathPrefixes) {
+        AF_TRACE("Attempting: {}",
+                 (pathPrefixe.empty() ? "Default System Paths" : pathPrefixe));
+        if ((retVal =
+                 loadLibrary(join_path(pathPrefixe, bkndLibName).c_str()))) {
+            AF_TRACE("Found: {}", join_path(pathPrefixe, bkndLibName));
+
+            func count_func = reinterpret_cast<func>(
+                getFunctionPointer(retVal, "af_get_device_count"));
+            if (count_func) {
+                int count = 0;
+                count_func(&count);
+                AF_TRACE("Device Count: {}.", count);
+                if (count == 0) {
+                    AF_TRACE("Skipping: No devices found for {}", bkndLibName);
+                    retVal = nullptr;
+                    continue;
+                }
+            }
+
+            if (show_load_path) { printf("Using %s\n", bkndLibName.c_str()); }
+            break;
+        } else {
+            AF_TRACE("Failed to load {}", getErrorMessage());
+        }
+    }
+    return retVal;
+}
+
+spdlog::logger* AFSymbolManager::getLogger() { return logger.get(); }
+
+af::Backend& getActiveBackend() {
+    thread_local af_backend activeBackend =
+        AFSymbolManager::getInstance().getDefaultBackend();
+    return activeBackend;
+}
+
+LibHandle& getActiveHandle() {
+    thread_local LibHandle activeHandle =
+        AFSymbolManager::getInstance().getDefaultHandle();
+    return activeHandle;
+}
+
+AFSymbolManager::AFSymbolManager()
+    : defaultHandle(nullptr)
+    , numBackends(0)
+    , backendsAvailable(0)
+    , logger(loggerFactory("unified")) {
+    // In order of priority.
+    static const af_backend order[] = {AF_BACKEND_CUDA, AF_BACKEND_ONEAPI,
+                                       AF_BACKEND_OPENCL, AF_BACKEND_CPU};
+    LibHandle handle                = nullptr;
+    af::Backend backend             = AF_BACKEND_DEFAULT;
+    // Decremeting loop. The last successful backend loaded will be the most
+    // prefered one.
+    for (int i = NUM_BACKENDS - 1; i >= 0; i--) {
+        int bknd_idx          = backend_index(order[i]);
+        bkndHandles[bknd_idx] = openDynLibrary(order[i]);
+        if (bkndHandles[bknd_idx]) {
+            handle  = bkndHandles[bknd_idx];
+            backend = order[i];
+            numBackends++;
+            backendsAvailable += order[i];
+        }
+    }
+    if (backend) {
+        AF_TRACE("AF_DEFAULT_BACKEND: {}", getBackendDirectoryName(backend));
+        defaultBackend = backend;
+    } else {
+        logger->error("Backend was not found");
+        defaultBackend = AF_BACKEND_DEFAULT;
+    }
+
+    // Keep a copy of default order handle inorder to use it in ::setBackend
+    // when the user passes AF_BACKEND_DEFAULT
+    defaultHandle = handle;
+}
+
+AFSymbolManager::~AFSymbolManager() {
+    for (auto& bkndHandle : bkndHandles) {
+        if (bkndHandle) { unloadLibrary(bkndHandle); }
+    }
+}
+
+unsigned AFSymbolManager::getBackendCount() const { return numBackends; }
+
+int AFSymbolManager::getAvailableBackends() const { return backendsAvailable; }
+
+af_err setBackend(af::Backend bknd) {
+    auto& instance = AFSymbolManager::getInstance();
+    if (bknd == AF_BACKEND_DEFAULT) {
+        if (instance.getDefaultHandle()) {
+            getActiveHandle()  = instance.getDefaultHandle();
+            getActiveBackend() = instance.getDefaultBackend();
+            return AF_SUCCESS;
+        } else {
+            UNIFIED_ERROR_LOAD_LIB();
+        }
+    }
+    int idx = backend_index(bknd);
+    if (instance.getHandle(idx)) {
+        getActiveHandle()  = instance.getHandle(idx);
+        getActiveBackend() = bknd;
+        return AF_SUCCESS;
+    } else {
+        UNIFIED_ERROR_LOAD_LIB();
+    }
+}
+
+}  // namespace unified
+}  // namespace arrayfire
diff --git a/src/api/unified/symbol_manager.hpp b/src/api/unified/symbol_manager.hpp
new file mode 100644
index 0000000000..7f96f586e2
--- /dev/null
+++ b/src/api/unified/symbol_manager.hpp
@@ -0,0 +1,171 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/Logger.hpp>
+#include <common/err_common.hpp>
+#include <common/module_loading.hpp>
+#include <common/util.hpp>
+#include <af/backend.h>
+#include <af/defines.h>
+
+#include <spdlog/spdlog.h>
+#include <array>
+#include <cstdlib>
+#include <string>
+#include <unordered_map>
+
+namespace arrayfire {
+namespace unified {
+
+const int NUM_BACKENDS = 4;
+
+#define UNIFIED_ERROR_LOAD_LIB()                                       \
+    AF_RETURN_ERROR(                                                   \
+        "Failed to load dynamic library. "                             \
+        "See http://www.arrayfire.com/docs/unifiedbackend.htm "        \
+        "for instructions to set up environment for Unified backend.", \
+        AF_ERR_LOAD_LIB)
+
+static inline int backend_index(af::Backend be) {
+    switch (be) {
+        case AF_BACKEND_CPU: return 0;
+        case AF_BACKEND_CUDA: return 1;
+        case AF_BACKEND_OPENCL: return 2;
+        case AF_BACKEND_ONEAPI: return 3;
+        default: return -1;
+    }
+}
+
+class AFSymbolManager {
+   public:
+    static AFSymbolManager& getInstance() {
+        static AFSymbolManager* symbolManager = new AFSymbolManager();
+        return *symbolManager;
+    }
+
+    ~AFSymbolManager();
+
+    unsigned getBackendCount() const;
+    int getAvailableBackends() const;
+    af::Backend getDefaultBackend() { return defaultBackend; }
+    LibHandle getDefaultHandle() { return defaultHandle; }
+
+    spdlog::logger* getLogger();
+    LibHandle getHandle(int idx) { return bkndHandles[idx]; }
+
+   protected:
+    AFSymbolManager();
+
+    // Following two declarations are required to
+    // avoid copying accidental copy/assignment
+    // of instance returned by getInstance to other
+    // variables
+    AFSymbolManager(AFSymbolManager const&);
+    void operator=(AFSymbolManager const&);
+
+   private:
+    LibHandle bkndHandles[NUM_BACKENDS]{};
+
+    LibHandle defaultHandle;
+    unsigned numBackends;
+    int backendsAvailable;
+    af_backend defaultBackend;
+    std::shared_ptr<spdlog::logger> logger;
+};
+
+af_err setBackend(af::Backend bknd);
+
+af::Backend& getActiveBackend();
+
+LibHandle& getActiveHandle();
+
+namespace {
+bool checkArray(af_backend activeBackend, const af_array a) {
+    // Convert af_array into int to retrieve the backend info.
+    // See ArrayInfo.hpp for more
+    af_backend backend = (af_backend)0;
+
+    // This condition is required so that the invalid args tests for unified
+    // backend return the expected error rather than AF_ERR_ARR_BKND_MISMATCH
+    // Since a = 0, does not have a backend specified, it should be a
+    // AF_ERR_ARG instead of AF_ERR_ARR_BKND_MISMATCH
+    if (a == 0) return true;
+
+    af_get_backend_id(&backend, a);
+    return backend == activeBackend;
+}
+
+[[gnu::unused]] bool checkArray(af_backend activeBackend, const af_array* a) {
+    if (a) {
+        return checkArray(activeBackend, *a);
+    } else {
+        return true;
+    }
+}
+
+[[gnu::unused]] bool checkArrays(af_backend activeBackend) {
+    UNUSED(activeBackend);
+    // Dummy
+    return true;
+}
+
+}  // namespace
+
+template<typename T, typename... Args>
+bool checkArrays(af_backend activeBackend, T a, Args... arg) {
+    return checkArray(activeBackend, a) && checkArrays(activeBackend, arg...);
+}
+
+}  // namespace unified
+}  // namespace arrayfire
+
+/// Checks if the active backend and the af_arrays are the same.
+///
+/// Checks if the active backend and the af_array's backend match. If they do
+/// not match, an error is returned. This macro accepts pointer to af_arrays
+/// and af_arrays. Null pointers to af_arrays are considered acceptable.
+///
+/// \param[in] Any number of af_arrays or pointer to af_arrays
+#define CHECK_ARRAYS(...)                                                     \
+    do {                                                                      \
+        af_backend backendId = arrayfire::unified::getActiveBackend();        \
+        if (!arrayfire::unified::checkArrays(backendId, __VA_ARGS__))         \
+            AF_RETURN_ERROR("Input array does not belong to current backend", \
+                            AF_ERR_ARR_BKND_MISMATCH);                        \
+    } while (0)
+
+#define CALL(FUNCTION, ...)                                                      \
+    using af_func                  = std::add_pointer<decltype(FUNCTION)>::type; \
+    thread_local af_backend index_ = arrayfire::unified::getActiveBackend();     \
+    if (arrayfire::unified::getActiveHandle()) {                                 \
+        thread_local af_func func =                                              \
+            (af_func)arrayfire::common::getFunctionPointer(                      \
+                arrayfire::unified::getActiveHandle(), __func__);                \
+        if (!func) {                                                             \
+            AF_RETURN_ERROR(                                                     \
+                "requested symbol name could not be found in loaded library.",   \
+                AF_ERR_LOAD_LIB);                                                \
+        }                                                                        \
+        if (index_ != arrayfire::unified::getActiveBackend()) {                  \
+            index_ = arrayfire::unified::getActiveBackend();                     \
+            func   = (af_func)arrayfire::common::getFunctionPointer(             \
+                arrayfire::unified::getActiveHandle(), __func__);              \
+        }                                                                        \
+        return func(__VA_ARGS__);                                                \
+    } else {                                                                     \
+        AF_RETURN_ERROR("ArrayFire couldn't locate any backends.",               \
+                        AF_ERR_LOAD_LIB);                                        \
+    }
+
+#define CALL_NO_PARAMS(FUNCTION) CALL(FUNCTION)
+
+#define LOAD_SYMBOL()                      \
+    arrayfire::common::getFunctionPointer( \
+        arrayfire::unified::getActiveHandle(), __FUNCTION__)
diff --git a/src/api/unified/util.cpp b/src/api/unified/util.cpp
new file mode 100644
index 0000000000..5833046a3f
--- /dev/null
+++ b/src/api/unified/util.cpp
@@ -0,0 +1,55 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/util.h>
+#include "symbol_manager.hpp"
+
+af_err af_print_array(af_array arr) {
+    CHECK_ARRAYS(arr);
+    CALL(af_print_array, arr);
+}
+
+af_err af_print_array_gen(const char *exp, const af_array arr,
+                          const int precision) {
+    CHECK_ARRAYS(arr);
+    CALL(af_print_array_gen, exp, arr, precision);
+}
+
+af_err af_save_array(int *index, const char *key, const af_array arr,
+                     const char *filename, const bool append) {
+    CHECK_ARRAYS(arr);
+    CALL(af_save_array, index, key, arr, filename, append);
+}
+
+af_err af_read_array_index(af_array *out, const char *filename,
+                           const unsigned index) {
+    CALL(af_read_array_index, out, filename, index);
+}
+
+af_err af_read_array_key(af_array *out, const char *filename, const char *key) {
+    CALL(af_read_array_key, out, filename, key);
+}
+
+af_err af_read_array_key_check(int *index, const char *filename,
+                               const char *key) {
+    CALL(af_read_array_key_check, index, filename, key);
+}
+
+af_err af_array_to_string(char **output, const char *exp, const af_array arr,
+                          const int precision, const bool transpose) {
+    CHECK_ARRAYS(arr);
+    CALL(af_array_to_string, output, exp, arr, precision, transpose);
+}
+
+af_err af_example_function(af_array *out, const af_array a,
+                           const af_someenum_t param) {
+    CHECK_ARRAYS(a);
+    CALL(af_example_function, out, a, param);
+}
diff --git a/src/api/unified/vision.cpp b/src/api/unified/vision.cpp
new file mode 100644
index 0000000000..b600c4355f
--- /dev/null
+++ b/src/api/unified/vision.cpp
@@ -0,0 +1,103 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/vision.h>
+#include "symbol_manager.hpp"
+
+af_err af_fast(af_features *out, const af_array in, const float thr,
+               const unsigned arc_length, const bool non_max,
+               const float feature_ratio, const unsigned edge) {
+    CHECK_ARRAYS(in);
+    CALL(af_fast, out, in, thr, arc_length, non_max, feature_ratio, edge);
+}
+
+af_err af_harris(af_features *out, const af_array in,
+                 const unsigned max_corners, const float min_response,
+                 const float sigma, const unsigned block_size,
+                 const float k_thr) {
+    CHECK_ARRAYS(in);
+    CALL(af_harris, out, in, max_corners, min_response, sigma, block_size,
+         k_thr);
+}
+
+af_err af_orb(af_features *feat, af_array *desc, const af_array in,
+              const float fast_thr, const unsigned max_feat,
+              const float scl_fctr, const unsigned levels,
+              const bool blur_img) {
+    CHECK_ARRAYS(in);
+    CALL(af_orb, feat, desc, in, fast_thr, max_feat, scl_fctr, levels,
+         blur_img);
+}
+
+af_err af_sift(af_features *feat, af_array *desc, const af_array in,
+               const unsigned n_layers, const float contrast_thr,
+               const float edge_thr, const float init_sigma,
+               const bool double_input, const float intensity_scale,
+               const float feature_ratio) {
+    CHECK_ARRAYS(in);
+    CALL(af_sift, feat, desc, in, n_layers, contrast_thr, edge_thr, init_sigma,
+         double_input, intensity_scale, feature_ratio);
+}
+
+af_err af_gloh(af_features *feat, af_array *desc, const af_array in,
+               const unsigned n_layers, const float contrast_thr,
+               const float edge_thr, const float init_sigma,
+               const bool double_input, const float intensity_scale,
+               const float feature_ratio) {
+    CHECK_ARRAYS(in);
+    CALL(af_gloh, feat, desc, in, n_layers, contrast_thr, edge_thr, init_sigma,
+         double_input, intensity_scale, feature_ratio);
+}
+
+af_err af_hamming_matcher(af_array *idx, af_array *dist, const af_array query,
+                          const af_array train, const dim_t dist_dim,
+                          const unsigned n_dist) {
+    CHECK_ARRAYS(query, train);
+    CALL(af_hamming_matcher, idx, dist, query, train, dist_dim, n_dist);
+}
+
+af_err af_nearest_neighbour(af_array *idx, af_array *dist, const af_array query,
+                            const af_array train, const dim_t dist_dim,
+                            const unsigned n_dist,
+                            const af_match_type dist_type) {
+    CHECK_ARRAYS(query, train);
+    CALL(af_nearest_neighbour, idx, dist, query, train, dist_dim, n_dist,
+         dist_type);
+}
+
+af_err af_match_template(af_array *out, const af_array search_img,
+                         const af_array template_img,
+                         const af_match_type m_type) {
+    CHECK_ARRAYS(search_img, template_img);
+    CALL(af_match_template, out, search_img, template_img, m_type);
+}
+
+af_err af_susan(af_features *out, const af_array in, const unsigned radius,
+                const float diff_thr, const float geom_thr,
+                const float feature_ratio, const unsigned edge) {
+    CHECK_ARRAYS(in);
+    CALL(af_susan, out, in, radius, diff_thr, geom_thr, feature_ratio, edge);
+}
+
+af_err af_dog(af_array *out, const af_array in, const int radius1,
+              const int radius2) {
+    CHECK_ARRAYS(in);
+    CALL(af_dog, out, in, radius1, radius2);
+}
+
+af_err af_homography(af_array *H, int *inliers, const af_array x_src,
+                     const af_array y_src, const af_array x_dst,
+                     const af_array y_dst, const af_homography_type htype,
+                     const float inlier_thr, const unsigned iterations,
+                     const af_dtype type) {
+    CHECK_ARRAYS(x_src, y_src, x_dst, y_dst);
+    CALL(af_homography, H, inliers, x_src, y_src, x_dst, y_dst, htype,
+         inlier_thr, iterations, type);
+}
diff --git a/src/backend/ArrayInfo.cpp b/src/backend/ArrayInfo.cpp
deleted file mode 100644
index 20c5bd88e2..0000000000
--- a/src/backend/ArrayInfo.cpp
+++ /dev/null
@@ -1,174 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <ArrayInfo.hpp>
-#include <numeric>
-#include <algorithm>
-#include <functional>
-#include <err_common.hpp>
-
-using af::dim4;
-
-dim_t
-calcOffset(const af::dim4 &strides, const af::dim4 &offsets)
-{
-    dim_t offset = 0;
-    for (int i = 0; i < 4; i++) offset += offsets[i] * strides[i];
-    return offset;
-}
-
-
-const ArrayInfo&
-getInfo(af_array arr)
-{
-    const ArrayInfo *info = static_cast<ArrayInfo*>(reinterpret_cast<void *>(arr));
-    return *info;
-}
-
-af_err
-af_get_elements(dim_t *elems, const af_array arr)
-{
-    *elems =  getInfo(arr).elements();
-    return AF_SUCCESS; //FIXME: Catch exceptions correctly
-}
-
-af_err af_get_type(af_dtype *type, const af_array arr)
-{
-    *type = getInfo(arr).getType();
-    return AF_SUCCESS; //FIXME: Catch exceptions correctly
-}
-
-dim4 calcStrides(const dim4 &parentDim)
-{
-    dim4 out(1, 1, 1, 1);
-    dim_t *out_dims = out.get();
-    const dim_t *parent_dims =  parentDim.get();
-
-    for (dim_t i=1; i < 4; i++) {
-        out_dims[i] = out_dims[i - 1] * parent_dims[i-1];
-    }
-
-    return out;
-}
-
-void ArrayInfo::modStrides(const dim4 &newStrides)
-{
-    dim_strides = newStrides;
-}
-
-void ArrayInfo::modDims(const dim4 &newDims)
-{
-    dim_size   = newDims;
-    modStrides(calcStrides(newDims));
-}
-
-bool ArrayInfo::isEmpty() const
-{
-    return (elements() == 0);
-}
-
-bool ArrayInfo::isScalar() const
-{
-    return (elements() == 1);
-}
-
-bool ArrayInfo::isRow() const
-{
-    return (dims()[0] == 1 && dims()[1] > 1 && dims()[2] == 1 && dims()[3] == 1);
-}
-
-bool ArrayInfo::isColumn() const
-{
-    return (dims()[0] > 1 && dims()[1] == 1 && dims()[2] == 1 && dims()[3] == 1);
-}
-
-bool ArrayInfo::isVector() const
-{
-    int singular_dims = 0;
-    for(int i = 0; i < AF_MAX_DIMS; i++) {
-        singular_dims += (dims()[i] == 1);
-    }
-    return singular_dims == AF_MAX_DIMS - 1;
-}
-
-bool ArrayInfo::isComplex() const
-{
-    return ((type == c32) || (type == c64));
-}
-
-bool ArrayInfo::isReal() const
-{
-    return !isComplex();
-}
-
-bool ArrayInfo::isDouble() const
-{
-    return (type == f64 || type == c64);
-}
-
-bool ArrayInfo::isSingle() const
-{
-    return (type == f32 || type == c32);
-}
-
-bool ArrayInfo::isRealFloating() const
-{
-    return (type == f64 || type == f32);
-}
-
-bool ArrayInfo::isFloating() const
-{
-    return (!isInteger() && !isBool());
-}
-
-bool ArrayInfo::isInteger() const
-{
-    return (type == s32
-         || type == u32
-         || type == s64
-         || type == u64
-         || type == u8);
-}
-
-bool ArrayInfo::isBool() const
-{
-    return (type == b8);
-}
-
-bool ArrayInfo::isLinear() const
-{
-    if (ndims() == 1) {
-        return dim_strides[0] == 1;
-    }
-
-    dim_t count = 1;
-    for (int i = 0; i < (int)ndims(); i++) {
-        if (count != dim_strides[i]) {
-            return false;
-        }
-        count *= dim_size[i];
-    }
-    return true;
-}
-
-dim4 getOutDims(const dim4 &ldims, const dim4 &rdims, bool batchMode)
-{
-    if (!batchMode) {
-        DIM_ASSERT(1, ldims == rdims);
-        return ldims;
-    }
-
-    dim_t odims[] = {1, 1, 1, 1};
-    for (int i = 0; i < 4; i++) {
-        DIM_ASSERT(1, ldims[i] == rdims[i] || ldims[i] == 1 || rdims[i] == 1);
-        odims[i] = std::max(ldims[i], rdims[i]);
-    }
-
-    return dim4(4, odims);
-}
diff --git a/src/backend/ArrayInfo.hpp b/src/backend/ArrayInfo.hpp
deleted file mode 100644
index b798051d9b..0000000000
--- a/src/backend/ArrayInfo.hpp
+++ /dev/null
@@ -1,123 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <af/util.h>
-#include <af/dim4.hpp>
-#include <af/device.h>
-#include <vector>
-
-dim_t
-calcOffset(const af::dim4 &strides, const af::dim4 &offsets);
-
-af::dim4
-calcStrides(const af::dim4 &parentDim);
-
-af::dim4 getOutDims(const af::dim4 &ldims, const af::dim4 &rdims, bool batchMode);
-
-/// Array Arrayementation Info class
-// This class is the base class to all Array objects. The purpose of this class
-// was to have a way to retrieve basic information of an Array object without
-// specifying what type the object is at compile time.
-class ArrayInfo
-{
-private:
-    int             devId;
-    af_dtype        type;
-    af::dim4        dim_size;
-    af::dim4        dim_offsets, dim_strides;
-
-public:
-    ArrayInfo(int id, af::dim4 size, af::dim4 offset, af::dim4 stride, af_dtype af_type):
-        devId(id),
-        type(af_type),
-        dim_size(size),
-        dim_offsets(offset),
-        dim_strides(stride)
-    { af_init(); }
-
-#if __cplusplus > 199711L
-    //Copy constructors are deprecated if there is a
-    //user-defined destructor in c++11
-    ArrayInfo(const ArrayInfo& other) = default;
-#endif
-    ~ArrayInfo() {}
-
-    const af_dtype& getType() const     { return type;                  }
-
-    const af::dim4& offsets() const     { return dim_offsets;           }
-
-    const af::dim4& strides()    const  { return dim_strides;           }
-
-    size_t elements() const             { return dim_size.elements();   }
-    size_t ndims() const                { return dim_size.ndims();      }
-    const af::dim4& dims() const        { return dim_size;              }
-
-    int getDevId() const { return devId; }
-
-    void setId(int id) const { const_cast<ArrayInfo *>(this)->setId(id); }
-    void setId(int id) { devId = id; }
-
-    void resetInfo(const af::dim4& dims)
-    {
-        dim_size = dims;
-        dim_strides = calcStrides(dims);
-        dim_offsets = af::dim4(0,0,0,0);
-    }
-
-    void resetDims(const af::dim4& dims)
-    {
-        dim_size = dims;
-    }
-
-    void modDims(const af::dim4 &newDims);
-
-    void modStrides(const af::dim4 &newStrides);
-
-    bool isEmpty() const;
-
-    bool isScalar() const;
-
-    bool isRow() const;
-
-    bool isColumn() const;
-
-    bool isVector() const;
-
-    bool isComplex() const;
-
-    bool isReal() const;
-
-    bool isDouble() const;
-
-    bool isSingle() const;
-
-    bool isRealFloating() const;
-
-    bool isFloating() const;
-
-    bool isInteger() const;
-
-    bool isBool() const;
-
-    bool isLinear() const;
-};
-
-// Returns size and time info for an array object.
-// Note this doesn't require template parameters.
-const  ArrayInfo&
-getInfo(const af_array arr);
-
-
-af::dim4 toDims(const std::vector<af_seq>& seqs, const af::dim4 &parentDims);
-
-af::dim4 toOffset(const std::vector<af_seq>& seqs, const af::dim4 &parentDims);
-
-af::dim4 toStride(const std::vector<af_seq>& seqs, const af::dim4 &parentDims);
diff --git a/src/backend/cblas.cpp b/src/backend/cblas.cpp
deleted file mode 100644
index 1b582c516a..0000000000
--- a/src/backend/cblas.cpp
+++ /dev/null
@@ -1,65 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#include <blas.hpp>
-
-#ifdef USE_F77_BLAS
-#define ADD_
-#include <cblas_f77.h>
-
-static char transChar(CBLAS_TRANSPOSE Trans)
-{
-    switch(Trans) {
-        case CblasNoTrans:      return 'N';
-        case CblasTrans:        return 'T';
-        case CblasConjTrans:    return 'C';
-        default:                return '\0';
-    }
-}
-
-#define GEMM_F77(X, TS, TV, TY)                                                     \
-void cblas_##X##gemm(                                                               \
-       const CBLAS_ORDER Order, const CBLAS_TRANSPOSE TransA,                       \
-       const CBLAS_TRANSPOSE TransB, const int M, const int N,                      \
-       const int K, const TS alpha, const TV *A,                                    \
-       const int lda, const TV *B, const int ldb,                                   \
-       const TS beta, TV *C, const int ldc)                                         \
-{                                                                                   \
-    char aT = transChar(TransA);                                                    \
-    char bT = transChar(TransB);                                                    \
-    X##gemm_(&aT, &bT, &M, &N, &K,                                                  \
-            (const TY *)ADDR(alpha), (const TY *)A, &lda,                           \
-            (const TY *)B, &ldb,                                                    \
-            (const TY *)ADDR(beta), (TY *)C, &ldc);                                 \
-}                                                                                   \
-void cblas_##X##gemv(                                                               \
-        const CBLAS_ORDER order, const CBLAS_TRANSPOSE TransA,                      \
-        const int M, const int N,                                                   \
-        const TS alpha, const TV *A, const int lda,                                 \
-        const TV *X, const int incX, const TS beta,                                 \
-        TV *Y, const int incY)                                                      \
-{                                                                                   \
-    char aT = transChar(TransA);                                                    \
-    X##gemv_(&aT, &M, &N,                                                           \
-            (const TY *)ADDR(alpha), (const TY *)A, &lda,                           \
-            (const TY *)X, &incX,                                                   \
-            (const TY *)ADDR(beta), (TY *)Y, &incY);                                \
-}                                                                                   \
-
-#define ADDR(val) &val
-GEMM_F77(s, float, float, float)
-GEMM_F77(d, double, double, double)
-#undef ADDR
-
-#define ADDR(val) val
-GEMM_F77(c, void *, void, float)
-GEMM_F77(z, void *, void, double)
-#undef ADDR
-
-#endif
diff --git a/src/backend/common/AllocatorInterface.hpp b/src/backend/common/AllocatorInterface.hpp
new file mode 100644
index 0000000000..0df799efdb
--- /dev/null
+++ b/src/backend/common/AllocatorInterface.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <cstddef>
+#include <memory>
+
+namespace spdlog {
+class logger;
+}
+namespace arrayfire {
+namespace common {
+
+/**
+ * An interface that provides backend-specific memory management functions,
+ * typically calling a dedicated backend-specific native API. Stored, wrapped,
+ * and called by a MemoryManagerBase, from which calls to its interface are
+ * delegated.
+ */
+class AllocatorInterface {
+   public:
+    AllocatorInterface() = default;
+    virtual ~AllocatorInterface() {}
+    virtual void shutdown()                       = 0;
+    virtual int getActiveDeviceId()               = 0;
+    virtual size_t getMaxMemorySize(int id)       = 0;
+    virtual void *nativeAlloc(const size_t bytes) = 0;
+    virtual void nativeFree(void *ptr)            = 0;
+    virtual spdlog::logger *getLogger() final { return this->logger.get(); }
+
+   protected:
+    std::shared_ptr<spdlog::logger> logger;
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/ArrayFireTypesIO.hpp b/src/backend/common/ArrayFireTypesIO.hpp
new file mode 100644
index 0000000000..e7a2e085ee
--- /dev/null
+++ b/src/backend/common/ArrayFireTypesIO.hpp
@@ -0,0 +1,93 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/Version.hpp>
+#include <spdlog/fmt/ostr.h>
+#include <af/dim4.hpp>
+#include <af/seq.h>
+#include <complex>
+
+template<>
+struct fmt::formatter<af_seq> {
+    constexpr auto parse(format_parse_context& ctx) -> decltype(ctx.begin()) {
+        return ctx.begin();
+    }
+
+    template<typename FormatContext>
+    auto format(const af_seq& p, FormatContext& ctx) -> decltype(ctx.out()) {
+        // ctx.out() is an output iterator to write to.
+        if (p.begin == af_span.begin && p.end == af_span.end &&
+            p.step == af_span.step) {
+            return format_to(ctx.out(), "span");
+        }
+        if (p.begin == p.end) { return format_to(ctx.out(), "{}", p.begin); }
+        if (p.step == 1) {
+            return format_to(ctx.out(), "({} -> {})", p.begin, p.end);
+        }
+        return format_to(ctx.out(), "({} -({})-> {})", p.begin, p.step, p.end);
+    }
+};
+
+#if FMT_VERSION >= 90000
+template<>
+struct fmt::formatter<af::dim4> : ostream_formatter {};
+template<>
+struct fmt::formatter<std::complex<float>> : ostream_formatter {};
+template<>
+struct fmt::formatter<std::complex<double>> : ostream_formatter {};
+#endif
+
+template<>
+struct fmt::formatter<arrayfire::common::Version> {
+    // show major version
+    bool show_major = false;
+    // show minor version
+    bool show_minor = false;
+    // show patch version
+    bool show_patch = false;
+
+    // Parses format specifications of the form ['M' | 'm' | 'p'].
+    constexpr auto parse(format_parse_context& ctx) -> decltype(ctx.begin()) {
+        auto it = ctx.begin(), end = ctx.end();
+        if (it == end || *it == '}') {
+            show_major = show_minor = show_patch = true;
+            return it;
+        }
+        do {
+            switch (*it) {
+                case 'M': show_major = true; break;
+                case 'm': show_minor = true; break;
+                case 'p': show_patch = true; break;
+                default: throw format_error("invalid format");
+            }
+            ++it;
+        } while (it != end && *it != '}');
+        return it;
+    }
+
+    template<typename FormatContext>
+    auto format(const arrayfire::common::Version& ver, FormatContext& ctx)
+        -> decltype(ctx.out()) {
+        if (ver.major() == -1) return format_to(ctx.out(), "N/A");
+        if (ver.minor() == -1) show_minor = false;
+        if (ver.patch() == -1) show_patch = false;
+        if (show_major && !show_minor && !show_patch) {
+            return format_to(ctx.out(), "{}", ver.major());
+        }
+        if (show_major && show_minor && !show_patch) {
+            return format_to(ctx.out(), "{}.{}", ver.major(), ver.minor());
+        }
+        if (show_major && show_minor && show_patch) {
+            return format_to(ctx.out(), "{}.{}.{}", ver.major(), ver.minor(),
+                             ver.patch());
+        }
+        return ctx.out();
+    }
+};
diff --git a/src/backend/common/ArrayInfo.cpp b/src/backend/common/ArrayInfo.cpp
new file mode 100644
index 0000000000..60c55c3e52
--- /dev/null
+++ b/src/backend/common/ArrayInfo.cpp
@@ -0,0 +1,236 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/traits.hpp>
+#include <algorithm>
+#include <cstring>
+#include <functional>
+#include <numeric>
+
+#include <backend.hpp>
+#include <platform.hpp>
+
+using af::dim4;
+
+dim4 calcStrides(const dim4 &parentDim) {
+    dim4 out(1, 1, 1, 1);
+    dim_t *out_dims          = out.get();
+    const dim_t *parent_dims = parentDim.get();
+
+    for (dim_t i = 1; i < 4; i++) {
+        out_dims[i] = out_dims[i - 1] * parent_dims[i - 1];
+    }
+
+    return out;
+}
+
+ArrayInfo::ArrayInfo(unsigned id, af::dim4 size, dim_t offset_, af::dim4 stride,
+                     af_dtype af_type)
+    : devId(id)
+    , type(af_type)
+    , dim_size(size)
+    , offset(offset_)
+    , dim_strides(stride)
+    , is_sparse(false) {
+    setId(id);
+    static_assert(std::is_move_assignable<ArrayInfo>::value,
+                  "ArrayInfo is not move assignable");
+    static_assert(std::is_move_constructible<ArrayInfo>::value,
+                  "ArrayInfo is not move constructible");
+    static_assert(
+        offsetof(ArrayInfo, devId) == 0,
+        "ArrayInfo::devId must be the first member variable of ArrayInfo. \
+                   devId is used to encode the backend into the integer. \
+                   This is then used in the unified backend to check mismatched arrays.");
+    static_assert(std::is_standard_layout<ArrayInfo>::value,
+                  "ArrayInfo must be a standard layout type");
+}
+
+ArrayInfo::ArrayInfo(unsigned id, af::dim4 size, dim_t offset_, af::dim4 stride,
+                     af_dtype af_type, bool sparse)
+    : devId(id)
+    , type(af_type)
+    , dim_size(size)
+    , offset(offset_)
+    , dim_strides(stride)
+    , is_sparse(sparse) {
+    setId(id);
+    static_assert(
+        offsetof(ArrayInfo, devId) == 0,
+        "ArrayInfo::devId must be the first member variable of ArrayInfo. \
+                   devId is used to encode the backend into the integer. \
+                   This is then used in the unified backend to check mismatched arrays.");
+    static_assert(std::is_nothrow_move_assignable<ArrayInfo>::value,
+                  "ArrayInfo is not nothrow move assignable");
+    static_assert(std::is_nothrow_move_constructible<ArrayInfo>::value,
+                  "ArrayInfo is not nothrow move constructible");
+}
+
+unsigned ArrayInfo::getDevId() const {
+    // The actual device ID is only stored in the first 8 bits of devId
+    // See ArrayInfo.hpp for more
+    return devId & 0xffU;
+}
+
+void ArrayInfo::setId(int id) const {
+    const_cast<ArrayInfo *>(this)->setId(id);
+}
+
+void ArrayInfo::setId(int id) {
+    /// Shift the backend flag to the end of the devId integer
+    unsigned backendId = detail::getBackend();
+    devId              = id | backendId << 8U;
+}
+
+af_backend ArrayInfo::getBackendId() const {
+    // devId >> 8 converts the backend info to 1, 2, 4 which are enums
+    // for CPU, CUDA, OpenCL, and oneAPI respectively
+    // See ArrayInfo.hpp for more
+    unsigned backendId = devId >> 8U;
+    return static_cast<af_backend>(backendId);
+}
+
+void ArrayInfo::modStrides(const dim4 &newStrides) { dim_strides = newStrides; }
+
+void ArrayInfo::modDims(const dim4 &newDims) {
+    dim_size = newDims;
+    modStrides(calcStrides(newDims));
+}
+
+bool ArrayInfo::isEmpty() const { return (elements() == 0); }
+
+bool ArrayInfo::isScalar() const { return (elements() == 1); }
+
+bool ArrayInfo::isRow() const {
+    return (dims()[0] == 1 && dims()[1] > 1 && dims()[2] == 1 &&
+            dims()[3] == 1);
+}
+
+bool ArrayInfo::isColumn() const {
+    return (dims()[0] > 1 && dims()[1] == 1 && dims()[2] == 1 &&
+            dims()[3] == 1);
+}
+
+bool ArrayInfo::isVector() const {
+    int singular_dims     = 0;
+    int non_singular_dims = 0;
+    for (int i = 0; i < AF_MAX_DIMS; i++) {
+        non_singular_dims += (dims()[i] != 0 && dims()[i] != 1);
+        singular_dims += (dims()[i] == 1);
+    }
+    return singular_dims == AF_MAX_DIMS - 1 && non_singular_dims == 1;
+}
+
+bool ArrayInfo::isComplex() const { return arrayfire::common::isComplex(type); }
+
+bool ArrayInfo::isReal() const { return arrayfire::common::isReal(type); }
+
+bool ArrayInfo::isDouble() const { return arrayfire::common::isDouble(type); }
+
+bool ArrayInfo::isSingle() const { return arrayfire::common::isSingle(type); }
+
+bool ArrayInfo::isHalf() const { return arrayfire::common::isHalf(type); }
+
+bool ArrayInfo::isRealFloating() const {
+    return arrayfire::common::isRealFloating(type);
+}
+
+bool ArrayInfo::isFloating() const {
+    return arrayfire::common::isFloating(type);
+}
+
+bool ArrayInfo::isInteger() const { return arrayfire::common::isInteger(type); }
+
+bool ArrayInfo::isBool() const { return arrayfire::common::isBool(type); }
+
+bool ArrayInfo::isLinear() const {
+    if (ndims() == 1) { return dim_strides[0] == 1; }
+
+    dim_t count = 1;
+    for (dim_t i = 0; i < ndims(); i++) {
+        if (count != dim_strides[i]) { return false; }
+        count *= dim_size[i];
+    }
+    return true;
+}
+
+bool ArrayInfo::isSparse() const { return is_sparse; }
+
+dim4 getOutDims(const dim4 &ldims, const dim4 &rdims, bool batchMode) {
+    if (!batchMode) {
+        DIM_ASSERT(1, ldims == rdims);
+        return ldims;
+    }
+
+    dim_t odims[] = {1, 1, 1, 1};
+    for (int i = 0; i < 4; i++) {
+        DIM_ASSERT(1, ldims[i] == rdims[i] || ldims[i] == 1 || rdims[i] == 1);
+        odims[i] = std::max(ldims[i], rdims[i]);
+    }
+
+    return dim4(4, odims);
+}
+
+using std::vector;
+
+dim4 toDims(const vector<af_seq> &seqs, const dim4 &parentDims) {
+    dim4 outDims(1, 1, 1, 1);
+    for (unsigned i = 0; i < seqs.size(); i++) {
+        outDims[i] = af::calcDim(seqs[i], parentDims[i]);
+        if (outDims[i] > parentDims[i]) {
+            AF_ERROR("Size mismatch between input and output", AF_ERR_SIZE);
+        }
+    }
+    return outDims;
+}
+
+dim4 toOffset(const vector<af_seq> &seqs, const dim4 &parentDims) {
+    dim4 outOffsets(0, 0, 0, 0);
+    for (unsigned i = 0; i < seqs.size(); i++) {
+        if (seqs[i].step != 0 && seqs[i].begin >= 0) {
+            outOffsets[i] = seqs[i].begin;
+        } else if (seqs[i].begin <= -1) {
+            outOffsets[i] = parentDims[i] + seqs[i].begin;
+        } else {
+            outOffsets[i] = 0;
+        }
+
+        if (outOffsets[i] >= parentDims[i]) {
+            AF_ERROR("Index out of range", AF_ERR_SIZE);
+        }
+    }
+    return outOffsets;
+}
+
+dim4 toStride(const vector<af_seq> &seqs, const af::dim4 &parentDims) {
+    dim4 out(calcStrides(parentDims));
+    for (unsigned i = 0; i < seqs.size(); i++) {
+        if (seqs[i].step != 0) { out[i] *= seqs[i].step; }
+    }
+    return out;
+}
+
+namespace arrayfire {
+namespace common {
+
+const ArrayInfo &getInfo(const af_array arr, bool sparse_check) {
+    const ArrayInfo *info = nullptr;
+    memcpy(&info, &arr, sizeof(af_array));
+
+    // Check Sparse -> If false, then both standard Array<T> and SparseArray<T>
+    // are accepted Otherwise only regular Array<T> is accepted
+    if (sparse_check) { ARG_ASSERT(0, info->isSparse() == false); }
+
+    return *info;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/ArrayInfo.hpp b/src/backend/common/ArrayInfo.hpp
new file mode 100644
index 0000000000..aae9e7b6a7
--- /dev/null
+++ b/src/backend/common/ArrayInfo.hpp
@@ -0,0 +1,144 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/defines.hpp>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <cstddef>
+#include <vector>
+
+af::dim4 calcStrides(const af::dim4& parentDim);
+
+af::dim4 getOutDims(const af::dim4& ldims, const af::dim4& rdims,
+                    bool batchMode);
+
+/// Array Arrayementation Info class
+// This class is the base class to all Array objects. The purpose of this class
+// was to have a way to retrieve basic information of an Array object without
+// specifying what type the object is at compile time.
+class ArrayInfo {
+   private:
+    // The devId variable stores information about the deviceId as well as the
+    // backend. The 8 LSBs (0-7) are used to store the device ID. The 09th LSB
+    // is set to 1 if backend is CPU The 10th LSB is set to 1 if backend is CUDA
+    // The 11th LSB is set to 1 if backend is OpenCL The 12th LSB is set to 1
+    // for oneAPI
+    // This information can be retrieved directly from an af_array by doing
+    //     int* devId = reinterpret_cast<int*>(a); // a is an af_array
+    //     af_backend backendID = *devId >> 8;   // Returns 1, 2, 4 for CPU,
+    //     CUDA or OpenCL respectively int        deviceID  = *devId & 0xff; //
+    //     Returns devices ID between 0-255
+    // This is possible by doing a static_assert on devId
+    //
+    // This can be changed in the future if the need arises for more devices as
+    // this implementation is internal. Make sure to change the bit shift ops
+    // when such a change is being made
+    unsigned devId;
+    af_dtype type;
+    af::dim4 dim_size;
+    dim_t offset;
+    af::dim4 dim_strides;
+    bool is_sparse;
+
+   public:
+    ArrayInfo(unsigned id, af::dim4 size, dim_t offset_, af::dim4 stride,
+              af_dtype af_type);
+
+    ArrayInfo(unsigned id, af::dim4 size, dim_t offset_, af::dim4 stride,
+              af_dtype af_type, bool sparse);
+
+    ArrayInfo()                       = default;
+    ArrayInfo(const ArrayInfo& other) = default;
+    ArrayInfo(ArrayInfo&& other)      = default;
+
+    ArrayInfo& operator=(ArrayInfo other) noexcept {
+        swap(other);
+        return *this;
+    }
+
+    void swap(ArrayInfo& other) noexcept {
+        using std::swap;
+        swap(devId, other.devId);
+        swap(type, other.type);
+        swap(dim_size, other.dim_size);
+        swap(offset, other.offset);
+        swap(dim_strides, other.dim_strides);
+        swap(is_sparse, other.is_sparse);
+    }
+
+    const af_dtype& getType() const { return type; }
+
+    dim_t getOffset() const { return offset; }
+
+    const af::dim4& strides() const { return dim_strides; }
+
+    dim_t elements() const { return dim_size.elements(); }
+    dim_t ndims() const { return dim_size.ndims(); }
+    const af::dim4& dims() const { return dim_size; }
+    size_t total() const { return offset + dim_strides[3] * dim_size[3]; }
+
+    unsigned getDevId() const;
+
+    void setId(int id) const;
+
+    void setId(int id);
+
+    af_backend getBackendId() const;
+
+    void resetInfo(const af::dim4& dims) {
+        dim_size    = dims;
+        dim_strides = calcStrides(dims);
+        offset      = 0;
+    }
+
+    void resetDims(const af::dim4& dims) { dim_size = dims; }
+
+    void modDims(const af::dim4& newDims);
+
+    void modStrides(const af::dim4& newStrides);
+
+    bool isEmpty() const;
+
+    bool isScalar() const;
+
+    bool isRow() const;
+
+    bool isColumn() const;
+
+    bool isVector() const;
+
+    bool isComplex() const;
+
+    bool isReal() const;
+
+    bool isDouble() const;
+
+    bool isSingle() const;
+
+    bool isHalf() const;
+
+    bool isRealFloating() const;
+
+    bool isFloating() const;
+
+    bool isInteger() const;
+
+    bool isBool() const;
+
+    bool isLinear() const;
+
+    bool isSparse() const;
+};
+
+af::dim4 toDims(const std::vector<af_seq>& seqs, const af::dim4& parentDims);
+
+af::dim4 toOffset(const std::vector<af_seq>& seqs, const af::dim4& parentDims);
+
+af::dim4 toStride(const std::vector<af_seq>& seqs, const af::dim4& parentDims);
diff --git a/src/backend/common/Binary.hpp b/src/backend/common/Binary.hpp
new file mode 100644
index 0000000000..128cf18988
--- /dev/null
+++ b/src/backend/common/Binary.hpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+#include <math.hpp>
+#include <types.hpp>
+
+#ifndef __DH__
+#define __DH__
+#endif
+
+#include "optypes.hpp"
+
+namespace arrayfire {
+namespace common {
+
+using namespace detail;  // NOLINT
+
+// Because isnan(cfloat) and isnan(cdouble) is not defined
+#define IS_NAN(val) !((val) == (val))
+
+template<typename T, af_op_t op>
+struct Binary {
+    static __DH__ T init();
+
+    __DH__ T operator()(T lhs, T rhs);
+};
+
+template<typename T>
+struct Binary<T, af_add_t> {
+    static __DH__ T init() { return scalar<T>(0); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs + rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_sub_t> {
+    static __DH__ T init() { return scalar<T>(0); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs - rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_mul_t> {
+    static __DH__ T init() { return scalar<T>(1); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs * rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_div_t> {
+    static __DH__ T init() { return scalar<T>(1); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs / rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_or_t> {
+    static __DH__ T init() { return scalar<T>(0); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs || rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_and_t> {
+    static __DH__ T init() { return scalar<T>(1); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs && rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_notzero_t> {
+    static __DH__ T init() { return scalar<T>(0); }
+
+    __DH__ T operator()(T lhs, T rhs) { return lhs + rhs; }
+};
+
+template<typename T>
+struct Binary<T, af_min_t> {
+    static __DH__ T init() { return maxval<T>(); }
+
+    __DH__ T operator()(T lhs, T rhs) { return detail::min(lhs, rhs); }
+};
+
+template<>
+struct Binary<char, af_min_t> {
+    static __DH__ char init() { return 1; }
+
+    __DH__ char operator()(char lhs, char rhs) {
+        return detail::min(lhs > 0, rhs > 0);
+    }
+};
+
+#define SPECIALIZE_COMPLEX_MIN(T, Tr)                                       \
+    template<>                                                              \
+    struct Binary<T, af_min_t> {                                            \
+        static __DH__ T init() { return scalar<T>(maxval<Tr>()); }          \
+                                                                            \
+        __DH__ T operator()(T lhs, T rhs) { return detail::min(lhs, rhs); } \
+    };
+
+SPECIALIZE_COMPLEX_MIN(cfloat, float)
+SPECIALIZE_COMPLEX_MIN(cdouble, double)
+
+#undef SPECIALIZE_COMPLEX_MIN
+
+template<typename T>
+struct Binary<T, af_max_t> {
+    static __DH__ T init() { return minval<T>(); }
+
+    __DH__ T operator()(T lhs, T rhs) { return detail::max(lhs, rhs); }
+};
+
+template<>
+struct Binary<char, af_max_t> {
+    static __DH__ char init() { return 0; }
+
+    __DH__ char operator()(char lhs, char rhs) { return max(lhs > 0, rhs > 0); }
+};
+
+#define SPECIALIZE_COMPLEX_MAX(T, Tr)                                       \
+    template<>                                                              \
+    struct Binary<T, af_max_t> {                                            \
+        static __DH__ T init() { return scalar<T>(detail::scalar<Tr>(0)); } \
+                                                                            \
+        __DH__ T operator()(T lhs, T rhs) { return detail::max(lhs, rhs); } \
+    };
+
+SPECIALIZE_COMPLEX_MAX(cfloat, float)
+SPECIALIZE_COMPLEX_MAX(cdouble, double)
+
+#undef SPECIALIZE_COMPLEX_MAX
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/CMakeLists.txt b/src/backend/common/CMakeLists.txt
new file mode 100644
index 0000000000..b33ea2598e
--- /dev/null
+++ b/src/backend/common/CMakeLists.txt
@@ -0,0 +1,142 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+add_library(afcommon_interface INTERFACE)
+
+target_sources(afcommon_interface
+  INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/BinaryNode.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/BinaryNode.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/ModdimNode.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/NaryNode.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/Node.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/Node.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/NodeIO.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/NodeIterator.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/ScalarNode.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit/UnaryNode.hpp
+    )
+
+target_sources(afcommon_interface
+  INTERFACE
+    ${CMAKE_CURRENT_SOURCE_DIR}/AllocatorInterface.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/ArrayInfo.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/ArrayInfo.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/ArrayFireTypesIO.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/DefaultMemoryManager.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/DefaultMemoryManager.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/DependencyModule.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/DependencyModule.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/FFTPlanCache.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/HandleBase.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/InteropManager.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/KernelInterface.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/Logger.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/Logger.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/MemoryManagerBase.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/MersenneTwister.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/ModuleInterface.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/Source.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/SparseArray.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/SparseArray.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/TemplateArg.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/TemplateTypename.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/Version.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/blas_headers.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/cast.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/cast.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/cblas.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/compile_module.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/complex.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/constants.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/defines.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/deterministicHash.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/deterministicHash.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/dim4.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/dispatch.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/dispatch.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/err_common.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/err_common.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/graphics_common.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/graphics_common.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/half.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/half.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/host_memory.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/host_memory.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/internal_enums.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/kernel_cache.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/kernel_cache.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/kernel_type.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/moddims.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/moddims.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/module_loading.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/sparse_helpers.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/traits.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/unique_handle.hpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/util.cpp
+    ${CMAKE_CURRENT_SOURCE_DIR}/util.hpp
+  )
+
+if(WIN32)
+  target_sources(afcommon_interface INTERFACE ${CMAKE_CURRENT_SOURCE_DIR}/module_loading_windows.cpp)
+else()
+  target_sources(afcommon_interface INTERFACE ${CMAKE_CURRENT_SOURCE_DIR}/module_loading_unix.cpp)
+endif()
+
+target_link_libraries(afcommon_interface
+  INTERFACE
+    af_spdlog
+    Boost::boost
+    nonstd::span-lite
+    ${CMAKE_DL_LIBS}
+)
+
+if(TARGET fmt::fmt)
+  target_link_libraries(afcommon_interface
+    INTERFACE
+      fmt::fmt
+  )
+endif()
+
+if(TARGET glad::glad)
+  target_link_libraries(afcommon_interface INTERFACE glad::glad)
+else()
+  target_link_libraries(afcommon_interface INTERFACE af_glad)
+endif()
+
+if(AF_BUILD_FORGE AND NOT Forge_FOUND)
+  add_dependencies(afcommon_interface forge)
+endif()
+
+target_include_directories(afcommon_interface
+  SYSTEM INTERFACE
+    $<$<PLATFORM_ID:Darwin>:${OPENGL_INCLUDE_DIR}>)
+
+target_include_directories(afcommon_interface
+  INTERFACE
+    ${ArrayFire_SOURCE_DIR}/src/backend
+    ${ArrayFire_BINARY_DIR}/src/backend)
+
+if(TARGET Forge::forge)
+  target_include_directories(afcommon_interface
+    SYSTEM INTERFACE
+      $<TARGET_PROPERTY:Forge::forge,INCLUDE_DIRECTORIES>
+  )
+else()
+  target_include_directories(afcommon_interface
+    SYSTEM INTERFACE
+      ${${forge_prefix}_SOURCE_DIR}/include
+      ${${forge_prefix}_BINARY_DIR}/include
+  )
+endif()
+
+if(APPLE AND NOT USE_MKL)
+  target_sources(afcommon_interface
+    INTERFACE
+      ${CMAKE_CURRENT_SOURCE_DIR}/lapacke.cpp
+      ${CMAKE_CURRENT_SOURCE_DIR}/lapacke.hpp)
+endif()
diff --git a/src/backend/common/DefaultMemoryManager.cpp b/src/backend/common/DefaultMemoryManager.cpp
new file mode 100644
index 0000000000..0e0694631d
--- /dev/null
+++ b/src/backend/common/DefaultMemoryManager.cpp
@@ -0,0 +1,387 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/dispatch.hpp>
+#include <common/err_common.hpp>
+#include <common/util.hpp>
+#include <memoryapi.hpp>
+#include <af/event.h>
+#include <af/memory.h>
+
+#include <algorithm>
+#include <cstdio>
+#include <memory>
+#include <string>
+#include <vector>
+
+using std::max;
+using std::move;
+using std::stoi;
+using std::string;
+using std::vector;
+
+namespace arrayfire {
+namespace common {
+
+DefaultMemoryManager::memory_info &
+DefaultMemoryManager::getCurrentMemoryInfo() {
+    return memory[this->getActiveDeviceId()];
+}
+
+void DefaultMemoryManager::cleanDeviceMemoryManager(int device) {
+    if (this->debug_mode) { return; }
+
+    // This vector is used to store the pointers which will be deleted by
+    // the memory manager. We are using this to avoid calling free while
+    // the lock is being held because the CPU backend calls sync.
+    vector<void *> free_ptrs;
+    size_t bytes_freed                         = 0;
+    DefaultMemoryManager::memory_info &current = memory[device];
+    {
+        lock_guard_t lock(this->memory_mutex);
+        // Return if all buffers are locked
+        if (current.total_buffers == current.lock_buffers) { return; }
+        free_ptrs.reserve(current.free_map.size());
+
+        for (auto &kv : current.free_map) {
+            size_t num_ptrs = kv.second.size();
+            // Free memory by pushing the last element into the free_ptrs
+            // vector which will be freed once outside of the lock
+            // for (auto ptr : kv.second) { free_ptrs.emplace_back(pair); }
+            std::move(begin(kv.second), end(kv.second),
+                      back_inserter(free_ptrs));
+            current.total_bytes -= num_ptrs * kv.first;
+            bytes_freed += num_ptrs * kv.first;
+            current.total_buffers -= num_ptrs;
+        }
+        current.free_map.clear();
+    }
+
+    AF_TRACE("GC: Clearing {} buffers {}", free_ptrs.size(),
+             bytesToString(bytes_freed));
+    // Free memory outside of the lock
+    for (auto *ptr : free_ptrs) { this->nativeFree(ptr); }
+}
+
+DefaultMemoryManager::DefaultMemoryManager(int num_devices,
+                                           unsigned max_buffers, bool debug)
+    : mem_step_size(1024)
+    , max_buffers(max_buffers)
+    , debug_mode(debug)
+    , memory(num_devices) {
+    // Check for environment variables
+
+    // Debug mode
+    string env_var = getEnvVar("AF_MEM_DEBUG");
+    if (!env_var.empty()) { this->debug_mode = env_var[0] != '0'; }
+    if (this->debug_mode) { mem_step_size = 1; }
+
+    // Max Buffer count
+    env_var = getEnvVar("AF_MAX_BUFFERS");
+    if (!env_var.empty()) { this->max_buffers = max(1, stoi(env_var)); }
+}
+
+void DefaultMemoryManager::initialize() { this->setMaxMemorySize(); }
+
+void DefaultMemoryManager::shutdown() { signalMemoryCleanup(); }
+
+void DefaultMemoryManager::addMemoryManagement(int device) {
+    // If there is a memory manager allocated for this device id, we might
+    // as well use it and the buffers allocated for it
+    if (static_cast<size_t>(device) < memory.size()) { return; }
+
+    // Assuming, device need not be always the next device Lets resize to
+    // current_size + device + 1 +1 is to account for device being 0-based
+    // index of devices
+    memory.resize(memory.size() + device + 1);
+}
+
+void DefaultMemoryManager::removeMemoryManagement(int device) {
+    if (static_cast<size_t>(device) >= memory.size()) {
+        AF_ERROR("No matching device found", AF_ERR_ARG);
+    }
+
+    // Do garbage collection for the device and leave the memory::memory_info
+    // struct from the memory vector intact
+    cleanDeviceMemoryManager(device);
+}
+
+void DefaultMemoryManager::setMaxMemorySize() {
+    for (unsigned n = 0; n < memory.size(); n++) {
+        // Calls garbage collection when: total_bytes > memsize * 0.75 when
+        // memsize < 4GB total_bytes > memsize - 1 GB when memsize >= 4GB If
+        // memsize returned 0, then use 1GB
+        size_t memsize = this->getMaxMemorySize(static_cast<int>(n));
+        memory[n].max_bytes =
+            memsize == 0
+                ? ONE_GB
+                : max(memsize * 0.75, static_cast<double>(memsize - ONE_GB));
+        AF_TRACE("memory[{}].max_bytes: {}", n,
+                 bytesToString(memory[n].max_bytes));
+    }
+}
+
+float DefaultMemoryManager::getMemoryPressure() {
+    lock_guard_t lock(this->memory_mutex);
+    memory_info &current = this->getCurrentMemoryInfo();
+    if (current.lock_bytes > current.max_bytes ||
+        current.lock_buffers > max_buffers) {
+        return 1.0;
+    } else {
+        return 0.0;
+    }
+}
+
+bool DefaultMemoryManager::jitTreeExceedsMemoryPressure(
+    size_t jit_tree_buffer_bytes) {
+    lock_guard_t lock(this->memory_mutex);
+    memory_info &current = this->getCurrentMemoryInfo();
+    if (current.lock_bytes > 0.25f * current.max_bytes) {
+        /// Evaluate JIT if half of all locked buffers are locked by this JIT
+        /// tree
+        return jit_tree_buffer_bytes > current.lock_bytes * 0.5f;
+    } else {
+        /// Evaluate if this JIT Tree accounts for 10% of total memory on the
+        /// device
+        return jit_tree_buffer_bytes > 0.10f * current.max_bytes;
+    }
+}
+
+void *DefaultMemoryManager::alloc(bool user_lock, const unsigned ndims,
+                                  dim_t *dims, const unsigned element_size) {
+    size_t bytes = element_size;
+    for (unsigned i = 0; i < ndims; ++i) { bytes *= dims[i]; }
+
+    void *ptr          = nullptr;
+    size_t alloc_bytes = this->debug_mode
+                             ? bytes
+                             : (divup(bytes, mem_step_size) * mem_step_size);
+
+    if (bytes > 0) {
+        memory_info &current = this->getCurrentMemoryInfo();
+        locked_info info     = {!user_lock, user_lock, alloc_bytes};
+
+        // There is no memory cache in debug mode
+        if (!this->debug_mode) {
+            // FIXME: Add better checks for garbage collection
+            // Perhaps look at total memory available as a metric
+            if (current.lock_bytes >= current.max_bytes ||
+                current.total_buffers >= this->max_buffers) {
+                AF_TRACE(
+                    "Running GC: current.lock_bytes({}) >= "
+                    "current.max_bytes({}) || current.total_buffers({}) >= "
+                    "this->max_buffers({})\n",
+                    current.lock_bytes, current.max_bytes,
+                    current.total_buffers, this->max_buffers);
+
+                this->signalMemoryCleanup();
+            }
+
+            lock_guard_t lock(this->memory_mutex);
+            auto free_buffer_iter = current.free_map.find(alloc_bytes);
+            if (free_buffer_iter != current.free_map.end() &&
+                !free_buffer_iter->second.empty()) {
+                // Delete existing buffer info and underlying event
+                // Set to existing in from free map
+                vector<void *> &free_buffer_vector = free_buffer_iter->second;
+                ptr                                = free_buffer_vector.back();
+                free_buffer_vector.pop_back();
+                current.locked_map[ptr] = info;
+                current.lock_bytes += alloc_bytes;
+                current.lock_buffers++;
+            }
+        }
+
+        // Only comes here if buffer size not found or in debug mode
+        if (ptr == nullptr) {
+            // Perform garbage collection if memory can not be allocated
+            try {
+                ptr = this->nativeAlloc(alloc_bytes);
+            } catch (const AfError &ex) {
+                // If out of memory, run garbage collect and try again
+                if (ex.getError() != AF_ERR_NO_MEM) { throw; }
+                this->signalMemoryCleanup();
+                ptr = this->nativeAlloc(alloc_bytes);
+            }
+            lock_guard_t lock(this->memory_mutex);
+            // Increment these two only when it succeeds to come here.
+            current.total_bytes += alloc_bytes;
+            current.total_buffers += 1;
+            current.locked_map[ptr] = info;
+            current.lock_bytes += alloc_bytes;
+            current.lock_buffers++;
+        }
+    }
+
+    return ptr;
+}
+
+size_t DefaultMemoryManager::allocated(void *ptr) {
+    if (!ptr) { return 0; }
+    memory_info &current = this->getCurrentMemoryInfo();
+    auto locked_iter     = current.locked_map.find(ptr);
+    if (locked_iter == current.locked_map.end()) { return 0; }
+    return (locked_iter->second).bytes;
+}
+
+void DefaultMemoryManager::unlock(void *ptr, bool user_unlock) {
+    // Shortcut for empty arrays
+    if (!ptr) { return; }
+
+    // Frees the pointer outside the lock.
+    uptr_t freed_ptr(nullptr, [this](void *p) { this->nativeFree(p); });
+    {
+        lock_guard_t lock(this->memory_mutex);
+        memory_info &current = this->getCurrentMemoryInfo();
+
+        auto locked_buffer_iter = current.locked_map.find(ptr);
+        if (locked_buffer_iter == current.locked_map.end()) {
+            // Pointer not found in locked map
+            // Probably came from user, just free it
+            freed_ptr.reset(ptr);
+            return;
+        }
+        locked_info &locked_buffer_info = locked_buffer_iter->second;
+        void *locked_buffer_ptr         = locked_buffer_iter->first;
+
+        if (user_unlock) {
+            locked_buffer_info.user_lock = false;
+        } else {
+            locked_buffer_info.manager_lock = false;
+        }
+
+        // Return early if either one is locked
+        if (locked_buffer_info.user_lock || locked_buffer_info.manager_lock) {
+            return;
+        }
+
+        size_t bytes = locked_buffer_info.bytes;
+        current.lock_bytes -= locked_buffer_info.bytes;
+        current.lock_buffers--;
+
+        if (this->debug_mode) {
+            // Just free memory in debug mode
+            if (locked_buffer_info.bytes > 0) {
+                freed_ptr.reset(locked_buffer_ptr);
+                current.total_buffers--;
+                current.total_bytes -= locked_buffer_info.bytes;
+            }
+        } else {
+            current.free_map[bytes].emplace_back(ptr);
+        }
+        current.locked_map.erase(locked_buffer_iter);
+    }
+}
+
+void DefaultMemoryManager::signalMemoryCleanup() {
+    cleanDeviceMemoryManager(this->getActiveDeviceId());
+}
+
+void DefaultMemoryManager::printInfo(const char *msg, const int device) {
+    UNUSED(device);
+    const memory_info &current = this->getCurrentMemoryInfo();
+
+    printf("%s\n", msg);
+    printf(
+        "---------------------------------------------------------\n"
+        "|     POINTER      |    SIZE    |  AF LOCK  | USER LOCK |\n"
+        "---------------------------------------------------------\n");
+
+    lock_guard_t lock(this->memory_mutex);
+    for (const auto &kv : current.locked_map) {
+        const char *status_mngr = "Yes";
+        const char *status_user = "Unknown";
+        if (kv.second.user_lock) {
+            status_user = "Yes";
+        } else {
+            status_user = " No";
+        }
+
+        const char *unit = "KB";
+        double size      = static_cast<double>(kv.second.bytes) / 1024;
+        if (size >= 1024) {
+            size = size / 1024;
+            unit = "MB";
+        }
+
+        printf("|  %14p  |  %6.f %s | %9s | %9s |\n", kv.first, size, unit,
+               status_mngr, status_user);
+    }
+
+    for (const auto &kv : current.free_map) {
+        const char *status_mngr = "No";
+        const char *status_user = "No";
+
+        const char *unit = "KB";
+        double size      = static_cast<double>(kv.first) / 1024;
+        if (size >= 1024) {
+            size = size / 1024;
+            unit = "MB";
+        }
+
+        for (const auto &ptr : kv.second) {
+            printf("|  %14p  |  %6.f %s | %9s | %9s |\n", ptr, size, unit,
+                   status_mngr, status_user);
+        }
+    }
+
+    printf("---------------------------------------------------------\n");
+}
+
+void DefaultMemoryManager::usageInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                                     size_t *lock_bytes, size_t *lock_buffers) {
+    const memory_info &current = this->getCurrentMemoryInfo();
+    lock_guard_t lock(this->memory_mutex);
+    if (alloc_bytes) { *alloc_bytes = current.total_bytes; }
+    if (alloc_buffers) { *alloc_buffers = current.total_buffers; }
+    if (lock_bytes) { *lock_bytes = current.lock_bytes; }
+    if (lock_buffers) { *lock_buffers = current.lock_buffers; }
+}
+
+void DefaultMemoryManager::userLock(const void *ptr) {
+    memory_info &current = this->getCurrentMemoryInfo();
+
+    lock_guard_t lock(this->memory_mutex);
+
+    auto locked_iter = current.locked_map.find(const_cast<void *>(ptr));
+    if (locked_iter != current.locked_map.end()) {
+        locked_iter->second.user_lock = true;
+    } else {
+        locked_info info = {false, true, 100};  // This number is not relevant
+
+        current.locked_map[const_cast<void *>(ptr)] = info;
+    }
+}
+
+void DefaultMemoryManager::userUnlock(const void *ptr) {
+    this->unlock(const_cast<void *>(ptr), true);
+}
+
+bool DefaultMemoryManager::isUserLocked(const void *ptr) {
+    memory_info &current = this->getCurrentMemoryInfo();
+    lock_guard_t lock(this->memory_mutex);
+    auto locked_iter = current.locked_map.find(const_cast<void *>(ptr));
+    if (locked_iter == current.locked_map.end()) { return false; }
+    return locked_iter->second.user_lock;
+}
+
+size_t DefaultMemoryManager::getMemStepSize() {
+    lock_guard_t lock(this->memory_mutex);
+    return this->mem_step_size;
+}
+
+void DefaultMemoryManager::setMemStepSize(size_t new_step_size) {
+    lock_guard_t lock(this->memory_mutex);
+    this->mem_step_size = new_step_size;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/DefaultMemoryManager.hpp b/src/backend/common/DefaultMemoryManager.hpp
new file mode 100644
index 0000000000..60fa10a8c9
--- /dev/null
+++ b/src/backend/common/DefaultMemoryManager.hpp
@@ -0,0 +1,138 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/MemoryManagerBase.hpp>
+#include <common/defines.hpp>
+
+#include <functional>
+#include <unordered_map>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+
+constexpr unsigned MAX_BUFFERS = 1000;
+constexpr size_t ONE_GB        = 1 << 30;
+
+using uptr_t = std::unique_ptr<void, std::function<void(void *)>>;
+
+class DefaultMemoryManager final : public common::MemoryManagerBase {
+    size_t mem_step_size;
+    unsigned max_buffers;
+
+    bool debug_mode;
+
+    struct locked_info {
+        bool manager_lock;
+        bool user_lock;
+        size_t bytes;
+    };
+
+    using locked_t = typename std::unordered_map<void *, locked_info>;
+    using free_t   = std::unordered_map<size_t, std::vector<void *>>;
+
+    struct memory_info {
+        locked_t locked_map;
+        free_t free_map;
+
+        size_t max_bytes;
+        size_t total_bytes;
+        size_t total_buffers;
+        size_t lock_bytes;
+        size_t lock_buffers;
+
+        memory_info()
+            // Calling getMaxMemorySize() here calls the virtual function
+            // that returns 0 Call it from outside the constructor.
+            : max_bytes(ONE_GB)
+            , total_bytes(0)
+            , total_buffers(0)
+            , lock_bytes(0)
+            , lock_buffers(0) {}
+
+        memory_info(memory_info &other)             = delete;
+        memory_info(memory_info &&other)            = default;
+        memory_info &operator=(memory_info &other)  = delete;
+        memory_info &operator=(memory_info &&other) = default;
+    };
+
+    memory_info &getCurrentMemoryInfo();
+
+   public:
+    DefaultMemoryManager(int num_devices, unsigned max_buffers, bool debug);
+
+    // Initializes the memory manager
+    virtual void initialize() override;
+
+    // Shuts down the memory manager
+    virtual void shutdown() override;
+
+    // Intended to be used with OpenCL backend, where
+    // users are allowed to add external devices(context, device pair)
+    // to the list of devices automatically detected by the library
+    void addMemoryManagement(int device) override;
+
+    // Intended to be used with OpenCL backend, where
+    // users are allowed to add external devices(context, device pair)
+    // to the list of devices automatically detected by the library
+    void removeMemoryManagement(int device) override;
+
+    void setMaxMemorySize();
+
+    /// Returns a pointer of size at least long
+    ///
+    /// This funciton will return a memory location of at least \p size
+    /// bytes. If there is already a free buffer available, it will use
+    /// that buffer. Otherwise, it will allocate a new buffer using the
+    /// nativeAlloc function.
+    void *alloc(bool user_lock, const unsigned ndims, dim_t *dims,
+                const unsigned element_size) override;
+
+    /// returns the size of the buffer at the pointer allocated by the memory
+    /// manager.
+    size_t allocated(void *ptr) override;
+
+    /// Frees or marks the pointer for deletion during the nex garbage
+    /// collection event
+    void unlock(void *ptr, bool user_unlock) override;
+
+    /// Frees all buffers which are not locked by the user or not being
+    /// used.
+    void signalMemoryCleanup() override;
+
+    void printInfo(const char *msg, const int device) override;
+    void usageInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                   size_t *lock_bytes, size_t *lock_buffers) override;
+    void userLock(const void *ptr) override;
+    void userUnlock(const void *ptr) override;
+    bool isUserLocked(const void *ptr) override;
+    size_t getMemStepSize() override;
+    void setMemStepSize(size_t new_step_size) override;
+    float getMemoryPressure() override;
+    bool jitTreeExceedsMemoryPressure(size_t bytes) override;
+
+    ~DefaultMemoryManager() = default;
+
+   protected:
+    DefaultMemoryManager()                                             = delete;
+    DefaultMemoryManager(const DefaultMemoryManager &other)            = delete;
+    DefaultMemoryManager(DefaultMemoryManager &&other)                 = delete;
+    DefaultMemoryManager &operator=(const DefaultMemoryManager &other) = delete;
+    DefaultMemoryManager &operator=(DefaultMemoryManager &&other)      = delete;
+    common::mutex_t memory_mutex;
+    // backend-specific
+    std::vector<memory_info> memory;
+    // backend-agnostic
+    void cleanDeviceMemoryManager(int device);
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/DependencyModule.cpp b/src/backend/common/DependencyModule.cpp
new file mode 100644
index 0000000000..4ccb64bc9a
--- /dev/null
+++ b/src/backend/common/DependencyModule.cpp
@@ -0,0 +1,186 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/ArrayFireTypesIO.hpp>
+#include <common/DependencyModule.hpp>
+#include <common/Logger.hpp>
+#include <common/Version.hpp>
+#include <common/module_loading.hpp>
+
+#include <algorithm>
+#include <string>
+
+#ifdef OS_WIN
+#include <Windows.h>
+#else
+#include <dlfcn.h>
+#endif
+
+using arrayfire::common::Version;
+using std::make_tuple;
+using std::string;
+using std::to_string;
+using std::vector;
+
+#ifdef OS_WIN
+#include <Windows.h>
+
+static const char* librarySuffix = ".dll";
+
+namespace {
+vector<string> libNames(const std::string& name, const string& suffix,
+                        const Version& ver = arrayfire::common::NullVersion) {
+    UNUSED(ver);  // Windows DLL files are not version suffixed
+    return {name + suffix + librarySuffix};
+}
+}  // namespace
+
+#elif defined(OS_MAC)
+
+static const char* librarySuffix = ".dylib";
+static const char* libraryPrefix = "lib";
+
+namespace {
+vector<string> libNames(const std::string& name, const string& suffix,
+                        const Version& ver = arrayfire::common::NullVersion) {
+    UNUSED(suffix);
+    const string noVerName = libraryPrefix + name + librarySuffix;
+    if (ver != arrayfire::common::NullVersion) {
+        const string infix = "." + to_string(ver.major()) + ".";
+        return {libraryPrefix + name + infix + librarySuffix, noVerName};
+    } else {
+        return {noVerName};
+    }
+}
+}  // namespace
+
+#elif defined(OS_LNX)
+
+static const char* librarySuffix = ".so";
+static const char* libraryPrefix = "lib";
+
+namespace {
+vector<string> libNames(const std::string& name, const string& suffix,
+                        const Version& ver = arrayfire::common::NullVersion) {
+    UNUSED(suffix);
+    const string noVerName = libraryPrefix + name + librarySuffix;
+    if (ver != arrayfire::common::NullVersion) {
+        const string soname("." + to_string(ver.major()));
+
+        const string vsfx = "." + to_string(ver.major()) + "." +
+                            to_string(ver.minor()) + "." +
+                            to_string(ver.patch());
+        return {noVerName + vsfx, noVerName + soname, noVerName};
+    } else {
+        return {noVerName};
+    }
+}
+}  // namespace
+
+#else
+#error "Unsupported platform"
+#endif
+
+namespace arrayfire {
+namespace common {
+
+DependencyModule::DependencyModule(const char* plugin_file_name,
+                                   const char** paths)
+    : handle(nullptr)
+    , logger(common::loggerFactory("platform"))
+    , version(-1, -1) {
+    // TODO(umar): Implement handling of non-standard paths
+    UNUSED(paths);
+    if (plugin_file_name) {
+        auto fileNames = libNames(plugin_file_name, "");
+        AF_TRACE("Attempting to load: {}", fileNames[0]);
+        handle = loadLibrary(fileNames[0].c_str());
+        if (handle) {
+            AF_TRACE("Found: {}", fileNames[0]);
+        } else {
+            AF_TRACE("Unable to open {}", plugin_file_name);
+        }
+    }
+}
+
+DependencyModule::DependencyModule(
+    const vector<string>& plugin_base_file_name, const vector<string>& suffixes,
+    const vector<string>& paths, const size_t verListSize,
+    const Version* versions,
+    std::function<Version(const LibHandle&)> versionFunction)
+    : handle(nullptr)
+    , logger(common::loggerFactory("platform"))
+    , version(-1, -1) {
+    for (const string& base_name : plugin_base_file_name) {
+        for (const string& path : paths) {
+            UNUSED(path);
+            for (const string& suffix : suffixes) {
+#if !defined(OS_WIN)
+                // For a non-windows OS, i.e. most likely unix, shared library
+                // names have versions suffix based on the version. Lookup for
+                // libraries for given versions and proceed to a simple name
+                // lookup if versioned library is not found.
+                for (size_t v = 0; v < verListSize; v++) {
+                    auto fileNames = libNames(base_name, suffix, versions[v]);
+                    for (auto& fileName : fileNames) {
+                        AF_TRACE("Attempting to load: {}", fileName);
+                        handle = loadLibrary(fileName.c_str());
+                        if (handle) {
+                            if (versionFunction) {
+                                version = versionFunction(handle);
+                                AF_TRACE("Found: {}({})", fileName, version);
+                            } else {
+                                AF_TRACE("Found: {}", fileName);
+                            }
+                            return;
+                        }
+                    }
+                }
+#endif
+                auto fileNames = libNames(base_name, suffix);
+                AF_TRACE("Attempting to load: {}", fileNames[0]);
+                handle = loadLibrary(fileNames[0].c_str());
+                if (handle) {
+                    if (versionFunction) {
+                        version = versionFunction(handle);
+                        AF_TRACE("Found: {}({})", fileNames[0], version);
+                    } else {
+                        AF_TRACE("Found: {}", fileNames[0]);
+                    }
+                    return;
+                }
+            }
+        }
+    }
+    AF_TRACE("Unable to open {}", plugin_base_file_name[0]);
+}
+
+DependencyModule::~DependencyModule() noexcept {
+    if (handle) { unloadLibrary(handle); }
+}
+
+bool DependencyModule::isLoaded() const noexcept {
+    return static_cast<bool>(handle);
+}
+
+bool DependencyModule::symbolsLoaded() const noexcept {
+    return all_of(begin(functions), end(functions),
+                  [](void* ptr) { return ptr != nullptr; });
+}
+
+string DependencyModule::getErrorMessage() noexcept {
+    return common::getErrorMessage();
+}
+
+spdlog::logger* DependencyModule::getLogger() const noexcept {
+    return logger.get();
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/DependencyModule.hpp b/src/backend/common/DependencyModule.hpp
new file mode 100644
index 0000000000..6473a4d3bd
--- /dev/null
+++ b/src/backend/common/DependencyModule.hpp
@@ -0,0 +1,90 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/Logger.hpp>
+#include <common/Version.hpp>
+#include <common/defines.hpp>
+#include <common/module_loading.hpp>
+
+#include <memory>
+#include <string>
+#include <tuple>
+#include <utility>
+#include <vector>
+
+namespace spdlog {
+class logger;
+}
+namespace arrayfire {
+namespace common {
+
+/// Allows you to create classes which dynamically load dependencies at runtime
+///
+/// Creates a dependency module which will dynamically load a library
+/// at runtime instead of at link time. This class will be a component of a
+/// module class which will have member functions for each of the functions
+/// we use in ArrayFire
+class DependencyModule {
+    LibHandle handle;
+    std::shared_ptr<spdlog::logger> logger;
+    std::vector<void*> functions;
+    Version version;
+
+   public:
+    /// Loads the library \p plugin_file_name from the \p paths locations
+    /// \param plugin_file_name  The name of the library without any prefix or
+    ///                          extensions
+    /// \param paths             The locations to search for the libraries if
+    ///                          not found in standard locations
+    DependencyModule(const char* plugin_file_name,
+                     const char** paths = nullptr);
+
+    DependencyModule(
+        const std::vector<std::string>& plugin_base_file_name,
+        const std::vector<std::string>& suffixes,
+        const std::vector<std::string>& paths, const size_t verListSize = 0,
+        const Version* versions                                  = nullptr,
+        std::function<Version(const LibHandle&)> versionFunction = {});
+
+    ~DependencyModule() noexcept;
+
+    /// Returns a function pointer to the function with the name symbol_name
+    template<typename T>
+    T getSymbol(const char* symbol_name) {
+        functions.push_back(getFunctionPointer(handle, symbol_name));
+        return (T)functions.back();
+    }
+
+    /// Returns true if the module was successfully loaded
+    bool isLoaded() const noexcept;
+
+    /// Returns true if all of the symbols for the module were loaded
+    bool symbolsLoaded() const noexcept;
+
+    /// Returns the version of the module
+    Version getVersion() const noexcept { return version; }
+
+    /// Returns the last error message that occurred because of loading the
+    /// library
+    static std::string getErrorMessage() noexcept;
+
+    spdlog::logger* getLogger() const noexcept;
+};
+
+}  // namespace common
+}  // namespace arrayfire
+
+/// Creates a function pointer
+#define MODULE_MEMBER(NAME) decltype(&::NAME) NAME
+
+/// Dynamically loads the function pointer at runtime
+#define MODULE_FUNCTION_INIT(NAME) \
+    NAME = module.getSymbol<decltype(&::NAME)>(#NAME);
diff --git a/src/backend/common/EventBase.hpp b/src/backend/common/EventBase.hpp
new file mode 100644
index 0000000000..6356e4e1af
--- /dev/null
+++ b/src/backend/common/EventBase.hpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <utility>
+
+namespace arrayfire {
+namespace common {
+
+template<typename NativeEventPolicy>
+class EventBase {
+    using QueueType = typename NativeEventPolicy::QueueType;
+    using EventType = typename NativeEventPolicy::EventType;
+    using ErrorType = typename NativeEventPolicy::ErrorType;
+    EventType e_;
+
+   public:
+    /// Default constructor of the Event object. Does not create the event.
+    constexpr EventBase() noexcept : e_() {}
+
+    /// Deleted copy constructor
+    ///
+    /// The event object can only be moved.
+    EventBase(EventBase &other) = delete;
+
+    /// \brief Move constructor of the Event object. Resets the moved object to
+    ///        an invalid event.
+    EventBase(EventBase &&other) noexcept
+        : e_(std::forward<EventType>(other.e_)) {
+        other.e_ = 0;
+    }
+
+    /// \brief Event destructor. Calls the destroy event call on the native API
+    ~EventBase() noexcept {
+        // if (e_)
+        NativeEventPolicy::destroyEvent(&e_);
+    }
+
+    /// \brief Creates the event object by calling the native create API
+    ErrorType create() noexcept {
+        return NativeEventPolicy::createAndMarkEvent(&e_);
+    }
+
+    /// \brief Adds the event on the queue. Once this point on the program
+    ///        is executed, the event is marked complete.
+    ///
+    /// \returns the error code for the mark call
+    ErrorType mark(QueueType queue) noexcept {
+        return NativeEventPolicy::markEvent(&e_, queue);
+    }
+
+    /// \brief This is an asynchronous function which will block the
+    ///        queue/stream from progressing before continuing forward. It will
+    ///        not block the calling thread.
+    ///
+    /// \param queue The queue that will wait for the previous tasks to complete
+    ///
+    /// \returns the error code for the wait call
+    ErrorType enqueueWait(QueueType queue) noexcept {
+        return NativeEventPolicy::waitForEvent(&e_, queue);
+    }
+
+    /// \brief This function will block the calling thread until the event has
+    ///        completed
+    ErrorType block() noexcept { return NativeEventPolicy::syncForEvent(&e_); }
+
+    /// \brief Returns true if the event is a valid event.
+    constexpr operator bool() const { return e_; }
+
+    EventBase &operator=(EventBase &other) = delete;
+
+    EventBase &operator=(EventBase &&other) noexcept {
+        e_       = std::move(other.e_);
+        other.e_ = 0;
+        return *this;
+    }
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/FFTPlanCache.hpp b/src/backend/common/FFTPlanCache.hpp
new file mode 100644
index 0000000000..8ae853480d
--- /dev/null
+++ b/src/backend/common/FFTPlanCache.hpp
@@ -0,0 +1,74 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <deque>
+#include <memory>
+#include <string>
+#include <utility>
+
+namespace arrayfire {
+namespace common {
+// FFTPlanCache caches backend specific fft plans in FIFO order
+//
+// new plan |--> IF number of plans cached is at limit, pop the oldest entry and
+// push new plan.
+//          |
+//          |--> ELSE just push the plan
+// existing plan -> reuse a plan
+template<typename T, typename P>
+class FFTPlanCache {
+    using plan_t       = typename std::shared_ptr<P>;
+    using plan_pair_t  = typename std::pair<std::string, plan_t>;
+    using plan_cache_t = typename std::deque<plan_pair_t>;
+
+   public:
+    FFTPlanCache() : mMaxCacheSize(5) {}
+
+    void setMaxCacheSize(size_t size) {
+        mMaxCacheSize = size;
+        while (mCache.size() > mMaxCacheSize) mCache.pop_back();
+    }
+
+    size_t getMaxCacheSize() const { return mMaxCacheSize; }
+
+    // iterates through plan cache from front to back
+    // of the cache(queue)
+    // A valid shared_ptr of the plan in the cache is returned
+    // if found, and empty share_ptr otherwise.
+    plan_t find(const std::string& key) const {
+        std::shared_ptr<P> res;
+
+        for (unsigned i = 0; i < mCache.size(); ++i) {
+            if (key == mCache[i].first) {
+                res = mCache[i].second;
+                break;
+            }
+        }
+
+        return res;
+    }
+
+    // pushes plan to the front of cache(queue)
+    void push(const std::string key, plan_t plan) {
+        if (mCache.size() >= mMaxCacheSize) mCache.pop_back();
+
+        mCache.push_front(plan_pair_t(key, plan));
+    }
+
+   protected:
+    FFTPlanCache(FFTPlanCache const&);
+    void operator=(FFTPlanCache const&);
+
+    size_t mMaxCacheSize;
+
+    plan_cache_t mCache;
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/HandleBase.hpp b/src/backend/common/HandleBase.hpp
new file mode 100644
index 0000000000..713ae6f71f
--- /dev/null
+++ b/src/backend/common/HandleBase.hpp
@@ -0,0 +1,42 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace common {
+template<typename T, typename H>
+class HandleBase {
+    H handle_;
+
+   public:
+    HandleBase() : handle_(0) { static_cast<T*>(this)->createHandle(&handle_); }
+    ~HandleBase() { static_cast<T*>(this)->destroyHandle(handle_); }
+
+    operator H() { return handle_; }
+    H* get() { return &handle_; }
+
+    HandleBase(HandleBase const&)     = delete;
+    void operator=(HandleBase const&) = delete;
+
+    HandleBase(HandleBase&& h)            = default;
+    HandleBase& operator=(HandleBase&& h) = default;
+};
+}  // namespace common
+}  // namespace arrayfire
+
+#define CREATE_HANDLE(NAME, TYPE, CREATE_FUNCTION, DESTROY_FUNCTION,  \
+                      CHECK_FUNCTION)                                 \
+    class NAME : public common::HandleBase<NAME, TYPE> {              \
+       public:                                                        \
+        void createHandle(TYPE* handle) {                             \
+            CHECK_FUNCTION(CREATE_FUNCTION(handle));                  \
+        }                                                             \
+        void destroyHandle(TYPE handle) { DESTROY_FUNCTION(handle); } \
+    };
diff --git a/src/backend/common/InteropManager.hpp b/src/backend/common/InteropManager.hpp
new file mode 100644
index 0000000000..efdc76adb6
--- /dev/null
+++ b/src/backend/common/InteropManager.hpp
@@ -0,0 +1,110 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/err_common.hpp>
+#include <common/forge_loader.hpp>
+#include <common/util.hpp>
+
+#include <cstdio>
+#include <map>
+#include <memory>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+template<class T, typename R>
+class InteropManager {
+    using resource_t = typename std::shared_ptr<R>;
+    using res_vec_t  = typename std::vector<resource_t>;
+    using res_map_t  = typename std::map<void *, res_vec_t>;
+
+   public:
+    InteropManager() {}
+
+    ~InteropManager() {
+        try {
+            destroyResources();
+        } catch (const AfError &ex) {
+            std::string perr = getEnvVar("AF_PRINT_ERRORS");
+            if (!perr.empty()) {
+                if (perr != "0") fprintf(stderr, "%s\n", ex.what());
+            }
+        }
+    }
+
+    res_vec_t getImageResources(const fg_window image) {
+        if (mInteropMap.find(image) == mInteropMap.end()) {
+            uint32_t buffer;
+            FG_CHECK(common::forgePlugin().fg_get_pixel_buffer(&buffer, image));
+            mInteropMap[image] =
+                static_cast<T *>(this)->registerResources({buffer});
+        }
+        return mInteropMap[image];
+    }
+
+    res_vec_t getPlotResources(const fg_plot plot) {
+        if (mInteropMap.find(plot) == mInteropMap.end()) {
+            uint32_t buffer;
+            FG_CHECK(
+                common::forgePlugin().fg_get_plot_vertex_buffer(&buffer, plot));
+            mInteropMap[plot] =
+                static_cast<T *>(this)->registerResources({buffer});
+        }
+        return mInteropMap[plot];
+    }
+
+    res_vec_t getHistogramResources(const fg_histogram histogram) {
+        if (mInteropMap.find(histogram) == mInteropMap.end()) {
+            uint32_t buffer;
+            FG_CHECK(common::forgePlugin().fg_get_histogram_vertex_buffer(
+                &buffer, histogram));
+            mInteropMap[histogram] =
+                static_cast<T *>(this)->registerResources({buffer});
+        }
+        return mInteropMap[histogram];
+    }
+
+    res_vec_t getSurfaceResources(const fg_surface surface) {
+        if (mInteropMap.find(surface) == mInteropMap.end()) {
+            uint32_t buffer;
+            FG_CHECK(common::forgePlugin().fg_get_surface_vertex_buffer(
+                &buffer, surface));
+            mInteropMap[surface] =
+                static_cast<T *>(this)->registerResources({buffer});
+        }
+        return mInteropMap[surface];
+    }
+
+    res_vec_t getVectorFieldResources(const fg_vector_field field) {
+        if (mInteropMap.find(field) == mInteropMap.end()) {
+            uint32_t verts, dirs;
+            FG_CHECK(common::forgePlugin().fg_get_vector_field_vertex_buffer(
+                &verts, field));
+            FG_CHECK(common::forgePlugin().fg_get_vector_field_direction_buffer(
+                &dirs, field));
+            mInteropMap[field] =
+                static_cast<T *>(this)->registerResources({verts, dirs});
+        }
+        return mInteropMap[field];
+    }
+
+   protected:
+    InteropManager(InteropManager const &);
+    void operator=(InteropManager const &);
+
+    void destroyResources() {
+        for (auto iter : mInteropMap) iter.second.clear();
+    }
+
+    res_map_t mInteropMap;
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/KernelInterface.hpp b/src/backend/common/KernelInterface.hpp
new file mode 100644
index 0000000000..0ead60a8cd
--- /dev/null
+++ b/src/backend/common/KernelInterface.hpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <cstddef>
+#include <string>
+
+namespace arrayfire {
+namespace common {
+
+/// Kernel Interface that should be implemented by each backend
+template<typename TModuleType, typename TKernelType, typename TEnqueuerType,
+         typename TDevPtrType>
+class KernelInterface {
+    TModuleType mModuleHandle;
+    TKernelType mKernelHandle;
+    std::string mName;
+
+   public:
+    using ModuleType   = TModuleType;
+    using KernelType   = TKernelType;
+    using EnqueuerType = TEnqueuerType;
+    using DevPtrType   = TDevPtrType;
+    KernelInterface(std::string name, ModuleType mod, KernelType ker)
+        : mModuleHandle(mod), mKernelHandle(ker), mName(name) {}
+
+    /// \brief Set kernel
+    ///
+    /// \param[in] ker is backend specific kernel handle
+    inline void set(KernelType ker) { mKernelHandle = ker; }
+
+    /// \brief Get kernel
+    ///
+    /// \returns handle to backend specific kernel
+    inline KernelType get() const { return mKernelHandle; }
+
+    /// \brief Get module
+    ///
+    /// \returns handle to backend specific module
+    inline ModuleType getModuleHandle() { return mModuleHandle; }
+
+    /// \brief Get device pointer associated with name(label)
+    ///
+    /// This function is only useful with CUDA NVRTC based compilation
+    /// at the moment, calling this function for OpenCL backend build
+    /// will return a null pointer.
+    virtual DevPtrType getDevPtr(const char* name) = 0;
+
+    /// \brief Copy data from device memory to read-only memory
+    ///
+    /// This function copies data of `bytes` size from the device pointer to a
+    /// read-only memory.
+    ///
+    /// \param[in] dst is the device pointer to which data will be copied
+    /// \param[in] src is the device pointer from which data will be copied
+    /// \param[in] bytes are the number of bytes of data to be copied
+    virtual void copyToReadOnly(DevPtrType dst, DevPtrType src,
+                                size_t bytes) = 0;
+
+    /// \brief Copy a single scalar to device memory
+    ///
+    /// This function copies a single value of type T from host variable
+    /// to the device memory pointed by `dst`
+    ///
+    /// \param[in] dst is the device pointer to which data will be copied
+    /// \param[in] value is a poiner to the scalar value that is set at device
+    ///            pointer
+    /// \param[in] syncCopy will indicate if the backend call to upload the
+    ///            scalar value to GPU memory has to wait for copy to finish
+    ///            or proceed ahead without wait
+    virtual void setFlag(DevPtrType dst, int* scalarValPtr,
+                         const bool syncCopy = false) = 0;
+
+    /// \brief Fetch a scalar from device memory
+    ///
+    /// This function copies a single value of type T from device memory
+    ///
+    /// \param[in] src is the device pointer from which data will be copied
+    ///
+    /// \returns the integer scalar
+    virtual int getFlag(DevPtrType src) = 0;
+
+    /// \brief Enqueue Kernel per queueing criteria forwarding other parameters
+    ///
+    /// This operator overload enables Kernel object to work as functor that
+    /// internally executes the kernel stored in the Kernel object.
+    /// All parameters that are passed in after the EnqueueArgs object are
+    /// essentially forwarded to kenel launch API
+    ///
+    /// \param[in] qArgs is an object of type EnqueueArgsType like
+    //             cl::EnqueueArgs in OpenCL backend
+    /// \param[in] args is the placeholder for variadic arguments
+    template<typename EnqueueArgsType, typename... Args>
+    void operator()(const EnqueueArgsType& qArgs, Args... args) {
+        EnqueuerType launch;
+        launch(mName, mKernelHandle, qArgs, std::forward<Args>(args)...);
+    }
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/Logger.cpp b/src/backend/common/Logger.cpp
new file mode 100644
index 0000000000..3081eab672
--- /dev/null
+++ b/src/backend/common/Logger.cpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifdef _WIN32
+#include <windows.h>  // spdlog needs this
+#endif
+
+#include <common/Logger.hpp>
+#include <common/util.hpp>
+
+#include <spdlog/sinks/stdout_sinks.h>
+#include <array>
+#include <cstdlib>
+#include <memory>
+#include <mutex>
+#include <string>
+
+using std::array;
+using std::shared_ptr;
+using std::string;
+
+using spdlog::get;
+using spdlog::logger;
+using spdlog::stdout_logger_mt;
+
+namespace arrayfire {
+namespace common {
+
+shared_ptr<logger> loggerFactory(const string& name) {
+    shared_ptr<logger> logger;
+    if (!(logger = get(name))) {
+        logger = stdout_logger_mt(name);
+        logger->set_pattern("[%n][%E][%t] %v");
+
+        // Log mode
+        string env_var = getEnvVar("AF_TRACE");
+        if (env_var.find("all") != string::npos ||
+            env_var.find(name) != string::npos) {
+            logger->set_level(spdlog::level::trace);
+        } else {
+            logger->set_level(spdlog::level::off);
+        }
+    }
+    return logger;
+}
+
+string bytesToString(size_t bytes) {
+    constexpr array<const char*, 7> units{
+        {"B", "KB", "MB", "GB", "TB", "PB", "EB"}};
+    size_t count     = 0;
+    auto fbytes      = static_cast<double>(bytes);
+    size_t num_units = units.size();
+    for (count = 0; count < num_units && fbytes > 1000.0f; count++) {
+        fbytes *= (1.0f / 1024.0f);
+    }
+    if (count == units.size()) { count--; }
+    return fmt::format("{:.3g} {}", fbytes, units[count]);
+}
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/Logger.hpp b/src/backend/common/Logger.hpp
new file mode 100644
index 0000000000..a9a8feaa0b
--- /dev/null
+++ b/src/backend/common/Logger.hpp
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <memory>
+#include <string>
+#include <type_traits>
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wignored-attributes"
+#pragma clang diagnostic ignored "-Wtautological-constant-compare"
+#elif defined(__ICC) || defined(__INTEL_COMPILER)
+/* Intel ICC/ICPC */
+// Fix the warning code here, if any
+#elif defined(__GNUC__) || defined(__GNUG__)
+#pragma GCC diagnostic push
+/* GNU GCC/G++ */
+#elif defined(_MSC_VER)
+/* Microsoft Visual Studio */
+#else
+/* Other */
+#endif
+
+#include <spdlog/spdlog.h>
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic pop
+#elif defined(__ICC) || defined(__INTEL_COMPILER)
+/* Intel ICC/ICPC */
+// Fix the warning code here, if any
+#elif defined(__GNUC__) || defined(__GNUG__)
+/* GNU GCC/G++ */
+#pragma GCC diagnostic pop
+#elif defined(_MSC_VER)
+/* Microsoft Visual Studio */
+#pragma warning(pop)
+#else
+/* Other */
+#endif
+
+namespace arrayfire {
+namespace common {
+std::shared_ptr<spdlog::logger> loggerFactory(const std::string& name);
+std::string bytesToString(size_t bytes);
+}  // namespace common
+}  // namespace arrayfire
+
+#ifdef AF_WITH_LOGGING
+#define AF_STR_H(x) #x
+#define AF_STR_HELPER(x) AF_STR_H(x)
+#ifdef _MSC_VER
+#define AF_TRACE(...)                \
+    getLogger()->trace("[ " __FILE__ \
+                       "(" AF_STR_HELPER(__LINE__) ") ] " __VA_ARGS__)
+#else
+#define AF_TRACE(...)                \
+    getLogger()->trace("[ " __FILE__ \
+                       ":" AF_STR_HELPER(__LINE__) " ] " __VA_ARGS__)
+#endif
+#else
+#define AF_TRACE(logger, ...) (void)0
+#endif
diff --git a/src/backend/common/MemoryManagerBase.hpp b/src/backend/common/MemoryManagerBase.hpp
new file mode 100644
index 0000000000..ceeb26c605
--- /dev/null
+++ b/src/backend/common/MemoryManagerBase.hpp
@@ -0,0 +1,93 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/AllocatorInterface.hpp>
+#include <af/defines.h>
+
+#include <cstddef>
+#include <memory>
+
+namespace spdlog {
+class logger;
+}
+
+namespace arrayfire {
+namespace common {
+/**
+ * A internal base interface for a memory manager which is exposed to AF
+ * internals. Externally, both the default AF memory manager implementation and
+ * custom memory manager implementations are wrapped in a derived implementation
+ * of this interface.
+ */
+class MemoryManagerBase {
+   public:
+    MemoryManagerBase()                                     = default;
+    MemoryManagerBase &operator=(const MemoryManagerBase &) = delete;
+    MemoryManagerBase(const MemoryManagerBase &)            = delete;
+    virtual ~MemoryManagerBase() {}
+    // Shuts down the allocator interface which calls shutdown on the subclassed
+    // memory manager with device-specific context
+    virtual void shutdownAllocator() {
+        if (nmi_) nmi_->shutdown();
+    }
+    virtual void initialize()                                        = 0;
+    virtual void shutdown()                                          = 0;
+    virtual void *alloc(bool user_lock, const unsigned ndims, dim_t *dims,
+                        const unsigned element_size)                 = 0;
+    virtual size_t allocated(void *ptr)                              = 0;
+    virtual void unlock(void *ptr, bool user_unlock)                 = 0;
+    virtual void signalMemoryCleanup()                               = 0;
+    virtual void printInfo(const char *msg, const int device)        = 0;
+    virtual void usageInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                           size_t *lock_bytes, size_t *lock_buffers) = 0;
+    virtual void userLock(const void *ptr)                           = 0;
+    virtual void userUnlock(const void *ptr)                         = 0;
+    virtual bool isUserLocked(const void *ptr)                       = 0;
+    virtual size_t getMemStepSize()                                  = 0;
+    virtual void setMemStepSize(size_t new_step_size)                = 0;
+
+    /// Backend-specific functions
+    // OpenCL
+    virtual void addMemoryManagement(int device)    = 0;
+    virtual void removeMemoryManagement(int device) = 0;
+
+    int getActiveDeviceId() { return nmi_->getActiveDeviceId(); }
+    size_t getMaxMemorySize(int id) { return nmi_->getMaxMemorySize(id); }
+    void *nativeAlloc(const size_t bytes) { return nmi_->nativeAlloc(bytes); }
+    void nativeFree(void *ptr) { nmi_->nativeFree(ptr); }
+    virtual spdlog::logger *getLogger() final { return nmi_->getLogger(); }
+    virtual void setAllocator(std::unique_ptr<AllocatorInterface> nmi) {
+        nmi_ = std::move(nmi);
+    }
+
+    // Memory pressure functions
+    void setMemoryPressureThreshold(float pressure) {
+        memoryPressureThreshold_ = pressure;
+    }
+    float getMemoryPressureThreshold() const {
+        return memoryPressureThreshold_;
+    }
+    virtual float getMemoryPressure()                       = 0;
+    virtual bool jitTreeExceedsMemoryPressure(size_t bytes) = 0;
+
+   private:
+    // A threshold at or above which JIT evaluations will be triggered due to
+    // memory pressure. Settable via a call to setMemoryPressureThreshold
+    float memoryPressureThreshold_{1.0};
+    // A backend-specific memory manager, containing backend-specific
+    // methods that call native memory manipulation functions in a device
+    // API. We need to wrap these since they are opaquely called by the
+    // memory manager.
+    std::unique_ptr<AllocatorInterface> nmi_;
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/MersenneTwister.hpp b/src/backend/common/MersenneTwister.hpp
new file mode 100644
index 0000000000..a96e271a01
--- /dev/null
+++ b/src/backend/common/MersenneTwister.hpp
@@ -0,0 +1,265 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/********************************************************
+ * Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
+ * University.
+ * Copyright (c) 2011, 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima
+ * University and University of Tokyo.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ *       copyright notice, this list of conditions and the following
+ *       disclaimer in the documentation and/or other materials provided
+ *       with the distribution.
+ *     * Neither the name of the Hiroshima University, The Uinversity
+ *       of Tokyo nor the names of its contributors may be used to
+ *       endorse or promote products derived from this software without
+ *       specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *******************************************************/
+
+/*
+ * These numbers have been obtained from the following file :
+ * https://github.com/MersenneTwister-Lab/MTGP/blob/master/cuda-sample/mtgp32dc-param-11213.c
+ */
+
+#pragma once
+
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace common {
+const dim_t MaxBlocks     = 32;
+const dim_t TableLength   = 16 * MaxBlocks;
+const dim_t MersenneN     = 351;
+const dim_t MtStateLength = MaxBlocks * MersenneN;
+
+static unsigned pos[] = {
+    88, 84, 25, 42, 22, 11, 76, 11, 42, 60, 45, 80, 81, 16, 63, 38,
+    3,  55, 9,  75, 70, 63, 32, 70, 58, 33, 18, 9,  14, 91, 90, 86,
+};
+
+static unsigned sh1[] = {
+    19, 15, 4, 20, 1,  16, 16, 15, 6,  6, 12, 6,  8,  1, 14, 28,
+    30, 1,  9, 17, 15, 15, 7,  12, 21, 7, 7,  12, 16, 4, 10, 6,
+};
+
+static unsigned sh2[] = {
+    5, 12, 18, 9, 5, 1,  6,  16, 11, 11, 13, 9,  18, 19, 18, 1,
+    2, 16, 15, 6, 6, 17, 15, 10, 2,  10, 13, 13, 3,  2,  14, 7,
+};
+
+static unsigned mask = 4294443008;
+// static const unsigned mask[] = {
+// 4294443008, 4294443008, 4294443008, 4294443008, 4294443008, 4294443008,
+// 4294443008, 4294443008, 4294443008, 4294443008, 4294443008, 4294443008,
+// 4294443008, 4294443008, 4294443008, 4294443008, 4294443008, 4294443008,
+// 4294443008, 4294443008, 4294443008, 4294443008, 4294443008, 4294443008,
+// 4294443008, 4294443008, 4294443008, 4294443008, 4294443008, 4294443008,
+// 4294443008, 4294443008,
+//};
+
+static unsigned recursion_tbl[] = {
+    0,          2879706668, 3137826695, 279165355,  570425344,  2309281324,
+    2567401351, 849590699,  38330,      2879669142, 3137862205, 279129105,
+    570463674,  2309243798, 2567436861, 849554449,  0,          2593609479,
+    1975655185, 4015344662, 357564441,  2412205854, 1620187912, 4194651151,
+    53865,      2593621358, 1975699832, 4015365759, 357618288,  2412217719,
+    1620232545, 4194672230, 0,          1350351572, 3879866427, 3074337519,
+    1908408359, 566016755,  2525106204, 3338578632, 56684,      1350330296,
+    3879914839, 3074324355, 1908464971, 565995423,  2525154672, 3338565540,
+    0,          1055977248, 2682116586, 2704094922, 271581238,  784396054,
+    2414729692, 2971481852, 23789,      1055962061, 2682094855, 2704108071,
+    271604955,  784380923,  2414708017, 2971494929, 0,          2953117369,
+    62376084,   3014866477, 122683469,  3075800820, 82299097,   3034789472,
+    18916,      2953099101, 62357872,   3014885321, 122702249,  3075782416,
+    82280765,   3034808196, 0,          1684682268, 3815655459, 2265218623,
+    723517535,  1330263619, 3360573564, 2888072800, 51884,      1684733104,
+    3815670415, 2265232531, 723569395,  1330314479, 3360588496, 2888086732,
+    0,          465136444,  2086871972, 1742358680, 574619745,  972647261,
+    1579361221, 1167739129, 19553,      465119069,  2086891461, 1742341369,
+    574639104,  972629820,  1579380644, 1167721624, 0,          2761653723,
+    3161842783, 418290052,  1969225846, 3522919853, 3373655081, 1838062066,
+    50425,      2761668898, 3161792678, 418274685,  1969276047, 3522935124,
+    3373605072, 1838046475, 0,          823868564,  2715863359, 2432431531,
+    1012924546, 226180118,  2642463165, 2895901993, 46626,      823888566,
+    2715844381, 2432385929, 1012971168, 226200116,  2642444191, 2895856395,
+    0,          3163477696, 1302313789, 4044451325, 2389704861, 855561821,
+    3287268256, 2137091424, 40970,      3163453130, 1302272823, 4044475895,
+    2389745815, 855537239,  3287227306, 2137116010, 0,          112997176,
+    3723016697, 3679750849, 4213178533, 4254872477, 650688860,  544508516,
+    42220,      113022932,  3722976533, 3679727149, 4213220425, 4254898033,
+    650649008,  544485000,  0,          612320333,  3070909104, 2473926397,
+    3292528821, 3762242808, 1934252549, 1463098952, 21839,      612307202,
+    3070889983, 2473937842, 3292550650, 3762229687, 1934233418, 1463110407,
+    0,          3452614784, 3739161126, 320187046,  3513778374, 481998918,
+    263131872,  3261442656, 46663,      3452571335, 3739198561, 320150753,
+    3513824897, 481955329,  263169191,  3261406247, 0,          588293362,
+    2445856652, 3000527742, 2311061713, 2865800227, 403230557,  991456175,
+    62105,      588273259,  2445819157, 3000539623, 2311123528, 2865780410,
+    403193284,  991467830,  0,          2177339548, 3971246860, 1836317584,
+    4080009455, 1928826995, 528772067,  2655255423, 59650,      2177333662,
+    3971252750, 1836257938, 4080069101, 1928821105, 528777953,  2655195773,
+    0,          3849091899, 1973261595, 2431774240, 442499322,  4279008193,
+    1878889953, 2324819674, 17524,      3849076559, 1973277039, 2431756884,
+    442516622,  4278992821, 1878905237, 2324802222, 0,          2208474059,
+    1390454348, 3510764935, 2555379979, 468886208,  3400574791, 1225917580,
+    43685,      2208434542, 1390415081, 3510808354, 2555423662, 468846693,
+    3400535522, 1225961001, 0,          753114312,  1662184071, 1341224527,
+    2666529043, 2987630043, 4259507092, 3506534236, 32220,      753131796,
+    1662162779, 1341197203, 2666560719, 2987646983, 4259485256, 3506506368,
+    0,          578983141,  3246792315, 3808725662, 1649410346, 1087542735,
+    2748718929, 2169801652, 22528,      578997477,  3246802555, 3808744094,
+    1649432874, 1087557071, 2748729169, 2169820084, 0,          2174674396,
+    1341825145, 3462677925, 3861905719, 1739515115, 2848629070, 676611218,
+    35444,      2174644136, 1341794829, 3462713297, 3861941059, 1739484831,
+    2848598842, 676646630,  0,          3312287941, 1270375145, 2396381740,
+    2464153923, 1468891526, 3646448554, 473293679,  38325,      3312260464,
+    1270413148, 2396354457, 2464191734, 1468863539, 3646486047, 473265882,
+    0,          1011438676, 4087471600, 3488123300, 455082329,  661214477,
+    3900824745, 3569912061, 52539,      1011456367, 4087419083, 3488105631,
+    455134306,  661231670,  3900772754, 3569894854, 0,          3529350863,
+    2972224887, 1668617144, 3278897504, 288202671,  1918405655, 2684687064,
+    37845,      3529313562, 2972196514, 1668644973, 3278934709, 288164986,
+    1918377922, 2684715277, 0,          1851876274, 2748239338, 3450842712,
+    3060793725, 3625018063, 364825751,  2078256933, 54612,      1851897574,
+    2748192958, 3450829580, 3060847657, 3625039771, 364779971,  2078243441,
+    0,          2798128189, 889378840,  2479543333, 3664773513, 2092436916,
+    4017281425, 1236981164, 52108,      2798176177, 889328532,  2479497129,
+    3664824837, 2092484152, 4017230365, 1236934176, 0,          1067241239,
+    3453461153, 4065029558, 4190110101, 3327970946, 873964340,  193686563,
+    27698,      1067229989, 3453472403, 4065001860, 4190137767, 3327959728,
+    873975558,  193658897,  0,          2213869621, 887682042,  3072068559,
+    4061135267, 1910831510, 3338203737, 1158417004, 40265,      2213832060,
+    887647923,  3072104070, 4061175018, 1910793439, 3338170128, 1158453029,
+    0,          3310321543, 1761913168, 2890651351, 2243953084, 1083145787,
+    3972311276, 697030507,  31607,      3310290160, 1761923623, 2890640800,
+    2243984075, 1083114828, 3972322203, 697019420,  0,          2168967706,
+    1652210191, 3812452373, 663749057,  2799162331, 1173011406, 3299699156,
+    62673,      2168923851, 1652182750, 3812465860, 663811344,  2799118090,
+    1172983583, 3299712261, 0,          2787116654, 1233955067, 4021071509,
+    3287286227, 1708132285, 2323425576, 744271686,  37500,      2787152914,
+    1233926791, 4021042409, 3287323567, 1708168641, 2323397460, 744242490,
+    0,          3709953686, 1938447361, 2930457239, 1075839468, 2634114938,
+    866803181,  4002102139, 50633,      3709969247, 1938463176, 2930507614,
+    1075889189, 2634130099, 866818084,  4002152114, 0,          1104625460,
+    154195956,  1223157952, 1706033657, 610746061,  1820382733, 760736057,
+    35538,      1104655846, 154164518,  1223123474, 1706068779, 610776095,
+    1820351711, 760701931,
+};
+
+static unsigned temper_tbl[] = {
+    0,          101711872,  634912768,  600309760,  673972224,  775684096,
+    234094592,  199491584,  855825920,  890428928,  383442432,  281730560,
+    456056320,  490659328,  1056366080, 954654208,  0,          6581248,
+    135327744,  141859840,  922910720,  929491968,  1058172928, 1064705024,
+    5266944,    3420672,    138456576,  136626688,  928177664,  926331392,
+    1061301760, 1059471872, 0,          69272064,   134479872,  203751936,
+    289406976,  358679040,  423886848,  493158912,  544153088,  609098752,
+    678108672,  743054336,  825171456,  890117120,  959127040,  1024072704,
+    0,          608635904,  2523648,    610373120,  5439488,    605293568,
+    7700992,    607292928,  274488832,  874205696,  276487168,  876466176,
+    269442560,  877154816,  271178752,  879677440,  0,          1214875648,
+    67174400,   1281918976, 1614020608, 677218304,  1681195008, 744261632,
+    537288192,  1752159744, 604462592,  1819203072, 1077042688, 140236288,
+    1144217088, 207279616,  0,          3567010304, 194572288,  3741626880,
+    604008960,  4036767744, 798523904,  4211392512, 1563500032, 2309839872,
+    1453977088, 2184555520, 2033282048, 2913807872, 1923718144, 2788548096,
+    0,          136677376,  272809984,  409421824,  2109440,    134592512,
+    274919424,  407336960,  1891638784, 2028312064, 1619189248, 1755796992,
+    1893740032, 2026219008, 1621290496, 1753703936, 0,          4026617856,
+    172764160,  4199382016, 75673600,   4102283264, 248421376,  4275031040,
+    1158700544, 3037793792, 1331458560, 3210551808, 1100148224, 2979249664,
+    1272889856, 3151991296, 0,          2147483648, 0,          2147483648,
+    3221225472, 1073741824, 3221225472, 1073741824, 536878592,  2684362240,
+    536878592,  2684362240, 3758104064, 1610620416, 3758104064, 1610620416,
+    0,          3225420800, 2228224,    3227649024, 809369600,  4034790400,
+    807141376,  4032562176, 1073880576, 2151815680, 1075846656, 2153781760,
+    1882988032, 2960923136, 1881021952, 2958957056, 0,          206130176,
+    1107561472, 1313685504, 537462784,  742409216,  1645020160, 1849968640,
+    272670208,  470405632,  1380225536, 1577967104, 810128896,  1006688768,
+    1917688320, 2114246144, 0,          807406080,  275513344,  541854208,
+    1573888,    808979968,  276038656,  542379520,  269098496,  539628544,
+    6692352,    809899008,  269621760,  540151808,  8264192,    811470848,
+    0,          1612447744, 142082048,  1751384064, 7602176,    1617428480,
+    135004160,  1745879040, 10493440,   1622941184, 148381184,  1757683200,
+    13901312,   1623727616, 145497600,  1756372480, 0,          1275199488,
+    2793406464, 3934388224, 352321536,  1493303296, 3011510272, 4286709760,
+    689970688,  1696734720, 2409635328, 3282181632, 1008737792, 1881284096,
+    2594184704, 3600948736, 0,          150999040,  33619968,   184619008,
+    2103296,    153094144,  35723264,   186714112,  805543424,  956534272,
+    839032320,  990023168,  807634432,  958633472,  841123328,  992122368,
+    0,          3225487872, 3876324352, 659361280,  4198400000, 981436928,
+    489849856,  3715337728, 545267200,  3770749952, 3347847680, 130879488,
+    3669925376, 452957184,  1035115008, 4260597760, 0,          3910139904,
+    2496659456, 2109734912, 3687579648, 853278720,  1327235072, 2785804288,
+    164634112,  3770686976, 2634030592, 1947213312, 3525058048, 990649856,
+    1187782144, 2950438400, 0,          201461760,  3812622336, 4014084096,
+    33554432,   235016192,  3779067904, 3980529664, 1352670720, 1554124288,
+    3017809408, 3219262976, 1386225152, 1587678720, 2984254976, 3185708544,
+    0,          1469123584, 3497837568, 2280507392, 2818795520, 4287783936,
+    2021633024, 804167680,  5381632,    1472401920, 3492731392, 2277496320,
+    2823910912, 4290804224, 2016260608, 800898560,  0,          2550235136,
+    537106432,  3087144960, 1074806784, 3625041920, 1611913216, 4161951744,
+    212480,     2550316544, 536913408,  3087083008, 1075019264, 3625123328,
+    1611720192, 4161889792, 0,          543432704,  623958016,  89454592,
+    27283968,   566522368,  613452288,  83143168,   5381632,    540425728,
+    627230208,  84338176,   32656384,   563506176,  616731648,  78033920,
+    0,          134377472,  2521945088, 2656281600, 4236288,    138597376,
+    2517725184, 2652045312, 7249408,    141356544,  2520730112, 2654812672,
+    3029504,    137120256,  2524966400, 2659032576, 0,          269697536,
+    42663936,   311968256,  1346895872, 1079722496, 1388511232, 1120944640,
+    284057088,  16587776,   308633088,  41294848,   1084644864, 1354046464,
+    1110269440, 1379802112, 0,          54056960,   5527552,    57442304,
+    22155264,   40552448,   17188864,   37654528,   2153984,    51906048,
+    7636480,    55336448,   24301056,   38409728,   19305984,   35540480,
+    0,          3597599744, 21800448,   3609436672, 4467200,    3593154048,
+    17337344,   3613886464, 209935872,  3672922624, 231733248,  3684760576,
+    214397952,  3668471808, 227267072,  3689207296, 0,          70910976,
+    7421952,    72041472,   272663552,  343572480,  271696896,  336314368,
+    1879350784, 1950259712, 1886772736, 1951390208, 1615075840, 1685986816,
+    1614109184, 1678728704, 0,          1761935360, 185210880,  1645156352,
+    1101029376, 681926656,  1252685824, 598702080,  74128896,   1835933184,
+    258016768,  1717831168, 1170963968, 751730176,  1321297408, 667182592,
+    0,          3338677248, 1516357632, 2640438272, 302069760,  3573617664,
+    1214312448, 2405489664, 3368033792, 264253952,  2460079616, 1436678656,
+    3670091264, 499190272,  2158030336, 1201717760, 0,          554461696,
+    7452672,    561893888,  3422689792, 3977146368, 3430130176, 3984574464,
+    1102241280, 1623110656, 1103324672, 1624181760, 2377171968, 2898046464,
+    2378267648, 2899121664, 0,          697340416,  7652864,    702829568,
+    1074688000, 1772020224, 1081783808, 1776952320, 1141022208, 1838287872,
+    1148606464, 1843841536, 67956224,   765230080,  74983424,   770226688,
+    0,          423231488,  128225280,  513708032,  6653952,    425691136,
+    130095104,  519772160,  1146551808, 1567424000, 1139961344, 1523084800,
+    1144223232, 1560901120, 1134028288, 1521346048, 0,          33619968,
+    271785984,  305274880,  268632064,  302120960,  3153920,    36773888,
+    1077583360, 1111203328, 1342815744, 1376304640, 1345953280, 1379442176,
+    1074445824, 1108065792,
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/ModuleInterface.hpp b/src/backend/common/ModuleInterface.hpp
new file mode 100644
index 0000000000..2c3127abb2
--- /dev/null
+++ b/src/backend/common/ModuleInterface.hpp
@@ -0,0 +1,48 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace common {
+
+/// Instances of this object are stored in jit kernel cache
+template<typename ModuleType>
+class ModuleInterface {
+   private:
+    ModuleType mModuleHandle;
+
+   public:
+    /// \brief Creates an uninitialized Module
+    ModuleInterface() = default;
+
+    /// \brief Creates a module given a backend specific ModuleType
+    ///
+    /// \param[in] mod The backend specific module
+    ModuleInterface(ModuleType mod) : mModuleHandle(mod) {}
+
+    /// \brief Set module
+    ///
+    /// \param[in] mod is backend specific module handle
+    inline void set(ModuleType mod) { mModuleHandle = mod; }
+
+    /// \brief Get module
+    ///
+    /// \returns handle to backend specific module
+    inline const ModuleType& get() const { return mModuleHandle; }
+
+    /// \brief Unload module
+    virtual void unload() = 0;
+
+    /// \brief Returns true if the module mModuleHandle is initialized
+    virtual operator bool() const = 0;
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/Source.hpp b/src/backend/common/Source.hpp
new file mode 100644
index 0000000000..2199b389da
--- /dev/null
+++ b/src/backend/common/Source.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+namespace arrayfire {
+namespace common {
+struct Source {
+    const char* ptr;           // Pointer to the kernel source
+    const std::size_t length;  // Length of the kernel source
+    const std::size_t hash;    // hash value for the source *ptr;
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/SparseArray.cpp b/src/backend/common/SparseArray.cpp
new file mode 100644
index 0000000000..052dc97e86
--- /dev/null
+++ b/src/backend/common/SparseArray.cpp
@@ -0,0 +1,272 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/SparseArray.hpp>
+#include <copy.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <af/traits.hpp>
+
+using af::dim4;
+using af::dtype_traits;
+using detail::Array;
+using detail::cdouble;
+using detail::cfloat;
+using detail::copyArray;
+using detail::createDeviceDataArray;
+using detail::createHostDataArray;
+using detail::createValueArray;
+using detail::getActiveDeviceId;
+using detail::scalar;
+using detail::writeDeviceDataArray;
+
+namespace arrayfire {
+namespace common {
+////////////////////////////////////////////////////////////////////////////
+// Sparse Array Base Implementations
+////////////////////////////////////////////////////////////////////////////
+
+// ROW_LENGTH and column length expect standard variable names of
+// SparseArrayBase::stype
+// _nNZ  -> Constructor Argument
+// _dims -> Constructor Argument
+#define ROW_LENGTH                                               \
+    ((stype == AF_STORAGE_COO || stype == AF_STORAGE_CSC) ? _nNZ \
+                                                          : (_dims[0] + 1))
+#define COL_LENGTH                                               \
+    ((stype == AF_STORAGE_COO || stype == AF_STORAGE_CSR) ? _nNZ \
+                                                          : (_dims[1] + 1))
+
+SparseArrayBase::SparseArrayBase(const af::dim4 &_dims, dim_t _nNZ,
+                                 af::storage _storage, af_dtype _type)
+    : info(getActiveDeviceId(), _dims, 0, calcStrides(_dims), _type, true)
+    , stype(_storage)
+    , rowIdx(createValueArray<int>(dim4(ROW_LENGTH), 0))
+    , colIdx(createValueArray<int>(dim4(COL_LENGTH), 0)) {
+    static_assert(offsetof(SparseArrayBase, info) == 0,
+                  "SparseArrayBase::info must be the first member variable of "
+                  "SparseArrayBase.");
+    static_assert(std::is_nothrow_move_assignable<SparseArrayBase>::value,
+                  "SparseArrayBase is not move assignable");
+    static_assert(std::is_nothrow_move_constructible<SparseArrayBase>::value,
+                  "SparseArrayBase is not move constructible");
+}
+
+SparseArrayBase::SparseArrayBase(const af::dim4 &_dims, dim_t _nNZ,
+                                 int *const _rowIdx, int *const _colIdx,
+                                 const af::storage _storage, af_dtype _type,
+                                 bool _is_device, bool _copy_device)
+    : info(getActiveDeviceId(), _dims, 0, calcStrides(_dims), _type, true)
+    , stype(_storage)
+    , rowIdx(_is_device
+                 ? (!_copy_device
+                        ? createDeviceDataArray<int>(dim4(ROW_LENGTH), _rowIdx)
+                        : createValueArray<int>(dim4(ROW_LENGTH), 0))
+                 : createHostDataArray<int>(dim4(ROW_LENGTH), _rowIdx))
+    , colIdx(_is_device
+                 ? (!_copy_device
+                        ? createDeviceDataArray<int>(dim4(COL_LENGTH), _colIdx)
+                        : createValueArray<int>(dim4(COL_LENGTH), 0))
+                 : createHostDataArray<int>(dim4(COL_LENGTH), _colIdx)) {
+    static_assert(offsetof(SparseArrayBase, info) == 0,
+                  "SparseArrayBase::info must be the first member variable of "
+                  "SparseArrayBase.");
+    if (_is_device && _copy_device) {
+        writeDeviceDataArray<int>(rowIdx, _rowIdx, ROW_LENGTH * sizeof(int));
+        writeDeviceDataArray<int>(colIdx, _colIdx, COL_LENGTH * sizeof(int));
+    }
+}
+
+SparseArrayBase::SparseArrayBase(const af::dim4 &_dims,
+                                 const Array<int> &_rowIdx,
+                                 const Array<int> &_colIdx,
+                                 const af::storage _storage, af_dtype _type,
+                                 bool _copy)
+    : info(getActiveDeviceId(), _dims, 0, calcStrides(_dims), _type, true)
+    , stype(_storage)
+    , rowIdx(_copy ? copyArray<int>(_rowIdx) : _rowIdx)
+    , colIdx(_copy ? copyArray<int>(_colIdx) : _colIdx) {
+    static_assert(offsetof(SparseArrayBase, info) == 0,
+                  "SparseArrayBase::info must be the first member variable of "
+                  "SparseArrayBase.");
+}
+
+SparseArrayBase::SparseArrayBase(const SparseArrayBase &base, bool copy)
+    : info(base.info)
+    , stype(base.stype)
+    , rowIdx(copy ? copyArray<int>(base.rowIdx) : base.rowIdx)
+    , colIdx(copy ? copyArray<int>(base.colIdx) : base.colIdx) {}
+
+SparseArrayBase::~SparseArrayBase() = default;
+
+dim_t SparseArrayBase::getNNZ() const {
+    if (stype == AF_STORAGE_COO || stype == AF_STORAGE_CSC) {
+        return rowIdx.elements();
+    }
+    if (stype == AF_STORAGE_CSR) { return colIdx.elements(); }
+
+    // This is to ensure future storages are properly configured
+    return 0;
+}
+
+#undef ROW_LENGTH
+#undef COL_LENGTH
+
+////////////////////////////////////////////////////////////////////////////
+// Friend functions for Sparse Array Creation Implementations
+////////////////////////////////////////////////////////////////////////////
+template<typename T>
+SparseArray<T> createEmptySparseArray(const af::dim4 &_dims, dim_t _nNZ,
+                                      const af::storage _storage) {
+    return SparseArray<T>(_dims, _nNZ, _storage);
+}
+
+template<typename T>
+SparseArray<T> createHostDataSparseArray(const af::dim4 &_dims, const dim_t nNZ,
+                                         const T *const _values,
+                                         const int *const _rowIdx,
+                                         const int *const _colIdx,
+                                         const af::storage _storage) {
+    return SparseArray<T>(_dims, nNZ, const_cast<T *>(_values),
+                          const_cast<int *>(_rowIdx),
+                          const_cast<int *>(_colIdx), _storage, false);
+}
+
+template<typename T>
+SparseArray<T> createDeviceDataSparseArray(
+    const af::dim4 &_dims, const dim_t nNZ, T *const _values,
+    int *const _rowIdx,  // NOLINT(readability-non-const-parameter)
+    int *const _colIdx,  // NOLINT(readability-non-const-parameter)
+    const af::storage _storage, const bool _copy) {
+    return SparseArray<T>(_dims, nNZ, _values, _rowIdx, _colIdx, _storage, true,
+                          _copy);
+}
+
+template<typename T>
+SparseArray<T> createArrayDataSparseArray(
+    const af::dim4 &_dims, const Array<T> &_values, const Array<int> &_rowIdx,
+    const Array<int> &_colIdx, const af::storage _storage, const bool _copy) {
+    return SparseArray<T>(_dims, _values, _rowIdx, _colIdx, _storage, _copy);
+}
+
+template<typename T>
+SparseArray<T> copySparseArray(const SparseArray<T> &other) {
+    return SparseArray<T>(other, true);
+}
+
+template<typename T>
+SparseArray<T> *initSparseArray() {
+    return new SparseArray<T>(dim4(), 0, (af::storage)0);
+}
+
+template<typename T>
+void destroySparseArray(SparseArray<T> *sparse) {
+    delete sparse;
+}
+
+template<typename T>
+void checkAndMigrate(const SparseArray<T> &arr) {
+    checkAndMigrate(const_cast<Array<int> &>(arr.getColIdx()));
+    checkAndMigrate(const_cast<Array<int> &>(arr.getRowIdx()));
+    checkAndMigrate(const_cast<Array<T> &>(arr.getValues()));
+}
+
+////////////////////////////////////////////////////////////////////////////
+// Sparse Array Class Implementations
+////////////////////////////////////////////////////////////////////////////
+template<typename T>
+SparseArray<T>::SparseArray(const dim4 &_dims, dim_t _nNZ, af::storage _storage)
+    : base(_dims, _nNZ, _storage,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , values(createValueArray<T>(dim4(_nNZ), scalar<T>(0))) {
+    static_assert(std::is_standard_layout<SparseArray<T>>::value,
+                  "SparseArray<T> must be a standard layout type");
+    static_assert(std::is_nothrow_move_assignable<SparseArray<T>>::value,
+                  "SparseArray<T> is not move assignable");
+    static_assert(std::is_nothrow_move_constructible<SparseArray<T>>::value,
+                  "SparseArray<T> is not move constructible");
+    static_assert(offsetof(SparseArray<T>, base) == 0,
+                  "SparseArray<T>::base must be the first member variable of "
+                  "SparseArray<T>");
+}
+
+template<typename T>
+SparseArray<T>::SparseArray(const af::dim4 &_dims, dim_t _nNZ, T *const _values,
+                            int *const _rowIdx, int *const _colIdx,
+                            const af::storage _storage, bool _is_device,
+                            bool _copy_device)
+    : base(_dims, _nNZ, _rowIdx, _colIdx, _storage,
+           static_cast<af_dtype>(dtype_traits<T>::af_type), _is_device,
+           _copy_device)
+    , values(_is_device ? (!_copy_device
+                               ? createDeviceDataArray<T>(dim4(_nNZ), _values)
+                               : createValueArray<T>(dim4(_nNZ), scalar<T>(0)))
+                        : createHostDataArray<T>(dim4(_nNZ), _values)) {
+    if (_is_device && _copy_device) {
+        writeDeviceDataArray<T>(values, _values, _nNZ * sizeof(T));
+    }
+}
+
+template<typename T>
+SparseArray<T>::SparseArray(const af::dim4 &_dims, const Array<T> &_values,
+                            const Array<int> &_rowIdx,
+                            const Array<int> &_colIdx,
+                            const af::storage _storage, bool _copy)
+    : base(_dims, _rowIdx, _colIdx, _storage,
+           static_cast<af_dtype>(dtype_traits<T>::af_type), _copy)
+    , values(_copy ? copyArray<T>(_values) : _values) {}
+
+template<typename T>
+SparseArray<T>::SparseArray(const SparseArray<T> &other, bool copy)
+    : base(other.base, copy)
+    , values(copy ? copyArray<T>(other.values) : other.values) {}
+
+#define INSTANTIATE(T)                                                       \
+    template SparseArray<T> createEmptySparseArray<T>(                       \
+        const af::dim4 &_dims, dim_t _nNZ, const af::storage _storage);      \
+    template SparseArray<T> createHostDataSparseArray<T>(                    \
+        const af::dim4 &_dims, const dim_t _nNZ, const T *const _values,     \
+        const int *const _rowIdx, const int *const _colIdx,                  \
+        const af::storage _storage);                                         \
+    template SparseArray<T> createDeviceDataSparseArray<T>(                  \
+        const af::dim4 &_dims, const dim_t _nNZ,                             \
+        T *const _values, /*  NOLINT */                                      \
+        int *const _rowIdx, int *const _colIdx, const af::storage _storage,  \
+        const bool _copy);                                                   \
+    template SparseArray<T> createArrayDataSparseArray<T>(                   \
+        const af::dim4 &_dims, const Array<T> &_values,                      \
+        const Array<int> &_rowIdx, const Array<int> &_colIdx,                \
+        const af::storage _storage, const bool _copy);                       \
+    template SparseArray<T> *initSparseArray<T>();                           \
+    template SparseArray<T> copySparseArray<T>(const SparseArray<T> &other); \
+    template void destroySparseArray<T>(SparseArray<T> * sparse);            \
+                                                                             \
+    template SparseArray<T>::SparseArray(const af::dim4 &_dims, dim_t _nNZ,  \
+                                         af::storage _storage);              \
+    template SparseArray<T>::SparseArray(                                    \
+        const af::dim4 &_dims, dim_t _nNZ, T *const _values, /* NOLINT */    \
+        int *const _rowIdx, int *const _colIdx, const af::storage _storage,  \
+        bool _is_device, bool _copy_device);                                 \
+    template SparseArray<T>::SparseArray(                                    \
+        const af::dim4 &_dims, const Array<T> &_values,                      \
+        const Array<int> &_rowIdx, const Array<int> &_colIdx,                \
+        const af::storage _storage, bool _copy);                             \
+    template void checkAndMigrate(const SparseArray<T> &arr)
+
+// Instantiate only floating types
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(cfloat);
+INSTANTIATE(cdouble);
+
+#undef INSTANTIATE
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/SparseArray.hpp b/src/backend/common/SparseArray.hpp
new file mode 100644
index 0000000000..046a92fbe7
--- /dev/null
+++ b/src/backend/common/SparseArray.hpp
@@ -0,0 +1,259 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/sparse_helpers.hpp>
+
+#include <cstddef>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+
+template<typename T>
+class SparseArray;
+
+/// SparseArray Array Info class
+///
+/// This class is the base class to all SparseArray objects. The purpose of this
+/// class was to have a way to retrieve basic information of an Array object
+/// without specifying what type the object is at compile time.
+///
+/// NOTE: This is not a template class to allow the frontend to determine the
+/// af_array type at runtime
+class SparseArrayBase {
+   private:
+    ArrayInfo
+        info;  ///< NOTE: This must be the first element of SparseArray<T>.
+    af::storage stype;          ///< Storage format: CSR, CSC, COO
+    detail::Array<int> rowIdx;  ///< Linear array containing row indices
+    detail::Array<int> colIdx;  ///< Linear array containing col indices
+
+   public:
+    SparseArrayBase(SparseArrayBase &&other) noexcept = default;
+    SparseArrayBase(const af::dim4 &_dims, dim_t _nNZ, af::storage _storage,
+                    af_dtype _type);
+
+    SparseArrayBase(const af::dim4 &_dims, dim_t _nNZ, int *const _rowIdx,
+                    int *const _colIdx, const af::storage _storage,
+                    af_dtype _type, bool _is_device = false,
+                    bool _copy_device = false);
+
+    SparseArrayBase(const af::dim4 &_dims, const detail::Array<int> &_rowIdx,
+                    const detail::Array<int> &_colIdx,
+                    const af::storage _storage, af_dtype _type,
+                    bool _copy = false);
+
+    SparseArrayBase &operator=(SparseArrayBase other) noexcept {
+        std::swap(*this, other);
+        return *this;
+    }
+
+    /// A copy constructor for SparseArray
+    ///
+    /// This constructor copies the \p in SparseArray and creates a new object
+    /// from it. It can also perform a deep copy if the second argument is true.
+    ///
+    /// \param[in] in         The array that will be copied
+    /// \param[in] deep_copy  If true a deep copy is performed
+    SparseArrayBase(const SparseArrayBase &base, bool deep_copy = false);
+
+    ~SparseArrayBase();
+
+    ////////////////////////////////////////////////////////////////////////////
+    // Functions that call ArrayInfo object's functions
+    ////////////////////////////////////////////////////////////////////////////
+#define INSTANTIATE_INFO(return_type, func) \
+    return_type func() const { return info.func(); }
+
+    INSTANTIATE_INFO(const af_dtype &, getType)
+    INSTANTIATE_INFO(size_t, elements)
+    INSTANTIATE_INFO(size_t, ndims)
+    INSTANTIATE_INFO(const af::dim4 &, dims)
+    INSTANTIATE_INFO(size_t, total)
+    INSTANTIATE_INFO(int, getDevId)
+    INSTANTIATE_INFO(af_backend, getBackendId)
+    INSTANTIATE_INFO(bool, isEmpty)
+    INSTANTIATE_INFO(bool, isScalar)
+    INSTANTIATE_INFO(bool, isRow)
+    INSTANTIATE_INFO(bool, isColumn)
+    INSTANTIATE_INFO(bool, isVector)
+    INSTANTIATE_INFO(bool, isComplex)
+    INSTANTIATE_INFO(bool, isReal)
+    INSTANTIATE_INFO(bool, isDouble)
+    INSTANTIATE_INFO(bool, isSingle)
+    INSTANTIATE_INFO(bool, isHalf)
+    INSTANTIATE_INFO(bool, isRealFloating)
+    INSTANTIATE_INFO(bool, isFloating)
+    INSTANTIATE_INFO(bool, isInteger)
+    INSTANTIATE_INFO(bool, isBool)
+    INSTANTIATE_INFO(bool, isLinear)
+    INSTANTIATE_INFO(bool, isSparse)
+
+#undef INSTANTIATE_INFO
+
+    // setId of info, values, rowIdx, colIdx
+    void setId(int id) {
+        info.setId(id);
+        rowIdx.setId(id);
+        colIdx.setId(id);
+    }
+
+    /// Returns the row indices for the corresponding values in the SparseArray
+    detail::Array<int> &getRowIdx() { return rowIdx; }
+    const detail::Array<int> &getRowIdx() const { return rowIdx; }
+
+    /// Returns the column indices for the corresponding values in the
+    /// SparseArray
+    detail::Array<int> &getColIdx() { return colIdx; }
+    const detail::Array<int> &getColIdx() const { return colIdx; }
+
+    /// Returns the number of non-zero elements in the array.
+    dim_t getNNZ() const;
+
+    /// Returns the storage format of the SparseArray
+    af::storage getStorage() const { return stype; }
+};
+static_assert(std::is_standard_layout<SparseArrayBase>::value,
+              "SparseArrayBase must be a standard layout type");
+
+////////////////////////////////////////////////////////////////////////////
+// Sparse Array Class
+////////////////////////////////////////////////////////////////////////////
+template<typename T>
+class SparseArray {
+   private:
+    SparseArrayBase
+        base;  ///< This must be the first element of SparseArray<T>.
+    detail::Array<T> values;  ///< Linear array containing actual values
+
+    SparseArray(const af::dim4 &_dims, dim_t _nNZ, af::storage _storage);
+
+    explicit SparseArray(const af::dim4 &_dims, dim_t _nNZ, T *const _values,
+                         int *const _rowIdx, int *const _colIdx,
+                         const af::storage _storage, bool _is_device = false,
+                         bool _copy_device = false);
+
+    SparseArray(const af::dim4 &_dims, const detail::Array<T> &_values,
+                const detail::Array<int> &_rowIdx,
+                const detail::Array<int> &_colIdx, const af::storage _storage,
+                bool _copy = false);
+
+    /// A copy constructor for SparseArray
+    ///
+    /// This constructor copies the \p in SparseArray and creates a new object
+    /// from it. It can also perform a deep copy if the second argument is true.
+    ///
+    /// \param[in] other      The array that will be copied
+    /// \param[in] deep_copy  If true a deep copy is performed
+    SparseArray(const SparseArray<T> &other, bool deep_copy);
+
+   public:
+    SparseArray(const SparseArray<T> &other)     = default;
+    SparseArray(SparseArray<T> &&other) noexcept = default;
+
+    ~SparseArray() noexcept = default;
+
+    SparseArray<T> &operator=(SparseArray<T> other) noexcept {
+        std::swap(*this, other);
+        return *this;
+    }
+// Functions that call ArrayInfo object's functions
+#define INSTANTIATE_INFO(return_type, func) \
+    return_type func() const { return base.func(); }
+
+    INSTANTIATE_INFO(const af_dtype &, getType)
+    INSTANTIATE_INFO(size_t, elements)
+    INSTANTIATE_INFO(size_t, ndims)
+    INSTANTIATE_INFO(const af::dim4 &, dims)
+    INSTANTIATE_INFO(size_t, total)
+    INSTANTIATE_INFO(int, getDevId)
+    INSTANTIATE_INFO(af_backend, getBackendId)
+    INSTANTIATE_INFO(bool, isEmpty)
+    INSTANTIATE_INFO(bool, isScalar)
+    INSTANTIATE_INFO(bool, isRow)
+    INSTANTIATE_INFO(bool, isColumn)
+    INSTANTIATE_INFO(bool, isVector)
+    INSTANTIATE_INFO(bool, isComplex)
+    INSTANTIATE_INFO(bool, isReal)
+    INSTANTIATE_INFO(bool, isDouble)
+    INSTANTIATE_INFO(bool, isSingle)
+    INSTANTIATE_INFO(bool, isHalf)
+    INSTANTIATE_INFO(bool, isRealFloating)
+    INSTANTIATE_INFO(bool, isFloating)
+    INSTANTIATE_INFO(bool, isInteger)
+    INSTANTIATE_INFO(bool, isBool)
+    INSTANTIATE_INFO(bool, isLinear)
+    INSTANTIATE_INFO(bool, isSparse)
+
+    // Function from Base but not in ArrayInfo
+    INSTANTIATE_INFO(dim_t, getNNZ)
+    INSTANTIATE_INFO(af::storage, getStorage)
+
+    detail::Array<int> &getRowIdx() { return base.getRowIdx(); }
+    detail::Array<int> &getColIdx() { return base.getColIdx(); }
+    const detail::Array<int> &getRowIdx() const { return base.getRowIdx(); }
+    const detail::Array<int> &getColIdx() const { return base.getColIdx(); }
+
+#undef INSTANTIATE_INFO
+
+    void setId(int id) {
+        base.setId(id);
+        values.setId(id);
+    }
+
+    // Return the values array
+    detail::Array<T> &getValues() { return values; }
+    const detail::Array<T> &getValues() const { return values; }
+
+    void eval() const {
+        getValues().eval();
+        getRowIdx().eval();
+        getColIdx().eval();
+    }
+
+    // Friend functions for Sparse Array Creation
+    friend SparseArray<T> createEmptySparseArray<T>(const af::dim4 &_dims,
+                                                    dim_t _nNZ,
+                                                    const af::storage _storage);
+
+    friend SparseArray<T> createHostDataSparseArray<T>(
+        const af::dim4 &_dims, const dim_t nNZ, const T *const _values,
+        const int *const _rowIdx, const int *const _colIdx,
+        const af::storage _storage);
+
+    friend SparseArray<T> createDeviceDataSparseArray<T>(
+        const af::dim4 &_dims, const dim_t nNZ, T *const _values,
+        int *const _rowIdx, int *const _colIdx, const af::storage _storage,
+        const bool _copy);
+
+    friend SparseArray<T> createArrayDataSparseArray<T>(
+        const af::dim4 &_dims, const detail::Array<T> &_values,
+        const detail::Array<int> &_rowIdx, const detail::Array<int> &_colIdx,
+        const af::storage _storage, const bool _copy);
+
+    friend SparseArray<T> *initSparseArray<T>();
+
+    friend SparseArray<T> copySparseArray<T>(const SparseArray<T> &input);
+
+    friend void destroySparseArray<T>(SparseArray<T> *sparse);
+};
+
+/// Checks if the Array object can be migrated to the current device and if not,
+/// an error is thrown
+///
+/// \param[in] arr The Array that will be checked.
+template<typename T>
+void checkAndMigrate(const SparseArray<T> &arr);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/TemplateArg.hpp b/src/backend/common/TemplateArg.hpp
new file mode 100644
index 0000000000..238c912de2
--- /dev/null
+++ b/src/backend/common/TemplateArg.hpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/util.hpp>
+
+#include <array>
+#include <string>
+#include <utility>
+
+template<typename T>
+struct TemplateTypename;
+
+struct TemplateArg {
+    std::string _tparam;
+
+    TemplateArg(std::string str) : _tparam(std::move(str)) {}
+
+    template<typename T>
+    constexpr TemplateArg(TemplateTypename<T> arg) noexcept : _tparam(arg) {}
+
+    template<typename T>
+    constexpr TemplateArg(T value) noexcept
+        : _tparam(arrayfire::common::toString(value)) {}
+};
+
+template<typename... Targs>
+std::array<TemplateArg, sizeof...(Targs)> TemplateArgs(Targs &&...args) {
+    return std::array<TemplateArg, sizeof...(Targs)>{
+        std::forward<Targs>(args)...};
+}
+
+#define DefineKey(arg) " -D " #arg
+#define DefineValue(arg) " -D " #arg "=" + arrayfire::common::toString(arg)
+#define DefineKeyValue(key, arg) \
+    " -D " #key "=" + arrayfire::common::toString(arg)
+#define DefineKeyFromStr(arg) " -D " + std::string(arg)
diff --git a/src/backend/common/TemplateTypename.hpp b/src/backend/common/TemplateTypename.hpp
new file mode 100644
index 0000000000..96dfb3c6fe
--- /dev/null
+++ b/src/backend/common/TemplateTypename.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/TemplateArg.hpp>
+#include <traits.hpp>
+
+#include <string>
+
+template<typename T>
+struct TemplateTypename {
+    operator TemplateArg() const noexcept {
+        return {std::string(af::dtype_traits<T>::getName())};
+    }
+    operator std::string() const noexcept {
+        return {std::string(af::dtype_traits<T>::getName())};
+    }
+};
+
+#define SPECIALIZE(TYPE, NAME)                                  \
+    template<>                                                  \
+    struct TemplateTypename<TYPE> {                             \
+        operator TemplateArg() const noexcept {                 \
+            return TemplateArg(std::string(#NAME));             \
+        }                                                       \
+        operator std::string() const noexcept { return #NAME; } \
+    }
+
+SPECIALIZE(signed char, detail::schar);
+SPECIALIZE(unsigned char, detail::uchar);
+SPECIALIZE(unsigned int, detail::uint);
+SPECIALIZE(unsigned short, detail::ushort);
+SPECIALIZE(long long, long long);
+SPECIALIZE(unsigned long long, unsigned long long);
+
+#undef SPECIALIZE
diff --git a/src/backend/common/Transform.hpp b/src/backend/common/Transform.hpp
new file mode 100644
index 0000000000..3d56cf0209
--- /dev/null
+++ b/src/backend/common/Transform.hpp
@@ -0,0 +1,65 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <math.hpp>
+#include <types.hpp>
+
+#ifndef __DH__
+#define __DH__
+#endif
+
+#include "optypes.hpp"
+
+namespace arrayfire {
+namespace common {
+
+using namespace detail;  // NOLINT
+
+// Because isnan(cfloat) and isnan(cdouble) is not defined
+#define IS_NAN(val) !((val) == (val))
+
+template<typename Ti, typename To, af_op_t op>
+struct Transform {
+    __DH__ To operator()(Ti in) { return static_cast<To>(in); }
+};
+
+template<typename Ti, typename To>
+struct Transform<Ti, To, af_min_t> {
+    __DH__ To operator()(Ti in) {
+        return IS_NAN(in) ? Binary<To, af_min_t>::init() : To(in);
+    }
+};
+
+template<typename Ti, typename To>
+struct Transform<Ti, To, af_max_t> {
+    __DH__ To operator()(Ti in) {
+        return IS_NAN(in) ? Binary<To, af_max_t>::init() : To(in);
+    }
+};
+
+template<typename Ti, typename To>
+struct Transform<Ti, To, af_or_t> {
+    __DH__ To operator()(Ti in) { return (in != scalar<Ti>(0.)); }
+};
+
+template<typename Ti, typename To>
+struct Transform<Ti, To, af_and_t> {
+    __DH__ To operator()(Ti in) { return (in != scalar<Ti>(0.)); }
+};
+
+template<typename Ti, typename To>
+struct Transform<Ti, To, af_notzero_t> {
+    __DH__ To operator()(Ti in) { return (in != scalar<Ti>(0.)); }
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/Version.hpp b/src/backend/common/Version.hpp
new file mode 100644
index 0000000000..55a6e79efb
--- /dev/null
+++ b/src/backend/common/Version.hpp
@@ -0,0 +1,81 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <string>
+
+// Some compilers create these macros in the header. Causes
+// some errors in the Version struct constructor
+#ifdef major
+#undef major
+#endif
+#ifdef minor
+#undef minor
+#endif
+
+namespace arrayfire {
+namespace common {
+class Version {
+    int major_ = -1;
+    int minor_ = -1;
+    int patch_ = -1;
+
+   public:
+    /// Checks if the major version is defined before minor and minor is defined
+    /// before patch
+    constexpr static bool validate(int major_, int minor_,
+                                   int patch_) noexcept {
+        return !(major_ < 0 && (minor_ >= 0 || patch_ >= 0)) &&
+               !(minor_ < 0 && patch_ >= 0);
+    }
+
+    constexpr int major() const { return major_; }
+    constexpr int minor() const { return minor_; }
+    constexpr int patch() const { return patch_; }
+
+    constexpr Version(const int ver_major, const int ver_minor = -1,
+                      const int ver_patch = -1) noexcept
+        : major_(ver_major), minor_(ver_minor), patch_(ver_patch) {}
+};
+
+constexpr bool operator==(const Version& lhs, const Version& rhs) {
+    return lhs.major() == rhs.major() && lhs.minor() == rhs.minor() &&
+           lhs.patch() == rhs.patch();
+}
+
+constexpr bool operator!=(const Version& lhs, const Version& rhs) {
+    return !(lhs == rhs);
+}
+
+constexpr static Version NullVersion{-1, -1, -1};
+
+constexpr bool operator<(const Version& lhs, const Version& rhs) {
+    if (lhs == NullVersion || rhs == NullVersion) return false;
+    if (lhs.major() != -1 && rhs.major() != -1 && lhs.major() < rhs.major())
+        return true;
+    if (lhs.minor() != -1 && rhs.minor() != -1 && lhs.minor() < rhs.minor())
+        return true;
+    if (lhs.patch() != -1 && rhs.patch() != -1 && lhs.patch() < rhs.patch())
+        return true;
+    return false;
+}
+
+inline Version fromCudaVersion(size_t version_int) {
+    return {static_cast<int>(version_int / 1000),
+            static_cast<int>(version_int % 1000) / 10,
+            static_cast<int>(version_int % 10)};
+}
+
+inline std::string int_version_to_string(int version) {
+    return std::to_string(version / 1000) + "." +
+           std::to_string(static_cast<int>((version % 1000) / 10.));
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/blas_headers.hpp b/src/backend/common/blas_headers.hpp
new file mode 100644
index 0000000000..f76dfa0240
--- /dev/null
+++ b/src/backend/common/blas_headers.hpp
@@ -0,0 +1,38 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#ifdef USE_MKL
+#include <mkl_cblas.h>
+#else
+#ifdef __APPLE__
+#include <Accelerate/Accelerate.h>
+#else
+extern "C" {
+#include <cblas.h>
+}
+#endif
+#endif
+
+// TODO: Ask upstream for a more official way to detect it
+#ifdef OPENBLAS_CONST
+#define IS_OPENBLAS
+#endif
+
+// Make sure we get the correct type signature for OpenBLAS
+// OpenBLAS defines blasint as it's index type. Emulate this
+// if we're not dealing with openblas and use it where applicable
+#ifdef IS_OPENBLAS
+// blasint already defined
+static const bool cplx_void_ptr = false;
+#else
+using blasint                   = int;
+static const bool cplx_void_ptr = true;
+#endif
diff --git a/src/backend/common/cast.cpp b/src/backend/common/cast.cpp
new file mode 100644
index 0000000000..bcb2dfb519
--- /dev/null
+++ b/src/backend/common/cast.cpp
@@ -0,0 +1,71 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/cast.hpp>
+#include <handle.hpp>
+
+using arrayfire::common::half;
+using detail::cdouble;
+using detail::cfloat;
+using detail::intl;
+using detail::schar;
+using detail::uchar;
+using detail::uint;
+using detail::uintl;
+using detail::ushort;
+
+namespace arrayfire {
+namespace common {
+
+template<typename To>
+detail::Array<To> castArray(const af_array &in) {
+    const ArrayInfo &info = getInfo(in);
+
+    if (static_cast<af::dtype>(af::dtype_traits<To>::af_type) ==
+        info.getType()) {
+        return getArray<To>(in);
+    }
+
+    switch (info.getType()) {
+        case f32: return common::cast<To, float>(getArray<float>(in));
+        case f64: return common::cast<To, double>(getArray<double>(in));
+        case c32: return common::cast<To, cfloat>(getArray<cfloat>(in));
+        case c64: return common::cast<To, cdouble>(getArray<cdouble>(in));
+        case s32: return common::cast<To, int>(getArray<int>(in));
+        case u32: return common::cast<To, uint>(getArray<uint>(in));
+        case s8: return common::cast<To, schar>(getArray<schar>(in));
+        case u8: return common::cast<To, uchar>(getArray<uchar>(in));
+        case b8: return common::cast<To, char>(getArray<char>(in));
+        case s64: return common::cast<To, intl>(getArray<intl>(in));
+        case u64: return common::cast<To, uintl>(getArray<uintl>(in));
+        case s16: return common::cast<To, short>(getArray<short>(in));
+        case u16: return common::cast<To, ushort>(getArray<ushort>(in));
+        case f16:
+            return common::cast<To, common::half>(getArray<common::half>(in));
+        default: TYPE_ERROR(1, info.getType());
+    }
+}
+
+template detail::Array<float> castArray(const af_array &in);
+template detail::Array<double> castArray(const af_array &in);
+template detail::Array<cfloat> castArray(const af_array &in);
+template detail::Array<cdouble> castArray(const af_array &in);
+template detail::Array<int> castArray(const af_array &in);
+template detail::Array<uint> castArray(const af_array &in);
+template detail::Array<schar> castArray(const af_array &in);
+template detail::Array<uchar> castArray(const af_array &in);
+template detail::Array<char> castArray(const af_array &in);
+template detail::Array<intl> castArray(const af_array &in);
+template detail::Array<uintl> castArray(const af_array &in);
+template detail::Array<short> castArray(const af_array &in);
+template detail::Array<ushort> castArray(const af_array &in);
+template detail::Array<half> castArray(const af_array &in);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/cast.hpp b/src/backend/common/cast.hpp
new file mode 100644
index 0000000000..c60614a8a9
--- /dev/null
+++ b/src/backend/common/cast.hpp
@@ -0,0 +1,187 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <cast.hpp>
+#include <common/Logger.hpp>
+#include <memory>
+
+#ifdef AF_CPU
+#include <jit/UnaryNode.hpp>
+#endif
+
+namespace arrayfire {
+namespace common {
+/// This function determines if consecutive cast operations should be
+/// removed from a JIT AST.
+///
+/// This function returns true if consecutive cast operations in the JIT AST
+/// should be removed. Multiple cast operations are removed when going from
+/// a smaller type to a larger type and back again OR if the conversion is
+/// between two floating point types including complex types.
+///
+///                  Cast operations that will be removed
+///                        outer -> inner -> outer
+///
+///                                inner cast
+///           f32  f64  c32  c64  s32  u32   s8   u8   b8  s64  u64  s16  u16  f16
+///     f32    x    x    x    x                                                 x
+///     f64    x    x    x    x                                                 x
+///  o  c32    x    x    x    x                                                 x
+///  u  c64    x    x    x    x                                                 x
+///  t  s32    x    x    x    x    x    x                   x    x              x
+///  e  u32    x    x    x    x    x    x                   x    x              x
+///  r   s8    x    x    x    x    x    x    x    x    x    x    x    x    x    x
+///      u8    x    x    x    x    x    x    x    x    x    x    x    x    x    x
+///  c   b8    x    x    x    x    x    x    x    x    x    x    x    x    x    x
+///  a  s64    x    x    x    x                             x    x              x
+///  s  u64    x    x    x    x                             x    x              x
+///  t  s16    x    x    x    x    x    x                   x    x    x    x    x
+///     u16    x    x    x    x    x    x                   x    x    x    x    x
+///     f16    x    x    x    x                                                 x
+///
+/// \param[in] outer The type of the second cast and the child of the
+///            previous cast
+/// \param[in] inner  The type of the first cast
+///
+/// \returns True if the inner cast operation should be removed
+constexpr bool canOptimizeCast(af::dtype outer, af::dtype inner) {
+    if (isFloating(outer)) {
+        if (isFloating(inner)) { return true; }
+    } else {
+        if (isFloating(inner)) { return true; }
+        if (dtypeSize(inner) >= dtypeSize(outer)) { return true; }
+    }
+
+    return false;
+}
+
+#ifdef AF_CPU
+template<typename To, typename Ti>
+struct CastWrapper {
+    static spdlog::logger *getLogger() noexcept {
+        static std::shared_ptr<spdlog::logger> logger =
+            common::loggerFactory("ast");
+        return logger.get();
+    }
+
+    detail::Array<To> operator()(const detail::Array<Ti> &in) {
+        using detail::jit::UnaryNode;
+
+        common::Node_ptr in_node = in.getNode();
+        constexpr af::dtype to_dtype =
+            static_cast<af::dtype>(af::dtype_traits<To>::af_type);
+        constexpr af::dtype in_dtype =
+            static_cast<af::dtype>(af::dtype_traits<Ti>::af_type);
+
+        if (canOptimizeCast(to_dtype, in_dtype)) {
+            // JIT optimization in the cast of multiple sequential casts that
+            // become idempotent - check to see if the previous operation was
+            // also a cast
+            // TODO: handle arbitrarily long chains of casts
+            auto in_node_unary =
+                std::dynamic_pointer_cast<UnaryNode<To, Ti, af_cast_t>>(
+                    in_node);
+
+            if (in_node_unary && in_node_unary->getOp() == af_cast_t) {
+                // child child's output type is the input type of the child
+                AF_TRACE("Cast optimiztion performed by removing cast to {}",
+                         af::dtype_traits<Ti>::getName());
+                auto in_child_node = in_node_unary->getChildren()[0];
+                if (in_child_node->getType() == to_dtype) {
+                    // ignore the input node and simply connect a noop node from
+                    // the child's child to produce this op's output
+                    return detail::createNodeArray<To>(in.dims(),
+                                                       in_child_node);
+                }
+            }
+        }
+
+        auto node = std::make_shared<UnaryNode<To, Ti, af_cast_t>>(in_node);
+
+        return detail::createNodeArray<To>(in.dims(), move(node));
+    }
+};
+#else
+
+template<typename To, typename Ti>
+struct CastWrapper {
+    static spdlog::logger *getLogger() noexcept {
+        static std::shared_ptr<spdlog::logger> logger =
+            common::loggerFactory("ast");
+        return logger.get();
+    }
+
+    detail::Array<To> operator()(const detail::Array<Ti> &in) {
+        using arrayfire::common::UnaryNode;
+        detail::CastOp<To, Ti> cop;
+        common::Node_ptr in_node = in.getNode();
+        constexpr af::dtype to_dtype =
+            static_cast<af::dtype>(af::dtype_traits<To>::af_type);
+        constexpr af::dtype in_dtype =
+            static_cast<af::dtype>(af::dtype_traits<Ti>::af_type);
+
+        if (canOptimizeCast(to_dtype, in_dtype)) {
+            // JIT optimization in the cast of multiple sequential casts that
+            // become idempotent - check to see if the previous operation was
+            // also a cast
+            // TODO: handle arbitrarily long chains of casts
+            auto in_node_unary =
+                std::dynamic_pointer_cast<common::UnaryNode>(in_node);
+
+            if (in_node_unary && in_node_unary->getOp() == af_cast_t) {
+                // child child's output type is the input type of the child
+                AF_TRACE("Cast optimiztion performed by removing cast to {}",
+                         af::dtype_traits<Ti>::getName());
+                auto in_child_node = in_node_unary->getChildren()[0];
+                if (in_child_node->getType() == to_dtype) {
+                    // ignore the input node and simply connect a noop node from
+                    // the child's child to produce this op's output
+                    return detail::createNodeArray<To>(in.dims(),
+                                                       in_child_node);
+                }
+            }
+        }
+
+        common::UnaryNode *node =
+            new common::UnaryNode(to_dtype, cop.name(), in_node, af_cast_t);
+        return detail::createNodeArray<To>(in.dims(), common::Node_ptr(node));
+    }
+};
+
+#endif
+
+template<typename T>
+struct CastWrapper<T, T> {
+    detail::Array<T> operator()(const detail::Array<T> &in);
+};
+
+template<typename To, typename Ti>
+auto cast(detail::Array<Ti> &&in)
+    -> std::enable_if_t<std::is_same<Ti, To>::value, detail::Array<To>> {
+    return std::move(in);
+}
+
+template<typename To, typename Ti>
+auto cast(const detail::Array<Ti> &in)
+    -> std::enable_if_t<std::is_same<Ti, To>::value, detail::Array<To>> {
+    return in;
+}
+
+template<typename To, typename Ti>
+auto cast(const detail::Array<Ti> &in)
+    -> std::enable_if_t<std::is_same<Ti, To>::value == false,
+                        detail::Array<To>> {
+    CastWrapper<To, Ti> cast_op;
+    return cast_op(in);
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/cblas.cpp b/src/backend/common/cblas.cpp
new file mode 100644
index 0000000000..09f255cb94
--- /dev/null
+++ b/src/backend/common/cblas.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifdef USE_F77_BLAS
+#include <common/blas_headers.hpp>
+
+#define ADD_
+#include <cblas.h>
+#include <cblas_f77.h>
+
+static char transChar(CBLAS_TRANSPOSE Trans) {
+    switch (Trans) {
+        case CblasNoTrans: return 'N';
+        case CblasTrans: return 'T';
+        case CblasConjTrans: return 'C';
+        default: return '\0';
+    }
+}
+
+#define GEMM_F77(X, TS, TV, TY)                                                \
+    void cblas_##X##gemm(                                                      \
+        const CBLAS_ORDER Order, const CBLAS_TRANSPOSE TransA,                 \
+        const CBLAS_TRANSPOSE TransB, const int M, const int N, const int K,   \
+        const TS alpha, const TV *A, const int lda, const TV *B,               \
+        const int ldb, const TS beta, TV *C, const int ldc) {                  \
+        char aT = transChar(TransA);                                           \
+        char bT = transChar(TransB);                                           \
+        X##gemm_(&aT, &bT, &M, &N, &K, (const TY *)ADDR(alpha), (const TY *)A, \
+                 &lda, (const TY *)B, &ldb, (const TY *)ADDR(beta), (TY *)C,   \
+                 &ldc);                                                        \
+    }                                                                          \
+    void cblas_##X##gemv(                                                      \
+        const CBLAS_ORDER order, const CBLAS_TRANSPOSE TransA, const int M,    \
+        const int N, const TS alpha, const TV *A, const int lda, const TV *X,  \
+        const int incX, const TS beta, TV *Y, const int incY) {                \
+        char aT = transChar(TransA);                                           \
+        X##gemv_(&aT, &M, &N, (const TY *)ADDR(alpha), (const TY *)A, &lda,    \
+                 (const TY *)X, &incX, (const TY *)ADDR(beta), (TY *)Y,        \
+                 &incY);                                                       \
+    }                                                                          \
+    void cblas_##X##axpy(const int N, const TS alpha, const TV *X,             \
+                         const int incX, TV *Y, const int incY) {              \
+        X##axpy_(&N, (const TY *)ADDR(alpha), (const TY *)X, &incX, (TY *)Y,   \
+                 &incY);                                                       \
+    }                                                                          \
+    void cblas_##X##scal(const int N, const TS alpha, TV *X, const int incX) { \
+        X##scal_(&N, (const TY *)ADDR(alpha), (TY *)X, &incX);                 \
+    }
+
+#define ADDR(val) &val
+GEMM_F77(s, float, float, float)
+GEMM_F77(d, double, double, double)
+#undef ADDR
+
+#define ADDR(val) val
+GEMM_F77(c, void *, void, float)
+GEMM_F77(z, void *, void, double)
+#undef ADDR
+
+#else
+#include <blas.hpp>
+#endif
diff --git a/src/backend/common/compile_module.hpp b/src/backend/common/compile_module.hpp
new file mode 100644
index 0000000000..2f12f6386b
--- /dev/null
+++ b/src/backend/common/compile_module.hpp
@@ -0,0 +1,69 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#if !defined(AF_CPU)
+
+#include <Module.hpp>
+#include <backend.hpp>
+
+#include <nonstd/span.hpp>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+
+/// \brief Backend specific source compilation implementation
+///
+/// This function has to be implemented separately in each backend
+///
+/// \p kInstances can take of the following two forms depending on backend.
+/// - CUDA
+///     - A template instantiation style string like transpose<float, true, 1>
+///     - The \p kInstances is of size one in almost all cases. These strings
+///       are used to generate template instantiations of CUDA kernels while
+///       compiling the \p sources.
+/// - OpenCL
+///     - The \p kInstances parameter is not used.
+///
+/// \param[in] moduleKey is hash of code+options+instantiations. This is
+///            provided by caller to avoid recomputation.
+/// \param[in] sources is the list of source code to compile
+/// \param[in] options is the list of preprocessor definitions to be passed
+///            to the backend compilation function
+/// \param[in] kInstances is the name list of kernels in the \p sources
+/// \param[in] isJIT is identify if the module being compiled is not
+///            hand-written kernel
+///
+/// \returns Backend specific binary module that contains associated kernel
+detail::Module compileModule(const std::string& moduleKey,
+                             nonstd::span<const std::string> sources,
+                             nonstd::span<const std::string> options,
+                             nonstd::span<const std::string> kInstances,
+                             const bool isJIT);
+
+/// \brief Load module binary from disk cache
+///
+/// Note that, this is for internal use by functions that get called from
+/// compileModule. The reason it is exposed here is that, it's implementation
+/// is partly dependent on backend specifics like program binary loading etc.
+/// Exposing this enables each backend to implement it's specifics.
+///
+/// \param[in] device is the device index
+/// \param[in] moduleKey is hash of code+options+instantiations
+detail::Module loadModuleFromDisk(const int device,
+                                  const std::string& moduleKey,
+                                  const bool isJIT);
+
+}  // namespace common
+}  // namespace arrayfire
+
+#endif
diff --git a/src/backend/common/complex.hpp b/src/backend/common/complex.hpp
new file mode 100644
index 0000000000..e6c5bb79ce
--- /dev/null
+++ b/src/backend/common/complex.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <backend.hpp>
+#include <types.hpp>
+
+#include <type_traits>
+
+namespace arrayfire {
+namespace common {
+
+// The value returns true if the type is a complex type. False otherwise
+template<typename T>
+struct is_complex {
+    static const bool value = false;
+};
+template<>
+struct is_complex<detail::cfloat> {
+    static const bool value = true;
+};
+template<>
+struct is_complex<detail::cdouble> {
+    static const bool value = true;
+};
+
+/// This is an enable_if for complex types.
+template<typename T, typename TYPE = void>
+using if_complex = typename std::enable_if<is_complex<T>::value, TYPE>::type;
+
+/// This is an enable_if for real types.
+template<typename T, typename TYPE = void>
+using if_real =
+    typename std::enable_if<is_complex<T>::value == false, TYPE>::type;
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/constants.cpp b/src/backend/common/constants.cpp
new file mode 100644
index 0000000000..086cf14dd2
--- /dev/null
+++ b/src/backend/common/constants.cpp
@@ -0,0 +1,16 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <af/constants.h>
+#include <limits>
+
+namespace af {
+const double NaN = std::numeric_limits<double>::quiet_NaN();
+const double Inf = std::numeric_limits<double>::infinity();
+const double Pi  = 3.1415926535897932384626433832795028841971693993751;
+}  // namespace af
diff --git a/src/backend/common/debug.hpp b/src/backend/common/debug.hpp
new file mode 100644
index 0000000000..54e74a2953
--- /dev/null
+++ b/src/backend/common/debug.hpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <boost/stacktrace.hpp>
+#include <common/ArrayFireTypesIO.hpp>
+#include <common/jit/NodeIO.hpp>
+#include <spdlog/fmt/bundled/format.h>
+#include <iostream>
+
+#define DBGTRACE(msg)                                              \
+    fmt::print(std::cout, __FILE__ ":{}:{}\n{}\n", __LINE__, #msg, \
+               boost::stacktrace::stacktrace())
+
+namespace debugging {
+
+template<typename first>
+void print(const char *F, const first &FF) {
+    fmt::print(std::cout, "{} = {}", F, FF);
+}
+
+template<typename first, typename... ARGS>
+void print(const char *F, const first &FF, ARGS... args) {
+    fmt::print(std::cout, "{} = {} | ", F, FF);
+    print(args...);
+}
+}  // namespace debugging
+
+#define SHOW1(val1) debugging::print(#val1, val1)
+#define SHOW2(val1, val2) debugging::print(#val1, val1, #val2, val2)
+#define SHOW3(val1, val2, val3) \
+    debugging::print(#val1, val1, #val2, val2, #val3, val3)
+
+#define SHOW4(val1, val2, val3, val4) \
+    debugging::print(#val1, val1, #val2, val2, #val3, val3, #val4, val4)
+#define SHOW5(val1, val2, val3, val4, val5)                              \
+    debugging::print(#val1, val1, #val2, val2, #val3, val3, #val4, val4, \
+                     #val5, val5)
+#define SHOW6(val1, val2, val3, val4, val5, val6)                        \
+    debugging::print(#val1, val1, #val2, val2, #val3, val3, #val4, val4, \
+                     #val5, val5, #val6, val6)
+
+#define GET_MACRO(_1, _2, _3, _4, _5, _6, NAME, ...) NAME
+
+#define SHOW(...)                                                        \
+    do {                                                                 \
+        fmt::print(std::cout, "{}:({}): ", __FILE__, __LINE__);          \
+        GET_MACRO(__VA_ARGS__, SHOW6, SHOW5, SHOW4, SHOW3, SHOW2, SHOW1) \
+        (__VA_ARGS__);                                                   \
+        fmt::print(std::cout, "\n");                                     \
+    } while (0)
+
+#define PRINTVEC(val)                                                        \
+    do {                                                                     \
+        fmt::print(std::cout, "{}:({}):{} [{}]\n", __FILE__, __LINE__, #val, \
+                   fmt::join(val, ", "));                                    \
+    } while (0)
diff --git a/src/backend/common/defines.hpp b/src/backend/common/defines.hpp
new file mode 100644
index 0000000000..5c7eadc6ce
--- /dev/null
+++ b/src/backend/common/defines.hpp
@@ -0,0 +1,71 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/internal_enums.hpp>
+
+#include <mutex>
+#include <string>
+
+inline std::string clipFilePath(std::string path, std::string str) {
+    try {
+        std::string::size_type pos = path.rfind(str);
+        if (pos == std::string::npos) {
+            return path;
+        } else {
+            return path.substr(pos);
+        }
+    } catch (...) { return path; }
+}
+
+#define UNUSED(expr) \
+    do { (void)(expr); } while (0)
+
+#if defined(_WIN32) || defined(_MSC_VER)
+#define __PRETTY_FUNCTION__ __FUNCSIG__
+#if _MSC_VER < 1900
+#define snprintf sprintf_s
+#endif
+#define __AF_FILENAME__ (clipFilePath(__FILE__, "src\\").c_str())
+#else
+#define __AF_FILENAME__ (clipFilePath(__FILE__, "src/").c_str())
+#endif
+
+#if defined(NDEBUG)
+#define __AF_FUNC__ __FUNCTION__
+#else
+// Debug
+#define __AF_FUNC__ __PRETTY_FUNCTION__
+#endif
+
+#ifdef OS_WIN
+#include <Windows.h>
+using LibHandle = HMODULE;
+#define AF_PATH_SEPARATOR "\\"
+#elif defined(OS_MAC)
+using LibHandle = void*;
+#define AF_PATH_SEPARATOR "/"
+#elif defined(OS_LNX)
+using LibHandle = void*;
+#define AF_PATH_SEPARATOR "/"
+#else
+#error "Unsupported platform"
+#endif
+
+#ifndef AF_MEM_DEBUG
+#define AF_MEM_DEBUG 0
+#endif
+
+namespace arrayfire {
+namespace common {
+using mutex_t      = std::mutex;
+using lock_guard_t = std::lock_guard<mutex_t>;
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/deprecated.hpp b/src/backend/common/deprecated.hpp
new file mode 100644
index 0000000000..4a7aca99a5
--- /dev/null
+++ b/src/backend/common/deprecated.hpp
@@ -0,0 +1,27 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <af/compilers.h>
+
+// clang-format off
+#if AF_COMPILER_IS_MSVC
+#define AF_DEPRECATED_WARNINGS_OFF  \
+    __pragma(warning(push))         \
+    __pragma(warning(disable:4996))
+
+#define AF_DEPRECATED_WARNINGS_ON \
+    __pragma(warning(pop))
+#else
+#define AF_DEPRECATED_WARNINGS_OFF                                  \
+  _Pragma("GCC diagnostic push")                                 \
+  _Pragma("GCC diagnostic ignored \"-Wdeprecated-declarations\"")
+
+#define AF_DEPRECATED_WARNINGS_ON                                   \
+  _Pragma("GCC diagnostic pop")
+#endif
+// clang-format on
diff --git a/src/backend/common/deterministicHash.cpp b/src/backend/common/deterministicHash.cpp
new file mode 100644
index 0000000000..2280d4cbbb
--- /dev/null
+++ b/src/backend/common/deterministicHash.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/deterministicHash.hpp>
+
+#include <nonstd/span.hpp>
+#include <numeric>
+#include <string>
+
+using nonstd::span;
+using std::accumulate;
+using std::string;
+
+size_t deterministicHash(const void* data, size_t byteSize, size_t prevHash) {
+    // Fowler-Noll-Vo "1a" 32 bit hash
+    // https://en.wikipedia.org/wiki/Fowler-Noll-Vo_hash_function
+    const auto* byteData = static_cast<const uint8_t*>(data);
+    return accumulate(
+        byteData, byteData + byteSize, prevHash,
+        [&](size_t hash, uint8_t data) { return (hash ^ data) * FNV1A_PRIME; });
+}
+
+size_t deterministicHash(const string& data, const size_t prevHash) {
+    return deterministicHash(data.data(), data.size(), prevHash);
+}
+
+size_t deterministicHash(span<const string> list, const size_t prevHash) {
+    size_t hash = prevHash;
+    for (auto s : list) { hash = deterministicHash(s.data(), s.size(), hash); }
+    return hash;
+}
+
+size_t deterministicHash(span<const arrayfire::common::Source> list) {
+    // Combine the different source codes, via their hashes
+    size_t hash = FNV1A_BASE_OFFSET;
+    for (auto s : list) {
+        size_t h = s.hash ? s.hash : deterministicHash(s.ptr, s.length);
+        hash     = deterministicHash(&h, sizeof(size_t), hash);
+    }
+    return hash;
+}
diff --git a/src/backend/common/deterministicHash.hpp b/src/backend/common/deterministicHash.hpp
new file mode 100644
index 0000000000..fa950bc2a5
--- /dev/null
+++ b/src/backend/common/deterministicHash.hpp
@@ -0,0 +1,37 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <nonstd/span.hpp>
+#include <string>
+
+#include <common/Source.hpp>
+
+/// Return the FNV-1a hash of the provided bata.
+///
+/// \param[in] data Binary data to hash
+/// \param[in] byteSize Size of the data in bytes
+/// \param[in] optional prevHash Hash of previous parts when string is split
+///
+/// \returns An unsigned integer representing the hash of the data
+constexpr std::size_t FNV1A_BASE_OFFSET = 0x811C9DC5;
+constexpr std::size_t FNV1A_PRIME       = 0x01000193;
+std::size_t deterministicHash(const void* data, std::size_t byteSize,
+                              const std::size_t prevHash = FNV1A_BASE_OFFSET);
+
+// This is just a wrapper around the above function.
+std::size_t deterministicHash(const std::string& data,
+                              const std::size_t prevHash = FNV1A_BASE_OFFSET);
+
+// This concatenates strings in the vector and computes hash
+std::size_t deterministicHash(nonstd::span<const std::string> list,
+                              const std::size_t prevHash = FNV1A_BASE_OFFSET);
+
+// This concatenates hashes of multiple sources
+std::size_t deterministicHash(
+    nonstd::span<const arrayfire::common::Source> list);
diff --git a/src/backend/common/dim4.cpp b/src/backend/common/dim4.cpp
new file mode 100644
index 0000000000..96d8bc8447
--- /dev/null
+++ b/src/backend/common/dim4.cpp
@@ -0,0 +1,146 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/err_common.hpp>
+#include <af/dim4.hpp>
+#include <cfloat>
+#include <cmath>
+#include <limits>
+#include <numeric>
+
+namespace af {
+
+#if AF_COMPILER_CXX_STATIC_ASSERT
+static_assert(std::is_standard_layout<dim4>::value,
+              "af::dim4 must be a standard layout type");
+#endif
+
+using std::abs;
+using std::numeric_limits;
+
+dim4::dim4() : dims{0, 0, 0, 0} {}
+
+dim4::dim4(dim_t first, dim_t second, dim_t third, dim_t fourth)
+    : dims{first, second, third, fourth} {}
+
+dim4::dim4(const dim4& other)
+    : dims{other.dims[0], other.dims[1], other.dims[2], other.dims[3]} {}
+
+dim4::dim4(const unsigned ndims_, const dim_t* const dims_) : dims{} {
+    for (unsigned i = 0; i < 4; i++) { dims[i] = ndims_ > i ? dims_[i] : 1; }
+}
+
+dim4& dim4::operator=(dim4 other) noexcept {
+    std::swap(dims, other.dims);
+    return *this;
+}
+
+dim_t dim4::elements() const { return dims[0] * dims[1] * dims[2] * dims[3]; }
+
+dim_t dim4::elements() { return static_cast<const dim4&>(*this).elements(); }
+
+dim_t dim4::ndims() const {
+    dim_t num = elements();
+    if (num == 0) { return 0; }
+    if (num == 1) { return 1; }
+
+    if (dims[3] != 1) { return 4; }
+    if (dims[2] != 1) { return 3; }
+    if (dims[1] != 1) { return 2; }
+
+    return 1;
+}
+
+dim_t dim4::ndims() { return static_cast<const dim4&>(*this).ndims(); }
+
+const dim_t& dim4::operator[](const unsigned dim) const { return dims[dim]; }
+
+dim_t& dim4::operator[](const unsigned dim) {
+    return const_cast<dim_t&>(static_cast<const dim4&>((*this))[dim]);
+}
+
+bool dim4::operator==(const dim4& other) const {
+    bool ret = true;
+    for (unsigned i = 0; i < 4 && ret; i++) { ret = (*this)[i] == other[i]; }
+    return ret;
+}
+
+bool dim4::operator!=(const dim4& other) const { return !((*this) == other); }
+
+dim4& dim4::operator*=(const dim4& other) {
+    for (unsigned i = 0; i < 4; i++) { (*this)[i] *= other[i]; }
+    return *this;
+}
+
+dim4& dim4::operator+=(const dim4& other) {
+    for (unsigned i = 0; i < 4; i++) { (*this)[i] = (*this)[i] + other[i]; }
+    return *this;
+}
+
+dim4& dim4::operator-=(const dim4& other) {
+    for (unsigned i = 0; i < 4; i++) { (*this)[i] = (*this)[i] - other[i]; }
+    return *this;
+}
+
+dim4 operator+(const dim4& first, const dim4& second) {
+    dim4 dims;
+    for (unsigned i = 0; i < 4; i++) { dims[i] = first[i] + second[i]; }
+    return dims;
+}
+
+dim4 operator-(const dim4& first, const dim4& second) {
+    dim4 dims;
+    for (unsigned i = 0; i < 4; i++) { dims[i] = first[i] - second[i]; }
+    return dims;
+}
+
+dim4 operator*(const dim4& first, const dim4& second) {
+    dim4 dims;
+    for (unsigned i = 0; i < 4; i++) { dims[i] = first[i] * second[i]; }
+    return dims;
+}
+
+bool hasEnd(const af_seq& seq) { return (seq.begin <= -1 || seq.end <= -1); }
+
+bool isSpan(const af_seq& seq) {
+    return (seq.step == 0 && seq.begin == 1 && seq.end == 1);
+}
+
+size_t seqElements(const af_seq& seq) {
+    size_t out = 0;
+    if (seq.step > DBL_MIN) {
+        out = ((seq.end - seq.begin) / abs(seq.step)) + 1;
+    } else if (seq.step < -DBL_MIN) {
+        out = ((seq.begin - seq.end) / abs(seq.step)) + 1;
+    } else {
+        out = numeric_limits<size_t>::max();
+    }
+
+    return out;
+}
+
+dim_t calcDim(const af_seq& seq, const dim_t& parentDim) {
+    dim_t outDim = 1;
+    if (isSpan(seq)) {
+        outDim = parentDim;
+    } else if (hasEnd(seq)) {
+        af_seq temp = {seq.begin, seq.end, seq.step};
+        if (seq.begin < 0) { temp.begin += parentDim; }
+        if (seq.end < 0) { temp.end += parentDim; }
+        outDim = seqElements(temp);
+    } else {
+        DIM_ASSERT(1, seq.begin >= -DBL_MIN && seq.begin < parentDim);
+        DIM_ASSERT(1, seq.end < parentDim);
+        outDim = seqElements(seq);
+    }
+
+    return outDim;
+}
+
+}  // namespace af
diff --git a/src/backend/common/dispatch.cpp b/src/backend/common/dispatch.cpp
new file mode 100644
index 0000000000..4cf5cbe6b7
--- /dev/null
+++ b/src/backend/common/dispatch.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include "dispatch.hpp"
+
+unsigned nextpow2(unsigned x) {
+    x = x - 1U;
+    x = x | (x >> 1U);
+    x = x | (x >> 2U);
+    x = x | (x >> 4U);
+    x = x | (x >> 8U);
+    x = x | (x >> 16U);
+    return x + 1U;
+}
diff --git a/src/backend/common/dispatch.hpp b/src/backend/common/dispatch.hpp
new file mode 100644
index 0000000000..e248a22a97
--- /dev/null
+++ b/src/backend/common/dispatch.hpp
@@ -0,0 +1,189 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <assert.h>
+#include <platform.hpp>
+#include <af/defines.h>
+#include <algorithm>
+#include <cmath>
+
+#define divup(a, b) (((a) + (b)-1) / (b))
+
+unsigned nextpow2(unsigned x);
+
+// isPrime & greatestPrimeFactor are tailored after
+// itk::Math::{IsPrimt, GreatestPrimeFactor}
+template<typename T>
+inline bool isPrime(T n) {
+    if (n <= 1) return false;
+
+    const T last{(T)std::sqrt((double)n)};
+    for (T x{2}; x <= last; ++x) {
+        if (n % x == 0) return false;
+    }
+
+    return true;
+}
+
+template<typename T>
+inline T greatestPrimeFactor(T n) {
+    T v{2};
+
+    while (v <= n) {
+        if (n % v == 0 && isPrime(v))
+            n /= v;
+        else
+            v += 1;
+    }
+
+    return v;
+}
+// Empty columns (dim==1) in refDims are removed from dims & strides.
+// INPUT: refDims, refNdims
+// UPDATE: dims, strides
+// RETURN: ndims
+template<typename T>
+T removeEmptyColumns(const T refDims[AF_MAX_DIMS], const T refNdims,
+                     T dims[AF_MAX_DIMS], T strides[AF_MAX_DIMS]) {
+    T ndims{0};
+    const T* refPtr{refDims};
+    const T* refPtr_end{refDims + refNdims};
+    // Search for first dimension == 1
+    while (refPtr != refPtr_end && *refPtr != 1) {
+        ++refPtr;
+        ++ndims;
+    }
+    if (ndims != refNdims) {
+        T* dPtr_out{dims + ndims};
+        const T* dPtr_in{dPtr_out};
+        T* sPtr_out{strides + ndims};
+        const T* sPtr_in{sPtr_out};
+        // Compress all remaining dimensions
+        while (refPtr != refPtr_end) {
+            if (*refPtr != 1) {
+                *(dPtr_out++) = *dPtr_in;
+                *(sPtr_out++) = *sPtr_in;
+                ++ndims;
+            }
+            ++refPtr;
+            ++dPtr_in;
+            ++sPtr_in;
+        }
+        // Fill remaining dimensions with 1 and calculate corresponding strides
+        // lastStride = last written dim * last written stride
+        const T lastStride{*(dPtr_out - 1) * *(sPtr_out - 1)};
+        const T lastDim{1};
+        for (const T* dPtr_end{dims + AF_MAX_DIMS}; dPtr_out != dPtr_end;
+             ++dPtr_out, ++sPtr_out) {
+            *dPtr_out = lastDim;
+            *sPtr_out = lastStride;
+        }
+    }
+    return ndims;
+}
+
+// Empty columns (dim==1) in refDims are removed from strides
+// ASSUMPTION: dims are equal to refDims, so are not provided
+// INPUT: refDims, refNdims
+// UPDATE: strides
+// RETURN: ndims
+template<typename T>
+T removeEmptyColumns(const T refDims[AF_MAX_DIMS], const T refNdims,
+                     T strides[AF_MAX_DIMS]) {
+    T ndims{0};
+    const T* refPtr{refDims};
+    const T* refPtr_end{refDims + refNdims};
+    // Search for first dimension == 1
+    while (refPtr != refPtr_end && *refPtr != 1) {
+        ++refPtr;
+        ++ndims;
+    }
+    if (ndims != refNdims) {
+        T* sPtr_out{strides + ndims};
+        const T* sPtr_in{sPtr_out};
+        // Compress all remaining dimensions
+        while (refPtr != refPtr_end) {
+            if (*refPtr != 1) {
+                *(sPtr_out++) = *sPtr_in;
+                ++ndims;
+            };
+            ++refPtr;
+            ++sPtr_in;
+        }
+        // Calculate remaining strides
+        // lastStride = last written dim * last written stride
+        const T lastStride{*(refPtr - 1) * *(sPtr_out - 1)};
+        for (const T* sPtr_end{strides + AF_MAX_DIMS}; sPtr_out != sPtr_end;
+             ++sPtr_out) {
+            *sPtr_out = lastStride;
+        }
+    }
+    return ndims;
+}
+
+// Columns with the same stride in both arrays are combined.  Both arrays will
+// remain in sync and will return the same ndims.
+// ASSUMPTION: both arrays have the same ndims
+// UPDATE: dims1, strides1, UPDATE: dims2, strides2, ndims
+// RETURN: ndims
+template<typename T>
+T combineColumns(T dims1[AF_MAX_DIMS], T strides1[AF_MAX_DIMS], T& ndims,
+                 T dims2[AF_MAX_DIMS], T strides2[AF_MAX_DIMS]) {
+    for (T c{0}; c < ndims - 1; ++c) {
+        if (dims1[c] == dims2[c] && dims1[c] * strides1[c] == strides1[c + 1] &&
+            dims1[c] * strides2[c] == strides2[c + 1]) {
+            // Combine columns, since they are linear
+            // This will increase the dimension of the resulting column,
+            // given more opportunities for kernel optimization
+            dims1[c] *= dims1[c + 1];
+            dims2[c] *= dims2[c + 1];
+            --ndims;
+            for (T i{c + 1}; i < ndims; ++i) {
+                dims1[i]    = dims1[i + 1];
+                dims2[i]    = dims2[i + 1];
+                strides1[i] = strides1[i + 1];
+                strides2[i] = strides2[i + 1];
+            }
+            dims1[ndims] = 1;
+            dims2[ndims] = 1;
+            --c;  // Redo this colum, since it is removed now
+        }
+    }
+    return ndims;
+}
+// Columns with the same stride in both arrays are combined.  Both arrays will
+// remain in sync and will return the same ndims.
+// ASSUMPTION: both arrays have the same dims
+// UPDATE: dims1, strides1,
+// UPDATE: strides2, ndims
+// RETURN: ndims
+template<typename T>
+T combineColumns(T dims1[AF_MAX_DIMS], T strides1[AF_MAX_DIMS], T& ndims,
+                 T strides2[AF_MAX_DIMS]) {
+    for (T c{0}; c < ndims - 1; ++c) {
+        if (dims1[c] * strides1[c] == strides1[c + 1] &&
+            dims1[c] * strides2[c] == strides2[c + 1]) {
+            // Combine columns, since they are linear
+            // This will increase the dimension of the resulting column,
+            // given more opportunities for kernel optimization
+            dims1[c] *= dims1[c + 1];
+            --ndims;
+            for (T i{c + 1}; i < ndims; ++i) {
+                dims1[i]    = dims1[i + 1];
+                strides1[i] = strides1[i + 1];
+                strides2[i] = strides2[i + 1];
+            }
+            dims1[ndims] = 1;
+            --c;  // Redo this colum, since it is removed now
+        }
+    }
+    return ndims;
+}
\ No newline at end of file
diff --git a/src/backend/common/err_common.cpp b/src/backend/common/err_common.cpp
new file mode 100644
index 0000000000..672afe6da0
--- /dev/null
+++ b/src/backend/common/err_common.cpp
@@ -0,0 +1,257 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <common/util.hpp>
+#include <type_util.hpp>
+#include <af/device.h>
+#include <af/exception.h>
+
+#include <algorithm>
+#include <cstdio>
+#include <cstring>
+#include <sstream>
+#include <string>
+#include <utility>
+
+#ifdef AF_OPENCL
+#include <errorcodes.hpp>
+#include <platform.hpp>
+#elif defined(AF_ONEAPI)
+#include <oneapi/mkl/exceptions.hpp>
+#include <sycl/sycl.hpp>
+#endif
+
+using boost::stacktrace::stacktrace;
+using std::move;
+using std::string;
+using std::stringstream;
+
+using arrayfire::common::getEnvVar;
+using arrayfire::common::getName;
+using arrayfire::common::is_stacktrace_enabled;
+
+AfError::AfError(const char *const func, const char *const file, const int line,
+                 const char *const message, af_err err, stacktrace st)
+    : logic_error(message)
+    , functionName(func)
+    , fileName(file)
+    , st_(std::move(st))
+    , lineNumber(line)
+    , error(err) {}
+
+AfError::AfError(string func, string file, const int line,
+                 const string &message, af_err err, stacktrace st)
+    : logic_error(message)
+    , functionName(std::move(func))
+    , fileName(std::move(file))
+    , st_(std::move(st))
+    , lineNumber(line)
+    , error(err) {}
+
+const string &AfError::getFunctionName() const noexcept { return functionName; }
+
+const string &AfError::getFileName() const noexcept { return fileName; }
+
+int AfError::getLine() const noexcept { return lineNumber; }
+
+af_err AfError::getError() const noexcept { return error; }
+
+AfError::~AfError() noexcept = default;
+
+TypeError::TypeError(const char *const func, const char *const file,
+                     const int line, const int index, const af_dtype type,
+                     stacktrace st)
+    : AfError(func, file, line, "Invalid data type", AF_ERR_TYPE, std::move(st))
+    , errTypeName(getName(type))
+    , argIndex(index) {}
+
+const string &TypeError::getTypeName() const noexcept { return errTypeName; }
+
+int TypeError::getArgIndex() const noexcept { return argIndex; }
+
+ArgumentError::ArgumentError(const char *const func, const char *const file,
+                             const int line, const int index,
+                             const char *const expectString, stacktrace st)
+    : AfError(func, file, line, "Invalid argument", AF_ERR_ARG, std::move(st))
+    , expected(expectString)
+    , argIndex(index) {}
+
+const string &ArgumentError::getExpectedCondition() const noexcept {
+    return expected;
+}
+
+int ArgumentError::getArgIndex() const noexcept { return argIndex; }
+
+SupportError::SupportError(const char *const func, const char *const file,
+                           const int line, const char *const back,
+                           const char *const message, stacktrace st)
+    : AfError(func, file, line, message, AF_ERR_NOT_SUPPORTED,
+              std::move(st))
+    , backend(back) {}
+
+const string &SupportError::getBackendName() const noexcept { return backend; }
+
+DimensionError::DimensionError(const char *const func, const char *const file,
+                               const int line, const int index,
+                               const char *const expectString,
+                               const stacktrace &st)
+    : AfError(func, file, line, "Invalid size", AF_ERR_SIZE, st)
+    , expected(expectString)
+    , argIndex(index) {}
+
+const string &DimensionError::getExpectedCondition() const noexcept {
+    return expected;
+}
+
+int DimensionError::getArgIndex() const noexcept { return argIndex; }
+
+af_err set_global_error_string(const string &msg, af_err err) {
+    string perr = getEnvVar("AF_PRINT_ERRORS");
+    if (!perr.empty()) {
+        if (perr != "0") { fprintf(stderr, "%s\n", msg.c_str()); }
+    }
+    get_global_error_string() = msg;
+    return err;
+}
+
+af_err processException() {
+    stringstream ss;
+    af_err err = AF_ERR_INTERNAL;
+
+    try {
+        throw;
+    } catch (const DimensionError &ex) {
+        ss << "In function " << ex.getFunctionName() << "\n"
+           << "In file " << ex.getFileName() << ":" << ex.getLine() << "\n"
+           << "Invalid dimension for argument " << ex.getArgIndex() << "\n"
+           << "Expected: " << ex.getExpectedCondition() << "\n";
+        if (is_stacktrace_enabled()) { ss << ex.getStacktrace(); }
+
+        err = set_global_error_string(ss.str(), AF_ERR_SIZE);
+    } catch (const ArgumentError &ex) {
+        ss << "In function " << ex.getFunctionName() << "\n"
+           << "In file " << ex.getFileName() << ":" << ex.getLine() << "\n"
+           << "Invalid argument at index " << ex.getArgIndex() << "\n"
+           << "Expected: " << ex.getExpectedCondition() << "\n";
+
+        if (is_stacktrace_enabled()) { ss << ex.getStacktrace(); }
+        err = set_global_error_string(ss.str(), AF_ERR_ARG);
+    } catch (const SupportError &ex) {
+        ss << ex.getFunctionName() << " not supported for "
+           << ex.getBackendName() << " backend\n";
+
+        if (is_stacktrace_enabled()) { ss << ex.getStacktrace(); }
+        err = set_global_error_string(ss.str(), AF_ERR_NOT_SUPPORTED);
+    } catch (const TypeError &ex) {
+        ss << "In function " << ex.getFunctionName() << "\n"
+           << "In file " << ex.getFileName() << ":" << ex.getLine() << "\n"
+           << "Invalid type for argument " << ex.getArgIndex() << "\n";
+
+        if (is_stacktrace_enabled()) { ss << ex.getStacktrace(); }
+        err = set_global_error_string(ss.str(), AF_ERR_TYPE);
+    } catch (const AfError &ex) {
+        ss << "In function " << ex.getFunctionName() << "\n"
+           << "In file " << ex.getFileName() << ":" << ex.getLine() << "\n"
+           << ex.what() << "\n";
+        if (is_stacktrace_enabled()) { ss << ex.getStacktrace(); }
+
+        err = set_global_error_string(ss.str(), ex.getError());
+#ifdef AF_ONEAPI
+    } catch (const sycl::exception &ex) {
+        char oneapi_err_msg[1024];
+        snprintf(oneapi_err_msg, sizeof(oneapi_err_msg),
+                 "oneAPI Error (%d): %s", ex.code().value(), ex.what());
+
+        if (ex.code() == sycl::errc::memory_allocation) {
+            err = set_global_error_string(oneapi_err_msg, AF_ERR_NO_MEM);
+        } else {
+            err = set_global_error_string(oneapi_err_msg, AF_ERR_INTERNAL);
+        }
+    } catch (const oneapi::mkl::exception &ex) {
+        char oneapi_err_msg[1024];
+        snprintf(oneapi_err_msg, sizeof(oneapi_err_msg), "MKL Error: %s",
+                 ex.what());
+
+        err = set_global_error_string(oneapi_err_msg, AF_ERR_INTERNAL);
+#endif
+#ifdef AF_OPENCL
+    } catch (const cl::Error &ex) {
+        char opencl_err_msg[1024];
+        snprintf(opencl_err_msg, sizeof(opencl_err_msg),
+                 "OpenCL Error (%d): %s when calling %s", ex.err(),
+                 getErrorMessage(ex.err()).c_str(), ex.what());
+
+        if (ex.err() == CL_MEM_OBJECT_ALLOCATION_FAILURE) {
+            err = set_global_error_string(opencl_err_msg, AF_ERR_NO_MEM);
+        } else {
+            err = set_global_error_string(opencl_err_msg, AF_ERR_INTERNAL);
+        }
+#endif
+    } catch (const std::exception &ex) {
+        err = set_global_error_string(ex.what(), AF_ERR_UNKNOWN);
+    } catch (...) { err = set_global_error_string(ss.str(), AF_ERR_UNKNOWN); }
+
+    return err;
+}
+
+std::string &get_global_error_string() noexcept {
+    thread_local auto *global_error_string = new std::string("");
+    return *global_error_string;
+}
+
+const char *af_err_to_string(const af_err err) {
+    switch (err) {
+        case AF_SUCCESS: return "Success";
+        case AF_ERR_NO_MEM: return "Device out of memory";
+        case AF_ERR_DRIVER: return "Driver not available or incompatible";
+        case AF_ERR_RUNTIME: return "Runtime error ";
+        case AF_ERR_INVALID_ARRAY: return "Invalid array";
+        case AF_ERR_ARG: return "Invalid input argument";
+        case AF_ERR_SIZE: return "Invalid input size";
+        case AF_ERR_TYPE: return "Function does not support this data type";
+        case AF_ERR_DIFF_TYPE: return "Input types are not the same";
+        case AF_ERR_BATCH: return "Invalid batch configuration";
+        case AF_ERR_DEVICE:
+            return "Input does not belong to the current device.";
+        case AF_ERR_NOT_SUPPORTED: return "Function not supported";
+        case AF_ERR_NOT_CONFIGURED: return "Function not configured to build";
+        case AF_ERR_NONFREE:
+            return "Function unavailable. "
+                   "ArrayFire compiled without Non-Free algorithms support";
+        case AF_ERR_NO_DBL:
+            return "Double precision not supported for this device";
+        case AF_ERR_NO_GFX:
+            return "Graphics functionality unavailable. "
+                   "ArrayFire compiled without Graphics support";
+        case AF_ERR_NO_HALF:
+            return "Half precision floats not supported for this device";
+        case AF_ERR_LOAD_LIB: return "Failed to load dynamic library. ";
+        case AF_ERR_LOAD_SYM: return "Failed to load symbol";
+        case AF_ERR_ARR_BKND_MISMATCH:
+            return "There was a mismatch between an array and the current "
+                   "backend";
+        case AF_ERR_INTERNAL: return "Internal error";
+        case AF_ERR_UNKNOWN: return "Unknown error";
+    }
+    return "Unknown error. Please open an issue and add this error code to the "
+           "case in af_err_to_string.";
+}
+
+namespace arrayfire {
+namespace common {
+
+bool &is_stacktrace_enabled() noexcept {
+    static bool stacktrace_enabled = true;
+    return stacktrace_enabled;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/err_common.hpp b/src/backend/common/err_common.hpp
new file mode 100644
index 0000000000..846f4b516f
--- /dev/null
+++ b/src/backend/common/err_common.hpp
@@ -0,0 +1,219 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wattributes"
+#pragma GCC diagnostic ignored "-Wparentheses"
+#include <boost/stacktrace.hpp>
+#pragma GCC diagnostic pop
+#include <common/defines.hpp>
+#include <af/defines.h>
+
+#include <cassert>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <utility>
+
+class AfError : public std::logic_error {
+    std::string functionName;
+    std::string fileName;
+    boost::stacktrace::stacktrace st_;
+    int lineNumber;
+    af_err error;
+    AfError();
+
+   public:
+    AfError(const char* const func, const char* const file, const int line,
+            const char* const message, af_err err,
+            boost::stacktrace::stacktrace st);
+
+    AfError(std::string func, std::string file, const int line,
+            const std::string& message, af_err err,
+            boost::stacktrace::stacktrace st);
+
+    AfError(const AfError& other) noexcept = delete;
+
+    /// This is the same as default but gcc 6.1 fails when noexcept is used
+    /// along with the default specifier. Expanded the default definition
+    /// to avoid this error
+    AfError(AfError&& other) noexcept
+        : std::logic_error(std::forward<std::logic_error>(other))
+        , functionName(std::forward<std::string>(other.functionName))
+        , fileName(std::forward<std::string>(other.fileName))
+        , st_(std::forward<boost::stacktrace::stacktrace>(other.st_))
+        , lineNumber(std::forward<int>(other.lineNumber))
+        , error(std::forward<af_err>(other.error)) {}
+
+    const std::string& getFunctionName() const noexcept;
+
+    const std::string& getFileName() const noexcept;
+
+    const boost::stacktrace::stacktrace& getStacktrace() const noexcept {
+        return st_;
+    };
+
+    int getLine() const noexcept;
+
+    af_err getError() const noexcept;
+
+    virtual ~AfError() noexcept;
+};
+
+// TODO: Perhaps add a way to return supported types
+class TypeError : public AfError {
+    std::string errTypeName;
+    int argIndex;
+    TypeError();
+
+   public:
+    TypeError(const char* const func, const char* const file, const int line,
+              const int index, const af_dtype type,
+              const boost::stacktrace::stacktrace st);
+
+    TypeError(TypeError&& other) noexcept = default;
+
+    const std::string& getTypeName() const noexcept;
+
+    int getArgIndex() const noexcept;
+
+    ~TypeError() noexcept {}
+};
+
+class ArgumentError : public AfError {
+    std::string expected;
+    int argIndex;
+    ArgumentError();
+
+   public:
+    ArgumentError(const char* const func, const char* const file,
+                  const int line, const int index,
+                  const char* const expectString,
+                  const boost::stacktrace::stacktrace st);
+    ArgumentError(ArgumentError&& other) noexcept = default;
+
+    const std::string& getExpectedCondition() const noexcept;
+
+    int getArgIndex() const noexcept;
+
+    ~ArgumentError() noexcept {}
+};
+
+class SupportError : public AfError {
+    std::string backend;
+    SupportError();
+
+   public:
+    SupportError(const char* const func, const char* const file, const int line,
+                 const char* const back, const char* const message,
+                 const boost::stacktrace::stacktrace st);
+    SupportError(SupportError&& other) noexcept = default;
+
+    ~SupportError() noexcept {}
+
+    const std::string& getBackendName() const noexcept;
+};
+
+class DimensionError : public AfError {
+    std::string expected;
+    int argIndex;
+    DimensionError();
+
+   public:
+    DimensionError(const char* const func, const char* const file,
+                   const int line, const int index,
+                   const char* const expectString,
+                   const boost::stacktrace::stacktrace& st);
+    DimensionError(DimensionError&& other) noexcept = default;
+
+    const std::string& getExpectedCondition() const noexcept;
+
+    int getArgIndex() const noexcept;
+
+    ~DimensionError() noexcept {}
+};
+
+af_err processException();
+
+af_err set_global_error_string(const std::string& msg,
+                               af_err err = AF_ERR_UNKNOWN);
+
+#define DIM_ASSERT(INDEX, COND)                                          \
+    do {                                                                 \
+        if ((COND) == false) {                                           \
+            throw DimensionError(__AF_FUNC__, __AF_FILENAME__, __LINE__, \
+                                 INDEX, #COND,                           \
+                                 boost::stacktrace::stacktrace());       \
+        }                                                                \
+    } while (0)
+
+#define ARG_ASSERT(INDEX, COND)                                                \
+    do {                                                                       \
+        if ((COND) == false) {                                                 \
+            throw ArgumentError(__AF_FUNC__, __AF_FILENAME__, __LINE__, INDEX, \
+                                #COND, boost::stacktrace::stacktrace());       \
+        }                                                                      \
+    } while (0)
+
+#define TYPE_ERROR(INDEX, type)                                              \
+    do {                                                                     \
+        throw TypeError(__AF_FUNC__, __AF_FILENAME__, __LINE__, INDEX, type, \
+                        boost::stacktrace::stacktrace());                    \
+    } while (0)
+
+#define AF_ERROR(MSG, ERR_TYPE)                                              \
+    do {                                                                     \
+        throw AfError(__AF_FUNC__, __AF_FILENAME__, __LINE__, MSG, ERR_TYPE, \
+                      boost::stacktrace::stacktrace());                      \
+    } while (0)
+
+#define AF_RETURN_ERROR(MSG, ERR_TYPE)                                       \
+    do {                                                                     \
+        std::stringstream s;                                                 \
+        s << "Error in " << __AF_FUNC__ << "\n"                              \
+          << "In file " << __AF_FILENAME__ << ":" << __LINE__ << ": " << MSG \
+          << "\n"                                                            \
+          << boost::stacktrace::stacktrace();                                \
+        return set_global_error_string(s.str(), ERR_TYPE);                   \
+    } while (0)
+
+#define TYPE_ASSERT(COND)                                       \
+    do {                                                        \
+        if ((COND) == false) {                                  \
+            AF_ERROR("Type mismatch inputs", AF_ERR_DIFF_TYPE); \
+        }                                                       \
+    } while (0)
+
+#define AF_ASSERT(COND, MESSAGE) assert(MESSAGE&& COND)
+
+#define CATCHALL                   \
+    catch (...) {                  \
+        return processException(); \
+    }
+
+#define AF_CHECK(fn)                                                       \
+    do {                                                                   \
+        af_err __err = fn;                                                 \
+        if (__err == AF_SUCCESS) break;                                    \
+        throw AfError(__AF_FUNC__, __AF_FILENAME__, __LINE__, "\n", __err, \
+                      boost::stacktrace::stacktrace());                    \
+    } while (0)
+
+static const int MAX_ERR_SIZE = 1024;
+std::string& get_global_error_string() noexcept;
+
+namespace arrayfire {
+namespace common {
+
+bool& is_stacktrace_enabled() noexcept;
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/forge_loader.hpp b/src/backend/common/forge_loader.hpp
new file mode 100644
index 0000000000..6fcdd625ef
--- /dev/null
+++ b/src/backend/common/forge_loader.hpp
@@ -0,0 +1,134 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/DependencyModule.hpp>
+#include <forge.h>
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wignored-attributes"
+#elif defined(__ICC) || defined(__INTEL_COMPILER)
+/* Intel ICC/ICPC */
+// Fix the warning code here, if any
+#elif defined(_MSC_VER)
+/* Microsoft Visual Studio */
+#else
+/* Other */
+#endif
+
+#include <glad/glad.h>
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic pop
+#elif defined(__ICC) || defined(__INTEL_COMPILER)
+/* Intel ICC/ICPC */
+// Fix the warning code here, if any
+#elif defined(__GNUC__) || defined(__GNUG__)
+/* GNU GCC/G++ */
+#pragma GCC diagnostic pop
+#elif defined(_MSC_VER)
+/* Microsoft Visual Studio */
+#pragma warning(pop)
+#else
+/* Other */
+#endif
+
+namespace arrayfire {
+namespace common {
+
+class ForgeModule : public DependencyModule {
+   public:
+    ForgeModule();
+
+    MODULE_MEMBER(fg_create_window);
+    MODULE_MEMBER(fg_get_window_context_handle);
+    MODULE_MEMBER(fg_get_window_display_handle);
+    MODULE_MEMBER(fg_make_window_current);
+    MODULE_MEMBER(fg_set_window_font);
+    MODULE_MEMBER(fg_set_window_position);
+    MODULE_MEMBER(fg_set_window_title);
+    MODULE_MEMBER(fg_set_window_size);
+    MODULE_MEMBER(fg_set_window_colormap);
+    MODULE_MEMBER(fg_draw_chart_to_cell);
+    MODULE_MEMBER(fg_draw_chart);
+    MODULE_MEMBER(fg_draw_image_to_cell);
+    MODULE_MEMBER(fg_draw_image);
+    MODULE_MEMBER(fg_swap_window_buffers);
+    MODULE_MEMBER(fg_close_window);
+    MODULE_MEMBER(fg_show_window);
+    MODULE_MEMBER(fg_hide_window);
+    MODULE_MEMBER(fg_release_window);
+
+    MODULE_MEMBER(fg_create_font);
+    MODULE_MEMBER(fg_load_system_font);
+    MODULE_MEMBER(fg_release_font);
+
+    MODULE_MEMBER(fg_create_image);
+    MODULE_MEMBER(fg_get_pixel_buffer);
+    MODULE_MEMBER(fg_get_image_size);
+    MODULE_MEMBER(fg_release_image);
+
+    MODULE_MEMBER(fg_create_plot);
+    MODULE_MEMBER(fg_set_plot_color);
+    MODULE_MEMBER(fg_get_plot_vertex_buffer);
+    MODULE_MEMBER(fg_get_plot_vertex_buffer_size);
+    MODULE_MEMBER(fg_release_plot);
+
+    MODULE_MEMBER(fg_create_histogram);
+    MODULE_MEMBER(fg_set_histogram_color);
+    MODULE_MEMBER(fg_get_histogram_vertex_buffer);
+    MODULE_MEMBER(fg_get_histogram_vertex_buffer_size);
+    MODULE_MEMBER(fg_release_histogram);
+
+    MODULE_MEMBER(fg_create_surface);
+    MODULE_MEMBER(fg_set_surface_color);
+    MODULE_MEMBER(fg_get_surface_vertex_buffer);
+    MODULE_MEMBER(fg_get_surface_vertex_buffer_size);
+    MODULE_MEMBER(fg_release_surface);
+
+    MODULE_MEMBER(fg_create_vector_field);
+    MODULE_MEMBER(fg_set_vector_field_color);
+    MODULE_MEMBER(fg_get_vector_field_vertex_buffer_size);
+    MODULE_MEMBER(fg_get_vector_field_direction_buffer_size);
+    MODULE_MEMBER(fg_get_vector_field_vertex_buffer);
+    MODULE_MEMBER(fg_get_vector_field_direction_buffer);
+    MODULE_MEMBER(fg_release_vector_field);
+
+    MODULE_MEMBER(fg_create_chart);
+    MODULE_MEMBER(fg_get_chart_type);
+    MODULE_MEMBER(fg_get_chart_axes_limits);
+    MODULE_MEMBER(fg_set_chart_axes_limits);
+    MODULE_MEMBER(fg_set_chart_axes_titles);
+    MODULE_MEMBER(fg_set_chart_label_format);
+    MODULE_MEMBER(fg_append_image_to_chart);
+    MODULE_MEMBER(fg_append_plot_to_chart);
+    MODULE_MEMBER(fg_append_histogram_to_chart);
+    MODULE_MEMBER(fg_append_surface_to_chart);
+    MODULE_MEMBER(fg_append_vector_field_to_chart);
+    MODULE_MEMBER(fg_release_chart);
+
+    MODULE_MEMBER(fg_err_to_string);
+};
+
+ForgeModule& forgePlugin();
+
+#define FG_CHECK(fn)                                        \
+    do {                                                    \
+        fg_err e = (fn);                                    \
+        if (e != FG_ERR_NONE) {                             \
+            AF_ERROR("forge call failed", AF_ERR_INTERNAL); \
+        }                                                   \
+    } while (0);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/graphics_common.cpp b/src/backend/common/graphics_common.cpp
new file mode 100644
index 0000000000..01f94078d4
--- /dev/null
+++ b/src/backend/common/graphics_common.cpp
@@ -0,0 +1,526 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <common/util.hpp>
+#include <platform.hpp>
+#include <mutex>
+#include <utility>
+
+using arrayfire::common::getEnvVar;
+using std::make_pair;
+using std::string;
+
+namespace arrayfire {
+namespace common {
+
+/// Dynamically loads forge function pointer at runtime
+#define FG_MODULE_FUNCTION_INIT(NAME) \
+    NAME = DependencyModule::getSymbol<decltype(&::NAME)>(#NAME)
+
+ForgeModule::ForgeModule() : DependencyModule("forge", nullptr) {
+    if (DependencyModule::isLoaded()) {
+        FG_MODULE_FUNCTION_INIT(fg_create_window);
+        FG_MODULE_FUNCTION_INIT(fg_get_window_context_handle);
+        FG_MODULE_FUNCTION_INIT(fg_get_window_display_handle);
+        FG_MODULE_FUNCTION_INIT(fg_make_window_current);
+        FG_MODULE_FUNCTION_INIT(fg_set_window_font);
+        FG_MODULE_FUNCTION_INIT(fg_set_window_position);
+        FG_MODULE_FUNCTION_INIT(fg_set_window_title);
+        FG_MODULE_FUNCTION_INIT(fg_set_window_size);
+        FG_MODULE_FUNCTION_INIT(fg_set_window_colormap);
+        FG_MODULE_FUNCTION_INIT(fg_draw_chart_to_cell);
+        FG_MODULE_FUNCTION_INIT(fg_draw_chart);
+        FG_MODULE_FUNCTION_INIT(fg_draw_image_to_cell);
+        FG_MODULE_FUNCTION_INIT(fg_draw_image);
+        FG_MODULE_FUNCTION_INIT(fg_swap_window_buffers);
+        FG_MODULE_FUNCTION_INIT(fg_close_window);
+        FG_MODULE_FUNCTION_INIT(fg_show_window);
+        FG_MODULE_FUNCTION_INIT(fg_hide_window);
+        FG_MODULE_FUNCTION_INIT(fg_release_window);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_font);
+        FG_MODULE_FUNCTION_INIT(fg_load_system_font);
+        FG_MODULE_FUNCTION_INIT(fg_release_font);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_image);
+        FG_MODULE_FUNCTION_INIT(fg_get_pixel_buffer);
+        FG_MODULE_FUNCTION_INIT(fg_get_image_size);
+        FG_MODULE_FUNCTION_INIT(fg_release_image);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_plot);
+        FG_MODULE_FUNCTION_INIT(fg_set_plot_color);
+        FG_MODULE_FUNCTION_INIT(fg_get_plot_vertex_buffer);
+        FG_MODULE_FUNCTION_INIT(fg_get_plot_vertex_buffer_size);
+        FG_MODULE_FUNCTION_INIT(fg_release_plot);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_histogram);
+        FG_MODULE_FUNCTION_INIT(fg_set_histogram_color);
+        FG_MODULE_FUNCTION_INIT(fg_get_histogram_vertex_buffer);
+        FG_MODULE_FUNCTION_INIT(fg_get_histogram_vertex_buffer_size);
+        FG_MODULE_FUNCTION_INIT(fg_release_histogram);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_surface);
+        FG_MODULE_FUNCTION_INIT(fg_set_surface_color);
+        FG_MODULE_FUNCTION_INIT(fg_get_surface_vertex_buffer);
+        FG_MODULE_FUNCTION_INIT(fg_get_surface_vertex_buffer_size);
+        FG_MODULE_FUNCTION_INIT(fg_release_surface);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_vector_field);
+        FG_MODULE_FUNCTION_INIT(fg_set_vector_field_color);
+        FG_MODULE_FUNCTION_INIT(fg_get_vector_field_vertex_buffer_size);
+        FG_MODULE_FUNCTION_INIT(fg_get_vector_field_direction_buffer_size);
+        FG_MODULE_FUNCTION_INIT(fg_get_vector_field_vertex_buffer);
+        FG_MODULE_FUNCTION_INIT(fg_get_vector_field_direction_buffer);
+        FG_MODULE_FUNCTION_INIT(fg_release_vector_field);
+
+        FG_MODULE_FUNCTION_INIT(fg_create_chart);
+        FG_MODULE_FUNCTION_INIT(fg_get_chart_type);
+        FG_MODULE_FUNCTION_INIT(fg_get_chart_axes_limits);
+        FG_MODULE_FUNCTION_INIT(fg_set_chart_axes_limits);
+        FG_MODULE_FUNCTION_INIT(fg_set_chart_axes_titles);
+        FG_MODULE_FUNCTION_INIT(fg_set_chart_label_format);
+        FG_MODULE_FUNCTION_INIT(fg_append_image_to_chart);
+        FG_MODULE_FUNCTION_INIT(fg_append_plot_to_chart);
+        FG_MODULE_FUNCTION_INIT(fg_append_histogram_to_chart);
+        FG_MODULE_FUNCTION_INIT(fg_append_surface_to_chart);
+        FG_MODULE_FUNCTION_INIT(fg_append_vector_field_to_chart);
+        FG_MODULE_FUNCTION_INIT(fg_release_chart);
+
+        FG_MODULE_FUNCTION_INIT(fg_err_to_string);
+
+        if (!DependencyModule::symbolsLoaded()) {
+            string error_message =
+                "Error loading Forge: " + DependencyModule::getErrorMessage() +
+                "\nForge or one of it's dependencies failed to "
+                "load. Try installing Forge or check if Forge is in the "
+                "search path.";
+            AF_ERROR(error_message.c_str(), AF_ERR_LOAD_LIB);
+        }
+    }
+}
+
+template<typename T>
+fg_dtype getGLType() {
+    return FG_FLOAT32;
+}
+
+fg_marker_type getFGMarker(const af_marker_type af_marker) {
+    fg_marker_type fg_marker;
+    switch (af_marker) {
+        case AF_MARKER_NONE: fg_marker = FG_MARKER_NONE; break;
+        case AF_MARKER_POINT: fg_marker = FG_MARKER_POINT; break;
+        case AF_MARKER_CIRCLE: fg_marker = FG_MARKER_CIRCLE; break;
+        case AF_MARKER_SQUARE: fg_marker = FG_MARKER_SQUARE; break;
+        case AF_MARKER_TRIANGLE: fg_marker = FG_MARKER_TRIANGLE; break;
+        case AF_MARKER_CROSS: fg_marker = FG_MARKER_CROSS; break;
+        case AF_MARKER_PLUS: fg_marker = FG_MARKER_PLUS; break;
+        case AF_MARKER_STAR: fg_marker = FG_MARKER_STAR; break;
+        default: fg_marker = FG_MARKER_NONE; break;
+    }
+    return fg_marker;
+}
+
+#define INSTANTIATE_GET_FG_TYPE(T, ForgeEnum) \
+    template<>                                \
+    fg_dtype getGLType<T>() {                 \
+        return ForgeEnum;                     \
+    }
+
+INSTANTIATE_GET_FG_TYPE(float, FG_FLOAT32);
+INSTANTIATE_GET_FG_TYPE(int, FG_INT32);
+INSTANTIATE_GET_FG_TYPE(unsigned, FG_UINT32);
+INSTANTIATE_GET_FG_TYPE(char, FG_INT8);
+INSTANTIATE_GET_FG_TYPE(signed char, FG_INT8);
+INSTANTIATE_GET_FG_TYPE(unsigned char, FG_UINT8);
+INSTANTIATE_GET_FG_TYPE(unsigned short, FG_UINT16);
+INSTANTIATE_GET_FG_TYPE(short, FG_INT16);
+
+// NOLINTNEXTLINE(misc-unused-parameters)
+GLenum glErrorCheck(const char* msg, const char* file, int line) {
+// Skipped in release mode
+#ifndef NDEBUG
+    GLenum x = glGetError();
+
+    if (x != GL_NO_ERROR) {
+        char buf[1024];
+        sprintf(buf, "GL Error at: %s:%d Message: %s Error Code: %d \"%s\"\n",
+                file, line, msg, static_cast<int>(x), glGetString(x));
+        AF_ERROR(buf, AF_ERR_INTERNAL);
+    }
+    return x;
+#else
+    UNUSED(msg);
+    UNUSED(file);
+    UNUSED(line);
+    return static_cast<GLenum>(0);
+#endif
+}
+
+size_t getTypeSize(GLenum type) {
+    switch (type) {
+        case GL_FLOAT: return sizeof(float);
+        case GL_INT: return sizeof(int);
+        case GL_UNSIGNED_INT: return sizeof(unsigned);
+        case GL_SHORT: return sizeof(short);
+        case GL_UNSIGNED_SHORT: return sizeof(unsigned short);
+        case GL_BYTE: return sizeof(char);
+        case GL_UNSIGNED_BYTE: return sizeof(unsigned char);
+        default: return sizeof(float);
+    }
+}
+
+void makeContextCurrent(fg_window window) {
+    FG_CHECK(common::forgePlugin().fg_make_window_current(window));
+    CheckGL("End makeContextCurrent");
+}
+
+// dir -> true = round up, false = round down
+double step_round(const double in, const bool dir) {
+    if (in == 0) { return 0; }
+
+    static const double LOG2 = log10(2);
+    static const double LOG4 = log10(4);
+    static const double LOG6 = log10(6);
+    static const double LOG8 = log10(8);
+
+    // log_in is of the form "s abc.xyz", where
+    // s is either + or -; + indicates abs(in) >= 1 and - indicates 0 < abs(in)
+    // < 1 (log10(1) is +0)
+    const double sign   = in < 0 ? -1 : 1;
+    const double log_in = std::log10(std::fabs(in));
+    const double mag    = std::pow(10, std::floor(log_in)) *
+                       sign;  // Number of digits either left or right of 0
+    const double dec = std::log10(in / mag);  // log of the fraction
+
+    // This means in is of the for 10^n
+    if (dec == 0) { return in; }
+
+    // For negative numbers, -ve round down = +ve round up and vice versa
+    bool op_dir = in > 0 ? dir : !dir;
+
+    double mult = 1;
+
+    // Round up
+    if (op_dir) {
+        if (dec <= LOG2) {
+            mult = 2;
+        } else if (dec <= LOG4) {
+            mult = 4;
+        } else if (dec <= LOG6) {
+            mult = 6;
+        } else if (dec <= LOG8) {
+            mult = 8;
+        } else {
+            mult = 10;
+        }
+    } else {  // Round down
+        if (dec < LOG2) {
+            mult = 1;
+        } else if (dec < LOG4) {
+            mult = 2;
+        } else if (dec < LOG6) {
+            mult = 4;
+        } else if (dec < LOG8) {
+            mult = 6;
+        } else {
+            mult = 8;
+        }
+    }
+
+    return mag * mult;
+}
+
+ForgeModule& forgePlugin() { return detail::forgeManager().plugin(); }
+
+ForgeManager::ForgeManager() : mPlugin(new ForgeModule()) {}
+
+ForgeModule& ForgeManager::plugin() { return *mPlugin; }
+
+fg_window ForgeManager::getMainWindow() {
+    static std::once_flag flag;
+
+    // Define AF_DISABLE_GRAPHICS with any value to disable initialization
+    std::string noGraphicsENV = getEnvVar("AF_DISABLE_GRAPHICS");
+
+    af_err error      = AF_SUCCESS;
+    fg_err forgeError = FG_ERR_NONE;
+    if (noGraphicsENV.empty()) {  // If AF_DISABLE_GRAPHICS is not defined
+        std::call_once(flag, [this, &error, &forgeError] {
+            if (!this->mPlugin->isLoaded()) {
+                error = AF_ERR_LOAD_LIB;
+                return;
+            }
+            fg_window w = nullptr;
+            forgeError  = this->mPlugin->fg_create_window(
+                &w, WIDTH, HEIGHT, "ArrayFire", NULL, true);
+            if (forgeError != FG_ERR_NONE) { return; }
+            this->setWindowChartGrid(w, 1, 1);
+            this->mPlugin->fg_make_window_current(w);
+            this->mMainWindow.reset(new Window({w}));
+            if (!gladLoadGL()) { error = AF_ERR_LOAD_LIB; }
+        });
+        if (error == AF_ERR_LOAD_LIB) {
+            string error_message =
+                "Error loading Forge: " + this->mPlugin->getErrorMessage() +
+                "\nForge or one of it's dependencies failed to "
+                "load. Try installing Forge or check if Forge is in the "
+                "search path.";
+            AF_ERROR(error_message.c_str(), AF_ERR_LOAD_LIB);
+        }
+        if (forgeError != FG_ERR_NONE) {
+            AF_ERROR(this->mPlugin->fg_err_to_string(forgeError),
+                     AF_ERR_RUNTIME);
+        }
+    }
+
+    return mMainWindow->handle;
+}
+
+fg_window ForgeManager::getWindow(const int w, const int h,
+                                  const char* const title,
+                                  const bool invisible) {
+    fg_window retVal = 0;
+    FG_CHECK(mPlugin->fg_create_window(&retVal, w, h, title, getMainWindow(),
+                                       invisible));
+    if (retVal == 0) { AF_ERROR("Window creation failed", AF_ERR_INTERNAL); }
+    setWindowChartGrid(retVal, 1, 1);
+    return retVal;
+}
+
+void ForgeManager::setWindowChartGrid(const fg_window window, const int r,
+                                      const int c) {
+    auto chart_iter = mChartMap.find(window);
+
+    if (chart_iter != mChartMap.end()) {
+        // ChartVec found. Clear it.
+        // This has to be cleared as there is no guarantee that existing
+        // chart types(2D/3D) match the future grid requirements
+        for (const ChartPtr& c : chart_iter->second) {
+            if (c) { mChartAxesOverrideMap.erase(c->handle); }
+        }
+        (chart_iter->second).clear();  // Clear ChartList
+        auto gIter    = mWndGridMap.find(window);
+        gIter->second = make_pair(1, 1);
+    }
+
+    if (r == 0 || c == 0) {
+        mChartMap.erase(window);
+        mWndGridMap.erase(window);
+    } else {
+        mChartMap[window]   = ChartList(r * c);
+        mWndGridMap[window] = std::make_pair(r, c);
+    }
+}
+
+ForgeManager::WindowGridDims ForgeManager::getWindowGrid(
+    const fg_window window) {
+    auto gIter = mWndGridMap.find(window);
+    if (gIter == mWndGridMap.end()) { mWndGridMap[window] = make_pair(1, 1); }
+    return mWndGridMap[window];
+}
+
+fg_chart ForgeManager::getChart(const fg_window window, const int r,
+                                const int c, const fg_chart_type ctype) {
+    auto gIter = mWndGridMap.find(window);
+
+    int rows = std::get<0>(gIter->second);
+    int cols = std::get<1>(gIter->second);
+
+    if (c >= cols || r >= rows) {
+        AF_ERROR("Window Grid points are out of bounds", AF_ERR_TYPE);
+    }
+
+    // upgrade to exclusive access to make changes
+    auto chart_iter = mChartMap.find(window);
+    ChartPtr& chart = (chart_iter->second)[c * rows + r];
+
+    if (!chart) {
+        fg_chart temp = NULL;
+        FG_CHECK(mPlugin->fg_create_chart(&temp, ctype));
+        chart.reset(new Chart({temp}));
+        mChartAxesOverrideMap[chart->handle] = false;
+    } else {
+        fg_chart_type chart_type;
+        FG_CHECK(mPlugin->fg_get_chart_type(&chart_type, chart->handle));
+        if (chart_type != ctype) {
+            // Existing chart is of incompatible type
+            mChartAxesOverrideMap.erase(chart->handle);
+            fg_chart temp = 0;
+            FG_CHECK(mPlugin->fg_create_chart(&temp, ctype));
+            chart.reset(new Chart({temp}));
+            mChartAxesOverrideMap[chart->handle] = false;
+        }
+    }
+    return chart->handle;
+}
+
+unsigned long long ForgeManager::genImageKey(unsigned w, unsigned h,
+                                             fg_channel_format mode,
+                                             fg_dtype type) {
+    assert(w <= 2U << 16U);
+    assert(h <= 2U << 16U);
+    unsigned long long key = ((w & _16BIT) << 16U) | (h & _16BIT);
+    key = ((((key << 16U) | (mode & _16BIT)) << 16U) | (type | _16BIT));
+    return key;
+}
+
+fg_image ForgeManager::getImage(int w, int h, fg_channel_format mode,
+                                fg_dtype type) {
+    auto key = genImageKey(w, h, mode, type);
+
+    ChartKey keypair = std::make_pair(key, nullptr);
+    auto iter        = mImgMap.find(keypair);
+
+    if (iter == mImgMap.end()) {
+        fg_image img = nullptr;
+        FG_CHECK(mPlugin->fg_create_image(&img, w, h, mode, type));
+        mImgMap[keypair] = ImagePtr(new Image({img}));
+    }
+    return mImgMap[keypair]->handle;
+}
+
+fg_image ForgeManager::getImage(fg_chart chart, int w, int h,
+                                fg_channel_format mode, fg_dtype type) {
+    auto key = genImageKey(w, h, mode, type);
+
+    ChartKey keypair = make_pair(key, chart);
+    auto iter        = mImgMap.find(keypair);
+
+    if (iter == mImgMap.end()) {
+        fg_chart_type chart_type;
+        FG_CHECK(mPlugin->fg_get_chart_type(&chart_type, chart));
+        if (chart_type != FG_CHART_2D) {
+            AF_ERROR("Image can only be added to chart of type FG_CHART_2D",
+                     AF_ERR_TYPE);
+        }
+        fg_image img = nullptr;
+        FG_CHECK(mPlugin->fg_create_image(&img, w, h, mode, type));
+        FG_CHECK(mPlugin->fg_append_image_to_chart(chart, img));
+
+        mImgMap[keypair] = ImagePtr(new Image({img}));
+    }
+    return mImgMap[keypair]->handle;
+}
+
+fg_plot ForgeManager::getPlot(fg_chart chart, int nPoints, fg_dtype dtype,
+                              fg_plot_type ptype, fg_marker_type mtype) {
+    unsigned long long key =
+        ((static_cast<unsigned long long>(nPoints) & _48BIT) << 16U);
+    key |=
+        (((dtype & _4BIT) << 12U) | ((ptype & _4BIT) << 8U) | (mtype & _8BIT));
+
+    ChartKey keypair = std::make_pair(key, chart);
+    auto iter        = mPltMap.find(keypair);
+
+    if (iter == mPltMap.end()) {
+        fg_chart_type chart_type;
+        FG_CHECK(mPlugin->fg_get_chart_type(&chart_type, chart));
+
+        fg_plot plt = nullptr;
+        FG_CHECK(mPlugin->fg_create_plot(&plt, nPoints, dtype, chart_type,
+                                         ptype, mtype));
+        FG_CHECK(mPlugin->fg_append_plot_to_chart(chart, plt));
+
+        mPltMap[keypair] = PlotPtr(new Plot({plt}));
+    }
+    return mPltMap[keypair]->handle;
+}
+
+fg_histogram ForgeManager::getHistogram(fg_chart chart, int nBins,
+                                        fg_dtype type) {
+    unsigned long long key =
+        ((static_cast<unsigned long long>(nBins) & _48BIT) << 16U) |
+        (type & _16BIT);
+
+    ChartKey keypair = make_pair(key, chart);
+    auto iter        = mHstMap.find(keypair);
+
+    if (iter == mHstMap.end()) {
+        fg_chart_type chart_type;
+        FG_CHECK(mPlugin->fg_get_chart_type(&chart_type, chart));
+        if (chart_type != FG_CHART_2D) {
+            AF_ERROR("Histogram can only be added to chart of type FG_CHART_2D",
+                     AF_ERR_TYPE);
+        }
+        fg_histogram hst = nullptr;
+        FG_CHECK(mPlugin->fg_create_histogram(&hst, nBins, type));
+        FG_CHECK(mPlugin->fg_append_histogram_to_chart(chart, hst));
+        mHstMap[keypair] = HistogramPtr(new Histogram({hst}));
+    }
+    return mHstMap[keypair]->handle;
+}
+
+fg_surface ForgeManager::getSurface(fg_chart chart, int nX, int nY,
+                                    fg_dtype type) {
+    unsigned long long surfaceSize = nX * static_cast<unsigned long long>(nY);
+    assert(surfaceSize <= 2ULL << 48ULL);
+    unsigned long long key = ((surfaceSize & _48BIT) << 16U) | (type & _16BIT);
+
+    ChartKey keypair = make_pair(key, chart);
+    auto iter        = mSfcMap.find(keypair);
+
+    if (iter == mSfcMap.end()) {
+        fg_chart_type chart_type;
+        FG_CHECK(mPlugin->fg_get_chart_type(&chart_type, chart));
+        if (chart_type != FG_CHART_3D) {
+            AF_ERROR("Surface can only be added to chart of type FG_CHART_3D",
+                     AF_ERR_TYPE);
+        }
+        fg_surface surf = nullptr;
+        FG_CHECK(mPlugin->fg_create_surface(&surf, nX, nY, type,
+                                            FG_PLOT_SURFACE, FG_MARKER_NONE));
+        FG_CHECK(mPlugin->fg_append_surface_to_chart(chart, surf));
+        mSfcMap[keypair] = SurfacePtr(new Surface({surf}));
+    }
+    return mSfcMap[keypair]->handle;
+}
+
+fg_vector_field ForgeManager::getVectorField(fg_chart chart, int nPoints,
+                                             fg_dtype type) {
+    unsigned long long key =
+        ((static_cast<unsigned long long>(nPoints) & _48BIT) << 16U) |
+        (type & _16BIT);
+
+    ChartKey keypair = make_pair(key, chart);
+    auto iter        = mVcfMap.find(keypair);
+
+    if (iter == mVcfMap.end()) {
+        fg_chart_type chart_type;
+        FG_CHECK(mPlugin->fg_get_chart_type(&chart_type, chart));
+
+        fg_vector_field vfield = nullptr;
+        FG_CHECK(mPlugin->fg_create_vector_field(&vfield, nPoints, type,
+                                                 chart_type));
+        FG_CHECK(mPlugin->fg_append_vector_field_to_chart(chart, vfield));
+        mVcfMap[keypair] = VectorFieldPtr(new VectorField({vfield}));
+    }
+    return mVcfMap[keypair]->handle;
+}
+
+bool ForgeManager::getChartAxesOverride(const fg_chart chart) {
+    auto iter = mChartAxesOverrideMap.find(chart);
+    if (iter == mChartAxesOverrideMap.end()) {
+        AF_ERROR("Chart Not Found!", AF_ERR_INTERNAL);
+    }
+    return mChartAxesOverrideMap[chart];
+}
+
+void ForgeManager::setChartAxesOverride(const fg_chart chart, bool flag) {
+    auto iter = mChartAxesOverrideMap.find(chart);
+    if (iter == mChartAxesOverrideMap.end()) {
+        AF_ERROR("Chart Not Found!", AF_ERR_INTERNAL);
+    }
+    mChartAxesOverrideMap[chart] = flag;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/graphics_common.hpp b/src/backend/common/graphics_common.hpp
new file mode 100644
index 0000000000..ec59033fcb
--- /dev/null
+++ b/src/backend/common/graphics_common.hpp
@@ -0,0 +1,313 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/forge_loader.hpp>
+#include <af/graphics.h>
+
+#include <map>
+#include <memory>
+#include <utility>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+
+// default to f32(float) type
+template<typename T>
+fg_dtype getGLType();
+
+// Print for OpenGL errors
+// Returns 1 if an OpenGL error occurred, 0 otherwise.
+GLenum glErrorCheck(const char* msg, const char* file, int line);
+
+#define CheckGL(msg) \
+    arrayfire::common::glErrorCheck(msg, __AF_FILENAME__, __LINE__)
+
+fg_marker_type getFGMarker(const af_marker_type af_marker);
+
+void makeContextCurrent(fg_window window);
+
+double step_round(const double in, const bool dir);
+
+/// \brief The singleton manager class for Forge resources
+///
+/// Only device manager class can create objects of this class.
+/// You have to call forgeManager() defined in platform.hpp to
+/// access the object. It manages the windows, and other
+/// renderables (given below) that are drawed onto chosen window.
+/// Renderables:
+///      fg_image
+///      fg_plot
+///      fg_histogram
+///      fg_surface
+///      fg_vector_field
+///
+class ForgeManager {
+   public:
+    using WindowGridDims = std::pair<int, int>;
+
+    ForgeManager();
+    ForgeManager(ForgeManager const&)            = delete;
+    ForgeManager& operator=(ForgeManager const&) = delete;
+    ForgeManager(ForgeManager&&)                 = delete;
+    ForgeManager& operator=(ForgeManager&&)      = delete;
+
+    /// \brief Module used to invoke forge API calls
+    common::ForgeModule& plugin();
+
+    /// \brief The main window with which all other windows share GL context
+    fg_window getMainWindow();
+
+    /// \brief Create a window
+    ///
+    /// \param[in] width of the window
+    /// \param[in] height of the window
+    /// \param[in] title is the window title
+    /// \param[in] invisible indicates that if an invisible window
+    ///            has to be ceated
+    ///
+    /// \note Any window created will always shared OpenGL context with
+    ///       with the primary(getMainWindow()) window.
+    fg_window getWindow(const int width, const int height,
+                        const char* const title, const bool invisible = false);
+
+    /// \brief Set grid layout for a given Window
+    ///
+    /// Grid layout dictates how many renderables can be shown in a
+    /// single window. For example, if r = 2, c = 2, the entire rendering
+    /// area of the window will be split into four sections into which
+    /// different renderables can be drawn.
+    ///
+    /// \param[in] window is the target rendering context
+    /// \param[in] r is the number of rows in the grid
+    /// \param[in] c is the number of cols in the grid
+    void setWindowChartGrid(const fg_window window, const int r, const int c);
+
+    /// \brief Get grid layout of a window
+    ///
+    /// This function fetches the grid layout set for given window, probably
+    /// which was set by the function \ref ForgeManager::setWindowChartGrid
+    ///
+    /// \param[in] window is the target rendering context
+    WindowGridDims getWindowGrid(const fg_window window);
+
+    /// \brief Find/Create a Chart
+    ///
+    /// This function tries to find a chart fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching chart
+    /// resource handle is returned. If no match is found, a new chart
+    /// with given parameters is created, cached and returned.
+    ///
+    /// \param[in] window is the target rendering context
+    /// \param[in] r is indicates the row index in the grid layout
+    ///            of the given \p window. This is usually 0 for grids having
+    ///            single cell a.k.a capable of drawing one renderable.
+    /// \param[in] c is indicates the col index in the grid layout
+    ///            of the given \p window. This is usually 0 for grids having
+    ///            single cell a.k.a capable of drawing one renderable.
+    /// \param[in] ctype is type renderables to be rendered on chart, 2D or 3D
+    fg_chart getChart(const fg_window window, const int r, const int c,
+                      const fg_chart_type ctype);
+
+    /// \brief Find/Create an Image
+    ///
+    /// This function tries to find an image fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching image
+    /// resource handle is returned. If no match is found, a new image
+    /// with given parameters is created, cached and returned.
+    ///
+    /// Also do keep in mind this function has to be used only when you
+    /// are rendering just an image to the window. If you want to render
+    /// an image embedded into set of plots or anything else, use the getImage
+    /// member function that takes in \ref fg_chart as first parameter.
+    ///
+    /// \param[in] w is width of the image
+    /// \param[in] h is height of the image
+    /// \param[in] mode is the pixel packing format in the image
+    /// \param[in] type is type of data to be stored in image buffer
+    ///
+    /// \note The width and height of image needs to fall in the range of
+    /// [0, 2^16] for the ForgeManager to correctly retrieve the necessary
+    /// Forge Image object. This is an implementation limitation on how big
+    /// of an image can be rendered using arrayfire graphics funtionality
+    fg_image getImage(int w, int h, fg_channel_format mode, fg_dtype type);
+
+    /// \brief Find/Create an Image to render in a Chart
+    ///
+    /// This function tries to find an image fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching image
+    /// resource handle is returned. If no match is found, a new image
+    /// with given parameters is created, cached and returned.
+    ///
+    /// \param[in] chart is the chart to which image will be rendered
+    /// \param[in] w is width of the image
+    /// \param[in] h is height of the image
+    /// \param[in] mode is the pixel packing format in the image
+    /// \param[in] type is type of data to be stored in image buffer
+    ///
+    /// \note The width and height of image needs to fall in the range of
+    /// [0, 2^16] for the ForgeManager to correctly retrieve the necessary
+    /// Forge Image object. This is an implementation limitation on how big
+    /// of an image can be rendered using arrayfire graphics funtionality
+    fg_image getImage(fg_chart chart, int w, int h, fg_channel_format mode,
+                      fg_dtype type);
+
+    /// \brief Find/Create a Plot to render in a Chart
+    ///
+    /// This function tries to find a plot fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching plot
+    /// resource handle is returned. If no match is found, a new plot
+    /// with given parameters is created, cached and returned.
+    ///
+    /// \param[in] chart is the chart to which plot will be rendered
+    /// \param[in] nPoints is number of points in the plot
+    /// \param[in] dtype is type of data to be stored in plot buffer
+    /// \param[in] ptype indicates the type of plot \ref fg_plot_type
+    /// \param[in] mtype indicates the type of marker/sprite to render original
+    ///            points passed in the data buffer, \ref fg_marker_type
+    ///
+    /// \note \p nPoints needs to fall in the range of [0, 2^48]
+    /// for the ForgeManager to correctly retrieve the necessary Forge
+    /// plot object. This is an implementation limitation on how big of a
+    /// plot can be rendered using arrayfire graphics funtionality
+    fg_plot getPlot(fg_chart chart, int nPoints, fg_dtype dtype,
+                    fg_plot_type ptype, fg_marker_type mtype);
+
+    /// \brief Find/Create a Histogram to render in a Chart
+    ///
+    /// This function tries to find a histogram fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching histogram
+    /// resource handle is returned. If no match is found, a new histogram
+    /// with given parameters is created, cached and returned.
+    ///
+    /// \param[in] chart is the chart to which histogram will be rendered
+    /// \param[in] nBins is the total number of bins in the histogram
+    /// \param[in] type is type of data to be stored in histogram buffer
+    ///
+    /// \note \p nBins needs to fall in the range of [0, 2^48]
+    /// for the ForgeManager to correctly retrieve the necessary Forge
+    /// histogram object. This is an implementation limitation on how big
+    /// of a histogram can be rendered using arrayfire graphics funtionality
+    fg_histogram getHistogram(fg_chart chart, int nBins, fg_dtype type);
+
+    /// \brief Find/Create a Surface to render in a Chart
+    ///
+    /// This function tries to find a surface fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching surface
+    /// resource handle is returned. If no match is found, a new surface
+    /// with given parameters is created, cached and returned.
+    ///
+    /// \param[in] chart is the chart to which surface will be rendered
+    /// \param[in] nX is length of the surface grid
+    /// \param[in] nY is width of the surface grid
+    /// \param[in] type is type of data to be stored in image buffer
+    ///
+    /// \note \p nX * \p nY needs to fall in the range of [0, 2^48]
+    /// for the ForgeManager to correctly retrieve the necessary Forge Surface
+    /// object. This is an implementation limitation on how big of a surface
+    /// can be rendered using arrayfire graphics funtionality
+    fg_surface getSurface(fg_chart chart, int nX, int nY, fg_dtype type);
+
+    /// \brief Find/Create a Vector Field to render in a Chart
+    ///
+    /// This function tries to find a vector field fitting the given attributes
+    /// from forge resource cache. If a match is found, the matching vector
+    /// field resource handle is returned. If no match is found, a new vector
+    /// field with given parameters is created, cached and returned.
+    ///
+    /// \param[in] chart is the chart to which plot will be rendered
+    /// \param[in] nPoints is number of points in the 2D vector field
+    /// \param[in] type is type of data to be stored in plot buffer
+    ///
+    /// \note \p nPoints needs to fall in the range of [0, 2^48]
+    /// for the ForgeManager to correctly retrieve the necessary Forge vector
+    /// field object. This is an implementation limitation on how big of a
+    /// vector field can be rendered using arrayfire graphics funtionality
+    fg_vector_field getVectorField(fg_chart chart, int nPoints, fg_dtype type);
+
+    /// \brief Get chart axes limits override flag
+    ///
+    /// \param[in] chart is the target chart for which axes limits will be
+    /// overriden
+    bool getChartAxesOverride(const fg_chart chart);
+
+    /// \brief Set chart axes limits override flag
+    ///
+    /// \param[in] chart is the target chart for which axes limits will be
+    /// overriden \param[in] flag indicates if axes limits are overriden or not
+    void setChartAxesOverride(const fg_chart chart, bool flag = true);
+
+   private:
+    constexpr static unsigned int WIDTH        = 1280;
+    constexpr static unsigned int HEIGHT       = 720;
+    constexpr static unsigned long long _4BIT  = 0x000000000000000F;
+    constexpr static unsigned long long _8BIT  = 0x00000000000000FF;
+    constexpr static unsigned long long _16BIT = 0x000000000000FFFF;
+    constexpr static unsigned long long _32BIT = 0x00000000FFFFFFFF;
+    constexpr static unsigned long long _48BIT = 0x0000FFFFFFFFFFFF;
+
+    static unsigned long long genImageKey(unsigned w, unsigned h,
+                                          fg_channel_format mode,
+                                          fg_dtype type);
+
+#define DEFINE_WRAPPER_OBJECT(OBJECT, RELEASE)                           \
+    struct OBJECT {                                                      \
+        void* handle;                                                    \
+        struct Deleter {                                                 \
+            void operator()(OBJECT* pHandle) const {                     \
+                if (pHandle) { forgePlugin().RELEASE(pHandle->handle); } \
+            }                                                            \
+        };                                                               \
+    }
+
+    DEFINE_WRAPPER_OBJECT(Window, fg_release_window);
+    DEFINE_WRAPPER_OBJECT(Image, fg_release_image);
+    DEFINE_WRAPPER_OBJECT(Chart, fg_release_chart);
+    DEFINE_WRAPPER_OBJECT(Plot, fg_release_plot);
+    DEFINE_WRAPPER_OBJECT(Histogram, fg_release_histogram);
+    DEFINE_WRAPPER_OBJECT(Surface, fg_release_surface);
+    DEFINE_WRAPPER_OBJECT(VectorField, fg_release_vector_field);
+
+#undef DEFINE_WRAPPER_OBJECT
+
+    using ImagePtr       = std::unique_ptr<Image, Image::Deleter>;
+    using ChartPtr       = std::unique_ptr<Chart, Chart::Deleter>;
+    using PlotPtr        = std::unique_ptr<Plot, Plot::Deleter>;
+    using SurfacePtr     = std::unique_ptr<Surface, Surface::Deleter>;
+    using HistogramPtr   = std::unique_ptr<Histogram, Histogram::Deleter>;
+    using VectorFieldPtr = std::unique_ptr<VectorField, VectorField::Deleter>;
+    using ChartList      = std::vector<ChartPtr>;
+    using ChartKey       = std::pair<unsigned long long, fg_chart>;
+
+    using ChartMapIterator     = std::map<fg_window, ChartList>::iterator;
+    using WindGridMapIterator  = std::map<fg_window, WindowGridDims>::iterator;
+    using AxesOverrideIterator = std::map<fg_chart, bool>::iterator;
+    using ImageMapIterator     = std::map<ChartKey, ImagePtr>::iterator;
+    using PlotMapIterator      = std::map<ChartKey, PlotPtr>::iterator;
+    using HistogramMapIterator = std::map<ChartKey, HistogramPtr>::iterator;
+    using SurfaceMapIterator   = std::map<ChartKey, SurfacePtr>::iterator;
+    using VecFieldMapIterator  = std::map<ChartKey, VectorFieldPtr>::iterator;
+
+    std::unique_ptr<common::ForgeModule> mPlugin;
+    std::unique_ptr<Window, Window::Deleter> mMainWindow;
+
+    std::map<fg_window, ChartList> mChartMap;
+    std::map<ChartKey, ImagePtr> mImgMap;
+    std::map<ChartKey, PlotPtr> mPltMap;
+    std::map<ChartKey, HistogramPtr> mHstMap;
+    std::map<ChartKey, SurfacePtr> mSfcMap;
+    std::map<ChartKey, VectorFieldPtr> mVcfMap;
+    std::map<fg_window, WindowGridDims> mWndGridMap;
+    std::map<fg_chart, bool> mChartAxesOverrideMap;
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/half.cpp b/src/backend/common/half.cpp
new file mode 100644
index 0000000000..249346b038
--- /dev/null
+++ b/src/backend/common/half.cpp
@@ -0,0 +1,17 @@
+
+#include <common/half.hpp>
+#include <common/util.hpp>
+
+namespace arrayfire {
+namespace common {
+std::ostream &operator<<(std::ostream &os, const half &val) {
+    os << float(val);
+    return os;
+}
+
+template<>
+std::string toString(const half val) {
+    return common::toString(static_cast<float>(val));
+}
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/half.hpp b/src/backend/common/half.hpp
new file mode 100644
index 0000000000..42d18be47b
--- /dev/null
+++ b/src/backend/common/half.hpp
@@ -0,0 +1,1341 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#if defined(__NVCC__) || defined(__CUDACC_RTC__)
+
+// MSVC sets __cplusplus to 199711L for all versions unless you specify
+// the new \Zc:__cplusplus flag in Visual Studio 2017. This is not possible
+// in older versions of MSVC so we updated it here for the cuda_fp16 header
+// because otherwise it does not define the default constructor for __half
+// as default and that prevents the __half struct to be used in a constexpr
+// expression
+#if defined(_MSC_VER) && __cplusplus == 199711L
+#undef __cplusplus
+#define __cplusplus 201402L
+#define AF_CPLUSPLUS_CHANGED
+#endif
+
+#include <cuda_fp16.h>
+
+#ifdef AF_CPLUSPLUS_CHANGED
+#undef __cplusplus
+#undef AF_CPLUSPLUS_CHANGED
+#define __cplusplus 199711L
+#endif
+#endif
+
+#ifdef AF_ONEAPI
+#include <sycl/sycl.hpp>
+#endif
+
+#include <backend.hpp>
+
+#ifdef __CUDACC_RTC__
+
+#if defined(__cpp_if_constexpr) || __cplusplus >= 201606L
+#define AF_IF_CONSTEXPR if constexpr
+#else
+#define AF_IF_CONSTEXPR if
+#endif
+
+namespace std {
+enum float_round_style {
+    round_indeterminate       = -1,
+    round_toward_zero         = 0,
+    round_to_nearest          = 1,
+    round_toward_infinity     = 2,
+    round_toward_neg_infinity = 3
+};
+
+template<bool B, class T = void>
+struct enable_if {};
+
+template<class T>
+struct enable_if<true, T> {
+    typedef T type;
+};
+
+template<bool B, class T = void>
+using enable_if_t = typename enable_if<B, T>::type;
+
+template<class T, class U>
+struct is_same {
+    static constexpr bool value = false;
+};
+
+template<class T>
+struct is_same<T, T> {
+    static constexpr bool value = true;
+};
+
+template<class T, class U>
+constexpr bool is_same_v = is_same<T, U>::value;
+
+}  //  namespace std
+
+using uint16_t = unsigned short;
+// we do not include the af/compilers header in nvrtc compilations so
+// we are defining the AF_CONSTEXPR expression here
+#define AF_CONSTEXPR constexpr
+#else
+#include <af/compilers.h>
+#include <algorithm>
+#include <cmath>
+#include <cstdint>
+#include <cstring>
+#include <ostream>
+#include <string>
+#include <type_traits>
+
+#include <limits>
+
+#endif
+
+namespace arrayfire {
+namespace common {
+
+#if defined(__CUDA_ARCH__)
+using native_half_t = __half;
+#elif defined(AF_ONEAPI)
+using native_half_t = sycl::half;
+#else
+using native_half_t = uint16_t;
+#endif
+
+#ifdef __CUDACC_RTC__
+template<std::float_round_style R = std::round_to_nearest>
+AF_CONSTEXPR __DH__ native_half_t float2half_impl(float value) {
+    return __float2half_rn(value);
+}
+
+template<std::float_round_style R = std::round_to_nearest>
+AF_CONSTEXPR __DH__ native_half_t float2half_impl(double value) {
+    return __float2half_rn(value);
+}
+
+AF_CONSTEXPR
+__DH__ inline float half2float_impl(native_half_t value) noexcept {
+    return __half2float(value);
+}
+
+template<typename T>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(T value) noexcept;
+
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(int value) noexcept {
+    return __int2half_rn(value);
+}
+
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(unsigned value) noexcept {
+    return __uint2half_rn(value);
+}
+
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(long long value) noexcept {
+    return __ll2half_rn(value);
+}
+
+template<>
+AF_CONSTEXPR __DH__ native_half_t
+int2half_impl(unsigned long long value) noexcept {
+    return __ull2half_rn(value);
+}
+
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(short value) noexcept {
+    return __short2half_rn(value);
+}
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(unsigned short value) noexcept {
+    return __ushort2half_rn(value);
+}
+
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(char value) noexcept {
+    return __ull2half_rn(value);
+}
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(signed char value) noexcept {
+    return __ull2half_rn(value);
+}
+template<>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(unsigned char value) noexcept {
+    return __ull2half_rn(value);
+}
+
+#elif defined(AF_ONEAPI)
+
+template<std::float_round_style R = std::round_to_nearest>
+AF_CONSTEXPR native_half_t float2half_impl(float value) {
+    return static_cast<native_half_t>(value);
+}
+
+template<std::float_round_style R = std::round_to_nearest>
+AF_CONSTEXPR native_half_t float2half_impl(double value) {
+    return static_cast<native_half_t>(value);
+}
+
+inline float half2float_impl(native_half_t value) noexcept {
+    return static_cast<float>(value);
+}
+
+template<std::float_round_style R, bool S, typename T>
+AF_CONSTEXPR native_half_t int2half_impl(T value) noexcept {
+    return static_cast<native_half_t>(value);
+}
+
+#else
+
+/// Convert integer to half-precision floating point.
+///
+/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest
+///           rounding
+/// \tparam S `true` if value negative, `false` else
+/// \tparam T type to convert (builtin integer type)
+///
+/// \param value non-negative integral value
+///
+/// \return binary representation of half-precision value
+template<std::float_round_style R, bool S, typename T>
+AF_CONSTEXPR __DH__ native_half_t int2half_impl(T value) noexcept {
+    static_assert(std::is_integral<T>::value,
+                  "int to half conversion only supports builtin integer types");
+    if (S) value = -value;
+    uint16_t bits = S << 15;
+    if (value > 0xFFFF) {
+        AF_IF_CONSTEXPR(R == std::round_toward_infinity)
+        bits |= (0x7C00 - S);
+        else AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity) bits |=
+            (0x7BFF + S);
+        else bits |= (0x7BFF + (R != std::round_toward_zero));
+    } else if (value) {
+        uint32_t m = value, exp = 24;
+        for (; m < 0x400; m <<= 1, --exp)
+            ;
+        for (; m > 0x7FF; m >>= 1, ++exp)
+            ;
+        bits |= (exp << 10) + m;
+        if (exp > 24) {
+            AF_IF_CONSTEXPR(R == std::round_to_nearest)
+            bits += (value >> (exp - 25)) & 1
+#if HALF_ROUND_TIES_TO_EVEN
+                    & (((((1 << (exp - 25)) - 1) & value) != 0) | bits)
+#endif
+                ;
+            else AF_IF_CONSTEXPR(R == std::round_toward_infinity) bits +=
+                ((value & ((1 << (exp - 24)) - 1)) != 0) & !S;
+            else AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity) bits +=
+                ((value & ((1 << (exp - 24)) - 1)) != 0) & S;
+        }
+    }
+    return bits;
+}
+
+/// Convert IEEE single-precision to half-precision.
+/// Credit for this goes to [Jeroen van der
+/// Zijp](ftp://ftp.fox-toolkit.org/pub/fasthalffloatconversion.pdf).
+/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest
+/// rounding
+///
+/// \param value single-precision value
+/// \return binary representation of half-precision value
+template<std::float_round_style R = std::round_to_nearest>
+__DH__ native_half_t float2half_impl(float value) noexcept {
+    alignas(std::max(alignof(uint32_t), alignof(float))) float _value = value;
+    uint32_t bits = *reinterpret_cast<uint32_t*>(&_value);
+
+    constexpr uint16_t base_table[512] = {
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000, 0x0000,
+        0x0000, 0x0000, 0x0000, 0x0000, 0x0001, 0x0002, 0x0004, 0x0008, 0x0010,
+        0x0020, 0x0040, 0x0080, 0x0100, 0x0200, 0x0400, 0x0800, 0x0C00, 0x1000,
+        0x1400, 0x1800, 0x1C00, 0x2000, 0x2400, 0x2800, 0x2C00, 0x3000, 0x3400,
+        0x3800, 0x3C00, 0x4000, 0x4400, 0x4800, 0x4C00, 0x5000, 0x5400, 0x5800,
+        0x5C00, 0x6000, 0x6400, 0x6800, 0x6C00, 0x7000, 0x7400, 0x7800, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x7C00,
+        0x7C00, 0x7C00, 0x7C00, 0x7C00, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000,
+        0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8000, 0x8001,
+        0x8002, 0x8004, 0x8008, 0x8010, 0x8020, 0x8040, 0x8080, 0x8100, 0x8200,
+        0x8400, 0x8800, 0x8C00, 0x9000, 0x9400, 0x9800, 0x9C00, 0xA000, 0xA400,
+        0xA800, 0xAC00, 0xB000, 0xB400, 0xB800, 0xBC00, 0xC000, 0xC400, 0xC800,
+        0xCC00, 0xD000, 0xD400, 0xD800, 0xDC00, 0xE000, 0xE400, 0xE800, 0xEC00,
+        0xF000, 0xF400, 0xF800, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00,
+        0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00, 0xFC00};
+
+    constexpr uint8_t shift_table[512] = {
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 23, 22, 21, 20, 19,
+        18, 17, 16, 15, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
+        13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 13, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 23,
+        22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 13, 13, 13, 13, 13, 13, 13, 13,
+        13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13,
+        13, 13, 13, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24, 24,
+        24, 24, 24, 24, 24, 24, 24, 13};
+    alignas(std::max(alignof(uint16_t), alignof(native_half_t)))
+        uint16_t hbits =
+            base_table[bits >> 23] +
+            static_cast<uint16_t>((bits & 0x7FFFFF) >> shift_table[bits >> 23]);
+    AF_IF_CONSTEXPR(R == std::round_to_nearest)
+    hbits +=
+        (((bits & 0x7FFFFF) >> (shift_table[bits >> 23] - 1)) |
+         (((bits >> 23) & 0xFF) == 102)) &
+        ((hbits & 0x7C00) != 0x7C00)
+#if HALF_ROUND_TIES_TO_EVEN
+        & (((((static_cast<uint32>(1) << (shift_table[bits >> 23] - 1)) - 1) &
+             bits) != 0) |
+           hbits)
+#endif
+        ;
+    else AF_IF_CONSTEXPR(R == std::round_toward_zero) hbits -=
+        ((hbits & 0x7FFF) == 0x7C00) & ~shift_table[bits >> 23];
+    else AF_IF_CONSTEXPR(R == std::round_toward_infinity) hbits +=
+        ((((bits & 0x7FFFFF &
+            ((static_cast<uint32_t>(1) << (shift_table[bits >> 23])) - 1)) !=
+           0) |
+          (((bits >> 23) <= 102) & ((bits >> 23) != 0))) &
+         (hbits < 0x7C00)) -
+        ((hbits == 0xFC00) & ((bits >> 23) != 511));
+    else AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity) hbits +=
+        ((((bits & 0x7FFFFF &
+            ((static_cast<uint32_t>(1) << (shift_table[bits >> 23])) - 1)) !=
+           0) |
+          (((bits >> 23) <= 358) & ((bits >> 23) != 256))) &
+         (hbits < 0xFC00) & (hbits >> 15)) -
+        ((hbits == 0x7C00) & ((bits >> 23) != 255));
+
+    return *reinterpret_cast<native_half_t*>(&hbits);
+}
+
+/// Convert IEEE double-precision to half-precision.
+///
+/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest
+///           rounding
+/// \param value double-precision value
+///
+/// \return binary representation of half-precision value
+template<std::float_round_style R>
+__DH__ native_half_t float2half_impl(double value) {
+    alignas(std::max(alignof(uint64_t), alignof(double))) double _value = value;
+    uint64_t bits = *reinterpret_cast<uint64_t*>(&_value);
+    uint32_t hi = bits >> 32, lo = bits & 0xFFFFFFFF;
+    alignas(std::max(alignof(uint16_t), alignof(native_half_t)))
+        uint16_t hbits = (hi >> 16) & 0x8000;
+    hi &= 0x7FFFFFFF;
+    int exp = hi >> 20;
+    if (exp == 2047)
+        return hbits | 0x7C00 |
+               (0x3FF & -static_cast<unsigned>((bits & 0xFFFFFFFFFFFFF) != 0));
+    if (exp > 1038) {
+        AF_IF_CONSTEXPR(R == std::round_toward_infinity)
+        return hbits | (0x7C00 - (hbits >> 15));
+        AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity)
+        return hbits | (0x7BFF + (hbits >> 15));
+        return hbits | (0x7BFF + (R != std::round_toward_zero));
+    }
+    int g = 0, s = lo != 0;
+    if (exp > 1008) {
+        g = (hi >> 9) & 1;
+        s |= (hi & 0x1FF) != 0;
+        hbits |= ((exp - 1008) << 10) | ((hi >> 10) & 0x3FF);
+    } else if (exp > 997) {
+        int i = 1018 - exp;
+        hi    = (hi & 0xFFFFF) | 0x100000;
+        g     = (hi >> i) & 1;
+        s |= (hi & ((1L << i) - 1)) != 0;
+        hbits |= hi >> (i + 1);
+    } else {
+        s |= hi != 0;
+    }
+    AF_IF_CONSTEXPR(R == std::round_to_nearest)
+#if HALF_ROUND_TIES_TO_EVEN
+    hbits += g & (s | hbits);
+#else
+    hbits += g;
+#endif
+    else AF_IF_CONSTEXPR(R == std::round_toward_infinity) hbits +=
+        ~(hbits >> 15) & (s | g);
+    else AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity) hbits +=
+        (hbits >> 15) & (g | s);
+
+    return *reinterpret_cast<native_half_t*>(&hbits);
+}
+
+__DH__ inline float half2float_impl(native_half_t value) noexcept {
+    // return _cvtsh_ss(data.data_);
+    constexpr uint32_t mantissa_table[2048] = {
+        0x00000000, 0x33800000, 0x34000000, 0x34400000, 0x34800000, 0x34A00000,
+        0x34C00000, 0x34E00000, 0x35000000, 0x35100000, 0x35200000, 0x35300000,
+        0x35400000, 0x35500000, 0x35600000, 0x35700000, 0x35800000, 0x35880000,
+        0x35900000, 0x35980000, 0x35A00000, 0x35A80000, 0x35B00000, 0x35B80000,
+        0x35C00000, 0x35C80000, 0x35D00000, 0x35D80000, 0x35E00000, 0x35E80000,
+        0x35F00000, 0x35F80000, 0x36000000, 0x36040000, 0x36080000, 0x360C0000,
+        0x36100000, 0x36140000, 0x36180000, 0x361C0000, 0x36200000, 0x36240000,
+        0x36280000, 0x362C0000, 0x36300000, 0x36340000, 0x36380000, 0x363C0000,
+        0x36400000, 0x36440000, 0x36480000, 0x364C0000, 0x36500000, 0x36540000,
+        0x36580000, 0x365C0000, 0x36600000, 0x36640000, 0x36680000, 0x366C0000,
+        0x36700000, 0x36740000, 0x36780000, 0x367C0000, 0x36800000, 0x36820000,
+        0x36840000, 0x36860000, 0x36880000, 0x368A0000, 0x368C0000, 0x368E0000,
+        0x36900000, 0x36920000, 0x36940000, 0x36960000, 0x36980000, 0x369A0000,
+        0x369C0000, 0x369E0000, 0x36A00000, 0x36A20000, 0x36A40000, 0x36A60000,
+        0x36A80000, 0x36AA0000, 0x36AC0000, 0x36AE0000, 0x36B00000, 0x36B20000,
+        0x36B40000, 0x36B60000, 0x36B80000, 0x36BA0000, 0x36BC0000, 0x36BE0000,
+        0x36C00000, 0x36C20000, 0x36C40000, 0x36C60000, 0x36C80000, 0x36CA0000,
+        0x36CC0000, 0x36CE0000, 0x36D00000, 0x36D20000, 0x36D40000, 0x36D60000,
+        0x36D80000, 0x36DA0000, 0x36DC0000, 0x36DE0000, 0x36E00000, 0x36E20000,
+        0x36E40000, 0x36E60000, 0x36E80000, 0x36EA0000, 0x36EC0000, 0x36EE0000,
+        0x36F00000, 0x36F20000, 0x36F40000, 0x36F60000, 0x36F80000, 0x36FA0000,
+        0x36FC0000, 0x36FE0000, 0x37000000, 0x37010000, 0x37020000, 0x37030000,
+        0x37040000, 0x37050000, 0x37060000, 0x37070000, 0x37080000, 0x37090000,
+        0x370A0000, 0x370B0000, 0x370C0000, 0x370D0000, 0x370E0000, 0x370F0000,
+        0x37100000, 0x37110000, 0x37120000, 0x37130000, 0x37140000, 0x37150000,
+        0x37160000, 0x37170000, 0x37180000, 0x37190000, 0x371A0000, 0x371B0000,
+        0x371C0000, 0x371D0000, 0x371E0000, 0x371F0000, 0x37200000, 0x37210000,
+        0x37220000, 0x37230000, 0x37240000, 0x37250000, 0x37260000, 0x37270000,
+        0x37280000, 0x37290000, 0x372A0000, 0x372B0000, 0x372C0000, 0x372D0000,
+        0x372E0000, 0x372F0000, 0x37300000, 0x37310000, 0x37320000, 0x37330000,
+        0x37340000, 0x37350000, 0x37360000, 0x37370000, 0x37380000, 0x37390000,
+        0x373A0000, 0x373B0000, 0x373C0000, 0x373D0000, 0x373E0000, 0x373F0000,
+        0x37400000, 0x37410000, 0x37420000, 0x37430000, 0x37440000, 0x37450000,
+        0x37460000, 0x37470000, 0x37480000, 0x37490000, 0x374A0000, 0x374B0000,
+        0x374C0000, 0x374D0000, 0x374E0000, 0x374F0000, 0x37500000, 0x37510000,
+        0x37520000, 0x37530000, 0x37540000, 0x37550000, 0x37560000, 0x37570000,
+        0x37580000, 0x37590000, 0x375A0000, 0x375B0000, 0x375C0000, 0x375D0000,
+        0x375E0000, 0x375F0000, 0x37600000, 0x37610000, 0x37620000, 0x37630000,
+        0x37640000, 0x37650000, 0x37660000, 0x37670000, 0x37680000, 0x37690000,
+        0x376A0000, 0x376B0000, 0x376C0000, 0x376D0000, 0x376E0000, 0x376F0000,
+        0x37700000, 0x37710000, 0x37720000, 0x37730000, 0x37740000, 0x37750000,
+        0x37760000, 0x37770000, 0x37780000, 0x37790000, 0x377A0000, 0x377B0000,
+        0x377C0000, 0x377D0000, 0x377E0000, 0x377F0000, 0x37800000, 0x37808000,
+        0x37810000, 0x37818000, 0x37820000, 0x37828000, 0x37830000, 0x37838000,
+        0x37840000, 0x37848000, 0x37850000, 0x37858000, 0x37860000, 0x37868000,
+        0x37870000, 0x37878000, 0x37880000, 0x37888000, 0x37890000, 0x37898000,
+        0x378A0000, 0x378A8000, 0x378B0000, 0x378B8000, 0x378C0000, 0x378C8000,
+        0x378D0000, 0x378D8000, 0x378E0000, 0x378E8000, 0x378F0000, 0x378F8000,
+        0x37900000, 0x37908000, 0x37910000, 0x37918000, 0x37920000, 0x37928000,
+        0x37930000, 0x37938000, 0x37940000, 0x37948000, 0x37950000, 0x37958000,
+        0x37960000, 0x37968000, 0x37970000, 0x37978000, 0x37980000, 0x37988000,
+        0x37990000, 0x37998000, 0x379A0000, 0x379A8000, 0x379B0000, 0x379B8000,
+        0x379C0000, 0x379C8000, 0x379D0000, 0x379D8000, 0x379E0000, 0x379E8000,
+        0x379F0000, 0x379F8000, 0x37A00000, 0x37A08000, 0x37A10000, 0x37A18000,
+        0x37A20000, 0x37A28000, 0x37A30000, 0x37A38000, 0x37A40000, 0x37A48000,
+        0x37A50000, 0x37A58000, 0x37A60000, 0x37A68000, 0x37A70000, 0x37A78000,
+        0x37A80000, 0x37A88000, 0x37A90000, 0x37A98000, 0x37AA0000, 0x37AA8000,
+        0x37AB0000, 0x37AB8000, 0x37AC0000, 0x37AC8000, 0x37AD0000, 0x37AD8000,
+        0x37AE0000, 0x37AE8000, 0x37AF0000, 0x37AF8000, 0x37B00000, 0x37B08000,
+        0x37B10000, 0x37B18000, 0x37B20000, 0x37B28000, 0x37B30000, 0x37B38000,
+        0x37B40000, 0x37B48000, 0x37B50000, 0x37B58000, 0x37B60000, 0x37B68000,
+        0x37B70000, 0x37B78000, 0x37B80000, 0x37B88000, 0x37B90000, 0x37B98000,
+        0x37BA0000, 0x37BA8000, 0x37BB0000, 0x37BB8000, 0x37BC0000, 0x37BC8000,
+        0x37BD0000, 0x37BD8000, 0x37BE0000, 0x37BE8000, 0x37BF0000, 0x37BF8000,
+        0x37C00000, 0x37C08000, 0x37C10000, 0x37C18000, 0x37C20000, 0x37C28000,
+        0x37C30000, 0x37C38000, 0x37C40000, 0x37C48000, 0x37C50000, 0x37C58000,
+        0x37C60000, 0x37C68000, 0x37C70000, 0x37C78000, 0x37C80000, 0x37C88000,
+        0x37C90000, 0x37C98000, 0x37CA0000, 0x37CA8000, 0x37CB0000, 0x37CB8000,
+        0x37CC0000, 0x37CC8000, 0x37CD0000, 0x37CD8000, 0x37CE0000, 0x37CE8000,
+        0x37CF0000, 0x37CF8000, 0x37D00000, 0x37D08000, 0x37D10000, 0x37D18000,
+        0x37D20000, 0x37D28000, 0x37D30000, 0x37D38000, 0x37D40000, 0x37D48000,
+        0x37D50000, 0x37D58000, 0x37D60000, 0x37D68000, 0x37D70000, 0x37D78000,
+        0x37D80000, 0x37D88000, 0x37D90000, 0x37D98000, 0x37DA0000, 0x37DA8000,
+        0x37DB0000, 0x37DB8000, 0x37DC0000, 0x37DC8000, 0x37DD0000, 0x37DD8000,
+        0x37DE0000, 0x37DE8000, 0x37DF0000, 0x37DF8000, 0x37E00000, 0x37E08000,
+        0x37E10000, 0x37E18000, 0x37E20000, 0x37E28000, 0x37E30000, 0x37E38000,
+        0x37E40000, 0x37E48000, 0x37E50000, 0x37E58000, 0x37E60000, 0x37E68000,
+        0x37E70000, 0x37E78000, 0x37E80000, 0x37E88000, 0x37E90000, 0x37E98000,
+        0x37EA0000, 0x37EA8000, 0x37EB0000, 0x37EB8000, 0x37EC0000, 0x37EC8000,
+        0x37ED0000, 0x37ED8000, 0x37EE0000, 0x37EE8000, 0x37EF0000, 0x37EF8000,
+        0x37F00000, 0x37F08000, 0x37F10000, 0x37F18000, 0x37F20000, 0x37F28000,
+        0x37F30000, 0x37F38000, 0x37F40000, 0x37F48000, 0x37F50000, 0x37F58000,
+        0x37F60000, 0x37F68000, 0x37F70000, 0x37F78000, 0x37F80000, 0x37F88000,
+        0x37F90000, 0x37F98000, 0x37FA0000, 0x37FA8000, 0x37FB0000, 0x37FB8000,
+        0x37FC0000, 0x37FC8000, 0x37FD0000, 0x37FD8000, 0x37FE0000, 0x37FE8000,
+        0x37FF0000, 0x37FF8000, 0x38000000, 0x38004000, 0x38008000, 0x3800C000,
+        0x38010000, 0x38014000, 0x38018000, 0x3801C000, 0x38020000, 0x38024000,
+        0x38028000, 0x3802C000, 0x38030000, 0x38034000, 0x38038000, 0x3803C000,
+        0x38040000, 0x38044000, 0x38048000, 0x3804C000, 0x38050000, 0x38054000,
+        0x38058000, 0x3805C000, 0x38060000, 0x38064000, 0x38068000, 0x3806C000,
+        0x38070000, 0x38074000, 0x38078000, 0x3807C000, 0x38080000, 0x38084000,
+        0x38088000, 0x3808C000, 0x38090000, 0x38094000, 0x38098000, 0x3809C000,
+        0x380A0000, 0x380A4000, 0x380A8000, 0x380AC000, 0x380B0000, 0x380B4000,
+        0x380B8000, 0x380BC000, 0x380C0000, 0x380C4000, 0x380C8000, 0x380CC000,
+        0x380D0000, 0x380D4000, 0x380D8000, 0x380DC000, 0x380E0000, 0x380E4000,
+        0x380E8000, 0x380EC000, 0x380F0000, 0x380F4000, 0x380F8000, 0x380FC000,
+        0x38100000, 0x38104000, 0x38108000, 0x3810C000, 0x38110000, 0x38114000,
+        0x38118000, 0x3811C000, 0x38120000, 0x38124000, 0x38128000, 0x3812C000,
+        0x38130000, 0x38134000, 0x38138000, 0x3813C000, 0x38140000, 0x38144000,
+        0x38148000, 0x3814C000, 0x38150000, 0x38154000, 0x38158000, 0x3815C000,
+        0x38160000, 0x38164000, 0x38168000, 0x3816C000, 0x38170000, 0x38174000,
+        0x38178000, 0x3817C000, 0x38180000, 0x38184000, 0x38188000, 0x3818C000,
+        0x38190000, 0x38194000, 0x38198000, 0x3819C000, 0x381A0000, 0x381A4000,
+        0x381A8000, 0x381AC000, 0x381B0000, 0x381B4000, 0x381B8000, 0x381BC000,
+        0x381C0000, 0x381C4000, 0x381C8000, 0x381CC000, 0x381D0000, 0x381D4000,
+        0x381D8000, 0x381DC000, 0x381E0000, 0x381E4000, 0x381E8000, 0x381EC000,
+        0x381F0000, 0x381F4000, 0x381F8000, 0x381FC000, 0x38200000, 0x38204000,
+        0x38208000, 0x3820C000, 0x38210000, 0x38214000, 0x38218000, 0x3821C000,
+        0x38220000, 0x38224000, 0x38228000, 0x3822C000, 0x38230000, 0x38234000,
+        0x38238000, 0x3823C000, 0x38240000, 0x38244000, 0x38248000, 0x3824C000,
+        0x38250000, 0x38254000, 0x38258000, 0x3825C000, 0x38260000, 0x38264000,
+        0x38268000, 0x3826C000, 0x38270000, 0x38274000, 0x38278000, 0x3827C000,
+        0x38280000, 0x38284000, 0x38288000, 0x3828C000, 0x38290000, 0x38294000,
+        0x38298000, 0x3829C000, 0x382A0000, 0x382A4000, 0x382A8000, 0x382AC000,
+        0x382B0000, 0x382B4000, 0x382B8000, 0x382BC000, 0x382C0000, 0x382C4000,
+        0x382C8000, 0x382CC000, 0x382D0000, 0x382D4000, 0x382D8000, 0x382DC000,
+        0x382E0000, 0x382E4000, 0x382E8000, 0x382EC000, 0x382F0000, 0x382F4000,
+        0x382F8000, 0x382FC000, 0x38300000, 0x38304000, 0x38308000, 0x3830C000,
+        0x38310000, 0x38314000, 0x38318000, 0x3831C000, 0x38320000, 0x38324000,
+        0x38328000, 0x3832C000, 0x38330000, 0x38334000, 0x38338000, 0x3833C000,
+        0x38340000, 0x38344000, 0x38348000, 0x3834C000, 0x38350000, 0x38354000,
+        0x38358000, 0x3835C000, 0x38360000, 0x38364000, 0x38368000, 0x3836C000,
+        0x38370000, 0x38374000, 0x38378000, 0x3837C000, 0x38380000, 0x38384000,
+        0x38388000, 0x3838C000, 0x38390000, 0x38394000, 0x38398000, 0x3839C000,
+        0x383A0000, 0x383A4000, 0x383A8000, 0x383AC000, 0x383B0000, 0x383B4000,
+        0x383B8000, 0x383BC000, 0x383C0000, 0x383C4000, 0x383C8000, 0x383CC000,
+        0x383D0000, 0x383D4000, 0x383D8000, 0x383DC000, 0x383E0000, 0x383E4000,
+        0x383E8000, 0x383EC000, 0x383F0000, 0x383F4000, 0x383F8000, 0x383FC000,
+        0x38400000, 0x38404000, 0x38408000, 0x3840C000, 0x38410000, 0x38414000,
+        0x38418000, 0x3841C000, 0x38420000, 0x38424000, 0x38428000, 0x3842C000,
+        0x38430000, 0x38434000, 0x38438000, 0x3843C000, 0x38440000, 0x38444000,
+        0x38448000, 0x3844C000, 0x38450000, 0x38454000, 0x38458000, 0x3845C000,
+        0x38460000, 0x38464000, 0x38468000, 0x3846C000, 0x38470000, 0x38474000,
+        0x38478000, 0x3847C000, 0x38480000, 0x38484000, 0x38488000, 0x3848C000,
+        0x38490000, 0x38494000, 0x38498000, 0x3849C000, 0x384A0000, 0x384A4000,
+        0x384A8000, 0x384AC000, 0x384B0000, 0x384B4000, 0x384B8000, 0x384BC000,
+        0x384C0000, 0x384C4000, 0x384C8000, 0x384CC000, 0x384D0000, 0x384D4000,
+        0x384D8000, 0x384DC000, 0x384E0000, 0x384E4000, 0x384E8000, 0x384EC000,
+        0x384F0000, 0x384F4000, 0x384F8000, 0x384FC000, 0x38500000, 0x38504000,
+        0x38508000, 0x3850C000, 0x38510000, 0x38514000, 0x38518000, 0x3851C000,
+        0x38520000, 0x38524000, 0x38528000, 0x3852C000, 0x38530000, 0x38534000,
+        0x38538000, 0x3853C000, 0x38540000, 0x38544000, 0x38548000, 0x3854C000,
+        0x38550000, 0x38554000, 0x38558000, 0x3855C000, 0x38560000, 0x38564000,
+        0x38568000, 0x3856C000, 0x38570000, 0x38574000, 0x38578000, 0x3857C000,
+        0x38580000, 0x38584000, 0x38588000, 0x3858C000, 0x38590000, 0x38594000,
+        0x38598000, 0x3859C000, 0x385A0000, 0x385A4000, 0x385A8000, 0x385AC000,
+        0x385B0000, 0x385B4000, 0x385B8000, 0x385BC000, 0x385C0000, 0x385C4000,
+        0x385C8000, 0x385CC000, 0x385D0000, 0x385D4000, 0x385D8000, 0x385DC000,
+        0x385E0000, 0x385E4000, 0x385E8000, 0x385EC000, 0x385F0000, 0x385F4000,
+        0x385F8000, 0x385FC000, 0x38600000, 0x38604000, 0x38608000, 0x3860C000,
+        0x38610000, 0x38614000, 0x38618000, 0x3861C000, 0x38620000, 0x38624000,
+        0x38628000, 0x3862C000, 0x38630000, 0x38634000, 0x38638000, 0x3863C000,
+        0x38640000, 0x38644000, 0x38648000, 0x3864C000, 0x38650000, 0x38654000,
+        0x38658000, 0x3865C000, 0x38660000, 0x38664000, 0x38668000, 0x3866C000,
+        0x38670000, 0x38674000, 0x38678000, 0x3867C000, 0x38680000, 0x38684000,
+        0x38688000, 0x3868C000, 0x38690000, 0x38694000, 0x38698000, 0x3869C000,
+        0x386A0000, 0x386A4000, 0x386A8000, 0x386AC000, 0x386B0000, 0x386B4000,
+        0x386B8000, 0x386BC000, 0x386C0000, 0x386C4000, 0x386C8000, 0x386CC000,
+        0x386D0000, 0x386D4000, 0x386D8000, 0x386DC000, 0x386E0000, 0x386E4000,
+        0x386E8000, 0x386EC000, 0x386F0000, 0x386F4000, 0x386F8000, 0x386FC000,
+        0x38700000, 0x38704000, 0x38708000, 0x3870C000, 0x38710000, 0x38714000,
+        0x38718000, 0x3871C000, 0x38720000, 0x38724000, 0x38728000, 0x3872C000,
+        0x38730000, 0x38734000, 0x38738000, 0x3873C000, 0x38740000, 0x38744000,
+        0x38748000, 0x3874C000, 0x38750000, 0x38754000, 0x38758000, 0x3875C000,
+        0x38760000, 0x38764000, 0x38768000, 0x3876C000, 0x38770000, 0x38774000,
+        0x38778000, 0x3877C000, 0x38780000, 0x38784000, 0x38788000, 0x3878C000,
+        0x38790000, 0x38794000, 0x38798000, 0x3879C000, 0x387A0000, 0x387A4000,
+        0x387A8000, 0x387AC000, 0x387B0000, 0x387B4000, 0x387B8000, 0x387BC000,
+        0x387C0000, 0x387C4000, 0x387C8000, 0x387CC000, 0x387D0000, 0x387D4000,
+        0x387D8000, 0x387DC000, 0x387E0000, 0x387E4000, 0x387E8000, 0x387EC000,
+        0x387F0000, 0x387F4000, 0x387F8000, 0x387FC000, 0x38000000, 0x38002000,
+        0x38004000, 0x38006000, 0x38008000, 0x3800A000, 0x3800C000, 0x3800E000,
+        0x38010000, 0x38012000, 0x38014000, 0x38016000, 0x38018000, 0x3801A000,
+        0x3801C000, 0x3801E000, 0x38020000, 0x38022000, 0x38024000, 0x38026000,
+        0x38028000, 0x3802A000, 0x3802C000, 0x3802E000, 0x38030000, 0x38032000,
+        0x38034000, 0x38036000, 0x38038000, 0x3803A000, 0x3803C000, 0x3803E000,
+        0x38040000, 0x38042000, 0x38044000, 0x38046000, 0x38048000, 0x3804A000,
+        0x3804C000, 0x3804E000, 0x38050000, 0x38052000, 0x38054000, 0x38056000,
+        0x38058000, 0x3805A000, 0x3805C000, 0x3805E000, 0x38060000, 0x38062000,
+        0x38064000, 0x38066000, 0x38068000, 0x3806A000, 0x3806C000, 0x3806E000,
+        0x38070000, 0x38072000, 0x38074000, 0x38076000, 0x38078000, 0x3807A000,
+        0x3807C000, 0x3807E000, 0x38080000, 0x38082000, 0x38084000, 0x38086000,
+        0x38088000, 0x3808A000, 0x3808C000, 0x3808E000, 0x38090000, 0x38092000,
+        0x38094000, 0x38096000, 0x38098000, 0x3809A000, 0x3809C000, 0x3809E000,
+        0x380A0000, 0x380A2000, 0x380A4000, 0x380A6000, 0x380A8000, 0x380AA000,
+        0x380AC000, 0x380AE000, 0x380B0000, 0x380B2000, 0x380B4000, 0x380B6000,
+        0x380B8000, 0x380BA000, 0x380BC000, 0x380BE000, 0x380C0000, 0x380C2000,
+        0x380C4000, 0x380C6000, 0x380C8000, 0x380CA000, 0x380CC000, 0x380CE000,
+        0x380D0000, 0x380D2000, 0x380D4000, 0x380D6000, 0x380D8000, 0x380DA000,
+        0x380DC000, 0x380DE000, 0x380E0000, 0x380E2000, 0x380E4000, 0x380E6000,
+        0x380E8000, 0x380EA000, 0x380EC000, 0x380EE000, 0x380F0000, 0x380F2000,
+        0x380F4000, 0x380F6000, 0x380F8000, 0x380FA000, 0x380FC000, 0x380FE000,
+        0x38100000, 0x38102000, 0x38104000, 0x38106000, 0x38108000, 0x3810A000,
+        0x3810C000, 0x3810E000, 0x38110000, 0x38112000, 0x38114000, 0x38116000,
+        0x38118000, 0x3811A000, 0x3811C000, 0x3811E000, 0x38120000, 0x38122000,
+        0x38124000, 0x38126000, 0x38128000, 0x3812A000, 0x3812C000, 0x3812E000,
+        0x38130000, 0x38132000, 0x38134000, 0x38136000, 0x38138000, 0x3813A000,
+        0x3813C000, 0x3813E000, 0x38140000, 0x38142000, 0x38144000, 0x38146000,
+        0x38148000, 0x3814A000, 0x3814C000, 0x3814E000, 0x38150000, 0x38152000,
+        0x38154000, 0x38156000, 0x38158000, 0x3815A000, 0x3815C000, 0x3815E000,
+        0x38160000, 0x38162000, 0x38164000, 0x38166000, 0x38168000, 0x3816A000,
+        0x3816C000, 0x3816E000, 0x38170000, 0x38172000, 0x38174000, 0x38176000,
+        0x38178000, 0x3817A000, 0x3817C000, 0x3817E000, 0x38180000, 0x38182000,
+        0x38184000, 0x38186000, 0x38188000, 0x3818A000, 0x3818C000, 0x3818E000,
+        0x38190000, 0x38192000, 0x38194000, 0x38196000, 0x38198000, 0x3819A000,
+        0x3819C000, 0x3819E000, 0x381A0000, 0x381A2000, 0x381A4000, 0x381A6000,
+        0x381A8000, 0x381AA000, 0x381AC000, 0x381AE000, 0x381B0000, 0x381B2000,
+        0x381B4000, 0x381B6000, 0x381B8000, 0x381BA000, 0x381BC000, 0x381BE000,
+        0x381C0000, 0x381C2000, 0x381C4000, 0x381C6000, 0x381C8000, 0x381CA000,
+        0x381CC000, 0x381CE000, 0x381D0000, 0x381D2000, 0x381D4000, 0x381D6000,
+        0x381D8000, 0x381DA000, 0x381DC000, 0x381DE000, 0x381E0000, 0x381E2000,
+        0x381E4000, 0x381E6000, 0x381E8000, 0x381EA000, 0x381EC000, 0x381EE000,
+        0x381F0000, 0x381F2000, 0x381F4000, 0x381F6000, 0x381F8000, 0x381FA000,
+        0x381FC000, 0x381FE000, 0x38200000, 0x38202000, 0x38204000, 0x38206000,
+        0x38208000, 0x3820A000, 0x3820C000, 0x3820E000, 0x38210000, 0x38212000,
+        0x38214000, 0x38216000, 0x38218000, 0x3821A000, 0x3821C000, 0x3821E000,
+        0x38220000, 0x38222000, 0x38224000, 0x38226000, 0x38228000, 0x3822A000,
+        0x3822C000, 0x3822E000, 0x38230000, 0x38232000, 0x38234000, 0x38236000,
+        0x38238000, 0x3823A000, 0x3823C000, 0x3823E000, 0x38240000, 0x38242000,
+        0x38244000, 0x38246000, 0x38248000, 0x3824A000, 0x3824C000, 0x3824E000,
+        0x38250000, 0x38252000, 0x38254000, 0x38256000, 0x38258000, 0x3825A000,
+        0x3825C000, 0x3825E000, 0x38260000, 0x38262000, 0x38264000, 0x38266000,
+        0x38268000, 0x3826A000, 0x3826C000, 0x3826E000, 0x38270000, 0x38272000,
+        0x38274000, 0x38276000, 0x38278000, 0x3827A000, 0x3827C000, 0x3827E000,
+        0x38280000, 0x38282000, 0x38284000, 0x38286000, 0x38288000, 0x3828A000,
+        0x3828C000, 0x3828E000, 0x38290000, 0x38292000, 0x38294000, 0x38296000,
+        0x38298000, 0x3829A000, 0x3829C000, 0x3829E000, 0x382A0000, 0x382A2000,
+        0x382A4000, 0x382A6000, 0x382A8000, 0x382AA000, 0x382AC000, 0x382AE000,
+        0x382B0000, 0x382B2000, 0x382B4000, 0x382B6000, 0x382B8000, 0x382BA000,
+        0x382BC000, 0x382BE000, 0x382C0000, 0x382C2000, 0x382C4000, 0x382C6000,
+        0x382C8000, 0x382CA000, 0x382CC000, 0x382CE000, 0x382D0000, 0x382D2000,
+        0x382D4000, 0x382D6000, 0x382D8000, 0x382DA000, 0x382DC000, 0x382DE000,
+        0x382E0000, 0x382E2000, 0x382E4000, 0x382E6000, 0x382E8000, 0x382EA000,
+        0x382EC000, 0x382EE000, 0x382F0000, 0x382F2000, 0x382F4000, 0x382F6000,
+        0x382F8000, 0x382FA000, 0x382FC000, 0x382FE000, 0x38300000, 0x38302000,
+        0x38304000, 0x38306000, 0x38308000, 0x3830A000, 0x3830C000, 0x3830E000,
+        0x38310000, 0x38312000, 0x38314000, 0x38316000, 0x38318000, 0x3831A000,
+        0x3831C000, 0x3831E000, 0x38320000, 0x38322000, 0x38324000, 0x38326000,
+        0x38328000, 0x3832A000, 0x3832C000, 0x3832E000, 0x38330000, 0x38332000,
+        0x38334000, 0x38336000, 0x38338000, 0x3833A000, 0x3833C000, 0x3833E000,
+        0x38340000, 0x38342000, 0x38344000, 0x38346000, 0x38348000, 0x3834A000,
+        0x3834C000, 0x3834E000, 0x38350000, 0x38352000, 0x38354000, 0x38356000,
+        0x38358000, 0x3835A000, 0x3835C000, 0x3835E000, 0x38360000, 0x38362000,
+        0x38364000, 0x38366000, 0x38368000, 0x3836A000, 0x3836C000, 0x3836E000,
+        0x38370000, 0x38372000, 0x38374000, 0x38376000, 0x38378000, 0x3837A000,
+        0x3837C000, 0x3837E000, 0x38380000, 0x38382000, 0x38384000, 0x38386000,
+        0x38388000, 0x3838A000, 0x3838C000, 0x3838E000, 0x38390000, 0x38392000,
+        0x38394000, 0x38396000, 0x38398000, 0x3839A000, 0x3839C000, 0x3839E000,
+        0x383A0000, 0x383A2000, 0x383A4000, 0x383A6000, 0x383A8000, 0x383AA000,
+        0x383AC000, 0x383AE000, 0x383B0000, 0x383B2000, 0x383B4000, 0x383B6000,
+        0x383B8000, 0x383BA000, 0x383BC000, 0x383BE000, 0x383C0000, 0x383C2000,
+        0x383C4000, 0x383C6000, 0x383C8000, 0x383CA000, 0x383CC000, 0x383CE000,
+        0x383D0000, 0x383D2000, 0x383D4000, 0x383D6000, 0x383D8000, 0x383DA000,
+        0x383DC000, 0x383DE000, 0x383E0000, 0x383E2000, 0x383E4000, 0x383E6000,
+        0x383E8000, 0x383EA000, 0x383EC000, 0x383EE000, 0x383F0000, 0x383F2000,
+        0x383F4000, 0x383F6000, 0x383F8000, 0x383FA000, 0x383FC000, 0x383FE000,
+        0x38400000, 0x38402000, 0x38404000, 0x38406000, 0x38408000, 0x3840A000,
+        0x3840C000, 0x3840E000, 0x38410000, 0x38412000, 0x38414000, 0x38416000,
+        0x38418000, 0x3841A000, 0x3841C000, 0x3841E000, 0x38420000, 0x38422000,
+        0x38424000, 0x38426000, 0x38428000, 0x3842A000, 0x3842C000, 0x3842E000,
+        0x38430000, 0x38432000, 0x38434000, 0x38436000, 0x38438000, 0x3843A000,
+        0x3843C000, 0x3843E000, 0x38440000, 0x38442000, 0x38444000, 0x38446000,
+        0x38448000, 0x3844A000, 0x3844C000, 0x3844E000, 0x38450000, 0x38452000,
+        0x38454000, 0x38456000, 0x38458000, 0x3845A000, 0x3845C000, 0x3845E000,
+        0x38460000, 0x38462000, 0x38464000, 0x38466000, 0x38468000, 0x3846A000,
+        0x3846C000, 0x3846E000, 0x38470000, 0x38472000, 0x38474000, 0x38476000,
+        0x38478000, 0x3847A000, 0x3847C000, 0x3847E000, 0x38480000, 0x38482000,
+        0x38484000, 0x38486000, 0x38488000, 0x3848A000, 0x3848C000, 0x3848E000,
+        0x38490000, 0x38492000, 0x38494000, 0x38496000, 0x38498000, 0x3849A000,
+        0x3849C000, 0x3849E000, 0x384A0000, 0x384A2000, 0x384A4000, 0x384A6000,
+        0x384A8000, 0x384AA000, 0x384AC000, 0x384AE000, 0x384B0000, 0x384B2000,
+        0x384B4000, 0x384B6000, 0x384B8000, 0x384BA000, 0x384BC000, 0x384BE000,
+        0x384C0000, 0x384C2000, 0x384C4000, 0x384C6000, 0x384C8000, 0x384CA000,
+        0x384CC000, 0x384CE000, 0x384D0000, 0x384D2000, 0x384D4000, 0x384D6000,
+        0x384D8000, 0x384DA000, 0x384DC000, 0x384DE000, 0x384E0000, 0x384E2000,
+        0x384E4000, 0x384E6000, 0x384E8000, 0x384EA000, 0x384EC000, 0x384EE000,
+        0x384F0000, 0x384F2000, 0x384F4000, 0x384F6000, 0x384F8000, 0x384FA000,
+        0x384FC000, 0x384FE000, 0x38500000, 0x38502000, 0x38504000, 0x38506000,
+        0x38508000, 0x3850A000, 0x3850C000, 0x3850E000, 0x38510000, 0x38512000,
+        0x38514000, 0x38516000, 0x38518000, 0x3851A000, 0x3851C000, 0x3851E000,
+        0x38520000, 0x38522000, 0x38524000, 0x38526000, 0x38528000, 0x3852A000,
+        0x3852C000, 0x3852E000, 0x38530000, 0x38532000, 0x38534000, 0x38536000,
+        0x38538000, 0x3853A000, 0x3853C000, 0x3853E000, 0x38540000, 0x38542000,
+        0x38544000, 0x38546000, 0x38548000, 0x3854A000, 0x3854C000, 0x3854E000,
+        0x38550000, 0x38552000, 0x38554000, 0x38556000, 0x38558000, 0x3855A000,
+        0x3855C000, 0x3855E000, 0x38560000, 0x38562000, 0x38564000, 0x38566000,
+        0x38568000, 0x3856A000, 0x3856C000, 0x3856E000, 0x38570000, 0x38572000,
+        0x38574000, 0x38576000, 0x38578000, 0x3857A000, 0x3857C000, 0x3857E000,
+        0x38580000, 0x38582000, 0x38584000, 0x38586000, 0x38588000, 0x3858A000,
+        0x3858C000, 0x3858E000, 0x38590000, 0x38592000, 0x38594000, 0x38596000,
+        0x38598000, 0x3859A000, 0x3859C000, 0x3859E000, 0x385A0000, 0x385A2000,
+        0x385A4000, 0x385A6000, 0x385A8000, 0x385AA000, 0x385AC000, 0x385AE000,
+        0x385B0000, 0x385B2000, 0x385B4000, 0x385B6000, 0x385B8000, 0x385BA000,
+        0x385BC000, 0x385BE000, 0x385C0000, 0x385C2000, 0x385C4000, 0x385C6000,
+        0x385C8000, 0x385CA000, 0x385CC000, 0x385CE000, 0x385D0000, 0x385D2000,
+        0x385D4000, 0x385D6000, 0x385D8000, 0x385DA000, 0x385DC000, 0x385DE000,
+        0x385E0000, 0x385E2000, 0x385E4000, 0x385E6000, 0x385E8000, 0x385EA000,
+        0x385EC000, 0x385EE000, 0x385F0000, 0x385F2000, 0x385F4000, 0x385F6000,
+        0x385F8000, 0x385FA000, 0x385FC000, 0x385FE000, 0x38600000, 0x38602000,
+        0x38604000, 0x38606000, 0x38608000, 0x3860A000, 0x3860C000, 0x3860E000,
+        0x38610000, 0x38612000, 0x38614000, 0x38616000, 0x38618000, 0x3861A000,
+        0x3861C000, 0x3861E000, 0x38620000, 0x38622000, 0x38624000, 0x38626000,
+        0x38628000, 0x3862A000, 0x3862C000, 0x3862E000, 0x38630000, 0x38632000,
+        0x38634000, 0x38636000, 0x38638000, 0x3863A000, 0x3863C000, 0x3863E000,
+        0x38640000, 0x38642000, 0x38644000, 0x38646000, 0x38648000, 0x3864A000,
+        0x3864C000, 0x3864E000, 0x38650000, 0x38652000, 0x38654000, 0x38656000,
+        0x38658000, 0x3865A000, 0x3865C000, 0x3865E000, 0x38660000, 0x38662000,
+        0x38664000, 0x38666000, 0x38668000, 0x3866A000, 0x3866C000, 0x3866E000,
+        0x38670000, 0x38672000, 0x38674000, 0x38676000, 0x38678000, 0x3867A000,
+        0x3867C000, 0x3867E000, 0x38680000, 0x38682000, 0x38684000, 0x38686000,
+        0x38688000, 0x3868A000, 0x3868C000, 0x3868E000, 0x38690000, 0x38692000,
+        0x38694000, 0x38696000, 0x38698000, 0x3869A000, 0x3869C000, 0x3869E000,
+        0x386A0000, 0x386A2000, 0x386A4000, 0x386A6000, 0x386A8000, 0x386AA000,
+        0x386AC000, 0x386AE000, 0x386B0000, 0x386B2000, 0x386B4000, 0x386B6000,
+        0x386B8000, 0x386BA000, 0x386BC000, 0x386BE000, 0x386C0000, 0x386C2000,
+        0x386C4000, 0x386C6000, 0x386C8000, 0x386CA000, 0x386CC000, 0x386CE000,
+        0x386D0000, 0x386D2000, 0x386D4000, 0x386D6000, 0x386D8000, 0x386DA000,
+        0x386DC000, 0x386DE000, 0x386E0000, 0x386E2000, 0x386E4000, 0x386E6000,
+        0x386E8000, 0x386EA000, 0x386EC000, 0x386EE000, 0x386F0000, 0x386F2000,
+        0x386F4000, 0x386F6000, 0x386F8000, 0x386FA000, 0x386FC000, 0x386FE000,
+        0x38700000, 0x38702000, 0x38704000, 0x38706000, 0x38708000, 0x3870A000,
+        0x3870C000, 0x3870E000, 0x38710000, 0x38712000, 0x38714000, 0x38716000,
+        0x38718000, 0x3871A000, 0x3871C000, 0x3871E000, 0x38720000, 0x38722000,
+        0x38724000, 0x38726000, 0x38728000, 0x3872A000, 0x3872C000, 0x3872E000,
+        0x38730000, 0x38732000, 0x38734000, 0x38736000, 0x38738000, 0x3873A000,
+        0x3873C000, 0x3873E000, 0x38740000, 0x38742000, 0x38744000, 0x38746000,
+        0x38748000, 0x3874A000, 0x3874C000, 0x3874E000, 0x38750000, 0x38752000,
+        0x38754000, 0x38756000, 0x38758000, 0x3875A000, 0x3875C000, 0x3875E000,
+        0x38760000, 0x38762000, 0x38764000, 0x38766000, 0x38768000, 0x3876A000,
+        0x3876C000, 0x3876E000, 0x38770000, 0x38772000, 0x38774000, 0x38776000,
+        0x38778000, 0x3877A000, 0x3877C000, 0x3877E000, 0x38780000, 0x38782000,
+        0x38784000, 0x38786000, 0x38788000, 0x3878A000, 0x3878C000, 0x3878E000,
+        0x38790000, 0x38792000, 0x38794000, 0x38796000, 0x38798000, 0x3879A000,
+        0x3879C000, 0x3879E000, 0x387A0000, 0x387A2000, 0x387A4000, 0x387A6000,
+        0x387A8000, 0x387AA000, 0x387AC000, 0x387AE000, 0x387B0000, 0x387B2000,
+        0x387B4000, 0x387B6000, 0x387B8000, 0x387BA000, 0x387BC000, 0x387BE000,
+        0x387C0000, 0x387C2000, 0x387C4000, 0x387C6000, 0x387C8000, 0x387CA000,
+        0x387CC000, 0x387CE000, 0x387D0000, 0x387D2000, 0x387D4000, 0x387D6000,
+        0x387D8000, 0x387DA000, 0x387DC000, 0x387DE000, 0x387E0000, 0x387E2000,
+        0x387E4000, 0x387E6000, 0x387E8000, 0x387EA000, 0x387EC000, 0x387EE000,
+        0x387F0000, 0x387F2000, 0x387F4000, 0x387F6000, 0x387F8000, 0x387FA000,
+        0x387FC000, 0x387FE000};
+
+    constexpr uint32_t exponent_table[64] = {
+        0x00000000, 0x00800000, 0x01000000, 0x01800000, 0x02000000, 0x02800000,
+        0x03000000, 0x03800000, 0x04000000, 0x04800000, 0x05000000, 0x05800000,
+        0x06000000, 0x06800000, 0x07000000, 0x07800000, 0x08000000, 0x08800000,
+        0x09000000, 0x09800000, 0x0A000000, 0x0A800000, 0x0B000000, 0x0B800000,
+        0x0C000000, 0x0C800000, 0x0D000000, 0x0D800000, 0x0E000000, 0x0E800000,
+        0x0F000000, 0x47800000, 0x80000000, 0x80800000, 0x81000000, 0x81800000,
+        0x82000000, 0x82800000, 0x83000000, 0x83800000, 0x84000000, 0x84800000,
+        0x85000000, 0x85800000, 0x86000000, 0x86800000, 0x87000000, 0x87800000,
+        0x88000000, 0x88800000, 0x89000000, 0x89800000, 0x8A000000, 0x8A800000,
+        0x8B000000, 0x8B800000, 0x8C000000, 0x8C800000, 0x8D000000, 0x8D800000,
+        0x8E000000, 0x8E800000, 0x8F000000, 0xC7800000};
+
+    constexpr uint16_t offset_table[64] = {
+        0,    1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,
+        1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,
+        1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0,
+        1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,
+        1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024,
+        1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024};
+
+    alignas(std::max(alignof(uint16_t), alignof(native_half_t)))
+        native_half_t _value = value;
+    uint16_t value_bits      = *reinterpret_cast<uint16_t*>(&_value);
+
+    alignas(std::max(alignof(uint32_t), alignof(float))) uint32_t bits =
+        mantissa_table[offset_table[value_bits >> 10] + (value_bits & 0x3FF)] +
+        exponent_table[value_bits >> 10];
+    return *reinterpret_cast<float*>(&bits);
+}
+
+#endif  // __CUDACC_RTC__
+
+template<typename T, std::float_round_style R = std::round_to_nearest>
+#ifdef __CUDA_ARCH__
+AF_CONSTEXPR
+#endif
+    __DH__ native_half_t
+    float2half(T val) {
+    return float2half_impl<R>(val);
+}
+
+__DH__ inline float half2float(native_half_t value) noexcept {
+    return half2float_impl(value);
+}
+
+#ifndef __CUDACC_RTC__
+template<typename T, std::float_round_style R = std::round_to_nearest,
+         typename std::enable_if_t<std::is_integral<T>::value &&
+                                   std::is_signed<T>::value>* = nullptr>
+AF_CONSTEXPR __DH__ native_half_t int2half(T value) noexcept {
+    native_half_t out = (value < 0) ? int2half_impl<R, true, T>(value)
+                                    : int2half_impl<R, false, T>(value);
+    return out;
+}
+#endif
+
+template<typename T, std::float_round_style R = std::round_to_nearest
+#ifndef __CUDACC_RTC__
+         ,
+         typename std::enable_if_t<std::is_integral<T>::value &&
+                                   std::is_unsigned<T>::value>* = nullptr
+#endif
+         >
+AF_CONSTEXPR __DH__ native_half_t int2half(T value) noexcept {
+#if defined(__CUDACC_RTC__)
+    return int2half_impl(value);
+#else
+    return int2half_impl<R, false, T>(value);
+#endif
+}
+
+/// Convert half-precision floating point to integer.
+///
+/// \tparam R rounding mode to use, `std::round_indeterminate` for fastest
+///           rounding
+/// \tparam E `true` for round to even, `false` for round away from
+///           zero
+/// \tparam T type to convert to (buitlin integer type with at least 16
+///           bits precision, excluding any implicit sign bits) \param value
+///           binary representation of half-precision value \return integral
+///           value
+/// \param value The value to convert to integer
+template<std::float_round_style R, bool E, typename T>
+AF_CONSTEXPR T half2int(native_half_t value) {
+#ifdef __CUDA_ARCH__
+    AF_IF_CONSTEXPR(std::is_same<T, short>::value ||
+                    std::is_same<T, char>::value ||
+                    std::is_same<T, signed char>::value ||
+                    std::is_same<T, unsigned char>::value) {
+        return __half2short_rn(value);
+    }
+    else AF_IF_CONSTEXPR(std::is_same<T, unsigned short>::value) {
+        return __half2ushort_rn(value);
+    }
+    else AF_IF_CONSTEXPR(std::is_same<T, long long>::value) {
+        return __half2ll_rn(value);
+    }
+    else AF_IF_CONSTEXPR(std::is_same<T, unsigned long long>::value) {
+        return __half2ull_rn(value);
+    }
+    else AF_IF_CONSTEXPR(std::is_same<T, int>::value) {
+        return __half2int_rn(value);
+    }
+    else {
+        return __half2uint_rn(value);
+    }
+#elif defined(AF_ONEAPI)
+    return static_cast<T>(value);
+#else
+    static_assert(std::is_integral<T>::value,
+                  "half to int conversion only supports builtin integer types");
+    unsigned int e = value & 0x7FFF;
+    if (e >= 0x7C00)
+        return (value & 0x8000) ? std::numeric_limits<T>::min()
+                                : std::numeric_limits<T>::max();
+    if (e < 0x3800) {
+        AF_IF_CONSTEXPR(R == std::round_toward_infinity)
+        return T(~(value >> 15) & (e != 0));
+        else AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity) return -T(
+            value > 0x8000);
+        return T();
+    }
+    unsigned int m = (value & 0x3FF) | 0x400;
+    e >>= 10;
+    if (e < 25) {
+        AF_IF_CONSTEXPR(R == std::round_to_nearest)
+        m += (1 << (24 - e)) - (~(m >> (25 - e)) & E);
+        else AF_IF_CONSTEXPR(R == std::round_toward_infinity) m +=
+            ((value >> 15) - 1) & ((1 << (25 - e)) - 1U);
+        else AF_IF_CONSTEXPR(R == std::round_toward_neg_infinity) m +=
+            -(value >> 15) & ((1 << (25 - e)) - 1U);
+        m >>= 25 - e;
+    } else
+        m <<= e - 25;
+    return (value & 0x8000) ? -static_cast<T>(m) : static_cast<T>(m);
+#endif
+}
+
+namespace internal {
+/// Tag type for binary construction.
+struct binary_t {};
+
+/// Tag for binary construction.
+static constexpr binary_t binary = binary_t{};
+}  // namespace internal
+
+class half;
+
+AF_CONSTEXPR __DH__ static inline bool operator==(
+    arrayfire::common::half lhs, arrayfire::common::half rhs) noexcept;
+AF_CONSTEXPR __DH__ static inline bool operator!=(
+    arrayfire::common::half lhs, arrayfire::common::half rhs) noexcept;
+
+__DH__ static inline bool operator<(arrayfire::common::half lhs,
+                                    arrayfire::common::half rhs) noexcept;
+__DH__ static inline bool operator<(arrayfire::common::half lhs,
+                                    float rhs) noexcept;
+
+AF_CONSTEXPR __DH__ static inline bool isinf(half val) noexcept;
+
+/// Classification implementation.
+/// \param arg value to classify
+/// \retval true if not a number
+/// \retval false else
+AF_CONSTEXPR __DH__ static inline bool isnan(
+    arrayfire::common::half val) noexcept;
+
+class alignas(2) half {
+    native_half_t data_ = native_half_t();
+
+#if !defined(__NVCC__) && !defined(__CUDACC_RTC__)
+    // NVCC on OSX performs a weird transformation where it removes the std::
+    // namespace and complains that the std:: namespace is not there
+    friend class std::numeric_limits<half>;
+    friend struct std::hash<half>;
+#endif
+
+   public:
+    AF_CONSTEXPR
+    half() = default;
+
+    /// Constructor.
+    /// \param bits binary representation to set half to
+    AF_CONSTEXPR __DH__ half(internal::binary_t, uint16_t bits) noexcept
+        :
+#if defined(__CUDA_ARCH__)
+        data_(__ushort_as_half(bits))
+#else
+        data_(bits)
+#endif
+    {
+#ifndef __CUDACC_RTC__
+        static_assert(std::is_standard_layout<half>::value,
+                      "half must be a standard layout type");
+        static_assert(std::is_nothrow_move_assignable<half>::value,
+                      "half is not move assignable");
+        static_assert(std::is_nothrow_move_constructible<half>::value,
+                      "half is not move constructible");
+#endif
+    }
+
+    __DH__ explicit half(double value) noexcept
+        : data_(float2half<double>(value)) {}
+
+#if defined(__CUDA_ARCH__)
+    AF_CONSTEXPR
+#endif
+    __DH__ explicit half(float value) noexcept
+        : data_(float2half<float>(value)) {}
+
+    template<typename T>
+    AF_CONSTEXPR __DH__ explicit half(T value) noexcept
+        : data_(int2half(value)) {}
+
+#if defined(__CUDA_ARCH__)
+    AF_CONSTEXPR
+#endif
+    __DH__ half& operator=(const double& value) noexcept {
+        data_ = float2half(value);
+        return *this;
+    }
+
+#if defined(__CUDA_ARCH__) || defined(AF_ONEAPI)
+    AF_CONSTEXPR __DH__ explicit half(native_half_t value) noexcept
+        : data_(value) {}
+
+    AF_CONSTEXPR __DH__ half& operator=(native_half_t value) noexcept {
+        // NOTE Assignment to ushort from native_half_t only works with device
+        // code. using memcpy instead
+        data_ = value;
+        return *this;
+    }
+#endif
+
+    __DH__ explicit operator float() const noexcept {
+        return half2float(data_);
+    }
+
+    __DH__ explicit operator double() const noexcept {
+        // TODO(umar): convert directly to double
+        return half2float(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator short() const noexcept {
+        return half2int<std::round_indeterminate, true, short>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator long long() const noexcept {
+        return half2int<std::round_indeterminate, true, long long int>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator int() const noexcept {
+        return half2int<std::round_indeterminate, true, int>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator unsigned() const noexcept {
+        return half2int<std::round_indeterminate, true, unsigned>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator unsigned short() const noexcept {
+        return half2int<std::round_indeterminate, true, unsigned>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator unsigned long long() const noexcept {
+        return half2int<std::round_indeterminate, true, unsigned>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator char() const noexcept {
+        return half2int<std::round_indeterminate, true, char>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator signed char() const noexcept {
+        return half2int<std::round_indeterminate, true, signed char>(data_);
+    }
+
+    AF_CONSTEXPR __DH__ explicit operator unsigned char() const noexcept {
+        return half2int<std::round_indeterminate, true, unsigned char>(data_);
+    }
+
+#if defined(__CUDA_ARCH__) || defined(AF_ONEAPI)
+    AF_CONSTEXPR __DH__ operator native_half_t() const noexcept {
+        return data_;
+    };
+#endif
+
+    friend AF_CONSTEXPR __DH__ bool operator==(half lhs, half rhs) noexcept;
+    friend AF_CONSTEXPR __DH__ bool operator!=(half lhs, half rhs) noexcept;
+    friend __DH__ bool operator<(arrayfire::common::half lhs,
+                                 arrayfire::common::half rhs) noexcept;
+    friend __DH__ bool operator<(arrayfire::common::half lhs,
+                                 float rhs) noexcept;
+
+    friend AF_CONSTEXPR __DH__ bool isinf(half val) noexcept;
+    friend AF_CONSTEXPR __DH__ inline bool isnan(half val) noexcept;
+
+    AF_CONSTEXPR __DH__ arrayfire::common::half operator-() const {
+#if __CUDA_ARCH__ >= 530
+        return arrayfire::common::half(__hneg(data_));
+#elif defined(__CUDA_ARCH__)
+        return arrayfire::common::half(-(__half2float(data_)));
+#elif defined(AF_ONEAPI)
+        return arrayfire::common::half(-data_);
+#else
+        return arrayfire::common::half(internal::binary, data_ ^ 0x8000);
+#endif
+    }
+
+    AF_CONSTEXPR __DH__ arrayfire::common::half operator+() const {
+        return *this;
+    }
+
+    AF_CONSTEXPR static half infinity() {
+        half out;
+#ifdef __CUDA_ARCH__
+        out.data_ = __half_raw{0x7C00};
+#elif defined(AF_ONEAPI)
+        out.data_ = std::numeric_limits<sycl::half>::infinity();
+#else
+        out.data_ = 0x7C00;
+#endif
+        return out;
+    }
+};
+
+AF_CONSTEXPR __DH__ static inline bool operator==(
+    arrayfire::common::half lhs, arrayfire::common::half rhs) noexcept {
+#if __CUDA_ARCH__ >= 530
+    return __heq(lhs.data_, rhs.data_);
+#elif defined(__CUDA_ARCH__)
+    return __half2float(lhs.data_) == __half2float(rhs.data_);
+#elif defined(AF_ONEAPI)
+    return lhs.data_ == rhs.data_;
+#else
+    return (lhs.data_ == rhs.data_ || !((lhs.data_ | rhs.data_) & 0x7FFF)) &&
+           !isnan(lhs);
+#endif
+}
+
+AF_CONSTEXPR __DH__ static inline bool operator!=(
+    arrayfire::common::half lhs, arrayfire::common::half rhs) noexcept {
+#if __CUDA_ARCH__ >= 530
+    return __hne(lhs.data_, rhs.data_);
+#else
+    return !(lhs == rhs);
+#endif
+}
+
+__DH__ static inline bool operator<(arrayfire::common::half lhs,
+                                    arrayfire::common::half rhs) noexcept {
+#if __CUDA_ARCH__ >= 530
+    return __hlt(lhs.data_, rhs.data_);
+#elif defined(__CUDA_ARCH__)
+    return __half2float(lhs.data_) < __half2float(rhs.data_);
+#elif defined(AF_ONEAPI)
+    return lhs.data_ < rhs.data_;
+#else
+    int xabs = lhs.data_ & 0x7FFF, yabs = rhs.data_ & 0x7FFF;
+    return xabs <= 0x7C00 && yabs <= 0x7C00 &&
+           (((xabs == lhs.data_) ? xabs : -xabs) <
+            ((yabs == rhs.data_) ? yabs : -yabs));
+#endif
+}
+
+__DH__ static inline bool operator<(arrayfire::common::half lhs,
+                                    float rhs) noexcept {
+#if defined(__CUDA_ARCH__)
+    return __half2float(lhs.data_) < rhs;
+#elif defined(AF_ONEAPI)
+    return lhs.data_ < rhs;
+#else
+    return static_cast<float>(lhs) < rhs;
+#endif
+}
+
+#ifndef __CUDA_ARCH__
+std::ostream& operator<<(std::ostream& os, const half& val);
+
+static inline std::string to_string(const half& val) {
+    return std::to_string(static_cast<float>(val));
+}
+
+static inline std::string to_string(const half&& val) {
+    return std::to_string(static_cast<float>(val));
+}
+#endif
+
+}  // namespace common
+}  // namespace arrayfire
+
+#if !defined(__NVCC__) && !defined(__CUDACC_RTC__)
+// #endif
+/// Extensions to the C++ standard library.
+namespace std {
+/// Numeric limits for half-precision floats.
+/// Because of the underlying single-precision implementation of many
+/// operations, it inherits some properties from `std::numeric_limits<float>`.
+template<>
+class numeric_limits<arrayfire::common::half> : public numeric_limits<float> {
+   public:
+    /// Supports signed values.
+    static constexpr bool is_signed = true;
+
+    /// Is not exact.
+    static constexpr bool is_exact = false;
+
+    /// Doesn't provide modulo arithmetic.
+    static constexpr bool is_modulo = false;
+
+    /// IEEE conformant.
+    static constexpr bool is_iec559 = true;
+
+    /// Supports infinity.
+    static constexpr bool has_infinity = true;
+
+    /// Supports quiet NaNs.
+    static constexpr bool has_quiet_NaN = true;
+
+    /// Supports subnormal values.
+    static constexpr float_denorm_style has_denorm = denorm_present;
+
+    /// Rounding mode.
+    /// Due to the mix of internal single-precision computations (using the
+    /// rounding mode of the underlying single-precision implementation) with
+    /// the rounding mode of the single-to-half conversions, the actual rounding
+    /// mode might be `std::round_indeterminate` if the default half-precision
+    /// rounding mode doesn't match the single-precision rounding mode.
+    static constexpr float_round_style round_style =
+        std::numeric_limits<float>::round_style;
+
+    /// Significant digits.
+    static constexpr int digits = 11;
+
+    /// Significant decimal digits.
+    static constexpr int digits10 = 3;
+
+    /// Required decimal digits to represent all possible values.
+    static constexpr int max_digits10 = 5;
+
+    /// Number base.
+    static constexpr int radix = 2;
+
+    /// One more than smallest exponent.
+    static constexpr int min_exponent = -13;
+
+    /// Smallest normalized representable power of 10.
+    static constexpr int min_exponent10 = -4;
+
+    /// One more than largest exponent
+    static constexpr int max_exponent = 16;
+
+    /// Largest finitely representable power of 10.
+    static constexpr int max_exponent10 = 4;
+
+    /// Smallest positive normal value.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half min() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x0400);
+    }
+
+    /// Smallest finite value.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half lowest() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0xFBFF);
+    }
+
+    /// Largest finite value.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half max() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x7BFF);
+    }
+
+    /// Difference between one and next representable value.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half epsilon() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x1400);
+    }
+
+    /// Maximum rounding error.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half round_error() noexcept {
+        return arrayfire::common::half(
+            arrayfire::common::internal::binary,
+            (round_style == std::round_to_nearest) ? 0x3800 : 0x3C00);
+    }
+
+    /// Positive infinity.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half infinity() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x7C00);
+    }
+
+    /// Quiet NaN.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half quiet_NaN() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x7FFF);
+    }
+
+    /// Signalling NaN.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half
+    signaling_NaN() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x7DFF);
+    }
+
+    /// Smallest positive subnormal value.
+    static AF_CONSTEXPR __DH__ arrayfire::common::half denorm_min() noexcept {
+        return arrayfire::common::half(arrayfire::common::internal::binary,
+                                       0x0001);
+    }
+};
+
+/// Hash function for half-precision floats.
+/// This is only defined if C++11 `std::hash` is supported and enabled.
+template<>
+struct hash<
+    arrayfire::common::half>  //: unary_function<arrayfire::common::half,size_t>
+{
+    /// Type of function argument.
+    typedef arrayfire::common::half argument_type;
+
+    /// Function return type.
+    typedef size_t result_type;
+
+    /// Compute hash function.
+    /// \param arg half to hash
+    /// \return hash value
+    result_type operator()(argument_type arg) const {
+        return std::hash<uint16_t>()(
+            static_cast<unsigned>(arg.data_) &
+            -(*reinterpret_cast<uint16_t*>(&arg.data_) != 0x8000));
+    }
+};
+
+}  // namespace std
+#endif
+
+namespace arrayfire {
+namespace common {
+AF_CONSTEXPR __DH__ static bool isinf(half val) noexcept {
+#if __CUDA_ARCH__ >= 530
+    return __hisinf(val.data_);
+#elif defined(__CUDA_ARCH__)
+    return ::isinf(__half2float(val));
+#else
+    return val == half::infinity() || val == -half::infinity();
+#endif
+}
+
+AF_CONSTEXPR __DH__ static inline bool isnan(half val) noexcept {
+#if __CUDA_ARCH__ >= 530
+    return __hisnan(val.data_);
+#elif defined(__CUDA_ARCH__)
+    return ::isnan(__half2float(val));
+#elif defined(AF_ONEAPI)
+    return std::isnan(val.data_);
+#else
+    return (val.data_ & 0x7FFF) > 0x7C00;
+#endif
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/host_memory.cpp b/src/backend/common/host_memory.cpp
new file mode 100644
index 0000000000..0e213cb7e5
--- /dev/null
+++ b/src/backend/common/host_memory.cpp
@@ -0,0 +1,113 @@
+/*
+ * Author:  David Robert Nadeau
+ * Site:    http://NadeauSoftware.com/
+ * License: Creative Commons Attribution 3.0 Unported License
+ *          http://creativecommons.org/licenses/by/3.0/deed.en_US
+ * Source:
+ * http://nadeausoftware.com/sites/NadeauSoftware.com/files/getMemorySize.c
+ */
+
+#include "host_memory.hpp"
+
+#if defined(_WIN32)
+#include <Windows.h>
+
+#elif defined(__unix__) || defined(__unix) || defined(unix) || \
+    (defined(__APPLE__) && defined(__MACH__))
+#include <sys/param.h>
+#include <sys/types.h>
+#include <unistd.h>
+
+#if defined(BSD) && !defined(__gnu_hurd__)
+#include <sys/sysctl.h>
+#endif
+
+#else
+#define NOMEMORYSIZE
+#endif
+
+namespace arrayfire {
+namespace common {
+
+#ifdef NOMEMORYSIZE
+size_t getHostMemorySize() {
+    return 0L;  // Can't detect
+}
+
+#else
+
+/**
+ * Returns the size of physical memory (RAM) in bytes.
+ */
+size_t getHostMemorySize() {
+#if defined(_WIN32) && (defined(__CYGWIN__) || defined(__CYGWIN32__))
+    /* Cygwin under Windows. ------------------------------------ */
+    /* New 64-bit MEMORYSTATUSEX isn't available.  Use old 32.bit */
+    MEMORYSTATUS status;
+    status.dwLength = sizeof(status);
+    GlobalMemoryStatus(&status);
+    return (size_t)status.dwTotalPhys;
+
+#elif defined(_WIN32)
+    /* Windows. ------------------------------------------------- */
+    /* Use new 64-bit MEMORYSTATUSEX, not old 32-bit MEMORYSTATUS */
+    MEMORYSTATUSEX status;
+    status.dwLength = sizeof(status);
+    GlobalMemoryStatusEx(&status);
+    return (size_t)status.ullTotalPhys;
+
+#elif defined(__unix__) || defined(__unix) || defined(unix) || \
+    (defined(__APPLE__) && defined(__MACH__))
+    /* UNIX variants. ------------------------------------------- */
+    /* Prefer sysctl() over sysconf() except sysctl() HW_REALMEM and HW_PHYSMEM
+     */
+
+#if defined(CTL_HW) && (defined(HW_MEMSIZE) || defined(HW_PHYSMEM64))
+    int mib[2];
+    mib[0]       = CTL_HW;
+#if defined(HW_MEMSIZE)
+    mib[1]       = HW_MEMSIZE; /* OSX. --------------------- */
+#elif defined(HW_PHYSMEM64)
+    mib[1] = HW_PHYSMEM64; /* NetBSD, OpenBSD. --------- */
+#endif
+    int64_t size = 0;          /* 64-bit */
+    size_t len   = sizeof(size);
+    if (sysctl(mib, 2, &size, &len, NULL, 0) == 0) return (size_t)size;
+    return 0L; /* Failed? */
+
+#elif defined(_SC_AIX_REALMEM)
+    /* AIX. ----------------------------------------------------- */
+    return (size_t)sysconf(_SC_AIX_REALMEM) * (size_t)1024L;
+
+#elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
+    /* FreeBSD, Linux, OpenBSD, and Solaris. -------------------- */
+    return static_cast<size_t>(sysconf(_SC_PHYS_PAGES)) *
+           static_cast<size_t>(sysconf(_SC_PAGESIZE));
+
+#elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGE_SIZE)
+    /* Legacy. -------------------------------------------------- */
+    return (size_t)sysconf(_SC_PHYS_PAGES) * (size_t)sysconf(_SC_PAGE_SIZE);
+
+#elif defined(CTL_HW) && (defined(HW_PHYSMEM) || defined(HW_REALMEM))
+    /* DragonFly BSD, FreeBSD, NetBSD, OpenBSD, and OSX. -------- */
+    int mib[2];
+    mib[0]            = CTL_HW;
+#if defined(HW_REALMEM)
+    mib[1]            = HW_REALMEM; /* FreeBSD. ----------------- */
+#elif defined(HW_PYSMEM)
+    mib[1] = HW_PHYSMEM; /* Others. ------------------ */
+#endif
+    unsigned int size = 0;          /* 32-bit */
+    size_t len        = sizeof(size);
+    if (sysctl(mib, 2, &size, &len, NULL, 0) == 0) return (size_t)size;
+    return 0L; /* Failed? */
+#endif /* sysctl and sysconf variants */
+
+#else
+    return 0L; /* Unknown OS. */
+#endif
+}
+
+#endif  // NOMEMORYSIZE
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/host_memory.hpp b/src/backend/common/host_memory.hpp
new file mode 100644
index 0000000000..ead8a8c54e
--- /dev/null
+++ b/src/backend/common/host_memory.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <cstddef>
+
+namespace arrayfire {
+namespace common {
+
+size_t getHostMemorySize();
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/indexing_helpers.hpp b/src/backend/common/indexing_helpers.hpp
new file mode 100644
index 0000000000..9482fa639c
--- /dev/null
+++ b/src/backend/common/indexing_helpers.hpp
@@ -0,0 +1,38 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+#include <array>
+
+namespace arrayfire {
+namespace common {
+
+// will generate indexes to flip input array
+// of size original dims according to axes specified in flip
+template<typename T>
+static detail::Array<T> flip(const detail::Array<T>& in,
+                             const std::array<bool, AF_MAX_DIMS> flip) {
+    std::vector<af_seq> index(4, af_span);
+    const af::dim4& dims = in.dims();
+
+    for (int i = 0; i < AF_MAX_DIMS; ++i) {
+        if (flip[i]) {
+            index[i] = {static_cast<double>(dims[i] - 1), 0.0, -1.0};
+        }
+    }
+    return createSubArray(in, index);
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/internal_enums.hpp b/src/backend/common/internal_enums.hpp
new file mode 100644
index 0000000000..c4e76f7b7c
--- /dev/null
+++ b/src/backend/common/internal_enums.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+// TODO AF_BATCH_UNSUPPORTED is not required and shouldn't happen
+//      Code changes are required to handle all cases properly
+//      and this enum value should be removed.
+typedef enum {
+    AF_BATCH_UNSUPPORTED = -1, /* invalid inputs */
+    AF_BATCH_NONE,             /* one signal, one filter   */
+    AF_BATCH_LHS,              /* many signal, one filter  */
+    AF_BATCH_RHS,              /* one signal, many filter  */
+    AF_BATCH_SAME,             /* signal and filter have same batch size */
+    AF_BATCH_DIFF,             /* signal and filter have different batch size */
+} AF_BATCH_KIND;
diff --git a/src/backend/common/jit/BinaryNode.cpp b/src/backend/common/jit/BinaryNode.cpp
new file mode 100644
index 0000000000..b017394876
--- /dev/null
+++ b/src/backend/common/jit/BinaryNode.cpp
@@ -0,0 +1,160 @@
+
+#include <Array.hpp>
+#include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <complex.hpp>
+#include <types.hpp>
+#include <af/traits.hpp>
+
+#include <memory>
+
+using af::dim4;
+using af::dtype_traits;
+using detail::Array;
+using detail::BinOp;
+using detail::cdouble;
+using detail::cfloat;
+using detail::createNodeArray;
+
+using std::make_shared;
+
+namespace arrayfire {
+namespace common {
+#ifdef AF_CPU
+template<typename To, typename Ti, af_op_t op>
+Array<To> createBinaryNode(const Array<Ti> &lhs, const Array<Ti> &rhs,
+                           const af::dim4 &odims) {
+    common::Node_ptr lhs_node = lhs.getNode();
+    common::Node_ptr rhs_node = rhs.getNode();
+
+    auto node =
+        make_shared<detail::jit::BinaryNode<To, Ti, op>>(lhs_node, rhs_node);
+
+    return createNodeArray<To>(odims, move(node));
+}
+
+#else
+
+template<typename To, typename Ti, af_op_t op>
+Array<To> createBinaryNode(const Array<Ti> &lhs, const Array<Ti> &rhs,
+                           const af::dim4 &odims) {
+    auto createBinary = [](std::array<Node_ptr, 2> &operands) -> Node_ptr {
+        BinOp<To, Ti, op> bop;
+        return std::make_shared<BinaryNode>(
+            static_cast<af::dtype>(dtype_traits<To>::af_type), bop.name(),
+            operands[0], operands[1], op);
+    };
+
+    Node_ptr out =
+        common::createNaryNode<Ti, 2>(odims, createBinary, {&lhs, &rhs});
+    return createNodeArray<To>(odims, out);
+}
+
+#endif
+
+#define INSTANTIATE(To, Ti, op)                      \
+    template Array<To> createBinaryNode<To, Ti, op>( \
+        const Array<Ti> &lhs, const Array<Ti> &rhs, const dim4 &odims)
+
+INSTANTIATE(cfloat, float, af_cplx2_t);
+INSTANTIATE(cdouble, double, af_cplx2_t);
+
+#define INSTANTIATE_ARITH(op)                                \
+    INSTANTIATE(float, float, op);                           \
+    INSTANTIATE(cfloat, cfloat, op);                         \
+    INSTANTIATE(double, double, op);                         \
+    INSTANTIATE(cdouble, cdouble, op);                       \
+    INSTANTIATE(unsigned, unsigned, op);                     \
+    INSTANTIATE(short, short, op);                           \
+    INSTANTIATE(unsigned short, unsigned short, op);         \
+    INSTANTIATE(unsigned long long, unsigned long long, op); \
+    INSTANTIATE(long long, long long, op);                   \
+    INSTANTIATE(signed char, signed char, op);               \
+    INSTANTIATE(unsigned char, unsigned char, op);           \
+    INSTANTIATE(char, char, op);                             \
+    INSTANTIATE(common::half, common::half, op);             \
+    INSTANTIATE(int, int, op)
+
+INSTANTIATE_ARITH(af_add_t);
+INSTANTIATE_ARITH(af_sub_t);
+INSTANTIATE_ARITH(af_mul_t);
+INSTANTIATE_ARITH(af_div_t);
+INSTANTIATE_ARITH(af_min_t);
+INSTANTIATE_ARITH(af_max_t);
+
+#undef INSTANTIATE_ARITH
+
+#define INSTANTIATE_ARITH_REAL(op)                           \
+    INSTANTIATE(float, float, op);                           \
+    INSTANTIATE(double, double, op);                         \
+    INSTANTIATE(unsigned, unsigned, op);                     \
+    INSTANTIATE(short, short, op);                           \
+    INSTANTIATE(unsigned short, unsigned short, op);         \
+    INSTANTIATE(unsigned long long, unsigned long long, op); \
+    INSTANTIATE(long long, long long, op);                   \
+    INSTANTIATE(signed char, signed char, op);               \
+    INSTANTIATE(unsigned char, unsigned char, op);           \
+    INSTANTIATE(char, char, op);                             \
+    INSTANTIATE(common::half, common::half, op);             \
+    INSTANTIATE(int, int, op)
+
+INSTANTIATE_ARITH_REAL(af_rem_t);
+INSTANTIATE_ARITH_REAL(af_pow_t);
+INSTANTIATE_ARITH_REAL(af_mod_t);
+
+#define INSTANTIATE_FLOATOPS(op)     \
+    INSTANTIATE(float, float, op);   \
+    INSTANTIATE(double, double, op); \
+    INSTANTIATE(common::half, common::half, op)
+
+INSTANTIATE_FLOATOPS(af_hypot_t);
+INSTANTIATE_FLOATOPS(af_atan2_t);
+
+#define INSTANTIATE_BITOP(op)                                \
+    INSTANTIATE(unsigned, unsigned, op);                     \
+    INSTANTIATE(short, short, op);                           \
+    INSTANTIATE(unsigned short, unsigned short, op);         \
+    INSTANTIATE(unsigned long long, unsigned long long, op); \
+    INSTANTIATE(long long, long long, op);                   \
+    INSTANTIATE(signed char, signed char, op);               \
+    INSTANTIATE(unsigned char, unsigned char, op);           \
+    INSTANTIATE(char, char, op);                             \
+    INSTANTIATE(int, int, op)
+
+INSTANTIATE_BITOP(af_bitshiftl_t);
+INSTANTIATE_BITOP(af_bitshiftr_t);
+INSTANTIATE_BITOP(af_bitor_t);
+INSTANTIATE_BITOP(af_bitand_t);
+INSTANTIATE_BITOP(af_bitxor_t);
+#undef INSTANTIATE_BITOP
+
+#define INSTANTIATE_LOGIC(op)                  \
+    INSTANTIATE(char, float, op);              \
+    INSTANTIATE(char, double, op);             \
+    INSTANTIATE(char, cfloat, op);             \
+    INSTANTIATE(char, cdouble, op);            \
+    INSTANTIATE(char, common::half, op);       \
+    INSTANTIATE(char, unsigned, op);           \
+    INSTANTIATE(char, short, op);              \
+    INSTANTIATE(char, unsigned short, op);     \
+    INSTANTIATE(char, unsigned long long, op); \
+    INSTANTIATE(char, long long, op);          \
+    INSTANTIATE(char, signed char, op);        \
+    INSTANTIATE(char, unsigned char, op);      \
+    INSTANTIATE(char, char, op);               \
+    INSTANTIATE(char, int, op)
+
+INSTANTIATE_LOGIC(af_and_t);
+INSTANTIATE_LOGIC(af_or_t);
+INSTANTIATE_LOGIC(af_eq_t);
+INSTANTIATE_LOGIC(af_neq_t);
+INSTANTIATE_LOGIC(af_lt_t);
+INSTANTIATE_LOGIC(af_le_t);
+INSTANTIATE_LOGIC(af_gt_t);
+INSTANTIATE_LOGIC(af_ge_t);
+
+#undef INSTANTIATE_LOGIC
+#undef INSTANTIATE
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/BinaryNode.hpp b/src/backend/common/jit/BinaryNode.hpp
new file mode 100644
index 0000000000..e250382745
--- /dev/null
+++ b/src/backend/common/jit/BinaryNode.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/jit/NaryNode.hpp>
+
+#include <cmath>
+
+namespace arrayfire {
+namespace common {
+class BinaryNode : public NaryNode {
+   public:
+    BinaryNode(const af::dtype type, const char *op_str, common::Node_ptr lhs,
+               common::Node_ptr rhs, af_op_t op)
+        : NaryNode(type, op_str, 2, {{lhs, rhs}}, op,
+                   std::max(lhs->getHeight(), rhs->getHeight()) + 1) {}
+};
+
+template<typename To, typename Ti, af_op_t op>
+detail::Array<To> createBinaryNode(const detail::Array<Ti> &lhs,
+                                   const detail::Array<Ti> &rhs,
+                                   const af::dim4 &odims);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/BufferNodeBase.hpp b/src/backend/common/jit/BufferNodeBase.hpp
new file mode 100644
index 0000000000..85576304ad
--- /dev/null
+++ b/src/backend/common/jit/BufferNodeBase.hpp
@@ -0,0 +1,137 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/Node.hpp>
+#include <jit/kernel_generators.hpp>
+
+#include <backend.hpp>
+
+#include <cstring>
+#include <memory>
+#include <sstream>
+
+namespace arrayfire {
+namespace common {
+
+template<typename DataType, typename ParamType>
+class BufferNodeBase : public common::Node {
+   private:
+    DataType m_data;
+    unsigned m_bytes;
+    bool m_linear_buffer;
+
+   public:
+    ParamType m_param;
+    BufferNodeBase(af::dtype type)
+        : Node(type, 0, {}, kNodeType::Buffer)
+        , m_bytes(0)
+        , m_linear_buffer(true) {}
+
+    std::unique_ptr<Node> clone() final {
+        return std::make_unique<BufferNodeBase>(*this);
+    }
+
+    DataType getDataPointer() const { return m_data; }
+
+    void setData(ParamType param, DataType data, const unsigned bytes,
+                 bool is_linear) {
+        m_param         = param;
+        m_data          = data;
+        m_bytes         = bytes;
+        m_linear_buffer = is_linear;
+    }
+
+    bool isLinear(const dim_t dims[4]) const final {
+        bool same_dims = true;
+        for (int i = 0; same_dims && i < 4; i++) {
+            same_dims &= (dims[i] == m_param.dims[i]);
+        }
+        return m_linear_buffer && same_dims;
+    }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        kerString += '_';
+        kerString += getNameStr();
+        kerString += ',';
+        kerString += std::to_string(ids.id);
+    }
+
+    void genParams(std::stringstream &kerStream, int id,
+                   bool is_linear) const final {
+        detail::generateParamDeclaration(kerStream, id, is_linear,
+                                         getTypeStr());
+    }
+
+    int setArgs(int start_id, bool is_linear,
+                std::function<void(int id, const void *ptr, size_t arg_size,
+                                   bool is_buffer)>
+                    setArg) const override {
+        return detail::setBufferKernelArguments(start_id, is_linear, setArg,
+                                                m_data, m_param);
+    }
+
+    void genOffsets(std::stringstream &kerStream, int id,
+                    bool is_linear) const final {
+        detail::generateBufferOffsets(kerStream, id, is_linear, getTypeStr());
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        detail::generateBufferRead(kerStream, ids.id, getTypeStr());
+    }
+
+    void getInfo(unsigned &len, unsigned &buf_count,
+                 unsigned &bytes) const final {
+        len++;
+        buf_count++;
+        bytes += m_bytes;
+    }
+
+    size_t getBytes() const final { return m_bytes; }
+
+    size_t getHash() const noexcept override {
+        size_t out = 0;
+        auto ptr   = m_data.get();
+        std::memcpy(&out, &ptr, std::max(sizeof(Node *), sizeof(size_t)));
+        return out;
+    }
+
+    /// Compares two BufferNodeBase objects for equality
+    bool operator==(
+        const BufferNodeBase<DataType, ParamType> &other) const noexcept;
+
+    /// Overloads the equality operator to call comparisons between Buffer
+    /// objects. Calls the BufferNodeBase equality operator if the other
+    /// object is also a Buffer Node
+    bool operator==(const common::Node &other) const noexcept final {
+        if (other.isBuffer()) {
+            return *this ==
+                   static_cast<const BufferNodeBase<DataType, ParamType> &>(
+                       other);
+        }
+        return false;
+    }
+
+    virtual void modDims(const af::dim4 &newDim) override {
+        af::dim4 strides(1, 1, 1, 1);
+        for(dim_t i = 1; i < 4; ++i) {
+            strides[i] = strides[i - 1] * newDim[i - 1];
+        }
+
+        for(dim_t i = 0; i < 4; ++i) {
+            m_param.dims[i] = newDim[i];
+            m_param.strides[i] = strides[i];
+        }
+    }
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/ModdimNode.hpp b/src/backend/common/jit/ModdimNode.hpp
new file mode 100644
index 0000000000..b0f7d927a6
--- /dev/null
+++ b/src/backend/common/jit/ModdimNode.hpp
@@ -0,0 +1,34 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/NaryNode.hpp>
+
+namespace arrayfire {
+namespace common {
+
+class ModdimNode : public NaryNode {
+   public:
+    af::dim4 m_new_shape;
+    ModdimNode(const af::dim4& new_shape, const af::dtype type, Node_ptr child)
+        : NaryNode(type, "__noop", 1, {{child}}, af_moddims_t,
+                   child->getHeight() + 1)
+        , m_new_shape(new_shape) {
+        static_assert(std::is_nothrow_move_assignable<ModdimNode>::value,
+                      "ModdimNode is not move assignable");
+        static_assert(std::is_nothrow_move_constructible<ModdimNode>::value,
+                      "ModdimNode is not move constructible");
+    }
+
+    virtual std::unique_ptr<Node> clone() noexcept final {
+        return std::make_unique<ModdimNode>(*this);
+    }
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/NaryNode.hpp b/src/backend/common/jit/NaryNode.hpp
new file mode 100644
index 0000000000..5f1e91a570
--- /dev/null
+++ b/src/backend/common/jit/NaryNode.hpp
@@ -0,0 +1,141 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <common/defines.hpp>
+#include <common/jit/Node.hpp>
+
+#include <nonstd/span.hpp>
+#include <array>
+#include <iomanip>
+#include <sstream>
+#include <string>
+#include <utility>
+
+namespace arrayfire {
+namespace common {
+
+class NaryNode : public Node {
+   private:
+    int m_num_children;
+    const char *m_op_str;
+
+   protected:
+    af_op_t m_op;
+
+   public:
+    NaryNode(const af::dtype type, const char *op_str, const int num_children,
+             const std::array<common::Node_ptr, Node::kMaxChildren> &&children,
+             const af_op_t op, const int height)
+        : common::Node(
+              type, height,
+              std::forward<
+                  const std::array<common::Node_ptr, Node::kMaxChildren>>(
+                  children),
+              kNodeType::Nary)
+        , m_num_children(num_children)
+        , m_op_str(op_str)
+        , m_op(op) {
+        static_assert(std::is_nothrow_move_assignable<NaryNode>::value,
+                      "NaryNode is not move assignable");
+        static_assert(std::is_nothrow_move_constructible<NaryNode>::value,
+                      "NaryNode is not move constructible");
+    }
+
+    NaryNode(NaryNode &&other) noexcept = default;
+
+    NaryNode(const NaryNode &other) = default;
+
+    /// Default copy assignment operator
+    NaryNode &operator=(const NaryNode &node) = default;
+
+    /// Default move assignment operator
+    NaryNode &operator=(NaryNode &&node) noexcept = default;
+
+    void swap(NaryNode &other) noexcept {
+        using std::swap;
+        Node::swap(other);
+        swap(m_num_children, other.m_num_children);
+        swap(m_op_str, other.m_op_str);
+        swap(m_op, other.m_op);
+    }
+
+    af_op_t getOp() const noexcept final { return m_op; }
+
+    virtual std::unique_ptr<Node> clone() override {
+        return std::make_unique<NaryNode>(*this);
+    }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        // Make the dec representation of enum part of the Kernel name
+        kerString += '_';
+        kerString += std::to_string(m_op);
+        kerString += ',';
+        for (int i = 0; i < m_num_children; i++) {
+            kerString += std::to_string(ids.child_ids[i]);
+            kerString += ',';
+        }
+        kerString += std::to_string(ids.id);
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        kerStream << getTypeStr() << " val" << ids.id << " = " << m_op_str
+                  << "(";
+        for (int i = 0; i < m_num_children; i++) {
+            if (i > 0) kerStream << ", ";
+            kerStream << "val" << ids.child_ids[i];
+        }
+        kerStream << ");\n";
+    }
+};
+
+template<typename Ti, int N, typename FUNC>
+common::Node_ptr createNaryNode(
+    const af::dim4 &odims, FUNC createNode,
+    std::array<const detail::Array<Ti> *, N> &&children) {
+    std::array<common::Node_ptr, N> childNodes;
+    std::array<common::Node *, N> nodes;
+    for (int i = 0; i < N; i++) {
+        childNodes[i] = move(children[i]->getNode());
+        nodes[i]      = childNodes[i].get();
+    }
+
+    common::Node_ptr ptr = createNode(childNodes);
+
+    switch (detail::passesJitHeuristics<Ti>(nodes)) {
+        case kJITHeuristics::Pass: {
+            return ptr;
+        }
+        case kJITHeuristics::TreeHeight:
+        case kJITHeuristics::KernelParameterSize: {
+            int max_height_index = 0;
+            int max_height       = 0;
+            for (int i = 0; i < N; i++) {
+                if (max_height < childNodes[i]->getHeight()) {
+                    max_height_index = i;
+                    max_height       = childNodes[i]->getHeight();
+                }
+            }
+            children[max_height_index]->eval();
+            return createNaryNode<Ti, N>(odims, createNode, move(children));
+        }
+        case kJITHeuristics::MemoryPressure: {
+            for (auto &c : children) { c->eval(); }  // TODO: use evalMultiple()
+            return ptr;
+        }
+    }
+    return ptr;
+}
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/Node.cpp b/src/backend/common/jit/Node.cpp
new file mode 100644
index 0000000000..09c001a724
--- /dev/null
+++ b/src/backend/common/jit/Node.cpp
@@ -0,0 +1,99 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <build_version.hpp>
+#include <common/defines.hpp>
+#include <common/deterministicHash.hpp>
+#include <common/jit/Node.hpp>
+#include <common/util.hpp>
+
+#include <sstream>
+#include <string>
+#include <vector>
+
+using std::vector;
+
+namespace arrayfire {
+namespace common {
+
+int Node::getNodesMap(Node_map_t &node_map, vector<Node *> &full_nodes,
+                      vector<Node_ids> &full_ids) {
+    auto iter = node_map.find(this);
+    if (iter == node_map.end()) {
+        Node_ids ids{};
+
+        for (int i = 0; i < kMaxChildren && m_children[i] != nullptr; i++) {
+            ids.child_ids[i] =
+                m_children[i]->getNodesMap(node_map, full_nodes, full_ids);
+        }
+        ids.id         = static_cast<int>(node_map.size());
+        node_map[this] = ids.id;
+        full_nodes.push_back(this);
+        full_ids.push_back(ids);
+        return ids.id;
+    }
+    return iter->second;
+}
+
+std::string getFuncName(const vector<Node *> &output_nodes,
+                        const vector<int> &output_ids,
+                        const vector<Node *> &full_nodes,
+                        const vector<Node_ids> &full_ids, const bool is_linear,
+                        const bool loop0, const bool loop1, const bool loop2,
+                        const bool loop3) {
+    std::string funcName;
+    funcName.reserve(512);
+    funcName = (is_linear ? 'L' : 'G');
+    funcName += (loop0 ? '0' : 'X');
+    funcName += (loop1 ? '1' : 'X');
+    funcName += (loop2 ? '2' : 'X');
+    funcName += (loop3 ? '3' : 'X');
+
+    for (const auto &node : output_nodes) {
+        funcName += '_';
+        funcName += node->getNameStr();
+    }
+
+    for (const int id : output_ids) {
+        funcName += '-';
+        funcName += std::to_string(id);
+    }
+
+    for (int i = 0; i < static_cast<int>(full_nodes.size()); i++) {
+        full_nodes[i]->genKerName(funcName, full_ids[i]);
+    }
+
+    return "KER" + std::to_string(deterministicHash(funcName));
+}
+
+bool NodePtr_equalto::operator()(const Node *l, const Node *r) const noexcept {
+    return *l == *r;
+}
+
+auto isBuffer(const Node &ptr) -> bool { return ptr.isBuffer(); }
+
+auto isScalar(const Node &ptr) -> bool { return ptr.isScalar(); }
+
+bool Node::isLinear(const dim_t dims[4]) const { return true; }
+
+/// This function returns true if the \p node is a Shift node or a Buffer node
+auto isBufferOrShift(const Node_ptr &node) -> bool {
+    return node->getNodeType() == kNodeType::Buffer ||
+           node->getNodeType() == kNodeType::Shift;
+}
+
+}  // namespace common
+}  // namespace arrayfire
+
+size_t std::hash<arrayfire::common::Node *>::operator()(
+    arrayfire::common::Node *const node) const noexcept {
+    arrayfire::common::Node *const node_ptr =
+        static_cast<arrayfire::common::Node *const>(node);
+    return node_ptr->getHash();
+}
diff --git a/src/backend/common/jit/Node.hpp b/src/backend/common/jit/Node.hpp
new file mode 100644
index 0000000000..794c10c14c
--- /dev/null
+++ b/src/backend/common/jit/Node.hpp
@@ -0,0 +1,417 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+#include <common/defines.hpp>
+#include <optypes.hpp>
+#include <types.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+#include <nonstd/span.hpp>
+#include <algorithm>
+#include <array>
+#include <functional>
+#include <memory>
+#include <sstream>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+enum class kJITHeuristics {
+    Pass                = 0, /* no eval necessary */
+    TreeHeight          = 1, /* eval due to jit tree height */
+    KernelParameterSize = 2, /* eval due to many kernel parameters */
+    MemoryPressure      = 3  /* eval due to memory pressure */
+};
+
+namespace arrayfire {
+namespace common {
+
+enum class kNodeType {
+    Generic = 0,
+    Scalar  = 1,
+    Buffer  = 2,
+    Nary    = 3,
+    Shift   = 4,
+};
+
+class Node;
+}  // namespace common
+}  // namespace arrayfire
+
+#ifdef AF_CPU
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void evalMultiple(std::vector<Param<T>> arrays,
+                  std::vector<std::shared_ptr<common::Node>> output_nodes_);
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
+#endif
+
+namespace std {
+template<>
+struct hash<arrayfire::common::Node *> {
+    /// Calls the getHash function of the Node pointer
+    size_t operator()(arrayfire::common::Node *const n) const noexcept;
+};
+}  // namespace std
+
+namespace arrayfire {
+namespace common {
+class Node;
+struct Node_ids;
+
+/// A equal_to class that calls the dereference nodes equality operator
+struct NodePtr_equalto {
+    bool operator()(const Node *l, const Node *r) const noexcept;
+};
+
+using Node_map_t =
+    std::unordered_map<Node *, int, std::hash<Node *>, NodePtr_equalto>;
+using Node_map_iter = Node_map_t::iterator;
+
+using Node_ptr = std::shared_ptr<Node>;
+
+static const char *getFullName(af::dtype type) {
+    switch (type) {
+        case f32: return detail::getFullName<float>();
+        case f64: return detail::getFullName<double>();
+        case c32: return detail::getFullName<detail::cfloat>();
+        case c64: return detail::getFullName<detail::cdouble>();
+        case u32: return detail::getFullName<unsigned>();
+        case s32: return detail::getFullName<int>();
+        case u64: return detail::getFullName<unsigned long long>();
+        case s64: return detail::getFullName<long long>();
+        case u16: return detail::getFullName<unsigned short>();
+        case s16: return detail::getFullName<short>();
+        case b8: return detail::getFullName<char>();
+        case s8: return detail::getFullName<signed char>();
+        case u8: return detail::getFullName<unsigned char>();
+        case f16: return "half";
+    }
+    return "";
+}
+
+static const char *getShortName(af::dtype type) {
+    switch (type) {
+        case f32: return detail::shortname<float>();
+        case f64: return detail::shortname<double>();
+        case c32: return detail::shortname<detail::cfloat>();
+        case c64: return detail::shortname<detail::cdouble>();
+        case u32: return detail::shortname<unsigned>();
+        case s32: return detail::shortname<int>();
+        case u64: return detail::shortname<unsigned long long>();
+        case s64: return detail::shortname<long long>();
+        case u16: return detail::shortname<unsigned short>();
+        case s16: return detail::shortname<short>();
+        case b8: return detail::shortname<char>();
+        case s8: return detail::shortname<signed char>();
+        case u8: return detail::shortname<unsigned char>();
+        case f16: return "h";
+    }
+    return "";
+}
+
+class Node {
+   public:
+    static const int kMaxChildren = 3;
+
+   protected:
+   public:
+    std::array<Node_ptr, kMaxChildren> m_children;
+    af::dtype m_type;
+    int m_height;
+    kNodeType m_node_type = kNodeType::Generic;
+
+    template<typename T>
+    friend class NodeIterator;
+    Node() = default;
+    Node(const af::dtype type, const int height,
+         const std::array<Node_ptr, kMaxChildren> children, kNodeType node_type)
+        : m_children(children)
+        , m_type(type)
+        , m_height(height)
+        , m_node_type(node_type) {
+        static_assert(std::is_nothrow_move_assignable<Node>::value,
+                      "Node is not move assignable");
+    }
+
+    void swap(Node &other) noexcept {
+        using std::swap;
+        for (int i = 0; i < kMaxChildren; i++) {
+            swap(m_children[i], other.m_children[i]);
+        }
+        swap(m_type, other.m_type);
+        swap(m_height, other.m_height);
+    }
+
+    /// Default move constructor operator
+    Node(Node &&node) noexcept = default;
+
+    /// Default copy constructor operator
+    Node(const Node &node) = default;
+
+    /// Default copy assignment operator
+    Node &operator=(const Node &node) = default;
+
+    /// Default move assignment operator
+    Node &operator=(Node &&node) noexcept = default;
+
+    virtual af_op_t getOp() const noexcept { return af_none_t; }
+
+    int getNodesMap(Node_map_t &node_map, std::vector<Node *> &full_nodes,
+                    std::vector<Node_ids> &full_ids);
+
+    /// Generates the string that will be used to hash the kernel
+    virtual void genKerName(std::string &kerString,
+                            const Node_ids &ids) const = 0;
+
+    /// Generates the function parameters for the node.
+    ///
+    /// \param[in/out] kerStream  The string will be written to this stream
+    /// \param[in]     ids        The integer id of the node and its children
+    /// \param[in]     is_linear  True if the kernel is a linear kernel
+    virtual void genParams(std::stringstream &kerStream, int id,
+                           bool is_linear) const {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    virtual void calc(int x, int y, int z, int w, int lim) {
+        UNUSED(x);
+        UNUSED(y);
+        UNUSED(z);
+        UNUSED(w);
+    }
+
+    virtual void calc(int idx, int lim) {
+        UNUSED(idx);
+        UNUSED(lim);
+    }
+
+    const std::array<Node_ptr, kMaxChildren> &getChildren() const {
+        return m_children;
+    }
+
+    /// Generates the variable that stores the thread's/work-item's offset into
+    /// the memory.
+    ///
+    /// \param[in/out] kerStream  The string will be written to this stream
+    /// \param[in]     ids        The integer id of the node and its children
+    /// \param[in]     is_linear  True if the kernel is a linear kernel
+    virtual void genOffsets(std::stringstream &kerStream, int id,
+                            bool is_linear) const {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    /// Generates the code for the operation of the node.
+    ///
+    /// Generates the soruce code of the operation that the node needs to
+    /// perform. For example this function will create the string
+    /// "val2 = __add(val1, val2);" for the addition node.
+    ///
+    /// \param[in/out] kerStream  The string will be written to this stream
+    /// \param[in]     ids        The integer id of the node and its children
+    /// \param[in]     is_linear  True if the kernel is a linear kernel
+    virtual void genFuncs(std::stringstream &kerStream,
+                          const Node_ids &ids) const = 0;
+
+    /// Calls the setArg function on each of the arguments passed into the
+    /// kernel
+    ///
+    /// \param[in] start_id The index of the staring argument
+    /// \param[in] is_linear determines if the kernel should be linear or not
+    /// \param[in] setArg the function that will be called for each argument
+    ///
+    /// \returns the next index that will need to be set in the kernl. This
+    ///          is usually start_id + the number of times setArg is called
+    virtual int setArgs(int start_id, bool is_linear,
+                        std::function<void(int id, const void *ptr,
+                                           size_t arg_size, bool is_buffer)>
+                            setArg) const {
+        UNUSED(is_linear);
+        UNUSED(setArg);
+        return start_id;
+    }
+
+    virtual void getInfo(unsigned &len, unsigned &buf_count,
+                         unsigned &bytes) const {
+        UNUSED(buf_count);
+        UNUSED(bytes);
+        len++;
+    }
+
+    // Return the size of the parameter in bytes that will be passed to the
+    // kernel
+    virtual size_t getParamBytes() const { return 0; }
+
+    // Return the size of the size of the buffer node in bytes. Zero otherwise
+    virtual size_t getBytes() const { return 0; }
+
+    // Returns true if this node is a Buffer
+    bool isBuffer() const { return m_node_type == kNodeType::Buffer; }
+
+    // Returns true if this node is a Scalar
+    bool isScalar() const { return m_node_type == kNodeType::Scalar; }
+
+    /// Returns true if the buffer is linear
+    virtual bool isLinear(const dim_t dims[4]) const;
+
+    /// Returns the node type
+    kNodeType getNodeType() const { return m_node_type; }
+
+    /// Returns the type
+    af::dtype getType() const { return m_type; }
+
+    /// Returns the string representation of the type
+    std::string getTypeStr() const { return getFullName(m_type); }
+
+    /// Returns the height of the JIT tree from this node
+    int getHeight() const { return m_height; }
+
+    /// Returns the short name for this type
+    /// \note For the shift node this is "Sh" appended by the short name of the
+    ///       type
+    virtual std::string getNameStr() const { return getShortName(m_type); }
+
+    /// Default destructor
+    virtual ~Node() noexcept = default;
+
+    /// Returns the hash of the node. For all Nodes other than the Buffer node,
+    /// this is the pointer of the object
+    virtual size_t getHash() const noexcept {
+        std::hash<const void *> ptr_hash;
+        std::hash<af::dtype> aftype_hash;
+        std::hash<int> int_hash;
+        const void *ptr = this;
+        size_t h =
+            ptr_hash(ptr) ^ (aftype_hash(m_type) << 1) ^ (int_hash(m_height));
+        return h;
+    }
+
+    /// A very bad equality operator used only for the hash function.
+    virtual bool operator==(const Node &other) const noexcept {
+        return this == &other;
+    }
+    virtual std::unique_ptr<Node> clone() = 0;
+
+    virtual void modDims(const af::dim4 &newDim) {
+        UNUSED(newDim);
+    }
+
+#ifdef AF_CPU
+    template<typename U>
+    friend void arrayfire::cpu::kernel::evalMultiple(
+        std::vector<arrayfire::cpu::Param<U>> arrays,
+        std::vector<common::Node_ptr> output_nodes_);
+
+    virtual void setShape(af::dim4 new_shape) { UNUSED(new_shape); }
+
+#endif
+};
+
+struct Node_ids {
+    std::array<int, Node::kMaxChildren> child_ids;
+    int id;
+};
+
+std::string getFuncName(const std::vector<Node *> &output_nodes,
+                        const std::vector<int> &output_ids,
+                        const std::vector<Node *> &full_nodes,
+                        const std::vector<Node_ids> &full_ids,
+                        const bool is_linear, const bool loop0,
+                        const bool loop1, const bool loop2, const bool loop3);
+
+/// Returns true if the \p ptr is a Buffer Node
+auto isBuffer(const Node &ptr) -> bool;
+
+/// Returns true if the \p ptr is a Scalar Node
+auto isScalar(const Node &ptr) -> bool;
+
+/// Returns true if \p node is a Buffer or a Shift node
+auto isBufferOrShift(const Node_ptr &node) -> bool;
+
+template<typename T>
+inline void applyShifts(std::array<int, 4> &shifts, nonstd::span<T> dims) {
+    std::array<T, 4> out;
+    for (size_t i = 0; i < shifts.size(); i++) { out[i] = dims[shifts[i]]; }
+    std::copy(begin(out), std::end(out), std::begin(dims));
+}
+
+template<typename ArrayT>
+inline std::array<int, 4> compressArray(ArrayT dims) {
+    std::array<int, 4> shifts{0, 1, 2, 3};
+    bool changed;
+    do {
+        changed = false;
+        for (int i = 0; i < AF_MAX_DIMS - 1; i++) {
+            if (dims[i] == 1 && dims[i + 1] != 1) {
+                std::swap(dims[i], dims[i + 1]);
+                std::swap(shifts[i], shifts[i + 1]);
+                changed = true;
+            }
+        }
+    } while (changed);
+    return shifts;
+}
+
+/// Removes empty columns from output and the other node pointers in \p nodes
+template<typename ParamT, typename BufferNodeT, typename ShiftNodeT>
+void removeEmptyDimensions(nonstd::span<ParamT> outputs,
+                           nonstd::span<Node_ptr> nodes) {
+    dim_t *outDims{outputs[0].dims_ptr()};
+    dim_t *outStrides{outputs[0].strides_ptr()};
+    auto shifts = compressArray(outDims);
+    applyShifts<dim_t>(shifts, {outStrides, AF_MAX_DIMS});
+    for (auto nodeIt{begin(nodes)}, endIt{end(nodes)};
+         (nodeIt = find_if(nodeIt, endIt, isBufferOrShift)) != endIt;
+         ++nodeIt) {
+        switch ((*nodeIt)->getNodeType()) {
+            case kNodeType::Buffer: {
+                BufferNodeT *buf{static_cast<BufferNodeT *>(nodeIt->get())};
+                applyShifts<dim_t>(shifts,
+                                   {buf->m_param.dims_ptr(), AF_MAX_DIMS});
+                applyShifts<dim_t>(shifts,
+                                   {buf->m_param.strides_ptr(), AF_MAX_DIMS});
+            } break;
+            case kNodeType::Shift: {
+                ShiftNodeT &shiftNode{
+                    *static_cast<ShiftNodeT *>(nodeIt->get())};
+                BufferNodeT &buf{shiftNode.getBufferNode()};
+                applyShifts<dim_t>(shifts,
+                                   {buf.m_param.dims_ptr(), AF_MAX_DIMS});
+                applyShifts<dim_t>(shifts,
+                                   {buf.m_param.strides_ptr(), AF_MAX_DIMS});
+
+                auto &node_shifts = shiftNode.getShifts();
+                applyShifts<int>(shifts, node_shifts);
+            } break;
+            default: break;
+        }
+    }
+    std::for_each(
+        std::begin(outputs) + 1, std::end(outputs), [&shifts](ParamT &output) {
+            applyShifts<dim_t>(shifts, {output.dims_ptr(), AF_MAX_DIMS});
+            applyShifts<dim_t>(shifts, {output.strides_ptr(), AF_MAX_DIMS});
+        });
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/NodeIO.hpp b/src/backend/common/jit/NodeIO.hpp
new file mode 100644
index 0000000000..ac149d98d9
--- /dev/null
+++ b/src/backend/common/jit/NodeIO.hpp
@@ -0,0 +1,96 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/Node.hpp>
+#include <spdlog/fmt/bundled/format.h>
+
+#include <common/util.hpp>
+
+template<>
+struct fmt::formatter<af::dtype> : fmt::formatter<char> {
+    template<typename FormatContext>
+    auto format(const af::dtype& p, FormatContext& ctx) -> decltype(ctx.out()) {
+        format_to(ctx.out(), "{}", arrayfire::common::getName(p));
+        return ctx.out();
+    }
+};
+
+template<>
+struct fmt::formatter<arrayfire::common::Node> {
+    // Presentation format: 'p' - pointer, 't' - type.
+    // char presentation;
+    bool pointer;
+    bool type;
+    bool children;
+    bool op;
+
+    // Parses format specifications of the form ['f' | 'e'].
+    constexpr auto parse(format_parse_context& ctx) -> decltype(ctx.begin()) {
+        auto it = ctx.begin(), end = ctx.end();
+
+        if (it == end || *it == '}') {
+            pointer = type = children = op = true;
+            return it;
+        }
+
+        while (it != end && *it != '}') {
+            switch (*it) {
+                case 'p': pointer = true; break;
+                case 't': type = true; break;
+                case 'c': children = true; break;
+                case 'o': op = true; break;
+                default: throw format_error("invalid format");
+            }
+            ++it;
+        }
+
+        // Return an iterator past the end of the parsed range:
+        return it;
+    }
+
+    // Formats the point p using the parsed format specification (presentation)
+    // stored in this formatter.
+    template<typename FormatContext>
+    auto format(const arrayfire::common::Node& node, FormatContext& ctx)
+        -> decltype(ctx.out()) {
+        // ctx.out() is an output iterator to write to.
+
+        format_to(ctx.out(), "{{");
+        if (pointer) format_to(ctx.out(), "{} ", (void*)&node);
+        if (op) {
+            if (isBuffer(node)) {
+                format_to(ctx.out(), "buffer ");
+            } else if (isScalar(node)) {
+                format_to(ctx.out(), "scalar ",
+                          arrayfire::common::toString(node.getOp()));
+            } else {
+                format_to(ctx.out(), "{} ",
+                          arrayfire::common::toString(node.getOp()));
+            }
+        }
+        if (type) format_to(ctx.out(), "{} ", node.getType());
+        if (children) {
+            int count;
+            for (count = 0; count < arrayfire::common::Node::kMaxChildren &&
+                            node.m_children[count].get() != nullptr;
+                 count++) {}
+            if (count > 0) {
+                format_to(ctx.out(), "children: {{ ");
+                for (int i = 0; i < count; i++) {
+                    format_to(ctx.out(), "{} ", *(node.m_children[i].get()));
+                }
+                format_to(ctx.out(), "\b}} ");
+            }
+        }
+        format_to(ctx.out(), "\b}}");
+
+        return ctx.out();
+    }
+};
diff --git a/src/backend/common/jit/NodeIterator.hpp b/src/backend/common/jit/NodeIterator.hpp
new file mode 100644
index 0000000000..7359316c65
--- /dev/null
+++ b/src/backend/common/jit/NodeIterator.hpp
@@ -0,0 +1,105 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+
+#include <cstddef>
+#include <iterator>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+
+/// A node iterator that performs a breadth first traversal of the node tree
+template<typename Node = common::Node>
+class NodeIterator {
+   public:
+    using iterator_category = std::input_iterator_tag;
+    using value_type        = Node;
+    using difference_type   = std::ptrdiff_t;
+    using pointer           = Node*;
+    using reference         = Node&;
+
+   private:
+    std::vector<pointer> tree;
+    size_t index = 0;
+
+    /// Copies the children of the \p n Node to the end of the tree vector
+    void copy_children_to_end(Node* n) {
+        for (int i = 0; i < Node::kMaxChildren && n->m_children[i] != nullptr;
+             i++) {
+            auto ptr = n->m_children[i].get();
+            if (find(begin(tree), end(tree), ptr) == end(tree)) {
+                tree.push_back(ptr);
+            }
+        }
+    }
+
+   public:
+    /// NodeIterator Constructor
+    ///
+    /// \param[in] root The root node of the tree
+    NodeIterator(pointer root) : tree{root} {
+        tree.reserve(root->getHeight() * 8);
+    }
+
+    /// The equality operator
+    ///
+    /// \param[in] other the rhs of the node
+    bool operator==(const NodeIterator& other) const noexcept {
+        // If the tree vector is empty in the other iterator then this means
+        // that the other iterator is a sentinel(end) node.
+        if (other.tree.empty()) {
+            // If the index is the same as the tree size then the index is past
+            // the end of the tree
+            return index == tree.size();
+        }
+        return index == other.index && tree == other.tree;
+    }
+
+    bool operator!=(const NodeIterator& other) const noexcept {
+        return !operator==(other);
+    }
+
+    /// Advances the iterator by one node in the tree
+    NodeIterator& operator++() noexcept {
+        if (index < tree.size()) { copy_children_to_end(tree[index]); }
+        index++;
+        return *this;
+    }
+
+    /// @copydoc operator++()
+    NodeIterator operator++(int) noexcept {
+        NodeIterator before(*this);
+        operator++();
+        return before;
+    }
+
+    /// Advances the iterator by count nodes
+    NodeIterator& operator+=(std::size_t count) noexcept {
+        while (count-- > 0) { operator++(); }
+        return *this;
+    }
+
+    reference operator*() const noexcept { return *tree[index]; }
+
+    pointer operator->() const noexcept { return tree[index]; }
+
+    /// Creates a sentinel iterator. This is equivalent to the end iterator
+    NodeIterator()                                         = default;
+    NodeIterator(const NodeIterator& other)                = default;
+    NodeIterator(NodeIterator&& other) noexcept            = default;
+    ~NodeIterator() noexcept                               = default;
+    NodeIterator& operator=(const NodeIterator& other)     = default;
+    NodeIterator& operator=(NodeIterator&& other) noexcept = default;
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/ScalarNode.hpp b/src/backend/common/jit/ScalarNode.hpp
new file mode 100644
index 0000000000..4236ec4725
--- /dev/null
+++ b/src/backend/common/jit/ScalarNode.hpp
@@ -0,0 +1,97 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+#include <common/jit/Node.hpp>
+#include <af/traits.hpp>
+
+#include <math.hpp>
+#include <types.hpp>
+#include <iomanip>
+
+namespace arrayfire {
+namespace common {
+
+template<typename T>
+class ScalarNode : public common::Node {
+   private:
+    const T m_val;
+
+   public:
+    ScalarNode(T val)
+        : Node(static_cast<af::dtype>(af::dtype_traits<T>::af_type), 0, {},
+               kNodeType::Scalar)
+        , m_val(val) {
+        static_assert(std::is_nothrow_move_assignable<ScalarNode>::value,
+                      "ScalarNode is not move assignable");
+        static_assert(std::is_nothrow_move_constructible<ScalarNode>::value,
+                      "ScalarNode is not move constructible");
+    }
+
+    /// Default move copy constructor
+    ScalarNode(const ScalarNode& other) = default;
+
+    /// Default move constructor
+    ScalarNode(ScalarNode&& other) = default;
+
+    /// Default move/copy assignment operator(Rule of 4)
+    ScalarNode& operator=(ScalarNode node) noexcept {
+        swap(node);
+        return *this;
+    }
+
+    std::unique_ptr<Node> clone() final {
+        return std::make_unique<ScalarNode>(*this);
+    }
+
+    // Swap specilization
+    void swap(ScalarNode& other) noexcept {
+        using std::swap;
+        Node::swap(other);
+        swap(m_val, other.m_val);
+    }
+
+    void genKerName(std::string& kerString,
+                    const common::Node_ids& ids) const final {
+        kerString += '_';
+        kerString += getTypeStr();
+        kerString += ',';
+        kerString += std::to_string(ids.id);
+    }
+
+    void genParams(std::stringstream& kerStream, int id,
+                   bool is_linear) const final {
+        UNUSED(is_linear);
+        kerStream << getTypeStr() << " scalar" << id << ", \n";
+    }
+
+    int setArgs(int start_id, bool is_linear,
+                std::function<void(int id, const void* ptr, size_t arg_size,
+                                   bool is_buffer)>
+                    setArg) const final {
+        UNUSED(is_linear);
+        setArg(start_id, static_cast<const void*>(&m_val), sizeof(T), false);
+        return start_id + 1;
+    }
+
+    void genFuncs(std::stringstream& kerStream,
+                  const common::Node_ids& ids) const final {
+        kerStream << getTypeStr() << " val" << ids.id << " = scalar" << ids.id
+                  << ";\n";
+    }
+
+    std::string getNameStr() const final { return detail::shortname<T>(false); }
+
+    // Return the info for the params and the size of the buffers
+    virtual size_t getParamBytes() const final { return sizeof(T); }
+};
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/ShiftNodeBase.hpp b/src/backend/common/jit/ShiftNodeBase.hpp
new file mode 100644
index 0000000000..553f4a16a1
--- /dev/null
+++ b/src/backend/common/jit/ShiftNodeBase.hpp
@@ -0,0 +1,128 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <jit/BufferNode.hpp>
+#include <jit/kernel_generators.hpp>
+
+#include <backend.hpp>
+#include <iomanip>
+
+#include <array>
+#include <memory>
+#include <sstream>
+#include <string>
+
+namespace arrayfire {
+namespace common {
+
+template<typename BufferNode>
+class ShiftNodeBase : public Node {
+   private:
+    std::shared_ptr<BufferNode> m_buffer_node;
+    std::array<int, 4> m_shifts;
+
+   public:
+    ShiftNodeBase(const af::dtype type, std::shared_ptr<BufferNode> buffer_node,
+                  const std::array<int, 4> shifts)
+        : Node(type, 0, {}, kNodeType::Shift)
+        , m_buffer_node(buffer_node)
+        , m_shifts(shifts) {
+        static_assert(std::is_nothrow_move_assignable<ShiftNodeBase>::value,
+                      "ShiftNode is not move assignable");
+        static_assert(std::is_nothrow_move_constructible<ShiftNodeBase>::value,
+                      "ShiftNode is not move constructible");
+    }
+
+    /// Default move copy constructor
+    ShiftNodeBase(const ShiftNodeBase &other) = default;
+
+    /// Default move constructor
+    ShiftNodeBase(ShiftNodeBase &&other) = default;
+
+    /// Default move/copy assignment operator(Rule of 4)
+    ShiftNodeBase &operator=(ShiftNodeBase node) noexcept {
+        swap(node);
+        return *this;
+    }
+
+    std::array<int, 4> &getShifts() { return m_shifts; }
+
+    std::unique_ptr<Node> clone() final {
+        return std::make_unique<ShiftNodeBase>(*this);
+    }
+
+    // Swap specilization
+    void swap(ShiftNodeBase &other) noexcept {
+        using std::swap;
+        Node::swap(other);
+        swap(m_buffer_node, other.m_buffer_node);
+        swap(m_shifts, other.m_shifts);
+    }
+
+    BufferNode &getBufferNode() { return *m_buffer_node; }
+    const BufferNode &getBufferNode() const { return *m_buffer_node; }
+
+    bool isLinear(const dim_t dims[4]) const final {
+        UNUSED(dims);
+        return false;
+    }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        kerString += '_';
+        kerString += getNameStr();
+        kerString += ',';
+        kerString += std::to_string(ids.id);
+    }
+
+    void genParams(std::stringstream &kerStream, int id,
+                   bool is_linear) const final {
+        m_buffer_node->genParams(kerStream, id, is_linear);
+        for (int i = 0; i < 4; i++) {
+            kerStream << "int shift" << id << "_" << i << ",\n";
+        }
+    }
+
+    int setArgs(int start_id, bool is_linear,
+                std::function<void(int id, const void *ptr, size_t arg_size,
+                                   bool is_buffer)>
+                    setArg) const {
+        int curr_id = m_buffer_node->setArgs(start_id, is_linear, setArg);
+        for (int i = 0; i < 4; i++) {
+            const int &d = m_shifts[i];
+            setArg(curr_id + i, static_cast<const void *>(&d), sizeof(int),
+                   false);
+        }
+        return curr_id + 4;
+    }
+
+    void genOffsets(std::stringstream &kerStream, int id,
+                    bool is_linear) const final {
+        detail::generateShiftNodeOffsets(kerStream, id, is_linear,
+                                         getTypeStr());
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        detail::generateShiftNodeRead(kerStream, ids.id, getTypeStr());
+    }
+
+    void getInfo(unsigned &len, unsigned &buf_count,
+                 unsigned &bytes) const final {
+        m_buffer_node->getInfo(len, buf_count, bytes);
+    }
+
+    std::string getNameStr() const final {
+        return std::string("Sh") + getShortName(m_type);
+    }
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/jit/UnaryNode.hpp b/src/backend/common/jit/UnaryNode.hpp
new file mode 100644
index 0000000000..c847bd9f91
--- /dev/null
+++ b/src/backend/common/jit/UnaryNode.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/NaryNode.hpp>
+
+namespace arrayfire {
+namespace common {
+
+class UnaryNode : public NaryNode {
+   public:
+    UnaryNode(const af::dtype type, const char *op_str, Node_ptr child,
+              af_op_t op)
+        : NaryNode(type, op_str, 1, {{child}}, op, child->getHeight() + 1) {
+        static_assert(std::is_nothrow_move_assignable<UnaryNode>::value,
+                      "UnaryNode is not move assignable");
+        static_assert(std::is_nothrow_move_constructible<UnaryNode>::value,
+                      "UnaryNode is not move constructible");
+    }
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/kernel_cache.cpp b/src/backend/common/kernel_cache.cpp
new file mode 100644
index 0000000000..423204ba6b
--- /dev/null
+++ b/src/backend/common/kernel_cache.cpp
@@ -0,0 +1,146 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if !defined(AF_CPU) && !defined(AF_ONEAPI)
+
+#include <common/compile_module.hpp>
+#include <common/deterministicHash.hpp>
+#include <common/kernel_cache.hpp>
+#include <device_manager.hpp>
+#include <platform.hpp>
+
+#include <nonstd/span.hpp>
+#include <shared_mutex>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+using detail::Kernel;
+using detail::Module;
+
+using nonstd::span;
+using std::array;
+using std::back_inserter;
+using std::shared_lock;
+using std::shared_timed_mutex;
+using std::string;
+using std::to_string;
+using std::transform;
+using std::unique_lock;
+using std::unordered_map;
+using std::vector;
+
+namespace arrayfire {
+namespace common {
+
+using ModuleMap = unordered_map<size_t, Module>;
+
+shared_timed_mutex& getCacheMutex(const int device) {
+    static shared_timed_mutex mutexes[detail::DeviceManager::MAX_DEVICES];
+    return mutexes[device];
+}
+
+ModuleMap& getCache(const int device) {
+    static ModuleMap* caches =
+        new ModuleMap[detail::DeviceManager::MAX_DEVICES];
+    return caches[device];
+}
+
+Module findModule(const int device, const size_t& key) {
+    shared_lock<shared_timed_mutex> readLock(getCacheMutex(device));
+    auto& cache = getCache(device);
+    auto iter   = cache.find(key);
+    if (iter != cache.end()) { return iter->second; }
+    return Module{};
+}
+
+Kernel getKernel(const string& kernelName, span<const common::Source> sources,
+                 span<const TemplateArg> targs, span<const string> options,
+                 const bool sourceIsJIT) {
+    string tInstance = kernelName;
+
+#if defined(AF_CUDA)
+    auto targsIt  = targs.begin();
+    auto targsEnd = targs.end();
+    if (targsIt != targsEnd) {
+        tInstance += '<' + targsIt->_tparam;
+        while (++targsIt != targsEnd) { tInstance += ',' + targsIt->_tparam; }
+        tInstance += '>';
+    }
+#else
+    UNUSED(targs);
+#endif
+
+    // The JIT kernel uses the hashing of the kernelName (tInstance) only to
+    // speed up to search for its cached kernel.  All the other kernels have the
+    // full source code linked in, and will hash the full code + options
+    // instead.
+    size_t moduleKeyCache = 0;
+    if (sourceIsJIT) {
+        moduleKeyCache = deterministicHash(tInstance);
+    } else {
+        moduleKeyCache = (sources.size() == 1 && sources[0].hash)
+                             ? sources[0].hash
+                             : deterministicHash(sources);
+        moduleKeyCache = deterministicHash(options, moduleKeyCache);
+#if defined(AF_CUDA)
+        moduleKeyCache = deterministicHash(tInstance, moduleKeyCache);
+#endif
+    }
+    const int device  = detail::getActiveDeviceId();
+    Module currModule = findModule(device, moduleKeyCache);
+
+    if (!currModule) {
+        // When saving on disk, the moduleKeyDisk has to correspond with the
+        // full code + optinos (in all circumstances). A recalculation for JIT
+        // is necessary, while for the others we can reuse the moduleKeyCache.
+        size_t moduleKeyDisk = 0;
+        if (sourceIsJIT) {
+            moduleKeyDisk = (sources.size() == 1 && sources[0].hash)
+                                ? sources[0].hash
+                                : deterministicHash(sources);
+            moduleKeyDisk = deterministicHash(options, moduleKeyDisk);
+#if defined(AF_CUDA)
+            moduleKeyDisk = deterministicHash(tInstance, moduleKeyDisk);
+#endif
+        } else {
+            moduleKeyDisk = moduleKeyCache;
+        }
+        currModule =
+            loadModuleFromDisk(device, to_string(moduleKeyDisk), sourceIsJIT);
+        if (!currModule) {
+            vector<string> sources_str;
+            for (const auto& s : sources) {
+                sources_str.push_back({s.ptr, s.length});
+            }
+            currModule = compileModule(to_string(moduleKeyDisk), sources_str,
+                                       options, array{tInstance}, sourceIsJIT);
+        }
+
+        unique_lock<shared_timed_mutex> writeLock(getCacheMutex(device));
+        auto& cache = getCache(device);
+        auto iter   = cache.find(moduleKeyCache);
+        if (iter == cache.end()) {
+            // If not found, this thread is the first one to compile
+            // this kernel. Keep the generated module.
+            Module mod = currModule;
+            getCache(device).emplace(moduleKeyCache, mod);
+        } else {
+            currModule.unload();  // dump the current threads extra
+                                  // compilation
+            currModule = iter->second;
+        }
+    }
+    return getKernel(currModule, tInstance, sourceIsJIT);
+}
+
+}  // namespace common
+}  // namespace arrayfire
+
+#endif
diff --git a/src/backend/common/kernel_cache.hpp b/src/backend/common/kernel_cache.hpp
new file mode 100644
index 0000000000..50602963b1
--- /dev/null
+++ b/src/backend/common/kernel_cache.hpp
@@ -0,0 +1,110 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#if !defined(AF_CPU)
+
+#include <Kernel.hpp>
+#include <Module.hpp>
+#include <backend.hpp>
+#include <common/Source.hpp>
+#include <common/TemplateTypename.hpp>
+#include <common/util.hpp>
+
+#include <nonstd/span.hpp>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace common {
+
+/// \brief Find/Create-Cache a Kernel that fits the given criteria
+///
+/// This function takes in two vectors of strings apart from the main Kernel
+/// name, match criteria, to find a suitable kernel in the Kernel cache. It
+/// builds and caches a new Kernel object if one isn't found in the cache.
+///
+/// The paramter \p key has to be the unique name for a given kernel.
+/// The key has to be present in one of the entries of KernelMap defined in
+/// the header EnqueueArgs.hpp.
+///
+/// The parameter \p templateArgs is a list of stringified template arguments of
+/// the kernel. These strings are used to generate the template instantiation
+/// expression of the kernel during compilation stage. This string is used as
+/// key to kernel cache map. At some point in future, the idea is to use these
+/// instantiation strings to generate template instatiations in online compiler.
+///
+/// The paramter \p options is a list of strings that lets you add
+/// definitions such as `-D<NAME>` or `-D<NAME>=<VALUE>` to the compiler. To
+/// enable easy stringification of variables into their definition equation,
+/// three helper macros are provided: TemplateArg, DefineKey and DefineValue.
+///
+/// Example Usage: transpose
+///
+/// \code
+/// auto transpose = getKernel("arrayfire::cuda::transpose",
+/// {{transpase_cuh_src}},
+///         {
+///           TemplateTypename<T>(),
+///           TemplateArg(conjugate),
+///           TemplateArg(is32multiple)
+///         },
+///         {
+///           DefineValue(THREADS_Y) // Results in a definition
+///                                  // "-D THREADS_Y=<Value of THREADS_Y>"
+///           DefineKeyValue(DIMY, threads_y)  // Results in a definition
+///                                            // "-D DIMY=<Value of threads_y>"
+///         }
+///         );
+/// \endcode
+///
+/// \param[in] kernelName is the name of the kernel qualified as kernel in code
+/// \param[in] sources is the list of common::Source to be compiled if required
+/// \param[in] templateArgs is a vector of strings containing stringified names
+///            of the template arguments of kernel to be compiled.
+/// \param[in] options is a vector of strings that enables the user to
+///            add definitions such as `-D<NAME>` or `-D<NAME>=<VALUE>` for
+///            the kernel compilation.
+///
+detail::Kernel getKernel(const std::string& kernelName,
+                         nonstd::span<const common::Source> sources,
+                         nonstd::span<const TemplateArg> templateArgs,
+                         nonstd::span<const std::string> options = {},
+                         const bool sourceIsJIT                  = false);
+
+/// \brief Lookup a Module that matches the given key
+///
+/// This function is intended to be used by JIT only. Usage in other
+/// places will most likely result in Module{nullptr}. If by
+/// chance you do get a match for non-jit usage, it is accidental and
+/// such Module will not work as expected.
+///
+/// \param[in] device is index of device in given backend for which
+///            the module look up has to be done
+/// \param[in] key is hash generated from code + options + kernel_name
+///            at caller scope
+detail::Module findModule(const int device, const std::size_t& key);
+
+/// \brief Get Kernel object for given name from given Module
+///
+/// This function is intended to be used by JIT and compileKernel only.
+/// Usage in other places may have undefined behaviour.
+///
+/// \param[in] mod is cache entry from module map.
+/// \param[in] name is actual kernel name or it's template instantiation
+/// \param[in] sourceWasJIT is used to fetch mangled name for given module
+///            associated with \p name
+detail::Kernel getKernel(const detail::Module& mod, const std::string& name,
+                         const bool sourceWasJIT);
+
+}  // namespace common
+}  // namespace arrayfire
+
+#endif
diff --git a/src/backend/common/kernel_type.hpp b/src/backend/common/kernel_type.hpp
new file mode 100644
index 0000000000..9d833b7e4b
--- /dev/null
+++ b/src/backend/common/kernel_type.hpp
@@ -0,0 +1,37 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace common {
+
+/// \brief Maps a type between its data representation and the type used
+///        during compute operations
+///
+/// This struct defines two types. The data type is used to reference the
+/// data of an array. The compute type will be used during the computation.
+/// The kernel is responsible for converting from the data type to the
+/// computation type.
+/// For most types these types will be the same. For fp16 type the compute
+/// type will be float on platforms that don't support 16 bit floating point
+/// operations.
+template<typename T>
+struct kernel_type {
+    /// The type used to represent the data values
+    using data = T;
+
+    /// The type used when performing a computation
+    using compute = T;
+
+    /// The type defined by the compute framework for this type
+    using native = compute;
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/lapacke.cpp b/src/backend/common/lapacke.cpp
new file mode 100644
index 0000000000..3bba5b5a5a
--- /dev/null
+++ b/src/backend/common/lapacke.cpp
@@ -0,0 +1,309 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(__APPLE__) && !defined(AF_CUDA)
+#include <Accelerate/Accelerate.h>
+#include <common/defines.hpp>
+#include <common/lapacke.hpp>
+#include <traits.hpp>
+
+#include <algorithm>
+#include <cstdint>
+#include <vector>
+
+#if INTPTR_MAX == INT16MAX
+#define BS 16
+#elif INTPTR_MAX == INT32MAX
+#define BS 32
+#elif INTPTR_MAX == INT64MAX
+#define BS 64
+#else
+#define BS 32
+#endif
+
+#define LAPACK_FUNC(X, T, TO)                                                  \
+    int LAPACKE_##X##geqrf(int layout, int M, int N, T *A, int lda, T *tau) {  \
+        UNUSED(layout);                                                        \
+        int lwork = N * BS;                                                    \
+        T *work   = new T[lwork];                                              \
+        int info  = 0;                                                         \
+        X##geqrf_(&M, &N, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);      \
+        delete[] work;                                                         \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##geqrf_work(int layout, int M, int N, T *A, int lda,       \
+                                T *tau, T *work, int lwork) {                  \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##geqrf_(&M, &N, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);      \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##getrf(int layout, int M, int N, T *A, int lda,            \
+                           int *pivot) {                                       \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##getrf_(&M, &N, (TO)A, &lda, pivot, &info);                          \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##getrs(int layout, char trans, int M, int N, const T *A,   \
+                           int lda, const int *pivot, T *B, int ldb) {         \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##getrs_(&trans, &M, &N, (TO)A, &lda, (int *)pivot, (TO)B, &ldb,      \
+                  &info);                                                      \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##potrf(int layout, char uplo, int N, T *A, int lda) {      \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##potrf_(&uplo, &N, (TO)A, &lda, &info);                              \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##gesv(int layout, int N, int nrhs, T *A, int lda,          \
+                          int *pivot, T *B, int ldb) {                         \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##gesv_(&N, &nrhs, (TO)A, &lda, pivot, (TO)B, &ldb, &info);           \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##gels(int layout, char trans, int M, int N, int nrhs,      \
+                          T *A, int lda, T *B, int ldb) {                      \
+        UNUSED(layout);                                                        \
+        int lwork = std::min(M, N) + std::max(M, std::max(N, nrhs)) * BS;      \
+        T *work   = new T[lwork];                                              \
+        int info  = 0;                                                         \
+        X##gels_(&trans, &M, &N, &nrhs, (TO)A, &lda, (TO)B, &ldb, (TO)work,    \
+                 &lwork, &info);                                               \
+        delete[] work;                                                         \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##getri(int layout, int N, T *A, int lda,                   \
+                           const int *pivot) {                                 \
+        UNUSED(layout);                                                        \
+        int lwork = N * BS;                                                    \
+        T *work   = new T[lwork];                                              \
+        int info  = 0;                                                         \
+        X##getri_(&N, (TO)A, &lda, const_cast<int *>(pivot), (TO)work, &lwork, \
+                  &info);                                                      \
+        delete[] work;                                                         \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##trtri(int layout, char uplo, char diag, int N, T *A,      \
+                           int lda) {                                          \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##trtri_(&uplo, &diag, &N, (TO)A, &lda, &info);                       \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##trtrs(int layout, char uplo, char trans, char diag,       \
+                           int N, int NRHS, const T *A, int lda, T *B,         \
+                           int ldb) {                                          \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##trtrs_(&uplo, &trans, &diag, &N, &NRHS, (TO)A, &lda, (TO)B, &ldb,   \
+                  &info);                                                      \
+        return info;                                                           \
+    }                                                                          \
+    int LAPACKE_##X##larft(int layout, char direct, char storev, int N, int K, \
+                           const T *v, int ldv, const T *tau, T *t, int ldt) { \
+        UNUSED(layout);                                                        \
+        X##larft_(&direct, &storev, &N, &K, (TO)v, &ldv,                       \
+                  (TO) const_cast<T *>(tau), (TO)t, &ldt);                     \
+        return 0;                                                              \
+    }                                                                          \
+    int LAPACKE_##X##laswp(int layout, int N, T *A, int lda, int k1, int k2,   \
+                           const int *pivot, int incx) {                       \
+        UNUSED(layout);                                                        \
+        X##laswp_(&N, (TO)A, &lda, &k1, &k2, const_cast<int *>(pivot), &incx); \
+        return 0;                                                              \
+    }
+
+LAPACK_FUNC(s, float, float *)
+LAPACK_FUNC(d, double, double *)
+LAPACK_FUNC(c, cfloat, __CLPK_complex *)
+LAPACK_FUNC(z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_GQR(P, X, T, TO)                                             \
+    int LAPACKE_##X##P(int layout, int M, int N, int K, T *A, int lda,      \
+                       const T *tau) {                                      \
+        UNUSED(layout);                                                     \
+        int lwork = N * 32;                                                 \
+        T *work   = new T[lwork];                                           \
+        int info  = 0;                                                      \
+        X##P##_(&M, &N, &K, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info); \
+        delete[] work;                                                      \
+        return info;                                                        \
+    }
+
+LAPACK_GQR(orgqr, s, float, float *)
+LAPACK_GQR(orgqr, d, double, double *)
+LAPACK_GQR(ungqr, c, cfloat, __CLPK_complex *)
+LAPACK_GQR(ungqr, z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_GQR_WORK(P, X, T, TO)                                          \
+    int LAPACKE_##X##P##_work(int layout, int M, int N, int K, T *A, int lda, \
+                              const T *tau, T *work, int lwork) {             \
+        UNUSED(layout);                                                       \
+        int info = 0;                                                         \
+        X##P##_(&M, &N, &K, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);   \
+        return info;                                                          \
+    }
+
+LAPACK_GQR_WORK(orgqr, s, float, float *)
+LAPACK_GQR_WORK(orgqr, d, double, double *)
+LAPACK_GQR_WORK(ungqr, c, cfloat, __CLPK_complex *)
+LAPACK_GQR_WORK(ungqr, z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_MQR_WORK(P, X, T, TO)                                           \
+    int LAPACKE_##X##P##_work(int layout, char side, char trans, int M, int N, \
+                              int K, const T *A, int lda, const T *tau, T *c,  \
+                              int ldc, T *work, int lwork) {                   \
+        UNUSED(layout);                                                        \
+        int info = 0;                                                          \
+        X##P##_(&side, &trans, &M, &N, &K, (TO)A, &lda, (TO)tau, (TO)c, &ldc,  \
+                (TO)work, &lwork, &info);                                      \
+        return info;                                                           \
+    }
+
+LAPACK_MQR_WORK(ormqr, s, float, float *)
+LAPACK_MQR_WORK(ormqr, d, double, double *)
+LAPACK_MQR_WORK(unmqr, c, cfloat, __CLPK_complex *)
+LAPACK_MQR_WORK(unmqr, z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_GESDD_REAL(P, X, T, Tr, TO)                                   \
+    int LAPACKE_##X##P(int layout, char jobz, int m, int n, T *in, int ldin, \
+                       Tr *s, T *u, int ldu, T *vt, int ldvt) {              \
+        UNUSED(layout);                                                      \
+        int info     = 0;                                                    \
+        int lwork    = -1;                                                   \
+        T work_param = 0;                                                    \
+        X##P##_(&jobz, &m, &n, (TO)in, &ldin, s, (TO)u, &ldu, (TO)vt, &ldvt, \
+                &work_param, &lwork, NULL, &info);                           \
+        lwork = work_param;                                                  \
+        std::vector<T> work(lwork);                                          \
+        std::vector<int> iwork(8 * std::min(m, n));                          \
+        X##P##_(&jobz, &m, &n, (TO)in, &ldin, s, (TO)u, &ldu, (TO)vt, &ldvt, \
+                (TO)&work[0], &lwork, &iwork[0], &info);                     \
+        return info;                                                         \
+    }
+
+#define LAPACK_GESDD_CPLX(P, X, T, Tr, TO)                                   \
+    int LAPACKE_##X##P(int layout, char jobz, int m, int n, T *in, int ldin, \
+                       Tr *s, T *u, int ldu, T *vt, int ldvt) {              \
+        UNUSED(layout);                                                      \
+        int info   = 0;                                                      \
+        int max_mn = std::max(m, n);                                         \
+        int min_mn = std::max(m, n);                                         \
+        int lwork  = 5 * max_mn;                                             \
+        std::vector<T> work(lwork);                                          \
+        std::vector<int> iwork(8 * std::min(m, n));                          \
+        int irwork = std::max(                                               \
+            1,                                                               \
+            min_mn * std::max(5 * min_mn + 7, 2 * max_mn + 2 * min_mn + 1)); \
+        std::vector<Tr> rwork(irwork);                                       \
+        X##P##_(&jobz, &m, &n, (TO)in, &ldin, s, (TO)u, &ldu, (TO)vt, &ldvt, \
+                (TO)&work[0], &lwork, &rwork[0], &iwork[0], &info);          \
+        return info;                                                         \
+    }
+
+LAPACK_GESDD_REAL(gesdd, s, float, float, float *)
+LAPACK_GESDD_REAL(gesdd, d, double, double, double *)
+LAPACK_GESDD_CPLX(gesdd, c, cfloat, float, __CLPK_complex *)
+LAPACK_GESDD_CPLX(gesdd, z, cdouble, double, __CLPK_doublecomplex *)
+
+#define LAPACK_LAMCH(X, T) \
+    T LAPACKE_##X##lamch(char cmach) { return X##lamch_(&cmach); }
+
+LAPACK_LAMCH(s, float)
+LAPACK_LAMCH(d, double)
+
+#define LAPACK_LACPY(X, T, TO)                                        \
+    int LAPACKE_##X##lacpy(int matrix_order, char uplo, int m, int n, \
+                           const T *a, int lda, T *b, int ldb) {      \
+        UNUSED(matrix_order);                                         \
+        int info = 0;                                                 \
+        X##lacpy_(&uplo, &m, &n, (TO)a, &lda, (TO)b, &ldb);           \
+        return info;                                                  \
+    }
+
+LAPACK_LACPY(s, float, float *)
+LAPACK_LACPY(d, double, double *)
+LAPACK_LACPY(c, cfloat, __CLPK_complex *)
+LAPACK_LACPY(z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_GBR_WORK(P, X, T, TO)                                       \
+    int LAPACKE_##X##P##_work(int matrix_order, char vect, int m, int n,   \
+                              int k, T *a, int lda, const T *tau, T *work, \
+                              int lwork) {                                 \
+        UNUSED(matrix_order);                                              \
+        int info = 0;                                                      \
+        X##P##_(&vect, &m, &n, &k, (TO)a, &lda, (TO)tau, (TO)work, &lwork, \
+                &info);                                                    \
+        return info;                                                       \
+    }
+
+LAPACK_GBR_WORK(orgbr, s, float, float *)
+LAPACK_GBR_WORK(orgbr, d, double, double *)
+LAPACK_GBR_WORK(ungbr, c, cfloat, __CLPK_complex *)
+LAPACK_GBR_WORK(ungbr, z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_BDSQR_WORK(X, T, Tr, TO)                                        \
+    int LAPACKE_##X##bdsqr_work(                                               \
+        int matrix_order, char uplo, int n, int ncvt, int nru, int ncc, Tr *d, \
+        Tr *e, T *vt, int ldvt, T *u, int ldu, T *c, int ldc, Tr *work) {      \
+        UNUSED(matrix_order);                                                  \
+        int info = 0;                                                          \
+        X##bdsqr_(&uplo, &n, &ncvt, &nru, &ncc, d, e, (TO)vt, &ldvt, (TO)u,    \
+                  &ldu, (TO)c, &ldc, work, &info);                             \
+        return info;                                                           \
+    }
+
+LAPACK_BDSQR_WORK(s, float, float, float *)
+LAPACK_BDSQR_WORK(d, double, double, double *)
+LAPACK_BDSQR_WORK(c, cfloat, float, __CLPK_complex *)
+LAPACK_BDSQR_WORK(z, cdouble, double, __CLPK_doublecomplex *)
+
+#define LAPACK_GEBRD_WORK(X, T, Tr, TO)                                        \
+    int LAPACKE_##X##gebrd_work(int matrix_order, int m, int n, T *a, int lda, \
+                                Tr *d, Tr *e, T *tauq, T *taup, T *work,       \
+                                int lwork) {                                   \
+        UNUSED(matrix_order);                                                  \
+        int info = 0;                                                          \
+        X##gebrd_(&m, &n, (TO)a, &lda, d, e, (TO)tauq, (TO)taup, (TO)work,     \
+                  &lwork, &info);                                              \
+        return info;                                                           \
+    }
+
+LAPACK_GEBRD_WORK(s, float, float, float *)
+LAPACK_GEBRD_WORK(d, double, double, double *)
+LAPACK_GEBRD_WORK(c, cfloat, float, __CLPK_complex *)
+LAPACK_GEBRD_WORK(z, cdouble, double, __CLPK_doublecomplex *)
+
+#define LAPACK_LARFG_WORK(X, T, TO)                                        \
+    int LAPACKE_##X##larfg_work(int n, T *alpha, T *x, int incx, T *tau) { \
+        int info = 0;                                                      \
+        X##larfg_(&n, (TO)alpha, (TO)x, &incx, (TO)tau);                   \
+        return info;                                                       \
+    }
+
+LAPACK_LARFG_WORK(s, float, float *)
+LAPACK_LARFG_WORK(d, double, double *)
+LAPACK_LARFG_WORK(c, cfloat, __CLPK_complex *)
+LAPACK_LARFG_WORK(z, cdouble, __CLPK_doublecomplex *)
+
+#define LAPACK_LACGV_WORK(X, T, TO)                      \
+    int LAPACKE_##X##lacgv_work(int n, T *x, int incx) { \
+        X##lacgv_(&n, (TO)x, &incx);                     \
+        return 0;                                        \
+    }
+
+LAPACK_LACGV_WORK(c, cfloat, __CLPK_complex *)
+LAPACK_LACGV_WORK(z, cdouble, __CLPK_doublecomplex *)
+
+#endif
diff --git a/src/backend/common/lapacke.hpp b/src/backend/common/lapacke.hpp
new file mode 100644
index 0000000000..88b485d7be
--- /dev/null
+++ b/src/backend/common/lapacke.hpp
@@ -0,0 +1,154 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(__APPLE__) && !defined(AF_CUDA)
+#include <backend.hpp>
+#include <types.hpp>
+
+using detail::cdouble;
+using detail::cfloat;
+
+#define LAPACK_FUNC(X, T)                                                      \
+    int LAPACKE_##X##geqrf(int layout, int M, int N, T *A, int lda, T *tau);   \
+    int LAPACKE_##X##geqrf_work(int layout, int M, int N, T *A, int lda,       \
+                                T *tau, T *work, int lwork);                   \
+    int LAPACKE_##X##getrf(int layout, int M, int N, T *A, int lda,            \
+                           int *pivot);                                        \
+    int LAPACKE_##X##potrf(int layout, char uplo, int N, T *A, int lda);       \
+    int LAPACKE_##X##gesv(int layout, int N, int nrhs, T *A, int lda,          \
+                          int *pivot, T *B, int ldb);                          \
+    int LAPACKE_##X##gels(int layout, char trans, int M, int N, int nrhs,      \
+                          T *A, int lda, T *B, int ldb);                       \
+    int LAPACKE_##X##getri(int layout, int N, T *A, int lda,                   \
+                           const int *pivot);                                  \
+    int LAPACKE_##X##trtri(int layout, char uplo, char diag, int N, T *A,      \
+                           int lda);                                           \
+    int LAPACKE_##X##larft(int layout, char direct, char storev, int N, int K, \
+                           const T *v, int ldv, const T *tau, T *t, int ldt);  \
+    int LAPACKE_##X##laswp(int layout, int N, T *A, int lda, int k1, int k2,   \
+                           const int *pivot, int incx);                        \
+    int LAPACKE_##X##getrs(int layout, char trans, int M, int N, const T *A,   \
+                           int lda, const int *pivot, T *B, int ldb);          \
+    int LAPACKE_##X##trtrs(int layout, char uplo, char trans, char diag,       \
+                           int N, int NRHS, const T *A, int lda, T *B,         \
+                           int ldb);
+
+LAPACK_FUNC(s, float)
+LAPACK_FUNC(d, double)
+LAPACK_FUNC(c, cfloat)
+LAPACK_FUNC(z, cdouble)
+
+#define LAPACK_GQR(P, X, T)                                            \
+    int LAPACKE_##X##P(int layout, int M, int N, int K, T *A, int lda, \
+                       const T *tau);
+
+LAPACK_GQR(orgqr, s, float)
+LAPACK_GQR(orgqr, d, double)
+LAPACK_GQR(ungqr, c, cfloat)
+LAPACK_GQR(ungqr, z, cdouble)
+
+#define LAPACK_GQR_WORK(P, X, T)                                              \
+    int LAPACKE_##X##P##_work(int layout, int M, int N, int K, T *A, int lda, \
+                              const T *tau, T *work, int lwork);
+
+LAPACK_GQR_WORK(orgqr, s, float)
+LAPACK_GQR_WORK(orgqr, d, double)
+LAPACK_GQR_WORK(ungqr, c, cfloat)
+LAPACK_GQR_WORK(ungqr, z, cdouble)
+
+#define LAPACK_MQR_WORK(P, X, T)                                               \
+    int LAPACKE_##X##P##_work(int layout, char side, char trans, int M, int N, \
+                              int K, const T *A, int lda, const T *tau, T *c,  \
+                              int ldc, T *work, int lwork);
+
+LAPACK_MQR_WORK(ormqr, s, float)
+LAPACK_MQR_WORK(ormqr, d, double)
+LAPACK_MQR_WORK(unmqr, c, cfloat)
+LAPACK_MQR_WORK(unmqr, z, cdouble)
+
+#define LAPACK_GESDD(P, X, T, Tr)                                            \
+    int LAPACKE_##X##P(int layout, char jobz, int m, int n, T *in, int ldin, \
+                       Tr *s, T *u, int ldu, T *vt, int ldvt);
+
+LAPACK_GESDD(gesdd, s, float, float)
+LAPACK_GESDD(gesdd, d, double, double)
+LAPACK_GESDD(gesdd, c, cfloat, float)
+LAPACK_GESDD(gesdd, z, cdouble, double)
+
+#define LAPACK_LAMCH(X, T) T LAPACKE_##X##lamch(char cmach);
+
+LAPACK_LAMCH(s, float)
+LAPACK_LAMCH(d, double)
+
+#define LAPACK_LACPY(X, T)                                            \
+    int LAPACKE_##X##lacpy(int matrix_order, char uplo, int m, int n, \
+                           const T *a, int lda, T *b, int ldb);
+
+LAPACK_LACPY(s, float)
+LAPACK_LACPY(d, double)
+LAPACK_LACPY(c, cfloat)
+LAPACK_LACPY(z, cdouble)
+
+#define LAPACK_GBR_WORK(P, X, T)                                           \
+    int LAPACKE_##X##P##_work(int matrix_order, char vect, int m, int n,   \
+                              int k, T *a, int lda, const T *tau, T *work, \
+                              int lwork);
+
+LAPACK_GBR_WORK(orgbr, s, float)
+LAPACK_GBR_WORK(orgbr, d, double)
+LAPACK_GBR_WORK(ungbr, c, cfloat)
+LAPACK_GBR_WORK(ungbr, z, cdouble)
+
+#define LAPACK_BDSQR_WORK(X, T, Tr)                                            \
+    int LAPACKE_##X##bdsqr_work(                                               \
+        int matrix_order, char uplo, int n, int ncvt, int nru, int ncc, Tr *d, \
+        Tr *e, T *vt, int ldvt, T *u, int ldu, T *c, int ldc, Tr *work);
+
+LAPACK_BDSQR_WORK(s, float, float)
+LAPACK_BDSQR_WORK(d, double, double)
+LAPACK_BDSQR_WORK(c, cfloat, float)
+LAPACK_BDSQR_WORK(z, cdouble, double)
+
+#define LAPACK_GEBRD_WORK(X, T, Tr)                                            \
+    int LAPACKE_##X##gebrd_work(int matrix_order, int m, int n, T *a, int lda, \
+                                Tr *d, Tr *e, T *tauq, T *taup, T *work,       \
+                                int lwork);
+
+LAPACK_GEBRD_WORK(s, float, float)
+LAPACK_GEBRD_WORK(d, double, double)
+LAPACK_GEBRD_WORK(c, cfloat, float)
+LAPACK_GEBRD_WORK(z, cdouble, double)
+
+#define LAPACK_LARFG_WORK(X, T) \
+    int LAPACKE_##X##larfg_work(int n, T *alpha, T *x, int incx, T *tau);
+
+LAPACK_LARFG_WORK(s, float)
+LAPACK_LARFG_WORK(d, double)
+LAPACK_LARFG_WORK(c, cfloat)
+LAPACK_LARFG_WORK(z, cdouble)
+
+#define LAPACK_LACGV_WORK(X, T) \
+    int LAPACKE_##X##lacgv_work(int n, T *x, int incx);
+
+LAPACK_LACGV_WORK(c, cfloat)
+LAPACK_LACGV_WORK(z, cdouble)
+
+#undef LAPACK_FUNC
+#undef LAPACK_GQR
+#undef LAPACK_GQR_WORK
+#undef LAPACK_MQR_WORK
+#undef LAPACK_GESDD
+#undef LAPACK_LAMCH
+#undef LAPACK_LACPY
+#undef LAPACK_GBR_WORK
+#undef LAPACK_BDSQR_WORK
+#undef LAPACK_GEBRD_WORK
+#undef LAPACK_LARFG_WORK
+#undef LAPACK_LACGV_WORK
+#endif
diff --git a/src/backend/common/moddims.cpp b/src/backend/common/moddims.cpp
new file mode 100644
index 0000000000..25edfa5b0a
--- /dev/null
+++ b/src/backend/common/moddims.cpp
@@ -0,0 +1,112 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/moddims.hpp>
+
+#include <common/jit/ModdimNode.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <copy.hpp>
+
+using af::dim4;
+using detail::Array;
+using detail::copyArray;
+using detail::createNodeArray;
+
+using std::make_shared;
+using std::shared_ptr;
+using std::array;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using std::vector;
+
+namespace arrayfire {
+namespace common {
+
+Node_ptr copyModdims(const Node_ptr &in, const af::dim4 &newDim) {
+
+    Node_ptr out = in->clone();
+    for(int i = 0; i < in->kMaxChildren && in->m_children[i] != nullptr; ++i) {
+        out->m_children[i] = copyModdims(in->m_children[i], newDim);
+    }
+    if(out->isBuffer()) out->modDims(newDim);
+
+    return out;
+}
+
+template<typename T>
+Array<T> moddimOp(const Array<T> &in, af::dim4 outDim) {
+
+    const auto &node = in.getNode();
+
+    NodeIterator<> it(node.get());
+
+    dim4 olddims_t = in.dims();
+
+    bool all_linear = true;
+    while (all_linear && it != NodeIterator<>()) {
+        all_linear &= it->isLinear(olddims_t.get());
+        ++it;
+    }
+    if (all_linear == false) in.eval();
+
+    Array<T> out = createNodeArray<T>(outDim, copyModdims(in.getNode(), outDim));
+
+    return out;
+}
+
+template<typename T>
+Array<T> modDims(const Array<T> &in, const af::dim4 &newDims) {
+    if (in.isLinear() == false) {
+        // Nonlinear array's shape cannot be modified. Copy the data and modify
+        // the shape of the array
+        Array<T> out = copyArray<T>(in);
+        out.setDataDims(newDims);
+        return out;
+    } else if (in.isReady()) {
+        /// If the array is a buffer, modify the dimension and return
+        auto out = in;
+        out.setDataDims(newDims);
+        return out;
+    } else {
+        /// If the array is a node and not linear and not a buffer, then create
+        /// a moddims node
+        auto out = moddimOp<T>(in, newDims);
+        return out;
+    }
+}
+
+template<typename T>
+detail::Array<T> flat(const detail::Array<T> &in) {
+    const af::dim4 newDims(in.elements());
+    return common::modDims<T>(in, newDims);
+}
+
+}  // namespace common
+}  // namespace arrayfire
+
+#define INSTANTIATE(TYPE)                                          \
+    template detail::Array<TYPE> arrayfire::common::modDims<TYPE>( \
+        const detail::Array<TYPE> &in, const af::dim4 &newDims);   \
+    template detail::Array<TYPE> arrayfire::common::flat<TYPE>(    \
+        const detail::Array<TYPE> &in)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(detail::cfloat);
+INSTANTIATE(detail::cdouble);
+INSTANTIATE(arrayfire::common::half);
+INSTANTIATE(signed char);
+INSTANTIATE(unsigned char);
+INSTANTIATE(char);
+INSTANTIATE(unsigned short);
+INSTANTIATE(short);
+INSTANTIATE(unsigned);
+INSTANTIATE(int);
+INSTANTIATE(long long);
+INSTANTIATE(unsigned long long);
diff --git a/src/backend/common/moddims.hpp b/src/backend/common/moddims.hpp
new file mode 100644
index 0000000000..c127407753
--- /dev/null
+++ b/src/backend/common/moddims.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace common {
+
+/// Modifies the shape of the Array<T> object to \p newDims
+///
+/// Modifies the shape of the Array<T> object to \p newDims. Depending on the
+/// in Array, different operations will be performed.
+///
+/// * If the object is a linear array and it is an unevaluated JIT node, this
+///   function will createa a JIT Node.
+/// * If the object is not a JIT node but it is still linear, It will create a
+///   reference to the internal array with the new shape.
+/// * If the array is non-linear a moddims operation will be performed
+///
+/// \param in       The input array that who's shape will be modified
+/// \param newDims  The new shape of the input Array<T>
+///
+/// \returns        a new Array<T> with the specified shape.
+template<typename T>
+detail::Array<T> modDims(const detail::Array<T> &in, const af::dim4 &newDims);
+
+/// Calls moddims where all elements are in the first dimension of the array
+///
+/// \param in  The input Array to be flattened
+///
+/// \returns A new array where all elements are in the first dimension.
+template<typename T>
+detail::Array<T> flat(const detail::Array<T> &in);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/module_loading.hpp b/src/backend/common/module_loading.hpp
new file mode 100644
index 0000000000..c64231a49a
--- /dev/null
+++ b/src/backend/common/module_loading.hpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/defines.hpp>
+
+namespace arrayfire {
+namespace common {
+
+void* getFunctionPointer(LibHandle handle, const char* symbolName);
+
+LibHandle loadLibrary(const char* library_name);
+
+void unloadLibrary(LibHandle handle);
+
+std::string getErrorMessage();
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/module_loading_unix.cpp b/src/backend/common/module_loading_unix.cpp
new file mode 100644
index 0000000000..8380cdf3b1
--- /dev/null
+++ b/src/backend/common/module_loading_unix.cpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/defines.hpp>
+#include <common/module_loading.hpp>
+
+#include <dlfcn.h>
+
+#include <string>
+using std::string;
+
+namespace arrayfire {
+namespace common {
+
+void* getFunctionPointer(LibHandle handle, const char* symbolName) {
+    return dlsym(handle, symbolName);
+}
+
+LibHandle loadLibrary(const char* library_name) {
+    return dlopen(library_name, RTLD_LAZY);
+}
+void unloadLibrary(LibHandle handle) { dlclose(handle); }
+
+string getErrorMessage() {
+    char* errMsg = dlerror();
+    if (errMsg) { return string(errMsg); }
+    // constructing std::basic_string from NULL/0 address is
+    // invalid and has undefined behavior
+    return string("No Error");
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/module_loading_windows.cpp b/src/backend/common/module_loading_windows.cpp
new file mode 100644
index 0000000000..bccf1e9bbc
--- /dev/null
+++ b/src/backend/common/module_loading_windows.cpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/defines.hpp>
+#include <common/module_loading.hpp>
+
+#include <Windows.h>
+#include <string>
+
+using std::string;
+
+namespace arrayfire {
+namespace common {
+
+void* getFunctionPointer(LibHandle handle, const char* symbolName) {
+    return GetProcAddress(handle, symbolName);
+}
+
+LibHandle loadLibrary(const char* library_name) {
+    return LoadLibrary(library_name);
+}
+
+void unloadLibrary(LibHandle handle) { FreeLibrary(handle); }
+
+string getErrorMessage() {
+    const char* lpMsgBuf;
+    DWORD dw = GetLastError();
+
+    FormatMessage(FORMAT_MESSAGE_ALLOCATE_BUFFER | FORMAT_MESSAGE_FROM_SYSTEM |
+                      FORMAT_MESSAGE_IGNORE_INSERTS,
+                  NULL, dw, MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT),
+                  (LPTSTR)&lpMsgBuf, 0, NULL);
+    string error_message(lpMsgBuf);
+    return error_message;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/sparse_helpers.hpp b/src/backend/common/sparse_helpers.hpp
new file mode 100644
index 0000000000..daec047eb3
--- /dev/null
+++ b/src/backend/common/sparse_helpers.hpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace common {
+
+class SparseArrayBase;
+template<typename T>
+class SparseArray;
+
+////////////////////////////////////////////////////////////////////////////
+// Friend functions for Sparse Array Creation
+////////////////////////////////////////////////////////////////////////////
+template<typename T>
+SparseArray<T> createEmptySparseArray(const af::dim4 &_dims, dim_t _nNZ,
+                                      const af::storage _storage);
+
+template<typename T>
+SparseArray<T> createHostDataSparseArray(const af::dim4 &_dims, const dim_t nNZ,
+                                         const T *const _values,
+                                         const int *const _rowIdx,
+                                         const int *const _colIdx,
+                                         const af::storage _storage);
+
+template<typename T>
+SparseArray<T> createDeviceDataSparseArray(const af::dim4 &_dims,
+                                           const dim_t nNZ, T *const _values,
+                                           int *const _rowIdx,
+                                           int *const _colIdx,
+                                           const af::storage _storage,
+                                           const bool _copy = false);
+
+template<typename T>
+SparseArray<T> createArrayDataSparseArray(const af::dim4 &_dims,
+                                          const detail::Array<T> &_values,
+                                          const detail::Array<int> &_rowIdx,
+                                          const detail::Array<int> &_colIdx,
+                                          const af::storage _storage,
+                                          const bool _copy = false);
+
+template<typename T>
+SparseArray<T> *initSparseArray();
+
+template<typename T>
+void destroySparseArray(SparseArray<T> *sparse);
+
+/// Performs a deep copy of the \p input array.
+///
+/// \param[in] other    The sparse array that is to be copied
+/// \returns A deep copy of the input sparse array
+template<typename T>
+SparseArray<T> copySparseArray(const SparseArray<T> &other);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/tile.hpp b/src/backend/common/tile.hpp
new file mode 100644
index 0000000000..b6ccdd2f60
--- /dev/null
+++ b/src/backend/common/tile.hpp
@@ -0,0 +1,50 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <tile.hpp>
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <backend.hpp>
+#include <optypes.hpp>
+#include <unary.hpp>
+
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace common {
+
+/// duplicates the elements of an Array<T> array.
+template<typename T>
+detail::Array<T> tile(const detail::Array<T> &in, const af::dim4 tileDims) {
+    const af::dim4 &inDims = in.dims();
+
+    // FIXME: Always use JIT instead of checking for the condition.
+    // The current limitation exists for performance reasons. it should change
+    // in the future.
+
+    bool take_jit_path = true;
+    af::dim4 outDims(1, 1, 1, 1);
+
+    // Check if JIT path can be taken. JIT path can only be taken if tiling a
+    // singleton dimension.
+    for (int i = 0; i < 4; i++) {
+        take_jit_path &= (inDims[i] == 1 || tileDims[i] == 1);
+        outDims[i] = inDims[i] * tileDims[i];
+    }
+
+    if (take_jit_path) {
+        return detail::unaryOp<T, af_noop_t>(in, outDims);
+    } else {
+        return detail::tile<T>(in, tileDims);
+    }
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/traits.hpp b/src/backend/common/traits.hpp
new file mode 100644
index 0000000000..51a4b53899
--- /dev/null
+++ b/src/backend/common/traits.hpp
@@ -0,0 +1,89 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/err_common.hpp>
+#include <af/defines.h>
+
+namespace af {
+template<typename T>
+struct dtype_traits;
+}
+
+namespace arrayfire {
+namespace common {
+class half;
+
+namespace {
+
+inline size_t dtypeSize(af::dtype type) {
+    switch (type) {
+        case s8:
+        case u8:
+        case b8: return 1;
+        case s16:
+        case u16:
+        case f16: return 2;
+        case s32:
+        case u32:
+        case f32: return 4;
+        case u64:
+        case s64:
+        case c32:
+        case f64: return 8;
+        case c64: return 16;
+        default: AF_RETURN_ERROR("Unsupported type", AF_ERR_INTERNAL);
+    }
+}
+
+constexpr bool isComplex(af::dtype type) {
+    return ((type == c32) || (type == c64));
+}
+
+constexpr bool isReal(af::dtype type) { return !isComplex(type); }
+
+constexpr bool isDouble(af::dtype type) { return (type == f64 || type == c64); }
+
+constexpr bool isSingle(af::dtype type) { return (type == f32 || type == c32); }
+
+constexpr bool isHalf(af::dtype type) { return (type == f16); }
+
+constexpr bool isRealFloating(af::dtype type) {
+    return (type == f64 || type == f32 || type == f16);
+}
+
+constexpr bool isInteger(af::dtype type) {
+    return (type == s32 || type == u32 || type == s64 || type == u64 ||
+            type == s16 || type == u16 || type == s8 || type == u8);
+}
+
+constexpr bool isBool(af::dtype type) { return (type == b8); }
+
+constexpr bool isFloating(af::dtype type) {
+    return (!isInteger(type) && !isBool(type));
+}
+
+template<typename T, typename U, typename... Args>
+constexpr bool is_any_of() {
+    AF_IF_CONSTEXPR(!sizeof...(Args)) { return std::is_same<T, U>::value; }
+    else { return std::is_same<T, U>::value || is_any_of<T, Args...>(); }
+}
+
+}  // namespace
+}  // namespace common
+}  // namespace arrayfire
+
+namespace af {
+template<>
+struct dtype_traits<arrayfire::common::half> {
+    enum { af_type = f16, ctype = f16 };
+    typedef arrayfire::common::half base_type;
+    static const char *getName() { return "half"; }
+};
+}  // namespace af
diff --git a/src/backend/common/unique_handle.hpp b/src/backend/common/unique_handle.hpp
new file mode 100644
index 0000000000..c55e2ddf81
--- /dev/null
+++ b/src/backend/common/unique_handle.hpp
@@ -0,0 +1,138 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <af/compilers.h>
+
+#include <utility>
+
+namespace arrayfire {
+namespace common {
+
+template<typename T>
+class ResourceHandler {
+   public:
+    template<typename... Args>
+    static int createHandle(T *handle, Args... args);
+    static int destroyHandle(T handle);
+};
+
+/// \brief A generic class to manage basic RAII lifetimes for C handles
+///
+/// This class manages the lifetimes of C handles found in many types of
+/// libraries. This class is non-copiable but can be moved.
+///
+/// You can use this class with a new handle by using the DEFINE_HANDLER
+/// macro to define creatHandle/destroyHandle policy implemention for a
+/// given resource handle type.
+///
+/// \code{.cpp}
+/// DEFINE_HANDLER(ClassName, HandleName, HandleCreator, HandleDestroyer);
+/// \code{.cpp}
+template<typename T>
+class unique_handle {
+   private:
+    T handle_;
+
+   public:
+    /// Default constructor. Initializes the handle to zero. Does not call the
+    /// create function
+    constexpr unique_handle() noexcept : handle_(0) {}
+
+    /// \brief Takes ownership of a previously created handle
+    ///
+    /// \param[in] handle The handle to manage by this object
+    explicit constexpr unique_handle(T handle) noexcept : handle_(handle){};
+
+    /// \brief Deletes the handle if created.
+    ~unique_handle() noexcept { reset(); }
+
+    /// \brief Deletes the handle if created.
+    void reset() noexcept {
+        if (handle_) {
+            ResourceHandler<T>::destroyHandle(handle_);
+            handle_ = 0;
+        }
+    }
+
+    unique_handle(const unique_handle &other) noexcept      = delete;
+    unique_handle &operator=(unique_handle &other) noexcept = delete;
+
+    AF_CONSTEXPR unique_handle(unique_handle &&other) noexcept
+        : handle_(other.handle_) {
+        other.handle_ = 0;
+    }
+
+    unique_handle &operator=(unique_handle &&other) noexcept {
+        handle_       = other.handle_;
+        other.handle_ = 0;
+    }
+
+    /// \brief Implicit converter for the handle
+    constexpr operator const T &() const noexcept { return handle_; }
+
+    template<typename... Args>
+    int create(Args... args) {
+        if (!handle_) {
+            int error = ResourceHandler<T>::createHandle(
+                &handle_, std::forward<Args>(args)...);
+            if (error) { handle_ = 0; }
+            return error;
+        }
+        return 0;
+    }
+
+    // Returns true if the \p other unique_handle is the same as this handle
+    constexpr bool operator==(unique_handle &other) const noexcept {
+        return handle_ == other.handle_;
+    }
+
+    // Returns true if the \p other handle is the same as this handle
+    constexpr bool operator==(T &other) const noexcept {
+        return handle_ == other;
+    }
+
+    // Returns true if the \p other handle is the same as this handle
+    constexpr bool operator==(T other) const noexcept {
+        return handle_ == other;
+    }
+
+    // Returns true if the handle was initialized correctly
+    constexpr operator bool() { return handle_ != 0; }
+};
+
+/// \brief Returns an initialized handle object. The create function on this
+///        object is already called with the parameter pack provided as
+///        function arguments.
+template<typename T, typename... Args>
+unique_handle<T> make_handle(Args... args) {
+    unique_handle<T> h;
+    h.create(std::forward<Args>(args)...);
+    return h;
+}
+
+}  // namespace common
+}  // namespace arrayfire
+
+#define DEFINE_HANDLER(HANDLE_TYPE, HCREATOR, HDESTROYER)            \
+    namespace arrayfire {                                            \
+    namespace common {                                               \
+    template<>                                                       \
+    class ResourceHandler<HANDLE_TYPE> {                             \
+       public:                                                       \
+        template<typename... Args>                                   \
+        static int createHandle(HANDLE_TYPE *handle, Args... args) { \
+            return HCREATOR(handle, std::forward<Args>(args)...);    \
+        }                                                            \
+        static int destroyHandle(HANDLE_TYPE handle) {               \
+            return HDESTROYER(handle);                               \
+        }                                                            \
+    };                                                               \
+    }                                                                \
+    }
diff --git a/src/backend/common/util.cpp b/src/backend/common/util.cpp
new file mode 100644
index 0000000000..87be74fa83
--- /dev/null
+++ b/src/backend/common/util.cpp
@@ -0,0 +1,528 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/// This file contains platform independent utility functions
+#if defined(OS_WIN)
+#include <Windows.h>
+#else
+#include <pwd.h>
+#include <unistd.h>
+#endif
+
+#include <common/Logger.hpp>
+#include <common/TemplateArg.hpp>
+#include <common/defines.hpp>
+#include <common/util.hpp>
+#include <optypes.hpp>
+#include <af/defines.h>
+
+#include <nonstd/span.hpp>
+#include <sys/stat.h>
+
+#include <algorithm>
+#include <array>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <fstream>
+#include <numeric>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <vector>
+
+#ifdef __has_include
+#if __has_include(<charconv>)
+#include <charconv>
+#endif
+#if __has_include(<version>)
+#include <version>
+#endif
+#endif
+
+using nonstd::span;
+using std::accumulate;
+using std::array;
+using std::hash;
+using std::ofstream;
+using std::once_flag;
+using std::rename;
+using std::size_t;
+using std::string;
+using std::stringstream;
+using std::thread;
+using std::to_string;
+using std::uint8_t;
+using std::vector;
+
+namespace arrayfire {
+namespace common {
+// http://stackoverflow.com/questions/216823/whats-the-best-way-to-trim-stdstring/217605#217605
+// trim from start
+string& ltrim(string& s) {
+    s.erase(s.begin(),
+            find_if(s.begin(), s.end(), [](char c) { return !isspace(c); }));
+    return s;
+}
+
+string getEnvVar(const string& key) {
+#if defined(OS_WIN)
+    DWORD bufSize =
+        32767;  // limit according to GetEnvironment Variable documentation
+    string retVal;
+    retVal.resize(bufSize);
+    bufSize = GetEnvironmentVariable(key.c_str(), &retVal[0], bufSize);
+    if (!bufSize) {
+        return string("");
+    } else {
+        retVal.resize(bufSize);
+        return retVal;
+    }
+#else
+    char* str = getenv(key.c_str());
+    return str == NULL ? string("") : string(str);
+#endif
+}
+
+const char* getName(af_dtype type) {
+    switch (type) {
+        case f32: return "float";
+        case f64: return "double";
+        case c32: return "complex float";
+        case c64: return "complex double";
+        case u32: return "unsigned int";
+        case s32: return "int";
+        case u16: return "unsigned short";
+        case s16: return "short";
+        case u64: return "unsigned long long";
+        case s64: return "long long";
+        case u8: return "unsigned char";
+        case s8: return "signed char";
+        case b8: return "bool";
+        default: return "unknown type";
+    }
+}
+
+void saveKernel(const string& funcName, const string& jit_ker,
+                const string& ext) {
+    static constexpr const char* saveJitKernelsEnvVarName =
+        "AF_JIT_KERNEL_TRACE";
+    static const char* jitKernelsOutput = getenv(saveJitKernelsEnvVarName);
+    if (!jitKernelsOutput) { return; }
+    if (strcmp(jitKernelsOutput, "stdout") == 0) {
+        fputs(jit_ker.c_str(), stdout);
+        return;
+    }
+    if (strcmp(jitKernelsOutput, "stderr") == 0) {
+        fputs(jit_ker.c_str(), stderr);
+        return;
+    }
+    // Path to a folder
+    const string ffp =
+        string(jitKernelsOutput) + AF_PATH_SEPARATOR + funcName + ext;
+
+#if defined(OS_WIN)
+    FILE* f = fopen(ffp.c_str(), "w");
+#else
+    FILE* f = fopen(ffp.c_str(), "we");
+#endif
+
+    if (!f) {
+        fprintf(stderr, "Cannot open file %s\n", ffp.c_str());
+        return;
+    }
+    if (fputs(jit_ker.c_str(), f) == EOF) {
+        fprintf(stderr, "Failed to write kernel to file %s\n", ffp.c_str());
+    }
+    fclose(f);
+}
+
+#if defined(OS_WIN)
+string getTemporaryDirectory() {
+    DWORD bufSize = 261;  // limit according to GetTempPath documentation
+    string retVal;
+    retVal.resize(bufSize);
+    bufSize = GetTempPathA(bufSize, &retVal[0]);
+    retVal.resize(bufSize);
+    return retVal;
+}
+#else
+string getHomeDirectory() {
+    string home = getEnvVar("XDG_CACHE_HOME");
+    if (!home.empty()) { return home; }
+
+    home = getEnvVar("HOME");
+    if (!home.empty()) { return home; }
+
+    return getpwuid(getuid())->pw_dir;
+}
+#endif
+
+bool directoryExists(const string& path) {
+#if defined(OS_WIN)
+    struct _stat status;
+    return _stat(path.c_str(), &status) == 0 && (status.st_mode & S_IFDIR) != 0;
+#else
+    struct stat status {};
+    // NOLINTNEXTLINE(hicpp-signed-bitwise)
+    return stat(path.c_str(), &status) == 0 && (status.st_mode & S_IFDIR) != 0;
+#endif
+}
+
+bool createDirectory(const string& path) {
+#if defined(OS_WIN)
+    return CreateDirectoryA(path.c_str(), NULL) != 0;
+#else
+    return mkdir(path.c_str(), 0777) == 0;
+#endif
+}
+
+bool removeFile(const string& path) {
+#if defined(OS_WIN)
+    return DeleteFileA(path.c_str()) != 0;
+#else
+    return unlink(path.c_str()) == 0;
+#endif
+}
+
+bool renameFile(const string& sourcePath, const string& destPath) {
+    return rename(sourcePath.c_str(), destPath.c_str()) == 0;
+}
+
+bool isDirectoryWritable(const string& path) {
+    if (!directoryExists(path) && !createDirectory(path)) { return false; }
+
+    const string testPath = path + AF_PATH_SEPARATOR + "test";
+    if (!ofstream(testPath).is_open()) { return false; }
+    removeFile(testPath);
+
+    return true;
+}
+
+#ifndef NOSPDLOG
+string& getCacheDirectory() {
+    static once_flag flag;
+    static string cacheDirectory;
+
+    call_once(flag, []() {
+        string pathList[] = {
+#if defined(OS_WIN)
+            getTemporaryDirectory() + "\\ArrayFire"
+#else
+            getHomeDirectory() + "/.arrayfire",
+            "/tmp/arrayfire"
+#endif
+        };
+
+        auto env_path = getEnvVar(JIT_KERNEL_CACHE_DIRECTORY_ENV_NAME);
+        if (!env_path.empty() && !isDirectoryWritable(env_path)) {
+            spdlog::get("platform")
+                ->warn(
+                    "The environment variable {}({}) is "
+                    "not writeable. Falling back to default.",
+                    JIT_KERNEL_CACHE_DIRECTORY_ENV_NAME, env_path);
+            env_path.clear();
+        }
+
+        if (env_path.empty()) {
+            auto iterDir =
+                find_if(begin(pathList), end(pathList), isDirectoryWritable);
+
+            cacheDirectory = iterDir != end(pathList) ? *iterDir : "";
+        } else {
+            cacheDirectory = env_path;
+        }
+    });
+
+    return cacheDirectory;
+}
+#endif
+
+string makeTempFilename() {
+    thread_local size_t fileCount = 0u;
+
+    ++fileCount;
+    const size_t threadID = hash<thread::id>{}(std::this_thread::get_id());
+
+    return to_string(
+        hash<string>{}(to_string(threadID) + "_" + to_string(fileCount)));
+}
+
+template<typename T>
+string toString(T value) {
+#ifdef __cpp_lib_to_chars
+    array<char, 128> out;
+    if (auto [ptr, ec] = std::to_chars(out.data(), out.data() + 128, value);
+        ec == std::errc()) {
+        return string(out.data(), ptr);
+    } else {
+        return string("#error invalid conversion");
+    }
+#else
+    stringstream ss;
+    ss.imbue(std::locale::classic());
+    ss << value;
+    return ss.str();
+#endif
+}
+
+template string toString<int>(int);
+template string toString<unsigned short>(unsigned short);
+template string toString<short>(short);
+template string toString<unsigned char>(unsigned char);
+template string toString<signed char>(signed char);
+template string toString<char>(char);
+template string toString<long>(long);
+template string toString<long long>(long long);
+template string toString<unsigned>(unsigned);
+template string toString<unsigned long>(unsigned long);
+template string toString<unsigned long long>(unsigned long long);
+template string toString<float>(float);
+template string toString<double>(double);
+template string toString<long double>(long double);
+
+template<>
+string toString(TemplateArg arg) {
+    return arg._tparam;
+}
+
+template<>
+string toString(bool val) {
+    return string(val ? "true" : "false");
+}
+
+template<>
+string toString(const char* str) {
+    return string(str);
+}
+
+template<>
+string toString(const string str) {
+    return str;
+}
+
+template<>
+string toString(af_op_t val) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (val) {
+        CASE_STMT(af_add_t);
+        CASE_STMT(af_sub_t);
+        CASE_STMT(af_mul_t);
+        CASE_STMT(af_div_t);
+
+        CASE_STMT(af_and_t);
+        CASE_STMT(af_or_t);
+        CASE_STMT(af_eq_t);
+        CASE_STMT(af_neq_t);
+        CASE_STMT(af_lt_t);
+        CASE_STMT(af_le_t);
+        CASE_STMT(af_gt_t);
+        CASE_STMT(af_ge_t);
+
+        CASE_STMT(af_bitnot_t);
+        CASE_STMT(af_bitor_t);
+        CASE_STMT(af_bitand_t);
+        CASE_STMT(af_bitxor_t);
+        CASE_STMT(af_bitshiftl_t);
+        CASE_STMT(af_bitshiftr_t);
+
+        CASE_STMT(af_min_t);
+        CASE_STMT(af_max_t);
+        CASE_STMT(af_cplx2_t);
+        CASE_STMT(af_atan2_t);
+        CASE_STMT(af_pow_t);
+        CASE_STMT(af_hypot_t);
+
+        CASE_STMT(af_sin_t);
+        CASE_STMT(af_cos_t);
+        CASE_STMT(af_tan_t);
+        CASE_STMT(af_asin_t);
+        CASE_STMT(af_acos_t);
+        CASE_STMT(af_atan_t);
+
+        CASE_STMT(af_sinh_t);
+        CASE_STMT(af_cosh_t);
+        CASE_STMT(af_tanh_t);
+        CASE_STMT(af_asinh_t);
+        CASE_STMT(af_acosh_t);
+        CASE_STMT(af_atanh_t);
+
+        CASE_STMT(af_exp_t);
+        CASE_STMT(af_expm1_t);
+        CASE_STMT(af_erf_t);
+        CASE_STMT(af_erfc_t);
+
+        CASE_STMT(af_log_t);
+        CASE_STMT(af_log10_t);
+        CASE_STMT(af_log1p_t);
+        CASE_STMT(af_log2_t);
+
+        CASE_STMT(af_sqrt_t);
+        CASE_STMT(af_cbrt_t);
+
+        CASE_STMT(af_abs_t);
+        CASE_STMT(af_cast_t);
+        CASE_STMT(af_cplx_t);
+        CASE_STMT(af_real_t);
+        CASE_STMT(af_imag_t);
+        CASE_STMT(af_conj_t);
+
+        CASE_STMT(af_floor_t);
+        CASE_STMT(af_ceil_t);
+        CASE_STMT(af_round_t);
+        CASE_STMT(af_trunc_t);
+        CASE_STMT(af_signbit_t);
+
+        CASE_STMT(af_rem_t);
+        CASE_STMT(af_mod_t);
+
+        CASE_STMT(af_tgamma_t);
+        CASE_STMT(af_lgamma_t);
+
+        CASE_STMT(af_notzero_t);
+
+        CASE_STMT(af_iszero_t);
+        CASE_STMT(af_isinf_t);
+        CASE_STMT(af_isnan_t);
+
+        CASE_STMT(af_sigmoid_t);
+
+        CASE_STMT(af_noop_t);
+
+        CASE_STMT(af_select_t);
+        CASE_STMT(af_not_select_t);
+        CASE_STMT(af_rsqrt_t);
+        CASE_STMT(af_moddims_t);
+
+        CASE_STMT(af_none_t);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(af_interp_type p) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (p) {
+        CASE_STMT(AF_INTERP_NEAREST);
+        CASE_STMT(AF_INTERP_LINEAR);
+        CASE_STMT(AF_INTERP_BILINEAR);
+        CASE_STMT(AF_INTERP_CUBIC);
+        CASE_STMT(AF_INTERP_LOWER);
+        CASE_STMT(AF_INTERP_LINEAR_COSINE);
+        CASE_STMT(AF_INTERP_BILINEAR_COSINE);
+        CASE_STMT(AF_INTERP_BICUBIC);
+        CASE_STMT(AF_INTERP_CUBIC_SPLINE);
+        CASE_STMT(AF_INTERP_BICUBIC_SPLINE);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(af_border_type p) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (p) {
+        CASE_STMT(AF_PAD_ZERO);
+        CASE_STMT(AF_PAD_SYM);
+        CASE_STMT(AF_PAD_CLAMP_TO_EDGE);
+        CASE_STMT(AF_PAD_PERIODIC);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(af_moment_type p) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (p) {
+        CASE_STMT(AF_MOMENT_M00);
+        CASE_STMT(AF_MOMENT_M01);
+        CASE_STMT(AF_MOMENT_M10);
+        CASE_STMT(AF_MOMENT_M11);
+        CASE_STMT(AF_MOMENT_FIRST_ORDER);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(af_match_type p) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (p) {
+        CASE_STMT(AF_SAD);
+        CASE_STMT(AF_ZSAD);
+        CASE_STMT(AF_LSAD);
+        CASE_STMT(AF_SSD);
+        CASE_STMT(AF_ZSSD);
+        CASE_STMT(AF_LSSD);
+        CASE_STMT(AF_NCC);
+        CASE_STMT(AF_ZNCC);
+        CASE_STMT(AF_SHD);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(af_flux_function p) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (p) {
+        CASE_STMT(AF_FLUX_QUADRATIC);
+        CASE_STMT(AF_FLUX_EXPONENTIAL);
+        CASE_STMT(AF_FLUX_DEFAULT);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(AF_BATCH_KIND val) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (val) {
+        CASE_STMT(AF_BATCH_NONE);
+        CASE_STMT(AF_BATCH_LHS);
+        CASE_STMT(AF_BATCH_RHS);
+        CASE_STMT(AF_BATCH_SAME);
+        CASE_STMT(AF_BATCH_DIFF);
+        CASE_STMT(AF_BATCH_UNSUPPORTED);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+template<>
+string toString(af_homography_type val) {
+    const char* retVal = NULL;
+#define CASE_STMT(v) \
+    case v: retVal = #v; break
+    switch (val) {
+        CASE_STMT(AF_HOMOGRAPHY_RANSAC);
+        CASE_STMT(AF_HOMOGRAPHY_LMEDS);
+    }
+#undef CASE_STMT
+    return retVal;
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/common/util.hpp b/src/backend/common/util.hpp
new file mode 100644
index 0000000000..8a1ad42838
--- /dev/null
+++ b/src/backend/common/util.hpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/// This file contains platform independent utility functions
+#pragma once
+
+#include <optypes.hpp>
+#include <af/defines.h>
+
+#include <string>
+
+namespace arrayfire {
+namespace common {
+/// The environment variable that determines where the runtime kernels
+/// will be stored on the file system
+constexpr const char* JIT_KERNEL_CACHE_DIRECTORY_ENV_NAME =
+    "AF_JIT_KERNEL_CACHE_DIRECTORY";
+
+std::string getEnvVar(const std::string& key);
+
+std::string& ltrim(std::string& s);
+
+// Dump the kernel sources only if the environment variable is defined
+void saveKernel(const std::string& funcName, const std::string& jit_ker,
+                const std::string& ext);
+
+std::string& getCacheDirectory();
+
+bool directoryExists(const std::string& path);
+
+bool createDirectory(const std::string& path);
+
+bool removeFile(const std::string& path);
+
+bool renameFile(const std::string& sourcePath, const std::string& destPath);
+
+bool isDirectoryWritable(const std::string& path);
+
+/// Return a string suitable for naming a temporary file.
+///
+/// Every call to this function will generate a new string with a very low
+/// probability of colliding with past or future outputs of this function,
+/// including calls from other threads or processes. The string contains
+/// no extension.
+std::string makeTempFilename();
+
+const char* getName(af_dtype type);
+
+std::string getOpEnumStr(af_op_t val);
+
+template<typename T>
+std::string toString(T value);
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/cpu/Array.cpp b/src/backend/cpu/Array.cpp
index 899e409d61..276ea952b4 100644
--- a/src/backend/cpu/Array.cpp
+++ b/src/backend/cpu/Array.cpp
@@ -7,273 +7,374 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
+#include <kernel/Array.hpp>
+
+#include <Param.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <common/traits.hpp>
 #include <copy.hpp>
-#include <TNJ/BufferNode.hpp>
-#include <TNJ/ScalarNode.hpp>
+#include <jit/BufferNode.hpp>
+#include <jit/Node.hpp>
+#include <jit/ScalarNode.hpp>
 #include <memory.hpp>
 #include <platform.hpp>
-#include <cstring>
+#include <queue.hpp>
+#include <traits.hpp>
 
-namespace cpu
-{
-    const int MAX_TNJ_LEN = 20;
-    using TNJ::BufferNode;
-    using TNJ::Node;
-    using TNJ::Node_ptr;
-
-    using af::dim4;
-
-    template<typename T>
-    Array<T>::Array(dim4 dims):
-        ArrayInfo(getActiveDeviceId(), dims, dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(memAlloc<T>(dims.elements()), memFree<T>), data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    { }
-
-    template<typename T>
-    Array<T>::Array(dim4 dims, const T * const in_data):
-        ArrayInfo(getActiveDeviceId(), dims, dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(memAlloc<T>(dims.elements()), memFree<T>), data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    {
-        std::copy(in_data, in_data + dims.elements(), data.get());
-    }
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/seq.h>
+#include <af/traits.hpp>
 
+#include <nonstd/span.hpp>
+#include <algorithm>  // IWYU pragma: keep
+#include <cstddef>
+#include <cstring>
+#include <type_traits>
+#include <utility>
+
+using af::dim4;
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_map_t;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::cpu::jit::BufferNode;
+
+using nonstd::span;
+using std::accumulate;
+using std::adjacent_find;
+using std::copy;
+using std::find_if;
+using std::is_standard_layout;
+using std::make_shared;
+using std::move;
+using std::vector;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+shared_ptr<BufferNode<T>> bufferNodePtr() {
+    return std::make_shared<BufferNode<T>>();
+}
 
-    template<typename T>
-    Array<T>::Array(af::dim4 dims, TNJ::Node_ptr n) :
-        ArrayInfo(-1, dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(), data_dims(dims),
-        node(n), ready(false), offset(0), owner(true)
-    {
+template<typename T>
+Array<T>::Array(dim4 dims)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(memAlloc<T>(dims.elements()).release(), memFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, T *const in_data, bool is_device,
+                bool copy_device)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data((is_device & !copy_device) ? in_data
+                                      : memAlloc<T>(dims.elements()).release(),
+           memFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    static_assert(is_standard_layout<Array<T>>::value,
+                  "Array<T> must be a standard layout type");
+    static_assert(std::is_nothrow_move_assignable<Array<T>>::value,
+                  "Array<T> is not move assignable");
+    static_assert(std::is_nothrow_move_constructible<Array<T>>::value,
+                  "Array<T> is not move constructible");
+    static_assert(
+        offsetof(Array<T>, info) == 0,
+        "Array<T>::info must be the first member variable of Array<T>");
+    if (!is_device || copy_device) {
+        // Ensure the memory being written to isnt used anywhere else.
+        getQueue().sync();
+        copy(in_data, in_data + dims.elements(), data.get());
     }
+}
 
-    template<typename T>
-    Array<T>::Array(const Array<T>& parent, const dim4 &dims, const dim4 &offsets, const dim4 &strides) :
-        ArrayInfo(parent.getDevId(), dims, offsets, strides, (af_dtype)dtype_traits<T>::af_type),
-        data(parent.getData()), data_dims(parent.getDataDims()),
-        node(), ready(true),
-        offset(parent.getOffset() + calcOffset(parent.strides(), offsets)),
-        owner(false)
-    { }
-
-    template<typename T>
-    void Array<T>::eval()
-    {
-        if (isReady()) return;
-
-        this->setId(getActiveDeviceId());
-        data = std::shared_ptr<T>(memAlloc<T>(elements()), memFree<T>);
-        T *ptr = data.get();
-
-        dim4 ostrs = strides();
-        dim4 odims = dims();
-
-        for (int w = 0; w < (int)odims[3]; w++) {
-            dim_t offw = w * ostrs[3];
-
-            for (int z = 0; z < (int)odims[2]; z++) {
-                dim_t offz = z * ostrs[2] + offw;
-
-                for (int y = 0; y < (int)odims[1]; y++) {
-                    dim_t offy = y * ostrs[1] + offz;
-
-                    for (int x = 0; x < (int)odims[0]; x++) {
-                        dim_t id = x + offy;
+template<typename T>
+Array<T>::Array(const af::dim4 &dims, Node_ptr n)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data()
+    , data_dims(dims)
+    , node(move(n))
+    , owner(true) {}
+
+template<typename T>
+Array<T>::Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset_,
+                const dim4 &strides)
+    : info(parent.getDevId(), dims, offset_, strides,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(parent.getData())
+    , data_dims(parent.getDataDims())
+    , node()
+    , owner(false) {}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, const dim4 &strides, dim_t offset_,
+                T *const in_data, bool is_device)
+    : info(getActiveDeviceId(), dims, offset_, strides,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(is_device ? in_data : memAlloc<T>(info.total()).release(), memFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    if (!is_device) {
+        // Ensure the memory being written to isnt used anywhere else.
+        getQueue().sync();
+        copy(in_data, in_data + info.total(), data.get());
+    }
+}
 
-                        ptr[id] = *(T *)node->calc(x, y, z, w);
-                    }
-                }
-            }
-        }
+template<typename T>
+void checkAndMigrate(const Array<T> &arr) {
+    return;
+}
 
+template<typename T>
+void Array<T>::eval() {
+    evalMultiple<T>({this});
+}
 
-        ready = true;
+template<typename T>
+void Array<T>::eval() const {
+    const_cast<Array<T> *>(this)->eval();
+}
 
-        Node_ptr prev = node;
-        prev->reset();
-        // FIXME: Replace the current node in any JIT possible trees with the new BufferNode
-        node.reset();
+template<typename T>
+T *Array<T>::device() {
+    if (!isOwner() || getOffset() || data.use_count() > 1) {
+        *this = copyArray<T>(*this);
     }
+    getQueue().sync();
+    return this->get();
+}
 
-    template<typename T>
-    void Array<T>::eval() const
-    {
-        if (isReady()) return;
-        const_cast<Array<T> *>(this)->eval();
+template<typename T>
+void evalMultiple(vector<Array<T> *> array_ptrs) {
+    vector<Array<T> *> outputs;
+    vector<common::Node_ptr> nodes;
+    vector<Param<T>> params;
+    if (getQueue().is_worker()) {
+        AF_ERROR("Array not evaluated", AF_ERR_INTERNAL);
     }
 
-    template<typename T>
-    Array<T>::~Array()
-    { }
+    // Check if all the arrays have the same dimension
+    auto it = adjacent_find(begin(array_ptrs), end(array_ptrs),
+                            [](const Array<T> *l, const Array<T> *r) {
+                                return l->dims() != r->dims();
+                            });
 
-    template<typename T>
-    Node_ptr Array<T>::getNode() const
-    {
-        if (!node) {
-
-            unsigned bytes = this->getDataDims().elements() * sizeof(T);
+    // If they are not the same. eval individually
+    if (it != end(array_ptrs)) {
+        for (auto ptr : array_ptrs) { ptr->eval(); }
+        return;
+    }
 
-            BufferNode<T> *buf_node = new BufferNode<T>(data,
-                                                        bytes,
-                                                        offset,
-                                                        dims().get(),
-                                                        strides().get());
+    for (Array<T> *array : array_ptrs) {
+        if (array->isReady()) { continue; }
 
-            const_cast<Array<T> *>(this)->node = Node_ptr(reinterpret_cast<Node *>(buf_node));
-        }
+        array->setId(getActiveDeviceId());
+        array->data =
+            shared_ptr<T>(memAlloc<T>(array->elements()).release(), memFree);
 
-        return node;
+        outputs.push_back(array);
+        params.emplace_back(array->getData().get(), array->dims(),
+                            array->strides());
+        nodes.push_back(array->node);
     }
 
-    template<typename T>
-    Array<T>
-    createHostDataArray(const dim4 &size, const T * const data)
-    {
-        return Array<T>(size, data);
-    }
+    if (params.empty()) return;
 
-    template<typename T>
-    Array<T>
-    createDeviceDataArray(const dim4 &size, const void *data)
-    {
-        return Array<T>(size, (const T * const) data);
-    }
+    getQueue().enqueue(cpu::kernel::evalMultiple<T>, params, nodes);
 
-    template<typename T>
-    Array<T>
-    createValueArray(const dim4 &size, const T& value)
-    {
-        TNJ::ScalarNode<T> *node = new TNJ::ScalarNode<T>(value);
-        return createNodeArray<T>(size, TNJ::Node_ptr(
-                                      reinterpret_cast<TNJ::Node *>(node)));
-    }
+    for (Array<T> *array : outputs) { array->node.reset(); }
+}
 
-    template<typename T>
-    Array<T>
-    createEmptyArray(const dim4 &size)
-    {
-        return Array<T>(size);
-    }
+template<typename T>
+Node_ptr Array<T>::getNode() {
+    if (node) { return node; }
 
-    template<typename T>
-    Array<T> *initArray() { return new Array<T>(dim4()); }
+    std::shared_ptr<BufferNode<T>> out = bufferNodePtr<T>();
+    unsigned bytes = this->getDataDims().elements() * sizeof(T);
+    out->setData(data, bytes, getOffset(), dims().get(), strides().get(),
+                 isLinear());
+    return out;
+}
 
+template<typename T>
+Node_ptr Array<T>::getNode() const {
+    return const_cast<Array<T> *>(this)->getNode();
+}
 
-    template<typename T>
-    Array<T>
-    createNodeArray(const dim4 &dims, Node_ptr node)
-    {
-        Array<T> out =  Array<T>(dims, node);
+template<typename T>
+Array<T> createHostDataArray(const dim4 &dims, const T *const data) {
+    return Array<T>(dims, const_cast<T *>(data), false);
+}
 
-        unsigned length =0, buf_count = 0, bytes = 0;
+template<typename T>
+Array<T> createDeviceDataArray(const dim4 &dims, void *data, bool copy) {
+    bool is_device = true;
+    return Array<T>(dims, static_cast<T *>(data), is_device, copy);
+}
 
-        Node *n = node.get();
-        n->getInfo(length, buf_count, bytes);
-        n->reset();
+template<typename T>
+Array<T> createValueArray(const dim4 &dims, const T &value) {
+    return createNodeArray<T>(dims, make_shared<jit::ScalarNode<T>>(value));
+}
 
-        if (length > MAX_TNJ_LEN ||
-            buf_count >= MAX_BUFFERS ||
-            bytes >= MAX_BYTES) {
-            out.eval();
+template<typename T>
+Array<T> createEmptyArray(const dim4 &dims) {
+    return Array<T>(dims);
+}
+
+template<typename T>
+kJITHeuristics passesJitHeuristics(span<Node *> root_nodes) {
+    if (!evalFlag()) { return kJITHeuristics::Pass; }
+    size_t bytes = 0;
+    for (Node *n : root_nodes) {
+        if (n->getHeight() > static_cast<int>(getMaxJitSize())) {
+            return kJITHeuristics::TreeHeight;
+        }
+        // Check if approaching the memory limit
+        if (getMemoryPressure() >= getMemoryPressureThreshold()) {
+            NodeIterator<Node> it(n);
+            NodeIterator<Node> end_node;
+            bytes = accumulate(it, end_node, bytes,
+                               [=](const size_t prev, const Node &n) {
+                                   // getBytes returns the size of the data
+                                   // Array. Sub arrays will be represented
+                                   // by their parent size.
+                                   return prev + n.getBytes();
+                               });
         }
+    }
 
-        return out;
+    if (jitTreeExceedsMemoryPressure(bytes)) {
+        return kJITHeuristics::MemoryPressure;
     }
 
+    return kJITHeuristics::Pass;
+}
 
-    template<typename T>
-    Array<T> createSubArray(const Array<T>& parent,
-                            const std::vector<af_seq> &index,
-                            bool copy)
-    {
-        parent.eval();
+template<typename T>
+Array<T> createNodeArray(const dim4 &dims, Node_ptr node) {
+    Array<T> out(dims, node);
+    return out;
+}
 
-        dim4 dDims = parent.getDataDims();
-        dim4 pDims = parent.dims();
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent, const vector<af_seq> &index,
+                        bool copy) {
+    parent.eval();
 
-        dim4 dims   = toDims  (index, pDims);
-        dim4 offset = toOffset(index, dDims);
-        dim4 stride = toStride (index, dDims);
+    dim4 dDims          = parent.getDataDims();
+    dim4 parent_strides = parent.strides();
 
-        Array<T> out = Array<T>(parent, dims, offset, stride);
+    if (parent.isLinear() == false) {
+        const Array<T> parentCopy = copyArray(parent);
+        return createSubArray(parentCopy, index, copy);
+    }
 
-        if (!copy) return out;
+    const dim4 &pDims = parent.dims();
+    dim4 dims         = toDims(index, pDims);
+    dim4 strides      = toStride(index, dDims);
 
-        if (stride[0] != 1 ||
-            stride[1] <  0 ||
-            stride[2] <  0 ||
-            stride[3] <  0) {
+    // Find total offsets after indexing
+    dim4 offsets = toOffset(index, pDims);
+    dim_t offset = parent.getOffset();
+    for (int i = 0; i < 4; i++) { offset += offsets[i] * parent_strides[i]; }
 
-            out = copyArray(out);
-        }
+    Array<T> out = Array<T>(parent, dims, offset, strides);
 
-        return out;
-    }
+    if (!copy) { return out; }
 
-    template<typename T>
-    void
-    destroyArray(Array<T> *A)
-    {
-        delete A;
+    if (strides[0] != 1 || strides[1] < 0 || strides[2] < 0 || strides[3] < 0) {
+        out = copyArray(out);
     }
 
+    return out;
+}
 
-    template<typename T>
-    void evalArray(const Array<T> &A)
-    {
-        A.eval();
-    }
+template<typename T>
+void destroyArray(Array<T> *A) {
+    delete A;
+}
 
-    template<typename T>
-    void
-    writeHostDataArray(Array<T> &arr, const T * const data, const size_t bytes)
-    {
-        if(!arr.isOwner()) {
-            arr = createEmptyArray<T>(arr.dims());
-        }
-        memcpy(arr.get() + arr.getOffset(), data, bytes);
-    }
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data,
+                        const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
+    arr.eval();
+    // Ensure the memory being written to isnt used anywhere else.
+    getQueue().sync();
+    memcpy(arr.get(), data, bytes);
+}
 
-    template<typename T>
-    void
-    writeDeviceDataArray(Array<T> &arr, const void * const data, const size_t bytes)
-    {
-        if(!arr.isOwner()) {
-            arr = createEmptyArray<T>(arr.dims());
-        }
-        memcpy(arr.get() + arr.getOffset(), (const T * const)data, bytes);
-    }
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
+    memcpy(arr.get(), static_cast<const T *const>(data), bytes);
+}
 
-#define INSTANTIATE(T)                                                  \
-    template       Array<T>  createHostDataArray<T>   (const dim4 &size, const T * const data); \
-    template       Array<T>  createDeviceDataArray<T> (const dim4 &size, const void *data); \
-    template       Array<T>  createValueArray<T>      (const dim4 &size, const T &value); \
-    template       Array<T>  createEmptyArray<T>      (const dim4 &size); \
-    template       Array<T>  *initArray<T      >      ();               \
-    template       Array<T>  createSubArray<T>        (const Array<T> &parent, \
-                                                       const std::vector<af_seq> &index, \
-                                                       bool copy);      \
-    template       void      destroyArray<T>          (Array<T> *A);    \
-    template       void      evalArray<T>             (const Array<T> &A); \
-    template       Array<T>  createNodeArray<T>       (const dim4 &size, TNJ::Node_ptr node); \
-    template       Array<T>::~Array        ();                          \
-    template       void Array<T>::eval();                               \
-    template       void Array<T>::eval() const;                         \
-    template       TNJ::Node_ptr Array<T>::getNode() const;             \
-    template       void      writeHostDataArray<T>    (Array<T> &arr, const T * const data, const size_t bytes); \
-    template       void      writeDeviceDataArray<T>  (Array<T> &arr, const void * const data, const size_t bytes); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+template<typename T>
+void Array<T>::setDataDims(const dim4 &new_dims) {
+    data_dims = new_dims;
+    modDims(new_dims);
 }
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> createHostDataArray<T>(const dim4 &dims,                \
+                                             const T *const data);            \
+    template Array<T> createDeviceDataArray<T>(const dim4 &dims, void *data,  \
+                                               bool copy);                    \
+    template Array<T> createValueArray<T>(const dim4 &dims, const T &value);  \
+    template Array<T> createEmptyArray<T>(const dim4 &dims);                  \
+    template Array<T> createSubArray<T>(                                      \
+        const Array<T> &parent, const vector<af_seq> &index, bool copy);      \
+    template void destroyArray<T>(Array<T> * A);                              \
+    template Array<T> createNodeArray<T>(const dim4 &dims, Node_ptr node);    \
+    template void Array<T>::eval();                                           \
+    template void Array<T>::eval() const;                                     \
+    template T *Array<T>::device();                                           \
+    template Array<T>::Array(const af::dim4 &dims, T *const in_data,          \
+                             bool is_device, bool copy_device);               \
+    template Array<T>::Array(const af::dim4 &dims, const af::dim4 &strides,   \
+                             dim_t offset, T *const in_data, bool is_device); \
+    template Node_ptr Array<T>::getNode();                                    \
+    template Node_ptr Array<T>::getNode() const;                              \
+    template void writeHostDataArray<T>(Array<T> & arr, const T *const data,  \
+                                        const size_t bytes);                  \
+    template void writeDeviceDataArray<T>(                                    \
+        Array<T> & arr, const void *const data, const size_t bytes);          \
+    template void evalMultiple<T>(vector<Array<T> *> arrays);                 \
+    template kJITHeuristics passesJitHeuristics<T>(span<Node *> n);           \
+    template void Array<T>::setDataDims(const dim4 &new_dims);                \
+    template void checkAndMigrate<T>(const Array<T> &arr);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/Array.hpp b/src/backend/cpu/Array.hpp
index 437265528f..7afed3501e 100644
--- a/src/backend/cpu/Array.hpp
+++ b/src/backend/cpu/Array.hpp
@@ -7,145 +7,306 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-//This is the array implementation class.
+// This is the array implementation class.
 #pragma once
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <ArrayInfo.hpp>
-#include <backend.hpp>
-#include <types.hpp>
-#include <traits.hpp>
-#include <TNJ/Node.hpp>
+
+#include <Param.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/jit/Node.hpp>
+#include <jit/Node.hpp>
 #include <memory.hpp>
-#include <memory>
+#include <platform.hpp>
+
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/seq.h>
+
+#include <nonstd/span.hpp>
 #include <algorithm>
+#include <cstddef>
+#include <memory>
 #include <vector>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
-    using std::shared_ptr;
-    using af::dim4;
+namespace jit {
+template<typename T>
+class BufferNode;
+}
 
-    template<typename T> class Array;
+namespace kernel {
+template<typename T>
+void evalArray(Param<T> in, common::Node_ptr node);
+
+template<typename T>
+void evalMultiple(std::vector<Param<T>> arrays,
+                  std::vector<common::Node_ptr> nodes);
+
+}  // namespace kernel
+
+template<typename T>
+class Array;
+
+using af::dim4;
+using std::shared_ptr;
+
+template<typename T>
+void evalMultiple(std::vector<Array<T> *> array_ptrs);
+
+// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createNodeArray(const af::dim4 &dims, common::Node_ptr node);
+
+template<typename T>
+Array<T> createValueArray(const af::dim4 &dims, const T &value);
+
+// Creates an array and copies from the \p data pointer located in host memory
+//
+// \param[in] dims The dimension of the array
+// \param[in] data The data that will be copied to the array
+template<typename T>
+Array<T> createHostDataArray(const af::dim4 &dims, const T *const data);
+
+/// Creates an Array<T> object from a device pointer.
+///
+/// \param[in] dims The shape of the resulting Array.
+/// \param[in] data The device pointer to the data
+/// \param[in] copy If true, memory will be allocated and the data will be
+///                 copied to the device. If false the data will be used
+///                 directly
+/// \returns The new Array<T> object based on the device pointer.
+template<typename T>
+Array<T> createDeviceDataArray(const af::dim4 &dims, void *data,
+                               bool copy = false);
+
+template<typename T>
+Array<T> createStridedArray(af::dim4 dims, af::dim4 strides, dim_t offset,
+                            T *const in_data, bool is_device) {
+    return Array<T>(dims, strides, offset, in_data, is_device);
+}
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createNodeArray(const af::dim4 &size, TNJ::Node_ptr node);
+/// Copies data to an existing Array object from a host pointer
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data, const size_t bytes);
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createValueArray(const af::dim4 &size, const T& value);
+/// Copies data to an existing Array object from a device pointer
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes);
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createHostDataArray(const af::dim4 &size, const T * const data);
+/// Creates an empty array of a given size. No data is initialized
+///
+/// \param[in] size The dimension of the output array
+template<typename T>
+Array<T> createEmptyArray(const af::dim4 &dims);
 
-    template<typename T>
-    Array<T> createDeviceDataArray(const af::dim4 &size, const void *data);
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent,
+                        const std::vector<af_seq> &index, bool copy = true);
 
-    // Copies data to an existing Array object from a host pointer
-    template<typename T>
-    void writeHostDataArray(Array<T> &arr, const T * const data, const size_t bytes);
+// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+void destroyArray(Array<T> *A);
 
-    // Copies data to an existing Array object from a device pointer
-    template<typename T>
-    void writeDeviceDataArray(Array<T> &arr, const void * const data, const size_t bytes);
+template<typename T>
+kJITHeuristics passesJitHeuristics(nonstd::span<common::Node *> node);
 
-    // Create an Array object and do not assign any values to it
-    template<typename T> Array<T> *initArray();
+template<typename T>
+void *getDevicePtr(const Array<T> &arr) {
+    T *ptr = arr.device();
+    memLock(ptr);
 
-    template<typename T>
-    Array<T> createEmptyArray(const af::dim4 &size);
+    return (void *)ptr;
+}
 
-    template<typename T>
-    Array<T> createSubArray(const Array<T>& parent,
-                            const std::vector<af_seq> &index,
-                            bool copy=true);
+template<typename T>
+void *getRawPtr(const Array<T> &arr) {
+    getQueue().sync();
+    return (void *)(arr.get(false));
+}
 
-    template<typename T>
-    void evalArray(const Array<T> &A);
+/// Checks if the Array object can be migrated to the current device and if not,
+/// an error is thrown
+///
+/// \param[in] arr The Array that will be checked.
+template<typename T>
+void checkAndMigrate(const Array<T> &arr);
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    void destroyArray(Array<T> *A);
+// Array Array Implementation
+template<typename T>
+class Array {
+    ArrayInfo info;  // Must be the first element of Array<T>
 
-    template<typename T>
-    void *getDevicePtr(const Array<T>& arr)
-    {
-        memUnlink((T *)arr.get());
-        return (void *)arr.get();
+    /// Pointer to the data
+    std::shared_ptr<T> data;
+
+    /// The shape of the underlying parent data.
+    af::dim4 data_dims;
+
+    /// Null if this a buffer node. Otherwise this points to a JIT node
+    common::Node_ptr node;
+
+    /// If true, the Array object is the parent. If false the data object points
+    /// to another array's data
+    bool owner;
+
+    /// Default constructor
+    Array() = default;
+
+    /// Creates an uninitialized array of a specific shape
+    Array(dim4 dims);
+
+    explicit Array(const af::dim4 &dims, T *const in_data, bool is_device,
+                   bool copy_device = false);
+    Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset,
+          const dim4 &stride);
+    explicit Array(const af::dim4 &dims, common::Node_ptr n);
+    Array(const af::dim4 &dims, const af::dim4 &strides, dim_t offset,
+          T *const in_data, bool is_device = false);
+
+   public:
+    Array<T>(const Array<T> &other) = default;
+    Array<T>(Array<T> &&other)      = default;
+
+    Array<T> &operator=(Array<T> other) noexcept {
+        swap(other);
+        return *this;
     }
 
-    // Array Array Implementation
-    template<typename T>
-    class Array : public ArrayInfo
-    {
-        //TODO: Generator based array
+    void swap(Array<T> &other) noexcept {
+        using std::swap;
+        swap(info, other.info);
+        swap(data, other.data);
+        swap(data_dims, other.data_dims);
+        swap(node, other.node);
+        swap(owner, other.owner);
+    }
 
-        //data if parent. empty if child
-        std::shared_ptr<T> data;
-        af::dim4 data_dims;
+    void resetInfo(const af::dim4 &dims) { info.resetInfo(dims); }
 
-        TNJ::Node_ptr node;
-        bool ready;
-        dim_t offset;
-        bool owner;
+    // Modifies the dimensions of the array without modifing the underlying
+    // data
+    void resetDims(const af::dim4 &dims) { info.resetDims(dims); }
+    void modDims(const af::dim4 &newDims) { info.modDims(newDims); }
+    void modStrides(const af::dim4 &newStrides) { info.modStrides(newStrides); }
+    void setId(int id) { info.setId(id); }
 
-        Array(dim4 dims);
-        explicit Array(dim4 dims, const T * const in_data);
-        Array(const Array<T>& parnt, const dim4 &dims, const dim4 &offset, const dim4 &stride);
-        explicit Array(af::dim4 dims, TNJ::Node_ptr n);
+#define INFO_FUNC(RET_TYPE, NAME) \
+    RET_TYPE NAME() const { return info.NAME(); }
 
-    public:
+    INFO_FUNC(const af_dtype &, getType)
+    INFO_FUNC(const af::dim4 &, strides)
+    INFO_FUNC(dim_t, elements)
+    INFO_FUNC(dim_t, ndims)
+    INFO_FUNC(const af::dim4 &, dims)
+    INFO_FUNC(int, getDevId)
 
-        ~Array();
+#undef INFO_FUNC
 
-        bool isReady() const { return ready; }
+#define INFO_IS_FUNC(NAME) \
+    bool NAME() const { return info.NAME(); }
 
-        bool isOwner() const { return owner; }
+    INFO_IS_FUNC(isEmpty)
+    INFO_IS_FUNC(isScalar)
+    INFO_IS_FUNC(isRow)
+    INFO_IS_FUNC(isColumn)
+    INFO_IS_FUNC(isVector)
+    INFO_IS_FUNC(isComplex)
+    INFO_IS_FUNC(isReal)
+    INFO_IS_FUNC(isDouble)
+    INFO_IS_FUNC(isSingle)
+    INFO_IS_FUNC(isHalf);
+    INFO_IS_FUNC(isRealFloating)
+    INFO_IS_FUNC(isFloating)
+    INFO_IS_FUNC(isInteger)
+    INFO_IS_FUNC(isBool)
+    INFO_IS_FUNC(isLinear)
+    INFO_IS_FUNC(isSparse)
 
-        void eval();
-        void eval() const;
+#undef INFO_IS_FUNC
 
-        dim_t getOffset() const { return offset; }
-        shared_ptr<T> getData() const {return data; }
+    ~Array() = default;
 
-        dim4 getDataDims() const
-        {
-            // This is for moddims
-            // dims and data_dims are different when moddims is used
-            return isOwner() ? dims() : data_dims;
-        }
+    bool isReady() const { return static_cast<bool>(node) == false; }
 
-        T* get(bool withOffset = true)
-        {
-            return const_cast<T*>(static_cast<const Array<T>*>(this)->get(withOffset));
-        }
+    bool isOwner() const { return owner; }
+
+    void eval();
+    void eval() const;
+
+    dim_t getOffset() const { return info.getOffset(); }
+    shared_ptr<T> getData() const { return data; }
+
+    dim4 getDataDims() const { return data_dims; }
+
+    void setDataDims(const dim4 &new_dims);
 
-        const T* get(bool withOffset = true) const
-        {
-            if (!isReady()) eval();
-            return data.get() + (withOffset ? offset : 0);
+    size_t getAllocatedBytes() const {
+        if (!isReady()) return 0;
+        size_t bytes = memoryManager().allocated(data.get());
+        // External device poitner
+        if (bytes == 0 && data.get()) {
+            return data_dims.elements() * sizeof(T);
         }
+        return bytes;
+    }
 
-        TNJ::Node_ptr getNode() const;
+    T *device();
 
-        friend Array<T> createValueArray<T>(const af::dim4 &size, const T& value);
-        friend Array<T> createHostDataArray<T>(const af::dim4 &size, const T * const data);
-        friend Array<T> createDeviceDataArray<T>(const af::dim4 &size, const void *data);
+    T *device() const { return const_cast<Array<T> *>(this)->device(); }
 
-        friend Array<T> *initArray<T>();
-        friend Array<T> createEmptyArray<T>(const af::dim4 &size);
-        friend Array<T> createNodeArray<T>(const af::dim4 &dims, TNJ::Node_ptr node);
+    T *get(bool withOffset = true) {
+        return const_cast<T *>(
+            static_cast<const Array<T> *>(this)->get(withOffset));
+    }
 
-        friend Array<T> createSubArray<T>(const Array<T>& parent,
-                                          const std::vector<af_seq> &index,
-                                          bool copy);
+    const T *get(bool withOffset = true) const {
+        if (!data.get()) eval();
+        return data.get() + (withOffset ? getOffset() : 0);
+    }
 
-        friend void destroyArray<T>(Array<T> *arr);
-        friend void evalArray<T>(const Array<T> &arr);
-        friend void *getDevicePtr<T>(const Array<T>& arr);
-    };
+    int useCount() const { return static_cast<int>(data.use_count()); }
 
-}
+    operator Param<T>() {
+        return Param<T>(this->get(), this->dims(), this->strides());
+    }
+
+    operator CParam<T>() const {
+        return CParam<T>(this->get(), this->dims(), this->strides());
+    }
+
+    common::Node_ptr getNode() const;
+    common::Node_ptr getNode();
+
+    friend void evalMultiple<T>(std::vector<Array<T> *> arrays);
+
+    friend Array<T> createValueArray<T>(const af::dim4 &dims, const T &value);
+    friend Array<T> createHostDataArray<T>(const af::dim4 &dims,
+                                           const T *const data);
+    friend Array<T> createDeviceDataArray<T>(const af::dim4 &dims, void *data,
+                                             bool copy);
+    friend Array<T> createStridedArray<T>(af::dim4 dims, af::dim4 strides,
+                                          dim_t offset, T *const in_data,
+                                          bool is_device);
+
+    friend Array<T> createEmptyArray<T>(const af::dim4 &dims);
+    friend Array<T> createNodeArray<T>(const af::dim4 &dims,
+                                       common::Node_ptr node);
+
+    friend Array<T> createSubArray<T>(const Array<T> &parent,
+                                      const std::vector<af_seq> &index,
+                                      bool copy);
+
+    friend void kernel::evalArray<T>(Param<T> in, common::Node_ptr node);
+    friend void kernel::evalMultiple<T>(std::vector<Param<T>> arrays,
+                                        std::vector<common::Node_ptr> nodes);
+
+    friend void destroyArray<T>(Array<T> *arr);
+    friend void *getDevicePtr<T>(const Array<T> &arr);
+    friend void *getRawPtr<T>(const Array<T> &arr);
+};
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/CMakeLists.txt b/src/backend/cpu/CMakeLists.txt
index 5efe74cdb7..8a83a55894 100644
--- a/src/backend/cpu/CMakeLists.txt
+++ b/src/backend/cpu/CMakeLists.txt
@@ -1,145 +1,367 @@
-
-ADD_DEFINITIONS(-DAF_CPU)
-
-FIND_PACKAGE(CBLAS REQUIRED)
-
-IF(USE_CPU_F77_BLAS)
-    MESSAGE("Using F77 BLAS")
-    ADD_DEFINITIONS(-DUSE_F77_BLAS)
-ENDIF()
-
-IF(USE_CPU_MKL)
-    MESSAGE("Using MKL")
-    ADD_DEFINITIONS(-DUSE_MKL)
-ENDIF()
-
-IF (NOT CBLAS_LIBRARIES)
-    MESSAGE(SEND_ERROR "CBLAS Library not set")
-ELSE()
-    MESSAGE(STATUS "Using CBLAS Library: ${CBLAS_LIBRARIES}")
-ENDIF()
-
-IF(${CMAKE_CXX_COMPILER_ID} STREQUAL "GNU" AND "${APPLE}")
-    ADD_DEFINITIONS(-flax-vector-conversions)
-ENDIF()
-
-IF(${MKL_FOUND})
-    ADD_DEFINITIONS(-DUSE_MKL)
-ENDIF()
-
-FIND_PACKAGE(FFTW REQUIRED)
-MESSAGE(STATUS "FFTW Found ? ${FFTW_FOUND}")
-MESSAGE(STATUS "FFTW Library: ${FFTW_LIBRARIES}")
-
-INCLUDE("${CMAKE_MODULE_PATH}/FindGLEWmx.cmake")
-
-IF(APPLE)
-    FIND_PACKAGE(LAPACK)
-ELSE(APPLE) # Linux and Windows
-    FIND_PACKAGE(LAPACKE)
-ENDIF(APPLE)
-
-IF(NOT LAPACK_FOUND)
-    MESSAGE(WARNING "LAPACK not found. Functionality will be disabled")
-ELSE(NOT LAPACK_FOUND)
-    ADD_DEFINITIONS(-DLAPACK_${BLA_VENDOR})
-    ADD_DEFINITIONS(-DWITH_CPU_LINEAR_ALGEBRA)
-ENDIF()
-
-IF(NOT UNIX)
-    ADD_DEFINITIONS(-DAFDLL)
-ENDIF()
-
-INCLUDE_DIRECTORIES(
-    ${CMAKE_INCLUDE_PATH}
-    "${CMAKE_SOURCE_DIR}/src/backend/cpu"
-    ${FFTW_INCLUDES}
-    ${CBLAS_INCLUDE_DIR}
-    ${LAPACK_INCLUDE_DIR}
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+include(InternalUtils)
+
+generate_product_version(af_cpu_ver_res_file
+  FILE_NAME "afcpu"
+  FILE_DESCRIPTION "CPU Backend Dynamic-link library"
+)
+
+add_library(afcpu "")
+add_library(ArrayFire::afcpu ALIAS afcpu)
+
+# CPU back end needs to use MKL LP64 interface
+set(MKL_INTERFACE_INTEGER_SIZE 4)
+set(MKL_INTERFACE "lp64")
+
+# CPU backend source files
+target_sources(afcpu
+  PRIVATE
+    $<$<PLATFORM_ID:Windows>:${af_cpu_ver_res_file}>
+    Array.cpp
+    Array.hpp
+    anisotropic_diffusion.cpp
+    anisotropic_diffusion.hpp
+    approx.cpp
+    approx.hpp
+    arith.hpp
+    assign.cpp
+    assign.hpp
+    backend.hpp
+    binary.hpp
+    bilateral.cpp
+    bilateral.hpp
+    blas.cpp
+    blas.hpp
+    canny.cpp
+    canny.hpp
+    cast.hpp
+    cholesky.cpp
+    cholesky.hpp
+    complex.hpp
+    convolve.cpp
+    convolve.hpp
+    copy.cpp
+    copy.hpp
+    device_manager.cpp
+    device_manager.hpp
+    diagonal.cpp
+    diagonal.hpp
+    diff.cpp
+    diff.hpp
+    err_cpu.hpp
+    Event.cpp
+    Event.hpp
+    exampleFunction.cpp
+    exampleFunction.hpp
+    fast.cpp
+    fast.hpp
+    fft.cpp
+    fft.hpp
+    fftconvolve.cpp
+    fftconvolve.hpp
+    flood_fill.hpp
+    flood_fill.cpp
+    gradient.cpp
+    gradient.hpp
+    harris.cpp
+    harris.hpp
+    hist_graphics.cpp
+    hist_graphics.hpp
+    histogram.cpp
+    histogram.hpp
+    homography.cpp
+    homography.hpp
+    hsv_rgb.cpp
+    hsv_rgb.hpp
+    identity.cpp
+    identity.hpp
+    iir.cpp
+    iir.hpp
+    image.cpp
+    image.hpp
+    index.cpp
+    index.hpp
+    inverse.cpp
+    inverse.hpp
+    iota.cpp
+    iota.hpp
+    ireduce.cpp
+    ireduce.hpp
+    join.cpp
+    join.hpp
+    lapack_helper.hpp
+    logic.hpp
+    lookup.cpp
+    lookup.hpp
+    lu.cpp
+    lu.hpp
+    match_template.cpp
+    match_template.hpp
+    math.cpp
+    math.hpp
+    mean.cpp
+    mean.hpp
+    meanshift.cpp
+    meanshift.hpp
+    medfilt.cpp
+    medfilt.hpp
+    memory.cpp
+    memory.hpp
+    moments.cpp
+    moments.hpp
+    morph.cpp
+    morph.hpp
+    nearest_neighbour.cpp
+    nearest_neighbour.hpp
+    orb.cpp
+    orb.hpp
+    ParamIterator.hpp
+    platform.cpp
+    platform.hpp
+    plot.cpp
+    plot.hpp
+    print.hpp
+    qr.cpp
+    qr.hpp
+    queue.hpp
+    random_engine.cpp
+    random_engine.hpp
+    range.cpp
+    range.hpp
+    reduce.cpp
+    reduce.hpp
+    regions.cpp
+    regions.hpp
+    reorder.cpp
+    reorder.hpp
+    resize.cpp
+    resize.hpp
+    reshape.cpp
+    rotate.cpp
+    rotate.hpp
+    scan.cpp
+    scan.hpp
+    scan_by_key.cpp
+    scan_by_key.hpp
+    select.cpp
+    select.hpp
+    set.cpp
+    set.hpp
+    shift.cpp
+    shift.hpp
+    sift.cpp
+    sift.hpp
+    sobel.cpp
+    sobel.hpp
+    solve.cpp
+    solve.hpp
+    sort.cpp
+    sort.hpp
+    sort_by_key.cpp
+    sort_by_key.hpp
+    sort_index.cpp
+    sort_index.hpp
+    sparse.cpp
+    sparse.hpp
+    sparse_arith.cpp
+    sparse_arith.hpp
+    sparse_blas.cpp
+    sparse_blas.hpp
+    surface.cpp
+    surface.hpp
+    susan.cpp
+    susan.hpp
+    svd.cpp
+    svd.hpp
+    tile.cpp
+    tile.hpp
+    topk.cpp
+    topk.hpp
+    traits.hpp
+    transform.cpp
+    transform.hpp
+    transpose.cpp
+    transpose.hpp
+    triangle.cpp
+    triangle.hpp
+    types.hpp
+    unary.hpp
+    unwrap.cpp
+    unwrap.hpp
+    utility.hpp
+    vector_field.cpp
+    vector_field.hpp
+    where.cpp
+    where.hpp
+    wrap.cpp
+    wrap.hpp
+  )
+
+# CPU backend kernel files
+target_sources(afcpu
+  PRIVATE
+    kernel/Array.hpp
+    kernel/anisotropic_diffusion.hpp
+    kernel/approx.hpp
+    kernel/assign.hpp
+    kernel/bilateral.hpp
+    kernel/canny.hpp
+    kernel/convolve.hpp
+    kernel/copy.hpp
+    kernel/diagonal.hpp
+    kernel/diff.hpp
+    kernel/dot.hpp
+    kernel/exampleFunction.hpp
+    kernel/fast.hpp
+    kernel/fftconvolve.hpp
+    kernel/flood_fill.hpp
+    kernel/gradient.hpp
+    kernel/harris.hpp
+    kernel/histogram.hpp
+    kernel/hsv_rgb.hpp
+    kernel/identity.hpp
+    kernel/iir.hpp
+    kernel/index.hpp
+    kernel/interp.hpp
+    kernel/iota.hpp
+    kernel/ireduce.hpp
+    kernel/join.hpp
+    kernel/lookup.hpp
+    kernel/lu.hpp
+    kernel/match_template.hpp
+    kernel/meanshift.hpp
+    kernel/medfilt.hpp
+    kernel/moments.hpp
+    kernel/morph.hpp
+    kernel/nearest_neighbour.hpp
+    kernel/orb.hpp
+    kernel/pad_array_borders.hpp
+    kernel/random_engine.hpp
+    kernel/random_engine_mersenne.hpp
+    kernel/random_engine_philox.hpp
+    kernel/random_engine_threefry.hpp
+    kernel/range.hpp
+    kernel/reduce.hpp
+    kernel/regions.hpp
+    kernel/reorder.hpp
+    kernel/resize.hpp
+    kernel/rotate.hpp
+    kernel/scan.hpp
+    kernel/scan_by_key.hpp
+    kernel/select.hpp
+    kernel/shift.hpp
+    kernel/sift.hpp
+    kernel/sobel.hpp
+    kernel/sort.hpp
+    kernel/sort_by_key.hpp
+    kernel/sort_helper.hpp
+    kernel/sparse.hpp
+    kernel/sparse_arith.hpp
+    kernel/susan.hpp
+    kernel/tile.hpp
+    kernel/transform.hpp
+    kernel/transpose.hpp
+    kernel/triangle.hpp
+    kernel/unwrap.hpp
+    kernel/wrap.hpp
+  )
+
+if (AF_WITH_CPUID)
+  target_compile_definitions(afcpu PRIVATE -DAF_WITH_CPUID)
+endif(AF_WITH_CPUID)
+
+af_dep_check_and_populate(${threads_prefix}
+  URI https://github.com/arrayfire/threads.git
+  REF 4d4a4f0384d1ac2f25b2c4fc1d57b9e25f4d6818
+)
+
+target_sources(afcpu
+  PRIVATE
+    ${${threads_prefix}_SOURCE_DIR}/include/threads/async_queue.hpp
+    ${${threads_prefix}_SOURCE_DIR}/include/threads/event.hpp
+  )
+
+include("${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/CMakeLists.txt")
+
+target_include_directories(afcpu
+  PUBLIC
+    $<BUILD_INTERFACE:${ArrayFire_SOURCE_DIR}/include>
+    $<BUILD_INTERFACE:${ArrayFire_BINARY_DIR}/include>
+    $<INSTALL_INTERFACE:${AF_INSTALL_INC_DIR}>
+  PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}
+    ${${threads_prefix}_SOURCE_DIR}/include)
+
+target_include_directories(afcpu
+  SYSTEM PRIVATE
+    ${CBLAS_INCLUDE_DIR})
+
+target_compile_definitions(afcpu
+  PRIVATE
+    AF_CPU
+  )
+
+target_link_libraries(afcpu
+  PRIVATE
+    c_api_interface
+    cpp_api_interface
+    afcommon_interface
+    cpu_sort_by_key
+    Threads::Threads
+  )
+if(BUILD_WITH_MKL)
+  target_compile_definitions(afcpu PRIVATE USE_MKL)
+  target_compile_definitions(afcpu PRIVATE AF_MKL_INTERFACE_SIZE=${MKL_INTERFACE_INTEGER_SIZE})
+
+  if(MKL_BATCH)
+    target_compile_definitions(afcpu PRIVATE AF_USE_MKL_BATCH)
+  endif()
+
+  if(AF_WITH_STATIC_MKL)
+      target_link_libraries(afcpu PRIVATE MKL::Static)
+      target_compile_definitions(afcpu PRIVATE USE_STATIC_MKL)
+  else()
+      target_link_libraries(afcpu PRIVATE MKL::RT)
+  endif()
+else()
+  target_link_libraries(afcpu
+    PRIVATE
+      ${CBLAS_LIBRARIES}
+      FFTW::FFTW
+      FFTW::FFTWF
     )
-
-FILE(GLOB cpu_headers
-    "*.hpp"
-    "*.h")
-
-FILE(GLOB cpu_sources
-    "*.cpp")
-
-source_group(backend\\cpu\\Headers FILES ${cpu_headers})
-source_group(backend\\cpu\\Sources FILES ${cpu_sources})
-
-FILE(GLOB backend_headers
-    "../*.hpp"
-    "../*.h"
-    )
-
-FILE(GLOB backend_sources
-    "../*.cpp"
-    )
-
-source_group(backend\\Headers FILES ${backend_headers})
-source_group(backend\\Sources FILES ${backend_sources})
-
-FILE(GLOB c_headers
-    "../../api/c/*.hpp"
-    "../../api/c/*.h"
-    )
-
-FILE(GLOB c_sources
-    "../../api/c/*.cpp"
-    )
-
-source_group(api\\c\\Headers FILES ${c_headers})
-source_group(api\\c\\Sources FILES ${c_sources})
-
-FILE(GLOB cpp_sources
-    "../../api/cpp/*.cpp"
-    )
-
-source_group(api\\cpp\\Sources FILES ${cpp_sources})
-
-# OS Definitions
-IF(UNIX)
-    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC")
-ELSE(${UNIX}) #Windows
-    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
-ENDIF()
-
-ADD_LIBRARY(afcpu SHARED
-           ${cpu_headers}
-           ${cpu_sources}
-           ${backend_headers}
-           ${backend_sources}
-           ${c_headers}
-           ${c_sources}
-           ${cpp_sources})
-
-TARGET_LINK_LIBRARIES(afcpu
-                      ${FreeImage_LIBS}
-                      ${CBLAS_LIBRARIES}
-                      ${FFTW_LIBRARIES}
-                      )
-
-IF(LAPACK_FOUND)
-   TARGET_LINK_LIBRARIES(afcpu
-   ${LAPACK_LIBRARIES}
-   )
-ENDIF()
-
-IF(FORGE_FOUND)
-    TARGET_LINK_LIBRARIES(afcpu
-                          ${FORGE_LIBRARIES}
-                         )
-ENDIF()
-
-SET_TARGET_PROPERTIES(afcpu PROPERTIES
-                      VERSION "${AF_VERSION}"
-                      SOVERSION "${AF_VERSION_MAJOR}")
-
-INSTALL(TARGETS afcpu EXPORT CPU DESTINATION "${AF_INSTALL_LIB_DIR}"
-        COMPONENT libraries)
-
-export(TARGETS afcpu FILE ArrayFireCPU.cmake)
-INSTALL(EXPORT CPU DESTINATION "${AF_INSTALL_CMAKE_DIR}"
-        COMPONENT cmake
-        FILE ArrayFireCPU.cmake)
+  if(LAPACK_FOUND AND LAPACKE_FOUND)
+    target_link_libraries(afcpu PRIVATE LAPACKE::LAPACKE ${LAPACK_LIBRARIES})
+  endif()
+endif()
+
+if(LAPACK_FOUND OR BUILD_WITH_MKL)
+  target_compile_definitions(afcpu PRIVATE WITH_LINEAR_ALGEBRA)
+endif()
+
+af_split_debug_info(afcpu ${AF_INSTALL_LIB_DIR})
+
+install(TARGETS afcpu
+  EXPORT ArrayFireCPUTargets
+  COMPONENT cpu
+  PUBLIC_HEADER DESTINATION af
+  RUNTIME DESTINATION ${AF_INSTALL_BIN_DIR}
+  LIBRARY DESTINATION ${AF_INSTALL_LIB_DIR}
+  ARCHIVE DESTINATION ${AF_INSTALL_LIB_DIR}
+  FRAMEWORK DESTINATION framework
+  INCLUDES DESTINATION ${AF_INSTALL_INC_DIR}
+  )
+
+source_group(include REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/include/*)
+source_group(api\\cpp REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/cpp/*)
+source_group(api\\c   REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/c/*)
+source_group(backend  REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/backend/common/*|${CMAKE_CURRENT_SOURCE_DIR}/*)
+source_group(backend\\kernel  REGULAR_EXPRESSION ${CMAKE_CURRENT_SOURCE_DIR}/kernel/*)
+source_group("generated files" FILES ${ArrayFire_BINARY_DIR}/src/backend/build_version.hpp ${ArrayFire_BINARY_DIR}/include/af/version.h)
+source_group("" FILES CMakeLists.txt)
diff --git a/src/backend/cpu/Event.cpp b/src/backend/cpu/Event.cpp
new file mode 100644
index 0000000000..8cdf94338c
--- /dev/null
+++ b/src/backend/cpu/Event.cpp
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Event.hpp>
+
+#include <common/err_common.hpp>
+#include <events.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/event.h>
+#include <memory>
+
+using std::make_unique;
+
+namespace arrayfire {
+namespace cpu {
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(cpu::queue& queue) {
+    Event e;
+    if (0 == e.create()) { e.mark(queue); }
+    return e;
+}
+
+af_event createEvent() {
+    auto e = make_unique<Event>();
+    // Ensure that the default queue is initialized
+    getQueue();
+    if (e->create() != 0) {
+        AF_ERROR("Could not create event", AF_ERR_RUNTIME);
+    }
+    Event& ref = *e.release();
+    return getHandle(ref);
+}
+
+void markEventOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active queue
+    if (event.mark(getQueue()) != 0) {
+        AF_ERROR("Could not mark event on active queue", AF_ERR_RUNTIME);
+    }
+}
+
+void enqueueWaitOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active queue
+    if (event.enqueueWait(getQueue()) != 0) {
+        AF_ERROR("Could not enqueue wait on active queue for event",
+                 AF_ERR_RUNTIME);
+    }
+}
+
+void block(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    if (event.block() != 0) {
+        AF_ERROR("Could not block on active queue for event", AF_ERR_RUNTIME);
+    }
+}
+
+af_event createAndMarkEvent() {
+    af_event handle = createEvent();
+    markEventOnActiveQueue(handle);
+    return handle;
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/Event.hpp b/src/backend/cpu/Event.hpp
new file mode 100644
index 0000000000..103bc3e9ee
--- /dev/null
+++ b/src/backend/cpu/Event.hpp
@@ -0,0 +1,62 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/EventBase.hpp>
+#include <queue.hpp>
+#include <af/event.h>
+
+#include <type_traits>
+
+namespace arrayfire {
+namespace cpu {
+
+class CPUEventPolicy {
+   public:
+    using EventType = queue_event;
+    using QueueType = std::add_lvalue_reference<queue>::type;
+    using ErrorType = int;
+
+    static int createAndMarkEvent(queue_event *e) noexcept {
+        return e->create();
+    }
+
+    static int markEvent(queue_event *e, cpu::queue &stream) noexcept {
+        return e->mark(stream);
+    }
+
+    static int waitForEvent(queue_event *e, cpu::queue &stream) noexcept {
+        return e->wait(stream);
+    }
+
+    static int syncForEvent(queue_event *e) noexcept {
+        e->sync();
+        return 0;
+    }
+
+    static int destroyEvent(queue_event *e) noexcept { return 0; }
+};
+
+using Event = common::EventBase<CPUEventPolicy>;
+
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(cpu::queue &queue);
+
+af_event createEvent();
+
+void markEventOnActiveQueue(af_event eventHandle);
+
+void enqueueWaitOnActiveQueue(af_event eventHandle);
+
+void block(af_event eventHandle);
+
+af_event createAndMarkEvent();
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/Param.hpp b/src/backend/cpu/Param.hpp
new file mode 100644
index 0000000000..55b507876a
--- /dev/null
+++ b/src/backend/cpu/Param.hpp
@@ -0,0 +1,157 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+/// \brief Constant parameter object who's memory cannot be modified. Params
+///        represent the view of the memory in the kernel object. They do not
+///        own the memory.
+template<typename T>
+class CParam {
+   private:
+    const T *m_ptr;
+    af::dim4 m_dims;
+    af::dim4 m_strides;
+
+   public:
+    CParam(const T *iptr, const af::dim4 &idims,
+           const af::dim4 &istrides) noexcept
+        : m_ptr(iptr) {
+        for (int i = 0; i < 4; i++) {
+            m_dims[i]    = idims[i];
+            m_strides[i] = istrides[i];
+        }
+    }
+
+    /// \brief returns the pointer to the memory
+    constexpr const T *get() const noexcept { return m_ptr; }
+
+    /// Gets the shape/dimension of the memory
+    af::dim4 dims() const noexcept { return m_dims; }
+
+    /// Gets the stride of the memory
+    af::dim4 strides() const noexcept { return m_strides; }
+
+    /// Returns the size of a particular dimension
+    ///
+    /// \param[in] i The dimension
+    constexpr dim_t dims(int i) const noexcept { return m_dims[i]; }
+
+    /// Returns the stride of a particular dimension
+    ///
+    /// \param[in] i The dimension
+    constexpr dim_t strides(int i) const noexcept { return m_strides[i]; }
+
+    constexpr CParam()                                 = delete;
+    constexpr CParam(const CParam &other)              = default;
+    constexpr CParam(CParam &&other)                   = default;
+    CParam<T> &operator=(CParam &&other) noexcept      = default;
+    CParam<T> &operator=(const CParam &other) noexcept = default;
+    ~CParam()                                          = default;
+};
+
+/// \brief Parameter object usually passed into kernels. Params
+///        represent the view of the memory in the kernel object. They do not
+///        own the memory.
+template<typename T>
+class Param {
+   private:
+    T *m_ptr;
+    af::dim4 m_dims;
+    af::dim4 m_strides;
+
+   public:
+    /// Creates an empty Param object pointing to null
+    Param() noexcept : m_ptr(nullptr) {}
+
+    /// Creates an new Param object given a pointer, dimension and strides
+    Param(T *iptr, const af::dim4 &idims, const af::dim4 &istrides) noexcept
+        : m_ptr(iptr) {
+        for (int i = 0; i < 4; i++) {
+            m_dims[i]    = idims[i];
+            m_strides[i] = istrides[i];
+        }
+    }
+
+    /// returns the pointer to the object
+    T *get() noexcept { return m_ptr; }
+
+    /// Param to CParam implicit conversion operator
+    constexpr operator CParam<T>() const noexcept {
+        return CParam<T>(const_cast<T *>(m_ptr), m_dims, m_strides);
+    }
+
+    /// Gets the shape/dimension of the memory
+    af::dim4 dims() const noexcept { return m_dims; }
+
+    /// Gets the stride of the memory
+    af::dim4 strides() const noexcept { return m_strides; }
+
+    /// Returns the size of a particular dimension
+    ///
+    /// \param[in] i The dimension
+    constexpr dim_t dims(int i) const noexcept { return m_dims[i]; }
+
+    /// Returns the stride of a particular dimension
+    ///
+    /// \param[in] i The dimension
+    constexpr dim_t strides(int i) const noexcept { return m_strides[i]; }
+
+    ~Param()                                         = default;
+    constexpr Param(const Param &other)              = default;
+    constexpr Param(Param &&other)                   = default;
+    Param<T> &operator=(Param &&other) noexcept      = default;
+    Param<T> &operator=(const Param &other) noexcept = default;
+};
+
+template<typename T>
+class Array;
+
+// These functions are needed to convert Array<T> to Param<T> when queueing up
+// functions. This is fine becacuse we only have 1 compute queue. This ensures
+// there's no race conditions.
+
+/// \brief Converts Array<T> to Param<T> or CParam<T> based on the constness
+///        of the Array<T> object. If called on anything else, the object is
+///        returned unchanged.
+///
+/// \param[in] val The value to convert to Param<T>
+template<typename T>
+const T &toParam(const T &val) noexcept {
+    return val;
+}
+
+/// \brief Converts Array<T> to Param<T> or CParam<T> based on the constness
+///        of the Array<T> object. If called on anything else, the object is
+///        returned unchanged.
+///
+/// \param[in] val The value to convert to Param<T>
+template<typename T>
+Param<T> toParam(Array<T> &val) noexcept {
+    return Param<T>(val.get(), val.dims(), val.strides());
+}
+
+/// \brief Converts Array<T> to Param<T> or CParam<T> based on the constness
+///        of the Array<T> object. If called on anything else, the object is
+///        returned unchanged.
+///
+/// \param[in] val The value to convert to Param<T>
+template<typename T>
+CParam<T> toParam(const Array<T> &val) noexcept {
+    return CParam<T>(val.get(), val.dims(), val.strides());
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/ParamIterator.hpp b/src/backend/cpu/ParamIterator.hpp
new file mode 100644
index 0000000000..3d6427853e
--- /dev/null
+++ b/src/backend/cpu/ParamIterator.hpp
@@ -0,0 +1,141 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <af/dim4.hpp>
+
+#include <array>
+#include <cstddef>
+#include <iterator>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+
+/// A Param iterator that iterates through a Param object
+template<typename T>
+class ParamIterator {
+   public:
+    using difference_type   = ptrdiff_t;
+    using value_type        = T;
+    using pointer           = T*;
+    using reference         = T&;
+    using const_reference   = const T&;
+    using iterator_category = std::forward_iterator_tag;
+
+    /// Creates a sentinel iterator. This is equivalent to the end iterator
+    ParamIterator() noexcept
+        : ptr(nullptr)
+        , dims(1)
+        , stride(1)
+        , dim_index{dims[0], dims[1], dims[2], dims[3]} {}
+
+    /// ParamIterator Constructor
+    ParamIterator(cpu::Param<T>& in) noexcept
+        : ptr(in.get())
+        , dims(in.dims())
+        , stride(calcIteratorStrides(dims, in.strides()))
+        , dim_index{in.dims()[0], in.dims()[1], in.dims()[2], in.dims()[3]} {}
+
+    ParamIterator(cpu::CParam<typename std::remove_const<T>::type>& in) noexcept
+        : ptr(const_cast<pointer>(in.get()))
+        , dims(in.dims())
+        , stride(calcIteratorStrides(dims, in.strides()))
+        , dim_index{in.dims()[0], in.dims()[1], in.dims()[2], in.dims()[3]} {}
+
+    /// The equality operator
+    bool operator==(const ParamIterator& other) const noexcept {
+        return ptr == other.ptr;
+    }
+
+    /// The inequality operator
+    bool operator!=(const ParamIterator& other) const noexcept {
+        return ptr != other.ptr;
+    }
+
+    /// Advances the iterator, pre increment operator
+    ParamIterator& operator++() noexcept {
+        for (int i = 0; i < AF_MAX_DIMS; i++) {
+            dim_index[i]--;
+            ptr += stride[i];
+            if (dim_index[i]) { return *this; }
+            dim_index[i] = dims[i];
+        }
+        ptr = nullptr;
+        return *this;
+    }
+
+    /// Advances the iterator by count elements
+    ParamIterator& operator+=(std::size_t count) noexcept {
+        while (count-- > 0) { operator++(); }
+        return *this;
+    }
+
+    reference operator*() noexcept { return *ptr; }
+
+    const_reference operator*() const noexcept { return *ptr; }
+
+    const pointer operator->() const noexcept { return ptr; }
+
+    ParamIterator(const ParamIterator<T>& other) = default;
+    ParamIterator(ParamIterator<T>&& other)      = default;
+    ~ParamIterator() noexcept                    = default;
+    ParamIterator<T>& operator=(const ParamIterator<T>& other) noexcept =
+        default;
+    ParamIterator<T>& operator=(ParamIterator<T>&& other) noexcept = default;
+
+   private:
+    T* ptr;
+
+    // The dimension of the array
+    const af::dim4 dims;
+
+    // The iterator's stride
+    const af::dim4 stride;
+
+    // NOTE: This is not really the true coordinate of the iteration. It's
+    // values will go down as you move through the array.
+    std::array<dim_t, AF_MAX_DIMS> dim_index;
+
+    /// Calculates the iterator offsets.
+    ///
+    /// These are different from the original offsets because they define
+    /// the stride from the end of the last element in the previous dimension
+    /// to the first element on the next dimension.
+    static dim4 calcIteratorStrides(const dim4& dims,
+                                    const dim4& stride) noexcept {
+        return dim4(stride[0], stride[1] - (stride[0] * dims[0]),
+                    stride[2] - (stride[1] * dims[1]),
+                    stride[3] - (stride[2] * dims[2]));
+    }
+};
+
+template<typename T>
+ParamIterator<T> begin(Param<T>& param) {
+    return ParamIterator<T>(param);
+}
+
+template<typename T>
+ParamIterator<T> end(Param<T>& param) {
+    return ParamIterator<T>();
+}
+
+template<typename T>
+ParamIterator<const T> begin(CParam<T>& param) {
+    return ParamIterator<const T>(param);
+}
+
+template<typename T>
+ParamIterator<const T> end(CParam<T>& param) {
+    return ParamIterator<const T>();
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/TNJ/BinaryNode.hpp b/src/backend/cpu/TNJ/BinaryNode.hpp
deleted file mode 100644
index 8abd9a6b15..0000000000
--- a/src/backend/cpu/TNJ/BinaryNode.hpp
+++ /dev/null
@@ -1,77 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <optypes.hpp>
-#include <vector>
-#include <math.hpp>
-
-namespace cpu
-{
-
-    template<typename To, typename Ti, af_op_t op>
-    struct BinOp
-    {
-        To eval(Ti lhs, Ti rhs)
-        {
-            return scalar<To>(0);
-        }
-    };
-
-namespace TNJ
-{
-
-    template<typename To, typename Ti, af_op_t op>
-    class BinaryNode  : public Node
-    {
-
-    protected:
-        Node_ptr m_lhs;
-        Node_ptr m_rhs;
-        BinOp<To, Ti, op> m_op;
-        To m_val;
-
-    public:
-        BinaryNode(Node_ptr lhs, Node_ptr rhs) :
-            Node(),
-            m_lhs(lhs),
-            m_rhs(rhs),
-            m_val(0)
-        {
-        }
-
-        void *calc(int x, int y, int z, int w)
-        {
-            m_val = m_op.eval(*(Ti *)m_lhs->calc(x, y, z, w),
-                              *(Ti *)m_rhs->calc(x, y, z, w));
-            return  (void *)&m_val;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_is_eval) return;
-
-            m_lhs->getInfo(len, buf_count, bytes);
-            m_rhs->getInfo(len, buf_count, bytes);
-            len++;
-
-            m_is_eval = true;
-            return;
-        }
-
-        void reset()
-        {
-            m_is_eval = false;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/cpu/TNJ/BufferNode.hpp b/src/backend/cpu/TNJ/BufferNode.hpp
deleted file mode 100644
index 4fc2988170..0000000000
--- a/src/backend/cpu/TNJ/BufferNode.hpp
+++ /dev/null
@@ -1,82 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <optypes.hpp>
-#include <vector>
-#include "Node.hpp"
-
-namespace cpu
-{
-
-namespace TNJ
-{
-
-    using std::shared_ptr;
-    template<typename T>
-    class BufferNode : public Node
-    {
-
-    protected:
-        shared_ptr<T> ptr;
-        unsigned m_bytes;
-        dim_t off;
-        dim_t strides[4];
-        dim_t dims[4];
-
-    public:
-
-        BufferNode(shared_ptr<T> data,
-                   unsigned bytes,
-                   dim_t data_off,
-                   const dim_t *dms,
-                   const dim_t *strs) :
-            Node(),
-            ptr(data),
-            m_bytes(bytes),
-            off(data_off)
-        {
-            for (int i = 0; i < 4; i++) {
-                strides[i] = strs[i];
-                dims[i] = dms[i];
-            }
-        }
-
-        void *calc(int x, int y, int z, int w)
-        {
-            dim_t l_off = 0;
-            l_off += (w < (int)dims[3]) * w * strides[3];
-            l_off += (z < (int)dims[2]) * z * strides[2];
-            l_off += (y < (int)dims[1]) * y * strides[1];
-            l_off += (x < (int)dims[0]) * x;
-            return (void *)(ptr.get() + off + l_off);
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_is_eval) return;
-
-            len++;
-            buf_count++;
-            bytes += m_bytes;
-            m_is_eval = true;
-            return;
-        }
-
-        void reset()
-        {
-            m_is_eval = false;
-            off = 0;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/cpu/TNJ/Node.hpp b/src/backend/cpu/TNJ/Node.hpp
deleted file mode 100644
index 09faefd9bd..0000000000
--- a/src/backend/cpu/TNJ/Node.hpp
+++ /dev/null
@@ -1,53 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <optypes.hpp>
-#include <vector>
-#include "Node.hpp"
-#include <memory>
-
-namespace cpu
-{
-
-namespace TNJ
-{
-
-    class Node
-    {
-
-    protected:
-        bool m_is_eval;
-
-    public:
-        Node() : m_is_eval(false) {}
-
-        virtual void *calc(int x, int y, int z, int w)
-        {
-            m_is_eval = true;
-            return NULL;
-        }
-
-        virtual void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            len = 0;
-            buf_count = 0;
-            bytes = 0;
-        }
-
-        virtual void reset() { m_is_eval = false;}
-
-        virtual ~Node() {}
-    };
-
-    typedef std::shared_ptr<Node> Node_ptr;
-}
-
-}
diff --git a/src/backend/cpu/TNJ/ScalarNode.hpp b/src/backend/cpu/TNJ/ScalarNode.hpp
deleted file mode 100644
index ee2bfbcd2a..0000000000
--- a/src/backend/cpu/TNJ/ScalarNode.hpp
+++ /dev/null
@@ -1,49 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <optypes.hpp>
-#include <vector>
-#include "Node.hpp"
-
-namespace cpu
-{
-
-namespace TNJ
-{
-
-    template<typename T>
-    class ScalarNode : public Node
-    {
-
-    protected:
-        T m_val;
-
-    public:
-        ScalarNode(T val) : Node(), m_val(val) {}
-
-        void *calc(int x, int y, int z, int w)
-        {
-            return (void *)(&m_val);
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_is_eval) return;
-            len++;
-            m_is_eval = true;
-            return;
-        }
-
-        void reset() { m_is_eval = false; }
-    };
-}
-
-}
diff --git a/src/backend/cpu/TNJ/UnaryNode.hpp b/src/backend/cpu/TNJ/UnaryNode.hpp
deleted file mode 100644
index a94bb0bf34..0000000000
--- a/src/backend/cpu/TNJ/UnaryNode.hpp
+++ /dev/null
@@ -1,74 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <optypes.hpp>
-#include <vector>
-#include <math.hpp>
-#include "Node.hpp"
-
-namespace cpu
-{
-
-    template<typename To, typename Ti, af_op_t op>
-    struct UnOp
-    {
-        To eval(Ti in)
-        {
-            return scalar<To>(0);
-        }
-    };
-
-namespace TNJ
-{
-
-    template<typename To, typename Ti, af_op_t op>
-    class UnaryNode  : public Node
-    {
-
-    protected:
-        Node_ptr m_child;
-        UnOp <To, Ti, op> m_op;
-        To m_val;
-
-    public:
-        UnaryNode(Node_ptr in) :
-            Node(),
-            m_child(in),
-            m_val(0)
-        {
-        }
-
-        void *calc(int x, int y, int z, int w)
-        {
-            m_val = m_op.eval(*(Ti *)m_child->calc(x, y, z, w));
-            return (void *)(&m_val);
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_is_eval) return;
-
-            m_child->getInfo(len, buf_count, bytes);
-            len++;
-
-            m_is_eval = true;
-            return;
-        }
-
-        void reset()
-        {
-            m_is_eval = false;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/cpu/anisotropic_diffusion.cpp b/src/backend/cpu/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..7d38cbe5ab
--- /dev/null
+++ b/src/backend/cpu/anisotropic_diffusion.cpp
@@ -0,0 +1,37 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <anisotropic_diffusion.hpp>
+#include <kernel/anisotropic_diffusion.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq) {
+    if (eq == AF_DIFFUSION_MCDE) {
+        getQueue().enqueue(kernel::anisotropicDiffusion<T, true>, inout, dt,
+                           mct, fftype);
+    } else {
+        getQueue().enqueue(kernel::anisotropicDiffusion<T, false>, inout, dt,
+                           mct, fftype);
+    }
+}
+
+#define INSTANTIATE(T)                                     \
+    template void anisotropicDiffusion<T>(                 \
+        Array<T> & inout, const float dt, const float mct, \
+        const af::fluxFunction fftype, const af::diffusionEq eq);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/anisotropic_diffusion.hpp b/src/backend/cpu/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..76d1f9ddcf
--- /dev/null
+++ b/src/backend/cpu/anisotropic_diffusion.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include "af/defines.h"
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+class Array;
+
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/approx.cpp b/src/backend/cpu/approx.cpp
index 050b3e1603..f65cd18961 100644
--- a/src/backend/cpu/approx.cpp
+++ b/src/backend/cpu/approx.cpp
@@ -7,333 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>
 #include <approx.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-
-namespace cpu
-{
-    ///////////////////////////////////////////////////////////////////////////
-    // Approx1
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename Ty, typename Tp, af_interp_type method>
-    struct approx1_op
-    {
-        void operator()(Ty *out, const af::dim4 &odims, const dim_t oElems,
-                  const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-                  const Tp *pos, const af::dim4 &pdims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides, const af::dim4 &pstrides,
-                  const float offGrid, const dim_t idx)
-        {
-            return;
-        }
-    };
-
-    template<typename Ty, typename Tp>
-    struct approx1_op<Ty, Tp, AF_INTERP_NEAREST>
-    {
-        void operator()(Ty *out, const af::dim4 &odims, const dim_t oElems,
-                  const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-                  const Tp *pos, const af::dim4 &pdims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides, const af::dim4 &pstrides,
-                  const float offGrid, const dim_t idx)
-        {
-            const dim_t pmId = idx;
-
-            const Tp x = pos[pmId];
-            bool gFlag = false;
-            if (x < 0 || idims[0] < x+1) {
-                gFlag = true;
-            }
-
-            for(dim_t idw = 0; idw < odims[3]; idw++) {
-                for(dim_t idz = 0; idz < odims[2]; idz++) {
-                    for(dim_t idy = 0; idy < odims[1]; idy++) {
-                        const dim_t omId = idw * ostrides[3] + idz * ostrides[2]
-                                            + idy * ostrides[1] + idx;
-                        if(gFlag) {
-                            out[omId] = scalar<Ty>(offGrid);
-                        } else {
-                            dim_t ioff = idw * istrides[3] + idz * istrides[2]
-                                          + idy * istrides[1];
-                            const dim_t iMem = round(x) + ioff;
-
-                            out[omId] = in[iMem];
-                        }
-                    }
-                }
-            }
-        }
-    };
-
-    template<typename Ty, typename Tp>
-    struct approx1_op<Ty, Tp, AF_INTERP_LINEAR>
-    {
-        void operator()(Ty *out, const af::dim4 &odims, const dim_t oElems,
-                  const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-                  const Tp *pos, const af::dim4 &pdims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides, const af::dim4 &pstrides,
-                  const float offGrid, const dim_t idx)
-        {
-            const dim_t pmId = idx;
-
-            const Tp x = pos[pmId];
-            bool gFlag = false;
-            if (x < 0 || idims[0] < x+1) {
-                gFlag = true;
-            }
-
-            const Tp grid_x = floor(x);  // nearest grid
-            const Tp off_x = x - grid_x; // fractional offset
-
-            for(dim_t idw = 0; idw < odims[3]; idw++) {
-                for(dim_t idz = 0; idz < odims[2]; idz++) {
-                    for(dim_t idy = 0; idy < odims[1]; idy++) {
-                        const dim_t omId = idw * ostrides[3] + idz * ostrides[2]
-                                            + idy * ostrides[1] + idx;
-                        if(gFlag) {
-                            out[omId] = scalar<Ty>(offGrid);
-                        } else {
-                            dim_t ioff = idw * istrides[3] + idz * istrides[2] + idy * istrides[1] + grid_x;
-
-                            // Check if x and x + 1 are both valid indices
-                            bool cond = (x < idims[0] - 1);
-                            // Compute Left and Right Weighted Values
-                            Ty yl = ((Tp)1.0 - off_x) * in[ioff];
-                            Ty yr = cond ? (off_x) * in[ioff + 1] : scalar<Ty>(0);
-                            Ty yo = yl + yr;
-                            // Compute Weight used
-                            Tp wt = cond ? (Tp)1.0 : (Tp)(1.0 - off_x);
-                            // Write final value
-                            out[omId] = (yo / wt);
-                        }
-                    }
-                }
-            }
-        }
-    };
-
-    template<typename Ty, typename Tp, af_interp_type method>
-    void approx1_(Ty *out, const af::dim4 &odims, const dim_t oElems,
-            const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-            const Tp *pos, const af::dim4 &pdims,
-            const af::dim4 &ostrides, const af::dim4 &istrides, const af::dim4 &pstrides,
-            const float offGrid)
-    {
-        approx1_op<Ty, Tp, method> op;
-        for(dim_t x = 0; x < odims[0]; x++) {
-            op(out, odims, oElems, in, idims, iElems, pos, pdims,
-               ostrides, istrides, pstrides, offGrid, x);
-        }
-    }
-
-    template<typename Ty, typename Tp>
-    Array<Ty> approx1(const Array<Ty> &in, const Array<Tp> &pos,
-                       const af_interp_type method, const float offGrid)
-    {
-        af::dim4 odims = in.dims();
-        odims[0] = pos.dims()[0];
-
-        // Create output placeholder
-        Array<Ty> out = createEmptyArray<Ty>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                approx1_<Ty, Tp, AF_INTERP_NEAREST>
-                        (out.get(), out.dims(), out.elements(),
-                         in.get(), in.dims(), in.elements(), pos.get(), pos.dims(),
-                         out.strides(), in.strides(), pos.strides(), offGrid);
-                break;
-            case AF_INTERP_LINEAR:
-                approx1_<Ty, Tp, AF_INTERP_LINEAR>
-                        (out.get(), out.dims(), out.elements(),
-                         in.get(), in.dims(), in.elements(), pos.get(), pos.dims(),
-                         out.strides(), in.strides(), pos.strides(), offGrid);
-                break;
-            default:
-                break;
-        }
-        return out;
-    }
-
-    ///////////////////////////////////////////////////////////////////////////
-    // Approx2
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename Ty, typename Tp, af_interp_type method>
-    struct approx2_op
-    {
-        void operator()(Ty *out, const af::dim4 &odims, const dim_t oElems,
-                  const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-                  const Tp *pos, const af::dim4 &pdims, const Tp *qos, const af::dim4 &qdims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides,
-                  const af::dim4 &pstrides, const af::dim4 &qstrides,
-                  const float offGrid, const dim_t idx, const dim_t idy)
-        {
-            return;
-        }
-    };
-
-    template<typename Ty, typename Tp>
-    struct approx2_op<Ty, Tp, AF_INTERP_NEAREST>
-    {
-        void operator()(Ty *out, const af::dim4 &odims, const dim_t oElems,
-                  const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-                  const Tp *pos, const af::dim4 &pdims, const Tp *qos, const af::dim4 &qdims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides,
-                  const af::dim4 &pstrides, const af::dim4 &qstrides,
-                  const float offGrid, const dim_t idx, const dim_t idy)
-        {
-            const dim_t pmId = idy * pstrides[1] + idx;
-            const dim_t qmId = idy * qstrides[1] + idx;
-
-            bool gFlag = false;
-            const Tp x = pos[pmId], y = qos[qmId];
-            if (x < 0 || y < 0 || idims[0] < x+1 || idims[1] < y+1) {
-                gFlag = true;
-            }
-
-            for(dim_t idw = 0; idw < odims[3]; idw++) {
-                for(dim_t idz = 0; idz < odims[2]; idz++) {
-                    const dim_t omId = idw * ostrides[3] + idz * ostrides[2]
-                                        + idy * ostrides[1] + idx;
-                    if(gFlag) {
-                        out[omId] = scalar<Ty>(offGrid);
-                    } else {
-                        const dim_t grid_x = round(x), grid_y = round(y); // nearest grid
-                        const dim_t imId = idw * istrides[3] +
-                                              idz * istrides[2] +
-                                              grid_y * istrides[1] + grid_x;
-                        out[omId] = in[imId];
-                    }
-                }
-            }
-        }
-    };
-
-    template<typename Ty, typename Tp>
-    struct approx2_op<Ty, Tp, AF_INTERP_LINEAR>
-    {
-        void operator()(Ty *out, const af::dim4 &odims, const dim_t oElems,
-                  const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-                  const Tp *pos, const af::dim4 &pdims, const Tp *qos, const af::dim4 &qdims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides,
-                  const af::dim4 &pstrides, const af::dim4 &qstrides,
-                  const float offGrid, const dim_t idx, const dim_t idy)
-        {
-            const dim_t pmId = idy * pstrides[1] + idx;
-            const dim_t qmId = idy * qstrides[1] + idx;
-
-            bool gFlag = false;
-            const Tp x = pos[pmId], y = qos[qmId];
-            if (x < 0 || y < 0 || idims[0] < x+1 || idims[1] < y+1) {
-                gFlag = true;
-            }
-
-            const Tp grid_x = floor(x),   grid_y = floor(y);   // nearest grid
-            const Tp off_x  = x - grid_x, off_y  = y - grid_y; // fractional offset
-
-            // Check if pVal and pVal + 1 are both valid indices
-            bool condY = (y < idims[1] - 1);
-            bool condX = (x < idims[0] - 1);
-
-            // Compute wieghts used
-            Tp wt00 = ((Tp)1.0 - off_x) * ((Tp)1.0 - off_y);
-            Tp wt10 = (condY) ? ((Tp)1.0 - off_x) * (off_y) : 0;
-            Tp wt01 = (condX) ? (off_x) * ((Tp)1.0 - off_y) : 0;
-            Tp wt11 = (condX && condY) ? (off_x) * (off_y)  : 0;
-
-            Tp wt = wt00 + wt10 + wt01 + wt11;
-            Ty zero = scalar<Ty>(0);
-
-            for(dim_t idw = 0; idw < odims[3]; idw++) {
-                for(dim_t idz = 0; idz < odims[2]; idz++) {
-                    const dim_t omId = idw * ostrides[3] + idz * ostrides[2]
-                                        + idy * ostrides[1] + idx;
-                    if(gFlag) {
-                        out[omId] = scalar<Ty>(offGrid);
-                    } else {
-                        dim_t ioff = idw * istrides[3] + idz * istrides[2]
-                                   + grid_y * istrides[1] + grid_x;
-
-                        // Compute Weighted Values
-                        Ty y00 =                    wt00 * in[ioff];
-                        Ty y10 = (condY) ?          wt10 * in[ioff + istrides[1]]     : zero;
-                        Ty y01 = (condX) ?          wt01 * in[ioff + 1]                   : zero;
-                        Ty y11 = (condX && condY) ? wt11 * in[ioff + istrides[1] + 1] : zero;
-
-                        Ty yo = y00 + y10 + y01 + y11;
-
-                        // Write Final Value
-                        out[omId] = (yo / wt);
-
-                    }
-                }
-            }
-        }
-    };
-
-    template<typename Ty, typename Tp, af_interp_type method>
-    void approx2_(Ty *out, const af::dim4 &odims, const dim_t oElems,
-            const Ty *in,  const af::dim4 &idims, const dim_t iElems,
-            const Tp *pos, const af::dim4 &pdims, const Tp *qos, const af::dim4 &qdims,
-            const af::dim4 &ostrides, const af::dim4 &istrides,
-            const af::dim4 &pstrides, const af::dim4 &qstrides,
-            const float offGrid)
-    {
-        approx2_op<Ty, Tp, method> op;
-        for(dim_t y = 0; y < odims[1]; y++) {
-            for(dim_t x = 0; x < odims[0]; x++) {
-                op(out, odims, oElems, in, idims, iElems, pos, pdims, qos, qdims,
-                    ostrides, istrides, pstrides, qstrides, offGrid, x, y);
-            }
-        }
+#include <kernel/approx.hpp>
+#include <platform.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            getQueue().enqueue(kernel::approx1<Ty, Tp, 1>, yo, yi, xo, xdim,
+                               xi_beg, xi_step, offGrid, method);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+            getQueue().enqueue(kernel::approx1<Ty, Tp, 2>, yo, yi, xo, xdim,
+                               xi_beg, xi_step, offGrid, method);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+            getQueue().enqueue(kernel::approx1<Ty, Tp, 3>, yo, yi, xo, xdim,
+                               xi_beg, xi_step, offGrid, method);
+            break;
+        default: break;
     }
+}
 
-    template<typename Ty, typename Tp>
-    Array<Ty> approx2(const Array<Ty> &in, const Array<Tp> &pos0, const Array<Tp> &pos1,
-                       const af_interp_type method, const float offGrid)
-    {
-        af::dim4 odims = in.dims();
-        odims[0] = pos0.dims()[0];
-        odims[1] = pos0.dims()[1];
-
-        // Create output placeholder
-        Array<Ty> out = createEmptyArray<Ty>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                approx2_<Ty, Tp, AF_INTERP_NEAREST>
-                        (out.get(), out.dims(), out.elements(),
-                         in.get(), in.dims(), in.elements(),
-                         pos0.get(), pos0.dims(), pos1.get(), pos1.dims(),
-                         out.strides(), in.strides(), pos0.strides(), pos1.strides(),
-                         offGrid);
-                break;
-            case AF_INTERP_LINEAR:
-                approx2_<Ty, Tp, AF_INTERP_LINEAR>
-                        (out.get(), out.dims(), out.elements(),
-                         in.get(), in.dims(), in.elements(),
-                         pos0.get(), pos0.dims(), pos1.get(), pos1.dims(),
-                         out.strides(), in.strides(), pos0.strides(), pos1.strides(),
-                         offGrid);
-                break;
-            default:
-                break;
-        }
-        return out;
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            getQueue().enqueue(kernel::approx2<Ty, Tp, 1>, zo, zi, xo, xdim,
+                               xi_beg, xi_step, yo, ydim, yi_beg, yi_step,
+                               offGrid, method);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+        case AF_INTERP_BILINEAR_COSINE:
+            getQueue().enqueue(kernel::approx2<Ty, Tp, 2>, zo, zi, xo, xdim,
+                               xi_beg, xi_step, yo, ydim, yi_beg, yi_step,
+                               offGrid, method);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+        case AF_INTERP_BICUBIC_SPLINE:
+            getQueue().enqueue(kernel::approx2<Ty, Tp, 3>, zo, zi, xo, xdim,
+                               xi_beg, xi_step, yo, ydim, yi_beg, yi_step,
+                               offGrid, method);
+            break;
+        default: break;
     }
-
-#define INSTANTIATE(Ty, Tp)                                                                     \
-    template Array<Ty> approx1<Ty, Tp>(const Array<Ty> &in, const Array<Tp> &pos,              \
-                                        const af_interp_type method, const float offGrid);      \
-    template Array<Ty> approx2<Ty, Tp>(const Array<Ty> &in, const Array<Tp> &pos0,             \
-                                        const Array<Tp> &pos1, const af_interp_type method,     \
-                                        const float offGrid);                                   \
-
-    INSTANTIATE(float  , float )
-    INSTANTIATE(double , double)
-    INSTANTIATE(cfloat , float )
-    INSTANTIATE(cdouble, double)
 }
+
+#define INSTANTIATE(Ty, Tp)                                       \
+    template void approx1<Ty, Tp>(                                \
+        Array<Ty> & yo, const Array<Ty> &yi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const af_interp_type method, const float offGrid);        \
+    template void approx2<Ty, Tp>(                                \
+        Array<Ty> & zo, const Array<Ty> &zi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const Array<Tp> &yo, const int ydim, const Tp &yi_beg,    \
+        const Tp &yi_step, const af_interp_type method, const float offGrid);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/approx.hpp b/src/backend/cpu/approx.hpp
index f282f8f43c..893250a824 100644
--- a/src/backend/cpu/approx.hpp
+++ b/src/backend/cpu/approx.hpp
@@ -7,16 +7,21 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
+#include <af/defines.h>
 
-namespace cpu
-{
-    template<typename Ty, typename Tp>
-    Array<Ty> approx1(const Array<Ty> &in, const Array<Tp> &pos,
-                      const af_interp_type method, const float offGrid);
+namespace arrayfire {
+namespace cpu {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid);
 
-    template<typename Ty, typename Tp>
-    Array<Ty> approx2(const Array<Ty> &in, const Array<Tp> &pos0, const Array<Tp> &pos1,
-                      const af_interp_type method, const float offGrid);
-}
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/arith.hpp b/src/backend/cpu/arith.hpp
index fe19551356..131f9ae64a 100644
--- a/src/backend/cpu/arith.hpp
+++ b/src/backend/cpu/arith.hpp
@@ -7,78 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <optypes.hpp>
-#include <err_cpu.hpp>
-#include <cmath>
-#include <TNJ/BinaryNode.hpp>
-
-namespace cpu
-{
-
-#define ARITH_FN(OP, op)                        \
-    template<typename T>                        \
-    struct BinOp<T, T, OP>                      \
-    {                                           \
-        T eval(T lhs, T rhs)                    \
-        {                                       \
-            return lhs op rhs;                  \
-        }                                       \
-    };                                          \
+#pragma once
 
+#include <Array.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <af/dim4.hpp>
 
-ARITH_FN(af_add_t, +)
-ARITH_FN(af_sub_t, -)
-ARITH_FN(af_mul_t, *)
-ARITH_FN(af_div_t, /)
-
-#undef ARITH_FN
+namespace arrayfire {
+namespace cpu {
 
-template<typename T> static T __mod(T lhs, T rhs)
-{
-    T res = lhs % rhs;
-    return (res < 0) ? abs(rhs - res) : res;
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &&lhs, const Array<T> &&rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
 
-template<typename T> static T __rem(T lhs, T rhs) { return lhs % rhs; }
-
-template<> STATIC_ float __mod<float>(float lhs, float rhs) { return fmod(lhs, rhs); }
-template<> STATIC_ double __mod<double>(double lhs, double rhs) { return fmod(lhs, rhs); }
-template<> STATIC_ float __rem<float>(float lhs, float rhs) { return remainder(lhs, rhs); }
-template<> STATIC_ double __rem<double>(double lhs, double rhs) { return remainder(lhs, rhs); }
-
-
-#define NUMERIC_FN(OP, FN)                      \
-    template<typename T>                        \
-    struct BinOp<T, T, OP>                      \
-    {                                           \
-        T eval(T lhs, T rhs)                    \
-        {                                       \
-            return FN(lhs, rhs);                \
-        }                                       \
-    };                                          \
-
-NUMERIC_FN(af_max_t, max)
-NUMERIC_FN(af_min_t, min)
-NUMERIC_FN(af_mod_t, __mod)
-NUMERIC_FN(af_pow_t, pow)
-NUMERIC_FN(af_rem_t, __rem)
-NUMERIC_FN(af_atan2_t, atan2)
-NUMERIC_FN(af_hypot_t, hypot)
-
 template<typename T, af_op_t op>
-Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-{
-    TNJ::Node_ptr lhs_node = lhs.getNode();
-    TNJ::Node_ptr rhs_node = rhs.getNode();
-
-    TNJ::BinaryNode<T, T, op> *node = new TNJ::BinaryNode<T, T, op>(lhs_node, rhs_node);
-
-    return createNodeArray<T>(odims, TNJ::Node_ptr(
-                                  reinterpret_cast<TNJ::Node *>(node)));
+Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/assign.cpp b/src/backend/cpu/assign.cpp
index a8ac33ece0..32af00e487 100644
--- a/src/backend/cpu/assign.cpp
+++ b/src/backend/cpu/assign.cpp
@@ -7,122 +7,71 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
+#include <assign.hpp>
+#include <kernel/assign.hpp>
+
 #include <Array.hpp>
+#include <Param.hpp>
+#include <common/half.hpp>
 #include <handle.hpp>
-#include <assign.hpp>
-#include <err_cpu.hpp>
+#include <platform.hpp>
+#include <types.hpp>
 
-using af::dim4;
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/index.h>
+#include <af/seq.h>
 
-namespace cpu
-{
+#include <utility>
+#include <vector>
 
-static inline
-dim_t trimIndex(int idx, const dim_t &len)
-{
-    int ret_val = idx;
-    int offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=(int)len) {
-        ret_val = len-offset-1;
-    }
-    return ret_val;
-}
+using af::dim4;
+using std::vector;
 
+namespace arrayfire {
+namespace cpu {
 template<typename T>
-void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs)
-{
-    bool isSeq[4];
-    std::vector<af_seq> seqs(4, af_span);
-    // create seq vector to retrieve output
-    // dimensions, offsets & offsets
-    for (dim_t x=0; x<4; ++x) {
-        if (idxrs[x].isSeq) {
-            seqs[x] = idxrs[x].idx.seq;
-        }
+void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs) {
+    vector<bool> isSeq(4);
+    vector<af_seq> seqs(4, af_span);
+    // create seq vector to retrieve output dimensions, offsets & offsets
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
         isSeq[x] = idxrs[x].isSeq;
     }
 
-    dim4 dDims = out.getDataDims();
-    dim4 pDims = out.dims();
-    // retrieve dimensions & strides for array
-    // to which rhs is being copied to
-    dim4 dst_offsets    = toOffset(seqs, dDims);
-    dim4 dst_strides    = toStride(seqs, dDims);
-    // retrieve rhs array dimenesions & strides
-    dim4 src_dims       = rhs.dims();
-    dim4 src_strides    = rhs.strides();
-
-    std::vector< Array<uint> > idxArrs(4, createEmptyArray<uint>(dim4()));
+    vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
     // look through indexs to read af_array indexs
-    for (dim_t x=0; x<4; ++x) {
+    for (dim_t x = 0; x < 4; ++x) {
         if (!isSeq[x]) {
             idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
+            idxArrs[x].eval();
         }
     }
 
-    // declare pointers to af_array index data
-    const uint* ptr0 = idxArrs[0].get();
-    const uint* ptr1 = idxArrs[1].get();
-    const uint* ptr2 = idxArrs[2].get();
-    const uint* ptr3 = idxArrs[3].get();
-
-    const T * src= rhs.get();
-    T * dst      = out.get();
-
-    for(dim_t l=0; l<src_dims[3]; ++l) {
-
-        dim_t src_loff = l*src_strides[3];
-
-        dim_t dst_lIdx = trimIndex(isSeq[3] ? l+dst_offsets[3] : ptr3[l], pDims[3]);
-        dim_t dst_loff = dst_lIdx * dst_strides[3];
-
-        for(dim_t k=0; k<src_dims[2]; ++k) {
-
-            dim_t src_koff = k*src_strides[2];
-
-            dim_t dst_kIdx = trimIndex(isSeq[2] ? k+dst_offsets[2] : ptr2[k], pDims[2]);
-            dim_t dst_koff = dst_kIdx * dst_strides[2];
-
-            for(dim_t j=0; j<src_dims[1]; ++j) {
-
-                dim_t src_joff = j*src_strides[1];
-
-                dim_t dst_jIdx = trimIndex(isSeq[1] ? j+dst_offsets[1] : ptr1[j], pDims[1]);
-                dim_t dst_joff = dst_jIdx * dst_strides[1];
-
-                for(dim_t i=0; i<src_dims[0]; ++i) {
-
-                    dim_t src_ioff = i*src_strides[0];
-                    dim_t src_idx  = src_ioff + src_joff + src_koff + src_loff;
-
-                    dim_t dst_iIdx = trimIndex(isSeq[0] ? i+dst_offsets[0] : ptr0[i], pDims[0]);
-                    dim_t dst_ioff = dst_iIdx * dst_strides[0];
-                    dim_t dst_idx  = dst_ioff + dst_joff + dst_koff + dst_loff;
-
-                    dst[dst_idx] = src[src_idx];
-                }
-            }
-        }
-    }
+    vector<CParam<uint>> idxParams(idxArrs.begin(), idxArrs.end());
+    getQueue().enqueue(kernel::assign<T>, out, out.getDataDims(), rhs,
+                       move(isSeq), move(seqs), move(idxParams));
 }
 
-#define INSTANTIATE(T) \
-    template void assign<T>(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
+#define INSTANTIATE(T)                                                \
+    template void assign<T>(Array<T> & out, const af_index_t idxrs[], \
+                            const Array<T>& rhs);
 
 INSTANTIATE(cdouble)
-INSTANTIATE(double )
-INSTANTIATE(cfloat )
-INSTANTIATE(float  )
-INSTANTIATE(uintl  )
-INSTANTIATE(uint   )
-INSTANTIATE(intl   )
-INSTANTIATE(int    )
-INSTANTIATE(uchar  )
-INSTANTIATE(char   )
-
-}
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(uintl)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(int)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(arrayfire::common::half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/assign.hpp b/src/backend/cpu/assign.hpp
index 00ad56eb33..ccbdec5ddf 100644
--- a/src/backend/cpu/assign.hpp
+++ b/src/backend/cpu/assign.hpp
@@ -7,12 +7,15 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>
+#include <af/index.h>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+class Array;
 
 template<typename T>
 void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/backend.hpp b/src/backend/cpu/backend.hpp
index 744fa8f290..ba9f9677d3 100644
--- a/src/backend/cpu/backend.hpp
+++ b/src/backend/cpu/backend.hpp
@@ -21,4 +21,4 @@
 
 #include "types.hpp"
 
-namespace detail = cpu;
+namespace detail = arrayfire::cpu;
diff --git a/src/backend/cpu/bilateral.cpp b/src/backend/cpu/bilateral.cpp
index d8ef7c61cb..19af80f3cb 100644
--- a/src/backend/cpu/bilateral.cpp
+++ b/src/backend/cpu/bilateral.cpp
@@ -7,105 +7,41 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <bilateral.hpp>
-#include <cmath>
-#include <algorithm>
-
-using af::dim4;
-
-namespace cpu
-{
-
-static inline dim_t clamp(int a, dim_t mn, dim_t mx)
-{
-    return (a < (int)mn ? mn : (a > (int)mx ? mx : a));
-}
-
-static inline unsigned getIdx(const dim4 &strides,
-        int i, int j = 0, int k = 0, int l = 0)
-{
-    return (l * strides[3] +
-            k * strides[2] +
-            j * strides[1] +
-            i * strides[0]);
-}
-
-template<typename inType, typename outType, bool isColor>
-Array<outType> bilateral(const Array<inType> &in, const float &s_sigma, const float &c_sigma)
-{
-    const dim4 dims     = in.dims();
-    const dim4 istrides = in.strides();
 
-    Array<outType> out = createEmptyArray<outType>(dims);
-    const dim4 ostrides = out.strides();
-
-    outType *outData    = out.get();
-    const inType * inData = in.get();
+#include <Array.hpp>
+#include <kernel/bilateral.hpp>
+#include <platform.hpp>
 
-    // clamp spatical and chromatic sigma's
-    float space_          = std::min(11.5f, std::max(s_sigma, 0.f));
-    float color_          = std::max(c_sigma, 0.f);
-    const dim_t radius = std::max((dim_t)(space_ * 1.5f), (dim_t)1);
-    const float svar      = space_*space_;
-    const float cvar      = color_*color_;
+#include <af/dim4.hpp>
 
-    for(dim_t b3=0; b3<dims[3]; ++b3) {
-        // b3 for loop handles following batch configurations
-        //  - gfor
-        //  - input based batch
-        //      - when input is 4d array for color images
-        for(dim_t b2=0; b2<dims[2]; ++b2) {
-            // b2 for loop handles following batch configurations
-            //  - channels
-            //  - input based batch
-            //      - when input is 3d array for grayscale images
-            for(dim_t j=0; j<dims[1]; ++j) {
-                // j steps along 2nd dimension
-                for(dim_t i=0; i<dims[0]; ++i) {
-                    // i steps along 1st dimension
-                    outType norm = 0.0;
-                    outType res  = 0.0;
-                    const outType center = (outType)inData[getIdx(istrides, i, j)];
-                    for(dim_t wj=-radius; wj<=radius; ++wj) {
-                        // clamps offsets
-                        dim_t tj = clamp(j+wj, 0, dims[1]-1);
-                        for(dim_t wi=-radius; wi<=radius; ++wi) {
-                            // clamps offsets
-                            dim_t ti = clamp(i+wi, 0, dims[0]-1);
-                            // proceed
-                            const outType val= (outType)inData[getIdx(istrides, ti, tj)];
-                            const outType gauss_space = (wi*wi+wj*wj)/(-2.0*svar);
-                            const outType gauss_range = ((center-val)*(center-val))/(-2.0*cvar);
-                            const outType weight = std::exp(gauss_space+gauss_range);
-                            norm += weight;
-                            res += val*weight;
-                        }
-                    } // filter loop ends here
+using af::dim4;
 
-                    outData[getIdx(ostrides, i, j)] = res/norm;
-                } //1st dimension loop ends here
-            } //2nd dimension loop ends here
-            outData += ostrides[2];
-            inData  += istrides[2];
-        }
-    }
+namespace arrayfire {
+namespace cpu {
 
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &sSigma,
+                         const float &cSigma) {
+    Array<outType> out = createEmptyArray<outType>(in.dims());
+    getQueue().enqueue(kernel::bilateral<outType, inType>, out, in, sSigma,
+                       cSigma);
     return out;
 }
 
-#define INSTANTIATE(inT, outT)\
-template Array<outT> bilateral<inT, outT,true >(const Array<inT> &in, const float &s_sigma, const float &c_sigma);\
-template Array<outT> bilateral<inT, outT,false>(const Array<inT> &in, const float &s_sigma, const float &c_sigma);
+#define INSTANTIATE(inT, outT)                                    \
+    template Array<outT> bilateral<inT, outT>(const Array<inT> &, \
+                                              const float &, const float &);
 
 INSTANTIATE(double, double)
-INSTANTIATE(float ,  float)
-INSTANTIATE(char  ,  float)
-INSTANTIATE(int   ,  float)
-INSTANTIATE(uint  ,  float)
-INSTANTIATE(uchar ,  float)
-
-}
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/bilateral.hpp b/src/backend/cpu/bilateral.hpp
index 51542a7c59..1cb6edb1e1 100644
--- a/src/backend/cpu/bilateral.hpp
+++ b/src/backend/cpu/bilateral.hpp
@@ -9,10 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
-
-template<typename inType, typename outType, bool isColor>
-Array<outType> bilateral(const Array<inType> &in, const float &s_sigma, const float &c_sigma);
-
-}
+namespace arrayfire {
+namespace cpu {
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &spatialSigma,
+                         const float &chromaticSigma);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/binary.hpp b/src/backend/cpu/binary.hpp
new file mode 100644
index 0000000000..8d28501053
--- /dev/null
+++ b/src/backend/cpu/binary.hpp
@@ -0,0 +1,154 @@
+/*******************************************************
+ * Copyright (c) 2025, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <jit/Node.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+#include <types.hpp>
+#include <cmath>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename To, typename Ti, af_op_t op>
+struct BinOp;
+
+#define ARITH_FN(OP, op)                                                 \
+    template<typename T>                                                 \
+    struct BinOp<T, T, OP> {                                             \
+        void eval(jit::array<compute_t<T>> &out,                         \
+                  const jit::array<compute_t<T>> &lhs,                   \
+                  const jit::array<compute_t<T>> &rhs, int lim) const {  \
+            for (int i = 0; i < lim; i++) { out[i] = lhs[i] op rhs[i]; } \
+        }                                                                \
+    };
+
+ARITH_FN(af_add_t, +)
+ARITH_FN(af_sub_t, -)
+ARITH_FN(af_mul_t, *)
+ARITH_FN(af_div_t, /)
+
+#undef ARITH_FN
+
+#define LOGIC_FN(OP, op)                                                      \
+    template<typename T>                                                      \
+    struct BinOp<char, T, OP> {                                               \
+        void eval(jit::array<char> &out, const jit::array<compute_t<T>> &lhs, \
+                  const jit::array<compute_t<T>> &rhs, int lim) {             \
+            for (int i = 0; i < lim; i++) { out[i] = lhs[i] op rhs[i]; }      \
+        }                                                                     \
+    };
+
+LOGIC_FN(af_eq_t, ==)
+LOGIC_FN(af_neq_t, !=)
+LOGIC_FN(af_lt_t, <)
+LOGIC_FN(af_gt_t, >)
+LOGIC_FN(af_le_t, <=)
+LOGIC_FN(af_ge_t, >=)
+LOGIC_FN(af_and_t, &&)
+LOGIC_FN(af_or_t, ||)
+
+#undef LOGIC_FN
+
+#define LOGIC_CPLX_FN(T, OP, op)                                               \
+    template<>                                                                 \
+    struct BinOp<char, std::complex<T>, OP> {                                  \
+        typedef std::complex<T> Ti;                                            \
+        void eval(jit::array<char> &out, const jit::array<compute_t<Ti>> &lhs, \
+                  const jit::array<compute_t<Ti>> &rhs, int lim) {             \
+            for (int i = 0; i < lim; i++) {                                    \
+                T lhs_mag = std::abs(lhs[i]);                                  \
+                T rhs_mag = std::abs(rhs[i]);                                  \
+                out[i]    = lhs_mag op rhs_mag;                                \
+            }                                                                  \
+        }                                                                      \
+    };
+
+LOGIC_CPLX_FN(float, af_lt_t, <)
+LOGIC_CPLX_FN(float, af_le_t, <=)
+LOGIC_CPLX_FN(float, af_gt_t, >)
+LOGIC_CPLX_FN(float, af_ge_t, >=)
+LOGIC_CPLX_FN(float, af_and_t, &&)
+LOGIC_CPLX_FN(float, af_or_t, ||)
+
+LOGIC_CPLX_FN(double, af_lt_t, <)
+LOGIC_CPLX_FN(double, af_le_t, <=)
+LOGIC_CPLX_FN(double, af_gt_t, >)
+LOGIC_CPLX_FN(double, af_ge_t, >=)
+LOGIC_CPLX_FN(double, af_and_t, &&)
+LOGIC_CPLX_FN(double, af_or_t, ||)
+
+#undef LOGIC_CPLX_FN
+
+template<typename T>
+static T __mod(T lhs, T rhs) {
+    return lhs % rhs; // Same as other backends
+}
+
+template<typename T>
+static T __rem(T lhs, T rhs) {
+    return lhs % rhs;
+}
+
+template<>
+inline float __mod<float>(float lhs, float rhs) {
+    return fmod(lhs, rhs);
+}
+template<>
+inline double __mod<double>(double lhs, double rhs) {
+    return fmod(lhs, rhs);
+}
+template<>
+inline float __rem<float>(float lhs, float rhs) {
+    return remainder(lhs, rhs);
+}
+template<>
+inline double __rem<double>(double lhs, double rhs) {
+    return remainder(lhs, rhs);
+}
+
+#define BITWISE_FN(OP, op)                                               \
+    template<typename T>                                                 \
+    struct BinOp<T, T, OP> {                                             \
+        void eval(jit::array<compute_t<T>> &out,                         \
+                  const jit::array<compute_t<T>> &lhs,                   \
+                  const jit::array<compute_t<T>> &rhs, int lim) {        \
+            for (int i = 0; i < lim; i++) { out[i] = lhs[i] op rhs[i]; } \
+        }                                                                \
+    };
+
+BITWISE_FN(af_bitor_t, |)
+BITWISE_FN(af_bitand_t, &)
+BITWISE_FN(af_bitxor_t, ^)
+BITWISE_FN(af_bitshiftl_t, <<)
+BITWISE_FN(af_bitshiftr_t, >>)
+
+#undef BITWISE_FN
+
+#define NUMERIC_FN(OP, FN)                                                 \
+    template<typename T>                                                   \
+    struct BinOp<T, T, OP> {                                               \
+        void eval(jit::array<compute_t<T>> &out,                           \
+                  const jit::array<compute_t<T>> &lhs,                     \
+                  const jit::array<compute_t<T>> &rhs, int lim) {          \
+            for (int i = 0; i < lim; i++) { out[i] = FN(lhs[i], rhs[i]); } \
+        }                                                                  \
+    };
+
+NUMERIC_FN(af_max_t, max)
+NUMERIC_FN(af_min_t, min)
+NUMERIC_FN(af_mod_t, __mod)
+NUMERIC_FN(af_pow_t, pow)
+NUMERIC_FN(af_rem_t, __rem)
+NUMERIC_FN(af_atan2_t, atan2)
+NUMERIC_FN(af_hypot_t, hypot)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/blas.cpp b/src/backend/cpu/blas.cpp
index 5c7fc11d68..60cd9be655 100644
--- a/src/backend/cpu/blas.cpp
+++ b/src/backend/cpu/blas.cpp
@@ -8,216 +8,398 @@
  ********************************************************/
 
 #include <blas.hpp>
-#include <af/dim4.hpp>
-#include <handle.hpp>
-#include <cassert>
-#include <err_cpu.hpp>
-#include <err_common.hpp>
-
-namespace cpu
-{
-
-    using std::add_const;
-    using std::add_pointer;
-    using std::enable_if;
-    using std::is_floating_point;
-    using std::remove_const;
-    using std::conditional;
-
-template<typename T, typename BT>
-using cptr_type     =   typename conditional<   is_complex<T>::value,
-                                                const void *,
-                                                const T*>::type;
-template<typename T, typename BT>
-using ptr_type     =    typename conditional<   is_complex<T>::value,
-                                                void *,
-                                                T*>::type;
-template<typename T, typename BT>
-using scale_type   =    typename conditional<   is_complex<T>::value,
-                                                const void *,
-                                                const T>::type;
-template<typename T, typename BT>
-using gemm_func_def = void (*)( const CBLAS_ORDER, const CBLAS_TRANSPOSE, const CBLAS_TRANSPOSE,
-                                const int, const int, const int,
-                                scale_type<T, BT>, cptr_type<T, BT>, const int,
-                                cptr_type<T, BT>, const int,
-                                scale_type<T, BT>, ptr_type<T, BT>, const int);
-
-template<typename T, typename BT>
-using gemv_func_def = void (*)( const CBLAS_ORDER, const CBLAS_TRANSPOSE,
-                                const int, const int,
-                                scale_type<T, BT>, cptr_type<T, BT>, const int,
-                                cptr_type<T, BT>, const int,
-                                scale_type<T, BT>, ptr_type<T, BT>, const int);
-
-#define BLAS_FUNC_DEF( FUNC )                                                      \
-template<typename T, typename BT> FUNC##_func_def<T, BT> FUNC##_func();
-
-
-#define BLAS_FUNC( FUNC, TYPE, BASE_TYPE, PREFIX )                                 \
-template<> FUNC##_func_def<TYPE, BASE_TYPE>     FUNC##_func<TYPE, BASE_TYPE>()     \
-{ return &cblas_##PREFIX##FUNC; }
-
-BLAS_FUNC_DEF( gemm )
-#ifdef OS_WIN
-BLAS_FUNC(gemm , float   , float  , s)
-BLAS_FUNC(gemm , double  , double , d)
-BLAS_FUNC(gemm , cfloat  , float  , c)
-BLAS_FUNC(gemm , cdouble , double , z)
-#else
-BLAS_FUNC(gemm , float   , float , s)
-BLAS_FUNC(gemm , double  , double, d)
-BLAS_FUNC(gemm , cfloat  ,   void, c)
-BLAS_FUNC(gemm , cdouble ,   void, z)
+
+#ifdef USE_MKL
+#include <mkl_cblas.h>
 #endif
 
-BLAS_FUNC_DEF(gemv)
-#ifdef OS_WIN
-BLAS_FUNC(gemv , float   ,  float , s)
-BLAS_FUNC(gemv , double  ,  double, d)
-BLAS_FUNC(gemv , cfloat  ,  float , c)
-BLAS_FUNC(gemv , cdouble ,  double, z)
-#else
-BLAS_FUNC(gemv , float   ,  float, s)
-BLAS_FUNC(gemv , double  , double, d)
-BLAS_FUNC(gemv , cfloat  ,   void, c)
-BLAS_FUNC(gemv , cdouble ,   void, z)
+#include <Array.hpp>
+#include <Param.hpp>
+#include <common/blas_headers.hpp>
+#include <common/cast.hpp>
+#include <common/complex.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <kernel/dot.hpp>
+#include <platform.hpp>
+#include <types.hpp>
+
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <algorithm>
+#include <type_traits>
+#include <vector>
+
+using af::dtype_traits;
+using arrayfire::common::cast;
+using arrayfire::common::half;
+using arrayfire::common::is_complex;
+using std::conditional;
+using std::vector;
+
+namespace arrayfire {
+namespace cpu {
+
+// clang-format off
+// Some implementations of BLAS require void* for complex pointers while others
+// use float*/double*
+//
+// Sample cgemm API
+// OpenBLAS
+// void cblas_cgemm(OPENBLAS_CONST enum CBLAS_ORDER Order,
+//                  OPENBLAS_CONST enum CBLAS_TRANSPOSE TransA,
+//                  OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB,
+//                  OPENBLAS_CONST blasint M,
+//                  OPENBLAS_CONST blasint N,
+//                  OPENBLAS_CONST blasint K,
+//                  OPENBLAS_CONST float *alpha, OPENBLAS_CONST float *A,
+//                  OPENBLAS_CONST blasint lda,
+//                  OPENBLAS_CONST float *B, OPENBLAS_CONST blasint ldb,
+//                  OPENBLAS_CONST float *beta,
+//                  float *C, OPENBLAS_CONST blasint ldc);
+//
+// MKL
+// void cblas_cgemm(const  CBLAS_LAYOUT Layout,
+//                  const CBLAS_TRANSPOSE TransA, const  CBLAS_TRANSPOSE TransB,
+//                  const MKL_INT M, const MKL_INT N, const MKL_INT K,
+//                  const void *alpha, const void *A, const MKL_INT lda,
+//                  const void *B, const MKL_INT ldb, const void *beta,
+//                  void *C, const MKL_INT ldc);
+// void cblas_cgemm_batch(const  CBLAS_LAYOUT Layout,
+//                        const CBLAS_TRANSPOSE* TransA,
+//                        const CBLAS_TRANSPOSE* TransB,
+//                        const MKL_INT* M, const MKL_INT* N, const MKL_INT* K,
+//                        const void *alpha, const void **A, const MKL_INT* lda,
+//                        const void **B, const MKL_INT* ldb, const void *beta,
+//                        void **C, const MKL_INT* ldc,
+//                        const MKL_INT group_count, const MKL_INT* group_size);
+//
+// atlas cblas
+// void cblas_cgemm(const enum CBLAS_ORDER Order,
+//                  const enum CBLAS_TRANSPOSE TransA,
+//                  const enum CBLAS_TRANSPOSE TransB,
+//                  const int M, const int N, const int K,
+//                  const void *alpha, const void *A, const int lda,
+//                  const void *B, const int ldb, const void *beta,
+//                  void *C, const int ldc);
+//
+// LAPACKE
+// void cblas_cgemm(const enum CBLAS_ORDER Order,
+//                  const enum CBLAS_TRANSPOSE TransA,
+//                  const enum CBLAS_TRANSPOSE TransB,
+//                  const int M, const int N, const int K,
+//                  const void *alpha, const void *A, const int lda,
+//                  const void *B, const int ldb, const void *beta,
+//                  void *C, const int ldc);
+// clang-format on
+
+template<typename T>
+struct blas_base {
+    using type =
+        typename conditional<is_complex<T>::value && cplx_void_ptr, void,
+                             typename dtype_traits<T>::base_type>::type;
+};
+
+template<typename T>
+using cptr_type =
+    typename conditional<is_complex<T>::value,
+                         const typename blas_base<T>::type *, const T *>::type;
+template<typename T>
+using ptr_type = typename conditional<is_complex<T>::value,
+                                      typename blas_base<T>::type *, T *>::type;
+
+template<typename T, bool batched = false>
+class scale_type {
+    const T val;
+
+   public:
+    explicit scale_type(const T *val_ptr) : val(*val_ptr) {}
+    using api_type = const typename conditional<
+        is_complex<T>::value, const typename blas_base<T>::type *,
+        const typename conditional<batched, const T *, const T>::type>::type;
+
+    api_type getScale() const {  // NOLINT(readability-const-return-type)
+        return val;
+    }
+};
+
+#define INSTANTIATE_BATCHED(TYPE)              \
+    template<>                                 \
+    typename scale_type<TYPE, true>::api_type  \
+    scale_type<TYPE, true>::getScale() const { \
+        return &val;                           \
+    }
+
+INSTANTIATE_BATCHED(float);   // NOLINT(readability-const-return-type)
+INSTANTIATE_BATCHED(double);  // NOLINT(readability-const-return-type)
+#undef INSTANTIATE_BATCHED
+
+#define INSTANTIATE_COMPLEX(TYPE, BATCHED)                                    \
+    template<>                                                                \
+    scale_type<TYPE, BATCHED>::api_type scale_type<TYPE, BATCHED>::getScale() \
+        const {                                                               \
+        return reinterpret_cast<const blas_base<TYPE>::type *const>(&val);    \
+    }
+
+INSTANTIATE_COMPLEX(cfloat, true);    // NOLINT(readability-const-return-type)
+INSTANTIATE_COMPLEX(cfloat, false);   // NOLINT(readability-const-return-type)
+INSTANTIATE_COMPLEX(cdouble, true);   // NOLINT(readability-const-return-type)
+INSTANTIATE_COMPLEX(cdouble, false);  // NOLINT(readability-const-return-type)
+#undef INSTANTIATE_COMPLEX
+
+template<typename T>
+using gemm_func_def = void (*)(const CBLAS_ORDER, const CBLAS_TRANSPOSE,
+                               const CBLAS_TRANSPOSE, const blasint,
+                               const blasint, const blasint,
+                               typename scale_type<T>::api_type, cptr_type<T>,
+                               const blasint, cptr_type<T>, const blasint,
+                               typename scale_type<T>::api_type, ptr_type<T>,
+                               const blasint);
+
+template<typename T>
+using gemv_func_def = void (*)(const CBLAS_ORDER, const CBLAS_TRANSPOSE,
+                               const blasint, const blasint,
+                               typename scale_type<T>::api_type, cptr_type<T>,
+                               const blasint, cptr_type<T>, const blasint,
+                               typename scale_type<T>::api_type, ptr_type<T>,
+                               const blasint);
+
+#ifdef USE_MKL
+template<typename T>
+using gemm_batch_func_def = void (*)(
+    const CBLAS_LAYOUT, const CBLAS_TRANSPOSE *, const CBLAS_TRANSPOSE *,
+    const MKL_INT *, const MKL_INT *, const MKL_INT *,
+    typename scale_type<T, true>::api_type, cptr_type<T> *, const MKL_INT *,
+    cptr_type<T> *, const MKL_INT *, typename scale_type<T, true>::api_type,
+    ptr_type<T> *, const MKL_INT *, const MKL_INT, const MKL_INT *);
 #endif
 
-template<typename T, typename BT, int value>
-typename enable_if<is_floating_point<T>::value, scale_type<T,BT>>::type
-getScale() { return T(value); }
+#define BLAS_FUNC_DEF(FUNC) \
+    template<typename T>    \
+    FUNC##_func_def<T> FUNC##_func();
 
-template<typename T, typename BT, int value>
-typename enable_if<is_complex<T>::value, scale_type<T,BT>>::type
-getScale()
-{
-    static T val(value);
-    return (const BT *)&val;
-}
+#define BLAS_FUNC(FUNC, TYPE, PREFIX)                        \
+    template<>                                               \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() {              \
+        return (FUNC##_func_def<TYPE>)&cblas_##PREFIX##FUNC; \
+    }
+
+BLAS_FUNC_DEF(gemm)
+BLAS_FUNC(gemm, float, s)
+BLAS_FUNC(gemm, double, d)
+BLAS_FUNC(gemm, cfloat, c)
+BLAS_FUNC(gemm, cdouble, z)
+
+BLAS_FUNC_DEF(gemv)
+BLAS_FUNC(gemv, float, s)
+BLAS_FUNC(gemv, double, d)
+BLAS_FUNC(gemv, cfloat, c)
+BLAS_FUNC(gemv, cdouble, z)
+
+#ifdef USE_MKL
+BLAS_FUNC_DEF(gemm_batch)
+BLAS_FUNC(gemm_batch, float, s)
+BLAS_FUNC(gemm_batch, double, d)
+BLAS_FUNC(gemm_batch, cfloat, c)
+BLAS_FUNC(gemm_batch, cdouble, z)
+#endif
 
 CBLAS_TRANSPOSE
-toCblasTranspose(af_mat_prop opt)
-{
+toCblasTranspose(af_mat_prop opt) {
     CBLAS_TRANSPOSE out = CblasNoTrans;
-    switch(opt) {
-        case AF_MAT_NONE        : out = CblasNoTrans;   break;
-        case AF_MAT_TRANS           : out = CblasTrans;     break;
-        case AF_MAT_CTRANS : out = CblasConjTrans; break;
-        default                     : AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    switch (opt) {
+        case AF_MAT_NONE: out = CblasNoTrans; break;
+        case AF_MAT_TRANS: out = CblasTrans; break;
+        case AF_MAT_CTRANS: out = CblasConjTrans; break;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
     }
     return out;
 }
 
-using namespace std;
-
+template<typename Ti, typename To>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta) {
+    const CBLAS_TRANSPOSE lOpts = toCblasTranspose(optLhs);
+    const CBLAS_TRANSPOSE rOpts = toCblasTranspose(optRhs);
+
+    const int aRowDim = (lOpts == CblasNoTrans) ? 0 : 1;
+    const int aColDim = (lOpts == CblasNoTrans) ? 1 : 0;
+    const int bColDim = (rOpts == CblasNoTrans) ? 1 : 0;
+
+    const dim4 &lDims = lhs.dims();
+    const dim4 &rDims = rhs.dims();
+    const int M       = lDims[aRowDim];
+    const int N       = rDims[bColDim];
+    const int K       = lDims[aColDim];
+    const dim4 oDims  = out.dims();
+
+    using BT  = typename blas_base<Ti>::type;
+    using CBT = const typename blas_base<Ti>::type;
+
+    auto alpha_ = scale_type<Ti, false>(alpha);
+    auto beta_  = scale_type<Ti, false>(beta);
+#ifdef USE_MKL
+    auto alpha_batched = scale_type<Ti, true>(alpha);
+    auto beta_batched  = scale_type<Ti, true>(beta);
+#endif
 
-#ifdef OS_WIN
-#define BT af::dtype_traits<T>::base_type
-#define REINTERPRET_CAST(PTR_TYPE, X) reinterpret_cast<PTR_TYPE>((X))
+    auto func = [=](Param<Ti> output, CParam<Ti> left, CParam<Ti> right) {
+        dim4 lStrides = left.strides();
+        dim4 rStrides = right.strides();
+        dim4 oStrides = output.strides();
+
+        if (output.dims().ndims() <= 2) {
+            if (right.dims()[bColDim] == 1) {
+                dim_t incr =
+                    (optRhs == AF_MAT_NONE) ? rStrides[0] : rStrides[1];
+                gemv_func<Ti>()(
+                    CblasColMajor, lOpts, lDims[0], lDims[1], alpha_.getScale(),
+                    reinterpret_cast<CBT *>(left.get()), lStrides[1],
+                    reinterpret_cast<CBT *>(right.get()), incr,
+                    beta_.getScale(), reinterpret_cast<BT *>(output.get()),
+                    oStrides[0]);
+            } else {
+                gemm_func<Ti>()(
+                    CblasColMajor, lOpts, rOpts, M, N, K, alpha_.getScale(),
+                    reinterpret_cast<CBT *>(left.get()), lStrides[1],
+                    reinterpret_cast<CBT *>(right.get()), rStrides[1],
+                    beta_.getScale(), reinterpret_cast<BT *>(output.get()),
+                    oStrides[1]);
+            }
+        } else {
+            int batchSize = static_cast<int>(oDims[2] * oDims[3]);
+
+            const bool is_l_d2_batched = oDims[2] == lDims[2];
+            const bool is_l_d3_batched = oDims[3] == lDims[3];
+            const bool is_r_d2_batched = oDims[2] == rDims[2];
+            const bool is_r_d3_batched = oDims[3] == rDims[3];
+
+            vector<CBT *> lptrs(batchSize);
+            vector<CBT *> rptrs(batchSize);
+            vector<BT *> optrs(batchSize);
+
+            for (int n = 0; n < batchSize; n++) {
+                ptrdiff_t w = n / oDims[2];
+                ptrdiff_t z = n - w * oDims[2];
+
+                ptrdiff_t loff = z * (is_l_d2_batched * lStrides[2]) +
+                                 w * (is_l_d3_batched * lStrides[3]);
+                ptrdiff_t roff = z * (is_r_d2_batched * rStrides[2]) +
+                                 w * (is_r_d3_batched * rStrides[3]);
+
+                lptrs[n] = reinterpret_cast<CBT *>(left.get() + loff);
+                rptrs[n] = reinterpret_cast<CBT *>(right.get() + roff);
+                optrs[n] = reinterpret_cast<BT *>(
+                    output.get() + z * oStrides[2] + w * oStrides[3]);
+            }
+
+#ifdef USE_MKL
+            // MKL can handle multiple groups of batches
+            // However, for ArrayFire's use case, the group_count=1
+            const MKL_INT lda = lStrides[1];
+            const MKL_INT ldb = rStrides[1];
+            const MKL_INT ldc = oStrides[1];
+
+            gemm_batch_func<Ti>()(CblasColMajor, &lOpts, &rOpts, &M, &N, &K,
+                                  alpha_batched.getScale(), lptrs.data(), &lda,
+                                  rptrs.data(), &ldb, beta_batched.getScale(),
+                                  optrs.data(), &ldc, 1, &batchSize);
 #else
-template<typename T> struct cblas_types;
-
-template<>
-struct cblas_types<float> {
-    typedef float base_type;
-};
-
-template<>
-struct cblas_types<cfloat> {
-    typedef void base_type;
-};
+            for (int n = 0; n < batchSize; n++) {
+                if (rDims[bColDim] == 1) {
+                    dim_t incr =
+                        (optRhs == AF_MAT_NONE) ? rStrides[0] : rStrides[1];
+                    gemv_func<Ti>()(CblasColMajor, lOpts, lDims[0], lDims[1],
+                                    alpha_.getScale(), lptrs[n], lStrides[1],
+                                    rptrs[n], incr, beta_.getScale(), optrs[n],
+                                    oStrides[0]);
+                } else {
+                    gemm_func<Ti>()(CblasColMajor, lOpts, rOpts, M, N, K,
+                                    alpha_.getScale(), lptrs[n], lStrides[1],
+                                    rptrs[n], rStrides[1], beta_.getScale(),
+                                    optrs[n], oStrides[1]);
+                }
+            }
+#endif
+        }
+    };
+    getQueue().enqueue(func, out, lhs, rhs);
+}
 
 template<>
-struct cblas_types<double> {
-    typedef double base_type;
-};
+void gemm<half>(Array<half> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+                const half *alpha, const Array<half> &lhs,
+                const Array<half> &rhs, const half *beta) {
+    Array<float> outArr    = createValueArray<float>(out.dims(), 0);
+    const auto float_alpha = static_cast<float>(*alpha);
+    const auto float_beta  = static_cast<float>(*beta);
+    gemm<float>(outArr, optLhs, optRhs, &float_alpha, cast<float>(lhs),
+                cast<float>(rhs), &float_beta);
+    copyArray(out, outArr);
+}
 
 template<>
-struct cblas_types<cdouble> {
-    typedef void base_type;
-};
-#define BT typename cblas_types<T>::base_type
-#define REINTERPRET_CAST(PTR_TYPE, X) (X)
-#endif
+void gemm<schar, float>(Array<float> &out, af_mat_prop optLhs,
+                        af_mat_prop optRhs, const float *alpha,
+                        const Array<schar> &lhs, const Array<schar> &rhs,
+                        const float *beta) {
+    TYPE_ERROR(3, af_dtype::s8);
+}
 
 template<typename T>
-Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs,
-                af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    CBLAS_TRANSPOSE lOpts = toCblasTranspose(optLhs);
-    CBLAS_TRANSPOSE rOpts = toCblasTranspose(optRhs);
-
-    int aRowDim = (lOpts == CblasNoTrans) ? 0 : 1;
-    int aColDim = (lOpts == CblasNoTrans) ? 1 : 0;
-    int bColDim = (rOpts == CblasNoTrans) ? 1 : 0;
-
-    dim4 lDims = lhs.dims();
-    dim4 rDims = rhs.dims();
-    int M = lDims[aRowDim];
-    int N = rDims[bColDim];
-    int K = lDims[aColDim];
-
-    //FIXME: Leaks on errors.
-    Array<T> out = createEmptyArray<T>(af::dim4(M, N, 1, 1));
-    auto alpha = getScale<T, BT, 1>();
-    auto beta  = getScale<T, BT, 0>();
-
-    dim4 lStrides = lhs.strides();
-    dim4 rStrides = rhs.strides();
-    if(rDims[bColDim] == 1) {
-        N = lDims[aColDim];
-        gemv_func<T, BT>()(
-            CblasColMajor, lOpts,
-            lDims[0], lDims[1],
-            alpha, REINTERPRET_CAST(const BT*, lhs.get()), lStrides[1],
-            REINTERPRET_CAST(const BT*, rhs.get()), rStrides[0],
-            beta, REINTERPRET_CAST(BT*, out.get()), 1);
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs) {
+    Array<T> out = createEmptyArray<T>(af::dim4(1));
+    if (optLhs == AF_MAT_CONJ && optRhs == AF_MAT_CONJ) {
+        getQueue().enqueue(kernel::dot<T, false, true>, out, lhs, rhs, optLhs,
+                           optRhs);
+    } else if (optLhs == AF_MAT_CONJ && optRhs == AF_MAT_NONE) {
+        getQueue().enqueue(kernel::dot<T, true, false>, out, lhs, rhs, optLhs,
+                           optRhs);
+    } else if (optLhs == AF_MAT_NONE && optRhs == AF_MAT_CONJ) {
+        getQueue().enqueue(kernel::dot<T, true, false>, out, rhs, lhs, optRhs,
+                           optLhs);
     } else {
-        gemm_func<T, BT>()(
-            CblasColMajor, lOpts, rOpts,
-            M, N, K,
-            alpha, REINTERPRET_CAST(const BT*, lhs.get()), lStrides[1],
-            REINTERPRET_CAST(const BT*, rhs.get()), rStrides[1],
-            beta, REINTERPRET_CAST(BT*, out.get()), out.dims()[0]);
+        getQueue().enqueue(kernel::dot<T, false, false>, out, lhs, rhs, optLhs,
+                           optRhs);
     }
-
     return out;
 }
 
-template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-             af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    int N = lhs.dims()[0];
-
-    T out = 0;
-    const T *pL = lhs.get();
-    const T *pR = rhs.get();
-
-    for(int i = 0; i < N; i++)
-        out += pL[i] * pR[i];
-
-    return createValueArray(af::dim4(1), out);
+template<>
+Array<half> dot<half>(const Array<half> &lhs, const Array<half> &rhs,
+                      af_mat_prop optLhs, af_mat_prop optRhs) {
+    Array<float> out = dot(cast<float>(lhs), cast<float>(rhs), optLhs, optRhs);
+    return cast<half>(out);
 }
 
 #undef BT
 #undef REINTEPRET_CAST
 
-#define INSTANTIATE_BLAS(TYPE)                                                          \
-    template Array<TYPE> matmul<TYPE>(const Array<TYPE> &lhs, const Array<TYPE> &rhs,  \
-                                      af_mat_prop optLhs, af_mat_prop optRhs);
-
-INSTANTIATE_BLAS(float)
-INSTANTIATE_BLAS(cfloat)
-INSTANTIATE_BLAS(double)
-INSTANTIATE_BLAS(cdouble)
-
-#define INSTANTIATE_DOT(TYPE)                                                       \
-    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
-                                   af_mat_prop optLhs, af_mat_prop optRhs);
-
-INSTANTIATE_DOT(float)
-INSTANTIATE_DOT(double)
-
-}
+#define INSTANTIATE_GEMM(TYPE)                                               \
+    template void gemm<TYPE>(Array<TYPE> & out, af_mat_prop optLhs,          \
+                             af_mat_prop optRhs, const TYPE *alphas,         \
+                             const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
+                             const TYPE *beta)
+
+INSTANTIATE_GEMM(float);
+INSTANTIATE_GEMM(cfloat);
+INSTANTIATE_GEMM(double);
+INSTANTIATE_GEMM(cdouble);
+
+#define INSTANTIATE_DOT(TYPE)                                                  \
+    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs,                     \
+                                   const Array<TYPE> &rhs, af_mat_prop optLhs, \
+                                   af_mat_prop optRhs)
+
+INSTANTIATE_DOT(float);
+INSTANTIATE_DOT(double);
+INSTANTIATE_DOT(cfloat);
+INSTANTIATE_DOT(cdouble);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/blas.hpp b/src/backend/cpu/blas.hpp
index 4d89807412..c16916dafb 100644
--- a/src/backend/cpu/blas.hpp
+++ b/src/backend/cpu/blas.hpp
@@ -7,30 +7,33 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/blas.h>
 #include <Array.hpp>
+#include <af/defines.h>
 
-#ifdef __APPLE__
-#include <Accelerate/Accelerate.h>
-#else
-#ifdef USE_MKL
-#include <mkl_cblas.h>
-#else
-extern "C" {
-#include <cblas.h>
-}
-#endif
-#endif
+namespace arrayfire {
+namespace cpu {
 
-namespace cpu
-{
+template<typename Ti, typename To = Ti>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta);
 
 template<typename T>
-Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs,
-                af_mat_prop optLhs, af_mat_prop optRhs);
+Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+                af_mat_prop optRhs) {
+    int Mdim     = optLhs == AF_MAT_NONE ? 0 : 1;
+    int Ndim     = optRhs == AF_MAT_NONE ? 1 : 0;
+    Array<T> res = createEmptyArray<T>(
+        dim4(lhs.dims()[Mdim], rhs.dims()[Ndim], lhs.dims()[2], lhs.dims()[3]));
+    static const T alpha = T(1.0);
+    static const T beta  = T(0.0);
+    gemm(res, optLhs, optRhs, &alpha, lhs, rhs, &beta);
+    return res;
+}
+
 template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-             af_mat_prop optLhs, af_mat_prop optRhs);
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/canny.cpp b/src/backend/cpu/canny.cpp
new file mode 100644
index 0000000000..17f242c0fc
--- /dev/null
+++ b/src/backend/cpu/canny.cpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <canny.hpp>
+
+#include <Array.hpp>
+#include <Param.hpp>
+#include <kernel/canny.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+namespace arrayfire {
+namespace cpu {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy) {
+    Array<float> out = createValueArray<float>(mag.dims(), 0);
+
+    getQueue().enqueue(kernel::nonMaxSuppression<float>, out, mag, gx, gy);
+
+    return out;
+}
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak) {
+    Array<char> out = createValueArray<char>(strong.dims(), 0);
+
+    getQueue().enqueue(kernel::edgeTrackingHysteresis<char>, out, strong, weak);
+
+    return out;
+}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/canny.hpp b/src/backend/cpu/canny.hpp
new file mode 100644
index 0000000000..7f21d89fe5
--- /dev/null
+++ b/src/backend/cpu/canny.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy);
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/cast.hpp b/src/backend/cpu/cast.hpp
index bb59d5edd6..d51b7838b8 100644
--- a/src/backend/cpu/cast.hpp
+++ b/src/backend/cpu/cast.hpp
@@ -8,90 +8,151 @@
  ********************************************************/
 
 #pragma once
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <complex>
+#include <Array.hpp>
 #include <err_cpu.hpp>
+#include <jit/UnaryNode.hpp>
 #include <math.hpp>
 #include <optypes.hpp>
 #include <types.hpp>
-#include <TNJ/UnaryNode.hpp>
-#include <Array.hpp>
+#include <af/dim4.hpp>
+#include <complex>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename To, typename Ti>
-struct UnOp<To, Ti, af_cast_t>
-{
-    To eval(Ti in)
-    {
-        return To(in);
+struct UnOp<To, Ti, af_cast_t> {
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(in[i]); }
     }
 };
 
+/// NOTE(umar): The next specializations have multiple eval functions because
+/// the f16 data type needs to be converted to and from the compute type.
+/// Here, we have specializations for real numbers as well as the complex
+/// numbers
+/// TODO(umar): make a macro to reduce repeat code
+
 template<typename To>
-struct UnOp<To, std::complex<float>, af_cast_t>
-{
-    To eval(std::complex<float> in)
-    {
-        return To(std::abs(in));
+struct UnOp<To, arrayfire::common::half, af_cast_t> {
+    typedef arrayfire::common::half Ti;
+
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) {
+            float val = static_cast<float>(in[i]);
+            out[i]    = To(val);
+        }
+    }
+
+    void eval(jit::array<To> &out, const jit::array<float> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(in[i]); }
     }
 };
 
-template<typename To>
-struct UnOp<To, std::complex<double>, af_cast_t>
-{
-    To eval(std::complex<double> in)
-    {
-        return To(std::abs(in));
+template<typename Ti>
+struct UnOp<arrayfire::common::half, Ti, af_cast_t> {
+    typedef arrayfire::common::half To;
+
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) {
+            float val = static_cast<float>(in[i]);
+            out[i]    = To(val);
+        }
+    }
+
+    void eval(jit::array<float> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = float(in[i]); }
     }
 };
 
+template<>
+struct UnOp<arrayfire::common::half, std::complex<float>, af_cast_t> {
+    typedef arrayfire::common::half To;
+    typedef std::complex<float> Ti;
 
-#define CAST_B8(T)                              \
-    template<>                                  \
-    struct UnOp<char, T, af_cast_t>             \
-    {                                           \
-        char eval(T in)                         \
-        {                                       \
-            return char(in != 0);               \
-        }                                       \
-    };                                          \
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) {
+            float val = std::abs(in[i]);
+            out[i]    = To(val);
+        }
+    }
 
-CAST_B8(float)
-CAST_B8(double)
-CAST_B8(int)
-CAST_B8(uchar)
-CAST_B8(char)
+    void eval(jit::array<float> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = std::abs(in[i]); }
+    }
+};
 
-template<typename To, typename Ti>
-struct CastWrapper
-{
-    Array<To> operator()(const Array<Ti> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<To, Ti, af_cast_t> *node = new TNJ::UnaryNode<To, Ti, af_cast_t>(in_node);
-        return createNodeArray<To>(in.dims(), TNJ::Node_ptr(
-                                       reinterpret_cast<TNJ::Node *>(node)));
+template<>
+struct UnOp<arrayfire::common::half, std::complex<double>, af_cast_t> {
+    typedef arrayfire::common::half To;
+    typedef std::complex<double> Ti;
+
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) {
+            float val = std::abs(in[i]);
+            out[i]    = To(val);
+        }
+    }
+
+    void eval(jit::array<float> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = std::abs(in[i]); }
     }
 };
 
-template<typename T>
-struct CastWrapper<T, T>
-{
-    Array<T> operator()(const Array<T> &in)
-    {
-        return in;
+template<typename To>
+struct UnOp<To, std::complex<float>, af_cast_t> {
+    typedef std::complex<float> Ti;
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(std::abs(in[i])); }
     }
 };
 
-template<typename To, typename Ti>
-Array<To> cast(const Array<Ti> &in)
-{
-    CastWrapper<To, Ti> cast_op;
-    return cast_op(in);
-}
+template<typename To>
+struct UnOp<To, std::complex<double>, af_cast_t> {
+    typedef std::complex<double> Ti;
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(std::abs(in[i])); }
+    }
+};
+
+// DO NOT REMOVE THE TWO SPECIALIZATIONS BELOW
+// These specializations are required because we partially specialize when
+// Ti = std::complex<T> The partial specializations above expect output to
+// be real. so they To(std::abs(v)) instead of To(v) which results in
+// incorrect values when To is complex.
+
+template<>
+struct UnOp<std::complex<float>, std::complex<double>, af_cast_t> {
+    typedef std::complex<double> Ti;
+    typedef std::complex<float> To;
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(in[i]); }
+    }
+};
+
+template<>
+struct UnOp<std::complex<double>, std::complex<float>, af_cast_t> {
+    typedef std::complex<float> Ti;
+    typedef std::complex<double> To;
+    void eval(jit::array<To> &out, const jit::array<Ti> &in, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(in[i]); }
+    }
+};
+
+#define CAST_B8(T)                                                           \
+    template<>                                                               \
+    struct UnOp<char, T, af_cast_t> {                                        \
+        void eval(jit::array<char> &out, const jit::array<T> &in, int lim) { \
+            for (int i = 0; i < lim; i++) { out[i] = char(in[i] != 0); }     \
+        }                                                                    \
+    };
+
+CAST_B8(float)
+CAST_B8(double)
+CAST_B8(int)
+CAST_B8(schar)
+CAST_B8(uchar)
+CAST_B8(char)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/cholesky.cpp b/src/backend/cpu/cholesky.cpp
index 57beaa4146..cd478ad75e 100644
--- a/src/backend/cpu/cholesky.cpp
+++ b/src/backend/cpu/cholesky.cpp
@@ -8,108 +8,114 @@
  ********************************************************/
 
 #include <cholesky.hpp>
-#include <err_common.hpp>
 
-#if defined(WITH_CPU_LINEAR_ALGEBRA)
+#include <common/err_common.hpp>
 
-#include <af/dim4.hpp>
-#include <handle.hpp>
-#include <iostream>
-#include <cassert>
-#include <err_cpu.hpp>
-#include <triangle.hpp>
+#if defined(WITH_LINEAR_ALGEBRA)
+
+#include <Array.hpp>
+#include <Param.hpp>
+#include <copy.hpp>
+#include <types.hpp>
 
 #include <lapack_helper.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <triangle.hpp>
+#include <af/dim4.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-using potrf_func_def = int (*)(ORDER_TYPE, char,
-                               int,
-                               T*, int);
-
-#define CH_FUNC_DEF( FUNC )                                     \
-template<typename T> FUNC##_func_def<T> FUNC##_func();
+using potrf_func_def = int (*)(ORDER_TYPE, char, int, T *, int);
 
+#define CH_FUNC_DEF(FUNC) \
+    template<typename T>  \
+    FUNC##_func_def<T> FUNC##_func();
 
-#define CH_FUNC( FUNC, TYPE, PREFIX )                           \
-template<> FUNC##_func_def<TYPE>     FUNC##_func<TYPE>()        \
-{ return & LAPACK_NAME(PREFIX##FUNC); }
+#define CH_FUNC(FUNC, TYPE, PREFIX)             \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
 
-CH_FUNC_DEF( potrf )
-CH_FUNC(potrf , float  , s)
-CH_FUNC(potrf , double , d)
-CH_FUNC(potrf , cfloat , c)
-CH_FUNC(potrf , cdouble, z)
+CH_FUNC_DEF(potrf)
+CH_FUNC(potrf, float, s)
+CH_FUNC(potrf, double, d)
+CH_FUNC(potrf, cfloat, c)
+CH_FUNC(potrf, cdouble, z)
 
 template<typename T>
-Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper)
-{
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
     Array<T> out = copyArray<T>(in);
-    *info = cholesky_inplace(out, is_upper);
+    *info        = cholesky_inplace(out, is_upper);
 
-    if (is_upper) triangle<T, true , false>(out, out);
-    else          triangle<T, false, false>(out, out);
+    triangle<T>(out, out, is_upper, false);
 
     return out;
 }
 
 template<typename T>
-int cholesky_inplace(Array<T> &in, const bool is_upper)
-{
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
     dim4 iDims = in.dims();
-    int N = iDims[0];
+    int N      = iDims[0];
 
     char uplo = 'L';
-    if(is_upper)
-        uplo = 'U';
+    if (is_upper) { uplo = 'U'; }
 
-    int info = potrf_func<T>()(AF_LAPACK_COL_MAJOR, uplo,
-                               N, in.get(), in.strides()[1]);
+    int info  = 0;
+    auto func = [&](int *info, Param<T> in) {
+        *info = potrf_func<T>()(AF_LAPACK_COL_MAJOR, uplo, N, in.get(),
+                                in.strides(1));
+    };
+
+    getQueue().enqueue(func, &info, in);
+    // Ensure the value of info has been written into info.
+    getQueue().sync();
 
     return info;
 }
 
-#define INSTANTIATE_CH(T)                                                                   \
-    template int cholesky_inplace<T>(Array<T> &in, const bool is_upper);                    \
-    template Array<T> cholesky<T>   (int *info, const Array<T> &in, const bool is_upper);   \
-
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
 
 INSTANTIATE_CH(float)
 INSTANTIATE_CH(cfloat)
 INSTANTIATE_CH(double)
 INSTANTIATE_CH(cdouble)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper)
-{
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
     AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-int cholesky_inplace(Array<T> &in, const bool is_upper)
-{
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
     AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_CH(T)                                                                   \
-    template int cholesky_inplace<T>(Array<T> &in, const bool is_upper);                    \
-    template Array<T> cholesky<T>   (int *info, const Array<T> &in, const bool is_upper);   \
-
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
 
 INSTANTIATE_CH(float)
 INSTANTIATE_CH(cfloat)
 INSTANTIATE_CH(double)
 INSTANTIATE_CH(cdouble)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
 
-#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/cpu/cholesky.hpp b/src/backend/cpu/cholesky.hpp
index 322a789666..5b1247be4d 100644
--- a/src/backend/cpu/cholesky.hpp
+++ b/src/backend/cpu/cholesky.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
 
-    template<typename T>
-    int cholesky_inplace(Array<T> &in, const bool is_upper);
-}
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/complex.hpp b/src/backend/cpu/complex.hpp
index 367b0a6c5a..44dc574377 100644
--- a/src/backend/cpu/complex.hpp
+++ b/src/backend/cpu/complex.hpp
@@ -7,93 +7,81 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <optypes.hpp>
 #include <err_cpu.hpp>
-#include <TNJ/BinaryNode.hpp>
-#include <TNJ/UnaryNode.hpp>
-
-namespace cpu
-{
-
-    template<typename To, typename Ti>
-    struct BinOp<To, Ti, af_cplx2_t>
-    {
-        To eval(Ti lhs, Ti rhs)
-        {
-            return To(lhs, rhs);
-        }
-    };
-
-    template<typename To, typename Ti>
-    Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs, const af::dim4 &odims)
-    {
-        TNJ::Node_ptr lhs_node = lhs.getNode();
-        TNJ::Node_ptr rhs_node = rhs.getNode();
+#include <jit/BinaryNode.hpp>
+#include <jit/UnaryNode.hpp>
+#include <optypes.hpp>
+#include <af/dim4.hpp>
+#include <complex>
 
-        TNJ::BinaryNode<To, Ti, af_cplx2_t> *node =
-            new TNJ::BinaryNode<To, Ti, af_cplx2_t>(lhs_node, rhs_node);
+namespace arrayfire {
+namespace cpu {
 
-        return createNodeArray<To>(odims, TNJ::Node_ptr(
-                                       reinterpret_cast<TNJ::Node *>(node)));
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_cplx2_t> {
+    void eval(jit::array<To> &out, const jit::array<Ti> &lhs,
+              const jit::array<Ti> &rhs, int lim) {
+        for (int i = 0; i < lim; i++) { out[i] = To(lhs[i], rhs[i]); }
     }
+};
 
-#define CPLX_UNARY_FN(op)                       \
-    template<typename To, typename Ti>          \
-    struct UnOp<To, Ti, af_##op##_t>            \
-    {                                           \
-        To eval(Ti in)                          \
-        {                                       \
-            return std::op(in);                 \
-        }                                       \
-    };                                          \
-
-    CPLX_UNARY_FN(real)
-    CPLX_UNARY_FN(imag)
-    CPLX_UNARY_FN(conj)
-    CPLX_UNARY_FN(abs)
-
-    template<typename To, typename Ti>
-    Array<To> real(const Array<Ti> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<To, Ti, af_real_t> *node = new TNJ::UnaryNode<To, Ti, af_real_t>(in_node);
-
-        return createNodeArray<To>(in.dims(),
-                                   TNJ::Node_ptr(reinterpret_cast<TNJ::Node *>(node)));
-    }
+template<typename To, typename Ti>
+Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs,
+               const af::dim4 &odims) {
+    common::Node_ptr lhs_node = lhs.getNode();
+    common::Node_ptr rhs_node = rhs.getNode();
 
-    template<typename To, typename Ti>
-    Array<To> imag(const Array<Ti> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<To, Ti, af_imag_t> *node = new TNJ::UnaryNode<To, Ti, af_imag_t>(in_node);
+    jit::BinaryNode<To, Ti, af_cplx2_t> *node =
+        new jit::BinaryNode<To, Ti, af_cplx2_t>(lhs_node, rhs_node);
 
-        return createNodeArray<To>(in.dims(),
-                                   TNJ::Node_ptr(reinterpret_cast<TNJ::Node *>(node)));
-    }
+    return createNodeArray<To>(odims, common::Node_ptr(node));
+}
 
-    template<typename To, typename Ti>
-    Array<To> abs(const Array<Ti> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<To, Ti, af_abs_t> *node = new TNJ::UnaryNode<To, Ti, af_abs_t>(in_node);
+#define CPLX_UNARY_FN(op)                                              \
+    template<typename To, typename Ti>                                 \
+    struct UnOp<To, Ti, af_##op##_t> {                                 \
+        void eval(jit::array<compute_t<To>> &out,                      \
+                  const jit::array<compute_t<Ti>> &in, int lim) {      \
+            for (int i = 0; i < lim; i++) { out[i] = std::op(in[i]); } \
+        }                                                              \
+    };
 
-        return createNodeArray<To>(in.dims(),
-                                   TNJ::Node_ptr(reinterpret_cast<TNJ::Node *>(node)));
-    }
+CPLX_UNARY_FN(real)
+CPLX_UNARY_FN(imag)
+CPLX_UNARY_FN(conj)
+CPLX_UNARY_FN(abs)
 
-    template<typename T>
-    Array<T> conj(const Array<T> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<T, T, af_conj_t> *node = new TNJ::UnaryNode<T, T, af_conj_t>(in_node);
+template<typename To, typename Ti>
+Array<To> real(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    auto node = std::make_shared<jit::UnaryNode<To, Ti, af_real_t>>(in_node);
 
-        return createNodeArray<T>(in.dims(),
-                                  TNJ::Node_ptr(reinterpret_cast<TNJ::Node *>(node)));
-    }
+    return createNodeArray<To>(in.dims(), move(node));
+}
+
+template<typename To, typename Ti>
+Array<To> imag(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    auto node = std::make_shared<jit::UnaryNode<To, Ti, af_imag_t>>(in_node);
+
+    return createNodeArray<To>(in.dims(), move(node));
+}
+
+template<typename To, typename Ti>
+Array<To> abs(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    auto node = std::make_shared<jit::UnaryNode<To, Ti, af_abs_t>>(in_node);
+
+    return createNodeArray<To>(in.dims(), move(node));
+}
+
+template<typename T>
+Array<T> conj(const Array<T> &in) {
+    common::Node_ptr in_node = in.getNode();
+    auto node = std::make_shared<jit::UnaryNode<T, T, af_conj_t>>(in_node);
+
+    return createNodeArray<T>(in.dims(), move(node));
 }
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/convolve.cpp b/src/backend/cpu/convolve.cpp
index 8ad8a2314f..2fd0e3bce3 100644
--- a/src/backend/cpu/convolve.cpp
+++ b/src/backend/cpu/convolve.cpp
@@ -7,317 +7,255 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <arith.hpp>
+#include <blas.hpp>
+#include <common/defines.hpp>
+#include <common/half.hpp>
+#include <common/indexing_helpers.hpp>
+#include <common/moddims.hpp>
 #include <convolve.hpp>
-#include <err_cpu.hpp>
-#include <math.hpp>
-
-using af::dim4;
-
-namespace cpu
-{
-
-template<typename T, typename accT, bool expand>
-void one2one_1d(T *optr, T const *iptr, accT const *fptr, dim4 const &oDims,
-                dim4 const &sDims, dim4 const &fDims, dim4 const &sStrides)
-{
-    dim_t start = (expand ? 0 : fDims[0]/2);
-    dim_t end   = (expand ? oDims[0] : start + sDims[0]);
-    for(dim_t i=start; i<end; ++i) {
-        accT accum = 0.0;
-        for(dim_t f=0; f<fDims[0]; ++f) {
-            dim_t iIdx = i-f;
-            T s_val = ((iIdx>=0 &&iIdx<sDims[0])? iptr[iIdx*sStrides[0]] : T(0));
-            accum += accT(s_val * fptr[f]);
-        }
-        optr[i-start] = T(accum);
-    }
-}
-
-template<typename T, typename accT, bool expand>
-void one2one_2d(T *optr, T const *iptr, accT const *fptr, dim4 const &oDims,
-                dim4 const &sDims, dim4 const &fDims, dim4 const &oStrides,
-                dim4 const &sStrides, dim4 const &fStrides)
-{
-    dim_t jStart = (expand ? 0 : fDims[1]/2);
-    dim_t jEnd   = (expand ? oDims[1] : jStart + sDims[1]);
-    dim_t iStart = (expand ? 0 : fDims[0]/2);
-    dim_t iEnd   = (expand ? oDims[0] : iStart + sDims[0]);
-
-    for(dim_t j=jStart; j<jEnd; ++j) {
-        dim_t joff = (j-jStart)*oStrides[1];
-
-        for(dim_t i=iStart; i<iEnd; ++i) {
-
-            accT accum = accT(0);
-            for(dim_t wj=0; wj<fDims[1]; ++wj) {
-                dim_t jIdx  = j-wj;
-                dim_t w_joff = wj*fStrides[1];
-                dim_t s_joff = jIdx * sStrides[1];
-                bool isJValid = (jIdx>=0 && jIdx<sDims[1]);
-
-                for(dim_t wi=0; wi<fDims[0]; ++wi) {
-                    dim_t iIdx = i-wi;
-
-                    T s_val = T(0);
-                    if ( isJValid && (iIdx>=0 && iIdx<sDims[0])) {
-                        s_val = iptr[s_joff+iIdx*sStrides[0]];
-                    }
-
-                    accum += accT(s_val * fptr[w_joff+wi*fStrides[0]]);
-                }
-            }
-            optr[joff+i-iStart] = T(accum);
-        }
-    }
-}
-
-template<typename T, typename accT, bool expand>
-void one2one_3d(T *optr, T const *iptr, accT const *fptr, dim4 const &oDims,
-                dim4 const &sDims, dim4 const &fDims, dim4 const &oStrides,
-                dim4 const &sStrides, dim4 const &fStrides)
-{
-    dim_t kStart = (expand ? 0 : fDims[2]/2);
-    dim_t kEnd   = (expand ? oDims[2] : kStart + sDims[2]);
-    dim_t jStart = (expand ? 0 : fDims[1]/2);
-    dim_t jEnd   = (expand ? oDims[1] : jStart + sDims[1]);
-    dim_t iStart = (expand ? 0 : fDims[0]/2);
-    dim_t iEnd   = (expand ? oDims[0] : iStart + sDims[0]);
-
-    for(dim_t k=kStart; k<kEnd; ++k) {
-        dim_t koff = (k-kStart)*oStrides[2];
-
-        for(dim_t j=jStart; j<jEnd; ++j) {
-            dim_t joff = (j-jStart)*oStrides[1];
-
-            for(dim_t i=iStart; i<iEnd; ++i) {
-
-                accT accum = accT(0);
-                for(dim_t wk=0; wk<fDims[2]; ++wk) {
-                    dim_t kIdx  = k-wk;
-                    dim_t w_koff = wk*fStrides[2];
-                    dim_t s_koff = kIdx * sStrides[2];
-                    bool isKValid = (kIdx>=0 && kIdx<sDims[2]);
-
-                    for(dim_t wj=0; wj<fDims[1]; ++wj) {
-                        dim_t jIdx  = j-wj;
-                        dim_t w_joff = wj*fStrides[1];
-                        dim_t s_joff = jIdx * sStrides[1];
-                        bool isJValid = (jIdx>=0 && jIdx<sDims[1]);
-
-                        for(dim_t wi=0; wi<fDims[0]; ++wi) {
-                            dim_t iIdx = i-wi;
-
-                            T s_val = T(0);
-                            if ( isKValid && isJValid && (iIdx>=0 && iIdx<sDims[0])) {
-                                s_val = iptr[s_koff+s_joff+iIdx*sStrides[0]];
-                            }
-
-                            accum += accT(s_val * fptr[w_koff+w_joff+wi*fStrides[0]]);
-                        }
-                    }
-                }
-                optr[koff+joff+i-iStart] = T(accum);
-            } //i loop ends here
-        } // j loop ends here
-    } // k loop ends here
-}
+#include <handle.hpp>
+#include <kernel/convolve.hpp>
+#include <platform.hpp>
+#include <reorder.hpp>
+#include <transpose.hpp>
+#include <unwrap.hpp>
+#include <wrap.hpp>
+#include <vector>
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-void convolve_nd(T *optr, T const *iptr, accT const *fptr,
-                dim4 const &oDims, dim4 const &sDims, dim4 const &fDims,
-                dim4 const &oStrides, dim4 const &sStrides, dim4 const &fStrides,
-                ConvolveBatchKind kind)
-{
-    dim_t out_step[4]  = {0, 0, 0, 0}; /* first value is never used, and declared for code simplicity */
-    dim_t in_step[4]   = {0, 0, 0, 0}; /* first value is never used, and declared for code simplicity */
-    dim_t filt_step[4] = {0, 0, 0, 0}; /* first value is never used, and declared for code simplicity */
-    dim_t batch[4]     = {0, 1, 1, 1}; /* first value is never used, and declared for code simplicity */
-
-    for (dim_t i=1; i<4; ++i) {
-        switch(kind) {
-            case MANY2ONE:
-                out_step[i] = oStrides[i];
-                in_step[i]  = sStrides[i];
-                if (i>=baseDim) batch[i] = sDims[i];
-                break;
-            case MANY2MANY:
-                out_step[i]  = oStrides[i];
-                in_step[i]   = sStrides[i];
-                filt_step[i] = fStrides[i];
-                if (i>=baseDim) batch[i] = sDims[i];
-                break;
-            case ONE2MANY:
-                out_step[i]  = oStrides[i];
-                filt_step[i] = fStrides[i];
-                if (i>=baseDim) batch[i] = fDims[i];
-                break;
-            default:
-                break;
-        }
-    }
-
-    for (dim_t b3=0; b3<batch[3]; ++b3) {
-        for (dim_t b2=0; b2<batch[2]; ++b2) {
-            for (dim_t b1=0; b1<batch[1]; ++b1) {
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
-                T * out          = optr + b1 * out_step[1] + b2 * out_step[2] + b3 * out_step[3];
-                T const *in      = iptr + b1 *  in_step[1] + b2 *  in_step[2] + b3 *  in_step[3];
-                accT const *filt = fptr + b1 *filt_step[1] + b2 *filt_step[2] + b3 *filt_step[3];
+using af::dim4;
+using arrayfire::common::flip;
+using arrayfire::common::half;
+using arrayfire::common::modDims;
 
-                switch(baseDim) {
-                    case 1: one2one_1d<T, accT, expand>(out, in, filt, oDims, sDims, fDims, sStrides);                     break;
-                    case 2: one2one_2d<T, accT, expand>(out, in, filt, oDims, sDims, fDims, oStrides, sStrides, fStrides); break;
-                    case 3: one2one_3d<T, accT, expand>(out, in, filt, oDims, sDims, fDims, oStrides, sStrides, fStrides); break;
-                }
-            }
-        }
-    }
-}
+namespace arrayfire {
+namespace cpu {
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-Array<T> convolve(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind)
-{
-    auto sDims    = signal.dims();
-    auto fDims    = filter.dims();
-    auto sStrides = signal.strides();
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand) {
+    auto sDims = signal.dims();
+    auto fDims = filter.dims();
 
     dim4 oDims(1);
     if (expand) {
-        for(dim_t d=0; d<4; ++d) {
-            if (kind==ONE2ONE || kind==ONE2MANY) {
-                oDims[d] = sDims[d]+fDims[d]-1;
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
             } else {
-                oDims[d] = (d<baseDim ? sDims[d]+fDims[d]-1 : sDims[d]);
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
             }
         }
     } else {
         oDims = sDims;
-        if (kind==ONE2MANY) {
-            for (dim_t i=baseDim; i<4; ++i)
-                oDims[i] = fDims[i];
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
         }
     }
 
     Array<T> out = createEmptyArray<T>(oDims);
 
-    convolve_nd<T, accT, baseDim, expand>(out.get(), signal.get(), filter.get(),
-            oDims, sDims, fDims, out.strides(), sStrides, filter.strides(), kind);
+    getQueue().enqueue(kernel::convolve_nd<T, accT>, out, signal, filter, kind,
+                       rank, expand);
 
     return out;
 }
 
-template<typename T, typename accT, dim_t conv_dim, bool expand>
-void convolve2_separable(T *optr, T const *iptr, accT const *fptr,
-                        dim4 const &oDims, dim4 const &sDims, dim4 const &orgDims, dim_t fDim,
-                        dim4 const &oStrides, dim4 const &sStrides, dim_t fStride)
-{
-    for(dim_t j=0; j<oDims[1]; ++j) {
-
-        dim_t jOff = j*oStrides[1];
-        dim_t cj = j + (conv_dim==1)*(expand ? 0: fDim>>1);
-
-        for(dim_t i=0; i<oDims[0]; ++i) {
-
-            dim_t iOff = i*oStrides[0];
-            dim_t ci = i + (conv_dim==0)*(expand ? 0 : fDim>>1);
-
-            accT accum = scalar<accT>(0);
-
-            for(dim_t f=0; f<fDim; ++f) {
-                T f_val = fptr[f];
-                T s_val;
-
-                if (conv_dim==0) {
-                    dim_t offi = ci - f;
-                    bool isCIValid = offi>=0 && offi<sDims[0];
-                    bool isCJValid = cj>=0 && cj<sDims[1];
-                    s_val = (isCJValid && isCIValid ? iptr[cj*sDims[0]+offi] : scalar<T>(0));
-                } else {
-                    dim_t offj = cj - f;
-                    bool isCIValid = ci>=0 && ci<sDims[0];
-                    bool isCJValid = offj>=0 && offj<sDims[1];
-                    s_val = (isCJValid && isCIValid ? iptr[offj*sDims[0]+ci] : scalar<T>(0));
-                }
-
-                accum += accT(s_val * f_val);
-            }
-            optr[iOff+jOff] = T(accum);
-        }
-    }
-}
-
-template<typename T, typename accT, bool expand>
-Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter)
-{
-    auto sDims    = signal.dims();
-    auto cfDims   = c_filter.dims();
-    auto rfDims   = r_filter.dims();
-    auto sStrides = signal.strides();
-
-    dim_t cflen = (dim_t)cfDims.elements();
-    dim_t rflen = (dim_t)rfDims.elements();
-
-    dim4 tDims = sDims;
-    dim4 oDims = sDims;
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const &signal, Array<accT> const &c_filter,
+                   Array<accT> const &r_filter, const bool expand) {
+    const auto &sDims = signal.dims();
+    dim4 tDims        = sDims;
+    dim4 oDims        = sDims;
 
     if (expand) {
-        // separable convolve only does ONE2ONE and standard batch(MANY2ONE)
+        auto cfDims = c_filter.dims();
+        auto rfDims = r_filter.dims();
+
+        auto cflen = cfDims.elements();
+        auto rflen = rfDims.elements();
+        // separable convolve only does AF_BATCH_NONE and standard
+        // batch(AF_BATCH_LHS)
         tDims[0] += cflen - 1;
         oDims[0] += cflen - 1;
         oDims[1] += rflen - 1;
     }
 
-    Array<T> temp = createEmptyArray<T>(tDims);
     Array<T> out  = createEmptyArray<T>(oDims);
-    auto tStrides = temp.strides();
-    auto oStrides = out.strides();
-
-    for (dim_t b3=0; b3<oDims[3]; ++b3) {
+    Array<T> temp = createEmptyArray<T>(tDims);
 
-        dim_t i_b3Off = b3*sStrides[3];
-        dim_t t_b3Off = b3*tStrides[3];
-        dim_t o_b3Off = b3*oStrides[3];
+    if (expand) {
+        getQueue().enqueue(kernel::convolve2<T, accT, true>, out, signal,
+                           c_filter, r_filter, temp);
+    } else {
+        getQueue().enqueue(kernel::convolve2<T, accT, false>, out, signal,
+                           c_filter, r_filter, temp);
+    }
+    return out;
+}
 
-        for (dim_t b2=0; b2<oDims[2]; ++b2) {
+#define INSTANTIATE(T, accT)                                                   \
+    template Array<T> convolve<T, accT>(Array<T> const &, Array<accT> const &, \
+                                        AF_BATCH_KIND, const int, const bool); \
+    template Array<T> convolve2<T, accT>(Array<T> const &,                     \
+                                         Array<accT> const &,                  \
+                                         Array<accT> const &, const bool);
 
-            T const *iptr = signal.get()+ b2*sStrides[2] + i_b3Off;
-            T *tptr = temp.get() + b2*tStrides[2] + t_b3Off;
-            T *optr = out.get()  + b2*oStrides[2] + o_b3Off;
+INSTANTIATE(cdouble, cdouble)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> convolve2_unwrap(const Array<T> &signal, const Array<T> &filter,
+                          const dim4 &stride, const dim4 &padding,
+                          const dim4 &dilation) {
+    dim4 sDims = signal.dims();
+    dim4 fDims = filter.dims();
+
+    dim_t outputWidth =
+        1 + (sDims[0] + 2 * padding[0] - (((fDims[0] - 1) * dilation[0]) + 1)) /
+                stride[0];
+    dim_t outputHeight =
+        1 + (sDims[1] + 2 * padding[1] - (((fDims[1] - 1) * dilation[1]) + 1)) /
+                stride[1];
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(signal, fDims[0], fDims[1], stride[0], stride[1], padding[0],
+               padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsedFilter = flip(filter, {1, 1, 0, 0});
+    collapsedFilter          = modDims(collapsedFilter,
+                                       dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> res =
+        matmul(unwrapped, collapsedFilter, AF_MAT_TRANS, AF_MAT_NONE);
+    res = modDims(res, dim4(outputWidth, outputHeight, signal.dims()[3],
+                            collapsedFilter.dims()[1]));
+    Array<T> out = reorder(res, dim4(0, 1, 3, 2));
 
-            convolve2_separable<T, accT, 0, expand>(tptr, iptr, c_filter.get(),
-                    tDims, sDims, sDims, cflen,
-                    tStrides, sStrides, c_filter.strides()[0]);
+    return out;
+}
 
-            convolve2_separable<T, accT, 1, expand>(optr, tptr, r_filter.get(),
-                    oDims, tDims, sDims, rflen,
-                    oStrides, tStrides, r_filter.strides()[0]);
-        }
-    }
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation) {
+    Array<T> out = createEmptyArray<T>(dim4());
+    out = convolve2_unwrap<T>(signal, filter, stride, padding, dilation);
 
     return out;
 }
 
-#define INSTANTIATE(T, accT)                                            \
-    template Array<T> convolve <T, accT, 1, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 1, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 2, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 2, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 3, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 3, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve2<T, accT, true >(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter); \
-    template Array<T> convolve2<T, accT, false>(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);
-
-INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
+#define INSTANTIATE(T)                                                        \
+    template Array<T> convolve2<T>(Array<T> const &signal,                    \
+                                   Array<T> const &filter, const dim4 stride, \
+                                   const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> & /*convolved_output*/,
+                           af::dim4 stride, af::dim4 padding,
+                           af::dim4 dilation) {
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &sDims = original_signal.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    Array<T> collapsed_filter = flip(original_filter, {1, 1, 0, 0});
+    collapsed_filter          = modDims(collapsed_filter,
+                                        dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    Array<T> res =
+        matmul(collapsed_gradient, collapsed_filter, AF_MAT_NONE, AF_MAT_TRANS);
+    res = modDims(res, dim4(res.dims()[0] / sDims[3], sDims[3],
+                            fDims[0] * fDims[1], sDims[2]));
+    res = reorder(res, dim4(0, 2, 3, 1));
+
+    const bool retCols = false;
+    res = wrap_dilated(res, sDims[0], sDims[1], fDims[0], fDims[1], stride[0],
+                       stride[1], padding[0], padding[1], dilation[0],
+                       dilation[1], retCols);
+
+    return res;
+}
 
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> & /*convolved_output*/,
+                             af::dim4 stride, af::dim4 padding,
+                             af::dim4 dilation) {
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(original_signal, fDims[0], fDims[1], stride[0], stride[1],
+               padding[0], padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    Array<T> res =
+        matmul(unwrapped, collapsed_gradient, AF_MAT_NONE, AF_MAT_NONE);
+    res = modDims(res, dim4(fDims[0], fDims[1], fDims[2], fDims[3]));
+
+    return flip(res, {1, 1, 0, 0});
 }
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> conv2DataGradient<T>(                                 \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);        \
+    template Array<T> conv2FilterGradient<T>(                               \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/convolve.hpp b/src/backend/cpu/convolve.hpp
index 95d7625c90..66963a1d58 100644
--- a/src/backend/cpu/convolve.hpp
+++ b/src/backend/cpu/convolve.hpp
@@ -8,15 +8,35 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <convolve_common.hpp>
+#include <common/defines.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-Array<T> convolve(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind);
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand);
 
-template<typename T, typename accT, bool expand>
-Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const &signal, Array<accT> const &c_filter,
+                   Array<accT> const &r_filter, const bool expand);
 
-}
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation);
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> &convolved_output, af::dim4 stride,
+                           af::dim4 padding, af::dim4 dilation);
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> &convolved_output, af::dim4 stride,
+                             af::dim4 padding, af::dim4 dilation);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/copy.cpp b/src/backend/cpu/copy.cpp
index eb50e79d97..ea98c0f613 100644
--- a/src/backend/cpu/copy.cpp
+++ b/src/backend/cpu/copy.cpp
@@ -7,205 +7,154 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <type_traits>
-#include <af/array.h>
 #include <Array.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/complex.hpp>
+#include <common/half.hpp>
 #include <copy.hpp>
-#include <cstring>
-#include <algorithm>
-#include <complex>
-#include <vector>
-#include <cassert>
 #include <err_cpu.hpp>
-#include <math.hpp>
-
-namespace cpu
-{
-    template<typename T>
-    static void stridedCopy(T* dst, const dim4& ostrides, const T* src, const dim4 &dims, const dim4 &strides, unsigned dim)
-    {
-        if(dim == 0) {
-            if(strides[dim] == 1) {
-                //FIXME: Check for errors / exceptions
-                memcpy(dst, src, dims[dim] * sizeof(T));
-            } else {
-                for(dim_t i = 0; i < dims[dim]; i++) {
-                    dst[i] = src[strides[dim]*i];
-                }
-            }
-        } else {
-            for(dim_t i = dims[dim]; i > 0; i--) {
-                stridedCopy<T>(dst, ostrides, src, dims, strides, dim - 1);
-                src += strides[dim];
-                dst += ostrides[dim];
-            }
-        }
-    }
-
-    // Assigns to single elements
-    template<typename T>
-    void copyData(T *to, const Array<T> &from)
-    {
-        if(from.isOwner()) {
-            // FIXME: Check for errors / exceptions
-            memcpy(to, from.get(), from.elements()*sizeof(T));
-        } else {
-            dim4 ostrides = calcStrides(from.dims());
-            stridedCopy<T>(to, ostrides, from.get(), from.dims(), from.strides(), from.ndims() - 1);
-        }
-    }
-
-    template<typename T>
-    Array<T> copyArray(const Array<T> &A)
-    {
-        Array<T> out = createEmptyArray<T>(A.dims());
-        copyData(out.get(), A);
-        return out;
-    }
-
-    template<typename inType, typename outType>
-    static void copy(Array<outType> &dst, const Array<inType> &src, outType default_value, double factor)
-    {
-        dim4 src_dims       = src.dims();
-        dim4 dst_dims       = dst.dims();
-        dim4 src_strides    = src.strides();
-        dim4 dst_strides    = dst.strides();
-
-        const inType * src_ptr = src.get();
-        outType * dst_ptr      = dst.get();
-
-        dim_t trgt_l = std::min(dst_dims[3], src_dims[3]);
-        dim_t trgt_k = std::min(dst_dims[2], src_dims[2]);
-        dim_t trgt_j = std::min(dst_dims[1], src_dims[1]);
-        dim_t trgt_i = std::min(dst_dims[0], src_dims[0]);
-
-        for(dim_t l=0; l<dst_dims[3]; ++l) {
-
-            dim_t src_loff = l*src_strides[3];
-            dim_t dst_loff = l*dst_strides[3];
-            bool isLvalid = l<trgt_l;
-
-            for(dim_t k=0; k<dst_dims[2]; ++k) {
-
-                dim_t src_koff = k*src_strides[2];
-                dim_t dst_koff = k*dst_strides[2];
-                bool isKvalid = k<trgt_k;
-
-                for(dim_t j=0; j<dst_dims[1]; ++j) {
-
-                    dim_t src_joff = j*src_strides[1];
-                    dim_t dst_joff = j*dst_strides[1];
-                    bool isJvalid = j<trgt_j;
-
-                    for(dim_t i=0; i<dst_dims[0]; ++i) {
-                        outType temp = default_value;
-                        if (isLvalid && isKvalid && isJvalid && i<trgt_i) {
-                            dim_t src_idx = i*src_strides[0] + src_joff + src_koff + src_loff;
-                            temp = outType(src_ptr[src_idx])*outType(factor);
-                        }
-                        dim_t dst_idx = i*dst_strides[0] + dst_joff + dst_koff + dst_loff;
-                        dst_ptr[dst_idx] = temp;
-                    }
-                }
-            }
-        }
-    }
+#include <kernel/copy.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
-    template<typename inType, typename outType>
-    Array<outType>
-    padArray(Array<inType> const &in, dim4 const &dims,
-             outType default_value, double factor)
-    {
-        Array<outType> ret = createValueArray<outType>(dims, default_value);
-        copy<inType, outType>(ret, in, outType(default_value), factor);
-        return ret;
-    }
+#include <cstdio>
+#include <cstring>
 
-    template<typename inType, typename outType>
-    void copyArray(Array<outType> &out, Array<inType> const &in)
-    {
-        copy<inType, outType>(out, in, scalar<outType>(0), 1.0);
+using arrayfire::common::half;  // NOLINT(misc-unused-using-decls) bug in
+                                // clang-tidy
+using arrayfire::common::is_complex;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copyData(T *to, const Array<T> &from) {
+    if (from.elements() == 0) { return; }
+
+    from.eval();
+    // Ensure all operations on 'from' are complete before copying data to host.
+    getQueue().sync();
+    if (from.isLinear()) {
+        // FIXME: Check for errors / exceptions
+        memcpy(to, from.get(), from.elements() * sizeof(T));
+    } else {
+        dim4 ostrides = calcStrides(from.dims());
+        kernel::stridedCopy<T>(to, ostrides, from.get(), from.dims(),
+                               from.strides(), from.ndims() - 1);
     }
+}
 
+template<typename T>
+Array<T> copyArray(const Array<T> &A) {
+    Array<T> out = createEmptyArray<T>(A.dims());
+    if (A.elements() > 0) { getQueue().enqueue(kernel::copy<T, T>, out, A); }
+    return out;
+}
 
-#define INSTANTIATE(T)                                                  \
-    template void      copyData<T> (T *data, const Array<T> &from);     \
-    template Array<T>  copyArray<T>(const Array<T> &A);                 \
-
-    INSTANTIATE(float  )
-    INSTANTIATE(double )
-    INSTANTIATE(cfloat )
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int    )
-    INSTANTIATE(uint   )
-    INSTANTIATE(uchar  )
-    INSTANTIATE(char   )
-    INSTANTIATE(intl   )
-    INSTANTIATE(uintl  )
-
-
-#define INSTANTIATE_PAD_ARRAY(SRC_T)                                    \
-    template Array<float  > padArray<SRC_T, float  >(Array<SRC_T> const &src, dim4 const &dims, float   default_value, double factor); \
-    template Array<double > padArray<SRC_T, double >(Array<SRC_T> const &src, dim4 const &dims, double  default_value, double factor); \
-    template Array<cfloat > padArray<SRC_T, cfloat >(Array<SRC_T> const &src, dim4 const &dims, cfloat  default_value, double factor); \
-    template Array<cdouble> padArray<SRC_T, cdouble>(Array<SRC_T> const &src, dim4 const &dims, cdouble default_value, double factor); \
-    template Array<int    > padArray<SRC_T, int    >(Array<SRC_T> const &src, dim4 const &dims, int     default_value, double factor); \
-    template Array<uint   > padArray<SRC_T, uint   >(Array<SRC_T> const &src, dim4 const &dims, uint    default_value, double factor); \
-    template Array<intl    > padArray<SRC_T, intl    >(Array<SRC_T> const &src, dim4 const &dims, intl     default_value, double factor); \
-    template Array<uintl   > padArray<SRC_T, uintl   >(Array<SRC_T> const &src, dim4 const &dims, uintl    default_value, double factor); \
-    template Array<uchar  > padArray<SRC_T, uchar  >(Array<SRC_T> const &src, dim4 const &dims, uchar   default_value, double factor); \
-    template Array<char   > padArray<SRC_T, char   >(Array<SRC_T> const &src, dim4 const &dims, char    default_value, double factor); \
-    template void copyArray<SRC_T, float  >(Array<float  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, double >(Array<double > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cfloat >(Array<cfloat > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cdouble>(Array<cdouble> &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, int    >(Array<int    > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uint   >(Array<uint   > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, intl    >(Array<intl    > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uintl   >(Array<uintl   > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uchar  >(Array<uchar  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, char   >(Array<char   > &dst, Array<SRC_T> const &src);
-
-    INSTANTIATE_PAD_ARRAY(float )
-    INSTANTIATE_PAD_ARRAY(double)
-    INSTANTIATE_PAD_ARRAY(int   )
-    INSTANTIATE_PAD_ARRAY(uint  )
-    INSTANTIATE_PAD_ARRAY(intl   )
-    INSTANTIATE_PAD_ARRAY(uintl  )
-    INSTANTIATE_PAD_ARRAY(uchar )
-    INSTANTIATE_PAD_ARRAY(char  )
-
-#define INSTANTIATE_PAD_ARRAY_COMPLEX(SRC_T)                            \
-    template Array<cfloat > padArray<SRC_T, cfloat >(Array<SRC_T> const &src, dim4 const &dims, cfloat  default_value, double factor); \
-    template Array<cdouble> padArray<SRC_T, cdouble>(Array<SRC_T> const &src, dim4 const &dims, cdouble default_value, double factor); \
-    template void copyArray<SRC_T, cfloat  >(Array<cfloat  > &dst, Array<SRC_T> const &src);    \
-    template void copyArray<SRC_T, cdouble   >(Array<cdouble > &dst, Array<SRC_T> const &src);
-
-    INSTANTIATE_PAD_ARRAY_COMPLEX(cfloat )
-    INSTANTIATE_PAD_ARRAY_COMPLEX(cdouble)
-
-#define SPECILIAZE_UNUSED_COPYARRAY(SRC_T, DST_T) \
-    template<> void copyArray<SRC_T, DST_T>(Array<DST_T> &out, Array<SRC_T> const &in) \
-    {\
-        CPU_NOT_SUPPORTED();\
-    }
-
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, double)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, float)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uchar)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, char)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uint)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, int)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, intl)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uintl)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, double)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, float)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uchar)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, char)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uint)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, int)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, intl)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uintl)
+template<typename inType, typename outType>
+void copyArray(Array<outType> &out, Array<inType> const &in) {
+    static_assert(
+        !(is_complex<inType>::value && !is_complex<outType>::value),
+        "Cannot copy from complex Array<T> to a non complex Array<T>");
+    getQueue().enqueue(kernel::copy<outType, inType>, out, in);
+}
 
+#define INSTANTIATE(T)                                         \
+    template void copyData<T>(T * data, const Array<T> &from); \
+    template Array<T> copyArray<T>(const Array<T> &A);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COPY_ARRAY(SRC_T)                                 \
+    template void copyArray<SRC_T, float>(Array<float> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, double>(Array<double> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,     \
+                                            Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, int>(Array<int> & dst,             \
+                                        Array<SRC_T> const &src);     \
+    template void copyArray<SRC_T, uint>(Array<uint> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, intl>(Array<intl> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, uintl>(Array<uintl> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, short>(Array<short> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, ushort>(Array<ushort> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, schar>(Array<schar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, uchar>(Array<uchar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, char>(Array<char> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, half>(Array<half> & dst,           \
+                                         Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY(float)
+INSTANTIATE_COPY_ARRAY(double)
+INSTANTIATE_COPY_ARRAY(int)
+INSTANTIATE_COPY_ARRAY(uint)
+INSTANTIATE_COPY_ARRAY(intl)
+INSTANTIATE_COPY_ARRAY(uintl)
+INSTANTIATE_COPY_ARRAY(schar)
+INSTANTIATE_COPY_ARRAY(uchar)
+INSTANTIATE_COPY_ARRAY(char)
+INSTANTIATE_COPY_ARRAY(ushort)
+INSTANTIATE_COPY_ARRAY(short)
+INSTANTIATE_COPY_ARRAY(half)
+
+#define INSTANTIATE_COPY_ARRAY_COMPLEX(SRC_T)                        \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,      \
+                                           Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,    \
+                                            Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY_COMPLEX(cfloat)
+INSTANTIATE_COPY_ARRAY_COMPLEX(cdouble)
+
+template<typename T>
+T getScalar(const Array<T> &in) {
+    in.eval();
+    getQueue().sync();
+    return in.get()[0];
 }
+
+#define INSTANTIATE_GETSCALAR(T) template T getScalar(const Array<T> &in);
+
+INSTANTIATE_GETSCALAR(float)
+INSTANTIATE_GETSCALAR(double)
+INSTANTIATE_GETSCALAR(cfloat)
+INSTANTIATE_GETSCALAR(cdouble)
+INSTANTIATE_GETSCALAR(int)
+INSTANTIATE_GETSCALAR(uint)
+INSTANTIATE_GETSCALAR(schar)
+INSTANTIATE_GETSCALAR(uchar)
+INSTANTIATE_GETSCALAR(char)
+INSTANTIATE_GETSCALAR(intl)
+INSTANTIATE_GETSCALAR(uintl)
+INSTANTIATE_GETSCALAR(short)
+INSTANTIATE_GETSCALAR(ushort)
+INSTANTIATE_GETSCALAR(half)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/copy.hpp b/src/backend/cpu/copy.hpp
index 4178461c62..6e68bff2b7 100644
--- a/src/backend/cpu/copy.hpp
+++ b/src/backend/cpu/copy.hpp
@@ -8,22 +8,70 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
+#include <kernel/pad_array_borders.hpp>
+#include <math.hpp>
+#include <queue.hpp>
 
-namespace cpu
-{
+namespace af {
+class dim4;
+}
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copyData(T *to, const Array<T> &from);
+
+template<typename T>
+Array<T> copyArray(const Array<T> &A);
+
+template<typename inType, typename outType>
+void copyArray(Array<outType> &out, const Array<inType> &in);
 
-    template<typename T>
-    void copyData(T *data, const Array<T> &A);
+// Resize Array to target dimensions and convert type
+//
+// Depending on the \p outDims, the output Array can be either truncated
+// or padded (towards end of respective dimensions).
+//
+// While resizing copying, if output dimensions are larger than input, then
+// elements beyond the input dimensions are set to the \p defaultValue.
+//
+// \param[in] in is input Array
+// \param[in] outDims is the target output dimensions
+// \param[in] defaultValue is the value to which padded locations are set.
+// \param[in] scale is the value by which all output elements are scaled.
+//
+// \returns Array<outType>
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue = outType(0), double scale = 1.0);
 
-    template<typename T>
-    Array<T> copyArray(const Array<T> &A);
+template<typename T>
+Array<T> padArrayBorders(const Array<T> &in, const dim4 &lowerBoundPadding,
+                         const dim4 &upperBoundPadding,
+                         const af::borderType btype) {
+    const dim4 &iDims = in.dims();
 
-    template<typename inType, typename outType>
-    void copyArray(Array<outType> &out, const Array<inType> &in);
+    dim4 oDims(lowerBoundPadding[0] + iDims[0] + upperBoundPadding[0],
+               lowerBoundPadding[1] + iDims[1] + upperBoundPadding[1],
+               lowerBoundPadding[2] + iDims[2] + upperBoundPadding[2],
+               lowerBoundPadding[3] + iDims[3] + upperBoundPadding[3]);
 
-    template<typename inType, typename outType>
-    Array<outType> padArray(Array<inType> const &in, dim4 const &dims,
-                            outType default_value=outType(0), double factor=1.0);
+    if (oDims == iDims) { return in; }
+
+    auto ret = (btype == AF_PAD_ZERO ? createValueArray<T>(oDims, scalar<T>(0))
+                                     : createEmptyArray<T>(oDims));
+
+    getQueue().enqueue(kernel::padBorders<T>, ret, in, lowerBoundPadding,
+                       upperBoundPadding, btype);
+    return ret;
 }
+
+template<typename T>
+void multiply_inplace(Array<T> &in, double val);
+
+template<typename T>
+T getScalar(const Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/device_manager.cpp b/src/backend/cpu/device_manager.cpp
new file mode 100644
index 0000000000..e2d5ed6f68
--- /dev/null
+++ b/src/backend/cpu/device_manager.cpp
@@ -0,0 +1,184 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/DefaultMemoryManager.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <device_manager.hpp>
+#include <memory.hpp>
+#include <af/version.h>
+
+#include <cctype>
+#include <sstream>
+
+using arrayfire::common::MemoryManagerBase;
+using std::string;
+
+#ifdef CPUID_CAPABLE
+
+CPUInfo::CPUInfo()
+    : mVendorId("")
+    , mModelName("")
+    , mNumSMT(0)
+    , mNumCores(0)
+    , mNumLogCpus(0)
+    , mIsHTT(false) {
+    // Get vendor name EAX=0
+    CPUID cpuID1(1, 0);
+    mIsHTT = cpuID1.EDX() & HTT_POS;
+
+    CPUID cpuID0(0, 0);
+    uint32_t HFS = cpuID0.EAX();
+    mVendorId += string(reinterpret_cast<const char*>(&cpuID0.EBX()), 4);
+    mVendorId += string(reinterpret_cast<const char*>(&cpuID0.EDX()), 4);
+    mVendorId += string(reinterpret_cast<const char*>(&cpuID0.ECX()), 4);
+
+    string upVId = mVendorId;
+
+    for_each(upVId.begin(), upVId.end(),
+             [](char& in) { in = static_cast<char>(::toupper(in)); });
+
+    // Get num of cores
+    if (upVId.find("INTEL") != std::string::npos) {
+        mVendorId = "Intel";
+        if (HFS >= 11) {
+            for (int lvl = 0; lvl < MAX_INTEL_TOP_LVL; ++lvl) {
+                CPUID cpuID4(0x0B, lvl);
+                uint32_t currLevel = (LVL_TYPE & cpuID4.ECX()) >> 8U;
+                switch (currLevel) {
+                    case 0x01: mNumSMT = LVL_CORES & cpuID4.EBX(); break;
+                    case 0x02: mNumLogCpus = LVL_CORES & cpuID4.EBX(); break;
+                    default: break;
+                }
+            }
+            // Fixes Possible divide by zero error
+            // TODO: Fix properly
+            mNumCores = mNumLogCpus / (mNumSMT == 0 ? 1 : mNumSMT);
+        } else {
+            if (HFS >= 1) {
+                mNumLogCpus = (cpuID1.EBX() >> 16U) & 0xFFU;
+                if (HFS >= 4) {
+                    mNumCores = 1 + ((CPUID(4, 0).EAX() >> 26U) & 0x3FU);
+                }
+            }
+            if (mIsHTT) {
+                if (!(mNumCores > 1)) {
+                    mNumCores   = 1;
+                    mNumLogCpus = (mNumLogCpus >= 2 ? mNumLogCpus : 2U);
+                }
+            } else {
+                mNumCores = mNumLogCpus = 1;
+            }
+        }
+    } else if (upVId.find("AMD") != std::string::npos) {
+        mVendorId = "AMD";
+        if (HFS >= 1) {
+            mNumLogCpus = (cpuID1.EBX() >> 16U) & 0xFFU;
+            if (CPUID(0x80000000, 0).EAX() >= 8U) {
+                mNumCores = 1 + ((CPUID(0x80000008, 0).ECX() & 0xFFU));
+            }
+        }
+        if (mIsHTT) {
+            if (!(mNumCores > 1)) {
+                mNumCores   = 1;
+                mNumLogCpus = (mNumLogCpus >= 2 ? mNumLogCpus : 2);
+            }
+        } else {
+            mNumCores = mNumLogCpus = 1;
+        }
+    } else {
+        mVendorId = "Unknown";
+    }
+    // Get processor brand string
+    // This seems to be working for both Intel & AMD vendors
+    for (unsigned i = 0x80000002; i < 0x80000005; ++i) {
+        CPUID cpuID(i, 0);
+        mModelName += string(reinterpret_cast<const char*>(&cpuID.EAX()), 4);
+        mModelName += string(reinterpret_cast<const char*>(&cpuID.EBX()), 4);
+        mModelName += string(reinterpret_cast<const char*>(&cpuID.ECX()), 4);
+        mModelName += string(reinterpret_cast<const char*>(&cpuID.EDX()), 4);
+    }
+    mModelName.shrink_to_fit();
+}
+
+#else
+
+CPUInfo::CPUInfo()
+    : mVendorId("Unknown")
+    , mModelName("Unknown")
+    , mNumSMT(1)
+    , mNumCores(1)
+    , mNumLogCpus(1)
+    , mIsHTT(false) {}
+
+#endif
+
+namespace arrayfire {
+namespace cpu {
+
+DeviceManager::DeviceManager()
+    : queues(MAX_QUEUES)
+    , fgMngr(new common::ForgeManager())
+    , memManager(new common::DefaultMemoryManager(
+          getDeviceCount(), common::MAX_BUFFERS,
+          AF_MEM_DEBUG || AF_CPU_MEM_DEBUG)) {
+    // Use the default ArrayFire memory manager
+    std::unique_ptr<cpu::Allocator> deviceMemoryManager(new cpu::Allocator());
+    memManager->setAllocator(std::move(deviceMemoryManager));
+    memManager->initialize();
+}
+
+DeviceManager& DeviceManager::getInstance() {
+    static auto* my_instance = new DeviceManager();
+    return *my_instance;
+}
+
+CPUInfo DeviceManager::getCPUInfo() const { return cinfo; }
+
+void DeviceManager::resetMemoryManager() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_CPU_MEM_DEBUG));
+    setMemoryManager(std::move(mgr));
+}
+
+void DeviceManager::setMemoryManager(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a memory manager and the default memory
+    // manager still hasn't been initialized, so initialize it anyways so we
+    // don't inadvertently reset to it when we first call memoryManager()
+    memoryManager();
+    // Calls shutdown() on the existing memory manager
+    if (memManager) { memManager->shutdownAllocator(); }
+    memManager = std::move(newMgr);
+    // Set the backend memory manager for this new manager to register native
+    // functions correctly.
+    std::unique_ptr<cpu::Allocator> deviceMemoryManager(new cpu::Allocator());
+    memManager->setAllocator(std::move(deviceMemoryManager));
+    memManager->initialize();
+}
+
+void DeviceManager::setMemoryManagerPinned(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    UNUSED(newMgr);
+    UNUSED(this);
+    AF_ERROR("Using pinned memory with CPU is not supported",
+             AF_ERR_NOT_SUPPORTED);
+}
+
+void DeviceManager::resetMemoryManagerPinned() {
+    // This is a NOOP - we should never set a pinned memory manager in the first
+    // place for the CPU backend, but don't throw in case backend-agnostic
+    // functions that operate on all memory managers need to call this
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/device_manager.hpp b/src/backend/cpu/device_manager.hpp
new file mode 100644
index 0000000000..a67c611d24
--- /dev/null
+++ b/src/backend/cpu/device_manager.hpp
@@ -0,0 +1,147 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <platform.hpp>
+#include <queue.hpp>
+#include <memory>
+#include <mutex>
+#include <string>
+
+using arrayfire::common::MemoryManagerBase;
+
+#ifndef AF_CPU_MEM_DEBUG
+#define AF_CPU_MEM_DEBUG 0
+#endif
+
+#if defined(AF_WITH_CPUID) &&                                       \
+    (defined(__x86_64__) || defined(_M_X64) || defined(__i386__) || \
+     defined(_M_IX86) || defined(_WIN64))
+#define CPUID_CAPABLE
+#endif
+
+#ifdef _WIN32
+#include <intrin.h>
+#include <limits.h>
+typedef unsigned __int32 uint32_t;
+#else
+#include <stdint.h>
+#endif
+
+#ifdef CPUID_CAPABLE
+
+#define MAX_INTEL_TOP_LVL 4
+
+class CPUID {
+    uint32_t regs[4];
+
+   public:
+    explicit CPUID(unsigned funcId, unsigned subFuncId) {
+#ifdef _WIN32
+        __cpuidex((int*)regs, (int)funcId, (int)subFuncId);
+
+#else
+        asm volatile("cpuid"
+                     : "=a"(regs[0]), "=b"(regs[1]), "=c"(regs[2]),
+                       "=d"(regs[3])
+                     : "a"(funcId), "c"(subFuncId));
+#endif
+    }
+
+    inline const uint32_t& EAX() const { return regs[0]; }
+    inline const uint32_t& EBX() const { return regs[1]; }
+    inline const uint32_t& ECX() const { return regs[2]; }
+    inline const uint32_t& EDX() const { return regs[3]; }
+};
+
+#endif
+
+class CPUInfo {
+   public:
+    CPUInfo();
+    std::string vendor() const { return mVendorId; }
+    std::string model() const { return mModelName; }
+    int threads() const { return mNumLogCpus; }
+
+   private:
+    // Bit positions for data extractions
+    static const uint32_t LVL_NUM   = 0x000000FF;
+    static const uint32_t LVL_TYPE  = 0x0000FF00;
+    static const uint32_t LVL_CORES = 0x0000FFFF;
+    static const uint32_t HTT_POS   = 0x10000000;
+
+    // Attributes
+    std::string mVendorId;
+    std::string mModelName;
+    unsigned mNumSMT;
+    unsigned mNumCores;
+    unsigned mNumLogCpus;
+    bool mIsHTT;
+};
+
+namespace arrayfire {
+namespace cpu {
+
+class DeviceManager {
+   public:
+    static const int MAX_QUEUES            = 1;
+    static const int NUM_DEVICES           = 1;
+    static const unsigned ACTIVE_DEVICE_ID = 0;
+    static const bool IS_DOUBLE_SUPPORTED  = true;
+
+    // TODO(umar): Half is not supported for BLAS and FFT on x86_64
+    static const bool IS_HALF_SUPPORTED = true;
+
+    static DeviceManager& getInstance();
+
+    friend queue& getQueue(int device);
+
+    friend MemoryManagerBase& memoryManager();
+
+    friend void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+    friend void resetMemoryManager();
+
+    // Pinned memory not supported in CPU
+    friend void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+    void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+    friend void resetMemoryManagerPinned();
+
+    void resetMemoryManagerPinned();
+
+    friend arrayfire::common::ForgeManager& forgeManager();
+
+    void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+    void resetMemoryManager();
+
+    CPUInfo getCPUInfo() const;
+
+   private:
+    DeviceManager();
+    // Following two declarations are required to
+    // avoid copying accidental copy/assignment
+    // of instance returned by getInstance to other
+    // variables
+    DeviceManager(DeviceManager const&)  = delete;
+    void operator=(DeviceManager const&) = delete;
+
+    // Attributes
+    std::vector<queue> queues;
+    std::unique_ptr<arrayfire::common::ForgeManager> fgMngr;
+    const CPUInfo cinfo;
+    std::unique_ptr<MemoryManagerBase> memManager;
+    std::mutex mutex;
+};
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/diagonal.cpp b/src/backend/cpu/diagonal.cpp
index 2ae69a6901..1767096ed0 100644
--- a/src/backend/cpu/diagonal.cpp
+++ b/src/backend/cpu/diagonal.cpp
@@ -7,84 +7,66 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <Array.hpp>
 #include <diagonal.hpp>
-#include <math.hpp>
-#include <err_cpu.hpp>
+#include <kernel/diagonal.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> diagCreate(const Array<T> &in, const int num)
-    {
-        int size = in.dims()[0] + std::abs(num);
-        int batch = in.dims()[1];
-        Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
-
-        const T *iptr = in.get();
-        T *optr = out.get();
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
-        for (int k = 0; k < batch; k++) {
-            for (int j = 0; j < size; j++) {
-                for (int i = 0; i < size; i++) {
-                    T val = scalar<T>(0);
-                    if (i == j - num) {
-                        val = (num > 0) ? iptr[i] : iptr[j];
-                    }
-                    optr[i + j * out.strides()[1]] = val;
-                }
-            }
-            optr += out.strides()[2];
-            iptr += in.strides()[1];
-        }
+#include <algorithm>
+#include <cstdlib>
 
-        return out;
-    }
+using arrayfire::common::half;  // NOLINT(misc-unused-using-decls) bug in
+                                // clang-tidy
+using std::abs;  // NOLINT(misc-unused-using-decls) bug in clang-tidy
+using std::min;  // NOLINT(misc-unused-using-decls) bug in clang-tidy
 
-    template<typename T>
-    Array<T> diagExtract(const Array<T> &in, const int num)
-    {
-        const dim_t *idims = in.dims().get();
-        dim_t size = std::max(idims[0], idims[1]) - std::abs(num);
-        Array<T> out = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
+namespace arrayfire {
+namespace cpu {
 
-        const dim_t *odims = out.dims().get();
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num) {
+    int size     = in.dims()[0] + abs(num);
+    int batch    = in.dims()[1];
+    Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
 
-        const int i_off = (num > 0) ? (num * in.strides()[1]) : (-num);
+    getQueue().enqueue(kernel::diagCreate<T>, out, in, num);
 
-        for (int l = 0; l < (int)odims[3]; l++) {
+    return out;
+}
 
-            for (int k = 0; k < (int)odims[2]; k++) {
-                const T *iptr = in.get() + l * in.strides()[3] + k * in.strides()[2] + i_off;
-                T *optr = out.get() + l * out.strides()[3] + k * out.strides()[2];
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num) {
+    const dim4 &idims = in.dims();
+    dim_t size        = min(idims[0], idims[1]) - abs(num);
+    Array<T> out      = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
 
-                for (int i = 0; i < (int)odims[0]; i++) {
-                    T val = scalar<T>(0);
-                    if (i < idims[0] && i < idims[1]) val =  iptr[i * in.strides()[1] + i];
-                    optr[i] = val;
-                }
-            }
-        }
+    getQueue().enqueue(kernel::diagExtract<T>, out, in, num);
 
-        return out;
-    }
+    return out;
+}
 
 #define INSTANTIATE_DIAGONAL(T)                                          \
-    template Array<T>  diagExtract<T>    (const Array<T> &in, const int num); \
-    template Array<T>  diagCreate <T>    (const Array<T> &in, const int num);
+    template Array<T> diagExtract<T>(const Array<T> &in, const int num); \
+    template Array<T> diagCreate<T>(const Array<T> &in, const int num);
 
-    INSTANTIATE_DIAGONAL(float)
-    INSTANTIATE_DIAGONAL(double)
-    INSTANTIATE_DIAGONAL(cfloat)
-    INSTANTIATE_DIAGONAL(cdouble)
-    INSTANTIATE_DIAGONAL(int)
-    INSTANTIATE_DIAGONAL(uint)
-    INSTANTIATE_DIAGONAL(intl)
-    INSTANTIATE_DIAGONAL(uintl)
-    INSTANTIATE_DIAGONAL(char)
-    INSTANTIATE_DIAGONAL(uchar)
+INSTANTIATE_DIAGONAL(float)
+INSTANTIATE_DIAGONAL(double)
+INSTANTIATE_DIAGONAL(cfloat)
+INSTANTIATE_DIAGONAL(cdouble)
+INSTANTIATE_DIAGONAL(int)
+INSTANTIATE_DIAGONAL(uint)
+INSTANTIATE_DIAGONAL(intl)
+INSTANTIATE_DIAGONAL(uintl)
+INSTANTIATE_DIAGONAL(char)
+INSTANTIATE_DIAGONAL(schar)
+INSTANTIATE_DIAGONAL(uchar)
+INSTANTIATE_DIAGONAL(short)
+INSTANTIATE_DIAGONAL(ushort)
+INSTANTIATE_DIAGONAL(half)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/diagonal.hpp b/src/backend/cpu/diagonal.hpp
index d2a21e932b..8a3807b913 100644
--- a/src/backend/cpu/diagonal.hpp
+++ b/src/backend/cpu/diagonal.hpp
@@ -7,15 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> diagCreate(const Array<T> &in, const int num);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num);
 
-    template<typename T>
-    Array<T> diagExtract(const Array<T> &in, const int num);
-}
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/diff.cpp b/src/backend/cpu/diff.cpp
index 733de8fcb2..f9ced50f52 100644
--- a/src/backend/cpu/diff.cpp
+++ b/src/backend/cpu/diff.cpp
@@ -7,115 +7,60 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>
 #include <diff.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-
-namespace cpu
-{
-    unsigned getIdx(af::dim4 strides, af::dim4 offs, int i, int j = 0, int k = 0, int l = 0)
-    {
-        return (l * strides[3] +
-                k * strides[2] +
-                j * strides[1] +
-                i);
-    }
-
-    template<typename T>
-    Array<T>  diff1(const Array<T> &in, const int dim)
-    {
-        // Bool for dimension
-        bool is_dim0 = dim == 0;
-        bool is_dim1 = dim == 1;
-        bool is_dim2 = dim == 2;
-        bool is_dim3 = dim == 3;
-
-        // Decrement dimension of select dimension
-        af::dim4 dims = in.dims();
-        dims[dim]--;
-
-        // Create output placeholder
-        Array<T> outArray = createValueArray(dims, (T)0);
 
-        // Get pointers to raw data
-        const T *inPtr = in.get();
-              T *outPtr = outArray.get();
-
-        // TODO: Improve this
-        for(dim_t l = 0; l < dims[3]; l++) {
-            for(dim_t k = 0; k < dims[2]; k++) {
-                for(dim_t j = 0; j < dims[1]; j++) {
-                    for(dim_t i = 0; i < dims[0]; i++) {
-                        // Operation: out[index] = in[index + 1 * dim_size] - in[index]
-                        int idx = getIdx(in.strides(), in.offsets(), i, j, k, l);
-                        int jdx = getIdx(in.strides(), in.offsets(),
-                                         i + is_dim0, j + is_dim1,
-                                         k + is_dim2, l + is_dim3);
-                        int odx = getIdx(outArray.strides(), outArray.offsets(), i, j, k, l);
-                        outPtr[odx] = inPtr[jdx] - inPtr[idx];
-                    }
-                }
-            }
-        }
+#include <Array.hpp>
+#include <kernel/diff.hpp>
+#include <platform.hpp>
 
-        return outArray;
-    }
+#include <af/dim4.hpp>
 
-    template<typename T>
-    Array<T>  diff2(const Array<T> &in, const int dim)
-    {
-        // Bool for dimension
-        bool is_dim0 = dim == 0;
-        bool is_dim1 = dim == 1;
-        bool is_dim2 = dim == 2;
-        bool is_dim3 = dim == 3;
+namespace arrayfire {
+namespace cpu {
 
-        // Decrement dimension of select dimension
-        af::dim4 dims = in.dims();
-        dims[dim] -= 2;
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim) {
+    // Decrement dimension of select dimension
+    af::dim4 dims = in.dims();
+    dims[dim]--;
 
-        // Create output placeholder
-        Array<T> outArray = createValueArray(dims, (T)0);
+    Array<T> outArray = createEmptyArray<T>(dims);
 
-        // Get pointers to raw data
-        const T *inPtr = in.get();
-              T *outPtr = outArray.get();
+    getQueue().enqueue(kernel::diff1<T>, outArray, in, dim);
 
-        // TODO: Improve this
-        for(dim_t l = 0; l < dims[3]; l++) {
-            for(dim_t k = 0; k < dims[2]; k++) {
-                for(dim_t j = 0; j < dims[1]; j++) {
-                    for(dim_t i = 0; i < dims[0]; i++) {
-                        // Operation: out[index] = in[index + 1 * dim_size] - in[index]
-                        int idx = getIdx(in.strides(), in.offsets(), i, j, k, l);
-                        int jdx = getIdx(in.strides(), in.offsets(),
-                                         i + is_dim0, j + is_dim1,
-                                         k + is_dim2, l + is_dim3);
-                        int kdx = getIdx(in.strides(), in.offsets(),
-                                         i + 2 * is_dim0, j + 2 * is_dim1,
-                                         k + 2 * is_dim2, l + 2 * is_dim3);
-                        int odx = getIdx(outArray.strides(), outArray.offsets(), i, j, k, l);
-                        outPtr[odx] = inPtr[kdx] + inPtr[idx] - inPtr[jdx] - inPtr[jdx];
-                    }
-                }
-            }
-        }
+    return outArray;
+}
 
-        return outArray;
-    }
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim) {
+    // Decrement dimension of select dimension
+    af::dim4 dims = in.dims();
+    dims[dim] -= 2;
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T>  diff1<T>  (const Array<T> &in, const int dim);   \
-    template Array<T>  diff2<T>  (const Array<T> &in, const int dim);   \
+    Array<T> outArray = createEmptyArray<T>(dims);
 
+    getQueue().enqueue(kernel::diff2<T>, outArray, in, dim);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    return outArray;
 }
+
+#define INSTANTIATE(T)                                             \
+    template Array<T> diff1<T>(const Array<T> &in, const int dim); \
+    template Array<T> diff2<T>(const Array<T> &in, const int dim);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/diff.hpp b/src/backend/cpu/diff.hpp
index 2556d0d619..7a50aec7c2 100644
--- a/src/backend/cpu/diff.hpp
+++ b/src/backend/cpu/diff.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> diff1(const Array<T> &in, const int dim);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim);
 
-    template<typename T>
-    Array<T> diff2(const Array<T> &in, const int dim);
-}
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/err_cpu.hpp b/src/backend/cpu/err_cpu.hpp
index e0359a84b4..58c7b59aab 100644
--- a/src/backend/cpu/err_cpu.hpp
+++ b/src/backend/cpu/err_cpu.hpp
@@ -7,8 +7,10 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <err_common.hpp>
+#include <common/err_common.hpp>
 
-#define CPU_NOT_SUPPORTED() do {                       \
-        throw SupportError(__FILE__, __LINE__, "CPU"); \
-    } while(0)
+#define CPU_NOT_SUPPORTED(message)                                          \
+    do {                                                                    \
+        throw SupportError(__AF_FUNC__, __AF_FILENAME__, __LINE__, "CPU",   \
+                           message, boost::stacktrace::stacktrace());       \
+    } while (0)
diff --git a/src/backend/cpu/exampleFunction.cpp b/src/backend/cpu/exampleFunction.cpp
index a9e7bca9eb..3f677bc24b 100644
--- a/src/backend/cpu/exampleFunction.cpp
+++ b/src/backend/cpu/exampleFunction.cpp
@@ -7,55 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>                    // header with cpu backend specific
-                                        // Array class implementation that inherits
-                                        // ArrayInfo base class
+#include <Array.hpp>  // header with cpu backend specific
+                      // Array class implementation that inherits
+                      // ArrayInfo base class
 
-#include <exampleFunction.hpp>          // cpu backend function header
+#include <exampleFunction.hpp>         // cpu backend function header
+#include <kernel/exampleFunction.hpp>  // Function implementation header
 
-#include <err_cpu.hpp>                  // error check functions and Macros
-                                        // specific to cpu backend
+#include <err_cpu.hpp>  // error check functions and Macros
+                        // specific to cpu backend
+#include <platform.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> exampleFunction(const Array<T> &in, const af_someenum_t method)
-{
-    dim4 outputDims;                    // this should be '= in.dims();' in most cases
-                                        // but would definitely depend on the type of
-                                        // algorithm you are implementing.
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method) {
+    dim4 outputDims;  // this should be '= in.dims();' in most cases
+                      // but would definitely depend on the type of
+                      // algorithm you are implementing.
 
     Array<T> out = createEmptyArray<T>(outputDims);
-                                        // Please use the create***Array<T> helper
-                                        // functions defined in Array.hpp to create
-                                        // different types of Arrays. Please check the
-                                        // file to know what are the different types you
-                                        // can create.
+    // Please use the create***Array<T> helper
+    // functions defined in Array.hpp to create
+    // different types of Arrays. Please check the
+    // file to know what are the different types you
+    // can create.
 
-    //dim4 in_dims    = in.dims();        // you can retrieve dimensions
+    // Enqueue the function call on the worker thread
+    // This code will be present in src/backend/cpu/kernel/exampleFunction.hpp
+    getQueue().enqueue(kernel::exampleFunction<T>, out, a, b, method);
 
-    //dim4 in_offsets = in.offsets();     // you can retrieve offsets - used when given array
-                                        // is an sub-array pointing to some other array and
-                                        // doesn't have memory of its own
-
-    //dim4 in_strides = in.strides();     // you can retrieve strides
-
-    //const T* src = in.get();            // cpu::Array<T>::get returns the pointer to the
-                                        // memory allocated for that Array
-
-    //T* dst = out.get();
-
-    // Implement your algorithm and write results to dst
-
-    return out;                         // return the result
+    return out;  // return the result
 }
 
-
-#define INSTANTIATE(T)  \
-    template Array<T> exampleFunction<T>(const Array<T> &in, const af_someenum_t method);
+#define INSTANTIATE(T)                                                         \
+    template Array<T> exampleFunction<T>(const Array<T> &a, const Array<T> &b, \
+                                         const af_someenum_t method);
 
 // INSTANTIATIONS for all the types which
 // are present in the switch case statement
@@ -64,10 +56,11 @@ INSTANTIATE(float)
 INSTANTIATE(double)
 INSTANTIATE(int)
 INSTANTIATE(uint)
+INSTANTIATE(schar)
 INSTANTIATE(uchar)
 INSTANTIATE(char)
 INSTANTIATE(cfloat)
 INSTANTIATE(cdouble)
 
-}
-
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/exampleFunction.hpp b/src/backend/cpu/exampleFunction.hpp
index 5393c2c495..19a3d151ef 100644
--- a/src/backend/cpu/exampleFunction.hpp
+++ b/src/backend/cpu/exampleFunction.hpp
@@ -7,12 +7,13 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/util.h>
 #include <Array.hpp>
+#include <af/defines.h>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> exampleFunction(const Array<T> &in, const af_someenum_t method);
-}
-
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/fast.cpp b/src/backend/cpu/fast.cpp
index 929d48fcc2..ac93345797 100644
--- a/src/backend/cpu/fast.cpp
+++ b/src/backend/cpu/fast.cpp
@@ -7,248 +7,33 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cpu.hpp>
-#include <handle.hpp>
 #include <fast.hpp>
+#include <kernel/fast.hpp>
 
-using af::dim4;
-
-namespace cpu
-{
-
-inline int clamp(int f, int a, int b)
-{
-    return std::max(a, std::min(f, b));
-}
-
-inline int idx_y(int i)
-{
-    if (i >= 8)
-        return clamp(-(i-8-4), -3, 3);
-
-    return clamp(i-4, -3, 3);
-}
-
-inline int idx_x(int i)
-{
-    if (i < 12)
-        return idx_y(i+4);
-
-    return idx_y(i-12);
-}
-
-inline int idx(int y, int x, unsigned idim0)
-{
-    return x * idim0 + y;
-}
-
-// test_greater()
-// Tests if a pixel x > p + thr
-inline int test_greater(float x, float p, float thr)
-{
-    return (x >= p + thr);
-}
-
-// test_smaller()
-// Tests if a pixel x < p - thr
-inline int test_smaller(float x, float p, float thr)
-{
-    return (x <= p - thr);
-}
-
-// test_pixel()
-// Returns -1 when x < p - thr
-// Returns  0 when x >= p - thr && x <= p + thr
-// Returns  1 when x > p + thr
-template<typename T>
-inline int test_pixel(const T* image, const float p, float thr, int y, int x, unsigned idim0)
-{
-    return -test_smaller((float)image[idx(y,x,idim0)], p, thr) | test_greater((float)image[idx(y,x,idim0)], p, thr);
-}
-
-// abs_diff()
-// Returns absolute difference of x and y
-inline int abs_diff(int x, int y)
-{
-    return abs(x - y);
-}
-inline unsigned abs_diff(unsigned x, unsigned y)
-{
-    return (unsigned)abs((int)x - (int)y);
-}
-inline float abs_diff(float x, float y)
-{
-    return fabs(x - y);
-}
-inline double abs_diff(double x, double y)
-{
-    return fabs(x - y);
-}
-
-template<typename T>
-void locate_features(
-    const Array<T> &in,
-    Array<float> &score,
-    Array<float> &x_out,
-    Array<float> &y_out,
-    Array<float> &score_out,
-    unsigned* count,
-    const float thr,
-    const unsigned arc_length,
-    const unsigned nonmax,
-    const unsigned max_feat,
-    const unsigned edge)
-{
-    dim4 in_dims = in.dims();
-    const T* in_ptr = in.get();
-
-    for (int y = edge; y < (int)(in_dims[0] - edge); y++) {
-        for (int x = edge; x < (int)(in_dims[1] - edge); x++) {
-            float p = in_ptr[idx(y, x, in_dims[0])];
-
-            // Start by testing opposite pixels of the circle that will result in
-            // a non-kepoint
-            int d;
-            d  = test_pixel<T>(in_ptr, p, thr, y-3,   x, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y+3,   x, in_dims[0]);
-            if (d == 0)
-                continue;
-
-            d &= test_pixel<T>(in_ptr, p, thr, y-2, x+2, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y+2, x-2, in_dims[0]);
-            d &= test_pixel<T>(in_ptr, p, thr, y  , x+3, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y  , x-3, in_dims[0]);
-            d &= test_pixel<T>(in_ptr, p, thr, y+2, x+2, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y-2, x-2, in_dims[0]);
-            if (d == 0)
-                continue;
-
-            d &= test_pixel<T>(in_ptr, p, thr, y-3, x+1, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y+3, x-1, in_dims[0]);
-            d &= test_pixel<T>(in_ptr, p, thr, y-1, x+3, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y+1, x-3, in_dims[0]);
-            d &= test_pixel<T>(in_ptr, p, thr, y+1, x+3, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y-1, x-3, in_dims[0]);
-            d &= test_pixel<T>(in_ptr, p, thr, y+3, x+1, in_dims[0]) | test_pixel<T>(in_ptr, p, thr, y-3, x-1, in_dims[0]);
-            if (d == 0)
-                continue;
-
-            int sum = 0;
-
-            // Sum responses [-1, 0 or 1] of first arc_length pixels
-            for (int i = 0; i < static_cast<int>(arc_length); i++)
-                sum += test_pixel<T>(in_ptr, p, thr, y+idx_y(i), x+idx_x(i), in_dims[0]);
-
-            // Test maximum and mininmum responses of first segment of arc_length
-            // pixels
-            int max_sum = 0, min_sum = 0;
-            max_sum = std::max(max_sum, sum);
-            min_sum = std::min(min_sum, sum);
-
-            // Sum responses and test the remaining 16-arc_length pixels of the circle
-            for (int i = arc_length; i < 16; i++) {
-                sum -= test_pixel<T>(in_ptr, p, thr, y+idx_y(i-arc_length), x+idx_x(i-arc_length), in_dims[0]);
-                sum += test_pixel<T>(in_ptr, p, thr, y+idx_y(i), x+idx_x(i), in_dims[0]);
-                max_sum = std::max(max_sum, sum);
-                min_sum = std::min(min_sum, sum);
-            }
-
-            // To completely test all possible segments, it's necessary to test
-            // segments that include the top junction of the circle
-            for (int i = 0; i < static_cast<int>(arc_length-1); i++) {
-                sum -= test_pixel<T>(in_ptr, p, thr, y+idx_y(16-arc_length+i), x+idx_x(16-arc_length+i), in_dims[0]);
-                sum += test_pixel<T>(in_ptr, p, thr, y+idx_y(i), x+idx_x(i), in_dims[0]);
-                max_sum = std::max(max_sum, sum);
-                min_sum = std::min(min_sum, sum);
-            }
-
-            float s_bright = 0, s_dark = 0;
-            for (int i = 0; i < 16; i++) {
-                float p_x = (float)in_ptr[idx(y+idx_y(i), x+idx_x(i), in_dims[0])];
-
-                s_bright += test_greater(p_x, p, thr) * (abs_diff(p_x, p) - thr);
-                s_dark   += test_smaller(p_x, p, thr) * (abs_diff(p, p_x) - thr);
-            }
-
-            // If sum at some point was equal to (+-)arc_length, there is a segment
-            // that for which all pixels are much brighter or much brighter than
-            // central pixel p.
-            if (max_sum == static_cast<int>(arc_length) || min_sum == -static_cast<int>(arc_length)) {
-                unsigned j = *count;
-                ++*count;
-                if (j < max_feat) {
-                    float *x_out_ptr = x_out.get();
-                    float *y_out_ptr = y_out.get();
-                    float *score_out_ptr = score_out.get();
-                    x_out_ptr[j]     = static_cast<float>(x);
-                    y_out_ptr[j]     = static_cast<float>(y);
-                    score_out_ptr[j] = static_cast<float>(std::max(s_bright, s_dark));
-                    if (nonmax == 1) {
-                        float* score_ptr = score.get();
-                        score_ptr[idx(y, x, in_dims[0])] = std::max(s_bright, s_dark);
-                    }
-                }
-            }
-        }
-    }
-}
-
-void non_maximal(
-    const Array<float> &score,
-    const Array<float> &x_in,
-    const Array<float> &y_in,
-    Array<float> &x_out,
-    Array<float> &y_out,
-    Array<float> &score_out,
-    unsigned* count,
-    const unsigned total_feat,
-    const unsigned edge)
-{
-    const float *score_ptr = score.get();
-    const float *x_in_ptr = x_in.get();
-    const float *y_in_ptr = y_in.get();
-
-    dim4 score_dims = score.dims();
-
-    for (unsigned k = 0; k < total_feat; k++) {
-        unsigned x = static_cast<unsigned>(round(x_in_ptr[k]));
-        unsigned y = static_cast<unsigned>(round(y_in_ptr[k]));
-
-        float v = score_ptr[y + score_dims[0] * x];
-        float max_v;
-        max_v = std::max(score_ptr[y-1 + score_dims[0] * (x-1)], score_ptr[y-1 + score_dims[0] * x]);
-        max_v = std::max(max_v, score_ptr[y-1 + score_dims[0] * (x+1)]);
-        max_v = std::max(max_v, score_ptr[y   + score_dims[0] * (x-1)]);
-        max_v = std::max(max_v, score_ptr[y   + score_dims[0] * (x+1)]);
-        max_v = std::max(max_v, score_ptr[y+1 + score_dims[0] * (x-1)]);
-        max_v = std::max(max_v, score_ptr[y+1 + score_dims[0] * (x)  ]);
-        max_v = std::max(max_v, score_ptr[y+1 + score_dims[0] * (x+1)]);
-
-        if (y >= score_dims[1] - edge - 1 || y <= edge + 1 ||
-            x >= score_dims[0] - edge - 1 || x <= edge + 1)
-            continue;
+#include <Array.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
+#include <cmath>
 
-        // Stores keypoint to feat_out if it's response is maximum compared to
-        // its 8-neighborhood
-        if (v > max_v) {
-            unsigned j = *count;
-            ++*count;
+#include <algorithm>
+#include <cmath>
+#include <cstddef>
 
-            float *x_out_ptr = x_out.get();
-            float *y_out_ptr = y_out.get();
-            float *score_out_ptr = score_out.get();
+using af::dim4;
+using std::ceil;
 
-            x_out_ptr[j]     = static_cast<float>(x);
-            y_out_ptr[j]     = static_cast<float>(y);
-            score_out_ptr[j] = static_cast<float>(v);
-        }
-    }
-}
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
 unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
               const Array<T> &in, const float thr, const unsigned arc_length,
               const bool nonmax, const float feature_ratio,
-              const unsigned edge)
-{
-    dim4 in_dims = in.dims();
+              const unsigned edge) {
+    in.eval();
+
+    dim4 in_dims            = in.dims();
     const unsigned max_feat = ceil(in.elements() * feature_ratio);
 
     // Matrix containing scores for detected features, scores are stored in the
@@ -256,20 +41,22 @@ unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
     Array<float> V = createEmptyArray<float>(dim4());
     if (nonmax == 1) {
         dim4 V_dims(in_dims[0], in_dims[1]);
-        V = createValueArray<float>(V_dims, (float)0);
+        V = createValueArray<float>(V_dims, 0.f);
+        V.eval();
     }
+    getQueue().sync();
 
     // Arrays containing all features detected before non-maximal suppression.
     dim4 max_feat_dims(max_feat);
-    Array<float> x = createEmptyArray<float>(max_feat_dims);
-    Array<float> y = createEmptyArray<float>(max_feat_dims);
+    Array<float> x     = createEmptyArray<float>(max_feat_dims);
+    Array<float> y     = createEmptyArray<float>(max_feat_dims);
     Array<float> score = createEmptyArray<float>(max_feat_dims);
 
     // Feature counter
     unsigned count = 0;
 
-    locate_features<T>(in, V, x, y, score, &count, thr, arc_length,
-                       nonmax, max_feat, edge);
+    kernel::locate_features<T>(in, V, x, y, score, &count, thr, arc_length,
+                               nonmax, max_feat, edge);
 
     // If more features than max_feat were detected, feat wasn't populated
     // with them anyway, so the real number of features will be that of
@@ -277,47 +64,44 @@ unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
     unsigned feat_found = std::min(max_feat, count);
     dim4 feat_found_dims(feat_found);
 
-    Array<float> x_total = createEmptyArray<float>(af::dim4());
-    Array<float> y_total = createEmptyArray<float>(af::dim4());
+    Array<float> x_total     = createEmptyArray<float>(af::dim4());
+    Array<float> y_total     = createEmptyArray<float>(af::dim4());
     Array<float> score_total = createEmptyArray<float>(af::dim4());
 
     if (nonmax == 1) {
-
         x_total     = createEmptyArray<float>(feat_found_dims);
         y_total     = createEmptyArray<float>(feat_found_dims);
         score_total = createEmptyArray<float>(feat_found_dims);
 
         count = 0;
-        non_maximal(V, x, y,
-                    x_total, y_total, score_total,
-                    &count, feat_found, edge);
+        kernel::non_maximal(V, x, y, x_total, y_total, score_total, &count,
+                            feat_found, edge);
 
         feat_found = std::min(max_feat, count);
     } else {
-        x_total = x;
-        y_total = y;
+        x_total     = x;
+        y_total     = y;
         score_total = score;
     }
 
     if (feat_found > 0) {
         feat_found_dims = dim4(feat_found);
 
-        x_out = createEmptyArray<float>(feat_found_dims);
-        y_out = createEmptyArray<float>(feat_found_dims);
+        x_out     = createEmptyArray<float>(feat_found_dims);
+        y_out     = createEmptyArray<float>(feat_found_dims);
         score_out = createEmptyArray<float>(feat_found_dims);
 
-        float *x_total_ptr = x_total.get();
-        float *y_total_ptr = y_total.get();
+        float *x_total_ptr     = x_total.get();
+        float *y_total_ptr     = y_total.get();
         float *score_total_ptr = score_total.get();
 
-
-        float *x_out_ptr = x_out.get();
-        float *y_out_ptr = y_out.get();
+        float *x_out_ptr     = x_out.get();
+        float *y_out_ptr     = y_out.get();
         float *score_out_ptr = score_out.get();
 
         for (size_t i = 0; i < feat_found; i++) {
-            x_out_ptr[i] = x_total_ptr[i];
-            y_out_ptr[i] = y_total_ptr[i];
+            x_out_ptr[i]     = x_total_ptr[i];
+            y_out_ptr[i]     = y_total_ptr[i];
             score_out_ptr[i] = score_total_ptr[i];
         }
     }
@@ -325,16 +109,21 @@ unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
     return feat_found;
 }
 
-#define INSTANTIATE(T)                                                                              \
-    template unsigned fast<T>(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,    \
-                              const Array<T> &in, const float thr, const unsigned arc_length,       \
-                              const bool nonmax, const float feature_ratio, const unsigned edge);
+#define INSTANTIATE(T)                                                        \
+    template unsigned fast<T>(                                                \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const float thr, const unsigned arc_length,       \
+        const bool nonmax, const float feature_ratio, const unsigned edge);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/fast.hpp b/src/backend/cpu/fast.hpp
index 61dca27407..7d22621bb4 100644
--- a/src/backend/cpu/fast.hpp
+++ b/src/backend/cpu/fast.hpp
@@ -7,18 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
-#include <Array.hpp>
-
-using af::features;
-
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+class Array;
 
 template<typename T>
 unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
               const Array<T> &in, const float thr, const unsigned arc_length,
-              const bool non_max, const float feature_ratio,
+              const bool nonmax, const float feature_ratio,
               const unsigned edge);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/fft.cpp b/src/backend/cpu/fft.cpp
index 37023b47bc..31515d0f99 100644
--- a/src/backend/cpu/fft.cpp
+++ b/src/backend/cpu/fft.cpp
@@ -7,147 +7,227 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <fft.hpp>
-#include <err_cpu.hpp>
-#include <fftw3.h>
+
+#include <Array.hpp>
 #include <copy.hpp>
-#include <math.hpp>
+#include <fftw3.h>
+#include <platform.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
 
-using af::dim4;
+#include <array>
+#include <type_traits>
 
-namespace cpu
-{
+using af::dim4;
+using std::array;
 
-template<int rank>
-void computeDims(int rdims[rank], const dim4 &idims)
-{
-    for (int i = 0; i < rank; i++) {
-        rdims[i] = idims[(rank -1) - i];
-    }
-}
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
 struct fftw_transform;
 
-#define TRANSFORM(PRE, TY)                                              \
-    template<>                                                          \
-    struct fftw_transform<TY>                                           \
-    {                                                                   \
-        typedef PRE##_plan plan_t;                                      \
-        typedef PRE##_complex ctype_t;                                  \
-                                                                        \
-        template<typename... Args>                                      \
-            plan_t create(Args... args)                                 \
-        { return PRE##_plan_many_dft(args...); }                        \
-        void execute(plan_t plan) { return PRE##_execute(plan); }       \
-        void destroy(plan_t plan) { return PRE##_destroy_plan(plan); }  \
-    };                                                                  \
-
+#define TRANSFORM(PRE, TY)                                             \
+    template<>                                                         \
+    struct fftw_transform<TY> {                                        \
+        typedef PRE##_plan plan_t;                                     \
+        typedef PRE##_complex ctype_t;                                 \
+                                                                       \
+        template<typename... Args>                                     \
+        plan_t create(Args... args) {                                  \
+            return PRE##_plan_many_dft(args...);                       \
+        }                                                              \
+        void execute(plan_t plan) { return PRE##_execute(plan); }      \
+        void destroy(plan_t plan) { return PRE##_destroy_plan(plan); } \
+    };
 
 TRANSFORM(fftwf, cfloat)
 TRANSFORM(fftw, cdouble)
 
-template<typename T, int rank, int direction>
-void fft_common(Array <T> &out, const Array<T> &in)
-{
-    int in_dims[rank];
-    int in_embed[rank];
-    int out_embed[rank];
+template<typename To, typename Ti>
+struct fftw_real_transform;
+
+#define TRANSFORM_REAL(PRE, To, Ti, POST)                              \
+    template<>                                                         \
+    struct fftw_real_transform<To, Ti> {                               \
+        typedef PRE##_plan plan_t;                                     \
+        typedef PRE##_complex ctype_t;                                 \
+                                                                       \
+        template<typename... Args>                                     \
+        plan_t create(Args... args) {                                  \
+            return PRE##_plan_many_dft_##POST(args...);                \
+        }                                                              \
+        void execute(plan_t plan) { return PRE##_execute(plan); }      \
+        void destroy(plan_t plan) { return PRE##_destroy_plan(plan); } \
+    };
+
+TRANSFORM_REAL(fftwf, cfloat, float, r2c)
+TRANSFORM_REAL(fftw, cdouble, double, r2c)
+TRANSFORM_REAL(fftwf, float, cfloat, c2r)
+TRANSFORM_REAL(fftw, double, cdouble, c2r)
+
+inline array<int, AF_MAX_DIMS> computeDims(const int rank, const dim4 &idims) {
+    array<int, AF_MAX_DIMS> retVal = {};
+    for (int i = 0; i < rank; i++) { retVal[i] = idims[(rank - 1) - i]; }
+    return retVal;
+}
 
-    const dim4 idims = in.dims();
+void setFFTPlanCacheSize(size_t numPlans) { UNUSED(numPlans); }
 
-    computeDims<rank>(in_dims  , idims);
-    computeDims<rank>(in_embed , in.getDataDims());
-    computeDims<rank>(out_embed, out.getDataDims());
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction) {
+    auto func = [=](Param<T> in, const af::dim4 iDataDims) {
+        const af::dim4 idims = in.dims();
 
-    const dim4 istrides = in.strides();
-    const dim4 ostrides = out.strides();
+        auto t_dims   = computeDims(rank, idims);
+        auto in_embed = computeDims(rank, iDataDims);
 
-    typedef typename fftw_transform<T>::ctype_t ctype_t;
-    typename fftw_transform<T>::plan_t plan;
+        const af::dim4 istrides = in.strides();
 
-    fftw_transform<T> transform;
+        using ctype_t = typename fftw_transform<T>::ctype_t;
+        typename fftw_transform<T>::plan_t plan;
 
-    int batch = 1;
-    for (int i = rank; i < 4; i++) {
-        batch *= idims[i];
-    }
+        fftw_transform<T> transform;
 
-    plan = transform.create(rank,
-                            in_dims,
-                            (int)batch,
-                            (ctype_t *)in.get(),
-                            in_embed, (int)istrides[0],
-                            (int)istrides[rank],
-                            (ctype_t *)out.get(),
-                            out_embed, (int)ostrides[0],
-                            (int)ostrides[rank],
-                            direction ? FFTW_FORWARD : FFTW_BACKWARD,
-                            FFTW_ESTIMATE);
-
-    transform.execute(plan);
-    transform.destroy(plan);
+        int batch = 1;
+        for (int i = rank; i < 4; i++) { batch *= idims[i]; }
 
-}
+        plan = transform.create(
+            rank, t_dims.data(), batch, reinterpret_cast<ctype_t *>(in.get()),
+            in_embed.data(), static_cast<int>(istrides[0]),
+            static_cast<int>(istrides[rank]),
+            reinterpret_cast<ctype_t *>(in.get()), in_embed.data(),
+            static_cast<int>(istrides[0]), static_cast<int>(istrides[rank]),
+            direction ? FFTW_FORWARD : FFTW_BACKWARD,
+            FFTW_ESTIMATE);  // NOLINT(hicpp-signed-bitwise)
 
-void computePaddedDims(dim4 &pdims,
-                       const dim4 &idims,
-                       const dim_t npad,
-                       dim_t const * const pad)
-{
-    for (int i = 0; i < 4; i++) {
-        pdims[i] = (i < (int)npad) ? pad[i] : idims[i];
-    }
+        transform.execute(plan);
+        transform.destroy(plan);
+    };
+    getQueue().enqueue(func, in, in.getDataDims());
 }
 
-template<typename inType, typename outType, int rank, bool isR2C>
-Array<outType> fft(Array<inType> const &in, double norm_factor, dim_t const npad, dim_t const * const pad)
-{
-    ARG_ASSERT(1, rank >= 1 && rank <= 3);
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank) {
+    dim4 odims    = in.dims();
+    odims[0]      = odims[0] / 2 + 1;
+    Array<Tc> out = createEmptyArray<Tc>(odims);
 
-    dim4 pdims(1);
-    computePaddedDims(pdims, in.dims(), npad, pad);
+    auto func = [=](Param<Tc> out, const af::dim4 oDataDims, CParam<Tr> in,
+                    const af::dim4 iDataDims) {
+        af::dim4 idims = in.dims();
 
-    Array<outType> ret = padArray<inType, outType>(in, pdims);
-    fft_common<outType, rank, true>(ret, ret);
-    return ret;
-}
+        auto t_dims    = computeDims(rank, idims);
+        auto in_embed  = computeDims(rank, iDataDims);
+        auto out_embed = computeDims(rank, oDataDims);
 
-template<typename T, int rank>
-Array<T> ifft(Array<T> const &in, double norm_factor, dim_t const npad, dim_t const * const pad)
-{
-    ARG_ASSERT(1, rank >= 1 && rank <= 3);
+        const af::dim4 istrides = in.strides();
+        const af::dim4 ostrides = out.strides();
 
-    dim4 pdims(1);
-    computePaddedDims(pdims, in.dims(), npad, pad);
+        using ctype_t = typename fftw_real_transform<Tc, Tr>::ctype_t;
+        using plan_t  = typename fftw_real_transform<Tc, Tr>::plan_t;
+        plan_t plan;
 
-    Array<T> ret = padArray<T, T>(in, pdims, scalar<T>(0), norm_factor);
-    fft_common<T, rank, false>(ret, ret);
+        fftw_real_transform<Tc, Tr> transform;
 
-    return ret;
-}
+        int batch = 1;
+        for (int i = rank; i < 4; i++) { batch *= idims[i]; }
 
-#define INSTANTIATE1(T1, T2)\
-    template Array<T2> fft <T1, T2, 1, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T2> fft <T1, T2, 2, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T2> fft <T1, T2, 3, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+        plan = transform.create(
+            rank, t_dims.data(), batch, const_cast<Tr *>(in.get()),
+            in_embed.data(), static_cast<int>(istrides[0]),
+            static_cast<int>(istrides[rank]),
+            reinterpret_cast<ctype_t *>(out.get()), out_embed.data(),
+            static_cast<int>(ostrides[0]), static_cast<int>(ostrides[rank]),
+            FFTW_ESTIMATE);
 
-INSTANTIATE1(float  , cfloat )
-INSTANTIATE1(double , cdouble)
+        transform.execute(plan);
+        transform.destroy(plan);
+    };
 
-#define INSTANTIATE2(T)\
-    template Array<T> fft <T, T, 1, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> fft <T, T, 2, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> fft <T, T, 3, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 1>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 2>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 3>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+    getQueue().enqueue(func, out, out.getDataDims(), in, in.getDataDims());
 
-INSTANTIATE2(cfloat )
-INSTANTIATE2(cdouble)
+    return out;
+}
 
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank) {
+    Array<Tr> out = createEmptyArray<Tr>(odims);
+
+    auto func = [=](Param<Tr> out, const af::dim4 oDataDims, CParam<Tc> in,
+                    const af::dim4 iDataDims, const af::dim4 odims) {
+        auto t_dims    = computeDims(rank, odims);
+        auto in_embed  = computeDims(rank, iDataDims);
+        auto out_embed = computeDims(rank, oDataDims);
+
+        const af::dim4 istrides = in.strides();
+        const af::dim4 ostrides = out.strides();
+
+        using ctype_t = typename fftw_real_transform<Tr, Tc>::ctype_t;
+        using plan_t  = typename fftw_real_transform<Tr, Tc>::plan_t;
+        plan_t plan;
+
+        fftw_real_transform<Tr, Tc> transform;
+
+        int batch = 1;
+        for (int i = rank; i < 4; i++) { batch *= odims[i]; }
+
+        // By default, fftw estimate flag is sufficient for most transforms.
+        // However, complex to real transforms modify the input data memory
+        // while performing the transformation. To avoid that, we need to pass
+        // FFTW_PRESERVE_INPUT also. This flag however only works for 1D
+        // transforms and for higher level transformations, a copy of input
+        // data is passed onto the upstream FFTW calls.
+        unsigned int flags = FFTW_ESTIMATE;  // NOLINT(hicpp-signed-bitwise)
+        if (rank == 1) {
+            flags |= FFTW_PRESERVE_INPUT;  // NOLINT(hicpp-signed-bitwise)
+        }
+
+        plan = transform.create(
+            rank, t_dims.data(), batch,
+            reinterpret_cast<ctype_t *>(const_cast<Tc *>(in.get())),
+            in_embed.data(), static_cast<int>(istrides[0]),
+            static_cast<int>(istrides[rank]), out.get(), out_embed.data(),
+            static_cast<int>(ostrides[0]), static_cast<int>(ostrides[rank]),
+            flags);
+
+        transform.execute(plan);
+        transform.destroy(plan);
+    };
+
+#ifdef USE_MKL
+    getQueue().enqueue(func, out, out.getDataDims(), in, in.getDataDims(),
+                       odims);
+#else
+    if (rank > 1 || odims.ndims() > 1) {
+        // FFTW does not have a input preserving algorithm for multidimensional
+        // c2r FFTs
+        Array<Tc> in_ = copyArray<Tc>(in);
+        getQueue().enqueue(func, out, out.getDataDims(), in_, in.getDataDims(),
+                           odims);
+    } else {
+        getQueue().enqueue(func, out, out.getDataDims(), in, in.getDataDims(),
+                           odims);
+    }
+#endif
+
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template void fft_inplace<T>(Array<T> &, const int, const bool);
+
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+#define INSTANTIATE_REAL(Tr, Tc)                                             \
+    template Array<Tc> fft_r2c<Tc, Tr>(const Array<Tr> &, const int);        \
+    template Array<Tr> fft_c2r<Tr, Tc>(const Array<Tc> &in, const dim4 &odi, \
+                                       const int);
+
+INSTANTIATE_REAL(float, cfloat)
+INSTANTIATE_REAL(double, cdouble)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/fft.hpp b/src/backend/cpu/fft.hpp
index 252690c817..383690ca21 100644
--- a/src/backend/cpu/fft.hpp
+++ b/src/backend/cpu/fft.hpp
@@ -9,13 +9,24 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
+#include <cstddef>
 
-template<typename inType, typename outType, int rank, bool isR2C>
-Array<outType> fft(Array<inType> const &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+namespace af {
+class dim4;
+}
 
-template<typename T, int rank>
-Array<T> ifft(Array<T> const &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+namespace arrayfire {
+namespace cpu {
 
-}
+void setFFTPlanCacheSize(size_t numPlans);
+
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction);
+
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank);
+
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/fftconvolve.cpp b/src/backend/cpu/fftconvolve.cpp
index 9e52879d72..ff2e5b68c4 100644
--- a/src/backend/cpu/fftconvolve.cpp
+++ b/src/backend/cpu/fftconvolve.cpp
@@ -7,423 +7,213 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
+#include <fftconvolve.hpp>
+
 #include <Array.hpp>
-#include <dispatch.hpp>
-#include <fft.hpp>
-#include <err_cpu.hpp>
+#include <common/dispatch.hpp>
 #include <fftw3.h>
-#include <copy.hpp>
-#include <convolve_common.hpp>
-
-namespace cpu
-{
-
-template<typename To, typename Ti>
-void packData(To* out_ptr, const af::dim4& od, const af::dim4& os,
-              Array<Ti> const& in)
-{
-    const af::dim4 id = in.dims();
-    const af::dim4 is = in.strides();
-    const Ti* in_ptr = in.get();
-
-    int id0_half = divup(id[0], 2);
-    bool odd_id0 = (id[0] % 2 == 1);
-
-    for (int d3 = 0; d3 < (int)od[3]; d3++) {
-        for (int d2 = 0; d2 < (int)od[2]; d2++) {
-            for (int d1 = 0; d1 < (int)od[1]; d1++) {
-                for (int d0 = 0; d0 < (int)od[0] / 2; d0++) {
-                    const dim_t oidx = d3*os[3] + d2*os[2] + d1*os[1] + d0*2;
-
-                    if (d0 < (int)id0_half && d1 < (int)id[1] && d2 < (int)id[2] && d3 < (int)id[3]) {
-                        const dim_t iidx = d3*is[3] + d2*is[2] + d1*is[1] + d0;
-                        out_ptr[oidx]   = (To)in_ptr[iidx];
-                        if (d0 == id0_half-1 && odd_id0)
-                            out_ptr[oidx+1] = (To)0;
-                        else
-                            out_ptr[oidx+1] = (To)in_ptr[iidx+id0_half];
-                    }
-                    else {
-                        // Pad remaining elements with 0s
-                        out_ptr[oidx]   = (To)0;
-                        out_ptr[oidx+1] = (To)0;
-                    }
-                }
-            }
-        }
-    }
-}
+#include <kernel/fftconvolve.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 
-template<typename To, typename Ti>
-void padArray(To* out_ptr, const af::dim4& od, const af::dim4& os,
-              Array<Ti> const& in)
-{
-    const af::dim4 id = in.dims();
-    const af::dim4 is = in.strides();
-    const Ti* in_ptr = in.get();
-
-    for (int d3 = 0; d3 < (int)od[3]; d3++) {
-        for (int d2 = 0; d2 < (int)od[2]; d2++) {
-            for (int d1 = 0; d1 < (int)od[1]; d1++) {
-                for (int d0 = 0; d0 < (int)od[0] / 2; d0++) {
-                    const dim_t oidx = d3*os[3] + d2*os[2] + d1*os[1] + d0*2;
-
-                    if (d0 < (int)id[0] && d1 < (int)id[1] && d2 < (int)id[2] && d3 < (int)id[3]) {
-                        // Copy input elements to real elements, set imaginary elements to 0
-                        const dim_t iidx = d3*is[3] + d2*is[2] + d1*is[1] + d0;
-                        out_ptr[oidx]   = (To)in_ptr[iidx];
-                        out_ptr[oidx+1] = (To)0;
-                    }
-                    else {
-                        // Pad remaining of the matrix to 0s
-                        out_ptr[oidx]   = (To)0;
-                        out_ptr[oidx+1] = (To)0;
-                    }
-                }
-            }
-        }
-    }
-}
+#include <array>
+#include <cmath>
+#include <functional>
+#include <type_traits>
 
-template<typename T>
-void complexMultiply(T* out_ptr, const af::dim4& od, const af::dim4& os,
-                     T* in1_ptr, const af::dim4& i1d, const af::dim4& i1s,
-                     T* in2_ptr, const af::dim4& i2d, const af::dim4& i2s,
-                     ConvolveBatchKind kind)
-{
-    for (int d3 = 0; d3 < (int)od[3]; d3++) {
-        for (int d2 = 0; d2 < (int)od[2]; d2++) {
-            for (int d1 = 0; d1 < (int)od[1]; d1++) {
-                for (int d0 = 0; d0 < (int)od[0] / 2; d0++) {
-                    if (kind == ONE2ONE || kind == MANY2MANY) {
-                        // Complex multiply each signal to equivalent filter
-                        const int ridx = d3*os[3] + d2*os[2] + d1*os[1] + d0*2;
-                        const int iidx = ridx + 1;
-
-                        T a = in1_ptr[ridx];
-                        T b = in1_ptr[iidx];
-                        T c = in2_ptr[ridx];
-                        T d = in2_ptr[iidx];
-
-                        T ac = a*c;
-                        T bd = b*d;
-
-                        out_ptr[ridx] = ac - bd;
-                        out_ptr[iidx] = (a+b) * (c+d) - ac - bd;
-                    }
-                    else if (kind == MANY2ONE) {
-                        // Complex multiply all signals to filter
-                        const int ridx1 = d3*os[3] + d2*os[2] + d1*os[1] + d0*2;
-                        const int iidx1 = ridx1 + 1;
-                        const int ridx2 = ridx1 % (i2s[3] * i2d[3]);
-                        const int iidx2 = iidx1 % (i2s[3] * i2d[3]);
-
-                        T a = in1_ptr[ridx1];
-                        T b = in1_ptr[iidx1];
-                        T c = in2_ptr[ridx2];
-                        T d = in2_ptr[iidx2];
-
-                        T ac = a*c;
-                        T bd = b*d;
-
-                        out_ptr[ridx1] = ac - bd;
-                        out_ptr[iidx1] = (a+b) * (c+d) - ac - bd;
-                    }
-                    else if (kind == ONE2MANY) {
-                        // Complex multiply signal to all filters
-                        const int ridx2 = d3*os[3] + d2*os[2] + d1*os[1] + d0*2;
-                        const int iidx2 = ridx2 + 1;
-                        const int ridx1 = ridx2 % (i1s[3] * i1d[3]);
-                        const int iidx1 = iidx2 % (i1s[3] * i1d[3]);
-
-                        T a = in1_ptr[ridx1];
-                        T b = in1_ptr[iidx1];
-                        T c = in2_ptr[ridx2];
-                        T d = in2_ptr[iidx2];
-
-                        T ac = a*c;
-                        T bd = b*d;
-
-                        out_ptr[ridx2] = ac - bd;
-                        out_ptr[iidx2] = (a+b) * (c+d) - ac - bd;
-                    }
-                }
-            }
-        }
-    }
-}
+using af::dim4;
+using std::array;
+using std::ceil;
 
-template<typename To, typename Ti, bool roundOut>
-void reorderOutput(To* out_ptr, const af::dim4& od, const af::dim4& os,
-                   const Ti* in_ptr, const af::dim4& id, const af::dim4& is,
-                   const af::dim4& fd, const int half_di0, const int baseDim,
-                   const int fftScale, const bool expand)
-{
-    for (int d3 = 0; d3 < (int)od[3]; d3++) {
-        for (int d2 = 0; d2 < (int)od[2]; d2++) {
-            for (int d1 = 0; d1 < (int)od[1]; d1++) {
-                for (int d0 = 0; d0 < (int)od[0]; d0++) {
-                    int id0, id1, id2, id3;
-                    if (expand) {
-                        id0 = d0;
-                        id1 = d1 * is[1];
-                        id2 = d2 * is[2];
-                        id3 = d3 * is[3];
-                    }
-                    else {
-                        id0 = d0 + fd[0]/2;
-                        id1 = (d1 + (baseDim > 1)*(fd[1]/2)) * is[1];
-                        id2 = (d2 + (baseDim > 2)*(fd[2]/2)) * is[2];
-                        id3 = d3 * is[3];
-                    }
-
-                    int oidx = d3*os[3] + d2*os[2] + d1*os[1] + d0;
-
-                    // Divide output elements to cuFFT resulting scale, round result if output
-                    // type is single or double precision floating-point
-                    if (id0 < half_di0) {
-                        // Copy top elements
-                        int iidx = id3 + id2 + id1 + id0 * 2;
-                        if (roundOut)
-                            out_ptr[oidx] = (To)roundf((float)(in_ptr[iidx] / fftScale));
-                        else
-                            out_ptr[oidx] = (To)(in_ptr[iidx] / fftScale);
-                    }
-                    else if (id0 < half_di0 + (int)fd[0] - 1) {
-                        // Add signal and filter elements to central part
-                        int iidx1 = id3 + id2 + id1 + id0 * 2;
-                        int iidx2 = id3 + id2 + id1 + (id0 - half_di0) * 2 + 1;
-                        if (roundOut)
-                            out_ptr[oidx] = (To)roundf((float)((in_ptr[iidx1] + in_ptr[iidx2]) / fftScale));
-                        else
-                            out_ptr[oidx] = (To)((in_ptr[iidx1] + in_ptr[iidx2]) / fftScale);
-                    }
-                    else {
-                        // Copy bottom elements
-                        const int iidx = id3 + id2 + id1 + (id0 - half_di0) * 2 + 1;
-                        if (roundOut)
-                            out_ptr[oidx] = (To)roundf((float)(in_ptr[iidx] / fftScale));
-                        else
-                            out_ptr[oidx] = (To)(in_ptr[iidx] / fftScale);
-                    }
-                }
-            }
-        }
-    }
-}
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename convT>
+using reorderFunc = std::function<void(
+    Param<T> out, Param<convT> packed, CParam<T> filter,
+    const dim_t sig_half_d0, const dim_t fftScale, const dim4 sig_tmp_dims,
+    const dim4 sig_tmp_strides, const dim4 filter_tmp_dims,
+    const dim4 filter_tmp_strides, AF_BATCH_KIND kind)>;
 
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
+template<typename T>
 Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
-                     const bool expand, ConvolveBatchKind kind)
-{
-    const af::dim4 sd = signal.dims();
-    const af::dim4 fd = filter.dims();
+                     const bool expand, AF_BATCH_KIND kind, const int rank) {
+    using convT = typename std::conditional<std::is_integral<T>::value ||
+                                                std::is_same<T, float>::value,
+                                            float, double>::type;
+
+    constexpr bool IsTypeDouble = std::is_same<T, double>::value;
 
+    const dim4& sd = signal.dims();
+    const dim4& fd = filter.dims();
     dim_t fftScale = 1;
 
-    af::dim4 packed_dims;
-    int fft_dims[baseDim];
-    af::dim4 sig_tmp_dims, sig_tmp_strides;
-    af::dim4 filter_tmp_dims, filter_tmp_strides;
+    dim4 packedDims(1, 1, 1, 1);
+    array<int, AF_MAX_DIMS> fftDims{};  // AF_MAX_DIMS(4) > rank
 
     // Pack both signal and filter on same memory array, this will ensure
-    // better use of batched cuFFT capabilities
-    for (dim_t k = 0; k < 4; k++) {
-        if (k < baseDim)
-            packed_dims[k] = nextpow2((unsigned)(sd[k] + fd[k] - 1));
-        else if (k == baseDim)
-            packed_dims[k] = sd[k] + fd[k];
-        else
-            packed_dims[k] = 1;
-
-        if (k < baseDim) {
-            fft_dims[baseDim-k-1] = (k == 0) ? packed_dims[k] / 2 : packed_dims[k];
-            fftScale *= fft_dims[baseDim-k-1];
-        }
+    // better use of batched FFT capabilities
+    fftDims[rank - 1] = nextpow2(
+        static_cast<unsigned>(static_cast<int>(ceil(sd[0] / 2.f)) + fd[0] - 1));
+    packedDims[0] = 2 * fftDims[rank - 1];
+    fftScale *= fftDims[rank - 1];
+
+    for (int k = 1; k < rank; k++) {
+        packedDims[k] = nextpow2(static_cast<unsigned>(sd[k] + fd[k] - 1));
+        fftDims[rank - k - 1] = packedDims[k];
+        fftScale *= fftDims[rank - k - 1];
     }
 
-    Array<convT> packed = createEmptyArray<convT>(packed_dims);
-    convT *packed_ptr = packed.get();
-
-    const af::dim4 packed_strides = packed.strides();
-
-    sig_tmp_dims[0]    = filter_tmp_dims[0] = packed_dims[0];
-    sig_tmp_strides[0] = filter_tmp_strides[0] = 1;
-
-    for (dim_t k = 1; k < 4; k++) {
-        if (k < baseDim) {
-            sig_tmp_dims[k]    = packed_dims[k];
-            filter_tmp_dims[k] = packed_dims[k];
-        }
-        else {
-            sig_tmp_dims[k]    = sd[k];
-            filter_tmp_dims[k] = fd[k];
-        }
-
-        sig_tmp_strides[k]    = sig_tmp_strides[k - 1] * sig_tmp_dims[k - 1];
-        filter_tmp_strides[k] = filter_tmp_strides[k - 1] * filter_tmp_dims[k - 1];
+    dim_t sbatch = 1, fbatch = 1;
+    for (int k = rank; k < AF_MAX_DIMS; k++) {
+        sbatch *= sd[k];
+        fbatch *= fd[k];
     }
+    packedDims[rank] = (sbatch + fbatch);
+
+    Array<convT> packed = createEmptyArray<convT>(packedDims);
 
-    // Calculate memory offsets for packed signal and filter
-    convT *sig_tmp_ptr    = packed_ptr;
-    convT *filter_tmp_ptr = packed_ptr + sig_tmp_strides[3] * sig_tmp_dims[3];
+    dim4 paddedSigDims(packedDims[0], (1 < rank ? packedDims[1] : sd[1]),
+                       (2 < rank ? packedDims[2] : sd[2]),
+                       (3 < rank ? packedDims[3] : sd[3]));
+    dim4 paddedFilDims(packedDims[0], (1 < rank ? packedDims[1] : fd[1]),
+                       (2 < rank ? packedDims[2] : fd[2]),
+                       (3 < rank ? packedDims[3] : fd[3]));
+    dim4 paddedSigStrides = calcStrides(paddedSigDims);
+    dim4 paddedFilStrides = calcStrides(paddedFilDims);
 
     // Number of packed complex elements in dimension 0
     dim_t sig_half_d0 = divup(sd[0], 2);
 
     // Pack signal in a complex matrix where first dimension is half the input
     // (allows faster FFT computation) and pad array to a power of 2 with 0s
-    packData<convT, T>(sig_tmp_ptr, sig_tmp_dims, sig_tmp_strides, signal);
+    getQueue().enqueue(kernel::packData<convT, T>, packed, paddedSigDims,
+                       paddedSigStrides, signal);
 
     // Pad filter array with 0s
-    padArray<convT, T>(filter_tmp_ptr, filter_tmp_dims, filter_tmp_strides, filter);
-
-    // Compute forward FFT
-    if (isDouble) {
-        fftw_plan plan = fftw_plan_many_dft(baseDim,
-                                            fft_dims,
-                                            packed_dims[baseDim],
-                                            (fftw_complex*)packed.get(),
-                                            NULL,
-                                            packed_strides[0],
-                                            packed_strides[baseDim] / 2,
-                                            (fftw_complex*)packed.get(),
-                                            NULL,
-                                            packed_strides[0],
-                                            packed_strides[baseDim] / 2,
-                                            FFTW_FORWARD,
-                                            FFTW_ESTIMATE);
-
-        fftw_execute(plan);
-        fftw_destroy_plan(plan);
-    }
-    else {
-        fftwf_plan plan = fftwf_plan_many_dft(baseDim,
-                                              fft_dims,
-                                              packed_dims[baseDim],
-                                              (fftwf_complex*)packed.get(),
-                                              NULL,
-                                              packed_strides[0],
-                                              packed_strides[baseDim] / 2,
-                                              (fftwf_complex*)packed.get(),
-                                              NULL,
-                                              packed_strides[0],
-                                              packed_strides[baseDim] / 2,
-                                              FFTW_FORWARD,
-                                              FFTW_ESTIMATE);
-
-        fftwf_execute(plan);
-        fftwf_destroy_plan(plan);
-    }
+    const dim_t offset = paddedSigStrides[3] * paddedSigDims[3];
+    getQueue().enqueue(kernel::padArray<convT, T>, packed, paddedFilDims,
+                       paddedFilStrides, filter, offset);
+
+    // NOLINTNEXTLINE(performance-unnecessary-value-param)
+    auto upstream_dft = [=](Param<convT> packed,
+                            const array<int, AF_MAX_DIMS> fftDims) {
+        const dim4 packedDims     = packed.dims();
+        const dim4 packed_strides = packed.strides();
+        // Compute forward FFT
+        if (IsTypeDouble) {
+            fftw_plan plan = fftw_plan_many_dft(
+                rank, fftDims.data(), packedDims[rank],
+                reinterpret_cast<fftw_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2,
+                reinterpret_cast<fftw_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2, FFTW_FORWARD,
+                FFTW_ESTIMATE);  // NOLINT(hicpp-signed-bitwise)
+
+            fftw_execute(plan);
+            fftw_destroy_plan(plan);
+        } else {
+            fftwf_plan plan = fftwf_plan_many_dft(
+                rank, fftDims.data(), packedDims[rank],
+                reinterpret_cast<fftwf_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2,
+                reinterpret_cast<fftwf_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2, FFTW_FORWARD,
+                FFTW_ESTIMATE);  // NOLINT(hicpp-signed-bitwise)
+
+            fftwf_execute(plan);
+            fftwf_destroy_plan(plan);
+        }
+    };
+    getQueue().enqueue(upstream_dft, packed, fftDims);
 
     // Multiply filter and signal FFT arrays
-    if (kind == ONE2MANY)
-        complexMultiply<convT>(filter_tmp_ptr, filter_tmp_dims, filter_tmp_strides,
-                               sig_tmp_ptr, sig_tmp_dims, sig_tmp_strides,
-                               filter_tmp_ptr, filter_tmp_dims, filter_tmp_strides,
-                               kind);
-    else
-        complexMultiply<convT>(sig_tmp_ptr, sig_tmp_dims, sig_tmp_strides,
-                               sig_tmp_ptr, sig_tmp_dims, sig_tmp_strides,
-                               filter_tmp_ptr, filter_tmp_dims, filter_tmp_strides,
-                               kind);
-
-    // Compute inverse FFT
-    if (isDouble) {
-        fftw_plan plan = fftw_plan_many_dft(baseDim,
-                                            fft_dims,
-                                            packed_dims[baseDim],
-                                            (fftw_complex*)packed.get(),
-                                            NULL,
-                                            packed_strides[0],
-                                            packed_strides[baseDim] / 2,
-                                            (fftw_complex*)packed.get(),
-                                            NULL,
-                                            packed_strides[0],
-                                            packed_strides[baseDim] / 2,
-                                            FFTW_BACKWARD,
-                                            FFTW_ESTIMATE);
-
-        fftw_execute(plan);
-        fftw_destroy_plan(plan);
-    }
-    else {
-        fftwf_plan plan = fftwf_plan_many_dft(baseDim,
-                                              fft_dims,
-                                              packed_dims[baseDim],
-                                              (fftwf_complex*)packed.get(),
-                                              NULL,
-                                              packed_strides[0],
-                                              packed_strides[baseDim] / 2,
-                                              (fftwf_complex*)packed.get(),
-                                              NULL,
-                                              packed_strides[0],
-                                              packed_strides[baseDim] / 2,
-                                              FFTW_BACKWARD,
-                                              FFTW_ESTIMATE);
-
-        fftwf_execute(plan);
-        fftwf_destroy_plan(plan);
-    }
+    getQueue().enqueue(kernel::complexMultiply<convT>, packed, paddedSigDims,
+                       paddedSigStrides, paddedFilDims, paddedFilStrides, kind,
+                       offset);
+
+    // NOLINTNEXTLINE(performance-unnecessary-value-param)
+    auto upstream_idft = [=](Param<convT> packed,
+                             const array<int, AF_MAX_DIMS> fftDims) {
+        const dim4 packedDims     = packed.dims();
+        const dim4 packed_strides = packed.strides();
+        // Compute inverse FFT
+        if (IsTypeDouble) {
+            fftw_plan plan = fftw_plan_many_dft(
+                rank, fftDims.data(), packedDims[rank],
+                reinterpret_cast<fftw_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2,
+                reinterpret_cast<fftw_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2, FFTW_BACKWARD,
+                FFTW_ESTIMATE);  // NOLINT(hicpp-signed-bitwise)
+
+            fftw_execute(plan);
+            fftw_destroy_plan(plan);
+        } else {
+            fftwf_plan plan = fftwf_plan_many_dft(
+                rank, fftDims.data(), packedDims[rank],
+                reinterpret_cast<fftwf_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2,
+                reinterpret_cast<fftwf_complex*>(packed.get()), nullptr,
+                packed_strides[0], packed_strides[rank] / 2, FFTW_BACKWARD,
+                FFTW_ESTIMATE);  // NOLINT(hicpp-signed-bitwise)
+
+            fftwf_execute(plan);
+            fftwf_destroy_plan(plan);
+        }
+    };
+    getQueue().enqueue(upstream_idft, packed, fftDims);
 
     // Compute output dimensions
     dim4 oDims(1);
     if (expand) {
-        for(dim_t d=0; d<4; ++d) {
-            if (kind==ONE2ONE || kind==ONE2MANY) {
-                oDims[d] = sd[d]+fd[d]-1;
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sd[d] + fd[d] - 1;
             } else {
-                oDims[d] = (d<baseDim ? sd[d]+fd[d]-1 : sd[d]);
+                oDims[d] = (d < rank ? sd[d] + fd[d] - 1 : sd[d]);
             }
         }
     } else {
         oDims = sd;
-        if (kind==ONE2MANY) {
-            for (dim_t i=baseDim; i<4; ++i)
-                oDims[i] = fd[i];
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fd[i]; }
         }
     }
 
     Array<T> out = createEmptyArray<T>(oDims);
-    T* out_ptr = out.get();
-    const af::dim4 out_dims = out.dims();
-    const af::dim4 out_strides = out.strides();
-
-    const af::dim4 filter_dims = filter.dims();
-
-    // Reorder the output
-    if (kind == ONE2MANY) {
-        reorderOutput<T, convT, roundOut>
-            (out_ptr, out_dims, out_strides,
-             filter_tmp_ptr, filter_tmp_dims, filter_tmp_strides,
-             filter_dims, sig_half_d0, baseDim, fftScale, expand);
-    }
-    else {
-        reorderOutput<T, convT, roundOut>
-            (out_ptr, out_dims, out_strides,
-             sig_tmp_ptr, sig_tmp_dims, sig_tmp_strides,
-             filter_dims, sig_half_d0, baseDim, fftScale, expand);
-    }
+
+    static const reorderFunc<T, convT> funcs[6] = {
+        kernel::reorder<T, convT, 1, false>,
+        kernel::reorder<T, convT, 2, false>,
+        kernel::reorder<T, convT, 3, false>,
+        kernel::reorder<T, convT, 1, true>,
+        kernel::reorder<T, convT, 2, true>,
+        kernel::reorder<T, convT, 3, true>,
+    };
+
+    getQueue().enqueue(funcs[expand * 3 + (rank - 1)], out, packed, filter,
+                       sig_half_d0, fftScale, paddedSigDims, paddedSigStrides,
+                       paddedFilDims, paddedFilStrides, kind);
 
     return out;
 }
 
-#define INSTANTIATE(T, convT, cT, isDouble, roundOut)                                                   \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 1>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);    \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 2>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);    \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 3>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);
-
-INSTANTIATE(double, double, cdouble, true , false)
-INSTANTIATE(float , float,  cfloat,  false, false)
-INSTANTIATE(uint  , float,  cfloat,  false, true)
-INSTANTIATE(int   , float,  cfloat,  false, true)
-INSTANTIATE(uchar , float,  cfloat,  false, true)
-INSTANTIATE(char  , float,  cfloat,  false, true)
-
-} // namespace cpu
+#define INSTANTIATE(T)                                                 \
+    template Array<T> fftconvolve<T>(Array<T> const&, Array<T> const&, \
+                                     const bool, AF_BATCH_KIND, const int);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(int)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(uintl)
+INSTANTIATE(intl)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/fftconvolve.hpp b/src/backend/cpu/fftconvolve.hpp
index db76b69c8a..8a21fbe958 100644
--- a/src/backend/cpu/fftconvolve.hpp
+++ b/src/backend/cpu/fftconvolve.hpp
@@ -8,12 +8,12 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <convolve_common.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
-Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);
-
-}
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/flood_fill.cpp b/src/backend/cpu/flood_fill.cpp
new file mode 100644
index 0000000000..2ea32df803
--- /dev/null
+++ b/src/backend/cpu/flood_fill.cpp
@@ -0,0 +1,42 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <flood_fill.hpp>
+
+#include <err_cpu.hpp>
+#include <kernel/flood_fill.hpp>
+
+using af::connectivity;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup) {
+    auto out = createValueArray(image.dims(), T(0));
+    getQueue().enqueue(kernel::floodFill<T>, out, image, seedsX, seedsY,
+                       newValue, lowValue, highValue, nlookup);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> floodFill(const Array<T>&, const Array<uint>&,           \
+                                const Array<uint>&, const T, const T, const T, \
+                                const af::connectivity);
+
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(ushort)
+INSTANTIATE(uchar)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/flood_fill.hpp b/src/backend/cpu/flood_fill.hpp
new file mode 100644
index 0000000000..8ac52fbec1
--- /dev/null
+++ b/src/backend/cpu/flood_fill.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup = AF_CONNECTIVITY_8);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/gradient.cpp b/src/backend/cpu/gradient.cpp
index 8ab2fe46fc..d328e9f7e4 100644
--- a/src/backend/cpu/gradient.cpp
+++ b/src/backend/cpu/gradient.cpp
@@ -8,87 +8,30 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <err_cpu.hpp>
 #include <gradient.hpp>
+#include <kernel/gradient.hpp>
 #include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 #include <stdexcept>
-#include <err_cpu.hpp>
-
-namespace cpu
-{
-    template<typename T>
-    void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in)
-    {
-        const af::dim4 dims = in.dims();
-
-        T *d_grad0    = grad0.get();
-        T *d_grad1    = grad1.get();
-        const T *d_in = in.get();
 
-        const af::dim4 inst = in.strides();
-        const af::dim4 g0st = grad0.strides();
-        const af::dim4 g1st = grad1.strides();
+namespace arrayfire {
+namespace cpu {
 
-        T v5 = scalar<T>(0.5);
-        T v1 = scalar<T>(1.0);
-
-        for(dim_t idw = 0; idw < dims[3]; idw++) {
-            const dim_t inW = idw * inst[3];
-            const dim_t g0W = idw * g0st[3];
-            const dim_t g1W = idw * g1st[3];
-            for(dim_t idz = 0; idz < dims[2]; idz++) {
-                const dim_t inZW = inW + idz * inst[2];
-                const dim_t g0ZW = g0W + idz * g0st[2];
-                const dim_t g1ZW = g1W + idz * g1st[2];
-                dim_t xl, xr, yl,yr;
-                T f0, f1;
-                for(dim_t idy = 0; idy < dims[1]; idy++) {
-                    const dim_t inYZW = inZW + idy * inst[1];
-                    const dim_t g0YZW = g0ZW + idy * g0st[1];
-                    const dim_t g1YZW = g1ZW + idy * g1st[1];
-                    if(idy == 0) {
-                        yl = inYZW + inst[1];
-                        yr = inYZW;
-                        f1 = v1;
-                    } else if(idy == dims[1] - 1) {
-                        yl = inYZW;
-                        yr = inYZW - inst[1];
-                        f1 = v1;
-                    } else {
-                        yl = inYZW + inst[1];
-                        yr = inYZW - inst[1];
-                        f1 = v5;
-                    }
-                    for(dim_t idx = 0; idx < dims[0]; idx++) {
-                        const dim_t inMem = inYZW + idx;
-                        const dim_t g0Mem = g0YZW + idx;
-                        const dim_t g1Mem = g1YZW + idx;
-                        if(idx == 0) {
-                            xl = inMem + 1;
-                            xr = inMem;
-                            f0 = v1;
-                        } else if(idx == dims[0] - 1) {
-                            xl = inMem;
-                            xr = inMem - 1;
-                            f0 = v1;
-                        } else {
-                            xl = inMem + 1;
-                            xr = inMem - 1;
-                            f0 = v5;
-                        }
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in) {
+    getQueue().enqueue(kernel::gradient<T>, grad0, grad1, in);
+}
 
-                        d_grad0[g0Mem] = f0 * (d_in[xl] - d_in[xr]);
-                        d_grad1[g1Mem] = f1 * (d_in[yl + idx] - d_in[yr + idx]);
-                    }
-                }
-            }
-        }
-    }
+#define INSTANTIATE(T)                                            \
+    template void gradient<T>(Array<T> & grad0, Array<T> & grad1, \
+                              const Array<T> &in);
 
-#define INSTANTIATE(T)                                                                  \
-    template void gradient<T>(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);    \
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/gradient.hpp b/src/backend/cpu/gradient.hpp
index 0fc8690634..d73ecafccf 100644
--- a/src/backend/cpu/gradient.hpp
+++ b/src/backend/cpu/gradient.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/hamming.cpp b/src/backend/cpu/hamming.cpp
deleted file mode 100644
index 544883bc21..0000000000
--- a/src/backend/cpu/hamming.cpp
+++ /dev/null
@@ -1,103 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cpu.hpp>
-#include <handle.hpp>
-
-using af::dim4;
-
-namespace cpu
-{
-
-#if defined(_WIN32) || defined(_MSC_VER)
-
-#include <intrin.h>
-#define __builtin_popcount __popcnt
-
-#endif
-
-template<typename T>
-inline uint hamming_distance(T v1, T v2)
-{
-    return __builtin_popcount(v1 ^ v2);
-}
-
-template<typename T>
-void hamming_matcher(Array<uint>& idx, Array<uint>& dist,
-                     const Array<T>& query, const Array<T>& train,
-                     const uint dist_dim, const uint n_dist)
-{
-    uint sample_dim = (dist_dim == 0) ? 1 : 0;
-    const dim4 qDims = query.dims();
-    const dim4 tDims = train.dims();
-
-    if (n_dist > 1) {
-        CPU_NOT_SUPPORTED();
-    }
-
-    const unsigned distLength = qDims[dist_dim];
-    const unsigned nQuery = qDims[sample_dim];
-    const unsigned nTrain = tDims[sample_dim];
-
-    const dim4 outDims(n_dist, nQuery);
-
-    idx  = createEmptyArray<uint>(outDims);
-    dist = createEmptyArray<uint>(outDims);
-
-    const T* qPtr = query.get();
-    const T* tPtr = train.get();
-    uint* iPtr = idx.get();
-    uint* dPtr = dist.get();
-
-    for (unsigned i = 0; i < nQuery; i++) {
-        unsigned best_dist = limit_max<unsigned>();
-        unsigned best_idx  = 0;
-
-        for (unsigned j = 0; j < nTrain; j++) {
-            unsigned local_dist = 0;
-            for (unsigned k = 0; k < distLength; k++) {
-                size_t qIdx, tIdx;
-                if (sample_dim == 0) {
-                    qIdx = k * qDims[0] + i;
-                    tIdx = k * tDims[0] + j;
-                }
-                else {
-                    qIdx = i * qDims[0] + k;
-                    tIdx = j * tDims[0] + k;
-                }
-
-                local_dist += hamming_distance(qPtr[qIdx], tPtr[tIdx]);
-            }
-
-            if (local_dist < best_dist) {
-                best_dist = local_dist;
-                best_idx  = j;
-            }
-        }
-
-        size_t oIdx;
-        oIdx = i;
-        iPtr[oIdx] = best_idx;
-        dPtr[oIdx] = best_dist;
-    }
-}
-
-#define INSTANTIATE(T)\
-    template void hamming_matcher<T>(Array<uint>& idx, Array<uint>& dist,           \
-                                     const Array<T>& query, const Array<T>& train,  \
-                                     const uint dist_dim, const uint n_dist);
-
-INSTANTIATE(uchar)
-INSTANTIATE(uint)
-
-}
diff --git a/src/backend/cpu/hamming.hpp b/src/backend/cpu/hamming.hpp
deleted file mode 100644
index 3787a68283..0000000000
--- a/src/backend/cpu/hamming.hpp
+++ /dev/null
@@ -1,20 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-
-namespace cpu
-{
-
-template<typename T>
-void hamming_matcher(Array<uint>& idx, Array<uint>& dist,
-                     const Array<T>& query, const Array<T>& train,
-                     const uint dist_dim, const uint n_dist);
-
-}
diff --git a/src/backend/cpu/harris.cpp b/src/backend/cpu/harris.cpp
new file mode 100644
index 0000000000..cf7f41ecbf
--- /dev/null
+++ b/src/backend/cpu/harris.cpp
@@ -0,0 +1,152 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <convolve.hpp>
+#include <gradient.hpp>
+#include <harris.hpp>
+#include <kernel/harris.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <sort_index.hpp>
+#include <af/dim4.hpp>
+#include <cstring>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &resp_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr) {
+    dim4 idims = in.dims();
+
+    // Window filter
+    auto h_filter = memAlloc<convAccT>(filter_len);
+    // Decide between rectangular or circular filter
+    if (sigma < 0.5f) {
+        for (unsigned i = 0; i < filter_len; i++) {
+            h_filter[i] = static_cast<T>(1) / (filter_len);
+        }
+    } else {
+        gaussian1D<convAccT>(h_filter.get(), static_cast<int>(filter_len),
+                             sigma);
+    }
+    Array<convAccT> filter =
+        createDeviceDataArray<convAccT>(dim4(filter_len), h_filter.release());
+    unsigned border_len = filter_len / 2 + 1;
+
+    Array<T> ix = createEmptyArray<T>(idims);
+    Array<T> iy = createEmptyArray<T>(idims);
+
+    // Compute first order derivatives
+    gradient<T>(iy, ix, in);
+
+    Array<T> ixx = createEmptyArray<T>(idims);
+    Array<T> ixy = createEmptyArray<T>(idims);
+    Array<T> iyy = createEmptyArray<T>(idims);
+
+    // Compute second-order derivatives
+    getQueue().enqueue(kernel::second_order_deriv<T>, ixx, ixy, iyy,
+                       in.elements(), ix, iy);
+
+    // Convolve second-order derivatives with proper window filter
+    ixx = convolve2<T, convAccT>(ixx, filter, filter, false);
+    ixy = convolve2<T, convAccT>(ixy, filter, filter, false);
+    iyy = convolve2<T, convAccT>(iyy, filter, filter, false);
+
+    const unsigned corner_lim = in.elements() * 0.2f;
+
+    Array<T> responses = createEmptyArray<T>(dim4(in.elements()));
+
+    getQueue().enqueue(kernel::harris_responses<T>, responses, idims[0],
+                       idims[1], ixx, ixy, iyy, k_thr, border_len);
+
+    Array<float> xCorners    = createEmptyArray<float>(dim4(corner_lim));
+    Array<float> yCorners    = createEmptyArray<float>(dim4(corner_lim));
+    Array<float> respCorners = createEmptyArray<float>(dim4(corner_lim));
+
+    const unsigned min_r =
+        (max_corners > 0) ? 0U : static_cast<unsigned>(min_response);
+
+    // Performs non-maximal suppression
+    getQueue().sync();
+    unsigned corners_found = 0;
+    kernel::non_maximal<T>(xCorners, yCorners, respCorners, &corners_found,
+                           idims[0], idims[1], responses, min_r, border_len,
+                           corner_lim);
+
+    const unsigned corners_out =
+        min(corners_found, (max_corners > 0) ? max_corners : corner_lim);
+    if (corners_out == 0) { return 0; }
+
+    if (max_corners > 0 && corners_found > corners_out) {
+        respCorners.resetDims(dim4(corners_found));
+        Array<float> harris_sorted =
+            createEmptyArray<float>(dim4(corners_found));
+        Array<unsigned> harris_idx =
+            createEmptyArray<unsigned>(dim4(corners_found));
+
+        // Sort Harris responses
+        sort_index<float>(harris_sorted, harris_idx, respCorners, 0, false);
+
+        x_out    = createEmptyArray<float>(dim4(corners_out));
+        y_out    = createEmptyArray<float>(dim4(corners_out));
+        resp_out = createEmptyArray<float>(dim4(corners_out));
+
+        // Keep only the corners with higher Harris responses
+        getQueue().enqueue(kernel::keep_corners, x_out, y_out, resp_out,
+                           xCorners, yCorners, harris_sorted, harris_idx,
+                           corners_out);
+    } else if (max_corners == 0 && corners_found < corner_lim) {
+        x_out    = createEmptyArray<float>(dim4(corners_out));
+        y_out    = createEmptyArray<float>(dim4(corners_out));
+        resp_out = createEmptyArray<float>(dim4(corners_out));
+
+        auto copyFunc =
+            [=](Param<float> x_out, Param<float> y_out,
+                Param<float> outResponses, const CParam<float> &x_crnrs,
+                const CParam<float> &y_crnrs, const CParam<float> &inResponses,
+                const unsigned corners_out) {
+                memcpy(x_out.get(), x_crnrs.get(), corners_out * sizeof(float));
+                memcpy(y_out.get(), y_crnrs.get(), corners_out * sizeof(float));
+                memcpy(outResponses.get(), inResponses.get(),
+                       corners_out * sizeof(float));
+            };
+        getQueue().enqueue(copyFunc, x_out, y_out, resp_out, xCorners, yCorners,
+                           respCorners, corners_out);
+    } else {
+        x_out    = xCorners;
+        y_out    = yCorners;
+        resp_out = respCorners;
+        x_out.resetDims(dim4(corners_out));
+        y_out.resetDims(dim4(corners_out));
+        resp_out.resetDims(dim4(corners_out));
+    }
+
+    return corners_out;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned harris<T, convAccT>(                                    \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned max_corners,                       \
+        const float min_response, const float sigma,                          \
+        const unsigned block_size, const float k_thr);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/harris.hpp b/src/backend/cpu/harris.hpp
new file mode 100644
index 0000000000..b42f8cd4f8
--- /dev/null
+++ b/src/backend/cpu/harris.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &resp_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/hist_graphics.cpp b/src/backend/cpu/hist_graphics.cpp
index 4c940fb523..a77e9fe77e 100644
--- a/src/backend/cpu/hist_graphics.cpp
+++ b/src/backend/cpu/hist_graphics.cpp
@@ -7,34 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
-#include <hist_graphics.hpp>
 #include <err_cpu.hpp>
+#include <hist_graphics.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-void copy_histogram(const Array<T> &data, const fg::Histogram* hist)
-{
+void copy_histogram(const Array<T> &data, fg_histogram hist) {
+    ForgeModule &_ = forgePlugin();
+    data.eval();
+    getQueue().sync();
+
     CheckGL("Begin copy_histogram");
+    unsigned bytes = 0, buffer = 0;
+    FG_CHECK(_.fg_get_histogram_vertex_buffer(&buffer, hist));
+    FG_CHECK(_.fg_get_histogram_vertex_buffer_size(&bytes, hist));
 
-    glBindBuffer(GL_ARRAY_BUFFER, hist->vbo());
-    glBufferSubData(GL_ARRAY_BUFFER, 0, hist->size(), data.get());
+    glBindBuffer(GL_ARRAY_BUFFER, buffer);
+    glBufferSubData(GL_ARRAY_BUFFER, 0, bytes, data.get());
     glBindBuffer(GL_ARRAY_BUFFER, 0);
 
     CheckGL("End copy_histogram");
 }
 
-#define INSTANTIATE(T)  \
-    template void copy_histogram<T>(const Array<T> &data, const fg::Histogram* hist);
+#define INSTANTIATE(T) \
+    template void copy_histogram<T>(const Array<T> &, fg_histogram);
 
 INSTANTIATE(float)
 INSTANTIATE(int)
 INSTANTIATE(uint)
+INSTANTIATE(schar)
 INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
-
-#endif  // WITH_GRAPHICS
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/hist_graphics.hpp b/src/backend/cpu/hist_graphics.hpp
index 909bd76099..8971645496 100644
--- a/src/backend/cpu/hist_graphics.hpp
+++ b/src/backend/cpu/hist_graphics.hpp
@@ -9,18 +9,14 @@
 
 #pragma once
 
-#if defined (WITH_GRAPHICS)
-
-#include <graphics_common.hpp>
 #include <Array.hpp>
+#include <common/graphics_common.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-void copy_histogram(const Array<T> &data, const fg::Histogram* hist);
-
-}
-
-#endif
+void copy_histogram(const Array<T> &data, fg_histogram hist);
 
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/histogram.cpp b/src/backend/cpu/histogram.cpp
index de38f37b03..9d9c6ba8fa 100644
--- a/src/backend/cpu/histogram.cpp
+++ b/src/backend/cpu/histogram.cpp
@@ -7,56 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <common/half.hpp>
 #include <histogram.hpp>
+#include <kernel/histogram.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
-
-namespace cpu
-{
-
-template<typename inType, typename outType>
-Array<outType> histogram(const Array<inType> &in, const unsigned &nbins, const double &minval, const double &maxval)
-{
-    float step = (maxval - minval)/(float)nbins;
-
-    const dim4 inDims  = in.dims();
-    dim4 iStrides      = in.strides();
-    dim4 outDims       = dim4(nbins,1,inDims[2],inDims[3]);
-    Array<outType> out = createValueArray<outType>(outDims, outType(0));
-    dim4 oStrides      = out.strides();
-    dim_t nElems    = inDims[0]*inDims[1];
-
-    outType *outData    = out.get();
-    const inType* inData= in.get();
-
-    for(dim_t b3 = 0; b3 < outDims[3]; b3++) {
-        for(dim_t b2 = 0; b2 < outDims[2]; b2++) {
-            for(dim_t i=0; i<nElems; i++) {
-                int bin = (int)((inData[i] - minval) / step);
-                bin = std::max(bin, 0);
-                bin = std::min(bin, (int)(nbins - 1));
-                outData[bin]++;
-            }
-            inData  += iStrides[2];
-            outData += oStrides[2];
-        }
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear) {
+    const dim4 &inDims = in.dims();
+    dim4 outDims       = dim4(nbins, 1, inDims[2], inDims[3]);
+    Array<uint> out    = createValueArray<uint>(outDims, uint(0));
+    if (isLinear) {
+        getQueue().enqueue(kernel::histogram<T, true>, out, in, nbins, minval,
+                           maxval);
+    } else {
+        getQueue().enqueue(kernel::histogram<T, false>, out, in, nbins, minval,
+                           maxval);
     }
-
     return out;
 }
 
-#define INSTANTIATE(in_t,out_t)\
-template Array<out_t> histogram(const Array<in_t> &in, const unsigned &nbins, const double &minval, const double &maxval);
-
-INSTANTIATE(float , uint)
-INSTANTIATE(double, uint)
-INSTANTIATE(char  , uint)
-INSTANTIATE(int   , uint)
-INSTANTIATE(uint  , uint)
-INSTANTIATE(uchar , uint)
-
-}
+#define INSTANTIATE(T)                                                    \
+    template Array<uint> histogram<T>(const Array<T> &, const unsigned &, \
+                                      const double &, const double &,     \
+                                      const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/histogram.hpp b/src/backend/cpu/histogram.hpp
index 458438fc48..086baf50f0 100644
--- a/src/backend/cpu/histogram.hpp
+++ b/src/backend/cpu/histogram.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
-
-template<typename inType, typename outType>
-Array<outType> histogram(const Array<inType> &in, const unsigned &nbins, const double &minval, const double &maxval);
-
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/homography.cpp b/src/backend/cpu/homography.cpp
new file mode 100644
index 0000000000..9be88a2e02
--- /dev/null
+++ b/src/backend/cpu/homography.cpp
@@ -0,0 +1,424 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <err_cpu.hpp>
+#include <homography.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
+
+#include <array>
+#include <cmath>
+#include <cstring>
+#include <limits>
+#include <vector>
+
+using af::dim4;
+using std::abs;
+using std::array;
+using std::log;
+using std::max;
+using std::min;
+using std::numeric_limits;
+using std::pow;
+using std::round;
+using std::sqrt;
+using std::vector;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+T sq(T a) {
+    return a * a;
+}
+
+#define APTR(Y, X) (A_ptr[(Y)*Adims[0] + (X)])
+
+static const float RANSACConfidence  = 0.99f;
+static const float LMEDSConfidence   = 0.99f;
+static const float LMEDSOutlierRatio = 0.4f;
+
+template<typename T>
+struct EPS {
+    T eps() { return numeric_limits<float>::epsilon(); }
+};
+
+template<>
+struct EPS<float> {
+    static float eps() { return numeric_limits<float>::epsilon(); }
+};
+
+template<>
+struct EPS<double> {
+    static double eps() { return numeric_limits<double>::epsilon(); }
+};
+
+template<typename T, int M, int N>
+void JacobiSVD(T* S, T* V) {
+    const int iterations = 30;
+    array<T, N> d{};
+
+    for (int i = 0; i < N; i++) {
+        T sd = 0;
+        for (int j = 0; j < M; j++) {
+            T t = S[i * M + j];
+            sd += t * t;
+        }
+        d[i] = sd;
+
+        V[i * N + i] = 1;
+    }
+
+    for (int it = 0; it < iterations; it++) {
+        bool converged = false;
+
+        for (int i = 0; i < N - 1; i++) {
+            for (int j = i + 1; j < N; j++) {
+                T* Si = S + i * M;
+                T* Sj = S + j * M;
+                T* Vi = V + i * N;
+                T* Vj = V + j * N;
+
+                T p = static_cast<T>(0);
+                for (int k = 0; k < M; k++) { p += Si[k] * Sj[k]; }
+
+                if (abs(p) <= M * EPS<T>::eps() * sqrt(d[i] * d[j])) {
+                    continue;
+                }
+
+                T y  = d[i] - d[j];
+                T r  = hypot(p * 2, y);
+                T r2 = r * 2;
+                T c, s;
+                if (y >= 0) {
+                    c = sqrt((r + y) / r2);
+                    s = p / (r2 * c);
+                } else {
+                    s = sqrt((r - y) / r2);
+                    c = p / (r2 * s);
+                }
+
+                T a = 0, b = 0;
+                for (int k = 0; k < M; k++) {
+                    T t0  = c * Si[k] + s * Sj[k];
+                    T t1  = c * Sj[k] - s * Si[k];
+                    Si[k] = t0;
+                    Sj[k] = t1;
+
+                    a += t0 * t0;
+                    b += t1 * t1;
+                }
+                d[i] = a;
+                d[j] = b;
+
+                for (int l = 0; l < N; l++) {
+                    T t0 = Vi[l] * c + Vj[l] * s;
+                    T t1 = Vj[l] * c - Vi[l] * s;
+
+                    Vi[l] = t0;
+                    Vj[l] = t1;
+                }
+
+                converged = true;
+            }
+            if (!converged) { break; }
+        }
+    }
+}
+
+unsigned updateIterations(float inlier_ratio, unsigned iter) {
+    float w  = min(max(inlier_ratio, 0.0f), 1.0f);
+    float wn = pow(1 - w, 4.f);
+
+    float d = 1.f - wn;
+    if (d < numeric_limits<float>::min()) { return 0; }
+
+    d = log(d);
+
+    float p = min(max(RANSACConfidence, 0.0f), 1.0f);
+    float n = log(1.f - p);
+
+    return n <= d * static_cast<float>(iter)
+               ? iter
+               : static_cast<unsigned>(round(n / d));
+}
+
+template<typename T>
+int computeHomography(T* H_ptr, const float* rnd_ptr, const float* x_src_ptr,
+                      const float* y_src_ptr, const float* x_dst_ptr,
+                      const float* y_dst_ptr) {
+    if (static_cast<unsigned>(rnd_ptr[0]) ==
+            static_cast<unsigned>(rnd_ptr[1]) ||
+        static_cast<unsigned>(rnd_ptr[0]) ==
+            static_cast<unsigned>(rnd_ptr[2]) ||
+        static_cast<unsigned>(rnd_ptr[0]) ==
+            static_cast<unsigned>(rnd_ptr[3]) ||
+        static_cast<unsigned>(rnd_ptr[1]) ==
+            static_cast<unsigned>(rnd_ptr[2]) ||
+        static_cast<unsigned>(rnd_ptr[1]) ==
+            static_cast<unsigned>(rnd_ptr[3]) ||
+        static_cast<unsigned>(rnd_ptr[2]) ==
+            static_cast<unsigned>(rnd_ptr[3])) {
+        return 1;
+    }
+
+    float src_pt_x[4], src_pt_y[4], dst_pt_x[4], dst_pt_y[4];
+    for (unsigned j = 0; j < 4; j++) {
+        src_pt_x[j] = x_src_ptr[static_cast<unsigned>(rnd_ptr[j])];
+        src_pt_y[j] = y_src_ptr[static_cast<unsigned>(rnd_ptr[j])];
+        dst_pt_x[j] = x_dst_ptr[static_cast<unsigned>(rnd_ptr[j])];
+        dst_pt_y[j] = y_dst_ptr[static_cast<unsigned>(rnd_ptr[j])];
+    }
+
+    float x_src_mean =
+        (src_pt_x[0] + src_pt_x[1] + src_pt_x[2] + src_pt_x[3]) / 4.f;
+    float y_src_mean =
+        (src_pt_y[0] + src_pt_y[1] + src_pt_y[2] + src_pt_y[3]) / 4.f;
+    float x_dst_mean =
+        (dst_pt_x[0] + dst_pt_x[1] + dst_pt_x[2] + dst_pt_x[3]) / 4.f;
+    float y_dst_mean =
+        (dst_pt_y[0] + dst_pt_y[1] + dst_pt_y[2] + dst_pt_y[3]) / 4.f;
+
+    float src_var = 0.0f, dst_var = 0.0f;
+    for (unsigned j = 0; j < 4; j++) {
+        src_var += sq(src_pt_x[j] - x_src_mean) + sq(src_pt_y[j] - y_src_mean);
+        dst_var += sq(dst_pt_x[j] - x_dst_mean) + sq(dst_pt_y[j] - y_dst_mean);
+    }
+
+    src_var /= 4.f;
+    dst_var /= 4.f;
+
+    float src_scale = sqrt(2.0f) / sqrt(src_var);
+    float dst_scale = sqrt(2.0f) / sqrt(dst_var);
+
+    Array<T> A     = createValueArray<T>(af::dim4(9, 9), static_cast<T>(0));
+    af::dim4 Adims = A.dims();
+    T* A_ptr       = A.get();
+    getQueue().sync();
+
+    for (unsigned j = 0; j < 4; j++) {
+        float srcx = (src_pt_x[j] - x_src_mean) * src_scale;
+        float srcy = (src_pt_y[j] - y_src_mean) * src_scale;
+        float dstx = (dst_pt_x[j] - x_dst_mean) * dst_scale;
+        float dsty = (dst_pt_y[j] - y_dst_mean) * dst_scale;
+
+        APTR(3, j * 2) = -srcx;
+        APTR(4, j * 2) = -srcy;
+        APTR(5, j * 2) = -1.0f;
+        APTR(6, j * 2) = dsty * srcx;
+        APTR(7, j * 2) = dsty * srcy;
+        APTR(8, j * 2) = dsty;
+
+        APTR(0, j * 2 + 1) = srcx;
+        APTR(1, j * 2 + 1) = srcy;
+        APTR(2, j * 2 + 1) = 1.0f;
+        APTR(6, j * 2 + 1) = -dstx * srcx;
+        APTR(7, j * 2 + 1) = -dstx * srcy;
+        APTR(8, j * 2 + 1) = -dstx;
+    }
+
+    Array<T> V =
+        createValueArray<T>(af::dim4(Adims[1], Adims[1]), static_cast<T>(0));
+    V.eval();
+    getQueue().sync();
+    JacobiSVD<T, 9, 9>(A.get(), V.get());
+
+    dim4 Vdims = V.dims();
+    T* V_ptr   = V.get();
+
+    array<T, 9> vH{};
+    for (unsigned j = 0; j < 9; j++) { vH[j] = V_ptr[8 * Vdims[0] + j]; }
+
+    H_ptr[0] = src_scale * x_dst_mean * vH[6] + src_scale * vH[0] / dst_scale;
+    H_ptr[1] = src_scale * x_dst_mean * vH[7] + src_scale * vH[1] / dst_scale;
+    H_ptr[2] = x_dst_mean * (vH[8] - src_scale * y_src_mean * vH[7] -
+                             src_scale * x_src_mean * vH[6]) +
+               (vH[2] - src_scale * y_src_mean * vH[1] -
+                src_scale * x_src_mean * vH[0]) /
+                   dst_scale;
+
+    H_ptr[3] = src_scale * y_dst_mean * vH[6] + src_scale * vH[3] / dst_scale;
+    H_ptr[4] = src_scale * y_dst_mean * vH[7] + src_scale * vH[4] / dst_scale;
+    H_ptr[5] = y_dst_mean * (vH[8] - src_scale * y_src_mean * vH[7] -
+                             src_scale * x_src_mean * vH[6]) +
+               (vH[5] - src_scale * y_src_mean * vH[4] -
+                src_scale * x_src_mean * vH[3]) /
+                   dst_scale;
+
+    H_ptr[6] = src_scale * vH[6];
+    H_ptr[7] = src_scale * vH[7];
+    H_ptr[8] =
+        vH[8] - src_scale * y_src_mean * vH[7] - src_scale * x_src_mean * vH[6];
+
+    return 0;
+}
+
+// LMedS:
+// http://research.microsoft.com/en-us/um/people/zhang/INRIA/Publis/Tutorial-Estim/node25.html
+template<typename T>
+int findBestHomography(Array<T>& bestH, const Array<float>& x_src,
+                       const Array<float>& y_src, const Array<float>& x_dst,
+                       const Array<float>& y_dst, const Array<float>& rnd,
+                       const unsigned iterations, const unsigned nsamples,
+                       const float inlier_thr, const af_homography_type htype) {
+    const float* x_src_ptr = x_src.get();
+    const float* y_src_ptr = y_src.get();
+    const float* x_dst_ptr = x_dst.get();
+    const float* y_dst_ptr = y_dst.get();
+
+    Array<T> H =
+        createValueArray<T>(af::dim4(9, iterations), static_cast<T>(0));
+    H.eval();
+    getQueue().sync();
+
+    const af::dim4& rdims = rnd.dims();
+    const af::dim4& Hdims = H.dims();
+
+    unsigned iter    = iterations;
+    unsigned bestIdx = 0;
+    int bestInliers  = 0;
+    float minMedian  = numeric_limits<float>::max();
+
+    for (unsigned i = 0; i < iter; i++) {
+        const unsigned Hidx = Hdims[0] * i;
+        T* H_ptr            = H.get() + Hidx;
+
+        const unsigned ridx  = rdims[0] * i;
+        const float* rnd_ptr = rnd.get() + ridx;
+
+        if (computeHomography<T>(H_ptr, rnd_ptr, x_src_ptr, y_src_ptr,
+                                 x_dst_ptr, y_dst_ptr)) {
+            continue;
+        }
+
+        if (htype == AF_HOMOGRAPHY_RANSAC) {
+            int inliers_count = 0;
+            for (unsigned j = 0; j < nsamples; j++) {
+                float z = H_ptr[6] * x_src_ptr[j] + H_ptr[7] * y_src_ptr[j] +
+                          H_ptr[8];
+                float x = (H_ptr[0] * x_src_ptr[j] + H_ptr[1] * y_src_ptr[j] +
+                           H_ptr[2]) /
+                          z;
+                float y = (H_ptr[3] * x_src_ptr[j] + H_ptr[4] * y_src_ptr[j] +
+                           H_ptr[5]) /
+                          z;
+
+                float dist = sq(x_dst_ptr[j] - x) + sq(y_dst_ptr[j] - y);
+                if (dist < (inlier_thr * inlier_thr)) { inliers_count++; }
+            }
+            iter =
+                updateIterations(static_cast<float>(nsamples - inliers_count) /
+                                     static_cast<float>(nsamples),
+                                 iter);
+            if (inliers_count > bestInliers) {
+                bestIdx     = i;
+                bestInliers = inliers_count;
+            }
+        } else if (htype == AF_HOMOGRAPHY_LMEDS) {
+            vector<float> err(nsamples);
+            for (unsigned j = 0; j < nsamples; j++) {
+                float z = H_ptr[6] * x_src_ptr[j] + H_ptr[7] * y_src_ptr[j] +
+                          H_ptr[8];
+                float x = (H_ptr[0] * x_src_ptr[j] + H_ptr[1] * y_src_ptr[j] +
+                           H_ptr[2]) /
+                          z;
+                float y = (H_ptr[3] * x_src_ptr[j] + H_ptr[4] * y_src_ptr[j] +
+                           H_ptr[5]) /
+                          z;
+
+                float dist = sq(x_dst_ptr[j] - x) + sq(y_dst_ptr[j] - y);
+                err[j]     = sqrt(dist);
+            }
+
+            stable_sort(err.begin(), err.end());
+
+            float median = err[nsamples / 2];
+            if (nsamples % 2 == 0) {
+                median = (median + err[nsamples / 2 - 1]) * 0.5f;
+            }
+
+            if (median < minMedian &&
+                median > numeric_limits<float>::epsilon()) {
+                minMedian = median;
+                bestIdx   = i;
+            }
+        }
+    }
+
+    memcpy(bestH.get(), H.get() + bestIdx * 9, 9 * sizeof(T));
+
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        float sigma =
+            max(1.4826f * (1.f + 5.f / (static_cast<float>(nsamples) - 4.f)) *
+                    static_cast<float>(sqrt(minMedian)),
+                1e-6f);
+        float dist_thr = sq(2.5f * sigma);
+        T* bestH_ptr   = bestH.get();
+
+        for (unsigned j = 0; j < nsamples; j++) {
+            float z = bestH_ptr[6] * x_src_ptr[j] +
+                      bestH_ptr[7] * y_src_ptr[j] + bestH_ptr[8];
+            float x = (bestH_ptr[0] * x_src_ptr[j] +
+                       bestH_ptr[1] * y_src_ptr[j] + bestH_ptr[2]) /
+                      z;
+            float y = (bestH_ptr[3] * x_src_ptr[j] +
+                       bestH_ptr[4] * y_src_ptr[j] + bestH_ptr[5]) /
+                      z;
+
+            float dist = sq(x_dst_ptr[j] - x) + sq(y_dst_ptr[j] - y);
+            if (dist <= dist_thr) { bestInliers++; }
+        }
+    }
+
+    return bestInliers;
+}
+
+template<typename T>
+int homography(Array<T>& bestH, const Array<float>& x_src,
+               const Array<float>& y_src, const Array<float>& x_dst,
+               const Array<float>& y_dst, const Array<float>& initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations) {
+    const dim4& idims       = x_src.dims();
+    const unsigned nsamples = idims[0];
+
+    unsigned iter = iterations;
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        iter = min(iter, static_cast<unsigned>(
+                             log(1.f - LMEDSConfidence) /
+                             log(1.f - pow(1.f - LMEDSOutlierRatio, 4.f))));
+    }
+
+    af::dim4 rdims(4, iter);
+    Array<float> fctr =
+        createValueArray<float>(rdims, static_cast<float>(nsamples));
+    Array<float> rnd = arithOp<float, af_mul_t>(initial, fctr, rdims);
+    rnd.eval();
+    getQueue().sync();
+
+    return findBestHomography<T>(bestH, x_src, y_src, x_dst, y_dst, rnd, iter,
+                                 nsamples, inlier_thr, htype);
+}
+
+#define INSTANTIATE(T)                                          \
+    template int homography<T>(                                 \
+        Array<T> & bestH, const Array<float>& x_src,            \
+        const Array<float>& y_src, const Array<float>& x_dst,   \
+        const Array<float>& y_dst, const Array<float>& initial, \
+        const af_homography_type htype, const float inlier_thr, \
+        const unsigned iterations);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/homography.hpp b/src/backend/cpu/homography.hpp
new file mode 100644
index 0000000000..76ac8bbf86
--- /dev/null
+++ b/src/backend/cpu/homography.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+int homography(Array<T> &H, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/hsv_rgb.cpp b/src/backend/cpu/hsv_rgb.cpp
index 82f404fa95..cf278862d0 100644
--- a/src/backend/cpu/hsv_rgb.cpp
+++ b/src/backend/cpu/hsv_rgb.cpp
@@ -1,138 +1,46 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <ArrayInfo.hpp>
 #include <hsv_rgb.hpp>
-#include <err_cpu.hpp>
-#include <cmath>
-
-using af::dim4;
+#include <kernel/hsv_rgb.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> hsv2rgb(const Array<T>& in)
-{
-    const dim4 dims    = in.dims();
-    const dim4 strides = in.strides();
-    Array<T> out       = createEmptyArray<T>(dims);
-    dim_t obStride  = out.strides()[3];
-    dim_t coff      = strides[2];
-    dim_t bCount    = dims[3];
-
-    for(dim_t b=0; b<bCount; ++b) {
-        const T* src = in.get() + b * strides[3];
-        T* dst       = out.get() + b * obStride;
-
-        for(dim_t j=0; j<dims[1]; ++j) {
-            dim_t jOff = j*strides[1];
-            // j steps along 2nd dimension
-            for(dim_t i=0; i<dims[0]; ++i) {
-                // i steps along 1st dimension
-                dim_t hIdx = i*strides[0] + jOff;
-                dim_t sIdx = hIdx + coff;
-                dim_t vIdx = sIdx + coff;
-
-                T H = src[hIdx];
-                T S = src[sIdx];
-                T V = src[vIdx];
-
-                T R, G, B;
-                R = G = B = 0;
-
-                int   m = (int)(H * 6);
-                T f = H * 6 - m;
-                T p = V * (1 - S);
-                T q = V * (1 - f * S);
-                T t = V * (1 - (1 - f) * S);
-
-                switch (m % 6) {
-                    case 0: R = V, G = t, B = p; break;
-                    case 1: R = q, G = V, B = p; break;
-                    case 2: R = p, G = V, B = t; break;
-                    case 3: R = p, G = q, B = V; break;
-                    case 4: R = t, G = p, B = V; break;
-                    case 5: R = V, G = p, B = q; break;
-                }
+Array<T> hsv2rgb(const Array<T>& in) {
+    Array<T> out = createEmptyArray<T>(in.dims());
 
-                dst[hIdx] = R;
-                dst[sIdx] = G;
-                dst[vIdx] = B;
-            }
-        }
-    }
+    getQueue().enqueue(kernel::hsv2rgb<T>, out, in);
 
     return out;
 }
 
 template<typename T>
-Array<T> rgb2hsv(const Array<T>& in)
-{
-    const dim4 dims    = in.dims();
-    const dim4 strides = in.strides();
-    Array<T> out       = createEmptyArray<T>(dims);
-    dim4 oStrides      = out.strides();
-    dim_t bCount    = dims[3];
+Array<T> rgb2hsv(const Array<T>& in) {
+    Array<T> out = createEmptyArray<T>(in.dims());
 
-    for(dim_t b=0; b<bCount; ++b) {
-        const T* src = in.get() + b * strides[3];
-        T* dst       = out.get() + b * oStrides[3];
-
-        for(dim_t j=0; j<dims[1]; ++j) {
-            // j steps along 2nd dimension
-            dim_t oj = j * oStrides[1];
-            dim_t ij = j * strides[1];
-
-            for(dim_t i=0; i<dims[0]; ++i) {
-                // i steps along 1st dimension
-                dim_t oIdx0 = i * oStrides[0] + oj;
-                dim_t oIdx1 = oIdx0 + oStrides[2];
-                dim_t oIdx2 = oIdx1 + oStrides[2];
-
-                dim_t iIdx0 = i * strides[0]  + ij;
-                dim_t iIdx1 = iIdx0 + strides[2];
-                dim_t iIdx2 = iIdx1 + strides[2];
-
-                T R = src[iIdx0];
-                T G = src[iIdx1];
-                T B = src[iIdx2];
-                T Cmax = std::max(std::max(R, G), B);
-                T Cmin = std::min(std::min(R, G), B);
-                T delta= Cmax-Cmin;
-
-                T H = 0;
-
-                if (Cmax!=Cmin) {
-                    if (Cmax==R) H = (G-B)/delta + (G<B ? 6 : 0);
-                    if (Cmax==G) H = (B-R)/delta + 2;
-                    if (Cmax==B) H = (R-G)/delta + 4;
-                    H = H / 6.0f;
-                }
-
-                dst[oIdx0] = H;
-                dst[oIdx1] = (Cmax==0.0f ? 0 : delta/Cmax);
-                dst[oIdx2] = Cmax;
-            }
-        }
-    }
+    getQueue().enqueue(kernel::rgb2hsv<T>, out, in);
 
     return out;
 }
 
-#define INSTANTIATE(T)  \
+#define INSTANTIATE(T)                                \
     template Array<T> hsv2rgb<T>(const Array<T>& in); \
-    template Array<T> rgb2hsv<T>(const Array<T>& in); \
+    template Array<T> rgb2hsv<T>(const Array<T>& in);
 
 INSTANTIATE(double)
-INSTANTIATE(float )
+INSTANTIATE(float)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/hsv_rgb.hpp b/src/backend/cpu/hsv_rgb.hpp
index 5c870a7a39..3d0929c22b 100644
--- a/src/backend/cpu/hsv_rgb.hpp
+++ b/src/backend/cpu/hsv_rgb.hpp
@@ -1,16 +1,16 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
 #include <Array.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
 Array<T> hsv2rgb(const Array<T>& in);
@@ -18,4 +18,5 @@ Array<T> hsv2rgb(const Array<T>& in);
 template<typename T>
 Array<T> rgb2hsv(const Array<T>& in);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/identity.cpp b/src/backend/cpu/identity.cpp
index 3112991406..ce7f35bdb0 100644
--- a/src/backend/cpu/identity.cpp
+++ b/src/backend/cpu/identity.cpp
@@ -6,46 +6,47 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <identity.hpp>
+#include <kernel/identity.hpp>
 
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
 #include <Array.hpp>
-#include <identity.hpp>
-#include <math.hpp>
-
-namespace cpu
-{
-    template<typename T>
-    Array<T> identity(const dim4& dims)
-    {
-        Array<T> out = createEmptyArray<T>(dims);
-        T *ptr = out.get();
-        const dim_t *out_dims  = out.dims().get();
-
-        for (dim_t k = 0; k < out_dims[2] * out_dims[3]; k++) {
-            for (dim_t j = 0; j < out_dims[1]; j++) {
-                for (dim_t i = 0; i < out_dims[0]; i++) {
-                    ptr[j * out_dims[0] + i]  = (i == j) ? scalar<T>(1) : scalar<T>(0);
-                }
-            }
-            ptr += out_dims[0] * out_dims[1];
-        }
-        return out;
-    }
-
-#define INSTANTIATE_IDENTITY(T)                              \
-    template Array<T>  identity<T>    (const af::dim4 &dims);
-
-    INSTANTIATE_IDENTITY(float)
-    INSTANTIATE_IDENTITY(double)
-    INSTANTIATE_IDENTITY(cfloat)
-    INSTANTIATE_IDENTITY(cdouble)
-    INSTANTIATE_IDENTITY(int)
-    INSTANTIATE_IDENTITY(intl)
-    INSTANTIATE_IDENTITY(uintl)
-    INSTANTIATE_IDENTITY(uint)
-    INSTANTIATE_IDENTITY(char)
-    INSTANTIATE_IDENTITY(uchar)
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;  // NOLINT(misc-unused-using-decls) bug in
+                                // clang-tidy
+
+namespace arrayfire {
+namespace cpu {
 
+template<typename T>
+Array<T> identity(const dim4& dims) {
+    Array<T> out = createEmptyArray<T>(dims);
+
+    getQueue().enqueue(kernel::identity<T>, out);
+
+    return out;
 }
+
+#define INSTANTIATE_IDENTITY(T) \
+    template Array<T> identity<T>(const af::dim4& dims);
+
+INSTANTIATE_IDENTITY(float)
+INSTANTIATE_IDENTITY(double)
+INSTANTIATE_IDENTITY(cfloat)
+INSTANTIATE_IDENTITY(cdouble)
+INSTANTIATE_IDENTITY(int)
+INSTANTIATE_IDENTITY(uint)
+INSTANTIATE_IDENTITY(intl)
+INSTANTIATE_IDENTITY(uintl)
+INSTANTIATE_IDENTITY(char)
+INSTANTIATE_IDENTITY(schar)
+INSTANTIATE_IDENTITY(uchar)
+INSTANTIATE_IDENTITY(short)
+INSTANTIATE_IDENTITY(ushort)
+INSTANTIATE_IDENTITY(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/identity.hpp b/src/backend/cpu/identity.hpp
index 3506dabe61..5a77fa2d9a 100644
--- a/src/backend/cpu/identity.hpp
+++ b/src/backend/cpu/identity.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> identity(const dim4& dim);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> identity(const dim4& dim);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/iir.cpp b/src/backend/cpu/iir.cpp
index 580078eb42..9d3fcfc966 100644
--- a/src/backend/cpu/iir.cpp
+++ b/src/backend/cpu/iir.cpp
@@ -7,86 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <iir.hpp>
-#include <err_cpu.hpp>
-#include <math.hpp>
-#include <arith.hpp>
 #include <convolve.hpp>
+#include <iir.hpp>
+#include <kernel/iir.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x)
-    {
-        T h_a0 = a.get()[0];
-        Array<T> a0 = createValueArray<T>(b.dims(), h_a0);
-
-        ConvolveBatchKind type = x.ndims() == 1 ? ONE2ONE : MANY2MANY;
-        if (x.ndims() != b.ndims()) {
-            type = (x.ndims() < b.ndims()) ? ONE2MANY : MANY2ONE;
-        }
-
-        // Extract the first N elements
-        Array<T> c = convolve<T, T, 1, true>(x, b, type);
-        dim4 cdims = c.dims();
-        cdims[0] = x.dims()[0];
-        c.resetDims(cdims);
-
-        int num_a = a.dims()[0];
-
-        dim4 ydims = c.dims();
-        Array<T> y = createEmptyArray<T>(ydims);
+namespace arrayfire {
+namespace cpu {
 
-        for (int l = 0; l < (int)ydims[3]; l++) {
-            dim_t yidx3 = l * y.strides()[3];
-            dim_t cidx3 = l * c.strides()[3];
-            dim_t aidx3 = l * a.strides()[3];
-
-            for (int k = 0; k < (int)ydims[2]; k++) {
-
-                dim_t yidx2 = k * y.strides()[2] + yidx3;
-                dim_t cidx2 = k * c.strides()[2] + cidx3;
-                dim_t aidx2 = k * a.strides()[2] + aidx3;
-
-                for (int j = 0; j < (int)ydims[1]; j++) {
-
-                    dim_t yidx1 = j * y.strides()[1] + yidx2;
-                    dim_t cidx1 = j * c.strides()[1] + cidx2;
-                    dim_t aidx1 = j * a.strides()[1] + aidx2;
+template<typename T>
+Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x) {
+    AF_BATCH_KIND type = x.ndims() == 1 ? AF_BATCH_NONE : AF_BATCH_SAME;
+    if (x.ndims() != b.ndims()) {
+        type = (x.ndims() < b.ndims()) ? AF_BATCH_RHS : AF_BATCH_LHS;
+    }
 
-                    std::vector<T> h_z(num_a);
+    // Extract the first N elements
+    Array<T> c = convolve<T, T>(x, b, type, 1, true);
+    dim4 cdims = c.dims();
+    cdims[0]   = x.dims()[0];
+    c.resetDims(cdims);
 
-                    const T *h_a = a.get() + (a.ndims() > 1 ? aidx1 : 0);
-                    T *h_c = c.get() + cidx1;
-                    T *h_y = y.get() + yidx1;
+    Array<T> y = createEmptyArray<T>(c.dims());
 
-                    for (int i = 0; i < (int)ydims[0]; i++) {
+    getQueue().enqueue(kernel::iir<T>, y, c, a);
 
-                        T y = h_y[i] = (h_c[i] + h_z[0]) /  h_a[0];
-                        for (int ii = 1; ii < num_a; ii++) {
-                            h_z[ii - 1] = h_z[ii] - h_a[ii] * y;
-                        }
-                    }
-                }
-            }
-        }
+    return y;
+}
 
-        return y;
-    }
+#define INSTANTIATE(T)                                          \
+    template Array<T> iir(const Array<T> &b, const Array<T> &a, \
+                          const Array<T> &x);
 
-#define INSTANTIATE(T)                          \
-    template Array<T> iir(const Array<T> &b,    \
-                          const Array<T> &a,    \
-                          const Array<T> &x);   \
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/iir.hpp b/src/backend/cpu/iir.hpp
index 4969dd0b95..4075c48b43 100644
--- a/src/backend/cpu/iir.hpp
+++ b/src/backend/cpu/iir.hpp
@@ -9,9 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
 Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x);
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/image.cpp b/src/backend/cpu/image.cpp
index 8b211fe84d..2e24dec9be 100644
--- a/src/backend/cpu/image.cpp
+++ b/src/backend/cpu/image.cpp
@@ -10,42 +10,50 @@
 // Parts of this code sourced from SnopyDogy
 // https://gist.github.com/SnopyDogy/a9a22497a893ec86aa3e
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <image.hpp>
+#include <common/graphics_common.hpp>
 #include <err_cpu.hpp>
-#include <cstdio>
-#include <stdexcept>
-#include <graphics_common.hpp>
-
-using af::dim4;
-
-namespace cpu
-{
-    template<typename T>
-    void copy_image(const Array<T> &in, const fg::Image* image)
-    {
-        CheckGL("Before CopyArrayToPBO");
-        const T *d_X = in.get();
-        size_t data_size = image->size();
-
-        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, image->pbo());
-        glBufferSubData(GL_PIXEL_UNPACK_BUFFER, 0, data_size, d_X);
-        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
-
-        CheckGL("In CopyArrayToPBO");
-    }
-
-    #define INSTANTIATE(T)  \
-        template void copy_image<T>(const Array<T> &in, const fg::Image* image);
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+#include <image.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image) {
+    ForgeModule &_ = forgePlugin();
+
+    CheckGL("Before CopyArrayToImage");
+    const T *d_X = in.get();
+    getQueue().sync();
+
+    unsigned data_size = 0, buffer = 0;
+    FG_CHECK(_.fg_get_pixel_buffer(&buffer, image));
+    FG_CHECK(_.fg_get_image_size(&data_size, image));
+
+    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer);
+    glBufferSubData(GL_PIXEL_UNPACK_BUFFER, 0, data_size, d_X);
+    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
+
+    CheckGL("In CopyArrayToImage");
 }
 
-#endif  // WITH_GRAPHICS
+#define INSTANTIATE(T) template void copy_image<T>(const Array<T> &, fg_image);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/image.hpp b/src/backend/cpu/image.hpp
index dc6cc62f09..2dd41e585e 100644
--- a/src/backend/cpu/image.hpp
+++ b/src/backend/cpu/image.hpp
@@ -7,15 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <graphics_common.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace cpu {
 
-namespace cpu
-{
-    template<typename T>
-    void copy_image(const Array<T> &in, const fg::Image* image);
-}
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image);
 
-#endif
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/index.cpp b/src/backend/cpu/index.cpp
index 162e67fb46..84cff747bd 100644
--- a/src/backend/cpu/index.cpp
+++ b/src/backend/cpu/index.cpp
@@ -7,57 +7,53 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <index.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
 #include <handle.hpp>
-#include <err_cpu.hpp>
+#include <kernel/index.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
+
+#include <utility>
 #include <vector>
 
 using af::dim4;
+using arrayfire::common::half;  // NOLINT(misc-unused-using-decls) bug in
+                                // clang-tidy
+using std::vector;
 
-namespace cpu
-{
-
-static inline
-dim_t trimIndex(dim_t idx, const dim_t &len)
-{
-    dim_t ret_val = idx;
-    dim_t offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=len) {
-        ret_val = len-offset-1;
-    }
-    return ret_val;
-}
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> index(const Array<T>& in, const af_index_t idxrs[])
-{
-    bool isSeq[4];
-    std::vector<af_seq> seqs(4, af_span);
+Array<T> index(const Array<T>& in, const af_index_t idxrs[]) {
+    vector<bool> isSeq(4);
+    vector<af_seq> seqs(4, af_span);
     // create seq vector to retrieve output
     // dimensions, offsets & offsets
-    for (dim_t x=0; x<4; ++x) {
+    for (unsigned x = 0; x < isSeq.size(); ++x) {
         if (idxrs[x].isSeq) {
-            seqs[x] = idxrs[x].idx.seq;
+            af_seq seq = idxrs[x].idx.seq;
+            // Handle af_span as a sequence that covers the complete axis
+            if (seq.begin == af_span.begin && seq.end == af_span.end &&
+                seq.step == af_span.step) {
+                seqs[x] = af_seq{0, (double)(in.dims()[x] - 1), 1};
+            } else {
+                seqs[x] = seq;
+            }
         }
         isSeq[x] = idxrs[x].isSeq;
     }
 
-    // rettrieve
-    dim4 iDims = in.dims();
-    dim4 dDims = in.getDataDims();
-    dim4 oDims = toDims  (seqs, iDims);
-    dim4 iOffs = toOffset(seqs, dDims);
-    dim4 iStrds= toStride(seqs, dDims);
+    // retrieve
+    dim4 oDims = toDims(seqs, in.dims());
 
-    std::vector< Array<uint> > idxArrs(4, createEmptyArray<uint>(dim4()));
+    vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
     // look through indexs to read af_array indexs
-    for (dim_t x=0; x<4; ++x) {
+    for (unsigned x = 0; x < isSeq.size(); ++x) {
         if (!isSeq[x]) {
             idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
             // set output array ith dimension value
@@ -66,45 +62,10 @@ Array<T> index(const Array<T>& in, const af_index_t idxrs[])
     }
 
     Array<T> out = createEmptyArray<T>(oDims);
-    dim4 oStrides= out.strides();
-
-    const T *src = in.get();
-    T *dst = out.get();
-
-    const uint* ptr0 = idxArrs[0].get();
-    const uint* ptr1 = idxArrs[1].get();
-    const uint* ptr2 = idxArrs[2].get();
-    const uint* ptr3 = idxArrs[3].get();
-
-    for (dim_t l=0; l<oDims[3]; ++l) {
+    vector<CParam<uint>> idxParams(idxArrs.begin(), idxArrs.end());
 
-        dim_t lOff   = l*oStrides[3];
-        dim_t inIdx3 = trimIndex(isSeq[3] ? l+iOffs[3] : ptr3[l], iDims[3]);
-        dim_t inOff3 = inIdx3*iStrds[3];
-
-        for (dim_t k=0; k<oDims[2]; ++k) {
-
-            dim_t kOff   = k*oStrides[2];
-            dim_t inIdx2 = trimIndex(isSeq[2] ? k+iOffs[2] : ptr2[k], iDims[2]);
-            dim_t inOff2 = inIdx2*iStrds[2];
-
-            for (dim_t j=0; j<oDims[1]; ++j) {
-
-                dim_t jOff   = j*oStrides[1];
-                dim_t inIdx1 = trimIndex(isSeq[1] ? j+iOffs[1] : ptr1[j], iDims[1]);
-                dim_t inOff1 = inIdx1*iStrds[1];
-
-                for (dim_t i=0; i<oDims[0]; ++i) {
-
-                    dim_t iOff   = i*oStrides[0];
-                    dim_t inIdx0 = trimIndex(isSeq[0] ? i+iOffs[0] : ptr0[i], iDims[0]);
-                    dim_t inOff0 = inIdx0*iStrds[0];
-
-                    dst[lOff+kOff+jOff+iOff] = src[inOff3+inOff2+inOff1+inOff0];
-                }
-            }
-        }
-    }
+    getQueue().enqueue(kernel::index<T>, out, in, in.getDataDims(),
+                       std::move(isSeq), std::move(seqs), std::move(idxParams));
 
     return out;
 }
@@ -113,14 +74,19 @@ Array<T> index(const Array<T>& in, const af_index_t idxrs[])
     template Array<T> index<T>(const Array<T>& in, const af_index_t idxrs[]);
 
 INSTANTIATE(cdouble)
-INSTANTIATE(double )
-INSTANTIATE(cfloat )
-INSTANTIATE(float  )
-INSTANTIATE(uintl  )
-INSTANTIATE(uint   )
-INSTANTIATE(intl   )
-INSTANTIATE(int    )
-INSTANTIATE(uchar  )
-INSTANTIATE(char   )
-
-}
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(uintl)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(int)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/index.hpp b/src/backend/cpu/index.hpp
index ed116657b4..14a6692db1 100644
--- a/src/backend/cpu/index.hpp
+++ b/src/backend/cpu/index.hpp
@@ -8,11 +8,13 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <af/index.h>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
 Array<T> index(const Array<T>& in, const af_index_t idxrs[]);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/inverse.cpp b/src/backend/cpu/inverse.cpp
index 129823b963..20543d027c 100644
--- a/src/backend/cpu/inverse.cpp
+++ b/src/backend/cpu/inverse.cpp
@@ -7,48 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/err_common.hpp>
 #include <inverse.hpp>
-#include <err_common.hpp>
 
-#if defined(WITH_CPU_LINEAR_ALGEBRA)
+#if defined(WITH_LINEAR_ALGEBRA)
 
-#include <af/dim4.hpp>
+#include <err_cpu.hpp>
 #include <handle.hpp>
 #include <range.hpp>
-#include <iostream>
+#include <af/dim4.hpp>
 #include <cassert>
-#include <err_cpu.hpp>
 
+#include <identity.hpp>
 #include <lapack_helper.hpp>
 #include <lu.hpp>
-#include <identity.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 #include <solve.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-using getri_func_def = int (*)(ORDER_TYPE, int,
-                               T *, int,
-                               const int *);
+using getri_func_def = int (*)(ORDER_TYPE, int, T *, int, const int *);
 
-#define INV_FUNC_DEF( FUNC )                                        \
-template<typename T> FUNC##_func_def<T> FUNC##_func();
+#define INV_FUNC_DEF(FUNC) \
+    template<typename T>   \
+    FUNC##_func_def<T> FUNC##_func();
 
-#define INV_FUNC( FUNC, TYPE, PREFIX )                              \
-template<> FUNC##_func_def<TYPE>     FUNC##_func<TYPE>()            \
-{ return & LAPACK_NAME(PREFIX##FUNC); }
+#define INV_FUNC(FUNC, TYPE, PREFIX)            \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
 
-INV_FUNC_DEF( getri )
-INV_FUNC(getri , float  , s)
-INV_FUNC(getri , double , d)
-INV_FUNC(getri , cfloat , c)
-INV_FUNC(getri , cdouble, z)
+INV_FUNC_DEF(getri)
+INV_FUNC(getri, float, s)
+INV_FUNC(getri, double, d)
+INV_FUNC(getri, cfloat, c)
+INV_FUNC(getri, cdouble, z)
 
 template<typename T>
-Array<T> inverse(const Array<T> &in)
-{
-
+Array<T> inverse(const Array<T> &in) {
     int M = in.dims()[0];
     int N = in.dims()[1];
 
@@ -57,47 +57,46 @@ Array<T> inverse(const Array<T> &in)
         return solve(in, I);
     }
 
-    Array<T> A = copyArray<T>(in);
-
+    Array<T> A       = copyArray<T>(in);
     Array<int> pivot = lu_inplace<T>(A, false);
 
-    getri_func<T>()(AF_LAPACK_COL_MAJOR, M,
-                    A.get(), A.strides()[1],
-                    pivot.get());
+    auto func = [=](Param<T> A, Param<int> pivot, int M) {
+        getri_func<T>()(AF_LAPACK_COL_MAJOR, M, A.get(), A.strides(1),
+                        pivot.get());
+    };
+    getQueue().enqueue(func, A, pivot, M);
 
     return A;
 }
 
-#define INSTANTIATE(T)                                                                   \
-    template Array<T> inverse<T> (const Array<T> &in);
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
 
 INSTANTIATE(float)
 INSTANTIATE(cfloat)
 INSTANTIATE(double)
 INSTANTIATE(cdouble)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> inverse(const Array<T> &in)
-{
-    AF_ERROR("Linear Algebra is diabled on CPU",
-              AF_ERR_NOT_CONFIGURED);
+Array<T> inverse(const Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE(T)                                                                   \
-    template Array<T> inverse<T> (const Array<T> &in);
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
 
 INSTANTIATE(float)
 INSTANTIATE(cfloat)
 INSTANTIATE(double)
 INSTANTIATE(cdouble)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
 
-#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/cpu/inverse.hpp b/src/backend/cpu/inverse.hpp
index 7ab44b109a..476388cb68 100644
--- a/src/backend/cpu/inverse.hpp
+++ b/src/backend/cpu/inverse.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> inverse(const Array<T> &in);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> inverse(const Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/iota.cpp b/src/backend/cpu/iota.cpp
index e6111e8b03..fe50919783 100644
--- a/src/backend/cpu/iota.cpp
+++ b/src/backend/cpu/iota.cpp
@@ -6,63 +6,46 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <iota.hpp>
+#include <kernel/iota.hpp>
 
 #include <Array.hpp>
-#include <iota.hpp>
+#include <common/half.hpp>
 #include <math.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-#include <algorithm>
-#include <numeric>
+#include <platform.hpp>
+#include <queue.hpp>
 
-using namespace std;
+using arrayfire::common::half;  // NOLINT(misc-unused-using-decls) bug in
+                                // clang-tidy
 
-namespace cpu
-{
-    ///////////////////////////////////////////////////////////////////////////
-    // Kernel Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T>
-    void iota(T *out, const dim4 &dims, const dim4 &strides, const dim4 &sdims, const dim4 &tdims)
-    {
-        for(dim_t w = 0; w < dims[3]; w++) {
-            dim_t offW = w * strides[3];
-            T valW = (w % sdims[3]) * sdims[0] * sdims[1] * sdims[2];
-            for(dim_t z = 0; z < dims[2]; z++) {
-                dim_t offWZ = offW + z * strides[2];
-                T valZ = valW + (z % sdims[2]) * sdims[0] * sdims[1];
-                for(dim_t y = 0; y < dims[1]; y++) {
-                    dim_t offWZY = offWZ + y * strides[1];
-                    T valY = valZ + (y % sdims[1]) * sdims[0];
-                    for(dim_t x = 0; x < dims[0]; x++) {
-                        dim_t id = offWZY + x;
-                        out[id] = valY + (x % sdims[0]);
-                    }
-                }
-            }
-        }
-    }
+namespace arrayfire {
+namespace cpu {
 
-    ///////////////////////////////////////////////////////////////////////////
-    // Wrapper Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T>
-    Array<T> iota(const dim4 &dims, const dim4 &tile_dims)
-    {
-        dim4 outdims = dims * tile_dims;
+template<typename T>
+Array<T> iota(const dim4 &dims, const dim4 &tile_dims) {
+    dim4 outdims = dims * tile_dims;
 
-        Array<T> out = createEmptyArray<T>(outdims);
-        iota<T>(out.get(), out.dims(), out.strides(), dims, tile_dims);
+    Array<T> out = createEmptyArray<T>(outdims);
 
-        return out;
-    }
+    getQueue().enqueue(kernel::iota<T>, out, dims);
 
-#define INSTANTIATE(T)                                                          \
-    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/iota.hpp b/src/backend/cpu/iota.hpp
index c437425cb3..9921933cbf 100644
--- a/src/backend/cpu/iota.hpp
+++ b/src/backend/cpu/iota.hpp
@@ -8,12 +8,11 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
-}
-
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/ireduce.cpp b/src/backend/cpu/ireduce.cpp
index 9819eec8e8..b87c12bc87 100644
--- a/src/backend/cpu/ireduce.cpp
+++ b/src/backend/cpu/ireduce.cpp
@@ -6,191 +6,128 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <ireduce.hpp>
+#include <kernel/ireduce.hpp>
 
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <ireduce.hpp>
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
+
+#include <complex>
 
 using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+
+template<af_op_t op, typename T>
+using ireduce_dim_func =
+    std::function<void(Param<T>, Param<uint>, const dim_t, CParam<T>,
+                       const dim_t, const int, CParam<uint>)>;
+
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim) {
+    dim4 odims       = in.dims();
+    odims[dim]       = 1;
+    Array<uint> rlen = createEmptyArray<uint>(af::dim4(0));
+    static const ireduce_dim_func<op, T> ireduce_funcs[] = {
+        kernel::ireduce_dim<op, T, 1>(), kernel::ireduce_dim<op, T, 2>(),
+        kernel::ireduce_dim<op, T, 3>(), kernel::ireduce_dim<op, T, 4>()};
+
+    getQueue().enqueue(ireduce_funcs[in.ndims() - 1], out, loc, 0, in, 0, dim,
+                       rlen);
+}
 
-namespace cpu
-{
-    template<typename T> double cabs(const T in) { return (double)in; }
-    static double cabs(const char in) { return (double)(in > 0); }
-    static double cabs(const cfloat &in) { return (double)abs(in); }
-    static double cabs(const cdouble &in) { return (double)abs(in); }
-
-    template<af_op_t op, typename T>
-    struct MinMaxOp
-    {
-        T m_val;
-        uint m_idx;
-        MinMaxOp(T val, uint idx) :
-            m_val(val), m_idx(idx)
-        {
-        }
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen) {
+    dim4 odims = in.dims();
+    odims[dim] = 1;
 
-        void operator()(T val, uint idx)
-        {
-            if (cabs(val) < cabs(m_val) ||
-                (cabs(val) == cabs(m_val) &&
-                 idx > m_idx)) {
-                m_val = val;
-                m_idx = idx;
-            }
-        }
-    };
-
-    template<typename T>
-    struct MinMaxOp<af_max_t, T>
-    {
-        T m_val;
-        uint m_idx;
-        MinMaxOp(T val, uint idx) :
-            m_val(val), m_idx(idx)
-        {
-        }
+    static const ireduce_dim_func<op, T> ireduce_funcs[] = {
+        kernel::ireduce_dim<op, T, 1>(), kernel::ireduce_dim<op, T, 2>(),
+        kernel::ireduce_dim<op, T, 3>(), kernel::ireduce_dim<op, T, 4>()};
 
-        void operator()(T val, uint idx)
-        {
-            if (cabs(val) > cabs(m_val) ||
-                (cabs(val) == cabs(m_val) &&
-                 idx <= m_idx)) {
-                m_val = val;
-                m_idx = idx;
-            }
-        }
-    };
-
-    template<af_op_t op, typename T, int D>
-    struct ireduce_dim
-    {
-        void operator()(T *out, const dim4 ostrides, const dim4 odims,
-                        uint *loc,
-                        const T *in , const dim4 istrides, const dim4 idims,
-                        const int dim)
-        {
-            const int D1 = D - 1;
-            for (dim_t i = 0; i < odims[D1]; i++) {
-                ireduce_dim<op, T, D1>()(out + i * ostrides[D1],
-                                         ostrides, odims,
-                                         loc + i * ostrides[D1],
-                                         in  + i * istrides[D1],
-                                         istrides, idims,
-                                         dim);
-            }
-        }
-    };
-
-    template<af_op_t op, typename T>
-    struct ireduce_dim<op, T, 0>
-    {
-        void operator()(T *out, const dim4 ostrides, const dim4 odims,
-                        uint *loc,
-                        const T *in , const dim4 istrides, const dim4 idims,
-                        const int dim)
-        {
-
-            dim_t stride = istrides[dim];
-            MinMaxOp<op, T> Op(in[0], 0);
-            for (dim_t i = 0; i < idims[dim]; i++) {
-                Op(in[i * stride], i);
-            }
+    getQueue().enqueue(ireduce_funcs[in.ndims() - 1], out, loc, 0, in, 0, dim,
+                       rlen);
+}
 
-            *out = Op.m_val;
-            *loc = Op.m_idx;
-        }
-    };
-
-    template<af_op_t op, typename T>
-    void ireduce(Array<T> &out, Array<uint> &loc,
-                 const Array<T> &in, const int dim)
-    {
-        dim4 odims = in.dims();
-        odims[dim] = 1;
-
-        switch (in.ndims()) {
-        case 1:
-            ireduce_dim<op, T, 1>()(out.get(), out.strides(), out.dims(),
-                                    loc.get(),
-                                    in.get(), in.strides(), in.dims(), dim);
-            break;
-
-        case 2:
-            ireduce_dim<op, T, 2>()(out.get(), out.strides(), out.dims(),
-                                    loc.get(),
-                                    in.get(), in.strides(), in.dims(), dim);
-            break;
-
-        case 3:
-            ireduce_dim<op, T, 3>()(out.get(), out.strides(), out.dims(),
-                                    loc.get(),
-                                    in.get(), in.strides(), in.dims(), dim);
-            break;
-
-        case 4:
-            ireduce_dim<op, T, 4>()(out.get(), out.strides(), out.dims(),
-                                    loc.get(),
-                                    in.get(), in.strides(), in.dims(), dim);
-            break;
-        }
-    }
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in) {
+    in.eval();
+    getQueue().sync();
 
-    template<af_op_t op, typename T>
-    T ireduce_all(unsigned *loc, const Array<T> &in)
-    {
-        af::dim4 dims = in.dims();
-        af::dim4 strides = in.strides();
-        const T *inPtr = in.get();
+    af::dim4 dims    = in.dims();
+    af::dim4 strides = in.strides();
+    const T *inPtr   = in.get();
+    dim_t idx = 0;
 
-        MinMaxOp<op, T> Op(inPtr[0], 0);
+    kernel::MinMaxOp<op, T> Op(inPtr[0], 0);
 
-        for(dim_t l = 0; l < dims[3]; l++) {
-            dim_t off3 = l * strides[3];
+    for (dim_t l = 0; l < dims[3]; l++) {
+        dim_t off3 = l * strides[3];
 
-            for(dim_t k = 0; k < dims[2]; k++) {
-                dim_t off2 = k * strides[2];
+        for (dim_t k = 0; k < dims[2]; k++) {
+            dim_t off2 = k * strides[2];
 
-                for(dim_t j = 0; j < dims[1]; j++) {
-                    dim_t off1 = j * strides[1];
+            for (dim_t j = 0; j < dims[1]; j++) {
+                dim_t off1 = j * strides[1];
 
-                    for(dim_t i = 0; i < dims[0]; i++) {
-                        dim_t idx = i + off1 + off2 + off3;
-                        Op(inPtr[idx], idx);
-                    }
+                for (dim_t i = 0; i < dims[0]; i++) {
+                    dim_t d_idx = i + off1 + off2 + off3;
+                    Op(inPtr[d_idx], idx++);
                 }
             }
         }
-
-        *loc = Op.m_idx;
-        return Op.m_val;
     }
 
-#define INSTANTIATE(ROp, T)                                             \
-    template void ireduce<ROp, T>(Array<T> &out, Array<uint> &loc,      \
-                                  const Array<T> &in, const int dim);   \
-    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);  \
-
-    //min
-    INSTANTIATE(af_min_t, float  )
-    INSTANTIATE(af_min_t, double )
-    INSTANTIATE(af_min_t, cfloat )
-    INSTANTIATE(af_min_t, cdouble)
-    INSTANTIATE(af_min_t, int    )
-    INSTANTIATE(af_min_t, uint   )
-    INSTANTIATE(af_min_t, char   )
-    INSTANTIATE(af_min_t, uchar  )
-
-    //max
-    INSTANTIATE(af_max_t, float  )
-    INSTANTIATE(af_max_t, double )
-    INSTANTIATE(af_max_t, cfloat )
-    INSTANTIATE(af_max_t, cdouble)
-    INSTANTIATE(af_max_t, int    )
-    INSTANTIATE(af_max_t, uint   )
-    INSTANTIATE(af_max_t, char   )
-    INSTANTIATE(af_max_t, uchar  )
+    *loc = Op.m_idx;
+    return Op.m_val;
 }
+
+#define INSTANTIATE(ROp, T)                                           \
+    template void ireduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim); \
+    template void rreduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim,  \
+                                  const Array<uint> &rlen);           \
+    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);
+
+// min
+INSTANTIATE(af_min_t, float)
+INSTANTIATE(af_min_t, double)
+INSTANTIATE(af_min_t, cfloat)
+INSTANTIATE(af_min_t, cdouble)
+INSTANTIATE(af_min_t, int)
+INSTANTIATE(af_min_t, uint)
+INSTANTIATE(af_min_t, intl)
+INSTANTIATE(af_min_t, uintl)
+INSTANTIATE(af_min_t, char)
+INSTANTIATE(af_min_t, schar)
+INSTANTIATE(af_min_t, uchar)
+INSTANTIATE(af_min_t, short)
+INSTANTIATE(af_min_t, ushort)
+INSTANTIATE(af_min_t, half)
+
+// max
+INSTANTIATE(af_max_t, float)
+INSTANTIATE(af_max_t, double)
+INSTANTIATE(af_max_t, cfloat)
+INSTANTIATE(af_max_t, cdouble)
+INSTANTIATE(af_max_t, int)
+INSTANTIATE(af_max_t, uint)
+INSTANTIATE(af_max_t, intl)
+INSTANTIATE(af_max_t, uintl)
+INSTANTIATE(af_max_t, char)
+INSTANTIATE(af_max_t, schar)
+INSTANTIATE(af_max_t, uchar)
+INSTANTIATE(af_max_t, short)
+INSTANTIATE(af_max_t, ushort)
+INSTANTIATE(af_max_t, half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/ireduce.hpp b/src/backend/cpu/ireduce.hpp
index 22f43e9a50..301ee65e53 100644
--- a/src/backend/cpu/ireduce.hpp
+++ b/src/backend/cpu/ireduce.hpp
@@ -7,16 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace cpu
-{
-    template<af_op_t op, typename T>
-    void ireduce(Array<T> &out, Array<uint> &loc,
-                 const Array<T> &in, const int dim);
+namespace arrayfire {
+namespace cpu {
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim);
 
-    template<af_op_t op, typename T>
-    T ireduce_all(unsigned *loc, const Array<T> &in);
-}
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen);
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/jit/BinaryNode.hpp b/src/backend/cpu/jit/BinaryNode.hpp
new file mode 100644
index 0000000000..424e37a63f
--- /dev/null
+++ b/src/backend/cpu/jit/BinaryNode.hpp
@@ -0,0 +1,98 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <binary.hpp>
+#include <common/jit/Node.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+
+#include <array>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+
+namespace jit {
+
+template<typename To, typename Ti, af_op_t op>
+class BinaryNode : public TNode<compute_t<To>> {
+   protected:
+    BinOp<compute_t<To>, compute_t<Ti>, op> m_op;
+    using TNode<compute_t<To>>::m_children;
+
+   public:
+    BinaryNode(common::Node_ptr lhs, common::Node_ptr rhs)
+        : TNode<compute_t<To>>(compute_t<To>(0),
+                               std::max(lhs->getHeight(), rhs->getHeight()) + 1,
+                               {{lhs, rhs}}, common::kNodeType::Nary) {}
+
+    std::unique_ptr<common::Node> clone() final {
+        return std::make_unique<BinaryNode>(*this);
+    }
+
+    af_op_t getOp() const noexcept final { return op; }
+
+    void calc(int x, int y, int z, int w, int lim) final {
+        UNUSED(x);
+        UNUSED(y);
+        UNUSED(z);
+        UNUSED(w);
+        auto lhs = static_cast<TNode<compute_t<Ti>> *>(m_children[0].get());
+        auto rhs = static_cast<TNode<compute_t<Ti>> *>(m_children[1].get());
+        m_op.eval(this->m_val, lhs->m_val, rhs->m_val, lim);
+    }
+
+    void calc(int idx, int lim) final {
+        UNUSED(idx);
+        auto lhs = static_cast<TNode<compute_t<Ti>> *>(m_children[0].get());
+        auto rhs = static_cast<TNode<compute_t<Ti>> *>(m_children[1].get());
+        m_op.eval(this->m_val, lhs->m_val, rhs->m_val, lim);
+    }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        UNUSED(kerString);
+        UNUSED(ids);
+    }
+
+    void genParams(std::stringstream &kerStream, int id,
+                   bool is_linear) const final {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    int setArgs(int start_id, bool is_linear,
+                std::function<void(int id, const void *ptr, size_t arg_size,
+                                   bool is_buffer)>
+                    setArg) const override {
+        UNUSED(is_linear);
+        UNUSED(setArg);
+        return start_id++;
+    }
+
+    void genOffsets(std::stringstream &kerStream, int id,
+                    bool is_linear) const final {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        UNUSED(kerStream);
+        UNUSED(ids);
+    }
+};
+
+}  // namespace jit
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/jit/BufferNode.hpp b/src/backend/cpu/jit/BufferNode.hpp
new file mode 100644
index 0000000000..ca3cfe7bb5
--- /dev/null
+++ b/src/backend/cpu/jit/BufferNode.hpp
@@ -0,0 +1,195 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <optypes.hpp>
+#include <af/defines.h>
+#include "Node.hpp"
+
+#include <functional>
+#include <memory>
+#include <sstream>
+#include <string>
+
+namespace arrayfire {
+namespace cpu {
+
+namespace jit {
+
+template<typename T>
+class BufferNode : public TNode<T> {
+   protected:
+    std::shared_ptr<T> m_data;
+    T *m_ptr;
+    unsigned m_bytes;
+    dim_t m_strides[4];
+    dim_t m_dims[4];
+    bool m_linear_buffer;
+
+   public:
+    BufferNode()
+        : TNode<T>(T(0), 0, {}, common::kNodeType::Buffer)
+        , m_bytes(0)
+        , m_strides{0, 0, 0, 0}
+        , m_dims{0, 0, 0, 0}
+        , m_linear_buffer(true) {}
+
+    std::unique_ptr<common::Node> clone() final {
+        return std::make_unique<BufferNode>(*this);
+    }
+
+    void setData(std::shared_ptr<T> data, unsigned bytes, dim_t data_off,
+                 const dim_t *dims, const dim_t *strides,
+                 const bool is_linear) {
+        m_data          = data;
+        m_ptr           = data.get() + data_off;
+        m_bytes         = bytes;
+        m_linear_buffer = is_linear;
+        for (int i = 0; i < 4; i++) {
+            m_strides[i] = strides[i];
+            m_dims[i]    = dims[i];
+        }
+    }
+
+    void setShape(af::dim4 new_shape) final {
+        auto new_strides = calcStrides(new_shape);
+        m_dims[0]        = new_shape[0];
+        m_dims[1]        = new_shape[1];
+        m_dims[2]        = new_shape[2];
+        m_dims[3]        = new_shape[3];
+        m_strides[0]     = new_strides[0];
+        m_strides[1]     = new_strides[1];
+        m_strides[2]     = new_strides[2];
+        m_strides[3]     = new_strides[3];
+    }
+
+    void calc(int x, int y, int z, int w, int lim) final {
+        using Tc = compute_t<T>;
+
+        dim_t l_off = 0;
+        l_off += (w < (int)m_dims[3]) * w * m_strides[3];
+        l_off += (z < (int)m_dims[2]) * z * m_strides[2];
+        l_off += (y < (int)m_dims[1]) * y * m_strides[1];
+        T *in_ptr   = m_ptr + l_off;
+        Tc *out_ptr = this->m_val.data();
+        for (int i = 0; i < lim; i++) {
+            out_ptr[i] =
+                static_cast<Tc>(in_ptr[((x + i) < m_dims[0]) ? (x + i) : 0]);
+        }
+    }
+
+    void calc(int idx, int lim) final {
+        using Tc = compute_t<T>;
+
+        T *in_ptr   = m_ptr + idx;
+        Tc *out_ptr = this->m_val.data();
+        for (int i = 0; i < lim; i++) {
+            out_ptr[i] = static_cast<Tc>(in_ptr[i]);
+        }
+    }
+
+    void getInfo(unsigned &len, unsigned &buf_count,
+                 unsigned &bytes) const final {
+        len++;
+        buf_count++;
+        bytes += m_bytes;
+        return;
+    }
+
+    size_t getBytes() const final { return m_bytes; }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        UNUSED(kerString);
+        UNUSED(ids);
+    }
+
+    void genParams(std::stringstream &kerStream, int id,
+                   bool is_linear) const final {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    int setArgs(int start_id, bool is_linear,
+                std::function<void(int id, const void *ptr, size_t arg_size,
+                                   bool is_buffer)>
+                    setArg) const override {
+        UNUSED(is_linear);
+        UNUSED(setArg);
+        return start_id++;
+    }
+
+    void genOffsets(std::stringstream &kerStream, int id,
+                    bool is_linear) const final {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        UNUSED(kerStream);
+        UNUSED(ids);
+    }
+
+    bool isLinear(const dim_t *dims) const final {
+        return m_linear_buffer && dims[0] == m_dims[0] &&
+               dims[1] == m_dims[1] && dims[2] == m_dims[2] &&
+               dims[3] == m_dims[3];
+    }
+
+    size_t getHash() const noexcept final {
+        std::hash<const void *> ptr_hash;
+        std::hash<af::dtype> aftype_hash;
+        return ptr_hash(static_cast<const void *>(m_ptr)) ^
+               (aftype_hash(
+                    static_cast<af::dtype>(af::dtype_traits<T>::af_type))
+                << 1);
+    }
+
+    /// Compares two BufferNodeBase objects for equality
+    bool operator==(const BufferNode<T> &other) const noexcept {
+        using std::begin;
+        using std::end;
+        using std::equal;
+        return m_ptr == other.m_ptr && m_bytes == other.m_bytes &&
+               m_linear_buffer == other.m_linear_buffer &&
+               equal(begin(m_dims), end(m_dims), begin(other.m_dims)) &&
+               equal(begin(m_strides), end(m_strides), begin(other.m_strides));
+    };
+
+    /// Overloads the equality operator to call comparisons between Buffer
+    /// objects. Calls the BufferNodeBase equality operator if the other
+    /// object is also a Buffer Node
+    bool operator==(const common::Node &other) const noexcept final {
+        if (other.isBuffer() && this->getType() == other.getType()) {
+            return *this == static_cast<const BufferNode<T> &>(other);
+        }
+        return false;
+    }
+
+    virtual void modDims(const af::dim4 &newDim) override {
+        af::dim4 strides(1, 1, 1, 1);
+        for(dim_t i = 1; i < 4; ++i) {
+            strides[i] = strides[i - 1] * newDim[i - 1];
+        }
+
+        for(dim_t i = 0; i < 4; ++i) {
+            m_dims[i] = newDim[i];
+            m_strides[i] = strides[i];
+        }
+    }
+
+};
+
+}  // namespace jit
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/jit/Node.hpp b/src/backend/cpu/jit/Node.hpp
new file mode 100644
index 0000000000..c40b0adf92
--- /dev/null
+++ b/src/backend/cpu/jit/Node.hpp
@@ -0,0 +1,58 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/defines.hpp>
+#include <common/half.hpp>
+#include <common/jit/Node.hpp>
+#include <common/traits.hpp>
+#include <optypes.hpp>
+#include <af/traits.hpp>
+
+#include <array>
+#include <memory>
+#include <unordered_map>
+
+namespace common {
+template<typename T>
+class NodeIterator;
+}
+
+namespace arrayfire {
+namespace cpu {
+
+namespace jit {
+constexpr int VECTOR_LENGTH = 256;
+
+template<typename T>
+using array = std::array<T, VECTOR_LENGTH>;
+
+}  // namespace jit
+
+template<typename T>
+class TNode : public common::Node {
+   public:
+    alignas(16) jit::array<compute_t<T>> m_val;
+    using arrayfire::common::Node::m_children;
+
+   public:
+    TNode(T val, const int height,
+          const std::array<common::Node_ptr, kMaxChildren> &&children,
+          common::kNodeType node_type)
+        : Node(static_cast<af::dtype>(af::dtype_traits<T>::af_type), height,
+               move(children), node_type) {
+        using namespace common;
+        m_val.fill(static_cast<compute_t<T>>(val));
+    }
+
+    virtual ~TNode() = default;
+};
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/jit/ScalarNode.hpp b/src/backend/cpu/jit/ScalarNode.hpp
new file mode 100644
index 0000000000..0b119deb82
--- /dev/null
+++ b/src/backend/cpu/jit/ScalarNode.hpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <optypes.hpp>
+#include <vector>
+#include "Node.hpp"
+
+namespace arrayfire {
+namespace cpu {
+
+namespace jit {
+
+template<typename T>
+class ScalarNode : public TNode<T> {
+   public:
+    ScalarNode(T val) : TNode<T>(val, 0, {}, common::kNodeType::Scalar) {}
+
+    std::unique_ptr<common::Node> clone() final {
+        return std::make_unique<ScalarNode>(*this);
+    }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        UNUSED(kerString);
+        UNUSED(ids);
+    }
+
+    void genParams(std::stringstream &kerStream, int id,
+                   bool is_linear) const final {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    int setArgs(int start_id, bool is_linear,
+                std::function<void(int id, const void *ptr, size_t arg_size,
+                                   bool is_buffer)>
+                    setArg) const override {
+        UNUSED(is_linear);
+        UNUSED(setArg);
+        return start_id++;
+    }
+
+    void genOffsets(std::stringstream &kerStream, int id,
+                    bool is_linear) const final {
+        UNUSED(kerStream);
+        UNUSED(id);
+        UNUSED(is_linear);
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        UNUSED(kerStream);
+        UNUSED(ids);
+    }
+};
+}  // namespace jit
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/jit/UnaryNode.hpp b/src/backend/cpu/jit/UnaryNode.hpp
new file mode 100644
index 0000000000..5ca37ca8f4
--- /dev/null
+++ b/src/backend/cpu/jit/UnaryNode.hpp
@@ -0,0 +1,76 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <math.hpp>
+#include <optypes.hpp>
+#include <types.hpp>
+#include "Node.hpp"
+
+#include <jit/BufferNode.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+template<typename To, typename Ti, af_op_t op>
+struct UnOp {
+    void eval(jit::array<compute_t<To>> &out,
+              const jit::array<compute_t<Ti>> &in, int lim) const;
+};
+
+namespace jit {
+
+template<typename To, typename Ti, af_op_t op>
+class UnaryNode : public TNode<To> {
+   protected:
+    using arrayfire::common::Node::m_children;
+    UnOp<To, Ti, op> m_op;
+
+   public:
+    UnaryNode(common::Node_ptr child)
+        : TNode<To>(To(0), child->getHeight() + 1, {{child}},
+                    common::kNodeType::Nary) {}
+
+    std::unique_ptr<common::Node> clone() final {
+        return std::make_unique<UnaryNode>(*this);
+    }
+
+    af_op_t getOp() const noexcept final { return op; }
+
+    void calc(int x, int y, int z, int w, int lim) final {
+        UNUSED(x);
+        UNUSED(y);
+        UNUSED(z);
+        UNUSED(w);
+        auto child = static_cast<TNode<Ti> *>(m_children[0].get());
+        m_op.eval(TNode<To>::m_val, child->m_val, lim);
+    }
+
+    void calc(int idx, int lim) final {
+        UNUSED(idx);
+        auto child = static_cast<TNode<Ti> *>(m_children[0].get());
+        m_op.eval(TNode<To>::m_val, child->m_val, lim);
+    }
+
+    void genKerName(std::string &kerString,
+                    const common::Node_ids &ids) const final {
+        UNUSED(kerString);
+        UNUSED(ids);
+    }
+
+    void genFuncs(std::stringstream &kerStream,
+                  const common::Node_ids &ids) const final {
+        UNUSED(kerStream);
+        UNUSED(ids);
+    }
+};
+
+}  // namespace jit
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/join.cpp b/src/backend/cpu/join.cpp
index eeb34a01c7..602f2db7f9 100644
--- a/src/backend/cpu/join.cpp
+++ b/src/backend/cpu/join.cpp
@@ -8,240 +8,96 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <common/half.hpp>
 #include <join.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-
-namespace cpu
-{
-    template<typename To, typename Tx, int dim>
-    void join_append(To *out, const Tx *X, const af::dim4 &offset,
-               const af::dim4 &odims, const af::dim4 &xdims,
-               const af::dim4 &ost, const af::dim4 &xst)
-    {
-        for(dim_t ow = 0; ow < xdims[3]; ow++) {
-            const dim_t xW = ow * xst[3];
-            const dim_t oW = (ow + offset[3]) * ost[3];
-
-            for(dim_t oz = 0; oz < xdims[2]; oz++) {
-                const dim_t xZW = xW + oz * xst[2];
-                const dim_t oZW = oW + (oz + offset[2]) * ost[2];
-
-                for(dim_t oy = 0; oy < xdims[1]; oy++) {
-                    const dim_t xYZW = xZW + oy * xst[1];
-                    const dim_t oYZW = oZW + (oy + offset[1]) * ost[1];
-
-                    for(dim_t ox = 0; ox < xdims[0]; ox++) {
-                        const dim_t iMem = xYZW + ox;
-                        const dim_t oMem = oYZW + (ox + offset[0]);
-                        out[oMem] = X[iMem];
-                    }
-                }
-            }
+#include <kernel/join.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+#include <algorithm>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+Array<T> join(const int dim, const Array<T> &first, const Array<T> &second) {
+    // All dimensions except join dimension must be equal
+    // Compute output dims
+    af::dim4 odims;
+    af::dim4 fdims = first.dims();
+    af::dim4 sdims = second.dims();
+
+    for (int i = 0; i < 4; i++) {
+        if (i == dim) {
+            odims[i] = fdims[i] + sdims[i];
+        } else {
+            odims[i] = fdims[i];
         }
     }
 
-    template<int dim>
-    af::dim4 calcOffset(const af::dim4 dims)
-    {
-        af::dim4 offset;
-        offset[0] = (dim == 0) ? dims[0] : 0;
-        offset[1] = (dim == 1) ? dims[1] : 0;
-        offset[2] = (dim == 2) ? dims[2] : 0;
-        offset[3] = (dim == 3) ? dims[3] : 0;
-        return offset;
-    }
-
-    template<typename Tx, typename Ty>
-    Array<Tx> join(const int dim, const Array<Tx> &first, const Array<Ty> &second)
-    {
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        af::dim4 odims;
-        af::dim4 fdims = first.dims();
-        af::dim4 sdims = second.dims();
-
-        for(int i = 0; i < 4; i++) {
-            if(i == dim) {
-                odims[i] = fdims[i] + sdims[i];
-            } else {
-                odims[i] = fdims[i];
-            }
-        }
-
-        Array<Tx> out = createEmptyArray<Tx>(odims);
-
-        Tx* outPtr = out.get();
-        const Tx* fptr = first.get();
-        const Ty* sptr = second.get();
-
-        af::dim4 zero(0,0,0,0);
-
-        switch(dim) {
-            case 0:
-                join_append<Tx, Tx, 0>(outPtr, fptr, zero,
-                                       odims, fdims, out.strides(), first.strides());
-                join_append<Tx, Ty, 0>(outPtr, sptr, calcOffset<0>(fdims),
-                                       odims, sdims, out.strides(), second.strides());
-                break;
-            case 1:
-                join_append<Tx, Tx, 1>(outPtr, fptr, zero,
-                                       odims, fdims, out.strides(), first.strides());
-                join_append<Tx, Ty, 1>(outPtr, sptr, calcOffset<1>(fdims),
-                                       odims, sdims, out.strides(), second.strides());
-                break;
-            case 2:
-                join_append<Tx, Tx, 2>(outPtr, fptr, zero,
-                                       odims, fdims, out.strides(), first.strides());
-                join_append<Tx, Ty, 2>(outPtr, sptr, calcOffset<2>(fdims),
-                                       odims, sdims, out.strides(), second.strides());
-                break;
-            case 3:
-                join_append<Tx, Tx, 3>(outPtr, fptr, zero,
-                                       odims, fdims, out.strides(), first.strides());
-                join_append<Tx, Ty, 3>(outPtr, sptr, calcOffset<3>(fdims),
-                                       odims, sdims, out.strides(), second.strides());
-                break;
-        }
-
-        return out;
-    }
-
-    template<typename T, int n_arrays>
-    void join_wrapper(const int dim, Array<T> &out, const std::vector<Array<T>> &inputs)
-    {
-        af::dim4 zero(0,0,0,0);
-        af::dim4 d = zero;
-        switch(dim) {
-            case 0:
-                join_append<T, T, 0>(out.get(), inputs[0].get(), zero,
-                            out.dims(), inputs[0].dims(), out.strides(), inputs[0].strides());
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    join_append<T, T, 0>(out.get(), inputs[i].get(), calcOffset<0>(d),
-                            out.dims(), inputs[i].dims(), out.strides(), inputs[i].strides());
-                }
-                break;
-            case 1:
-                join_append<T, T, 1>(out.get(), inputs[0].get(), zero,
-                            out.dims(), inputs[0].dims(), out.strides(), inputs[0].strides());
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    join_append<T, T, 1>(out.get(), inputs[i].get(), calcOffset<1>(d),
-                            out.dims(), inputs[i].dims(), out.strides(), inputs[i].strides());
-                }
-                break;
-            case 2:
-                join_append<T, T, 2>(out.get(), inputs[0].get(), zero,
-                            out.dims(), inputs[0].dims(), out.strides(), inputs[0].strides());
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    join_append<T, T, 2>(out.get(), inputs[i].get(), calcOffset<2>(d),
-                            out.dims(), inputs[i].dims(), out.strides(), inputs[i].strides());
-                }
-                break;
-            case 3:
-                join_append<T, T, 3>(out.get(), inputs[0].get(), zero,
-                            out.dims(), inputs[0].dims(), out.strides(), inputs[0].strides());
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    join_append<T, T, 3>(out.get(), inputs[i].get(), calcOffset<3>(d),
-                            out.dims(), inputs[i].dims(), out.strides(), inputs[i].strides());
-                }
-                break;
-        }
-    }
+    Array<T> out = createEmptyArray<T>(odims);
+    std::vector<CParam<T>> v{first, second};
+    getQueue().enqueue(kernel::join<T>, dim, out, v, 2);
 
-    template<typename T>
-    Array<T> join(const int dim, const std::vector<Array<T>> &inputs)
-    {
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        af::dim4 odims;
-        const dim_t n_arrays = inputs.size();
-        std::vector<af::dim4> idims(n_arrays);
-
-        dim_t dim_size = 0;
-        for(int i = 0; i < (int)idims.size(); i++) {
-            idims[i] = inputs[i].dims();
-            dim_size += idims[i][dim];
-        }
+    return out;
+}
 
-        for(int i = 0; i < 4; i++) {
-            if(i == dim) {
-                odims[i] = dim_size;
-            } else {
-                odims[i] = idims[0][i];
-            }
-        }
+template<typename T>
+void join(Array<T> &out, const int dim, const std::vector<Array<T>> &inputs) {
+    const dim_t n_arrays = inputs.size();
 
-        Array<T> out = createEmptyArray<T>(odims);
-
-        switch(n_arrays) {
-            case 1:
-                join_wrapper<T, 1>(dim, out, inputs);
-                break;
-            case 2:
-                join_wrapper<T, 2>(dim, out, inputs);
-                break;
-            case 3:
-                join_wrapper<T, 3>(dim, out, inputs);
-                break;
-            case 4:
-                join_wrapper<T, 4>(dim, out, inputs);
-                break;
-            case 5:
-                join_wrapper<T, 5>(dim, out, inputs);
-                break;
-            case 6:
-                join_wrapper<T, 6>(dim, out, inputs);
-                break;
-            case 7:
-                join_wrapper<T, 7>(dim, out, inputs);
-                break;
-            case 8:
-                join_wrapper<T, 8>(dim, out, inputs);
-                break;
-            case 9:
-                join_wrapper<T, 9>(dim, out, inputs);
-                break;
-            case 10:
-                join_wrapper<T,10>(dim, out, inputs);
-                break;
-        }
+    std::vector<Array<T> *> input_ptrs(inputs.size());
+    std::transform(
+        begin(inputs), end(inputs), begin(input_ptrs),
+        [](const Array<T> &input) { return const_cast<Array<T> *>(&input); });
+    evalMultiple(input_ptrs);
+    std::vector<CParam<T>> inputParams(inputs.begin(), inputs.end());
 
-        return out;
-    }
-
-#define INSTANTIATE(Tx, Ty) \
-    template Array<Tx> join<Tx, Ty>(const int dim, const Array<Tx> &first, const Array<Ty> &second);
+    getQueue().enqueue(kernel::join<T>, dim, out, inputParams, n_arrays);
+}
 
-    INSTANTIATE(float,   float)
-    INSTANTIATE(double,  double)
-    INSTANTIATE(cfloat,  cfloat)
-    INSTANTIATE(cdouble, cdouble)
-    INSTANTIATE(int,     int)
-    INSTANTIATE(uint,    uint)
-    INSTANTIATE(intl,    intl)
-    INSTANTIATE(uintl,   uintl)
-    INSTANTIATE(uchar,   uchar)
-    INSTANTIATE(char,    char)
+#define INSTANTIATE(T)                                              \
+    template Array<T> join<T>(const int dim, const Array<T> &first, \
+                              const Array<T> &second);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(half)
 
 #undef INSTANTIATE
 
-#define INSTANTIATE(T)      \
-    template Array<T> join<T>(const int dim, const std::vector<Array<T>> &inputs);
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+#define INSTANTIATE(T)                                   \
+    template void join<T>(Array<T> & out, const int dim, \
+                          const std::vector<Array<T>> &inputs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(half)
 
 #undef INSTANTIATE
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/join.hpp b/src/backend/cpu/join.hpp
index 4848edd27d..f13bea2fed 100644
--- a/src/backend/cpu/join.hpp
+++ b/src/backend/cpu/join.hpp
@@ -7,15 +7,15 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 #include <vector>
 
-namespace cpu
-{
-    template<typename Tx, typename Ty>
-    Array<Tx> join(const int dim, const Array<Tx> &first, const Array<Ty> &second);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> join(const int dim, const Array<T> &first, const Array<T> &second);
 
-    template<typename T>
-    Array<T> join(const int dim, const std::vector<Array<T>> &inputs);
-}
+template<typename T>
+void join(Array<T> &output, const int dim, const std::vector<Array<T>> &inputs);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/Array.hpp b/src/backend/cpu/kernel/Array.hpp
new file mode 100644
index 0000000000..7af4e35555
--- /dev/null
+++ b/src/backend/cpu/kernel/Array.hpp
@@ -0,0 +1,209 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/jit/ModdimNode.hpp>
+#include <common/jit/Node.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <jit/BufferNode.hpp>
+#include <jit/Node.hpp>
+#include <jit/UnaryNode.hpp>
+#include <platform.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+/// Clones node_index_map and update the child pointers
+std::vector<std::shared_ptr<common::Node>> cloneNodes(
+    const std::vector<common::Node *> &node_index_map,
+    const std::vector<common::Node_ids> &ids) {
+    using arrayfire::common::Node;
+    // find all moddims in the tree
+    std::vector<std::shared_ptr<Node>> node_clones;
+    node_clones.reserve(node_index_map.size());
+    transform(begin(node_index_map), end(node_index_map),
+              back_inserter(node_clones), [](Node *n) { return n->clone(); });
+
+    for (common::Node_ids id : ids) {
+        auto &children = node_clones[id.id]->m_children;
+        for (int i = 0; i < Node::kMaxChildren && children[i] != nullptr; i++) {
+            children[i] = node_clones[id.child_ids[i]];
+        }
+    }
+    return node_clones;
+}
+
+/// Sets the shape of the buffer node_index_map under the moddims node to the
+/// new shape
+void propagateModdimsShape(
+    std::vector<std::shared_ptr<common::Node>> &node_clones) {
+    using arrayfire::common::NodeIterator;
+    for (auto &node : node_clones) {
+        if (node->getOp() == af_moddims_t) {
+            common::ModdimNode *mn =
+                static_cast<common::ModdimNode *>(node.get());
+
+            NodeIterator<> it(node.get());
+            while (it != NodeIterator<>()) {
+                it = std::find_if(it, NodeIterator<>(), common::isBuffer);
+                if (it == NodeIterator<>()) { break; }
+
+                it->setShape(mn->m_new_shape);
+
+                ++it;
+            }
+        }
+    }
+}
+
+/// Removes node_index_map whos operation matchs a unary operation \p op.
+void removeNodeOfOperation(
+    std::vector<std::shared_ptr<common::Node>> &node_index_map, af_op_t op) {
+    using arrayfire::common::Node;
+
+    for (size_t nid = 0; nid < node_index_map.size(); nid++) {
+        auto &node = node_index_map[nid];
+
+        for (int i = 0;
+             i < Node::kMaxChildren && node->m_children[i] != nullptr; i++) {
+            if (node->m_children[i]->getOp() == op) {
+                // replace moddims
+                auto moddim_node    = node->m_children[i];
+                node->m_children[i] = moddim_node->m_children[0];
+            }
+        }
+    }
+
+    node_index_map.erase(remove_if(begin(node_index_map), end(node_index_map),
+                                   [op](std::shared_ptr<Node> &node) {
+                                       return node->getOp() == op;
+                                   }),
+                         end(node_index_map));
+}
+
+/// Returns the cloned output_nodes located in the node_clones array
+///
+/// This function returns the new cloned version of the output_nodes_ from
+/// the node_clones array. If the output node is a moddim node, then it will
+/// set the output node to be its first non-moddim node child
+template<typename T>
+std::vector<TNode<T> *> getClonedOutputNodes(
+    common::Node_map_t &node_index_map,
+    const std::vector<std::shared_ptr<common::Node>> &node_clones,
+    const std::vector<common::Node_ptr> &output_nodes_) {
+    std::vector<TNode<T> *> cloned_output_nodes;
+    cloned_output_nodes.reserve(output_nodes_.size());
+    for (auto &n : output_nodes_) {
+        TNode<T> *ptr;
+        if (n->getOp() == af_moddims_t) {
+            // if the output node is a moddims node, then set the output node
+            // to be the child of the moddims node. This is necessary because
+            // we remove the moddim node_index_map from the tree later
+            int child_index = node_index_map[n->m_children[0].get()];
+            ptr = static_cast<TNode<T> *>(node_clones[child_index].get());
+            while (ptr->getOp() == af_moddims_t) {
+                ptr = static_cast<TNode<T> *>(ptr->m_children[0].get());
+            }
+        } else {
+            int node_index = node_index_map[n.get()];
+            ptr = static_cast<TNode<T> *>(node_clones[node_index].get());
+        }
+        cloned_output_nodes.push_back(ptr);
+    }
+    return cloned_output_nodes;
+}
+
+template<typename T>
+void evalMultiple(std::vector<Param<T>> arrays,
+                  std::vector<common::Node_ptr> output_nodes_) {
+    using arrayfire::common::ModdimNode;
+    using arrayfire::common::Node;
+    using arrayfire::common::Node_map_t;
+    using arrayfire::common::NodeIterator;
+
+    af::dim4 odims = arrays[0].dims();
+    af::dim4 ostrs = arrays[0].strides();
+
+    Node_map_t node_index_map;
+    std::vector<T *> ptrs;
+    std::vector<common::Node *> full_nodes;
+    std::vector<common::Node_ids> ids;
+
+    int narrays = static_cast<int>(arrays.size());
+    ptrs.reserve(narrays);
+    for (int i = 0; i < narrays; i++) {
+        ptrs.push_back(arrays[i].get());
+        output_nodes_[i]->getNodesMap(node_index_map, full_nodes, ids);
+    }
+    auto node_clones = cloneNodes(full_nodes, ids);
+
+    std::vector<TNode<T> *> cloned_output_nodes =
+        getClonedOutputNodes<T>(node_index_map, node_clones, output_nodes_);
+    propagateModdimsShape(node_clones);
+    removeNodeOfOperation(node_clones, af_moddims_t);
+
+    bool is_linear = true;
+    for (auto &node : node_clones) { is_linear &= node->isLinear(odims.get()); }
+
+    int num_nodes        = node_clones.size();
+    int num_output_nodes = cloned_output_nodes.size();
+    if (is_linear) {
+        int num = arrays[0].dims().elements();
+        int cnum =
+            jit::VECTOR_LENGTH * std::ceil(double(num) / jit::VECTOR_LENGTH);
+        for (int i = 0; i < cnum; i += jit::VECTOR_LENGTH) {
+            int lim = std::min(jit::VECTOR_LENGTH, num - i);
+            for (int n = 0; n < num_nodes; n++) {
+                node_clones[n]->calc(i, lim);
+            }
+            for (int n = 0; n < num_output_nodes; n++) {
+                std::copy(cloned_output_nodes[n]->m_val.begin(),
+                          cloned_output_nodes[n]->m_val.begin() + lim,
+                          ptrs[n] + i);
+            }
+        }
+    } else {
+        for (int w = 0; w < (int)odims[3]; w++) {
+            dim_t offw = w * ostrs[3];
+
+            for (int z = 0; z < (int)odims[2]; z++) {
+                dim_t offz = z * ostrs[2] + offw;
+
+                for (int y = 0; y < (int)odims[1]; y++) {
+                    dim_t offy = y * ostrs[1] + offz;
+
+                    int dim0  = odims[0];
+                    int cdim0 = jit::VECTOR_LENGTH *
+                                std::ceil(double(dim0) / jit::VECTOR_LENGTH);
+                    for (int x = 0; x < (int)cdim0; x += jit::VECTOR_LENGTH) {
+                        int lim  = std::min(jit::VECTOR_LENGTH, dim0 - x);
+                        dim_t id = x + offy;
+
+                        for (int n = 0; n < num_nodes; n++) {
+                            node_clones[n]->calc(x, y, z, w, lim);
+                        }
+                        for (int n = 0; n < num_output_nodes; n++) {
+                            std::copy(
+                                cloned_output_nodes[n]->m_val.begin(),
+                                cloned_output_nodes[n]->m_val.begin() + lim,
+                                ptrs[n] + id);
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/anisotropic_diffusion.hpp b/src/backend/cpu/kernel/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..1acad4857c
--- /dev/null
+++ b/src/backend/cpu/kernel/anisotropic_diffusion.hpp
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <math.hpp>
+
+#include <algorithm>
+#include <cassert>
+#include <cmath>
+
+using std::exp;
+using std::pow;
+using std::sqrt;
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+int index(int x, int y, int stride1) { return y * stride1 + x; }
+
+float quad(float value) { return 1.0f / (1.0f + value); }
+
+float computeGradientBasedUpdate(const float mct, const float NW, const float N,
+                                 const float NE, const float W, const float C,
+                                 const float E, const float SW, const float S,
+                                 const float SE,
+                                 const af_flux_function fftype) {
+    float delta = 0.f;
+
+    float dx, dy, df, db, cx, cxd;
+
+    // centralized derivatives
+    dx = (E - W) * 0.5f;
+    dy = (S - N) * 0.5f;
+
+    // half-d's and conductance along first dimension
+    df = E - C;
+    db = C - W;
+
+    float gmsqf = (df * df + 0.25f * pow(dy + 0.5f * (SE - NE), 2.f)) * mct;
+    float gmsqb = (db * db + 0.25f * pow(dy + 0.5f * (SW - NW), 2.f)) * mct;
+    if (fftype == AF_FLUX_EXPONENTIAL) {
+        cx  = exp(gmsqf);
+        cxd = exp(gmsqb);
+    } else {
+        cx  = quad(gmsqf);
+        cxd = quad(gmsqb);
+    }
+    delta = (cx * df - cxd * db);
+
+    // half-d's and conductance along second dimension
+    df = S - C;
+    db = C - N;
+
+    gmsqf = (df * df + 0.25f * pow(dx + 0.5f * (SE - SW), 2.f)) * mct;
+    gmsqb = (db * db + 0.25f * pow(dx + 0.5f * (NE - NW), 2.f)) * mct;
+    if (fftype == AF_FLUX_EXPONENTIAL) {
+        cx  = exp(gmsqf);
+        cxd = exp(gmsqb);
+    } else {
+        cx  = quad(gmsqf);
+        cxd = quad(gmsqb);
+    }
+    delta += (cx * df - cxd * db);
+
+    return delta;
+}
+
+float computeCurvatureBasedUpdate(const float mct, const float NW,
+                                  const float N, const float NE, const float W,
+                                  const float C, const float E, const float SW,
+                                  const float S, const float SE) {
+    float delta     = 0.f;
+    float prop_grad = 0.f;
+
+    float df0, db0;
+    float dx, dy, df, db, cx, cxd, gmf, gmb, gmsqf, gmsqb;
+
+    // centralized derivatives
+    dx = (E - W) * 0.5f;
+    dy = (S - N) * 0.5f;
+
+    // half-d's and conductance along first dimension
+    df  = E - C;
+    db  = C - W;
+    df0 = df;
+    db0 = db;
+
+    gmsqf = (df * df + 0.25f * pow(dy + 0.5f * (SE - NE), 2.f));
+    gmsqb = (db * db + 0.25f * pow(dy + 0.5f * (SW - NW), 2.f));
+
+    gmf = sqrt(1.0e-10f + gmsqf);
+    gmb = sqrt(1.0e-10f + gmsqb);
+
+    cx  = exp(gmsqf * mct);
+    cxd = exp(gmsqb * mct);
+
+    delta = ((df / gmf) * cx - (db / gmb) * cxd);
+
+    // half-d's and conductance along second dimension
+    df = S - C;
+    db = C - N;
+
+    gmsqf = (df * df + 0.25f * pow(dx + 0.5f * (SE - SW), 2.f));
+    gmsqb = (db * db + 0.25f * pow(dx + 0.5f * (NE - NW), 2.f));
+    gmf   = sqrt(1.0e-10f + gmsqf);
+    gmb   = sqrt(1.0e-10f + gmsqb);
+
+    cx  = exp(gmsqf * mct);
+    cxd = exp(gmsqb * mct);
+
+    delta += ((df / gmf) * cx - (db / gmb) * cxd);
+
+    if (delta > 0.f) {
+        prop_grad +=
+            (pow(fminf(db0, 0.0f), 2.0f) + pow(fmaxf(df0, 0.0f), 2.0f));
+        prop_grad += (pow(fminf(db, 0.0f), 2.0f) + pow(fmaxf(df, 0.0f), 2.0f));
+    } else {
+        prop_grad +=
+            (pow(fmaxf(db0, 0.0f), 2.0f) + pow(fminf(df0, 0.0f), 2.0f));
+        prop_grad += (pow(fmaxf(db, 0.0f), 2.0f) + pow(fminf(df, 0.0f), 2.0f));
+    }
+
+    return sqrt(prop_grad) * delta;
+}
+
+template<typename T, bool isMCDE>
+void anisotropicDiffusion(Param<T> inout, const float dt, const float mct,
+                          const af_flux_function fftype) {
+    const auto dims     = inout.dims();
+    const auto strides  = inout.strides();
+    const auto d1stride = strides[1];
+    const int d0        = dims[0] - 1;
+    const int d1        = dims[1] - 1;
+    const int d2        = dims[2];
+    const int d3        = dims[3];
+
+    for (int b3 = 0; b3 < d3; ++b3) {
+        for (int b2 = 0; b2 < d2; ++b2) {
+            T* img = inout.get() + b2 * strides[2] + b3 * strides[3];
+            for (int j = 1; j < d1; ++j) {
+                for (int i = 1; i < d0; ++i) {
+                    float C     = 0.f;
+                    float delta = 0.f;
+
+                    const int ip1 = i + 1;
+                    const int im1 = i - 1;
+                    const int jp1 = j + 1;
+                    const int jm1 = j - 1;
+
+                    if (isMCDE) {
+                        delta = computeCurvatureBasedUpdate(
+                            mct, img[index(im1, jm1, d1stride)],
+                            img[index(i, jm1, d1stride)],
+                            img[index(ip1, jm1, d1stride)],
+                            img[index(im1, j, d1stride)],
+                            C = img[index(i, j, d1stride)],
+                            img[index(ip1, j, d1stride)],
+                            img[index(im1, jp1, d1stride)],
+                            img[index(i, jp1, d1stride)],
+                            img[index(ip1, jp1, d1stride)]);
+
+                    } else {
+                        delta = computeGradientBasedUpdate(
+                            mct, img[index(im1, jm1, d1stride)],
+                            img[index(i, jm1, d1stride)],
+                            img[index(ip1, jm1, d1stride)],
+                            img[index(im1, j, d1stride)],
+                            C = img[index(i, j, d1stride)],
+                            img[index(ip1, j, d1stride)],
+                            img[index(im1, jp1, d1stride)],
+                            img[index(i, jp1, d1stride)],
+                            img[index(ip1, jp1, d1stride)], fftype);
+                    }
+
+                    img[i + j * d1stride] = (T)(C + delta * dt);
+                }
+            }
+        }
+    }
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/approx.hpp b/src/backend/cpu/kernel/approx.hpp
new file mode 100644
index 0000000000..826b124fdb
--- /dev/null
+++ b/src/backend/cpu/kernel/approx.hpp
@@ -0,0 +1,141 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+#include "interp.hpp"
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename InT, typename LocT, int order>
+void approx1(Param<InT> yo, CParam<InT> yi, CParam<LocT> xo, const int xdim,
+             const LocT &xi_beg, const LocT &xi_step, const float offGrid,
+             af_interp_type method) {
+    InT *yo_ptr        = yo.get();
+    const LocT *xo_ptr = xo.get();
+
+    const af::dim4 yo_dims = yo.dims();
+    const af::dim4 yi_dims = yi.dims();
+    const af::dim4 xo_dims = xo.dims();
+
+    const af::dim4 yo_strides = yo.strides();
+    const af::dim4 yi_strides = yi.strides();
+    const af::dim4 xo_strides = xo.strides();
+
+    Interp1<InT, LocT, order> interp;
+    bool is_xo_off[] = {xo_dims[0] > 1, xo_dims[1] > 1, xo_dims[2] > 1,
+                        xo_dims[3] > 1};
+    bool is_yi_off[] = {true, true, true, true};
+    is_yi_off[xdim]  = false;
+
+    for (dim_t idw = 0; idw < yo_dims[3]; idw++) {
+        for (dim_t idz = 0; idz < yo_dims[2]; idz++) {
+            dim_t yo_off_zw = idw * yo_strides[3] + idz * yo_strides[2];
+            dim_t yi_off_zw = idw * yi_strides[3] * is_yi_off[3] +
+                              idz * yi_strides[2] * is_yi_off[2];
+            dim_t xo_off_zw = idw * xo_strides[3] * is_xo_off[3] +
+                              idz * xo_strides[2] * is_xo_off[2];
+
+            for (dim_t idy = 0; idy < yo_dims[1]; idy++) {
+                dim_t yo_off = yo_off_zw + idy * yo_strides[1];
+                dim_t yi_off = yi_off_zw + idy * yi_strides[1] * is_yi_off[1];
+                dim_t xo_off = xo_off_zw + idy * xo_strides[1] * is_xo_off[1];
+
+                for (dim_t idx = 0; idx < yo_dims[0]; idx++) {
+                    dim_t yi_idx = idx * is_yi_off[0];
+                    const LocT x =
+                        (xo_ptr[xo_off + idx * is_xo_off[0]] - xi_beg) /
+                        xi_step;
+
+                    // FIXME: Only cubic interpolation is doing clamping
+                    // We need to make it consistent across all methods
+                    // Not changing the behavior because tests will fail
+                    bool clamp = order == 3;
+
+                    if (x < 0 || yi_dims[xdim] < x + 1) {
+                        yo_ptr[yo_off + idx] = scalar<InT>(offGrid);
+                    } else {
+                        interp(yo, yo_off + idx, yi, yi_off + yi_idx, x, method,
+                               1, clamp, xdim);
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename InT, typename LocT, int order>
+void approx2(Param<InT> zo, CParam<InT> zi, CParam<LocT> xo, const int xdim,
+             const LocT &xi_beg, const LocT &xi_step, CParam<LocT> yo,
+             const int ydim, const LocT &yi_beg, const LocT &yi_step,
+             float const offGrid, af_interp_type method) {
+    InT *zo_ptr        = zo.get();
+    const LocT *xo_ptr = xo.get();
+    const LocT *yo_ptr = yo.get();
+
+    af::dim4 const zo_dims    = zo.dims();
+    af::dim4 const zi_dims    = zi.dims();
+    af::dim4 const xo_dims    = xo.dims();
+    af::dim4 const zo_strides = zo.strides();
+    af::dim4 const zi_strides = zi.strides();
+    af::dim4 const xo_strides = xo.strides();
+    af::dim4 const yo_strides = yo.strides();
+
+    Interp2<InT, LocT, order> interp;
+    bool is_xo_off[] = {xo_dims[0] > 1, xo_dims[1] > 1, xo_dims[2] > 1,
+                        xo_dims[3] > 1};
+    bool is_zi_off[] = {true, true, true, true};
+    is_zi_off[xdim]  = false;
+    is_zi_off[ydim]  = false;
+
+    for (dim_t idw = 0; idw < zo_dims[3]; idw++) {
+        for (dim_t idz = 0; idz < zo_dims[2]; idz++) {
+            dim_t zo_off_zw = idw * zo_strides[3] + idz * zo_strides[2];
+            dim_t zi_off_zw = idw * zi_strides[3] * is_zi_off[3] +
+                              idz * zi_strides[2] * is_zi_off[2];
+            dim_t xo_off_zw = idw * xo_strides[3] * is_xo_off[3] +
+                              idz * xo_strides[2] * is_xo_off[2];
+            dim_t yo_off_zw = idw * yo_strides[3] * is_xo_off[3] +
+                              idz * yo_strides[2] * is_xo_off[2];
+
+            for (dim_t idy = 0; idy < zo_dims[1]; idy++) {
+                dim_t xo_off = xo_off_zw + idy * xo_strides[1] * is_xo_off[1];
+                dim_t yo_off = yo_off_zw + idy * yo_strides[1] * is_xo_off[1];
+                dim_t zi_off = zi_off_zw + idy * zi_strides[1] * is_zi_off[1];
+                dim_t zo_off = zo_off_zw + idy * zo_strides[1];
+
+                for (dim_t idx = 0; idx < zo_dims[0]; idx++) {
+                    const LocT x = (xo_ptr[xo_off + idx] - xi_beg) / xi_step;
+                    const LocT y = (yo_ptr[yo_off + idx] - yi_beg) / yi_step;
+
+                    dim_t zi_idx = idx * zi_strides[0] * is_zi_off[0];
+
+                    // FIXME: Only cubic interpolation is doing clamping
+                    // We need to make it consistent across all methods
+                    // Not changing the behavior because tests will fail
+                    bool clamp = order == 3;
+
+                    if (x < 0 || zi_dims[xdim] < x + 1 || y < 0 ||
+                        zi_dims[ydim] < y + 1) {
+                        zo_ptr[zo_off + idx] = scalar<InT>(offGrid);
+                    } else {
+                        interp(zo, zo_off + idx, zi, zi_off + zi_idx, x, y,
+                               method, 1, clamp, xdim, ydim);
+                    }
+                }
+            }
+        }
+    }
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/assign.hpp b/src/backend/cpu/kernel/assign.hpp
new file mode 100644
index 0000000000..4605f5d000
--- /dev/null
+++ b/src/backend/cpu/kernel/assign.hpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/ArrayInfo.hpp>
+#include <types.hpp>
+#include <utility.hpp>
+
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/seq.h>
+
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void assign(Param<T> out, af::dim4 dDims, CParam<T> rhs,
+            std::vector<bool> const isSeq, std::vector<af_seq> const seqs,
+            std::vector<CParam<uint>> idxArrs) {
+    af::dim4 pDims = out.dims();
+    // retrieve dimensions & strides for array to which rhs is being copied to
+    af::dim4 dst_offsets = toOffset(seqs, dDims);
+    af::dim4 dst_strides = toStride(seqs, dDims);
+    // retrieve rhs array dimenesions & strides
+    af::dim4 src_dims    = rhs.dims();
+    af::dim4 src_strides = rhs.strides();
+    // declare pointers to af_array index data
+    uint const* const ptr0 = idxArrs[0].get();
+    uint const* const ptr1 = idxArrs[1].get();
+    uint const* const ptr2 = idxArrs[2].get();
+    uint const* const ptr3 = idxArrs[3].get();
+
+    const T* src = rhs.get();
+    T* dst       = out.get();
+
+    for (dim_t l = 0; l < src_dims[3]; ++l) {
+        dim_t src_loff = l * src_strides[3];
+
+        dim_t dst_lIdx =
+            trimIndex(isSeq[3] ? l + dst_offsets[3] : ptr3[l], pDims[3]);
+        dim_t dst_loff = dst_lIdx * dst_strides[3];
+
+        for (dim_t k = 0; k < src_dims[2]; ++k) {
+            dim_t src_koff = k * src_strides[2];
+
+            dim_t dst_kIdx =
+                trimIndex(isSeq[2] ? k + dst_offsets[2] : ptr2[k], pDims[2]);
+            dim_t dst_koff = dst_kIdx * dst_strides[2];
+
+            for (dim_t j = 0; j < src_dims[1]; ++j) {
+                dim_t src_joff = j * src_strides[1];
+
+                dim_t dst_jIdx = trimIndex(
+                    isSeq[1] ? j + dst_offsets[1] : ptr1[j], pDims[1]);
+                dim_t dst_joff = dst_jIdx * dst_strides[1];
+
+                for (dim_t i = 0; i < src_dims[0]; ++i) {
+                    dim_t src_ioff = i * src_strides[0];
+                    dim_t src_idx  = src_ioff + src_joff + src_koff + src_loff;
+
+                    dim_t dst_iIdx = trimIndex(
+                        isSeq[0] ? i + dst_offsets[0] : ptr0[i], pDims[0]);
+                    dim_t dst_ioff = dst_iIdx * dst_strides[0];
+                    dim_t dst_idx  = dst_ioff + dst_joff + dst_koff + dst_loff;
+
+                    dst[dst_idx] = src[src_idx];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/bilateral.hpp b/src/backend/cpu/kernel/bilateral.hpp
new file mode 100644
index 0000000000..72d8edd12c
--- /dev/null
+++ b/src/backend/cpu/kernel/bilateral.hpp
@@ -0,0 +1,90 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+#include <utility.hpp>
+#include <cmath>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename OutT, typename InT>
+void bilateral(Param<OutT> out, CParam<InT> in, float const s_sigma,
+               float const c_sigma) {
+    using std::clamp;
+    using std::max;
+    using std::min;
+
+    af::dim4 const dims     = in.dims();
+    af::dim4 const istrides = in.strides();
+    af::dim4 const ostrides = out.strides();
+
+    // clamp spatical and chromatic sigma's
+    float space_       = min(11.5f, max(s_sigma, 0.f));
+    float color_       = max(c_sigma, 0.f);
+    dim_t const radius = max((dim_t)(space_ * 1.5f), (dim_t)1);
+    float const svar   = space_ * space_;
+    float const cvar   = color_ * color_;
+
+    for (dim_t b3 = 0; b3 < dims[3]; ++b3) {
+        OutT *outData     = out.get() + b3 * ostrides[3];
+        InT const *inData = in.get() + b3 * istrides[3];
+
+        // b3 for loop handles following batch configurations
+        //  - gfor
+        //  - input based batch
+        //      - when input is 4d array for color images
+        for (dim_t b2 = 0; b2 < dims[2]; ++b2) {
+            // b2 for loop handles following batch configurations
+            //  - channels
+            //  - input based batch
+            //      - when input is 3d array for grayscale images
+            for (dim_t j = 0; j < dims[1]; ++j) {
+                // j steps along 2nd dimension
+                for (dim_t i = 0; i < dims[0]; ++i) {
+                    // i steps along 1st dimension
+                    OutT norm         = 0.0;
+                    OutT res          = 0.0;
+                    OutT const center = (OutT)inData[getIdx(istrides, i, j)];
+                    for (dim_t wj = -radius; wj <= radius; ++wj) {
+                        // clamps offsets
+                        dim_t tj = clamp(j + wj, dim_t(0), dims[1] - 1);
+                        for (dim_t wi = -radius; wi <= radius; ++wi) {
+                            // clamps offsets
+                            dim_t ti = clamp(i + wi, dim_t(0), dims[0] - 1);
+                            // proceed
+                            OutT const val =
+                                (OutT)inData[getIdx(istrides, ti, tj)];
+                            OutT const gauss_space =
+                                (wi * wi + wj * wj) / (-2.0 * svar);
+                            OutT const gauss_range =
+                                ((center - val) * (center - val)) /
+                                (-2.0 * cvar);
+                            OutT const weight =
+                                std::exp(gauss_space + gauss_range);
+                            norm += weight;
+                            res += val * weight;
+                        }
+                    }  // filter loop ends here
+
+                    outData[getIdx(ostrides, i, j)] = res / norm;
+                }  // 1st dimension loop ends here
+            }      // 2nd dimension loop ends here
+            outData += ostrides[2];
+            inData += istrides[2];
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/canny.hpp b/src/backend/cpu/kernel/canny.hpp
new file mode 100644
index 0000000000..e68b73cfb6
--- /dev/null
+++ b/src/backend/cpu/kernel/canny.hpp
@@ -0,0 +1,186 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <array>
+#include <cassert>
+#include <list>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+template<typename T>
+void nonMaxSuppression(Param<T> output, CParam<T> magnitude, CParam<T> dxParam,
+                       CParam<T> dyParam) {
+    const af::dim4 dims    = magnitude.dims();
+    const af::dim4 strides = magnitude.strides();
+
+    T* out       = output.get();
+    const T* mag = magnitude.get();
+    const T* dX  = dxParam.get();
+    const T* dY  = dyParam.get();
+
+    for (dim_t b3 = 0; b3 < dims[3]; ++b3) {
+        for (dim_t b2 = 0; b2 < dims[2]; ++b2) {
+            dim_t offset;
+
+            offset = dims[0] + 1;
+
+            for (dim_t j = 2; j < dims[1]; ++j, offset += 2) {
+                for (dim_t i = 2; i < dims[0]; ++i, ++offset) {
+                    T curr = mag[offset];
+                    if (curr == 0) {
+                        out[offset] = (T)0;
+                    } else {
+                        const float se = mag[offset + dims[0] + 1];
+                        const float nw = mag[offset - dims[0] - 1];
+                        const float ea = mag[offset + 1];
+                        const float we = mag[offset - 1];
+                        const float ne = mag[offset - dims[0] + 1];
+                        const float sw = mag[offset + dims[0] - 1];
+                        const float no = mag[offset - dims[0]];
+                        const float so = mag[offset + dims[0]];
+                        const float dx = dX[offset];
+                        const float dy = dY[offset];
+
+                        float a1, a2, b1, b2, alpha;
+
+                        if (dx >= 0) {
+                            if (dy >= 0) {
+                                const bool isDxMagGreater = (dx - dy) >= 0;
+
+                                a1    = isDxMagGreater ? ea : so;
+                                a2    = isDxMagGreater ? we : no;
+                                b1    = se;
+                                b2    = nw;
+                                alpha = isDxMagGreater ? dy / dx : dx / dy;
+                            } else {
+                                const bool isDyMagGreater = (dx + dy) >= 0;
+
+                                a1    = isDyMagGreater ? ea : no;
+                                a2    = isDyMagGreater ? we : so;
+                                b1    = ne;
+                                b2    = sw;
+                                alpha = isDyMagGreater ? -dy / dx : dx / -dy;
+                            }
+                        } else {
+                            if (dy >= 0) {
+                                const bool isDyMagGreater = (dx + dy) >= 0;
+
+                                a1    = isDyMagGreater ? so : we;
+                                a2    = isDyMagGreater ? no : ea;
+                                b1    = sw;
+                                b2    = ne;
+                                alpha = isDyMagGreater ? -dx / dy : dy / -dx;
+                            } else {
+                                const bool isDxMagGreater = (-dx + dy) >= 0;
+
+                                a1    = isDxMagGreater ? we : no;
+                                a2    = isDxMagGreater ? ea : so;
+                                b1    = nw;
+                                b2    = se;
+                                alpha = isDxMagGreater ? dy / dx : dx / dy;
+                            }
+                        }
+
+                        float mag1 = (1.0f - alpha) * a1 + alpha * b1;
+                        float mag2 = (1.0f - alpha) * a2 + alpha * b2;
+
+                        if (curr > mag1 && curr > mag2) {
+                            out[offset] = curr;
+                        } else {
+                            out[offset] = (T)0;
+                        }
+                    }
+                }
+            }
+
+            out += strides[2];
+            mag += strides[2];
+            dX += strides[2];
+            dY += strides[2];
+        }
+        out += strides[3];
+        mag += strides[3];
+        dX += strides[3];
+        dY += strides[3];
+    }
+}
+
+template<typename T>
+void traceEdge(T* out, const T* strong, const T* weak, int t, int stride1) {
+    if (!out || !strong || !weak) return;
+
+    const T EDGE = 1;
+
+    std::list<dim_t> edges;  // list of edges to be checked
+    edges.push_back(t);
+
+    do {
+        t = edges.front();
+        edges.pop_front();  // remove the last after read
+
+        // get indices of 8 neighbours
+        std::array<dim_t, 8> potentials;
+
+        potentials[0] = t - stride1 - 1;    // north-west
+        potentials[1] = potentials[0] + 1;  // north
+        potentials[2] = potentials[1] + 1;  // north-east
+        potentials[3] = t - 1;              // west
+        potentials[4] = t + 1;              // east
+        potentials[5] = t + stride1 - 1;    // south-west
+        potentials[6] = potentials[5] + 1;  // south
+        potentials[7] = potentials[6] + 1;  // south-east
+
+        // test 8 neighbours and add them into edge
+        // list only if they are also edges
+        for (auto it : potentials) {
+            if (weak[it] > 0 && out[it] != EDGE) {
+                out[it] = EDGE;
+                edges.emplace_back(it);
+            }
+        }
+    } while (!edges.empty());
+}
+
+template<typename T>
+void edgeTrackingHysteresis(Param<T> out, CParam<T> strong, CParam<T> weak) {
+    const af::dim4 dims    = strong.dims();
+    const dim_t batchCount = dims[2] * dims[3];
+    const dim_t jMax       = dims[1] - 1;
+    const dim_t iMax       = dims[0] - 1;
+
+    const T* sptr = strong.get();
+    const T* wptr = weak.get();
+    T* optr       = out.get();
+
+    for (dim_t batchId = 0; batchId < batchCount; ++batchId) {
+        // Skip processing borders
+        dim_t t = dims[0] + 1;
+
+        for (dim_t j = 1; j <= jMax; ++j) {
+            for (dim_t i = 1; i <= iMax; ++i, ++t) {
+                // if current pixel(sptr) is part of a edge
+                // and output doesn't have it marked already,
+                // mark it and trace the pixels from here.
+                if (sptr[t] > 0 && optr[t] != 1) {
+                    optr[t] = 1;
+                    traceEdge(optr, sptr, wptr, t, dims[0]);
+                }
+            }
+        }
+        optr += out.strides(2);
+        sptr += strong.strides(2);
+        wptr += weak.strides(2);
+    }
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/convolve.hpp b/src/backend/cpu/kernel/convolve.hpp
new file mode 100644
index 0000000000..62381dd749
--- /dev/null
+++ b/src/backend/cpu/kernel/convolve.hpp
@@ -0,0 +1,293 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename InT, typename AccT>
+void one2one_1d(InT *optr, InT const *const iptr, AccT const *const fptr,
+                af::dim4 const &oDims, af::dim4 const &sDims,
+                af::dim4 const &fDims, af::dim4 const &sStrides,
+                const bool expand) {
+    dim_t start = (expand ? 0 : fDims[0] / 2);
+    dim_t end   = (expand ? oDims[0] : start + sDims[0]);
+    for (dim_t i = start; i < end; ++i) {
+        AccT accum = 0.0;
+        for (dim_t f = 0; f < fDims[0]; ++f) {
+            dim_t iIdx = i - f;
+            InT s_val =
+                ((iIdx >= 0 && iIdx < sDims[0]) ? iptr[iIdx * sStrides[0]]
+                                                : InT(0));
+            accum += AccT(s_val * fptr[f]);
+        }
+        optr[i - start] = InT(accum);
+    }
+}
+
+template<typename InT, typename AccT>
+void one2one_2d(InT *optr, InT const *const iptr, AccT const *const fptr,
+                af::dim4 const &oDims, af::dim4 const &sDims,
+                af::dim4 const &fDims, af::dim4 const &oStrides,
+                af::dim4 const &sStrides, af::dim4 const &fStrides,
+                const bool expand) {
+    dim_t jStart = (expand ? 0 : fDims[1] / 2);
+    dim_t jEnd   = (expand ? oDims[1] : jStart + sDims[1]);
+    dim_t iStart = (expand ? 0 : fDims[0] / 2);
+    dim_t iEnd   = (expand ? oDims[0] : iStart + sDims[0]);
+
+    for (dim_t j = jStart; j < jEnd; ++j) {
+        dim_t joff = (j - jStart) * oStrides[1];
+
+        for (dim_t i = iStart; i < iEnd; ++i) {
+            AccT accum = AccT(0);
+            for (dim_t wj = 0; wj < fDims[1]; ++wj) {
+                dim_t jIdx    = j - wj;
+                dim_t w_joff  = wj * fStrides[1];
+                dim_t s_joff  = jIdx * sStrides[1];
+                bool isJValid = (jIdx >= 0 && jIdx < sDims[1]);
+
+                for (dim_t wi = 0; wi < fDims[0]; ++wi) {
+                    dim_t iIdx = i - wi;
+
+                    InT s_val = InT(0);
+                    if (isJValid && (iIdx >= 0 && iIdx < sDims[0])) {
+                        s_val = iptr[s_joff + iIdx * sStrides[0]];
+                    }
+
+                    accum += AccT(s_val * fptr[w_joff + wi * fStrides[0]]);
+                }
+            }
+            optr[joff + i - iStart] = InT(accum);
+        }
+    }
+}
+
+template<typename InT, typename AccT>
+void one2one_3d(InT *optr, InT const *const iptr, AccT const *const fptr,
+                af::dim4 const &oDims, af::dim4 const &sDims,
+                af::dim4 const &fDims, af::dim4 const &oStrides,
+                af::dim4 const &sStrides, af::dim4 const &fStrides,
+                const bool expand) {
+    dim_t kStart = (expand ? 0 : fDims[2] / 2);
+    dim_t kEnd   = (expand ? oDims[2] : kStart + sDims[2]);
+    dim_t jStart = (expand ? 0 : fDims[1] / 2);
+    dim_t jEnd   = (expand ? oDims[1] : jStart + sDims[1]);
+    dim_t iStart = (expand ? 0 : fDims[0] / 2);
+    dim_t iEnd   = (expand ? oDims[0] : iStart + sDims[0]);
+
+    for (dim_t k = kStart; k < kEnd; ++k) {
+        dim_t koff = (k - kStart) * oStrides[2];
+
+        for (dim_t j = jStart; j < jEnd; ++j) {
+            dim_t joff = (j - jStart) * oStrides[1];
+
+            for (dim_t i = iStart; i < iEnd; ++i) {
+                AccT accum = AccT(0);
+                for (dim_t wk = 0; wk < fDims[2]; ++wk) {
+                    dim_t kIdx    = k - wk;
+                    dim_t w_koff  = wk * fStrides[2];
+                    dim_t s_koff  = kIdx * sStrides[2];
+                    bool isKValid = (kIdx >= 0 && kIdx < sDims[2]);
+
+                    for (dim_t wj = 0; wj < fDims[1]; ++wj) {
+                        dim_t jIdx    = j - wj;
+                        dim_t w_joff  = wj * fStrides[1];
+                        dim_t s_joff  = jIdx * sStrides[1];
+                        bool isJValid = (jIdx >= 0 && jIdx < sDims[1]);
+
+                        for (dim_t wi = 0; wi < fDims[0]; ++wi) {
+                            dim_t iIdx = i - wi;
+
+                            InT s_val = InT(0);
+                            if (isKValid && isJValid &&
+                                (iIdx >= 0 && iIdx < sDims[0])) {
+                                s_val =
+                                    iptr[s_koff + s_joff + iIdx * sStrides[0]];
+                            }
+
+                            accum +=
+                                AccT(s_val *
+                                     fptr[w_koff + w_joff + wi * fStrides[0]]);
+                        }
+                    }
+                }
+                optr[koff + joff + i - iStart] = InT(accum);
+            }  // i loop ends here
+        }      // j loop ends here
+    }          // k loop ends here
+}
+
+template<typename InT, typename AccT>
+void convolve_nd(Param<InT> out, CParam<InT> signal, CParam<AccT> filter,
+                 AF_BATCH_KIND kind, const int rank, const bool expand) {
+    InT *optr              = out.get();
+    InT const *const iptr  = signal.get();
+    AccT const *const fptr = filter.get();
+
+    af::dim4 const oDims = out.dims();
+    af::dim4 const sDims = signal.dims();
+    af::dim4 const fDims = filter.dims();
+
+    af::dim4 const oStrides = out.strides();
+    af::dim4 const sStrides = signal.strides();
+    af::dim4 const fStrides = filter.strides();
+
+    dim_t out_step[AF_MAX_DIMS] = {
+        0, 0, 0,
+        0}; /* first value is never used, and declared for code simplicity */
+    dim_t in_step[AF_MAX_DIMS] = {
+        0, 0, 0,
+        0}; /* first value is never used, and declared for code simplicity */
+    dim_t filt_step[AF_MAX_DIMS] = {
+        0, 0, 0,
+        0}; /* first value is never used, and declared for code simplicity */
+    dim_t batch[AF_MAX_DIMS] = {
+        0, 1, 1,
+        1}; /* first value is never used, and declared for code simplicity */
+
+    for (dim_t i = 1; i < 4; ++i) {
+        switch (kind) {
+            case AF_BATCH_LHS:
+                out_step[i] = oStrides[i];
+                in_step[i]  = sStrides[i];
+                if (i >= rank) batch[i] = sDims[i];
+                break;
+            case AF_BATCH_SAME:
+                out_step[i]  = oStrides[i];
+                in_step[i]   = sStrides[i];
+                filt_step[i] = fStrides[i];
+                if (i >= rank) batch[i] = sDims[i];
+                break;
+            case AF_BATCH_RHS:
+                out_step[i]  = oStrides[i];
+                filt_step[i] = fStrides[i];
+                if (i >= rank) batch[i] = fDims[i];
+                break;
+            default: break;
+        }
+    }
+
+    for (dim_t b3 = 0; b3 < batch[3]; ++b3) {
+        for (dim_t b2 = 0; b2 < batch[2]; ++b2) {
+            for (dim_t b1 = 0; b1 < batch[1]; ++b1) {
+                InT *out = optr + b1 * out_step[1] + b2 * out_step[2] +
+                           b3 * out_step[3];
+                InT const *in =
+                    iptr + b1 * in_step[1] + b2 * in_step[2] + b3 * in_step[3];
+                AccT const *filt = fptr + b1 * filt_step[1] +
+                                   b2 * filt_step[2] + b3 * filt_step[3];
+
+                switch (rank) {
+                    case 1:
+                        one2one_1d<InT, AccT>(out, in, filt, oDims, sDims,
+                                              fDims, sStrides, expand);
+                        break;
+                    case 2:
+                        one2one_2d<InT, AccT>(out, in, filt, oDims, sDims,
+                                              fDims, oStrides, sStrides,
+                                              fStrides, expand);
+                        break;
+                    case 3:
+                        one2one_3d<InT, AccT>(out, in, filt, oDims, sDims,
+                                              fDims, oStrides, sStrides,
+                                              fStrides, expand);
+                        break;
+                }
+            }
+        }
+    }
+}
+
+template<typename InT, typename AccT, bool Expand, int ConvDim>
+void convolve2_separable(InT *optr, InT const *const iptr,
+                         AccT const *const fptr, af::dim4 const &oDims,
+                         af::dim4 const &sDims, af::dim4 const &orgDims,
+                         dim_t fDim, af::dim4 const &oStrides,
+                         af::dim4 const &sStrides, dim_t fStride) {
+    UNUSED(orgDims);
+    UNUSED(sStrides);
+    UNUSED(fStride);
+    for (dim_t j = 0; j < oDims[1]; ++j) {
+        dim_t jOff = j * oStrides[1];
+        dim_t cj   = j + (ConvDim == 1) * (Expand ? 0 : fDim >> 1);
+
+        for (dim_t i = 0; i < oDims[0]; ++i) {
+            dim_t iOff = i * oStrides[0];
+            dim_t ci   = i + (ConvDim == 0) * (Expand ? 0 : fDim >> 1);
+
+            AccT accum = scalar<AccT>(0);
+
+            for (dim_t f = 0; f < fDim; ++f) {
+                InT f_val = fptr[f];
+                InT s_val;
+
+                if (ConvDim == 0) {
+                    dim_t offi     = ci - f;
+                    bool isCIValid = offi >= 0 && offi < sDims[0];
+                    bool isCJValid = cj >= 0 && cj < sDims[1];
+                    s_val = (isCJValid && isCIValid ? iptr[cj * sDims[0] + offi]
+                                                    : scalar<InT>(0));
+                } else {
+                    dim_t offj     = cj - f;
+                    bool isCIValid = ci >= 0 && ci < sDims[0];
+                    bool isCJValid = offj >= 0 && offj < sDims[1];
+                    s_val = (isCJValid && isCIValid ? iptr[offj * sDims[0] + ci]
+                                                    : scalar<InT>(0));
+                }
+
+                accum += AccT(s_val * f_val);
+            }
+            optr[iOff + jOff] = InT(accum);
+        }
+    }
+}
+
+template<typename InT, typename AccT, bool Expand>
+void convolve2(Param<InT> out, CParam<InT> signal, CParam<AccT> c_filter,
+               CParam<AccT> r_filter, Param<InT> temp) {
+    dim_t cflen = (dim_t)c_filter.dims().elements();
+    dim_t rflen = (dim_t)r_filter.dims().elements();
+
+    auto oDims = out.dims();
+    auto sDims = signal.dims();
+
+    auto oStrides = out.strides();
+    auto sStrides = signal.strides();
+    auto tStrides = temp.strides();
+
+    for (dim_t b3 = 0; b3 < oDims[3]; ++b3) {
+        dim_t i_b3Off = b3 * sStrides[3];
+        dim_t t_b3Off = b3 * tStrides[3];
+        dim_t o_b3Off = b3 * oStrides[3];
+
+        for (dim_t b2 = 0; b2 < oDims[2]; ++b2) {
+            InT const *const iptr = signal.get() + b2 * sStrides[2] + i_b3Off;
+            InT *tptr             = temp.get() + b2 * tStrides[2] + t_b3Off;
+            InT *optr             = out.get() + b2 * oStrides[2] + o_b3Off;
+
+            convolve2_separable<InT, AccT, Expand, 0>(
+                tptr, iptr, c_filter.get(), temp.dims(), sDims, sDims, cflen,
+                tStrides, sStrides, c_filter.strides(0));
+
+            convolve2_separable<InT, AccT, Expand, 1>(
+                optr, tptr, r_filter.get(), oDims, temp.dims(), sDims, rflen,
+                oStrides, tStrides, r_filter.strides(0));
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/copy.hpp b/src/backend/cpu/kernel/copy.hpp
new file mode 100644
index 0000000000..9506ed7d70
--- /dev/null
+++ b/src/backend/cpu/kernel/copy.hpp
@@ -0,0 +1,164 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+#include <cstring>  //memcpy
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void stridedCopy(T* dst, af::dim4 const& ostrides, T const* src,
+                 af::dim4 const& dims, af::dim4 const& strides, unsigned dim) {
+    if (dim == 0) {
+        if (strides[dim] == 1) {
+            // FIXME: Check for errors / exceptions
+            std::memcpy(dst, src, dims[dim] * sizeof(T));
+        } else {
+            for (dim_t i = 0; i < dims[dim]; i++) {
+                dst[i] = src[strides[dim] * i];
+            }
+        }
+    } else {
+        for (dim_t i = dims[dim]; i > 0; i--) {
+            stridedCopy<T>(dst, ostrides, src, dims, strides, dim - 1);
+            src += strides[dim];
+            dst += ostrides[dim];
+        }
+    }
+}
+
+template<typename OutT, typename InT>
+void copyElemwise(Param<OutT> dst, CParam<InT> src, OutT default_value,
+                  double factor) {
+    af::dim4 src_dims    = src.dims();
+    af::dim4 dst_dims    = dst.dims();
+    af::dim4 src_strides = src.strides();
+    af::dim4 dst_strides = dst.strides();
+
+    data_t<InT> const* const src_ptr = src.get();
+    data_t<OutT>* dst_ptr            = dst.get();
+
+    dim_t trgt_l = std::min(dst_dims[3], src_dims[3]);
+    dim_t trgt_k = std::min(dst_dims[2], src_dims[2]);
+    dim_t trgt_j = std::min(dst_dims[1], src_dims[1]);
+    dim_t trgt_i = std::min(dst_dims[0], src_dims[0]);
+
+    for (dim_t l = 0; l < dst_dims[3]; ++l) {
+        dim_t src_loff = l * src_strides[3];
+        dim_t dst_loff = l * dst_strides[3];
+        bool isLvalid  = l < trgt_l;
+
+        for (dim_t k = 0; k < dst_dims[2]; ++k) {
+            dim_t src_koff = k * src_strides[2];
+            dim_t dst_koff = k * dst_strides[2];
+            bool isKvalid  = k < trgt_k;
+
+            for (dim_t j = 0; j < dst_dims[1]; ++j) {
+                dim_t src_joff = j * src_strides[1];
+                dim_t dst_joff = j * dst_strides[1];
+                bool isJvalid  = j < trgt_j;
+
+                for (dim_t i = 0; i < dst_dims[0]; ++i) {
+                    data_t<OutT> temp = default_value;
+                    if (isLvalid && isKvalid && isJvalid && i < trgt_i) {
+                        dim_t src_idx =
+                            i * src_strides[0] + src_joff + src_koff + src_loff;
+                        // The conversions here are necessary because the half
+                        // type does not convert to complex automatically
+                        temp =
+                            compute_t<OutT>(compute_t<InT>(src_ptr[src_idx])) *
+                            compute_t<OutT>(factor);
+                    }
+                    dim_t dst_idx =
+                        i * dst_strides[0] + dst_joff + dst_koff + dst_loff;
+                    dst_ptr[dst_idx] = temp;
+                }
+            }
+        }
+    }
+}
+
+template<typename OutT, typename InT>
+struct CopyImpl {
+    static void copy(Param<OutT> dst, CParam<InT> src) {
+        copyElemwise(dst, src, scalar<OutT>(0), 1.0);
+    }
+};
+
+template<typename T>
+struct CopyImpl<T, T> {
+    static void copy(Param<T> dst, CParam<T> src) {
+        af::dim4 src_dims    = src.dims();
+        af::dim4 dst_dims    = dst.dims();
+        af::dim4 src_strides = src.strides();
+        af::dim4 dst_strides = dst.strides();
+
+        T const* src_ptr = src.get();
+        T* dst_ptr       = dst.get();
+
+        // find the major-most dimension, which is linear in both arrays
+        int linear_end = 0;
+        dim_t count    = 1;
+        while (linear_end < 4 && count == src_strides[linear_end] &&
+               count == dst_strides[linear_end]) {
+            count *= src_dims[linear_end];
+            ++linear_end;
+        }
+
+        // traverse through the array using strides only until neccessary
+        copy_go(dst_ptr, dst_strides, dst_dims, src_ptr, src_strides, src_dims,
+                3, linear_end);
+    }
+
+    static void copy_go(T* dst_ptr, const af::dim4& dst_strides,
+                        const af::dim4& dst_dims, T const* src_ptr,
+                        const af::dim4& src_strides, const af::dim4& src_dims,
+                        int dim, int linear_end) {
+        // if we are in a higher dimension, copy the entire stride if possible
+        if (linear_end == dim + 1) {
+            std::memcpy(dst_ptr, src_ptr,
+                        sizeof(T) * src_strides[dim] * src_dims[dim]);
+            return;
+        }
+
+        // 0th dimension is recursion bottom - copy element by element
+        if (dim == 0) {
+            for (dim_t i = 0; i < dst_dims[0]; ++i) {
+                *dst_ptr = *src_ptr;
+                dst_ptr += dst_strides[0];
+                src_ptr += src_strides[0];
+            }
+            return;
+        }
+
+        // otherwise recurse to a lower dimenstion
+        for (dim_t i = 0; i < dst_dims[dim]; ++i) {
+            copy_go(dst_ptr, dst_strides, dst_dims, src_ptr, src_strides,
+                    src_dims, dim - 1, linear_end);
+            dst_ptr += dst_strides[dim];
+            src_ptr += src_strides[dim];
+        }
+    }
+};
+
+template<typename OutT, typename InT>
+void copy(Param<OutT> dst, CParam<InT> src) {
+    CopyImpl<OutT, InT>::copy(dst, src);
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/diagonal.hpp b/src/backend/cpu/kernel/diagonal.hpp
new file mode 100644
index 0000000000..388bd4c459
--- /dev/null
+++ b/src/backend/cpu/kernel/diagonal.hpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void diagCreate(Param<T> out, CParam<T> in, int const num) {
+    int batch = in.dims(1);
+    int size  = out.dims(0);
+
+    T const *iptr = in.get();
+    T *optr       = out.get();
+
+    for (int k = 0; k < batch; k++) {
+        for (int j = 0; j < size; j++) {
+            for (int i = 0; i < size; i++) {
+                T val = scalar<T>(0);
+                if (i == j - num) { val = (num > 0) ? iptr[i] : iptr[j]; }
+                optr[i + j * out.strides(1)] = val;
+            }
+        }
+        optr += out.strides(2);
+        iptr += in.strides(1);
+    }
+}
+
+template<typename T>
+void diagExtract(Param<T> out, CParam<T> in, int const num) {
+    af::dim4 const odims = out.dims();
+    af::dim4 const idims = in.dims();
+
+    int const i_off = (num > 0) ? (num * in.strides(1)) : (-num);
+
+    for (int l = 0; l < (int)odims[3]; l++) {
+        for (int k = 0; k < (int)odims[2]; k++) {
+            const T *iptr =
+                in.get() + l * in.strides(3) + k * in.strides(2) + i_off;
+            T *optr = out.get() + l * out.strides(3) + k * out.strides(2);
+
+            for (int i = 0; i < (int)odims[0]; i++) {
+                T val = scalar<T>(0);
+                if (i < idims[0] && i < idims[1])
+                    val = iptr[i * in.strides(1) + i];
+                optr[i] = val;
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/diff.hpp b/src/backend/cpu/kernel/diff.hpp
new file mode 100644
index 0000000000..b1ed5642b6
--- /dev/null
+++ b/src/backend/cpu/kernel/diff.hpp
@@ -0,0 +1,84 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void diff1(Param<T> out, CParam<T> in, int const dim) {
+    af::dim4 dims = out.dims();
+    // Bool for dimension
+    bool is_dim0 = dim == 0;
+    bool is_dim1 = dim == 1;
+    bool is_dim2 = dim == 2;
+    bool is_dim3 = dim == 3;
+
+    T const* const inPtr = in.get();
+    T* outPtr            = out.get();
+
+    // TODO: Improve this
+    for (dim_t l = 0; l < dims[3]; l++) {
+        for (dim_t k = 0; k < dims[2]; k++) {
+            for (dim_t j = 0; j < dims[1]; j++) {
+                for (dim_t i = 0; i < dims[0]; i++) {
+                    // Operation: out[index] = in[index + 1 * dim_size] -
+                    // in[index]
+                    int idx     = getIdx(in.strides(), i, j, k, l);
+                    int jdx     = getIdx(in.strides(), i + is_dim0, j + is_dim1,
+                                         k + is_dim2, l + is_dim3);
+                    int odx     = getIdx(out.strides(), i, j, k, l);
+                    outPtr[odx] = inPtr[jdx] - inPtr[idx];
+                }
+            }
+        }
+    }
+}
+
+template<typename T>
+void diff2(Param<T> out, CParam<T> in, int const dim) {
+    af::dim4 dims = out.dims();
+    // Bool for dimension
+    bool is_dim0 = dim == 0;
+    bool is_dim1 = dim == 1;
+    bool is_dim2 = dim == 2;
+    bool is_dim3 = dim == 3;
+
+    T const* const inPtr = in.get();
+    T* outPtr            = out.get();
+
+    // TODO: Improve this
+    for (dim_t l = 0; l < dims[3]; l++) {
+        for (dim_t k = 0; k < dims[2]; k++) {
+            for (dim_t j = 0; j < dims[1]; j++) {
+                for (dim_t i = 0; i < dims[0]; i++) {
+                    // Operation: out[index] = in[index + 1 * dim_size] -
+                    // in[index]
+                    int idx = getIdx(in.strides(), i, j, k, l);
+                    int jdx = getIdx(in.strides(), i + is_dim0, j + is_dim1,
+                                     k + is_dim2, l + is_dim3);
+                    int kdx =
+                        getIdx(in.strides(), i + 2 * is_dim0, j + 2 * is_dim1,
+                               k + 2 * is_dim2, l + 2 * is_dim3);
+                    int odx = getIdx(out.strides(), i, j, k, l);
+                    outPtr[odx] =
+                        inPtr[kdx] + inPtr[idx] - inPtr[jdx] - inPtr[jdx];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/dot.hpp b/src/backend/cpu/kernel/dot.hpp
new file mode 100644
index 0000000000..74ea9087c3
--- /dev/null
+++ b/src/backend/cpu/kernel/dot.hpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <complex>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+T conj(T x) {
+    return x;
+}
+
+template<>
+cfloat conj<cfloat>(cfloat c) {
+    return std::conj(c);
+}
+template<>
+cdouble conj<cdouble>(cdouble c) {
+    return std::conj(c);
+}
+
+template<typename T, bool conjugate, bool both_conjugate>
+void dot(Param<T> output, CParam<T> lhs, CParam<T> rhs, af_mat_prop optLhs,
+         af_mat_prop optRhs) {
+    UNUSED(optLhs);
+    UNUSED(optRhs);
+    int N = lhs.dims(0);
+
+    T out       = 0;
+    const T *pL = lhs.get();
+    const T *pR = rhs.get();
+
+    for (int i = 0; i < N; i++)
+        out += (conjugate ? kernel::conj(pL[i]) : pL[i]) * pR[i];
+
+    if (both_conjugate) out = kernel::conj(out);
+
+    *output.get() = out;
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/exampleFunction.hpp b/src/backend/cpu/kernel/exampleFunction.hpp
new file mode 100644
index 0000000000..6b263830ab
--- /dev/null
+++ b/src/backend/cpu/kernel/exampleFunction.hpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void exampleFunction(Param<T> out, CParam<T> a, CParam<T> b,
+                     const af_someenum_t method) {
+    UNUSED(method);
+    dim4 oDims = out.dims();
+
+    dim4 aStrides = a.strides();  // you can retrieve strides
+    dim4 bStrides = b.strides();
+    dim4 oStrides = out.strides();
+
+    const T* src1 =
+        a.get();  // cpu::Param<T>::get returns the pointer to the
+                  // memory allocated for that Param (with proper offsets)
+    const T* src2 =
+        b.get();  // cpu::Param<T>::get returns the pointer to the
+                  // memory allocated for that Param (with proper offsets)
+    T* dst = out.get();
+
+    // Implement your algorithm and write results to dst
+    for (int j = 0; j < oDims[1]; ++j) {
+        for (int i = 0; i < oDims[0]; ++i) {
+            int src1Idx = i + j * aStrides[1];
+            int src2Idx = i + j * bStrides[1];
+            int dstIdx  = i + j * oStrides[1];
+
+            // kernel algorithm goes here
+            dst[dstIdx] = src1[src1Idx] + src2[src2Idx];
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/fast.hpp b/src/backend/cpu/kernel/fast.hpp
new file mode 100644
index 0000000000..341ddbe701
--- /dev/null
+++ b/src/backend/cpu/kernel/fast.hpp
@@ -0,0 +1,219 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+inline int idx_y(int i) {
+    using std::clamp;
+    if (i >= 8) return clamp(-(i - 8 - 4), -3, 3);
+
+    return clamp(i - 4, -3, 3);
+}
+
+inline int idx_x(int i) {
+    if (i < 12) return idx_y(i + 4);
+
+    return idx_y(i - 12);
+}
+
+inline int idx(int y, int x, unsigned idim0) { return x * idim0 + y; }
+
+// test_greater()
+// Tests if a pixel x > p + thr
+inline int test_greater(float x, float p, float thr) { return (x > p + thr); }
+
+// test_smaller()
+// Tests if a pixel x < p - thr
+inline int test_smaller(float x, float p, float thr) { return (x < p - thr); }
+
+// test_pixel()
+// Returns -1 when x < p - thr
+// Returns  0 when x >= p - thr && x <= p + thr
+// Returns  1 when x > p + thr
+template<typename T>
+inline int test_pixel(const T *image, const float p, float thr, int y, int x,
+                      unsigned idim0) {
+    return -test_smaller((float)image[idx(y, x, idim0)], p, thr) +
+           test_greater((float)image[idx(y, x, idim0)], p, thr);
+}
+
+// abs_diff()
+// Returns absolute difference of x and y
+inline int abs_diff(int x, int y) { return abs(x - y); }
+inline unsigned abs_diff(unsigned x, unsigned y) {
+    return (unsigned)abs((int)x - (int)y);
+}
+inline float abs_diff(float x, float y) { return fabs(x - y); }
+inline double abs_diff(double x, double y) { return fabs(x - y); }
+
+template<typename T>
+void locate_features(CParam<T> in, Param<float> score, Param<float> x_out,
+                     Param<float> y_out, Param<float> score_out,
+                     unsigned *count, float const thr,
+                     unsigned const arc_length, unsigned const nonmax,
+                     unsigned const max_feat, unsigned const edge) {
+    af::dim4 in_dims = in.dims();
+    T const *in_ptr  = in.get();
+
+    for (int y = edge; y < (int)(in_dims[0] - edge); y++) {
+        for (int x = edge; x < (int)(in_dims[1] - edge); x++) {
+            float p = in_ptr[idx(y, x, in_dims[0])];
+
+            // Start by testing opposite pixels of the circle that will result
+            // in a non-kepoint
+            int d;
+            d = test_pixel<T>(in_ptr, p, thr, y - 3, x, in_dims[0]) |
+                test_pixel<T>(in_ptr, p, thr, y + 3, x, in_dims[0]);
+            if (d == 0) continue;
+
+            d &= test_pixel<T>(in_ptr, p, thr, y - 2, x + 2, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y + 2, x - 2, in_dims[0]);
+            d &= test_pixel<T>(in_ptr, p, thr, y, x + 3, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y, x - 3, in_dims[0]);
+            d &= test_pixel<T>(in_ptr, p, thr, y + 2, x + 2, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y - 2, x - 2, in_dims[0]);
+            if (d == 0) continue;
+
+            d &= test_pixel<T>(in_ptr, p, thr, y - 3, x + 1, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y + 3, x - 1, in_dims[0]);
+            d &= test_pixel<T>(in_ptr, p, thr, y - 1, x + 3, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y + 1, x - 3, in_dims[0]);
+            d &= test_pixel<T>(in_ptr, p, thr, y + 1, x + 3, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y - 1, x - 3, in_dims[0]);
+            d &= test_pixel<T>(in_ptr, p, thr, y + 3, x + 1, in_dims[0]) |
+                 test_pixel<T>(in_ptr, p, thr, y - 3, x - 1, in_dims[0]);
+            if (d == 0) continue;
+
+            int sum = 0;
+
+            // Sum responses [-1, 0 or 1] of first arc_length pixels
+            for (int i = 0; i < static_cast<int>(arc_length); i++)
+                sum += test_pixel<T>(in_ptr, p, thr, y + idx_y(i), x + idx_x(i),
+                                     in_dims[0]);
+
+            // Test maximum and mininmum responses of first segment of
+            // arc_length pixels
+            int max_sum = 0, min_sum = 0;
+            max_sum = std::max(max_sum, sum);
+            min_sum = std::min(min_sum, sum);
+
+            // Sum responses and test the remaining 16-arc_length pixels of the
+            // circle
+            for (int i = arc_length; i < 16; i++) {
+                sum -= test_pixel<T>(in_ptr, p, thr, y + idx_y(i - arc_length),
+                                     x + idx_x(i - arc_length), in_dims[0]);
+                sum += test_pixel<T>(in_ptr, p, thr, y + idx_y(i), x + idx_x(i),
+                                     in_dims[0]);
+                max_sum = std::max(max_sum, sum);
+                min_sum = std::min(min_sum, sum);
+            }
+
+            // To completely test all possible segments, it's necessary to test
+            // segments that include the top junction of the circle
+            for (int i = 0; i < static_cast<int>(arc_length - 1); i++) {
+                sum -= test_pixel<T>(
+                    in_ptr, p, thr, y + idx_y(16 - arc_length + i),
+                    x + idx_x(16 - arc_length + i), in_dims[0]);
+                sum += test_pixel<T>(in_ptr, p, thr, y + idx_y(i), x + idx_x(i),
+                                     in_dims[0]);
+                max_sum = std::max(max_sum, sum);
+                min_sum = std::min(min_sum, sum);
+            }
+
+            float s_bright = 0, s_dark = 0;
+            for (int i = 0; i < 16; i++) {
+                float p_x =
+                    (float)in_ptr[idx(y + idx_y(i), x + idx_x(i), in_dims[0])];
+
+                s_bright +=
+                    test_greater(p_x, p, thr) * (abs_diff(p_x, p) - thr);
+                s_dark += test_smaller(p_x, p, thr) * (abs_diff(p, p_x) - thr);
+            }
+
+            // If sum at some point was equal to (+-)arc_length, there is a
+            // segment that for which all pixels are much brighter or much
+            // brighter than central pixel p.
+            if (max_sum == static_cast<int>(arc_length) ||
+                min_sum == -static_cast<int>(arc_length)) {
+                unsigned j = *count;
+                ++*count;
+                if (j < max_feat) {
+                    float *x_out_ptr     = x_out.get();
+                    float *y_out_ptr     = y_out.get();
+                    float *score_out_ptr = score_out.get();
+                    x_out_ptr[j]         = static_cast<float>(x);
+                    y_out_ptr[j]         = static_cast<float>(y);
+                    score_out_ptr[j] =
+                        static_cast<float>(std::max(s_bright, s_dark));
+                    if (nonmax == 1) {
+                        float *score_ptr = score.get();
+                        score_ptr[idx(y, x, in_dims[0])] =
+                            std::max(s_bright, s_dark);
+                    }
+                }
+            }
+        }
+    }
+}
+
+void non_maximal(CParam<float> score, CParam<float> x_in, CParam<float> y_in,
+                 Param<float> x_out, Param<float> y_out, Param<float> score_out,
+                 unsigned *count, unsigned const total_feat,
+                 unsigned const edge) {
+    float const *score_ptr = score.get();
+    float const *x_in_ptr  = x_in.get();
+    float const *y_in_ptr  = y_in.get();
+
+    af::dim4 score_dims = score.dims();
+
+    for (unsigned k = 0; k < total_feat; k++) {
+        unsigned x = static_cast<unsigned>(round(x_in_ptr[k]));
+        unsigned y = static_cast<unsigned>(round(y_in_ptr[k]));
+
+        float v = score_ptr[y + score_dims[0] * x];
+        float max_v;
+        max_v = std::max(score_ptr[y - 1 + score_dims[0] * (x - 1)],
+                         score_ptr[y - 1 + score_dims[0] * x]);
+        max_v = std::max(max_v, score_ptr[y - 1 + score_dims[0] * (x + 1)]);
+        max_v = std::max(max_v, score_ptr[y + score_dims[0] * (x - 1)]);
+        max_v = std::max(max_v, score_ptr[y + score_dims[0] * (x + 1)]);
+        max_v = std::max(max_v, score_ptr[y + 1 + score_dims[0] * (x - 1)]);
+        max_v = std::max(max_v, score_ptr[y + 1 + score_dims[0] * (x)]);
+        max_v = std::max(max_v, score_ptr[y + 1 + score_dims[0] * (x + 1)]);
+
+        if (y >= score_dims[1] - edge - 1 || y <= edge + 1 ||
+            x >= score_dims[0] - edge - 1 || x <= edge + 1)
+            continue;
+
+        // Stores keypoint to feat_out if it's response is maximum compared to
+        // its 8-neighborhood
+        if (v > max_v) {
+            unsigned j = *count;
+            ++*count;
+
+            float *x_out_ptr     = x_out.get();
+            float *y_out_ptr     = y_out.get();
+            float *score_out_ptr = score_out.get();
+
+            x_out_ptr[j]     = static_cast<float>(x);
+            y_out_ptr[j]     = static_cast<float>(y);
+            score_out_ptr[j] = static_cast<float>(v);
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/fftconvolve.hpp b/src/backend/cpu/kernel/fftconvolve.hpp
new file mode 100644
index 0000000000..13109502c7
--- /dev/null
+++ b/src/backend/cpu/kernel/fftconvolve.hpp
@@ -0,0 +1,255 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename To, typename Ti>
+void packData(Param<To> out, const af::dim4 od, const af::dim4 os,
+              CParam<Ti> in) {
+    To* out_ptr = out.get();
+
+    const af::dim4 id = in.dims();
+    const af::dim4 is = in.strides();
+    const Ti* in_ptr  = in.get();
+
+    int id0_half = divup(id[0], 2);
+    bool odd_id0 = (id[0] % 2 == 1);
+
+    for (int d3 = 0; d3 < (int)od[3]; d3++) {
+        for (int d2 = 0; d2 < (int)od[2]; d2++) {
+            for (int d1 = 0; d1 < (int)od[1]; d1++) {
+                for (int d0 = 0; d0 < (int)od[0] / 2; d0++) {
+                    const dim_t oidx =
+                        d3 * os[3] + d2 * os[2] + d1 * os[1] + d0 * 2;
+
+                    if (d0 < (int)id0_half && d1 < (int)id[1] &&
+                        d2 < (int)id[2] && d3 < (int)id[3]) {
+                        const dim_t iidx =
+                            d3 * is[3] + d2 * is[2] + d1 * is[1] + d0;
+                        out_ptr[oidx] = (To)in_ptr[iidx];
+                        if (d0 == id0_half - 1 && odd_id0)
+                            out_ptr[oidx + 1] = (To)0;
+                        else
+                            out_ptr[oidx + 1] = (To)in_ptr[iidx + id0_half];
+                    } else {
+                        // Pad remaining elements with 0s
+                        out_ptr[oidx]     = (To)0;
+                        out_ptr[oidx + 1] = (To)0;
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename To, typename Ti>
+void padArray(Param<To> out, const af::dim4 od, const af::dim4 os,
+              CParam<Ti> in, const dim_t offset) {
+    To* out_ptr       = out.get() + offset;
+    const af::dim4 id = in.dims();
+    const af::dim4 is = in.strides();
+    const Ti* in_ptr  = in.get();
+
+    for (int d3 = 0; d3 < (int)od[3]; d3++) {
+        for (int d2 = 0; d2 < (int)od[2]; d2++) {
+            for (int d1 = 0; d1 < (int)od[1]; d1++) {
+                for (int d0 = 0; d0 < (int)od[0] / 2; d0++) {
+                    const dim_t oidx =
+                        d3 * os[3] + d2 * os[2] + d1 * os[1] + d0 * 2;
+
+                    if (d0 < (int)id[0] && d1 < (int)id[1] && d2 < (int)id[2] &&
+                        d3 < (int)id[3]) {
+                        // Copy input elements to real elements, set imaginary
+                        // elements to 0
+                        const dim_t iidx =
+                            d3 * is[3] + d2 * is[2] + d1 * is[1] + d0;
+                        out_ptr[oidx]     = (To)in_ptr[iidx];
+                        out_ptr[oidx + 1] = (To)0;
+                    } else {
+                        // Pad remaining of the matrix to 0s
+                        out_ptr[oidx]     = (To)0;
+                        out_ptr[oidx + 1] = (To)0;
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename T>
+void complexMultiply(Param<T> packed, const af::dim4 sig_dims,
+                     const af::dim4 sig_strides, const af::dim4 fit_dims,
+                     const af::dim4 fit_strides, AF_BATCH_KIND kind,
+                     const dim_t offset) {
+    T* out_ptr = packed.get() + (kind == AF_BATCH_RHS ? offset : 0);
+    T* in1_ptr = packed.get();
+    T* in2_ptr = packed.get() + offset;
+
+    const af::dim4& od  = (kind == AF_BATCH_RHS ? fit_dims : sig_dims);
+    const af::dim4& os  = (kind == AF_BATCH_RHS ? fit_strides : sig_strides);
+    const af::dim4& i1d = sig_dims;
+    const af::dim4& i2d = fit_dims;
+    const af::dim4& i1s = sig_strides;
+    const af::dim4& i2s = fit_strides;
+
+    for (int d3 = 0; d3 < (int)od[3]; d3++) {
+        for (int d2 = 0; d2 < (int)od[2]; d2++) {
+            for (int d1 = 0; d1 < (int)od[1]; d1++) {
+                for (int d0 = 0; d0 < (int)od[0] / 2; d0++) {
+                    if (kind == AF_BATCH_NONE || kind == AF_BATCH_SAME) {
+                        // Complex multiply each signal to equivalent filter
+                        const int ridx =
+                            d3 * os[3] + d2 * os[2] + d1 * os[1] + d0 * 2;
+                        const int iidx = ridx + 1;
+
+                        T a = in1_ptr[ridx];
+                        T b = in1_ptr[iidx];
+                        T c = in2_ptr[ridx];
+                        T d = in2_ptr[iidx];
+
+                        out_ptr[ridx] = a * c - b * d;
+                        out_ptr[iidx] = a * d + b * c;
+                    } else if (kind == AF_BATCH_LHS) {
+                        // Complex multiply all signals to filter
+                        const int ridx1 =
+                            d3 * os[3] + d2 * os[2] + d1 * os[1] + d0 * 2;
+                        const int iidx1 = ridx1 + 1;
+                        const int ridx2 = ridx1 % (i2s[3] * i2d[3]);
+                        const int iidx2 = iidx1 % (i2s[3] * i2d[3]);
+
+                        T a = in1_ptr[ridx1];
+                        T b = in1_ptr[iidx1];
+                        T c = in2_ptr[ridx2];
+                        T d = in2_ptr[iidx2];
+
+                        out_ptr[ridx1] = a * c - b * d;
+                        out_ptr[iidx1] = a * d + b * c;
+                    } else if (kind == AF_BATCH_RHS) {
+                        // Complex multiply signal to all filters
+                        const int ridx2 =
+                            d3 * os[3] + d2 * os[2] + d1 * os[1] + d0 * 2;
+                        const int iidx2 = ridx2 + 1;
+                        const int ridx1 = ridx2 % (i1s[3] * i1d[3]);
+                        const int iidx1 = iidx2 % (i1s[3] * i1d[3]);
+
+                        T a = in1_ptr[ridx1];
+                        T b = in1_ptr[iidx1];
+                        T c = in2_ptr[ridx2];
+                        T d = in2_ptr[iidx2];
+
+                        out_ptr[ridx2] = a * c - b * d;
+                        out_ptr[iidx2] = a * d + b * c;
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename To, typename Ti, int Rank, bool Expand>
+void reorderHelper(To* out_ptr, const af::dim4& od, const af::dim4& os,
+                   const Ti* in_ptr, const af::dim4& id, const af::dim4& is,
+                   const af::dim4& fd, const int half_di0, const int fftScale) {
+    constexpr bool RoundResult = std::is_integral<To>::value;
+
+    UNUSED(id);
+    for (int d3 = 0; d3 < (int)od[3]; d3++) {
+        for (int d2 = 0; d2 < (int)od[2]; d2++) {
+            for (int d1 = 0; d1 < (int)od[1]; d1++) {
+                for (int d0 = 0; d0 < (int)od[0]; d0++) {
+                    int id0, id1, id2, id3;
+                    if (Expand) {
+                        id0 = d0;
+                        id1 = d1 * is[1];
+                        id2 = d2 * is[2];
+                        id3 = d3 * is[3];
+                    } else {
+                        id0 = d0 + fd[0] / 2;
+                        id1 = (d1 + (Rank > 1) * (fd[1] / 2)) * is[1];
+                        id2 = (d2 + (Rank > 2) * (fd[2] / 2)) * is[2];
+                        id3 = d3 * is[3];
+                    }
+
+                    int oidx = d3 * os[3] + d2 * os[2] + d1 * os[1] + d0;
+
+                    // Divide output elements to cuFFT resulting scale, round
+                    // result if output type is single or double precision
+                    // floating-point
+                    if (id0 < half_di0) {
+                        // Copy top elements
+                        int iidx = id3 + id2 + id1 + id0 * 2;
+                        if (RoundResult)
+                            out_ptr[oidx] =
+                                (To)roundf((float)(in_ptr[iidx] / fftScale));
+                        else
+                            out_ptr[oidx] = (To)(in_ptr[iidx] / fftScale);
+                    } else if (id0 < half_di0 + (int)fd[0] - 1) {
+                        // Add signal and filter elements to central part
+                        int iidx1 = id3 + id2 + id1 + id0 * 2;
+                        int iidx2 = id3 + id2 + id1 + (id0 - half_di0) * 2 + 1;
+                        if (RoundResult)
+                            out_ptr[oidx] = (To)roundf(
+                                (float)((in_ptr[iidx1] + in_ptr[iidx2]) /
+                                        fftScale));
+                        else
+                            out_ptr[oidx] =
+                                (To)((in_ptr[iidx1] + in_ptr[iidx2]) /
+                                     fftScale);
+                    } else {
+                        // Copy bottom elements
+                        const int iidx =
+                            id3 + id2 + id1 + (id0 - half_di0) * 2 + 1;
+                        if (RoundResult)
+                            out_ptr[oidx] =
+                                (To)roundf((float)(in_ptr[iidx] / fftScale));
+                        else
+                            out_ptr[oidx] = (To)(in_ptr[iidx] / fftScale);
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename T, typename convT, int Rank, bool Expand>
+void reorder(Param<T> out, Param<convT> packed, CParam<T> filter,
+             const dim_t sig_half_d0, const dim_t fftScale,
+             const dim4 sig_tmp_dims, const dim4 sig_tmp_strides,
+             const dim4 filter_tmp_dims, const dim4 filter_tmp_strides,
+             AF_BATCH_KIND kind) {
+    T* out_ptr                 = out.get();
+    const af::dim4 out_dims    = out.dims();
+    const af::dim4 out_strides = out.strides();
+
+    const af::dim4 filter_dims = filter.dims();
+
+    convT* packed_ptr     = packed.get();
+    convT* sig_tmp_ptr    = packed_ptr;
+    convT* filter_tmp_ptr = packed_ptr + sig_tmp_strides[3] * sig_tmp_dims[3];
+
+    // Reorder the output
+    if (kind == AF_BATCH_RHS) {
+        reorderHelper<T, convT, Rank, Expand>(
+            out_ptr, out_dims, out_strides, filter_tmp_ptr, filter_tmp_dims,
+            filter_tmp_strides, filter_dims, sig_half_d0, fftScale);
+    } else {
+        reorderHelper<T, convT, Rank, Expand>(
+            out_ptr, out_dims, out_strides, sig_tmp_ptr, sig_tmp_dims,
+            sig_tmp_strides, filter_dims, sig_half_d0, fftScale);
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/flood_fill.hpp b/src/backend/cpu/kernel/flood_fill.hpp
new file mode 100644
index 0000000000..121adc87e6
--- /dev/null
+++ b/src/backend/cpu/kernel/flood_fill.hpp
@@ -0,0 +1,123 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <ParamIterator.hpp>
+#include <common/defines.hpp>
+
+#include <queue>
+#include <utility>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+// Output array is set to the following values during the progression
+// of the algorithm.
+//
+// 0 - not visited at all (default values in output because it was created
+//                         using createValueArray helper at level of the
+//                         functions caller)
+// 1 - not valid
+// 2 - valid (candidate for neighborhood walk, pushed onto the queue)
+//
+// Once, the algorithm is finished, output is reset
+// to either zero or \p newValue for all valid pixels.
+template<typename T>
+void floodFill(Param<T> out, CParam<T> in, CParam<uint> x, CParam<uint> y,
+               T newValue, T lower, T upper, af::connectivity connectivity) {
+    UNUSED(connectivity);
+
+    using af::dim4;
+    using PtrDist    = typename ParamIterator<T>::difference_type;
+    using Point      = std::pair<uint, uint>;
+    using Candidates = std::queue<Point>;
+
+    const dim4 dims    = in.dims();
+    const dim4 strides = in.strides();
+
+    ParamIterator<T> endOfNeighborhood;
+    const dim4 nhoodRadii(1, 1, 0, 0);
+    const dim4 nhood(2 * nhoodRadii[0] + 1, 2 * nhoodRadii[1] + 1,
+                     2 * nhoodRadii[2] + 1, 2 * nhoodRadii[3] + 1);
+
+    auto isInside = [&dims](uint x, uint y) {
+        return (x >= 0 && x < dims[0] && y >= 0 && y < dims[1]);
+    };
+    auto leftTopPtr = [&strides, &nhoodRadii](T* ptr, const af::dim4& center) {
+        T* ltPtr = ptr;
+        for (dim_t d = 0; d < AF_MAX_DIMS; ++d) {
+            ltPtr += ((center[d] - nhoodRadii[d]) * strides[d]);
+        }
+        return ltPtr;
+    };
+    Candidates queue;
+    {
+        auto oit = begin(out);
+        for (auto xit = begin(x), yit = begin(y);
+             xit != end(x) && yit != end(y); ++xit, ++yit) {
+            if (isInside(*xit, *yit)) {
+                queue.emplace(*xit, *yit);
+                oit.operator->()[(*xit) + (*yit) * dims[0]] = T(2);
+            }
+        }
+    }
+
+    T* inPtr  = const_cast<T*>(in.get());
+    T* outPtr = out.get();
+
+    while (!queue.empty()) {
+        Point& p = queue.front();
+
+        const dim4 center(p.first, p.second, 0, 0);
+
+        CParam<T> inNHood(const_cast<const T*>(leftTopPtr(inPtr, center)),
+                          nhood, strides);
+        Param<T> outNHood(leftTopPtr(outPtr, center), nhood, strides);
+
+        ParamIterator<T> inIter(inNHood);
+        ParamIterator<T> outIter(outNHood);
+
+        while (inIter != endOfNeighborhood) {
+            const T* ptr     = inIter.operator->();
+            PtrDist dist     = ptr - inPtr;
+            const uint currx = static_cast<uint>(dist % dims[0]);
+            const uint curry = static_cast<uint>(dist / dims[0]);
+
+            if (isInside(currx, curry) && (*outIter == 0)) {
+                // Current point is inside image boundaries and hasn't been
+                // visited at all.
+                if (*inIter >= lower && *inIter <= upper) {
+                    // Current pixel is within threshold limits.
+                    // Mark as valid and push on to the queue
+                    *outIter = T(2);
+                    queue.emplace(currx, curry);
+                } else {
+                    // Not valid pixel
+                    *outIter = T(1);
+                }
+            }
+            // Both input and output neighborhood iterators
+            // should increment in lock step for this algorithm
+            // to work correctly
+            ++inIter;
+            ++outIter;
+        }
+        queue.pop();
+    }
+
+    for (auto outIter = begin(out); outIter != end(out); ++outIter) {
+        *outIter = (*outIter == T(2) ? newValue : T(0));
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/gradient.hpp b/src/backend/cpu/kernel/gradient.hpp
new file mode 100644
index 0000000000..407f4fc6da
--- /dev/null
+++ b/src/backend/cpu/kernel/gradient.hpp
@@ -0,0 +1,88 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void gradient(Param<T> grad0, Param<T> grad1, CParam<T> in) {
+    const af::dim4 dims = in.dims();
+
+    T *d_grad0    = grad0.get();
+    T *d_grad1    = grad1.get();
+    const T *d_in = in.get();
+
+    const af::dim4 inst = in.strides();
+    const af::dim4 g0st = grad0.strides();
+    const af::dim4 g1st = grad1.strides();
+
+    T v5 = scalar<T>(0.5);
+    T v1 = scalar<T>(1.0);
+
+    for (dim_t idw = 0; idw < dims[3]; idw++) {
+        const dim_t inW = idw * inst[3];
+        const dim_t g0W = idw * g0st[3];
+        const dim_t g1W = idw * g1st[3];
+        for (dim_t idz = 0; idz < dims[2]; idz++) {
+            const dim_t inZW = inW + idz * inst[2];
+            const dim_t g0ZW = g0W + idz * g0st[2];
+            const dim_t g1ZW = g1W + idz * g1st[2];
+            dim_t xl, xr, yl, yr;
+            T f0, f1;
+            for (dim_t idy = 0; idy < dims[1]; idy++) {
+                const dim_t inYZW = inZW + idy * inst[1];
+                const dim_t g0YZW = g0ZW + idy * g0st[1];
+                const dim_t g1YZW = g1ZW + idy * g1st[1];
+                if (idy == 0) {
+                    yl = inYZW + inst[1];
+                    yr = inYZW;
+                    f1 = v1;
+                } else if (idy == dims[1] - 1) {
+                    yl = inYZW;
+                    yr = inYZW - inst[1];
+                    f1 = v1;
+                } else {
+                    yl = inYZW + inst[1];
+                    yr = inYZW - inst[1];
+                    f1 = v5;
+                }
+                for (dim_t idx = 0; idx < dims[0]; idx++) {
+                    const dim_t inMem = inYZW + idx;
+                    const dim_t g0Mem = g0YZW + idx;
+                    const dim_t g1Mem = g1YZW + idx;
+                    if (idx == 0) {
+                        xl = inMem + 1;
+                        xr = inMem;
+                        f0 = v1;
+                    } else if (idx == dims[0] - 1) {
+                        xl = inMem;
+                        xr = inMem - 1;
+                        f0 = v1;
+                    } else {
+                        xl = inMem + 1;
+                        xr = inMem - 1;
+                        f0 = v5;
+                    }
+
+                    d_grad0[g0Mem] = f0 * (d_in[xl] - d_in[xr]);
+                    d_grad1[g1Mem] = f1 * (d_in[yl + idx] - d_in[yr + idx]);
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/harris.hpp b/src/backend/cpu/kernel/harris.hpp
new file mode 100644
index 0000000000..4b717c6187
--- /dev/null
+++ b/src/backend/cpu/kernel/harris.hpp
@@ -0,0 +1,122 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void second_order_deriv(Param<T> ixx, Param<T> ixy, Param<T> iyy,
+                        const unsigned in_len, CParam<T> ix, CParam<T> iy) {
+    T* ixx_out     = ixx.get();
+    T* ixy_out     = ixy.get();
+    T* iyy_out     = iyy.get();
+    const T* ix_in = ix.get();
+    const T* iy_in = iy.get();
+    for (unsigned x = 0; x < in_len; x++) {
+        ixx_out[x] = ix_in[x] * ix_in[x];
+        ixy_out[x] = ix_in[x] * iy_in[x];
+        iyy_out[x] = iy_in[x] * iy_in[x];
+    }
+}
+
+template<typename T>
+void harris_responses(Param<T> resp, const unsigned idim0, const unsigned idim1,
+                      CParam<T> ixx, CParam<T> ixy, CParam<T> iyy,
+                      const float k_thr, const unsigned border_len) {
+    T* resp_out      = resp.get();
+    const T* ixx_in  = ixx.get();
+    const T* ixy_in  = ixy.get();
+    const T* iyy_in  = iyy.get();
+    const unsigned r = border_len;
+
+    for (unsigned x = r; x < idim1 - r; x++) {
+        for (unsigned y = r; y < idim0 - r; y++) {
+            const unsigned idx = x * idim0 + y;
+
+            // Calculates matrix trace and determinant
+            T tr  = ixx_in[idx] + iyy_in[idx];
+            T det = ixx_in[idx] * iyy_in[idx] - ixy_in[idx] * ixy_in[idx];
+
+            // Calculates local Harris response
+            resp_out[idx] = det - k_thr * (tr * tr);
+        }
+    }
+}
+
+template<typename T>
+void non_maximal(Param<float> xOut, Param<float> yOut, Param<float> respOut,
+                 unsigned* count, const unsigned idim0, const unsigned idim1,
+                 CParam<T> respIn, const float min_resp,
+                 const unsigned border_len, const unsigned max_corners) {
+    float* x_out     = xOut.get();
+    float* y_out     = yOut.get();
+    float* resp_out  = respOut.get();
+    const T* resp_in = respIn.get();
+    // Responses on the border don't have 8-neighbors to compare, discard them
+    const unsigned r = border_len + 1;
+
+    for (unsigned x = r; x < idim1 - r; x++) {
+        for (unsigned y = r; y < idim0 - r; y++) {
+            const T v = resp_in[x * idim0 + y];
+
+            // Find maximum neighborhood response
+            T max_v;
+            max_v = std::max(resp_in[(x - 1) * idim0 + y - 1],
+                             resp_in[x * idim0 + y - 1]);
+            max_v = std::max(max_v, resp_in[(x + 1) * idim0 + y - 1]);
+            max_v = std::max(max_v, resp_in[(x - 1) * idim0 + y]);
+            max_v = std::max(max_v, resp_in[(x + 1) * idim0 + y]);
+            max_v = std::max(max_v, resp_in[(x - 1) * idim0 + y + 1]);
+            max_v = std::max(max_v, resp_in[(x)*idim0 + y + 1]);
+            max_v = std::max(max_v, resp_in[(x + 1) * idim0 + y + 1]);
+
+            // Stores corner to {x,y,resp}_out if it's response is maximum
+            // compared to its 8-neighborhood and greater or equal minimum
+            // response
+            if (v > max_v && v >= (T)min_resp) {
+                const unsigned idx = *count;
+                *count += 1;
+                if (idx < max_corners) {
+                    x_out[idx]    = (float)x;
+                    y_out[idx]    = (float)y;
+                    resp_out[idx] = (float)v;
+                }
+            }
+        }
+    }
+}
+
+static void keep_corners(Param<float> xOut, Param<float> yOut,
+                         Param<float> respOut, CParam<float> xIn,
+                         CParam<float> yIn, CParam<float> respIn,
+                         CParam<unsigned> respIdx, const unsigned n_corners) {
+    float* x_out         = xOut.get();
+    float* y_out         = yOut.get();
+    float* resp_out      = respOut.get();
+    const float* x_in    = xIn.get();
+    const float* y_in    = yIn.get();
+    const float* resp_in = respIn.get();
+    const uint* resp_idx = respIdx.get();
+
+    // Keep only the first n_feat features
+    for (unsigned f = 0; f < n_corners; f++) {
+        x_out[f]    = x_in[resp_idx[f]];
+        y_out[f]    = y_in[resp_idx[f]];
+        resp_out[f] = resp_in[f];
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/histogram.hpp b/src/backend/cpu/kernel/histogram.hpp
new file mode 100644
index 0000000000..fb90631c52
--- /dev/null
+++ b/src/backend/cpu/kernel/histogram.hpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <types.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, bool IsLinear>
+void histogram(Param<uint> out, CParam<T> in, const unsigned nbins,
+               const double minval, const double maxval) {
+    dim4 const outDims  = out.dims();
+    float const step    = (maxval - minval) / (float)nbins;
+    dim4 const inDims   = in.dims();
+    dim4 const iStrides = in.strides();
+    dim4 const oStrides = out.strides();
+    dim_t const nElems  = inDims[0] * inDims[1];
+
+    auto minValT = compute_t<T>(minval);
+    for (dim_t b3 = 0; b3 < outDims[3]; b3++) {
+        uint* outData   = out.get() + b3 * oStrides[3];
+        const T* inData = in.get() + b3 * iStrides[3];
+        for (dim_t b2 = 0; b2 < outDims[2]; b2++) {
+            for (dim_t i = 0; i < nElems; i++) {
+                int idx =
+                    IsLinear
+                        ? i
+                        : ((i % inDims[0]) + (i / inDims[0]) * iStrides[1]);
+                int bin = (int)((compute_t<T>(inData[idx]) - minValT) / step);
+                bin     = std::max(bin, 0);
+                bin     = std::min(bin, (int)(nbins - 1));
+                outData[bin]++;
+            }
+            inData += iStrides[2];
+            outData += oStrides[2];
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/hsv_rgb.hpp b/src/backend/cpu/kernel/hsv_rgb.hpp
new file mode 100644
index 0000000000..1bf4c387bc
--- /dev/null
+++ b/src/backend/cpu/kernel/hsv_rgb.hpp
@@ -0,0 +1,121 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <cmath>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void hsv2rgb(Param<T> out, CParam<T> in) {
+    const af::dim4 dims    = in.dims();
+    const af::dim4 strides = in.strides();
+    dim_t obStride         = out.strides(3);
+    dim_t coff             = strides[2];
+    dim_t bCount           = dims[3];
+
+    for (dim_t b = 0; b < bCount; ++b) {
+        const T* src = in.get() + b * strides[3];
+        T* dst       = out.get() + b * obStride;
+
+        for (dim_t j = 0; j < dims[1]; ++j) {
+            dim_t jOff = j * strides[1];
+            // j steps along 2nd dimension
+            for (dim_t i = 0; i < dims[0]; ++i) {
+                // i steps along 1st dimension
+                dim_t hIdx = i * strides[0] + jOff;
+                dim_t sIdx = hIdx + coff;
+                dim_t vIdx = sIdx + coff;
+
+                T H = src[hIdx];
+                T S = src[sIdx];
+                T V = src[vIdx];
+
+                T R, G, B;
+                R = G = B = 0;
+
+                int m = (int)(H * 6);
+                T f   = H * 6 - m;
+                T p   = V * (1 - S);
+                T q   = V * (1 - f * S);
+                T t   = V * (1 - (1 - f) * S);
+
+                switch (m % 6) {
+                    case 0: R = V, G = t, B = p; break;
+                    case 1: R = q, G = V, B = p; break;
+                    case 2: R = p, G = V, B = t; break;
+                    case 3: R = p, G = q, B = V; break;
+                    case 4: R = t, G = p, B = V; break;
+                    case 5: R = V, G = p, B = q; break;
+                }
+
+                dst[hIdx] = R;
+                dst[sIdx] = G;
+                dst[vIdx] = B;
+            }
+        }
+    }
+}
+
+template<typename T>
+void rgb2hsv(Param<T> out, CParam<T> in) {
+    const af::dim4 dims    = in.dims();
+    const af::dim4 strides = in.strides();
+    af::dim4 oStrides      = out.strides();
+    dim_t bCount           = dims[3];
+
+    for (dim_t b = 0; b < bCount; ++b) {
+        const T* src = in.get() + b * strides[3];
+        T* dst       = out.get() + b * oStrides[3];
+
+        for (dim_t j = 0; j < dims[1]; ++j) {
+            // j steps along 2nd dimension
+            dim_t oj = j * oStrides[1];
+            dim_t ij = j * strides[1];
+
+            for (dim_t i = 0; i < dims[0]; ++i) {
+                // i steps along 1st dimension
+                dim_t oIdx0 = i * oStrides[0] + oj;
+                dim_t oIdx1 = oIdx0 + oStrides[2];
+                dim_t oIdx2 = oIdx1 + oStrides[2];
+
+                dim_t iIdx0 = i * strides[0] + ij;
+                dim_t iIdx1 = iIdx0 + strides[2];
+                dim_t iIdx2 = iIdx1 + strides[2];
+
+                T R     = src[iIdx0];
+                T G     = src[iIdx1];
+                T B     = src[iIdx2];
+                T Cmax  = std::max(std::max(R, G), B);
+                T Cmin  = std::min(std::min(R, G), B);
+                T delta = Cmax - Cmin;
+
+                T H = 0;
+
+                if (Cmax != Cmin) {
+                    if (Cmax == R) H = (G - B) / delta + (G < B ? 6 : 0);
+                    if (Cmax == G) H = (B - R) / delta + 2;
+                    if (Cmax == B) H = (R - G) / delta + 4;
+                    H = H / 6.0f;
+                }
+
+                dst[oIdx0] = H;
+                dst[oIdx1] = (Cmax == 0.0f ? 0 : delta / Cmax);
+                dst[oIdx2] = Cmax;
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/identity.hpp b/src/backend/cpu/kernel/identity.hpp
new file mode 100644
index 0000000000..a00a2cc83c
--- /dev/null
+++ b/src/backend/cpu/kernel/identity.hpp
@@ -0,0 +1,36 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void identity(Param<T> out) {
+    T *ptr                  = out.get();
+    const af::dim4 out_dims = out.dims();
+
+    for (dim_t k = 0; k < out_dims[2] * out_dims[3]; k++) {
+        for (dim_t j = 0; j < out_dims[1]; j++) {
+            for (dim_t i = 0; i < out_dims[0]; i++) {
+                ptr[j * out_dims[0] + i] =
+                    (i == j) ? scalar<T>(1) : scalar<T>(0);
+            }
+        }
+        ptr += out_dims[0] * out_dims[1];
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/iir.hpp b/src/backend/cpu/kernel/iir.hpp
new file mode 100644
index 0000000000..515d778f5d
--- /dev/null
+++ b/src/backend/cpu/kernel/iir.hpp
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void iir(Param<T> y, Param<T> c, CParam<T> a) {
+    dim4 ydims = c.dims();
+    int num_a  = a.dims(0);
+
+    for (int l = 0; l < (int)ydims[3]; l++) {
+        dim_t yidx3 = l * y.strides(3);
+        dim_t cidx3 = l * c.strides(3);
+        dim_t aidx3 = l * a.strides(3);
+
+        for (int k = 0; k < (int)ydims[2]; k++) {
+            dim_t yidx2 = k * y.strides(2) + yidx3;
+            dim_t cidx2 = k * c.strides(2) + cidx3;
+            dim_t aidx2 = k * a.strides(2) + aidx3;
+
+            for (int j = 0; j < (int)ydims[1]; j++) {
+                dim_t yidx1 = j * y.strides(1) + yidx2;
+                dim_t cidx1 = j * c.strides(1) + cidx2;
+                dim_t aidx1 = j * a.strides(1) + aidx2;
+
+                std::vector<T> h_z(num_a);
+
+                const T *h_a = a.get() + (a.dims().ndims() > 1 ? aidx1 : 0);
+                T *h_c       = c.get() + cidx1;
+                T *h_y       = y.get() + yidx1;
+
+                for (int i = 0; i < (int)ydims[0]; i++) {
+                    T y = h_y[i] = (h_c[i] + h_z[0]) / h_a[0];
+                    for (int ii = 1; ii < num_a; ii++) {
+                        h_z[ii - 1] = h_z[ii] - h_a[ii] * y;
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/index.hpp b/src/backend/cpu/kernel/index.hpp
new file mode 100644
index 0000000000..962b0713dc
--- /dev/null
+++ b/src/backend/cpu/kernel/index.hpp
@@ -0,0 +1,70 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void index(Param<T> out, CParam<T> in, const af::dim4 dDims,
+           std::vector<bool> const isSeq, std::vector<af_seq> const seqs,
+           std::vector<CParam<uint>> idxArrs) {
+    const af::dim4 iDims    = in.dims();
+    const af::dim4 iOffs    = toOffset(seqs, dDims);
+    const af::dim4 iStrds   = in.strides();
+    const af::dim4 oDims    = out.dims();
+    const af::dim4 oStrides = out.strides();
+    const T* src            = in.get();
+    T* dst                  = out.get();
+    const uint* ptr0        = idxArrs[0].get();
+    const uint* ptr1        = idxArrs[1].get();
+    const uint* ptr2        = idxArrs[2].get();
+    const uint* ptr3        = idxArrs[3].get();
+
+    for (dim_t l = 0; l < oDims[3]; ++l) {
+        dim_t lOff   = l * oStrides[3];
+        dim_t inIdx3 = trimIndex(
+            isSeq[3] ? l * seqs[3].step + iOffs[3] : ptr3[l], iDims[3]);
+        dim_t inOff3 = inIdx3 * iStrds[3];
+
+        for (dim_t k = 0; k < oDims[2]; ++k) {
+            dim_t kOff   = k * oStrides[2];
+            dim_t inIdx2 = trimIndex(
+                isSeq[2] ? k * seqs[2].step + iOffs[2] : ptr2[k], iDims[2]);
+            dim_t inOff2 = inIdx2 * iStrds[2];
+
+            for (dim_t j = 0; j < oDims[1]; ++j) {
+                dim_t jOff   = j * oStrides[1];
+                dim_t inIdx1 = trimIndex(
+                    isSeq[1] ? j * seqs[1].step + iOffs[1] : ptr1[j], iDims[1]);
+                dim_t inOff1 = inIdx1 * iStrds[1];
+
+                for (dim_t i = 0; i < oDims[0]; ++i) {
+                    dim_t iOff   = i * oStrides[0];
+                    dim_t inIdx0 = trimIndex(
+                        isSeq[0] ? i * seqs[0].step + iOffs[0] : ptr0[i],
+                        iDims[0]);
+                    dim_t inOff0 = inIdx0 * iStrds[0];
+
+                    dst[lOff + kOff + jOff + iOff] =
+                        src[inOff3 + inOff2 + inOff1 + inOff0];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/interp.hpp b/src/backend/cpu/kernel/interp.hpp
new file mode 100644
index 0000000000..d316b22f19
--- /dev/null
+++ b/src/backend/cpu/kernel/interp.hpp
@@ -0,0 +1,353 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <math.hpp>
+#include <af/constants.h>
+#include <type_traits>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+using std::conditional;
+using std::is_same;
+
+template<typename T>
+using wtype_t =
+    typename conditional<is_same<T, double>::value, double, float>::type;
+
+template<typename T>
+using vtype_t =
+    typename conditional<common::is_complex<T>::value, T, wtype_t<T>>::type;
+
+template<typename InT, typename LocT>
+InT linearInterpFunc(InT val[2], LocT ratio) {
+    return (1 - ratio) * val[0] + ratio * val[1];
+}
+
+template<typename InT, typename LocT>
+InT bilinearInterpFunc(InT val[2][2], LocT xratio, LocT yratio) {
+    InT res[2];
+    res[0] = linearInterpFunc(val[0], xratio);
+    res[1] = linearInterpFunc(val[1], xratio);
+    return linearInterpFunc(res, yratio);
+}
+
+template<typename InT, typename LocT>
+InT cubicInterpFunc(InT val[4], LocT xratio, bool spline) {
+    InT a0, a1, a2, a3;
+    if (spline) {
+        a0 = scalar<InT>(-0.5) * val[0] + scalar<InT>(1.5) * val[1] +
+             scalar<InT>(-1.5) * val[2] + scalar<InT>(0.5) * val[3];
+
+        a1 = scalar<InT>(1.0) * val[0] + scalar<InT>(-2.5) * val[1] +
+             scalar<InT>(2.0) * val[2] + scalar<InT>(-0.5) * val[3];
+
+        a2 = scalar<InT>(-0.5) * val[0] + scalar<InT>(0.5) * val[2];
+
+        a3 = val[1];
+    } else {
+        a0 = val[3] - val[2] - val[0] + val[1];
+        a1 = val[0] - val[1] - a0;
+        a2 = val[2] - val[0];
+        a3 = val[1];
+    }
+
+    LocT xratio2 = xratio * xratio;
+    LocT xratio3 = xratio2 * xratio;
+
+    return a0 * xratio3 + a1 * xratio2 + a2 * xratio + a3;
+}
+
+template<typename InT, typename LocT>
+InT bicubicInterpFunc(InT val[4][4], LocT xratio, LocT yratio, bool spline) {
+    InT res[4];
+    res[0] = cubicInterpFunc(val[0], xratio, spline);
+    res[1] = cubicInterpFunc(val[1], xratio, spline);
+    res[2] = cubicInterpFunc(val[2], xratio, spline);
+    res[3] = cubicInterpFunc(val[3], xratio, spline);
+    return cubicInterpFunc(res, yratio, spline);
+}
+
+template<typename InT, typename LocT, int order>
+struct Interp1 {};
+
+template<typename InT, typename LocT>
+struct Interp1<InT, LocT, 1> {
+    void operator()(Param<InT> &out, int ooff, CParam<InT> &in, int ioff,
+                    LocT x, af_interp_type method, int batch, bool clamp,
+                    int xdim = 0, int batch_dim = 1) {
+        const InT *inptr    = in.get();
+        const dim4 idims    = in.dims();
+        const dim4 istrides = in.strides();
+
+        InT *outptr         = out.get();
+        const dim4 ostrides = out.strides();
+
+        const int x_lim    = idims[xdim];
+        const int x_stride = istrides[xdim];
+
+        int xid   = (method == AF_INTERP_LOWER ? std::floor(x) : std::round(x));
+        bool cond = xid >= 0 && xid < x_lim;
+        if (clamp) xid = std::max(0, std::min(xid, x_lim));
+
+        const int idx = ioff + xid * x_stride;
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * istrides[batch_dim];
+            outptr[ooff + n * ostrides[batch_dim]] =
+                (cond || clamp) ? inptr[idx_n] : scalar<InT>(0);
+        }
+    }
+};
+
+template<typename InT, typename LocT>
+struct Interp1<InT, LocT, 2> {
+    void operator()(Param<InT> &out, int ooff, CParam<InT> &in, int ioff,
+                    LocT x, af_interp_type method, int batch, bool clamp,
+                    int xdim = 0, int batch_dim = 1) {
+        typedef vtype_t<InT> VT;
+
+        const int grid_x = floor(x);    // nearest grid
+        const LocT off_x = x - grid_x;  // fractional offset
+
+        const InT *inptr    = in.get();
+        const dim4 idims    = in.dims();
+        const dim4 istrides = in.strides();
+        InT *outptr         = out.get();
+        const dim4 ostrides = out.strides();
+
+        const int x_lim    = idims[xdim];
+        const int x_stride = istrides[xdim];
+        const int idx      = ioff + grid_x * x_stride;
+
+        bool cond[2] = {true, grid_x + 1 < x_lim};
+        int offx[2]  = {0, cond[1] ? 1 : 0};
+
+        LocT ratio = off_x;
+        if (method == AF_INTERP_LINEAR_COSINE) {
+            // Smooth the factional part with cosine
+            ratio = (1 - std::cos(ratio * af::Pi)) / 2;
+        }
+
+        const VT zero = scalar<VT>(0);
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * istrides[batch_dim];
+            VT val[2] = {zero, zero};
+            for (int i = 0; i < 2; i++) {
+                if (clamp || cond[i])
+                    val[i] = inptr[idx_n + offx[i] * x_stride];
+            }
+            outptr[ooff + n * ostrides[batch_dim]] =
+                linearInterpFunc(val, ratio);
+        }
+    }
+};
+
+template<typename InT, typename LocT>
+struct Interp1<InT, LocT, 3> {
+    void operator()(Param<InT> &out, int ooff, CParam<InT> &in, int ioff,
+                    LocT x, af_interp_type method, int batch, bool clamp,
+                    int xdim = 0, int batch_dim = 1) {
+        typedef vtype_t<InT> VT;
+
+        const int grid_x = floor(x);    // nearest grid
+        const LocT off_x = x - grid_x;  // fractional offset
+
+        const InT *inptr    = in.get();
+        const dim4 idims    = in.dims();
+        const dim4 istrides = in.strides();
+        InT *outptr         = out.get();
+        const dim4 ostrides = out.strides();
+
+        const int x_lim    = idims[xdim];
+        const int x_stride = istrides[xdim];
+        const int idx      = ioff + grid_x * x_stride;
+
+        bool cond[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                        grid_x + 2 < x_lim};
+        int off[4]   = {cond[0] ? -1 : 0, 0, cond[2] ? 1 : 0,
+                      cond[3] ? 2 : (cond[2] ? 1 : 0)};
+
+        const VT zero = scalar<VT>(0);
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * istrides[batch_dim];
+            VT val[4] = {zero, zero, zero, zero};
+            for (int i = 0; i < 4; i++) {
+                if (clamp || cond[i]) val[i] = inptr[idx_n + off[i] * x_stride];
+            }
+            bool spline = method == AF_INTERP_CUBIC_SPLINE;
+            outptr[ooff + n * ostrides[batch_dim]] =
+                cubicInterpFunc(val, off_x, spline);
+        }
+    }
+};
+
+template<typename InT, typename LocT, int order>
+struct Interp2 {};
+
+template<typename InT, typename LocT>
+struct Interp2<InT, LocT, 1> {
+    void operator()(Param<InT> &out, int ooff, CParam<InT> &in, int ioff,
+                    LocT x, LocT y, af_interp_type method, int nimages,
+                    bool clamp, int xdim = 0, int ydim = 1, int batch_dim = 2) {
+        const InT *inptr    = in.get();
+        const dim4 istrides = in.strides();
+        const dim4 idims    = in.dims();
+
+        InT *outptr         = out.get();
+        const dim4 ostrides = out.strides();
+
+        int xid = (method == AF_INTERP_LOWER ? std::floor(x) : std::round(x));
+        int yid = (method == AF_INTERP_LOWER ? std::floor(y) : std::round(y));
+
+        const int x_lim    = idims[xdim];
+        const int y_lim    = idims[ydim];
+        const int x_stride = istrides[xdim];
+        const int y_stride = istrides[ydim];
+        const int idx      = ioff + yid * y_stride + xid * x_stride;
+
+        bool condX = xid >= 0 && xid < x_lim;
+        bool condY = yid >= 0 && yid < y_lim;
+
+        if (clamp) {
+            xid = std::max(0, std::min(xid, x_lim));
+            yid = std::max(0, std::min(yid, y_lim));
+        }
+
+        bool cond = condX && condY;
+        for (int n = 0; n < nimages; n++) {
+            int idx_n = idx + n * istrides[batch_dim];
+            outptr[ooff + n * ostrides[batch_dim]] =
+                (clamp || cond) ? inptr[idx_n] : scalar<InT>(0);
+        }
+    }
+};
+
+template<typename InT, typename LocT>
+struct Interp2<InT, LocT, 2> {
+    void operator()(Param<InT> &out, int ooff, CParam<InT> &in, int ioff,
+                    LocT x, LocT y, af_interp_type method, int nimages,
+                    bool clamp, int xdim = 0, int ydim = 1, int batch_dim = 2) {
+        typedef vtype_t<InT> VT;
+
+        const InT *inptr    = in.get();
+        const dim4 idims    = in.dims();
+        const dim4 istrides = in.strides();
+
+        InT *outptr         = out.get();
+        const dim4 ostrides = out.strides();
+
+        const int grid_x = floor(x);
+        const LocT off_x = x - grid_x;
+
+        const int grid_y = floor(y);
+        const LocT off_y = y - grid_y;
+
+        const int x_lim    = idims[xdim];
+        const int y_lim    = idims[ydim];
+        const int x_stride = istrides[xdim];
+        const int y_stride = istrides[ydim];
+        const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+        bool condX[2] = {true, x + 1 < x_lim};
+        bool condY[2] = {true, y + 1 < y_lim};
+
+        int offX[2] = {0, condX[1] ? 1 : 0};
+        int offY[2] = {0, condY[1] ? 1 : 0};
+
+        VT zero = scalar<VT>(0);
+
+        LocT xratio = off_x, yratio = off_y;
+        if (method == AF_INTERP_LINEAR_COSINE ||
+            method == AF_INTERP_BILINEAR_COSINE) {
+            // Smooth the factional part with cosine
+            xratio = (1 - std::cos(xratio * af::Pi)) / 2;
+            yratio = (1 - std::cos(yratio * af::Pi)) / 2;
+        }
+
+        for (int n = 0; n < nimages; n++) {
+            int idx_n = idx + n * istrides[batch_dim];
+            VT val[2][2];
+            for (int j = 0; j < 2; j++) {
+                int off_y = idx_n + offY[j] * y_stride;
+                for (int i = 0; i < 2; i++) {
+                    bool cond = clamp || (condX[i] && condY[j]);
+                    val[j][i] = cond ? inptr[off_y + offX[i] * x_stride] : zero;
+                }
+            }
+            outptr[ooff + n * ostrides[batch_dim]] =
+                bilinearInterpFunc(val, off_x, off_y);
+        }
+    }
+};
+
+template<typename InT, typename LocT>
+struct Interp2<InT, LocT, 3> {
+    void operator()(Param<InT> &out, int ooff, CParam<InT> &in, int ioff,
+                    LocT x, LocT y, af_interp_type method, int nimages,
+                    bool clamp, int xdim = 0, int ydim = 1, int batch_dim = 2) {
+        typedef vtype_t<InT> VT;
+
+        const InT *inptr    = in.get();
+        const dim4 idims    = in.dims();
+        const dim4 istrides = in.strides();
+
+        InT *outptr         = out.get();
+        const dim4 ostrides = out.strides();
+
+        const int grid_x = floor(x);
+        const LocT off_x = x - grid_x;
+
+        const int grid_y = floor(y);
+        const LocT off_y = y - grid_y;
+
+        const int x_lim    = idims[xdim];
+        const int y_lim    = idims[ydim];
+        const int x_stride = istrides[xdim];
+        const int y_stride = istrides[ydim];
+        const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+        // used for setting values at boundaries
+        bool condX[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                         grid_x + 2 < x_lim};
+        bool condY[4] = {grid_y - 1 >= 0, true, grid_y + 1 < y_lim,
+                         grid_y + 2 < y_lim};
+        int offX[4]   = {condX[0] ? -1 : 0, 0, condX[2] ? 1 : 0,
+                       condX[3] ? 2 : (condX[2] ? 1 : 0)};
+        int offY[4]   = {condY[0] ? -1 : 0, 0, condY[2] ? 1 : 0,
+                       condY[3] ? 2 : (condY[2] ? 1 : 0)};
+
+        bool spline = (method == AF_INTERP_CUBIC_SPLINE ||
+                       method == AF_INTERP_BICUBIC_SPLINE);
+        VT zero     = scalar<VT>(0);
+        for (int n = 0; n < nimages; n++) {
+            int idx_n = idx + n * istrides[batch_dim];
+
+            // for bicubic interpolation, work with 4x4 val at a time
+            VT val[4][4];
+            for (int j = 0; j < 4; j++) {
+                int ioff_j = idx_n + offY[j] * y_stride;
+                for (int i = 0; i < 4; i++) {
+                    bool cond = clamp || (condX[i] && condY[j]);
+                    val[j][i] =
+                        cond ? inptr[ioff_j + offX[i] * x_stride] : zero;
+                }
+            }
+            outptr[ooff + n * ostrides[batch_dim]] =
+                bicubicInterpFunc(val, off_x, off_y, spline);
+        }
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/iota.hpp b/src/backend/cpu/kernel/iota.hpp
new file mode 100644
index 0000000000..ef575a8166
--- /dev/null
+++ b/src/backend/cpu/kernel/iota.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void iota(Param<T> output, const af::dim4& sdims) {
+    const af::dim4 dims    = output.dims();
+    data_t<T>* out         = output.get();
+    const af::dim4 strides = output.strides();
+
+    for (dim_t w = 0; w < dims[3]; w++) {
+        dim_t offW = w * strides[3];
+        dim_t valW = (w % sdims[3]) * sdims[0] * sdims[1] * sdims[2];
+        for (dim_t z = 0; z < dims[2]; z++) {
+            dim_t offWZ = offW + z * strides[2];
+            dim_t valZ  = valW + (z % sdims[2]) * sdims[0] * sdims[1];
+            for (dim_t y = 0; y < dims[1]; y++) {
+                dim_t offWZY = offWZ + y * strides[1];
+                dim_t valY   = valZ + (y % sdims[1]) * sdims[0];
+                for (dim_t x = 0; x < dims[0]; x++) {
+                    dim_t id = offWZY + x;
+                    out[id]  = valY + (x % sdims[0]);
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/ireduce.hpp b/src/backend/cpu/kernel/ireduce.hpp
new file mode 100644
index 0000000000..9d2598af4b
--- /dev/null
+++ b/src/backend/cpu/kernel/ireduce.hpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/half.hpp>
+#include <algorithm>
+#include <cmath>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+double cabs(const T in) {
+    return (double)in;
+}
+static double cabs(const char in) { return (double)(in > 0); }
+static double cabs(const cfloat &in) { return (double)abs(in); }
+static double cabs(const cdouble &in) { return (double)abs(in); }
+
+template<af_op_t op, typename T>
+struct MinMaxOp {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {
+        using arrayfire::cpu::is_nan;
+        if (is_nan(val)) { m_val = common::Binary<T, op>::init(); }
+    }
+
+    void operator()(T val, uint idx) {
+        if ((cabs(val) < cabs(m_val) ||
+             (cabs(val) == cabs(m_val) && idx > m_idx))) {
+            m_val = val;
+            m_idx = idx;
+        }
+    }
+};
+
+template<typename T>
+struct MinMaxOp<af_max_t, T> {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {
+        using arrayfire::cpu::is_nan;
+        if (is_nan(val)) { m_val = common::Binary<T, af_max_t>::init(); }
+    }
+
+    void operator()(T val, uint idx) {
+        if ((cabs(val) > cabs(m_val) ||
+             (cabs(val) == cabs(m_val) && idx <= m_idx))) {
+            m_val = val;
+            m_idx = idx;
+        }
+    }
+};
+
+template<af_op_t op, typename T, int D>
+struct ireduce_dim {
+    void operator()(Param<T> output, Param<uint> locParam,
+                    const dim_t outOffset, CParam<T> input,
+                    const dim_t inOffset, const int dim, CParam<uint> rlen) {
+        const af::dim4 odims    = output.dims();
+        const af::dim4 ostrides = output.strides();
+        const af::dim4 istrides = input.strides();
+        const int D1            = D - 1;
+        for (dim_t i = 0; i < odims[D1]; i++) {
+            ireduce_dim<op, T, D1>()(output, locParam,
+                                     outOffset + i * ostrides[D1], input,
+                                     inOffset + i * istrides[D1], dim, rlen);
+        }
+    }
+};
+
+template<af_op_t op, typename T>
+struct ireduce_dim<op, T, 0> {
+    void operator()(Param<T> output, Param<uint> locParam,
+                    const dim_t outOffset, CParam<T> input,
+                    const dim_t inOffset, const int dim, CParam<uint> rlen) {
+        const af::dim4 idims    = input.dims();
+        const af::dim4 istrides = input.strides();
+
+        T const *const in   = input.get();
+        T *out              = output.get();
+        uint *loc           = locParam.get();
+        const uint *rlenptr = (rlen.get()) ? rlen.get() + outOffset : nullptr;
+
+        dim_t stride = istrides[dim];
+        MinMaxOp<op, T> Op(in[inOffset], 0);
+        int lim =
+            (rlenptr) ? std::min(idims[dim], (dim_t)*rlenptr) : idims[dim];
+        for (dim_t i = 0; i < lim; i++) { Op(in[inOffset + i * stride], i); }
+
+        out[outOffset] = Op.m_val;
+        loc[outOffset] = Op.m_idx;
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/join.hpp b/src/backend/cpu/kernel/join.hpp
new file mode 100644
index 0000000000..800ded1270
--- /dev/null
+++ b/src/backend/cpu/kernel/join.hpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+af::dim4 calcOffset(const af::dim4 dims, int dim) {
+    af::dim4 offset;
+    offset[0] = (dim == 0) ? dims[0] : 0;
+    offset[1] = (dim == 1) ? dims[1] : 0;
+    offset[2] = (dim == 2) ? dims[2] : 0;
+    offset[3] = (dim == 3) ? dims[3] : 0;
+    return offset;
+}
+
+template<typename T>
+void join_append(T *out, const T *X, const af::dim4 &offset,
+                 const af::dim4 &xdims, const af::dim4 &ost,
+                 const af::dim4 &xst) {
+    for (dim_t ow = 0; ow < xdims[3]; ow++) {
+        const dim_t xW = ow * xst[3];
+        const dim_t oW = (ow + offset[3]) * ost[3];
+
+        for (dim_t oz = 0; oz < xdims[2]; oz++) {
+            const dim_t xZW = xW + oz * xst[2];
+            const dim_t oZW = oW + (oz + offset[2]) * ost[2];
+
+            for (dim_t oy = 0; oy < xdims[1]; oy++) {
+                const dim_t xYZW = xZW + oy * xst[1];
+                const dim_t oYZW = oZW + (oy + offset[1]) * ost[1];
+
+                memcpy(out + oYZW + offset[0], X + xYZW, xdims[0] * sizeof(T));
+            }
+        }
+    }
+}
+
+template<typename T>
+void join(const int dim, Param<T> out, const std::vector<CParam<T>> inputs,
+          int n_arrays) {
+    af::dim4 zero(0, 0, 0, 0);
+    af::dim4 d = zero;
+    join_append<T>(out.get(), inputs[0].get(), zero, inputs[0].dims(),
+                   out.strides(), inputs[0].strides());
+    for (int i = 1; i < n_arrays; i++) {
+        d += inputs[i - 1].dims();
+        join_append<T>(out.get(), inputs[i].get(), calcOffset(d, dim),
+                       inputs[i].dims(), out.strides(), inputs[i].strides());
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/lookup.hpp b/src/backend/cpu/kernel/lookup.hpp
new file mode 100644
index 0000000000..f968e48ff8
--- /dev/null
+++ b/src/backend/cpu/kernel/lookup.hpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename InT, typename IndexT>
+void lookup(Param<InT> out, CParam<InT> input, CParam<IndexT> indices,
+            unsigned const dim) {
+    const af::dim4 iDims    = input.dims();
+    const af::dim4 oDims    = out.dims();
+    const af::dim4 iStrides = input.strides();
+    const af::dim4 oStrides = out.strides();
+    const InT *inPtr        = input.get();
+    const IndexT *idxPtr    = indices.get();
+
+    InT *outPtr = out.get();
+
+    for (dim_t l = 0; l < oDims[3]; ++l) {
+        dim_t iLOff = iStrides[3] *
+                      (dim == 3 ? trimIndex((dim_t)idxPtr[l], iDims[3]) : l);
+        dim_t oLOff = l * oStrides[3];
+
+        for (dim_t k = 0; k < oDims[2]; ++k) {
+            dim_t iKOff =
+                iStrides[2] *
+                (dim == 2 ? trimIndex((dim_t)idxPtr[k], iDims[2]) : k);
+            dim_t oKOff = k * oStrides[2];
+
+            for (dim_t j = 0; j < oDims[1]; ++j) {
+                dim_t iJOff =
+                    iStrides[1] *
+                    (dim == 1 ? trimIndex((dim_t)idxPtr[j], iDims[1]) : j);
+                dim_t oJOff = j * oStrides[1];
+
+                for (dim_t i = 0; i < oDims[0]; ++i) {
+                    dim_t iIOff =
+                        iStrides[0] *
+                        (dim == 0 ? trimIndex((dim_t)idxPtr[i], iDims[0]) : i);
+                    dim_t oIOff = i * oStrides[0];
+
+                    outPtr[oLOff + oKOff + oJOff + oIOff] =
+                        inPtr[iLOff + iKOff + iJOff + iIOff];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/lu.hpp b/src/backend/cpu/kernel/lu.hpp
new file mode 100644
index 0000000000..170289919c
--- /dev/null
+++ b/src/backend/cpu/kernel/lu.hpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void lu_split(Param<T> lower, Param<T> upper, CParam<T> in) {
+    T *l       = lower.get();
+    T *u       = upper.get();
+    const T *i = in.get();
+
+    af::dim4 ldm = lower.dims();
+    af::dim4 udm = upper.dims();
+    af::dim4 idm = in.dims();
+    af::dim4 lst = lower.strides();
+    af::dim4 ust = upper.strides();
+    af::dim4 ist = in.strides();
+
+    for (dim_t ow = 0; ow < idm[3]; ow++) {
+        const dim_t lW = ow * lst[3];
+        const dim_t uW = ow * ust[3];
+        const dim_t iW = ow * ist[3];
+
+        for (dim_t oz = 0; oz < idm[2]; oz++) {
+            const dim_t lZW = lW + oz * lst[2];
+            const dim_t uZW = uW + oz * ust[2];
+            const dim_t iZW = iW + oz * ist[2];
+
+            for (dim_t oy = 0; oy < idm[1]; oy++) {
+                const dim_t lYZW = lZW + oy * lst[1];
+                const dim_t uYZW = uZW + oy * ust[1];
+                const dim_t iYZW = iZW + oy * ist[1];
+
+                for (dim_t ox = 0; ox < idm[0]; ox++) {
+                    const dim_t lMem = lYZW + ox;
+                    const dim_t uMem = uYZW + ox;
+                    const dim_t iMem = iYZW + ox;
+                    if (ox > oy) {
+                        if (oy < ldm[1]) l[lMem] = i[iMem];
+                        if (ox < udm[0]) u[uMem] = scalar<T>(0);
+                    } else if (oy > ox) {
+                        if (oy < ldm[1]) l[lMem] = scalar<T>(0);
+                        if (ox < udm[0]) u[uMem] = i[iMem];
+                    } else if (ox == oy) {
+                        if (oy < ldm[1]) l[lMem] = scalar<T>(1.0);
+                        if (ox < udm[0]) u[uMem] = i[iMem];
+                    }
+                }
+            }
+        }
+    }
+}
+
+void convertPivot(Param<int> p, Param<int> pivot) {
+    int *d_pi = pivot.get();
+    int *d_po = p.get();
+    dim_t d0  = pivot.dims(0);
+    for (int j = 0; j < (int)d0; j++) {
+        // 1 indexed in pivot
+        std::swap(d_po[j], d_po[d_pi[j] - 1]);
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/match_template.hpp b/src/backend/cpu/kernel/match_template.hpp
new file mode 100644
index 0000000000..bed6ef5354
--- /dev/null
+++ b/src/backend/cpu/kernel/match_template.hpp
@@ -0,0 +1,144 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename OutT, typename InT, af::matchType MatchType>
+void matchTemplate(Param<OutT> out, CParam<InT> sImg, CParam<InT> tImg) {
+    constexpr bool needMean = MatchType == AF_ZSAD || MatchType == AF_LSAD ||
+                              MatchType == AF_ZSSD || MatchType == AF_LSSD ||
+                              MatchType == AF_ZNCC;
+
+    const af::dim4 sDims    = sImg.dims();
+    const af::dim4 tDims    = tImg.dims();
+    const af::dim4 sStrides = sImg.strides();
+    const af::dim4 tStrides = tImg.strides();
+
+    const dim_t tDim0 = tDims[0];
+    const dim_t tDim1 = tDims[1];
+    const dim_t sDim0 = sDims[0];
+    const dim_t sDim1 = sDims[1];
+
+    const af::dim4 oStrides = out.strides();
+
+    OutT tImgMean        = OutT(0);
+    dim_t winNumElements = tImg.dims().elements();
+    const InT* tpl       = tImg.get();
+
+    if (needMean) {
+        for (dim_t tj = 0; tj < tDim1; tj++) {
+            dim_t tjStride = tj * tStrides[1];
+
+            for (dim_t ti = 0; ti < tDim0; ti++) {
+                tImgMean += (OutT)tpl[tjStride + ti * tStrides[0]];
+            }
+        }
+        tImgMean /= winNumElements;
+    }
+
+    OutT* dst      = out.get();
+    const InT* src = sImg.get();
+
+    for (dim_t b3 = 0; b3 < sDims[3]; ++b3) {
+        for (dim_t b2 = 0; b2 < sDims[2]; ++b2) {
+            // slide through image window after window
+            for (dim_t sj = 0; sj < sDim1; sj++) {
+                dim_t ojStride = sj * oStrides[1];
+
+                for (dim_t si = 0; si < sDim0; si++) {
+                    OutT disparity = OutT(0);
+
+                    // mean for window
+                    // this variable will be used based on MatchType value
+                    OutT wImgMean = OutT(0);
+                    if (needMean) {
+                        for (dim_t tj = 0, j = sj; tj < tDim1; tj++, j++) {
+                            dim_t jStride = j * sStrides[1];
+
+                            for (dim_t ti = 0, i = si; ti < tDim0; ti++, i++) {
+                                InT sVal = ((j < sDim1 && i < sDim0)
+                                                ? src[jStride + i * sStrides[0]]
+                                                : InT(0));
+                                wImgMean += (OutT)sVal;
+                            }
+                        }
+                        wImgMean /= winNumElements;
+                    }
+
+                    // run the window match metric
+                    for (dim_t tj = 0, j = sj; tj < tDim1; tj++, j++) {
+                        dim_t jStride  = j * sStrides[1];
+                        dim_t tjStride = tj * tStrides[1];
+
+                        for (dim_t ti = 0, i = si; ti < tDim0; ti++, i++) {
+                            InT sVal = ((j < sDim1 && i < sDim0)
+                                            ? src[jStride + i * sStrides[0]]
+                                            : InT(0));
+                            InT tVal = tpl[tjStride + ti * tStrides[0]];
+                            OutT temp;
+                            switch (MatchType) {
+                                case AF_SAD:
+                                    disparity += fabs((OutT)sVal - (OutT)tVal);
+                                    break;
+                                case AF_ZSAD:
+                                    disparity += fabs((OutT)sVal - wImgMean -
+                                                      (OutT)tVal + tImgMean);
+                                    break;
+                                case AF_LSAD:
+                                    disparity +=
+                                        fabs((OutT)sVal -
+                                             (wImgMean / tImgMean) * tVal);
+                                    break;
+                                case AF_SSD:
+                                    disparity += ((OutT)sVal - (OutT)tVal) *
+                                                 ((OutT)sVal - (OutT)tVal);
+                                    break;
+                                case AF_ZSSD:
+                                    temp = ((OutT)sVal - wImgMean - (OutT)tVal +
+                                            tImgMean);
+                                    disparity += temp * temp;
+                                    break;
+                                case AF_LSSD:
+                                    temp = ((OutT)sVal -
+                                            (wImgMean / tImgMean) * tVal);
+                                    disparity += temp * temp;
+                                    break;
+                                case AF_NCC:
+                                    // TODO: furture implementation
+                                    break;
+                                case AF_ZNCC:
+                                    // TODO: furture implementation
+                                    break;
+                                case AF_SHD:
+                                    // TODO: furture implementation
+                                    break;
+                            }
+                        }
+                    }
+                    // output is just created, hence not doing the
+                    // extra multiplication for 0th dim stride
+                    dst[ojStride + si] = disparity;
+                }
+            }
+            src += sStrides[2];
+            dst += oStrides[2];
+        }
+        src += sStrides[3];
+        dst += oStrides[3];
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/mean.hpp b/src/backend/cpu/kernel/mean.hpp
new file mode 100644
index 0000000000..c15773687e
--- /dev/null
+++ b/src/backend/cpu/kernel/mean.hpp
@@ -0,0 +1,127 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <common/Transform.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename Ti, typename To, typename Tw>
+struct MeanOp {
+    common::Transform<Ti, To, af_add_t> transform;
+    To runningMean;
+    Tw runningCount;
+    MeanOp(Ti mean, Tw count)
+        : transform(), runningMean(transform(mean)), runningCount(count) {}
+
+    /// Prevents the optimzation of the mean calculation by some compiler flags
+    /// specifically -march=native.
+    [[gnu::optimize("01")]] void operator()(Ti _newMean, Tw newCount) {
+        To newMean = transform(_newMean);
+        if ((newCount != 0) || (runningCount != 0)) {
+            Tw runningScale = runningCount;
+            Tw newScale     = newCount;
+            runningCount += newCount;
+            runningScale = runningScale / runningCount;
+            newScale     = newScale / runningCount;
+            runningMean  = (runningScale * runningMean) + (newScale * newMean);
+        }
+    }
+};
+
+template<typename T, typename Tw, int D>
+struct mean_weighted_dim {
+    void operator()(Param<T> output, const dim_t outOffset,
+                    const CParam<T> input, const dim_t inOffset,
+                    const CParam<Tw> weight, const dim_t wtOffset,
+                    const int dim) {
+        const af::dim4 odims    = output.dims();
+        const af::dim4 ostrides = output.strides();
+        const af::dim4 istrides = input.strides();
+        const af::dim4 wstrides = weight.strides();
+        const int D1            = D - 1;
+        for (dim_t i = 0; i < odims[D1]; i++) {
+            mean_weighted_dim<T, Tw, D1>()(output, outOffset + i * ostrides[D1],
+                                           input, inOffset + i * istrides[D1],
+                                           weight, wtOffset + i * wstrides[D1],
+                                           dim);
+        }
+    }
+};
+
+template<typename T, typename Tw>
+struct mean_weighted_dim<T, Tw, 0> {
+    void operator()(Param<T> output, const dim_t outOffset,
+                    const CParam<T> input, const dim_t inOffset,
+                    const CParam<Tw> weight, const dim_t wtOffset,
+                    const int dim) {
+        const af::dim4 idims    = input.dims();
+        const af::dim4 istrides = input.strides();
+        const af::dim4 wstrides = weight.strides();
+
+        T const* const in  = input.get();
+        Tw const* const wt = weight.get();
+        T* out             = output.get();
+
+        dim_t istride = istrides[dim];
+        dim_t wstride = wstrides[dim];
+        MeanOp<compute_t<T>, compute_t<T>, compute_t<Tw>> Op(0, 0);
+        for (dim_t i = 0; i < idims[dim]; i++) {
+            Op(compute_t<T>(in[inOffset + i * istride]),
+               compute_t<Tw>(wt[wtOffset + i * wstride]));
+        }
+
+        out[outOffset] = Op.runningMean;
+    }
+};
+
+template<typename Ti, typename Tw, typename To, int D>
+struct mean_dim {
+    void operator()(Param<To> output, const dim_t outOffset,
+                    const CParam<Ti> input, const dim_t inOffset,
+                    const int dim) {
+        const af::dim4 odims    = output.dims();
+        const af::dim4 ostrides = output.strides();
+        const af::dim4 istrides = input.strides();
+        const int D1            = D - 1;
+        for (dim_t i = 0; i < odims[D1]; i++) {
+            mean_dim<Ti, Tw, To, D1>()(output, outOffset + i * ostrides[D1],
+                                       input, inOffset + i * istrides[D1], dim);
+        }
+    }
+};
+
+template<typename Ti, typename Tw, typename To>
+struct mean_dim<Ti, Tw, To, 0> {
+    void operator()(Param<To> output, const dim_t outOffset,
+                    const CParam<Ti> input, const dim_t inOffset,
+                    const int dim) {
+        const af::dim4 idims    = input.dims();
+        const af::dim4 istrides = input.strides();
+
+        Ti const* const in = input.get();
+        To* out            = output.get();
+
+        dim_t istride = istrides[dim];
+        dim_t end     = inOffset + idims[dim] * istride;
+        MeanOp<compute_t<Ti>, compute_t<To>, compute_t<Tw>> Op(0, 0);
+        for (dim_t i = inOffset; i < end; i += istride) {
+            Op(compute_t<Ti>(in[i]), 1);
+        }
+
+        out[outOffset] = Op.runningMean;
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/meanshift.hpp b/src/backend/cpu/kernel/meanshift.hpp
new file mode 100644
index 0000000000..490fb93af6
--- /dev/null
+++ b/src/backend/cpu/kernel/meanshift.hpp
@@ -0,0 +1,143 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+#include <type_traits>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+template<typename T, bool IsColor>
+void meanShift(Param<T> out, CParam<T> in, const float spatialSigma,
+               const float chromaticSigma, const unsigned numIterations) {
+    typedef typename std::conditional<std::is_same<T, double>::value, double,
+                                      float>::type AccType;
+
+    const af::dim4 dims     = in.dims();
+    const af::dim4 istrides = in.strides();
+    const af::dim4 ostrides = out.strides();
+    const unsigned bCount   = (IsColor ? 1 : dims[2]);
+    const unsigned channels = (IsColor ? dims[2] : 1);
+    const dim_t radius      = std::max((int)(spatialSigma * 1.5f), 1);
+    const AccType cvar      = chromaticSigma * chromaticSigma;
+
+    std::array<AccType, 4> currentCenterColors{{0}};
+    std::array<AccType, 4> currentMeanColors{{0}};
+    std::array<AccType, 4> tempColors{{0}};
+    for (dim_t b3 = 0; b3 < dims[3]; ++b3) {
+        for (unsigned b2 = 0; b2 < bCount; ++b2) {
+            T* outData      = out.get() + b2 * ostrides[2] + b3 * ostrides[3];
+            const T* inData = in.get() + b2 * istrides[2] + b3 * istrides[3];
+
+            for (dim_t j = 0; j < dims[1]; ++j) {
+                dim_t j_in_off  = j * istrides[1];
+                dim_t j_out_off = j * ostrides[1];
+
+                for (dim_t i = 0; i < dims[0]; ++i) {
+                    dim_t i_in_off  = i * istrides[0];
+                    dim_t i_out_off = i * ostrides[0];
+
+                    for (unsigned ch = 0; ch < channels; ++ch)
+                        currentCenterColors[ch] = static_cast<AccType>(
+                            inData[j_in_off + i_in_off + ch * istrides[2]]);
+
+                    int meanPosJ = j;
+                    int meanPosI = i;
+
+                    // scope of meanshift iterations begin
+                    for (unsigned it = 0; it < numIterations; ++it) {
+                        int oldMeanPosJ = meanPosJ;
+                        int oldMeanPosI = meanPosI;
+                        unsigned count  = 0;
+                        int shift_y     = 0;
+                        int shift_x     = 0;
+
+                        currentMeanColors.fill(0);
+                        // Windowing operation
+                        for (dim_t wj = -radius; wj <= radius; ++wj) {
+                            int hit_count = 0;
+                            dim_t tj      = meanPosJ + wj;
+                            if (tj < 0 || tj > dims[1] - 1) continue;
+
+                            dim_t tjstride = tj * istrides[1];
+
+                            for (dim_t wi = -radius; wi <= radius; ++wi) {
+                                dim_t ti = meanPosI + wi;
+                                if (ti < 0 || ti > dims[0] - 1) continue;
+
+                                dim_t tistride = ti * istrides[0];
+
+                                AccType norm = 0;
+                                for (unsigned ch = 0; ch < channels; ++ch) {
+                                    tempColors[ch] = static_cast<AccType>(
+                                        inData[tistride + tjstride +
+                                               ch * istrides[2]]);
+                                    AccType diff = currentCenterColors[ch] -
+                                                   tempColors[ch];
+                                    norm += (diff * diff);
+                                }
+                                if (norm <= cvar) {
+                                    for (unsigned ch = 0; ch < channels; ++ch)
+                                        currentMeanColors[ch] += tempColors[ch];
+
+                                    shift_x += ti;
+                                    ++hit_count;
+                                }
+                            }
+                            count += hit_count;
+                            shift_y += tj * hit_count;
+                        }
+
+                        if (count == 0) break;
+
+                        const AccType fcount = 1 / static_cast<AccType>(count);
+
+                        meanPosJ =
+                            static_cast<int>(std::trunc(shift_y * fcount));
+                        meanPosI =
+                            static_cast<int>(std::trunc(shift_x * fcount));
+
+                        for (unsigned ch = 0; ch < channels; ++ch)
+                            currentMeanColors[ch] =
+                                std::trunc(currentMeanColors[ch] * fcount);
+
+                        AccType norm = 0;
+                        for (unsigned ch = 0; ch < channels; ++ch) {
+                            AccType diff =
+                                currentMeanColors[ch] - currentCenterColors[ch];
+                            norm += (diff * diff);
+                        }
+
+                        // stop the process if mean converged or within given
+                        // tolerance range
+                        bool stop = (meanPosJ == oldMeanPosJ &&
+                                     oldMeanPosI == meanPosI) ||
+                                    ((abs(oldMeanPosJ - meanPosJ) +
+                                      abs(oldMeanPosI - meanPosI) + norm) <= 1);
+
+                        for (unsigned ch = 0; ch < channels; ++ch)
+                            currentCenterColors[ch] = currentMeanColors[ch];
+
+                        if (stop) break;
+                    }  // scope of meanshift iterations end
+
+                    for (dim_t ch = 0; ch < channels; ++ch)
+                        outData[j_out_off + i_out_off + ch * ostrides[2]] =
+                            static_cast<T>(currentCenterColors[ch]);
+                }
+            }
+        }
+    }
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/medfilt.hpp b/src/backend/cpu/kernel/medfilt.hpp
new file mode 100644
index 0000000000..cd998adf05
--- /dev/null
+++ b/src/backend/cpu/kernel/medfilt.hpp
@@ -0,0 +1,206 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+#include <algorithm>
+#include <vector>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, af::borderType Pad>
+void medfilt1(Param<T> out, CParam<T> in, dim_t w_wid) {
+    constexpr bool IsValidPadType = (Pad == AF_PAD_ZERO || Pad == AF_PAD_SYM);
+
+    const af::dim4 dims     = in.dims();
+    const af::dim4 istrides = in.strides();
+    const af::dim4 ostrides = out.strides();
+
+    std::vector<T> wind_vals;
+    wind_vals.reserve(w_wid);
+
+    for (int b3 = 0; b3 < (int)dims[3]; b3++) {
+        T const* in_ptr = in.get() + b3 * istrides[3];
+        T* out_ptr      = out.get() + b3 * ostrides[3];
+
+        for (int b2 = 0; b2 < (int)dims[2]; b2++) {
+            for (int col = 0; col < (int)dims[1]; col++) {
+                int ocol_off = col * ostrides[1];
+
+                for (int row = 0; row < (int)dims[0]; row++) {
+                    wind_vals.clear();
+                    for (int wi = 0; wi < (int)w_wid; ++wi) {
+                        int im_row = row + wi - w_wid / 2;
+                        int im_roff;
+                        switch (Pad) {
+                            case AF_PAD_ZERO:
+                                im_roff = im_row * istrides[0];
+                                if (im_row < 0 || im_row >= (int)dims[0])
+                                    wind_vals.push_back(0);
+                                else
+                                    wind_vals.push_back(in_ptr[im_roff]);
+                                break;
+                            case AF_PAD_SYM: {
+                                if (im_row < 0) { im_row *= -1; }
+
+                                if (im_row >= (int)dims[0]) {
+                                    im_row = 2 * ((int)dims[0] - 1) - im_row;
+                                }
+
+                                im_roff = im_row * istrides[0];
+                                wind_vals.push_back(in_ptr[im_roff]);
+                            } break;
+                            default:
+                                static_assert(IsValidPadType,
+                                              "Unsupported padding type");
+                                break;
+                        }
+                    }
+
+                    int off = wind_vals.size() / 2;
+                    std::stable_sort(wind_vals.begin(), wind_vals.end());
+                    if (wind_vals.size() % 2 == 0)
+                        out_ptr[ocol_off + row * ostrides[0]] =
+                            (wind_vals[off] + wind_vals[off - 1]) / 2;
+                    else {
+                        out_ptr[ocol_off + row * ostrides[0]] = wind_vals[off];
+                    }
+                }
+            }
+            in_ptr += istrides[2];
+            out_ptr += ostrides[2];
+        }
+    }
+}
+
+template<typename T, af::borderType Pad>
+void medfilt2(Param<T> out, CParam<T> in, dim_t w_len, dim_t w_wid) {
+    constexpr bool IsValidPadType = (Pad == AF_PAD_ZERO || Pad == AF_PAD_SYM);
+
+    const af::dim4 dims     = in.dims();
+    const af::dim4 istrides = in.strides();
+    const af::dim4 ostrides = out.strides();
+
+    std::vector<T> wind_vals;
+    wind_vals.reserve(w_len * w_wid);
+
+    for (int b3 = 0; b3 < (int)dims[3]; b3++) {
+        T const* in_ptr = in.get() + b3 * istrides[3];
+        T* out_ptr      = out.get() + b3 * ostrides[3];
+
+        for (int b2 = 0; b2 < (int)dims[2]; b2++) {
+            for (int col = 0; col < (int)dims[1]; col++) {
+                int ocol_off = col * ostrides[1];
+
+                for (int row = 0; row < (int)dims[0]; row++) {
+                    wind_vals.clear();
+
+                    for (int wj = 0; wj < (int)w_wid; ++wj) {
+                        bool isColOff = false;
+
+                        int im_col  = col + wj - w_wid / 2;
+                        int im_coff = 0;
+                        switch (Pad) {
+                            case AF_PAD_ZERO:
+                                im_coff = im_col * istrides[1];
+                                if (im_col < 0 || im_col >= (int)dims[1])
+                                    isColOff = true;
+                                break;
+                            case AF_PAD_SYM: {
+                                if (im_col < 0) {
+                                    im_col *= -1;
+                                    isColOff = true;
+                                }
+
+                                if (im_col >= (int)dims[1]) {
+                                    im_col   = 2 * ((int)dims[1] - 1) - im_col;
+                                    isColOff = true;
+                                }
+
+                                im_coff = im_col * istrides[1];
+                            } break;
+                            default:
+                                static_assert(IsValidPadType,
+                                              "Unsupported padding type");
+                                break;
+                        }
+
+                        for (int wi = 0; wi < (int)w_len; ++wi) {
+                            bool isRowOff = false;
+
+                            int im_row  = row + wi - w_len / 2;
+                            int im_roff = 0;
+                            switch (Pad) {
+                                case AF_PAD_ZERO:
+                                    im_roff = im_row * istrides[0];
+                                    if (im_row < 0 || im_row >= (int)dims[0])
+                                        isRowOff = true;
+                                    break;
+                                case AF_PAD_SYM: {
+                                    if (im_row < 0) {
+                                        im_row *= -1;
+                                        isRowOff = true;
+                                    }
+
+                                    if (im_row >= (int)dims[0]) {
+                                        im_row =
+                                            2 * ((int)dims[0] - 1) - im_row;
+                                        isRowOff = true;
+                                    }
+
+                                    im_roff = im_row * istrides[0];
+                                } break;
+                                default:
+                                    static_assert(IsValidPadType,
+                                                  "Unsupported padding type");
+                                    break;
+                            }
+
+                            if (isRowOff || isColOff) {
+                                switch (Pad) {
+                                    case AF_PAD_ZERO:
+                                        wind_vals.push_back(0);
+                                        break;
+                                    case AF_PAD_SYM:
+                                        wind_vals.push_back(
+                                            in_ptr[im_coff + im_roff]);
+                                        break;
+                                    default:
+                                        static_assert(
+                                            IsValidPadType,
+                                            "Unsupported padding type");
+                                        break;
+                                }
+                            } else
+                                wind_vals.push_back(in_ptr[im_coff + im_roff]);
+                        }
+                    }
+
+                    std::stable_sort(wind_vals.begin(), wind_vals.end());
+                    int off = wind_vals.size() / 2;
+                    if (wind_vals.size() % 2 == 0)
+                        out_ptr[ocol_off + row * ostrides[0]] =
+                            (wind_vals[off] + wind_vals[off - 1]) / 2;
+                    else
+                        out_ptr[ocol_off + row * ostrides[0]] = wind_vals[off];
+                }
+            }
+            in_ptr += istrides[2];
+            out_ptr += ostrides[2];
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/moments.hpp b/src/backend/cpu/kernel/moments.hpp
new file mode 100644
index 0000000000..0f3e6611eb
--- /dev/null
+++ b/src/backend/cpu/kernel/moments.hpp
@@ -0,0 +1,62 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+#include <utility.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void moments(Param<float> output, CParam<T> input, af_moment_type moment) {
+    T const *const in       = input.get();
+    af::dim4 const idims    = input.dims();
+    af::dim4 const istrides = input.strides();
+    af::dim4 const ostrides = output.strides();
+
+    float *out = output.get();
+
+    for (dim_t w = 0; w < idims[3]; w++) {
+        for (dim_t z = 0; z < idims[2]; z++) {
+            dim_t out_off = w * ostrides[3] + z * ostrides[2];
+            for (dim_t y = 0; y < idims[1]; y++) {
+                dim_t in_off =
+                    y * istrides[1] + z * istrides[2] + w * istrides[3];
+                for (dim_t x = 0; x < idims[0]; x++) {
+                    dim_t m_off = 0;
+                    float val   = in[in_off + x];
+                    if ((moment & AF_MOMENT_M00) > 0) {
+                        out[out_off + m_off] += val;
+                        m_off++;
+                    }
+                    if ((moment & AF_MOMENT_M01) > 0) {
+                        out[out_off + m_off] += x * val;
+                        m_off++;
+                    }
+                    if ((moment & AF_MOMENT_M10) > 0) {
+                        out[out_off + m_off] += y * val;
+                        m_off++;
+                    }
+                    if ((moment & AF_MOMENT_M11) > 0) {
+                        out[out_off + m_off] += x * y * val;
+                        m_off++;
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/morph.hpp b/src/backend/cpu/kernel/morph.hpp
new file mode 100644
index 0000000000..563420e57f
--- /dev/null
+++ b/src/backend/cpu/kernel/morph.hpp
@@ -0,0 +1,147 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <utility.hpp>
+#include <limits>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+template<typename T>
+void getOffsets(std::vector<dim_t>& offsets, const af::dim4& strides,
+                const CParam<T>& mask) {
+    const af::dim4 fstrides = mask.strides();
+    const T* filter         = mask.get();
+    const dim_t dim0 = mask.dims()[0], dim1 = mask.dims()[1];
+    const dim_t R0 = dim0 / 2;
+    const dim_t R1 = dim1 / 2;
+
+    offsets.reserve(mask.dims().elements());
+    for (dim_t j = 0; j < dim1; ++j) {
+        for (dim_t i = 0; i < dim0; ++i) {
+            if (filter[getIdx(fstrides, i, j)] > (T)0) {
+                dim_t offset = (j - R1) * strides[1] + (i - R0) * strides[0];
+                offsets.push_back(offset);
+            }
+        }
+    }
+}
+
+template<typename T, bool IsDilation>
+struct MorphFilterOp {
+    T operator()(const T& a, const T& b) {
+        return IsDilation ? std::max(a, b) : std::min(a, b);
+    }
+};
+
+template<typename T, bool IsDilation>
+void morph(Param<T> paddedOut, CParam<T> paddedIn, CParam<T> mask) {
+    MorphFilterOp<T, IsDilation> filterOp;
+    T init = IsDilation ? common::Binary<T, af_max_t>::init()
+                        : common::Binary<T, af_min_t>::init();
+
+    const af::dim4 ostrides = paddedOut.strides();
+    T* outData              = paddedOut.get();
+    const af::dim4 istrides = paddedIn.strides();
+    const af::dim4 dims     = paddedIn.dims();
+    const T* inData         = paddedIn.get();
+
+    std::vector<dim_t> offsets;
+    getOffsets(offsets, istrides, mask);
+
+    const dim_t batchSize = dims[0] * dims[1];
+    const int batchCount  = dims[2] * dims[3];
+    for (int b = 0; b < batchCount; ++b) {
+        for (dim_t n = 0; n < batchSize; ++n) {
+            T filterResult = init;
+            for (size_t oi = 0; oi < offsets.size(); ++oi) {
+                dim_t x = n + offsets[oi];
+                if (x >= 0 && x < batchSize)
+                    filterResult = filterOp(filterResult, inData[x]);
+            }
+            outData[n] = filterResult;
+        }
+        outData += ostrides[2];
+        inData += istrides[2];
+    }
+}
+
+template<typename T, bool IsDilation>
+void morph3d(Param<T> out, CParam<T> in, CParam<T> mask) {
+    const af::dim4 dims     = in.dims();
+    const af::dim4 window   = mask.dims();
+    const dim_t R0          = window[0] / 2;
+    const dim_t R1          = window[1] / 2;
+    const dim_t R2          = window[2] / 2;
+    const af::dim4 istrides = in.strides();
+    const af::dim4 fstrides = mask.strides();
+    const dim_t bCount      = dims[3];
+    const af::dim4 ostrides = out.strides();
+    T* outData              = out.get();
+    const T* inData         = in.get();
+    const T* filter         = mask.get();
+
+    T init = IsDilation ? common::Binary<T, af_max_t>::init()
+                        : common::Binary<T, af_min_t>::init();
+
+    for (dim_t batchId = 0; batchId < bCount; ++batchId) {
+        // either channels or batch is handled by outer most loop
+        for (dim_t k = 0; k < dims[2]; ++k) {
+            // k steps along 3rd dimension
+            for (dim_t j = 0; j < dims[1]; ++j) {
+                // j steps along 2nd dimension
+                for (dim_t i = 0; i < dims[0]; ++i) {
+                    // i steps along 1st dimension
+                    T filterResult = init;
+
+                    // wk, wj,wi steps along 2nd & 1st dimensions of filter
+                    // window respectively
+                    for (dim_t wk = 0; wk < window[2]; wk++) {
+                        for (dim_t wj = 0; wj < window[1]; wj++) {
+                            for (dim_t wi = 0; wi < window[0]; wi++) {
+                                dim_t offk = k + wk - R2;
+                                dim_t offj = j + wj - R1;
+                                dim_t offi = i + wi - R0;
+
+                                T maskValue =
+                                    filter[getIdx(fstrides, wi, wj, wk)];
+
+                                if ((maskValue > (T)0) && offi >= 0 &&
+                                    offj >= 0 && offk >= 0 && offi < dims[0] &&
+                                    offj < dims[1] && offk < dims[2]) {
+                                    T inValue = inData[getIdx(istrides, offi,
+                                                              offj, offk)];
+
+                                    if (IsDilation)
+                                        filterResult =
+                                            std::max(filterResult, inValue);
+                                    else
+                                        filterResult =
+                                            std::min(filterResult, inValue);
+                                }
+
+                            }  // window 1st dimension loop ends here
+                        }      // window 1st dimension loop ends here
+                    }          // filter window loop ends here
+
+                    outData[getIdx(ostrides, i, j, k)] = filterResult;
+                }  // 1st dimension loop ends here
+            }      // 2nd dimension loop ends here
+        }          // 3rd dimension loop ends here
+        // next iteration will be next batch if any
+        outData += ostrides[3];
+        inData += istrides[3];
+    }
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/nearest_neighbour.hpp b/src/backend/cpu/kernel/nearest_neighbour.hpp
new file mode 100644
index 0000000000..af94d03ec4
--- /dev/null
+++ b/src/backend/cpu/kernel/nearest_neighbour.hpp
@@ -0,0 +1,102 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+#if defined(_WIN32) || defined(_MSC_VER)
+
+#include <intrin.h>
+#define __builtin_popcount __popcnt
+#define __builtin_popcountll __popcnt64
+
+#endif
+
+template<typename T, typename To, af_match_type dist_type>
+struct dist_op {
+    To operator()(T v1, T v2) {
+        return v1 - v2;  // Garbage distance
+    }
+};
+
+template<typename T, typename To>
+struct dist_op<T, To, AF_SAD> {
+    To operator()(T v1, T v2) { return std::abs((double)v1 - (double)v2); }
+};
+
+template<typename T, typename To>
+struct dist_op<T, To, AF_SSD> {
+    To operator()(T v1, T v2) { return (v1 - v2) * (v1 - v2); }
+};
+
+template<typename To>
+struct dist_op<uint, To, AF_SHD> {
+    To operator()(uint v1, uint v2) { return __builtin_popcount(v1 ^ v2); }
+};
+
+template<typename To>
+struct dist_op<uintl, To, AF_SHD> {
+    To operator()(uintl v1, uintl v2) { return __builtin_popcountll(v1 ^ v2); }
+};
+
+template<typename To>
+struct dist_op<uchar, To, AF_SHD> {
+    To operator()(uchar v1, uchar v2) { return __builtin_popcount(v1 ^ v2); }
+};
+
+template<typename To>
+struct dist_op<ushort, To, AF_SHD> {
+    To operator()(ushort v1, ushort v2) { return __builtin_popcount(v1 ^ v2); }
+};
+
+template<typename T, typename To, af_match_type dist_type>
+void nearest_neighbour(Param<To> dists, CParam<T> query, CParam<T> train,
+                       const uint dist_dim) {
+    uint sample_dim  = (dist_dim == 0) ? 1 : 0;
+    const dim4 qDims = query.dims();
+    const dim4 tDims = train.dims();
+
+    const unsigned distLength = qDims[dist_dim];
+    const unsigned nQuery     = qDims[sample_dim];
+    const unsigned nTrain     = tDims[sample_dim];
+
+    const T* qPtr = query.get();
+    const T* tPtr = train.get();
+    To* dPtr      = dists.get();
+
+    dist_op<T, To, dist_type> op;
+
+    for (unsigned i = 0; i < nQuery; i++) {
+        for (unsigned j = 0; j < nTrain; j++) {
+            To local_dist = 0;
+            for (unsigned k = 0; k < distLength; k++) {
+                size_t qIdx, tIdx;
+                if (sample_dim == 0) {
+                    qIdx = k * qDims[0] + i;
+                    tIdx = k * tDims[0] + j;
+                } else {
+                    qIdx = i * qDims[0] + k;
+                    tIdx = j * tDims[0] + k;
+                }
+
+                local_dist += op(qPtr[qIdx], tPtr[tIdx]);
+            }
+
+            dPtr[i * nTrain + j] = local_dist;
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/orb.hpp b/src/backend/cpu/kernel/orb.hpp
new file mode 100644
index 0000000000..385f71abb6
--- /dev/null
+++ b/src/backend/cpu/kernel/orb.hpp
@@ -0,0 +1,285 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+// Reference pattern, generated for a patch size of 31x31, as suggested by
+// original ORB paper
+#define REF_PAT_SIZE 31
+#define REF_PAT_SAMPLES 256
+#define REF_PAT_COORDS 4
+#define REF_PAT_LENGTH (REF_PAT_SAMPLES * REF_PAT_COORDS)
+
+// Current reference pattern was borrowed from OpenCV, to build a pattern with
+// similar quality, a training process must be applied, as described in
+// sections 4.2 and 4.3 of the original ORB paper.
+const int ref_pat[REF_PAT_LENGTH] = {
+    8,   -3,  9,   5,   4,   2,   7,   -12, -11, 9,   -8,  2,   7,   -12, 12,
+    -13, 2,   -13, 2,   12,  1,   -7,  1,   6,   -2,  -10, -2,  -4,  -13, -13,
+    -11, -8,  -13, -3,  -12, -9,  10,  4,   11,  9,   -13, -8,  -8,  -9,  -11,
+    7,   -9,  12,  7,   7,   12,  6,   -4,  -5,  -3,  0,   -13, 2,   -12, -3,
+    -9,  0,   -7,  5,   12,  -6,  12,  -1,  -3,  6,   -2,  12,  -6,  -13, -4,
+    -8,  11,  -13, 12,  -8,  4,   7,   5,   1,   5,   -3,  10,  -3,  3,   -7,
+    6,   12,  -8,  -7,  -6,  -2,  -2,  11,  -1,  -10, -13, 12,  -8,  10,  -7,
+    3,   -5,  -3,  -4,  2,   -3,  7,   -10, -12, -6,  11,  5,   -12, 6,   -7,
+    5,   -6,  7,   -1,  1,   0,   4,   -5,  9,   11,  11,  -13, 4,   7,   4,
+    12,  2,   -1,  4,   4,   -4,  -12, -2,  7,   -8,  -5,  -7,  -10, 4,   11,
+    9,   12,  0,   -8,  1,   -13, -13, -2,  -8,  2,   -3,  -2,  -2,  3,   -6,
+    9,   -4,  -9,  8,   12,  10,  7,   0,   9,   1,   3,   7,   -5,  11,  -10,
+    -13, -6,  -11, 0,   10,  7,   12,  1,   -6,  -3,  -6,  12,  10,  -9,  12,
+    -4,  -13, 8,   -8,  -12, -13, 0,   -8,  -4,  3,   3,   7,   8,   5,   7,
+    10,  -7,  -1,  7,   1,   -12, 3,   -10, 5,   6,   2,   -4,  3,   -10, -13,
+    0,   -13, 5,   -13, -7,  -12, 12,  -13, 3,   -11, 8,   -7,  12,  -4,  7,
+    6,   -10, 12,  8,   -9,  -1,  -7,  -6,  -2,  -5,  0,   12,  -12, 5,   -7,
+    5,   3,   -10, 8,   -13, -7,  -7,  -4,  5,   -3,  -2,  -1,  -7,  2,   9,
+    5,   -11, -11, -13, -5,  -13, -1,  6,   0,   -1,  5,   -3,  5,   2,   -4,
+    -13, -4,  12,  -9,  -6,  -9,  6,   -12, -10, -8,  -4,  10,  2,   12,  -3,
+    7,   12,  12,  12,  -7,  -13, -6,  5,   -4,  9,   -3,  4,   7,   -1,  12,
+    2,   -7,  6,   -5,  1,   -13, 11,  -12, 5,   -3,  7,   -2,  -6,  7,   -8,
+    12,  -7,  -13, -7,  -11, -12, 1,   -3,  12,  12,  2,   -6,  3,   0,   -4,
+    3,   -2,  -13, -1,  -13, 1,   9,   7,   1,   8,   -6,  1,   -1,  3,   12,
+    9,   1,   12,  6,   -1,  -9,  -1,  3,   -13, -13, -10, 5,   7,   7,   10,
+    12,  12,  -5,  12,  9,   6,   3,   7,   11,  5,   -13, 6,   10,  2,   -12,
+    2,   3,   3,   8,   4,   -6,  2,   6,   12,  -13, 9,   -12, 10,  3,   -8,
+    4,   -7,  9,   -11, 12,  -4,  -6,  1,   12,  2,   -8,  6,   -9,  7,   -4,
+    2,   3,   3,   -2,  6,   3,   11,  0,   3,   -3,  8,   -8,  7,   8,   9,
+    3,   -11, -5,  -6,  -4,  -10, 11,  -5,  10,  -5,  -8,  -3,  12,  -10, 5,
+    -9,  0,   8,   -1,  12,  -6,  4,   -6,  6,   -11, -10, 12,  -8,  7,   4,
+    -2,  6,   7,   -2,  0,   -2,  12,  -5,  -8,  -5,  2,   7,   -6,  10,  12,
+    -9,  -13, -8,  -8,  -5,  -13, -5,  -2,  8,   -8,  9,   -13, -9,  -11, -9,
+    0,   1,   -8,  1,   -2,  7,   -4,  9,   1,   -2,  1,   -1,  -4,  11,  -6,
+    12,  -11, -12, -9,  -6,  4,   3,   7,   7,   12,  5,   5,   10,  8,   0,
+    -4,  2,   8,   -9,  12,  -5,  -13, 0,   7,   2,   12,  -1,  2,   1,   7,
+    5,   11,  7,   -9,  3,   5,   6,   -8,  -13, -4,  -8,  9,   -5,  9,   -3,
+    -3,  -4,  -7,  -3,  -12, 6,   5,   8,   0,   -7,  6,   -6,  12,  -13, 6,
+    -5,  -2,  1,   -10, 3,   10,  4,   1,   8,   -4,  -2,  -2,  2,   -13, 2,
+    -12, 12,  12,  -2,  -13, 0,   -6,  4,   1,   9,   3,   -6,  -10, -3,  -5,
+    -3,  -13, -1,  1,   7,   5,   12,  -11, 4,   -2,  5,   -7,  -13, 9,   -9,
+    -5,  7,   1,   8,   6,   7,   -8,  7,   6,   -7,  -4,  -7,  1,   -8,  11,
+    -7,  -8,  -13, 6,   -12, -8,  2,   4,   3,   9,   10,  -5,  12,  3,   -6,
+    -5,  -6,  7,   8,   -3,  9,   -8,  2,   -12, 2,   8,   -11, -2,  -10, 3,
+    -12, -13, -7,  -9,  -11, 0,   -10, -5,  5,   -3,  11,  8,   -2,  -13, -1,
+    12,  -1,  -8,  0,   9,   -13, -11, -12, -5,  -10, -2,  -10, 11,  -3,  9,
+    -2,  -13, 2,   -3,  3,   2,   -9,  -13, -4,  0,   -4,  6,   -3,  -10, -4,
+    12,  -2,  -7,  -6,  -11, -4,  9,   6,   -3,  6,   11,  -13, 11,  -5,  5,
+    11,  11,  12,  6,   7,   -5,  12,  -2,  -1,  12,  0,   7,   -4,  -8,  -3,
+    -2,  -7,  1,   -6,  7,   -13, -12, -8,  -13, -7,  -2,  -6,  -8,  -8,  5,
+    -6,  -9,  -5,  -1,  -4,  5,   -13, 7,   -8,  10,  1,   5,   5,   -13, 1,
+    0,   10,  -13, 9,   12,  10,  -1,  5,   -8,  10,  -9,  -1,  11,  1,   -13,
+    -9,  -3,  -6,  2,   -1,  -10, 1,   12,  -13, 1,   -8,  -10, 8,   -11, 10,
+    -6,  2,   -13, 3,   -6,  7,   -13, 12,  -9,  -10, -10, -5,  -7,  -10, -8,
+    -8,  -13, 4,   -6,  8,   5,   3,   12,  8,   -13, -4,  2,   -3,  -3,  5,
+    -13, 10,  -12, 4,   -13, 5,   -1,  -9,  9,   -4,  3,   0,   3,   3,   -9,
+    -12, 1,   -6,  1,   3,   2,   4,   -8,  -10, -10, -10, 9,   8,   -13, 12,
+    12,  -8,  -12, -6,  -5,  2,   2,   3,   7,   10,  6,   11,  -8,  6,   8,
+    8,   -12, -7,  10,  -6,  5,   -3,  -9,  -3,  9,   -1,  -13, -1,  5,   -3,
+    -7,  -3,  4,   -8,  -2,  -8,  3,   4,   2,   12,  12,  2,   -5,  3,   11,
+    6,   -9,  11,  -13, 3,   -1,  7,   12,  11,  -1,  12,  4,   -3,  0,   -3,
+    6,   4,   -11, 4,   12,  2,   -4,  2,   1,   -10, -6,  -8,  1,   -13, 7,
+    -11, 1,   -13, 12,  -11, -13, 6,   0,   11,  -13, 0,   -1,  1,   4,   -13,
+    3,   -9,  -2,  -9,  8,   -6,  -3,  -13, -6,  -8,  -2,  5,   -9,  8,   10,
+    2,   7,   3,   -9,  -1,  -6,  -1,  -1,  9,   5,   11,  -2,  11,  -3,  12,
+    -8,  3,   0,   3,   5,   -1,  4,   0,   10,  3,   -6,  4,   5,   -13, 0,
+    -10, 5,   5,   8,   12,  11,  8,   9,   9,   -6,  7,   -4,  8,   -12, -10,
+    4,   -10, 9,   7,   3,   12,  4,   9,   -7,  10,  -2,  7,   0,   12,  -2,
+    -1,  -6,  0,   -11,
+};
+
+template<typename T>
+void keep_features(float* x_out, float* y_out, float* score_out,
+                   float* size_out, const float* x_in, const float* y_in,
+                   const float* score_in, const unsigned* score_idx,
+                   const float* size_in, const unsigned n_feat) {
+    // Keep only the first n_feat features
+    for (unsigned f = 0; f < n_feat; f++) {
+        x_out[f]     = x_in[score_idx[f]];
+        y_out[f]     = y_in[score_idx[f]];
+        score_out[f] = score_in[f];
+        if (size_in != nullptr && size_out != nullptr)
+            size_out[f] = size_in[score_idx[f]];
+    }
+}
+
+template<typename T, bool use_scl>
+void harris_response(float* x_out, float* y_out, float* score_out,
+                     float* size_out, const float* x_in, const float* y_in,
+                     const float* scl_in, const unsigned total_feat,
+                     unsigned* usable_feat, CParam<T> image,
+                     const unsigned block_size, const float k_thr,
+                     const unsigned patch_size) {
+    const af::dim4 idims = image.dims();
+    const T* image_ptr   = image.get();
+    for (unsigned f = 0; f < total_feat; f++) {
+        unsigned x, y;
+        float scl = 1.f;
+        if (use_scl) {
+            // Update x and y coordinates according to scale
+            scl = scl_in[f];
+            x   = (unsigned)round(x_in[f] * scl);
+            y   = (unsigned)round(y_in[f] * scl);
+        } else {
+            x = (unsigned)round(x_in[f]);
+            y = (unsigned)round(y_in[f]);
+        }
+
+        // Round feature size to nearest odd integer
+        float size = 2.f * floor((patch_size * scl) / 2.f) + 1.f;
+
+        // Avoid keeping features that might be too wide and might not fit on
+        // the image, sqrt(2.f) is the radius when angle is 45 degrees and
+        // represents widest case possible
+        unsigned patch_r = ceil(size * sqrt(2.f) / 2.f);
+        if (x < patch_r || y < patch_r || x >= idims[1] - patch_r ||
+            y >= idims[0] - patch_r)
+            continue;
+
+        unsigned r = block_size / 2;
+
+        float ixx = 0.f, iyy = 0.f, ixy = 0.f;
+        unsigned block_size_sq = block_size * block_size;
+        for (unsigned k = 0; k < block_size_sq; k++) {
+            int i = k / block_size - r;
+            int j = k % block_size - r;
+
+            // Calculate local x and y derivatives
+            float ix = image_ptr[(x + i + 1) * idims[0] + y + j] -
+                       image_ptr[(x + i - 1) * idims[0] + y + j];
+            float iy = image_ptr[(x + i) * idims[0] + y + j + 1] -
+                       image_ptr[(x + i) * idims[0] + y + j - 1];
+
+            // Accumulate second order derivatives
+            ixx += ix * ix;
+            iyy += iy * iy;
+            ixy += ix * iy;
+        }
+
+        unsigned idx = *usable_feat;
+        *usable_feat += 1;
+        float tr  = ixx + iyy;
+        float det = ixx * iyy - ixy * ixy;
+
+        // Calculate Harris responses
+        float resp = det - k_thr * (tr * tr);
+
+        // Scale factor
+        // TODO: improve response scaling
+        float rscale = 0.001f;
+        rscale       = rscale * rscale * rscale * rscale;
+
+        x_out[idx]     = x;
+        y_out[idx]     = y;
+        score_out[idx] = resp * rscale;
+        if (use_scl) size_out[idx] = size;
+    }
+}
+
+template<typename T>
+void centroid_angle(const float* x_in, const float* y_in,
+                    float* orientation_out, const unsigned total_feat,
+                    CParam<T> image, const unsigned patch_size) {
+    const af::dim4 idims = image.dims();
+    const T* image_ptr   = image.get();
+    for (unsigned f = 0; f < total_feat; f++) {
+        unsigned x = (unsigned)round(x_in[f]);
+        unsigned y = (unsigned)round(y_in[f]);
+
+        unsigned r = patch_size / 2;
+        if (x < r || y < r || x > idims[1] - r || y > idims[0] - r) continue;
+
+        T m01 = (T)0, m10 = (T)0;
+        unsigned patch_size_sq = patch_size * patch_size;
+        for (unsigned k = 0; k < patch_size_sq; k++) {
+            int i = k / patch_size - r;
+            int j = k % patch_size - r;
+
+            // Calculate first order moments
+            T p = image_ptr[(x + i) * idims[0] + y + j];
+            m01 += j * p;
+            m10 += i * p;
+        }
+
+        float angle        = atan2(m01, m10);
+        orientation_out[f] = angle;
+    }
+}
+
+template<typename T>
+inline T get_pixel(unsigned x, unsigned y, const float ori, const unsigned size,
+                   const int dist_x, const int dist_y, CParam<T> image,
+                   const unsigned patch_size) {
+    const af::dim4 idims = image.dims();
+    const T* image_ptr   = image.get();
+    float ori_sin        = sin(ori);
+    float ori_cos        = cos(ori);
+    float patch_scl      = (float)size / (float)patch_size;
+
+    // Calculate point coordinates based on orientation and size
+    x += round(dist_x * patch_scl * ori_cos - dist_y * patch_scl * ori_sin);
+    y += round(dist_x * patch_scl * ori_sin + dist_y * patch_scl * ori_cos);
+
+    return image_ptr[x * idims[0] + y];
+}
+
+template<typename T>
+void extract_orb(unsigned* desc_out, const unsigned n_feat, float* x_in_out,
+                 float* y_in_out, const float* ori_in, float* size_out,
+                 CParam<T> image, const float scl, const unsigned patch_size) {
+    const af::dim4 idims = image.dims();
+    for (unsigned f = 0; f < n_feat; f++) {
+        unsigned x    = (unsigned)round(x_in_out[f]);
+        unsigned y    = (unsigned)round(y_in_out[f]);
+        float ori     = ori_in[f];
+        unsigned size = patch_size;
+
+        unsigned r = ceil(patch_size * sqrt(2.f) / 2.f);
+        if (x < r || y < r || x >= idims[1] - r || y >= idims[0] - r) continue;
+
+        // Descriptor fixed at 256 bits for now
+        // Storing descriptor as a vector of 8 x 32-bit unsigned numbers
+        for (unsigned i = 0; i < 8; i++) {
+            unsigned v = 0;
+
+            // j < 32 for 256 bits descriptor
+            for (unsigned j = 0; j < 32; j++) {
+                // Get position from distribution pattern and values of points
+                // p1 and p2
+                int dist_x = ref_pat[i * 32 * 4 + j * 4];
+                int dist_y = ref_pat[i * 32 * 4 + j * 4 + 1];
+                T p1       = get_pixel(x, y, ori, size, dist_x, dist_y, image,
+                                       patch_size);
+
+                dist_x = ref_pat[i * 32 * 4 + j * 4 + 2];
+                dist_y = ref_pat[i * 32 * 4 + j * 4 + 3];
+                T p2   = get_pixel(x, y, ori, size, dist_x, dist_y, image,
+                                   patch_size);
+
+                // Calculate bit based on p1 and p2 and shifts it to correct
+                // position
+                v |= (p1 < p2) << j;
+            }
+
+            // Store 32 bits of descriptor
+            desc_out[f * 8 + i] += v;
+        }
+
+        x_in_out[f] = round(x * scl);
+        y_in_out[f] = round(y * scl);
+        size_out[f] = patch_size * scl;
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/pad_array_borders.hpp b/src/backend/cpu/kernel/pad_array_borders.hpp
new file mode 100644
index 0000000000..8b44c9d425
--- /dev/null
+++ b/src/backend/cpu/kernel/pad_array_borders.hpp
@@ -0,0 +1,134 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <utility.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+namespace {
+static inline dim_t idxByndEdge(const dim_t i, const dim_t lb, const dim_t len,
+                                const af::borderType btype) {
+    dim_t retVal;
+    switch (btype) {
+        case AF_PAD_SYM: retVal = trimIndex(i - lb, len); break;
+        case AF_PAD_CLAMP_TO_EDGE:
+            retVal = std::max(dim_t(0), std::min(i - lb, len - 1));
+            break;
+        case AF_PAD_PERIODIC: {
+            dim_t rem = (i - lb) % len;
+            bool cond = rem < 0;
+            retVal    = cond * (rem + len) + (1 - cond) * rem;
+        } break;
+        default: retVal = 0; break;
+    }
+    return retVal;
+}
+}  // namespace
+
+template<typename T>
+void padBorders(Param<T> out, CParam<T> in, const dim4 lBoundPadSize,
+                const dim4 uBoundPadSize, const af::borderType btype) {
+    const dim4& oDims = out.dims();
+    const dim4& oStrs = out.strides();
+    const dim4& iDims = in.dims();
+    const dim4& iStrs = in.strides();
+
+    T const* const src = in.get();
+    T* dst             = out.get();
+
+    const dim4 validRegEnds(
+        oDims[0] - uBoundPadSize[0], oDims[1] - uBoundPadSize[1],
+        oDims[2] - uBoundPadSize[2], oDims[3] - uBoundPadSize[3]);
+    const bool isInputLinear = iStrs[0] == 1;
+
+    /*
+     * VALID REGION COPYING DOES
+     * NOT NEED ANY BOUND CHECKS
+     * */
+    for (dim_t l = lBoundPadSize[3]; l < validRegEnds[3]; ++l) {
+        dim_t oLOff = oStrs[3] * l;
+        dim_t iLOff = iStrs[3] * (l - lBoundPadSize[3]);
+
+        for (dim_t k = lBoundPadSize[2]; k < validRegEnds[2]; ++k) {
+            dim_t oKOff = oStrs[2] * k;
+            dim_t iKOff = iStrs[2] * (k - lBoundPadSize[2]);
+
+            for (dim_t j = lBoundPadSize[1]; j < validRegEnds[1]; ++j) {
+                dim_t oJOff = oStrs[1] * j;
+                dim_t iJOff = iStrs[1] * (j - lBoundPadSize[1]);
+
+                if (isInputLinear) {
+                    T const* const sptr = src + iLOff + iKOff + iJOff;
+                    T* dptr = dst + oLOff + oKOff + oJOff + lBoundPadSize[0];
+
+                    std::copy(sptr, sptr + iDims[0], dptr);
+                } else {
+                    for (dim_t i = lBoundPadSize[0]; i < validRegEnds[0]; ++i) {
+                        dim_t oIOff = oStrs[0] * i;
+                        dim_t iIOff = iStrs[0] * (i - lBoundPadSize[0]);
+
+                        dst[oLOff + oKOff + oJOff + oIOff] =
+                            src[iLOff + iKOff + iJOff + iIOff];
+                    }
+                }
+            }  // second dimension loop
+        }      // third dimension loop
+    }          // fourth dimension loop
+
+    // If we have to do zero padding,
+    // just return as the output is filled with
+    // zeros during allocation
+    if (btype == AF_PAD_ZERO) return;
+
+    /*
+     * PADDED REGIONS NEED BOUND
+     * CHECKS; FOLLOWING NESTED
+     * LOOPS SHALL ONLY PROCESS
+     * PADDED REGIONS AND SKIP REST
+     * */
+    for (dim_t l = 0; l < oDims[3]; ++l) {
+        bool skipL  = (l >= lBoundPadSize[3] && l < validRegEnds[3]);
+        dim_t oLOff = oStrs[3] * l;
+        dim_t iLOff =
+            iStrs[3] * idxByndEdge(l, lBoundPadSize[3], iDims[3], btype);
+        for (dim_t k = 0; k < oDims[2]; ++k) {
+            bool skipK  = (k >= lBoundPadSize[2] && k < validRegEnds[2]);
+            dim_t oKOff = oStrs[2] * k;
+            dim_t iKOff =
+                iStrs[2] * idxByndEdge(k, lBoundPadSize[2], iDims[2], btype);
+            for (dim_t j = 0; j < oDims[1]; ++j) {
+                bool skipJ  = (j >= lBoundPadSize[1] && j < validRegEnds[1]);
+                dim_t oJOff = oStrs[1] * j;
+                dim_t iJOff = iStrs[1] *
+                              idxByndEdge(j, lBoundPadSize[1], iDims[1], btype);
+                for (dim_t i = 0; i < oDims[0]; ++i) {
+                    bool skipI = (i >= lBoundPadSize[0] && i < validRegEnds[0]);
+                    if (skipI && skipJ && skipK && skipL) continue;
+
+                    dim_t oIOff = oStrs[0] * i;
+                    dim_t iIOff = iStrs[0] * idxByndEdge(i, lBoundPadSize[0],
+                                                         iDims[0], btype);
+
+                    dst[oLOff + oKOff + oJOff + oIOff] =
+                        src[iLOff + iKOff + iJOff + iIOff];
+
+                }  // first dimension loop
+            }      // second dimension loop
+        }          // third dimension loop
+    }              // fourth dimension loop
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/random_engine.hpp b/src/backend/cpu/kernel/random_engine.hpp
new file mode 100644
index 0000000000..0ab49f8a80
--- /dev/null
+++ b/src/backend/cpu/kernel/random_engine.hpp
@@ -0,0 +1,426 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <err_cpu.hpp>
+#include <kernel/random_engine_mersenne.hpp>
+#include <kernel/random_engine_philox.hpp>
+#include <kernel/random_engine_threefry.hpp>
+#include <types.hpp>
+
+#include <algorithm>
+#include <cmath>
+#include <cstring>
+
+using std::array;
+using std::memcpy;
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+// Utils
+static const double PI_VAL =
+    3.1415926535897932384626433832795028841971693993751058209749445923078164;
+
+// Conversion to half adapted from Random123
+constexpr float unsigned_half_factor =
+    ((1.0f) / (std::numeric_limits<ushort>::max() + (1.0f)));
+constexpr float unsigned_half_half_factor((0.5f) * unsigned_half_factor);
+
+template<typename T>
+T transform(uint *val, uint index);
+
+template<>
+uintl transform<uintl>(uint *val, uint index) {
+    uint index2 = index << 1;
+    uintl v     = ((static_cast<uintl>(val[index2]) << 32) |
+               (static_cast<uintl>(val[index2 + 1])));
+    return v;
+}
+
+// Generates rationals in [0, 1)
+float getFloat01(uint *val, uint index) {
+    // Conversion to floats adapted from Random123
+    constexpr float factor =
+        ((1.0f) /
+         (static_cast<float>(std::numeric_limits<unsigned int>::max()) +
+          (1.0f)));
+    constexpr float half_factor = ((0.5f) * factor);
+    return fmaf(val[index], factor, half_factor);
+}
+
+// Generates rationals in (-1, 1]
+static float getFloatNegative11(uint *val, uint index) {
+    // Conversion to floats adapted from Random123
+    constexpr float factor =
+        ((1.0) /
+         (static_cast<double>(std::numeric_limits<int>::max()) + (1.0)));
+    constexpr float half_factor = ((0.5f) * factor);
+
+    return fmaf(static_cast<float>(val[index]), factor, half_factor);
+}
+
+// Generates rationals in [0, 1)
+arrayfire::common::half getHalf01(uint *val, uint index) {
+    float v = val[index >> 1U] >> (16U * (index & 1U)) & 0x0000ffff;
+    return static_cast<arrayfire::common::half>(
+        fmaf(v, unsigned_half_factor, unsigned_half_half_factor));
+}
+
+// Generates rationals in (-1, 1]
+static arrayfire::common::half getHalfNegative11(uint *val, uint index) {
+    float v = val[index >> 1U] >> (16U * (index & 1U)) & 0x0000ffff;
+    // Conversion to half adapted from Random123
+    constexpr float factor =
+        ((1.0f) / (std::numeric_limits<short>::max() + (1.0f)));
+    constexpr float half_factor = ((0.5f) * factor);
+
+    return static_cast<arrayfire::common::half>(fmaf(v, factor, half_factor));
+}
+
+// Generates rationals in [0, 1)
+double getDouble01(uint *val, uint index) {
+    uintl v = transform<uintl>(val, index);
+    constexpr double factor =
+        ((1.0) / (std::numeric_limits<unsigned long long>::max() +
+                  static_cast<long double>(1.0l)));
+    constexpr double half_factor((0.5) * factor);
+    return fma(v, factor, half_factor);
+}
+
+template<>
+char transform<char>(uint *val, uint index) {
+    char v = 0;
+    memcpy(&v, static_cast<char *>(static_cast<void *>(val)) + index,
+           sizeof(char));
+    v &= 0x1;
+    return v;
+}
+
+template<>
+uchar transform<uchar>(uint *val, uint index) {
+    uchar v = 0;
+    memcpy(&v, static_cast<uchar *>(static_cast<void *>(val)) + index,
+           sizeof(uchar));
+    return v;
+}
+
+template<>
+schar transform<schar>(uint *val, uint index) {
+    return transform<uchar>(val, index);
+}
+
+template<>
+ushort transform<ushort>(uint *val, uint index) {
+    ushort v = val[index >> 1U] >> (16U * (index & 1U)) & 0x0000ffff;
+    return v;
+}
+
+template<>
+short transform<short>(uint *val, uint index) {
+    return transform<ushort>(val, index);
+}
+
+template<>
+uint transform<uint>(uint *val, uint index) {
+    return val[index];
+}
+
+template<>
+int transform<int>(uint *val, uint index) {
+    return transform<uint>(val, index);
+}
+
+template<>
+intl transform<intl>(uint *val, uint index) {
+    uintl v = transform<uintl>(val, index);
+    intl out;
+    memcpy(&out, &v, sizeof(intl));
+    return v;
+}
+
+template<>
+float transform<float>(uint *val, uint index) {
+    return 1.f - getFloat01(val, index);
+}
+
+template<>
+double transform<double>(uint *val, uint index) {
+    return 1. - getDouble01(val, index);
+}
+
+template<>
+arrayfire::common::half transform<arrayfire::common::half>(uint *val,
+                                                           uint index) {
+    float v = val[index >> 1U] >> (16U * (index & 1U)) & 0x0000ffff;
+    return static_cast<arrayfire::common::half>(
+        1.f - fmaf(v, unsigned_half_factor, unsigned_half_half_factor));
+}
+
+// Generates rationals in [-1, 1)
+double getDoubleNegative11(uint *val, uint index) {
+    intl v = transform<intl>(val, index);
+    // Conversion to doubles adapted from Random123
+    constexpr double signed_factor =
+        ((1.0l) / (std::numeric_limits<long long>::max() + (1.0l)));
+    constexpr double half_factor = ((0.5) * signed_factor);
+    return fma(v, signed_factor, half_factor);
+}
+
+#define MAX_RESET_CTR_VAL 64
+#define WRITE_STRIDE 256
+
+// This implementation aims to emulate the corresponding method in the CUDA
+// backend, in order to produce the exact same numbers as CUDA.
+// A stride of WRITE_STRIDE (256) is applied between each write
+// (emulating the CUDA thread writing to 4 locations with a stride of
+// blockDim.x, which is 256).
+// ELEMS_PER_ITER correspond to elementsPerBlock in the CUDA backend, so each
+// "iter" (iteration) here correspond to a CUDA thread block doing its work.
+// This change was prompted by issue #2429
+template<typename T>
+void philoxUniform(T *out, size_t elements, const uintl seed, uintl counter) {
+    uint hi  = seed >> 32;
+    uint lo  = seed;
+    uint hic = counter >> 32;
+    uint loc = counter;
+
+    constexpr size_t RESET_CTR = MAX_RESET_CTR_VAL / sizeof(T);
+    constexpr size_t ELEMS_PER_ITER =
+        WRITE_STRIDE * 4 * sizeof(uint) / sizeof(T);
+
+    int num_iters = divup(elements, ELEMS_PER_ITER);
+    size_t len    = num_iters * ELEMS_PER_ITER;
+
+    constexpr size_t NUM_WRITES = 16 / sizeof(T);
+    for (size_t iter = 0; iter < len; iter += ELEMS_PER_ITER) {
+        for (size_t i = 0; i < WRITE_STRIDE; i += RESET_CTR) {
+            for (size_t j = 0; j < RESET_CTR; ++j) {
+                // first_write_idx is the first of the 4 locations that will
+                // be written to
+                uintptr_t first_write_idx = iter + i + j;
+                if (first_write_idx >= elements) { break; }
+
+                // Recalculate key and ctr to emulate how the CUDA backend
+                // calculates these per thread
+                uint key[2] = {lo, hi};
+                uint ctr[4] = {loc + (uint)first_write_idx, 0, 0, 0};
+                ctr[1]      = hic + (ctr[0] < loc);
+                ctr[2]      = (ctr[1] < hic);
+                philox(key, ctr);
+
+                // Use the same ctr array for each of the 4 locations,
+                // but each of the location gets a different ctr value
+                for (uint buf_idx = 0; buf_idx < NUM_WRITES; ++buf_idx) {
+                    size_t out_idx = iter + buf_idx * WRITE_STRIDE + i + j;
+                    if (out_idx < elements) {
+                        out[out_idx] = transform<T>(ctr, buf_idx);
+                    }
+                }
+            }
+        }
+    }
+}
+
+#undef MAX_RESET_CTR_VAL
+#undef WRITE_STRIDE
+
+template<typename T>
+void threefryUniform(T *out, size_t elements, const uintl seed, uintl counter) {
+    uint hi     = seed >> 32;
+    uint lo     = seed;
+    uint hic    = counter >> 32;
+    uint loc    = counter;
+    uint key[2] = {lo, hi};
+    uint ctr[2] = {loc, hic};
+    uint val[2];
+
+    int reset = (2 * sizeof(uint)) / sizeof(T);
+    for (int i = 0; i < (int)elements; i += reset) {
+        threefry(key, ctr, val);
+        ++ctr[0];
+        ctr[1] += (ctr[0] == 0);
+        int lim = (reset < (int)(elements - i)) ? reset : (int)(elements - i);
+        for (int j = 0; j < lim; ++j) { out[i + j] = transform<T>(val, j); }
+    }
+}
+
+template<typename T>
+void boxMullerTransform(data_t<T> *const out1, data_t<T> *const out2,
+                        const T r1, const T r2) {
+    /*
+     * The log of a real value x where 0 < x < 1 is negative.
+     */
+    using Tc = compute_t<T>;
+    Tc r     = sqrt((Tc)(-2.0) * log(static_cast<Tc>(r2)));
+    Tc theta = PI_VAL * (static_cast<Tc>(r1));
+
+    *out1 = r * sin(theta);
+    *out2 = r * cos(theta);
+}
+
+void boxMullerTransform(uint val[4], double *temp) {
+    boxMullerTransform<double>(&temp[0], &temp[1], getDoubleNegative11(val, 0),
+                               getDouble01(val, 1));
+}
+
+void boxMullerTransform(uint val[4], float *temp) {
+    boxMullerTransform<float>(&temp[0], &temp[1], getFloatNegative11(val, 0),
+                              getFloat01(val, 1));
+    boxMullerTransform<float>(&temp[2], &temp[3], getFloatNegative11(val, 2),
+                              getFloat01(val, 3));
+}
+
+void boxMullerTransform(uint val[4], arrayfire::common::half *temp) {
+    using arrayfire::common::half;
+    boxMullerTransform<half>(&temp[0], &temp[1], getHalfNegative11(val, 0),
+                             getHalf01(val, 1));
+    boxMullerTransform<half>(&temp[2], &temp[3], getHalfNegative11(val, 2),
+                             getHalf01(val, 3));
+    boxMullerTransform<half>(&temp[4], &temp[5], getHalfNegative11(val, 4),
+                             getHalf01(val, 5));
+    boxMullerTransform<half>(&temp[6], &temp[7], getHalfNegative11(val, 6),
+                             getHalf01(val, 7));
+}
+
+template<typename T>
+void philoxNormal(T *out, size_t elements, const uintl seed, uintl counter) {
+    uint hi     = seed >> 32;
+    uint lo     = seed;
+    uint hic    = counter >> 32;
+    uint loc    = counter;
+    uint key[2] = {lo, hi};
+    uint ctr[4] = {loc, hic, 0, 0};
+    T temp[(4 * sizeof(uint)) / sizeof(T)];
+
+    int reset = (4 * sizeof(uint)) / sizeof(T);
+    for (int i = 0; i < (int)elements; i += reset) {
+        philox(key, ctr);
+        boxMullerTransform(ctr, temp);
+        int lim = (reset < (int)(elements - i)) ? reset : (int)(elements - i);
+        for (int j = 0; j < lim; ++j) { out[i + j] = temp[j]; }
+    }
+}
+
+template<typename T>
+void threefryNormal(T *out, size_t elements, const uintl seed, uintl counter) {
+    uint hi     = seed >> 32;
+    uint lo     = seed;
+    uint hic    = counter >> 32;
+    uint loc    = counter;
+    uint key[2] = {lo, hi};
+    uint ctr[2] = {loc, hic};
+    uint val[4];
+    T temp[(4 * sizeof(uint)) / sizeof(T)];
+
+    int reset = (4 * sizeof(uint)) / sizeof(T);
+    for (int i = 0; i < (int)elements; i += reset) {
+        threefry(key, ctr, val);
+        ++ctr[0];
+        ctr[1] += (ctr[0] == 0);
+        threefry(key, ctr, val + 2);
+        ++ctr[0];
+        ctr[1] += (ctr[0] == 0);
+        boxMullerTransform(val, temp);
+        int lim = (reset < (int)(elements - i)) ? reset : (int)(elements - i);
+        for (int j = 0; j < lim; ++j) { out[i + j] = temp[j]; }
+    }
+}
+
+template<typename T>
+void uniformDistributionMT(T *out, size_t elements, uint *const state,
+                           const uint *const pos, const uint *const sh1,
+                           const uint *const sh2, uint mask,
+                           const uint *const recursion_table,
+                           const uint *const temper_table) {
+    uint l_state[STATE_SIZE];
+    uint o[4];
+    uint lpos = pos[0];
+    uint lsh1 = sh1[0];
+    uint lsh2 = sh2[0];
+
+    state_read(l_state, state);
+
+    int reset = (4 * sizeof(uint)) / sizeof(T);
+    for (int i = 0; i < (int)elements; i += reset) {
+        mersenne(o, l_state, i, lpos, lsh1, lsh2, mask, recursion_table,
+                 temper_table);
+        int lim = (reset < (int)(elements - i)) ? reset : (int)(elements - i);
+        for (int j = 0; j < lim; ++j) { out[i + j] = transform<T>(o, j); }
+    }
+
+    state_write(state, l_state);
+}
+
+template<typename T>
+void normalDistributionMT(T *out, size_t elements, uint *const state,
+                          const uint *const pos, const uint *const sh1,
+                          const uint *const sh2, uint mask,
+                          const uint *const recursion_table,
+                          const uint *const temper_table) {
+    T temp[(4 * sizeof(uint)) / sizeof(T)];
+    uint l_state[STATE_SIZE];
+    uint o[4];
+    uint lpos = pos[0];
+    uint lsh1 = sh1[0];
+    uint lsh2 = sh2[0];
+
+    state_read(l_state, state);
+
+    int reset = (4 * sizeof(uint)) / sizeof(T);
+    for (int i = 0; i < (int)elements; i += reset) {
+        mersenne(o, l_state, i, lpos, lsh1, lsh2, mask, recursion_table,
+                 temper_table);
+        boxMullerTransform(o, temp);
+        int lim = (reset < (int)(elements - i)) ? reset : (int)(elements - i);
+        for (int j = 0; j < lim; ++j) { out[i + j] = temp[j]; }
+    }
+
+    state_write(state, l_state);
+}
+
+template<typename T>
+void uniformDistributionCBRNG(T *out, size_t elements,
+                              af_random_engine_type type, const uintl seed,
+                              uintl counter) {
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            philoxUniform(out, elements, seed, counter);
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            threefryUniform(out, elements, seed, counter);
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+}
+
+template<typename T>
+void normalDistributionCBRNG(T *out, size_t elements,
+                             af_random_engine_type type, const uintl seed,
+                             uintl counter) {
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            philoxNormal(out, elements, seed, counter);
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            threefryNormal(out, elements, seed, counter);
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/random_engine_mersenne.hpp b/src/backend/cpu/kernel/random_engine_mersenne.hpp
new file mode 100644
index 0000000000..5087621b26
--- /dev/null
+++ b/src/backend/cpu/kernel/random_engine_mersenne.hpp
@@ -0,0 +1,121 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/********************************************************
+ * Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
+ * University.
+ * Copyright (c) 2011, 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima
+ * University and University of Tokyo.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ *       copyright notice, this list of conditions and the following
+ *       disclaimer in the documentation and/or other materials provided
+ *       with the distribution.
+ *     * Neither the name of the Hiroshima University, The Uinversity
+ *       of Tokyo nor the names of its contributors may be used to
+ *       endorse or promote products derived from this software without
+ *       specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *******************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+static const int N          = 351;
+static const int STATE_SIZE = 256 * 3;
+
+uint recursion(const uint* const recursion_table, const uint mask,
+               const uint sh1, const uint sh2, const uint x1, const uint x2,
+               uint y) {
+    uint x = (x1 & mask) ^ x2;
+    x ^= x << sh1;
+    y        = x ^ (y >> sh2);
+    uint mat = recursion_table[y & 0x0f];
+    return y ^ mat;
+}
+
+uint temper(const uint* const temper_table, const uint v, uint t) {
+    t ^= t >> 16;
+    t ^= t >> 8;
+    uint mat = temper_table[t & 0x0f];
+    return v ^ mat;
+}
+
+void mersenne(uint* const out, uint* const state, int i, uint pos, uint sh1,
+              uint sh2, uint mask, const uint* const recursion_table,
+              const uint* const temper_table) {
+    int index    = i % STATE_SIZE;
+    int offsetX1 = (STATE_SIZE - N + index) % STATE_SIZE;
+    int offsetX2 = (STATE_SIZE - N + index + 1) % STATE_SIZE;
+    int offsetY  = (STATE_SIZE - N + index + pos) % STATE_SIZE;
+    int offsetT  = (STATE_SIZE - N + index + pos - 1) % STATE_SIZE;
+    for (int i = 0; i < 4; ++i) {
+        state[index] =
+            recursion(recursion_table, mask, sh1, sh2, state[offsetX1],
+                      state[offsetX2], state[offsetY]);
+        out[i]   = temper(temper_table, state[index], state[offsetT]);
+        offsetX1 = (offsetX1 + 1) % STATE_SIZE;
+        offsetX2 = (offsetX2 + 1) % STATE_SIZE;
+        offsetY  = (offsetY + 1) % STATE_SIZE;
+        offsetT  = (offsetT + 1) % STATE_SIZE;
+        index    = (index + 1) % STATE_SIZE;
+    }
+}
+
+void state_read(uint* const l_state, const uint* const state) {
+    for (int i = 0; i < N; ++i) { l_state[STATE_SIZE - N + i] = state[i]; }
+}
+
+void state_write(uint* const state, const uint* const l_state) {
+    for (int i = 0; i < N; ++i) { state[i] = l_state[STATE_SIZE - N + i]; }
+}
+
+void initMersenneState(uint* const state, const uint* const tbl,
+                       const uintl seed) {
+    uint hidden_seed = tbl[4] ^ (tbl[8] << 16);
+    uint tmp         = hidden_seed;
+    tmp += tmp >> 16;
+    tmp += tmp >> 8;
+    tmp &= 0xff;
+    tmp |= tmp << 8;
+    tmp |= tmp << 16;
+    state[0] = seed;
+    state[1] =
+        hidden_seed ^ ((uint)(1812433253) * (state[0] ^ (state[0] >> 30)) + 1);
+    for (int i = 2; i < N; ++i) {
+        state[i] = tmp;
+        state[i] ^=
+            (uint)(1812433253) * (state[i - 1] ^ (state[i - 1] >> 30)) + i;
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/random_engine_philox.hpp b/src/backend/cpu/kernel/random_engine_philox.hpp
new file mode 100644
index 0000000000..f1a82014df
--- /dev/null
+++ b/src/backend/cpu/kernel/random_engine_philox.hpp
@@ -0,0 +1,107 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ *
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/philox.hpp
+
+static const uint m4x32_0 = 0xD2511F53;
+static const uint m4x32_1 = 0xCD9E8D57;
+static const uint w32_0   = 0x9E3779B9;
+static const uint w32_1   = 0xBB67AE85;
+
+void mulhilo(const uint a, const uint b, uint* const hi, uint* const lo) {
+    *hi = (((uintl)a) * ((uintl)b)) >> 32;
+    *lo = a * b;
+}
+
+void philoxBump(uint* const k) {
+    k[0] += w32_0;
+    k[1] += w32_1;
+}
+
+void philoxRound(const uint* const k, uint* const c) {
+    uint hi0, lo0, hi1, lo1;
+    mulhilo(m4x32_0, c[0], &hi0, &lo0);
+    mulhilo(m4x32_1, c[2], &hi1, &lo1);
+    c[0] = hi1 ^ c[1] ^ k[0];
+    c[1] = lo1;
+    c[2] = hi0 ^ c[3] ^ k[1];
+    c[3] = lo0;
+}
+
+void philox(uint* const key, uint* const ctr) {
+    // 10 Rounds
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/random_engine_threefry.hpp b/src/backend/cpu/kernel/random_engine_threefry.hpp
new file mode 100644
index 0000000000..df728c9a81
--- /dev/null
+++ b/src/backend/cpu/kernel/random_engine_threefry.hpp
@@ -0,0 +1,160 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/threefry.hpp
+
+static const uint SKEIN_KS_PARITY = 0x1BD11BDA;
+
+static const uint R0 = 13;
+static const uint R1 = 15;
+static const uint R2 = 26;
+static const uint R3 = 6;
+static const uint R4 = 17;
+static const uint R5 = 29;
+static const uint R6 = 16;
+static const uint R7 = 24;
+
+static inline uint rotL(uint x, uint N) {
+    return (x << (N & 31)) | (x >> ((32 - N) & 31));
+}
+
+static inline void threefry(uint k[2], uint c[2], uint X[2]) {
+    uint ks[3];
+
+    ks[2] = SKEIN_KS_PARITY;
+    ks[0] = k[0];
+    X[0]  = c[0];
+    ks[2] ^= k[0];
+    ks[1] = k[1];
+    X[1]  = c[1];
+    ks[2] ^= k[1];
+
+    X[0] += ks[0];
+    X[1] += ks[1];
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=1) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 1; /* X[2-1] += r  */
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=2) */
+    X[0] += ks[2];
+    X[1] += ks[0];
+    X[1] += 2;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=3) */
+    X[0] += ks[0];
+    X[1] += ks[1];
+    X[1] += 3;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=4) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 4;
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/range.hpp b/src/backend/cpu/kernel/range.hpp
new file mode 100644
index 0000000000..8d93d384be
--- /dev/null
+++ b/src/backend/cpu/kernel/range.hpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, int dim>
+void range(Param<T> output) {
+    T* out = output.get();
+
+    const dim4 dims    = output.dims();
+    const dim4 strides = output.strides();
+
+    for (dim_t w = 0; w < dims[3]; w++) {
+        dim_t offW = w * strides[3];
+        for (dim_t z = 0; z < dims[2]; z++) {
+            dim_t offWZ = offW + z * strides[2];
+            for (dim_t y = 0; y < dims[1]; y++) {
+                dim_t offWZY = offWZ + y * strides[1];
+                for (dim_t x = 0; x < dims[0]; x++) {
+                    dim_t id = offWZY + x;
+                    if (dim == 0) {
+                        out[id] = x;
+                    } else if (dim == 1) {
+                        out[id] = y;
+                    } else if (dim == 2) {
+                        out[id] = z;
+                    } else if (dim == 3) {
+                        out[id] = w;
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/reduce.hpp b/src/backend/cpu/kernel/reduce.hpp
new file mode 100644
index 0000000000..de685b426a
--- /dev/null
+++ b/src/backend/cpu/kernel/reduce.hpp
@@ -0,0 +1,204 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/half.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<af_op_t op, typename Ti, typename To, int D>
+struct reduce_dim {
+    void operator()(Param<To> out, const dim_t outOffset, CParam<Ti> in,
+                    const dim_t inOffset, const int dim, bool change_nan,
+                    double nanval) {
+        static const int D1 = D - 1;
+        reduce_dim<op, Ti, To, D1> reduce_dim_next;
+
+        const af::dim4 ostrides = out.strides();
+        const af::dim4 istrides = in.strides();
+        const af::dim4 odims    = out.dims();
+
+        for (dim_t i = 0; i < odims[D1]; i++) {
+            reduce_dim_next(out, outOffset + i * ostrides[D1], in,
+                            inOffset + i * istrides[D1], dim, change_nan,
+                            nanval);
+        }
+    }
+};
+
+template<af_op_t op, typename Ti, typename To>
+struct reduce_dim<op, Ti, To, 0> {
+    common::Transform<data_t<Ti>, compute_t<To>, op> transform;
+    common::Binary<compute_t<To>, op> reduce;
+    void operator()(Param<To> out, const dim_t outOffset, CParam<Ti> in,
+                    const dim_t inOffset, const int dim, bool change_nan,
+                    double nanval) {
+        const af::dim4 istrides = in.strides();
+        const af::dim4 idims    = in.dims();
+
+        data_t<To> *const outPtr      = out.get() + outOffset;
+        data_t<Ti> const *const inPtr = in.get() + inOffset;
+        dim_t stride                  = istrides[dim];
+
+        compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+        for (dim_t i = 0; i < idims[dim]; i++) {
+            compute_t<To> in_val = transform(inPtr[i * stride]);
+            if (change_nan) in_val = IS_NAN(in_val) ? nanval : in_val;
+            out_val = reduce(in_val, out_val);
+        }
+
+        *outPtr = data_t<To>(out_val);
+    }
+};
+
+template<typename Tk>
+void n_reduced_keys(Param<Tk> okeys, int *n_reduced, CParam<Tk> keys) {
+    const af::dim4 kdims = keys.dims();
+
+    Tk *const outKeysPtr      = okeys.get();
+    Tk const *const inKeysPtr = keys.get();
+
+    int nkeys      = 0;
+    Tk current_key = inKeysPtr[0];
+    for (dim_t i = 0; i < kdims[0]; i++) {
+        Tk keyval = inKeysPtr[i];
+
+        if (keyval != current_key) {
+            outKeysPtr[nkeys] = current_key;
+            current_key       = keyval;
+            ++nkeys;
+        }
+
+        if (i == (kdims[0] - 1)) { outKeysPtr[nkeys] = current_key; }
+    }
+
+    *n_reduced = nkeys + 1;
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To, int D>
+struct reduce_dim_by_key {
+    void operator()(Param<To> ovals, const dim_t ovOffset, CParam<Tk> keys,
+                    CParam<Ti> vals, const dim_t vOffset, int *n_reduced,
+                    const int dim, bool change_nan, double nanval) {
+        static const int D1 = D - 1;
+        reduce_dim_by_key<op, Ti, Tk, To, D1> reduce_by_key_dim_next;
+
+        const af::dim4 ovstrides = ovals.strides();
+        const af::dim4 vstrides  = vals.strides();
+        const af::dim4 vdims     = ovals.dims();
+
+        if (D1 == dim) {
+            reduce_by_key_dim_next(ovals, ovOffset, keys, vals, vOffset,
+                                   n_reduced, dim, change_nan, nanval);
+        } else {
+            for (dim_t i = 0; i < vdims[D1]; i++) {
+                reduce_by_key_dim_next(ovals, ovOffset + (i * ovstrides[D1]),
+                                       keys, vals, vOffset + (i * vstrides[D1]),
+                                       n_reduced, dim, change_nan, nanval);
+            }
+        }
+    }
+};
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+struct reduce_dim_by_key<op, Ti, Tk, To, 0> {
+    common::Transform<data_t<Ti>, compute_t<To>, op> transform;
+    common::Binary<compute_t<To>, op> reduce;
+    void operator()(Param<To> ovals, const dim_t ovOffset, CParam<Tk> keys,
+                    CParam<Ti> vals, const dim_t vOffset, int *n_reduced,
+                    const int dim, bool change_nan, double nanval) {
+        const af::dim4 vstrides = vals.strides();
+        const af::dim4 vdims    = vals.dims();
+
+        const af::dim4 ovstrides = ovals.strides();
+
+        data_t<Tk> const *const inKeysPtr = keys.get();
+        data_t<Ti> const *const inValsPtr = vals.get();
+        data_t<To> *const outValsPtr      = ovals.get();
+
+        int keyidx                = 0;
+        compute_t<Tk> current_key = compute_t<Tk>(inKeysPtr[0]);
+        compute_t<To> out_val     = reduce.init();
+
+        dim_t istride = vstrides[dim];
+        dim_t ostride = ovstrides[dim];
+
+        for (dim_t i = 0; i < vdims[dim]; i++) {
+            compute_t<Tk> keyval = inKeysPtr[i];
+
+            if (keyval == current_key) {
+                compute_t<To> in_val =
+                    transform(inValsPtr[vOffset + (i * istride)]);
+                if (change_nan) in_val = IS_NAN(in_val) ? nanval : in_val;
+                out_val = reduce(in_val, out_val);
+
+            } else {
+                outValsPtr[ovOffset + (keyidx * ostride)] = out_val;
+
+                current_key = keyval;
+                out_val     = transform(inValsPtr[vOffset + (i * istride)]);
+                if (change_nan) out_val = IS_NAN(out_val) ? nanval : out_val;
+                ++keyidx;
+            }
+
+            if (i == (vdims[dim] - 1)) {
+                outValsPtr[ovOffset + (keyidx * ostride)] = out_val;
+            }
+        }
+    }
+};
+
+template<af_op_t op, typename Ti, typename To>
+struct reduce_all {
+    common::Transform<data_t<Ti>, compute_t<To>, op> transform;
+    common::Binary<compute_t<To>, op> reduce;
+    void operator()(Param<To> out, CParam<Ti> in, bool change_nan,
+                    double nanval) {
+        // Decrement dimension of select dimension
+        af::dim4 dims            = in.dims();
+        af::dim4 strides         = in.strides();
+        const data_t<Ti> *inPtr  = in.get();
+        data_t<To> *const outPtr = out.get();
+
+        compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+
+        for (dim_t l = 0; l < dims[3]; l++) {
+            dim_t off3 = l * strides[3];
+
+            for (dim_t k = 0; k < dims[2]; k++) {
+                dim_t off2 = k * strides[2];
+
+                for (dim_t j = 0; j < dims[1]; j++) {
+                    dim_t off1 = j * strides[1];
+
+                    for (dim_t i = 0; i < dims[0]; i++) {
+                        dim_t idx = i + off1 + off2 + off3;
+
+                        compute_t<To> in_val = transform(inPtr[idx]);
+                        if (change_nan) {
+                            in_val = IS_NAN(in_val) ? nanval : in_val;
+                        }
+                        out_val = reduce(in_val, out_val);
+                    }
+                }
+            }
+        }
+
+        *outPtr = data_t<To>(out_val);
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/regions.hpp b/src/backend/cpu/kernel/regions.hpp
new file mode 100644
index 0000000000..fab7398720
--- /dev/null
+++ b/src/backend/cpu/kernel/regions.hpp
@@ -0,0 +1,171 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <memory.hpp>
+#include <map>
+#include <set>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+class LabelNode {
+   private:
+    T label;
+    T minLabel;
+    unsigned rank;
+    LabelNode* parent;
+
+   public:
+    LabelNode() : label(0), minLabel(0), rank(0), parent(this) {}
+    LabelNode(T label) : label(label), minLabel(label), rank(0), parent(this) {}
+
+    T getLabel() { return label; }
+
+    T getMinLabel() { return minLabel; }
+
+    LabelNode* getParent() { return parent; }
+
+    unsigned getRank() { return rank; }
+
+    void setMinLabel(T l) { minLabel = l; }
+
+    void setParent(LabelNode* p) { parent = p; }
+
+    void setRank(unsigned r) { rank = r; }
+};
+
+template<typename T>
+static LabelNode<T>* find(LabelNode<T>* x) {
+    if (x->getParent() != x) x->setParent(find(x->getParent()));
+    return x->getParent();
+}
+
+template<typename T>
+static void setUnion(LabelNode<T>* x, LabelNode<T>* y) {
+    LabelNode<T>* xRoot = find(x);
+    LabelNode<T>* yRoot = find(y);
+    if (xRoot == yRoot) return;
+
+    T xMinLabel = xRoot->getMinLabel();
+    T yMinLabel = yRoot->getMinLabel();
+    xRoot->setMinLabel(std::min(xMinLabel, yMinLabel));
+    yRoot->setMinLabel(std::min(xMinLabel, yMinLabel));
+
+    if (xRoot->getRank() < yRoot->getRank())
+        xRoot->setParent(yRoot);
+    else if (xRoot->getRank() > yRoot->getRank())
+        yRoot->setParent(xRoot);
+    else {
+        yRoot->setParent(xRoot);
+        xRoot->setRank(xRoot->getRank() + 1);
+    }
+}
+
+template<typename T>
+void regions(Param<T> out, CParam<char> in, af_connectivity connectivity) {
+    const af::dim4 inDims = in.dims();
+    const char* inPtr     = in.get();
+    T* outPtr             = out.get();
+
+    // Map labels
+    typedef typename std::unique_ptr<LabelNode<T>> UnqLabelPtr;
+    typedef typename std::map<T, UnqLabelPtr> LabelMap;
+    typedef typename LabelMap::iterator LabelMapIterator;
+
+    LabelMap lmap;
+
+    // Initial label
+    T label = (T)1;
+
+    for (int j = 0; j < (int)inDims[1]; j++) {
+        for (int i = 0; i < (int)inDims[0]; i++) {
+            int idx = j * inDims[0] + i;
+            if (inPtr[idx] != 0) {
+                std::vector<T> l;
+
+                // Test neighbors
+                if (i > 0 && outPtr[j * (int)inDims[0] + i - 1] > 0)
+                    l.push_back(outPtr[j * inDims[0] + i - 1]);
+                if (j > 0 && outPtr[(j - 1) * (int)inDims[0] + i] > 0)
+                    l.push_back(outPtr[(j - 1) * inDims[0] + i]);
+                if (connectivity == AF_CONNECTIVITY_8 && i > 0 && j > 0 &&
+                    outPtr[(j - 1) * inDims[0] + i - 1] > 0)
+                    l.push_back(outPtr[(j - 1) * inDims[0] + i - 1]);
+                if (connectivity == AF_CONNECTIVITY_8 &&
+                    i < (int)inDims[0] - 1 && j > 0 &&
+                    outPtr[(j - 1) * inDims[0] + i + 1] != 0)
+                    l.push_back(outPtr[(j - 1) * inDims[0] + i + 1]);
+
+                if (!l.empty()) {
+                    T minl = l[0];
+                    for (size_t k = 0; k < l.size(); k++) {
+                        minl                        = std::min(l[k], minl);
+                        LabelMapIterator currentMap = lmap.find(l[k]);
+                        LabelNode<T>* node          = currentMap->second.get();
+                        // Group labels of the same region under a disjoint set
+                        for (size_t m = k + 1; m < l.size(); m++)
+                            setUnion(node, lmap.find(l[m])->second.get());
+                    }
+                    // Set label to smallest neighbor label
+                    outPtr[idx] = minl;
+                } else {
+                    // Insert new label in map
+                    lmap.insert(std::make_pair(
+                        label, UnqLabelPtr(new LabelNode<T>(label))));
+                    outPtr[idx] = label++;
+                }
+            }
+        }
+    }
+
+    std::set<T> removed;
+
+    for (int j = 0; j < (int)inDims[1]; j++) {
+        for (int i = 0; i < (int)inDims[0]; i++) {
+            int idx = j * (int)inDims[0] + i;
+            if (inPtr[idx] != 0) {
+                T l                         = outPtr[idx];
+                LabelMapIterator currentMap = lmap.find(l);
+
+                if (currentMap != lmap.end()) {
+                    LabelNode<T>* node = currentMap->second.get();
+
+                    LabelNode<T>* nodeRoot = find(node);
+                    outPtr[idx]            = nodeRoot->getMinLabel();
+
+                    // Mark removed labels (those that are part of a region
+                    // that contains a smaller label)
+                    if (node->getMinLabel() < l || nodeRoot->getMinLabel() < l)
+                        removed.insert(l);
+                    if (node->getLabel() > node->getMinLabel())
+                        removed.insert(node->getLabel());
+                }
+            }
+        }
+    }
+
+    // Calculate final neighbors (ensure final labels are sequential)
+    for (int j = 0; j < (int)inDims[1]; j++) {
+        for (int i = 0; i < (int)inDims[0]; i++) {
+            int idx = j * (int)inDims[0] + i;
+            if (outPtr[idx] > 0) {
+                outPtr[idx] -=
+                    distance(removed.begin(), removed.lower_bound(outPtr[idx]));
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/reorder.hpp b/src/backend/cpu/kernel/reorder.hpp
new file mode 100644
index 0000000000..ccaf8efc72
--- /dev/null
+++ b/src/backend/cpu/kernel/reorder.hpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void reorder(Param<T> out, CParam<T> in, const af::dim4 oDims,
+             const af::dim4 rdims) {
+    T* outPtr      = out.get();
+    const T* inPtr = in.get();
+
+    const af::dim4 ist = in.strides();
+    const af::dim4 ost = out.strides();
+
+    dim_t ids[4] = {0};
+    for (dim_t ow = 0; ow < oDims[3]; ow++) {
+        const dim_t oW = ow * ost[3];
+        ids[rdims[3]]  = ow;
+        for (dim_t oz = 0; oz < oDims[2]; oz++) {
+            const dim_t oZW = oW + oz * ost[2];
+            ids[rdims[2]]   = oz;
+            for (dim_t oy = 0; oy < oDims[1]; oy++) {
+                const dim_t oYZW = oZW + oy * ost[1];
+                ids[rdims[1]]    = oy;
+                for (dim_t ox = 0; ox < oDims[0]; ox++) {
+                    const dim_t oIdx = oYZW + ox;
+
+                    ids[rdims[0]]    = ox;
+                    const dim_t iIdx = ids[3] * ist[3] + ids[2] * ist[2] +
+                                       ids[1] * ist[1] + ids[0];
+
+                    outPtr[oIdx] = inPtr[iIdx];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/resize.hpp b/src/backend/cpu/kernel/resize.hpp
new file mode 100644
index 0000000000..d5e1a3f6b9
--- /dev/null
+++ b/src/backend/cpu/kernel/resize.hpp
@@ -0,0 +1,177 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <math.hpp>
+#include <af/traits.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+/**
+ * noop function for round to avoid compilation
+ * issues due to lack of this function in C90 based
+ * compilers, it is only present in C99 and C++11
+ *
+ * This is not a full fledged implementation, this function
+ * is to be used only for positive numbers, i m using it here
+ * for calculating dimensions of arrays
+ */
+dim_t round2int(float value) { return (dim_t)(value + 0.5f); }
+
+using std::conditional;
+using std::is_same;
+
+template<typename T>
+using wtype_t =
+    typename conditional<is_same<T, double>::value, double, float>::type;
+
+template<typename T>
+using vtype_t =
+    typename conditional<common::is_complex<T>::value, T, wtype_t<T>>::type;
+
+template<typename T, af_interp_type method>
+struct resize_op {
+    void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims,
+                    const af::dim4 &idims, const af::dim4 &ostrides,
+                    const af::dim4 &istrides, const dim_t x, const dim_t y) {
+        UNUSED(outPtr);
+        UNUSED(inPtr);
+        UNUSED(odims);
+        UNUSED(idims);
+        UNUSED(ostrides);
+        UNUSED(istrides);
+        UNUSED(x);
+        UNUSED(y);
+        return;
+    }
+};
+
+template<typename T>
+struct resize_op<T, AF_INTERP_NEAREST> {
+    void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims,
+                    const af::dim4 &idims, const af::dim4 &ostrides,
+                    const af::dim4 &istrides, const dim_t x, const dim_t y) {
+        // Compute Indices
+        dim_t i_x = round2int((float)x / (odims[0] / (float)idims[0]));
+        dim_t i_y = round2int((float)y / (odims[1] / (float)idims[1]));
+
+        if (i_x >= idims[0]) i_x = idims[0] - 1;
+        if (i_y >= idims[1]) i_y = idims[1] - 1;
+
+        dim_t i_off = i_y * istrides[1] + i_x;
+        dim_t o_off = y * ostrides[1] + x;
+        // Copy values from all channels
+        for (dim_t w = 0; w < odims[3]; w++) {
+            dim_t wost = w * ostrides[3];
+            dim_t wist = w * istrides[3];
+            for (dim_t z = 0; z < odims[2]; z++) {
+                outPtr[o_off + z * ostrides[2] + wost] =
+                    inPtr[i_off + z * istrides[2] + wist];
+            }
+        }
+    }
+};
+
+template<typename T>
+struct resize_op<T, AF_INTERP_BILINEAR> {
+    void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims,
+                    const af::dim4 &idims, const af::dim4 &ostrides,
+                    const af::dim4 &istrides, const dim_t x, const dim_t y) {
+        // Compute Indices
+        float f_x = (float)x / (odims[0] / (float)idims[0]);
+        float f_y = (float)y / (odims[1] / (float)idims[1]);
+
+        dim_t i1_x = floor(f_x);
+        dim_t i1_y = floor(f_y);
+
+        if (i1_x >= idims[0]) i1_x = idims[0] - 1;
+        if (i1_y >= idims[1]) i1_y = idims[1] - 1;
+
+        float b = f_x - i1_x;
+        float a = f_y - i1_y;
+
+        dim_t i2_x = (i1_x + 1 >= idims[0] ? idims[0] - 1 : i1_x + 1);
+        dim_t i2_y = (i1_y + 1 >= idims[1] ? idims[1] - 1 : i1_y + 1);
+
+        typedef typename af::dtype_traits<T>::base_type BT;
+        typedef wtype_t<BT> WT;
+        typedef vtype_t<T> VT;
+
+        dim_t o_off = y * ostrides[1] + x;
+        // Copy values from all channels
+        for (dim_t w = 0; w < odims[3]; w++) {
+            dim_t wst = w * istrides[3];
+            for (dim_t z = 0; z < odims[2]; z++) {
+                dim_t zst         = z * istrides[2];
+                dim_t channel_off = zst + wst;
+                VT p1 = inPtr[i1_y * istrides[1] + i1_x + channel_off];
+                VT p2 = inPtr[i2_y * istrides[1] + i1_x + channel_off];
+                VT p3 = inPtr[i1_y * istrides[1] + i2_x + channel_off];
+                VT p4 = inPtr[i2_y * istrides[1] + i2_x + channel_off];
+
+                outPtr[o_off + z * ostrides[2] + w * ostrides[3]] =
+                    scalar<WT>((1.0f - a) * (1.0f - b)) * p1 +
+                    scalar<WT>((a) * (1.0f - b)) * p2 +
+                    scalar<WT>((1.0f - a) * (b)) * p3 +
+                    scalar<WT>((a) * (b)) * p4;
+            }
+        }
+    }
+};
+
+template<typename T>
+struct resize_op<T, AF_INTERP_LOWER> {
+    void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims,
+                    const af::dim4 &idims, const af::dim4 &ostrides,
+                    const af::dim4 &istrides, const dim_t x, const dim_t y) {
+        // Compute Indices
+        dim_t i_x = floor((float)x / (odims[0] / (float)idims[0]));
+        dim_t i_y = floor((float)y / (odims[1] / (float)idims[1]));
+
+        if (i_x >= idims[0]) i_x = idims[0] - 1;
+        if (i_y >= idims[1]) i_y = idims[1] - 1;
+
+        dim_t i_off = i_y * istrides[1] + i_x;
+        dim_t o_off = y * ostrides[1] + x;
+        // Copy values from all channels
+        for (dim_t w = 0; w < odims[3]; w++) {
+            dim_t wost = w * ostrides[3];
+            dim_t wist = w * istrides[3];
+            for (dim_t z = 0; z < odims[2]; z++) {
+                outPtr[o_off + z * ostrides[2] + wost] =
+                    inPtr[i_off + z * istrides[2] + wist];
+            }
+        }
+    }
+};
+
+template<typename T, af_interp_type method>
+void resize(Param<T> out, CParam<T> in) {
+    af::dim4 idims    = in.dims();
+    af::dim4 odims    = out.dims();
+    const T *inPtr    = in.get();
+    T *outPtr         = out.get();
+    af::dim4 ostrides = out.strides();
+    af::dim4 istrides = in.strides();
+
+    resize_op<T, method> op;
+    for (dim_t y = 0; y < odims[1]; y++) {
+        for (dim_t x = 0; x < odims[0]; x++) {
+            op(outPtr, inPtr, odims, idims, ostrides, istrides, x, y);
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/rotate.hpp b/src/backend/cpu/kernel/rotate.hpp
new file mode 100644
index 0000000000..67a34a9e71
--- /dev/null
+++ b/src/backend/cpu/kernel/rotate.hpp
@@ -0,0 +1,93 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <math.hpp>
+#include <af/traits.hpp>
+#include "interp.hpp"
+
+using af::dtype_traits;
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, int order>
+void rotate(Param<T> output, CParam<T> input, const float theta,
+            af_interp_type method) {
+    typedef typename dtype_traits<T>::base_type BT;
+    typedef wtype_t<BT> WT;
+    Interp2<T, WT, order> interp;
+
+    const af::dim4 odims    = output.dims();
+    const af::dim4 idims    = input.dims();
+    const af::dim4 ostrides = output.strides();
+    const af::dim4 istrides = input.strides();
+
+    const float c = cos(-theta), s = sin(-theta);
+    float tx, ty;
+    {
+        const float nx = 0.5 * (idims[0] - 1);
+        const float ny = 0.5 * (idims[1] - 1);
+        const float mx = 0.5 * (odims[0] - 1);
+        const float my = 0.5 * (odims[1] - 1);
+        const float sx = (mx * c + my * -s);
+        const float sy = (mx * s + my * c);
+        tx             = -(sx - nx);
+        ty             = -(sy - ny);
+    }
+
+    const float tmat[6] = {
+        std::round(c * 1000) / 1000.0f,  std::round(-s * 1000) / 1000.0f,
+        std::round(tx * 1000) / 1000.0f, std::round(s * 1000) / 1000.0f,
+        std::round(c * 1000) / 1000.0f,  std::round(ty * 1000) / 1000.0f,
+    };
+
+    int nimages = odims[2];
+    T *out      = output.get();
+
+    for (int idw = 0; idw < (int)odims[3]; idw++) {
+        int out_offw = idw * ostrides[3];
+        int in_offw  = idw * istrides[3];
+
+        // Do transform for image
+        for (int idy = 0; idy < (int)odims[1]; idy++) {
+            for (int idx = 0; idx < (int)odims[0]; idx++) {
+                WT xidi = idx * tmat[0] + idy * tmat[1] + tmat[2];
+                WT yidi = idx * tmat[3] + idy * tmat[4] + tmat[5];
+
+                // Special conditions to deal with boundaries for bilinear and
+                // bicubic
+                // FIXME: Ideally this condition should be removed or be present
+                // for all methods But tests are expecting a different behavior
+                // for bilinear and nearest
+                bool condX = xidi >= -0.0001 && xidi < idims[0];
+                bool condY = yidi >= -0.0001 && yidi < idims[1];
+                int ooff   = out_offw + idy * ostrides[1] + idx;
+                if (order == 1 || (condX && condY)) {
+                    // FIXME: Nearest and lower do not do clamping, but other
+                    // methods do Make it consistent
+                    bool clamp = order != 1;
+                    interp(output, ooff, input, in_offw, xidi, yidi, method,
+                           nimages, clamp);
+                } else {
+                    for (int n = 0; n < nimages; n++) {
+                        out[ooff + n * ostrides[2]] = scalar<T>(0);
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/scan.hpp b/src/backend/cpu/kernel/scan.hpp
new file mode 100644
index 0000000000..3ad4e04688
--- /dev/null
+++ b/src/backend/cpu/kernel/scan.hpp
@@ -0,0 +1,76 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<af_op_t op, typename Ti, typename To, int D, bool inclusive_scan>
+struct scan_dim {
+    void operator()(Param<To> out, dim_t outOffset, CParam<Ti> in,
+                    dim_t inOffset, const int dim) const {
+        const af::dim4 odims    = out.dims();
+        const af::dim4 ostrides = out.strides();
+        const af::dim4 istrides = in.strides();
+
+        const int D1 = D - 1;
+        for (dim_t i = 0; i < odims[D1]; i++) {
+            scan_dim<op, Ti, To, D1, inclusive_scan> func;
+            func(out, outOffset + i * ostrides[D1], in,
+                 inOffset + i * istrides[D1], dim);
+            if (D1 == dim) break;
+        }
+    }
+};
+
+template<af_op_t op, typename Ti, typename To, bool inclusive_scan>
+struct scan_dim<op, Ti, To, 0, inclusive_scan> {
+    void operator()(Param<To> output, dim_t outOffset, CParam<Ti> input,
+                    dim_t inOffset, const int dim) const {
+        const Ti* in = input.get() + inOffset;
+        To* out      = output.get() + outOffset;
+
+        const af::dim4 ostrides = output.strides();
+        const af::dim4 istrides = input.strides();
+        const af::dim4 idims    = input.dims();
+
+        dim_t istride = istrides[dim];
+        dim_t ostride = ostrides[dim];
+
+        common::Transform<Ti, To, op> transform;
+        // FIXME: Change the name to something better
+        common::Binary<To, op> scan;
+
+        To out_val = common::Binary<To, op>::init();
+        for (dim_t i = 0; i < idims[dim]; i++) {
+            To in_val = transform(in[i * istride]);
+            out_val   = scan(in_val, out_val);
+            if (!inclusive_scan) {
+                // The loop shifts the output index by 1.
+                // The last index wraps around and writes the first element.
+                if (i == (idims[dim] - 1)) {
+                    out[0] = common::Binary<To, op>::init();
+                } else {
+                    out[(i + 1) * ostride] = out_val;
+                }
+            } else {
+                out[i * ostride] = out_val;
+            }
+        }
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/scan_by_key.hpp b/src/backend/cpu/kernel/scan_by_key.hpp
new file mode 100644
index 0000000000..4639dfcda7
--- /dev/null
+++ b/src/backend/cpu/kernel/scan_by_key.hpp
@@ -0,0 +1,90 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<af_op_t op, typename Ti, typename Tk, typename To, int D>
+struct scan_dim_by_key {
+    bool inclusive_scan;
+    scan_dim_by_key(bool inclusiveSanKey) : inclusive_scan(inclusiveSanKey) {}
+
+    void operator()(Param<To> out, dim_t outOffset, CParam<Tk> key,
+                    dim_t keyOffset, CParam<Ti> in, dim_t inOffset,
+                    const int dim) const {
+        const af::dim4 odims    = out.dims();
+        const af::dim4 ostrides = out.strides();
+        const af::dim4 kstrides = key.strides();
+        const af::dim4 istrides = in.strides();
+
+        const int D1 = D - 1;
+        for (dim_t i = 0; i < odims[D1]; i++) {
+            scan_dim_by_key<op, Ti, Tk, To, D1> func(inclusive_scan);
+            func(out, outOffset + i * ostrides[D1], key,
+                 keyOffset + i * kstrides[D1], in, inOffset + i * istrides[D1],
+                 dim);
+            if (D1 == dim) break;
+        }
+    }
+};
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+struct scan_dim_by_key<op, Ti, Tk, To, 0> {
+    bool inclusive_scan;
+    scan_dim_by_key(bool inclusiveSanKey) : inclusive_scan(inclusiveSanKey) {}
+
+    void operator()(Param<To> output, dim_t outOffset, CParam<Tk> keyinput,
+                    dim_t keyOffset, CParam<Ti> input, dim_t inOffset,
+                    const int dim) const {
+        const Ti* in  = input.get() + inOffset;
+        const Tk* key = keyinput.get() + keyOffset;
+        To* out       = output.get() + outOffset;
+
+        const af::dim4 ostrides = output.strides();
+        const af::dim4 kstrides = keyinput.strides();
+        const af::dim4 istrides = input.strides();
+        const af::dim4 idims    = input.dims();
+
+        dim_t istride = istrides[dim];
+        dim_t kstride = kstrides[dim];
+        dim_t ostride = ostrides[dim];
+
+        common::Transform<Ti, To, op> transform;
+        // FIXME: Change the name to something better
+        common::Binary<To, op> scan;
+
+        To out_val = common::Binary<To, op>::init();
+        Tk key_val = key[0];
+
+        dim_t k = !inclusive_scan;
+        if (!inclusive_scan) { out[0] = common::Binary<To, op>::init(); }
+
+        for (dim_t i = 0; i < idims[dim] - (!inclusive_scan); i++, k++) {
+            To in_val = transform(in[i * istride]);
+            if (key[k * kstride] != key_val) {
+                out_val =
+                    !inclusive_scan ? common::Binary<To, op>::init() : in_val;
+                key_val = key[k * kstride];
+            } else {
+                out_val = scan(in_val, out_val);
+            }
+            out[k * ostride] = out_val;
+        }
+    }
+};
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/select.hpp b/src/backend/cpu/kernel/select.hpp
new file mode 100644
index 0000000000..dcc3c8855c
--- /dev/null
+++ b/src/backend/cpu/kernel/select.hpp
@@ -0,0 +1,124 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void select(Param<T> out, CParam<char> cond, CParam<T> a, CParam<T> b) {
+    af::dim4 adims    = a.dims();
+    af::dim4 astrides = a.strides();
+    af::dim4 bdims    = b.dims();
+    af::dim4 bstrides = b.strides();
+
+    af::dim4 cdims    = cond.dims();
+    af::dim4 cstrides = cond.strides();
+
+    af::dim4 odims    = out.dims();
+    af::dim4 ostrides = out.strides();
+
+    bool is_a_same[] = {adims[0] == odims[0], adims[1] == odims[1],
+                        adims[2] == odims[2], adims[3] == odims[3]};
+
+    bool is_b_same[] = {bdims[0] == odims[0], bdims[1] == odims[1],
+                        bdims[2] == odims[2], bdims[3] == odims[3]};
+
+    bool is_c_same[] = {cdims[0] == odims[0], cdims[1] == odims[1],
+                        cdims[2] == odims[2], cdims[3] == odims[3]};
+
+    const T *aptr    = a.get();
+    const T *bptr    = b.get();
+    T *optr          = out.get();
+    const char *cptr = cond.get();
+
+    for (int l = 0; l < odims[3]; l++) {
+        int o_off3 = ostrides[3] * l;
+        int a_off3 = astrides[3] * is_a_same[3] * l;
+        int b_off3 = bstrides[3] * is_b_same[3] * l;
+        int c_off3 = cstrides[3] * is_c_same[3] * l;
+
+        for (int k = 0; k < odims[2]; k++) {
+            int o_off2 = ostrides[2] * k + o_off3;
+            int a_off2 = astrides[2] * is_a_same[2] * k + a_off3;
+            int b_off2 = bstrides[2] * is_b_same[2] * k + b_off3;
+            int c_off2 = cstrides[2] * is_c_same[2] * k + c_off3;
+
+            for (int j = 0; j < odims[1]; j++) {
+                int o_off1 = ostrides[1] * j + o_off2;
+                int a_off1 = astrides[1] * is_a_same[1] * j + a_off2;
+                int b_off1 = bstrides[1] * is_b_same[1] * j + b_off2;
+                int c_off1 = cstrides[1] * is_c_same[1] * j + c_off2;
+
+                for (int i = 0; i < odims[0]; i++) {
+                    bool cval = is_c_same[0] ? cptr[c_off1 + i] : cptr[c_off1];
+                    T aval    = is_a_same[0] ? aptr[a_off1 + i] : aptr[a_off1];
+                    T bval    = is_b_same[0] ? bptr[b_off1 + i] : bptr[b_off1];
+                    T oval    = cval ? aval : bval;
+                    optr[o_off1 + i] = oval;
+                }
+            }
+        }
+    }
+}
+
+template<typename T, bool flip>
+void select_scalar(Param<T> out, CParam<char> cond, CParam<T> a, const T b) {
+    af::dim4 astrides = a.strides();
+    af::dim4 adims    = a.dims();
+    af::dim4 cstrides = cond.strides();
+    af::dim4 cdims    = cond.dims();
+
+    af::dim4 odims    = out.dims();
+    af::dim4 ostrides = out.strides();
+
+    const data_t<T> *aptr = a.get();
+    data_t<T> *optr       = out.get();
+    const char *cptr      = cond.get();
+
+    const compute_t<T> scalar = static_cast<compute_t<T>>(b);
+
+    bool is_a_same[] = {adims[0] == odims[0], adims[1] == odims[1],
+                        adims[2] == odims[2], adims[3] == odims[3]};
+
+    bool is_c_same[] = {cdims[0] == odims[0], cdims[1] == odims[1],
+                        cdims[2] == odims[2], cdims[3] == odims[3]};
+
+    for (int l = 0; l < odims[3]; l++) {
+        int o_off3 = ostrides[3] * l;
+        int a_off3 = astrides[3] * is_a_same[3] * l;
+        int c_off3 = cstrides[3] * is_c_same[3] * l;
+
+        for (int k = 0; k < odims[2]; k++) {
+            int o_off2 = ostrides[2] * k + o_off3;
+            int a_off2 = astrides[2] * is_a_same[2] * k + a_off3;
+            int c_off2 = cstrides[2] * is_c_same[2] * k + c_off3;
+
+            for (int j = 0; j < odims[1]; j++) {
+                int o_off1 = ostrides[1] * j + o_off2;
+                int a_off1 = astrides[1] * is_a_same[1] * j + a_off2;
+                int c_off1 = cstrides[1] * is_c_same[1] * j + c_off2;
+
+                for (int i = 0; i < odims[0]; i++) {
+                    bool cval = is_c_same[0] ? cptr[c_off1 + i] : cptr[c_off1];
+                    compute_t<T> aval = static_cast<compute_t<T>>(
+                        is_a_same[0] ? aptr[a_off1 + i] : aptr[a_off1]);
+                    optr[o_off1 + i] = (flip ^ cval) ? aval : scalar;
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/shift.hpp b/src/backend/cpu/kernel/shift.hpp
new file mode 100644
index 0000000000..223c3081a0
--- /dev/null
+++ b/src/backend/cpu/kernel/shift.hpp
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <cassert>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+static inline dim_t simple_mod(const dim_t i, const dim_t dim) {
+    return (i < dim) ? i : (i - dim);
+}
+
+template<typename T>
+void shift(Param<T> out, CParam<T> in, const af::dim4 sdims) {
+    T* outPtr      = out.get();
+    const T* inPtr = in.get();
+
+    const af::dim4 oDims = out.dims();
+    const af::dim4 ist   = in.strides();
+    const af::dim4 ost   = out.strides();
+
+    int sdims_[4];
+    // Need to do this because we are mapping output to input in the kernel
+    for (int i = 0; i < 4; i++) {
+        // sdims_[i] will always be positive and always [0, oDims[i]].
+        // Negative shifts are converted to position by going the other way
+        // round
+        sdims_[i] = -(sdims[i] % (int)oDims[i]) + oDims[i] * (sdims[i] > 0);
+        assert(sdims_[i] >= 0 && sdims_[i] <= oDims[i]);
+    }
+
+    for (dim_t ow = 0; ow < oDims[3]; ow++) {
+        const int oW = ow * ost[3];
+        const int iw = simple_mod((ow + sdims_[3]), oDims[3]);
+        const int iW = iw * ist[3];
+        for (dim_t oz = 0; oz < oDims[2]; oz++) {
+            const int oZW = oW + oz * ost[2];
+            const int iz  = simple_mod((oz + sdims_[2]), oDims[2]);
+            const int iZW = iW + iz * ist[2];
+            for (dim_t oy = 0; oy < oDims[1]; oy++) {
+                const int oYZW = oZW + oy * ost[1];
+                const int iy   = simple_mod((oy + sdims_[1]), oDims[1]);
+                const int iYZW = iZW + iy * ist[1];
+                for (dim_t ox = 0; ox < oDims[0]; ox++) {
+                    const int oIdx = oYZW + ox;
+                    const int ix   = simple_mod((ox + sdims_[0]), oDims[0]);
+                    const int iIdx = iYZW + ix;
+
+                    outPtr[oIdx] = inPtr[iIdx];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sift.hpp b/src/backend/cpu/kernel/sift.hpp
new file mode 100644
index 0000000000..ee1eb046a7
--- /dev/null
+++ b/src/backend/cpu/kernel/sift.hpp
@@ -0,0 +1,1057 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// The source code contained in this file is based on the original code by
+// Rob Hess. Please note that SIFT is an algorithm patented and protected
+// by US law. As of 29-Dec-2020, the patent stands expired. It can be looked
+// up here - https://patents.google.com/patent/US6711293B1/en
+
+#pragma once
+
+#include <convolve.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <resize.hpp>
+#include <sort_index.hpp>
+
+#include <cstring>
+#include <limits>
+#include <vector>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cpu {
+
+static const float PI_VAL = 3.14159265358979323846f;
+
+// default width of descriptor histogram array
+static const int DescrWidth = 4;
+
+// default number of bins per histogram in descriptor array
+static const int DescrHistBins = 8;
+
+// assumed gaussian blur for input image
+static const float InitSigma = 0.5f;
+
+// width of border in which to ignore keypoints
+static const int ImgBorder = 5;
+
+// maximum steps of keypoint interpolation before failure
+static const int MaxInterpSteps = 5;
+
+// default number of bins in histogram for orientation assignment
+static const int OriHistBins = 36;
+
+// determines gaussian sigma for orientation assignment
+static const float OriSigFctr = 1.5f;
+
+// determines the radius of the region used in orientation assignment */
+static const float OriRadius = 3.0f * OriSigFctr;
+
+// number of passes of orientation histogram smoothing
+static const int SmoothOriPasses = 2;
+
+// orientation magnitude relative to max that results in new feature
+static const float OriPeakRatio = 0.8f;
+
+// determines the size of a single descriptor orientation histogram
+static const float DescrSclFctr = 3.f;
+
+// threshold on magnitude of elements of descriptor vector
+static const float DescrMagThr = 0.2f;
+
+// factor used to convert floating-point descriptor to unsigned char
+static const float IntDescrFctr = 512.f;
+
+// Number of GLOH bins in radial direction
+static const unsigned GLOHRadialBins = 3;
+
+// Radiuses of GLOH descriptors
+static const float GLOHRadii[GLOHRadialBins] = {6.f, 11.f, 15.f};
+
+// Number of GLOH angular bins (excluding the inner-most radial section)
+static const unsigned GLOHAngularBins = 8;
+
+// Number of GLOH bins per histogram in descriptor
+static const unsigned GLOHHistBins = 16;
+
+typedef struct {
+    float f[4];
+    unsigned l;
+} feat_t;
+
+bool feat_cmp(feat_t i, feat_t j) {
+    for (int k = 0; k < 4; k++)
+        if (i.f[k] != j.f[k]) return (i.f[k] < j.f[k]);
+    if (i.l != j.l) return (i.l < j.l);
+
+    return false;
+}
+
+void array_to_feat(std::vector<feat_t>& feat, float* x, float* y,
+                   unsigned* layer, float* resp, float* size, unsigned nfeat) {
+    feat.resize(nfeat);
+    for (unsigned i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = resp[i];
+        feat[i].f[3] = size[i];
+        feat[i].l    = layer[i];
+    }
+}
+
+template<typename T>
+void gaussian1D(T* out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
+
+    T sum = (T)0;
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * PI_VAL * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
+        out[i] = el;
+        sum += el;
+    }
+
+    for (int k = 0; k < dim; k++) out[k] /= sum;
+}
+
+template<typename T>
+Array<T> gauss_filter(float sigma) {
+    // Using 6-sigma rule
+    unsigned gauss_len = std::min((unsigned)round(sigma * 6 + 1) | 1, 31u);
+
+    Array<T> filter = createEmptyArray<T>(gauss_len);
+    gaussian1D((T*)getDevicePtr(filter), gauss_len, sigma);
+
+    return filter;
+}
+
+template<int N>
+void gaussianElimination(float* A, float* b, float* x) {
+    // forward elimination
+    for (int i = 0; i < N - 1; i++) {
+        for (int j = i + 1; j < N; j++) {
+            float s = A[j * N + i] / A[i * N + i];
+
+            for (int k = i; k < N; k++) A[j * N + k] -= s * A[i * N + k];
+
+            b[j] -= s * b[i];
+        }
+    }
+
+    for (int i = 0; i < N; i++) x[i] = 0;
+
+    // backward substitution
+    float sum = 0;
+    for (int i = 0; i <= N - 2; i++) {
+        sum = b[i];
+        for (int j = i + 1; j < N; j++) sum -= A[i * N + j] * x[j];
+        x[i] = sum / A[i * N + i];
+    }
+}
+
+template<typename T>
+void sub(Array<T>& out, const Array<T>& in1, const Array<T>& in2) {
+    size_t nel       = in1.elements();
+    T* out_ptr       = out.get();
+    const T* in1_ptr = in1.get();
+    const T* in2_ptr = in2.get();
+
+    for (size_t i = 0; i < nel; i++) { out_ptr[i] = in1_ptr[i] - in2_ptr[i]; }
+}
+
+#define CPTR(Y, X) (center_ptr[(Y)*idims[0] + (X)])
+#define PPTR(Y, X) (prev_ptr[(Y)*idims[0] + (X)])
+#define NPTR(Y, X) (next_ptr[(Y)*idims[0] + (X)])
+
+// Determines whether a pixel is a scale-space extremum by comparing it to its
+// 3x3x3 pixel neighborhood.
+template<typename T>
+void detectExtrema(float* x_out, float* y_out, unsigned* layer_out,
+                   unsigned* counter, const Array<T>& prev,
+                   const Array<T>& center, const Array<T>& next,
+                   const unsigned layer, const unsigned max_feat,
+                   const float threshold) {
+    const af::dim4 idims = center.dims();
+    const T* prev_ptr    = prev.get();
+    const T* center_ptr  = center.get();
+    const T* next_ptr    = next.get();
+
+    for (int y = ImgBorder; y < idims[1] - ImgBorder; y++) {
+        for (int x = ImgBorder; x < idims[0] - ImgBorder; x++) {
+            float p = center_ptr[y * idims[0] + x];
+
+            // Find extrema
+            if (abs((float)p) > threshold &&
+                ((p > 0 && p > CPTR(y - 1, x - 1) && p > CPTR(y - 1, x) &&
+                  p > CPTR(y - 1, x + 1) && p > CPTR(y, x - 1) &&
+                  p > CPTR(y, x + 1) && p > CPTR(y + 1, x - 1) &&
+                  p > CPTR(y + 1, x) && p > CPTR(y + 1, x + 1) &&
+                  p > PPTR(y - 1, x - 1) && p > PPTR(y - 1, x) &&
+                  p > PPTR(y - 1, x + 1) && p > PPTR(y, x - 1) &&
+                  p > PPTR(y, x) && p > PPTR(y, x + 1) &&
+                  p > PPTR(y + 1, x - 1) && p > PPTR(y + 1, x) &&
+                  p > PPTR(y + 1, x + 1) && p > NPTR(y - 1, x - 1) &&
+                  p > NPTR(y - 1, x) && p > NPTR(y - 1, x + 1) &&
+                  p > NPTR(y, x - 1) && p > NPTR(y, x) && p > NPTR(y, x + 1) &&
+                  p > NPTR(y + 1, x - 1) && p > NPTR(y + 1, x) &&
+                  p > NPTR(y + 1, x + 1)) ||
+                 (p < 0 && p < CPTR(y - 1, x - 1) && p < CPTR(y - 1, x) &&
+                  p < CPTR(y - 1, x + 1) && p < CPTR(y, x - 1) &&
+                  p < CPTR(y, x + 1) && p < CPTR(y + 1, x - 1) &&
+                  p < CPTR(y + 1, x) && p < CPTR(y + 1, x + 1) &&
+                  p < PPTR(y - 1, x - 1) && p < PPTR(y - 1, x) &&
+                  p < PPTR(y - 1, x + 1) && p < PPTR(y, x - 1) &&
+                  p < PPTR(y, x) && p < PPTR(y, x + 1) &&
+                  p < PPTR(y + 1, x - 1) && p < PPTR(y + 1, x) &&
+                  p < PPTR(y + 1, x + 1) && p < NPTR(y - 1, x - 1) &&
+                  p < NPTR(y - 1, x) && p < NPTR(y - 1, x + 1) &&
+                  p < NPTR(y, x - 1) && p < NPTR(y, x) && p < NPTR(y, x + 1) &&
+                  p < NPTR(y + 1, x - 1) && p < NPTR(y + 1, x) &&
+                  p < NPTR(y + 1, x + 1)))) {
+                if (*counter < max_feat) {
+                    x_out[*counter]     = (float)y;
+                    y_out[*counter]     = (float)x;
+                    layer_out[*counter] = layer;
+                    (*counter)++;
+                }
+            }
+        }
+    }
+}
+
+// Interpolates a scale-space extremum's location and scale to subpixel
+// accuracy to form an image feature. Rejects features with low contrast.
+// Based on Section 4 of Lowe's paper.
+template<typename T>
+void interpolateExtrema(float* x_out, float* y_out, unsigned* layer_out,
+                        float* response_out, float* size_out, unsigned* counter,
+                        const float* x_in, const float* y_in,
+                        const unsigned* layer_in, const unsigned extrema_feat,
+                        std::vector<Array<T>>& dog_pyr, const unsigned max_feat,
+                        const unsigned octave, const unsigned n_layers,
+                        const float contrast_thr, const float edge_thr,
+                        const float sigma, const float img_scale) {
+    for (int f = 0; f < (int)extrema_feat; f++) {
+        const float first_deriv_scale  = img_scale * 0.5f;
+        const float second_deriv_scale = img_scale;
+        const float cross_deriv_scale  = img_scale * 0.25f;
+
+        float xl = 0, xy = 0, xx = 0, contr = 0;
+        int i = 0;
+
+        unsigned x     = x_in[f];
+        unsigned y     = y_in[f];
+        unsigned layer = layer_in[f];
+
+        const T* prev_ptr = dog_pyr[octave * (n_layers + 2) + layer - 1].get();
+        const T* center_ptr = dog_pyr[octave * (n_layers + 2) + layer].get();
+        const T* next_ptr = dog_pyr[octave * (n_layers + 2) + layer + 1].get();
+
+        af::dim4 idims = dog_pyr[octave * (n_layers + 2)].dims();
+
+        bool converges = true;
+
+        for (i = 0; i < MaxInterpSteps; i++) {
+            float dD[3] = {
+                (float)(CPTR(x + 1, y) - CPTR(x - 1, y)) * first_deriv_scale,
+                (float)(CPTR(x, y + 1) - CPTR(x, y - 1)) * first_deriv_scale,
+                (float)(NPTR(x, y) - PPTR(x, y)) * first_deriv_scale};
+
+            float d2 = CPTR(x, y) * 2.f;
+            float dxx =
+                (CPTR(x + 1, y) + CPTR(x - 1, y) - d2) * second_deriv_scale;
+            float dyy =
+                (CPTR(x, y + 1) + CPTR(x, y - 1) - d2) * second_deriv_scale;
+            float dss = (NPTR(x, y) + PPTR(x, y) - d2) * second_deriv_scale;
+            float dxy = (CPTR(x + 1, y + 1) - CPTR(x - 1, y + 1) -
+                         CPTR(x + 1, y - 1) + CPTR(x - 1, y - 1)) *
+                        cross_deriv_scale;
+            float dxs = (NPTR(x + 1, y) - NPTR(x - 1, y) - PPTR(x + 1, y) +
+                         PPTR(x - 1, y)) *
+                        cross_deriv_scale;
+            float dys = (NPTR(x, y + 1) - NPTR(x - 1, y - 1) - PPTR(x, y - 1) +
+                         PPTR(x - 1, y - 1)) *
+                        cross_deriv_scale;
+
+            float H[9] = {dxx, dxy, dxs, dxy, dyy, dys, dxs, dys, dss};
+
+            float X[3];
+            gaussianElimination<3>(H, dD, X);
+
+            xl = -X[2];
+            xy = -X[1];
+            xx = -X[0];
+
+            if (fabs(xl) < 0.5f && fabs(xy) < 0.5f && fabs(xx) < 0.5f) break;
+
+            x += round(xx);
+            y += round(xy);
+            layer += round(xl);
+
+            if (layer < 1 || layer > n_layers || x < ImgBorder ||
+                x >= idims[1] - ImgBorder || y < ImgBorder ||
+                y >= idims[0] - ImgBorder) {
+                converges = false;
+                break;
+            }
+        }
+
+        // ensure convergence of interpolation
+        if (i >= MaxInterpSteps || !converges) continue;
+
+        float dD[3] = {
+            (float)(CPTR(x + 1, y) - CPTR(x - 1, y)) * first_deriv_scale,
+            (float)(CPTR(x, y + 1) - CPTR(x, y - 1)) * first_deriv_scale,
+            (float)(NPTR(x, y) - PPTR(x, y)) * first_deriv_scale};
+        float X[3] = {xx, xy, xl};
+
+        float P = dD[0] * X[0] + dD[1] * X[1] + dD[2] * X[2];
+
+        contr = center_ptr[x * idims[0] + y] * img_scale + P * 0.5f;
+        if (abs(contr) < (contrast_thr / n_layers)) continue;
+
+        // principal curvatures are computed using the trace and det of Hessian
+        float d2  = CPTR(x, y) * 2.f;
+        float dxx = (CPTR(x + 1, y) + CPTR(x - 1, y) - d2) * second_deriv_scale;
+        float dyy = (CPTR(x, y + 1) + CPTR(x, y - 1) - d2) * second_deriv_scale;
+        float dxy = (CPTR(x + 1, y + 1) - CPTR(x - 1, y + 1) -
+                     CPTR(x + 1, y - 1) + CPTR(x - 1, y - 1)) *
+                    cross_deriv_scale;
+
+        float tr  = dxx + dyy;
+        float det = dxx * dyy - dxy * dxy;
+
+        // add FLT_EPSILON for double-precision compatibility
+        if (det <= 0 ||
+            tr * tr * edge_thr >= (edge_thr + 1) * (edge_thr + 1) * det +
+                                      std::numeric_limits<float>::epsilon())
+            continue;
+
+        if (*counter < max_feat) {
+            x_out[*counter]        = (x + xx) * (1 << octave);
+            y_out[*counter]        = (y + xy) * (1 << octave);
+            layer_out[*counter]    = layer;
+            response_out[*counter] = abs(contr);
+            size_out[*counter] =
+                sigma * pow(2.f, octave + (layer + xl) / n_layers) * 2.f;
+            (*counter)++;
+        }
+    }
+}
+
+#undef CPTR
+#undef PPTR
+#undef NPTR
+
+// Remove duplicate keypoints
+void removeDuplicates(float* x_out, float* y_out, unsigned* layer_out,
+                      float* response_out, float* size_out, unsigned* counter,
+                      const std::vector<feat_t>& sorted_feat) {
+    size_t nfeat = sorted_feat.size();
+
+    for (size_t f = 0; f < nfeat; f++) {
+        float prec_fctr = 1e4f;
+
+        if (f < nfeat - 1) {
+            if (round(sorted_feat[f].f[0] * prec_fctr) ==
+                    round(sorted_feat[f + 1].f[0] * prec_fctr) &&
+                round(sorted_feat[f].f[1] * prec_fctr) ==
+                    round(sorted_feat[f + 1].f[1] * prec_fctr) &&
+                round(sorted_feat[f].f[2] * prec_fctr) ==
+                    round(sorted_feat[f + 1].f[2] * prec_fctr) &&
+                round(sorted_feat[f].f[3] * prec_fctr) ==
+                    round(sorted_feat[f + 1].f[3] * prec_fctr) &&
+                sorted_feat[f].l == sorted_feat[f + 1].l)
+                continue;
+        }
+
+        x_out[*counter]        = sorted_feat[f].f[0];
+        y_out[*counter]        = sorted_feat[f].f[1];
+        response_out[*counter] = sorted_feat[f].f[2];
+        size_out[*counter]     = sorted_feat[f].f[3];
+        layer_out[*counter]    = sorted_feat[f].l;
+        (*counter)++;
+    }
+}
+
+#define IPTR(Y, X) (img_ptr[(Y)*idims[0] + (X)])
+
+// Computes a canonical orientation for each image feature in an array.  Based
+// on Section 5 of Lowe's paper.  This function adds features to the array when
+// there is more than one dominant orientation at a given feature location.
+template<typename T>
+void calcOrientation(float* x_out, float* y_out, unsigned* layer_out,
+                     float* response_out, float* size_out, float* ori_out,
+                     unsigned* counter, const float* x_in, const float* y_in,
+                     const unsigned* layer_in, const float* response_in,
+                     const float* size_in, const unsigned total_feat,
+                     const std::vector<Array<T>>& gauss_pyr,
+                     const unsigned max_feat, const unsigned octave,
+                     const unsigned n_layers, const bool double_input) {
+    const int n = OriHistBins;
+
+    float hist[OriHistBins];
+    float temphist[OriHistBins];
+
+    for (unsigned f = 0; f < total_feat; f++) {
+        // Load keypoint information
+        const float real_x   = x_in[f];
+        const float real_y   = y_in[f];
+        const unsigned layer = layer_in[f];
+        const float response = response_in[f];
+        const float size     = size_in[f];
+
+        const int pt_x = (int)round(real_x / (1 << octave));
+        const int pt_y = (int)round(real_y / (1 << octave));
+
+        // Calculate auxiliary parameters
+        const float scl_octv  = size * 0.5f / (1 << octave);
+        const int radius      = (int)round(OriRadius * scl_octv);
+        const float sigma     = OriSigFctr * scl_octv;
+        const int len         = (radius * 2 + 1);
+        const float exp_denom = 2.f * sigma * sigma;
+
+        // Points img to correct Gaussian pyramid layer
+        const Array<T> img = gauss_pyr[octave * (n_layers + 3) + layer];
+        const T* img_ptr   = img.get();
+
+        for (int i = 0; i < OriHistBins; i++) hist[i] = 0.f;
+
+        af::dim4 idims = img.dims();
+
+        // Calculate orientation histogram
+        for (int l = 0; l < len * len; l++) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = pt_y + i;
+            int x = pt_x + j;
+            if (y < 1 || y >= idims[0] - 1 || x < 1 || x >= idims[1] - 1)
+                continue;
+
+            float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+            float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+            float mag = sqrt(dx * dx + dy * dy);
+            float ori = atan2(dy, dx);
+            float w   = exp(-(i * i + j * j) / exp_denom);
+
+            int bin = round(n * (ori + PI_VAL) / (2.f * PI_VAL));
+            bin     = bin < n ? bin : 0;
+
+            hist[bin] += w * mag;
+        }
+
+        for (int i = 0; i < SmoothOriPasses; i++) {
+            for (int j = 0; j < n; j++) { temphist[j] = hist[j]; }
+            for (int j = 0; j < n; j++) {
+                float prev = (j == 0) ? temphist[n - 1] : temphist[j - 1];
+                float next = (j + 1 == n) ? temphist[0] : temphist[j + 1];
+                hist[j]    = 0.25f * prev + 0.5f * temphist[j] + 0.25f * next;
+            }
+        }
+
+        float omax = hist[0];
+        for (int i = 1; i < n; i++) omax = max(omax, hist[i]);
+
+        float mag_thr = (float)(omax * OriPeakRatio);
+        int l, r;
+        for (int j = 0; j < n; j++) {
+            l = (j == 0) ? n - 1 : j - 1;
+            r = (j + 1) % n;
+            if (hist[j] > hist[l] && hist[j] > hist[r] && hist[j] >= mag_thr) {
+                if (*counter < max_feat) {
+                    float bin = j + 0.5f * (hist[l] - hist[r]) /
+                                        (hist[l] - 2.0f * hist[j] + hist[r]);
+                    bin = (bin < 0.0f) ? bin + n : (bin >= n) ? bin - n : bin;
+                    float ori = 360.f - ((360.f / n) * bin);
+
+                    float new_real_x = real_x;
+                    float new_real_y = real_y;
+                    float new_size   = size;
+
+                    if (double_input) {
+                        float scale = 0.5f;
+                        new_real_x *= scale;
+                        new_real_y *= scale;
+                        new_size *= scale;
+                    }
+
+                    x_out[*counter]        = new_real_x;
+                    y_out[*counter]        = new_real_y;
+                    layer_out[*counter]    = layer;
+                    response_out[*counter] = response;
+                    size_out[*counter]     = new_size;
+                    ori_out[*counter]      = ori;
+                    (*counter)++;
+                }
+            }
+        }
+    }
+}
+
+void normalizeDesc(float* desc, const int histlen) {
+    float len_sq = 0.0f;
+
+    for (int i = 0; i < histlen; i++) len_sq += desc[i] * desc[i];
+
+    float len_inv = 1.0f / sqrt(len_sq);
+
+    for (int i = 0; i < histlen; i++) { desc[i] *= len_inv; }
+}
+
+// Computes feature descriptors for features in an array.  Based on Section 6
+// of Lowe's paper.
+template<typename T>
+void computeDescriptor(float* desc_out, const unsigned desc_len,
+                       const float* x_in, const float* y_in,
+                       const unsigned* layer_in, const float* response_in,
+                       const float* size_in, const float* ori_in,
+                       const unsigned total_feat,
+                       const std::vector<Array<T>>& gauss_pyr, const int d,
+                       const int n, const float scale, const unsigned octave,
+                       const unsigned n_layers) {
+    UNUSED(response_in);
+    float desc[128];
+
+    for (unsigned f = 0; f < total_feat; f++) {
+        const unsigned layer = layer_in[f];
+        float ori            = (360.f - ori_in[f]) * PI_VAL / 180.f;
+        ori                  = (ori > PI_VAL) ? ori - PI_VAL * 2 : ori;
+        const float size     = size_in[f];
+        const int fx         = round(x_in[f] * scale);
+        const int fy         = round(y_in[f] * scale);
+
+        // Points img to correct Gaussian pyramid layer
+        Array<T> img     = gauss_pyr[octave * (n_layers + 3) + layer];
+        const T* img_ptr = img.get();
+        af::dim4 idims   = img.dims();
+
+        float cos_t        = cos(ori);
+        float sin_t        = sin(ori);
+        float bins_per_rad = n / (PI_VAL * 2.f);
+        float exp_denom    = d * d * 0.5f;
+        float hist_width   = DescrSclFctr * size * scale * 0.5f;
+        int radius         = hist_width * sqrt(2.f) * (d + 1.f) * 0.5f + 0.5f;
+
+        int len = radius * 2 + 1;
+
+        for (int i = 0; i < (int)desc_len; i++) desc[i] = 0.f;
+
+        // Calculate orientation histogram
+        for (int l = 0; l < len * len; l++) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = fy + i;
+            int x = fx + j;
+
+            float x_rot = (j * cos_t - i * sin_t) / hist_width;
+            float y_rot = (j * sin_t + i * cos_t) / hist_width;
+            float xbin  = x_rot + d / 2 - 0.5f;
+            float ybin  = y_rot + d / 2 - 0.5f;
+
+            if (ybin > -1.0f && ybin < d && xbin > -1.0f && xbin < d && y > 0 &&
+                y < idims[0] - 1 && x > 0 && x < idims[1] - 1) {
+                float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+                float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+                float grad_mag = sqrt(dx * dx + dy * dy);
+                float grad_ori = atan2(dy, dx) - ori;
+                while (grad_ori < 0.0f) grad_ori += PI_VAL * 2;
+                while (grad_ori >= PI_VAL * 2) grad_ori -= PI_VAL * 2;
+
+                float w    = exp(-(x_rot * x_rot + y_rot * y_rot) / exp_denom);
+                float obin = grad_ori * bins_per_rad;
+                float mag  = grad_mag * w;
+
+                int x0 = floor(xbin);
+                int y0 = floor(ybin);
+                int o0 = floor(obin);
+                xbin -= x0;
+                ybin -= y0;
+                obin -= o0;
+
+                for (int yl = 0; yl <= 1; yl++) {
+                    int yb = y0 + yl;
+                    if (yb >= 0 && yb < d) {
+                        float v_y = mag * ((yl == 0) ? 1.0f - ybin : ybin);
+                        for (int xl = 0; xl <= 1; xl++) {
+                            int xb = x0 + xl;
+                            if (xb >= 0 && xb < d) {
+                                float v_x =
+                                    v_y * ((xl == 0) ? 1.0f - xbin : xbin);
+                                for (int ol = 0; ol <= 1; ol++) {
+                                    int ob = (o0 + ol) % n;
+                                    float v_o =
+                                        v_x * ((ol == 0) ? 1.0f - obin : obin);
+                                    desc[(yb * d + xb) * n + ob] += v_o;
+                                }
+                            }
+                        }
+                    }
+                }
+            }
+        }
+
+        normalizeDesc(desc, desc_len);
+
+        for (int i = 0; i < (int)desc_len; i++)
+            desc[i] = min(desc[i], DescrMagThr);
+
+        normalizeDesc(desc, desc_len);
+
+        // Calculate final descriptor values
+        for (int k = 0; k < (int)desc_len; k++) {
+            desc_out[f * desc_len + k] =
+                round(min(255.f, desc[k] * IntDescrFctr));
+        }
+    }
+}
+
+// Computes GLOH feature descriptors for features in an array. Based on Section
+// III-B of Mikolajczyk and Schmid paper.
+template<typename T>
+void computeGLOHDescriptor(float* desc_out, const unsigned desc_len,
+                           const float* x_in, const float* y_in,
+                           const unsigned* layer_in, const float* response_in,
+                           const float* size_in, const float* ori_in,
+                           const unsigned total_feat,
+                           const std::vector<Array<T>>& gauss_pyr, const int d,
+                           const unsigned rb, const unsigned ab,
+                           const unsigned hb, const float scale,
+                           const unsigned octave, const unsigned n_layers) {
+    UNUSED(response_in);
+    float desc[272];
+
+    for (unsigned f = 0; f < total_feat; f++) {
+        const unsigned layer = layer_in[f];
+        float ori            = (360.f - ori_in[f]) * PI_VAL / 180.f;
+        ori                  = (ori > PI_VAL) ? ori - PI_VAL * 2 : ori;
+        const float size     = size_in[f];
+        const int fx         = round(x_in[f] * scale);
+        const int fy         = round(y_in[f] * scale);
+
+        // Points img to correct Gaussian pyramid layer
+        Array<T> img     = gauss_pyr[octave * (n_layers + 3) + layer];
+        const T* img_ptr = img.get();
+        af::dim4 idims   = img.dims();
+
+        float cos_t              = cos(ori);
+        float sin_t              = sin(ori);
+        float hist_bins_per_rad  = hb / (PI_VAL * 2.f);
+        float polar_bins_per_rad = ab / (PI_VAL * 2.f);
+        float exp_denom          = GLOHRadii[rb - 1] * 0.5f;
+
+        float hist_width = DescrSclFctr * size * scale * 0.5f;
+
+        // Keep same descriptor radius used for SIFT
+        int radius = hist_width * sqrt(2.f) * (d + 1.f) * 0.5f + 0.5f;
+
+        // Alternative radius size calculation, changing the radius weight
+        // (rw) in the range of 0.25f-0.75f gives different results,
+        // increasing it tends to show a better recall rate but with a
+        // smaller amount of correct matches
+        // float rw = 0.5f;
+        // int radius = hist_width * GLOHRadii[rb-1] * rw + 0.5f;
+
+        int len = radius * 2 + 1;
+
+        for (int i = 0; i < (int)desc_len; i++) desc[i] = 0.f;
+
+        // Calculate orientation histogram
+        for (int l = 0; l < len * len; l++) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = fy + i;
+            int x = fx + j;
+
+            float x_rot = (j * cos_t - i * sin_t);
+            float y_rot = (j * sin_t + i * cos_t);
+
+            float r = sqrt(x_rot * x_rot + y_rot * y_rot) / radius *
+                      GLOHRadii[rb - 1];
+            float theta = atan2(y_rot, x_rot);
+            while (theta < 0.0f) theta += PI_VAL * 2;
+            while (theta >= PI_VAL * 2) theta -= PI_VAL * 2;
+
+            float tbin = theta * polar_bins_per_rad;
+            float rbin =
+                (r < GLOHRadii[0])
+                    ? r / GLOHRadii[0]
+                    : ((r < GLOHRadii[1])
+                           ? 1 + (r - GLOHRadii[0]) /
+                                     (float)(GLOHRadii[1] - GLOHRadii[0])
+                           : min(2 + (r - GLOHRadii[1]) /
+                                         (float)(GLOHRadii[2] - GLOHRadii[1]),
+                                 3.f - std::numeric_limits<float>::epsilon()));
+
+            if (r <= GLOHRadii[rb - 1] && y > 0 && y < idims[0] - 1 && x > 0 &&
+                x < idims[1] - 1) {
+                float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+                float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+                float grad_mag = sqrt(dx * dx + dy * dy);
+                float grad_ori = atan2(dy, dx) - ori;
+                while (grad_ori < 0.0f) grad_ori += PI_VAL * 2;
+                while (grad_ori >= PI_VAL * 2) grad_ori -= PI_VAL * 2;
+
+                float w    = exp(-r / exp_denom);
+                float obin = grad_ori * hist_bins_per_rad;
+                float mag  = grad_mag * w;
+
+                int t0 = floor(tbin);
+                int r0 = floor(rbin);
+                int o0 = floor(obin);
+                tbin -= t0;
+                rbin -= r0;
+                obin -= o0;
+
+                for (int rl = 0; rl <= 1; rl++) {
+                    int rb    = (rbin > 0.5f) ? (r0 + rl) : (r0 - rl);
+                    float v_r = mag * ((rl == 0) ? 1.0f - rbin : rbin);
+                    if (rb >= 0 && rb <= 2) {
+                        for (int tl = 0; tl <= 1; tl++) {
+                            int tb    = (t0 + tl) % ab;
+                            float v_t = v_r * ((tl == 0) ? 1.0f - tbin : tbin);
+                            for (int ol = 0; ol <= 1; ol++) {
+                                int ob = (o0 + ol) % hb;
+                                float v_o =
+                                    v_t * ((ol == 0) ? 1.0f - obin : obin);
+                                unsigned idx =
+                                    (rb > 0) *
+                                        (hb + ((rb - 1) * ab + tb) * hb) +
+                                    ob;
+                                desc[idx] += v_o;
+                            }
+                        }
+                    }
+                }
+            }
+        }
+
+        normalizeDesc(desc, desc_len);
+
+        for (int i = 0; i < (int)desc_len; i++)
+            desc[i] = min(desc[i], DescrMagThr);
+
+        normalizeDesc(desc, desc_len);
+
+        // Calculate final descriptor values
+        for (int k = 0; k < (int)desc_len; k++) {
+            desc_out[f * desc_len + k] =
+                round(min(255.f, desc[k] * IntDescrFctr));
+        }
+    }
+}
+
+#undef IPTR
+
+template<typename T, typename convAccT>
+Array<T> createInitialImage(const Array<T>& img, const float init_sigma,
+                            const bool double_input) {
+    af::dim4 idims = img.dims();
+
+    Array<T> init_img = createEmptyArray<T>(af::dim4());
+
+    float s = (double_input) ? std::max((float)sqrt(init_sigma * init_sigma -
+                                                    InitSigma * InitSigma * 4),
+                                        0.1f)
+                             : std::max((float)sqrt(init_sigma * init_sigma -
+                                                    InitSigma * InitSigma),
+                                        0.1f);
+
+    Array<T> filter = gauss_filter<T>(s);
+
+    if (double_input) {
+        Array<T> double_img =
+            resize<T>(img, idims[0] * 2, idims[1] * 2, AF_INTERP_BILINEAR);
+        init_img = convolve2<T, convAccT>(double_img, filter, filter, false);
+    } else {
+        init_img = convolve2<T, convAccT>(img, filter, filter, false);
+    }
+
+    return init_img;
+}
+
+template<typename T, typename convAccT>
+std::vector<Array<T>> buildGaussPyr(const Array<T>& init_img,
+                                    const unsigned n_octaves,
+                                    const unsigned n_layers,
+                                    const float init_sigma) {
+    // Precompute Gaussian sigmas using the following formula:
+    // \sigma_{total}^2 = \sigma_{i}^2 + \sigma_{i-1}^2
+    std::vector<float> sig_layers(n_layers + 3);
+    sig_layers[0] = init_sigma;
+    float k       = std::pow(2.0f, 1.0f / n_layers);
+    for (unsigned i = 1; i < n_layers + 3; i++) {
+        float sig_prev  = std::pow(k, i - 1) * init_sigma;
+        float sig_total = sig_prev * k;
+        sig_layers[i] = std::sqrt(sig_total * sig_total - sig_prev * sig_prev);
+    }
+
+    // Gaussian Pyramid
+    std::vector<Array<T>> gauss_pyr(n_octaves * (n_layers + 3),
+                                    createEmptyArray<T>(af::dim4()));
+    for (unsigned o = 0; o < n_octaves; o++) {
+        for (unsigned l = 0; l < n_layers + 3; l++) {
+            unsigned src_idx = (l == 0) ? (o - 1) * (n_layers + 3) + n_layers
+                                        : o * (n_layers + 3) + l - 1;
+            unsigned idx     = o * (n_layers + 3) + l;
+
+            if (o == 0 && l == 0) {
+                gauss_pyr[idx] = init_img;
+            } else if (l == 0) {
+                af::dim4 sdims = gauss_pyr[src_idx].dims();
+                gauss_pyr[idx] = resize<T>(gauss_pyr[src_idx], sdims[0] / 2,
+                                           sdims[1] / 2, AF_INTERP_BILINEAR);
+            } else {
+                Array<T> filter = gauss_filter<T>(sig_layers[l]);
+
+                gauss_pyr[idx] = convolve2<T, convAccT>(gauss_pyr[src_idx],
+                                                        filter, filter, false);
+            }
+        }
+    }
+
+    return gauss_pyr;
+}
+
+template<typename T>
+std::vector<Array<T>> buildDoGPyr(std::vector<Array<T>>& gauss_pyr,
+                                  const unsigned n_octaves,
+                                  const unsigned n_layers) {
+    // DoG Pyramid
+    std::vector<Array<T>> dog_pyr(n_octaves * (n_layers + 2),
+                                  createEmptyArray<T>(af::dim4()));
+    for (unsigned o = 0; o < n_octaves; o++) {
+        for (unsigned l = 0; l < n_layers + 2; l++) {
+            unsigned idx    = o * (n_layers + 2) + l;
+            unsigned bottom = o * (n_layers + 3) + l;
+            unsigned top    = o * (n_layers + 3) + l + 1;
+
+            dog_pyr[idx] = createEmptyArray<T>(gauss_pyr[bottom].dims());
+
+            sub<T>(dog_pyr[idx], gauss_pyr[top], gauss_pyr[bottom]);
+        }
+    }
+
+    return dog_pyr;
+}
+
+template<typename T, typename convAccT>
+unsigned sift_impl(Array<float>& x, Array<float>& y, Array<float>& score,
+                   Array<float>& ori, Array<float>& size, Array<float>& desc,
+                   const Array<T>& in, const unsigned n_layers,
+                   const float contrast_thr, const float edge_thr,
+                   const float init_sigma, const bool double_input,
+                   const float img_scale, const float feature_ratio,
+                   const bool compute_GLOH) {
+    using std::function;
+    using std::unique_ptr;
+    using std::vector;
+    af::dim4 idims = in.dims();
+
+    unsigned min_dim = min(idims[0], idims[1]);
+    if (double_input) min_dim *= 2;
+
+    const unsigned n_octaves = floor(log(min_dim) / log(2)) - 2;
+
+    Array<T> init_img =
+        createInitialImage<T, convAccT>(in, init_sigma, double_input);
+
+    std::vector<Array<T>> gauss_pyr =
+        buildGaussPyr<T, convAccT>(init_img, n_octaves, n_layers, init_sigma);
+
+    std::vector<Array<T>> dog_pyr =
+        buildDoGPyr<T>(gauss_pyr, n_octaves, n_layers);
+
+    vector<uptr<float>> x_pyr(n_octaves);
+    vector<uptr<float>> y_pyr(n_octaves);
+    vector<uptr<float>> response_pyr(n_octaves);
+    vector<uptr<float>> size_pyr(n_octaves);
+    vector<uptr<float>> ori_pyr(n_octaves);
+    vector<uptr<float>> desc_pyr(n_octaves);
+    vector<unsigned> feat_pyr(n_octaves, 0);
+    unsigned total_feat = 0;
+
+    const unsigned d  = DescrWidth;
+    const unsigned n  = DescrHistBins;
+    const unsigned rb = GLOHRadialBins;
+    const unsigned ab = GLOHAngularBins;
+    const unsigned hb = GLOHHistBins;
+    const unsigned desc_len =
+        (compute_GLOH) ? (1 + (rb - 1) * ab) * hb : d * d * n;
+
+    for (unsigned i = 0; i < n_octaves; i++) {
+        af::dim4 ddims = dog_pyr[i * (n_layers + 2)].dims();
+        if (ddims[0] - 2 * ImgBorder < 1 || ddims[1] - 2 * ImgBorder < 1)
+            continue;
+
+        const unsigned imel     = ddims[0] * ddims[1];
+        const unsigned max_feat = ceil(imel * feature_ratio);
+
+        auto extrema_x        = memAlloc<float>(max_feat);
+        auto extrema_y        = memAlloc<float>(max_feat);
+        auto extrema_layer    = memAlloc<unsigned>(max_feat);
+        unsigned extrema_feat = 0;
+
+        for (unsigned j = 1; j <= n_layers; j++) {
+            unsigned prev   = i * (n_layers + 2) + j - 1;
+            unsigned center = i * (n_layers + 2) + j;
+            unsigned next   = i * (n_layers + 2) + j + 1;
+
+            unsigned layer = j;
+
+            float extrema_thr = 0.5f * contrast_thr / n_layers;
+            detectExtrema<T>(extrema_x.get(), extrema_y.get(),
+                             extrema_layer.get(), &extrema_feat, dog_pyr[prev],
+                             dog_pyr[center], dog_pyr[next], layer, max_feat,
+                             extrema_thr);
+        }
+
+        extrema_feat = min(extrema_feat, max_feat);
+
+        if (extrema_feat == 0) { continue; }
+
+        unsigned interp_feat = 0;
+
+        auto interp_x        = memAlloc<float>(extrema_feat);
+        auto interp_y        = memAlloc<float>(extrema_feat);
+        auto interp_layer    = memAlloc<unsigned>(extrema_feat);
+        auto interp_response = memAlloc<float>(extrema_feat);
+        auto interp_size     = memAlloc<float>(extrema_feat);
+
+        interpolateExtrema<T>(interp_x.get(), interp_y.get(),
+                              interp_layer.get(), interp_response.get(),
+                              interp_size.get(), &interp_feat, extrema_x.get(),
+                              extrema_y.get(), extrema_layer.get(),
+                              extrema_feat, dog_pyr, max_feat, i, n_layers,
+                              contrast_thr, edge_thr, init_sigma, img_scale);
+
+        interp_feat = min(interp_feat, max_feat);
+
+        if (interp_feat == 0) { continue; }
+
+        std::vector<feat_t> sorted_feat;
+        array_to_feat(sorted_feat, interp_x.get(), interp_y.get(),
+                      interp_layer.get(), interp_response.get(),
+                      interp_size.get(), interp_feat);
+        std::stable_sort(sorted_feat.begin(), sorted_feat.end(), feat_cmp);
+
+        unsigned nodup_feat = 0;
+
+        auto nodup_x        = memAlloc<float>(interp_feat);
+        auto nodup_y        = memAlloc<float>(interp_feat);
+        auto nodup_layer    = memAlloc<unsigned>(interp_feat);
+        auto nodup_response = memAlloc<float>(interp_feat);
+        auto nodup_size     = memAlloc<float>(interp_feat);
+
+        removeDuplicates(nodup_x.get(), nodup_y.get(), nodup_layer.get(),
+                         nodup_response.get(), nodup_size.get(), &nodup_feat,
+                         sorted_feat);
+
+        const unsigned max_oriented_feat = nodup_feat * 3;
+
+        auto oriented_x        = memAlloc<float>(max_oriented_feat);
+        auto oriented_y        = memAlloc<float>(max_oriented_feat);
+        auto oriented_layer    = memAlloc<unsigned>(max_oriented_feat);
+        auto oriented_response = memAlloc<float>(max_oriented_feat);
+        auto oriented_size     = memAlloc<float>(max_oriented_feat);
+        auto oriented_ori      = memAlloc<float>(max_oriented_feat);
+
+        unsigned oriented_feat = 0;
+
+        calcOrientation<T>(
+            oriented_x.get(), oriented_y.get(), oriented_layer.get(),
+            oriented_response.get(), oriented_size.get(), oriented_ori.get(),
+            &oriented_feat, nodup_x.get(), nodup_y.get(), nodup_layer.get(),
+            nodup_response.get(), nodup_size.get(), nodup_feat, gauss_pyr,
+            max_oriented_feat, i, n_layers, double_input);
+
+        if (oriented_feat == 0) { continue; }
+
+        auto desc = memAlloc<float>(oriented_feat * desc_len);
+
+        float scale = 1.f / (1 << i);
+        if (double_input) scale *= 2.f;
+
+        if (compute_GLOH)
+            computeGLOHDescriptor<T>(
+                desc.get(), desc_len, oriented_x.get(), oriented_y.get(),
+                oriented_layer.get(), oriented_response.get(),
+                oriented_size.get(), oriented_ori.get(), oriented_feat,
+                gauss_pyr, d, rb, ab, hb, scale, i, n_layers);
+        else
+            computeDescriptor<T>(desc.get(), desc_len, oriented_x.get(),
+                                 oriented_y.get(), oriented_layer.get(),
+                                 oriented_response.get(), oriented_size.get(),
+                                 oriented_ori.get(), oriented_feat, gauss_pyr,
+                                 d, n, scale, i, n_layers);
+
+        total_feat += oriented_feat;
+        feat_pyr[i] = oriented_feat;
+
+        if (oriented_feat > 0) {
+            x_pyr[i]        = std::move(oriented_x);
+            y_pyr[i]        = std::move(oriented_y);
+            response_pyr[i] = std::move(oriented_response);
+            ori_pyr[i]      = std::move(oriented_ori);
+            size_pyr[i]     = std::move(oriented_size);
+            desc_pyr[i]     = std::move(desc);
+        }
+    }
+
+    if (total_feat > 0) {
+        const af::dim4 total_feat_dims(total_feat);
+        const af::dim4 desc_dims(desc_len, total_feat);
+
+        // Allocate output memory
+        x     = createEmptyArray<float>(total_feat_dims);
+        y     = createEmptyArray<float>(total_feat_dims);
+        score = createEmptyArray<float>(total_feat_dims);
+        ori   = createEmptyArray<float>(total_feat_dims);
+        size  = createEmptyArray<float>(total_feat_dims);
+        desc  = createEmptyArray<float>(desc_dims);
+
+        float* x_ptr     = x.get();
+        float* y_ptr     = y.get();
+        float* score_ptr = score.get();
+        float* ori_ptr   = ori.get();
+        float* size_ptr  = size.get();
+        float* desc_ptr  = desc.get();
+
+        unsigned offset = 0;
+        for (unsigned i = 0; i < n_octaves; i++) {
+            if (feat_pyr[i] == 0) continue;
+
+            memcpy(x_ptr + offset, x_pyr[i].get(), feat_pyr[i] * sizeof(float));
+            memcpy(y_ptr + offset, y_pyr[i].get(), feat_pyr[i] * sizeof(float));
+            memcpy(score_ptr + offset, response_pyr[i].get(),
+                   feat_pyr[i] * sizeof(float));
+            memcpy(ori_ptr + offset, ori_pyr[i].get(),
+                   feat_pyr[i] * sizeof(float));
+            memcpy(size_ptr + offset, size_pyr[i].get(),
+                   feat_pyr[i] * sizeof(float));
+
+            memcpy(desc_ptr + (offset * desc_len), desc_pyr[i].get(),
+                   feat_pyr[i] * desc_len * sizeof(float));
+            offset += feat_pyr[i];
+        }
+    }
+
+    return total_feat;
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sobel.hpp b/src/backend/cpu/kernel/sobel.hpp
new file mode 100644
index 0000000000..54315203d4
--- /dev/null
+++ b/src/backend/cpu/kernel/sobel.hpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+#include <cassert>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename Ti, typename To, bool isDX>
+void derivative(Param<To> output, CParam<Ti> input) {
+    const af::dim4 dims     = input.dims();
+    const af::dim4 istrides = input.strides();
+    const af::dim4 ostrides = output.strides();
+
+    auto reflect101 = [](int index, int endIndex) -> int {
+        return std::abs(endIndex - std::abs(endIndex - index));
+    };
+
+    for (dim_t b3 = 0; b3 < dims[3]; ++b3) {
+        To* optr       = output.get() + b3 * ostrides[3];
+        const Ti* iptr = input.get() + b3 * istrides[3];
+        for (dim_t b2 = 0; b2 < dims[2]; ++b2) {
+            for (dim_t j = 0; j < dims[1]; ++j) {
+                int joff    = j;
+                int _joff   = reflect101(j - 1, static_cast<int>(dims[1] - 1));
+                int joff_   = reflect101(j + 1, static_cast<int>(dims[1] - 1));
+                int joffset = j * ostrides[1];
+
+                for (dim_t i = 0; i < dims[0]; ++i) {
+                    To accum = To(0);
+
+                    int ioff = i;
+                    int _ioff =
+                        reflect101(i - 1, static_cast<int>(dims[0] - 1));
+                    int ioff_ =
+                        reflect101(i + 1, static_cast<int>(dims[0] - 1));
+
+                    To NW = iptr[_joff * istrides[1] + _ioff * istrides[0]];
+                    To SW = iptr[_joff * istrides[1] + ioff_ * istrides[0]];
+                    To NE = iptr[joff_ * istrides[1] + _ioff * istrides[0]];
+                    To SE = iptr[joff_ * istrides[1] + ioff_ * istrides[0]];
+
+                    if (isDX) {
+                        To N  = iptr[joff * istrides[1] + _ioff * istrides[0]];
+                        To S  = iptr[joff * istrides[1] + ioff_ * istrides[0]];
+                        accum = SW + SE - (NW + NE) + 2 * (S - N);
+                    } else {
+                        To W  = iptr[_joff * istrides[1] + ioff * istrides[0]];
+                        To E  = iptr[joff_ * istrides[1] + ioff * istrides[0]];
+                        accum = NE + SE - (NW + SW) + 2 * (E - W);
+                    }
+
+                    optr[joffset + i * ostrides[0]] = accum;
+                }
+            }
+
+            optr += ostrides[2];
+            iptr += istrides[2];
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sort.hpp b/src/backend/cpu/kernel/sort.hpp
new file mode 100644
index 0000000000..0e4c91aa56
--- /dev/null
+++ b/src/backend/cpu/kernel/sort.hpp
@@ -0,0 +1,49 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <math.hpp>
+#include <algorithm>
+#include <functional>
+#include <numeric>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+// Based off of http://stackoverflow.com/a/12399290
+template<typename T>
+void sort0Iterative(Param<T> val, bool isAscending) {
+    // initialize original index locations
+    T *val_ptr = val.get();
+
+    std::function<bool(T, T)> op = std::greater<T>();
+    if (isAscending) { op = std::less<T>(); }
+
+    T *comp_ptr = nullptr;
+    for (dim_t w = 0; w < val.dims(3); w++) {
+        dim_t valW = w * val.strides(3);
+        for (dim_t z = 0; z < val.dims(2); z++) {
+            dim_t valWZ = valW + z * val.strides(2);
+            for (dim_t y = 0; y < val.dims(1); y++) {
+                dim_t valOffset = valWZ + y * val.strides(1);
+
+                comp_ptr = val_ptr + valOffset;
+                std::sort(comp_ptr, comp_ptr + val.dims(0), op);
+            }
+        }
+    }
+    return;
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sort_by_key.hpp b/src/backend/cpu/kernel/sort_by_key.hpp
new file mode 100644
index 0000000000..785a25b378
--- /dev/null
+++ b/src/backend/cpu/kernel/sort_by_key.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param<Tk> okey, Param<Tv> oval, bool isAscending);
+
+template<typename Tk, typename Tv>
+void sortByKeyBatched(Param<Tk> okey, Param<Tv> oval, const int dim,
+                      bool isAscending);
+
+template<typename Tk, typename Tv>
+void sort0ByKey(Param<Tk> okey, Param<Tv> oval, bool isAscending);
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sort_by_key/CMakeLists.txt b/src/backend/cpu/kernel/sort_by_key/CMakeLists.txt
new file mode 100644
index 0000000000..752501fabc
--- /dev/null
+++ b/src/backend/cpu/kernel/sort_by_key/CMakeLists.txt
@@ -0,0 +1,51 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp" FILESTRINGS)
+
+foreach(STR ${FILESTRINGS})
+    if(${STR} MATCHES "// SBK_TYPES")
+        string(REPLACE "// SBK_TYPES:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_TYPES ${TEMP})
+    endif()
+endforeach()
+
+add_library(cpu_sort_by_key INTERFACE)
+foreach(SBK_TYPE ${SBK_TYPES})
+  add_library(cpu_sort_by_key_${SBK_TYPE} OBJECT
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp"
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key_impl.hpp"
+    )
+  set_target_properties(cpu_sort_by_key_${SBK_TYPE}
+    PROPERTIES
+      COMPILE_DEFINITIONS "TYPE=${SBK_TYPE};AFDLL;$<TARGET_PROPERTY:Boost::boost,INTERFACE_COMPILE_DEFINITIONS>"
+      CXX_STANDARD 17
+      CXX_EXTENSIONS OFF
+      CXX_VISIBILITY_PRESET hidden
+      FOLDER "Generated Targets")
+
+  arrayfire_set_default_cxx_flags(cpu_sort_by_key_${SBK_TYPE})
+
+  target_include_directories(cpu_sort_by_key_${SBK_TYPE}
+    PUBLIC
+      .
+      ../../api/c
+      ${ArrayFire_SOURCE_DIR}/include
+      ${ArrayFire_BINARY_DIR}/include
+    PRIVATE
+      ../common
+      ..
+      threads)
+
+  target_include_directories(cpu_sort_by_key_${SBK_TYPE}
+    SYSTEM PRIVATE
+      $<TARGET_PROPERTY:Boost::boost,INTERFACE_INCLUDE_DIRECTORIES>)
+
+  set_target_properties(cpu_sort_by_key_${SBK_TYPE} PROPERTIES POSITION_INDEPENDENT_CODE ON)
+  target_sources(cpu_sort_by_key
+    INTERFACE $<TARGET_OBJECTS:cpu_sort_by_key_${SBK_TYPE}>)
+endforeach(SBK_TYPE ${SBK_TYPES})
diff --git a/src/backend/cpu/kernel/sort_by_key/sort_by_key_impl.cpp b/src/backend/cpu/kernel/sort_by_key/sort_by_key_impl.cpp
new file mode 100644
index 0000000000..5873e93117
--- /dev/null
+++ b/src/backend/cpu/kernel/sort_by_key/sort_by_key_impl.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sort_by_key_impl.hpp>
+
+// SBK_TYPES:float double int uint intl uintl short ushort char schar uchar
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+INSTANTIATE1(TYPE)
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sort_by_key_impl.hpp b/src/backend/cpu/kernel/sort_by_key_impl.hpp
new file mode 100644
index 0000000000..e77e868d78
--- /dev/null
+++ b/src/backend/cpu/kernel/sort_by_key_impl.hpp
@@ -0,0 +1,179 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <kernel/sort_helper.hpp>
+#include <math.hpp>
+#include <algorithm>
+#include <functional>
+#include <numeric>
+#include <queue>
+#include <tuple>
+#include <utility>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param<Tk> okey, Param<Tv> oval, bool isAscending) {
+    // Get pointers and initialize original index locations
+    Tk *okey_ptr = okey.get();
+    Tv *oval_ptr = oval.get();
+
+    typedef IndexPair<Tk, Tv> CurrentPair;
+
+    dim_t size = okey.dims(0);
+    std::vector<CurrentPair> pairKeyVal(size);
+
+    for (dim_t w = 0; w < okey.dims(3); w++) {
+        dim_t okeyW = w * okey.strides(3);
+        dim_t ovalW = w * oval.strides(3);
+
+        for (dim_t z = 0; z < okey.dims(2); z++) {
+            dim_t okeyWZ = okeyW + z * okey.strides(2);
+            dim_t ovalWZ = ovalW + z * oval.strides(2);
+
+            for (dim_t y = 0; y < okey.dims(1); y++) {
+                dim_t okeyOffset = okeyWZ + y * okey.strides(1);
+                dim_t ovalOffset = ovalWZ + y * oval.strides(1);
+
+                Tk *okey_col_ptr = okey_ptr + okeyOffset;
+                Tv *oval_col_ptr = oval_ptr + ovalOffset;
+
+                for (dim_t x = 0; x < size; x++) {
+                    pairKeyVal[x] =
+                        std::make_tuple(okey_col_ptr[x], oval_col_ptr[x]);
+                }
+
+                if (isAscending) {
+                    std::stable_sort(pairKeyVal.begin(), pairKeyVal.end(),
+                                     IPCompare<Tk, Tv, true>());
+                } else {
+                    std::stable_sort(pairKeyVal.begin(), pairKeyVal.end(),
+                                     IPCompare<Tk, Tv, false>());
+                }
+
+                for (unsigned x = 0; x < size; x++) {
+                    okey_ptr[okeyOffset + x] = std::get<0>(pairKeyVal[x]);
+                    oval_ptr[ovalOffset + x] = std::get<1>(pairKeyVal[x]);
+                }
+            }
+        }
+    }
+
+    return;
+}
+
+template<typename Tk, typename Tv>
+void sortByKeyBatched(Param<Tk> okey, Param<Tv> oval, const int dim,
+                      bool isAscending) {
+    af::dim4 inDims = okey.dims();
+
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    std::vector<uint> key(inDims.elements());
+    // IOTA
+    {
+        af::dim4 dims = inDims;
+        uint *out     = key.data();
+        af::dim4 strides(1);
+        for (int i = 1; i < 4; i++) strides[i] = strides[i - 1] * dims[i - 1];
+
+        for (dim_t w = 0; w < dims[3]; w++) {
+            dim_t offW = w * strides[3];
+            dim_t okeyW =
+                (w % seqDims[3]) * seqDims[0] * seqDims[1] * seqDims[2];
+            for (dim_t z = 0; z < dims[2]; z++) {
+                dim_t offWZ = offW + z * strides[2];
+                dim_t okeyZ =
+                    okeyW + (z % seqDims[2]) * seqDims[0] * seqDims[1];
+                for (dim_t y = 0; y < dims[1]; y++) {
+                    dim_t offWZY = offWZ + y * strides[1];
+                    dim_t okeyY  = okeyZ + (y % seqDims[1]) * seqDims[0];
+                    for (dim_t x = 0; x < dims[0]; x++) {
+                        dim_t id = offWZY + x;
+                        out[id]  = okeyY + (x % seqDims[0]);
+                    }
+                }
+            }
+        }
+    }
+
+    // initialize original index locations
+    Tk *okey_ptr = okey.get();
+    Tv *oval_ptr = oval.get();
+
+    typedef KeyIndexPair<Tk, Tv> CurrentTuple;
+    size_t size = okey.dims().elements();
+    std::vector<CurrentTuple> tupleKeyValIdx(size);
+
+    for (unsigned i = 0; i < size; i++) {
+        tupleKeyValIdx[i] = std::make_tuple(okey_ptr[i], oval_ptr[i], key[i]);
+    }
+
+    if (isAscending) {
+        std::stable_sort(tupleKeyValIdx.begin(), tupleKeyValIdx.end(),
+                         KIPCompareV<Tk, Tv, true>());
+    } else {
+        std::stable_sort(tupleKeyValIdx.begin(), tupleKeyValIdx.end(),
+                         KIPCompareV<Tk, Tv, false>());
+    }
+
+    std::stable_sort(tupleKeyValIdx.begin(), tupleKeyValIdx.end(),
+                     KIPCompareK<Tk, Tv, true>());
+
+    for (unsigned x = 0; x < okey.dims().elements(); x++) {
+        okey_ptr[x] = std::get<0>(tupleKeyValIdx[x]);
+        oval_ptr[x] = std::get<1>(tupleKeyValIdx[x]);
+    }
+}
+
+template<typename Tk, typename Tv>
+void sort0ByKey(Param<Tk> okey, Param<Tv> oval, bool isAscending) {
+    int higherDims = okey.dims(1) * okey.dims(2) * okey.dims(3);
+    // TODO Make a better heurisitic
+    if (higherDims > 4)
+        kernel::sortByKeyBatched<Tk, Tv>(okey, oval, 0, isAscending);
+    else
+        kernel::sort0ByKeyIterative<Tk, Tv>(okey, oval, isAscending);
+}
+
+#define INSTANTIATE(Tk, Tv)                                                   \
+    template void sort0ByKey<Tk, Tv>(Param<Tk> okey, Param<Tv> oval,          \
+                                     bool isAscending);                       \
+    template void sort0ByKeyIterative<Tk, Tv>(Param<Tk> okey, Param<Tv> oval, \
+                                              bool isAscending);              \
+    template void sortByKeyBatched<Tk, Tv>(Param<Tk> okey, Param<Tv> oval,    \
+                                           const int dim, bool isAscending);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sort_helper.hpp b/src/backend/cpu/kernel/sort_helper.hpp
new file mode 100644
index 0000000000..ff301c0e0a
--- /dev/null
+++ b/src/backend/cpu/kernel/sort_helper.hpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_cpu.hpp>
+#include <tuple>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+template<typename Tk, typename Tv>
+using IndexPair = std::tuple<Tk, Tv>;
+
+template<typename Tk, typename Tv, bool isAscending>
+struct IPCompare {
+    bool operator()(const IndexPair<Tk, Tv> &lhs,
+                    const IndexPair<Tk, Tv> &rhs) {
+        // Check stable sort condition
+        Tk lhsVal = std::get<0>(lhs);
+        Tk rhsVal = std::get<0>(rhs);
+        if (isAscending)
+            return (lhsVal < rhsVal);
+        else
+            return (lhsVal > rhsVal);
+    }
+};
+
+template<typename Tk, typename Tv>
+using KeyIndexPair = std::tuple<Tk, Tv, uint>;
+
+template<typename Tk, typename Tv, bool isAscending>
+struct KIPCompareV {
+    bool operator()(const KeyIndexPair<Tk, Tv> &lhs,
+                    const KeyIndexPair<Tk, Tv> &rhs) {
+        // Check stable sort condition
+        Tk lhsVal = std::get<0>(lhs);
+        Tk rhsVal = std::get<0>(rhs);
+        if (isAscending)
+            return (lhsVal < rhsVal);
+        else
+            return (lhsVal > rhsVal);
+    }
+};
+
+template<typename Tk, typename Tv, bool isAscending>
+struct KIPCompareK {
+    bool operator()(const KeyIndexPair<Tk, Tv> &lhs,
+                    const KeyIndexPair<Tk, Tv> &rhs) {
+        uint lhsVal = std::get<2>(lhs);
+        uint rhsVal = std::get<2>(rhs);
+        if (isAscending)
+            return (lhsVal < rhsVal);
+        else
+            return (lhsVal > rhsVal);
+    }
+};
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sparse.hpp b/src/backend/cpu/kernel/sparse.hpp
new file mode 100644
index 0000000000..9cf8074d80
--- /dev/null
+++ b/src/backend/cpu/kernel/sparse.hpp
@@ -0,0 +1,177 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <kernel/sort_helper.hpp>
+#include <math.hpp>
+#include <utility.hpp>
+#include <algorithm>
+#include <tuple>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void coo2dense(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+               CParam<int> colIdx) {
+    const T *vPtr   = values.get();
+    const int *rPtr = rowIdx.get();
+    const int *cPtr = colIdx.get();
+
+    T *outPtr = output.get();
+
+    af::dim4 ostrides = output.strides();
+
+    int nNZ = values.dims(0);
+    for (int i = 0; i < nNZ; i++) {
+        T v   = vPtr[i];
+        int r = rPtr[i];
+        int c = cPtr[i];
+
+        int offset = r + c * ostrides[1];
+
+        outPtr[offset] = v;
+    }
+}
+
+template<typename T>
+void dense2csr(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+               CParam<T> in) {
+    const T *iPtr = in.get();
+    T *vPtr       = values.get();
+    int *rPtr     = rowIdx.get();
+    int *cPtr     = colIdx.get();
+
+    int stride    = in.strides(1);
+    af::dim4 dims = in.dims();
+
+    int offset = 0;
+    for (int i = 0; i < dims[0]; ++i) {
+        rPtr[i] = offset;
+        for (int j = 0; j < dims[1]; ++j) {
+            if (iPtr[j * stride + i] != scalar<T>(0)) {
+                vPtr[offset]   = iPtr[j * stride + i];
+                cPtr[offset++] = j;
+            }
+        }
+    }
+    rPtr[dims[0]] = offset;
+}
+
+template<typename T>
+void csr2dense(Param<T> out, CParam<T> values, CParam<int> rowIdx,
+               CParam<int> colIdx) {
+    T *oPtr         = out.get();
+    const T *vPtr   = values.get();
+    const int *rPtr = rowIdx.get();
+    const int *cPtr = colIdx.get();
+
+    int stride = out.strides(1);
+
+    int r = rowIdx.dims(0);
+    for (int i = 0; i < r - 1; i++) {
+        for (int ii = rPtr[i]; ii < rPtr[i + 1]; ++ii) {
+            int j                = cPtr[ii];
+            T v                  = vPtr[ii];
+            oPtr[j * stride + i] = v;
+        }
+    }
+}
+
+// Modified code from sort helper
+template<typename T>
+using SpKeyIndexPair =
+    std::tuple<int, T, int>;  // sorting index, value, other index
+
+template<typename T>
+struct SpKIPCompareK {
+    bool operator()(const SpKeyIndexPair<T> &lhs,
+                    const SpKeyIndexPair<T> &rhs) {
+        int lhsVal = std::get<0>(lhs);
+        int rhsVal = std::get<0>(rhs);
+        // Always returns ascending
+        return (lhsVal < rhsVal);
+    }
+};
+
+template<typename T>
+void csr2coo(Param<T> ovalues, Param<int> orowIdx, Param<int> ocolIdx,
+             CParam<T> ivalues, CParam<int> irowIdx, CParam<int> icolIdx) {
+    // First calculate the linear index
+    T *ovPtr   = ovalues.get();
+    int *orPtr = orowIdx.get();
+    int *ocPtr = ocolIdx.get();
+
+    const T *ivPtr   = ivalues.get();
+    const int *irPtr = irowIdx.get();
+    const int *icPtr = icolIdx.get();
+
+    // Create cordinate form of the row array
+    for (int i = 0; i < (int)irowIdx.dims().elements() - 1; i++) {
+        std::fill_n(orPtr + irPtr[i], irPtr[i + 1] - irPtr[i], i);
+    }
+
+    // Sort the coordinate form using column index
+    // Uses code from sort_by_key kernels
+    typedef SpKeyIndexPair<T> CurrentPair;
+    int size = ovalues.dims(0);
+    std::vector<CurrentPair> pairKeyVal(size);
+
+    for (int x = 0; x < size; x++) {
+        pairKeyVal[x] = std::make_tuple(icPtr[x], ivPtr[x], orPtr[x]);
+    }
+
+    std::stable_sort(pairKeyVal.begin(), pairKeyVal.end(), SpKIPCompareK<T>());
+
+    for (int x = 0; x < (int)ovalues.dims().elements(); x++) {
+        std::tie(ocPtr[x], ovPtr[x], orPtr[x]) = pairKeyVal[x];
+    }
+}
+
+template<typename T>
+void coo2csr(Param<T> ovalues, Param<int> orowIdx, Param<int> ocolIdx,
+             CParam<T> ivalues, CParam<int> irowIdx, CParam<int> icolIdx) {
+    T *ovPtr   = ovalues.get();
+    int *orPtr = orowIdx.get();
+    int *ocPtr = ocolIdx.get();
+
+    const T *ivPtr   = ivalues.get();
+    const int *irPtr = irowIdx.get();
+    const int *icPtr = icolIdx.get();
+
+    // Sort the colidx and values based on rowIdx
+    // Uses code from sort_by_key kernels
+    typedef SpKeyIndexPair<T> CurrentPair;
+    int size = ovalues.dims(0);
+    std::vector<CurrentPair> pairKeyVal(size);
+
+    for (int x = 0; x < size; x++) {
+        pairKeyVal[x] = std::make_tuple(irPtr[x], ivPtr[x], icPtr[x]);
+    }
+
+    std::stable_sort(pairKeyVal.begin(), pairKeyVal.end(), SpKIPCompareK<T>());
+
+    ovPtr[0] = 0;
+    for (int x = 0; x < (int)ovalues.dims().elements(); x++) {
+        int row = -2;  // Some value that will make orPtr[row + 1] error out
+        std::tie(row, ovPtr[x], ocPtr[x]) = pairKeyVal[x];
+        orPtr[row + 1]++;
+    }
+
+    // Compress row storage
+    for (int x = 1; x < (int)orowIdx.dims().elements(); x++) {
+        orPtr[x] += orPtr[x - 1];
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/sparse_arith.hpp b/src/backend/cpu/kernel/sparse_arith.hpp
new file mode 100644
index 0000000000..07eae80aca
--- /dev/null
+++ b/src/backend/cpu/kernel/sparse_arith.hpp
@@ -0,0 +1,227 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <math.hpp>
+
+#include <cmath>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, af_op_t op>
+struct arith_op {
+    T operator()(T v1, T v2) {
+        UNUSED(v1);
+        UNUSED(v2);
+        return scalar<T>(0);
+    }
+};
+
+template<typename T>
+struct arith_op<T, af_add_t> {
+    T operator()(T v1, T v2) { return v1 + v2; }
+};
+
+template<typename T>
+struct arith_op<T, af_sub_t> {
+    T operator()(T v1, T v2) { return v1 - v2; }
+};
+
+template<typename T>
+struct arith_op<T, af_mul_t> {
+    T operator()(T v1, T v2) { return v1 * v2; }
+};
+
+template<typename T>
+struct arith_op<T, af_div_t> {
+    T operator()(T v1, T v2) { return v1 / v2; }
+};
+
+template<typename T, af_op_t op, af_storage type>
+void sparseArithOpD(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+                    CParam<int> colIdx, CParam<T> rhs,
+                    const bool reverse = false) {
+    T *oPtr       = output.get();
+    const T *hPtr = rhs.get();
+
+    const T *vPtr   = values.get();
+    const int *rPtr = rowIdx.get();
+    const int *cPtr = colIdx.get();
+
+    dim4 odims    = output.dims();
+    dim4 ostrides = output.strides();
+    ;
+    dim4 hstrides = rhs.strides();
+    ;
+
+    std::vector<int> temp;
+    if (type == AF_STORAGE_CSR) {
+        temp.resize(values.dims().elements());
+        for (int i = 0; i < rowIdx.dims(0) - 1; i++) {
+            for (int ii = rPtr[i]; ii < rPtr[i + 1]; ii++) { temp[ii] = i; }
+        }
+        //} else if(type == AF_STORAGE_CSC) {   // For future
+    }
+
+    const int *xx = (type == AF_STORAGE_CSR) ? temp.data() : rPtr;
+    const int *yy = (type == AF_STORAGE_CSC) ? temp.data() : cPtr;
+
+    for (int i = 0; i < (int)values.dims().elements(); i++) {
+        // Bad index data
+        if (xx[i] >= odims[0] || yy[i] >= odims[1]) continue;
+
+        int offset = xx[i] + yy[i] * ostrides[1];
+        int hoff   = xx[i] + yy[i] * hstrides[1];
+
+        if (reverse)
+            oPtr[offset] = arith_op<T, op>()(hPtr[hoff], vPtr[i]);
+        else
+            oPtr[offset] = arith_op<T, op>()(vPtr[i], hPtr[hoff]);
+    }
+}
+
+template<typename T, af_op_t op, af_storage type>
+void sparseArithOpS(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+                    CParam<T> rhs, const bool reverse = false) {
+    T *vPtr         = values.get();
+    const int *rPtr = rowIdx.get();
+    const int *cPtr = colIdx.get();
+
+    const T *hPtr = rhs.get();
+
+    dim4 dims     = rhs.dims();
+    dim4 hstrides = rhs.strides();
+
+    std::vector<int> temp;
+    if (type == AF_STORAGE_CSR) {
+        temp.resize(values.dims().elements());
+        for (int i = 0; i < rowIdx.dims(0) - 1; i++) {
+            for (int ii = rPtr[i]; ii < rPtr[i + 1]; ii++) { temp[ii] = i; }
+        }
+        //} else if(type == AF_STORAGE_CSC) {   // For future
+    }
+
+    const int *xx = (type == AF_STORAGE_CSR) ? temp.data() : rPtr;
+    const int *yy = (type == AF_STORAGE_CSC) ? temp.data() : cPtr;
+
+    for (int i = 0; i < (int)values.dims().elements(); i++) {
+        // Bad index data
+        if (xx[i] >= dims[0] || yy[i] >= dims[1]) continue;
+
+        int hoff = xx[i] + yy[i] * hstrides[1];
+
+        if (reverse)
+            vPtr[i] = arith_op<T, op>()(hPtr[hoff], vPtr[i]);
+        else
+            vPtr[i] = arith_op<T, op>()(vPtr[i], hPtr[hoff]);
+    }
+}
+
+// The following functions can handle CSR
+// storage format only as of now.
+static void calcOutNNZ(Param<int> outRowIdx, const uint M, const uint N,
+                       CParam<int> lRowIdx, CParam<int> lColIdx,
+                       CParam<int> rRowIdx, CParam<int> rColIdx) {
+    UNUSED(N);
+    int *orPtr       = outRowIdx.get();
+    const int *lrPtr = lRowIdx.get();
+    const int *lcPtr = lColIdx.get();
+    const int *rrPtr = rRowIdx.get();
+    const int *rcPtr = rColIdx.get();
+
+    unsigned csrOutCount = 0;
+    for (uint row = 0; row < M; ++row) {
+        const int lEnd = lrPtr[row + 1];
+        const int rEnd = rrPtr[row + 1];
+
+        uint rowNNZ = 0;
+        int l       = lrPtr[row];
+        int r       = rrPtr[row];
+        while (l < lEnd && r < rEnd) {
+            int lci = lcPtr[l];
+            int rci = rcPtr[r];
+
+            l += (lci <= rci);
+            r += (lci >= rci);
+            rowNNZ++;
+        }
+        // Elements from lhs or rhs are exhausted.
+        // Just count left over elements
+        rowNNZ += (lEnd - l);
+        rowNNZ += (rEnd - r);
+
+        orPtr[row] = csrOutCount;
+        csrOutCount += rowNNZ;
+    }
+    // Write out the Rows+1 entry
+    orPtr[M] = csrOutCount;
+}
+
+template<typename T, af_op_t op>
+void sparseArithOp(Param<T> oVals, Param<int> oColIdx, CParam<int> oRowIdx,
+                   const uint Rows, CParam<T> lvals, CParam<int> lRowIdx,
+                   CParam<int> lColIdx, CParam<T> rvals, CParam<int> rRowIdx,
+                   CParam<int> rColIdx) {
+    const int *orPtr = oRowIdx.get();
+    const T *lvPtr   = lvals.get();
+    const int *lrPtr = lRowIdx.get();
+    const int *lcPtr = lColIdx.get();
+    const T *rvPtr   = rvals.get();
+    const int *rrPtr = rRowIdx.get();
+    const int *rcPtr = rColIdx.get();
+
+    arith_op<T, op> binOp;
+
+    auto ZERO = scalar<T>(0);
+
+    for (uint row = 0; row < Rows; ++row) {
+        const int lEnd = lrPtr[row + 1];
+        const int rEnd = rrPtr[row + 1];
+        const int offs = orPtr[row];
+
+        T *ovPtr   = oVals.get() + offs;
+        int *ocPtr = oColIdx.get() + offs;
+
+        uint rowNNZ = 0;
+        int l       = lrPtr[row];
+        int r       = rrPtr[row];
+        while (l < lEnd && r < rEnd) {
+            int lci = lcPtr[l];
+            int rci = rcPtr[r];
+
+            T lhs = (lci <= rci ? lvPtr[l] : ZERO);
+            T rhs = (lci >= rci ? rvPtr[r] : ZERO);
+
+            ovPtr[rowNNZ] = binOp(lhs, rhs);
+            ocPtr[rowNNZ] = (lci <= rci) ? lci : rci;
+
+            l += (lci <= rci);
+            r += (lci >= rci);
+            rowNNZ++;
+        }
+        while (l < lEnd) {
+            ovPtr[rowNNZ] = binOp(lvPtr[l], ZERO);
+            ocPtr[rowNNZ] = lcPtr[l];
+            l++;
+            rowNNZ++;
+        }
+        while (r < rEnd) {
+            ovPtr[rowNNZ] = binOp(ZERO, rvPtr[r]);
+            ocPtr[rowNNZ] = rcPtr[r];
+            r++;
+            rowNNZ++;
+        }
+    }
+}
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/susan.hpp b/src/backend/cpu/kernel/susan.hpp
new file mode 100644
index 0000000000..161f185f8b
--- /dev/null
+++ b/src/backend/cpu/kernel/susan.hpp
@@ -0,0 +1,98 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void susan_responses(Param<T> output, CParam<T> input, const dim_t idim0,
+                     const dim_t idim1, const int radius, const float t,
+                     const float g, const unsigned border_len) {
+    T* resp_out = output.get();
+    const T* in = input.get();
+
+    const unsigned r = border_len;
+    const int rSqrd  = radius * radius;
+
+    for (dim_t y = r; y < idim1 - r; ++y) {
+        for (dim_t x = r; x < idim0 - r; ++x) {
+            const dim_t idx = y * idim0 + x;
+            T m_0           = in[idx];
+            float nM        = 0.0f;
+
+            for (int i = -radius; i <= radius; ++i) {
+                for (int j = -radius; j <= radius; ++j) {
+                    if (i * i + j * j < rSqrd) {
+                        int p         = x + i;
+                        int q         = y + j;
+                        T m           = in[p + idim0 * q];
+                        float exp_pow = std::pow((m - m_0) / t, 6.0);
+                        float cM      = std::exp(-exp_pow);
+                        nM += cM;
+                    }
+                }
+            }
+
+            resp_out[idx] = nM < g ? g - nM : T(0);
+        }
+    }
+}
+
+template<typename T>
+void non_maximal(Param<float> xcoords, Param<float> ycoords,
+                 Param<float> response, shared_ptr<unsigned> counter,
+                 const dim_t idim0, const dim_t idim1, CParam<T> input,
+                 const unsigned border_len, const unsigned max_corners) {
+    float* x_out     = xcoords.get();
+    float* y_out     = ycoords.get();
+    float* resp_out  = response.get();
+    unsigned* count  = counter.get();
+    const T* resp_in = input.get();
+
+    // Responses on the border don't have 8-neighbors to compare, discard them
+    const unsigned r = border_len + 1;
+
+    for (dim_t y = r; y < idim1 - r; y++) {
+        for (dim_t x = r; x < idim0 - r; x++) {
+            const T v = resp_in[y * idim0 + x];
+
+            // Find maximum neighborhood response
+            T max_v;
+            max_v = std::max(resp_in[(y - 1) * idim0 + x - 1],
+                             resp_in[y * idim0 + x - 1]);
+            max_v = std::max(max_v, resp_in[(y + 1) * idim0 + x - 1]);
+            max_v = std::max(max_v, resp_in[(y - 1) * idim0 + x]);
+            max_v = std::max(max_v, resp_in[(y + 1) * idim0 + x]);
+            max_v = std::max(max_v, resp_in[(y - 1) * idim0 + x + 1]);
+            max_v = std::max(max_v, resp_in[(y)*idim0 + x + 1]);
+            max_v = std::max(max_v, resp_in[(y + 1) * idim0 + x + 1]);
+
+            // Stores corner to {x,y,resp}_out if it's response is maximum
+            // compared to its 8-neighborhood and greater or equal minimum
+            // response
+            if (v > max_v) {
+                const dim_t idx = *count;
+                *count += 1;
+                if (idx < max_corners) {
+                    x_out[idx]    = (float)x;
+                    y_out[idx]    = (float)y;
+                    resp_out[idx] = (float)v;
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/tile.hpp b/src/backend/cpu/kernel/tile.hpp
new file mode 100644
index 0000000000..bb533889ac
--- /dev/null
+++ b/src/backend/cpu/kernel/tile.hpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void tile(Param<T> out, CParam<T> in) {
+    T* outPtr      = out.get();
+    const T* inPtr = in.get();
+
+    const af::dim4 iDims = in.dims();
+    const af::dim4 oDims = out.dims();
+    const af::dim4 ist   = in.strides();
+    const af::dim4 ost   = out.strides();
+
+    for (dim_t ow = 0; ow < oDims[3]; ow++) {
+        const dim_t iw = ow % iDims[3];
+        const dim_t iW = iw * ist[3];
+        const dim_t oW = ow * ost[3];
+        for (dim_t oz = 0; oz < oDims[2]; oz++) {
+            const dim_t iz  = oz % iDims[2];
+            const dim_t iZW = iW + iz * ist[2];
+            const dim_t oZW = oW + oz * ost[2];
+            for (dim_t oy = 0; oy < oDims[1]; oy++) {
+                const dim_t iy   = oy % iDims[1];
+                const dim_t iYZW = iZW + iy * ist[1];
+                const dim_t oYZW = oZW + oy * ost[1];
+                for (dim_t ox = 0; ox < oDims[0]; ox++) {
+                    const dim_t ix   = ox % iDims[0];
+                    const dim_t iMem = iYZW + ix;
+                    const dim_t oMem = oYZW + ox;
+                    outPtr[oMem]     = inPtr[iMem];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/transform.hpp b/src/backend/cpu/kernel/transform.hpp
new file mode 100644
index 0000000000..bfa1485629
--- /dev/null
+++ b/src/backend/cpu/kernel/transform.hpp
@@ -0,0 +1,144 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <af/traits.hpp>
+#include <type_traits>
+#include "interp.hpp"
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void calc_transform_inverse(T *txo, const T *txi, const bool perspective) {
+    if (perspective) {
+        txo[0] = txi[4] * txi[8] - txi[5] * txi[7];
+        txo[1] = -(txi[1] * txi[8] - txi[2] * txi[7]);
+        txo[2] = txi[1] * txi[5] - txi[2] * txi[4];
+
+        txo[3] = -(txi[3] * txi[8] - txi[5] * txi[6]);
+        txo[4] = txi[0] * txi[8] - txi[2] * txi[6];
+        txo[5] = -(txi[0] * txi[5] - txi[2] * txi[3]);
+
+        txo[6] = txi[3] * txi[7] - txi[4] * txi[6];
+        txo[7] = -(txi[0] * txi[7] - txi[1] * txi[6]);
+        txo[8] = txi[0] * txi[4] - txi[1] * txi[3];
+
+        T det = txi[0] * txo[0] + txi[1] * txo[3] + txi[2] * txo[6];
+
+        txo[0] /= det;
+        txo[1] /= det;
+        txo[2] /= det;
+        txo[3] /= det;
+        txo[4] /= det;
+        txo[5] /= det;
+        txo[6] /= det;
+        txo[7] /= det;
+        txo[8] /= det;
+    } else {
+        T det = txi[0] * txi[4] - txi[1] * txi[3];
+
+        txo[0] = txi[4] / det;
+        txo[1] = txi[3] / det;
+        txo[3] = txi[1] / det;
+        txo[4] = txi[0] / det;
+
+        txo[2] = txi[2] * -txo[0] + txi[5] * -txo[1];
+        txo[5] = txi[2] * -txo[3] + txi[5] * -txo[4];
+    }
+}
+
+template<typename T>
+void calc_transform_inverse(T *tmat, const T *tmat_ptr, const bool inverse,
+                            const bool perspective, const unsigned transf_len) {
+    // The way kernel is structured, it expects an inverse
+    // transform matrix by default.
+    // If it is an forward transform, then we need its inverse
+    if (inverse) {
+        for (int i = 0; i < (int)transf_len; i++) tmat[i] = tmat_ptr[i];
+    } else {
+        calc_transform_inverse(tmat, tmat_ptr, perspective);
+    }
+}
+
+template<typename T, int order>
+void transform(Param<T> output, CParam<T> input, CParam<float> transform,
+               const bool inverse, const bool perspective,
+               af_interp_type method) {
+    typedef typename af::dtype_traits<T>::base_type BT;
+    typedef wtype_t<BT> WT;
+
+    const af::dim4 idims    = input.dims();
+    const af::dim4 odims    = output.dims();
+    const af::dim4 tdims    = transform.dims();
+    const af::dim4 tstrides = transform.strides();
+    const af::dim4 istrides = input.strides();
+    const af::dim4 ostrides = output.strides();
+
+    T *out          = output.get();
+    const float *tf = transform.get();
+
+    int batch_size = 1;
+    if (idims[2] != tdims[2]) batch_size = idims[2];
+
+    Interp2<T, WT, order> interp;
+    for (int idw = 0; idw < (int)odims[3]; idw++) {
+        dim_t out_offw = idw * ostrides[3];
+        dim_t in_offw  = (idims[3] > 1) * idw * istrides[3];
+        dim_t tf_offw  = (tdims[3] > 1) * idw * tstrides[3];
+
+        for (int idz = 0; idz < (int)odims[2]; idz += batch_size) {
+            dim_t out_offzw = out_offw + idz * ostrides[2];
+            dim_t in_offzw  = in_offw + (idims[2] > 1) * idz * istrides[2];
+            dim_t tf_offzw  = tf_offw + (tdims[2] > 1) * idz * tstrides[2];
+
+            const float *tptr = tf + tf_offzw;
+
+            float tmat[9];
+            calc_transform_inverse(tmat, tptr, inverse, perspective,
+                                   perspective ? 9 : 6);
+
+            for (int idy = 0; idy < (int)odims[1]; idy++) {
+                for (int idx = 0; idx < (int)odims[0]; idx++) {
+                    WT xidi = idx * tmat[0] + idy * tmat[1] + tmat[2];
+                    WT yidi = idx * tmat[3] + idy * tmat[4] + tmat[5];
+
+                    if (perspective) {
+                        WT W = idx * tmat[6] + idy * tmat[7] + tmat[8];
+                        xidi /= W;
+                        yidi /= W;
+                    }
+
+                    // FIXME: Nearest and lower do not do clamping, but other
+                    // methods do Make it consistent
+                    bool clamp = order != 1;
+                    bool condX = xidi >= -0.0001 && xidi < idims[0];
+                    bool condY = yidi >= -0.0001 && yidi < idims[1];
+
+                    int ooff = out_offzw + idy * ostrides[1] + idx;
+                    if (condX && condY) {
+                        interp(output, ooff, input, in_offzw, xidi, yidi,
+                               method, batch_size, clamp);
+                    } else {
+                        for (int n = 0; n < batch_size; n++) {
+                            out[ooff + n * ostrides[2]] = scalar<T>(0);
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/transpose.hpp b/src/backend/cpu/kernel/transpose.hpp
new file mode 100644
index 0000000000..5c9a254401
--- /dev/null
+++ b/src/backend/cpu/kernel/transpose.hpp
@@ -0,0 +1,182 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+T getConjugate(const T &in) {
+    // For non-complex types return same
+    return in;
+}
+
+template<>
+cfloat getConjugate(const cfloat &in) {
+    return std::conj(in);
+}
+
+template<>
+cdouble getConjugate(const cdouble &in) {
+    return std::conj(in);
+}
+
+template<typename T, int M, int N>
+void transpose_kernel(T *output, const T *input, int ostride, int istride) {
+    for (int j = 0; j < N; j++) {
+        for (int i = 0; i < M; i++) { output[i * ostride] = input[i]; }
+        input += istride;
+        output++;
+    }
+}
+
+template<typename T>
+void transpose_real(Param<T> output, CParam<T> input) {
+    const af::dim4 odims    = output.dims();
+    const af::dim4 ostrides = output.strides();
+    const af::dim4 istrides = input.strides();
+
+    T *out            = output.get();
+    T const *const in = input.get();
+
+    constexpr int M = 8;
+    constexpr int N = 8;
+
+    dim_t odims1_down = floor(odims[1] / N) * N;
+    dim_t odims0_down = floor(odims[0] / M) * M;
+
+    for (dim_t l = 0; l < odims[3]; ++l) {
+        for (dim_t k = 0; k < odims[2]; ++k) {
+            // Outermost loop handles batch mode
+            // if input has no data along third dimension
+            // this loop runs only once
+            T *out_      = out + l * ostrides[3] + k * ostrides[2];
+            const T *in_ = in + l * istrides[3] + k * istrides[2];
+
+            if (odims1_down > 0) {
+                for (dim_t j = 0; j <= odims1_down; j += N) {
+                    for (dim_t i = 0; i < odims0_down; i += M) {
+                        transpose_kernel<T, M, N>(out_, in_, ostrides[1],
+                                                  istrides[1]);
+                        out_ += M;
+                        in_ += istrides[1] * N;
+                    }
+
+                    for (dim_t jj = 0; jj < N; jj++) {
+                        for (dim_t i = odims0_down; i < odims[0]; i++) {
+                            *out_ = *in_;
+                            out_++;
+                            in_ += istrides[1];
+                        }
+                        out_ += ostrides[1] - (odims[0] - odims0_down);
+                        in_ -= (odims[0] - odims0_down) * istrides[1] - 1;
+                    }
+                    out_ = out + l * ostrides[3] + k * ostrides[2] +
+                           j * ostrides[1];
+                    in_ = in + l * istrides[3] + k * istrides[2] + j;
+                }
+            }
+            for (dim_t j = odims1_down; j < odims[1]; j++) {
+                out_ =
+                    out + l * ostrides[3] + k * ostrides[2] + j * ostrides[1];
+                in_ = in + l * istrides[3] + k * istrides[2] + j;
+                for (dim_t i = 0; i < odims[0]; i++) {
+                    *out_ = *in_;
+                    out_++;
+                    in_ += istrides[1];
+                }
+            }
+        }
+    }
+}
+
+template<typename T>
+void transpose_conj(Param<T> output, CParam<T> input) {
+    const af::dim4 odims    = output.dims();
+    const af::dim4 ostrides = output.strides();
+    const af::dim4 istrides = input.strides();
+
+    T *out            = output.get();
+    T const *const in = input.get();
+
+    for (dim_t l = 0; l < odims[3]; ++l) {
+        for (dim_t k = 0; k < odims[2]; ++k) {
+            // Outermost loop handles batch mode
+            // if input has no data along third dimension
+            // this loop runs only once
+
+            for (dim_t j = 0; j < odims[1]; ++j) {
+                for (dim_t i = 0; i < odims[0]; ++i) {
+                    // calculate array indices based on offsets and strides
+                    // the helper getIdx takes care of indices
+                    const dim_t inIdx  = getIdx(istrides, j, i, k, l);
+                    const dim_t outIdx = getIdx(ostrides, i, j, k, l);
+                    out[outIdx]        = getConjugate(in[inIdx]);
+                }
+            }
+            // outData and inData pointers doesn't need to be
+            // offset as the getIdx function is taking care
+            // of the batch parameter
+        }
+    }
+}
+
+template<typename T>
+void transpose(Param<T> out, CParam<T> in, const bool conjugate) {
+    return (conjugate ? transpose_conj<T>(out, in)
+                      : transpose_real<T>(out, in));
+}
+
+template<typename T, bool conjugate>
+void transpose_inplace(Param<T> input) {
+    const af::dim4 idims    = input.dims();
+    const af::dim4 istrides = input.strides();
+
+    T *in = input.get();
+
+    for (dim_t l = 0; l < idims[3]; ++l) {
+        for (dim_t k = 0; k < idims[2]; ++k) {
+            // Outermost loop handles batch mode
+            // if input has no data along third dimension
+            // this loop runs only once
+            //
+            // Run only bottom triangle. std::swap swaps with upper triangle
+            for (dim_t j = 0; j < idims[1]; ++j) {
+                for (dim_t i = j + 1; i < idims[0]; ++i) {
+                    // calculate array indices based on offsets and strides
+                    // the helper getIdx takes care of indices
+                    const dim_t iIdx = getIdx(istrides, j, i, k, l);
+                    const dim_t oIdx = getIdx(istrides, i, j, k, l);
+                    if (conjugate) {
+                        in[iIdx] = getConjugate(in[iIdx]);
+                        in[oIdx] = getConjugate(in[oIdx]);
+                        std::swap(in[iIdx], in[oIdx]);
+                    } else {
+                        std::swap(in[iIdx], in[oIdx]);
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename T>
+void transpose_inplace(Param<T> in, const bool conjugate) {
+    return (conjugate ? transpose_inplace<T, true>(in)
+                      : transpose_inplace<T, false>(in));
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/triangle.hpp b/src/backend/cpu/kernel/triangle.hpp
new file mode 100644
index 0000000000..3c6051ce0b
--- /dev/null
+++ b/src/backend/cpu/kernel/triangle.hpp
@@ -0,0 +1,60 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, bool IsUpper, bool IsUnitDiag>
+void triangle(Param<T> out, CParam<T> in) {
+    T *o       = out.get();
+    const T *i = in.get();
+
+    af::dim4 odm = out.dims();
+
+    af::dim4 ost = out.strides();
+    af::dim4 ist = in.strides();
+
+    for (dim_t ow = 0; ow < odm[3]; ow++) {
+        const dim_t oW = ow * ost[3];
+        const dim_t iW = ow * ist[3];
+
+        for (dim_t oz = 0; oz < odm[2]; oz++) {
+            const dim_t oZW = oW + oz * ost[2];
+            const dim_t iZW = iW + oz * ist[2];
+
+            for (dim_t oy = 0; oy < odm[1]; oy++) {
+                const dim_t oYZW = oZW + oy * ost[1];
+                const dim_t iYZW = iZW + oy * ist[1];
+
+                for (dim_t ox = 0; ox < odm[0]; ox++) {
+                    const dim_t oMem = oYZW + ox;
+                    const dim_t iMem = iYZW + ox;
+
+                    bool cond         = IsUpper ? (oy >= ox) : (oy <= ox);
+                    bool do_unit_diag = (IsUnitDiag && ox == oy);
+                    if (cond) {
+                        o[oMem] = do_unit_diag ? scalar<T>(1) : i[iMem];
+                    } else {
+                        o[oMem] = scalar<T>(0);
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/unwrap.hpp b/src/backend/cpu/kernel/unwrap.hpp
new file mode 100644
index 0000000000..e9cd6675a3
--- /dev/null
+++ b/src/backend/cpu/kernel/unwrap.hpp
@@ -0,0 +1,84 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T>
+void unwrap_dim(Param<T> out, CParam<T> in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const int d) {
+    const T *inPtr = in.get();
+    T *outPtr      = out.get();
+
+    af::dim4 idims    = in.dims();
+    af::dim4 odims    = out.dims();
+    af::dim4 istrides = in.strides();
+    af::dim4 ostrides = out.strides();
+
+    dim_t nx = 1 + (idims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+
+    for (dim_t w = 0; w < odims[3]; w++) {
+        for (dim_t z = 0; z < odims[2]; z++) {
+            dim_t cOut    = w * ostrides[3] + z * ostrides[2];
+            dim_t cIn     = w * istrides[3] + z * istrides[2];
+            const T *iptr = inPtr + cIn;
+            T *optr_      = outPtr + cOut;
+
+            for (dim_t col = 0; col < odims[d]; col++) {
+                // Offset output ptr
+                T *optr = optr_ + col * ostrides[d];
+
+                // Calculate input window index
+                dim_t winy = (col / nx);
+                dim_t winx = (col % nx);
+
+                dim_t startx = winx * sx;
+                dim_t starty = winy * sy;
+
+                dim_t spx = startx - px;
+                dim_t spy = starty - py;
+
+                // Short cut condition ensuring all values within input
+                // dimensions
+                bool cond = (spx >= 0 && spx + (wx * dx) < idims[0] &&
+                             spy >= 0 && spy + (wy * dy) < idims[1]);
+
+                for (dim_t y = 0; y < wy; y++) {
+                    dim_t ypad = spy + y * dy;
+                    for (dim_t x = 0; x < wx; x++) {
+                        dim_t xpad = spx + x * dx;
+
+                        dim_t oloc = (y * wx + x);
+                        if (d == 0) oloc *= ostrides[1];
+
+                        if (cond || (xpad >= 0 && xpad < idims[0] &&
+                                     ypad >= 0 && ypad < idims[1])) {
+                            dim_t iloc =
+                                (ypad * istrides[1] + xpad * istrides[0]);
+                            optr[oloc] = iptr[iloc];
+                        } else {
+                            optr[oloc] = scalar<T>(0.0);
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/kernel/wrap.hpp b/src/backend/cpu/kernel/wrap.hpp
new file mode 100644
index 0000000000..0a6eb63a5d
--- /dev/null
+++ b/src/backend/cpu/kernel/wrap.hpp
@@ -0,0 +1,148 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_cpu.hpp>
+#include <math.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cpu {
+namespace kernel {
+
+template<typename T, int d>
+void wrap_dim(Param<T> out, CParam<T> in, const dim_t wx, const dim_t wy,
+              const dim_t sx, const dim_t sy, const dim_t px, const dim_t py) {
+    const T *inPtr = in.get();
+    T *outPtr      = out.get();
+
+    af::dim4 idims    = in.dims();
+    af::dim4 odims    = out.dims();
+    af::dim4 istrides = in.strides();
+    af::dim4 ostrides = out.strides();
+
+    dim_t nx = (odims[0] + 2 * px - wx) / sx + 1;
+
+    for (dim_t w = 0; w < idims[3]; w++) {
+        for (dim_t z = 0; z < idims[2]; z++) {
+            dim_t cIn      = w * istrides[3] + z * istrides[2];
+            dim_t cOut     = w * ostrides[3] + z * ostrides[2];
+            const T *iptr_ = inPtr + cIn;
+            T *optr        = outPtr + cOut;
+
+            for (dim_t col = 0; col < idims[d]; col++) {
+                // Offset output ptr
+                const T *iptr = iptr_ + col * istrides[d];
+
+                // Calculate input window index
+                dim_t winy = (col / nx);
+                dim_t winx = (col % nx);
+
+                dim_t startx = winx * sx;
+                dim_t starty = winy * sy;
+
+                dim_t spx = startx - px;
+                dim_t spy = starty - py;
+
+                // Short cut condition ensuring all values within input
+                // dimensions
+                bool cond = (spx >= 0 && spx + wx < odims[0] && spy >= 0 &&
+                             spy + wy < odims[1]);
+
+                for (dim_t y = 0; y < wy; y++) {
+                    for (dim_t x = 0; x < wx; x++) {
+                        dim_t xpad = spx + x;
+                        dim_t ypad = spy + y;
+
+                        dim_t iloc = (y * wx + x);
+                        if (d == 0) iloc *= istrides[1];
+
+                        if (cond || (xpad >= 0 && xpad < odims[0] &&
+                                     ypad >= 0 && ypad < odims[1])) {
+                            dim_t oloc =
+                                (ypad * ostrides[1] + xpad * ostrides[0]);
+                            // FIXME: When using threads, atomize this
+                            optr[oloc] += iptr[iloc];
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+template<typename T>
+void wrap_dim_dilated(Param<T> out, CParam<T> in, const dim_t wx,
+                      const dim_t wy, const dim_t sx, const dim_t sy,
+                      const dim_t px, const dim_t py, const dim_t dx,
+                      const dim_t dy, const int d) {
+    const T *inPtr = in.get();
+    T *outPtr      = out.get();
+
+    af::dim4 idims    = in.dims();
+    af::dim4 odims    = out.dims();
+    af::dim4 istrides = in.strides();
+    af::dim4 ostrides = out.strides();
+
+    dim_t nx = 1 + (odims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+
+    for (dim_t w = 0; w < idims[3]; w++) {
+        for (dim_t z = 0; z < idims[2]; z++) {
+            dim_t cIn              = w * istrides[3] + z * istrides[2];
+            dim_t cOut             = w * ostrides[3] + z * ostrides[2];
+            const data_t<T> *iptr_ = inPtr + cIn;
+            data_t<T> *optr        = outPtr + cOut;
+
+            for (dim_t col = 0; col < idims[d]; col++) {
+                // Offset output ptr
+                const data_t<T> *iptr = iptr_ + col * istrides[d];
+
+                // Calculate input window index
+                dim_t winy = (col / nx);
+                dim_t winx = (col % nx);
+
+                dim_t startx = winx * sx;
+                dim_t starty = winy * sy;
+
+                dim_t spx = startx - px;
+                dim_t spy = starty - py;
+
+                // Short cut condition ensuring all values within input
+                // dimensions
+                bool cond = (spx >= 0 && spx + (wx * dx) < odims[0] &&
+                             spy >= 0 && spy + (wy * dy) < odims[1]);
+
+                for (dim_t y = 0; y < wy; y++) {
+                    dim_t ypad = spy + y * dy;
+                    for (dim_t x = 0; x < wx; x++) {
+                        dim_t xpad = spx + x * dx;
+
+                        dim_t iloc = (y * wx + x);
+                        if (d == 0) iloc *= istrides[1];
+
+                        if (cond || (xpad >= 0 && xpad < odims[0] &&
+                                     ypad >= 0 && ypad < odims[1])) {
+                            dim_t oloc =
+                                (ypad * ostrides[1] + xpad * ostrides[0]);
+                            // FIXME: When using threads, atomize this
+                            optr[oloc] = static_cast<compute_t<T>>(optr[oloc]) +
+                                         static_cast<compute_t<T>>(iptr[iloc]);
+                        }
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/lapack_helper.hpp b/src/backend/cpu/lapack_helper.hpp
index f978ecb92b..e9b509f921 100644
--- a/src/backend/cpu/lapack_helper.hpp
+++ b/src/backend/cpu/lapack_helper.hpp
@@ -17,16 +17,17 @@
 #define AF_LAPACK_COL_MAJOR LAPACK_COL_MAJOR
 #define LAPACK_NAME(fn) LAPACKE_##fn
 
+#ifdef USE_MKL
+#include <mkl_lapack.h>
+#include <mkl_lapacke.h>
+#else
 #ifdef __APPLE__
 #include <Accelerate/Accelerate.h>
-#include <lapacke.hpp>
+#include <common/lapacke.hpp>
 #undef AF_LAPACK_COL_MAJOR
 #define AF_LAPACK_COL_MAJOR 0
-#else
-#ifdef USE_MKL
-#include<mkl_lapacke.h>
-#else // NETLIB LAPACKE
-#include<lapacke.h>
+#else  // NETLIB LAPACKE
+#include <lapacke.h>
 #endif
 #endif
 
diff --git a/src/backend/cpu/logic.hpp b/src/backend/cpu/logic.hpp
index 0b3b5f7f27..40a90e0167 100644
--- a/src/backend/cpu/logic.hpp
+++ b/src/backend/cpu/logic.hpp
@@ -7,109 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <optypes.hpp>
+#include <common/jit/BinaryNode.hpp>
 #include <err_cpu.hpp>
+#include <optypes.hpp>
 #include <types.hpp>
-#include <TNJ/BinaryNode.hpp>
-
-namespace cpu
-{
-
-#define LOGIC_FN(OP, op)                        \
-    template<typename T>                        \
-    struct BinOp<char, T, OP>                   \
-    {                                           \
-        char eval(T lhs, T rhs)                 \
-        {                                       \
-            return lhs op rhs;                  \
-        }                                       \
-    };                                          \
-
-
-    LOGIC_FN(af_eq_t, ==)
-    LOGIC_FN(af_neq_t, !=)
-    LOGIC_FN(af_lt_t, <)
-    LOGIC_FN(af_gt_t, >)
-    LOGIC_FN(af_le_t, <=)
-    LOGIC_FN(af_ge_t, >=)
-    LOGIC_FN(af_and_t, &&)
-    LOGIC_FN(af_or_t, ||)
-
-#undef LOGIC_FN
-
-#define LOGIC_CPLX_FN(T, OP, op)                    \
-    template<>                                      \
-    struct BinOp<char, std::complex<T>, OP>         \
-    {                                               \
-        char eval(std::complex<T> lhs,              \
-                  std::complex<T> rhs)              \
-        {                                           \
-            return std::abs(lhs) op std::abs(rhs);  \
-        }                                           \
-    };                                              \
-
-LOGIC_CPLX_FN(float, af_lt_t, <)
-LOGIC_CPLX_FN(float, af_le_t, <=)
-LOGIC_CPLX_FN(float, af_gt_t, >)
-LOGIC_CPLX_FN(float, af_ge_t, >=)
-LOGIC_CPLX_FN(float, af_and_t, &&)
-LOGIC_CPLX_FN(float, af_or_t, ||)
-
-
-LOGIC_CPLX_FN(double, af_lt_t, <)
-LOGIC_CPLX_FN(double, af_le_t, <=)
-LOGIC_CPLX_FN(double, af_gt_t, >)
-LOGIC_CPLX_FN(double, af_ge_t, >=)
-LOGIC_CPLX_FN(double, af_and_t, &&)
-LOGIC_CPLX_FN(double, af_or_t, ||)
-
-#undef LOGIC_CPLX_FN
-
-    template<typename T, af_op_t op>
-    Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        TNJ::Node_ptr lhs_node = lhs.getNode();
-        TNJ::Node_ptr rhs_node = rhs.getNode();
-
-        TNJ::BinaryNode<char, T, op> *node = new TNJ::BinaryNode<char, T, op>(lhs_node, rhs_node);
-
-        return createNodeArray<char>(odims, TNJ::Node_ptr(
-                                          reinterpret_cast<TNJ::Node *>(node)));
-    }
-
-
-
-#define BITWISE_FN(OP, op)                      \
-    template<typename T>                        \
-    struct BinOp<T, T, OP>                      \
-    {                                           \
-        T eval(T lhs, T rhs)                    \
-        {                                       \
-            return lhs op rhs;                  \
-        }                                       \
-    };                                          \
-
-    BITWISE_FN(af_bitor_t, |)
-    BITWISE_FN(af_bitand_t, &)
-    BITWISE_FN(af_bitxor_t, ^)
-    BITWISE_FN(af_bitshiftl_t, <<)
-    BITWISE_FN(af_bitshiftr_t, >>)
-
-#undef BITWISE_FN
+#include <af/dim4.hpp>
 
-    template<typename T, af_op_t op>
-    Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        TNJ::Node_ptr lhs_node = lhs.getNode();
-        TNJ::Node_ptr rhs_node = rhs.getNode();
+namespace arrayfire {
+namespace cpu {
 
-        TNJ::BinaryNode<T, T, op> *node = new TNJ::BinaryNode<T, T, op>(lhs_node, rhs_node);
+template<typename T, af_op_t op>
+Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs,
+                    const af::dim4 &odims) {
+    return common::createBinaryNode<char, T, op>(lhs, rhs, odims);
+}
 
-        return createNodeArray<T>(odims, TNJ::Node_ptr(
-                                      reinterpret_cast<TNJ::Node *>(node)));
-    }
+template<typename T, af_op_t op>
+Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/lookup.cpp b/src/backend/cpu/lookup.cpp
index f3e18bd4d6..b8c56e297c 100644
--- a/src/backend/cpu/lookup.cpp
+++ b/src/backend/cpu/lookup.cpp
@@ -6,91 +6,71 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
+#include <kernel/lookup.hpp>
 #include <lookup.hpp>
-#include <err_cpu.hpp>
-#include <cstdlib>
 
-namespace cpu
-{
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <cstdlib>
 
-static inline
-dim_t trimIndex(int idx, const dim_t &len)
-{
-    int ret_val = idx;
-    int offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=len) {
-        ret_val = len-offset-1;
-    }
-    return ret_val;
-}
+using arrayfire::common::half;
 
+namespace arrayfire {
+namespace cpu {
 template<typename in_t, typename idx_t>
-Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, const unsigned dim)
-{
-    const dim4 iDims = input.dims();
-    const dim4 iStrides = input.strides();
-
-    const in_t *inPtr = input.get();
-    const idx_t *idxPtr = indices.get();
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim) {
+    const dim4 &iDims = input.dims();
 
     dim4 oDims(1);
-    for (int d=0; d<4; ++d)
-        oDims[d] = (d==int(dim) ? indices.elements() : iDims[d]);
+    for (int d = 0; d < 4; ++d) {
+        oDims[d] = (d == int(dim) ? indices.elements() : iDims[d]);
+    }
 
     Array<in_t> out = createEmptyArray<in_t>(oDims);
-
-    dim4 oStrides = out.strides();
-
-    in_t *outPtr = out.get();
-
-    for (dim_t l=0; l<oDims[3]; ++l) {
-
-        dim_t iLOff = iStrides[3]*(dim==3 ? trimIndex((dim_t)idxPtr[l], iDims[3]): l);
-        dim_t oLOff = l*oStrides[3];
-
-        for (dim_t k=0; k<oDims[2]; ++k) {
-
-            dim_t iKOff = iStrides[2]*(dim==2 ? trimIndex((dim_t)idxPtr[k], iDims[2]): k);
-            dim_t oKOff = k*oStrides[2];
-
-            for (dim_t j=0; j<oDims[1]; ++j) {
-
-                dim_t iJOff = iStrides[1]*(dim==1 ? trimIndex((dim_t)idxPtr[j], iDims[1]): j);
-                dim_t oJOff = j*oStrides[1];
-
-                for (dim_t i=0; i<oDims[0]; ++i) {
-
-                    dim_t iIOff = iStrides[0]*(dim==0 ? trimIndex((dim_t)idxPtr[i], iDims[0]): i);
-                    dim_t oIOff = i*oStrides[0];
-
-                    outPtr[oLOff+oKOff+oJOff+oIOff] = inPtr[iLOff+iKOff+iJOff+iIOff];
-                }
-            }
-        }
-    }
+    getQueue().enqueue(kernel::lookup<in_t, idx_t>, out, input, indices, dim);
 
     return out;
 }
 
-#define INSTANTIATE(T)  \
-    template Array<T>  lookup<T, float   >(const Array<T> &input, const Array<float   > &indices, const unsigned dim); \
-    template Array<T>  lookup<T, double  >(const Array<T> &input, const Array<double  > &indices, const unsigned dim); \
-    template Array<T>  lookup<T, int     >(const Array<T> &input, const Array<int     > &indices, const unsigned dim); \
-    template Array<T>  lookup<T, unsigned>(const Array<T> &input, const Array<unsigned> &indices, const unsigned dim); \
-    template Array<T>  lookup<T, uchar   >(const Array<T> &input, const Array<uchar   > &indices, const unsigned dim);
-
-INSTANTIATE(float   );
-INSTANTIATE(cfloat  );
-INSTANTIATE(double  );
-INSTANTIATE(cdouble );
-INSTANTIATE(int     );
+#define INSTANTIATE(T)                                                         \
+    template Array<T> lookup<T, float>(const Array<T> &, const Array<float> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, double>(                                       \
+        const Array<T> &, const Array<double> &, const unsigned);              \
+    template Array<T> lookup<T, int>(const Array<T> &, const Array<int> &,     \
+                                     const unsigned);                          \
+    template Array<T> lookup<T, unsigned>(                                     \
+        const Array<T> &, const Array<unsigned> &, const unsigned);            \
+    template Array<T> lookup<T, short>(const Array<T> &, const Array<short> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, ushort>(                                       \
+        const Array<T> &, const Array<ushort> &, const unsigned);              \
+    template Array<T> lookup<T, intl>(const Array<T> &, const Array<intl> &,   \
+                                      const unsigned);                         \
+    template Array<T> lookup<T, uintl>(const Array<T> &, const Array<uintl> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, schar>(const Array<T> &, const Array<schar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, uchar>(const Array<T> &, const Array<uchar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, half>(const Array<T> &, const Array<half> &,   \
+                                      const unsigned);
+
+INSTANTIATE(float);
+INSTANTIATE(cfloat);
+INSTANTIATE(double);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
 INSTANTIATE(unsigned);
-INSTANTIATE(intl    );
-INSTANTIATE(uintl   );
-INSTANTIATE(uchar   );
-INSTANTIATE(char    );
-
-}
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(char);
+INSTANTIATE(ushort);
+INSTANTIATE(short);
+INSTANTIATE(half);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/lookup.hpp b/src/backend/cpu/lookup.hpp
index a41ea3b086..c21a757d10 100644
--- a/src/backend/cpu/lookup.hpp
+++ b/src/backend/cpu/lookup.hpp
@@ -9,10 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
-
+namespace arrayfire {
+namespace cpu {
 template<typename in_t, typename idx_t>
-Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, const unsigned dim);
-
-}
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/lu.cpp b/src/backend/cpu/lu.cpp
index 0eefb16816..43df22e90c 100644
--- a/src/backend/cpu/lu.cpp
+++ b/src/backend/cpu/lu.cpp
@@ -7,120 +7,53 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/err_common.hpp>
 #include <lu.hpp>
-#include <err_common.hpp>
 
-#if defined(WITH_CPU_LINEAR_ALGEBRA)
-
-#include <af/dim4.hpp>
+#if defined(WITH_LINEAR_ALGEBRA)
 #include <handle.hpp>
-#include <iostream>
-#include <cassert>
-#include <err_cpu.hpp>
-
-#include <range.hpp>
+#include <kernel/lu.hpp>
 #include <lapack_helper.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <range.hpp>
+#include <af/dim4.hpp>
 
-namespace cpu
-{
-
-template<typename T>
-using getrf_func_def = int (*)(ORDER_TYPE, int, int,
-                               T*, int,
-                               int*);
-
-#define LU_FUNC_DEF( FUNC )                                     \
-template<typename T> FUNC##_func_def<T> FUNC##_func();
+#include <cassert>
+#include <iostream>
 
+namespace arrayfire {
+namespace cpu {
 
-#define LU_FUNC( FUNC, TYPE, PREFIX )                           \
-template<> FUNC##_func_def<TYPE>     FUNC##_func<TYPE>()        \
-{ return & LAPACK_NAME(PREFIX##FUNC); }
+template<typename T>
+using getrf_func_def = int (*)(ORDER_TYPE, int, int, T *, int, int *);
 
-LU_FUNC_DEF( getrf )
-LU_FUNC(getrf , float  , s)
-LU_FUNC(getrf , double , d)
-LU_FUNC(getrf , cfloat , c)
-LU_FUNC(getrf , cdouble, z)
+#define LU_FUNC_DEF(FUNC) \
+    template<typename T>  \
+    FUNC##_func_def<T> FUNC##_func();
 
-template<typename T>
-void lu_split(Array<T> &lower, Array<T> &upper, const Array<T> &in)
-{
-    T *l = lower.get();
-    T *u = upper.get();
-    const T *i = in.get();
-
-    dim4 ldm = lower.dims();
-    dim4 udm = upper.dims();
-    dim4 idm = in.dims();
-
-    dim4 lst = lower.strides();
-    dim4 ust = upper.strides();
-    dim4 ist = in.strides();
-
-    for(dim_t ow = 0; ow < idm[3]; ow++) {
-        const dim_t lW = ow * lst[3];
-        const dim_t uW = ow * ust[3];
-        const dim_t iW = ow * ist[3];
-
-        for(dim_t oz = 0; oz < idm[2]; oz++) {
-            const dim_t lZW = lW + oz * lst[2];
-            const dim_t uZW = uW + oz * ust[2];
-            const dim_t iZW = iW + oz * ist[2];
-
-            for(dim_t oy = 0; oy < idm[1]; oy++) {
-                const dim_t lYZW = lZW + oy * lst[1];
-                const dim_t uYZW = uZW + oy * ust[1];
-                const dim_t iYZW = iZW + oy * ist[1];
-
-                for(dim_t ox = 0; ox < idm[0]; ox++) {
-                    const dim_t lMem = lYZW + ox;
-                    const dim_t uMem = uYZW + ox;
-                    const dim_t iMem = iYZW + ox;
-                    if(ox > oy) {
-                        if(oy < ldm[1])
-                            l[lMem] = i[iMem];
-                        if(ox < udm[0])
-                            u[uMem] = scalar<T>(0);
-                    } else if (oy > ox) {
-                        if(oy < ldm[1])
-                            l[lMem] = scalar<T>(0);
-                        if(ox < udm[0])
-                            u[uMem] = i[iMem];
-                    } else if(ox == oy) {
-                        if(oy < ldm[1])
-                            l[lMem] = scalar<T>(1.0);
-                        if(ox < udm[0])
-                            u[uMem] = i[iMem];
-                    }
-                }
-            }
-        }
+#define LU_FUNC(FUNC, TYPE, PREFIX)             \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
     }
-}
 
-void convertPivot(Array<int> &pivot, int out_sz)
-{
-    Array<int> p = range<int>(dim4(out_sz), 0);
-    int *d_pi = pivot.get();
-    int *d_po = p.get();
-    dim_t d0 = pivot.dims()[0];
-    for(int j = 0; j < (int)d0; j++) {
-        // 1 indexed in pivot
-        std::swap(d_po[j], d_po[d_pi[j] - 1]);
-    }
-    pivot = p;
-}
+LU_FUNC_DEF(getrf)
+LU_FUNC(getrf, float, s)
+LU_FUNC(getrf, double, d)
+LU_FUNC(getrf, cfloat, c)
+LU_FUNC(getrf, cdouble, z)
 
 template<typename T>
-void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
-{
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
     dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
+    int M      = iDims[0];
+    int N      = iDims[1];
 
     Array<T> in_copy = copyArray<T>(in);
-    pivot = lu_inplace(in_copy);
+    pivot            = lu_inplace(in_copy);
 
     // SPLIT into lower and upper
     dim4 ldims(M, min(M, N));
@@ -128,64 +61,72 @@ void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
     lower = createEmptyArray<T>(ldims);
     upper = createEmptyArray<T>(udims);
 
-    lu_split<T>(lower, upper, in_copy);
+    getQueue().enqueue(kernel::lu_split<T>, lower, upper, in_copy);
 }
 
 template<typename T>
-Array<int> lu_inplace(Array<T> &in, const bool convert_pivot)
-{
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
     dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-
-    Array<int> pivot = createEmptyArray<int>(af::dim4(min(M, N), 1, 1, 1));
-
-    getrf_func<T>()(AF_LAPACK_COL_MAJOR, M, N,
-                    in.get(), in.strides()[1],
-                    pivot.get());
-
-    if(convert_pivot) convertPivot(pivot, M);
-
-    return pivot;
+    Array<int> pivot =
+        createEmptyArray<int>(af::dim4(min(iDims[0], iDims[1]), 1, 1, 1));
+
+    auto func = [=](Param<T> in, Param<int> pivot) {
+        dim4 iDims = in.dims();
+        getrf_func<T>()(AF_LAPACK_COL_MAJOR, iDims[0], iDims[1], in.get(),
+                        in.strides(1), pivot.get());
+    };
+    getQueue().enqueue(func, in, pivot);
+
+    if (convert_pivot) {
+        Array<int> p = range<int>(dim4(iDims[0]), 0);
+        getQueue().enqueue(kernel::convertPivot, p, pivot);
+        return p;
+    } else {
+        return pivot;
+    }
 }
 
-#define INSTANTIATE_LU(T)                                                                           \
-    template Array<int> lu_inplace<T>(Array<T> &in, const bool convert_pivot);                      \
-    template void lu<T>(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+bool isLAPACKAvailable() { return true; }
 
-INSTANTIATE_LU(float)
-INSTANTIATE_LU(cfloat)
-INSTANTIATE_LU(double)
-INSTANTIATE_LU(cdouble)
+}  // namespace cpu
+}  // namespace arrayfire
 
-}
-
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
-{
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-Array<int> lu_inplace(Array<T> &in, const bool convert_pivot)
-{
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
     AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_LU(T)                                                                           \
-    template Array<int> lu_inplace<T>(Array<T> &in, const bool convert_pivot);                      \
-    template void lu<T>(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+bool isLAPACKAvailable() { return false; }
+
+}  // namespace cpu
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace cpu {
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
 
 INSTANTIATE_LU(float)
 INSTANTIATE_LU(cfloat)
 INSTANTIATE_LU(double)
 INSTANTIATE_LU(cdouble)
 
-}
-
-#endif
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/lu.hpp b/src/backend/cpu/lu.hpp
index c25dcaaa16..d114d4f2b4 100644
--- a/src/backend/cpu/lu.hpp
+++ b/src/backend/cpu/lu.hpp
@@ -7,14 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in);
 
-    template<typename T>
-    Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
-}
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
+
+bool isLAPACKAvailable();
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/match_template.cpp b/src/backend/cpu/match_template.cpp
index b026529dba..6b4d0f1b91 100644
--- a/src/backend/cpu/match_template.cpp
+++ b/src/backend/cpu/match_template.cpp
@@ -7,157 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <match_template.hpp>
-#include <err_cpu.hpp>
-
-using af::dim4;
-
-namespace cpu
-{
-
-template<typename inType, typename outType, af_match_type mType>
-Array<outType> match_template(const Array<inType> &sImg, const Array<inType> &tImg)
-{
-    const dim4 sDims = sImg.dims();
-    const dim4 tDims = tImg.dims();
-    const dim4 sStrides = sImg.strides();
-    const dim4 tStrides = tImg.strides();
-
-    const dim_t tDim0  = tDims[0];
-    const dim_t tDim1  = tDims[1];
-    const dim_t sDim0  = sDims[0];
-    const dim_t sDim1  = sDims[1];
-
-    Array<outType> out = createEmptyArray<outType>(sDims);
-    const dim4 oStrides = out.strides();
-
-    outType tImgMean = outType(0);
-    dim_t winNumElements = tImg.elements();
-    bool needMean = mType==AF_ZSAD || mType==AF_LSAD ||
-                    mType==AF_ZSSD || mType==AF_LSSD ||
-                    mType==AF_ZNCC;
-    const inType * tpl = tImg.get();
-
-    if (needMean) {
-        for(dim_t tj=0; tj<tDim1; tj++) {
-            dim_t tjStride = tj*tStrides[1];
-
-            for(dim_t ti=0; ti<tDim0; ti++) {
-                tImgMean += (outType)tpl[tjStride+ti*tStrides[0]];
-            }
-        }
-        tImgMean /= winNumElements;
-    }
 
-    outType * dst      = out.get();
-    const inType * src = sImg.get();
-
-    for(dim_t b3=0; b3<sDims[3]; ++b3) {
-    for(dim_t b2=0; b2<sDims[2]; ++b2) {
-
-        // slide through image window after window
-        for(dim_t sj=0; sj<sDim1; sj++) {
-
-            dim_t ojStride = sj*oStrides[1];
-
-            for(dim_t si=0; si<sDim0; si++) {
-                outType disparity = outType(0);
-
-                // mean for window
-                // this variable will be used based on mType value
-                outType wImgMean = outType(0);
-                if (needMean) {
-                    for(dim_t tj=0,j=sj; tj<tDim1; tj++, j++) {
-                        dim_t jStride = j*sStrides[1];
-
-                        for(dim_t ti=0, i=si; ti<tDim0; ti++, i++) {
-                            inType sVal = ((j<sDim1 && i<sDim0) ?
-                                    src[jStride + i*sStrides[0]] : inType(0));
-                            wImgMean += (outType)sVal;
-                        }
-                    }
-                    wImgMean /= winNumElements;
-                }
+#include <kernel/match_template.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 
-                // run the window match metric
-                for(dim_t tj=0,j=sj; tj<tDim1; tj++, j++) {
-                    dim_t jStride = j*sStrides[1];
-                    dim_t tjStride = tj*tStrides[1];
+#include <functional>
 
-                    for(dim_t ti=0, i=si; ti<tDim0; ti++, i++) {
-                        inType sVal = ((j<sDim1 && i<sDim0) ?
-                                            src[jStride + i*sStrides[0]] : inType(0));
-                        inType tVal = tpl[tjStride+ti*tStrides[0]];
-                        outType temp;
-                        switch(mType) {
-                            case AF_SAD:
-                                disparity += fabs((outType)sVal-(outType)tVal);
-                                break;
-                            case AF_ZSAD:
-                                disparity += fabs((outType)sVal - wImgMean -
-                                                  (outType)tVal + tImgMean);
-                                break;
-                            case AF_LSAD:
-                                disparity += fabs((outType)sVal-(wImgMean/tImgMean)*tVal);
-                                break;
-                            case AF_SSD:
-                                disparity += ((outType)sVal-(outType)tVal)*((outType)sVal-(outType)tVal);
-                                break;
-                            case AF_ZSSD:
-                                temp = ((outType)sVal - wImgMean - (outType)tVal + tImgMean);
-                                disparity += temp*temp;
-                                break;
-                            case AF_LSSD:
-                                temp = ((outType)sVal-(wImgMean/tImgMean)*tVal);
-                                disparity += temp*temp;
-                                break;
-                            case AF_NCC:
-                                //TODO: furture implementation
-                                break;
-                            case AF_ZNCC:
-                                //TODO: furture implementation
-                                break;
-                            case AF_SHD:
-                                //TODO: furture implementation
-                                break;
-                        }
-                    }
-                }
-                // output is just created, hence not doing the
-                // extra multiplication for 0th dim stride
-                dst[ojStride + si] = disparity;
-            }
-        }
-        src += sStrides[2];
-        dst += oStrides[2];
-    }
-        src += sStrides[3];
-        dst += oStrides[3];
-    }
+using af::dim4;
 
+namespace arrayfire {
+namespace cpu {
+
+template<typename To, typename Ti>
+using matchFunc = std::function<void(Param<To>, CParam<Ti>, CParam<Ti>)>;
+
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType) {
+    static const matchFunc<outType, inType> funcs[6] = {
+        kernel::matchTemplate<outType, inType, AF_SAD>,
+        kernel::matchTemplate<outType, inType, AF_ZSAD>,
+        kernel::matchTemplate<outType, inType, AF_LSAD>,
+        kernel::matchTemplate<outType, inType, AF_SSD>,
+        kernel::matchTemplate<outType, inType, AF_ZSSD>,
+        kernel::matchTemplate<outType, inType, AF_LSSD>,
+    };
+
+    Array<outType> out = createEmptyArray<outType>(sImg.dims());
+    getQueue().enqueue(funcs[static_cast<int>(mType)], out, sImg, tImg);
     return out;
 }
 
-#define INSTANTIATE(in_t, out_t)\
-    template Array<out_t> match_template<in_t, out_t, AF_SAD >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_LSAD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZSAD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_SSD >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_LSSD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZSSD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_NCC >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZNCC>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_SHD >(const Array<in_t> &sImg, const Array<in_t> &tImg);
+#define INSTANTIATE(in_t, out_t)                       \
+    template Array<out_t> match_template<in_t, out_t>( \
+        const Array<in_t> &, const Array<in_t> &, const af::matchType);
 
 INSTANTIATE(double, double)
-INSTANTIATE(float ,  float)
-INSTANTIATE(char  ,  float)
-INSTANTIATE(int   ,  float)
-INSTANTIATE(uint  ,  float)
-INSTANTIATE(uchar ,  float)
-
-}
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/match_template.hpp b/src/backend/cpu/match_template.hpp
index 777f645ff9..6fbbec0a9e 100644
--- a/src/backend/cpu/match_template.hpp
+++ b/src/backend/cpu/match_template.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
-
-template<typename inType, typename outType, af_match_type mType>
-Array<outType> match_template(const Array<inType> &sImg, const Array<inType> &tImg);
-
-}
+namespace arrayfire {
+namespace cpu {
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/math.cpp b/src/backend/cpu/math.cpp
index 1bff41774c..07b037a30a 100644
--- a/src/backend/cpu/math.cpp
+++ b/src/backend/cpu/math.cpp
@@ -6,46 +6,38 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <common/defines.hpp>
 #include <math.hpp>
+#include <complex>
 
-namespace cpu
-{
-    uint abs(uint val) { return val; }
-    uchar abs(uchar val) { return val; }
-    uintl abs(uintl val) { return val; }
-#if !(defined(OS_WIN) || (defined(ARCH_32) && defined(OS_LNX)))  // Not(Windows or Tegra)
-    size_t abs(size_t val) { return val; }
-#endif
-
-    cfloat  scalar(float val)
-    {
-        cfloat  cval = {(float)val, 0};
-        return cval;
-    }
-
-    cdouble scalar(double val)
-    {
-        cdouble  cval = {val, 0};
-        return cval;
-    }
-
-    cfloat min(cfloat lhs, cfloat rhs)
-    {
-        return abs(lhs) < abs(rhs) ? lhs : rhs;
-    }
-
-    cdouble min(cdouble lhs, cdouble rhs)
-    {
-        return abs(lhs) < abs(rhs) ? lhs : rhs;
-    }
-
-    cfloat max(cfloat lhs, cfloat rhs)
-    {
-        return abs(lhs) > abs(rhs) ? lhs : rhs;
-    }
-
-    cdouble max(cdouble lhs, cdouble rhs)
-    {
-        return abs(lhs) > abs(rhs) ? lhs : rhs;
-    }
+namespace arrayfire {
+namespace cpu {
+
+uint abs(uint val) { return val; }
+uchar abs(uchar val) { return val; }
+uintl abs(uintl val) { return val; }
+
+cfloat scalar(float val) {
+    cfloat cval = {val, 0};
+    return cval;
+}
+
+cdouble scalar(double val) {
+    cdouble cval = {val, 0};
+    return cval;
 }
+
+cfloat min(cfloat lhs, cfloat rhs) { return abs(lhs) < abs(rhs) ? lhs : rhs; }
+
+cdouble min(cdouble lhs, cdouble rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
+
+cfloat max(cfloat lhs, cfloat rhs) { return abs(lhs) > abs(rhs) ? lhs : rhs; }
+
+cdouble max(cdouble lhs, cdouble rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/math.hpp b/src/backend/cpu/math.hpp
index de4da0a8e5..06c1027edf 100644
--- a/src/backend/cpu/math.hpp
+++ b/src/backend/cpu/math.hpp
@@ -8,64 +8,141 @@
  ********************************************************/
 
 #pragma once
-#include <limits>
+
+#include <common/defines.hpp>
+#include <common/half.hpp>
+#include <types.hpp>
+#include <af/defines.h>
+
 #include <algorithm>
+#include <climits>
+#include <limits>
 #include <numeric>
-#include "types.hpp"
-#include <af/defines.h>
 
-namespace cpu
-{
-    template<typename T> static inline T abs(T val) { return std::abs(val); }
-    uint abs(uint val);
-    uchar abs(uchar val);
-    uintl abs(uintl val);
-#if !(defined(OS_WIN) || (defined(ARCH_32) && defined(OS_LNX)))  // Not(Windows or Tegra)
-    size_t abs(size_t val);
-#endif
-
-    template<typename T> static inline T min(T lhs, T rhs) { return std::min(lhs, rhs); }
-    cfloat min(cfloat lhs, cfloat rhs);
-    cdouble min(cdouble lhs, cdouble rhs);
-
-    template<typename T> static inline T max(T lhs, T rhs) { return std::max(lhs, rhs); }
-    cfloat max(cfloat lhs, cfloat rhs);
-    cdouble max(cdouble lhs, cdouble rhs);
-
-    template<typename T> static inline T division(T lhs, double rhs) { return lhs / rhs; }
-
-    template<> STATIC_ cfloat division<cfloat>(cfloat lhs, double rhs)
-    {
-        cfloat retVal(real(lhs) / rhs, imag(lhs) / rhs);
-        return retVal;
-    }
-
-    template<> STATIC_ cdouble division<cdouble>(cdouble lhs, double rhs)
-    {
-        cdouble retVal(real(lhs) / rhs, imag(lhs) / rhs);
-        return retVal;
-    }
-
-    template <typename T> static inline T limit_max()
-    { return std::numeric_limits<T>::max(); }
-
-    template <typename T> static inline T limit_min()
-    { return std::numeric_limits<T>::min(); }
-
-    template<typename T>
-    static T scalar(double val)
-    {
-        return (T)(val);
-    }
-
-    template<typename To, typename Ti>
-    static To scalar(Ti real, Ti imag)
-    {
-        To  cval = {real, imag};
-        return cval;
-    }
-
-    cfloat  scalar(float val);
-
-    cdouble scalar(double val);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+static inline T abs(T val) {
+    return std::abs(val);
+}
+uint abs(uint val);
+uchar abs(uchar val);
+uintl abs(uintl val);
+
+template<typename T>
+static inline T min(T lhs, T rhs) {
+    return std::min(lhs, rhs);
+}
+cfloat min(cfloat lhs, cfloat rhs);
+cdouble min(cdouble lhs, cdouble rhs);
+
+template<typename T>
+static inline T max(T lhs, T rhs) {
+    return std::max(lhs, rhs);
+}
+cfloat max(cfloat lhs, cfloat rhs);
+cdouble max(cdouble lhs, cdouble rhs);
+
+template<typename T>
+static inline auto is_nan(const T &val) -> bool {
+    return false;
 }
+
+template<>
+inline auto is_nan<float>(const float &val) -> bool {
+    return std::isnan(val);
+}
+
+template<>
+inline auto is_nan<double>(const double &val) -> bool {
+    return std::isnan(val);
+}
+
+template<>
+inline auto is_nan<common::half>(const common::half &val) -> bool {
+    return isnan(val);
+}
+
+template<>
+inline auto is_nan<cfloat>(const cfloat &in) -> bool {
+    return std::isnan(real(in)) || std::isnan(imag(in));
+}
+
+template<>
+inline auto is_nan<cdouble>(const cdouble &in) -> bool {
+    return std::isnan(real(in)) || std::isnan(imag(in));
+}
+
+template<typename T>
+static inline T division(T lhs, double rhs) {
+    return lhs / rhs;
+}
+
+template<>
+inline cfloat division<cfloat>(cfloat lhs, double rhs) {
+    cfloat retVal(real(lhs) / static_cast<float>(rhs),
+                  imag(lhs) / static_cast<float>(rhs));
+    return retVal;
+}
+
+template<>
+inline cdouble division<cdouble>(cdouble lhs, double rhs) {
+    cdouble retVal(real(lhs) / rhs, imag(lhs) / rhs);
+    return retVal;
+}
+
+template<typename T>
+inline T maxval() {
+    return std::numeric_limits<T>::max();
+}
+template<typename T>
+inline T minval() {
+    return std::numeric_limits<T>::lowest();
+}
+template<>
+inline float maxval() {
+    return std::numeric_limits<float>::infinity();
+}
+template<>
+inline double maxval() {
+    return std::numeric_limits<double>::infinity();
+}
+template<>
+inline arrayfire::common::half maxval() {
+    return std::numeric_limits<arrayfire::common::half>::infinity();
+}
+template<>
+inline float minval() {
+    return -std::numeric_limits<float>::infinity();
+}
+template<>
+inline double minval() {
+    return -std::numeric_limits<double>::infinity();
+}
+template<>
+inline arrayfire::common::half minval() {
+    return -std::numeric_limits<arrayfire::common::half>::infinity();
+}
+
+template<typename T>
+static T scalar(double val) {
+    return T(val);
+}
+
+template<typename To, typename Ti>
+static To scalar(Ti real, Ti imag) {
+    To cval = {real, imag};
+    return cval;
+}
+
+cfloat scalar(float val);
+
+cdouble scalar(double val);
+
+inline double real(cdouble in) noexcept { return std::real(in); }
+inline float real(cfloat in) noexcept { return std::real(in); }
+inline double imag(cdouble in) noexcept { return std::imag(in); }
+inline float imag(cfloat in) noexcept { return std::imag(in); }
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/mean.cpp b/src/backend/cpu/mean.cpp
new file mode 100644
index 0000000000..2323442110
--- /dev/null
+++ b/src/backend/cpu/mean.cpp
@@ -0,0 +1,164 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/mean.hpp>
+#include <mean.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+
+#include <complex>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename Ti, typename Tw, typename To>
+using mean_dim_func = std::function<void(
+    Param<To>, const dim_t, const CParam<Ti>, const dim_t, const int)>;
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti> &in, const int dim) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    static const mean_dim_func<Ti, Tw, To> mean_funcs[] = {
+        kernel::mean_dim<Ti, Tw, To, 1>(), kernel::mean_dim<Ti, Tw, To, 2>(),
+        kernel::mean_dim<Ti, Tw, To, 3>(), kernel::mean_dim<Ti, Tw, To, 4>()};
+
+    getQueue().enqueue(mean_funcs[in.ndims() - 1], out, 0, in, 0, dim);
+    return out;
+}
+
+template<typename T, typename Tw>
+using mean_weighted_dim_func =
+    std::function<void(Param<T>, const dim_t, const CParam<T>, const dim_t,
+                       const CParam<Tw>, const dim_t, const int)>;
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T> &in, const Array<Tw> &wt, const int dim) {
+    dim4 odims   = in.dims();
+    odims[dim]   = 1;
+    Array<T> out = createEmptyArray<T>(odims);
+    static const mean_weighted_dim_func<T, Tw> mean_funcs[] = {
+        kernel::mean_weighted_dim<T, Tw, 1>(),
+        kernel::mean_weighted_dim<T, Tw, 2>(),
+        kernel::mean_weighted_dim<T, Tw, 3>(),
+        kernel::mean_weighted_dim<T, Tw, 4>()};
+
+    getQueue().enqueue(mean_funcs[in.ndims() - 1], out, 0, in, 0, wt, 0, dim);
+    return out;
+}
+
+template<typename T, typename Tw>
+T mean(const Array<T> &in, const Array<Tw> &wt) {
+    using MeanOpT = kernel::MeanOp<compute_t<T>, compute_t<T>, compute_t<Tw>>;
+    in.eval();
+    wt.eval();
+    getQueue().sync();
+
+    af::dim4 dims    = in.dims();
+    af::dim4 strides = in.strides();
+    const T *inPtr   = in.get();
+    const Tw *wtPtr  = wt.get();
+
+    auto input  = compute_t<T>(inPtr[0]);
+    auto weight = compute_t<Tw>(wtPtr[0]);
+    MeanOpT Op(input, weight);
+
+    for (dim_t l = 0; l < dims[3]; l++) {
+        dim_t off3 = l * strides[3];
+
+        for (dim_t k = 0; k < dims[2]; k++) {
+            dim_t off2 = k * strides[2];
+
+            for (dim_t j = 0; j < dims[1]; j++) {
+                dim_t off1 = j * strides[1];
+
+                for (dim_t i = 0; i < dims[0]; i++) {
+                    dim_t idx = i + off1 + off2 + off3;
+                    Op(compute_t<T>(inPtr[idx]), compute_t<Tw>(wtPtr[idx]));
+                }
+            }
+        }
+    }
+
+    return T(Op.runningMean);
+}
+
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti> &in) {
+    using MeanOpT = kernel::MeanOp<compute_t<Ti>, compute_t<To>, compute_t<Tw>>;
+    in.eval();
+    getQueue().sync();
+
+    af::dim4 dims    = in.dims();
+    af::dim4 strides = in.strides();
+    const Ti *inPtr  = in.get();
+
+    MeanOpT Op(0, 0);
+
+    for (dim_t l = 0; l < dims[3]; l++) {
+        dim_t off3 = l * strides[3];
+
+        for (dim_t k = 0; k < dims[2]; k++) {
+            dim_t off2 = k * strides[2];
+
+            for (dim_t j = 0; j < dims[1]; j++) {
+                dim_t off1 = j * strides[1];
+
+                for (dim_t i = 0; i < dims[0]; i++) {
+                    dim_t idx = i + off1 + off2 + off3;
+                    Op(compute_t<Ti>(inPtr[idx]), 1);
+                }
+            }
+        }
+    }
+
+    return To(Op.runningMean);
+}
+
+#define INSTANTIATE(Ti, Tw, To)                        \
+    template To mean<Ti, Tw, To>(const Array<Ti> &in); \
+    template Array<To> mean<Ti, Tw, To>(const Array<Ti> &in, const int dim);
+
+INSTANTIATE(double, double, double);
+INSTANTIATE(float, float, float);
+INSTANTIATE(int, float, float);
+INSTANTIATE(unsigned, float, float);
+INSTANTIATE(intl, double, double);
+INSTANTIATE(uintl, double, double);
+INSTANTIATE(short, float, float);
+INSTANTIATE(ushort, float, float);
+INSTANTIATE(schar, float, float);
+INSTANTIATE(uchar, float, float);
+INSTANTIATE(char, float, float);
+INSTANTIATE(cfloat, float, cfloat);
+INSTANTIATE(cdouble, double, cdouble);
+INSTANTIATE(half, float, half);
+INSTANTIATE(half, float, float);
+
+#define INSTANTIATE_WGT(T, Tw)                                              \
+    template T mean<T, Tw>(const Array<T> &in, const Array<Tw> &wts);       \
+    template Array<T> mean<T, Tw>(const Array<T> &in, const Array<Tw> &wts, \
+                                  const int dim);
+
+INSTANTIATE_WGT(double, double);
+INSTANTIATE_WGT(float, float);
+INSTANTIATE_WGT(cfloat, float);
+INSTANTIATE_WGT(cdouble, double);
+INSTANTIATE_WGT(half, float);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/mean.hpp b/src/backend/cpu/mean.hpp
new file mode 100644
index 0000000000..7079a91528
--- /dev/null
+++ b/src/backend/cpu/mean.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim);
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wt, const int dim);
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts);
+
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/meanshift.cpp b/src/backend/cpu/meanshift.cpp
index 86e1d6eea2..878aa4cacb 100644
--- a/src/backend/cpu/meanshift.cpp
+++ b/src/backend/cpu/meanshift.cpp
@@ -7,153 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <meanshift.hpp>
-#include <cmath>
-#include <algorithm>
 #include <err_cpu.hpp>
+#include <kernel/meanshift.hpp>
 #include <math.hpp>
+#include <meanshift.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
+#include <algorithm>
+#include <cmath>
 
 using af::dim4;
 using std::vector;
 
-namespace cpu
-{
-
-inline dim_t clamp(dim_t a, dim_t mn, dim_t mx)
-{
-    return (a<mn ? mn : (a>mx ? mx : a));
-}
-
-template<typename T, bool is_color>
-Array<T>  meanshift(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter)
-{
-    const dim4 dims     = in.dims();
-    const dim4 istrides = in.strides();
-    Array<T> out        = createEmptyArray<T>(dims);
-    const dim4 ostrides = out.strides();
-
-    const dim_t bCount   = (is_color ? 1 : dims[2]);
-    const dim_t channels = (is_color ? dims[2] : 1);
-
-    // clamp spatical and chromatic sigma's
-    float space_          = std::min(11.5f, s_sigma);
-    const dim_t radius = std::max((int)(space_ * 1.5f), 1);
-    const float cvar      = c_sigma*c_sigma;
-
-    vector<float> means;
-    vector<float> centers;
-    vector<float> tmpclrs;
-    means.reserve(channels);
-    centers.reserve(channels);
-    tmpclrs.reserve(channels);
-
-    T *outData       = out.get();
-    const T * inData = in.get();
-
-    for(dim_t b3=0; b3<dims[3]; ++b3) {
-        for(dim_t b2=0; b2<bCount; ++b2) {
-
-            for(dim_t j=0; j<dims[1]; ++j) {
-
-                dim_t j_in_off  = j*istrides[1];
-                dim_t j_out_off = j*ostrides[1];
-
-                for(dim_t i=0; i<dims[0]; ++i) {
-
-                    dim_t i_in_off  = i*istrides[0];
-                    dim_t i_out_off = i*ostrides[0];
-
-                    // clear means and centers for this pixel
-                    for(dim_t ch=0; ch<channels; ++ch) {
-                        means[ch] = 0.0f;
-                        // the expression ch*istrides[2] will only effect when ch>1
-                        // i.e for color images where batch is along fourth dimension
-                        centers[ch] = inData[j_in_off + i_in_off + ch*istrides[2]];
-                    }
-
-                    // scope of meanshift iterationd begin
-                    for(unsigned it=0; it<iter; ++it) {
-
-                        int count   = 0;
-                        int shift_x = 0;
-                        int shift_y = 0;
-
-                        for(dim_t wj=-radius; wj<=radius; ++wj) {
-
-                            int hit_count = 0;
-
-                            for(dim_t wi=-radius; wi<=radius; ++wi) {
-
-                                dim_t tj = j + wj;
-                                dim_t ti = i + wi;
-
-                                // clamps offsets
-                                tj = clamp(tj, 0ll, dims[1]-1);
-                                ti = clamp(ti, 0ll, dims[0]-1);
-
-                                // proceed
-                                float norm = 0.0f;
-                                for(dim_t ch=0; ch<channels; ++ch) {
-                                    tmpclrs[ch] = inData[ tj*istrides[1] + ti*istrides[0] + ch*istrides[2]];
-                                    norm += (centers[ch]-tmpclrs[ch]) * (centers[ch]-tmpclrs[ch]);
-                                }
-
-                                if (norm<= cvar) {
-                                    for(dim_t ch=0; ch<channels; ++ch)
-                                        means[ch] += tmpclrs[ch];
-                                    shift_x += wi;
-                                    ++hit_count;
-                                }
-
-                            }
-                            count+= hit_count;
-                            shift_y += wj*hit_count;
-                        }
-
-                        if (count==0) { break; }
-
-                        const float fcount = 1.f/count;
-                        const int mean_x = (int)(shift_x*fcount+0.5f);
-                        const int mean_y = (int)(shift_y*fcount+0.5f);
-                        for(dim_t ch=0; ch<channels; ++ch)
-                            means[ch] *= fcount;
-
-                        float norm = 0.f;
-                        for(dim_t ch=0; ch<channels; ++ch)
-                            norm += ((means[ch]-centers[ch])*(means[ch]-centers[ch]));
-                        bool stop = ((abs(shift_y-mean_y)+abs(shift_x-mean_x)) + norm) <= 1;
-                        shift_x = mean_x;
-                        shift_y = mean_y;
-                        for(dim_t ch=0; ch<channels; ++ch)
-                            centers[ch] = means[ch];
-                        if (stop) { break; }
-                    } // scope of meanshift iterations end
-
-                    for(dim_t ch=0; ch<channels; ++ch)
-                        outData[j_out_off + i_out_off + ch*ostrides[2]] = centers[ch];
-
-                }
-            }
-            outData += ostrides[2];
-            inData  += istrides[2];
-        }
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+
+    if (isColor) {
+        getQueue().enqueue(kernel::meanShift<T, true>, out, in, spatialSigma,
+                           chromaticSigma, numIterations);
+    } else {
+        getQueue().enqueue(kernel::meanShift<T, false>, out, in, spatialSigma,
+                           chromaticSigma, numIterations);
     }
+
     return out;
 }
 
-#define INSTANTIATE(T) \
-    template Array<T>  meanshift<T, true >(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter); \
-    template Array<T>  meanshift<T, false>(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter);
+#define INSTANTIATE(T)                                              \
+    template Array<T> meanshift<T>(const Array<T> &, const float &, \
+                                   const float &, const unsigned &, \
+                                   const bool &);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/meanshift.hpp b/src/backend/cpu/meanshift.hpp
index 1a57807cce..c17d922414 100644
--- a/src/backend/cpu/meanshift.hpp
+++ b/src/backend/cpu/meanshift.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
-
-template<typename T, bool is_color>
-Array<T> meanshift(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter);
-
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/medfilt.cpp b/src/backend/cpu/medfilt.cpp
index 1047a52723..4c952fc762 100644
--- a/src/backend/cpu/medfilt.cpp
+++ b/src/backend/cpu/medfilt.cpp
@@ -7,143 +7,66 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <medfilt.hpp>
-#include <err_cpu.hpp>
-#include <algorithm>
-
-using af::dim4;
-
-namespace cpu
-{
-
-template<typename T, af_border_type pad>
-Array<T> medfilt(const Array<T> &in, dim_t w_len, dim_t w_wid)
-{
-    const dim4 dims     = in.dims();
-    const dim4 istrides = in.strides();
-    Array<T> out        = createEmptyArray<T>(dims);
-    const dim4 ostrides = out.strides();
-
-    std::vector<T> wind_vals;
-    wind_vals.reserve(w_len*w_wid);
-
-    T const * in_ptr = in.get();
-    T * out_ptr = out.get();
-
-    for(int b3=0; b3<(int)dims[3]; b3++) {
-
-        for(int b2=0; b2<(int)dims[2]; b2++) {
-
-            for(int col=0; col<(int)dims[1]; col++) {
-
-                int ocol_off = col*ostrides[1];
-
-                for(int row=0; row<(int)dims[0]; row++) {
 
-                    wind_vals.clear();
-
-                    for(int wj=0; wj<(int)w_wid; ++wj) {
-
-                        bool isColOff = false;
-
-                        int im_col = col + wj-w_wid/2;
-                        int im_coff;
-                        switch(pad) {
-                            case AF_PAD_ZERO:
-                                im_coff = im_col * istrides[1];
-                                if (im_col < 0 || im_col>=(int)dims[1])
-                                    isColOff = true;
-                                break;
-                            case AF_PAD_SYM:
-                                {
-                                    if (im_col < 0) {
-                                        im_col *= -1;
-                                        isColOff = true;
-                                    }
-
-                                    if (im_col>=(int)dims[1]) {
-                                        im_col = 2*((int)dims[1]-1) - im_col;
-                                        isColOff = true;
-                                    }
-
-                                    im_coff = im_col * istrides[1];
-                                }
-                                break;
-                        }
-
-                        for(int wi=0; wi<(int)w_len; ++wi) {
+#include <Array.hpp>
+#include <kernel/medfilt.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 
-                            bool isRowOff = false;
+#include <functional>
 
-                            int im_row = row + wi-w_len/2;
-                            int im_roff;
-                            switch(pad) {
-                                case AF_PAD_ZERO:
-                                    im_roff = im_row * istrides[0];
-                                    if (im_row < 0 || im_row>=(int)dims[0])
-                                        isRowOff = true;
-                                    break;
-                                case AF_PAD_SYM:
-                                    {
-                                        if (im_row < 0) {
-                                            im_row *= -1;
-                                            isRowOff = true;
-                                        }
+using af::dim4;
 
-                                        if (im_row>=(int)dims[0]) {
-                                            im_row = 2*((int)dims[0]-1) - im_row;
-                                            isRowOff = true;
-                                        }
+namespace arrayfire {
+namespace cpu {
 
-                                        im_roff = im_row * istrides[0];
-                                    }
-                                    break;
-                            }
+template<typename T>
+using medianFilter1 = std::function<void(Param<T>, CParam<T>, dim_t)>;
 
-                            if(isRowOff || isColOff) {
-                                switch(pad) {
-                                    case AF_PAD_ZERO:
-                                        wind_vals.push_back(0);
-                                        break;
-                                    case AF_PAD_SYM:
-                                        wind_vals.push_back(in_ptr[im_coff+im_roff]);
-                                        break;
-                                }
-                            } else
-                                wind_vals.push_back(in_ptr[im_coff+im_roff]);
-                        }
-                    }
+template<typename T>
+using medianFilter2 = std::function<void(Param<T>, CParam<T>, dim_t, dim_t)>;
 
-                    std::stable_sort(wind_vals.begin(),wind_vals.end());
-                    int off = wind_vals.size()/2;
-                    if (wind_vals.size()%2==0)
-                        out_ptr[ocol_off+row*ostrides[0]] = (wind_vals[off]+wind_vals[off-1])/2;
-                    else {
-                        out_ptr[ocol_off+row*ostrides[0]] = wind_vals[off];
-                    }
-                }
-            }
-            in_ptr  += istrides[2];
-            out_ptr += ostrides[2];
-        }
-    }
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType pad) {
+    static const medianFilter1<T> funcs[2] = {
+        kernel::medfilt1<T, AF_PAD_ZERO>,
+        kernel::medfilt1<T, AF_PAD_SYM>,
+    };
+    Array<T> out = createEmptyArray<T>(in.dims());
+    getQueue().enqueue(funcs[static_cast<int>(pad)], out, in, w_wid);
+    return out;
+}
 
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType pad) {
+    static const medianFilter2<T> funcs[2] = {
+        kernel::medfilt2<T, AF_PAD_ZERO>,
+        kernel::medfilt2<T, AF_PAD_SYM>,
+    };
+    Array<T> out = createEmptyArray<T>(in.dims());
+    getQueue().enqueue(funcs[static_cast<int>(pad)], out, in, w_len, w_wid);
     return out;
 }
 
-#define INSTANTIATE(T)\
-    template Array<T> medfilt<T, AF_PAD_ZERO     >(const Array<T> &in, dim_t w_len, dim_t w_wid); \
-    template Array<T> medfilt<T, AF_PAD_SYM>(const Array<T> &in, dim_t w_len, dim_t w_wid);
+#define INSTANTIATE(T)                                                 \
+    template Array<T> medfilt1<T>(const Array<T> &in, const int w_wid, \
+                                  const af::borderType);               \
+    template Array<T> medfilt2<T>(const Array<T> &in, const int w_len, \
+                                  const int w_wid, const af::borderType);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/medfilt.hpp b/src/backend/cpu/medfilt.hpp
index 81234ee997..5d9f8e688c 100644
--- a/src/backend/cpu/medfilt.hpp
+++ b/src/backend/cpu/medfilt.hpp
@@ -9,10 +9,16 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
-template<typename T, af_border_type edge_pad>
-Array<T> medfilt(const Array<T> &in, dim_t w_len, dim_t w_wid);
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType edge_pad);
 
-}
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType edge_pad);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/memory.cpp b/src/backend/cpu/memory.cpp
index 12e6f6a5cf..0a32186f2e 100644
--- a/src/backend/cpu/memory.cpp
+++ b/src/backend/cpu/memory.cpp
@@ -8,225 +8,154 @@
  ********************************************************/
 
 #include <memory.hpp>
+
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/half.hpp>
 #include <err_cpu.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <spdlog/spdlog.h>
 #include <types.hpp>
-#include <map>
-#include <dispatch.hpp>
-#include <cstdlib>
-#include <mutex>
-
-namespace cpu
-{
-
-    static size_t memory_resolution = 1024; //1KB
-
-    void setMemStepSize(size_t step_bytes)
-    {
-        memory_resolution = step_bytes;
-    }
-
-    size_t getMemStepSize(void)
-    {
-        return memory_resolution;
-    }
-
-    class Manager
-    {
-        public:
-        static bool initialized;
-        Manager()
-        {
-            initialized = true;
-        }
-
-        ~Manager()
-        {
-            garbageCollect();
-        }
-    };
-
-    bool Manager::initialized = false;
-
-    static void managerInit()
-    {
-        if(Manager::initialized == false)
-            static Manager pm = Manager();
-    }
-
-    typedef struct
-    {
-        bool is_free;
-        bool is_unlinked;
-        size_t bytes;
-    } mem_info;
-
-    static size_t used_bytes = 0;
-    static size_t used_buffers = 0;
-    static size_t total_bytes = 0;
-    typedef std::map<void *, mem_info> mem_t;
-    typedef mem_t::iterator mem_iter;
-
-    mem_t memory_map;
-    std::mutex memory_map_mutex;
-
-    template<typename T>
-    void freeWrapper(T *ptr)
-    {
-        free((void *)ptr);
-    }
-
-    void garbageCollect()
-    {
-        for(mem_iter iter = memory_map.begin();
-            iter != memory_map.end(); ++iter) {
-
-            if ((iter->second).is_free) {
-
-                if (!(iter->second).is_unlinked) {
-                    freeWrapper(iter->first);
-                }
-                total_bytes -= iter->second.bytes;
-            }
-        }
-
-        mem_iter memory_curr = memory_map.begin();
-        mem_iter memory_end  = memory_map.end();
-
-        while(memory_curr != memory_end) {
-            if (memory_curr->second.is_free) {
-                memory_map.erase(memory_curr++);
-            } else {
-                ++memory_curr;
-            }
-        }
-    }
-
-    template<typename T>
-    T* memAlloc(const size_t &elements)
-    {
-        managerInit();
-
-        T* ptr = NULL;
-        size_t alloc_bytes = divup(sizeof(T) * elements, memory_resolution) * memory_resolution;
-
-        if (elements > 0) {
-            std::lock_guard<std::mutex> lock(memory_map_mutex);
-
-            // FIXME: Add better checks for garbage collection
-            // Perhaps look at total memory available as a metric
-            if (memory_map.size() > MAX_BUFFERS ||
-                used_bytes >= MAX_BYTES) {
-
-                garbageCollect();
-            }
+#include <af/dim4.hpp>
+
+#include <utility>
+
+using af::dim4;
+using arrayfire::common::bytesToString;
+using arrayfire::common::half;
+using std::function;
+using std::move;
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace cpu {
+float getMemoryPressure() { return memoryManager().getMemoryPressure(); }
+float getMemoryPressureThreshold() {
+    return memoryManager().getMemoryPressureThreshold();
+}
 
-            for(mem_iter iter = memory_map.begin();
-                iter != memory_map.end(); ++iter) {
+bool jitTreeExceedsMemoryPressure(size_t bytes) {
+    return memoryManager().jitTreeExceedsMemoryPressure(bytes);
+}
 
-                mem_info info = iter->second;
+void setMemStepSize(size_t step_bytes) {
+    memoryManager().setMemStepSize(step_bytes);
+}
 
-                if ( info.is_free &&
-                    !info.is_unlinked &&
-                     info.bytes == alloc_bytes) {
+size_t getMemStepSize() { return memoryManager().getMemStepSize(); }
 
-                    iter->second.is_free = false;
-                    used_bytes += alloc_bytes;
-                    used_buffers++;
-                    return (T *)iter->first;
-                }
-            }
+void signalMemoryCleanup() { memoryManager().signalMemoryCleanup(); }
 
-            // Perform garbage collection if memory can not be allocated
-            ptr = (T *)malloc(alloc_bytes);
+void shutdownMemoryManager() { memoryManager().shutdown(); }
 
-            mem_info info = {false, false, alloc_bytes};
-            memory_map[ptr] = info;
+void printMemInfo(const char *msg, const int device) {
+    memoryManager().printInfo(msg, device);
+}
 
-            used_bytes += alloc_bytes;
-            used_buffers++;
-            total_bytes += alloc_bytes;
-        }
-        return ptr;
-    }
+template<typename T>
+unique_ptr<T[], function<void(void *)>> memAlloc(const size_t &elements) {
+    // TODO: make memAlloc aware of array shapes
+    dim4 dims(elements);
+    T *ptr = static_cast<T *>(
+        memoryManager().alloc(false, 1, dims.get(), sizeof(T)));
+    return unique_ptr<T[], function<void(void *)>>(ptr, memFree);
+}
 
-    template<typename T>
-    void memFree(T *ptr)
-    {
-        std::lock_guard<std::mutex> lock(memory_map_mutex);
+void *memAllocUser(const size_t &bytes) {
+    dim4 dims(bytes);
+    void *ptr = memoryManager().alloc(true, 1, dims.get(), 1);
+    return ptr;
+}
 
-        mem_iter iter = memory_map.find((void *)ptr);
+void memFree(void *ptr) { return memoryManager().unlock(ptr, false); }
 
-        if (iter != memory_map.end()) {
-            if ((iter->second).is_unlinked) return;
+void memFreeUser(void *ptr) { memoryManager().unlock(ptr, true); }
 
-            iter->second.is_free = true;
-            used_bytes -= iter->second.bytes;
-            used_buffers--;
+void memLock(const void *ptr) { memoryManager().userLock(ptr); }
 
-        } else {
-            freeWrapper(ptr); // Free it because we are not sure what the size is
-        }
-    }
+bool isLocked(const void *ptr) { return memoryManager().isUserLocked(ptr); }
 
-    template<typename T>
-    void memUnlink(T *ptr)
-    {
-        std::lock_guard<std::mutex> lock(memory_map_mutex);
+void memUnlock(const void *ptr) { memoryManager().userUnlock(ptr); }
 
-        mem_iter iter = memory_map.find((void *)ptr);
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers) {
+    memoryManager().usageInfo(alloc_bytes, alloc_buffers, lock_bytes,
+                              lock_buffers);
+}
 
-        if (iter != memory_map.end()) {
+template<typename T>
+T *pinnedAlloc(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = memoryManager().alloc(false, 1, dims.get(), sizeof(T));
+    return static_cast<T *>(ptr);
+}
 
-            iter->second.is_unlinked = true;
-            iter->second.is_free = true;
-            used_bytes -= iter->second.bytes;
-            used_buffers--;
+void pinnedFree(void *ptr) { memoryManager().unlock(ptr, false); }
+
+#define INSTANTIATE(T)                                                   \
+    template std::unique_ptr<T[], std::function<void(void *)>> memAlloc( \
+        const size_t &elements);                                         \
+    template T *pinnedAlloc(const size_t &elements);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(half)
+
+template<>
+void *pinnedAlloc<void>(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = memoryManager().alloc(false, 1, dims.get(), 1);
+    return ptr;
+}
 
-        } else {
-            mem_info info = { false,
-                              false,
-                              100 }; //This number is not relevant
+Allocator::Allocator() { logger = common::loggerFactory("mem"); }
 
-            memory_map[ptr] = info;
+void Allocator::shutdown() {
+    for (int n = 0; n < cpu::getDeviceCount(); n++) {
+        try {
+            cpu::setDevice(n);
+            shutdownMemoryManager();
+        } catch (const AfError &err) {
+            continue;  // Do not throw any errors while shutting down
         }
     }
+}
 
-    void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers)
-    {
-        if (alloc_bytes   ) *alloc_bytes   = total_bytes;
-        if (alloc_buffers ) *alloc_buffers = memory_map.size();
-        if (lock_bytes    ) *lock_bytes    = used_bytes;
-        if (lock_buffers  ) *lock_buffers  = used_buffers;
-    }
+int Allocator::getActiveDeviceId() {
+    return static_cast<int>(cpu::getActiveDeviceId());
+}
 
-    template<typename T>
-    T* pinnedAlloc(const size_t &elements)
-    {
-        return memAlloc<T>(elements);
-    }
+size_t Allocator::getMaxMemorySize(int id) {
+    return cpu::getDeviceMemorySize(id);
+}
 
-    template<typename T>
-    void pinnedFree(T* ptr)
-    {
-        memFree<T>(ptr);
-    }
+void *Allocator::nativeAlloc(const size_t bytes) {
+    void *ptr = malloc(bytes);  // NOLINT(hicpp-no-malloc)
+    AF_TRACE("nativeAlloc: {:>7} {}", bytesToString(bytes), ptr);
+    if (!ptr) { AF_ERROR("Unable to allocate memory", AF_ERR_NO_MEM); }
+    return ptr;
+}
 
-#define INSTANTIATE(T)                              \
-    template T* memAlloc(const size_t &elements);   \
-    template void memFree(T* ptr);                  \
-    template void memUnlink(T* ptr);                \
-    template T* pinnedAlloc(const size_t &elements);\
-    template void pinnedFree(T* ptr);               \
-
-    INSTANTIATE(float)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(double)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+void Allocator::nativeFree(void *ptr) {
+    AF_TRACE("nativeFree: {: >8} {}", " ", ptr);
+    // Make sure this pointer is not being used on the queue before freeing the
+    // memory.
+    getQueue().sync();
+    free(ptr);  // NOLINT(hicpp-no-malloc)
 }
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/memory.hpp b/src/backend/cpu/memory.hpp
index 242e48663f..908136d094 100644
--- a/src/backend/cpu/memory.hpp
+++ b/src/backend/cpu/memory.hpp
@@ -8,24 +8,60 @@
  ********************************************************/
 #pragma once
 
+#include <common/AllocatorInterface.hpp>
 #include <af/defines.h>
-namespace cpu
-{
-    template<typename T> T* memAlloc(const size_t &elements);
-    template<typename T> void memFree(T* ptr);
-    template<typename T> void memUnlink(T *ptr);
-
-    template<typename T> T* pinnedAlloc(const size_t &elements);
-    template<typename T> void pinnedFree(T* ptr);
-
-    static const unsigned MAX_BUFFERS = 100;
-    static const unsigned MAX_BYTES = 100 * (1 << 20);
-
-    void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers);
-    void garbageCollect();
-    void pinnedGarbageCollect();
-
-    void setMemStepSize(size_t step_bytes);
-    size_t getMemStepSize(void);
-}
+
+#include <functional>
+#include <memory>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+using uptr = std::unique_ptr<T[], std::function<void(T[])>>;
+
+template<typename T>
+std::unique_ptr<T[], std::function<void(void *)>> memAlloc(
+    const size_t &elements);
+void *memAllocUser(const size_t &bytes);
+
+// Need these as 2 separate function and not a default argument
+// This is because it is used as the deleter in shared pointer
+// which cannot support default arguments
+void memFree(void *ptr);
+void memFreeUser(void *ptr);
+
+void memLock(const void *ptr);
+void memUnlock(const void *ptr);
+bool isLocked(const void *ptr);
+
+template<typename T>
+T *pinnedAlloc(const size_t &elements);
+void pinnedFree(void *ptr);
+
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers);
+void signalMemoryCleanup();
+void shutdownMemoryManager();
+void pinnedGarbageCollect();
+
+void printMemInfo(const char *msg, const int device);
+
+float getMemoryPressure();
+float getMemoryPressureThreshold();
+bool jitTreeExceedsMemoryPressure(size_t bytes);
+void setMemStepSize(size_t step_bytes);
+size_t getMemStepSize(void);
+
+class Allocator final : public common::AllocatorInterface {
+   public:
+    Allocator();
+    ~Allocator() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+};
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/moments.cpp b/src/backend/cpu/moments.cpp
new file mode 100644
index 0000000000..09db606bd4
--- /dev/null
+++ b/src/backend/cpu/moments.cpp
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cpu.hpp>
+#include <kernel/moments.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cpu {
+
+static inline unsigned bitCount(unsigned v) {
+    v = v - ((v >> 1U) & 0x55555555U);
+    v = (v & 0x33333333U) + ((v >> 2U) & 0x33333333U);
+    return (((v + (v >> 4U)) & 0xF0F0F0FU) * 0x1010101U) >> 24U;
+}
+
+using af::dim4;
+
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment) {
+    dim4 odims, idims = in.dims();
+    dim_t moments_dim = bitCount(moment);
+
+    odims[0] = moments_dim;
+    odims[1] = 1;
+    odims[2] = idims[2];
+    odims[3] = idims[3];
+
+    Array<float> out = createValueArray<float>(odims, 0.f);
+    getQueue().enqueue(kernel::moments<T>, out, in, moment);
+    getQueue().sync();
+    return out;
+}
+
+#define INSTANTIATE(T)                                   \
+    template Array<float> moments<T>(const Array<T> &in, \
+                                     const af_moment_type moment);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/moments.hpp b/src/backend/cpu/moments.hpp
new file mode 100644
index 0000000000..43793307da
--- /dev/null
+++ b/src/backend/cpu/moments.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/morph.cpp b/src/backend/cpu/morph.cpp
index ff7b49d0de..e526e7c066 100644
--- a/src/backend/cpu/morph.cpp
+++ b/src/backend/cpu/morph.cpp
@@ -7,166 +7,69 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <copy.hpp>
+#include <kernel/morph.hpp>
 #include <morph.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/dim4.hpp>
 #include <algorithm>
 
 using af::dim4;
 
-namespace cpu
-{
-
-static inline unsigned getIdx(const dim4 &strides,
-        int i, int j = 0, int k = 0, int l = 0)
-{
-    return (l * strides[3] +
-            k * strides[2] +
-            j * strides[1] +
-            i * strides[0]);
-}
-
-template<typename T, bool isDilation>
-Array<T> morph(const Array<T> &in, const Array<T> &mask)
-{
-    const dim4 dims       = in.dims();
-    const dim4 window     = mask.dims();
-    const dim_t R0     = window[0]/2;
-    const dim_t R1     = window[1]/2;
-    const dim4 istrides   = in.strides();
-    const dim4 fstrides   = mask.strides();
-
-    Array<T> out         = createEmptyArray<T>(dims);
-    const dim4 ostrides   = out.strides();
-
-    T* outData            = out.get();
-    const T*   inData     = in.get();
-    const T*   filter     = mask.get();
-
-    for(dim_t b3=0; b3<dims[3]; ++b3) {
-        for(dim_t b2=0; b2<dims[2]; ++b2) {
-            // either channels or batch is handled by outer most loop
-            for(dim_t j=0; j<dims[1]; ++j) {
-                // j steps along 2nd dimension
-                for(dim_t i=0; i<dims[0]; ++i) {
-                    // i steps along 1st dimension
-                    T filterResult = inData[ getIdx(istrides, i, j) ];
-
-                    // wj,wi steps along 2nd & 1st dimensions of filter window respectively
-                    for(dim_t wj=0; wj<window[1]; wj++) {
-                        for(dim_t wi=0; wi<window[0]; wi++) {
-
-                            dim_t offj = j+wj-R1;
-                            dim_t offi = i+wi-R0;
-
-                            T maskValue = filter[ getIdx(fstrides, wi, wj) ];
-
-                            if ((maskValue > (T)0) && offi>=0 && offj>=0 && offi<dims[0] && offj<dims[1]) {
-
-                                T inValue   = inData[ getIdx(istrides, offi, offj) ];
-
-                                if (isDilation)
-                                    filterResult = std::max(filterResult, inValue);
-                                else
-                                    filterResult = std::min(filterResult, inValue);
-                            }
-
-                        } // window 1st dimension loop ends here
-                    } // filter window loop ends here
-
-                    outData[ getIdx(ostrides, i, j) ] = filterResult;
-                } //1st dimension loop ends here
-            } // 2nd dimension loop ends here
-
-            // next iteration will be next batch if any
-            outData += ostrides[2];
-            inData  += istrides[2];
-        }
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    af::borderType padType = isDilation ? AF_PAD_ZERO : AF_PAD_CLAMP_TO_EDGE;
+    const af::dim4 &idims  = in.dims();
+    const af::dim4 &mdims  = mask.dims();
+
+    const af::dim4 lpad(mdims[0] / 2, mdims[1] / 2, 0, 0);
+    const af::dim4 &upad(lpad);
+    const af::dim4 odims(lpad[0] + idims[0] + upad[0],
+                         lpad[1] + idims[1] + upad[1], idims[2], idims[3]);
+
+    auto out = createEmptyArray<T>(odims);
+    auto inp = padArrayBorders(in, lpad, upad, padType);
+
+    if (isDilation) {
+        getQueue().enqueue(kernel::morph<T, true>, out, inp, mask);
+    } else {
+        getQueue().enqueue(kernel::morph<T, false>, out, inp, mask);
     }
 
-    return out;
-}
-
-template<typename T, bool isDilation>
-Array<T> morph3d(const Array<T> &in, const Array<T> &mask)
-{
-    const dim4 dims       = in.dims();
-    const dim4 window     = mask.dims();
-    const dim_t R0     = window[0]/2;
-    const dim_t R1     = window[1]/2;
-    const dim_t R2     = window[2]/2;
-    const dim4 istrides   = in.strides();
-    const dim4 fstrides   = mask.strides();
-    const dim_t bCount = dims[3];
-
-    Array<T> out         = createEmptyArray<T>(dims);
-    const dim4 ostrides   = out.strides();
-
-    T* outData            = out.get();
-    const T*   inData     = in.get();
-    const T*   filter     = mask.get();
-
-    for(dim_t batchId=0; batchId<bCount; ++batchId) {
-        // either channels or batch is handled by outer most loop
-        for(dim_t k=0; k<dims[2]; ++k) {
-            // k steps along 3rd dimension
-            for(dim_t j=0; j<dims[1]; ++j) {
-                // j steps along 2nd dimension
-                for(dim_t i=0; i<dims[0]; ++i) {
-                    // i steps along 1st dimension
-                    T filterResult = inData[ getIdx(istrides, i, j, k) ];
-
-                    // wk, wj,wi steps along 2nd & 1st dimensions of filter window respectively
-                    for(dim_t wk=0; wk<window[2]; wk++) {
-                        for(dim_t wj=0; wj<window[1]; wj++) {
-                            for(dim_t wi=0; wi<window[0]; wi++) {
-
-                                dim_t offk = k+wk-R2;
-                                dim_t offj = j+wj-R1;
-                                dim_t offi = i+wi-R0;
+    std::vector<af_seq> idxs(4, af_span);
+    idxs[0] = af_seq{double(lpad[0]), double(lpad[0] + idims[0] - 1), 1.0};
+    idxs[1] = af_seq{double(lpad[1]), double(lpad[1] + idims[1] - 1), 1.0};
 
-                                T maskValue = filter[ getIdx(fstrides, wi, wj, wk) ];
-
-                                if ((maskValue > (T)0) && offi>=0 && offj>=0 && offk>=0 &&
-                                        offi<dims[0] && offj<dims[1] && offk<dims[2]) {
-
-                                    T inValue   = inData[ getIdx(istrides, offi, offj, offk) ];
-
-                                    if (isDilation)
-                                        filterResult = std::max(filterResult, inValue);
-                                    else
-                                        filterResult = std::min(filterResult, inValue);
-                                }
-
-                            } // window 1st dimension loop ends here
-                        }  // window 1st dimension loop ends here
-                    }// filter window loop ends here
+    return createSubArray(out, idxs);
+}
 
-                    outData[ getIdx(ostrides, i, j, k) ] = filterResult;
-                } //1st dimension loop ends here
-            } // 2nd dimension loop ends here
-        } // 3rd dimension loop ends here
-        // next iteration will be next batch if any
-        outData += ostrides[3];
-        inData  += istrides[3];
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    if (isDilation) {
+        getQueue().enqueue(kernel::morph3d<T, true>, out, in, mask);
+    } else {
+        getQueue().enqueue(kernel::morph3d<T, false>, out, in, mask);
     }
-
     return out;
 }
 
-#define INSTANTIATE(T)\
-    template Array<T> morph  <T, true >(const Array<T> &in, const Array<T> &mask);\
-    template Array<T> morph  <T, false>(const Array<T> &in, const Array<T> &mask);\
-    template Array<T> morph3d<T, true >(const Array<T> &in, const Array<T> &mask);\
-    template Array<T> morph3d<T, false>(const Array<T> &in, const Array<T> &mask);
+#define INSTANTIATE(T)                                                    \
+    template Array<T> morph<T>(const Array<T> &, const Array<T> &, bool); \
+    template Array<T> morph3d<T>(const Array<T> &, const Array<T> &, bool);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/morph.hpp b/src/backend/cpu/morph.hpp
index e537303135..d1fabb47f7 100644
--- a/src/backend/cpu/morph.hpp
+++ b/src/backend/cpu/morph.hpp
@@ -9,13 +9,12 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation);
 
-template<typename T, bool isDilation>
-Array<T> morph(const Array<T> &in, const Array<T> &mask);
-
-template<typename T, bool isDilation>
-Array<T> morph3d(const Array<T> &in, const Array<T> &mask);
-
-}
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/nearest_neighbour.cpp b/src/backend/cpu/nearest_neighbour.cpp
new file mode 100644
index 0000000000..0581e97ab6
--- /dev/null
+++ b/src/backend/cpu/nearest_neighbour.cpp
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cpu.hpp>
+#include <kernel/nearest_neighbour.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <topk.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist, const af_match_type dist_type) {
+    uint sample_dim   = (dist_dim == 0) ? 1 : 0;
+    const dim4& qDims = query.dims();
+    const dim4& tDims = train.dims();
+    const dim4 outDims(n_dist, qDims[sample_dim]);
+    const dim4 distDims(tDims[sample_dim], qDims[sample_dim]);
+
+    Array<To> tmp_dists = createEmptyArray<To>(distDims);
+
+    idx  = createEmptyArray<uint>(outDims);
+    dist = createEmptyArray<To>(outDims);
+
+    switch (dist_type) {
+        case AF_SAD:
+            getQueue().enqueue(kernel::nearest_neighbour<T, To, AF_SAD>,
+                               tmp_dists, query, train, dist_dim);
+            break;
+        case AF_SSD:
+            getQueue().enqueue(kernel::nearest_neighbour<T, To, AF_SSD>,
+                               tmp_dists, query, train, dist_dim);
+            break;
+        case AF_SHD:
+            getQueue().enqueue(kernel::nearest_neighbour<T, To, AF_SHD>,
+                               tmp_dists, query, train, dist_dim);
+            break;
+        default: AF_ERROR("Unsupported dist_type", AF_ERR_NOT_CONFIGURED);
+    }
+
+    cpu::topk(dist, idx, tmp_dists, n_dist, 0, AF_TOPK_MIN);
+}
+
+#define INSTANTIATE(T, To)                                             \
+    template void nearest_neighbour<T, To>(                            \
+        Array<uint> & idx, Array<To> & dist, const Array<T>& query,    \
+        const Array<T>& train, const uint dist_dim, const uint n_dist, \
+        const af_match_type dist_type);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(int, int)
+INSTANTIATE(uint, uint)
+INSTANTIATE(intl, intl)
+INSTANTIATE(uintl, uintl)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, uint)
+INSTANTIATE(ushort, uint)
+INSTANTIATE(short, int)
+
+INSTANTIATE(uintl, uint)  // For Hamming
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/nearest_neighbour.hpp b/src/backend/cpu/nearest_neighbour.hpp
new file mode 100644
index 0000000000..0c5bd401d9
--- /dev/null
+++ b/src/backend/cpu/nearest_neighbour.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist,
+                       const af_match_type dist_type = AF_SSD);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/orb.cpp b/src/backend/cpu/orb.cpp
index 342487f128..f03eb6427b 100644
--- a/src/backend/cpu/orb.cpp
+++ b/src/backend/cpu/orb.cpp
@@ -7,737 +7,239 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <err_cpu.hpp>
-#include <handle.hpp>
-#include <resize.hpp>
-#include <fast.hpp>
-#include <sort_index.hpp>
 #include <convolve.hpp>
+#include <fast.hpp>
+#include <kernel/orb.hpp>
 #include <memory.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <resize.hpp>
+#include <sort_index.hpp>
+#include <af/dim4.hpp>
+
+#include <cmath>
 #include <cstring>
+#include <functional>
+#include <memory>
+#include <utility>
+#include <vector>
 
 using af::dim4;
-
-namespace cpu
-{
-
-static const float PI_VAL = 3.14159265358979323846f;
-
-// Reference pattern, generated for a patch size of 31x31, as suggested by
-// original ORB paper
-#define REF_PAT_SIZE 31
-#define REF_PAT_SAMPLES 256
-#define REF_PAT_COORDS 4
-#define REF_PAT_LENGTH (REF_PAT_SAMPLES*REF_PAT_COORDS)
-
-// Current reference pattern was borrowed from OpenCV, to build a pattern with
-// similar quality, a training process must be applied, as described in
-// sections 4.2 and 4.3 of the original ORB paper.
-const int ref_pat[REF_PAT_LENGTH] = {
-    8,-3, 9,5,
-    4,2, 7,-12,
-    -11,9, -8,2,
-    7,-12, 12,-13,
-    2,-13, 2,12,
-    1,-7, 1,6,
-    -2,-10, -2,-4,
-    -13,-13, -11,-8,
-    -13,-3, -12,-9,
-    10,4, 11,9,
-    -13,-8, -8,-9,
-    -11,7, -9,12,
-    7,7, 12,6,
-    -4,-5, -3,0,
-    -13,2, -12,-3,
-    -9,0, -7,5,
-    12,-6, 12,-1,
-    -3,6, -2,12,
-    -6,-13, -4,-8,
-    11,-13, 12,-8,
-    4,7, 5,1,
-    5,-3, 10,-3,
-    3,-7, 6,12,
-    -8,-7, -6,-2,
-    -2,11, -1,-10,
-    -13,12, -8,10,
-    -7,3, -5,-3,
-    -4,2, -3,7,
-    -10,-12, -6,11,
-    5,-12, 6,-7,
-    5,-6, 7,-1,
-    1,0, 4,-5,
-    9,11, 11,-13,
-    4,7, 4,12,
-    2,-1, 4,4,
-    -4,-12, -2,7,
-    -8,-5, -7,-10,
-    4,11, 9,12,
-    0,-8, 1,-13,
-    -13,-2, -8,2,
-    -3,-2, -2,3,
-    -6,9, -4,-9,
-    8,12, 10,7,
-    0,9, 1,3,
-    7,-5, 11,-10,
-    -13,-6, -11,0,
-    10,7, 12,1,
-    -6,-3, -6,12,
-    10,-9, 12,-4,
-    -13,8, -8,-12,
-    -13,0, -8,-4,
-    3,3, 7,8,
-    5,7, 10,-7,
-    -1,7, 1,-12,
-    3,-10, 5,6,
-    2,-4, 3,-10,
-    -13,0, -13,5,
-    -13,-7, -12,12,
-    -13,3, -11,8,
-    -7,12, -4,7,
-    6,-10, 12,8,
-    -9,-1, -7,-6,
-    -2,-5, 0,12,
-    -12,5, -7,5,
-    3,-10, 8,-13,
-    -7,-7, -4,5,
-    -3,-2, -1,-7,
-    2,9, 5,-11,
-    -11,-13, -5,-13,
-    -1,6, 0,-1,
-    5,-3, 5,2,
-    -4,-13, -4,12,
-    -9,-6, -9,6,
-    -12,-10, -8,-4,
-    10,2, 12,-3,
-    7,12, 12,12,
-    -7,-13, -6,5,
-    -4,9, -3,4,
-    7,-1, 12,2,
-    -7,6, -5,1,
-    -13,11, -12,5,
-    -3,7, -2,-6,
-    7,-8, 12,-7,
-    -13,-7, -11,-12,
-    1,-3, 12,12,
-    2,-6, 3,0,
-    -4,3, -2,-13,
-    -1,-13, 1,9,
-    7,1, 8,-6,
-    1,-1, 3,12,
-    9,1, 12,6,
-    -1,-9, -1,3,
-    -13,-13, -10,5,
-    7,7, 10,12,
-    12,-5, 12,9,
-    6,3, 7,11,
-    5,-13, 6,10,
-    2,-12, 2,3,
-    3,8, 4,-6,
-    2,6, 12,-13,
-    9,-12, 10,3,
-    -8,4, -7,9,
-    -11,12, -4,-6,
-    1,12, 2,-8,
-    6,-9, 7,-4,
-    2,3, 3,-2,
-    6,3, 11,0,
-    3,-3, 8,-8,
-    7,8, 9,3,
-    -11,-5, -6,-4,
-    -10,11, -5,10,
-    -5,-8, -3,12,
-    -10,5, -9,0,
-    8,-1, 12,-6,
-    4,-6, 6,-11,
-    -10,12, -8,7,
-    4,-2, 6,7,
-    -2,0, -2,12,
-    -5,-8, -5,2,
-    7,-6, 10,12,
-    -9,-13, -8,-8,
-    -5,-13, -5,-2,
-    8,-8, 9,-13,
-    -9,-11, -9,0,
-    1,-8, 1,-2,
-    7,-4, 9,1,
-    -2,1, -1,-4,
-    11,-6, 12,-11,
-    -12,-9, -6,4,
-    3,7, 7,12,
-    5,5, 10,8,
-    0,-4, 2,8,
-    -9,12, -5,-13,
-    0,7, 2,12,
-    -1,2, 1,7,
-    5,11, 7,-9,
-    3,5, 6,-8,
-    -13,-4, -8,9,
-    -5,9, -3,-3,
-    -4,-7, -3,-12,
-    6,5, 8,0,
-    -7,6, -6,12,
-    -13,6, -5,-2,
-    1,-10, 3,10,
-    4,1, 8,-4,
-    -2,-2, 2,-13,
-    2,-12, 12,12,
-    -2,-13, 0,-6,
-    4,1, 9,3,
-    -6,-10, -3,-5,
-    -3,-13, -1,1,
-    7,5, 12,-11,
-    4,-2, 5,-7,
-    -13,9, -9,-5,
-    7,1, 8,6,
-    7,-8, 7,6,
-    -7,-4, -7,1,
-    -8,11, -7,-8,
-    -13,6, -12,-8,
-    2,4, 3,9,
-    10,-5, 12,3,
-    -6,-5, -6,7,
-    8,-3, 9,-8,
-    2,-12, 2,8,
-    -11,-2, -10,3,
-    -12,-13, -7,-9,
-    -11,0, -10,-5,
-    5,-3, 11,8,
-    -2,-13, -1,12,
-    -1,-8, 0,9,
-    -13,-11, -12,-5,
-    -10,-2, -10,11,
-    -3,9, -2,-13,
-    2,-3, 3,2,
-    -9,-13, -4,0,
-    -4,6, -3,-10,
-    -4,12, -2,-7,
-    -6,-11, -4,9,
-    6,-3, 6,11,
-    -13,11, -5,5,
-    11,11, 12,6,
-    7,-5, 12,-2,
-    -1,12, 0,7,
-    -4,-8, -3,-2,
-    -7,1, -6,7,
-    -13,-12, -8,-13,
-    -7,-2, -6,-8,
-    -8,5, -6,-9,
-    -5,-1, -4,5,
-    -13,7, -8,10,
-    1,5, 5,-13,
-    1,0, 10,-13,
-    9,12, 10,-1,
-    5,-8, 10,-9,
-    -1,11, 1,-13,
-    -9,-3, -6,2,
-    -1,-10, 1,12,
-    -13,1, -8,-10,
-    8,-11, 10,-6,
-    2,-13, 3,-6,
-    7,-13, 12,-9,
-    -10,-10, -5,-7,
-    -10,-8, -8,-13,
-    4,-6, 8,5,
-    3,12, 8,-13,
-    -4,2, -3,-3,
-    5,-13, 10,-12,
-    4,-13, 5,-1,
-    -9,9, -4,3,
-    0,3, 3,-9,
-    -12,1, -6,1,
-    3,2, 4,-8,
-    -10,-10, -10,9,
-    8,-13, 12,12,
-    -8,-12, -6,-5,
-    2,2, 3,7,
-    10,6, 11,-8,
-    6,8, 8,-12,
-    -7,10, -6,5,
-    -3,-9, -3,9,
-    -1,-13, -1,5,
-    -3,-7, -3,4,
-    -8,-2, -8,3,
-    4,2, 12,12,
-    2,-5, 3,11,
-    6,-9, 11,-13,
-    3,-1, 7,12,
-    11,-1, 12,4,
-    -3,0, -3,6,
-    4,-11, 4,12,
-    2,-4, 2,1,
-    -10,-6, -8,1,
-    -13,7, -11,1,
-    -13,12, -11,-13,
-    6,0, 11,-13,
-    0,-1, 1,4,
-    -13,3, -9,-2,
-    -9,8, -6,-3,
-    -13,-6, -8,-2,
-    5,-9, 8,10,
-    2,7, 3,-9,
-    -1,-6, -1,-1,
-    9,5, 11,-2,
-    11,-3, 12,-8,
-    3,0, 3,5,
-    -1,4, 0,10,
-    3,-6, 4,5,
-    -13,0, -10,5,
-    5,8, 12,11,
-    8,9, 9,-6,
-    7,-4, 8,-12,
-    -10,4, -10,9,
-    7,3, 12,4,
-    9,-7, 10,-2,
-    7,0, 12,-2,
-    -1,-6, 0,-11,
-};
-
-template<typename T>
-void gaussian1D(T* out, const int dim, double sigma=0.0)
-{
-    if(!(sigma>0)) sigma = 0.25*dim;
-
-    T sum = (T)0;
-    for(int i=0;i<dim;i++)
-    {
-        int x = i-(dim-1)/2;
-        T el = 1. / sqrt(2 * PI_VAL * sigma*sigma) * exp(-((x*x)/(2*(sigma*sigma))));
-        out[i] = el;
-        sum   += el;
-    }
-
-    for(int k=0;k<dim;k++)
-        out[k] /= sum;
-}
-
-template<typename T>
-void keep_features(
-    float* x_out,
-    float* y_out,
-    float* score_out,
-    float* size_out,
-    const float* x_in,
-    const float* y_in,
-    const float* score_in,
-    const unsigned* score_idx,
-    const float* size_in,
-    const unsigned n_feat)
-{
-    // Keep only the first n_feat features
-    for (unsigned f = 0; f < n_feat; f++) {
-        x_out[f] = x_in[score_idx[f]];
-        y_out[f] = y_in[score_idx[f]];
-        score_out[f] = score_in[f];
-        if (size_in != nullptr && size_out != nullptr)
-            size_out[f] = size_in[score_idx[f]];
-    }
-}
-
-template<typename T, bool use_scl>
-void harris_response(
-    float* x_out,
-    float* y_out,
-    float* score_out,
-    float* size_out,
-    const float* x_in,
-    const float* y_in,
-    const float* scl_in,
-    const unsigned total_feat,
-    unsigned* usable_feat,
-    const Array<T>& image,
-    const unsigned block_size,
-    const float k_thr,
-    const unsigned patch_size)
-{
-    const af::dim4 idims = image.dims();
-    const T* image_ptr = image.get();
-    for (unsigned f = 0; f < total_feat; f++) {
-        unsigned x, y;
-        float scl = 1.f;
-        if (use_scl) {
-            // Update x and y coordinates according to scale
-            scl = scl_in[f];
-            x = (unsigned)round(x_in[f] * scl);
-            y = (unsigned)round(y_in[f] * scl);
-        }
-        else {
-            x = (unsigned)round(x_in[f]);
-            y = (unsigned)round(y_in[f]);
-        }
-
-        // Round feature size to nearest odd integer
-        float size = 2.f * floor((patch_size * scl) / 2.f) + 1.f;
-
-        // Avoid keeping features that might be too wide and might not fit on
-        // the image, sqrt(2.f) is the radius when angle is 45 degrees and
-        // represents widest case possible
-        unsigned patch_r = ceil(size * sqrt(2.f) / 2.f);
-        if (x < patch_r || y < patch_r || x >= idims[1] - patch_r || y >= idims[0] - patch_r)
-            continue;
-
-        unsigned r = block_size / 2;
-
-        float ixx = 0.f, iyy = 0.f, ixy = 0.f;
-        unsigned block_size_sq = block_size * block_size;
-        for (unsigned k = 0; k < block_size_sq; k++) {
-            int i = k / block_size - r;
-            int j = k % block_size - r;
-
-            // Calculate local x and y derivatives
-            float ix = image_ptr[(x+i+1) * idims[0] + y+j] - image_ptr[(x+i-1) * idims[0] + y+j];
-            float iy = image_ptr[(x+i) * idims[0] + y+j+1] - image_ptr[(x+i) * idims[0] + y+j-1];
-
-            // Accumulate second order derivatives
-            ixx += ix*ix;
-            iyy += iy*iy;
-            ixy += ix*iy;
-        }
-
-        unsigned idx = *usable_feat;
-        *usable_feat += 1;
-        float tr = ixx + iyy;
-        float det = ixx*iyy - ixy*ixy;
-
-        // Calculate Harris responses
-        float resp = det - k_thr * (tr*tr);
-
-        // Scale factor
-        // TODO: improve response scaling
-        float rscale = 0.001f;
-        rscale = rscale * rscale * rscale * rscale;
-
-        x_out[idx] = x;
-        y_out[idx] = y;
-        score_out[idx] = resp * rscale;
-        if (use_scl)
-            size_out[idx] = size;
-    }
-}
-
-template<typename T>
-void centroid_angle(
-    const float* x_in,
-    const float* y_in,
-    float* orientation_out,
-    const unsigned total_feat,
-    const Array<T>& image,
-    const unsigned patch_size)
-{
-    const af::dim4 idims = image.dims();
-    const T* image_ptr = image.get();
-    for (unsigned f = 0; f < total_feat; f++) {
-        unsigned x = (unsigned)round(x_in[f]);
-        unsigned y = (unsigned)round(y_in[f]);
-
-        unsigned r = patch_size / 2;
-        if (x < r || y < r || x > idims[1] - r || y > idims[0] - r)
-            continue;
-
-        T m01 = (T)0, m10 = (T)0;
-        unsigned patch_size_sq = patch_size * patch_size;
-        for (unsigned k = 0; k < patch_size_sq; k++) {
-            int i = k / patch_size - r;
-            int j = k % patch_size - r;
-
-            // Calculate first order moments
-            T p = image_ptr[(x+i) * idims[0] + y+j];
-            m01 += j * p;
-            m10 += i * p;
-        }
-
-        float angle = atan2(m01, m10);
-        orientation_out[f] = angle;
-    }
-}
-
-template<typename T>
-inline T get_pixel(
-    unsigned x,
-    unsigned y,
-    const float ori,
-    const unsigned size,
-    const int dist_x,
-    const int dist_y,
-    const Array<T>& image,
-    const unsigned patch_size)
-{
-    const af::dim4 idims = image.dims();
-    const T* image_ptr = image.get();
-    float ori_sin = sin(ori);
-    float ori_cos = cos(ori);
-    float patch_scl = (float)size / (float)patch_size;
-
-    // Calculate point coordinates based on orientation and size
-    x += round(dist_x * patch_scl * ori_cos - dist_y * patch_scl * ori_sin);
-    y += round(dist_x * patch_scl * ori_sin + dist_y * patch_scl * ori_cos);
-
-    return image_ptr[x * idims[0] + y];
-}
-
-template<typename T>
-void extract_orb(
-    unsigned* desc_out,
-    const unsigned n_feat,
-    float* x_in_out,
-    float* y_in_out,
-    const float* ori_in,
-    float* size_out,
-    const Array<T>& image,
-    const float scl,
-    const unsigned patch_size)
-{
-    const af::dim4 idims = image.dims();
-    for (unsigned f = 0; f < n_feat; f++) {
-        unsigned x = (unsigned)round(x_in_out[f]);
-        unsigned y = (unsigned)round(y_in_out[f]);
-        float ori = ori_in[f];
-        unsigned size = patch_size;
-
-        unsigned r = ceil(patch_size * sqrt(2.f) / 2.f);
-        if (x < r || y < r || x >= idims[1] - r || y >= idims[0] - r)
-            continue;
-
-        // Descriptor fixed at 256 bits for now
-        // Storing descriptor as a vector of 8 x 32-bit unsigned numbers
-        for (unsigned i = 0; i < 8; i++) {
-            unsigned v = 0;
-
-            // j < 32 for 256 bits descriptor
-            for (unsigned j = 0; j < 32; j++) {
-                // Get position from distribution pattern and values of points p1 and p2
-                int dist_x = ref_pat[i*32*4 + j*4];
-                int dist_y = ref_pat[i*32*4 + j*4+1];
-                T p1 = get_pixel(x, y, ori, size, dist_x, dist_y, image, patch_size);
-
-                dist_x = ref_pat[i*32*4 + j*4+2];
-                dist_y = ref_pat[i*32*4 + j*4+3];
-                T p2 = get_pixel(x, y, ori, size, dist_x, dist_y, image, patch_size);
-
-                // Calculate bit based on p1 and p2 and shifts it to correct position
-                v |= (p1 < p2) << j;
-            }
-
-            // Store 32 bits of descriptor
-            desc_out[f * 8 + i] += v;
-        }
-
-        x_in_out[f] = round(x * scl);
-        y_in_out[f] = round(y * scl);
-        size_out[f] = patch_size * scl;
-    }
-}
-
-
+using std::ceil;
+using std::floor;
+using std::function;
+using std::min;
+using std::move;
+using std::pow;
+using std::round;
+using std::sqrt;
+using std::unique_ptr;
+using std::vector;
+
+namespace arrayfire {
+namespace cpu {
 
 template<typename T, typename convAccT>
-unsigned orb(Array<float> &x, Array<float> &y,
-             Array<float> &score, Array<float> &ori,
-             Array<float> &size, Array<uint> &desc,
-             const Array<T>& image,
-             const float fast_thr, const unsigned max_feat,
-             const float scl_fctr, const unsigned levels,
-             const bool blur_img)
-{
-
-    unsigned patch_size = REF_PAT_SIZE;
-
-    const af::dim4 idims = image.dims();
-    unsigned min_side = std::min(idims[0], idims[1]);
+unsigned orb(Array<float>& x, Array<float>& y, Array<float>& score,
+             Array<float>& ori, Array<float>& size, Array<uint>& desc,
+             const Array<T>& image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img) {
+    image.eval();
+    getQueue().sync();
+
+    float patch_size = REF_PAT_SIZE;
+
+    const dim4& idims   = image.dims();
+    float min_side      = min(idims[0], idims[1]);
     unsigned max_levels = 0;
-    float scl_sum = 0.f;
+    float scl_sum       = 0.f;
 
     for (unsigned i = 0; i < levels; i++) {
         min_side /= scl_fctr;
 
         // Minimum image side for a descriptor to be computed
-        if (min_side < patch_size || max_levels == levels) break;
+        if (min_side < patch_size || max_levels == levels) { break; }
 
         max_levels++;
-        scl_sum += 1.f / (float)std::pow(scl_fctr,(float)i);
+        scl_sum += 1.f / pow(scl_fctr, static_cast<float>(i));
     }
 
-    std::vector<float*> h_x_pyr(max_levels);
-    std::vector<float*> h_y_pyr(max_levels);
-    std::vector<float*> h_score_pyr(max_levels);
-    std::vector<float*> h_ori_pyr(max_levels);
-    std::vector<float*> h_size_pyr(max_levels);
-    std::vector<unsigned*> h_desc_pyr(max_levels);
+    vector<unique_ptr<float[], function<void(float*)>>> h_x_pyr(max_levels);
+    vector<unique_ptr<float[], function<void(float*)>>> h_y_pyr(max_levels);
+    vector<unique_ptr<float[], function<void(float*)>>> h_score_pyr(max_levels);
+    vector<unique_ptr<float[], function<void(float*)>>> h_ori_pyr(max_levels);
+    vector<unique_ptr<float[], function<void(float*)>>> h_size_pyr(max_levels);
+    vector<unique_ptr<unsigned[], function<void(unsigned*)>>> h_desc_pyr(
+        max_levels);
 
-    std::vector<unsigned> feat_pyr(max_levels);
+    vector<unsigned> feat_pyr(max_levels);
     unsigned total_feat = 0;
 
     // Compute number of features to keep for each level
-    std::vector<unsigned> lvl_best(max_levels);
+    vector<unsigned> lvl_best(max_levels);
     unsigned feat_sum = 0;
-    for (unsigned i = 0; i < max_levels-1; i++) {
-        float lvl_scl = (float)std::pow(scl_fctr,(float)i);
-        lvl_best[i] = ceil((max_feat / scl_sum) / lvl_scl);
+    for (unsigned i = 0; i < max_levels - 1; i++) {
+        auto lvl_scl = pow(scl_fctr, static_cast<float>(i));
+        lvl_best[i]  = ceil((static_cast<float>(max_feat) / scl_sum) / lvl_scl);
         feat_sum += lvl_best[i];
     }
-    lvl_best[max_levels-1] = max_feat - feat_sum;
+    lvl_best[max_levels - 1] = max_feat - feat_sum;
 
     // Maintain a reference to previous level image
-    Array<T> prev_img = createEmptyArray<T>(af::dim4());
-    af::dim4 prev_ldims;
+    Array<T> prev_img = createEmptyArray<T>(dim4());
+    dim4 prev_ldims;
 
-    af::dim4 gauss_dims(9);
-    T* h_gauss = nullptr;
-    Array<T> gauss_filter = createEmptyArray<T>(af::dim4());
+    dim4 gauss_dims(9);
+    unique_ptr<T[], function<void(T*)>> h_gauss;
+    Array<T> gauss_filter = createEmptyArray<T>(dim4());
 
     for (unsigned i = 0; i < max_levels; i++) {
-        af::dim4 ldims;
-        const float lvl_scl = (float)std::pow(scl_fctr,(float)i);
-        Array<T> lvl_img = createEmptyArray<T>(af::dim4());
+        dim4 ldims;
+        const auto lvl_scl = pow(scl_fctr, static_cast<float>(i));
+        Array<T> lvl_img   = createEmptyArray<T>(dim4());
 
         if (i == 0) {
             // First level is used in its original size
             lvl_img = image;
-            ldims = image.dims();
+            ldims   = image.dims();
 
-            prev_img = image;
+            prev_img   = image;
             prev_ldims = image.dims();
-        }
-        else {
+        } else {
             // Resize previous level image to current level dimensions
             ldims[0] = round(idims[0] / lvl_scl);
             ldims[1] = round(idims[1] / lvl_scl);
 
-            lvl_img = resize<T>(prev_img, ldims[0], ldims[1], AF_INTERP_BILINEAR);
+            lvl_img =
+                resize<T>(prev_img, ldims[0], ldims[1], AF_INTERP_BILINEAR);
 
-            prev_img = lvl_img;
+            prev_img   = lvl_img;
             prev_ldims = lvl_img.dims();
         }
+        prev_img.eval();
+        lvl_img.eval();
+        getQueue().sync();
 
-
-        Array<float> x_feat = createEmptyArray<float>(dim4());
-        Array<float> y_feat = createEmptyArray<float>(dim4());
+        Array<float> x_feat     = createEmptyArray<float>(dim4());
+        Array<float> y_feat     = createEmptyArray<float>(dim4());
         Array<float> score_feat = createEmptyArray<float>(dim4());
 
         // Round feature size to nearest odd integer
-        float size = 2.f * floor(patch_size / 2.f) + 1.f;
+        float size = 2.f * floor(static_cast<float>(patch_size) / 2.f) + 1.f;
 
         // Avoid keeping features that might be too wide and might not fit on
         // the image, sqrt(2.f) is the radius when angle is 45 degrees and
         // represents widest case possible
         unsigned edge = ceil(size * sqrt(2.f) / 2.f);
 
-        unsigned lvl_feat = fast(x_feat, y_feat, score_feat,
-                                 lvl_img, fast_thr, 9, 1, 0.15f, edge);
+        unsigned lvl_feat = fast(x_feat, y_feat, score_feat, lvl_img, fast_thr,
+                                 9, 1, 0.15f, edge);
 
-
-        if (lvl_feat == 0) {
-            continue;
-        }
+        if (lvl_feat == 0) { continue; }
 
         float* h_x_feat = x_feat.get();
         float* h_y_feat = y_feat.get();
 
-        float* h_x_harris = memAlloc<float>(lvl_feat);
-        float* h_y_harris = memAlloc<float>(lvl_feat);
-        float* h_score_harris = memAlloc<float>(lvl_feat);
+        auto h_x_harris     = memAlloc<float>(lvl_feat);
+        auto h_y_harris     = memAlloc<float>(lvl_feat);
+        auto h_score_harris = memAlloc<float>(lvl_feat);
 
         // Calculate Harris responses
         // Good block_size >= 7 (must be an odd number)
         unsigned usable_feat = 0;
-        harris_response<T, false>(h_x_harris, h_y_harris, h_score_harris, nullptr,
-                                  h_x_feat, h_y_feat, nullptr,
-                                  lvl_feat, &usable_feat,
-                                  lvl_img,
-                                  7, 0.04f, patch_size);
-
-        if (usable_feat == 0) {
-            memFree(h_x_harris);
-            memFree(h_y_harris);
-            memFree(h_score_harris);
+        kernel::harris_response<T, false>(
+            h_x_harris.get(), h_y_harris.get(), h_score_harris.get(), nullptr,
+            h_x_feat, h_y_feat, nullptr, lvl_feat, &usable_feat, lvl_img, 7,
+            0.04f, patch_size);
 
-            continue;
-        }
+        if (usable_feat == 0) { continue; }
 
         // Sort features according to Harris responses
         af::dim4 usable_feat_dims(usable_feat);
-        Array<float> score_harris = createHostDataArray(usable_feat_dims, h_score_harris);
+        Array<float> score_harris = createDeviceDataArray<float>(
+            usable_feat_dims, h_score_harris.get());
         Array<float> harris_sorted = createEmptyArray<float>(af::dim4());
         Array<unsigned> harris_idx = createEmptyArray<unsigned>(af::dim4());
 
-        sort_index<float, false>(harris_sorted, harris_idx, score_harris, 0);
+        sort_index<float>(harris_sorted, harris_idx, score_harris, 0, false);
+        getQueue().sync();
 
-        memFree(h_score_harris);
-
-        usable_feat = std::min(usable_feat, lvl_best[i]);
+        usable_feat = min(usable_feat, lvl_best[i]);
 
         if (usable_feat == 0) {
-            memFree(h_x_harris);
-            memFree(h_y_harris);
-
+            h_score_harris.release();
             continue;
         }
 
-        float* h_x_lvl = memAlloc<float>(usable_feat);
-        float* h_y_lvl = memAlloc<float>(usable_feat);
-        float* h_score_lvl = memAlloc<float>(usable_feat);
+        auto h_x_lvl     = memAlloc<float>(usable_feat);
+        auto h_y_lvl     = memAlloc<float>(usable_feat);
+        auto h_score_lvl = memAlloc<float>(usable_feat);
 
         // Keep only features with higher Harris responses
-        keep_features<T>(h_x_lvl, h_y_lvl, h_score_lvl, nullptr,
-                         h_x_harris, h_y_harris, harris_sorted.get(), harris_idx.get(),
-                         nullptr, usable_feat);
+        kernel::keep_features<T>(h_x_lvl.get(), h_y_lvl.get(),
+                                 h_score_lvl.get(), nullptr, h_x_harris.get(),
+                                 h_y_harris.get(), harris_sorted.get(),
+                                 harris_idx.get(), nullptr, usable_feat);
 
-        memFree(h_x_harris);
-        memFree(h_y_harris);
-
-        float* h_ori_lvl = memAlloc<float>(usable_feat);
-        float* h_size_lvl = memAlloc<float>(usable_feat);
+        auto h_ori_lvl  = memAlloc<float>(usable_feat);
+        auto h_size_lvl = memAlloc<float>(usable_feat);
 
         // Compute orientation of features
-        centroid_angle<T>(h_x_lvl, h_y_lvl, h_ori_lvl, usable_feat,
-                          lvl_img, patch_size);
+        kernel::centroid_angle<T>(h_x_lvl.get(), h_y_lvl.get(), h_ori_lvl.get(),
+                                  usable_feat, lvl_img, patch_size);
 
         Array<T> lvl_filt = createEmptyArray<T>(dim4());
 
         if (blur_img) {
-            // Calculate a separable Gaussian kernel, if one is not already stored
+            // Calculate a separable Gaussian kernel, if one is not already
+            // stored
             if (!h_gauss) {
                 h_gauss = memAlloc<T>(gauss_dims[0]);
-                gaussian1D(h_gauss, gauss_dims[0], 2.f);
-                gauss_filter = createHostDataArray(gauss_dims, h_gauss);
+                gaussian1D(h_gauss.get(), gauss_dims[0], 2.f);
+                gauss_filter =
+                    createDeviceDataArray<T>(gauss_dims, h_gauss.get());
+                gauss_filter.eval();
             }
 
-            // Filter level image with Gaussian kernel to reduce noise sensitivity
-            lvl_filt = convolve2<T, convAccT, false>(lvl_img, gauss_filter, gauss_filter);
+            // Filter level image with Gaussian kernel to reduce noise
+            // sensitivity
+            lvl_filt = convolve2<T, convAccT>(lvl_img, gauss_filter,
+                                              gauss_filter, false);
         }
+        lvl_filt.eval();
+        getQueue().sync();
 
         // Compute ORB descriptors
-        unsigned* h_desc_lvl = memAlloc<unsigned>(usable_feat * 8);
-        memset(h_desc_lvl, 0, usable_feat * 8 * sizeof(unsigned));
-        if (blur_img)
-            extract_orb<T>(h_desc_lvl, usable_feat,
-                           h_x_lvl, h_y_lvl, h_ori_lvl, h_size_lvl,
-                           lvl_filt, lvl_scl, patch_size);
-        else
-            extract_orb<T>(h_desc_lvl, usable_feat,
-                           h_x_lvl, h_y_lvl, h_ori_lvl, h_size_lvl,
-                           lvl_img, lvl_scl, patch_size);
+        auto h_desc_lvl = memAlloc<unsigned>(usable_feat * 8);
+        memset(h_desc_lvl.get(), 0, usable_feat * 8 * sizeof(unsigned));
+        if (blur_img) {
+            kernel::extract_orb<T>(h_desc_lvl.get(), usable_feat, h_x_lvl.get(),
+                                   h_y_lvl.get(), h_ori_lvl.get(),
+                                   h_size_lvl.get(), lvl_filt, lvl_scl,
+                                   patch_size);
+        } else {
+            kernel::extract_orb<T>(h_desc_lvl.get(), usable_feat, h_x_lvl.get(),
+                                   h_y_lvl.get(), h_ori_lvl.get(),
+                                   h_size_lvl.get(), lvl_img, lvl_scl,
+                                   patch_size);
+        }
 
         // Store results to pyramids
         total_feat += usable_feat;
-        feat_pyr[i] = usable_feat;
-        h_x_pyr[i] = h_x_lvl;
-        h_y_pyr[i] = h_y_lvl;
-        h_score_pyr[i] = h_score_lvl;
-        h_ori_pyr[i] = h_ori_lvl;
-        h_size_pyr[i] = h_size_lvl;
-        h_desc_pyr[i] = h_desc_lvl;
-
+        feat_pyr[i]    = usable_feat;
+        h_x_pyr[i]     = move(h_x_lvl);
+        h_y_pyr[i]     = move(h_y_lvl);
+        h_score_pyr[i] = move(h_score_lvl);
+        h_ori_pyr[i]   = move(h_ori_lvl);
+        h_size_pyr[i]  = move(h_size_lvl);
+        h_desc_pyr[i]  = move(h_desc_lvl);
+        h_score_harris.release();
+        h_gauss.release();
     }
 
-    if (h_gauss != nullptr)
-        memFree(h_gauss);
-
-    if (total_feat > 0 ) {
-
+    if (total_feat > 0) {
         // Allocate feature Arrays
         const af::dim4 total_feat_dims(total_feat);
         const af::dim4 desc_dims(8, total_feat);
@@ -747,54 +249,48 @@ unsigned orb(Array<float> &x, Array<float> &y,
         score = createEmptyArray<float>(total_feat_dims);
         ori   = createEmptyArray<float>(total_feat_dims);
         size  = createEmptyArray<float>(total_feat_dims);
-        desc  = createEmptyArray<uint >(desc_dims);
+        desc  = createEmptyArray<uint>(desc_dims);
 
-        float* h_x = x.get();
-        float* h_y = y.get();
+        float* h_x     = x.get();
+        float* h_y     = y.get();
         float* h_score = score.get();
-        float* h_ori = ori.get();
-        float* h_size = size.get();
+        float* h_ori   = ori.get();
+        float* h_size  = size.get();
 
         unsigned* h_desc = desc.get();
 
         unsigned offset = 0;
         for (unsigned i = 0; i < max_levels; i++) {
-            if (feat_pyr[i] == 0)
-                continue;
-
-            if (i > 0)
-                offset += feat_pyr[i-1];
-
-            memcpy(h_x+offset, h_x_pyr[i], feat_pyr[i] * sizeof(float));
-            memcpy(h_y+offset, h_y_pyr[i], feat_pyr[i] * sizeof(float));
-            memcpy(h_score+offset, h_score_pyr[i], feat_pyr[i] * sizeof(float));
-            memcpy(h_ori+offset, h_ori_pyr[i], feat_pyr[i] * sizeof(float));
-            memcpy(h_size+offset, h_size_pyr[i], feat_pyr[i] * sizeof(float));
-
-            memcpy(h_desc+(offset*8), h_desc_pyr[i], feat_pyr[i] * 8 * sizeof(unsigned));
-
-            memFree(h_x_pyr[i]);
-            memFree(h_y_pyr[i]);
-            memFree(h_score_pyr[i]);
-            memFree(h_ori_pyr[i]);
-            memFree(h_size_pyr[i]);
-            memFree(h_desc_pyr[i]);
+            if (feat_pyr[i] == 0) { continue; }
+
+            if (i > 0) { offset += feat_pyr[i - 1]; }
+
+            memcpy(h_x + offset, h_x_pyr[i].get(), feat_pyr[i] * sizeof(float));
+            memcpy(h_y + offset, h_y_pyr[i].get(), feat_pyr[i] * sizeof(float));
+            memcpy(h_score + offset, h_score_pyr[i].get(),
+                   feat_pyr[i] * sizeof(float));
+            memcpy(h_ori + offset, h_ori_pyr[i].get(),
+                   feat_pyr[i] * sizeof(float));
+            memcpy(h_size + offset, h_size_pyr[i].get(),
+                   feat_pyr[i] * sizeof(float));
+
+            memcpy(h_desc + (offset * 8), h_desc_pyr[i].get(),
+                   feat_pyr[i] * 8 * sizeof(unsigned));
         }
     }
 
     return total_feat;
 }
 
-#define INSTANTIATE(T, convAccT)                                                        \
-    template unsigned orb<T, convAccT>(Array<float> &x, Array<float> &y,                \
-                                       Array<float> &score, Array<float> &ori,          \
-                                       Array<float> &size, Array<uint> &desc,           \
-                                       const Array<T>& image,                           \
-                                       const float fast_thr, const unsigned max_feat,   \
-                                       const float scl_fctr, const unsigned levels,     \
-                                       const bool blur_img);
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned orb<T, convAccT>(                                       \
+        Array<float> & x, Array<float> & y, Array<float> & score,             \
+        Array<float> & ori, Array<float> & size, Array<uint> & desc,          \
+        const Array<T>& image, const float fast_thr, const unsigned max_feat, \
+        const float scl_fctr, const unsigned levels, const bool blur_img);
 
-INSTANTIATE(float , float )
+INSTANTIATE(float, float)
 INSTANTIATE(double, double)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/orb.hpp b/src/backend/cpu/orb.hpp
index 0b4ebd8931..8bdd7a92c0 100644
--- a/src/backend/cpu/orb.hpp
+++ b/src/backend/cpu/orb.hpp
@@ -7,21 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
 #include <Array.hpp>
+#include <af/features.h>
 
 using af::features;
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T, typename convAccT>
 unsigned orb(Array<float> &x, Array<float> &y, Array<float> &score,
              Array<float> &orientation, Array<float> &size,
-             Array<unsigned> &desc,
-             const Array<T>& image,
-             const float fast_thr, const unsigned max_feat,
-             const float scl_fctr, const unsigned levels,
-             const bool blur_img);
+             Array<unsigned> &desc, const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/platform.cpp b/src/backend/cpu/platform.cpp
index d6b4724c25..a1dd7cd67b 100644
--- a/src/backend/cpu/platform.cpp
+++ b/src/backend/cpu/platform.cpp
@@ -7,77 +7,178 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/version.h>
+#include <build_version.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/defines.hpp>
+#include <common/host_memory.hpp>
+#include <device_manager.hpp>
 #include <platform.hpp>
+#include <af/version.h>
+
+#include <cctype>
+#include <cstdio>
+#include <memory>
 #include <sstream>
+#include <string>
 
-namespace cpu
-{
+using arrayfire::common::ForgeManager;
+using arrayfire::common::getEnvVar;
+using arrayfire::common::ltrim;
+using arrayfire::common::MemoryManagerBase;
+using std::endl;
+using std::ostringstream;
+using std::stoi;
+using std::string;
+using std::unique_ptr;
 
-static const char *get_system(void)
-{
-    return
-#if defined(ARCH_32)
-    "32-bit "
-#elif defined(ARCH_64)
-    "64-bit "
-#endif
+namespace arrayfire {
+namespace cpu {
+
+static string get_system() {
+    string arch = (sizeof(void*) == 4) ? "32-bit " : "64-bit ";
 
+    return arch +
 #if defined(OS_LNX)
-    "Linux";
+           "Linux";
 #elif defined(OS_WIN)
-    "Windows";
+           "Windows";
 #elif defined(OS_MAC)
-    "Mac OSX";
+           "Mac OSX";
 #endif
 }
 
-std::string getInfo()
-{
-    std::ostringstream info;
-    info << "ArrayFire v" << AF_VERSION
-         << " (CPU, " << get_system() << ", build " << AF_REVISION << ")" << std::endl;
+int getBackend() { return AF_BACKEND_CPU; }
+
+string getDeviceInfo() noexcept {
+    const CPUInfo cinfo = DeviceManager::getInstance().getCPUInfo();
+
+    ostringstream info;
+
+    info << "ArrayFire v" << AF_VERSION << " (CPU, " << get_system()
+         << ", build " << AF_REVISION << ")" << endl;
+
+    string model = cinfo.model();
+
+    size_t memMB =
+        getDeviceMemorySize(static_cast<int>(getActiveDeviceId())) / 1048576;
+
+    info << string("[0] ") << cinfo.vendor() << ": " << ltrim(model);
+
+    if (memMB) {
+        info << ", " << memMB << " MB, ";
+    } else {
+        info << ", Unknown MB, ";
+    }
+
+    info << "Max threads(" << cinfo.threads() << ") ";
+#ifndef NDEBUG
+    info << AF_COMPILER_STR;
+#endif
+    info << endl;
+
     return info.str();
 }
 
-bool isDoubleSupported(int device)
-{
-    return true;
+bool isDoubleSupported(int device) {
+    UNUSED(device);
+    return DeviceManager::IS_DOUBLE_SUPPORTED;
+}
+
+bool isHalfSupported(int device) {
+    UNUSED(device);
+    return DeviceManager::IS_HALF_SUPPORTED;
+}
+
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute) {
+    const CPUInfo cinfo = DeviceManager::getInstance().getCPUInfo();
+
+    snprintf(d_name, 64, "%s", cinfo.vendor().c_str());
+    snprintf(d_platform, 10, "CPU");
+    // report the compiler for toolkit
+    snprintf(d_toolkit, 64, "%s", AF_COMPILER_STR);
+    snprintf(d_compute, 10, "%s", "0.0");
 }
 
-void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
-{
-    static bool flag;
-    if(!flag) {
-        printf("WARNING: af_devprop not supported for CPU\n");
-        flag = 1;
+int& getMaxJitSize() {
+    constexpr int MAX_JIT_LEN = 100;
+    thread_local int length   = 0;
+    if (length <= 0) {
+        string env_var = getEnvVar("AF_CPU_MAX_JIT_LEN");
+        if (!env_var.empty()) {
+            int input_len = stoi(env_var);
+            length        = input_len > 0 ? input_len : MAX_JIT_LEN;
+        } else {
+            length = MAX_JIT_LEN;
+        }
     }
+    return length;
 }
 
-int getDeviceCount()
-{
-    return 1;
+int getDeviceCount() { return DeviceManager::NUM_DEVICES; }
+
+void init() {
+    thread_local const auto& instance = DeviceManager::getInstance();
+    UNUSED(instance);
 }
 
+// Get the currently active device id
+unsigned getActiveDeviceId() { return DeviceManager::ACTIVE_DEVICE_ID; }
 
-int setDevice(int device)
-{
-    static bool flag;
-    if(!flag) {
-        printf("WARNING: af_set_device not supported for CPU\n");
-        flag = 1;
-    }
-    return 1;
+size_t getDeviceMemorySize(int device) {
+    UNUSED(device);
+    return common::getHostMemorySize();
 }
 
-int getActiveDeviceId()
-{
+size_t getHostMemorySize() { return common::getHostMemorySize(); }
+
+int setDevice(int device) {
+    thread_local bool flag = false;
+    if (!flag && device != 0) {
+#ifndef NDEBUG
+        fprintf(
+            stderr,
+            "WARNING af_set_device(device): device can only be 0 for CPU\n");
+#endif
+        flag = true;
+    }
     return 0;
 }
 
-void sync(int device)
-{
-    // Nothing here
+queue& getQueue(int device) {
+    return DeviceManager::getInstance().queues[device];
+}
+
+queue* getQueueHandle(int device) { return &getQueue(device); }
+
+void sync(int device) { getQueue(device).sync(); }
+
+bool& evalFlag() {
+    thread_local bool flag = true;
+    return flag;
+}
+
+MemoryManagerBase& memoryManager() {
+    DeviceManager& inst = DeviceManager::getInstance();
+    return *(inst.memManager);
+}
+
+void setMemoryManager(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManager(move(mgr));
+}
+
+void resetMemoryManager() {
+    return DeviceManager::getInstance().resetMemoryManager();
+}
+
+void setMemoryManagerPinned(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManagerPinned(move(mgr));
 }
 
+void resetMemoryManagerPinned() {
+    return DeviceManager::getInstance().resetMemoryManagerPinned();
 }
+
+ForgeManager& forgeManager() { return *(DeviceManager::getInstance().fgMngr); }
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/platform.hpp b/src/backend/cpu/platform.hpp
index e899837b8c..1f86639188 100644
--- a/src/backend/cpu/platform.hpp
+++ b/src/backend/cpu/platform.hpp
@@ -7,20 +7,71 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#pragma once
+
+#include <queue.hpp>
 #include <string>
 
+namespace arrayfire {
+namespace common {
+class ForgeManager;
+class MemoryManagerBase;
+}  // namespace common
+}  // namespace arrayfire
+
+using arrayfire::common::MemoryManagerBase;
+
+namespace arrayfire {
 namespace cpu {
-    std::string getInfo();
 
-    bool isDoubleSupported(int device);
+int getBackend();
+
+std::string getDeviceInfo() noexcept;
+
+bool isDoubleSupported(int device);
+
+bool isHalfSupported(int device);
+
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute);
+
+int& getMaxJitSize();
+
+int getDeviceCount();
+
+void init();
+
+unsigned getActiveDeviceId();
+
+size_t getDeviceMemorySize(int device);
+
+size_t getHostMemorySize();
+
+int setDevice(int device);
+
+queue& getQueue(int device = 0);
+
+/// Return a handle to the queue for the device.
+///
+/// \param[in] device The device of the returned queue
+/// \returns The handle to the queue
+queue* getQueueHandle(int device);
+
+void sync(int device);
+
+bool& evalFlag();
+
+MemoryManagerBase& memoryManager();
+
+void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
 
-    void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
+void resetMemoryManager();
 
-    int getDeviceCount();
+// Pinned memory not supported
+void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
 
-    int setDevice(int device);
+void resetMemoryManagerPinned();
 
-    int getActiveDeviceId();
+arrayfire::common::ForgeManager& forgeManager();
 
-    void sync(int device);
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/plot.cpp b/src/backend/cpu/plot.cpp
index 68c4300210..1ca6ae7882 100644
--- a/src/backend/cpu/plot.cpp
+++ b/src/backend/cpu/plot.cpp
@@ -7,40 +7,49 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined(WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <plot.hpp>
+#include <common/graphics_common.hpp>
 #include <err_cpu.hpp>
-#include <stdexcept>
-#include <graphics_common.hpp>
-#include <reduce.hpp>
-#include <memory.hpp>
+#include <platform.hpp>
+#include <plot.hpp>
+#include <queue.hpp>
 
 using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
 
-namespace cpu
-{
-    template<typename T>
-    void copy_plot(const Array<T> &P, fg::Plot* plot)
-    {
-        CheckGL("Before CopyArrayToVBO");
+namespace arrayfire {
+namespace cpu {
 
-        glBindBuffer(GL_ARRAY_BUFFER, plot->vbo());
-        glBufferSubData(GL_ARRAY_BUFFER, 0, plot->size(), P.get());
-        glBindBuffer(GL_ARRAY_BUFFER, 0);
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot) {
+    ForgeModule &_ = forgePlugin();
+    P.eval();
+    getQueue().sync();
 
-        CheckGL("In CopyArrayToVBO");
-    }
+    CheckGL("Before CopyArrayToVBO");
+    unsigned bytes = 0, buffer = 0;
+    FG_CHECK(_.fg_get_plot_vertex_buffer(&buffer, plot));
+    FG_CHECK(_.fg_get_plot_vertex_buffer_size(&bytes, plot));
 
-    #define INSTANTIATE(T)  \
-        template void copy_plot<T>(const Array<T> &P, fg::Plot* plot);
+    glBindBuffer(GL_ARRAY_BUFFER, buffer);
+    glBufferSubData(GL_ARRAY_BUFFER, 0, bytes, P.get());
+    glBindBuffer(GL_ARRAY_BUFFER, 0);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
+    CheckGL("In CopyArrayToVBO");
 }
 
-#endif  // WITH_GRAPHICS
+#define INSTANTIATE(T) template void copy_plot<T>(const Array<T> &, fg_plot);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/plot.hpp b/src/backend/cpu/plot.hpp
index 73c72101e2..11063e22f4 100644
--- a/src/backend/cpu/plot.hpp
+++ b/src/backend/cpu/plot.hpp
@@ -7,16 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <graphics_common.hpp>
+#include <common/graphics_common.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    void copy_plot(const Array<T> &P, fg::Plot* plot);
-}
+namespace arrayfire {
+namespace cpu {
 
-#endif
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot);
 
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/print.hpp b/src/backend/cpu/print.hpp
index a21b904c45..52e3e62877 100644
--- a/src/backend/cpu/print.hpp
+++ b/src/backend/cpu/print.hpp
@@ -7,7 +7,8 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-namespace cpu
-{
-    // Nothing here
-}
+namespace arrayfire {
+namespace cpu {
+// Nothing here
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/qr.cpp b/src/backend/cpu/qr.cpp
index d1c3e233af..61d6305438 100644
--- a/src/backend/cpu/qr.cpp
+++ b/src/backend/cpu/qr.cpp
@@ -8,67 +8,73 @@
  ********************************************************/
 
 #include <qr.hpp>
-#include <err_common.hpp>
 
-#if defined(WITH_CPU_LINEAR_ALGEBRA)
-
-#include <af/dim4.hpp>
-#include <handle.hpp>
-#include <iostream>
-#include <cassert>
 #include <err_cpu.hpp>
-#include <triangle.hpp>
 
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
 #include <lapack_helper.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <triangle.hpp>
+#include <af/dim4.hpp>
 
-namespace cpu
-{
+using af::dim4;
 
-template<typename T>
-using geqrf_func_def = int (*)(ORDER_TYPE, int, int,
-                               T*, int,
-                               T*);
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-using gqr_func_def = int (*)(ORDER_TYPE, int, int, int,
-                             T*, int,
-                             const T*);
-
-#define QR_FUNC_DEF( FUNC )                                         \
-template<typename T> FUNC##_func_def<T> FUNC##_func();
+using geqrf_func_def = int (*)(ORDER_TYPE, int, int, T *, int, T *);
 
-
-#define QR_FUNC( FUNC, TYPE, PREFIX )                               \
-template<> FUNC##_func_def<TYPE>     FUNC##_func<TYPE>()            \
-{ return & LAPACK_NAME(PREFIX##FUNC); }
-
-QR_FUNC_DEF( geqrf )
-QR_FUNC(geqrf , float  , s)
-QR_FUNC(geqrf , double , d)
-QR_FUNC(geqrf , cfloat , c)
-QR_FUNC(geqrf , cdouble, z)
-
-#define GQR_FUNC_DEF( FUNC )                                         \
-template<typename T> FUNC##_func_def<T> FUNC##_func();
-
-#define GQR_FUNC( FUNC, TYPE, PREFIX )                               \
-template<> FUNC##_func_def<TYPE>     FUNC##_func<TYPE>()             \
-{ return & LAPACK_NAME(PREFIX); }
-
-GQR_FUNC_DEF( gqr )
-GQR_FUNC(gqr , float  , sorgqr)
-GQR_FUNC(gqr , double , dorgqr)
-GQR_FUNC(gqr , cfloat , cungqr)
-GQR_FUNC(gqr , cdouble, zungqr)
+template<typename T>
+using gqr_func_def = int (*)(ORDER_TYPE, int, int, int, T *, int, const T *);
+
+#define QR_FUNC_DEF(FUNC) \
+    template<typename T>  \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define QR_FUNC(FUNC, TYPE, PREFIX)             \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
+
+QR_FUNC_DEF(geqrf)
+QR_FUNC(geqrf, float, s)
+QR_FUNC(geqrf, double, d)
+QR_FUNC(geqrf, cfloat, c)
+QR_FUNC(geqrf, cdouble, z)
+
+#define GQR_FUNC_DEF(FUNC) \
+    template<typename T>   \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define GQR_FUNC(FUNC, TYPE, PREFIX)            \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX);            \
+    }
+
+GQR_FUNC_DEF(gqr)
+GQR_FUNC(gqr, float, sorgqr)
+GQR_FUNC(gqr, double, dorgqr)
+GQR_FUNC(gqr, cfloat, cungqr)
+GQR_FUNC(gqr, cdouble, zungqr)
 
 template<typename T>
-void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in)
-{
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
     dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    const dim4 NullShape(0, 0, 0, 0);
 
-    q = padArray<T, T>(in, dim4(M, max(M, N)));
+    dim4 endPadding(M - iDims[0], max(M, N) - iDims[1], 0, 0);
+    q = (endPadding == NullShape
+             ? copyArray(in)
+             : padArrayBorders(in, NullShape, endPadding, AF_PAD_ZERO));
     q.resetDims(iDims);
     t = qr_inplace(q);
 
@@ -76,69 +82,67 @@ void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in)
     dim4 rdims(M, N);
     r = createEmptyArray<T>(rdims);
 
-    triangle<T, true, false>(r, q);
-
-    gqr_func<T>()(AF_LAPACK_COL_MAJOR,
-                  M, M, min(M, N),
-                  q.get(), q.strides()[1],
-                  t.get());
+    triangle<T>(r, q, true, false);
 
+    auto func = [=](Param<T> q, Param<T> t, int M, int N) {
+        gqr_func<T>()(AF_LAPACK_COL_MAJOR, M, M, min(M, N), q.get(),
+                      q.strides(1), t.get());
+    };
     q.resetDims(dim4(M, M));
+    getQueue().enqueue(func, q, t, M, N);
 }
 
 template<typename T>
-Array<T> qr_inplace(Array<T> &in)
-{
+Array<T> qr_inplace(Array<T> &in) {
     dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-
+    int M      = iDims[0];
+    int N      = iDims[1];
     Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
 
-    geqrf_func<T>()(AF_LAPACK_COL_MAJOR, M, N,
-                    in.get(), in.strides()[1],
-                    t.get());
+    auto func = [=](Param<T> in, Param<T> t, int M, int N) {
+        geqrf_func<T>()(AF_LAPACK_COL_MAJOR, M, N, in.get(), in.strides(1),
+                        t.get());
+    };
+    getQueue().enqueue(func, in, t, M, N);
 
     return t;
 }
 
-#define INSTANTIATE_QR(T)                                                                           \
-    template Array<T> qr_inplace<T>(Array<T> &in);                                                \
-    template void qr<T>(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
-
-INSTANTIATE_QR(float)
-INSTANTIATE_QR(cfloat)
-INSTANTIATE_QR(double)
-INSTANTIATE_QR(cdouble)
-
-}
+}  // namespace cpu
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in)
-{
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-Array<T> qr_inplace(Array<T> &in)
-{
+Array<T> qr_inplace(Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_QR(T)                                                                           \
-    template Array<T> qr_inplace<T>(Array<T> &in);                                                \
-    template void qr<T>(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace cpu {
+
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
 
 INSTANTIATE_QR(float)
 INSTANTIATE_QR(cfloat)
 INSTANTIATE_QR(double)
 INSTANTIATE_QR(cdouble)
 
-}
-
-#endif
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/qr.hpp b/src/backend/cpu/qr.hpp
index 82d7c1b8a9..4a3290e61c 100644
--- a/src/backend/cpu/qr.hpp
+++ b/src/backend/cpu/qr.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
 
-    template<typename T>
-    Array<T> qr_inplace(Array<T> &in);
-}
+template<typename T>
+Array<T> qr_inplace(Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/queue.hpp b/src/backend/cpu/queue.hpp
new file mode 100644
index 0000000000..cdcfb8092f
--- /dev/null
+++ b/src/backend/cpu/queue.hpp
@@ -0,0 +1,148 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Param.hpp>
+#include <common/util.hpp>
+#include <memory.hpp>
+
+#include <algorithm>
+
+// FIXME: Is there a better way to check for std::future not being supported ?
+#if defined(AF_DISABLE_CPU_ASYNC) || \
+    (defined(__GNUC__) &&            \
+     (__GCC_ATOMIC_INT_LOCK_FREE < 2 || __GCC_ATOMIC_POINTER_LOCK_FREE < 2))
+
+#include <functional>
+using std::function;
+#include <err_cpu.hpp>
+#define __SYNCHRONOUS_ARCH 1
+class queue_impl {
+   public:
+    template<typename F, typename... Args>
+    void enqueue(const F func, Args... args) const {
+        AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL);
+    }
+
+    void sync() const { AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL); }
+
+    bool is_worker() const {
+        AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL);
+        return false;
+    }
+};
+
+class event_impl {
+   public:
+    event_impl() noexcept                              = default;
+    ~event_impl() noexcept                             = default;
+    explicit event_impl(const event_impl &other)       = default;
+    event_impl(event_impl &&other) noexcept            = default;
+    event_impl &operator=(event_impl &&other) noexcept = default;
+    event_impl &operator=(event_impl &other) noexcept  = default;
+
+    explicit event_impl(const int val) {}
+
+    event_impl &operator=(int val) noexcept { return *this; }
+
+    int create() {
+        AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL);
+        return 0;
+    }
+
+    int mark(queue_impl &queue) {
+        AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL);
+        return 0;
+    }
+
+    int wait(queue_impl &queue) const {
+        AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL);
+        return 0;
+    }
+
+    int sync() const noexcept {
+        AF_ERROR("Incorrectly configured", AF_ERR_INTERNAL);
+        return 0;
+    }
+
+    operator bool() const noexcept { return false; }
+};
+
+#else
+
+#include <threads/async_queue.hpp>
+#include <threads/event.hpp>
+#define __SYNCHRONOUS_ARCH 0
+using queue_impl = threads::async_queue;
+using event_impl = threads::event;
+
+#endif
+
+namespace arrayfire {
+namespace cpu {
+
+/// Wraps the async_queue class
+class queue {
+   public:
+    queue()
+        : count(0)
+        , sync_calls(__SYNCHRONOUS_ARCH == 1 ||
+                     common::getEnvVar("AF_SYNCHRONOUS_CALLS") == "1") {}
+
+    template<typename F, typename... Args>
+    void enqueue(const F func, Args &&...args) {
+        count++;
+        if (sync_calls) {
+            func(toParam(std::forward<Args>(args))...);
+        } else {
+            aQueue.enqueue(func, toParam(std::forward<Args>(args))...);
+        }
+#ifndef NDEBUG
+        sync();
+#else
+        if (getMemoryPressure() >= getMemoryPressureThreshold() ||
+            count >= 25) {
+            sync();
+        }
+#endif
+    }
+
+    void sync() {
+        count = 0;
+        if (!sync_calls) aQueue.sync();
+    }
+
+    bool is_worker() const {
+        return (!sync_calls) ? aQueue.is_worker() : false;
+    }
+
+    friend class queue_event;
+
+   private:
+    int count;
+    const bool sync_calls;
+    queue_impl aQueue;
+};
+
+class queue_event {
+    event_impl event_;
+
+   public:
+    queue_event() = default;
+    queue_event(int val) : event_(val) {}
+
+    int create() { return event_.create(); }
+
+    int mark(queue &q) { return event_.mark(q.aQueue); }
+    int wait(queue &q) { return event_.wait(q.aQueue); }
+    int sync() noexcept { return event_.sync(); }
+    operator bool() const noexcept { return event_; }
+};
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/random.cpp b/src/backend/cpu/random.cpp
deleted file mode 100644
index a806d790b0..0000000000
--- a/src/backend/cpu/random.cpp
+++ /dev/null
@@ -1,175 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <type_traits>
-#include <random>
-#include <algorithm>
-#include <functional>
-#include <limits>
-#include <type_traits>
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <Array.hpp>
-#include <random.hpp>
-
-namespace cpu
-{
-
-using namespace std;
-
-template<typename T>
-using is_arithmetic_t       = typename enable_if< is_arithmetic<T>::value,      function<T()>>::type;
-template<typename T>
-using is_complex_t          = typename enable_if< is_complex<T>::value,         function<T()>>::type;
-template<typename T>
-using is_floating_point_t   = typename enable_if< is_floating_point<T>::value,  function<T()>>::type;
-
-template<typename T, typename GenType>
-is_arithmetic_t<T>
-urand(GenType &generator)
-{
-    typedef typename conditional<   is_floating_point<T>::value,
-                                    uniform_real_distribution<T>,
-#if OS_WIN
-                                    uniform_int_distribution<unsigned>>::type dist;
-#else
-                                    uniform_int_distribution<T >> ::type dist;
-#endif
-    return bind(dist(), generator);
-}
-
-template<typename T, typename GenType>
-is_complex_t<T>
-urand(GenType &generator)
-{
-    auto func = urand<typename T::value_type>(generator);
-    return [func] () { return T(func(), func());};
-}
-
-template<typename T, typename GenType>
-is_floating_point_t<T>
-nrand(GenType &generator)
-{
-    return bind(normal_distribution<T>(), generator);
-}
-
-template<typename T, typename GenType>
-is_complex_t<T>
-nrand(GenType &generator)
-{
-    auto func = nrand<typename T::value_type>(generator);
-    return [func] () { return T(func(), func());};
-}
-
-static default_random_engine generator;
-static unsigned long long gen_seed = 0;
-static bool is_first = true;
-#define GLOBAL 1
-
-template<typename T>
-Array<T> randn(const af::dim4 &dims)
-{
-    static unsigned long long my_seed = 0;
-    if (is_first) {
-        setSeed(gen_seed);
-        my_seed = gen_seed;
-    }
-
-    static auto gen = nrand<T>(generator);
-
-    if (my_seed != gen_seed) {
-        gen = nrand<T>(generator);
-    }
-
-    Array<T> outArray = createEmptyArray<T>(dims);
-    T *outPtr = outArray.get();
-    generate(outPtr, outPtr + outArray.elements(), gen);
-    return outArray;
-}
-
-template<typename T>
-Array<T> randu(const af::dim4 &dims)
-{
-    static unsigned long long my_seed = 0;
-    if (is_first) {
-        setSeed(gen_seed);
-        my_seed = gen_seed;
-    }
-
-    static auto gen = urand<T>(generator);
-
-    if (my_seed != gen_seed) {
-        gen = urand<T>(generator);
-    }
-
-    Array<T> outArray = createEmptyArray<T>(dims);
-    T *outPtr = outArray.get();
-    for (int i = 0; i < (int)outArray.elements(); i++) {
-        outPtr[i] = gen();
-    }
-    return outArray;
-}
-
-#define INSTANTIATE_UNIFORM(T)                              \
-    template Array<T>  randu<T>    (const af::dim4 &dims);
-
-INSTANTIATE_UNIFORM(float)
-INSTANTIATE_UNIFORM(double)
-INSTANTIATE_UNIFORM(cfloat)
-INSTANTIATE_UNIFORM(cdouble)
-INSTANTIATE_UNIFORM(int)
-INSTANTIATE_UNIFORM(uint)
-INSTANTIATE_UNIFORM(uchar)
-
-#define INSTANTIATE_NORMAL(T)                              \
-    template Array<T>  randn<T>(const af::dim4 &dims);
-
-INSTANTIATE_NORMAL(float)
-INSTANTIATE_NORMAL(double)
-INSTANTIATE_NORMAL(cfloat)
-INSTANTIATE_NORMAL(cdouble)
-
-
-template<>
-Array<char> randu(const af::dim4 &dims)
-{
-    static unsigned long long my_seed = 0;
-    if (is_first) {
-        setSeed(gen_seed);
-        my_seed = gen_seed;
-    }
-
-    static auto gen = urand<float>(generator);
-
-    if (my_seed != gen_seed) {
-        gen = urand<float>(generator);
-    }
-
-    Array<char> outArray = createEmptyArray<char>(dims);
-    char *outPtr = outArray.get();
-    for (int i = 0; i < (int)outArray.elements(); i++) {
-        outPtr[i] = gen() > 0.5;
-    }
-    return outArray;
-}
-
-void setSeed(const uintl seed)
-{
-    generator.seed(seed);
-    is_first = false;
-    gen_seed = seed;
-}
-
-uintl getSeed()
-{
-    return gen_seed;
-}
-
-}
diff --git a/src/backend/cpu/random.hpp b/src/backend/cpu/random.hpp
deleted file mode 100644
index 1707e44ccf..0000000000
--- a/src/backend/cpu/random.hpp
+++ /dev/null
@@ -1,23 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <Array.hpp>
-
-namespace cpu
-{
-    template<typename T>
-    Array<T> randu(const af::dim4 &dims);
-
-    template<typename T>
-    Array<T> randn(const af::dim4 &dims);
-
-    void setSeed(const uintl seed);
-    uintl getSeed();
-}
diff --git a/src/backend/cpu/random_engine.cpp b/src/backend/cpu/random_engine.cpp
new file mode 100644
index 0000000000..d42a7bdae1
--- /dev/null
+++ b/src/backend/cpu/random_engine.cpp
@@ -0,0 +1,169 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/random_engine.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl) {
+    getQueue().enqueue(kernel::initMersenneState, state.get(), tbl.get(), seed);
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type, const uintl seed,
+                             uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    getQueue().enqueue(kernel::uniformDistributionCBRNG<T>, out.get(),
+                       out.elements(), type, seed, counter);
+    counter += out.elements();
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl seed,
+                            uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    getQueue().enqueue(kernel::normalDistributionCBRNG<T>, out.get(),
+                       out.elements(), type, seed, counter);
+    counter += out.elements();
+    return out;
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    getQueue().enqueue(kernel::uniformDistributionMT<T>, out.get(),
+                       out.elements(), state.get(), pos.get(), sh1.get(),
+                       sh2.get(), mask, recursion_table.get(),
+                       temper_table.get());
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    getQueue().enqueue(kernel::normalDistributionMT<T>, out.get(),
+                       out.elements(), state.get(), pos.get(), sh1.get(),
+                       sh2.get(), mask, recursion_table.get(),
+                       temper_table.get());
+    return out;
+}
+
+#define INSTANTIATE_UNIFORM(T)                                   \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl seed, uintl &counter);                       \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+#define INSTANTIATE_NORMAL(T)                                                  \
+    template Array<T> normalDistribution<T>(const af::dim4 &dims,              \
+                                            const af_random_engine_type type,  \
+                                            const uintl seed, uintl &counter); \
+    template Array<T> normalDistribution<T>(                                   \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,                \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,               \
+        Array<uint> temper_table, Array<uint> state);
+
+#define COMPLEX_UNIFORM_DISTRIBUTION(T, TR)                              \
+    template<>                                                           \
+    Array<T> uniformDistribution<T>(const af::dim4 &dims,                \
+                                    const af_random_engine_type type,    \
+                                    const uintl seed, uintl &counter) {  \
+        Array<T> out    = createEmptyArray<T>(dims);                     \
+        TR *outPtr      = (TR *)out.get();                               \
+        size_t elements = out.elements() * 2;                            \
+        getQueue().enqueue(kernel::uniformDistributionCBRNG<TR>, outPtr, \
+                           elements, type, seed, counter);               \
+        counter += elements;                                             \
+        return out;                                                      \
+    }                                                                    \
+    template<>                                                           \
+    Array<T> uniformDistribution<T>(                                     \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,          \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,         \
+        Array<uint> temper_table, Array<uint> state) {                   \
+        Array<T> out    = createEmptyArray<T>(dims);                     \
+        TR *outPtr      = (TR *)out.get();                               \
+        size_t elements = out.elements() * 2;                            \
+        getQueue().enqueue(kernel::uniformDistributionMT<TR>, outPtr,    \
+                           elements, state.get(), pos.get(), sh1.get(),  \
+                           sh2.get(), mask, recursion_table.get(),       \
+                           temper_table.get());                          \
+        return out;                                                      \
+    }
+
+#define COMPLEX_NORMAL_DISTRIBUTION(T, TR)                                     \
+    template<>                                                                 \
+    Array<T> normalDistribution<T>(const af::dim4 &dims,                       \
+                                   const af_random_engine_type type,           \
+                                   const uintl seed, uintl &counter) {         \
+        Array<T> out    = createEmptyArray<T>(dims);                           \
+        TR *outPtr      = (TR *)out.get();                                     \
+        size_t elements = out.elements() * 2;                                  \
+        getQueue().enqueue(kernel::normalDistributionCBRNG<TR>, outPtr,        \
+                           elements, type, seed, counter);                     \
+        counter += elements;                                                   \
+        return out;                                                            \
+    }                                                                          \
+    template<>                                                                 \
+    Array<T> normalDistribution<T>(                                            \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,                \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,               \
+        Array<uint> temper_table, Array<uint> state) {                         \
+        Array<T> out    = createEmptyArray<T>(dims);                           \
+        TR *outPtr      = (TR *)out.get();                                     \
+        size_t elements = out.elements() * 2;                                  \
+        getQueue().enqueue(kernel::normalDistributionMT<TR>, outPtr, elements, \
+                           state.get(), pos.get(), sh1.get(), sh2.get(), mask, \
+                           recursion_table.get(), temper_table.get());         \
+        return out;                                                            \
+    }
+
+INSTANTIATE_UNIFORM(float)
+INSTANTIATE_UNIFORM(double)
+INSTANTIATE_UNIFORM(int)
+INSTANTIATE_UNIFORM(uint)
+INSTANTIATE_UNIFORM(intl)
+INSTANTIATE_UNIFORM(uintl)
+INSTANTIATE_UNIFORM(char)
+INSTANTIATE_UNIFORM(schar)
+INSTANTIATE_UNIFORM(uchar)
+INSTANTIATE_UNIFORM(short)
+INSTANTIATE_UNIFORM(ushort)
+INSTANTIATE_UNIFORM(half)
+
+INSTANTIATE_NORMAL(float)
+INSTANTIATE_NORMAL(double)
+INSTANTIATE_NORMAL(half)
+
+COMPLEX_UNIFORM_DISTRIBUTION(cdouble, double)  // NOLINT
+COMPLEX_UNIFORM_DISTRIBUTION(cfloat, float)    // NOLINT
+
+COMPLEX_NORMAL_DISTRIBUTION(cdouble, double)  // NOLINT
+COMPLEX_NORMAL_DISTRIBUTION(cfloat, float)    // NOLINT
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/random_engine.hpp b/src/backend/cpu/random_engine.hpp
new file mode 100644
index 0000000000..adfa7b9fc6
--- /dev/null
+++ b/src/backend/cpu/random_engine.hpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cpu {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const unsigned long long seed,
+                             unsigned long long &counter);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type,
+                            const unsigned long long seed,
+                            unsigned long long &counter);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/range.cpp b/src/backend/cpu/range.cpp
index 1d042899f2..ad100da4d4 100644
--- a/src/backend/cpu/range.cpp
+++ b/src/backend/cpu/range.cpp
@@ -6,78 +6,59 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <kernel/range.hpp>
+#include <range.hpp>
 
 #include <Array.hpp>
-#include <range.hpp>
-#include <math.hpp>
-#include <stdexcept>
 #include <err_cpu.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
 #include <algorithm>
 #include <numeric>
+#include <stdexcept>
 
-namespace cpu
-{
-    ///////////////////////////////////////////////////////////////////////////
-    // Kernel Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T, int dim>
-    void range(T *out, const dim4 &dims, const dim4 &strides)
-    {
-        for(dim_t w = 0; w < dims[3]; w++) {
-            dim_t offW = w * strides[3];
-            for(dim_t z = 0; z < dims[2]; z++) {
-                dim_t offWZ = offW + z * strides[2];
-                for(dim_t y = 0; y < dims[1]; y++) {
-                    dim_t offWZY = offWZ + y * strides[1];
-                    for(dim_t x = 0; x < dims[0]; x++) {
-                        dim_t id = offWZY + x;
-                        if(dim == 0) {
-                            out[id] = x;
-                        } else if(dim == 1) {
-                            out[id] = y;
-                        } else if(dim == 2) {
-                            out[id] = z;
-                        } else if(dim == 3) {
-                            out[id] = w;
-                        }
-                    }
-                }
-            }
-        }
-    }
-
-    ///////////////////////////////////////////////////////////////////////////
-    // Wrapper Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T>
-    Array<T> range(const dim4& dims, const int seq_dim)
-    {
-        // Set dimension along which the sequence should be
-        // Other dimensions are simply tiled
-        int _seq_dim = seq_dim;
-        if(seq_dim < 0) {
-            _seq_dim = 0;   // column wise sequence
-        }
-
-        Array<T> out = createEmptyArray<T>(dims);
-        switch(_seq_dim) {
-            case 0: range<T, 0>(out.get(), out.dims(), out.strides()); break;
-            case 1: range<T, 1>(out.get(), out.dims(), out.strides()); break;
-            case 2: range<T, 2>(out.get(), out.dims(), out.strides()); break;
-            case 3: range<T, 3>(out.get(), out.dims(), out.strides()); break;
-            default : AF_ERROR("Invalid rep selection", AF_ERR_ARG);
-        }
+using arrayfire::common::half;
 
+namespace arrayfire {
+namespace cpu {
 
-        return out;
+template<typename T>
+Array<T> range(const dim4& dims, const int seq_dim) {
+    // Set dimension along which the sequence should be
+    // Other dimensions are simply tiled
+    int _seq_dim = seq_dim;
+    if (seq_dim < 0) {
+        _seq_dim = 0;  // column wise sequence
     }
 
-#define INSTANTIATE(T)                                                      \
-    template Array<T> range<T>(const af::dim4 &dims, const int seq_dims);   \
+    Array<T> out = createEmptyArray<T>(dims);
+    switch (_seq_dim) {
+        case 0: getQueue().enqueue(kernel::range<T, 0>, out); break;
+        case 1: getQueue().enqueue(kernel::range<T, 1>, out); break;
+        case 2: getQueue().enqueue(kernel::range<T, 2>, out); break;
+        case 3: getQueue().enqueue(kernel::range<T, 3>, out); break;
+        default: AF_ERROR("Invalid rep selection", AF_ERR_ARG);
+    }
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> range<T>(const af::dim4& dims, const int seq_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/range.hpp b/src/backend/cpu/range.hpp
index a6d10a5cd8..b6d0f58bd9 100644
--- a/src/backend/cpu/range.hpp
+++ b/src/backend/cpu/range.hpp
@@ -8,11 +8,11 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> range(const dim4& dim, const int seq_dim = -1);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim = -1);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/reduce.cpp b/src/backend/cpu/reduce.cpp
index 79b4860cfe..5b13d6f96f 100644
--- a/src/backend/cpu/reduce.cpp
+++ b/src/backend/cpu/reduce.cpp
@@ -7,189 +7,255 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/half.hpp>
+#include <kernel/reduce.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 #include <reduce.hpp>
-#include <ops.hpp>
-#include <functional>
+#include <af/dim4.hpp>
+
 #include <complex>
+#include <functional>
 
 using af::dim4;
+using arrayfire::common::Binary;
+using arrayfire::common::half;
+using arrayfire::common::Transform;
+using arrayfire::cpu::cdouble;
+
+namespace arrayfire {
+namespace common {
+
+template<>
+struct Binary<cdouble, af_add_t> {
+    static cdouble init() { return cdouble(0, 0); }
 
-namespace cpu
-{
-    template<af_op_t op, typename Ti, typename To, int D>
-    struct reduce_dim
-    {
-        void operator()(To *out, const dim4 &ostrides, const dim4 &odims,
-                        const Ti *in , const dim4 &istrides, const dim4 &idims,
-                        const int dim)
-        {
-            static const int D1 = D - 1;
-            static reduce_dim<op, Ti, To, D1> reduce_dim_next;
-            for (dim_t i = 0; i < odims[D1]; i++) {
-                 reduce_dim_next(out + i * ostrides[D1],
-                                 ostrides, odims,
-                                 in  + i * istrides[D1],
-                                 istrides, idims,
-                                 dim);
-            }
-        }
-    };
-
-    template<af_op_t op, typename Ti, typename To>
-    struct reduce_dim<op, Ti, To, 0>
-    {
-
-        Transform<Ti, To, op> transform;
-        Binary<To, op> reduce;
-        void operator()(To *out, const dim4 &ostrides, const dim4 &odims,
-                        const Ti *in , const dim4 &istrides, const dim4 &idims,
-                        const int dim)
-        {
-            dim_t stride = istrides[dim];
-
-            To out_val = reduce.init();
-            for (dim_t i = 0; i < idims[dim]; i++) {
-                To in_val = transform(in[i * stride]);
-                out_val = reduce(in_val, out_val);
-            }
-
-            *out = out_val;
-        }
-    };
-
-    template<af_op_t op, typename Ti, typename To>
-    using reduce_dim_func = std::function<void(To*,const dim4&, const dim4&,
-                                                const Ti*, const dim4&, const dim4&,
-                                                const int)>;
-
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> reduce(const Array<Ti> &in, const int dim)
-    {
-        dim4 odims = in.dims();
-        odims[dim] = 1;
-
-        Array<To> out = createEmptyArray<To>(odims);
-        static reduce_dim_func<op, Ti, To>  reduce_funcs[4] = { reduce_dim<op, Ti, To, 1>()
-                                                              , reduce_dim<op, Ti, To, 2>()
-                                                              , reduce_dim<op, Ti, To, 3>()
-                                                              , reduce_dim<op, Ti, To, 4>()};
-
-        reduce_funcs[in.ndims() - 1](out.get(), out.strides(), out.dims(),
-                                    in.get(), in.strides(), in.dims(), dim);
-
-        return out;
+    cdouble operator()(cdouble lhs, cdouble rhs) {
+        return cdouble(real(lhs) + real(rhs), imag(lhs) + imag(rhs));
     }
+};
 
-    template<af_op_t op, typename Ti, typename To>
-    To reduce_all(const Array<Ti> &in)
-    {
-        Transform<Ti, To, op> transform;
-        Binary<To, op> reduce;
+}  // namespace common
+namespace cpu {
 
-        To out = reduce.init();
+template<af_op_t op, typename Ti, typename To>
+using reduce_dim_func = std::function<void(
+    Param<To>, const dim_t, CParam<Ti>, const dim_t, const int, bool, double)>;
 
-        // Decrement dimension of select dimension
-        af::dim4 dims = in.dims();
-        af::dim4 strides = in.strides();
-        const Ti *inPtr = in.get();
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan,
+                 double nanval) {
+    dim4 odims = in.dims();
+    odims[dim] = 1;
 
-        for(dim_t l = 0; l < dims[3]; l++) {
-            dim_t off3 = l * strides[3];
+    Array<To> out = createEmptyArray<To>(odims);
+    static const reduce_dim_func<op, Ti, To> reduce_funcs[4] = {
+        kernel::reduce_dim<op, Ti, To, 1>(),
+        kernel::reduce_dim<op, Ti, To, 2>(),
+        kernel::reduce_dim<op, Ti, To, 3>(),
+        kernel::reduce_dim<op, Ti, To, 4>()};
 
-            for(dim_t k = 0; k < dims[2]; k++) {
-                dim_t off2 = k * strides[2];
+    getQueue().enqueue(reduce_funcs[in.ndims() - 1], out, 0, in, 0, dim,
+                       change_nan, nanval);
 
-                for(dim_t j = 0; j < dims[1]; j++) {
-                    dim_t off1 = j * strides[1];
+    return out;
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+using reduce_dim_func_by_key =
+    std::function<void(Param<To> ovals, const dim_t ovOffset, CParam<Tk> keys,
+                       CParam<Ti> vals, const dim_t vOffset, int *n_reduced,
+                       const int dim, bool change_nan, double nanval)>;
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan, double nanval) {
+    dim4 okdims = keys.dims();
+    dim4 ovdims = vals.dims();
 
-                    for(dim_t i = 0; i < dims[0]; i++) {
-                        dim_t idx = i + off1 + off2 + off3;
+    int n_reduced;
+    Array<Tk> fullsz_okeys = createEmptyArray<Tk>(okdims);
+    getQueue().enqueue(kernel::n_reduced_keys<Tk>, fullsz_okeys, &n_reduced,
+                       keys);
+    getQueue().sync();
 
-                        To val = transform(inPtr[idx]);
-                        out = reduce(val, out);
-                    }
-                }
-            }
-        }
+    okdims[0]   = n_reduced;
+    ovdims[dim] = n_reduced;
 
-        return out;
+    std::vector<af_seq> index;
+    for (int i = 0; i < keys.ndims(); ++i) {
+        af_seq s = {0.0, static_cast<double>(okdims[i]) - 1, 1.0};
+        index.push_back(s);
     }
+    Array<Tk> okeys = createSubArray<Tk>(fullsz_okeys, index, true);
+    Array<To> ovals = createEmptyArray<To>(ovdims);
 
-#define INSTANTIATE(ROp, Ti, To)                                        \
-    template Array<To> reduce<ROp, Ti, To>(const Array<Ti> &in, const int dim); \
-    template To reduce_all<ROp, Ti, To>(const Array<Ti> &in);
-
-    //min
-    INSTANTIATE(af_min_t, float  , float  )
-    INSTANTIATE(af_min_t, double , double )
-    INSTANTIATE(af_min_t, cfloat , cfloat )
-    INSTANTIATE(af_min_t, cdouble, cdouble)
-    INSTANTIATE(af_min_t, int    , int    )
-    INSTANTIATE(af_min_t, uint   , uint   )
-    INSTANTIATE(af_min_t, char   , char   )
-    INSTANTIATE(af_min_t, uchar  , uchar  )
-
-    //max
-    INSTANTIATE(af_max_t, float  , float  )
-    INSTANTIATE(af_max_t, double , double )
-    INSTANTIATE(af_max_t, cfloat , cfloat )
-    INSTANTIATE(af_max_t, cdouble, cdouble)
-    INSTANTIATE(af_max_t, int    , int    )
-    INSTANTIATE(af_max_t, uint   , uint   )
-    INSTANTIATE(af_max_t, char   , char   )
-    INSTANTIATE(af_max_t, uchar  , uchar  )
-
-    //sum
-    INSTANTIATE(af_add_t, float  , float  )
-    INSTANTIATE(af_add_t, double , double )
-    INSTANTIATE(af_add_t, cfloat , cfloat )
-    INSTANTIATE(af_add_t, cdouble, cdouble)
-    INSTANTIATE(af_add_t, int    , int    )
-    INSTANTIATE(af_add_t, uint   , uint   )
-    INSTANTIATE(af_add_t, char   , int    )
-    INSTANTIATE(af_add_t, uchar  , uint   )
-
-    //sum
-    INSTANTIATE(af_mul_t, float  , float  )
-    INSTANTIATE(af_mul_t, double , double )
-    INSTANTIATE(af_mul_t, cfloat , cfloat )
-    INSTANTIATE(af_mul_t, cdouble, cdouble)
-    INSTANTIATE(af_mul_t, int    , int    )
-    INSTANTIATE(af_mul_t, uint   , uint   )
-    INSTANTIATE(af_mul_t, char   , int    )
-    INSTANTIATE(af_mul_t, uchar  , uint   )
-
-    // count
-    INSTANTIATE(af_notzero_t, float  , uint)
-    INSTANTIATE(af_notzero_t, double , uint)
-    INSTANTIATE(af_notzero_t, cfloat , uint)
-    INSTANTIATE(af_notzero_t, cdouble, uint)
-    INSTANTIATE(af_notzero_t, int    , uint)
-    INSTANTIATE(af_notzero_t, uint   , uint)
-    INSTANTIATE(af_notzero_t, char   , uint)
-    INSTANTIATE(af_notzero_t, uchar  , uint)
-
-    //anytrue
-    INSTANTIATE(af_or_t, float  , char)
-    INSTANTIATE(af_or_t, double , char)
-    INSTANTIATE(af_or_t, cfloat , char)
-    INSTANTIATE(af_or_t, cdouble, char)
-    INSTANTIATE(af_or_t, int    , char)
-    INSTANTIATE(af_or_t, uint   , char)
-    INSTANTIATE(af_or_t, char   , char)
-    INSTANTIATE(af_or_t, uchar  , char)
-
-    //alltrue
-    INSTANTIATE(af_and_t, float  , char)
-    INSTANTIATE(af_and_t, double , char)
-    INSTANTIATE(af_and_t, cfloat , char)
-    INSTANTIATE(af_and_t, cdouble, char)
-    INSTANTIATE(af_and_t, int    , char)
-    INSTANTIATE(af_and_t, uint   , char)
-    INSTANTIATE(af_and_t, char   , char)
-    INSTANTIATE(af_and_t, uchar  , char)
+    static const reduce_dim_func_by_key<op, Ti, Tk, To> reduce_funcs[4] = {
+        kernel::reduce_dim_by_key<op, Ti, Tk, To, 1>(),
+        kernel::reduce_dim_by_key<op, Ti, Tk, To, 2>(),
+        kernel::reduce_dim_by_key<op, Ti, Tk, To, 3>(),
+        kernel::reduce_dim_by_key<op, Ti, Tk, To, 4>()};
+
+    getQueue().enqueue(reduce_funcs[vals.ndims() - 1], ovals, 0, keys, vals, 0,
+                       &n_reduced, dim, change_nan, nanval);
+
+    keys_out = okeys;
+    vals_out = ovals;
 }
+
+template<af_op_t op, typename Ti, typename To>
+using reduce_all_func =
+    std::function<void(Param<To>, CParam<Ti>, bool, double)>;
+
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan, double nanval) {
+    in.eval();
+
+    Array<To> out = createEmptyArray<To>(1);
+    static const reduce_all_func<op, Ti, To> reduce_all_kernel =
+        kernel::reduce_all<op, Ti, To>();
+    getQueue().enqueue(reduce_all_kernel, out, in, change_nan, nanval);
+    getQueue().sync();
+    return out;
+}
+
+#define INSTANTIATE(ROp, Ti, To)                                               \
+    template Array<To> reduce<ROp, Ti, To>(const Array<Ti> &in, const int dim, \
+                                           bool change_nan, double nanval);    \
+    template Array<To> reduce_all<ROp, Ti, To>(                                \
+        const Array<Ti> &in, bool change_nan, double nanval);                  \
+    template void reduce_by_key<ROp, Ti, int, To>(                             \
+        Array<int> & keys_out, Array<To> & vals_out, const Array<int> &keys,   \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template void reduce_by_key<ROp, Ti, uint, To>(                            \
+        Array<uint> & keys_out, Array<To> & vals_out, const Array<uint> &keys, \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval);
+
+// min
+INSTANTIATE(af_min_t, float, float)
+INSTANTIATE(af_min_t, double, double)
+INSTANTIATE(af_min_t, cfloat, cfloat)
+INSTANTIATE(af_min_t, cdouble, cdouble)
+INSTANTIATE(af_min_t, int, int)
+INSTANTIATE(af_min_t, uint, uint)
+INSTANTIATE(af_min_t, intl, intl)
+INSTANTIATE(af_min_t, uintl, uintl)
+INSTANTIATE(af_min_t, char, char)
+INSTANTIATE(af_min_t, schar, schar)
+INSTANTIATE(af_min_t, uchar, uchar)
+INSTANTIATE(af_min_t, short, short)
+INSTANTIATE(af_min_t, ushort, ushort)
+INSTANTIATE(af_min_t, half, half)
+
+// max
+INSTANTIATE(af_max_t, float, float)
+INSTANTIATE(af_max_t, double, double)
+INSTANTIATE(af_max_t, cfloat, cfloat)
+INSTANTIATE(af_max_t, cdouble, cdouble)
+INSTANTIATE(af_max_t, int, int)
+INSTANTIATE(af_max_t, uint, uint)
+INSTANTIATE(af_max_t, intl, intl)
+INSTANTIATE(af_max_t, uintl, uintl)
+INSTANTIATE(af_max_t, char, char)
+INSTANTIATE(af_max_t, schar, schar)
+INSTANTIATE(af_max_t, uchar, uchar)
+INSTANTIATE(af_max_t, short, short)
+INSTANTIATE(af_max_t, ushort, ushort)
+INSTANTIATE(af_max_t, half, half)
+
+// sum
+INSTANTIATE(af_add_t, float, float)
+INSTANTIATE(af_add_t, double, double)
+INSTANTIATE(af_add_t, cfloat, cfloat)
+INSTANTIATE(af_add_t, cdouble, cdouble)
+INSTANTIATE(af_add_t, int, int)
+INSTANTIATE(af_add_t, int, float)
+INSTANTIATE(af_add_t, uint, uint)
+INSTANTIATE(af_add_t, uint, float)
+INSTANTIATE(af_add_t, intl, intl)
+INSTANTIATE(af_add_t, intl, double)
+INSTANTIATE(af_add_t, uintl, uintl)
+INSTANTIATE(af_add_t, uintl, double)
+INSTANTIATE(af_add_t, char, int)
+INSTANTIATE(af_add_t, char, float)
+INSTANTIATE(af_add_t, schar, int)
+INSTANTIATE(af_add_t, schar, float)
+INSTANTIATE(af_add_t, uchar, uint)
+INSTANTIATE(af_add_t, uchar, float)
+INSTANTIATE(af_add_t, short, int)
+INSTANTIATE(af_add_t, short, float)
+INSTANTIATE(af_add_t, ushort, uint)
+INSTANTIATE(af_add_t, ushort, float)
+INSTANTIATE(af_add_t, half, float)
+INSTANTIATE(af_add_t, half, half)
+
+// mul
+INSTANTIATE(af_mul_t, float, float)
+INSTANTIATE(af_mul_t, double, double)
+INSTANTIATE(af_mul_t, cfloat, cfloat)
+INSTANTIATE(af_mul_t, cdouble, cdouble)
+INSTANTIATE(af_mul_t, int, int)
+INSTANTIATE(af_mul_t, uint, uint)
+INSTANTIATE(af_mul_t, intl, intl)
+INSTANTIATE(af_mul_t, uintl, uintl)
+INSTANTIATE(af_mul_t, char, int)
+INSTANTIATE(af_mul_t, schar, int)
+INSTANTIATE(af_mul_t, uchar, uint)
+INSTANTIATE(af_mul_t, short, int)
+INSTANTIATE(af_mul_t, ushort, uint)
+INSTANTIATE(af_mul_t, half, float)
+
+// count
+INSTANTIATE(af_notzero_t, float, uint)
+INSTANTIATE(af_notzero_t, double, uint)
+INSTANTIATE(af_notzero_t, cfloat, uint)
+INSTANTIATE(af_notzero_t, cdouble, uint)
+INSTANTIATE(af_notzero_t, int, uint)
+INSTANTIATE(af_notzero_t, uint, uint)
+INSTANTIATE(af_notzero_t, intl, uint)
+INSTANTIATE(af_notzero_t, uintl, uint)
+INSTANTIATE(af_notzero_t, char, uint)
+INSTANTIATE(af_notzero_t, schar, uint)
+INSTANTIATE(af_notzero_t, uchar, uint)
+INSTANTIATE(af_notzero_t, short, uint)
+INSTANTIATE(af_notzero_t, ushort, uint)
+INSTANTIATE(af_notzero_t, half, uint)
+
+// anytrue
+INSTANTIATE(af_or_t, float, char)
+INSTANTIATE(af_or_t, double, char)
+INSTANTIATE(af_or_t, cfloat, char)
+INSTANTIATE(af_or_t, cdouble, char)
+INSTANTIATE(af_or_t, int, char)
+INSTANTIATE(af_or_t, uint, char)
+INSTANTIATE(af_or_t, intl, char)
+INSTANTIATE(af_or_t, uintl, char)
+INSTANTIATE(af_or_t, char, char)
+INSTANTIATE(af_or_t, schar, char)
+INSTANTIATE(af_or_t, uchar, char)
+INSTANTIATE(af_or_t, short, char)
+INSTANTIATE(af_or_t, ushort, char)
+INSTANTIATE(af_or_t, half, char)
+
+// alltrue
+INSTANTIATE(af_and_t, float, char)
+INSTANTIATE(af_and_t, double, char)
+INSTANTIATE(af_and_t, cfloat, char)
+INSTANTIATE(af_and_t, cdouble, char)
+INSTANTIATE(af_and_t, int, char)
+INSTANTIATE(af_and_t, uint, char)
+INSTANTIATE(af_and_t, intl, char)
+INSTANTIATE(af_and_t, uintl, char)
+INSTANTIATE(af_and_t, char, char)
+INSTANTIATE(af_and_t, schar, char)
+INSTANTIATE(af_and_t, uchar, char)
+INSTANTIATE(af_and_t, short, char)
+INSTANTIATE(af_and_t, ushort, char)
+INSTANTIATE(af_and_t, half, char)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/reduce.hpp b/src/backend/cpu/reduce.hpp
index 039a47d8cb..8ff97c51a6 100644
--- a/src/backend/cpu/reduce.hpp
+++ b/src/backend/cpu/reduce.hpp
@@ -6,16 +6,23 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <af/array.h>
+#pragma once
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan = false,
+                 double nanval = 0);
 
-namespace cpu
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> reduce(const Array<Ti> &in, const int dim);
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan = false, double nanval = 0);
 
-    template<af_op_t op, typename Ti, typename To>
-    To reduce_all(const Array<Ti> &in);
-}
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan = false,
+                     double nanval = 0);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/regions.cpp b/src/backend/cpu/regions.cpp
index b1377689a5..821a5285c3 100644
--- a/src/backend/cpu/regions.cpp
+++ b/src/backend/cpu/regions.cpp
@@ -7,206 +7,41 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <regions.hpp>
 #include <err_cpu.hpp>
+#include <kernel/regions.hpp>
 #include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <regions.hpp>
+#include <af/dim4.hpp>
+#include <algorithm>
 #include <map>
 #include <set>
-#include <algorithm>
 
 using af::dim4;
 
-namespace cpu
-{
-
-template<typename T>
-class LabelNode
-{
-private:
-    T label;
-    T minLabel;
-    unsigned rank;
-    LabelNode* parent;
-
-public:
-    LabelNode() : label(0), minLabel(0), rank(0), parent(this) { }
-    LabelNode(T label) : label(label), minLabel(label), rank(0), parent(this) { }
-
-    T getLabel()
-    {
-        return label;
-    }
-
-    T getMinLabel()
-    {
-        return minLabel;
-    }
-
-    LabelNode* getParent()
-    {
-        return parent;
-    }
-
-    unsigned getRank()
-    {
-        return rank;
-    }
-
-    void setMinLabel(T l)
-    {
-        minLabel = l;
-    }
-
-    void setParent(LabelNode* p)
-    {
-        parent = p;
-    }
-
-    void setRank(unsigned r)
-    {
-        rank = r;
-    }
-};
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-static LabelNode<T>* find(LabelNode<T>* x)
-{
-    if (x->getParent() != x)
-        x->setParent(find(x->getParent()));
-    return x->getParent();
-}
-
-template<typename T>
-static void setUnion(LabelNode<T>* x, LabelNode<T>* y)
-{
-    LabelNode<T>* xRoot = find(x);
-    LabelNode<T>* yRoot = find(y);
-    if (xRoot == yRoot)
-        return;
-
-    T xMinLabel = xRoot->getMinLabel();
-    T yMinLabel = yRoot->getMinLabel();
-    xRoot->setMinLabel(min(xMinLabel, yMinLabel));
-    yRoot->setMinLabel(min(xMinLabel, yMinLabel));
-
-    if (xRoot->getRank() < yRoot->getRank())
-        xRoot->setParent(yRoot);
-    else if (xRoot->getRank() > yRoot->getRank())
-        yRoot->setParent(xRoot);
-    else {
-        yRoot->setParent(xRoot);
-        xRoot->setRank(xRoot->getRank() + 1);
-    }
-}
-
-template<typename T>
-Array<T> regions(const Array<char> &in, af_connectivity connectivity)
-{
-    const dim4 in_dims = in.dims();
-
-    // Create output placeholder
-    Array<T> out = createValueArray(in_dims, (T)0);
-
-    const char *in_ptr  = in.get();
-          T    *out_ptr = out.get();
-
-    // Map labels
-    typedef typename std::map<T, LabelNode<T>* > label_map_t;
-    typedef typename label_map_t::iterator label_map_iterator_t;
-
-    label_map_t lmap;
-
-    // Initial label
-    T label = (T)1;
-
-    for (int j = 0; j < (int)in_dims[1]; j++) {
-        for (int i = 0; i < (int)in_dims[0]; i++) {
-            int idx = j * in_dims[0] + i;
-            if (in_ptr[idx] != 0) {
-                std::vector<T> l;
-
-                // Test neighbors
-                if (i > 0 && out_ptr[j * (int)in_dims[0] + i-1] > 0)
-                    l.push_back(out_ptr[j * in_dims[0] + i-1]);
-                if (j > 0 && out_ptr[(j-1) * (int)in_dims[0] + i] > 0)
-                    l.push_back(out_ptr[(j-1) * in_dims[0] + i]);
-                if (connectivity == AF_CONNECTIVITY_8 && i > 0 && j > 0 && out_ptr[(j-1) * in_dims[0] + i-1] > 0)
-                    l.push_back(out_ptr[(j-1) * in_dims[0] + i-1]);
-                if (connectivity == AF_CONNECTIVITY_8 && i < (int)in_dims[0] - 1 && j > 0 && out_ptr[(j-1) * in_dims[0] + i+1] != 0)
-                    l.push_back(out_ptr[(j-1) * in_dims[0] + i+1]);
-
-                if (!l.empty()) {
-                    T minl = l[0];
-                    for (size_t k = 0; k < l.size(); k++) {
-                        minl = min(l[k], minl);
-                        label_map_iterator_t cur_map = lmap.find(l[k]);
-                        LabelNode<T> *node = cur_map->second;
-                        // Group labels of the same region under a disjoint set
-                        for (size_t m = k+1; m < l.size(); m++)
-                            setUnion(node, lmap.find(l[m])->second);
-                    }
-                    // Set label to smallest neighbor label
-                    out_ptr[idx] = minl;
-                }
-                else {
-                    // Insert new label in map
-                    LabelNode<T> *node = new LabelNode<T>(label);
-                    lmap.insert(std::pair<T, LabelNode<T>* >(label, node));
-                    out_ptr[idx] = label++;
-                }
-            }
-        }
-    }
-
-    std::set<T> removed;
-
-    for (int j = 0; j < (int)in_dims[1]; j++) {
-        for (int i = 0; i < (int)in_dims[0]; i++) {
-            int idx = j * (int)in_dims[0] + i;
-            if (in_ptr[idx] != 0) {
-                T l = out_ptr[idx];
-                label_map_iterator_t cur_map = lmap.find(l);
-
-                if (cur_map != lmap.end()) {
-                    LabelNode<T>* node = cur_map->second;
-
-                    LabelNode<T>* node_root = find(node);
-                    out_ptr[idx] = node_root->getMinLabel();
-
-                    // Mark removed labels (those that are part of a region
-                    // that contains a smaller label)
-                    if (node->getMinLabel() < l || node_root->getMinLabel() < l)
-                        removed.insert(l);
-                    if (node->getLabel() > node->getMinLabel())
-                        removed.insert(node->getLabel());
-                }
-            }
-        }
-    }
-
-    // Calculate final neighbors (ensure final labels are sequential)
-    for (int j = 0; j < (int)in_dims[1]; j++) {
-        for (int i = 0; i < (int)in_dims[0]; i++) {
-            int idx = j * (int)in_dims[0] + i;
-            if (out_ptr[idx] > 0) {
-                out_ptr[idx] -= distance(removed.begin(), removed.lower_bound(out_ptr[idx]));
-            }
-        }
-    }
+Array<T> regions(const Array<char> &in, af_connectivity connectivity) {
+    Array<T> out = createValueArray(in.dims(), static_cast<T>(0));
+    getQueue().enqueue(kernel::regions<T>, out, in, connectivity);
 
     return out;
 }
 
-#define INSTANTIATE(T)\
-    template Array<T> regions<T>(const Array<char> &in, af_connectivity connectivity);
+#define INSTANTIATE(T)                                  \
+    template Array<T> regions<T>(const Array<char> &in, \
+                                 af_connectivity connectivity);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/regions.hpp b/src/backend/cpu/regions.hpp
index 2e94711d28..b1c06b1911 100644
--- a/src/backend/cpu/regions.hpp
+++ b/src/backend/cpu/regions.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
 Array<T> regions(const Array<char> &in, af_connectivity connectivity);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/reorder.cpp b/src/backend/cpu/reorder.cpp
index 42da24e435..dd0a43ccac 100644
--- a/src/backend/cpu/reorder.cpp
+++ b/src/backend/cpu/reorder.cpp
@@ -6,70 +6,47 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <Array.hpp>
+#include <kernel/reorder.hpp>
 #include <reorder.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-
-namespace cpu
-{
-    template<typename T>
-    Array<T> reorder(const Array<T> &in, const af::dim4 &rdims)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims(0);
-        for(int i = 0; i < 4; i++)
-            oDims[i] = iDims[rdims[i]];
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        T* outPtr = out.get();
-        const T* inPtr = in.get();
-
-        const af::dim4 ist = in.strides();
-        const af::dim4 ost = out.strides();
 
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 
-        dim_t ids[4]  = {0};
-        for(dim_t ow = 0; ow < oDims[3]; ow++) {
-            const dim_t oW = ow * ost[3];
-            ids[rdims[3]] = ow;
-            for(dim_t oz = 0; oz < oDims[2]; oz++) {
-                const dim_t oZW = oW + oz * ost[2];
-                ids[rdims[2]] = oz;
-                for(dim_t oy = 0; oy < oDims[1]; oy++) {
-                    const dim_t oYZW = oZW + oy * ost[1];
-                    ids[rdims[1]] = oy;
-                    for(dim_t ox = 0; ox < oDims[0]; ox++) {
-                        const dim_t oIdx = oYZW + ox;
-
-                        ids[rdims[0]] = ox;
-                        const dim_t iIdx = ids[3] * ist[3] + ids[2] * ist[2] +
-                                              ids[1] * ist[1] + ids[0];
-
-                        outPtr[oIdx] = inPtr[iIdx];
-                    }
-                }
-            }
-        }
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                         \
-    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);  \
+using arrayfire::common::half;
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+namespace arrayfire {
+namespace cpu {
 
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(0);
+    for (int i = 0; i < 4; i++) { oDims[i] = iDims[rdims[i]]; }
 
+    Array<T> out = createEmptyArray<T>(oDims);
+    getQueue().enqueue(kernel::reorder<T>, out, in, oDims, rdims);
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/reorder.hpp b/src/backend/cpu/reorder.hpp
index 01f8b3c292..5dee87f401 100644
--- a/src/backend/cpu/reorder.hpp
+++ b/src/backend/cpu/reorder.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/reshape.cpp b/src/backend/cpu/reshape.cpp
new file mode 100644
index 0000000000..31a0053684
--- /dev/null
+++ b/src/backend/cpu/reshape.cpp
@@ -0,0 +1,101 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+
+#include <common/half.hpp>
+#include <kernel/copy.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void multiply_inplace(Array<T> &in, double val) {
+    getQueue().enqueue(kernel::copyElemwise<T, T>, in, in, static_cast<T>(0),
+                       val);
+}
+
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue, double scale) {
+    Array<outType> out = createValueArray(outDims, defaultValue);
+    getQueue().enqueue(kernel::copyElemwise<outType, inType>, out, in,
+                       defaultValue, scale);
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template void multiply_inplace<T>(Array<T> & in, double norm);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+#define INSTANTIATE_PAD_ARRAY(SRC_T)                                          \
+    template Array<float> reshape<SRC_T, float>(const Array<SRC_T> &,         \
+                                                const dim4 &, float, double); \
+    template Array<double> reshape<SRC_T, double>(                            \
+        const Array<SRC_T> &, const dim4 &, double, double);                  \
+    template Array<cfloat> reshape<SRC_T, cfloat>(                            \
+        const Array<SRC_T> &, const dim4 &, cfloat, double);                  \
+    template Array<cdouble> reshape<SRC_T, cdouble>(                          \
+        const Array<SRC_T> &, const dim4 &, cdouble, double);                 \
+    template Array<int> reshape<SRC_T, int>(const Array<SRC_T> &,             \
+                                            const dim4 &, int, double);       \
+    template Array<uint> reshape<SRC_T, uint>(const Array<SRC_T> &,           \
+                                              const dim4 &, uint, double);    \
+    template Array<intl> reshape<SRC_T, intl>(const Array<SRC_T> &,           \
+                                              const dim4 &, intl, double);    \
+    template Array<uintl> reshape<SRC_T, uintl>(const Array<SRC_T> &,         \
+                                                const dim4 &, uintl, double); \
+    template Array<short> reshape<SRC_T, short>(const Array<SRC_T> &,         \
+                                                const dim4 &, short, double); \
+    template Array<ushort> reshape<SRC_T, ushort>(                            \
+        const Array<SRC_T> &, const dim4 &, ushort, double);                  \
+    template Array<schar> reshape<SRC_T, schar>(const Array<SRC_T> &,         \
+                                                const dim4 &, schar, double); \
+    template Array<uchar> reshape<SRC_T, uchar>(const Array<SRC_T> &,         \
+                                                const dim4 &, uchar, double); \
+    template Array<char> reshape<SRC_T, char>(const Array<SRC_T> &,           \
+                                              const dim4 &, char, double);
+
+INSTANTIATE_PAD_ARRAY(float)
+INSTANTIATE_PAD_ARRAY(double)
+INSTANTIATE_PAD_ARRAY(int)
+INSTANTIATE_PAD_ARRAY(uint)
+INSTANTIATE_PAD_ARRAY(intl)
+INSTANTIATE_PAD_ARRAY(uintl)
+INSTANTIATE_PAD_ARRAY(schar)
+INSTANTIATE_PAD_ARRAY(uchar)
+INSTANTIATE_PAD_ARRAY(char)
+INSTANTIATE_PAD_ARRAY(ushort)
+INSTANTIATE_PAD_ARRAY(short)
+INSTANTIATE_PAD_ARRAY(arrayfire::common::half)
+
+#define INSTANTIATE_PAD_ARRAY_COMPLEX(SRC_T)                 \
+    template Array<cfloat> reshape<SRC_T, cfloat>(           \
+        const Array<SRC_T> &, const dim4 &, cfloat, double); \
+    template Array<cdouble> reshape<SRC_T, cdouble>(         \
+        const Array<SRC_T> &, const dim4 &, cdouble, double);
+
+INSTANTIATE_PAD_ARRAY_COMPLEX(cfloat)
+INSTANTIATE_PAD_ARRAY_COMPLEX(cdouble)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/resize.cpp b/src/backend/cpu/resize.cpp
index c3b18d04e6..ffc473fd4e 100644
--- a/src/backend/cpu/resize.cpp
+++ b/src/backend/cpu/resize.cpp
@@ -8,183 +8,56 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <resize.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
+#include <kernel/resize.hpp>
 #include <math.hpp>
-#include <types.hpp>
-#include <af/traits.hpp>
-
-namespace cpu
-{
-    /**
-     * noop function for round to avoid compilation
-     * issues due to lack of this function in C90 based
-     * compilers, it is only present in C99 and C++11
-     *
-     * This is not a full fledged implementation, this function
-     * is to be used only for positive numbers, i m using it here
-     * for calculating dimensions of arrays
-     */
-    dim_t round2int(float value)
-    {
-        return (dim_t)(value+0.5f);
-    }
-
-    using std::conditional;
-    using std::is_same;
-
-    template<typename T>
-    using wtype_t = typename conditional<is_same<T, double>::value, double, float>::type;
-
-    template<typename T>
-    using vtype_t = typename conditional<is_complex<T>::value,
-                                         T, wtype_t<T>
-                                        >::type;
-
-    template<typename T, af_interp_type method>
-    struct resize_op
-    {
-        void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims, const af::dim4 &idims,
-                  const af::dim4 &ostrides, const af::dim4 &istrides,
-                  const dim_t x, const dim_t y)
-        {
-            return;
-        }
-    };
-
-    template<typename T>
-    struct resize_op<T, AF_INTERP_NEAREST>
-    {
-        void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims, const af::dim4 &idims,
-                const af::dim4 &ostrides, const af::dim4 &istrides,
-                const dim_t x, const dim_t y)
-        {
-            // Compute Indices
-            dim_t i_x = round2int((float)x / (odims[0] / (float)idims[0]));
-            dim_t i_y = round2int((float)y / (odims[1] / (float)idims[1]));
-
-            if (i_x >= idims[0]) i_x = idims[0] - 1;
-            if (i_y >= idims[1]) i_y = idims[1] - 1;
-
-            dim_t i_off = i_y * istrides[1] + i_x;
-            dim_t o_off =   y * ostrides[1] + x;
-            // Copy values from all channels
-            for(dim_t w = 0; w < odims[3]; w++) {
-                dim_t wost = w * ostrides[3];
-                dim_t wist = w * istrides[3];
-                for(dim_t z = 0; z < odims[2]; z++) {
-                    outPtr[o_off + z * ostrides[2] + wost] = inPtr[i_off + z * istrides[2] + wist];
-                }
-            }
-        }
-    };
-
-    template<typename T>
-    struct resize_op<T, AF_INTERP_BILINEAR>
-    {
-        void operator()(T *outPtr, const T *inPtr, const af::dim4 &odims, const af::dim4 &idims,
-                const af::dim4 &ostrides, const af::dim4 &istrides,
-                const dim_t x, const dim_t y)
-        {
-            // Compute Indices
-            float f_x = (float)x / (odims[0] / (float)idims[0]);
-            float f_y = (float)y / (odims[1] / (float)idims[1]);
-
-            dim_t i1_x  = floor(f_x);
-            dim_t i1_y  = floor(f_y);
-
-            if (i1_x >= idims[0]) i1_x = idims[0] - 1;
-            if (i1_y >= idims[1]) i1_y = idims[1] - 1;
-
-            float b   = f_x - i1_x;
-            float a   = f_y - i1_y;
-
-            dim_t i2_x  = (i1_x + 1 >= idims[0] ? idims[0] - 1 : i1_x + 1);
-            dim_t i2_y  = (i1_y + 1 >= idims[1] ? idims[1] - 1 : i1_y + 1);
-
-            typedef typename dtype_traits<T>::base_type BT;
-            typedef wtype_t<BT> WT;
-            typedef vtype_t<T> VT;
-
-            dim_t o_off = y * ostrides[1] + x;
-            // Copy values from all channels
-            for(dim_t w = 0; w < odims[3]; w++) {
-                dim_t wst = w * istrides[3];
-                for(dim_t z = 0; z < odims[2]; z++) {
-                    dim_t zst = z * istrides[2];
-                    dim_t channel_off = zst + wst;
-                    VT p1 = inPtr[i1_y * istrides[1] + i1_x + channel_off];
-                    VT p2 = inPtr[i2_y * istrides[1] + i1_x + channel_off];
-                    VT p3 = inPtr[i1_y * istrides[1] + i2_x + channel_off];
-                    VT p4 = inPtr[i2_y * istrides[1] + i2_x + channel_off];
-
-                    outPtr[o_off + z * ostrides[2] + w * ostrides[3]] =
-                                    scalar<WT>((1.0f - a) * (1.0f - b)) * p1 +
-                                    scalar<WT>((    a   ) * (1.0f - b)) * p2 +
-                                    scalar<WT>((1.0f - a) * (    b   )) * p3 +
-                                    scalar<WT>((    a   ) * (    b   )) * p4;
-                }
-            }
-        }
-    };
-
-    template<typename T, af_interp_type method>
-    void resize_(T *outPtr, const T *inPtr, const af::dim4 &odims, const af::dim4 &idims,
-                 const af::dim4 &ostrides, const af::dim4 &istrides)
-    {
-        resize_op<T, method> op;
-        for(dim_t y = 0; y < odims[1]; y++) {
-            for(dim_t x = 0; x < odims[0]; x++) {
-                op(outPtr, inPtr, odims, idims, ostrides, istrides, x, y);
-            }
-        }
-    }
-
-    template<typename T>
-    Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
-                    const af_interp_type method)
-    {
-        af::dim4 idims = in.dims();
-        af::dim4 odims(odim0, odim1, idims[2], idims[3]);
-
-        // Create output placeholder
-        Array<T> outArray = createValueArray(odims, (T)0);
-
-        // Get pointers to raw data
-        const T *inPtr = in.get();
-              T *outPtr = outArray.get();
-
-        af::dim4 ostrides = outArray.strides();
-        af::dim4 istrides = in.strides();
+#include <platform.hpp>
+#include <queue.hpp>
+#include <resize.hpp>
 
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                resize_<T, AF_INTERP_NEAREST>(outPtr, inPtr, odims, idims, ostrides, istrides);
-                break;
-            case AF_INTERP_BILINEAR:
-                resize_<T, AF_INTERP_BILINEAR>(outPtr, inPtr, odims, idims, ostrides, istrides);
-                break;
-            default:
-                break;
-        }
-        return outArray;
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method) {
+    af::dim4 idims = in.dims();
+    af::dim4 odims(odim0, odim1, idims[2], idims[3]);
+    // Create output placeholder
+    Array<T> out = createValueArray(odims, static_cast<T>(0));
+
+    switch (method) {
+        case AF_INTERP_NEAREST:
+            getQueue().enqueue(kernel::resize<T, AF_INTERP_NEAREST>, out, in);
+            break;
+        case AF_INTERP_BILINEAR:
+            getQueue().enqueue(kernel::resize<T, AF_INTERP_BILINEAR>, out, in);
+            break;
+        case AF_INTERP_LOWER:
+            getQueue().enqueue(kernel::resize<T, AF_INTERP_LOWER>, out, in);
+            break;
+        default: break;
     }
-
-
-#define INSTANTIATE(T)                                                                            \
-    template Array<T> resize<T> (const Array<T> &in, const dim_t odim0, const dim_t odim1, \
-                                 const af_interp_type method);
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    return out;
 }
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> resize<T>(const Array<T> &in, const dim_t odim0, \
+                                const dim_t odim1,                     \
+                                const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/resize.hpp b/src/backend/cpu/resize.hpp
index 8a10d9df3e..d31290daf5 100644
--- a/src/backend/cpu/resize.hpp
+++ b/src/backend/cpu/resize.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
-                    const af_interp_type method);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/rotate.cpp b/src/backend/cpu/rotate.cpp
index b7c45768be..bed34b7bf3 100644
--- a/src/backend/cpu/rotate.cpp
+++ b/src/backend/cpu/rotate.cpp
@@ -8,104 +8,56 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <kernel/rotate.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 #include <rotate.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-#include "transform_interp.hpp"
 
-namespace cpu
-{
-    template<typename T, af_interp_type method>
-    void rotate_(T *out, const T *in, const float theta,
-                 const af::dim4 &odims, const af::dim4 &idims,
-                 const af::dim4 &ostrides, const af::dim4 &istrides)
-    {
-        dim_t nimages = idims[2];
-
-        void (*t_fn)(T *, const T *, const float *, const af::dim4 &,
-                     const af::dim4 &, const af::dim4 &,
-                     const dim_t, const dim_t, const dim_t, const dim_t);
-
-        const float c = cos(-theta), s = sin(-theta);
-        float tx, ty;
-        {
-            const float nx = 0.5 * (idims[0] - 1);
-            const float ny = 0.5 * (idims[1] - 1);
-            const float mx = 0.5 * (odims[0] - 1);
-            const float my = 0.5 * (odims[1] - 1);
-            const float sx = (mx * c + my *-s);
-            const float sy = (mx * s + my * c);
-            tx = -(sx - nx);
-            ty = -(sy - ny);
-        }
-
-        const float tmat[6] = {std::round( c * 1000) / 1000.0f,
-                               std::round(-s * 1000) / 1000.0f,
-                               std::round(tx * 1000) / 1000.0f,
-                               std::round( s * 1000) / 1000.0f,
-                               std::round( c * 1000) / 1000.0f,
-                               std::round(ty * 1000) / 1000.0f,
-                              };
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                t_fn = &transform_n;
-                break;
-            case AF_INTERP_BILINEAR:
-                t_fn = &transform_b;
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                break;
-        }
-
-
-        // Do transform for image
-        for(int yy = 0; yy < (int)odims[1]; yy++) {
-            for(int xx = 0; xx < (int)odims[0]; xx++) {
-                t_fn(out, in, tmat, idims, ostrides, istrides, nimages, 0, xx, yy);
-            }
-        }
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method) {
+    Array<T> out = createEmptyArray<T>(odims);
+
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            getQueue().enqueue(kernel::rotate<T, 1>, out, in, theta, method);
+            break;
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_BILINEAR_COSINE:
+            getQueue().enqueue(kernel::rotate<T, 2>, out, in, theta, method);
+            break;
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_BICUBIC_SPLINE:
+            getQueue().enqueue(kernel::rotate<T, 3>, out, in, theta, method);
+            break;
+        default: AF_ERROR("Unsupported interpolation type", AF_ERR_ARG); break;
     }
 
-    template<typename T>
-    Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
-                     const af_interp_type method)
-    {
-        Array<T> out = createEmptyArray<T>(odims);
-        const af::dim4 idims = in.dims();
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                rotate_<T, AF_INTERP_NEAREST>
-                       (out.get(), in.get(), theta, odims, idims, out.strides(), in.strides());
-                break;
-            case AF_INTERP_BILINEAR:
-                rotate_<T, AF_INTERP_BILINEAR>
-                       (out.get(), in.get(), theta, odims, idims, out.strides(), in.strides());
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                break;
-        }
-
-        return out;
-    }
-
-
-#define INSTANTIATE(T)                                                              \
-    template Array<T> rotate(const Array<T> &in, const float theta,                 \
-                             const af::dim4 &odims, const af_interp_type method);   \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    return out;
 }
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> rotate(const Array<T> &in, const float theta, \
+                             const af::dim4 &odims,                 \
+                             const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/rotate.hpp b/src/backend/cpu/rotate.hpp
index c49ad8fe55..cf18a7df56 100644
--- a/src/backend/cpu/rotate.hpp
+++ b/src/backend/cpu/rotate.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
-                    const af_interp_type method);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/scan.cpp b/src/backend/cpu/scan.cpp
index cdd359f822..7f6843f99a 100644
--- a/src/backend/cpu/scan.cpp
+++ b/src/backend/cpu/scan.cpp
@@ -7,105 +7,92 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <kernel/scan.hpp>
+#include <optypes.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 #include <scan.hpp>
-#include <ops.hpp>
+#include <af/dim4.hpp>
+#include <complex>
 
 using af::dim4;
 
-namespace cpu
-{
-    template<af_op_t op, typename Ti, typename To, int D>
-    struct scan_dim
-    {
-        void operator()(To *out, const dim4 ostrides, const dim4 odims,
-                        const Ti *in , const dim4 istrides, const dim4 idims,
-                        const int dim)
-        {
-            const int D1 = D - 1;
-            for (dim_t i = 0; i < odims[D1]; i++) {
-                scan_dim<op, Ti, To, D1>()(out + i * ostrides[D1],
-                                           ostrides, odims,
-                                           in  + i * istrides[D1],
-                                           istrides, idims,
-                                           dim);
-                if (D1 == dim) break;
-            }
-        }
-    };
+namespace arrayfire {
+namespace cpu {
 
-    template<af_op_t op, typename Ti, typename To>
-    struct scan_dim<op, Ti, To, 0>
-    {
-        void operator()(To *out, const dim4 ostrides, const dim4 odims,
-                        const Ti *in , const dim4 istrides, const dim4 idims,
-                        const int dim)
-        {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusive_scan) {
+    const dim4& dims = in.dims();
+    Array<To> out    = createEmptyArray<To>(dims);
 
-            dim_t istride = istrides[dim];
-            dim_t ostride = ostrides[dim];
-
-            Transform<Ti, To, op> transform;
-            // FIXME: Change the name to something better
-            Binary<To, op> scan;
-
-            To out_val = scan.init();
-            for (dim_t i = 0; i < idims[dim]; i++) {
-                To in_val = transform(in[i * istride]);
-                out_val = scan(in_val, out_val);
-                out[i * ostride] = out_val;
-            }
+    if (inclusive_scan) {
+        switch (in.ndims()) {
+            case 1:
+                kernel::scan_dim<op, Ti, To, 1, true> func1;
+                getQueue().enqueue(func1, out, 0, in, 0, dim);
+                break;
+            case 2:
+                kernel::scan_dim<op, Ti, To, 2, true> func2;
+                getQueue().enqueue(func2, out, 0, in, 0, dim);
+                break;
+            case 3:
+                kernel::scan_dim<op, Ti, To, 3, true> func3;
+                getQueue().enqueue(func3, out, 0, in, 0, dim);
+                break;
+            case 4:
+                kernel::scan_dim<op, Ti, To, 4, true> func4;
+                getQueue().enqueue(func4, out, 0, in, 0, dim);
+                break;
         }
-    };
-
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> scan(const Array<Ti>& in, const int dim)
-    {
-        dim4 dims = in.dims();
-
-        Array<To> out = createValueArray<To>(dims, 0);
-
+    } else {
         switch (in.ndims()) {
-        case 1:
-            scan_dim<op, Ti, To, 1>()(out.get(), out.strides(), out.dims(),
-                                      in.get(), in.strides(), in.dims(), dim);
-            break;
-
-        case 2:
-            scan_dim<op, Ti, To, 2>()(out.get(), out.strides(), out.dims(),
-                                      in.get(), in.strides(), in.dims(), dim);
-            break;
-
-        case 3:
-            scan_dim<op, Ti, To, 3>()(out.get(), out.strides(), out.dims(),
-                                      in.get(), in.strides(), in.dims(), dim);
-            break;
-
-        case 4:
-            scan_dim<op, Ti, To, 4>()(out.get(), out.strides(), out.dims(),
-                                      in.get(), in.strides(), in.dims(), dim);
-            break;
+            case 1:
+                kernel::scan_dim<op, Ti, To, 1, false> func1;
+                getQueue().enqueue(func1, out, 0, in, 0, dim);
+                break;
+            case 2:
+                kernel::scan_dim<op, Ti, To, 2, false> func2;
+                getQueue().enqueue(func2, out, 0, in, 0, dim);
+                break;
+            case 3:
+                kernel::scan_dim<op, Ti, To, 3, false> func3;
+                getQueue().enqueue(func3, out, 0, in, 0, dim);
+                break;
+            case 4:
+                kernel::scan_dim<op, Ti, To, 4, false> func4;
+                getQueue().enqueue(func4, out, 0, in, 0, dim);
+                break;
         }
-
-        return out;
     }
 
-#define INSTANTIATE(ROp, Ti, To)                                        \
-    template Array<To> scan<ROp, Ti, To>(const Array<Ti> &in, const int dim); \
-
-    //accum
-    INSTANTIATE(af_add_t, float  , float  )
-    INSTANTIATE(af_add_t, double , double )
-    INSTANTIATE(af_add_t, cfloat , cfloat )
-    INSTANTIATE(af_add_t, cdouble, cdouble)
-    INSTANTIATE(af_add_t, int    , int    )
-    INSTANTIATE(af_add_t, uint   , uint   )
-    INSTANTIATE(af_add_t, char   , int    )
-    INSTANTIATE(af_add_t, uchar  , uint   )
-    INSTANTIATE(af_notzero_t, char  , uint   )
-
+    return out;
 }
+
+#define INSTANTIATE_SCAN(ROp, Ti, To)                                        \
+    template Array<To> scan<ROp, Ti, To>(const Array<Ti>& in, const int dim, \
+                                         bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_ALL(ROp)           \
+    INSTANTIATE_SCAN(ROp, float, float)     \
+    INSTANTIATE_SCAN(ROp, double, double)   \
+    INSTANTIATE_SCAN(ROp, cfloat, cfloat)   \
+    INSTANTIATE_SCAN(ROp, cdouble, cdouble) \
+    INSTANTIATE_SCAN(ROp, int, int)         \
+    INSTANTIATE_SCAN(ROp, uint, uint)       \
+    INSTANTIATE_SCAN(ROp, intl, intl)       \
+    INSTANTIATE_SCAN(ROp, uintl, uintl)     \
+    INSTANTIATE_SCAN(ROp, char, int)        \
+    INSTANTIATE_SCAN(ROp, char, uint)       \
+    INSTANTIATE_SCAN(ROp, schar, int)       \
+    INSTANTIATE_SCAN(ROp, uchar, uint)      \
+    INSTANTIATE_SCAN(ROp, short, int)       \
+    INSTANTIATE_SCAN(ROp, ushort, uint)
+
+INSTANTIATE_SCAN(af_notzero_t, char, uint)
+INSTANTIATE_SCAN_ALL(af_add_t)
+INSTANTIATE_SCAN_ALL(af_mul_t)
+INSTANTIATE_SCAN_ALL(af_min_t)
+INSTANTIATE_SCAN_ALL(af_max_t)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/scan.hpp b/src/backend/cpu/scan.hpp
index 2d5deda00c..45cd171092 100644
--- a/src/backend/cpu/scan.hpp
+++ b/src/backend/cpu/scan.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace cpu
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> scan(const Array<Ti>& in, const int dim);
-}
+namespace arrayfire {
+namespace cpu {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusive_scan = true);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/scan_by_key.cpp b/src/backend/cpu/scan_by_key.cpp
new file mode 100644
index 0000000000..f869098ffd
--- /dev/null
+++ b/src/backend/cpu/scan_by_key.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <kernel/scan_by_key.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <scan_by_key.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cpu {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan) {
+    const dim4& dims = in.dims();
+    Array<To> out    = createEmptyArray<To>(dims);
+    kernel::scan_dim_by_key<op, Ti, Tk, To, 1> func1(inclusive_scan);
+    kernel::scan_dim_by_key<op, Ti, Tk, To, 2> func2(inclusive_scan);
+    kernel::scan_dim_by_key<op, Ti, Tk, To, 3> func3(inclusive_scan);
+    kernel::scan_dim_by_key<op, Ti, Tk, To, 4> func4(inclusive_scan);
+
+    switch (in.ndims()) {
+        case 1: getQueue().enqueue(func1, out, 0, key, 0, in, 0, dim); break;
+        case 2: getQueue().enqueue(func2, out, 0, key, 0, in, 0, dim); break;
+        case 3: getQueue().enqueue(func3, out, 0, key, 0, in, 0, dim); break;
+        case 4: getQueue().enqueue(func4, out, 0, key, 0, in, 0, dim); break;
+    }
+
+    return out;
+}
+
+#define INSTANTIATE_SCAN_BY_KEY(ROp, Ti, Tk, To)                  \
+    template Array<To> scan<ROp, Ti, Tk, To>(                     \
+        const Array<Tk>& key, const Array<Ti>& in, const int dim, \
+        bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_BY_KEY_ALL(ROp, Tk)           \
+    INSTANTIATE_SCAN_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_BY_KEY_ALL_OP(ROp) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, int)   \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uint)  \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, intl)  \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uintl)
+
+INSTANTIATE_SCAN_BY_KEY_ALL_OP(af_add_t)
+INSTANTIATE_SCAN_BY_KEY_ALL_OP(af_mul_t)
+INSTANTIATE_SCAN_BY_KEY_ALL_OP(af_min_t)
+INSTANTIATE_SCAN_BY_KEY_ALL_OP(af_max_t)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/scan_by_key.hpp b/src/backend/cpu/scan_by_key.hpp
new file mode 100644
index 0000000000..414840dc35
--- /dev/null
+++ b/src/backend/cpu/scan_by_key.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan = true);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/select.cpp b/src/backend/cpu/select.cpp
new file mode 100644
index 0000000000..8258cae47a
--- /dev/null
+++ b/src/backend/cpu/select.cpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <kernel/select.hpp>
+#include <select.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b) {
+    getQueue().enqueue(kernel::select<T>, out, cond, a, b);
+}
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b) {
+    getQueue().enqueue(kernel::select_scalar<T, flip>, out, cond, a, b);
+}
+
+#define INSTANTIATE(T)                                                   \
+    template void select<T>(Array<T> & out, const Array<char> &cond,     \
+                            const Array<T> &a, const Array<T> &b);       \
+    template void select_scalar<T, true>(Array<T> & out,                 \
+                                         const Array<char> &cond,        \
+                                         const Array<T> &a, const T &b); \
+    template void select_scalar<T, false>(Array<T> & out,                \
+                                          const Array<char> &cond,       \
+                                          const Array<T> &a, const T &b);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/select.hpp b/src/backend/cpu/select.hpp
new file mode 100644
index 0000000000..1ed5d3969b
--- /dev/null
+++ b/src/backend/cpu/select.hpp
@@ -0,0 +1,38 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b);
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b);
+
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const af::dim4 &odims) {
+    Array<T> out = createEmptyArray<T>(odims);
+    select(out, cond, a, b);
+    return out;
+}
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b, const af::dim4 &odims) {
+    Array<T> out = createEmptyArray<T>(odims);
+    select_scalar<T, flip>(out, cond, a, b);
+    return out;
+}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/set.cpp b/src/backend/cpu/set.cpp
index 3a8239ed1d..6db13c8760 100644
--- a/src/backend/cpu/set.cpp
+++ b/src/backend/cpu/set.cpp
@@ -7,112 +7,125 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <algorithm>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <set.hpp>
 #include <copy.hpp>
-#include <sort.hpp>
 #include <err_cpu.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <set.hpp>
+#include <sort.hpp>
+#include <af/dim4.hpp>
+#include <algorithm>
+#include <complex>
 #include <vector>
 
-namespace cpu
-{
-    using namespace std;
-    using af::dim4;
-
-    template<typename T>
-    Array<T> setUnique(const Array<T> &in,
-                        const bool is_sorted)
-    {
-        Array<T> out = createEmptyArray<T>(af::dim4());
-        if (is_sorted) out = copyArray<T>(in);
-        else           out = sort<T, 1>(in, 0);
-
-        T *ptr = out.get();
-        T *last = std::unique(ptr, ptr + in.elements());
-        dim_t dist = (dim_t)std::distance(ptr, last);
-
-        dim4 dims(dist, 1, 1, 1);
-        out.resetDims(dims);
-        return out;
+namespace arrayfire {
+namespace cpu {
+
+using af::dim4;
+using std::distance;
+using std::set_intersection;
+using std::set_union;
+using std::unique;
+
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted) {
+    Array<T> out = createEmptyArray<T>(af::dim4());
+    if (is_sorted) {
+        out = copyArray<T>(in);
+    } else {
+        out = sort<T>(in, 0, true);
     }
 
-    template<typename T>
-    Array<T> setUnion(const Array<T> &first,
-                       const Array<T> &second,
-                       const bool is_unique)
-    {
-        Array<T> uFirst = first;
-        Array<T> uSecond = second;
-
-        if (!is_unique) {
-            // FIXME: Perhaps copy + unique would do ?
-            uFirst  = setUnique(first, false);
-            uSecond = setUnique(second, false);
-        }
+    // Need to sync old jobs since we need to
+    // operator on pointers directly in std::unique
+    getQueue().sync();
 
-        dim_t first_elements  = uFirst.elements();
-        dim_t second_elements = uSecond.elements();
-        dim_t elements = first_elements + second_elements;
+    T *ptr    = out.get();
+    T *last   = unique(ptr, ptr + in.elements());
+    auto dist = static_cast<dim_t>(distance(ptr, last));
 
-        Array<T> out = createEmptyArray<T>(af::dim4(elements));
-
-        T *ptr = out.get();
-        T *last = std::set_union(uFirst.get() , uFirst.get()  + first_elements,
-                                 uSecond.get(), uSecond.get() + second_elements,
-                                 ptr);
+    dim4 dims(dist, 1, 1, 1);
+    out.resetDims(dims);
+    return out;
+}
 
-        dim_t dist = (dim_t)std::distance(ptr, last);
-        dim4 dims(dist, 1, 1, 1);
-        out.resetDims(dims);
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique) {
+    Array<T> uFirst  = first;
+    Array<T> uSecond = second;
 
-        return out;
+    if (!is_unique) {
+        // FIXME: Perhaps copy + unique would do ?
+        uFirst  = setUnique(first, false);
+        uSecond = setUnique(second, false);
     }
 
-    template<typename T>
-    Array<T> setIntersect(const Array<T> &first,
-                          const Array<T> &second,
-                          const bool is_unique)
-    {
-        Array<T> uFirst = first;
-        Array<T> uSecond = second;
+    dim_t first_elements  = uFirst.elements();
+    dim_t second_elements = uSecond.elements();
+    dim_t elements        = first_elements + second_elements;
 
-        if (!is_unique) {
-            uFirst  = setUnique(first, false);
-            uSecond = setUnique(second, false);
-        }
+    Array<T> out = createEmptyArray<T>(af::dim4(elements));
 
-        dim_t first_elements  = uFirst.elements();
-        dim_t second_elements = uSecond.elements();
-        dim_t elements = std::max(first_elements, second_elements);
+    T *ptr  = out.get();
+    T *last = set_union(uFirst.get(), uFirst.get() + first_elements,
+                        uSecond.get(), uSecond.get() + second_elements, ptr);
 
-        Array<T> out = createEmptyArray<T>(af::dim4(elements));
+    auto dist = static_cast<dim_t>(distance(ptr, last));
+    dim4 dims(dist, 1, 1, 1);
+    out.resetDims(dims);
 
-        T *ptr = out.get();
-        T *last = std::set_intersection(uFirst.get() , uFirst.get()  + first_elements,
-                                        uSecond.get(), uSecond.get() + second_elements,
-                                        ptr);
+    return out;
+}
 
-        dim_t dist = (dim_t)std::distance(ptr, last);
-        dim4 dims(dist, 1, 1, 1);
-        out.resetDims(dims);
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique) {
+    Array<T> uFirst  = first;
+    Array<T> uSecond = second;
 
-        return out;
+    if (!is_unique) {
+        uFirst  = setUnique(first, false);
+        uSecond = setUnique(second, false);
     }
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> setUnique<T>(const Array<T> &in, const bool is_sorted); \
-    template Array<T> setUnion<T>(const Array<T> &first, const Array<T> &second, const bool is_unique); \
-    template Array<T> setIntersect<T>(const Array<T> &first, const Array<T> &second, const bool is_unique); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+    dim_t first_elements  = uFirst.elements();
+    dim_t second_elements = uSecond.elements();
+    dim_t elements        = std::max(first_elements, second_elements);
+
+    Array<T> out = createEmptyArray<T>(af::dim4(elements));
+
+    T *ptr = out.get();
+    T *last =
+        set_intersection(uFirst.get(), uFirst.get() + first_elements,
+                         uSecond.get(), uSecond.get() + second_elements, ptr);
+
+    auto dist = static_cast<dim_t>(distance(ptr, last));
+    dim4 dims(dist, 1, 1, 1);
+    out.resetDims(dims);
+
+    return out;
 }
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> setUnique<T>(const Array<T> &in, const bool is_sorted); \
+    template Array<T> setUnion<T>(                                            \
+        const Array<T> &first, const Array<T> &second, const bool is_unique); \
+    template Array<T> setIntersect<T>(                                        \
+        const Array<T> &first, const Array<T> &second, const bool is_unique);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/set.hpp b/src/backend/cpu/set.hpp
index f007cdf101..086fcc6866 100644
--- a/src/backend/cpu/set.hpp
+++ b/src/backend/cpu/set.hpp
@@ -7,19 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
+#pragma once
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T> Array<T> setUnique(const Array<T> &in,
-                                            const bool is_sorted);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted);
 
-    template<typename T> Array<T> setUnion(const Array<T> &first,
-                                           const Array<T> &second,
-                                           const bool is_unique);
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique);
 
-    template<typename T> Array<T> setIntersect(const Array<T> &first,
-                                               const Array<T> &second,
-                                               const bool is_unique);
-}
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/shift.cpp b/src/backend/cpu/shift.cpp
index a9d12b99c4..d812cbde89 100644
--- a/src/backend/cpu/shift.cpp
+++ b/src/backend/cpu/shift.cpp
@@ -8,77 +8,40 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <kernel/shift.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
 #include <shift.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-#include <cassert>
 
-namespace cpu
-{
-    static inline dim_t simple_mod(const dim_t i, const dim_t dim)
-    {
-        return (i < dim) ? i : (i - dim);
-    }
+namespace arrayfire {
+namespace cpu {
 
-    template<typename T>
-    Array<T> shift(const Array<T> &in, const int sdims[4])
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    const af::dim4 temp(sdims[0], sdims[1], sdims[2], sdims[3]);
 
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        T* outPtr = out.get();
-        const T* inPtr = in.get();
-
-        const af::dim4 ist = in.strides();
-        const af::dim4 ost = out.strides();
-
-        int sdims_[4];
-        // Need to do this because we are mapping output to input in the kernel
-        for(int i = 0; i < 4; i++) {
-            // sdims_[i] will always be positive and always [0, oDims[i]].
-            // Negative shifts are converted to position by going the other way round
-            sdims_[i] = -(sdims[i] % (int)oDims[i]) + oDims[i] * (sdims[i] > 0);
-            assert(sdims_[i] >= 0 && sdims_[i] <= oDims[i]);
-        }
-
-        for(dim_t ow = 0; ow < oDims[3]; ow++) {
-            const int oW = ow * ost[3];
-            const int iw = simple_mod((ow + sdims_[3]), oDims[3]);
-            const int iW = iw * ist[3];
-            for(dim_t oz = 0; oz < oDims[2]; oz++) {
-                const int oZW = oW + oz * ost[2];
-                const int iz = simple_mod((oz + sdims_[2]), oDims[2]);
-                const int iZW = iW + iz * ist[2];
-                for(dim_t oy = 0; oy < oDims[1]; oy++) {
-                    const int oYZW = oZW + oy * ost[1];
-                    const int iy = simple_mod((oy + sdims_[1]), oDims[1]);
-                    const int iYZW = iZW + iy * ist[1];
-                    for(dim_t ox = 0; ox < oDims[0]; ox++) {
-                        const int oIdx = oYZW + ox;
-                        const int ix = simple_mod((ox + sdims_[0]), oDims[0]);
-                        const int iIdx = iYZW + ix;
-
-                        outPtr[oIdx] = inPtr[iIdx];
-                    }
-                }
-            }
-        }
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                  \
-    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    getQueue().enqueue(kernel::shift<T>, out, in, temp);
 
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/shift.hpp b/src/backend/cpu/shift.hpp
index ce76eee3dc..0e298f16ae 100644
--- a/src/backend/cpu/shift.hpp
+++ b/src/backend/cpu/shift.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> shift(const Array<T> &in, const int sdims[4]);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sift.cpp b/src/backend/cpu/sift.cpp
new file mode 100644
index 0000000000..246505a206
--- /dev/null
+++ b/src/backend/cpu/sift.cpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sift.hpp>
+
+#include <kernel/sift.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x, Array<float>& y, Array<float>& score,
+              Array<float>& ori, Array<float>& size, Array<float>& desc,
+              const Array<T>& in, const unsigned n_layers,
+              const float contrast_thr, const float edge_thr,
+              const float init_sigma, const bool double_input,
+              const float img_scale, const float feature_ratio,
+              const bool compute_GLOH) {
+    return sift_impl<T, convAccT>(
+        x, y, score, ori, size, desc, in, n_layers, contrast_thr, edge_thr,
+        init_sigma, double_input, img_scale, feature_ratio, compute_GLOH);
+}
+
+#define INSTANTIATE(T, convAccT)                                               \
+    template unsigned sift<T, convAccT>(                                       \
+        Array<float> & x, Array<float> & y, Array<float> & score,              \
+        Array<float> & ori, Array<float> & size, Array<float> & desc,          \
+        const Array<T>& in, const unsigned n_layers, const float contrast_thr, \
+        const float edge_thr, const float init_sigma, const bool double_input, \
+        const float img_scale, const float feature_ratio,                      \
+        const bool compute_GLOH);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sift.hpp b/src/backend/cpu/sift.hpp
new file mode 100644
index 0000000000..804e52eb27
--- /dev/null
+++ b/src/backend/cpu/sift.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x, Array<float>& y, Array<float>& score,
+              Array<float>& ori, Array<float>& size, Array<float>& desc,
+              const Array<T>& in, const unsigned n_layers,
+              const float contrast_thr, const float edge_thr,
+              const float init_sigma, const bool double_input,
+              const float img_scale, const float feature_ratio,
+              const bool compute_GLOH);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sobel.cpp b/src/backend/cpu/sobel.cpp
index 41cd8ce11b..5708348295 100644
--- a/src/backend/cpu/sobel.cpp
+++ b/src/backend/cpu/sobel.cpp
@@ -7,102 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <sobel.hpp>
 #include <convolve.hpp>
-#include <err_cpu.hpp>
-#include <utility>
+#include <kernel/sobel.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <sobel.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace cpu
-{
-
-template<typename Ti, typename To, bool isDX>
-void derivative(To *optr, Ti const *iptr, dim4 const &dims, dim4 const &strides)
-{
-    for(dim_t b3=0; b3<dims[3]; ++b3) {
-    for(dim_t b2=0; b2<dims[2]; ++b2) {
-
-        for(dim_t j=0; j<dims[1]; ++j) {
-
-            int joff  = j;
-            int _joff = j-1;
-            int joff_ = j+1;
-            int joffset = j*strides[1];
-
-            for(dim_t i=0; i<dims[0]; ++i) {
-
-                To accum = To(0);
-
-                int  ioff = i;
-                int _ioff = i-1;
-                int ioff_ = i+1;
-
-                To NW = (_ioff>=0 && _joff>=0) ?
-                        iptr[_joff*strides[1]+_ioff*strides[0]] : 0;
-                To SW = (ioff_<(int)dims[0] && _joff>=0) ?
-                        iptr[_joff*strides[1]+ioff_*strides[0]] : 0;
-                To NE = (_ioff>=0 && joff_<(int)dims[1]) ?
-                        iptr[joff_*strides[1]+_ioff*strides[0]] : 0;
-                To SE = (ioff_<(int)dims[0] && joff_<(int)dims[1]) ?
-                        iptr[joff_*strides[1]+ioff_*strides[0]] : 0;
-
-                if (isDX) {
-                    To W  = _joff>=0 ?
-                            iptr[_joff*strides[1]+ioff*strides[0]] : 0;
-
-                    To E  = joff_<(int)dims[1] ?
-                            iptr[joff_*strides[1]+ioff*strides[0]] : 0;
-
-                    accum = NW+SW - (NE+SE) + 2*(W-E);
-                } else {
-                    To N  = _ioff>=0 ?
-                            iptr[joff*strides[1]+_ioff*strides[0]] : 0;
-
-                    To S  = ioff_<(int)dims[0] ?
-                            iptr[joff*strides[1]+ioff_*strides[0]] : 0;
-
-                    accum = NW+NE - (SW+SE) + 2*(N-S);
-                }
-
-                optr[joffset+i*strides[0]] = accum;
-            }
-        }
-
-        optr += strides[2];
-        iptr += strides[2];
-    }
-    optr += strides[3];
-    iptr += strides[3];
-    }
-}
+namespace arrayfire {
+namespace cpu {
 
 template<typename Ti, typename To>
-std::pair< Array<To>, Array<To> >
-sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size)
-{
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size) {
+    UNUSED(ker_size);
+    // ket_size is for future proofing, this argument is not used
+    // currently
     Array<To> dx = createEmptyArray<To>(img.dims());
     Array<To> dy = createEmptyArray<To>(img.dims());
 
-    derivative<Ti, To, true >(dx.get(), img.get(), img.dims(), img.strides());
-    derivative<Ti, To, false>(dy.get(), img.get(), img.dims(), img.strides());
+    getQueue().enqueue(kernel::derivative<Ti, To, true>, dx, img);
+    getQueue().enqueue(kernel::derivative<Ti, To, false>, dy, img);
 
     return std::make_pair(dx, dy);
 }
 
-#define INSTANTIATE(Ti, To)                                                 \
-    template std::pair< Array<To>, Array<To> >                            \
-    sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size);
+#define INSTANTIATE(Ti, To)                                    \
+    template std::pair<Array<To>, Array<To>> sobelDerivatives( \
+        const Array<Ti> &img, const unsigned &ker_size);
 
-INSTANTIATE(float , float)
+INSTANTIATE(float, float)
 INSTANTIATE(double, double)
-INSTANTIATE(int   , int)
-INSTANTIATE(uint  , int)
-INSTANTIATE(char  , int)
-INSTANTIATE(uchar , int)
-
-}
+INSTANTIATE(int, int)
+INSTANTIATE(uint, int)
+INSTANTIATE(char, int)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, int)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, int)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sobel.hpp b/src/backend/cpu/sobel.hpp
index 23678a9748..ad1082d18e 100644
--- a/src/backend/cpu/sobel.hpp
+++ b/src/backend/cpu/sobel.hpp
@@ -10,11 +10,12 @@
 #include <Array.hpp>
 #include <utility>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename Ti, typename To>
-std::pair< Array<To>, Array<To> >
-sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size);
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/solve.cpp b/src/backend/cpu/solve.cpp
index 1e88e8d915..0e8d863817 100644
--- a/src/backend/cpu/solve.cpp
+++ b/src/backend/cpu/solve.cpp
@@ -8,197 +8,363 @@
  ********************************************************/
 
 #include <solve.hpp>
-#include <err_common.hpp>
 
-#if defined(WITH_CPU_LINEAR_ALGEBRA)
-
-#include <af/dim4.hpp>
-#include <handle.hpp>
-#include <range.hpp>
-#include <iostream>
-#include <cassert>
 #include <err_cpu.hpp>
 
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
 #include <lapack_helper.hpp>
+#include <math.hpp>
+#if USE_MKL
+#include <mkl_version.h>
+#endif
+#include <queue.hpp>
+#include <af/dim4.hpp>
+#include <algorithm>
+#include <complex>
+#include <vector>
+
+using af::dim4;
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-using gesv_func_def = int (*)(ORDER_TYPE, int, int,
-                              T *, int,
-                              int *,
-                              T *, int);
+using gesv_func_def = int (*)(ORDER_TYPE, int, int, T *, int, int *, T *, int);
 
 template<typename T>
-using gels_func_def = int (*)(ORDER_TYPE, char,
-                              int, int, int,
-                              T *, int,
-                              T *, int);
+using gels_func_def = int (*)(ORDER_TYPE, char, int, int, int, T *, int, T *,
+                              int);
 
+#ifdef AF_USE_MKL_BATCH
 template<typename T>
-using getrs_func_def = int (*)(ORDER_TYPE, char,
-                               int, int,
-                               const T *, int,
-                               const int *,
-                               T *, int);
+using getrf_batch_strided_func_def =
+    void (*)(const MKL_INT *m, const MKL_INT *n, T *a, const MKL_INT *lda,
+             const MKL_INT *stride_a, MKL_INT *ipiv, const MKL_INT *stride_ipiv,
+             const MKL_INT *batch_size, MKL_INT *info);
 
+#if INTEL_MKL_VERSION >= 20210004
 template<typename T>
-using trtrs_func_def = int (*)(ORDER_TYPE,
-                               char, char, char,
-                               int, int,
-                               const T *, int,
-                               T *, int);
+using getrs_batch_strided_func_def = void (*)(
+    const char *trans, const MKL_INT *n, const MKL_INT *nrhs, const T *a,
+    const MKL_INT *lda, const MKL_INT *stride_a, const MKL_INT *ipiv,
+    const MKL_INT *stride_ipiv, T *b, const MKL_INT *ldb,
+    const MKL_INT *stride_b, const MKL_INT *batch_size, MKL_INT *info);
+#else
+template<typename T>
+using getrs_batch_strided_func_def =
+    void (*)(const char *trans, const MKL_INT *n, const MKL_INT *nrhs, T *a,
+             const MKL_INT *lda, const MKL_INT *stride_a, MKL_INT *ipiv,
+             const MKL_INT *stride_ipiv, T *b, const MKL_INT *ldb,
+             const MKL_INT *stride_b, const MKL_INT *batch_size, MKL_INT *info);
+#endif
+#endif
 
+template<typename T>
+using getrs_func_def = int (*)(ORDER_TYPE, char, int, int, const T *, int,
+                               const int *, T *, int);
+
+template<typename T>
+using trtrs_func_def = int (*)(ORDER_TYPE, char, char, char, int, int,
+                               const T *, int, T *, int);
 
-#define SOLVE_FUNC_DEF( FUNC )                                      \
-template<typename T> FUNC##_func_def<T> FUNC##_func();
+#define SOLVE_FUNC_DEF(FUNC) \
+    template<typename T>     \
+    FUNC##_func_def<T> FUNC##_func();
 
+#define SOLVE_FUNC(FUNC, TYPE, PREFIX)          \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
 
-#define SOLVE_FUNC( FUNC, TYPE, PREFIX )                            \
-template<> FUNC##_func_def<TYPE>     FUNC##_func<TYPE>()            \
-{ return & LAPACK_NAME(PREFIX##FUNC); }
+SOLVE_FUNC_DEF(gesv)
+SOLVE_FUNC(gesv, float, s)
+SOLVE_FUNC(gesv, double, d)
+SOLVE_FUNC(gesv, cfloat, c)
+SOLVE_FUNC(gesv, cdouble, z)
 
-SOLVE_FUNC_DEF( gesv )
-SOLVE_FUNC(gesv , float  , s)
-SOLVE_FUNC(gesv , double , d)
-SOLVE_FUNC(gesv , cfloat , c)
-SOLVE_FUNC(gesv , cdouble, z)
+SOLVE_FUNC_DEF(gels)
+SOLVE_FUNC(gels, float, s)
+SOLVE_FUNC(gels, double, d)
+SOLVE_FUNC(gels, cfloat, c)
+SOLVE_FUNC(gels, cdouble, z)
 
-SOLVE_FUNC_DEF( gels )
-SOLVE_FUNC(gels , float  , s)
-SOLVE_FUNC(gels , double , d)
-SOLVE_FUNC(gels , cfloat , c)
-SOLVE_FUNC(gels , cdouble, z)
+#ifdef AF_USE_MKL_BATCH
 
-SOLVE_FUNC_DEF( getrs )
-SOLVE_FUNC(getrs , float  , s)
-SOLVE_FUNC(getrs , double , d)
-SOLVE_FUNC(getrs , cfloat , c)
-SOLVE_FUNC(getrs , cdouble, z)
+template<typename T>
+struct mkl_type {
+    using type = T;
+};
+template<>
+struct mkl_type<std::complex<float>> {
+    using type = MKL_Complex8;
+};
+template<>
+struct mkl_type<std::complex<double>> {
+    using type = MKL_Complex16;
+};
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wnoexcept-type"
+template<typename T>
+getrf_batch_strided_func_def<T> getrf_batch_strided_func();
 
-SOLVE_FUNC_DEF( trtrs )
-SOLVE_FUNC(trtrs , float  , s)
-SOLVE_FUNC(trtrs , double , d)
-SOLVE_FUNC(trtrs , cfloat , c)
-SOLVE_FUNC(trtrs , cdouble, z)
+template<>
+getrf_batch_strided_func_def<float> getrf_batch_strided_func<float>() {
+    return &sgetrf_batch_strided;
+}
+template<>
+getrf_batch_strided_func_def<double> getrf_batch_strided_func<double>() {
+    return &dgetrf_batch_strided;
+}
+template<>
+getrf_batch_strided_func_def<MKL_Complex8>
+getrf_batch_strided_func<MKL_Complex8>() {
+    return &cgetrf_batch_strided;
+}
+template<>
+getrf_batch_strided_func_def<MKL_Complex16>
+getrf_batch_strided_func<MKL_Complex16>() {
+    return &zgetrf_batch_strided;
+}
 
 template<typename T>
-Array<T> solveLU(const Array<T> &A, const Array<int> &pivot,
-                 const Array<T> &b, const af_mat_prop options)
-{
-    int N = A.dims()[0];
-    int NRHS = b.dims()[1];
+getrs_batch_strided_func_def<T> getrs_batch_strided_func();
+
+template<>
+getrs_batch_strided_func_def<float> getrs_batch_strided_func<float>() {
+    return &sgetrs_batch_strided;
+}
+template<>
+getrs_batch_strided_func_def<double> getrs_batch_strided_func<double>() {
+    return &dgetrs_batch_strided;
+}
+template<>
+getrs_batch_strided_func_def<MKL_Complex8>
+getrs_batch_strided_func<MKL_Complex8>() {
+    return &cgetrs_batch_strided;
+}
+template<>
+getrs_batch_strided_func_def<MKL_Complex16>
+getrs_batch_strided_func<MKL_Complex16>() {
+    return &zgetrs_batch_strided;
+}
+
+#pragma GCC diagnostic pop
+#endif
 
-    Array< T > B = copyArray<T>(b);
+SOLVE_FUNC_DEF(getrs)
+SOLVE_FUNC(getrs, float, s)
+SOLVE_FUNC(getrs, double, d)
+SOLVE_FUNC(getrs, cfloat, c)
+SOLVE_FUNC(getrs, cdouble, z)
 
-    getrs_func<T>()(AF_LAPACK_COL_MAJOR, 'N',
-                    N, NRHS,
-                    A.get(), A.strides()[1],
-                    pivot.get(),
-                    B.get(), B.strides()[1]);
+SOLVE_FUNC_DEF(trtrs)
+SOLVE_FUNC(trtrs, float, s)
+SOLVE_FUNC(trtrs, double, d)
+SOLVE_FUNC(trtrs, cfloat, c)
+SOLVE_FUNC(trtrs, cdouble, z)
+
+template<typename T>
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    UNUSED(options);
+    int N      = A.dims()[0];
+    int NRHS   = b.dims()[1];
+    Array<T> B = copyArray<T>(b);
+
+    // NOLINTNEXTLINE
+    auto func = [=](CParam<T> A, Param<T> B, CParam<int> pivot, int N,
+                    int NRHS) {
+        getrs_func<T>()(AF_LAPACK_COL_MAJOR, 'N', N, NRHS, A.get(),
+                        A.strides(1), pivot.get(), B.get(), B.strides(1));
+    };
+    getQueue().enqueue(func, A, B, pivot, N, NRHS);
 
     return B;
 }
 
 template<typename T>
-Array<T> triangleSolve(const Array<T> &A, const Array<T> &b, const af_mat_prop options)
-{
+Array<T> triangleSolve(const Array<T> &A, const Array<T> &b,
+                       const af_mat_prop options) {
     Array<T> B = copyArray<T>(b);
-    int N = B.dims()[0];
-    int NRHS = B.dims()[1];
-
-    trtrs_func<T>()(AF_LAPACK_COL_MAJOR,
-                    options & AF_MAT_UPPER ? 'U' : 'L',
-                    'N', // transpose flag
-                    options & AF_MAT_DIAG_UNIT ? 'U' : 'N',
-                    N, NRHS,
-                    A.get(), A.strides()[1],
-                    B.get(), B.strides()[1]);
+    int N      = B.dims()[0];
+    int NRHS   = B.dims()[1];
+
+    auto func = [=](const CParam<T> A, Param<T> B, int N, int NRHS,
+                    const af_mat_prop options) {
+        trtrs_func<T>()(AF_LAPACK_COL_MAJOR, options & AF_MAT_UPPER ? 'U' : 'L',
+                        'N',  // transpose flag
+                        options & AF_MAT_DIAG_UNIT ? 'U' : 'N', N, NRHS,
+                        A.get(), A.strides(1), B.get(), B.strides(1));
+    };
+    getQueue().enqueue(func, A, B, N, NRHS, options);
+
     return B;
 }
 
+#ifdef AF_USE_MKL_BATCH
 
 template<typename T>
-Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options)
-{
+Array<T> generalSolveBatched(const Array<T> &a, const Array<T> &b,
+                             const af_mat_prop options) {
+    using std::vector;
+    int batches = a.dims()[2] * a.dims()[3];
 
-    if (options & AF_MAT_UPPER ||
-        options & AF_MAT_LOWER) {
-        return triangleSolve<T>(a, b, options);
-    }
+    dim4 aDims = a.dims();
+    dim4 bDims = b.dims();
+    int M      = aDims[0];
+    int N      = aDims[1];
+    int K      = bDims[1];
+    int MN     = std::min(M, N);
 
-    int M = a.dims()[0];
-    int N = a.dims()[1];
-    int K = b.dims()[1];
+    int lda     = a.strides()[1];
+    int astride = a.strides()[2];
 
+    vector<int> ipiv(MN * batches);
+    int ipivstride = MN;
+
+    int ldb     = b.strides()[1];
+    int bstride = b.strides()[2];
+
+    vector<int> info(batches, 0);
+
+    char trans = 'N';
 
     Array<T> A = copyArray<T>(a);
-    Array<T> B = padArray<T, T>(b, dim4(max(M, N), K));
-
-    if(M == N) {
-        Array<int> pivot = createEmptyArray<int>(dim4(N, 1, 1));
-        gesv_func<T>()(AF_LAPACK_COL_MAJOR, N, K,
-                       A.get(), A.strides()[1],
-                       pivot.get(),
-                       B.get(), B.strides()[1]);
-    } else {
-        int sM = a.strides()[1];
-        int sN = a.strides()[2] / sM;
-
-        gels_func<T>()(AF_LAPACK_COL_MAJOR, 'N',
-                       M, N, K,
-                       A.get(), A.strides()[1],
-                       B.get(), max(sM, sN));
-        B.resetDims(dim4(N, K));
-    }
+    Array<T> B = copyArray<T>(b);
+
+    auto getrf_rs = [](char TRANS, int M, int N, int K, Param<T> a, int LDA,
+                       int ASTRIDE, vector<int> IPIV, int IPIVSTRIDE,
+                       Param<T> b, int LDB, int BSTRIDE, int BATCH_SIZE,
+                       vector<int> INFO) {
+        getrf_batch_strided_func<typename mkl_type<T>::type>()(
+            &M, &N, reinterpret_cast<typename mkl_type<T>::type *>(a.get()),
+            &LDA, &ASTRIDE, IPIV.data(), &IPIVSTRIDE, &BATCH_SIZE, INFO.data());
+
+        getrs_batch_strided_func<typename mkl_type<T>::type>()(
+            &TRANS, &M, &K,
+            reinterpret_cast<typename mkl_type<T>::type *>(a.get()), &LDA,
+            &ASTRIDE, IPIV.data(), &IPIVSTRIDE,
+            reinterpret_cast<typename mkl_type<T>::type *>(b.get()), &LDB,
+            &BSTRIDE, &BATCH_SIZE, INFO.data());
+    };
+
+    getQueue().enqueue(getrf_rs, trans, M, N, K, A, lda, astride, ipiv,
+                       ipivstride, B, ldb, bstride, batches, info);
 
     return B;
 }
+#endif
 
-#define INSTANTIATE_SOLVE(T)                                            \
-    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,    \
-                               const af_mat_prop options);              \
-    template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
-                                 const Array<T> &b, const af_mat_prop options); \
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    if (options & AF_MAT_UPPER || options & AF_MAT_LOWER) {
+        return triangleSolve<T>(a, b, options);
+    }
 
-INSTANTIATE_SOLVE(float)
-INSTANTIATE_SOLVE(cfloat)
-INSTANTIATE_SOLVE(double)
-INSTANTIATE_SOLVE(cdouble)
+#ifdef AF_USE_MKL_BATCH
+    if (a.dims()[2] > 1 || a.dims()[3] > 1) {
+        return generalSolveBatched(a, b, options);
+    }
+#endif
 
+    const dim4 NullShape(0, 0, 0, 0);
+
+    dim4 aDims = a.dims();
+    int batchz = aDims[2];
+    int batchw = aDims[3];
+
+    int M = aDims[0];
+    int N = aDims[1];
+    int K = b.dims()[1];
+
+    Array<T> A = copyArray<T>(a);
+
+    dim4 endPadding(max(M, N) - b.dims()[0], K - b.dims()[1], 0, 0);
+    Array<T> B = (endPadding == NullShape
+                      ? copyArray(b)
+                      : padArrayBorders(b, NullShape, endPadding, AF_PAD_ZERO));
+
+    for (int i = 0; i < batchw; i++) {
+        for (int j = 0; j < batchz; j++) {
+            Param<T> pA(A.get() + A.strides()[2] * j + A.strides()[3] * i,
+                        A.dims(), A.strides());
+            Param<T> pB(B.get() + B.strides()[2] * j + B.strides()[3] * i,
+                        B.dims(), B.strides());
+            if (M == N) {
+                Array<int> pivot = createEmptyArray<int>(dim4(N, 1, 1));
+
+                auto func = [](Param<T> A, Param<T> B, Param<int> pivot, int N,
+                               int K) {
+                    gesv_func<T>()(AF_LAPACK_COL_MAJOR, N, K, A.get(),
+                                   A.strides(1), pivot.get(), B.get(),
+                                   B.strides(1));
+                };
+                getQueue().enqueue(func, pA, pB, pivot, N, K);
+            } else {
+                auto func = [=](Param<T> A, Param<T> B, int M, int N, int K) {
+                    int sM = A.dims(0);
+                    int sN = A.dims(1);
+
+                    gels_func<T>()(AF_LAPACK_COL_MAJOR, 'N', M, N, K, A.get(),
+                                   A.strides(1), B.get(), max(sM, sN));
+                };
+                getQueue().enqueue(func, pA, pB, M, N, K);
+            }
+        }
+    }
+
+    if (M != N) { B.resetDims(dim4(N, K, B.dims()[2], B.dims()[3])); }
+
+    return B;
 }
 
-#else
+}  // namespace cpu
+}  // namespace arrayfire
 
-namespace cpu
-{
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> solveLU(const Array<T> &A, const Array<int> &pivot,
-                 const Array<T> &b, const af_mat_prop options)
-{
-    AF_ERROR("Linear Algebra is diabled on CPU",
-             AF_ERR_NOT_CONFIGURED);
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    AF_ERROR(
+        "This version of ArrayFire was built without linear algebra routines",
+        AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options)
-{
-    AF_ERROR("Linear Algebra is diabled on CPU",
-              AF_ERR_NOT_CONFIGURED);
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    AF_ERROR(
+        "This version of ArrayFire was built without linear algebra routines",
+        AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_SOLVE(T)                                            \
-    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,    \
-                               const af_mat_prop options);              \
+}  // namespace cpu
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace cpu {
+
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
     template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
-                                 const Array<T> &b, const af_mat_prop options); \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
 
 INSTANTIATE_SOLVE(float)
 INSTANTIATE_SOLVE(cfloat)
 INSTANTIATE_SOLVE(double)
 INSTANTIATE_SOLVE(cdouble)
-}
 
-#endif
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/solve.hpp b/src/backend/cpu/solve.hpp
index 84166015e0..c63ec1252b 100644
--- a/src/backend/cpu/solve.hpp
+++ b/src/backend/cpu/solve.hpp
@@ -7,15 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options = AF_MAT_NONE);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options = AF_MAT_NONE);
 
-    template<typename T>
-    Array<T> solveLU(const Array<T> &a, const Array<int> &pivot,
-                     const Array<T> &b, const af_mat_prop options = AF_MAT_NONE);
-}
+template<typename T>
+Array<T> solveLU(const Array<T> &a, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options = AF_MAT_NONE);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sort.cpp b/src/backend/cpu/sort.cpp
index 6c1ebb7cdd..41c6b75147 100644
--- a/src/backend/cpu/sort.cpp
+++ b/src/backend/cpu/sort.cpp
@@ -8,77 +8,102 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <sort.hpp>
-#include <math.hpp>
 #include <copy.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
+#include <iota.hpp>
+#include <kernel/sort.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <range.hpp>
+#include <reorder.hpp>
+#include <sort.hpp>
+#include <sort_by_key.hpp>
 #include <algorithm>
 #include <functional>
 
-using std::greater;
-using std::less;
-using std::sort;
-using std::function;
-
-namespace cpu
-{
-    ///////////////////////////////////////////////////////////////////////////
-    // Kernel Functions
-    ///////////////////////////////////////////////////////////////////////////
-
-    // Based off of http://stackoverflow.com/a/12399290
-    template<typename T, bool isAscending>
-    void sort0(Array<T> &val)
-    {
-        // initialize original index locations
-        T *val_ptr = val.get();
-
-        function<bool(T, T)> op = greater<T>();
-        if(isAscending) { op = less<T>(); }
-
-        T *comp_ptr = nullptr;
-        for(dim_t w = 0; w < val.dims()[3]; w++) {
-            dim_t valW = w * val.strides()[3];
-            for(dim_t z = 0; z < val.dims()[2]; z++) {
-                dim_t valWZ = valW + z * val.strides()[2];
-                for(dim_t y = 0; y < val.dims()[1]; y++) {
-
-                    dim_t valOffset = valWZ + y * val.strides()[1];
-
-                    comp_ptr = val_ptr + valOffset;
-                    std::sort(comp_ptr, comp_ptr + val.dims()[0], op);
-                }
-            }
-        }
-        return;
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, int dim>
+void sortBatched(Array<T>& val, bool isAscending) {
+    af::dim4 inDims = val.dims();
+
+    // Sort dimension
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    Array<uint> key = iota<uint>(seqDims, tileDims);
+
+    Array<uint> resKey = createEmptyArray<uint>(dim4());
+    Array<T> resVal    = createEmptyArray<T>(dim4());
+
+    val.setDataDims(inDims.elements());
+    key.setDataDims(inDims.elements());
+
+    sort_by_key<T, uint>(resVal, resKey, val, key, 0, isAscending);
+
+    // Needs to be ascending (true) in order to maintain the indices properly
+    sort_by_key<uint, T>(key, val, resKey, resVal, 0, true);
+    val.setDataDims(inDims);  // This is correct only for dim0
+}
+
+template<typename T>
+void sort0(Array<T>& val, bool isAscending) {
+    int higherDims = val.elements() / val.dims()[0];
+    // TODO Make a better heurisitic
+    if (higherDims > 10) {
+        sortBatched<T, 0>(val, isAscending);
+    } else {
+        getQueue().enqueue(kernel::sort0Iterative<T>, val, isAscending);
     }
+}
 
-    ///////////////////////////////////////////////////////////////////////////
-    // Wrapper Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T, bool isAscending>
-    Array<T> sort(const Array<T> &in, const unsigned dim)
-    {
-        Array<T> out = copyArray<T>(in);
-        switch(dim) {
-            case 0: sort0<T, isAscending>(out);
-                    break;
-            default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
-        }
-        return out;
+template<typename T>
+Array<T> sort(const Array<T>& in, const unsigned dim, bool isAscending) {
+    Array<T> out = copyArray<T>(in);
+    switch (dim) {
+        case 0: sort0<T>(out, isAscending); break;
+        case 1: sortBatched<T, 1>(out, isAscending); break;
+        case 2: sortBatched<T, 2>(out, isAscending); break;
+        case 3: sortBatched<T, 3>(out, isAscending); break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
     }
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> sort<T, true>(const Array<T> &in, const unsigned dim); \
-    template Array<T> sort<T,false>(const Array<T> &in, const unsigned dim); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    //INSTANTIATE(cfloat)
-    //INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+    if (dim != 0) {
+        af::dim4 preorderDims = out.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = out.dims()[dim];
+        for (int i = 1; i <= static_cast<int>(dim); i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = out.dims()[i - 1];
+        }
+
+        out.setDataDims(preorderDims);
+        out = reorder<T>(out, reorderDims);
+    }
+    return out;
 }
+
+#define INSTANTIATE(T)                                                \
+    template Array<T> sort<T>(const Array<T>& in, const unsigned dim, \
+                              bool isAscending);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+// INSTANTIATE(cfloat)
+// INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sort.hpp b/src/backend/cpu/sort.hpp
index 79caf3aa2d..c22dab7c7d 100644
--- a/src/backend/cpu/sort.hpp
+++ b/src/backend/cpu/sort.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T, bool isAscending>
-    Array<T> sort(const Array<T> &in, const unsigned dim);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sort_by_key.cpp b/src/backend/cpu/sort_by_key.cpp
index b96c6cc55a..efe8eba2f1 100644
--- a/src/backend/cpu/sort_by_key.cpp
+++ b/src/backend/cpu/sort_by_key.cpp
@@ -8,125 +8,87 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <range.hpp>
+#include <reorder.hpp>
 #include <sort_by_key.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-#include <algorithm>
-#include <numeric>
-#include <queue>
-#include <future>
 
-using std::greater;
-using std::less;
-using std::sort;
-using std::function;
-using std::queue;
-using std::future;
-using std::async;
-
-namespace cpu
-{
-    ///////////////////////////////////////////////////////////////////////////
-    // Kernel Functions
-    ///////////////////////////////////////////////////////////////////////////
-
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort0_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey, const Array<Tv> &ival)
-    {
-        function<bool(Tk, Tk)> op = greater<Tk>();
-        if(isAscending) { op = less<Tk>(); }
-
-        // Get pointers and initialize original index locations
-        Array<uint> oidx = createValueArray(ikey.dims(), 0u);
-            uint *oidx_ptr = oidx.get();
-              Tk *okey_ptr = okey.get();
-              Tv *oval_ptr = oval.get();
-        const Tk *ikey_ptr = ikey.get();
-        const Tv *ival_ptr = ival.get();
-
-        std::vector<uint> seq_vec(oidx.dims()[0]);
-        std::iota(seq_vec.begin(), seq_vec.end(), 0);
-
-        const Tk *comp_ptr = nullptr;
-        auto comparator = [&comp_ptr, &op](size_t i1, size_t i2) {return op(comp_ptr[i1], comp_ptr[i2]);};
-
-        for(dim_t w = 0; w < ikey.dims()[3]; w++) {
-            dim_t okeyW = w * okey.strides()[3];
-            dim_t ovalW = w * oval.strides()[3];
-            dim_t oidxW = w * oidx.strides()[3];
-            dim_t ikeyW = w * ikey.strides()[3];
-            dim_t ivalW = w * ival.strides()[3];
-
-            for(dim_t z = 0; z < ikey.dims()[2]; z++) {
-                dim_t okeyWZ = okeyW + z * okey.strides()[2];
-                dim_t ovalWZ = ovalW + z * oval.strides()[2];
-                dim_t oidxWZ = oidxW + z * oidx.strides()[2];
-                dim_t ikeyWZ = ikeyW + z * ikey.strides()[2];
-                dim_t ivalWZ = ivalW + z * ival.strides()[2];
-
-                for(dim_t y = 0; y < ikey.dims()[1]; y++) {
-
-                    dim_t okeyOffset = okeyWZ + y * okey.strides()[1];
-                    dim_t ovalOffset = ovalWZ + y * oval.strides()[1];
-                    dim_t oidxOffset = oidxWZ + y * oidx.strides()[1];
-                    dim_t ikeyOffset = ikeyWZ + y * ikey.strides()[1];
-                    dim_t ivalOffset = ivalWZ + y * ival.strides()[1];
-
-                    uint *ptr = oidx_ptr + oidxOffset;
-                    std::copy(seq_vec.begin(), seq_vec.end(), ptr);
-
-                    comp_ptr = ikey_ptr + ikeyOffset;
-                    std::stable_sort(ptr, ptr + ikey.dims()[0], comparator);
-
-                    for (dim_t i = 0; i < oval.dims()[0]; ++i){
-                        uint sortIdx = oidx_ptr[oidxOffset + i];
-                        okey_ptr[okeyOffset + i] = ikey_ptr[ikeyOffset + sortIdx];
-                        oval_ptr[ovalOffset + i] = ival_ptr[ivalOffset + sortIdx];
-                    }
-                }
-            }
-        }
-
-        return;
+namespace arrayfire {
+namespace cpu {
+
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const uint dim, bool isAscending) {
+    okey = copyArray<Tk>(ikey);
+    oval = copyArray<Tv>(ival);
+
+    switch (dim) {
+        case 0:
+            getQueue().enqueue(kernel::sort0ByKey<Tk, Tv>, okey, oval,
+                               isAscending);
+            break;
+        case 1:
+        case 2:
+        case 3:
+            getQueue().enqueue(kernel::sortByKeyBatched<Tk, Tv>, okey, oval,
+                               dim, isAscending);
+            break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
     }
 
-    ///////////////////////////////////////////////////////////////////////////
-    // Wrapper Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort_by_key(Array<Tk> &okey, Array<Tv> &oval,
-               const Array<Tk> &ikey, const Array<Tv> &ival, const uint dim)
-    {
-        okey = createEmptyArray<Tk>(ikey.dims());
-        oval = createEmptyArray<Tv>(ival.dims());
-        switch(dim) {
-            case 0: sort0_by_key<Tk, Tv, isAscending>(okey, oval, ikey, ival);
-                    break;
-            default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+    if (dim != 0) {
+        af::dim4 preorderDims = okey.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = okey.dims()[dim];
+        for (int i = 1; i <= static_cast<int>(dim); i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = okey.dims()[i - 1];
         }
-    }
-
-#define INSTANTIATE(Tk, Tv)                                             \
-    template void                                                       \
-    sort_by_key<Tk, Tv, true>(Array<Tk> &okey, Array<Tv> &oval,         \
-                              const Array<Tk> &ikey, const Array<Tv> &ival, const uint dim); \
-    template void                                                       \
-    sort_by_key<Tk, Tv,false>(Array<Tk> &okey, Array<Tv> &oval,         \
-                              const Array<Tk> &ikey, const Array<Tv> &ival, const uint dim); \
 
-#define INSTANTIATE1(Tk)       \
-    INSTANTIATE(Tk, float)     \
-    INSTANTIATE(Tk, double)    \
-    INSTANTIATE(Tk, int)       \
-    INSTANTIATE(Tk, uint)      \
-    INSTANTIATE(Tk, char)      \
-    INSTANTIATE(Tk, uchar)     \
+        okey.setDataDims(preorderDims);
+        oval.setDataDims(preorderDims);
 
-    INSTANTIATE1(float)
-    INSTANTIATE1(double)
-    INSTANTIATE1(int)
-    INSTANTIATE1(uint)
-    INSTANTIATE1(char)
-    INSTANTIATE1(uchar)
+        okey = reorder<Tk>(okey, reorderDims);
+        oval = reorder<Tv>(oval, reorderDims);
+    }
 }
+
+#define INSTANTIATE(Tk, Tv)                                        \
+    template void sort_by_key<Tk, Tv>(                             \
+        Array<Tk> & okey, Array<Tv> & oval, const Array<Tk> &ikey, \
+        const Array<Tv> &ival, const uint dim, bool isAscending);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)
+
+INSTANTIATE1(float)
+INSTANTIATE1(double)
+INSTANTIATE1(int)
+INSTANTIATE1(uint)
+INSTANTIATE1(char)
+INSTANTIATE1(schar)
+INSTANTIATE1(uchar)
+INSTANTIATE1(short)
+INSTANTIATE1(ushort)
+INSTANTIATE1(intl)
+INSTANTIATE1(uintl)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sort_by_key.hpp b/src/backend/cpu/sort_by_key.hpp
index e11ddd6d21..8ed3bb63f4 100644
--- a/src/backend/cpu/sort_by_key.hpp
+++ b/src/backend/cpu/sort_by_key.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort_by_key(Array<Tk> &okey, Array<Tv> &oval,
-               const Array<Tk> &ikey, const Array<Tv> &ival, const unsigned dim);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const unsigned dim, bool isAscending);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sort_index.cpp b/src/backend/cpu/sort_index.cpp
index 75690e062e..8b1f4a1319 100644
--- a/src/backend/cpu/sort_index.cpp
+++ b/src/backend/cpu/sort_index.cpp
@@ -8,101 +8,79 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <sort_index.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <kernel/sort_by_key.hpp>
 #include <math.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <range.hpp>
+#include <reorder.hpp>
+#include <sort_index.hpp>
+
 #include <algorithm>
 #include <numeric>
-#include <queue>
-#include <future>
-
-using std::greater;
-using std::less;
-using std::sort;
-using std::function;
-using std::queue;
-using std::future;
-using std::async;
-
-namespace cpu
-{
-    ///////////////////////////////////////////////////////////////////////////
-    // Kernel Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T, bool isAscending>
-    void sort0_index(Array<T> &val, Array<uint> &idx, const Array<T> &in)
-    {
-        // initialize original index locations
-           uint *idx_ptr = idx.get();
-              T *val_ptr = val.get();
-        const T *in_ptr  = in.get();
-        function<bool(T, T)> op = greater<T>();
-        if(isAscending) { op = less<T>(); }
-
-        std::vector<uint> seq_vec(idx.dims()[0]);
-        std::iota(seq_vec.begin(), seq_vec.end(), 0);
-
-        const T *comp_ptr = nullptr;
-        auto comparator = [&comp_ptr, &op](size_t i1, size_t i2) {return op(comp_ptr[i1], comp_ptr[i2]);};
 
-        for(dim_t w = 0; w < in.dims()[3]; w++) {
-            dim_t valW = w * val.strides()[3];
-            dim_t idxW = w * idx.strides()[3];
-            dim_t  inW = w *  in.strides()[3];
-            for(dim_t z = 0; z < in.dims()[2]; z++) {
-                dim_t valWZ = valW + z * val.strides()[2];
-                dim_t idxWZ = idxW + z * idx.strides()[2];
-                dim_t  inWZ =  inW + z *  in.strides()[2];
-                for(dim_t y = 0; y < in.dims()[1]; y++) {
+namespace arrayfire {
+namespace cpu {
 
-                    dim_t valOffset = valWZ + y * val.strides()[1];
-                    dim_t idxOffset = idxWZ + y * idx.strides()[1];
-                    dim_t inOffset  =  inWZ + y *  in.strides()[1];
+template<typename T>
+void sort_index(Array<T> &okey, Array<uint> &oval, const Array<T> &in,
+                const uint dim, bool isAscending) {
+    // okey is values, oval is indices
+    okey = copyArray<T>(in);
+    oval = range<uint>(in.dims(), dim);
 
-                    uint *ptr = idx_ptr + idxOffset;
-                    std::copy(seq_vec.begin(), seq_vec.end(), ptr);
-
-                    comp_ptr = in_ptr + inOffset;
-                    std::stable_sort(ptr, ptr + in.dims()[0], comparator);
+    switch (dim) {
+        case 0:
+            getQueue().enqueue(kernel::sort0ByKey<T, uint>, okey, oval,
+                               isAscending);
+            break;
+        case 1:
+        case 2:
+        case 3:
+            getQueue().enqueue(kernel::sortByKeyBatched<T, uint>, okey, oval,
+                               dim, isAscending);
+            break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
 
-                    for (dim_t i = 0; i < val.dims()[0]; ++i){
-                        val_ptr[valOffset + i] = in_ptr[inOffset + idx_ptr[idxOffset + i]];
-                    }
-                }
-            }
+    if (dim != 0) {
+        af::dim4 preorderDims = okey.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = okey.dims()[dim];
+        for (int i = 1; i <= static_cast<int>(dim); i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = okey.dims()[i - 1];
         }
 
-        return;
-    }
+        okey.setDataDims(preorderDims);
+        oval.setDataDims(preorderDims);
 
-    ///////////////////////////////////////////////////////////////////////////
-    // Wrapper Functions
-    ///////////////////////////////////////////////////////////////////////////
-    template<typename T, bool isAscending>
-    void sort_index(Array<T> &val, Array<uint> &idx, const Array<T> &in, const uint dim)
-    {
-        val = createEmptyArray<T>(in.dims());
-        idx = createEmptyArray<uint>(in.dims());
-        switch(dim) {
-            case 0: sort0_index<T, isAscending>(val, idx, in);
-                    break;
-            default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
-        }
+        okey = reorder<T>(okey, reorderDims);
+        oval = reorder<uint>(oval, reorderDims);
     }
+}
 
-#define INSTANTIATE(T)                                                  \
-    template void sort_index<T, true>(Array<T> &val, Array<uint> &idx, const Array<T> &in, \
-                                      const uint dim);                  \
-    template void sort_index<T,false>(Array<T> &val, Array<uint> &idx, const Array<T> &in, \
-                                      const uint dim);                  \
+#define INSTANTIATE(T)                                              \
+    template void sort_index<T>(Array<T> & val, Array<uint> & idx,  \
+                                const Array<T> &in, const uint dim, \
+                                bool isAscending);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    //INSTANTIATE(cfloat)
-    //INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-}
+INSTANTIATE(float)
+INSTANTIATE(double)
+// INSTANTIATE(cfloat)
+// INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sort_index.hpp b/src/backend/cpu/sort_index.hpp
index b3cccec789..b0b50fbf87 100644
--- a/src/backend/cpu/sort_index.hpp
+++ b/src/backend/cpu/sort_index.hpp
@@ -7,11 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T, bool isAscending>
-    void sort_index(Array<T> &val, Array<unsigned> &idx, const Array<T> &in, const unsigned dim);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void sort_index(Array<T> &okey, Array<unsigned> &oval, const Array<T> &in,
+                const unsigned dim, bool isAscending);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sparse.cpp b/src/backend/cpu/sparse.cpp
new file mode 100644
index 0000000000..3641c96a90
--- /dev/null
+++ b/src/backend/cpu/sparse.cpp
@@ -0,0 +1,165 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sparse.hpp>
+#include <sparse.hpp>
+
+#include <stdexcept>
+#include <string>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/complex.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <reduce.hpp>
+#include <where.hpp>
+
+#include <functional>
+
+using arrayfire::common::cast;
+using std::function;
+
+namespace arrayfire {
+namespace cpu {
+
+using arrayfire::common::createArrayDataSparseArray;
+using arrayfire::common::createEmptySparseArray;
+using arrayfire::common::SparseArray;
+
+template<typename T, af_storage stype>
+SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in) {
+    if (stype == AF_STORAGE_CSR) {
+        uint nNZ = getScalar<uint>(reduce_all<af_notzero_t, T, uint>(in));
+
+        auto sparse = createEmptySparseArray<T>(in.dims(), nNZ, stype);
+        sparse.eval();
+
+        Array<T> values   = sparse.getValues();
+        Array<int> rowIdx = sparse.getRowIdx();
+        Array<int> colIdx = sparse.getColIdx();
+
+        getQueue().enqueue(kernel::dense2csr<T>, values, rowIdx, colIdx, in);
+
+        return sparse;
+    } else if (stype == AF_STORAGE_COO) {
+        auto nonZeroIdx = cast<int, uint>(where<T>(in));
+
+        dim_t nNZ = nonZeroIdx.elements();
+
+        auto cnst = createValueArray<int>(dim4(nNZ), in.dims()[0]);
+
+        auto rowIdx =
+            arithOp<int, af_mod_t>(nonZeroIdx, cnst, nonZeroIdx.dims());
+        auto colIdx =
+            arithOp<int, af_div_t>(nonZeroIdx, cnst, nonZeroIdx.dims());
+
+        Array<T> values = copyArray<T>(in);
+        values.modDims(dim4(values.elements()));
+        values = lookup<T, int>(values, nonZeroIdx, 0);
+
+        return createArrayDataSparseArray<T>(in.dims(), values, rowIdx, colIdx,
+                                             stype);
+    } else {
+        AF_ERROR("CPU Backend only supports Dense to CSR or COO",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+}
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const SparseArray<T> &in) {
+    Array<T> dense = createValueArray<T>(in.dims(), scalar<T>(0));
+
+    Array<T> values   = in.getValues();
+    Array<int> rowIdx = in.getRowIdx();
+    Array<int> colIdx = in.getColIdx();
+
+    if (stype == AF_STORAGE_CSR) {
+        getQueue().enqueue(kernel::csr2dense<T>, dense, values, rowIdx, colIdx);
+    } else if (stype == AF_STORAGE_COO) {
+        getQueue().enqueue(kernel::coo2dense<T>, dense, values, rowIdx, colIdx);
+    } else {
+        AF_ERROR("CPU Backend only supports CSR or COO to Dense",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    return dense;
+}
+
+template<typename T, af_storage dest, af_storage src>
+SparseArray<T> sparseConvertStorageToStorage(const SparseArray<T> &in) {
+    in.eval();
+
+    auto converted = createEmptySparseArray<T>(
+        in.dims(), static_cast<int>(in.getNNZ()), dest);
+    converted.eval();
+
+    function<void(Param<T>, Param<int>, Param<int>, CParam<T>, CParam<int>,
+                  CParam<int>)>
+        converter;
+
+    if (src == AF_STORAGE_CSR && dest == AF_STORAGE_COO) {
+        converter = kernel::csr2coo<T>;
+    } else if (src == AF_STORAGE_COO && dest == AF_STORAGE_CSR) {
+        converter = kernel::coo2csr<T>;
+    } else {
+        // Should never come here
+        AF_ERROR("CPU Backend invalid conversion combination",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+    getQueue().enqueue(converter, converted.getValues(), converted.getRowIdx(),
+                       converted.getColIdx(), in.getValues(), in.getRowIdx(),
+                       in.getColIdx());
+    return converted;
+}
+
+#define INSTANTIATE_TO_STORAGE(T, S)                     \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSR>( \
+        const SparseArray<T> &);                         \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSC>( \
+        const SparseArray<T> &);                         \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_COO>( \
+        const SparseArray<T> &);
+
+#define INSTANTIATE_SPARSE(T)                                               \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSR>( \
+        const Array<T> &in);                                                \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSC>( \
+        const Array<T> &in);                                                \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_COO>( \
+        const Array<T> &in);                                                \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSR>(       \
+        const SparseArray<T> &in);                                          \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSC>(       \
+        const SparseArray<T> &in);                                          \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_COO>(       \
+        const SparseArray<T> &in);                                          \
+                                                                            \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSR)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSC)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_COO)
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+#undef INSTANTIATE_TO_STORAGE
+#undef INSTANTIATE_SPARSE
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sparse.hpp b/src/backend/cpu/sparse.hpp
new file mode 100644
index 0000000000..8709fe199d
--- /dev/null
+++ b/src/backend/cpu/sparse.hpp
@@ -0,0 +1,27 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T, af_storage stype>
+common::SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in);
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const common::SparseArray<T> &in);
+
+template<typename T, af_storage dest, af_storage src>
+common::SparseArray<T> sparseConvertStorageToStorage(
+    const common::SparseArray<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sparse_arith.cpp b/src/backend/cpu/sparse_arith.cpp
new file mode 100644
index 0000000000..d6d7e5391e
--- /dev/null
+++ b/src/backend/cpu/sparse_arith.cpp
@@ -0,0 +1,170 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arith.hpp>
+#include <common/SparseArray.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <sparse.hpp>
+#include <sparse_arith.hpp>
+#include <af/dim4.hpp>
+
+#include <kernel/sparse_arith.hpp>
+
+#include <algorithm>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+using arrayfire::common::createArrayDataSparseArray;
+using arrayfire::common::createEmptySparseArray;
+using arrayfire::common::SparseArray;
+using std::numeric_limits;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+T getInf() {
+    return scalar<T>(numeric_limits<T>::infinity());
+}
+
+template<>
+cfloat getInf() {
+    return scalar<cfloat, float>(numeric_limits<float>::infinity(),
+                                 numeric_limits<float>::infinity());
+}
+
+template<>
+cdouble getInf() {
+    return scalar<cdouble, double>(numeric_limits<double>::infinity(),
+                                   numeric_limits<double>::infinity());
+}
+
+template<typename T, af_op_t op>
+Array<T> arithOpD(const SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse) {
+    Array<T> out  = createEmptyArray<T>(dim4(0));
+    Array<T> zero = createValueArray<T>(rhs.dims(), scalar<T>(0));
+    switch (op) {
+        case af_add_t: out = copyArray<T>(rhs); break;
+        case af_sub_t:
+            out = reverse ? copyArray<T>(rhs)
+                          : arithOp<T, af_sub_t>(zero, rhs, rhs.dims());
+            break;
+        default: out = copyArray<T>(rhs);
+    }
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            getQueue().enqueue(kernel::sparseArithOpD<T, op, AF_STORAGE_CSR>,
+                               out, lhs.getValues(), lhs.getRowIdx(),
+                               lhs.getColIdx(), rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            getQueue().enqueue(kernel::sparseArithOpD<T, op, AF_STORAGE_COO>,
+                               out, lhs.getValues(), lhs.getRowIdx(),
+                               lhs.getColIdx(), rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const Array<T> &rhs,
+                       const bool reverse) {
+    SparseArray<T> out = createArrayDataSparseArray<T>(
+        lhs.dims(), lhs.getValues(), lhs.getRowIdx(), lhs.getColIdx(),
+        lhs.getStorage(), true);
+    switch (out.getStorage()) {
+        case AF_STORAGE_CSR:
+            getQueue().enqueue(kernel::sparseArithOpS<T, op, AF_STORAGE_CSR>,
+                               out.getValues(), out.getRowIdx(),
+                               out.getColIdx(), rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            getQueue().enqueue(kernel::sparseArithOpS<T, op, AF_STORAGE_COO>,
+                               out.getValues(), out.getRowIdx(),
+                               out.getColIdx(), rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const SparseArray<T> &rhs) {
+    af::storage sfmt = lhs.getStorage();
+
+    const dim4 &dims = lhs.dims();
+    const uint M     = dims[0];
+    const uint N     = dims[1];
+
+    auto rowArr = createEmptyArray<int>(dim4(M + 1));
+
+    getQueue().enqueue(kernel::calcOutNNZ, rowArr, M, N, lhs.getRowIdx(),
+                       lhs.getColIdx(), rhs.getRowIdx(), rhs.getColIdx());
+    getQueue().sync();
+
+    uint nnz = rowArr.get()[M];
+    auto out = createEmptySparseArray<T>(dims, nnz, sfmt);
+
+    copyArray(out.getRowIdx(), rowArr);
+
+    getQueue().enqueue(kernel::sparseArithOp<T, op>, out.getValues(),
+                       out.getColIdx(), out.getRowIdx(), M, lhs.getValues(),
+                       lhs.getRowIdx(), lhs.getColIdx(), rhs.getValues(),
+                       rhs.getRowIdx(), rhs.getColIdx());
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> arithOpD<T, af_add_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_sub_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_mul_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_div_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_mul_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_div_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_mul_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_div_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sparse_arith.hpp b/src/backend/cpu/sparse_arith.hpp
new file mode 100644
index 0000000000..2563802c4d
--- /dev/null
+++ b/src/backend/cpu/sparse_arith.hpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <optypes.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace cpu {
+// These two functions cannot be overloaded by return type.
+// So have to give them separate names.
+template<typename T, af_op_t op>
+Array<T> arithOpD(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const Array<T> &rhs, const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const common::SparseArray<T> &rhs);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sparse_blas.cpp b/src/backend/cpu/sparse_blas.cpp
new file mode 100644
index 0000000000..d6bd338575
--- /dev/null
+++ b/src/backend/cpu/sparse_blas.cpp
@@ -0,0 +1,466 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sparse_blas.hpp>
+
+#ifdef USE_MKL
+#include <mkl_spblas.h>
+#endif
+
+#include <common/complex.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+
+#include <cassert>
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace cpu {
+
+#ifdef USE_MKL
+using sp_cfloat  = MKL_Complex8;
+using sp_cdouble = MKL_Complex16;
+#else
+using sp_cfloat  = cfloat;
+using sp_cdouble = cdouble;
+
+// From mkl_spblas.h
+typedef enum {
+    SPARSE_OPERATION_NON_TRANSPOSE       = 10,
+    SPARSE_OPERATION_TRANSPOSE           = 11,
+    SPARSE_OPERATION_CONJUGATE_TRANSPOSE = 12,
+} sparse_operation_t;
+#endif
+
+template<typename T, class Enable = void>
+struct blas_base {
+    using type = T;
+};
+
+template<typename T>
+struct blas_base<T,
+                 typename std::enable_if<common::is_complex<T>::value>::type> {
+    using type = typename std::conditional<std::is_same<T, cdouble>::value,
+                                           sp_cdouble, sp_cfloat>::type;
+};
+
+template<typename T>
+using cptr_type = typename std::conditional<common::is_complex<T>::value,
+                                            const typename blas_base<T>::type *,
+                                            const T *>::type;
+template<typename T>
+using ptr_type =
+    typename std::conditional<common::is_complex<T>::value,
+                              typename blas_base<T>::type *, T *>::type;
+template<typename T>
+using scale_type =
+    typename std::conditional<common::is_complex<T>::value,
+                              const typename blas_base<T>::type, const T>::type;
+
+template<typename To, typename Ti>
+auto getScaleValue(Ti val) -> std::remove_cv_t<To> {
+    return static_cast<std::remove_cv_t<To>>(val);
+}
+
+template<typename T, int value>
+scale_type<T> getScale() {  // NOLINT(readability-const-return-type)
+    static T val(value);
+    return getScaleValue<scale_type<T>, T>(val);
+}
+
+sparse_operation_t toSparseTranspose(af_mat_prop opt) {
+    sparse_operation_t out = SPARSE_OPERATION_NON_TRANSPOSE;
+    switch (opt) {
+        case AF_MAT_NONE: out = SPARSE_OPERATION_NON_TRANSPOSE; break;
+        case AF_MAT_TRANS: out = SPARSE_OPERATION_TRANSPOSE; break;
+        case AF_MAT_CTRANS: out = SPARSE_OPERATION_CONJUGATE_TRANSPOSE; break;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+    return out;
+}
+
+#ifdef USE_MKL
+
+template<>
+sp_cfloat getScaleValue<const sp_cfloat, cfloat>(cfloat val) {
+    sp_cfloat ret;
+    ret.real = val.real();
+    ret.imag = val.imag();
+    return ret;
+}
+
+template<>
+sp_cdouble getScaleValue<const sp_cdouble, cdouble>(cdouble val) {
+    sp_cdouble ret;
+    ret.real = val.real();
+    ret.imag = val.imag();
+    return ret;
+}
+
+// sparse_status_t mkl_sparse_z_create_csr (
+//                 sparse_matrix_t *A,
+//                 sparse_index_base_t indexing,
+//                 MKL_INT rows, MKL_INT cols,
+//                 MKL_INT *rows_start, MKL_INT *rows_end,
+//                 MKL_INT *col_indx,
+//                 MKL_Complex16 *values);
+
+template<typename T>
+using create_csr_func_def = sparse_status_t (*)(sparse_matrix_t *,
+                                                sparse_index_base_t, int, int,
+                                                int *, int *, int *,
+                                                ptr_type<T>);
+
+#define SPARSE_FUNC_DEF(FUNC) \
+    template<typename T>      \
+    FUNC##_func_def<T> FUNC##_func();
+
+SPARSE_FUNC_DEF(create_csr)
+
+#undef SPARSE_FUNC_DEF
+
+#define SPARSE_FUNC(FUNC, TYPE, PREFIX)         \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &mkl_sparse_##PREFIX##_##FUNC;   \
+    }
+
+SPARSE_FUNC(create_csr, float, s)
+SPARSE_FUNC(create_csr, double, d)
+SPARSE_FUNC(create_csr, cfloat, c)
+SPARSE_FUNC(create_csr, cdouble, z)
+
+#undef SPARSE_FUNC
+
+// sparse_status_t mkl_sparse_z_mv (
+//                 sparse_operation_t operation,
+//                 MKL_Complex16 alpha,
+//                 const sparse_matrix_t A,
+//                 struct matrix_descr descr,
+//                 const MKL_Complex16 *x,
+//                 MKL_Complex16 beta,
+//                 MKL_Complex16 *y);
+//
+// sparse_status_t mkl_sparse_z_mm (
+//                 sparse_operation_t operation,
+//                 MKL_Complex16 alpha,
+//                 const sparse_matrix_t A,
+//                 struct matrix_descr descr,
+//                 sparse_layout_t layout,
+//                 const MKL_Complex16 *x,
+//                 MKL_INT columns, MKL_INT ldx,
+//                 MKL_Complex16 beta,
+//                 MKL_Complex16 *y,
+//                 MKL_INT ldy);
+
+template<typename T>
+using mv_func_def = sparse_status_t (*)(const sparse_operation_t, scale_type<T>,
+                                        const sparse_matrix_t, matrix_descr,
+                                        cptr_type<T>, scale_type<T>,
+                                        ptr_type<T>);
+
+template<typename T>
+using mm_func_def = sparse_status_t (*)(const sparse_operation_t, scale_type<T>,
+                                        const sparse_matrix_t, matrix_descr,
+                                        sparse_layout_t, cptr_type<T>, int, int,
+                                        scale_type<T>, ptr_type<T>, int);
+
+#define SPARSE_FUNC_DEF(FUNC) \
+    template<typename T>      \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define SPARSE_FUNC(FUNC, TYPE, PREFIX)         \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &mkl_sparse_##PREFIX##_##FUNC;   \
+    }
+
+SPARSE_FUNC_DEF(mv)
+SPARSE_FUNC(mv, float, s)
+SPARSE_FUNC(mv, double, d)
+SPARSE_FUNC(mv, cfloat, c)
+SPARSE_FUNC(mv, cdouble, z)
+
+SPARSE_FUNC_DEF(mm)
+SPARSE_FUNC(mm, float, s)
+SPARSE_FUNC(mm, double, d)
+SPARSE_FUNC(mm, cfloat, c)
+SPARSE_FUNC(mm, cdouble, z)
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+    // MKL: CSRMM Does not support optRhs
+    UNUSED(optRhs);
+
+    // Similar Operations to GEMM
+    sparse_operation_t lOpts = toSparseTranspose(optLhs);
+
+    int lRowDim = (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) ? 0 : 1;
+    // int lColDim = (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) ? 1 : 0;
+
+    // Unsupported : (rOpts == SPARSE_OPERATION_NON_TRANSPOSE;) ? 1 : 0;
+    static const int rColDim = 1;
+
+    const dim4 &lDims = lhs.dims();
+    const dim4 &rDims = rhs.dims();
+
+    int M = lDims[lRowDim];
+    int N = rDims[rColDim];
+    // int K = lDims[lColDim];
+
+    Array<T> out = createValueArray<T>(af::dim4(M, N, 1, 1), scalar<T>(0));
+
+    auto func = [=](Param<T> output, CParam<T> values, CParam<int> rowIdx,
+                    CParam<int> colIdx, const dim_t sdim0, const dim_t sdim1,
+                    CParam<T> right) {
+        auto alpha = getScale<T, 1>();
+        auto beta  = getScale<T, 0>();
+
+        int ldb = right.strides(1);
+        int ldc = output.strides(1);
+
+        int *pB = const_cast<int *>(rowIdx.get());
+        int *pE = pB + 1;
+        T *vptr = const_cast<T *>(values.get());
+
+        sparse_matrix_t csrLhs;
+        create_csr_func<T>()(&csrLhs, SPARSE_INDEX_BASE_ZERO, sdim0, sdim1, pB,
+                             pE, const_cast<int *>(colIdx.get()),
+                             reinterpret_cast<ptr_type<T>>(vptr));
+
+        struct matrix_descr descrLhs {};
+        descrLhs.type = SPARSE_MATRIX_TYPE_GENERAL;
+
+        mkl_sparse_optimize(csrLhs);
+
+        if (rDims[rColDim] == 1) {
+            mkl_sparse_set_mv_hint(csrLhs, lOpts, descrLhs, 1);
+            mv_func<T>()(lOpts, alpha, csrLhs, descrLhs,
+                         reinterpret_cast<cptr_type<T>>(right.get()), beta,
+                         reinterpret_cast<ptr_type<T>>(output.get()));
+        } else {
+            mkl_sparse_set_mm_hint(csrLhs, lOpts, descrLhs,
+                                   SPARSE_LAYOUT_COLUMN_MAJOR, N, 1);
+            mm_func<T>()(
+                lOpts, alpha, csrLhs, descrLhs, SPARSE_LAYOUT_COLUMN_MAJOR,
+                reinterpret_cast<cptr_type<T>>(right.get()), N, ldb, beta,
+                reinterpret_cast<ptr_type<T>>(output.get()), ldc);
+        }
+        mkl_sparse_destroy(csrLhs);
+    };
+
+    const Array<T> values   = lhs.getValues();
+    const Array<int> rowIdx = lhs.getRowIdx();
+    const Array<int> colIdx = lhs.getColIdx();
+    af::dim4 ldims          = lhs.dims();
+
+    getQueue().enqueue(func, out, values, rowIdx, colIdx, ldims[0], ldims[1],
+                       rhs);
+
+    return out;
+}
+
+#else  // #if USE_MKL
+
+template<typename T>
+T getConjugate(const T &in) {
+    // For non-complex types return same
+    return in;
+}
+
+template<>
+cfloat getConjugate(const cfloat &in) {
+    return std::conj(in);
+}
+
+template<>
+cdouble getConjugate(const cdouble &in) {
+    return std::conj(in);
+}
+
+template<typename T, bool conjugate>
+void mv(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+        CParam<int> colIdx, CParam<T> right, int M) {
+    const T *valPtr   = values.get();
+    const int *rowPtr = rowIdx.get();
+    const int *colPtr = colIdx.get();
+    const T *rightPtr = right.get();
+
+    T *outPtr = output.get();
+
+    // Output Array Created is a zero value Array
+    // Hence, no need to initialize to zero here
+    for (int i = 0; i < M; ++i) {
+        for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+            // If stride[0] of right is not 1 then rightPtr[colPtr[j]*stride]
+            if (conjugate) {
+                outPtr[i] += getConjugate(valPtr[j]) * rightPtr[colPtr[j]];
+            } else {
+                outPtr[i] += valPtr[j] * rightPtr[colPtr[j]];
+            }
+        }
+    }
+}
+
+template<typename T, bool conjugate>
+void mtv(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+         CParam<int> colIdx, CParam<T> right, int M) {
+    UNUSED(M);
+
+    const T *valPtr   = values.get();
+    const int *rowPtr = rowIdx.get();
+    const int *colPtr = colIdx.get();
+    const T *rightPtr = right.get();
+    T *outPtr         = output.get();
+
+    // Output Array Created is a zero value Array
+    // Hence, no need to initialize to zero here
+    for (int i = 0; i < rowIdx.dims(0) - 1; ++i) {
+        for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+            // If stride[0] of right is not 1 then rightPtr[i*stride]
+            if (conjugate) {
+                outPtr[colPtr[j]] += getConjugate(valPtr[j]) * rightPtr[i];
+            } else {
+                outPtr[colPtr[j]] += valPtr[j] * rightPtr[i];
+            }
+        }
+    }
+}
+
+template<typename T, bool conjugate>
+void mm(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+        CParam<int> colIdx, CParam<T> right, int M, int N, int ldb, int ldc) {
+    UNUSED(M);
+    const T *valPtr   = values.get();
+    const int *rowPtr = rowIdx.get();
+    const int *colPtr = colIdx.get();
+    const T *rightPtr = right.get();
+    T *outPtr         = output.get();
+
+    for (int o = 0; o < N; ++o) {
+        for (int i = 0; i < rowIdx.dims(0) - 1; ++i) {
+            outPtr[i] = scalar<T>(0);
+            for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+                // If stride[0] of right is not 1 then
+                // rightPtr[colPtr[j]*stride]
+                if (conjugate) {
+                    outPtr[i] += getConjugate(valPtr[j]) * rightPtr[colPtr[j]];
+                } else {
+                    outPtr[i] += valPtr[j] * rightPtr[colPtr[j]];
+                }
+            }
+        }
+        rightPtr += ldb;
+        outPtr += ldc;
+    }
+}
+
+template<typename T, bool conjugate>
+void mtm(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+         CParam<int> colIdx, CParam<T> right, int M, int N, int ldb, int ldc) {
+    const T *valPtr   = values.get();
+    const int *rowPtr = rowIdx.get();
+    const int *colPtr = colIdx.get();
+    const T *rightPtr = right.get();
+    T *outPtr         = output.get();
+
+    for (int o = 0; o < N; ++o) {
+        for (int i = 0; i < M; ++i) { outPtr[i] = scalar<T>(0); }
+
+        for (int i = 0; i < rowIdx.dims(0) - 1; ++i) {
+            for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+                // If stride[0] of right is not 1 then rightPtr[i*stride]
+                if (conjugate) {
+                    outPtr[colPtr[j]] += getConjugate(valPtr[j]) * rightPtr[i];
+                } else {
+                    outPtr[colPtr[j]] += valPtr[j] * rightPtr[i];
+                }
+            }
+        }
+        rightPtr += ldb;
+        outPtr += ldc;
+    }
+}
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+    UNUSED(optRhs);
+
+    // Similar Operations to GEMM
+    sparse_operation_t lOpts = toSparseTranspose(optLhs);
+
+    int lRowDim = (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) ? 0 : 1;
+
+    static const int rColDim = 1;
+
+    const dim4 &lDims = lhs.dims();
+    const dim4 &rDims = rhs.dims();
+    int M             = lDims[lRowDim];
+    int N             = rDims[rColDim];
+
+    Array<T> out = createValueArray<T>(af::dim4(M, N, 1, 1), scalar<T>(0));
+
+    auto func = [=](Param<T> output, CParam<T> values, CParam<int> rowIdx,
+                    CParam<int> colIdx, CParam<T> right) {
+        int ldb = right.strides(1);
+        int ldc = output.strides(1);
+
+        if (rDims[rColDim] == 1) {
+            if (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) {
+                mv<T, false>(output, values, rowIdx, colIdx, right, M);
+            } else if (lOpts == SPARSE_OPERATION_TRANSPOSE) {
+                mtv<T, false>(output, values, rowIdx, colIdx, right, M);
+            } else if (lOpts == SPARSE_OPERATION_CONJUGATE_TRANSPOSE) {
+                mtv<T, true>(output, values, rowIdx, colIdx, right, M);
+            }
+        } else {
+            if (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) {
+                mm<T, false>(output, values, rowIdx, colIdx, right, M, N, ldb,
+                             ldc);
+            } else if (lOpts == SPARSE_OPERATION_TRANSPOSE) {
+                mtm<T, false>(output, values, rowIdx, colIdx, right, M, N, ldb,
+                              ldc);
+            } else if (lOpts == SPARSE_OPERATION_CONJUGATE_TRANSPOSE) {
+                mtm<T, true>(output, values, rowIdx, colIdx, right, M, N, ldb,
+                             ldc);
+            }
+        }
+    };
+
+    const Array<T> values   = lhs.getValues();
+    const Array<int> rowIdx = lhs.getRowIdx();
+    const Array<int> colIdx = lhs.getColIdx();
+
+    getQueue().enqueue(func, out, values, rowIdx, colIdx, rhs);
+
+    return out;
+}
+
+#endif  // #if USE_MKL
+
+#define INSTANTIATE_SPARSE(T)                                            \
+    template Array<T> matmul<T>(const common::SparseArray<T> &lhs,       \
+                                const Array<T> &rhs, af_mat_prop optLhs, \
+                                af_mat_prop optRhs);
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/sparse_blas.hpp b/src/backend/cpu/sparse_blas.hpp
new file mode 100644
index 0000000000..f59ef83d60
--- /dev/null
+++ b/src/backend/cpu/sparse_blas.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T>& lhs, const Array<T>& rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/surface.cpp b/src/backend/cpu/surface.cpp
new file mode 100644
index 0000000000..d86bd6f469
--- /dev/null
+++ b/src/backend/cpu/surface.cpp
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+#include <err_cpu.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <surface.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface) {
+    ForgeModule &_ = common::forgePlugin();
+    P.eval();
+    getQueue().sync();
+
+    CheckGL("Before CopyArrayToVBO");
+    unsigned bytes = 0, buffer = 0;
+    FG_CHECK(_.fg_get_surface_vertex_buffer(&buffer, surface));
+    FG_CHECK(_.fg_get_surface_vertex_buffer_size(&bytes, surface));
+
+    glBindBuffer(GL_ARRAY_BUFFER, buffer);
+    glBufferSubData(GL_ARRAY_BUFFER, 0, bytes, P.get());
+    glBindBuffer(GL_ARRAY_BUFFER, 0);
+
+    CheckGL("In CopyArrayToVBO");
+}
+
+#define INSTANTIATE(T) \
+    template void copy_surface<T>(const Array<T> &, fg_surface);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/surface.hpp b/src/backend/cpu/surface.hpp
new file mode 100644
index 0000000000..1bcf57fac3
--- /dev/null
+++ b/src/backend/cpu/surface.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/susan.cpp b/src/backend/cpu/susan.cpp
new file mode 100644
index 0000000000..c5321deb16
--- /dev/null
+++ b/src/backend/cpu/susan.cpp
@@ -0,0 +1,82 @@
+/*******************************************************
+ * Copyright (c) 2015, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <kernel/susan.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <af/features.h>
+#include <cmath>
+#include <memory>
+
+using af::features;
+using std::shared_ptr;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out, Array<float> &resp_out,
+               const Array<T> &in, const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge) {
+    dim4 idims                = in.dims();
+    const unsigned corner_lim = in.elements() * feature_ratio;
+
+    auto x_corners    = createEmptyArray<float>(dim4(corner_lim));
+    auto y_corners    = createEmptyArray<float>(dim4(corner_lim));
+    auto resp_corners = createEmptyArray<float>(dim4(corner_lim));
+    auto response     = createEmptyArray<T>(dim4(in.elements()));
+    auto corners_found =
+        std::shared_ptr<unsigned>(memAlloc<unsigned>(1).release(), memFree);
+    corners_found.get()[0] = 0;
+
+    getQueue().enqueue(kernel::susan_responses<T>, response, in, idims[0],
+                       idims[1], radius, diff_thr, geom_thr, edge);
+    getQueue().enqueue(kernel::non_maximal<T>, x_corners, y_corners,
+                       resp_corners, corners_found, idims[0], idims[1],
+                       response, edge, corner_lim);
+    getQueue().sync();
+
+    const unsigned corners_out = min((corners_found.get())[0], corner_lim);
+    if (corners_out == 0) {
+        x_out    = createEmptyArray<float>(dim4());
+        y_out    = createEmptyArray<float>(dim4());
+        resp_out = createEmptyArray<float>(dim4());
+        return 0;
+    } else {
+        x_out    = x_corners;
+        y_out    = y_corners;
+        resp_out = resp_corners;
+        x_out.resetDims(dim4(corners_out));
+        y_out.resetDims(dim4(corners_out));
+        resp_out.resetDims(dim4(corners_out));
+        return corners_out;
+    }
+}
+
+#define INSTANTIATE(T)                                                        \
+    template unsigned susan<T>(                                               \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned radius, const float diff_thr,      \
+        const float geom_thr, const float feature_ratio, const unsigned edge);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/susan.hpp b/src/backend/cpu/susan.hpp
new file mode 100644
index 0000000000..af6640e195
--- /dev/null
+++ b/src/backend/cpu/susan.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out,
+               Array<float> &score_out, const Array<T> &in,
+               const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge);
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/svd.cpp b/src/backend/cpu/svd.cpp
new file mode 100644
index 0000000000..75804d240b
--- /dev/null
+++ b/src/backend/cpu/svd.cpp
@@ -0,0 +1,126 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/err_common.hpp>
+#include <err_cpu.hpp>
+#include <svd.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <lapack_helper.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+#define SVD_FUNC_DEF(FUNC)            \
+    template<typename T, typename Tr> \
+    svd_func_def<T, Tr> svd_func();
+
+#define SVD_FUNC(FUNC, T, Tr, PREFIX)       \
+    template<>                              \
+    svd_func_def<T, Tr> svd_func<T, Tr>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);  \
+    }
+
+#if defined(USE_MKL) || defined(__APPLE__)
+
+template<typename T, typename Tr>
+using svd_func_def = int (*)(ORDER_TYPE, char jobz, int m, int n, T *in,
+                             int ldin, Tr *s, T *u, int ldu, T *vt, int ldvt);
+
+SVD_FUNC_DEF(gesdd)
+SVD_FUNC(gesdd, float, float, s)
+SVD_FUNC(gesdd, double, double, d)
+SVD_FUNC(gesdd, cfloat, float, c)
+SVD_FUNC(gesdd, cdouble, double, z)
+
+#else  // Atlas causes memory freeing issues with using gesdd
+
+template<typename T, typename Tr>
+using svd_func_def = int (*)(ORDER_TYPE, char jobu, char jobvt, int m, int n,
+                             T *in, int ldin, Tr *s, T *u, int ldu, T *vt,
+                             int ldvt, Tr *superb);
+
+SVD_FUNC_DEF(gesvd)
+SVD_FUNC(gesvd, float, float, s)
+SVD_FUNC(gesvd, double, double, d)
+SVD_FUNC(gesvd, cfloat, float, c)
+SVD_FUNC(gesvd, cdouble, double, z)
+
+#endif
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    auto func = [=](Param<Tr> s, Param<T> u, Param<T> vt, Param<T> in) {
+        dim4 iDims = in.dims();
+        int M      = iDims[0];
+        int N      = iDims[1];
+
+#if defined(USE_MKL) || defined(__APPLE__)
+        svd_func<T, Tr>()(AF_LAPACK_COL_MAJOR, 'A', M, N, in.get(),
+                          in.strides(1), s.get(), u.get(), u.strides(1),
+                          vt.get(), vt.strides(1));
+#else
+        std::vector<Tr> superb(std::min(M, N));
+        svd_func<T, Tr>()(AF_LAPACK_COL_MAJOR, 'A', 'A', M, N, in.get(),
+                          in.strides(1), s.get(), u.get(), u.strides(1),
+                          vt.get(), vt.strides(1), &superb[0]);
+#endif
+    };
+    getQueue().enqueue(func, s, u, vt, in);
+}
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    Array<T> in_copy = copyArray<T>(in);
+    svdInPlace(s, u, vt, in_copy);
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on CPU", AF_ERR_NOT_CONFIGURED);
+}
+
+}  // namespace cpu
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace cpu {
+
+#define INSTANTIATE_SVD(T, Tr)                                           \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE_SVD(float, float)
+INSTANTIATE_SVD(double, double)
+INSTANTIATE_SVD(cfloat, float)
+INSTANTIATE_SVD(cdouble, double)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/svd.hpp b/src/backend/cpu/svd.hpp
new file mode 100644
index 0000000000..ba667d2032
--- /dev/null
+++ b/src/backend/cpu/svd.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in);
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/tile.cpp b/src/backend/cpu/tile.cpp
index e1ee43f519..884bfed40d 100644
--- a/src/backend/cpu/tile.cpp
+++ b/src/backend/cpu/tile.cpp
@@ -7,67 +7,52 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>
+#include <kernel/tile.hpp>
 #include <tile.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-
-namespace cpu
-{
-    template<typename T>
-    Array<T> tile(const Array<T> &in, const af::dim4 &tileDims)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
-        oDims *= tileDims;
-
-        if(iDims.elements() == 0 || oDims.elements() == 0) {
-            throw std::runtime_error("Elements are 0");
-        }
 
-        Array<T> out = createEmptyArray<T>(oDims);
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <platform.hpp>
 
-        T* outPtr = out.get();
-        const T* inPtr = in.get();
+using arrayfire::common::half;
 
-        const af::dim4 ist = in.strides();
-        const af::dim4 ost = out.strides();
+namespace arrayfire {
+namespace cpu {
 
-        for(dim_t ow = 0; ow < oDims[3]; ow++) {
-            const dim_t iw = ow % iDims[3];
-            const dim_t iW = iw * ist[3];
-            const dim_t oW = ow * ost[3];
-            for(dim_t oz = 0; oz < oDims[2]; oz++) {
-                const dim_t iz = oz % iDims[2];
-                const dim_t iZW = iW + iz * ist[2];
-                const dim_t oZW = oW + oz * ost[2];
-                for(dim_t oy = 0; oy < oDims[1]; oy++) {
-                    const dim_t iy = oy % iDims[1];
-                    const dim_t iYZW = iZW + iy * ist[1];
-                    const dim_t oYZW = oZW + oy * ost[1];
-                    for(dim_t ox = 0; ox < oDims[0]; ox++) {
-                        const dim_t ix = ox % iDims[0];
-                        const dim_t iMem = iYZW + ix;
-                        const dim_t oMem = oYZW + ox;
-                        outPtr[oMem] = inPtr[iMem];
-                    }
-                }
-            }
-        }
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims *= tileDims;
 
-        return out;
+    if (iDims.elements() == 0 || oDims.elements() == 0) {
+        throw std::runtime_error("Elements are 0");
     }
 
-#define INSTANTIATE(T)                                                         \
-    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);  \
+    Array<T> out = createEmptyArray<T>(oDims);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    getQueue().enqueue(kernel::tile<T>, out, in);
 
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/tile.hpp b/src/backend/cpu/tile.hpp
index e03ba7233a..eee387cb87 100644
--- a/src/backend/cpu/tile.hpp
+++ b/src/backend/cpu/tile.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/topk.cpp b/src/backend/cpu/topk.cpp
new file mode 100644
index 0000000000..0103c3586b
--- /dev/null
+++ b/src/backend/cpu/topk.cpp
@@ -0,0 +1,134 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <index.hpp>
+#include <sort.hpp>
+#include <sort_index.hpp>
+
+#include <algorithm>
+#include <cmath>
+#include <numeric>
+#include <vector>
+
+using arrayfire::common::half;
+using std::iota;
+using std::min;
+using std::partial_sort_copy;
+using std::vector;
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void topk(Array<T>& vals, Array<unsigned>& idxs, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order) {
+    // The out_dims is of size k along the dimension of the topk operation
+    // and the same as the input dimension otherwise.
+    dim4 out_dims(1);
+    int ndims = in.dims().ndims();
+    for (int i = 0; i < ndims; i++) {
+        if (i == dim) {
+            out_dims[i] = min(k, static_cast<int>(in.dims()[i]));
+        } else {
+            out_dims[i] = in.dims()[i];
+        }
+    }
+
+    auto values  = createEmptyArray<T>(out_dims);
+    auto indices = createEmptyArray<unsigned>(out_dims);
+
+    auto func = [=](Param<T> values, Param<unsigned> indices, CParam<T> in) {
+        const T* ptr   = in.get();
+        unsigned* iptr = indices.get();
+        T* vptr        = values.get();
+
+        // Create a linear index
+        vector<uint> idx(in.dims().elements());
+        iota(begin(idx), end(idx), 0);
+
+        int iter = in.dims()[1] * in.dims()[2] * in.dims()[3];
+        for (int i = 0; i < iter; i++) {
+            auto idx_itr = begin(idx) + i * in.strides()[1];
+            auto* kiptr  = iptr + k * i;
+
+            if (order & AF_TOPK_MIN) {
+                if (order & AF_TOPK_STABLE) {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return compute_t<T>(ptr[lhs]) <
+                                           compute_t<T>(ptr[rhs])
+                                       ? true
+                                   : compute_t<T>(ptr[lhs]) ==
+                                           compute_t<T>(ptr[rhs])
+                                       ? (lhs < rhs)
+                                       : false;
+                        });
+                } else {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return compute_t<T>(ptr[lhs]) <
+                                   compute_t<T>(ptr[rhs]);
+                        });
+                    // Sort the top k values in each column
+                }
+            } else {
+                if (order & AF_TOPK_STABLE) {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return compute_t<T>(ptr[lhs]) >
+                                           compute_t<T>(ptr[rhs])
+                                       ? true
+                                   : compute_t<T>(ptr[lhs]) ==
+                                           compute_t<T>(ptr[rhs])
+                                       ? (lhs < rhs)
+                                       : false;
+                        });
+                } else {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return compute_t<T>(ptr[lhs]) >
+                                   compute_t<T>(ptr[rhs]);
+                        });
+                }
+            }
+
+            auto* kvptr = vptr + k * i;
+            for (int j = 0; j < k; j++) {
+                // Update the value arrays with the original values
+                kvptr[j] = ptr[kiptr[j]];
+                // Convert linear indices back to column indices
+                kiptr[j] -= i * in.strides()[1];
+            }
+        }
+    };
+
+    getQueue().enqueue(func, values, indices, in);
+
+    vals = values;
+    idxs = indices;
+}
+
+#define INSTANTIATE(T)                                                  \
+    template void topk<T>(Array<T>&, Array<unsigned>&, const Array<T>&, \
+                          const int, const int, const af::topkFunction);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(half)
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/topk.hpp b/src/backend/cpu/topk.hpp
new file mode 100644
index 0000000000..0383e13fcf
--- /dev/null
+++ b/src/backend/cpu/topk.hpp
@@ -0,0 +1,16 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void topk(Array<T>& keys, Array<unsigned>& vals, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/transform.cpp b/src/backend/cpu/transform.cpp
index ed9f0ad39d..0fbe10ea5c 100644
--- a/src/backend/cpu/transform.cpp
+++ b/src/backend/cpu/transform.cpp
@@ -8,130 +8,66 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <transform.hpp>
+#include <copy.hpp>
+#include <kernel/transform.hpp>
 #include <math.hpp>
-#include <stdexcept>
-#include <err_cpu.hpp>
-#include "transform_interp.hpp"
-
-namespace cpu
-{
-    template <typename T>
-    void calc_affine_inverse(T *txo, const T *txi)
-    {
-        T det = txi[0]*txi[4] - txi[1]*txi[3];
-
-        txo[0] = txi[4] / det;
-        txo[1] = txi[3] / det;
-        txo[3] = txi[1] / det;
-        txo[4] = txi[0] / det;
-
-        txo[2] = txi[2] * -txo[0] + txi[5] * -txo[1];
-        txo[5] = txi[2] * -txo[3] + txi[5] * -txo[4];
-    }
-
-    template <typename T>
-    void calc_affine_inverse(T *tmat, const T *tmat_ptr, const bool inverse)
-    {
-        // The way kernel is structured, it expects an inverse
-        // transform matrix by default.
-        // If it is an forward transform, then we need its inverse
-        if(inverse) {
-            for(int i = 0; i < 6; i++)
-                tmat[i] = tmat_ptr[i];
-        } else {
-            calc_affine_inverse(tmat, tmat_ptr);
-        }
-    }
-
-    template<typename T, af_interp_type method>
-    void transform_(T *out, const T *in, const float *tf,
-                    const af::dim4 &odims, const af::dim4 &idims,
-                    const af::dim4 &ostrides, const af::dim4 &istrides,
-                    const af::dim4 &tstrides, const bool inverse)
-    {
-        dim_t nimages     = idims[2];
-        // Multiplied in src/backend/transform.cpp
-        dim_t ntransforms = odims[2] / idims[2];
-
-        void (*t_fn)(T *, const T *, const float *, const af::dim4 &,
-                     const af::dim4 &, const af::dim4 &,
-                     const dim_t, const dim_t, const dim_t, const dim_t);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                t_fn = &transform_n;
-                break;
-            case AF_INTERP_BILINEAR:
-                t_fn = &transform_b;
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                break;
-        }
-
-
-        // For each transform channel
-        for(int t_idx = 0; t_idx < (int)ntransforms; t_idx++) {
-            // Compute inverse if required
-            const float *tmat_ptr = tf + t_idx * 6;
-            float tmat[6];
-            calc_affine_inverse(tmat, tmat_ptr, inverse);
-
-            // Offset for output pointer
-            dim_t o_offset = t_idx * nimages * ostrides[2];
-
-            // Do transform for image
-            for(int yy = 0; yy < (int)odims[1]; yy++) {
-                for(int xx = 0; xx < (int)odims[0]; xx++) {
-                    t_fn(out, in, tmat, idims, ostrides, istrides, nimages, o_offset, xx, yy);
-                }
-            }
-        }
-    }
-
-    template<typename T>
-    Array<T> transform(const Array<T> &in, const Array<float> &transform, const af::dim4 &odims,
-                        const af_interp_type method, const bool inverse)
-    {
-        const af::dim4 idims = in.dims();
-
-        Array<T> out = createEmptyArray<T>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                transform_<T, AF_INTERP_NEAREST>
-                          (out.get(), in.get(), transform.get(), odims, idims,
-                           out.strides(), in.strides(), transform.strides(), inverse);
-                break;
-            case AF_INTERP_BILINEAR:
-                transform_<T, AF_INTERP_BILINEAR>
-                          (out.get(), in.get(), transform.get(), odims, idims,
-                           out.strides(), in.strides(), transform.strides(), inverse);
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                break;
-        }
+#include <platform.hpp>
+#include <transform.hpp>
 
-        return out;
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective) {
+    out.eval();
+    in.eval();
+
+    // TODO: Temporary Fix, must fix handling subarrays upstream
+    // tf has to be linear, although offset is allowed
+    const Array<float> tf_Lin = tf.isLinear() ? tf : copyArray(tf);
+    tf.eval();
+
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            getQueue().enqueue(kernel::transform<T, 1>, out, in, tf_Lin,
+                               inverse, perspective, method);
+            break;
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_BILINEAR_COSINE:
+            getQueue().enqueue(kernel::transform<T, 2>, out, in, tf_Lin,
+                               inverse, perspective, method);
+            break;
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_BICUBIC_SPLINE:
+            getQueue().enqueue(kernel::transform<T, 3>, out, in, tf_Lin,
+                               inverse, perspective, method);
+            break;
+        default: AF_ERROR("Unsupported interpolation type", AF_ERR_ARG); break;
     }
-
-
-#define INSTANTIATE(T)                                                                          \
-    template Array<T> transform(const Array<T> &in, const Array<float> &transform,             \
-                                const af::dim4 &odims, const af_interp_type method, \
-                                const bool inverse);                    \
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
 }
+
+#define INSTANTIATE(T)                                                       \
+    template void transform(Array<T> &out, const Array<T> &in,               \
+                            const Array<float> &tf,                          \
+                            const af_interp_type method, const bool inverse, \
+                            const bool perspective);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/transform.hpp b/src/backend/cpu/transform.hpp
index f9e730b1d4..1df2b38934 100644
--- a/src/backend/cpu/transform.hpp
+++ b/src/backend/cpu/transform.hpp
@@ -7,12 +7,13 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<T> transform(const Array<T> &in, const Array<float> &tf, const af::dim4 &odims,
-                        const af_interp_type method, const bool inverse);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/transform_interp.hpp b/src/backend/cpu/transform_interp.hpp
deleted file mode 100644
index 8ae8eb9e14..0000000000
--- a/src/backend/cpu/transform_interp.hpp
+++ /dev/null
@@ -1,124 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <types.hpp>
-#include <af/traits.hpp>
-
-namespace cpu
-{
-    using std::conditional;
-    using std::is_same;
-
-    template<typename T>
-    using wtype_t = typename conditional<is_same<T, double>::value, double, float>::type;
-
-    template<typename T>
-    using vtype_t = typename conditional<is_complex<T>::value,
-                                         T, wtype_t<T>
-                                        >::type;
-
-    template<typename T>
-    void transform_n(T *out, const T *in, const float *tmat, const af::dim4 &idims,
-                      const af::dim4 &ostrides, const af::dim4 &istrides,
-                      const dim_t nimages, const dim_t o_offset,
-                      const dim_t xx, const dim_t yy)
-    {
-        // Compute output index
-        const dim_t xi = round(xx * tmat[0]
-                             + yy * tmat[1]
-                                  + tmat[2]);
-        const dim_t yi = round(xx * tmat[3]
-                             + yy * tmat[4]
-                                  + tmat[5]);
-
-        // Compute memory location of indices
-        dim_t loci = (yi * istrides[1] + xi);
-        dim_t loco = (yy * ostrides[1] + xx);
-
-        T val = scalar<T>(0.0f);
-        // Copy to output
-        for(int batch = 0; batch < (int)idims[3]; batch++) {
-            dim_t i__ = batch * istrides[3];
-            dim_t o__ = batch * ostrides[3];
-            for(int i_idx = 0; i_idx < (int)nimages; i_idx++) {
-                dim_t i_off = i_idx * istrides[2] + i__;
-                dim_t o_off = o_offset + i_idx * ostrides[2] + o__;
-
-                if (xi < idims[0] && yi < idims[1] && xi >= 0 && yi >= 0)
-                    val = in[i_off + loci];
-
-                out[o_off + loco] = val;
-            }
-        }
-    }
-
-    template<typename T>
-    void transform_b(T *out, const T *in, const float *tmat, const af::dim4 &idims,
-                      const af::dim4 &ostrides, const af::dim4 &istrides,
-                      const dim_t nimages, const dim_t o_offset,
-                      const dim_t xx, const dim_t yy)
-    {
-        dim_t loco = (yy * ostrides[1] + xx);
-        // Compute input index
-        const float xi = xx * tmat[0]
-                       + yy * tmat[1]
-                            + tmat[2];
-        const float yi = xx * tmat[3]
-                       + yy * tmat[4]
-                            + tmat[5];
-
-        if (xi < -0.0001 || yi < -0.0001 || idims[0] < xi || idims[1] < yi) {
-            for(int i_idx = 0; i_idx < (int)nimages; i_idx++) {
-                const dim_t o_off = o_offset + i_idx * ostrides[2] + loco;
-                out[o_off] = scalar<T>(0.0f);
-            }
-            return;
-        }
-
-        typedef typename dtype_traits<T>::base_type BT;
-        typedef wtype_t<BT> WT;
-        typedef vtype_t<T> VT;
-
-        const WT grd_x = floor(xi),  grd_y = floor(yi);
-        const WT off_x = xi - grd_x, off_y = yi - grd_y;
-
-        dim_t loci = grd_y * istrides[1] + grd_x;
-
-        // Check if pVal and pVal + 1 are both valid indices
-        bool condY = (yi < idims[1] - 1);
-        bool condX = (xi < idims[0] - 1);
-
-        const T zero = scalar<T>(0.0f);
-
-        // Compute weights used
-        const WT wt00 = (1.0 - off_x) * (1.0 - off_y);
-        const WT wt10 = (condY) ? (1.0 - off_x) * (off_y)     : 0;
-        const WT wt01 = (condX) ? (off_x) * (1.0 - off_y)     : 0;
-        const WT wt11 = (condX && condY) ? (off_x) * (off_y)  : 0;
-
-        const WT wt = wt00 + wt10 + wt01 + wt11;
-
-        for(int batch = 0; batch < (int)idims[3]; batch++) {
-            dim_t i__ = batch * istrides[3];
-            dim_t o__ = batch * ostrides[3];
-            for(int i_idx = 0; i_idx < (int)nimages; i_idx++) {
-                const dim_t i_off = i_idx * istrides[2] + loci + i__;
-                const dim_t o_off = o_offset + i_idx * ostrides[2] + loco + o__;
-                // Compute Weighted Values
-                VT v00 =                    in[i_off] * wt00;
-                VT v10 = (condY) ?          in[i_off + istrides[1]] * wt10     : zero;
-                VT v01 = (condX) ?          in[i_off + 1] * wt01               : zero;
-                VT v11 = (condX && condY) ? in[i_off + istrides[1] + 1] * wt11 : zero;
-                VT vo = v00 + v10 + v01 + v11;
-
-                out[o_off] = vo / wt;
-            }
-        }
-    }
-}
diff --git a/src/backend/cpu/transpose.cpp b/src/backend/cpu/transpose.cpp
index f820f9ea5d..a9f6f9d3d5 100644
--- a/src/backend/cpu/transpose.cpp
+++ b/src/backend/cpu/transpose.cpp
@@ -6,159 +6,58 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <kernel/transpose.hpp>
+#include <transpose.hpp>
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <transpose.hpp>
+#include <common/half.hpp>
+#include <platform.hpp>
+#include <af/dim4.hpp>
 
-#include <utility>
 #include <cassert>
+#include <utility>
 
 using af::dim4;
+using arrayfire::common::half;
 
-namespace cpu
-{
-
-static inline unsigned getIdx(const dim4 &strides,
-        int i, int j = 0, int k = 0, int l = 0)
-{
-    return (l * strides[3] +
-            k * strides[2] +
-            j * strides[1] +
-            i );
-}
-
-template<typename T>
-T getConjugate(const T &in)
-{
-    // For non-complex types return same
-    return in;
-}
-
-template<>
-cfloat getConjugate(const cfloat &in)
-{
-    return std::conj(in);
-}
-
-template<>
-cdouble getConjugate(const cdouble &in)
-{
-    return std::conj(in);
-}
-
-template<typename T, bool conjugate>
-void transpose_(T *out, const T *in, const af::dim4 &odims, const af::dim4 &idims,
-                const af::dim4 &ostrides, const af::dim4 &istrides)
-{
-    for (dim_t l = 0; l < odims[3]; ++l) {
-        for (dim_t k = 0; k < odims[2]; ++k) {
-            // Outermost loop handles batch mode
-            // if input has no data along third dimension
-            // this loop runs only once
-            for (dim_t j = 0; j < odims[1]; ++j) {
-                for (dim_t i = 0; i < odims[0]; ++i) {
-                    // calculate array indices based on offsets and strides
-                    // the helper getIdx takes care of indices
-                    const dim_t inIdx  = getIdx(istrides,j,i,k,l);
-                    const dim_t outIdx = getIdx(ostrides,i,j,k,l);
-                    if(conjugate)
-                        out[outIdx] = getConjugate(in[inIdx]);
-                    else
-                        out[outIdx] = in[inIdx];
-                }
-            }
-            // outData and inData pointers doesn't need to be
-            // offset as the getIdx function is taking care
-            // of the batch parameter
-        }
-    }
-}
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T> transpose(const Array<T> &in, const bool conjugate)
-{
-    const dim4 inDims = in.dims();
-
-    dim4 outDims   = dim4(inDims[1],inDims[0],inDims[2],inDims[3]);
-
+Array<T> transpose(const Array<T> &in, const bool conjugate) {
+    const dim4 &inDims = in.dims();
+    const dim4 outDims = dim4(inDims[1], inDims[0], inDims[2], inDims[3]);
     // create an array with first two dimensions swapped
-    Array<T> out  = createEmptyArray<T>(outDims);
-
-    // get data pointers for input and output Arrays
-    T* outData          = out.get();
-    const T*   inData   = in.get();
+    Array<T> out = createEmptyArray<T>(outDims);
 
-    if(conjugate) {
-        transpose_<T, true>(outData, inData,
-                            out.dims(), in.dims(), out.strides(), in.strides());
-    } else {
-        transpose_<T, false>(outData, inData,
-                             out.dims(), in.dims(), out.strides(), in.strides());
-    }
+    getQueue().enqueue(kernel::transpose<T>, out, in, conjugate);
 
     return out;
 }
 
-template<typename T, bool conjugate>
-void transpose_inplace(T *in, const af::dim4 &idims, const af::dim4 &istrides)
-{
-    for (dim_t l = 0; l < idims[3]; ++l) {
-        for (dim_t k = 0; k < idims[2]; ++k) {
-            // Outermost loop handles batch mode
-            // if input has no data along third dimension
-            // this loop runs only once
-            //
-            // Run only bottom triangle. std::swap swaps with upper triangle
-            for (dim_t j = 0; j < idims[1]; ++j) {
-                for (dim_t i = j + 1; i < idims[0]; ++i) {
-                    // calculate array indices based on offsets and strides
-                    // the helper getIdx takes care of indices
-                    const dim_t iIdx  = getIdx(istrides,j,i,k,l);
-                    const dim_t oIdx = getIdx(istrides,i,j,k,l);
-                    if(conjugate) {
-                        in[iIdx] = getConjugate(in[iIdx]);
-                        in[oIdx] = getConjugate(in[oIdx]);
-                        std::swap(in[iIdx], in[oIdx]);
-                    }
-                    else {
-                        std::swap(in[iIdx], in[oIdx]);
-                    }
-                }
-            }
-        }
-    }
-}
-
 template<typename T>
-void transpose_inplace(Array<T> &in, const bool conjugate)
-{
-    // get data pointers for input and output Arrays
-    T* inData = in.get();
-
-    if(conjugate) {
-        transpose_inplace<T, true >(inData, in.dims(), in.strides());
-    } else {
-        transpose_inplace<T, false>(inData, in.dims(), in.strides());
-    }
+void transpose_inplace(Array<T> &in, const bool conjugate) {
+    getQueue().enqueue(kernel::transpose_inplace<T>, in, conjugate);
 }
 
-#define INSTANTIATE(T)                                                      \
-    template Array<T> transpose(const Array<T> &in, const bool conjugate);  \
+#define INSTANTIATE(T)                                                     \
+    template Array<T> transpose(const Array<T> &in, const bool conjugate); \
     template void transpose_inplace(Array<T> &in, const bool conjugate);
 
-INSTANTIATE(float  )
-INSTANTIATE(cfloat )
-INSTANTIATE(double )
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
 INSTANTIATE(cdouble)
-INSTANTIATE(char   )
-INSTANTIATE(int    )
-INSTANTIATE(uint   )
-INSTANTIATE(uchar  )
-INSTANTIATE(intl   )
-INSTANTIATE(uintl  )
-
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/transpose.hpp b/src/backend/cpu/transpose.hpp
index 2a8f72d96e..565f89cc6c 100644
--- a/src/backend/cpu/transpose.hpp
+++ b/src/backend/cpu/transpose.hpp
@@ -9,13 +9,14 @@
 
 #include <Array.hpp>
 
-namespace cpu
-{
+namespace arrayfire {
+namespace cpu {
 
 template<typename T>
-Array<T>  transpose(const Array<T> &in, const bool conjugate);
+Array<T> transpose(const Array<T> &in, const bool conjugate);
 
 template<typename T>
 void transpose_inplace(Array<T> &in, const bool conjugate);
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/triangle.cpp b/src/backend/cpu/triangle.cpp
index 82c4fd1edc..6c276ca4bd 100644
--- a/src/backend/cpu/triangle.cpp
+++ b/src/backend/cpu/triangle.cpp
@@ -6,84 +6,63 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <Array.hpp>
 #include <triangle.hpp>
-#include <math.hpp>
-
-namespace cpu
-{
-
-template<typename T, bool is_upper, bool is_unit_diag>
-void triangle(Array<T> &out, const Array<T> &in)
-{
-    T *o = out.get();
-    const T *i = in.get();
 
-    dim4 odm = out.dims();
-
-    dim4 ost = out.strides();
-    dim4 ist = in.strides();
-
-    for(dim_t ow = 0; ow < odm[3]; ow++) {
-        const dim_t oW = ow * ost[3];
-        const dim_t iW = ow * ist[3];
+#include <common/half.hpp>
+#include <kernel/triangle.hpp>
+#include <platform.hpp>
+#include <af/dim4.hpp>
 
-        for(dim_t oz = 0; oz < odm[2]; oz++) {
-            const dim_t oZW = oW + oz * ost[2];
-            const dim_t iZW = iW + oz * ist[2];
+#include <functional>
 
-            for(dim_t oy = 0; oy < odm[1]; oy++) {
-                const dim_t oYZW = oZW + oy * ost[1];
-                const dim_t iYZW = iZW + oy * ist[1];
+using arrayfire::common::half;
 
-                for(dim_t ox = 0; ox < odm[0]; ox++) {
-                    const dim_t oMem = oYZW + ox;
-                    const dim_t iMem = iYZW + ox;
+namespace arrayfire {
+namespace cpu {
 
-                    bool cond = is_upper ? (oy >= ox) : (oy <= ox);
-                    bool do_unit_diag = (is_unit_diag && ox == oy);
-                    if(cond) {
-                        o[oMem] = do_unit_diag ? scalar<T>(1) : i[iMem];
-                    } else {
-                        o[oMem] = scalar<T>(0);
-                    }
+template<typename T>
+using triangleFunc = std::function<void(Param<T>, CParam<T>)>;
 
-                }
-            }
-        }
-    }
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag) {
+    static const triangleFunc<T> funcs[4] = {
+        kernel::triangle<T, false, false>,
+        kernel::triangle<T, false, true>,
+        kernel::triangle<T, true, false>,
+        kernel::triangle<T, true, true>,
+    };
+    const int funcIdx = is_upper * 2 + is_unit_diag;
+    getQueue().enqueue(funcs[funcIdx], out, in);
 }
 
-template<typename T, bool is_upper, bool is_unit_diag>
-Array<T> triangle(const Array<T> &in)
-{
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag) {
     Array<T> out = createEmptyArray<T>(in.dims());
-    triangle<T, is_upper, is_unit_diag>(out, in);
+    triangle<T>(out, in, is_upper, is_unit_diag);
     return out;
 }
 
 #define INSTANTIATE(T)                                                  \
-    template void triangle<T, true ,  true>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, false,  true>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, true , false>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, false, false>(Array<T> &out, const Array<T> &in); \
-    template Array<T> triangle<T, true ,  true>(const Array<T> &in);    \
-    template Array<T> triangle<T, false,  true>(const Array<T> &in);    \
-    template Array<T> triangle<T, true , false>(const Array<T> &in);    \
-    template Array<T> triangle<T, false, false>(const Array<T> &in);    \
+    template void triangle<T>(Array<T> &, const Array<T> &, const bool, \
+                              const bool);                              \
+    template Array<T> triangle<T>(const Array<T> &, const bool, const bool);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/triangle.hpp b/src/backend/cpu/triangle.hpp
index 6ae0df2e9f..01e55f7c0b 100644
--- a/src/backend/cpu/triangle.hpp
+++ b/src/backend/cpu/triangle.hpp
@@ -7,14 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T, bool is_upper, bool is_unit_diag>
-    void triangle(Array<T> &out, const Array<T> &in);
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag);
 
-    template<typename T, bool is_upper, bool is_unit_diag>
-    Array<T> triangle(const Array<T> &in);
-}
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/types.hpp b/src/backend/cpu/types.hpp
index a281b6b6a4..f1f58e7006 100644
--- a/src/backend/cpu/types.hpp
+++ b/src/backend/cpu/types.hpp
@@ -8,17 +8,57 @@
  ********************************************************/
 
 #pragma once
+#include <common/kernel_type.hpp>
 #include <complex>
 
-namespace cpu
-{
-    typedef std::complex<float>     cfloat;
-    typedef std::complex<double>    cdouble;
-    typedef unsigned int            uint;
-    typedef unsigned char           uchar;
+namespace arrayfire {
+namespace cpu {
 
-    template<typename T> struct is_complex          { static const bool value = false;  };
-    template<> struct           is_complex<cfloat>  { static const bool value = true;   };
-    template<> struct           is_complex<cdouble> { static const bool value = true;   };
+namespace {
+template<typename T>
+const char *shortname(bool caps = false) {
+    return caps ? "?" : "?";
+}
 
+template<typename T>
+const char *getFullName() {
+    return "N/A";
 }
+
+}  // namespace
+
+using cdouble = std::complex<double>;
+using cfloat  = std::complex<float>;
+using intl    = long long;
+using uint    = unsigned int;
+using schar   = signed char;
+using uchar   = unsigned char;
+using uintl   = unsigned long long;
+using ushort  = unsigned short;
+
+template<typename T>
+using compute_t = typename common::kernel_type<T>::compute;
+
+template<typename T>
+using data_t = typename common::kernel_type<T>::data;
+
+}  // namespace cpu
+
+namespace common {
+template<typename T>
+struct kernel_type;
+
+class half;
+
+template<>
+struct kernel_type<arrayfire::common::half> {
+    using data = arrayfire::common::half;
+
+    // These are the types within a kernel
+    using native = float;
+
+    using compute = float;
+};
+}  // namespace common
+
+}  // namespace arrayfire
diff --git a/src/backend/cpu/unary.hpp b/src/backend/cpu/unary.hpp
index 970b1515f6..620ed26e8c 100644
--- a/src/backend/cpu/unary.hpp
+++ b/src/backend/cpu/unary.hpp
@@ -7,101 +7,118 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#pragma once
 #include <Array.hpp>
-#include <optypes.hpp>
 #include <err_cpu.hpp>
-#include <TNJ/UnaryNode.hpp>
+#include <jit/UnaryNode.hpp>
+#include <optypes.hpp>
 #include <cmath>
 
-namespace cpu
-{
-#define sign(in) std::signbit(in)
-
-#define UNARY_FN(op)                            \
-    template<typename T>                        \
-    struct UnOp<T, T, af_##op##_t>              \
-    {                                           \
-        T eval(T in)                            \
-        {                                       \
-            return op(in);                      \
-        }                                       \
-    };                                          \
-
-UNARY_FN(sin)
-UNARY_FN(cos)
-UNARY_FN(tan)
-
-UNARY_FN(asin)
-UNARY_FN(acos)
-UNARY_FN(atan)
-
-UNARY_FN(sinh)
-UNARY_FN(cosh)
-UNARY_FN(tanh)
-
-UNARY_FN(asinh)
-UNARY_FN(acosh)
-UNARY_FN(atanh)
-
-UNARY_FN(round)
-UNARY_FN(trunc)
-UNARY_FN(sign )
-UNARY_FN(floor)
-UNARY_FN(ceil)
-
-UNARY_FN(exp)
-UNARY_FN(expm1)
-UNARY_FN(erf)
-UNARY_FN(erfc)
-
-UNARY_FN(log)
-UNARY_FN(log10)
-UNARY_FN(log1p)
-UNARY_FN(log2)
-
-UNARY_FN(sqrt)
-UNARY_FN(cbrt)
-
-UNARY_FN(tgamma)
-UNARY_FN(lgamma)
-
-#undef UNARY_FN
-#undef sign
-
-    template<typename T, af_op_t op>
-    Array<T> unaryOp(const Array<T> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<T, T, op> *node = new TNJ::UnaryNode<T, T, op>(in_node);
-
-        return createNodeArray<T>(in.dims(),
-                                  TNJ::Node_ptr(reinterpret_cast<TNJ::Node *>(node)));
-    }
+namespace arrayfire {
+namespace cpu {
 
-#define iszero(a) ((a) == 0)
+template<typename T>
+T sigmoid(T in) {
+    return (1.0) / (1 + std::exp(-in));
+}
+
+template<typename T>
+T rsqrt(T in) {
+    return pow(in, -0.5);
+}
+
+#define UNARY_OP_FN(op, fn)                                       \
+    template<typename T>                                          \
+    struct UnOp<T, T, af_##op##_t> {                              \
+        void eval(jit::array<compute_t<T>> &out,                  \
+                  const jit::array<compute_t<T>> &in, int lim) {  \
+            for (int i = 0; i < lim; i++) { out[i] = fn(in[i]); } \
+        }                                                         \
+    };
+
+#define UNARY_OP(op) UNARY_OP_FN(op, std::op)
+
+UNARY_OP(sin)
+UNARY_OP(cos)
+UNARY_OP(tan)
+
+UNARY_OP(asin)
+UNARY_OP(acos)
+UNARY_OP(atan)
+
+UNARY_OP(sinh)
+UNARY_OP(cosh)
+UNARY_OP(tanh)
+
+UNARY_OP(asinh)
+UNARY_OP(acosh)
+UNARY_OP(atanh)
+
+UNARY_OP(round)
+UNARY_OP(trunc)
+UNARY_OP(signbit)
+UNARY_OP(floor)
+UNARY_OP(ceil)
 
-#define CHECK_FN(name ,op)                      \
-    template<typename T>                        \
-    struct UnOp<char, T, af_##name##_t>         \
-    {                                           \
-        char eval(T in)                         \
-        {                                       \
-            return op(in);                      \
-        }                                       \
-    };                                          \
-
-    CHECK_FN(isinf, std::isinf)
-    CHECK_FN(isnan, std::isnan)
-    CHECK_FN(iszero, iszero)
-
-    template<typename T, af_op_t op>
-    Array<char> checkOp(const Array<T> &in)
-    {
-        TNJ::Node_ptr in_node = in.getNode();
-        TNJ::UnaryNode<char, T, op> *node = new TNJ::UnaryNode<char, T, op>(in_node);
-
-        return createNodeArray<char>(in.dims(),
-                                     TNJ::Node_ptr(reinterpret_cast<TNJ::Node *>(node)));
-    }
+UNARY_OP(exp)
+UNARY_OP_FN(sigmoid, sigmoid)
+UNARY_OP(expm1)
+UNARY_OP(erf)
+UNARY_OP(erfc)
 
+UNARY_OP(log)
+UNARY_OP(log10)
+UNARY_OP(log1p)
+UNARY_OP(log2)
+
+UNARY_OP(sqrt)
+UNARY_OP_FN(rsqrt, rsqrt)
+UNARY_OP(cbrt)
+
+UNARY_OP(tgamma)
+UNARY_OP(lgamma)
+UNARY_OP_FN(noop, )  /// Empty second parameter so it does nothing
+
+UNARY_OP_FN(bitnot, ~)
+
+#undef UNARY_OP
+#undef UNARY_OP_FN
+
+template<typename T, af_op_t op>
+Array<T> unaryOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using UnaryNode = jit::UnaryNode<T, T, op>;
+
+    common::Node_ptr in_node = in.getNode();
+    auto node                = std::make_shared<UnaryNode>(in_node);
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    return createNodeArray<T>(outDim, move(node));
+}
+
+#define iszero(a) ((a) == 0)
+
+#define CHECK_FN(name, op)                                                   \
+    template<typename T>                                                     \
+    struct UnOp<char, T, af_##name##_t> {                                    \
+        void eval(jit::array<char> &out, const jit::array<compute_t<T>> &in, \
+                  int lim) {                                                 \
+            for (int i = 0; i < lim; i++) { out[i] = op(in[i]); }            \
+        }                                                                    \
+    };
+
+CHECK_FN(isinf, std::isinf)
+CHECK_FN(isnan, std::isnan)
+CHECK_FN(iszero, iszero)
+#undef iszero
+
+template<typename T, af_op_t op>
+Array<char> checkOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    common::Node_ptr in_node = in.getNode();
+    auto node = std::make_shared<jit::UnaryNode<char, T, op>>(in_node);
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    return createNodeArray<char>(outDim, move(node));
 }
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/unwrap.cpp b/src/backend/cpu/unwrap.cpp
new file mode 100644
index 0000000000..dca2433ff8
--- /dev/null
+++ b/src/backend/cpu/unwrap.cpp
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <kernel/unwrap.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <unwrap.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+
+    dim_t nx = 1 + (idims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    dim_t ny = 1 + (idims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    af::dim4 odims(wx * wy, nx * ny, idims[2], idims[3]);
+
+    if (!is_column) { std::swap(odims[0], odims[1]); }
+
+    Array<T> outArray = createEmptyArray<T>(odims);
+
+    const int d = (is_column) ? 1 : 0;
+    getQueue().enqueue(kernel::unwrap_dim<T>, outArray, in, wx, wy, sx, sy, px,
+                       py, dx, dy, d);
+
+    return outArray;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> unwrap<T>(                                            \
+        const Array<T> &in, const dim_t wx, const dim_t wy, const dim_t sx, \
+        const dim_t sy, const dim_t px, const dim_t py, const dim_t dx,     \
+        const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/unwrap.hpp b/src/backend/cpu/unwrap.hpp
new file mode 100644
index 0000000000..fcfad88f6f
--- /dev/null
+++ b/src/backend/cpu/unwrap.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/utility.hpp b/src/backend/cpu/utility.hpp
new file mode 100644
index 0000000000..9cd3de96f0
--- /dev/null
+++ b/src/backend/cpu/utility.hpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <af/constants.h>
+#include <algorithm>
+#include <cmath>
+#include "backend.hpp"
+
+namespace arrayfire {
+namespace cpu {
+static inline dim_t trimIndex(int const& idx, dim_t const& len) {
+    int ret_val = idx;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= (int)len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
+    }
+    return ret_val;
+}
+
+static inline unsigned getIdx(af::dim4 const& strides, int i, int j = 0,
+                              int k = 0, int l = 0) {
+    return (l * strides[3] + k * strides[2] + j * strides[1] + i * strides[0]);
+}
+
+template<typename T>
+void gaussian1D(T* out, int const dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
+
+    T sum = (T)0;
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / std::sqrt(2 * af::Pi * sigma * sigma) *
+               std::exp(-((x * x) / (2 * (sigma * sigma))));
+        out[i] = el;
+        sum += el;
+    }
+
+    for (int k = 0; k < dim; k++) out[k] /= sum;
+}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/vector_field.cpp b/src/backend/cpu/vector_field.cpp
new file mode 100644
index 0000000000..efe207be09
--- /dev/null
+++ b/src/backend/cpu/vector_field.cpp
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+#include <err_cpu.hpp>
+#include <platform.hpp>
+#include <queue.hpp>
+#include <vector_field.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield) {
+    ForgeModule &_ = forgePlugin();
+    points.eval();
+    directions.eval();
+    getQueue().sync();
+
+    CheckGL("Before CopyArrayToVBO");
+
+    unsigned size1 = 0, size2 = 0;
+    unsigned buff1 = 0, buff2 = 0;
+    FG_CHECK(_.fg_get_vector_field_vertex_buffer_size(&size1, vfield));
+    FG_CHECK(_.fg_get_vector_field_direction_buffer_size(&size2, vfield));
+    FG_CHECK(_.fg_get_vector_field_vertex_buffer(&buff1, vfield));
+    FG_CHECK(_.fg_get_vector_field_direction_buffer(&buff2, vfield));
+
+    glBindBuffer(GL_ARRAY_BUFFER, buff1);
+    glBufferSubData(GL_ARRAY_BUFFER, 0, size1, points.get());
+    glBindBuffer(GL_ARRAY_BUFFER, 0);
+
+    glBindBuffer(GL_ARRAY_BUFFER, buff2);
+    glBufferSubData(GL_ARRAY_BUFFER, 0, size2, directions.get());
+    glBindBuffer(GL_ARRAY_BUFFER, 0);
+
+    CheckGL("In CopyArrayToVBO");
+}
+
+#define INSTANTIATE(T)                                                     \
+    template void copy_vector_field<T>(const Array<T> &, const Array<T> &, \
+                                       fg_vector_field);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/vector_field.hpp b/src/backend/cpu/vector_field.hpp
new file mode 100644
index 0000000000..a64414e781
--- /dev/null
+++ b/src/backend/cpu/vector_field.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/where.cpp b/src/backend/cpu/where.cpp
index da2987e38c..30f70efcb0 100644
--- a/src/backend/cpu/where.cpp
+++ b/src/backend/cpu/where.cpp
@@ -7,71 +7,76 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <math.hpp>
 #include <memory.hpp>
+#include <platform.hpp>
 #include <where.hpp>
-#include <ops.hpp>
+#include <af/dim4.hpp>
+
+#include <complex>
 #include <vector>
 
 using af::dim4;
 
-namespace cpu
-{
-    template<typename T>
-    Array<uint> where(const Array<T> &in)
-    {
-        const dim_t *dims    = in.dims().get();
-        const dim_t *strides = in.strides().get();
-        static const T zero = scalar<T>(0);
+namespace arrayfire {
+namespace cpu {
 
-        const T *iptr = in.get();
-        uint *out_vec  = memAlloc<uint>(in.elements());
+template<typename T>
+Array<uint> where(const Array<T> &in) {
+    const dim_t *dims    = in.dims().get();
+    const dim_t *strides = in.strides().get();
+    static const T zero  = scalar<T>(0);
 
-        dim_t count = 0;
-        dim_t idx = 0;
-        for (dim_t w = 0; w < dims[3]; w++) {
-            uint offw = w * strides[3];
+    const T *iptr = in.get();
+    auto out_vec  = memAlloc<uint>(in.elements());
+    getQueue().sync();
 
-            for (dim_t z = 0; z < dims[2]; z++) {
-                uint offz = offw + z * strides[2];
+    dim_t count = 0;
+    dim_t idx   = 0;
+    for (dim_t w = 0; w < dims[3]; w++) {
+        uint offw = w * strides[3];
 
-                for (dim_t y = 0; y < dims[1]; y++) {
-                    uint offy = y * strides[1] + offz;
+        for (dim_t z = 0; z < dims[2]; z++) {
+            uint offz = offw + z * strides[2];
 
-                    for (dim_t x = 0; x < dims[0]; x++) {
+            for (dim_t y = 0; y < dims[1]; y++) {
+                uint offy = y * strides[1] + offz;
 
-                        T val = iptr[offy + x];
-                        if (val != zero) {
-                            out_vec[count] = idx;
-                            count++;
-                        }
-                        idx++;
+                for (dim_t x = 0; x < dims[0]; x++) {
+                    T val = iptr[offy + x];
+                    if (val != zero) {
+                        out_vec[count] = idx;
+                        count++;
                     }
+                    idx++;
                 }
             }
         }
-
-        Array<uint> out = createHostDataArray(dim4(count), out_vec);
-        memFree<uint>(out_vec);
-        return out;
     }
 
-#define INSTANTIATE(T)                                  \
-    template Array<uint> where<T>(const Array<T> &in);    \
+    Array<uint> out = createDeviceDataArray<uint>(dim4(count), out_vec.get());
+    out_vec.release();
+    return out;
+}
+
+#define INSTANTIATE(T) template Array<uint> where<T>(const Array<T> &in);
 
-    INSTANTIATE(float  )
-    INSTANTIATE(cfloat )
-    INSTANTIATE(double )
-    INSTANTIATE(cdouble)
-    INSTANTIATE(char   )
-    INSTANTIATE(int    )
-    INSTANTIATE(uint   )
-    INSTANTIATE(intl   )
-    INSTANTIATE(uintl  )
-    INSTANTIATE(uchar  )
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/where.hpp b/src/backend/cpu/where.hpp
index c615def543..35c671c2b0 100644
--- a/src/backend/cpu/where.hpp
+++ b/src/backend/cpu/where.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cpu
-{
-    template<typename T>
-    Array<uint> where(const Array<T>& in);
-}
+namespace arrayfire {
+namespace cpu {
+template<typename T>
+Array<uint> where(const Array<T>& in);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/wrap.cpp b/src/backend/cpu/wrap.cpp
new file mode 100644
index 0000000000..0c0d397e3f
--- /dev/null
+++ b/src/backend/cpu/wrap.cpp
@@ -0,0 +1,89 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <kernel/wrap.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <wrap.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column) {
+    evalMultiple<T>(std::vector<Array<T> *>{const_cast<Array<T> *>(&in), &out});
+
+    if (is_column) {
+        getQueue().enqueue(kernel::wrap_dim<T, 1>, out, in, wx, wy, sx, sy, px,
+                           py);
+    } else {
+        getQueue().enqueue(kernel::wrap_dim<T, 0>, out, in, wx, wy, sx, sy, px,
+                           py);
+    }
+}
+
+#define INSTANTIATE(T)                                                        \
+    template void wrap<T>(Array<T> & out, const Array<T> &in, const dim_t wx, \
+                          const dim_t wy, const dim_t sx, const dim_t sy,     \
+                          const dim_t px, const dim_t py,                     \
+                          const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+    af::dim4 odims(ox, oy, idims[2], idims[3]);
+
+    Array<T> out = createValueArray<T>(odims, scalar<T>(0));
+    out.eval();
+    in.eval();
+
+    getQueue().enqueue(kernel::wrap_dim_dilated<T>, out, in, wx, wy, sx, sy, px,
+                       py, dx, dy, is_column);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> wrap_dilated<T>(                                      \
+        const Array<T> &in, const dim_t ox, const dim_t oy, const dim_t wx, \
+        const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,     \
+        const dim_t py, const dim_t dx, const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cpu/wrap.hpp b/src/backend/cpu/wrap.hpp
new file mode 100644
index 0000000000..0bec7c8727
--- /dev/null
+++ b/src/backend/cpu/wrap.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cpu {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column);
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace cpu
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Array.cpp b/src/backend/cuda/Array.cpp
index 2d6398587c..e0d5f73f5a 100644
--- a/src/backend/cuda/Array.cpp
+++ b/src/backend/cuda/Array.cpp
@@ -7,284 +7,500 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
+#include <common/Logger.hpp>
+#include <common/half.hpp>
+#include <common/jit/NodeIterator.hpp>
 #include <copy.hpp>
 #include <err_cuda.hpp>
-#include <JIT/BufferNode.hpp>
-#include <scalar.hpp>
+#include <jit/BufferNode.hpp>
 #include <memory.hpp>
 #include <platform.hpp>
+#include <scalar.hpp>
+#include <af/dim4.hpp>
 
-using af::dim4;
+#include <cstddef>
+#include <memory>
+#include <numeric>
+#include <vector>
 
-namespace cuda
-{
-
-    const int MAX_JIT_LEN = 20;
-    using JIT::BufferNode;
-    using JIT::Node;
-    using JIT::Node_ptr;
-
-    template<typename T>
-    Array<T>::Array(af::dim4 dims) :
-        ArrayInfo(getActiveDeviceId(), dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(memAlloc<T>(dims.elements()), memFree<T>), data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    {}
-
-    template<typename T>
-    Array<T>::Array(af::dim4 dims, const T * const in_data, bool is_device) :
-        ArrayInfo(getActiveDeviceId(), dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data((is_device ? (T *)in_data : memAlloc<T>(dims.elements())), memFree<T>),
-        data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    {
-        if (!is_device) {
-            CUDA_CHECK(cudaMemcpy(data.get(), in_data, dims.elements() * sizeof(T), cudaMemcpyHostToDevice));
-        }
+using af::dim4;
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::cuda::jit::BufferNode;
+
+using nonstd::span;
+using std::accumulate;
+using std::move;
+using std::shared_ptr;
+using std::vector;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void verifyTypeSupport() {
+    if ((std::is_same<T, double>::value || std::is_same<T, cdouble>::value) &&
+        !isDoubleSupported(getActiveDeviceId())) {
+        AF_ERROR("Double precision not supported", AF_ERR_NO_DBL);
+    } else if (std::is_same<T, common::half>::value &&
+               !isHalfSupported(getActiveDeviceId())) {
+        AF_ERROR("Half precision not supported", AF_ERR_NO_HALF);
     }
+}
 
-    template<typename T>
-    Array<T>::Array(const Array<T>& parent, const dim4 &dims, const dim4 &offsets, const dim4 &strides) :
-        ArrayInfo(parent.getDevId(), dims, offsets, strides, (af_dtype)dtype_traits<T>::af_type),
-        data(parent.getData()), data_dims(parent.getDataDims()),
-        node(), ready(true),
-        offset(parent.getOffset() + calcOffset(parent.strides(), offsets)),
-        owner(false)
-    { }
-
-    template<typename T>
-    Array<T>::Array(Param<T> &tmp) :
-        ArrayInfo(getActiveDeviceId(), af::dim4(tmp.dims[0], tmp.dims[1], tmp.dims[2], tmp.dims[3]),
-                  af::dim4(0, 0, 0, 0),
-                  af::dim4(tmp.strides[0], tmp.strides[1], tmp.strides[2], tmp.strides[3]),
-                  (af_dtype)dtype_traits<T>::af_type),
-        data(tmp.ptr, memFree<T>),
-        data_dims(af::dim4(tmp.dims[0], tmp.dims[1], tmp.dims[2], tmp.dims[3])),
-        node(), ready(true), offset(0), owner(true)
-    {
+template<typename T>
+std::shared_ptr<BufferNode<T>> bufferNodePtr() {
+    return std::make_shared<BufferNode<T>>(
+        static_cast<af::dtype>(dtype_traits<T>::af_type));
+}
+
+template<typename T>
+void checkAndMigrate(Array<T> &arr) {
+    int arr_id = arr.getDevId();
+    int cur_id = detail::getActiveDeviceId();
+    if (!isDeviceBufferAccessible(arr_id, cur_id)) {
+        static auto getLogger = [&] { return spdlog::get("platform"); };
+        AF_TRACE("Migrating array from {} to {}.", arr_id, cur_id);
+        auto migrated_data = memAlloc<T>(arr.elements());
+        CUDA_CHECK(
+            cudaMemcpyPeerAsync(migrated_data.get(), getDeviceNativeId(cur_id),
+                                arr.get(), getDeviceNativeId(arr_id),
+                                arr.elements() * sizeof(T), getActiveStream()));
+        arr.data.reset(migrated_data.release(), memFree);
     }
+}
 
-    template<typename T>
-    Array<T>::Array(af::dim4 dims, JIT::Node_ptr n) :
-        ArrayInfo(-1, dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(), data_dims(dims),
-        node(n), ready(false), offset(0), owner(true)
-    {
+template<typename T>
+Array<T>::Array(const af::dim4 &dims)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data((dims.elements() ? memAlloc<T>(dims.elements()).release() : nullptr),
+           memFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {}
+
+template<typename T>
+Array<T>::Array(const af::dim4 &dims, const T *const in_data, bool is_device,
+                bool copy_device)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(((is_device && !copy_device)
+                ? const_cast<T *>(in_data)
+                : memAlloc<T>(dims.elements()).release()),
+           memFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    static_assert(std::is_standard_layout<Array<T>>::value,
+                  "Array<T> must be a standard layout type");
+    static_assert(std::is_nothrow_move_assignable<Array<T>>::value,
+                  "Array<T> is not move assignable");
+    static_assert(std::is_nothrow_move_constructible<Array<T>>::value,
+                  "Array<T> is not move constructible");
+    static_assert(
+        offsetof(Array<T>, info) == 0,
+        "Array<T>::info must be the first member variable of Array<T>");
+    if (!is_device) {
+        CUDA_CHECK(cudaMemcpyAsync(data.get(), in_data,
+                                   dims.elements() * sizeof(T),
+                                   cudaMemcpyHostToDevice, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+    } else if (copy_device) {
+        CUDA_CHECK(
+            cudaMemcpyAsync(data.get(), in_data, dims.elements() * sizeof(T),
+                            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
     }
+}
 
+template<typename T>
+Array<T>::Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset_,
+                const dim4 &strides)
+    : info(parent.getDevId(), dims, offset_, strides,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(parent.getData())
+    , data_dims(parent.getDataDims())
+    , node()
+    , owner(false) {}
+
+template<typename T>
+Array<T>::Array(Param<T> &tmp, bool owner_)
+    : info(getActiveDeviceId(),
+           af::dim4(tmp.dims[0], tmp.dims[1], tmp.dims[2], tmp.dims[3]), 0,
+           af::dim4(tmp.strides[0], tmp.strides[1], tmp.strides[2],
+                    tmp.strides[3]),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(tmp.ptr, owner_ ? std::function<void(T *)>(memFree)
+                           : std::function<void(T *)>([](T * /*unused*/) {}))
+    , data_dims(af::dim4(tmp.dims[0], tmp.dims[1], tmp.dims[2], tmp.dims[3]))
+    , node()
+    , owner(owner_) {}
+
+template<typename T>
+Array<T>::Array(const af::dim4 &dims, common::Node_ptr n)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data()
+    , data_dims(dims)
+    , node(move(n))
+    , owner(true) {
+    if (node->isBuffer()) {
+        data = std::static_pointer_cast<BufferNode<T>>(node)->getDataPointer();
+    }
+}
 
-    template<typename T>
-    void Array<T>::eval()
-    {
-        if (isReady()) return;
+template<typename T>
+Array<T>::Array(const af::dim4 &dims, const af::dim4 &strides, dim_t offset_,
+                const T *const in_data, bool is_device)
+    : info(getActiveDeviceId(), dims, offset_, strides,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(is_device ? const_cast<T *>(in_data)
+                     : memAlloc<T>(info.total()).release(),
+           memFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    if (!is_device) {
+        cudaStream_t stream = getActiveStream();
+        CUDA_CHECK(cudaMemcpyAsync(data.get(), in_data,
+                                   info.total() * sizeof(T),
+                                   cudaMemcpyHostToDevice, stream));
+        CUDA_CHECK(cudaStreamSynchronize(stream));
+    }
+}
 
-        this->setId(getActiveDeviceId());
-        data = shared_ptr<T>(memAlloc<T>(elements()),
-                             memFree<T>);
+template<typename T>
+void Array<T>::eval() {
+    if (isReady()) { return; }
 
-        Param<T> res;
-        res.ptr = data.get();
+    this->setId(getActiveDeviceId());
+    this->data = shared_ptr<T>(memAlloc<T>(elements()).release(), memFree);
 
-        for (int  i = 0; i < 4; i++) {
-            res.dims[i] = dims()[i];
-            res.strides[i] = strides()[i];
-        }
+    Param<T> p(data.get(), dims().get(), strides().get());
+    evalNodes<T>(p, node.get());
+    node.reset();
+}
 
-        evalNodes(res, this->getNode().get());
-        ready = true;
+template<typename T>
+void Array<T>::eval() const {
+    const_cast<Array<T> *>(this)->eval();
+}
 
-        Node_ptr prev = node;
-        prev->resetFlags();
-        // FIXME: Replace the current node in any JIT possible trees with the new BufferNode
-        node.reset();
+template<typename T>
+T *Array<T>::device() {
+    if (!isOwner() || getOffset() || data.use_count() > 1) {
+        *this = copyArray<T>(*this);
     }
+    return this->get();
+}
 
-    template<typename T>
-    void Array<T>::eval() const
-    {
-        if (isReady()) return;
-        const_cast<Array<T> *>(this)->eval();
+template<typename T>
+void evalMultiple(std::vector<Array<T> *> arrays) {
+    vector<Param<T>> output_params;
+    vector<Array<T> *> output_arrays;
+    vector<Node *> nodes;
+
+    // Check if all the arrays have the same dimension
+    auto it = std::adjacent_find(begin(arrays), end(arrays),
+                                 [](const Array<T> *l, const Array<T> *r) {
+                                     return l->dims() != r->dims();
+                                 });
+
+    // If they are not the same. eval individually
+    if (it != end(arrays)) {
+        for (auto ptr : arrays) { ptr->eval(); }
+        return;
     }
 
-    template<typename T>
-    Array<T>::~Array() {}
-
-    template<typename T>
-    Node_ptr Array<T>::getNode() const
-    {
-        if (!node) {
-            bool is_linear = isLinear();
-            unsigned bytes = this->getDataDims().elements() * sizeof(T);
-            BufferNode<T> *buf_node = new BufferNode<T>(irname<T>(),
-                                                        afShortName<T>(), data,
-                                                        *this, bytes, is_linear);
-            const_cast<Array<T> *>(this)->node = Node_ptr(reinterpret_cast<Node *>(buf_node));
-        }
+    for (Array<T> *array : arrays) {
+        if (array->isReady()) { continue; }
 
-        return node;
-    }
+        array->setId(getActiveDeviceId());
+        array->data =
+            shared_ptr<T>(memAlloc<T>(array->elements()).release(), memFree);
 
-    template<typename T>
-    Array<T> createNodeArray(const dim4 &dims, Node_ptr node)
-    {
-        Array<T> out =  Array<T>(dims, node);
+        output_params.emplace_back(array->getData().get(), array->dims().get(),
+                                   array->strides().get());
+        output_arrays.push_back(array);
+        nodes.push_back(array->getNode().get());
+    }
 
-        unsigned length =0, buf_count = 0, bytes = 0;
+    if (output_params.empty()) return;
 
-        Node *n = node.get();
-        n->getInfo(length, buf_count, bytes);
-        n->resetFlags();
+    evalNodes(output_params, nodes);
 
-        if (length > MAX_JIT_LEN ||
-            buf_count >= MAX_BUFFERS ||
-            bytes >= MAX_BYTES) {
-            out.eval();
-        }
+    for (Array<T> *array : output_arrays) { array->node.reset(); }
+}
 
-        return out;
-    }
+template<typename T>
+Node_ptr Array<T>::getNode() {
+    if (node) { return node; }
 
-    template<typename T>
-    Array<T> createHostDataArray(const dim4 &size, const T * const data)
-    {
-        return Array<T>(size, data, false);
-    }
+    Param<T> kinfo = *this;
+    unsigned bytes = this->dims().elements() * sizeof(T);
+    auto nn        = bufferNodePtr<T>();
+    nn->setData(kinfo, data, bytes, isLinear());
 
-    template<typename T>
-    Array<T> createDeviceDataArray(const dim4 &size, const void *data)
-    {
-        return Array<T>(size, (const T * const)data, true);
-    }
+    return nn;
+}
 
-    template<typename T>
-    Array<T> createValueArray(const dim4 &size, const T& value)
-    {
-        return createScalarNode<T>(size, value);
-    }
+template<typename T>
+Node_ptr Array<T>::getNode() const {
+    return const_cast<Array<T> *>(this)->getNode();
+}
 
-    template<typename T>
-    Array<T> createEmptyArray(const dim4 &size)
-    {
-        return Array<T>(size);
+/// This function should be called after a new JIT node is created. It will
+/// return true if the newly created node will generate a valid kernel. If
+/// false the node will fail to compile or the node and its referenced buffers
+/// are consuming too many resources. If false, the node's child nodes should
+/// be evaluated before continuing.
+///
+/// We eval in the following cases:
+///
+/// 1. Too many bytes are locked up by JIT causing memory
+///    pressure. Too many bytes is assumed to be half of all bytes
+///    allocated so far.
+///
+/// 2. The number of parameters we are passing into the kernel exceeds the
+///    limitation on the platform. For NVIDIA this is 4096 bytes. The
+template<typename T>
+kJITHeuristics passesJitHeuristics(span<Node *> root_nodes) {
+    if (!evalFlag()) { return kJITHeuristics::Pass; }
+    static auto getLogger = [&] { return spdlog::get("jit"); };
+    for (Node *n : root_nodes) {
+        if (n->getHeight() > static_cast<int>(getMaxJitSize())) {
+            AF_TRACE(
+                "JIT tree evaluated because of tree height exceeds limit: {} > "
+                "{}",
+                n->getHeight(), getMaxJitSize());
+            return kJITHeuristics::TreeHeight;
+        }
     }
 
-    template<typename T>
-    Array<T> *initArray()
-    {
-        return new Array<T>(dim4());
+    // A lightweight check based on the height of the node. This is an
+    // inexpensive operation and does not traverse the JIT tree.
+    int heightCheckLimit = 6;
+    bool atHeightLimit =
+        std::any_of(std::begin(root_nodes), std::end(root_nodes),
+                    [heightCheckLimit](Node *n) {
+                        return (n->getHeight() + 1 >= heightCheckLimit);
+                    });
+    if (atHeightLimit || getMemoryPressure() >= getMemoryPressureThreshold()) {
+        // The size of the parameters without any extra arguments from the
+        // JIT tree. This includes one output Param object and 4 integers.
+        size_t base_param_size =
+            sizeof(Param<T>) * root_nodes.size() + (4 * sizeof(uint));
+
+        // extra padding for safety to avoid failure during compilation
+        constexpr size_t jit_padding_size = 256;  //@umar dontfix!
+        // This is the maximum size of the params that can be allowed by the
+        // CUDA platform.
+        size_t max_param_size = 4096 - base_param_size - jit_padding_size;
+
+        struct tree_info {
+            size_t total_buffer_size;
+            size_t num_buffers;
+            size_t param_scalar_size;
+        };
+        NodeIterator<> end_node;
+        tree_info info = tree_info{0, 0, 0};
+
+        for (Node *n : root_nodes) {
+            info = accumulate(
+                NodeIterator<>(n), end_node, info,
+                [](tree_info &prev, const Node &node) {
+                    if (node.isBuffer()) {
+                        const auto &buf_node =
+                            static_cast<const BufferNode<T> &>(node);
+                        // getBytes returns the size of the data Array.
+                        // Sub arrays will be represented by their
+                        // parent size.
+                        prev.total_buffer_size += buf_node.getBytes();
+                        prev.num_buffers++;
+                    } else {
+                        prev.param_scalar_size += node.getParamBytes();
+                    }
+                    return prev;
+                });
+        }
+        size_t param_size =
+            info.num_buffers * sizeof(Param<T>) + info.param_scalar_size;
+
+        // TODO: the buffer_size check here is very conservative. It
+        // will trigger an evaluation of the node in most cases. We
+        // should be checking the amount of memory available to guard
+        // this eval
+        if (param_size >= max_param_size) {
+            AF_TRACE(
+                "JIT tree evaluated because of kernel parameter size: {} >= {}",
+                param_size, max_param_size);
+            return kJITHeuristics::KernelParameterSize;
+        }
+        if (jitTreeExceedsMemoryPressure(info.total_buffer_size)) {
+            AF_TRACE("JIT tree evaluated because of memory pressure: {}",
+                     info.total_buffer_size);
+            return kJITHeuristics::MemoryPressure;
+        }
     }
+    return kJITHeuristics::Pass;
+}
 
-    template<typename T>
-    Array<T> createSubArray(const Array<T>& parent,
-                            const std::vector<af_seq> &index,
-                            bool copy)
-    {
-        parent.eval();
+template<typename T>
+Array<T> createNodeArray(const dim4 &dims, Node_ptr node) {
+    verifyTypeSupport<T>();
+    Array<T> out = Array<T>(dims, node);
+    return out;
+}
 
-        dim4 dDims = parent.getDataDims();
-        dim4 pDims = parent.dims();
+template<typename T>
+Array<T> createHostDataArray(const dim4 &dims, const T *const data) {
+    verifyTypeSupport<T>();
+    bool is_device   = false;
+    bool copy_device = false;
+    return Array<T>(dims, data, is_device, copy_device);
+}
 
-        dim4 dims   = toDims  (index, pDims);
-        dim4 offset = toOffset(index, dDims);
-        dim4 stride = toStride (index, dDims);
+template<typename T>
+Array<T> createDeviceDataArray(const dim4 &dims, void *data, bool copy) {
+    verifyTypeSupport<T>();
+    bool is_device = true;
+    return Array<T>(dims, static_cast<T *>(data), is_device, copy);
+}
 
-        Array<T> out = Array<T>(parent, dims, offset, stride);
+template<typename T>
+Array<T> createValueArray(const dim4 &dims, const T &value) {
+    verifyTypeSupport<T>();
+    return createScalarNode<T>(dims, value);
+}
 
-        if (!copy) return out;
+template<typename T>
+Array<T> createEmptyArray(const dim4 &dims) {
+    verifyTypeSupport<T>();
+    return Array<T>(dims);
+}
 
-        if (stride[0] != 1 ||
-            stride[1] <  0 ||
-            stride[2] <  0 ||
-            stride[3] <  0) {
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent,
+                        const std::vector<af_seq> &index, bool copy) {
+    parent.eval();
 
-            out = copyArray(out);
-        }
+    dim4 dDims          = parent.getDataDims();
+    dim4 parent_strides = parent.strides();
 
-        return out;
+    if (parent.isLinear() == false) {
+        const Array<T> parentCopy = copyArray(parent);
+        return createSubArray(parentCopy, index, copy);
     }
 
-    template<typename T>
-    Array<T> createParamArray(Param<T> &tmp)
-    {
-        return Array<T>(tmp);
-    }
+    const dim4 &pDims = parent.dims();
+    dim4 dims         = toDims(index, pDims);
+    dim4 strides      = toStride(index, dDims);
 
-    template<typename T>
-    void destroyArray(Array<T> *A)
-    {
-        delete A;
-    }
+    // Find total offsets after indexing
+    dim4 offsets = toOffset(index, pDims);
+    dim_t offset = parent.getOffset();
+    for (int i = 0; i < 4; i++) { offset += offsets[i] * parent_strides[i]; }
 
-    template<typename T>
-    void evalArray(const Array<T> &A)
-    {
-        A.eval();
+    Array<T> out = Array<T>(parent, dims, offset, strides);
+
+    if (!copy) { return out; }
+
+    if (strides[0] != 1 || strides[1] < 0 || strides[2] < 0 || strides[3] < 0) {
+        out = copyArray(out);
     }
 
-    template<typename T>
-    void
-    writeHostDataArray(Array<T> &arr, const T * const data, const size_t bytes)
-    {
-        if (!arr.isOwner()) {
-            arr = createEmptyArray<T>(arr.dims());
-        }
+    return out;
+}
 
-        T *ptr = arr.get();
+template<typename T>
+Array<T> createParamArray(Param<T> &tmp, bool owner) {
+    return Array<T>(tmp, owner);
+}
 
-        CUDA_CHECK(cudaMemcpy(ptr + arr.getOffset(), data,
-                              bytes,
-                              cudaMemcpyHostToDevice));
+template<typename T>
+void destroyArray(Array<T> *A) {
+    delete A;
+}
 
-        return;
-    }
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data,
+                        const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
 
-    template<typename T>
-    void
-    writeDeviceDataArray(Array<T> &arr, const void * const data, const size_t bytes)
-    {
-        if (!arr.isOwner()) {
-            arr = createEmptyArray<T>(arr.dims());
-        }
+    T *ptr = arr.get();
 
-        T *ptr = arr.get();
+    CUDA_CHECK(cudaMemcpyAsync(ptr, data, bytes, cudaMemcpyHostToDevice,
+                               getActiveStream()));
+    CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+}
 
-        CUDA_CHECK(cudaMemcpy(ptr + arr.getOffset(), data,
-                              bytes,
-                              cudaMemcpyDeviceToDevice));
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
 
-        return;
-    }
+    T *ptr = arr.get();
 
-#define INSTANTIATE(T)                                                  \
-    template       Array<T>  createHostDataArray<T>   (const dim4 &size, const T * const data); \
-    template       Array<T>  createDeviceDataArray<T> (const dim4 &size, const void *data); \
-    template       Array<T>  createValueArray<T>      (const dim4 &size, const T &value); \
-    template       Array<T>  createEmptyArray<T>      (const dim4 &size); \
-    template       Array<T>  *initArray<T      >      ();               \
-    template       Array<T>  createParamArray<T>      (Param<T> &tmp);  \
-    template       Array<T>  createSubArray<T>        (const Array<T> &parent, \
-                                                       const std::vector<af_seq> &index, \
-                                                       bool copy);      \
-    template       void      destroyArray<T>          (Array<T> *A);    \
-    template       void      evalArray<T>             (const Array<T> &A); \
-    template       Array<T>  createNodeArray<T>       (const dim4 &size, JIT::Node_ptr node); \
-    template       Array<T>::~Array        ();                          \
-    template       void Array<T>::eval();                               \
-    template       void Array<T>::eval() const;                         \
-    template       void      writeHostDataArray<T>    (Array<T> &arr, const T * const data, const size_t bytes); \
-    template       void      writeDeviceDataArray<T>  (Array<T> &arr, const void * const data, const size_t bytes); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+    CUDA_CHECK(cudaMemcpyAsync(ptr, data, bytes, cudaMemcpyDeviceToDevice,
+                               getActiveStream()));
+}
 
+template<typename T>
+void Array<T>::setDataDims(const dim4 &new_dims) {
+    data_dims = new_dims;
+    modDims(new_dims);
 }
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> createHostDataArray<T>(const dim4 &size,                \
+                                             const T *const data);            \
+    template Array<T> createDeviceDataArray<T>(const dim4 &size, void *data,  \
+                                               bool copy);                    \
+    template Array<T> createValueArray<T>(const dim4 &size, const T &value);  \
+    template Array<T> createEmptyArray<T>(const dim4 &size);                  \
+    template Array<T> createParamArray<T>(Param<T> & tmp, bool owner);        \
+    template Array<T> createSubArray<T>(                                      \
+        const Array<T> &parent, const std::vector<af_seq> &index, bool copy); \
+    template void destroyArray<T>(Array<T> * A);                              \
+    template Array<T> createNodeArray<T>(const dim4 &size,                    \
+                                         common::Node_ptr node);              \
+    template Array<T>::Array(const af::dim4 &dims, const af::dim4 &strides,   \
+                             dim_t offset, const T *const in_data,            \
+                             bool is_device);                                 \
+    template Array<T>::Array(const af::dim4 &dims, const T *const in_data,    \
+                             bool is_device, bool copy_device);               \
+    template Node_ptr Array<T>::getNode();                                    \
+    template Node_ptr Array<T>::getNode() const;                              \
+    template void Array<T>::eval();                                           \
+    template void Array<T>::eval() const;                                     \
+    template T *Array<T>::device();                                           \
+    template void writeHostDataArray<T>(Array<T> & arr, const T *const data,  \
+                                        const size_t bytes);                  \
+    template void writeDeviceDataArray<T>(                                    \
+        Array<T> & arr, const void *const data, const size_t bytes);          \
+    template void evalMultiple<T>(std::vector<Array<T> *> arrays);            \
+    template kJITHeuristics passesJitHeuristics<T>(span<Node *> n);           \
+    template void Array<T>::setDataDims(const dim4 &new_dims);                \
+    template void checkAndMigrate<T>(Array<T> & arr);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Array.hpp b/src/backend/cuda/Array.hpp
index 55be6348de..82e8bb9583 100644
--- a/src/backend/cuda/Array.hpp
+++ b/src/backend/cuda/Array.hpp
@@ -8,174 +8,305 @@
  ********************************************************/
 #pragma once
 
-// Workaround for BOOST_NOINLINE not being defined with nvcc / CUDA < 6.5
-#if CUDA_VERSION < 6050
-#ifndef BOOST_NOINLINE
-#define BOOST_NOINLINE __attribute__ ((noinline))
-#endif
-#endif
-
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <ArrayInfo.hpp>
-#include "traits.hpp"
+#include <Param.hpp>
 #include <backend.hpp>
-#include <types.hpp>
-#include <traits.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/jit/Node.hpp>
+#include <cuda.h>
 #include <cuda_runtime_api.h>
-#include <Param.hpp>
-#include <JIT/Node.hpp>
-#include <boost/shared_ptr.hpp>
-#include <vector>
+#include <jit/BufferNode.hpp>
 #include <memory.hpp>
+#include <platform.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+#include "traits.hpp"
 
-namespace cuda
-{
-
-    using af::dim4;
-    using boost::shared_ptr;
-
-    template<typename T> class Array;
+#include <nonstd/span.hpp>
+#include <vector>
 
-    template<typename T>
-    void evalNodes(Param<T> &out, JIT::Node *node);
+namespace arrayfire {
+namespace cuda {
+
+using af::dim4;
+
+template<typename T>
+class Array;
+
+/// Checks if the Array object can be migrated to the current device and if not,
+/// an error is thrown
+///
+/// \param[in] arr The Array that will be checked.
+template<typename T>
+void checkAndMigrate(Array<T> &arr);
+
+template<typename T>
+void evalNodes(Param<T> out, common::Node *node);
+
+template<typename T>
+void evalNodes(std::vector<Param<T>> &out,
+               const std::vector<common::Node *> &nodes);
+
+template<typename T>
+void evalMultiple(std::vector<Array<T> *> arrays);
+
+template<typename T>
+Array<T> createNodeArray(const af::dim4 &dims, common::Node_ptr node);
+
+template<typename T>
+Array<T> createValueArray(const af::dim4 &dims, const T &value);
+
+// Creates an array and copies from the \p data pointer located in host memory
+//
+// \param[in] dims The dimension of the array
+// \param[in] data The data that will be copied to the array
+template<typename T>
+Array<T> createHostDataArray(const af::dim4 &dims, const T *const data);
+
+/// Creates an Array<T> object from a device pointer.
+///
+/// \param[in] dims The shape of the resulting Array.
+/// \param[in] data The device pointer to the data
+/// \param[in] copy If true, memory will be allocated and the data will be
+///                 copied to the device. If false the data will be used
+///                 directly
+/// \returns The new Array<T> object based on the device pointer.
+template<typename T>
+Array<T> createDeviceDataArray(const af::dim4 &dims, void *data,
+                               bool copy = false);
+
+template<typename T>
+Array<T> createStridedArray(const af::dim4 &dims, const af::dim4 &strides,
+                            dim_t offset, const T *const in_data,
+                            bool is_device) {
+    return Array<T>(dims, strides, offset, in_data, is_device);
+}
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createNodeArray(const af::dim4 &size, JIT::Node_ptr node);
+/// Copies data to an existing Array object from a host pointer
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data, const size_t bytes);
+
+/// Copies data to an existing Array object from a device pointer
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes);
+
+/// Creates an empty array of a given size. No data is initialized
+///
+/// \param[in] size The dimension of the output array
+template<typename T>
+Array<T> createEmptyArray(const af::dim4 &dims);
+
+/// Create an Array object from Param<T> object.
+///
+/// \param[in] in    The Param<T> array that is created.
+/// \param[in] owner If true, the new Array<T> object is the owner of the data.
+/// If false
+///                  the Array<T> will not delete the object on destruction
+template<typename T>
+Array<T> createParamArray(Param<T> &tmp, bool owner);
+
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent,
+                        const std::vector<af_seq> &index, bool copy = true);
+
+// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+void destroyArray(Array<T> *A);
+
+/// \brief Checks if the Node can be compiled successfully and the buffers
+///        references are not consuming most of the allocated memory
+///
+/// \param [in] node The root node which needs to be checked
+///
+/// \returns false if the kernel generated by this node will fail to compile
+///          or its nodes are consuming too much memory.
+template<typename T>
+kJITHeuristics passesJitHeuristics(nonstd::span<common::Node *> node);
+
+template<typename T>
+void *getDevicePtr(const Array<T> &arr) {
+    T *ptr = arr.device();
+    memLock(ptr);
+    return (void *)ptr;
+}
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createValueArray(const af::dim4 &size, const T& value);
+template<typename T>
+void *getRawPtr(const Array<T> &arr) {
+    return (void *)(arr.get(false));
+}
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createHostDataArray(const af::dim4 &size, const T * const data);
+template<typename T>
+class Array {
+    ArrayInfo info;  // This must be the first element of Array<T>
 
-    template<typename T>
-    Array<T> createDeviceDataArray(const af::dim4 &size, const void *data);
+    /// Pointer to the data
+    std::shared_ptr<T> data;
 
-    // Copies data to an existing Array object from a host pointer
-    template<typename T>
-    void writeHostDataArray(Array<T> &arr, const T * const data, const size_t bytes);
+    /// The shape of the underlying parent data.
+    af::dim4 data_dims;
 
-    // Copies data to an existing Array object from a device pointer
-    template<typename T>
-    void writeDeviceDataArray(Array<T> &arr, const void * const data, const size_t bytes);
+    /// Null if this a buffer node. Otherwise this points to a JIT node
+    common::Node_ptr node;
 
-    // Create an Array object and do not assign any values to it
-    template<typename T> Array<T> *initArray();
+    /// If true, the Array object is the parent. If false the data object points
+    /// to another array's data
+    bool owner;
 
-    template<typename T>
-    Array<T> createEmptyArray(const af::dim4 &size);
+    Array(const af::dim4 &dims);
 
-    // Create an Array object from Param<T>
-    template<typename T>
-    Array<T> createParamArray(Param<T> &tmp);
+    explicit Array(const af::dim4 &dims, const T *const in_data,
+                   bool is_device = false, bool copy_device = false);
+    Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset,
+          const dim4 &stride);
+    Array(Param<T> &tmp, bool owner);
+    Array(const af::dim4 &dims, common::Node_ptr n);
 
-    template<typename T>
-    Array<T> createSubArray(const Array<T>& parent,
-                            const std::vector<af_seq> &index,
-                            bool copy=true);
+    std::shared_ptr<T> getData() const { return data; }
 
-    template<typename T>
-    void evalArray(const Array<T> &A);
+   public:
+    Array(const Array<T> &other) = default;
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    void destroyArray(Array<T> *A);
+    Array(Array<T> &&other) noexcept = default;
 
-    template<typename T>
-    void *getDevicePtr(const Array<T>& arr)
-    {
-        memUnlink((T *)arr.get());
-        return (void *)arr.get();
+    Array<T> &operator=(Array<T> other) noexcept {
+        swap(other);
+        return *this;
     }
 
-    template<typename T>
-    class Array : public ArrayInfo
-    {
-        shared_ptr<T> data;
-        af::dim4 data_dims;
-
-        JIT::Node_ptr node;
-        bool ready;
-        dim_t offset;
-        bool owner;
-
-        Array(af::dim4 dims);
-        explicit Array(af::dim4 dims, const T * const in_data, bool is_device = false);
-        Array(const Array<T>& parnt, const dim4 &dims, const dim4 &offset, const dim4 &stride);
-        Array(Param<T> &tmp);
-        Array(af::dim4 dims, JIT::Node_ptr n);
-    public:
-
-        ~Array();
-
-        bool isReady() const { return ready; }
-        bool isOwner() const { return owner; }
-
-        void eval();
-        void eval() const;
-
-        dim_t getOffset() const { return offset; }
-        shared_ptr<T> getData() const { return data; }
-
-        dim4 getDataDims() const
-        {
-            // This is for moddims
-            // dims and data_dims are different when moddims is used
-            return isOwner() ? dims() : data_dims;
-        }
-
-        T* get(bool withOffset = true)
-        {
-            if (!isReady()) eval();
-            return const_cast<T*>(static_cast<const Array<T>*>(this)->get(withOffset));
-        }
+    void swap(Array<T> &other) noexcept {
+        using std::swap;
+        swap(info, other.info);
+        swap(data, other.data);
+        swap(data_dims, other.data_dims);
+        swap(node, other.node);
+        swap(owner, other.owner);
+    }
 
-        //FIXME: implement withOffset parameter
-        const   T* get(bool withOffset = true) const
-        {
-            if (!isReady()) eval();
-            return data.get() + (withOffset ? offset : 0);
+    Array(const af::dim4 &dims, const af::dim4 &strides, dim_t offset,
+          const T *const in_data, bool is_device = false);
+
+    void resetInfo(const af::dim4 &dims) { info.resetInfo(dims); }
+    void resetDims(const af::dim4 &dims) { info.resetDims(dims); }
+    void modDims(const af::dim4 &newDims) { info.modDims(newDims); }
+    void modStrides(const af::dim4 &newStrides) { info.modStrides(newStrides); }
+    void setId(int id) { info.setId(id); }
+
+#define INFO_FUNC(RET_TYPE, NAME) \
+    RET_TYPE NAME() const { return info.NAME(); }
+
+    INFO_FUNC(const af_dtype &, getType)
+    INFO_FUNC(const af::dim4 &, strides)
+    INFO_FUNC(dim_t, elements)
+    INFO_FUNC(dim_t, ndims)
+    INFO_FUNC(const af::dim4 &, dims)
+    INFO_FUNC(int, getDevId)
+
+#undef INFO_FUNC
+
+#define INFO_IS_FUNC(NAME) \
+    bool NAME() const { return info.NAME(); }
+
+    INFO_IS_FUNC(isEmpty);
+    INFO_IS_FUNC(isScalar);
+    INFO_IS_FUNC(isRow);
+    INFO_IS_FUNC(isColumn);
+    INFO_IS_FUNC(isVector);
+    INFO_IS_FUNC(isComplex);
+    INFO_IS_FUNC(isReal);
+    INFO_IS_FUNC(isDouble);
+    INFO_IS_FUNC(isSingle);
+    INFO_IS_FUNC(isHalf);
+    INFO_IS_FUNC(isRealFloating);
+    INFO_IS_FUNC(isFloating);
+    INFO_IS_FUNC(isInteger);
+    INFO_IS_FUNC(isBool);
+    INFO_IS_FUNC(isLinear);
+    INFO_IS_FUNC(isSparse);
+
+#undef INFO_IS_FUNC
+
+    ~Array() = default;
+
+    bool isReady() const { return static_cast<bool>(node) == false; }
+    bool isOwner() const { return owner; }
+
+    void eval();
+    void eval() const;
+
+    dim_t getOffset() const { return info.getOffset(); }
+
+    dim4 getDataDims() const { return data_dims; }
+
+    void setDataDims(const dim4 &new_dims);
+
+    size_t getAllocatedBytes() const {
+        if (!isReady()) return 0;
+        size_t bytes = memoryManager().allocated(data.get());
+        // External device poitner
+        if (bytes == 0 && data.get()) {
+            return data_dims.elements() * sizeof(T);
         }
+        return bytes;
+    }
 
-        operator Param<T>()
-        {
-            Param<T> out;
-            out.ptr = this->get();
-            for (int  i = 0; i < 4; i++) {
-                out.dims[i] = dims()[i];
-                out.strides[i] = strides()[i];
-            }
-            return out;
-        }
+    T *device();
 
-        operator CParam<T>() const
-        {
-            CParam<T> out(this->get(), this->dims().get(), this->strides().get());
-            return out;
-        }
+    T *device() const { return const_cast<Array<T> *>(this)->device(); }
 
-        JIT::Node_ptr getNode() const;
+    T *get(bool withOffset = true) {
+        if (!isReady()) eval();
+        return const_cast<T *>(
+            static_cast<const Array<T> *>(this)->get(withOffset));
+    }
 
-        friend Array<T> createValueArray<T>(const af::dim4 &size, const T& value);
-        friend Array<T> createHostDataArray<T>(const af::dim4 &size, const T * const data);
-        friend Array<T> createDeviceDataArray<T>(const af::dim4 &size, const void *data);
+    // FIXME: implement withOffset parameter
+    const T *get(bool withOffset = true) const {
+        if (!isReady()) eval();
+        return data.get() + (withOffset ? getOffset() : 0);
+    }
 
-        friend Array<T> *initArray<T>();
-        friend Array<T> createEmptyArray<T>(const af::dim4 &size);
-        friend Array<T> createParamArray<T>(Param<T> &tmp);
-        friend Array<T> createNodeArray<T>(const af::dim4 &dims, JIT::Node_ptr node);
+    int useCount() const { return data.use_count(); }
 
-        friend Array<T> createSubArray<T>(const Array<T>& parent,
-                                          const std::vector<af_seq> &index,
-                                          bool copy);
+    operator Param<data_t<T>>() {
+        return Param<data_t<T>>(this->get(), this->dims().get(),
+                                this->strides().get());
+    }
 
-        friend void destroyArray<T>(Array<T> *arr);
-        friend void evalArray<T>(const Array<T> &arr);
-        friend void *getDevicePtr<T>(const Array<T>& arr);
-    };
+    operator CParam<data_t<T>>() const {
+        return CParam<data_t<T>>(this->get(), this->dims().get(),
+                                 this->strides().get());
+    }
 
-}
+    common::Node_ptr getNode();
+    common::Node_ptr getNode() const;
+
+    friend void evalMultiple<T>(std::vector<Array<T> *> arrays);
+    friend Array<T> createValueArray<T>(const af::dim4 &size, const T &value);
+    friend Array<T> createHostDataArray<T>(const af::dim4 &dims,
+                                           const T *const data);
+    friend Array<T> createDeviceDataArray<T>(const af::dim4 &dims, void *data,
+                                             bool copy);
+    friend Array<T> createStridedArray<T>(const af::dim4 &dims,
+                                          const af::dim4 &strides, dim_t offset,
+                                          const T *const in_data,
+                                          bool is_device);
+
+    friend Array<T> createEmptyArray<T>(const af::dim4 &dims);
+    friend Array<T> createParamArray<T>(Param<T> &tmp, bool owner);
+    friend Array<T> createNodeArray<T>(const af::dim4 &dims,
+                                       common::Node_ptr node);
+
+    friend Array<T> createSubArray<T>(const Array<T> &parent,
+                                      const std::vector<af_seq> &index,
+                                      bool copy);
+
+    friend void destroyArray<T>(Array<T> *arr);
+    friend void *getDevicePtr<T>(const Array<T> &arr);
+    friend void *getRawPtr<T>(const Array<T> &arr);
+    friend void checkAndMigrate<T>(Array<T> &arr);
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/CMakeLists.txt b/src/backend/cuda/CMakeLists.txt
index 671eb50a7a..5085c57717 100644
--- a/src/backend/cuda/CMakeLists.txt
+++ b/src/backend/cuda/CMakeLists.txt
@@ -1,213 +1,928 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-
-FIND_PACKAGE(CUDA REQUIRED)
-FIND_PACKAGE(Boost REQUIRED)
-
-INCLUDE("${CMAKE_MODULE_PATH}/CLKernelToH.cmake")
-INCLUDE("${CMAKE_MODULE_PATH}/FindNVVM.cmake")
-
-# Disables running cuda_compute_check.c when build windows using remote
-IF(NOT DEFINED CUDA_COMPUTE_CAPABILITY)
-    INCLUDE("${CMAKE_MODULE_PATH}/CUDACheckCompute.cmake")
-ELSE(NOT DEFINED CUDA_COMPUTE_CAPABILITY)
-    IF(NOT DEFINED CUDA_GENERATE_CODE)
-        SET(CUDA_GENERATE_CODE "arch=compute_${CUDA_COMPUTE_CAPABILITY},code=sm_${CUDA_COMPUTE_CAPABILITY}")
-    ENDIF(NOT DEFINED CUDA_GENERATE_CODE)
-
-    SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -arch compute_${CUDA_COMPUTE_CAPABILITY})
-ENDIF()
-
-IF(UNIX)
-    SET(CUDA_NVCC_FLAGS ${CUDA_NVCC_FLAGS} -Xcompiler -fvisibility=hidden)
-    REMOVE_DEFINITIONS(-std=c++0x)
-    IF(${WITH_COVERAGE})
-        SET(CUDA_NVCC_FLAGS "${CUDA_NVCC_FLAGS} -Xcompiler -fprofile-arcs -Xcompiler -ftest-coverage -Xlinker -fprofile-arcs -Xlinker -ftest-coverage")
-    ENDIF(${WITH_COVERAGE})
-ELSE()
-    ADD_DEFINITIONS(-DAFDLL)
-ENDIF()
-
-ADD_DEFINITIONS(-DAF_CUDA)
-
-IF(${CUDA_VERSION_MAJOR} LESS 7)
-    MESSAGE(STATUS "CUDA Version ${CUDA_VERSION_STRING} does not contain cuSolve library. Linear Algebra will not be available." )
-    IF(CMAKE_VERSION VERSION_LESS 3.2)
-        SET(CUDA_cusolver_LIBRARY)
-    ENDIF(CMAKE_VERSION VERSION_LESS 3.2)
-ELSE(${CUDA_VERSION_MAJOR} LESS 7)
-    MESSAGE(STATUS "CUDA cusolver library available in CUDA Version ${CUDA_VERSION_STRING}")
-    ADD_DEFINITIONS(-DWITH_CUDA_LINEAR_ALGEBRA)
-    IF(CMAKE_VERSION VERSION_LESS 3.2)
-        FIND_LIBRARY(
-            CUDA_cusolver_LIBRARY
-            NAMES "cusolver"
-            PATHS ${CUDA_TOOLKIT_ROOT_DIR}
-            PATH_SUFFIXES "lib64" "lib/x64" "lib"
-            DOC "CUDA cusolver Library"
-            NO_DEFAULT_PATH
-            )
-    ENDIF(CMAKE_VERSION VERSION_LESS 3.2)
-ENDIF(${CUDA_VERSION_MAJOR} LESS 7)
-
-INCLUDE_DIRECTORIES(
-    ${CMAKE_INCLUDE_PATH}
-    ${Boost_INCLUDE_DIR}
-    ${CUDA_INCLUDE_DIRS}
-    "${CMAKE_SOURCE_DIR}/src/backend/cuda"
-    "${CMAKE_CURRENT_BINARY_DIR}"
-    ${CUDA_NVVM_INCLUDE_DIR}
-    )
-
-FILE(GLOB cuda_headers
-     "*.hpp"
-     "*.h")
-
-FILE(GLOB cuda_sources
-    "*.cu"
-    "*.cpp"
-    "sort_by_key/*.cu"
-    "kernel/*.cu")
-
-FILE(GLOB jit_sources
-    "JIT/*.hpp")
-
-FILE(GLOB kernel_headers
-    "kernel/*.hpp")
-
-FILE(GLOB ptx_sources
-    "JIT/*.cu")
-
-source_group(backend\\cuda\\Headers FILES ${cuda_headers})
-source_group(backend\\cuda\\Sources FILES ${cuda_sources})
-source_group(backend\\cuda\\JIT FILES ${jit_sources})
-source_group(backend\\cuda\\kernel\\Headers FILES ${kernel_headers})
-
-FILE(GLOB backend_headers
-    "../*.hpp"
-    "../*.h"
-    )
-
-FILE(GLOB backend_sources
-    "../*.cpp"
-    )
-
-source_group(backend\\Headers FILES ${backend_headers})
-source_group(backend\\Sources FILES ${backend_sources})
-
-FILE(GLOB c_headers
-    "../../api/c/*.hpp"
-    "../../api/c/*.h"
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+generate_product_version(af_cuda_ver_res_file
+  FILE_NAME "afcuda"
+  FILE_DESCRIPTION "CUDA Backend Dynamic-link library"
+)
+
+dependency_check(CUDA_FOUND "CUDA not found.")
+if(AF_WITH_CUDNN)
+  dependency_check(cuDNN_FOUND "CUDNN not found.")
+endif()
+
+include(AFcuda_helpers)
+include(FileToString)
+include(InternalUtils)
+include(select_compute_arch)
+
+# Remove cublas_device library which is no longer included with the cuda
+# toolkit. Fixes issues with older CMake versions
+if(DEFINED CUDA_cublas_device_LIBRARY AND NOT CUDA_cublas_device_LIBRARY)
+  list(REMOVE_ITEM CUDA_CUBLAS_LIBRARIES ${CUDA_cublas_device_LIBRARY})
+endif()
+
+if(NOT OPENGL_FOUND)
+  # create a dummy gl.h header to satisfy cuda_gl_interop.h requirement
+  # all opengl functionality is made available via glad third party code
+  # that is built along with arrayfire code base.
+  set(dummy_gl_root "${ArrayFire_BINARY_DIR}/include/GL")
+  if(APPLE)
+    set(dummy_gl_root "${ArrayFire_BINARY_DIR}/include/OpenGL")
+  endif()
+  file(WRITE "${dummy_gl_root}/gl.h" "// Dummy file to satisy cuda_gl_interop")
+endif()
+
+# Find if CUDA Toolkit is at least 10.0 to use static
+# lapack library. Otherwise, we have to use regular shared library
+if(UNIX AND (CUDA_VERSION_MAJOR VERSION_GREATER 10 OR CUDA_VERSION_MAJOR VERSION_EQUAL 10))
+  set(use_static_cuda_lapack ON)
+else()
+  set(use_static_cuda_lapack OFF)
+endif()
+
+set(CUDA_architecture_build_targets "Auto" CACHE
+  STRING "The compute architectures targeted by this build. (Options: Auto;3.0;Maxwell;All;Common)")
+
+find_cuda_helper_libs(nvrtc)
+find_cuda_helper_libs(nvrtc-builtins)
+list(APPEND nvrtc_libs ${CUDA_nvrtc_LIBRARY})
+if(UNIX)
+  list(APPEND nvrtc_libs ${CUDA_nvrtc-builtins_LIBRARY})
+endif()
+
+if(UNIX AND AF_WITH_STATIC_CUDA_NUMERIC_LIBS)
+  # The libraries that may be staticly linked or may be loaded at runtime
+  set(AF_CUDA_optionally_static_libraries)
+
+  af_multiple_option(NAME        AF_cusparse_LINK_LOADING
+    DEFAULT     "Module"
+    DESCRIPTION "The approach to load the cusparse library. Static linking(Static) or Dynamic runtime loading(Module) of the module"
+    OPTIONS     "Module" "Static")
+
+  if(AF_cusparse_LINK_LOADING STREQUAL "Static")
+    af_find_static_cuda_libs(cusparse_static PRUNE)
+    list(APPEND AF_CUDA_optionally_static_libraries ${AF_CUDA_cusparse_static_LIBRARY})
+  endif()
+
+  af_find_static_cuda_libs(culibos)
+  af_find_static_cuda_libs(cublas_static PRUNE)
+  af_find_static_cuda_libs(cublasLt_static PRUNE)
+  af_find_static_cuda_libs(cufft_static)
+
+  if(CUDA_VERSION VERSION_GREATER 11.4)
+    af_find_static_cuda_libs(nvrtc_static)
+    af_find_static_cuda_libs(nvrtc-builtins_static)
+    af_find_static_cuda_libs(nvptxcompiler_static)
+    set(nvrtc_libs ${AF_CUDA_nvrtc_static_LIBRARY}
+                   ${AF_CUDA_nvrtc-builtins_static_LIBRARY}
+                   ${AF_CUDA_nvptxcompiler_static_LIBRARY})
+  endif()
+
+  # FIXME When NVCC resolves this particular issue.
+  # NVCC doesn't like -l<full_path_static_lib>, hence we cannot
+  # use ${CMAKE_*_LIBRARY} variables in the following flags.
+  set(af_cuda_static_flags "${af_cuda_static_flags};-lculibos")
+  set(af_cuda_static_flags "${af_cuda_static_flags};-lcublas_static")
+
+  if(CUDA_VERSION VERSION_GREATER 10.0)
+    set(af_cuda_static_flags "${af_cuda_static_flags};-lcublasLt_static")
+  endif()
+  set(af_cuda_static_flags "${af_cuda_static_flags};-lcufft_static")
+
+  if(${use_static_cuda_lapack})
+    af_find_static_cuda_libs(cusolver_static PRUNE)
+    set(cusolver_static_lib "${AF_CUDA_cusolver_static_LIBRARY}")
+
+    # NVIDIA LAPACK library liblapack_static.a is a subset of LAPACK and only
+    # contains GPU accelerated stedc and bdsqr. The user has to link
+    # libcusolver_static.a with liblapack_static.a in order to build
+    # successfully.
+    # Cuda Versions >= 12.0 changed lib name to libcusolver_lapack_static.a
+    if (CUDA_VERSION VERSION_GREATER_EQUAL 12.0)
+      af_find_static_cuda_libs(cusolver_lapack_static)
+    else()
+      af_find_static_cuda_libs(lapack_static)
+    endif()
+
+    set(af_cuda_static_flags "${af_cuda_static_flags};-lcusolver_static")
+  else()
+    set(cusolver_lib "${CUDA_cusolver_LIBRARY}" OpenMP::OpenMP_CXX)
+  endif()
+endif()
+
+get_filename_component(CUDA_LIBRARIES_PATH ${CUDA_cudart_static_LIBRARY} DIRECTORY CACHE)
+
+mark_as_advanced(
+    CUDA_LIBRARIES_PATH
+    CUDA_architecture_build_targets)
+
+if(CUDA_VERSION_MAJOR VERSION_LESS 11)
+  find_package(CUB)
+  if(NOT TARGET CUB::CUB)
+    af_dep_check_and_populate(${cub_prefix}
+      URI https://github.com/NVIDIA/cub.git
+      REF 1.10.0
     )
+    find_package(CUB REQUIRED
+      PATHS ${${cub_prefix}_SOURCE_DIR})
+  endif()
+endif()
 
-FILE(GLOB c_sources
-    "../../api/c/*.cpp"
-    )
-
-source_group(api\\c\\Headers FILES ${c_headers})
-source_group(api\\c\\Sources FILES ${c_sources})
+file(GLOB jit_src "kernel/jit.cuh")
 
-FILE(GLOB cpp_sources
-    "../../api/cpp/*.cpp"
+file_to_string(
+    SOURCES ${jit_src}
+    VARNAME jit_files
+    EXTENSION "hpp"
+    OUTPUT_DIR "kernel_headers"
+    TARGETS jit_kernel_targets
+    NAMESPACE "arrayfire cuda"
+    WITH_EXTENSION
     )
 
-source_group(api\\cpp\\Sources FILES ${cpp_sources})
-
-IF(${CUDA_COMPUTE_CAPABILITY} STREQUAL "21")
-    SET(PTX_COMPUTE "20")
-ELSEIF(${CUDA_COMPUTE_CAPABILITY} STREQUAL "32")
-    SET(PTX_COMPUTE "30")
-ELSEIF(${CUDA_COMPUTE_CAPABILITY} STREQUAL "52")
-    SET(PTX_COMPUTE "50")
-ELSE()
-    SET(PTX_COMPUTE ${CUDA_COMPUTE_CAPABILITY})
-ENDIF()
-
-CUDA_COMPILE_PTX(ptx_files ${ptx_sources})
-
-set(cuda_ptx "")
-foreach(ptx_src_file ${ptx_sources})
-
-      get_filename_component(_name "${ptx_src_file}" NAME_WE)
-
-      set(_gen_file_name
-        "${CMAKE_BINARY_DIR}/src/backend/cuda/cuda_compile_ptx_generated_${_name}.cu.ptx")
-      set(_out_file_name
-        "${CMAKE_BINARY_DIR}/src/backend/cuda/${_name}.ptx")
-
-      ADD_CUSTOM_COMMAND(
-        OUTPUT "${_out_file_name}"
-        DEPENDS "${_gen_file_name}"
-        COMMAND ${CMAKE_COMMAND} -E copy "${_gen_file_name}" "${_out_file_name}")
-
-      list(APPEND cuda_ptx "${_out_file_name}")
-endforeach()
-
-SET( ptx_headers
-    "ptx_headers")
-
-CL_KERNEL_TO_H(
-    SOURCES ${cuda_ptx}
-    VARNAME kernel_files
+set(nvrtc_src
+  ${CUDA_INCLUDE_DIRS}/cuda_fp16.h
+  ${CUDA_INCLUDE_DIRS}/cuda_fp16.hpp
+  ${CUDA_TOOLKIT_ROOT_DIR}/include/cuComplex.h
+  ${CUDA_TOOLKIT_ROOT_DIR}/include/math_constants.h
+  ${CUDA_TOOLKIT_ROOT_DIR}/include/vector_types.h
+  ${CUDA_TOOLKIT_ROOT_DIR}/include/vector_functions.h
+
+  ${PROJECT_SOURCE_DIR}/src/api/c/optypes.hpp
+  ${PROJECT_SOURCE_DIR}/include/af/defines.h
+  ${PROJECT_SOURCE_DIR}/include/af/traits.hpp
+  ${PROJECT_BINARY_DIR}/include/af/version.h
+
+  ${CMAKE_CURRENT_SOURCE_DIR}/Param.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/assign_kernel_param.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/backend.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/dims_param.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/interp.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/shared.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/math.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/minmax_op.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/utility.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/types.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/../common/Binary.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/../common/Transform.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/../common/half.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/../common/internal_enums.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/../common/kernel_type.hpp
+
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/anisotropic_diffusion.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/approx1.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/approx2.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/assign.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/bilateral.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/canny.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/convolve1.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/convolve2.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/convolve3.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/convolve_separable.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/copy.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/diagonal.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/diff.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/exampleFunction.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/fftconvolve.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/flood_fill.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/gradient.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/histogram.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/hsv_rgb.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/identity.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/iir.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/index.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/iota.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/ireduce.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/lookup.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/lu_split.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/match_template.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/meanshift.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/medfilt.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/memcopy.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/moments.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/morph.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/pad_array_borders.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/range.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/resize.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reorder.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/rotate.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/select.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_dim.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_dim_by_key.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_first.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_first_by_key.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sobel.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sparse.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sparse_arith.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/susan.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/tile.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/transform.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/transpose.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/transpose_inplace.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/triangle.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/unwrap.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/where.cuh
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/wrap.cuh
+  )
+
+file_to_string(
+    SOURCES ${nvrtc_src}
+    VARNAME nvrtc_files
     EXTENSION "hpp"
-    OUTPUT_DIR ${ptx_headers}
-    TARGETS ptx_targets
-    NAMESPACE "cuda"
-    EOF "1"
+    OUTPUT_DIR "nvrtc_kernel_headers"
+    TARGETS nvrtc_kernel_targets
+    NAMESPACE "arrayfire cuda"
+    WITH_EXTENSION
+    NULLTERM
     )
 
-IF("${APPLE}")
-    ADD_DEFINITIONS(-D__STRICT_ANSI__)
-    IF(${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang")
-        IF(${CUDA_VERSION_MAJOR} VERSION_LESS 7)
-            SET(STD_LIB_BINDING "-stdlib=libstdc++")
-        ELSE(${CUDA_VERSION_MAJOR} VERSION_LESS 7)
-            SET(STD_LIB_BINDING "-stdlib=libc++")
-        ENDIF()
-
-        ADD_DEFINITIONS("${STD_LIB_BINDING}")
-        SET(CMAKE_SHARED_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} ${STD_LIB_BINDING}")
-        SET(CMAKE_STATIC_LINKER_FLAGS "${CMAKE_SHARED_LINKER_FLAGS} ${STD_LIB_BINDING}")
-        SET(CUDA_HOST_COMPILER "/usr/bin/clang++")
-    ENDIF()
-ENDIF()
-
-CUDA_ADD_LIBRARY(afcuda SHARED
-                ${cuda_headers}
-                ${cuda_sources}
-                ${jit_sources}
-                ${kernel_headers}
-                ${backend_headers}
-                ${backend_sources}
-                ${c_headers}
-                ${c_sources}
-                ${cpp_sources}
-                OPTIONS "-gencode" ${CUDA_GENERATE_CODE})
-
-ADD_DEPENDENCIES(afcuda ${ptx_targets})
-
-TARGET_LINK_LIBRARIES(afcuda
-                      ${CUDA_CUBLAS_LIBRARIES}
-                      ${CUDA_LIBRARIES}
-                      ${CUDA_cusolver_LIBRARY}
-                      ${FreeImage_LIBS}
-                      ${CUDA_CUFFT_LIBRARIES}
-                      ${CUDA_NVVM_LIBRARIES}
-                      ${CUDA_CUDA_LIBRARY})
-
-IF(FORGE_FOUND)
-    TARGET_LINK_LIBRARIES(afcuda
-                          ${FORGE_LIBRARIES}
-                         )
-ENDIF()
-
-SET_TARGET_PROPERTIES(afcuda PROPERTIES
-    VERSION "${AF_VERSION}"
-    SOVERSION "${AF_VERSION_MAJOR}")
-
-INSTALL(TARGETS afcuda EXPORT CUDA DESTINATION "${AF_INSTALL_LIB_DIR}"
-        COMPONENT libraries)
-
-export(TARGETS afcuda FILE ArrayFireCUDA.cmake)
-INSTALL(EXPORT CUDA DESTINATION "${AF_INSTALL_CMAKE_DIR}"
-    COMPONENT cmake
-    FILE ArrayFireCUDA.cmake)
+include(kernel/scan_by_key/CMakeLists.txt)
+include(kernel/thrust_sort_by_key/CMakeLists.txt)
+
+
+add_library(afcuda
+    $<$<PLATFORM_ID:Windows>:${af_cuda_ver_res_file}>
+    ${thrust_sort_sources}
+
+    blas.cu
+    blas.hpp
+    cudaDataType.hpp
+    cufft.cu
+    cufft.hpp
+    cusparse_descriptor_helpers.hpp
+    fft.cu
+    sparse.cu
+    sparse.hpp
+    sparse_arith.cu
+    sparse_arith.hpp
+    sparse_blas.cu
+    sparse_blas.hpp
+    solve.cu
+    solve.hpp
+
+    EnqueueArgs.hpp
+    all.cu
+    anisotropic_diffusion.cpp
+    any.cu
+    approx.cpp
+    bilateral.cpp
+    canny.cpp
+    count.cu
+    Event.cpp
+    Event.hpp
+    exampleFunction.cpp
+    fast.cu
+    harris.cu
+    histogram.cpp
+    homography.cu
+    hsv_rgb.cpp
+    match_template.cpp
+    max.cu
+    mean.cu
+    meanshift.cpp
+    medfilt.cpp
+    min.cu
+    moments.cpp
+    nearest_neighbour.cu
+    orb.cu
+    pad_array_borders.cpp
+    product.cu
+    random_engine.cu
+    regions.cu
+    resize.cpp
+    rotate.cpp
+    set.cu
+    sift.cu
+    sobel.cpp
+    sort.cu
+    sort_by_key.cu
+    sort_index.cu
+    sum.cu
+    topk.cu
+    transform.cpp
+    transpose.cpp
+    transpose_inplace.cpp
+
+    kernel/anisotropic_diffusion.hpp
+    kernel/approx.hpp
+    kernel/assign.hpp
+    kernel/atomics.hpp
+    kernel/bilateral.hpp
+    kernel/canny.hpp
+    kernel/config.hpp
+    kernel/convolve.hpp
+    kernel/convolve_separable.cpp
+    kernel/diagonal.hpp
+    kernel/diff.hpp
+    kernel/exampleFunction.hpp
+    kernel/fast.hpp
+    kernel/fast_lut.hpp
+    kernel/fftconvolve.hpp
+    kernel/flood_fill.hpp
+    kernel/gradient.hpp
+    kernel/harris.hpp
+    kernel/histogram.hpp
+    kernel/homography.hpp
+    kernel/hsv_rgb.hpp
+    kernel/identity.hpp
+    kernel/iir.hpp
+    kernel/index.hpp
+    kernel/interp.hpp
+    kernel/iota.hpp
+    kernel/ireduce.hpp
+    kernel/lookup.hpp
+    kernel/lu_split.hpp
+    kernel/match_template.hpp
+    kernel/mean.hpp
+    kernel/meanshift.hpp
+    kernel/medfilt.hpp
+    kernel/memcopy.hpp
+    kernel/moments.hpp
+    kernel/morph.hpp
+    kernel/nearest_neighbour.hpp
+    kernel/orb.hpp
+    kernel/orb_patch.hpp
+    kernel/pad_array_borders.hpp
+    kernel/random_engine.hpp
+    kernel/random_engine_mersenne.hpp
+    kernel/random_engine_philox.hpp
+    kernel/random_engine_threefry.hpp
+    kernel/range.hpp
+    kernel/reduce.hpp
+    kernel/reduce_by_key.hpp
+    kernel/regions.hpp
+    kernel/reorder.hpp
+    kernel/resize.hpp
+    kernel/rotate.hpp
+    kernel/scan_dim.hpp
+    kernel/scan_dim_by_key.hpp
+    kernel/scan_dim_by_key_impl.hpp
+    kernel/scan_first.hpp
+    kernel/scan_first_by_key.hpp
+    kernel/scan_first_by_key_impl.hpp
+    kernel/select.hpp
+    kernel/shared.hpp
+    kernel/shfl_intrinsics.hpp
+    kernel/sift.hpp
+    kernel/sobel.hpp
+    kernel/sort.hpp
+    kernel/sort_by_key.hpp
+    kernel/sparse.hpp
+    kernel/sparse_arith.hpp
+    kernel/susan.hpp
+    kernel/thrust_sort_by_key.hpp
+    kernel/thrust_sort_by_key_impl.hpp
+    kernel/tile.hpp
+    kernel/topk.hpp
+    kernel/transform.hpp
+    kernel/transpose.hpp
+    kernel/transpose_inplace.hpp
+    kernel/triangle.hpp
+    kernel/unwrap.hpp
+    kernel/where.hpp
+    kernel/wrap.hpp
+
+    Array.cpp
+    Array.hpp
+    Kernel.cpp
+    Kernel.hpp
+    LookupTable1D.hpp
+    Module.hpp
+    Param.hpp
+    ThrustAllocator.cuh
+    ThrustArrayFirePolicy.hpp
+    anisotropic_diffusion.hpp
+    approx.hpp
+    arith.hpp
+    assign.cpp
+    assign.hpp
+    backend.hpp
+    bilateral.hpp
+    binary.hpp
+    blas.hpp
+    canny.hpp
+    cast.hpp
+    cholesky.cpp
+    cholesky.hpp
+    complex.hpp
+    compile_module.cpp
+    convolve.cpp
+    convolve.hpp
+    convolveNN.cpp
+    copy.cpp
+    copy.hpp
+    cublas.cpp
+    cublas.hpp
+
+    $<$<BOOL:${AF_WITH_CUDNN}>: cudnn.cpp
+                             cudnn.hpp
+                             cudnnModule.cpp
+                             cudnnModule.hpp>
+
+    cufft.hpp
+    cusolverDn.cpp
+    cusolverDn.hpp
+    cusparse.cpp
+    cusparse.hpp
+    cusparseModule.cpp
+    cusparseModule.hpp
+    device_manager.cpp
+    device_manager.hpp
+    debug_cuda.hpp
+    thrust_utils.hpp
+    diagonal.cpp
+    diagonal.hpp
+    diff.cpp
+    diff.hpp
+    driver.cpp
+    err_cuda.hpp
+    exampleFunction.hpp
+    fast.hpp
+    fast_pyramid.cpp
+    fast_pyramid.hpp
+    fft.hpp
+    fftconvolve.cpp
+    fftconvolve.hpp
+    flood_fill.cpp
+    flood_fill.hpp
+    GraphicsResourceManager.cpp
+    GraphicsResourceManager.hpp
+    gradient.cpp
+    gradient.hpp
+    harris.hpp
+    hist_graphics.cpp
+    hist_graphics.hpp
+    histogram.hpp
+    homography.hpp
+    hsv_rgb.hpp
+    identity.cpp
+    identity.hpp
+    iir.cpp
+    iir.hpp
+    image.cpp
+    image.hpp
+    index.cpp
+    index.hpp
+    inverse.cpp
+    inverse.hpp
+    iota.cpp
+    iota.hpp
+    ireduce.cpp
+    ireduce.hpp
+    jit.cpp
+    join.cpp
+    join.hpp
+    logic.hpp
+    lookup.cpp
+    lookup.hpp
+    lu.cpp
+    lu.hpp
+    match_template.hpp
+    math.hpp
+    mean.hpp
+    meanshift.hpp
+    medfilt.hpp
+    memory.cpp
+    memory.hpp
+    minmax_op.hpp
+    moments.hpp
+    morph.cpp
+    morph.hpp
+    nearest_neighbour.hpp
+    orb.hpp
+    platform.cpp
+    platform.hpp
+    plot.cpp
+    plot.hpp
+    print.hpp
+    qr.cpp
+    qr.hpp
+    random_engine.hpp
+    range.cpp
+    range.hpp
+    reduce.hpp
+    reduce_impl.hpp
+    regions.hpp
+    reorder.cpp
+    reorder.hpp
+    resize.hpp
+    reshape.cpp
+    rotate.hpp
+    scalar.hpp
+    scan.cpp
+    scan.hpp
+    scan_by_key.cpp
+    scan_by_key.hpp
+    select.cpp
+    select.hpp
+    set.hpp
+    shift.cpp
+    shift.hpp
+    sift.hpp
+    sobel.hpp
+    solve.hpp
+    sort.hpp
+    sort_by_key.hpp
+    sort_index.hpp
+    sparse.hpp
+    sparse_arith.hpp
+    sparse_blas.hpp
+    surface.cpp
+    surface.hpp
+    susan.cpp
+    susan.hpp
+    svd.cpp
+    svd.hpp
+    tile.cpp
+    tile.hpp
+    threadsMgt.hpp
+    topk.hpp
+    traits.hpp
+    transform.hpp
+    transpose.hpp
+    triangle.cpp
+    triangle.hpp
+    types.hpp
+    unary.hpp
+    unwrap.cpp
+    unwrap.hpp
+    utility.cpp
+    utility.hpp
+    vector_field.cpp
+    vector_field.hpp
+    where.cpp
+    where.hpp
+    wrap.cpp
+    wrap.hpp
+
+    jit/BufferNode.hpp
+    jit/ShiftNode.hpp
+    jit/kernel_generators.hpp
+
+    ${scan_by_key_sources}
+  )
+
+
+if(UNIX AND AF_WITH_STATIC_CUDA_NUMERIC_LIBS)
+  check_cxx_compiler_flag("-Wl,--start-group -Werror" group_flags)
+  if(group_flags)
+    set(START_GROUP -Wl,--start-group)
+    set(END_GROUP -Wl,--end-group)
+  endif()
+
+  target_link_libraries(afcuda
+    PRIVATE
+      ${cusolver_lib}
+      ${START_GROUP}
+      ${CUDA_culibos_LIBRARY} #also a static libary
+      ${AF_CUDA_cublas_static_LIBRARY}
+      ${AF_CUDA_cublasLt_static_LIBRARY}
+      ${AF_CUDA_cufft_static_LIBRARY}
+      ${AF_CUDA_optionally_static_libraries}
+      ${nvrtc_libs}
+      ${cusolver_static_lib}
+      ${END_GROUP})
+
+  if(CUDA_VERSION VERSION_GREATER 10.0)
+    target_link_libraries(afcuda
+      PRIVATE
+        ${AF_CUDA_cublasLt_static_LIBRARY})
+  endif()
+
+  if(CUDA_VERSION VERSION_GREATER 9.5)
+    target_link_libraries(afcuda
+      PRIVATE
+        ${CUDA_lapack_static_LIBRARY})
+  endif()
+
+else()
+  target_link_libraries(afcuda
+    PRIVATE
+      ${CUDA_CUBLAS_LIBRARIES}
+      ${CUDA_CUFFT_LIBRARIES}
+      ${CUDA_cusolver_LIBRARY}
+      ${nvrtc_libs}
+  )
+endif()
+
+if(CUDA_VERSION_MAJOR VERSION_LESS 11)
+  target_link_libraries(afcuda
+    PRIVATE
+      CUB::CUB
+  )
+endif()
+
+af_detect_and_set_cuda_architectures(afcuda)
+
+if(CUDA_VERSION VERSION_LESS 11.0)
+  if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.18")
+    set_target_properties(afcuda
+      PROPERTIES
+        CUDA_STANDARD 14
+        CUDA_STANDARD_REQUIRED ON)
+  else()
+    target_compile_options(afcuda
+      PRIVATE
+        $<$<COMPILE_LANGUAGE:CUDA>:--std=c++14>)
+  endif()
+else()
+  if(CMAKE_VERSION VERSION_GREATER_EQUAL "3.18")
+    set_target_properties(afcuda
+      PROPERTIES
+        CUDA_STANDARD 17
+        CUDA_STANDARD_REQUIRED ON)
+  else()
+    target_compile_options(afcuda
+      PRIVATE
+        $<$<COMPILE_LANGUAGE:CUDA>:--std=c++17>)
+  endif()
+endif()
+
+target_compile_definitions(afcuda
+  PRIVATE
+    AF_CUDA
+
+    # CUDA_NO_HALF prevents the inclusion of the half class in the global namespace
+    # which conflicts with the half class in ArrayFire's common namespace. prefer
+    # using __half class instead for CUDA
+    CUDA_NO_HALF
+
+    $<$<BOOL:${AF_WITH_CUDNN}>:WITH_CUDNN>
+)
+
+# New API of cuSparse was introduced in 10.1.168 for Linux and the older
+# 10.1.105 fix version doesn't it. Unfortunately, the new API was introduced in
+# in a fix release of CUDA - unconventionally. As CMake's FindCUDA module
+# doesn't provide patch/fix version number, we use 10.2 as the minimum
+# CUDA version to enable this new cuSparse API.
+if(CUDA_VERSION_MAJOR VERSION_GREATER 10 OR
+    (UNIX AND
+      CUDA_VERSION_MAJOR VERSION_EQUAL 10 AND CUDA_VERSION_MINOR VERSION_GREATER 1))
+  target_compile_definitions(afcuda
+    PRIVATE
+      AF_USE_NEW_CUSPARSE_API)
+endif()
+
+target_compile_options(afcuda
+  PRIVATE
+    $<$<BOOL:${AF_WITH_FAST_MATH}>:$<$<COMPILE_LANGUAGE:CUDA>:-use_fast_math>>
+    $<$<COMPILE_LANGUAGE:CUDA>:--expt-relaxed-constexpr>
+    $<$<COMPILE_LANGUAGE:CUDA>:-Xcudafe --diag_suppress=unrecognized_gcc_pragma>
+    $<$<COMPILE_LANGUAGE:CUDA>: $<$<CXX_COMPILER_ID:MSVC>:  -Xcompiler=/wd4251
+                                                            -Xcompiler=/wd4068
+                                                            -Xcompiler=/wd4275
+                                                            -Xcompiler=/wd4668
+                                                            -Xcompiler=/wd4710
+                                                            -Xcompiler=/wd4505
+                                                            -Xcompiler=/bigobj>>
+)
+
+
+if(UNIX AND AF_WITH_STATIC_CUDA_NUMERIC_LIBS AND AF_cusparse_LINK_LOADING STREQUAL "Static")
+  target_compile_definitions(afcuda
+    PRIVATE
+      AF_cusparse_STATIC_LINKING)
+endif()
+
+add_library(ArrayFire::afcuda ALIAS afcuda)
+
+add_dependencies(afcuda ${jit_kernel_targets} ${nvrtc_kernel_targets})
+
+if(UNIX AND AF_WITH_PRUNE_STATIC_CUDA_NUMERIC_LIBS)
+  add_dependencies(afcuda ${cuda_pruned_library_targets})
+endif()
+
+target_include_directories (afcuda
+  PUBLIC
+    $<BUILD_INTERFACE:${ArrayFire_SOURCE_DIR}/include>
+    $<BUILD_INTERFACE:${ArrayFire_BINARY_DIR}/include>
+    $<INSTALL_INTERFACE:${AF_INSTALL_INC_DIR}>
+  PRIVATE
+    ${ArrayFire_SOURCE_DIR}/src/api/c
+    ${CMAKE_CURRENT_SOURCE_DIR}
+    ${CMAKE_CURRENT_SOURCE_DIR}/kernel
+    ${CMAKE_CURRENT_SOURCE_DIR}/jit
+    ${CMAKE_CURRENT_BINARY_DIR})
+
+target_include_directories (afcuda
+  SYSTEM PRIVATE
+    $<$<BOOL:${AF_WITH_CUDNN}>:${cuDNN_INCLUDE_DIRS}>
+    ${CUDA_INCLUDE_DIRS}
+)
+
+target_link_libraries(afcuda
+  PRIVATE
+    c_api_interface
+    cpp_api_interface
+    afcommon_interface
+    ${CMAKE_DL_LIBS}
+  )
+
+# If the driver is not found the cuda driver api need to be linked against the
+# libcuda.so stub located in the lib[64]/stubs directory
+if(CUDA_CUDA_LIBRARY)
+  target_link_libraries(afcuda PRIVATE ${CUDA_CUDA_LIBRARY})
+else()
+  message(STATUS "CUDA driver library missing. Looking for libcuda stub.")
+  find_library(CUDA_CUDA_STUB
+             NAMES cuda
+             PATHS ${CUDA_LIBRARIES_PATH}/stubs
+             NO_DEFAULT_PATH
+         )
+  if(CUDA_CUDA_STUB)
+    message(STATUS "CUDA driver stub FOUND: ${CUDA_CUDA_STUB}")
+  endif()
+
+  #NOTE: Only link against the stub library when building
+  target_link_libraries(afcuda
+    PUBLIC
+      $<BUILD_INTERFACE:${CUDA_CUDA_STUB}>)
+endif()
+
+# TODO(umar): This is required for NVRTC to work correctly on OSX. It may not
+#             be necessary on other platforms.
+if(APPLE)
+  target_link_libraries(afcuda PUBLIC -Wl,-rpath,${CUDA_LIBRARIES_PATH})
+endif()
+
+af_split_debug_info(afcuda ${AF_INSTALL_LIB_DIR})
+
+install(TARGETS afcuda
+  EXPORT ArrayFireCUDATargets
+  COMPONENT cuda
+  PUBLIC_HEADER DESTINATION af
+  RUNTIME DESTINATION ${AF_INSTALL_BIN_DIR}
+  LIBRARY DESTINATION ${AF_INSTALL_LIB_DIR}
+  ARCHIVE DESTINATION ${AF_INSTALL_LIB_DIR}
+  FRAMEWORK DESTINATION framework
+  INCLUDES DESTINATION ${AF_INSTALL_INC_DIR}
+  )
+
+set(cuda_deps "")
+set (PX ${CMAKE_SHARED_LIBRARY_PREFIX})
+set (SX ${CMAKE_SHARED_LIBRARY_SUFFIX})
+set (dlib_path_prefix ${CUDA_LIBRARIES_PATH})
+if (WIN32)
+  set(dlib_path_prefix "${CUDA_TOOLKIT_ROOT_DIR}/bin")
+endif ()
+
+function(afcu_collect_libs libname)
+  set(options "FULL_VERSION")
+  set(single_args "LIB_MAJOR;LIB_MINOR")
+  set(multi_args "")
+
+  cmake_parse_arguments(cuda_args "${options}" "${single_args}" "${multi_args}" ${ARGN})
+
+  if(cuda_args_LIB_MAJOR AND cuda_args_LIB_MINOR)
+    set(lib_major ${cuda_args_LIB_MAJOR})
+	  set(lib_minor ${cuda_args_LIB_MINOR})
+  else()
+    set(lib_major ${CUDA_VERSION_MAJOR})
+	  set(lib_minor ${CUDA_VERSION_MINOR})
+  endif()
+  set(lib_version "${lib_major}.${lib_minor}")
+
+  if (WIN32)
+    find_file(CUDA_${libname}_LIBRARY_DLL
+      NAMES
+        "${PX}${libname}64_${lib_major}${SX}"
+        "${PX}${libname}64_${lib_major}${lib_minor}${SX}"
+        "${PX}${libname}64_${lib_major}0_0${SX}"
+        "${PX}${libname}64_${lib_major}${lib_minor}_0${SX}"
+        "${PX}${libname}_${lib_major}0_0${SX}"
+      PATHS ${dlib_path_prefix}
+    )
+    mark_as_advanced(CUDA_${libname}_LIBRARY_DLL)
+    install(FILES "${CUDA_${libname}_LIBRARY_DLL}"
+      DESTINATION ${AF_INSTALL_BIN_DIR}
+      COMPONENT cuda_dependencies)
+  elseif (APPLE)
+    get_filename_component(outpath "${dlib_path_prefix}/${PX}${libname}.${lib_major}.${lib_minor}${SX}" REALPATH)
+    install(FILES       "${outpath}"
+            DESTINATION ${AF_INSTALL_BIN_DIR}
+            RENAME      "${PX}${libname}.${lib_version}${SX}"
+            COMPONENT   cuda_dependencies)
+  else () #UNIX
+    find_library(CUDA_${libname}_LIBRARY
+      NAMES ${libname}
+      PATHS
+        ${dlib_path_prefix})
+
+    get_filename_component(outpath "${CUDA_${libname}_LIBRARY}" REALPATH)
+    if(cuda_args_FULL_VERSION)
+      set(library_install_name "${PX}${libname}${SX}.${lib_version}")
+    else()
+      set(library_install_name "${PX}${libname}${SX}.${lib_major}")
+    endif()
+    install(FILES       ${outpath}
+            DESTINATION ${AF_INSTALL_LIB_DIR}
+            RENAME      ${library_install_name}
+            COMPONENT   cuda_dependencies)
+  endif ()
+endfunction()
+
+function(afcu_collect_cudnn_libs cudnn_infix)
+  set(internal_infix "_")
+  if(NOT "${cudnn_infix}" STREQUAL "")
+    set(internal_infix "_${cudnn_infix}_")
+    string(TOUPPER ${internal_infix} internal_infix)
+  endif()
+  if(WIN32)
+    set(cudnn_lib "${cuDNN${internal_infix}DLL_LIBRARY}")
+  else()
+    get_filename_component(cudnn_lib "${cuDNN${internal_infix}LINK_LIBRARY}" REALPATH)
+  endif()
+  install(FILES ${cudnn_lib} DESTINATION ${AF_INSTALL_LIB_DIR} COMPONENT cuda_dependencies)
+endfunction()
+
+if(AF_INSTALL_STANDALONE)
+  if(AF_WITH_CUDNN)
+    afcu_collect_cudnn_libs("")
+    if(cuDNN_VERSION_MAJOR VERSION_EQUAL 8)
+      # cudnn changed how dlls are shipped starting major version 8
+      # except the main dll a lot of the other DLLs are loaded upon demand
+      afcu_collect_cudnn_libs(cnn_infer)
+      afcu_collect_cudnn_libs(cnn_train)
+      afcu_collect_cudnn_libs(ops_infer)
+      afcu_collect_cudnn_libs(ops_train)
+    elseif(cuDNN_VERSION_MAJOR VERSION_GREATER_EQUAL 9)
+      # infer and train libraries are now combined in version 9
+      afcu_collect_cudnn_libs(cnn)
+      afcu_collect_cudnn_libs(ops)
+    endif()
+  endif()
+
+  if(WIN32 OR NOT AF_WITH_STATIC_CUDA_NUMERIC_LIBS)
+    if(CUDA_VERSION_MAJOR VERSION_EQUAL 12)
+        afcu_collect_libs(cufft LIB_MAJOR 11 LIB_MINOR 3)
+    elseif(CUDA_VERSION_MAJOR VERSION_EQUAL 11)
+        afcu_collect_libs(cufft LIB_MAJOR 10 LIB_MINOR 4)
+    else()
+        afcu_collect_libs(cufft)
+    endif()
+    afcu_collect_libs(cublas)
+    if(CUDA_VERSION VERSION_GREATER 10.0)
+      afcu_collect_libs(cublasLt)
+    endif()
+    if(CUDA_VERSION_MAJOR VERSION_EQUAL 12)
+        afcu_collect_libs(cusolver LIB_MAJOR 11 LIB_MINOR 7)
+    else()
+        afcu_collect_libs(cusolver)
+    endif()
+    afcu_collect_libs(cusparse)
+    if(CUDA_VERSION VERSION_GREATER 12.0)
+      afcu_collect_libs(nvJitLink)
+    endif()
+  elseif(NOT ${use_static_cuda_lapack})
+    if(CUDA_VERSION_MAJOR VERSION_EQUAL 12)
+        afcu_collect_libs(cusolver LIB_MAJOR 11 LIB_MINOR 7)
+    else()
+        afcu_collect_libs(cusolver)
+    endif()
+  endif()
+
+  if(WIN32 OR CUDA_VERSION VERSION_LESS 11.5 OR NOT AF_WITH_STATIC_CUDA_NUMERIC_LIBS)
+    afcu_collect_libs(nvrtc)
+    if(CUDA_VERSION VERSION_GREATER 10.0)
+      afcu_collect_libs(nvrtc-builtins FULL_VERSION)
+    else()
+      if(APPLE)
+        afcu_collect_libs(cudart)
+
+        get_filename_component(nvrtc_outpath "${dlib_path_prefix}/${PX}nvrtc-builtins.${CUDA_VERSION_MAJOR}.${CUDA_VERSION_MINOR}${SX}" REALPATH)
+        install(FILES       ${nvrtc_outpath}
+                DESTINATION ${AF_INSTALL_BIN_DIR}
+                RENAME      "${PX}nvrtc-builtins${SX}"
+                COMPONENT   cuda_dependencies)
+      elseif(UNIX)
+        get_filename_component(nvrtc_outpath "${dlib_path_prefix}/${PX}nvrtc-builtins${SX}" REALPATH)
+        install(FILES       ${nvrtc_outpath}
+                DESTINATION ${AF_INSTALL_LIB_DIR}
+                RENAME      "${PX}nvrtc-builtins${SX}"
+                COMPONENT   cuda_dependencies)
+      else()
+        afcu_collect_libs(nvrtc-builtins)
+      endif()
+    endif()
+  endif()
+endif()
+
+
+source_group(include REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/include/*)
+source_group(api\\cpp REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/cpp/*)
+source_group(api\\c   REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/c/*)
+source_group(backend  REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/backend/common/*|${CMAKE_CURRENT_SOURCE_DIR}/*)
+source_group(backend\\kernel  REGULAR_EXPRESSION ${CMAKE_CURRENT_SOURCE_DIR}/kernel/*|${CMAKE_CURRENT_SOURCE_DIR}/kernel/thrust_sort_by_key/*|${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_by_key/*)
+source_group("generated files"  FILES ${ArrayFire_BINARY_DIR}/src/backend/build_version.hpp ${ArrayFire_BINARY_DIR}/include/af/version.h
+                                REGULAR_EXPRESSION ${CMAKE_CURRENT_BINARY_DIR}/${kernel_headers_dir}/*)
+source_group("" FILES CMakeLists.txt)
+
+mark_as_advanced(
+  FETCHCONTENT_SOURCE_DIR_NV_CUB
+  FETCHCONTENT_UPDATES_DISCONNECTED_NV_CUB
+)
diff --git a/src/backend/cuda/EnqueueArgs.hpp b/src/backend/cuda/EnqueueArgs.hpp
new file mode 100644
index 0000000000..f3fb608b4c
--- /dev/null
+++ b/src/backend/cuda/EnqueueArgs.hpp
@@ -0,0 +1,55 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <cuda.h>
+#include <cuda_runtime.h>
+
+#include <vector>
+
+namespace arrayfire {
+namespace cuda {
+
+///
+/// EnqueueArgs is a kernel launch configuration composition object
+///
+/// This structure is an composition of various parameters that are
+/// required to successfully launch a CUDA kernel.
+///
+struct EnqueueArgs {
+    // TODO(pradeep): this can be easily templated
+    // template<typename Queue, typename Event>
+    dim3 mBlocks;                  ///< Number of blocks per grid/kernel-launch
+    dim3 mThreads;                 ///< Number of threads per block
+    CUstream mStream;              ///< CUDA stream to enqueue the kernel on
+    unsigned int mSharedMemSize;   ///< Size(in bytes) of shared memory used
+    std::vector<CUevent> mEvents;  ///< Events to wait for kernel execution
+
+    ///
+    /// \brief EnqueueArgs constructor
+    ///
+    /// \param[in] blks is number of blocks per grid
+    /// \param[in] thrds is number of threads per block
+    /// \param[in] stream is CUDA steam on which kernel has to be enqueued
+    /// \param[in] sharedMemSize is number of bytes of shared memory allocation
+    /// \param[in] events is list of events to wait for kernel execution
+    ///
+    EnqueueArgs(dim3 blks, dim3 thrds, CUstream stream = 0,
+                const unsigned int sharedMemSize   = 0,
+                const std::vector<CUevent> &events = {})
+        : mBlocks(blks)
+        , mThreads(thrds)
+        , mStream(stream)
+        , mSharedMemSize(sharedMemSize)
+        , mEvents(events) {}
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Event.cpp b/src/backend/cuda/Event.cpp
new file mode 100644
index 0000000000..fb5fbff170
--- /dev/null
+++ b/src/backend/cuda/Event.cpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Event.hpp>
+
+#include <common/err_common.hpp>
+#include <cuda_runtime_api.h>
+#include <events.hpp>
+#include <platform.hpp>
+#include <af/event.h>
+
+#include <memory>
+
+namespace arrayfire {
+namespace cuda {
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(cudaStream_t queue) {
+    Event e;
+    if (e.create() == CUDA_SUCCESS) { e.mark(queue); }
+    return e;
+}
+
+af_event createEvent() {
+    // Default CUDA stream needs to be initialized to use the CUDA driver
+    // Ctx
+    getActiveStream();
+    auto e = std::make_unique<Event>();
+    if (e->create() != CUDA_SUCCESS) {
+        AF_ERROR("Could not create event", AF_ERR_RUNTIME);
+    }
+    return getHandle(*(e.release()));
+}
+
+void markEventOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active stream
+    cudaStream_t stream = getActiveStream();
+    if (event.mark(stream) != CUDA_SUCCESS) {
+        AF_ERROR("Could not mark event on active stream", AF_ERR_RUNTIME);
+    }
+}
+
+void enqueueWaitOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active stream
+    cudaStream_t stream = getActiveStream();
+    if (event.enqueueWait(stream) != CUDA_SUCCESS) {
+        AF_ERROR("Could not enqueue wait on active stream for event",
+                 AF_ERR_RUNTIME);
+    }
+}
+
+void block(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    if (event.block() != CUDA_SUCCESS) {
+        AF_ERROR("Could not block on active stream for event", AF_ERR_RUNTIME);
+    }
+}
+
+af_event createAndMarkEvent() {
+    af_event handle = createEvent();
+    markEventOnActiveQueue(handle);
+    return handle;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Event.hpp b/src/backend/cuda/Event.hpp
new file mode 100644
index 0000000000..2db9679aca
--- /dev/null
+++ b/src/backend/cuda/Event.hpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/EventBase.hpp>
+#include <cuda.h>
+#include <cuda_runtime_api.h>
+#include <af/event.h>
+
+namespace arrayfire {
+namespace cuda {
+
+class CUDARuntimeEventPolicy {
+   public:
+    using EventType = CUevent;
+    using QueueType = CUstream;
+    using ErrorType = CUresult;
+
+    static ErrorType createAndMarkEvent(CUevent *e) noexcept {
+        // Creating events with the CU_EVENT_BLOCKING_SYNC flag
+        // severly impacts the speed if/when creating many arrays
+        auto err = cuEventCreate(e, CU_EVENT_DISABLE_TIMING);
+        return err;
+    }
+
+    static ErrorType markEvent(CUevent *e, QueueType &stream) noexcept {
+        auto err = cuEventRecord(*e, stream);
+        return err;
+    }
+
+    static ErrorType waitForEvent(CUevent *e, QueueType &stream) noexcept {
+        auto err = cuStreamWaitEvent(stream, *e, 0);
+        return err;
+    }
+
+    static ErrorType syncForEvent(CUevent *e) noexcept {
+        return cuEventSynchronize(*e);
+    }
+
+    static ErrorType destroyEvent(CUevent *e) noexcept {
+        auto err = cuEventDestroy(*e);
+        return err;
+    }
+};
+
+using Event = common::EventBase<CUDARuntimeEventPolicy>;
+
+/// \brief Creates a new event and marks it in the stream
+Event makeEvent(cudaStream_t queue);
+
+af_event createEvent();
+
+void markEventOnActiveQueue(af_event eventHandle);
+
+void enqueueWaitOnActiveQueue(af_event eventHandle);
+
+void block(af_event eventHandle);
+
+af_event createAndMarkEvent();
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/GraphicsResourceManager.cpp b/src/backend/cuda/GraphicsResourceManager.cpp
new file mode 100644
index 0000000000..cca78f286f
--- /dev/null
+++ b/src/backend/cuda/GraphicsResourceManager.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <GraphicsResourceManager.hpp>
+
+#include <common/graphics_common.hpp>
+// cuda_gl_interop.h does not include OpenGL headers for ARM
+// __gl_h_ should be defined by glad.h inclusion
+#include <cuda_gl_interop.h>
+#include <err_cuda.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+GraphicsResourceManager::ShrdResVector
+GraphicsResourceManager::registerResources(
+    const std::vector<uint32_t>& resources) {
+    ShrdResVector output;
+
+    auto deleter = [](cudaGraphicsResource_t* handle) {
+        // FIXME Having a CUDA_CHECK around unregister
+        // call is causing invalid GL context.
+        // Moving ForgeManager class singleton as data
+        // member of DeviceManager with proper ordering
+        // of member destruction doesn't help either.
+        // Calling makeContextCurrent also doesn't help.
+        cudaGraphicsUnregisterResource(*handle);
+        delete handle;
+    };
+
+    for (auto id : resources) {
+        cudaGraphicsResource_t r;
+        CUDA_CHECK(cudaGraphicsGLRegisterBuffer(
+            &r, id, cudaGraphicsMapFlagsWriteDiscard));
+        output.emplace_back(new cudaGraphicsResource_t(r), deleter);
+    }
+
+    return output;
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/GraphicsResourceManager.hpp b/src/backend/cuda/GraphicsResourceManager.hpp
new file mode 100644
index 0000000000..dde6a30ab5
--- /dev/null
+++ b/src/backend/cuda/GraphicsResourceManager.hpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/InteropManager.hpp>
+#include <driver_types.h>
+
+#include <map>
+#include <vector>
+
+namespace arrayfire {
+namespace cuda {
+class GraphicsResourceManager
+    : public common::InteropManager<GraphicsResourceManager,
+                                    cudaGraphicsResource_t> {
+   public:
+    using ShrdResVector = std::vector<std::shared_ptr<cudaGraphicsResource_t>>;
+
+    GraphicsResourceManager() {}
+    static ShrdResVector registerResources(
+        const std::vector<uint32_t> &resources);
+
+   protected:
+    GraphicsResourceManager(GraphicsResourceManager const &);
+    void operator=(GraphicsResourceManager const &);
+};
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/JIT/BinaryNode.hpp b/src/backend/cuda/JIT/BinaryNode.hpp
deleted file mode 100644
index 2a2abb0610..0000000000
--- a/src/backend/cuda/JIT/BinaryNode.hpp
+++ /dev/null
@@ -1,151 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <iomanip>
-
-namespace cuda
-{
-
-namespace JIT
-{
-
-    class BinaryNode : public Node
-    {
-    private:
-        std::string m_op_str;
-        Node_ptr m_lhs, m_rhs;
-        int m_op;
-
-    public:
-        BinaryNode(const char *out_type_str, const char *name_str,
-                   const std::string &op_str,
-                   Node_ptr lhs, Node_ptr rhs, int op)
-            : Node(out_type_str, name_str),
-              m_op_str(op_str),
-              m_lhs(lhs),
-              m_rhs(rhs),
-              m_op(op)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            return m_lhs->isLinear(dims) && m_rhs->isLinear(dims);
-        }
-
-        void genParams(std::stringstream &kerStream,
-                       std::stringstream &annStream, bool is_linear)
-        {
-            if (m_gen_param) return;
-            if (!(m_lhs->isGenParam())) m_lhs->genParams(kerStream, annStream, is_linear);
-            if (!(m_rhs->isGenParam())) m_rhs->genParams(kerStream, annStream, is_linear);
-            m_gen_param = true;
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-            if (!(m_lhs->isGenOffset())) m_lhs->genOffsets(kerStream, is_linear);
-            if (!(m_rhs->isGenOffset())) m_rhs->genOffsets(kerStream, is_linear);
-            m_gen_offset = true;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            m_lhs->genKerName(kerStream);
-            m_rhs->genKerName(kerStream);
-
-            if (m_gen_name) return;
-            // Make the hex representation of enum part of the Kernel name
-            kerStream << "_" << std::setw(2) << std::setfill('0') << std::hex << m_op;
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_lhs->getId();
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_rhs->getId();
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream, str_map_t &declStrs, bool is_linear)
-        {
-            if (m_gen_func) return;
-
-            if (!(m_lhs->isGenFunc())) m_lhs->genFuncs(kerStream, declStrs, is_linear);
-            if (!(m_rhs->isGenFunc())) m_rhs->genFuncs(kerStream, declStrs, is_linear);
-
-            std::stringstream declStream;
-            declStream << "declare " << m_type_str << " " << m_op_str
-                       << "(" << m_lhs->getTypeStr() << " , " << m_rhs->getTypeStr() << ")\n";
-
-            str_map_iter loc = declStrs.find(declStream.str());
-            if (loc == declStrs.end()) {
-                declStrs[declStream.str()] = true;
-            }
-
-            kerStream << "%val" << m_id << " = call "
-                      << m_type_str << " "
-                      << m_op_str << "("
-                      << m_lhs->getTypeStr() << " "
-                      << "%val" << m_lhs->getId() << ", "
-                      << m_rhs->getTypeStr() << " "
-                      << "%val" << m_rhs->getId() << ")\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            id = m_lhs->setId(id);
-            id = m_rhs->setId(id);
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-
-            m_lhs->getInfo(len, buf_count, bytes);
-            m_rhs->getInfo(len, buf_count, bytes);
-            len++;
-
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_set_arg = false;
-            m_lhs->resetFlags();
-            m_rhs->resetFlags();
-        }
-
-        void setArgs(std::vector<void *> &args, bool is_linear)
-        {
-            if (m_set_arg) return;
-
-            m_lhs->setArgs(args, is_linear);
-            m_rhs->setArgs(args, is_linear);
-
-            m_set_arg = true;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/cuda/JIT/BufferNode.hpp b/src/backend/cuda/JIT/BufferNode.hpp
deleted file mode 100644
index efe32f8b72..0000000000
--- a/src/backend/cuda/JIT/BufferNode.hpp
+++ /dev/null
@@ -1,209 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <iomanip>
-
-namespace cuda
-{
-
-namespace JIT
-{
-
-    template <typename T>
-    static inline std::string toString(T val)
-    {
-        std::stringstream s;
-        s << val;
-        return s.str();
-    }
-
-    template<typename T>
-    class BufferNode : public Node
-    {
-    private:
-        // Keep the shared pointer for reference counting
-        shared_ptr<T> sptr;
-        CParam<T> m_param;
-        unsigned m_bytes;
-
-        bool m_linear;
-    public:
-
-        BufferNode(const char *type_str,
-                   const char *name_str,
-                   shared_ptr<T> data,
-                   CParam<T> param,
-                   unsigned bytes,
-                   bool is_linear)
-            : Node(type_str, name_str),
-              sptr(data),
-              m_param(param),
-              m_bytes(bytes),
-              m_linear(is_linear)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            bool same_dims = true;
-            for (int i = 0; same_dims && i < 4; i++) {
-                same_dims &= (dims[i] == m_param.dims[i]);
-            }
-            return m_linear && same_dims;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            if (m_gen_name) return;
-
-            kerStream << "_" << m_name_str;
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genParams(std::stringstream &kerStream,
-                       std::stringstream &annStream, bool is_linear)
-        {
-            if (m_gen_param) return;
-            kerStream << m_type_str << "* %in" << m_id << ",\n";
-            annStream << m_type_str << "*,\n";
-
-            if (!is_linear) {
-                kerStream << "i32 %dim0"     << m_id << ","
-                          << "i32 %dim1"     << m_id << ","
-                          << "i32 %dim2"     << m_id << ","
-                          << "i32 %dim3"     << m_id << ","
-                          << "\n"
-                          << "i32 %str1"     << m_id << ","
-                          << "i32 %str2"     << m_id << ","
-                          << "i32 %str3"     << m_id << ","
-                          << "\n";
-
-                annStream << "i32, i32, i32, i32,\n"
-                          << "i32, i32, i32,\n";
-            }
-
-            m_gen_param = true;
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-
-            if (!is_linear) {
-                kerStream << "%b3" << m_id << " = icmp slt i32 %id3, %dim3" << m_id << "\n";
-                kerStream << "%b2" << m_id << " = icmp slt i32 %id2, %dim2" << m_id << "\n";
-                kerStream << "%b1" << m_id << " = icmp slt i32 %id1, %dim1" << m_id << "\n";
-                kerStream << "%b0" << m_id << " = icmp slt i32 %id0, %dim0" << m_id << "\n";
-
-                kerStream << "%c3" << m_id << " = zext i1 %b3" << m_id << " to i32\n";
-                kerStream << "%c2" << m_id << " = zext i1 %b2" << m_id << " to i32\n";
-                kerStream << "%c1" << m_id << " = zext i1 %b1" << m_id << " to i32\n";
-                kerStream << "%c0" << m_id << " = zext i1 %b0" << m_id << " to i32\n";
-
-                kerStream << "%d3" << m_id << " = mul i32 %c3" << m_id << ", %id3\n";
-                kerStream << "%d2" << m_id << " = mul i32 %c2" << m_id << ", %id2\n";
-                kerStream << "%d1" << m_id << " = mul i32 %c1" << m_id << ", %id1\n";
-                kerStream << "%d0" << m_id << " = mul i32 %c0" << m_id << ", %id0\n";
-
-                kerStream << "%off3i" << m_id << " = mul i32 %d3" << m_id
-                          << ", %str3" << m_id << "\n";
-
-                kerStream << "%off2i" << m_id << " = mul i32 %d2" << m_id
-                          << ", %str2" << m_id << "\n";
-
-                kerStream << "%off1i" << m_id << " = mul i32 %d1" << m_id
-                          << ", %str1" << m_id << "\n";
-
-                kerStream << "%off23i" << m_id << " = add i32 %off2i"
-                          << m_id << ", %off3i" << m_id << "\n";
-
-                kerStream << "%off123i" << m_id << " = add i32 %off23i"
-                          << m_id << ", %off1i" << m_id << "\n";
-
-                kerStream << "%idxa" << m_id << " = add i32 %off123i"
-                          << m_id << ", %d0" << m_id << "\n";
-
-                kerStream << "%idx" << m_id << " = sext i32 %idxa" << m_id <<" to i64\n\n";
-            }
-
-            m_gen_offset = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream, str_map_t &declStrs, bool is_linear)
-        {
-            if (m_gen_func) return;
-
-            kerStream << "%inIdx" << m_id << " = "
-                      << "getelementptr inbounds " << m_type_str << "* %in" << m_id
-                      << ", i64 %idx";
-
-            if (!is_linear) kerStream << m_id;
-            kerStream << "\n";
-
-            kerStream << "%val" << m_id << " = " << "load "
-                      << m_type_str << "* %inIdx" << m_id << "\n\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-
-            len++;
-            buf_count++;
-            bytes += m_bytes;
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_gen_name = false;
-            m_set_arg = false;
-        }
-
-        void setArgs(std::vector<void *> &args, bool is_linear)
-        {
-            if (m_set_arg) return;
-            args.push_back((void *)&(m_param.ptr));
-
-            if (!is_linear) {
-                args.push_back((void *)&m_param.dims[0]);
-                args.push_back((void *)&m_param.dims[1]);
-                args.push_back((void *)&m_param.dims[2]);
-                args.push_back((void *)&m_param.dims[3]);
-                args.push_back((void *)&m_param.strides[1]);
-                args.push_back((void *)&m_param.strides[2]);
-                args.push_back((void *)&m_param.strides[3]);
-            }
-            m_set_arg = true;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/cuda/JIT/Node.hpp b/src/backend/cuda/JIT/Node.hpp
deleted file mode 100644
index e30a1cf63b..0000000000
--- a/src/backend/cuda/JIT/Node.hpp
+++ /dev/null
@@ -1,89 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <af/array.h>
-#include <optypes.hpp>
-#include <string>
-#include <vector>
-#include <map>
-#include <boost/shared_ptr.hpp>
-
-namespace cuda
-{
-
-namespace JIT
-{
-    typedef std::map<std::string, bool> str_map_t;
-    typedef str_map_t::iterator str_map_iter;
-    using boost::shared_ptr;
-
-    class Node
-    {
-    protected:
-        std::string m_type_str;
-        std::string m_name_str;
-        int m_id;
-        bool m_set_id;
-        bool m_gen_func;
-        bool m_gen_param;
-        bool m_gen_offset;
-        bool m_set_arg;
-        bool m_gen_name;
-
-    public:
-
-        Node(const char *type_str, const char *name_str)
-            : m_type_str(type_str),
-              m_name_str(name_str),
-              m_id(-1),
-              m_set_id(false),
-              m_gen_func(false),
-              m_gen_param(false),
-              m_gen_offset(false),
-              m_set_arg(false),
-              m_gen_name(false)
-        {}
-
-        virtual void genKerName(std::stringstream &kerStream) {}
-        virtual void genParams  (std::stringstream &kerStream,
-                                 std::stringstream &annStream, bool is_linear) {}
-        virtual void genOffsets (std::stringstream &kerStream, bool is_linear) {}
-        virtual void genFuncs   (std::stringstream &kerStream, str_map_t &declStrs, bool is_linear)
-        { m_gen_func = true;}
-
-        virtual int setId(int id) { m_set_id = true; return id; }
-        virtual void setArgs(std::vector<void *> &args, bool is_linear) { m_set_arg = true; }
-        virtual bool isLinear(dim_t dims[4]) { return true; }
-
-        virtual void resetFlags() {}
-        virtual void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            len = 0;
-            buf_count = 0;
-            bytes = 0;
-        }
-
-        std::string getTypeStr() { return m_type_str; }
-
-        bool isGenFunc() { return m_gen_func; }
-        bool isGenParam() { return m_gen_param; }
-        bool isGenOffset() { return m_gen_offset; }
-
-        int getId()  { return m_id; }
-        std::string getNameStr() { return m_name_str; }
-
-        virtual ~Node() {}
-    };
-
-    typedef shared_ptr<Node> Node_ptr;
-
-}
-
-}
diff --git a/src/backend/cuda/JIT/ScalarNode.hpp b/src/backend/cuda/JIT/ScalarNode.hpp
deleted file mode 100644
index 288af4dcdb..0000000000
--- a/src/backend/cuda/JIT/ScalarNode.hpp
+++ /dev/null
@@ -1,107 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <types.hpp>
-#include "Node.hpp"
-#include <math.hpp>
-#include <iomanip>
-
-namespace cuda
-{
-
-namespace JIT
-{
-    template<typename T>
-    class ScalarNode : public Node
-    {
-    private:
-        T m_val;
-    public:
-
-        ScalarNode(T val)
-            : Node(irname<T>(), afShortName<T>(false)),
-              m_val(val)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            return true;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            if (m_gen_name) return;
-
-            kerStream << "_" << m_name_str;
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genParams(std::stringstream &kerStream,
-                       std::stringstream &annStream,
-                       bool is_linear)
-        {
-            if (m_gen_param) return;
-            kerStream << m_type_str << " %val" << m_id << ", " << std::endl;
-            annStream << m_type_str << ",\n";
-            m_gen_param = true;
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-            m_gen_offset = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream, str_map_t &declStrs, bool is_linear)
-        {
-            if (m_gen_func) return;
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-            len++;
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_gen_name = false;
-            m_set_arg = false;
-        }
-
-        void setArgs(std::vector<void *> &args, bool is_linear)
-        {
-            if (m_set_arg) return;
-            args.push_back((void *)&m_val);
-            m_set_arg = true;
-        }
-    };
-}
-
-}
diff --git a/src/backend/cuda/JIT/UnaryNode.hpp b/src/backend/cuda/JIT/UnaryNode.hpp
deleted file mode 100644
index caa573104b..0000000000
--- a/src/backend/cuda/JIT/UnaryNode.hpp
+++ /dev/null
@@ -1,139 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <iomanip>
-
-namespace cuda
-{
-
-namespace JIT
-{
-
-    class UnaryNode : public Node
-    {
-    private:
-        std::string m_op_str;
-        Node_ptr m_child;
-        int m_op;
-
-    public:
-        UnaryNode(const char *out_type_str, const char *name_str,
-                  const std::string &op_str,
-                  Node_ptr child, int op)
-            : Node(out_type_str, name_str),
-              m_op_str(op_str),
-              m_child(child),
-              m_op(op)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            return m_child->isLinear(dims);
-        }
-
-        void genParams(std::stringstream &kerStream,
-                       std::stringstream &annStream, bool is_linear)
-        {
-            if (m_gen_param) return;
-            if (!(m_child->isGenParam())) m_child->genParams(kerStream, annStream, is_linear);
-            m_gen_param = true;
-        }
-
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-            if (!(m_child->isGenOffset())) m_child->genOffsets(kerStream, is_linear);
-            m_gen_offset = true;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            m_child->genKerName(kerStream);
-
-            if (m_gen_name) return;
-
-            // Make the hex representation of enum part of the Kernel name
-            kerStream << "_" << std::setw(2) << std::setfill('0') << std::hex << m_op;
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_child->getId();
-            kerStream << std::setw(2) << std::setfill('0') << std::hex << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream, str_map_t &declStrs, bool is_linear)
-        {
-            if (m_gen_func) return;
-
-            if (!(m_child->isGenFunc())) m_child->genFuncs(kerStream, declStrs, is_linear);
-
-            std::stringstream declStream;
-            declStream << "declare " << m_type_str << " " << m_op_str
-                       << "(" << m_child->getTypeStr() << ")\n";
-
-            str_map_iter loc = declStrs.find(declStream.str());
-            if (loc == declStrs.end()) {
-                declStrs[declStream.str()] = true;
-            }
-
-            kerStream << "%val" << m_id << " = call "
-                      << m_type_str << " "
-                      << m_op_str << "("
-                      << m_child->getTypeStr() << " "
-                      << "%val" << m_child->getId() << ")\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            id = m_child->setId(id);
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-
-            m_child->getInfo(len, buf_count, bytes);
-            len++;
-
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_set_arg = false;
-            m_child->resetFlags();
-        }
-
-        void setArgs(std::vector<void *> &args, bool is_linear)
-        {
-            if (m_set_arg) return;
-            m_child->setArgs(args, is_linear);
-            m_set_arg = true;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/cuda/JIT/arith.cu b/src/backend/cuda/JIT/arith.cu
deleted file mode 100644
index 01e5f41b9f..0000000000
--- a/src/backend/cuda/JIT/arith.cu
+++ /dev/null
@@ -1,42 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-#define ARITH_BASIC(fn, op, T)                  \
-    __device__ T ___##fn(T a, T b)              \
-    {                                           \
-        return a op b;                          \
-    }                                           \
-
-
-#define ARITH(fn, op)                                   \
-    ARITH_BASIC(fn, op, float)                          \
-    ARITH_BASIC(fn, op, double)                         \
-    ARITH_BASIC(fn, op, int)                            \
-    ARITH_BASIC(fn, op, uint)                           \
-    ARITH_BASIC(fn, op, char)                           \
-    ARITH_BASIC(fn, op, uchar)                          \
-    ARITH_BASIC(fn, op, intl)                           \
-    ARITH_BASIC(fn, op, uintl)                          \
-                                                        \
-    __device__ cfloat ___##fn(cfloat a, cfloat b)       \
-    {                                                   \
-        return cuC##fn##f(a, b);                        \
-    }                                                   \
-                                                        \
-    __device__ cdouble ___##fn(cdouble a, cdouble b)    \
-    {                                                   \
-        return cuC##fn(a, b);                           \
-    }                                                   \
-
-ARITH(add, +)
-ARITH(sub, -)
-ARITH(mul, *)
-ARITH(div, /)
diff --git a/src/backend/cuda/JIT/cast.cu b/src/backend/cuda/JIT/cast.cu
deleted file mode 100644
index 6b1c9b0d33..0000000000
--- a/src/backend/cuda/JIT/cast.cu
+++ /dev/null
@@ -1,94 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-#define CAST_BASIC(FN, To, Ti) __device__ To FN(Ti in) { return (To) in; }
-
-#define CAST_BASIC_BOOL(FN, To, Ti) __device__ To FN(Ti in) { return (To)(in != 0); }
-
-#define CAST(T, X)                              \
-    CAST_BASIC(___mk##X, T, float)              \
-    CAST_BASIC(___mk##X, T, double)             \
-    CAST_BASIC(___mk##X, T, int)                \
-    CAST_BASIC(___mk##X, T, uint)               \
-    CAST_BASIC(___mk##X, T, char)               \
-    CAST_BASIC(___mk##X, T, uchar)              \
-    CAST_BASIC(___mk##X, T, intl)               \
-    CAST_BASIC(___mk##X, T, uintl)              \
-
-CAST(float, S)
-CAST(double, D)
-CAST(int, I)
-CAST(intl, X)
-CAST(uint, U)
-CAST(uchar, V)
-CAST(uintl, Y)
-
-CAST_BASIC_BOOL(___mkJ, char, float)
-CAST_BASIC_BOOL(___mkJ, char, double)
-CAST_BASIC_BOOL(___mkJ, char, int)
-CAST_BASIC_BOOL(___mkJ, char, uint)
-CAST_BASIC_BOOL(___mkJ, char, char)
-CAST_BASIC_BOOL(___mkJ, char, uchar)
-CAST_BASIC_BOOL(___mkJ, char, intl)
-CAST_BASIC_BOOL(___mkJ, char, uintl)
-
-#define CPLX_BASIC(FN, To, Tr, Ti)              \
-    __device__ To FN(Ti in)                     \
-    {                                           \
-        To out = {(Tr)in, 0};                   \
-        return out;                             \
-    }                                           \
-
-#define CPLX_CAST(T, Tr, X)                     \
-    CPLX_BASIC(___mk##X, T, Tr, float)          \
-    CPLX_BASIC(___mk##X, T, Tr, double)         \
-    CPLX_BASIC(___mk##X, T, Tr, int)            \
-    CPLX_BASIC(___mk##X, T, Tr, uint)           \
-    CPLX_BASIC(___mk##X, T, Tr, char)           \
-    CPLX_BASIC(___mk##X, T, Tr, uchar)          \
-
-CPLX_CAST(cfloat, float, C)
-CPLX_CAST(cdouble, double, Z)
-
-__device__ cfloat ___mkC(cfloat C)
-{
-    return C;
-}
-
-__device__ cfloat ___mkC(cdouble C)
-{
-    cfloat res = {C.x, C.y};
-    return res;
-}
-
-__device__ cdouble ___mkZ(cdouble C)
-{
-    return C;
-}
-
-__device__ cdouble ___mkZ(cfloat C)
-{
-    cdouble res = {C.x, C.y};
-    return res;
-}
-
-__device__ float ___real(cfloat in) { return in.x; }
-__device__ double ___real(cdouble in) { return in.x; }
-
-
-__device__ float ___imag(cfloat in) { return in.y; }
-__device__ double ___imag(cdouble in) { return in.y; }
-
-__device__ cfloat ___cplx(float l, float r) { cfloat out = {l, r}; return out; }
-__device__ cdouble ___cplx(double l, double r) { cdouble out = {l, r}; return out; }
-
-__device__ cfloat  ___conj(cfloat  in) { return cuConjf(in); }
-__device__ cdouble ___conj(cdouble in) { return cuConj (in); }
diff --git a/src/backend/cuda/JIT/exp.cu b/src/backend/cuda/JIT/exp.cu
deleted file mode 100644
index 9568342823..0000000000
--- a/src/backend/cuda/JIT/exp.cu
+++ /dev/null
@@ -1,81 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-#define MATH_BASIC(fn, T)                       \
-    __device__ T ___##fn(T a)                   \
-    {                                           \
-        return fn##f((float)a);                 \
-    }                                           \
-
-
-#define MATH(fn)                                \
-    MATH_BASIC(fn, float)                       \
-    MATH_BASIC(fn, int)                         \
-    MATH_BASIC(fn, uint)                        \
-    MATH_BASIC(fn, char)                        \
-    MATH_BASIC(fn, uchar)                       \
-    __device__ double ___##fn(double a)         \
-    {                                           \
-        return fn(a);                           \
-    }                                           \
-
-
-MATH(exp)
-MATH(expm1)
-MATH(erf)
-MATH(erfc)
-
-MATH(log)
-MATH(log10)
-MATH(log1p)
-MATH(log2)
-
-MATH(sqrt)
-MATH(cbrt)
-
-#define MATH2_BASIC(fn, T)                      \
-    __device__ T ___##fn(T a, T b)              \
-    {                                           \
-        return fn##f((float)a, (float)b);       \
-    }                                           \
-
-#define MATH2(fn)                                   \
-    MATH2_BASIC(fn, float)                          \
-    MATH2_BASIC(fn, int)                            \
-    MATH2_BASIC(fn, uint)                           \
-    MATH2_BASIC(fn, char)                           \
-    MATH2_BASIC(fn, uchar)                          \
-    __device__ double ___##fn(double a, double b)   \
-    {                                               \
-        return fn(a, b);                            \
-    }                                               \
-
-MATH2(pow)
-
-__device__ cfloat ___pow(cfloat a, float b)
-{
-    float R = cuCabsf(a);
-    float Theta = atan2(a.y, a.x);
-    float R_b = powf(R, b);
-    float Theta_b = Theta * b;
-    cfloat res = {R_b * cosf(Theta_b), R_b * sinf(Theta_b)};
-    return res;
-}
-
-__device__ cdouble ___pow(cdouble a, float b)
-{
-    float R = cuCabs(a);
-    float Theta = atan2(a.y, a.x);
-    float R_b = pow(R, b);
-    float Theta_b = Theta * b;
-    cdouble res = {R_b * cos(Theta_b), R_b * sin(Theta_b)};
-    return res;
-}
diff --git a/src/backend/cuda/JIT/hyper.cu b/src/backend/cuda/JIT/hyper.cu
deleted file mode 100644
index 23cdb84b7c..0000000000
--- a/src/backend/cuda/JIT/hyper.cu
+++ /dev/null
@@ -1,37 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-#define MATH_BASIC(fn, T)                       \
-    __device__ T ___##fn(T a)                   \
-    {                                           \
-        return fn##f((float)a);                 \
-    }                                           \
-
-
-#define MATH(fn)                                \
-    MATH_BASIC(fn, float)                       \
-    MATH_BASIC(fn, int)                         \
-    MATH_BASIC(fn, uint)                        \
-    MATH_BASIC(fn, char)                        \
-    MATH_BASIC(fn, uchar)                       \
-    __device__ double ___##fn(double a)         \
-    {                                           \
-        return fn(a);                           \
-    }                                           \
-
-
-MATH(sinh)
-MATH(cosh)
-MATH(tanh)
-
-MATH(asinh)
-MATH(acosh)
-MATH(atanh)
diff --git a/src/backend/cuda/JIT/logic.cu b/src/backend/cuda/JIT/logic.cu
deleted file mode 100644
index a9c8f71655..0000000000
--- a/src/backend/cuda/JIT/logic.cu
+++ /dev/null
@@ -1,99 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-#define LOGIC_BASIC(fn, op, T)                  \
-    __device__ bool ___##fn(T a, T b)           \
-    {                                           \
-        return a op b;                          \
-    }                                           \
-
-
-#define LOGIC(fn, op)                               \
-    LOGIC_BASIC(fn, op, float)                      \
-    LOGIC_BASIC(fn, op, double)                     \
-    LOGIC_BASIC(fn, op, int)                        \
-    LOGIC_BASIC(fn, op, uint)                       \
-    LOGIC_BASIC(fn, op, char)                       \
-    LOGIC_BASIC(fn, op, uchar)                      \
-    LOGIC_BASIC(fn, op, intl)                       \
-    LOGIC_BASIC(fn, op, uintl)                      \
-                                                    \
-    __device__ bool ___##fn(cfloat a, cfloat b)     \
-    {                                               \
-        return cabs2(a) op cabs2(b);                \
-    }                                               \
-                                                    \
-    __device__ bool ___##fn(cdouble a, cdouble b)   \
-    {                                               \
-        return cabs2(a) op cabs2(b);                \
-    }                                               \
-
-LOGIC(lt, <)
-LOGIC(gt, >)
-LOGIC(le, <=)
-LOGIC(ge, >=)
-LOGIC(and, &&)
-LOGIC(or, ||)
-
-#define LOGIC_EQ(fn, op, op2)                       \
-    LOGIC_BASIC(fn, op, float)                      \
-    LOGIC_BASIC(fn, op, double)                     \
-    LOGIC_BASIC(fn, op, int)                        \
-    LOGIC_BASIC(fn, op, uint)                       \
-    LOGIC_BASIC(fn, op, char)                       \
-    LOGIC_BASIC(fn, op, uchar)                      \
-                                                    \
-    __device__ bool ___##fn(cfloat a, cfloat b)     \
-    {                                               \
-        return (a.x op b.x) op2 (a.y op b.y);       \
-    }                                               \
-                                                    \
-    __device__ bool ___##fn(cdouble a, cdouble b)   \
-    {                                               \
-        return (a.x op b.x) op2 (a.y op b.y);       \
-    }                                               \
-
-LOGIC_EQ(eq, ==, &&)
-LOGIC_EQ(neq, !=, ||)
-
-#define NOT_FN(T)                                   \
-    __device__ bool ___not(T in) { return !in; }    \
-
-NOT_FN(float)
-NOT_FN(double)
-NOT_FN(int)
-NOT_FN(uint)
-NOT_FN(char)
-NOT_FN(uchar)
-NOT_FN(intl)
-NOT_FN(uintl)
-
-#define BIT_FN(T)                                                   \
-    __device__ T ___bitand   (T lhs, T rhs) { return lhs &  rhs; }  \
-    __device__ T ___bitor    (T lhs, T rhs) { return lhs |  rhs; }  \
-    __device__ T ___bitxor   (T lhs, T rhs) { return lhs ^  rhs; }  \
-    __device__ T ___bitshiftl(T lhs, T rhs) { return lhs << rhs; }  \
-    __device__ T ___bitshiftr(T lhs, T rhs) { return lhs >> rhs; }  \
-
-BIT_FN(int)
-BIT_FN(char)
-BIT_FN(intl)
-BIT_FN(uchar)
-BIT_FN(uint)
-BIT_FN(uintl)
-
-__device__ char ___isNaN(float in) { return isnan(in); }
-__device__ char ___isINF(float in) { return isinf(in); }
-__device__ char ___iszero(float in) { return (in == 0); }
-
-__device__ char ___isNaN(double in) { return isnan(in); }
-__device__ char ___isINF(double in) { return isinf(in); }
-__device__ char ___iszero(double in) { return (in == 0); }
diff --git a/src/backend/cuda/JIT/numeric.cu b/src/backend/cuda/JIT/numeric.cu
deleted file mode 100644
index 5471710a5d..0000000000
--- a/src/backend/cuda/JIT/numeric.cu
+++ /dev/null
@@ -1,144 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-template<typename T> __device__ T sign(T a) { return signbit(a); }
-
-#define MATH_BASIC(fn, T)                       \
-    __device__ T ___##fn(T a)                   \
-    {                                           \
-        return fn(a);                           \
-    }                                           \
-
-
-#define MATH_NOOP(fn, T)                        \
-    __device__ T ___##fn(T a)                   \
-    {                                           \
-        return a;                               \
-    }                                           \
-
-
-#define MATH_CAST(fn, T, Tc)                    \
-    __device__ T ___##fn(T a)                   \
-    {                                           \
-        return (T)fn((Tc)a);                    \
-    }                                           \
-
-MATH_BASIC(floor, float)
-MATH_BASIC(floor, double)
-MATH_NOOP(floor, int)
-MATH_NOOP(floor, uint)
-MATH_NOOP(floor, char)
-MATH_NOOP(floor, uchar)
-
-MATH_BASIC(ceil, float)
-MATH_BASIC(ceil, double)
-MATH_NOOP(ceil, int)
-MATH_NOOP(ceil, uint)
-MATH_NOOP(ceil, char)
-MATH_NOOP(ceil, uchar)
-
-MATH_BASIC(round, float)
-MATH_BASIC(round, double)
-MATH_NOOP(round, int)
-MATH_NOOP(round, uint)
-MATH_NOOP(round, char)
-MATH_NOOP(round, uchar)
-
-MATH_BASIC(trunc, float)
-MATH_BASIC(trunc, double)
-MATH_NOOP(trunc, int)
-MATH_NOOP(trunc, uint)
-MATH_NOOP(trunc, char)
-MATH_NOOP(trunc, uchar)
-
-MATH_BASIC(sign, float)
-MATH_BASIC(sign, double)
-MATH_NOOP(sign, int)
-MATH_NOOP(sign, uint)
-MATH_NOOP(sign, char)
-MATH_NOOP(sign, uchar)
-
-MATH_BASIC(abs, float)
-MATH_BASIC(abs, double)
-MATH_BASIC(abs, int)
-MATH_CAST(abs, char, int)
-MATH_NOOP(abs, uint)
-MATH_NOOP(abs, uchar)
-
-MATH_BASIC(tgamma, float)
-MATH_BASIC(tgamma, double)
-MATH_CAST(tgamma, int, float)
-MATH_CAST(tgamma, uint, float)
-MATH_CAST(tgamma, char, float)
-MATH_CAST(tgamma, uchar, float)
-
-MATH_BASIC(lgamma, float)
-MATH_BASIC(lgamma, double)
-MATH_CAST(lgamma, int, float)
-MATH_CAST(lgamma, uint, float)
-MATH_CAST(lgamma, char, float)
-MATH_CAST(lgamma, uchar, float)
-
-__device__ float ___abs(cfloat a) { return cuCabsf(a); }
-__device__ double ___abs(cdouble a) { return cuCabs(a); }
-
-template<typename T> __device__ T rem(T a, T b) { return a % b; }
-__device__ float rem(float a, float b) { return remainderf(a, b); }
-__device__ double rem(double a, double b) { return remainder(a, b); }
-
-template<typename T> __device__ T mod(T a, T b) { return a % b; }
-__device__ float mod(float a, float b) { return fmodf(a, b); }
-__device__ double mod(double a, double b) { return fmod(a, b); }
-
-#define MATH2_BASIC(fn, T)                      \
-    __device__ T ___##fn(T a, T b)              \
-    {                                           \
-        return fn(a, b);                        \
-    }                                           \
-
-#define MATH2(fn)                                   \
-    MATH2_BASIC(fn, float)                          \
-    MATH2_BASIC(fn, int)                            \
-    MATH2_BASIC(fn, uint)                           \
-    MATH2_BASIC(fn, intl)                           \
-    MATH2_BASIC(fn, uintl)                          \
-    MATH2_BASIC(fn, char)                           \
-    MATH2_BASIC(fn, uchar)                          \
-    __device__ double ___##fn(double a, double b)   \
-    {                                               \
-        return fn(a, b);                            \
-    }                                               \
-
-MATH2(min)
-MATH2(max)
-MATH2(mod)
-MATH2(rem)
-
-__device__ float ___hypot(float a, float b)
-{
-    return hypot(a, b);
-}
-
-__device__ double ___hypot(double a, double b)
-{
-    return hypot(a, b);
-}
-
-#define COMPARE_CPLX(fn, op, T)                 \
-    __device__ T ___##fn(T a, T b)              \
-    {                                           \
-        return cabs2(a) op cabs2(b) ? a : b;    \
-    }                                           \
-
-COMPARE_CPLX(min, <, cfloat)
-COMPARE_CPLX(min, <, cdouble)
-COMPARE_CPLX(max, >, cfloat)
-COMPARE_CPLX(max, >, cdouble)
diff --git a/src/backend/cuda/JIT/trig.cu b/src/backend/cuda/JIT/trig.cu
deleted file mode 100644
index b224e379c4..0000000000
--- a/src/backend/cuda/JIT/trig.cu
+++ /dev/null
@@ -1,54 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "types.h"
-
-#define MATH_BASIC(fn, T)                       \
-    __device__ T ___##fn(T a)                   \
-    {                                           \
-        return fn##f((float)a);                 \
-    }                                           \
-
-
-#define MATH(fn)                                \
-    MATH_BASIC(fn, float)                       \
-    MATH_BASIC(fn, int)                         \
-    MATH_BASIC(fn, uint)                        \
-    MATH_BASIC(fn, char)                        \
-    MATH_BASIC(fn, uchar)                       \
-    __device__ double ___##fn(double a)         \
-    {                                           \
-        return fn(a);                           \
-    }                                           \
-
-
-MATH(sin)
-MATH(cos)
-MATH(tan)
-
-MATH(asin)
-MATH(acos)
-MATH(atan)
-
-#define ATAN2(T)                                \
-    __device__ T ___atan2(T x, T y)             \
-    {                                           \
-        return atan2((float)x, (float)y);       \
-    }                                           \
-
-ATAN2(float)
-ATAN2(int)
-ATAN2(uint)
-ATAN2(char)
-ATAN2(uchar)
-
-__device__ double ___atan2(double x, double y)
-{
-    return atan2(x, y);
-}
diff --git a/src/backend/cuda/JIT/types.h b/src/backend/cuda/JIT/types.h
deleted file mode 100644
index 80314bc34d..0000000000
--- a/src/backend/cuda/JIT/types.h
+++ /dev/null
@@ -1,20 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <cuComplex.h>
-#include <math_functions.h>
-typedef unsigned char uchar;
-typedef unsigned int uint;
-typedef cuFloatComplex cfloat;
-typedef cuDoubleComplex cdouble;
-typedef long long intl;
-typedef unsigned long long uintl;
-
-__device__ __inline__ float cabs2(cfloat in) { return in.x * in.x + in.y * in.y;}
-__device__ __inline__ double cabs2(cdouble in) { return in.x * in.x + in.y * in.y; }
diff --git a/src/backend/cuda/Kernel.cpp b/src/backend/cuda/Kernel.cpp
new file mode 100644
index 0000000000..d72672a1fc
--- /dev/null
+++ b/src/backend/cuda/Kernel.cpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Kernel.hpp>
+
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+Kernel::DevPtrType Kernel::getDevPtr(const char* name) {
+    Kernel::DevPtrType out = 0;
+    size_t size            = 0;
+    CU_CHECK(cuModuleGetGlobal(&out, &size, this->getModuleHandle(), name));
+    return out;
+}
+
+void Kernel::copyToReadOnly(Kernel::DevPtrType dst, Kernel::DevPtrType src,
+                            size_t bytes) {
+    CU_CHECK(cuMemcpyDtoDAsync(dst, src, bytes, getActiveStream()));
+}
+
+void Kernel::setFlag(Kernel::DevPtrType dst, int* scalarValPtr,
+                     const bool syncCopy) {
+    CU_CHECK(
+        cuMemcpyHtoDAsync(dst, scalarValPtr, sizeof(int), getActiveStream()));
+    if (syncCopy) { CU_CHECK(cuStreamSynchronize(getActiveStream())); }
+}
+
+int Kernel::getFlag(Kernel::DevPtrType src) {
+    int retVal = 0;
+    CU_CHECK(cuMemcpyDtoHAsync(&retVal, src, sizeof(int), getActiveStream()));
+    CU_CHECK(cuStreamSynchronize(getActiveStream()));
+    return retVal;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Kernel.hpp b/src/backend/cuda/Kernel.hpp
new file mode 100644
index 0000000000..2199292080
--- /dev/null
+++ b/src/backend/cuda/Kernel.hpp
@@ -0,0 +1,76 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/KernelInterface.hpp>
+#include <common/Logger.hpp>
+
+#include <EnqueueArgs.hpp>
+#include <backend.hpp>
+#include <err_cuda.hpp>
+#include <cstdlib>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+
+struct Enqueuer {
+    static auto getLogger() {
+        static auto logger = common::loggerFactory("kernel");
+        return logger.get();
+    };
+
+    template<typename... Args>
+    void operator()(std::string name, void* ker, const EnqueueArgs& qArgs,
+                    Args... args) {
+        void* params[] = {static_cast<void*>(&args)...};
+        for (auto& event : qArgs.mEvents) {
+            CU_CHECK(cuStreamWaitEvent(qArgs.mStream, event, 0));
+        }
+        AF_TRACE(
+            "Launching {}: Blocks: [{}, {}, {}] Threads: [{}, {}, {}] Shared "
+            "Memory: {}",
+            name, qArgs.mBlocks.x, qArgs.mBlocks.y, qArgs.mBlocks.z,
+            qArgs.mThreads.x, qArgs.mThreads.y, qArgs.mThreads.z,
+            qArgs.mSharedMemSize);
+        CU_CHECK(cuLaunchKernel(static_cast<CUfunction>(ker), qArgs.mBlocks.x,
+                                qArgs.mBlocks.y, qArgs.mBlocks.z,
+                                qArgs.mThreads.x, qArgs.mThreads.y,
+                                qArgs.mThreads.z, qArgs.mSharedMemSize,
+                                qArgs.mStream, params, NULL));
+    }
+};
+
+class Kernel
+    : public common::KernelInterface<CUmodule, CUfunction, Enqueuer,
+                                     CUdeviceptr> {
+   public:
+    using ModuleType = CUmodule;
+    using KernelType = CUfunction;
+    using DevPtrType = CUdeviceptr;
+    using BaseClass =
+        common::KernelInterface<ModuleType, KernelType, Enqueuer, DevPtrType>;
+
+    Kernel() : BaseClass("", nullptr, nullptr) {}
+    Kernel(std::string name, ModuleType mod, KernelType ker)
+        : BaseClass(name, mod, ker) {}
+
+    DevPtrType getDevPtr(const char* name) final;
+
+    void copyToReadOnly(DevPtrType dst, DevPtrType src, size_t bytes) final;
+
+    void setFlag(DevPtrType dst, int* scalarValPtr,
+                 const bool syncCopy = false) final;
+
+    int getFlag(DevPtrType src) final;
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/LookupTable1D.hpp b/src/backend/cuda/LookupTable1D.hpp
new file mode 100644
index 0000000000..f688ac4b7e
--- /dev/null
+++ b/src/backend/cuda/LookupTable1D.hpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+
+#include <type_traits>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+class LookupTable1D {
+   public:
+    LookupTable1D()                                     = delete;
+    LookupTable1D(const LookupTable1D& arg)             = delete;
+    LookupTable1D(const LookupTable1D&& arg)            = delete;
+    LookupTable1D& operator=(const LookupTable1D& arg)  = delete;
+    LookupTable1D& operator=(const LookupTable1D&& arg) = delete;
+
+    LookupTable1D(const Array<T>& lutArray) : mTexture(0), mData(lutArray) {
+        cudaResourceDesc resDesc;
+        memset(&resDesc, 0, sizeof(resDesc));
+
+        cudaTextureDesc texDesc;
+        memset(&texDesc, 0, sizeof(texDesc));
+
+        resDesc.resType                = cudaResourceTypeLinear;
+        resDesc.res.linear.devPtr      = mData.get();
+        resDesc.res.linear.desc.x      = sizeof(T) * 8;
+        resDesc.res.linear.sizeInBytes = mData.elements() * sizeof(T);
+
+        if (std::is_signed<T>::value)
+            resDesc.res.linear.desc.f = cudaChannelFormatKindSigned;
+        else if (std::is_unsigned<T>::value)
+            resDesc.res.linear.desc.f = cudaChannelFormatKindUnsigned;
+        else
+            resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
+
+        texDesc.readMode = cudaReadModeElementType;
+
+        CUDA_CHECK(
+            cudaCreateTextureObject(&mTexture, &resDesc, &texDesc, NULL));
+    }
+
+    ~LookupTable1D() {
+        if (mTexture) { cudaDestroyTextureObject(mTexture); }
+    }
+
+    cudaTextureObject_t get() const noexcept { return mTexture; }
+
+   private:
+    // Keep a copy so that ref count doesn't go down to zero when
+    // original Array<T> goes out of scope before LookupTable1D object does.
+    Array<T> mData;
+    cudaTextureObject_t mTexture;
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Module.hpp b/src/backend/cuda/Module.hpp
new file mode 100644
index 0000000000..88881611fc
--- /dev/null
+++ b/src/backend/cuda/Module.hpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/ModuleInterface.hpp>
+#include <err_cuda.hpp>
+
+#include <cuda.h>
+
+#include <string>
+#include <unordered_map>
+
+namespace arrayfire {
+namespace cuda {
+
+/// CUDA backend wrapper for CUmodule
+class Module : public common::ModuleInterface<CUmodule> {
+   private:
+    std::unordered_map<std::string, std::string> mInstanceMangledNames;
+
+   public:
+    using ModuleType = CUmodule;
+    using BaseClass  = common::ModuleInterface<ModuleType>;
+
+    Module() = default;
+    Module(ModuleType mod) : BaseClass(mod) {
+        mInstanceMangledNames.reserve(1);
+    }
+
+    operator bool() const final { return get(); }
+
+    void unload() final {
+        CU_CHECK(cuModuleUnload(get()));
+        set(nullptr);
+    }
+
+    const std::string mangledName(const std::string& instantiation) const {
+        auto iter = mInstanceMangledNames.find(instantiation);
+        if (iter != mInstanceMangledNames.end()) {
+            return iter->second;
+        } else {
+            return std::string("");
+        }
+    }
+
+    void add(const std::string& instantiation, const std::string& mangledName) {
+        mInstanceMangledNames.emplace(instantiation, mangledName);
+    }
+
+    const auto& map() const { return mInstanceMangledNames; }
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/Param.hpp b/src/backend/cuda/Param.hpp
index c07aaa72b5..496d4eea68 100644
--- a/src/backend/cuda/Param.hpp
+++ b/src/backend/cuda/Param.hpp
@@ -8,46 +8,77 @@
  ********************************************************/
 
 #pragma once
-#include <af/defines.h>
+
 #include <backend.hpp>
+#include <types.hpp>
+#include <af/defines.h>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
-struct Param
-{
-    T *ptr;
+class Param {
+   public:
     dim_t dims[4];
     dim_t strides[4];
+    T *ptr;
+
+    __DH__ Param() noexcept : dims(), strides(), ptr(nullptr) {}
+
+    __DH__
+    Param(T *iptr, const dim_t *idims, const dim_t *istrides) noexcept
+        : dims{idims[0], idims[1], idims[2], idims[3]}
+        , strides{istrides[0], istrides[1], istrides[2], istrides[3]}
+        , ptr(iptr) {}
+
+    __DH__ size_t elements() const noexcept {
+        return dims[0] * dims[1] * dims[2] * dims[3];
+    }
+
+    dim_t *dims_ptr() { return dims; }
+    dim_t *strides_ptr() { return strides; }
+
+    Param(const Param<T> &other) noexcept               = default;
+    Param(Param<T> &&other) noexcept                    = default;
+    Param<T> &operator=(const Param<T> &other) noexcept = default;
+    Param<T> &operator=(Param<T> &&other) noexcept      = default;
 };
 
 template<typename T>
-class CParam
-{
-public:
-    const T *ptr;
+Param<T> flat(Param<T> in) {
+    in.dims[0] = in.elements();
+    in.dims[1] = 1;
+    in.dims[2] = 1;
+    in.dims[3] = 1;
+    return in;
+}
+
+template<typename T>
+class CParam {
+   public:
     dim_t dims[4];
     dim_t strides[4];
+    const T *ptr;
 
-    __DH__ CParam(const T *iptr, const dim_t *idims, const dim_t *istrides) :
-        ptr(iptr)
-    {
-        for (int i = 0; i < 4; i++) {
-            dims[i] = idims[i];
-            strides[i] = istrides[i];
-        }
-    }
+    __DH__ CParam(const T *iptr, const dim_t *idims, const dim_t *istrides)
+        : dims{idims[0], idims[1], idims[2], idims[3]}
+        , strides{istrides[0], istrides[1], istrides[2], istrides[3]}
+        , ptr(iptr) {}
 
-    __DH__ CParam(Param<T> &in) : ptr(in.ptr)
-    {
-        for (int i = 0; i < 4; i++) {
-            dims[i] = in.dims[i];
-            strides[i] = in.strides[i];
-        }
+    __DH__ CParam(Param<T> &in)
+        : dims{in.dims[0], in.dims[1], in.dims[2], in.dims[3]}
+        , strides{in.strides[0], in.strides[1], in.strides[2], in.strides[3]}
+        , ptr(in.ptr) {}
+
+    __DH__ size_t elements() const noexcept {
+        return dims[0] * dims[1] * dims[2] * dims[3];
     }
 
-    __DH__ ~CParam() {}
+    CParam(const CParam<T> &other) noexcept               = default;
+    CParam(CParam<T> &&other) noexcept                    = default;
+    CParam<T> &operator=(const CParam<T> &other) noexcept = default;
+    CParam<T> &operator=(CParam<T> &&other) noexcept      = default;
 };
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/ThrustAllocator.cuh b/src/backend/cuda/ThrustAllocator.cuh
new file mode 100644
index 0000000000..93a4a8fc6d
--- /dev/null
+++ b/src/backend/cuda/ThrustAllocator.cuh
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <memory.hpp>
+#include <thrust/device_malloc_allocator.h>
+#include <thrust/device_vector.h>
+
+// Below Class definition is found at the following URL
+// http://stackoverflow.com/questions/9007343/mix-custom-memory-managment-and-thrust-in-cuda
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+struct ThrustAllocator : thrust::device_malloc_allocator<T> {
+    // shorthand for the name of the base class
+    typedef thrust::device_malloc_allocator<T> super_t;
+
+    // get access to some of the base class's typedefs
+    // note that because we inherited from device_malloc_allocator,
+    // pointer is actually thrust::device_ptr<T>
+    typedef typename super_t::pointer pointer;
+
+    typedef typename super_t::size_type size_type;
+
+    pointer allocate(size_type elements) {
+        return thrust::device_ptr<T>(
+            memAlloc<T>(elements)
+                .release());  // delegate to ArrayFire allocator
+    }
+
+    void deallocate(pointer p, size_type n) {
+        UNUSED(n);
+        memFree(p.get());  // delegate to ArrayFire allocator
+    }
+};
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/ThrustArrayFirePolicy.hpp b/src/backend/cuda/ThrustArrayFirePolicy.hpp
new file mode 100644
index 0000000000..339d3ea088
--- /dev/null
+++ b/src/backend/cuda/ThrustArrayFirePolicy.hpp
@@ -0,0 +1,71 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <backend.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <thrust/memory.h>
+#include <thrust/system/cuda/execution_policy.h>
+
+namespace arrayfire {
+namespace cuda {
+struct ThrustArrayFirePolicy
+    : thrust::cuda::execution_policy<ThrustArrayFirePolicy> {};
+
+template<typename T>
+thrust::pair<thrust::pointer<T, ThrustArrayFirePolicy>, std::ptrdiff_t>
+get_temporary_buffer(ThrustArrayFirePolicy, std::ptrdiff_t n) {
+    thrust::pointer<T, ThrustArrayFirePolicy> result(
+        arrayfire::cuda::memAlloc<T>(n / sizeof(T)).release());
+
+    return thrust::make_pair(result, n);
+}
+
+template<typename Pointer>
+inline void return_temporary_buffer(ThrustArrayFirePolicy, Pointer p) {
+    memFree(thrust::raw_pointer_cast(p));
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
+
+#if defined(_WIN32)
+THRUST_NAMESPACE_BEGIN
+#else
+namespace thrust {
+#endif
+namespace cuda_cub {
+template<>
+__DH__ inline cudaStream_t get_stream<arrayfire::cuda::ThrustArrayFirePolicy>(
+    execution_policy<arrayfire::cuda::ThrustArrayFirePolicy> &) {
+#if defined(__CUDA_ARCH__)
+    return 0;
+#else
+    return arrayfire::cuda::getActiveStream();
+#endif
+}
+
+__DH__
+inline cudaError_t synchronize_stream(
+    const arrayfire::cuda::ThrustArrayFirePolicy &) {
+#if defined(__CUDA_ARCH__)
+    return cudaSuccess;
+#else
+    return cudaStreamSynchronize(arrayfire::cuda::getActiveStream());
+#endif
+}
+
+}  // namespace cuda_cub
+#if defined(_WIN32)
+THRUST_NAMESPACE_END
+#else
+}  // namespace thrust
+#endif
diff --git a/src/backend/cuda/all.cu b/src/backend/cuda/all.cu
index 5d79de4e0d..fa0681dbaf 100644
--- a/src/backend/cuda/all.cu
+++ b/src/backend/cuda/all.cu
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    //alltrue
-    INSTANTIATE(af_and_t, float  , char)
-    INSTANTIATE(af_and_t, double , char)
-    INSTANTIATE(af_and_t, cfloat , char)
-    INSTANTIATE(af_and_t, cdouble, char)
-    INSTANTIATE(af_and_t, int    , char)
-    INSTANTIATE(af_and_t, uint   , char)
-    INSTANTIATE(af_and_t, char   , char)
-    INSTANTIATE(af_and_t, uchar  , char)
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// alltrue
+INSTANTIATE(af_and_t, float, char)
+INSTANTIATE(af_and_t, double, char)
+INSTANTIATE(af_and_t, cfloat, char)
+INSTANTIATE(af_and_t, cdouble, char)
+INSTANTIATE(af_and_t, int, char)
+INSTANTIATE(af_and_t, uint, char)
+INSTANTIATE(af_and_t, intl, char)
+INSTANTIATE(af_and_t, uintl, char)
+INSTANTIATE(af_and_t, char, char)
+INSTANTIATE(af_and_t, schar, char)
+INSTANTIATE(af_and_t, uchar, char)
+INSTANTIATE(af_and_t, short, char)
+INSTANTIATE(af_and_t, ushort, char)
+INSTANTIATE(af_and_t, half, char)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/anisotropic_diffusion.cpp b/src/backend/cuda/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..45b84b8b6f
--- /dev/null
+++ b/src/backend/cuda/anisotropic_diffusion.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <anisotropic_diffusion.hpp>
+#include <kernel/anisotropic_diffusion.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq) {
+    kernel::anisotropicDiffusion<T>(inout, dt, mct, fftype,
+                                    eq == AF_DIFFUSION_MCDE);
+}
+
+#define INSTANTIATE(T)                                     \
+    template void anisotropicDiffusion<T>(                 \
+        Array<T> & inout, const float dt, const float mct, \
+        const af::fluxFunction fftype, const af::diffusionEq eq);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/anisotropic_diffusion.hpp b/src/backend/cuda/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..6e9c2e4c1c
--- /dev/null
+++ b/src/backend/cuda/anisotropic_diffusion.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/any.cu b/src/backend/cuda/any.cu
index 994db2f835..801dcb6c10 100644
--- a/src/backend/cuda/any.cu
+++ b/src/backend/cuda/any.cu
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    //anytrue
-    INSTANTIATE(af_or_t, float  , char)
-    INSTANTIATE(af_or_t, double , char)
-    INSTANTIATE(af_or_t, cfloat , char)
-    INSTANTIATE(af_or_t, cdouble, char)
-    INSTANTIATE(af_or_t, int    , char)
-    INSTANTIATE(af_or_t, uint   , char)
-    INSTANTIATE(af_or_t, char   , char)
-    INSTANTIATE(af_or_t, uchar  , char)
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// anytrue
+INSTANTIATE(af_or_t, float, char)
+INSTANTIATE(af_or_t, double, char)
+INSTANTIATE(af_or_t, cfloat, char)
+INSTANTIATE(af_or_t, cdouble, char)
+INSTANTIATE(af_or_t, int, char)
+INSTANTIATE(af_or_t, uint, char)
+INSTANTIATE(af_or_t, intl, char)
+INSTANTIATE(af_or_t, uintl, char)
+INSTANTIATE(af_or_t, char, char)
+INSTANTIATE(af_or_t, schar, char)
+INSTANTIATE(af_or_t, uchar, char)
+INSTANTIATE(af_or_t, short, char)
+INSTANTIATE(af_or_t, ushort, char)
+INSTANTIATE(af_or_t, half, char)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/approx.cpp b/src/backend/cuda/approx.cpp
new file mode 100644
index 0000000000..b9bd55e78d
--- /dev/null
+++ b/src/backend/cuda/approx.cpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <approx.hpp>
+#include <err_cuda.hpp>
+#include <kernel/approx.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid) {
+    kernel::approx1<Ty, Tp>(yo, yi, xo, xdim, xi_beg, xi_step, offGrid, method,
+                            interpOrder(method));
+}
+
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid) {
+    kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim, yi_beg,
+                            yi_step, offGrid, method, interpOrder(method));
+}
+
+#define INSTANTIATE(Ty, Tp)                                       \
+    template void approx1<Ty, Tp>(                                \
+        Array<Ty> & yo, const Array<Ty> &yi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const af_interp_type method, const float offGrid);        \
+    template void approx2<Ty, Tp>(                                \
+        Array<Ty> & zo, const Array<Ty> &zi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const Array<Tp> &yo, const int ydim, const Tp &yi_beg,    \
+        const Tp &yi_step, const af_interp_type method, const float offGrid);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/approx.cu b/src/backend/cuda/approx.cu
deleted file mode 100644
index 34e9d43139..0000000000
--- a/src/backend/cuda/approx.cu
+++ /dev/null
@@ -1,78 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <approx.hpp>
-#include <kernel/approx.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename Ty, typename Tp>
-    Array<Ty> approx1(const Array<Ty> &in, const Array<Tp> &pos,
-                      const af_interp_type method, const float offGrid)
-    {
-        af::dim4 idims = in.dims();
-        af::dim4 odims = in.dims();
-        odims[0] = pos.dims()[0];
-
-        // Create output placeholder
-        Array<Ty> out = createEmptyArray<Ty>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::approx1<Ty, Tp, AF_INTERP_NEAREST> (out, in, pos, offGrid);
-                break;
-            case AF_INTERP_LINEAR:
-                kernel::approx1<Ty, Tp, AF_INTERP_LINEAR> (out, in, pos, offGrid);
-                break;
-            default:
-                break;
-        }
-        return out;
-    }
-
-    template<typename Ty, typename Tp>
-    Array<Ty> approx2(const Array<Ty> &in, const Array<Tp> &pos0, const Array<Tp> &pos1,
-                      const af_interp_type method, const float offGrid)
-    {
-        af::dim4 idims = in.dims();
-        af::dim4 odims = pos0.dims();
-        odims[2] = in.dims()[2];
-        odims[3] = in.dims()[3];
-
-        // Create output placeholder
-        Array<Ty> out = createEmptyArray<Ty>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::approx2<Ty, Tp, AF_INTERP_NEAREST> (out, in, pos0, pos1, offGrid);
-                break;
-            case AF_INTERP_LINEAR:
-                kernel::approx2<Ty, Tp, AF_INTERP_LINEAR> (out, in, pos0, pos1, offGrid);
-                break;
-            default:
-                break;
-        }
-        return out;
-    }
-
-#define INSTANTIATE(Ty, Tp)                                             \
-    template Array<Ty> approx1<Ty, Tp>(const Array<Ty> &in, const Array<Tp> &pos, \
-                                       const af_interp_type method, const float offGrid); \
-    template Array<Ty> approx2<Ty, Tp>(const Array<Ty> &in, const Array<Tp> &pos0, \
-                                       const Array<Tp> &pos1, const af_interp_type method, \
-                                       const float offGrid);            \
-
-    INSTANTIATE(float  , float )
-    INSTANTIATE(double , double)
-    INSTANTIATE(cfloat , float )
-    INSTANTIATE(cdouble, double)
-}
diff --git a/src/backend/cuda/approx.hpp b/src/backend/cuda/approx.hpp
index 34bc954dce..c72d2cbe9b 100644
--- a/src/backend/cuda/approx.hpp
+++ b/src/backend/cuda/approx.hpp
@@ -7,16 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename Ty, typename Tp>
-    Array<Ty> approx1(const Array<Ty> &in, const Array<Tp> &pos,
-                      const af_interp_type method, const float offGrid);
+namespace arrayfire {
+namespace cuda {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid);
 
-    template<typename Ty, typename Tp>
-    Array<Ty> approx2(const Array<Ty> &in, const Array<Tp> &pos0, const Array<Tp> &pos1,
-                      const af_interp_type method, const float offGrid);
-}
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/arith.hpp b/src/backend/cuda/arith.hpp
index cc3e6dc5cc..67e39f54f4 100644
--- a/src/backend/cuda/arith.hpp
+++ b/src/backend/cuda/arith.hpp
@@ -7,19 +7,25 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
+#pragma once
+
 #include <Array.hpp>
-#include <optypes.hpp>
-#include <err_cuda.hpp>
-#include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &&lhs, const Array<T> &&rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
+}
 
-namespace cuda
-{
-    template<typename T, af_op_t op>
-    Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<T, T, op>(lhs, rhs, odims);
-    }
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/assign.cpp b/src/backend/cuda/assign.cpp
new file mode 100644
index 0000000000..b65265dc8b
--- /dev/null
+++ b/src/backend/cuda/assign.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <assign.hpp>
+#include <kernel/assign.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <handle.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs) {
+    AssignKernelParam p;
+    std::vector<af_seq> seqs(4, af_span);
+    // create seq vector to retrieve output
+    // dimensions, offsets & offsets
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
+    }
+
+    // retrieve dimensions, strides and offsets
+    dim4 dDims = out.dims();
+    // retrieve dimensions & strides for array
+    // to which rhs is being copied to
+    dim4 dstOffs  = toOffset(seqs, dDims);
+    dim4 dstStrds = toStride(seqs, dDims);
+
+    for (dim_t i = 0; i < 4; ++i) {
+        p.isSeq[i] = idxrs[i].isSeq;
+        p.offs[i]  = dstOffs[i];
+        p.strds[i] = dstStrds[i];
+    }
+
+    std::vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
+    // look through indexs to read af_array indexs
+    for (dim_t x = 0; x < 4; ++x) {
+        // set idxPtrs to null
+        p.ptr[x] = 0;
+        // set index pointers were applicable
+        if (!p.isSeq[x]) {
+            idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
+            p.ptr[x]   = idxArrs[x].get();
+        }
+    }
+
+    kernel::assign<T>(out, rhs, p);
+}
+
+#define INSTANTIATE(T)                                                \
+    template void assign<T>(Array<T> & out, const af_index_t idxrs[], \
+                            const Array<T>& rhs);
+
+INSTANTIATE(cdouble)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/assign.cu b/src/backend/cuda/assign.cu
deleted file mode 100644
index 7bea851fdd..0000000000
--- a/src/backend/cuda/assign.cu
+++ /dev/null
@@ -1,79 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <handle.hpp>
-#include <assign.hpp>
-#include <kernel/assign.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs)
-{
-    kernel::AssignKernelParam_t p;
-    std::vector<af_seq> seqs(4, af_span);
-    // create seq vector to retrieve output
-    // dimensions, offsets & offsets
-    for (dim_t x=0; x<4; ++x) {
-        if (idxrs[x].isSeq) {
-            seqs[x] = idxrs[x].idx.seq;
-        }
-    }
-
-    // retrieve dimensions, strides and offsets
-    dim4 dDims = out.dims();
-    // retrieve dimensions & strides for array
-    // to which rhs is being copied to
-    dim4 dstOffs = toOffset(seqs, dDims);
-    dim4 dstStrds= toStride(seqs, dDims);
-
-    for (dim_t i=0; i<4; ++i) {
-        p.isSeq[i] = idxrs[i].isSeq;
-        p.offs[i]  = dstOffs[i];
-        p.strds[i] = dstStrds[i];
-    }
-
-    std::vector< Array<uint> > idxArrs(4, createEmptyArray<uint>(dim4()));
-    // look through indexs to read af_array indexs
-    for (dim_t x=0; x<4; ++x) {
-        // set idxPtrs to null
-        p.ptr[x] = 0;
-        // set index pointers were applicable
-        if (!p.isSeq[x]) {
-            idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
-            p.ptr[x] = idxArrs[x].get();
-        }
-    }
-
-    kernel::assign<T>(out, rhs, p);
-}
-
-#define INSTANTIATE(T) \
-    template void assign<T>(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
-
-INSTANTIATE(cdouble)
-INSTANTIATE(double )
-INSTANTIATE(cfloat )
-INSTANTIATE(float  )
-INSTANTIATE(uintl  )
-INSTANTIATE(uint   )
-INSTANTIATE(intl   )
-INSTANTIATE(int    )
-INSTANTIATE(uchar  )
-INSTANTIATE(char   )
-
-}
diff --git a/src/backend/cuda/assign.hpp b/src/backend/cuda/assign.hpp
index 730df7c473..be2f725e90 100644
--- a/src/backend/cuda/assign.hpp
+++ b/src/backend/cuda/assign.hpp
@@ -8,11 +8,13 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <af/index.h>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/assign_kernel_param.hpp b/src/backend/cuda/assign_kernel_param.hpp
new file mode 100644
index 0000000000..350893f911
--- /dev/null
+++ b/src/backend/cuda/assign_kernel_param.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cuda {
+
+typedef struct {
+    int offs[4];
+    int strds[4];
+    int steps[4];
+    bool isSeq[4];
+    unsigned int* ptr[4];
+} AssignKernelParam;
+
+using IndexKernelParam = AssignKernelParam;
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/backend.hpp b/src/backend/cuda/backend.hpp
index 475a343a0a..149353ca21 100644
--- a/src/backend/cuda/backend.hpp
+++ b/src/backend/cuda/backend.hpp
@@ -8,17 +8,24 @@
  ********************************************************/
 
 #pragma once
+
 #ifdef __DH__
 #undef __DH__
 #endif
 
+#ifdef __CUDACC_RTC__
+#define __DH__ __device__
+#else
 #ifdef __CUDACC__
 #include <cuda_runtime.h>
 #define __DH__ __device__ __host__
 #else
 #define __DH__
 #endif
+#endif
 
-#include "types.hpp"
+namespace arrayfire {
+namespace cuda {}  // namespace cuda
+}  // namespace arrayfire
 
-namespace detail = cuda;
+namespace detail = arrayfire::cuda;
diff --git a/src/backend/cuda/bilateral.cpp b/src/backend/cuda/bilateral.cpp
new file mode 100644
index 0000000000..6d56640fa8
--- /dev/null
+++ b/src/backend/cuda/bilateral.cpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <bilateral.hpp>
+#include <kernel/bilateral.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &sSigma,
+                         const float &cSigma) {
+    Array<outType> out = createEmptyArray<outType>(in.dims());
+    kernel::bilateral<inType, outType>(out, in, sSigma, cSigma);
+    return out;
+}
+
+#define INSTANTIATE(inT, outT)                                    \
+    template Array<outT> bilateral<inT, outT>(const Array<inT> &, \
+                                              const float &, const float &);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/bilateral.cu b/src/backend/cuda/bilateral.cu
deleted file mode 100644
index 4c1d7fc6f9..0000000000
--- a/src/backend/cuda/bilateral.cu
+++ /dev/null
@@ -1,41 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <bilateral.hpp>
-#include <kernel/bilateral.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename inType, typename outType, bool isColor>
-Array<outType> bilateral(const Array<inType> &in, const float &s_sigma, const float &c_sigma)
-{
-    Array<outType>out = createEmptyArray<outType>(in.dims());
-    kernel::bilateral<inType, outType, isColor>(out, in, s_sigma, c_sigma);
-    return out;
-}
-
-#define INSTANTIATE(inT, outT)\
-template Array<outT> bilateral<inT, outT,true >(const Array<inT> &in, const float &s_sigma, const float &c_sigma);\
-template Array<outT> bilateral<inT, outT,false>(const Array<inT> &in, const float &s_sigma, const float &c_sigma);
-
-INSTANTIATE(double, double)
-INSTANTIATE(float ,  float)
-INSTANTIATE(char  ,  float)
-INSTANTIATE(int   ,  float)
-INSTANTIATE(uint  ,  float)
-INSTANTIATE(uchar ,  float)
-
-}
diff --git a/src/backend/cuda/bilateral.hpp b/src/backend/cuda/bilateral.hpp
index 23000086bd..63cdaee7af 100644
--- a/src/backend/cuda/bilateral.hpp
+++ b/src/backend/cuda/bilateral.hpp
@@ -9,10 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
-
-template<typename inType, typename outType, bool isColor>
-Array<outType> bilateral(const Array<inType> &in, const float &s_sigma, const float &c_sigma);
-
-}
+namespace arrayfire {
+namespace cuda {
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &spatialSigma,
+                         const float &chromaticSigma);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/binary.hpp b/src/backend/cuda/binary.hpp
index d5f742eec3..ca707f30be 100644
--- a/src/backend/cuda/binary.hpp
+++ b/src/backend/cuda/binary.hpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2025, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -8,83 +8,127 @@
  ********************************************************/
 
 #pragma once
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <optypes.hpp>
 #include <math.hpp>
-#include <JIT/BinaryNode.hpp>
-
-namespace cuda
-{
+#include <optypes.hpp>
 
+namespace arrayfire {
+namespace cuda {
 
 template<typename To, typename Ti, af_op_t op>
-struct BinOp
-{
-    const char *name()
-    {
-        return "noop";
-    }
+struct BinOp;
+
+#define BINARY_TYPE_1(fn)                            \
+    template<typename To, typename Ti>               \
+    struct BinOp<To, Ti, af_##fn##_t> {              \
+        const char *name() { return "__" #fn; }      \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cfloat, af_##fn##_t> {          \
+        const char *name() { return "__c" #fn "f"; } \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cdouble, af_##fn##_t> {         \
+        const char *name() { return "__c" #fn; }     \
+    };
+
+BINARY_TYPE_1(eq)
+BINARY_TYPE_1(neq)
+BINARY_TYPE_1(lt)
+BINARY_TYPE_1(le)
+BINARY_TYPE_1(gt)
+BINARY_TYPE_1(ge)
+BINARY_TYPE_1(add)
+BINARY_TYPE_1(sub)
+BINARY_TYPE_1(mul)
+BINARY_TYPE_1(div)
+BINARY_TYPE_1(and)
+BINARY_TYPE_1(or)
+BINARY_TYPE_1(bitand)
+BINARY_TYPE_1(bitor)
+BINARY_TYPE_1(bitxor)
+BINARY_TYPE_1(bitshiftl)
+BINARY_TYPE_1(bitshiftr)
+
+#undef BINARY_TYPE_1
+
+#define BINARY_TYPE_2(fn)                            \
+    template<typename To, typename Ti>               \
+    struct BinOp<To, Ti, af_##fn##_t> {              \
+        const char *name() { return "__" #fn; }      \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, float, af_##fn##_t> {           \
+        const char *name() { return "f" #fn "f"; }   \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, double, af_##fn##_t> {          \
+        const char *name() { return "f" #fn; }       \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, cfloat, af_##fn##_t> {          \
+        const char *name() { return "__c" #fn "f"; } \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, cdouble, af_##fn##_t> {         \
+        const char *name() { return "__c" #fn; }     \
+    };
+
+BINARY_TYPE_2(min)
+BINARY_TYPE_2(max)
+BINARY_TYPE_2(rem)
+BINARY_TYPE_2(mod)
+
+template<>
+struct BinOp<common::half, common::half, af_mod_t> {
+    const char *name() { return "hmod"; }
 };
 
-#define BINARY(fn)                                  \
-    template<typename To, typename Ti>              \
-    struct BinOp<To, Ti, af_##fn##_t>               \
-    {                                               \
-        std::string res;                            \
-        BinOp() :                                   \
-            res(cuMangledName<Ti, true>("___"#fn))  \
-        {}                                          \
-        const std::string name()                    \
-        {                                           \
-            return res;                             \
-        }                                           \
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_pow_t> {
+    const char *name() { return "__pow"; }
+};
+
+#define POW_BINARY_OP(INTYPE, OPNAME)         \
+    template<typename To>                     \
+    struct BinOp<To, INTYPE, af_pow_t> {      \
+        const char *name() { return OPNAME; } \
     };
 
-BINARY(add)
-BINARY(sub)
-BINARY(mul)
-BINARY(div)
-BINARY(and)
-BINARY(or)
-BINARY(bitand)
-BINARY(bitor)
-BINARY(bitxor)
-BINARY(bitshiftl)
-BINARY(bitshiftr)
-
-BINARY(lt)
-BINARY(gt)
-BINARY(le)
-BINARY(ge)
-BINARY(eq)
-BINARY(neq)
-
-BINARY(max)
-BINARY(min)
-BINARY(pow)
-BINARY(mod)
-BINARY(rem)
-BINARY(atan2)
-BINARY(hypot)
-
-#undef BINARY
+POW_BINARY_OP(double, "pow")
+POW_BINARY_OP(float, "powf")
+POW_BINARY_OP(intl, "__powll")
+POW_BINARY_OP(uintl, "__powul")
+POW_BINARY_OP(uint, "__powui")
+POW_BINARY_OP(int, "__powsi")
 
-template<typename To, typename Ti, af_op_t op>
-Array<To> createBinaryNode(const Array<Ti> &lhs, const Array<Ti> &rhs, const af::dim4 &odims)
-{
-    BinOp<To, Ti, op> bop;
+#undef POW_BINARY_OP
+
+template<typename Ti>
+struct BinOp<cfloat, Ti, af_cplx2_t> {
+    const char *name() { return "__cplx2f"; }
+};
 
-    JIT::Node_ptr lhs_node = lhs.getNode();
-    JIT::Node_ptr rhs_node = rhs.getNode();
+template<typename Ti>
+struct BinOp<cdouble, Ti, af_cplx2_t> {
+    const char *name() { return "__cplx2"; }
+};
 
-    JIT::BinaryNode *node = new JIT::BinaryNode(irname<To>(),
-                                                afShortName<To>(),
-                                                bop.name(),
-                                                lhs_node,
-                                                rhs_node, (int)(op));
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_cplx2_t> {
+    const char *name() { return "noop"; }
+};
 
-    return createNodeArray<To>(odims, JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-}
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_atan2_t> {
+    const char *name() { return "atan2"; }
+};
+
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_hypot_t> {
+    const char *name() { return "hypot"; }
+};
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/blas.cpp b/src/backend/cuda/blas.cpp
deleted file mode 100644
index 9ed9cdf48d..0000000000
--- a/src/backend/cuda/blas.cpp
+++ /dev/null
@@ -1,240 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <blas.hpp>
-#include <cuda_runtime.h>
-#include <cublas_v2.h>
-#include <platform.hpp>
-
-#include <stdexcept>
-#include <string>
-#include <cassert>
-#include <math.hpp>
-#include <err_common.hpp>
-#include <cublasManager.hpp>
-
-namespace cuda
-{
-
-using cublas::getHandle;
-
-cublasOperation_t
-toCblasTranspose(af_mat_prop opt)
-{
-    cublasOperation_t out = CUBLAS_OP_N;
-    switch(opt) {
-        case AF_MAT_NONE        : out = CUBLAS_OP_N;    break;
-        case AF_MAT_TRANS           : out = CUBLAS_OP_T;    break;
-        case AF_MAT_CTRANS : out = CUBLAS_OP_C;    break;
-        default                     : AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
-    }
-    return out;
-}
-
-template<typename T>
-struct gemm_func_def_t
-{
-    typedef cublasStatus_t (*gemm_func_def)(    cublasHandle_t,
-                                                cublasOperation_t, cublasOperation_t,
-                                                int, int, int,
-                                                const T *,  const T *, int,
-                                                            const T *, int,
-                                                const T *,        T *, int);
-};
-
-template<typename T>
-struct gemv_func_def_t
-{
-    typedef cublasStatus_t (*gemv_func_def)(    cublasHandle_t,
-                                                cublasOperation_t,
-                                                int, int,
-                                                const T *,  const T *, int,
-                                                            const T *, int,
-                                                const T *,        T *, int);
-};
-
-template<typename T>
-struct dot_func_def_t
-{
-    typedef cublasStatus_t (*dot_func_def)(    cublasHandle_t,
-                                                int,
-                                                const T *,  int,
-                                                const T *,  int,
-                                                T *);
-};
-
-template<typename T>
-struct trsm_func_def_t
-{
-    typedef cublasStatus_t (*trsm_func_def)(    cublasHandle_t,
-                                                cublasSideMode_t,
-                                                cublasFillMode_t,
-                                                cublasOperation_t,
-                                                cublasDiagType_t,
-                                                int, int,
-                                                const T *,
-                                                const T *, int,
-                                                T *, int);
-};
-
-#define BLAS_FUNC_DEF( FUNC )                       \
-template<typename T>                                \
-typename FUNC##_func_def_t<T>::FUNC##_func_def      \
-FUNC##_func();
-
-#define BLAS_FUNC( FUNC, TYPE, PREFIX )         \
-template<> typename FUNC##_func_def_t<TYPE>::FUNC##_func_def       FUNC##_func<TYPE>()  { return &cublas##PREFIX##FUNC; }
-
-BLAS_FUNC_DEF(gemm)
-BLAS_FUNC(gemm, float,  S)
-BLAS_FUNC(gemm, cfloat, C)
-BLAS_FUNC(gemm, double, D)
-BLAS_FUNC(gemm, cdouble,Z)
-
-BLAS_FUNC_DEF(gemv)
-BLAS_FUNC(gemv, float,  S)
-BLAS_FUNC(gemv, cfloat, C)
-BLAS_FUNC(gemv, double, D)
-BLAS_FUNC(gemv, cdouble,Z)
-
-BLAS_FUNC_DEF(dot)
-BLAS_FUNC(dot, float,  S)
-BLAS_FUNC(dot, double, D)
-
-BLAS_FUNC_DEF(trsm)
-BLAS_FUNC(trsm, float,  S)
-BLAS_FUNC(trsm, cfloat, C)
-BLAS_FUNC(trsm, double, D)
-BLAS_FUNC(trsm, cdouble,Z)
-
-using namespace std;
-
-template<typename T>
-Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs,
-                af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    cublasOperation_t lOpts = toCblasTranspose(optLhs);
-    cublasOperation_t rOpts = toCblasTranspose(optRhs);
-
-    int aRowDim = (lOpts == CUBLAS_OP_N) ? 0 : 1;
-    int aColDim = (lOpts == CUBLAS_OP_N) ? 1 : 0;
-    int bColDim = (rOpts == CUBLAS_OP_N) ? 1 : 0;
-
-    dim4 lDims = lhs.dims();
-    dim4 rDims = rhs.dims();
-    int M = lDims[aRowDim];
-    int N = rDims[bColDim];
-    int K = lDims[aColDim];
-
-    Array<T> out = createEmptyArray<T>(af::dim4(M, N, 1, 1));
-    T alpha = scalar<T>(1);
-    T beta  = scalar<T>(0);
-
-    dim4 lStrides = lhs.strides();
-    dim4 rStrides = rhs.strides();
-    if(rDims[bColDim] == 1) {
-        N = lDims[aColDim];
-        CUBLAS_CHECK(gemv_func<T>()(
-                         getHandle(),
-                         lOpts,
-                         lDims[0],
-                         lDims[1],
-                         &alpha,
-                         lhs.get(), lStrides[1],
-                         rhs.get(), rStrides[0],
-                         &beta,
-                         out.get(), 1));
-    } else {
-        CUBLAS_CHECK(gemm_func<T>()(
-                         getHandle(),
-                         lOpts,
-                         rOpts,
-                         M, N, K,
-                         &alpha,
-                         lhs.get(), lStrides[1],
-                         rhs.get(), rStrides[1],
-                         &beta,
-                         out.get(),
-                         out.dims()[0]));
-    }
-
-    return out;
-
-}
-
-template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-             af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    int N = lhs.dims()[0];
-
-    T out;
-
-    CUBLAS_CHECK(dot_func<T>()(getHandle(),
-                               N,
-                               lhs.get(), lhs.strides()[0],
-                               rhs.get(), rhs.strides()[0],
-                               &out));
-
-    return createValueArray(af::dim4(1), out);
-}
-
-template<typename T>
-void trsm(const Array<T> &lhs, Array<T> &rhs, af_mat_prop trans,
-          bool is_upper, bool is_left, bool is_unit)
-{
-    //dim4 lDims = lhs.dims();
-    dim4 rDims = rhs.dims();
-    int M = rDims[0];
-    int N = rDims[1];
-
-    T alpha = scalar<T>(1);
-
-    dim4 lStrides = lhs.strides();
-    dim4 rStrides = rhs.strides();
-
-    CUBLAS_CHECK(trsm_func<T>()(
-                     getHandle(),
-                     is_left  ? CUBLAS_SIDE_LEFT : CUBLAS_SIDE_RIGHT,
-                     is_upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
-                     toCblasTranspose(trans),
-                     is_unit  ? CUBLAS_DIAG_UNIT : CUBLAS_DIAG_NON_UNIT,
-                     M, N,
-                     &alpha,
-                     lhs.get(), lStrides[1],
-                     rhs.get(), rStrides[1]));
-}
-
-
-#define INSTANTIATE_BLAS(TYPE)                                                          \
-    template Array<TYPE> matmul<TYPE>(const Array<TYPE> &lhs, const Array<TYPE> &rhs,  \
-                                      af_mat_prop optLhs, af_mat_prop optRhs);
-
-INSTANTIATE_BLAS(float)
-INSTANTIATE_BLAS(cfloat)
-INSTANTIATE_BLAS(double)
-INSTANTIATE_BLAS(cdouble)
-
-#define INSTANTIATE_DOT(TYPE)                                                       \
-    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
-                                   af_mat_prop optLhs, af_mat_prop optRhs);
-
-INSTANTIATE_DOT(float)
-INSTANTIATE_DOT(double)
-
-#define INSTANTIATE_TRSM(TYPE)                                                          \
-    template void trsm<TYPE>(const Array<TYPE> &lhs, Array<TYPE> &rhs,                  \
-                             af_mat_prop trans, bool is_upper, bool is_left, bool is_unit);
-
-INSTANTIATE_TRSM(float)
-INSTANTIATE_TRSM(cfloat)
-INSTANTIATE_TRSM(double)
-INSTANTIATE_TRSM(cdouble)
-
-}
diff --git a/src/backend/cuda/blas.cu b/src/backend/cuda/blas.cu
new file mode 100644
index 0000000000..08df398a8d
--- /dev/null
+++ b/src/backend/cuda/blas.cu
@@ -0,0 +1,389 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <blas.hpp>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/half.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <cublas.hpp>
+#include <cublas_v2.h>
+#include <cudaDataType.hpp>
+#include <cuda_runtime.h>
+#include <err_cuda.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <reduce.hpp>
+#include <tile.hpp>
+#include <transpose.hpp>
+#include <types.hpp>
+
+#include <cassert>
+#include <functional>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+using arrayfire::common::half;
+using arrayfire::common::kernel_type;
+using std::is_same;
+using std::vector;
+
+namespace arrayfire {
+namespace cuda {
+
+cublasOperation_t toCblasTranspose(af_mat_prop opt) {
+    cublasOperation_t out = CUBLAS_OP_N;
+    switch (opt) {
+        case AF_MAT_NONE: out = CUBLAS_OP_N; break;
+        case AF_MAT_TRANS: out = CUBLAS_OP_T; break;
+        case AF_MAT_CTRANS: out = CUBLAS_OP_C; break;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+    return out;
+}
+
+template<typename T>
+using gemm_func_def = std::function<cublasStatus_t(
+    cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int,
+    const T *, const T *, int, const T *, int, const T *, T *, int)>;
+
+template<typename T>
+using gemmBatched_func_def = std::function<cublasStatus_t(
+    cublasHandle_t, cublasOperation_t, cublasOperation_t, int, int, int,
+    const T *, const T **, int, const T **, int, const T *, T **, int, int)>;
+
+template<typename T>
+using trsm_func_def = std::function<cublasStatus_t(
+    cublasHandle_t, cublasSideMode_t, cublasFillMode_t, cublasOperation_t,
+    cublasDiagType_t, int, int, const T *, const T *, int, T *, int)>;
+
+#define BLAS_FUNC_DEF(FUNC) \
+    template<typename T>    \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define BLAS_FUNC(FUNC, TYPE, PREFIX)           \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &cublas##PREFIX##FUNC;           \
+    }
+
+BLAS_FUNC_DEF(gemm)
+BLAS_FUNC(gemm, float, S)
+BLAS_FUNC(gemm, cfloat, C)
+BLAS_FUNC(gemm, double, D)
+BLAS_FUNC(gemm, cdouble, Z)
+BLAS_FUNC(gemm, __half, H)
+
+BLAS_FUNC_DEF(gemmBatched)
+BLAS_FUNC(gemmBatched, float, S)
+BLAS_FUNC(gemmBatched, cfloat, C)
+BLAS_FUNC(gemmBatched, double, D)
+BLAS_FUNC(gemmBatched, cdouble, Z)
+BLAS_FUNC(gemmBatched, __half, H)
+
+template<>
+gemm_func_def<schar> gemm_func<schar>() {
+    TYPE_ERROR(3, af_dtype::s8);
+    return gemm_func_def<schar>();
+}
+template<>
+gemmBatched_func_def<schar> gemmBatched_func<schar>() {
+    TYPE_ERROR(3, af_dtype::s8);
+    return gemmBatched_func_def<schar>();
+}
+
+BLAS_FUNC_DEF(trsm)
+BLAS_FUNC(trsm, float, S)
+BLAS_FUNC(trsm, cfloat, C)
+BLAS_FUNC(trsm, double, D)
+BLAS_FUNC(trsm, cdouble, Z)
+
+#undef BLAS_FUNC
+#undef BLAS_FUNC_DEF
+
+template<typename T, bool conjugate>
+struct dot_func_def_t {
+    typedef cublasStatus_t (*dot_func_def)(cublasHandle_t, int, const T *, int,
+                                           const T *, int, T *);
+};
+
+#define BLAS_FUNC_DEF(FUNC)              \
+    template<typename T, bool conjugate> \
+    typename FUNC##_func_def_t<T, conjugate>::FUNC##_func_def FUNC##_func();
+
+#define BLAS_FUNC(FUNC, TYPE, CONJUGATE, PREFIX)                       \
+    template<>                                                         \
+    typename FUNC##_func_def_t<TYPE, CONJUGATE>::FUNC##_func_def       \
+        FUNC##_func<TYPE, CONJUGATE>() {                               \
+        return (FUNC##_func_def_t<TYPE, CONJUGATE>::FUNC##_func_def) & \
+               cublas##PREFIX##FUNC;                                   \
+    }
+
+BLAS_FUNC_DEF(dot)
+BLAS_FUNC(dot, float, true, S)
+BLAS_FUNC(dot, double, true, D)
+BLAS_FUNC(dot, float, false, S)
+BLAS_FUNC(dot, double, false, D)
+
+#undef BLAS_FUNC
+
+#define BLAS_FUNC(FUNC, TYPE, CONJUGATE, PREFIX, SUFFIX)               \
+    template<>                                                         \
+    typename FUNC##_func_def_t<TYPE, CONJUGATE>::FUNC##_func_def       \
+        FUNC##_func<TYPE, CONJUGATE>() {                               \
+        return (FUNC##_func_def_t<TYPE, CONJUGATE>::FUNC##_func_def) & \
+               cublas##PREFIX##FUNC##SUFFIX;                           \
+    }
+
+BLAS_FUNC_DEF(dot)
+BLAS_FUNC(dot, cfloat, true, C, c)
+BLAS_FUNC(dot, cdouble, true, Z, c)
+BLAS_FUNC(dot, cfloat, false, C, u)
+BLAS_FUNC(dot, cdouble, false, Z, u)
+
+#undef BLAS_FUNC
+#undef BLAS_FUNC_DEF
+
+template<typename T>
+cublasGemmAlgo_t selectGEMMAlgorithm() {
+    return CUBLAS_GEMM_DEFAULT;
+}
+
+template<>
+cublasGemmAlgo_t selectGEMMAlgorithm<common::half>() {
+    auto dev              = getDeviceProp(getActiveDeviceId());
+    cublasGemmAlgo_t algo = CUBLAS_GEMM_DEFAULT;
+    if (dev.major >= 7) { algo = CUBLAS_GEMM_DEFAULT_TENSOR_OP; }
+    return algo;
+}
+
+template<>
+cublasGemmAlgo_t selectGEMMAlgorithm<__half>() {
+    return selectGEMMAlgorithm<common::half>();
+}
+
+template<typename Ti, typename To = Ti>
+cublasStatus_t gemmDispatch(BlasHandle handle, cublasOperation_t lOpts,
+                            cublasOperation_t rOpts, int M, int N, int K,
+                            const To *alpha, const Array<Ti> &lhs, dim_t lStride,
+                            const Array<Ti> &rhs, dim_t rStride, const To *beta,
+                            Array<To> &out, dim_t oleading) {
+    auto prop = getDeviceProp(getActiveDeviceId());
+#if __CUDACC_VER_MAJOR__ >= 10
+    if (prop.major > 3 && __CUDACC_VER_MAJOR__ >= 10) {
+        return cublasGemmEx(
+            blasHandle(), lOpts, rOpts, M, N, K, alpha, lhs.get(), getType<Ti>(),
+            lStride, rhs.get(), getType<Ti>(), rStride, beta, out.get(),
+            getType<To>(), out.strides()[1],
+            getComputeType<To>(),  // Compute type
+
+            // NOTE: When using the CUBLAS_GEMM_DEFAULT_TENSOR_OP algorithm
+            // for the cublasGemm*Ex functions, the performance of the
+            // fp32 numbers seem to increase dramatically. Their numerical
+            // accuracy is also different compared to regular gemm fuctions.
+            // The CUBLAS_GEMM_DEFAULT algorithm selection does not experience
+            // this change. Does this imply that the TENSOR_OP function
+            // performs the computation in fp16 bit even when the compute
+            // type is CUDA_R_32F?
+            selectGEMMAlgorithm<Ti>());
+    } else {
+#endif
+        using Nt = typename common::kernel_type<Ti>::native;
+        return gemm_func<Nt>()(blasHandle(), lOpts, rOpts, M, N, K, (Nt *)alpha,
+                               (Nt *)lhs.get(), lStride, (Nt *)rhs.get(),
+                               rStride, (Nt *)beta, (Nt *)out.get(), oleading);
+
+#if __CUDACC_VER_MAJOR__ >= 10
+    }
+#endif
+}
+
+template<typename Ti, typename To = Ti>
+cublasStatus_t gemmBatchedDispatch(BlasHandle handle, cublasOperation_t lOpts,
+                                   cublasOperation_t rOpts, int M, int N, int K,
+                                   const To *alpha, const Ti **lptrs,
+                                   int lStrides, const Ti **rptrs, int rStrides,
+                                   const To *beta, To **optrs, int oStrides,
+                                   int batchSize) {
+    auto prop = getDeviceProp(getActiveDeviceId());
+#if __CUDACC_VER_MAJOR__ >= 10
+    if (prop.major > 3) {
+        return cublasGemmBatchedEx(
+            blasHandle(), lOpts, rOpts, M, N, K, alpha, (const void **)lptrs,
+            getType<Ti>(), lStrides, (const void **)rptrs, getType<Ti>(),
+            rStrides, beta, (void **)optrs, getType<Ti>(), oStrides, batchSize,
+            getComputeType<Ti>(),  // compute type
+            // NOTE: When using the CUBLAS_GEMM_DEFAULT_TENSOR_OP algorithm
+            // for the cublasGemm*Ex functions, the performance of the
+            // fp32 numbers seem to increase dramatically. Their numerical
+            // accuracy is also different compared to regular gemm fuctions.
+            // The CUBLAS_GEMM_DEFAULT algorithm selection does not experience
+            // this change. Does this imply that the TENSOR_OP function
+            // performs the computation in fp16 bit even when the compute
+            // type is CUDA_R_32F?
+            selectGEMMAlgorithm<Ti>());
+    } else {
+#endif
+        using Nt = typename common::kernel_type<Ti>::native;
+        return gemmBatched_func<Nt>()(
+            blasHandle(), lOpts, rOpts, M, N, K, (const Nt *)alpha,
+            (const Nt **)lptrs, lStrides, (const Nt **)rptrs, rStrides,
+            (const Nt *)beta, (Nt **)optrs, oStrides, batchSize);
+#if __CUDACC_VER_MAJOR__ >= 10
+    }
+#endif
+}
+
+template<typename Ti, typename To>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs, const To *alpha,
+          const Array<Ti> &lhs, const Array<Ti> &rhs, const To *beta) {
+    const cublasOperation_t lOpts = toCblasTranspose(optLhs);
+    const cublasOperation_t rOpts = toCblasTranspose(optRhs);
+
+    const int aRowDim = (lOpts == CUBLAS_OP_N) ? 0 : 1;
+    const int aColDim = (lOpts == CUBLAS_OP_N) ? 1 : 0;
+    const int bColDim = (rOpts == CUBLAS_OP_N) ? 1 : 0;
+
+    const dim4 lDims = lhs.dims();
+    const dim4 rDims = rhs.dims();
+    const int M      = lDims[aRowDim];
+    const int N      = rDims[bColDim];
+    const int K      = lDims[aColDim];
+    const dim4 oDims = out.dims();
+
+    dim4 lStrides = lhs.strides();
+    dim4 rStrides = rhs.strides();
+    dim4 oStrides = out.strides();
+
+    if (oDims.ndims() <= 2) {
+        CUBLAS_CHECK((gemmDispatch<Ti, To>(blasHandle(), lOpts, rOpts, M, N, K, alpha,
+                                           lhs, lStrides[1], rhs, rStrides[1], beta,
+                                           out, oStrides[1])));
+    } else {
+        int batchSize = oDims[2] * oDims[3];
+        vector<const Ti *> lptrs(batchSize);
+        vector<const Ti *> rptrs(batchSize);
+        vector<To *> optrs(batchSize);
+
+        bool is_l_d2_batched = oDims[2] == lDims[2];
+        bool is_l_d3_batched = oDims[3] == lDims[3];
+
+        bool is_r_d2_batched = oDims[2] == rDims[2];
+        bool is_r_d3_batched = oDims[3] == rDims[3];
+
+        const Ti *lptr = lhs.get();
+        const Ti *rptr = rhs.get();
+        To *optr    = out.get();
+
+        for (int n = 0; n < batchSize; n++) {
+            int w    = n / oDims[2];
+            int z    = n - w * oDims[2];
+            int loff = z * (is_l_d2_batched * lStrides[2]) +
+                       w * (is_l_d3_batched * lStrides[3]);
+            int roff = z * (is_r_d2_batched * rStrides[2]) +
+                       w * (is_r_d3_batched * rStrides[3]);
+            lptrs[n] = lptr + loff;
+            rptrs[n] = rptr + roff;
+            optrs[n] = optr + z * oStrides[2] + w * oStrides[3];
+        }
+
+        size_t bytes = batchSize * sizeof(Ti **);
+        auto d_lptrs = memAlloc<uchar>(bytes);
+        auto d_rptrs = memAlloc<uchar>(bytes);
+        auto d_optrs = memAlloc<uchar>(bytes);
+        CUDA_CHECK(cudaMemcpyAsync(d_lptrs.get(), lptrs.data(), bytes,
+                                   cudaMemcpyHostToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(d_rptrs.get(), rptrs.data(), bytes,
+                                   cudaMemcpyHostToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(d_optrs.get(), optrs.data(), bytes,
+                                   cudaMemcpyHostToDevice, getActiveStream()));
+
+        // Call this before the gemm call so that you don't have to wait for the
+        // computation. Even though it would make more sense to put it
+        // afterwards
+        CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
+
+        using Nt = typename common::kernel_type<Ti>::native;
+        CUBLAS_CHECK(gemmBatchedDispatch(
+            blasHandle(), lOpts, rOpts, M, N, K, alpha,
+            (const Ti **)d_lptrs.get(), lStrides[1], (const Ti **)d_rptrs.get(),
+            rStrides[1], beta, (To **)d_optrs.get(), oStrides[1], batchSize));
+    }
+}
+
+template<typename T>
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs) {
+    auto lhs_ = (optLhs == AF_MAT_NONE ? lhs : conj<T>(lhs));
+    auto rhs_ = (optRhs == AF_MAT_NONE ? rhs : conj<T>(rhs));
+    auto temp = arithOp<T, af_mul_t>(lhs_, rhs_, lhs_.dims());
+    return reduce<af_add_t, T, T>(temp, 0, false, 0);
+}
+
+template<typename T>
+void trsm(const Array<T> &lhs, Array<T> &rhs, af_mat_prop trans, bool is_upper,
+          bool is_left, bool is_unit) {
+    // dim4 lDims = lhs.dims();
+    dim4 rDims = rhs.dims();
+    int M      = rDims[0];
+    int N      = rDims[1];
+
+    T alpha = scalar<T>(1);
+
+    dim4 lStrides = lhs.strides();
+    dim4 rStrides = rhs.strides();
+
+    CUBLAS_CHECK(trsm_func<T>()(
+        blasHandle(), is_left ? CUBLAS_SIDE_LEFT : CUBLAS_SIDE_RIGHT,
+        is_upper ? CUBLAS_FILL_MODE_UPPER : CUBLAS_FILL_MODE_LOWER,
+        toCblasTranspose(trans),
+        is_unit ? CUBLAS_DIAG_UNIT : CUBLAS_DIAG_NON_UNIT, M, N, &alpha,
+        lhs.get(), lStrides[1], rhs.get(), rStrides[1]));
+}
+
+#define INSTANTIATE_GEMM(TYPE, OUTTYPE)                                      \
+    template void gemm<TYPE>(Array<OUTTYPE> & out, af_mat_prop optLhs,       \
+                             af_mat_prop optRhs, const OUTTYPE *alpha,       \
+                             const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
+                             const OUTTYPE *beta);
+
+INSTANTIATE_GEMM(float, float)
+INSTANTIATE_GEMM(cfloat, cfloat)
+INSTANTIATE_GEMM(double, double)
+INSTANTIATE_GEMM(cdouble, cdouble)
+INSTANTIATE_GEMM(half, half)
+INSTANTIATE_GEMM(schar, float)
+
+#define INSTANTIATE_DOT(TYPE)                                                  \
+    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs,                     \
+                                   const Array<TYPE> &rhs, af_mat_prop optLhs, \
+                                   af_mat_prop optRhs);
+
+INSTANTIATE_DOT(float)
+INSTANTIATE_DOT(double)
+INSTANTIATE_DOT(cfloat)
+INSTANTIATE_DOT(cdouble)
+INSTANTIATE_DOT(half)
+
+#define INSTANTIATE_TRSM(TYPE)                                               \
+    template void trsm<TYPE>(const Array<TYPE> &lhs, Array<TYPE> &rhs,       \
+                             af_mat_prop trans, bool is_upper, bool is_left, \
+                             bool is_unit);
+
+INSTANTIATE_TRSM(float)
+INSTANTIATE_TRSM(cfloat)
+INSTANTIATE_TRSM(double)
+INSTANTIATE_TRSM(cdouble)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/blas.hpp b/src/backend/cuda/blas.hpp
index c0a6e966e6..37432911e2 100644
--- a/src/backend/cuda/blas.hpp
+++ b/src/backend/cuda/blas.hpp
@@ -7,23 +7,35 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/blas.h>
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
+template<typename Ti, typename To = Ti>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta);
 
 template<typename T>
-Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs,
-                af_mat_prop optLhs, af_mat_prop optRhs);
+Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+                af_mat_prop optRhs) {
+    int Mdim     = optLhs == AF_MAT_NONE ? 0 : 1;
+    int Ndim     = optRhs == AF_MAT_NONE ? 1 : 0;
+    Array<T> res = createEmptyArray<T>(
+        dim4(lhs.dims()[Mdim], rhs.dims()[Ndim], lhs.dims()[2], lhs.dims()[3]));
+    constexpr T alpha = 1.0;
+    constexpr T beta  = 0.0;
+    gemm(res, optLhs, optRhs, &alpha, lhs, rhs, &beta);
+    return res;
+}
 
 template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-             af_mat_prop optLhs, af_mat_prop optRhs);
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs);
 
 template<typename T>
 void trsm(const Array<T> &lhs, Array<T> &rhs, af_mat_prop trans = AF_MAT_NONE,
           bool is_upper = false, bool is_left = true, bool is_unit = false);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/canny.cpp b/src/backend/cuda/canny.cpp
new file mode 100644
index 0000000000..ebf8ba2e04
--- /dev/null
+++ b/src/backend/cuda/canny.cpp
@@ -0,0 +1,34 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <canny.hpp>
+#include <err_cuda.hpp>
+#include <kernel/canny.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy) {
+    Array<float> out = createValueArray<float>(mag.dims(), 0);
+    kernel::nonMaxSuppression<float>(out, mag, gx, gy);
+    return out;
+}
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak) {
+    Array<char> out = createValueArray<char>(strong.dims(), 0);
+    kernel::edgeTrackingHysteresis<char>(out, strong, weak);
+    return out;
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/canny.hpp b/src/backend/cuda/canny.hpp
new file mode 100644
index 0000000000..7f8142493b
--- /dev/null
+++ b/src/backend/cuda/canny.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy);
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cast.hpp b/src/backend/cuda/cast.hpp
index 0b9bd81e25..214d24845a 100644
--- a/src/backend/cuda/cast.hpp
+++ b/src/backend/cuda/cast.hpp
@@ -8,65 +8,82 @@
  ********************************************************/
 
 #pragma once
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <complex>
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <common/jit/UnaryNode.hpp>
 #include <err_cuda.hpp>
 #include <math.hpp>
 #include <optypes.hpp>
 #include <types.hpp>
-#include <JIT/UnaryNode.hpp>
-#include <Array.hpp>
+#include <af/dim4.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename To, typename Ti>
-struct CastOp
-{
-    std::string func;
-    CastOp() {
-        std::string tmp = std::string("___mk") + afShortName<To>();
-        func = cuMangledName<Ti, false>(tmp.c_str());
-    }
+struct CastOp {
+    const char *name() { return ""; }
+};
+
+#define CAST_FN(TYPE)                                \
+    template<typename Ti>                            \
+    struct CastOp<TYPE, Ti> {                        \
+        const char *name() { return "(" #TYPE ")"; } \
+    };
 
-    const std::string name()
-    {
-        return func;
-    }
+CAST_FN(int)
+CAST_FN(unsigned int)
+CAST_FN(unsigned char)
+CAST_FN(signed char)
+CAST_FN(unsigned short)
+CAST_FN(short)
+CAST_FN(float)
+CAST_FN(double)
+
+template<typename Ti>
+struct CastOp<common::half, Ti> {
+    const char *name() { return "(__half)"; }
 };
 
+#define CAST_CFN(TYPE)                                    \
+    template<typename Ti>                                 \
+    struct CastOp<TYPE, Ti> {                             \
+        const char *name() { return "__convert_" #TYPE; } \
+    };
 
-template<typename To, typename Ti>
-struct CastWrapper
-{
-    Array<To> operator()(const Array<Ti> &in)
-    {
-        CastOp<To, Ti> cop;
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<To>(),
-                                                  afShortName<To>(),
-                                                  cop.name(),
-                                                  in_node, af_cast_t);
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+CAST_CFN(cfloat)
+CAST_CFN(cdouble)
+CAST_CFN(char)
+
+template<>
+struct CastOp<cfloat, cdouble> {
+    const char *name() { return "__convert_z2c"; }
 };
 
-template<typename T>
-struct CastWrapper<T, T>
-{
-    Array<T> operator()(const Array<T> &in)
-    {
-        return in;
-    }
+template<>
+struct CastOp<cdouble, cfloat> {
+    const char *name() { return "__convert_c2z"; }
 };
 
-template<typename To, typename Ti>
-Array<To> cast(const Array<Ti> &in)
-{
-    CastWrapper<To, Ti> cast_op;
-    return cast_op(in);
-}
+template<>
+struct CastOp<cfloat, cfloat> {
+    const char *name() { return "__convert_c2c"; }
+};
+
+template<>
+struct CastOp<cdouble, cdouble> {
+    const char *name() { return "__convert_z2z"; }
+};
+
+// Casting from half to unsigned char causes compilation issues. First convert
+// to short then to half
+template<>
+struct CastOp<unsigned char, common::half> {
+    const char *name() { return "(short)"; }
+};
+
+#undef CAST_FN
+#undef CAST_CFN
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cholesky.cpp b/src/backend/cuda/cholesky.cpp
new file mode 100644
index 0000000000..7c48dbb40c
--- /dev/null
+++ b/src/backend/cuda/cholesky.cpp
@@ -0,0 +1,128 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cholesky.hpp>
+#include <common/err_common.hpp>
+
+#include <copy.hpp>
+#include <cublas_v2.h>
+#include <cusolverDn.hpp>
+#include <identity.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <triangle.hpp>
+
+#include <common/err_common.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+// cusolverStatus_t cusolverDn<>potrf_bufferSize(
+//        cusolverDnHandle_t handle,
+//        cublasFillMode_t uplo,
+//        int n,
+//        <> *A,
+//        int lda,
+//        int *Lwork );
+//
+// cusolverStatus_t cusolverDn<>potrf(
+//        cusolverDnHandle_t handle,
+//        cublasFillMode_t uplo,
+//        int n,
+//        <> *A, int lda,
+//        <> *Workspace, int Lwork,
+//        int *devInfo );
+
+template<typename T>
+struct potrf_func_def_t {
+    using potrf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t,
+                                                cublasFillMode_t, int, T *, int,
+                                                T *, int, int *);
+};
+
+template<typename T>
+struct potrf_buf_func_def_t {
+    using potrf_buf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t,
+                                                    cublasFillMode_t, int, T *,
+                                                    int, int *);
+};
+
+#define CH_FUNC_DEF(FUNC)                                         \
+    template<typename T>                                          \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func(); \
+                                                                  \
+    template<typename T>                                          \
+    typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def FUNC##_buf_func();
+
+#define CH_FUNC(FUNC, TYPE, PREFIX)                                         \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               cusolverDn##PREFIX##FUNC;                                    \
+    }                                                                       \
+                                                                            \
+    template<>                                                              \
+    typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def               \
+        FUNC##_buf_func<TYPE>() {                                           \
+        return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def) &         \
+               cusolverDn##PREFIX##FUNC##_bufferSize;                       \
+    }
+
+CH_FUNC_DEF(potrf)
+CH_FUNC(potrf, float, S)
+CH_FUNC(potrf, double, D)
+CH_FUNC(potrf, cfloat, C)
+CH_FUNC(potrf, cdouble, Z)
+
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
+    Array<T> out = copyArray<T>(in);
+    *info        = cholesky_inplace(out, is_upper);
+
+    triangle<T>(out, out, is_upper, false);
+
+    return out;
+}
+
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
+    dim4 iDims = in.dims();
+    int N      = iDims[0];
+
+    int lwork = 0;
+
+    cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;
+    if (is_upper) { uplo = CUBLAS_FILL_MODE_UPPER; }
+
+    CUSOLVER_CHECK(potrf_buf_func<T>()(solverDnHandle(), uplo, N, in.get(),
+                                       in.strides()[1], &lwork));
+
+    auto workspace = memAlloc<T>(lwork);
+    auto d_info    = memAlloc<int>(1);
+
+    CUSOLVER_CHECK(potrf_func<T>()(solverDnHandle(), uplo, N, in.get(),
+                                   in.strides()[1], workspace.get(), lwork,
+                                   d_info.get()));
+
+    // FIXME: should return h_info
+    return 0;
+}
+
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
+
+INSTANTIATE_CH(float)
+INSTANTIATE_CH(cfloat)
+INSTANTIATE_CH(double)
+INSTANTIATE_CH(cdouble)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cholesky.cu b/src/backend/cuda/cholesky.cu
deleted file mode 100644
index d785eef3ef..0000000000
--- a/src/backend/cuda/cholesky.cu
+++ /dev/null
@@ -1,180 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <err_common.hpp>
-#include <cholesky.hpp>
-
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-
-#include <cusolverDnManager.hpp>
-#include <cublas_v2.h>
-#include <identity.hpp>
-#include <iostream>
-#include <memory.hpp>
-#include <copy.hpp>
-#include <triangle.hpp>
-
-#include <math.hpp>
-#include <err_common.hpp>
-
-namespace cuda
-{
-
-using cusolver::getDnHandle;
-
-//cusolverStatus_t cusolverDn<>potrf_bufferSize(
-//        cusolverDnHandle_t handle,
-//        cublasFillMode_t uplo,
-//        int n,
-//        <> *A,
-//        int lda,
-//        int *Lwork );
-//
-//cusolverStatus_t cusolverDn<>potrf(
-//        cusolverDnHandle_t handle,
-//        cublasFillMode_t uplo,
-//        int n,
-//        <> *A, int lda,
-//        <> *Workspace, int Lwork,
-//        int *devInfo );
-
-template<typename T>
-struct potrf_func_def_t
-{
-    typedef cusolverStatus_t (*potrf_func_def) (
-                              cusolverDnHandle_t,
-                              cublasFillMode_t,
-                              int,
-                              T *, int,
-                              T *,
-                              int, int *);
-};
-
-template<typename T>
-struct potrf_buf_func_def_t
-{
-    typedef cusolverStatus_t (*potrf_buf_func_def) (
-                              cusolverDnHandle_t,
-                              cublasFillMode_t,
-                              int,
-                              T *, int,
-                              int *);
-};
-
-#define CH_FUNC_DEF( FUNC )                                                     \
-template<typename T>                                                            \
-typename FUNC##_func_def_t<T>::FUNC##_func_def                                  \
-FUNC##_func();                                                                  \
-                                                                                \
-template<typename T>                                                            \
-typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def                          \
-FUNC##_buf_func();                                                              \
-
-#define CH_FUNC( FUNC, TYPE, PREFIX )                                                           \
-template<> typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>()                \
-{ return (FUNC##_func_def_t<TYPE>::FUNC##_func_def)&cusolverDn##PREFIX##FUNC; }                 \
-                                                                                                \
-template<> typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def FUNC##_buf_func<TYPE>()    \
-{ return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def)&cusolverDn##PREFIX##FUNC##_bufferSize; }
-
-CH_FUNC_DEF( potrf )
-CH_FUNC(potrf , float  , S)
-CH_FUNC(potrf , double , D)
-CH_FUNC(potrf , cfloat , C)
-CH_FUNC(potrf , cdouble, Z)
-
-template<typename T>
-Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper)
-{
-
-    Array<T> out = copyArray<T>(in);
-    *info = cholesky_inplace(out, is_upper);
-
-    if (is_upper) triangle<T, true , false>(out, out);
-    else          triangle<T, false, false>(out, out);
-
-    return out;
-}
-
-template<typename T>
-int cholesky_inplace(Array<T> &in, const bool is_upper)
-{
-    dim4 iDims = in.dims();
-    int N = iDims[0];
-
-    int lwork = 0;
-
-    cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;
-    if(is_upper)
-        uplo = CUBLAS_FILL_MODE_UPPER;
-
-    CUSOLVER_CHECK(potrf_buf_func<T>()(getDnHandle(),
-                                       uplo,
-                                       N,
-                                       in.get(), in.strides()[1],
-                                       &lwork));
-
-    T *workspace = memAlloc<T>(lwork);
-    int *d_info = memAlloc<int>(1);
-
-    CUSOLVER_CHECK(potrf_func<T>()(getDnHandle(),
-                                   uplo,
-                                   N,
-                                   in.get(), in.strides()[1],
-                                   workspace, lwork,
-                                   d_info));
-
-    memFree(workspace);
-    memFree(d_info);
-
-    //FIXME: should return h_info
-    return 0;
-}
-
-#define INSTANTIATE_CH(T)                                                                   \
-    template int cholesky_inplace<T>(Array<T> &in, const bool is_upper);                    \
-    template Array<T> cholesky<T>   (int *info, const Array<T> &in, const bool is_upper);   \
-
-
-INSTANTIATE_CH(float)
-INSTANTIATE_CH(cfloat)
-INSTANTIATE_CH(double)
-INSTANTIATE_CH(cdouble)
-}
-
-#else
-namespace cuda
-{
-
-template<typename T>
-Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-template<typename T>
-int cholesky_inplace(Array<T> &in, const bool is_upper)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-#define INSTANTIATE_CH(T)                                                                   \
-    template int cholesky_inplace<T>(Array<T> &in, const bool is_upper);                    \
-    template Array<T> cholesky<T>   (int *info, const Array<T> &in, const bool is_upper);
-
-INSTANTIATE_CH(float)
-INSTANTIATE_CH(cfloat)
-INSTANTIATE_CH(double)
-INSTANTIATE_CH(cdouble)
-
-}
-
-#endif
diff --git a/src/backend/cuda/cholesky.hpp b/src/backend/cuda/cholesky.hpp
index cae0484f1d..4a97aab757 100644
--- a/src/backend/cuda/cholesky.hpp
+++ b/src/backend/cuda/cholesky.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
 
-    template<typename T>
-    int cholesky_inplace(Array<T> &in, const bool is_upper);
-}
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/compile_module.cpp b/src/backend/cuda/compile_module.cpp
new file mode 100644
index 0000000000..d7ee8182bc
--- /dev/null
+++ b/src/backend/cuda/compile_module.cpp
@@ -0,0 +1,506 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/compile_module.hpp>  //compileModule & loadModuleFromDisk
+#include <common/kernel_cache.hpp>    //getKernel(Module&, ...)
+
+#include <Module.hpp>
+#include <common/Logger.hpp>
+#include <common/deterministicHash.hpp>
+#include <common/internal_enums.hpp>
+#include <common/util.hpp>
+#include <device_manager.hpp>
+#include <kernel_headers/jit_cuh.hpp>
+#include <nvrtc_kernel_headers/Binary_hpp.hpp>
+#include <nvrtc_kernel_headers/Param_hpp.hpp>
+#include <nvrtc_kernel_headers/Transform_hpp.hpp>
+#include <nvrtc_kernel_headers/assign_kernel_param_hpp.hpp>
+#include <nvrtc_kernel_headers/backend_hpp.hpp>
+#include <nvrtc_kernel_headers/cuComplex_h.hpp>
+#include <nvrtc_kernel_headers/cuda_fp16_h.hpp>
+#include <nvrtc_kernel_headers/cuda_fp16_hpp.hpp>
+#include <nvrtc_kernel_headers/defines_h.hpp>
+#include <nvrtc_kernel_headers/dims_param_hpp.hpp>
+#include <nvrtc_kernel_headers/half_hpp.hpp>
+#include <nvrtc_kernel_headers/internal_enums_hpp.hpp>
+#include <nvrtc_kernel_headers/interp_hpp.hpp>
+#include <nvrtc_kernel_headers/kernel_type_hpp.hpp>
+#include <nvrtc_kernel_headers/math_constants_h.hpp>
+#include <nvrtc_kernel_headers/math_hpp.hpp>
+#include <nvrtc_kernel_headers/minmax_op_hpp.hpp>
+#include <nvrtc_kernel_headers/optypes_hpp.hpp>
+#include <nvrtc_kernel_headers/shared_hpp.hpp>
+#include <nvrtc_kernel_headers/traits_hpp.hpp>
+#include <nvrtc_kernel_headers/types_hpp.hpp>
+#include <nvrtc_kernel_headers/utility_hpp.hpp>
+#include <nvrtc_kernel_headers/vector_functions_h.hpp>
+#include <nvrtc_kernel_headers/vector_types_h.hpp>
+#include <nvrtc_kernel_headers/version_h.hpp>
+#include <optypes.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+#include <af/version.h>
+
+#include <nvrtc.h>
+
+#include <algorithm>
+#include <array>
+#include <chrono>
+#include <cmath>
+#include <cstdlib>
+#include <cstring>
+#include <fstream>
+#include <iterator>
+#include <memory>
+#include <numeric>
+#include <string>
+#include <type_traits>
+#include <utility>
+#include <vector>
+
+using arrayfire::common::getCacheDirectory;
+using arrayfire::common::makeTempFilename;
+using arrayfire::common::removeFile;
+using arrayfire::common::renameFile;
+using arrayfire::cuda::getComputeCapability;
+using arrayfire::cuda::getDeviceProp;
+using detail::Module;
+using nonstd::span;
+using std::accumulate;
+using std::array;
+using std::back_insert_iterator;
+using std::begin;
+using std::end;
+using std::extent;
+using std::find_if;
+using std::make_pair;
+using std::ofstream;
+using std::pair;
+using std::string;
+using std::to_string;
+using std::transform;
+using std::unique_ptr;
+using std::vector;
+using std::chrono::duration_cast;
+using std::chrono::high_resolution_clock;
+using std::chrono::milliseconds;
+
+constexpr size_t linkLogSize = 2048;
+
+#define CU_LINK_CHECK(fn)                                               \
+    do {                                                                \
+        CUresult res = (fn);                                            \
+        if (res == CUDA_SUCCESS) break;                                 \
+        array<char, linkLogSize + 512> cu_err_msg;                      \
+        const char *cu_err_name;                                        \
+        cuGetErrorName(res, &cu_err_name);                              \
+        snprintf(cu_err_msg.data(), cu_err_msg.size(),                  \
+                 "CU Link Error %s(%d): %s\n", cu_err_name, (int)(res), \
+                 linkError);                                            \
+        AF_ERROR(cu_err_msg.data(), AF_ERR_INTERNAL);                   \
+    } while (0)
+
+#define NVRTC_CHECK(fn)                                                   \
+    do {                                                                  \
+        nvrtcResult res = (fn);                                           \
+        if (res == NVRTC_SUCCESS) break;                                  \
+        array<char, 4096> nvrtc_err_msg;                                  \
+        snprintf(nvrtc_err_msg.data(), nvrtc_err_msg.size(),              \
+                 "NVRTC Error(%d): %s\n", res, nvrtcGetErrorString(res)); \
+        AF_ERROR(nvrtc_err_msg.data(), AF_ERR_INTERNAL);                  \
+    } while (0)
+
+#define NVRTC_COMPILE_CHECK(fn)                              \
+    do {                                                     \
+        nvrtcResult res = (fn);                              \
+        if (res == NVRTC_SUCCESS) break;                     \
+        size_t logSize;                                      \
+        nvrtcGetProgramLogSize(prog, &logSize);              \
+        vector<char> log(logSize + 1);                       \
+        nvrtcGetProgramLog(prog, log.data());                \
+        log[logSize] = '\0';                                 \
+        array<char, 4096> nvrtc_err_msg;                     \
+        snprintf(nvrtc_err_msg.data(), nvrtc_err_msg.size(), \
+                 "NVRTC Error(%d): %s\nLog: \n%s\n", res,    \
+                 nvrtcGetErrorString(res), log.data());      \
+        AF_ERROR(nvrtc_err_msg.data(), AF_ERR_INTERNAL);     \
+    } while (0)
+
+spdlog::logger *getLogger() {
+    static std::shared_ptr<spdlog::logger> logger(
+        arrayfire::common::loggerFactory("jit"));
+    return logger.get();
+}
+
+string getKernelCacheFilename(const int device, const string &key) {
+    const auto computeFlag = getComputeCapability(device);
+    const string computeVersion =
+        to_string(computeFlag.first) + to_string(computeFlag.second);
+
+    return "KER" + key + "_CU_" + computeVersion + "_AF_" +
+           to_string(AF_API_VERSION_CURRENT) + ".bin";
+}
+
+namespace arrayfire {
+namespace common {
+
+Module compileModule(const string &moduleKey, span<const string> sources,
+                     span<const string> opts, span<const string> kInstances,
+                     const bool sourceIsJIT) {
+    nvrtcProgram prog;
+    using namespace arrayfire::cuda;
+    if (sourceIsJIT) {
+        constexpr const char *header_names[] = {
+            "utility",        "cuda_fp16.hpp",      "cuda_fp16.h",
+            "vector_types.h", "vector_functions.h",
+        };
+        constexpr size_t numHeaders = extent<decltype(header_names)>::value;
+        array<const char *, numHeaders> headers = {
+            "", cuda_fp16_hpp, cuda_fp16_h, vector_types_h, vector_functions_h,
+        };
+        static_assert(headers.size() == numHeaders,
+                      "headers array contains fewer sources than header_names");
+        NVRTC_CHECK(nvrtcCreateProgram(&prog, sources[0].c_str(),
+                                       moduleKey.c_str(), numHeaders,
+                                       headers.data(), header_names));
+    } else {
+        constexpr static const char *includeNames[] = {
+            "math.h",          // DUMMY ENTRY TO SATISFY cuComplex_h inclusion
+            "stdbool.h",       // DUMMY ENTRY TO SATISFY af/defines.h inclusion
+            "stdlib.h",        // DUMMY ENTRY TO SATISFY af/defines.h inclusion
+            "vector_types.h",  // DUMMY ENTRY TO SATISFY cuComplex_h inclusion
+            "utility",         // DUMMY ENTRY TO SATISFY utility inclusion
+            "backend.hpp",
+            "cuComplex.h",
+            "jit.cuh",
+            "math.hpp",
+            "optypes.hpp",
+            "Param.hpp",
+            "shared.hpp",
+            "types.hpp",
+            "cuda_fp16.hpp",
+            "cuda_fp16.h",
+            "common/Binary.hpp",
+            "common/Transform.hpp",
+            "common/half.hpp",
+            "common/kernel_type.hpp",
+            "af/traits.hpp",
+            "interp.hpp",
+            "math_constants.h",
+            "af/defines.h",
+            "af/version.h",
+            "utility.hpp",
+            "assign_kernel_param.hpp",
+            "dims_param.hpp",
+            "common/internal_enums.hpp",
+            "minmax_op.hpp",
+            "vector_functions.h",
+        };
+
+        constexpr size_t numHeaders = extent<decltype(includeNames)>::value;
+        static const array<string, numHeaders> sourceStrings = {{
+            string(""),  // DUMMY ENTRY TO SATISFY cuComplex_h inclusion
+            string(""),  // DUMMY ENTRY TO SATISFY af/defines.h inclusion
+            string(""),  // DUMMY ENTRY TO SATISFY af/defines.h inclusion
+            string(""),  // DUMMY ENTRY TO SATISFY cuComplex_h inclusion
+            string(""),  // DUMMY ENTRY TO SATISFY utility inclusion
+            string(backend_hpp, backend_hpp_len),
+            string(cuComplex_h, cuComplex_h_len),
+            string(jit_cuh, jit_cuh_len),
+            string(math_hpp, math_hpp_len),
+            string(optypes_hpp, optypes_hpp_len),
+            string(Param_hpp, Param_hpp_len),
+            string(shared_hpp, shared_hpp_len),
+            string(types_hpp, types_hpp_len),
+            string(cuda_fp16_hpp, cuda_fp16_hpp_len),
+            string(cuda_fp16_h, cuda_fp16_h_len),
+            string(Binary_hpp, Binary_hpp_len),
+            string(Transform_hpp, Transform_hpp_len),
+            string(half_hpp, half_hpp_len),
+            string(kernel_type_hpp, kernel_type_hpp_len),
+            string(traits_hpp, traits_hpp_len),
+            string(interp_hpp, interp_hpp_len),
+            string(math_constants_h, math_constants_h_len),
+            string(defines_h, defines_h_len),
+            string(version_h, version_h_len),
+            string(utility_hpp, utility_hpp_len),
+            string(assign_kernel_param_hpp, assign_kernel_param_hpp_len),
+            string(dims_param_hpp, dims_param_hpp_len),
+            string(internal_enums_hpp, internal_enums_hpp_len),
+            string(minmax_op_hpp, minmax_op_hpp_len),
+            string(vector_functions_h, vector_functions_h_len),
+        }};
+
+        static const char *headers[] = {
+            sourceStrings[0].c_str(),  sourceStrings[1].c_str(),
+            sourceStrings[2].c_str(),  sourceStrings[3].c_str(),
+            sourceStrings[4].c_str(),  sourceStrings[5].c_str(),
+            sourceStrings[6].c_str(),  sourceStrings[7].c_str(),
+            sourceStrings[8].c_str(),  sourceStrings[9].c_str(),
+            sourceStrings[10].c_str(), sourceStrings[11].c_str(),
+            sourceStrings[12].c_str(), sourceStrings[13].c_str(),
+            sourceStrings[14].c_str(), sourceStrings[15].c_str(),
+            sourceStrings[16].c_str(), sourceStrings[17].c_str(),
+            sourceStrings[18].c_str(), sourceStrings[19].c_str(),
+            sourceStrings[20].c_str(), sourceStrings[21].c_str(),
+            sourceStrings[22].c_str(), sourceStrings[23].c_str(),
+            sourceStrings[24].c_str(), sourceStrings[25].c_str(),
+            sourceStrings[26].c_str(), sourceStrings[27].c_str(),
+            sourceStrings[28].c_str(), sourceStrings[29].c_str()};
+        static_assert(extent<decltype(headers)>::value == numHeaders,
+                      "headers array contains fewer sources than includeNames");
+        NVRTC_CHECK(nvrtcCreateProgram(&prog, sources[0].c_str(),
+                                       moduleKey.c_str(), numHeaders, headers,
+                                       includeNames));
+    }
+
+    int device       = getActiveDeviceId();
+    auto computeFlag = getComputeCapability(device);
+    array<char, 32> arch;
+    snprintf(arch.data(), arch.size(), "--gpu-architecture=compute_%d%d",
+             computeFlag.first, computeFlag.second);
+    vector<const char *> compiler_options = {
+        arch.data(),
+#if CUDA_VERSION >= 11000
+        "--std=c++17",
+#else
+        "--std=c++14",
+#endif
+        "--device-as-default-execution-space",
+#ifdef AF_WITH_FAST_MATH
+        "--use_fast_math",
+        "-DAF_WITH_FAST_MATH",
+#endif
+#if !(defined(NDEBUG) || defined(__aarch64__) || defined(__LP64__))
+        "--device-debug",
+        "--generate-line-info"
+#endif
+    };
+    if (!sourceIsJIT) {
+        transform(begin(opts), end(opts),
+                  back_insert_iterator<vector<const char *>>(compiler_options),
+                  [](const string &s) { return s.data(); });
+
+        for (auto &instantiation : kInstances) {
+            NVRTC_CHECK(nvrtcAddNameExpression(prog, instantiation.c_str()));
+        }
+    }
+
+    auto compile = high_resolution_clock::now();
+    NVRTC_COMPILE_CHECK(nvrtcCompileProgram(prog, compiler_options.size(),
+                                            compiler_options.data()));
+    auto compile_end = high_resolution_clock::now();
+    size_t ptx_size;
+    vector<char> ptx;
+    NVRTC_CHECK(nvrtcGetPTXSize(prog, &ptx_size));
+    ptx.resize(ptx_size);
+    NVRTC_CHECK(nvrtcGetPTX(prog, ptx.data()));
+
+    char linkInfo[linkLogSize]  = {0};
+    char linkError[linkLogSize] = {0};
+
+    CUlinkState linkState;
+    CUjit_option linkOptions[] = {
+        CU_JIT_INFO_LOG_BUFFER, CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,
+        CU_JIT_ERROR_LOG_BUFFER, CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES,
+        CU_JIT_LOG_VERBOSE};
+
+    void *linkOptionValues[] = {
+        linkInfo, reinterpret_cast<void *>(linkLogSize), linkError,
+        reinterpret_cast<void *>(linkLogSize), reinterpret_cast<void *>(1)};
+
+    auto link = high_resolution_clock::now();
+    CU_LINK_CHECK(cuLinkCreate(5, linkOptions, linkOptionValues, &linkState));
+    CU_LINK_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void *)ptx.data(),
+                                ptx.size(), moduleKey.c_str(), 0, NULL, NULL));
+
+    void *cubin = nullptr;
+    size_t cubinSize;
+
+    CUmodule modOut = nullptr;
+    CU_LINK_CHECK(cuLinkComplete(linkState, &cubin, &cubinSize));
+    CU_CHECK(cuModuleLoadData(&modOut, cubin));
+    auto link_end = high_resolution_clock::now();
+
+    Module retVal(modOut);
+    if (!sourceIsJIT) {
+        for (auto &instantiation : kInstances) {
+            // memory allocated & destroyed by nvrtcProgram for below var
+            const char *name = nullptr;
+            NVRTC_CHECK(
+                nvrtcGetLoweredName(prog, instantiation.c_str(), &name));
+            retVal.add(instantiation, string(name, strlen(name)));
+        }
+    }
+
+#ifdef AF_CACHE_KERNELS_TO_DISK
+    // save kernel in cache
+    const string &cacheDirectory = getCacheDirectory();
+    if (!cacheDirectory.empty()) {
+        const string cacheFile = cacheDirectory + AF_PATH_SEPARATOR +
+                                 getKernelCacheFilename(device, moduleKey);
+        const string tempFile =
+            cacheDirectory + AF_PATH_SEPARATOR + makeTempFilename();
+        try {
+            // write module hash(everything: names, code & options) and CUBIN
+            // data
+            ofstream out(tempFile, std::ios::binary);
+            if (!sourceIsJIT) {
+                size_t mangledNamesListSize = retVal.map().size();
+                out.write(reinterpret_cast<const char *>(&mangledNamesListSize),
+                          sizeof(mangledNamesListSize));
+                for (auto &iter : retVal.map()) {
+                    size_t kySize   = iter.first.size();
+                    size_t vlSize   = iter.second.size();
+                    const char *key = iter.first.c_str();
+                    const char *val = iter.second.c_str();
+                    out.write(reinterpret_cast<const char *>(&kySize),
+                              sizeof(kySize));
+                    out.write(key, iter.first.size());
+                    out.write(reinterpret_cast<const char *>(&vlSize),
+                              sizeof(vlSize));
+                    out.write(val, iter.second.size());
+                }
+            }
+
+            // compute CUBIN hash
+            const size_t cubinHash = deterministicHash(cubin, cubinSize);
+
+            out.write(reinterpret_cast<const char *>(&cubinHash),
+                      sizeof(cubinHash));
+            out.write(reinterpret_cast<const char *>(&cubinSize),
+                      sizeof(cubinSize));
+            out.write(static_cast<const char *>(cubin), cubinSize);
+            out.close();
+
+            // try to rename temporary file into final cache file, if this fails
+            // this means another thread has finished compiling this kernel
+            // before the current thread.
+            if (!renameFile(tempFile, cacheFile)) { removeFile(tempFile); }
+        } catch (const std::ios_base::failure &e) {
+            AF_TRACE("{{{:<30} : failed saving binary to {} for {}, {}}}",
+                     moduleKey, cacheFile, getDeviceProp(device).name,
+                     e.what());
+        }
+    }
+#endif
+
+    CU_LINK_CHECK(cuLinkDestroy(linkState));
+    NVRTC_CHECK(nvrtcDestroyProgram(&prog));
+
+    // skip --std=c++14 because it will stay the same. It doesn't
+    // provide useful information
+    auto listOpts = [](vector<const char *> &in) {
+        return accumulate(begin(in) + 2, end(in), string(in[0]),
+                          [](const string &lhs, const string &rhs) {
+                              return lhs + ", " + rhs;
+                          });
+    };
+    AF_TRACE("{{ {:<20} : compile:{:>5} ms, link:{:>4} ms, {{ {} }}, {} }}",
+             moduleKey,
+             duration_cast<milliseconds>(compile_end - compile).count(),
+             duration_cast<milliseconds>(link_end - link).count(),
+             listOpts(compiler_options), getDeviceProp(device).name);
+    return retVal;
+}
+
+Module loadModuleFromDisk(const int device, const string &moduleKey,
+                          const bool isJIT) {
+    const string &cacheDirectory = getCacheDirectory();
+    if (cacheDirectory.empty()) return Module{nullptr};
+
+    const string cacheFile = cacheDirectory + AF_PATH_SEPARATOR +
+                             getKernelCacheFilename(device, moduleKey);
+
+    CUmodule modOut = nullptr;
+    Module retVal{nullptr};
+    try {
+        std::ifstream in(cacheFile, std::ios::binary);
+        if (!in) {
+            AF_TRACE("{{{:<20} : Unable to open {} for {}}}", moduleKey,
+                     cacheFile, getDeviceProp(device).name);
+            removeFile(cacheFile);  // Remove if exists
+            return Module{nullptr};
+        }
+        in.exceptions(std::ios::failbit | std::ios::badbit);
+
+        if (!isJIT) {
+            size_t mangledListSize = 0;
+            in.read(reinterpret_cast<char *>(&mangledListSize),
+                    sizeof(mangledListSize));
+            for (size_t i = 0; i < mangledListSize; ++i) {
+                size_t keySize = 0;
+                in.read(reinterpret_cast<char *>(&keySize), sizeof(keySize));
+                vector<char> key;
+                key.reserve(keySize);
+                in.read(key.data(), keySize);
+
+                size_t itemSize = 0;
+                in.read(reinterpret_cast<char *>(&itemSize), sizeof(itemSize));
+                vector<char> item;
+                item.reserve(itemSize);
+                in.read(item.data(), itemSize);
+
+                retVal.add(string(key.data(), keySize),
+                           string(item.data(), itemSize));
+            }
+        }
+
+        size_t cubinHash = 0;
+        in.read(reinterpret_cast<char *>(&cubinHash), sizeof(cubinHash));
+        size_t cubinSize = 0;
+        in.read(reinterpret_cast<char *>(&cubinSize), sizeof(cubinSize));
+        vector<char> cubin(cubinSize);
+        in.read(cubin.data(), cubinSize);
+        in.close();
+
+        // check CUBIN binary data has not been corrupted
+        const size_t recomputedHash =
+            deterministicHash(cubin.data(), cubinSize);
+        if (recomputedHash != cubinHash) {
+            AF_ERROR("Module on disk seems to be corrupted", AF_ERR_LOAD_SYM);
+        }
+
+        CU_CHECK(cuModuleLoadData(&modOut, cubin.data()));
+
+        AF_TRACE("{{{:<20} : loaded from {} for {} }}", moduleKey, cacheFile,
+                 getDeviceProp(device).name);
+
+        retVal.set(modOut);
+    } catch (const std::ios_base::failure &e) {
+        AF_TRACE("{{{:<20} : Unable to read {} for {}}}", moduleKey, cacheFile,
+                 getDeviceProp(device).name);
+        removeFile(cacheFile);
+    } catch (const AfError &e) {
+        if (e.getError() == AF_ERR_LOAD_SYM) {
+            AF_TRACE(
+                "{{{:<20} : Corrupt binary({}) found on disk for {}, removed}}",
+                moduleKey, cacheFile, getDeviceProp(device).name);
+        } else {
+            if (modOut != nullptr) { CU_CHECK(cuModuleUnload(modOut)); }
+            AF_TRACE(
+                "{{{:<20} : cuModuleLoadData failed with content from {} for "
+                "{}, {}}}",
+                moduleKey, cacheFile, getDeviceProp(device).name, e.what());
+        }
+        removeFile(cacheFile);
+    }
+    return retVal;
+}
+
+arrayfire::cuda::Kernel getKernel(const Module &mod, const string &nameExpr,
+                                  const bool sourceWasJIT) {
+    std::string name  = (sourceWasJIT ? nameExpr : mod.mangledName(nameExpr));
+    CUfunction kernel = nullptr;
+    CU_CHECK(cuModuleGetFunction(&kernel, mod.get(), name.c_str()));
+    return {nameExpr, mod.get(), kernel};
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/cuda/complex.hpp b/src/backend/cuda/complex.hpp
index 82304b9a22..81f39dd785 100644
--- a/src/backend/cuda/complex.hpp
+++ b/src/backend/cuda/complex.hpp
@@ -6,99 +6,88 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <optypes.hpp>
-#include <err_cuda.hpp>
-#include <JIT/BinaryNode.hpp>
-#include <JIT/UnaryNode.hpp>
-
-namespace cuda
-{
-    template<typename T> static const std::string cplx_name() { return "@___noop"; }
-    template<> STATIC_ const std::string cplx_name<cfloat >() { return cuMangledName<float , true>("___cplx"); }
-    template<> STATIC_ const std::string cplx_name<cdouble>() { return cuMangledName<double, true>("___cplx"); }
-
-    template<typename T> static const std::string real_name() { return "@___noop"; }
-    template<> STATIC_ const std::string real_name<cfloat >() { return cuMangledName<cfloat , false>("___real"); }
-    template<> STATIC_ const std::string real_name<cdouble>() { return cuMangledName<cdouble, false>("___real"); }
 
-    template<typename T> static const std::string imag_name() { return "@___noop"; }
-    template<> STATIC_ const std::string imag_name<cfloat >() { return cuMangledName<cfloat , false>("___imag"); }
-    template<> STATIC_ const std::string imag_name<cdouble>() { return cuMangledName<cdouble, false>("___imag"); }
+#pragma once
 
-    template<typename T> static const std::string abs_name() { return "@___noop"; }
-    template<> STATIC_ const std::string abs_name<float  >() { return cuMangledName<float  , false>("___abs"); }
-    template<> STATIC_ const std::string abs_name<double >() { return cuMangledName<double , false>("___abs"); }
-    template<> STATIC_ const std::string abs_name<cfloat >() { return cuMangledName<cfloat , false>("___abs"); }
-    template<> STATIC_ const std::string abs_name<cdouble>() { return cuMangledName<cdouble, false>("___abs"); }
-
-    template<typename T> static const std::string conj_name() { return "@___noop"; }
-    template<> STATIC_ const std::string conj_name<cfloat >() { return cuMangledName<cfloat , false>("___conj"); }
-    template<> STATIC_ const std::string conj_name<cdouble>() { return cuMangledName<cdouble, false>("___conj"); }
+#include <Array.hpp>
+#include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <common/jit/UnaryNode.hpp>
+#include <optypes.hpp>
+#include <af/dim4.hpp>
 
-    template<typename To, typename Ti>
-    Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs, const af::dim4 &odims)
-    {
-        JIT::Node_ptr lhs_node = lhs.getNode();
-        JIT::Node_ptr rhs_node = rhs.getNode();
+namespace arrayfire {
+namespace cuda {
+template<typename To, typename Ti>
+Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<To, Ti, af_cplx2_t>(lhs, rhs, odims);
+}
 
-        JIT::BinaryNode *node = new JIT::BinaryNode(irname<To>(),
-                                                    afShortName<To>(),
-                                                    cplx_name<To>(),
-                                                    lhs_node,
-                                                    rhs_node, (int)(af_cplx2_t));
+template<typename To, typename Ti>
+Array<To> real(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              "__creal", in_node, af_real_t);
 
-        return createNodeArray<To>(odims, JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
 
-    template<typename To, typename Ti>
-    Array<To> real(const Array<Ti> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<To>(),
-                                                  afShortName<To>(),
-                                                  real_name<Ti>(),
-                                                  in_node, af_real_t);
+template<typename To, typename Ti>
+Array<To> imag(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              "__cimag", in_node, af_imag_t);
 
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
 
-    template<typename To, typename Ti>
-    Array<To> imag(const Array<Ti> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<To>(),
-                                                  afShortName<To>(),
-                                                  imag_name<Ti>(),
-                                                  in_node, af_imag_t);
+template<typename T>
+static const char *abs_name() {
+    return "fabs";
+}
+template<>
+inline const char *abs_name<cfloat>() {
+    return "__cabsf";
+}
+template<>
+inline const char *abs_name<cdouble>() {
+    return "__cabs";
+}
 
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+template<typename To, typename Ti>
+Array<To> abs(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              abs_name<Ti>(), in_node, af_abs_t);
 
-    template<typename To, typename Ti>
-    Array<To> abs(const Array<Ti> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<To>(),
-                                                  afShortName<To>(),
-                                                  abs_name<Ti>(),
-                                                  in_node, af_abs_t);
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
 
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+template<typename T>
+static const char *conj_name() {
+    return "__noop";
+}
+template<>
+inline const char *conj_name<cfloat>() {
+    return "__cconjf";
+}
+template<>
+inline const char *conj_name<cdouble>() {
+    return "__cconj";
+}
 
-    template<typename T>
-    Array<T> conj(const Array<T> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<T>(),
-                                                  afShortName<T>(),
-                                                  conj_name<T>(),
-                                                  in_node, af_conj_t);
+template<typename T>
+Array<T> conj(const Array<T> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<T>::af_type),
+                              conj_name<T>(), in_node, af_conj_t);
 
-        return createNodeArray<T>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<T>(in.dims(), common::Node_ptr(node));
 }
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/convolve.cpp b/src/backend/cuda/convolve.cpp
index 4cf3642d5c..043bfdcc9e 100644
--- a/src/backend/cuda/convolve.cpp
+++ b/src/backend/cuda/convolve.cpp
@@ -7,61 +7,65 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <cast.hpp>
+#include <common/half.hpp>
 #include <convolve.hpp>
-#include <kernel/convolve.hpp>
 #include <err_cuda.hpp>
+#include <kernel/convolve.hpp>
+#include <platform.hpp>
+#include <af/dim4.hpp>
+#include <type_traits>
 
 using af::dim4;
+using arrayfire::common::half;
+using std::conditional;
+using std::is_same;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-Array<T> convolve(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind)
-{
-    const dim4 sDims    = signal.dims();
-    const dim4 fDims    = filter.dims();
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand) {
+    const dim4 &sDims = signal.dims();
+    const dim4 &fDims = filter.dims();
 
     dim4 oDims(1);
     if (expand) {
-        for(dim_t d=0; d<4; ++d) {
-            if (kind==ONE2ONE || kind==ONE2MANY) {
-                oDims[d] = sDims[d]+fDims[d]-1;
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
             } else {
-                oDims[d] = (d<baseDim ? sDims[d]+fDims[d]-1 : sDims[d]);
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
             }
         }
     } else {
         oDims = sDims;
-        if (kind==ONE2MANY) {
-            for (dim_t i=baseDim; i<4; ++i)
-                oDims[i] = fDims[i];
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
         }
     }
 
-    Array<T> out   = createEmptyArray<T>(oDims);
+    Array<T> out = createEmptyArray<T>(oDims);
 
-    kernel::convolve_nd<T, accT, baseDim, expand>(out, signal, filter, kind);
+    kernel::convolve_nd<T, accT>(out, signal, filter, kind, rank, expand);
 
     return out;
 }
 
-template<typename T, typename accT, bool expand>
-Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter)
-{
-    const dim4 cfDims   = c_filter.dims();
-    const dim4 rfDims   = r_filter.dims();
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const &signal, Array<accT> const &c_filter,
+                   Array<accT> const &r_filter, const bool expand) {
+    const dim4 &cfDims = c_filter.dims();
+    const dim4 &rfDims = r_filter.dims();
 
-    const dim_t cfLen= cfDims.elements();
-    const dim_t rfLen= rfDims.elements();
+    const dim_t cfLen = cfDims.elements();
+    const dim_t rfLen = rfDims.elements();
 
-    const dim4 sDims = signal.dims();
-    dim4 tDims = sDims;
-    dim4 oDims = sDims;
+    const dim4 &sDims = signal.dims();
+    dim4 tDims        = sDims;
+    dim4 oDims        = sDims;
 
     if (expand) {
         tDims[0] += cfLen - 1;
@@ -69,32 +73,36 @@ Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<ac
         oDims[1] += rfLen - 1;
     }
 
-    Array<T> temp= createEmptyArray<T>(tDims);
-    Array<T> out = createEmptyArray<T>(oDims);
+    Array<T> temp = createEmptyArray<T>(tDims);
+    Array<T> out  = createEmptyArray<T>(oDims);
 
-    kernel::convolve2<T, accT, 0, expand>(temp, signal, c_filter);
-    kernel::convolve2<T, accT, 1, expand>(out, temp, r_filter);
+    kernel::convolve2<T, accT>(temp, signal, c_filter, 0, expand);
+    kernel::convolve2<T, accT>(out, temp, r_filter, 1, expand);
 
     return out;
 }
 
-#define INSTANTIATE(T, accT)                                            \
-    template Array<T> convolve <T, accT, 1, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 1, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 2, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 2, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 3, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 3, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve2<T, accT, true >(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter); \
-    template Array<T> convolve2<T, accT, false>(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);
+#define INSTANTIATE(T, accT)                                                   \
+    template Array<T> convolve<T, accT>(Array<T> const &, Array<accT> const &, \
+                                        AF_BATCH_KIND, const int, const bool); \
+    template Array<T> convolve2<T, accT>(Array<T> const &,                     \
+                                         Array<accT> const &,                  \
+                                         Array<accT> const &, const bool);
 
 INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
-
-}
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+#undef INSTANTIATE
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/convolve.hpp b/src/backend/cuda/convolve.hpp
index 513745736d..b7faa73f00 100644
--- a/src/backend/cuda/convolve.hpp
+++ b/src/backend/cuda/convolve.hpp
@@ -8,15 +8,34 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <convolve_common.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-Array<T> convolve(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind);
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand);
 
-template<typename T, typename accT, bool expand>
-Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const &signal, Array<accT> const &c_filter,
+                   Array<accT> const &r_filter, const bool expand);
 
-}
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation);
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> &convolved_output, af::dim4 stride,
+                           af::dim4 padding, af::dim4 dilation);
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> &convolved_output, af::dim4 stride,
+                             af::dim4 padding, af::dim4 dilation);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/convolveNN.cpp b/src/backend/cuda/convolveNN.cpp
new file mode 100644
index 0000000000..d4be5d9616
--- /dev/null
+++ b/src/backend/cuda/convolveNN.cpp
@@ -0,0 +1,540 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <convolve.hpp>
+
+#include <Array.hpp>
+#include <blas.hpp>
+#include <common/cast.hpp>
+#include <common/half.hpp>
+#include <common/indexing_helpers.hpp>
+#include <common/moddims.hpp>
+#include <common/unique_handle.hpp>
+#ifdef WITH_CUDNN
+#include <cudnn.hpp>
+#endif
+#include <err_cuda.hpp>
+#include <kernel/convolve.hpp>
+#include <platform.hpp>
+#include <reorder.hpp>
+#include <transpose.hpp>
+#include <unwrap.hpp>
+#include <wrap.hpp>
+#include <af/dim4.hpp>
+
+#include <type_traits>
+#include <utility>
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::flip;
+using arrayfire::common::half;
+using arrayfire::common::make_handle;
+using arrayfire::common::modDims;
+using std::conditional;
+using std::is_same;
+using std::pair;
+using std::tie;
+using std::vector;
+
+namespace arrayfire {
+namespace cuda {
+
+#ifdef WITH_CUDNN
+
+auto getLogger() { return getCudnnPlugin().getLogger(); }
+
+template<typename Desc, typename T>
+auto toCudnn(Array<T> arr) {
+    auto descriptor = make_handle<Desc>();
+    cudnnSet(descriptor, getCudnnDataType<T>(), arr.dims());
+    return descriptor;
+}
+
+template<typename T>
+using scale_type =
+    typename conditional<is_same<T, double>::value, double, float>::type;
+
+pair<cudnnConvolutionFwdAlgo_t, size_t> getForwardAlgorithm(
+    cudnnHandle_t cudnn, cudnnTensorDescriptor_t input_descriptor,
+    cudnnFilterDescriptor_t filter_descriptor,
+    cudnnConvolutionDescriptor_t convolution_descriptor,
+    cudnnTensorDescriptor_t output_descriptor) {
+    cudnnConvolutionFwdAlgo_t convolution_algorithm;
+    size_t workspace_bytes = 0;
+
+    auto version = getCudnnPlugin().getVersion();
+    if (version.major() >= 8) {
+        int maxAlgoCount = 0;
+        CUDNN_CHECK(cuda::cudnnGetConvolutionForwardAlgorithmMaxCount(
+            cudnn, &maxAlgoCount));
+
+        vector<cudnnConvolutionFwdAlgoPerf_t> perfResults(maxAlgoCount);
+        int returnAlgoCount = 0;
+        CUDNN_CHECK(cuda::cudnnFindConvolutionForwardAlgorithm(
+            cudnn, input_descriptor, filter_descriptor, convolution_descriptor,
+            output_descriptor, maxAlgoCount, &returnAlgoCount,
+            perfResults.data()));
+
+        for (int i = 0; i < returnAlgoCount; ++i) {
+            if (perfResults[i].status == CUDNN_STATUS_SUCCESS) {
+                convolution_algorithm = perfResults[i].algo;
+                workspace_bytes       = perfResults[i].memory;
+                break;
+            }
+        }
+    } else {
+        const int memory_limit =
+            0;  // TODO: set to remaining space in memory manager?
+        CUDNN_CHECK(cuda::cudnnGetConvolutionForwardAlgorithm(
+            cudnn, input_descriptor, filter_descriptor, convolution_descriptor,
+            output_descriptor, CUDNN_CONVOLUTION_FWD_PREFER_FASTEST,
+            memory_limit, &convolution_algorithm));
+        CUDNN_CHECK(cuda::cudnnGetConvolutionForwardWorkspaceSize(
+            cudnn, input_descriptor, filter_descriptor, convolution_descriptor,
+            output_descriptor, convolution_algorithm, &workspace_bytes));
+    }
+
+    return {convolution_algorithm, workspace_bytes};
+}
+
+template<typename T>
+Array<T> convolve2_cudnn(const Array<T> &signal, const Array<T> &filter,
+                         const dim4 &stride, const dim4 &padding,
+                         const dim4 &dilation) {
+    cudnnHandle_t cudnn = nnHandle();
+
+    cudnnDataType_t cudnn_dtype = getCudnnDataType<T>();
+    auto input_descriptor       = toCudnn<cudnnTensorDescriptor_t>(signal);
+    auto filter_descriptor      = toCudnn<cudnnFilterDescriptor_t>(filter);
+
+    // create convolution descriptor
+    auto convolution_descriptor = make_handle<cudnnConvolutionDescriptor_t>();
+
+    CUDNN_CHECK(cuda::cudnnSetConvolution2dDescriptor(
+        convolution_descriptor, padding[1], padding[0], stride[1], stride[0],
+        dilation[1], dilation[0], CUDNN_CONVOLUTION, cudnn_dtype));
+
+    // get output dimensions
+    const int tensorDims = 4;
+    int convolved_output_dim[tensorDims];
+    CUDNN_CHECK(cuda::cudnnGetConvolutionNdForwardOutputDim(
+        convolution_descriptor, input_descriptor, filter_descriptor, tensorDims,
+        convolved_output_dim));
+
+    // create output descriptor
+    const int n_out = convolved_output_dim[0];
+    const int c_out = convolved_output_dim[1];
+    const int h_out = convolved_output_dim[2];
+    const int w_out = convolved_output_dim[3];
+
+    // prepare output array and scratch space
+    dim4 odims(w_out, h_out, c_out, n_out);
+    Array<T> out = createEmptyArray<T>(odims);
+
+    auto output_descriptor = toCudnn<cudnnTensorDescriptor_t>(out);
+
+    // get convolution algorithm
+    cudnnConvolutionFwdAlgo_t convolution_algorithm;
+    size_t workspace_bytes = 0;
+
+    tie(convolution_algorithm, workspace_bytes) =
+        getForwardAlgorithm(cudnn, input_descriptor, filter_descriptor,
+                            convolution_descriptor, output_descriptor);
+
+    auto workspace_buffer = memAlloc<char>(workspace_bytes);
+
+    // perform convolution
+    auto alpha = scalar<scale_type<T>>(1.0);
+    auto beta  = scalar<scale_type<T>>(0.0);
+    CUDNN_CHECK(cuda::cudnnConvolutionForward(
+        cudnn, &alpha, input_descriptor, signal.device(), filter_descriptor,
+        filter.device(), convolution_descriptor, convolution_algorithm,
+        (void *)workspace_buffer.get(), workspace_bytes, &beta,
+        output_descriptor, out.device()));
+
+    return out;
+}
+
+template<typename T>
+constexpr void checkTypeSupport() {
+    static_assert(std::is_same<float, T>::value ||
+                      std::is_same<double, T>::value ||
+                      std::is_same<half, T>::value,
+                  "Invalid CuDNN data type: only f64, f32, f16 are supported");
+}
+
+#endif
+
+template<typename T>
+Array<T> convolve2_base(const Array<T> &signal, const Array<T> &filter,
+                        const dim4 &stride, const dim4 &padding,
+                        const dim4 &dilation) {
+    dim4 sDims = signal.dims();
+    dim4 fDims = filter.dims();
+
+    dim_t outputWidth =
+        1 + (sDims[0] + 2 * padding[0] - (((fDims[0] - 1) * dilation[0]) + 1)) /
+                stride[0];
+    dim_t outputHeight =
+        1 + (sDims[1] + 2 * padding[1] - (((fDims[1] - 1) * dilation[1]) + 1)) /
+                stride[1];
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(signal, fDims[0], fDims[1], stride[0], stride[1], padding[0],
+               padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsedFilter = filter;
+
+    collapsedFilter = flip(collapsedFilter, {1, 1, 0, 0});
+    collapsedFilter = modDims(collapsedFilter,
+                              dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    T alpha        = scalar<T>(1.0);
+    T beta         = scalar<T>(0.0);
+    const int Mdim = 1;
+    const int Ndim = 1;
+    Array<T> res   = createEmptyArray<T>(
+        dim4(unwrapped.dims()[Mdim], collapsedFilter.dims()[Ndim],
+               unwrapped.dims()[2], unwrapped.dims()[3]));
+    gemm(res, AF_MAT_TRANS, AF_MAT_NONE, &alpha, unwrapped, collapsedFilter,
+         &beta);
+    res = modDims(res, dim4(outputWidth, outputHeight, signal.dims()[3],
+                            collapsedFilter.dims()[1]));
+    Array<T> out = reorder(res, dim4(0, 1, 3, 2));
+
+    return out;
+}
+
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation) {
+#ifdef WITH_CUDNN
+    if (getCudnnPlugin().isLoaded()) {
+        checkTypeSupport<T>();
+        return convolve2_cudnn<T>(signal, filter, stride, padding, dilation);
+    }
+#endif
+    return convolve2_base<T>(signal, filter, stride, padding, dilation);
+}
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> convolve2<T>(Array<T> const &signal,                    \
+                                   Array<T> const &filter, const dim4 stride, \
+                                   const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> data_gradient_base(const Array<T> &incoming_gradient,
+                            const Array<T> &original_signal,
+                            const Array<T> &original_filter,
+                            const Array<T> &convolved_output, af::dim4 stride,
+                            af::dim4 padding, af::dim4 dilation) {
+    UNUSED(convolved_output);
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &sDims = original_signal.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    Array<T> collapsed_filter = original_filter;
+
+    collapsed_filter = flip(collapsed_filter, {1, 1, 0, 0});
+    collapsed_filter = modDims(collapsed_filter,
+                               dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    T alpha        = scalar<T>(1.0);
+    T beta         = scalar<T>(0.0);
+    const int Mdim = 0;
+    const int Ndim = 0;
+    Array<T> res   = createEmptyArray<T>(
+        dim4(collapsed_gradient.dims()[Mdim], collapsed_filter.dims()[Ndim],
+               collapsed_gradient.dims()[3], collapsed_gradient.dims()[3]));
+    gemm(res, AF_MAT_NONE, AF_MAT_TRANS, &alpha, collapsed_gradient,
+         collapsed_filter, &beta);
+    res = modDims(res, dim4(res.dims()[0] / sDims[3], sDims[3],
+                            fDims[0] * fDims[1], sDims[2]));
+    res = reorder(res, dim4(0, 2, 3, 1));
+
+    const bool retCols = false;
+    res = wrap_dilated(res, sDims[0], sDims[1], fDims[0], fDims[1], stride[0],
+                       stride[1], padding[0], padding[1], dilation[0],
+                       dilation[1], retCols);
+
+    return res;
+}
+
+#ifdef WITH_CUDNN
+template<typename T>
+Array<T> data_gradient_cudnn(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> &convolved_output, af::dim4 stride,
+                             af::dim4 padding, af::dim4 dilation) {
+    UNUSED(convolved_output);
+    auto cudnn = nnHandle();
+
+    dim4 sDims = original_signal.dims();
+    dim4 fDims = original_filter.dims();
+
+    cudnnDataType_t cudnn_dtype = getCudnnDataType<T>();
+
+    // create x descriptor
+    auto dx_descriptor = toCudnn<cudnnTensorDescriptor_t>(original_signal);
+    auto dy_descriptor = toCudnn<cudnnTensorDescriptor_t>(incoming_gradient);
+
+    // create output filter gradient descriptor
+    auto w_descriptor = make_handle<cudnnFilterDescriptor_t>();
+
+    CUDNN_CHECK(cuda::cudnnSetFilter4dDescriptor(w_descriptor, cudnn_dtype,
+                                                 CUDNN_TENSOR_NCHW, fDims[3],
+                                                 fDims[2], fDims[1], fDims[0]));
+
+    // create convolution descriptor
+    auto convolution_descriptor = make_handle<cudnnConvolutionDescriptor_t>();
+
+    CUDNN_CHECK(cuda::cudnnSetConvolution2dDescriptor(
+        convolution_descriptor, padding[1], padding[0], stride[1], stride[0],
+        dilation[1], dilation[0], CUDNN_CONVOLUTION, cudnn_dtype));
+
+    cudnnConvolutionBwdDataAlgo_t bwd_data_convolution_algorithm;
+    if ((dilation[0] == 1 && dilation[1] == 1) || is_same<T, half>::value) {
+        bwd_data_convolution_algorithm = CUDNN_CONVOLUTION_BWD_DATA_ALGO_1;
+    } else {
+        bwd_data_convolution_algorithm = CUDNN_CONVOLUTION_BWD_DATA_ALGO_0;
+    }
+
+    // figure out scratch space memory requirements
+    size_t workspace_bytes;
+    CUDNN_CHECK(cuda::cudnnGetConvolutionBackwardDataWorkspaceSize(
+        cudnn, w_descriptor, dy_descriptor, convolution_descriptor,
+        dx_descriptor, bwd_data_convolution_algorithm, &workspace_bytes));
+
+    dim4 odims(sDims[0], sDims[1], sDims[2], sDims[3]);
+    Array<T> out = createEmptyArray<T>(odims);
+
+    auto workspace_buffer = memAlloc<char>(workspace_bytes);
+
+    // perform convolution
+    auto alpha = scalar<scale_type<T>>(1.0);
+    auto beta  = scalar<scale_type<T>>(0.0);
+
+    CUDNN_CHECK(cuda::cudnnConvolutionBackwardData(
+        cudnn, &alpha, w_descriptor, original_filter.get(), dy_descriptor,
+        incoming_gradient.get(), convolution_descriptor,
+        bwd_data_convolution_algorithm, (void *)workspace_buffer.get(),
+        workspace_bytes, &beta, dx_descriptor, out.device()));
+
+    return out;
+}
+#endif
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> &convolved_output, af::dim4 stride,
+                           af::dim4 padding, af::dim4 dilation) {
+#ifdef WITH_CUDNN
+    if (getCudnnPlugin().isLoaded()) {
+        checkTypeSupport<T>();
+        return data_gradient_cudnn<T>(incoming_gradient, original_signal,
+                                      original_filter, convolved_output, stride,
+                                      padding, dilation);
+    }
+#endif
+    return data_gradient_base<T>(incoming_gradient, original_signal,
+                                 original_filter, convolved_output, stride,
+                                 padding, dilation);
+}
+
+template<typename T>
+Array<T> filter_gradient_base(const Array<T> &incoming_gradient,
+                              const Array<T> &original_signal,
+                              const Array<T> &original_filter,
+                              const Array<T> &convolved_output, af::dim4 stride,
+                              af::dim4 padding, af::dim4 dilation) {
+    UNUSED(convolved_output);
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(original_signal, fDims[0], fDims[1], stride[0], stride[1],
+               padding[0], padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    T alpha        = scalar<T>(1.0);
+    T beta         = scalar<T>(0.0);
+    const int Mdim = 0;
+    const int Ndim = 1;
+    Array<T> res   = createEmptyArray<T>(
+        dim4(unwrapped.dims()[Mdim], collapsed_gradient.dims()[Ndim],
+               unwrapped.dims()[2], unwrapped.dims()[3]));
+    gemm(res, AF_MAT_NONE, AF_MAT_NONE, &alpha, unwrapped, collapsed_gradient,
+         &beta);
+    res = modDims(res, dim4(fDims[0], fDims[1], fDims[2], fDims[3]));
+
+    return flip(res, {1, 1, 0, 0});
+}
+
+#ifdef WITH_CUDNN
+
+pair<cudnnConvolutionBwdFilterAlgo_t, size_t> getBackwardFilterAlgorithm(
+    cudnnHandle_t cudnn, cudnnTensorDescriptor_t x_descriptor,
+    cudnnTensorDescriptor_t dy_descriptor,
+    cudnnConvolutionDescriptor_t convolution_descriptor,
+    cudnnFilterDescriptor_t dw_descriptor) {
+    // determine algorithm to use
+    cudnnConvolutionBwdFilterAlgo_t bwd_filt_convolution_algorithm;
+    // figure out scratch space memory requirements
+    size_t workspace_bytes = 0;
+
+    auto version = getCudnnPlugin().getVersion();
+    if (version.major() >= 8) {
+        int maxAlgoCount = 0;
+        CUDNN_CHECK(cuda::cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(
+            cudnn, &maxAlgoCount));
+
+        vector<cudnnConvolutionBwdFilterAlgoPerf_t> perfResults(maxAlgoCount);
+        int returnAlgoCount = 0;
+        CUDNN_CHECK(cuda::cudnnFindConvolutionBackwardFilterAlgorithm(
+            cudnn, x_descriptor, dy_descriptor, convolution_descriptor,
+            dw_descriptor, maxAlgoCount, &returnAlgoCount, perfResults.data()));
+
+        for (int i = 0; i < returnAlgoCount; ++i) {
+            if (perfResults[i].status == CUDNN_STATUS_SUCCESS) {
+                bwd_filt_convolution_algorithm = perfResults[i].algo;
+                workspace_bytes                = perfResults[i].memory;
+                break;
+            }
+        }
+    } else {
+        CUDNN_CHECK(cuda::cudnnGetConvolutionBackwardFilterAlgorithm(
+            cudnn, x_descriptor, dy_descriptor, convolution_descriptor,
+            dw_descriptor, CUDNN_CONVOLUTION_BWD_FILTER_PREFER_FASTEST, 0,
+            &bwd_filt_convolution_algorithm));
+        CUDNN_CHECK(cuda::cudnnGetConvolutionBackwardFilterWorkspaceSize(
+            cudnn, x_descriptor, dy_descriptor, convolution_descriptor,
+            dw_descriptor, bwd_filt_convolution_algorithm, &workspace_bytes));
+    }
+    return {bwd_filt_convolution_algorithm, workspace_bytes};
+}
+
+template<typename T>
+Array<T> filter_gradient_cudnn(const Array<T> &incoming_gradient,
+                               const Array<T> &original_signal,
+                               const Array<T> &original_filter,
+                               const Array<T> &convolved_output,
+                               af::dim4 stride, af::dim4 padding,
+                               af::dim4 dilation) {
+    UNUSED(convolved_output);
+    auto cudnn = nnHandle();
+
+    const dim4 &fDims = original_filter.dims();
+
+    // create dx descriptor
+    cudnnDataType_t cudnn_dtype = getCudnnDataType<T>();
+    auto x_descriptor  = toCudnn<cudnnTensorDescriptor_t>(original_signal);
+    auto dy_descriptor = toCudnn<cudnnTensorDescriptor_t>(incoming_gradient);
+
+    // create convolution descriptor
+    auto convolution_descriptor = make_handle<cudnnConvolutionDescriptor_t>();
+
+    CUDNN_CHECK(cuda::cudnnSetConvolution2dDescriptor(
+        convolution_descriptor, padding[1], padding[0], stride[1], stride[0],
+        dilation[1], dilation[0], CUDNN_CONVOLUTION, cudnn_dtype));
+
+    // create output filter gradient descriptor
+    auto dw_descriptor = toCudnn<cudnnFilterDescriptor_t>(original_filter);
+
+    // determine algorithm to use
+    cudnnConvolutionBwdFilterAlgo_t bwd_filt_convolution_algorithm;
+    // figure out scratch space memory requirements
+    size_t workspace_bytes = 0;
+
+    tie(bwd_filt_convolution_algorithm, workspace_bytes) =
+        getBackwardFilterAlgorithm(cudnn, x_descriptor, dy_descriptor,
+                                   convolution_descriptor, dw_descriptor);
+
+    // prepare output array and scratch space
+    Array<T> out          = createEmptyArray<T>(fDims);
+    auto workspace_buffer = memAlloc<char>(workspace_bytes);
+
+    // perform convolution
+    auto alpha = scalar<scale_type<T>>(1.0);
+    auto beta  = scalar<scale_type<T>>(0.0);
+    CUDNN_CHECK(cuda::cudnnConvolutionBackwardFilter(
+        cudnn, &alpha, x_descriptor, original_signal.device(), dy_descriptor,
+        incoming_gradient.device(), convolution_descriptor,
+        bwd_filt_convolution_algorithm, (void *)workspace_buffer.get(),
+        workspace_bytes, &beta, dw_descriptor, out.device()));
+
+    return out;
+}
+#endif
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> &convolved_output, af::dim4 stride,
+                             af::dim4 padding, af::dim4 dilation) {
+#ifdef WITH_CUDNN
+    if (getCudnnPlugin().isLoaded()) {
+        checkTypeSupport<T>();
+        return filter_gradient_cudnn<T>(incoming_gradient, original_signal,
+                                        original_filter, convolved_output,
+                                        stride, padding, dilation);
+    }
+#endif
+    return filter_gradient_base<T>(incoming_gradient, original_signal,
+                                   original_filter, convolved_output, stride,
+                                   padding, dilation);
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> conv2DataGradient<T>(                                 \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);        \
+    template Array<T> conv2FilterGradient<T>(                               \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/copy.cpp b/src/backend/cuda/copy.cpp
new file mode 100644
index 0000000000..5d1701d965
--- /dev/null
+++ b/src/backend/cuda/copy.cpp
@@ -0,0 +1,204 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+
+#include <Array.hpp>
+#include <common/complex.hpp>
+#include <common/half.hpp>
+#include <cuda_runtime_api.h>
+#include <kernel/memcopy.hpp>
+#include <math.hpp>
+
+using arrayfire::common::half;
+using arrayfire::common::is_complex;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copyData(T *data, const Array<T> &src) {
+    if (src.elements() > 0) {
+        Array<T> lin = src.isReady() && src.isLinear() ? src : copyArray(src);
+        // out is now guaranteed linear
+        auto stream = getActiveStream();
+        CUDA_CHECK(cudaMemcpyAsync(data, lin.get(), lin.elements() * sizeof(T),
+                                   cudaMemcpyDeviceToHost, stream));
+        CUDA_CHECK(cudaStreamSynchronize(stream));
+    }
+}
+
+template<typename T>
+Array<T> copyArray(const Array<T> &src) {
+    Array<T> out = createEmptyArray<T>(src.dims());
+    if (src.elements() > 0) {
+        if (src.isReady()) {
+            if (src.isLinear()) {
+                CUDA_CHECK(cudaMemcpyAsync(
+                    out.get(), src.get(), src.elements() * sizeof(T),
+                    cudaMemcpyDeviceToDevice, getActiveStream()));
+            } else {
+                kernel::memcopy<T>(out, src, src.ndims());
+            }
+        } else {
+            evalNodes<T>(out, src.getNode().get());
+        }
+    }
+    return out;
+}
+
+template<typename T>
+void multiply_inplace(Array<T> &src, double norm) {
+    if (src.elements() > 0) {
+        kernel::copy<T, T>(src, src, src.ndims(), scalar<T>(0), norm);
+    }
+}
+
+template<typename inType, typename outType>
+struct copyWrapper {
+    void operator()(Array<outType> &dst, Array<inType> const &src) {
+        kernel::copy<inType, outType>(dst, src, dst.ndims(), scalar<outType>(0),
+                                      1.0);
+    }
+};
+
+template<typename T>
+struct copyWrapper<T, T> {
+    void operator()(Array<T> &dst, Array<T> const &src) {
+        if (src.elements() > 0) {
+            if (dst.dims() == src.dims()) {
+                if (src.isReady()) {
+                    if (dst.isLinear() && src.isLinear()) {
+                        CUDA_CHECK(cudaMemcpyAsync(
+                            dst.get(), src.get(), src.elements() * sizeof(T),
+                            cudaMemcpyDeviceToDevice, getActiveStream()));
+                    } else {
+                        kernel::memcopy<T>(dst, src, src.ndims());
+                    }
+                } else {
+                    Param<T> info(dst.get(), src.dims().dims,
+                                  dst.strides().dims);
+                    evalNodes(info, src.getNode().get());
+                }
+            } else {
+                // dst has more elements than src, so default has to be applied
+                kernel::copy<T, T>(dst, src, dst.ndims(), scalar<T>(0), 1.0);
+            }
+        }
+    }
+};
+
+template<typename inType, typename outType>
+void copyArray(Array<outType> &dst, Array<inType> const &src) {
+    static_assert(!(is_complex<inType>::value && !is_complex<outType>::value),
+                  "Cannot copy from complex value to a non complex value");
+    copyWrapper<inType, outType> copyFn;
+    copyFn(dst, src);
+}
+
+#define INSTANTIATE(T)                                        \
+    template void copyData<T>(T * data, const Array<T> &src); \
+    template Array<T> copyArray<T>(const Array<T> &src);      \
+    template void multiply_inplace<T>(Array<T> & src, double norm);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COPY_ARRAY(SRC_T)                                 \
+    template void copyArray<SRC_T, float>(Array<float> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, double>(Array<double> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,     \
+                                            Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, int>(Array<int> & dst,             \
+                                        Array<SRC_T> const &src);     \
+    template void copyArray<SRC_T, uint>(Array<uint> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, intl>(Array<intl> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, uintl>(Array<uintl> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, short>(Array<short> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, ushort>(Array<ushort> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, schar>(Array<schar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, uchar>(Array<uchar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, char>(Array<char> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, half>(Array<half> & dst,           \
+                                         Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY(float)
+INSTANTIATE_COPY_ARRAY(double)
+INSTANTIATE_COPY_ARRAY(int)
+INSTANTIATE_COPY_ARRAY(uint)
+INSTANTIATE_COPY_ARRAY(intl)
+INSTANTIATE_COPY_ARRAY(uintl)
+INSTANTIATE_COPY_ARRAY(short)
+INSTANTIATE_COPY_ARRAY(ushort)
+INSTANTIATE_COPY_ARRAY(schar)
+INSTANTIATE_COPY_ARRAY(uchar)
+INSTANTIATE_COPY_ARRAY(char)
+INSTANTIATE_COPY_ARRAY(half)
+
+#define INSTANTIATE_COPY_ARRAY_COMPLEX(SRC_T)                        \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,      \
+                                           Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,    \
+                                            Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY_COMPLEX(cfloat)
+INSTANTIATE_COPY_ARRAY_COMPLEX(cdouble)
+
+template<typename T>
+T getScalar(const Array<T> &src) {
+    T retVal{};
+    CUDA_CHECK(cudaMemcpyAsync(&retVal, src.get(), sizeof(T),
+                               cudaMemcpyDeviceToHost, getActiveStream()));
+    CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
+    return retVal;
+}
+
+#define INSTANTIATE_GETSCALAR(T) template T getScalar(const Array<T> &in);
+
+INSTANTIATE_GETSCALAR(float)
+INSTANTIATE_GETSCALAR(double)
+INSTANTIATE_GETSCALAR(cfloat)
+INSTANTIATE_GETSCALAR(cdouble)
+INSTANTIATE_GETSCALAR(int)
+INSTANTIATE_GETSCALAR(uint)
+INSTANTIATE_GETSCALAR(schar)
+INSTANTIATE_GETSCALAR(uchar)
+INSTANTIATE_GETSCALAR(char)
+INSTANTIATE_GETSCALAR(intl)
+INSTANTIATE_GETSCALAR(uintl)
+INSTANTIATE_GETSCALAR(short)
+INSTANTIATE_GETSCALAR(ushort)
+INSTANTIATE_GETSCALAR(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/copy.cu b/src/backend/cuda/copy.cu
deleted file mode 100644
index c58b1cb43f..0000000000
--- a/src/backend/cuda/copy.cu
+++ /dev/null
@@ -1,186 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <cuda_runtime_api.h>
-#include <af/array.h>
-#include <af/defines.h>
-#include <Array.hpp>
-#include <copy.hpp>
-#include <kernel/memcopy.hpp>
-#include <err_cuda.hpp>
-#include <math.hpp>
-
-namespace cuda
-{
-
-    template<typename T>
-    void copyData(T *data, const Array<T> &A)
-    {
-        // FIXME: Merge this with copyArray
-        evalArray(A);
-
-        Array<T> out = A;
-        const T *ptr = NULL;
-
-        if (A.isOwner() || // No offsets, No strides
-            A.ndims() == 1 // Simple offset, no strides.
-            ) {
-
-            //A.get() gets data with offsets
-            ptr = A.get();
-        } else {
-            //FIXME: Think about implementing eval
-            out = copyArray(A);
-            ptr = out.get();
-        }
-
-        CUDA_CHECK(cudaMemcpy(data, ptr,
-                              A.elements() * sizeof(T),
-                              cudaMemcpyDeviceToHost));
-
-        return;
-    }
-
-    template<typename T>
-    Array<T> copyArray(const Array<T> &A)
-    {
-        Array<T> out = createEmptyArray<T>(A.dims());
-
-        if (A.isLinear()) {
-            CUDA_CHECK(cudaMemcpyAsync(out.get(), A.get(),
-                                       A.elements() * sizeof(T),
-                                       cudaMemcpyDeviceToDevice));
-        } else {
-            // FIXME: Seems to fail when using Param<T>
-            kernel::memcopy(out.get(), out.strides().get(), A.get(), A.dims().get(),
-                            A.strides().get(), (uint)A.ndims());
-        }
-        return out;
-    }
-
-    template<typename inType, typename outType>
-    Array<outType> padArray(Array<inType> const &in, dim4 const &dims, outType default_value, double factor)
-    {
-        ARG_ASSERT(1, (in.ndims() == dims.ndims()));
-        Array<outType> ret = createEmptyArray<outType>(dims);
-        kernel::copy<inType, outType>(ret, in, in.ndims(), default_value, factor);
-        return ret;
-    }
-
-    template<typename inType, typename outType>
-    struct copyWrapper {
-        void operator()(Array<outType> &out, Array<inType> const &in)
-        {
-            kernel::copy<inType, outType>(out, in, in.ndims(), scalar<outType>(0), 1);
-        }
-    };
-
-    template<typename T>
-    struct copyWrapper<T, T> {
-        void operator()(Array<T> &out, Array<T> const &in)
-        {
-            if (out.isLinear() &&
-                in.isLinear() &&
-                out.elements() == in.elements())
-            {
-                CUDA_CHECK(cudaMemcpyAsync(out.get(), in.get(),
-                                           in.elements() * sizeof(T),
-                                           cudaMemcpyDeviceToDevice));
-            } else {
-                kernel::copy<T, T>(out, in, in.ndims(), scalar<T>(0), 1);
-            }
-        }
-    };
-
-    template<typename inType, typename outType>
-    void copyArray(Array<outType> &out, Array<inType> const &in)
-    {
-        ARG_ASSERT(1, (in.ndims() == out.dims().ndims()));
-        copyWrapper<inType, outType> copyFn;
-        copyFn(out, in);
-    }
-
-#define INSTANTIATE(T)                                              \
-    template void      copyData<T> (T *data, const Array<T> &from); \
-    template Array<T> copyArray<T>(const Array<T> &A);              \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl   )
-    INSTANTIATE(uintl  )
-
-#define INSTANTIATE_PAD_ARRAY(SRC_T)                                    \
-    template Array<float  > padArray<SRC_T, float  >(Array<SRC_T> const &src, dim4 const &dims, float   default_value, double factor); \
-    template Array<double > padArray<SRC_T, double >(Array<SRC_T> const &src, dim4 const &dims, double  default_value, double factor); \
-    template Array<cfloat > padArray<SRC_T, cfloat >(Array<SRC_T> const &src, dim4 const &dims, cfloat  default_value, double factor); \
-    template Array<cdouble> padArray<SRC_T, cdouble>(Array<SRC_T> const &src, dim4 const &dims, cdouble default_value, double factor); \
-    template Array<int    > padArray<SRC_T, int    >(Array<SRC_T> const &src, dim4 const &dims, int     default_value, double factor); \
-    template Array<uint   > padArray<SRC_T, uint   >(Array<SRC_T> const &src, dim4 const &dims, uint    default_value, double factor); \
-    template Array<intl    > padArray<SRC_T, intl    >(Array<SRC_T> const &src, dim4 const &dims, intl     default_value, double factor); \
-    template Array<uintl   > padArray<SRC_T, uintl   >(Array<SRC_T> const &src, dim4 const &dims, uintl    default_value, double factor); \
-    template Array<uchar  > padArray<SRC_T, uchar  >(Array<SRC_T> const &src, dim4 const &dims, uchar   default_value, double factor); \
-    template Array<char   > padArray<SRC_T, char   >(Array<SRC_T> const &src, dim4 const &dims, char    default_value, double factor); \
-    template void copyArray<SRC_T, float  >(Array<float  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, double >(Array<double > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cfloat >(Array<cfloat > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cdouble>(Array<cdouble> &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, int    >(Array<int    > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uint   >(Array<uint   > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, intl    >(Array<intl    > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uintl   >(Array<uintl   > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uchar  >(Array<uchar  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, char   >(Array<char   > &dst, Array<SRC_T> const &src);
-
-    INSTANTIATE_PAD_ARRAY(float )
-    INSTANTIATE_PAD_ARRAY(double)
-    INSTANTIATE_PAD_ARRAY(int   )
-    INSTANTIATE_PAD_ARRAY(uint  )
-    INSTANTIATE_PAD_ARRAY(intl   )
-    INSTANTIATE_PAD_ARRAY(uintl  )
-    INSTANTIATE_PAD_ARRAY(uchar )
-    INSTANTIATE_PAD_ARRAY(char  )
-
-#define INSTANTIATE_PAD_ARRAY_COMPLEX(SRC_T)                            \
-    template Array<cfloat > padArray<SRC_T, cfloat >(Array<SRC_T> const &src, dim4 const &dims, cfloat  default_value, double factor); \
-    template Array<cdouble> padArray<SRC_T, cdouble>(Array<SRC_T> const &src, dim4 const &dims, cdouble default_value, double factor); \
-    template void copyArray<SRC_T, cfloat  >(Array<cfloat  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cdouble   >(Array<cdouble > &dst, Array<SRC_T> const &src);
-
-    INSTANTIATE_PAD_ARRAY_COMPLEX(cfloat )
-    INSTANTIATE_PAD_ARRAY_COMPLEX(cdouble)
-
-#define SPECILIAZE_UNUSED_COPYARRAY(SRC_T, DST_T) \
-    template<> void copyArray<SRC_T, DST_T>(Array<DST_T> &out, Array<SRC_T> const &in) \
-    {\
-        CUDA_NOT_SUPPORTED();\
-    }
-
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, double)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, float)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uchar)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, char)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uint)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, int)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, intl)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uintl)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, double)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, float)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uchar)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, char)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uint)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, int)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, intl)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uintl)
-}
diff --git a/src/backend/cuda/copy.hpp b/src/backend/cuda/copy.hpp
index 02e672aa7f..454e50679e 100644
--- a/src/backend/cuda/copy.hpp
+++ b/src/backend/cuda/copy.hpp
@@ -8,22 +8,57 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
+// Copies(blocking) data from an Array<T> object to a contiguous host side
+// pointer.
+//
+// \param dst The destination pointer on the host system.
+// \param src    The source array
+template<typename T>
+void copyData(T *dst, const Array<T> &src);
 
-    template<typename T>
-    void copyData(T *data, const Array<T> &A);
+// Create a deep copy of the \p src Array with the same size and shape. The new
+// Array will not maintain the subarray metadata of the \p src array.
+//
+// \param   src  The source Array<T> object.
+// \returns      A new Array<T> object with the same shape and data as the
+//               \p src Array<T>
+template<typename T>
+Array<T> copyArray(const Array<T> &src);
 
-    template<typename T>
-    Array<T> copyArray(const Array<T> &A);
+template<typename inType, typename outType>
+void copyArray(Array<outType> &out, const Array<inType> &in);
 
-    template<typename inType, typename outType>
-    void copyArray(Array<outType> &out, const Array<inType> &in);
+// Resize Array to target dimensions and convert type
+//
+// Depending on the \p outDims, the output Array can be either truncated
+// or padded (towards end of respective dimensions).
+//
+// While resizing copying, if output dimensions are larger than input, then
+// elements beyond the input dimensions are set to the \p defaultValue.
+//
+// \param[in] in is input Array
+// \param[in] outDims is the target output dimensions
+// \param[in] defaultValue is the value to which padded locations are set.
+// \param[in] scale is the value by which all output elements are scaled.
+//
+// \returns Array<outType>
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue = outType(0), double scale = 1.0);
 
-    template<typename inType, typename outType>
-    Array<outType> padArray(Array<inType> const &in, dim4 const &dims,
-                            outType default_value, double factor=1.0);
-}
+template<typename T>
+Array<T> padArrayBorders(Array<T> const &in, dim4 const &lowerBoundPadding,
+                         dim4 const &upperBoundPadding,
+                         const af::borderType btype);
+
+template<typename T>
+void multiply_inplace(Array<T> &in, double val);
+
+template<typename T>
+T getScalar(const Array<T> &in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/count.cu b/src/backend/cuda/count.cu
index 5d58b7a4d7..3cb5806a88 100644
--- a/src/backend/cuda/count.cu
+++ b/src/backend/cuda/count.cu
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    // count
-    INSTANTIATE(af_notzero_t, float  , uint)
-    INSTANTIATE(af_notzero_t, double , uint)
-    INSTANTIATE(af_notzero_t, cfloat , uint)
-    INSTANTIATE(af_notzero_t, cdouble, uint)
-    INSTANTIATE(af_notzero_t, int    , uint)
-    INSTANTIATE(af_notzero_t, uint   , uint)
-    INSTANTIATE(af_notzero_t, char   , uint)
-    INSTANTIATE(af_notzero_t, uchar  , uint)
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// count
+INSTANTIATE(af_notzero_t, float, uint)
+INSTANTIATE(af_notzero_t, double, uint)
+INSTANTIATE(af_notzero_t, cfloat, uint)
+INSTANTIATE(af_notzero_t, cdouble, uint)
+INSTANTIATE(af_notzero_t, int, uint)
+INSTANTIATE(af_notzero_t, uint, uint)
+INSTANTIATE(af_notzero_t, intl, uint)
+INSTANTIATE(af_notzero_t, uintl, uint)
+INSTANTIATE(af_notzero_t, short, uint)
+INSTANTIATE(af_notzero_t, ushort, uint)
+INSTANTIATE(af_notzero_t, char, uint)
+INSTANTIATE(af_notzero_t, schar, uint)
+INSTANTIATE(af_notzero_t, uchar, uint)
+INSTANTIATE(af_notzero_t, half, uint)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cublas.cpp b/src/backend/cuda/cublas.cpp
new file mode 100644
index 0000000000..31111deda4
--- /dev/null
+++ b/src/backend/cuda/cublas.cpp
@@ -0,0 +1,36 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cublas.hpp>
+
+#include <common/err_common.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+const char* errorString(cublasStatus_t err) {
+    switch (err) {
+        case CUBLAS_STATUS_SUCCESS: return "CUBLAS_STATUS_SUCCESS";
+        case CUBLAS_STATUS_NOT_INITIALIZED:
+            return "CUBLAS_STATUS_NOT_INITIALIZED";
+        case CUBLAS_STATUS_ALLOC_FAILED: return "CUBLAS_STATUS_ALLOC_FAILED";
+        case CUBLAS_STATUS_INVALID_VALUE: return "CUBLAS_STATUS_INVALID_VALUE";
+        case CUBLAS_STATUS_ARCH_MISMATCH: return "CUBLAS_STATUS_ARCH_MISMATCH";
+        case CUBLAS_STATUS_MAPPING_ERROR: return "CUBLAS_STATUS_MAPPING_ERROR";
+        case CUBLAS_STATUS_EXECUTION_FAILED:
+            return "CUBLAS_STATUS_EXECUTION_FAILED";
+        case CUBLAS_STATUS_INTERNAL_ERROR:
+            return "CUBLAS_STATUS_INTERNAL_ERROR";
+        case CUBLAS_STATUS_NOT_SUPPORTED: return "CUBLAS_STATUS_NOT_SUPPORTED";
+        default: return "UNKNOWN";
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cublas.hpp b/src/backend/cuda/cublas.hpp
new file mode 100644
index 0000000000..d0611263d8
--- /dev/null
+++ b/src/backend/cuda/cublas.hpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/defines.hpp>
+#include <common/unique_handle.hpp>
+#include <cublas_v2.h>
+
+DEFINE_HANDLER(cublasHandle_t, cublasCreate, cublasDestroy);
+
+namespace arrayfire {
+namespace cuda {
+
+const char* errorString(cublasStatus_t err);
+
+#define CUBLAS_CHECK(fn)                                                    \
+    do {                                                                    \
+        cublasStatus_t _error = fn;                                         \
+        if (_error != CUBLAS_STATUS_SUCCESS) {                              \
+            char _err_msg[1024];                                            \
+            snprintf(_err_msg, sizeof(_err_msg), "CUBLAS Error (%d): %s\n", \
+                     (int)(_error), arrayfire::cuda::errorString(_error));  \
+            AF_ERROR(_err_msg, AF_ERR_INTERNAL);                            \
+        }                                                                   \
+    } while (0)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cublasManager.cpp b/src/backend/cuda/cublasManager.cpp
deleted file mode 100644
index 1f34d54bbc..0000000000
--- a/src/backend/cuda/cublasManager.cpp
+++ /dev/null
@@ -1,74 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <stdio.h>
-#include <platform.hpp>
-#include <err_common.hpp>
-#include <cublasManager.hpp>
-#include <boost/scoped_ptr.hpp>
-#include <platform.hpp>
-
-namespace cublas {
-
-    const char *errorString(cublasStatus_t err)
-    {
-
-        switch(err)
-        {
-        case    CUBLAS_STATUS_SUCCESS:              return "CUBLAS_STATUS_SUCCESS";
-        case    CUBLAS_STATUS_NOT_INITIALIZED:      return "CUBLAS_STATUS_NOT_INITIALIZED";
-        case    CUBLAS_STATUS_ALLOC_FAILED:         return "CUBLAS_STATUS_ALLOC_FAILED";
-        case    CUBLAS_STATUS_INVALID_VALUE:        return "CUBLAS_STATUS_INVALID_VALUE";
-        case    CUBLAS_STATUS_ARCH_MISMATCH:        return "CUBLAS_STATUS_ARCH_MISMATCH";
-        case    CUBLAS_STATUS_MAPPING_ERROR:        return "CUBLAS_STATUS_MAPPING_ERROR";
-        case    CUBLAS_STATUS_EXECUTION_FAILED:     return "CUBLAS_STATUS_EXECUTION_FAILED";
-        case    CUBLAS_STATUS_INTERNAL_ERROR:       return "CUBLAS_STATUS_INTERNAL_ERROR";
-#if CUDA_VERSION > 5050
-        case    CUBLAS_STATUS_NOT_SUPPORTED:        return "CUBLAS_STATUS_NOT_SUPPORTED";
-#endif
-        default:                                    return "UNKNOWN";
-        }
-    }
-
-    //RAII class around the cublas Handle
-    class cublasHandle
-    {
-        cublasHandle_t handle;
-    public:
-
-        cublasHandle()  : handle(0)
-        {
-            CUBLAS_CHECK(cublasCreate(&handle));
-        }
-
-        ~cublasHandle()
-        {
-            cublasDestroy(handle);
-        }
-
-        cublasHandle_t get() const
-        {
-            return handle;
-        }
-    };
-
-    cublasHandle_t getHandle()
-    {
-        using boost::scoped_ptr;
-        static scoped_ptr<cublasHandle> handle[cuda::DeviceManager::MAX_DEVICES];
-
-        int id = cuda::getActiveDeviceId();
-
-        if(!handle[id]) {
-            handle[id].reset(new cublasHandle());
-        }
-
-        return handle[id]->get();
-    }
-}
diff --git a/src/backend/cuda/cublasManager.hpp b/src/backend/cuda/cublasManager.hpp
deleted file mode 100644
index 06b5aa1e7c..0000000000
--- a/src/backend/cuda/cublasManager.hpp
+++ /dev/null
@@ -1,37 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <stdio.h>
-#include <err_common.hpp>
-#include <cublas_v2.h>
-
-
-namespace cublas {
-
-    const char * errorString(cublasStatus_t err);
-    cublasHandle_t getHandle();
-}
-
-
-#define CUBLAS_CHECK(fn) do {                   \
-        cublasStatus_t _error = fn;             \
-        if (_error != CUBLAS_STATUS_SUCCESS) {  \
-            char _err_msg[1024];                \
-            snprintf(_err_msg,                  \
-                     sizeof(_err_msg),          \
-                     "CUBLAS Error (%d): %s\n", \
-                     (int)(_error),             \
-                     cublas::errorString(       \
-                         _error));              \
-                                                \
-            AF_ERROR(_err_msg,                  \
-                     AF_ERR_INTERNAL);          \
-        }                                       \
-    } while(0)
diff --git a/src/backend/cuda/cudaDataType.hpp b/src/backend/cuda/cudaDataType.hpp
new file mode 100644
index 0000000000..3746d0b4b9
--- /dev/null
+++ b/src/backend/cuda/cudaDataType.hpp
@@ -0,0 +1,86 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/half.hpp>
+#include <library_types.h>  // cudaDataType enum
+#include <types.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+inline cudaDataType_t getType();
+
+template<>
+inline cudaDataType_t getType<float>() {
+    return CUDA_R_32F;
+}
+
+template<>
+inline cudaDataType_t getType<cfloat>() {
+    return CUDA_C_32F;
+}
+
+template<>
+inline cudaDataType_t getType<double>() {
+    return CUDA_R_64F;
+}
+
+template<>
+inline cudaDataType_t getType<cdouble>() {
+    return CUDA_C_64F;
+}
+
+template<>
+inline cudaDataType_t getType<common::half>() {
+    return CUDA_R_16F;
+}
+
+template<>
+inline cudaDataType_t getType<uchar>() {
+    return CUDA_R_8I;
+}
+
+template<>
+inline cudaDataType_t getType<schar>() {
+    return CUDA_R_8I;
+}
+
+/* only supports LStride/RStride % 4 == 0 */
+template<>
+inline cudaDataType_t getType<int>() {
+    return CUDA_R_32I;
+}
+
+template<typename T>
+inline cudaDataType_t getComputeType() {
+    return getType<T>();
+}
+
+template<>
+inline cudaDataType_t getComputeType<common::half>() {
+    cudaDataType_t algo = getType<common::half>();
+    // There is probbaly a bug in nvidia cuda docs and/or drivers: According to
+    // https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx computeType
+    // could be 32F even if A/B inputs are 16F. But CudaCompute 6.1 GPUs (for
+    // example GTX10X0) dont seem to be capbale to compute at f32 when the
+    // inputs are f16: results are inf if trying to do so and cublasGemmEx even
+    // returns OK. At the moment let's comment out : the drawback is just that
+    // the speed of f16 computation on these GPUs is very slow:
+    //
+    // auto dev            = getDeviceProp(getActiveDeviceId());
+    // if (dev.major == // 6 && dev.minor == 1) { algo = CUDA_R_32F; }
+
+    return algo;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cudnn.cpp b/src/backend/cuda/cudnn.cpp
new file mode 100644
index 0000000000..5b8a500d00
--- /dev/null
+++ b/src/backend/cuda/cudnn.cpp
@@ -0,0 +1,307 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cudnn.hpp>
+#include <err_cuda.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+const char *errorString(cudnnStatus_t err) {
+    switch (err) {
+        case CUDNN_STATUS_SUCCESS: return "CUDNN_STATUS_SUCCESS";
+        case CUDNN_STATUS_NOT_INITIALIZED:
+            return "CUDNN_STATUS_NOT_INITIALIZED";
+        case CUDNN_STATUS_ALLOC_FAILED: return "CUDNN_STATUS_ALLOC_FAILED";
+        case CUDNN_STATUS_BAD_PARAM: return "CUDNN_STATUS_BAD_PARAM";
+        case CUDNN_STATUS_INTERNAL_ERROR: return "CUDNN_STATUS_INTERNAL_ERROR";
+        case CUDNN_STATUS_INVALID_VALUE: return "CUDNN_STATUS_INVALID_VALUE";
+        case CUDNN_STATUS_ARCH_MISMATCH: return "CUDNN_STATUS_ARCH_MISMATCH";
+        case CUDNN_STATUS_MAPPING_ERROR: return "CUDNN_STATUS_MAPPING_ERROR";
+        case CUDNN_STATUS_EXECUTION_FAILED:
+            return "CUDNN_STATUS_EXECUTION_FAILED";
+        case CUDNN_STATUS_NOT_SUPPORTED: return "CUDNN_STATUS_NOT_SUPPORTED";
+        case CUDNN_STATUS_LICENSE_ERROR: return "CUDNN_STATUS_LICENSE_ERROR";
+#if CUDNN_VERSION >= 6000
+        case CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING:
+            return "CUDNN_STATUS_RUNTIME_PREREQUISITE_MISSING";
+#if CUDNN_VERSION >= 7000
+        case CUDNN_STATUS_RUNTIME_IN_PROGRESS:
+            return "CUDNN_STATUS_RUNTIME_IN_PROGRESS";
+        case CUDNN_STATUS_RUNTIME_FP_OVERFLOW:
+            return "CUDNN_STATUS_RUNTIME_FP_OVERFLOW";
+#if CUDNN_VERSION >= 8000
+        case CUDNN_STATUS_VERSION_MISMATCH:
+            return "CUDNN_STATUS_VERSION_MISMATCH";
+#endif
+#endif
+#endif
+        default: return "UNKNOWN";
+    }
+}
+
+template<>
+cudnnDataType_t getCudnnDataType<float>() {
+    return CUDNN_DATA_FLOAT;
+}
+template<>
+cudnnDataType_t getCudnnDataType<double>() {
+    return CUDNN_DATA_DOUBLE;
+}
+
+#if CUDNN_VERSION >= 6000
+template<>
+cudnnDataType_t getCudnnDataType<int>() {
+    return CUDNN_DATA_INT32;
+}
+
+#if CUDNN_VERSION >= 7100
+/// TODONT COMMIT
+template<>
+cudnnDataType_t getCudnnDataType<signed char>() {
+    return CUDNN_DATA_INT8;
+}
+
+template<>
+cudnnDataType_t getCudnnDataType<unsigned char>() {
+    return CUDNN_DATA_UINT8;
+}
+#endif
+#endif
+
+template<>
+cudnnDataType_t getCudnnDataType<common::half>() {
+    return CUDNN_DATA_HALF;
+}
+
+void cudnnSet(cudnnTensorDescriptor_t desc, cudnnDataType_t cudnn_dtype,
+              dim4 dims) {
+    CUDNN_CHECK(cuda::cudnnSetTensor4dDescriptor(desc, CUDNN_TENSOR_NCHW,
+                                                 cudnn_dtype, dims[3], dims[2],
+                                                 dims[1], dims[0]));
+}
+
+void cudnnSet(cudnnFilterDescriptor_t desc, cudnnDataType_t cudnn_dtype,
+              dim4 dims) {
+    CUDNN_CHECK(cuda::cudnnSetFilter4dDescriptor(desc, cudnn_dtype,
+                                                 CUDNN_TENSOR_NCHW, dims[3],
+                                                 dims[2], dims[1], dims[0]));
+}
+
+cudnnStatus_t cudnnSetConvolution2dDescriptor(
+    cudnnConvolutionDescriptor_t convDesc,
+    int pad_h,     // zero-padding height
+    int pad_w,     // zero-padding width
+    int u,         // vertical filter stride
+    int v,         // horizontal filter stride
+    int upscalex,  // upscale the input in x-direction
+    int upscaley,  // upscale the input in y-direction
+    cudnnConvolutionMode_t mode, cudnnDataType_t computeType) {
+    return
+#if CUDNN_VERSION >= 6000
+        getCudnnPlugin().cudnnSetConvolution2dDescriptor(
+            convDesc, pad_h, pad_w, u, v, upscalex, upscaley, mode,
+            computeType);
+#elif CUDNN_VERSION >= 4000
+        getCudnnPlugin().cudnnSetConvolution2dDescriptor(
+            convDesc, pad_h, pad_w, u, v, upscalex, upscaley, mode);
+#else
+        static_assert(1 != 1, "cuDNN version not supported");
+#endif
+}
+
+cudnnStatus_t cudnnSetFilter4dDescriptor(cudnnFilterDescriptor_t filterDesc,
+                                         cudnnDataType_t dataType,
+                                         cudnnTensorFormat_t format, int k,
+                                         int c, int h, int w) {
+#if CUDNN_VERSION >= 6000
+    int version = getCudnnPlugin().cudnnGetVersion();
+    if (version >= 6000) {
+        return getCudnnPlugin().cudnnSetFilter4dDescriptor(filterDesc, dataType,
+                                                           format, k, c, h, w);
+    } else if (version == 4000) {
+        return getCudnnPlugin().cudnnSetFilter4dDescriptor_v4(
+            filterDesc, dataType, format, k, c, h, w);
+    }
+    CUDA_NOT_SUPPORTED(
+        "cudnnSetFilter4dDescriptor not supported for the current version of "
+        "cuDNN");
+#elif CUDNN_VERSION == 4000
+    return getCudnnPlugin().cudnnSetFilter4dDescriptor_v4(filterDesc, dataType,
+                                                          format, k, c, h, w);
+#else
+    static_assert(1 != 1, "cuDNN version not supported");
+#endif
+}
+
+cudnnStatus_t cudnnSetTensor4dDescriptor(cudnnTensorDescriptor_t tensorDesc,
+                                         cudnnTensorFormat_t format,
+                                         cudnnDataType_t dataType, int n, int c,
+                                         int h, int w) {
+    return getCudnnPlugin().cudnnSetTensor4dDescriptor(tensorDesc, format,
+                                                       dataType, n, c, h, w);
+}
+
+cudnnStatus_t cudnnGetConvolutionBackwardDataWorkspaceSize(
+    cudnnHandle_t handle, const cudnnFilterDescriptor_t wDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t dxDesc, cudnnConvolutionBwdDataAlgo_t algo,
+    size_t *sizeInBytes) {
+    return getCudnnPlugin().cudnnGetConvolutionBackwardDataWorkspaceSize(
+        handle, wDesc, dyDesc, convDesc, dxDesc, algo, sizeInBytes);
+}
+
+cudnnStatus_t cudnnConvolutionBackwardData(
+    cudnnHandle_t handle, const void *alpha,
+    const cudnnFilterDescriptor_t wDesc, const void *w,
+    const cudnnTensorDescriptor_t dyDesc, const void *dy,
+    const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionBwdDataAlgo_t algo, void *workSpace,
+    size_t workSpaceSizeInBytes, const void *beta,
+    const cudnnTensorDescriptor_t dxDesc, void *dx) {
+    return getCudnnPlugin().cudnnConvolutionBackwardData(
+        handle, alpha, wDesc, w, dyDesc, dy, convDesc, algo, workSpace,
+        workSpaceSizeInBytes, beta, dxDesc, dx);
+}
+
+cudnnStatus_t cudnnGetConvolutionNdForwardOutputDim(
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t inputTensorDesc,
+    const cudnnFilterDescriptor_t filterDesc, int nbDims,
+    int tensorOuputDimA[]) {
+    return getCudnnPlugin().cudnnGetConvolutionNdForwardOutputDim(
+        convDesc, inputTensorDesc, filterDesc, nbDims, tensorOuputDimA);
+}
+
+cudnnStatus_t cudnnGetConvolutionForwardAlgorithmMaxCount(cudnnHandle_t handle,
+                                                          int *count) {
+    return getCudnnPlugin().cudnnGetConvolutionForwardAlgorithmMaxCount(handle,
+                                                                        count);
+}
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(
+    cudnnHandle_t handle, int *count) {
+    return getCudnnPlugin().cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(
+        handle, count);
+}
+
+cudnnStatus_t cudnnGetConvolutionForwardWorkspaceSize(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc, cudnnConvolutionFwdAlgo_t algo,
+    size_t *sizeInBytes) {
+    return getCudnnPlugin().cudnnGetConvolutionForwardWorkspaceSize(
+        handle, xDesc, wDesc, convDesc, yDesc, algo, sizeInBytes);
+}
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterWorkspaceSize(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t gradDesc,
+    cudnnConvolutionBwdFilterAlgo_t algo, size_t *sizeInBytes) {
+    return getCudnnPlugin().cudnnGetConvolutionBackwardFilterWorkspaceSize(
+        handle, xDesc, dyDesc, convDesc, gradDesc, algo, sizeInBytes);
+}
+
+cudnnStatus_t cudnnFindConvolutionForwardAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc, const int requestedAlgoCount,
+    int *returnedAlgoCount, cudnnConvolutionFwdAlgoPerf_t *perfResults) {
+    return getCudnnPlugin().cudnnFindConvolutionForwardAlgorithm(
+        handle, xDesc, wDesc, convDesc, yDesc, requestedAlgoCount,
+        returnedAlgoCount, perfResults);
+}
+
+cudnnStatus_t cudnnFindConvolutionBackwardFilterAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t dwDesc, const int requestedAlgoCount,
+    int *returnedAlgoCount, cudnnConvolutionBwdFilterAlgoPerf_t *perfResults) {
+    return getCudnnPlugin().cudnnFindConvolutionBackwardFilterAlgorithm(
+        handle, xDesc, dyDesc, convDesc, dwDesc, requestedAlgoCount,
+        returnedAlgoCount, perfResults);
+}
+
+cudnnStatus_t cudnnGetConvolutionForwardAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc,
+    cudnnConvolutionFwdPreference_t preference, size_t memoryLimitInBytes,
+    cudnnConvolutionFwdAlgo_t *algo) {
+    auto version = getCudnnPlugin().getVersion();
+    if (version.major() < 8) {
+        return getCudnnPlugin().cudnnGetConvolutionForwardAlgorithm(
+            handle, xDesc, wDesc, convDesc, yDesc, preference,
+            memoryLimitInBytes, algo);
+    } else {
+        AF_ERROR(
+            "cudnnGetConvolutionForwardAlgorithm has been removed since cuDNN "
+            "8",
+            AF_ERR_NOT_SUPPORTED);
+        return CUDNN_STATUS_SUCCESS;
+    }
+}
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t dwDesc,
+    cudnnConvolutionBwdFilterPreference_t preference, size_t memoryLimitInBytes,
+    cudnnConvolutionBwdFilterAlgo_t *algo) {
+    auto version = getCudnnPlugin().getVersion();
+    if (version.major() < 8) {
+        return getCudnnPlugin().cudnnGetConvolutionBackwardFilterAlgorithm(
+            handle, xDesc, dyDesc, convDesc, dwDesc, preference,
+            memoryLimitInBytes, algo);
+    } else {
+        AF_ERROR(
+            "cudnnGetConvolutionBackwardFilterAlgorithm has been removed since "
+            "cuDNN 8",
+            AF_ERR_NOT_SUPPORTED);
+        return CUDNN_STATUS_SUCCESS;
+    }
+}
+
+cudnnStatus_t cudnnConvolutionForward(
+    cudnnHandle_t handle, const void *alpha,
+    const cudnnTensorDescriptor_t xDesc, const void *x,
+    const cudnnFilterDescriptor_t wDesc, const void *w,
+    const cudnnConvolutionDescriptor_t convDesc, cudnnConvolutionFwdAlgo_t algo,
+    void *workSpace, size_t workSpaceSizeInBytes, const void *beta,
+    const cudnnTensorDescriptor_t yDesc, void *y) {
+    return getCudnnPlugin().cudnnConvolutionForward(
+        handle, alpha, xDesc, x, wDesc, w, convDesc, algo, workSpace,
+        workSpaceSizeInBytes, beta, yDesc, y);
+}
+
+cudnnStatus_t cudnnConvolutionBackwardFilter(
+    cudnnHandle_t handle, const void *alpha,
+    const cudnnTensorDescriptor_t xDesc, const void *x,
+    const cudnnTensorDescriptor_t dyDesc, const void *dy,
+    const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionBwdFilterAlgo_t algo, void *workSpace,
+    size_t workSpaceSizeInBytes, const void *beta,
+    const cudnnFilterDescriptor_t dwDesc, void *dw) {
+    return getCudnnPlugin().cudnnConvolutionBackwardFilter(
+        handle, alpha, xDesc, x, dyDesc, dy, convDesc, algo, workSpace,
+        workSpaceSizeInBytes, beta, dwDesc, dw);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cudnn.hpp b/src/backend/cuda/cudnn.hpp
new file mode 100644
index 0000000000..5cd8f5f7e6
--- /dev/null
+++ b/src/backend/cuda/cudnn.hpp
@@ -0,0 +1,188 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/defines.hpp>
+#include <common/half.hpp>
+#include <common/unique_handle.hpp>
+#include <cudnnModule.hpp>
+#include <af/dim4.hpp>
+
+// clang-format off
+DEFINE_HANDLER(cudnnHandle_t, arrayfire::cuda::getCudnnPlugin().cudnnCreate, arrayfire::cuda::getCudnnPlugin().cudnnDestroy);
+
+DEFINE_HANDLER(cudnnTensorDescriptor_t, arrayfire::cuda::getCudnnPlugin().cudnnCreateTensorDescriptor, arrayfire::cuda::getCudnnPlugin().cudnnDestroyTensorDescriptor);
+
+DEFINE_HANDLER(cudnnFilterDescriptor_t, arrayfire::cuda::getCudnnPlugin().cudnnCreateFilterDescriptor, arrayfire::cuda::getCudnnPlugin().cudnnDestroyFilterDescriptor);
+
+DEFINE_HANDLER(cudnnConvolutionDescriptor_t, arrayfire::cuda::getCudnnPlugin().cudnnCreateConvolutionDescriptor, arrayfire::cuda::getCudnnPlugin().cudnnDestroyConvolutionDescriptor);
+// clang-format on
+
+namespace arrayfire {
+namespace cuda {
+
+const char *errorString(cudnnStatus_t err);
+
+#define CUDNN_CHECK(fn)                                                     \
+    do {                                                                    \
+        cudnnStatus_t _error = (fn);                                        \
+        if (_error == CUDNN_STATUS_SUCCESS) {                               \
+            break;                                                          \
+        } else if (_error == CUDNN_STATUS_ALLOC_FAILED) {                   \
+            AF_ERROR(                                                       \
+                "CUDNN Error(CUDNN_STATUS_ALLOC_FAILED): Error allocating " \
+                "for function all ",                                        \
+                AF_ERR_NO_MEM);                                             \
+        } else if (_error == CUDNN_STATUS_NOT_SUPPORTED) {                  \
+            CUDA_NOT_SUPPORTED(                                             \
+                "CUDNN Error(CUDNN_STATUS_NOT_SUPPORTED): This version of " \
+                "CUDNN does not support the data type or the size of this " \
+                "operation");                                               \
+        } else {                                                            \
+            char _err_msg[1024];                                            \
+            snprintf(_err_msg, sizeof(_err_msg), "CUDNN Error(%s): \n",     \
+                     errorString(_error));                                  \
+            AF_ERROR(_err_msg, AF_ERR_INTERNAL);                            \
+        }                                                                   \
+    } while (0)
+
+/// Returns a cuDNN type based on the template parameter
+template<typename T>
+cudnnDataType_t getCudnnDataType();
+
+void cudnnSet(cudnnTensorDescriptor_t desc, cudnnDataType_t cudnn_dtype,
+              af::dim4 dims);
+
+void cudnnSet(cudnnFilterDescriptor_t desc, cudnnDataType_t cudnn_dtype,
+              af::dim4 dims);
+
+// cuDNN Wrappers
+//
+// cuDNN deprecates and releases function names often between releases. in order
+// to prevent locking arrayfire versions to specific cuDNN versions, we wrap all
+// cuDNN calls so that the main codebase is not full of ifdefs. The Following
+// functions are wrappers around cuDNN functions that abstract out the version
+// differences between older versions of cuDNN.
+//
+
+cudnnStatus_t cudnnSetConvolution2dDescriptor(
+    cudnnConvolutionDescriptor_t convDesc,
+    int pad_h,     // zero-padding height
+    int pad_w,     // zero-padding width
+    int u,         // vertical filter stride
+    int v,         // horizontal filter stride
+    int upscalex,  // upscale the input in x-direction
+    int upscaley,  // upscale the input in y-direction
+    cudnnConvolutionMode_t mode, cudnnDataType_t computeType);
+
+cudnnStatus_t cudnnSetFilter4dDescriptor(cudnnFilterDescriptor_t filterDesc,
+                                         cudnnDataType_t dataType,
+                                         cudnnTensorFormat_t format, int k,
+                                         int c, int h, int w);
+
+cudnnStatus_t cudnnSetTensor4dDescriptor(
+    cudnnTensorDescriptor_t tensorDesc, cudnnTensorFormat_t format,
+    cudnnDataType_t dataType, /* image data type */
+    int n,                    /* number of inputs (batch size) */
+    int c,                    /* number of input feature maps */
+    int h,                    /* height of input section */
+    int w);                   /* width of input section */
+
+cudnnStatus_t cudnnGetConvolutionBackwardDataWorkspaceSize(
+    cudnnHandle_t handle, const cudnnFilterDescriptor_t wDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t dxDesc, cudnnConvolutionBwdDataAlgo_t algo,
+    size_t *sizeInBytes);
+
+cudnnStatus_t cudnnConvolutionBackwardData(
+    cudnnHandle_t handle, const void *alpha,
+    const cudnnFilterDescriptor_t wDesc, const void *w,
+    const cudnnTensorDescriptor_t dyDesc, const void *dy,
+    const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionBwdDataAlgo_t algo, void *workSpace,
+    size_t workSpaceSizeInBytes, const void *beta,
+    const cudnnTensorDescriptor_t dxDesc, void *dx);
+
+cudnnStatus_t cudnnGetConvolutionNdForwardOutputDim(
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t inputTensorDesc,
+    const cudnnFilterDescriptor_t filterDesc, int nbDims,
+    int tensorOuputDimA[]);
+
+cudnnStatus_t cudnnGetConvolutionForwardAlgorithmMaxCount(cudnnHandle_t handle,
+                                                          int *count);
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithmMaxCount(
+    cudnnHandle_t handle, int *count);
+
+cudnnStatus_t cudnnGetConvolutionForwardWorkspaceSize(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc, cudnnConvolutionFwdAlgo_t algo,
+    size_t *sizeInBytes);
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterWorkspaceSize(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t gradDesc,
+    cudnnConvolutionBwdFilterAlgo_t algo, size_t *sizeInBytes);
+
+cudnnStatus_t cudnnFindConvolutionForwardAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc, const int requestedAlgoCount,
+    int *returnedAlgoCount, cudnnConvolutionFwdAlgoPerf_t *perfResults);
+
+cudnnStatus_t cudnnFindConvolutionBackwardFilterAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t dwDesc, const int requestedAlgoCount,
+    int *returnedAlgoCount, cudnnConvolutionBwdFilterAlgoPerf_t *perfResults);
+
+cudnnStatus_t cudnnGetConvolutionForwardAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc,
+    cudnnConvolutionFwdPreference_t preference, size_t memoryLimitInBytes,
+    cudnnConvolutionFwdAlgo_t *algo);
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t dwDesc,
+    cudnnConvolutionBwdFilterPreference_t preference, size_t memoryLimitInBytes,
+    cudnnConvolutionBwdFilterAlgo_t *algo);
+
+cudnnStatus_t cudnnConvolutionForward(
+    cudnnHandle_t handle, const void *alpha,
+    const cudnnTensorDescriptor_t xDesc, const void *x,
+    const cudnnFilterDescriptor_t wDesc, const void *w,
+    const cudnnConvolutionDescriptor_t convDesc, cudnnConvolutionFwdAlgo_t algo,
+    void *workSpace, size_t workSpaceSizeInBytes, const void *beta,
+    const cudnnTensorDescriptor_t yDesc, void *y);
+
+cudnnStatus_t cudnnConvolutionBackwardFilter(
+    cudnnHandle_t handle, const void *alpha,
+    const cudnnTensorDescriptor_t xDesc, const void *x,
+    const cudnnTensorDescriptor_t dyDesc, const void *dy,
+    const cudnnConvolutionDescriptor_t convDesc,
+    cudnnConvolutionBwdFilterAlgo_t algo, void *workSpace,
+    size_t workSpaceSizeInBytes, const void *beta,
+    const cudnnFilterDescriptor_t dwDesc, void *dw);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cudnnModule.cpp b/src/backend/cuda/cudnnModule.cpp
new file mode 100644
index 0000000000..66c4b4ab06
--- /dev/null
+++ b/src/backend/cuda/cudnnModule.cpp
@@ -0,0 +1,185 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cudnnModule.hpp>
+
+#include <common/ArrayFireTypesIO.hpp>
+#include <common/Logger.hpp>
+#include <common/err_common.hpp>
+#include <common/util.hpp>
+#include <device_manager.hpp>
+#include <utility.hpp>
+
+#include <array>
+#include <string>
+#include <tuple>
+
+using arrayfire::common::fromCudaVersion;
+using arrayfire::common::Version;
+using std::make_tuple;
+using std::string;
+
+namespace arrayfire {
+namespace cuda {
+
+// clang-format off
+// Latest version from each minor releases are enlisted below
+constexpr std::array<common::Version, 11> cudnnVersions = {
+    Version(8, 0,  1),
+    Version(7, 6,  5),
+    Version(7, 5,  1),
+    Version(7, 4,  2),
+    Version(7, 3,  1),
+    Version(7, 2,  1),
+    Version(7, 1,  4),
+    Version(7, 0,  5),
+    Version(6, 0, 21),
+    Version(5, 1, 10),
+    Version(4, 0,  7)
+};
+// clang-format on
+
+spdlog::logger* cudnnModule::getLogger() const noexcept {
+    return module.getLogger();
+}
+
+Version cudnnVersionComponents(size_t version) {
+    int major = static_cast<int>(version / 1000);
+    int minor = static_cast<int>((version - (major * 1000)) / 100);
+    int patch = static_cast<int>(version - (major * 1000) - (minor * 100));
+    return {major, minor, patch};
+}
+
+Version cudaRuntimeVersionComponents(size_t version) {
+    int major = static_cast<int>(version / 1000);
+    int minor = static_cast<int>((version - (major * 1000)) / 10);
+    int patch =
+        static_cast<int>((version - (major * 1000) - (minor * 10)) / 10);
+    return {major, minor, patch};
+}
+
+Version getCudnnVersion(const LibHandle& handle) {
+    std::function<size_t()> fptr(reinterpret_cast<size_t (*)()>(
+        common::getFunctionPointer(handle, "cudnnGetVersion")));
+    size_t v = fptr();
+
+    return cudnnVersionComponents(v);
+}
+
+cudnnModule::cudnnModule()
+    : module({"cudnn"}, {"", "64_8", "64_7", "64_6", "64_5", "64_4"}, {""},
+             cudnnVersions.size(), cudnnVersions.data(), getCudnnVersion) {
+    if (!module.isLoaded()) {
+        AF_TRACE(
+            "WARNING: Unable to load cuDNN: {}"
+            "\ncuDNN failed to load. Try installing cuDNN or check if cuDNN is "
+            "in the search path. On Linux, you can set the LD_DEBUG=libs "
+            "environment variable to debug loading issues. Falling back to "
+            "matmul based implementation",
+            module.getErrorMessage());
+
+        return;
+    }
+
+    MODULE_FUNCTION_INIT(cudnnGetVersion);
+
+    size_t cudnn_rtversion_val = 0;
+
+    Version cudnn_version = module.getVersion();
+    if (cudnn_version < Version(6)) {
+        AF_TRACE(
+            "Warning: This version of cuDNN({}) does not support "
+            "cudnnGetCudartVersion. No runtime checks performed.",
+            cudnn_version);
+    } else {
+        MODULE_FUNCTION_INIT(cudnnGetCudartVersion);
+        cudnn_rtversion_val = this->cudnnGetCudartVersion();
+    }
+
+    Version cudnn_rtversion = cudaRuntimeVersionComponents(cudnn_rtversion_val);
+
+    AF_TRACE("cuDNN Version: {} cuDNN CUDA Runtime: {}", cudnn_version,
+             cudnn_rtversion);
+
+    Version compiled_cudnn_version = fromCudaVersion(CUDNN_VERSION);
+
+    // Check to see if the version of cuDNN ArrayFire was compiled against
+    // is compatible with the version loaded at runtime
+    if (compiled_cudnn_version.major() <= 6 &&
+        compiled_cudnn_version < cudnn_version) {
+        string error_msg = fmt::format(
+            "ArrayFire was compiled with an older version of cuDNN({}.{}) that "
+            "does not support the version that was loaded at runtime({}.{}).",
+            CUDNN_MAJOR, CUDNN_MINOR, cudnn_version.major(),
+            cudnn_version.minor());
+        AF_ERROR(error_msg, AF_ERR_NOT_SUPPORTED);
+    }
+
+    int afcuda_runtime_version = 0;
+    cudaRuntimeGetVersion(&afcuda_runtime_version);
+    Version afcuda_runtime = fromCudaVersion(afcuda_runtime_version);
+    if (afcuda_runtime != cudnn_rtversion) {
+        getLogger()->warn(
+            "WARNING: ArrayFire CUDA Runtime({}) and cuDNN CUDA "
+            "Runtime({}) do not match. For maximum compatibility, make sure "
+            "the two versions match.(Ignoring check)",
+            // NOTE: the int version formats from CUDA and cuDNN are different
+            // so we are using int_version_to_string for the ArrayFire CUDA
+            // runtime
+            afcuda_runtime, cudnn_rtversion);
+    }
+
+    MODULE_FUNCTION_INIT(cudnnConvolutionBackwardData);
+    MODULE_FUNCTION_INIT(cudnnConvolutionBackwardFilter);
+    MODULE_FUNCTION_INIT(cudnnConvolutionForward);
+    MODULE_FUNCTION_INIT(cudnnCreate);
+    MODULE_FUNCTION_INIT(cudnnCreateConvolutionDescriptor);
+    MODULE_FUNCTION_INIT(cudnnCreateFilterDescriptor);
+    MODULE_FUNCTION_INIT(cudnnCreateTensorDescriptor);
+    MODULE_FUNCTION_INIT(cudnnDestroy);
+    MODULE_FUNCTION_INIT(cudnnDestroyConvolutionDescriptor);
+    MODULE_FUNCTION_INIT(cudnnDestroyFilterDescriptor);
+    MODULE_FUNCTION_INIT(cudnnDestroyTensorDescriptor);
+    MODULE_FUNCTION_INIT(cudnnGetConvolutionBackwardDataWorkspaceSize);
+    MODULE_FUNCTION_INIT(cudnnGetConvolutionForwardAlgorithmMaxCount);
+    MODULE_FUNCTION_INIT(cudnnGetConvolutionBackwardFilterAlgorithmMaxCount);
+    MODULE_FUNCTION_INIT(cudnnGetConvolutionForwardWorkspaceSize);
+    MODULE_FUNCTION_INIT(cudnnGetConvolutionBackwardFilterWorkspaceSize);
+    MODULE_FUNCTION_INIT(cudnnFindConvolutionForwardAlgorithm);
+    MODULE_FUNCTION_INIT(cudnnFindConvolutionBackwardFilterAlgorithm);
+    if (cudnn_version.major() < 8) {
+        MODULE_FUNCTION_INIT(cudnnGetConvolutionForwardAlgorithm);
+        MODULE_FUNCTION_INIT(cudnnGetConvolutionBackwardFilterAlgorithm);
+    }
+    MODULE_FUNCTION_INIT(cudnnGetConvolutionNdForwardOutputDim);
+    MODULE_FUNCTION_INIT(cudnnSetConvolution2dDescriptor);
+    MODULE_FUNCTION_INIT(cudnnSetFilter4dDescriptor);
+    if (cudnn_version.major() == 4) {
+        MODULE_FUNCTION_INIT(cudnnSetFilter4dDescriptor_v4);
+    }
+    MODULE_FUNCTION_INIT(cudnnSetStream);
+    MODULE_FUNCTION_INIT(cudnnSetTensor4dDescriptor);
+
+    if (!module.symbolsLoaded()) {
+        string error_message =
+            "Error loading cuDNN symbols. ArrayFire was unable to load some "
+            "symbols from the cuDNN library. Please create an issue on the "
+            "ArrayFire repository with information about the installed cuDNN "
+            "and ArrayFire on your system.";
+        AF_ERROR(error_message, AF_ERR_LOAD_LIB);
+    }
+}
+
+cudnnModule& getCudnnPlugin() noexcept {
+    static auto* plugin = new cudnnModule();
+    return *plugin;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cudnnModule.hpp b/src/backend/cuda/cudnnModule.hpp
new file mode 100644
index 0000000000..26856f69d7
--- /dev/null
+++ b/src/backend/cuda/cudnnModule.hpp
@@ -0,0 +1,112 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/DependencyModule.hpp>
+
+#include <cudnn.h>
+
+#include <memory>
+#include <tuple>
+
+#if CUDNN_VERSION > 4000
+// This function is not available on versions greater than v4
+cudnnStatus_t cudnnSetFilter4dDescriptor_v4(
+    cudnnFilterDescriptor_t filterDesc,
+    cudnnDataType_t dataType,  // image data type
+    cudnnTensorFormat_t format,
+    int k,   // number of output feature maps
+    int c,   // number of input feature maps
+    int h,   // height of each input filter
+    int w);  // width of  each input filter
+#else
+// This function is only available on newer versions of cudnn
+size_t cudnnGetCudartVersion(void);
+#endif
+
+#if CUDNN_VERSION >= 8000
+typedef enum {
+    CUDNN_CONVOLUTION_FWD_NO_WORKSPACE            = 0,
+    CUDNN_CONVOLUTION_FWD_PREFER_FASTEST          = 1,
+    CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT = 2,
+} cudnnConvolutionFwdPreference_t;
+
+typedef enum {
+    CUDNN_CONVOLUTION_BWD_FILTER_NO_WORKSPACE            = 0,
+    CUDNN_CONVOLUTION_BWD_FILTER_PREFER_FASTEST          = 1,
+    CUDNN_CONVOLUTION_BWD_FILTER_SPECIFY_WORKSPACE_LIMIT = 2,
+} cudnnConvolutionBwdFilterPreference_t;
+
+cudnnStatus_t cudnnGetConvolutionForwardAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnFilterDescriptor_t wDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnTensorDescriptor_t yDesc,
+    cudnnConvolutionFwdPreference_t preference, size_t memoryLimitInBytes,
+    cudnnConvolutionFwdAlgo_t* algo);
+
+cudnnStatus_t cudnnGetConvolutionBackwardFilterAlgorithm(
+    cudnnHandle_t handle, const cudnnTensorDescriptor_t xDesc,
+    const cudnnTensorDescriptor_t dyDesc,
+    const cudnnConvolutionDescriptor_t convDesc,
+    const cudnnFilterDescriptor_t dwDesc,
+    cudnnConvolutionBwdFilterPreference_t preference, size_t memoryLimitInBytes,
+    cudnnConvolutionBwdFilterAlgo_t* algo);
+#endif
+
+namespace arrayfire {
+namespace cuda {
+
+class cudnnModule {
+    common::DependencyModule module;
+
+   public:
+    cudnnModule();
+    MODULE_MEMBER(cudnnConvolutionBackwardData);
+    MODULE_MEMBER(cudnnConvolutionBackwardFilter);
+    MODULE_MEMBER(cudnnConvolutionForward);
+    MODULE_MEMBER(cudnnCreate);
+    MODULE_MEMBER(cudnnCreateConvolutionDescriptor);
+    MODULE_MEMBER(cudnnCreateFilterDescriptor);
+    MODULE_MEMBER(cudnnCreateTensorDescriptor);
+    MODULE_MEMBER(cudnnDestroy);
+    MODULE_MEMBER(cudnnDestroyConvolutionDescriptor);
+    MODULE_MEMBER(cudnnDestroyFilterDescriptor);
+    MODULE_MEMBER(cudnnDestroyTensorDescriptor);
+    MODULE_MEMBER(cudnnGetConvolutionBackwardDataWorkspaceSize);
+    MODULE_MEMBER(cudnnGetConvolutionForwardAlgorithmMaxCount);
+    MODULE_MEMBER(cudnnGetConvolutionBackwardFilterAlgorithmMaxCount);
+    MODULE_MEMBER(cudnnFindConvolutionForwardAlgorithm);
+    MODULE_MEMBER(cudnnFindConvolutionBackwardFilterAlgorithm);
+    MODULE_MEMBER(cudnnGetConvolutionForwardWorkspaceSize);
+    MODULE_MEMBER(cudnnGetConvolutionBackwardFilterWorkspaceSize);
+    MODULE_MEMBER(cudnnGetConvolutionForwardAlgorithm);
+    MODULE_MEMBER(cudnnGetConvolutionBackwardFilterAlgorithm);
+    MODULE_MEMBER(cudnnGetConvolutionNdForwardOutputDim);
+    MODULE_MEMBER(cudnnSetConvolution2dDescriptor);
+    MODULE_MEMBER(cudnnSetFilter4dDescriptor);
+    MODULE_MEMBER(cudnnSetFilter4dDescriptor_v4);
+    MODULE_MEMBER(cudnnGetVersion);
+    MODULE_MEMBER(cudnnGetCudartVersion);
+    MODULE_MEMBER(cudnnSetStream);
+    MODULE_MEMBER(cudnnSetTensor4dDescriptor);
+
+    spdlog::logger* getLogger() const noexcept;
+
+    /// Returns the version of the cuDNN loaded at runtime
+    common::Version getVersion() const noexcept { return module.getVersion(); }
+
+    bool isLoaded() const noexcept { return module.isLoaded(); }
+};
+
+cudnnModule& getCudnnPlugin() noexcept;
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cufft.cu b/src/backend/cuda/cufft.cu
new file mode 100644
index 0000000000..69d7229b6b
--- /dev/null
+++ b/src/backend/cuda/cufft.cu
@@ -0,0 +1,124 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cufft.hpp>
+
+#include <memory.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+const char *_cufftGetResultString(cufftResult res) {
+    switch (res) {
+        case CUFFT_SUCCESS: return "cuFFT: success";
+
+        case CUFFT_INVALID_PLAN: return "cuFFT: invalid plan handle passed";
+
+        case CUFFT_ALLOC_FAILED: return "cuFFT: resources allocation failed";
+
+        case CUFFT_INVALID_TYPE: return "cuFFT: invalid type (deprecated)";
+
+        case CUFFT_INVALID_VALUE:
+            return "cuFFT: invalid parameters passed to cuFFT API";
+
+        case CUFFT_INTERNAL_ERROR:
+            return "cuFFT: internal error detected using cuFFT";
+
+        case CUFFT_EXEC_FAILED: return "cuFFT: FFT execution failed";
+
+        case CUFFT_SETUP_FAILED: return "cuFFT: library initialization failed";
+
+        case CUFFT_INVALID_SIZE: return "cuFFT: invalid size parameters passed";
+
+        case CUFFT_UNALIGNED_DATA: return "cuFFT: unaligned data (deprecated)";
+
+        case CUFFT_INCOMPLETE_PARAMETER_LIST:
+            return "cuFFT: call is missing parameters";
+
+        case CUFFT_INVALID_DEVICE:
+            return "cuFFT: plan execution different than plan creation";
+
+        case CUFFT_PARSE_ERROR: return "cuFFT: plan parse error";
+
+        case CUFFT_NO_WORKSPACE: return "cuFFT: no workspace provided";
+
+        case CUFFT_NOT_IMPLEMENTED: return "cuFFT: not implemented";
+
+        case CUFFT_LICENSE_ERROR: return "cuFFT: license error";
+
+#if CUDA_VERSION >= 8000
+        case CUFFT_NOT_SUPPORTED: return "cuFFT: not supported";
+#endif
+    }
+
+    return "cuFFT: unknown error";
+}
+
+SharedPlan findPlan(int rank, int *n, int *inembed, int istride, int idist,
+                    int *onembed, int ostride, int odist, cufftType type,
+                    int batch) {
+    // create the key string
+    char key_str_temp[64];
+    sprintf(key_str_temp, "%d:", rank);
+
+    std::string key_string(key_str_temp);
+
+    for (int r = 0; r < rank; ++r) {
+        sprintf(key_str_temp, "%d:", n[r]);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    if (inembed != NULL) {
+        for (int r = 0; r < rank; ++r) {
+            sprintf(key_str_temp, "%d:", inembed[r]);
+            key_string.append(std::string(key_str_temp));
+        }
+        sprintf(key_str_temp, "%d:%d:", istride, idist);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    if (onembed != NULL) {
+        for (int r = 0; r < rank; ++r) {
+            sprintf(key_str_temp, "%d:", onembed[r]);
+            key_string.append(std::string(key_str_temp));
+        }
+        sprintf(key_str_temp, "%d:%d:", ostride, odist);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    sprintf(key_str_temp, "%d:%d", (int)type, batch);
+    key_string.append(std::string(key_str_temp));
+
+    PlanCache &planner = arrayfire::cuda::fftManager();
+    SharedPlan retVal  = planner.find(key_string);
+
+    if (retVal) return retVal;
+
+    PlanType *temp  = (PlanType *)malloc(sizeof(PlanType));
+    cufftResult res = cufftPlanMany(temp, rank, n, inembed, istride, idist,
+                                    onembed, ostride, odist, type, batch);
+
+    // If plan creation fails, clean up the memory we hold on to and try again
+    if (res != CUFFT_SUCCESS) {
+        arrayfire::cuda::signalMemoryCleanup();
+        CUFFT_CHECK(cufftPlanMany(temp, rank, n, inembed, istride, idist,
+                                  onembed, ostride, odist, type, batch));
+    }
+
+    retVal.reset(temp, [](PlanType *p) {
+        cufftDestroy(*p);
+        free(p);
+    });
+    // push the plan into plan cache
+    planner.push(key_string, retVal);
+
+    return retVal;
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cufft.hpp b/src/backend/cuda/cufft.hpp
new file mode 100644
index 0000000000..80ba06c8f5
--- /dev/null
+++ b/src/backend/cuda/cufft.hpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/FFTPlanCache.hpp>
+#include <common/err_common.hpp>
+#include <common/unique_handle.hpp>
+#include <cufft.h>
+#include <cstdio>
+
+DEFINE_HANDLER(cufftHandle, cufftCreate, cufftDestroy);
+
+namespace arrayfire {
+namespace cuda {
+
+typedef cufftHandle PlanType;
+typedef std::shared_ptr<PlanType> SharedPlan;
+
+const char *_cufftGetResultString(cufftResult res);
+
+SharedPlan findPlan(int rank, int *n, int *inembed, int istride, int idist,
+                    int *onembed, int ostride, int odist, cufftType type,
+                    int batch);
+
+class PlanCache : public common::FFTPlanCache<PlanCache, PlanType> {
+    friend SharedPlan findPlan(int rank, int *n, int *inembed, int istride,
+                               int idist, int *onembed, int ostride, int odist,
+                               cufftType type, int batch);
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
+
+#define CUFFT_CHECK(fn)                                                   \
+    do {                                                                  \
+        cufftResult _cufft_res = fn;                                      \
+        if (_cufft_res != CUFFT_SUCCESS) {                                \
+            char cufft_res_msg[1024];                                     \
+            snprintf(cufft_res_msg, sizeof(cufft_res_msg),                \
+                     "cuFFT Error (%d): %s\n", (int)(_cufft_res),         \
+                     arrayfire::cuda::_cufftGetResultString(_cufft_res)); \
+                                                                          \
+            AF_ERROR(cufft_res_msg, AF_ERR_INTERNAL);                     \
+        }                                                                 \
+    } while (0)
diff --git a/src/backend/cuda/cusolverDn.cpp b/src/backend/cuda/cusolverDn.cpp
new file mode 100644
index 0000000000..3cbfec6898
--- /dev/null
+++ b/src/backend/cuda/cusolverDn.cpp
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cusolverDn.hpp>
+#include <debug_cuda.hpp>
+#include <platform.hpp>
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+const char *errorString(cusolverStatus_t err) {
+    switch (err) {
+        case CUSOLVER_STATUS_SUCCESS: return "CUSOLVER_STATUS_SUCCESS";
+        case CUSOLVER_STATUS_NOT_INITIALIZED:
+            return "CUSOLVER_STATUS_NOT_INITIALIZED";
+        case CUSOLVER_STATUS_ALLOC_FAILED:
+            return "CUSOLVER_STATUS_ALLOC_FAILED";
+        case CUSOLVER_STATUS_INVALID_VALUE:
+            return "CUSOLVER_STATUS_INVALID_VALUE";
+        case CUSOLVER_STATUS_ARCH_MISMATCH:
+            return "CUSOLVER_STATUS_ARCH_MISMATCH";
+        case CUSOLVER_STATUS_MAPPING_ERROR:
+            return "CUSOLVER_STATUS_MAPPING_ERROR";
+        case CUSOLVER_STATUS_EXECUTION_FAILED:
+            return "CUSOLVER_STATUS_EXECUTION_FAILED";
+        case CUSOLVER_STATUS_INTERNAL_ERROR:
+            return "CUSOLVER_STATUS_INTERNAL_ERROR";
+        case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
+            return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
+        case CUSOLVER_STATUS_NOT_SUPPORTED:
+            return "CUSOLVER_STATUS_NOT_SUPPORTED";
+        case CUSOLVER_STATUS_ZERO_PIVOT: return "CUSOLVER_STATUS_ZERO_PIVOT";
+        case CUSOLVER_STATUS_INVALID_LICENSE:
+            return "CUSOLVER_STATUS_INVALID_LICENSE";
+        default: return "UNKNOWN";
+    }
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cusolverDn.hpp b/src/backend/cuda/cusolverDn.hpp
new file mode 100644
index 0000000000..e9edab58b5
--- /dev/null
+++ b/src/backend/cuda/cusolverDn.hpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/unique_handle.hpp>
+#include <cusolverDn.h>
+
+DEFINE_HANDLER(cusolverDnHandle_t, cusolverDnCreate, cusolverDnDestroy);
+
+namespace arrayfire {
+namespace cuda {
+
+const char* errorString(cusolverStatus_t err);
+
+#define CUSOLVER_CHECK(fn)                                                    \
+    do {                                                                      \
+        cusolverStatus_t _error = fn;                                         \
+        if (_error != CUSOLVER_STATUS_SUCCESS) {                              \
+            char _err_msg[1024];                                              \
+            snprintf(_err_msg, sizeof(_err_msg), "CUSOLVER Error (%d): %s\n", \
+                     (int)(_error), arrayfire::cuda::errorString(_error));    \
+                                                                              \
+            AF_ERROR(_err_msg, AF_ERR_INTERNAL);                              \
+        }                                                                     \
+    } while (0)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cusolverDnManager.cpp b/src/backend/cuda/cusolverDnManager.cpp
deleted file mode 100644
index 7b4564182f..0000000000
--- a/src/backend/cuda/cusolverDnManager.cpp
+++ /dev/null
@@ -1,81 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-
-#include <cusolverDnManager.hpp>
-#include <platform.hpp>
-
-#include <stdexcept>
-#include <string>
-#include <iostream>
-#include <boost/scoped_ptr.hpp>
-
-namespace cusolver {
-
-    const char *errorString(cusolverStatus_t err)
-    {
-        switch(err) {
-        case    CUSOLVER_STATUS_SUCCESS                     :   return "CUSOLVER_STATUS_SUCCESS"                    ;
-        case    CUSOLVER_STATUS_NOT_INITIALIZED             :   return "CUSOLVER_STATUS_NOT_INITIALIZED"            ;
-        case    CUSOLVER_STATUS_ALLOC_FAILED                :   return "CUSOLVER_STATUS_ALLOC_FAILED"               ;
-        case    CUSOLVER_STATUS_INVALID_VALUE               :   return "CUSOLVER_STATUS_INVALID_VALUE"              ;
-        case    CUSOLVER_STATUS_ARCH_MISMATCH               :   return "CUSOLVER_STATUS_ARCH_MISMATCH"              ;
-        case    CUSOLVER_STATUS_MAPPING_ERROR               :   return "CUSOLVER_STATUS_MAPPING_ERROR"              ;
-        case    CUSOLVER_STATUS_EXECUTION_FAILED            :   return "CUSOLVER_STATUS_EXECUTION_FAILED"           ;
-        case    CUSOLVER_STATUS_INTERNAL_ERROR              :   return "CUSOLVER_STATUS_INTERNAL_ERROR"             ;
-        case    CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED   :   return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED"  ;
-        case    CUSOLVER_STATUS_NOT_SUPPORTED               :   return "CUSOLVER_STATUS_NOT_SUPPORTED"              ;
-        case    CUSOLVER_STATUS_ZERO_PIVOT                  :   return "CUSOLVER_STATUS_ZERO_PIVOT"                 ;
-        case    CUSOLVER_STATUS_INVALID_LICENSE             :   return "CUSOLVER_STATUS_INVALID_LICENSE"            ;
-        default                                             :   return "UNKNOWN";
-        }
-    }
-
-
-//RAII class around the cusolver Handle
-    class cusolverDnHandle
-    {
-        cusolverDnHandle_t handle;
-    public:
-
-        cusolverDnHandle()
-            : handle(0)
-        {
-            CUSOLVER_CHECK(cusolverDnCreate(&handle));
-        }
-
-        ~cusolverDnHandle()
-        {
-            cusolverDnDestroy(handle);
-        }
-
-        cusolverDnHandle_t get() const
-        {
-            return handle;
-        }
-    };
-
-    cusolverDnHandle_t getDnHandle()
-    {
-        using boost::scoped_ptr;
-        static scoped_ptr<cusolverDnHandle> handle[cuda::DeviceManager::MAX_DEVICES];
-
-        int id = cuda::getActiveDeviceId();
-
-        if(!handle[id]) {
-            handle[id].reset(new cusolverDnHandle());
-        }
-
-        return handle[id]->get();
-    }
-
-}
-
-#endif
diff --git a/src/backend/cuda/cusolverDnManager.hpp b/src/backend/cuda/cusolverDnManager.hpp
deleted file mode 100644
index f4035a4043..0000000000
--- a/src/backend/cuda/cusolverDnManager.hpp
+++ /dev/null
@@ -1,41 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-#pragma once
-
-#include <cuda_runtime.h>
-#include <cusolverDn.h>
-#include <stdio.h>
-#include <err_common.hpp>
-
-namespace cusolver
-{
-
-    const char * errorString(cusolverStatus_t err);
-    cusolverDnHandle_t getDnHandle();
-}
-
-#define CUSOLVER_CHECK(fn) do {                     \
-        cusolverStatus_t _error = fn;               \
-        if (_error != CUSOLVER_STATUS_SUCCESS) {    \
-            char _err_msg[1024];                    \
-            snprintf(_err_msg,                      \
-                     sizeof(_err_msg),              \
-                     "CUBLAS Error (%d): %s\n",     \
-                     (int)(_error),                 \
-                     cusolver::errorString(         \
-                         _error));                  \
-                                                    \
-            AF_ERROR(_err_msg,                      \
-                     AF_ERR_INTERNAL);              \
-        }                                           \
-    } while(0)
-
-#endif
diff --git a/src/backend/cuda/cusparse.cpp b/src/backend/cuda/cusparse.cpp
new file mode 100644
index 0000000000..224d798327
--- /dev/null
+++ b/src/backend/cuda/cusparse.cpp
@@ -0,0 +1,42 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cusparse.hpp>
+#include <platform.hpp>
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+const char* errorString(cusparseStatus_t err) {
+    switch (err) {
+        case CUSPARSE_STATUS_SUCCESS: return "CUSPARSE_STATUS_SUCCESS";
+        case CUSPARSE_STATUS_NOT_INITIALIZED:
+            return "CUSPARSE_STATUS_NOT_INITIALIZED";
+        case CUSPARSE_STATUS_ALLOC_FAILED:
+            return "CUSPARSE_STATUS_ALLOC_FAILED";
+        case CUSPARSE_STATUS_INVALID_VALUE:
+            return "CUSPARSE_STATUS_INVALID_VALUE";
+        case CUSPARSE_STATUS_ARCH_MISMATCH:
+            return "CUSPARSE_STATUS_ARCH_MISMATCH";
+        case CUSPARSE_STATUS_MAPPING_ERROR:
+            return "CUSPARSE_STATUS_MAPPING_ERROR";
+        case CUSPARSE_STATUS_EXECUTION_FAILED:
+            return "CUSPARSE_STATUS_EXECUTION_FAILED";
+        case CUSPARSE_STATUS_INTERNAL_ERROR:
+            return "CUSPARSE_STATUS_INTERNAL_ERROR";
+        case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
+            return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
+        case CUSPARSE_STATUS_ZERO_PIVOT: return "CUSPARSE_STATUS_ZERO_PIVOT";
+        default: return "UNKNOWN";
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cusparse.hpp b/src/backend/cuda/cusparse.hpp
new file mode 100644
index 0000000000..e7b5a51e33
--- /dev/null
+++ b/src/backend/cuda/cusparse.hpp
@@ -0,0 +1,93 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/SparseArray.hpp>
+#include <common/defines.hpp>
+#include <common/unique_handle.hpp>
+#include <cudaDataType.hpp>
+#include <cusparseModule.hpp>
+#include <cusparse_v2.h>
+#include <err_cuda.hpp>
+
+#if defined(AF_USE_NEW_CUSPARSE_API)
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+cusparseStatus_t createSpMatDescr(
+    cusparseSpMatDescr_t *out, const arrayfire::common::SparseArray<T> &arr) {
+    auto &_ = arrayfire::cuda::getCusparsePlugin();
+    switch (arr.getStorage()) {
+        case AF_STORAGE_CSR: {
+            return _.cusparseCreateCsr(
+                out, arr.dims()[0], arr.dims()[1], arr.getNNZ(),
+                (void *)arr.getRowIdx().get(), (void *)arr.getColIdx().get(),
+                (void *)arr.getValues().get(), CUSPARSE_INDEX_32I,
+                CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, getType<T>());
+        }
+#if CUSPARSE_VERSION >= 11300
+        case AF_STORAGE_CSC: {
+            return _.cusparseCreateCsc(
+                out, arr.dims()[0], arr.dims()[1], arr.getNNZ(),
+                (void *)arr.getColIdx().get(), (void *)arr.getRowIdx().get(),
+                (void *)arr.getValues().get(), CUSPARSE_INDEX_32I,
+                CUSPARSE_INDEX_32I, CUSPARSE_INDEX_BASE_ZERO, getType<T>());
+        }
+#else
+        case AF_STORAGE_CSC:
+            CUDA_NOT_SUPPORTED(
+                "Sparse not supported for CSC on this version of the CUDA "
+                "Toolkit");
+#endif
+        case AF_STORAGE_COO: {
+            return _.cusparseCreateCoo(
+                out, arr.dims()[0], arr.dims()[1], arr.getNNZ(),
+                (void *)arr.getColIdx().get(), (void *)arr.getRowIdx().get(),
+                (void *)arr.getValues().get(), CUSPARSE_INDEX_32I,
+                CUSPARSE_INDEX_BASE_ZERO, getType<T>());
+        }
+    }
+    return CUSPARSE_STATUS_SUCCESS;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
+#endif
+
+// clang-format off
+DEFINE_HANDLER(cusparseHandle_t, arrayfire::cuda::getCusparsePlugin().cusparseCreate, arrayfire::cuda::getCusparsePlugin().cusparseDestroy);
+DEFINE_HANDLER(cusparseMatDescr_t, arrayfire::cuda::getCusparsePlugin().cusparseCreateMatDescr, arrayfire::cuda::getCusparsePlugin().cusparseDestroyMatDescr);
+#if defined(AF_USE_NEW_CUSPARSE_API)
+DEFINE_HANDLER(cusparseSpMatDescr_t, arrayfire::cuda::createSpMatDescr, arrayfire::cuda::getCusparsePlugin().cusparseDestroySpMat);
+DEFINE_HANDLER(cusparseDnVecDescr_t, arrayfire::cuda::getCusparsePlugin().cusparseCreateDnVec, arrayfire::cuda::getCusparsePlugin().cusparseDestroyDnVec);
+DEFINE_HANDLER(cusparseDnMatDescr_t, arrayfire::cuda::getCusparsePlugin().cusparseCreateDnMat, arrayfire::cuda::getCusparsePlugin().cusparseDestroyDnMat);
+#endif
+// clang-format on
+
+namespace arrayfire {
+namespace cuda {
+
+const char *errorString(cusparseStatus_t err);
+
+#define CUSPARSE_CHECK(fn)                                                    \
+    do {                                                                      \
+        cusparseStatus_t _error = fn;                                         \
+        if (_error != CUSPARSE_STATUS_SUCCESS) {                              \
+            char _err_msg[1024];                                              \
+            snprintf(_err_msg, sizeof(_err_msg), "CUSPARSE Error (%d): %s\n", \
+                     (int)(_error), arrayfire::cuda::errorString(_error));    \
+                                                                              \
+            AF_ERROR(_err_msg, AF_ERR_INTERNAL);                              \
+        }                                                                     \
+    } while (0)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cusparseModule.cpp b/src/backend/cuda/cusparseModule.cpp
new file mode 100644
index 0000000000..a7dba5dc77
--- /dev/null
+++ b/src/backend/cuda/cusparseModule.cpp
@@ -0,0 +1,174 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cusparse.hpp>
+#include <cusparseModule.hpp>
+
+#include <common/Version.hpp>
+#include <common/err_common.hpp>
+#include <cusparse.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+
+#include <cuda.h>
+#include <string>
+
+using arrayfire::common::Version;
+
+namespace arrayfire {
+namespace cuda {
+
+common::Version getCusparseVersion(const LibHandle& handle) {
+    std::function<cusparseStatus_t(libraryPropertyType, int*)> fptr(
+        reinterpret_cast<cusparseStatus_t (*)(libraryPropertyType, int*)>(
+            common::getFunctionPointer(handle, "cusparseGetProperty")));
+
+    int major, minor, patch;
+    CUSPARSE_CHECK(fptr(MAJOR_VERSION, &major));
+    CUSPARSE_CHECK(fptr(MINOR_VERSION, &minor));
+    CUSPARSE_CHECK(fptr(PATCH_LEVEL, &patch));
+
+    Version out{major, minor, patch};
+    return out;
+}
+
+cusparseModule::cusparseModule()
+    :
+#ifdef AF_cusparse_STATIC_LINKING
+    module(nullptr, nullptr)
+#else
+    module({"cusparse"}, {"64_12", "64_11", "64_10", "64_9", "64_8"}, {""}, 0,
+           nullptr, getCusparseVersion)
+#endif
+{
+#ifdef AF_cusparse_STATIC_LINKING
+    AF_TRACE("CuSparse linked staticly.");
+#undef MODULE_FUNCTION_INIT
+#define MODULE_FUNCTION_INIT(NAME) NAME = &::NAME
+#else
+    if (!module.isLoaded()) {
+        AF_TRACE(
+            "WARNING: Unable to load cuSparse: {}\n"
+            "cuSparse failed to load. Try installing cuSparse or check if\n"
+            "cuSparse is in the search path. On Linux, you can set the\n"
+            "LD_DEBUG=libs environment variable to debug loading issues.\n"
+            "Falling back to matmul based implementation",
+            module.getErrorMessage());
+
+        return;
+    }
+#endif
+
+    MODULE_FUNCTION_INIT(cusparseGetVersion);
+
+#if CUSPARSE_VERSION < 11300
+    MODULE_FUNCTION_INIT(cusparseCcsc2dense);
+    MODULE_FUNCTION_INIT(cusparseCcsr2dense);
+    MODULE_FUNCTION_INIT(cusparseCdense2csc);
+    MODULE_FUNCTION_INIT(cusparseCdense2csr);
+    MODULE_FUNCTION_INIT(cusparseCgthr);
+    MODULE_FUNCTION_INIT(cusparseDcsc2dense);
+    MODULE_FUNCTION_INIT(cusparseDcsr2dense);
+    MODULE_FUNCTION_INIT(cusparseDdense2csc);
+    MODULE_FUNCTION_INIT(cusparseDdense2csr);
+    MODULE_FUNCTION_INIT(cusparseDgthr);
+    MODULE_FUNCTION_INIT(cusparseScsc2dense);
+    MODULE_FUNCTION_INIT(cusparseScsr2dense);
+    MODULE_FUNCTION_INIT(cusparseSdense2csc);
+    MODULE_FUNCTION_INIT(cusparseSdense2csr);
+    MODULE_FUNCTION_INIT(cusparseSgthr);
+    MODULE_FUNCTION_INIT(cusparseZcsc2dense);
+    MODULE_FUNCTION_INIT(cusparseZcsr2dense);
+    MODULE_FUNCTION_INIT(cusparseZdense2csc);
+    MODULE_FUNCTION_INIT(cusparseZdense2csr);
+    MODULE_FUNCTION_INIT(cusparseZgthr);
+#else
+    MODULE_FUNCTION_INIT(cusparseCreateCsc);
+    MODULE_FUNCTION_INIT(cusparseSparseToDense_bufferSize);
+    MODULE_FUNCTION_INIT(cusparseSparseToDense);
+    MODULE_FUNCTION_INIT(cusparseDenseToSparse_bufferSize);
+    MODULE_FUNCTION_INIT(cusparseDenseToSparse_analysis);
+    MODULE_FUNCTION_INIT(cusparseDenseToSparse_convert);
+    MODULE_FUNCTION_INIT(cusparseSpMatGetSize);
+    MODULE_FUNCTION_INIT(cusparseCsrSetPointers);
+    MODULE_FUNCTION_INIT(cusparseCscSetPointers);
+    MODULE_FUNCTION_INIT(cusparseSetPointerMode);
+    MODULE_FUNCTION_INIT(cusparseXcsrsort_bufferSizeExt);
+    MODULE_FUNCTION_INIT(cusparseXcsrsort);
+#endif
+
+    MODULE_FUNCTION_INIT(cusparseCnnz);
+    MODULE_FUNCTION_INIT(cusparseCreateCsr);
+    MODULE_FUNCTION_INIT(cusparseCreateCoo);
+    MODULE_FUNCTION_INIT(cusparseCreateDnMat);
+    MODULE_FUNCTION_INIT(cusparseCreateDnVec);
+    MODULE_FUNCTION_INIT(cusparseCreateIdentityPermutation);
+    MODULE_FUNCTION_INIT(cusparseCreate);
+    MODULE_FUNCTION_INIT(cusparseCreateMatDescr);
+    MODULE_FUNCTION_INIT(cusparseDestroyDnMat);
+    MODULE_FUNCTION_INIT(cusparseDestroyDnVec);
+    MODULE_FUNCTION_INIT(cusparseDestroy);
+    MODULE_FUNCTION_INIT(cusparseDestroyMatDescr);
+    MODULE_FUNCTION_INIT(cusparseDestroySpMat);
+    MODULE_FUNCTION_INIT(cusparseDnnz);
+    MODULE_FUNCTION_INIT(cusparseSetMatIndexBase);
+    MODULE_FUNCTION_INIT(cusparseSetMatType);
+    MODULE_FUNCTION_INIT(cusparseSetStream);
+    MODULE_FUNCTION_INIT(cusparseSnnz);
+    MODULE_FUNCTION_INIT(cusparseSpMM_bufferSize);
+    MODULE_FUNCTION_INIT(cusparseSpMM);
+    MODULE_FUNCTION_INIT(cusparseSpMV_bufferSize);
+    MODULE_FUNCTION_INIT(cusparseSpMV);
+    MODULE_FUNCTION_INIT(cusparseXcoo2csr);
+    MODULE_FUNCTION_INIT(cusparseXcoosort_bufferSizeExt);
+    MODULE_FUNCTION_INIT(cusparseXcoosortByColumn);
+    MODULE_FUNCTION_INIT(cusparseXcoosortByRow);
+    MODULE_FUNCTION_INIT(cusparseXcsr2coo);
+#if CUSPARSE_VERSION < 11000
+    MODULE_FUNCTION_INIT(cusparseXcsrgeamNnz);
+    MODULE_FUNCTION_INIT(cusparseScsrgeam);
+    MODULE_FUNCTION_INIT(cusparseDcsrgeam);
+    MODULE_FUNCTION_INIT(cusparseCcsrgeam);
+    MODULE_FUNCTION_INIT(cusparseZcsrgeam);
+#else
+    MODULE_FUNCTION_INIT(cusparseXcsrgeam2Nnz);
+    MODULE_FUNCTION_INIT(cusparseScsrgeam2_bufferSizeExt);
+    MODULE_FUNCTION_INIT(cusparseScsrgeam2);
+    MODULE_FUNCTION_INIT(cusparseDcsrgeam2_bufferSizeExt);
+    MODULE_FUNCTION_INIT(cusparseDcsrgeam2);
+    MODULE_FUNCTION_INIT(cusparseCcsrgeam2_bufferSizeExt);
+    MODULE_FUNCTION_INIT(cusparseCcsrgeam2);
+    MODULE_FUNCTION_INIT(cusparseZcsrgeam2_bufferSizeExt);
+    MODULE_FUNCTION_INIT(cusparseZcsrgeam2);
+#endif
+    MODULE_FUNCTION_INIT(cusparseZnnz);
+
+#ifndef AF_cusparse_STATIC_LINKING
+    if (!module.symbolsLoaded()) {
+        std::string error_message =
+            "Error loading cuSparse symbols. ArrayFire was unable to load some "
+            "symbols from the cuSparse library. Please create an issue on the "
+            "ArrayFire repository with information about the installed "
+            "cuSparse and ArrayFire on your system.";
+        AF_ERROR(error_message, AF_ERR_LOAD_LIB);
+    }
+#endif
+}
+
+spdlog::logger* cusparseModule::getLogger() const noexcept {
+    return module.getLogger();
+}
+
+cusparseModule& getCusparsePlugin() noexcept {
+    static auto* plugin = new cusparseModule();
+    return *plugin;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cusparseModule.hpp b/src/backend/cuda/cusparseModule.hpp
new file mode 100644
index 0000000000..fc3bb09b76
--- /dev/null
+++ b/src/backend/cuda/cusparseModule.hpp
@@ -0,0 +1,118 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/DependencyModule.hpp>
+#include <cuda.h>
+#include <cusparse_v2.h>
+
+namespace arrayfire {
+namespace cuda {
+class cusparseModule {
+    arrayfire::common::DependencyModule module;
+
+   public:
+    cusparseModule();
+    ~cusparseModule() = default;
+
+    MODULE_MEMBER(cusparseGetVersion);
+
+#if CUSPARSE_VERSION < 11300
+    MODULE_MEMBER(cusparseCcsc2dense);
+    MODULE_MEMBER(cusparseCcsr2dense);
+    MODULE_MEMBER(cusparseCdense2csc);
+    MODULE_MEMBER(cusparseCdense2csr);
+    MODULE_MEMBER(cusparseCgthr);
+    MODULE_MEMBER(cusparseDcsc2dense);
+    MODULE_MEMBER(cusparseDcsr2dense);
+    MODULE_MEMBER(cusparseDdense2csc);
+    MODULE_MEMBER(cusparseDdense2csr);
+    MODULE_MEMBER(cusparseDgthr);
+    MODULE_MEMBER(cusparseScsc2dense);
+    MODULE_MEMBER(cusparseScsr2dense);
+    MODULE_MEMBER(cusparseSdense2csc);
+    MODULE_MEMBER(cusparseSdense2csr);
+    MODULE_MEMBER(cusparseSgthr);
+    MODULE_MEMBER(cusparseZcsc2dense);
+    MODULE_MEMBER(cusparseZcsr2dense);
+    MODULE_MEMBER(cusparseZdense2csc);
+    MODULE_MEMBER(cusparseZdense2csr);
+    MODULE_MEMBER(cusparseZgthr);
+#else
+    MODULE_MEMBER(cusparseCreateCsc);
+    MODULE_MEMBER(cusparseSparseToDense);
+    MODULE_MEMBER(cusparseSparseToDense_bufferSize);
+    MODULE_MEMBER(cusparseDenseToSparse_bufferSize);
+    MODULE_MEMBER(cusparseDenseToSparse_analysis);
+    MODULE_MEMBER(cusparseDenseToSparse_convert);
+    MODULE_MEMBER(cusparseSpMatGetSize);
+    MODULE_MEMBER(cusparseCsrSetPointers);
+    MODULE_MEMBER(cusparseCscSetPointers);
+    MODULE_MEMBER(cusparseGather);
+    MODULE_MEMBER(cusparseSetPointerMode);
+    MODULE_MEMBER(cusparseXcsrsort_bufferSizeExt);
+    MODULE_MEMBER(cusparseXcsrsort);
+#endif
+
+    MODULE_MEMBER(cusparseCreateCoo);
+    MODULE_MEMBER(cusparseCreateCsr);
+    MODULE_MEMBER(cusparseDestroyDnMat);
+    MODULE_MEMBER(cusparseDestroyDnVec);
+    MODULE_MEMBER(cusparseDestroy);
+    MODULE_MEMBER(cusparseDestroyMatDescr);
+    MODULE_MEMBER(cusparseDestroySpMat);
+    MODULE_MEMBER(cusparseCnnz);
+    MODULE_MEMBER(cusparseCreateDnMat);
+    MODULE_MEMBER(cusparseCreateDnVec);
+    MODULE_MEMBER(cusparseCreateIdentityPermutation);
+    MODULE_MEMBER(cusparseCreate);
+    MODULE_MEMBER(cusparseCreateMatDescr);
+    MODULE_MEMBER(cusparseDnnz);
+    MODULE_MEMBER(cusparseSetMatIndexBase);
+    MODULE_MEMBER(cusparseSetMatType);
+    MODULE_MEMBER(cusparseSetStream);
+    MODULE_MEMBER(cusparseSnnz);
+    MODULE_MEMBER(cusparseSpMM_bufferSize);
+    MODULE_MEMBER(cusparseSpMM);
+    MODULE_MEMBER(cusparseSpMV_bufferSize);
+    MODULE_MEMBER(cusparseSpMV);
+    MODULE_MEMBER(cusparseXcoo2csr);
+    MODULE_MEMBER(cusparseXcoosort_bufferSizeExt);
+    MODULE_MEMBER(cusparseXcoosortByColumn);
+    MODULE_MEMBER(cusparseXcoosortByRow);
+    MODULE_MEMBER(cusparseXcsr2coo);
+
+#if CUSPARSE_VERSION < 11000
+    MODULE_MEMBER(cusparseCcsrgeam);
+    MODULE_MEMBER(cusparseDcsrgeam);
+    MODULE_MEMBER(cusparseScsrgeam);
+    MODULE_MEMBER(cusparseZcsrgeam);
+    MODULE_MEMBER(cusparseXcsrgeamNnz);
+#else
+    MODULE_MEMBER(cusparseCcsrgeam2_bufferSizeExt);
+    MODULE_MEMBER(cusparseCcsrgeam2);
+    MODULE_MEMBER(cusparseDcsrgeam2_bufferSizeExt);
+    MODULE_MEMBER(cusparseDcsrgeam2);
+    MODULE_MEMBER(cusparseScsrgeam2_bufferSizeExt);
+    MODULE_MEMBER(cusparseScsrgeam2);
+    MODULE_MEMBER(cusparseZcsrgeam2_bufferSizeExt);
+    MODULE_MEMBER(cusparseZcsrgeam2);
+    MODULE_MEMBER(cusparseXcsrgeam2Nnz);
+#endif
+
+    MODULE_MEMBER(cusparseZnnz);
+
+    spdlog::logger* getLogger() const noexcept;
+};
+
+cusparseModule& getCusparsePlugin() noexcept;
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/cusparse_descriptor_helpers.hpp b/src/backend/cuda/cusparse_descriptor_helpers.hpp
new file mode 100644
index 0000000000..340a049b11
--- /dev/null
+++ b/src/backend/cuda/cusparse_descriptor_helpers.hpp
@@ -0,0 +1,49 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#if defined(AF_USE_NEW_CUSPARSE_API)
+// CUDA Toolkit 10.0 or later
+
+#include <common/unique_handle.hpp>
+#include <cudaDataType.hpp>
+#include <cusparse.hpp>
+
+#include <utility>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+auto cusparseDescriptor(const common::SparseArray<T> &in) {
+    auto dims = in.dims();
+
+    return common::make_handle<cusparseSpMatDescr_t>(in);
+}
+
+template<typename T>
+auto denVecDescriptor(const Array<T> &in) {
+    return common::make_handle<cusparseDnVecDescr_t>(
+        in.elements(), (void *)(in.get()), getType<T>());
+}
+
+template<typename T>
+auto denMatDescriptor(const Array<T> &in) {
+    auto dims    = in.dims();
+    auto strides = in.strides();
+    return common::make_handle<cusparseDnMatDescr_t>(
+        dims[0], dims[1], strides[1], (void *)in.get(), getType<T>(),
+        CUSPARSE_ORDER_COL);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
+
+#endif
diff --git a/src/backend/cuda/debug_cuda.hpp b/src/backend/cuda/debug_cuda.hpp
index 807d0a2702..555944a5ed 100644
--- a/src/backend/cuda/debug_cuda.hpp
+++ b/src/backend/cuda/debug_cuda.hpp
@@ -8,19 +8,67 @@
  ********************************************************/
 
 #pragma once
+#include <common/Logger.hpp>
 #include <err_cuda.hpp>
+#include <platform.hpp>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel_logger {
+
+inline auto getLogger() {
+    static auto logger = common::loggerFactory("kernel");
+    return logger;
+}
+}  // namespace kernel_logger
+}  // namespace cuda
+}  // namespace arrayfire
+
+template<>
+struct fmt::formatter<dim3> : fmt::formatter<std::string> {
+    // parse is inherited from formatter<string_view>.
+    template<typename FormatContext>
+    auto format(dim3 c, FormatContext& ctx) {
+        std::string name = fmt::format("{} {} {}", c.x, c.y, c.z);
+        return formatter<std::string>::format(name, ctx);
+    }
+};
+
+#define CUDA_LAUNCH_SMEM(fn, blks, thrds, smem_size, ...)                   \
+    do {                                                                    \
+        {                                                                   \
+            using namespace arrayfire::cuda::kernel_logger;                 \
+            AF_TRACE(                                                       \
+                "Launching {}: Blocks: [{}] Threads: [{}] "                 \
+                "Shared Memory: {}",                                        \
+                #fn, blks, thrds, smem_size);                               \
+        }                                                                   \
+        fn<<<blks, thrds, smem_size, arrayfire::cuda::getActiveStream()>>>( \
+            __VA_ARGS__);                                                   \
+    } while (false)
+
+#define CUDA_LAUNCH(fn, blks, thrds, ...) \
+    CUDA_LAUNCH_SMEM(fn, blks, thrds, 0, __VA_ARGS__)
 
 // FIXME: Add a special flag for debug
 #ifndef NDEBUG
 
-#define POST_LAUNCH_CHECK() do {                \
-        CUDA_CHECK(cudaDeviceSynchronize());    \
-    } while(0)                                  \
+#define POST_LAUNCH_CHECK()                                                    \
+    do {                                                                       \
+        CUDA_CHECK(cudaStreamSynchronize(arrayfire::cuda::getActiveStream())); \
+    } while (0)
 
 #else
 
-#define POST_LAUNCH_CHECK() do {                \
-        CUDA_CHECK(cudaPeekAtLastError());      \
-    } while(0)                                  \
+#define POST_LAUNCH_CHECK()                                                 \
+    do {                                                                    \
+        if (arrayfire::cuda::synchronize_calls()) {                         \
+            CUDA_CHECK(                                                     \
+                cudaStreamSynchronize(arrayfire::cuda::getActiveStream())); \
+        } else {                                                            \
+            CUDA_CHECK(cudaPeekAtLastError());                              \
+        }                                                                   \
+    } while (0)
 
 #endif
diff --git a/src/backend/cuda/device_manager.cpp b/src/backend/cuda/device_manager.cpp
new file mode 100644
index 0000000000..ee7ce76980
--- /dev/null
+++ b/src/backend/cuda/device_manager.cpp
@@ -0,0 +1,780 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <device_manager.hpp>
+
+#if defined(OS_WIN)
+#include <windows.h>
+#endif
+
+#include <GraphicsResourceManager.hpp>
+#include <build_version.hpp>
+#include <common/ArrayFireTypesIO.hpp>
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/defines.hpp>
+#include <common/graphics_common.hpp>
+#include <common/host_memory.hpp>
+#include <common/util.hpp>
+#include <cublas_v2.h>  // needed for af/cuda.h
+#include <driver.h>
+#include <err_cuda.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <spdlog/spdlog.h>
+#include <af/cuda.h>
+#include <af/version.h>
+// cuda_gl_interop.h does not include OpenGL headers for ARM
+// __gl_h_ should be defined by glad.h inclusion
+#include <cuda_gl_interop.h>
+#include <utility.hpp>
+
+#include <nvrtc.h>
+
+#include <algorithm>
+#include <array>
+#include <memory>
+#include <mutex>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <thread>
+#include <utility>
+
+using arrayfire::common::fromCudaVersion;
+using arrayfire::common::getEnvVar;
+using std::begin;
+using std::end;
+using std::find;
+using std::find_if;
+using std::make_pair;
+using std::pair;
+using std::string;
+using std::stringstream;
+
+namespace arrayfire {
+namespace cuda {
+
+struct cuNVRTCcompute {
+    /// The CUDA Toolkit version returned by cudaRuntimeGetVersion
+    int cudaVersion;
+    /// Maximum major compute flag supported by cudaVersion
+    int major;
+    /// Maximum minor compute flag supported by cudaVersion
+    int minor;
+    /// Maximum minor compute flag supported on the embedded(Jetson) platforms
+    int embedded_minor;
+};
+
+/// Struct represents the cuda toolkit version and its associated minimum
+/// required driver versions.
+struct ToolkitDriverVersions {
+    /// The CUDA Toolkit version returned by cudaDriverGetVersion or
+    /// cudaRuntimeGetVersion
+    int version;
+
+    /// The minimum GPU driver version required for the \p version toolkit on
+    /// Linux or macOS
+    float unix_min_version;
+
+    /// The minimum GPU driver version required for the \p version toolkit on
+    /// Windows
+    float windows_min_version;
+};
+
+// clang-format off
+static const int jetsonComputeCapabilities[] = {
+    8070,
+    7020,
+    6020,
+    5030,
+    3020,
+};
+// clang-format on
+
+// clang-format off
+static const cuNVRTCcompute Toolkit2MaxCompute[] = {
+    {12090, 9, 0, 0},
+    {12080, 9, 0, 0},
+    {12070, 9, 0, 0},
+    {12060, 9, 0, 0},
+    {12050, 9, 0, 0},
+    {12040, 9, 0, 0},
+    {12030, 9, 0, 0},
+    {12020, 9, 0, 0},
+    {12010, 9, 0, 0},
+    {12000, 9, 0, 0},
+    {11080, 9, 0, 0},
+    {11070, 8, 7, 0},
+    {11060, 8, 6, 0},
+    {11050, 8, 6, 0},
+    {11040, 8, 6, 0},
+    {11030, 8, 6, 0},
+    {11020, 8, 6, 0},
+    {11010, 8, 6, 0},
+    {11000, 8, 0, 0},
+    {10020, 7, 5, 2},
+    {10010, 7, 5, 2},
+    {10000, 7, 0, 2},
+    { 9020, 7, 0, 2},
+    { 9010, 7, 0, 2},
+    { 9000, 7, 0, 2},
+    { 8000, 5, 2, 3},
+    { 7050, 5, 2, 3},
+    { 7000, 5, 2, 3}};
+// clang-format on
+
+// A tuple of Compute Capability and the associated number of cores in each
+// streaming multiprocessors for that architecture
+struct ComputeCapabilityToStreamingProcessors {
+    // The compute capability in hex
+    // 0xMm (hex), M = major version, m = minor version
+    int compute_capability;
+    // Number of CUDA cores per SM
+    int cores_per_sm;
+};
+
+/// Map giving the minimum device driver needed in order to run a given version
+/// of CUDA for both Linux/Mac and Windows from:
+/// https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
+// clang-format off
+static const ToolkitDriverVersions
+    CudaToDriverVersion[] = {
+        {12090, 525.60f, 528.33f},
+        {12080, 525.60f, 528.33f},
+        {12070, 525.60f, 528.33f},
+        {12060, 525.60f, 528.33f},
+        {12050, 525.60f, 528.33f},
+        {12040, 525.60f, 528.33f},
+        {12030, 525.60f, 528.33f},
+        {12020, 525.60f, 528.33f},
+        {12010, 525.60f, 528.33f},
+        {12000, 525.60f, 528.33f},
+        {11080, 450.80f, 452.39f},
+        {11070, 450.80f, 452.39f},
+        {11060, 450.80f, 452.39f},
+        {11050, 450.80f, 452.39f},
+        {11040, 450.80f, 452.39f},
+        {11030, 450.80f, 452.39f},
+        {11020, 450.80f, 452.39f},
+        {11010, 450.80f, 452.39f},
+        {11000, 450.36f, 451.22f},
+        {10020, 440.33f, 441.22f},
+        {10010, 418.39f, 418.96f},
+        {10000, 410.48f, 411.31f},
+        {9020,  396.37f, 398.26f},
+        {9010,  390.46f, 391.29f},
+        {9000,  384.81f, 385.54f},
+        {8000,  375.26f, 376.51f},
+        {7050,  352.31f, 353.66f},
+        {7000,  346.46f, 347.62f}};
+// clang-format on
+
+// Vector of minimum supported compute versions for CUDA toolkit (i+1).*
+// where i is the index of the vector
+static const std::array<int, 12> minSV{{1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 3, 5}};
+
+static ComputeCapabilityToStreamingProcessors gpus[] = {
+    {0x10, 8},   {0x11, 8},   {0x12, 8},   {0x13, 8},   {0x20, 32},
+    {0x21, 48},  {0x30, 192}, {0x32, 192}, {0x35, 192}, {0x37, 192},
+    {0x50, 128}, {0x52, 128}, {0x53, 128}, {0x60, 64},  {0x61, 128},
+    {0x62, 128}, {0x70, 64},  {0x75, 64},  {0x80, 64},  {0x86, 128},
+    {0x87, 128}, {0x89, 128}, {0x90, 128}, {-1, -1},
+};
+
+// pulled from CUTIL from CUDA SDK
+static inline int compute2cores(unsigned major, unsigned minor) {
+    for (int i = 0; gpus[i].compute_capability != -1; ++i) {
+        if (static_cast<unsigned>(gpus[i].compute_capability) ==
+            (major << 4U) + minor) {
+            return gpus[i].cores_per_sm;
+        }
+    }
+    return 0;
+}
+
+static inline int getMinSupportedCompute(int cudaMajorVer) {
+    int CVSize = static_cast<int>(minSV.size());
+    return (cudaMajorVer > CVSize ? minSV[CVSize - 1]
+                                  : minSV[cudaMajorVer - 1]);
+}
+
+bool isEmbedded(pair<int, int> compute) {
+    int version = compute.first * 1000 + compute.second * 10;
+    return end(jetsonComputeCapabilities) !=
+           find(begin(jetsonComputeCapabilities),
+                end(jetsonComputeCapabilities), version);
+}
+
+bool checkDeviceWithRuntime(int runtime, pair<int, int> compute) {
+    auto rt = find_if(
+        begin(Toolkit2MaxCompute), end(Toolkit2MaxCompute),
+        [runtime](cuNVRTCcompute c) { return c.cudaVersion == runtime; });
+    if (rt == end(Toolkit2MaxCompute)) {
+        spdlog::get("platform")
+            ->warn(
+                "CUDA runtime version({}) not recognized. Please "
+                "create an issue or a pull request on the ArrayFire repository "
+                "to update the Toolkit2MaxCompute array with this version of "
+                "the CUDA Runtime. Continuing.",
+                fromCudaVersion(runtime));
+        return true;
+    }
+
+    if (rt->major >= compute.first) {
+        if (rt->major == compute.first) {
+            return rt->minor >= compute.second;
+        } else {
+            return true;
+        }
+    } else {
+        return false;
+    }
+}
+
+/// Check for compatible compute version based on runtime cuda toolkit version
+void checkAndSetDevMaxCompute(pair<int, int> &computeCapability) {
+    auto originalCompute = computeCapability;
+    int rtCudaVer        = 0;
+    CUDA_CHECK(cudaRuntimeGetVersion(&rtCudaVer));
+    auto tkitMaxCompute = find_if(
+        begin(Toolkit2MaxCompute), end(Toolkit2MaxCompute),
+        [rtCudaVer](cuNVRTCcompute v) { return rtCudaVer == v.cudaVersion; });
+
+    bool embeddedDevice = isEmbedded(computeCapability);
+
+    // If runtime cuda version is found in toolkit array
+    // check for max possible compute for that cuda version
+    if (tkitMaxCompute != end(Toolkit2MaxCompute) &&
+        computeCapability.first >= tkitMaxCompute->major) {
+        int minorVersion = embeddedDevice ? tkitMaxCompute->embedded_minor
+                                          : tkitMaxCompute->minor;
+
+        if (computeCapability.second > minorVersion) {
+            computeCapability = make_pair(tkitMaxCompute->major, minorVersion);
+            spdlog::get("platform")
+                ->warn(
+                    "The compute capability for the current device({}.{}) "
+                    "exceeds maximum supported by ArrayFire's CUDA "
+                    "runtime({}.{}). Download or rebuild the latest version of "
+                    "ArrayFire to avoid this warning. Using {}.{} for JIT "
+                    "compilation kernels.",
+                    originalCompute.first, originalCompute.second,
+                    computeCapability.first, computeCapability.second,
+                    computeCapability.first, computeCapability.second);
+        }
+    } else if (computeCapability.first >= Toolkit2MaxCompute[0].major) {
+        // If runtime cuda version is NOT found in toolkit array
+        // use the top most toolkit max compute
+        int minorVersion = embeddedDevice ? tkitMaxCompute->embedded_minor
+                                          : tkitMaxCompute->minor;
+        if (computeCapability.second > minorVersion) {
+            computeCapability =
+                make_pair(Toolkit2MaxCompute[0].major, minorVersion);
+            spdlog::get("platform")
+                ->warn(
+                    "CUDA runtime version({}) not recognized. Targeting "
+                    "compute {}.{} for this device which is the latest compute "
+                    "capability supported by ArrayFire's CUDA runtime({}.{}). "
+                    "Please create an issue or a pull request on the ArrayFire "
+                    "repository to update the Toolkit2MaxCompute array with "
+                    "this version of the CUDA Runtime.",
+                    fromCudaVersion(rtCudaVer), originalCompute.first,
+                    originalCompute.second, computeCapability.first,
+                    computeCapability.second, computeCapability.first,
+                    computeCapability.second);
+        }
+    } else if (computeCapability.first < 3) {
+        // all compute versions prior to Kepler, we don't support
+        // don't change the computeCapability.
+        spdlog::get("platform")
+            ->warn(
+                "The compute capability of the current device({}.{}) "
+                "lower than the minimum compute version ArrayFire "
+                "supports.",
+                originalCompute.first, originalCompute.second);
+    }
+}
+
+pair<int, int> getComputeCapability(const int device) {
+    return DeviceManager::getInstance().devJitComputes[device];
+}
+
+// Return true if greater, false if lesser.
+// if equal, it continues to next comparison
+#define COMPARE(a, b, f)                   \
+    do {                                   \
+        if ((a)->f > (b)->f) return true;  \
+        if ((a)->f < (b)->f) return false; \
+        break;                             \
+    } while (0)
+
+static inline bool card_compare_compute(const cudaDevice_t &l,
+                                        const cudaDevice_t &r) {
+    const cudaDevice_t *lc = &l;
+    const cudaDevice_t *rc = &r;
+
+    COMPARE(lc, rc, prop.major);
+    COMPARE(lc, rc, prop.minor);
+    COMPARE(lc, rc, flops);
+    COMPARE(lc, rc, prop.totalGlobalMem);
+    COMPARE(lc, rc, nativeId);
+    return false;
+}
+
+static inline bool card_compare_flops(const cudaDevice_t &l,
+                                      const cudaDevice_t &r) {
+    const cudaDevice_t *lc = &l;
+    const cudaDevice_t *rc = &r;
+
+    COMPARE(lc, rc, flops);
+    COMPARE(lc, rc, prop.totalGlobalMem);
+    COMPARE(lc, rc, prop.major);
+    COMPARE(lc, rc, prop.minor);
+    COMPARE(lc, rc, nativeId);
+    return false;
+}
+
+static inline bool card_compare_mem(const cudaDevice_t &l,
+                                    const cudaDevice_t &r) {
+    const cudaDevice_t *lc = &l;
+    const cudaDevice_t *rc = &r;
+
+    COMPARE(lc, rc, prop.totalGlobalMem);
+    COMPARE(lc, rc, flops);
+    COMPARE(lc, rc, prop.major);
+    COMPARE(lc, rc, prop.minor);
+    COMPARE(lc, rc, nativeId);
+    return false;
+}
+
+static inline bool card_compare_num(const cudaDevice_t &l,
+                                    const cudaDevice_t &r) {
+    const cudaDevice_t *lc = &l;
+    const cudaDevice_t *rc = &r;
+
+    COMPARE(lc, rc, nativeId);
+    return false;
+}
+
+bool DeviceManager::checkGraphicsInteropCapability() {
+    static std::once_flag checkInteropFlag;
+    thread_local bool capable = true;
+
+    std::call_once(checkInteropFlag, []() {
+        unsigned int pCudaEnabledDeviceCount = 0;
+        int pCudaGraphicsEnabledDeviceIds    = 0;
+        cudaGetLastError();  // Reset Errors
+        cudaError_t err = cudaGLGetDevices(
+            &pCudaEnabledDeviceCount, &pCudaGraphicsEnabledDeviceIds,
+            getDeviceCount(), cudaGLDeviceListAll);
+        if (err == cudaErrorOperatingSystem) {
+            // OS Support Failure - Happens when devices are in TCC mode or
+            // do not have a display connected
+            capable = false;
+        }
+        cudaGetLastError();  // Reset Errors
+    });
+
+    return capable;
+}
+
+DeviceManager &DeviceManager::getInstance() {
+    static auto *my_instance = new DeviceManager();
+    return *my_instance;
+}
+
+void DeviceManager::setMemoryManager(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a memory manager and the default memory
+    // manager still hasn't been initialized, so initialize it anyways so we
+    // don't inadvertently reset to it when we first call memoryManager()
+    memoryManager();
+    // Calls shutdown() on the existing memory manager.
+    if (memManager) { memManager->shutdownAllocator(); }
+    memManager = std::move(newMgr);
+    // Set the backend memory manager for this new manager to register native
+    // functions correctly.
+    std::unique_ptr<cuda::Allocator> deviceMemoryManager(new Allocator());
+    memManager->setAllocator(std::move(deviceMemoryManager));
+    memManager->initialize();
+}
+
+void DeviceManager::resetMemoryManager() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_CUDA_MEM_DEBUG));
+    setMemoryManager(std::move(mgr));
+}
+
+void DeviceManager::setMemoryManagerPinned(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a pinned memory manager and the default
+    // memory manager still hasn't been initialized, so initialize it anyways so
+    // we don't inadvertently reset to it when we first call
+    // pinnedMemoryManager()
+    pinnedMemoryManager();
+    // Calls shutdown() on the existing memory manager.
+    if (pinnedMemManager) { pinnedMemManager->shutdownAllocator(); }
+    // Set the backend memory manager for this new manager to register native
+    // functions correctly.
+    pinnedMemManager = std::move(newMgr);
+    std::unique_ptr<cuda::AllocatorPinned> deviceMemoryManager(
+        new AllocatorPinned());
+    pinnedMemManager->setAllocator(std::move(deviceMemoryManager));
+    pinnedMemManager->initialize();
+}
+
+void DeviceManager::resetMemoryManagerPinned() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_CUDA_MEM_DEBUG));
+    setMemoryManagerPinned(std::move(mgr));
+}
+
+/// A debug only function that checks to see if the driver or runtime
+/// function is part of the CudaToDriverVersion array. If the runtime
+/// version is not part of the array then an error is thrown in debug
+/// mode. If the driver version is not part of the array, then a message
+/// is displayed in the error stream.
+///
+/// \param[in] runtime_version  The version integer returned by
+///                             cudaRuntimeGetVersion
+/// \param[in] driver_version   The version integer returned by
+///                             cudaDriverGetVersion
+/// \note: only works in debug builds
+void debugRuntimeCheck(spdlog::logger *logger, int runtime_version,
+                       int driver_version) {
+    auto runtime_it =
+        find_if(begin(CudaToDriverVersion), end(CudaToDriverVersion),
+                [runtime_version](ToolkitDriverVersions ver) {
+                    return runtime_version == ver.version;
+                });
+    auto driver_it =
+        find_if(begin(CudaToDriverVersion), end(CudaToDriverVersion),
+                [driver_version](ToolkitDriverVersions ver) {
+                    return driver_version == ver.version;
+                });
+
+    auto getLogger = [&logger]() -> spdlog::logger * { return logger; };
+
+    // If the runtime version is not part of the CudaToDriverVersion array,
+    // display a message in the trace. Do not throw an error unless this is
+    // a debug build
+    if (runtime_it == end(CudaToDriverVersion)) {
+        constexpr size_t buf_size = 256;
+        char buf[buf_size];
+        const char *err_msg =
+            "CUDA runtime version({}) not recognized. Please create an issue "
+            "or a pull request on the ArrayFire repository to update the "
+            "CudaToDriverVersion variable with this version of the CUDA "
+            "runtime.\n";
+        fmt::format_to_n(buf, buf_size, err_msg,
+                         fromCudaVersion(runtime_version));
+        AF_TRACE("{}", buf);
+#ifndef NDEBUG
+        AF_ERROR(buf, AF_ERR_RUNTIME);
+#endif
+    }
+
+    if (driver_it == end(CudaToDriverVersion)) {
+        AF_TRACE(
+            "CUDA driver version({}) not part of the CudaToDriverVersion "
+            "array. Please create an issue or a pull request on the ArrayFire "
+            "repository to update the CudaToDriverVersion variable with this "
+            "version of the CUDA runtime.\n",
+            fromCudaVersion(driver_version));
+    }
+}
+
+// Check if the device driver version is recent enough to run the cuda libs
+// linked with afcuda:
+void DeviceManager::checkCudaVsDriverVersion() {
+    const std::string driverVersionString = getDriverVersion();
+
+    int driver  = 0;
+    int runtime = 0;
+    CUDA_CHECK(cudaDriverGetVersion(&driver));
+    CUDA_CHECK(cudaRuntimeGetVersion(&runtime));
+
+    AF_TRACE("CUDA Driver supports up to CUDA {} ArrayFire CUDA Runtime {}",
+             fromCudaVersion(driver), fromCudaVersion(runtime));
+
+    debugRuntimeCheck(getLogger(), runtime, driver);
+
+    int runtime_major = runtime / 1000;
+    int driver_major = driver / 1000;
+
+    if (runtime_major > driver_major) {
+        string msg =
+            "ArrayFire was built with CUDA {} which requires GPU driver "
+            "version {} or later. Please download and install the latest "
+            "drivers from https://www.nvidia.com/drivers for your GPU. "
+            "Alternatively, you could rebuild ArrayFire with CUDA Toolkit "
+            "version {} to use the current drivers.";
+
+        auto runtime_it =
+            find_if(begin(CudaToDriverVersion), end(CudaToDriverVersion),
+                    [runtime](ToolkitDriverVersions ver) {
+                        return runtime == ver.version;
+                    });
+
+        constexpr size_t buf_size = 1024;
+        // If the runtime version is not part of the CudaToDriverVersion
+        // array, display a message in the trace. Do not throw an error
+        // unless this is a debug build
+        if (runtime_it == end(CudaToDriverVersion)) {
+            char buf[buf_size];
+            char err_msg[] =
+                "CUDA runtime version(%s) not recognized. Please create an "
+                "issue or a pull request on the ArrayFire repository to "
+                "update the CudaToDriverVersion variable with this "
+                "version of the CUDA Toolkit.";
+            snprintf(buf, buf_size, err_msg,
+                     fmt::format("{}", fromCudaVersion(runtime)).c_str());
+            AF_TRACE("{}", buf);
+            return;
+        }
+
+        float minimumDriverVersion =
+#ifdef OS_WIN
+            runtime_it->windows_min_version;
+#else
+            runtime_it->unix_min_version;
+#endif
+
+        char buf[buf_size];
+        fmt::format_to_n(buf, buf_size, msg, fromCudaVersion(runtime),
+                         minimumDriverVersion, fromCudaVersion(driver));
+
+        AF_ERROR(buf, AF_ERR_DRIVER);
+    }
+}
+
+/// This function initializes and deletes a nvrtcProgram object. There seems to
+/// be a bug in nvrtc which fails if this is first done on a child thread. We
+/// are assuming that the initilization is done in the main thread.
+void initNvrtc() {
+    nvrtcProgram prog;
+    nvrtcCreateProgram(&prog, " ", "dummy", 0, nullptr, nullptr);
+    nvrtcDestroyProgram(&prog);
+}
+
+DeviceManager::DeviceManager()
+    : logger(common::loggerFactory("platform"))
+    , cuDevices(0)
+    , nDevices(0)
+    , fgMngr(new arrayfire::common::ForgeManager()) {
+    try {
+        checkCudaVsDriverVersion();
+
+        CUDA_CHECK(cudaGetDeviceCount(&nDevices));
+        AF_TRACE("Found {} CUDA devices", nDevices);
+        if (nDevices == 0) {
+            AF_ERROR("No CUDA capable devices found", AF_ERR_DRIVER);
+            return;
+        }
+        cuDevices.reserve(nDevices);
+
+        int cudaRtVer = 0;
+        CUDA_CHECK(cudaRuntimeGetVersion(&cudaRtVer));
+        int cudaMajorVer = cudaRtVer / 1000;
+
+        for (int i = 0; i < nDevices; i++) {
+            cudaDevice_t dev{};
+            CUDA_CHECK(cudaGetDeviceProperties(&dev.prop, i));
+            if (dev.prop.major < getMinSupportedCompute(cudaMajorVer)) {
+                AF_TRACE("Unsuppored device: {}", dev.prop.name);
+                continue;
+            } else {
+                dev.flops = static_cast<size_t>(dev.prop.multiProcessorCount) *
+                            compute2cores(dev.prop.major, dev.prop.minor) *
+                            dev.prop.clockRate;
+                dev.nativeId = i;
+                AF_TRACE(
+                    "Found device: {} (sm_{}{}) ({:0.3} GB | ~{} GFLOPs | {} "
+                    "SMs)",
+                    dev.prop.name, dev.prop.major, dev.prop.minor,
+                    dev.prop.totalGlobalMem / 1024. / 1024. / 1024.,
+                    dev.flops / 1024. / 1024. * 2,
+                    dev.prop.multiProcessorCount);
+                cuDevices.push_back(dev);
+            }
+        }
+    } catch (const AfError &err) {
+        // If one of the CUDA functions threw an exception. catch it and wrap it
+        // into a more informative ArrayFire exception.
+        if (err.getError() == AF_ERR_INTERNAL) {
+            AF_ERROR(
+                "Error initializing CUDA runtime. Check your CUDA device is "
+                "visible to the OS and you have installed the correct driver. "
+                "Try running the nvidia-smi utility to debug any driver "
+                "issues.",
+                AF_ERR_RUNTIME);
+        } else {
+            throw;
+        }
+    }
+    nDevices = cuDevices.size();
+
+    sortDevices();
+
+    // Set all default peer access to false
+    for (auto &dev_map : device_peer_access_map)
+        for (auto &dev_access : dev_map) { dev_access = false; }
+
+    // Enable peer 2 peer access to device memory if available
+    for (int i = 0; i < nDevices; i++) {
+        for (int j = 0; j < nDevices; j++) {
+            if (i != j) {
+                int can_access_peer;
+                CUDA_CHECK(cudaDeviceCanAccessPeer(&can_access_peer, i, j));
+                if (can_access_peer) {
+                    CUDA_CHECK(cudaSetDevice(i));
+                    AF_TRACE("Peer access enabled for {}({}) and {}({})", i,
+                             cuDevices[i].prop.name, j, cuDevices[j].prop.name);
+                    CUDA_CHECK(cudaDeviceEnablePeerAccess(j, 0));
+                    device_peer_access_map[i][j] = true;
+                }
+            } else {
+                device_peer_access_map[i][j] = true;
+            }
+        }
+    }
+
+    // Initialize all streams to 0.
+    // Streams will be created in setActiveDevice()
+    for (int i = 0; i < MAX_DEVICES; i++) {
+        streams[i] = static_cast<cudaStream_t>(0);
+        if (i < nDevices) {
+            auto prop =
+                make_pair(cuDevices[i].prop.major, cuDevices[i].prop.minor);
+            checkAndSetDevMaxCompute(prop);
+            devJitComputes.emplace_back(prop);
+        }
+    }
+
+    std::string deviceENV = getEnvVar("AF_CUDA_DEFAULT_DEVICE");
+    AF_TRACE("AF_CUDA_DEFAULT_DEVICE: {}", deviceENV);
+    if (deviceENV.empty()) {
+        setActiveDevice(0, cuDevices[0].nativeId);
+    } else {
+        stringstream s(deviceENV);
+        int def_device = -1;
+        s >> def_device;
+        if (def_device < 0 || def_device >= nDevices) {
+            getLogger()->warn(
+                "AF_CUDA_DEFAULT_DEVICE({}) out of range. Setting default "
+                "device to 0.",
+                def_device);
+            setActiveDevice(0, cuDevices[0].nativeId);
+        } else {
+            setActiveDevice(def_device, cuDevices[def_device].nativeId);
+        }
+    }
+    initNvrtc();
+    AF_TRACE("Default device: {}({})", getActiveDeviceId(),
+             cuDevices[getActiveDeviceId()].prop.name);
+}
+
+spdlog::logger *DeviceManager::getLogger() { return logger.get(); }
+
+void DeviceManager::sortDevices(sort_mode mode) {
+    switch (mode) {
+        case memory:
+            std::stable_sort(cuDevices.begin(), cuDevices.end(),
+                             card_compare_mem);
+            break;
+        case flops:
+            std::stable_sort(cuDevices.begin(), cuDevices.end(),
+                             card_compare_flops);
+            break;
+        case compute:
+            std::stable_sort(cuDevices.begin(), cuDevices.end(),
+                             card_compare_compute);
+            break;
+        case none:
+        default:
+            std::stable_sort(cuDevices.begin(), cuDevices.end(),
+                             card_compare_num);
+            break;
+    }
+}
+
+int DeviceManager::setActiveDevice(int device, int nId) {
+    thread_local bool retryFlag = true;
+
+    int numDevices = cuDevices.size();
+
+    if (device >= numDevices) { return -1; }
+
+    int old = getActiveDeviceId();
+
+    if (nId == -1) { nId = getDeviceNativeId(device); }
+
+    cudaError_t err = cudaSetDevice(nId);
+
+    if (err == cudaSuccess) {
+        tlocalActiveDeviceId() = device;
+        return old;
+    }
+
+    // For the first time a thread calls setDevice,
+    // if the requested device is unavailable, try checking
+    // for other available devices - while loop below
+    if (!retryFlag) {
+        CUDA_CHECK(err);
+        return old;
+    }
+
+    // Comes only when retryFlag is true. Set it to false
+    retryFlag = false;
+
+    while (true) {
+        // Check for errors other than DevicesUnavailable
+        // If success, return. Else throw error
+        // If DevicesUnavailable, try other devices (while loop below)
+        if (err != cudaErrorDeviceAlreadyInUse) {
+            CUDA_CHECK(err);
+            tlocalActiveDeviceId() = device;
+            return old;
+        }
+        cudaGetLastError();  // Reset error stack
+#ifndef NDEBUG
+        getLogger()->warn(
+            "Warning: Device {} is unavailable. Using next available "
+            "device \n",
+            device);
+#endif
+        // Comes here is the device is in exclusive mode or
+        // otherwise fails streamCreate with this error.
+        // All other errors will error out
+        device++;
+        if (device >= numDevices) { break; }
+
+        // Can't call getNativeId here as it will cause an infinite loop with
+        // the constructor
+        nId = cuDevices[device].nativeId;
+
+        err = cudaSetDevice(nId);
+    }
+
+    // If all devices fail with DeviceAlreadyInUse, then throw this error
+    CUDA_CHECK(err);
+
+    return old;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/device_manager.hpp b/src/backend/cuda/device_manager.hpp
new file mode 100644
index 0000000000..ca43efaf1f
--- /dev/null
+++ b/src/backend/cuda/device_manager.hpp
@@ -0,0 +1,147 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <platform.hpp>
+
+#include <array>
+#include <memory>
+#include <mutex>
+#include <string>
+#include <utility>
+#include <vector>
+
+using arrayfire::common::MemoryManagerBase;
+
+#ifndef AF_CUDA_MEM_DEBUG
+#define AF_CUDA_MEM_DEBUG 0
+#endif
+
+namespace arrayfire {
+namespace cuda {
+
+struct cudaDevice_t {
+    cudaDeviceProp prop;
+    size_t flops;
+    int nativeId;
+};
+
+int& tlocalActiveDeviceId();
+
+bool checkDeviceWithRuntime(int runtime, std::pair<int, int> compute);
+
+class DeviceManager {
+   public:
+    static const int MAX_DEVICES = 16;
+
+    static bool checkGraphicsInteropCapability();
+
+    static DeviceManager& getInstance();
+    ~DeviceManager();
+
+    spdlog::logger* getLogger();
+
+    friend MemoryManagerBase& memoryManager();
+
+    friend void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+    void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+    friend void resetMemoryManager();
+
+    void resetMemoryManager();
+
+    friend MemoryManagerBase& pinnedMemoryManager();
+
+    friend void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+    void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+    friend void resetMemoryManagerPinned();
+
+    void resetMemoryManagerPinned();
+
+    friend arrayfire::common::ForgeManager& forgeManager();
+
+    friend GraphicsResourceManager& interopManager();
+
+    friend std::string getDeviceInfo(int device) noexcept;
+
+    friend std::string getPlatformInfo() noexcept;
+
+    friend std::string getDriverVersion() noexcept;
+
+    friend std::string getCUDARuntimeVersion() noexcept;
+
+    friend std::string getDeviceInfo() noexcept;
+
+    friend int getDeviceCount();
+
+    friend int getDeviceNativeId(int device);
+
+    friend int getDeviceIdFromNativeId(int nativeId);
+
+    friend cudaStream_t getStream(int device);
+
+    friend int setDevice(int device);
+
+    friend const cudaDeviceProp& getDeviceProp(int device);
+
+    friend std::pair<int, int> getComputeCapability(const int device);
+
+    friend bool isDeviceBufferAccessible(int buf_device_id, int execution_id);
+
+   private:
+    DeviceManager();
+
+    // Following two declarations are required to
+    // avoid copying accidental copy/assignment
+    // of instance returned by getInstance to other
+    // variables
+    DeviceManager(DeviceManager const&);
+    void operator=(DeviceManager const&);
+
+    // Attributes
+    enum sort_mode { flops = 0, memory = 1, compute = 2, none = 3 };
+
+    // Checks if the Graphics driver is capable of running the CUDA toolkit
+    // version that ArrayFire was compiled against
+    void checkCudaVsDriverVersion();
+    void sortDevices(sort_mode mode = flops);
+
+    int setActiveDevice(int device, int nId = -1);
+
+    std::shared_ptr<spdlog::logger> logger;
+
+    /// A matrix of booleans where true indicates that the corresponding
+    /// corrdinate devices can access each other buffers. False indicates
+    /// buffers need to be copied over to the other device
+    std::array<std::array<bool, MAX_DEVICES>, MAX_DEVICES>
+        device_peer_access_map;
+
+    std::vector<cudaDevice_t> cuDevices;
+    std::vector<std::pair<int, int>> devJitComputes;
+
+    int nDevices;
+    cudaStream_t streams[MAX_DEVICES]{};
+
+    std::unique_ptr<arrayfire::common::ForgeManager> fgMngr;
+
+    std::unique_ptr<MemoryManagerBase> memManager;
+
+    std::unique_ptr<MemoryManagerBase> pinnedMemManager;
+
+    std::unique_ptr<GraphicsResourceManager> gfxManagers[MAX_DEVICES];
+
+    std::mutex mutex;
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/diagonal.cpp b/src/backend/cuda/diagonal.cpp
new file mode 100644
index 0000000000..b5dd2b5c0b
--- /dev/null
+++ b/src/backend/cuda/diagonal.cpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <diagonal.hpp>
+#include <err_cuda.hpp>
+#include <kernel/diagonal.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num) {
+    int size     = in.dims()[0] + std::abs(num);
+    int batch    = in.dims()[1];
+    Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
+
+    kernel::diagCreate<T>(out, in, num);
+
+    return out;
+}
+
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num) {
+    const dim_t *idims = in.dims().get();
+    dim_t size         = std::min(idims[0], idims[1]) - std::abs(num);
+    Array<T> out       = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
+
+    kernel::diagExtract<T>(out, in, num);
+
+    return out;
+}
+
+#define INSTANTIATE_DIAGONAL(T)                                          \
+    template Array<T> diagExtract<T>(const Array<T> &in, const int num); \
+    template Array<T> diagCreate<T>(const Array<T> &in, const int num);
+
+INSTANTIATE_DIAGONAL(float)
+INSTANTIATE_DIAGONAL(double)
+INSTANTIATE_DIAGONAL(cfloat)
+INSTANTIATE_DIAGONAL(cdouble)
+INSTANTIATE_DIAGONAL(int)
+INSTANTIATE_DIAGONAL(uint)
+INSTANTIATE_DIAGONAL(intl)
+INSTANTIATE_DIAGONAL(uintl)
+INSTANTIATE_DIAGONAL(char)
+INSTANTIATE_DIAGONAL(schar)
+INSTANTIATE_DIAGONAL(uchar)
+INSTANTIATE_DIAGONAL(short)
+INSTANTIATE_DIAGONAL(ushort)
+INSTANTIATE_DIAGONAL(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/diagonal.cu b/src/backend/cuda/diagonal.cu
deleted file mode 100644
index 05b8025de5..0000000000
--- a/src/backend/cuda/diagonal.cu
+++ /dev/null
@@ -1,60 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <Array.hpp>
-#include <diagonal.hpp>
-#include <math.hpp>
-#include <err_cuda.hpp>
-#include <kernel/diagonal.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> diagCreate(const Array<T> &in, const int num)
-    {
-        int size = in.dims()[0] + std::abs(num);
-        int batch = in.dims()[1];
-        Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
-
-        kernel::diagCreate<T>(out, in, num);
-
-        return out;
-    }
-
-    template<typename T>
-    Array<T> diagExtract(const Array<T> &in, const int num)
-    {
-        const dim_t *idims = in.dims().get();
-        dim_t size = std::max(idims[0], idims[1]) - std::abs(num);
-        Array<T> out = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
-
-        kernel::diagExtract<T>(out, in, num);
-
-        return out;
-    }
-
-#define INSTANTIATE_DIAGONAL(T)                                          \
-    template Array<T>  diagExtract<T>    (const Array<T> &in, const int num); \
-    template Array<T>  diagCreate <T>    (const Array<T> &in, const int num);
-
-    INSTANTIATE_DIAGONAL(float)
-    INSTANTIATE_DIAGONAL(double)
-    INSTANTIATE_DIAGONAL(cfloat)
-    INSTANTIATE_DIAGONAL(cdouble)
-    INSTANTIATE_DIAGONAL(int)
-    INSTANTIATE_DIAGONAL(uint)
-    INSTANTIATE_DIAGONAL(intl)
-    INSTANTIATE_DIAGONAL(uintl)
-    INSTANTIATE_DIAGONAL(char)
-    INSTANTIATE_DIAGONAL(uchar)
-
-}
diff --git a/src/backend/cuda/diagonal.hpp b/src/backend/cuda/diagonal.hpp
index db671dece3..a1a9828a2a 100644
--- a/src/backend/cuda/diagonal.hpp
+++ b/src/backend/cuda/diagonal.hpp
@@ -7,15 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> diagCreate(const Array<T> &in, const int num);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num);
 
-    template<typename T>
-    Array<T> diagExtract(const Array<T> &in, const int num);
-}
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/diff.cpp b/src/backend/cuda/diff.cpp
new file mode 100644
index 0000000000..b21ab36b72
--- /dev/null
+++ b/src/backend/cuda/diff.cpp
@@ -0,0 +1,65 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <diff.hpp>
+#include <err_cuda.hpp>
+#include <kernel/diff.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> diff(const Array<T> &in, const int dim, const bool isDiff2) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims[dim] -= (isDiff2 + 1);
+
+    if (iDims.elements() == 0 || oDims.elements() == 0) {
+        AF_ERROR("Elements are 0", AF_ERR_SIZE);
+    }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::diff<T>(out, in, in.ndims(), dim, isDiff2);
+
+    return out;
+}
+
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim) {
+    return diff<T>(in, dim, false);
+}
+
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim) {
+    return diff<T>(in, dim, true);
+}
+
+#define INSTANTIATE(T)                                             \
+    template Array<T> diff1<T>(const Array<T> &in, const int dim); \
+    template Array<T> diff2<T>(const Array<T> &in, const int dim);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/diff.cu b/src/backend/cuda/diff.cu
deleted file mode 100644
index f7b1a6ec18..0000000000
--- a/src/backend/cuda/diff.cu
+++ /dev/null
@@ -1,72 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <diff.hpp>
-#include <kernel/diff.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-
-    template<typename T, bool isDiff2>
-    static Array<T> diff(const Array<T> &in, const int dim)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
-        oDims[dim] -= (isDiff2 + 1);
-
-        if(iDims.elements() == 0 || oDims.elements() == 0) {
-            AF_ERROR("Elements are 0", AF_ERR_SIZE);
-        }
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        switch (dim) {
-            case (0):   kernel::diff<T, 0, isDiff2>(out, in, in.ndims());
-                break;
-            case (1):   kernel::diff<T, 1, isDiff2>(out, in, in.ndims());
-                break;
-            case (2):   kernel::diff<T, 2, isDiff2>(out, in, in.ndims());
-                break;
-            case (3):   kernel::diff<T, 3, isDiff2>(out, in, in.ndims());
-                break;
-        }
-
-        return out;
-    }
-
-    template<typename T>
-    Array<T> diff1(const Array<T> &in, const int dim)
-    {
-        return diff<T, false>(in, dim);
-    }
-
-    template<typename T>
-    Array<T> diff2(const Array<T> &in, const int dim)
-    {
-        return diff<T, true>(in, dim);
-    }
-
-#define INSTANTIATE(T)                                                  \
-    template Array<T> diff1<T>  (const Array<T> &in, const int dim);   \
-    template Array<T> diff2<T>  (const Array<T> &in, const int dim);   \
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-
-}
diff --git a/src/backend/cuda/diff.hpp b/src/backend/cuda/diff.hpp
index b0b66d0b54..c2b4900862 100644
--- a/src/backend/cuda/diff.hpp
+++ b/src/backend/cuda/diff.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> diff1(const Array<T> &in, const int dim);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim);
 
-    template<typename T>
-    Array<T> diff2(const Array<T> &in, const int dim);
-}
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/dilate.cu b/src/backend/cuda/dilate.cu
deleted file mode 100644
index 0da33f2969..0000000000
--- a/src/backend/cuda/dilate.cu
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph_impl.hpp"
-
-namespace cuda
-{
-
-INSTANTIATE(float , true)
-INSTANTIATE(double, true)
-INSTANTIATE(char  , true)
-INSTANTIATE(int   , true)
-INSTANTIATE(uint  , true)
-INSTANTIATE(uchar , true)
-
-}
diff --git a/src/backend/cuda/dilate3d.cu b/src/backend/cuda/dilate3d.cu
deleted file mode 100644
index 32b0babc9d..0000000000
--- a/src/backend/cuda/dilate3d.cu
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph3d_impl.hpp"
-
-namespace cuda
-{
-
-INSTANTIATE(float , true)
-INSTANTIATE(double, true)
-INSTANTIATE(char  , true)
-INSTANTIATE(int   , true)
-INSTANTIATE(uint  , true)
-INSTANTIATE(uchar , true)
-
-}
diff --git a/src/backend/cuda/dims_param.hpp b/src/backend/cuda/dims_param.hpp
new file mode 100644
index 0000000000..273eaf13cb
--- /dev/null
+++ b/src/backend/cuda/dims_param.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cuda {
+
+typedef struct {
+    int dim[4];
+} dims_t;
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/driver.cpp b/src/backend/cuda/driver.cpp
index b1f03e254c..4edcbf664f 100644
--- a/src/backend/cuda/driver.cpp
+++ b/src/backend/cuda/driver.cpp
@@ -7,17 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <stdio.h>
-#include <string.h>
 #include <driver.h>
+#include <cstdio>
+#include <cstring>
 
 #ifdef OS_WIN
-#include <windows.h>
 #include <stdlib.h>
+#include <windows.h>
 #define snprintf _snprintf
 
-int nvDriverVersion(char *result, int len)
-{
+int nvDriverVersion(char *result, int len) {
 #ifndef OS_WIN
     LPCTSTR lptstrFilename = "nvcuda.dll";
     DWORD dwLen, dwHandle;
@@ -40,8 +39,8 @@ int nvDriverVersion(char *result, int len)
     rv = VerQueryValue(lpData, "\\", (LPVOID *)&lpBuffer, &buflen);
     if (!rv) return 0;
 
-    version = (HIWORD(lpBuffer->dwFileVersionLS) - 10)*10000 +
-               LOWORD(lpBuffer->dwFileVersionLS);
+    version = (HIWORD(lpBuffer->dwFileVersionLS) - 10) * 10000 +
+              LOWORD(lpBuffer->dwFileVersionLS);
     fversion = version / 100.f;
 
     snprintf(result, len, "%.2f", fversion);
@@ -50,53 +49,54 @@ int nvDriverVersion(char *result, int len)
 #else
     snprintf(result, len, "%.2f", 0.0);
 #endif
-    return 1;
+    return 0;
 }
 
 #else
 
-int nvDriverVersion(char *result, int len)
-{
+int nvDriverVersion(char *result, int len) {
     int pos = 0, epos = 0, i = 0;
     char buffer[1024];
     FILE *f = NULL;
 
-    if (NULL == (f = fopen("/proc/driver/nvidia/version", "r"))) {
-        return 0;
-    }
+    if (NULL == (f = fopen("/proc/driver/nvidia/version", "re"))) { return 0; }
     if (fgets(buffer, 1024, f) == NULL) {
-        if(f) fclose(f);
+        if (f) { fclose(f); }
         return 0;
     }
 
-    //just close it now since we've already read what we need
-    if(f) fclose(f);
+    // just close it now since we've already read what we need
+    if (f) { fclose(f); }
 
     for (i = 1; i < 8; i++) {
-
-        while (buffer[pos] != ' ' && buffer[pos] != '\t')
-            if (pos >= 1024 || buffer[pos] == '\0' || buffer[pos] == '\n')
+        while (buffer[pos] != ' ' && buffer[pos] != '\t') {
+            if (pos >= 1024 || buffer[pos] == '\0' || buffer[pos] == '\n') {
                 return 0;
-            else
+            } else {
                 pos++;
-        while (buffer[pos] == ' ' || buffer[pos] == '\t')
-            if (pos >= 1024 || buffer[pos] == '\0' || buffer[pos] == '\n')
+            }
+        }
+        while (buffer[pos] == ' ' || buffer[pos] == '\t') {
+            if (pos >= 1024 || buffer[pos] == '\0' || buffer[pos] == '\n') {
                 return 0;
-            else
+            } else {
                 pos++;
+            }
+        }
     }
 
     epos = pos;
     while (buffer[epos] != ' ' && buffer[epos] != '\t') {
-        if (epos >= 1024 || buffer[epos] == '\0' || buffer[epos] == '\n')
+        if (epos >= 1024 || buffer[epos] == '\0' || buffer[epos] == '\n') {
             return 0;
-        else
+        } else {
             epos++;
+        }
     }
 
     buffer[epos] = '\0';
 
-    strncpy(result, buffer+pos, len);
+    strncpy(result, buffer + pos, len);
 
     return 1;
 }
diff --git a/src/backend/cuda/driver.h b/src/backend/cuda/driver.h
index 835c3fef17..fa828301f9 100644
--- a/src/backend/cuda/driver.h
+++ b/src/backend/cuda/driver.h
@@ -13,7 +13,7 @@
 extern "C" {
 #endif
 
-int nvDriverVersion(char *buffer, int len);
+int nvDriverVersion(char *result, int len);
 
 #ifdef __cplusplus
 }
diff --git a/src/backend/cuda/erode.cu b/src/backend/cuda/erode.cu
deleted file mode 100644
index dbb2c8ece9..0000000000
--- a/src/backend/cuda/erode.cu
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph_impl.hpp"
-
-namespace cuda
-{
-
-INSTANTIATE(float , false)
-INSTANTIATE(double, false)
-INSTANTIATE(char  , false)
-INSTANTIATE(int   , false)
-INSTANTIATE(uint  , false)
-INSTANTIATE(uchar , false)
-
-}
diff --git a/src/backend/cuda/erode3d.cu b/src/backend/cuda/erode3d.cu
deleted file mode 100644
index 808198a455..0000000000
--- a/src/backend/cuda/erode3d.cu
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph3d_impl.hpp"
-
-namespace cuda
-{
-
-INSTANTIATE(float , false)
-INSTANTIATE(double, false)
-INSTANTIATE(char  , false)
-INSTANTIATE(int   , false)
-INSTANTIATE(uint  , false)
-INSTANTIATE(uchar , false)
-
-}
diff --git a/src/backend/cuda/err_cuda.hpp b/src/backend/cuda/err_cuda.hpp
index 761af19c95..f6db7e6822 100644
--- a/src/backend/cuda/err_cuda.hpp
+++ b/src/backend/cuda/err_cuda.hpp
@@ -8,26 +8,50 @@
  ********************************************************/
 
 #pragma once
+#include <common/defines.hpp>
+#include <common/err_common.hpp>
 #include <stdio.h>
-#include <err_common.hpp>
 
-#define CUDA_NOT_SUPPORTED() do {                       \
-        throw SupportError(__FILE__, __LINE__, "CUDA"); \
-    } while(0)
+#define CUDA_NOT_SUPPORTED(message)                                         \
+    do {                                                                    \
+        throw SupportError(__AF_FUNC__, __AF_FILENAME__, __LINE__, "CUDA",  \
+                           message, boost::stacktrace::stacktrace());       \
+    } while (0)
 
-#define CUDA_CHECK(fn) do {                     \
-        cudaError_t _cuda_error = fn;           \
-        if (_cuda_error != cudaSuccess) {       \
-            char cuda_err_msg[1024];            \
-            snprintf(cuda_err_msg,                \
-                     sizeof(cuda_err_msg),      \
-                     "CUDA Error (%d): %s\n",   \
-                     (int)(_cuda_error),        \
-                     cudaGetErrorString(        \
-                         _cuda_error));         \
-                                                \
-            AF_ERROR(cuda_err_msg,              \
-                     AF_ERR_INTERNAL);          \
-        }                                       \
-    } while(0)
+#define CU_CHECK(fn)                                                          \
+    do {                                                                      \
+        CUresult res = fn;                                                    \
+        if (res == CUDA_SUCCESS) break;                                       \
+        char cu_err_msg[1024];                                                \
+        const char* cu_err_name;                                              \
+        const char* cu_err_string;                                            \
+        CUresult nameErr, strErr;                                             \
+        nameErr = cuGetErrorName(res, &cu_err_name);                          \
+        strErr  = cuGetErrorString(res, &cu_err_string);                      \
+        if (nameErr == CUDA_SUCCESS && strErr == CUDA_SUCCESS) {              \
+            snprintf(cu_err_msg, sizeof(cu_err_msg), "CU Error %s(%d): %s\n", \
+                     cu_err_name, (int)(res), cu_err_string);                 \
+            AF_ERROR(cu_err_msg, AF_ERR_INTERNAL);                            \
+        } else {                                                              \
+            AF_ERROR("CU Unknown error.\n", AF_ERR_INTERNAL);                 \
+        }                                                                     \
+    } while (0)
 
+#define CUDA_CHECK(fn)                                               \
+    do {                                                             \
+        cudaError_t _cuda_error = fn;                                \
+        if (_cuda_error != cudaSuccess) {                            \
+            char cuda_err_msg[1024];                                 \
+            snprintf(cuda_err_msg, sizeof(cuda_err_msg),             \
+                     "CUDA Error (%d): %s\n", (int)(_cuda_error),    \
+                     cudaGetErrorString(cudaGetLastError()));        \
+                                                                     \
+            if (_cuda_error == cudaErrorMemoryAllocation) {          \
+                AF_ERROR(cuda_err_msg, AF_ERR_NO_MEM);               \
+            } else if (_cuda_error == cudaErrorDevicesUnavailable) { \
+                AF_ERROR(cuda_err_msg, AF_ERR_DRIVER);               \
+            } else {                                                 \
+                AF_ERROR(cuda_err_msg, AF_ERR_INTERNAL);             \
+            }                                                        \
+        }                                                            \
+    } while (0)
diff --git a/src/backend/cuda/err_cufft.hpp b/src/backend/cuda/err_cufft.hpp
deleted file mode 100644
index 492fcd9ed1..0000000000
--- a/src/backend/cuda/err_cufft.hpp
+++ /dev/null
@@ -1,88 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <stdio.h>
-#include <err_common.hpp>
-#include <cuda.h> // Need this for CUDA_VERSION
-#include <cufft.h>
-
-static const char * _cufftGetResultString(cufftResult res)
-{
-    switch (res)
-    {
-        case CUFFT_SUCCESS:
-            return "cuFFT: success";
-
-        case CUFFT_INVALID_PLAN:
-            return "cuFFT: invalid plan handle passed";
-
-        case CUFFT_ALLOC_FAILED:
-            return "cuFFT: resources allocation failed";
-
-        case CUFFT_INVALID_TYPE:
-            return "cuFFT: invalid type (deprecated)";
-
-        case CUFFT_INVALID_VALUE:
-            return "cuFFT: invalid parameters passed to cuFFT API";
-
-        case CUFFT_INTERNAL_ERROR:
-            return "cuFFT: internal error detected using cuFFT";
-
-        case CUFFT_EXEC_FAILED:
-            return "cuFFT: FFT execution failed";
-
-        case CUFFT_SETUP_FAILED:
-            return "cuFFT: library initialization failed";
-
-        case CUFFT_INVALID_SIZE:
-            return "cuFFT: invalid size parameters passed";
-
-        case CUFFT_UNALIGNED_DATA:
-            return "cuFFT: unaligned data (deprecated)";
-
-        case CUFFT_INCOMPLETE_PARAMETER_LIST:
-            return "cuFFT: call is missing parameters";
-
-        case CUFFT_INVALID_DEVICE:
-            return "cuFFT: plan execution different than plan creation";
-
-        case CUFFT_PARSE_ERROR:
-            return "cuFFT: plan parse error";
-
-        case CUFFT_NO_WORKSPACE:
-            return "cuFFT: no workspace provided";
-
-#if CUDA_VERSION >= 6050
-        case CUFFT_NOT_IMPLEMENTED:
-            return "cuFFT: not implemented";
-
-        case CUFFT_LICENSE_ERROR:
-            return "cuFFT: license error";
-#endif
-    }
-
-    return "cuFFT: unknown error";
-}
-
-#define CUFFT_CHECK(fn) do {                    \
-        cufftResult _cufft_res = fn;            \
-        if (_cufft_res != CUFFT_SUCCESS) {      \
-            char cufft_res_msg[1024];           \
-            snprintf(cufft_res_msg,             \
-                     sizeof(cufft_res_msg),     \
-                     "cuFFT Error (%d): %s\n",  \
-                     (int)(_cufft_res),         \
-                     _cufftGetResultString(     \
-                         _cufft_res));          \
-                                                \
-            AF_ERROR(cufft_res_msg,             \
-                     AF_ERR_INTERNAL);          \
-        }                                       \
-    } while(0)
diff --git a/src/backend/cuda/exampleFunction.cpp b/src/backend/cuda/exampleFunction.cpp
new file mode 100644
index 0000000000..12bf635785
--- /dev/null
+++ b/src/backend/cuda/exampleFunction.cpp
@@ -0,0 +1,70 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// header with cuda backend specific
+// Array class implementation that inherits
+// ArrayInfo base class
+#include <Array.hpp>
+
+#include <exampleFunction.hpp>  // cuda backend function header
+
+// error check functions and Macros
+// specific to cuda backend
+#include <err_cuda.hpp>
+
+// this header is under the folder src/cuda/kernel
+// defines the CUDA kernel and its wrapper
+// function to which the main computation of your
+// algorithm should be relayed to
+#include <kernel/exampleFunction.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method) {
+    dim4 outputDims;  // this should be '= in.dims();' in most cases
+                      // but would definitely depend on the type of
+                      // algorithm you are implementing.
+
+    Array<T> out = createEmptyArray<T>(outputDims);
+    // Please use the create***Array<T> helper
+    // functions defined in Array.hpp to create
+    // different types of Arrays. Please check the
+    // file to know what are the different types you
+    // can create.
+
+    // Relay the actual computation to CUDA kernel wrapper
+    kernel::exampleFunc<T>(out, a, b, method);
+
+    return out;  // return the result
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> exampleFunction<T>(const Array<T> &a, const Array<T> &b, \
+                                         const af_someenum_t method);
+
+// INSTANTIATIONS for all the types which
+// are present in the switch case statement
+// in src/api/c/exampleFunction.cpp should be available
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/exampleFunction.cu b/src/backend/cuda/exampleFunction.cu
deleted file mode 100644
index 4a7db44fb4..0000000000
--- a/src/backend/cuda/exampleFunction.cu
+++ /dev/null
@@ -1,65 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>                    // header with cuda backend specific
-                                        // Array class implementation that inherits
-                                        // ArrayInfo base class
-
-#include <exampleFunction.hpp>          // cuda backend function header
-
-#include <err_cuda.hpp>                 // error check functions and Macros
-                                        // specific to cuda backend
-
-#include <kernel/exampleFunction.hpp>   // this header under the folder src/cuda/kernel
-                                        // defines the CUDA kernel and its wrapper
-                                        // function to which the main computation of your
-                                        // algorithm should be relayed to
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-Array<T> exampleFunction(const Array<T> &in, const af_someenum_t method)
-{
-    dim4 outputDims;                    // this should be '= in.dims();' in most cases
-                                        // but would definitely depend on the type of
-                                        // algorithm you are implementing.
-
-    Array<T> out = createEmptyArray<T>(outputDims);
-                                        // Please use the create***Array<T> helper
-                                        // functions defined in Array.hpp to create
-                                        // different types of Arrays. Please check the
-                                        // file to know what are the different types you
-                                        // can create.
-
-    // Relay the actual computation to CUDA kernel wrapper
-    kernel::exampleFunc<T>(out, in, method);
-
-    return out;                         // return the result
-}
-
-
-#define INSTANTIATE(T)  \
-    template Array<T> exampleFunction<T>(const Array<T> &in, const af_someenum_t method);
-
-// INSTANTIATIONS for all the types which
-// are present in the switch case statement
-// in src/api/c/exampleFunction.cpp should be available
-INSTANTIATE(float)
-INSTANTIATE(double)
-INSTANTIATE(int)
-INSTANTIATE(uint)
-INSTANTIATE(uchar)
-INSTANTIATE(char)
-INSTANTIATE(cfloat)
-INSTANTIATE(cdouble)
-
-}
diff --git a/src/backend/cuda/exampleFunction.hpp b/src/backend/cuda/exampleFunction.hpp
index a297ef7ee4..d0e9938dda 100644
--- a/src/backend/cuda/exampleFunction.hpp
+++ b/src/backend/cuda/exampleFunction.hpp
@@ -9,8 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> exampleFunction(const Array<T> &in, const af_someenum_t method);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fast.cu b/src/backend/cuda/fast.cu
index 7bd6f4772a..63e9a57cb4 100644
--- a/src/backend/cuda/fast.cu
+++ b/src/backend/cuda/fast.cu
@@ -7,57 +7,65 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/features.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cuda.hpp>
-#include <handle.hpp>
+#include <fast.hpp>
+
+#include <LookupTable1D.hpp>
 #include <kernel/fast.hpp>
+#include <kernel/fast_lut.hpp>
+#include <af/dim4.hpp>
+
+#include <mutex>
 
 using af::dim4;
 using af::features;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
               const Array<T> &in, const float thr, const unsigned arc_length,
-              const bool non_max, const float feature_ratio, const unsigned edge)
-{
-    const dim4 dims = in.dims();
-
+              const bool non_max, const float feature_ratio,
+              const unsigned edge) {
     unsigned nfeat;
     float *d_x_out;
     float *d_y_out;
     float *d_score_out;
 
-    kernel::fast<T>(&nfeat, &d_x_out, &d_y_out, &d_score_out, in,
-                    thr, arc_length, non_max, feature_ratio, edge);
+    // TODO(pradeep) Figure out a better way to create lut Array only once
+    const Array<unsigned char> lut = createHostDataArray(
+        af::dim4(sizeof(FAST_LUT) / sizeof(unsigned char)), FAST_LUT);
+
+    LookupTable1D<unsigned char> fastLUT(lut);
+
+    kernel::fast<T>(&nfeat, &d_x_out, &d_y_out, &d_score_out, in, thr,
+                    arc_length, non_max, feature_ratio, edge, fastLUT);
 
     if (nfeat > 0) {
         const dim4 out_dims(nfeat);
 
-        x_out = createDeviceDataArray<float>(out_dims, d_x_out);
-        y_out = createDeviceDataArray<float>(out_dims, d_y_out);
+        x_out     = createDeviceDataArray<float>(out_dims, d_x_out);
+        y_out     = createDeviceDataArray<float>(out_dims, d_y_out);
         score_out = createDeviceDataArray<float>(out_dims, d_score_out);
     }
-
     return nfeat;
 }
 
-#define INSTANTIATE(T)                                                                              \
-    template unsigned fast<T>(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,    \
-                              const Array<T> &in, const float thr, const unsigned arc_length,       \
-                              const bool nonmax, const float feature_ratio, const unsigned edge);
+#define INSTANTIATE(T)                                                        \
+    template unsigned fast<T>(                                                \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const float thr, const unsigned arc_length,       \
+        const bool nonmax, const float feature_ratio, const unsigned edge);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fast.hpp b/src/backend/cuda/fast.hpp
index e3f3606d7d..d60c671634 100644
--- a/src/backend/cuda/fast.hpp
+++ b/src/backend/cuda/fast.hpp
@@ -7,17 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
 #include <Array.hpp>
+#include <af/features.h>
 
 using af::features;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
               const Array<T> &in, const float thr, const unsigned arc_length,
-              const bool non_max, const float feature_ratio, const unsigned edge);
+              const bool non_max, const float feature_ratio,
+              const unsigned edge);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fast_pyramid.cpp b/src/backend/cuda/fast_pyramid.cpp
new file mode 100644
index 0000000000..ba0b6dfbf4
--- /dev/null
+++ b/src/backend/cuda/fast_pyramid.cpp
@@ -0,0 +1,129 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <fast_pyramid.hpp>
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <fast.hpp>
+#include <resize.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using std::vector;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void fast_pyramid(vector<unsigned> &feat_pyr, vector<Array<float>> &x_pyr,
+                  vector<Array<float>> &y_pyr, vector<unsigned> &lvl_best,
+                  vector<float> &lvl_scl, vector<Array<T>> &img_pyr,
+                  const Array<T> &in, const float fast_thr,
+                  const unsigned max_feat, const float scl_fctr,
+                  const unsigned levels, const unsigned patch_size) {
+    dim4 indims         = in.dims();
+    unsigned min_side   = std::min(indims[0], indims[1]);
+    unsigned max_levels = 0;
+    float scl_sum       = 0.f;
+
+    for (unsigned i = 0; i < levels; i++) {
+        min_side /= scl_fctr;
+
+        // Minimum image side for a descriptor to be computed
+        if (min_side < patch_size || max_levels == levels) { break; }
+
+        max_levels++;
+        scl_sum += 1.f / std::pow(scl_fctr, static_cast<float>(i));
+    }
+
+    // Compute number of features to keep for each level
+    lvl_best.resize(max_levels);
+    lvl_scl.resize(max_levels);
+    unsigned feat_sum = 0;
+    for (unsigned i = 0; i < max_levels - 1; i++) {
+        auto scl   = std::pow(scl_fctr, static_cast<float>(i));
+        lvl_scl[i] = scl;
+
+        lvl_best[i] = ceil((max_feat / scl_sum) / lvl_scl[i]);
+        feat_sum += lvl_best[i];
+    }
+    lvl_scl[max_levels - 1] =
+        std::pow(scl_fctr, static_cast<float>(max_levels) - 1);
+    lvl_best[max_levels - 1] = max_feat - feat_sum;
+
+    // Hold multi-scale image pyramids
+    static const dim4 dims0;
+    static const CParam<T> emptyCParam(NULL, dims0.get(), dims0.get());
+
+    img_pyr.reserve(max_levels);
+
+    // Create multi-scale image pyramid
+    for (unsigned i = 0; i < max_levels; i++) {
+        if (i == 0) {
+            // First level is used in its original size
+            img_pyr.push_back(in);
+        } else {
+            // Resize previous level image to current level dimensions
+            dim4 dims(round(indims[0] / lvl_scl[i]),
+                      round(indims[1] / lvl_scl[i]));
+
+            img_pyr.push_back(createEmptyArray<T>(dims));
+            img_pyr[i] =
+                resize(img_pyr[i - 1], dims[0], dims[1], AF_INTERP_BILINEAR);
+        }
+    }
+
+    feat_pyr.resize(max_levels);
+
+    // Round feature size to nearest odd integer
+    float size = 2.f * floor(patch_size / 2.f) + 1.f;
+
+    // Avoid keeping features that are too wide and might not fit the image,
+    // sqrt(2.f) is the radius when angle is 45 degrees and represents
+    // widest case possible
+    unsigned edge = ceil(size * sqrt(2.f) / 2.f);
+
+    for (unsigned i = 0; i < max_levels; i++) {
+        Array<float> x_out     = createEmptyArray<float>(dim4());
+        Array<float> y_out     = createEmptyArray<float>(dim4());
+        Array<float> score_out = createEmptyArray<float>(dim4());
+
+        unsigned lvl_feat = fast(x_out, y_out, score_out, img_pyr[i], fast_thr,
+                                 9, 1, 0.14f, edge);
+
+        if (lvl_feat > 0) {
+            feat_pyr[i] = lvl_feat;
+            x_pyr.push_back(x_out);
+            y_pyr.push_back(y_out);
+        } else {
+            feat_pyr[i] = 0;
+        }
+    }
+}
+
+#define INSTANTIATE(T)                                                      \
+    template void fast_pyramid<T>(                                          \
+        vector<unsigned> &, vector<Array<float>> &, vector<Array<float>> &, \
+        vector<unsigned> &, vector<float> &, vector<Array<T>> &,            \
+        const Array<T> &, const float, const unsigned, const float,         \
+        const unsigned, const unsigned);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fast_pyramid.cu b/src/backend/cuda/fast_pyramid.cu
deleted file mode 100644
index 3c2223683f..0000000000
--- a/src/backend/cuda/fast_pyramid.cu
+++ /dev/null
@@ -1,54 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/features.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cuda.hpp>
-#include <handle.hpp>
-#include <kernel/fast_pyramid.hpp>
-
-using af::dim4;
-using af::features;
-
-namespace cuda
-{
-
-template<typename T>
-void fast_pyramid(std::vector<unsigned>& feat_pyr, std::vector<float*>& d_x_pyr,
-                  std::vector<float*>& d_y_pyr, std::vector<unsigned>& lvl_best,
-                  std::vector<float>& lvl_scl, std::vector<CParam<T> >& img_pyr,
-                  const Array<T>& image,
-                  const float fast_thr, const unsigned max_feat,
-                  const float scl_fctr, const unsigned levels,
-                  const unsigned patch_size)
-{
-    kernel::fast_pyramid<T>(feat_pyr, d_x_pyr, d_y_pyr, lvl_best, lvl_scl, img_pyr,
-                            image, fast_thr, max_feat, scl_fctr, levels, patch_size);
-}
-
-#define INSTANTIATE(T)\
-    template void fast_pyramid<T>(std::vector<unsigned>& feat_pyr, std::vector<float*>& d_x_pyr,    \
-                                  std::vector<float*>& d_y_pyr, std::vector<unsigned>& lvl_best,    \
-                                  std::vector<float>& lvl_scl, std::vector<CParam<T> >& img_pyr,    \
-                                  const Array<T>& image,                                            \
-                                  const float fast_thr, const unsigned max_feat,                    \
-                                  const float scl_fctr, const unsigned levels,                      \
-                                  const unsigned patch_size);
-
-INSTANTIATE(float )
-INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
diff --git a/src/backend/cuda/fast_pyramid.hpp b/src/backend/cuda/fast_pyramid.hpp
index d232411b2a..af8e902ea2 100644
--- a/src/backend/cuda/fast_pyramid.hpp
+++ b/src/backend/cuda/fast_pyramid.hpp
@@ -7,21 +7,22 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
-#include <Array.hpp>
+#pragma once
 
-using af::features;
+#include <Array.hpp>
 
-namespace cuda
-{
+#include <vector>
 
+namespace arrayfire {
+namespace cuda {
 template<typename T>
-void fast_pyramid(std::vector<unsigned>& feat_pyr, std::vector<float*>& d_x_pyr,
-                  std::vector<float*>& d_y_pyr, std::vector<unsigned>& lvl_best,
-                  std::vector<float>& lvl_scl, std::vector<CParam<T> >& img_pyr,
-                  const Array<T>& image,
+void fast_pyramid(std::vector<unsigned> &feat_pyr,
+                  std::vector<Array<float>> &d_x_pyr,
+                  std::vector<Array<float>> &d_y_pyr,
+                  std::vector<unsigned> &lvl_best, std::vector<float> &lvl_scl,
+                  std::vector<Array<T>> &img_pyr, const Array<T> &in,
                   const float fast_thr, const unsigned max_feat,
                   const float scl_fctr, const unsigned levels,
                   const unsigned patch_size);
-
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fft.cpp b/src/backend/cuda/fft.cpp
deleted file mode 100644
index 31d1f7c6c2..0000000000
--- a/src/backend/cuda/fft.cpp
+++ /dev/null
@@ -1,243 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <copy.hpp>
-#include <fft.hpp>
-#include <err_cuda.hpp>
-#include <err_cufft.hpp>
-#include <cufft.h>
-#include <math.hpp>
-#include <string>
-#include <cstdio>
-#include <memory.hpp>
-
-using af::dim4;
-using std::string;
-
-namespace cuda
-{
-
-
-// cuFFTPlanner will do very basic plan caching.
-// it looks for required candidate in mHandles array and returns if found one.
-// otherwise, it will create a plan and set it at the mAvailSlotIndex and increment
-// the slot index variable in ciruclar fashion 0 to MAX_PLAN_CACHE, then back to zero and repeat.
-class cuFFTPlanner
-{
-    friend void find_cufft_plan(cufftHandle &plan, int rank, int *n,
-                                int *inembed, int istride, int idist,
-                                int *onembed, int ostride, int odist,
-                                cufftType type, int batch);
-
-    public:
-        static cuFFTPlanner& getInstance() {
-            static cuFFTPlanner single_instance;
-            return single_instance;
-        }
-
-    private:
-        cuFFTPlanner() : mAvailSlotIndex(0) {}
-        cuFFTPlanner(cuFFTPlanner const&);
-        void operator=(cuFFTPlanner const&);
-
-        static const int MAX_PLAN_CACHE = 5;
-
-        int                  mAvailSlotIndex;
-        cufftHandle mHandles[MAX_PLAN_CACHE];
-        string         mKeys[MAX_PLAN_CACHE];
-};
-
-void find_cufft_plan(cufftHandle &plan, int rank, int *n,
-                     int *inembed, int istride, int idist,
-                     int *onembed, int ostride, int odist,
-                     cufftType type, int batch)
-{
-    cuFFTPlanner &planner = cuFFTPlanner::getInstance();
-    // create the key string
-    char key_str_temp[64];
-    sprintf(key_str_temp, "%d:", rank);
-
-    string key_string(key_str_temp);
-
-    for(int r=0; r<rank; ++r) {
-        sprintf(key_str_temp, "%d:", n[r]);
-        key_string.append(std::string(key_str_temp));
-    }
-
-    if (inembed!=NULL) {
-        for(int r=0; r<rank; ++r) {
-            sprintf(key_str_temp, "%d:", inembed[r]);
-            key_string.append(std::string(key_str_temp));
-        }
-        sprintf(key_str_temp, "%d:%d:", istride, idist);
-        key_string.append(std::string(key_str_temp));
-    }
-
-    if (onembed!=NULL) {
-        for(int r=0; r<rank; ++r) {
-            sprintf(key_str_temp, "%d:", onembed[r]);
-            key_string.append(std::string(key_str_temp));
-        }
-        sprintf(key_str_temp, "%d:%d:", ostride, odist);
-        key_string.append(std::string(key_str_temp));
-    }
-
-    sprintf(key_str_temp, "%d:%d", (int)type, batch);
-    key_string.append(std::string(key_str_temp));
-
-    // find the matching plan_index in the array cuFFTPlanner::mKeys
-    int plan_index = -1;
-    for (int i=0; i<cuFFTPlanner::MAX_PLAN_CACHE; ++i) {
-        if (key_string==planner.mKeys[i]) {
-            plan_index = i;
-            break;
-        }
-    }
-    // return mHandles[plan_index] if plan_index valid
-    if (plan_index!=-1) {
-        plan = planner.mHandles[plan_index];
-        return;
-    }
-    // otherwise create a new plan and set it at mAvailSlotIndex
-    // and finally set it to output plan variable
-    int slot_index = planner.mAvailSlotIndex;
-    cufftDestroy(planner.mHandles[slot_index]); // We ignore both return values
-
-    cufftHandle temp;
-    cufftResult res = cufftPlanMany(&temp, rank, n, inembed, istride, idist, onembed, ostride, odist, type, batch);
-
-    // If plan creation fails, clean up the memory we hold on to and try again
-    if (res != CUFFT_SUCCESS) {
-        garbageCollect();
-        CUFFT_CHECK(cufftPlanMany(&temp, rank, n, inembed, istride, idist, onembed, ostride, odist, type, batch));
-    }
-
-    plan = temp;
-    planner.mHandles[slot_index] = temp;
-    planner.mKeys[slot_index] = key_string;
-    planner.mAvailSlotIndex = (slot_index + 1)%cuFFTPlanner::MAX_PLAN_CACHE;
-}
-
-template<typename T>
-struct cufft_transform;
-
-#define CUFFT_FUNC(T, TRANSFORM_TYPE)                               \
-    template<>                                                      \
-    struct cufft_transform<T>                                       \
-    {                                                               \
-        enum { type = CUFFT_##TRANSFORM_TYPE };                     \
-        cufftResult                                                 \
-            operator() (cufftHandle plan, T *in, T *out, int dir) { \
-            return cufftExec##TRANSFORM_TYPE(plan, in, out, dir);   \
-        }                                                           \
-    };
-
-CUFFT_FUNC(cfloat , C2C)
-CUFFT_FUNC(cdouble, Z2Z)
-
-template<int rank>
-void computeDims(int rdims[rank], const dim4 &idims)
-{
-    for (int i = 0; i < rank; i++) {
-        rdims[i] = idims[(rank -1) - i];
-    }
-}
-
-template<typename T, int rank, bool direction>
-void fft_common(Array<T> &out, const Array<T> &in)
-{
-    const dim4 idims    = in.dims();
-    const dim4 istrides = in.strides();
-    const dim4 ostrides = out.strides();
-
-    int in_dims[rank];
-    int in_embed[rank];
-    int out_embed[rank];
-
-    computeDims<rank>(in_dims, idims);
-    computeDims<rank>(in_embed, in.getDataDims());
-    computeDims<rank>(out_embed, out.getDataDims());
-
-    int batch = 1;
-    for (int i = rank; i < 4; i++) {
-        batch *= idims[i];
-    }
-
-    cufftHandle plan;
-    find_cufft_plan(plan, rank, in_dims,
-                    in_embed , istrides[0], istrides[rank],
-                    out_embed, ostrides[0], ostrides[rank],
-                    (cufftType)cufft_transform<T>::type, batch);
-
-    cufft_transform<T> transform;
-    CUFFT_CHECK(transform(plan, (T *)in.get(), out.get(), direction ? CUFFT_FORWARD : CUFFT_INVERSE));
-}
-
-void computePaddedDims(dim4 &pdims,
-                       const dim4 &idims,
-                       const dim_t npad,
-                       dim_t const * const pad)
-{
-    for (int i = 0; i < 4; i++) {
-        pdims[i] = (i < (int)npad) ? pad[i] : idims[i];
-    }
-}
-
-template<typename inType, typename outType, int rank, bool isR2C>
-Array<outType> fft(Array<inType> const &in, double norm_factor, dim_t const npad, dim_t const * const pad)
-{
-    ARG_ASSERT(1, (rank>=1 && rank<=3));
-
-    dim4 pdims(1);
-    computePaddedDims(pdims, in.dims(), npad, pad);
-
-    Array<outType> ret = padArray<inType, outType>(in, pdims, scalar<outType>(0), norm_factor);
-    fft_common<outType, rank, true>(ret, ret);
-
-    return ret;
-}
-
-template<typename T, int rank>
-Array<T> ifft(Array<T> const &in, double norm_factor, dim_t const npad, dim_t const * const pad)
-{
-    ARG_ASSERT(1, (rank>=1 && rank<=3));
-
-    dim4 pdims(1);
-    computePaddedDims(pdims, in.dims(), npad, pad);
-
-    Array<T> ret = padArray<T, T>(in, pdims, scalar<T>(0), norm_factor);
-    fft_common<T, rank, false>(ret, ret);
-
-    return ret;
-}
-
-#define INSTANTIATE1(T1, T2)\
-    template Array<T2> fft <T1, T2, 1, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T2> fft <T1, T2, 2, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T2> fft <T1, T2, 3, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad);
-
-INSTANTIATE1(float  , cfloat )
-INSTANTIATE1(double , cdouble)
-
-#define INSTANTIATE2(T)\
-    template Array<T> fft <T, T, 1, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> fft <T, T, 2, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> fft <T, T, 3, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 1>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 2>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 3>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad);
-
-INSTANTIATE2(cfloat )
-INSTANTIATE2(cdouble)
-
-}
diff --git a/src/backend/cuda/fft.cu b/src/backend/cuda/fft.cu
new file mode 100644
index 0000000000..800e6571d2
--- /dev/null
+++ b/src/backend/cuda/fft.cu
@@ -0,0 +1,163 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <fft.hpp>
+
+#include <Array.hpp>
+#include <copy.hpp>
+#include <cufft.hpp>
+#include <debug_cuda.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <af/dim4.hpp>
+
+#include <array>
+
+using af::dim4;
+using std::array;
+using std::string;
+
+namespace arrayfire {
+namespace cuda {
+void setFFTPlanCacheSize(size_t numPlans) {
+    fftManager().setMaxCacheSize(numPlans);
+}
+
+template<typename T>
+struct cufft_transform;
+
+#define CUFFT_FUNC(T, TRANSFORM_TYPE)                                      \
+    template<>                                                             \
+    struct cufft_transform<T> {                                            \
+        enum { type = CUFFT_##TRANSFORM_TYPE };                            \
+        cufftResult operator()(cufftHandle plan, T *in, T *out, int dir) { \
+            return cufftExec##TRANSFORM_TYPE(plan, in, out, dir);          \
+        }                                                                  \
+    };
+
+CUFFT_FUNC(cfloat, C2C)
+CUFFT_FUNC(cdouble, Z2Z)
+
+template<typename To, typename Ti>
+struct cufft_real_transform;
+
+#define CUFFT_REAL_FUNC(To, Ti, TRANSFORM_TYPE)                     \
+    template<>                                                      \
+    struct cufft_real_transform<To, Ti> {                           \
+        enum { type = CUFFT_##TRANSFORM_TYPE };                     \
+        cufftResult operator()(cufftHandle plan, Ti *in, To *out) { \
+            return cufftExec##TRANSFORM_TYPE(plan, in, out);        \
+        }                                                           \
+    };
+
+CUFFT_REAL_FUNC(cfloat, float, R2C)
+CUFFT_REAL_FUNC(cdouble, double, D2Z)
+
+CUFFT_REAL_FUNC(float, cfloat, C2R)
+CUFFT_REAL_FUNC(double, cdouble, Z2D)
+
+inline array<int, AF_MAX_DIMS> computeDims(const int rank, const dim4 &idims) {
+    array<int, AF_MAX_DIMS> retVal = {};
+    for (int i = 0; i < rank; i++) { retVal[i] = idims[(rank - 1) - i]; }
+    return retVal;
+}
+
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction) {
+    const dim4 idims    = in.dims();
+    const dim4 istrides = in.strides();
+
+    auto t_dims   = computeDims(rank, idims);
+    auto in_embed = computeDims(rank, in.getDataDims());
+
+    int batch = 1;
+    for (int i = rank; i < 4; i++) { batch *= idims[i]; }
+
+    SharedPlan plan =
+        findPlan(rank, t_dims.data(), in_embed.data(), istrides[0],
+                 istrides[rank], in_embed.data(), istrides[0], istrides[rank],
+                 (cufftType)cufft_transform<T>::type, batch);
+
+    cufft_transform<T> transform;
+    CUFFT_CHECK(cufftSetStream(*plan.get(), getActiveStream()));
+    CUFFT_CHECK(transform(*plan.get(), (T *)in.get(), in.get(),
+                          direction ? CUFFT_FORWARD : CUFFT_INVERSE));
+}
+
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank) {
+    dim4 idims = in.dims();
+    dim4 odims = in.dims();
+
+    odims[0] = odims[0] / 2 + 1;
+
+    Array<Tc> out = createEmptyArray<Tc>(odims);
+
+    auto t_dims    = computeDims(rank, idims);
+    auto in_embed  = computeDims(rank, in.getDataDims());
+    auto out_embed = computeDims(rank, out.getDataDims());
+
+    int batch = 1;
+    for (int i = rank; i < AF_MAX_DIMS; i++) { batch *= idims[i]; }
+
+    dim4 istrides = in.strides();
+    dim4 ostrides = out.strides();
+
+    SharedPlan plan =
+        findPlan(rank, t_dims.data(), in_embed.data(), istrides[0],
+                 istrides[rank], out_embed.data(), ostrides[0], ostrides[rank],
+                 (cufftType)cufft_real_transform<Tc, Tr>::type, batch);
+
+    cufft_real_transform<Tc, Tr> transform;
+    CUFFT_CHECK(cufftSetStream(*plan.get(), getActiveStream()));
+    CUFFT_CHECK(transform(*plan.get(), (Tr *)in.get(), out.get()));
+    return out;
+}
+
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank) {
+    Array<Tr> out = createEmptyArray<Tr>(odims);
+
+    auto t_dims    = computeDims(rank, odims);
+    auto in_embed  = computeDims(rank, in.getDataDims());
+    auto out_embed = computeDims(rank, out.getDataDims());
+
+    int batch = 1;
+    for (int i = rank; i < AF_MAX_DIMS; i++) { batch *= odims[i]; }
+
+    dim4 istrides = in.strides();
+    dim4 ostrides = out.strides();
+
+    cufft_real_transform<Tr, Tc> transform;
+
+    SharedPlan plan =
+        findPlan(rank, t_dims.data(), in_embed.data(), istrides[0],
+                 istrides[rank], out_embed.data(), ostrides[0], ostrides[rank],
+                 (cufftType)cufft_real_transform<Tr, Tc>::type, batch);
+
+    CUFFT_CHECK(cufftSetStream(*plan.get(), getActiveStream()));
+    CUFFT_CHECK(transform(*plan.get(), (Tc *)in.get(), out.get()));
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template void fft_inplace<T>(Array<T> &, const int, const bool);
+
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+#define INSTANTIATE_REAL(Tr, Tc)                                               \
+    template Array<Tc> fft_r2c<Tc, Tr>(const Array<Tr> &, const int);          \
+    template Array<Tr> fft_c2r<Tr, Tc>(const Array<Tc> &in, const dim4 &odims, \
+                                       const int);
+
+INSTANTIATE_REAL(float, cfloat)
+INSTANTIATE_REAL(double, cdouble)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fft.hpp b/src/backend/cuda/fft.hpp
index bde7c2c545..5cc2bf42e4 100644
--- a/src/backend/cuda/fft.hpp
+++ b/src/backend/cuda/fft.hpp
@@ -9,16 +9,19 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-template<typename inType, typename outType, int rank, bool isR2C>
-Array<outType> fft(Array<inType> const &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+void setFFTPlanCacheSize(size_t numPlans);
 
-template<typename T, int rank>
-Array<T> ifft(Array<T> const &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+template<typename T>
+void fft_inplace(Array<T> &out, const int rank, const bool direction);
 
-template<typename T, int rank, bool direction>
-void fft_common(Array<T> &out, const Array<T> &in);
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank);
 
-}
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fftconvolve.cpp b/src/backend/cuda/fftconvolve.cpp
new file mode 100644
index 0000000000..cb8359423e
--- /dev/null
+++ b/src/backend/cuda/fftconvolve.cpp
@@ -0,0 +1,123 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <fftconvolve.hpp>
+
+#include <Array.hpp>
+#include <fft.hpp>
+#include <kernel/fftconvolve.hpp>
+#include <af/dim4.hpp>
+
+#include <type_traits>
+
+using af::dim4;
+using std::conditional;
+using std::is_integral;
+using std::is_same;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+dim4 calcPackedSize(Array<T> const& i1, Array<T> const& i2, const int rank) {
+    const dim4& i1d = i1.dims();
+    const dim4& i2d = i2.dims();
+
+    dim_t pd[AF_MAX_DIMS] = {1, 1, 1, 1};
+
+    dim_t max_d0 = (i1d[0] > i2d[0]) ? i1d[0] : i2d[0];
+    dim_t min_d0 = (i1d[0] < i2d[0]) ? i1d[0] : i2d[0];
+    pd[0]        = nextpow2(static_cast<unsigned>(
+        static_cast<int>(ceil(max_d0 / 2.f)) + min_d0 - 1));
+
+    for (int k = 1; k < AF_MAX_DIMS; k++) {
+        if (k < rank) {
+            pd[k] = nextpow2(static_cast<unsigned>(i1d[k] + i2d[k] - 1));
+        } else {
+            pd[k] = i1d[k];
+        }
+    }
+
+    return dim4(pd[0], pd[1], pd[2], pd[3]);
+}
+
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank) {
+    using convT = typename conditional<is_integral<T>::value ||
+                                           is_same<T, float>::value ||
+                                           is_same<T, cfloat>::value,
+                                       float, double>::type;
+    using cT    = typename conditional<is_same<convT, float>::value, cfloat,
+                                    cdouble>::type;
+
+    const dim4& sDims = signal.dims();
+    const dim4& fDims = filter.dims();
+
+    dim4 oDims(1);
+    if (expand) {
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
+            } else {
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
+            }
+        }
+    } else {
+        oDims = sDims;
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
+        }
+    }
+
+    const dim4 spDims       = calcPackedSize<T>(signal, filter, rank);
+    const dim4 fpDims       = calcPackedSize<T>(filter, signal, rank);
+    Array<cT> signal_packed = createEmptyArray<cT>(spDims);
+    Array<cT> filter_packed = createEmptyArray<cT>(fpDims);
+
+    kernel::packDataHelper<cT, T>(signal_packed, filter_packed, signal, filter);
+
+    fft_inplace<cT>(signal_packed, rank, true);
+    fft_inplace<cT>(filter_packed, rank, true);
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::complexMultiplyHelper<T, cT>(signal_packed, filter_packed, kind);
+
+    if (kind == AF_BATCH_RHS) {
+        fft_inplace<cT>(filter_packed, rank, false);
+        kernel::reorderOutputHelper<T, cT>(out, filter_packed, signal, filter,
+                                           expand, rank);
+    } else {
+        fft_inplace<cT>(signal_packed, rank, false);
+        kernel::reorderOutputHelper<T, cT>(out, signal_packed, signal, filter,
+                                           expand, rank);
+    }
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> fftconvolve<T>(Array<T> const&, Array<T> const&, \
+                                     const bool, AF_BATCH_KIND, const int);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(int)
+INSTANTIATE(uchar)
+INSTANTIATE(schar)
+INSTANTIATE(char)
+INSTANTIATE(uintl)
+INSTANTIATE(intl)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/fftconvolve.cu b/src/backend/cuda/fftconvolve.cu
deleted file mode 100644
index 14b7da0246..0000000000
--- a/src/backend/cuda/fftconvolve.cu
+++ /dev/null
@@ -1,123 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <fftconvolve.hpp>
-#include <kernel/fftconvolve.hpp>
-#include <err_cuda.hpp>
-
-#include <fft.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-static const dim4 calcPackedSize(Array<T> const& i1,
-                                 Array<T> const& i2,
-                                 const dim_t baseDim)
-{
-    const dim4 i1d = i1.dims();
-    const dim4 i2d = i2.dims();
-
-    dim_t pd[4];
-
-    // Pack both signal and filter on same memory array, this will ensure
-    // better use of batched cuFFT capabilities
-    for (dim_t k = 0; k < 4; k++) {
-        if (k == 0)
-            pd[k] = nextpow2((unsigned)(i1d[k] + i2d[k] - 1)) / 2;
-        else if (k < baseDim)
-            pd[k] = nextpow2((unsigned)(i1d[k] + i2d[k] - 1));
-        else if (k == baseDim)
-            pd[k] = i1d[k];
-        else
-            pd[k] = 1;
-    }
-
-    return dim4(pd[0], pd[1], pd[2], pd[3]);
-}
-
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
-Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind)
-{
-    const dim4 sDims = signal.dims();
-    const dim4 fDims = filter.dims();
-
-    dim4 oDims(1);
-    if (expand) {
-        for(dim_t d=0; d<4; ++d) {
-            if (kind==ONE2ONE || kind==ONE2MANY) {
-                oDims[d] = sDims[d]+fDims[d]-1;
-            } else {
-                oDims[d] = (d<baseDim ? sDims[d]+fDims[d]-1 : sDims[d]);
-            }
-        }
-    } else {
-        oDims = sDims;
-        if (kind==ONE2MANY) {
-            for (dim_t i=baseDim; i<4; ++i)
-                oDims[i] = fDims[i];
-        }
-    }
-
-    const dim4 spDims = calcPackedSize<T>(signal, filter, baseDim);
-    const dim4 fpDims = calcPackedSize<T>(filter, signal, baseDim);
-    Array<cT> signal_packed = createEmptyArray<cT>(spDims);
-    Array<cT> filter_packed = createEmptyArray<cT>(fpDims);
-
-    kernel::packDataHelper<cT, T>(signal_packed, filter_packed, signal, filter, baseDim);
-
-    fft_common<cT, baseDim, true>(signal_packed, signal_packed);
-    fft_common<cT, baseDim, true>(filter_packed, filter_packed);
-
-    Array<T> out = createEmptyArray<T>(oDims);
-
-    if (expand)
-        kernel::complexMultiplyHelper<T, cT>(out, signal_packed, filter_packed, signal, filter, kind);
-    else
-        kernel::complexMultiplyHelper<T, cT>(out, signal_packed, filter_packed, signal, filter, kind);
-
-    if (kind == ONE2MANY) {
-        fft_common<cT, baseDim, false>(filter_packed, filter_packed);
-        if (expand)
-            kernel::reorderOutputHelper<T, cT, roundOut, baseDim, true >(out, filter_packed, signal, filter, kind);
-        else
-            kernel::reorderOutputHelper<T, cT, roundOut, baseDim, false>(out, filter_packed, signal, filter, kind);
-    } else {
-        fft_common<cT, baseDim, false>(signal_packed, signal_packed);
-        if (expand)
-            kernel::reorderOutputHelper<T, cT, roundOut, baseDim, true >(out, signal_packed, signal, filter, kind);
-        else
-            kernel::reorderOutputHelper<T, cT, roundOut, baseDim, false>(out, signal_packed, signal, filter, kind);
-    }
-
-    return out;
-}
-
-#define INSTANTIATE(T, convT, cT, isDouble, roundOut)                                                   \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 1>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);    \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 2>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);    \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 3>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);
-
-INSTANTIATE(double, double, cdouble, true , false)
-INSTANTIATE(float , float,  cfloat,  false, false)
-INSTANTIATE(uint  , float,  cfloat,  false, true)
-INSTANTIATE(int   , float,  cfloat,  false, true)
-INSTANTIATE(uchar , float,  cfloat,  false, true)
-INSTANTIATE(char  , float,  cfloat,  false, true)
-
-}
diff --git a/src/backend/cuda/fftconvolve.hpp b/src/backend/cuda/fftconvolve.hpp
index 5eea28d376..c158bdaa3d 100644
--- a/src/backend/cuda/fftconvolve.hpp
+++ b/src/backend/cuda/fftconvolve.hpp
@@ -8,12 +8,12 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <convolve_common.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
-Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);
-
-}
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/flood_fill.cpp b/src/backend/cuda/flood_fill.cpp
new file mode 100644
index 0000000000..2165f8a6c8
--- /dev/null
+++ b/src/backend/cuda/flood_fill.cpp
@@ -0,0 +1,40 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <flood_fill.hpp>
+
+#include <err_cuda.hpp>
+#include <kernel/flood_fill.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup) {
+    auto out = createValueArray(image.dims(), T(0));
+    kernel::floodFill<T>(out, image, seedsX, seedsY, newValue, lowValue,
+                         highValue, nlookup);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> floodFill(const Array<T>&, const Array<uint>&,           \
+                                const Array<uint>&, const T, const T, const T, \
+                                const af::connectivity);
+
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(ushort)
+INSTANTIATE(uchar)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/flood_fill.hpp b/src/backend/cuda/flood_fill.hpp
new file mode 100644
index 0000000000..6716abeae7
--- /dev/null
+++ b/src/backend/cuda/flood_fill.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup = AF_CONNECTIVITY_8);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/gradient.cpp b/src/backend/cuda/gradient.cpp
new file mode 100644
index 0000000000..b7274a736f
--- /dev/null
+++ b/src/backend/cuda/gradient.cpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gradient.hpp>
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/gradient.hpp>
+#include <math.hpp>
+
+#include <stdexcept>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in) {
+    kernel::gradient<T>(grad0, grad1, in);
+}
+
+#define INSTANTIATE(T)                                            \
+    template void gradient<T>(Array<T> & grad0, Array<T> & grad1, \
+                              const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/gradient.cu b/src/backend/cuda/gradient.cu
deleted file mode 100644
index 30b36b0bfa..0000000000
--- a/src/backend/cuda/gradient.cu
+++ /dev/null
@@ -1,32 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <gradient.hpp>
-#include <kernel/gradient.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in)
-    {
-        kernel::gradient<T>(grad0, grad1, in);
-    }
-
-#define INSTANTIATE(T)                                                                  \
-    template void gradient<T>(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);    \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-}
diff --git a/src/backend/cuda/gradient.hpp b/src/backend/cuda/gradient.hpp
index ecae97d854..46ff6db000 100644
--- a/src/backend/cuda/gradient.hpp
+++ b/src/backend/cuda/gradient.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/hamming.cu b/src/backend/cuda/hamming.cu
deleted file mode 100644
index 2022675e1c..0000000000
--- a/src/backend/cuda/hamming.cu
+++ /dev/null
@@ -1,62 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cuda.hpp>
-#include <handle.hpp>
-#include <kernel/hamming.hpp>
-#include <kernel/transpose.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-void hamming_matcher(Array<uint>& idx, Array<uint>& dist,
-                     const Array<T>& query, const Array<T>& train,
-                     const uint dist_dim, const uint n_dist)
-{
-    uint sample_dim = (dist_dim == 0) ? 1 : 0;
-    const dim4 qDims = query.dims();
-    const dim4 tDims = train.dims();
-
-    const dim4 outDims(n_dist, qDims[sample_dim]);
-
-    idx  = createEmptyArray<uint>(outDims);
-    dist = createEmptyArray<uint>(outDims);
-
-    Array<T> queryT = query;
-    Array<T> trainT = train;
-
-    if (dist_dim == 0) {
-        const dim4 queryTDims = dim4(qDims[1], qDims[0], qDims[2], qDims[3]);
-        const dim4 trainTDims = dim4(tDims[1], tDims[0], tDims[2], tDims[3]);
-        queryT = createEmptyArray<T>(queryTDims);
-        trainT = createEmptyArray<T>(trainTDims);
-
-        kernel::transpose<T, false>(queryT, query, query.ndims());
-        kernel::transpose<T, false>(trainT, train, train.ndims());
-    }
-
-    kernel::hamming_matcher<T>(idx, dist, queryT, trainT, 1, n_dist);
-}
-
-#define INSTANTIATE(T)\
-    template void hamming_matcher<T>(Array<uint>& idx, Array<uint>& dist,           \
-                                     const Array<T>& query, const Array<T>& train,  \
-                                     const uint dist_dim, const uint n_dist);
-
-INSTANTIATE(uchar)
-INSTANTIATE(uint)
-
-}
diff --git a/src/backend/cuda/hamming.hpp b/src/backend/cuda/hamming.hpp
deleted file mode 100644
index 37d3944c43..0000000000
--- a/src/backend/cuda/hamming.hpp
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-
-using af::features;
-
-namespace cuda
-{
-
-template<typename T>
-void hamming_matcher(Array<uint>& idx, Array<uint>& dist,
-                     const Array<T>& query, const Array<T>& train,
-                     const uint dist_dim, const uint n_dist);
-
-}
diff --git a/src/backend/cuda/harris.cu b/src/backend/cuda/harris.cu
new file mode 100644
index 0000000000..1c9c9a482c
--- /dev/null
+++ b/src/backend/cuda/harris.cu
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/harris.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &score_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr) {
+    unsigned nfeat;
+    float *d_x_out;
+    float *d_y_out;
+    float *d_score_out;
+
+    kernel::harris<T, convAccT>(&nfeat, &d_x_out, &d_y_out, &d_score_out, in,
+                                max_corners, min_response, sigma, filter_len,
+                                k_thr);
+
+    if (nfeat > 0) {
+        const dim4 out_dims(nfeat);
+
+        x_out     = createDeviceDataArray<float>(out_dims, d_x_out);
+        y_out     = createDeviceDataArray<float>(out_dims, d_y_out);
+        score_out = createDeviceDataArray<float>(out_dims, d_score_out);
+    }
+
+    return nfeat;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned harris<T, convAccT>(                                    \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned max_corners,                       \
+        const float min_response, const float sigma,                          \
+        const unsigned filter_len, const float k_thr);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/harris.hpp b/src/backend/cuda/harris.hpp
new file mode 100644
index 0000000000..4cf4fc8084
--- /dev/null
+++ b/src/backend/cuda/harris.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &score_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/hist_graphics.cpp b/src/backend/cuda/hist_graphics.cpp
new file mode 100644
index 0000000000..cabadeb1ad
--- /dev/null
+++ b/src/backend/cuda/hist_graphics.cpp
@@ -0,0 +1,76 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_cuda.hpp>
+#include <device_manager.hpp>
+#include <err_cuda.hpp>
+#include <hist_graphics.hpp>
+
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_histogram(const Array<T> &data, fg_histogram hist) {
+    auto stream = getActiveStream();
+    if (DeviceManager::checkGraphicsInteropCapability()) {
+        const T *d_P = data.get();
+
+        auto res = interopManager().getHistogramResources(hist);
+
+        size_t bytes = 0;
+        T *d_vbo     = NULL;
+        cudaGraphicsMapResources(1, res[0].get(), stream);
+        cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &bytes,
+                                             *(res[0].get()));
+        cudaMemcpyAsync(d_vbo, d_P, bytes, cudaMemcpyDeviceToDevice, stream);
+        cudaGraphicsUnmapResources(1, res[0].get(), stream);
+
+        CheckGL("After cuda resource copy");
+
+        POST_LAUNCH_CHECK();
+    } else {
+        ForgeModule &_ = common::forgePlugin();
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_histogram_vertex_buffer(&buffer, hist));
+        FG_CHECK(_.fg_get_histogram_vertex_buffer_size(&bytes, hist));
+
+        CheckGL("Begin CUDA fallback-resource copy");
+        glBindBuffer(GL_ARRAY_BUFFER, buffer);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            CUDA_CHECK(cudaMemcpyAsync(ptr, data.get(), bytes,
+                                       cudaMemcpyDeviceToHost, stream));
+            CUDA_CHECK(cudaStreamSynchronize(stream));
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+        CheckGL("End CUDA fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T) \
+    template void copy_histogram<T>(const Array<T> &, fg_histogram);
+
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/hist_graphics.cu b/src/backend/cuda/hist_graphics.cu
deleted file mode 100644
index d1424d8157..0000000000
--- a/src/backend/cuda/hist_graphics.cu
+++ /dev/null
@@ -1,52 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined (WITH_GRAPHICS)
-
-#include <interopManager.hpp>
-#include <Array.hpp>
-#include <hist_graphics.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-
-namespace cuda
-{
-
-template<typename T>
-void copy_histogram(const Array<T> &data, const fg::Histogram* hist)
-{
-    const T *d_P = data.get();
-
-    InteropManager& intrpMngr = InteropManager::getInstance();
-
-    cudaGraphicsResource *cudaVBOResource = intrpMngr.getBufferResource(hist);
-    // Map resource. Copy data to VBO. Unmap resource.
-    size_t num_bytes = hist->size();
-    T* d_vbo = NULL;
-    cudaGraphicsMapResources(1, &cudaVBOResource, 0);
-    cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &num_bytes, cudaVBOResource);
-    cudaMemcpy(d_vbo, d_P, num_bytes, cudaMemcpyDeviceToDevice);
-    cudaGraphicsUnmapResources(1, &cudaVBOResource, 0);
-
-    CheckGL("After cuda resource copy");
-
-    POST_LAUNCH_CHECK();
-}
-
-#define INSTANTIATE(T)  \
-    template void copy_histogram<T>(const Array<T> &data, const fg::Histogram* hist);
-
-INSTANTIATE(float)
-INSTANTIATE(int)
-INSTANTIATE(uint)
-INSTANTIATE(uchar)
-
-}
-
-#endif  // WITH_GRAPHICS
diff --git a/src/backend/cuda/hist_graphics.hpp b/src/backend/cuda/hist_graphics.hpp
index c8a6176b59..348d84ba3c 100644
--- a/src/backend/cuda/hist_graphics.hpp
+++ b/src/backend/cuda/hist_graphics.hpp
@@ -9,18 +9,14 @@
 
 #pragma once
 
-#if defined (WITH_GRAPHICS)
-
-#include <graphics_common.hpp>
 #include <Array.hpp>
+#include <common/graphics_common.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
-void copy_histogram(const Array<T> &data, const fg::Histogram* hist);
-
-}
-
-#endif
+void copy_histogram(const Array<T> &data, fg_histogram hist);
 
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/histogram.cpp b/src/backend/cuda/histogram.cpp
new file mode 100644
index 0000000000..f012d6e64b
--- /dev/null
+++ b/src/backend/cuda/histogram.cpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <histogram.hpp>
+#include <kernel/histogram.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear) {
+    const dim4 &dims = in.dims();
+    dim4 outDims     = dim4(nbins, 1, dims[2], dims[3]);
+    Array<uint> out  = createValueArray<uint>(outDims, uint(0));
+    kernel::histogram<T>(out, in, nbins, minval, maxval, isLinear);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                    \
+    template Array<uint> histogram<T>(const Array<T> &, const unsigned &, \
+                                      const double &, const double &,     \
+                                      const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/histogram.cu b/src/backend/cuda/histogram.cu
deleted file mode 100644
index e9a980fa22..0000000000
--- a/src/backend/cuda/histogram.cu
+++ /dev/null
@@ -1,62 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <histogram.hpp>
-#include <kernel/histogram.hpp>
-#include <err_cuda.hpp>
-#include <vector>
-
-using af::dim4;
-using std::vector;
-
-namespace cuda
-{
-
-template<typename inType, typename outType>
-Array<outType> histogram(const Array<inType> &in, const unsigned &nbins, const double &minval, const double &maxval)
-{
-
-    ARG_ASSERT(1, (nbins<=kernel::MAX_BINS));
-
-    const dim4 dims     = in.dims();
-    dim4 outDims        = dim4(nbins, 1, dims[2], dims[3]);
-    Array<outType> out  = createValueArray<outType>(outDims, outType(0));
-
-    // create an array to hold min and max values for
-    // batch operation handling, this will reduce
-    // number of concurrent reads to one single memory location
-    dim_t mmNElems= dims[2] * dims[3];
-    cfloat init;
-    init.x = minval;
-    init.y = maxval;
-    vector<cfloat> h_minmax(mmNElems, init);
-
-    dim4 minmax_dims(mmNElems*2);
-    Array<cfloat> minmax = createHostDataArray<cfloat>(minmax_dims, &h_minmax.front());
-
-    kernel::histogram<inType, outType>(out, in, minmax.get(), nbins);
-
-    return out;
-}
-
-#define INSTANTIATE(in_t,out_t)\
-template Array<out_t> histogram(const Array<in_t> &in, const unsigned &nbins, const double &minval, const double &maxval);
-
-INSTANTIATE(float , uint)
-INSTANTIATE(double, uint)
-INSTANTIATE(char  , uint)
-INSTANTIATE(int   , uint)
-INSTANTIATE(uint  , uint)
-INSTANTIATE(uchar , uint)
-
-}
diff --git a/src/backend/cuda/histogram.hpp b/src/backend/cuda/histogram.hpp
index 8fa4592f29..f9498d422c 100644
--- a/src/backend/cuda/histogram.hpp
+++ b/src/backend/cuda/histogram.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
-
-template<typename inType, typename outType>
-Array<outType> histogram(const Array<inType> &in, const unsigned &nbins, const double &minval, const double &maxval);
-
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/homography.cu b/src/backend/cuda/homography.cu
new file mode 100644
index 0000000000..7b70064902
--- /dev/null
+++ b/src/backend/cuda/homography.cu
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <err_cuda.hpp>
+#include <kernel/homography.hpp>
+#include <af/dim4.hpp>
+#include <algorithm>
+
+#include <limits>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+#define RANSACConfidence 0.99f
+#define LMEDSConfidence 0.99f
+#define LMEDSOutlierRatio 0.4f
+
+template<typename T>
+int homography(Array<T> &bestH, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations) {
+    const af::dim4 idims    = x_src.dims();
+    const unsigned nsamples = idims[0];
+
+    unsigned iter    = iterations;
+    Array<float> err = createEmptyArray<float>(dim4());
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        iter = ::std::min(
+            iter, (unsigned)(log(1.f - LMEDSConfidence) /
+                             log(1.f - pow(1.f - LMEDSOutlierRatio, 4.f))));
+        err = createValueArray<float>(af::dim4(nsamples, iter),
+                                      std::numeric_limits<float>::max());
+    }
+
+    af::dim4 rdims(4, iter);
+    Array<float> fctr = createValueArray<float>(rdims, (float)nsamples);
+    Array<float> rnd  = arithOp<float, af_mul_t>(initial, fctr, rdims);
+
+    Array<T> tmpH = createValueArray<T>(af::dim4(9, iter), (T)0);
+
+    return kernel::computeH<T>(bestH, tmpH, err, x_src, y_src, x_dst, y_dst,
+                               rnd, iter, nsamples, inlier_thr, htype);
+}
+
+#define INSTANTIATE(T)                                                      \
+    template int homography<T>(                                             \
+        Array<T> & H, const Array<float> &x_src, const Array<float> &y_src, \
+        const Array<float> &x_dst, const Array<float> &y_dst,               \
+        const Array<float> &initial, const af_homography_type htype,        \
+        const float inlier_thr, const unsigned iterations);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/homography.hpp b/src/backend/cuda/homography.hpp
new file mode 100644
index 0000000000..95c4bdf853
--- /dev/null
+++ b/src/backend/cuda/homography.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+int homography(Array<T> &H, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/hsv_rgb.cpp b/src/backend/cuda/hsv_rgb.cpp
new file mode 100644
index 0000000000..d4eda7ef58
--- /dev/null
+++ b/src/backend/cuda/hsv_rgb.cpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <hsv_rgb.hpp>
+#include <kernel/hsv_rgb.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> hsv2rgb(const Array<T>& in) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::hsv2rgb_convert<T>(out, in, true);
+    return out;
+}
+
+template<typename T>
+Array<T> rgb2hsv(const Array<T>& in) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::hsv2rgb_convert<T>(out, in, false);
+    return out;
+}
+
+#define INSTANTIATE(T)                                \
+    template Array<T> hsv2rgb<T>(const Array<T>& in); \
+    template Array<T> rgb2hsv<T>(const Array<T>& in);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/hsv_rgb.cu b/src/backend/cuda/hsv_rgb.cu
deleted file mode 100644
index f2e4f3f84d..0000000000
--- a/src/backend/cuda/hsv_rgb.cu
+++ /dev/null
@@ -1,50 +0,0 @@
-/*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <hsv_rgb.hpp>
-#include <kernel/hsv_rgb.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-Array<T> hsv2rgb(const Array<T>& in)
-{
-    Array<T> out   = createEmptyArray<T>(in.dims());
-
-    kernel::hsv2rgb_convert<T, true>(out, in);
-
-    return out;
-}
-
-template<typename T>
-Array<T> rgb2hsv(const Array<T>& in)
-{
-    Array<T> out   = createEmptyArray<T>(in.dims());
-
-    kernel::hsv2rgb_convert<T, false>(out, in);
-
-    return out;
-}
-
-#define INSTANTIATE(T)  \
-    template Array<T> hsv2rgb<T>(const Array<T>& in); \
-    template Array<T> rgb2hsv<T>(const Array<T>& in); \
-
-INSTANTIATE(double)
-INSTANTIATE(float )
-
-}
diff --git a/src/backend/cuda/hsv_rgb.hpp b/src/backend/cuda/hsv_rgb.hpp
index aef0837d4f..26288245e6 100644
--- a/src/backend/cuda/hsv_rgb.hpp
+++ b/src/backend/cuda/hsv_rgb.hpp
@@ -1,16 +1,16 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 Array<T> hsv2rgb(const Array<T>& in);
@@ -18,4 +18,5 @@ Array<T> hsv2rgb(const Array<T>& in);
 template<typename T>
 Array<T> rgb2hsv(const Array<T>& in);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/identity.cpp b/src/backend/cuda/identity.cpp
new file mode 100644
index 0000000000..ee62dcf549
--- /dev/null
+++ b/src/backend/cuda/identity.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <identity.hpp>
+#include <kernel/identity.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <debug_cuda.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> identity(const dim4& dims) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::identity<T>(out);
+    return out;
+}
+
+#define INSTANTIATE_IDENTITY(T) \
+    template Array<T> identity<T>(const af::dim4& dims);
+
+INSTANTIATE_IDENTITY(float)
+INSTANTIATE_IDENTITY(double)
+INSTANTIATE_IDENTITY(cfloat)
+INSTANTIATE_IDENTITY(cdouble)
+INSTANTIATE_IDENTITY(int)
+INSTANTIATE_IDENTITY(uint)
+INSTANTIATE_IDENTITY(intl)
+INSTANTIATE_IDENTITY(uintl)
+INSTANTIATE_IDENTITY(char)
+INSTANTIATE_IDENTITY(schar)
+INSTANTIATE_IDENTITY(uchar)
+INSTANTIATE_IDENTITY(short)
+INSTANTIATE_IDENTITY(ushort)
+INSTANTIATE_IDENTITY(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/identity.cu b/src/backend/cuda/identity.cu
deleted file mode 100644
index 264d5b8a5e..0000000000
--- a/src/backend/cuda/identity.cu
+++ /dev/null
@@ -1,42 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <Array.hpp>
-#include <identity.hpp>
-#include <debug_cuda.hpp>
-#include <kernel/identity.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> identity(const dim4& dims)
-    {
-        Array<T> out  = createEmptyArray<T>(dims);
-        kernel::identity<T>(out);
-        return out;
-    }
-
-#define INSTANTIATE_IDENTITY(T)                              \
-    template Array<T>  identity<T>    (const af::dim4 &dims);
-
-    INSTANTIATE_IDENTITY(float)
-    INSTANTIATE_IDENTITY(double)
-    INSTANTIATE_IDENTITY(cfloat)
-    INSTANTIATE_IDENTITY(cdouble)
-    INSTANTIATE_IDENTITY(int)
-    INSTANTIATE_IDENTITY(uint)
-    INSTANTIATE_IDENTITY(intl)
-    INSTANTIATE_IDENTITY(uintl)
-    INSTANTIATE_IDENTITY(char)
-    INSTANTIATE_IDENTITY(uchar)
-
-}
diff --git a/src/backend/cuda/identity.hpp b/src/backend/cuda/identity.hpp
index 9b92f7d989..f03d9f6199 100644
--- a/src/backend/cuda/identity.hpp
+++ b/src/backend/cuda/identity.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> identity(const dim4& dim);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> identity(const dim4& dim);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/iir.cpp b/src/backend/cuda/iir.cpp
new file mode 100644
index 0000000000..63a662b885
--- /dev/null
+++ b/src/backend/cuda/iir.cpp
@@ -0,0 +1,60 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <convolve.hpp>
+#include <err_cuda.hpp>
+#include <iir.hpp>
+#include <kernel/iir.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x) {
+    AF_BATCH_KIND type = x.ndims() == 1 ? AF_BATCH_NONE : AF_BATCH_SAME;
+    if (x.ndims() != b.ndims()) {
+        type = (x.ndims() < b.ndims()) ? AF_BATCH_RHS : AF_BATCH_LHS;
+    }
+
+    // Extract the first N elements
+    Array<T> c = convolve<T, T>(x, b, type, 1, true);
+    dim4 cdims = c.dims();
+    cdims[0]   = x.dims()[0];
+    c.resetDims(cdims);
+
+    int num_a = a.dims()[0];
+
+    if (num_a == 1) { return c; }
+
+    dim4 ydims = c.dims();
+    Array<T> y = createEmptyArray<T>(ydims);
+
+    if (a.ndims() > 1) {
+        kernel::iir<T, true>(y, c, a);
+    } else {
+        kernel::iir<T, false>(y, c, a);
+    }
+    return y;
+}
+
+#define INSTANTIATE(T)                                          \
+    template Array<T> iir(const Array<T> &b, const Array<T> &a, \
+                          const Array<T> &x);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/iir.cu b/src/backend/cuda/iir.cu
deleted file mode 100644
index 4530406989..0000000000
--- a/src/backend/cuda/iir.cu
+++ /dev/null
@@ -1,64 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <iir.hpp>
-#include <err_cuda.hpp>
-#include <math.hpp>
-#include <arith.hpp>
-#include <convolve.hpp>
-#include <kernel/iir.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x)
-    {
-
-        ConvolveBatchKind type = x.ndims() == 1 ? ONE2ONE : MANY2MANY;
-        if (x.ndims() != b.ndims()) {
-            type = (x.ndims() < b.ndims()) ? ONE2MANY : MANY2ONE;
-        }
-
-        // Extract the first N elements
-        Array<T> c = convolve<T, T, 1, true>(x, b, type);
-        dim4 cdims = c.dims();
-        cdims[0] = x.dims()[0];
-        c.resetDims(cdims);
-
-        int num_a = a.dims()[0];
-
-        if (num_a == 1) return c;
-
-        dim4 ydims = c.dims();
-        Array<T> y = createEmptyArray<T>(ydims);
-
-        if (a.ndims() > 1) {
-            kernel::iir<T,  true>(y, c, a);
-        } else {
-            kernel::iir<T, false>(y, c, a);
-        }
-        return y;
-    }
-
-#define INSTANTIATE(T)                          \
-    template Array<T> iir(const Array<T> &b,    \
-                          const Array<T> &a,    \
-                          const Array<T> &x);   \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-}
diff --git a/src/backend/cuda/iir.hpp b/src/backend/cuda/iir.hpp
index a3f88581dc..1ad18333f3 100644
--- a/src/backend/cuda/iir.hpp
+++ b/src/backend/cuda/iir.hpp
@@ -9,9 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x);
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/image.cpp b/src/backend/cuda/image.cpp
new file mode 100644
index 0000000000..23bccf616e
--- /dev/null
+++ b/src/backend/cuda/image.cpp
@@ -0,0 +1,80 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Parts of this code sourced from SnopyDogy
+// https://gist.github.com/SnopyDogy/a9a22497a893ec86aa3e
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_cuda.hpp>
+#include <device_manager.hpp>
+#include <err_cuda.hpp>
+#include <image.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image) {
+    auto stream = getActiveStream();
+    if (DeviceManager::checkGraphicsInteropCapability()) {
+        auto res = interopManager().getImageResources(image);
+
+        const T *d_X = in.get();
+        size_t bytes = 0;
+        T *d_pixels  = NULL;
+        cudaGraphicsMapResources(1, res[0].get(), stream);
+        cudaGraphicsResourceGetMappedPointer((void **)&d_pixels, &bytes,
+                                             *(res[0].get()));
+        cudaMemcpyAsync(d_pixels, d_X, bytes, cudaMemcpyDeviceToDevice, stream);
+        cudaGraphicsUnmapResources(1, res[0].get(), stream);
+
+        POST_LAUNCH_CHECK();
+        CheckGL("After cuda resource copy");
+    } else {
+        ForgeModule &_ = common::forgePlugin();
+        CheckGL("Begin CUDA fallback-resource copy");
+        unsigned data_size = 0, buffer = 0;
+        FG_CHECK(_.fg_get_image_size(&data_size, image));
+        FG_CHECK(_.fg_get_pixel_buffer(&buffer, image));
+
+        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer);
+        glBufferData(GL_PIXEL_UNPACK_BUFFER, data_size, 0, GL_STREAM_DRAW);
+        auto *ptr = static_cast<GLubyte *>(
+            glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            CUDA_CHECK(cudaMemcpyAsync(ptr, in.get(), data_size,
+                                       cudaMemcpyDeviceToHost, stream));
+            CUDA_CHECK(cudaStreamSynchronize(stream));
+            glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
+        }
+        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
+        CheckGL("End CUDA fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T) template void copy_image<T>(const Array<T> &, fg_image);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/image.cu b/src/backend/cuda/image.cu
deleted file mode 100644
index bf6e7dfb58..0000000000
--- a/src/backend/cuda/image.cu
+++ /dev/null
@@ -1,58 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-// Parts of this code sourced from SnopyDogy
-// https://gist.github.com/SnopyDogy/a9a22497a893ec86aa3e
-
-#if defined(WITH_GRAPHICS)
-
-#include <Array.hpp>
-#include <image.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-#include <interopManager.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-void copy_image(const Array<T> &in, const fg::Image* image)
-{
-    InteropManager& intrpMngr = InteropManager::getInstance();
-
-    cudaGraphicsResource *cudaPBOResource = intrpMngr.getBufferResource(image);
-
-    const T *d_X = in.get();
-    // Map resource. Copy data to PBO. Unmap resource.
-    size_t num_bytes;
-    T* d_pbo = NULL;
-    cudaGraphicsMapResources(1, &cudaPBOResource, 0);
-    cudaGraphicsResourceGetMappedPointer((void **)&d_pbo, &num_bytes, cudaPBOResource);
-    cudaMemcpy(d_pbo, d_X, num_bytes, cudaMemcpyDeviceToDevice);
-    cudaGraphicsUnmapResources(1, &cudaPBOResource, 0);
-
-    POST_LAUNCH_CHECK();
-    CheckGL("After cuda resource copy");
-}
-
-#define INSTANTIATE(T)      \
-    template void copy_image<T>(const Array<T> &in, const fg::Image* image);
-
-INSTANTIATE(float)
-INSTANTIATE(double)
-INSTANTIATE(int)
-INSTANTIATE(uint)
-INSTANTIATE(uchar)
-INSTANTIATE(char)
-
-}
-
-#endif
diff --git a/src/backend/cuda/image.hpp b/src/backend/cuda/image.hpp
index ece664b3c2..2a98743dd4 100644
--- a/src/backend/cuda/image.hpp
+++ b/src/backend/cuda/image.hpp
@@ -7,15 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <graphics_common.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace cuda {
 
-namespace cuda
-{
-    template<typename T>
-    void copy_image(const Array<T> &in, const fg::Image* image);
-}
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image);
 
-#endif
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/index.cpp b/src/backend/cuda/index.cpp
new file mode 100644
index 0000000000..dbb7d1ad60
--- /dev/null
+++ b/src/backend/cuda/index.cpp
@@ -0,0 +1,100 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <index.hpp>
+
+#include <Array.hpp>
+#include <assign_kernel_param.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <handle.hpp>
+#include <kernel/index.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> index(const Array<T>& in, const af_index_t idxrs[]) {
+    IndexKernelParam p;
+    std::vector<af_seq> seqs(4, af_span);
+    // create seq vector to retrieve output
+    // dimensions, offsets & offsets
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
+    }
+
+    // retrieve dimensions, strides and offsets
+    const dim4& iDims = in.dims();
+    dim4 dDims        = in.getDataDims();
+    dim4 oDims        = toDims(seqs, iDims);
+    dim4 iOffs        = toOffset(seqs, dDims);
+    dim4 iStrds       = in.strides();
+
+    for (dim_t i = 0; i < 4; ++i) {
+        p.isSeq[i] = idxrs[i].isSeq;
+        p.offs[i]  = iOffs[i];
+        p.strds[i] = iStrds[i];
+        p.steps[i] = 0;
+        if (idxrs[i].isSeq) {
+            af_seq seq = idxrs[i].idx.seq;
+            // The step for af_span used in the kernel must be 1
+            if (seq.begin == af_span.begin && seq.end == af_span.end &&
+                seq.step == af_span.step)
+                p.steps[i] = 1;
+            else
+                p.steps[i] = seq.step;
+        }
+    }
+
+    std::vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
+    // look through indexs to read af_array indexs
+    for (dim_t x = 0; x < 4; ++x) {
+        // set idxPtrs to null
+        p.ptr[x] = 0;
+        // set index pointers were applicable
+        if (!p.isSeq[x]) {
+            idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
+            p.ptr[x]   = idxArrs[x].get();
+            // set output array ith dimension value
+            oDims[x] = idxArrs[x].elements();
+        }
+    }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+    if (oDims.elements() == 0) { return out; }
+
+    kernel::index<T>(out, in, p);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> index<T>(const Array<T>& in, const af_index_t idxrs[]);
+
+INSTANTIATE(cdouble)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(int)
+INSTANTIATE(uintl)
+INSTANTIATE(intl)
+INSTANTIATE(uchar)
+INSTANTIATE(schar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/index.cu b/src/backend/cuda/index.cu
deleted file mode 100644
index 988f589ddb..0000000000
--- a/src/backend/cuda/index.cu
+++ /dev/null
@@ -1,85 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <handle.hpp>
-#include <index.hpp>
-#include <kernel/index.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-Array<T> index(const Array<T>& in, const af_index_t idxrs[])
-{
-    kernel::IndexKernelParam_t p;
-    std::vector<af_seq> seqs(4, af_span);
-    // create seq vector to retrieve output
-    // dimensions, offsets & offsets
-    for (dim_t x=0; x<4; ++x) {
-        if (idxrs[x].isSeq) {
-            seqs[x] = idxrs[x].idx.seq;
-        }
-    }
-
-    // retrieve dimensions, strides and offsets
-    dim4 iDims = in.dims();
-    dim4 dDims = in.getDataDims();
-    dim4 oDims = toDims  (seqs, iDims);
-    dim4 iOffs = toOffset(seqs, dDims);
-    dim4 iStrds= toStride(seqs, dDims);
-
-    for (dim_t i=0; i<4; ++i) {
-        p.isSeq[i] = idxrs[i].isSeq;
-        p.offs[i]  = iOffs[i];
-        p.strds[i] = iStrds[i];
-    }
-
-    std::vector< Array<uint> > idxArrs(4, createEmptyArray<uint>(dim4()));
-    // look through indexs to read af_array indexs
-    for (dim_t x=0; x<4; ++x) {
-        // set idxPtrs to null
-        p.ptr[x] = 0;
-        // set index pointers were applicable
-        if (!p.isSeq[x]) {
-            idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
-            p.ptr[x] = idxArrs[x].get();
-            // set output array ith dimension value
-            oDims[x] = idxArrs[x].elements();
-        }
-    }
-
-    Array<T> out = createEmptyArray<T>(oDims);
-
-    kernel::index<T>(out, in, p);
-
-    return out;
-}
-
-#define INSTANTIATE(T) \
-    template Array<T> index<T>(const Array<T>& in, const af_index_t idxrs[]);
-
-INSTANTIATE(cdouble)
-INSTANTIATE(double )
-INSTANTIATE(cfloat )
-INSTANTIATE(float  )
-INSTANTIATE(uintl  )
-INSTANTIATE(uint   )
-INSTANTIATE(intl   )
-INSTANTIATE(int    )
-INSTANTIATE(uchar  )
-INSTANTIATE(char   )
-
-}
diff --git a/src/backend/cuda/index.hpp b/src/backend/cuda/index.hpp
index 52e0201c56..5966078eaf 100644
--- a/src/backend/cuda/index.hpp
+++ b/src/backend/cuda/index.hpp
@@ -8,11 +8,13 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <af/index.h>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 Array<T> index(const Array<T>& in, const af_index_t idxrs[]);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/interopManager.cu b/src/backend/cuda/interopManager.cu
deleted file mode 100644
index 7d44caeb88..0000000000
--- a/src/backend/cuda/interopManager.cu
+++ /dev/null
@@ -1,94 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-// Parts of this code sourced from SnopyDogy
-// https://gist.github.com/SnopyDogy/a9a22497a893ec86aa3e
-
-#if defined(WITH_GRAPHICS)
-
-#include <interopManager.hpp>
-#include <err_cuda.hpp>
-#include <cstdio>
-
-namespace cuda
-{
-
-void InteropManager::destroyResources()
-{
-    int n = getActiveDeviceId();
-    for(iter_t iter = interop_maps[n].begin(); iter != interop_maps[n].end(); iter++)
-        CUDA_CHECK(cudaGraphicsUnregisterResource(iter->second));
-}
-
-InteropManager::~InteropManager()
-{
-    for(int i = 0; i < getDeviceCount(); i++) {
-        setDevice(i);
-        destroyResources();
-    }
-}
-
-InteropManager& InteropManager::getInstance()
-{
-    static InteropManager my_instance;
-    return my_instance;
-}
-
-cudaGraphicsResource* InteropManager::getBufferResource(const fg::Image* key)
-{
-    int device = getActiveDeviceId();
-    void* key_value = (void*)key;
-
-    if(interop_maps[device].find(key_value) == interop_maps[device].end()) {
-        cudaGraphicsResource *cudaPBOResource;
-        // Register PBO with CUDA
-        CUDA_CHECK(cudaGraphicsGLRegisterBuffer(&cudaPBOResource, key->pbo(), cudaGraphicsMapFlagsWriteDiscard));
-        interop_maps[device][key_value] = cudaPBOResource;
-    }
-
-    return interop_maps[device][key_value];
-}
-
-cudaGraphicsResource* InteropManager::getBufferResource(const fg::Plot* key)
-{
-    int device = getActiveDeviceId();
-    void* key_value = (void*)key;
-
-    iter_t iter = interop_maps[device].find(key_value);
-
-    if(interop_maps[device].find(key_value) == interop_maps[device].end()) {
-        cudaGraphicsResource *cudaVBOResource;
-        // Register VBO with CUDA
-        CUDA_CHECK(cudaGraphicsGLRegisterBuffer(&cudaVBOResource, key->vbo(), cudaGraphicsMapFlagsWriteDiscard));
-        interop_maps[device][key_value] = cudaVBOResource;
-    }
-
-    return interop_maps[device][key_value];
-}
-
-cudaGraphicsResource* InteropManager::getBufferResource(const fg::Histogram* key)
-{
-    int device = getActiveDeviceId();
-    void* key_value = (void*)key;
-
-    iter_t iter = interop_maps[device].find(key_value);
-
-    if(interop_maps[device].find(key_value) == interop_maps[device].end()) {
-        cudaGraphicsResource *cudaVBOResource;
-        // Register VBO with CUDA
-        CUDA_CHECK(cudaGraphicsGLRegisterBuffer(&cudaVBOResource, key->vbo(), cudaGraphicsMapFlagsWriteDiscard));
-        interop_maps[device][key_value] = cudaVBOResource;
-    }
-
-    return interop_maps[device][key_value];
-}
-
-}
-
-#endif
diff --git a/src/backend/cuda/interopManager.hpp b/src/backend/cuda/interopManager.hpp
deleted file mode 100644
index f6d3904eb5..0000000000
--- a/src/backend/cuda/interopManager.hpp
+++ /dev/null
@@ -1,54 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-// Parts of this code sourced from SnopyDogy
-// https://gist.github.com/SnopyDogy/a9a22497a893ec86aa3e
-
-#if defined(WITH_GRAPHICS)
-
-#include <platform.hpp>
-#include <graphics_common.hpp>
-
-#include <cuda.h>
-#include <cuda_runtime.h>
-#include <cuda_gl_interop.h>
-
-#include <map>
-
-using af::dim4;
-
-namespace cuda
-{
-
-typedef std::map<void *, cudaGraphicsResource *> interop_t;
-typedef interop_t::iterator iter_t;
-
-// Manager Class for cudaPBOResource: calls garbage collection at the end of the program
-class InteropManager
-{
-    private:
-        interop_t interop_maps[DeviceManager::MAX_DEVICES];
-
-    public:
-        static InteropManager& getInstance();
-        ~InteropManager();
-        cudaGraphicsResource* getBufferResource(const fg::Image* handle);
-        cudaGraphicsResource* getBufferResource(const fg::Plot* handle);
-        cudaGraphicsResource* getBufferResource(const fg::Histogram* handle);
-
-    protected:
-        InteropManager() {}
-        InteropManager(InteropManager const&);
-        void operator=(InteropManager const&);
-        void destroyResources();
-};
-
-}
-
-#endif
diff --git a/src/backend/cuda/inverse.cpp b/src/backend/cuda/inverse.cpp
new file mode 100644
index 0000000000..db7059d4a9
--- /dev/null
+++ b/src/backend/cuda/inverse.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/err_common.hpp>
+#include <inverse.hpp>
+
+#include <identity.hpp>
+#include <solve.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> inverse(const Array<T> &in) {
+    Array<T> I = identity<T>(in.dims());
+    return solve<T>(in, I);
+}
+
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/inverse.cu b/src/backend/cuda/inverse.cu
deleted file mode 100644
index 96295f39ac..0000000000
--- a/src/backend/cuda/inverse.cu
+++ /dev/null
@@ -1,60 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <inverse.hpp>
-#include <err_common.hpp>
-
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-
-#include <solve.hpp>
-#include <identity.hpp>
-#include <handle.hpp>
-
-namespace cuda
-{
-
-template<typename T>
-Array<T> inverse(const Array<T> &in)
-{
-    Array<T> I = identity<T>(in.dims());
-    return solve<T>(in, I);
-}
-
-#define INSTANTIATE(T)                                                                   \
-    template Array<T> inverse<T> (const Array<T> &in);
-
-INSTANTIATE(float)
-INSTANTIATE(cfloat)
-INSTANTIATE(double)
-INSTANTIATE(cdouble)
-
-}
-
-#else
-namespace cuda
-{
-
-template<typename T>
-Array<T> inverse(const Array<T> &in)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-              AF_ERR_NOT_CONFIGURED);
-}
-
-#define INSTANTIATE(T)                                                                   \
-    template Array<T> inverse<T> (const Array<T> &in);
-
-INSTANTIATE(float)
-INSTANTIATE(cfloat)
-INSTANTIATE(double)
-INSTANTIATE(cdouble)
-
-}
-
-#endif
diff --git a/src/backend/cuda/inverse.hpp b/src/backend/cuda/inverse.hpp
index a8eb3eaa96..7c662b8cda 100644
--- a/src/backend/cuda/inverse.hpp
+++ b/src/backend/cuda/inverse.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> inverse(const Array<T> &in);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> inverse(const Array<T> &in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/iota.cpp b/src/backend/cuda/iota.cpp
new file mode 100644
index 0000000000..0ac6dbee74
--- /dev/null
+++ b/src/backend/cuda/iota.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <iota.hpp>
+#include <kernel/iota.hpp>
+#include <math.hpp>
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> iota(const dim4 &dims, const dim4 &tile_dims) {
+    dim4 outdims = dims * tile_dims;
+
+    Array<T> out = createEmptyArray<T>(outdims);
+    kernel::iota<T>(out, dims);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/iota.cu b/src/backend/cuda/iota.cu
deleted file mode 100644
index b5aa16cab4..0000000000
--- a/src/backend/cuda/iota.cu
+++ /dev/null
@@ -1,39 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <iota.hpp>
-#include <kernel/iota.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> iota(const dim4 &dims, const dim4 &tile_dims)
-    {
-        dim4 outdims = dims * tile_dims;
-
-        Array<T> out = createEmptyArray<T>(outdims);
-        kernel::iota<T>(out, dims, tile_dims);
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                          \
-    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-}
-
diff --git a/src/backend/cuda/iota.hpp b/src/backend/cuda/iota.hpp
index a63b3b3259..5232fdddbc 100644
--- a/src/backend/cuda/iota.hpp
+++ b/src/backend/cuda/iota.hpp
@@ -8,13 +8,11 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
-}
-
-
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/ireduce.cpp b/src/backend/cuda/ireduce.cpp
new file mode 100644
index 0000000000..a2236230d4
--- /dev/null
+++ b/src/backend/cuda/ireduce.cpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <ireduce.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <af/dim4.hpp>
+
+#undef _GLIBCXX_USE_INT128
+#include <err_cuda.hpp>
+#include <kernel/ireduce.hpp>
+
+#include <complex>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim) {
+    Array<uint> rlen = createEmptyArray<uint>(af::dim4(0));
+    kernel::ireduce<T, op>(out, loc.get(), in, dim, rlen);
+}
+
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen) {
+    kernel::ireduce<T, op>(out, loc.get(), in, dim, rlen);
+}
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in) {
+    return kernel::ireduce_all<T, op>(loc, in);
+}
+
+#define INSTANTIATE(ROp, T)                                           \
+    template void ireduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim); \
+    template void rreduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim,  \
+                                  const Array<uint> &rlen);           \
+    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);
+
+// min
+INSTANTIATE(af_min_t, float)
+INSTANTIATE(af_min_t, double)
+INSTANTIATE(af_min_t, cfloat)
+INSTANTIATE(af_min_t, cdouble)
+INSTANTIATE(af_min_t, int)
+INSTANTIATE(af_min_t, uint)
+INSTANTIATE(af_min_t, intl)
+INSTANTIATE(af_min_t, uintl)
+INSTANTIATE(af_min_t, short)
+INSTANTIATE(af_min_t, ushort)
+INSTANTIATE(af_min_t, char)
+INSTANTIATE(af_min_t, schar)
+INSTANTIATE(af_min_t, uchar)
+INSTANTIATE(af_min_t, half)
+
+// max
+INSTANTIATE(af_max_t, float)
+INSTANTIATE(af_max_t, double)
+INSTANTIATE(af_max_t, cfloat)
+INSTANTIATE(af_max_t, cdouble)
+INSTANTIATE(af_max_t, int)
+INSTANTIATE(af_max_t, uint)
+INSTANTIATE(af_max_t, intl)
+INSTANTIATE(af_max_t, uintl)
+INSTANTIATE(af_max_t, short)
+INSTANTIATE(af_max_t, ushort)
+INSTANTIATE(af_max_t, char)
+INSTANTIATE(af_max_t, schar)
+INSTANTIATE(af_max_t, uchar)
+INSTANTIATE(af_max_t, half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/ireduce.cu b/src/backend/cuda/ireduce.cu
deleted file mode 100644
index 79fffd0b8e..0000000000
--- a/src/backend/cuda/ireduce.cu
+++ /dev/null
@@ -1,64 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <ireduce.hpp>
-
-#undef _GLIBCXX_USE_INT128
-#include <complex>
-#include <kernel/ireduce.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-    template<af_op_t op, typename T>
-    void ireduce(Array<T> &out, Array<uint> &loc,
-                 const Array<T> &in, const int dim)
-    {
-        kernel::ireduce<T, op>(out, loc.get(), in, dim);
-    }
-
-    template<af_op_t op, typename T>
-    T ireduce_all(unsigned *loc, const Array<T> &in)
-    {
-        return kernel::ireduce_all<T, op>(loc, in);
-    }
-
-#define INSTANTIATE(ROp, T)                                             \
-    template void ireduce<ROp, T>(Array<T> &out, Array<uint> &loc,      \
-                                  const Array<T> &in, const int dim);   \
-    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);  \
-
-    //min
-    INSTANTIATE(af_min_t, float  )
-    INSTANTIATE(af_min_t, double )
-    INSTANTIATE(af_min_t, cfloat )
-    INSTANTIATE(af_min_t, cdouble)
-    INSTANTIATE(af_min_t, int    )
-    INSTANTIATE(af_min_t, uint   )
-    INSTANTIATE(af_min_t, char   )
-    INSTANTIATE(af_min_t, uchar  )
-
-    //max
-    INSTANTIATE(af_max_t, float  )
-    INSTANTIATE(af_max_t, double )
-    INSTANTIATE(af_max_t, cfloat )
-    INSTANTIATE(af_max_t, cdouble)
-    INSTANTIATE(af_max_t, int    )
-    INSTANTIATE(af_max_t, uint   )
-    INSTANTIATE(af_max_t, char   )
-    INSTANTIATE(af_max_t, uchar  )
-}
diff --git a/src/backend/cuda/ireduce.hpp b/src/backend/cuda/ireduce.hpp
index 483cb255e5..f65eb863a4 100644
--- a/src/backend/cuda/ireduce.hpp
+++ b/src/backend/cuda/ireduce.hpp
@@ -7,16 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace cuda
-{
-    template<af_op_t op, typename T>
-    void ireduce(Array<T> &out, Array<uint> &loc,
-                 const Array<T> &in, const int dim);
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim);
 
-    template<af_op_t op, typename T>
-    T ireduce_all(unsigned *loc, const Array<T> &in);
-}
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen);
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/jit.cpp b/src/backend/cuda/jit.cpp
index 3fb6d3e4f5..171ec66f61 100644
--- a/src/backend/cuda/jit.cpp
+++ b/src/backend/cuda/jit.cpp
@@ -7,453 +7,586 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <map>
-#include <stdexcept>
+#include <Kernel.hpp>
+#include <common/deterministicHash.hpp>
+#include <common/half.hpp>
+#include <common/jit/ModdimNode.hpp>
+#include <common/jit/Node.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <common/kernel_cache.hpp>
+#include <common/util.hpp>
 #include <copy.hpp>
-#include <JIT/Node.hpp>
-#include <ptx_headers/arith.hpp>
-#include <ptx_headers/logic.hpp>
-#include <ptx_headers/exp.hpp>
-#include <ptx_headers/numeric.hpp>
-#include <ptx_headers/trig.hpp>
-#include <ptx_headers/hyper.hpp>
-#include <ptx_headers/cast.hpp>
-#include <platform.hpp>
-#include <dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <device_manager.hpp>
 #include <err_cuda.hpp>
+#include <jit/ShiftNode.hpp>
+#include <kernel_headers/jit_cuh.hpp>
 #include <math.hpp>
-#include <vector>
-#include <nvvm.h>
-#include <boost/functional/hash.hpp>
-#include <boost/scoped_ptr.hpp>
-
-using std::vector;
-using boost::scoped_ptr;
+#include <platform.hpp>
+#include <threadsMgt.hpp>
+#include <type_util.hpp>
+#include <af/dim4.hpp>
 
-namespace cuda
-{
+#include <algorithm>
+#include <cstdlib>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <vector>
 
-using JIT::Node;
+using arrayfire::common::findModule;
+using arrayfire::common::getEnvVar;
+using arrayfire::common::getFuncName;
+using arrayfire::common::half;
+using arrayfire::common::isBufferOrShift;
+using arrayfire::common::kNodeType;
+using arrayfire::common::ModdimNode;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ids;
+using arrayfire::common::Node_map_t;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::common::saveKernel;
+using arrayfire::cuda::jit::BufferNode;
+using arrayfire::cuda::jit::ShiftNode;
+
+using std::array;
+using std::equal;
+using std::find_if;
+using std::for_each;
+using std::shared_ptr;
 using std::string;
 using std::stringstream;
-using JIT::str_map_t;
-using JIT::str_map_iter;
-
-const char *layout64 = "target datalayout = \"e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64\"\n\n\n";
-const char *layout32 = "target datalayout = \"e-p:32:32:32-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64\"\n\n\n";
-
-const char *triple64 = "target triple = \"nvptx64-unknown-cuda\"\n\n";
-const char *triple32 = "target triple = \"nvptx-unknown-cuda\"\n\n";
-
-static string getFuncName(Node *node, bool is_linear)
-{
-    node->setId(0);
-
-    stringstream funcName;
-    stringstream hashName;
-
-    if (is_linear) funcName << "L_"; //Kernel Linear
-    else           funcName << "G_"; //Kernel General
-
-    funcName << node->getNameStr();
-    node->genKerName(funcName);
-    funcName.str();
-
-    boost::hash<std::string> hash_fn;
-
-    hashName << "@KER";
-    hashName << hash_fn(funcName.str());
-    return hashName.str();
-}
-
-static string getKernelString(string funcName, Node *node, bool is_linear)
-{
-    stringstream kerStream;
-    stringstream annStream;
-    str_map_t declStrs;
+using std::to_string;
+using std::vector;
 
-    int id = node->getId();
+namespace arrayfire {
+namespace cuda {
 
-    if (sizeof(void *) == 8) {
-        kerStream << layout64;
-        kerStream << triple64;
-    } else {
-        kerStream << layout32;
-        kerStream << triple32;
-    }
+static string getKernelString(const string& funcName,
+                              const vector<Node*>& full_nodes,
+                              const vector<Node_ids>& full_ids,
+                              const vector<int>& output_ids,
+                              const bool is_linear, const bool loop0,
+                              const bool loop1, const bool loop2,
+                              const bool loop3) {
+    const std::string includeFileStr(jit_cuh, jit_cuh_len);
 
-    kerStream << "define void " << funcName << " (" << std::endl;
-    node->genParams(kerStream, annStream, is_linear);
-    kerStream << node->getTypeStr() <<"* %out,\n"
-              << "i32 %ostr0, i32 %ostr1, i32 %ostr2, i32 %ostr3,\n"
-              << "i32 %odim0, i32 %odim1, i32 %odim2, i32 %odim3,\n"
-              << "i32 %blkx, i32 %blky) {"
-              << "\n\n";
-
-    kerStream << "entry:\n\n";
-    kerStream << "%tidx = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()\n";
-    kerStream << "%bdmx = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()\n";
-    kerStream << "%bidx = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()\n";
-    kerStream << "%bidy = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.y()\n";
-    kerStream << "%gdmx = call i32 @llvm.nvvm.read.ptx.sreg.nctaid.x()\n";
-    kerStream << "\n\n";
-
-    if (!is_linear) {
-
-        kerStream << "%tidy = call i32 @llvm.nvvm.read.ptx.sreg.tid.y()\n";
-        kerStream << "%bdmy = call i32 @llvm.nvvm.read.ptx.sreg.ntid.y()\n";
-
-        kerStream << "%id2 = sdiv i32 %bidx, %blkx\n";
-        kerStream << "%id3 = sdiv i32 %bidy, %blky\n";
-        kerStream << "%id2m = mul i32 %id2, %blkx\n";
-        kerStream << "%id3m = mul i32 %id3, %blky\n";
-        kerStream << "%blk_x = sub i32 %bidx, %id2m\n";
-        kerStream << "%blk_y = sub i32 %bidy, %id3m\n";
-        kerStream << "%id0m = mul i32 %blk_x, %bdmx\n";
-        kerStream << "%id1m = mul i32 %blk_y, %bdmy\n";
-        kerStream << "%id0 = add i32 %tidx, %id0m\n";
-        kerStream << "%id1 = add i32 %tidy, %id1m\n";
-        kerStream << "\n\n";
-
-        kerStream << "%off3o = mul i32 %id3, %ostr3\n";
-        kerStream << "%off2o = mul i32 %id2, %ostr2\n";
-        kerStream << "%off1o = mul i32 %id1, %ostr1\n";
-        kerStream << "%off23o = add i32 %off3o, %off2o\n";
-        kerStream << "%off123o = add i32 %off23o, %off1o\n";
-        kerStream << "%idxa = add i32 %off123o, %id0\n";
-        kerStream << "%idx = sext i32 %idxa to i64\n";
-        kerStream << "\n\n";
-
-        kerStream << "%cmp3 = icmp slt i32 %id3, %odim3\n";
-        kerStream << "%cmp2 = icmp slt i32 %id2, %odim2\n";
-        kerStream << "%cmp1 = icmp slt i32 %id1, %odim1\n";
-        kerStream << "%cmp0 = icmp slt i32 %id0, %odim0\n";
-
-        kerStream << "br i1 %cmp3, label %check2, label %end\n";
-        kerStream << "\ncheck2:\n";
-        kerStream << "br i1 %cmp2, label %check1, label %end\n";
-        kerStream << "\ncheck1:\n";
-        kerStream << "br i1 %cmp1, label %check0, label %end\n";
-        kerStream << "\ncheck0:\n";
-        kerStream << "br i1 %cmp0, label %core, label %end\n";
-
-    } else {
-
-        kerStream << "%boff = mul i32 %bidy, %gdmx\n";
-        kerStream << "%bid  = add i32 %boff, %bidx\n";
-        kerStream << "%goff = mul i32 %bid , %bdmx\n";
-        kerStream << "%gid  = add i32 %goff ,%tidx\n";
-        kerStream << "%idx  = sext i32 %gid to i64\n";
-        kerStream << "%el1  = mul i32 %odim0, %odim1\n";
-        kerStream << "%el2  = mul i32 %el1  , %odim2\n";
-        kerStream << "%el3  = mul i32 %el2  , %odim3\n";
-        kerStream << "%cmp0 = icmp slt i32 %gid, %el3\n";
-        kerStream << "br i1 %cmp0, label %core, label %end\n";
+    const std::string paramTStr = R"JIT(
+template<typename T>
+struct Param {
+    dim_t dims[4];
+    dim_t strides[4];
+    T *ptr;
+};
+)JIT";
+
+    std::string typedefStr{"typedef unsigned int uint;\ntypedef "};
+    typedefStr += getFullName<dim_t>();
+    typedefStr += " dim_t;\n";
+
+    // Common CUDA code
+    // This part of the code does not change with the kernel.
+
+    static const char* kernelVoid = "extern \"C\" __global__ void\n";
+    static const char* dimParams  = "";
+
+    static const char* blockStart = "{";
+    static const char* blockEnd   = "\n}\n";
+
+    static const char* linearInit = R"JIT(
+    int idx = blockIdx.x * blockDim.x + threadIdx.x;
+    const int idxEnd = outref.dims[0];
+    if (idx < idxEnd) {)JIT";
+    static const char* linearEnd  = R"JIT(
+    })JIT";
+
+    static const char* linearLoop0Start = R"JIT(
+        const int idxID0Inc = gridDim.x*blockDim.x;
+        do {)JIT";
+    static const char* linearLoop0End   = R"JIT(
+            idx += idxID0Inc;
+            if (idx >= idxEnd) break;
+        } while (true);)JIT";
+
+    // ///////////////////////////////////////////////
+    // oInfo = output optimized information (dims, strides, offset).
+    //         oInfo has removed dimensions, to optimized block scheduling
+    // iInfo = input internal information (dims, strides, offset)
+    //         iInfo has the original dimensions, auto generated code
+    //
+    // Loop3 is fastest and becomes inside loop, since
+    //      - #of loops is known upfront
+    // Loop1 is used for extra dynamic looping (writing into cache)
+    // Loop0 is used for extra dynamic looping (writing into cache),
+    //       VECTORS ONLY!!
+    // All loops are conditional and idependent Format Loop1 & Loop3
+    // ////////////////////////////
+    //  *stridedLoopNInit               // Always
+    //  *stridedLoop1Init               // Conditional
+    //  *stridedLoop2Init               // Conditional
+    //  *stridedLoop3Init               // Conditional
+    //  *stridedLoop1Start              // Conditional
+    //      *stridedLoop2Start          // Conditional
+    //          *stridedLoop3Start      // Conditional
+    //              auto generated code // Always
+    //          *stridedLoop3End        // Conditional
+    //      *stridedLoop2End            // Conditional
+    //  *stridedLoop1End                // Conditional
+    //  *stridedEnd                     // Always
+    //
+    // Format loop0 (Vector only)
+    // //////////////////////////
+    // *stridedLoop0Init                // Always
+    // *stridedLoop0Start               // Always
+    //      auto generated code         // Always
+    // *stridedLoop0End                 // Always
+    // *stridedEnd                      // Always
+
+    // -----
+    static const char* stridedLoop0Init  = R"JIT(
+    int id0 = blockIdx.x * blockDim.x + threadIdx.x;
+    const int id0End = outref.dims[0];
+    if (id0 < id0End) {
+#define id1 0
+#define id2 0
+#define id3 0
+        const int ostrides0 = outref.strides[0];
+        int idx = ostrides0*id0;)JIT";
+    static const char* stridedLoop0Start = R"JIT(
+        const int id0Inc = gridDim.x*blockDim.x;
+        const int idxID0Inc = ostrides0*id0Inc;
+        do {)JIT";
+    static const char* stridedLoop0End   = R"JIT(
+            id0 += id0Inc;
+            if (id0 >= id0End) break;
+            idx += idxID0Inc;
+        } while (true);)JIT";
+
+    static const char* stridedLoopNInit = R"JIT(
+    int id0 = blockIdx.x * blockDim.x + threadIdx.x;
+    int id1 = blockIdx.y * blockDim.y + threadIdx.y;
+    const int id0End = outref.dims[0];
+    const int id1End = outref.dims[1];
+    if ((id0 < id0End) & (id1 < id1End)) {
+        int id2 = blockIdx.z * blockDim.z + threadIdx.z;
+#define id3 0
+        const int ostrides1 = outref.strides[1];
+        int idx = (int)outref.strides[0]*id0 + ostrides1*id1 + (int)outref.strides[2]*id2;)JIT";
+    static const char* stridedEnd       = R"JIT(
+    })JIT";
+
+    static const char* stridedLoop3Init  = R"JIT(
+#undef id3
+        int id3 = 0;
+        const int id3End = outref.dims[3];
+        const int idxID3Inc = outref.strides[3];)JIT";
+    static const char* stridedLoop3Start = R"JIT(
+                    const int idxBaseID3 = idx;
+                    do {)JIT";
+    // Looping over outside dim3 means that all dimensions are present,
+    // so the internal id3 can be used directly
+    static const char* stridedLoop3End = R"JIT(
+                       ++id3;
+                       if (id3 == id3End) break;
+                       idx += idxID3Inc;
+                    } while (true);
+                    id3 = 0;
+                    idx = idxBaseID3;)JIT";
+
+    static const char* stridedLoop2Init  = R"JIT(
+        const int id2End = outref.dims[2];
+        const int id2Inc = gridDim.z*blockDim.z;
+        const int idxID2Inc = (int)outref.strides[2]*id2Inc;)JIT";
+    static const char* stridedLoop2Start = R"JIT(
+                const int idxBaseID2 = idx;
+                const int baseID2 = id2;
+                do {)JIT";
+    static const char* stridedLoop2End   = R"JIT(
+                    id2 += id2Inc;
+                    if (id2 >= id2End) break;
+                    idx += idxID2Inc;
+                } while (true);
+                id2 = baseID2;
+                idx = idxBaseID2;)JIT";
+
+    // No reset of od1/id[decode.dim1] is necessary since this is the overall
+    // loop
+    static const char* stridedLoop1Init  = R"JIT(
+        const int id1Inc = gridDim.y*blockDim.y;
+        const int idxID1Inc = ostrides1*id1Inc;)JIT";
+    static const char* stridedLoop1Start = R"JIT(
+            do {)JIT";
+    static const char* stridedLoop1End   = R"JIT(
+                id1 += id1Inc;
+                if (id1 >= id1End) break;
+                idx += idxID1Inc;
+            } while (true);)JIT";
+
+    // Reuse stringstreams, because they are very costly during initialization
+    thread_local stringstream inParamStream;
+    thread_local stringstream outParamStream;
+    thread_local stringstream inOffsetsStream;
+    thread_local stringstream opsStream;
+    thread_local stringstream outrefStream;
+    thread_local stringstream kerStream;
+
+    string ret;
+    try {
+        int oid{0};
+        for (size_t i{0}; i < full_nodes.size(); i++) {
+            const auto& node{full_nodes[i]};
+            const auto& ids_curr{full_ids[i]};
+            // Generate input parameters, only needs current id
+            node->genParams(inParamStream, ids_curr.id, is_linear);
+            // Generate input offsets, only needs current id
+            node->genOffsets(inOffsetsStream, ids_curr.id, is_linear);
+            // Generate the core function body, needs children ids as well
+            node->genFuncs(opsStream, ids_curr);
+            for (size_t output_idx{0}; output_idx < output_ids.size();
+                 ++output_idx) {
+                if (output_ids[output_idx] == ids_curr.id) {
+                    // Generate also output parameters
+                    outParamStream << (oid == 0 ? "" : ",\n") << "Param<"
+                                   << full_nodes[ids_curr.id]->getTypeStr()
+                                   << "> out" << oid;
+                    // Generate code to write the output (offset already in ptr)
+                    opsStream << "out" << output_idx << ".ptr[idx] = val"
+                              << ids_curr.id << ";\n";
+                    ++oid;
+                }
+            }
+        }
+
+        outrefStream << "\n    const Param<"
+                     << full_nodes[output_ids[0]]->getTypeStr()
+                     << "> &outref = out0;";
+
+        // Put various blocks into a single stream
+        kerStream << typedefStr << includeFileStr << "\n\n"
+                  << paramTStr << '\n'
+                  << kernelVoid << funcName << "(\n"
+                  << inParamStream.str() << outParamStream.str() << dimParams
+                  << ')' << blockStart << outrefStream.str();
+        if (is_linear) {
+            kerStream << linearInit;
+            if (loop0) kerStream << linearLoop0Start;
+            kerStream << "\n\n" << inOffsetsStream.str() << opsStream.str();
+            if (loop0) kerStream << linearLoop0End;
+            kerStream << linearEnd;
+        } else {
+            if (loop0) {
+                kerStream << stridedLoop0Init << stridedLoop0Start;
+            } else {
+                kerStream << stridedLoopNInit;
+                if (loop3) kerStream << stridedLoop3Init;
+                if (loop2) kerStream << stridedLoop2Init;
+                if (loop1) kerStream << stridedLoop1Init << stridedLoop1Start;
+                if (loop2) kerStream << stridedLoop2Start;
+                if (loop3) kerStream << stridedLoop3Start;
+            }
+            kerStream << "\n\n" << inOffsetsStream.str() << opsStream.str();
+            if (loop3) kerStream << stridedLoop3End;
+            if (loop2) kerStream << stridedLoop2End;
+            if (loop1) kerStream << stridedLoop1End;
+            if (loop0) kerStream << stridedLoop0End;
+            kerStream << stridedEnd;
+        }
+        kerStream << blockEnd;
+        ret = kerStream.str();
+    } catch (...) {
+        // Prepare for next round
+        inParamStream.str("");
+        outParamStream.str("");
+        inOffsetsStream.str("");
+        opsStream.str("");
+        outrefStream.str("");
+        kerStream.str("");
+        throw;
     }
 
-    kerStream << "\n";
-    kerStream << "end:\n\n";
-    kerStream << "ret void\n";
-
-    kerStream <<"\n";
-    kerStream << "core:\n\n";
-    node->genOffsets(kerStream, is_linear);
-
-    node->genFuncs(kerStream, declStrs, is_linear);
+    // Prepare for next round
+    inParamStream.str("");
+    outParamStream.str("");
+    inOffsetsStream.str("");
+    opsStream.str("");
+    outrefStream.str("");
+    kerStream.str("");
 
-    kerStream << "%outIdx = getelementptr inbounds " << node->getTypeStr() << "* %out, i64 %idx\n";
-    kerStream << "store "
-              << node->getTypeStr()
-              << " %val" << id << ", "
-              << node->getTypeStr()
-              << "* %outIdx\n";
-
-    kerStream << "\nret void\n";
-    kerStream << "\n}\n\n";
-
-    for(str_map_iter iterator = declStrs.begin(); iterator != declStrs.end(); iterator++) {
-        kerStream << iterator->first << "\n";
-    }
-
-    kerStream
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.tid.x() nounwind readnone\n"
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.tid.y() nounwind readnone\n"
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.ntid.x() nounwind readnone\n"
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.ntid.y() nounwind readnone\n"
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.x() nounwind readnone\n"
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.ctaid.y() nounwind readnone\n"
-        << "declare i32 @llvm.nvvm.read.ptx.sreg.nctaid.x() nounwind readnone\n";
-
-    kerStream << "\n";
-
-    kerStream << "!nvvm.annotations = !{!1}\n"
-              << "!1 = metadata !{void (\n"
-              << annStream.str()
-              << node->getTypeStr() << "*,\n"
-              << "i32, i32, i32, i32,\n"
-              << "i32, i32, i32, i32,\n"
-              << "i32, i32\n"
-              << ")* " << funcName << ",\n "
-              << "metadata !\"kernel\", i32 1}\n";
-
-    return kerStream.str();
+    return ret;
 }
 
-#define NVVM_CHECK(fn, msg) do {                \
-        nvvmResult res = fn;                    \
-        if (res == NVVM_SUCCESS) break;         \
-        char nvvm_err_msg[1024];                \
-        snprintf(nvvm_err_msg,                    \
-                 sizeof(nvvm_err_msg),          \
-                 "NVVM Error (%d): %s\n",       \
-                 (int)(res), msg);              \
-        AF_ERROR(nvvm_err_msg,                  \
-                 AF_ERR_INTERNAL);              \
-                                                \
-    } while(0)
-
-static char *irToPtx(string IR, size_t *ptx_size)
-{
-    nvvmProgram prog;
-
-    NVVM_CHECK(nvvmCreateProgram(&prog), "Failed to create program");
-
-    NVVM_CHECK(nvvmAddModuleToProgram(prog, IR.c_str(), IR.size(), "generated kernel"),
-               "Failed to add module");
-
-//#ifdef NDEBUG
-#if 0
-    NVVM_CHECK(nvvmCompileProgram(prog, 0, NULL), "Failed to compile program");
-#else
-    nvvmResult comp_res = nvvmCompileProgram(prog, 0, NULL);
-    if (comp_res != NVVM_SUCCESS) {
-        size_t log_size = 0;
-        nvvmGetProgramLogSize(prog, &log_size);
-        printf("%ld, %zu\n", IR.size(), log_size);
-        scoped_ptr<char> log(new char[log_size]);
-        nvvmGetProgramLog(prog, log.get());
-        printf("LOG:\n%s\n%s", log.get(), IR.c_str());
-        NVVM_CHECK(comp_res, "Failed to compile program");
+static CUfunction getKernel(const vector<Node*>& output_nodes,
+                            const vector<int>& output_ids,
+                            const vector<Node*>& full_nodes,
+                            const vector<Node_ids>& full_ids,
+                            const bool is_linear, const bool loop0,
+                            const bool loop1, const bool loop2,
+                            const bool loop3) {
+    const string funcName{getFuncName(output_nodes, output_ids, full_nodes,
+                                      full_ids, is_linear, loop0, loop1, loop2,
+                                      loop3)};
+    // A forward lookup in module cache helps avoid recompiling
+    // the JIT source generated from identical JIT-trees.
+    const auto entry{
+        findModule(getActiveDeviceId(), deterministicHash(funcName))};
+
+    if (!entry) {
+        const string jitKer{getKernelString(funcName, full_nodes, full_ids,
+                                            output_ids, is_linear, loop0, loop1,
+                                            loop2, loop3)};
+        saveKernel(funcName, jitKer, ".cu");
+
+        const common::Source jit_src{jitKer.c_str(), jitKer.size(),
+                                     deterministicHash(jitKer)};
+
+        return common::getKernel(funcName, {{jit_src}}, {}, {}, true).get();
     }
-#endif
-
-    NVVM_CHECK(nvvmGetCompiledResultSize(prog, ptx_size), "Can not get ptx size");
-
-    char *ptx = new char[*ptx_size];
-    NVVM_CHECK(nvvmGetCompiledResult(prog, ptx), "Can not get ptx from NVVM IR");
-
-    NVVM_CHECK(nvvmDestroyProgram(&prog), "Failed to destroy program");
-    return ptx;
+    return common::getKernel(entry, funcName, true).get();
 }
 
-typedef struct {
-    CUmodule prog;
-    CUfunction ker;
-} kc_entry_t;
-
-
-const size_t size = 1024;
-char linkInfo[size];
-char linkError[size];
-
+template<typename T>
+void evalNodes(vector<Param<T>>& outputs, const vector<Node*>& output_nodes) {
+    const unsigned nrOutputs{static_cast<unsigned>(output_nodes.size())};
+    if (nrOutputs == 0) { return; }
+    assert(outputs.size() == output_nodes.size());
+    dim_t* outDims{outputs[0].dims};
+    dim_t* outStrides{outputs[0].strides};
 #ifndef NDEBUG
-#define CU_CHECK(fn) do {                       \
-        CUresult res = fn;                      \
-        if (res == CUDA_SUCCESS) break;         \
-        char cu_err_msg[1024];                  \
-        snprintf(cu_err_msg,                    \
-                 sizeof(cu_err_msg),            \
-                 "CU Error (%d)\n%s\n",         \
-                 (int)(res), linkError);        \
-        AF_ERROR(cu_err_msg,                    \
-                 AF_ERR_INTERNAL);              \
-    } while(0)
-#else
-#define CU_CHECK(fn) do {                       \
-        CUresult res = fn;                      \
-        if (res == CUDA_SUCCESS) break;         \
-        char cu_err_msg[1024];                  \
-        snprintf(cu_err_msg,                    \
-                 sizeof(cu_err_msg),            \
-                 "CU Error (%d)\n",             \
-                 (int)(res));                   \
-        AF_ERROR(cu_err_msg,                    \
-                 AF_ERR_INTERNAL);              \
-    } while(0)
+    for_each(
+        begin(outputs)++, end(outputs),
+        [outDims, outStrides](Param<T>& output) {
+            assert(equal(output.dims, output.dims + AF_MAX_DIMS, outDims) &&
+                   equal(output.strides, output.strides + AF_MAX_DIMS,
+                         outStrides));
+        });
 #endif
 
-static kc_entry_t compileKernel(const char *ker_name, string jit_ker)
-{
-    size_t ptx_size;
-    scoped_ptr<const char> ptx(irToPtx(jit_ker, &ptx_size));
-
-    CUlinkState linkState;
-
-    linkInfo[0] = 0;
-    linkError[0] = 0;
-
-    CUjit_option linkOptions[] = {
-        CU_JIT_INFO_LOG_BUFFER,
-        CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES,
-        CU_JIT_ERROR_LOG_BUFFER,
-        CU_JIT_ERROR_LOG_BUFFER_SIZE_BYTES,
-        CU_JIT_LOG_VERBOSE
-    };
-
-    void *linkOptionValues[] = {
-        linkInfo,
-        reinterpret_cast<void*>(1024),
-        linkError,
-        reinterpret_cast<void*>(1024),
-        reinterpret_cast<void*>(1)
-    };
-
-    CU_CHECK(cuLinkCreate(5, linkOptions, linkOptionValues, &linkState));
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)ptx.get(),
-                           ptx_size, ker_name, 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)arith_ptx,
-                           arith_ptx_len, "arith", 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)cast_ptx,
-                           cast_ptx_len, "cast", 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)exp_ptx,
-                           exp_ptx_len, "exp", 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)hyper_ptx,
-                           hyper_ptx_len, "hyper", 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)logic_ptx,
-                           logic_ptx_len, "logic", 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)numeric_ptx,
-                           numeric_ptx_len, "numeric", 0, NULL, NULL));
-
-    CU_CHECK(cuLinkAddData(linkState, CU_JIT_INPUT_PTX, (void*)trig_ptx,
-                           trig_ptx_len, "trig", 0, NULL, NULL));
-
-    void *cubin;
-    size_t cubinSize;
-
-    CUmodule module;
-    CUfunction kernel;
-
-    CU_CHECK(cuLinkComplete(linkState, &cubin, &cubinSize));
-    CU_CHECK(cuModuleLoadDataEx(&module, cubin, 0, 0, 0));
-    CU_CHECK(cuModuleGetFunction(&kernel, module, ker_name + 1));
-
-    kc_entry_t entry = {module, kernel};
-
-    return entry;
-}
-
-static CUfunction getKernel(Node *node, bool is_linear)
-{
-
-    string funcName = getFuncName(node, is_linear);
-
-    typedef std::map<string, kc_entry_t> kc_t;
-    static kc_t kernelCaches[DeviceManager::MAX_DEVICES];
-    int device = getActiveDeviceId();
-
-    kc_t::iterator idx = kernelCaches[device].find(funcName);
-    kc_entry_t entry = {NULL, NULL};
-
-    if (idx == kernelCaches[device].end()) {
-        string jit_ker = getKernelString(funcName, node, is_linear);
-        entry = compileKernel(funcName.c_str(), jit_ker);
-        kernelCaches[device][funcName] = entry;
-    } else {
-        entry = idx->second;
+    dim_t ndims{outDims[3] > 1   ? 4
+                : outDims[2] > 1 ? 3
+                : outDims[1] > 1 ? 2
+                : outDims[0] > 0 ? 1
+                                 : 0};
+    bool is_linear{true};
+    dim_t numOutElems{1};
+    for (dim_t dim{0}; dim < ndims; ++dim) {
+        is_linear &= (numOutElems == outStrides[dim]);
+        numOutElems *= outDims[dim];
+    }
+    if (numOutElems == 0) { return; }
+
+    // Use thread local to reuse the memory every time you are
+    // here.
+    thread_local Node_map_t nodes;
+    thread_local vector<Node*> full_nodes;
+    thread_local vector<Node_ids> full_ids;
+    thread_local vector<int> output_ids;
+
+    try {
+        // Reserve some space to improve performance at smaller
+        // sizes
+        constexpr size_t CAP{1024};
+        if (full_nodes.capacity() < CAP) {
+            nodes.reserve(CAP);
+            output_ids.reserve(10);
+            full_nodes.reserve(CAP);
+            full_ids.reserve(CAP);
+        }
+
+        const af::dtype outputType{output_nodes[0]->getType()};
+        const size_t outputSizeofType{size_of(outputType)};
+        for (Node* node : output_nodes) {
+            assert(node->getType() == outputType);
+            const int id = node->getNodesMap(nodes, full_nodes, full_ids);
+            output_ids.push_back(id);
+        }
+
+        size_t inputSize{0};
+        unsigned nrInputs{0};
+        bool moddimsFound{false};
+        for (const Node* node : full_nodes) {
+            is_linear &= node->isLinear(outDims);
+            moddimsFound |= (node->getOp() == af_moddims_t);
+            if (node->isBuffer()) {
+                ++nrInputs;
+                inputSize += node->getBytes();
+            }
+        }
+        const size_t outputSize{numOutElems * outputSizeofType * nrOutputs};
+        const size_t totalSize{inputSize + outputSize};
+
+        bool emptyColumnsFound{false};
+        if (is_linear) {
+            outDims[0]    = numOutElems;
+            outDims[1]    = 1;
+            outDims[2]    = 1;
+            outDims[3]    = 1;
+            outStrides[0] = 1;
+            outStrides[1] = numOutElems;
+            outStrides[2] = numOutElems;
+            outStrides[3] = numOutElems;
+            ndims         = 1;
+        } else {
+            emptyColumnsFound = ndims > (outDims[0] == 1   ? 1
+                                         : outDims[1] == 1 ? 2
+                                         : outDims[2] == 1 ? 3
+                                                           : 4);
+        }
+
+        // Keep node_clones in scope, so that the nodes remain active for later
+        // referral in case moddims or Column elimination operations have to
+        // take place
+        vector<Node_ptr> node_clones;
+        if (moddimsFound | emptyColumnsFound) {
+            node_clones.reserve(full_nodes.size());
+            for (Node* node : full_nodes) {
+                node_clones.emplace_back(node->clone());
+            }
+
+            for (const Node_ids& ids : full_ids) {
+                auto& children{node_clones[ids.id]->m_children};
+                for (int i{0}; i < Node::kMaxChildren && children[i] != nullptr;
+                     i++) {
+                    children[i] = node_clones[ids.child_ids[i]];
+                }
+            }
+
+            if (moddimsFound) {
+                const auto isModdim{[](const Node_ptr& node) {
+                    return node->getOp() == af_moddims_t;
+                }};
+                for (auto nodeIt{begin(node_clones)}, endIt{end(node_clones)};
+                     (nodeIt = find_if(nodeIt, endIt, isModdim)) != endIt;
+                     ++nodeIt) {
+                    const ModdimNode* mn{
+                        static_cast<ModdimNode*>(nodeIt->get())};
+
+                    const auto new_strides{calcStrides(mn->m_new_shape)};
+                    const auto isBuffer{
+                        [](const Node& ptr) { return ptr.isBuffer(); }};
+                    for (NodeIterator<> it{nodeIt->get()},
+                         end{NodeIterator<>()};
+                         (it = find_if(it, end, isBuffer)) != end; ++it) {
+                        BufferNode<T>* buf{static_cast<BufferNode<T>*>(&(*it))};
+                        buf->m_param.dims[0]    = mn->m_new_shape[0];
+                        buf->m_param.dims[1]    = mn->m_new_shape[1];
+                        buf->m_param.dims[2]    = mn->m_new_shape[2];
+                        buf->m_param.dims[3]    = mn->m_new_shape[3];
+                        buf->m_param.strides[0] = new_strides[0];
+                        buf->m_param.strides[1] = new_strides[1];
+                        buf->m_param.strides[2] = new_strides[2];
+                        buf->m_param.strides[3] = new_strides[3];
+                    }
+                }
+            }
+            if (emptyColumnsFound) {
+                common::removeEmptyDimensions<Param<T>, BufferNode<T>,
+                                              ShiftNode<T>>(outputs,
+                                                            node_clones);
+            }
+
+            full_nodes.clear();
+            for (Node_ptr& node : node_clones) {
+                full_nodes.push_back(node.get());
+            }
+        }
+
+        threadsMgt<dim_t> th(outDims, ndims);
+        const dim3 threads{th.genThreads()};
+        const dim3 blocks{th.genBlocks(threads, nrInputs, nrOutputs, totalSize,
+                                       outputSizeofType)};
+        auto ker = getKernel(output_nodes, output_ids, full_nodes, full_ids,
+                             is_linear, th.loop0, th.loop1, th.loop2, th.loop3);
+
+        vector<void*> args;
+        for (const Node* node : full_nodes) {
+            node->setArgs(0, is_linear,
+                          [&](int /*id*/, const void* ptr, size_t /*size*/,
+                              bool /*is_buffer*/) {
+                              args.push_back(const_cast<void*>(ptr));
+                          });
+        }
+
+        for (auto& out : outputs) { args.push_back(static_cast<void*>(&out)); }
+
+        {
+            using namespace arrayfire::cuda::kernel_logger;
+            AF_TRACE(
+                "Launching : Dims: [{},{},{},{}] Blocks: [{}] "
+                "Threads: [{}] threads: {}",
+                outDims[0], outDims[1], outDims[2], outDims[3], blocks, threads,
+                blocks.x * threads.x * blocks.y * threads.y * blocks.z *
+                    threads.z);
+        }
+        CU_CHECK(cuLaunchKernel(ker, blocks.x, blocks.y, blocks.z, threads.x,
+                                threads.y, threads.z, 0, getActiveStream(),
+                                args.data(), NULL));
+    } catch (...) {
+        // Reset the thread local vectors
+        nodes.clear();
+        output_ids.clear();
+        full_nodes.clear();
+        full_ids.clear();
+        throw;
     }
 
-    return entry.ker;
+    // Reset the thread local vectors
+    nodes.clear();
+    output_ids.clear();
+    full_nodes.clear();
+    full_ids.clear();
 }
 
 template<typename T>
-void evalNodes(Param<T> &out, Node *node)
-{
-    bool is_linear = node->isLinear(out.dims);
-    CUfunction ker = getKernel(node, is_linear);
-    vector<void *> args;
-    node->setArgs(args, is_linear);
-
-    void *ptr = (void *)out.ptr;
-    int strides[] = {(int)out.strides[0],
-                     (int)out.strides[1],
-                     (int)out.strides[2],
-                     (int)out.strides[3]};
-
-    int dims[] = {(int)out.dims[0],
-                  (int)out.dims[1],
-                  (int)out.dims[2],
-                  (int)out.dims[3]};
-
-    args.push_back((void *)&ptr);
-    for (int i = 0; i < 4; i++) args.push_back((void *)(strides + i));
-    for (int i = 0; i < 4; i++) args.push_back((void *)(dims + i));
-
-    int threads_x = 1, threads_y = 1;
-    int blocks_x_ = 1, blocks_y_ = 1;
-    int blocks_x  = 1, blocks_y = 1;
-
-    if (is_linear) {
-
-        threads_x = 256;
-        threads_y =  1;
-
-        int blocks = divup((out.dims[0] *
-                            out.dims[1] *
-                            out.dims[2] *
-                            out.dims[3]), threads_x);
-
-        blocks_y_ = divup(blocks, 65535);
-        blocks_x_ = divup(blocks, blocks_y_);
-
-        blocks_x = blocks_x_;
-        blocks_y = blocks_y_;
-
-    } else {
-
-        threads_x = 32;
-        threads_y =  8;
-
-        blocks_x_ = divup(out.dims[0], threads_x);
-        blocks_y_ = divup(out.dims[1], threads_y);
-
-        blocks_x = blocks_x_ * out.dims[2];
-        blocks_y = blocks_y_ * out.dims[3];
-    }
-
-    args.push_back((void *)&blocks_x_);
-    args.push_back((void *)&blocks_y_);
-
-    CU_CHECK(cuLaunchKernel(ker,
-                            blocks_x,
-                            blocks_y,
-                            1,
-                            threads_x,
-                            threads_y,
-                            1,
-                            0,
-                            NULL,
-                            &args.front(),
-                            NULL));
+void evalNodes(Param<T> out, Node* node) {
+    vector<Param<T>> outputs{out};
+    vector<Node*> nodes{node};
+    evalNodes(outputs, nodes);
 }
 
-template void evalNodes<float  >(Param<float  > &out, Node *node);
-template void evalNodes<double >(Param<double > &out, Node *node);
-template void evalNodes<cfloat >(Param<cfloat > &out, Node *node);
-template void evalNodes<cdouble>(Param<cdouble> &out, Node *node);
-template void evalNodes<int    >(Param<int    > &out, Node *node);
-template void evalNodes<uint   >(Param<uint   > &out, Node *node);
-template void evalNodes<char   >(Param<char   > &out, Node *node);
-template void evalNodes<uchar  >(Param<uchar  > &out, Node *node);
-template void evalNodes<intl   >(Param<intl   > &out, Node *node);
-template void evalNodes<uintl  >(Param<uintl  > &out, Node *node);
-
-
-}
+template void evalNodes<float>(Param<float> out, Node* node);
+template void evalNodes<double>(Param<double> out, Node* node);
+template void evalNodes<cfloat>(Param<cfloat> out, Node* node);
+template void evalNodes<cdouble>(Param<cdouble> out, Node* node);
+template void evalNodes<int>(Param<int> out, Node* node);
+template void evalNodes<uint>(Param<uint> out, Node* node);
+template void evalNodes<char>(Param<char> out, Node* node);
+template void evalNodes<schar>(Param<schar> out, Node* node);
+template void evalNodes<uchar>(Param<uchar> out, Node* node);
+template void evalNodes<intl>(Param<intl> out, Node* node);
+template void evalNodes<uintl>(Param<uintl> out, Node* node);
+template void evalNodes<short>(Param<short> out, Node* node);
+template void evalNodes<ushort>(Param<ushort> out, Node* node);
+template void evalNodes<half>(Param<half> out, Node* node);
+
+template void evalNodes<float>(vector<Param<float>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<double>(vector<Param<double>>& out,
+                                const vector<Node*>& node);
+template void evalNodes<cfloat>(vector<Param<cfloat>>& out,
+                                const vector<Node*>& node);
+template void evalNodes<cdouble>(vector<Param<cdouble>>& out,
+                                 const vector<Node*>& node);
+template void evalNodes<int>(vector<Param<int>>& out,
+                             const vector<Node*>& node);
+template void evalNodes<uint>(vector<Param<uint>>& out,
+                              const vector<Node*>& node);
+template void evalNodes<char>(vector<Param<char>>& out,
+                              const vector<Node*>& node);
+template void evalNodes<schar>(vector<Param<schar>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<uchar>(vector<Param<uchar>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<intl>(vector<Param<intl>>& out,
+                              const vector<Node*>& node);
+template void evalNodes<uintl>(vector<Param<uintl>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<short>(vector<Param<short>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<ushort>(vector<Param<ushort>>& out,
+                                const vector<Node*>& node);
+template void evalNodes<half>(vector<Param<half>>& out,
+                              const vector<Node*>& node);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/jit/BufferNode.hpp b/src/backend/cuda/jit/BufferNode.hpp
new file mode 100644
index 0000000000..8692b72515
--- /dev/null
+++ b/src/backend/cuda/jit/BufferNode.hpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/BufferNodeBase.hpp>
+#include "../Param.hpp"
+
+namespace arrayfire {
+namespace cuda {
+namespace jit {
+template<typename T>
+using BufferNode = common::BufferNodeBase<std::shared_ptr<T>, Param<T>>;
+}  // namespace jit
+}  // namespace cuda
+
+namespace common {
+
+template<typename DataType, typename ParamType>
+bool BufferNodeBase<DataType, ParamType>::operator==(
+    const BufferNodeBase<DataType, ParamType> &other) const noexcept {
+    // clang-format off
+    return m_data.get() == other.m_data.get() &&
+           m_bytes == other.m_bytes &&
+           m_param.ptr == other.m_param.ptr &&
+           m_linear_buffer == other.m_linear_buffer &&
+           m_param.dims[0] == other.m_param.dims[0] &&
+           m_param.dims[1] == other.m_param.dims[1] &&
+           m_param.dims[2] == other.m_param.dims[2] &&
+           m_param.dims[3] == other.m_param.dims[3] &&
+           m_param.strides[0] == other.m_param.strides[0] &&
+           m_param.strides[1] == other.m_param.strides[1] &&
+           m_param.strides[2] == other.m_param.strides[2] &&
+           m_param.strides[3] == other.m_param.strides[3];
+    // clang-format on
+}
+
+}  // namespace common
+
+}  // namespace arrayfire
diff --git a/src/backend/cuda/jit/ShiftNode.hpp b/src/backend/cuda/jit/ShiftNode.hpp
new file mode 100644
index 0000000000..16bdf5d0f9
--- /dev/null
+++ b/src/backend/cuda/jit/ShiftNode.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/jit/ShiftNodeBase.hpp>
+#include <jit/BufferNode.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace jit {
+
+template<typename T>
+using ShiftNode = common::ShiftNodeBase<BufferNode<T>>;
+
+}  // namespace jit
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/jit/kernel_generators.hpp b/src/backend/cuda/jit/kernel_generators.hpp
new file mode 100644
index 0000000000..02f58f432d
--- /dev/null
+++ b/src/backend/cuda/jit/kernel_generators.hpp
@@ -0,0 +1,110 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/Node.hpp>
+
+#include <functional>
+#include <iomanip>
+#include <memory>
+#include <sstream>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+
+namespace {
+
+/// Creates a string that will be used to declare the parameter of kernel
+void generateParamDeclaration(std::stringstream& kerStream, int id,
+                              bool is_linear, const std::string& m_type_str) {
+    if (is_linear) {
+        kerStream << m_type_str << " *in" << id << "_ptr,\n";
+    } else {
+        kerStream << "Param<" << m_type_str << "> in" << id << ",\n";
+    }
+}
+
+/// Calls the setArg function to set the arguments for a kernel call
+template<typename T>
+int setBufferKernelArguments(
+    int start_id, bool is_linear,
+    std::function<void(int id, const void* ptr, size_t arg_size,
+                       bool is_buffer)>& setArg,
+    const std::shared_ptr<T>& ptr, const Param<T>& info) {
+    UNUSED(ptr);
+    if (is_linear) {
+        setArg(start_id, static_cast<const void*>(&info.ptr), sizeof(T*), true);
+    } else {
+        setArg(start_id, static_cast<const void*>(&info), sizeof(Param<T>),
+               true);
+    }
+    return start_id + 1;
+}
+
+/// Generates the code to calculate the offsets for a buffer
+void generateBufferOffsets(std::stringstream& kerStream, int id, bool is_linear,
+                           const std::string& type_str) {
+    const std::string idx_str  = std::string("idx") + std::to_string(id);
+    const std::string info_str = std::string("in") + std::to_string(id);
+
+    if (is_linear) {
+        kerStream << "#define " << idx_str << " idx\n";
+    } else {
+        kerStream << "int " << idx_str << " = id0*(id0<" << info_str
+                  << ".dims[0])*" << info_str << ".strides[0] + id1*(id1<"
+                  << info_str << ".dims[1])*" << info_str
+                  << ".strides[1] + id2*(id2<" << info_str << ".dims[2])*"
+                  << info_str << ".strides[2] + id3*(id3<" << info_str
+                  << ".dims[3])*" << info_str << ".strides[3];\n";
+        kerStream << type_str << " *in" << id << "_ptr = in" << id << ".ptr;\n";
+    }
+}
+
+/// Generates the code to read a buffer and store it in a local variable
+void generateBufferRead(std::stringstream& kerStream, int id,
+                        const std::string& type_str) {
+    kerStream << type_str << " val" << id << " = in" << id << "_ptr[idx" << id
+              << "];\n";
+}
+
+inline void generateShiftNodeOffsets(std::stringstream& kerStream, int id,
+                                     bool is_linear,
+                                     const std::string& type_str) {
+    UNUSED(is_linear);
+    const std::string idx_str  = std::string("idx") + std::to_string(id);
+    const std::string info_str = std::string("in") + std::to_string(id);
+    const std::string id_str = std::string("sh_id_") + std::to_string(id) + '_';
+    const std::string shift_str =
+        std::string("shift") + std::to_string(id) + '_';
+
+    for (int i = 0; i < 4; i++) {
+        kerStream << "int " << id_str << i << " = __circular_mod(id" << i
+                  << " + " << shift_str << i << ", " << info_str << ".dims["
+                  << i << "]);\n";
+    }
+    kerStream << "int " << idx_str << " = " << id_str << "0*(" << id_str << "0<"
+              << info_str << ".dims[0])*" << info_str << ".strides[0] + "
+              << id_str << "1*(" << id_str << "1<" << info_str << ".dims[1])*"
+              << info_str << ".strides[1] + " << id_str << "2*(" << id_str
+              << "2<" << info_str << ".dims[2])*" << info_str
+              << ".strides[2] + " << id_str << "3*(" << id_str << "3<"
+              << info_str << ".dims[3])*" << info_str << ".strides[3];\n";
+    kerStream << type_str << " *in" << id << "_ptr = in" << id << ".ptr;\n";
+}
+
+inline void generateShiftNodeRead(std::stringstream& kerStream, int id,
+                                  const std::string& type_str) {
+    kerStream << type_str << " val" << id << " = in" << id << "_ptr[idx" << id
+              << "];\n";
+}
+
+}  // namespace
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/join.cpp b/src/backend/cuda/join.cpp
new file mode 100644
index 0000000000..5065412342
--- /dev/null
+++ b/src/backend/cuda/join.cpp
@@ -0,0 +1,240 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <join.hpp>
+#include <kernel/memcopy.hpp>
+
+#include <algorithm>
+#include <map>
+#include <stdexcept>
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using std::vector;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> join(const int jdim, const Array<T> &first, const Array<T> &second) {
+    // All dimensions except join dimension must be equal
+    const dim4 &fdims{first.dims()};
+    const dim4 &sdims{second.dims()};
+    // Compute output dims
+    dim4 odims(fdims);
+    odims.dims[jdim] += sdims.dims[jdim];
+    Array<T> out{createEmptyArray<T>(odims)};
+    const cudaStream_t activeStream{getActiveStream()};
+
+    // topspeed is achieved when byte size(in+out) ~= L2CacheSize
+    //
+    // 1 array: memcpy always copies 1 array.  topspeed
+    //      --> size(in) < L2CacheSize/2
+    // 2 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/2
+    //          --> JIT can copy 2 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - size(in) < L2CacheSize/2
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called twice
+    //      - size(in) >= L2CacheSize/2
+    //          --> memcpy will achieve veryLargeArray speed.  The kernel
+    //              will be called twice
+    if (fdims.dims[jdim] == sdims.dims[jdim]) {
+        const size_t L2CacheSize{getL2CacheSize(getActiveDeviceId())};
+        if (!(first.isReady() || second.isReady()) ||
+            (fdims.elements() * sizeof(T) * 2 * 2 < L2CacheSize)) {
+            // Both arrays have same size & everything fits into the cache,
+            // so treat in 1 JIT kernel, iso individual copies which is
+            // always slower
+            const dim_t *outStrides{out.strides().dims};
+            vector<Param<T>> outputs{
+                {out.get(), fdims.dims, outStrides},
+                {out.get() + fdims.dims[jdim] * outStrides[jdim], sdims.dims,
+                 outStrides}};
+            // Extend the life of the returned node, by saving the
+            // corresponding shared_ptr
+            const Node_ptr fNode{first.getNode()};
+            const Node_ptr sNode{second.getNode()};
+            vector<Node *> nodes{fNode.get(), sNode.get()};
+            evalNodes(outputs, nodes);
+            return out;
+        }
+        // continue because individually processing is faster
+    }
+
+    // Handle each array individually
+    if (first.isReady()) {
+        if (1LL + jdim >= first.ndims() && first.isLinear()) {
+            // first & out are linear
+            CUDA_CHECK(cudaMemcpyAsync(out.get(), first.get(),
+                                       first.elements() * sizeof(T),
+                                       cudaMemcpyDeviceToDevice, activeStream));
+        } else {
+            kernel::memcopy<T>(out, first, first.ndims());
+        }
+    } else {
+        // Write the result directly in the out array
+        const Param<T> output(out.get(), fdims.dims, out.strides().dims);
+        evalNodes(output, first.getNode().get());
+    }
+
+    if (second.isReady()) {
+        if (1LL + jdim >= second.ndims() && second.isLinear()) {
+            // second & out are linear
+            CUDA_CHECK(cudaMemcpyAsync(
+                out.get() + fdims.dims[jdim] * out.strides().dims[jdim],
+                second.get(), second.elements() * sizeof(T),
+                cudaMemcpyDeviceToDevice, activeStream));
+        } else {
+            Param<T> output(
+                out.get() + fdims.dims[jdim] * out.strides().dims[jdim],
+                sdims.dims, out.strides().dims);
+            kernel::memcopy<T>(output, second, second.ndims());
+        }
+    } else {
+        // Write the result directly in the out array
+        const Param<T> output(
+            out.get() + fdims.dims[jdim] * out.strides().dims[jdim], sdims.dims,
+            out.strides().dims);
+        evalNodes(output, second.getNode().get());
+    }
+
+    return (out);
+}
+
+template<typename T>
+void join(Array<T> &out, const int jdim, const vector<Array<T>> &inputs) {
+    class eval {
+       public:
+        vector<Param<T>> outputs;
+        vector<Node_ptr> nodePtrs;
+        vector<Node *> nodes;
+        vector<const Array<T> *> ins;
+    };
+    std::map<dim_t, eval> evals;
+    const cudaStream_t activeStream{getActiveStream()};
+    const size_t L2CacheSize{getL2CacheSize(getActiveDeviceId())};
+
+    // topspeed is achieved when byte size(in+out) ~= L2CacheSize
+    //
+    // 1 array: memcpy always copies 1 array.  topspeed
+    //      --> size(in) <= L2CacheSize/2
+    // 2 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/2
+    //          --> JIT can copy 2 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - else
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called twice
+    // 3 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/3
+    //          --> JIT can copy 3 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - else
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called multiple times
+
+    // Group all arrays according to size
+    dim_t outOffset{0};
+    for (const Array<T> &iArray : inputs) {
+        const dim_t *idims{iArray.dims().dims};
+        eval &e{evals[idims[jdim]]};
+        e.outputs.emplace_back(out.get() + outOffset, idims,
+                               out.strides().dims);
+        // Extend life of the returned node by saving the corresponding
+        // shared_ptr
+        e.nodePtrs.emplace_back(iArray.getNode());
+        e.nodes.push_back(e.nodePtrs.back().get());
+        e.ins.push_back(&iArray);
+        outOffset += idims[jdim] * out.strides().dims[jdim];
+    }
+
+    for (auto &eval : evals) {
+        auto &s{eval.second};
+        if (s.ins.size() == 1 ||
+            s.ins[0]->elements() * sizeof(T) * 2 * 2 > L2CacheSize) {
+            // Process (evaluated arrays) individually for
+            //  - single small array
+            //  - very large arrays
+            auto nodeIt{begin(s.nodes)};
+            auto outputIt{begin(s.outputs)};
+            for (const Array<T> *in : s.ins) {
+                if (in->isReady()) {
+                    if (1LL + jdim >= in->ndims() && in->isLinear()) {
+                        CUDA_CHECK(cudaMemcpyAsync(outputIt->ptr, in->get(),
+                                                   in->elements() * sizeof(T),
+                                                   cudaMemcpyHostToDevice,
+                                                   activeStream));
+                    } else {
+                        kernel::memcopy<T>(*outputIt, *in, in->ndims());
+                    }
+                    // eliminate this array from the list, so that it will
+                    // not be processed as bulk via JIT
+                    outputIt = s.outputs.erase(outputIt);
+                    nodeIt   = s.nodes.erase(nodeIt);
+                } else {
+                    ++outputIt;
+                    ++nodeIt;
+                }
+            }
+        }
+        evalNodes(s.outputs, s.nodes);
+    }
+}
+
+#define INSTANTIATE(T)                                               \
+    template Array<T> join<T>(const int jdim, const Array<T> &first, \
+                              const Array<T> &second);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#undef INSTANTIATE
+
+#define INSTANTIATE(T)                                    \
+    template void join<T>(Array<T> & out, const int jdim, \
+                          const vector<Array<T>> &inputs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#undef INSTANTIATE
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/join.cu b/src/backend/cuda/join.cu
deleted file mode 100644
index 074326e167..0000000000
--- a/src/backend/cuda/join.cu
+++ /dev/null
@@ -1,201 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <join.hpp>
-#include <kernel/join.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<int dim>
-    af::dim4 calcOffset(const af::dim4 dims)
-    {
-        af::dim4 offset;
-        offset[0] = (dim == 0) ? dims[0] : 0;
-        offset[1] = (dim == 1) ? dims[1] : 0;
-        offset[2] = (dim == 2) ? dims[2] : 0;
-        offset[3] = (dim == 3) ? dims[3] : 0;
-        return offset;
-    }
-
-    template<typename Tx, typename Ty>
-    Array<Tx> join(const int dim, const Array<Tx> &first, const Array<Ty> &second)
-    {
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        af::dim4 odims;
-        af::dim4 fdims = first.dims();
-        af::dim4 sdims = second.dims();
-
-        for(int i = 0; i < 4; i++) {
-            if(i == dim) {
-                odims[i] = fdims[i] + sdims[i];
-            } else {
-                odims[i] = fdims[i];
-            }
-        }
-
-        Array<Tx> out = createEmptyArray<Tx>(odims);
-
-        af::dim4 zero(0,0,0,0);
-
-        switch(dim) {
-            case 0:
-                kernel::join<Tx, Tx, 0>(out, first,  zero);
-                kernel::join<Tx, Ty, 0>(out, second, calcOffset<0>(fdims));
-                break;
-            case 1:
-                kernel::join<Tx, Tx, 1>(out, first,  zero);
-                kernel::join<Tx, Ty, 1>(out, second, calcOffset<1>(fdims));
-                break;
-            case 2:
-                kernel::join<Tx, Tx, 2>(out, first,  zero);
-                kernel::join<Tx, Ty, 2>(out, second, calcOffset<2>(fdims));
-                break;
-            case 3:
-                kernel::join<Tx, Tx, 3>(out, first,  zero);
-                kernel::join<Tx, Ty, 3>(out, second, calcOffset<3>(fdims));
-                break;
-        }
-
-        return out;
-    }
-
-    template<typename T, int n_arrays>
-    void join_wrapper(const int dim, Array<T> &out, const std::vector<Array<T> > &inputs)
-    {
-        af::dim4 zero(0,0,0,0);
-        af::dim4 d = zero;
-
-        switch(dim) {
-            case 0:
-                kernel::join<T, T, 0>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 0>(out, inputs[i], calcOffset<0>(d));
-                }
-                break;
-            case 1:
-                kernel::join<T, T, 1>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 1>(out, inputs[i], calcOffset<1>(d));
-                }
-                break;
-            case 2:
-                kernel::join<T, T, 1>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 2>(out, inputs[i], calcOffset<2>(d));
-                }
-                break;
-            case 3:
-                kernel::join<T, T, 3>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 3>(out, inputs[i], calcOffset<3>(d));
-                }
-                break;
-        }
-    }
-
-    template<typename T>
-    Array<T> join(const int dim, const std::vector<Array<T> > &inputs)
-    {
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        af::dim4 odims;
-        const dim_t n_arrays = inputs.size();
-        std::vector<af::dim4> idims(n_arrays);
-
-        dim_t dim_size = 0;
-        for(int i = 0; i < (int)idims.size(); i++) {
-            idims[i] = inputs[i].dims();
-            dim_size += idims[i][dim];
-        }
-
-        for(int i = 0; i < 4; i++) {
-            if(i == dim) {
-                odims[i] = dim_size;
-            } else {
-                odims[i] = idims[0][i];
-            }
-        }
-
-        Array<T> out = createEmptyArray<T>(odims);
-
-        switch(n_arrays) {
-            case 1:
-                join_wrapper<T, 1>(dim, out, inputs);
-                break;
-            case 2:
-                join_wrapper<T, 2>(dim, out, inputs);
-                break;
-            case 3:
-                join_wrapper<T, 3>(dim, out, inputs);
-                break;
-            case 4:
-                join_wrapper<T, 4>(dim, out, inputs);
-                break;
-            case 5:
-                join_wrapper<T, 5>(dim, out, inputs);
-                break;
-            case 6:
-                join_wrapper<T, 6>(dim, out, inputs);
-                break;
-            case 7:
-                join_wrapper<T, 7>(dim, out, inputs);
-                break;
-            case 8:
-                join_wrapper<T, 8>(dim, out, inputs);
-                break;
-            case 9:
-                join_wrapper<T, 9>(dim, out, inputs);
-                break;
-            case 10:
-                join_wrapper<T,10>(dim, out, inputs);
-                break;
-        }
-        return out;
-    }
-
-#define INSTANTIATE(Tx, Ty)                                                                             \
-    template Array<Tx> join<Tx, Ty>(const int dim, const Array<Tx> &first, const Array<Ty> &second);   \
-
-    INSTANTIATE(float,   float)
-    INSTANTIATE(double,  double)
-    INSTANTIATE(cfloat,  cfloat)
-    INSTANTIATE(cdouble, cdouble)
-    INSTANTIATE(int,     int)
-    INSTANTIATE(uint,    uint)
-    INSTANTIATE(intl,    intl)
-    INSTANTIATE(uintl,   uintl)
-    INSTANTIATE(uchar,   uchar)
-    INSTANTIATE(char,    char)
-
-#undef INSTANTIATE
-
-#define INSTANTIATE(T)                                                                              \
-    template Array<T> join<T>(const int dim, const std::vector<Array<T> > &inputs);
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-
-#undef INSTANTIATE
-}
diff --git a/src/backend/cuda/join.hpp b/src/backend/cuda/join.hpp
index 6f9ee27782..18767feae9 100644
--- a/src/backend/cuda/join.hpp
+++ b/src/backend/cuda/join.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename Tx, typename Ty>
-    Array<Tx> join(const int dim, const Array<Tx> &first, const Array<Ty> &second);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> join(const int dim, const Array<T> &first, const Array<T> &second);
 
-    template<typename T>
-    Array<T> join(const int dim, const std::vector<Array<T> > &inputs);
-}
+template<typename T>
+void join(Array<T> &out, const int dim, const std::vector<Array<T>> &inputs);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/anisotropic_diffusion.cuh b/src/backend/cuda/kernel/anisotropic_diffusion.cuh
new file mode 100644
index 0000000000..8b108b434d
--- /dev/null
+++ b/src/backend/cuda/kernel/anisotropic_diffusion.cuh
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+__forceinline__ __device__ int index(const int x, const int y, const int dim0,
+                                     const int dim1, const int stride0,
+                                     const int stride1) {
+    return clamp(x, 0, dim0 - 1) * stride0 + clamp(y, 0, dim1 - 1) * stride1;
+}
+
+__device__
+float quadratic(const float value) { return 1.0f / (1.0f + value); }
+
+template<af_flux_function FluxEnum>
+__device__ float gradientUpdate(const float mct, const float C, const float S,
+                                const float N, const float W, const float E,
+                                const float SE, const float SW, const float NE,
+                                const float NW) {
+    float delta = 0;
+
+    float dx, dy, df, db, cx, cxd;
+
+    // centralized derivatives
+    dx = (E - W) * 0.5f;
+    dy = (S - N) * 0.5f;
+
+    // half-d's and conductance along first dimension
+    df = E - C;
+    db = C - W;
+
+    if (FluxEnum == AF_FLUX_EXPONENTIAL) {
+        cx  = expf((df * df + 0.25f * afpowf(dy + 0.5f * (SE - NE), 2)) * mct);
+        cxd = expf((db * db + 0.25f * afpowf(dy + 0.5f * (SW - NW), 2)) * mct);
+    } else {
+        cx =
+            quadratic((df * df + 0.25f * afpowf(dy + 0.5f * (SE - NE), 2)) * mct);
+        cxd =
+            quadratic((db * db + 0.25f * afpowf(dy + 0.5f * (SW - NW), 2)) * mct);
+    }
+    delta += (cx * df - cxd * db);
+
+    // half-d's and conductance along second dimension
+    df = S - C;
+    db = C - N;
+
+    if (FluxEnum == AF_FLUX_EXPONENTIAL) {
+        cx  = expf((df * df + 0.25f * afpowf(dx + 0.5f * (SE - SW), 2)) * mct);
+        cxd = expf((db * db + 0.25f * afpowf(dx + 0.5f * (NE - NW), 2)) * mct);
+    } else {
+        cx =
+            quadratic((df * df + 0.25f * afpowf(dx + 0.5f * (SE - SW), 2)) * mct);
+        cxd =
+            quadratic((db * db + 0.25f * afpowf(dx + 0.5f * (NE - NW), 2)) * mct);
+    }
+    delta += (cx * df - cxd * db);
+
+    return delta;
+}
+
+__device__ float curvatureUpdate(const float mct, const float C, const float S,
+                                 const float N, const float W, const float E,
+                                 const float SE, const float SW, const float NE,
+                                 const float NW) {
+    float delta     = 0;
+    float prop_grad = 0;
+
+    float df0, db0;
+    float dx, dy, df, db, cx, cxd, gmf, gmb, gmsqf, gmsqb;
+
+    // centralized derivatives
+    dx = (E - W) * 0.5f;
+    dy = (S - N) * 0.5f;
+
+    // half-d's and conductance along first dimension
+    df  = E - C;
+    db  = C - W;
+    df0 = df;
+    db0 = db;
+
+    gmsqf = (df * df + 0.25f * afpowf(dy + 0.5f * (SE - NE), 2));
+    gmsqb = (db * db + 0.25f * afpowf(dy + 0.5f * (SW - NW), 2));
+
+    gmf = sqrtf(1.0e-10 + gmsqf);
+    gmb = sqrtf(1.0e-10 + gmsqb);
+
+    cx  = expf(gmsqf * mct);
+    cxd = expf(gmsqb * mct);
+
+    delta += ((df / gmf) * cx - (db / gmb) * cxd);
+
+    // half-d's and conductance along second dimension
+    df = S - C;
+    db = C - N;
+
+    gmsqf = (df * df + 0.25f * afpowf(dx + 0.5f * (SE - SW), 2));
+    gmsqb = (db * db + 0.25f * afpowf(dx + 0.5f * (NE - NW), 2));
+    gmf   = sqrtf(1.0e-10 + gmsqf);
+    gmb   = sqrtf(1.0e-10 + gmsqb);
+
+    cx  = expf(gmsqf * mct);
+    cxd = expf(gmsqb * mct);
+
+    delta += ((df / gmf) * cx - (db / gmb) * cxd);
+
+    if (delta > 0) {
+        prop_grad +=
+            (afpowf(fminf(db0, 0.0f), 2.0f) + afpowf(fmaxf(df0, 0.0f), 2.0f));
+        prop_grad +=
+            (afpowf(fminf(db, 0.0f), 2.0f) + afpowf(fmaxf(df, 0.0f), 2.0f));
+    } else {
+        prop_grad +=
+            (afpowf(fmaxf(db0, 0.0f), 2.0f) + afpowf(fminf(df0, 0.0f), 2.0f));
+        prop_grad +=
+            (afpowf(fmaxf(db, 0.0f), 2.0f) + afpowf(fminf(df, 0.0f), 2.0f));
+    }
+
+    return sqrtf(prop_grad) * delta;
+}
+
+template<typename T, af_flux_function FluxEnum, bool isMCDE>
+__global__ void diffUpdate(Param<T> inout, const float dt, const float mct,
+                           const unsigned blkX, const unsigned blkY) {
+    const unsigned RADIUS          = 1;
+    const unsigned SHRD_MEM_WIDTH  = THREADS_X + 2 * RADIUS;
+    const unsigned SHRD_MEM_HEIGHT = THREADS_Y * YDIM_LOAD + 2 * RADIUS;
+
+    __shared__ float shrdMem[SHRD_MEM_HEIGHT][SHRD_MEM_WIDTH];
+
+    const int l0 = inout.dims[0];
+    const int l1 = inout.dims[1];
+    const int s0 = inout.strides[0];
+    const int s1 = inout.strides[1];
+
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+
+    const int b2 = blockIdx.x / blkX;
+    const int b3 = blockIdx.y / blkY;
+
+    const int gx = blockDim.x * (blockIdx.x - b2 * blkX) + lx;
+    int gy       = blockDim.y * (blockIdx.y - b3 * blkY) + ly;
+
+    T* img = (T*)inout.ptr + (b3 * inout.strides[3] + b2 * inout.strides[2]);
+
+#pragma unroll
+    for (int b = ly, gy2 = gy - RADIUS; b < SHRD_MEM_HEIGHT;
+         b += blockDim.y, gy2 += blockDim.y) {
+#pragma unroll
+        for (int a = lx, gx2 = gx - RADIUS; a < SHRD_MEM_WIDTH;
+             a += blockDim.x, gx2 += blockDim.x) {
+            shrdMem[b][a] = img[index(gx2, gy2, l0, l1, s0, s1)];
+        }
+    }
+    __syncthreads();
+
+    int i = lx + RADIUS;
+    int j = ly + RADIUS;
+
+#pragma unroll
+    for (int ld = 0; ld < YDIM_LOAD; ++ld, j += blockDim.y, gy += blockDim.y) {
+        float C     = shrdMem[j][i];
+        float delta = 0.0f;
+        if (isMCDE) {
+            delta = curvatureUpdate(
+                mct, C, shrdMem[j][i + 1], shrdMem[j][i - 1], shrdMem[j - 1][i],
+                shrdMem[j + 1][i], shrdMem[j + 1][i + 1], shrdMem[j - 1][i + 1],
+                shrdMem[j + 1][i - 1], shrdMem[j - 1][i - 1]);
+        } else {
+            delta = gradientUpdate<FluxEnum>(
+                mct, C, shrdMem[j][i + 1], shrdMem[j][i - 1], shrdMem[j - 1][i],
+                shrdMem[j + 1][i], shrdMem[j + 1][i + 1], shrdMem[j - 1][i + 1],
+                shrdMem[j + 1][i - 1], shrdMem[j - 1][i - 1]);
+        }
+        if (gy < l1 && gx < l0) {
+            img[gx * s0 + gy * s1] = (T)(C + delta * dt);
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/anisotropic_diffusion.hpp b/src/backend/cuda/kernel/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..f376b8842e
--- /dev/null
+++ b/src/backend/cuda/kernel/anisotropic_diffusion.hpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/anisotropic_diffusion_cuh.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr int THREADS_X = 32;
+constexpr int THREADS_Y = 8;
+constexpr int YDIM_LOAD = 2 * THREADS_X / THREADS_Y;
+
+template<typename T>
+void anisotropicDiffusion(Param<T> inout, const float dt, const float mct,
+                          const af::fluxFunction fftype, bool isMCDE) {
+    auto diffUpdate = common::getKernel(
+        "arrayfire::cuda::diffUpdate", {{anisotropic_diffusion_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(fftype),
+                     TemplateArg(isMCDE)),
+        {{DefineValue(THREADS_X), DefineValue(THREADS_Y),
+          DefineValue(YDIM_LOAD)}});
+
+    dim3 threads(THREADS_X, THREADS_Y, 1);
+
+    int blkX = divup(inout.dims[0], threads.x);
+    int blkY = divup(inout.dims[1], threads.y * YDIM_LOAD);
+
+    dim3 blocks(blkX * inout.dims[2], blkY * inout.dims[3], 1);
+
+    const int maxBlkY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    const int blkZ    = divup(blocks.y, maxBlkY);
+
+    if (blkZ > 1) {
+        blocks.y = maxBlkY;
+        blocks.z = blkZ;
+    }
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    diffUpdate(qArgs, inout, dt, mct, blkX, blkY);
+
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/approx.hpp b/src/backend/cuda/kernel/approx.hpp
index b3dfc96a72..46490c06b1 100644
--- a/src/backend/cuda/kernel/approx.hpp
+++ b/src/backend/cuda/kernel/approx.hpp
@@ -7,244 +7,79 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/approx1_cuh.hpp>
+#include <nvrtc_kernel_headers/approx2_cuh.hpp>
+#include <af/defines.h>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 16;
-        static const int TY = 16;
-        static const int THREADS = 256;
-
-        ///////////////////////////////////////////////////////////////////////////
-        // nearest-neighbor resampling
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename Ty, typename Tp>
-        __device__ inline static
-        void core_nearest1(const int idx, const int idy, const int idz, const int idw,
-                           Param<Ty> out, CParam<Ty> in, CParam<Tp> pos,
-                           const float offGrid)
-        {
-            const int omId = idw * out.strides[3] + idz * out.strides[2]
-                                + idy * out.strides[1] + idx;
-            const int pmId = idx;
-
-            const Tp x = pos.ptr[pmId];
-            if (x < 0 || in.dims[0] < x+1) {
-                out.ptr[omId] = scalar<Ty>(offGrid);
-                return;
-            }
-
-            int ioff = idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1];
-            const int iMem = round(x) + ioff;
-
-            Ty yt = in.ptr[iMem];
-            out.ptr[omId] = yt;
-        }
-
-        template<typename Ty, typename Tp>
-        __device__ inline static
-        void core_nearest2(const int idx, const int idy, const int idz, const int idw,
-                           Param<Ty> out, CParam<Ty> in,
-                           CParam<Tp> pos, CParam<Tp> qos, const float offGrid)
-        {
-            const int omId = idw * out.strides[3] + idz * out.strides[2]
-                                + idy * out.strides[1] + idx;
-            const int pmId = idy * pos.strides[1] + idx;
-            const int qmId = idy * qos.strides[1] + idx;
-
-            const Tp x = pos.ptr[pmId], y = qos.ptr[qmId];
-            if (x < 0 || y < 0 || in.dims[0] < x+1 || in.dims[1] < y+1) {
-                out.ptr[omId] = scalar<Ty>(offGrid);
-                return;
-            }
-
-            const int grid_x = round(x), grid_y = round(y); // nearest grid
-            const int imId = idw * in.strides[3] + idz * in.strides[2]
-                             + grid_y * in.strides[1] + grid_x;
-
-            Ty val = in.ptr[imId];
-            out.ptr[omId] = val;
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // linear resampling
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename Ty, typename Tp>
-        __device__ inline static
-        void core_linear1(const int idx, const int idy, const int idz, const int idw,
-                          Param<Ty> out, CParam<Ty> in, CParam<Tp> pos,
-                          const float offGrid)
-        {
-            const int omId = idw * out.strides[3] + idz * out.strides[2]
-                                + idy * out.strides[1] + idx;
-            const int pmId = idx;
-
-            const Tp pVal = pos.ptr[pmId];
-            if (pVal < 0 || in.dims[0] < pVal+1) {
-                out.ptr[omId] = scalar<Ty>(offGrid);
-                return;
-            }
-
-            const Tp grid_x = floor(pVal);  // nearest grid
-            const Tp off_x = pVal - grid_x; // fractional offset
-
-            int ioff = idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1] + grid_x;
-
-            // Check if pVal and pVal + 1 are both valid indices
-            bool cond = (pVal < in.dims[0] - 1);
-            // Compute Left and Right Weighted Values
-            Ty yl = ((Tp)1.0 - off_x) * in.ptr[ioff];
-            Ty yr = cond ? (off_x) * in.ptr[ioff + 1] : scalar<Ty>(0);
-            Ty yo = yl + yr;
-            // Compute Weight used
-            Tp wt = cond ? (Tp)1.0 : (Tp)(1.0 - off_x);
-            // Write final value
-            out.ptr[omId] = (yo / wt);
-        }
-
-        template<typename Ty, typename Tp>
-        __device__ inline static
-        void core_linear2(const int idx, const int idy, const int idz, const int idw,
-                           Param<Ty> out, CParam<Ty> in,
-                           CParam<Tp> pos, CParam<Tp> qos, const float offGrid)
-        {
-            const int omId = idw * out.strides[3] + idz * out.strides[2]
-                                + idy * out.strides[1] + idx;
-            const int pmId = idy * pos.strides[1] + idx;
-            const int qmId = idy * qos.strides[1] + idx;
-
-            const Tp x = pos.ptr[pmId], y = qos.ptr[qmId];
-            if (x < 0 || y < 0 || in.dims[0] < x+1 || in.dims[1] < y+1) {
-                out.ptr[omId] = scalar<Ty>(offGrid);
-                return;
-            }
-
-            const Tp grid_x = floor(x),   grid_y = floor(y);   // nearest grid
-            const Tp off_x  = x - grid_x, off_y  = y - grid_y; // fractional offset
-
-            int ioff = idw * in.strides[3] + idz * in.strides[2] + grid_y * in.strides[1] + grid_x;
-
-            // Check if pVal and pVal + 1 are both valid indices
-            bool condY = (y < in.dims[1] - 1);
-            bool condX = (x < in.dims[0] - 1);
-
-            // Compute wieghts used
-            Tp wt00 = ((Tp)1.0 - off_x) * ((Tp)1.0 - off_y);
-            Tp wt10 = (condY) ? ((Tp)1.0 - off_x) * (off_y) : 0;
-            Tp wt01 = (condX) ? (off_x) * ((Tp)1.0 - off_y) : 0;
-            Tp wt11 = (condX && condY) ? (off_x) * (off_y)  : 0;
-
-            Tp wt = wt00 + wt10 + wt01 + wt11;
-
-            // Compute Weighted Values
-            Ty zero = scalar<Ty>(0);
-            Ty y00 =                    wt00 * in.ptr[ioff];
-            Ty y10 = (condY) ?          wt10 * in.ptr[ioff + in.strides[1]]     : zero;
-            Ty y01 = (condX) ?          wt01 * in.ptr[ioff + 1]                 : zero;
-            Ty y11 = (condX && condY) ? wt11 * in.ptr[ioff + in.strides[1] + 1] : zero;
-            Ty yo = y00 + y10 + y01 + y11;
-
-            // Write Final Value
-            out.ptr[omId] = (yo / wt);
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Approx Kernel
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename Ty, typename Tp, af_interp_type method>
-        __global__
-        void approx1_kernel(Param<Ty> out, CParam<Ty> in, CParam<Tp> pos,
-                            const float offGrid, const int blocksMatX)
-        {
-            const int idw = blockIdx.y / out.dims[2];
-            const int idz = blockIdx.y - idw * out.dims[2];
-
-            const int idy = blockIdx.x / blocksMatX;
-            const int blockIdx_x = blockIdx.x - idy * blocksMatX;
-            const int idx = blockIdx_x * blockDim.x + threadIdx.x;
-
-            if (idx >= out.dims[0] || idy >= out.dims[1] ||
-                idz >= out.dims[2] || idw >= out.dims[3])
-                return;
-
-            switch(method) {
-                case AF_INTERP_NEAREST:
-                    core_nearest1(idx, idy, idz, idw, out, in, pos, offGrid);
-                    break;
-                case AF_INTERP_LINEAR:
-                    core_linear1(idx, idy, idz, idw, out, in, pos, offGrid);
-                    break;
-                default:
-                    break;
-            }
-        }
-
-        template<typename Ty, typename Tp, af_interp_type method>
-        __global__
-        void approx2_kernel(Param<Ty> out, CParam<Ty> in,
-                      CParam<Tp> pos, CParam<Tp> qos, const float offGrid,
-                      const int blocksMatX, const int blocksMatY)
-        {
-            const int idz = blockIdx.x / blocksMatX;
-            const int idw = blockIdx.y / blocksMatY;
-
-            int blockIdx_x = blockIdx.x - idz * blocksMatX;
-            int blockIdx_y = blockIdx.y - idw * blocksMatY;
-
-            int idx = threadIdx.x + blockIdx_x * blockDim.x;
-            int idy = threadIdx.y + blockIdx_y * blockDim.y;
-
-            if (idx >= out.dims[0] || idy >= out.dims[1] ||
-                idz >= out.dims[2] || idw >= out.dims[3])
-                return;
-
-            switch(method) {
-                case AF_INTERP_NEAREST:
-                    core_nearest2(idx, idy, idz, idw, out, in, pos, qos, offGrid);
-                    break;
-                case AF_INTERP_LINEAR:
-                    core_linear2(idx, idy, idz, idw, out, in, pos, qos, offGrid);
-                    break;
-                default:
-                    break;
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template <typename Ty, typename Tp, af_interp_type method>
-        void approx1(Param<Ty> out, CParam<Ty> in,
-               CParam<Tp> pos, const float offGrid)
-        {
-            dim3 threads(THREADS, 1, 1);
-            int blocksPerMat = divup(out.dims[0], threads.x);
-            dim3 blocks(blocksPerMat * out.dims[1], out.dims[2] * out.dims[3]);
-
-            approx1_kernel<Ty, Tp, method><<<blocks, threads>>>
-                          (out, in, pos, offGrid, blocksPerMat);
-            POST_LAUNCH_CHECK();
-        }
-
-        template <typename Ty, typename Tp, af_interp_type method>
-        void approx2(Param<Ty> out, CParam<Ty> in,
-                    CParam<Tp> pos, CParam<Tp> qos, const float offGrid)
-        {
-            dim3 threads(TX, TY, 1);
-            int blocksPerMatX = divup(out.dims[0], threads.x);
-            int blocksPerMatY = divup(out.dims[1], threads.y);
-            dim3 blocks(blocksPerMatX * out.dims[2], blocksPerMatY * out.dims[3]);
-
-            approx2_kernel<Ty, Tp, method><<<blocks, threads>>>
-                          (out, in, pos, qos, offGrid, blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+// Kernel Launch Config Values
+static const int TX      = 16;
+static const int TY      = 16;
+static const int THREADS = 256;
+
+template<typename Ty, typename Tp>
+void approx1(Param<Ty> yo, CParam<Ty> yi, CParam<Tp> xo, const int xdim,
+             const Tp &xi_beg, const Tp &xi_step, const float offGrid,
+             const af::interpType method, const int order) {
+    auto approx1 = common::getKernel(
+        "arrayfire::cuda::approx1", {{approx1_cuh_src}},
+        TemplateArgs(TemplateTypename<Ty>(), TemplateTypename<Tp>(),
+                     TemplateArg(xdim), TemplateArg(order)));
+
+    dim3 threads(THREADS, 1, 1);
+    int blocksPerMat = divup(yo.dims[0], threads.x);
+    dim3 blocks(blocksPerMat * yo.dims[1], yo.dims[2] * yo.dims[3]);
+
+    bool batch = !(xo.dims[1] == 1 && xo.dims[2] == 1 && xo.dims[3] == 1);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    approx1(qArgs, yo, yi, xo, xi_beg, Tp(1) / xi_step, offGrid, blocksPerMat,
+            batch, method);
+
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ty, typename Tp>
+void approx2(Param<Ty> zo, CParam<Ty> zi, CParam<Tp> xo, const int xdim,
+             const Tp &xi_beg, const Tp &xi_step, CParam<Tp> yo, const int ydim,
+             const Tp &yi_beg, const Tp &yi_step, const float offGrid,
+             const af::interpType method, const int order) {
+    auto approx2 = common::getKernel(
+        "arrayfire::cuda::approx2", {{approx2_cuh_src}},
+        TemplateArgs(TemplateTypename<Ty>(), TemplateTypename<Tp>(),
+                     TemplateArg(xdim), TemplateArg(ydim), TemplateArg(order)));
+
+    dim3 threads(TX, TY, 1);
+    int blocksPerMatX = divup(zo.dims[0], threads.x);
+    int blocksPerMatY = divup(zo.dims[1], threads.y);
+    dim3 blocks(blocksPerMatX * zo.dims[2], blocksPerMatY * zo.dims[3]);
+
+    bool batch = !(xo.dims[2] == 1 && xo.dims[3] == 1);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    approx2(qArgs, zo, zi, xo, xi_beg, Tp(1) / xi_step, yo, yi_beg,
+            Tp(1) / yi_step, offGrid, blocksPerMatX, blocksPerMatY, batch,
+            method);
+
+    POST_LAUNCH_CHECK();
 }
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/approx1.cuh b/src/backend/cuda/kernel/approx1.cuh
new file mode 100644
index 0000000000..9ccf95e504
--- /dev/null
+++ b/src/backend/cuda/kernel/approx1.cuh
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <interp.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Ty, typename Tp, int xdim, int order>
+__global__ void approx1(Param<Ty> yo, CParam<Ty> yi, CParam<Tp> xo,
+                        const Tp xi_beg, const Tp xi_step_reproc,
+                        const float offGrid, const int blocksMatX,
+                        const bool batch, af::interpType method) {
+    const int idy        = blockIdx.x / blocksMatX;
+    const int blockIdx_x = blockIdx.x - idy * blocksMatX;
+    const int idx        = blockIdx_x * blockDim.x + threadIdx.x;
+
+    const int idw = (blockIdx.y + blockIdx.z * gridDim.y) / yo.dims[2];
+    const int idz = (blockIdx.y + blockIdx.z * gridDim.y) - idw * yo.dims[2];
+
+    if (idx >= yo.dims[0] || idy >= yo.dims[1] || idz >= yo.dims[2] ||
+        idw >= yo.dims[3])
+        return;
+
+    // FIXME: Only cubic interpolation is doing clamping
+    // We need to make it consistent across all methods
+    // Not changing the behavior because tests will fail
+    const bool clamp = order == 3;
+
+    bool is_off[] = {xo.dims[0] > 1, xo.dims[1] > 1, xo.dims[2] > 1,
+                     xo.dims[3] > 1};
+
+    int xo_idx = idx * is_off[0];
+    if (batch) {
+        xo_idx += idw * xo.strides[3] * is_off[3];
+        xo_idx += idz * xo.strides[2] * is_off[2];
+        xo_idx += idy * xo.strides[1] * is_off[1];
+    }
+
+    const Tp x = (xo.ptr[xo_idx] - xi_beg) * xi_step_reproc;
+
+    const int yo_idx =
+        idw * yo.strides[3] + idz * yo.strides[2] + idy * yo.strides[1] + idx;
+
+#pragma unroll
+    for (int flagIdx = 0; flagIdx < 4; ++flagIdx) { is_off[flagIdx] = true; }
+    is_off[xdim] = false;
+
+    if (x < 0 || yi.dims[xdim] < x + 1) {
+        yo.ptr[yo_idx] = scalar<Ty>(offGrid);
+        return;
+    }
+
+    int yi_idx = idx * is_off[0];
+    yi_idx += idw * yi.strides[3] * is_off[3];
+    yi_idx += idz * yi.strides[2] * is_off[2];
+    yi_idx += idy * yi.strides[1] * is_off[1];
+
+    Interp1<Ty, Tp, xdim, order> interp;
+    interp(yo, yo_idx, yi, yi_idx, x, method, 1, clamp);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/approx2.cuh b/src/backend/cuda/kernel/approx2.cuh
new file mode 100644
index 0000000000..7d4179643e
--- /dev/null
+++ b/src/backend/cuda/kernel/approx2.cuh
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <interp.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Ty, typename Tp, int xdim, int ydim, int order>
+__global__ void approx2(Param<Ty> zo, CParam<Ty> zi, CParam<Tp> xo,
+                        const Tp xi_beg, const Tp xi_step_reproc, CParam<Tp> yo,
+                        const Tp yi_beg, const Tp yi_step_reproc,
+                        const float offGrid, const int blocksMatX,
+                        const int blocksMatY, const bool batch,
+                        af::interpType method) {
+    const int idz        = blockIdx.x / blocksMatX;
+    const int blockIdx_x = blockIdx.x - idz * blocksMatX;
+    const int idx        = threadIdx.x + blockIdx_x * blockDim.x;
+
+    const int idw = (blockIdx.y + blockIdx.z * gridDim.y) / blocksMatY;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - idw * blocksMatY;
+    const int idy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (idx >= zo.dims[0] || idy >= zo.dims[1] || idz >= zo.dims[2] ||
+        idw >= zo.dims[3])
+        return;
+
+    // FIXME: Only cubic interpolation is doing clamping
+    // We need to make it consistent across all methods
+    // Not changing the behavior because tests will fail
+    const bool clamp = order == 3;
+
+    bool is_off[] = {xo.dims[0] > 1, xo.dims[1] > 1, xo.dims[2] > 1,
+                     xo.dims[3] > 1};
+
+    const int zo_idx =
+        idw * zo.strides[3] + idz * zo.strides[2] + idy * zo.strides[1] + idx;
+    int xo_idx = idy * xo.strides[1] * is_off[1] + idx * is_off[0];
+    int yo_idx = idy * yo.strides[1] * is_off[1] + idx * is_off[0];
+    if (batch) {
+        xo_idx +=
+            idw * xo.strides[3] * is_off[3] + idz * xo.strides[2] * is_off[2];
+        yo_idx +=
+            idw * yo.strides[3] * is_off[3] + idz * yo.strides[2] * is_off[2];
+    }
+
+    const Tp x = (xo.ptr[xo_idx] - xi_beg) * xi_step_reproc;
+    const Tp y = (yo.ptr[yo_idx] - yi_beg) * yi_step_reproc;
+
+#pragma unroll
+    for (int flagIdx = 0; flagIdx < 4; ++flagIdx) { is_off[flagIdx] = true; }
+    is_off[xdim] = false;
+    is_off[ydim] = false;
+
+    if (x < 0 || y < 0 || zi.dims[xdim] < x + 1 || zi.dims[ydim] < y + 1) {
+        zo.ptr[zo_idx] = scalar<Ty>(offGrid);
+        return;
+    }
+
+    int zi_idx = idy * zi.strides[1] * is_off[1] + idx * is_off[0];
+    zi_idx += idw * zi.strides[3] * is_off[3] + idz * zi.strides[2] * is_off[2];
+
+    Interp2<Ty, Tp, xdim, ydim, order> interp;
+    interp(zo, zo_idx, zi, zi_idx, x, y, method, 1, clamp);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/assign.cuh b/src/backend/cuda/kernel/assign.cuh
new file mode 100644
index 0000000000..ddf159288b
--- /dev/null
+++ b/src/backend/cuda/kernel/assign.cuh
@@ -0,0 +1,63 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <assign_kernel_param.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void assign(Param<T> out, CParam<T> in, const AssignKernelParam p,
+                       const int nBBS0, const int nBBS1) {
+    // retrieve index pointers
+    // these can be 0 where af_array index is not used
+    const uint* ptr0 = p.ptr[0];
+    const uint* ptr1 = p.ptr[1];
+    const uint* ptr2 = p.ptr[2];
+    const uint* ptr3 = p.ptr[3];
+    // retrive booleans that tell us which index to use
+    const bool s0 = p.isSeq[0];
+    const bool s1 = p.isSeq[1];
+    const bool s2 = p.isSeq[2];
+    const bool s3 = p.isSeq[3];
+
+    const int gz = blockIdx.x / nBBS0;
+    const int gw = (blockIdx.y + blockIdx.z * gridDim.y) / nBBS1;
+    const int gx = blockDim.x * (blockIdx.x - gz * nBBS0) + threadIdx.x;
+    const int gy =
+        blockDim.y * ((blockIdx.y + blockIdx.z * gridDim.y) - gw * nBBS1) +
+        threadIdx.y;
+
+    if (gx < in.dims[0] && gy < in.dims[1] && gz < in.dims[2] &&
+        gw < in.dims[3]) {
+        // calculate pointer offsets for input
+        int i =
+            p.strds[0] * trimIndex(s0 ? gx + p.offs[0] : ptr0[gx], out.dims[0]);
+        int j =
+            p.strds[1] * trimIndex(s1 ? gy + p.offs[1] : ptr1[gy], out.dims[1]);
+        int k =
+            p.strds[2] * trimIndex(s2 ? gz + p.offs[2] : ptr2[gz], out.dims[2]);
+        int l =
+            p.strds[3] * trimIndex(s3 ? gw + p.offs[3] : ptr3[gw], out.dims[3]);
+        // offset input and output pointers
+        const T* src =
+            (const T*)in.ptr + (gx * in.strides[0] + gy * in.strides[1] +
+                                gz * in.strides[2] + gw * in.strides[3]);
+        T* dst = (T*)out.ptr + (i + j + k + l);
+        // set the output
+        dst[0] = src[0];
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/assign.hpp b/src/backend/cuda/kernel/assign.hpp
index 7284437bc8..008de72d37 100644
--- a/src/backend/cuda/kernel/assign.hpp
+++ b/src/backend/cuda/kernel/assign.hpp
@@ -7,81 +7,44 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
-#include <math.hpp>
-#include <utility.hpp>
+#include <assign_kernel_param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/assign_cuh.hpp>
 
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 32;
-static const int THREADS_Y =  8;
-
-typedef struct {
-    int  offs[4];
-    int strds[4];
-    bool     isSeq[4];
-    uint*      ptr[4];
-} AssignKernelParam_t;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 template<typename T>
-__global__
-void AssignKernel(Param<T> out, CParam<T> in, const AssignKernelParam_t p,
-                 const int nBBS0, const int nBBS1)
-{
-    // retrieve index pointers
-    // these can be 0 where af_array index is not used
-    const uint* ptr0 = p.ptr[0];
-    const uint* ptr1 = p.ptr[1];
-    const uint* ptr2 = p.ptr[2];
-    const uint* ptr3 = p.ptr[3];
-    // retrive booleans that tell us which index to use
-    const bool s0 = p.isSeq[0];
-    const bool s1 = p.isSeq[1];
-    const bool s2 = p.isSeq[2];
-    const bool s3 = p.isSeq[3];
+void assign(Param<T> out, CParam<T> in, const AssignKernelParam& p) {
+    constexpr int THREADS_X = 32;
+    constexpr int THREADS_Y = 8;
 
-    const int gz = blockIdx.x/nBBS0;
-    const int gw = blockIdx.y/nBBS1;
-    const int gx = blockDim.x * (blockIdx.x - gz*nBBS0) + threadIdx.x;
-    const int gy = blockDim.y * (blockIdx.y - gw*nBBS1) + threadIdx.y;
+    auto assignKer =
+        common::getKernel("arrayfire::cuda::assign", {{assign_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()));
 
-    if (gx<in.dims[0] && gy<in.dims[1] && gz<in.dims[2] && gw<in.dims[3]) {
-        // calculate pointer offsets for input
-        int i = p.strds[0] * trimIndex(s0 ? gx+p.offs[0] : ptr0[gx], out.dims[0]);
-        int j = p.strds[1] * trimIndex(s1 ? gy+p.offs[1] : ptr1[gy], out.dims[1]);
-        int k = p.strds[2] * trimIndex(s2 ? gz+p.offs[2] : ptr2[gz], out.dims[2]);
-        int l = p.strds[3] * trimIndex(s3 ? gw+p.offs[3] : ptr3[gw], out.dims[3]);
-        // offset input and output pointers
-        const T *src = (const T*)in.ptr + (gx*in.strides[0]+gy*in.strides[1]+ gz*in.strides[2]+gw*in.strides[3]);
-        T *dst = (T*)out.ptr +(i+j+k+l);
-        // set the output
-        dst[0] = src[0];
-    }
-}
-
-template<typename T>
-void assign(Param<T> out, CParam<T> in, const AssignKernelParam_t& p)
-{
     const dim3 threads(THREADS_X, THREADS_Y);
 
     int blks_x = divup(in.dims[0], threads.x);
     int blks_y = divup(in.dims[1], threads.y);
 
-    dim3 blocks(blks_x*in.dims[2], blks_y*in.dims[3]);
+    dim3 blocks(blks_x * in.dims[2], blks_y * in.dims[3]);
 
-    AssignKernel<T> <<<blocks, threads>>> (out, in, p, blks_x, blks_y);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-    POST_LAUNCH_CHECK();
-}
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-}
+    assignKer(qArgs, out, in, p, blks_x, blks_y);
 
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/atomics.hpp b/src/backend/cuda/kernel/atomics.hpp
new file mode 100644
index 0000000000..cea1678e59
--- /dev/null
+++ b/src/backend/cuda/kernel/atomics.hpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+template<typename T>
+__device__ T atomicAdd(T *ptr, T val) {
+    return ::atomicAdd(ptr, val);
+}
+
+#define SPECIALIZE(T, fn1, fn2)                                                \
+    template<>                                                                 \
+    __device__ T atomicAdd<T>(T * ptr, T val) {                                \
+        unsigned long long int *ptr_as_ull = (unsigned long long int *)ptr;    \
+        unsigned long long int old         = *ptr_as_ull, assumed;             \
+        do {                                                                   \
+            assumed = old;                                                     \
+            old     = atomicCAS(ptr_as_ull, assumed, fn2(val + fn1(assumed))); \
+        } while (assumed != old);                                              \
+        return fn1(old);                                                       \
+    }
+
+SPECIALIZE(double, __longlong_as_double, __double_as_longlong)
+SPECIALIZE(intl, intl, uintl)
+SPECIALIZE(uintl, uintl, uintl)
+
+template<>
+__device__ cfloat atomicAdd<cfloat>(cfloat *ptr, cfloat val) {
+    float *fptr = (float *)(ptr);
+    cfloat res;
+    res.x = ::atomicAdd(fptr + 0, val.x);
+    res.y = ::atomicAdd(fptr + 1, val.y);
+    return res;
+}
+
+template<>
+__device__ cdouble atomicAdd<cdouble>(cdouble *ptr, cdouble val) {
+    double *fptr = (double *)(ptr);
+    cdouble res;
+    res.x = atomicAdd(fptr + 0, val.x);
+    res.y = atomicAdd(fptr + 1, val.y);
+    return res;
+}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/bilateral.cuh b/src/backend/cuda/kernel/bilateral.cuh
new file mode 100644
index 0000000000..6fdfbd1a3d
--- /dev/null
+++ b/src/backend/cuda/kernel/bilateral.cuh
@@ -0,0 +1,112 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <shared.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+inline __device__ int lIdx(int x, int y, int stride1, int stride0) {
+    return (y * stride1 + x * stride0);
+}
+
+template<typename inType, typename outType>
+inline __device__ void load2ShrdMem(outType *shrd, const inType *const in,
+                                    int lx, int ly, int shrdStride, int dim0,
+                                    int dim1, int gx, int gy, int inStride1,
+                                    int inStride0) {
+    shrd[ly * shrdStride + lx] = in[lIdx(
+        clamp(gx, 0, dim0 - 1), clamp(gy, 0, dim1 - 1), inStride1, inStride0)];
+}
+
+template<typename inType, typename outType>
+__global__ void bilateral(Param<outType> out, CParam<inType> in,
+                          float sigma_space, float sigma_color, int gaussOff,
+                          int nBBS0, int nBBS1) {
+    SharedMemory<outType> shared;
+    outType *localMem = shared.getPointer();
+    outType *gauss2d  = localMem + gaussOff;
+
+    const int radius                    = max((int)(sigma_space * 1.5f), 1);
+    const int padding                   = 2 * radius;
+    const int window_size               = padding + 1;
+    const int shrdLen                   = THREADS_X + padding;
+    const float variance_range          = sigma_color * sigma_color;
+    const float variance_space          = sigma_space * sigma_space;
+    const float variance_space_neg2     = -2.0 * variance_space;
+    const float inv_variance_range_neg2 = -0.5 / variance_range;
+
+    // gfor batch offsets
+    unsigned b2 = blockIdx.x / nBBS0;
+    unsigned b3 = blockIdx.y / nBBS1;
+    const inType *iptr =
+        (const inType *)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
+    outType *optr =
+        (outType *)out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
+
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+
+    const int gx = THREADS_X * (blockIdx.x - b2 * nBBS0) + lx;
+    const int gy = THREADS_Y * (blockIdx.y - b3 * nBBS1) + ly;
+
+    // generate gauss2d spatial variance values for block
+    if (lx < window_size && ly < window_size) {
+        int x = lx - radius;
+        int y = ly - radius;
+        gauss2d[ly * window_size + lx] =
+            __expf(((x * x) + (y * y)) / variance_space_neg2);
+    }
+
+    // pull image to local memory
+    for (int b = ly, gy2 = gy; b < shrdLen;
+         b += blockDim.y, gy2 += blockDim.y) {
+        // move row_set get_local_size(1) along coloumns
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += blockDim.x, gx2 += blockDim.x) {
+            load2ShrdMem<inType, outType>(
+                localMem, iptr, a, b, shrdLen, in.dims[0], in.dims[1],
+                gx2 - radius, gy2 - radius, in.strides[1], in.strides[0]);
+        }
+    }
+
+    __syncthreads();
+
+    if (gx < in.dims[0] && gy < in.dims[1]) {
+        lx += radius;
+        ly += radius;
+        const outType center_color = localMem[ly * shrdLen + lx];
+        outType res                = 0;
+        outType norm               = 0;
+        int joff                   = (ly - radius) * shrdLen + (lx - radius);
+        int goff                   = 0;
+
+#pragma unroll
+        for (int wj = 0; wj < window_size; ++wj) {
+#pragma unroll
+            for (int wi = 0; wi < window_size; ++wi) {
+                const outType tmp_color = localMem[joff + wi];
+                const outType c         = center_color - tmp_color;
+                const outType gauss_range =
+                    __expf(c * c * inv_variance_range_neg2);
+                const outType weight = gauss2d[goff + wi] * gauss_range;
+                norm += weight;
+                res += tmp_color * weight;
+            }
+            joff += shrdLen;
+            goff += window_size;
+        }
+        optr[gy * out.strides[1] + gx] = res / norm;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/bilateral.hpp b/src/backend/cuda/kernel/bilateral.hpp
index 21740e04b1..c32d946792 100644
--- a/src/backend/cuda/kernel/bilateral.hpp
+++ b/src/backend/cuda/kernel/bilateral.hpp
@@ -7,137 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-#include "shared.hpp"
+#include <nvrtc_kernel_headers/bilateral_cuh.hpp>
 
-namespace cuda
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 static const int THREADS_X = 16;
 static const int THREADS_Y = 16;
 
-inline __device__
-int lIdx(int x, int y, int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
-
-inline __device__
-int clamp(int f, int a, int b)
-{
-    return max(a, min(f, b));
-}
-
-inline __device__
-float gaussian1d(float x, float variance)
-{
-    return exp((x * x) / (-2.f * variance));
-}
-
 template<typename inType, typename outType>
-inline __device__
-void load2ShrdMem(outType * shrd, const inType * const in,
-                  int lx, int ly, int shrdStride,
-                  int dim0, int dim1,
-                  int gx, int gy,
-                  int inStride1, int inStride0)
-{
-    shrd[ly*shrdStride+lx] = in[lIdx(clamp(gx, 0, dim0-1), clamp(gy, 0, dim1-1), inStride1, inStride0)];
-}
-
-template<typename inType, typename outType>
-static __global__
-void bilateralKernel(Param<outType> out, CParam<inType> in,
-                     float sigma_space, float sigma_color,
-                     int gaussOff, int nBBS0, int nBBS1)
-{
-    SharedMemory<outType> shared;
-    outType *localMem = shared.getPointer();
-    outType *gauss2d  = localMem + gaussOff;
-
-    const int radius      = max((int)(sigma_space * 1.5f), 1);
-    const int padding     = 2 * radius;
-    const int window_size = padding + 1;
-    const int shrdLen     = THREADS_X + padding;
-    const float variance_range = sigma_color * sigma_color;
-    const float variance_space = sigma_space * sigma_space;
-
-    // gfor batch offsets
-    unsigned b2 = blockIdx.x / nBBS0;
-    unsigned b3 = blockIdx.y / nBBS1;
-    const inType* iptr  = (const inType *) in.ptr + (b2 * in.strides[2]  + b3 * in.strides[3] );
-    outType*       optr = (outType *     )out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
-
-    int lx = threadIdx.x;
-    int ly = threadIdx.y;
-
-    const int gx = THREADS_X * (blockIdx.x-b2*nBBS0) + lx;
-    const int gy = THREADS_Y * (blockIdx.y-b3*nBBS1) + ly;
-
-    // generate gauss2d spatial variance values for block
-    if (lx<window_size && ly<window_size) {
-        int x = lx - radius;
-        int y = ly - radius;
-        gauss2d[ly*window_size+lx] = exp( ((x*x) + (y*y)) / (-2.f * variance_space));
-    }
-
-    // pull image to local memory
-    load2ShrdMem<inType, outType>(localMem, iptr, lx, ly, shrdLen,
-            in.dims[0], in.dims[1], gx-radius, gy-radius, in.strides[1], in.strides[0]);
-
-    int lx2 = lx + THREADS_X;
-    int ly2 = ly + THREADS_Y;
-    int gx2 = gx + THREADS_X;
-    int gy2 = gy + THREADS_Y;
-
-    if (lx<padding) {
-        load2ShrdMem<inType, outType>(localMem, iptr, lx2, ly, shrdLen,
-                in.dims[0], in.dims[1], gx2-radius, gy-radius, in.strides[1], in.strides[0]);
-    }
-    if (ly<padding) {
-        load2ShrdMem<inType, outType>(localMem, iptr, lx, ly2, shrdLen,
-                in.dims[0], in.dims[1], gx-radius, gy2-radius, in.strides[1], in.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2ShrdMem<inType, outType>(localMem, iptr, lx2, ly2, shrdLen,
-                in.dims[0], in.dims[1], gx2-radius, gy2-radius, in.strides[1], in.strides[0]);
-    }
-    __syncthreads();
+void bilateral(Param<outType> out, CParam<inType> in, float s_sigma,
+               float c_sigma) {
+    auto bilateral = common::getKernel(
+        "arrayfire::cuda::bilateral", {{bilateral_cuh_src}},
+        TemplateArgs(TemplateTypename<inType>(), TemplateTypename<outType>()),
+        {{DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
 
-    if (gx<in.dims[0] && gy<in.dims[1]) {
-        lx += radius;
-        ly += radius;
-        const outType center_color = localMem[ly*shrdLen+lx];
-        outType res  = 0;
-        outType norm = 0;
-#pragma unroll
-        for(int wj=0; wj<window_size; ++wj) {
-            int joff = (ly+wj-radius)*shrdLen + (lx-radius);
-            int goff = wj*window_size;
-#pragma unroll
-            for(int wi=0; wi<window_size; ++wi) {
-                const outType tmp_color   = localMem[joff+wi];
-                const outType gauss_range = gaussian1d(center_color - tmp_color, variance_range);
-                const outType weight      = gauss2d[goff+wi] * gauss_range;
-                norm += weight;
-                res  += tmp_color * weight;
-            }
-        }
-        optr[gy*out.strides[1]+gx] = res / norm;
-    }
-}
-
-template<typename inType, typename outType, bool isColor>
-void bilateral(Param<outType> out, CParam<inType> in, float s_sigma, float c_sigma)
-{
     dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
 
     int blk_x = divup(in.dims[0], THREADS_X);
@@ -146,18 +36,28 @@ void bilateral(Param<outType> out, CParam<inType> in, float s_sigma, float c_sig
     dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
 
     // calculate shared memory size
-    int radius = (int)std::max(s_sigma * 1.5f, 1.f);
-    int num_shrd_elems    = (THREADS_X + 2 * radius) * (THREADS_Y + 2 * radius);
-    int num_gauss_elems   = (2 * radius + 1)*(2 * radius + 1);
-    int total_shrd_size   = sizeof(outType) * (num_shrd_elems + num_gauss_elems);
+    int radius          = (int)std::max(s_sigma * 1.5f, 1.f);
+    int num_shrd_elems  = (THREADS_X + 2 * radius) * (THREADS_Y + 2 * radius);
+    int num_gauss_elems = (2 * radius + 1) * (2 * radius + 1);
+    size_t total_shrd_size =
+        sizeof(outType) * (num_shrd_elems + num_gauss_elems);
+
+    size_t MAX_SHRD_SIZE = getDeviceProp(getActiveDeviceId()).sharedMemPerBlock;
+    if (total_shrd_size > MAX_SHRD_SIZE) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nCUDA Bilateral filter doesn't support %f spatial sigma\n",
+                 s_sigma);
+        CUDA_NOT_SUPPORTED(errMessage);
+    }
 
-    bilateralKernel<inType, outType>
-        <<<blocks, threads, total_shrd_size>>>
-        (out, in, s_sigma, c_sigma, num_shrd_elems, blk_x, blk_y);
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(), total_shrd_size);
 
-    POST_LAUNCH_CHECK();
-}
+    bilateral(qArgs, out, in, s_sigma, c_sigma, num_shrd_elems, blk_x, blk_y);
 
+    POST_LAUNCH_CHECK();
 }
 
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/canny.cuh b/src/backend/cuda/kernel/canny.cuh
new file mode 100644
index 0000000000..bdd9ac2217
--- /dev/null
+++ b/src/backend/cuda/kernel/canny.cuh
@@ -0,0 +1,317 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+// hasChanged is a variable in kernel space
+// used to track the convergence of
+// the breath first search algorithm
+__device__ int hasChanged = 0;
+
+namespace arrayfire {
+namespace cuda {
+
+__forceinline__ __device__ int lIdx(int x, int y, int stride0, int stride1) {
+    return (x * stride0 + y * stride1);
+}
+
+template<typename T>
+__global__ void nonMaxSuppression(Param<float> output, CParam<T> in,
+                                  CParam<T> dx, CParam<T> dy, unsigned nBBS0,
+                                  unsigned nBBS1) {
+    const unsigned SHRD_MEM_WIDTH  = THREADS_X + 2;  // Coloumns
+    const unsigned SHRD_MEM_HEIGHT = THREADS_Y + 2;  // Rows
+
+    // Declared shared memory with 1 pixel border
+    __shared__ T shrdMem[SHRD_MEM_HEIGHT][SHRD_MEM_WIDTH];
+
+    // local thread indices
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = blockIdx.x / nBBS0;
+    const unsigned b3 = blockIdx.y / nBBS1;
+
+    // global indices
+    const int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + lx;
+    const int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + ly;
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    const T* mag = (const T*)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
+    const T* dX = (const T*)dx.ptr + (b2 * dx.strides[2] + b3 * dx.strides[3]) +
+                  dx.strides[1] + 1;
+    const T* dY = (const T*)dy.ptr + (b2 * dy.strides[2] + b3 * dy.strides[3]) +
+                  dy.strides[1] + 1;
+    T* out = (float*)output.ptr +
+             (b2 * output.strides[2] + b3 * output.strides[3]) +
+             output.strides[1] + 1;
+
+    // pull image to shared memory
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < SHRD_MEM_HEIGHT && gy2 < in.dims[1];
+         b += blockDim.y, gy2 += blockDim.y)
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < SHRD_MEM_WIDTH && gx2 < in.dims[0];
+             a += blockDim.x, gx2 += blockDim.x)
+            shrdMem[b][a] = mag[lIdx(gx2, gy2, in.strides[0], in.strides[1])];
+
+    int i = lx + 1;
+    int j = ly + 1;
+
+    __syncthreads();
+
+    if (gx < in.dims[0] - 2 && gy < in.dims[1] - 2) {
+        int idx = lIdx(gx, gy, in.strides[0], in.strides[1]);
+
+        const float cmag = shrdMem[j][i];
+
+        if (cmag == 0.0f)
+            out[idx] = (T)0;
+        else {
+            const float dx = dX[idx];
+            const float dy = dY[idx];
+            const float se = shrdMem[j + 1][i + 1];
+            const float nw = shrdMem[j - 1][i - 1];
+            const float ea = shrdMem[j][i + 1];
+            const float we = shrdMem[j][i - 1];
+            const float ne = shrdMem[j - 1][i + 1];
+            const float sw = shrdMem[j + 1][i - 1];
+            const float no = shrdMem[j - 1][i];
+            const float so = shrdMem[j + 1][i];
+
+            float a1, a2, b1, b2, alpha;
+
+            if (dx >= 0) {
+                if (dy >= 0) {
+                    const bool isDxMagGreater = (dx - dy) >= 0;
+
+                    a1    = isDxMagGreater ? ea : so;
+                    a2    = isDxMagGreater ? we : no;
+                    b1    = se;
+                    b2    = nw;
+                    alpha = isDxMagGreater ? dy / dx : dx / dy;
+                } else {
+                    const bool isDyMagGreater = (dx + dy) >= 0;
+
+                    a1    = isDyMagGreater ? ea : no;
+                    a2    = isDyMagGreater ? we : so;
+                    b1    = ne;
+                    b2    = sw;
+                    alpha = isDyMagGreater ? -dy / dx : dx / -dy;
+                }
+            } else {
+                if (dy >= 0) {
+                    const bool isDxMagGreater = (dx + dy) >= 0;
+
+                    a1    = isDxMagGreater ? so : we;
+                    a2    = isDxMagGreater ? no : ea;
+                    b1    = sw;
+                    b2    = ne;
+                    alpha = isDxMagGreater ? -dx / dy : dy / -dx;
+                } else {
+                    const bool isDyMagGreater = (-dx + dy) >= 0;
+
+                    a1    = isDyMagGreater ? we : no;
+                    a2    = isDyMagGreater ? ea : so;
+                    b1    = nw;
+                    b2    = se;
+                    alpha = isDyMagGreater ? dy / dx : dx / dy;
+                }
+            }
+
+            float mag1 = (1 - alpha) * a1 + alpha * b1;
+            float mag2 = (1 - alpha) * a2 + alpha * b2;
+
+            if (cmag > mag1 && cmag > mag2) {
+                out[idx] = cmag;
+            } else {
+                out[idx] = (T)0;
+            }
+        }
+    }
+}
+
+template<typename T>
+__global__ void initEdgeOut(Param<T> output, CParam<T> strong, CParam<T> weak,
+                            unsigned nBBS0, unsigned nBBS1) {
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = blockIdx.x / nBBS0;
+    const unsigned b3 = blockIdx.y / nBBS1;
+
+    // global indices
+    const int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + threadIdx.x;
+    const int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + threadIdx.y;
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    const T* wPtr = weak.ptr + (b2 * weak.strides[2] + b3 * weak.strides[3]) +
+                    weak.strides[1] + 1;
+    const T* sPtr = strong.ptr +
+                    (b2 * strong.strides[2] + b3 * strong.strides[3]) +
+                    strong.strides[1] + 1;
+    T* oPtr = output.ptr + (b2 * output.strides[2] + b3 * output.strides[3]) +
+              output.strides[1] + 1;
+
+    if (gx < (output.dims[0] - 2) && gy < (output.dims[1] - 2)) {
+        int idx   = lIdx(gx, gy, output.strides[0], output.strides[1]);
+        oPtr[idx] = (sPtr[idx] > 0 ? STRONG : (wPtr[idx] > 0 ? WEAK : NOEDGE));
+    }
+}
+
+#define VALID_BLOCK_IDX(j, i)                             \
+    ((j) > 0 && (j) < (SHRD_MEM_HEIGHT - 1) && (i) > 0 && \
+     (i) < (SHRD_MEM_WIDTH - 1))
+
+template<typename T>
+__global__ void edgeTrack(Param<T> output, unsigned nBBS0, unsigned nBBS1) {
+    const unsigned SHRD_MEM_WIDTH  = THREADS_X + 2;  // Cols
+    const unsigned SHRD_MEM_HEIGHT = THREADS_Y + 2;  // Rows
+
+    // shared memory with 1 pixel border
+    // strong and weak images are binary(char) images thus,
+    // occupying only (16+2)*(16+2) = 324 bytes per shared memory tile
+    __shared__ int outMem[SHRD_MEM_HEIGHT][SHRD_MEM_WIDTH];
+
+    // local thread indices
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = blockIdx.x / nBBS0;
+    const unsigned b3 = blockIdx.y / nBBS1;
+
+    // global indices
+    const int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + lx;
+    const int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + ly;
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    T* oPtr = output.ptr + (b2 * output.strides[2] + b3 * output.strides[3]);
+
+    // pull image to shared memory
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < SHRD_MEM_HEIGHT;
+         b += blockDim.y, gy2 += blockDim.y) {
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < SHRD_MEM_WIDTH;
+             a += blockDim.x, gx2 += blockDim.x) {
+            int x = gx2;
+            int y = gy2;
+            if (x >= 0 && x < output.dims[0] && y >= 0 && y < output.dims[1])
+                outMem[b][a] =
+                    oPtr[lIdx(x, y, output.strides[0], output.strides[1])];
+            else
+                outMem[b][a] = NOEDGE;
+        }
+    }
+
+    int i = lx + 1;
+    int j = ly + 1;
+
+    __syncthreads();
+
+    int continueIter = 1;
+
+    while (continueIter) {
+        int nw, no, ne, we, ea, sw, so, se;
+
+        if (outMem[j][i] == WEAK) {
+            nw = outMem[j - 1][i - 1];
+            no = outMem[j - 1][i];
+            ne = outMem[j - 1][i + 1];
+            we = outMem[j][i - 1];
+            ea = outMem[j][i + 1];
+            sw = outMem[j + 1][i - 1];
+            so = outMem[j + 1][i];
+            se = outMem[j + 1][i + 1];
+
+            bool hasStrongNeighbour =
+                nw == STRONG || no == STRONG || ne == STRONG || ea == STRONG ||
+                se == STRONG || so == STRONG || sw == STRONG || we == STRONG;
+
+            if (hasStrongNeighbour) outMem[j][i] = STRONG;
+        }
+
+        __syncthreads();
+
+        // Check if there are any STRONG pixels with weak neighbours.
+        // This search however ignores 1-pixel border encompassing the
+        // shared memory tile region.
+        bool hasWeakNeighbour = false;
+        if (outMem[j][i] == STRONG) {
+            nw = outMem[j - 1][i - 1] == WEAK && VALID_BLOCK_IDX(j - 1, i - 1);
+            no = outMem[j - 1][i] == WEAK && VALID_BLOCK_IDX(j - 1, i);
+            ne = outMem[j - 1][i + 1] == WEAK && VALID_BLOCK_IDX(j - 1, i + 1);
+            we = outMem[j][i - 1] == WEAK && VALID_BLOCK_IDX(j, i - 1);
+            ea = outMem[j][i + 1] == WEAK && VALID_BLOCK_IDX(j, i + 1);
+            sw = outMem[j + 1][i - 1] == WEAK && VALID_BLOCK_IDX(j + 1, i - 1);
+            so = outMem[j + 1][i] == WEAK && VALID_BLOCK_IDX(j + 1, i);
+            se = outMem[j + 1][i + 1] == WEAK && VALID_BLOCK_IDX(j + 1, i + 1);
+
+            hasWeakNeighbour = nw || no || ne || ea || se || so || sw || we;
+        }
+
+        continueIter = __syncthreads_or(hasWeakNeighbour);
+    };
+
+    // Check if any 1-pixel border ring
+    // has weak pixels with strong candidates
+    // within the main region, then increment hasChanged.
+    int cu = outMem[j][i];
+    int nw = outMem[j - 1][i - 1];
+    int no = outMem[j - 1][i];
+    int ne = outMem[j - 1][i + 1];
+    int ea = outMem[j][i + 1];
+    int se = outMem[j + 1][i + 1];
+    int so = outMem[j + 1][i];
+    int sw = outMem[j + 1][i - 1];
+    int we = outMem[j][i - 1];
+
+    bool hasWeakNeighbour = nw == WEAK || no == WEAK || ne == WEAK ||
+                            ea == WEAK || se == WEAK || so == WEAK ||
+                            sw == WEAK || we == WEAK;
+
+    if (__syncthreads_or(cu == STRONG && hasWeakNeighbour) && lx == 0 &&
+        ly == 0)
+        atomicAdd(&hasChanged, 1);
+
+    // Update output with shared memory result
+    if (gx < (output.dims[0] - 2) && gy < (output.dims[1] - 2))
+        oPtr[lIdx(gx, gy, output.strides[0], output.strides[1]) +
+             output.strides[1] + 1] = outMem[j][i];
+}
+
+template<typename T>
+__global__ void suppressLeftOver(Param<T> output, unsigned nBBS0,
+                                 unsigned nBBS1) {
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = blockIdx.x / nBBS0;
+    const unsigned b3 = blockIdx.y / nBBS1;
+
+    // global indices
+    const int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + threadIdx.x;
+    const int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + threadIdx.y;
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    T* oPtr = output.ptr + (b2 * output.strides[2] + b3 * output.strides[3]) +
+              output.strides[1] + 1;
+
+    if (gx < (output.dims[0] - 2) && gy < (output.dims[1] - 2)) {
+        int idx = lIdx(gx, gy, output.strides[0], output.strides[1]);
+        T val   = oPtr[idx];
+        if (val == WEAK) oPtr[idx] = NOEDGE;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/canny.hpp b/src/backend/cuda/kernel/canny.hpp
new file mode 100644
index 0000000000..ef3dc6c40c
--- /dev/null
+++ b/src/backend/cuda/kernel/canny.hpp
@@ -0,0 +1,96 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/canny_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+static const int STRONG = 1;
+static const int WEAK   = 2;
+static const int NOEDGE = 0;
+
+static const int THREADS_X = 16;
+static const int THREADS_Y = 16;
+
+template<typename T>
+void nonMaxSuppression(Param<T> output, CParam<T> magnitude, CParam<T> dx,
+                       CParam<T> dy) {
+    auto nonMaxSuppress = common::getKernel(
+        "arrayfire::cuda::nonMaxSuppression", {{canny_cuh_src}},
+        TemplateArgs(TemplateTypename<T>()),
+        {{DefineValue(STRONG), DefineValue(WEAK), DefineValue(NOEDGE),
+          DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
+
+    dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
+
+    // Launch only threads to process non-border pixels
+    int blk_x = divup(magnitude.dims[0] - 2, threads.x);
+    int blk_y = divup(magnitude.dims[1] - 2, threads.y);
+
+    // launch batch * blk_x blocks along x dimension
+    dim3 blocks(blk_x * magnitude.dims[2], blk_y * magnitude.dims[3]);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    nonMaxSuppress(qArgs, output, magnitude, dx, dy, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T>
+void edgeTrackingHysteresis(Param<T> output, CParam<T> strong, CParam<T> weak) {
+    auto initEdgeOut = common::getKernel(
+        "arrayfire::cuda::initEdgeOut", {{canny_cuh_src}},
+        TemplateArgs(TemplateTypename<T>()),
+        {{DefineValue(STRONG), DefineValue(WEAK), DefineValue(NOEDGE),
+          DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
+    auto edgeTrack = common::getKernel(
+        "arrayfire::cuda::edgeTrack", {{canny_cuh_src}},
+        TemplateArgs(TemplateTypename<T>()),
+        {{DefineValue(STRONG), DefineValue(WEAK), DefineValue(NOEDGE),
+          DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
+    auto suppressLeftOver = common::getKernel(
+        "arrayfire::cuda::suppressLeftOver", {{canny_cuh_src}},
+        TemplateArgs(TemplateTypename<T>()),
+        {{DefineValue(STRONG), DefineValue(WEAK), DefineValue(NOEDGE),
+          DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
+
+    dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
+
+    // Launch only threads to process non-border pixels
+    int blk_x = divup(weak.dims[0] - 2, threads.x);
+    int blk_y = divup(weak.dims[1] - 2, threads.y);
+
+    // launch batch * blk_x blocks along x dimension
+    dim3 blocks(blk_x * weak.dims[2], blk_y * weak.dims[3]);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    initEdgeOut(qArgs, output, strong, weak, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
+
+    auto flagPtr = edgeTrack.getDevPtr("hasChanged");
+
+    int notFinished = 1;
+    while (notFinished) {
+        notFinished = 0;
+        edgeTrack.setFlag(flagPtr, &notFinished);
+        edgeTrack(qArgs, output, blk_x, blk_y);
+        POST_LAUNCH_CHECK();
+        notFinished = edgeTrack.getFlag(flagPtr);
+    }
+    suppressLeftOver(qArgs, output, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
+}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/config.hpp b/src/backend/cuda/kernel/config.hpp
index c879e4fec6..9bef1d7784 100644
--- a/src/backend/cuda/kernel/config.hpp
+++ b/src/backend/cuda/kernel/config.hpp
@@ -9,14 +9,14 @@
 
 #pragma once
 
-namespace cuda
-{
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-    static const uint THREADS_PER_BLOCK = 256;
-    static const uint THREADS_X = 32;
-    static const uint THREADS_Y = THREADS_PER_BLOCK / THREADS_X;
-    static const uint REPEAT    = 32;
-}
-}
+static const uint THREADS_PER_BLOCK = 256;
+static const uint THREADS_X         = 32;
+static const uint THREADS_Y         = THREADS_PER_BLOCK / THREADS_X;
+static const uint REPEAT            = 32;
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/convolve.cu b/src/backend/cuda/kernel/convolve.cu
deleted file mode 100644
index d0ee8b5d86..0000000000
--- a/src/backend/cuda/kernel/convolve.cu
+++ /dev/null
@@ -1,506 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_cuda.hpp>
-#include <math.hpp>
-#include "shared.hpp"
-#include <convolve.hpp>
-
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const int THREADS   = 256;
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-static const int CUBE_X    =  8;
-static const int CUBE_Y    =  8;
-static const int CUBE_Z    =  4;
-
-// below shared MAX_*_LEN's are calculated based on
-// a maximum shared memory configuration of 48KB per block
-// considering complex types as well
-static const int MAX_CONV1_FILTER_LEN = 129;
-static const int MAX_CONV2_FILTER_LEN = 17;
-static const int MAX_CONV3_FILTER_LEN = 5;
-
-// we shall declare the maximum size required of above all three cases
-// and re-use the same constant memory locations for every case
-__constant__ char cFilter[2*(2*(MAX_CONV1_FILTER_LEN-1)+THREADS)*sizeof(double)];
-
-template<typename T, typename aT, bool expand>
-__global__
-void convolve1(Param<T> out, CParam<T> signal, int fLen,
-               int nBBS0, int nBBS1,
-               int o1, int o2, int o3,
-               int s1, int s2, int s3)
-{
-    SharedMemory<T> shared;
-    T * shrdMem = shared.getPointer();
-
-    const int padding = fLen-1;
-    const int shrdLen = blockDim.x + 2*padding;
-    const unsigned b1 = blockIdx.x/nBBS0;   /* [0 {1} 2 3] */
-    const unsigned b3 = blockIdx.y/nBBS1;   /* [0 1 2 {3}] */
-    const unsigned b2 = blockIdx.y-nBBS1*b3;/* [0 1 {2} 3] */
-
-    T *dst = (T *)out.ptr + (b1 * out.strides[1] +  /* activated with batched input signal */
-                             o1 * out.strides[1] +  /* activated with batched input filter */
-                             b2 * out.strides[2] +  /* activated with batched input signal */
-                             o2 * out.strides[2] +  /* activated with batched input filter */
-                             b3 * out.strides[3] +  /* activated with batched input signal */
-                             o3 * out.strides[3]);  /* activated with batched input filter */
-
-    const T *src = (const T *)signal.ptr + (b1 * signal.strides[1] + /* activated with batched input signal */
-                                            s1 * signal.strides[1] + /* activated with batched input filter */
-                                            b2 * signal.strides[2] + /* activated with batched input signal */
-                                            s2 * signal.strides[2] + /* activated with batched input filter */
-                                            b3 * signal.strides[3] + /* activated with batched input signal */
-                                            s3 * signal.strides[3]); /* activated with batched input filter */
-
-    const aT *impulse = (const aT *)cFilter;
-
-    int gx  = blockDim.x*(blockIdx.x-b1*nBBS0);
-
-    int s0 = signal.strides[0];
-    int d0 = signal.dims[0];
-    for (int i=threadIdx.x; i<shrdLen; i+=blockDim.x) {
-        int idx= gx-padding + i;
-        shrdMem[i]  = (idx>=0 && idx<d0) ? src[idx*s0] : scalar<T>(0);
-    }
-    __syncthreads();
-    gx += threadIdx.x;
-
-    if (gx<out.dims[0]) {
-        int lx   = threadIdx.x + padding + (expand ? 0 : fLen>>1);
-        aT accum = scalar<aT>(0);
-        for(int f=0; f<fLen; ++f) {
-            accum = accum + (shrdMem[lx-f]*impulse[f]);
-        }
-        dst[gx] = (T)accum;
-    }
-}
-
-template<typename T, typename aT, bool expand, int fLen0, int fLen1>
-__global__
-void convolve2(Param<T> out, CParam<T> signal, int nBBS0,
-               int nBBS1, int o2, int o3, int s2, int s3)
-{
-    const size_t C_SIZE  = (THREADS_X+2*(fLen0-1))* (THREADS_Y+2*(fLen1-1));
-    __shared__ T shrdMem[C_SIZE];
-
-    const int radius0  = fLen0-1;
-    const int radius1  = fLen1-1;
-    const int padding0 = 2*radius0;
-    const int padding1 = 2*radius1;
-    const int shrdLen0 = THREADS_X + padding0;
-    const int shrdLen1 = THREADS_Y + padding1;
-
-    unsigned b0  = blockIdx.x/nBBS0;
-    unsigned b1  = blockIdx.y/nBBS1;
-    T *dst = (T *)out.ptr + (b0 * out.strides[2] + /* activated with batched input signal */
-                             o2 * out.strides[2] + /* activated with batched input filter */
-                             b1 * out.strides[3] + /* activated with batched input signal */
-                             o3 * out.strides[3]); /* activated with batched input filter */
-
-    const T *src = (const T *)signal.ptr + (b0 * signal.strides[2] + /* activated with batched input signal */
-                                            s2 * signal.strides[2] + /* activated with batched input filter */
-                                            b1 * signal.strides[3] + /* activated with batched input signal */
-                                            s3 * signal.strides[3]); /* activated with batched input filter */
-
-    const aT *impulse  = (const aT *)cFilter;
-
-    int lx  = threadIdx.x;
-    int ly  = threadIdx.y;
-    int gx  = THREADS_X * (blockIdx.x-b0*nBBS0) + lx;
-    int gy  = THREADS_Y * (blockIdx.y-b1*nBBS1) + ly;
-
-    int s0 = signal.strides[0];
-    int s1 = signal.strides[1];
-    int d0 = signal.dims[0];
-    int d1 = signal.dims[1];
-    // below loops are traditional loops, they only run multiple
-    // times filter length is more than launch size
-#pragma unroll
-    for (int b=ly, gy2=gy; b<shrdLen1; b+=THREADS_Y, gy2+=THREADS_Y) {
-        int j = gy2-radius1;
-        bool is_j  = j>=0 && j<d1;
-        // move row_set THREADS_Y along coloumns
-#pragma unroll
-        for (int a=lx, gx2=gx; a<shrdLen0; a+=THREADS_X, gx2+=THREADS_X) {
-            int i = gx2-radius0;
-            bool is_i  = i>=0 && i<d0;
-            shrdMem[b*shrdLen0+a] = (is_i && is_j ? src[i*s0+j*s1] : scalar<T>(0));
-        }
-    }
-    __syncthreads();
-
-    if (gx<out.dims[0] && gy<out.dims[1]) {
-        int ci = lx + radius0 + (expand ? 0 : fLen0>>1);
-        int cj = ly + radius1 + (expand ? 0 : fLen1>>1);
-
-        aT accum = scalar<aT>(0);
-#pragma unroll
-        for(int fj=0; fj<fLen1; ++fj) {
-#pragma unroll
-            for(int fi=0; fi<fLen0; ++fi) {
-                aT f_val = impulse[fj*fLen0+fi];
-                T s_val = shrdMem[(cj-fj)*shrdLen0 + (ci-fi)];
-                accum   = accum + s_val*f_val;
-            }
-        }
-        dst[gy*out.strides[1]+gx] = (T)accum;
-    }
-}
-
-__inline__ __device__
-int index(int i, int j, int k, int jstride, int kstride)
-{
-    return i+j*jstride+k*kstride;
-}
-
-template<typename T, typename aT, bool expand>
-__global__
-void convolve3(Param<T> out, CParam<T> signal, int fLen0, int fLen1,
-               int fLen2, int nBBS, int o3, int s3)
-{
-    SharedMemory<T> shared;
-
-    T * shrdMem       = shared.getPointer();
-    int radius0  = fLen0-1;
-    int radius1  = fLen1-1;
-    int radius2  = fLen2-1;
-    int shrdLen0 = blockDim.x + 2*radius0;
-    int shrdLen1 = blockDim.y + 2*radius1;
-    int shrdLen2 = blockDim.z + 2*radius2;
-    int skStride = shrdLen0 * shrdLen1;
-    int fStride  = fLen0 * fLen1;
-    unsigned b2  = blockIdx.x/nBBS;
-
-    T *dst = (T *)out.ptr + (b2 * out.strides[3] + /* activated with batched input signal */
-                             o3 * out.strides[3]); /* activated with batched input filter */
-
-    const T *src = (const T *)signal.ptr + (b2 * signal.strides[3] + /* activated with batched input signal */
-                                            s3 * signal.strides[3]); /* activated with batched input filter */
-
-    const aT *impulse  = (const aT *)cFilter;
-
-    int lx  = threadIdx.x;
-    int ly  = threadIdx.y;
-    int lz  = threadIdx.z;
-    int gx  = blockDim.x * (blockIdx.x-b2*nBBS) + lx;
-    int gy  = blockDim.y * blockIdx.y + ly;
-    int gz  = blockDim.z * blockIdx.z + lz;
-
-    int s0 = signal.strides[0];
-    int s1 = signal.strides[1];
-    int s2 = signal.strides[2];
-    int d0 = signal.dims[0];
-    int d1 = signal.dims[1];
-    int d2 = signal.dims[2];
-#pragma unroll
-    for (int c=lz, gz2=gz; c<shrdLen2; c+=CUBE_Z, gz2+=CUBE_Z) {
-        int k = gz2-radius2;
-        bool is_k  = k>=0 && k<d2;
-#pragma unroll
-        for (int b=ly, gy2=gy; b<shrdLen1; b+=CUBE_Y, gy2+=CUBE_Y) {
-            int j = gy2-radius1;
-            bool is_j  = j>=0 && j<d1;
-#pragma unroll
-            for (int a=lx, gx2=gx; a<shrdLen0; a+=CUBE_X, gx2+=CUBE_X) {
-                int i = gx2-radius0;
-                bool is_i  = i>=0 && i<d0;
-                shrdMem[c*skStride+b*shrdLen0+a] =
-                    (is_i && is_j && is_k ? src[i*s0+j*s1+k*s2] : scalar<T>(0));
-            }
-        }
-    }
-    __syncthreads();
-
-    if (gx<out.dims[0] && gy<out.dims[1] && gz<out.dims[2]) {
-        int ci = lx + radius0 + (expand ? 0 : fLen0>>1);
-        int cj = ly + radius1 + (expand ? 0 : fLen1>>1);
-        int ck = lz + radius2 + (expand ? 0 : fLen2>>1);
-
-        aT accum = scalar<aT>(0);
-#pragma unroll
-        for(int fk=0; fk<fLen2; ++fk) {
-#pragma unroll
-            for(int fj=0; fj<fLen1; ++fj) {
-#pragma unroll
-                for(int fi=0; fi<fLen0; ++fi) {
-                    aT f_val = impulse[index(fi, fj, fk, fLen0, fStride)];
-                    T s_val = shrdMem[index(ci-fi, cj-fj, ck-fk, shrdLen0, skStride)];
-                    accum   = accum + s_val*f_val;
-                }
-            }
-        }
-        dst[index(gx, gy, gz, out.strides[1], out.strides[2])] = (T)accum;
-    }
-}
-
-struct conv_kparam_t {
-    dim3              mBlocks;
-    dim3             mThreads;
-    size_t        mSharedSize;
-    int           mBlk_x;
-    int           mBlk_y;
-    bool       outHasNoOffset;
-    bool        inHasNoOffset;
-    bool     launchMoreBlocks;
-    int             o[3];
-    int             s[3];
-};
-
-template<typename T>
-void prepareKernelArgs(conv_kparam_t &params, dim_t oDims[], dim_t fDims[], int baseDim)
-{
-    int batchDims[4] = {1, 1, 1, 1};
-    for(int i=baseDim; i<4; ++i) {
-        batchDims[i] = (params.launchMoreBlocks ? 1 : oDims[i]);
-    }
-
-    if (baseDim==1) {
-        params.mThreads    = dim3(THREADS, 1);
-        params.mBlk_x      = divup(oDims[0], params.mThreads.x);
-        params.mBlk_y      = batchDims[2];
-        params.mBlocks     = dim3(params.mBlk_x * batchDims[1], params.mBlk_y * batchDims[3]);
-        params.mSharedSize = (params.mThreads.x+2*(fDims[0]-1)) * sizeof(T);
-    } else if (baseDim==2) {
-        params.mThreads    = dim3(THREADS_X, THREADS_Y);
-        params.mBlk_x      = divup(oDims[0], params.mThreads.x);
-        params.mBlk_y      = divup(oDims[1], params.mThreads.y);
-        params.mBlocks     = dim3(params.mBlk_x * batchDims[2], params.mBlk_y * batchDims[3]);
-    } else if (baseDim==3) {
-        params.mThreads    = dim3(CUBE_X, CUBE_Y, CUBE_Z);
-        params.mBlk_x      = divup(oDims[0], params.mThreads.x);
-        params.mBlk_y      = divup(oDims[1], params.mThreads.y);
-        int blk_z     = divup(oDims[2], params.mThreads.z);
-        params.mBlocks     = dim3(params.mBlk_x * batchDims[3], params.mBlk_y, blk_z);
-        params.mSharedSize = (params.mThreads.x+2*(fDims[0]-1)) *
-                             (params.mThreads.y+2*(fDims[1]-1)) *
-                             (params.mThreads.z+2*(fDims[2]-1)) * sizeof(T);
-    }
-}
-
-template<typename T, typename aT, bool expand, int f0, int f1>
-void conv2Helper(const conv_kparam_t &p, Param<T> out, CParam<T> sig)
-{
-    (convolve2<T, aT, expand, f0, f1>)
-        <<<p.mBlocks, p.mThreads>>>(out, sig, p.mBlk_x, p.mBlk_y, p.o[1], p.o[2], p.s[1], p.s[2]);
-
-    POST_LAUNCH_CHECK();
-}
-
-template<typename T, typename aT, bool expand, int f0>
-void conv2Helper(const conv_kparam_t &p, Param<T> out, CParam<T> sig, int f1)
-{
-    switch(f1) {
-        case  1: conv2Helper<T, aT, expand, f0,  1>(p, out, sig); break;
-        case  2: conv2Helper<T, aT, expand, f0,  2>(p, out, sig); break;
-        case  3: conv2Helper<T, aT, expand, f0,  3>(p, out, sig); break;
-        case  4: conv2Helper<T, aT, expand, f0,  4>(p, out, sig); break;
-        case  5: conv2Helper<T, aT, expand, f0,  5>(p, out, sig); break;
-        default: CUDA_NOT_SUPPORTED();
-    }
-}
-
-template<typename T, typename aT, bool expand>
-void conv2Helper(const conv_kparam_t &p, Param<T> out, CParam<T> sig, int f0, int f1)
-{
-    switch(f0) {
-        case  1: conv2Helper<T, aT, expand,  1>(p, out, sig, f1); break;
-        case  2: conv2Helper<T, aT, expand,  2>(p, out, sig, f1); break;
-        case  3: conv2Helper<T, aT, expand,  3>(p, out, sig, f1); break;
-        case  4: conv2Helper<T, aT, expand,  4>(p, out, sig, f1); break;
-        case  5: conv2Helper<T, aT, expand,  5>(p, out, sig, f1); break;
-        default: {
-                     if (f0==f1) {
-                         switch(f1) {
-                             case  6: conv2Helper<T, aT, expand,  6,  6>(p, out, sig); break;
-                             case  7: conv2Helper<T, aT, expand,  7,  7>(p, out, sig); break;
-                             case  8: conv2Helper<T, aT, expand,  8,  8>(p, out, sig); break;
-                             case  9: conv2Helper<T, aT, expand,  9,  9>(p, out, sig); break;
-                             case 10: conv2Helper<T, aT, expand, 10, 10>(p, out, sig); break;
-                             case 11: conv2Helper<T, aT, expand, 11, 11>(p, out, sig); break;
-                             case 12: conv2Helper<T, aT, expand, 12, 12>(p, out, sig); break;
-                             case 13: conv2Helper<T, aT, expand, 13, 13>(p, out, sig); break;
-                             case 14: conv2Helper<T, aT, expand, 14, 14>(p, out, sig); break;
-                             case 15: conv2Helper<T, aT, expand, 15, 15>(p, out, sig); break;
-                             case 16: conv2Helper<T, aT, expand, 16, 16>(p, out, sig); break;
-                             case 17: conv2Helper<T, aT, expand, 17, 17>(p, out, sig); break;
-                             default: CUDA_NOT_SUPPORTED();
-                         }
-                     } else
-                         CUDA_NOT_SUPPORTED();
-                 } break;
-    }
-}
-
-template<typename T, typename aT, bool expand>
-void convolve_1d(conv_kparam_t &p, Param<T> out, CParam<T> sig, CParam<aT> filt)
-{
-    prepareKernelArgs<T>(p, out.dims, filt.dims, 1);
-
-    int filterLen = filt.dims[0];
-
-    for (int b3=0; b3<filt.dims[3]; ++b3) {
-        int f3Off = b3 * filt.strides[3];
-
-        for (int b2=0; b2<filt.dims[2]; ++b2) {
-            int f2Off = b2 * filt.strides[2];
-
-            for (int b1=0; b1<filt.dims[1]; ++b1) {
-                int f1Off = b1 * filt.strides[1];
-
-                // FIXME: if the filter array is strided, direct copy of symbols
-                // might cause issues
-                CUDA_CHECK(cudaMemcpyToSymbol(kernel::cFilter,
-                                              filt.ptr+(f1Off+f2Off+f3Off),
-                                              filterLen*sizeof(aT),
-                                              0, cudaMemcpyDeviceToDevice));
-
-                p.o[0] = (p.outHasNoOffset ? 0 : b1);
-                p.o[1] = (p.outHasNoOffset ? 0 : b2);
-                p.o[2] = (p.outHasNoOffset ? 0 : b3);
-                p.s[0] = (p.inHasNoOffset ? 0 : b1);
-                p.s[1] = (p.inHasNoOffset ? 0 : b2);
-                p.s[2] = (p.inHasNoOffset ? 0 : b3);
-
-                (convolve1<T, aT, expand>)
-                    <<<p.mBlocks, p.mThreads, p.mSharedSize>>>
-                    (out, sig, filt.dims[0], p.mBlk_x, p.mBlk_y,
-                     p.o[0], p.o[1], p.o[2], p.s[0], p.s[1], p.s[2]);
-
-                POST_LAUNCH_CHECK();
-            }
-        }
-    }
-}
-
-template<typename T, typename aT, bool expand>
-void convolve_2d(conv_kparam_t &p, Param<T> out, CParam<T> sig, CParam<aT> filt)
-{
-    prepareKernelArgs<T>(p, out.dims, filt.dims, 2);
-
-    int filterLen = filt.dims[0] * filt.dims[1];
-
-    for (int b3=0; b3<filt.dims[3]; ++b3) {
-        int f3Off = b3 * filt.strides[3];
-
-        for (int b2=0; b2<filt.dims[2]; ++b2) {
-            int f2Off = b2 * filt.strides[2];
-
-            // FIXME: if the filter array is strided, direct copy of symbols
-            // might cause issues
-            CUDA_CHECK(cudaMemcpyToSymbol(kernel::cFilter,
-                                          filt.ptr+(f2Off+f3Off),
-                                          filterLen*sizeof(aT),
-                                          0, cudaMemcpyDeviceToDevice));
-
-            p.o[1] = (p.outHasNoOffset ? 0 : b2);
-            p.o[2] = (p.outHasNoOffset ? 0 : b3);
-            p.s[1] = (p.inHasNoOffset ? 0 : b2);
-            p.s[2] = (p.inHasNoOffset ? 0 : b3);
-
-            conv2Helper<T, aT, expand>(p, out, sig, filt.dims[0], filt.dims[1]);
-        }
-    }
-}
-
-template<typename T, typename aT, bool expand>
-void convolve_3d(conv_kparam_t &p, Param<T> out, CParam<T> sig, CParam<aT> filt)
-{
-    prepareKernelArgs<T>(p, out.dims, filt.dims, 3);
-
-    int filterLen = filt.dims[0] * filt.dims[1] * filt.dims[2];
-
-    for (int b3=0; b3<filt.dims[3]; ++b3) {
-        int f3Off = b3 * filt.strides[3];
-
-        // FIXME: if the filter array is strided, direct copy of symbols
-        // might cause issues
-        CUDA_CHECK(cudaMemcpyToSymbol(kernel::cFilter,
-                    filt.ptr+f3Off,
-                    filterLen*sizeof(aT),
-                    0, cudaMemcpyDeviceToDevice));
-
-        p.o[2] = (p.outHasNoOffset ? 0 : b3);
-        p.s[2] = (p.inHasNoOffset ? 0 : b3);
-
-        (convolve3<T, aT, expand>)
-            <<<p.mBlocks, p.mThreads, p.mSharedSize>>>
-            (out, sig, filt.dims[0], filt.dims[1], filt.dims[2], p.mBlk_x, p.o[2], p.s[2]);
-
-        POST_LAUNCH_CHECK();
-    }
-}
-
-template<typename T, typename aT, int baseDim, bool expand>
-void convolve_nd(Param<T> out, CParam<T> signal, CParam<aT> filt, ConvolveBatchKind kind)
-{
-    bool callKernel = true;
-
-    int MCFL2 = kernel::MAX_CONV2_FILTER_LEN;
-    int MCFL3 = kernel::MAX_CONV3_FILTER_LEN;
-    switch(baseDim) {
-        case 1: if (filt.dims[0]>kernel::MAX_CONV1_FILTER_LEN) callKernel = false; break;
-        case 2: if ((filt.dims[0]*filt.dims[1]) > (MCFL2 * MCFL2)) callKernel = false; break;
-        case 3: if ((filt.dims[0]*filt.dims[1]*filt.dims[2]) > (MCFL3 * MCFL3 * MCFL3)) callKernel = false; break;
-    }
-
-    if (!callKernel) { CUDA_NOT_SUPPORTED(); }
-
-    conv_kparam_t param;
-    for (int i=0; i<3; ++i) {
-        param.o[i] = 0;
-        param.s[i] = 0;
-    }
-    param.launchMoreBlocks = kind==MANY2MANY || kind==ONE2MANY;
-    param.outHasNoOffset = kind==MANY2ONE || kind==ONE2ONE;
-    param.inHasNoOffset  = kind!=MANY2MANY;
-
-    switch(baseDim) {
-        case 1: convolve_1d<T, aT, expand>(param, out, signal, filt); break;
-        case 2: convolve_2d<T, aT, expand>(param, out, signal, filt); break;
-        case 3: convolve_3d<T, aT, expand>(param, out, signal, filt); break;
-    }
-
-    POST_LAUNCH_CHECK();
-}
-
-#define INSTANTIATE(T, aT)  \
-	template void convolve_nd<T, aT, 1, true >(Param<T> out, CParam<T> signal, CParam<aT> filter, ConvolveBatchKind kind);\
-	template void convolve_nd<T, aT, 1, false>(Param<T> out, CParam<T> signal, CParam<aT> filter, ConvolveBatchKind kind);\
-	template void convolve_nd<T, aT, 2, true >(Param<T> out, CParam<T> signal, CParam<aT> filter, ConvolveBatchKind kind);\
-	template void convolve_nd<T, aT, 2, false>(Param<T> out, CParam<T> signal, CParam<aT> filter, ConvolveBatchKind kind);\
-	template void convolve_nd<T, aT, 3, true >(Param<T> out, CParam<T> signal, CParam<aT> filter, ConvolveBatchKind kind);\
-	template void convolve_nd<T, aT, 3, false>(Param<T> out, CParam<T> signal, CParam<aT> filter, ConvolveBatchKind kind);\
-
-
-INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
-
-}
-
-}
diff --git a/src/backend/cuda/kernel/convolve.hpp b/src/backend/cuda/kernel/convolve.hpp
index 47e0267d62..38339f2de2 100644
--- a/src/backend/cuda/kernel/convolve.hpp
+++ b/src/backend/cuda/kernel/convolve.hpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2019, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -7,26 +7,331 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/defines.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-#include "shared.hpp"
+#include <nvrtc_kernel_headers/convolve1_cuh.hpp>
+#include <nvrtc_kernel_headers/convolve2_cuh.hpp>
+#include <nvrtc_kernel_headers/convolve3_cuh.hpp>
+#include <nvrtc_kernel_headers/convolve_separable_cuh.hpp>
+#include <traits.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+static const int CONV_THREADS = 256;
+
+static const int CONV2_THREADS_X = 16;
+static const int CONV2_THREADS_Y = 16;
+
+static const int CONV3_CUBE_X = 8;
+static const int CONV3_CUBE_Y = 8;
+static const int CONV3_CUBE_Z = 4;
+
+// below shared MAX_*_LEN's are calculated based on
+// a maximum shared memory configuration of 48KB per block
+// considering complex types as well
+static const int MAX_CONV1_FILTER_LEN = 129;
+static const int MAX_CONV2_FILTER_LEN = 17;
+static const int MAX_CONV3_FILTER_LEN = 5;
+
+constexpr static const char* conv_c_name  = "cFilter";
+constexpr static const char* sconv_c_name = "sFilter";
+
+struct conv_kparam_t {
+    dim3 mBlocks;
+    dim3 mThreads;
+    size_t mSharedSize;
+    int mBlk_x;
+    int mBlk_y;
+    bool outHasNoOffset;
+    bool inHasNoOffset;
+    bool launchMoreBlocks;
+    int o[3];
+    int s[3];
+};
+
+template<typename T>
+void prepareKernelArgs(conv_kparam_t& params, dim_t oDims[], dim_t fDims[],
+                       int baseDim) {
+    int batchDims[4] = {1, 1, 1, 1};
+    for (int i = baseDim; i < 4; ++i) {
+        batchDims[i] = (params.launchMoreBlocks ? 1 : oDims[i]);
+    }
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    if (baseDim == 1) {
+        params.mThreads = dim3(CONV_THREADS, 1);
+        params.mBlk_x   = divup(oDims[0], params.mThreads.x);
+        params.mBlk_y   = batchDims[2];
+        params.mBlocks =
+            dim3(params.mBlk_x * batchDims[1], params.mBlk_y * batchDims[3]);
+        params.mSharedSize =
+            (params.mThreads.x + 2 * (fDims[0] - 1)) * sizeof(T);
+        params.mBlocks.z = divup(params.mBlocks.y, maxBlocksY);
+        params.mBlocks.y = divup(params.mBlocks.y, params.mBlocks.z);
+    } else if (baseDim == 2) {
+        params.mThreads = dim3(CONV2_THREADS_X, CONV2_THREADS_Y);
+        params.mBlk_x   = divup(oDims[0], params.mThreads.x);
+        params.mBlk_y   = divup(oDims[1], params.mThreads.y);
+        params.mBlocks =
+            dim3(params.mBlk_x * batchDims[2], params.mBlk_y * batchDims[3]);
+        params.mBlocks.z = divup(params.mBlocks.y, maxBlocksY);
+        params.mBlocks.y = divup(params.mBlocks.y, params.mBlocks.z);
+    } else if (baseDim == 3) {
+        params.mThreads = dim3(CONV3_CUBE_X, CONV3_CUBE_Y, CONV3_CUBE_Z);
+        params.mBlk_x   = divup(oDims[0], params.mThreads.x);
+        params.mBlk_y   = divup(oDims[1], params.mThreads.y);
+        int blk_z       = divup(oDims[2], params.mThreads.z);
+        params.mBlocks =
+            dim3(params.mBlk_x * batchDims[3], params.mBlk_y, blk_z);
+        params.mSharedSize = (params.mThreads.x + 2 * (fDims[0] - 1)) *
+                             (params.mThreads.y + 2 * (fDims[1] - 1)) *
+                             (params.mThreads.z + 2 * (fDims[2] - 1)) *
+                             sizeof(T);
+    }
+}
+
+template<typename T, typename aT>
+void convolve_1d(conv_kparam_t& p, Param<T> out, CParam<T> sig, CParam<aT> filt,
+                 const bool expand) {
+    auto convolve1 = common::getKernel(
+        "arrayfire::cuda::convolve1", {{convolve1_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateTypename<aT>(),
+                     TemplateArg(expand)),
+        {{DefineValue(MAX_CONV1_FILTER_LEN), DefineValue(CONV_THREADS)}});
+
+    prepareKernelArgs<T>(p, out.dims, filt.dims, 1);
+
+    size_t filterSize = filt.dims[0] * sizeof(aT);
+
+    for (int b3 = 0; b3 < filt.dims[3]; ++b3) {
+        int f3Off = b3 * filt.strides[3];
+
+        for (int b2 = 0; b2 < filt.dims[2]; ++b2) {
+            int f2Off = b2 * filt.strides[2];
+
+            for (int b1 = 0; b1 < filt.dims[1]; ++b1) {
+                int f1Off      = b1 * filt.strides[1];
+                const aT* fptr = filt.ptr + (f1Off + f2Off + f3Off);
+
+                // FIXME: case where filter array is strided
+                auto constMemPtr = convolve1.getDevPtr(conv_c_name);
+                convolve1.copyToReadOnly(constMemPtr,
+                                         reinterpret_cast<CUdeviceptr>(fptr),
+                                         filterSize);
+
+                p.o[0] = (p.outHasNoOffset ? 0 : b1);
+                p.o[1] = (p.outHasNoOffset ? 0 : b2);
+                p.o[2] = (p.outHasNoOffset ? 0 : b3);
+                p.s[0] = (p.inHasNoOffset ? 0 : b1);
+                p.s[1] = (p.inHasNoOffset ? 0 : b2);
+                p.s[2] = (p.inHasNoOffset ? 0 : b3);
+
+                EnqueueArgs qArgs(p.mBlocks, p.mThreads, getActiveStream(),
+                                  p.mSharedSize);
+                convolve1(qArgs, out, sig, filt.dims[0], p.mBlk_x, p.mBlk_y,
+                          p.o[0], p.o[1], p.o[2], p.s[0], p.s[1], p.s[2]);
+                POST_LAUNCH_CHECK();
+            }
+        }
+    }
+}
+
+template<typename T, typename aT>
+void conv2Helper(const conv_kparam_t& p, Param<T> out, CParam<T> sig,
+                 const aT* fptr, int f0, int f1, const bool expand) {
+    const bool isFilterSizeLt5  = (f0 <= 5 && f1 <= 5);
+    const bool isFilterGt5AndSq = (f0 == f1 && f0 > 5 && f0 < 18);
+
+    if (!(isFilterSizeLt5 || isFilterGt5AndSq)) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nCUDA Convolution doesn't support %dx%d kernel\n", f0, f1);
+        CUDA_NOT_SUPPORTED(errMessage);
+    }
+
+    auto convolve2 = common::getKernel(
+        "arrayfire::cuda::convolve2", {{convolve2_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateTypename<aT>(),
+                     TemplateArg(expand), TemplateArg(f0), TemplateArg(f1)),
+        {{DefineValue(MAX_CONV1_FILTER_LEN), DefineValue(CONV_THREADS),
+          DefineValue(CONV2_THREADS_X), DefineValue(CONV2_THREADS_Y)}});
+
+    // FIXME: case where filter array is strided
+    auto constMemPtr = convolve2.getDevPtr(conv_c_name);
+    convolve2.copyToReadOnly(constMemPtr, reinterpret_cast<CUdeviceptr>(fptr),
+                             f0 * f1 * sizeof(aT));
+
+    EnqueueArgs qArgs(p.mBlocks, p.mThreads, getActiveStream());
+    convolve2(qArgs, out, sig, p.mBlk_x, p.mBlk_y, p.o[1], p.o[2], p.s[1],
+              p.s[2]);
+    POST_LAUNCH_CHECK();
+}
 
-namespace cuda
-{
+template<typename T, typename aT>
+void convolve_2d(conv_kparam_t& p, Param<T> out, CParam<T> sig, CParam<aT> filt,
+                 const bool expand) {
+    prepareKernelArgs<T>(p, out.dims, filt.dims, 2);
 
-namespace kernel
-{
+    for (int b3 = 0; b3 < filt.dims[3]; ++b3) {
+        int f3Off = b3 * filt.strides[3];
 
-template<typename T, typename accType, int baseDim, bool expand>
-void convolve_nd(Param<T> out, CParam<T> signal, CParam<accType> filter, ConvolveBatchKind kind);
+        for (int b2 = 0; b2 < filt.dims[2]; ++b2) {
+            int f2Off = b2 * filt.strides[2];
 
-template<typename T, typename accType, int conv_dim, bool expand>
-void convolve2(Param<T> out, CParam<T> signal, CParam<accType> filter);
+            const aT* fptr = filt.ptr + (f2Off + f3Off);
 
+            p.o[1] = (p.outHasNoOffset ? 0 : b2);
+            p.o[2] = (p.outHasNoOffset ? 0 : b3);
+            p.s[1] = (p.inHasNoOffset ? 0 : b2);
+            p.s[2] = (p.inHasNoOffset ? 0 : b3);
+
+            conv2Helper<T, aT>(p, out, sig, fptr, filt.dims[0], filt.dims[1],
+                               expand);
+        }
+    }
 }
 
+template<typename T, typename aT>
+void convolve_3d(conv_kparam_t& p, Param<T> out, CParam<T> sig, CParam<aT> filt,
+                 const bool expand) {
+    auto convolve3 = common::getKernel(
+        "arrayfire::cuda::convolve3", {{convolve3_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateTypename<aT>(),
+                     TemplateArg(expand)),
+        {{DefineValue(MAX_CONV1_FILTER_LEN), DefineValue(CONV_THREADS),
+          DefineValue(CONV3_CUBE_X), DefineValue(CONV3_CUBE_Y),
+          DefineValue(CONV3_CUBE_Z)}});
+
+    prepareKernelArgs<T>(p, out.dims, filt.dims, 3);
+
+    size_t filterSize = filt.dims[0] * filt.dims[1] * filt.dims[2] * sizeof(aT);
+
+    for (int b3 = 0; b3 < filt.dims[3]; ++b3) {
+        int f3Off = b3 * filt.strides[3];
+
+        const aT* fptr = filt.ptr + f3Off;
+
+        // FIXME: case where filter array is strided
+        auto constMemPtr = convolve3.getDevPtr(conv_c_name);
+        convolve3.copyToReadOnly(
+            constMemPtr, reinterpret_cast<CUdeviceptr>(fptr), filterSize);
+
+        p.o[2] = (p.outHasNoOffset ? 0 : b3);
+        p.s[2] = (p.inHasNoOffset ? 0 : b3);
+
+        EnqueueArgs qArgs(p.mBlocks, p.mThreads, getActiveStream(),
+                          p.mSharedSize);
+        convolve3(qArgs, out, sig, filt.dims[0], filt.dims[1], filt.dims[2],
+                  p.mBlk_x, p.o[2], p.s[2]);
+        POST_LAUNCH_CHECK();
+    }
 }
+
+template<typename T, typename aT>
+void convolve_nd(Param<T> out, CParam<T> signal, CParam<aT> filt,
+                 AF_BATCH_KIND kind, int baseDim, bool expand) {
+    bool callKernel = true;
+
+    int MCFL2 = kernel::MAX_CONV2_FILTER_LEN;
+    int MCFL3 = kernel::MAX_CONV3_FILTER_LEN;
+    switch (baseDim) {
+        case 1:
+            if (filt.dims[0] > kernel::MAX_CONV1_FILTER_LEN) callKernel = false;
+            break;
+        case 2:
+            if ((filt.dims[0] * filt.dims[1]) > (MCFL2 * MCFL2))
+                callKernel = false;
+            break;
+        case 3:
+            if ((filt.dims[0] * filt.dims[1] * filt.dims[2]) >
+                (MCFL3 * MCFL3 * MCFL3))
+                callKernel = false;
+            break;
+    }
+
+    if (!callKernel) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nCUDA N Dimensional Convolution doesn't support "
+                 "%lldx%lldx%lld kernel\n",
+                 filt.dims[0], filt.dims[1], filt.dims[2]);
+        CUDA_NOT_SUPPORTED(errMessage);
+    }
+
+    conv_kparam_t param;
+    for (int i = 0; i < 3; ++i) {
+        param.o[i] = 0;
+        param.s[i] = 0;
+    }
+    param.launchMoreBlocks = kind == AF_BATCH_SAME || kind == AF_BATCH_RHS;
+    param.outHasNoOffset   = kind == AF_BATCH_LHS || kind == AF_BATCH_NONE;
+    param.inHasNoOffset    = kind != AF_BATCH_SAME;
+
+    switch (baseDim) {
+        case 1: convolve_1d<T, aT>(param, out, signal, filt, expand); break;
+        case 2: convolve_2d<T, aT>(param, out, signal, filt, expand); break;
+        case 3: convolve_3d<T, aT>(param, out, signal, filt, expand); break;
+    }
+
+    POST_LAUNCH_CHECK();
+}
+
+static const int SCONV_THREADS_X = 16;
+static const int SCONV_THREADS_Y = 16;
+
+// below shared MAX_*_LEN's are calculated based on
+// a maximum shared memory configuration of 48KB per block
+// considering complex types as well
+static const int MAX_SCONV_FILTER_LEN = 31;
+
+template<typename T, typename aT>
+void convolve2(Param<T> out, CParam<T> signal, CParam<aT> filter, int conv_dim,
+               bool expand) {
+    int fLen =
+        filter.dims[0] * filter.dims[1] * filter.dims[2] * filter.dims[3];
+
+    if (fLen > kernel::MAX_SCONV_FILTER_LEN) {
+        // TODO call upon fft
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nCUDA convolution supports max kernel size of %d\n",
+                 kernel::MAX_SCONV_FILTER_LEN);
+        CUDA_NOT_SUPPORTED(errMessage);
+    }
+
+    auto convolve2_separable = common::getKernel(
+        "arrayfire::cuda::convolve2_separable", {{convolve_separable_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateTypename<aT>(),
+                     TemplateArg(conv_dim), TemplateArg(expand),
+                     TemplateArg(fLen)),
+        {{DefineValue(MAX_SCONV_FILTER_LEN), DefineValue(SCONV_THREADS_X),
+          DefineValue(SCONV_THREADS_Y)}});
+
+    dim3 threads(SCONV_THREADS_X, SCONV_THREADS_Y);
+
+    int blk_x = divup(out.dims[0], threads.x);
+    int blk_y = divup(out.dims[1], threads.y);
+
+    dim3 blocks(blk_x * signal.dims[2], blk_y * signal.dims[3]);
+
+    // FIXME: case where filter array is strided
+    auto constMemPtr = convolve2_separable.getDevPtr(sconv_c_name);
+    convolve2_separable.copyToReadOnly(
+        constMemPtr, reinterpret_cast<CUdeviceptr>(filter.ptr),
+        fLen * sizeof(aT));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    convolve2_separable(qArgs, out, signal, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/convolve1.cuh b/src/backend/cuda/kernel/convolve1.cuh
new file mode 100644
index 0000000000..f82c85427c
--- /dev/null
+++ b/src/backend/cuda/kernel/convolve1.cuh
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <shared.hpp>
+
+__constant__ char cFilter[2 * (2 * (MAX_CONV1_FILTER_LEN - 1) + CONV_THREADS) *
+                          sizeof(double)];
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename aT, bool expand>
+__global__ void convolve1(Param<T> out, CParam<T> signal, int fLen, int nBBS0,
+                          int nBBS1, int o1, int o2, int o3, int s1, int s2,
+                          int s3) {
+    SharedMemory<T> shared;
+    T *shrdMem = shared.getPointer();
+
+    const int padding = fLen - 1;
+    const int shrdLen = blockDim.x + 2 * padding;
+    const unsigned b1 = blockIdx.x / nBBS0; /* [0 {1} 2 3] */
+    const unsigned b3 =
+        (blockIdx.y + blockIdx.z * gridDim.y) / nBBS1; /* [0 1 2 {3}] */
+    const unsigned b2 =
+        (blockIdx.y + blockIdx.z * gridDim.y) - nBBS1 * b3; /* [0 1 {2} 3] */
+    if (b2 >= out.dims[2] || b3 >= out.dims[3]) return;
+
+    T *dst = (T *)out.ptr +
+             (b1 * out.strides[1] + /* activated with batched input signal */
+              o1 * out.strides[1] + /* activated with batched input filter */
+              b2 * out.strides[2] + /* activated with batched input signal */
+              o2 * out.strides[2] + /* activated with batched input filter */
+              b3 * out.strides[3] + /* activated with batched input signal */
+              o3 * out.strides[3]); /* activated with batched input filter */
+
+    const T *src =
+        (const T *)signal.ptr +
+        (b1 * signal.strides[1] + /* activated with batched input signal */
+         s1 * signal.strides[1] + /* activated with batched input filter */
+         b2 * signal.strides[2] + /* activated with batched input signal */
+         s2 * signal.strides[2] + /* activated with batched input filter */
+         b3 * signal.strides[3] + /* activated with batched input signal */
+         s3 * signal.strides[3]); /* activated with batched input filter */
+
+    const aT *impulse = (const aT *)cFilter;
+
+    int gx = blockDim.x * (blockIdx.x - b1 * nBBS0);
+
+    int s0 = signal.strides[0];
+    int d0 = signal.dims[0];
+    for (int i = threadIdx.x; i < shrdLen; i += blockDim.x) {
+        int idx    = gx - padding + i;
+        shrdMem[i] = (idx >= 0 && idx < d0) ? src[idx * s0] : scalar<T>(0);
+    }
+    __syncthreads();
+    gx += threadIdx.x;
+
+    if (gx < out.dims[0]) {
+        int lx   = threadIdx.x + padding + (expand ? 0 : fLen >> 1);
+        aT accum = scalar<aT>(0);
+        for (int f = 0; f < fLen; ++f) {
+            accum = accum + (shrdMem[lx - f] * impulse[f]);
+        }
+        dst[gx] = (T)accum;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/convolve2.cuh b/src/backend/cuda/kernel/convolve2.cuh
new file mode 100644
index 0000000000..3699cb9e51
--- /dev/null
+++ b/src/backend/cuda/kernel/convolve2.cuh
@@ -0,0 +1,101 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+__constant__ char cFilter[2 * (2 * (MAX_CONV1_FILTER_LEN - 1) + CONV_THREADS) *
+                          sizeof(double)];
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename aT, bool expand, int fLen0, int fLen1>
+__global__ void convolve2(Param<T> out, CParam<T> signal, int nBBS0, int nBBS1,
+                          int o2, int o3, int s2, int s3) {
+    const size_t C_SIZE = (CONV2_THREADS_X + 2 * (fLen0 - 1)) *
+                          (CONV2_THREADS_Y + 2 * (fLen1 - 1));
+    __shared__ T shrdMem[C_SIZE];
+
+    const int radius0  = fLen0 - 1;
+    const int radius1  = fLen1 - 1;
+    const int padding0 = 2 * radius0;
+    const int padding1 = 2 * radius1;
+    const int shrdLen0 = CONV2_THREADS_X + padding0;
+    const int shrdLen1 = CONV2_THREADS_Y + padding1;
+
+    unsigned b0 = blockIdx.x / nBBS0;
+    unsigned b1 = (blockIdx.y + blockIdx.z * gridDim.y) / nBBS1;
+    T *dst      = (T *)out.ptr +
+             (b0 * out.strides[2] + /* activated with batched input signal */
+              o2 * out.strides[2] + /* activated with batched input filter */
+              b1 * out.strides[3] + /* activated with batched input signal */
+              o3 * out.strides[3]); /* activated with batched input filter */
+
+    const T *src =
+        (const T *)signal.ptr +
+        (b0 * signal.strides[2] + /* activated with batched input signal */
+         s2 * signal.strides[2] + /* activated with batched input filter */
+         b1 * signal.strides[3] + /* activated with batched input signal */
+         s3 * signal.strides[3]); /* activated with batched input filter */
+
+    const aT *impulse = (const aT *)cFilter;
+
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+    int gx = CONV2_THREADS_X * (blockIdx.x - b0 * nBBS0) + lx;
+    int gy =
+        CONV2_THREADS_Y * ((blockIdx.y + blockIdx.z * gridDim.y) - b1 * nBBS1) +
+        ly;
+
+    if (b1 >= out.dims[3]) return;
+
+    int s0 = signal.strides[0];
+    int s1 = signal.strides[1];
+    int d0 = signal.dims[0];
+    int d1 = signal.dims[1];
+    // below loops are traditional loops, they only run multiple
+    // times filter length is more than launch size
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < shrdLen1;
+         b += CONV2_THREADS_Y, gy2 += CONV2_THREADS_Y) {
+        int j     = gy2 - radius1;
+        bool is_j = j >= 0 && j < d1;
+        // move row_set CONV2_THREADS_Y along coloumns
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < shrdLen0;
+             a += CONV2_THREADS_X, gx2 += CONV2_THREADS_X) {
+            int i     = gx2 - radius0;
+            bool is_i = i >= 0 && i < d0;
+            shrdMem[b * shrdLen0 + a] =
+                (is_i && is_j ? src[i * s0 + j * s1] : scalar<T>(0));
+        }
+    }
+    __syncthreads();
+
+    if (gx < out.dims[0] && gy < out.dims[1]) {
+        int ci = lx + radius0 + (expand ? 0 : fLen0 >> 1);
+        int cj = ly + radius1 + (expand ? 0 : fLen1 >> 1);
+
+        aT accum = scalar<aT>(0);
+#pragma unroll
+        for (int fj = 0; fj < fLen1; ++fj) {
+#pragma unroll
+            for (int fi = 0; fi < fLen0; ++fi) {
+                aT f_val = impulse[fj * fLen0 + fi];
+                T s_val  = shrdMem[(cj - fj) * shrdLen0 + (ci - fi)];
+                accum    = accum + s_val * f_val;
+            }
+        }
+        dst[gy * out.strides[1] + gx] = (T)accum;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/convolve3.cuh b/src/backend/cuda/kernel/convolve3.cuh
new file mode 100644
index 0000000000..18ad939054
--- /dev/null
+++ b/src/backend/cuda/kernel/convolve3.cuh
@@ -0,0 +1,111 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <shared.hpp>
+
+__constant__ char cFilter[2 * (2 * (MAX_CONV1_FILTER_LEN - 1) + CONV_THREADS) *
+                          sizeof(double)];
+
+namespace arrayfire {
+namespace cuda {
+
+__inline__ int index(int i, int j, int k, int jstride, int kstride) {
+    return i + j * jstride + k * kstride;
+}
+
+template<typename T, typename aT, bool expand>
+__global__ void convolve3(Param<T> out, CParam<T> signal, int fLen0, int fLen1,
+                          int fLen2, int nBBS, int o3, int s3) {
+    SharedMemory<T> shared;
+
+    T *shrdMem   = shared.getPointer();
+    int radius0  = fLen0 - 1;
+    int radius1  = fLen1 - 1;
+    int radius2  = fLen2 - 1;
+    int shrdLen0 = blockDim.x + 2 * radius0;
+    int shrdLen1 = blockDim.y + 2 * radius1;
+    int shrdLen2 = blockDim.z + 2 * radius2;
+    int skStride = shrdLen0 * shrdLen1;
+    int fStride  = fLen0 * fLen1;
+    unsigned b2  = blockIdx.x / nBBS;
+
+    T *dst = (T *)out.ptr +
+             (b2 * out.strides[3] + /* activated with batched input signal */
+              o3 * out.strides[3]); /* activated with batched input filter */
+
+    const T *src =
+        (const T *)signal.ptr +
+        (b2 * signal.strides[3] + /* activated with batched input signal */
+         s3 * signal.strides[3]); /* activated with batched input filter */
+
+    const aT *impulse = (const aT *)cFilter;
+
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+    int lz = threadIdx.z;
+    int gx = blockDim.x * (blockIdx.x - b2 * nBBS) + lx;
+    int gy = blockDim.y * blockIdx.y + ly;
+    int gz = blockDim.z * blockIdx.z + lz;
+
+    int s0 = signal.strides[0];
+    int s1 = signal.strides[1];
+    int s2 = signal.strides[2];
+    int d0 = signal.dims[0];
+    int d1 = signal.dims[1];
+    int d2 = signal.dims[2];
+#pragma unroll
+    for (int c = lz, gz2 = gz; c < shrdLen2;
+         c += CONV3_CUBE_Z, gz2 += CONV3_CUBE_Z) {
+        int k     = gz2 - radius2;
+        bool is_k = k >= 0 && k < d2;
+#pragma unroll
+        for (int b = ly, gy2 = gy; b < shrdLen1;
+             b += CONV3_CUBE_Y, gy2 += CONV3_CUBE_Y) {
+            int j     = gy2 - radius1;
+            bool is_j = j >= 0 && j < d1;
+#pragma unroll
+            for (int a = lx, gx2 = gx; a < shrdLen0;
+                 a += CONV3_CUBE_X, gx2 += CONV3_CUBE_X) {
+                int i     = gx2 - radius0;
+                bool is_i = i >= 0 && i < d0;
+                shrdMem[c * skStride + b * shrdLen0 + a] =
+                    (is_i && is_j && is_k ? src[i * s0 + j * s1 + k * s2]
+                                          : scalar<T>(0));
+            }
+        }
+    }
+    __syncthreads();
+
+    if (gx < out.dims[0] && gy < out.dims[1] && gz < out.dims[2]) {
+        int ci = lx + radius0 + (expand ? 0 : fLen0 >> 1);
+        int cj = ly + radius1 + (expand ? 0 : fLen1 >> 1);
+        int ck = lz + radius2 + (expand ? 0 : fLen2 >> 1);
+
+        aT accum = scalar<aT>(0);
+#pragma unroll
+        for (int fk = 0; fk < fLen2; ++fk) {
+#pragma unroll
+            for (int fj = 0; fj < fLen1; ++fj) {
+#pragma unroll
+                for (int fi = 0; fi < fLen0; ++fi) {
+                    aT f_val = impulse[index(fi, fj, fk, fLen0, fStride)];
+                    T s_val = shrdMem[index(ci - fi, cj - fj, ck - fk, shrdLen0,
+                                            skStride)];
+                    accum   = accum + s_val * f_val;
+                }
+            }
+        }
+        dst[index(gx, gy, gz, out.strides[1], out.strides[2])] = (T)accum;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/convolve_separable.cpp b/src/backend/cuda/kernel/convolve_separable.cpp
new file mode 100644
index 0000000000..14a62d1f1e
--- /dev/null
+++ b/src/backend/cuda/kernel/convolve_separable.cpp
@@ -0,0 +1,34 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <kernel/convolve.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+#define INSTANTIATE(T, aT) \
+    template void convolve2<T, aT>(Param<T>, CParam<T>, CParam<aT>, int, bool);
+
+INSTANTIATE(cdouble, cdouble)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/convolve_separable.cu b/src/backend/cuda/kernel/convolve_separable.cu
deleted file mode 100644
index 9b3092d2c7..0000000000
--- a/src/backend/cuda/kernel/convolve_separable.cu
+++ /dev/null
@@ -1,193 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_cuda.hpp>
-#include <math.hpp>
-#include <convolve.hpp>
-
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-// below shared MAX_*_LEN's are calculated based on
-// a maximum shared memory configuration of 48KB per block
-// considering complex types as well
-static const int MAX_SCONV_FILTER_LEN = 31;
-
-// we shall declare the maximum size required of above all three cases
-// and re-use the same constant memory locations for every case
-__constant__ char sFilter[2*THREADS_Y*(2*(MAX_SCONV_FILTER_LEN-1)+THREADS_X)*sizeof(double)];
-
-template<typename T, typename accType, int conv_dim, bool expand, int fLen>
-__global__
-void convolve2_separable(Param<T> out, CParam<T> signal, int nBBS0, int nBBS1)
-{
-    const int smem_len =   (conv_dim==0 ?
-                                (THREADS_X+2*(fLen-1))* THREADS_Y:
-                                (THREADS_Y+2*(fLen-1))* THREADS_X);
-    __shared__ T shrdMem[smem_len];
-
-    const int radius  = fLen-1;
-    const int padding = 2*radius;
-    const int s0      = signal.strides[0];
-    const int s1      = signal.strides[1];
-    const int d0      = signal.dims[0];
-    const int d1      = signal.dims[1];
-    const int shrdLen = THREADS_X + (conv_dim==0 ? padding : 0);
-
-    unsigned b2  = blockIdx.x/nBBS0;
-    unsigned b3  = blockIdx.y/nBBS1;
-    T *dst       = (T *)out.ptr          + (b2*out.strides[2] + b3*out.strides[3]);
-    const T *src = (const T *)signal.ptr + (b2*signal.strides[2] + b3*signal.strides[3]);
-    const accType *impulse  = (const accType *)sFilter;
-
-    int lx = threadIdx.x;
-    int ly = threadIdx.y;
-    int ox = THREADS_X * (blockIdx.x-b2*nBBS0) + lx;
-    int oy = THREADS_Y * (blockIdx.y-b3*nBBS1) + ly;
-    int gx = ox;
-    int gy = oy;
-
-    // below if-else statement is based on template parameter
-    if (conv_dim==0) {
-        gx += (expand ? 0 : fLen>>1);
-        int endX = ((fLen-1)<<1) + THREADS_X;
-
-#pragma unroll
-        for(int lx = threadIdx.x, glb_x = gx; lx<endX; lx += THREADS_X, glb_x += THREADS_X) {
-            int i = glb_x - radius;
-            int j = gy;
-            bool is_i  = i>=0 && i<d0;
-            bool is_j  = j>=0 && j<d1;
-            shrdMem[ly*shrdLen+lx] = (is_i && is_j ? src[i*s0 + j*s1] : scalar<T>(0));
-        }
-
-    } else if (conv_dim==1) {
-        gy += (expand ? 0 : fLen>>1);
-        int endY = ((fLen-1)<<1) + THREADS_Y;
-
-#pragma unroll
-        for(int ly = threadIdx.y, glb_y = gy; ly<endY; ly += THREADS_Y, glb_y += THREADS_Y) {
-            int i = gx;
-            int j = glb_y - radius;
-            bool is_i  = i>=0 && i<d0;
-            bool is_j  = j>=0 && j<d1;
-            shrdMem[ly*shrdLen+lx] = (is_i && is_j ? src[i*s0 + j*s1] : scalar<T>(0));
-        }
-    }
-    __syncthreads();
-
-    if (ox<out.dims[0] && oy<out.dims[1]) {
-        // below conditional statement is based on template parameter
-        int i  = (conv_dim==0 ? lx : ly) + radius;
-        accType accum = scalar<accType>(0);
-#pragma unroll
-        for(int f=0; f<fLen; ++f) {
-            accType f_val = impulse[f];
-            // below conditional statement is based on template parameter
-            int s_idx = (conv_dim==0 ? (ly*shrdLen+(i-f)) : ((i-f)*shrdLen+lx));
-            T s_val = shrdMem[s_idx];
-            accum   = accum + s_val*f_val;
-        }
-        dst[oy*out.strides[1]+ox] = (T)accum;
-    }
-}
-
-template<typename T, typename aT, int cDim, bool expand, int f>
-void conv2Helper(dim3 blks, dim3 thrds, Param<T> out, CParam<T> sig, int nBBS0, int nBBS1)
-{
-   (convolve2_separable<T, aT, cDim, expand, f>)<<<blks, thrds>>>(out, sig, nBBS0, nBBS1);
-}
-
-template<typename T, typename accType, int conv_dim, bool expand>
-void convolve2(Param<T> out, CParam<T> signal, CParam<accType> filter)
-{
-    int fLen = filter.dims[0] * filter.dims[1] * filter.dims[2] * filter.dims[3];
-    if(fLen > kernel::MAX_SCONV_FILTER_LEN) {
-        // call upon fft
-        CUDA_NOT_SUPPORTED();
-    }
-
-    dim3 threads(THREADS_X, THREADS_Y);
-
-    int blk_x = divup(out.dims[0], threads.x);
-    int blk_y = divup(out.dims[1], threads.y);
-
-    dim3 blocks(blk_x*signal.dims[2], blk_y*signal.dims[3]);
-
-
-   // FIX ME: if the filter array is strided, direct copy of symbols
-   // might cause issues
-   CUDA_CHECK(cudaMemcpyToSymbol(kernel::sFilter, filter.ptr, fLen*sizeof(accType), 0, cudaMemcpyDeviceToDevice));
-
-    switch(fLen) {
-        case  2: conv2Helper<T, accType, conv_dim, expand,  2>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  3: conv2Helper<T, accType, conv_dim, expand,  3>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  4: conv2Helper<T, accType, conv_dim, expand,  4>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  5: conv2Helper<T, accType, conv_dim, expand,  5>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  6: conv2Helper<T, accType, conv_dim, expand,  6>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  7: conv2Helper<T, accType, conv_dim, expand,  7>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  8: conv2Helper<T, accType, conv_dim, expand,  8>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case  9: conv2Helper<T, accType, conv_dim, expand,  9>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 10: conv2Helper<T, accType, conv_dim, expand, 10>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 11: conv2Helper<T, accType, conv_dim, expand, 11>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 12: conv2Helper<T, accType, conv_dim, expand, 12>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 13: conv2Helper<T, accType, conv_dim, expand, 13>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 14: conv2Helper<T, accType, conv_dim, expand, 14>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 15: conv2Helper<T, accType, conv_dim, expand, 15>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 16: conv2Helper<T, accType, conv_dim, expand, 16>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 17: conv2Helper<T, accType, conv_dim, expand, 17>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 18: conv2Helper<T, accType, conv_dim, expand, 18>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 19: conv2Helper<T, accType, conv_dim, expand, 19>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 20: conv2Helper<T, accType, conv_dim, expand, 20>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 21: conv2Helper<T, accType, conv_dim, expand, 21>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 22: conv2Helper<T, accType, conv_dim, expand, 22>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 23: conv2Helper<T, accType, conv_dim, expand, 23>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 24: conv2Helper<T, accType, conv_dim, expand, 24>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 25: conv2Helper<T, accType, conv_dim, expand, 25>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 26: conv2Helper<T, accType, conv_dim, expand, 26>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 27: conv2Helper<T, accType, conv_dim, expand, 27>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 28: conv2Helper<T, accType, conv_dim, expand, 28>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 29: conv2Helper<T, accType, conv_dim, expand, 29>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 30: conv2Helper<T, accType, conv_dim, expand, 30>(blocks, threads, out, signal, blk_x, blk_y); break;
-        case 31: conv2Helper<T, accType, conv_dim, expand, 31>(blocks, threads, out, signal, blk_x, blk_y); break;
-        default: CUDA_NOT_SUPPORTED();
-    }
-
-   POST_LAUNCH_CHECK();
-}
-
-#define INSTANTIATE(T, accType)                                         \
-	template void convolve2<T, accType, 0, true >(Param<T> out, CParam<T> signal, CParam<accType> filter); \
-	template void convolve2<T, accType, 0, false>(Param<T> out, CParam<T> signal, CParam<accType> filter); \
-	template void convolve2<T, accType, 1, true >(Param<T> out, CParam<T> signal, CParam<accType> filter); \
-	template void convolve2<T, accType, 1, false>(Param<T> out, CParam<T> signal, CParam<accType> filter); \
-
-
-INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
-
-}
-
-}
diff --git a/src/backend/cuda/kernel/convolve_separable.cuh b/src/backend/cuda/kernel/convolve_separable.cuh
new file mode 100644
index 0000000000..ead157df92
--- /dev/null
+++ b/src/backend/cuda/kernel/convolve_separable.cuh
@@ -0,0 +1,101 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+__constant__ char sFilter[2 * SCONV_THREADS_Y *
+                          (2 * (MAX_SCONV_FILTER_LEN - 1) + SCONV_THREADS_X) *
+                          sizeof(double)];
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename accType, int conv_dim, bool expand, int fLen>
+__global__ void convolve2_separable(Param<T> out, CParam<T> signal, int nBBS0,
+                                    int nBBS1) {
+    const int smem_len =
+        (conv_dim == 0 ? (SCONV_THREADS_X + 2 * (fLen - 1)) * SCONV_THREADS_Y
+                       : (SCONV_THREADS_Y + 2 * (fLen - 1)) * SCONV_THREADS_X);
+    __shared__ T shrdMem[smem_len];
+
+    const int radius  = fLen - 1;
+    const int padding = 2 * radius;
+    const int s0      = signal.strides[0];
+    const int s1      = signal.strides[1];
+    const int d0      = signal.dims[0];
+    const int d1      = signal.dims[1];
+    const int shrdLen = SCONV_THREADS_X + (conv_dim == 0 ? padding : 0);
+
+    unsigned b2  = blockIdx.x / nBBS0;
+    unsigned b3  = blockIdx.y / nBBS1;
+    T *dst       = (T *)out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
+    const T *src = (const T *)signal.ptr +
+                   (b2 * signal.strides[2] + b3 * signal.strides[3]);
+    const accType *impulse = (const accType *)sFilter;
+
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+    int ox = SCONV_THREADS_X * (blockIdx.x - b2 * nBBS0) + lx;
+    int oy = SCONV_THREADS_Y * (blockIdx.y - b3 * nBBS1) + ly;
+    int gx = ox;
+    int gy = oy;
+
+    // below if-else statement is based on template parameter
+    if (conv_dim == 0) {
+        gx += (expand ? 0 : fLen >> 1);
+        int endX = ((fLen - 1) << 1) + SCONV_THREADS_X;
+
+#pragma unroll
+        for (int lx = threadIdx.x, glb_x = gx; lx < endX;
+             lx += SCONV_THREADS_X, glb_x += SCONV_THREADS_X) {
+            int i     = glb_x - radius;
+            int j     = gy;
+            bool is_i = i >= 0 && i < d0;
+            bool is_j = j >= 0 && j < d1;
+            shrdMem[ly * shrdLen + lx] =
+                (is_i && is_j ? src[i * s0 + j * s1] : scalar<T>(0));
+        }
+
+    } else if (conv_dim == 1) {
+        gy += (expand ? 0 : fLen >> 1);
+        int endY = ((fLen - 1) << 1) + SCONV_THREADS_Y;
+
+#pragma unroll
+        for (int ly = threadIdx.y, glb_y = gy; ly < endY;
+             ly += SCONV_THREADS_Y, glb_y += SCONV_THREADS_Y) {
+            int i     = gx;
+            int j     = glb_y - radius;
+            bool is_i = i >= 0 && i < d0;
+            bool is_j = j >= 0 && j < d1;
+            shrdMem[ly * shrdLen + lx] =
+                (is_i && is_j ? src[i * s0 + j * s1] : scalar<T>(0));
+        }
+    }
+    __syncthreads();
+
+    if (ox < out.dims[0] && oy < out.dims[1]) {
+        // below conditional statement is based on template parameter
+        int i         = (conv_dim == 0 ? lx : ly) + radius;
+        accType accum = scalar<accType>(0);
+#pragma unroll
+        for (int f = 0; f < fLen; ++f) {
+            accType f_val = impulse[f];
+            // below conditional statement is based on template parameter
+            int s_idx = (conv_dim == 0 ? (ly * shrdLen + (i - f))
+                                       : ((i - f) * shrdLen + lx));
+            T s_val   = shrdMem[s_idx];
+            accum     = accum + s_val * f_val;
+        }
+        dst[oy * out.strides[1] + ox] = (T)accum;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/copy.cuh b/src/backend/cuda/kernel/copy.cuh
new file mode 100644
index 0000000000..20f6bfa021
--- /dev/null
+++ b/src/backend/cuda/kernel/copy.cuh
@@ -0,0 +1,306 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/half.hpp>
+#include <dims_param.hpp>
+#include <types.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__inline__ __device__ static T scale(T value, double factor) {
+    return (T)(double(value) * factor);
+}
+
+template<>
+__inline__ __device__ cfloat scale<cfloat>(cfloat value, double factor) {
+    return make_cuFloatComplex(value.x * factor, value.y * factor);
+}
+
+template<>
+__inline__ __device__ cdouble scale<cdouble>(cdouble value, double factor) {
+    return make_cuDoubleComplex(value.x * factor, value.y * factor);
+}
+
+template<typename inType, typename outType>
+__inline__ __device__ outType convertType(inType value) {
+    return static_cast<outType>(value);
+}
+
+template<>
+__inline__ __device__ char convertType<compute_t<common::half>, char>(
+    compute_t<common::half> value) {
+    return (char)((short)value);
+}
+
+template<>
+__inline__ __device__ compute_t<common::half>
+convertType<char, compute_t<common::half>>(char value) {
+    return compute_t<common::half>(value);
+}
+
+template<>
+__inline__ __device__ schar
+convertType<compute_t<common::half>, schar>(compute_t<common::half> value) {
+    return (schar)((short)value);
+}
+
+template<>
+__inline__ __device__ compute_t<common::half>
+convertType<schar, compute_t<common::half>>(schar value) {
+    return compute_t<common::half>(value);
+}
+
+template<>
+__inline__ __device__ uchar
+convertType<compute_t<common::half>, uchar>(compute_t<common::half> value) {
+    return (uchar)((short)value);
+}
+
+template<>
+__inline__ __device__ compute_t<common::half>
+convertType<uchar, compute_t<common::half>>(uchar value) {
+    return compute_t<common::half>(value);
+}
+
+template<>
+__inline__ __device__ cdouble convertType<cfloat, cdouble>(cfloat value) {
+    return cuComplexFloatToDouble(value);
+}
+
+template<>
+__inline__ __device__ cfloat convertType<cdouble, cfloat>(cdouble value) {
+    return cuComplexDoubleToFloat(value);
+}
+
+#define OTHER_SPECIALIZATIONS(IN_T)                                        \
+    template<>                                                             \
+    __inline__ __device__ cfloat convertType<IN_T, cfloat>(IN_T value) {   \
+        return make_cuFloatComplex(static_cast<float>(value), 0.0f);       \
+    }                                                                      \
+                                                                           \
+    template<>                                                             \
+    __inline__ __device__ cdouble convertType<IN_T, cdouble>(IN_T value) { \
+        return make_cuDoubleComplex(static_cast<double>(value), 0.0);      \
+    }
+
+OTHER_SPECIALIZATIONS(float)
+OTHER_SPECIALIZATIONS(double)
+OTHER_SPECIALIZATIONS(int)
+OTHER_SPECIALIZATIONS(uint)
+OTHER_SPECIALIZATIONS(intl)
+OTHER_SPECIALIZATIONS(uintl)
+OTHER_SPECIALIZATIONS(short)
+OTHER_SPECIALIZATIONS(ushort)
+OTHER_SPECIALIZATIONS(schar)
+OTHER_SPECIALIZATIONS(uchar)
+OTHER_SPECIALIZATIONS(char)
+OTHER_SPECIALIZATIONS(common::half)
+
+// scaledCopy without looping, so dim3 has to be 1.
+// conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] >= dims[1]
+//      global dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+template<typename inType, typename outType, bool SAME_DIMS, bool FACTOR>
+__global__ void scaledCopy(Param<outType> dst, CParam<inType> src,
+                           const outType default_value, const double factor) {
+    const int id0 = blockIdx.x * blockDim.x + threadIdx.x;
+    const int id1 = blockIdx.y * blockDim.y + threadIdx.y;
+    if ((id0 < (int)dst.dims[0]) & (id1 < (int)dst.dims[1])) {
+        const int id2 = blockIdx.z * blockDim.z + threadIdx.z;
+
+        const int idx_in =
+            id0 * src.strides[0] + id1 * src.strides[1] + id2 * src.strides[2];
+        const int idx_out =
+            id0 * dst.strides[0] + id1 * dst.strides[1] + id2 * dst.strides[2];
+
+        if (SAME_DIMS | ((id0 < (int)src.dims[0]) & (id1 < (int)src.dims[1]) &
+                         (id2 < (int)src.dims[2]))) {
+            dst.ptr[idx_out] = convertType<inType, outType>(
+                FACTOR ? scale<inType>(src.ptr[idx_in], factor)
+                       : src.ptr[idx_in]);
+        } else {
+            dst.ptr[idx_out] = default_value;
+        }
+    }
+}
+
+// scaledCopy with looping over dims[0] -- VECTOR ONLY
+// Conditions:
+//      global dims[0] has no restrictions
+//      only dims[1] == 1 will be processed!!
+//      only dims[2] == 1 will be processed!!
+//      only dims[3] == 1 will be processed!!
+template<typename inType, typename outType, bool SAME_DIMS, bool FACTOR>
+__global__ void scaledCopyLoop0(Param<outType> dst, CParam<inType> src,
+                                const outType default_value,
+                                const double factor) {
+    int id0              = blockIdx.x * blockDim.x + threadIdx.x;
+    const int id0End_out = dst.dims[0];
+    if (id0 < id0End_out) {
+        const int id0End_in     = src.dims[0];
+        const int istrides0     = src.strides[0];
+        const int ostrides0     = dst.strides[0];
+        const int id0Inc        = gridDim.x * blockDim.x;
+        int idx_in              = id0 * istrides0;
+        const int idxID0Inc_in  = id0Inc * istrides0;
+        int idx_out             = id0 * ostrides0;
+        const int idxID0Inc_out = id0Inc * ostrides0;
+
+        while (id0 < id0End_in) {
+            // inside input array, so convert
+            dst.ptr[idx_out] = convertType<inType, outType>(
+                FACTOR ? scale<inType>(src.ptr[idx_in], factor)
+                       : src.ptr[idx_in]);
+            id0 += id0Inc;
+            idx_in += idxID0Inc_in;
+            idx_out += idxID0Inc_out;
+        }
+        if (!SAME_DIMS) {
+            while (id0 < id0End_out) {
+                // outside the input array, so copy default value
+                dst.ptr[idx_out] = default_value;
+                id0 += id0Inc;
+                idx_out += idxID0Inc_out;
+            }
+        }
+    }
+}
+
+// scaledCopy with looping over dims[1]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] has no restrictions
+//      global dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+template<typename inType, typename outType, bool SAME_DIMS, bool FACTOR>
+__global__ void scaledCopyLoop1(Param<outType> dst, CParam<inType> src,
+                                const outType default_value,
+                                const double factor) {
+    const int id0        = blockIdx.x * blockDim.x + threadIdx.x;
+    int id1              = blockIdx.y * blockDim.y + threadIdx.y;
+    const int id1End_out = dst.dims[1];
+    if ((id0 < (int)dst.dims[0]) & (id1 < id1End_out)) {
+        const int id2       = blockIdx.z * blockDim.z + threadIdx.z;
+        const int ostrides1 = dst.strides[1];
+        const int id1Inc    = gridDim.y * blockDim.y;
+        int idx_out         = id0 * (int)dst.strides[0] + id1 * ostrides1 +
+                      id2 * (int)dst.strides[2];
+        const int idxID1Inc_out = id1Inc * ostrides1;
+        const int id1End_in     = src.dims[1];
+        const int istrides1     = src.strides[1];
+        int idx_in              = id0 * (int)src.strides[0] + id1 * istrides1 +
+                     id2 * (int)src.strides[2];
+        const int idxID1Inc_in = id1Inc * istrides1;
+
+        if (SAME_DIMS | ((id0 < (int)src.dims[0]) & (id2 < src.dims[2]))) {
+            while (id1 < id1End_in) {
+                // inside input array, so convert
+                dst.ptr[idx_out] = convertType<inType, outType>(
+                    FACTOR ? scale<inType>(src.ptr[idx_in], factor)
+                           : src.ptr[idx_in]);
+                id1 += id1Inc;
+                idx_in += idxID1Inc_in;
+                idx_out += idxID1Inc_out;
+            }
+        }
+        if (!SAME_DIMS) {
+            while (id1 < id1End_out) {
+                // outside the input array, so copy default value
+                dst.ptr[idx_out] = default_value;
+                id1 += id1Inc;
+                idx_out += idxID1Inc_out;
+            }
+        }
+    }
+}
+
+// scaledCopy with looping over dims[1], dims[2] and dims[3]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] has no restrictions
+//      global dims[2] <= dims[2]
+template<typename inType, typename outType, bool SAME_DIMS, bool FACTOR>
+__global__ void scaledCopyLoop123(Param<outType> out, CParam<inType> in,
+                                  outType default_value, double factor) {
+    const int id0    = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    int id1          = blockIdx.y * blockDim.y + threadIdx.y;  // Limit 64K
+    const int odims0 = out.dims[0];
+    const int odims1 = out.dims[1];
+    if ((id0 < odims0) & (id1 < odims1)) {
+        int id2 = blockIdx.z * blockDim.z + threadIdx.z;  // Limit 64K
+        int idxBaseBase_out = id0 * (int)out.strides[0] +
+                              id1 * (int)out.strides[1] +
+                              id2 * (int)out.strides[2];
+        const int idxIncID3_out     = out.strides[3];
+        const int odims2            = out.dims[2];
+        const int idxEndIncID3_out  = out.dims[3] * idxIncID3_out;
+        const int incID1            = gridDim.y * blockDim.y;
+        const int idxBaseIncID1_out = incID1 * (int)out.strides[1];
+        const int incID2            = gridDim.z * blockDim.z;
+        const int idxBaseIncID2_out = incID2 * (int)out.strides[2];
+
+        int idxBaseBase_in = id0 * (int)in.strides[0] +
+                             id1 * (int)in.strides[1] +
+                             id2 * (int)in.strides[2];
+        const int idxIncID3_in     = in.strides[3];
+        const int idims0           = in.dims[0];
+        const int idims1           = in.dims[1];
+        const int idims2           = in.dims[2];
+        const int idxEndIncID3_in  = in.dims[3] * idxIncID3_in;
+        const int idxBaseIncID1_in = incID1 * (int)in.strides[1];
+        const int idxBaseIncID2_in = incID2 * (int)in.strides[2];
+
+        do {
+            int idxBase_in  = idxBaseBase_in;
+            int idxBase_out = idxBaseBase_out;
+            do {
+                int idxEndID3_in  = idxEndIncID3_in + idxBase_in;
+                int idxEndID3_out = idxEndIncID3_out + idxBase_out;
+                int idx_in        = idxBase_in;
+                int idx_out       = idxBase_out;
+                if (SAME_DIMS |
+                    ((id0 < idims0) & (id1 < idims1) & (id2 < idims2))) {
+                    // inside input array, so convert
+                    do {
+                        out.ptr[idx_out] = convertType<inType, outType>(
+                            FACTOR ? scale<inType>(in.ptr[idx_in], factor)
+                                   : in.ptr[idx_in]);
+                        idx_in += idxIncID3_in;
+                        idx_out += idxIncID3_out;
+                    } while (idx_in != idxEndID3_in);
+                }
+                if (!SAME_DIMS) {
+                    while (idx_out != idxEndID3_out) {
+                        // outside the input array, so copy default value
+                        out.ptr[idx_out] = default_value;
+                        idx_out += idxIncID3_out;
+                    }
+                }
+                id1 += incID1;
+                if (id1 >= odims1) break;
+                idxBase_in += idxBaseIncID1_in;
+                idxBase_out += idxBaseIncID1_out;
+            } while (true);
+            id2 += incID2;
+            if (id2 >= odims2) break;
+            idxBaseBase_in += idxBaseIncID2_in;
+            idxBaseBase_out += idxBaseIncID2_out;
+        } while (true);
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/diagonal.cuh b/src/backend/cuda/kernel/diagonal.cuh
new file mode 100644
index 0000000000..6e47af5b22
--- /dev/null
+++ b/src/backend/cuda/kernel/diagonal.cuh
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void createDiagonalMat(Param<T> out, CParam<T> in, int num,
+                                  int blocks_x) {
+    unsigned idz        = blockIdx.x / blocks_x;
+    unsigned blockIdx_x = blockIdx.x - idz * blocks_x;
+
+    unsigned idx = threadIdx.x + blockIdx_x * blockDim.x;
+    unsigned idy =
+        threadIdx.y + (blockIdx.y + blockIdx.z * gridDim.y) * blockDim.y;
+
+    if (idx >= out.dims[0] || idy >= out.dims[1] || idz >= out.dims[2]) return;
+
+    T *optr       = out.ptr + idz * out.strides[2] + idy * out.strides[1] + idx;
+    const T *iptr = in.ptr + idz * in.strides[1] + ((num > 0) ? idx : idy);
+
+    T val = (idx == (idy - num)) ? *iptr : scalar<T>(0);
+    *optr = val;
+}
+
+template<typename T>
+__global__ void extractDiagonal(Param<T> out, CParam<T> in, int num,
+                                int blocks_z) {
+    unsigned idw = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_z;
+    unsigned idz = (blockIdx.y + blockIdx.z * gridDim.y) - idw * blocks_z;
+
+    unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
+
+    if (idx >= out.dims[0] || idz >= out.dims[2] || idw >= out.dims[3]) return;
+
+    T *optr = out.ptr + idz * out.strides[2] + idw * out.strides[3] + idx;
+
+    if (idx >= in.dims[0] || idx >= in.dims[1]) *optr = scalar<T>(0);
+
+    int i_off     = (num > 0) ? (num * in.strides[1] + idx) : (idx - num);
+    const T *iptr = in.ptr + idz * in.strides[2] + idw * in.strides[3] + i_off;
+    *optr         = iptr[idx * in.strides[1]];
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/diagonal.hpp b/src/backend/cuda/kernel/diagonal.hpp
index 45f9c802c9..40b25e159e 100644
--- a/src/backend/cuda/kernel/diagonal.hpp
+++ b/src/backend/cuda/kernel/diagonal.hpp
@@ -7,84 +7,65 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
-#include <platform.hpp>
-#include <debug_cuda.hpp>
-#include <Param.hpp>
-#include <math.hpp>
-
-namespace cuda
-{
-namespace kernel
-{
-    template<typename T>
-    __global__ static void
-    diagCreateKernel(Param<T> out, CParam<T> in, int num, int blocks_x)
-    {
-        unsigned idz = blockIdx.x / blocks_x;
-        unsigned blockIdx_x = blockIdx.x - idz * blocks_x;
-
-        unsigned idx = threadIdx.x + blockIdx_x * blockDim.x;
-        unsigned idy = threadIdx.y + blockIdx.y * blockDim.y;
-
-        if (idx >= out.dims[0] ||
-            idy >= out.dims[1] ||
-            idz >= out.dims[2]) return;
-
-
-        T *optr = out.ptr + idz * out.strides[2] + idy * out.strides[1] + idx;
-        const T *iptr = in.ptr  + idz *  in.strides[1] + ((num > 0) ? idx : idy);
-
-        T val = (idx == (idy - num)) ? *iptr : scalar<T>(0);
-        *optr = val;
-    }
+#pragma once
 
-    template<typename T>
-    static void diagCreate(Param<T> out, CParam<T> in, int num)
-    {
-        dim3 threads(32, 8);
-        int blocks_x = divup(out.dims[0], threads.x);
-        int blocks_y = divup(out.dims[1], threads.y);
-        dim3 blocks(blocks_x * out.dims[2], blocks_y);
-
-        diagCreateKernel<T> <<<blocks, threads>>> (out, in, num, blocks_x);
-        POST_LAUNCH_CHECK();
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/diagonal_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename T>
+void diagCreate(Param<T> out, CParam<T> in, int num) {
+    auto genDiagMat = common::getKernel("arrayfire::cuda::createDiagonalMat",
+                                        {{diagonal_cuh_src}},
+                                        TemplateArgs(TemplateTypename<T>()));
+
+    dim3 threads(32, 8);
+    int blocks_x = divup(out.dims[0], threads.x);
+    int blocks_y = divup(out.dims[1], threads.y);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y);
+
+    const int maxBlocksY    = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    const int blocksPerMatZ = divup(blocks.y, maxBlocksY);
+    if (blocksPerMatZ > 1) {
+        blocks.y = maxBlocksY;
+        blocks.z = blocksPerMatZ;
     }
 
-    template<typename T>
-    __global__ static void
-    diagExtractKernel(Param<T> out, CParam<T> in, int num, int blocks_z)
-    {
-        unsigned idw = blockIdx.y / blocks_z;
-        unsigned idz = blockIdx.y  - idw * blocks_z;
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        unsigned idx = threadIdx.x + blockIdx.x * blockDim.x;
+    genDiagMat(qArgs, out, in, num, blocks_x);
 
-        if (idx >= out.dims[0] ||
-            idz >= out.dims[2] ||
-            idw >= out.dims[3]) return;
+    POST_LAUNCH_CHECK();
+}
 
-        T *optr = out.ptr + idz * out.strides[2] + idw * out.strides[3] + idx;
+template<typename T>
+void diagExtract(Param<T> out, CParam<T> in, int num) {
+    auto extractDiag = common::getKernel("arrayfire::cuda::extractDiagonal",
+                                         {{diagonal_cuh_src}},
+                                         TemplateArgs(TemplateTypename<T>()));
 
-        if (idx >= in.dims[0] || idx >= in.dims[1]) *optr = scalar<T>(0);
+    dim3 threads(256, 1);
+    int blocks_x = divup(out.dims[0], threads.x);
+    int blocks_z = out.dims[2];
+    dim3 blocks(blocks_x, out.dims[3] * blocks_z);
 
-        int i_off = (num > 0) ? (num * in.strides[1] + idx) : (idx - num);
-        const T *iptr = in.ptr  + idz *  in.strides[2] + idw *  in.strides[3] + i_off;
-        *optr = iptr[idx * in.strides[1]];
-    }
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-    template<typename T>
-    static void diagExtract(Param<T> out, CParam<T> in, int num)
-    {
-        dim3 threads(256, 1);
-        int blocks_x = divup(out.dims[0], threads.x);
-        int blocks_z = out.dims[2];
-        dim3 blocks(blocks_x, out.dims[3] * blocks_z);
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        diagExtractKernel<T> <<<blocks, threads>>> (out, in, num, blocks_z);
-        POST_LAUNCH_CHECK();
-    }
+    extractDiag(qArgs, out, in, num, blocks_z);
 
+    POST_LAUNCH_CHECK();
 }
-}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/diff.cuh b/src/backend/cuda/kernel/diff.cuh
new file mode 100644
index 0000000000..fc02296b5c
--- /dev/null
+++ b/src/backend/cuda/kernel/diff.cuh
@@ -0,0 +1,62 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool D>
+inline void diff_this(T* out, const T* in, const unsigned oMem,
+                      const unsigned iMem0, const unsigned iMem1,
+                      const unsigned iMem2) {
+    // iMem2 can never be 0
+    if (D == 0) {  // Diff1
+        out[oMem] = in[iMem1] - in[iMem0];
+    } else {  // Diff2
+        out[oMem] = in[iMem2] - in[iMem1] - in[iMem1] + in[iMem0];
+    }
+}
+
+template<typename T, unsigned dim, bool isDiff2>
+__global__ void diff(Param<T> out, CParam<T> in, const unsigned oElem,
+                     const unsigned blocksPerMatX,
+                     const unsigned blocksPerMatY) {
+    unsigned idz = blockIdx.x / blocksPerMatX;
+    unsigned idw = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+
+    unsigned blockIdx_x = blockIdx.x - idz * blocksPerMatX;
+    unsigned blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - idw * blocksPerMatY;
+
+    unsigned idx = threadIdx.x + blockIdx_x * blockDim.x;
+    unsigned idy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (idx >= out.dims[0] || idy >= out.dims[1] || idz >= out.dims[2] ||
+        idw >= out.dims[3])
+        return;
+
+    unsigned iMem0 =
+        idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1] + idx;
+    unsigned iMem1 = iMem0 + in.strides[dim];
+    unsigned iMem2 = iMem1 + in.strides[dim];
+
+    unsigned oMem = idw * out.strides[3] + idz * out.strides[2] +
+                    idy * out.strides[1] + idx;
+
+    iMem2 *= isDiff2;
+
+    diff_this<T, isDiff2>(out.ptr, in.ptr, oMem, iMem0, iMem1, iMem2);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/diff.hpp b/src/backend/cuda/kernel/diff.hpp
index c5f57d9a58..cdce6eaf8f 100644
--- a/src/backend/cuda/kernel/diff.hpp
+++ b/src/backend/cuda/kernel/diff.hpp
@@ -7,91 +7,49 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/diff_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 16;
-        static const unsigned TY = 16;
-
-        template<typename T, bool D>
-        inline __host__ __device__
-        void diff_this(T* out, const T* in, const unsigned oMem, const unsigned iMem0,
-                       const unsigned iMem1, const unsigned iMem2)
-        {
-            //iMem2 can never be 0
-            if(D == 0) {                        // Diff1
-                out[oMem] = in[iMem1] - in[iMem0];
-            } else {                                // Diff2
-                out[oMem] = in[iMem2] - in[iMem1] - in[iMem1] + in[iMem0];
-            }
-        }
-
-        /////////////////////////////////////////////////////////////////////////////
-        // 1st and 2nd Order Differential for 4D along all dimensions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, unsigned dim, bool isDiff2>
-        __global__
-        void diff_kernel(Param<T> out, CParam<T> in, const unsigned oElem,
-                         const unsigned blocksPerMatX, const unsigned blocksPerMatY)
-        {
-            unsigned idz = blockIdx.x / blocksPerMatX;
-            unsigned idw = blockIdx.y / blocksPerMatY;
-
-            unsigned blockIdx_x = blockIdx.x - idz * blocksPerMatX;
-            unsigned blockIdx_y = blockIdx.y - idw * blocksPerMatY;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            unsigned idx = threadIdx.x + blockIdx_x * blockDim.x;
-            unsigned idy = threadIdx.y + blockIdx_y * blockDim.y;
+template<typename T>
+void diff(Param<T> out, CParam<T> in, const int indims, const unsigned dim,
+          const bool isDiff2) {
+    constexpr unsigned TX = 16;
+    constexpr unsigned TY = 16;
 
-            if(idx >= out.dims[0] ||
-               idy >= out.dims[1] ||
-               idz >= out.dims[2] ||
-               idw >= out.dims[3])
-                return;
+    auto diff =
+        common::getKernel("arrayfire::cuda::diff", {{diff_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>(), TemplateArg(dim),
+                                       TemplateArg(isDiff2)));
 
-            unsigned iMem0 = idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1] + idx;
-            unsigned iMem1 = iMem0 + in.strides[dim];
-            unsigned iMem2 = iMem1 + in.strides[dim];
+    dim3 threads(TX, TY, 1);
 
-            unsigned oMem = idw * out.strides[3] + idz * out.strides[2] + idy * out.strides[1] + idx;
+    if (dim == 0 && indims == 1) { threads = dim3(TX * TY, 1, 1); }
 
-            iMem2 *= isDiff2;
+    int blocksPerMatX = divup(out.dims[0], TX);
+    int blocksPerMatY = divup(out.dims[1], TY);
+    dim3 blocks(blocksPerMatX * out.dims[2], blocksPerMatY * out.dims[3], 1);
 
-            diff_this<T, isDiff2>(out.ptr, in.ptr, oMem, iMem0, iMem1, iMem2);
-        }
+    const int oElem = out.dims[0] * out.dims[1] * out.dims[2] * out.dims[3];
 
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, unsigned dim, bool isDiff2>
-        void diff(Param<T> out, CParam<T> in, const int indims)
-        {
-            dim3 threads(TX, TY, 1);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-            if (dim == 0 && indims == 1) {
-                threads = dim3(TX * TY, 1, 1);
-            }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-            int blocksPerMatX = divup(out.dims[0], TX);
-            int blocksPerMatY = divup(out.dims[1], TY);
-            dim3 blocks(blocksPerMatX * out.dims[2],
-                        blocksPerMatY * out.dims[3],
-                        1);
+    diff(qArgs, out, in, oElem, blocksPerMatX, blocksPerMatY);
 
-            const int oElem = out.dims[0] * out.dims[1] * out.dims[2] * out.dims[3];
-
-            diff_kernel<T, dim, isDiff2> <<<blocks, threads>>>
-                (out, in, oElem, blocksPerMatX, blocksPerMatY);
-
-            POST_LAUNCH_CHECK();
-        }
-}
+    POST_LAUNCH_CHECK();
 }
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/exampleFunction.cuh b/src/backend/cuda/kernel/exampleFunction.cuh
new file mode 100644
index 0000000000..e0a4ddffd6
--- /dev/null
+++ b/src/backend/cuda/kernel/exampleFunction.cuh
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void exampleFunc(Param<T> c, CParam<T> a, CParam<T> b,
+                            const af_someenum_t p) {
+    // get current thread global identifiers along required dimensions
+    int i = blockDim.x * blockIdx.x + threadIdx.x;
+    int j = blockDim.y * blockIdx.y + threadIdx.y;
+
+    if (i < a.dims[0] && j < a.dims[1]) {
+        // if needed use strides array to compute linear index of arrays
+        int src1Idx = i + j * a.strides[1];
+        int src2Idx = i + j * b.strides[1];
+        int dstIdx  = i + j * c.strides[1];
+
+        T* dst        = c.ptr;
+        const T* src1 = a.ptr;
+        const T* src2 = b.ptr;
+
+        // kernel algorithm goes here
+        dst[dstIdx] = src1[src1Idx] + src2[src2Idx];
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/exampleFunction.hpp b/src/backend/cuda/kernel/exampleFunction.hpp
index 0366b787c6..4f037eb771 100644
--- a/src/backend/cuda/kernel/exampleFunction.hpp
+++ b/src/backend/cuda/kernel/exampleFunction.hpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2019, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -7,57 +7,50 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>                     // CUDA specific math functions
+#include <Param.hpp>
 
-#include <Param.hpp>                    // This header has the declaration of structures
-                                        // that are passed onto kernel. Operator overloads
-                                        // for creating Param objects from cuda::Array<T>
-                                        // objects is automatic, no special work is needed.
-                                        // Hence, the CUDA kernel wrapper function takes in
-                                        // Param and CParam(constant version of Param) instead
-                                        // of cuda::Array<T>
+#include <common/dispatch.hpp>  // common utility header for CUDA & OpenCL backends
+                                // has the divup macro
 
-#include <dispatch.hpp>                 // common utility header for CUDA & OpenCL backends
-                                        // has the divup macro
+#include <debug_cuda.hpp>  // For Debug only related CUDA validations
 
-#include <err_cuda.hpp>                 // CUDA specific error check functions and macros
+#include <common/kernel_cache.hpp>  // nvrtc cache mechanims API
 
-#include <debug_cuda.hpp>               // For Debug only related CUDA validations
+#include <nvrtc_kernel_headers/exampleFunction_cuh.hpp>  //kernel generated by nvrtc
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-namespace kernel
-{
+namespace kernel {
 
-static const unsigned TX = 16;          // Kernel Launch Config Values
-static const unsigned TY = 16;          // Kernel Launch Config Values
-
-template<typename T>
-__global__
-void exampleFuncKernel(Param<T> out, CParam<T> in, const af_someenum_t p)
-{
-    // kernel implementation goes here
-}
+static const unsigned TX = 16;  // Kernel Launch Config Values
+static const unsigned TY = 16;  // Kernel Launch Config Values
 
+template<typename T>  // CUDA kernel wrapper function
+void exampleFunc(Param<T> c, CParam<T> a, CParam<T> b, const af_someenum_t p) {
+    auto exampleFunc = common::getKernel("arrayfire::cuda::exampleFunc",
+                                         {{exampleFunction_cuh_src}},
+                                         TemplateArgs(TemplateTypename<T>()));
 
-template <typename T>                   // CUDA kernel wrapper function
-void exampleFunc(Param<T> out, CParam<T> in, const af_someenum_t p)
-{
+    dim3 threads(TX, TY, 1);  // set your cuda launch config for blocks
 
-    dim3 threads(TX, TY, 1);            // set your cuda launch config for blocks
+    int blk_x = divup(c.dims[0], threads.x);
+    int blk_y = divup(c.dims[1], threads.y);
+    dim3 blocks(blk_x, blk_y);  // set your cuda launch config for grid
 
-    int blk_x = divup(out.dims[0], threads.x);
-    int blk_y = divup(out.dims[1], threads.y);
-    dim3 blocks(blk_x, blk_y);          // set your opencl launch config for grid
+    // EnqueueArgs encapsulates CUDA kernel launch
+    // configuration paramters. There are various versions
+    // of EnqueueArgs constructors that you can use depending
+    // on your CUDA kernels needs such as shared memory etc.
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-    // launch your kernel
-    exampleFuncKernel<T> <<<blocks, threads>>> (out, in, p);
+    // Call the kernel functor retrieved using arrayfire::common::getKernel
+    exampleFunc(qArgs, c, a, b, p);
 
-    POST_LAUNCH_CHECK();                // Macro for post kernel launch checks
-                                        // these checks are carried  ONLY IN DEBUG mode
+    POST_LAUNCH_CHECK();  // Macro for post kernel launch checks
+                          // these checks are carried  ONLY IN DEBUG mode
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/fast.hpp b/src/backend/cuda/kernel/fast.hpp
index bac1ee2cb8..7b54162b42 100644
--- a/src/backend/cuda/kernel/fast.hpp
+++ b/src/backend/cuda/kernel/fast.hpp
@@ -7,59 +7,44 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
+#pragma once
+
+#include <LookupTable1D.hpp>
+#include <common/dispatch.hpp>
 #include <debug_cuda.hpp>
-#include <kernel/fast_lut.hpp>
+#include <kernel/shared.hpp>
+#include <math.hpp>
 #include <memory.hpp>
+#include <cub/block/block_reduce.cuh>
 
-namespace cuda
-{
-
-namespace kernel
-{
-
-inline __device__
-int clamp(const int f, const int a, const int b)
-{
-    return max(a, min(f, b));
-}
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-inline __device__
-int idx_y(const int i)
-{
+inline __device__ int idx_y(const int i) {
     int j = i - 4;
     int k = min(j, 8 - j);
     return clamp(k, -3, 3);
 }
 
-inline __device__
-int idx_x(const int i)
-{
-    return idx_y((i + 4) & 15);
-}
+inline __device__ int idx_x(const int i) { return idx_y((i + 4) & 15); }
 
-inline __device__
-int idx(const int x, const int y)
-{
+inline __device__ int idx(const int x, const int y) {
     return ((threadIdx.x + 3 + x) + (blockDim.x + 6) * (threadIdx.y + 3 + y));
 }
 
 // test_greater()
-// Tests if a pixel x >= p + thr
-inline __device__
-int test_greater(const float x, const float p, const float thr)
-{
-    return (x >= p + thr);
+// Tests if a pixel x > p + thr
+inline __device__ int test_greater(const float x, const float p,
+                                   const float thr) {
+    return (x > p + thr);
 }
 
 // test_smaller()
-// Tests if a pixel x <= p - thr
-inline __device__
-int test_smaller(const float x, const float p, const float thr)
-{
-    return (x <= p - thr);
+// Tests if a pixel x < p - thr
+inline __device__ int test_smaller(const float x, const float p,
+                                   const float thr) {
+    return (x < p - thr);
 }
 
 // test_pixel()
@@ -67,235 +52,170 @@ int test_smaller(const float x, const float p, const float thr)
 // Returns  0 when x >= p - thr && x <= p + thr
 // Returns  1 when x > p + thr
 template<typename T>
-inline __device__
-int test_pixel(const T* local_image, const float p, const float thr, const int x, const int y)
-{
-    return -test_smaller((float)local_image[idx(x,y)], p, thr) | test_greater((float)local_image[idx(x,y)], p, thr);
+inline __device__ int test_pixel(const T *local_image, const float p,
+                                 const float thr, const int x, const int y) {
+    return -test_smaller((float)local_image[idx(x, y)], p, thr) +
+           test_greater((float)local_image[idx(x, y)], p, thr);
 }
 
 // max_val()
 // Returns max of x and y
-inline __device__
-int max_val(const int x, const int y)
-{
+inline __device__ int max_val(const int x, const int y) { return max(x, y); }
+inline __device__ unsigned max_val(const unsigned x, const unsigned y) {
     return max(x, y);
 }
-inline __device__
-unsigned max_val(const unsigned x, const unsigned y)
-{
+inline __device__ short max_val(const short x, const short y) {
     return max(x, y);
 }
-inline __device__
-float max_val(const float x, const float y)
-{
+inline __device__ ushort max_val(const ushort x, const ushort y) {
+    return max(x, y);
+}
+inline __device__ float max_val(const float x, const float y) {
     return fmax(x, y);
 }
-inline __device__
-double max_val(const double x, const double y)
-{
+inline __device__ double max_val(const double x, const double y) {
     return fmax(x, y);
 }
 
 // abs_diff()
 // Returns absolute difference of x and y
-inline __device__ int abs_diff(const int x, const int y)
-{
+inline __device__ int abs_diff(const int x, const int y) {
     int i = x - y;
     return max(-i, i);
 }
-inline __device__ unsigned abs_diff(const unsigned x, const unsigned y)
-{
+inline __device__ unsigned abs_diff(const unsigned x, const unsigned y) {
     int i = (int)x - (int)y;
     return max(-i, i);
 }
-inline __device__ float abs_diff(const float x, const float y)
-{
+inline __device__ short abs_diff(const short x, const short y) {
+    short i = x - y;
+    return max(-i, i);
+}
+inline __device__ ushort abs_diff(const ushort x, const ushort y) {
+    int i = (int)x - (int)y;
+    return (ushort)max(-i, i);
+}
+inline __device__ float abs_diff(const float x, const float y) {
     return fabs(x - y);
 }
-inline __device__ double abs_diff(const double x, const double y)
-{
+inline __device__ double abs_diff(const double x, const double y) {
     return fabs(x - y);
 }
 
-// non-specialized class template
-//http://www.naic.edu/~phil/hardware/nvidia/doc/src/simpleTemplates/doc/readme.txt
-template <class T>
-class ExtSharedMem
-{
-    public:
-        // Ensure that we won't compile any un-specialized types
-        __device__ T* getPointer() { extern __shared__ float s_float[]; return s_float; };
-};
-
-// specialization for char
-template <>
-class ExtSharedMem <char>
-{
-    public:
-        __device__ char* getPointer() { extern __shared__ char s_char[]; return s_char; }
-};
-
-// specialization for int
-template <>
-class ExtSharedMem <uchar>
-{
-    public:
-        __device__ uchar* getPointer() { extern __shared__ uchar s_uchar[]; return s_uchar; }
-};
-
-// specialization for int
-template <>
-class ExtSharedMem <int>
-{
-    public:
-        __device__ int* getPointer() { extern __shared__ int s_int[]; return s_int; }
-};
-
-// specialization for unsigned
-template <>
-class ExtSharedMem <unsigned>
-{
-    public:
-        __device__ unsigned* getPointer() { extern __shared__ unsigned s_unsigned[]; return s_unsigned; }
-};
-
-// specialization for float
-template <>
-class ExtSharedMem <float>
-{
-    public:
-        __device__ float* getPointer() { extern __shared__ float s_float[]; return s_float; }
-};
-
-// specialization for double
-template <>
-class ExtSharedMem <double>
-{
-    public:
-        __device__ double* getPointer() { extern __shared__ double s_double[]; return s_double; }
-};
+inline __device__ int lookup(const int n, cudaTextureObject_t tex) {
+    return (int)tex1Dfetch<unsigned char>(tex, n);
+}
 
 template<typename T, int arc_length>
-__device__
-void locate_features_core(
-    T* local_image,
-    float* score,
-    const unsigned idim0,
-    const unsigned idim1,
-    const float thr,
-    int x, int y,
-    const unsigned edge)
-{
+__device__ void locate_features_core(T *local_image, float *score,
+                                     const unsigned idim0, const unsigned idim1,
+                                     const float thr, int x, int y,
+                                     const unsigned edge,
+                                     cudaTextureObject_t luTable) {
     if (x >= idim0 - edge || y >= idim1 - edge) return;
 
     score[y * idim0 + x] = 0.f;
 
-    float p = local_image[idx( 0, 0)];
+    float p = local_image[idx(0, 0)];
 
     // Start by testing opposite pixels of the circle that will result in
     // a non-kepoint
-    int d = test_pixel<T>(local_image, p, thr, -3,  0) | test_pixel<T>(local_image, p, thr, 3,  0);
-    if (d == 0)
-        return;
-
-    d &= test_pixel<T>(local_image, p, thr, -2,  2) | test_pixel<T>(local_image, p, thr,  2, -2);
-    d &= test_pixel<T>(local_image, p, thr,  0,  3) | test_pixel<T>(local_image, p, thr,  0, -3);
-    d &= test_pixel<T>(local_image, p, thr,  2,  2) | test_pixel<T>(local_image, p, thr, -2, -2);
-    if (d == 0)
-        return;
-
-    d &= test_pixel<T>(local_image, p, thr, -3,  1) | test_pixel<T>(local_image, p, thr,  3, -1);
-    d &= test_pixel<T>(local_image, p, thr, -1,  3) | test_pixel<T>(local_image, p, thr,  1, -3);
-    d &= test_pixel<T>(local_image, p, thr,  1,  3) | test_pixel<T>(local_image, p, thr, -1, -3);
-    d &= test_pixel<T>(local_image, p, thr,  3,  1) | test_pixel<T>(local_image, p, thr, -3, -1);
-    if (d == 0)
-        return;
+    int d = test_pixel<T>(local_image, p, thr, -3, 0) |
+            test_pixel<T>(local_image, p, thr, 3, 0);
+    if (d == 0) return;
+
+    d &= test_pixel<T>(local_image, p, thr, -2, 2) |
+         test_pixel<T>(local_image, p, thr, 2, -2);
+    d &= test_pixel<T>(local_image, p, thr, 0, 3) |
+         test_pixel<T>(local_image, p, thr, 0, -3);
+    d &= test_pixel<T>(local_image, p, thr, 2, 2) |
+         test_pixel<T>(local_image, p, thr, -2, -2);
+    if (d == 0) return;
+
+    d &= test_pixel<T>(local_image, p, thr, -3, 1) |
+         test_pixel<T>(local_image, p, thr, 3, -1);
+    d &= test_pixel<T>(local_image, p, thr, -1, 3) |
+         test_pixel<T>(local_image, p, thr, 1, -3);
+    d &= test_pixel<T>(local_image, p, thr, 1, 3) |
+         test_pixel<T>(local_image, p, thr, -1, -3);
+    d &= test_pixel<T>(local_image, p, thr, 3, 1) |
+         test_pixel<T>(local_image, p, thr, -3, -1);
+    if (d == 0) return;
 
     int bright = 0, dark = 0;
     float s_bright = 0, s_dark = 0;
 
-    // Force less loop unrolls to control maximum number of registers and
-    // launch more blocks
-    #pragma unroll 4
+// Force less loop unrolls to control maximum number of registers and
+// launch more blocks
+#pragma unroll 4
     for (int i = 0; i < 16; i++) {
         // Get pixel from the circle
-        float p_x = local_image[idx(idx_x(i),idx_y(i))];
+        float p_x = local_image[idx(idx_x(i), idx_y(i))];
 
         // Compute binary vectors with responses for each pixel on circle
         bright |= test_greater(p_x, p, thr) << i;
-        dark   |= test_smaller(p_x, p, thr) << i;
+        dark |= test_smaller(p_x, p, thr) << i;
 
         // Compute scores for brighter and darker pixels
         float weight = abs_diff(p_x, p) - thr;
         s_bright += test_greater(p_x, p, thr) * weight;
-        s_dark   += test_smaller(p_x, p, thr) * weight;
+        s_dark += test_smaller(p_x, p, thr) * weight;
     }
 
     // Checks LUT to verify if there is a segment for which all pixels are much
     // brighter or much darker than central pixel p.
-    if ((int)FAST_LUT[bright] >= arc_length || (int)FAST_LUT[dark] >= arc_length)
+    if (lookup(bright, luTable) >= arc_length ||
+        lookup(dark, luTable) >= arc_length)
         score[x + idim0 * y] = max_val(s_bright, s_dark);
 }
 
 template<typename T>
-__device__
-void load_shared_image(CParam<T> in,
-                       T *local_image,
-                       unsigned ix, unsigned iy,
-                       unsigned bx, unsigned by,
-                       unsigned x, unsigned y,
-                       unsigned lx, unsigned ly,
-                       const unsigned edge)
-{
+__device__ void load_shared_image(CParam<T> in, T *local_image, unsigned ix,
+                                  unsigned iy, unsigned bx, unsigned by,
+                                  unsigned x, unsigned y, unsigned lx,
+                                  unsigned ly, const unsigned edge) {
     // Copy an image patch to shared memory, with a 3-pixel edge
     if (ix < lx && iy < ly && x - 3 < in.dims[0] && y - 3 < in.dims[1]) {
-        local_image[(ix)      + (bx+6) * (iy)]    = in.ptr[(x-3)    + in.dims[0] * (y-3)];
+        local_image[(ix) + (bx + 6) * (iy)] =
+            in.ptr[(x - 3) + in.dims[0] * (y - 3)];
         if (x + lx - 3 < in.dims[0])
-            local_image[(ix + lx) + (bx+6) * (iy)]    = in.ptr[(x+lx-3) + in.dims[0] * (y-3)];
+            local_image[(ix + lx) + (bx + 6) * (iy)] =
+                in.ptr[(x + lx - 3) + in.dims[0] * (y - 3)];
         if (y + ly - 3 < in.dims[1])
-            local_image[(ix)      + (bx+6) * (iy+ly)] = in.ptr[(x-3)    + in.dims[0] * (y+ly-3)];
+            local_image[(ix) + (bx + 6) * (iy + ly)] =
+                in.ptr[(x - 3) + in.dims[0] * (y + ly - 3)];
         if (x + lx - 3 < in.dims[0] && y + ly - 3 < in.dims[1])
-            local_image[(ix + lx) + (bx+6) * (iy+ly)] = in.ptr[(x+lx-3) + in.dims[0] * (y+ly-3)];
+            local_image[(ix + lx) + (bx + 6) * (iy + ly)] =
+                in.ptr[(x + lx - 3) + in.dims[0] * (y + ly - 3)];
     }
 }
 
 template<typename T, int arc_length>
-__global__
-void locate_features(
-    CParam<T> in,
-    float* score,
-    const float thr,
-    const unsigned edge)
-{
+__global__ void locate_features(CParam<T> in, float *score, const float thr,
+                                const unsigned edge,
+                                cudaTextureObject_t luTable) {
     unsigned ix = threadIdx.x;
     unsigned iy = threadIdx.y;
     unsigned bx = blockDim.x;
     unsigned by = blockDim.y;
-    unsigned x = bx * blockIdx.x + ix + edge;
-    unsigned y = by * blockIdx.y + iy + edge;
+    unsigned x  = bx * blockIdx.x + ix + edge;
+    unsigned y  = by * blockIdx.y + iy + edge;
     unsigned lx = bx / 2 + 3;
     unsigned ly = by / 2 + 3;
 
-    ExtSharedMem<T> shared;
-    T* local_image_curr = shared.getPointer();
+    SharedMemory<T> shared;
+    T *local_image_curr = shared.getPointer();
     load_shared_image(in, local_image_curr, ix, iy, bx, by, x, y, lx, ly, edge);
     __syncthreads();
-    locate_features_core<T, arc_length>(local_image_curr, score,
-                                        in.dims[0], in.dims[1], thr, x, y, edge);
+    locate_features_core<T, arc_length>(local_image_curr, score, in.dims[0],
+                                        in.dims[1], thr, x, y, edge, luTable);
 }
 
 template<bool nonmax>
-__global__
-void non_max_counts(
-    unsigned *d_counts,
-    unsigned *d_offsets,
-    unsigned *d_total,
-    float *flags,
-    const float* score,
-    const unsigned idim0,
-    const unsigned idim1,
-    const unsigned edge)
-{
+__global__ void non_max_counts(unsigned *d_counts, unsigned *d_offsets,
+                               unsigned *d_total, float *flags,
+                               const float *score, const unsigned idim0,
+                               const unsigned idim1, const unsigned edge) {
     const int xid = blockIdx.x * blockDim.x * 2 + threadIdx.x;
     const int yid = blockIdx.y * blockDim.y * 8 + threadIdx.y;
     const int tid = blockDim.x * threadIdx.y + threadIdx.x;
@@ -307,13 +227,16 @@ void non_max_counts(
     const int yend = (blockIdx.y + 1) * blockDim.y * 8;
 
     const int bid = blockIdx.y * gridDim.x + blockIdx.x;
-    __shared__ unsigned s_counts[256];
+    using BlockReduce =
+        cub::BlockReduce<unsigned, 32, cub::BLOCK_REDUCE_WARP_REDUCTIONS, 8>;
+
+    __shared__ typename BlockReduce::TempStorage temp_storage;
 
     unsigned count = 0;
     for (int y = yid; y < yend; y += yoff) {
-        if (y >= idim1 - edge-1 || y <= edge+1) continue;
+        if (y >= idim1 - edge - 1 || y <= edge + 1) continue;
         for (int x = xid; x < xend; x += xoff) {
-            if (x >= idim0 - edge-1 || x <= edge+1) continue;
+            if (x >= idim0 - edge - 1 || x <= edge + 1) continue;
 
             float v = score[y * idim0 + x];
             if (v == 0) {
@@ -323,15 +246,16 @@ void non_max_counts(
 
             if (nonmax) {
                 float max_v = v;
-                max_v = max_val(score[x-1 + idim0 * (y-1)], score[x-1 + idim0 * y]);
-                max_v = max_val(max_v, score[x-1 + idim0 * (y+1)]);
-                max_v = max_val(max_v, score[x   + idim0 * (y-1)]);
-                max_v = max_val(max_v, score[x   + idim0 * (y+1)]);
-                max_v = max_val(max_v, score[x+1 + idim0 * (y-1)]);
-                max_v = max_val(max_v, score[x+1 + idim0 * (y)  ]);
-                max_v = max_val(max_v, score[x+1 + idim0 * (y+1)]);
-
-                v = (v > max_v) ? v : 0;
+                max_v       = max_val(score[x - 1 + idim0 * (y - 1)],
+                                      score[x - 1 + idim0 * y]);
+                max_v       = max_val(max_v, score[x - 1 + idim0 * (y + 1)]);
+                max_v       = max_val(max_v, score[x + idim0 * (y - 1)]);
+                max_v       = max_val(max_v, score[x + idim0 * (y + 1)]);
+                max_v       = max_val(max_v, score[x + 1 + idim0 * (y - 1)]);
+                max_v       = max_val(max_v, score[x + 1 + idim0 * (y)]);
+                max_v       = max_val(max_v, score[x + 1 + idim0 * (y + 1)]);
+
+                v                    = (v > max_v) ? v : 0;
                 flags[y * idim0 + x] = v;
                 if (v == 0) continue;
             }
@@ -340,44 +264,21 @@ void non_max_counts(
         }
     }
 
-    s_counts[tid] = count;
-    __syncthreads();
-
-    if (tid >= 128) return;
-    if (tid < 128) s_counts[tid] += s_counts[tid + 128]; __syncthreads();
-
-    if (tid >= 64) return;
-    if (tid <  64) s_counts[tid] += s_counts[tid +  64]; __syncthreads();
-
-    if (tid >= 32) return;
-    if (tid <  32) s_counts[tid] += s_counts[tid +  32];
-    if (tid <  16) s_counts[tid] += s_counts[tid +  16];
-    if (tid <   8) s_counts[tid] += s_counts[tid +   8];
-    if (tid <   4) s_counts[tid] += s_counts[tid +   4];
-    if (tid <   2) s_counts[tid] += s_counts[tid +   2];
-    if (tid <   1) s_counts[tid] += s_counts[tid +   1];
+    int sum = BlockReduce(temp_storage).Sum(count);
 
     if (tid == 0) {
-        unsigned total = s_counts[0] ? atomicAdd(d_total, s_counts[0]) : 0;
-        d_counts [bid] = s_counts[0];
+        unsigned total = sum ? atomicAdd(d_total, sum) : 0;
+        d_counts[bid]  = sum;
         d_offsets[bid] = total;
     }
 }
 
 template<typename T>
-__global__
-void get_features(
-    float *x_out,
-    float *y_out,
-    float *score_out,
-    const T* flags,
-    const unsigned *d_counts,
-    const unsigned *d_offsets,
-    const unsigned total,
-    const unsigned idim0,
-    const unsigned idim1,
-    const unsigned edge)
-{
+__global__ void get_features(float *x_out, float *y_out, float *score_out,
+                             const T *flags, const unsigned *d_counts,
+                             const unsigned *d_offsets, const unsigned total,
+                             const unsigned idim0, const unsigned idim1,
+                             const unsigned edge) {
     const int xid = blockIdx.x * blockDim.x * 2 + threadIdx.x;
     const int yid = blockIdx.y * blockDim.y * 8 + threadIdx.y;
     const int tid = blockDim.x * threadIdx.y + threadIdx.x;
@@ -394,136 +295,153 @@ void get_features(
     __shared__ unsigned s_idx;
 
     if (tid == 0) {
-        s_count  = d_counts [bid];
-        s_idx    = d_offsets[bid];
+        s_count = d_counts[bid];
+        s_idx   = d_offsets[bid];
     }
     __syncthreads();
 
     // Blocks that are empty, please bail
     if (s_count == 0) return;
     for (int y = yid; y < yend; y += yoff) {
-        if (y >= idim1 - edge-1 || y <= edge+1) continue;
+        if (y >= idim1 - edge - 1 || y <= edge + 1) continue;
         for (int x = xid; x < xend; x += xoff) {
-            if (x >= idim0 - edge-1 || x <= edge+1) continue;
+            if (x >= idim0 - edge - 1 || x <= edge + 1) continue;
 
             float v = flags[y * idim0 + x];
             if (v == 0) continue;
 
             unsigned id = atomicAdd(&s_idx, 1u);
             if (id >= total) return;
-            y_out[id] = x;
-            x_out[id] = y;
+            y_out[id]     = x;
+            x_out[id]     = y;
             score_out[id] = v;
         }
     }
 }
 
 template<typename T>
-void fast(unsigned* out_feat,
-          float** x_out,
-          float** y_out,
-          float** score_out,
-          CParam<T> in,
-          const float thr,
-          const unsigned arc_length,
-          const unsigned nonmax,
-          const float feature_ratio,
-          const unsigned edge)
-{
-    const unsigned max_feat = ceil(in.dims[0] * in.dims[1] * feature_ratio);
+void fast(unsigned *out_feat, float **x_out, float **y_out, float **score_out,
+          const Array<T> &in, const float thr, const unsigned arc_length,
+          const unsigned nonmax, const float feature_ratio, const unsigned edge,
+          const LookupTable1D<unsigned char> &luTable) {
+    dim4 indims             = in.dims();
+    const unsigned max_feat = ceil(indims[0] * indims[1] * feature_ratio);
 
     dim3 threads(16, 16);
-    dim3 blocks(divup(in.dims[0]-edge*2, threads.x), divup(in.dims[1]-edge*2, threads.y));
+    dim3 blocks(divup(indims[0] - edge * 2, threads.x),
+                divup(indims[1] - edge * 2, threads.y));
 
     // Matrix containing scores for detected features, scores are stored in the
     // same coordinates as features, dimensions should be equal to in.
-    float *d_score = NULL;
-    size_t score_bytes = in.dims[0] * in.dims[1] * sizeof(float) + sizeof(unsigned);
-    d_score = (float *)memAlloc<char>(score_bytes);
+    auto d_score = memAlloc<float>(indims[0] * indims[1] + 1);
 
-    float *d_flags = d_score;
+    float *d_flags = d_score.get();
+    uptr<float> d_flags_alloc;
     if (nonmax) {
-        d_flags = memAlloc<float>(in.dims[0] * in.dims[1]);
+        d_flags_alloc = memAlloc<float>(indims[0] * indims[1]);
+        d_flags       = d_flags_alloc.get();
     }
 
     // Shared memory size
     size_t shared_size = (threads.x + 6) * (threads.y + 6) * sizeof(T);
 
-    switch(arc_length) {
-    case 9:
-        locate_features<T, 9><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 10:
-        locate_features<T,10><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 11:
-        locate_features<T,11><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 12:
-        locate_features<T,12><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 13:
-        locate_features<T,13><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 14:
-        locate_features<T,14><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 15:
-        locate_features<T,15><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
-    case 16:
-        locate_features<T,16><<<blocks, threads, shared_size>>>(in, d_score, thr, edge);
-        break;
+    switch (arc_length) {
+        case 9:
+            CUDA_LAUNCH_SMEM((locate_features<T, 9>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 10:
+            CUDA_LAUNCH_SMEM((locate_features<T, 10>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 11:
+            CUDA_LAUNCH_SMEM((locate_features<T, 11>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 12:
+            CUDA_LAUNCH_SMEM((locate_features<T, 12>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 13:
+            CUDA_LAUNCH_SMEM((locate_features<T, 13>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 14:
+            CUDA_LAUNCH_SMEM((locate_features<T, 14>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 15:
+            CUDA_LAUNCH_SMEM((locate_features<T, 15>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
+        case 16:
+            CUDA_LAUNCH_SMEM((locate_features<T, 16>), blocks, threads,
+                             shared_size, in, d_score.get(), thr, edge,
+                             luTable.get());
+            break;
     }
 
     POST_LAUNCH_CHECK();
 
     threads.x = 32;
-    threads.y =  8;
+    threads.y = 8;
 
-    blocks.x = divup(in.dims[0], 64);
-    blocks.y = divup(in.dims[1], 64);
+    blocks.x = divup(indims[0], 64);
+    blocks.y = divup(indims[1], 64);
 
-    unsigned *d_total = (unsigned *)(d_score + in.dims[0] * in.dims[1]);
-    CUDA_CHECK(cudaMemset(d_total, 0, sizeof(unsigned)));
-    unsigned *d_counts  = memAlloc<unsigned>(blocks.x * blocks.y);
-    unsigned *d_offsets = memAlloc<unsigned>(blocks.x * blocks.y);
+    unsigned *d_total = (unsigned *)(d_score.get() + (indims[0] * indims[1]));
+    CUDA_CHECK(
+        cudaMemsetAsync(d_total, 0, sizeof(unsigned), getActiveStream()));
+    auto d_counts  = memAlloc<unsigned>(blocks.x * blocks.y);
+    auto d_offsets = memAlloc<unsigned>(blocks.x * blocks.y);
 
     if (nonmax)
-        non_max_counts<true ><<<blocks, threads>>>(d_counts, d_offsets, d_total, d_flags,
-                                                   d_score, in.dims[0], in.dims[1], edge);
+        CUDA_LAUNCH((non_max_counts<true>), blocks, threads, d_counts.get(),
+                    d_offsets.get(), d_total, d_flags, d_score.get(), indims[0],
+                    indims[1], edge);
     else
-        non_max_counts<false><<<blocks, threads>>>(d_counts, d_offsets, d_total, d_flags,
-                                                   d_score, in.dims[0], in.dims[1], edge);
+        CUDA_LAUNCH((non_max_counts<false>), blocks, threads, d_counts.get(),
+                    d_offsets.get(), d_total, d_flags, d_score.get(), indims[0],
+                    indims[1], edge);
 
     POST_LAUNCH_CHECK();
 
     // Dimensions of output array
     unsigned total;
-    CUDA_CHECK(cudaMemcpy(&total, d_total, sizeof(unsigned), cudaMemcpyDeviceToHost));
+    CUDA_CHECK(cudaMemcpyAsync(&total, d_total, sizeof(unsigned),
+                               cudaMemcpyDeviceToHost, getActiveStream()));
+    CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
     total = total < max_feat ? total : max_feat;
 
     if (total > 0) {
-        *x_out     = memAlloc<float>(total);
-        *y_out     = memAlloc<float>(total);
-        *score_out = memAlloc<float>(total);
+        auto x_out_alloc     = memAlloc<float>(total);
+        auto y_out_alloc     = memAlloc<float>(total);
+        auto score_out_alloc = memAlloc<float>(total);
+        *x_out               = x_out_alloc.get();
+        *y_out               = y_out_alloc.get();
+        *score_out           = score_out_alloc.get();
 
-        get_features<float><<<blocks, threads>>>(*x_out, *y_out, *score_out, d_flags, d_counts,
-                                                 d_offsets, total, in.dims[0], in.dims[1], edge);
+        CUDA_LAUNCH((get_features<float>), blocks, threads, *x_out, *y_out,
+                    *score_out, d_flags, d_counts.get(), d_offsets.get(), total,
+                    indims[0], indims[1], edge);
 
         POST_LAUNCH_CHECK();
+
+        x_out_alloc.release();
+        y_out_alloc.release();
+        score_out_alloc.release();
     }
 
     *out_feat = total;
-
-    memFree<uchar>((uchar *)d_score);
-    memFree(d_counts);
-    memFree(d_offsets);
-    if (nonmax) {
-        memFree(d_flags);
-    }
 }
 
-} // namespace kernel
-
-} // namespace cuda
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/fast_lut.hpp b/src/backend/cuda/kernel/fast_lut.hpp
index aba857120e..5ac82a67c7 100644
--- a/src/backend/cuda/kernel/fast_lut.hpp
+++ b/src/backend/cuda/kernel/fast_lut.hpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2020, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -7,2052 +7,3456 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__constant__ unsigned char FAST_LUT[] = {
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 12, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 13, 14,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 12, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,
-    10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 14, 15,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 12, 14,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 13, 15,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 14,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9, 10, 10, 10, 10, 11, 11, 12, 15,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 14,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9, 10, 10, 11, 15,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 14,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9, 10, 15,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 14,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 13,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 10,
-     0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9, 15,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 13,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 14,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 13,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 12,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 11,
-     0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 10,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0, 15,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 12,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 13,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 12,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 14,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 12,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 13,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 12,
-     0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 11,  0,  0,  0,  9,  0,  0,  0, 10,  0,  0,  0,  9,  0,  0,  0, 15,
-     0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 12,  0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 13,
-     0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 12,  0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 14,
-     0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 12,  0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 13,
-     0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 12,  0,  9,  0, 10,  0,  9,  0, 11,  0,  9,  0, 10,  0,  9,  0, 15,
-     9, 10,  9, 11,  9, 10,  9, 12,  9, 10,  9, 11,  9, 10,  9, 13,  9, 10,  9, 11,  9, 10,  9, 12,  9, 10,  9, 11,  9, 10,  9, 14,
-     9, 10,  9, 11,  9, 10,  9, 12,  9, 10,  9, 11,  9, 10,  9, 13,  9, 10,  9, 11,  9, 10,  9, 12,  9, 10,  9, 11,  9, 10,  9, 15,
-    10, 11, 10, 12, 10, 11, 10, 13, 10, 11, 10, 12, 10, 11, 10, 14, 10, 11, 10, 12, 10, 11, 10, 13, 10, 11, 10, 12, 10, 11, 10, 15,
-    11, 12, 11, 13, 11, 12, 11, 14, 11, 12, 11, 13, 11, 12, 11, 15, 12, 13, 12, 14, 12, 13, 12, 15, 13, 14, 13, 15, 14, 15, 15, 16};
+#pragma once
+
+unsigned char FAST_LUT[] = {
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 11, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  9,  9,  9,  10, 10, 11, 12, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  9,  10, 11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9,  10, 10, 10, 10, 11,
+    11, 12, 13, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 11, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  10, 10, 11, 12, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  9,  10, 11, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,
+    9,  9,  9,  9,  9,  9,  9,  9,  9,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11,
+    11, 11, 12, 12, 13, 14, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,
+    10, 11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  10, 10, 11, 12, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 11, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,
+    9,  10, 10, 10, 10, 11, 11, 12, 13, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  9,  10, 11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,  10, 10, 11,
+    12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 11, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  10, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,
+    9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  10,
+    10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11,
+    11, 11, 11, 11, 12, 12, 12, 12, 13, 13, 14, 15, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  9,  10, 12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  9,  9,
+    10, 10, 11, 13, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 12,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  11, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    9,  9,  9,  9,  9,  9,  9,  10, 10, 10, 10, 11, 11, 12, 14, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  11, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 12, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    9,  9,  9,  10, 10, 11, 13, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    9,  10, 12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,
+    9,  9,  10, 10, 10, 10, 10, 10, 10, 10, 11, 11, 11, 11, 12, 12, 13, 15, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  12, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 13, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  12, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  9,  9,  9,  10, 10, 11, 14, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  9,  10, 13, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  9,  9,  9,  9,  9,  9,  9,  10, 10, 10, 10, 11, 11,
+    12, 15, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  13,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  12, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  9,  10, 14, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  12, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  13, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  12, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  9,  9,  9,  10, 10, 11, 15, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  13, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  14, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  13, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  12,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  9,  10, 15, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  13, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  12, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  14, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  12, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  13, 0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  12, 0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  11,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,
+    0,  0,  0,  0,  0,  9,  15, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    11, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  12, 0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  13, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  12, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,
+    0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10,
+    0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  14, 0,  0,  0,
+    0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,
+    0,  9,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  9,  0,
+    0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,
+    0,  0,  0,  12, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,
+    10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,
+    0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,
+    0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  13, 0,  0,  0,  0,  0,  0,  0,  9,
+    0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,
+    0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,
+    0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  12, 0,
+    0,  0,  0,  0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,
+    0,  0,  0,  9,  0,  0,  0,  0,  0,  0,  0,  11, 0,  0,  0,  0,  0,  0,  0,
+    9,  0,  0,  0,  0,  0,  0,  0,  10, 0,  0,  0,  0,  0,  0,  0,  9,  0,  0,
+    0,  0,  0,  0,  0,  15, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,
+    0,  0,  11, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  12,
+    0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  11, 0,  0,  0,
+    9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  13, 0,  0,  0,  9,  0,  0,
+    0,  10, 0,  0,  0,  9,  0,  0,  0,  11, 0,  0,  0,  9,  0,  0,  0,  10, 0,
+    0,  0,  9,  0,  0,  0,  12, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,
+    0,  0,  0,  11, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,
+    14, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  11, 0,  0,
+    0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  12, 0,  0,  0,  9,  0,
+    0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  11, 0,  0,  0,  9,  0,  0,  0,  10,
+    0,  0,  0,  9,  0,  0,  0,  13, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,
+    9,  0,  0,  0,  11, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,
+    0,  12, 0,  0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  11, 0,
+    0,  0,  9,  0,  0,  0,  10, 0,  0,  0,  9,  0,  0,  0,  15, 0,  9,  0,  10,
+    0,  9,  0,  11, 0,  9,  0,  10, 0,  9,  0,  12, 0,  9,  0,  10, 0,  9,  0,
+    11, 0,  9,  0,  10, 0,  9,  0,  13, 0,  9,  0,  10, 0,  9,  0,  11, 0,  9,
+    0,  10, 0,  9,  0,  12, 0,  9,  0,  10, 0,  9,  0,  11, 0,  9,  0,  10, 0,
+    9,  0,  14, 0,  9,  0,  10, 0,  9,  0,  11, 0,  9,  0,  10, 0,  9,  0,  12,
+    0,  9,  0,  10, 0,  9,  0,  11, 0,  9,  0,  10, 0,  9,  0,  13, 0,  9,  0,
+    10, 0,  9,  0,  11, 0,  9,  0,  10, 0,  9,  0,  12, 0,  9,  0,  10, 0,  9,
+    0,  11, 0,  9,  0,  10, 0,  9,  0,  15, 9,  10, 9,  11, 9,  10, 9,  12, 9,
+    10, 9,  11, 9,  10, 9,  13, 9,  10, 9,  11, 9,  10, 9,  12, 9,  10, 9,  11,
+    9,  10, 9,  14, 9,  10, 9,  11, 9,  10, 9,  12, 9,  10, 9,  11, 9,  10, 9,
+    13, 9,  10, 9,  11, 9,  10, 9,  12, 9,  10, 9,  11, 9,  10, 9,  15, 10, 11,
+    10, 12, 10, 11, 10, 13, 10, 11, 10, 12, 10, 11, 10, 14, 10, 11, 10, 12, 10,
+    11, 10, 13, 10, 11, 10, 12, 10, 11, 10, 15, 11, 12, 11, 13, 11, 12, 11, 14,
+    11, 12, 11, 13, 11, 12, 11, 15, 12, 13, 12, 14, 12, 13, 12, 15, 13, 14, 13,
+    15, 14, 15, 15, 16};
diff --git a/src/backend/cuda/kernel/fast_pyramid.hpp b/src/backend/cuda/kernel/fast_pyramid.hpp
deleted file mode 100644
index 61a9c7ac32..0000000000
--- a/src/backend/cuda/kernel/fast_pyramid.hpp
+++ /dev/null
@@ -1,146 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/defines.h>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-#include <memory.hpp>
-
-#include "fast.hpp"
-#include "resize.hpp"
-
-namespace cuda
-{
-
-namespace kernel
-{
-
-template<typename T>
-void fast_pyramid(std::vector<unsigned>& feat_pyr,
-                  std::vector<float*>& d_x_pyr,
-                  std::vector<float*>& d_y_pyr,
-                  std::vector<unsigned>& lvl_best,
-                  std::vector<float>& lvl_scl,
-                  std::vector<CParam<T> >& img_pyr,
-                  CParam<T> in,
-                  const float fast_thr,
-                  const unsigned max_feat,
-                  const float scl_fctr,
-                  const unsigned levels,
-                  const unsigned patch_size)
-{
-    unsigned min_side = std::min(in.dims[0], in.dims[1]);
-    unsigned max_levels = 0;
-    float scl_sum = 0.f;
-
-    for (unsigned i = 0; i < levels; i++) {
-        min_side /= scl_fctr;
-
-        // Minimum image side for a descriptor to be computed
-        if (min_side < patch_size || max_levels == levels) break;
-
-        max_levels++;
-        scl_sum += 1.f / (float)std::pow(scl_fctr,(float)i);
-    }
-
-    // Compute number of features to keep for each level
-    lvl_best.resize(max_levels);
-    lvl_scl.resize(max_levels);
-    unsigned feat_sum = 0;
-    for (unsigned i = 0; i < max_levels-1; i++) {
-        float scl = (float)std::pow(scl_fctr,(float)i);
-        lvl_scl[i] = scl;
-
-        lvl_best[i] = ceil((max_feat / scl_sum) / lvl_scl[i]);
-        feat_sum += lvl_best[i];
-    }
-    lvl_scl[max_levels-1] = (float)std::pow(scl_fctr,(float)max_levels-1);
-    lvl_best[max_levels-1] = max_feat - feat_sum;
-
-    // Hold multi-scale image pyramids
-    img_pyr.reserve(max_levels);
-
-    // Create multi-scale image pyramid
-    for (unsigned i = 0; i < max_levels; i++) {
-        if (i == 0) {
-            // First level is used in its original size
-            img_pyr[i].ptr = in.ptr;
-            for (int k = 0; k < 4; k++) {
-                img_pyr[i].dims[k] = in.dims[k];
-                img_pyr[i].strides[k] = in.strides[k];
-            }
-        }
-        else {
-            // Resize previous level image to current level dimensions
-            Param<T> lvl_img;
-            lvl_img.dims[0] = round(in.dims[0] / lvl_scl[i]);
-            lvl_img.dims[1] = round(in.dims[1] / lvl_scl[i]);
-            lvl_img.strides[0] = 1;
-            lvl_img.strides[1] = lvl_img.dims[0] * lvl_img.strides[0];
-
-            for (int k = 2; k < 4; k++) {
-                lvl_img.dims[k] = 1;
-                lvl_img.strides[k] = lvl_img.dims[k - 1] * lvl_img.strides[k - 1];
-            }
-
-            int lvl_elem = lvl_img.strides[3] * lvl_img.dims[3];
-            lvl_img.ptr = memAlloc<T>(lvl_elem);
-
-            resize<T, AF_INTERP_BILINEAR>(lvl_img, img_pyr[i-1]);
-
-            img_pyr[i].ptr = lvl_img.ptr;
-            for (int k = 0; k < 4; k++) {
-                img_pyr[i].dims[k] = lvl_img.dims[k];
-                img_pyr[i].strides[k] = lvl_img.strides[k];
-            }
-        }
-    }
-
-    feat_pyr.resize(max_levels);
-    d_x_pyr.resize(max_levels);
-    d_y_pyr.resize(max_levels);
-
-    for (unsigned i = 0; i < max_levels; i++) {
-        unsigned lvl_feat = 0;
-        float* d_x_feat = NULL;
-        float* d_y_feat = NULL;
-        float* d_score_feat = NULL;
-
-        // Round feature size to nearest odd integer
-        float size = 2.f * floor(patch_size / 2.f) + 1.f;
-
-        // Avoid keeping features that are too wide and might not fit the image,
-        // sqrt(2.f) is the radius when angle is 45 degrees and represents
-        // widest case possible
-        unsigned edge = ceil(size * sqrt(2.f) / 2.f);
-
-        // Detects FAST features
-        fast(&lvl_feat, &d_x_feat, &d_y_feat, &d_score_feat,
-             img_pyr[i], fast_thr, 9, 1, 0.15f, edge);
-
-        // FAST score is not used
-        memFree(d_score_feat);
-
-        if (lvl_feat == 0) {
-            feat_pyr[i] = 0;
-            d_x_pyr[i] = NULL;
-            d_x_pyr[i] = NULL;
-        }
-        else {
-            feat_pyr[i] = lvl_feat;
-            d_x_pyr[i] = d_x_feat;
-            d_y_pyr[i] = d_y_feat;
-        }
-    }
-}
-
-} // namespace kernel
-
-} // namespace cuda
diff --git a/src/backend/cuda/kernel/fftconvolve.cuh b/src/backend/cuda/kernel/fftconvolve.cuh
new file mode 100644
index 0000000000..350a7b299f
--- /dev/null
+++ b/src/backend/cuda/kernel/fftconvolve.cuh
@@ -0,0 +1,222 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/internal_enums.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename To, typename Ti>
+__global__ void packData(Param<To> out, CParam<Ti> in, const int di0_half,
+                         const bool odd_di0) {
+    const int t = blockDim.x * blockIdx.x + threadIdx.x;
+
+    const int tMax = out.strides[3] * out.dims[3];
+
+    if (t >= tMax) return;
+
+    const int do1 = out.dims[1];
+    const int do2 = out.dims[2];
+    const int so1 = out.strides[1];
+    const int so2 = out.strides[2];
+    const int so3 = out.strides[3];
+
+    const int to0 = t % so1;
+    const int to1 = (t / so1) % do1;
+    const int to2 = (t / so2) % do2;
+    const int to3 = t / so3;
+
+    const int di1 = in.dims[1];
+    const int di2 = in.dims[2];
+    const int si1 = in.strides[1];
+    const int si2 = in.strides[2];
+    const int si3 = in.strides[3];
+
+    const int ti0 = to0;
+    const int ti1 = to1 * si1;
+    const int ti2 = to2 * si2;
+    const int ti3 = to3 * si3;
+
+    const int iidx1 = ti3 + ti2 + ti1 + ti0;
+    const int iidx2 = iidx1 + di0_half;
+    const int oidx  = to3 * so3 + to2 * so2 + to1 * so1 + to0;
+
+    if (to0 < di0_half && to1 < di1 && to2 < di2) {
+        out.ptr[oidx].x = in.ptr[iidx1];
+        if (ti0 == di0_half - 1 && odd_di0)
+            out.ptr[oidx].y = 0;
+        else
+            out.ptr[oidx].y = in.ptr[iidx2];
+    } else {
+        // Pad remaining elements with 0s
+        out.ptr[oidx].x = 0;
+        out.ptr[oidx].y = 0;
+    }
+}
+
+template<typename To, typename Ti>
+__global__ void padArray(Param<To> out, CParam<Ti> in) {
+    const int t = blockDim.x * blockIdx.x + threadIdx.x;
+
+    const int tMax = out.strides[3] * out.dims[3];
+
+    if (t >= tMax) return;
+
+    const int do1 = out.dims[1];
+    const int do2 = out.dims[2];
+    const int so1 = out.strides[1];
+    const int so2 = out.strides[2];
+    const int so3 = out.strides[3];
+
+    const int to0 = t % so1;
+    const int to1 = (t / so1) % do1;
+    const int to2 = (t / so2) % do2;
+    const int to3 = (t / so3);
+
+    const int di0 = in.dims[0];
+    const int di1 = in.dims[1];
+    const int di2 = in.dims[2];
+    const int di3 = in.dims[3];
+    const int si1 = in.strides[1];
+    const int si2 = in.strides[2];
+    const int si3 = in.strides[3];
+
+    const int ti0 = to0;
+    const int ti1 = to1 * si1;
+    const int ti2 = to2 * si2;
+    const int ti3 = to3 * si3;
+
+    const int iidx = ti3 + ti2 + ti1 + ti0;
+
+    const int t2 = to3 * so3 + to2 * so2 + to1 * so1 + to0;
+
+    if (to0 < di0 && to1 < di1 && to2 < di2 && to3 < di3) {
+        // Copy input elements to real elements, set imaginary elements to 0
+        out.ptr[t2].x = in.ptr[iidx];
+        out.ptr[t2].y = 0;
+    } else {
+        // Pad remaining of the matrix to 0s
+        out.ptr[t2].x = 0;
+        out.ptr[t2].y = 0;
+    }
+}
+
+template<typename convT, AF_BATCH_KIND kind>
+__global__ void complexMultiply(Param<convT> out, Param<convT> in1,
+                                Param<convT> in2, const int nelem) {
+    const int t = blockDim.x * blockIdx.x + threadIdx.x;
+
+    if (t >= nelem) return;
+
+    if (kind == AF_BATCH_NONE || kind == AF_BATCH_SAME) {
+        // Complex multiply each signal to equivalent filter
+        const int ridx = t;
+
+        convT c1 = in1.ptr[ridx];
+        convT c2 = in2.ptr[ridx];
+
+        out.ptr[ridx].x = c1.x * c2.x - c1.y * c2.y;
+        out.ptr[ridx].y = c1.x * c2.y + c1.y * c2.x;
+    } else if (kind == AF_BATCH_LHS) {
+        // Complex multiply all signals to filter
+        const int ridx1 = t;
+        const int ridx2 = t % (in2.strides[3] * in2.dims[3]);
+
+        convT c1 = in1.ptr[ridx1];
+        convT c2 = in2.ptr[ridx2];
+
+        out.ptr[ridx1].x = c1.x * c2.x - c1.y * c2.y;
+        out.ptr[ridx1].y = c1.x * c2.y + c1.y * c2.x;
+    } else if (kind == AF_BATCH_RHS) {
+        // Complex multiply signal to all filters
+        const int ridx1 = t % (in1.strides[3] * in1.dims[3]);
+        const int ridx2 = t;
+
+        convT c1 = in1.ptr[ridx1];
+        convT c2 = in2.ptr[ridx2];
+
+        out.ptr[ridx2].x = c1.x * c2.x - c1.y * c2.y;
+        out.ptr[ridx2].y = c1.x * c2.y + c1.y * c2.x;
+    }
+}
+
+template<typename To, typename Ti, bool expand, bool roundOut>
+__global__ void reorderOutput(Param<To> out, Param<Ti> in, CParam<To> filter,
+                              const int half_di0, const int rank,
+                              const int fftScale) {
+    const int t = blockIdx.x * blockDim.x + threadIdx.x;
+
+    const int tMax = out.strides[3] * out.dims[3];
+
+    if (t >= tMax) return;
+
+    const int do1 = out.dims[1];
+    const int do2 = out.dims[2];
+    const int so1 = out.strides[1];
+    const int so2 = out.strides[2];
+    const int so3 = out.strides[3];
+
+    const int si1 = in.strides[1];
+    const int si2 = in.strides[2];
+    const int si3 = in.strides[3];
+
+    const int to0 = t % so1;
+    const int to1 = (t / so1) % do1;
+    const int to2 = (t / so2) % do2;
+    const int to3 = (t / so3);
+
+    int oidx = to3 * so3 + to2 * so2 + to1 * so1 + to0;
+
+    int ti0, ti1, ti2, ti3;
+    if (expand) {
+        ti0 = to0;
+        ti1 = to1 * si1;
+        ti2 = to2 * si2;
+        ti3 = to3 * si3;
+    } else {
+        ti0 = to0 + filter.dims[0] / 2;
+        ti1 = (to1 + (rank > 1) * (filter.dims[1] / 2)) * si1;
+        ti2 = (to2 + (rank > 2) * (filter.dims[2] / 2)) * si2;
+        ti3 = to3 * si3;
+    }
+
+    // Divide output elements to cuFFT resulting scale, round result if output
+    // type is single or double precision floating-point
+    if (ti0 < half_di0) {
+        // Copy top elements
+        int iidx = ti3 + ti2 + ti1 + ti0;
+        if (roundOut)
+            out.ptr[oidx] = (To)roundf(in.ptr[iidx].x / fftScale);
+        else
+            out.ptr[oidx] = (To)(in.ptr[iidx].x / fftScale);
+    } else if (ti0 < half_di0 + filter.dims[0] - 1) {
+        // Add signal and filter elements to central part
+        int iidx1 = ti3 + ti2 + ti1 + ti0;
+        int iidx2 = ti3 + ti2 + ti1 + (ti0 - half_di0);
+        if (roundOut)
+            out.ptr[oidx] =
+                (To)roundf((in.ptr[iidx1].x + in.ptr[iidx2].y) / fftScale);
+        else
+            out.ptr[oidx] =
+                (To)((in.ptr[iidx1].x + in.ptr[iidx2].y) / fftScale);
+    } else {
+        // Copy bottom elements
+        const int iidx = ti3 + ti2 + ti1 + (ti0 - half_di0);
+        if (roundOut)
+            out.ptr[oidx] = (To)roundf(in.ptr[iidx].y / fftScale);
+        else
+            out.ptr[oidx] = (To)(in.ptr[iidx].y / fftScale);
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/fftconvolve.hpp b/src/backend/cuda/kernel/fftconvolve.hpp
index 2a6a686733..da3657d4de 100644
--- a/src/backend/cuda/kernel/fftconvolve.hpp
+++ b/src/backend/cuda/kernel/fftconvolve.hpp
@@ -7,336 +7,110 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
-#include <err_cufft.hpp>
-#include <debug_cuda.hpp>
-#include <Param.hpp>
-#include <memory.hpp>
-#include <cufft.h>
+#pragma once
 
-namespace cuda
-{
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/fftconvolve_cuh.hpp>
 
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 static const int THREADS = 256;
 
-template<typename To, typename Ti>
-__global__ void packData(
-    Param<To> out,
-    CParam<Ti> in,
-    const int di0_half,
-    const bool odd_di0)
-{
-    const int t = blockDim.x * blockIdx.x + threadIdx.x;
-
-    const int tMax = out.strides[3] * out.dims[3];
-
-    if (t >= tMax)
-        return;
-
-    const int do1 = out.dims[1];
-    const int do2 = out.dims[2];
-    const int so1 = out.strides[1];
-    const int so2 = out.strides[2];
-    const int so3 = out.strides[3];
-
-    const int to0 = t % so1;
-    const int to1 = (t / so1) % do1;
-    const int to2 = (t / so2) % do2;
-    const int to3 = t / so3;
-
-    const int di1 = in.dims[1];
-    const int di2 = in.dims[2];
-    const int si1 = in.strides[1];
-    const int si2 = in.strides[2];
-    const int si3 = in.strides[3];
-
-    const int ti0 = to0;
-    const int ti1 = to1 * si1;
-    const int ti2 = to2 * si2;
-    const int ti3 = to3 * si3;
-
-    const int iidx1 = ti3 + ti2 + ti1 + ti0;
-    const int iidx2 = iidx1 + di0_half;
-    const int oidx = to3*so3 + to2*so2 + to1*so1 + to0;
-
-    if (to0 < di0_half && to1 < di1 && to2 < di2) {
-        out.ptr[oidx].x = in.ptr[iidx1];
-        if (ti0 == di0_half-1 && odd_di0)
-            out.ptr[oidx].y = 0;
-        else
-            out.ptr[oidx].y = in.ptr[iidx2];
-    }
-    else {
-        // Pad remaining elements with 0s
-        out.ptr[oidx].x = 0;
-        out.ptr[oidx].y = 0;
-    }
-}
-
-template<typename To, typename Ti>
-__global__ void padArray(
-    Param<To> out,
-    CParam<Ti> in)
-{
-    const int t = blockDim.x * blockIdx.x + threadIdx.x;
-
-    const int tMax = out.strides[3] * out.dims[3];
-
-    if (t >= tMax)
-        return;
-
-    const int do1 = out.dims[1];
-    const int do2 = out.dims[2];
-    const int so1 = out.strides[1];
-    const int so2 = out.strides[2];
-    const int so3 = out.strides[3];
-
-    const int to0 = t % so1;
-    const int to1 = (t / so1) % do1;
-    const int to2 = (t / so2) % do2;
-    const int to3 = (t / so3);
-
-    const int di0 = in.dims[0];
-    const int di1 = in.dims[1];
-    const int di2 = in.dims[2];
-    const int di3 = in.dims[3];
-    const int si1 = in.strides[1];
-    const int si2 = in.strides[2];
-    const int si3 = in.strides[3];
-
-    const int ti0 = to0;
-    const int ti1 = to1 * si1;
-    const int ti2 = to2 * si2;
-    const int ti3 = to3 * si3;
-
-    const int iidx = ti3 + ti2 + ti1 + ti0;
-
-    const int t2 = to3*so3 + to2*so2 + to1*so1 + to0;
-
-    if (to0 < di0 && to1 < di1 && to2 < di2 && to3 < di3) {
-        // Copy input elements to real elements, set imaginary elements to 0
-        out.ptr[t2].x = in.ptr[iidx];
-        out.ptr[t2].y = 0;
-    }
-    else {
-        // Pad remaining of the matrix to 0s
-        out.ptr[t2].x = 0;
-        out.ptr[t2].y = 0;
-    }
-}
-
-template<typename convT, ConvolveBatchKind kind>
-__global__ void complexMultiply(
-    Param<convT> out,
-    Param<convT> in1,
-    Param<convT> in2,
-    const int nelem)
-{
-    const int t = blockDim.x * blockIdx.x + threadIdx.x;
-
-    if (t >= nelem)
-        return;
-
-    if (kind == ONE2ONE || kind == MANY2MANY) {
-        // Complex multiply each signal to equivalent filter
-        const int ridx = t;
-
-        convT c1 = in1.ptr[ridx];
-        convT c2 = in2.ptr[ridx];
-
-        out.ptr[ridx].x = c1.x*c2.x - c1.y*c2.y;
-        out.ptr[ridx].y = (c1.x+c1.y) * (c2.x+c2.y) - c1.x*c2.x - c1.y*c2.y;
-    }
-    else if (kind == MANY2ONE) {
-        // Complex multiply all signals to filter
-        const int ridx1 = t;
-        const int ridx2 = t % (in2.strides[3] * in2.dims[3]);
-
-        convT c1 = in1.ptr[ridx1];
-        convT c2 = in2.ptr[ridx2];
-
-        out.ptr[ridx1].x = c1.x*c2.x - c1.y*c2.y;
-        out.ptr[ridx1].y = (c1.x+c1.y) * (c2.x+c2.y) - c1.x*c2.x - c1.y*c2.y;
-    }
-    else if (kind == ONE2MANY) {
-        // Complex multiply signal to all filters
-        const int ridx1 = t % (in1.strides[3] * in1.dims[3]);
-        const int ridx2 = t;
-
-        convT c1 = in1.ptr[ridx1];
-        convT c2 = in2.ptr[ridx2];
-
-        out.ptr[ridx2].x = c1.x*c2.x - c1.y*c2.y;
-        out.ptr[ridx2].y = (c1.x+c1.y) * (c2.x+c2.y) - c1.x*c2.x - c1.y*c2.y;
-    }
-}
-
-template<typename To, typename Ti, bool expand, bool roundOut>
-__global__ void reorderOutput(
-    Param<To> out,
-    Param<Ti> in,
-    CParam<To> filter,
-    const int half_di0,
-    const int baseDim,
-    const int fftScale)
-{
-    const int t = blockIdx.x * blockDim.x + threadIdx.x;
-
-    const int tMax = out.strides[3] * out.dims[3];
-
-    if (t >= tMax)
-        return;
-
-    const int do1 = out.dims[1];
-    const int do2 = out.dims[2];
-    const int so1 = out.strides[1];
-    const int so2 = out.strides[2];
-    const int so3 = out.strides[3];
-
-    const int si1 = in.strides[1];
-    const int si2 = in.strides[2];
-    const int si3 = in.strides[3];
-
-    const int to0 = t % so1;
-    const int to1 = (t / so1) % do1;
-    const int to2 = (t / so2) % do2;
-    const int to3 = (t / so3);
+template<typename convT, typename T>
+void packDataHelper(Param<convT> sig_packed, Param<convT> filter_packed,
+                    CParam<T> sig, CParam<T> filter) {
+    auto packData = common::getKernel(
+        "arrayfire::cuda::packData", {{fftconvolve_cuh_src}},
+        TemplateArgs(TemplateTypename<convT>(), TemplateTypename<T>()));
+    auto padArray = common::getKernel(
+        "arrayfire::cuda::padArray", {{fftconvolve_cuh_src}},
+        TemplateArgs(TemplateTypename<convT>(), TemplateTypename<T>()));
 
-    int oidx = to3*so3 + to2*so2 + to1*so1 + to0;
+    dim_t *sd = sig.dims;
 
-    int ti0, ti1, ti2, ti3;
-    if (expand) {
-        ti0 = to0;
-        ti1 = to1 * si1;
-        ti2 = to2 * si2;
-        ti3 = to3 * si3;
-    }
-    else {
-        ti0 = to0 + filter.dims[0]/2;
-        ti1 = (to1 + (baseDim > 1)*(filter.dims[1]/2)) * si1;
-        ti2 = (to2 + (baseDim > 2)*(filter.dims[2]/2)) * si2;
-        ti3 = to3 * si3;
-    }
+    int sig_packed_elem    = 1;
+    int filter_packed_elem = 1;
 
-    // Divide output elements to cuFFT resulting scale, round result if output
-    // type is single or double precision floating-point
-    if (ti0 < half_di0) {
-        // Copy top elements
-        int iidx = ti3 + ti2 + ti1 + ti0;
-        if (roundOut)
-            out.ptr[oidx] = (To)roundf(in.ptr[iidx].x / fftScale);
-        else
-            out.ptr[oidx] = (To)(in.ptr[iidx].x / fftScale);
+    for (int i = 0; i < 4; i++) {
+        sig_packed_elem *= sig_packed.dims[i];
+        filter_packed_elem *= filter_packed.dims[i];
     }
-    else if (ti0 < half_di0 + filter.dims[0] - 1) {
-        // Add signal and filter elements to central part
-        int iidx1 = ti3 + ti2 + ti1 + ti0;
-        int iidx2 = ti3 + ti2 + ti1 + (ti0 - half_di0);
-        if (roundOut)
-            out.ptr[oidx] = (To)roundf((in.ptr[iidx1].x + in.ptr[iidx2].y) / fftScale);
-        else
-            out.ptr[oidx] = (To)((in.ptr[iidx1].x + in.ptr[iidx2].y) / fftScale);
-    }
-    else {
-        // Copy bottom elements
-        const int iidx = ti3 + ti2 + ti1 + (ti0 - half_di0);
-        if (roundOut)
-            out.ptr[oidx] = (To)roundf(in.ptr[iidx].y / fftScale);
-        else
-            out.ptr[oidx] = (To)(in.ptr[iidx].y / fftScale);
-    }
-}
-
-template<typename convT, typename T>
-void packDataHelper(Param<convT> sig_packed,
-                    Param<convT> filter_packed,
-                    CParam<T> sig,
-                    CParam<T> filter,
-                    const int baseDim)
-{
-    dim_t *sd = sig.dims;
-
-    int sig_packed_elem = sig_packed.strides[3] * sig_packed.dims[3];
-    int filter_packed_elem = filter_packed.strides[3] * filter_packed.dims[3];
 
     // Number of packed complex elements in dimension 0
-    int sig_half_d0 = divup(sd[0], 2);
+    int sig_half_d0      = divup(sd[0], 2);
     bool sig_half_d0_odd = (sd[0] % 2 == 1);
 
     dim3 threads(THREADS);
     dim3 blocks(divup(sig_packed_elem, threads.x));
 
+    EnqueueArgs packQArgs(blocks, threads, getActiveStream());
+
     // Pack signal in a complex matrix where first dimension is half the input
     // (allows faster FFT computation) and pad array to a power of 2 with 0s
-    packData<convT, T><<<blocks, threads>>>(sig_packed, sig, sig_half_d0, sig_half_d0_odd);
+    packData(packQArgs, sig_packed, sig, sig_half_d0, sig_half_d0_odd);
     POST_LAUNCH_CHECK();
 
     blocks = dim3(divup(filter_packed_elem, threads.x));
 
+    EnqueueArgs padQArgs(blocks, threads, getActiveStream());
+
     // Pad filter array with 0s
-    padArray<convT, T><<<blocks, threads>>>(filter_packed, filter);
+    padArray(padQArgs, filter_packed, filter);
     POST_LAUNCH_CHECK();
 }
 
+// TODO(umar): This needs a better name
 template<typename T, typename convT>
-void complexMultiplyHelper(Param<T> out,
-                           Param<convT> sig_packed,
-                           Param<convT> filter_packed,
-                           CParam<T> sig,
-                           CParam<T> filter,
-                           ConvolveBatchKind kind)
-{
-    int sig_packed_elem = sig_packed.strides[3] * sig_packed.dims[3];
-    int filter_packed_elem = filter_packed.strides[3] * filter_packed.dims[3];
+void complexMultiplyHelper(Param<convT> sig_packed, Param<convT> filter_packed,
+                           AF_BATCH_KIND kind) {
+    auto cplxMul = common::getKernel(
+        "arrayfire::cuda::complexMultiply", {{fftconvolve_cuh_src}},
+        TemplateArgs(TemplateTypename<convT>(), TemplateArg(kind)));
+
+    int sig_packed_elem    = 1;
+    int filter_packed_elem = 1;
+
+    for (int i = 0; i < 4; i++) {
+        sig_packed_elem *= sig_packed.dims[i];
+        filter_packed_elem *= filter_packed.dims[i];
+    }
 
     dim3 threads(THREADS);
     dim3 blocks(divup(sig_packed_elem / 2, threads.x));
 
-    int mul_elem = (sig_packed_elem < filter_packed_elem) ?
-                        filter_packed_elem : sig_packed_elem;
-    blocks = dim3(divup(mul_elem, threads.x));
+    int mul_elem = (sig_packed_elem < filter_packed_elem) ? filter_packed_elem
+                                                          : sig_packed_elem;
+    blocks       = dim3(divup(mul_elem, threads.x));
 
-    // Multiply filter and signal FFT arrays
-    switch(kind) {
-        case ONE2ONE:
-            complexMultiply<convT, ONE2ONE  ><<<blocks, threads>>>
-                (sig_packed, sig_packed, filter_packed, mul_elem);
-            break;
-        case MANY2ONE:
-            complexMultiply<convT, MANY2ONE ><<<blocks, threads>>>
-                (sig_packed, sig_packed, filter_packed, mul_elem);
-            break;
-        case ONE2MANY:
-            complexMultiply<convT, ONE2MANY ><<<blocks, threads>>>
-                (filter_packed, sig_packed, filter_packed, mul_elem);
-            break;
-        case MANY2MANY:
-            complexMultiply<convT, MANY2MANY><<<blocks, threads>>>
-                (sig_packed, sig_packed, filter_packed, mul_elem);
-            break;
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    if (kind == AF_BATCH_RHS) {
+        cplxMul(qArgs, filter_packed, sig_packed, filter_packed, mul_elem);
+    } else {
+        cplxMul(qArgs, sig_packed, sig_packed, filter_packed, mul_elem);
     }
     POST_LAUNCH_CHECK();
 }
 
-template<typename T, typename convT, bool roundOut, int baseDim, bool expand>
-void reorderOutputHelper(Param<T> out,
-                         Param<convT> packed,
-                         CParam<T> sig,
-                         CParam<T> filter,
-                         ConvolveBatchKind kind)
-{
-    dim_t *sd = sig.dims;
+template<typename T, typename convT>
+void reorderOutputHelper(Param<T> out, Param<convT> packed, CParam<T> sig,
+                         CParam<T> filter, bool expand, int rank) {
+    constexpr bool RoundResult = std::is_integral<T>::value;
+
+    auto reorderOut = common::getKernel(
+        "arrayfire::cuda::reorderOutput", {{fftconvolve_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateTypename<convT>(),
+                     TemplateArg(expand), TemplateArg(RoundResult)));
+
+    dim_t *sd    = sig.dims;
     int fftScale = 1;
 
     // Calculate the scale by which to divide cuFFT results
-    for (int k = 0; k < baseDim; k++)
-        fftScale *= packed.dims[k];
+    for (int k = 0; k < rank; k++) fftScale *= packed.dims[k];
 
     // Number of packed complex elements in dimension 0
     int sig_half_d0 = divup(sd[0], 2);
@@ -344,11 +118,12 @@ void reorderOutputHelper(Param<T> out,
     dim3 threads(THREADS);
     dim3 blocks(divup(out.strides[3] * out.dims[3], threads.x));
 
-    reorderOutput<T, convT, expand, roundOut><<<blocks, threads>>>
-        (out, packed, filter, sig_half_d0, baseDim, fftScale);
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    reorderOut(qArgs, out, packed, filter, sig_half_d0, rank, fftScale);
     POST_LAUNCH_CHECK();
 }
 
-} // namespace kernel
-
-} // namespace cuda
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/flood_fill.cuh b/src/backend/cuda/kernel/flood_fill.cuh
new file mode 100644
index 0000000000..ede793c0d3
--- /dev/null
+++ b/src/backend/cuda/kernel/flood_fill.cuh
@@ -0,0 +1,146 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <af/defines.h>
+
+/// doAnotherLaunch is a variable in kernel space
+/// used to track the convergence of
+/// the breath first search algorithm
+__device__ int doAnotherLaunch = 0;
+
+namespace arrayfire {
+namespace cuda {
+
+/// Output array is set to the following values during the progression
+/// of the algorithm.
+///
+/// 0 - not processed
+/// 1 - not valid
+/// 2 - valid (candidate for neighborhood walk, pushed onto the queue)
+///
+/// Once, the algorithm is finished, output is reset
+/// to either zero or \p newValue for all valid pixels.
+template<typename T>
+constexpr T VALID() {
+    return T(2);
+}
+template<typename T>
+constexpr T INVALID() {
+    return T(1);
+}
+template<typename T>
+constexpr T ZERO() {
+    return T(0);
+}
+
+template<typename T>
+__global__ void initSeeds(Param<T> out, CParam<uint> seedsx,
+                          CParam<uint> seedsy) {
+    uint idx = blockDim.x * blockIdx.x + threadIdx.x;
+    if (idx < seedsx.elements()) {
+        uint x                       = seedsx.ptr[idx];
+        uint y                       = seedsy.ptr[idx];
+        out.ptr[x + y * out.dims[0]] = VALID<T>();
+    }
+}
+
+template<typename T>
+__global__ void floodStep(Param<T> out, CParam<T> img, T lowValue,
+                          T highValue) {
+    constexpr int RADIUS      = 1;
+    constexpr int SMEM_WIDTH  = THREADS_X + 2 * RADIUS;
+    constexpr int SMEM_HEIGHT = THREADS_Y + 2 * RADIUS;
+
+    __shared__ T smem[SMEM_HEIGHT][SMEM_WIDTH];
+
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+    const int gx = blockDim.x * blockIdx.x + lx;
+    const int gy = blockDim.y * blockIdx.y + ly;
+    const int d0 = out.dims[0];
+    const int d1 = out.dims[1];
+    const int s0 = out.strides[0];
+    const int s1 = out.strides[1];
+
+    const T *iptr = (const T *)img.ptr;
+    T *optr       = (T *)out.ptr;
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < SMEM_HEIGHT;
+         b += blockDim.y, gy2 += blockDim.y) {
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < SMEM_WIDTH;
+             a += blockDim.x, gx2 += blockDim.x) {
+            int x      = gx2 - RADIUS;
+            int y      = gy2 - RADIUS;
+            bool inROI = (x >= 0 && x < d0 && y >= 0 && y < d1);
+            smem[b][a] = (inROI ? optr[x * s0 + y * s1] : INVALID<T>());
+        }
+    }
+    int i = lx + RADIUS;
+    int j = ly + RADIUS;
+
+    T tImgVal = iptr[(clamp(gx, 0, int(img.dims[0] - 1)) * img.strides[0] +
+                      clamp(gy, 0, int(img.dims[1] - 1)) * img.strides[1])];
+    const int isPxBtwnThresholds =
+        (tImgVal >= lowValue && tImgVal <= highValue);
+    __syncthreads();
+
+    T origOutVal      = smem[j][i];
+    bool blockChanged = false;
+    bool isBorderPxl  = (lx == 0 || ly == 0 || lx == (blockDim.x - 1) ||
+                        ly == (blockDim.y - 1));
+    do {
+        int validNeighbors = 0;
+#pragma unroll
+        for (int no_j = -RADIUS; no_j <= RADIUS; ++no_j) {
+#pragma unroll
+            for (int no_i = -RADIUS; no_i <= RADIUS; ++no_i) {
+                T currVal = smem[j + no_j][i + no_i];
+                validNeighbors += (currVal == VALID<T>());
+            }
+        }
+        __syncthreads();
+
+        bool outChanged = (smem[j][i] == ZERO<T>() && (validNeighbors > 0));
+        if (outChanged) { smem[j][i] = T(isPxBtwnThresholds + INVALID<T>()); }
+        blockChanged = __syncthreads_or(int(outChanged));
+    } while (blockChanged);
+
+    T newOutVal = smem[j][i];
+
+    bool borderChanged =
+        (isBorderPxl && newOutVal != origOutVal && newOutVal == VALID<T>());
+
+    borderChanged = __syncthreads_or(int(borderChanged));
+
+    if (borderChanged && lx == 0 && ly == 0) {
+        // Atleast one border pixel changed. Therefore, mark for
+        // another kernel launch to propogate changes beyond border
+        // of this block
+        doAnotherLaunch = 1;
+    }
+
+    if (gx < d0 && gy < d1) { optr[(gx * s0 + gy * s1)] = smem[j][i]; }
+}
+
+template<typename T>
+__global__ void finalizeOutput(Param<T> out, T newValue) {
+    uint gx = blockDim.x * blockIdx.x + threadIdx.x;
+    uint gy = blockDim.y * blockIdx.y + threadIdx.y;
+    if (gx < out.dims[0] && gy < out.dims[1]) {
+        uint idx     = gx * out.strides[0] + gy * out.strides[1];
+        T val        = out.ptr[idx];
+        out.ptr[idx] = (val == VALID<T>() ? newValue : ZERO<T>());
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/flood_fill.hpp b/src/backend/cuda/kernel/flood_fill.hpp
new file mode 100644
index 0000000000..03e3fd8fea
--- /dev/null
+++ b/src/backend/cuda/kernel/flood_fill.hpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/defines.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/flood_fill_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr int THREADS   = 256;
+constexpr int TILE_DIM  = 32;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = THREADS / TILE_DIM;
+
+// Shared memory per block required by floodFill kernel
+template<typename T>
+constexpr size_t sharedMemRequiredByFloodFill() {
+    // 1-pixel border neighborhood
+    return sizeof(T) * ((THREADS_X + 2) * (THREADS_Y + 2));
+}
+
+template<typename T>
+void floodFill(Param<T> out, CParam<T> image, CParam<uint> seedsx,
+               CParam<uint> seedsy, const T newValue, const T lowValue,
+               const T highValue, const af::connectivity nlookup) {
+    UNUSED(nlookup);
+    if (sharedMemRequiredByFloodFill<T>() >
+        getDeviceProp(getActiveDeviceId()).sharedMemPerBlock) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nCurrent thread's CUDA device doesn't have sufficient "
+                 "shared memory required by FloodFill\n");
+        CUDA_NOT_SUPPORTED(errMessage);
+    }
+
+    auto initSeeds =
+        common::getKernel("arrayfire::cuda::initSeeds", {{flood_fill_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()));
+    auto floodStep =
+        common::getKernel("arrayfire::cuda::floodStep", {{flood_fill_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()),
+                          {{DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
+    auto finalizeOutput = common::getKernel(
+        "arrayfire::cuda::finalizeOutput", {{flood_fill_cuh_src}},
+        TemplateArgs(TemplateTypename<T>()));
+
+    EnqueueArgs qArgs(dim3(divup(seedsx.elements(), THREADS)), dim3(THREADS),
+                      getActiveStream());
+    initSeeds(qArgs, out, seedsx, seedsy);
+    POST_LAUNCH_CHECK();
+
+    dim3 threads(THREADS_X, THREADS_Y);
+    dim3 blocks(divup(image.dims[0], threads.x),
+                divup(image.dims[1], threads.y));
+    EnqueueArgs fQArgs(blocks, threads, getActiveStream());
+
+    auto continueFlagPtr = floodStep.getDevPtr("doAnotherLaunch");
+
+    for (int doAnotherLaunch = 1; doAnotherLaunch > 0;) {
+        doAnotherLaunch = 0;
+        floodStep.setFlag(continueFlagPtr, &doAnotherLaunch);
+        floodStep(fQArgs, out, image, lowValue, highValue);
+        POST_LAUNCH_CHECK();
+        doAnotherLaunch = floodStep.getFlag(continueFlagPtr);
+    }
+    finalizeOutput(fQArgs, out, newValue);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/gradient.cuh b/src/backend/cuda/kernel/gradient.cuh
new file mode 100644
index 0000000000..19ec419887
--- /dev/null
+++ b/src/backend/cuda/kernel/gradient.cuh
@@ -0,0 +1,93 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+#define sidx(y, x) scratch[y + 1][x + 1]
+
+template<typename T>
+__global__ void gradient(Param<T> grad0, Param<T> grad1, CParam<T> in,
+                         const int blocksPerMatX, const int blocksPerMatY) {
+    const int idz = blockIdx.x / blocksPerMatX;
+    const int idw = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+
+    const int blockIdx_x = blockIdx.x - idz * blocksPerMatX;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - idw * blocksPerMatY;
+
+    const int xB = blockIdx_x * blockDim.x;
+    const int yB = blockIdx_y * blockDim.y;
+
+    const int idx = threadIdx.x + xB;
+    const int idy = threadIdx.y + yB;
+
+    bool cond = (idx >= in.dims[0] || idy >= in.dims[1] || idz >= in.dims[2] ||
+                 idw >= in.dims[3]);
+
+    int xmax = (TX > (in.dims[0] - xB)) ? (in.dims[0] - xB) : TX;
+    int ymax = (TY > (in.dims[1] - yB)) ? (in.dims[1] - yB) : TY;
+
+    int iIdx =
+        idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1] + idx;
+
+    int g0dx = idw * grad0.strides[3] + idz * grad0.strides[2] +
+               idy * grad0.strides[1] + idx;
+
+    int g1dx = idw * grad1.strides[3] + idz * grad1.strides[2] +
+               idy * grad1.strides[1] + idx;
+
+    __shared__ T scratch[TY + 2][TX + 2];
+
+    // Multipliers - 0.5 for interior, 1 for edge cases
+    float xf = 0.5 * (1 + (idx == 0 || idx >= (in.dims[0] - 1)));
+    float yf = 0.5 * (1 + (idy == 0 || idy >= (in.dims[1] - 1)));
+
+    // Copy data to scratch space
+    sidx(threadIdx.y, threadIdx.x) = cond ? scalar<T>(0) : in.ptr[iIdx];
+
+    __syncthreads();
+
+    // Copy buffer zone data. Corner (0,0) etc, are not used.
+    // Cols
+    if (threadIdx.y == 0) {
+        // Y-1
+        sidx(-1, threadIdx.x)   = (cond || idy == 0)
+                                      ? sidx(0, threadIdx.x)
+                                      : in.ptr[iIdx - in.strides[1]];
+        sidx(ymax, threadIdx.x) = (cond || (idy + ymax) >= in.dims[1])
+                                      ? sidx(ymax - 1, threadIdx.x)
+                                      : in.ptr[iIdx + ymax * in.strides[1]];
+    }
+    // Rows
+    if (threadIdx.x == 0) {
+        sidx(threadIdx.y, -1) =
+            (cond || idx == 0) ? sidx(threadIdx.y, 0) : in.ptr[iIdx - 1];
+        sidx(threadIdx.y, xmax) = (cond || (idx + xmax) >= in.dims[0])
+                                      ? sidx(threadIdx.y, xmax - 1)
+                                      : in.ptr[iIdx + xmax];
+    }
+
+    __syncthreads();
+
+    if (cond) return;
+
+    grad0.ptr[g0dx] = xf * (sidx(threadIdx.y, threadIdx.x + 1) -
+                            sidx(threadIdx.y, threadIdx.x - 1));
+    grad1.ptr[g1dx] = yf * (sidx(threadIdx.y + 1, threadIdx.x) -
+                            sidx(threadIdx.y - 1, threadIdx.x));
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/gradient.hpp b/src/backend/cuda/kernel/gradient.hpp
index 1712d6f6f8..3aaf250e60 100644
--- a/src/backend/cuda/kernel/gradient.hpp
+++ b/src/backend/cuda/kernel/gradient.hpp
@@ -7,108 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/gradient_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-
-#define sidx(y, x) scratch[y + 1][x + 1]
-
-        template<typename T>
-        __global__
-        void gradient_kernel(Param<T> grad0, Param<T> grad1, CParam<T> in,
-                             const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int idz = blockIdx.x / blocksPerMatX;
-            const int idw = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - idz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - idw * blocksPerMatY;
-
-            const int xB = blockIdx_x * blockDim.x;
-            const int yB = blockIdx_y * blockDim.y;
-
-            const int idx = threadIdx.x + xB;
-            const int idy = threadIdx.y + yB;
-
-            bool cond = (idx >= in.dims[0] || idy >= in.dims[1] ||
-                         idz >= in.dims[2] || idw >= in.dims[3]);
-
-            int xmax = (TX > (in.dims[0] - xB)) ? (in.dims[0] - xB) : TX;
-            int ymax = (TY > (in.dims[1] - yB)) ? (in.dims[1] - yB) : TY;
+#include <array>
 
-            int iIdx = idw * in.strides[3] + idz * in.strides[2]
-                          + idy * in.strides[1] + idx;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            int g0dx = idw * grad0.strides[3] + idz * grad0.strides[2]
-                          + idy * grad0.strides[1] + idx;
+template<typename T>
+void gradient(Param<T> grad0, Param<T> grad1, CParam<T> in) {
+    constexpr unsigned TX = 32;
+    constexpr unsigned TY = 8;
 
-            int g1dx = idw * grad1.strides[3] + idz * grad1.strides[2]
-                          + idy * grad1.strides[1] + idx;
+    auto gradient =
+        common::getKernel("arrayfire::cuda::gradient", {{gradient_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()),
+                          {{DefineValue(TX), DefineValue(TY)}});
 
-            __shared__ T scratch[TY + 2][TX + 2];
+    dim3 threads(TX, TY, 1);
 
-            // Multipliers - 0.5 for interior, 1 for edge cases
-            float xf = 0.5 * (1 + (idx == 0 || idx >= (in.dims[0] - 1)));
-            float yf = 0.5 * (1 + (idy == 0 || idy >= (in.dims[1] - 1)));
+    int blocksPerMatX = divup(in.dims[0], TX);
+    int blocksPerMatY = divup(in.dims[1], TY);
+    dim3 blocks(blocksPerMatX * in.dims[2], blocksPerMatY * in.dims[3], 1);
 
-            // Copy data to scratch space
-            sidx(threadIdx.y, threadIdx.x) = cond ? scalar<T>(0) : in.ptr[iIdx];
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-            __syncthreads();
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-            // Copy buffer zone data. Corner (0,0) etc, are not used.
-            // Cols
-            if(threadIdx.y == 0) {
-                // Y-1
-                sidx(-1, threadIdx.x) = (cond || idy == 0) ?
-                                        sidx(0, threadIdx.x) : in.ptr[iIdx - in.strides[1]];
-                sidx(ymax, threadIdx.x) = (cond || idy + ymax >= in.dims[1] - 1) ?
-                                        sidx(ymax - 1, threadIdx.x) : in.ptr[iIdx + ymax * in.strides[1]];
-            }
-            // Rows
-            if(threadIdx.x == 0) {
-                sidx(threadIdx.y, -1) = (cond || idx == 0) ?
-                                        sidx(threadIdx.y, 0) : in.ptr[iIdx - 1];
-                sidx(threadIdx.y, xmax) = (cond || idx + xmax >= in.dims[0] - 1) ?
-                                        sidx(threadIdx.y, xmax - 1) : in.ptr[iIdx + xmax];
-            }
-
-            __syncthreads();
-
-            if (cond) return;
-
-            grad0.ptr[g0dx] = xf * (sidx(threadIdx.y, threadIdx.x + 1)
-                                 -  sidx(threadIdx.y, threadIdx.x - 1));
-            grad1.ptr[g1dx] = yf * (sidx(threadIdx.y + 1, threadIdx.x)
-                                 -  sidx(threadIdx.y - 1, threadIdx.x));
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void gradient(Param<T> grad0, Param<T> grad1, CParam<T> in)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(in.dims[0], TX);
-            int blocksPerMatY = divup(in.dims[1], TY);
-            dim3 blocks(blocksPerMatX * in.dims[2],
-                        blocksPerMatY * in.dims[3],
-                        1);
-
-            gradient_kernel<T><<<blocks, threads>>>(grad0, grad1, in, blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
+    gradient(qArgs, grad0, grad1, in, blocksPerMatX, blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/hamming.hpp b/src/backend/cuda/kernel/hamming.hpp
deleted file mode 100644
index 99fcf7f783..0000000000
--- a/src/backend/cuda/kernel/hamming.hpp
+++ /dev/null
@@ -1,478 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/defines.h>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-#include <memory.hpp>
-#include <platform.hpp>
-
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const unsigned THREADS = 256;
-
-template<typename T, unsigned feat_len, bool use_shmem>
-__global__ void hamming_matcher_unroll(
-    unsigned* out_idx,
-    unsigned* out_dist,
-    CParam<T> query,
-    CParam<T> train,
-    const unsigned max_dist)
-{
-    unsigned nquery = query.dims[0];
-    unsigned ntrain = train.dims[0];
-
-    unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
-    unsigned tid = threadIdx.x;
-
-    __shared__ unsigned s_dist[THREADS];
-    __shared__ unsigned s_idx[THREADS];
-
-    extern __shared__ char smem[];
-    T* s_query = (T*)smem;
-    T* s_train = (T*)smem + feat_len;
-
-    s_dist[tid] = max_dist;
-    s_idx[tid]  = 0xffffffff;
-
-    bool valid_feat = (f < ntrain);
-
-    if (valid_feat) {
-        // Copy blockDim.x training features to shared memory
-        if (use_shmem) {
-            #pragma unroll
-            for (unsigned i = 0; i < feat_len; i++) {
-                s_train[i * blockDim.x + tid] = train.ptr[i * ntrain + f];
-            }
-        }
-    }
-    __syncthreads();
-
-    for (unsigned j = 0; j < nquery; j++) {
-        s_dist[tid] = max_dist;
-
-        // Load one query feature that will be tested against all training
-        // features in current block
-        if (tid < feat_len && valid_feat) {
-            s_query[tid] = query.ptr[tid * nquery + j];
-        }
-        __syncthreads();
-
-        unsigned dist = 0;
-        if (valid_feat) {
-            #pragma unroll
-            for (unsigned k = 0; k < feat_len; k++) {
-                // Calculate Hamming distance for 32-bits of descriptor and
-                // accumulates to dist
-                if (use_shmem) {
-                    dist += __popc(s_train[k * blockDim.x + tid] ^ s_query[k]);
-                }
-                else {
-                    dist += __popc(train.ptr[k * ntrain + f] ^ s_query[k]);
-                }
-            }
-
-            // Only stores the feature index and distance if it's smaller
-            // than the best match found so far
-            s_dist[tid] = dist;
-            s_idx[tid]  = f;
-        }
-        __syncthreads();
-
-        // Find best match in training features from block to the current
-        // query feature
-        if (tid < 128) {
-            if (s_dist[tid + 128] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 128];
-                s_idx[tid]  = s_idx[tid + 128];
-            }
-        }
-        __syncthreads();
-        if (tid < 64) {
-            if (s_dist[tid + 64] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 64];
-                s_idx[tid]  = s_idx[tid + 64];
-            }
-        }
-        __syncthreads();
-        if (tid < 32) {
-            if (s_dist[tid + 32] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 32];
-                s_idx[tid]  = s_idx[tid + 32];
-            }
-        }
-        __syncthreads();
-        if (tid < 16) {
-            if (s_dist[tid + 16] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 16];
-                s_idx[tid]  = s_idx[tid + 16];
-            }
-        }
-        __syncthreads();
-        if (tid < 8) {
-            if (s_dist[tid + 8] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 8];
-                s_idx[tid]  = s_idx[tid + 8];
-            }
-        }
-        __syncthreads();
-        if (tid < 4) {
-            if (s_dist[tid + 4] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 4];
-                s_idx[tid]  = s_idx[tid + 4];
-            }
-        }
-        __syncthreads();
-        if (tid < 2) {
-            if (s_dist[tid + 2] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 2];
-                s_idx[tid]  = s_idx[tid + 2];
-            }
-        }
-        __syncthreads();
-        if (tid < 1) {
-            if (s_dist[tid + 1] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 1];
-                s_idx[tid]  = s_idx[tid + 1];
-            }
-        }
-        __syncthreads();
-
-        // Store best match in training features from block to the current
-        // query feature
-        if (valid_feat) {
-            out_dist[j * gridDim.x + blockIdx.x] = s_dist[0];
-            out_idx[j * gridDim.x + blockIdx.x]  = s_idx[0];
-        }
-        __syncthreads();
-    }
-}
-
-template<typename T, bool use_shmem>
-__global__ void hamming_matcher(
-    unsigned* out_idx,
-    unsigned* out_dist,
-    CParam<T> query,
-    CParam<T> train,
-    const unsigned max_dist,
-    const unsigned feat_len)
-{
-    unsigned nquery = query.dims[0];
-    unsigned ntrain = train.dims[0];
-
-    unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
-    unsigned tid = threadIdx.x;
-
-    __shared__ unsigned s_dist[THREADS];
-    __shared__ unsigned s_idx[THREADS];
-
-    extern __shared__ char smem[];
-    T* s_query = (T*)smem;
-    T* s_train = (T*)smem + feat_len;
-
-    s_dist[tid] = max_dist;
-    s_idx[tid]  = 0xffffffff;
-
-    bool valid_feat = (f < ntrain);
-
-    if (valid_feat) {
-        // Copy blockDim.x training features to shared memory
-        if (use_shmem) {
-            for (unsigned i = 0; i < feat_len; i++) {
-                s_train[i * blockDim.x + tid] = train.ptr[i * ntrain + f];
-            }
-        }
-    }
-    __syncthreads();
-
-    for (unsigned j = 0; j < nquery; j++) {
-        s_dist[tid] = max_dist;
-
-        // Load one query feature that will be tested against all training
-        // features in current block
-        if (tid < feat_len && valid_feat) {
-            s_query[tid] = query.ptr[tid * nquery + j];
-        }
-        __syncthreads();
-
-        unsigned dist = 0;
-        if (valid_feat) {
-            for (unsigned k = 0; k < feat_len; k++) {
-                // Calculate Hamming distance for 32-bits of descriptor and
-                // accumulates to dist
-                if (use_shmem) {
-                    dist += __popc(s_train[k * blockDim.x + tid] ^ s_query[k]);
-                }
-                else {
-                    dist += __popc(train.ptr[k * ntrain + f] ^ s_query[k]);
-                }
-            }
-
-            // Only stores the feature index and distance if it's smaller
-            // than the best match found so far
-            s_dist[tid] = dist;
-            s_idx[tid]  = f;
-        }
-        __syncthreads();
-
-        // Find best match in training features from block to the current
-        // query feature
-        if (tid < 128) {
-            if (s_dist[tid + 128] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 128];
-                s_idx[tid]  = s_idx[tid + 128];
-            }
-        }
-        __syncthreads();
-        if (tid < 64) {
-            if (s_dist[tid + 64] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 64];
-                s_idx[tid]  = s_idx[tid + 64];
-            }
-        }
-        __syncthreads();
-        if (tid < 32) {
-            if (s_dist[tid + 32] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 32];
-                s_idx[tid]  = s_idx[tid + 32];
-            }
-        }
-        __syncthreads();
-        if (tid < 16) {
-            if (s_dist[tid + 16] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 16];
-                s_idx[tid]  = s_idx[tid + 16];
-            }
-        }
-        __syncthreads();
-        if (tid < 8) {
-            if (s_dist[tid + 8] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 8];
-                s_idx[tid]  = s_idx[tid + 8];
-            }
-        }
-        __syncthreads();
-        if (tid < 4) {
-            if (s_dist[tid + 4] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 4];
-                s_idx[tid]  = s_idx[tid + 4];
-            }
-        }
-        __syncthreads();
-        if (tid < 2) {
-            if (s_dist[tid + 2] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 2];
-                s_idx[tid]  = s_idx[tid + 2];
-            }
-        }
-        __syncthreads();
-        if (tid < 1) {
-            if (s_dist[tid + 1] < s_dist[tid]) {
-                s_dist[tid] = s_dist[tid + 1];
-                s_idx[tid]  = s_idx[tid + 1];
-            }
-        }
-        __syncthreads();
-
-        // Store best match in training features from block to the current
-        // query feature
-        if (valid_feat) {
-            out_dist[j * gridDim.x + blockIdx.x] = s_dist[0];
-            out_idx[j * gridDim.x + blockIdx.x]  = s_idx[0];
-        }
-        __syncthreads();
-    }
-}
-
-__global__ void select_matches(
-    Param<unsigned> idx,
-    Param<unsigned> dist,
-    const unsigned* in_idx,
-    const unsigned* in_dist,
-    const unsigned nfeat,
-    const unsigned nelem,
-    const unsigned max_dist)
-{
-    unsigned f = blockIdx.x * blockDim.x + threadIdx.x;
-    unsigned sid = threadIdx.x * blockDim.y + threadIdx.y;
-
-    __shared__ unsigned s_dist[THREADS];
-    __shared__ unsigned s_idx[THREADS];
-
-    if (f < nfeat) {
-        s_dist[sid] = max_dist;
-        __syncthreads();
-
-        for (unsigned i = threadIdx.y; i < nelem; i += blockDim.y) {
-            unsigned dist = in_dist[f * nelem + i];
-
-            // Copy all best matches previously found in hamming_matcher() to
-            // shared memory
-            if (dist < s_dist[sid]) {
-                s_dist[sid] = dist;
-                s_idx[sid]  = in_idx[f * nelem + i];
-            }
-            __syncthreads();
-        }
-
-        // Reduce best matches and find the best of them all
-        for (unsigned i = blockDim.y / 2; i > 0; i >>= 1) {
-            if (threadIdx.y < i) {
-                unsigned dist = s_dist[sid + i];
-                if (dist < s_dist[sid]) {
-                    s_dist[sid] = dist;
-                    s_idx[sid]  = s_idx[sid + i];
-                }
-                __syncthreads();
-            }
-        }
-
-        // Store best matches and indexes to training dataset
-        if (threadIdx.y == 0) {
-            dist.ptr[f] = s_dist[threadIdx.x * blockDim.y];
-            idx.ptr[f]  = s_idx[threadIdx.x * blockDim.y];
-        }
-    }
-}
-
-template<typename T>
-void hamming_matcher(Param<uint> idx,
-                     Param<uint> dist,
-                     CParam<T> query,
-                     CParam<T> train,
-                     const dim_t dist_dim,
-                     const unsigned n_dist)
-{
-    const unsigned feat_len = query.dims[dist_dim];
-    const unsigned max_dist = feat_len * 8 * sizeof(T);
-
-    if (feat_len > THREADS) {
-        CUDA_NOT_SUPPORTED();
-    }
-
-    const dim_t sample_dim = (dist_dim == 0) ? 1 : 0;
-
-    const unsigned nquery = query.dims[sample_dim];
-    const unsigned ntrain = train.dims[sample_dim];
-
-    dim3 threads(THREADS, 1);
-    dim3 blocks(divup(ntrain, threads.x), 1);
-
-    // Determine maximum feat_len capable of using shared memory (faster)
-    int device = getActiveDeviceId();
-    cudaDeviceProp prop = getDeviceProp(device);
-    size_t avail_smem = prop.sharedMemPerBlock;
-    size_t smem_predef = 2 * THREADS * sizeof(unsigned) + feat_len * sizeof(T);
-    size_t strain_sz = threads.x * feat_len * sizeof(T);
-    bool use_shmem = (avail_smem >= (smem_predef + strain_sz)) ? true : false;
-    unsigned smem_sz = (use_shmem) ? smem_predef + strain_sz : smem_predef;
-
-    unsigned nblk = blocks.x;
-
-    unsigned* d_blk_idx  = memAlloc<unsigned>(nblk * nquery);
-    unsigned* d_blk_dist = memAlloc<unsigned>(nblk * nquery);
-
-    // For each query vector, find training vector with smallest Hamming
-    // distance per CUDA block
-    if (use_shmem) {
-        switch(feat_len) {
-        // Optimized lengths (faster due to loop unrolling)
-        case 1:
-            hamming_matcher_unroll<T,1,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 2:
-            hamming_matcher_unroll<T,2,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 4:
-            hamming_matcher_unroll<T,4,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 8:
-            hamming_matcher_unroll<T,8,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 16:
-            hamming_matcher_unroll<T,16,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 32:
-            hamming_matcher_unroll<T,32,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 64:
-            hamming_matcher_unroll<T,64,true><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        default:
-            hamming_matcher<T,true><<<blocks, threads, smem_sz>>>
-                           (d_blk_idx, d_blk_dist, query, train, max_dist, feat_len);
-        }
-    }
-    else {
-        switch(feat_len) {
-        // Optimized lengths (faster due to loop unrolling)
-        case 1:
-            hamming_matcher_unroll<T,1,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 2:
-            hamming_matcher_unroll<T,2,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 4:
-            hamming_matcher_unroll<T,4,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 8:
-            hamming_matcher_unroll<T,8,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 16:
-            hamming_matcher_unroll<T,16,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 32:
-            hamming_matcher_unroll<T,32,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        case 64:
-            hamming_matcher_unroll<T,64,false><<<blocks, threads, smem_sz>>>
-                                  (d_blk_idx, d_blk_dist, query, train, max_dist);
-            break;
-        default:
-            hamming_matcher<T,false><<<blocks, threads, smem_sz>>>
-                           (d_blk_idx, d_blk_dist, query, train, max_dist, feat_len);
-        }
-    }
-    POST_LAUNCH_CHECK();
-
-    threads = dim3(32, 8);
-    blocks = dim3(nquery, 1);
-
-    // Reduce all smallest Hamming distances from each block and store final
-    // best match
-    select_matches<<<blocks, threads>>>(idx, dist,
-                                        d_blk_idx, d_blk_dist,
-                                        nquery, nblk, max_dist);
-    POST_LAUNCH_CHECK();
-
-    memFree(d_blk_idx);
-    memFree(d_blk_dist);
-}
-
-} // namespace kernel
-
-} // namespace cuda
diff --git a/src/backend/cuda/kernel/harris.hpp b/src/backend/cuda/kernel/harris.hpp
new file mode 100644
index 0000000000..e956f02441
--- /dev/null
+++ b/src/backend/cuda/kernel/harris.hpp
@@ -0,0 +1,353 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <memory.hpp>
+#include <af/constants.h>
+
+#include "config.hpp"
+#include "convolve.hpp"
+#include "gradient.hpp"
+#include "range.hpp"
+#include "sort_by_key.hpp"
+
+#include <vector>
+
+namespace arrayfire {
+namespace cuda {
+
+namespace kernel {
+
+static const unsigned BLOCK_SIZE = 16;
+
+template<typename T>
+void gaussian1D(T* out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
+
+    T sum = (T)0;
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * af::Pi * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
+        out[i] = el;
+        sum += el;
+    }
+
+    for (int k = 0; k < dim; k++) out[k] /= sum;
+}
+
+// max_val()
+// Returns max of x and y
+inline __device__ int max_val(const int x, const int y) { return max(x, y); }
+inline __device__ unsigned max_val(const unsigned x, const unsigned y) {
+    return max(x, y);
+}
+inline __device__ float max_val(const float x, const float y) {
+    return fmax(x, y);
+}
+inline __device__ double max_val(const double x, const double y) {
+    return fmax(x, y);
+}
+
+template<typename T>
+__global__ void second_order_deriv(T* ixx_out, T* ixy_out, T* iyy_out,
+                                   const unsigned in_len, const T* ix_in,
+                                   const T* iy_in) {
+    const unsigned x = blockDim.x * blockIdx.x + threadIdx.x;
+
+    if (x < in_len) {
+        ixx_out[x] = ix_in[x] * ix_in[x];
+        ixy_out[x] = ix_in[x] * iy_in[x];
+        iyy_out[x] = iy_in[x] * iy_in[x];
+    }
+}
+
+template<typename T>
+__global__ void harris_responses(T* resp_out, const unsigned idim0,
+                                 const unsigned idim1, const T* ixx_in,
+                                 const T* ixy_in, const T* iyy_in,
+                                 const float k_thr, const unsigned border_len) {
+    const unsigned r = border_len;
+
+    const unsigned x = blockDim.x * blockIdx.x + threadIdx.x + r;
+    const unsigned y = blockDim.y * blockIdx.y + threadIdx.y + r;
+
+    if (x < idim1 - r && y < idim0 - r) {
+        const unsigned idx = x * idim0 + y;
+
+        // Calculates matrix trace and determinant
+        T tr  = ixx_in[idx] + iyy_in[idx];
+        T det = ixx_in[idx] * iyy_in[idx] - ixy_in[idx] * ixy_in[idx];
+
+        // Calculates local Harris response
+        resp_out[idx] = det - k_thr * (tr * tr);
+    }
+}
+
+template<typename T>
+__global__ void non_maximal(float* x_out, float* y_out, float* resp_out,
+                            unsigned* count, const unsigned idim0,
+                            const unsigned idim1, const T* resp_in,
+                            const float min_resp, const unsigned border_len,
+                            const unsigned max_corners) {
+    // Responses on the border don't have 8-neighbors to compare, discard them
+    const unsigned r = border_len + 1;
+
+    const unsigned x = blockDim.x * blockIdx.x + threadIdx.x + r;
+    const unsigned y = blockDim.y * blockIdx.y + threadIdx.y + r;
+
+    if (x < idim1 - r && y < idim0 - r) {
+        const T v = resp_in[x * idim0 + y];
+
+        // Find maximum neighborhood response
+        T max_v;
+        max_v = max_val(resp_in[(x - 1) * idim0 + y - 1],
+                        resp_in[x * idim0 + y - 1]);
+        max_v = max_val(max_v, resp_in[(x + 1) * idim0 + y - 1]);
+        max_v = max_val(max_v, resp_in[(x - 1) * idim0 + y]);
+        max_v = max_val(max_v, resp_in[(x + 1) * idim0 + y]);
+        max_v = max_val(max_v, resp_in[(x - 1) * idim0 + y + 1]);
+        max_v = max_val(max_v, resp_in[(x)*idim0 + y + 1]);
+        max_v = max_val(max_v, resp_in[(x + 1) * idim0 + y + 1]);
+
+        // Stores corner to {x,y,resp}_out if it's response is maximum compared
+        // to its 8-neighborhood and greater or equal minimum response
+        if (v > max_v && v >= (T)min_resp) {
+            unsigned idx = atomicAdd(count, 1u);
+            if (idx < max_corners) {
+                x_out[idx]    = (float)x;
+                y_out[idx]    = (float)y;
+                resp_out[idx] = (float)v;
+            }
+        }
+    }
+}
+
+__global__ void keep_corners(float* x_out, float* y_out, float* resp_out,
+                             const float* x_in, const float* y_in,
+                             const float* resp_in, const unsigned* resp_idx,
+                             const unsigned n_corners) {
+    const unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
+
+    // Keep only the first n_feat features
+    if (f < n_corners) {
+        x_out[f]    = x_in[(unsigned)resp_idx[f]];
+        y_out[f]    = y_in[(unsigned)resp_idx[f]];
+        resp_out[f] = resp_in[f];
+    }
+}
+
+int compare(const void* a, const void* b) { return *(float*)a > *(float*)b; }
+
+template<typename T, typename convAccT>
+void harris(unsigned* corners_out, float** x_out, float** y_out,
+            float** resp_out, CParam<T> in, const unsigned max_corners,
+            const float min_response, const float sigma,
+            const unsigned filter_len, const float k_thr) {
+    // Window filter
+    std::vector<convAccT> h_filter(filter_len);
+    // Decide between rectangular or circular filter
+    if (sigma < 0.5f) {
+        for (unsigned i = 0; i < filter_len; i++)
+            h_filter[i] = (T)1.f / (filter_len);
+    } else {
+        gaussian1D<convAccT>(h_filter.data(), (int)filter_len, sigma);
+    }
+
+    // Copy filter to device object
+    Param<convAccT> filter;
+    filter.dims[0]    = filter_len;
+    filter.strides[0] = 1;
+
+    for (int k = 1; k < 4; k++) {
+        filter.dims[k]    = 1;
+        filter.strides[k] = filter.dims[k - 1] * filter.strides[k - 1];
+    }
+
+    int filter_elem   = filter.strides[3] * filter.dims[3];
+    auto filter_alloc = memAlloc<convAccT>(filter_elem);
+    filter.ptr        = filter_alloc.get();
+    CUDA_CHECK(cudaMemcpyAsync(filter.ptr, h_filter.data(),
+                               filter_elem * sizeof(convAccT),
+                               cudaMemcpyHostToDevice, getActiveStream()));
+
+    const unsigned border_len = filter_len / 2 + 1;
+
+    Param<T> ix, iy;
+    for (dim_t i = 0; i < 4; i++) {
+        ix.dims[i] = iy.dims[i] = in.dims[i];
+        ix.strides[i] = iy.strides[i] = in.strides[i];
+    }
+    auto ix_alloc = memAlloc<T>(ix.dims[3] * ix.strides[3]);
+    auto iy_alloc = memAlloc<T>(iy.dims[3] * iy.strides[3]);
+    ix.ptr        = ix_alloc.get();
+    iy.ptr        = iy_alloc.get();
+
+    // Compute first-order derivatives as gradients
+    gradient<T>(iy, ix, in);
+
+    Param<T> ixx, ixy, iyy;
+    Param<T> ixx_tmp, ixy_tmp, iyy_tmp;
+    for (dim_t i = 0; i < 4; i++) {
+        ixx.dims[i] = ixy.dims[i] = iyy.dims[i] = in.dims[i];
+        ixx_tmp.dims[i] = ixy_tmp.dims[i] = iyy_tmp.dims[i] = in.dims[i];
+        ixx.strides[i] = ixy.strides[i] = iyy.strides[i] = in.strides[i];
+        ixx_tmp.strides[i] = ixy_tmp.strides[i] = iyy_tmp.strides[i] =
+            in.strides[i];
+    }
+    auto ixx_alloc = memAlloc<T>(ixx.dims[3] * ixx.strides[3]);
+    auto ixy_alloc = memAlloc<T>(ixy.dims[3] * ixy.strides[3]);
+    auto iyy_alloc = memAlloc<T>(iyy.dims[3] * iyy.strides[3]);
+    ixx.ptr        = ixx_alloc.get();
+    ixy.ptr        = ixy_alloc.get();
+    iyy.ptr        = iyy_alloc.get();
+
+    // Compute second-order derivatives
+    dim3 threads(THREADS_PER_BLOCK, 1);
+    dim3 blocks(divup(in.dims[3] * in.strides[3], threads.x), 1);
+    CUDA_LAUNCH((second_order_deriv<T>), blocks, threads, ixx.ptr, ixy.ptr,
+                iyy.ptr, in.dims[3] * in.strides[3], ix.ptr, iy.ptr);
+
+    auto ixx_tmp_alloc = memAlloc<T>(ixx_tmp.dims[3] * ixx_tmp.strides[3]);
+    auto ixy_tmp_alloc = memAlloc<T>(ixy_tmp.dims[3] * ixy_tmp.strides[3]);
+    auto iyy_tmp_alloc = memAlloc<T>(iyy_tmp.dims[3] * iyy_tmp.strides[3]);
+    ixx_tmp.ptr        = ixx_tmp_alloc.get();
+    ixy_tmp.ptr        = ixy_tmp_alloc.get();
+    iyy_tmp.ptr        = iyy_tmp_alloc.get();
+
+    // Convolve second-order derivatives with proper window filter
+    convolve2<T, convAccT>(ixx_tmp, CParam<T>(ixx), filter, 0, false);
+    convolve2<T, convAccT>(ixx, CParam<T>(ixx_tmp), filter, 1, false);
+    convolve2<T, convAccT>(ixy_tmp, CParam<T>(ixy), filter, 0, false);
+    convolve2<T, convAccT>(ixy, CParam<T>(ixy_tmp), filter, 1, false);
+    convolve2<T, convAccT>(iyy_tmp, CParam<T>(iyy), filter, 0, false);
+    convolve2<T, convAccT>(iyy, CParam<T>(iyy_tmp), filter, 1, false);
+
+    // Number of corners is not known a priori, limit maximum number of corners
+    // according to image dimensions
+    unsigned corner_lim = in.dims[3] * in.strides[3] * 0.2f;
+
+    auto d_corners_found = memAlloc<unsigned>(1);
+    CUDA_CHECK(cudaMemsetAsync(d_corners_found.get(), 0, sizeof(unsigned),
+                               getActiveStream()));
+
+    auto d_x_corners    = memAlloc<float>(corner_lim);
+    auto d_y_corners    = memAlloc<float>(corner_lim);
+    auto d_resp_corners = memAlloc<float>(corner_lim);
+
+    auto d_responses = memAlloc<T>(in.dims[3] * in.strides[3]);
+
+    // Calculate Harris responses for all pixels
+    threads = dim3(BLOCK_SIZE, BLOCK_SIZE);
+    blocks  = dim3(divup(in.dims[1] - border_len * 2, threads.x),
+                   divup(in.dims[0] - border_len * 2, threads.y));
+    CUDA_LAUNCH((harris_responses<T>), blocks, threads, d_responses.get(),
+                in.dims[0], in.dims[1], ixx.ptr, ixy.ptr, iyy.ptr, k_thr,
+                border_len);
+
+    const float min_r = (max_corners > 0) ? 0.f : min_response;
+
+    // Perform non-maximal suppression
+    CUDA_LAUNCH((non_maximal<T>), blocks, threads, d_x_corners.get(),
+                d_y_corners.get(), d_resp_corners.get(), d_corners_found.get(),
+                in.dims[0], in.dims[1], d_responses.get(), min_r, border_len,
+                corner_lim);
+
+    unsigned corners_found = 0;
+    CUDA_CHECK(cudaMemcpyAsync(&corners_found, d_corners_found.get(),
+                               sizeof(unsigned), cudaMemcpyDeviceToHost,
+                               getActiveStream()));
+    CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+
+    *corners_out =
+        min(corners_found, (max_corners > 0) ? max_corners : corner_lim);
+
+    if (*corners_out == 0) return;
+
+    if (max_corners > 0 && corners_found > *corners_out) {
+        Param<float> harris_responses;
+        Param<unsigned> harris_idx;
+
+        harris_responses.dims[0] = harris_idx.dims[0] = corners_found;
+        harris_responses.strides[0] = harris_idx.strides[0] = 1;
+
+        for (int k = 1; k < 4; k++) {
+            harris_responses.dims[k] = 1;
+            harris_responses.strides[k] =
+                harris_responses.dims[k - 1] * harris_responses.strides[k - 1];
+            harris_idx.dims[k] = 1;
+            harris_idx.strides[k] =
+                harris_idx.dims[k - 1] * harris_idx.strides[k - 1];
+        }
+
+        int sort_elem = harris_responses.strides[3] * harris_responses.dims[3];
+        harris_responses.ptr = d_resp_corners.get();
+        // Create indices using range
+        auto harris_idx_alloc = memAlloc<unsigned>(sort_elem);
+        harris_idx.ptr        = harris_idx_alloc.get();
+        kernel::range<uint>(harris_idx, 0);
+
+        // Sort Harris responses
+        sort0ByKey<float, uint>(harris_responses, harris_idx, false);
+
+        auto x_out_alloc    = memAlloc<float>(*corners_out);
+        auto y_out_alloc    = memAlloc<float>(*corners_out);
+        auto resp_out_alloc = memAlloc<float>(*corners_out);
+        *x_out              = x_out_alloc.get();
+        *y_out              = y_out_alloc.get();
+        *resp_out           = resp_out_alloc.get();
+
+        // Keep only the first corners_to_keep corners with higher Harris
+        // responses
+        threads = dim3(THREADS_PER_BLOCK, 1);
+        blocks  = dim3(divup(*corners_out, threads.x), 1);
+        CUDA_LAUNCH(keep_corners, blocks, threads, *x_out, *y_out, *resp_out,
+                    d_x_corners.get(), d_y_corners.get(), harris_responses.ptr,
+                    harris_idx.ptr, *corners_out);
+
+        x_out_alloc.release();
+        y_out_alloc.release();
+        resp_out_alloc.release();
+    } else if (max_corners == 0 && corners_found < corner_lim) {
+        auto x_out_alloc    = memAlloc<float>(*corners_out);
+        auto y_out_alloc    = memAlloc<float>(*corners_out);
+        auto resp_out_alloc = memAlloc<float>(*corners_out);
+        *x_out              = x_out_alloc.get();
+        *y_out              = y_out_alloc.get();
+        *resp_out           = resp_out_alloc.get();
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            *x_out, d_x_corners.get(), *corners_out * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *y_out, d_y_corners.get(), *corners_out * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *resp_out, d_resp_corners.get(), *corners_out * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+
+        x_out_alloc.release();
+        y_out_alloc.release();
+        resp_out_alloc.release();
+    } else {
+        *x_out    = d_x_corners.release();
+        *y_out    = d_y_corners.release();
+        *resp_out = d_resp_corners.release();
+    }
+    filter_alloc.release();
+}
+
+}  // namespace kernel
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/histogram.cuh b/src/backend/cuda/kernel/histogram.cuh
new file mode 100644
index 0000000000..258dc6ff3c
--- /dev/null
+++ b/src/backend/cuda/kernel/histogram.cuh
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <shared.hpp>
+#include <types.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool isLinear>
+__global__ void histogram(Param<uint> out, CParam<T> in, int len, int nbins,
+                          float minval, float maxval, int nBBS) {
+    SharedMemory<uint> shared;
+    uint *shrdMem = shared.getPointer();
+
+    // offset input and output to account for batch ops
+    unsigned b2 = blockIdx.x / nBBS;
+    const data_t<T> *iptr =
+        in.ptr + b2 * in.strides[2] + blockIdx.y * in.strides[3];
+    uint *optr = out.ptr + b2 * out.strides[2] + blockIdx.y * out.strides[3];
+
+    int start = (blockIdx.x - b2 * nBBS) * THRD_LOAD * blockDim.x + threadIdx.x;
+    int end   = min((start + THRD_LOAD * blockDim.x), len);
+    float step = (maxval - minval) / (float)nbins;
+    compute_t<T> minvalT(minval);
+
+    // If nbins > max shared memory allocated, then just use atomicAdd on global
+    // memory
+    bool use_global = nbins > MAX_BINS;
+
+    // Skip initializing shared memory
+    if (!use_global) {
+        for (int i = threadIdx.x; i < nbins; i += blockDim.x) shrdMem[i] = 0;
+        __syncthreads();
+    }
+
+    for (int row = start; row < end; row += blockDim.x) {
+        int idx =
+            isLinear
+                ? row
+                : ((row % in.dims[0]) + (row / in.dims[0]) * in.strides[1]);
+        int bin =
+            (int)(static_cast<float>(compute_t<T>(iptr[idx]) - minvalT) / step);
+        bin = (bin < 0) ? 0 : bin;
+        bin = (bin >= nbins) ? (nbins - 1) : bin;
+
+        if (use_global) {
+            atomicAdd((optr + bin), 1);
+        } else {
+            atomicAdd((shrdMem + bin), 1);
+        }
+    }
+
+    // No need to write to global if use_global is true
+    if (!use_global) {
+        __syncthreads();
+        for (int i = threadIdx.x; i < nbins; i += blockDim.x) {
+            atomicAdd((optr + i), shrdMem[i]);
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/histogram.hpp b/src/backend/cuda/kernel/histogram.hpp
index 6926fd0825..ddc0d7fae0 100644
--- a/src/backend/cuda/kernel/histogram.hpp
+++ b/src/backend/cuda/kernel/histogram.hpp
@@ -7,92 +7,43 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include "shared.hpp"
+#include <nvrtc_kernel_headers/histogram_cuh.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-namespace kernel
-{
+constexpr int MAX_BINS  = 4000;
+constexpr int THREADS_X = 256;
+constexpr int THRD_LOAD = 16;
 
-static const unsigned MAX_BINS  = 4000;
-static const int THREADS_X =  256;
-static const int THRD_LOAD =   16;
+template<typename T>
+void histogram(Param<uint> out, CParam<T> in, int nbins, float minval,
+               float maxval, bool isLinear) {
+    auto histogram = common::getKernel(
+        "arrayfire::cuda::histogram", {{histogram_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(isLinear)),
+        {{DefineValue(MAX_BINS), DefineValue(THRD_LOAD)}});
 
-__forceinline__ __device__ int minimum(int a, int b)
-{
-  return (a < b ? a : b);
-}
-
-template<typename inType, typename outType>
-static __global__
-void histogramKernel(Param<outType> out, CParam<inType> in,
-                     const cfloat *d_minmax, int len,
-                     int nbins, int nBBS)
-{
-    SharedMemory<outType> shared;
-    outType * shrdMem = shared.getPointer();
-
-    // offset input and output to account for batch ops
-    unsigned b2 = blockIdx.x / nBBS;
-    const inType *iptr  =  in.ptr + b2 *  in.strides[2] + blockIdx.y *  in.strides[3];
-    outType      *optr  = out.ptr + b2 * out.strides[2] + blockIdx.y * out.strides[3];
-
-    int start = (blockIdx.x-b2*nBBS) * THRD_LOAD * blockDim.x + threadIdx.x;
-    int end   = minimum((start + THRD_LOAD * blockDim.x), len);
-
-    __shared__ float min;
-    __shared__ float step;
-
-    // offset minmax array to account for batch ops
-    d_minmax += (b2 * blockIdx.x + blockIdx.y);
-
-    if (threadIdx.x == 0) {
-        float2 minmax = *d_minmax;
-        min  = minmax.x;
-        step = (minmax.y-minmax.x) / (float)nbins;
-    }
-
-    for (int i = threadIdx.x; i < nbins; i += blockDim.x)
-        shrdMem[i] = 0;
-    __syncthreads();
-
-    for (int row = start; row < end; row += blockDim.x) {
-        int bin = (int)((iptr[row] - min) / step);
-        bin     = (bin < 0)      ? 0         : bin;
-        bin     = (bin >= nbins) ? (nbins-1) : bin;
-        atomicAdd((shrdMem + bin), 1);
-    }
-    __syncthreads();
-
-    for (int i = threadIdx.x; i < nbins; i += blockDim.x) {
-        atomicAdd((optr + i), shrdMem[i]);
-    }
-}
-
-template<typename inType, typename outType>
-void histogram(Param<outType> out, CParam<inType> in, cfloat *d_minmax, int nbins)
-{
     dim3 threads(kernel::THREADS_X, 1);
 
     int nElems = in.dims[0] * in.dims[1];
-    int blk_x  = divup(nElems, THRD_LOAD*THREADS_X);
+    int blk_x  = divup(nElems, THRD_LOAD * THREADS_X);
 
     dim3 blocks(blk_x * in.dims[2], in.dims[3]);
 
-    int smem_size = nbins * sizeof(outType);
-
-    histogramKernel<inType, outType>
-        <<<blocks, threads, smem_size>>>
-        (out, in, d_minmax, nElems, nbins, blk_x);
+    // If nbins > MAX_BINS, we are using global memory so smem_size can be 0;
+    int smem_size = nbins <= MAX_BINS ? (nbins * sizeof(uint)) : 0;
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(), smem_size);
+    histogram(qArgs, out, in, nElems, nbins, minval, maxval, blk_x);
     POST_LAUNCH_CHECK();
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/homography.hpp b/src/backend/cuda/kernel/homography.hpp
new file mode 100644
index 0000000000..72627f84a8
--- /dev/null
+++ b/src/backend/cuda/kernel/homography.hpp
@@ -0,0 +1,618 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <memory.hpp>
+#include "ireduce.hpp"
+#include "reduce.hpp"
+#include "sort.hpp"
+
+#include <cfloat>
+
+namespace arrayfire {
+namespace cuda {
+
+namespace kernel {
+
+template<typename T>
+__device__ T sq(T a) {
+    return a * a;
+}
+
+template<typename T>
+struct EPS {
+    __device__ T eps() { return FLT_EPSILON; }
+};
+
+template<>
+struct EPS<float> {
+    __device__ static float eps() { return FLT_EPSILON; }
+};
+
+template<>
+struct EPS<double> {
+    __device__ static double eps() { return DBL_EPSILON; }
+};
+
+#define RANSACConfidence 0.99f
+#define LMEDSConfidence 0.99f
+#define LMEDSOutlierRatio 0.4f
+
+extern __shared__ char sh[];
+
+template<typename T>
+__device__ void JacobiSVD(int m, int n) {
+    const int iterations = 30;
+
+    int tid_x = threadIdx.x;
+    int bsz_x = blockDim.x;
+    int tid_y = threadIdx.y;
+    // int gid_y = blockIdx.y * blockDim.y + tid_y;
+
+    __shared__ T s_acc1[256];
+    __shared__ T s_acc2[256];
+
+    __shared__ T s_d[16 * 9];
+
+    T* s_V = (T*)sh;
+    T* s_S = (T*)sh + 16 * 81;
+
+    int doff = tid_y * n;
+    int soff = tid_y * 81;
+
+    if (tid_x < n) {
+        T acc1 = 0;
+        for (int i = 0; i < m; i++) {
+            int stid = soff + tid_x * m + i;
+            T t      = s_S[stid];
+            acc1 += t * t;
+            s_V[stid] = (tid_x == i) ? 1 : 0;
+        }
+        s_d[doff + tid_x] = acc1;
+    }
+    __syncthreads();
+
+    for (int it = 0; it < iterations; it++) {
+        for (int i = 0; i < n - 1; i++) {
+            for (int j = i + 1; j < n; j++) {
+                T* Si = s_S + soff + i * m;
+                T* Sj = s_S + soff + j * m;
+
+                T* Vi = s_V + soff + i * n;
+                T* Vj = s_V + soff + j * n;
+
+                T p = (T)0;
+                for (int k = 0; k < m; k++) p += Si[k] * Sj[k];
+
+                T di = s_d[doff + i];
+                T dj = s_d[doff + j];
+                __syncthreads();
+
+                T c = 0, s = 0;
+                T t0 = 0, t1 = 0;
+                int cond = (fabs(p) > m * EPS<T>::eps() * sqrt(di * dj));
+                T a = 0, b = 0;
+
+                if (cond) {
+                    T y  = di - dj;
+                    T r  = hypot(p * 2, y);
+                    T r2 = r * 2;
+                    if (y >= 0) {
+                        c = sqrt((r + y) / r2);
+                        s = p / (r2 * c);
+                    } else {
+                        s = sqrt((r - y) / r2);
+                        c = p / (r2 * s);
+                    }
+
+                    for (int k = tid_x; k < m; k += bsz_x) {
+                        t0                        = c * Si[k] + s * Sj[k];
+                        t1                        = c * Sj[k] - s * Si[k];
+                        Si[k]                     = t0;
+                        Sj[k]                     = t1;
+                        s_acc1[tid_y * bsz_x + k] = t0 * t0;
+                        s_acc2[tid_y * bsz_x + k] = t1 * t1;
+                    }
+                }
+                __syncthreads();
+
+                if (cond) {
+                    a = 0;
+                    b = 0;
+                    for (int k = 0; k < m; k++) {
+                        a += s_acc1[tid_y * bsz_x + k];
+                        b += s_acc2[tid_y * bsz_x + k];
+                    }
+                    s_d[doff + i] = a;
+                    s_d[doff + j] = b;
+                }
+                __syncthreads();
+
+                if (cond) {
+                    for (int l = tid_x; l < n; l += bsz_x) {
+                        T t0 = Vi[l] * c + Vj[l] * s;
+                        T t1 = Vj[l] * c - Vi[l] * s;
+
+                        Vi[l] = t0;
+                        Vj[l] = t1;
+                    }
+                }
+                __syncthreads();
+            }
+        }
+    }
+}
+
+__device__ bool computeMeanScale(
+    float* x_src_mean, float* y_src_mean, float* x_dst_mean, float* y_dst_mean,
+    float* src_scale, float* dst_scale, float* src_pt_x, float* src_pt_y,
+    float* dst_pt_x, float* dst_pt_y, CParam<float> x_src, CParam<float> y_src,
+    CParam<float> x_dst, CParam<float> y_dst, CParam<float> rnd, int i) {
+    const unsigned ridx = rnd.dims[0] * i;
+    unsigned r[4]       = {(unsigned)rnd.ptr[ridx], (unsigned)rnd.ptr[ridx + 1],
+                           (unsigned)rnd.ptr[ridx + 2], (unsigned)rnd.ptr[ridx + 3]};
+
+    // If one of the points is repeated, it's a bad samples, will still
+    // compute homography to ensure all threads pass __syncthreads()
+    bool bad = (r[0] == r[1] || r[0] == r[2] || r[0] == r[3] || r[1] == r[2] ||
+                r[1] == r[3] || r[2] == r[3]);
+
+    for (unsigned j = 0; j < 4; j++) {
+        src_pt_x[j] = x_src.ptr[r[j]];
+        src_pt_y[j] = y_src.ptr[r[j]];
+        dst_pt_x[j] = x_dst.ptr[r[j]];
+        dst_pt_y[j] = y_dst.ptr[r[j]];
+    }
+
+    *x_src_mean = (src_pt_x[0] + src_pt_x[1] + src_pt_x[2] + src_pt_x[3]) / 4.f;
+    *y_src_mean = (src_pt_y[0] + src_pt_y[1] + src_pt_y[2] + src_pt_y[3]) / 4.f;
+    *x_dst_mean = (dst_pt_x[0] + dst_pt_x[1] + dst_pt_x[2] + dst_pt_x[3]) / 4.f;
+    *y_dst_mean = (dst_pt_y[0] + dst_pt_y[1] + dst_pt_y[2] + dst_pt_y[3]) / 4.f;
+
+    float src_var = 0.0f, dst_var = 0.0f;
+    for (unsigned j = 0; j < 4; j++) {
+        src_var +=
+            sq(src_pt_x[j] - *x_src_mean) + sq(src_pt_y[j] - *y_src_mean);
+        dst_var +=
+            sq(dst_pt_x[j] - *x_dst_mean) + sq(dst_pt_y[j] - *y_dst_mean);
+    }
+
+    src_var /= 4.f;
+    dst_var /= 4.f;
+
+    *src_scale = sqrt(2.0f) / sqrt(src_var);
+    *dst_scale = sqrt(2.0f) / sqrt(dst_var);
+
+    return !bad;
+}
+
+#define SSPTR(Z, Y, X) (s_S[(Z)*81 + (Y)*9 + (X)])
+
+template<typename T>
+__global__ void buildLinearSystem(Param<T> H, CParam<float> x_src,
+                                  CParam<float> y_src, CParam<float> x_dst,
+                                  CParam<float> y_dst, CParam<float> rnd,
+                                  const unsigned iterations) {
+    unsigned tid_y = threadIdx.y;
+    unsigned i     = blockIdx.y * blockDim.y + tid_y;
+
+    if (i < iterations) {
+        float x_src_mean, y_src_mean;
+        float x_dst_mean, y_dst_mean;
+        float src_scale, dst_scale;
+        float src_pt_x[4], src_pt_y[4], dst_pt_x[4], dst_pt_y[4];
+
+        computeMeanScale(&x_src_mean, &y_src_mean, &x_dst_mean, &y_dst_mean,
+                         &src_scale, &dst_scale, src_pt_x, src_pt_y, dst_pt_x,
+                         dst_pt_y, x_src, y_src, x_dst, y_dst, rnd, i);
+
+        T* s_V = (T*)sh;
+        T* s_S = (T*)sh + 16 * 81;
+
+        // Compute input matrix
+        for (unsigned j = threadIdx.x; j < 4; j += blockDim.x) {
+            float srcx = (src_pt_x[j] - x_src_mean) * src_scale;
+            float srcy = (src_pt_y[j] - y_src_mean) * src_scale;
+            float dstx = (dst_pt_x[j] - x_dst_mean) * dst_scale;
+            float dsty = (dst_pt_y[j] - y_dst_mean) * dst_scale;
+
+            SSPTR(tid_y, 0, j * 2) = 0.0f;
+            SSPTR(tid_y, 1, j * 2) = 0.0f;
+            SSPTR(tid_y, 2, j * 2) = 0.0f;
+            SSPTR(tid_y, 3, j * 2) = -srcx;
+            SSPTR(tid_y, 4, j * 2) = -srcy;
+            SSPTR(tid_y, 5, j * 2) = -1.0f;
+            SSPTR(tid_y, 6, j * 2) = dsty * srcx;
+            SSPTR(tid_y, 7, j * 2) = dsty * srcy;
+            SSPTR(tid_y, 8, j * 2) = dsty;
+
+            SSPTR(tid_y, 0, j * 2 + 1) = srcx;
+            SSPTR(tid_y, 1, j * 2 + 1) = srcy;
+            SSPTR(tid_y, 2, j * 2 + 1) = 1.0f;
+            SSPTR(tid_y, 3, j * 2 + 1) = 0.0f;
+            SSPTR(tid_y, 4, j * 2 + 1) = 0.0f;
+            SSPTR(tid_y, 5, j * 2 + 1) = 0.0f;
+            SSPTR(tid_y, 6, j * 2 + 1) = -dstx * srcx;
+            SSPTR(tid_y, 7, j * 2 + 1) = -dstx * srcy;
+            SSPTR(tid_y, 8, j * 2 + 1) = -dstx;
+
+            if (j == 4) {
+                SSPTR(tid_y, 0, 8) = 0.0f;
+                SSPTR(tid_y, 1, 8) = 0.0f;
+                SSPTR(tid_y, 2, 8) = 0.0f;
+                SSPTR(tid_y, 3, 8) = 0.0f;
+                SSPTR(tid_y, 4, 8) = 0.0f;
+                SSPTR(tid_y, 5, 8) = 0.0f;
+                SSPTR(tid_y, 6, 8) = 0.0f;
+                SSPTR(tid_y, 7, 8) = 0.0f;
+                SSPTR(tid_y, 8, 8) = 0.0f;
+            }
+        }
+        __syncthreads();
+
+        JacobiSVD<T>(9, 9);
+
+        T vH[9], H_tmp[9];
+        for (unsigned j = 0; j < 9; j++) vH[j] = s_V[tid_y * 81 + 8 * 9 + j];
+
+        H_tmp[0] =
+            src_scale * x_dst_mean * vH[6] + src_scale * vH[0] / dst_scale;
+        H_tmp[1] =
+            src_scale * x_dst_mean * vH[7] + src_scale * vH[1] / dst_scale;
+        H_tmp[2] = x_dst_mean * (vH[8] - src_scale * y_src_mean * vH[7] -
+                                 src_scale * x_src_mean * vH[6]) +
+                   (vH[2] - src_scale * y_src_mean * vH[1] -
+                    src_scale * x_src_mean * vH[0]) /
+                       dst_scale;
+
+        H_tmp[3] =
+            src_scale * y_dst_mean * vH[6] + src_scale * vH[3] / dst_scale;
+        H_tmp[4] =
+            src_scale * y_dst_mean * vH[7] + src_scale * vH[4] / dst_scale;
+        H_tmp[5] = y_dst_mean * (vH[8] - src_scale * y_src_mean * vH[7] -
+                                 src_scale * x_src_mean * vH[6]) +
+                   (vH[5] - src_scale * y_src_mean * vH[4] -
+                    src_scale * x_src_mean * vH[3]) /
+                       dst_scale;
+
+        H_tmp[6] = src_scale * vH[6];
+        H_tmp[7] = src_scale * vH[7];
+        H_tmp[8] = vH[8] - src_scale * y_src_mean * vH[7] -
+                   src_scale * x_src_mean * vH[6];
+
+        const unsigned Hidx = H.dims[0] * i;
+        T* H_ptr            = H.ptr + Hidx;
+        for (int h = 0; h < 9; h++) H_ptr[h] = H_tmp[h];
+    }
+}
+
+#undef SSPTR
+
+// LMedS:
+// http://research.microsoft.com/en-us/um/people/zhang/INRIA/Publis/Tutorial-Estim/node25.html
+template<typename T>
+__global__ void computeEvalHomography(
+    Param<unsigned> inliers, Param<unsigned> idx, Param<T> H, Param<float> err,
+    CParam<float> x_src, CParam<float> y_src, CParam<float> x_dst,
+    CParam<float> y_dst, CParam<float> rnd, const unsigned iterations,
+    const unsigned nsamples, const float inlier_thr,
+    const af_homography_type htype) {
+    unsigned bid_x = blockIdx.x;
+    unsigned tid_x = threadIdx.x;
+    unsigned i     = bid_x * blockDim.x + tid_x;
+
+    __shared__ unsigned s_inliers[256];
+    __shared__ unsigned s_idx[256];
+
+    s_inliers[tid_x] = 0;
+    s_idx[tid_x]     = 0;
+    __syncthreads();
+
+    if (i < iterations) {
+        const unsigned Hidx = H.dims[0] * i;
+        T* H_ptr            = H.ptr + Hidx;
+        T H_tmp[9];
+        for (int h = 0; h < 9; h++) H_tmp[h] = H_ptr[h];
+
+        if (htype == AF_HOMOGRAPHY_RANSAC) {
+            // Compute inliers
+            unsigned inliers_count = 0;
+            for (unsigned j = 0; j < nsamples; j++) {
+                float z = H_tmp[6] * x_src.ptr[j] + H_tmp[7] * y_src.ptr[j] +
+                          H_tmp[8];
+                float x = (H_tmp[0] * x_src.ptr[j] + H_tmp[1] * y_src.ptr[j] +
+                           H_tmp[2]) /
+                          z;
+                float y = (H_tmp[3] * x_src.ptr[j] + H_tmp[4] * y_src.ptr[j] +
+                           H_tmp[5]) /
+                          z;
+
+                float dist = sq(x_dst.ptr[j] - x) + sq(y_dst.ptr[j] - y);
+                if (dist < inlier_thr * inlier_thr) inliers_count++;
+            }
+
+            s_inliers[tid_x] = inliers_count;
+            s_idx[tid_x]     = i;
+        } else if (htype == AF_HOMOGRAPHY_LMEDS) {
+            // Compute error
+            for (unsigned j = 0; j < nsamples; j++) {
+                float z = H_tmp[6] * x_src.ptr[j] + H_tmp[7] * y_src.ptr[j] +
+                          H_tmp[8];
+                float x = (H_tmp[0] * x_src.ptr[j] + H_tmp[1] * y_src.ptr[j] +
+                           H_tmp[2]) /
+                          z;
+                float y = (H_tmp[3] * x_src.ptr[j] + H_tmp[4] * y_src.ptr[j] +
+                           H_tmp[5]) /
+                          z;
+
+                float dist = sq(x_dst.ptr[j] - x) + sq(y_dst.ptr[j] - y);
+                err.ptr[i * err.dims[0] + j] = sqrt(dist);
+            }
+        }
+    }
+    __syncthreads();
+
+    if (htype == AF_HOMOGRAPHY_RANSAC) {
+        // Find sample with most inliers
+        for (unsigned tx = 128; tx > 0; tx >>= 1) {
+            if (tid_x < tx) {
+                if (s_inliers[tid_x + tx] > s_inliers[tid_x]) {
+                    s_inliers[tid_x] = s_inliers[tid_x + tx];
+                    s_idx[tid_x]     = s_idx[tid_x + tx];
+                }
+            }
+            __syncthreads();
+        }
+
+        inliers.ptr[bid_x] = s_inliers[0];
+        idx.ptr[bid_x]     = s_idx[0];
+    }
+}
+
+__global__ void computeMedian(Param<float> median, Param<unsigned> idx,
+                              CParam<float> err, const unsigned iterations) {
+    const unsigned tid = threadIdx.x;
+    const unsigned bid = blockIdx.x;
+    const unsigned i   = bid * blockDim.x + threadIdx.x;
+
+    __shared__ float s_median[256];
+    __shared__ unsigned s_idx[256];
+
+    s_median[tid] = FLT_MAX;
+    s_idx[tid]    = 0;
+
+    if (i < iterations) {
+        const int nsamples = err.dims[0];
+        float m            = err.ptr[i * nsamples + nsamples / 2];
+        if (nsamples % 2 == 0)
+            m = (m + err.ptr[i * nsamples + nsamples / 2 - 1]) * 0.5f;
+
+        s_idx[tid]    = i;
+        s_median[tid] = m;
+    }
+    __syncthreads();
+
+    for (unsigned t = 128; t > 0; t >>= 1) {
+        if (tid < t) {
+            if (s_median[tid + t] < s_median[tid]) {
+                s_median[tid] = s_median[tid + t];
+                s_idx[tid]    = s_idx[tid + t];
+            }
+        }
+        __syncthreads();
+    }
+
+    median.ptr[bid] = s_median[0];
+    idx.ptr[bid]    = s_idx[0];
+}
+
+#define DIVUP(A, B) (((A) + (B)-1) / (B))
+
+__global__ void findMinMedian(float* minMedian, unsigned* minIdx,
+                              CParam<float> median, CParam<unsigned> idx) {
+    const int tid = threadIdx.x;
+
+    __shared__ float s_minMedian[256];
+    __shared__ unsigned s_minIdx[256];
+
+    s_minMedian[tid] = FLT_MAX;
+    s_minIdx[tid]    = 0;
+    __syncthreads();
+
+    const int loop = DIVUP(median.dims[0], blockDim.x);
+
+    for (int i = 0; i < loop; i++) {
+        int j = i * blockDim.x + tid;
+        if (j < median.dims[0] && median.ptr[j] < s_minMedian[tid]) {
+            s_minMedian[tid] = median.ptr[j];
+            s_minIdx[tid]    = idx.ptr[j];
+        }
+        __syncthreads();
+    }
+
+    for (unsigned t = 128; t > 0; t >>= 1) {
+        if (tid < t) {
+            if (s_minMedian[tid + t] < s_minMedian[tid]) {
+                s_minMedian[tid] = s_minMedian[tid + t];
+                s_minIdx[tid]    = s_minIdx[tid + t];
+            }
+        }
+        __syncthreads();
+    }
+
+    *minMedian = s_minMedian[0];
+    *minIdx    = s_minIdx[0];
+}
+
+#undef DIVUP
+
+template<typename T>
+__global__ void computeLMedSInliers(Param<unsigned> inliers, CParam<T> H,
+                                    CParam<float> x_src, CParam<float> y_src,
+                                    CParam<float> x_dst, CParam<float> y_dst,
+                                    const float minMedian,
+                                    const unsigned nsamples) {
+    unsigned tid = threadIdx.x;
+    unsigned bid = blockIdx.x;
+    unsigned i   = bid * blockDim.x + tid;
+
+    __shared__ T s_H[9];
+    __shared__ unsigned s_inliers[256];
+
+    s_inliers[tid] = 0;
+    __syncthreads();
+
+    if (tid < 9) s_H[tid] = H.ptr[tid];
+    __syncthreads();
+
+    float sigma = max(
+        1.4826f * (1 + 5.f / (nsamples - 4)) * (float)sqrt(minMedian), 1e-6f);
+    float dist_thr = sq(2.5f * sigma);
+
+    if (i < nsamples) {
+        float z = s_H[6] * x_src.ptr[i] + s_H[7] * y_src.ptr[i] + s_H[8];
+        float x = (s_H[0] * x_src.ptr[i] + s_H[1] * y_src.ptr[i] + s_H[2]) / z;
+        float y = (s_H[3] * x_src.ptr[i] + s_H[4] * y_src.ptr[i] + s_H[5]) / z;
+
+        float dist = sq(x_dst.ptr[i] - x) + sq(y_dst.ptr[i] - y);
+        if (dist <= dist_thr) s_inliers[tid] = 1;
+    }
+    __syncthreads();
+
+    for (unsigned t = 128; t > 0; t >>= 1) {
+        if (tid < t) s_inliers[tid] += s_inliers[tid + t];
+        __syncthreads();
+    }
+
+    inliers.ptr[bid] = s_inliers[0];
+}
+
+template<typename T>
+int computeH(Param<T> bestH, Param<T> H, Param<float> err, CParam<float> x_src,
+             CParam<float> y_src, CParam<float> x_dst, CParam<float> y_dst,
+             CParam<float> rnd, const unsigned iterations,
+             const unsigned nsamples, const float inlier_thr,
+             const af_homography_type htype) {
+    dim3 threads(16, 16);
+    dim3 blocks(1, divup(iterations, threads.y));
+
+    // Build linear system and solve SVD
+    size_t ls_shared_sz = threads.x * 81 * 2 * sizeof(T);
+    CUDA_LAUNCH_SMEM((buildLinearSystem<T>), blocks, threads, ls_shared_sz, H,
+                     x_src, y_src, x_dst, y_dst, rnd, iterations);
+    POST_LAUNCH_CHECK();
+
+    threads = dim3(256);
+    blocks  = dim3(divup(iterations, threads.x));
+
+    // Allocate some temporary buffers
+    dim4 idx_dims(blocks.x);
+    Array<unsigned> idx     = createEmptyArray<unsigned>(idx_dims);
+    Array<unsigned> inliers = createEmptyArray<unsigned>(
+        (htype == AF_HOMOGRAPHY_RANSAC) ? blocks.x
+                                        : divup(nsamples, threads.x));
+
+    // Compute (and for RANSAC, evaluate) homographies
+    CUDA_LAUNCH((computeEvalHomography<T>), blocks, threads, inliers, idx, H,
+                err, x_src, y_src, x_dst, y_dst, rnd, iterations, nsamples,
+                inlier_thr, htype);
+    POST_LAUNCH_CHECK();
+
+    unsigned inliersH, idxH;
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        Array<float> median = createEmptyArray<float>(idx_dims);
+        // TODO: Improve this sorting, if the number of iterations is
+        // sufficiently large, this can be *very* slow
+        kernel::sort0<float>(err, true);
+
+        unsigned minIdx;
+        float minMedian;
+
+        // Compute median of every iteration
+        CUDA_LAUNCH((computeMedian), blocks, threads, median, idx, err,
+                    iterations);
+        POST_LAUNCH_CHECK();
+
+        // Reduce medians, only in case iterations > 256
+        if (blocks.x > 1) {
+            blocks = dim3(1);
+
+            auto finalMedian = memAlloc<float>(1);
+            auto finalIdx    = memAlloc<unsigned>(1);
+
+            CUDA_LAUNCH((findMinMedian), blocks, threads, finalMedian.get(),
+                        finalIdx.get(), median, idx);
+            POST_LAUNCH_CHECK();
+
+            CUDA_CHECK(cudaMemcpyAsync(&minMedian, finalMedian.get(),
+                                       sizeof(float), cudaMemcpyDeviceToHost,
+                                       getActiveStream()));
+            CUDA_CHECK(cudaMemcpyAsync(&minIdx, finalIdx.get(),
+                                       sizeof(unsigned), cudaMemcpyDeviceToHost,
+                                       getActiveStream()));
+            CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        } else {
+            CUDA_CHECK(cudaMemcpyAsync(&minMedian, median.get(), sizeof(float),
+                                       cudaMemcpyDeviceToHost,
+                                       getActiveStream()));
+            CUDA_CHECK(cudaMemcpyAsync(&minIdx, idx.get(), sizeof(unsigned),
+                                       cudaMemcpyDeviceToHost,
+                                       getActiveStream()));
+            CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        }
+
+        // Copy best homography to output
+        CUDA_CHECK(cudaMemcpyAsync(bestH.ptr, H.ptr + minIdx * 9, 9 * sizeof(T),
+                                   cudaMemcpyDeviceToDevice,
+                                   getActiveStream()));
+
+        blocks = dim3(divup(nsamples, threads.x));
+        // sync stream for the device to host copies to be visible for
+        // the subsequent kernel launch
+
+        CUDA_LAUNCH((computeLMedSInliers<T>), blocks, threads, inliers, bestH,
+                    x_src, y_src, x_dst, y_dst, minMedian, nsamples);
+        POST_LAUNCH_CHECK();
+
+        // Adds up the total number of inliers
+        Array<unsigned> totalInliers = createEmptyArray<unsigned>(1);
+        kernel::reduce<unsigned, unsigned, af_add_t>(totalInliers, inliers, 0,
+                                                     false, 0.0);
+
+        CUDA_CHECK(cudaMemcpyAsync(&inliersH, totalInliers.get(),
+                                   sizeof(unsigned), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+
+    } else if (htype == AF_HOMOGRAPHY_RANSAC) {
+        unsigned blockIdx;
+        inliersH = kernel::ireduce_all<unsigned, af_max_t>(&blockIdx, inliers);
+        // Copies back index and number of inliers of best homography estimation
+        CUDA_CHECK(cudaMemcpyAsync(&idxH, idx.get() + blockIdx,
+                                   sizeof(unsigned), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(bestH.ptr, H.ptr + idxH * 9, 9 * sizeof(T),
+                                   cudaMemcpyDeviceToDevice,
+                                   getActiveStream()));
+    }
+
+    // sync stream for the device to host copies to be visible for
+    // the subsequent kernel launch
+    CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+
+    return (int)inliersH;
+}
+
+}  // namespace kernel
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/hsv_rgb.cuh b/src/backend/cuda/kernel/hsv_rgb.cuh
new file mode 100644
index 0000000000..9ffcf0cc61
--- /dev/null
+++ b/src/backend/cuda/kernel/hsv_rgb.cuh
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool isHSV2RGB>
+__global__ void hsvrgbConverter(Param<T> out, CParam<T> in, int nBBS) {
+    // batch offsets
+    unsigned batchId = blockIdx.x / nBBS;
+    const T* src     = (const T*)in.ptr + (batchId * in.strides[3]);
+    T* dst           = (T*)out.ptr + (batchId * out.strides[3]);
+    // global indices
+    int gx = blockDim.x * (blockIdx.x - batchId * nBBS) + threadIdx.x;
+    int gy = blockDim.y * (blockIdx.y + blockIdx.z * gridDim.y) + threadIdx.y;
+
+    if (gx < out.dims[0] && gy < out.dims[1] && batchId < out.dims[3]) {
+        int oIdx0 = gx + gy * out.strides[1];
+        int oIdx1 = oIdx0 + out.strides[2];
+        int oIdx2 = oIdx1 + out.strides[2];
+
+        int iIdx0 = gx * in.strides[0] + gy * in.strides[1];
+        int iIdx1 = iIdx0 + in.strides[2];
+        int iIdx2 = iIdx1 + in.strides[2];
+
+        if (isHSV2RGB) {
+            T H = src[iIdx0];
+            T S = src[iIdx1];
+            T V = src[iIdx2];
+
+            T R, G, B;
+            R = G = B = 0;
+
+            int i = (int)(H * 6);
+            T f   = H * 6 - i;
+            T p   = V * (1 - S);
+            T q   = V * (1 - f * S);
+            T t   = V * (1 - (1 - f) * S);
+
+            switch (i % 6) {
+                case 0: R = V, G = t, B = p; break;
+                case 1: R = q, G = V, B = p; break;
+                case 2: R = p, G = V, B = t; break;
+                case 3: R = p, G = q, B = V; break;
+                case 4: R = t, G = p, B = V; break;
+                case 5: R = V, G = p, B = q; break;
+            }
+
+            dst[oIdx0] = R;
+            dst[oIdx1] = G;
+            dst[oIdx2] = B;
+        } else {
+            T R     = src[iIdx0];
+            T G     = src[iIdx1];
+            T B     = src[iIdx2];
+            T Cmax  = fmax(fmax(R, G), B);
+            T Cmin  = fmin(fmin(R, G), B);
+            T delta = Cmax - Cmin;
+
+            T H = 0;
+
+            if (Cmax != Cmin) {
+                if (Cmax == R) H = (G - B) / delta + (G < B ? 6 : 0);
+                if (Cmax == G) H = (B - R) / delta + 2;
+                if (Cmax == B) H = (R - G) / delta + 4;
+                H = H / 6.0f;
+            }
+
+            dst[oIdx0] = H;
+            dst[oIdx1] = Cmax == 0.0f ? 0 : delta / Cmax;
+            dst[oIdx2] = Cmax;
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/hsv_rgb.hpp b/src/backend/cuda/kernel/hsv_rgb.hpp
index d9a54d32b9..83cae19e33 100644
--- a/src/backend/cuda/kernel/hsv_rgb.hpp
+++ b/src/backend/cuda/kernel/hsv_rgb.hpp
@@ -7,96 +7,25 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/hsv_rgb_cuh.hpp>
 
-namespace cuda
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 static const int THREADS_X = 16;
 static const int THREADS_Y = 16;
 
-template<typename T, bool isHSV2RGB>
-__global__
-void convert(Param<T> out, CParam<T> in, int nBBS)
-{
-    // batch offsets
-    unsigned batchId= blockIdx.x / nBBS;
-    const T* src    = (const T *) in.ptr + (batchId *  in.strides[3]);
-    T*       dst    = (T *      )out.ptr + (batchId * out.strides[3]);
-    // global indices
-    int gx = blockDim.x * (blockIdx.x-batchId*nBBS) + threadIdx.x;
-    int gy = blockDim.y * blockIdx.y + threadIdx.y;
-
-    if (gx < out.dims[0] && gy < out.dims[1]) {
-
-        int oIdx0 = gx + gy * out.strides[1];
-        int oIdx1 = oIdx0 + out.strides[2];
-        int oIdx2 = oIdx1 + out.strides[2];
-
-        int iIdx0 = gx * in.strides[0] + gy * in.strides[1];
-        int iIdx1 = iIdx0 + in.strides[2];
-        int iIdx2 = iIdx1 + in.strides[2];
-
-        if(isHSV2RGB) {
-            T H = src[iIdx0];
-            T S = src[iIdx1];
-            T V = src[iIdx2];
-
-            T R, G, B;
-            R = G = B = 0;
-
-            int   i = (int)(H * 6);
-            T f = H * 6 - i;
-            T p = V * (1 - S);
-            T q = V * (1 - f * S);
-            T t = V * (1 - (1 - f) * S);
-
-            switch (i % 6) {
-                case 0: R = V, G = t, B = p; break;
-                case 1: R = q, G = V, B = p; break;
-                case 2: R = p, G = V, B = t; break;
-                case 3: R = p, G = q, B = V; break;
-                case 4: R = t, G = p, B = V; break;
-                case 5: R = V, G = p, B = q; break;
-            }
+template<typename T>
+void hsv2rgb_convert(Param<T> out, CParam<T> in, bool isHSV2RGB) {
+    auto hsvrgbConverter = common::getKernel(
+        "arrayfire::cuda::hsvrgbConverter", {{hsv_rgb_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(isHSV2RGB)));
 
-            dst[oIdx0] = R;
-            dst[oIdx1] = G;
-            dst[oIdx2] = B;
-        } else {
-            T R = src[iIdx0];
-            T G = src[iIdx1];
-            T B = src[iIdx2];
-            T Cmax = fmax(fmax(R, G), B);
-            T Cmin = fmin(fmin(R, G), B);
-            T delta= Cmax-Cmin;
-
-            T H = 0;
-
-            if (Cmax!=Cmin) {
-                if (Cmax==R) H = (G-B)/delta + (G<B ? 6 : 0);
-                if (Cmax==G) H = (B-R)/delta + 2;
-                if (Cmax==B) H = (R-G)/delta + 4;
-                H = H / 6.0f;
-            }
-
-            dst[oIdx0] = H;
-            dst[oIdx1] = Cmax==0.0f ? 0 : delta/Cmax;
-            dst[oIdx2] = Cmax;
-        }
-    }
-}
-
-template<typename T, bool isHSV2RGB>
-void hsv2rgb_convert(Param<T> out, CParam<T> in)
-{
     const dim3 threads(THREADS_X, THREADS_Y);
 
     int blk_x = divup(in.dims[0], threads.x);
@@ -104,13 +33,17 @@ void hsv2rgb_convert(Param<T> out, CParam<T> in)
 
     // all images are three channels, so batch
     // parameter would be along 4th dimension
-    dim3 blocks(blk_x*in.dims[3], blk_y);
+    dim3 blocks(blk_x * in.dims[3], blk_y);
 
-    convert<T, isHSV2RGB> <<<blocks, threads>>> (out, in, blk_x);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    hsvrgbConverter(qArgs, out, in, blk_x);
     POST_LAUNCH_CHECK();
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/identity.cuh b/src/backend/cuda/kernel/identity.cuh
new file mode 100644
index 0000000000..e8868f0a9a
--- /dev/null
+++ b/src/backend/cuda/kernel/identity.cuh
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void identity(Param<T> out, int blocks_x, int blocks_y) {
+    const dim_t idz = blockIdx.x / blocks_x;
+    const dim_t idw = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+
+    const dim_t blockIdx_x = blockIdx.x - idz * blocks_x;
+    const dim_t blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - idw * blocks_y;
+
+    const dim_t idx = threadIdx.x + blockIdx_x * blockDim.x;
+    const dim_t idy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (idx >= out.dims[0] || idy >= out.dims[1] || idz >= out.dims[2] ||
+        idw >= out.dims[3])
+        return;
+
+    const T one  = scalar<T>(1);
+    const T zero = scalar<T>(0);
+
+    T *ptr = out.ptr + idz * out.strides[2] + idw * out.strides[3];
+    T val  = (idx == idy) ? one : zero;
+    ptr[idx + idy * out.strides[1]] = val;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/identity.hpp b/src/backend/cuda/kernel/identity.hpp
index 670eba71d4..c3aea2dc8b 100644
--- a/src/backend/cuda/kernel/identity.hpp
+++ b/src/backend/cuda/kernel/identity.hpp
@@ -7,52 +7,38 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
-#include <platform.hpp>
-#include <debug_cuda.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <math.hpp>
-
-namespace cuda
-{
-namespace kernel
-{
-
-    template<typename T>
-    __global__
-    static void identity_kernel(Param<T> out, int blocks_x, int blocks_y)
-    {
-        unsigned idz = blockIdx.x / blocks_x;
-        unsigned idw = blockIdx.y / blocks_y;
-
-        unsigned blockIdx_x = blockIdx.x - idz * blocks_x;
-        unsigned blockIdx_y = blockIdx.y - idw * blocks_y;
-
-        unsigned idx = threadIdx.x + blockIdx_x * blockDim.x;
-        unsigned idy = threadIdx.y + blockIdx_y * blockDim.y;
-
-        if(idx >= out.dims[0] ||
-           idy >= out.dims[1] ||
-           idz >= out.dims[2] ||
-           idw >= out.dims[3])
-            return;
-
-        T *ptr = out.ptr + idz * out.strides[2] + idw * out.strides[3];
-        T val = (idx == idy) ? scalar<T>(1) : scalar<T>(0);
-        ptr[idx + idy * out.strides[1]] = val;
-    }
-
-    template<typename T>
-    static void identity(Param<T> out)
-    {
-        dim3 threads(32, 8);
-        int blocks_x = divup(out.dims[0], threads.x);
-        int blocks_y = divup(out.dims[1], threads.y);
-        dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
-
-        identity_kernel<T> <<<blocks, threads>>> (out, blocks_x, blocks_y);
-        POST_LAUNCH_CHECK();
-    }
-}
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/identity_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename T>
+void identity(Param<T> out) {
+    auto identity =
+        common::getKernel("arrayfire::cuda::identity", {{identity_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()));
+
+    dim3 threads(32, 8);
+    int blocks_x = divup(out.dims[0], threads.x);
+    int blocks_y = divup(out.dims[1], threads.y);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    identity(qArgs, out, blocks_x, blocks_y);
+    POST_LAUNCH_CHECK();
 }
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/iir.cuh b/src/backend/cuda/kernel/iir.cuh
new file mode 100644
index 0000000000..e5b195f77a
--- /dev/null
+++ b/src/backend/cuda/kernel/iir.cuh
@@ -0,0 +1,71 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool batch_a>
+__global__ void iir(Param<T> y, CParam<T> c, CParam<T> a, const int blocks_y) {
+    __shared__ T s_z[MAX_A_SIZE];
+    __shared__ T s_a[MAX_A_SIZE];
+    __shared__ T s_y;
+
+    const int idz = blockIdx.x;
+    const int idw = blockIdx.y / blocks_y;
+    const int idy = blockIdx.y - idw * blocks_y;
+
+    const int tx    = threadIdx.x;
+    const int num_a = a.dims[0];
+
+    int y_off = idw * y.strides[3] + idz * y.strides[2] + idy * y.strides[1];
+    int c_off = idw * c.strides[3] + idz * c.strides[2] + idy * c.strides[1];
+    int a_off = 0;
+
+    if (batch_a)
+        a_off = idw * a.strides[3] + idz * a.strides[2] + idy * a.strides[1];
+
+    T *d_y           = y.ptr + y_off;
+    const T *d_c     = c.ptr + c_off;
+    const T *d_a     = a.ptr + a_off;
+    const int repeat = (num_a + blockDim.x - 1) / blockDim.x;
+
+    for (int ii = 0; ii < MAX_A_SIZE / blockDim.x; ii++) {
+        int id  = ii * blockDim.x + tx;
+        s_z[id] = scalar<T>(0);
+        s_a[id] = (id < num_a) ? d_a[id] : scalar<T>(0);
+    }
+    __syncthreads();
+
+    for (int i = 0; i < y.dims[0]; i++) {
+        if (tx == 0) {
+            s_y    = (d_c[i] + s_z[0]) / s_a[0];
+            d_y[i] = s_y;
+        }
+        __syncthreads();
+
+#pragma unroll
+        for (int ii = 0; ii < repeat; ii++) {
+            int id = ii * blockDim.x + tx + 1;
+
+            T z = s_z[id] - s_a[id] * s_y;
+            __syncthreads();
+
+            s_z[id - 1] = z;
+            __syncthreads();
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/iir.hpp b/src/backend/cuda/kernel/iir.hpp
index 9c21057745..a17d205fd8 100644
--- a/src/backend/cuda/kernel/iir.hpp
+++ b/src/backend/cuda/kernel/iir.hpp
@@ -7,90 +7,41 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-
-namespace cuda
-{
-
-    namespace kernel
-    {
-
-        static const int MAX_A_SIZE = 1024;
-
-        template<typename T, bool batch_a>
-        __global__
-        void iir_kernel(Param<T> y, CParam<T> c, CParam<T> a,
-                        const int blocks_y)
-        {
-            __shared__ T s_z[MAX_A_SIZE];
-            __shared__ T s_a[MAX_A_SIZE];
-            __shared__ T s_y;
-
-            const int idz = blockIdx.x;
-            const int idw = blockIdx.y / blocks_y;
-            const int idy = blockIdx.y - idw * blocks_y;
-
-            const int tx = threadIdx.x;
-            const int num_a = a.dims[0];
+#include <nvrtc_kernel_headers/iir_cuh.hpp>
 
-            int y_off = idw * y.strides[3] + idz * y.strides[2] + idy * y.strides[1];
-            int c_off = idw * c.strides[3] + idz * c.strides[2] + idy * c.strides[1];
-            int a_off = 0;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            if (batch_a) a_off = idw * a.strides[3] + idz * a.strides[2] + idy * a.strides[1];
+template<typename T, bool batch_a>
+void iir(Param<T> y, CParam<T> c, CParam<T> a) {
+    constexpr int MAX_A_SIZE = 1024;
 
-            T *d_y = y.ptr + y_off;
-            const T *d_c = c.ptr + c_off;
-            const T *d_a = a.ptr + a_off;
-            const int repeat = (num_a + blockDim.x - 1) / blockDim.x;
+    auto iir = common::getKernel(
+        "arrayfire::cuda::iir", {{iir_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(batch_a)),
+        {{DefineValue(MAX_A_SIZE)}});
 
-            for (int ii = 0; ii < MAX_A_SIZE / blockDim.x; ii++) {
-                int id = ii * blockDim.x + tx;
-                s_z[id] = scalar<T>(0);
-                s_a[id] = (id < num_a) ? d_a[id] : scalar<T>(0);
-            }
-            __syncthreads();
+    const int blocks_y = y.dims[1];
+    const int blocks_x = y.dims[2];
 
+    dim3 blocks(blocks_x, blocks_y * y.dims[3]);
 
-            for (int i = 0; i < y.dims[0]; i++) {
-                if (tx == 0) {
-                    s_y = (d_c[i] + s_z[0]) / s_a[0];
-                    d_y[i] = s_y;
-                }
-                __syncthreads();
+    int threads = 256;
+    while (threads > y.dims[0] && threads > 32) threads /= 2;
 
-#pragma unroll
-                for (int ii = 0; ii < repeat; ii++) {
-                    int id = ii * blockDim.x + tx + 1;
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-                    T z = s_z[id] - s_a[id] * s_y;
-                    __syncthreads();
-
-                    s_z[id - 1] = z;
-                    __syncthreads();
-                }
-            }
-        }
-
-        template<typename T, bool batch_a>
-        void iir(Param<T> y, CParam<T> c, CParam<T> a)
-        {
-            const int blocks_y = y.dims[1];
-            const int blocks_x = y.dims[2];
-
-            dim3 blocks(blocks_x,
-                        blocks_y * y.dims[3]);
-
-            int threads = 256;
-            while (threads > y.dims[0] && threads > 32) threads /= 2;
-
-            (iir_kernel<T, batch_a>)<<<blocks, threads>>>(y, c, a, blocks_y);
-        }
-
-    }
+    iir(qArgs, y, c, a, blocks_y);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/index.cuh b/src/backend/cuda/kernel/index.cuh
new file mode 100644
index 0000000000..968e9ae0c6
--- /dev/null
+++ b/src/backend/cuda/kernel/index.cuh
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <assign_kernel_param.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void index(Param<T> out, CParam<T> in, const IndexKernelParam p,
+                      const int nBBS0, const int nBBS1) {
+    // retrieve index pointers
+    // these can be 0 where af_array index is not used
+    const uint* ptr0 = p.ptr[0];
+    const uint* ptr1 = p.ptr[1];
+    const uint* ptr2 = p.ptr[2];
+    const uint* ptr3 = p.ptr[3];
+    // retrive booleans that tell us which index to use
+    const bool s0 = p.isSeq[0];
+    const bool s1 = p.isSeq[1];
+    const bool s2 = p.isSeq[2];
+    const bool s3 = p.isSeq[3];
+
+    const int gz = blockIdx.x / nBBS0;
+    const int gx = blockDim.x * (blockIdx.x - gz * nBBS0) + threadIdx.x;
+
+    const int gw = (blockIdx.y + blockIdx.z * gridDim.y) / nBBS1;
+    const int gy =
+        blockDim.y * ((blockIdx.y + blockIdx.z * gridDim.y) - gw * nBBS1) +
+        threadIdx.y;
+
+    if (gx < out.dims[0] && gy < out.dims[1] && gz < out.dims[2] &&
+        gw < out.dims[3]) {
+        // calculate pointer offsets for input
+        int i =
+            p.strds[0] *
+            trimIndex(s0 ? gx * p.steps[0] + p.offs[0] : ptr0[gx], in.dims[0]);
+        int j =
+            p.strds[1] *
+            trimIndex(s1 ? gy * p.steps[1] + p.offs[1] : ptr1[gy], in.dims[1]);
+        int k =
+            p.strds[2] *
+            trimIndex(s2 ? gz * p.steps[2] + p.offs[2] : ptr2[gz], in.dims[2]);
+        int l =
+            p.strds[3] *
+            trimIndex(s3 ? gw * p.steps[3] + p.offs[3] : ptr3[gw], in.dims[3]);
+        // offset input and output pointers
+        const T* src = (const T*)in.ptr + (i + j + k + l);
+        T* dst = (T*)out.ptr + (gx * out.strides[0] + gy * out.strides[1] +
+                                gz * out.strides[2] + gw * out.strides[3]);
+        // set the output
+        dst[0] = src[0];
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/index.hpp b/src/backend/cuda/kernel/index.hpp
index 11c655c495..d2a4d06d37 100644
--- a/src/backend/cuda/kernel/index.hpp
+++ b/src/backend/cuda/kernel/index.hpp
@@ -7,81 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <math.hpp>
-#include <utility.hpp>
+#include <assign_kernel_param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/index_cuh.hpp>
 
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 32;
-static const int THREADS_Y =  8;
-
-typedef struct {
-    int  offs[4];
-    int strds[4];
-    bool     isSeq[4];
-    uint*      ptr[4];
-} IndexKernelParam_t;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 template<typename T>
-__global__
-void indexKernel(Param<T> out, CParam<T> in, const IndexKernelParam_t p,
-                 const int nBBS0, const int nBBS1)
-{
-    // retrieve index pointers
-    // these can be 0 where af_array index is not used
-    const uint* ptr0 = p.ptr[0];
-    const uint* ptr1 = p.ptr[1];
-    const uint* ptr2 = p.ptr[2];
-    const uint* ptr3 = p.ptr[3];
-    // retrive booleans that tell us which index to use
-    const bool s0 = p.isSeq[0];
-    const bool s1 = p.isSeq[1];
-    const bool s2 = p.isSeq[2];
-    const bool s3 = p.isSeq[3];
-
-    const int gz = blockIdx.x/nBBS0;
-    const int gw = blockIdx.y/nBBS1;
-    const int gx = blockDim.x * (blockIdx.x - gz*nBBS0) + threadIdx.x;
-    const int gy = blockDim.y * (blockIdx.y - gw*nBBS1) + threadIdx.y;
-
-    if (gx<out.dims[0] && gy<out.dims[1] && gz<out.dims[2] && gw<out.dims[3]) {
-        // calculate pointer offsets for input
-        int i = p.strds[0] * trimIndex(s0 ? gx+p.offs[0] : ptr0[gx], in.dims[0]);
-        int j = p.strds[1] * trimIndex(s1 ? gy+p.offs[1] : ptr1[gy], in.dims[1]);
-        int k = p.strds[2] * trimIndex(s2 ? gz+p.offs[2] : ptr2[gz], in.dims[2]);
-        int l = p.strds[3] * trimIndex(s3 ? gw+p.offs[3] : ptr3[gw], in.dims[3]);
-        // offset input and output pointers
-        const T *src = (const T*)in.ptr + (i+j+k+l);
-        T *dst = (T*)out.ptr +(gx*out.strides[0]+gy*out.strides[1]+ gz*out.strides[2]+gw*out.strides[3]);
-        // set the output
-        dst[0] = src[0];
+void index(Param<T> out, CParam<T> in, const IndexKernelParam& p) {
+    auto index = common::getKernel("arrayfire::cuda::index", {{index_cuh_src}},
+                                   TemplateArgs(TemplateTypename<T>()));
+    dim3 threads;
+    switch (out.dims[1]) {
+        case 1: threads.y = 1; break;
+        case 2: threads.y = 2; break;
+        case 3:
+        case 4: threads.y = 4; break;
+        default: threads.y = 8; break;
     }
-}
-
-template<typename T>
-void index(Param<T> out, CParam<T> in, const IndexKernelParam_t& p)
-{
-    const dim3 threads(THREADS_X, THREADS_Y);
+    threads.x = static_cast<unsigned>(256.f / threads.y);
 
     int blks_x = divup(out.dims[0], threads.x);
     int blks_y = divup(out.dims[1], threads.y);
 
-    dim3 blocks(blks_x*out.dims[2], blks_y*out.dims[3]);
+    dim3 blocks(blks_x * out.dims[2], blks_y * out.dims[3]);
 
-    indexKernel<T> <<<blocks, threads>>> (out, in, p, blks_x, blks_y);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-    POST_LAUNCH_CHECK();
-}
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
+    index(qArgs, out, in, p, blks_x, blks_y);
+    POST_LAUNCH_CHECK();
 }
 
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/interp.hpp b/src/backend/cuda/kernel/interp.hpp
new file mode 100644
index 0000000000..39fb7a77ff
--- /dev/null
+++ b/src/backend/cuda/kernel/interp.hpp
@@ -0,0 +1,332 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+struct itype_t {
+    typedef float wtype;
+    typedef float vtype;
+};
+
+template<>
+struct itype_t<double> {
+    typedef double wtype;
+    typedef double vtype;
+};
+
+template<>
+struct itype_t<cfloat> {
+    typedef float wtype;
+    typedef cfloat vtype;
+};
+
+template<>
+struct itype_t<cdouble> {
+    typedef double wtype;
+    typedef cdouble vtype;
+};
+
+template<typename Ty, typename Tp>
+__device__ Ty linearInterpFunc(Ty val[2], Tp ratio) {
+    return (1 - ratio) * val[0] + ratio * val[1];
+}
+
+template<typename Ty, typename Tp>
+__device__ Ty bilinearInterpFunc(Ty val[2][2], Tp xratio, Tp yratio) {
+    Ty res[2];
+    res[0] = linearInterpFunc(val[0], xratio);
+    res[1] = linearInterpFunc(val[1], xratio);
+    return linearInterpFunc(res, yratio);
+}
+
+template<typename Ty, typename Tp>
+__device__ inline static Ty cubicInterpFunc(Ty val[4], Tp xratio, bool spline) {
+    Ty a0, a1, a2, a3;
+    if (spline) {
+        a0 = scalar<Ty>(-0.5) * val[0] + scalar<Ty>(1.5) * val[1] +
+             scalar<Ty>(-1.5) * val[2] + scalar<Ty>(0.5) * val[3];
+
+        a1 = scalar<Ty>(1.0) * val[0] + scalar<Ty>(-2.5) * val[1] +
+             scalar<Ty>(2.0) * val[2] + scalar<Ty>(-0.5) * val[3];
+
+        a2 = scalar<Ty>(-0.5) * val[0] + scalar<Ty>(0.5) * val[2];
+
+        a3 = val[1];
+    } else {
+        a0 = val[3] - val[2] - val[0] + val[1];
+        a1 = val[0] - val[1] - a0;
+        a2 = val[2] - val[0];
+        a3 = val[1];
+    }
+
+    Tp xratio2 = xratio * xratio;
+    Tp xratio3 = xratio2 * xratio;
+
+    return a0 * xratio3 + a1 * xratio2 + a2 * xratio + a3;
+}
+
+template<typename Ty, typename Tp>
+__device__ inline static Ty bicubicInterpFunc(Ty val[4][4], Tp xratio,
+                                              Tp yratio, bool spline) {
+    Ty res[4];
+    res[0] = cubicInterpFunc(val[0], xratio, spline);
+    res[1] = cubicInterpFunc(val[1], xratio, spline);
+    res[2] = cubicInterpFunc(val[2], xratio, spline);
+    res[3] = cubicInterpFunc(val[3], xratio, spline);
+    return cubicInterpFunc(res, yratio, spline);
+}
+
+template<typename Ty, typename Tp, int xdim, int order>
+struct Interp1 {};
+
+template<typename Ty, typename Tp, int xdim>
+struct Interp1<Ty, Tp, xdim, 1> {
+    __device__ void operator()(Param<Ty> out, int ooff, CParam<Ty> in, int ioff,
+                               Tp x, af::interpType method, int batch,
+                               bool clamp, int batch_dim = 1) {
+        Ty zero = scalar<Ty>(0);
+
+        const int x_lim    = in.dims[xdim];
+        const int x_stride = in.strides[xdim];
+
+        int xid   = (method == AF_INTERP_LOWER ? floor(x) : round(x));
+        bool cond = xid >= 0 && xid < x_lim;
+        if (clamp) xid = max(0, min(xid, x_lim));
+
+        const int idx = ioff + xid * x_stride;
+
+        for (int n = 0; n < batch; n++) {
+            Ty outval                                  = (cond || clamp)
+                                                             ? in.ptr[idx + n * in.strides[batch_dim]]
+                                                             : zero;
+            out.ptr[ooff + n * out.strides[batch_dim]] = outval;
+        }
+    }
+};
+
+template<typename Ty, typename Tp, int xdim>
+struct Interp1<Ty, Tp, xdim, 2> {
+    __device__ void operator()(Param<Ty> out, int ooff, CParam<Ty> in, int ioff,
+                               Tp x, af::interpType method, int batch,
+                               bool clamp, int batch_dim = 1) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = floor(x);    // nearest grid
+        const WT off_x   = x - grid_x;  // fractional offset
+
+        const int x_lim    = in.dims[xdim];
+        const int x_stride = in.strides[xdim];
+        const int idx      = ioff + grid_x * x_stride;
+
+        bool cond[2] = {true, grid_x + 1 < x_lim};
+        int offx[2]  = {0, cond[1] ? 1 : 0};
+        WT ratio     = off_x;
+        if (method == AF_INTERP_LINEAR_COSINE) {
+            // Smooth the factional part with cosine
+            ratio = (1 - cos(ratio * CUDART_PI)) / 2;
+        }
+
+        Ty zero = scalar<Ty>(0);
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * in.strides[batch_dim];
+            VT val[2] = {
+                (clamp || cond[0]) ? in.ptr[idx_n + offx[0] * x_stride] : zero,
+                (clamp || cond[1]) ? in.ptr[idx_n + offx[1] * x_stride] : zero};
+            out.ptr[ooff + n * out.strides[batch_dim]] =
+                linearInterpFunc(val, ratio);
+        }
+    }
+};
+
+template<typename Ty, typename Tp, int xdim>
+struct Interp1<Ty, Tp, xdim, 3> {
+    __device__ void operator()(Param<Ty> out, int ooff, CParam<Ty> in, int ioff,
+                               Tp x, af::interpType method, int batch,
+                               bool clamp, int batch_dim = 1) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = floor(x);    // nearest grid
+        const WT off_x   = x - grid_x;  // fractional offset
+
+        const int x_lim    = in.dims[xdim];
+        const int x_stride = in.strides[xdim];
+        const int idx      = ioff + grid_x * x_stride;
+
+        bool cond[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                        grid_x + 2 < x_lim};
+        int offx[4]  = {cond[0] ? -1 : 0, 0, cond[2] ? 1 : 0,
+                       cond[3] ? 2 : (cond[2] ? 1 : 0)};
+
+        bool spline = method == AF_INTERP_CUBIC_SPLINE;
+        Ty zero     = scalar<Ty>(0);
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * in.strides[batch_dim];
+            VT val[4];
+            for (int i = 0; i < 4; i++) {
+                val[i] = (clamp || cond[i]) ? in.ptr[idx_n + offx[i] * x_stride]
+                                            : zero;
+            }
+            out.ptr[ooff + n * out.strides[batch_dim]] =
+                cubicInterpFunc(val, off_x, spline);
+        }
+    }
+};
+
+template<typename Ty, typename Tp, int xdim, int ydim, int order>
+struct Interp2 {};
+
+template<typename Ty, typename Tp, int xdim, int ydim>
+struct Interp2<Ty, Tp, xdim, ydim, 1> {
+    __device__ void operator()(Param<Ty> out, int ooff, CParam<Ty> in, int ioff,
+                               Tp x, Tp y, af::interpType method, int batch,
+                               bool clamp, int batch_dim = 2) {
+        int xid = (method == AF_INTERP_LOWER ? floor(x) : round(x));
+        int yid = (method == AF_INTERP_LOWER ? floor(y) : round(y));
+
+        const int x_lim    = in.dims[xdim];
+        const int y_lim    = in.dims[ydim];
+        const int x_stride = in.strides[xdim];
+        const int y_stride = in.strides[ydim];
+
+        if (clamp) {
+            xid = max(0, min(xid, in.dims[xdim]));
+            yid = max(0, min(yid, in.dims[ydim]));
+        }
+
+        const int idx = ioff + yid * y_stride + xid * x_stride;
+
+        bool condX = xid >= 0 && xid < x_lim;
+        bool condY = yid >= 0 && yid < y_lim;
+
+        Ty zero   = scalar<Ty>(0);
+        bool cond = condX && condY;
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * in.strides[batch_dim];
+            Ty val    = (clamp || cond) ? in.ptr[idx_n] : zero;
+            out.ptr[ooff + n * out.strides[batch_dim]] = val;
+        }
+    }
+};
+
+template<typename Ty, typename Tp, int xdim, int ydim>
+struct Interp2<Ty, Tp, xdim, ydim, 2> {
+    __device__ void operator()(Param<Ty> out, int ooff, CParam<Ty> in, int ioff,
+                               Tp x, Tp y, af::interpType method, int batch,
+                               bool clamp, int batch_dim = 2) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = floor(x);
+        const WT off_x   = x - grid_x;
+
+        const int grid_y = floor(y);
+        const WT off_y   = y - grid_y;
+
+        const int x_lim    = in.dims[xdim];
+        const int y_lim    = in.dims[ydim];
+        const int x_stride = in.strides[xdim];
+        const int y_stride = in.strides[ydim];
+        const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+        bool condX[2] = {true, x + 1 < x_lim};
+        bool condY[2] = {true, y + 1 < y_lim};
+        int offx[2]   = {0, condX[1] ? 1 : 0};
+        int offy[2]   = {0, condY[1] ? 1 : 0};
+
+        WT xratio = off_x, yratio = off_y;
+        if (method == AF_INTERP_LINEAR_COSINE ||
+            method == AF_INTERP_BILINEAR_COSINE) {
+            // Smooth the factional part with cosine
+            xratio = (1 - cos(xratio * CUDART_PI)) / 2;
+            yratio = (1 - cos(yratio * CUDART_PI)) / 2;
+        }
+
+        Ty zero = scalar<Ty>(0);
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * in.strides[batch_dim];
+            VT val[2][2];
+            for (int j = 0; j < 2; j++) {
+                int ioff_j = idx_n + offy[j] * y_stride;
+                for (int i = 0; i < 2; i++) {
+                    bool cond = clamp || (condX[i] && condY[j]);
+                    val[j][i] =
+                        (cond) ? in.ptr[ioff_j + offx[i] * x_stride] : zero;
+                }
+            }
+            out.ptr[ooff + n * out.strides[batch_dim]] =
+                bilinearInterpFunc(val, xratio, yratio);
+        }
+    }
+};
+
+template<typename Ty, typename Tp, int xdim, int ydim>
+struct Interp2<Ty, Tp, xdim, ydim, 3> {
+    __device__ void operator()(Param<Ty> out, int ooff, CParam<Ty> in, int ioff,
+                               Tp x, Tp y, af::interpType method, int batch,
+                               bool clamp, int batch_dim = 2) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = floor(x);
+        const WT off_x   = x - grid_x;
+
+        const int grid_y = floor(y);
+        const WT off_y   = y - grid_y;
+
+        const int x_lim    = in.dims[xdim];
+        const int y_lim    = in.dims[ydim];
+        const int x_stride = in.strides[xdim];
+        const int y_stride = in.strides[ydim];
+        const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+        // used for setting values at boundaries
+        bool condX[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                         grid_x + 2 < x_lim};
+        bool condY[4] = {grid_y - 1 >= 0, true, grid_y + 1 < y_lim,
+                         grid_y + 2 < y_lim};
+        int offX[4]   = {condX[0] ? -1 : 0, 0, condX[2] ? 1 : 0,
+                       condX[3] ? 2 : (condX[2] ? 1 : 0)};
+        int offY[4]   = {condY[0] ? -1 : 0, 0, condY[2] ? 1 : 0,
+                       condY[3] ? 2 : (condY[2] ? 1 : 0)};
+
+        // for bicubic interpolation, work with 4x4 val at a time
+        Ty zero     = scalar<Ty>(0);
+        bool spline = (method == AF_INTERP_CUBIC_SPLINE ||
+                       method == AF_INTERP_BICUBIC_SPLINE);
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * in.strides[batch_dim];
+            VT val[4][4];
+#pragma unroll
+            for (int j = 0; j < 4; j++) {
+                int ioff_j = idx_n + offY[j] * y_stride;
+#pragma unroll
+                for (int i = 0; i < 4; i++) {
+                    bool cond = clamp || (condX[i] && condY[j]);
+                    val[j][i] =
+                        (cond) ? in.ptr[ioff_j + offX[i] * x_stride] : zero;
+                }
+            }
+
+            out.ptr[ooff + n * out.strides[batch_dim]] =
+                bicubicInterpFunc(val, off_x, off_y, spline);
+        }
+    }
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/iota.cuh b/src/backend/cuda/kernel/iota.cuh
new file mode 100644
index 0000000000..ce0ec56168
--- /dev/null
+++ b/src/backend/cuda/kernel/iota.cuh
@@ -0,0 +1,55 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void iota(Param<T> out, const int s0, const int s1, const int s2,
+                     const int s3, const int blocksPerMatX,
+                     const int blocksPerMatY) {
+    const int oz         = blockIdx.x / blocksPerMatX;
+    const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
+    const int xx         = threadIdx.x + blockIdx_x * blockDim.x;
+
+    const int ow = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - ow * blocksPerMatY;
+    const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (xx >= out.dims[0] || yy >= out.dims[1] || oz >= out.dims[2] ||
+        ow >= out.dims[3])
+        return;
+
+    const int ozw = ow * out.strides[3] + oz * out.strides[2];
+
+    dim_t val = (ow % s3) * s2 * s1 * s0;
+    val += (oz % s2) * s1 * s0;
+
+    const int incy = blocksPerMatY * blockDim.y;
+    const int incx = blocksPerMatX * blockDim.x;
+
+    for (int oy = yy; oy < out.dims[1]; oy += incy) {
+        int oyzw   = ozw + oy * out.strides[1];
+        dim_t valY = val + (oy % s1) * s0;
+        for (int ox = xx; ox < out.dims[0]; ox += incx) {
+            int oidx = oyzw + ox;
+
+            out.ptr[oidx] = valY + (ox % s0);
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/iota.hpp b/src/backend/cuda/kernel/iota.hpp
index c29c7c5fcd..1007ec2f1e 100644
--- a/src/backend/cuda/kernel/iota.hpp
+++ b/src/backend/cuda/kernel/iota.hpp
@@ -7,81 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/iota_cuh.hpp>
+#include <af/dim4.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 512;
-        static const unsigned TILEY = 32;
-
-        template<typename T>
-        __global__
-        void iota_kernel(Param<T> out,
-                         const int s0, const int s1, const int s2, const int s3,
-                         const int t0, const int t1, const int t2, const int t3,
-                         const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            if(xx >= out.dims[0] ||
-               yy >= out.dims[1] ||
-               oz >= out.dims[2] ||
-               ow >= out.dims[3])
-                return;
+template<typename T>
+void iota(Param<T> out, const af::dim4 &sdims) {
+    constexpr unsigned IOTA_TX = 32;
+    constexpr unsigned IOTA_TY = 8;
+    constexpr unsigned TILEX   = 512;
+    constexpr unsigned TILEY   = 32;
 
-            const int ozw = ow * out.strides[3] + oz * out.strides[2];
+    auto iota = common::getKernel("arrayfire::cuda::iota", {{iota_cuh_src}},
+                                  TemplateArgs(TemplateTypename<T>()));
 
-            T val = (ow % s3) * s2 * s1 * s0;
-            val  += (oz % s2) * s1 * s0;
+    dim3 threads(IOTA_TX, IOTA_TY, 1);
 
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
+    int blocksPerMatX = divup(out.dims[0], TILEX);
+    int blocksPerMatY = divup(out.dims[1], TILEY);
 
-            for(int oy = yy; oy < out.dims[1]; oy += incy) {
-                int oyzw = ozw + oy * out.strides[1];
-                T valY = val + (oy % s1) * s0;
-                for(int ox = xx; ox < out.dims[0]; ox += incx) {
-                    int oidx = oyzw + ox;
+    dim3 blocks(blocksPerMatX * out.dims[2], blocksPerMatY * out.dims[3], 1);
 
-                    out.ptr[oidx] = valY + (ox % s0);
-                }
-            }
-        }
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void iota(Param<T> out, const dim4 &sdims, const dim4 &tdims)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(out.dims[0], TILEX);
-            int blocksPerMatY = divup(out.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * out.dims[2],
-                        blocksPerMatY * out.dims[3],
-                        1);
-
-            iota_kernel<T><<<blocks, threads>>>(out, sdims[0], sdims[1], sdims[2], sdims[3],
-                        tdims[0], tdims[1], tdims[2], tdims[3], blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
+    iota(qArgs, out, sdims[0], sdims[1], sdims[2], sdims[3], blocksPerMatX,
+         blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/ireduce.cuh b/src/backend/cuda/kernel/ireduce.cuh
new file mode 100644
index 0000000000..6c59a360b1
--- /dev/null
+++ b/src/backend/cuda/kernel/ireduce.cuh
@@ -0,0 +1,255 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <minmax_op.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, af_op_t op, uint dim, bool is_first, uint DIMY>
+__global__ static void ireduceDim(Param<T> out, uint *olptr, CParam<T> in,
+                                  const uint *ilptr, uint blocks_x,
+                                  uint blocks_y, uint offset_dim,
+                                  CParam<uint> rlen) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * THREADS_X + tidx;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const uint xid = blockIdx_x * blockDim.x + tidx;
+    const uint yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    uint ids[4] = {xid, yid, zid, wid};
+
+    const T *iptr = in.ptr;
+    T *optr       = out.ptr;
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    bool rlen_valid = (ids[0] < rlen.dims[0]) && (ids[1] < rlen.dims[1]) &&
+                      (ids[2] < rlen.dims[2]) && (ids[3] < rlen.dims[3]);
+    const uint *rlenptr = (rlen.ptr && rlen_valid)
+                              ? rlen.ptr + ids[3] * rlen.strides[3] +
+                                    ids[2] * rlen.strides[2] +
+                                    ids[1] * rlen.strides[1] + ids[0]
+                              : nullptr;
+
+    optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+            ids[1] * out.strides[1] + ids[0];
+    olptr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+             ids[1] * out.strides[1] + ids[0];
+
+    const uint blockIdx_dim = ids[dim];
+
+    ids[dim] = ids[dim] * blockDim.y + tidy;
+    iptr += ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+            ids[1] * in.strides[1] + ids[0];
+    if (!is_first)
+        ilptr += ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+                 ids[1] * in.strides[1] + ids[0];
+    const uint id_dim_in = ids[dim];
+
+    const uint istride_dim = in.strides[dim];
+
+    bool is_valid = (ids[0] < in.dims[0]) && (ids[1] < in.dims[1]) &&
+                    (ids[2] < in.dims[2]) && (ids[3] < in.dims[3]);
+
+    T val    = common::Binary<T, op>::init();
+    uint idx = id_dim_in;
+
+    uint lim = (rlenptr) ? *rlenptr : in.dims[dim];
+    lim      = (is_first) ? min((uint)in.dims[dim], lim) : lim;
+    bool within_ragged_bounds =
+        (is_first) ? (idx < lim)
+                   : ((rlenptr) ? ((is_valid) && (*ilptr < lim)) : true);
+    if (is_valid && id_dim_in < in.dims[dim] && within_ragged_bounds) {
+        val = *iptr;
+        if (!is_first) idx = *ilptr;
+    }
+
+    MinMaxOp<op, T> Op(val, idx);
+
+    const uint id_dim_in_start = id_dim_in + offset_dim * blockDim.y;
+
+    __shared__ T s_val[THREADS_X * DIMY];
+    __shared__ uint s_idx[THREADS_X * DIMY];
+
+    for (int id = id_dim_in_start; is_valid && (id < lim);
+         id += offset_dim * blockDim.y) {
+        iptr = iptr + offset_dim * blockDim.y * istride_dim;
+        if (!is_first) {
+            ilptr = ilptr + offset_dim * blockDim.y * istride_dim;
+            Op(*iptr, *ilptr);
+        } else {
+            Op(*iptr, id);
+        }
+    }
+
+    s_val[tid] = Op.m_val;
+    s_idx[tid] = Op.m_idx;
+
+    T *s_vptr    = s_val + tid;
+    uint *s_iptr = s_idx + tid;
+    __syncthreads();
+
+    if (DIMY == 8) {
+        if (tidy < 4) {
+            Op(s_vptr[THREADS_X * 4], s_iptr[THREADS_X * 4]);
+            *s_vptr = Op.m_val;
+            *s_iptr = Op.m_idx;
+        }
+        __syncthreads();
+    }
+
+    if (DIMY >= 4) {
+        if (tidy < 2) {
+            Op(s_vptr[THREADS_X * 2], s_iptr[THREADS_X * 2]);
+            *s_vptr = Op.m_val;
+            *s_iptr = Op.m_idx;
+        }
+        __syncthreads();
+    }
+
+    if (DIMY >= 2) {
+        if (tidy < 1) {
+            Op(s_vptr[THREADS_X * 1], s_iptr[THREADS_X * 1]);
+            *s_vptr = Op.m_val;
+            *s_iptr = Op.m_idx;
+        }
+        __syncthreads();
+    }
+
+    if (tidy == 0 && is_valid && (blockIdx_dim < out.dims[dim])) {
+        *optr  = *s_vptr;
+        *olptr = *s_iptr;
+    }
+}
+
+template<typename T, af_op_t op>
+__device__ void warp_reduce(T *s_ptr, uint *s_idx, uint tidx) {
+    MinMaxOp<op, T> Op(s_ptr[tidx], s_idx[tidx]);
+#pragma unroll
+    for (int n = 16; n >= 1; n >>= 1) {
+        if (tidx < n) {
+            Op(s_ptr[tidx + n], s_idx[tidx + n]);
+            s_ptr[tidx] = Op.m_val;
+            s_idx[tidx] = Op.m_idx;
+        }
+        __syncthreads();
+    }
+}
+
+template<typename T, af_op_t op, bool is_first, uint DIMX>
+__global__ static void ireduceFirst(Param<T> out, uint *olptr, CParam<T> in,
+                                    const uint *ilptr, uint blocks_x,
+                                    uint blocks_y, uint repeat,
+                                    CParam<uint> rlen) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * blockDim.x + tidx;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const uint xid = blockIdx_x * blockDim.x * repeat + tidx;
+    const uint yid = blockIdx_y * blockDim.y + tidy;
+
+    const data_t<T> *iptr = in.ptr;
+    data_t<T> *optr       = out.ptr;
+    const uint *rlenptr   = (rlen.ptr) ? rlen.ptr + wid * rlen.strides[3] +
+                                           zid * rlen.strides[2] +
+                                           yid * rlen.strides[1]
+                                       : nullptr;
+
+    iptr += wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+    optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+
+    if (!is_first)
+        ilptr +=
+            wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+    olptr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+
+    if (yid >= in.dims[1] || zid >= in.dims[2] || wid >= in.dims[3]) return;
+
+    int minlen = rlenptr ? min(*rlenptr, in.dims[0]) : in.dims[0];
+    int lim    = min((int)(xid + repeat * DIMX), minlen);
+
+    compute_t<T> val = common::Binary<compute_t<T>, op>::init();
+    uint idx         = xid;
+
+    if (xid < lim) {
+        val = static_cast<compute_t<T>>(iptr[xid]);
+        if (!is_first) idx = ilptr[xid];
+    }
+
+    MinMaxOp<op, compute_t<T>> Op(val, idx);
+
+    __shared__ compute_t<T> s_val[THREADS_PER_BLOCK];
+    __shared__ uint s_idx[THREADS_PER_BLOCK];
+
+    for (int id = xid + DIMX; id < lim; id += DIMX) {
+        Op(static_cast<compute_t<T>>(iptr[id]), (!is_first) ? ilptr[id] : id);
+    }
+
+    s_val[tid] = Op.m_val;
+    s_idx[tid] = Op.m_idx;
+    __syncthreads();
+
+    compute_t<T> *s_vptr = s_val + tidy * DIMX;
+    uint *s_iptr         = s_idx + tidy * DIMX;
+
+    if (DIMX == 256) {
+        if (tidx < 128) {
+            Op(s_vptr[tidx + 128], s_iptr[tidx + 128]);
+            s_vptr[tidx] = Op.m_val;
+            s_iptr[tidx] = Op.m_idx;
+        }
+        __syncthreads();
+    }
+
+    if (DIMX >= 128) {
+        if (tidx < 64) {
+            Op(s_vptr[tidx + 64], s_iptr[tidx + 64]);
+            s_vptr[tidx] = Op.m_val;
+            s_iptr[tidx] = Op.m_idx;
+        }
+        __syncthreads();
+    }
+
+    if (DIMX >= 64) {
+        if (tidx < 32) {
+            Op(s_vptr[tidx + 32], s_iptr[tidx + 32]);
+            s_vptr[tidx] = Op.m_val;
+            s_iptr[tidx] = Op.m_idx;
+        }
+        __syncthreads();
+    }
+
+    warp_reduce<compute_t<T>, op>(s_vptr, s_iptr, tidx);
+
+    if (tidx == 0) {
+        optr[blockIdx_x]  = s_vptr[0];
+        olptr[blockIdx_x] = s_iptr[0];
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/ireduce.hpp b/src/backend/cuda/kernel/ireduce.hpp
index 8c0dff1aec..992d0871c4 100644
--- a/src/backend/cuda/kernel/ireduce.hpp
+++ b/src/backend/cuda/kernel/ireduce.hpp
@@ -7,519 +7,256 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <ops.hpp>
-#include <backend.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <math.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include "config.hpp"
 #include <memory.hpp>
-#include <boost/scoped_ptr.hpp>
-
-using boost::scoped_ptr;
-
-namespace cuda
-{
-namespace kernel
-{
-
-    template<typename T> __host__ __device__ double cabs(const T in) { return (double)in; }
-    static double __host__ __device__ cabs(const char in) { return (double)(in > 0); }
-    static double __host__ __device__ cabs(const cfloat &in) { return (double)abs(in); }
-    static double __host__ __device__ cabs(const cdouble &in) { return (double)abs(in); }
-
-    template<af_op_t op, typename T>
-    struct MinMaxOp
-    {
-        T m_val;
-        uint m_idx;
-        __host__ __device__ MinMaxOp(T val, uint idx) :
-            m_val(val), m_idx(idx)
-        {
-        }
+#include <minmax_op.hpp>
+#include <nvrtc_kernel_headers/ireduce_cuh.hpp>
+#include "config.hpp"
 
-        __host__ __device__ void operator()(T val, uint idx)
-        {
-            if (cabs(val) < cabs(m_val) ||
-                (cabs(val) == cabs(m_val) &&
-                 idx > m_idx)) {
-                m_val = val;
-                m_idx = idx;
-            }
-        }
-    };
-
-    template<typename T>
-    struct MinMaxOp<af_max_t, T>
-    {
-        T m_val;
-        uint m_idx;
-        __host__ __device__ MinMaxOp(T val, uint idx) :
-            m_val(val), m_idx(idx)
-        {
-        }
+#include <memory>
 
-        __host__ __device__ void operator()(T val, uint idx)
-        {
-            if (cabs(val) > cabs(m_val) ||
-                (cabs(val) == cabs(m_val) &&
-                 idx <= m_idx)) {
-                m_val = val;
-                m_idx = idx;
-            }
-        }
-    };
-
-    template<typename T, af_op_t op, uint dim, bool is_first, uint DIMY>
-    __global__
-    static void ireduce_dim_kernel(Param<T> out, uint *olptr,
-                                  CParam <T> in, const uint *ilptr,
-                                  uint blocks_x, uint blocks_y, uint offset_dim)
-    {
-        const uint tidx = threadIdx.x;
-        const uint tidy = threadIdx.y;
-        const uint tid  = tidy * THREADS_X + tidx;
-
-        const uint zid = blockIdx.x / blocks_x;
-        const uint wid = blockIdx.y / blocks_y;
-        const uint blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const uint blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const uint xid = blockIdx_x * blockDim.x + tidx;
-        const uint yid = blockIdx_y; // yid  of output. updated for input later.
-
-        uint ids[4] = {xid, yid, zid, wid};
-
-        const T *iptr = in.ptr;
-        T *optr = out.ptr;
-
-        // There is only one element per block for out
-        // There are blockDim.y elements per block for in
-        // Hence increment ids[dim] just after offseting out and before offsetting in
-        optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] + ids[1] * out.strides[1] + ids[0];
-        olptr += ids[3] * out.strides[3] + ids[2] * out.strides[2] + ids[1] * out.strides[1] + ids[0];
-        const uint blockIdx_dim = ids[dim];
-
-        ids[dim] = ids[dim] * blockDim.y + tidy;
-        iptr  += ids[3] * in.strides[3] + ids[2] * in.strides[2] + ids[1] * in.strides[1] + ids[0];
-        if (!is_first) ilptr  += ids[3] * in.strides[3] + ids[2] * in.strides[2] + ids[1] * in.strides[1] + ids[0];
-        const uint id_dim_in = ids[dim];
-
-        const uint istride_dim = in.strides[dim];
-
-        bool is_valid =
-            (ids[0] < in.dims[0]) &&
-            (ids[1] < in.dims[1]) &&
-            (ids[2] < in.dims[2]) &&
-            (ids[3] < in.dims[3]);
-
-        Binary<T, op> ireduce;
-
-        T val = ireduce.init();
-        uint idx = id_dim_in;
-
-        if (is_valid && id_dim_in < in.dims[dim]) {
-            val = *iptr;
-            if (!is_first) idx = *ilptr;
-        }
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-        MinMaxOp<op, T> Op(val, idx);
+template<typename T, af_op_t op, int dim, bool is_first>
+void ireduce_dim_launcher(Param<T> out, uint *olptr, CParam<T> in,
+                          const uint *ilptr, const uint threads_y,
+                          const dim_t blocks_dim[4], CParam<uint> rlen) {
+    dim3 threads(THREADS_X, threads_y);
 
-        const uint id_dim_in_start = id_dim_in + offset_dim * blockDim.y;
+    dim3 blocks(blocks_dim[0] * blocks_dim[2], blocks_dim[1] * blocks_dim[3]);
 
-        __shared__ T s_val[THREADS_X * DIMY];
-        __shared__ uint s_idx[THREADS_X * DIMY];
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-        for (int id = id_dim_in_start;
-             is_valid && (id < in.dims[dim]);
-             id += offset_dim * blockDim.y) {
+    auto ireduceDim = common::getKernel(
+        "arrayfire::cuda::ireduceDim", {{ireduce_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(op), TemplateArg(dim),
+                     TemplateArg(is_first), TemplateArg(threads_y)),
+        {{DefineValue(THREADS_X)}});
 
-            iptr = iptr + offset_dim * blockDim.y * istride_dim;
-            if (!is_first) {
-                ilptr = ilptr + offset_dim * blockDim.y * istride_dim;
-                Op(*iptr, *ilptr);
-            } else {
-                Op(*iptr, id);
-            }
-        }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        s_val[tid] = Op.m_val;
-        s_idx[tid] = Op.m_idx;
+    ireduceDim(qArgs, out, olptr, in, ilptr, blocks_dim[0], blocks_dim[1],
+               blocks_dim[dim], rlen);
 
-        T *s_vptr = s_val + tid;
-        uint *s_iptr = s_idx + tid;
-        __syncthreads();
+    POST_LAUNCH_CHECK();
+}
 
-        if (DIMY == 8) {
-            if (tidy < 4) {
-                Op(s_vptr[THREADS_X * 4], s_iptr[THREADS_X * 4]);
-                *s_vptr = Op.m_val;
-                *s_iptr = Op.m_idx;
-            }
-            __syncthreads();
-        }
+template<typename T, af_op_t op, int dim>
+void ireduce_dim(Param<T> out, uint *olptr, CParam<T> in, CParam<uint> rlen) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.dims[dim]));
+    uint threads_x = THREADS_X;
 
-        if (DIMY >= 4) {
-            if (tidy < 2) {
-                Op(s_vptr[THREADS_X * 2], s_iptr[THREADS_X * 2]);
-                *s_vptr = Op.m_val;
-                *s_iptr = Op.m_idx;
-            }
-            __syncthreads();
-        }
+    dim_t blocks_dim[] = {divup(in.dims[0], threads_x), in.dims[1], in.dims[2],
+                          in.dims[3]};
 
-        if (DIMY >= 2) {
-            if (tidy < 1) {
-                Op(s_vptr[THREADS_X * 1], s_iptr[THREADS_X * 1]);
-                *s_vptr = Op.m_val;
-                *s_iptr = Op.m_idx;
-            }
-            __syncthreads();
-        }
+    blocks_dim[dim] = divup(in.dims[dim], threads_y * REPEAT);
 
-        if (tidy == 0 && is_valid &&
-            (blockIdx_dim < out.dims[dim])) {
-            *optr = *s_vptr;
-            *olptr = *s_iptr;
-        }
+    Param<T> tmp = out;
+    uint *tlptr  = olptr;
+    uptr<T> tmp_alloc;
+    uptr<uint> tlptr_alloc;
 
-    }
+    if (blocks_dim[dim] > 1) {
+        int tmp_elements = 1;
+        tmp.dims[dim]    = blocks_dim[dim];
 
-    template<typename T, af_op_t op, int dim, bool is_first>
-    void ireduce_dim_launcher(Param<T> out, uint *olptr,
-                             CParam<T> in, const uint *ilptr,
-                             const uint threads_y, const uint blocks_dim[4])
-    {
-        dim3 threads(THREADS_X, threads_y);
-
-        dim3 blocks(blocks_dim[0] * blocks_dim[2],
-                    blocks_dim[1] * blocks_dim[3]);
-
-        switch (threads_y) {
-        case 8:
-            (ireduce_dim_kernel<T, op, dim, is_first, 8>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
-        case 4:
-            (ireduce_dim_kernel<T, op, dim, is_first, 4>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
-        case 2:
-            (ireduce_dim_kernel<T, op, dim, is_first, 2>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
-        case 1:
-            (ireduce_dim_kernel<T, op, dim, is_first, 1>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
-        }
+        for (int k = 0; k < 4; k++) tmp_elements *= tmp.dims[k];
+        tmp_alloc   = memAlloc<T>(tmp_elements);
+        tlptr_alloc = memAlloc<uint>(tmp_elements);
+        tmp.ptr     = tmp_alloc.get();
+        tlptr       = tlptr_alloc.get();
 
-        POST_LAUNCH_CHECK();
+        for (int k = dim + 1; k < 4; k++) tmp.strides[k] *= blocks_dim[dim];
     }
 
-    template<typename T, af_op_t op, int dim>
-    void ireduce_dim(Param<T> out,  uint *olptr, CParam<T> in)
-    {
-        uint threads_y = std::min(THREADS_Y, nextpow2(in.dims[dim]));
-        uint threads_x = THREADS_X;
-
-        uint blocks_dim[] = {divup(in.dims[0], threads_x),
-                             in.dims[1], in.dims[2], in.dims[3]};
-
-        blocks_dim[dim] = divup(in.dims[dim], threads_y * REPEAT);
-
-        Param<T> tmp = out;
-        uint *tlptr = olptr;
-
-        if (blocks_dim[dim] > 1) {
-            int tmp_elements = 1;
-            tmp.dims[dim] = blocks_dim[dim];
+    ireduce_dim_launcher<T, op, dim, true>(tmp, tlptr, in, NULL, threads_y,
+                                           blocks_dim, rlen);
 
-            for (int k = 0; k < 4; k++) tmp_elements *= tmp.dims[k];
-            tmp.ptr = memAlloc<T>(tmp_elements);
-            tlptr = memAlloc<uint>(tmp_elements);
+    if (blocks_dim[dim] > 1) {
+        blocks_dim[dim] = 1;
 
-            for (int k = dim + 1; k < 4; k++) tmp.strides[k] *= blocks_dim[dim];
-        }
-
-        ireduce_dim_launcher<T, op, dim, true>(tmp, tlptr, in, NULL, threads_y, blocks_dim);
+        ireduce_dim_launcher<T, op, dim, false>(out, olptr, tmp, tlptr,
+                                                threads_y, blocks_dim, rlen);
+    }
+}
 
-        if (blocks_dim[dim] > 1) {
-            blocks_dim[dim] = 1;
+template<typename T, af_op_t op, bool is_first>
+void ireduce_first_launcher(Param<T> out, uint *olptr, CParam<T> in,
+                            const uint *ilptr, const uint blocks_x,
+                            const uint blocks_y, const uint threads_x,
+                            CParam<uint> rlen) {
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * in.dims[2], blocks_y * in.dims[3]);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    uint repeat = divup(in.dims[0], (blocks_x * threads_x));
+
+    // threads_x can take values 32, 64, 128, 256
+    auto ireduceFirst = common::getKernel(
+        "arrayfire::cuda::ireduceFirst", {{ireduce_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(op),
+                     TemplateArg(is_first), TemplateArg(threads_x)),
+        {{DefineValue(THREADS_PER_BLOCK)}});
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    ireduceFirst(qArgs, out, olptr, in, ilptr, blocks_x, blocks_y, repeat,
+                 rlen);
+    POST_LAUNCH_CHECK();
+}
 
-            ireduce_dim_launcher<T, op, dim, false>(out, olptr, tmp, tlptr,
-                                                    threads_y, blocks_dim);
+template<typename T, af_op_t op>
+void ireduce_first(Param<T> out, uint *olptr, CParam<T> in, CParam<uint> rlen) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(in.dims[1], threads_y);
+
+    Param<T> tmp = out;
+    uint *tlptr  = olptr;
+    uptr<T> tmp_alloc;
+    uptr<uint> tlptr_alloc;
+    if (blocks_x > 1) {
+        auto elements = blocks_x * in.dims[1] * in.dims[2] * in.dims[3];
+        tmp_alloc     = memAlloc<T>(elements);
+        tlptr_alloc   = memAlloc<uint>(elements);
+        tmp.ptr       = tmp_alloc.get();
+        tlptr         = tlptr_alloc.get();
+
+        tmp.dims[0] = blocks_x;
+        for (int k = 1; k < 4; k++) tmp.strides[k] *= blocks_x;
+    }
 
-            memFree(tmp.ptr);
-            memFree(tlptr);
-        }
+    ireduce_first_launcher<T, op, true>(tmp, tlptr, in, NULL, blocks_x,
+                                        blocks_y, threads_x, rlen);
 
+    if (blocks_x > 1) {
+        ireduce_first_launcher<T, op, false>(out, olptr, tmp, tlptr, 1,
+                                             blocks_y, threads_x, rlen);
     }
+}
 
-    template<typename T, af_op_t op>
-    __device__ void warp_reduce(T *s_ptr, uint *s_idx, uint tidx)
-    {
-        MinMaxOp<op, T> Op(s_ptr[tidx], s_idx[tidx]);
-#pragma unroll
-        for (int n = 16; n >= 1; n >>= 1) {
-            if (tidx < n) {
-                Op(s_ptr[tidx + n], s_idx[tidx + n]);
-                s_ptr[tidx] = Op.m_val;
-                s_idx[tidx] = Op.m_idx;
-            }
-            __syncthreads();
-        }
+template<typename T, af_op_t op>
+void ireduce(Param<T> out, uint *olptr, CParam<T> in, int dim,
+             CParam<uint> rlen) {
+    switch (dim) {
+        case 0: return ireduce_first<T, op>(out, olptr, in, rlen);
+        case 1: return ireduce_dim<T, op, 1>(out, olptr, in, rlen);
+        case 2: return ireduce_dim<T, op, 2>(out, olptr, in, rlen);
+        case 3: return ireduce_dim<T, op, 3>(out, olptr, in, rlen);
     }
+}
 
+template<typename T, af_op_t op>
+T ireduce_all(uint *idx, CParam<T> in) {
+    using std::unique_ptr;
+    int in_elements = in.dims[0] * in.dims[1] * in.dims[2] * in.dims[3];
 
-    template<typename T, af_op_t op, bool is_first, uint DIMX>
-    __global__
-    static void ireduce_first_kernel(Param<T> out, uint *olptr,
-                                    CParam<T>  in, const uint *ilptr,
-                                    uint blocks_x, uint blocks_y, uint repeat)
-    {
-        const uint tidx = threadIdx.x;
-        const uint tidy = threadIdx.y;
-        const uint tid  = tidy * blockDim.x + tidx;
-
-        const uint zid = blockIdx.x / blocks_x;
-        const uint wid = blockIdx.y / blocks_y;
-        const uint blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const uint blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const uint xid = blockIdx_x * blockDim.x * repeat + tidx;
-        const uint yid = blockIdx_y * blockDim.y + tidy;
-
-        const T *iptr = in.ptr;
-        T *optr = out.ptr;
-
-        iptr += wid *  in.strides[3] + zid *  in.strides[2] + yid *  in.strides[1];
-        optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
-
-        if (!is_first) ilptr += wid *  in.strides[3] + zid *  in.strides[2] + yid *  in.strides[1];
-        olptr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
-
-        if (yid >= in.dims[1] ||
-            zid >= in.dims[2] ||
-            wid >= in.dims[3]) return;
-
-        int lim = min((int)(xid + repeat * DIMX), in.dims[0]);
-
-        Binary<T, op> ireduce;
-
-        T val = ireduce.init();
-        uint idx = xid;
-
-        if (xid < lim) {
-            val = iptr[xid];
-            if (!is_first) idx = ilptr[xid];
-        }
-
-        MinMaxOp<op, T> Op(val, idx);
-
-        __shared__ T s_val[THREADS_PER_BLOCK];
-        __shared__ uint s_idx[THREADS_PER_BLOCK];
-
-
-        for (int id = xid + DIMX; id < lim; id += DIMX) {
-            Op(iptr[id], (!is_first) ? ilptr[id] : id);
-        }
-
-        s_val[tid] = Op.m_val;
-        s_idx[tid] = Op.m_idx;
-        __syncthreads();
-
-        T *s_vptr = s_val + tidy * DIMX;
-        uint *s_iptr = s_idx + tidy * DIMX;
-
-        if (DIMX == 256) {
-            if (tidx < 128) {
-                Op(s_vptr[tidx + 128], s_iptr[tidx + 128]);
-                s_vptr[tidx] = Op.m_val;
-                s_iptr[tidx] = Op.m_idx;
-            }
-            __syncthreads();
-        }
-
-        if (DIMX >= 128) {
-            if (tidx <  64) {
-                Op(s_vptr[tidx +  64], s_iptr[tidx +  64]);
-                s_vptr[tidx] = Op.m_val;
-                s_iptr[tidx] = Op.m_idx;
-            }
-            __syncthreads();
-        }
-
-        if (DIMX >=  64) {
-            if (tidx <  32) {
-                Op(s_vptr[tidx +  32], s_iptr[tidx +  32]);
-                s_vptr[tidx] = Op.m_val;
-                s_iptr[tidx] = Op.m_idx;
-            }
-            __syncthreads();
-        }
-
-        warp_reduce<T, op>(s_vptr, s_iptr, tidx);
-
-        if (tidx == 0) {
-            optr[blockIdx_x] = s_vptr[0];
-            olptr[blockIdx_x] = s_iptr[0];
-        }
+    bool is_linear = (in.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &=
+            (in.strides[k] == (in.strides[k - 1] * in.dims[k - 1]));
     }
 
-    template<typename T, af_op_t op, bool is_first>
-    void ireduce_first_launcher(Param<T> out, uint *olptr, CParam<T> in, const uint *ilptr,
-                               const uint blocks_x, const uint blocks_y, const uint threads_x)
-    {
-
-        dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
-        dim3 blocks(blocks_x * in.dims[2],
-                    blocks_y * in.dims[3]);
-
-        uint repeat = divup(in.dims[0], (blocks_x * threads_x));
-
-        switch (threads_x) {
-        case 32:
-            (ireduce_first_kernel<T, op, is_first,  32>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_x, blocks_y, repeat); break;
-        case 64:
-            (ireduce_first_kernel<T, op, is_first,  64>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_x, blocks_y, repeat); break;
-        case 128:
-            (ireduce_first_kernel<T, op, is_first,  128>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_x, blocks_y, repeat); break;
-        case 256:
-            (ireduce_first_kernel<T, op, is_first,  256>)<<<blocks, threads>>>(
-                out, olptr, in, ilptr, blocks_x, blocks_y, repeat); break;
+    // FIXME: Use better heuristics to get to the optimum number
+    if (!is_linear || in_elements > 4096) {
+        if (is_linear) {
+            in.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.dims[k]    = 1;
+                in.strides[k] = in_elements;
+            }
         }
 
-        POST_LAUNCH_CHECK();
-    }
-
-    template<typename T, af_op_t op>
-    void ireduce_first(Param<T> out, uint *olptr, CParam<T> in)
-    {
         uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_BLOCK);
+        threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
         uint threads_y = THREADS_PER_BLOCK / threads_x;
 
+        Param<T> tmp;
+        uint *tlptr;
+
         uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
         uint blocks_y = divup(in.dims[1], threads_y);
 
-        Param<T> tmp = out;
-        uint *tlptr = olptr;
-        if (blocks_x > 1) {
-            tmp.ptr = memAlloc<T>(blocks_x *
-                                  in.dims[1] *
-                                  in.dims[2] *
-                                  in.dims[3]);
-
-            tlptr = memAlloc<uint>(blocks_x *
-                                   in.dims[1] *
-                                   in.dims[2] *
-                                   in.dims[3]);
-
-            tmp.dims[0] = blocks_x;
-            for (int k = 1; k < 4; k++) tmp.strides[k] *= blocks_x;
-        }
+        tmp.dims[0]    = blocks_x;
+        tmp.strides[0] = 1;
 
-        ireduce_first_launcher<T, op, true>(tmp, tlptr, in, NULL, blocks_x, blocks_y, threads_x);
-
-        if (blocks_x > 1) {
-            ireduce_first_launcher<T, op, false>(out, olptr, tmp, tlptr, 1, blocks_y, threads_x);
-
-            memFree(tmp.ptr);
-            memFree(tlptr);
-        }
-    }
-
-    template<typename T, af_op_t op>
-    void ireduce(Param<T> out, uint *olptr, CParam<T> in, int dim)
-    {
-        switch (dim) {
-        case 0: return ireduce_first<T, op   >(out, olptr, in);
-        case 1: return ireduce_dim  <T, op, 1>(out, olptr, in);
-        case 2: return ireduce_dim  <T, op, 2>(out, olptr, in);
-        case 3: return ireduce_dim  <T, op, 3>(out, olptr, in);
+        for (int k = 1; k < 4; k++) {
+            tmp.dims[k]    = in.dims[k];
+            tmp.strides[k] = tmp.dims[k - 1] * tmp.strides[k - 1];
         }
-    }
 
-    template<typename T, af_op_t op>
-    T ireduce_all(uint *idx, CParam<T> in)
-    {
-        int in_elements = in.strides[3] * in.dims[3];
-
-        // FIXME: Use better heuristics to get to the optimum number
-        if (in_elements > 4096) {
-
-            bool is_linear = (in.strides[0] == 1);
-            for (int k = 1; k < 4; k++) {
-                is_linear &= (in.strides[k] == (in.strides[k - 1] * in.dims[k - 1]));
+        int tmp_elements = tmp.strides[3] * tmp.dims[3];
+
+        // TODO: Use scoped_ptr
+        auto tmp_alloc   = memAlloc<T>(tmp_elements);
+        auto tlptr_alloc = memAlloc<uint>(tmp_elements);
+        tmp.ptr          = tmp_alloc.get();
+        tlptr            = tlptr_alloc.get();
+        af::dim4 emptysz(0);
+        CParam<uint> rlen(nullptr, emptysz.get(), emptysz.get());
+        ireduce_first_launcher<T, op, true>(tmp, tlptr, in, NULL, blocks_x,
+                                            blocks_y, threads_x, rlen);
+
+        unique_ptr<T[]> h_ptr(new T[tmp_elements]);
+        unique_ptr<uint[]> h_lptr(new uint[tmp_elements]);
+        T *h_ptr_raw     = h_ptr.get();
+        uint *h_lptr_raw = h_lptr.get();
+
+        CUDA_CHECK(cudaMemcpyAsync(h_ptr_raw, tmp.ptr, tmp_elements * sizeof(T),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(h_lptr_raw, tlptr,
+                                   tmp_elements * sizeof(uint),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
+
+        if (!is_linear) {
+            // Converting n-d index into a linear index
+            // in is of size   [   dims0, dims1, dims2, dims3]
+            // tidx is of size [blocks_x, dims1, dims2, dims3]
+            // i / blocks_x gives you the batch number "N"
+            // "N * dims0 + i" gives the linear index
+            for (int i = 0; i < tmp_elements; i++) {
+                h_lptr_raw[i] += (i / blocks_x) * in.dims[0];
             }
+        }
 
-            if (is_linear) {
-                in.dims[0] = in_elements;
-                for (int k = 1; k < 4; k++) {
-                    in.dims[k] = 1;
-                    in.strides[k] = in_elements;
-                }
-            }
-
-            uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
-            threads_x = std::min(threads_x, THREADS_PER_BLOCK);
-            uint threads_y = THREADS_PER_BLOCK / threads_x;
-
-            Param<T> tmp;
-            uint *tlptr;
-
-            uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
-            uint blocks_y = divup(in.dims[1], threads_y);
-
-            tmp.dims[0] = blocks_x;
-            tmp.strides[0] = 1;
-
-            for (int k = 1; k < 4; k++) {
-                tmp.dims[k] = in.dims[k];
-                tmp.strides[k] = tmp.dims[k - 1] * tmp.strides[k - 1];
-            }
-
-            int tmp_elements = tmp.strides[3] * tmp.dims[3];
-
-            //TODO: Use scoped_ptr
-            tmp.ptr = memAlloc<T>(tmp_elements);
-            tlptr = memAlloc<uint>(tmp_elements);
-            ireduce_first_launcher<T, op, true>(tmp, tlptr, in, NULL, blocks_x, blocks_y, threads_x);
-
-            scoped_ptr<T>       h_ptr(new T[tmp_elements]);
-            scoped_ptr<uint>    h_lptr(new uint[tmp_elements]);
-            T*      h_ptr_raw = h_ptr.get();
-            uint*   h_lptr_raw = h_lptr.get();
-
-            CUDA_CHECK(cudaMemcpy(h_ptr_raw, tmp.ptr, tmp_elements * sizeof(T), cudaMemcpyDeviceToHost));
-            CUDA_CHECK(cudaMemcpy(h_lptr_raw, tlptr, tmp_elements * sizeof(uint), cudaMemcpyDeviceToHost));
-            memFree(tmp.ptr);
-            memFree(tlptr);
-
-            MinMaxOp<op, T> Op(h_ptr_raw[0], h_lptr_raw[0]);
-
-            for (int i = 1; i < tmp_elements; i++) {
-                Op(h_ptr_raw[i], h_lptr_raw[i]);
-            }
+        MinMaxOp<op, T> Op(h_ptr_raw[0], h_lptr_raw[0]);
 
-            *idx = Op.m_idx;
-            return Op.m_val;
-        } else {
+        for (int i = 1; i < tmp_elements; i++) {
+            Op(h_ptr_raw[i], h_lptr_raw[i]);
+        }
 
-            scoped_ptr<T> h_ptr(new T[in_elements]);
-            T* h_ptr_raw = h_ptr.get();
-            CUDA_CHECK(cudaMemcpy(h_ptr_raw, in.ptr, in_elements * sizeof(T), cudaMemcpyDeviceToHost));
+        *idx = Op.m_idx;
+        return Op.m_val;
+    } else {
+        unique_ptr<T[]> h_ptr(new T[in_elements]);
+        T *h_ptr_raw = h_ptr.get();
+        CUDA_CHECK(cudaMemcpyAsync(h_ptr_raw, in.ptr, in_elements * sizeof(T),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
 
-            MinMaxOp<op, T> Op(h_ptr_raw[0], 0);
-            for (int i = 1; i < in_elements; i++) {
-                Op(h_ptr_raw[i], i);
-            }
+        MinMaxOp<op, T> Op(h_ptr_raw[0], 0);
+        for (int i = 1; i < in_elements; i++) { Op(h_ptr_raw[i], i); }
 
-            *idx = Op.m_idx;
-            return Op.m_val;
-        }
+        *idx = Op.m_idx;
+        return Op.m_val;
     }
-
-}
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/jit.cuh b/src/backend/cuda/kernel/jit.cuh
new file mode 100644
index 0000000000..879d46f3c2
--- /dev/null
+++ b/src/backend/cuda/kernel/jit.cuh
@@ -0,0 +1,251 @@
+/*******************************************************
+ * Copyright (c) 2025, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+typedef float2 cuFloatComplex;
+typedef cuFloatComplex cfloat;
+
+typedef double2 cuDoubleComplex;
+typedef cuDoubleComplex cdouble;
+
+#include <cuda_fp16.h>
+
+// ----------------------------------------------
+// COMMON OPERATIONS
+// ----------------------------------------------
+
+#define __select(cond, a, b) (cond) ? (a) : (b)
+#define __not_select(cond, a, b) (cond) ? (b) : (a)
+#define __circular_mod(a, b) ((a) < (b)) ? (a) : (a - b)
+
+// ----------------------------------------------
+// REAL NUMBER OPERATIONS
+// ----------------------------------------------
+#define __noop(a) (a)
+#define __add(lhs, rhs) (lhs) + (rhs)
+#define __sub(lhs, rhs) (lhs) - (rhs)
+#define __mul(lhs, rhs) (lhs) * (rhs)
+#define __div(lhs, rhs) (lhs) / (rhs)
+#define __and(lhs, rhs) (lhs) && (rhs)
+#define __or(lhs, rhs) (lhs) || (rhs)
+
+#define __lt(lhs, rhs) (lhs) < (rhs)
+#define __gt(lhs, rhs) (lhs) > (rhs)
+#define __le(lhs, rhs) (lhs) <= (rhs)
+#define __ge(lhs, rhs) (lhs) >= (rhs)
+#define __eq(lhs, rhs) (lhs) == (rhs)
+#define __neq(lhs, rhs) (lhs) != (rhs)
+
+#define __conj(in) (in)
+#define __real(in) (in)
+#define __imag(in) (0)
+#define __abs(in) abs(in)
+#define __sigmoid(in) (1.0 / (1 + exp(-(in))))
+
+#define __bitnot(in) (~(in))
+#define __bitor(lhs, rhs) ((lhs) | (rhs))
+#define __bitand(lhs, rhs) ((lhs) & (rhs))
+#define __bitxor(lhs, rhs) ((lhs) ^ (rhs))
+#define __bitshiftl(lhs, rhs) ((lhs) << (rhs))
+#define __bitshiftr(lhs, rhs) ((lhs) >> (rhs))
+
+#define __min(lhs, rhs) ((lhs) < (rhs)) ? (lhs) : (rhs)
+#define __max(lhs, rhs) ((lhs) > (rhs)) ? (lhs) : (rhs)
+#define __rem(lhs, rhs) ((lhs) % (rhs))
+#define __mod(lhs, rhs) ((lhs) % (rhs))
+
+#define __pow(lhs, rhs)  \
+    static_cast<double>( \
+        pow(static_cast<double>(lhs), static_cast<double>(rhs)));
+#define __powll(lhs, rhs) \
+    __double2ll_rn(pow(__ll2double_rn(lhs), __ll2double_rn(rhs)))
+#define __powul(lhs, rhs) \
+    __double2ull_rn(pow(__ull2double_rn(lhs), __ull2double_rn(rhs)))
+#define __powui(lhs, rhs) \
+    __double2uint_rn(pow(__uint2double_rn(lhs), __uint2double_rn(rhs)))
+#define __powsi(lhs, rhs) \
+    __double2int_rn(pow(__int2double_rn(lhs), __int2double_rn(rhs)))
+
+#define __convert_char(val) (char)((val) != 0)
+#define frem(lhs, rhs) remainder((lhs), (rhs))
+#define fremf(lhs, rhs) remainderf((lhs), (rhs))
+
+// ----------------------------------------------
+// COMPLEX FLOAT OPERATIONS
+// ----------------------------------------------
+
+#define __crealf(in) ((in).x)
+#define __cimagf(in) ((in).y)
+#define __cabsf(in) hypotf(in.x, in.y)
+
+__device__ cfloat __cplx2f(float x, float y) {
+    cfloat res = {x, y};
+    return res;
+}
+
+__device__ cfloat __cconjf(cfloat in) {
+    cfloat res = {in.x, -in.y};
+    return res;
+}
+
+__device__ cfloat __caddf(cfloat lhs, cfloat rhs) {
+    cfloat res = {lhs.x + rhs.x, lhs.y + rhs.y};
+    return res;
+}
+
+__device__ cfloat __csubf(cfloat lhs, cfloat rhs) {
+    cfloat res = {lhs.x - rhs.x, lhs.y - rhs.y};
+    return res;
+}
+
+__device__ cfloat __cmulf(cfloat lhs, cfloat rhs) {
+    cfloat out;
+    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
+    return out;
+}
+
+__device__ cfloat __cdivf(cfloat lhs, cfloat rhs) {
+    // Normalize by absolute value and multiply
+    float rhs_abs     = __cabsf(rhs);
+    float inv_rhs_abs = 1.0f / rhs_abs;
+    float rhs_x       = inv_rhs_abs * rhs.x;
+    float rhs_y       = inv_rhs_abs * rhs.y;
+    cfloat out = {lhs.x * rhs_x + lhs.y * rhs_y, lhs.y * rhs_x - lhs.x * rhs_y};
+    out.x *= inv_rhs_abs;
+    out.y *= inv_rhs_abs;
+    return out;
+}
+
+__device__ cfloat __cminf(cfloat lhs, cfloat rhs) {
+    return __cabsf(lhs) < __cabsf(rhs) ? lhs : rhs;
+}
+
+__device__ cfloat __cmaxf(cfloat lhs, cfloat rhs) {
+    return __cabsf(lhs) > __cabsf(rhs) ? lhs : rhs;
+}
+#define __candf(lhs, rhs) __cabsf(lhs) && __cabsf(rhs)
+#define __corf(lhs, rhs) __cabsf(lhs) || __cabsf(rhs)
+#define __ceqf(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
+#define __cneqf(lhs, rhs) !__ceqf((lhs), (rhs))
+#define __cltf(lhs, rhs) (__cabsf(lhs) < __cabsf(rhs))
+#define __clef(lhs, rhs) (__cabsf(lhs) <= __cabsf(rhs))
+#define __cgtf(lhs, rhs) (__cabsf(lhs) > __cabsf(rhs))
+#define __cgef(lhs, rhs) (__cabsf(lhs) >= __cabsf(rhs))
+#define __convert_cfloat(real) __cplx2f(real, 0)
+#define __convert_c2c(in) (in)
+#define __convert_z2c(in) __cplx2f((float)in.x, (float)in.y)
+
+// ----------------------------------------------
+// COMPLEX DOUBLE OPERATIONS
+// ----------------------------------------------
+#define __creal(in) ((in).x)
+#define __cimag(in) ((in).y)
+#define __cabs(in) hypot(in.x, in.y)
+
+__device__ cdouble __cplx2(double x, double y) {
+    cdouble res = {x, y};
+    return res;
+}
+
+__device__ cdouble __cconj(cdouble in) {
+    cdouble res = {in.x, -in.y};
+    return res;
+}
+
+__device__ cdouble __cadd(cdouble lhs, cdouble rhs) {
+    cdouble res = {lhs.x + rhs.x, lhs.y + rhs.y};
+    return res;
+}
+
+__device__ cdouble __csub(cdouble lhs, cdouble rhs) {
+    cdouble res = {lhs.x - rhs.x, lhs.y - rhs.y};
+    return res;
+}
+
+__device__ cdouble __cmul(cdouble lhs, cdouble rhs) {
+    cdouble out;
+    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
+    return out;
+}
+
+__device__ cdouble __cdiv(cdouble lhs, cdouble rhs) {
+    // Normalize by absolute value and multiply
+    double rhs_abs     = __cabs(rhs);
+    double inv_rhs_abs = 1.0 / rhs_abs;
+    double rhs_x       = inv_rhs_abs * rhs.x;
+    double rhs_y       = inv_rhs_abs * rhs.y;
+    cdouble out        = {lhs.x * rhs_x + lhs.y * rhs_y,
+                          lhs.y * rhs_x - lhs.x * rhs_y};
+    out.x *= inv_rhs_abs;
+    out.y *= inv_rhs_abs;
+    return out;
+}
+
+__device__ cdouble __cmin(cdouble lhs, cdouble rhs) {
+    return __cabs(lhs) < __cabs(rhs) ? lhs : rhs;
+}
+
+__device__ cdouble __cmax(cdouble lhs, cdouble rhs) {
+    return __cabs(lhs) > __cabs(rhs) ? lhs : rhs;
+}
+
+template<typename T>
+static __device__ __inline__ int iszero(T a) {
+    return a == T(0);
+}
+
+template<typename T>
+static __device__ __inline__ int __isinf(const T in) {
+    return isinf(in);
+}
+
+template<>
+__device__ __inline__ int __isinf<__half>(const __half in) {
+#if __CUDA_ARCH__ >= 530
+    return __hisinf(in);
+#else
+    return ::isinf(__half2float(in));
+#endif
+}
+
+__device__ __inline__
+__half hmod(const __half lhs, const __half rhs) {
+#if __CUDA_ARCH__ >= 530
+    return __hsub(lhs, __hmul(htrunc(__hdiv(lhs, rhs)), rhs));
+#else
+    return __float2half(fmodf(__half2float(lhs), __half2float(rhs)));
+#endif
+}
+
+template<typename T>
+static __device__ __inline__ int __isnan(const T in) {
+    return isnan(in);
+}
+
+template<>
+__device__ __inline__ int __isnan<__half>(const __half in) {
+#if __CUDA_ARCH__ >= 530
+    return __hisnan(in);
+#else
+    return ::isnan(__half2float(in));
+#endif
+}
+
+#define __cand(lhs, rhs) __cabs(lhs) && __cabs(rhs)
+#define __cor(lhs, rhs) __cabs(lhs) || __cabs(rhs)
+#define __ceq(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
+#define __cneq(lhs, rhs) !__ceq((lhs), (rhs))
+#define __clt(lhs, rhs) (__cabs(lhs) < __cabs(rhs))
+#define __cle(lhs, rhs) (__cabs(lhs) <= __cabs(rhs))
+#define __cgt(lhs, rhs) (__cabs(lhs) > __cabs(rhs))
+#define __cge(lhs, rhs) (__cabs(lhs) >= __cabs(rhs))
+#define __convert_cdouble(real) __cplx2(real, 0)
+#define __convert_z2z(in) (in)
+#define __convert_c2z(in) __cplx2((double)in.x, (double)in.y)
diff --git a/src/backend/cuda/kernel/join.hpp b/src/backend/cuda/kernel/join.hpp
deleted file mode 100644
index 2bf68aa772..0000000000
--- a/src/backend/cuda/kernel/join.hpp
+++ /dev/null
@@ -1,82 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <math.hpp>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 256;
-        static const unsigned TILEY = 32;
-
-        template<typename To, typename Ti, int dim>
-        __global__
-        void join_kernel(Param<To> out, CParam<Ti> in,
-                         const int o0, const int o1, const int o2, const int o3,
-                         const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int iz = blockIdx.x / blocksPerMatX;
-            const int iw = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - iz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - iw * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
-
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
-
-            To *d_out = out.ptr;
-            Ti const *d_in = in.ptr;
-
-            if(iz < in.dims[2] && iw < in.dims[3]) {
-                d_out = d_out + (iz + o2) * out.strides[2] + (iw + o3) * out.strides[3];
-                d_in  = d_in  + iz * in.strides[2] + iw * in.strides[3];
-
-                for (int iy = yy; iy < in.dims[1]; iy += incy) {
-                    Ti const *d_in_ = d_in + iy * in.strides[1];
-                    To *d_out_ = d_out + (iy + o1) * out.strides[1];
-
-                    for (int ix = xx; ix < in.dims[0]; ix += incx) {
-                        d_out_[ix + o0] = d_in_[ix];
-                    }
-                }
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename To, typename Tx, int dim>
-        void join(Param<To> out, CParam<Tx> X, const af::dim4 &offset)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(X.dims[0], TILEX);
-            int blocksPerMatY = divup(X.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * X.dims[2],
-                        blocksPerMatY * X.dims[3],
-                        1);
-
-            join_kernel<To, Tx, dim><<<blocks, threads>>>
-                       (out, X, offset[0], offset[1], offset[2], offset[3],
-                        blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
-}
diff --git a/src/backend/cuda/kernel/lookup.cuh b/src/backend/cuda/kernel/lookup.cuh
new file mode 100644
index 0000000000..753ea8c6db
--- /dev/null
+++ b/src/backend/cuda/kernel/lookup.cuh
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename in_t, typename idx_t>
+__global__ void lookup1D(Param<in_t> out, CParam<in_t> in,
+                         CParam<idx_t> indices, int vDim) {
+    int idx = threadIdx.x + blockIdx.x * THREADS * THRD_LOAD;
+
+    const in_t* inPtr   = (const in_t*)in.ptr;
+    const idx_t* idxPtr = (const idx_t*)indices.ptr;
+
+    in_t* outPtr = (in_t*)out.ptr;
+
+    int en = min(out.dims[vDim], idx + THRD_LOAD * THREADS);
+
+    for (int oIdx = idx; oIdx < en; oIdx += THREADS) {
+        int iIdx     = trimIndex(static_cast<int>(idxPtr[oIdx]), in.dims[vDim]);
+        outPtr[oIdx] = inPtr[iIdx];
+    }
+}
+
+template<typename in_t, typename idx_t, unsigned dim>
+__global__ void lookupND(Param<in_t> out, CParam<in_t> in,
+                         CParam<idx_t> indices, int nBBS0, int nBBS1) {
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+
+    int gz = blockIdx.x / nBBS0;
+    int gw = (blockIdx.y + blockIdx.z * gridDim.y) / nBBS1;
+
+    int gx = blockDim.x * (blockIdx.x - gz * nBBS0) + lx;
+    int gy =
+        blockDim.y * ((blockIdx.y + blockIdx.z * gridDim.y) - gw * nBBS1) + ly;
+
+    const idx_t* idxPtr = (const idx_t*)indices.ptr;
+
+    int i = in.strides[0] *
+            (dim == 0 ? trimIndex((int)idxPtr[gx], in.dims[0]) : gx);
+    int j = in.strides[1] *
+            (dim == 1 ? trimIndex((int)idxPtr[gy], in.dims[1]) : gy);
+    int k = in.strides[2] *
+            (dim == 2 ? trimIndex((int)idxPtr[gz], in.dims[2]) : gz);
+    int l = in.strides[3] *
+            (dim == 3 ? trimIndex((int)idxPtr[gw], in.dims[3]) : gw);
+
+    const in_t* inPtr = (const in_t*)in.ptr + (i + j + k + l);
+    in_t* outPtr = (in_t*)out.ptr + (gx * out.strides[0] + gy * out.strides[1] +
+                                     gz * out.strides[2] + gw * out.strides[3]);
+
+    if (gx < out.dims[0] && gy < out.dims[1] && gz < out.dims[2] &&
+        gw < out.dims[3]) {
+        outPtr[0] = inPtr[0];
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/lookup.hpp b/src/backend/cuda/kernel/lookup.hpp
index 1b23ff5b83..4d23596d6c 100644
--- a/src/backend/cuda/kernel/lookup.hpp
+++ b/src/backend/cuda/kernel/lookup.hpp
@@ -7,109 +7,74 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <math.hpp>
-#include <utility.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/lookup_cuh.hpp>
 
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const int THREADS = 256;
-
-static const int THREADS_X = 32;
-static const int THREADS_Y = 8;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-static const int THRD_LOAD = THREADS_X/THREADS_Y;
+constexpr int THREADS   = 256;
+constexpr int THREADS_X = 32;
+constexpr int THREADS_Y = 8;
+constexpr int THRD_LOAD = THREADS_X / THREADS_Y;
 
 template<typename in_t, typename idx_t>
-__global__
-void lookup1D(Param<in_t> out, CParam<in_t> in, CParam<idx_t> indices, int vDim)
-{
-    int idx = threadIdx.x + blockIdx.x * THREADS * THRD_LOAD;
-
-    const in_t* inPtr   = (const in_t*)in.ptr;
-    const idx_t* idxPtr = (const idx_t*)indices.ptr;
-
-    in_t* outPtr  = (in_t*)out.ptr;
-
-    int en = min(out.dims[vDim], idx + THRD_LOAD * THREADS);
-
-    for (int oIdx = idx; oIdx < en; oIdx += THREADS) {
-        int iIdx = trimIndex(idxPtr[oIdx], in.dims[vDim]);
-        outPtr[oIdx] = inPtr[iIdx];
-    }
-}
-
-template<typename in_t, typename idx_t, unsigned dim>
-__global__
-void lookupND(Param<in_t> out, CParam<in_t> in, CParam<idx_t> indices,
-                    int nBBS0, int nBBS1)
-{
-    int lx = threadIdx.x;
-    int ly = threadIdx.y;
-
-    int gz = blockIdx.x/nBBS0;
-    int gw = blockIdx.y/nBBS1;
-
-    int gx = blockDim.x * (blockIdx.x - gz*nBBS0) + lx;
-    int gy = blockDim.y * (blockIdx.y - gw*nBBS1) + ly;
-
-    const idx_t *idxPtr = (const idx_t*)indices.ptr;
-
-    int i = in.strides[0]*(dim==0 ? trimIndex((int)idxPtr[gx], in.dims[0]): gx);
-    int j = in.strides[1]*(dim==1 ? trimIndex((int)idxPtr[gy], in.dims[1]): gy);
-    int k = in.strides[2]*(dim==2 ? trimIndex((int)idxPtr[gz], in.dims[2]): gz);
-    int l = in.strides[3]*(dim==3 ? trimIndex((int)idxPtr[gw], in.dims[3]): gw);
-
-    const in_t *inPtr = (const in_t*)in.ptr + (i+j+k+l);
-    in_t *outPtr = (in_t*)out.ptr +(gx*out.strides[0]+gy*out.strides[1]+
-                                    gz*out.strides[2]+gw*out.strides[3]);
-
-    if (gx<out.dims[0] && gy<out.dims[1] && gz<out.dims[2] && gw<out.dims[3]) {
-        outPtr[0] = inPtr[0];
+void lookup(Param<in_t> out, CParam<in_t> in, CParam<idx_t> indices, int nDims,
+            unsigned dim) {
+    /* find which dimension has non-zero # of elements */
+    unsigned vDim = 0;
+    for (int i = 0; i < 4; i++) {
+        if (in.dims[i] == 1)
+            vDim++;
+        else
+            break;
     }
-}
 
-template<typename in_t, typename idx_t, unsigned dim>
-void lookup(Param<in_t> out, CParam<in_t> in, CParam<idx_t> indices, int nDims)
-{
-    if (nDims==1) {
+    if (dim == 0 && nDims == 1 && dim == vDim) {
         const dim3 threads(THREADS, 1);
-        /* find which dimension has non-zero # of elements */
-        int vDim = 0;
-        for (int i=0; i<4; i++) {
-            if (in.dims[i]==1)
-                vDim++;
-            else
-                break;
-        }
 
-        int blks = divup(out.dims[vDim], THREADS*THRD_LOAD);
+        int blks = divup(out.dims[vDim], THREADS * THRD_LOAD);
 
         dim3 blocks(blks, 1);
 
-        lookup1D<in_t, idx_t> <<<blocks, threads>>> (out, in, indices, vDim);
+        auto lookup1d = common::getKernel(
+            "arrayfire::cuda::lookup1D", {{lookup_cuh_src}},
+            TemplateArgs(TemplateTypename<in_t>(), TemplateTypename<idx_t>()),
+            {{DefineValue(THREADS), DefineValue(THRD_LOAD)}});
+
+        EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+        lookup1d(qArgs, out, in, indices, vDim);
     } else {
         const dim3 threads(THREADS_X, THREADS_Y);
 
         int blks_x = divup(out.dims[0], threads.x);
         int blks_y = divup(out.dims[1], threads.y);
 
-        dim3 blocks(blks_x*out.dims[2], blks_y*out.dims[3]);
+        dim3 blocks(blks_x * out.dims[2], blks_y * out.dims[3]);
 
-        lookupND<in_t, idx_t, dim> <<<blocks, threads>>> (out, in, indices, blks_x, blks_y);
-    }
+        const int maxBlocksY =
+            getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+        blocks.z = divup(blocks.y, maxBlocksY);
+        blocks.y = divup(blocks.y, blocks.z);
 
-    POST_LAUNCH_CHECK();
-}
+        auto lookupnd = common::getKernel(
+            "arrayfire::cuda::lookupND", {{lookup_cuh_src}},
+            TemplateArgs(TemplateTypename<in_t>(), TemplateTypename<idx_t>(),
+                         TemplateArg(dim)));
+        EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
+        lookupnd(qArgs, out, in, indices, blks_x, blks_y);
+    }
+    POST_LAUNCH_CHECK();
 }
 
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/lu_split.cuh b/src/backend/cuda/kernel/lu_split.cuh
new file mode 100644
index 0000000000..f2f892bbce
--- /dev/null
+++ b/src/backend/cuda/kernel/lu_split.cuh
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool same_dims>
+__global__ void luSplit(Param<T> lower, Param<T> upper, Param<T> in,
+                        const int blocksPerMatX, const int blocksPerMatY) {
+    const int oz = blockIdx.x / blocksPerMatX;
+    const int ow = blockIdx.y / blocksPerMatY;
+
+    const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
+    const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
+
+    const int xx = threadIdx.x + blockIdx_x * blockDim.x;
+    const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    const int incy = blocksPerMatY * blockDim.y;
+    const int incx = blocksPerMatX * blockDim.x;
+
+    T *d_l = lower.ptr;
+    T *d_u = upper.ptr;
+    T *d_i = in.ptr;
+
+    if (oz < in.dims[2] && ow < in.dims[3]) {
+        d_i = d_i + oz * in.strides[2] + ow * in.strides[3];
+        d_l = d_l + oz * lower.strides[2] + ow * lower.strides[3];
+        d_u = d_u + oz * upper.strides[2] + ow * upper.strides[3];
+
+        for (int oy = yy; oy < in.dims[1]; oy += incy) {
+            T *Yd_i = d_i + oy * in.strides[1];
+            T *Yd_l = d_l + oy * lower.strides[1];
+            T *Yd_u = d_u + oy * upper.strides[1];
+            for (int ox = xx; ox < in.dims[0]; ox += incx) {
+                if (ox > oy) {
+                    if (same_dims || oy < lower.dims[1]) Yd_l[ox] = Yd_i[ox];
+                    if (!same_dims || ox < upper.dims[0])
+                        Yd_u[ox] = scalar<T>(0);
+                } else if (oy > ox) {
+                    if (same_dims || oy < lower.dims[1])
+                        Yd_l[ox] = scalar<T>(0);
+                    if (!same_dims || ox < upper.dims[0]) Yd_u[ox] = Yd_i[ox];
+                } else if (ox == oy) {
+                    if (same_dims || oy < lower.dims[1])
+                        Yd_l[ox] = scalar<T>(1.0);
+                    if (!same_dims || ox < upper.dims[0]) Yd_u[ox] = Yd_i[ox];
+                }
+            }
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/lu_split.hpp b/src/backend/cuda/kernel/lu_split.hpp
index 33182eab4f..467173c218 100644
--- a/src/backend/cuda/kernel/lu_split.hpp
+++ b/src/backend/cuda/kernel/lu_split.hpp
@@ -7,95 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/lu_split_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 128;
-        static const unsigned TILEY = 32;
-
-        template<typename T, bool same_dims>
-        __global__
-        void lu_split_kernel(Param<T> lower, Param<T> upper, Param<T> in,
-                             const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
+#include <array>
 
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
+template<typename T>
+void lu_split(Param<T> lower, Param<T> upper, Param<T> in) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 128;
+    constexpr unsigned TILEY = 32;
 
-            T *d_l = lower.ptr;
-            T *d_u = upper.ptr;
-            T *d_i = in.ptr;
+    const bool sameDims =
+        lower.dims[0] == in.dims[0] && lower.dims[1] == in.dims[1];
 
-            if(oz < in.dims[2] && ow < in.dims[3]) {
-                d_i = d_i + oz * in.strides[2]    + ow * in.strides[3];
-                d_l = d_l + oz * lower.strides[2] + ow * lower.strides[3];
-                d_u = d_u + oz * upper.strides[2] + ow * upper.strides[3];
+    auto luSplit = common::getKernel(
+        "arrayfire::cuda::luSplit", {{lu_split_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(sameDims)));
 
-                for (int oy = yy; oy < in.dims[1]; oy += incy) {
-                    T *Yd_i = d_i + oy * in.strides[1];
-                    T *Yd_l = d_l +  oy * lower.strides[1];
-                    T *Yd_u = d_u +  oy * upper.strides[1];
-                    for (int ox = xx; ox < in.dims[0]; ox += incx) {
-                        if(ox > oy) {
-                            if(same_dims || oy < lower.dims[1])
-                                Yd_l[ox] = Yd_i[ox];
-                            if(!same_dims || ox < upper.dims[0])
-                                Yd_u[ox] = scalar<T>(0);
-                        } else if (oy > ox) {
-                            if(same_dims || oy < lower.dims[1])
-                                Yd_l[ox] = scalar<T>(0);
-                            if(!same_dims || ox < upper.dims[0])
-                                Yd_u[ox] = Yd_i[ox];
-                        } else if(ox == oy) {
-                            if(same_dims || oy < lower.dims[1])
-                                Yd_l[ox] = scalar<T>(1.0);
-                            if(!same_dims || ox < upper.dims[0])
-                                Yd_u[ox] = Yd_i[ox];
-                        }
-                    }
-                }
-            }
-        }
+    dim3 threads(TX, TY, 1);
 
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void lu_split(Param<T> lower, Param<T> upper, Param<T> in)
-        {
-            dim3 threads(TX, TY, 1);
+    int blocksPerMatX = divup(in.dims[0], TILEX);
+    int blocksPerMatY = divup(in.dims[1], TILEY);
+    dim3 blocks(blocksPerMatX * in.dims[2], blocksPerMatY * in.dims[3], 1);
 
-            int blocksPerMatX = divup(in.dims[0], TILEX);
-            int blocksPerMatY = divup(in.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * in.dims[2],
-                        blocksPerMatY * in.dims[3],
-                        1);
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-            if(lower.dims[0] == in.dims[0] && lower.dims[1] == in.dims[1]) {
-                lu_split_kernel<T, true><<<blocks, threads>>>(lower, upper, in, blocksPerMatX, blocksPerMatY);
-            } else {
-                lu_split_kernel<T, false><<<blocks, threads>>>(lower, upper, in, blocksPerMatX, blocksPerMatY);
-            }
-            POST_LAUNCH_CHECK();
-        }
-    }
+    luSplit(qArgs, lower, upper, in, blocksPerMatX, blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
 
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/match_template.cuh b/src/backend/cuda/kernel/match_template.cuh
new file mode 100644
index 0000000000..16cf172e1b
--- /dev/null
+++ b/src/backend/cuda/kernel/match_template.cuh
@@ -0,0 +1,122 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename inType, typename outType, af::matchType mType, bool needMean>
+__global__ void matchTemplate(Param<outType> out, CParam<inType> srch,
+                              CParam<inType> tmplt, int nBBS0, int nBBS1) {
+    unsigned b2 = blockIdx.x / nBBS0;
+    unsigned b3 = blockIdx.y / nBBS1;
+
+    int gx = threadIdx.x + (blockIdx.x - b2 * nBBS0) * blockDim.x;
+    int gy = threadIdx.y + (blockIdx.y - b3 * nBBS1) * blockDim.y;
+
+    if (gx < srch.dims[0] && gy < srch.dims[1]) {
+        const int tDim0    = tmplt.dims[0];
+        const int tDim1    = tmplt.dims[1];
+        const int sDim0    = srch.dims[0];
+        const int sDim1    = srch.dims[1];
+        const inType* tptr = (const inType*)tmplt.ptr;
+        int winNumElems    = tDim0 * tDim1;
+
+        outType tImgMean = outType(0);
+        if (needMean) {
+            for (int tj = 0; tj < tDim1; tj++) {
+                int tjStride = tj * tmplt.strides[1];
+
+                for (int ti = 0; ti < tDim0; ti++) {
+                    tImgMean += (outType)tptr[tjStride + ti * tmplt.strides[0]];
+                }
+            }
+            tImgMean /= winNumElems;
+        }
+
+        const inType* sptr = (const inType*)srch.ptr +
+                             (b2 * srch.strides[2] + b3 * srch.strides[3]);
+        outType* optr =
+            (outType*)out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
+
+        // mean for window
+        // this variable will be used based on mType value
+        outType wImgMean = outType(0);
+        if (needMean) {
+            for (int tj = 0, j = gy; tj < tDim1; tj++, j++) {
+                int jStride = j * srch.strides[1];
+
+                for (int ti = 0, i = gx; ti < tDim0; ti++, i++) {
+                    inType sVal = ((j < sDim1 && i < sDim0)
+                                       ? sptr[jStride + i * srch.strides[0]]
+                                       : inType(0));
+                    wImgMean += (outType)sVal;
+                }
+            }
+            wImgMean /= winNumElems;
+        }
+
+        // run the window match metric
+        outType disparity = outType(0);
+
+        for (int tj = 0, j = gy; tj < tDim1; tj++, j++) {
+            int jStride  = j * srch.strides[1];
+            int tjStride = tj * tmplt.strides[1];
+
+            for (int ti = 0, i = gx; ti < tDim0; ti++, i++) {
+                inType sVal = ((j < sDim1 && i < sDim0)
+                                   ? sptr[jStride + i * srch.strides[0]]
+                                   : inType(0));
+                inType tVal = tptr[tjStride + ti * tmplt.strides[0]];
+
+                outType temp;
+                switch (mType) {
+                    case AF_SAD:
+                        disparity += fabs((outType)sVal - (outType)tVal);
+                        break;
+                    case AF_ZSAD:
+                        disparity += fabs((outType)sVal - wImgMean -
+                                          (outType)tVal + tImgMean);
+                        break;
+                    case AF_LSAD:
+                        disparity +=
+                            fabs((outType)sVal - (wImgMean / tImgMean) * tVal);
+                        break;
+                    case AF_SSD:
+                        disparity += ((outType)sVal - (outType)tVal) *
+                                     ((outType)sVal - (outType)tVal);
+                        break;
+                    case AF_ZSSD:
+                        temp = ((outType)sVal - wImgMean - (outType)tVal +
+                                tImgMean);
+                        disparity += temp * temp;
+                        break;
+                    case AF_LSSD:
+                        temp = ((outType)sVal - (wImgMean / tImgMean) * tVal);
+                        disparity += temp * temp;
+                        break;
+                    case AF_NCC:
+                        // TODO: furture implementation
+                        break;
+                    case AF_ZNCC:
+                        // TODO: furture implementation
+                        break;
+                    case AF_SHD:
+                        // TODO: furture implementation
+                        break;
+                }
+            }
+        }
+        optr[gy * out.strides[1] + gx] = disparity;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/match_template.hpp b/src/backend/cuda/kernel/match_template.hpp
index e86c398b8f..a605eabab5 100644
--- a/src/backend/cuda/kernel/match_template.hpp
+++ b/src/backend/cuda/kernel/match_template.hpp
@@ -7,139 +7,41 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/match_template_cuh.hpp>
+#include <af/defines.h>
 
-namespace cuda
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 static const int THREADS_X = 16;
 static const int THREADS_Y = 16;
 
-template<typename inType, typename outType, af_match_type mType, bool needMean>
-__global__
-void matchTemplate(Param<outType> out, CParam<inType> srch, CParam<inType> tmplt,
-                   int nBBS0, int nBBS1)
-{
-    unsigned b2 = blockIdx.x / nBBS0;
-    unsigned b3 = blockIdx.y / nBBS1;
-
-    int gx = threadIdx.x + (blockIdx.x - b2*nBBS0) * blockDim.x;
-    int gy = threadIdx.y + (blockIdx.y - b3*nBBS1)* blockDim.y;
-
-    if (gx < srch.dims[0] && gy < srch.dims[1]) {
-
-        const int tDim0 = tmplt.dims[0];
-        const int tDim1 = tmplt.dims[1];
-        const int sDim0 = srch.dims[0];
-        const int sDim1 = srch.dims[1];
-        const inType* tptr   = (const inType*) tmplt.ptr;
-        int winNumElems = tDim0*tDim1;
-
-        outType tImgMean = outType(0);
-        if (needMean) {
-            for(int tj=0; tj<tDim1; tj++) {
-                int tjStride = tj*tmplt.strides[1];
-
-                for(int ti=0; ti<tDim0; ti++) {
-                    tImgMean += (outType)tptr[ tjStride + ti*tmplt.strides[0] ];
-                }
-            }
-            tImgMean /= winNumElems;
-        }
-
-        const inType* sptr  = (const inType*) srch.ptr + (b2 * srch.strides[2] + b3 * srch.strides[3]);
-        outType* optr       = (outType*) out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
-
-        // mean for window
-        // this variable will be used based on mType value
-        outType wImgMean = outType(0);
-        if (needMean) {
-            for(int tj=0,j=gy; tj<tDim1; tj++, j++) {
-                int jStride = j*srch.strides[1];
-
-                for(int ti=0, i=gx; ti<tDim0; ti++, i++) {
-                    inType sVal = ((j<sDim1 && i<sDim0) ? sptr[jStride + i*srch.strides[0]] : inType(0));
-                    wImgMean += (outType)sVal;
-                }
-            }
-            wImgMean /= winNumElems;
-        }
-
-        // run the window match metric
-        outType disparity = outType(0);
+template<typename inType, typename outType>
+void matchTemplate(Param<outType> out, CParam<inType> srch,
+                   CParam<inType> tmplt, const af::matchType mType,
+                   bool needMean) {
+    auto matchTemplate = common::getKernel(
+        "arrayfire::cuda::matchTemplate", {{match_template_cuh_src}},
+        TemplateArgs(TemplateTypename<inType>(), TemplateTypename<outType>(),
+                     TemplateArg(mType), TemplateArg(needMean)));
 
-        for(int tj=0,j=gy; tj<tDim1; tj++, j++) {
-
-            int jStride  = j*srch.strides[1];
-            int tjStride = tj*tmplt.strides[1];
-
-            for(int ti=0, i=gx; ti<tDim0; ti++, i++) {
-
-                inType sVal = ((j<sDim1 && i<sDim0) ? sptr[jStride + i*srch.strides[0]] : inType(0));
-                inType tVal = tptr[ tjStride + ti*tmplt.strides[0] ];
-
-                outType temp;
-                switch(mType) {
-                    case AF_SAD:
-                        disparity += fabs((outType)sVal-(outType)tVal);
-                        break;
-                    case AF_ZSAD:
-                        disparity += fabs((outType)sVal - wImgMean -
-                                (outType)tVal + tImgMean);
-                        break;
-                    case AF_LSAD:
-                        disparity += fabs((outType)sVal-(wImgMean/tImgMean)*tVal);
-                        break;
-                    case AF_SSD:
-                        disparity += ((outType)sVal-(outType)tVal)*((outType)sVal-(outType)tVal);
-                        break;
-                    case AF_ZSSD:
-                        temp = ((outType)sVal - wImgMean - (outType)tVal + tImgMean);
-                        disparity += temp*temp;
-                        break;
-                    case AF_LSSD:
-                        temp = ((outType)sVal-(wImgMean/tImgMean)*tVal);
-                        disparity += temp*temp;
-                        break;
-                    case AF_NCC:
-                        //TODO: furture implementation
-                        break;
-                    case AF_ZNCC:
-                        //TODO: furture implementation
-                        break;
-                    case AF_SHD:
-                        //TODO: furture implementation
-                        break;
-                }
-            }
-        }
-
-        optr[gy*out.strides[1]+gx] = disparity;
-    }
-}
-
-template<typename inType, typename outType, af_match_type mType, bool needMean>
-void matchTemplate(Param<outType> out, CParam<inType> srch, CParam<inType> tmplt)
-{
     const dim3 threads(THREADS_X, THREADS_Y);
 
     int blk_x = divup(srch.dims[0], threads.x);
     int blk_y = divup(srch.dims[1], threads.y);
 
-    dim3 blocks(blk_x*srch.dims[2], blk_y*srch.dims[3]);
-
-    matchTemplate<inType, outType, mType, needMean> <<< blocks, threads >>> (out, srch, tmplt, blk_x, blk_y);
+    dim3 blocks(blk_x * srch.dims[2], blk_y * srch.dims[3]);
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    matchTemplate(qArgs, out, srch, tmplt, blk_x, blk_y);
     POST_LAUNCH_CHECK();
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/mean.hpp b/src/backend/cuda/kernel/mean.hpp
new file mode 100644
index 0000000000..a26eeac7fd
--- /dev/null
+++ b/src/backend/cuda/kernel/mean.hpp
@@ -0,0 +1,599 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <cuda_fp16.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include "config.hpp"
+
+#include <memory>
+#include <vector>
+
+namespace arrayfire {
+namespace cuda {
+
+__device__ auto operator*(float lhs, __half rhs) -> __half {
+    return __float2half(lhs * __half2float(rhs));
+}
+
+__device__ auto operator/(__half lhs, float rhs) -> __half {
+    return __float2half(__half2float(lhs) / rhs);
+}
+
+namespace kernel {
+
+template<typename To, typename Tw>
+__device__ __host__ void stable_mean(To *lhs, Tw *l_wt, To rhs, Tw r_wt) {
+    if (((*l_wt) != (Tw)0) || (r_wt != (Tw)0)) {
+        Tw l_scale = (*l_wt);
+        (*l_wt) += r_wt;
+        l_scale = l_scale / (*l_wt);
+
+        Tw r_scale = r_wt / (*l_wt);
+        (*lhs)     = (l_scale * *lhs) + (r_scale * rhs);
+    }
+}
+
+template<typename Ti, typename Tw, typename To, uint dim, uint DIMY>
+__global__ static void mean_dim_kernel(Param<To> out, Param<Tw> owt,
+                                       CParam<Ti> in, CParam<Tw> iwt,
+                                       uint blocks_x, uint blocks_y,
+                                       uint offset_dim) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * THREADS_X + tidx;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint wid        = blockIdx.y / blocks_y;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const uint xid        = blockIdx_x * blockDim.x + tidx;
+    const uint yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    uint ids[4] = {xid, yid, zid, wid};
+
+    const Ti *iptr  = in.ptr;
+    const Tw *iwptr = iwt.ptr;
+    To *optr        = out.ptr;
+    Tw *owptr       = owt.ptr;
+
+    int ooffset = ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+                  ids[1] * out.strides[1] + ids[0];
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    optr += ooffset;
+    if (owptr != NULL) owptr += ooffset;
+
+    const uint blockIdx_dim = ids[dim];
+
+    ids[dim] = ids[dim] * blockDim.y + tidy;
+
+    int ioffset = ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+                  ids[1] * in.strides[1] + ids[0];
+    iptr += ioffset;
+    if (iwptr != NULL) iwptr += ioffset;
+
+    const uint id_dim_in   = ids[dim];
+    const uint istride_dim = in.strides[dim];
+
+    bool is_valid = (ids[0] < in.dims[0]) && (ids[1] < in.dims[1]) &&
+                    (ids[2] < in.dims[2]) && (ids[3] < in.dims[3]);
+
+    common::Transform<Ti, compute_t<To>, af_add_t> transform;
+
+    compute_t<To> val    = common::Binary<compute_t<To>, af_add_t>::init();
+    compute_t<Tw> weight = common::Binary<compute_t<Tw>, af_add_t>::init();
+
+    if (is_valid && id_dim_in < in.dims[dim]) {
+        val = transform(*iptr);
+        if (iwptr != NULL) {
+            weight = *iwptr;
+        } else {
+            weight = (Tw)1;
+        }
+    }
+
+    const uint id_dim_in_start = id_dim_in + offset_dim * blockDim.y;
+
+    __shared__ compute_t<To> s_val[THREADS_X * DIMY];
+    __shared__ compute_t<Tw> s_idx[THREADS_X * DIMY];
+
+    for (int id = id_dim_in_start; is_valid && (id < in.dims[dim]);
+         id += offset_dim * blockDim.y) {
+        iptr = iptr + offset_dim * blockDim.y * istride_dim;
+        if (iwptr != NULL) {
+            iwptr = iwptr + offset_dim * blockDim.y * istride_dim;
+            stable_mean(&val, &weight, transform(*iptr), compute_t<Tw>(*iwptr));
+        } else {
+            // Faster version of stable_mean when iwptr is NULL
+            val    = val + (transform(*iptr) - val) / (weight + (Tw)1);
+            weight = weight + (Tw)1;
+        }
+    }
+
+    s_val[tid] = val;
+    s_idx[tid] = weight;
+
+    compute_t<To> *s_vptr = s_val + tid;
+    compute_t<Tw> *s_iptr = s_idx + tid;
+    __syncthreads();
+
+    if (DIMY == 8) {
+        if (tidy < 4) {
+            stable_mean(s_vptr, s_iptr, s_vptr[THREADS_X * 4],
+                        s_iptr[THREADS_X * 4]);
+        }
+        __syncthreads();
+    }
+
+    if (DIMY >= 4) {
+        if (tidy < 2) {
+            stable_mean(s_vptr, s_iptr, s_vptr[THREADS_X * 2],
+                        s_iptr[THREADS_X * 2]);
+        }
+        __syncthreads();
+    }
+
+    if (DIMY >= 2) {
+        if (tidy < 1) {
+            stable_mean(s_vptr, s_iptr, s_vptr[THREADS_X * 1],
+                        s_iptr[THREADS_X * 1]);
+        }
+        __syncthreads();
+    }
+
+    if (tidy == 0 && is_valid && (blockIdx_dim < out.dims[dim])) {
+        *optr = *s_vptr;
+        if (owptr != NULL) *owptr = *s_iptr;
+    }
+}
+
+template<typename Ti, typename Tw, typename To, int dim>
+void mean_dim_launcher(Param<To> out, Param<Tw> owt, CParam<Ti> in,
+                       CParam<Tw> iwt, const uint threads_y,
+                       const dim_t blocks_dim[4]) {
+    dim3 threads(THREADS_X, threads_y);
+
+    dim3 blocks(blocks_dim[0] * blocks_dim[2], blocks_dim[1] * blocks_dim[3]);
+
+    switch (threads_y) {
+        case 8:
+            CUDA_LAUNCH((mean_dim_kernel<Ti, Tw, To, dim, 8>), blocks, threads,
+                        out, owt, in, iwt, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim]);
+            break;
+        case 4:
+            CUDA_LAUNCH((mean_dim_kernel<Ti, Tw, To, dim, 4>), blocks, threads,
+                        out, owt, in, iwt, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim]);
+            break;
+        case 2:
+            CUDA_LAUNCH((mean_dim_kernel<Ti, Tw, To, dim, 2>), blocks, threads,
+                        out, owt, in, iwt, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim]);
+            break;
+        case 1:
+            CUDA_LAUNCH((mean_dim_kernel<Ti, Tw, To, dim, 1>), blocks, threads,
+                        out, owt, in, iwt, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim]);
+            break;
+    }
+
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename Tw, typename To, int dim>
+void mean_dim(Param<To> out, CParam<Ti> in, CParam<Tw> iwt) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    dim_t blocks_dim[] = {divup(in.dims[0], threads_x), in.dims[1], in.dims[2],
+                          in.dims[3]};
+
+    blocks_dim[dim] = divup(in.dims[dim], threads_y * REPEAT);
+
+    Array<To> tmpOut = createEmptyArray<To>(dim4());
+    Array<Tw> tmpWt  = createEmptyArray<Tw>(dim4());
+
+    if (blocks_dim[dim] > 1) {
+        dim4 dims(4, out.dims);
+        dims[dim] = blocks_dim[dim];
+        tmpOut    = createEmptyArray<To>(dims);
+        tmpWt     = createEmptyArray<Tw>(dims);
+    } else {
+        tmpOut = createParamArray(out, false);
+    }
+
+    mean_dim_launcher<Ti, Tw, To, dim>(tmpOut, tmpWt, in, iwt, threads_y,
+                                       blocks_dim);
+
+    if (blocks_dim[dim] > 1) {
+        blocks_dim[dim] = 1;
+
+        Array<Tw> owt = createEmptyArray<Tw>(dim4());
+        mean_dim_launcher<To, Tw, To, dim>(out, owt, tmpOut, tmpWt, threads_y,
+                                           blocks_dim);
+    }
+}
+
+template<typename T, typename Tw>
+__device__ void warp_reduce(T *s_ptr, Tw *s_idx, uint tidx) {
+#pragma unroll
+    for (int n = 16; n >= 1; n >>= 1) {
+        if (tidx < n) {
+            stable_mean(s_ptr + tidx, s_idx + tidx, s_ptr[tidx + n],
+                        s_idx[tidx + n]);
+        }
+        __syncthreads();
+    }
+}
+
+// Calculate mean along the first dimension. If wt is an empty CParam, use
+// weight as 1 and treat it as count. If owt is empty Param, do not write
+// temporary reduced counts/weights to it.
+template<typename Ti, typename Tw, typename To, uint DIMX>
+__global__ static void mean_first_kernel(Param<To> out, Param<Tw> owt,
+                                         CParam<Ti> in, CParam<Tw> iwt,
+                                         uint blocks_x, uint blocks_y,
+                                         uint repeat) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * blockDim.x + tidx;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint wid        = blockIdx.y / blocks_y;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const uint xid        = blockIdx_x * blockDim.x * repeat + tidx;
+    const uint yid        = blockIdx_y * blockDim.y + tidy;
+
+    const Ti *iptr  = in.ptr;
+    const Tw *iwptr = iwt.ptr;
+    To *optr        = out.ptr;
+    Tw *owptr       = owt.ptr;
+
+    iptr += wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+    if (iwptr != NULL)
+        iwptr +=
+            wid * iwt.strides[3] + zid * iwt.strides[2] + yid * iwt.strides[1];
+    optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+    if (owptr != NULL)
+        owptr +=
+            wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+
+    if (yid >= in.dims[1] || zid >= in.dims[2] || wid >= in.dims[3]) return;
+
+    int lim = min((int)(xid + repeat * DIMX), in.dims[0]);
+
+    common::Transform<Ti, compute_t<To>, af_add_t> transform;
+
+    compute_t<To> val    = common::Binary<compute_t<To>, af_add_t>::init();
+    compute_t<Tw> weight = common::Binary<compute_t<Tw>, af_add_t>::init();
+
+    if (xid < lim) {
+        val = transform(iptr[xid]);
+        if (iwptr != NULL) {
+            weight = iwptr[xid];
+        } else {
+            weight = (Tw)1;
+        }
+    }
+
+    __shared__ compute_t<To> s_val[THREADS_PER_BLOCK];
+    __shared__ compute_t<Tw> s_idx[THREADS_PER_BLOCK];
+
+    if (iwptr != NULL) {
+        for (int id = xid + DIMX; id < lim; id += DIMX) {
+            stable_mean(&val, &weight, transform(iptr[id]),
+                        compute_t<Tw>(iwptr[id]));
+        }
+    } else {
+        for (int id = xid + DIMX; id < lim; id += DIMX) {
+            // Faster version of stable_mean when iwptr is NULL
+            val    = val + (transform(iptr[id]) - val) / (weight + (Tw)1);
+            weight = weight + (Tw)1;
+        }
+    }
+
+    s_val[tid] = val;
+    s_idx[tid] = weight;
+    __syncthreads();
+
+    compute_t<To> *s_vptr = s_val + tidy * DIMX;
+    compute_t<Tw> *s_iptr = s_idx + tidy * DIMX;
+
+    if (DIMX == 256) {
+        if (tidx < 128) {
+            stable_mean(s_vptr + tidx, s_iptr + tidx, s_vptr[tidx + 128],
+                        s_iptr[tidx + 128]);
+        }
+        __syncthreads();
+    }
+
+    if (DIMX >= 128) {
+        if (tidx < 64) {
+            stable_mean(s_vptr + tidx, s_iptr + tidx, s_vptr[tidx + 64],
+                        s_iptr[tidx + 64]);
+        }
+        __syncthreads();
+    }
+
+    if (DIMX >= 64) {
+        if (tidx < 32) {
+            stable_mean(s_vptr + tidx, s_iptr + tidx, s_vptr[tidx + 32],
+                        s_iptr[tidx + 32]);
+        }
+        __syncthreads();
+    }
+
+    warp_reduce<compute_t<To>, compute_t<Tw>>(s_vptr, s_iptr, tidx);
+
+    if (tidx == 0) {
+        optr[blockIdx_x] = s_vptr[0];
+        if (owptr != NULL) owptr[blockIdx_x] = s_iptr[0];
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean_first_launcher(Param<To> out, Param<Tw> owt, CParam<Ti> in,
+                         CParam<Tw> iwt, const uint blocks_x,
+                         const uint blocks_y, const uint threads_x) {
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * in.dims[2], blocks_y * in.dims[3]);
+
+    uint repeat = divup(in.dims[0], (blocks_x * threads_x));
+
+    switch (threads_x) {
+        case 32:
+            CUDA_LAUNCH((mean_first_kernel<Ti, Tw, To, 32>), blocks, threads,
+                        out, owt, in, iwt, blocks_x, blocks_y, repeat);
+            break;
+        case 64:
+            CUDA_LAUNCH((mean_first_kernel<Ti, Tw, To, 64>), blocks, threads,
+                        out, owt, in, iwt, blocks_x, blocks_y, repeat);
+            break;
+        case 128:
+            CUDA_LAUNCH((mean_first_kernel<Ti, Tw, To, 128>), blocks, threads,
+                        out, owt, in, iwt, blocks_x, blocks_y, repeat);
+            break;
+        case 256:
+            CUDA_LAUNCH((mean_first_kernel<Ti, Tw, To, 256>), blocks, threads,
+                        out, owt, in, iwt, blocks_x, blocks_y, repeat);
+            break;
+    }
+
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean_first(Param<To> out, CParam<Ti> in, CParam<Tw> iwt) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(in.dims[1], threads_y);
+
+    Array<To> tmpOut = createEmptyArray<To>(dim4());
+    Array<Tw> tmpWt  = createEmptyArray<Tw>(dim4());
+    if (blocks_x > 1) {
+        tmpOut = createEmptyArray<To>(
+            {blocks_x, in.dims[1], in.dims[2], in.dims[3]});
+        tmpWt = createEmptyArray<Tw>(
+            {blocks_x, in.dims[1], in.dims[2], in.dims[3]});
+    } else {
+        tmpOut = createParamArray(out, false);
+    }
+
+    mean_first_launcher<Ti, Tw, To>(tmpOut, tmpWt, in, iwt, blocks_x, blocks_y,
+                                    threads_x);
+
+    if (blocks_x > 1) {
+        Param<Tw> owt;
+        owt.ptr = NULL;
+        mean_first_launcher<To, Tw, To>(out, owt, tmpOut, tmpWt, 1, blocks_y,
+                                        threads_x);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean_weighted(Param<To> out, CParam<Ti> in, CParam<Tw> iwt, int dim) {
+    switch (dim) {
+        case 0: return mean_first<Ti, Tw, To>(out, in, iwt);
+        case 1: return mean_dim<Ti, Tw, To, 1>(out, in, iwt);
+        case 2: return mean_dim<Ti, Tw, To, 2>(out, in, iwt);
+        case 3: return mean_dim<Ti, Tw, To, 3>(out, in, iwt);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean(Param<To> out, CParam<Ti> in, int dim) {
+    Param<Tw> dummy_weight;
+    mean_weighted<Ti, Tw, To>(out, in, dummy_weight, dim);
+}
+
+template<typename T, typename Tw>
+T mean_all_weighted(CParam<T> in, CParam<Tw> iwt) {
+    int in_elements = in.dims[0] * in.dims[1] * in.dims[2] * in.dims[3];
+
+    // FIXME: Use better heuristics to get to the optimum number
+    if (in_elements > 4096) {
+        bool in_is_linear = (in.strides[0] == 1);
+        bool wt_is_linear = (iwt.strides[0] == 1);
+        for (int k = 1; k < 4; k++) {
+            in_is_linear &=
+                (in.strides[k] == (in.strides[k - 1] * in.dims[k - 1]));
+            wt_is_linear &=
+                (iwt.strides[k] == (iwt.strides[k - 1] * iwt.dims[k - 1]));
+        }
+
+        if (in_is_linear && wt_is_linear) {
+            in.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.dims[k]    = 1;
+                in.strides[k] = in_elements;
+            }
+
+            for (int k = 0; k < 4; k++) {
+                iwt.dims[k]    = in.dims[k];
+                iwt.strides[k] = in.strides[k];
+            }
+        }
+
+        uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+        uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+        uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+        uint blocks_y = divup(in.dims[1], threads_y);
+
+        Array<T> tmpOut =
+            createEmptyArray<T>({blocks_x, in.dims[1], in.dims[2], in.dims[3]});
+        Array<Tw> tmpWt = createEmptyArray<Tw>(
+            {blocks_x, in.dims[1], in.dims[2], in.dims[3]});
+
+        int tmp_elements = tmpOut.elements();
+
+        mean_first_launcher<T, Tw, T>(tmpOut, tmpWt, in, iwt, blocks_x,
+                                      blocks_y, threads_x);
+
+        std::vector<T> h_ptr(tmp_elements);
+        std::vector<Tw> h_wptr(tmp_elements);
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_ptr.data(), tmpOut.get(), tmp_elements * sizeof(T),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_wptr.data(), tmpWt.get(), tmp_elements * sizeof(Tw),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaStreamSynchronize(getStream(getActiveDeviceId())));
+
+        compute_t<T> val     = static_cast<compute_t<T>>(h_ptr[0]);
+        compute_t<Tw> weight = static_cast<compute_t<Tw>>(h_wptr[0]);
+
+        for (int i = 1; i < tmp_elements; i++) {
+            stable_mean(&val, &weight, compute_t<T>(h_ptr[i]),
+                        compute_t<Tw>(h_wptr[i]));
+        }
+
+        return static_cast<T>(val);
+    } else {
+        std::vector<T> h_ptr(in_elements);
+        std::vector<Tw> h_wptr(in_elements);
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_ptr.data(), in.ptr, in_elements * sizeof(T),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_wptr.data(), iwt.ptr, in_elements * sizeof(Tw),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaStreamSynchronize(getStream(getActiveDeviceId())));
+
+        compute_t<T> val     = static_cast<compute_t<T>>(h_ptr[0]);
+        compute_t<Tw> weight = static_cast<compute_t<Tw>>(h_wptr[0]);
+        for (int i = 1; i < in_elements; i++) {
+            stable_mean(&val, &weight, compute_t<T>(h_ptr[i]),
+                        compute_t<Tw>(h_wptr[i]));
+        }
+
+        return static_cast<T>(val);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+To mean_all(CParam<Ti> in) {
+    using std::unique_ptr;
+    int in_elements = in.dims[0] * in.dims[1] * in.dims[2] * in.dims[3];
+    bool is_linear  = (in.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.strides[k] == (in.strides[k - 1] * in.dims[k - 1]));
+    }
+
+    // FIXME: Use better heuristics to get to the optimum number
+    if (in_elements > 4096 || !is_linear) {
+        if (is_linear) {
+            in.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.dims[k]    = 1;
+                in.strides[k] = in_elements;
+            }
+        }
+
+        uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+        uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+        uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+        uint blocks_y = divup(in.dims[1], threads_y);
+
+        dim4 outDims(blocks_x, in.dims[1], in.dims[2], in.dims[3]);
+
+        Array<To> tmpOut = createEmptyArray<To>(outDims);
+        Array<Tw> tmpCt  = createEmptyArray<Tw>(outDims);
+
+        Param<Tw> iwt;
+        mean_first_launcher<Ti, Tw, To>(tmpOut, tmpCt, in, iwt, blocks_x,
+                                        blocks_y, threads_x);
+
+        int tmp_elements = tmpOut.elements();
+        std::vector<To> h_ptr(tmp_elements);
+        std::vector<Tw> h_cptr(tmp_elements);
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_ptr.data(), tmpOut.get(), tmp_elements * sizeof(To),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_cptr.data(), tmpCt.get(), tmp_elements * sizeof(Tw),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaStreamSynchronize(getStream(getActiveDeviceId())));
+
+        compute_t<To> val    = static_cast<compute_t<To>>(h_ptr[0]);
+        compute_t<Tw> weight = static_cast<compute_t<Tw>>(h_cptr[0]);
+
+        for (int i = 1; i < tmp_elements; i++) {
+            stable_mean(&val, &weight, compute_t<To>(h_ptr[i]),
+                        compute_t<Tw>(h_cptr[i]));
+        }
+
+        return static_cast<To>(val);
+    } else {
+        std::vector<Ti> h_ptr(in_elements);
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            h_ptr.data(), in.ptr, in_elements * sizeof(Ti),
+            cudaMemcpyDeviceToHost, getStream(getActiveDeviceId())));
+        CUDA_CHECK(cudaStreamSynchronize(getStream(getActiveDeviceId())));
+
+        common::Transform<Ti, compute_t<To>, af_add_t> transform;
+        compute_t<Tw> count = static_cast<compute_t<Tw>>(1);
+
+        compute_t<To> val    = transform(h_ptr[0]);
+        compute_t<Tw> weight = count;
+        for (int i = 1; i < in_elements; i++) {
+            stable_mean(&val, &weight, transform(h_ptr[i]), count);
+        }
+
+        return static_cast<To>(val);
+    }
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/meanshift.cuh b/src/backend/cuda/kernel/meanshift.cuh
new file mode 100644
index 0000000000..240c853f46
--- /dev/null
+++ b/src/backend/cuda/kernel/meanshift.cuh
@@ -0,0 +1,130 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename AccType, typename T, int channels>
+__global__ void meanshift(Param<T> out, CParam<T> in, int radius, float cvar,
+                          uint numIters, int nBBS0, int nBBS1) {
+    unsigned b2 = blockIdx.x / nBBS0;
+    unsigned b3 = blockIdx.y / nBBS1;
+    const T* iptr =
+        (const T*)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
+    T* optr      = (T*)out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
+    const int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + threadIdx.x;
+    const int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + threadIdx.y;
+
+    if (gx >= in.dims[0] || gy >= in.dims[1]) return;
+
+    int meanPosI = gx;
+    int meanPosJ = gy;
+
+    T currentCenterColors[channels];
+    T tempColors[channels];
+
+    AccType currentMeanColors[channels];
+
+#pragma unroll
+    for (int ch = 0; ch < channels; ++ch)
+        currentCenterColors[ch] = iptr[(
+            gx * in.strides[0] + gy * in.strides[1] + ch * in.strides[2])];
+
+    const int dim0LenLmt = in.dims[0] - 1;
+    const int dim1LenLmt = in.dims[1] - 1;
+
+    // scope of meanshift iterations begin
+    for (uint it = 0; it < numIters; ++it) {
+        int oldMeanPosJ = meanPosJ;
+        int oldMeanPosI = meanPosI;
+        unsigned count  = 0;
+
+        int shift_x = 0;
+        int shift_y = 0;
+
+#pragma unroll
+        for (int ch = 0; ch < channels; ++ch) currentMeanColors[ch] = 0;
+
+        for (int wj = -radius; wj <= radius; ++wj) {
+            int hit_count = 0;
+            int tj        = meanPosJ + wj;
+
+            if (tj < 0 || tj > dim1LenLmt) continue;
+
+            for (int wi = -radius; wi <= radius; ++wi) {
+                int ti = meanPosI + wi;
+
+                if (ti < 0 || ti > dim0LenLmt) continue;
+
+                AccType norm = 0;
+#pragma unroll
+                for (int ch = 0; ch < channels; ++ch) {
+                    tempColors[ch] =
+                        iptr[(ti * in.strides[0] + tj * in.strides[1] +
+                              ch * in.strides[2])];
+                    AccType diff = (AccType)currentCenterColors[ch] -
+                                   (AccType)tempColors[ch];
+                    norm += (diff * diff);
+                }
+
+                if (norm <= cvar) {
+#pragma unroll
+                    for (int ch = 0; ch < channels; ++ch)
+                        currentMeanColors[ch] += (AccType)tempColors[ch];
+
+                    shift_x += ti;
+                    ++hit_count;
+                }
+            }
+            count += hit_count;
+            shift_y += tj * hit_count;
+        }
+
+        if (count == 0) break;
+
+        const AccType fcount = 1 / (AccType)count;
+
+        meanPosI = __float2int_rz(shift_x * fcount);
+        meanPosJ = __float2int_rz(shift_y * fcount);
+
+#pragma unroll
+        for (int ch = 0; ch < channels; ++ch)
+            currentMeanColors[ch] =
+                __float2int_rz(currentMeanColors[ch] * fcount);
+
+        AccType norm = 0;
+#pragma unroll
+        for (int ch = 0; ch < channels; ++ch) {
+            AccType diff =
+                (AccType)currentCenterColors[ch] - currentMeanColors[ch];
+            norm += (diff * diff);
+        }
+
+        bool stop = (meanPosJ == oldMeanPosJ && meanPosI == oldMeanPosI) ||
+                    ((abs(oldMeanPosJ - meanPosJ) +
+                      abs(oldMeanPosI - meanPosI) + norm) <= 1);
+
+#pragma unroll
+        for (int ch = 0; ch < channels; ++ch)
+            currentCenterColors[ch] = (T)(currentMeanColors[ch]);
+
+        if (stop) break;
+    }  // scope of meanshift iterations end
+
+#pragma unroll
+    for (int ch = 0; ch < channels; ++ch)
+        optr[(gx * out.strides[0] + gy * out.strides[1] +
+              ch * out.strides[2])] = currentCenterColors[ch];
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/meanshift.hpp b/src/backend/cuda/kernel/meanshift.hpp
index 11ff10da9e..600f456fb9 100644
--- a/src/backend/cuda/kernel/meanshift.hpp
+++ b/src/backend/cuda/kernel/meanshift.hpp
@@ -7,218 +7,50 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-#include "shared.hpp"
+#include <nvrtc_kernel_headers/meanshift_cuh.hpp>
 
-namespace cuda
-{
+#include <array>
+#include <type_traits>
 
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 static const int THREADS_X = 16;
 static const int THREADS_Y = 16;
 
-__forceinline__ __device__
-int lIdx(int x, int y,
-              int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
-
-__forceinline__ __device__
-int clamp(int f, int a, int b)
-{
-    return max(a, min(f, b));
-}
-
-template<typename T, int channels>
-inline __device__
-void load2ShrdMem(T * shrd, const T * in,
-                  int lx, int ly,
-                  int shrdStride, int schStride,
-                  int dim0, int dim1,
-                  int gx, int gy,
-                  int ichStride, int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx, 0, dim0-1);
-    int gy_  = clamp(gy, 0, dim1-1);
-#pragma unroll
-    for(int ch=0; ch<channels; ++ch)
-        shrd[lIdx(lx, ly, shrdStride, 1)+ch*schStride] = in[lIdx(gx_, gy_, inStride1, inStride0)+ch*ichStride];
-}
-
-template<typename T, int channels>
-static __global__
-void meanshiftKernel(Param<T> out, CParam<T> in,
-                     float space_, int radius, float cvar,
-                     uint iter, int nBBS0, int nBBS1)
-{
-    SharedMemory<T> shared;
-    T * shrdMem = shared.getPointer();
-
-    // calculate necessary offset and window parameters
-    const int padding     = 2*radius + 1;
-    const int wind_len    = padding - 1;
-    const int shrdLen     = blockDim.x + padding;
-    const int schStride   = shrdLen*(blockDim.y + padding);
-    // the variable ichStride will only effect when we have >1
-    // channels. in the other cases, the expression in question
-    // will not use the variable
-    const int ichStride   = in.strides[2];
-
-    // gfor batch offsets
-    unsigned b2 = blockIdx.x / nBBS0;
-    unsigned b3 = blockIdx.y / nBBS1;
-    const T* iptr = (const T *) in.ptr + (b2 *  in.strides[2] + b3 *  in.strides[3]);
-    T*       optr = (T *      )out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
-
-    const int lx = threadIdx.x;
-    const int ly = threadIdx.y;
-
-    const int gx = blockDim.x * (blockIdx.x-b2*nBBS0) + lx;
-    const int gy = blockDim.y * (blockIdx.y-b3*nBBS1) + ly;
-
-    int gx2 = gx + blockDim.x;
-    int gy2 = gy + blockDim.y;
-    int lx2 = lx + blockDim.x;
-    int ly2 = ly + blockDim.y;
-    int i   = lx + radius;
-    int j   = ly + radius;
-
-    // pull image to local memory
-    load2ShrdMem<T, channels>(shrdMem, iptr, lx, ly, shrdLen, schStride,
-                              in.dims[0], in.dims[1], gx-radius,
-                              gy-radius, ichStride, in.strides[1], in.strides[0]);
-    if (lx<wind_len) {
-        load2ShrdMem<T, channels>(shrdMem, iptr, lx2, ly, shrdLen, schStride,
-                                  in.dims[0], in.dims[1], gx2-radius,
-                                  gy-radius, ichStride, in.strides[1], in.strides[0]);
-    }
-    if (ly<wind_len) {
-        load2ShrdMem<T, channels>(shrdMem, iptr, lx, ly2, shrdLen, schStride,
-                                  in.dims[0], in.dims[1], gx-radius,
-                                  gy2-radius, ichStride, in.strides[1], in.strides[0]);
-    }
-    if (lx<wind_len && ly<wind_len) {
-        load2ShrdMem<T, channels>(shrdMem, iptr, lx2, ly2, shrdLen, schStride,
-                                  in.dims[0], in.dims[1], gx2-radius,
-                                  gy2-radius, ichStride, in.strides[1], in.strides[0]);
-    }
-    __syncthreads();
-
-    if (gx>=in.dims[0] || gy>=in.dims[1])
-        return;
-
-    float means[channels];
-    float centers[channels];
-    float tmpclrs[channels];
-
-    // clear means and centers for this pixel
-#pragma unroll
-    for(int ch=0; ch<channels; ++ch) {
-        means[ch] = 0.0f;
-        centers[ch] = shrdMem[lIdx(i, j, shrdLen, 1)+ch*schStride];
-    }
-
-    // scope of meanshift iterationd begin
-    for(uint it=0; it<iter; ++it) {
-
-        int count   = 0;
-        int shift_x = 0;
-        int shift_y = 0;
-
-        for(int wj=-radius; wj<=radius; ++wj) {
-            int hit_count = 0;
+template<typename T>
+void meanshift(Param<T> out, CParam<T> in, const float spatialSigma,
+               const float chromaticSigma, const uint numIters, bool IsColor) {
+    typedef typename std::conditional<std::is_same<T, double>::value, double,
+                                      float>::type AccType;
+    auto meanshift = common::getKernel(
+        "arrayfire::cuda::meanshift", {{meanshift_cuh_src}},
+        TemplateArgs(TemplateTypename<AccType>(), TemplateTypename<T>(),
+                     TemplateArg((IsColor ? 3 : 1))  // channels
+                     ));
 
-            for(int wi=-radius; wi<=radius; ++wi) {
-
-                int tj = j + wj;
-                int ti = i + wi;
-
-                // proceed
-                float norm = 0.0f;
-#pragma unroll
-                for(int ch=0; ch<channels; ++ch) {
-                    tmpclrs[ch] = shrdMem[lIdx(ti, tj, shrdLen, 1)+ch*schStride];
-                    norm += (centers[ch]-tmpclrs[ch]) * (centers[ch]-tmpclrs[ch]);
-                }
-
-                if (norm<= cvar) {
-#pragma unroll
-                    for(int ch=0; ch<channels; ++ch)
-                        means[ch] += tmpclrs[ch];
-
-                    shift_x += wi;
-                    ++hit_count;
-                }
-            }
-            count+= hit_count;
-            shift_y += wj*hit_count;
-        }
-
-        if (count==0) { break; }
-
-        const float fcount = 1.f/count;
-        const int mean_x = (int)(shift_x*fcount+0.5f);
-        const int mean_y = (int)(shift_y*fcount+0.5f);
-#pragma unroll
-        for(int ch=0; ch<channels; ++ch)
-            means[ch] *= fcount;
-
-        float norm = 0.f;
-#pragma unroll
-        for(int ch=0; ch<channels; ++ch)
-            norm += ((means[ch]-centers[ch])*(means[ch]-centers[ch]));
-
-        bool stop = ((abs(shift_y-mean_y)+abs(shift_x-mean_x)) + norm) <= 1;
-        shift_x = mean_x;
-        shift_y = mean_y;
-
-#pragma unroll
-        for(int ch=0; ch<channels; ++ch)
-            centers[ch] = means[ch];
-        if (stop) { break; }
-    } // scope of meanshift iterations end
-
-#pragma unroll
-    for(int ch=0; ch<channels; ++ch)
-        optr[lIdx(gx, gy, out.strides[1], out.strides[0])+ch*ichStride] = centers[ch];
-}
-
-template<typename T, bool is_color>
-void meanshift(Param<T> out, CParam<T> in, float s_sigma, float c_sigma, uint iter)
-{
     static dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
 
-    int blk_x = divup(in.dims[0], THREADS_X);
-    int blk_y = divup(in.dims[1], THREADS_Y);
-
-    const int bCount   = (is_color ? 1 : in.dims[2]);
-    const int channels = (is_color ? in.dims[2] : 1); // this has to be 3 for color images
+    int blk_x        = divup(in.dims[0], THREADS_X);
+    int blk_y        = divup(in.dims[1], THREADS_Y);
+    const int bCount = (IsColor ? 1 : in.dims[2]);
 
     dim3 blocks(blk_x * bCount, blk_y * in.dims[3]);
 
     // clamp spatical and chromatic sigma's
-    float space_     = std::min(11.5f, s_sigma);
-    int radius  = std::max((int)(space_ * 1.5f), 1);
-    int padding = 2*radius+1;
-    const float cvar = c_sigma*c_sigma;
-    size_t shrd_size = channels*(threads.x + padding)*(threads.y+padding)*sizeof(T);
-
-    if (is_color)
-        (meanshiftKernel<T, 3>) <<<blocks, threads, shrd_size>>>(out, in, space_, radius, cvar, iter, blk_x, blk_y);
-    else
-        (meanshiftKernel<T, 1>) <<<blocks, threads, shrd_size>>>(out, in, space_, radius, cvar, iter, blk_x, blk_y);
+    int radius       = std::max((int)(spatialSigma * 1.5f), 1);
+    const float cvar = chromaticSigma * chromaticSigma;
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    meanshift(qArgs, out, in, radius, cvar, numIters, blk_x, blk_y);
     POST_LAUNCH_CHECK();
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/medfilt.cuh b/src/backend/cuda/kernel/medfilt.cuh
new file mode 100644
index 0000000000..e2d513cf95
--- /dev/null
+++ b/src/backend/cuda/kernel/medfilt.cuh
@@ -0,0 +1,286 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <shared.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+// Exchange trick: Morgan McGuire, ShaderX 2008
+#define swap(a, b)           \
+    {                        \
+        T tmp = a;           \
+        a     = min(a, b);   \
+        b     = max(tmp, b); \
+    }
+
+__forceinline__ __device__ int lIdx(int x, int y, int stride1, int stride0) {
+    return (y * stride1 + x * stride0);
+}
+
+template<typename T, af::borderType pad>
+__device__ void load2ShrdMem(T* shrd, const T* in, int lx, int ly,
+                             int shrdStride, int dim0, int dim1, int gx, int gy,
+                             int inStride1, int inStride0) {
+    switch (pad) {
+        case AF_PAD_ZERO: {
+            if (gx < 0 || gx >= dim0 || gy < 0 || gy >= dim1)
+                shrd[lIdx(lx, ly, shrdStride, 1)] = T(0);
+            else
+                shrd[lIdx(lx, ly, shrdStride, 1)] =
+                    in[lIdx(gx, gy, inStride1, inStride0)];
+        } break;
+        case AF_PAD_SYM: {
+            if (gx < 0) gx *= -1;
+            if (gy < 0) gy *= -1;
+            if (gx >= dim0) gx = 2 * (dim0 - 1) - gx;
+            if (gy >= dim1) gy = 2 * (dim1 - 1) - gy;
+
+            shrd[lIdx(lx, ly, shrdStride, 1)] =
+                in[lIdx(gx, gy, inStride1, inStride0)];
+        } break;
+    }
+}
+
+template<typename T, af::borderType pad>
+__device__ void load2ShrdMem_1d(T* shrd, const T* in, int lx, int dim0, int gx,
+                                int inStride0) {
+    switch (pad) {
+        case AF_PAD_ZERO: {
+            if (gx < 0 || gx >= dim0)
+                shrd[lx] = T(0);
+            else
+                shrd[lx] = in[gx];
+        } break;
+        case AF_PAD_SYM: {
+            if (gx < 0) gx *= -1;
+            if (gx >= dim0) gx = 2 * (dim0 - 1) - gx;
+
+            shrd[lx] = in[gx];
+        } break;
+    }
+}
+
+template<typename T, af::borderType pad, unsigned w_len, unsigned w_wid>
+__global__ void medfilt2(Param<T> out, CParam<T> in, int nBBS0, int nBBS1) {
+    __shared__ T shrdMem[(THREADS_X + w_len - 1) * (THREADS_Y + w_wid - 1)];
+
+    // calculate necessary offset and window parameters
+    const int padding = w_len - 1;
+    const int halo    = padding / 2;
+    const int shrdLen = blockDim.x + padding;
+
+    // batch offsets
+    unsigned b2 = blockIdx.x / nBBS0;
+    unsigned b3 = blockIdx.y / nBBS1;
+    const T* iptr =
+        (const T*)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
+    T* optr = (T*)out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
+
+    // local neighborhood indices
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+
+    // global indices
+    int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + lx;
+    int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + ly;
+
+    // pull image to local memory
+    for (int b = ly, gy2 = gy; b < shrdLen;
+         b += blockDim.y, gy2 += blockDim.y) {
+        // move row_set get_local_size(1) along coloumns
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += blockDim.x, gx2 += blockDim.x) {
+            load2ShrdMem<T, pad>(shrdMem, iptr, a, b, shrdLen, in.dims[0],
+                                 in.dims[1], gx2 - halo, gy2 - halo,
+                                 in.strides[1], in.strides[0]);
+        }
+    }
+
+    __syncthreads();
+
+    // Only continue if we're at a valid location
+    if (gx < in.dims[0] && gy < in.dims[1]) {
+        const int ARR_SIZE = w_len * (w_wid - w_wid / 2);
+        // pull top half from shared memory into local memory
+        T v[ARR_SIZE];
+#pragma unroll
+        for (int k = 0; k <= w_wid / 2; k++) {
+#pragma unroll
+            for (int i = 0; i < w_len; i++) {
+                v[w_len * k + i] = shrdMem[lIdx(lx + i, ly + k, shrdLen, 1)];
+            }
+        }
+
+        // with each pass, remove min and max values and add new value
+        // initial sort
+        // ensure min in first half, max in second half
+#pragma unroll
+        for (int i = 0; i < ARR_SIZE / 2; i++) {
+            swap(v[i], v[ARR_SIZE - 1 - i]);
+        }
+        // move min in first half to first pos
+#pragma unroll
+        for (int i = 1; i < (ARR_SIZE + 1) / 2; i++) { swap(v[0], v[i]); }
+        // move max in second half to last pos
+#pragma unroll
+        for (int i = ARR_SIZE - 2; i >= ARR_SIZE / 2; i--) {
+            swap(v[i], v[ARR_SIZE - 1]);
+        }
+
+        int last = ARR_SIZE - 1;
+
+        for (int k = 1 + w_wid / 2; k < w_wid; k++) {
+            for (int j = 0; j < w_len; j++) {
+                // add new contestant to first position in array
+                v[0] = shrdMem[lIdx(lx + j, ly + k, shrdLen, 1)];
+
+                last--;
+
+                // place max in last half, min in first half
+                for (int i = 0; i < (last + 1) / 2; i++) {
+                    swap(v[i], v[last - i]);
+                }
+                // now perform swaps on each half such that
+                // max is in last pos, min is in first pos
+                for (int i = 1; i <= last / 2; i++) { swap(v[0], v[i]); }
+                for (int i = last - 1; i >= (last + 1) / 2; i--) {
+                    swap(v[i], v[last]);
+                }
+            }
+        }
+
+        // no more new contestants
+        // may still have to sort the last row
+        // each outer loop drops the min and max
+        for (int k = 1; k < w_len / 2; k++) {
+            // move max/min into respective halves
+            for (int i = k; i < w_len / 2; i++) {
+                swap(v[i], v[w_len - 1 - i]);
+            }
+            // move min into first pos
+            for (int i = k + 1; i <= w_len / 2; i++) { swap(v[k], v[i]); }
+            // move max into last pos
+            for (int i = w_len - k - 2; i >= w_len / 2; i--) {
+                swap(v[i], v[w_len - 1 - k]);
+            }
+        }
+
+        // pick the middle element of the first row
+        optr[gy * out.strides[1] + gx * out.strides[0]] = v[w_len / 2];
+    }
+}
+
+template<typename T, af::borderType pad, unsigned ARR_SIZE>
+__global__ void medfilt1(Param<T> out, CParam<T> in, unsigned w_wid,
+                         int nBBS0) {
+    SharedMemory<T> shared;
+    T* shrdMem = shared.getPointer();
+
+    // calculate necessary offset and window parameters
+    const int padding = w_wid - 1;
+    const int halo    = padding / 2;
+    const int shrdLen = blockDim.x + padding;
+
+    // batch offsets
+    unsigned b1 = blockIdx.x / nBBS0;
+    unsigned b2 = blockIdx.y;
+    unsigned b3 = blockIdx.z;
+
+    const T* iptr =
+        (const T*)in.ptr +
+        (b1 * in.strides[1] + b2 * in.strides[2] + b3 * in.strides[3]);
+    T* optr = (T*)out.ptr +
+              (b1 * in.strides[1] + b2 * out.strides[2] + b3 * out.strides[3]);
+
+    // local neighborhood indices
+    int lx = threadIdx.x;
+
+    // global indices
+    int gx = blockDim.x * (blockIdx.x - b1 * nBBS0) + lx;
+
+    // pull signal to local memory
+    for (int a = lx, gx2 = gx; a < shrdLen;
+         a += blockDim.x, gx2 += blockDim.x) {
+        load2ShrdMem_1d<T, pad>(shrdMem, iptr, a, in.dims[0], gx2 - halo,
+                                in.strides[0]);
+    }
+
+    __syncthreads();
+
+    // Only continue if we're at a valid location
+    if (gx < in.dims[0]) {
+        const int ARR_BOUNDARY = (w_wid - w_wid / 2) + 1;
+        // pull top half from shared memory into local memory
+        T v[ARR_SIZE];
+
+#pragma unroll
+        for (int k = 0; k <= w_wid / 2 + 1; k++) { v[k] = shrdMem[lx + k]; }
+        // with each pass, remove min and max values and add new value
+        // initial sort
+        // ensure min in first half, max in second half
+#pragma unroll
+        for (int i = 0; i < ARR_BOUNDARY / 2; i++) {
+            swap(v[i], v[ARR_BOUNDARY - 1 - i]);
+        }
+        // move min in first half to first pos
+#pragma unroll
+        for (int i = 1; i < (ARR_BOUNDARY + 1) / 2; i++) { swap(v[0], v[i]); }
+        // move max in second half to last pos
+#pragma unroll
+        for (int i = ARR_BOUNDARY - 2; i >= ARR_BOUNDARY / 2; i--) {
+            swap(v[i], v[ARR_BOUNDARY - 1]);
+        }
+
+        int last = ARR_BOUNDARY - 1;
+
+        for (int k = w_wid / 2 + 2; k < w_wid; k++) {
+            // add new contestant to first position in array
+            v[0] = shrdMem[lx + k];
+
+            last--;
+
+            // place max in last half, min in first half
+            for (int i = 0; i < (last + 1) / 2; i++) {
+                swap(v[i], v[last - i]);
+            }
+            // now perform swaps on each half such that
+            // max is in last pos, min is in first pos
+            for (int i = 1; i <= last / 2; i++) { swap(v[0], v[i]); }
+            for (int i = last - 1; i >= (last + 1) / 2; i--) {
+                swap(v[i], v[last]);
+            }
+        }
+
+        // no more new contestants
+        // may still have to sort the last row
+        // each outer loop drops the min and max
+        for (int k = 0; k < last; k++) {
+            // move max/min into respective halves
+            for (int i = k; i < ARR_BOUNDARY / 2; i++) {
+                swap(v[i], v[ARR_BOUNDARY - 1 - i]);
+            }
+            // move min into first pos
+            for (int i = k + 1; i <= ARR_BOUNDARY / 2; i++) {
+                swap(v[k], v[i]);
+            }
+            // move max into last pos
+            for (int i = ARR_BOUNDARY - k - 2; i >= ARR_BOUNDARY / 2; i--) {
+                swap(v[i], v[ARR_BOUNDARY - 1 - k]);
+            }
+        }
+
+        // pick the middle element of the first row
+        optr[gx * out.strides[0]] = v[last / 2];
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/medfilt.hpp b/src/backend/cuda/kernel/medfilt.hpp
index 16665a4b40..20f3514ec6 100644
--- a/src/backend/cuda/kernel/medfilt.hpp
+++ b/src/backend/cuda/kernel/medfilt.hpp
@@ -7,223 +7,64 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/medfilt_cuh.hpp>
+#include <af/defines.h>
 
-namespace cuda
-{
-
-namespace kernel
-{
-
-static const int MAX_MEDFILTER_LEN = 15;
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+static const int MAX_MEDFILTER1_LEN = 121;
+static const int MAX_MEDFILTER2_LEN = 15;
+static const int THREADS_X          = 16;
+static const int THREADS_Y          = 16;
+
+template<typename T>
+void medfilt2(Param<T> out, CParam<T> in, const af::borderType pad, int w_len,
+              int w_wid) {
+    UNUSED(w_wid);
+    auto medfilt2 =
+        common::getKernel("arrayfire::cuda::medfilt2", {{medfilt_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>(), TemplateArg(pad),
+                                       TemplateArg(w_len), TemplateArg(w_wid)),
+                          {{DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
 
-// Exchange trick: Morgan McGuire, ShaderX 2008
-#define swap(a,b)    { T tmp = a; a = min(a,b); b = max(tmp,b); }
+    const dim3 threads(THREADS_X, THREADS_Y);
 
-__forceinline__ __device__
-int lIdx(int x, int y, int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
+    int blk_x = divup(in.dims[0], threads.x);
+    int blk_y = divup(in.dims[1], threads.y);
 
-template<typename T, af_border_type pad>
-__device__
-void load2ShrdMem(T * shrd, const T * in,
-                  int lx, int ly, int shrdStride,
-                  int dim0, int dim1,
-                  int gx, int gy,
-                  int inStride1, int inStride0)
-{
-    switch(pad) {
-        case AF_PAD_ZERO:
-            {
-                if (gx<0 || gx>=dim0 || gy<0 || gy>=dim1)
-                    shrd[lIdx(lx, ly, shrdStride, 1)] = T(0);
-                else
-                    shrd[lIdx(lx, ly, shrdStride, 1)] = in[lIdx(gx, gy, inStride1, inStride0)];
-            }
-            break;
-        case AF_PAD_SYM:
-            {
-                if (gx<0) gx *= -1;
-                if (gy<0) gy *= -1;
-                if (gx>=dim0) gx = 2*(dim0-1) - gx;
-                if (gy>=dim1) gy = 2*(dim1-1) - gy;
+    dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
 
-                shrd[lIdx(lx, ly, shrdStride, 1)] = in[lIdx(gx, gy, inStride1, inStride0)];
-            }
-            break;
-    }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    medfilt2(qArgs, out, in, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
 }
 
-template<typename T, af_border_type pad, unsigned w_len, unsigned w_wid>
-__global__
-void medfilt(Param<T> out, CParam<T> in, int nBBS0, int nBBS1)
-{
-    __shared__ T shrdMem[(THREADS_X+w_len-1)*(THREADS_Y+w_wid-1)];
-
-    // calculate necessary offset and window parameters
-    const int padding = w_len-1;
-    const int halo    = padding/2;
-    const int shrdLen = blockDim.x + padding;
-
-    // batch offsets
-    unsigned b2 = blockIdx.x / nBBS0;
-    unsigned b3 = blockIdx.y / nBBS1;
-    const T* iptr    = (const T *) in.ptr + (b2 *  in.strides[2] + b3 *  in.strides[3]);
-    T*       optr    = (T *      )out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
-
-    // local neighborhood indices
-    int lx = threadIdx.x;
-    int ly = threadIdx.y;
-
-    // global indices
-    int gx = blockDim.x * (blockIdx.x-b2*nBBS0) + lx;
-    int gy = blockDim.y * (blockIdx.y-b3*nBBS1) + ly;
-
-    // offset values for pulling image to local memory
-    int lx2 = lx + blockDim.x;
-    int ly2 = ly + blockDim.y;
-    int gx2 = gx + blockDim.x;
-    int gy2 = gy + blockDim.y;
-
-    // pull image to local memory
-    load2ShrdMem<T, pad>(shrdMem, iptr, lx, ly, shrdLen,
-                         in.dims[0], in.dims[1],
-                         gx-halo, gy-halo,
-                         in.strides[1], in.strides[0]);
-    if (lx<padding) {
-        load2ShrdMem<T, pad>(shrdMem, iptr, lx2, ly, shrdLen,
-                             in.dims[0], in.dims[1],
-                             gx2-halo, gy-halo,
-                             in.strides[1], in.strides[0]);
-    }
-    if (ly<padding) {
-        load2ShrdMem<T, pad>(shrdMem, iptr, lx, ly2, shrdLen,
-                             in.dims[0], in.dims[1],
-                             gx-halo, gy2-halo,
-                             in.strides[1], in.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2ShrdMem<T, pad>(shrdMem, iptr, lx2, ly2, shrdLen,
-                             in.dims[0], in.dims[1],
-                             gx2-halo, gy2-halo,
-                             in.strides[1], in.strides[0]);
-    }
-    __syncthreads();
+template<typename T>
+void medfilt1(Param<T> out, CParam<T> in, const af::borderType pad, int w_wid) {
+    auto medfilt1 =
+        common::getKernel("arrayfire::cuda::medfilt1", {{medfilt_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>(), TemplateArg(pad),
+                                       TemplateArg(w_wid)));
 
-    // Only continue if we're at a valid location
-    if (gx < in.dims[0] && gy < in.dims[1]) {
-
-        const int ARR_SIZE = w_len * (w_wid-w_wid/2);
-        // pull top half from shared memory into local memory
-        T v[ARR_SIZE];
-#pragma unroll
-        for(int k = 0; k <= w_wid/2; k++) {
-#pragma unroll
-            for(int i = 0; i < w_len; i++) {
-                v[w_len*k + i] = shrdMem[lIdx(lx+i,ly+k,shrdLen,1)];
-            }
-        }
-
-        // with each pass, remove min and max values and add new value
-        // initial sort
-        // ensure min in first half, max in second half
-#pragma unroll
-        for(int i = 0; i < ARR_SIZE/2; i++) {
-            swap(v[i], v[ARR_SIZE-1-i]);
-        }
-        // move min in first half to first pos
-#pragma unroll
-        for(int i = 1; i < (ARR_SIZE+1)/2; i++) {
-            swap(v[0], v[i]);
-        }
-        // move max in second half to last pos
-#pragma unroll
-        for(int i = ARR_SIZE-2; i >= ARR_SIZE/2; i--) {
-            swap(v[i], v[ARR_SIZE-1]);
-        }
-
-        int last = ARR_SIZE-1;
-
-        for(int k = 1+w_wid/2; k < w_wid; k++) {
-
-            for(int j = 0; j < w_len; j++) {
-
-                // add new contestant to first position in array
-                v[0] = shrdMem[lIdx(lx+j, ly+k, shrdLen, 1)];
-
-                last--;
-
-                // place max in last half, min in first half
-                for(int i = 0; i < (last+1)/2; i++) {
-                    swap(v[i], v[last-i]);
-                }
-                // now perform swaps on each half such that
-                // max is in last pos, min is in first pos
-                for(int i = 1; i <= last/2; i++) {
-                    swap(v[0], v[i]);
-                }
-                for(int i = last-1; i >= (last+1)/2; i--) {
-                    swap(v[i], v[last]);
-                }
-            }
-        }
-
-        // no more new contestants
-        // may still have to sort the last row
-        // each outer loop drops the min and max
-        for(int k = 1; k < w_len/2; k++) {
-            // move max/min into respective halves
-            for(int i = k; i < w_len/2; i++) {
-                swap(v[i], v[w_len-1-i]);
-            }
-            // move min into first pos
-            for(int i = k+1; i <= w_len/2; i++) {
-                swap(v[k], v[i]);
-            }
-            // move max into last pos
-            for(int i = w_len-k-2; i >= w_len/2; i--) {
-                swap(v[i], v[w_len-1-k]);
-            }
-        }
-
-        // pick the middle element of the first row
-        optr[gy*out.strides[1]+gx*out.strides[0]] = v[w_len/2];
-    }
-}
-
-template<typename T, af_border_type pad>
-void medfilt(Param<T> out, CParam<T> in, int w_len, int w_wid)
-{
-    const dim3 threads(THREADS_X, THREADS_Y);
+    const dim3 threads(THREADS_X);
 
     int blk_x = divup(in.dims[0], threads.x);
-    int blk_y = divup(in.dims[1], threads.y);
 
-    dim3 blocks(blk_x*in.dims[2], blk_y*in.dims[3]);
+    dim3 blocks(blk_x * in.dims[1], in.dims[2], in.dims[3]);
 
-    switch(w_len) {
-        case  3: (medfilt<T, pad,  3,  3>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-        case  5: (medfilt<T, pad,  5,  5>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-        case  7: (medfilt<T, pad,  7,  7>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-        case  9: (medfilt<T, pad,  9,  9>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-        case 11: (medfilt<T, pad, 11, 11>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-        case 13: (medfilt<T, pad, 13, 13>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-        case 15: (medfilt<T, pad, 15, 15>)<<<blocks, threads>>>(out, in, blk_x, blk_y); break;
-    }
+    const size_t shrdMemBytes = sizeof(T) * (THREADS_X + w_wid - 1);
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(), shrdMemBytes);
+    medfilt1(qArgs, out, in, w_wid, blk_x);
     POST_LAUNCH_CHECK();
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/memcopy.cuh b/src/backend/cuda/kernel/memcopy.cuh
new file mode 100644
index 0000000000..b078a48aea
--- /dev/null
+++ b/src/backend/cuda/kernel/memcopy.cuh
@@ -0,0 +1,227 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+// memCopy without looping, so dim3 has to be 1.
+// conditions:
+//      kernel dims[0] >= dims[0]
+//      kernel dims[1] >= dims[1]
+//      kernel dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+template<typename T>
+__global__ void memCopy(Param<T> out, CParam<T> in) {
+    const int id0 = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    const int id1 = blockIdx.y * blockDim.y + threadIdx.y;  // Limit 64K
+    if ((id0 < (int)in.dims[0]) & (id1 < (int)in.dims[1])) {
+        const int id2 = blockIdx.z * blockDim.z + threadIdx.z;  // Limit 64K
+
+        out.ptr[id0 * (int)out.strides[0] + id1 * (int)out.strides[1] +
+                id2 * (int)out.strides[2]] =
+            in.ptr[id0 * (int)in.strides[0] + id1 * (int)in.strides[1] +
+                   id2 * (int)in.strides[2]];
+    }
+}
+
+// memCopy with looping over dims[0] -- VECTOR ONLY
+// Conditions:
+//      kernel dims[0] has no restrictions
+//      only dims[1] == 1 will be processed!!
+//      only dims[2] == 1 will be procesed!!
+//      only dims[3] == 1 will be processed!!
+template<typename T>
+__global__ void memCopyLoop0(Param<T> out, CParam<T> in) {
+    int id0          = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    const int idims0 = in.dims[0];
+    if (id0 < idims0) {
+        const int incID0        = gridDim.x * blockDim.x;
+        const int istrides0     = in.strides[0];
+        int idx_in              = id0 * istrides0;
+        const int idxIncID0_in  = incID0 * istrides0;
+        const int ostrides0     = out.strides[0];
+        int idx_out             = id0 * ostrides0;
+        const int idxIncID0_out = incID0 * ostrides0;
+
+        do {
+            out.ptr[idx_out] = in.ptr[idx_in];
+            id0 += incID0;
+            if (id0 >= idims0) break;
+            idx_in += idxIncID0_in;
+            idx_out += idxIncID0_out;
+        } while (true);
+    }
+}
+
+// memCopy with looping over dims[1]
+// Conditions:
+//      kernel dims[0] >= dims[0]
+//      kernel dims[1] has no restrictions
+//      kernel dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+template<typename T>
+__global__ void memCopyLoop1(Param<T> out, CParam<T> in) {
+    const int id0    = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    int id1          = blockIdx.y * blockDim.y + threadIdx.y;  // Limit 64K
+    const int idims1 = in.dims[1];
+    if ((id0 < (int)in.dims[0]) & (id1 < idims1)) {
+        const int id2 = blockIdx.z * blockDim.z + threadIdx.z;  // Limit 64K
+        const int istrides1 = in.strides[1];
+        int idx_in          = id0 * (int)in.strides[0] + id1 * istrides1 +
+                     id2 * (int)in.strides[2];
+        const int incID1       = gridDim.y * blockDim.y;
+        const int idxIncID1_in = incID1 * istrides1;
+        const int ostrides1    = out.strides[1];
+        int idx_out            = id0 * (int)out.strides[0] + id1 * ostrides1 +
+                      id2 * (int)out.strides[2];
+        const int idxIncID1_out = incID1 * ostrides1;
+
+        do {
+            out.ptr[idx_out] = in.ptr[idx_in];
+            id1 += incID1;
+            if (id1 >= idims1) break;
+            idx_in += idxIncID1_in;
+            idx_out += idxIncID1_out;
+        } while (true);
+    }
+}
+
+// memCopy with looping over dims[3]
+// Conditions:
+//      kernel dims[0] >= dims[0]
+//      kernel dims[1] >= dims[1]
+//      kernel dims[2] == dims[2]
+template<typename T>
+__global__ void memCopyLoop3(Param<T> out, CParam<T> in) {
+    const int id0 = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    const int id1 = blockIdx.y * blockDim.y + threadIdx.y;  // Limit 64K
+    if ((id0 < (int)in.dims[0]) & (id1 < (int)in.dims[1])) {
+        const int id2 = blockIdx.z * blockDim.z + threadIdx.z;  // Limit 64K
+        int idx_in    = id0 * (int)in.strides[0] + id1 * (int)in.strides[1] +
+                     id2 * (int)in.strides[2];
+        const int idxIncID3_in = in.strides[3];
+        const int idxEnd_in    = (int)in.dims[3] * idxIncID3_in + idx_in;
+        int idx_out = id0 * (int)out.strides[0] + id1 * (int)out.strides[1] +
+                      id2 * (int)out.strides[2];
+        const int idxIncID3_out = out.strides[3];
+
+        do {
+            out.ptr[idx_out] = in.ptr[idx_in];
+            idx_in += idxIncID3_in;
+            if (idx_in == idxEnd_in) break;
+            idx_out += idxIncID3_out;
+        } while (true);
+    }
+}
+
+// memCopy with looping over dims[1] and dims[3]
+// Conditions:
+//      kernel dims[0] >= dims[0]
+//      kernel dims[1] has no restrictions
+//      kernel dims[2] == dims[2]
+template<typename T>
+__global__ void memCopyLoop13(Param<T> out, CParam<T> in) {
+    const int id0    = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    int id1          = blockIdx.y * blockDim.y + threadIdx.y;  // Limit 64K
+    const int idims1 = in.dims[1];
+    if ((id0 < (int)in.dims[0]) & (g1 < idims1)) {
+        const int id2 = blockIdx.z * blockDim.z + threadIdx.z;  // Limit 64K
+        const int istrides1 = in.strides[1];
+        int idxBase_in      = id0 * (int)in.strides[0] + id1 * istrides1 +
+                         id2 * (int)in.strides[2];
+        const int incID1           = gridDim.y * blockDim.y;
+        const int idxBaseIncID1_in = incID1 * istrides1;
+        const int idxIncID3_in     = (int)in.strides[3];
+        int idxEndID3_in = (int)in.dims[3] * idxIncID3_in + idxBase_in;
+        int idxBase_out  = id0 * (int)out.strides[0] +
+                          id1 * (int)out.strides[1] + id2 * (int)out.strides[2];
+        const int idxBaseIncID1_out = incID1 * (int)out.strides[1];
+        const int idxIncID3_out     = (int)out.strides[3];
+
+        do {
+            int idx_in  = idxBase_in;
+            int idx_out = idxBase_out;
+            while (true) {
+                out.ptr[idx_out] = in.ptr[idx_in];
+                idx_in += idxIncID3_in;
+                if (idx_in == idxEndID3_in) break;
+                idx_out += idxIncID3_out;
+            }
+            id1 += incID1;
+            if (id1 >= idims1) break;
+            idxBase_in += idxBaseIncID1_in;
+            idxEndID3_in += idxBaseIncID1_in;
+            idxBase_out += idxBaseIncID1_out;
+        } while (true);
+    }
+}
+
+// memCopy with looping over dims[1],dims[2] and dims[3]
+// Conditions:
+//      kernel dims[0] >= dims[0]
+//      kernel dims[1] has no restrictions
+//      kernel dims[2] <= dims[2]
+template<typename T>
+__global__ void memCopyLoop123(Param<T> out, CParam<T> in) {
+    const int id0    = blockIdx.x * blockDim.x + threadIdx.x;  // Limit 2G
+    int id1          = blockIdx.y * blockDim.y + threadIdx.y;  // Limit 64K
+    const int idims1 = in.dims[1];
+    if ((id0 < (int)in.dims[0]) & (id1 < idims1)) {
+        int id2 = blockIdx.z * blockDim.z + threadIdx.z;  // Limit 64K
+        const int istrides1 = in.strides[1];
+        const int istrides2 = in.strides[2];
+        int idxBaseBase_in =
+            id0 * (int)in.strides[0] + id1 * istrides1 + id2 * istrides2;
+        const int incID1           = gridDim.y * blockDim.y;
+        const int idxBaseIncID1_in = incID1 * istrides1;
+        const int incID2           = gridDim.z * blockDim.z;
+        const int idxBaseIncID2_in = incID2 * istrides2;
+        const int idxIncID3_in     = in.strides[3];
+        const int idxEndIncID3_in  = (int)in.dims[3] * idxIncID3_in;
+
+        const int ostrides1 = out.strides[1];
+        const int ostrides2 = out.strides[2];
+        int idxBaseBase_out =
+            id0 * (int)out.strides[0] + id1 * ostrides1 + id2 * ostrides2;
+        const int idxBaseIncID1_out = incID1 * ostrides1;
+        const int idxBaseIncID2_out = incID2 * ostrides2;
+        const int idxIncID3_out     = out.strides[3];
+        const int idims2            = in.dims[2];
+
+        do {
+            int idxBase_in  = idxBaseBase_in;
+            int idxBase_out = idxBaseBase_out;
+            do {
+                int idxEndID3_in = idxEndIncID3_in + idxBase_in;
+                int idx_in       = idxBase_in;
+                int idx_out      = idxBase_out;
+                do {
+                    out.ptr[idx_out] = in.ptr[idx_in];
+                    idx_in += idxIncID3_in;
+                    if (idx_in == idxEndID3_in) break;
+                    idx_out += idxIncID3_out;
+                } while (true);
+                id1 += incID1;
+                if (id1 >= idims1) break;
+                idxBase_in += idxBaseIncID1_in;
+                idxBase_out += idxBaseIncID1_out;
+            } while (true);
+            id2 += incID2;
+            if (id2 >= idims2) break;
+            idxBaseBase_in += idxBaseIncID2_in;
+            idxBaseBase_out += idxBaseIncID2_out;
+        } while (true);
+    }
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/memcopy.hpp b/src/backend/cuda/kernel/memcopy.hpp
index 96f7375e73..f4d39e6c64 100644
--- a/src/backend/cuda/kernel/memcopy.hpp
+++ b/src/backend/cuda/kernel/memcopy.hpp
@@ -7,216 +7,204 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
-#include <backend.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <backend.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-
-namespace cuda
-{
-namespace kernel
-{
-
-    typedef struct
-    {
-        int dim[4];
-    } dims_t;
-
-    static const uint DIMX = 32;
-    static const uint DIMY =  8;
-
-    template<typename T>
-    __global__ static void
-    memcopy_kernel(T *out, const dims_t ostrides,
-                   const T *in, const dims_t idims,
-                   const dims_t istrides, uint blocks_x, uint blocks_y)
-    {
-        const int tidx = threadIdx.x;
-        const int tidy = threadIdx.y;
-
-        const int zid = blockIdx.x / blocks_x;
-        const int wid = blockIdx.y / blocks_y;
-        const int blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const int blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const int xid = blockIdx_x * blockDim.x + tidx;
-        const int yid = blockIdx_y * blockDim.y + tidy;
-
-        // FIXME: Do more work per block
-        out += wid * ostrides.dim[3] + zid * ostrides.dim[2] + yid * ostrides.dim[1];
-        in  += wid * istrides.dim[3] + zid * istrides.dim[2] + yid * istrides.dim[1];
-
-        int istride0 = istrides.dim[0];
-        if (xid < idims.dim[0] &&
-            yid < idims.dim[1] &&
-            zid < idims.dim[2] &&
-            wid < idims.dim[3]) {
-            out[xid] = in[xid * istride0];
+#include <dims_param.hpp>
+#include <nvrtc_kernel_headers/copy_cuh.hpp>
+#include <nvrtc_kernel_headers/memcopy_cuh.hpp>
+#include <threadsMgt.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+// Increase vectorization by increasing the used type up to maxVectorWidth.
+// Example:
+//  input array<int> with return value = 4, means that the array became
+//  array<int4>.
+//
+// Parameters
+//  - IN     maxVectorWidth: maximum vectorisation desired
+//  - IN/OUT dims[4]: dimensions of the array
+//  - IN/OUT istrides[4]: strides of the input array
+//  - IN/OUT indims: ndims of the input array.  Updates when dim[0] becomes 1
+//  - IN/OUT ioffset: offset of the input array
+//  - IN/OUT ostrides[4]: strides of the output array
+//  - IN/OUT ooffset: offset of the output array
+//
+// Returns
+//  - maximum obtained vectorization.
+//  - All the parameters are updated accordingly
+//
+template<typename T>
+dim_t vectorizeShape(const dim_t maxVectorWidth, Param<T> &out, dim_t &indims,
+                     CParam<T> &in) {
+    dim_t vectorWidth{1};
+    if ((maxVectorWidth != 1) & (in.strides[0] == 1) & (out.strides[0] == 1)) {
+        // Only adjacent items can be grouped into a base vector type
+        void *in_ptr{(void *)in.ptr};
+        void *out_ptr{(void *)out.ptr};
+        // - global is the OR of the values to be checked.  When global is
+        // divisable by 2, than all source values are also
+        dim_t global{in.dims[0]};
+        for (int i{1}; i < indims; ++i) {
+            global |= in.strides[i] | out.strides[i];
         }
-
-    }
-
-    template<typename T>
-    void memcopy(T *out, const dim_t *ostrides,
-                 const T *in, const dim_t *idims,
-                 const dim_t *istrides, uint ndims)
-    {
-        dim3 threads(DIMX, DIMY);
-
-        if (ndims == 1) {
-            threads.x *= threads.y;
-            threads.y  = 1;
-       }
-
-        // FIXME: DO more work per block
-        uint blocks_x = divup(idims[0], threads.x);
-        uint blocks_y = divup(idims[1], threads.y);
-
-        dim3 blocks(blocks_x * idims[2],
-                    blocks_y * idims[3]);
-
-        dims_t _ostrides = {{ostrides[0], ostrides[1], ostrides[2], ostrides[3]}};
-        dims_t _istrides = {{istrides[0], istrides[1], istrides[2], istrides[3]}};
-        dims_t _idims = {{idims[0], idims[1], idims[2], idims[3]}};
-
-        (memcopy_kernel<T>)<<<blocks, threads>>>(out, _ostrides,
-                                                 in, _idims, _istrides,
-                                                 blocks_x, blocks_y);
-        POST_LAUNCH_CHECK();
-    }
-
-
-    ////////////////////////////// BEGIN - templated help functions for copy_kernel ////////////////////////////////
-    template<typename T>
-    __inline__ __device__ static
-    T scale(T value, double factor) {
-        return (T)(value*factor);
-    }
-
-    template<>
-    __inline__ __device__
-    cfloat scale<cfloat>(cfloat value, double factor) {
-        return make_cuFloatComplex(value.x*factor, value.y*factor);
-    }
-
-    template<>
-    __inline__ __device__
-    cdouble scale<cdouble>(cdouble value, double factor) {
-        return make_cuDoubleComplex(value.x*factor, value.y*factor);
-    }
-
-    template<typename inType, typename outType>
-    __inline__ __device__
-    outType convertType(inType value) {
-        return (outType)value;
-    }
-
-    template<>
-    __inline__ __device__
-    cdouble convertType<cfloat, cdouble>(cfloat value) {
-        return cuComplexFloatToDouble(value);
-    }
-
-    template<>
-    __inline__ __device__
-    cfloat convertType<cdouble, cfloat>(cdouble value) {
-        return cuComplexDoubleToFloat(value);
-    }
-
-#define OTHER_SPECIALIZATIONS(IN_T)                     \
-    template<>                                          \
-    __inline__ __device__                               \
-    cfloat convertType<IN_T, cfloat>(IN_T value) {      \
-        return make_cuFloatComplex(value, 0.0f);        \
-    }                                                   \
-                                                        \
-    template<>                                          \
-    __inline__ __device__                               \
-    cdouble convertType<IN_T, cdouble>(IN_T value) {    \
-        return make_cuDoubleComplex(value, 0.0);        \
-    }
-
-    OTHER_SPECIALIZATIONS(float )
-    OTHER_SPECIALIZATIONS(double)
-    OTHER_SPECIALIZATIONS(int   )
-    OTHER_SPECIALIZATIONS(uint  )
-    OTHER_SPECIALIZATIONS(intl   )
-    OTHER_SPECIALIZATIONS(uintl  )
-    OTHER_SPECIALIZATIONS(uchar )
-    OTHER_SPECIALIZATIONS(char  )
-    ////////////////////////////// END - templated help functions for copy_kernel //////////////////////////////////
-
-
-    template<typename inType, typename outType, bool same_dims>
-    __global__ static void
-    copy_kernel(Param<outType> dst, CParam<inType> src, outType default_value,
-                double factor, const dims_t trgt, uint blk_x, uint blk_y)
-    {
-        const uint lx = threadIdx.x;
-        const uint ly = threadIdx.y;
-
-        const uint gz = blockIdx.x / blk_x;
-        const uint gw = blockIdx.y / blk_y;
-        const uint blockIdx_x = blockIdx.x - (blk_x) * gz;
-        const uint blockIdx_y = blockIdx.y - (blk_y) * gw;
-        const uint gx = blockIdx_x * blockDim.x + lx;
-        const uint gy = blockIdx_y * blockDim.y + ly;
-
-        const inType * in = src.ptr + (gw * src.strides[3] + gz * src.strides[2] + gy * src.strides[1]);
-        outType * out     = dst.ptr + (gw * dst.strides[3] + gz * dst.strides[2] + gy * dst.strides[1]);
-
-        int istride0 = src.strides[0];
-        int ostride0 = dst.strides[0];
-
-        if (gy < dst.dims[1] && gz < dst.dims[2] && gw < dst.dims[3]) {
-            int loop_offset = blockDim.x * blk_x;
-            bool cond = gy < trgt.dim[1] && gz < trgt.dim[2] && gw < trgt.dim[3];
-            for(int rep=gx; rep<dst.dims[0]; rep+=loop_offset) {
-                outType temp = default_value;
-                if (same_dims || (rep < trgt.dim[0] && cond)) {
-                    temp = convertType<inType, outType>(scale<inType>(in[rep * istride0], factor));
+        // - The buffers are always aligned at 128 Bytes.  The pointers in the
+        // Param<T> structure are however, direct pointers (including the
+        // offset), so the final pointer has to be chedked on alignment
+        size_t filler{64};  // give enough space for the align to move
+        unsigned count{0};
+        while (((global & 1) == 0) & (vectorWidth < maxVectorWidth) &&
+               (in.ptr ==
+                std::align(alignof(T) * vectorWidth * 2, 1, in_ptr, filler)) &&
+               (out.ptr ==
+                std::align(alignof(T) * vectorWidth * 2, 1, out_ptr, filler))) {
+            ++count;
+            vectorWidth <<= 1;
+            global >>= 1;
+        }
+        if (count != 0) {
+            // update the dimensions, to compensate for the vector base
+            // type change
+            in.dims[0] >>= count;
+            for (int i{1}; i < indims; ++i) {
+                in.strides[i] >>= count;
+                out.strides[i] >>= count;
+            }
+            if (in.dims[0] == 1) {
+                // Vectorization has absorbed the full dim0, so eliminate
+                // this dimension
+                --indims;
+                for (int i{0}; i < indims; ++i) {
+                    in.dims[i]     = in.dims[i + 1];
+                    in.strides[i]  = in.strides[i + 1];
+                    out.strides[i] = out.strides[i + 1];
                 }
-                out[rep*ostride0] = temp;
+                in.dims[indims] = 1;
             }
         }
     }
+    return vectorWidth;
+}
 
-    template<typename inType, typename outType>
-    void copy(Param<outType> dst, CParam<inType> src, int ndims, outType default_value, double factor)
-    {
-        dim3 threads(DIMX, DIMY);
-        size_t local_size[] = {DIMX, DIMY};
+template<typename T>
+void memcopy(Param<T> out, CParam<T> in, dim_t indims) {
+    const size_t totalSize{in.elements() * sizeof(T) * 2};
+    removeEmptyColumns(in.dims, indims, out.strides);
+    indims = removeEmptyColumns(in.dims, indims, in.dims, in.strides);
+    indims = combineColumns(in.dims, in.strides, indims, out.strides);
+
+    // Optimization memory access and caching.
+    // Best performance is achieved with the highest vectorization
+    // (<int> --> <int2>,<int4>, ...), since more data is processed per IO.
+
+    // 16 Bytes gives best performance (=cdouble)
+    const dim_t maxVectorWidth{sizeof(T) > 8 ? 1 : 16 / sizeof(T)};
+    const dim_t vectorWidth{vectorizeShape(maxVectorWidth, out, indims, in)};
+    const size_t sizeofNewT{sizeof(T) * vectorWidth};
+
+    threadsMgt<dim_t> th(in.dims, indims);
+    const dim3 threads{th.genThreads()};
+    const dim3 blocks{th.genBlocks(threads, 1, 1, totalSize, sizeofNewT)};
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    // select the kernel with the necessary loopings
+    const char *kernelName{th.loop0   ? "arrayfire::cuda::memCopyLoop0"
+                           : th.loop2 ? "arrayfire::cuda::memCopyLoop123"
+                           : th.loop1 ? th.loop3
+                                            ? "arrayfire::cuda::memCopyLoop13"
+                                            : "arrayfire::cuda::memCopyLoop1"
+                           : th.loop3 ? "arrayfire::cuda::memCopyLoop3"
+                                      : "arrayfire::cuda::memCopy"};
+
+    // Conversion to cuda base vector types.
+    switch (sizeofNewT) {
+        case 1: {
+            auto memCopy{common::getKernel(kernelName, {{memcopy_cuh_src}},
+                                           TemplateArgs(TemplateArg("char")))};
+            memCopy(qArgs, Param<char>((char *)out.ptr, out.dims, out.strides),
+                    CParam<char>((const char *)in.ptr, in.dims, in.strides));
+        } break;
+        case 2: {
+            auto memCopy{common::getKernel(kernelName, {{memcopy_cuh_src}},
+                                           TemplateArgs(TemplateArg("short")))};
+            memCopy(qArgs,
+                    Param<short>((short *)out.ptr, out.dims, out.strides),
+                    CParam<short>((const short *)in.ptr, in.dims, in.strides));
+        } break;
+        case 4: {
+            auto memCopy{common::getKernel(kernelName, {{memcopy_cuh_src}},
+                                           TemplateArgs(TemplateArg("float")))};
+            memCopy(qArgs,
+                    Param<float>((float *)out.ptr, out.dims, out.strides),
+                    CParam<float>((const float *)in.ptr, in.dims, in.strides));
+        } break;
+        case 8: {
+            auto memCopy{
+                common::getKernel(kernelName, {{memcopy_cuh_src}},
+                                  TemplateArgs(TemplateArg("float2")))};
+            memCopy(
+                qArgs, Param<float2>((float2 *)out.ptr, out.dims, out.strides),
+                CParam<float2>((const float2 *)in.ptr, in.dims, in.strides));
+        } break;
+        case 16: {
+            auto memCopy{
+                common::getKernel(kernelName, {{memcopy_cuh_src}},
+                                  TemplateArgs(TemplateArg("float4")))};
+            memCopy(
+                qArgs, Param<float4>((float4 *)out.ptr, out.dims, out.strides),
+                CParam<float4>((const float4 *)in.ptr, in.dims, in.strides));
+        } break;
+        default: assert("type is larger than 16 bytes, which is unsupported");
+    }
+    POST_LAUNCH_CHECK();
+}
 
-        local_size[0] *= local_size[1];
-        if (ndims == 1) {
-            local_size[1] = 1;
+template<typename inType, typename outType>
+void copy(Param<outType> dst, CParam<inType> src, dim_t ondims,
+          outType default_value, double factor) {
+    const size_t totalSize{dst.elements() * sizeof(outType) +
+                           src.elements() * sizeof(inType)};
+    bool same_dims{true};
+    for (dim_t i{0}; i < ondims; ++i) {
+        if (src.dims[i] > dst.dims[i]) {
+            src.dims[i] = dst.dims[i];
+        } else if (src.dims[i] != dst.dims[i]) {
+            same_dims = false;
         }
+    }
+    removeEmptyColumns(dst.dims, ondims, src.dims, src.strides);
+    ondims = removeEmptyColumns(dst.dims, ondims, dst.dims, dst.strides);
+    ondims =
+        combineColumns(dst.dims, dst.strides, ondims, src.dims, src.strides);
 
-        uint blk_x = divup(dst.dims[0], local_size[0]);
-        uint blk_y = divup(dst.dims[1], local_size[1]);
-
-        dim3 blocks(blk_x * dst.dims[2],
-                    blk_y * dst.dims[3]);
-
-        int trgt_l  = std::min(dst.dims[3], src.dims[3]);
-        int trgt_k  = std::min(dst.dims[2], src.dims[2]);
-        int trgt_j  = std::min(dst.dims[1], src.dims[1]);
-        int trgt_i  = std::min(dst.dims[0], src.dims[0]);
-        dims_t trgt_dims = {{trgt_i, trgt_j, trgt_k, trgt_l}};
+    threadsMgt<dim_t> th(dst.dims, ondims);
+    const dim3 threads{th.genThreads()};
+    const dim3 blocks{th.genBlocks(threads, 1, 1, totalSize, sizeof(outType))};
 
-        bool same_dims = ( (src.dims[0]==dst.dims[0]) &&
-                           (src.dims[1]==dst.dims[1]) &&
-                           (src.dims[2]==dst.dims[2]) &&
-                           (src.dims[3]==dst.dims[3]) );
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        if (same_dims)
-            (copy_kernel<inType, outType, true >)<<<blocks, threads>>>(dst, src, default_value, factor, trgt_dims, blk_x, blk_y);
-        else
-            (copy_kernel<inType, outType, false>)<<<blocks, threads>>>(dst, src, default_value, factor, trgt_dims, blk_x, blk_y);
+    auto copy{common::getKernel(
+        th.loop0                 ? "arrayfire::cuda::scaledCopyLoop0"
+        : (th.loop2 || th.loop3) ? "arrayfire::cuda::scaledCopyLoop123"
+        : th.loop1               ? "arrayfire::cuda::scaledCopyLoop1"
+                                 : "arrayfire::cuda::scaledCopy",
+        {{copy_cuh_src}},
+        TemplateArgs(TemplateTypename<inType>(), TemplateTypename<outType>(),
+                     TemplateArg(same_dims), TemplateArg(factor != 1.0)))};
 
-        POST_LAUNCH_CHECK();
-    }
+    copy(qArgs, dst, src, default_value, factor);
 
+    POST_LAUNCH_CHECK();
 }
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/moments.cuh b/src/backend/cuda/kernel/moments.cuh
new file mode 100644
index 0000000000..12703a6343
--- /dev/null
+++ b/src/backend/cuda/kernel/moments.cuh
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void moments(Param<float> out, CParam<T> in, af::momentType moment,
+                        const bool pBatch) {
+    const dim_t idw = blockIdx.y / in.dims[2];
+    const dim_t idz = blockIdx.y - idw * in.dims[2];
+
+    const dim_t idy = blockIdx.x;
+    dim_t idx       = threadIdx.x;
+
+    if (idy >= in.dims[1] || idz >= in.dims[2] || idw >= in.dims[3]) return;
+
+    extern __shared__ float blk_moment_sum[];
+    if (threadIdx.x < out.dims[0]) { blk_moment_sum[threadIdx.x] = 0.f; }
+    __syncthreads();
+
+    dim_t mId = idy * in.strides[1] + idx;
+    if (pBatch) { mId += idw * in.strides[3] + idz * in.strides[2]; }
+
+    for (; idx < in.dims[0]; idx += blockDim.x) {
+        dim_t m_off = 0;
+        float val   = (float)in.ptr[mId];
+        mId += blockDim.x;
+
+        if ((moment & AF_MOMENT_M00) > 0) {
+            atomicAdd(blk_moment_sum + m_off++, val);
+        }
+        if ((moment & AF_MOMENT_M01) > 0) {
+            atomicAdd(blk_moment_sum + m_off++, idx * val);
+        }
+        if ((moment & AF_MOMENT_M10) > 0) {
+            atomicAdd(blk_moment_sum + m_off++, idy * val);
+        }
+        if ((moment & AF_MOMENT_M11) > 0) {
+            atomicAdd(blk_moment_sum + m_off, idx * idy * val);
+        }
+    }
+
+    __syncthreads();
+
+    float *offset = const_cast<float *>(
+        out.ptr + (idw * out.strides[3] + idz * out.strides[2]) + threadIdx.x);
+    if (threadIdx.x < out.dims[0])
+        atomicAdd(offset, blk_moment_sum[threadIdx.x]);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/moments.hpp b/src/backend/cuda/kernel/moments.hpp
new file mode 100644
index 0000000000..dcc1161b23
--- /dev/null
+++ b/src/backend/cuda/kernel/moments.hpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/moments_cuh.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+static const int THREADS = 128;
+
+template<typename T>
+void moments(Param<float> out, CParam<T> in, const af::momentType moment) {
+    auto moments =
+        common::getKernel("arrayfire::cuda::moments", {{moments_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()));
+
+    dim3 threads(THREADS, 1, 1);
+    dim3 blocks(in.dims[1], in.dims[2] * in.dims[3]);
+
+    bool pBatch = !(in.dims[2] == 1 && in.dims[3] == 1);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(),
+                      sizeof(float) * out.dims[0]);
+
+    moments(qArgs, out, in, moment, pBatch);
+
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/morph.cuh b/src/backend/cuda/kernel/morph.cuh
new file mode 100644
index 0000000000..34e7a10e1c
--- /dev/null
+++ b/src/backend/cuda/kernel/morph.cuh
@@ -0,0 +1,231 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <math.hpp>
+#include <shared.hpp>
+
+// cFilter is used by both 2d morph and 3d morph
+// Maximum kernel size supported for 2d morph is 19x19*8 = 2888
+// Maximum kernel size supported for 3d morph is 7x7x7*8 = 2744
+// We will declare a char array as __constant__ array and allocate
+// size necessary to hold doubles of FILTER_LEN*FILTER_LEN
+__constant__ char
+    cFilter[MAX_MORPH_FILTER_LEN * MAX_MORPH_FILTER_LEN * sizeof(double)];
+
+namespace arrayfire {
+namespace cuda {
+
+__forceinline__ __device__ int lIdx(int x, int y, int stride1, int stride0) {
+    return (y * stride1 + x * stride0);
+}
+
+template<typename T, bool isDilation>
+inline __device__ void load2ShrdMem(T* shrd, const T* const in, int lx, int ly,
+                                    int shrdStride, int dim0, int dim1, int gx,
+                                    int gy, int inStride1, int inStride0) {
+    T val = isDilation ? common::Binary<T, af_max_t>::init()
+                       : common::Binary<T, af_min_t>::init();
+    if (gx >= 0 && gx < dim0 && gy >= 0 && gy < dim1) {
+        val = in[lIdx(gx, gy, inStride1, inStride0)];
+    }
+    shrd[lIdx(lx, ly, shrdStride, 1)] = val;
+}
+
+// kernel assumes mask/filter is square and hence does the
+// necessary operations accordingly.
+//
+// Notes on template arguments for morphKernel:
+//   * T is the data type of the image & kernel
+//   * isDilation indicates if the current kernel invocation is an erosion
+//   operation or dilation operation
+//   * SeLength is the structuring element length a.k.a the kernel window
+//   length. This template parameter takes precedence over the kernel argument
+//   `windLen`.
+//
+// Please make sure at least one of the following variables is not 0.
+//  * SeLength (structuring element a.k.a window/kernel)
+//  * windLen
+// If SeLength is > 0, then that will override the kernel argument.
+template<typename T, bool isDilation, int SeLength = 0>
+__global__ void morph(Param<T> out, CParam<T> in, int nBBS0, int nBBS1,
+                      int windLen = 0) {
+    windLen = (SeLength > 0 ? SeLength : windLen);
+
+    SharedMemory<T> shared;
+    T* shrdMem = shared.getPointer();
+
+    // calculate necessary offset and window parameters
+    const int halo = windLen / 2;
+    const int padding =
+        (windLen % 2 == 0 ? (windLen - 1) : (2 * (windLen / 2)));
+    const int shrdLen  = blockDim.x + padding + 1;
+    const int shrdLen1 = blockDim.y + padding;
+
+    // gfor batch offsets
+    unsigned b2 = blockIdx.x / nBBS0;
+    unsigned b3 = blockIdx.y / nBBS1;
+    const T* iptr =
+        (const T*)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
+    T* optr = (T*)out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
+
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+
+    // global indices
+    const int gx = blockDim.x * (blockIdx.x - b2 * nBBS0) + lx;
+    const int gy = blockDim.y * (blockIdx.y - b3 * nBBS1) + ly;
+
+    // pull image to local memory
+    for (int b = ly, gy2 = gy; b < shrdLen1;
+         b += blockDim.y, gy2 += blockDim.y) {
+        // move row_set get_local_size(1) along coloumns
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += blockDim.x, gx2 += blockDim.x) {
+            load2ShrdMem<T, isDilation>(
+                shrdMem, iptr, a, b, shrdLen, in.dims[0], in.dims[1],
+                gx2 - halo, gy2 - halo, in.strides[1], in.strides[0]);
+        }
+    }
+
+    int i = lx + halo;
+    int j = ly + halo;
+
+    __syncthreads();
+
+    const T* d_filt = (const T*)cFilter;
+    T acc           = isDilation ? common::Binary<T, af_max_t>::init()
+                                 : common::Binary<T, af_min_t>::init();
+#pragma unroll
+    for (int wj = 0; wj < windLen; ++wj) {
+        int joff   = wj * windLen;
+        int w_joff = (j + wj - halo) * shrdLen;
+#pragma unroll
+        for (int wi = 0; wi < windLen; ++wi) {
+            if (d_filt[joff + wi] > (T)0) {
+                T cur = shrdMem[w_joff + (i + wi - halo)];
+                if (isDilation)
+                    acc = max(acc, cur);
+                else
+                    acc = min(acc, cur);
+            }
+        }
+    }
+
+    if (gx < in.dims[0] && gy < in.dims[1]) {
+        int outIdx   = lIdx(gx, gy, out.strides[1], out.strides[0]);
+        optr[outIdx] = acc;
+    }
+}
+
+__forceinline__ __device__ int lIdx3D(int x, int y, int z, int stride2,
+                                      int stride1, int stride0) {
+    return (z * stride2 + y * stride1 + x * stride0);
+}
+
+template<typename T, bool isDilation>
+inline __device__ void load2ShrdVolume(T* shrd, const T* const in, int lx,
+                                       int ly, int lz, int shrdStride1,
+                                       int shrdStride2, int dim0, int dim1,
+                                       int dim2, int gx, int gy, int gz,
+                                       int inStride2, int inStride1,
+                                       int inStride0) {
+    T val = isDilation ? common::Binary<T, af_max_t>::init()
+                       : common::Binary<T, af_min_t>::init();
+    if (gx >= 0 && gx < dim0 && gy >= 0 && gy < dim1 && gz >= 0 && gz < dim2) {
+        val = in[gx * inStride0 + gy * inStride1 + gz * inStride2];
+    }
+    shrd[lx + ly * shrdStride1 + lz * shrdStride2] = val;
+}
+
+// kernel assumes mask/filter is square and hence does the
+// necessary operations accordingly.
+template<typename T, bool isDilation, int windLen>
+__global__ void morph3D(Param<T> out, CParam<T> in, int nBBS) {
+    SharedMemory<T> shared;
+    T* shrdMem = shared.getPointer();
+
+    const int halo = windLen / 2;
+    const int padding =
+        (windLen % 2 == 0 ? (windLen - 1) : (2 * (windLen / 2)));
+
+    const int se_area  = windLen * windLen;
+    const int shrdLen  = blockDim.x + padding + 1;
+    const int shrdLen1 = blockDim.y + padding;
+    const int shrdLen2 = blockDim.z + padding;
+    const int shrdArea = shrdLen * shrdLen1;
+
+    // gfor batch offsets
+    unsigned batchId = blockIdx.x / nBBS;
+
+    const T* iptr = (const T*)in.ptr + (batchId * in.strides[3]);
+    T* optr       = (T*)out.ptr + (batchId * out.strides[3]);
+
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+    const int lz = threadIdx.z;
+
+    const int gx = blockDim.x * (blockIdx.x - batchId * nBBS) + lx;
+    const int gy = blockDim.y * blockIdx.y + ly;
+    const int gz = blockDim.z * blockIdx.z + lz;
+
+    for (int c = lz, gz2 = gz; c < shrdLen2;
+         c += blockDim.z, gz2 += blockDim.z) {
+        for (int b = ly, gy2 = gy; b < shrdLen1;
+             b += blockDim.y, gy2 += blockDim.y) {
+            for (int a = lx, gx2 = gx; a < shrdLen;
+                 a += blockDim.x, gx2 += blockDim.x) {
+                load2ShrdVolume<T, isDilation>(
+                    shrdMem, iptr, a, b, c, shrdLen, shrdArea, in.dims[0],
+                    in.dims[1], in.dims[2], gx2 - halo, gy2 - halo, gz2 - halo,
+                    in.strides[2], in.strides[1], in.strides[0]);
+            }
+        }
+    }
+
+    __syncthreads();
+    // indices of voxel owned by current thread
+    int i = lx + halo;
+    int j = ly + halo;
+    int k = lz + halo;
+
+    const T* d_filt = (const T*)cFilter;
+    T acc           = isDilation ? common::Binary<T, af_max_t>::init()
+                                 : common::Binary<T, af_min_t>::init();
+#pragma unroll
+    for (int wk = 0; wk < windLen; ++wk) {
+        int koff   = wk * se_area;
+        int w_koff = (k + wk - halo) * shrdArea;
+#pragma unroll
+        for (int wj = 0; wj < windLen; ++wj) {
+            int joff   = wj * windLen;
+            int w_joff = (j + wj - halo) * shrdLen;
+#pragma unroll
+            for (int wi = 0; wi < windLen; ++wi) {
+                if (d_filt[koff + joff + wi]) {
+                    T cur = shrdMem[w_koff + w_joff + i + wi - halo];
+                    if (isDilation)
+                        acc = max(acc, cur);
+                    else
+                        acc = min(acc, cur);
+                }
+            }
+        }
+    }
+
+    if (gx < in.dims[0] && gy < in.dims[1] && gz < in.dims[2]) {
+        int outIdx =
+            gz * out.strides[2] + gy * out.strides[1] + gx * out.strides[0];
+        optr[outIdx] = acc;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/morph.hpp b/src/backend/cuda/kernel/morph.hpp
index f7d29aa9fe..0aff8ff639 100644
--- a/src/backend/cuda/kernel/morph.hpp
+++ b/src/backend/cuda/kernel/morph.hpp
@@ -7,296 +7,40 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-#include "shared.hpp"
+#include <nvrtc_kernel_headers/morph_cuh.hpp>
 
-namespace cuda
-{
+#include <limits>
 
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-static const int MAX_MORPH_FILTER_LEN = 17;
-// cFilter is used by both 2d morph and 3d morph
-// Maximum kernel size supported for 2d morph is 19x19*8 = 2888
-// Maximum kernel size supported for 3d morph is 7x7x7*8 = 2744
-// We will declare a char array as __constant__ array and allocate
-// size necessary to hold doubles of FILTER_LEN*FILTER_LEN
-__constant__ char cFilter[MAX_MORPH_FILTER_LEN*MAX_MORPH_FILTER_LEN*sizeof(double)];
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-static const int CUBE_X    =  8;
-static const int CUBE_Y    =  8;
-static const int CUBE_Z    =  8;
-
-__forceinline__ __device__ int lIdx(int x, int y,
-        int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
-
-__forceinline__ __device__ int clamp(int f, int a, int b)
-{
-    return max(a, min(f, b));
-}
+static const int MAX_MORPH_FILTER_LEN = 19;
+static const int THREADS_X            = 16;
+static const int THREADS_Y            = 16;
+static const int CUBE_X               = 8;
+static const int CUBE_Y               = 8;
+static const int CUBE_Z               = 8;
 
 template<typename T>
-inline __device__ void load2ShrdMem(T * shrd, const T * const in,
-        int lx, int ly, int shrdStride,
-        int dim0, int dim1,
-        int gx, int gy,
-        int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx, 0, dim0-1);
-    int gy_  = clamp(gy, 0, dim1-1);
-    shrd[ lIdx(lx, ly, shrdStride, 1) ] = in[ lIdx(gx_, gy_, inStride1, inStride0) ];
-}
-
-// kernel assumes mask/filter is square and hence does the
-// necessary operations accordingly.
-template<typename T, bool isDilation, int windLen>
-static __global__ void morphKernel(Param<T> out, CParam<T> in,
-                                   int nBBS0, int nBBS1)
-{
-    // get shared memory pointer
-    SharedMemory<T> shared;
-    T * shrdMem = shared.getPointer();
-
-    // calculate necessary offset and window parameters
-    const int halo   = windLen/2;
-    const int padding= 2*halo;
-    const int shrdLen= blockDim.x + padding + 1;
-
-    // gfor batch offsets
-    unsigned b2 = blockIdx.x / nBBS0;
-    unsigned b3 = blockIdx.y / nBBS1;
-    const T* iptr    = (const T *) in.ptr + (b2 *  in.strides[2] + b3 *  in.strides[3]);
-    T*       optr    = (T *      )out.ptr + (b2 * out.strides[2] + b3 * out.strides[3]);
-
-    int gx, gy, i, j;
-    { //scoping out unnecessary variables
-    // local neighborhood indices
-    const int lx = threadIdx.x;
-    const int ly = threadIdx.y;
-
-    // global indices
-    gx = blockDim.x * (blockIdx.x-b2*nBBS0) + lx;
-    gy = blockDim.y * (blockIdx.y-b3*nBBS1) + ly;
-
-    // offset values for pulling image to local memory
-    int lx2      = lx + blockDim.x;
-    int ly2      = ly + blockDim.y;
-    int gx2      = gx + blockDim.x;
-    int gy2      = gy + blockDim.y;
-
-    // pull image to local memory
-    load2ShrdMem(shrdMem, iptr, lx, ly, shrdLen,
-                 in.dims[0], in.dims[1],
-                 gx-halo, gy-halo,
-                 in.strides[1], in.strides[0]);
-    if (lx<padding) {
-        load2ShrdMem(shrdMem, iptr, lx2, ly, shrdLen,
-                     in.dims[0], in.dims[1],
-                     gx2-halo, gy-halo,
-                     in.strides[1], in.strides[0]);
-    }
-    if (ly<padding) {
-        load2ShrdMem(shrdMem, iptr, lx, ly2, shrdLen,
-                     in.dims[0], in.dims[1],
-                     gx-halo, gy2-halo,
-                     in.strides[1], in.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2ShrdMem(shrdMem, iptr, lx2, ly2, shrdLen,
-                     in.dims[0], in.dims[1],
-                     gx2-halo, gy2-halo,
-                     in.strides[1], in.strides[0]);
-    }
-    i = lx + halo;
-    j = ly + halo;
-    }
-    __syncthreads();
-
-    const T * d_filt = (const T *)cFilter;
-    T acc = shrdMem[ lIdx(i, j, shrdLen, 1) ];
-#pragma unroll
-    for(int wj=0; wj<windLen; ++wj) {
-        int joff   = wj*windLen;
-        int w_joff = (j+wj-halo)*shrdLen;
-#pragma unroll
-        for(int wi=0; wi<windLen; ++wi) {
-            T cur  = shrdMem[w_joff + (i+wi-halo)];
-            if (d_filt[joff+wi]) {
-                if (isDilation)
-                    acc = max(acc, cur);
-                else
-                    acc = min(acc, cur);
-            }
-        }
-    }
-
-    if (gx<in.dims[0] && gy<in.dims[1]) {
-        int outIdx = lIdx(gx, gy, out.strides[1], out.strides[0]);
-        optr[outIdx] = acc;
-    }
-}
-
-__forceinline__ __device__ int lIdx3D(int x, int y, int z,
-        int stride2, int stride1, int stride0)
-{
-    return (z*stride2 + y*stride1 + x*stride0);
-}
-
-template<typename T>
-inline __device__ void load2ShrdVolume(T * shrd, const T * const in,
-        int lx, int ly, int lz,
-        int shrdStride1, int shrdStride2,
-        int dim0, int dim1, int dim2,
-        int gx, int gy, int gz,
-        int inStride2, int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx,0,dim0-1);
-    int gy_  = clamp(gy,0,dim1-1);
-    int gz_  = clamp(gz,0,dim2-1);
-    int shrdIdx = lx + ly*shrdStride1 + lz*shrdStride2;
-    int inIdx   = gx_*inStride0 + gy_*inStride1 + gz_*inStride2;
-    shrd[ shrdIdx ] = in[ inIdx ];
-}
-
-// kernel assumes mask/filter is square and hence does the
-// necessary operations accordingly.
-template<typename T, bool isDilation, int windLen>
-static __global__ void morph3DKernel(Param<T> out, CParam<T> in, int nBBS)
-{
-    // get shared memory pointer
-    SharedMemory<T> shared;
-    T * shrdMem = shared.getPointer();
-
-    const int halo      = windLen/2;
-    const int padding   = 2*halo;
-
-    const int se_area   = windLen*windLen;
-    const int shrdLen   = blockDim.x + padding + 1;
-    const int shrdArea  = shrdLen * (blockDim.y+padding);
+void morph(Param<T> out, CParam<T> in, CParam<T> mask, bool isDilation) {
+    const int windLen  = mask.dims[0];
+    const int SeLength = (windLen <= 10 ? windLen : 0);
 
-    // gfor batch offsets
-    unsigned batchId = blockIdx.x / nBBS;
+    auto morph = common::getKernel(
+        "arrayfire::cuda::morph", {{morph_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(isDilation),
+                     TemplateArg(SeLength)),
+        {{DefineValue(MAX_MORPH_FILTER_LEN)}});
 
-    const T* iptr    = (const T *) in.ptr + (batchId *  in.strides[3]);
-    T*       optr    = (T *      )out.ptr + (batchId * out.strides[3]);
+    morph.copyToReadOnly(morph.getDevPtr("cFilter"),
+                         reinterpret_cast<CUdeviceptr>(mask.ptr),
+                         mask.dims[0] * mask.dims[1] * sizeof(T));
 
-    int gx, gy, gz, i, j, k;
-    { // scoping out unnecessary variables
-    const int lx = threadIdx.x;
-    const int ly = threadIdx.y;
-    const int lz = threadIdx.z;
-
-    gx = blockDim.x * (blockIdx.x-batchId*nBBS) + lx;
-    gy = blockDim.y * blockIdx.y + ly;
-    gz = blockDim.z * blockIdx.z + lz;
-
-    const int gx2 = gx + blockDim.x;
-    const int gy2 = gy + blockDim.y;
-    const int gz2 = gz + blockDim.z;
-    const int lx2 = lx + blockDim.x;
-    const int ly2 = ly + blockDim.y;
-    const int lz2 = lz + blockDim.z;
-
-    // pull volume to shared memory
-    load2ShrdVolume(shrdMem, iptr, lx, ly, lz, shrdLen, shrdArea,
-                    in.dims[0], in.dims[1], in.dims[2],
-                    gx-halo, gy-halo, gz-halo,
-                    in.strides[2], in.strides[1], in.strides[0]);
-    if (lx<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx2, ly, lz, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx2-halo, gy-halo, gz-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    if (ly<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx, ly2, lz, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx-halo, gy2-halo, gz-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    if (lz<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx, ly, lz2, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx-halo, gy-halo, gz2-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx2, ly2, lz, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx2-halo, gy2-halo, gz-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    if (ly<padding && lz<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx, ly2, lz2, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx-halo, gy2-halo, gz2-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    if (lz<padding && lx<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx2, ly, lz2, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx2-halo, gy-halo, gz2-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    if (lx<padding && ly<padding && lz<padding) {
-        load2ShrdVolume(shrdMem, iptr, lx2, ly2, lz2, shrdLen, shrdArea,
-                        in.dims[0], in.dims[1], in.dims[2],
-                        gx2-halo, gy2-halo, gz2-halo,
-                        in.strides[2], in.strides[1], in.strides[0]);
-    }
-    __syncthreads();
-    // indices of voxel owned by current thread
-    i  = lx + halo;
-    j  = ly + halo;
-    k  = lz + halo;
-    }
-
-    const T * d_filt = (const T *)cFilter;
-    T acc = shrdMem[ lIdx3D(i, j, k, shrdArea, shrdLen, 1) ];
-#pragma unroll
-    for(int wk=0; wk<windLen; ++wk) {
-        int koff   = wk*se_area;
-        int w_koff = (k+wk-halo)*shrdArea;
-#pragma unroll
-        for(int wj=0; wj<windLen; ++wj) {
-        int joff   = wj*windLen;
-        int w_joff = (j+wj-halo)*shrdLen;
-#pragma unroll
-            for(int wi=0; wi<windLen; ++wi) {
-                T cur  = shrdMem[w_koff+w_joff + i+wi-halo];
-                if (d_filt[koff+joff+wi]) {
-                    if (isDilation)
-                        acc = max(acc, cur);
-                    else
-                        acc = min(acc, cur);
-                }
-            }
-        }
-    }
-
-    if (gx<in.dims[0] && gy<in.dims[1] && gz<in.dims[2]) {
-        int outIdx = gz * out.strides[2] +
-                          gy * out.strides[1] +
-                          gx * out.strides[0];
-        optr[outIdx] = acc;
-    }
-}
-
-template<typename T, bool isDilation>
-void morph(Param<T> out, CParam<T> in, int windLen)
-{
     dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
 
     int blk_x = divup(in.dims[0], THREADS_X);
@@ -305,30 +49,34 @@ void morph(Param<T> out, CParam<T> in, int windLen)
     dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
 
     // calculate shared memory size
-    int halo      = windLen/2;
-    int padding   = 2*halo;
-    int shrdLen   = kernel::THREADS_X + padding + 1; // +1 for to avoid bank conflicts
-    int shrdSize  = shrdLen * (kernel::THREADS_Y + padding) * sizeof(T);
-
-    switch(windLen) {
-        case  3: morphKernel<T, isDilation, 3> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case  5: morphKernel<T, isDilation, 5> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case  7: morphKernel<T, isDilation, 7> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case  9: morphKernel<T, isDilation, 9> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case 11: morphKernel<T, isDilation,11> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case 13: morphKernel<T, isDilation,13> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case 15: morphKernel<T, isDilation,15> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case 17: morphKernel<T, isDilation,17> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        case 19: morphKernel<T, isDilation,19> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-        default: morphKernel<T, isDilation, 3> <<< blocks, threads, shrdSize>>>(out, in, blk_x, blk_y); break;
-    }
+    int padding = (windLen % 2 == 0 ? (windLen - 1) : (2 * (windLen / 2)));
+    int shrdLen =
+        kernel::THREADS_X + padding + 1;  // +1 for to avoid bank conflicts
+    int shrdSize = shrdLen * (kernel::THREADS_Y + padding) * sizeof(T);
 
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(), shrdSize);
+    morph(qArgs, out, in, blk_x, blk_y, windLen);
     POST_LAUNCH_CHECK();
 }
 
-template<typename T, bool isDilation>
-void morph3d(Param<T> out, CParam<T> in, int windLen)
-{
+template<typename T>
+void morph3d(Param<T> out, CParam<T> in, CParam<T> mask, bool isDilation) {
+    const int windLen = mask.dims[0];
+
+    if (windLen > 7) {
+        CUDA_NOT_SUPPORTED("Morph 3D does not support kernels larger than 7.");
+    }
+
+    auto morph3D = common::getKernel(
+        "arrayfire::cuda::morph3D", {{morph_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(isDilation),
+                     TemplateArg(windLen)),
+        {{DefineValue(MAX_MORPH_FILTER_LEN)}});
+
+    morph3D.copyToReadOnly(
+        morph3D.getDevPtr("cFilter"), reinterpret_cast<CUdeviceptr>(mask.ptr),
+        mask.dims[0] * mask.dims[1] * mask.dims[2] * sizeof(T));
+
     dim3 threads(kernel::CUBE_X, kernel::CUBE_Y, kernel::CUBE_Z);
 
     int blk_x = divup(in.dims[0], CUBE_X);
@@ -337,20 +85,17 @@ void morph3d(Param<T> out, CParam<T> in, int windLen)
     dim3 blocks(blk_x * in.dims[3], blk_y, blk_z);
 
     // calculate shared memory size
-    int halo      = windLen/2;
-    int padding   = 2*halo;
-    int shrdLen   = kernel::CUBE_X + padding + 1; // +1 for to avoid bank conflicts
-    int shrdSize  = shrdLen * (kernel::CUBE_Y + padding) * (kernel::CUBE_Z + padding) * sizeof(T);
-
-    switch(windLen) {
-        case  3: morph3DKernel<T, isDilation, 3> <<< blocks, threads, shrdSize>>>(out, in, blk_x); break;
-        case  5: morph3DKernel<T, isDilation, 5> <<< blocks, threads, shrdSize>>>(out, in, blk_x); break;
-        case  7: morph3DKernel<T, isDilation, 7> <<< blocks, threads, shrdSize>>>(out, in, blk_x); break;
-        default: morph3DKernel<T, isDilation, 3> <<< blocks, threads, shrdSize>>>(out, in, blk_x); break;
-    }
-
+    int padding = (windLen % 2 == 0 ? (windLen - 1) : (2 * (windLen / 2)));
+    int shrdLen =
+        kernel::CUBE_X + padding + 1;  // +1 for to avoid bank conflicts
+    int shrdSize = shrdLen * (kernel::CUBE_Y + padding) *
+                   (kernel::CUBE_Z + padding) * sizeof(T);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(), shrdSize);
+    morph3D(qArgs, out, in, blk_x);
     POST_LAUNCH_CHECK();
 }
 
-}
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/nearest_neighbour.hpp b/src/backend/cuda/kernel/nearest_neighbour.hpp
new file mode 100644
index 0000000000..a628c18a48
--- /dev/null
+++ b/src/backend/cuda/kernel/nearest_neighbour.hpp
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+namespace kernel {
+
+static const unsigned THREADS = 256;
+
+template<typename T, typename To, af_match_type dist_type>
+struct dist_op {
+    __DH__ To operator()(T v1, T v2) { return v1 - v2; }
+};
+
+template<typename T, typename To>
+struct dist_op<T, To, AF_SAD> {
+    __device__ To operator()(T v1, T v2) {
+        return fabsf((float)v1 - (float)v2);
+    }
+};
+
+template<typename To>
+struct dist_op<double, To, AF_SAD> {
+    __device__ To operator()(double v1, double v2) {
+        return fabs((double)v1 - (double)v2);
+    }
+};
+
+template<typename T, typename To>
+struct dist_op<T, To, AF_SSD> {
+    __device__ To operator()(T v1, T v2) { return (v1 - v2) * (v1 - v2); }
+};
+
+template<typename To>
+struct dist_op<uint, To, AF_SHD> {
+    __device__ To operator()(uint v1, uint v2) { return __popc(v1 ^ v2); }
+};
+
+template<typename To>
+struct dist_op<uintl, To, AF_SHD> {
+    __device__ To operator()(uintl v1, uintl v2) { return __popcll(v1 ^ v2); }
+};
+
+template<typename To>
+struct dist_op<ushort, To, AF_SHD> {
+    __device__ To operator()(ushort v1, ushort v2) { return __popc(v1 ^ v2); }
+};
+
+template<typename To>
+struct dist_op<uchar, To, AF_SHD> {
+    __device__ To operator()(uchar v1, uchar v2) { return __popc(v1 ^ v2); }
+};
+
+template<typename T, typename To, af_match_type dist_type, bool use_shmem>
+__global__ void all_distances(To* out_dist, CParam<T> query, CParam<T> train,
+                              const To max_dist, const unsigned feat_len,
+                              const unsigned max_feat_len,
+                              const unsigned feat_offset) {
+    unsigned nquery = query.dims[0];
+    unsigned ntrain = train.dims[0];
+
+    unsigned f   = blockDim.x * blockIdx.x + threadIdx.x;
+    unsigned tid = threadIdx.x;
+
+    __shared__ To s_dist[THREADS];
+
+    extern __shared__ char smem[];
+    T* s_query = (T*)smem;
+    T* s_train = (T*)smem + max_feat_len;
+
+    s_dist[tid] = max_dist;
+
+    bool valid_feat = (f < ntrain);
+
+    if (valid_feat) {
+        // Copy blockDim.x training features to shared memory
+        if (use_shmem) {
+            unsigned end_feat = min(feat_offset + max_feat_len, feat_len);
+            for (unsigned i = feat_offset; i < end_feat; i++) {
+                s_train[(i - feat_offset) * blockDim.x + tid] =
+                    train.ptr[i * ntrain + f];
+            }
+        }
+    }
+    __syncthreads();
+
+    dist_op<T, To, dist_type> op;
+
+    for (unsigned j = 0; j < nquery; j++) {
+        s_dist[tid] = max_dist;
+
+        // Load one query feature that will be tested against all training
+        // features in current block
+        if (tid < max_feat_len) {
+            s_query[tid] = query.ptr[(tid + feat_offset) * nquery + j];
+        }
+        __syncthreads();
+
+        To dist = 0;
+        if (valid_feat) {
+            unsigned feat_end = min(feat_offset + max_feat_len, feat_len);
+            for (unsigned k = feat_offset; k < feat_end; k++) {
+                // Calculate Hamming distance for 32-bits of descriptor and
+                // accumulates to dist
+                if (use_shmem) {
+                    dist += op(s_train[(k - feat_offset) * blockDim.x + tid],
+                               s_query[k - feat_offset]);
+                } else {
+                    dist +=
+                        op(train.ptr[k * ntrain + f], s_query[k - feat_offset]);
+                }
+            }
+
+            // Only stores the feature index and distance if it's smaller
+            // than the best match found so far
+            s_dist[tid] = dist;
+        }
+
+        __syncthreads();
+        // Store best match in training features from block to the current
+        // query feature
+        if (valid_feat) {
+            if (feat_offset == 0)
+                out_dist[j * ntrain + f] = s_dist[tid];
+            else
+                out_dist[j * ntrain + f] += s_dist[tid];
+        }
+        __syncthreads();
+    }
+}
+
+template<typename T, typename To, af_match_type dist_type>
+void all_distances(Param<To> dist, CParam<T> query, CParam<T> train,
+                   const dim_t dist_dim) {
+    const dim_t feat_len = query.dims[dist_dim];
+    const unsigned max_kern_feat_len =
+        std::min(THREADS, static_cast<unsigned>(feat_len));
+    const To max_dist = maxval<To>();
+
+    const dim_t sample_dim = (dist_dim == 0) ? 1 : 0;
+
+    const unsigned ntrain = train.dims[sample_dim];
+
+    dim3 threads(THREADS, 1);
+    dim3 blocks(divup(ntrain, threads.x), 1);
+
+    // Determine maximum feat_len capable of using shared memory (faster)
+    int device          = getActiveDeviceId();
+    cudaDeviceProp prop = getDeviceProp(device);
+    size_t avail_smem   = prop.sharedMemPerBlock;
+    size_t smem_predef =
+        2 * THREADS * sizeof(unsigned) + max_kern_feat_len * sizeof(T);
+    size_t strain_sz = threads.x * max_kern_feat_len * sizeof(T);
+    bool use_shmem   = (avail_smem >= (smem_predef + strain_sz)) ? true : false;
+    unsigned smem_sz = (use_shmem) ? smem_predef + strain_sz : smem_predef;
+
+    // For each query vector, find training vector with smallest Hamming
+    // distance per CUDA block
+    for (dim_t feat_offset = 0; feat_offset < feat_len;
+         feat_offset += THREADS) {
+        if (use_shmem) {
+            CUDA_LAUNCH_SMEM((all_distances<T, To, dist_type, true>), blocks,
+                             threads, smem_sz, dist.ptr, query, train, max_dist,
+                             feat_len, max_kern_feat_len, feat_offset);
+        } else {
+            CUDA_LAUNCH_SMEM((all_distances<T, To, dist_type, false>), blocks,
+                             threads, smem_sz, dist.ptr, query, train, max_dist,
+                             feat_len, max_kern_feat_len, feat_offset);
+        }
+    }
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/orb.hpp b/src/backend/cuda/kernel/orb.hpp
index e0ce695edf..c1df7620f5 100644
--- a/src/backend/cuda/kernel/orb.hpp
+++ b/src/backend/cuda/kernel/orb.hpp
@@ -7,67 +7,56 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
+#pragma once
+
+#include <LookupTable1D.hpp>
+#include <common/dispatch.hpp>
 #include <debug_cuda.hpp>
+#include <kernel/convolve.hpp>
+#include <kernel/orb_patch.hpp>
+#include <kernel/range.hpp>
+#include <kernel/sort_by_key.hpp>
 #include <memory.hpp>
 
-#include <convolve_common.hpp>
-#include "convolve.hpp"
-#include "orb_patch.hpp"
-#include "sort_index.hpp"
-
-#include <boost/scoped_ptr.hpp>
-
+using std::unique_ptr;
 using std::vector;
-using boost::scoped_ptr;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-namespace kernel
-{
+constexpr int THREADS   = 256;
+constexpr int THREADS_X = 16;
+constexpr int THREADS_Y = 16;
 
-static const int THREADS   = 256;
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-static const float PI_VAL = 3.14159265358979323846f;
+constexpr float PI_VAL = 3.14159265358979323846f;
 
 template<typename T>
-void gaussian1D(T* out, const int dim, double sigma=0.0)
-{
-    if(!(sigma>0)) sigma = 0.25*dim;
+void gaussian1D(T* out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
 
     T sum = (T)0;
-    for(int i=0;i<dim;i++)
-    {
-        int x = i-(dim-1)/2;
-        T el = 1. / sqrt(2 * PI_VAL * sigma*sigma) * exp(-((x*x)/(2*(sigma*sigma))));
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * PI_VAL * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
         out[i] = el;
-        sum   += el;
+        sum += el;
     }
 
-    for(int k=0;k<dim;k++)
-        out[k] /= sum;
+    for (int k = 0; k < dim; k++) out[k] /= sum;
 }
 
-inline __device__ float block_reduce_sum(float val)
-{
-    __shared__ float data[THREADS_X*THREADS_Y];
+inline __device__ float block_reduce_sum(float val) {
+    __shared__ float data[THREADS_X * THREADS_Y];
 
     unsigned idx = threadIdx.x * blockDim.x + threadIdx.y;
 
     data[idx] = val;
     __syncthreads();
 
-    for (unsigned i = blockDim.y / 2; i > 0; i >>= 1)
-    {
-        if (threadIdx.y < i)
-        {
-            data[idx] += data[idx + i];
-        }
+    for (unsigned i = blockDim.y / 2; i > 0; i >>= 1) {
+        if (threadIdx.y < i) { data[idx] += data[idx + i]; }
 
         __syncthreads();
     }
@@ -76,24 +65,17 @@ inline __device__ float block_reduce_sum(float val)
 }
 
 template<typename T>
-__global__ void keep_features(
-    float* x_out,
-    float* y_out,
-    float* score_out,
-    float* size_out,
-    const float* x_in,
-    const float* y_in,
-    const float* score_in,
-    const unsigned* score_idx,
-    const float* size_in,
-    const unsigned n_feat)
-{
+__global__ void keep_features(float* x_out, float* y_out, float* score_out,
+                              float* size_out, const float* x_in,
+                              const float* y_in, const float* score_in,
+                              const unsigned* score_idx, const float* size_in,
+                              const unsigned n_feat) {
     unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
 
     // Keep only the first n_feat features
     if (f < n_feat) {
-        x_out[f] = x_in[score_idx[f]];
-        y_out[f] = y_in[score_idx[f]];
+        x_out[f]     = x_in[score_idx[f]];
+        y_out[f]     = y_in[score_idx[f]];
         score_out[f] = score_in[f];
         if (size_in != NULL && size_out != NULL)
             size_out[f] = size_in[score_idx[f]];
@@ -101,95 +83,87 @@ __global__ void keep_features(
 }
 
 template<typename T, bool use_scl>
-__global__ void harris_response(
-        float* score_out,
-        float* size_out,
-        const float* x_in,
-        const float* y_in,
-        const float* scl_in,
-        const unsigned total_feat,
-        CParam<T> image,
-        const unsigned block_size,
-        const float k_thr,
-        const unsigned patch_size)
-{
+__global__ void harris_response(float* score_out, float* size_out,
+                                const float* x_in, const float* y_in,
+                                const float* scl_in, const unsigned total_feat,
+                                CParam<T> image, const unsigned block_size,
+                                const float k_thr, const unsigned patch_size) {
     unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
 
+    float ixx = 0.f, iyy = 0.f, ixy = 0.f;
+    float size = 0.f;
+
     if (f < total_feat) {
         unsigned x, y;
         float scl = 1.f;
         if (use_scl) {
             // Update x and y coordinates according to scale
             scl = scl_in[f];
-            x = (unsigned)round(x_in[f] * scl);
-            y = (unsigned)round(y_in[f] * scl);
-        }
-        else {
+            x   = (unsigned)round(x_in[f] * scl);
+            y   = (unsigned)round(y_in[f] * scl);
+        } else {
             x = (unsigned)round(x_in[f]);
             y = (unsigned)round(y_in[f]);
         }
 
         // Round feature size to nearest odd integer
-        float size = 2.f * floor((patch_size * scl) / 2.f) + 1.f;
+        size = 2.f * floor((patch_size * scl) / 2.f) + 1.f;
 
         // Avoid keeping features that might be too wide and might not fit on
         // the image, sqrt(2.f) is the radius when angle is 45 degrees and
         // represents widest case possible
         unsigned patch_r = ceil(size * sqrt(2.f) / 2.f);
-        if (x < patch_r || y < patch_r || x >= image.dims[1] - patch_r || y >= image.dims[0] - patch_r)
+        if (x < patch_r || y < patch_r || x >= image.dims[1] - patch_r ||
+            y >= image.dims[0] - patch_r)
             return;
 
         unsigned r = block_size / 2;
 
-        float ixx = 0.f, iyy = 0.f, ixy = 0.f;
         unsigned block_size_sq = block_size * block_size;
         for (unsigned k = threadIdx.y; k < block_size_sq; k += blockDim.y) {
             int i = k / block_size - r;
             int j = k % block_size - r;
 
             // Calculate local x and y derivatives
-            float ix = image.ptr[(x+i+1) * image.dims[0] + y+j] - image.ptr[(x+i-1) * image.dims[0] + y+j];
-            float iy = image.ptr[(x+i) * image.dims[0] + y+j+1] - image.ptr[(x+i) * image.dims[0] + y+j-1];
+            float ix = image.ptr[(x + i + 1) * image.dims[0] + y + j] -
+                       image.ptr[(x + i - 1) * image.dims[0] + y + j];
+            float iy = image.ptr[(x + i) * image.dims[0] + y + j + 1] -
+                       image.ptr[(x + i) * image.dims[0] + y + j - 1];
 
             // Accumulate second order derivatives
-            ixx += ix*ix;
-            iyy += iy*iy;
-            ixy += ix*iy;
+            ixx += ix * ix;
+            iyy += iy * iy;
+            ixy += ix * iy;
         }
-        __syncthreads();
+    }
+    __syncthreads();
 
-        ixx = block_reduce_sum(ixx);
-        iyy = block_reduce_sum(iyy);
-        ixy = block_reduce_sum(ixy);
+    ixx = block_reduce_sum(ixx);
+    iyy = block_reduce_sum(iyy);
+    ixy = block_reduce_sum(ixy);
 
-        if (threadIdx.y == 0) {
-            float tr = ixx + iyy;
-            float det = ixx*iyy - ixy*ixy;
+    if (f < total_feat && threadIdx.y == 0) {
+        float tr  = ixx + iyy;
+        float det = ixx * iyy - ixy * ixy;
 
-            // Calculate Harris responses
-            float resp = det - k_thr * (tr*tr);
+        // Calculate Harris responses
+        float resp = det - k_thr * (tr * tr);
 
-            // Scale factor
-            // TODO: improve response scaling
-            float rscale = 0.001f;
-            rscale = rscale * rscale * rscale * rscale;
+        // Scale factor
+        // TODO: improve response scaling
+        float rscale = 0.001f;
+        rscale       = rscale * rscale * rscale * rscale;
 
-            score_out[f] = resp * rscale;
-            if (use_scl)
-                size_out[f] = size;
-        }
+        score_out[f] = resp * rscale;
+        if (use_scl) size_out[f] = size;
     }
 }
 
 template<typename T>
-__global__ void centroid_angle(
-    const float* x_in,
-    const float* y_in,
-    float* orientation_out,
-    const unsigned total_feat,
-    CParam<T> image,
-    const unsigned patch_size)
-{
+__global__ void centroid_angle(const float* x_in, const float* y_in,
+                               float* orientation_out,
+                               const unsigned total_feat, CParam<T> image,
+                               const unsigned patch_size) {
     unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
 
     if (f < total_feat) {
@@ -207,7 +181,7 @@ __global__ void centroid_angle(
             int j = k % patch_size - r;
 
             // Calculate first order moments
-            T p = image.ptr[(x+i) * image.dims[0] + y+j];
+            T p = image.ptr[(x + i) * image.dims[0] + y + j];
             m01 += j * p;
             m10 += i * p;
         }
@@ -216,25 +190,19 @@ __global__ void centroid_angle(
         m10 = block_reduce_sum(m10);
 
         if (threadIdx.y == 0) {
-            float angle = atan2((float)m01, (float)m10);
+            float angle        = atan2((float)m01, (float)m10);
             orientation_out[f] = angle;
         }
     }
 }
 
 template<typename T>
-inline __device__ T get_pixel(
-    unsigned x,
-    unsigned y,
-    const float ori,
-    const unsigned size,
-    const int dist_x,
-    const int dist_y,
-    CParam<T> image,
-    const unsigned patch_size)
-{
-    float ori_sin = sin(ori);
-    float ori_cos = cos(ori);
+inline __device__ T get_pixel(unsigned x, unsigned y, const float ori,
+                              const unsigned size, const int dist_x,
+                              const int dist_y, CParam<T> image,
+                              const unsigned patch_size) {
+    float ori_sin   = sin(ori);
+    float ori_cos   = cos(ori);
     float patch_scl = (float)size / (float)patch_size;
 
     // Calculate point coordinates based on orientation and size
@@ -244,24 +212,23 @@ inline __device__ T get_pixel(
     return image.ptr[x * image.dims[0] + y];
 }
 
+inline __device__ int lookup(const int n, cudaTextureObject_t tex) {
+    return tex1Dfetch<int>(tex, n);
+}
+
 template<typename T>
-__global__ void extract_orb(
-    unsigned* desc_out,
-    const unsigned n_feat,
-    float* x_in_out,
-    float* y_in_out,
-    const float* ori_in,
-    float* size_out,
-    CParam<T> image,
-    const float scl,
-    const unsigned patch_size)
-{
+__global__ void extract_orb(unsigned* desc_out, const unsigned n_feat,
+                            float* x_in_out, float* y_in_out,
+                            const float* ori_in, float* size_out,
+                            CParam<T> image, const float scl,
+                            const unsigned patch_size,
+                            cudaTextureObject_t luTable) {
     unsigned f = blockDim.x * blockIdx.x + threadIdx.x;
 
     if (f < n_feat) {
-        unsigned x = (unsigned)round(x_in_out[f]);
-        unsigned y = (unsigned)round(y_in_out[f]);
-        float ori = ori_in[f];
+        unsigned x    = (unsigned)round(x_in_out[f]);
+        unsigned y    = (unsigned)round(y_in_out[f]);
+        float ori     = ori_in[f];
         unsigned size = patch_size;
 
         unsigned r = ceil(patch_size * sqrt(2.f) / 2.f);
@@ -275,21 +242,25 @@ __global__ void extract_orb(
 
             // j < 16 for 256 bits descriptor
             for (unsigned j = 0; j < 16; j++) {
-                // Get position from distribution pattern and values of points p1 and p2
-                int dist_x = d_ref_pat[i*16*4 + j*4];
-                int dist_y = d_ref_pat[i*16*4 + j*4+1];
-                T p1 = get_pixel(x, y, ori, size, dist_x, dist_y, image, patch_size);
-
-                dist_x = d_ref_pat[i*16*4 + j*4+2];
-                dist_y = d_ref_pat[i*16*4 + j*4+3];
-                T p2 = get_pixel(x, y, ori, size, dist_x, dist_y, image, patch_size);
-
-                // Calculate bit based on p1 and p2 and shifts it to correct position
-                v |= (p1 < p2) << (j + 16*(i % 2));
+                // Get position from distribution pattern and values of points
+                // p1 and p2
+                int dist_x = lookup(i * 16 * 4 + j * 4, luTable);
+                int dist_y = lookup(i * 16 * 4 + j * 4 + 1, luTable);
+                T p1       = get_pixel(x, y, ori, size, dist_x, dist_y, image,
+                                       patch_size);
+
+                dist_x = lookup(i * 16 * 4 + j * 4 + 2, luTable);
+                dist_y = lookup(i * 16 * 4 + j * 4 + 3, luTable);
+                T p2   = get_pixel(x, y, ori, size, dist_x, dist_y, image,
+                                   patch_size);
+
+                // Calculate bit based on p1 and p2 and shifts it to correct
+                // position
+                v |= (p1 < p2) << (j + 16 * (i % 2));
             }
 
             // Store 16 bits of descriptor
-            atomicAdd(&desc_out[f * 8 + i/2], v);
+            atomicAdd(&desc_out[f * 8 + i / 2], v);
         }
 
         if (threadIdx.y == 0) {
@@ -300,35 +271,28 @@ __global__ void extract_orb(
     }
 }
 
-
-
 template<typename T, typename convAccT>
-void orb(unsigned* out_feat,
-         float** d_x,
-         float** d_y,
-         float** d_score,
-         float** d_ori,
-         float** d_size,
-         unsigned** d_desc,
-         vector<unsigned>& feat_pyr,
-         vector<float*>& d_x_pyr,
-         vector<float*>& d_y_pyr,
-         vector<unsigned>& lvl_best,
-         vector<float>& lvl_scl,
-         vector<CParam<T> >& img_pyr,
-         const float fast_thr,
-         const unsigned max_feat,
-         const float scl_fctr,
-         const unsigned levels,
-         const bool blur_img)
-{
+void orb(unsigned* out_feat, float** d_x, float** d_y, float** d_score,
+         float** d_ori, float** d_size, unsigned** d_desc,
+         vector<unsigned>& feat_pyr, vector<float*>& d_x_pyr,
+         vector<float*>& d_y_pyr, vector<unsigned>& lvl_best,
+         vector<float>& lvl_scl, vector<Array<T>>& img_pyr,
+         const float fast_thr, const unsigned max_feat, const float scl_fctr,
+         const unsigned levels, const bool blur_img,
+         const LookupTable1D<int>& luTable) {
+    UNUSED(fast_thr);
+    UNUSED(max_feat);
+    UNUSED(scl_fctr);
+    UNUSED(levels);
     unsigned patch_size = REF_PAT_SIZE;
 
     unsigned max_levels = feat_pyr.size();
 
     // In future implementations, the user will be capable of passing his
     // distribution instead of using the reference one
-    //CUDA_CHECK(cudaMemcpyToSymbol(d_ref_pat, h_ref_pat, 256 * 4 * sizeof(int), 0, cudaMemcpyHostToDevice));
+    // CUDA_CHECK(cudaMemcpyToSymbolAsync(d_ref_pat, h_ref_pat, 256 * 4 *
+    // sizeof(int), 0,
+    // cudaMemcpyHostToDevice, getActiveStream()));
 
     vector<float*> d_score_pyr(max_levels);
     vector<float*> d_ori_pyr(max_levels);
@@ -339,150 +303,101 @@ void orb(unsigned* out_feat,
     unsigned total_feat = 0;
 
     // Calculate a separable Gaussian kernel
-    Param<convAccT> gauss_filter;
+    Array<convAccT> gauss_filter = createEmptyArray<convAccT>(dim4());
     if (blur_img) {
         unsigned gauss_len = 9;
-        scoped_ptr<convAccT> h_gauss(new convAccT[gauss_len]);
-        gaussian1D(h_gauss.get(), gauss_len, 2.f);
-        gauss_filter.dims[0] = gauss_len;
-        gauss_filter.strides[0] = 1;
-
-        for (int k = 1; k < 4; k++) {
-            gauss_filter.dims[k] = 1;
-            gauss_filter.strides[k] = gauss_filter.dims[k - 1] * gauss_filter.strides[k - 1];
-        }
-
-        int gauss_elem = gauss_filter.strides[3] * gauss_filter.dims[3];
-        gauss_filter.ptr = memAlloc<convAccT>(gauss_elem);
-        CUDA_CHECK(cudaMemcpy(gauss_filter.ptr, h_gauss.get(), gauss_elem * sizeof(convAccT), cudaMemcpyHostToDevice));
+        vector<convAccT> h_gauss(gauss_len);
+        gaussian1D(h_gauss.data(), gauss_len, 2.f);
+        dim4 gauss_dim(gauss_len);
+        gauss_filter = createHostDataArray<convAccT>(gauss_dim, h_gauss.data());
+        CUDA_CHECK(cudaMemcpyAsync(gauss_filter.get(), h_gauss.data(),
+                                   h_gauss.size() * sizeof(convAccT),
+                                   cudaMemcpyHostToDevice, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
     }
 
     for (int i = 0; i < (int)max_levels; i++) {
-        if (feat_pyr[i] == 0 || lvl_best[i] == 0) {
-            if (i > 0)
-                memFree((T*)img_pyr[i].ptr);
-            continue;
-        }
+        if (feat_pyr[i] == 0 || lvl_best[i] == 0) { continue; }
 
-        float* d_score_harris = memAlloc<float>(feat_pyr[i]);
+        // auto d_score_harris = memAlloc<float>(feat_pyr[i]);
+        dim4 score_dim(feat_pyr[i]);
+        Array<float> d_score_harris =
+            createEmptyArray<float>(score_dim);  // harris_sorted
 
         // Calculate Harris responses
         // Good block_size >= 7 (must be an odd number)
         dim3 threads(THREADS_X, THREADS_Y);
         dim3 blocks(divup(feat_pyr[i], threads.x), 1);
-        harris_response<T,false><<<blocks, threads>>>(d_score_harris, NULL,
-                                                      d_x_pyr[i], d_y_pyr[i], NULL,
-                                                      feat_pyr[i],
-                                                      img_pyr[i], 7, 0.04f, patch_size);
+        CUDA_LAUNCH((harris_response<T, false>), blocks, threads,
+                    d_score_harris.get(), NULL, d_x_pyr[i], d_y_pyr[i], NULL,
+                    feat_pyr[i], img_pyr[i], 7, 0.04f, patch_size);
         POST_LAUNCH_CHECK();
 
-        Param<float> harris_sorted;
-        Param<unsigned> harris_idx;
-
-        harris_sorted.dims[0] = harris_idx.dims[0] = feat_pyr[i];
-        harris_sorted.strides[0] = harris_idx.strides[0] = 1;
+        dim4 feat_dim(feat_pyr[i]);
+        Array<unsigned> harris_idx = createEmptyArray<unsigned>(feat_dim);
 
-        for (int k = 1; k < 4; k++) {
-            harris_sorted.dims[k] = 1;
-            harris_sorted.strides[k] = harris_sorted.dims[k - 1] * harris_sorted.strides[k - 1];
-            harris_idx.dims[k] = 1;
-            harris_idx.strides[k] = harris_idx.dims[k - 1] * harris_idx.strides[k - 1];
-        }
-
-        int sort_elem = harris_sorted.strides[3] * harris_sorted.dims[3];
-        harris_sorted.ptr = d_score_harris;
-        harris_idx.ptr = memAlloc<unsigned>(sort_elem);
+        // Create indices using range
+        kernel::range<uint>(harris_idx, 0);
 
         // Sort features according to Harris responses
-        sort0_index<float, false>(harris_sorted, harris_idx);
+        kernel::sort0ByKey<float, uint>(d_score_harris, harris_idx, false);
 
         feat_pyr[i] = std::min(feat_pyr[i], lvl_best[i]);
 
-        float* d_x_lvl = memAlloc<float>(feat_pyr[i]);
-        float* d_y_lvl = memAlloc<float>(feat_pyr[i]);
-        float* d_score_lvl = memAlloc<float>(feat_pyr[i]);
+        float* d_x_lvl     = memAlloc<float>(feat_pyr[i]).release();
+        float* d_y_lvl     = memAlloc<float>(feat_pyr[i]).release();
+        float* d_score_lvl = memAlloc<float>(feat_pyr[i]).release();
 
         // Keep only features with higher Harris responses
         threads = dim3(THREADS, 1);
-        blocks = dim3(divup(feat_pyr[i], threads.x), 1);
-        keep_features<T><<<blocks, threads>>>(d_x_lvl, d_y_lvl, d_score_lvl, NULL,
-                                              d_x_pyr[i], d_y_pyr[i], harris_sorted.ptr, harris_idx.ptr,
-                                              NULL, feat_pyr[i]);
+        blocks  = dim3(divup(feat_pyr[i], threads.x), 1);
+        CUDA_LAUNCH((keep_features<T>), blocks, threads, d_x_lvl, d_y_lvl,
+                    d_score_lvl, NULL, d_x_pyr[i], d_y_pyr[i],
+                    d_score_harris.get(), harris_idx.get(), NULL, feat_pyr[i]);
         POST_LAUNCH_CHECK();
 
-        memFree(d_x_pyr[i]);
-        memFree(d_y_pyr[i]);
-        memFree(harris_sorted.ptr);
-        memFree(harris_idx.ptr);
-
-        float* d_ori_lvl = memAlloc<float>(feat_pyr[i]);
+        float* d_ori_lvl = memAlloc<float>(feat_pyr[i]).release();
 
         // Compute orientation of features
         threads = dim3(THREADS_X, THREADS_Y);
         blocks  = dim3(divup(feat_pyr[i], threads.x), 1);
-        centroid_angle<T><<<blocks, threads>>>(d_x_lvl, d_y_lvl, d_ori_lvl, feat_pyr[i],
-                                               img_pyr[i], patch_size);
+        CUDA_LAUNCH((centroid_angle<T>), blocks, threads, d_x_lvl, d_y_lvl,
+                    d_ori_lvl, feat_pyr[i], img_pyr[i], patch_size);
         POST_LAUNCH_CHECK();
 
-        Param<T> lvl_tmp;
-        Param<T> lvl_filt;
-
         if (blur_img) {
-            for (int k = 0; k < 4; k++) {
-                lvl_tmp.dims[k] = img_pyr[i].dims[k];
-                lvl_tmp.strides[k] = img_pyr[i].strides[k];
-                lvl_filt.dims[k] = img_pyr[i].dims[k];
-                lvl_filt.strides[k] = img_pyr[i].strides[k];
-            }
-
-            int lvl_elem = img_pyr[i].strides[3] * img_pyr[i].dims[3];
-            lvl_tmp.ptr = memAlloc<T>(lvl_elem);
-            lvl_filt.ptr = memAlloc<T>(lvl_elem);
+            Array<T> lvl_tmp = createEmptyArray<T>(img_pyr[i].dims());
 
             // Separable Gaussian filtering to reduce noise sensitivity
-            convolve2<T, convAccT, 0, false>(lvl_tmp, img_pyr[i], gauss_filter);
-            convolve2<T, convAccT, 1, false>(lvl_filt, CParam<T>(lvl_tmp), gauss_filter);
-
-            memFree(lvl_tmp.ptr);
-            if (i > 0)
-                memFree((T*)img_pyr[i].ptr);
-
-            img_pyr[i].ptr = lvl_filt.ptr;
-            for (int k = 0; k < 4; k++) {
-                img_pyr[i].dims[k] = lvl_filt.dims[k];
-                img_pyr[i].strides[k] = lvl_filt.strides[k];
-            }
+            convolve2<T, convAccT>(lvl_tmp, img_pyr[i], gauss_filter, 0, false);
+            convolve2<T, convAccT>(img_pyr[i], lvl_tmp, gauss_filter, 1, false);
         }
 
-        float* d_size_lvl = memAlloc<float>(feat_pyr[i]);
+        float* d_size_lvl = memAlloc<float>(feat_pyr[i]).release();
 
-        unsigned* d_desc_lvl = memAlloc<unsigned>(feat_pyr[i] * 8);
-        CUDA_CHECK(cudaMemset(d_desc_lvl, 0, feat_pyr[i] * 8 * sizeof(unsigned)));
+        unsigned* d_desc_lvl = memAlloc<unsigned>(feat_pyr[i] * 8).release();
+        CUDA_CHECK(cudaMemsetAsync(d_desc_lvl, 0,
+                                   feat_pyr[i] * 8 * sizeof(unsigned),
+                                   getActiveStream()));
 
         // Compute ORB descriptors
         threads = dim3(THREADS_X, THREADS_Y);
         blocks  = dim3(divup(feat_pyr[i], threads.x), 1);
-        extract_orb<T><<<blocks, threads>>>(d_desc_lvl, feat_pyr[i],
-                                            d_x_lvl, d_y_lvl, d_ori_lvl, d_size_lvl,
-                                            img_pyr[i], lvl_scl[i], patch_size);
+        CUDA_LAUNCH((extract_orb<T>), blocks, threads, d_desc_lvl, feat_pyr[i],
+                    d_x_lvl, d_y_lvl, d_ori_lvl, d_size_lvl, img_pyr[i],
+                    lvl_scl[i], patch_size, luTable.get());
         POST_LAUNCH_CHECK();
 
-        if (i > 0)
-            memFree((T*)img_pyr[i].ptr);
-
         // Store results to pyramids
         total_feat += feat_pyr[i];
-        d_x_pyr[i] = d_x_lvl;
-        d_y_pyr[i] = d_y_lvl;
+        d_x_pyr[i]     = d_x_lvl;
+        d_y_pyr[i]     = d_y_lvl;
         d_score_pyr[i] = d_score_lvl;
-        d_ori_pyr[i] = d_ori_lvl;
-        d_size_pyr[i] = d_size_lvl;
-        d_desc_pyr[i] = d_desc_lvl;
+        d_ori_pyr[i]   = d_ori_lvl;
+        d_size_pyr[i]  = d_size_lvl;
+        d_desc_pyr[i]  = d_desc_lvl;
     }
 
-    if (blur_img)
-        memFree((T*)gauss_filter.ptr);
-
     // If no features are found, set found features to 0 and return
     if (total_feat == 0) {
         *out_feat = 0;
@@ -490,27 +405,37 @@ void orb(unsigned* out_feat,
     }
 
     // Allocate output memory
-    *d_x     = memAlloc<float>(total_feat);
-    *d_y     = memAlloc<float>(total_feat);
-    *d_score = memAlloc<float>(total_feat);
-    *d_ori   = memAlloc<float>(total_feat);
-    *d_size  = memAlloc<float>(total_feat);
-    *d_desc  = memAlloc<unsigned>(total_feat * 8);
+    *d_x            = memAlloc<float>(total_feat).release();
+    *d_y            = memAlloc<float>(total_feat).release();
+    *d_score        = memAlloc<float>(total_feat).release();
+    *d_ori          = memAlloc<float>(total_feat).release();
+    *d_size         = memAlloc<float>(total_feat).release();
+    *d_desc         = memAlloc<unsigned>(total_feat * 8).release();
     unsigned offset = 0;
     for (unsigned i = 0; i < max_levels; i++) {
-        if (feat_pyr[i] == 0)
-            continue;
-
-        if (i > 0)
-            offset += feat_pyr[i-1];
-
-        CUDA_CHECK(cudaMemcpy(*d_x+offset, d_x_pyr[i], feat_pyr[i] * sizeof(float), cudaMemcpyDeviceToDevice));
-        CUDA_CHECK(cudaMemcpy(*d_y+offset, d_y_pyr[i], feat_pyr[i] * sizeof(float), cudaMemcpyDeviceToDevice));
-        CUDA_CHECK(cudaMemcpy(*d_score+offset, d_score_pyr[i], feat_pyr[i] * sizeof(float), cudaMemcpyDeviceToDevice));
-        CUDA_CHECK(cudaMemcpy(*d_ori+offset, d_ori_pyr[i], feat_pyr[i] * sizeof(float), cudaMemcpyDeviceToDevice));
-        CUDA_CHECK(cudaMemcpy(*d_size+offset, d_size_pyr[i], feat_pyr[i] * sizeof(float), cudaMemcpyDeviceToDevice));
-
-        CUDA_CHECK(cudaMemcpy(*d_desc+(offset*8), d_desc_pyr[i], feat_pyr[i] * 8 * sizeof(unsigned), cudaMemcpyDeviceToDevice));
+        if (feat_pyr[i] == 0) continue;
+
+        if (i > 0) offset += feat_pyr[i - 1];
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_x + offset, d_x_pyr[i], feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_y + offset, d_y_pyr[i], feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_score + offset, d_score_pyr[i], feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_ori + offset, d_ori_pyr[i], feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_size + offset, d_size_pyr[i], feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(*d_desc + (offset * 8), d_desc_pyr[i],
+                                   feat_pyr[i] * 8 * sizeof(unsigned),
+                                   cudaMemcpyDeviceToDevice,
+                                   getActiveStream()));
 
         memFree(d_x_pyr[i]);
         memFree(d_y_pyr[i]);
@@ -524,6 +449,6 @@ void orb(unsigned* out_feat,
     *out_feat = total_feat;
 }
 
-} // namespace kernel
-
-} // namespace cuda
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/orb_patch.hpp b/src/backend/cuda/kernel/orb_patch.hpp
index 7330feed6e..8a384c24ad 100644
--- a/src/backend/cuda/kernel/orb_patch.hpp
+++ b/src/backend/cuda/kernel/orb_patch.hpp
@@ -9,281 +9,90 @@
 
 #pragma once
 
-namespace cuda
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
 
 // Reference pattern, generated for a patch size of 31x31, as suggested by
 // original ORB paper
-#define REF_PAT_SIZE 31
-#define REF_PAT_SAMPLES 256
-#define REF_PAT_COORDS 4
-#define REF_PAT_LENGTH (REF_PAT_SAMPLES*REF_PAT_COORDS)
+constexpr unsigned REF_PAT_SIZE    = 31;
+constexpr unsigned REF_PAT_SAMPLES = 256;
+constexpr unsigned REF_PAT_COORDS  = 4;
+constexpr unsigned REF_PAT_LENGTH  = (REF_PAT_SAMPLES * REF_PAT_COORDS);
 
 // Current reference pattern was borrowed from OpenCV, a randomly generated
 // pattern will not achieve same quality as it must be trained like described
 // in sections 4.2 and 4.3 of the original ORB paper.
-__constant__ int d_ref_pat[REF_PAT_LENGTH] = {
-    8,-3, 9,5,
-    4,2, 7,-12,
-    -11,9, -8,2,
-    7,-12, 12,-13,
-    2,-13, 2,12,
-    1,-7, 1,6,
-    -2,-10, -2,-4,
-    -13,-13, -11,-8,
-    -13,-3, -12,-9,
-    10,4, 11,9,
-    -13,-8, -8,-9,
-    -11,7, -9,12,
-    7,7, 12,6,
-    -4,-5, -3,0,
-    -13,2, -12,-3,
-    -9,0, -7,5,
-    12,-6, 12,-1,
-    -3,6, -2,12,
-    -6,-13, -4,-8,
-    11,-13, 12,-8,
-    4,7, 5,1,
-    5,-3, 10,-3,
-    3,-7, 6,12,
-    -8,-7, -6,-2,
-    -2,11, -1,-10,
-    -13,12, -8,10,
-    -7,3, -5,-3,
-    -4,2, -3,7,
-    -10,-12, -6,11,
-    5,-12, 6,-7,
-    5,-6, 7,-1,
-    1,0, 4,-5,
-    9,11, 11,-13,
-    4,7, 4,12,
-    2,-1, 4,4,
-    -4,-12, -2,7,
-    -8,-5, -7,-10,
-    4,11, 9,12,
-    0,-8, 1,-13,
-    -13,-2, -8,2,
-    -3,-2, -2,3,
-    -6,9, -4,-9,
-    8,12, 10,7,
-    0,9, 1,3,
-    7,-5, 11,-10,
-    -13,-6, -11,0,
-    10,7, 12,1,
-    -6,-3, -6,12,
-    10,-9, 12,-4,
-    -13,8, -8,-12,
-    -13,0, -8,-4,
-    3,3, 7,8,
-    5,7, 10,-7,
-    -1,7, 1,-12,
-    3,-10, 5,6,
-    2,-4, 3,-10,
-    -13,0, -13,5,
-    -13,-7, -12,12,
-    -13,3, -11,8,
-    -7,12, -4,7,
-    6,-10, 12,8,
-    -9,-1, -7,-6,
-    -2,-5, 0,12,
-    -12,5, -7,5,
-    3,-10, 8,-13,
-    -7,-7, -4,5,
-    -3,-2, -1,-7,
-    2,9, 5,-11,
-    -11,-13, -5,-13,
-    -1,6, 0,-1,
-    5,-3, 5,2,
-    -4,-13, -4,12,
-    -9,-6, -9,6,
-    -12,-10, -8,-4,
-    10,2, 12,-3,
-    7,12, 12,12,
-    -7,-13, -6,5,
-    -4,9, -3,4,
-    7,-1, 12,2,
-    -7,6, -5,1,
-    -13,11, -12,5,
-    -3,7, -2,-6,
-    7,-8, 12,-7,
-    -13,-7, -11,-12,
-    1,-3, 12,12,
-    2,-6, 3,0,
-    -4,3, -2,-13,
-    -1,-13, 1,9,
-    7,1, 8,-6,
-    1,-1, 3,12,
-    9,1, 12,6,
-    -1,-9, -1,3,
-    -13,-13, -10,5,
-    7,7, 10,12,
-    12,-5, 12,9,
-    6,3, 7,11,
-    5,-13, 6,10,
-    2,-12, 2,3,
-    3,8, 4,-6,
-    2,6, 12,-13,
-    9,-12, 10,3,
-    -8,4, -7,9,
-    -11,12, -4,-6,
-    1,12, 2,-8,
-    6,-9, 7,-4,
-    2,3, 3,-2,
-    6,3, 11,0,
-    3,-3, 8,-8,
-    7,8, 9,3,
-    -11,-5, -6,-4,
-    -10,11, -5,10,
-    -5,-8, -3,12,
-    -10,5, -9,0,
-    8,-1, 12,-6,
-    4,-6, 6,-11,
-    -10,12, -8,7,
-    4,-2, 6,7,
-    -2,0, -2,12,
-    -5,-8, -5,2,
-    7,-6, 10,12,
-    -9,-13, -8,-8,
-    -5,-13, -5,-2,
-    8,-8, 9,-13,
-    -9,-11, -9,0,
-    1,-8, 1,-2,
-    7,-4, 9,1,
-    -2,1, -1,-4,
-    11,-6, 12,-11,
-    -12,-9, -6,4,
-    3,7, 7,12,
-    5,5, 10,8,
-    0,-4, 2,8,
-    -9,12, -5,-13,
-    0,7, 2,12,
-    -1,2, 1,7,
-    5,11, 7,-9,
-    3,5, 6,-8,
-    -13,-4, -8,9,
-    -5,9, -3,-3,
-    -4,-7, -3,-12,
-    6,5, 8,0,
-    -7,6, -6,12,
-    -13,6, -5,-2,
-    1,-10, 3,10,
-    4,1, 8,-4,
-    -2,-2, 2,-13,
-    2,-12, 12,12,
-    -2,-13, 0,-6,
-    4,1, 9,3,
-    -6,-10, -3,-5,
-    -3,-13, -1,1,
-    7,5, 12,-11,
-    4,-2, 5,-7,
-    -13,9, -9,-5,
-    7,1, 8,6,
-    7,-8, 7,6,
-    -7,-4, -7,1,
-    -8,11, -7,-8,
-    -13,6, -12,-8,
-    2,4, 3,9,
-    10,-5, 12,3,
-    -6,-5, -6,7,
-    8,-3, 9,-8,
-    2,-12, 2,8,
-    -11,-2, -10,3,
-    -12,-13, -7,-9,
-    -11,0, -10,-5,
-    5,-3, 11,8,
-    -2,-13, -1,12,
-    -1,-8, 0,9,
-    -13,-11, -12,-5,
-    -10,-2, -10,11,
-    -3,9, -2,-13,
-    2,-3, 3,2,
-    -9,-13, -4,0,
-    -4,6, -3,-10,
-    -4,12, -2,-7,
-    -6,-11, -4,9,
-    6,-3, 6,11,
-    -13,11, -5,5,
-    11,11, 12,6,
-    7,-5, 12,-2,
-    -1,12, 0,7,
-    -4,-8, -3,-2,
-    -7,1, -6,7,
-    -13,-12, -8,-13,
-    -7,-2, -6,-8,
-    -8,5, -6,-9,
-    -5,-1, -4,5,
-    -13,7, -8,10,
-    1,5, 5,-13,
-    1,0, 10,-13,
-    9,12, 10,-1,
-    5,-8, 10,-9,
-    -1,11, 1,-13,
-    -9,-3, -6,2,
-    -1,-10, 1,12,
-    -13,1, -8,-10,
-    8,-11, 10,-6,
-    2,-13, 3,-6,
-    7,-13, 12,-9,
-    -10,-10, -5,-7,
-    -10,-8, -8,-13,
-    4,-6, 8,5,
-    3,12, 8,-13,
-    -4,2, -3,-3,
-    5,-13, 10,-12,
-    4,-13, 5,-1,
-    -9,9, -4,3,
-    0,3, 3,-9,
-    -12,1, -6,1,
-    3,2, 4,-8,
-    -10,-10, -10,9,
-    8,-13, 12,12,
-    -8,-12, -6,-5,
-    2,2, 3,7,
-    10,6, 11,-8,
-    6,8, 8,-12,
-    -7,10, -6,5,
-    -3,-9, -3,9,
-    -1,-13, -1,5,
-    -3,-7, -3,4,
-    -8,-2, -8,3,
-    4,2, 12,12,
-    2,-5, 3,11,
-    6,-9, 11,-13,
-    3,-1, 7,12,
-    11,-1, 12,4,
-    -3,0, -3,6,
-    4,-11, 4,12,
-    2,-4, 2,1,
-    -10,-6, -8,1,
-    -13,7, -11,1,
-    -13,12, -11,-13,
-    6,0, 11,-13,
-    0,-1, 1,4,
-    -13,3, -9,-2,
-    -9,8, -6,-3,
-    -13,-6, -8,-2,
-    5,-9, 8,10,
-    2,7, 3,-9,
-    -1,-6, -1,-1,
-    9,5, 11,-2,
-    11,-3, 12,-8,
-    3,0, 3,5,
-    -1,4, 0,10,
-    3,-6, 4,5,
-    -13,0, -10,5,
-    5,8, 12,11,
-    8,9, 9,-6,
-    7,-4, 8,-12,
-    -10,4, -10,9,
-    7,3, 12,4,
-    9,-7, 10,-2,
-    7,0, 12,-2,
-    -1,-6, 0,-11,
+int d_ref_pat[REF_PAT_LENGTH] = {
+    8,   -3,  9,   5,   4,   2,   7,   -12, -11, 9,   -8,  2,   7,   -12, 12,
+    -13, 2,   -13, 2,   12,  1,   -7,  1,   6,   -2,  -10, -2,  -4,  -13, -13,
+    -11, -8,  -13, -3,  -12, -9,  10,  4,   11,  9,   -13, -8,  -8,  -9,  -11,
+    7,   -9,  12,  7,   7,   12,  6,   -4,  -5,  -3,  0,   -13, 2,   -12, -3,
+    -9,  0,   -7,  5,   12,  -6,  12,  -1,  -3,  6,   -2,  12,  -6,  -13, -4,
+    -8,  11,  -13, 12,  -8,  4,   7,   5,   1,   5,   -3,  10,  -3,  3,   -7,
+    6,   12,  -8,  -7,  -6,  -2,  -2,  11,  -1,  -10, -13, 12,  -8,  10,  -7,
+    3,   -5,  -3,  -4,  2,   -3,  7,   -10, -12, -6,  11,  5,   -12, 6,   -7,
+    5,   -6,  7,   -1,  1,   0,   4,   -5,  9,   11,  11,  -13, 4,   7,   4,
+    12,  2,   -1,  4,   4,   -4,  -12, -2,  7,   -8,  -5,  -7,  -10, 4,   11,
+    9,   12,  0,   -8,  1,   -13, -13, -2,  -8,  2,   -3,  -2,  -2,  3,   -6,
+    9,   -4,  -9,  8,   12,  10,  7,   0,   9,   1,   3,   7,   -5,  11,  -10,
+    -13, -6,  -11, 0,   10,  7,   12,  1,   -6,  -3,  -6,  12,  10,  -9,  12,
+    -4,  -13, 8,   -8,  -12, -13, 0,   -8,  -4,  3,   3,   7,   8,   5,   7,
+    10,  -7,  -1,  7,   1,   -12, 3,   -10, 5,   6,   2,   -4,  3,   -10, -13,
+    0,   -13, 5,   -13, -7,  -12, 12,  -13, 3,   -11, 8,   -7,  12,  -4,  7,
+    6,   -10, 12,  8,   -9,  -1,  -7,  -6,  -2,  -5,  0,   12,  -12, 5,   -7,
+    5,   3,   -10, 8,   -13, -7,  -7,  -4,  5,   -3,  -2,  -1,  -7,  2,   9,
+    5,   -11, -11, -13, -5,  -13, -1,  6,   0,   -1,  5,   -3,  5,   2,   -4,
+    -13, -4,  12,  -9,  -6,  -9,  6,   -12, -10, -8,  -4,  10,  2,   12,  -3,
+    7,   12,  12,  12,  -7,  -13, -6,  5,   -4,  9,   -3,  4,   7,   -1,  12,
+    2,   -7,  6,   -5,  1,   -13, 11,  -12, 5,   -3,  7,   -2,  -6,  7,   -8,
+    12,  -7,  -13, -7,  -11, -12, 1,   -3,  12,  12,  2,   -6,  3,   0,   -4,
+    3,   -2,  -13, -1,  -13, 1,   9,   7,   1,   8,   -6,  1,   -1,  3,   12,
+    9,   1,   12,  6,   -1,  -9,  -1,  3,   -13, -13, -10, 5,   7,   7,   10,
+    12,  12,  -5,  12,  9,   6,   3,   7,   11,  5,   -13, 6,   10,  2,   -12,
+    2,   3,   3,   8,   4,   -6,  2,   6,   12,  -13, 9,   -12, 10,  3,   -8,
+    4,   -7,  9,   -11, 12,  -4,  -6,  1,   12,  2,   -8,  6,   -9,  7,   -4,
+    2,   3,   3,   -2,  6,   3,   11,  0,   3,   -3,  8,   -8,  7,   8,   9,
+    3,   -11, -5,  -6,  -4,  -10, 11,  -5,  10,  -5,  -8,  -3,  12,  -10, 5,
+    -9,  0,   8,   -1,  12,  -6,  4,   -6,  6,   -11, -10, 12,  -8,  7,   4,
+    -2,  6,   7,   -2,  0,   -2,  12,  -5,  -8,  -5,  2,   7,   -6,  10,  12,
+    -9,  -13, -8,  -8,  -5,  -13, -5,  -2,  8,   -8,  9,   -13, -9,  -11, -9,
+    0,   1,   -8,  1,   -2,  7,   -4,  9,   1,   -2,  1,   -1,  -4,  11,  -6,
+    12,  -11, -12, -9,  -6,  4,   3,   7,   7,   12,  5,   5,   10,  8,   0,
+    -4,  2,   8,   -9,  12,  -5,  -13, 0,   7,   2,   12,  -1,  2,   1,   7,
+    5,   11,  7,   -9,  3,   5,   6,   -8,  -13, -4,  -8,  9,   -5,  9,   -3,
+    -3,  -4,  -7,  -3,  -12, 6,   5,   8,   0,   -7,  6,   -6,  12,  -13, 6,
+    -5,  -2,  1,   -10, 3,   10,  4,   1,   8,   -4,  -2,  -2,  2,   -13, 2,
+    -12, 12,  12,  -2,  -13, 0,   -6,  4,   1,   9,   3,   -6,  -10, -3,  -5,
+    -3,  -13, -1,  1,   7,   5,   12,  -11, 4,   -2,  5,   -7,  -13, 9,   -9,
+    -5,  7,   1,   8,   6,   7,   -8,  7,   6,   -7,  -4,  -7,  1,   -8,  11,
+    -7,  -8,  -13, 6,   -12, -8,  2,   4,   3,   9,   10,  -5,  12,  3,   -6,
+    -5,  -6,  7,   8,   -3,  9,   -8,  2,   -12, 2,   8,   -11, -2,  -10, 3,
+    -12, -13, -7,  -9,  -11, 0,   -10, -5,  5,   -3,  11,  8,   -2,  -13, -1,
+    12,  -1,  -8,  0,   9,   -13, -11, -12, -5,  -10, -2,  -10, 11,  -3,  9,
+    -2,  -13, 2,   -3,  3,   2,   -9,  -13, -4,  0,   -4,  6,   -3,  -10, -4,
+    12,  -2,  -7,  -6,  -11, -4,  9,   6,   -3,  6,   11,  -13, 11,  -5,  5,
+    11,  11,  12,  6,   7,   -5,  12,  -2,  -1,  12,  0,   7,   -4,  -8,  -3,
+    -2,  -7,  1,   -6,  7,   -13, -12, -8,  -13, -7,  -2,  -6,  -8,  -8,  5,
+    -6,  -9,  -5,  -1,  -4,  5,   -13, 7,   -8,  10,  1,   5,   5,   -13, 1,
+    0,   10,  -13, 9,   12,  10,  -1,  5,   -8,  10,  -9,  -1,  11,  1,   -13,
+    -9,  -3,  -6,  2,   -1,  -10, 1,   12,  -13, 1,   -8,  -10, 8,   -11, 10,
+    -6,  2,   -13, 3,   -6,  7,   -13, 12,  -9,  -10, -10, -5,  -7,  -10, -8,
+    -8,  -13, 4,   -6,  8,   5,   3,   12,  8,   -13, -4,  2,   -3,  -3,  5,
+    -13, 10,  -12, 4,   -13, 5,   -1,  -9,  9,   -4,  3,   0,   3,   3,   -9,
+    -12, 1,   -6,  1,   3,   2,   4,   -8,  -10, -10, -10, 9,   8,   -13, 12,
+    12,  -8,  -12, -6,  -5,  2,   2,   3,   7,   10,  6,   11,  -8,  6,   8,
+    8,   -12, -7,  10,  -6,  5,   -3,  -9,  -3,  9,   -1,  -13, -1,  5,   -3,
+    -7,  -3,  4,   -8,  -2,  -8,  3,   4,   2,   12,  12,  2,   -5,  3,   11,
+    6,   -9,  11,  -13, 3,   -1,  7,   12,  11,  -1,  12,  4,   -3,  0,   -3,
+    6,   4,   -11, 4,   12,  2,   -4,  2,   1,   -10, -6,  -8,  1,   -13, 7,
+    -11, 1,   -13, 12,  -11, -13, 6,   0,   11,  -13, 0,   -1,  1,   4,   -13,
+    3,   -9,  -2,  -9,  8,   -6,  -3,  -13, -6,  -8,  -2,  5,   -9,  8,   10,
+    2,   7,   3,   -9,  -1,  -6,  -1,  -1,  9,   5,   11,  -2,  11,  -3,  12,
+    -8,  3,   0,   3,   5,   -1,  4,   0,   10,  3,   -6,  4,   5,   -13, 0,
+    -10, 5,   5,   8,   12,  11,  8,   9,   9,   -6,  7,   -4,  8,   -12, -10,
+    4,   -10, 9,   7,   3,   12,  4,   9,   -7,  10,  -2,  7,   0,   12,  -2,
+    -1,  -6,  0,   -11,
 };
 
-} // namespace kernel
-
-} // namespace cuda
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/pad_array_borders.cuh b/src/backend/cuda/kernel/pad_array_borders.cuh
new file mode 100644
index 0000000000..73df3261a7
--- /dev/null
+++ b/src/backend/cuda/kernel/pad_array_borders.cuh
@@ -0,0 +1,89 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<af::borderType BType>
+__device__ int idxByndEdge(const int i, const int lb, const int len) {
+    uint retVal;
+    switch (BType) {
+        case AF_PAD_SYM: retVal = trimIndex(i - lb, len); break;
+        case AF_PAD_CLAMP_TO_EDGE: retVal = clamp(i - lb, 0, len - 1); break;
+        case AF_PAD_PERIODIC: {
+            int rem   = (i - lb) % len;
+            bool cond = rem < 0;
+            retVal    = cond * (rem + len) + (1 - cond) * rem;
+        } break;
+        default: retVal = 0; break;  // AF_PAD_ZERO
+    }
+    return retVal;
+}
+
+template<typename T, af::borderType BType>
+__global__ void padBorders(Param<T> out, CParam<T> in, const int l0,
+                           const int l1, const int l2, const int l3,
+                           unsigned blk_x, unsigned blk_y) {
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+    const int k  = blockIdx.x / blk_x;
+    const int l  = blockIdx.y / blk_y;
+
+    const int blockIdx_x = blockIdx.x - (blk_x)*k;
+    const int blockIdx_y = blockIdx.y - (blk_y)*l;
+    const int i          = blockIdx_x * blockDim.x + lx;
+    const int j          = blockIdx_y * blockDim.y + ly;
+
+    const int d0 = in.dims[0];
+    const int d1 = in.dims[1];
+    const int d2 = in.dims[2];
+    const int d3 = in.dims[3];
+    const int s0 = in.strides[0];
+    const int s1 = in.strides[1];
+    const int s2 = in.strides[2];
+    const int s3 = in.strides[3];
+
+    const T* src = in.ptr;
+    T* dst       = out.ptr;
+
+    bool isNotPadding =
+        (l >= l3 && l < (d3 + l3)) && (k >= l2 && k < (d2 + l2)) &&
+        (j >= l1 && j < (d1 + l1)) && (i >= l0 && i < (d0 + l0));
+    T value = scalar<T>(0);
+
+    if (isNotPadding) {
+        unsigned iLOff = (l - l3) * s3;
+        unsigned iKOff = (k - l2) * s2;
+        unsigned iJOff = (j - l1) * s1;
+        unsigned iIOff = (i - l0) * s0;
+
+        value = src[iLOff + iKOff + iJOff + iIOff];
+    } else if (BType != AF_PAD_ZERO) {
+        unsigned iLOff = idxByndEdge<BType>(l, l3, d3) * s3;
+        unsigned iKOff = idxByndEdge<BType>(k, l2, d2) * s2;
+        unsigned iJOff = idxByndEdge<BType>(j, l1, d1) * s1;
+        unsigned iIOff = idxByndEdge<BType>(i, l0, d0) * s0;
+
+        value = src[iLOff + iKOff + iJOff + iIOff];
+    }
+
+    if (i < out.dims[0] && j < out.dims[1] && k < out.dims[2] &&
+        l < out.dims[3]) {
+        unsigned off = (l * out.strides[3] + k * out.strides[2] +
+                        j * out.strides[1] + i * out.strides[0]);
+        dst[off]     = value;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/pad_array_borders.hpp b/src/backend/cuda/kernel/pad_array_borders.hpp
new file mode 100644
index 0000000000..b52fcf1401
--- /dev/null
+++ b/src/backend/cuda/kernel/pad_array_borders.hpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/pad_array_borders_cuh.hpp>
+#include <af/defines.h>
+
+#include <array>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+static const int PADB_THREADS_X = 32;
+static const int PADB_THREADS_Y = 8;
+
+template<typename T>
+void padBorders(Param<T> out, CParam<T> in, dim4 const lBoundPadding,
+                const af::borderType btype) {
+    auto padBorders = common::getKernel(
+        "arrayfire::cuda::padBorders", {{pad_array_borders_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(btype)));
+
+    dim3 threads(kernel::PADB_THREADS_X, kernel::PADB_THREADS_Y);
+
+    int blk_x = divup(out.dims[0], PADB_THREADS_X);
+    int blk_y = divup(out.dims[1], PADB_THREADS_Y);
+
+    dim3 blocks(blk_x * out.dims[2], blk_y * out.dims[3]);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    padBorders(qArgs, out, in, lBoundPadding[0], lBoundPadding[1],
+               lBoundPadding[2], lBoundPadding[3], blk_x, blk_y);
+
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/random.hpp b/src/backend/cuda/kernel/random.hpp
deleted file mode 100644
index 0c3167cd52..0000000000
--- a/src/backend/cuda/kernel/random.hpp
+++ /dev/null
@@ -1,178 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <curand_kernel.h>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
-#include <platform.hpp>
-#include <debug_cuda.hpp>
-
-namespace cuda
-{
-namespace kernel
-{
-
-    static const int THREADS = 256;
-    static const int BLOCKS  = 64;
-    static unsigned long long seed = 0;
-    static curandState_t *states[DeviceManager::MAX_DEVICES];
-    static bool is_init[DeviceManager::MAX_DEVICES] = {0};
-
-    template<typename T>
-    __device__
-    void generate_uniform(T *val, curandState_t *state)
-    {
-        *val =  (T)curand(state);
-    }
-
-    template<> __device__
-    void generate_uniform<char>(char *val, curandState_t *state)
-    {
-        *val = curand_uniform(state) > 0.5;
-    }
-
-    template<> __device__
-    void generate_uniform<float>(float *val, curandState_t *state)
-    {
-        *val = curand_uniform(state);
-    }
-
-    template<> __device__
-    void generate_uniform<double>(double *val, curandState_t *state)
-    {
-        *val = curand_uniform_double(state);
-    }
-
-    template<> __device__
-    void generate_uniform<cfloat>(cfloat *cval, curandState_t *state)
-    {
-        cval->x = curand_uniform(state);
-        cval->y = curand_uniform(state);
-    }
-
-    template<> __device__
-    void generate_uniform<cdouble>(cdouble *cval, curandState_t *state)
-    {
-        cval->x = curand_uniform_double(state);
-        cval->y = curand_uniform_double(state);
-    }
-
-
-    __device__
-    void generate_normal(float *val, curandState_t *state)
-    {
-        *val = curand_normal(state);
-    }
-
-
-    __device__
-    void generate_normal(double *val, curandState_t *state)
-    {
-        *val = curand_normal_double(state);
-    }
-
-
-    __device__
-    void generate_normal(cfloat *cval, curandState_t *state)
-    {
-        cval->x = curand_normal(state);
-        cval->y = curand_normal(state);
-    }
-
-
-    __device__
-    void generate_normal(cdouble *cval, curandState_t *state)
-    {
-        cval->x = curand_normal_double(state);
-        cval->y = curand_normal_double(state);
-    }
-
-    __global__ static void
-    setup_kernel(curandState_t *states, unsigned long long seed)
-    {
-        unsigned tid = blockDim.x * blockIdx.x + threadIdx.x;
-        curand_init(seed, tid, 0, &states[tid]);
-    }
-
-    template<typename T>
-    __global__ static void
-    uniform_kernel(T *out, curandState_t *states, size_t elements)
-    {
-        unsigned id = blockDim.x * blockIdx.x + threadIdx.x;
-        curandState_t state = states[id];
-        for (int tid = id; tid < elements; tid += blockDim.x * gridDim.x) {
-            T value;
-            generate_uniform<T>(&value, &state);
-            out[tid] = value;
-        }
-        states[id] = state;
-    }
-
-    template<typename T>
-    __global__ static void
-    normal_kernel(T *out, curandState_t *states, size_t elements)
-    {
-        unsigned id = blockDim.x * blockIdx.x + threadIdx.x;
-        curandState_t state = states[id];
-        for (int tid = id; tid < elements; tid += blockDim.x * gridDim.x) {
-            T value;
-            generate_normal(&value, &state);
-            out[tid] = value;
-        }
-        states[id] = state;
-    }
-
-    void setup_states()
-    {
-        int device = getActiveDeviceId();
-
-        if (!is_init[device]) {
-            CUDA_CHECK(cudaMalloc(&states[device], BLOCKS * THREADS * sizeof(curandState_t)));
-        }
-
-        setup_kernel<<<BLOCKS, THREADS>>>(states[device], seed);
-        POST_LAUNCH_CHECK();
-        is_init[device] = true;
-    }
-
-    template<typename T>
-    void randu(T *out, size_t elements)
-    {
-        int device = getActiveDeviceId();
-
-        int threads = THREADS;
-        int blocks  = divup(elements, THREADS);
-        if (blocks > BLOCKS) blocks = BLOCKS;
-        uniform_kernel<<<blocks, threads>>>(out, states[device], elements);
-        POST_LAUNCH_CHECK();
-    }
-
-    template<typename T>
-    void randn(T *out, size_t elements)
-    {
-        int device = getActiveDeviceId();
-
-        int threads = THREADS;
-        int blocks  = divup(elements, THREADS);
-        if (blocks > BLOCKS) blocks = BLOCKS;
-
-        if (!states[device]) {
-            CUDA_CHECK(cudaMalloc(&states[device], BLOCKS * THREADS * sizeof(curandState_t)));
-
-            setup_kernel<<<BLOCKS, THREADS>>>(states[device], seed);
-
-            POST_LAUNCH_CHECK();
-        }
-
-        normal_kernel<<<blocks, threads>>>(out, states[device], elements);
-
-        POST_LAUNCH_CHECK();
-    }
-}
-}
diff --git a/src/backend/cuda/kernel/random_engine.hpp b/src/backend/cuda/kernel/random_engine.hpp
new file mode 100644
index 0000000000..a5e2305885
--- /dev/null
+++ b/src/backend/cuda/kernel/random_engine.hpp
@@ -0,0 +1,1118 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <kernel/random_engine_mersenne.hpp>
+#include <kernel/random_engine_philox.hpp>
+#include <kernel/random_engine_threefry.hpp>
+#include <random_engine.hpp>
+#include <af/defines.h>
+
+#include <limits>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 530
+__device__ __half hlog(const __half a) {
+    return __float2half(logf(__half2float(a)));
+}
+__device__ __half hsqrt(const __half a) {
+    return __float2half(sqrtf(__half2float(a)));
+}
+__device__ __half hsin(const __half a) {
+    return __float2half(sinf(__half2float(a)));
+}
+__device__ __half hcos(const __half a) {
+    return __float2half(cosf(__half2float(a)));
+}
+__device__ __half __hfma(const __half a, __half b, __half c) {
+    return __float2half(
+        fmaf(__half2float(a), __half2float(b), __half2float(c)));
+}
+#endif
+
+// Utils
+static const int THREADS = 256;
+#define PI_VAL \
+    3.1415926535897932384626433832795028841971693993751058209749445923078164
+
+// Conversion to half adapted from Random123
+// #define HALF_FACTOR (1.0f) / (std::numeric_limits<ushort>::max() + (1.0f))
+// #define HALF_HALF_FACTOR ((0.5f) * HALF_FACTOR)
+//
+// NOTE: The following constants for half were calculated using the formulas
+// above. This is done so that we can avoid unnecessary computations because the
+// __half datatype is not a constexprable type. This prevents the compiler from
+// peforming these operations at compile time.
+#define HALF_FACTOR __ushort_as_half(0x100u)
+#define HALF_HALF_FACTOR __ushort_as_half(0x80)
+
+// Conversion to half adapted from Random123
+// #define SIGNED_HALF_FACTOR                                \
+    //((1.0f) / (std::numeric_limits<short>::max() + (1.0f)))
+// #define SIGNED_HALF_HALF_FACTOR ((0.5f) * SIGNED_HALF_FACTOR)
+//
+// NOTE: The following constants for half were calculated using the formulas
+// above. This is done so that we can avoid unnecessary computations because the
+// __half datatype is not a constexprable type. This prevents the compiler from
+// peforming these operations at compile time
+#define SIGNED_HALF_FACTOR __ushort_as_half(0x200u)
+#define SIGNED_HALF_HALF_FACTOR __ushort_as_half(0x100u)
+
+/// This is the largest integer representable by fp16. We need to
+/// make sure that the value converted from ushort is smaller than this
+/// value to avoid generating infinity
+constexpr ushort max_int_before_infinity = 65504;
+
+// Generates rationals in (0, 1]
+__device__ static __half oneMinusGetHalf01(uint num) {
+    // convert to ushort before the min operation
+    ushort v = min(max_int_before_infinity, ushort(num));
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 530
+    return (1.0f - __half2float(__hfma(__ushort2half_rn(v), HALF_FACTOR,
+                                       HALF_HALF_FACTOR)));
+#else
+    __half out = __ushort_as_half(0x3c00u) /*1.0h*/ -
+                 __hfma(__ushort2half_rn(v), HALF_FACTOR, HALF_HALF_FACTOR);
+    if (__hisinf(out)) printf("val: %d ushort: %d\n", num, v);
+    return out;
+#endif
+}
+
+// Generates rationals in (0, 1]
+__device__ static __half getHalf01(uint num) {
+    // convert to ushort before the min operation
+    ushort v = min(max_int_before_infinity, ushort(num));
+    return __hfma(__ushort2half_rn(v), HALF_FACTOR, HALF_HALF_FACTOR);
+}
+
+// Generates rationals in (-1, 1]
+__device__ static __half getHalfNegative11(uint num) {
+    // convert to ushort before the min operation
+    ushort v = min(max_int_before_infinity, ushort(num));
+    return __hfma(__ushort2half_rn(v), SIGNED_HALF_FACTOR,
+                  SIGNED_HALF_HALF_FACTOR);
+}
+
+// Generates rationals in (0, 1]
+__device__ static float getFloat01(uint num) {
+    // Conversion to floats adapted from Random123
+    constexpr float factor =
+        ((1.0f) /
+         (static_cast<float>(std::numeric_limits<unsigned int>::max()) +
+          (1.0f)));
+    constexpr float half_factor = ((0.5f) * factor);
+
+    return fmaf(static_cast<float>(num), factor, half_factor);
+}
+
+// Generates rationals in (-1, 1]
+__device__ static float getFloatNegative11(uint num) {
+    // Conversion to floats adapted from Random123
+    constexpr float factor =
+        ((1.0) /
+         (static_cast<double>(std::numeric_limits<int>::max()) + (1.0)));
+    constexpr float half_factor = ((0.5f) * factor);
+
+    return fmaf(static_cast<float>(num), factor, half_factor);
+}
+
+// Generates rationals in (0, 1]
+__device__ static double getDouble01(uint num1, uint num2) {
+    uint64_t n1 = num1;
+    uint64_t n2 = num2;
+    n1 <<= 32;
+    uint64_t num = n1 | n2;
+    constexpr double factor =
+        ((1.0) / (std::numeric_limits<unsigned long long>::max() +
+                  static_cast<double>(1.0)));
+    constexpr double half_factor((0.5) * factor);
+
+    return fma(static_cast<double>(num), factor, half_factor);
+}
+
+// Conversion to doubles adapted from Random123
+constexpr double signed_factor =
+    ((1.0l) / (std::numeric_limits<long long>::max() + (1.0l)));
+constexpr double half_factor = ((0.5) * signed_factor);
+
+// Generates rationals in (-1, 1]
+__device__ static double getDoubleNegative11(uint num1, uint num2) {
+    uint32_t arr[2] = {num2, num1};
+    uint64_t num;
+
+    memcpy(&num, arr, sizeof(uint64_t));
+    return fma(static_cast<double>(num), signed_factor, half_factor);
+}
+
+namespace {
+
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+#define HALF_MATH_FUNC(OP, HALF_OP)    \
+    template<>                         \
+    __device__ __half OP(__half val) { \
+        return ::HALF_OP(val);         \
+    }
+#else
+#define HALF_MATH_FUNC(OP, HALF_OP)     \
+    template<>                          \
+    __device__ __half OP(__half val) {  \
+        float fval = __half2float(val); \
+        return __float2half(OP(fval));  \
+    }
+#endif
+
+#define MATH_FUNC(OP, DOUBLE_OP, FLOAT_OP, HALF_OP) \
+    template<typename T>                            \
+    __device__ T OP(T val);                         \
+    template<>                                      \
+    __device__ double OP(double val) {              \
+        return ::DOUBLE_OP(val);                    \
+    }                                               \
+    template<>                                      \
+    __device__ float OP(float val) {                \
+        return ::FLOAT_OP(val);                     \
+    }                                               \
+    HALF_MATH_FUNC(OP, HALF_OP)
+
+MATH_FUNC(log, log, logf, hlog)
+MATH_FUNC(sqrt, sqrt, sqrtf, hsqrt)
+MATH_FUNC(sin, sin, sinf, hsin)
+MATH_FUNC(cos, cos, cosf, hcos)
+
+template<typename T>
+__device__ void sincos(T val, T *sptr, T *cptr);
+
+template<>
+__device__ void sincos(double val, double *sptr, double *cptr) {
+    ::sincos(val, sptr, cptr);
+}
+
+template<>
+__device__ void sincos(float val, float *sptr, float *cptr) {
+    sincosf(val, sptr, cptr);
+}
+
+template<>
+__device__ void sincos(__half val, __half *sptr, __half *cptr) {
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+    *sptr = sin(val);
+    *cptr = cos(val);
+#else
+    float s, c;
+    float fval = __half2float(val);
+    sincos(fval, &s, &c);
+    *sptr      = __float2half(s);
+    *cptr      = __float2half(c);
+#endif
+}
+
+template<typename T>
+__device__ void sincospi(T val, T *sptr, T *cptr);
+
+template<>
+__device__ void sincospi(double val, double *sptr, double *cptr) {
+    ::sincospi(val, sptr, cptr);
+}
+template<>
+__device__ void sincospi(float val, float *sptr, float *cptr) {
+    sincospif(val, sptr, cptr);
+}
+template<>
+__device__ void sincospi(__half val, __half *sptr, __half *cptr) {
+    // CUDA cannot make __half into a constexpr as of CUDA 11 so we are
+    // converting this offline
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 530
+    const __half pi_val = __ushort_as_half(0x4248);  // 0x4248 == 3.14062h
+    val *= pi_val;
+    *sptr = sin(val);
+    *cptr = cos(val);
+#else
+    float fval = __half2float(val);
+    float s, c;
+    sincospi(fval, &s, &c);
+    *sptr = __float2half(s);
+    *cptr = __float2half(c);
+#endif
+}
+
+}  // namespace
+
+template<typename T>
+constexpr T neg_two() {
+    return -2.0;
+}
+
+template<typename T>
+constexpr __device__ T two_pi() {
+    return 2.0 * PI_VAL;
+};
+
+template<typename Td, typename Tc>
+__device__ static void boxMullerTransform(Td *const out1, Td *const out2,
+                                          const Tc &r1, const Tc &r2) {
+    /*
+     * The log of a real value x where 0 < x < 1 is negative.
+     */
+    Tc r = sqrt(neg_two<Tc>() * log(r2));
+    Tc s, c;
+
+    // Multiplying by PI instead of 2*PI seems to yeild a better distribution
+    // even though the original boxMuller algorithm calls for 2 * PI
+    // sincos(two_pi<Tc>() * r1, &s, &c);
+    sincospi(r1, &s, &c);
+    *out1 = static_cast<Td>(r * s);
+    *out2 = static_cast<Td>(r * c);
+}
+#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ < 530
+template<>
+__device__ void boxMullerTransform<common::half, __half>(
+    common::half *const out1, common::half *const out2, const __half &r1,
+    const __half &r2) {
+    float o1, o2;
+    float fr1 = __half2float(r1);
+    float fr2 = __half2float(r2);
+    boxMullerTransform(&o1, &o2, fr1, fr2);
+    *out1 = o1;
+    *out2 = o2;
+}
+#endif
+
+// Writes without boundary checking
+__device__ static void writeOut128Bytes(uchar *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]                   = r1;
+    out[index + blockDim.x]      = r1 >> 8;
+    out[index + 2 * blockDim.x]  = r1 >> 16;
+    out[index + 3 * blockDim.x]  = r1 >> 24;
+    out[index + 4 * blockDim.x]  = r2;
+    out[index + 5 * blockDim.x]  = r2 >> 8;
+    out[index + 6 * blockDim.x]  = r2 >> 16;
+    out[index + 7 * blockDim.x]  = r2 >> 24;
+    out[index + 8 * blockDim.x]  = r3;
+    out[index + 9 * blockDim.x]  = r3 >> 8;
+    out[index + 10 * blockDim.x] = r3 >> 16;
+    out[index + 11 * blockDim.x] = r3 >> 24;
+    out[index + 12 * blockDim.x] = r4;
+    out[index + 13 * blockDim.x] = r4 >> 8;
+    out[index + 14 * blockDim.x] = r4 >> 16;
+    out[index + 15 * blockDim.x] = r4 >> 24;
+}
+
+__device__ static void writeOut128Bytes(schar *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    writeOut128Bytes((uchar *)(out), index, r1, r2, r3, r4);
+}
+
+__device__ static void writeOut128Bytes(char *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]                   = (r1)&0x1;
+    out[index + blockDim.x]      = (r1 >> 8) & 0x1;
+    out[index + 2 * blockDim.x]  = (r1 >> 16) & 0x1;
+    out[index + 3 * blockDim.x]  = (r1 >> 24) & 0x1;
+    out[index + 4 * blockDim.x]  = (r2)&0x1;
+    out[index + 5 * blockDim.x]  = (r2 >> 8) & 0x1;
+    out[index + 6 * blockDim.x]  = (r2 >> 16) & 0x1;
+    out[index + 7 * blockDim.x]  = (r2 >> 24) & 0x1;
+    out[index + 8 * blockDim.x]  = (r3)&0x1;
+    out[index + 9 * blockDim.x]  = (r3 >> 8) & 0x1;
+    out[index + 10 * blockDim.x] = (r3 >> 16) & 0x1;
+    out[index + 11 * blockDim.x] = (r3 >> 24) & 0x1;
+    out[index + 12 * blockDim.x] = (r4)&0x1;
+    out[index + 13 * blockDim.x] = (r4 >> 8) & 0x1;
+    out[index + 14 * blockDim.x] = (r4 >> 16) & 0x1;
+    out[index + 15 * blockDim.x] = (r4 >> 24) & 0x1;
+}
+
+__device__ static void writeOut128Bytes(short *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]                  = r1;
+    out[index + blockDim.x]     = r1 >> 16;
+    out[index + 2 * blockDim.x] = r2;
+    out[index + 3 * blockDim.x] = r2 >> 16;
+    out[index + 4 * blockDim.x] = r3;
+    out[index + 5 * blockDim.x] = r3 >> 16;
+    out[index + 6 * blockDim.x] = r4;
+    out[index + 7 * blockDim.x] = r4 >> 16;
+}
+
+__device__ static void writeOut128Bytes(ushort *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    writeOut128Bytes((short *)(out), index, r1, r2, r3, r4);
+}
+
+__device__ static void writeOut128Bytes(int *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]                  = r1;
+    out[index + blockDim.x]     = r2;
+    out[index + 2 * blockDim.x] = r3;
+    out[index + 3 * blockDim.x] = r4;
+}
+
+__device__ static void writeOut128Bytes(uint *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    writeOut128Bytes((int *)(out), index, r1, r2, r3, r4);
+}
+
+__device__ static void writeOut128Bytes(intl *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    intl c1                 = r2;
+    c1                      = (c1 << 32) | r1;
+    intl c2                 = r4;
+    c2                      = (c2 << 32) | r3;
+    out[index]              = c1;
+    out[index + blockDim.x] = c2;
+}
+
+__device__ static void writeOut128Bytes(uintl *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    writeOut128Bytes((intl *)(out), index, r1, r2, r3, r4);
+}
+
+__device__ static void writeOut128Bytes(float *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]                  = 1.f - getFloat01(r1);
+    out[index + blockDim.x]     = 1.f - getFloat01(r2);
+    out[index + 2 * blockDim.x] = 1.f - getFloat01(r3);
+    out[index + 3 * blockDim.x] = 1.f - getFloat01(r4);
+}
+
+__device__ static void writeOut128Bytes(cfloat *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index].x              = 1.f - getFloat01(r1);
+    out[index].y              = 1.f - getFloat01(r2);
+    out[index + blockDim.x].x = 1.f - getFloat01(r3);
+    out[index + blockDim.x].y = 1.f - getFloat01(r4);
+}
+
+__device__ static void writeOut128Bytes(double *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]              = 1.0 - getDouble01(r1, r2);
+    out[index + blockDim.x] = 1.0 - getDouble01(r3, r4);
+}
+
+__device__ static void writeOut128Bytes(cdouble *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index].x = 1.0 - getDouble01(r1, r2);
+    out[index].y = 1.0 - getDouble01(r3, r4);
+}
+
+__device__ static void writeOut128Bytes(common::half *out, const uint &index,
+                                        const uint &r1, const uint &r2,
+                                        const uint &r3, const uint &r4) {
+    out[index]                  = oneMinusGetHalf01(r1);
+    out[index + blockDim.x]     = oneMinusGetHalf01(r1 >> 16);
+    out[index + 2 * blockDim.x] = oneMinusGetHalf01(r2);
+    out[index + 3 * blockDim.x] = oneMinusGetHalf01(r2 >> 16);
+    out[index + 4 * blockDim.x] = oneMinusGetHalf01(r3);
+    out[index + 5 * blockDim.x] = oneMinusGetHalf01(r3 >> 16);
+    out[index + 6 * blockDim.x] = oneMinusGetHalf01(r4);
+    out[index + 7 * blockDim.x] = oneMinusGetHalf01(r4 >> 16);
+}
+
+// Normalized writes without boundary checking
+
+__device__ static void boxMullerWriteOut128Bytes(float *out, const uint &index,
+                                                 const uint &r1, const uint &r2,
+                                                 const uint &r3,
+                                                 const uint &r4) {
+    boxMullerTransform(&out[index], &out[index + blockDim.x],
+                       getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&out[index + 2 * blockDim.x],
+                       &out[index + 3 * blockDim.x], getFloatNegative11(r3),
+                       getFloat01(r4));
+}
+
+__device__ static void boxMullerWriteOut128Bytes(cfloat *out, const uint &index,
+                                                 const uint &r1, const uint &r2,
+                                                 const uint &r3,
+                                                 const uint &r4) {
+    boxMullerTransform(&out[index].x, &out[index].y, getFloatNegative11(r1),
+                       getFloat01(r2));
+    boxMullerTransform(&out[index + blockDim.x].x, &out[index + blockDim.x].y,
+                       getFloatNegative11(r3), getFloat01(r4));
+}
+
+__device__ static void boxMullerWriteOut128Bytes(double *out, const uint &index,
+                                                 const uint &r1, const uint &r2,
+                                                 const uint &r3,
+                                                 const uint &r4) {
+    boxMullerTransform(&out[index], &out[index + blockDim.x],
+                       getDoubleNegative11(r1, r2), getDouble01(r3, r4));
+}
+
+__device__ static void boxMullerWriteOut128Bytes(cdouble *out,
+                                                 const uint &index,
+                                                 const uint &r1, const uint &r2,
+                                                 const uint &r3,
+                                                 const uint &r4) {
+    boxMullerTransform(&out[index].x, &out[index].y,
+                       getDoubleNegative11(r1, r2), getDouble01(r3, r4));
+}
+
+__device__ static void boxMullerWriteOut128Bytes(common::half *out,
+                                                 const uint &index,
+                                                 const uint &r1, const uint &r2,
+                                                 const uint &r3,
+                                                 const uint &r4) {
+    boxMullerTransform(&out[index], &out[index + blockDim.x],
+                       getHalfNegative11(r1), getHalf01(r1 >> 16));
+    boxMullerTransform(&out[index + 2 * blockDim.x],
+                       &out[index + 3 * blockDim.x], getHalfNegative11(r2),
+                       getHalf01(r2 >> 16));
+    boxMullerTransform(&out[index + 4 * blockDim.x],
+                       &out[index + 5 * blockDim.x], getHalfNegative11(r3),
+                       getHalf01(r3 >> 16));
+    boxMullerTransform(&out[index + 6 * blockDim.x],
+                       &out[index + 7 * blockDim.x], getHalfNegative11(r4),
+                       getHalf01(r4 >> 16));
+}
+
+// Writes with boundary checking
+
+__device__ static void partialWriteOut128Bytes(uchar *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = r1 >> 8; }
+    if (index + 2 * blockDim.x < elements) {
+        out[index + 2 * blockDim.x] = r1 >> 16;
+    }
+    if (index + 3 * blockDim.x < elements) {
+        out[index + 3 * blockDim.x] = r1 >> 24;
+    }
+    if (index + 4 * blockDim.x < elements) { out[index + 4 * blockDim.x] = r2; }
+    if (index + 5 * blockDim.x < elements) {
+        out[index + 5 * blockDim.x] = r2 >> 8;
+    }
+    if (index + 6 * blockDim.x < elements) {
+        out[index + 6 * blockDim.x] = r2 >> 16;
+    }
+    if (index + 7 * blockDim.x < elements) {
+        out[index + 7 * blockDim.x] = r2 >> 24;
+    }
+    if (index + 8 * blockDim.x < elements) { out[index + 8 * blockDim.x] = r3; }
+    if (index + 9 * blockDim.x < elements) {
+        out[index + 9 * blockDim.x] = r3 >> 8;
+    }
+    if (index + 10 * blockDim.x < elements) {
+        out[index + 10 * blockDim.x] = r3 >> 16;
+    }
+    if (index + 11 * blockDim.x < elements) {
+        out[index + 11 * blockDim.x] = r3 >> 24;
+    }
+    if (index + 12 * blockDim.x < elements) {
+        out[index + 12 * blockDim.x] = r4;
+    }
+    if (index + 13 * blockDim.x < elements) {
+        out[index + 13 * blockDim.x] = r4 >> 8;
+    }
+    if (index + 14 * blockDim.x < elements) {
+        out[index + 14 * blockDim.x] = r4 >> 16;
+    }
+    if (index + 15 * blockDim.x < elements) {
+        out[index + 15 * blockDim.x] = r4 >> 24;
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(schar *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    partialWriteOut128Bytes((uchar *)(out), index, r1, r2, r3, r4, elements);
+}
+
+__device__ static void partialWriteOut128Bytes(char *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = (r1)&0x1; }
+    if (index + blockDim.x < elements) {
+        out[index + blockDim.x] = (r1 >> 8) & 0x1;
+    }
+    if (index + 2 * blockDim.x < elements) {
+        out[index + 2 * blockDim.x] = (r1 >> 16) & 0x1;
+    }
+    if (index + 3 * blockDim.x < elements) {
+        out[index + 3 * blockDim.x] = (r1 >> 24) & 0x1;
+    }
+    if (index + 4 * blockDim.x < elements) {
+        out[index + 4 * blockDim.x] = (r2)&0x1;
+    }
+    if (index + 5 * blockDim.x < elements) {
+        out[index + 5 * blockDim.x] = (r2 >> 8) & 0x1;
+    }
+    if (index + 6 * blockDim.x < elements) {
+        out[index + 6 * blockDim.x] = (r2 >> 16) & 0x1;
+    }
+    if (index + 7 * blockDim.x < elements) {
+        out[index + 7 * blockDim.x] = (r2 >> 24) & 0x1;
+    }
+    if (index + 8 * blockDim.x < elements) {
+        out[index + 8 * blockDim.x] = (r3)&0x1;
+    }
+    if (index + 9 * blockDim.x < elements) {
+        out[index + 9 * blockDim.x] = (r3 >> 8) & 0x1;
+    }
+    if (index + 10 * blockDim.x < elements) {
+        out[index + 10 * blockDim.x] = (r3 >> 16) & 0x1;
+    }
+    if (index + 11 * blockDim.x < elements) {
+        out[index + 11 * blockDim.x] = (r3 >> 24) & 0x1;
+    }
+    if (index + 12 * blockDim.x < elements) {
+        out[index + 12 * blockDim.x] = (r4)&0x1;
+    }
+    if (index + 13 * blockDim.x < elements) {
+        out[index + 13 * blockDim.x] = (r4 >> 8) & 0x1;
+    }
+    if (index + 14 * blockDim.x < elements) {
+        out[index + 14 * blockDim.x] = (r4 >> 16) & 0x1;
+    }
+    if (index + 15 * blockDim.x < elements) {
+        out[index + 15 * blockDim.x] = (r4 >> 24) & 0x1;
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(short *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = r1 >> 16; }
+    if (index + 2 * blockDim.x < elements) { out[index + 2 * blockDim.x] = r2; }
+    if (index + 3 * blockDim.x < elements) {
+        out[index + 3 * blockDim.x] = r2 >> 16;
+    }
+    if (index + 4 * blockDim.x < elements) { out[index + 4 * blockDim.x] = r3; }
+    if (index + 5 * blockDim.x < elements) {
+        out[index + 5 * blockDim.x] = r3 >> 16;
+    }
+    if (index + 6 * blockDim.x < elements) { out[index + 6 * blockDim.x] = r4; }
+    if (index + 7 * blockDim.x < elements) {
+        out[index + 7 * blockDim.x] = r4 >> 16;
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(ushort *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    partialWriteOut128Bytes((short *)(out), index, r1, r2, r3, r4, elements);
+}
+
+__device__ static void partialWriteOut128Bytes(int *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = r2; }
+    if (index + 2 * blockDim.x < elements) { out[index + 2 * blockDim.x] = r3; }
+    if (index + 3 * blockDim.x < elements) { out[index + 3 * blockDim.x] = r4; }
+}
+
+__device__ static void partialWriteOut128Bytes(uint *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    partialWriteOut128Bytes((int *)(out), index, r1, r2, r3, r4, elements);
+}
+
+__device__ static void partialWriteOut128Bytes(intl *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    intl c1 = r2;
+    c1      = (c1 << 32) | r1;
+    intl c2 = r4;
+    c2      = (c2 << 32) | r3;
+    if (index < elements) { out[index] = c1; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = c2; }
+}
+
+__device__ static void partialWriteOut128Bytes(uintl *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    partialWriteOut128Bytes((intl *)(out), index, r1, r2, r3, r4, elements);
+}
+
+__device__ static void partialWriteOut128Bytes(float *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = 1.f - getFloat01(r1); }
+    if (index + blockDim.x < elements) {
+        out[index + blockDim.x] = 1.f - getFloat01(r2);
+    }
+    if (index + 2 * blockDim.x < elements) {
+        out[index + 2 * blockDim.x] = 1.f - getFloat01(r3);
+    }
+    if (index + 3 * blockDim.x < elements) {
+        out[index + 3 * blockDim.x] = 1.f - getFloat01(r4);
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(cfloat *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) {
+        out[index].x = 1.f - getFloat01(r1);
+        out[index].y = 1.f - getFloat01(r2);
+    }
+    if (index + blockDim.x < elements) {
+        out[index + blockDim.x].x = 1.f - getFloat01(r3);
+        out[index + blockDim.x].y = 1.f - getFloat01(r4);
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(double *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = 1.0 - getDouble01(r1, r2); }
+    if (index + blockDim.x < elements) {
+        out[index + blockDim.x] = 1.0 - getDouble01(r3, r4);
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(cdouble *out, const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) {
+        out[index].x = 1.0 - getDouble01(r1, r2);
+        out[index].y = 1.0 - getDouble01(r3, r4);
+    }
+}
+
+// Normalized writes with boundary checking
+
+__device__ static void partialBoxMullerWriteOut128Bytes(
+    float *out, const uint &index, const uint &r1, const uint &r2,
+    const uint &r3, const uint &r4, const uint &elements) {
+    float n1, n2, n3, n4;
+    boxMullerTransform(&n1, &n2, getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&n3, &n4, getFloatNegative11(r3), getFloat01(r4));
+    if (index < elements) { out[index] = n1; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = n2; }
+    if (index + 2 * blockDim.x < elements) { out[index + 2 * blockDim.x] = n3; }
+    if (index + 3 * blockDim.x < elements) { out[index + 3 * blockDim.x] = n4; }
+}
+
+__device__ static void partialBoxMullerWriteOut128Bytes(
+    cfloat *out, const uint &index, const uint &r1, const uint &r2,
+    const uint &r3, const uint &r4, const uint &elements) {
+    float n1, n2, n3, n4;
+    boxMullerTransform(&n1, &n2, getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&n3, &n4, getFloatNegative11(r3), getFloat01(r4));
+    if (index < elements) {
+        out[index].x = n1;
+        out[index].y = n2;
+    }
+    if (index + blockDim.x < elements) {
+        out[index + blockDim.x].x = n3;
+        out[index + blockDim.x].y = n4;
+    }
+}
+
+__device__ static void partialBoxMullerWriteOut128Bytes(
+    double *out, const uint &index, const uint &r1, const uint &r2,
+    const uint &r3, const uint &r4, const uint &elements) {
+    double n1, n2;
+    boxMullerTransform(&n1, &n2, getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+    if (index < elements) { out[index] = n1; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = n2; }
+}
+
+__device__ static void partialBoxMullerWriteOut128Bytes(
+    cdouble *out, const uint &index, const uint &r1, const uint &r2,
+    const uint &r3, const uint &r4, const uint &elements) {
+    double n1, n2;
+    boxMullerTransform(&n1, &n2, getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+    if (index < elements) {
+        out[index].x = n1;
+        out[index].y = n2;
+    }
+}
+
+__device__ static void partialWriteOut128Bytes(common::half *out,
+                                               const uint &index,
+                                               const uint &r1, const uint &r2,
+                                               const uint &r3, const uint &r4,
+                                               const uint &elements) {
+    if (index < elements) { out[index] = oneMinusGetHalf01(r1); }
+    if (index + blockDim.x < elements) {
+        out[index + blockDim.x] = oneMinusGetHalf01(r1 >> 16);
+    }
+    if (index + 2 * blockDim.x < elements) {
+        out[index + 2 * blockDim.x] = oneMinusGetHalf01(r2);
+    }
+    if (index + 3 * blockDim.x < elements) {
+        out[index + 3 * blockDim.x] = oneMinusGetHalf01(r2 >> 16);
+    }
+    if (index + 4 * blockDim.x < elements) {
+        out[index + 4 * blockDim.x] = oneMinusGetHalf01(r3);
+    }
+    if (index + 5 * blockDim.x < elements) {
+        out[index + 5 * blockDim.x] = oneMinusGetHalf01(r3 >> 16);
+    }
+    if (index + 6 * blockDim.x < elements) {
+        out[index + 6 * blockDim.x] = oneMinusGetHalf01(r4);
+    }
+    if (index + 7 * blockDim.x < elements) {
+        out[index + 7 * blockDim.x] = oneMinusGetHalf01(r4 >> 16);
+    }
+}
+
+// Normalized writes with boundary checking
+__device__ static void partialBoxMullerWriteOut128Bytes(
+    common::half *out, const uint &index, const uint &r1, const uint &r2,
+    const uint &r3, const uint &r4, const uint &elements) {
+    common::half n[8];
+    boxMullerTransform(n + 0, n + 1, getHalfNegative11(r1),
+                       getHalf01(r1 >> 16));
+    boxMullerTransform(n + 2, n + 3, getHalfNegative11(r2),
+                       getHalf01(r2 >> 16));
+    boxMullerTransform(n + 4, n + 5, getHalfNegative11(r3),
+                       getHalf01(r3 >> 16));
+    boxMullerTransform(n + 6, n + 7, getHalfNegative11(r4),
+                       getHalf01(r4 >> 16));
+    if (index < elements) { out[index] = n[0]; }
+    if (index + blockDim.x < elements) { out[index + blockDim.x] = n[1]; }
+    if (index + 2 * blockDim.x < elements) {
+        out[index + 2 * blockDim.x] = n[2];
+    }
+    if (index + 3 * blockDim.x < elements) {
+        out[index + 3 * blockDim.x] = n[3];
+    }
+    if (index + 4 * blockDim.x < elements) {
+        out[index + 4 * blockDim.x] = n[4];
+    }
+    if (index + 5 * blockDim.x < elements) {
+        out[index + 5 * blockDim.x] = n[5];
+    }
+    if (index + 6 * blockDim.x < elements) {
+        out[index + 6 * blockDim.x] = n[6];
+    }
+    if (index + 7 * blockDim.x < elements) {
+        out[index + 7 * blockDim.x] = n[7];
+    }
+}
+
+template<typename T>
+__global__ void uniformPhilox(T *out, uint hi, uint lo, uint hic, uint loc,
+                              uint elementsPerBlock, uint elements) {
+    uint index  = blockIdx.x * elementsPerBlock + threadIdx.x;
+    uint key[2] = {lo, hi};
+    uint ctr[4] = {loc, hic, 0, 0};
+    ctr[0] += index;
+    ctr[1] += (ctr[0] < loc);
+    ctr[2] += (ctr[1] < hic);
+    if (blockIdx.x != (gridDim.x - 1)) {
+        philox(key, ctr);
+        writeOut128Bytes(out, index, ctr[0], ctr[1], ctr[2], ctr[3]);
+    } else {
+        philox(key, ctr);
+        partialWriteOut128Bytes(out, index, ctr[0], ctr[1], ctr[2], ctr[3],
+                                elements);
+    }
+}
+
+template<typename T>
+__global__ void uniformThreefry(T *out, uint hi, uint lo, uint hic, uint loc,
+                                uint elementsPerBlock, uint elements) {
+    uint index  = blockIdx.x * elementsPerBlock + threadIdx.x;
+    uint key[2] = {lo, hi};
+    uint ctr[2] = {loc, hic};
+    ctr[0] += index;
+    ctr[1] += (ctr[0] < loc);
+    uint o[4];
+
+    threefry(key, ctr, o);
+    uint step = elementsPerBlock / 2;
+    ctr[0] += step;
+    ctr[1] += (ctr[0] < step);
+    threefry(key, ctr, o + 2);
+
+    if (blockIdx.x != (gridDim.x - 1)) {
+        writeOut128Bytes(out, index, o[0], o[1], o[2], o[3]);
+    } else {
+        partialWriteOut128Bytes(out, index, o[0], o[1], o[2], o[3], elements);
+    }
+}
+
+template<typename T>
+__global__ void uniformMersenne(T *const out, uint *const gState,
+                                const uint *const pos_tbl,
+                                const uint *const sh1_tbl,
+                                const uint *const sh2_tbl, uint mask,
+                                const uint *const g_recursion_table,
+                                const uint *const g_temper_table,
+                                uint elementsPerBlock, size_t elements) {
+    __shared__ uint state[STATE_SIZE];
+    __shared__ uint recursion_table[TABLE_SIZE];
+    __shared__ uint temper_table[TABLE_SIZE];
+    uint start                    = blockIdx.x * elementsPerBlock;
+    uint end                      = start + elementsPerBlock;
+    end                           = (end > elements) ? elements : end;
+    int elementsPerBlockIteration = (blockDim.x * 4 * sizeof(uint)) / sizeof(T);
+    int iter = divup((end - start), elementsPerBlockIteration);
+
+    uint pos = pos_tbl[blockIdx.x];
+    uint sh1 = sh1_tbl[blockIdx.x];
+    uint sh2 = sh2_tbl[blockIdx.x];
+    state_read(state, gState);
+    read_table(recursion_table, g_recursion_table);
+    read_table(temper_table, g_temper_table);
+    __syncthreads();
+
+    uint index = start;
+    uint o[4];
+    int offsetX1 = (STATE_SIZE - N + threadIdx.x) % STATE_SIZE;
+    int offsetX2 = (STATE_SIZE - N + threadIdx.x + 1) % STATE_SIZE;
+    int offsetY  = (STATE_SIZE - N + threadIdx.x + pos) % STATE_SIZE;
+    int offsetT  = (STATE_SIZE - N + threadIdx.x + pos - 1) % STATE_SIZE;
+    int offsetO  = threadIdx.x;
+
+    for (int i = 0; i < iter; ++i) {
+        for (int ii = 0; ii < 4; ++ii) {
+            uint r = recursion(recursion_table, mask, sh1, sh2, state[offsetX1],
+                               state[offsetX2], state[offsetY]);
+            state[offsetO] = r;
+            o[ii]          = temper(temper_table, r, state[offsetT]);
+            offsetX1       = (offsetX1 + blockDim.x) % STATE_SIZE;
+            offsetX2       = (offsetX2 + blockDim.x) % STATE_SIZE;
+            offsetY        = (offsetY + blockDim.x) % STATE_SIZE;
+            offsetT        = (offsetT + blockDim.x) % STATE_SIZE;
+            offsetO        = (offsetO + blockDim.x) % STATE_SIZE;
+            __syncthreads();
+        }
+        if (i == iter - 1) {
+            partialWriteOut128Bytes(out, index + threadIdx.x, o[0], o[1], o[2],
+                                    o[3], elements);
+        } else {
+            writeOut128Bytes(out, index + threadIdx.x, o[0], o[1], o[2], o[3]);
+        }
+        index += elementsPerBlockIteration;
+    }
+    state_write(gState, state);
+}
+
+template<typename T>
+__global__ void normalPhilox(T *out, uint hi, uint lo, uint hic, uint loc,
+                             uint elementsPerBlock, uint elements) {
+    uint index  = blockIdx.x * elementsPerBlock + threadIdx.x;
+    uint key[2] = {lo, hi};
+    uint ctr[4] = {loc, hic, 0, 0};
+    ctr[0] += index;
+    ctr[1] += (ctr[0] < loc);
+    ctr[2] += (ctr[1] < hic);
+
+    philox(key, ctr);
+
+    if (blockIdx.x != (gridDim.x - 1)) {
+        boxMullerWriteOut128Bytes(out, index, ctr[0], ctr[1], ctr[2], ctr[3]);
+    } else {
+        partialBoxMullerWriteOut128Bytes(out, index, ctr[0], ctr[1], ctr[2],
+                                         ctr[3], elements);
+    }
+}
+
+template<typename T>
+__global__ void normalThreefry(T *out, uint hi, uint lo, uint hic, uint loc,
+                               uint elementsPerBlock, uint elements) {
+    uint index  = blockIdx.x * elementsPerBlock + threadIdx.x;
+    uint key[2] = {lo, hi};
+    uint ctr[2] = {loc, hic};
+    ctr[0] += index;
+    ctr[1] += (ctr[0] < loc);
+    uint o[4];
+
+    threefry(key, ctr, o);
+    uint step = elementsPerBlock / 2;
+    ctr[0] += step;
+    ctr[1] += (ctr[0] < step);
+    threefry(key, ctr, o + 2);
+
+    if (blockIdx.x != (gridDim.x - 1)) {
+        boxMullerWriteOut128Bytes(out, index, o[0], o[1], o[2], o[3]);
+    } else {
+        partialBoxMullerWriteOut128Bytes(out, index, o[0], o[1], o[2], o[3],
+                                         elements);
+    }
+}
+
+template<typename T>
+__global__ void normalMersenne(T *const out, uint *const gState,
+                               const uint *const pos_tbl,
+                               const uint *const sh1_tbl,
+                               const uint *const sh2_tbl, uint mask,
+                               const uint *const g_recursion_table,
+                               const uint *const g_temper_table,
+                               uint elementsPerBlock, uint elements) {
+    __shared__ uint state[STATE_SIZE];
+    __shared__ uint recursion_table[TABLE_SIZE];
+    __shared__ uint temper_table[TABLE_SIZE];
+    uint start = blockIdx.x * elementsPerBlock;
+    uint end   = start + elementsPerBlock;
+    end        = (end > elements) ? elements : end;
+    int iter = divup((end - start) * sizeof(T), blockDim.x * 4 * sizeof(uint));
+
+    uint pos = pos_tbl[blockIdx.x];
+    uint sh1 = sh1_tbl[blockIdx.x];
+    uint sh2 = sh2_tbl[blockIdx.x];
+    state_read(state, gState);
+    read_table(recursion_table, g_recursion_table);
+    read_table(temper_table, g_temper_table);
+    __syncthreads();
+
+    uint index                    = start;
+    int elementsPerBlockIteration = blockDim.x * 4 * sizeof(uint) / sizeof(T);
+    uint o[4];
+    int offsetX1 = (STATE_SIZE - N + threadIdx.x) % STATE_SIZE;
+    int offsetX2 = (STATE_SIZE - N + threadIdx.x + 1) % STATE_SIZE;
+    int offsetY  = (STATE_SIZE - N + threadIdx.x + pos) % STATE_SIZE;
+    int offsetT  = (STATE_SIZE - N + threadIdx.x + pos - 1) % STATE_SIZE;
+    int offsetO  = threadIdx.x;
+
+    for (int i = 0; i < iter; ++i) {
+        for (int ii = 0; ii < 4; ++ii) {
+            uint r = recursion(recursion_table, mask, sh1, sh2, state[offsetX1],
+                               state[offsetX2], state[offsetY]);
+            state[offsetO] = r;
+            o[ii]          = temper(temper_table, r, state[offsetT]);
+            offsetX1       = (offsetX1 + blockDim.x) % STATE_SIZE;
+            offsetX2       = (offsetX2 + blockDim.x) % STATE_SIZE;
+            offsetY        = (offsetY + blockDim.x) % STATE_SIZE;
+            offsetT        = (offsetT + blockDim.x) % STATE_SIZE;
+            offsetO        = (offsetO + blockDim.x) % STATE_SIZE;
+            __syncthreads();
+        }
+        if (i == iter - 1) {
+            partialBoxMullerWriteOut128Bytes(out, index + threadIdx.x, o[0],
+                                             o[1], o[2], o[3], elements);
+        } else {
+            boxMullerWriteOut128Bytes(out, index + threadIdx.x, o[0], o[1],
+                                      o[2], o[3]);
+        }
+        index += elementsPerBlockIteration;
+    }
+    state_write(gState, state);
+}
+
+template<typename T>
+void uniformDistributionMT(T *out, size_t elements, uint *const state,
+                           const uint *const pos, const uint *const sh1,
+                           const uint *const sh2, uint mask,
+                           const uint *const recursion_table,
+                           const uint *const temper_table) {
+    int threads                = THREADS;
+    int min_elements_per_block = 32 * threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks                 = divup(elements, min_elements_per_block);
+    blocks                     = (blocks > BLOCKS) ? BLOCKS : blocks;
+    uint elementsPerBlock      = divup(elements, blocks);
+    CUDA_LAUNCH(uniformMersenne, blocks, threads, out, state, pos, sh1, sh2,
+                mask, recursion_table, temper_table, elementsPerBlock,
+                elements);
+}
+
+template<typename T>
+void normalDistributionMT(T *out, size_t elements, uint *const state,
+                          const uint *const pos, const uint *const sh1,
+                          const uint *const sh2, uint mask,
+                          const uint *const recursion_table,
+                          const uint *const temper_table) {
+    int threads                = THREADS;
+    int min_elements_per_block = 32 * threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks                 = divup(elements, min_elements_per_block);
+    blocks                     = (blocks > BLOCKS) ? BLOCKS : blocks;
+    uint elementsPerBlock      = divup(elements, blocks);
+    CUDA_LAUNCH(normalMersenne, blocks, threads, out, state, pos, sh1, sh2,
+                mask, recursion_table, temper_table, elementsPerBlock,
+                elements);
+}
+
+template<typename T>
+void uniformDistributionCBRNG(T *out, size_t elements,
+                              const af_random_engine_type type,
+                              const uintl &seed, uintl &counter) {
+    int threads          = THREADS;
+    int elementsPerBlock = threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks           = divup(elements, elementsPerBlock);
+    uint hi              = seed >> 32;
+    uint lo              = seed;
+    uint hic             = counter >> 32;
+    uint loc             = counter;
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            CUDA_LAUNCH(uniformPhilox, blocks, threads, out, hi, lo, hic, loc,
+                        elementsPerBlock, elements);
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            CUDA_LAUNCH(uniformThreefry, blocks, threads, out, hi, lo, hic, loc,
+                        elementsPerBlock, elements);
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+    counter += elements;
+}
+
+template<typename T>
+void normalDistributionCBRNG(T *out, size_t elements,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter) {
+    int threads          = THREADS;
+    int elementsPerBlock = threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks           = divup(elements, elementsPerBlock);
+    uint hi              = seed >> 32;
+    uint lo              = seed;
+    uint hic             = counter >> 32;
+    uint loc             = counter;
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            CUDA_LAUNCH(normalPhilox, blocks, threads, out, hi, lo, hic, loc,
+                        elementsPerBlock, elements);
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            CUDA_LAUNCH(normalThreefry, blocks, threads, out, hi, lo, hic, loc,
+                        elementsPerBlock, elements);
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+    counter += elements;
+}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/random_engine_mersenne.hpp b/src/backend/cuda/kernel/random_engine_mersenne.hpp
new file mode 100644
index 0000000000..5b288bc6b4
--- /dev/null
+++ b/src/backend/cuda/kernel/random_engine_mersenne.hpp
@@ -0,0 +1,132 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/********************************************************
+ * Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
+ * University.
+ * Copyright (c) 2011, 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima
+ * University and University of Tokyo.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ *       copyright notice, this list of conditions and the following
+ *       disclaimer in the documentation and/or other materials provided
+ *       with the distribution.
+ *     * Neither the name of the Hiroshima University, The Uinversity
+ *       of Tokyo nor the names of its contributors may be used to
+ *       endorse or promote products derived from this software without
+ *       specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *******************************************************/
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr int N          = 351;
+constexpr int BLOCKS     = 32;
+constexpr int STATE_SIZE = (256 * 3);
+constexpr int TABLE_SIZE = 16;
+
+// Utils
+static inline __device__ void read_table(uint *const sharedTable,
+                                         const uint *const table) {
+    const uint *const t = table + (blockIdx.x * TABLE_SIZE);
+    if (threadIdx.x < TABLE_SIZE) { sharedTable[threadIdx.x] = t[threadIdx.x]; }
+}
+
+static inline __device__ void state_read(uint *const state,
+                                         const uint *const gState) {
+    const uint *const g                 = gState + (blockIdx.x * N);
+    state[STATE_SIZE - N + threadIdx.x] = g[threadIdx.x];
+    if (threadIdx.x < N - blockDim.x) {
+        state[STATE_SIZE - N + blockDim.x + threadIdx.x] =
+            g[blockDim.x + threadIdx.x];
+    }
+}
+
+static inline __device__ void state_write(uint *const gState,
+                                          const uint *const state) {
+    uint *const g  = gState + (blockIdx.x * N);
+    g[threadIdx.x] = state[STATE_SIZE - N + threadIdx.x];
+    if (threadIdx.x < N - blockDim.x) {
+        g[blockDim.x + threadIdx.x] =
+            state[STATE_SIZE - N + blockDim.x + threadIdx.x];
+    }
+}
+
+static inline __device__ uint recursion(const uint *const recursion_table,
+                                        const uint mask, const uint sh1,
+                                        const uint sh2, const uint x1,
+                                        const uint x2, uint y) {
+    uint x = (x1 & mask) ^ x2;
+    x ^= x << sh1;
+    y        = x ^ (y >> sh2);
+    uint mat = recursion_table[y & 0x0f];
+    return y ^ mat;
+}
+
+static inline __device__ uint temper(const uint *const temper_table,
+                                     const uint v, uint t) {
+    t ^= t >> 16;
+    t ^= t >> 8;
+    uint mat = temper_table[t & 0x0f];
+    return v ^ mat;
+}
+
+// Initialization
+
+__global__ void initState(uint *state, const uint *tbl, uintl seed) {
+    __shared__ uint lstate[N];
+    const uint *ltbl = tbl + (TABLE_SIZE * blockIdx.x);
+    uint hidden_seed = ltbl[4] ^ (ltbl[8] << 16);
+    uint tmp         = hidden_seed;
+    tmp += tmp >> 16;
+    tmp += tmp >> 8;
+    tmp &= 0xff;
+    tmp |= tmp << 8;
+    tmp |= tmp << 16;
+    lstate[threadIdx.x] = tmp;
+    __syncthreads();
+    if (threadIdx.x == 0) {
+        lstate[0] = seed;
+        lstate[1] = hidden_seed;
+        for (int i = 1; i < N; ++i) {
+            lstate[i] ^=
+                ((uint)(1812433253) * (lstate[i - 1] ^ (lstate[i - 1] >> 30)) +
+                 i);
+        }
+    }
+    __syncthreads();
+    state[N * blockIdx.x + threadIdx.x] = lstate[threadIdx.x];
+}
+
+void initMersenneState(uint *state, const uint *tbl, uintl seed) {
+    CUDA_LAUNCH(initState, BLOCKS, N, state, tbl, seed);
+}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/random_engine_philox.hpp b/src/backend/cuda/kernel/random_engine_philox.hpp
new file mode 100644
index 0000000000..8124416e03
--- /dev/null
+++ b/src/backend/cuda/kernel/random_engine_philox.hpp
@@ -0,0 +1,106 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/philox.hpp
+
+constexpr uint m4x32_0 = 0xD2511F53;
+constexpr uint m4x32_1 = 0xCD9E8D57;
+constexpr uint w32_0   = 0x9E3779B9;
+constexpr uint w32_1   = 0xBB67AE85;
+
+static inline __device__ void mulhilo(uint a, uint b, uint &hi, uint &lo) {
+    hi = __umulhi(a, b);
+    lo = a * b;
+}
+
+static inline __device__ void philoxBump(uint k[2]) {
+    k[0] += w32_0;
+    k[1] += w32_1;
+}
+
+static inline __device__ void philoxRound(const uint m0, const uint m1,
+                                          const uint k[2], uint c[4]) {
+    uint hi0, lo0, hi1, lo1;
+    mulhilo(m0, c[0], hi0, lo0);
+    mulhilo(m1, c[2], hi1, lo1);
+    c[0] = hi1 ^ c[1] ^ k[0];
+    c[1] = lo1;
+    c[2] = hi0 ^ c[3] ^ k[1];
+    c[3] = lo0;
+}
+
+static inline __device__ void philox(uint key[2], uint ctr[4]) {
+    // 10 Rounds
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/random_engine_threefry.hpp b/src/backend/cuda/kernel/random_engine_threefry.hpp
new file mode 100644
index 0000000000..a2bbbcaec1
--- /dev/null
+++ b/src/backend/cuda/kernel/random_engine_threefry.hpp
@@ -0,0 +1,164 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/threefry.hpp
+
+static const uint SKEIN_KS_PARITY32 = 0x1BD11BDA;
+
+static const uint R0 = 13;
+static const uint R1 = 15;
+static const uint R2 = 26;
+static const uint R3 = 6;
+static const uint R4 = 17;
+static const uint R5 = 29;
+static const uint R6 = 16;
+static const uint R7 = 24;
+
+static inline __device__ void setSkeinParity(uint *ptr) {
+    *ptr = SKEIN_KS_PARITY32;
+}
+
+static inline __device__ uint rotL(uint x, uint N) {
+    return (x << (N & 31)) | (x >> ((32 - N) & 31));
+}
+
+__device__ void threefry(uint k[2], uint c[2], uint X[2]) {
+    uint ks[3];
+
+    setSkeinParity(&ks[2]);
+    ks[0] = k[0];
+    X[0]  = c[0];
+    ks[2] ^= k[0];
+    ks[1] = k[1];
+    X[1]  = c[1];
+    ks[2] ^= k[1];
+
+    X[0] += ks[0];
+    X[1] += ks[1];
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=1) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 1; /* X[2-1] += r  */
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=2) */
+    X[0] += ks[2];
+    X[1] += ks[0];
+    X[1] += 2;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=3) */
+    X[0] += ks[0];
+    X[1] += ks[1];
+    X[1] += 3;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=4) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 4;
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/range.cuh b/src/backend/cuda/kernel/range.cuh
new file mode 100644
index 0000000000..753bbad174
--- /dev/null
+++ b/src/backend/cuda/kernel/range.cuh
@@ -0,0 +1,60 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void range(Param<T> out, const int dim, const int blocksPerMatX,
+                      const int blocksPerMatY) {
+    const int mul0 = (dim == 0);
+    const int mul1 = (dim == 1);
+    const int mul2 = (dim == 2);
+    const int mul3 = (dim == 3);
+
+    const int oz = blockIdx.x / blocksPerMatX;
+    const int ow = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+
+    const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - ow * blocksPerMatY;
+
+    const int xx = threadIdx.x + blockIdx_x * blockDim.x;
+    const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (xx >= out.dims[0] || yy >= out.dims[1] || oz >= out.dims[2] ||
+        ow >= out.dims[3])
+        return;
+
+    const int ozw = ow * out.strides[3] + oz * out.strides[2];
+
+    int valZW = (mul3 * ow) + (mul2 * oz);
+
+    const int incy = blocksPerMatY * blockDim.y;
+    const int incx = blocksPerMatX * blockDim.x;
+
+    for (int oy = yy; oy < out.dims[1]; oy += incy) {
+        compute_t<T> valYZW = valZW + (mul1 * oy);
+        int oyzw            = ozw + oy * out.strides[1];
+        for (int ox = xx; ox < out.dims[0]; ox += incx) {
+            int oidx         = oyzw + ox;
+            compute_t<T> val = valYZW + static_cast<compute_t<T>>(ox * mul0);
+
+            out.ptr[oidx] = val;
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/range.hpp b/src/backend/cuda/kernel/range.hpp
index 3cabc10356..9b75276dc4 100644
--- a/src/backend/cuda/kernel/range.hpp
+++ b/src/backend/cuda/kernel/range.hpp
@@ -7,83 +7,44 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/range_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 512;
-        static const unsigned TILEY = 32;
-
-        template<typename T>
-        __global__
-        void range_kernel(Param<T> out, const int dim,
-                          const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int mul0 = (dim == 0);
-            const int mul1 = (dim == 1);
-            const int mul2 = (dim == 2);
-            const int mul3 = (dim == 3);
-
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            if(xx >= out.dims[0] ||
-               yy >= out.dims[1] ||
-               oz >= out.dims[2] ||
-               ow >= out.dims[3])
-                return;
+template<typename T>
+void range(Param<T> out, const int dim) {
+    constexpr unsigned RANGE_TX    = 32;
+    constexpr unsigned RANGE_TY    = 8;
+    constexpr unsigned RANGE_TILEX = 512;
+    constexpr unsigned RANGE_TILEY = 32;
 
-            const int ozw = ow * out.strides[3] + oz * out.strides[2];
+    auto range = common::getKernel("arrayfire::cuda::range", {{range_cuh_src}},
+                                   TemplateArgs(TemplateTypename<T>()));
 
-            T valZW = (mul3 * ow) + (mul2 * oz);
+    dim3 threads(RANGE_TX, RANGE_TY, 1);
 
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
+    int blocksPerMatX = divup(out.dims[0], RANGE_TILEX);
+    int blocksPerMatY = divup(out.dims[1], RANGE_TILEY);
+    dim3 blocks(blocksPerMatX * out.dims[2], blocksPerMatY * out.dims[3], 1);
 
-            for(int oy = yy; oy < out.dims[1]; oy += incy) {
-                T valYZW = valZW + (mul1 * oy);
-                int oyzw = ozw + oy * out.strides[1];
-                for(int ox = xx; ox < out.dims[0]; ox += incx) {
-                    int oidx = oyzw + ox;
-                    T val = valYZW + (ox * mul0);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-                    out.ptr[oidx] = val;
-                }
-            }
-        }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void range(Param<T> out, const int dim)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(out.dims[0], TILEX);
-            int blocksPerMatY = divup(out.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * out.dims[2],
-                        blocksPerMatY * out.dims[3],
-                        1);
-
-            range_kernel<T><<<blocks, threads>>>(out, dim, blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
+    range(qArgs, out, dim, blocksPerMatX, blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/reduce.hpp b/src/backend/cuda/kernel/reduce.hpp
index a5961f39d9..c3cf279b39 100644
--- a/src/backend/cuda/kernel/reduce.hpp
+++ b/src/backend/cuda/kernel/reduce.hpp
@@ -7,441 +7,541 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <ops.hpp>
-#include <backend.hpp>
+#pragma once
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <math.hpp>
-#include <err_cuda.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
 #include <debug_cuda.hpp>
-#include "config.hpp"
+#include <err_cuda.hpp>
+#include <math.hpp>
 #include <memory.hpp>
-#include <boost/scoped_ptr.hpp>
-
-using boost::scoped_ptr;
-
-namespace cuda
-{
-namespace kernel
-{
-    template<typename Ti, typename To, af_op_t op, uint dim, uint DIMY>
-    __global__
-    static void reduce_dim_kernel(Param<To> out,
-                                  CParam <Ti> in,
-                                  uint blocks_x, uint blocks_y, uint offset_dim)
-    {
-        const uint tidx = threadIdx.x;
-        const uint tidy = threadIdx.y;
-        const uint tid  = tidy * THREADS_X + tidx;
-
-        const uint zid = blockIdx.x / blocks_x;
-        const uint wid = blockIdx.y / blocks_y;
-        const uint blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const uint blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const uint xid = blockIdx_x * blockDim.x + tidx;
-        const uint yid = blockIdx_y; // yid  of output. updated for input later.
-
-        uint ids[4] = {xid, yid, zid, wid};
-
-        const Ti *iptr = in.ptr;
-        To *optr = out.ptr;
-
-        // There is only one element per block for out
-        // There are blockDim.y elements per block for in
-        // Hence increment ids[dim] just after offseting out and before offsetting in
-        optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] + ids[1] * out.strides[1] + ids[0];
-        const uint blockIdx_dim = ids[dim];
-
-        ids[dim] = ids[dim] * blockDim.y + tidy;
-        iptr  += ids[3] * in.strides[3] + ids[2] * in.strides[2] + ids[1] * in.strides[1] + ids[0];
-        const uint id_dim_in = ids[dim];
-
-        const uint istride_dim = in.strides[dim];
-
-        bool is_valid =
-            (ids[0] < in.dims[0]) &&
-            (ids[1] < in.dims[1]) &&
-            (ids[2] < in.dims[2]) &&
-            (ids[3] < in.dims[3]);
-
-        Transform<Ti, To, op> transform;
-        Binary<To, op> reduce;
-
-        __shared__ To s_val[THREADS_X * DIMY];
-
-        To out_val = reduce.init();
-        for (int id = id_dim_in; is_valid && (id < in.dims[dim]); id += offset_dim * blockDim.y) {
-            To in_val = transform(*iptr);
-            out_val = reduce(in_val, out_val);
-            iptr = iptr + offset_dim * blockDim.y * istride_dim;
-        }
+#include "config.hpp"
 
-        s_val[tid] = out_val;
+#include <cub/warp/warp_reduce.cuh>
+
+#include <climits>
+#include <vector>
+
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op, uint dim, uint DIMY>
+__global__ static void reduce_dim_kernel(Param<To> out, CParam<Ti> in,
+                                         uint blocks_x, uint blocks_y,
+                                         uint offset_dim, bool change_nan,
+                                         To nanval) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * THREADS_X + tidx;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint xid        = blockIdx_x * blockDim.x + tidx;
+
+    __shared__ compute_t<To> s_val[THREADS_X * DIMY];
+
+    const uint wid = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const uint blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const uint yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    uint ids[4] = {xid, yid, zid, wid};
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    data_t<To> *const optr = out.ptr + ids[3] * out.strides[3] +
+                             ids[2] * out.strides[2] + ids[1] * out.strides[1] +
+                             ids[0];
+
+    const uint blockIdx_dim = ids[dim];
+    ids[dim]                = ids[dim] * blockDim.y + tidy;
+
+    const data_t<Ti> *iptr = in.ptr + ids[3] * in.strides[3] +
+                             ids[2] * in.strides[2] + ids[1] * in.strides[1] +
+                             ids[0];
+
+    const uint id_dim_in   = ids[dim];
+    const uint istride_dim = in.strides[dim];
+
+    bool is_valid = (ids[0] < in.dims[0]) && (ids[1] < in.dims[1]) &&
+                    (ids[2] < in.dims[2]) && (ids[3] < in.dims[3]);
+
+    common::Transform<Ti, compute_t<To>, op> transform;
+    common::Binary<compute_t<To>, op> reduce;
+    compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+    for (int id = id_dim_in; is_valid && (id < in.dims[dim]);
+         id += offset_dim * blockDim.y) {
+        compute_t<To> in_val = transform(*iptr);
+        if (change_nan)
+            in_val = !IS_NAN(in_val) ? in_val : compute_t<To>(nanval);
+        out_val = reduce(in_val, out_val);
+        iptr    = iptr + offset_dim * blockDim.y * istride_dim;
+    }
 
-        To *s_ptr = s_val + tid;
-        __syncthreads();
+    s_val[tid] = out_val;
 
-        if (DIMY == 8) {
-            if (tidy < 4) *s_ptr = reduce(*s_ptr, s_ptr[THREADS_X * 4]);
-            __syncthreads();
-        }
+    compute_t<To> *s_ptr = s_val + tid;
+    __syncthreads();
 
-        if (DIMY >= 4) {
-            if (tidy < 2) *s_ptr = reduce(*s_ptr, s_ptr[THREADS_X * 2]);
-            __syncthreads();
-        }
+    if (DIMY == 8) {
+        if (tidy < 4) *s_ptr = reduce(*s_ptr, s_ptr[THREADS_X * 4]);
+        __syncthreads();
+    }
 
-        if (DIMY >= 2) {
-            if (tidy < 1) *s_ptr = reduce(*s_ptr, s_ptr[THREADS_X * 1]);
-            __syncthreads();
-        }
+    if (DIMY >= 4) {
+        if (tidy < 2) *s_ptr = reduce(*s_ptr, s_ptr[THREADS_X * 2]);
+        __syncthreads();
+    }
 
-        if (tidy == 0 && is_valid &&
-            (blockIdx_dim < out.dims[dim])) {
-            *optr = *s_ptr;
-        }
+    if (DIMY >= 2) {
+        if (tidy < 1) *s_ptr = reduce(*s_ptr, s_ptr[THREADS_X * 1]);
+        __syncthreads();
+    }
 
+    if (tidy == 0 && is_valid && (blockIdx_dim < out.dims[dim])) {
+        *optr = *s_ptr;
     }
+}
+
+template<typename Ti, typename To, af_op_t op, int dim>
+void reduce_dim_launcher(Param<To> out, CParam<Ti> in, const uint threads_y,
+                         const dim_t blocks_dim[4], bool change_nan,
+                         double nanval) {
+    dim3 threads(THREADS_X, threads_y);
 
-    template<typename Ti, typename To, af_op_t op, int dim>
-    void reduce_dim_launcher(Param<To> out, CParam<Ti> in,
-                             const uint threads_y, const uint blocks_dim[4])
-    {
-        dim3 threads(THREADS_X, threads_y);
+    dim3 blocks(blocks_dim[0] * blocks_dim[2], blocks_dim[1] * blocks_dim[3]);
 
-        dim3 blocks(blocks_dim[0] * blocks_dim[2],
-                    blocks_dim[1] * blocks_dim[3]);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-        switch (threads_y) {
+    switch (threads_y) {
         case 8:
-            (reduce_dim_kernel<Ti, To, op, dim, 8>)<<<blocks, threads>>>(
-                out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
+            CUDA_LAUNCH((reduce_dim_kernel<Ti, To, op, dim, 8>), blocks,
+                        threads, out, in, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim], change_nan, scalar<To>(nanval));
+            break;
         case 4:
-            (reduce_dim_kernel<Ti, To, op, dim, 4>)<<<blocks, threads>>>(
-                out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
+            CUDA_LAUNCH((reduce_dim_kernel<Ti, To, op, dim, 4>), blocks,
+                        threads, out, in, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim], change_nan, scalar<To>(nanval));
+            break;
         case 2:
-            (reduce_dim_kernel<Ti, To, op, dim, 2>)<<<blocks, threads>>>(
-                out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
+            CUDA_LAUNCH((reduce_dim_kernel<Ti, To, op, dim, 2>), blocks,
+                        threads, out, in, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim], change_nan, scalar<To>(nanval));
+            break;
         case 1:
-            (reduce_dim_kernel<Ti, To, op, dim, 1>)<<<blocks, threads>>>(
-                out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim]); break;
-        }
-
-        POST_LAUNCH_CHECK();
+            CUDA_LAUNCH((reduce_dim_kernel<Ti, To, op, dim, 1>), blocks,
+                        threads, out, in, blocks_dim[0], blocks_dim[1],
+                        blocks_dim[dim], change_nan, scalar<To>(nanval));
+            break;
     }
 
-    template<typename Ti, typename To, af_op_t op, int dim>
-    void reduce_dim(Param<To> out,  CParam<Ti> in)
-    {
-        uint threads_y = std::min(THREADS_Y, nextpow2(in.dims[dim]));
-        uint threads_x = THREADS_X;
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename To, af_op_t op, int dim>
+void reduce_dim(Param<To> out, CParam<Ti> in, bool change_nan, double nanval) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.dims[dim]));
+    uint threads_x = THREADS_X;
 
-        uint blocks_dim[] = {divup(in.dims[0], threads_x),
-                             in.dims[1], in.dims[2], in.dims[3]};
+    dim_t blocks_dim[] = {divup(in.dims[0], threads_x), in.dims[1], in.dims[2],
+                          in.dims[3]};
 
-        blocks_dim[dim] = divup(in.dims[dim], threads_y * REPEAT);
+    blocks_dim[dim] = divup(in.dims[dim], threads_y * REPEAT);
 
-        Param<To> tmp = out;
+    Param<To> tmp = out;
+    uptr<To> tmp_alloc;
+    if (blocks_dim[dim] > 1) {
+        int tmp_elements = 1;
+        tmp.dims[dim]    = blocks_dim[dim];
 
-        if (blocks_dim[dim] > 1) {
-            int tmp_elements = 1;
-            tmp.dims[dim] = blocks_dim[dim];
+        for (int k = 0; k < 4; k++) tmp_elements *= tmp.dims[k];
+        tmp_alloc = memAlloc<To>(tmp_elements);
+        tmp.ptr   = tmp_alloc.get();
 
-            for (int k = 0; k < 4; k++) tmp_elements *= tmp.dims[k];
-            tmp.ptr = memAlloc<To>(tmp_elements);
+        for (int k = dim + 1; k < 4; k++) tmp.strides[k] *= blocks_dim[dim];
+    }
 
-            for (int k = dim + 1; k < 4; k++) tmp.strides[k] *= blocks_dim[dim];
+    reduce_dim_launcher<Ti, To, op, dim>(tmp, in, threads_y, blocks_dim,
+                                         change_nan, nanval);
+
+    if (blocks_dim[dim] > 1) {
+        blocks_dim[dim] = 1;
+
+        if (op == af_notzero_t) {
+            reduce_dim_launcher<To, To, af_add_t, dim>(
+                out, tmp, threads_y, blocks_dim, change_nan, nanval);
+        } else {
+            reduce_dim_launcher<To, To, op, dim>(
+                out, tmp, threads_y, blocks_dim, change_nan, nanval);
         }
+    }
+}
 
-        reduce_dim_launcher<Ti, To, op, dim>(tmp, in, threads_y, blocks_dim);
+template<typename Ti, typename To, af_op_t op, uint DIMX>
+__global__ static void reduce_first_kernel(Param<To> out, CParam<Ti> in,
+                                           uint blocks_x, uint blocks_y,
+                                           uint repeat, bool change_nan,
+                                           To nanval) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * blockDim.x + tidx;
 
-        if (blocks_dim[dim] > 1) {
-            blocks_dim[dim] = 1;
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint xid        = blockIdx_x * blockDim.x * repeat + tidx;
 
-            if (op == af_notzero_t) {
-                reduce_dim_launcher<To, To, af_add_t, dim>(out, tmp, threads_y, blocks_dim);
-            } else {
-                reduce_dim_launcher<To, To,       op, dim>(out, tmp, threads_y, blocks_dim);
-            }
+    common::Binary<compute_t<To>, op> reduce;
+    common::Transform<Ti, compute_t<To>, op> transform;
 
-            memFree(tmp.ptr);
-        }
+    __shared__ compute_t<To> s_val[THREADS_PER_BLOCK];
 
-    }
+    const uint wid = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const uint blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const uint yid = blockIdx_y * blockDim.y + tidy;
+
+    const data_t<Ti> *const iptr =
+        in.ptr +
+        (wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1]);
 
-    template<typename To, af_op_t op>
-    __device__ void warp_reduce_sync(To *s_ptr, uint tidx)
-    {
+    if (yid >= in.dims[1] || zid >= in.dims[2] || wid >= in.dims[3]) return;
 
+    int lim = min((int)(xid + repeat * DIMX), in.dims[0]);
+
+    compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+    for (int id = xid; id < lim; id += DIMX) {
+        compute_t<To> in_val = transform(iptr[id]);
+        if (change_nan)
+            in_val =
+                !IS_NAN(in_val) ? in_val : static_cast<compute_t<To>>(nanval);
+        out_val = reduce(in_val, out_val);
     }
 
-#if (__CUDA_ARCH__ >= 300)
-    template<typename To, af_op_t op>
-    __device__ void warp_reduce_shfl(To *s_ptr, uint tidx)
-    {
+    s_val[tid] = out_val;
+
+    __syncthreads();
+    compute_t<To> *s_ptr = s_val + tidy * DIMX;
 
+    if (DIMX == 256) {
+        if (tidx < 128) s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx + 128]);
+        __syncthreads();
     }
-#endif
-
-    template<typename To, af_op_t op>
-    struct WarpReduce
-    {
-        __device__ To operator()(To *s_ptr, uint tidx)
-        {
-            Binary<To, op> reduce;
-#pragma unroll
-            for (int n = 16; n >= 1; n >>= 1) {
-                if (tidx < n) {
-                    s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx + n]);
-                }
-                __syncthreads();
-            }
-            return s_ptr[tidx];
-        }
-    };
-
-
-#if (__CUDA_ARCH__ >= 300)
-#define WARP_REDUCE(T)                                  \
-    template<af_op_t op>                                \
-    struct WarpReduce<T, op>                            \
-    {                                                   \
-        __device__ T operator()(T *s_ptr, uint tidx)    \
-        {                                               \
-            Binary<T, op> reduce;                       \
-                                                        \
-            T val = s_ptr[tidx];                        \
-                                                        \
-            for (int n = 16; n >= 1; n >>= 1) {         \
-                val = reduce(val, __shfl_down(val, n)); \
-            }                                           \
-            return val;                                 \
-        }                                               \
-    };                                                  \
-
-    WARP_REDUCE(float)
-    WARP_REDUCE(int)
-    WARP_REDUCE(uchar) // upcasted to int
-    WARP_REDUCE(char)  // upcasted to int
-#endif
-
-    template<typename Ti, typename To, af_op_t op, uint DIMX>
-    __global__
-    static void reduce_first_kernel(Param<To> out,
-                                    CParam<Ti>  in,
-                                    uint blocks_x, uint blocks_y, uint repeat)
-    {
-        const uint tidx = threadIdx.x;
-        const uint tidy = threadIdx.y;
-        const uint tid  = tidy * blockDim.x + tidx;
-
-        const uint zid = blockIdx.x / blocks_x;
-        const uint wid = blockIdx.y / blocks_y;
-        const uint blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const uint blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const uint xid = blockIdx_x * blockDim.x * repeat + tidx;
-        const uint yid = blockIdx_y * blockDim.y + tidy;
-
-        const Ti *iptr = in.ptr;
-        To *optr = out.ptr;
-
-        iptr += wid *  in.strides[3] + zid *  in.strides[2] + yid *  in.strides[1];
-        optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
-
-        if (yid >= in.dims[1] ||
-            zid >= in.dims[2] ||
-            wid >= in.dims[3]) return;
-
-        Transform<Ti, To, op> transform;
-        Binary<To, op> reduce;
-
-        __shared__ To s_val[THREADS_PER_BLOCK];
-
-        To out_val = reduce.init();
-        int lim = min((int)(xid + repeat * DIMX), in.dims[0]);
-
-        for (int id = xid; id < lim; id += DIMX) {
-            To in_val = transform(iptr[id]);
-            out_val = reduce(in_val, out_val);
-        }
 
-        s_val[tid] = out_val;
+    if (DIMX >= 128) {
+        if (tidx < 64) s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx + 64]);
         __syncthreads();
-        To *s_ptr = s_val + tidy * DIMX;
+    }
 
-        if (DIMX == 256) {
-            if (tidx < 128) s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx + 128]);
-            __syncthreads();
-        }
+    if (DIMX >= 64) {
+        if (tidx < 32) s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx + 32]);
+        __syncthreads();
+    }
 
-        if (DIMX >= 128) {
-            if (tidx <  64) s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx +  64]);
-            __syncthreads();
-        }
+    typedef cub::WarpReduce<compute_t<To>> WarpReduce;
+    __shared__ typename WarpReduce::TempStorage temp_storage;
 
-        if (DIMX >=  64) {
-            if (tidx <  32) s_ptr[tidx] = reduce(s_ptr[tidx], s_ptr[tidx +  32]);
-            __syncthreads();
-        }
+    compute_t<To> warp_val = s_ptr[tidx];
+    out_val                = WarpReduce(temp_storage).Reduce(warp_val, reduce);
 
-        out_val = WarpReduce<To, op>()(s_ptr, tidx);
+    data_t<To> *const optr =
+        out.ptr +
+        (wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1]);
+    if (tidx == 0) optr[blockIdx_x] = data_t<To>(out_val);
+}
 
-        if (tidx == 0) {
-            optr[blockIdx_x] = out_val;
-        }
+template<typename Ti, typename To, af_op_t op, uint DIMX>
+__global__ static void reduce_all_kernel(Param<To> out,
+                                         Param<unsigned> retirementCount,
+                                         Param<To> tmp, CParam<Ti> in,
+                                         uint blocks_x, uint blocks_y,
+                                         uint repeat, bool change_nan,
+                                         To nanval) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+    const uint tid  = tidy * DIMX + tidx;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint xid        = blockIdx_x * blockDim.x * repeat + tidx;
+
+    const uint wid = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const uint blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const uint yid = blockIdx_y * blockDim.y + tidy;
+
+    common::Binary<compute_t<To>, op> reduce;
+    common::Transform<Ti, compute_t<To>, op> transform;
+
+    const int nwarps = THREADS_PER_BLOCK / 32;
+    __shared__ compute_t<To> s_val[nwarps];
+
+    const data_t<Ti> *const iptr =
+        in.ptr +
+        (wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1]);
+
+    bool cond = yid < in.dims[1] && zid < in.dims[2] && wid < in.dims[3];
+
+    int lim = min((int)(xid + repeat * DIMX), in.dims[0]);
+
+    compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+    for (int id = xid; cond && id < lim; id += DIMX) {
+        compute_t<To> in_val = transform(iptr[id]);
+        if (change_nan)
+            in_val =
+                !IS_NAN(in_val) ? in_val : static_cast<compute_t<To>>(nanval);
+        out_val = reduce(in_val, out_val);
     }
 
-    template<typename Ti, typename To, af_op_t op>
-    void reduce_first_launcher(Param<To> out, CParam<Ti> in,
-                               const uint blocks_x, const uint blocks_y, const uint threads_x)
-    {
+    const int warpid = tid / 32;
+    const int lid    = tid % 32;
 
-        dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
-        dim3 blocks(blocks_x * in.dims[2],
-                    blocks_y * in.dims[3]);
+    typedef cub::WarpReduce<compute_t<To>> WarpReduce;
+    __shared__ typename WarpReduce::TempStorage temp_storage[nwarps];
 
-        uint repeat = divup(in.dims[0], (blocks_x * threads_x));
+    out_val = WarpReduce(temp_storage[warpid]).Reduce(out_val, reduce);
 
-        switch (threads_x) {
-        case 32:
-            (reduce_first_kernel<Ti, To, op,  32>)<<<blocks, threads>>>(
-                out, in, blocks_x, blocks_y, repeat); break;
-        case 64:
-            (reduce_first_kernel<Ti, To, op,  64>)<<<blocks, threads>>>(
-                out, in, blocks_x, blocks_y, repeat); break;
-        case 128:
-            (reduce_first_kernel<Ti, To, op,  128>)<<<blocks, threads>>>(
-                out, in, blocks_x, blocks_y, repeat); break;
-        case 256:
-            (reduce_first_kernel<Ti, To, op,  256>)<<<blocks, threads>>>(
-                out, in, blocks_x, blocks_y, repeat); break;
-        }
+    if (cond && lid == 0) {
+        s_val[warpid] = out_val;
+    } else if (!cond) {
+        s_val[warpid] = common::Binary<compute_t<To>, op>::init();
+    }
+    __syncthreads();
 
-        POST_LAUNCH_CHECK();
+    if (tid < 32) {
+        out_val = tid < nwarps ? s_val[tid]
+                               : common::Binary<compute_t<To>, op>::init();
+        out_val = WarpReduce(temp_storage[0]).Reduce(out_val, reduce);
     }
 
-    template<typename Ti, typename To, af_op_t op>
-    void reduce_first(Param<To> out, CParam<Ti> in)
-    {
-        uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_BLOCK);
-        uint threads_y = THREADS_PER_BLOCK / threads_x;
-
-        uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
-        uint blocks_y = divup(in.dims[1], threads_y);
-
-        Param<To> tmp = out;
-        if (blocks_x > 1) {
-            tmp.ptr = memAlloc<To>(blocks_x *
-                                   in.dims[1] *
-                                   in.dims[2] *
-                                   in.dims[3]);
-
-            tmp.dims[0] = blocks_x;
-            for (int k = 1; k < 4; k++) tmp.strides[k] *= blocks_x;
+    const unsigned total_blocks = (gridDim.x * gridDim.y * gridDim.z);
+    const int uubidx            = (gridDim.x * gridDim.y) * blockIdx.z +
+                       (gridDim.x * blockIdx.y) + blockIdx.x;
+    if (cond && tid == 0) {
+        if (total_blocks != 1) {
+            tmp.ptr[uubidx] = data_t<To>(out_val);
+        } else {
+            out.ptr[0] = data_t<To>(out_val);
         }
+    }
 
-        reduce_first_launcher<Ti, To, op>(tmp, in, blocks_x, blocks_y, threads_x);
-
-        if (blocks_x > 1) {
+    // Last block to perform final reduction
+    if (total_blocks > 1) {
+        __shared__ bool amLast;
 
-            //FIXME: Is there an alternative to the if condition ?
-            if (op == af_notzero_t) {
-                reduce_first_launcher<To, To, af_add_t>(out, tmp, 1, blocks_y, threads_x);
-            } else {
-                reduce_first_launcher<To, To,       op>(out, tmp, 1, blocks_y, threads_x);
-            }
+        // wait until all outstanding memory instructions in this thread are
+        // finished
+        __threadfence();
 
-            memFree(tmp.ptr);
+        // Thread 0 takes a ticket
+        if (tid == 0) {
+            unsigned int ticket = atomicInc(retirementCount.ptr, total_blocks);
+            // If the ticket ID == number of blocks, we are the last block
+            amLast = (ticket == (total_blocks - 1));
         }
-    }
+        __syncthreads();  // for amlast
 
-    template<typename Ti, typename To, af_op_t op>
-    void reduce(Param<To> out, CParam<Ti> in, int dim)
-    {
-        switch (dim) {
-        case 0: return reduce_first<Ti, To, op   >(out, in);
-        case 1: return reduce_dim  <Ti, To, op, 1>(out, in);
-        case 2: return reduce_dim  <Ti, To, op, 2>(out, in);
-        case 3: return reduce_dim  <Ti, To, op, 3>(out, in);
-        }
-    }
+        if (amLast) {
+            int i   = tid;
+            out_val = common::Binary<compute_t<To>, op>::init();
 
-    template<typename Ti, typename To, af_op_t op>
-    To reduce_all(CParam<Ti> in)
-    {
-        int in_elements = in.strides[3] * in.dims[3];
+            while (i < total_blocks) {
+                compute_t<To> in_val = compute_t<To>(tmp.ptr[i]);
+                out_val              = reduce(in_val, out_val);
+                i += THREADS_PER_BLOCK;
+            }
 
-        // FIXME: Use better heuristics to get to the optimum number
-        if (in_elements > 4096) {
+            out_val = WarpReduce(temp_storage[warpid]).Reduce(out_val, reduce);
+            if (lid == 0) { s_val[warpid] = out_val; }
+            __syncthreads();
 
-            bool is_linear = (in.strides[0] == 1);
-            for (int k = 1; k < 4; k++) {
-                is_linear &= (in.strides[k] == (in.strides[k - 1] * in.dims[k - 1]));
+            if (tid < 32) {
+                out_val = tid < nwarps
+                              ? s_val[tid]
+                              : common::Binary<compute_t<To>, op>::init();
+                out_val = WarpReduce(temp_storage[0]).Reduce(out_val, reduce);
             }
 
-            if (is_linear) {
-                in.dims[0] = in_elements;
-                for (int k = 1; k < 4; k++) {
-                    in.dims[k] = 1;
-                    in.strides[k] = in_elements;
-                }
+            if (tid == 0) {
+                out.ptr[0] = out_val;
+
+                // reset retirement count so that next run succeeds
+                retirementCount.ptr[0] = 0;
             }
+        }
+    }
+}
 
-            uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
-            threads_x = std::min(threads_x, THREADS_PER_BLOCK);
-            uint threads_y = THREADS_PER_BLOCK / threads_x;
+template<typename Ti, typename To, af_op_t op>
+void reduce_all_launcher(Param<To> out, CParam<Ti> in, const uint blocks_x,
+                         const uint blocks_y, const uint threads_x,
+                         bool change_nan, double nanval) {
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * in.dims[2], blocks_y * in.dims[3]);
 
-            Param<To> tmp;
+    uint repeat = divup(in.dims[0], (blocks_x * threads_x));
 
-            uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
-            uint blocks_y = divup(in.dims[1], threads_y);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-            tmp.dims[0] = blocks_x;
-            tmp.strides[0] = 1;
+    long tmp_elements = blocks.x * blocks.y * blocks.z;
+    if (tmp_elements > UINT_MAX) {
+        AF_ERROR("Too many blocks requested (retirementCount == unsigned)",
+                 AF_ERR_RUNTIME);
+    }
+    Array<To> tmp                   = createEmptyArray<To>(tmp_elements);
+    Array<unsigned> retirementCount = createValueArray<unsigned>(1, 0);
 
-            for (int k = 1; k < 4; k++) {
-                tmp.dims[k] = in.dims[k];
-                tmp.strides[k] = tmp.dims[k - 1] * tmp.strides[k - 1];
-            }
+    switch (threads_x) {
+        case 32:
+            CUDA_LAUNCH((reduce_all_kernel<Ti, To, op, 32>), blocks, threads,
+                        out, retirementCount, tmp, in, blocks_x, blocks_y,
+                        repeat, change_nan, scalar<To>(nanval));
+            break;
+        case 64:
+            CUDA_LAUNCH((reduce_all_kernel<Ti, To, op, 64>), blocks, threads,
+                        out, retirementCount, tmp, in, blocks_x, blocks_y,
+                        repeat, change_nan, scalar<To>(nanval));
+            break;
+        case 128:
+            CUDA_LAUNCH((reduce_all_kernel<Ti, To, op, 128>), blocks, threads,
+                        out, retirementCount, tmp, in, blocks_x, blocks_y,
+                        repeat, change_nan, scalar<To>(nanval));
+            break;
+        case 256:
+            CUDA_LAUNCH((reduce_all_kernel<Ti, To, op, 256>), blocks, threads,
+                        out, retirementCount, tmp, in, blocks_x, blocks_y,
+                        repeat, change_nan, scalar<To>(nanval));
+            break;
+    }
 
-            int tmp_elements = tmp.strides[3] * tmp.dims[3];
+    POST_LAUNCH_CHECK();
+}
 
-            tmp.ptr = memAlloc<To>(tmp_elements);
-            reduce_first_launcher<Ti, To, op>(tmp, in, blocks_x, blocks_y, threads_x);
+template<typename Ti, typename To, af_op_t op>
+void reduce_first_launcher(Param<To> out, CParam<Ti> in, const uint blocks_x,
+                           const uint blocks_y, const uint threads_x,
+                           bool change_nan, double nanval) {
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * in.dims[2], blocks_y * in.dims[3]);
 
-            scoped_ptr<To> h_ptr(new To[tmp_elements]);
-            To* h_ptr_raw = h_ptr.get();
+    uint repeat = divup(in.dims[0], (blocks_x * threads_x));
 
-            CUDA_CHECK(cudaMemcpy(h_ptr_raw, tmp.ptr, tmp_elements * sizeof(To), cudaMemcpyDeviceToHost));
-            memFree(tmp.ptr);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-            Binary<To, op> reduce;
-            To out = reduce.init();
-            for (int i = 0; i < tmp_elements; i++) {
-                out = reduce(out, h_ptr_raw[i]);
-            }
+    switch (threads_x) {
+        case 32:
+            CUDA_LAUNCH((reduce_first_kernel<Ti, To, op, 32>), blocks, threads,
+                        out, in, blocks_x, blocks_y, repeat, change_nan,
+                        scalar<To>(nanval));
+            break;
+        case 64:
+            CUDA_LAUNCH((reduce_first_kernel<Ti, To, op, 64>), blocks, threads,
+                        out, in, blocks_x, blocks_y, repeat, change_nan,
+                        scalar<To>(nanval));
+            break;
+        case 128:
+            CUDA_LAUNCH((reduce_first_kernel<Ti, To, op, 128>), blocks, threads,
+                        out, in, blocks_x, blocks_y, repeat, change_nan,
+                        scalar<To>(nanval));
+            break;
+        case 256:
+            CUDA_LAUNCH((reduce_first_kernel<Ti, To, op, 256>), blocks, threads,
+                        out, in, blocks_x, blocks_y, repeat, change_nan,
+                        scalar<To>(nanval));
+            break;
+    }
 
-            return out;
+    POST_LAUNCH_CHECK();
+}
 
-        } else {
+template<typename Ti, typename To, af_op_t op>
+void reduce_first(Param<To> out, CParam<Ti> in, bool change_nan,
+                  double nanval) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(in.dims[1], threads_y);
+
+    Param<To> tmp = out;
+    uptr<To> tmp_alloc;
+    if (blocks_x > 1) {
+        tmp_alloc =
+            memAlloc<To>(blocks_x * in.dims[1] * in.dims[2] * in.dims[3]);
+        tmp.ptr = tmp_alloc.get();
+
+        tmp.dims[0] = blocks_x;
+        for (int k = 1; k < 4; k++) tmp.strides[k] *= blocks_x;
+    }
 
-            scoped_ptr<Ti> h_ptr(new Ti[in_elements]);
-            Ti* h_ptr_raw = h_ptr.get();
-            CUDA_CHECK(cudaMemcpy(h_ptr_raw, in.ptr, in_elements * sizeof(Ti), cudaMemcpyDeviceToHost));
+    reduce_first_launcher<Ti, To, op>(tmp, in, blocks_x, blocks_y, threads_x,
+                                      change_nan, nanval);
 
-            Transform<Ti, To, op> transform;
-            Binary<To, op> reduce;
-            To out = reduce.init();
+    if (blocks_x > 1) {
+        // FIXME: Is there an alternative to the if condition?
+        if (op == af_notzero_t) {
+            reduce_first_launcher<To, To, af_add_t>(
+                out, tmp, 1, blocks_y, threads_x, change_nan, nanval);
+        } else {
+            reduce_first_launcher<To, To, op>(out, tmp, 1, blocks_y, threads_x,
+                                              change_nan, nanval);
+        }
+    }
+}
 
-            for (int i = 0; i < in_elements; i++) {
-                out = reduce(out, transform(h_ptr_raw[i]));
-            }
+template<typename Ti, typename To, af_op_t op>
+void reduce(Param<To> out, CParam<Ti> in, int dim, bool change_nan,
+            double nanval) {
+    switch (dim) {
+        case 0: return reduce_first<Ti, To, op>(out, in, change_nan, nanval);
+        case 1: return reduce_dim<Ti, To, op, 1>(out, in, change_nan, nanval);
+        case 2: return reduce_dim<Ti, To, op, 2>(out, in, change_nan, nanval);
+        case 3: return reduce_dim<Ti, To, op, 3>(out, in, change_nan, nanval);
+    }
+}
+template<typename Ti, typename To, af_op_t op>
+void reduce_all(Param<To> out, CParam<Ti> in, bool change_nan, double nanval) {
+    int in_elements = in.dims[0] * in.dims[1] * in.dims[2] * in.dims[3];
+    bool is_linear  = (in.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.strides[k] == (in.strides[k - 1] * in.dims[k - 1]));
+    }
 
-            return out;
+    if (is_linear) {
+        in.dims[0] = in_elements;
+        for (int k = 1; k < 4; k++) {
+            in.dims[k]    = 1;
+            in.strides[k] = in_elements;
         }
     }
 
+    uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    // TODO: perf REPEAT, consider removing or runtime eval
+    // max problem size < SM resident threads, don't use REPEAT
+    uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(in.dims[1], threads_y);
+
+    reduce_all_launcher<Ti, To, op>(out, in, blocks_x, blocks_y, threads_x,
+                                    change_nan, nanval);
 }
-}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/reduce_by_key.hpp b/src/backend/cuda/kernel/reduce_by_key.hpp
new file mode 100644
index 0000000000..1e04a123ec
--- /dev/null
+++ b/src/backend/cuda/kernel/reduce_by_key.hpp
@@ -0,0 +1,621 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <type_traits>
+#include "config.hpp"
+
+#include <kernel/shfl_intrinsics.hpp>
+#include <cub/device/device_reduce.cuh>
+
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+// Reduces keys across block boundaries
+template<typename Tk, typename To, af_op_t op>
+__global__ void final_boundary_reduce(int *reduced_block_sizes, Param<Tk> keys,
+                                      Param<To> vals, const int n) {
+    const int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    common::Binary<compute_t<To>, op> reduce;
+
+    if (tid == ((blockIdx.x + 1) * blockDim.x) - 1 &&
+        blockIdx.x < gridDim.x - 1) {
+        Tk k0 = keys.ptr[tid];
+        Tk k1 = keys.ptr[tid + 1];
+        if (k0 == k1) {
+            compute_t<To> v0                = compute_t<To>(vals.ptr[tid]);
+            compute_t<To> v1                = compute_t<To>(vals.ptr[tid + 1]);
+            vals.ptr[tid + 1]               = reduce(v0, v1);
+            reduced_block_sizes[blockIdx.x] = blockDim.x - 1;
+        } else {
+            reduced_block_sizes[blockIdx.x] = blockDim.x;
+        }
+    }
+
+    // if last block, set block size to difference between n and block boundary
+    if (threadIdx.x == 0 && blockIdx.x == gridDim.x - 1) {
+        reduced_block_sizes[blockIdx.x] = n - (blockIdx.x * blockDim.x);
+    }
+}
+
+// Tests if data needs further reduction, including across block boundaries
+template<typename Tk>
+__global__ void test_needs_reduction(int *needs_another_reduction,
+                                     int *needs_block_boundary_reduced,
+                                     CParam<Tk> keys_in, const int n) {
+    const int tid = blockIdx.x * blockDim.x + threadIdx.x;
+    Tk k;
+
+    if (tid < n) { k = keys_in.ptr[tid]; }
+
+    int update_key = (k == shfl_down_sync(k, 1)) &&
+                     (tid < (n - 1)) && ((threadIdx.x % 32) < 31);
+    int remaining_updates = any_sync(update_key);
+
+    __syncthreads();
+
+    if (remaining_updates && (threadIdx.x % 32 == 0))
+        atomicOr(needs_another_reduction, remaining_updates);
+
+    // check across warp boundaries
+    update_key =
+        (((threadIdx.x % 32) == 31)           // last thread in warp
+         && (threadIdx.x < (blockDim.x - 1))  // not last thread in block
+         // next value valid and equal
+         && ((tid + 1) < n) && (k == keys_in.ptr[tid + 1]));
+    remaining_updates = any_sync(update_key);
+
+    // TODO: single per warp? change to assignment rather than atomicOr
+    if (remaining_updates) atomicOr(needs_another_reduction, remaining_updates);
+
+    // last thread in each block checks if any inter-block keys need further
+    // reduction
+    if (tid == ((blockIdx.x + 1) * blockDim.x) - 1 &&
+        blockIdx.x < gridDim.x - 1) {
+        int k0 = keys_in.ptr[tid];
+        int k1 = keys_in.ptr[tid + 1];
+        if (k0 == k1) { atomicOr(needs_block_boundary_reduced, 1); }
+    }
+}
+
+// Compacts "incomplete" block-sized chunks of data in global memory
+template<typename Tk, typename To>
+__global__ void compact(int *reduced_block_sizes, Param<Tk> keys_out,
+                        Param<To> vals_out, CParam<Tk> keys_in,
+                        CParam<To> vals_in, const int nBlocksZ) {
+    const int tidx = blockIdx.x * blockDim.x + threadIdx.x;
+    const int bidy = blockIdx.y;
+    const int bidz = blockIdx.z % nBlocksZ;
+    const int bidw = blockIdx.z / nBlocksZ;
+
+    // reduced_block_sizes should have inclusive sum of block sizes
+    int nwrite   = (blockIdx.x == 0) ? reduced_block_sizes[0]
+                                     : reduced_block_sizes[blockIdx.x] -
+                                         reduced_block_sizes[blockIdx.x - 1];
+    int writeloc = (blockIdx.x == 0) ? 0 : reduced_block_sizes[blockIdx.x - 1];
+
+    const int bOffset = bidw * vals_in.strides[3] + bidz * vals_in.strides[2] +
+                        bidy * vals_in.strides[1];
+    Tk k = keys_in.ptr[tidx];
+    To v = vals_in.ptr[bOffset + tidx];
+
+    if (threadIdx.x < nwrite) {
+        keys_out.ptr[writeloc + threadIdx.x]           = k;
+        vals_out.ptr[bOffset + writeloc + threadIdx.x] = v;
+    }
+}
+
+// Compacts "incomplete" block-sized chunks of data in global memory
+template<typename Tk, typename To>
+__global__ void compact_dim(int *reduced_block_sizes, Param<Tk> keys_out,
+                            Param<To> vals_out, CParam<Tk> keys_in,
+                            CParam<To> vals_in, const int dim,
+                            const int nBlocksZ) {
+    __shared__ int dim_ordering[4];
+    if (threadIdx.x == 0) {
+        int d           = 1;
+        dim_ordering[0] = dim;
+        for (int i = 0; i < 4; ++i) {
+            if (i != dim) dim_ordering[d++] = i;
+        }
+    }
+    __syncthreads();
+
+    const int tidx = blockIdx.x * blockDim.x + threadIdx.x;
+    const int bidy = blockIdx.y;
+    const int bidz = blockIdx.z % nBlocksZ;
+    const int bidw = blockIdx.z / nBlocksZ;
+
+    // reduced_block_sizes should have inclusive sum of block sizes
+    int nwrite   = (blockIdx.x == 0) ? reduced_block_sizes[0]
+                                     : reduced_block_sizes[blockIdx.x] -
+                                         reduced_block_sizes[blockIdx.x - 1];
+    int writeloc = (blockIdx.x == 0) ? 0 : reduced_block_sizes[blockIdx.x - 1];
+
+    const int tid = bidw * vals_in.strides[dim_ordering[3]] +
+                    bidz * vals_in.strides[dim_ordering[2]] +
+                    bidy * vals_in.strides[dim_ordering[1]] +
+                    tidx * vals_in.strides[dim];
+    Tk k = keys_in.ptr[tidx];
+    To v = vals_in.ptr[tid];
+
+    if (threadIdx.x < nwrite) {
+        keys_out.ptr[writeloc + threadIdx.x] = k;
+        const int bOffset = bidw * vals_out.strides[dim_ordering[3]] +
+                            bidz * vals_out.strides[dim_ordering[2]] +
+                            bidy * vals_out.strides[dim_ordering[1]];
+        vals_out
+            .ptr[bOffset + (writeloc + threadIdx.x) * vals_in.strides[dim]] = v;
+    }
+}
+
+const static int maxResPerWarp = 32;  // assume dim 0, no NAN values
+
+// Reduces each block by key
+template<typename Ti, typename Tk, typename To, af_op_t op, uint DIMX>
+__global__ static void reduce_blocks_by_key(int *reduced_block_sizes,
+                                            Param<Tk> reduced_keys,
+                                            Param<To> reduced_vals,
+                                            CParam<Tk> keys, CParam<Ti> vals,
+                                            int n, bool change_nan, To nanval,
+                                            const int nBlocksZ) {
+    const int tidx = blockIdx.x * blockDim.x + threadIdx.x;
+    const int bidy = blockIdx.y;
+    const int bidz = blockIdx.z % nBlocksZ;
+    const int bidw = blockIdx.z / nBlocksZ;
+
+    const int laneid = tidx % 32;
+
+    const int nWarps = DIMX / 32;
+
+    //
+    // Allocate and initialize shared memory
+
+    __shared__ int
+        warpReduceSizes[nWarps];  // number of reduced elements in each warp
+
+    __shared__ compute_t<Tk> warpReduceKeys[nWarps]
+                                           [maxResPerWarp];  // reduced key
+                                                             // segments for
+                                                             // each warp
+    __shared__ compute_t<To> warpReduceVals[nWarps]
+                                           [maxResPerWarp];  // reduced values
+                                                             // for each warp
+                                                             // corresponding to
+                                                             // each key segment
+
+    // space to hold left/right-most keys of each reduced warp to check if
+    // reduction should happen across boundaries
+    __shared__ compute_t<Tk> warpReduceLeftBoundaryKeys[nWarps];
+    __shared__ compute_t<Tk> warpReduceRightBoundaryKeys[nWarps];
+
+    // space to hold right-most values of each reduced warp to check if
+    // reduction should happen across boundaries
+    __shared__ compute_t<To> warpReduceRightBoundaryVals[nWarps];
+
+    // space to compact and finalize all reductions within block
+    __shared__ compute_t<Tk> warpReduceKeysSmemFinal[nWarps * maxResPerWarp];
+    __shared__ compute_t<To> warpReduceValsSmemFinal[nWarps * maxResPerWarp];
+
+    //
+    // will hold final number of reduced elements in block
+    __shared__ int reducedBlockSize;
+
+    if (threadIdx.x == 0) { reducedBlockSize = 0; }
+    if (threadIdx.x < nWarps * maxResPerWarp)
+        warpReduceValsSmemFinal[threadIdx.x] = scalar<compute_t<To>>(0);
+    __syncthreads();
+
+    common::Binary<compute_t<To>, op> reduce;
+    common::Transform<compute_t<Ti>, compute_t<To>, op> transform;
+
+    // load keys and values to threads
+    compute_t<Tk> k;
+    compute_t<To> v;
+    if (tidx < n) {
+        const int tid = bidw * vals.strides[3] + bidz * vals.strides[2] +
+                        bidy * vals.strides[1] +
+                        tidx;  // index for batched inputs
+        k = keys.ptr[tidx];
+        v = transform(compute_t<Ti>(vals.ptr[tid]));
+        if (change_nan) v = IS_NAN(v) ? compute_t<To>(nanval) : v;
+    } else {
+        v = common::Binary<compute_t<To>, op>::init();
+    }
+
+    compute_t<Tk> eq_check = (k != shfl_up_sync(k, 1));
+    // mark threads containing unique keys
+    char unique_flag = (eq_check || (laneid == 0)) && (tidx < n);
+
+    // scan unique flags to enumerate unique keys
+    char unique_id = unique_flag;
+#pragma unroll
+    for (int offset = 1; offset < 32; offset <<= 1) {
+        char y = shfl_up_sync(unique_id, offset);
+        if (laneid >= offset) unique_id += y;
+    }
+
+    //
+    // Reduce each warp by key
+    char all_eq = (k == shfl_down_sync(k, 1));
+    if (all_sync(all_eq)) {  // check special case of single key per warp
+        v = reduce(v, shfl_down_sync(v, 1));
+        v = reduce(v, shfl_down_sync(v, 2));
+        v = reduce(v, shfl_down_sync(v, 4));
+        v = reduce(v, shfl_down_sync(v, 8));
+        v = reduce(v, shfl_down_sync(v, 16));
+    } else {
+        compute_t<To> init = common::Binary<compute_t<To>, op>::init();
+        int eq_check, update_key;
+#pragma unroll
+        for (int delta = 1; delta < 32; delta <<= 1) {
+            eq_check =
+                (unique_id == shfl_down_sync(unique_id, delta));
+
+            // checks if this thread should perform a reduction
+            update_key =
+                eq_check && (laneid < (32 - delta)) && ((tidx + delta) < n);
+
+            // shfls data from neighboring threads
+            compute_t<To> uval = shfl_down_sync(v, delta);
+
+            // update if thread requires it
+            v = reduce(v, (update_key ? uval : init));
+        }
+    }
+
+    const int warpid = threadIdx.x / 32;
+
+    // last thread in warp has reduced warp size due to scan^
+    if (laneid == 31) { warpReduceSizes[warpid] = unique_id; }
+
+    // write left boundary values for each warp
+    if (unique_flag && unique_id == 1) {
+        warpReduceLeftBoundaryKeys[warpid] = k;
+    }
+
+    // write right boundary values for each warp
+    if (unique_flag && unique_id == warpReduceSizes[warpid]) {
+        warpReduceRightBoundaryKeys[warpid] = k;
+        warpReduceRightBoundaryVals[warpid] = v;
+    }
+
+    __syncthreads();
+
+    // if rightmost thread, check next warp's kv,
+    // invalidate self and change warpReduceSizes since first thread of next
+    // warp will update same key
+    // TODO: what if extra empty warps???
+    if (unique_flag && unique_id == warpReduceSizes[warpid] &&
+        warpid < nWarps - 1) {
+        int tid_next_warp = (blockIdx.x * blockDim.x + (warpid + 1) * 32);
+        // check within data range
+        if (tid_next_warp < n && k == warpReduceLeftBoundaryKeys[warpid + 1]) {
+            // disable writing from warps that need carry but aren't terminal
+            if (warpReduceSizes[warpid] > 1 || warpid > 0) { unique_flag = 0; }
+        }
+    }
+    __syncthreads();
+
+    // if leftmost thread, reduce carryover from previous warp(s) if needed
+    if (unique_flag && unique_id == 1 && warpid > 0) {
+        int test_wid = warpid - 1;
+        while (test_wid >= 0 && k == warpReduceRightBoundaryKeys[test_wid]) {
+            v = reduce(v, warpReduceRightBoundaryVals[test_wid]);
+            --warpReduceSizes[test_wid];
+            if (warpReduceSizes[test_wid] > 1) break;
+
+            --test_wid;
+        }
+    }
+
+    if (unique_flag) {
+        warpReduceKeys[warpid][unique_id - 1] = k;
+        warpReduceVals[warpid][unique_id - 1] = v;
+    }
+
+    __syncthreads();
+
+    // at this point, we have nWarps lists in shared memory with each list's
+    // size located in the warpReduceSizes[] array
+    // perform warp-scan to determine each warp's write location
+    int warpSzScan = 0;
+    if (warpid == 0 && laneid < nWarps) {
+        warpSzScan     = warpReduceSizes[laneid];
+        int activemask = 0xFFFFFFFF >> (32 - nWarps);
+#pragma unroll
+        for (int offset = 1; offset < 32; offset <<= 1) {
+            char y = __shfl_up_sync(activemask, warpSzScan, offset);
+            if (laneid >= offset) warpSzScan += y;
+        }
+        warpReduceSizes[laneid] = warpSzScan;
+        // final thread has final reduced size of block
+        if (laneid == nWarps - 1) reducedBlockSize = warpSzScan;
+    }
+    __syncthreads();
+
+    // write reduced block size to global memory
+    if (threadIdx.x == 0) {
+        reduced_block_sizes[blockIdx.x] = reducedBlockSize;
+    }
+
+    // compact reduced keys and values before writing to global memory
+    if (warpid > 0) {
+        int wsz = warpReduceSizes[warpid] - warpReduceSizes[warpid - 1];
+        if (laneid < wsz) {
+            int warpOffset = warpReduceSizes[warpid - 1];
+            warpReduceKeysSmemFinal[warpOffset + laneid] =
+                warpReduceKeys[warpid][laneid];
+            warpReduceValsSmemFinal[warpOffset + laneid] =
+                warpReduceVals[warpid][laneid];
+        }
+    } else {
+        int wsz = warpReduceSizes[warpid];
+        if (laneid < wsz) {
+            warpReduceKeysSmemFinal[laneid] = warpReduceKeys[0][laneid];
+            warpReduceValsSmemFinal[laneid] = warpReduceVals[0][laneid];
+        }
+    }
+    __syncthreads();
+
+    const int bOffset = bidw * reduced_vals.strides[3] +
+                        bidz * reduced_vals.strides[2] +
+                        bidy * reduced_vals.strides[1];
+    // write reduced keys/values per-block
+    if (threadIdx.x < reducedBlockSize) {
+        reduced_keys.ptr[(blockIdx.x * blockDim.x) + threadIdx.x] =
+            warpReduceKeysSmemFinal[threadIdx.x];
+        reduced_vals.ptr[bOffset + (blockIdx.x * blockDim.x) + threadIdx.x] =
+            warpReduceValsSmemFinal[threadIdx.x];
+    }
+}
+
+// Reduces each block by key
+template<typename Ti, typename Tk, typename To, af_op_t op, uint DIMX>
+__global__ static void reduce_blocks_dim_by_key(
+    int *reduced_block_sizes, Param<Tk> reduced_keys, Param<To> reduced_vals,
+    CParam<Tk> keys, CParam<Ti> vals, int n, bool change_nan, To nanval,
+    int dim, const int nBlocksZ) {
+    const int tidx = blockIdx.x * blockDim.x + threadIdx.x;
+    const int bidy = blockIdx.y;
+    const int bidz = blockIdx.z % nBlocksZ;
+    const int bidw = blockIdx.z / nBlocksZ;
+
+    const int laneid = tidx % 32;
+    const int nWarps = DIMX / 32;
+
+    //
+    // Allocate and initialize shared memory
+
+    __shared__ int
+        warpReduceSizes[nWarps];  // number of reduced elements in each warp
+
+    __shared__ Tk warpReduceKeys[nWarps][maxResPerWarp];  // reduced key
+                                                          // segments for each
+                                                          // warp
+    __shared__ compute_t<To> warpReduceVals[nWarps]
+                                           [maxResPerWarp];  // reduced values
+                                                             // for each warp
+                                                             // corresponding to
+                                                             // each key segment
+
+    // space to hold left/right-most keys of each reduced warp to check if
+    // reduction should happen accros boundaries
+    __shared__ Tk warpReduceLeftBoundaryKeys[nWarps];
+    __shared__ Tk warpReduceRightBoundaryKeys[nWarps];
+
+    // space to hold right-most values of each reduced warp to check if
+    // reduction should happen accros boundaries
+    __shared__ compute_t<To> warpReduceRightBoundaryVals[nWarps];
+
+    // space to compact and finalize all reductions within block
+    __shared__ Tk warpReduceKeysSmemFinal[nWarps * maxResPerWarp];
+    __shared__ compute_t<To> warpReduceValsSmemFinal[nWarps * maxResPerWarp];
+
+    //
+    // will hold final number of reduced elements in block
+    __shared__ int reducedBlockSize;
+    __shared__ int dim_ordering[4];
+
+    compute_t<To> init = common::Binary<compute_t<To>, op>::init();
+
+    if (threadIdx.x == 0) {
+        reducedBlockSize = 0;
+        int d            = 1;
+        dim_ordering[0]  = dim;
+        for (int i = 0; i < 4; ++i) {
+            if (i != dim) dim_ordering[d++] = i;
+        }
+    }
+    if (threadIdx.x < nWarps * maxResPerWarp)
+        warpReduceValsSmemFinal[threadIdx.x] = init;
+    __syncthreads();
+
+    common::Binary<compute_t<To>, op> reduce;
+    common::Transform<compute_t<Ti>, compute_t<To>, op> transform;
+
+    // load keys and values to threads
+    Tk k;
+    compute_t<To> v;
+    if (tidx < n) {
+        const int tid = bidw * vals.strides[dim_ordering[3]] +
+                        bidz * vals.strides[dim_ordering[2]] +
+                        bidy * vals.strides[dim_ordering[1]] +
+                        tidx * vals.strides[dim];  // index for batched inputs
+
+        k = keys.ptr[tidx];
+        v = transform(compute_t<Ti>(vals.ptr[tid]));
+        if (change_nan) v = IS_NAN(v) ? compute_t<To>(nanval) : v;
+    } else {
+        v = init;
+    }
+
+    Tk eq_check = (k != shfl_up_sync(k, 1));
+    // mark threads containing unique keys
+    char unique_flag = (eq_check || (laneid == 0)) && (tidx < n);
+
+    // scan unique flags to enumerate unique keys
+    char unique_id = unique_flag;
+#pragma unroll
+    for (int offset = 1; offset < 32; offset <<= 1) {
+        char y = shfl_up_sync(unique_id, offset);
+        if (laneid >= offset) unique_id += y;
+    }
+
+    //
+    // Reduce each warp by key
+    char all_eq = (k == shfl_down_sync(k, 1));
+    if (all_sync(all_eq)) {  // check special case of single key per warp
+        v = reduce(v, shfl_down_sync(v, 1));
+        v = reduce(v, shfl_down_sync(v, 2));
+        v = reduce(v, shfl_down_sync(v, 4));
+        v = reduce(v, shfl_down_sync(v, 8));
+        v = reduce(v, shfl_down_sync(v, 16));
+    } else {
+        compute_t<To> init = common::Binary<compute_t<To>, op>::init();
+        int eq_check, update_key;
+#pragma unroll
+        for (int delta = 1; delta < 32; delta <<= 1) {
+            eq_check =
+                (unique_id == shfl_down_sync(unique_id, delta));
+
+            // checks if this thread should perform a reduction
+            update_key =
+                eq_check && (laneid < (32 - delta)) && ((tidx + delta) < n);
+
+            // shfls data from neighboring threads
+            compute_t<To> uval = shfl_down_sync(v, delta);
+
+            // update if thread requires it
+            v = reduce(v, (update_key ? uval : init));
+        }
+    }
+
+    const int warpid = threadIdx.x / 32;
+
+    // last thread in warp has reduced warp size due to scan^
+    if (laneid == 31) { warpReduceSizes[warpid] = unique_id; }
+
+    // write left boundary values for each warp
+    if (unique_flag && unique_id == 1) {
+        warpReduceLeftBoundaryKeys[warpid] = k;
+    }
+
+    // write right boundary values for each warp
+    if (unique_flag && unique_id == warpReduceSizes[warpid]) {
+        warpReduceRightBoundaryKeys[warpid] = k;
+        warpReduceRightBoundaryVals[warpid] = v;
+    }
+
+    __syncthreads();
+
+    // if rightmost thread, check next warp's kv,
+    // invalidate self and change warpReduceSizes since first thread of next
+    // warp will update same key
+    // TODO: what if extra empty warps???
+    if (unique_flag && unique_id == warpReduceSizes[warpid] &&
+        warpid < nWarps - 1) {
+        int tid_next_warp = (blockIdx.x * blockDim.x + (warpid + 1) * 32);
+        // check within data range
+        if (tid_next_warp < n && k == warpReduceLeftBoundaryKeys[warpid + 1]) {
+            // disable writing from warps that need carry but aren't terminal
+            if (warpReduceSizes[warpid] > 1 || warpid > 0) { unique_flag = 0; }
+        }
+    }
+    __syncthreads();
+
+    // if leftmost thread, reduce carryover from previous warp(s) if needed
+    if (unique_flag && unique_id == 1 && warpid > 0) {
+        int test_wid = warpid - 1;
+        while (test_wid >= 0 && k == warpReduceRightBoundaryKeys[test_wid]) {
+            v = reduce(v, warpReduceRightBoundaryVals[test_wid]);
+            --warpReduceSizes[test_wid];
+            if (warpReduceSizes[test_wid] > 1) break;
+
+            --test_wid;
+        }
+    }
+
+    if (unique_flag) {
+        warpReduceKeys[warpid][unique_id - 1] = k;
+        warpReduceVals[warpid][unique_id - 1] = v;
+    }
+
+    __syncthreads();
+
+    // at this point, we have nWarps lists in shared memory with each list's
+    // size located in the warpReduceSizes[] array
+    // perform warp-scan to determine each warp's write location
+    int warpSzScan = 0;
+    if (warpid == 0 && laneid < nWarps) {
+        warpSzScan     = warpReduceSizes[laneid];
+        int activemask = 0xFFFFFFFF >> (32 - nWarps);
+#pragma unroll
+        for (int offset = 1; offset < 32; offset <<= 1) {
+            char y = __shfl_up_sync(activemask, warpSzScan, offset);
+            if (laneid >= offset) warpSzScan += y;
+        }
+        warpReduceSizes[laneid] = warpSzScan;
+        // final thread has final reduced size of block
+        if (laneid == nWarps - 1) reducedBlockSize = warpSzScan;
+    }
+    __syncthreads();
+
+    // write reduced block size to global memory
+    if (threadIdx.x == 0) {
+        reduced_block_sizes[blockIdx.x] = reducedBlockSize;
+    }
+
+    // compact reduced keys and values before writing to global memory
+    if (warpid > 0) {
+        int wsz = warpReduceSizes[warpid] - warpReduceSizes[warpid - 1];
+        if (laneid < wsz) {
+            int warpOffset = warpReduceSizes[warpid - 1];
+            warpReduceKeysSmemFinal[warpOffset + laneid] =
+                warpReduceKeys[warpid][laneid];
+            warpReduceValsSmemFinal[warpOffset + laneid] =
+                warpReduceVals[warpid][laneid];
+        }
+    } else {
+        int wsz = warpReduceSizes[warpid];
+        if (laneid < wsz) {
+            warpReduceKeysSmemFinal[laneid] = warpReduceKeys[0][laneid];
+            warpReduceValsSmemFinal[laneid] = warpReduceVals[0][laneid];
+        }
+    }
+    __syncthreads();
+
+    // write reduced keys/values per-block
+    if (threadIdx.x < reducedBlockSize) {
+        const int bOffset = bidw * reduced_vals.strides[dim_ordering[3]] +
+                            bidz * reduced_vals.strides[dim_ordering[2]] +
+                            bidy * reduced_vals.strides[dim_ordering[1]];
+        reduced_keys.ptr[(blockIdx.x * blockDim.x) + threadIdx.x] =
+            warpReduceKeysSmemFinal[threadIdx.x];
+        reduced_vals.ptr[bOffset + ((blockIdx.x * blockDim.x) + threadIdx.x) *
+                                       reduced_vals.strides[dim]] =
+            warpReduceValsSmemFinal[threadIdx.x];
+    }
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/regions.hpp b/src/backend/cuda/kernel/regions.hpp
index 3f7e21a621..d03aed4517 100644
--- a/src/backend/cuda/kernel/regions.hpp
+++ b/src/backend/cuda/kernel/regions.hpp
@@ -7,25 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <dispatch.hpp>
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
 #include <err_cuda.hpp>
 #include <math.hpp>
-#include <debug_cuda.hpp>
-#include <stdio.h>
 #include <memory.hpp>
+#include <thrust_utils.hpp>
 
 #include <thrust/adjacent_difference.h>
 #include <thrust/binary_search.h>
 #include <thrust/device_vector.h>
-#include <thrust/iterator/counting_iterator.h>
 #include <thrust/host_vector.h>
+#include <thrust/iterator/counting_iterator.h>
 #include <thrust/scan.h>
 #include <thrust/sort.h>
+#include <thrust/system/cuda/detail/par.h>
 #include <thrust/transform_scan.h>
 
-#if __CUDACC__
-
 static const int THREADS_X = 16;
 static const int THREADS_Y = 16;
 
@@ -36,34 +34,31 @@ __device__ static int continue_flag = 1;
 
 // Wrapper function for texture fetch
 template<typename T>
-__device__ __inline__
-static T fetch(const int n,
-               cuda::Param<T> equiv_map,
-               cudaTextureObject_t tex)
-{
-// FIXME: Enable capability >= 3.0
-//#if (__CUDA_ARCH__ >= 300)
-#if 0
-    // Kepler bindless texture objects
+static inline __device__ T fetch(const int n,
+                                 arrayfire::cuda::Param<T> equiv_map,
+                                 cudaTextureObject_t tex) {
     return tex1Dfetch<T>(tex, n);
-#else
+}
+
+template<>
+__device__ inline double fetch<double>(const int n,
+                                       arrayfire::cuda::Param<double> equiv_map,
+                                       cudaTextureObject_t tex) {
     return equiv_map.ptr[n];
-#endif
 }
 
 // The initial label kernel distinguishes between valid (nonzero)
 // pixels and "background" (zero) pixels.
 template<typename T, int n_per_thread>
-__global__
-static void initial_label(cuda::Param<T> equiv_map, cuda::CParam<char> bin)
-{
+__global__ static void initial_label(arrayfire::cuda::Param<T> equiv_map,
+                                     arrayfire::cuda::CParam<char> bin) {
     const int base_x = (blockIdx.x * blockDim.x * n_per_thread) + threadIdx.x;
     const int base_y = (blockIdx.y * blockDim.y * n_per_thread) + threadIdx.y;
 
-    // If in bounds and a valid pixel, set the initial label.
-    #pragma unroll
+// If in bounds and a valid pixel, set the initial label.
+#pragma unroll
     for (int xb = 0; xb < n_per_thread; ++xb) {
-        #pragma unroll
+#pragma unroll
         for (int yb = 0; yb < n_per_thread; ++yb) {
             const int x = base_x + (xb * blockDim.x);
             const int y = base_y + (yb * blockDim.y);
@@ -76,22 +71,24 @@ static void initial_label(cuda::Param<T> equiv_map, cuda::CParam<char> bin)
 }
 
 template<typename T, int n_per_thread>
-__global__
-static void final_relabel(cuda::Param<T> equiv_map, cuda::CParam<char> bin, const T* d_tmp)
-{
+__global__ static void final_relabel(arrayfire::cuda::Param<T> equiv_map,
+                                     arrayfire::cuda::CParam<char> bin,
+                                     const T* d_tmp) {
     const int base_x = (blockIdx.x * blockDim.x * n_per_thread) + threadIdx.x;
     const int base_y = (blockIdx.y * blockDim.y * n_per_thread) + threadIdx.y;
 
-    // If in bounds and a valid pixel, set the initial label.
-    #pragma unroll
+// If in bounds and a valid pixel, set the initial label.
+#pragma unroll
     for (int xb = 0; xb < n_per_thread; ++xb) {
-        #pragma unroll
+#pragma unroll
         for (int yb = 0; yb < n_per_thread; ++yb) {
             const int x = base_x + (xb * blockDim.x);
             const int y = base_y + (yb * blockDim.y);
             const int n = y * bin.dims[0] + x;
             if (x < bin.dims[0] && y < bin.dims[1]) {
-                equiv_map.ptr[n] = (bin.ptr[n] > (char)0) ? d_tmp[(int)equiv_map.ptr[n]] : (T)0;
+                equiv_map.ptr[n] = (bin.ptr[n] > (char)0)
+                                       ? d_tmp[(int)equiv_map.ptr[n]]
+                                       : (T)0;
             }
         }
     }
@@ -100,23 +97,19 @@ static void final_relabel(cuda::Param<T> equiv_map, cuda::CParam<char> bin, cons
 // When two labels are equivalent, choose the lower label, but
 // do not choose zero, which indicates invalid.
 template<typename T>
-__device__ __inline__
-static T relabel(const T a, const T b) {
-    return min((a + (cuda::limit_max<T>() * (a == 0))),(b + (cuda::limit_max<T>() * (b == 0))));
-}
-__device__ __inline__
-static double relabel(const double a, const double b) {
-    return fmin((a + (cuda::limit_max<double>() * (a == 0))),(b + (cuda::limit_max<double>() * (b == 0))));
-}
-__device__ __inline__
-static float relabel(const float a, const float b) {
-    return fminf((a + (cuda::limit_max<float>() * (a == 0))),(b + (cuda::limit_max<float>() * (b == 0))));
+__device__ __inline__ static T relabel(const T a, const T b) {
+    T aa = (a == 0) ? arrayfire::cuda::maxval<T>() : a;
+    T bb = (b == 0) ? arrayfire::cuda::maxval<T>() : b;
+    return min(aa, bb);
 }
 
-//Calculates the number of warps at compile time
+// Calculates the number of warps at compile time
 template<unsigned thread_count>
 struct warp_count {
-    enum { value = ((thread_count % 32) == 0 ? thread_count/32 : thread_count/32 + 1)};
+    enum {
+        value = ((thread_count % 32) == 0 ? thread_count / 32
+                                          : thread_count / 32 + 1)
+    };
 };
 
 // The following kernel updates the equivalency map.  This kernel
@@ -128,13 +121,9 @@ struct warp_count {
 // num_warps = 8; // (Could compute this from block dim)
 // Number of elements to handle per thread in each dimension
 // int n_per_thread = 2; // 2x2 per thread = 4 total elems per thread
-template <typename T, int block_dim, int n_per_thread, bool full_conn>
-__global__
-static void update_equiv(cuda::Param<T> equiv_map, const cudaTextureObject_t tex)
-{
-
-    typedef warp_count<block_dim*block_dim> num_warps;
-#if (__CUDA_ARCH__ >= 120) // This function uses warp ballot instructions
+template<typename T, int block_dim, int n_per_thread, bool full_conn>
+__global__ static void update_equiv(arrayfire::cuda::Param<T> equiv_map,
+                                    const cudaTextureObject_t tex) {
     // Basic coordinates
     const int base_x = (blockIdx.x * blockDim.x * n_per_thread) + threadIdx.x;
     const int base_y = (blockIdx.y * blockDim.y * n_per_thread) + threadIdx.y;
@@ -146,46 +135,32 @@ static void update_equiv(cuda::Param<T> equiv_map, const cudaTextureObject_t tex
 
     // Per element write flags and label, initially 0
     char write[n_per_thread * n_per_thread];
-    T    best_label[n_per_thread * n_per_thread];
+    T best_label[n_per_thread * n_per_thread];
 
-    #pragma unroll
+#pragma unroll
     for (int i = 0; i < n_per_thread * n_per_thread; ++i) {
         write[i]      = (char)0;
         best_label[i] = (T)0;
     }
 
     // Cached tile of the equivalency map
-    __shared__ T s_tile[n_per_thread*block_dim][(n_per_thread*block_dim)];
-
-    // Space to track ballot funcs to track convergence
-    __shared__ T s_changed[num_warps::value];
+    __shared__ T s_tile[n_per_thread * block_dim][(n_per_thread * block_dim)];
 
-    const int tn = (threadIdx.y * blockDim.x) + threadIdx.x;
-
-    const int warpIdx = tn / warpSize;
-    s_changed[warpIdx] = (T)0;
-    __syncthreads();
-
-#if (__CUDA_ARCH__ >= 130)
-    #pragma unroll
-#endif
+#pragma unroll
     for (int xb = 0; xb < n_per_thread; ++xb) {
-#if (__CUDA_ARCH__ >= 130)
-        #pragma unroll
-#endif
+#pragma unroll
         for (int yb = 0; yb < n_per_thread; ++yb) {
-
             // Indexing variables
-            const int x = base_x + (xb * blockDim.x);
-            const int y = base_y + (yb * blockDim.y);
-            const int tx = threadIdx.x + (xb * blockDim.x);
-            const int ty = threadIdx.y + (yb * blockDim.y);
+            const int x     = base_x + (xb * blockDim.x);
+            const int y     = base_y + (yb * blockDim.y);
+            const int tx    = threadIdx.x + (xb * blockDim.x);
+            const int ty    = threadIdx.y + (yb * blockDim.y);
             const int tid_i = xb * n_per_thread + yb;
-            const int n = y * width + x;
+            const int n     = y * width + x;
 
             // Get the label for this pixel if we're  in bounds
-            const T orig_label = (x < width && y < height) ?
-                fetch<T>(n, equiv_map, tex) : (T)0;
+            const T orig_label =
+                (x < width && y < height) ? fetch<T>(n, equiv_map, tex) : (T)0;
             s_tile[ty][tx] = orig_label;
 
             // Find the lowest label of the nearest valid pixel
@@ -193,45 +168,53 @@ static void update_equiv(cuda::Param<T> equiv_map, const cudaTextureObject_t tex
             best_label[tid_i] = orig_label;
 
             if (orig_label != (T)0) {
-                const int south_y = min(y, height-2) + 1;
+                const int south_y = min(y, height - 2) + 1;
                 const int north_y = max(y, 1) - 1;
-                const int east_x = min(x, width-2) + 1;
-                const int west_x = max(x, 1) - 1;
+                const int east_x  = min(x, width - 2) + 1;
+                const int west_x  = max(x, 1) - 1;
 
                 // Check bottom
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        fetch((south_y) * width + x, equiv_map, tex));
+                best_label[tid_i] =
+                    relabel(best_label[tid_i],
+                            fetch((south_y)*width + x, equiv_map, tex));
 
                 // Check right neighbor
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        fetch(y * width + east_x, equiv_map, tex));
+                best_label[tid_i] =
+                    relabel(best_label[tid_i],
+                            fetch(y * width + east_x, equiv_map, tex));
 
                 // Check left neighbor
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        fetch(y * width + west_x, equiv_map, tex));
+                best_label[tid_i] =
+                    relabel(best_label[tid_i],
+                            fetch(y * width + west_x, equiv_map, tex));
 
                 // Check top neighbor
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        fetch((north_y) * width + x, equiv_map, tex));
+                best_label[tid_i] =
+                    relabel(best_label[tid_i],
+                            fetch((north_y)*width + x, equiv_map, tex));
 
                 if (full_conn) {
                     // Check NW corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                            fetch((north_y) * width + west_x, equiv_map, tex));
+                    best_label[tid_i] = relabel(
+                        best_label[tid_i],
+                        fetch((north_y)*width + west_x, equiv_map, tex));
 
                     // Check NE corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                            fetch((north_y) * width + east_x, equiv_map, tex));
+                    best_label[tid_i] = relabel(
+                        best_label[tid_i],
+                        fetch((north_y)*width + east_x, equiv_map, tex));
 
                     // Check SW corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                        fetch((south_y) * width + west_x, equiv_map, tex));
+                    best_label[tid_i] = relabel(
+                        best_label[tid_i],
+                        fetch((south_y)*width + west_x, equiv_map, tex));
 
                     // Check SE corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                            fetch((south_y) * width + east_x, equiv_map, tex));
-                } // if connectivity == 8
-            } // if orig_label != 0
+                    best_label[tid_i] = relabel(
+                        best_label[tid_i],
+                        fetch((south_y)*width + east_x, equiv_map, tex));
+                }  // if connectivity == 8
+            }      // if orig_label != 0
 
             // Process the equivalency list.
             T last_label = orig_label;
@@ -239,127 +222,98 @@ static void update_equiv(cuda::Param<T> equiv_map, const cudaTextureObject_t tex
 
             while (best_label[tid_i] != (T)0 && new_label < last_label) {
                 last_label = new_label;
-                new_label = fetch(new_label - (T)1, equiv_map, tex);
+                new_label  = fetch(new_label - (T)1, equiv_map, tex);
             }
 
             if (orig_label != new_label) {
-                tid_changed = true;
+                tid_changed    = true;
                 s_tile[ty][tx] = new_label;
-                write[tid_i] = (char)1;
+                write[tid_i]   = (char)1;
             }
             best_label[tid_i] = new_label;
         }
     }
-    __syncthreads();
 
-    // Determine if any pixel changed
-    bool continue_iter = false;
-    s_changed[warpIdx] = __any((int)tid_changed);
-    __syncthreads();
-
-#if (__CUDA_ARCH__ >= 130)
-    #pragma unroll
-#endif
-    for (int i = 0; i < num_warps::value; i++)
-        continue_iter = continue_iter || (s_changed[i] != 0);
+    bool continue_iter = __syncthreads_or((int)tid_changed);
 
     // Iterate until no pixel in the tile changes
     while (continue_iter) {
-
         // Reset whether or not this thread's pixels have changed.
         tid_changed = false;
 
-#if (__CUDA_ARCH__ >= 130)
-        #pragma unroll
-#endif
+#pragma unroll
         for (int xb = 0; xb < n_per_thread; ++xb) {
-#if (__CUDA_ARCH__ >= 130)
-            #pragma unroll
-#endif
+#pragma unroll
             for (int yb = 0; yb < n_per_thread; ++yb) {
-
                 // Indexing
-                const int tx = threadIdx.x + (xb * blockDim.x);
-                const int ty = threadIdx.y + (yb * blockDim.y);
+                const int tx    = threadIdx.x + (xb * blockDim.x);
+                const int ty    = threadIdx.y + (yb * blockDim.y);
                 const int tid_i = xb * n_per_thread + yb;
 
                 T last_label = best_label[tid_i];
 
                 if (best_label[tid_i] != 0) {
-
-                    const int north_y   = max(ty, 1) -1;
-                    const int south_y   = min(ty, n_per_thread*block_dim - 2) +1;
-                    const int east_x    = min(tx, n_per_thread*block_dim - 2) +1;
-                    const int west_x    = max(tx, 1) -1;
+                    const int north_y = max(ty, 1) - 1;
+                    const int south_y =
+                        min(ty, n_per_thread * block_dim - 2) + 1;
+                    const int east_x =
+                        min(tx, n_per_thread * block_dim - 2) + 1;
+                    const int west_x = max(tx, 1) - 1;
 
                     // Check bottom
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[south_y][tx]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[south_y][tx]);
 
                     // Check right neighbor
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[ty][east_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[ty][east_x]);
 
                     // Check left neighbor
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[ty][west_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[ty][west_x]);
 
                     // Check top neighbor
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[north_y][tx]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[north_y][tx]);
 
                     if (full_conn) {
                         // Check NW corner
-                        best_label[tid_i] = relabel(best_label[tid_i],
-                                                    s_tile[north_y][west_x]);
+                        best_label[tid_i] =
+                            relabel(best_label[tid_i], s_tile[north_y][west_x]);
 
                         // Check NE corner
-                        best_label[tid_i] = relabel(best_label[tid_i],
-                                                    s_tile[north_y][east_x]);
+                        best_label[tid_i] =
+                            relabel(best_label[tid_i], s_tile[north_y][east_x]);
 
                         // Check SW corner
-                        best_label[tid_i] = relabel(best_label[tid_i],
-                                                    s_tile[south_y][west_x]);
+                        best_label[tid_i] =
+                            relabel(best_label[tid_i], s_tile[south_y][west_x]);
 
                         // Check SE corner
-                        best_label[tid_i] = relabel(best_label[tid_i],
-                                                    s_tile[south_y][east_x]);
-                    } // if connectivity == 8
+                        best_label[tid_i] =
+                            relabel(best_label[tid_i], s_tile[south_y][east_x]);
+                    }  // if connectivity == 8
 
                     // This thread's value changed during this iteration if the
                     // best label is not the same as the last label.
                     const bool changed = best_label[tid_i] != last_label;
-                    write[tid_i] = write[tid_i] || changed;
-                    tid_changed  =  tid_changed || changed;
+                    write[tid_i]       = write[tid_i] || changed;
+                    tid_changed        = tid_changed || changed;
                 }
             }
         }
         // Done looking at neighbors for this iteration
-        __syncthreads();
-
-        // Decide if we need to continue iterating
-        s_changed[warpIdx] = __any((int)tid_changed);
-        __syncthreads();
-        continue_iter = false;
-#if (__CUDA_ARCH__ >= 130)
-        #pragma unroll
-#endif
-        for (int i = 0; i < num_warps::value; i++)
-            continue_iter = continue_iter | (s_changed[i] != 0);
+        continue_iter = __syncthreads_or((int)tid_changed);
 
         // If we have to continue iterating, update the tile of the
         // equiv map in shared memory
         if (continue_iter) {
-#if (__CUDA_ARCH__ >= 130)
-            #pragma unroll
-#endif
+#pragma unroll
             for (int xb = 0; xb < n_per_thread; ++xb) {
-#if (__CUDA_ARCH__ >= 130)
-                #pragma unroll
-#endif
+#pragma unroll
                 for (int yb = 0; yb < n_per_thread; ++yb) {
-                    const int tx = threadIdx.x + (xb * blockDim.x);
-                    const int ty = threadIdx.y + (yb * blockDim.y);
+                    const int tx    = threadIdx.x + (xb * blockDim.x);
+                    const int ty    = threadIdx.y + (yb * blockDim.y);
                     const int tid_i = xb * n_per_thread + yb;
                     // Update tile in shared memory
                     s_tile[ty][tx] = best_label[tid_i];
@@ -367,50 +321,44 @@ static void update_equiv(cuda::Param<T> equiv_map, const cudaTextureObject_t tex
             }
             __syncthreads();
         }
-    } // while (continue_iter)
+    }  // while (continue_iter)
 
-    // Write out equiv_map
-#if (__CUDA_ARCH__ >= 130)
-    #pragma unroll
-#endif
+// Write out equiv_map
+#pragma unroll
     for (int xb = 0; xb < n_per_thread; ++xb) {
-#if (__CUDA_ARCH__ >= 130)
-        #pragma unroll
-#endif
+#pragma unroll
         for (int yb = 0; yb < n_per_thread; ++yb) {
-            const int x = base_x + (xb * blockDim.x);
-            const int y = base_y + (yb * blockDim.y);
-            const int n = y * width + x;
+            const int x     = base_x + (xb * blockDim.x);
+            const int y     = base_y + (yb * blockDim.y);
+            const int n     = y * width + x;
             const int tid_i = xb * n_per_thread + yb;
             if (x < width && y < height && write[tid_i]) {
-                equiv_map.ptr[n]  = best_label[tid_i];
-                continue_flag = 1;
+                equiv_map.ptr[n] = best_label[tid_i];
+                continue_flag    = 1;
             }
         }
     }
-#endif // __CUDA_ARCH__ >= 120
 }
 
 template<typename T>
-struct clamp_to_one : public thrust::unary_function<T,T>
-{
-    __host__ __device__ T operator()(const T& in) const
-    {
+struct clamp_to_one : public thrust::unary_function<T, T> {
+    __host__ __device__ T operator()(const T& in) const {
         return (in >= (T)1) ? (T)1 : in;
     }
 };
 
 template<typename T, bool full_conn, int n_per_thread>
-void regions(cuda::Param<T> out, cuda::CParam<char> in, cudaTextureObject_t tex)
-{
-    const dim3 threads(THREADS_X, THREADS_Y);
+void regions(arrayfire::cuda::Param<T> out, arrayfire::cuda::CParam<char> in,
+             cudaTextureObject_t tex) {
+    using arrayfire::cuda::getActiveStream;
+    dim3 threads(THREADS_X, THREADS_Y);
 
-    const int blk_x = divup(in.dims[0], threads.x*2);
-    const int blk_y = divup(in.dims[1], threads.y*2);
+    const int blk_x = divup(in.dims[0], threads.x * 2);
+    const int blk_y = divup(in.dims[1], threads.y * 2);
 
-    const dim3 blocks(blk_x, blk_y);
+    dim3 blocks(blk_x, blk_y);
 
-    (initial_label<T,n_per_thread>)<<<blocks, threads>>>(out, in);
+    CUDA_LAUNCH((initial_label<T, n_per_thread>), blocks, threads, out, in);
 
     POST_LAUNCH_CHECK();
 
@@ -418,16 +366,19 @@ void regions(cuda::Param<T> out, cuda::CParam<char> in, cudaTextureObject_t tex)
 
     while (h_continue) {
         h_continue = 0;
-        CUDA_CHECK(cudaMemcpyToSymbol(continue_flag, &h_continue, sizeof(int),
-                                      0, cudaMemcpyHostToDevice));
+        CUDA_CHECK(
+            cudaMemcpyToSymbolAsync(continue_flag, &h_continue, sizeof(int), 0,
+                                    cudaMemcpyHostToDevice, getActiveStream()));
 
-        (update_equiv<T, 16, n_per_thread, full_conn>)<<<blocks, threads>>>
-            (out, tex);
+        CUDA_LAUNCH((update_equiv<T, 16, n_per_thread, full_conn>), blocks,
+                    threads, out, tex);
 
         POST_LAUNCH_CHECK();
 
-        CUDA_CHECK(cudaMemcpyFromSymbol(&h_continue, continue_flag, sizeof(int),
-                                        0, cudaMemcpyDeviceToHost));
+        CUDA_CHECK(cudaMemcpyFromSymbolAsync(
+            &h_continue, continue_flag, sizeof(int), 0, cudaMemcpyDeviceToHost,
+            getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
     }
 
     // Now, perform the final relabeling.  This converts the equivalency
@@ -435,50 +386,47 @@ void regions(cuda::Param<T> out, cuda::CParam<char> in, cudaTextureObject_t tex)
     // component to being sequentially numbered components starting at
     // 1.
     int size = in.dims[0] * in.dims[1];
-    T* tmp = cuda::memAlloc<T>(size);
-    CUDA_CHECK(cudaMemcpy(tmp, out.ptr, size * sizeof(T),
-                          cudaMemcpyDeviceToDevice));
+    auto tmp = arrayfire::cuda::memAlloc<T>(size);
+    CUDA_CHECK(cudaMemcpyAsync(tmp.get(), out.ptr, size * sizeof(T),
+                               cudaMemcpyDeviceToDevice, getActiveStream()));
 
     // Wrap raw device ptr
-    thrust::device_ptr<T> wrapped_tmp = thrust::device_pointer_cast(tmp);
+    thrust::device_ptr<T> wrapped_tmp = thrust::device_pointer_cast(tmp.get());
 
     // Sort the copy
-    thrust::sort(wrapped_tmp, wrapped_tmp + size);
+    THRUST_SELECT(thrust::sort, wrapped_tmp, wrapped_tmp + size);
+
+    // Take the max element which is the number
+    // of label assignments to compute.
+    const int num_bins = wrapped_tmp[size - 1] + 1;
 
-    // Take the max element, this is the number of label assignments to
-    // compute.
-    int num_bins = wrapped_tmp[size - 1] + 1;
+    // If the number of label assignments is two,
+    // then either the entire input image is one big
+    // component(1's) or it has only one component other than
+    // background(0's). Either way, no further
+    // post-processing of labels is required.
+    if (num_bins <= 2) return;
 
-    thrust::device_vector<T> labels(num_bins);
+    arrayfire::cuda::ThrustVector<T> labels(num_bins);
 
     // Find the end of each section of values
     thrust::counting_iterator<T> search_begin(0);
-    thrust::upper_bound(wrapped_tmp,  wrapped_tmp  + size,
-                        search_begin, search_begin + num_bins,
-                        labels.begin());
-    thrust::adjacent_difference(labels.begin(), labels.end(), labels.begin());
+    THRUST_SELECT(thrust::upper_bound, wrapped_tmp, wrapped_tmp + size,
+                  search_begin, search_begin + num_bins, labels.begin());
+
+    THRUST_SELECT(thrust::adjacent_difference, labels.begin(), labels.end(),
+                  labels.begin());
 
     // Operators for the scan
     clamp_to_one<T> clamp;
     thrust::plus<T> add;
 
-    // Perform the scan -- this can computes the correct labels for each
-    // component
-    thrust::transform_exclusive_scan(labels.begin(),
-                                     labels.end(),
-                                     labels.begin(),
-                                     clamp,
-                                     0,
-                                     add);
-
+    // Perform scan -- this computes the correct labels for each component
+    THRUST_SELECT(thrust::transform_exclusive_scan, labels.begin(),
+                  labels.end(), labels.begin(), clamp, 0, add);
     // Apply the correct labels to the equivalency map
-    (final_relabel<T,n_per_thread>)<<<blocks,threads>>>(out,
-                                                        in,
-                                                        thrust::raw_pointer_cast(&labels[0]));
+    CUDA_LAUNCH((final_relabel<T, n_per_thread>), blocks, threads, out, in,
+                thrust::raw_pointer_cast(&labels[0]));
 
     POST_LAUNCH_CHECK();
-
-    cuda::memFree(tmp);
 }
-
-#endif // __CUDACC__
diff --git a/src/backend/cuda/kernel/reorder.cuh b/src/backend/cuda/kernel/reorder.cuh
new file mode 100644
index 0000000000..4f1db7bf3a
--- /dev/null
+++ b/src/backend/cuda/kernel/reorder.cuh
@@ -0,0 +1,60 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void reorder(Param<T> out, CParam<T> in, const int d0, const int d1,
+                        const int d2, const int d3, const int blocksPerMatX,
+                        const int blocksPerMatY) {
+    const int oz = blockIdx.x / blocksPerMatX;
+    const int ow = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+
+    const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - ow * blocksPerMatY;
+
+    const int xx = threadIdx.x + blockIdx_x * blockDim.x;
+    const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (xx >= out.dims[0] || yy >= out.dims[1] || oz >= out.dims[2] ||
+        ow >= out.dims[3])
+        return;
+
+    const int incy = blocksPerMatY * blockDim.y;
+    const int incx = blocksPerMatX * blockDim.x;
+
+    const int rdims[] = {d0, d1, d2, d3};
+    const int o_off   = ow * out.strides[3] + oz * out.strides[2];
+    int ids[4]        = {0};
+    ids[rdims[3]]     = ow;
+    ids[rdims[2]]     = oz;
+
+    for (int oy = yy; oy < out.dims[1]; oy += incy) {
+        ids[rdims[1]] = oy;
+        for (int ox = xx; ox < out.dims[0]; ox += incx) {
+            ids[rdims[0]] = ox;
+
+            const int oIdx = o_off + oy * out.strides[1] + ox;
+
+            const int iIdx = ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+                             ids[1] * in.strides[1] + ids[0];
+
+            out.ptr[oIdx] = in.ptr[iIdx];
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/reorder.hpp b/src/backend/cuda/kernel/reorder.hpp
index eb71e9d4bb..e54ebcf417 100644
--- a/src/backend/cuda/kernel/reorder.hpp
+++ b/src/backend/cuda/kernel/reorder.hpp
@@ -7,84 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/reorder_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 512;
-        static const unsigned TILEY = 32;
-
-        template<typename T>
-        __global__
-        void reorder_kernel(Param<T> out, CParam<T> in, const int d0, const int d1,
-                            const int d2, const int d3,
-                            const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            if(xx >= out.dims[0] ||
-               yy >= out.dims[1] ||
-               oz >= out.dims[2] ||
-               ow >= out.dims[3])
-                return;
+template<typename T>
+void reorder(Param<T> out, CParam<T> in, const dim_t *rdims) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 512;
+    constexpr unsigned TILEY = 32;
 
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
+    auto reorder =
+        common::getKernel("arrayfire::cuda::reorder", {{reorder_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()));
 
-            const int rdims[] = {d0, d1, d2, d3};
-            const int o_off   = ow * out.strides[3] + oz * out.strides[2];
-                  int ids[4]  = {0};
-            ids[rdims[3]] = ow;
-            ids[rdims[2]] = oz;
+    dim3 threads(TX, TY, 1);
 
-            for(int oy = yy; oy < out.dims[1]; oy += incy) {
-                ids[rdims[1]] = oy;
-                for(int ox = xx; ox < out.dims[0]; ox += incx) {
-                    ids[rdims[0]] = ox;
+    int blocksPerMatX = divup(out.dims[0], TILEX);
+    int blocksPerMatY = divup(out.dims[1], TILEY);
+    dim3 blocks(blocksPerMatX * out.dims[2], blocksPerMatY * out.dims[3], 1);
 
-                    const int oIdx = o_off + oy * out.strides[1] + ox;
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-                    const int iIdx = ids[3] * in.strides[3] + ids[2] * in.strides[2] +
-                                          ids[1] * in.strides[1] + ids[0];
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-                    out.ptr[oIdx] = in.ptr[iIdx];
-                }
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void reorder(Param<T> out, CParam<T> in, const dim_t *rdims)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(out.dims[0], TILEX);
-            int blocksPerMatY = divup(out.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * out.dims[2],
-                        blocksPerMatY * out.dims[3],
-                        1);
-
-            reorder_kernel<T><<<blocks, threads>>>(out, in, rdims[0], rdims[1], rdims[2], rdims[3],
-                                                   blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
+    reorder(qArgs, out, in, rdims[0], rdims[1], rdims[2], rdims[3],
+            blocksPerMatX, blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/resize.cuh b/src/backend/cuda/kernel/resize.cuh
new file mode 100644
index 0000000000..8186804dae
--- /dev/null
+++ b/src/backend/cuda/kernel/resize.cuh
@@ -0,0 +1,120 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <interp.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+// nearest-neighbor resampling
+template<typename T>
+__host__ __device__ void resize_n(Param<T> out, CParam<T> in, const int o_off,
+                                  const int i_off, const int blockIdx_x,
+                                  const int blockIdx_y, const float xf,
+                                  const float yf) {
+    const int ox = threadIdx.x + blockIdx_x * blockDim.x;
+    const int oy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    int ix = round(ox * xf);
+    int iy = round(oy * yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    out.ptr[o_off + ox + oy * out.strides[1]] =
+        in.ptr[i_off + ix + iy * in.strides[1]];
+}
+
+// bilinear resampling
+template<typename T>
+__host__ __device__ void resize_b(Param<T> out, CParam<T> in, const int o_off,
+                                  const int i_off, const int blockIdx_x,
+                                  const int blockIdx_y, const float xf_,
+                                  const float yf_) {
+    const int ox = threadIdx.x + blockIdx_x * blockDim.x;
+    const int oy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    float xf = ox * xf_;
+    float yf = oy * yf_;
+
+    int ix = floorf(xf);
+    int iy = floorf(yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    float b = xf - ix;
+    float a = yf - iy;
+
+    const int ix2 = ix + 1 < in.dims[0] ? ix + 1 : ix;
+    const int iy2 = iy + 1 < in.dims[1] ? iy + 1 : iy;
+
+    typedef typename itype_t<T>::wtype WT;
+    typedef typename itype_t<T>::vtype VT;
+
+    const T *iptr = in.ptr + i_off;
+
+    const VT p1 = iptr[ix + in.strides[1] * iy];
+    const VT p2 = iptr[ix + in.strides[1] * iy2];
+    const VT p3 = iptr[ix2 + in.strides[1] * iy];
+    const VT p4 = iptr[ix2 + in.strides[1] * iy2];
+
+    VT val = scalar<WT>((1.0f - a) * (1.0f - b)) * p1 +
+             scalar<WT>((a) * (1.0f - b)) * p2 +
+             scalar<WT>((1.0f - a) * (b)) * p3 + scalar<WT>((a) * (b)) * p4;
+
+    out.ptr[o_off + ox + oy * out.strides[1]] = val;
+}
+
+// lower resampling
+template<typename T>
+__host__ __device__ void resize_l(Param<T> out, CParam<T> in, const int o_off,
+                                  const int i_off, const int blockIdx_x,
+                                  const int blockIdx_y, const float xf,
+                                  const float yf) {
+    const int ox = threadIdx.x + blockIdx_x * blockDim.x;
+    const int oy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    int ix = (ox * xf);
+    int iy = (oy * yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    out.ptr[o_off + ox + oy * out.strides[1]] =
+        in.ptr[i_off + ix + iy * in.strides[1]];
+}
+
+template<typename T, af::interpType method>
+__global__ void resize(Param<T> out, CParam<T> in, const int b0, const int b1,
+                       const float xf, const float yf) {
+    const int bIdx = blockIdx.x / b0;
+    const int bIdy = blockIdx.y / b1;
+    // channel adjustment
+    const int i_off      = bIdx * in.strides[2] + bIdy * in.strides[3];
+    const int o_off      = bIdx * out.strides[2] + bIdy * out.strides[3];
+    const int blockIdx_x = blockIdx.x - bIdx * b0;
+    const int blockIdx_y = blockIdx.y - bIdy * b1;
+
+    // core
+    if (method == AF_INTERP_NEAREST) {
+        resize_n(out, in, o_off, i_off, blockIdx_x, blockIdx_y, xf, yf);
+    } else if (method == AF_INTERP_BILINEAR) {
+        resize_b(out, in, o_off, i_off, blockIdx_x, blockIdx_y, xf, yf);
+    } else if (method == AF_INTERP_LOWER) {
+        resize_l(out, in, o_off, i_off, blockIdx_x, blockIdx_y, xf, yf);
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/resize.hpp b/src/backend/cuda/kernel/resize.hpp
index 524ee6365e..6129fe1e64 100644
--- a/src/backend/cuda/kernel/resize.hpp
+++ b/src/backend/cuda/kernel/resize.hpp
@@ -7,160 +7,44 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <dispatch.hpp>
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
+#include <nvrtc_kernel_headers/resize_cuh.hpp>
+#include <af/defines.h>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 16;
-        static const unsigned TY = 16;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-        template<typename T>
-        struct itype_t
-        {
-            typedef float wtype;
-            typedef float vtype;
-        };
+// Kernel Launch Config Values
+static const unsigned TX = 16;
+static const unsigned TY = 16;
 
-        template<>
-        struct itype_t<double>
-        {
-            typedef double wtype;
-            typedef double vtype;
-        };
+template<typename T>
+void resize(Param<T> out, CParam<T> in, af_interp_type method) {
+    auto resize = common::getKernel(
+        "arrayfire::cuda::resize", {{resize_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(method)));
 
-        template<>
-        struct itype_t<cfloat>
-        {
-            typedef float  wtype;
-            typedef cfloat vtype;
-        };
+    dim3 threads(TX, TY, 1);
+    dim3 blocks(divup(out.dims[0], threads.x), divup(out.dims[1], threads.y));
+    int blocksPerMatX = blocks.x;
+    int blocksPerMatY = blocks.y;
 
-        template<>
-        struct itype_t<cdouble>
-        {
-            typedef double  wtype;
-            typedef cdouble vtype;
-        };
+    if (in.dims[2] > 1) { blocks.x *= in.dims[2]; }
+    if (in.dims[3] > 1) { blocks.y *= in.dims[3]; }
+    float xf = (float)in.dims[0] / (float)out.dims[0];
+    float yf = (float)in.dims[1] / (float)out.dims[1];
 
-        ///////////////////////////////////////////////////////////////////////////
-        // nearest-neighbor resampling
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        __host__ __device__
-        void resize_n(Param<T> out, CParam<T> in,
-                      const int o_off, const int i_off,
-                      const int blockIdx_x, const int blockIdx_y,
-                      const float xf, const float yf)
-        {
-            const int ox = threadIdx.x + blockIdx_x * blockDim.x;
-            const int oy = threadIdx.y + blockIdx_y * blockDim.y;
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-            int ix = round(ox * xf);
-            int iy = round(oy * yf);
+    resize(qArgs, out, in, blocksPerMatX, blocksPerMatY, xf, yf);
 
-            if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
-            if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
-            if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
-
-            out.ptr[o_off + ox + oy * out.strides[1]] = in.ptr[i_off + ix + iy * in.strides[1]];
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // bilinear resampling
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        __host__ __device__
-        void resize_b(Param<T> out, CParam<T> in,
-                      const int o_off, const int i_off,
-                      const int blockIdx_x, const int blockIdx_y,
-                      const float xf_, const float yf_)
-        {
-            const int ox = threadIdx.x + blockIdx_x * blockDim.x;
-            const int oy = threadIdx.y + blockIdx_y * blockDim.y;
-
-            float xf = ox * xf_;
-            float yf = oy * yf_;
-
-            int ix = floorf(xf);
-            int iy = floorf(yf);
-
-            if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
-            if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
-            if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
-
-            float b = xf - ix;
-            float a = yf - iy;
-
-            const int ix2 = ix + 1 <  in.dims[0] ? ix + 1 : ix;
-            const int iy2 = iy + 1 <  in.dims[1] ? iy + 1 : iy;
-
-            typedef typename itype_t<T>::wtype WT;
-            typedef typename itype_t<T>::vtype VT;
-
-            const T *iptr = in.ptr + i_off;
-
-            const VT p1 = iptr[ix  + in.strides[1] * iy ];
-            const VT p2 = iptr[ix  + in.strides[1] * iy2];
-            const VT p3 = iptr[ix2 + in.strides[1] * iy ] ;
-            const VT p4 = iptr[ix2 + in.strides[1] * iy2];
-
-            VT val = scalar<WT>((1.0f-a) * (1.0f-b)) * p1 +
-                     scalar<WT>((a)      * (1.0f-b)) * p2 +
-                     scalar<WT>((1.0f-a) * (b)     ) * p3 +
-                     scalar<WT>((a)      * (b)     ) * p4;
-
-            out.ptr[o_off + ox + oy * out.strides[1]] = val;
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Resize Kernel
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, af_interp_type method>
-        __global__
-        void resize_kernel(Param<T> out, CParam<T> in,
-                           const int b0, const int b1, const float xf, const float yf)
-        {
-            const int bIdx = blockIdx.x / b0;
-            const int bIdy = blockIdx.y / b1;
-            // channel adjustment
-            const int i_off = bIdx * in.strides[2]  + bIdy * in.strides[3];
-            const int o_off = bIdx * out.strides[2] + bIdy * out.strides[3];
-            const int blockIdx_x =  blockIdx.x - bIdx * b0;
-            const int blockIdx_y =  blockIdx.y - bIdy * b1;
-
-            // core
-            if(method == AF_INTERP_NEAREST) {
-                resize_n(out, in, o_off, i_off, blockIdx_x, blockIdx_y, xf, yf);
-            } else if(method == AF_INTERP_BILINEAR) {
-                resize_b(out, in, o_off, i_off, blockIdx_x, blockIdx_y, xf, yf);
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template <typename T, af_interp_type method>
-        void resize(Param<T> out, CParam<T> in)
-        {
-            dim3 threads(TX, TY, 1);
-            dim3 blocks(divup(out.dims[0], threads.x), divup(out.dims[1], threads.y));
-            int blocksPerMatX = blocks.x;
-            int blocksPerMatY = blocks.y;
-
-            if (in.dims[2] > 1) { blocks.x *= in.dims[2]; }
-            if (in.dims[3] > 1) { blocks.y *= in.dims[3]; }
-            float xf = (float)in.dims[0] / (float)out.dims[0];
-            float yf = (float)in.dims[1] / (float)out.dims[1];
-
-            resize_kernel<T, method><<<blocks, threads>>>(out, in, blocksPerMatX, blocksPerMatY, xf, yf);
-            POST_LAUNCH_CHECK();
-        }
-    }
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/rotate.cuh b/src/backend/cuda/kernel/rotate.cuh
new file mode 100644
index 0000000000..f6fa755ac2
--- /dev/null
+++ b/src/backend/cuda/kernel/rotate.cuh
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <interp.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+typedef struct {
+    float tmat[6];
+} tmat_t;
+
+template<typename T, int order>
+__global__ void rotate(Param<T> out, CParam<T> in, const tmat_t t,
+                       const int nimages, const int nbatches,
+                       const int blocksXPerImage, const int blocksYPerImage,
+                       af::interpType method) {
+    // Compute which image set
+    const int setId      = blockIdx.x / blocksXPerImage;
+    const int blockIdx_x = blockIdx.x - setId * blocksXPerImage;
+
+    const int batch      = blockIdx.y / blocksYPerImage;
+    const int blockIdx_y = blockIdx.y - batch * blocksYPerImage;
+
+    // Get thread indices
+    const int xido = blockIdx_x * blockDim.x + threadIdx.x;
+    const int yido = blockIdx_y * blockDim.y + threadIdx.y;
+
+    const int limages = min(out.dims[2] - setId * nimages, nimages);
+
+    if (xido >= out.dims[0] || yido >= out.dims[1]) return;
+
+    // Compute input index
+    typedef typename itype_t<T>::wtype WT;
+    WT xidi = xido * t.tmat[0] + yido * t.tmat[1] + t.tmat[2];
+    WT yidi = xido * t.tmat[3] + yido * t.tmat[4] + t.tmat[5];
+
+    // Global offset
+    //          Offset for transform channel + Offset for image channel.
+    int outoff     = setId * nimages * out.strides[2] + batch * out.strides[3];
+    int inoff      = setId * nimages * in.strides[2] + batch * in.strides[3];
+    const int loco = outoff + (yido * out.strides[1] + xido);
+
+    if (order > 1) {
+        // Special conditions to deal with boundaries for bilinear and bicubic
+        // FIXME: Ideally this condition should be removed or be present for all
+        // methods But tests are expecting a different behavior for bilinear and
+        // nearest
+        if (xidi < -0.0001 || yidi < -0.0001 || in.dims[0] < xidi ||
+            in.dims[1] < yidi) {
+            for (int i = 0; i < nimages; i++) {
+                out.ptr[loco + i * out.strides[2]] = scalar<T>(0.0f);
+            }
+            return;
+        }
+    }
+
+    Interp2<T, WT, 0, 1, order> interp;
+    // FIXME: Nearest and lower do not do clamping, but other methods do
+    // Make it consistent
+    bool clamp = order != 1;
+    interp(out, loco, in, inoff, xidi, yidi, method, limages, clamp);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/rotate.hpp b/src/backend/cuda/kernel/rotate.hpp
index d84a454d58..f1aa40585a 100644
--- a/src/backend/cuda/kernel/rotate.hpp
+++ b/src/backend/cuda/kernel/rotate.hpp
@@ -7,116 +7,83 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include "transform_interp.hpp"
-
-namespace cuda
-{
-    namespace kernel
+#include <nvrtc_kernel_headers/rotate_cuh.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+// Kernel Launch Config Values
+constexpr unsigned TX = 16;
+constexpr unsigned TY = 16;
+// Used for batching images
+constexpr int TI = 4;
+
+typedef struct {
+    float tmat[6];
+} tmat_t;
+
+template<typename T>
+void rotate(Param<T> out, CParam<T> in, const float theta,
+            const af::interpType method, const int order) {
+    auto rotate = common::getKernel(
+        "arrayfire::cuda::rotate", {{rotate_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(order)));
+
+    const float c = cos(-theta), s = sin(-theta);
+    float tx, ty;
     {
-        // Kernel Launch Config Values
-        static const unsigned TX = 16;
-        static const unsigned TY = 16;
-        // Used for batching images
-        static const unsigned TI = 4;
-
-        typedef struct {
-            float tmat[6];
-        } tmat_t;
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Rotate Kernel
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, af_interp_type method>
-        __global__ static void
-        rotate_kernel(Param<T> out, CParam<T> in, const tmat_t t,
-                      const int nimages, const int nbatches,
-                      const int blocksXPerImage, const int blocksYPerImage)
-        {
-            // Compute which image set
-            const int setId = blockIdx.x / blocksXPerImage;
-            const int blockIdx_x = blockIdx.x - setId * blocksXPerImage;
-
-            const int batch = blockIdx.y / blocksYPerImage;
-            const int blockIdx_y = blockIdx.y - batch * blocksYPerImage;
-
-            // Get thread indices
-            const int xx = blockIdx_x * blockDim.x + threadIdx.x;
-            const int yy = blockIdx_y * blockDim.y + threadIdx.y;
-
-            const int limages = min(out.dims[2] - setId * nimages, nimages);
-
-            if(xx >= out.dims[0] || yy >= out.dims[1])
-                return;
-
-            // Global offset
-            //          Offset for transform channel + Offset for image channel.
-                  T *optr = out.ptr + setId * nimages * out.strides[2] + batch * out.strides[3];
-            const T *iptr = in.ptr  + setId * nimages * in.strides[2]  + batch * in.strides[3];
-
-            switch(method) {
-                case AF_INTERP_NEAREST:
-                    transform_n(optr, out, iptr, in, t.tmat, xx, yy, limages); break;
-                case AF_INTERP_BILINEAR:
-                    transform_b(optr, out, iptr, in, t.tmat, xx, yy, limages); break;
-                default: break;
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template <typename T, af_interp_type method>
-        void rotate(Param<T> out, CParam<T> in, const float theta)
-        {
-            const float c = cos(-theta), s = sin(-theta);
-            float tx, ty;
-            {
-                const float nx = 0.5 * (in.dims[0] - 1);
-                const float ny = 0.5 * (in.dims[1] - 1);
-                const float mx = 0.5 * (out.dims[0] - 1);
-                const float my = 0.5 * (out.dims[1] - 1);
-                const float sx = (mx * c + my *-s);
-                const float sy = (mx * s + my * c);
-                tx = -(sx - nx);
-                ty = -(sy - ny);
-            }
-
-            // Rounding error. Anything more than 3 decimal points wont make a diff
-            tmat_t t;
-            t.tmat[0] = round( c * 1000) / 1000.0f;
-            t.tmat[1] = round(-s * 1000) / 1000.0f;
-            t.tmat[2] = round(tx * 1000) / 1000.0f;
-            t.tmat[3] = round( s * 1000) / 1000.0f;
-            t.tmat[4] = round( c * 1000) / 1000.0f;
-            t.tmat[5] = round(ty * 1000) / 1000.0f;
-
-            int nimages = in.dims[2];
-            int nbatches = in.dims[3];
-
-            dim3 threads(TX, TY, 1);
-            dim3 blocks(divup(out.dims[0], threads.x), divup(out.dims[1], threads.y));
-
-            const int blocksXPerImage = blocks.x;
-            const int blocksYPerImage = blocks.y;
-
-            if(nimages > TI) {
-                int tile_images = divup(nimages, TI);
-                nimages = TI;
-                blocks.x = blocks.x * tile_images;
-            }
-
-            blocks.y = blocks.y * nbatches;
-
-            rotate_kernel<T, method><<<blocks, threads>>> (out, in, t, nimages, nbatches,
-                                    blocksXPerImage, blocksYPerImage);
-
-            POST_LAUNCH_CHECK();
-        }
+        const float nx = 0.5 * (in.dims[0] - 1);
+        const float ny = 0.5 * (in.dims[1] - 1);
+        const float mx = 0.5 * (out.dims[0] - 1);
+        const float my = 0.5 * (out.dims[1] - 1);
+        const float sx = (mx * c + my * -s);
+        const float sy = (mx * s + my * c);
+        tx             = -(sx - nx);
+        ty             = -(sy - ny);
+    }
+
+    // Rounding error. Anything more than 3 decimal points wont make a diff
+    tmat_t t;
+    t.tmat[0] = round(c * 1000) / 1000.0f;
+    t.tmat[1] = round(-s * 1000) / 1000.0f;
+    t.tmat[2] = round(tx * 1000) / 1000.0f;
+    t.tmat[3] = round(s * 1000) / 1000.0f;
+    t.tmat[4] = round(c * 1000) / 1000.0f;
+    t.tmat[5] = round(ty * 1000) / 1000.0f;
+
+    int nimages  = in.dims[2];
+    int nbatches = in.dims[3];
+
+    dim3 threads(TX, TY, 1);
+    dim3 blocks(divup(out.dims[0], threads.x), divup(out.dims[1], threads.y));
+
+    const int blocksXPerImage = blocks.x;
+    const int blocksYPerImage = blocks.y;
+
+    if (nimages > TI) {
+        int tile_images = divup(nimages, TI);
+        nimages         = TI;
+        blocks.x        = blocks.x * tile_images;
     }
+
+    blocks.y = blocks.y * nbatches;
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    rotate(qArgs, out, in, t, nimages, nbatches, blocksXPerImage,
+           blocksYPerImage, method);
+
+    POST_LAUNCH_CHECK();
 }
 
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_by_key/CMakeLists.txt b/src/backend/cuda/kernel/scan_by_key/CMakeLists.txt
new file mode 100644
index 0000000000..8280fd4e74
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_by_key/CMakeLists.txt
@@ -0,0 +1,28 @@
+# Copyright (c) 2020, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_by_key/scan_by_key_impl.cpp" FILESTRINGS)
+
+foreach(STR ${FILESTRINGS})
+    if(${STR} MATCHES "// SBK_BINARY_OPS")
+        string(REPLACE "// SBK_BINARY_OPS:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_BINARY_OPS ${TEMP})
+    endif()
+endforeach()
+
+foreach(SBK_BINARY_OP ${SBK_BINARY_OPS})
+  configure_file(
+    "${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_by_key/scan_by_key_impl.cpp"
+    "${CMAKE_CURRENT_BINARY_DIR}/kernel/scan_by_key/scan_by_key_impl_${SBK_BINARY_OP}.cpp"
+  )
+
+  list(
+    APPEND
+    scan_by_key_sources
+    "${CMAKE_CURRENT_BINARY_DIR}/kernel/scan_by_key/scan_by_key_impl_${SBK_BINARY_OP}.cpp"
+  )
+endforeach(SBK_BINARY_OP ${SBK_BINARY_OPS})
diff --git a/src/backend/cuda/kernel/scan_by_key/scan_by_key_impl.cpp b/src/backend/cuda/kernel/scan_by_key/scan_by_key_impl.cpp
new file mode 100644
index 0000000000..b1480e6628
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_by_key/scan_by_key_impl.cpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/scan_dim_by_key_impl.hpp>
+#include <kernel/scan_first_by_key_impl.hpp>
+
+// This file instantiates scan_dim_by_key as separate object files from CMake
+// The line below is read by CMake to determenine the instantiations
+// SBK_BINARY_OPS:af_add_t af_mul_t af_max_t af_min_t
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// clang-format off
+INSTANTIATE_SCAN_FIRST_BY_KEY_OP( @SBK_BINARY_OP@ )
+INSTANTIATE_SCAN_DIM_BY_KEY_OP( @SBK_BINARY_OP@ )
+// clang-format on
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_dim.cuh b/src/backend/cuda/kernel/scan_dim.cuh
new file mode 100644
index 0000000000..a7f4066c80
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_dim.cuh
@@ -0,0 +1,172 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass,
+         uint DIMY, bool inclusive_scan>
+__global__ void scan_dim(Param<To> out, Param<To> tmp, CParam<Ti> in,
+                         uint blocks_x, uint blocks_y, uint blocks_dim,
+                         uint lim) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+    const int tid  = tidy * THREADS_X + tidx;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const int xid = blockIdx_x * blockDim.x + tidx;
+    const int yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    const Ti *iptr = in.ptr;
+    To *optr       = out.ptr;
+    To *tptr       = tmp.ptr;
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    tptr += ids[3] * tmp.strides[3] + ids[2] * tmp.strides[2] +
+            ids[1] * tmp.strides[1] + ids[0];
+    const int blockIdx_dim = ids[dim];
+
+    ids[dim] = ids[dim] * blockDim.y * lim + tidy;
+    optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+            ids[1] * out.strides[1] + ids[0];
+    iptr += ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+            ids[1] * in.strides[1] + ids[0];
+    int id_dim        = ids[dim];
+    const int out_dim = out.dims[dim];
+
+    bool is_valid = (ids[0] < out.dims[0]) && (ids[1] < out.dims[1]) &&
+                    (ids[2] < out.dims[2]) && (ids[3] < out.dims[3]);
+
+    const int ostride_dim = out.strides[dim];
+    const int istride_dim = in.strides[dim];
+
+    __shared__ To s_val[THREADS_X * DIMY * 2];
+    __shared__ To s_tmp[THREADS_X];
+    To *sptr = s_val + tid;
+
+    common::Transform<Ti, To, op> transform;
+    common::Binary<To, op> binop;
+
+    const To init = common::Binary<To, op>::init();
+    To val        = init;
+
+    const bool isLast = (tidy == (DIMY - 1));
+
+    for (int k = 0; k < lim; k++) {
+        if (isLast) s_tmp[tidx] = val;
+
+        bool cond = (is_valid) && (id_dim < out_dim);
+        val       = cond ? transform(*iptr) : init;
+        *sptr     = val;
+        __syncthreads();
+
+        int start = 0;
+#pragma unroll
+        for (int off = 1; off < DIMY; off *= 2) {
+            if (tidy >= off) val = binop(val, sptr[(start - off) * THREADS_X]);
+            start                   = DIMY - start;
+            sptr[start * THREADS_X] = val;
+
+            __syncthreads();
+        }
+
+        val = binop(val, s_tmp[tidx]);
+        if (inclusive_scan) {
+            if (cond) { *optr = val; }
+        } else if (is_valid) {
+            if (id_dim == (out_dim - 1)) {
+                *(optr - (id_dim * ostride_dim)) = init;
+            } else if (id_dim < (out_dim - 1)) {
+                *(optr + ostride_dim) = val;
+            }
+        }
+        id_dim += blockDim.y;
+        iptr += blockDim.y * istride_dim;
+        optr += blockDim.y * ostride_dim;
+        __syncthreads();
+    }
+
+    if (!isFinalPass && is_valid && (blockIdx_dim < tmp.dims[dim]) && isLast) {
+        *tptr = val;
+    }
+}
+
+template<typename To, af_op_t op, int dim>
+__global__ void scan_dim_bcast(Param<To> out, CParam<To> tmp, uint blocks_x,
+                               uint blocks_y, uint blocks_dim, uint lim,
+                               bool inclusive_scan) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const int xid = blockIdx_x * blockDim.x + tidx;
+    const int yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    const To *tptr = tmp.ptr;
+    To *optr       = out.ptr;
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    tptr += ids[3] * tmp.strides[3] + ids[2] * tmp.strides[2] +
+            ids[1] * tmp.strides[1] + ids[0];
+    const int blockIdx_dim = ids[dim];
+
+    ids[dim] = ids[dim] * blockDim.y * lim + tidy;
+    optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+            ids[1] * out.strides[1] + ids[0];
+    const int id_dim  = ids[dim];
+    const int out_dim = out.dims[dim];
+
+    // Shift broadcast one step to the right for exclusive scan (#2366)
+    int offset = inclusive_scan ? 0 : out.strides[dim];
+    optr += offset;
+
+    bool is_valid = (ids[0] < out.dims[0]) && (ids[1] < out.dims[1]) &&
+                    (ids[2] < out.dims[2]) && (ids[3] < out.dims[3]);
+
+    if (!is_valid) return;
+    if (blockIdx_dim == 0) return;
+
+    To accum = *(tptr - tmp.strides[dim]);
+
+    common::Binary<To, op> binop;
+    const int ostride_dim = out.strides[dim];
+
+    for (int k = 0, id = id_dim; is_valid && k < lim && (id < out_dim);
+         k++, id += blockDim.y) {
+        *optr = binop(*optr, accum);
+        optr += blockDim.y * ostride_dim;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_dim.hpp b/src/backend/cuda/kernel/scan_dim.hpp
index 72e80b3e88..9fc32c61e9 100644
--- a/src/backend/cuda/kernel/scan_dim.hpp
+++ b/src/backend/cuda/kernel/scan_dim.hpp
@@ -7,283 +7,120 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <ops.hpp>
-#include <backend.hpp>
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <math.hpp>
-#include <err_cuda.hpp>
+#include <backend.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <err_cuda.hpp>
 #include <memory.hpp>
+#include <nvrtc_kernel_headers/scan_dim_cuh.hpp>
 #include "config.hpp"
 
-namespace cuda
-{
-namespace kernel
-{
-
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass, uint DIMY>
-    __global__
-    static void scan_dim_kernel(Param<To> out,
-                                Param<To> tmp,
-                                CParam<Ti>  in,
-                                uint blocks_x,
-                                uint blocks_y,
-                                uint blocks_dim,
-                                uint lim)
-    {
-        const int tidx = threadIdx.x;
-        const int tidy = threadIdx.y;
-        const int tid  = tidy * THREADS_X + tidx;
-
-        const int zid = blockIdx.x / blocks_x;
-        const int wid = blockIdx.y / blocks_y;
-        const int blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const int blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const int xid = blockIdx_x * blockDim.x + tidx;
-        const int yid = blockIdx_y; // yid  of output. updated for input later.
-
-        int ids[4] = {xid, yid, zid, wid};
-
-        const Ti *iptr = in.ptr;
-        To *optr = out.ptr;
-        To *tptr = tmp.ptr;
-
-        // There is only one element per block for out
-        // There are blockDim.y elements per block for in
-        // Hence increment ids[dim] just after offseting out and before offsetting in
-        tptr += ids[3] * tmp.strides[3] + ids[2] * tmp.strides[2] + ids[1] * tmp.strides[1] + ids[0];
-        const int blockIdx_dim = ids[dim];
-
-        ids[dim] = ids[dim] * blockDim.y * lim + tidy;
-        optr  += ids[3] * out.strides[3] + ids[2] * out.strides[2] + ids[1] * out.strides[1] + ids[0];
-        iptr  += ids[3] *  in.strides[3] + ids[2] *  in.strides[2] + ids[1] *  in.strides[1] + ids[0];
-        int id_dim = ids[dim];
-        const int out_dim = out.dims[dim];
-
-        bool is_valid =
-            (ids[0] < out.dims[0]) &&
-            (ids[1] < out.dims[1]) &&
-            (ids[2] < out.dims[2]) &&
-            (ids[3] < out.dims[3]);
-
-        const int ostride_dim = out.strides[dim];
-        const int istride_dim =  in.strides[dim];
-
-        __shared__ To s_val[THREADS_X * DIMY * 2];
-        __shared__ To s_tmp[THREADS_X];
-        To *sptr =  s_val + tid;
-
-        Transform<Ti, To, op> transform;
-        Binary<To, op> binop;
-
-        const To init = binop.init();
-        To val = init;
-
-        const bool isLast = (tidy == (DIMY - 1));
-
-        for (int k = 0; k < lim; k++) {
-
-            if (isLast) s_tmp[tidx] = val;
-
-            bool cond = (is_valid) && (id_dim < out_dim);
-            val = cond ? transform(*iptr) : init;
-            *sptr = val;
-            __syncthreads();
-
-            int start = 0;
-#pragma unroll
-            for (int off = 1; off < DIMY; off *= 2) {
-
-                if (tidy >= off) val = binop(val, sptr[(start - off) * THREADS_X]);
-                start = DIMY - start;
-                sptr[start * THREADS_X] = val;
-
-                __syncthreads();
-            }
-
-            val = binop(val, s_tmp[tidx]);
-            __syncthreads();
-            if (cond) *optr = val;
-
-            id_dim += blockDim.y;
-            iptr += blockDim.y * istride_dim;
-            optr += blockDim.y * ostride_dim;
-        }
-
-        if (!isFinalPass &&
-            is_valid &&
-            (blockIdx_dim < tmp.dims[dim]) &&
-            isLast) {
-            *tptr = val;
-            }
-    }
-
-    template<typename To, af_op_t op, int dim>
-    __global__
-    static void bcast_dim_kernel(Param<To> out,
-                                 CParam<To> tmp,
-                                 uint blocks_x,
-                                 uint blocks_y,
-                                 uint blocks_dim,
-                                 uint lim)
-    {
-        const int tidx = threadIdx.x;
-        const int tidy = threadIdx.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-        const int zid = blockIdx.x / blocks_x;
-        const int wid = blockIdx.y / blocks_y;
-        const int blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const int blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const int xid = blockIdx_x * blockDim.x + tidx;
-        const int yid = blockIdx_y; // yid  of output. updated for input later.
+template<typename Ti, typename To, af_op_t op>
+static void scan_dim_launcher(Param<To> out, Param<To> tmp, CParam<Ti> in,
+                              const uint threads_y, const dim_t blocks_all[4],
+                              int dim, bool isFinalPass, bool inclusive_scan) {
+    auto scan_dim = common::getKernel(
+        "arrayfire::cuda::scan_dim", {{scan_dim_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<To>(),
+                     TemplateArg(op), TemplateArg(dim),
+                     TemplateArg(isFinalPass), TemplateArg(threads_y),
+                     TemplateArg(inclusive_scan)),
+        {{DefineValue(THREADS_X)}});
 
-        int ids[4] = {xid, yid, zid, wid};
+    dim3 threads(THREADS_X, threads_y);
 
-        const To *tptr = tmp.ptr;
-        To *optr = out.ptr;
+    dim3 blocks(blocks_all[0] * blocks_all[2], blocks_all[1] * blocks_all[3]);
 
-        // There is only one element per block for out
-        // There are blockDim.y elements per block for in
-        // Hence increment ids[dim] just after offseting out and before offsetting in
-        tptr += ids[3] * tmp.strides[3] + ids[2] * tmp.strides[2] + ids[1] * tmp.strides[1] + ids[0];
-        const int blockIdx_dim = ids[dim];
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-        ids[dim] = ids[dim] * blockDim.y * lim + tidy;
-        optr  += ids[3] * out.strides[3] + ids[2] * out.strides[2] + ids[1] * out.strides[1] + ids[0];
-        const int id_dim = ids[dim];
-        const int out_dim = out.dims[dim];
+    uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
 
-        bool is_valid =
-            (ids[0] < out.dims[0]) &&
-            (ids[1] < out.dims[1]) &&
-            (ids[2] < out.dims[2]) &&
-            (ids[3] < out.dims[3]);
-
-        if (!is_valid) return;
-        if (blockIdx_dim == 0) return;
-
-        To accum = *(tptr - tmp.strides[dim]);
-
-        Binary<To, op> binop;
-        const int ostride_dim = out.strides[dim];
-
-        for (int k = 0, id = id_dim;
-             is_valid && k < lim && (id < out_dim);
-             k++, id += blockDim.y) {
-
-            *optr = binop(*optr,accum);
-            optr += blockDim.y * ostride_dim;
-        }
-    }
-
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass>
-    static void scan_dim_launcher(Param<To> out,
-                           Param<To> tmp,
-                           CParam<Ti> in,
-                           const uint threads_y,
-                           const uint blocks_all[4])
-    {
-        dim3 threads(THREADS_X, threads_y);
-
-        dim3 blocks(blocks_all[0] * blocks_all[2],
-                    blocks_all[1] * blocks_all[3]);
-
-        uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
-
-        switch (threads_y) {
-        case 8:
-            (scan_dim_kernel<Ti, To, op, dim, isFinalPass, 8>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_all[0], blocks_all[1], blocks_all[dim], lim); break;
-        case 4:
-            (scan_dim_kernel<Ti, To, op, dim, isFinalPass, 4>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_all[0], blocks_all[1], blocks_all[dim], lim); break;
-        case 2:
-            (scan_dim_kernel<Ti, To, op, dim, isFinalPass, 2>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_all[0], blocks_all[1], blocks_all[dim], lim); break;
-        case 1:
-            (scan_dim_kernel<Ti, To, op, dim, isFinalPass, 1>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_all[0], blocks_all[1], blocks_all[dim], lim); break;
-        }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scan_dim(qArgs, out, tmp, in, blocks_all[0], blocks_all[1], blocks_all[dim],
+             lim);
+    POST_LAUNCH_CHECK();
+}
 
-        POST_LAUNCH_CHECK();
-    }
+template<typename To, af_op_t op>
+static void bcast_dim_launcher(Param<To> out, CParam<To> tmp,
+                               const uint threads_y, const dim_t blocks_all[4],
+                               int dim, bool inclusive_scan) {
+    auto scan_dim_bcast = common::getKernel(
+        "arrayfire::cuda::scan_dim_bcast", {{scan_dim_cuh_src}},
+        TemplateArgs(TemplateTypename<To>(), TemplateArg(op),
+                     TemplateArg(dim)));
 
+    dim3 threads(THREADS_X, threads_y);
 
+    dim3 blocks(blocks_all[0] * blocks_all[2], blocks_all[1] * blocks_all[3]);
 
-    template<typename To, af_op_t op, int dim>
-    static void bcast_dim_launcher(Param<To> out,
-                                   CParam<To> tmp,
-                                   const uint threads_y,
-                                   const uint blocks_all[4])
-    {
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-        dim3 threads(THREADS_X, threads_y);
+    uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
 
-        dim3 blocks(blocks_all[0] * blocks_all[2],
-                    blocks_all[1] * blocks_all[3]);
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scan_dim_bcast(qArgs, out, tmp, blocks_all[0], blocks_all[1],
+                   blocks_all[dim], lim, inclusive_scan);
+    POST_LAUNCH_CHECK();
+}
 
-        uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
+template<typename Ti, typename To, af_op_t op>
+static void scan_dim(Param<To> out, CParam<Ti> in, int dim,
+                     bool inclusive_scan) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(out.dims[dim]));
+    uint threads_x = THREADS_X;
 
-        (bcast_dim_kernel<To, op, dim>)<<<blocks, threads>>>(
-            out, tmp, blocks_all[0], blocks_all[1], blocks_all[dim], lim);
+    dim_t blocks_all[] = {divup(out.dims[0], threads_x), out.dims[1],
+                          out.dims[2], out.dims[3]};
 
-        POST_LAUNCH_CHECK();
-    }
+    blocks_all[dim] = divup(out.dims[dim], threads_y * REPEAT);
 
-    template<typename Ti, typename To, af_op_t op, int dim>
-    static void scan_dim(Param<To> out, CParam<Ti> in)
-    {
-        uint threads_y = std::min(THREADS_Y, nextpow2(out.dims[dim]));
-        uint threads_x = THREADS_X;
+    if (blocks_all[dim] == 1) {
+        scan_dim_launcher<Ti, To, op>(out, out, in, threads_y, blocks_all, dim,
+                                      true, inclusive_scan);
+    } else {
+        Param<To> tmp = out;
 
-        uint blocks_all[] = {divup(out.dims[0], threads_x),
-                             out.dims[1], out.dims[2], out.dims[3]};
+        tmp.dims[dim]  = blocks_all[dim];
+        tmp.strides[0] = 1;
+        for (int k = 1; k < 4; k++)
+            tmp.strides[k] = tmp.strides[k - 1] * tmp.dims[k - 1];
 
-        blocks_all[dim] = divup(out.dims[dim], threads_y * REPEAT);
+        int tmp_elements = tmp.strides[3] * tmp.dims[3];
+        auto tmp_alloc   = memAlloc<To>(tmp_elements);
+        tmp.ptr          = tmp_alloc.get();
 
-        if (blocks_all[dim] == 1) {
+        scan_dim_launcher<Ti, To, op>(out, tmp, in, threads_y, blocks_all, dim,
+                                      false, inclusive_scan);
 
-            scan_dim_launcher<Ti, To, op, dim, true>(out, out, in,
-                                                     threads_y,
-                                                     blocks_all);
+        int bdim        = blocks_all[dim];
+        blocks_all[dim] = 1;
 
+        // FIXME: Is there an alternative to the if condition ?
+        if (op == af_notzero_t) {
+            scan_dim_launcher<To, To, af_add_t>(tmp, tmp, tmp, threads_y,
+                                                blocks_all, dim, true, true);
         } else {
-
-            Param<To> tmp = out;
-
-            tmp.dims[dim] = blocks_all[dim];
-            tmp.strides[0] = 1;
-            for (int k = 1; k < 4; k++) tmp.strides[k] = tmp.strides[k - 1] * tmp.dims[k - 1];
-
-            int tmp_elements = tmp.strides[3] * tmp.dims[3];
-            tmp.ptr = memAlloc<To>(tmp_elements);
-
-            scan_dim_launcher<Ti, To, op, dim, false>(out, tmp, in,
-                                                      threads_y,
-                                                      blocks_all);
-
-            int bdim = blocks_all[dim];
-            blocks_all[dim] = 1;
-
-            //FIXME: Is there an alternative to the if condition ?
-            if (op == af_notzero_t) {
-                scan_dim_launcher<To, To, af_add_t, dim, true>(tmp, tmp, tmp,
-                                                               threads_y,
-                                                               blocks_all);
-            } else {
-                scan_dim_launcher<To, To,       op, dim, true>(tmp, tmp, tmp,
-                                                               threads_y,
-                                                               blocks_all);
-            }
-
-            blocks_all[dim] = bdim;
-            bcast_dim_launcher<To, op, dim>(out, tmp, threads_y, blocks_all);
-
-            memFree(tmp.ptr);
+            scan_dim_launcher<To, To, op>(tmp, tmp, tmp, threads_y, blocks_all,
+                                          dim, true, true);
         }
-    }
 
+        blocks_all[dim] = bdim;
+        bcast_dim_launcher<To, op>(out, tmp, threads_y, blocks_all, dim,
+                                   inclusive_scan);
+    }
 }
-}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_dim_by_key.cuh b/src/backend/cuda/kernel/scan_dim_by_key.cuh
new file mode 100644
index 0000000000..06de7c1ae1
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_dim_by_key.cuh
@@ -0,0 +1,372 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Tk>
+__device__ inline char calculate_head_flags_dim(const Tk *kptr, int id,
+                                                int stride) {
+    return (id == 0) ? 1 : ((*kptr) != (*(kptr - stride)));
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+__global__ void scanbykey_dim_nonfinal(Param<To> out, Param<To> tmp,
+                                       Param<char> tflg, Param<int> tlid,
+                                       CParam<Ti> in, CParam<Tk> key, int dim,
+                                       uint blocks_x, uint blocks_y, uint lim,
+                                       bool inclusive_scan) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+    const int tid  = tidy * THREADS_X + tidx;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = blockIdx.y / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const int xid        = blockIdx_x * blockDim.x + tidx;
+    const int yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    const Ti *iptr = in.ptr;
+    const Tk *kptr = key.ptr;
+    To *optr       = out.ptr;
+    To *tptr       = tmp.ptr;
+    char *tfptr    = tflg.ptr;
+    int *tiptr     = tlid.ptr;
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    tptr += ids[3] * tmp.strides[3] + ids[2] * tmp.strides[2] +
+            ids[1] * tmp.strides[1] + ids[0];
+    tfptr += ids[3] * tflg.strides[3] + ids[2] * tflg.strides[2] +
+             ids[1] * tflg.strides[1] + ids[0];
+    tiptr += ids[3] * tlid.strides[3] + ids[2] * tlid.strides[2] +
+             ids[1] * tlid.strides[1] + ids[0];
+    const int blockIdx_dim = ids[dim];
+
+    ids[dim] = ids[dim] * blockDim.y * lim + tidy;
+    optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+            ids[1] * out.strides[1] + ids[0];
+    iptr += ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+            ids[1] * in.strides[1] + ids[0];
+    kptr += ids[3] * key.strides[3] + ids[2] * key.strides[2] +
+            ids[1] * key.strides[1] + ids[0];
+    int id_dim        = ids[dim];
+    const int out_dim = out.dims[dim];
+
+    bool is_valid = (ids[0] < out.dims[0]) && (ids[1] < out.dims[1]) &&
+                    (ids[2] < out.dims[2]) && (ids[3] < out.dims[3]);
+
+    const int ostride_dim = out.strides[dim];
+    const int istride_dim = in.strides[dim];
+
+    __shared__ char s_flg[THREADS_X * DIMY * 2];
+    __shared__ To s_val[THREADS_X * DIMY * 2];
+    __shared__ char s_ftmp[THREADS_X];
+    __shared__ To s_tmp[THREADS_X];
+    __shared__ int boundaryid[THREADS_X];
+    To *sptr    = s_val + tid;
+    char *sfptr = s_flg + tid;
+
+    common::Transform<Ti, To, op> transform;
+    common::Binary<To, op> binop;
+
+    const To init = common::Binary<To, op>::init();
+    To val        = init;
+
+    const bool isLast = (tidy == (DIMY - 1));
+    if (isLast) {
+        s_tmp[tidx]      = val;
+        s_ftmp[tidx]     = 0;
+        boundaryid[tidx] = -1;
+    }
+    __syncthreads();
+
+    char flag = 0;
+    for (int k = 0; k < lim; k++) {
+        if (id_dim < out_dim) {
+            flag = calculate_head_flags_dim(kptr, id_dim, key.strides[dim]);
+        } else {
+            flag = 0;
+        }
+
+        // Load val from global in
+        if (inclusive_scan) {
+            if (id_dim >= out_dim) {
+                val = init;
+            } else {
+                val = transform(*iptr);
+            }
+        } else {
+            if ((id_dim == 0) || (id_dim >= out_dim) || flag) {
+                val = init;
+            } else {
+                val = transform(*(iptr - istride_dim));
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((tidy == 0) && (flag == 0)) {
+            val  = binop(val, s_tmp[tidx]);
+            flag = s_ftmp[tidx];
+        }
+
+        // Write to shared memory
+        *sptr  = val;
+        *sfptr = flag;
+        __syncthreads();
+
+        // Segmented Scan
+        int start = 0;
+#pragma unroll
+        for (int off = 1; off < DIMY; off *= 2) {
+            if (tidy >= off) {
+                val = sfptr[start * THREADS_X]
+                          ? val
+                          : binop(val, sptr[(start - off) * THREADS_X]);
+                flag =
+                    sfptr[start * THREADS_X] | sfptr[(start - off) * THREADS_X];
+            }
+            start                    = DIMY - start;
+            sptr[start * THREADS_X]  = val;
+            sfptr[start * THREADS_X] = flag;
+
+            __syncthreads();
+        }
+
+        // Identify segment boundary
+        if (tidy == 0) {
+            if ((s_ftmp[tidx] == 0) && (sfptr[start * THREADS_X] == 1)) {
+                boundaryid[tidx] = id_dim;
+            }
+        } else {
+            if ((sfptr[(start - 1) * THREADS_X] == 0) &&
+                (sfptr[start * THREADS_X] == 1)) {
+                boundaryid[tidx] = id_dim;
+            }
+        }
+        __syncthreads();
+
+        if (is_valid && (id_dim < out_dim)) *optr = val;
+        if (isLast) {
+            s_tmp[tidx]  = val;
+            s_ftmp[tidx] = flag;
+        }
+        id_dim += blockDim.y;
+        kptr += blockDim.y * key.strides[dim];
+        iptr += blockDim.y * istride_dim;
+        optr += blockDim.y * ostride_dim;
+        __syncthreads();
+    }
+
+    if (is_valid && (blockIdx_dim < tmp.dims[dim]) && isLast) {
+        *tptr        = val;
+        *tfptr       = flag;
+        int boundary = boundaryid[tidx];
+        *tiptr       = (boundary == -1) ? id_dim : boundary;
+    }
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+__global__ void scanbykey_dim_final(Param<To> out, CParam<Ti> in,
+                                    CParam<Tk> key, int dim, uint blocks_x,
+                                    uint blocks_y, uint lim,
+                                    bool calculateFlags, bool inclusive_scan) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+    const int tid  = tidy * THREADS_X + tidx;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = blockIdx.y / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const int xid        = blockIdx_x * blockDim.x + tidx;
+    const int yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    const Ti *iptr = in.ptr;
+    const Tk *kptr = key.ptr;
+    To *optr       = out.ptr;
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+
+    ids[dim] = ids[dim] * blockDim.y * lim + tidy;
+    optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+            ids[1] * out.strides[1] + ids[0];
+    iptr += ids[3] * in.strides[3] + ids[2] * in.strides[2] +
+            ids[1] * in.strides[1] + ids[0];
+    kptr += ids[3] * key.strides[3] + ids[2] * key.strides[2] +
+            ids[1] * key.strides[1] + ids[0];
+    int id_dim        = ids[dim];
+    const int out_dim = out.dims[dim];
+
+    bool is_valid = (ids[0] < out.dims[0]) && (ids[1] < out.dims[1]) &&
+                    (ids[2] < out.dims[2]) && (ids[3] < out.dims[3]);
+
+    const int ostride_dim = out.strides[dim];
+    const int istride_dim = in.strides[dim];
+
+    __shared__ char s_flg[THREADS_X * DIMY * 2];
+    __shared__ To s_val[THREADS_X * DIMY * 2];
+    __shared__ char s_ftmp[THREADS_X];
+    __shared__ To s_tmp[THREADS_X];
+    To *sptr    = s_val + tid;
+    char *sfptr = s_flg + tid;
+
+    common::Transform<Ti, To, op> transform;
+    common::Binary<To, op> binop;
+
+    const To init = common::Binary<To, op>::init();
+    To val        = init;
+
+    const bool isLast = (tidy == (DIMY - 1));
+    if (isLast) {
+        s_tmp[tidx]  = val;
+        s_ftmp[tidx] = 0;
+    }
+    __syncthreads();
+
+    char flag = 0;
+    for (int k = 0; k < lim; k++) {
+        if (calculateFlags) {
+            if (id_dim < out_dim) {
+                flag = calculate_head_flags_dim(kptr, id_dim, key.strides[dim]);
+            } else {
+                flag = 0;
+            }
+        } else {
+            flag = *kptr;
+        }
+
+        // Load val from global in
+        if (inclusive_scan) {
+            if (id_dim >= out_dim) {
+                val = init;
+            } else {
+                val = transform(*iptr);
+            }
+        } else {
+            if ((id_dim == 0) || (id_dim >= out_dim) || flag) {
+                val = init;
+            } else {
+                val = transform(*(iptr - istride_dim));
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((tidy == 0) && (flag == 0)) {
+            val  = binop(val, s_tmp[tidx]);
+            flag = s_ftmp[tidx];
+        }
+
+        // Write to shared memory
+        *sptr  = val;
+        *sfptr = flag;
+        __syncthreads();
+
+        // Segmented Scan
+        int start = 0;
+#pragma unroll
+        for (int off = 1; off < DIMY; off *= 2) {
+            if (tidy >= off) {
+                val = sfptr[start * THREADS_X]
+                          ? val
+                          : binop(val, sptr[(start - off) * THREADS_X]);
+                flag =
+                    sfptr[start * THREADS_X] | sfptr[(start - off) * THREADS_X];
+            }
+            start                    = DIMY - start;
+            sptr[start * THREADS_X]  = val;
+            sfptr[start * THREADS_X] = flag;
+
+            __syncthreads();
+        }
+
+        if (is_valid && (id_dim < out_dim)) *optr = val;
+        if (isLast) {
+            s_tmp[tidx]  = val;
+            s_ftmp[tidx] = flag;
+        }
+        id_dim += blockDim.y;
+        kptr += blockDim.y * key.strides[dim];
+        iptr += blockDim.y * istride_dim;
+        optr += blockDim.y * ostride_dim;
+        __syncthreads();
+    }
+}
+
+template<typename To, af_op_t op>
+__global__ void scanbykey_dim_bcast(Param<To> out, CParam<To> tmp,
+                                    Param<int> tlid, int dim, uint blocks_x,
+                                    uint blocks_y, uint blocks_dim, uint lim) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = blockIdx.y / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const int xid        = blockIdx_x * blockDim.x + tidx;
+    const int yid = blockIdx_y;  // yid  of output. updated for input later.
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    const To *tptr  = tmp.ptr;
+    To *optr        = out.ptr;
+    const int *iptr = tlid.ptr;
+
+    // There is only one element per block for out
+    // There are blockDim.y elements per block for in
+    // Hence increment ids[dim] just after offseting out and before offsetting
+    // in
+    tptr += ids[3] * tmp.strides[3] + ids[2] * tmp.strides[2] +
+            ids[1] * tmp.strides[1] + ids[0];
+    iptr += ids[3] * tlid.strides[3] + ids[2] * tlid.strides[2] +
+            ids[1] * tlid.strides[1] + ids[0];
+    const int blockIdx_dim = ids[dim];
+
+    ids[dim] = ids[dim] * blockDim.y * lim + tidy;
+    optr += ids[3] * out.strides[3] + ids[2] * out.strides[2] +
+            ids[1] * out.strides[1] + ids[0];
+    const int id_dim = ids[dim];
+
+    bool is_valid = (ids[0] < out.dims[0]) && (ids[1] < out.dims[1]) &&
+                    (ids[2] < out.dims[2]) && (ids[3] < out.dims[3]);
+
+    if (!is_valid) return;
+    if (blockIdx_dim == 0) return;
+
+    int boundary = *iptr;
+    To accum     = *(tptr - tmp.strides[dim]);
+
+    common::Binary<To, op> binop;
+    const int ostride_dim = out.strides[dim];
+
+    for (int k = 0, id = id_dim; is_valid && k < lim && (id < boundary);
+         k++, id += blockDim.y) {
+        *optr = binop(*optr, accum);
+        optr += blockDim.y * ostride_dim;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_dim_by_key.hpp b/src/backend/cuda/kernel/scan_dim_by_key.hpp
new file mode 100644
index 0000000000..05092499d6
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_dim_by_key.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Param.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scan_dim_by_key(Param<To> out, CParam<Ti> in, CParam<Tk> key, int dim,
+                     bool inclusive_scan);
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_dim_by_key_impl.hpp b/src/backend/cuda/kernel/scan_dim_by_key_impl.hpp
new file mode 100644
index 0000000000..0a07b7fa1e
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_dim_by_key_impl.hpp
@@ -0,0 +1,171 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <kernel/config.hpp>
+#include <memory.hpp>
+#include <nvrtc_kernel_headers/scan_dim_by_key_cuh.hpp>
+#include <optypes.hpp>
+#include <traits.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scan_dim_nonfinal_launcher(Param<To> out, Param<To> tmp,
+                                       Param<char> tflg, Param<int> tlid,
+                                       CParam<Ti> in, CParam<Tk> key,
+                                       const int dim, const uint threads_y,
+                                       const dim_t blocks_all[4],
+                                       bool inclusive_scan) {
+    auto scanbykey_dim_nonfinal = common::getKernel(
+        "arrayfire::cuda::scanbykey_dim_nonfinal", {{scan_dim_by_key_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<Tk>(),
+                     TemplateTypename<To>(), TemplateArg(op)),
+        {{DefineValue(THREADS_X), DefineKeyValue(DIMY, threads_y)}});
+
+    dim3 threads(THREADS_X, threads_y);
+
+    dim3 blocks(blocks_all[0] * blocks_all[2], blocks_all[1] * blocks_all[3]);
+
+    uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scanbykey_dim_nonfinal(qArgs, out, tmp, tflg, tlid, in, key, dim,
+                           blocks_all[0], blocks_all[1], lim, inclusive_scan);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scan_dim_final_launcher(Param<To> out, CParam<Ti> in,
+                                    CParam<Tk> key, const int dim,
+                                    const uint threads_y,
+                                    const dim_t blocks_all[4],
+                                    bool calculateFlags, bool inclusive_scan) {
+    auto scanbykey_dim_final = common::getKernel(
+        "arrayfire::cuda::scanbykey_dim_final", {{scan_dim_by_key_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<Tk>(),
+                     TemplateTypename<To>(), TemplateArg(op)),
+        {{DefineValue(THREADS_X), DefineKeyValue(DIMY, threads_y)}});
+
+    dim3 threads(THREADS_X, threads_y);
+
+    dim3 blocks(blocks_all[0] * blocks_all[2], blocks_all[1] * blocks_all[3]);
+
+    uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scanbykey_dim_final(qArgs, out, in, key, dim, blocks_all[0], blocks_all[1],
+                        lim, calculateFlags, inclusive_scan);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename To, af_op_t op>
+static void bcast_dim_launcher(Param<To> out, CParam<To> tmp, Param<int> tlid,
+                               const int dim, const uint threads_y,
+                               const dim_t blocks_all[4]) {
+    auto scanbykey_dim_bcast = common::getKernel(
+        "arrayfire::cuda::scanbykey_dim_bcast", {{scan_dim_by_key_cuh_src}},
+        TemplateArgs(TemplateTypename<To>(), TemplateArg(op)));
+    dim3 threads(THREADS_X, threads_y);
+    dim3 blocks(blocks_all[0] * blocks_all[2], blocks_all[1] * blocks_all[3]);
+
+    uint lim = divup(out.dims[dim], (threads_y * blocks_all[dim]));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scanbykey_dim_bcast(qArgs, out, tmp, tlid, dim, blocks_all[0],
+                        blocks_all[1], blocks_all[dim], lim);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scan_dim_by_key(Param<To> out, CParam<Ti> in, CParam<Tk> key, int dim,
+                     bool inclusive_scan) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(out.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    dim_t blocks_all[] = {divup(out.dims[0], threads_x), out.dims[1],
+                          out.dims[2], out.dims[3]};
+
+    blocks_all[dim] = divup(out.dims[dim], threads_y * REPEAT);
+
+    if (blocks_all[dim] == 1) {
+        scan_dim_final_launcher<Ti, Tk, To, op>(
+            out, in, key, dim, threads_y, blocks_all, true, inclusive_scan);
+
+    } else {
+        Param<To> tmp = out;
+        Param<char> tmpflg;
+        Param<int> tmpid;
+
+        tmp.dims[dim]  = blocks_all[dim];
+        tmp.strides[0] = 1;
+        for (int k = 1; k < 4; k++)
+            tmp.strides[k] = tmp.strides[k - 1] * tmp.dims[k - 1];
+        for (int k = 0; k < 4; k++) {
+            tmpflg.strides[k] = tmp.strides[k];
+            tmpid.strides[k]  = tmp.strides[k];
+            tmpflg.dims[k]    = tmp.dims[k];
+            tmpid.dims[k]     = tmp.dims[k];
+        }
+
+        int tmp_elements  = tmp.strides[3] * tmp.dims[3];
+        auto tmp_alloc    = memAlloc<To>(tmp_elements);
+        auto tmpflg_alloc = memAlloc<char>(tmp_elements);
+        auto tmpid_alloc  = memAlloc<int>(tmp_elements);
+        tmp.ptr           = tmp_alloc.get();
+        tmpflg.ptr        = tmpflg_alloc.get();
+        tmpid.ptr         = tmpid_alloc.get();
+
+        scan_dim_nonfinal_launcher<Ti, Tk, To, op>(out, tmp, tmpflg, tmpid, in,
+                                                   key, dim, threads_y,
+                                                   blocks_all, inclusive_scan);
+
+        int bdim        = blocks_all[dim];
+        blocks_all[dim] = 1;
+        scan_dim_final_launcher<To, char, To, op>(
+            tmp, tmp, tmpflg, dim, threads_y, blocks_all, false, true);
+
+        blocks_all[dim] = bdim;
+        bcast_dim_launcher<To, op>(out, tmp, tmpid, dim, threads_y, blocks_all);
+    }
+}
+
+}  // namespace kernel
+
+#define INSTANTIATE_SCAN_DIM_BY_KEY(ROp, Ti, Tk, To)           \
+    template void scan_dim_by_key<Ti, Tk, To, ROp>(            \
+        Param<To> out, CParam<Ti> in, CParam<Tk> key, int dim, \
+        bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, Tk)         \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_DIM_BY_KEY_OP(ROp)      \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, int)  \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, uint) \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, intl) \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, uintl)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_first.cuh b/src/backend/cuda/kernel/scan_first.cuh
new file mode 100644
index 0000000000..31abbd57a5
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_first.cuh
@@ -0,0 +1,138 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Ti, typename To, af_op_t op, bool isFinalPass, uint DIMX,
+         bool inclusive_scan>
+__global__ void scan_first(Param<To> out, Param<To> tmp, CParam<Ti> in,
+                           uint blocks_x, uint blocks_y, uint lim) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const int xid = blockIdx_x * blockDim.x * lim + tidx;
+    const int yid = blockIdx_y * blockDim.y + tidy;
+
+    bool cond_yzw =
+        (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
+
+    if (!cond_yzw) return;  // retire warps early
+
+    const Ti *iptr = in.ptr;
+    To *optr       = out.ptr;
+    To *tptr       = tmp.ptr;
+
+    iptr += wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+    optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+    tptr += wid * tmp.strides[3] + zid * tmp.strides[2] + yid * tmp.strides[1];
+
+    const int DIMY            = THREADS_PER_BLOCK / DIMX;
+    const int SHARED_MEM_SIZE = (2 * DIMX + 1) * (DIMY);
+
+    __shared__ To s_val[SHARED_MEM_SIZE];
+    __shared__ To s_tmp[DIMY];
+
+    To *sptr = s_val + tidy * (2 * DIMX + 1);
+
+    common::Transform<Ti, To, op> transform;
+    common::Binary<To, op> binop;
+
+    const To init = common::Binary<To, op>::init();
+    int id        = xid;
+    To val        = init;
+
+    const bool isLast = (tidx == (DIMX - 1));
+
+    for (int k = 0; k < lim; k++) {
+        if (isLast) s_tmp[tidy] = val;
+
+        bool cond  = (id < out.dims[0]);
+        val        = cond ? transform(iptr[id]) : init;
+        sptr[tidx] = val;
+        __syncthreads();
+
+        int start = 0;
+#pragma unroll
+        for (int off = 1; off < DIMX; off *= 2) {
+            if (tidx >= off) val = binop(val, sptr[(start - off) + tidx]);
+            start              = DIMX - start;
+            sptr[start + tidx] = val;
+
+            __syncthreads();
+        }
+
+        val = binop(val, s_tmp[tidy]);
+
+        if (inclusive_scan) {
+            if (cond) { optr[id] = val; }
+        } else {
+            if (id == (out.dims[0] - 1)) {
+                optr[0] = init;
+            } else if (id < (out.dims[0] - 1)) {
+                optr[id + 1] = val;
+            }
+        }
+        id += blockDim.x;
+        __syncthreads();
+    }
+
+    if (!isFinalPass && isLast) { tptr[blockIdx_x] = val; }
+}
+
+template<typename To, af_op_t op>
+__global__ void scan_first_bcast(Param<To> out, CParam<To> tmp, uint blocks_x,
+                                 uint blocks_y, uint lim, bool inclusive_scan) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const int xid = blockIdx_x * blockDim.x * lim + tidx;
+    const int yid = blockIdx_y * blockDim.y + tidy;
+
+    if (blockIdx_x == 0) return;
+
+    bool cond =
+        (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
+    if (!cond) return;
+
+    To *optr       = out.ptr;
+    const To *tptr = tmp.ptr;
+
+    optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+    tptr += wid * tmp.strides[3] + zid * tmp.strides[2] + yid * tmp.strides[1];
+
+    common::Binary<To, op> binop;
+    To accum = tptr[blockIdx_x - 1];
+
+    // Shift broadcast one step to the right for exclusive scan (#2366)
+    int offset = !inclusive_scan;
+    for (int k = 0, id = xid + offset; k < lim && id < out.dims[0];
+         k++, id += blockDim.x) {
+        optr[id] = binop(accum, optr[id]);
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_first.hpp b/src/backend/cuda/kernel/scan_first.hpp
index fb370c9880..868816f4ed 100644
--- a/src/backend/cuda/kernel/scan_first.hpp
+++ b/src/backend/cuda/kernel/scan_first.hpp
@@ -7,245 +7,110 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <ops.hpp>
-#include <backend.hpp>
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <math.hpp>
-#include <err_cuda.hpp>
+#include <backend.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <err_cuda.hpp>
 #include <memory.hpp>
+#include <nvrtc_kernel_headers/scan_first_cuh.hpp>
 #include "config.hpp"
 
-namespace cuda
-{
-namespace kernel
-{
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass, uint DIMX>
-    __global__
-    static void scan_first_kernel(Param<To> out,
-                                  Param<To> tmp,
-                                  CParam<Ti>  in,
-                                  uint blocks_x,
-                                  uint blocks_y,
-                                  uint lim)
-    {
-        const int tidx = threadIdx.x;
-        const int tidy = threadIdx.y;
-
-        const int zid = blockIdx.x / blocks_x;
-        const int wid = blockIdx.y / blocks_y;
-        const int blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const int blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const int xid = blockIdx_x * blockDim.x * lim + tidx;
-        const int yid = blockIdx_y * blockDim.y + tidy;
-
-        bool cond_yzw = (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
-
-        if (!cond_yzw) return; // retire warps early
-
-        const Ti *iptr = in.ptr;
-        To *optr = out.ptr;
-        To *tptr = tmp.ptr;
-
-        iptr += wid *  in.strides[3] + zid *  in.strides[2] + yid *  in.strides[1];
-        optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
-        tptr += wid * tmp.strides[3] + zid * tmp.strides[2] + yid * tmp.strides[1];
-
-
-        const int DIMY = THREADS_PER_BLOCK / DIMX;
-        const int SHARED_MEM_SIZE = (2 * DIMX + 1) * (DIMY);
-
-        __shared__ To s_val[SHARED_MEM_SIZE];
-        __shared__ To s_tmp[DIMY];
-
-        To *sptr = s_val + tidy * (2 * DIMX + 1);
-
-        Transform<Ti, To, op> transform;
-        Binary<To, op> binop;
-
-        const To init = binop.init();
-        int id = xid;
-        To val = init;
-
-        const bool isLast = (tidx == (DIMX - 1));
-
-        for (int k = 0; k < lim; k++) {
-
-            if (isLast) s_tmp[tidy] = val;
-
-            bool cond = ((id < out.dims[0]));
-            val = cond ? transform(iptr[id]) : init;
-            sptr[tidx] = val;
-            __syncthreads();
-
-
-            int start = 0;
-#pragma unroll
-            for (int off = 1; off < DIMX; off *= 2) {
-
-                if (tidx >= off) val = binop(val, sptr[(start - off) + tidx]);
-                start = DIMX - start;
-                sptr[start + tidx] = val;
-
-                __syncthreads();
-            }
-
-            val = binop(val, s_tmp[tidy]);
-            if (cond) optr[id] = val;
-            id += blockDim.x;
-            __syncthreads();
-        }
-
-        if (!isFinalPass && isLast) {
-            tptr[blockIdx_x] = val;
-        }
-    }
-
-    template<typename To, af_op_t op>
-    __global__
-    static void bcast_first_kernel(Param<To> out,
-                                   CParam<To> tmp,
-                                   uint blocks_x,
-                                   uint blocks_y,
-                                   uint lim)
-    {
-        const int tidx = threadIdx.x;
-        const int tidy = threadIdx.y;
-
-        const int zid = blockIdx.x / blocks_x;
-        const int wid = blockIdx.y / blocks_y;
-        const int blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const int blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const int xid = blockIdx_x * blockDim.x * lim + tidx;
-        const int yid = blockIdx_y * blockDim.y + tidy;
-
-        if (blockIdx_x == 0) return;
-
-        bool cond = (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
-        if (!cond) return;
-
-        To *optr = out.ptr;
-        const To *tptr = tmp.ptr;
-
-        optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
-        tptr += wid * tmp.strides[3] + zid * tmp.strides[2] + yid * tmp.strides[1];
-
-        Binary<To, op> binop;
-        To accum = tptr[blockIdx_x - 1];
-
-        for (int k = 0, id = xid;
-             k < lim && id < out.dims[0];
-             k++, id += blockDim.x) {
-
-            optr[id] = binop(accum, optr[id]);
-        }
-
-    }
-
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass>
-    static void scan_first_launcher(Param<To> out,
-                             Param<To> tmp,
-                             CParam<Ti> in,
-                             const uint blocks_x,
-                             const uint blocks_y,
-                             const uint threads_x)
-    {
-
-        dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
-        dim3 blocks(blocks_x * out.dims[2],
-                    blocks_y * out.dims[3]);
-
-        uint lim = divup(out.dims[0], (threads_x * blocks_x));
-
-        switch (threads_x) {
-        case 32:
-            (scan_first_kernel<Ti, To, op, isFinalPass,  32>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_x, blocks_y, lim); break;
-        case 64:
-            (scan_first_kernel<Ti, To, op, isFinalPass,  64>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_x, blocks_y, lim); break;
-        case 128:
-            (scan_first_kernel<Ti, To, op, isFinalPass,  128>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_x, blocks_y, lim); break;
-        case 256:
-            (scan_first_kernel<Ti, To, op, isFinalPass,  256>)<<<blocks, threads>>>(
-                out, tmp, in, blocks_x, blocks_y, lim); break;
-        }
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op>
+static void scan_first_launcher(Param<To> out, Param<To> tmp, CParam<Ti> in,
+                                const uint blocks_x, const uint blocks_y,
+                                const uint threads_x, bool isFinalPass,
+                                bool inclusive_scan) {
+    auto scan_first = common::getKernel(
+        "arrayfire::cuda::scan_first", {{scan_first_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<To>(),
+                     TemplateArg(op), TemplateArg(isFinalPass),
+                     TemplateArg(threads_x), TemplateArg(inclusive_scan)),
+        {{DefineValue(THREADS_PER_BLOCK)}});
+
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    uint lim = divup(out.dims[0], (threads_x * blocks_x));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scan_first(qArgs, out, tmp, in, blocks_x, blocks_y, lim);
+    POST_LAUNCH_CHECK();
+}
 
-        POST_LAUNCH_CHECK();
-    }
+template<typename To, af_op_t op>
+static void bcast_first_launcher(Param<To> out, CParam<To> tmp,
+                                 const uint blocks_x, const uint blocks_y,
+                                 const uint threads_x, bool inclusive_scan) {
+    auto scan_first_bcast = common::getKernel(
+        "arrayfire::cuda::scan_first_bcast", {{scan_first_cuh_src}},
+        TemplateArgs(TemplateTypename<To>(), TemplateArg(op)));
 
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
 
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-    template<typename To, af_op_t op>
-    static void bcast_first_launcher(Param<To> out,
-                                     CParam<To> tmp,
-                                     const uint blocks_x,
-                                     const uint blocks_y,
-                                     const uint threads_x)
-    {
+    uint lim = divup(out.dims[0], (threads_x * blocks_x));
 
-        dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
-        dim3 blocks(blocks_x * out.dims[2],
-                    blocks_y * out.dims[3]);
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scan_first_bcast(qArgs, out, tmp, blocks_x, blocks_y, lim, inclusive_scan);
+    POST_LAUNCH_CHECK();
+}
 
-        uint lim = divup(out.dims[0], (threads_x * blocks_x));
+template<typename Ti, typename To, af_op_t op>
+static void scan_first(Param<To> out, CParam<Ti> in, bool inclusive_scan) {
+    uint threads_x = nextpow2(std::max(32u, (uint)out.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
 
-        (bcast_first_kernel<To, op>)<<<blocks, threads>>>(
-            out, tmp, blocks_x, blocks_y, lim);
+    uint blocks_x = divup(out.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(out.dims[1], threads_y);
 
-        POST_LAUNCH_CHECK();
-    }
+    if (blocks_x == 1) {
+        scan_first_launcher<Ti, To, op>(out, out, in, blocks_x, blocks_y,
+                                        threads_x, true, inclusive_scan);
 
-    template<typename Ti, typename To, af_op_t op>
-    static void scan_first(Param<To> out, CParam<Ti> in)
-    {
-        uint threads_x = nextpow2(std::max(32u, (uint)out.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_BLOCK);
-        uint threads_y = THREADS_PER_BLOCK / threads_x;
+    } else {
+        Param<To> tmp = out;
 
-        uint blocks_x = divup(out.dims[0], threads_x * REPEAT);
-        uint blocks_y = divup(out.dims[1], threads_y);
+        tmp.dims[0]    = blocks_x;
+        tmp.strides[0] = 1;
+        for (int k = 1; k < 4; k++)
+            tmp.strides[k] = tmp.strides[k - 1] * tmp.dims[k - 1];
 
-        if (blocks_x == 1) {
+        int tmp_elements = tmp.strides[3] * tmp.dims[3];
+        auto tmp_alloc   = memAlloc<To>(tmp_elements);
+        tmp.ptr          = tmp_alloc.get();
 
-            scan_first_launcher<Ti, To, op, true>(out, out, in,
-                                                  blocks_x, blocks_y,
-                                                  threads_x);
+        scan_first_launcher<Ti, To, op>(out, tmp, in, blocks_x, blocks_y,
+                                        threads_x, false, inclusive_scan);
 
+        // FIXME: Is there an alternative to the if condition ?
+        if (op == af_notzero_t) {
+            scan_first_launcher<To, To, af_add_t>(tmp, tmp, tmp, 1, blocks_y,
+                                                  threads_x, true, true);
         } else {
-
-            Param<To> tmp = out;
-
-            tmp.dims[0] = blocks_x;
-            tmp.strides[0] = 1;
-            for (int k = 1; k < 4; k++) tmp.strides[k] = tmp.strides[k - 1] * tmp.dims[k - 1];
-
-            int tmp_elements = tmp.strides[3] * tmp.dims[3];
-            tmp.ptr = memAlloc<To>(tmp_elements);
-
-            scan_first_launcher<Ti, To, op, false>(out, tmp, in,
-                                                   blocks_x, blocks_y,
-                                                   threads_x);
-
-            //FIXME: Is there an alternative to the if condition ?
-            if (op == af_notzero_t) {
-                scan_first_launcher<To, To, af_add_t, true>(tmp, tmp, tmp,
-                                                             1, blocks_y,
-                                                             threads_x);
-            } else {
-                scan_first_launcher<To, To,       op, true>(tmp, tmp, tmp,
-                                                            1, blocks_y,
-                                                            threads_x);
-            }
-
-            bcast_first_launcher<To, op>(out, tmp, blocks_x, blocks_y, threads_x);
-
-            memFree(tmp.ptr);
+            scan_first_launcher<To, To, op>(tmp, tmp, tmp, 1, blocks_y,
+                                            threads_x, true, true);
         }
-    }
 
+        bcast_first_launcher<To, op>(out, tmp, blocks_x, blocks_y, threads_x,
+                                     inclusive_scan);
+    }
 }
-}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_first_by_key.cuh b/src/backend/cuda/kernel/scan_first_by_key.cuh
new file mode 100644
index 0000000000..8f876e2470
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_first_by_key.cuh
@@ -0,0 +1,317 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Tk>
+__device__ inline char calculate_head_flags(const Tk *kptr, int id,
+                                            int previd) {
+    return (id == 0) ? 1 : (kptr[id] != kptr[previd]);
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+__global__ void scanbykey_first_nonfinal(Param<To> out, Param<To> tmp,
+                                         Param<char> tflg, Param<int> tlid,
+                                         CParam<Ti> in, CParam<Tk> key,
+                                         uint blocks_x, uint blocks_y, uint lim,
+                                         bool inclusive_scan) {
+    common::Transform<Ti, To, op> transform;
+    common::Binary<To, op> binop;
+    const To init = common::Binary<To, op>::init();
+    To val        = init;
+
+    const int istride         = in.strides[0];
+    const int DIMY            = THREADS_PER_BLOCK / DIMX;
+    const int SHARED_MEM_SIZE = (2 * DIMX + 1) * (DIMY);
+    __shared__ char s_flg[SHARED_MEM_SIZE];
+    __shared__ To s_val[SHARED_MEM_SIZE];
+    __shared__ char s_ftmp[DIMY];
+    __shared__ To s_tmp[DIMY];
+    __shared__ int boundaryid[DIMY];
+
+    const int tidx       = threadIdx.x;
+    const int tidy       = threadIdx.y;
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = blockIdx.y / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const int xid        = blockIdx_x * blockDim.x * lim + tidx;
+    const int yid        = blockIdx_y * blockDim.y + tidy;
+    bool cond_yzw =
+        (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
+    if (!cond_yzw) return;  // retire warps early
+
+    To *sptr    = s_val + tidy * (2 * DIMX + 1);
+    char *sfptr = s_flg + tidy * (2 * DIMX + 1);
+    int id      = xid;
+
+    const bool isLast = (tidx == (DIMX - 1));
+    if (isLast) {
+        s_tmp[tidy]      = init;
+        s_ftmp[tidy]     = 0;
+        boundaryid[tidy] = -1;
+    }
+    __syncthreads();
+
+    const Ti *iptr = in.ptr;
+    const Tk *kptr = key.ptr;
+    To *optr       = out.ptr;
+    To *tptr       = tmp.ptr;
+    char *tfptr    = tflg.ptr;
+    int *tiptr     = tlid.ptr;
+    iptr += wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+    kptr += wid * key.strides[3] + zid * key.strides[2] + yid * key.strides[1];
+    optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+    tptr += wid * tmp.strides[3] + zid * tmp.strides[2] + yid * tmp.strides[1];
+    tfptr +=
+        wid * tflg.strides[3] + zid * tflg.strides[2] + yid * tflg.strides[1];
+    tiptr +=
+        wid * tlid.strides[3] + zid * tlid.strides[2] + yid * tlid.strides[1];
+
+    char flag = 0;
+    for (int k = 0; k < lim; k++) {
+        if (id < out.dims[0]) {
+            flag = calculate_head_flags(kptr, id, id - 1);
+        } else {
+            flag = 0;
+        }
+
+        // Load val from global in
+        if (inclusive_scan) {
+            if (id >= out.dims[0]) {
+                val = init;
+            } else {
+                val = transform(iptr[id]);
+            }
+        } else {
+            if ((id == 0) || (id >= out.dims[0]) || flag) {
+                val = init;
+            } else {
+                val = transform(iptr[id - istride]);
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((tidx == 0) && (flag == 0)) {
+            val  = binop(val, s_tmp[tidy]);
+            flag = s_ftmp[tidy];
+        }
+
+        // Write to shared memory
+        sptr[tidx]  = val;
+        sfptr[tidx] = flag;
+        __syncthreads();
+
+        // Segmented Scan
+        int start = 0;
+#pragma unroll
+        for (int off = 1; off < DIMX; off *= 2) {
+            if (tidx >= off) {
+                val  = sfptr[start + tidx]
+                           ? val
+                           : binop(val, sptr[(start - off) + tidx]);
+                flag = sfptr[start + tidx] | sfptr[(start - off) + tidx];
+            }
+            start               = DIMX - start;
+            sptr[start + tidx]  = val;
+            sfptr[start + tidx] = flag;
+
+            __syncthreads();
+        }
+
+        // Identify segment boundary
+        if (tidx == 0) {
+            if ((s_ftmp[tidy] == 0) && (sfptr[tidx] == 1)) {
+                boundaryid[tidy] = id;
+            }
+        } else {
+            if ((sfptr[tidx - 1] == 0) && (sfptr[tidx] == 1)) {
+                boundaryid[tidy] = id;
+            }
+        }
+        __syncthreads();
+
+        if (id < out.dims[0]) optr[id] = val;
+        if (isLast) {
+            s_tmp[tidy]  = val;
+            s_ftmp[tidy] = flag;
+        }
+        id += blockDim.x;
+        __syncthreads();
+    }
+    if (isLast) {
+        tptr[blockIdx_x]  = val;
+        tfptr[blockIdx_x] = flag;
+        int boundary      = boundaryid[tidy];
+        tiptr[blockIdx_x] = (boundary == -1) ? id : boundary;
+    }
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+__global__ void scanbykey_first_final(Param<To> out, CParam<Ti> in,
+                                      CParam<Tk> key, uint blocks_x,
+                                      uint blocks_y, uint lim,
+                                      bool calculateFlags,
+                                      bool inclusive_scan) {
+    common::Transform<Ti, To, op> transform;
+    common::Binary<To, op> binop;
+    const To init = common::Binary<To, op>::init();
+    To val        = init;
+
+    const int istride         = in.strides[0];
+    const int DIMY            = THREADS_PER_BLOCK / DIMX;
+    const int SHARED_MEM_SIZE = (2 * DIMX + 1) * (DIMY);
+    __shared__ char s_flg[SHARED_MEM_SIZE];
+    __shared__ To s_val[SHARED_MEM_SIZE];
+    __shared__ char s_ftmp[DIMY];
+    __shared__ To s_tmp[DIMY];
+
+    const int tidx       = threadIdx.x;
+    const int tidy       = threadIdx.y;
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = blockIdx.y / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const int xid        = blockIdx_x * blockDim.x * lim + tidx;
+    const int yid        = blockIdx_y * blockDim.y + tidy;
+    bool cond_yzw =
+        (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
+    if (!cond_yzw) return;  // retire warps early
+
+    To *sptr    = s_val + tidy * (2 * DIMX + 1);
+    char *sfptr = s_flg + tidy * (2 * DIMX + 1);
+    int id      = xid;
+
+    const bool isLast = (tidx == (DIMX - 1));
+    if (isLast) {
+        s_tmp[tidy]  = init;
+        s_ftmp[tidy] = 0;
+    }
+    __syncthreads();
+
+    const Ti *iptr = in.ptr;
+    const Tk *kptr = key.ptr;
+    To *optr       = out.ptr;
+    iptr += wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+    kptr += wid * key.strides[3] + zid * key.strides[2] + yid * key.strides[1];
+    optr += wid * out.strides[3] + zid * out.strides[2] + yid * out.strides[1];
+
+    for (int k = 0; k < lim; k++) {
+        char flag = 0;
+        if (calculateFlags) {
+            if (id < out.dims[0]) {
+                flag = calculate_head_flags(kptr, id, id - key.strides[0]);
+            }
+        } else {
+            flag = kptr[id];
+        }
+
+        // Load val from global in
+        if (inclusive_scan) {
+            if (id >= out.dims[0]) {
+                val = init;
+            } else {
+                val = transform(iptr[id]);
+            }
+        } else {
+            if ((id == 0) || (id >= out.dims[0]) || flag) {
+                val = init;
+            } else {
+                val = transform(iptr[id - istride]);
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((tidx == 0) && (flag == 0)) {
+            val  = binop(val, s_tmp[tidy]);
+            flag = flag | s_ftmp[tidy];
+        }
+
+        // Write to shared memory
+        sptr[tidx]  = val;
+        sfptr[tidx] = flag;
+        __syncthreads();
+
+        // Segmented Scan
+        int start = 0;
+#pragma unroll
+        for (int off = 1; off < DIMX; off *= 2) {
+            if (tidx >= off) {
+                val  = sfptr[start + tidx]
+                           ? val
+                           : binop(val, sptr[(start - off) + tidx]);
+                flag = sfptr[start + tidx] | sfptr[(start - off) + tidx];
+            }
+            start               = DIMX - start;
+            sptr[start + tidx]  = val;
+            sfptr[start + tidx] = flag;
+
+            __syncthreads();
+        }
+
+        if (id < out.dims[0]) optr[id] = val;
+        if (isLast) {
+            s_tmp[tidy]  = val;
+            s_ftmp[tidy] = flag;
+        }
+        id += blockDim.x;
+        __syncthreads();
+    }
+}
+
+template<typename To, af_op_t op>
+__global__ void scanbykey_first_bcast(Param<To> out, Param<To> tmp,
+                                      Param<int> tlid, uint blocks_x,
+                                      uint blocks_y, uint lim) {
+    const int tidx = threadIdx.x;
+    const int tidy = threadIdx.y;
+
+    const int zid        = blockIdx.x / blocks_x;
+    const int wid        = blockIdx.y / blocks_y;
+    const int blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const int blockIdx_y = blockIdx.y - (blocks_y)*wid;
+    const int xid        = blockIdx_x * blockDim.x * lim + tidx;
+    const int yid        = blockIdx_y * blockDim.y + tidy;
+
+    if (blockIdx_x != 0) {
+        bool cond =
+            (yid < out.dims[1]) && (zid < out.dims[2]) && (wid < out.dims[3]);
+        if (cond) {
+            To *optr        = out.ptr;
+            const To *tptr  = tmp.ptr;
+            const int *iptr = tlid.ptr;
+
+            optr += wid * out.strides[3] + zid * out.strides[2] +
+                    yid * out.strides[1];
+            tptr += wid * tmp.strides[3] + zid * tmp.strides[2] +
+                    yid * tmp.strides[1];
+            iptr += wid * tlid.strides[3] + zid * tlid.strides[2] +
+                    yid * tlid.strides[1];
+
+            common::Binary<To, op> binop;
+            int boundary = iptr[blockIdx_x];
+            To accum     = tptr[blockIdx_x - 1];
+
+            for (int k = 0, id = xid;
+                 k < lim && id < out.dims[0] && id < boundary;
+                 k++, id += blockDim.x) {
+                optr[id] = binop(accum, optr[id]);
+            }
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_first_by_key.hpp b/src/backend/cuda/kernel/scan_first_by_key.hpp
new file mode 100644
index 0000000000..80491a1c65
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_first_by_key.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Param.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scan_first_by_key(Param<To> out, CParam<Ti> in, CParam<Tk> key,
+                       bool inclusive_scan);
+}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/scan_first_by_key_impl.hpp b/src/backend/cuda/kernel/scan_first_by_key_impl.hpp
new file mode 100644
index 0000000000..bf873fdd3d
--- /dev/null
+++ b/src/backend/cuda/kernel/scan_first_by_key_impl.hpp
@@ -0,0 +1,157 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <kernel/config.hpp>
+#include <memory.hpp>
+#include <nvrtc_kernel_headers/scan_first_by_key_cuh.hpp>
+#include <optypes.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scan_nonfinal_launcher(Param<To> out, Param<To> tmp,
+                                   Param<char> tflg, Param<int> tlid,
+                                   CParam<Ti> in, CParam<Tk> key,
+                                   const uint blocks_x, const uint blocks_y,
+                                   const uint threads_x, bool inclusive_scan) {
+    auto scanbykey_first_nonfinal = common::getKernel(
+        "arrayfire::cuda::scanbykey_first_nonfinal",
+        {{scan_first_by_key_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<Tk>(),
+                     TemplateTypename<To>(), TemplateArg(op)),
+        {{DefineValue(THREADS_PER_BLOCK), DefineKeyValue(DIMX, threads_x)}});
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+
+    uint lim = divup(out.dims[0], (threads_x * blocks_x));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scanbykey_first_nonfinal(qArgs, out, tmp, tflg, tlid, in, key, blocks_x,
+                             blocks_y, lim, inclusive_scan);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scan_final_launcher(Param<To> out, CParam<Ti> in, CParam<Tk> key,
+                                const uint blocks_x, const uint blocks_y,
+                                const uint threads_x, bool calculateFlags,
+                                bool inclusive_scan) {
+    auto scanbykey_first_final = common::getKernel(
+        "arrayfire::cuda::scanbykey_first_final", {{scan_first_by_key_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<Tk>(),
+                     TemplateTypename<To>(), TemplateArg(op)),
+        {{DefineValue(THREADS_PER_BLOCK), DefineKeyValue(DIMX, threads_x)}});
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+
+    uint lim = divup(out.dims[0], (threads_x * blocks_x));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scanbykey_first_final(qArgs, out, in, key, blocks_x, blocks_y, lim,
+                          calculateFlags, inclusive_scan);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename To, af_op_t op>
+static void bcast_first_launcher(Param<To> out, Param<To> tmp, Param<int> tlid,
+                                 const dim_t blocks_x, const dim_t blocks_y,
+                                 const uint threads_x) {
+    auto scanbykey_first_bcast = common::getKernel(
+        "arrayfire::cuda::scanbykey_first_bcast", {{scan_first_by_key_cuh_src}},
+        TemplateArgs(TemplateTypename<To>(), TemplateArg(op)));
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+    uint lim = divup(out.dims[0], (threads_x * blocks_x));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    scanbykey_first_bcast(qArgs, out, tmp, tlid, blocks_x, blocks_y, lim);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scan_first_by_key(Param<To> out, CParam<Ti> in, CParam<Tk> key,
+                       bool inclusive_scan) {
+    uint threads_x = nextpow2(std::max(32u, (uint)out.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = static_cast<uint>(divup(out.dims[0], threads_x * REPEAT));
+    uint blocks_y = static_cast<uint>(divup(out.dims[1], threads_y));
+
+    if (blocks_x == 1) {
+        scan_final_launcher<Ti, Tk, To, op>(out, in, key, blocks_x, blocks_y,
+                                            threads_x, true, inclusive_scan);
+    } else {
+        Param<char> tmpflg;
+        Param<int> tmpid;
+        Param<To> tmp  = out;
+        tmp.dims[0]    = blocks_x;
+        tmp.strides[0] = 1;
+        for (int k = 1; k < AF_MAX_DIMS; k++) {
+            tmp.strides[k] = tmp.strides[k - 1] * tmp.dims[k - 1];
+        }
+        for (int k = 0; k < AF_MAX_DIMS; k++) {
+            tmpflg.dims[k]    = tmp.dims[k];
+            tmpflg.strides[k] = tmp.strides[k];
+            tmpid.dims[k]     = tmp.dims[k];
+            tmpid.strides[k]  = tmp.strides[k];
+        }
+
+        int tmp_elements  = tmp.strides[3] * tmp.dims[3];
+        auto tmp_alloc    = memAlloc<To>(tmp_elements);
+        auto tmpflg_alloc = memAlloc<char>(tmp_elements);
+        auto tmpid_alloc  = memAlloc<int>(tmp_elements);
+        tmp.ptr           = tmp_alloc.get();
+        tmpflg.ptr        = tmpflg_alloc.get();
+        tmpid.ptr         = tmpid_alloc.get();
+
+        scan_nonfinal_launcher<Ti, Tk, To, op>(out, tmp, tmpflg, tmpid, in, key,
+                                               blocks_x, blocks_y, threads_x,
+                                               inclusive_scan);
+
+        scan_final_launcher<To, char, To, op>(tmp, tmp, tmpflg, 1, blocks_y,
+                                              threads_x, false, true);
+
+        bcast_first_launcher<To, op>(out, tmp, tmpid, blocks_x, blocks_y,
+                                     threads_x);
+    }
+}
+}  // namespace kernel
+
+#define INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, Ti, Tk, To) \
+    template void scan_first_by_key<Ti, Tk, To, ROp>(  \
+        Param<To> out, CParam<Ti> in, CParam<Tk> key, bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, Tk)         \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_FIRST_BY_KEY_OP(ROp)      \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, int)  \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, uint) \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, intl) \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, uintl)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/select.cuh b/src/backend/cuda/kernel/select.cuh
new file mode 100644
index 0000000000..c5988594cd
--- /dev/null
+++ b/src/backend/cuda/kernel/select.cuh
@@ -0,0 +1,103 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+int getOffset(dim_t *dims, dim_t *strides, dim_t *refdims, int ids[4]) {
+    int off = 0;
+    off += ids[3] * (dims[3] == refdims[3]) * strides[3];
+    off += ids[2] * (dims[2] == refdims[2]) * strides[2];
+    off += ids[1] * (dims[1] == refdims[1]) * strides[1];
+    return off;
+}
+
+template<typename T, bool is_same>
+__global__ void select(Param<T> out, CParam<char> cond, CParam<T> a,
+                       CParam<T> b, int blk_x, int blk_y) {
+    const int idz = blockIdx.x / blk_x;
+    const int idw = (blockIdx.y + blockIdx.z * gridDim.y) / blk_y;
+
+    const int blockIdx_x = blockIdx.x - idz * blk_x;
+    const int blockIdx_y = (blockIdx.y + blockIdx.z * gridDim.y) - idw * blk_y;
+
+    const int idy  = blockIdx_y * blockDim.y + threadIdx.y;
+    const int idx0 = blockIdx_x * blockDim.x + threadIdx.x;
+
+    if (idw >= out.dims[3] || idz >= out.dims[2] || idy >= out.dims[1]) {
+        return;
+    }
+
+    const int off =
+        idw * out.strides[3] + idz * out.strides[2] + idy * out.strides[1];
+    T *optr = out.ptr + off;
+
+    const T *aptr    = a.ptr;
+    const T *bptr    = b.ptr;
+    const char *cptr = cond.ptr;
+
+    int ids[] = {idx0, idy, idz, idw};
+    aptr += getOffset(a.dims, a.strides, out.dims, ids);
+    bptr += getOffset(b.dims, b.strides, out.dims, ids);
+    cptr += getOffset(cond.dims, cond.strides, out.dims, ids);
+
+    if (is_same) {
+        for (int idx = idx0; idx < out.dims[0]; idx += blockDim.x * blk_x) {
+            optr[idx] = cptr[idx] ? aptr[idx] : bptr[idx];
+        }
+    } else {
+        bool csame = cond.dims[0] == out.dims[0];
+        bool asame = a.dims[0] == out.dims[0];
+        bool bsame = b.dims[0] == out.dims[0];
+        for (int idx = idx0; idx < out.dims[0]; idx += blockDim.x * blk_x) {
+            optr[idx] =
+                cptr[csame * idx] ? aptr[asame * idx] : bptr[bsame * idx];
+        }
+    }
+}
+
+template<typename T, bool flip>
+__global__ void selectScalar(Param<T> out, CParam<char> cond, CParam<T> a, T b,
+                             int blk_x, int blk_y) {
+    const int idz = blockIdx.x / blk_x;
+    const int idw = (blockIdx.y + blockIdx.z * gridDim.y) / blk_y;
+
+    const int blockIdx_x = blockIdx.x - idz * blk_x;
+    const int blockIdx_y = (blockIdx.y + blockIdx.z * gridDim.y) - idw * blk_y;
+
+    const int idx0 = blockIdx_x * blockDim.x + threadIdx.x;
+    const int idy  = blockIdx_y * blockDim.y + threadIdx.y;
+
+    const int off =
+        idw * out.strides[3] + idz * out.strides[2] + idy * out.strides[1];
+
+    T *optr = out.ptr + off;
+
+    const T *aptr    = a.ptr;
+    const char *cptr = cond.ptr;
+
+    int ids[] = {idx0, idy, idz, idw};
+    aptr += getOffset(a.dims, a.strides, out.dims, ids);
+    cptr += getOffset(cond.dims, cond.strides, out.dims, ids);
+
+    if (idw >= out.dims[3] || idz >= out.dims[2] || idy >= out.dims[1]) {
+        return;
+    }
+
+    for (int idx = idx0; idx < out.dims[0]; idx += blockDim.x * blk_x) {
+        optr[idx] = ((cptr[idx]) ^ flip) ? aptr[idx] : b;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/select.hpp b/src/backend/cuda/kernel/select.hpp
new file mode 100644
index 0000000000..4df1d3da83
--- /dev/null
+++ b/src/backend/cuda/kernel/select.hpp
@@ -0,0 +1,86 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <math.hpp>
+#include <nvrtc_kernel_headers/select_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr uint DIMX  = 32;
+constexpr uint DIMY  = 8;
+constexpr int REPEAT = 64;
+
+template<typename T>
+void select(Param<T> out, CParam<char> cond, CParam<T> a, CParam<T> b,
+            int ndims) {
+    bool is_same = true;
+    for (int i = 0; i < 4; i++) { is_same &= (a.dims[i] == b.dims[i]); }
+
+    auto select = common::getKernel(
+        "arrayfire::cuda::select", {{select_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(is_same)));
+
+    dim3 threads(DIMX, DIMY);
+
+    if (ndims == 1) {
+        threads.x *= threads.y;
+        threads.y = 1;
+    }
+
+    int blk_x = divup(out.dims[0], REPEAT * threads.x);
+    int blk_y = divup(out.dims[1], threads.y);
+
+    dim3 blocks(blk_x * out.dims[2], blk_y * out.dims[3]);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    select(qArgs, out, cond, a, b, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T>
+void select_scalar(Param<T> out, CParam<char> cond, CParam<T> a, const T b,
+                   int ndims, bool flip) {
+    auto selectScalar = common::getKernel(
+        "arrayfire::cuda::selectScalar", {{select_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(flip)));
+
+    dim3 threads(DIMX, DIMY);
+
+    if (ndims == 1) {
+        threads.x *= threads.y;
+        threads.y = 1;
+    }
+
+    int blk_x = divup(out.dims[0], REPEAT * threads.x);
+    int blk_y = divup(out.dims[1], threads.y);
+
+    dim3 blocks(blk_x * out.dims[2], blk_y * out.dims[3]);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    selectScalar(qArgs, out, cond, a, b, blk_x, blk_y);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/shared.hpp b/src/backend/cuda/kernel/shared.hpp
index eb7b432a12..d1f15653c3 100644
--- a/src/backend/cuda/kernel/shared.hpp
+++ b/src/backend/cuda/kernel/shared.hpp
@@ -9,32 +9,39 @@
 
 #pragma once
 
-namespace cuda
-{
+#ifdef __CUDACC_RTC__
 
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+struct SharedMemory {
+    __DH__ T* getPointer() {
+        extern __shared__ T ptr[];
+        return ptr;
+    }
+};
+}  // namespace cuda
+}  // namespace arrayfire
+
+#else
 
-template <typename T>
-struct SharedMemory
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename T>
+struct SharedMemory {
     // return a pointer to the runtime-sized shared memory array.
-    __device__ T* getPointer()
-    {
-        extern __device__ void Error_UnsupportedType(); // Ensure that we won't compile any un-specialized types
-        Error_UnsupportedType();
-        return (T*)0;
-    }
+    __device__ T* getPointer();
 };
 
-#define SPECIALIZE(T)                           \
-    template <>                                 \
-    struct SharedMemory <T>                     \
-    {                                           \
-        __device__ T* getPointer() {            \
-            extern __shared__ T ptr_##T##_[];   \
-                return ptr_##T##_;              \
-        }                                       \
+#define SPECIALIZE(T)                         \
+    template<>                                \
+    struct SharedMemory<T> {                  \
+        __device__ T* getPointer() {          \
+            extern __shared__ T ptr_##T##_[]; \
+            return ptr_##T##_;                \
+        }                                     \
     };
 
 SPECIALIZE(float)
@@ -44,9 +51,17 @@ SPECIALIZE(cdouble)
 SPECIALIZE(char)
 SPECIALIZE(int)
 SPECIALIZE(uint)
+SPECIALIZE(short)
+SPECIALIZE(ushort)
+SPECIALIZE(schar)
 SPECIALIZE(uchar)
+SPECIALIZE(intl)
+SPECIALIZE(uintl)
 
 #undef SPECIALIZE
 
-}
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
+
+#endif
diff --git a/src/backend/cuda/kernel/shfl_intrinsics.hpp b/src/backend/cuda/kernel/shfl_intrinsics.hpp
new file mode 100644
index 0000000000..a91dc74148
--- /dev/null
+++ b/src/backend/cuda/kernel/shfl_intrinsics.hpp
@@ -0,0 +1,113 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr unsigned int FULL_MASK = 0xffffffff;
+
+//__all_sync wrapper
+template<typename T>
+__device__ T all_sync(T var) {
+#if (CUDA_VERSION >= 9000)
+    return __all_sync(FULL_MASK, var);
+#else
+    return __all(var);
+#endif
+}
+
+//__all_sync wrapper
+template<typename T>
+__device__ T any_sync(T var) {
+#if (CUDA_VERSION >= 9000)
+    return __any_sync(FULL_MASK, var);
+#else
+    return __any(var);
+#endif
+}
+
+//__shfl_down_sync wrapper
+template<typename T>
+__device__ T ballot_sync(T var) {
+#if (CUDA_VERSION >= 9000)
+    return __ballot_sync(FULL_MASK, var);
+#else
+    return __ballot(var);
+#endif
+}
+
+//__shfl_down_sync wrapper
+template<typename T>
+__device__ T shfl_down_sync(T var, int delta) {
+#if (CUDA_VERSION >= 9000)
+    return __shfl_down_sync(FULL_MASK, var, delta);
+#else
+    return __shfl_down(var, delta);
+#endif
+}
+// specialization for cfloat
+template<>
+inline __device__ cfloat shfl_down_sync(cfloat var, int delta) {
+#if (CUDA_VERSION >= 9000)
+    cfloat res = {__shfl_down_sync(FULL_MASK, var.x, delta),
+                  __shfl_down_sync(FULL_MASK, var.y, delta)};
+#else
+    cfloat res  = {__shfl_down(var.x, delta), __shfl_down(var.y, delta)};
+#endif
+    return res;
+}
+// specialization for cdouble
+template<>
+inline __device__ cdouble shfl_down_sync(cdouble var,
+                                         int delta) {
+#if (CUDA_VERSION >= 9000)
+    cdouble res = {__shfl_down_sync(FULL_MASK, var.x, delta),
+                   __shfl_down_sync(FULL_MASK, var.y, delta)};
+#else
+    cdouble res = {__shfl_down(var.x, delta), __shfl_down(var.y, delta)};
+#endif
+    return res;
+}
+
+//__shfl_up_sync wrapper
+template<typename T>
+__device__ T shfl_up_sync(T var, int delta) {
+#if (CUDA_VERSION >= 9000)
+    return __shfl_up_sync(FULL_MASK, var, delta);
+#else
+    return __shfl_up(var, delta);
+#endif
+}
+// specialization for cfloat
+template<>
+inline __device__ cfloat shfl_up_sync(cfloat var, int delta) {
+#if (CUDA_VERSION >= 9000)
+    cfloat res = {__shfl_up_sync(FULL_MASK, var.x, delta),
+                  __shfl_up_sync(FULL_MASK, var.y, delta)};
+#else
+    cfloat res  = {__shfl_up(var.x, delta), __shfl_up(var.y, delta)};
+#endif
+    return res;
+}
+// specialization for cdouble
+template<>
+inline __device__ cdouble shfl_up_sync(cdouble var, int delta) {
+#if (CUDA_VERSION >= 9000)
+    cdouble res = {__shfl_up_sync(FULL_MASK, var.x, delta),
+                   __shfl_up_sync(FULL_MASK, var.y, delta)};
+#else
+    cdouble res = {__shfl_up(var.x, delta), __shfl_up(var.y, delta)};
+#endif
+    return res;
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/shift.hpp b/src/backend/cuda/kernel/shift.hpp
deleted file mode 100644
index 5cbed9f3c4..0000000000
--- a/src/backend/cuda/kernel/shift.hpp
+++ /dev/null
@@ -1,104 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <math.hpp>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-#include <cassert>
-
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 128;
-        static const unsigned TILEY = 32;
-
-        __host__ __device__
-        static inline int simple_mod(const int i, const int dim)
-        {
-            return (i < dim) ? i : (i - dim);
-        }
-
-        template<typename T>
-        __global__
-        void shift_kernel(Param<T> out, CParam<T> in, const int d0, const int d1,
-                            const int d2, const int d3,
-                            const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
-
-            if(xx >= out.dims[0] ||
-               yy >= out.dims[1] ||
-               oz >= out.dims[2] ||
-               ow >= out.dims[3])
-                return;
-
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
-
-            const int iw = simple_mod((ow + d3), out.dims[3]);
-            const int iz = simple_mod((oz + d2), out.dims[2]);
-
-            const int o_off = ow * out.strides[3] + oz * out.strides[2];
-            const int i_off = iw *  in.strides[3] + iz *  in.strides[2];
-
-            for(int oy = yy; oy < out.dims[1]; oy += incy) {
-                const int iy = simple_mod((oy + d1), out.dims[1]);
-                for(int ox = xx; ox < out.dims[0]; ox += incx) {
-                    const int ix = simple_mod((ox + d0), out.dims[0]);
-
-                    const int oIdx = o_off + oy * out.strides[1] + ox;
-                    const int iIdx = i_off + iy *  in.strides[1] + ix;
-
-                    out.ptr[oIdx] = in.ptr[iIdx];
-                }
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void shift(Param<T> out, CParam<T> in, const int *sdims)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(out.dims[0], TILEX);
-            int blocksPerMatY = divup(out.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * out.dims[2],
-                        blocksPerMatY * out.dims[3],
-                        1);
-
-            int sdims_[4];
-            // Need to do this because we are mapping output to input in the kernel
-            for(int i = 0; i < 4; i++) {
-                // sdims_[i] will always be positive and always [0, oDims[i]].
-                // Negative shifts are converted to position by going the other way round
-                sdims_[i] = -(sdims[i] % (int)out.dims[i]) + out.dims[i] * (sdims[i] > 0);
-                assert(sdims_[i] >= 0 && sdims_[i] <= out.dims[i]);
-            }
-
-            shift_kernel<T><<<blocks, threads>>>(out, in, sdims_[0], sdims_[1], sdims_[2], sdims_[3],
-                                                 blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
-}
diff --git a/src/backend/cuda/kernel/sift.hpp b/src/backend/cuda/kernel/sift.hpp
new file mode 100644
index 0000000000..9c3e3bf7b8
--- /dev/null
+++ b/src/backend/cuda/kernel/sift.hpp
@@ -0,0 +1,1409 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// The source code contained in this file is based on the original code by
+// Rob Hess. Please note that SIFT is an algorithm patented and protected
+// by US law. As of 29-Dec-2020, the patent stands expired. It can be looked
+// up here - https://patents.google.com/patent/US6711293B1/en
+
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <memory.hpp>
+#include <thrust_utils.hpp>
+#include <af/defines.h>
+#include "shared.hpp"
+
+#include "convolve.hpp"
+#include "resize.hpp"
+
+#include <thrust/device_ptr.h>
+#include <thrust/device_vector.h>
+#include <thrust/gather.h>
+#include <thrust/generate.h>
+#include <thrust/random.h>
+#include <thrust/sequence.h>
+#include <thrust/sort.h>
+
+#include <cfloat>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+static const dim_t SIFT_THREADS   = 256;
+static const dim_t SIFT_THREADS_X = 32;
+static const dim_t SIFT_THREADS_Y = 8;
+
+#define PI_VAL 3.14159265358979323846f
+
+// default width of descriptor histogram array
+#define DESCR_WIDTH 4
+
+// default number of bins per histogram in descriptor array
+#define DESCR_HIST_BINS 8
+
+// assumed gaussian blur for input image
+#define INIT_SIGMA 0.5f
+
+// width of border in which to ignore keypoints
+#define IMG_BORDER 5
+
+// maximum steps of keypointerpolation before failure
+#define MAX_INTERP_STEPS 5
+
+// default number of bins in histogram for orientation assignment
+#define ORI_HIST_BINS 36
+
+// determines gaussian sigma for orientation assignment
+#define ORI_SIG_FCTR 1.5f
+
+// determines the radius of the region used in orientation assignment */
+#define ORI_RADIUS (3.0f * ORI_SIG_FCTR)
+
+// number of passes of orientation histogram smoothing
+#define SMOOTH_ORI_PASSES 2
+
+// orientation magnitude relative to max that results in new feature
+#define ORI_PEAK_RATIO 0.8f
+
+// determines the size of a single descriptor orientation histogram
+#define DESCR_SCL_FCTR 3.f
+
+// threshold on magnitude of elements of descriptor vector
+#define DESC_MAG_THR 0.2f
+
+// factor used to convert floating-podescriptor to unsigned char
+#define INT_DESCR_FCTR 512.f
+
+// Number of GLOH bins in radial direction
+static const unsigned GLOHRadialBins = 3;
+
+// Radii of GLOH descriptors
+__constant__ float GLOHRadii[GLOHRadialBins] = {6.f, 11.f, 15.f};
+
+// Number of GLOH angular bins (excluding the inner-most radial section)
+static const unsigned GLOHAngularBins = 8;
+
+// Number of GLOH bins per histogram in descriptor
+static const unsigned GLOHHistBins = 16;
+
+template<typename T>
+void gaussian1D(T* out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
+
+    T sum = (T)0;
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * PI_VAL * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
+        out[i] = el;
+        sum += el;
+    }
+
+    for (int k = 0; k < dim; k++) out[k] /= sum;
+}
+
+template<typename T>
+Array<T> gauss_filter(float sigma) {
+    // Using 6-sigma rule
+    unsigned gauss_len = std::min((unsigned)round(sigma * 6 + 1) | 1, 31u);
+
+    std::vector<T> h_gauss(gauss_len);
+    gaussian1D(h_gauss.data(), gauss_len, sigma);
+
+    Array<T> gauss_filter =
+        createHostDataArray(dim4(gauss_len), h_gauss.data());
+    return gauss_filter;
+}
+
+template<int N>
+__inline__ __device__ void gaussianElimination(float* A, float* b, float* x) {
+// forward elimination
+#pragma unroll
+    for (int i = 0; i < N - 1; i++) {
+#pragma unroll
+        for (int j = i + 1; j < N; j++) {
+            float s = A[j * N + i] / A[i * N + i];
+
+#pragma unroll
+            for (int k = i; k < N; k++) A[j * N + k] -= s * A[i * N + k];
+
+            b[j] -= s * b[i];
+        }
+    }
+
+#pragma unroll
+    for (int i = 0; i < N; i++) x[i] = 0;
+
+    // backward substitution
+    float sum = 0;
+#pragma unroll
+    for (int i = 0; i <= N - 2; i++) {
+        sum = b[i];
+#pragma unroll
+        for (int j = i + 1; j < N; j++) sum -= A[i * N + j] * x[j];
+        x[i] = sum / A[i * N + i];
+    }
+}
+
+__inline__ __device__ void normalizeDesc(float* desc, float* accum,
+                                         const int histlen) {
+    int tid_x = threadIdx.x;
+    int tid_y = threadIdx.y;
+    int bsz_x = blockDim.x;
+
+    for (int i = tid_x; i < histlen; i += bsz_x)
+        accum[i] = desc[tid_y * histlen + i] * desc[tid_y * histlen + i];
+    __syncthreads();
+
+    if (tid_x < 64) accum[tid_x] += accum[tid_x + 64];
+    __syncthreads();
+    if (tid_x < 32) accum[tid_x] += accum[tid_x + 32];
+    __syncthreads();
+    if (tid_x < 16) accum[tid_x] += accum[tid_x + 16];
+    __syncthreads();
+    if (tid_x < 8) accum[tid_x] += accum[tid_x + 8];
+    __syncthreads();
+    if (tid_x < 4) accum[tid_x] += accum[tid_x + 4];
+    __syncthreads();
+    if (tid_x < 2) accum[tid_x] += accum[tid_x + 2];
+    __syncthreads();
+    if (tid_x < 1) accum[tid_x] += accum[tid_x + 1];
+    __syncthreads();
+
+    float len_sq  = accum[0];
+    float len_inv = 1.0f / sqrtf(len_sq);
+
+    for (int i = tid_x; i < histlen; i += bsz_x) {
+        desc[tid_y * histlen + i] *= len_inv;
+    }
+    __syncthreads();
+}
+
+__inline__ __device__ void normalizeGLOHDesc(float* desc, float* accum,
+                                             const int histlen) {
+    int tid_x = threadIdx.x;
+    int tid_y = threadIdx.y;
+    int bsz_x = blockDim.x;
+
+    for (int i = tid_x; i < histlen; i += bsz_x)
+        accum[i] = desc[tid_y * histlen + i] * desc[tid_y * histlen + i];
+    __syncthreads();
+
+    if (tid_x < 128) accum[tid_x] += accum[tid_x + 128];
+    __syncthreads();
+    if (tid_x < 64) accum[tid_x] += accum[tid_x + 64];
+    __syncthreads();
+    if (tid_x < 32) accum[tid_x] += accum[tid_x + 32];
+    __syncthreads();
+    if (tid_x < 16)
+        // GLOH is 272-dimensional, accumulating last 16 descriptors
+        accum[tid_x] += accum[tid_x + 16] + accum[tid_x + 256];
+    __syncthreads();
+    if (tid_x < 8) accum[tid_x] += accum[tid_x + 8];
+    __syncthreads();
+    if (tid_x < 4) accum[tid_x] += accum[tid_x + 4];
+    __syncthreads();
+    if (tid_x < 2) accum[tid_x] += accum[tid_x + 2];
+    __syncthreads();
+    if (tid_x < 1) accum[tid_x] += accum[tid_x + 1];
+    __syncthreads();
+
+    float len_sq  = accum[0];
+    float len_inv = 1.0f / sqrtf(len_sq);
+
+    for (int i = tid_x; i < histlen; i += bsz_x) {
+        desc[tid_y * histlen + i] *= len_inv;
+    }
+    __syncthreads();
+}
+
+template<typename T>
+__global__ void sub(Param<T> out, CParam<T> in, const unsigned nel,
+                    const unsigned n_layers) {
+    unsigned i = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (i < nel) {
+        for (unsigned l = 0; l < n_layers; l++)
+            out.ptr[l * nel + i] =
+                in.ptr[(l + 1) * nel + i] - in.ptr[l * nel + i];
+    }
+}
+
+#define SCPTR(Y, X) (s_center[(Y)*s_i + (X)])
+#define SPPTR(Y, X) (s_prev[(Y)*s_i + (X)])
+#define SNPTR(Y, X) (s_next[(Y)*s_i + (X)])
+#define DPTR(Z, Y, X) (dog.ptr[(Z)*imel + (Y)*dim0 + (X)])
+
+// Determines whether a pixel is a scale-space extremum by comparing it to its
+// 3x3x3 pixel neighborhood.
+template<typename T>
+__global__ void detectExtrema(float* x_out, float* y_out, unsigned* layer_out,
+                              unsigned* counter, CParam<T> dog,
+                              const unsigned max_feat, const float threshold) {
+    const int dim0 = dog.dims[0];
+    const int dim1 = dog.dims[1];
+    const int imel = dim0 * dim1;
+
+    const int tid_i = threadIdx.x;
+    const int tid_j = threadIdx.y;
+    const int bsz_i = blockDim.x;
+    const int bsz_j = blockDim.y;
+    const int i     = blockIdx.x * bsz_i + tid_i + IMG_BORDER;
+    const int j     = blockIdx.y * bsz_j + tid_j + IMG_BORDER;
+
+    const int x = tid_i + 1;
+    const int y = tid_j + 1;
+
+    // One pixel border for each side
+    const int s_i = bsz_i + 2;
+    const int s_j = bsz_j + 2;
+
+    SharedMemory<float> shared;
+    float* shrdMem  = shared.getPointer();
+    float* s_next   = shrdMem;
+    float* s_center = shrdMem + s_i * s_j;
+    float* s_prev   = shrdMem + s_i * s_j * 2;
+
+    for (int l = 1; l < dog.dims[2] - 1; l++) {
+        const int s_i_half = s_i / 2;
+        const int s_j_half = s_j / 2;
+        if (tid_i < s_i_half && tid_j < s_j_half && i < dim0 - IMG_BORDER + 1 &&
+            j < dim1 - IMG_BORDER + 1) {
+            SNPTR(tid_j, tid_i) = DPTR(l + 1, j - 1, i - 1);
+            SCPTR(tid_j, tid_i) = DPTR(l, j - 1, i - 1);
+            SPPTR(tid_j, tid_i) = DPTR(l - 1, j - 1, i - 1);
+
+            SNPTR(tid_j, tid_i + s_i_half) =
+                DPTR((l + 1), j - 1, i - 1 + s_i_half);
+            SCPTR(tid_j, tid_i + s_i_half) = DPTR((l), j - 1, i - 1 + s_i_half);
+            SPPTR(tid_j, tid_i + s_i_half) =
+                DPTR((l - 1), j - 1, i - 1 + s_i_half);
+
+            SNPTR(tid_j + s_j_half, tid_i) =
+                DPTR(l + 1, j - 1 + s_j_half, i - 1);
+            SCPTR(tid_j + s_j_half, tid_i) = DPTR(l, j - 1 + s_j_half, i - 1);
+            SPPTR(tid_j + s_j_half, tid_i) =
+                DPTR(l - 1, j - 1 + s_j_half, i - 1);
+
+            SNPTR(tid_j + s_j_half, tid_i + s_i_half) =
+                DPTR(l + 1, j - 1 + s_j_half, i - 1 + s_i_half);
+            SCPTR(tid_j + s_j_half, tid_i + s_i_half) =
+                DPTR(l, j - 1 + s_j_half, i - 1 + s_i_half);
+            SPPTR(tid_j + s_j_half, tid_i + s_i_half) =
+                DPTR(l - 1, j - 1 + s_j_half, i - 1 + s_i_half);
+        }
+        __syncthreads();
+
+        float p = SCPTR(y, x);
+
+        if (abs(p) > threshold && i < dim0 - IMG_BORDER &&
+            j < dim1 - IMG_BORDER &&
+            ((p > 0 && p > SCPTR(y - 1, x - 1) && p > SCPTR(y - 1, x) &&
+              p > SCPTR(y - 1, x + 1) && p > SCPTR(y, x - 1) &&
+              p > SCPTR(y, x + 1) && p > SCPTR(y + 1, x - 1) &&
+              p > SCPTR(y + 1, x) && p > SCPTR(y + 1, x + 1) &&
+              p > SPPTR(y - 1, x - 1) && p > SPPTR(y - 1, x) &&
+              p > SPPTR(y - 1, x + 1) && p > SPPTR(y, x - 1) &&
+              p > SPPTR(y, x) && p > SPPTR(y, x + 1) &&
+              p > SPPTR(y + 1, x - 1) && p > SPPTR(y + 1, x) &&
+              p > SPPTR(y + 1, x + 1) && p > SNPTR(y - 1, x - 1) &&
+              p > SNPTR(y - 1, x) && p > SNPTR(y - 1, x + 1) &&
+              p > SNPTR(y, x - 1) && p > SNPTR(y, x) && p > SNPTR(y, x + 1) &&
+              p > SNPTR(y + 1, x - 1) && p > SNPTR(y + 1, x) &&
+              p > SNPTR(y + 1, x + 1)) ||
+             (p < 0 && p < SCPTR(y - 1, x - 1) && p < SCPTR(y - 1, x) &&
+              p < SCPTR(y - 1, x + 1) && p < SCPTR(y, x - 1) &&
+              p < SCPTR(y, x + 1) && p < SCPTR(y + 1, x - 1) &&
+              p < SCPTR(y + 1, x) && p < SCPTR(y + 1, x + 1) &&
+              p < SPPTR(y - 1, x - 1) && p < SPPTR(y - 1, x) &&
+              p < SPPTR(y - 1, x + 1) && p < SPPTR(y, x - 1) &&
+              p < SPPTR(y, x) && p < SPPTR(y, x + 1) &&
+              p < SPPTR(y + 1, x - 1) && p < SPPTR(y + 1, x) &&
+              p < SPPTR(y + 1, x + 1) && p < SNPTR(y - 1, x - 1) &&
+              p < SNPTR(y - 1, x) && p < SNPTR(y - 1, x + 1) &&
+              p < SNPTR(y, x - 1) && p < SNPTR(y, x) && p < SNPTR(y, x + 1) &&
+              p < SNPTR(y + 1, x - 1) && p < SNPTR(y + 1, x) &&
+              p < SNPTR(y + 1, x + 1)))) {
+            unsigned idx = atomicAdd(counter, 1u);
+            if (idx < max_feat) {
+                x_out[idx]     = (float)j;
+                y_out[idx]     = (float)i;
+                layer_out[idx] = l;
+            }
+        }
+        __syncthreads();
+    }
+}
+
+#undef SCPTR
+#undef SPPTR
+#undef SNPTR
+#define CPTR(Y, X) (center_ptr[(Y)*dim0 + (X)])
+#define PPTR(Y, X) (prev_ptr[(Y)*dim0 + (X)])
+#define NPTR(Y, X) (next_ptr[(Y)*dim0 + (X)])
+
+// Interpolates a scale-space extremum's location and scale to subpixel
+// accuracy to form an image feature. Rejects features with low contrast.
+// Based on Section 4 of Lowe's paper.
+template<typename T>
+__global__ void interpolateExtrema(
+    float* x_out, float* y_out, unsigned* layer_out, float* response_out,
+    float* size_out, unsigned* counter, const float* x_in, const float* y_in,
+    const unsigned* layer_in, const unsigned extrema_feat,
+    const CParam<T> dog_octave, const unsigned max_feat, const unsigned octave,
+    const unsigned n_layers, const float contrast_thr, const float edge_thr,
+    const float sigma, const float img_scale) {
+    const unsigned f = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (f < extrema_feat) {
+        const float first_deriv_scale  = img_scale * 0.5f;
+        const float second_deriv_scale = img_scale;
+        const float cross_deriv_scale  = img_scale * 0.25f;
+
+        float xl = 0, xy = 0, xx = 0, contr = 0;
+        int i = 0;
+
+        unsigned x     = x_in[f];
+        unsigned y     = y_in[f];
+        unsigned layer = layer_in[f];
+
+        const int dim0 = dog_octave.dims[0];
+        const int dim1 = dog_octave.dims[1];
+        const int imel = dim0 * dim1;
+
+        const T* prev_ptr   = dog_octave.ptr + (layer - 1) * imel;
+        const T* center_ptr = dog_octave.ptr + (layer)*imel;
+        const T* next_ptr   = dog_octave.ptr + (layer + 1) * imel;
+
+        for (i = 0; i < MAX_INTERP_STEPS; i++) {
+            float dD[3] = {
+                (float)(CPTR(x + 1, y) - CPTR(x - 1, y)) * first_deriv_scale,
+                (float)(CPTR(x, y + 1) - CPTR(x, y - 1)) * first_deriv_scale,
+                (float)(NPTR(x, y) - PPTR(x, y)) * first_deriv_scale};
+
+            float d2 = CPTR(x, y) * 2.f;
+            float dxx =
+                (CPTR(x + 1, y) + CPTR(x - 1, y) - d2) * second_deriv_scale;
+            float dyy =
+                (CPTR(x, y + 1) + CPTR(x, y - 1) - d2) * second_deriv_scale;
+            float dss = (NPTR(x, y) + PPTR(x, y) - d2) * second_deriv_scale;
+            float dxy = (CPTR(x + 1, y + 1) - CPTR(x - 1, y + 1) -
+                         CPTR(x + 1, y - 1) + CPTR(x - 1, y - 1)) *
+                        cross_deriv_scale;
+            float dxs = (NPTR(x + 1, y) - NPTR(x - 1, y) - PPTR(x + 1, y) +
+                         PPTR(x - 1, y)) *
+                        cross_deriv_scale;
+            float dys = (NPTR(x, y + 1) - NPTR(x - 1, y - 1) - PPTR(x, y - 1) +
+                         PPTR(x - 1, y - 1)) *
+                        cross_deriv_scale;
+
+            float H[9] = {dxx, dxy, dxs, dxy, dyy, dys, dxs, dys, dss};
+
+            float X[3];
+            gaussianElimination<3>(H, dD, X);
+
+            xl = -X[2];
+            xy = -X[1];
+            xx = -X[0];
+
+            if (abs(xl) < 0.5f && abs(xy) < 0.5f && abs(xx) < 0.5f) break;
+
+            x += round(xx);
+            y += round(xy);
+            layer += round(xl);
+
+            if (layer < 1 || layer > n_layers || x < IMG_BORDER ||
+                x >= dim1 - IMG_BORDER || y < IMG_BORDER ||
+                y >= dim0 - IMG_BORDER)
+                return;
+        }
+
+        // ensure convergence of interpolation
+        if (i >= MAX_INTERP_STEPS) return;
+
+        float dD[3] = {
+            (float)(CPTR(x + 1, y) - CPTR(x - 1, y)) * first_deriv_scale,
+            (float)(CPTR(x, y + 1) - CPTR(x, y - 1)) * first_deriv_scale,
+            (float)(NPTR(x, y) - PPTR(x, y)) * first_deriv_scale};
+        float X[3] = {xx, xy, xl};
+
+        float P = dD[0] * X[0] + dD[1] * X[1] + dD[2] * X[2];
+
+        contr = CPTR(x, y) * img_scale + P * 0.5f;
+        if (abs(contr) < (contrast_thr / n_layers)) return;
+
+        // principal curvatures are computed using the trace and det of Hessian
+        float d2  = CPTR(x, y) * 2.f;
+        float dxx = (CPTR(x + 1, y) + CPTR(x - 1, y) - d2) * second_deriv_scale;
+        float dyy = (CPTR(x, y + 1) + CPTR(x, y - 1) - d2) * second_deriv_scale;
+        float dxy = (CPTR(x + 1, y + 1) - CPTR(x - 1, y + 1) -
+                     CPTR(x + 1, y - 1) + CPTR(x - 1, y - 1)) *
+                    cross_deriv_scale;
+
+        float tr  = dxx + dyy;
+        float det = dxx * dyy - dxy * dxy;
+
+        // add FLT_EPSILON for double-precision compatibility
+        if (det <= 0 || tr * tr * edge_thr >=
+                            (edge_thr + 1) * (edge_thr + 1) * det + FLT_EPSILON)
+            return;
+
+        unsigned ridx = atomicAdd(counter, 1u);
+
+        if (ridx < max_feat) {
+            x_out[ridx]        = (x + xx) * (1 << octave);
+            y_out[ridx]        = (y + xy) * (1 << octave);
+            layer_out[ridx]    = layer;
+            response_out[ridx] = abs(contr);
+            size_out[ridx] =
+                sigma * pow(2.f, octave + (layer + xl) / n_layers) * 2.f;
+        }
+    }
+}
+
+#undef CPTR
+#undef PPTR
+#undef NPTR
+
+// Remove duplicate keypoints
+__global__ void removeDuplicates(float* x_out, float* y_out,
+                                 unsigned* layer_out, float* response_out,
+                                 float* size_out, unsigned* counter,
+                                 const float* x_in, const float* y_in,
+                                 const unsigned* layer_in,
+                                 const float* response_in, const float* size_in,
+                                 const unsigned total_feat) {
+    const unsigned f = blockIdx.x * blockDim.x + threadIdx.x;
+
+    if (f >= total_feat) return;
+
+    float prec_fctr = 1e4f;
+
+    if (f < total_feat - 1) {
+        if (round(x_in[f] * prec_fctr) == round(x_in[f + 1] * prec_fctr) &&
+            round(y_in[f] * prec_fctr) == round(y_in[f + 1] * prec_fctr) &&
+            layer_in[f] == layer_in[f + 1] &&
+            round(response_in[f] * prec_fctr) ==
+                round(response_in[f + 1] * prec_fctr) &&
+            round(size_in[f] * prec_fctr) == round(size_in[f + 1] * prec_fctr))
+            return;
+    }
+
+    unsigned idx = atomicAdd(counter, 1);
+
+    x_out[idx]        = x_in[f];
+    y_out[idx]        = y_in[f];
+    layer_out[idx]    = layer_in[f];
+    response_out[idx] = response_in[f];
+    size_out[idx]     = size_in[f];
+}
+
+#define IPTR(Y, X) (img_ptr[(Y)*dim0 + (X)])
+
+// Computes a canonical orientation for each image feature in an array.  Based
+// on Section 5 of Lowe's paper.  This function adds features to the array when
+// there is more than one dominant orientation at a given feature location.
+template<typename T>
+__global__ void calcOrientation(
+    float* x_out, float* y_out, unsigned* layer_out, float* response_out,
+    float* size_out, float* ori_out, unsigned* counter, const float* x_in,
+    const float* y_in, const unsigned* layer_in, const float* response_in,
+    const float* size_in, const unsigned total_feat,
+    const CParam<T> gauss_octave, const unsigned max_feat,
+    const unsigned octave, const bool double_input) {
+    const int tid_x = threadIdx.x;
+    const int tid_y = threadIdx.y;
+    const int bsz_x = blockDim.x;
+    const int bsz_y = blockDim.y;
+
+    const unsigned f = blockIdx.y * bsz_y + tid_y;
+
+    const int n = ORI_HIST_BINS;
+
+    SharedMemory<float> shared;
+    float* shrdMem  = shared.getPointer();
+    float* hist     = shrdMem;
+    float* temphist = shrdMem + n * 8;
+
+    // Initialize temporary histogram
+    for (int i = tid_x; i < ORI_HIST_BINS; i += bsz_x)
+        hist[tid_y * n + i] = 0.f;
+    __syncthreads();
+
+    float real_x, real_y, response, size;
+    unsigned layer;
+
+    if (f < total_feat) {
+        // Load keypoint information
+        real_x   = x_in[f];
+        real_y   = y_in[f];
+        layer    = layer_in[f];
+        response = response_in[f];
+        size     = size_in[f];
+
+        const int pt_x = (int)round(real_x / (1 << octave));
+        const int pt_y = (int)round(real_y / (1 << octave));
+
+        // Calculate auxiliary parameters
+        const float scl_octv  = size * 0.5f / (1 << octave);
+        const int radius      = (int)round(ORI_RADIUS * scl_octv);
+        const float sigma     = ORI_SIG_FCTR * scl_octv;
+        const int len         = (radius * 2 + 1);
+        const float exp_denom = 2.f * sigma * sigma;
+
+        const int dim0 = gauss_octave.dims[0];
+        const int dim1 = gauss_octave.dims[1];
+        const int imel = dim0 * dim1;
+
+        // Points img to correct Gaussian pyramid layer
+        const T* img_ptr = gauss_octave.ptr + layer * imel;
+
+        // Calculate orientation histogram
+        for (int l = tid_x; l < len * len; l += bsz_x) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = pt_y + i;
+            int x = pt_x + j;
+            if (y < 1 || y >= dim0 - 1 || x < 1 || x >= dim1 - 1) continue;
+
+            float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+            float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+            float mag = sqrt(dx * dx + dy * dy);
+            float ori = atan2(dy, dx);
+            float w   = exp(-(i * i + j * j) / exp_denom);
+
+            int bin = round(n * (ori + PI_VAL) / (2.f * PI_VAL));
+            bin     = bin < n ? bin : 0;
+
+            atomicAdd(&hist[tid_y * n + bin], w * mag);
+        }
+    }
+    __syncthreads();
+
+    for (int i = 0; i < SMOOTH_ORI_PASSES; i++) {
+        for (int j = tid_x; j < n; j += bsz_x) {
+            temphist[tid_y * n + j] = hist[tid_y * n + j];
+        }
+        __syncthreads();
+        for (int j = tid_x; j < n; j += bsz_x) {
+            float prev = (j == 0) ? temphist[tid_y * n + n - 1]
+                                  : temphist[tid_y * n + j - 1];
+            float next = (j + 1 == n) ? temphist[tid_y * n]
+                                      : temphist[tid_y * n + j + 1];
+            hist[tid_y * n + j] =
+                0.25f * prev + 0.5f * temphist[tid_y * n + j] + 0.25f * next;
+        }
+        __syncthreads();
+    }
+
+    for (int i = tid_x; i < n; i += bsz_x)
+        temphist[tid_y * n + i] = hist[tid_y * n + i];
+    __syncthreads();
+
+    if (tid_x < 16)
+        temphist[tid_y * n + tid_x] =
+            fmax(hist[tid_y * n + tid_x], hist[tid_y * n + tid_x + 16]);
+    __syncthreads();
+    if (tid_x < 8)
+        temphist[tid_y * n + tid_x] =
+            fmax(temphist[tid_y * n + tid_x], temphist[tid_y * n + tid_x + 8]);
+    __syncthreads();
+    if (tid_x < 4) {
+        temphist[tid_y * n + tid_x] =
+            fmax(temphist[tid_y * n + tid_x], hist[tid_y * n + tid_x + 32]);
+        temphist[tid_y * n + tid_x] =
+            fmax(temphist[tid_y * n + tid_x], temphist[tid_y * n + tid_x + 4]);
+    }
+    __syncthreads();
+    if (tid_x < 2)
+        temphist[tid_y * n + tid_x] =
+            fmax(temphist[tid_y * n + tid_x], temphist[tid_y * n + tid_x + 2]);
+    __syncthreads();
+    if (tid_x < 1)
+        temphist[tid_y * n + tid_x] =
+            fmax(temphist[tid_y * n + tid_x], temphist[tid_y * n + tid_x + 1]);
+    __syncthreads();
+    float omax = temphist[tid_y * n];
+
+    if (f < total_feat) {
+        float mag_thr = (float)(omax * ORI_PEAK_RATIO);
+        int l, r;
+        for (int j = tid_x; j < n; j += bsz_x) {
+            l = (j == 0) ? n - 1 : j - 1;
+            r = (j + 1) % n;
+            if (hist[tid_y * n + j] > hist[tid_y * n + l] &&
+                hist[tid_y * n + j] > hist[tid_y * n + r] &&
+                hist[tid_y * n + j] >= mag_thr) {
+                int idx = atomicAdd(counter, 1);
+
+                if (idx < max_feat) {
+                    float bin =
+                        j +
+                        0.5f * (hist[tid_y * n + l] - hist[tid_y * n + r]) /
+                            (hist[tid_y * n + l] - 2.0f * hist[tid_y * n + j] +
+                             hist[tid_y * n + r]);
+                    bin = (bin < 0.0f) ? bin + n : (bin >= n) ? bin - n : bin;
+                    float ori = 360.f - ((360.f / n) * bin);
+
+                    float new_real_x = real_x;
+                    float new_real_y = real_y;
+                    float new_size   = size;
+
+                    if (double_input) {
+                        float scale = 0.5f;
+                        new_real_x *= scale;
+                        new_real_y *= scale;
+                        new_size *= scale;
+                    }
+
+                    x_out[idx]        = new_real_x;
+                    y_out[idx]        = new_real_y;
+                    layer_out[idx]    = layer;
+                    response_out[idx] = response;
+                    size_out[idx]     = new_size;
+                    ori_out[idx]      = ori;
+                }
+            }
+        }
+    }
+}
+
+// Computes feature descriptors for features in an array.  Based on Section 6
+// of Lowe's paper.
+template<typename T>
+__global__ void computeDescriptor(
+    float* desc_out, const unsigned desc_len, const unsigned histsz,
+    const float* x_in, const float* y_in, const unsigned* layer_in,
+    const float* response_in, const float* size_in, const float* ori_in,
+    const unsigned total_feat, const CParam<T> gauss_octave, const int d,
+    const int n, const float scale, const int n_layers) {
+    const int tid_x = threadIdx.x;
+    const int tid_y = threadIdx.y;
+    const int bsz_x = blockDim.x;
+    const int bsz_y = blockDim.y;
+
+    const int f = blockIdx.y * bsz_y + tid_y;
+
+    SharedMemory<float> shared;
+    float* shrdMem = shared.getPointer();
+    float* desc    = shrdMem;
+    float* accum   = shrdMem + desc_len * histsz;
+
+    for (int i = tid_x; i < desc_len * histsz; i += bsz_x)
+        desc[tid_y * desc_len + i] = 0.f;
+    __syncthreads();
+
+    if (f < total_feat) {
+        const unsigned layer = layer_in[f];
+        float ori            = (360.f - ori_in[f]) * PI_VAL / 180.f;
+        ori                  = (ori > PI_VAL) ? ori - PI_VAL * 2 : ori;
+        const float size     = size_in[f];
+        const int fx         = round(x_in[f] * scale);
+        const int fy         = round(y_in[f] * scale);
+
+        const int dim0 = gauss_octave.dims[0];
+        const int dim1 = gauss_octave.dims[1];
+        const int imel = dim0 * dim1;
+
+        // Points img to correct Gaussian pyramid layer
+        const T* img_ptr = gauss_octave.ptr + layer * imel;
+
+        float cos_t        = cosf(ori);
+        float sin_t        = sinf(ori);
+        float bins_per_rad = n / (PI_VAL * 2.f);
+        float exp_denom    = d * d * 0.5f;
+        float hist_width   = DESCR_SCL_FCTR * size * scale * 0.5f;
+        int radius         = hist_width * sqrtf(2.f) * (d + 1.f) * 0.5f + 0.5f;
+
+        int len            = radius * 2 + 1;
+        const int hist_off = (tid_x % histsz) * desc_len;
+
+        // Calculate orientation histogram
+        for (int l = tid_x; l < len * len; l += bsz_x) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = fy + i;
+            int x = fx + j;
+
+            float x_rot = (j * cos_t - i * sin_t) / hist_width;
+            float y_rot = (j * sin_t + i * cos_t) / hist_width;
+            float xbin  = x_rot + d / 2 - 0.5f;
+            float ybin  = y_rot + d / 2 - 0.5f;
+
+            if (ybin > -1.0f && ybin < d && xbin > -1.0f && xbin < d && y > 0 &&
+                y < dim0 - 1 && x > 0 && x < dim1 - 1) {
+                float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+                float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+                float grad_mag = sqrtf(dx * dx + dy * dy);
+                float grad_ori = atan2f(dy, dx) - ori;
+                while (grad_ori < 0.0f) grad_ori += PI_VAL * 2;
+                while (grad_ori >= PI_VAL * 2) grad_ori -= PI_VAL * 2;
+
+                float w    = exp(-(x_rot * x_rot + y_rot * y_rot) / exp_denom);
+                float obin = grad_ori * bins_per_rad;
+                float mag  = grad_mag * w;
+
+                int x0 = floor(xbin);
+                int y0 = floor(ybin);
+                int o0 = floor(obin);
+                xbin -= x0;
+                ybin -= y0;
+                obin -= o0;
+
+                for (int yl = 0; yl <= 1; yl++) {
+                    int yb = y0 + yl;
+                    if (yb >= 0 && yb < d) {
+                        float v_y = mag * ((yl == 0) ? 1.0f - ybin : ybin);
+                        for (int xl = 0; xl <= 1; xl++) {
+                            int xb = x0 + xl;
+                            if (xb >= 0 && xb < d) {
+                                float v_x =
+                                    v_y * ((xl == 0) ? 1.0f - xbin : xbin);
+                                for (int ol = 0; ol <= 1; ol++) {
+                                    int ob = (o0 + ol) % n;
+                                    float v_o =
+                                        v_x * ((ol == 0) ? 1.0f - obin : obin);
+                                    atomicAdd(
+                                        &desc[hist_off + tid_y * desc_len +
+                                              (yb * d + xb) * n + ob],
+                                        v_o);
+                                }
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+    __syncthreads();
+
+    // Combine histograms (reduces previous atomicAdd overhead)
+    for (int l = tid_x; l < desc_len * 4; l += bsz_x)
+        desc[l] += desc[l + 4 * desc_len];
+    __syncthreads();
+    for (int l = tid_x; l < desc_len * 2; l += bsz_x)
+        desc[l] += desc[l + 2 * desc_len];
+    __syncthreads();
+    for (int l = tid_x; l < desc_len; l += bsz_x) desc[l] += desc[l + desc_len];
+    __syncthreads();
+
+    normalizeDesc(desc, accum, desc_len);
+
+    for (int i = tid_x; i < desc_len; i += bsz_x)
+        desc[tid_y * desc_len + i] =
+            min(desc[tid_y * desc_len + i], DESC_MAG_THR);
+    __syncthreads();
+
+    normalizeDesc(desc, accum, desc_len);
+
+    if (f < total_feat) {
+        // Calculate final descriptor values
+        for (int k = tid_x; k < desc_len; k += bsz_x)
+            desc_out[f * desc_len + k] =
+                round(min(255.f, desc[tid_y * desc_len + k] * INT_DESCR_FCTR));
+    }
+}
+
+// Computes GLOH feature descriptors for features in an array. Based on Section
+// III-B of Mikolajczyk and Schmid paper.
+template<typename T>
+__global__ void computeGLOHDescriptor(
+    float* desc_out, const unsigned desc_len, const unsigned histsz,
+    const float* x_in, const float* y_in, const unsigned* layer_in,
+    const float* response_in, const float* size_in, const float* ori_in,
+    const unsigned total_feat, const CParam<T> gauss_octave, const int d,
+    const unsigned rb, const unsigned ab, const unsigned hb, const float scale,
+    const int n_layers) {
+    const int tid_x = threadIdx.x;
+    const int tid_y = threadIdx.y;
+    const int bsz_x = blockDim.x;
+    const int bsz_y = blockDim.y;
+
+    const int f = blockIdx.y * bsz_y + tid_y;
+
+    SharedMemory<float> shared;
+    float* shrdMem = shared.getPointer();
+    float* desc    = shrdMem;
+    float* accum   = shrdMem + desc_len * histsz;
+
+    for (int i = tid_x; i < desc_len * histsz; i += bsz_x)
+        desc[tid_y * desc_len + i] = 0.f;
+    __syncthreads();
+
+    if (f < total_feat) {
+        const unsigned layer = layer_in[f];
+        float ori            = (360.f - ori_in[f]) * PI_VAL / 180.f;
+        ori                  = (ori > PI_VAL) ? ori - PI_VAL * 2 : ori;
+        const float size     = size_in[f];
+        const int fx         = round(x_in[f] * scale);
+        const int fy         = round(y_in[f] * scale);
+
+        const int dim0 = gauss_octave.dims[0];
+        const int dim1 = gauss_octave.dims[1];
+        const int imel = dim0 * dim1;
+
+        // Points img to correct Gaussian pyramid layer
+        const T* img_ptr = gauss_octave.ptr + layer * imel;
+
+        float cos_t              = cosf(ori);
+        float sin_t              = sinf(ori);
+        float hist_bins_per_rad  = hb / (PI_VAL * 2.f);
+        float polar_bins_per_rad = ab / (PI_VAL * 2.f);
+        float exp_denom          = GLOHRadii[rb - 1] * 0.5f;
+
+        float hist_width = DESCR_SCL_FCTR * size * scale * 0.5f;
+
+        // Keep same descriptor radius used for SIFT
+        int radius = hist_width * sqrt(2.f) * (d + 1.f) * 0.5f + 0.5f;
+
+        // Alternative radius size calculation, changing the radius weight
+        // (rw) in the range of 0.25f-0.75f gives different results,
+        // increasing it tends to show a better recall rate but with a
+        // smaller amount of correct matches
+        // float rw = 0.5f;
+        // int radius = hist_width * GLOHRadii[rb-1] * rw + 0.5f;
+
+        int len            = radius * 2 + 1;
+        const int hist_off = (tid_x % histsz) * desc_len;
+
+        // Calculate orientation histogram
+        for (int l = tid_x; l < len * len; l += bsz_x) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = fy + i;
+            int x = fx + j;
+
+            float x_rot = (j * cos_t - i * sin_t);
+            float y_rot = (j * sin_t + i * cos_t);
+
+            float r = sqrt(x_rot * x_rot + y_rot * y_rot) / radius *
+                      GLOHRadii[rb - 1];
+            float theta = atan2(y_rot, x_rot);
+            while (theta < 0.0f) theta += PI_VAL * 2;
+            while (theta >= PI_VAL * 2) theta -= PI_VAL * 2;
+
+            float tbin = theta * polar_bins_per_rad;
+            float rbin =
+                (r < GLOHRadii[0])
+                    ? r / GLOHRadii[0]
+                    : ((r < GLOHRadii[1])
+                           ? 1 + (r - GLOHRadii[0]) /
+                                     (float)(GLOHRadii[1] - GLOHRadii[0])
+                           : min(2 + (r - GLOHRadii[1]) /
+                                         (float)(GLOHRadii[2] - GLOHRadii[1]),
+                                 3.f - FLT_EPSILON));
+
+            if (r <= GLOHRadii[rb - 1] && y > 0 && y < dim0 - 1 && x > 0 &&
+                x < dim1 - 1) {
+                float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+                float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+                float grad_mag = sqrtf(dx * dx + dy * dy);
+                float grad_ori = atan2f(dy, dx) - ori;
+                while (grad_ori < 0.0f) grad_ori += PI_VAL * 2;
+                while (grad_ori >= PI_VAL * 2) grad_ori -= PI_VAL * 2;
+
+                float w    = exp(-r / exp_denom);
+                float obin = grad_ori * hist_bins_per_rad;
+                float mag  = grad_mag * w;
+
+                int t0 = floor(tbin);
+                int r0 = floor(rbin);
+                int o0 = floor(obin);
+                tbin -= t0;
+                rbin -= r0;
+                obin -= o0;
+
+                for (int rl = 0; rl <= 1; rl++) {
+                    int rb    = (rbin > 0.5f) ? (r0 + rl) : (r0 - rl);
+                    float v_r = mag * ((rl == 0) ? 1.0f - rbin : rbin);
+                    if (rb >= 0 && rb <= 2) {
+                        for (int tl = 0; tl <= 1; tl++) {
+                            int tb    = (t0 + tl) % ab;
+                            float v_t = v_r * ((tl == 0) ? 1.0f - tbin : tbin);
+                            for (int ol = 0; ol <= 1; ol++) {
+                                int ob = (o0 + ol) % hb;
+                                float v_o =
+                                    v_t * ((ol == 0) ? 1.0f - obin : obin);
+                                unsigned idx =
+                                    (rb > 0) *
+                                        (hb + ((rb - 1) * ab + tb) * hb) +
+                                    ob;
+                                atomicAdd(
+                                    &desc[hist_off + tid_y * desc_len + idx],
+                                    v_o);
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+    __syncthreads();
+
+    // Combine histograms (reduces previous atomicAdd overhead)
+    for (int l = tid_x; l < desc_len * 4; l += bsz_x)
+        desc[l] += desc[l + 4 * desc_len];
+    __syncthreads();
+    for (int l = tid_x; l < desc_len * 2; l += bsz_x)
+        desc[l] += desc[l + 2 * desc_len];
+    __syncthreads();
+    for (int l = tid_x; l < desc_len; l += bsz_x) desc[l] += desc[l + desc_len];
+    __syncthreads();
+
+    normalizeGLOHDesc(desc, accum, desc_len);
+
+    for (int i = tid_x; i < desc_len; i += bsz_x)
+        desc[tid_y * desc_len + i] =
+            min(desc[tid_y * desc_len + i], DESC_MAG_THR);
+    __syncthreads();
+
+    normalizeGLOHDesc(desc, accum, desc_len);
+
+    if (f < total_feat) {
+        // Calculate final descriptor values
+        for (int k = tid_x; k < desc_len; k += bsz_x)
+            desc_out[f * desc_len + k] =
+                round(min(255.f, desc[tid_y * desc_len + k] * INT_DESCR_FCTR));
+    }
+}
+
+#undef IPTR
+
+template<typename T, typename convAccT>
+Array<T> createInitialImage(CParam<T> img, const float init_sigma,
+                            const bool double_input) {
+    dim4 dims((double_input) ? img.dims[0] * 2 : img.dims[0],
+              (double_input) ? img.dims[1] * 2 : img.dims[1]);
+    Array<T> init_img = createEmptyArray<T>(dims);
+    Array<T> init_tmp = createEmptyArray<T>(dims);
+
+    float s = (double_input)
+                  ? std::max((float)sqrt(init_sigma * init_sigma -
+                                         INIT_SIGMA * INIT_SIGMA * 4),
+                             0.1f)
+                  : std::max((float)sqrt(init_sigma * init_sigma -
+                                         INIT_SIGMA * INIT_SIGMA),
+                             0.1f);
+
+    Array<convAccT> filter = gauss_filter<convAccT>(s);
+
+    if (double_input) {
+        resize<T>(init_img, img, AF_INTERP_BILINEAR);
+        convolve2<T, convAccT>(init_tmp, init_img, filter, 0, false);
+    } else
+        convolve2<T, convAccT>(init_tmp, img, filter, 0, false);
+
+    convolve2<T, convAccT>(init_img, CParam<T>(init_tmp), filter, 1, false);
+
+    return init_img;
+}
+
+template<typename T, typename convAccT>
+std::vector<Array<T>> buildGaussPyr(Param<T> init_img, const unsigned n_octaves,
+                                    const unsigned n_layers,
+                                    const float init_sigma) {
+    // Precompute Gaussian sigmas using the following formula:
+    // \sigma_{total}^2 = \sigma_{i}^2 + \sigma_{i-1}^2
+    std::vector<float> sig_layers(n_layers + 3);
+    sig_layers[0] = init_sigma;
+    float k       = std::pow(2.0f, 1.0f / n_layers);
+    for (unsigned i = 1; i < n_layers + 3; i++) {
+        float sig_prev  = std::pow(k, i - 1) * init_sigma;
+        float sig_total = sig_prev * k;
+        sig_layers[i] = std::sqrt(sig_total * sig_total - sig_prev * sig_prev);
+    }
+
+    // Gaussian Pyramid
+    std::vector<Array<T>> gauss_pyr;
+    std::vector<Array<T>> tmp_pyr;
+    gauss_pyr.reserve(n_octaves);
+    tmp_pyr.reserve(n_octaves * (n_layers + 3));
+    for (unsigned o = 0; o < n_octaves; o++) {
+        gauss_pyr.push_back(createEmptyArray<T>(
+            {(o == 0) ? init_img.dims[0] : gauss_pyr[o - 1].dims()[0] / 2,
+             (o == 0) ? init_img.dims[1] : gauss_pyr[o - 1].dims()[1] / 2,
+             n_layers + 3}));
+
+        for (unsigned l = 0; l < n_layers + 3; l++) {
+            unsigned src_idx = (l == 0) ? (o - 1) * (n_layers + 3) + n_layers
+                                        : o * (n_layers + 3) + l - 1;
+            unsigned idx     = o * (n_layers + 3) + l;
+
+            if (o == 0 && l == 0) {
+                tmp_pyr.push_back(createParamArray(init_img, false));
+            } else if (l == 0) {
+                tmp_pyr.push_back(
+                    createEmptyArray<T>({tmp_pyr[src_idx].dims()[0] / 2,
+                                         tmp_pyr[src_idx].dims()[1] / 2}));
+                resize<T>(tmp_pyr[idx], tmp_pyr[src_idx], AF_INTERP_BILINEAR);
+            } else {
+                tmp_pyr.push_back(createEmptyArray<T>(tmp_pyr[src_idx].dims()));
+                Array<T> tmp = createEmptyArray<T>(tmp_pyr[src_idx].dims());
+                Array<convAccT> filter = gauss_filter<convAccT>(sig_layers[l]);
+
+                convolve2<T, convAccT>(tmp, tmp_pyr[src_idx], filter, 0, false);
+                convolve2<T, convAccT>(tmp_pyr[idx], CParam<T>(tmp), filter, 1,
+                                       false);
+
+                // memFree(tmp.ptr);
+            }
+
+            const unsigned imel   = tmp_pyr[idx].elements();
+            const unsigned offset = imel * l;
+
+            CUDA_CHECK(cudaMemcpyAsync(
+                gauss_pyr[o].get() + offset, tmp_pyr[idx].get(),
+                imel * sizeof(T), cudaMemcpyDeviceToDevice, getActiveStream()));
+        }
+    }
+    return gauss_pyr;
+}
+
+template<typename T>
+std::vector<Array<T>> buildDoGPyr(std::vector<Array<T>>& gauss_pyr,
+                                  const unsigned n_octaves,
+                                  const unsigned n_layers) {
+    // DoG Pyramid
+    std::vector<Array<T>> dog_pyr;
+    dog_pyr.reserve(n_octaves);
+
+    for (unsigned o = 0; o < n_octaves; o++) {
+        dog_pyr.push_back(createEmptyArray<T>(
+            {gauss_pyr[o].dims()[0], gauss_pyr[o].dims()[1],
+             gauss_pyr[o].dims()[2] - 1, gauss_pyr[o].dims()[3]}));
+
+        const unsigned nel = dog_pyr[o].dims()[1] * dog_pyr[o].strides()[1];
+        const unsigned dog_layers = n_layers + 2;
+
+        dim3 threads(SIFT_THREADS);
+        dim3 blocks(divup(nel, threads.x));
+        CUDA_LAUNCH((sub<T>), blocks, threads, dog_pyr[o], gauss_pyr[o], nel,
+                    dog_layers);
+        POST_LAUNCH_CHECK();
+    }
+
+    return dog_pyr;
+}
+
+template<typename T>
+void update_permutation(thrust::device_ptr<T>& keys,
+                        arrayfire::cuda::ThrustVector<int>& permutation) {
+    // temporary storage for keys
+    arrayfire::cuda::ThrustVector<T> temp(permutation.size());
+
+    // permute the keys with the current reordering
+    THRUST_SELECT((thrust::gather), permutation.begin(), permutation.end(),
+                  keys, temp.begin());
+
+    // stable_sort the permuted keys and update the permutation
+    THRUST_SELECT((thrust::stable_sort_by_key), temp.begin(), temp.end(),
+                  permutation.begin());
+}
+
+template<typename T>
+void apply_permutation(thrust::device_ptr<T>& keys,
+                       arrayfire::cuda::ThrustVector<int>& permutation) {
+    // copy keys to temporary vector
+    arrayfire::cuda::ThrustVector<T> temp(keys, keys + permutation.size());
+
+    // permute the keys
+    THRUST_SELECT((thrust::gather), permutation.begin(), permutation.end(),
+                  temp.begin(), keys);
+}
+
+template<typename T, typename convAccT>
+void sift(unsigned* out_feat, unsigned* out_dlen, float** d_x, float** d_y,
+          float** d_score, float** d_ori, float** d_size, float** d_desc,
+          CParam<T> img, const unsigned n_layers, const float contrast_thr,
+          const float edge_thr, const float init_sigma, const bool double_input,
+          const float img_scale, const float feature_ratio,
+          const bool compute_GLOH) {
+    unsigned min_dim = min(img.dims[0], img.dims[1]);
+    if (double_input) min_dim *= 2;
+
+    const unsigned n_octaves = floor(log(min_dim) / log(2)) - 2;
+
+    Array<T> init_img =
+        createInitialImage<T, convAccT>(img, init_sigma, double_input);
+
+    std::vector<Array<T>> gauss_pyr =
+        buildGaussPyr<T, convAccT>(init_img, n_octaves, n_layers, init_sigma);
+
+    std::vector<Array<T>> dog_pyr =
+        buildDoGPyr<T>(gauss_pyr, n_octaves, n_layers);
+
+    std::vector<uptr<float>> d_x_pyr(n_octaves);
+    std::vector<uptr<float>> d_y_pyr(n_octaves);
+    std::vector<uptr<float>> d_response_pyr(n_octaves);
+    std::vector<uptr<float>> d_size_pyr(n_octaves);
+    std::vector<uptr<float>> d_ori_pyr(n_octaves);
+    std::vector<uptr<float>> d_desc_pyr(n_octaves);
+    std::vector<unsigned> feat_pyr(n_octaves);
+    unsigned total_feat = 0;
+
+    const unsigned d  = DESCR_WIDTH;
+    const unsigned n  = DESCR_HIST_BINS;
+    const unsigned rb = GLOHRadialBins;
+    const unsigned ab = GLOHAngularBins;
+    const unsigned hb = GLOHHistBins;
+    const unsigned desc_len =
+        (compute_GLOH) ? (1 + (rb - 1) * ab) * hb : d * d * n;
+
+    uptr<unsigned> d_count = memAlloc<unsigned>(1);
+    for (unsigned i = 0; i < n_octaves; i++) {
+        if (dog_pyr[i].dims()[0] - 2 * IMG_BORDER < 1 ||
+            dog_pyr[i].dims()[1] - 2 * IMG_BORDER < 1)
+            continue;
+
+        const unsigned imel     = dog_pyr[i].dims()[0] * dog_pyr[i].dims()[1];
+        const unsigned max_feat = ceil(imel * feature_ratio);
+
+        CUDA_CHECK(cudaMemsetAsync(d_count.get(), 0, sizeof(unsigned),
+                                   getActiveStream()));
+
+        uptr<float> d_extrema_x        = memAlloc<float>(max_feat);
+        uptr<float> d_extrema_y        = memAlloc<float>(max_feat);
+        uptr<unsigned> d_extrema_layer = memAlloc<unsigned>(max_feat);
+
+        int dim0 = dog_pyr[i].dims()[0];
+        int dim1 = dog_pyr[i].dims()[1];
+
+        dim3 threads(SIFT_THREADS_X, SIFT_THREADS_Y);
+        dim3 blocks(divup(dim0 - 2 * IMG_BORDER, threads.x),
+                    divup(dim1 - 2 * IMG_BORDER, threads.y));
+
+        float extrema_thr = 0.5f * contrast_thr / n_layers;
+        const size_t extrema_shared_size =
+            (threads.x + 2) * (threads.y + 2) * 3 * sizeof(float);
+        CUDA_LAUNCH_SMEM((detectExtrema<T>), blocks, threads,
+                         extrema_shared_size, d_extrema_x.get(),
+                         d_extrema_y.get(), d_extrema_layer.get(),
+                         d_count.get(), dog_pyr[i], max_feat, extrema_thr);
+        POST_LAUNCH_CHECK();
+
+        unsigned extrema_feat = 0;
+        CUDA_CHECK(cudaMemcpyAsync(&extrema_feat, d_count.get(),
+                                   sizeof(unsigned), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        extrema_feat = min(extrema_feat, max_feat);
+
+        if (extrema_feat == 0) { continue; }
+
+        CUDA_CHECK(cudaMemsetAsync(d_count.get(), 0, sizeof(unsigned),
+                                   getActiveStream()));
+
+        auto d_interp_x        = memAlloc<float>(extrema_feat);
+        auto d_interp_y        = memAlloc<float>(extrema_feat);
+        auto d_interp_layer    = memAlloc<unsigned>(extrema_feat);
+        auto d_interp_response = memAlloc<float>(extrema_feat);
+        auto d_interp_size     = memAlloc<float>(extrema_feat);
+
+        threads = dim3(SIFT_THREADS, 1);
+        blocks  = dim3(divup(extrema_feat, threads.x), 1);
+
+        CUDA_LAUNCH((interpolateExtrema<T>), blocks, threads, d_interp_x.get(),
+                    d_interp_y.get(), d_interp_layer.get(),
+                    d_interp_response.get(), d_interp_size.get(), d_count.get(),
+                    d_extrema_x.get(), d_extrema_y.get(), d_extrema_layer.get(),
+                    extrema_feat, dog_pyr[i], max_feat, i, n_layers,
+                    contrast_thr, edge_thr, init_sigma, img_scale);
+        POST_LAUNCH_CHECK();
+
+        unsigned interp_feat = 0;
+        CUDA_CHECK(cudaMemcpyAsync(&interp_feat, d_count.get(),
+                                   sizeof(unsigned), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        interp_feat = min(interp_feat, max_feat);
+
+        CUDA_CHECK(cudaMemsetAsync(d_count.get(), 0, sizeof(unsigned),
+                                   getActiveStream()));
+
+        if (interp_feat == 0) { continue; }
+
+        thrust::device_ptr<float> interp_x_ptr =
+            thrust::device_pointer_cast(d_interp_x.get());
+        thrust::device_ptr<float> interp_y_ptr =
+            thrust::device_pointer_cast(d_interp_y.get());
+        thrust::device_ptr<unsigned> interp_layer_ptr =
+            thrust::device_pointer_cast(d_interp_layer.get());
+        thrust::device_ptr<float> interp_response_ptr =
+            thrust::device_pointer_cast(d_interp_response.get());
+        thrust::device_ptr<float> interp_size_ptr =
+            thrust::device_pointer_cast(d_interp_size.get());
+
+        arrayfire::cuda::ThrustVector<int> permutation(interp_feat);
+        thrust::sequence(permutation.begin(), permutation.end());
+
+        update_permutation<float>(interp_size_ptr, permutation);
+        update_permutation<float>(interp_response_ptr, permutation);
+        update_permutation<unsigned>(interp_layer_ptr, permutation);
+        update_permutation<float>(interp_y_ptr, permutation);
+        update_permutation<float>(interp_x_ptr, permutation);
+
+        apply_permutation<float>(interp_size_ptr, permutation);
+        apply_permutation<float>(interp_response_ptr, permutation);
+        apply_permutation<unsigned>(interp_layer_ptr, permutation);
+        apply_permutation<float>(interp_y_ptr, permutation);
+        apply_permutation<float>(interp_x_ptr, permutation);
+
+        auto d_nodup_x        = memAlloc<float>(interp_feat);
+        auto d_nodup_y        = memAlloc<float>(interp_feat);
+        auto d_nodup_layer    = memAlloc<unsigned>(interp_feat);
+        auto d_nodup_response = memAlloc<float>(interp_feat);
+        auto d_nodup_size     = memAlloc<float>(interp_feat);
+
+        threads = dim3(SIFT_THREADS, 1);
+        blocks  = dim3(divup(interp_feat, threads.x), 1);
+
+        CUDA_LAUNCH((removeDuplicates), blocks, threads, d_nodup_x.get(),
+                    d_nodup_y.get(), d_nodup_layer.get(),
+                    d_nodup_response.get(), d_nodup_size.get(), d_count.get(),
+                    d_interp_x.get(), d_interp_y.get(), d_interp_layer.get(),
+                    d_interp_response.get(), d_interp_size.get(), interp_feat);
+        POST_LAUNCH_CHECK();
+
+        unsigned nodup_feat = 0;
+        CUDA_CHECK(cudaMemcpyAsync(&nodup_feat, d_count.get(), sizeof(unsigned),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        CUDA_CHECK(cudaMemsetAsync(d_count.get(), 0, sizeof(unsigned),
+                                   getActiveStream()));
+
+        const unsigned max_oriented_feat = nodup_feat * 3;
+
+        auto d_oriented_x        = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_y        = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_layer    = memAlloc<unsigned>(max_oriented_feat);
+        auto d_oriented_response = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_size     = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_ori      = memAlloc<float>(max_oriented_feat);
+
+        threads = dim3(SIFT_THREADS_X, SIFT_THREADS_Y);
+        blocks  = dim3(1, divup(nodup_feat, threads.y));
+
+        const size_t ori_shared_size =
+            ORI_HIST_BINS * threads.y * 2 * sizeof(float);
+        CUDA_LAUNCH_SMEM(
+            (calcOrientation<T>), blocks, threads, ori_shared_size,
+            d_oriented_x.get(), d_oriented_y.get(), d_oriented_layer.get(),
+            d_oriented_response.get(), d_oriented_size.get(),
+            d_oriented_ori.get(), d_count.get(), d_nodup_x.get(),
+            d_nodup_y.get(), d_nodup_layer.get(), d_nodup_response.get(),
+            d_nodup_size.get(), nodup_feat, CParam<T>(gauss_pyr[i]),
+            max_oriented_feat, i, double_input);
+        POST_LAUNCH_CHECK();
+
+        unsigned oriented_feat = 0;
+        CUDA_CHECK(cudaMemcpyAsync(&oriented_feat, d_count.get(),
+                                   sizeof(unsigned), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        oriented_feat = min(oriented_feat, max_oriented_feat);
+
+        if (oriented_feat == 0) { continue; }
+
+        auto d_desc = memAlloc<float>(oriented_feat * desc_len);
+
+        float scale = 1.f / (1 << i);
+        if (double_input) scale *= 2.f;
+
+        threads = dim3(SIFT_THREADS, 1);
+        blocks  = dim3(1, divup(oriented_feat, threads.y));
+
+        const unsigned histsz    = 8;
+        const size_t shared_size = desc_len * (histsz + 1) * sizeof(float);
+
+        if (compute_GLOH)
+            CUDA_LAUNCH_SMEM((computeGLOHDescriptor<T>), blocks, threads,
+                             shared_size, d_desc.get(), desc_len, histsz,
+                             d_oriented_x.get(), d_oriented_y.get(),
+                             d_oriented_layer.get(), d_oriented_response.get(),
+                             d_oriented_size.get(), d_oriented_ori.get(),
+                             oriented_feat, gauss_pyr[i], d, rb, ab, hb, scale,
+                             n_layers);
+        else
+            CUDA_LAUNCH_SMEM((computeDescriptor<T>), blocks, threads,
+                             shared_size, d_desc.get(), desc_len, histsz,
+                             d_oriented_x.get(), d_oriented_y.get(),
+                             d_oriented_layer.get(), d_oriented_response.get(),
+                             d_oriented_size.get(), d_oriented_ori.get(),
+                             oriented_feat, CParam<T>(gauss_pyr[i]), d, n,
+                             scale, n_layers);
+        POST_LAUNCH_CHECK();
+
+        total_feat += oriented_feat;
+        feat_pyr[i] = oriented_feat;
+
+        if (oriented_feat > 0) {
+            d_x_pyr[i]        = std::move(d_oriented_x);
+            d_y_pyr[i]        = std::move(d_oriented_y);
+            d_response_pyr[i] = std::move(d_oriented_response);
+            d_ori_pyr[i]      = std::move(d_oriented_ori);
+            d_size_pyr[i]     = std::move(d_oriented_size);
+            d_desc_pyr[i]     = std::move(d_desc);
+        }
+    }
+
+    // Allocate output memory
+    *d_x     = memAlloc<float>(total_feat).release();
+    *d_y     = memAlloc<float>(total_feat).release();
+    *d_score = memAlloc<float>(total_feat).release();
+    *d_ori   = memAlloc<float>(total_feat).release();
+    *d_size  = memAlloc<float>(total_feat).release();
+    *d_desc  = memAlloc<float>(total_feat * desc_len).release();
+
+    unsigned offset = 0;
+    for (unsigned i = 0; i < n_octaves; i++) {
+        if (feat_pyr[i] == 0) continue;
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_x + offset, d_x_pyr[i].get(), feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_y + offset, d_y_pyr[i].get(), feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(*d_score + offset, d_response_pyr[i].get(),
+                                   feat_pyr[i] * sizeof(float),
+                                   cudaMemcpyDeviceToDevice,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_ori + offset, d_ori_pyr[i].get(), feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(
+            *d_size + offset, d_size_pyr[i].get(), feat_pyr[i] * sizeof(float),
+            cudaMemcpyDeviceToDevice, getActiveStream()));
+
+        CUDA_CHECK(
+            cudaMemcpyAsync(*d_desc + (offset * desc_len), d_desc_pyr[i].get(),
+                            feat_pyr[i] * desc_len * sizeof(float),
+                            cudaMemcpyDeviceToDevice, getActiveStream()));
+
+        offset += feat_pyr[i];
+    }
+
+    // Sets number of output features
+    *out_feat = total_feat;
+    *out_dlen = desc_len;
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sobel.cuh b/src/backend/cuda/kernel/sobel.cuh
new file mode 100644
index 0000000000..03e333c414
--- /dev/null
+++ b/src/backend/cuda/kernel/sobel.cuh
@@ -0,0 +1,91 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+__device__ int reflect101(int index, int endIndex) {
+    return abs(endIndex - abs(endIndex - index));
+}
+
+template<typename Ti>
+__device__ Ti load2ShrdMem(const Ti* in, int d0, int d1, int gx, int gy,
+                           int inStride1, int inStride0) {
+    int idx =
+        reflect101(gx, d0 - 1) * inStride0 + reflect101(gy, d1 - 1) * inStride1;
+    return in[idx];
+}
+
+template<typename Ti, typename To>
+__global__ void sobel3x3(Param<To> dx, Param<To> dy, CParam<Ti> in, int nBBS0,
+                         int nBBS1) {
+    __shared__ Ti shrdMem[THREADS_X + 2][THREADS_Y + 2];
+
+    // calculate necessary offset and window parameters
+    const int radius  = 1;
+    const int padding = 2 * radius;
+    const int shrdLen = blockDim.x + padding;
+
+    // batch offsets
+    unsigned b2 = blockIdx.x / nBBS0;
+    unsigned b3 = blockIdx.y / nBBS1;
+    const Ti* iptr =
+        (const Ti*)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
+    To* dxptr = (To*)dx.ptr + (b2 * dx.strides[2] + b3 * dx.strides[3]);
+    To* dyptr = (To*)dy.ptr + (b2 * dy.strides[2] + b3 * dy.strides[3]);
+
+    // local neighborhood indices
+    int lx = threadIdx.x;
+    int ly = threadIdx.y;
+
+    // global indices
+    int gx = THREADS_X * (blockIdx.x - b2 * nBBS0) + lx;
+    int gy = THREADS_Y * (blockIdx.y - b3 * nBBS1) + ly;
+
+    for (int b = ly, gy2 = gy; b < shrdLen;
+         b += blockDim.y, gy2 += blockDim.y) {
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += blockDim.x, gx2 += blockDim.x) {
+            shrdMem[a][b] =
+                load2ShrdMem<Ti>(iptr, in.dims[0], in.dims[1], gx2 - radius,
+                                 gy2 - radius, in.strides[1], in.strides[0]);
+        }
+    }
+
+    __syncthreads();
+
+    // Only continue if we're at a valid location
+    if (gx < in.dims[0] && gy < in.dims[1]) {
+        int i  = lx + radius;
+        int j  = ly + radius;
+        int _i = i - 1;
+        int i_ = i + 1;
+        int _j = j - 1;
+        int j_ = j + 1;
+
+        float NW = shrdMem[_i][_j];
+        float SW = shrdMem[i_][_j];
+        float NE = shrdMem[_i][j_];
+        float SE = shrdMem[i_][j_];
+
+        float t1                       = shrdMem[_i][j];
+        float t2                       = shrdMem[i_][j];
+        dxptr[gy * dx.strides[1] + gx] = (SW + SE - (NW + NE) + 2 * (t2 - t1));
+
+        t1                             = shrdMem[i][_j];
+        t2                             = shrdMem[i][j_];
+        dyptr[gy * dy.strides[1] + gx] = (NE + SE - (NW + SW) + 2 * (t2 - t1));
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sobel.hpp b/src/backend/cuda/kernel/sobel.hpp
index e7de1ea445..710b930404 100644
--- a/src/backend/cuda/kernel/sobel.hpp
+++ b/src/backend/cuda/kernel/sobel.hpp
@@ -7,131 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/sobel_cuh.hpp>
 
-namespace cuda
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
 static const int THREADS_X = 16;
 static const int THREADS_Y = 16;
 
-template<typename Ti>
-__device__
-Ti load2ShrdMem(const Ti * in,
-               int dim0, int dim1,
-               int gx, int gy,
-               int inStride1, int inStride0)
-{
-    if (gx<0 || gx>=dim0 || gy<0 || gy>=dim1)
-        return Ti(0);
-    else
-        return in[gx*inStride0+gy*inStride1];
-}
-
 template<typename Ti, typename To>
-__global__
-void sobel3x3(Param<To> dx, Param<To> dy, CParam<Ti> in, int nBBS0, int nBBS1)
-{
-    __shared__ Ti shrdMem[THREADS_X+2][THREADS_Y+2];
-
-    // calculate necessary offset and window parameters
-    const int radius  = 1;
-    const int padding = 2*radius;
-
-    // batch offsets
-    unsigned b2 = blockIdx.x / nBBS0;
-    unsigned b3 = blockIdx.y / nBBS1;
-    const Ti* iptr     = (const Ti *)in.ptr + (b2 * in.strides[2] + b3 * in.strides[3]);
-    To*       dxptr    = (To *      )dx.ptr + (b2 * dx.strides[2] + b3 * dx.strides[3]);
-    To*       dyptr    = (To *      )dy.ptr + (b2 * dy.strides[2] + b3 * dy.strides[3]);
-
-    // local neighborhood indices
-    int lx = threadIdx.x;
-    int ly = threadIdx.y;
-
-    // global indices
-    int gx = THREADS_X * (blockIdx.x-b2*nBBS0) + lx;
-    int gy = THREADS_Y * (blockIdx.y-b3*nBBS1) + ly;
-
-    // offset values for pulling image to local memory
-    int lx2 = lx + THREADS_X;
-    int ly2 = ly + THREADS_Y;
-    int gx2 = gx + THREADS_X;
-    int gy2 = gy + THREADS_Y;
-
-    // pull image to local memory
-    shrdMem[lx][ly] = load2ShrdMem<Ti>(iptr, in.dims[0], in.dims[1],
-                                      gx-radius, gy-radius,
-                                      in.strides[1], in.strides[0]);
-    if (lx<padding) {
-        shrdMem[lx2][ly] = load2ShrdMem<Ti>(iptr, in.dims[0], in.dims[1],
-                                           gx2-radius, gy-radius,
-                                           in.strides[1], in.strides[0]);
-    }
-    if (ly<padding) {
-        shrdMem[lx][ly2] = load2ShrdMem<Ti>(iptr, in.dims[0], in.dims[1],
-                                           gx-radius, gy2-radius,
-                                           in.strides[1], in.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        shrdMem[lx2][ly2] = load2ShrdMem<Ti>(iptr, in.dims[0], in.dims[1],
-                                            gx2-radius, gy2-radius,
-                                            in.strides[1], in.strides[0]);
-    }
-    __syncthreads();
+void sobel(Param<To> dx, Param<To> dy, CParam<Ti> in,
+           const unsigned& ker_size) {
+    UNUSED(ker_size);
 
-    // Only continue if we're at a valid location
-    if (gx < in.dims[0] && gy < in.dims[1]) {
-        int i = lx + radius;
-        int j = ly + radius;
-        int _i = i-1;
-        int i_ = i+1;
-        int _j = j-1;
-        int j_ = j+1;
+    auto sobel3x3 = common::getKernel(
+        "arrayfire::cuda::sobel3x3", {{sobel_cuh_src}},
+        TemplateArgs(TemplateTypename<Ti>(), TemplateTypename<To>()),
+        {{DefineValue(THREADS_X), DefineValue(THREADS_Y)}});
 
-        float NW = shrdMem[_i][_j];
-        float SW = shrdMem[i_][_j];
-        float NE = shrdMem[_i][j_];
-        float SE = shrdMem[i_][j_];
-
-        float t1 = shrdMem[i][_j];
-        float t2 = shrdMem[i][j_];
-        dxptr[gy*dx.strides[1]+gx] = (NW+SW - (NE+SE) + 2*(t1-t2));
-
-        t1 = shrdMem[_i][j];
-        t2 = shrdMem[i_][j];
-        dyptr[gy*dy.strides[1]+gx] = (NW+NE - (SW+SE) + 2*(t1-t2));
-
-    }
-}
-
-template<typename Ti, typename To>
-void sobel(Param<To> dx, Param<To> dy, CParam<Ti> in, const unsigned &ker_size)
-{
     const dim3 threads(THREADS_X, THREADS_Y);
 
     int blk_x = divup(in.dims[0], threads.x);
     int blk_y = divup(in.dims[1], threads.y);
 
-    dim3 blocks(blk_x*in.dims[2], blk_y*in.dims[3]);
+    dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
 
-    //TODO: add more cases when 5x5 and 7x7 kernels are done
-    switch(ker_size) {
-        case  3:
-            (sobel3x3<Ti, To>) <<< blocks, threads >>> (dx, dy, in, blk_x, blk_y);
-            break;
-    }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-    POST_LAUNCH_CHECK();
-}
+    // TODO: call other cases when support for 5x5 & 7x7 is added
+    // Note: This is checked at sobel API entry point
+    sobel3x3(qArgs, dx, dy, in, blk_x, blk_y);
 
+    POST_LAUNCH_CHECK();
 }
 
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sort.hpp b/src/backend/cuda/kernel/sort.hpp
index 34256d7b4a..23ee41b820 100644
--- a/src/backend/cuda/kernel/sort.hpp
+++ b/src/backend/cuda/kernel/sort.hpp
@@ -7,48 +7,78 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
 #include <debug_cuda.hpp>
-#include <thrust/device_ptr.h>
+#include <err_cuda.hpp>
+#include <handle.hpp>
+#include <iota.hpp>
+#include <kernel/thrust_sort_by_key.hpp>
+#include <math.hpp>
 #include <thrust/sort.h>
+#include <thrust_utils.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, bool isAscending>
-        void sort0(Param<T> val)
-        {
-            thrust::device_ptr<T> val_ptr = thrust::device_pointer_cast(val.ptr);
-
-            for(int w = 0; w < val.dims[3]; w++) {
-                int valW = w * val.strides[3];
-                for(int z = 0; z < val.dims[2]; z++) {
-                    int valWZ = valW + z * val.strides[2];
-                    for(int y = 0; y < val.dims[1]; y++) {
-
-                        int valOffset = valWZ + y * val.strides[1];
-
-                        if(isAscending) {
-                            thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0]);
-                        } else {
-                            thrust::sort(val_ptr + valOffset, val_ptr + valOffset + val.dims[0],
-                                         thrust::greater<T>());
-                        }
-                    }
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// Wrapper functions
+template<typename T>
+void sort0Iterative(Param<T> val, bool isAscending) {
+    for (int w = 0; w < val.dims[3]; w++) {
+        int valW = w * val.strides[3];
+        for (int z = 0; z < val.dims[2]; z++) {
+            int valWZ = valW + z * val.strides[2];
+            for (int y = 0; y < val.dims[1]; y++) {
+                int valOffset = valWZ + y * val.strides[1];
+
+                if (isAscending) {
+                    THRUST_SELECT(thrust::sort, val.ptr + valOffset,
+                                  val.ptr + valOffset + val.dims[0]);
+                } else {
+                    THRUST_SELECT(thrust::sort, val.ptr + valOffset,
+                                  val.ptr + valOffset + val.dims[0],
+                                  thrust::greater<T>());
                 }
             }
-            POST_LAUNCH_CHECK();
         }
     }
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T>
+void sortBatched(Param<T> pVal, int dim, bool isAscending) {
+    af::dim4 inDims;
+    for (int i = 0; i < 4; i++) inDims[i] = pVal.dims[i];
+
+    // Sort dimension
+    // tileDims * seqDims = inDims
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    // Create/call iota
+    Array<uint> pKey = iota<uint>(seqDims, tileDims);
+
+    pVal = flat(pVal);
+
+    // Sort indices
+    // sort_by_key<T, uint, isAscending>(*resVal, *resKey, val, key, 0);
+    thrustSortByKey(pVal.ptr, pKey.get(), pVal.dims[0], isAscending);
+
+    // Needs to be ascending (true) in order to maintain the indices properly
+    thrustSortByKey(pKey.get(), pVal.ptr, pVal.dims[0], true);
+}
+
+template<typename T>
+void sort0(Param<T> val, bool isAscending) {
+    int higherDims = val.dims[1] * val.dims[2] * val.dims[3];
+
+    if (higherDims > 10)
+        sortBatched<T>(val, 0, isAscending);
+    else
+        kernel::sort0Iterative<T>(val, isAscending);
 }
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sort_by_key.hpp b/src/backend/cuda/kernel/sort_by_key.hpp
index 7c63c2824f..aea6bebb85 100644
--- a/src/backend/cuda/kernel/sort_by_key.hpp
+++ b/src/backend/cuda/kernel/sort_by_key.hpp
@@ -7,53 +7,93 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
 #include <debug_cuda.hpp>
-#include <thrust/device_ptr.h>
-#include <thrust/sort.h>
-
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename Tk, typename Tv, bool isAscending>
-        void sort0_by_key(Param<Tk> okey, Param<Tv> oval)
-        {
-            thrust::device_ptr<Tk>       okey_ptr = thrust::device_pointer_cast(okey.ptr);
-            thrust::device_ptr<Tv>       oval_ptr = thrust::device_pointer_cast(oval.ptr);
-
-            for(int w = 0; w < okey.dims[3]; w++) {
-                int okeyW = w * okey.strides[3];
-                int ovalW = w * oval.strides[3];
-                for(int z = 0; z < okey.dims[2]; z++) {
-                    int okeyWZ = okeyW + z * okey.strides[2];
-                    int ovalWZ = ovalW + z * oval.strides[2];
-                    for(int y = 0; y < okey.dims[1]; y++) {
-
-                        int okeyOffset = okeyWZ + y * okey.strides[1];
-                        int ovalOffset = ovalWZ + y * oval.strides[1];
-
-                        if(isAscending) {
-                            thrust::sort_by_key(okey_ptr + okeyOffset, okey_ptr + okeyOffset + okey.dims[0],
-                                                oval_ptr + ovalOffset);
-                        } else {
-                            thrust::sort_by_key(okey_ptr + okeyOffset, okey_ptr + okeyOffset + okey.dims[0],
-                                                oval_ptr + ovalOffset, thrust::greater<Tk>());
-                        }
-                    }
-                }
+#include <err_cuda.hpp>
+#include <iota.hpp>
+#include <kernel/thrust_sort_by_key.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// Wrapper functions
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param<Tk> okey, Param<Tv> oval, bool isAscending) {
+    for (int w = 0; w < okey.dims[3]; w++) {
+        int okeyW = w * okey.strides[3];
+        int ovalW = w * oval.strides[3];
+        for (int z = 0; z < okey.dims[2]; z++) {
+            int okeyWZ = okeyW + z * okey.strides[2];
+            int ovalWZ = ovalW + z * oval.strides[2];
+            for (int y = 0; y < okey.dims[1]; y++) {
+                int okeyOffset = okeyWZ + y * okey.strides[1];
+                int ovalOffset = ovalWZ + y * oval.strides[1];
+
+                thrustSortByKey<Tk, Tv>(okey.ptr + okeyOffset,
+                                        oval.ptr + ovalOffset, okey.dims[0],
+                                        isAscending);
             }
-            POST_LAUNCH_CHECK();
         }
     }
+    POST_LAUNCH_CHECK();
+}
+
+template<typename Tk, typename Tv>
+void sortByKeyBatched(Param<Tk> pKey, Param<Tv> pVal, const int dim,
+                      bool isAscending) {
+    af::dim4 inDims;
+    for (int i = 0; i < 4; i++) inDims[i] = pKey.dims[i];
+
+    const dim_t elements = inDims.elements();
+
+    // Sort dimension
+    // tileDims * seqDims = inDims
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    // Create/call iota
+    Array<uint> Seq = iota<uint>(seqDims, tileDims);
+
+    Tk *Key   = pKey.ptr;
+    auto cKey = memAlloc<Tk>(elements);
+    CUDA_CHECK(cudaMemcpyAsync(cKey.get(), Key, elements * sizeof(Tk),
+                               cudaMemcpyDeviceToDevice, getActiveStream()));
+
+    Tv *Val = pVal.ptr;
+    thrustSortByKey(Key, Val, elements, isAscending);
+    thrustSortByKey(cKey.get(), Seq.get(), elements, isAscending);
+
+    auto cSeq = memAlloc<uint>(elements);
+    CUDA_CHECK(cudaMemcpyAsync(cSeq.get(), Seq.get(), elements * sizeof(uint),
+                               cudaMemcpyDeviceToDevice, getActiveStream()));
+
+    // This always needs to be ascending
+    thrustSortByKey(Seq.get(), Val, elements, true);
+    thrustSortByKey(cSeq.get(), Key, elements, true);
+
+    // No need of doing moddims here because the original Array<T>
+    // dimensions have not been changed
+    // val.modDims(inDims);
+}
+
+template<typename Tk, typename Tv>
+void sort0ByKey(Param<Tk> okey, Param<Tv> oval, bool isAscending) {
+    int higherDims = okey.dims[1] * okey.dims[2] * okey.dims[3];
+
+    // Batced sort performs 4x sort by keys But this is only useful
+    // before GPU is saturated The GPU is saturated at around 100,000
+    // integers Call batched sort only if both conditions are met
+    if (higherDims > 4 && okey.dims[0] < 100000)
+        kernel::sortByKeyBatched<Tk, Tv>(okey, oval, 0, isAscending);
+    else
+        kernel::sort0ByKeyIterative<Tk, Tv>(okey, oval, isAscending);
 }
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sort_index.hpp b/src/backend/cuda/kernel/sort_index.hpp
deleted file mode 100644
index df1febace8..0000000000
--- a/src/backend/cuda/kernel/sort_index.hpp
+++ /dev/null
@@ -1,57 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <math.hpp>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-#include <thrust/device_ptr.h>
-#include <thrust/sequence.h>
-#include <thrust/sort.h>
-
-namespace cuda
-{
-    namespace kernel
-    {
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, bool isAscending>
-        void sort0_index(Param<T> val, Param<unsigned> idx)
-        {
-            thrust::device_ptr<T>        val_ptr = thrust::device_pointer_cast(val.ptr);
-            thrust::device_ptr<unsigned> idx_ptr = thrust::device_pointer_cast(idx.ptr);
-
-            for(int w = 0; w < val.dims[3]; w++) {
-                int valW = w * val.strides[3];
-                int idxW = w * idx.strides[3];
-                for(int z = 0; z < val.dims[2]; z++) {
-                    int valWZ = valW + z * val.strides[2];
-                    int idxWZ = idxW + z * idx.strides[2];
-                    for(int y = 0; y < val.dims[1]; y++) {
-
-                        int valOffset = valWZ + y * val.strides[1];
-                        int idxOffset = idxWZ + y * idx.strides[1];
-
-                        thrust::sequence(idx_ptr + idxOffset, idx_ptr + idxOffset + idx.dims[0]);
-                        if(isAscending) {
-                            thrust::sort_by_key(val_ptr + valOffset, val_ptr + valOffset + val.dims[0],
-                                                idx_ptr + idxOffset);
-                        } else {
-                            thrust::sort_by_key(val_ptr + valOffset, val_ptr + valOffset + val.dims[0],
-                                                idx_ptr + idxOffset, thrust::greater<T>());
-                        }
-                    }
-                }
-            }
-            POST_LAUNCH_CHECK();
-        }
-    }
-}
diff --git a/src/backend/cuda/kernel/sparse.cuh b/src/backend/cuda/kernel/sparse.cuh
new file mode 100644
index 0000000000..84825bdd24
--- /dev/null
+++ b/src/backend/cuda/kernel/sparse.cuh
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void coo2Dense(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+                          CParam<int> colIdx) {
+    for (int i = threadIdx.x; i < reps * blockDim.x; i += blockDim.x) {
+        int id = i + blockIdx.x * blockDim.x * reps;
+        if (id >= values.dims[0]) return;
+
+        T v   = values.ptr[id];
+        int r = rowIdx.ptr[id];
+        int c = colIdx.ptr[id];
+
+        int offset = r + c * output.strides[1];
+
+        output.ptr[offset] = v;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sparse.hpp b/src/backend/cuda/kernel/sparse.hpp
new file mode 100644
index 0000000000..60068d3e20
--- /dev/null
+++ b/src/backend/cuda/kernel/sparse.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/sparse_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename T>
+void coo2dense(Param<T> output, CParam<T> values, CParam<int> rowIdx,
+               CParam<int> colIdx) {
+    constexpr int reps = 4;
+
+    auto coo2Dense = common::getKernel(
+        "arrayfire::cuda::coo2Dense", {{sparse_cuh_src}},
+        TemplateArgs(TemplateTypename<T>()), {{DefineValue(reps)}});
+
+    dim3 threads(256, 1, 1);
+
+    dim3 blocks(divup(values.dims[0], threads.x * reps), 1, 1);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    coo2Dense(qArgs, output, values, rowIdx, colIdx);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sparse_arith.cuh b/src/backend/cuda/kernel/sparse_arith.cuh
new file mode 100644
index 0000000000..5357805abe
--- /dev/null
+++ b/src/backend/cuda/kernel/sparse_arith.cuh
@@ -0,0 +1,156 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, af_op_t op>
+struct arith_op {
+    T operator()(T v1, T v2) { return T(0); }
+};
+
+template<typename T>
+struct arith_op<T, af_add_t> {
+    T operator()(T v1, T v2) { return v1 + v2; }
+};
+
+template<typename T>
+struct arith_op<T, af_sub_t> {
+    T operator()(T v1, T v2) { return v1 - v2; }
+};
+
+template<typename T>
+struct arith_op<T, af_mul_t> {
+    T operator()(T v1, T v2) { return v1 * v2; }
+};
+
+template<typename T>
+struct arith_op<T, af_div_t> {
+    T operator()(T v1, T v2) { return v1 / v2; }
+};
+
+// All Kernels follow below naming convention
+// <format>ArithXYZ where
+// <format> is either csr or coo
+// X - D for Dense output, S for sparse output
+// Y - D for Dense lhs, S for sparse lhs
+// Z - D for Dense rhs, S for sparse rhs
+
+template<typename T, af_op_t op>
+__global__ void csrArithDSD(Param<T> out, CParam<T> values, CParam<int> rowIdx,
+                            CParam<int> colIdx, CParam<T> rhs,
+                            const bool reverse) {
+    const int row = blockIdx.x * TY + threadIdx.y;
+
+    if (row >= out.dims[0]) return;
+
+    const int rowStartIdx = rowIdx.ptr[row];
+    const int rowEndIdx   = rowIdx.ptr[row + 1];
+
+    // Repeat loop until all values in the row are computed
+    for (int idx = rowStartIdx + threadIdx.x; idx < rowEndIdx; idx += TX) {
+        const int col = colIdx.ptr[idx];
+
+        if (row >= out.dims[0] || col >= out.dims[1]) continue;  // Bad indices
+
+        // Get Values
+        const T val  = values.ptr[idx];
+        const T rval = rhs.ptr[col * rhs.strides[1] + row];
+
+        const int offset = col * out.strides[1] + row;
+        if (reverse)
+            out.ptr[offset] = arith_op<T, op>()(rval, val);
+        else
+            out.ptr[offset] = arith_op<T, op>()(val, rval);
+    }
+}
+
+template<typename T, af_op_t op>
+__global__ void cooArithDSD(Param<T> out, CParam<T> values, CParam<int> rowIdx,
+                            CParam<int> colIdx, CParam<T> rhs,
+                            const bool reverse) {
+    const int idx = blockIdx.x * THREADS + threadIdx.x;
+
+    if (idx >= values.dims[0]) return;
+
+    const int row = rowIdx.ptr[idx];
+    const int col = colIdx.ptr[idx];
+
+    if (row >= out.dims[0] || col >= out.dims[1]) return;  // Bad indices
+
+    // Get Values
+    const T val  = values.ptr[idx];
+    const T rval = rhs.ptr[col * rhs.strides[1] + row];
+
+    const int offset = col * out.strides[1] + row;
+    if (reverse)
+        out.ptr[offset] = arith_op<T, op>()(rval, val);
+    else
+        out.ptr[offset] = arith_op<T, op>()(val, rval);
+}
+
+template<typename T, af_op_t op>
+__global__ void csrArithSSD(Param<T> values, Param<int> rowIdx,
+                            Param<int> colIdx, CParam<T> rhs,
+                            const bool reverse) {
+    const int row = blockIdx.x * TY + threadIdx.y;
+
+    if (row >= rhs.dims[0]) return;
+
+    const int rowStartIdx = rowIdx.ptr[row];
+    const int rowEndIdx   = rowIdx.ptr[row + 1];
+
+    // Repeat loop until all values in the row are computed
+    for (int idx = rowStartIdx + threadIdx.x; idx < rowEndIdx; idx += TX) {
+        const int col = colIdx.ptr[idx];
+
+        if (row >= rhs.dims[0] || col >= rhs.dims[1]) continue;  // Bad indices
+
+        // Get Values
+        const T val  = values.ptr[idx];
+        const T rval = rhs.ptr[col * rhs.strides[1] + row];
+
+        if (reverse)
+            values.ptr[idx] = arith_op<T, op>()(rval, val);
+        else
+            values.ptr[idx] = arith_op<T, op>()(val, rval);
+    }
+}
+
+template<typename T, af_op_t op>
+__global__ void cooArithSSD(Param<T> values, Param<int> rowIdx,
+                            Param<int> colIdx, CParam<T> rhs,
+                            const bool reverse) {
+    const int idx = blockIdx.x * THREADS + threadIdx.x;
+
+    if (idx >= values.dims[0]) return;
+
+    const int row = rowIdx.ptr[idx];
+    const int col = colIdx.ptr[idx];
+
+    if (row >= rhs.dims[0] || col >= rhs.dims[1]) return;  // Bad indices
+
+    // Get Values
+    const T val  = values.ptr[idx];
+    const T rval = rhs.ptr[col * rhs.strides[1] + row];
+
+    if (reverse)
+        values.ptr[idx] = arith_op<T, op>()(rval, val);
+    else
+        values.ptr[idx] = arith_op<T, op>()(val, rval);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/sparse_arith.hpp b/src/backend/cuda/kernel/sparse_arith.hpp
new file mode 100644
index 0000000000..b21d2130e5
--- /dev/null
+++ b/src/backend/cuda/kernel/sparse_arith.hpp
@@ -0,0 +1,109 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/sparse_arith_cuh.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr unsigned TX      = 32;
+constexpr unsigned TY      = 8;
+constexpr unsigned THREADS = TX * TY;
+
+template<typename T, af_op_t op>
+void sparseArithOpCSR(Param<T> out, CParam<T> values, CParam<int> rowIdx,
+                      CParam<int> colIdx, CParam<T> rhs, const bool reverse) {
+    auto csrArithDSD = common::getKernel(
+        "arrayfire::cuda::csrArithDSD", {{sparse_arith_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(op)),
+        {{DefineValue(TX), DefineValue(TY)}});
+
+    // Each Y for threads does one row
+    dim3 threads(TX, TY, 1);
+
+    // No. of blocks = divup(no. of rows / threads.y). No blocks on Y
+    dim3 blocks(divup(out.dims[0], TY), 1, 1);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    csrArithDSD(qArgs, out, values, rowIdx, colIdx, rhs, reverse);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCOO(Param<T> out, CParam<T> values, CParam<int> rowIdx,
+                      CParam<int> colIdx, CParam<T> rhs, const bool reverse) {
+    auto cooArithDSD = common::getKernel(
+        "arrayfire::cuda::cooArithDSD", {{sparse_arith_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(op)),
+        {{DefineValue(THREADS)}});
+
+    // Linear indexing with one elements per thread
+    dim3 threads(THREADS, 1, 1);
+
+    // No. of blocks = divup(no. of rows / threads.y). No blocks on Y
+    dim3 blocks(divup(values.dims[0], THREADS), 1, 1);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    cooArithDSD(qArgs, out, values, rowIdx, colIdx, rhs, reverse);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCSR(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+                      CParam<T> rhs, const bool reverse) {
+    auto csrArithSSD = common::getKernel(
+        "arrayfire::cuda::csrArithSSD", {{sparse_arith_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(op)),
+        {{DefineValue(TX), DefineValue(TY)}});
+
+    // Each Y for threads does one row
+    dim3 threads(TX, TY, 1);
+
+    // No. of blocks = divup(no. of rows / threads.y). No blocks on Y
+    dim3 blocks(divup(rhs.dims[0], TY), 1, 1);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    csrArithSSD(qArgs, values, rowIdx, colIdx, rhs, reverse);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCOO(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+                      CParam<T> rhs, const bool reverse) {
+    auto cooArithSSD = common::getKernel(
+        "arrayfire::cuda::cooArithSSD", {{sparse_arith_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(op)),
+        {{DefineValue(THREADS)}});
+
+    // Linear indexing with one elements per thread
+    dim3 threads(THREADS, 1, 1);
+
+    // No. of blocks = divup(no. of rows / threads.y). No blocks on Y
+    dim3 blocks(divup(values.dims[0], THREADS), 1, 1);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    cooArithSSD(qArgs, values, rowIdx, colIdx, rhs, reverse);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/susan.cuh b/src/backend/cuda/kernel/susan.cuh
new file mode 100644
index 0000000000..5bb7f28805
--- /dev/null
+++ b/src/backend/cuda/kernel/susan.cuh
@@ -0,0 +1,125 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+#include <shared.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+inline __device__ int max_val(const int x, const int y) { return max(x, y); }
+inline __device__ unsigned max_val(const unsigned x, const unsigned y) {
+    return max(x, y);
+}
+inline __device__ float max_val(const float x, const float y) {
+    return fmax(x, y);
+}
+inline __device__ double max_val(const double x, const double y) {
+    return fmax(x, y);
+}
+
+template<typename T>
+__global__ void susan(T* out, const T* in, const unsigned idim0,
+                      const unsigned idim1, const unsigned radius,
+                      const float t, const float g, const unsigned edge) {
+    const int rSqrd   = radius * radius;
+    const int windLen = 2 * radius + 1;
+    const int shrdLen = BLOCK_X + windLen - 1;
+
+    SharedMemory<T> shared;
+    T* shrdMem = shared.getPointer();
+
+    const unsigned lx = threadIdx.x;
+    const unsigned ly = threadIdx.y;
+    const unsigned gx = blockDim.x * blockIdx.x + lx + edge;
+    const unsigned gy = blockDim.y * blockIdx.y + ly + edge;
+
+    const unsigned nucleusIdx = (ly + radius) * shrdLen + lx + radius;
+    shrdMem[nucleusIdx] = gx < idim0 && gy < idim1 ? in[gy * idim0 + gx] : 0;
+    T m_0               = shrdMem[nucleusIdx];
+
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < shrdLen; b += BLOCK_Y, gy2 += BLOCK_Y) {
+        int j = gy2 - radius;
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < shrdLen; a += BLOCK_X, gx2 += BLOCK_X) {
+            int i = gx2 - radius;
+            shrdMem[b * shrdLen + a] =
+                (i < idim0 && j < idim1 ? in[j * idim0 + i] : m_0);
+        }
+    }
+    __syncthreads();
+
+    if (gx < idim0 - edge && gy < idim1 - edge) {
+        unsigned idx = gy * idim0 + gx;
+        float nM     = 0.0f;
+#pragma unroll
+        for (int p = 0; p < windLen; ++p) {
+#pragma unroll
+            for (int q = 0; q < windLen; ++q) {
+                int i = p - radius;
+                int j = q - radius;
+                int a = lx + radius + i;
+                int b = ly + radius + j;
+                if (i * i + j * j < rSqrd) {
+                    float c       = m_0;
+                    float m       = shrdMem[b * shrdLen + a];
+                    float exp_pow = afpowf((m - c) / t, 6.0f);
+                    float cM      = expf(-exp_pow);
+                    nM += cM;
+                }
+            }
+        }
+        out[idx] = nM < g ? g - nM : T(0);
+    }
+}
+
+template<typename T>
+__global__ void nonMax(float* x_out, float* y_out, float* resp_out,
+                       unsigned* count, const unsigned idim0,
+                       const unsigned idim1, const T* resp_in,
+                       const unsigned edge, const unsigned max_corners) {
+    // Responses on the border don't have 8-neighbors to compare, discard them
+    const unsigned r = edge + 1;
+
+    const unsigned gx = blockDim.x * blockIdx.x + threadIdx.x + r;
+    const unsigned gy = blockDim.y * blockIdx.y + threadIdx.y + r;
+
+    if (gx < idim0 - r && gy < idim1 - r) {
+        const T v = resp_in[gy * idim0 + gx];
+
+        // Find maximum neighborhood response
+        T max_v;
+        max_v = max_val(resp_in[(gy - 1) * idim0 + gx - 1],
+                        resp_in[gy * idim0 + gx - 1]);
+        max_v = max_val(max_v, resp_in[(gy + 1) * idim0 + gx - 1]);
+        max_v = max_val(max_v, resp_in[(gy - 1) * idim0 + gx]);
+        max_v = max_val(max_v, resp_in[(gy + 1) * idim0 + gx]);
+        max_v = max_val(max_v, resp_in[(gy - 1) * idim0 + gx + 1]);
+        max_v = max_val(max_v, resp_in[(gy)*idim0 + gx + 1]);
+        max_v = max_val(max_v, resp_in[(gy + 1) * idim0 + gx + 1]);
+
+        // Stores corner to {x,y,resp}_out if it's response is maximum compared
+        // to its 8-neighborhood and greater or equal minimum response
+        if (v > max_v) {
+            unsigned idx = atomicAdd(count, 1u);
+            if (idx < max_corners) {
+                x_out[idx]    = (float)gx;
+                y_out[idx]    = (float)gy;
+                resp_out[idx] = (float)v;
+            }
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/susan.hpp b/src/backend/cuda/kernel/susan.hpp
new file mode 100644
index 0000000000..28a96a1e6d
--- /dev/null
+++ b/src/backend/cuda/kernel/susan.hpp
@@ -0,0 +1,75 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/susan_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+constexpr unsigned BLOCK_X = 16;
+constexpr unsigned BLOCK_Y = 16;
+
+template<typename T>
+void susan_responses(T* out, const T* in, const unsigned idim0,
+                     const unsigned idim1, const int radius, const float t,
+                     const float g, const unsigned edge) {
+    auto susan =
+        common::getKernel("arrayfire::cuda::susan", {{susan_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()),
+                          {{DefineValue(BLOCK_X), DefineValue(BLOCK_Y)}});
+
+    dim3 threads(BLOCK_X, BLOCK_Y);
+    dim3 blocks(divup(idim0 - edge * 2, BLOCK_X),
+                divup(idim1 - edge * 2, BLOCK_Y));
+    const size_t SMEM_SIZE =
+        (BLOCK_X + 2 * radius) * (BLOCK_Y + 2 * radius) * sizeof(T);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream(), SMEM_SIZE);
+
+    susan(qArgs, out, in, idim0, idim1, radius, t, g, edge);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T>
+void nonMaximal(float* x_out, float* y_out, float* resp_out, unsigned* count,
+                const unsigned idim0, const unsigned idim1, const T* resp_in,
+                const unsigned edge, const unsigned max_corners) {
+    auto nonMax =
+        common::getKernel("arrayfire::cuda::nonMax", {{susan_cuh_src}},
+                          TemplateArgs(TemplateTypename<T>()));
+
+    dim3 threads(BLOCK_X, BLOCK_Y);
+    dim3 blocks(divup(idim0 - edge * 2, BLOCK_X),
+                divup(idim1 - edge * 2, BLOCK_Y));
+
+    auto d_corners_found = memAlloc<unsigned>(1);
+    CUDA_CHECK(cudaMemsetAsync(d_corners_found.get(), 0, sizeof(unsigned),
+                               getActiveStream()));
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    nonMax(qArgs, x_out, y_out, resp_out, d_corners_found.get(), idim0, idim1,
+           resp_in, edge, max_corners);
+    POST_LAUNCH_CHECK();
+
+    CUDA_CHECK(cudaMemcpyAsync(count, d_corners_found.get(), sizeof(unsigned),
+                               cudaMemcpyDeviceToHost, getActiveStream()));
+    CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/thrust_sort_by_key.hpp b/src/backend/cuda/kernel/thrust_sort_by_key.hpp
new file mode 100644
index 0000000000..9bf2a9b7a3
--- /dev/null
+++ b/src/backend/cuda/kernel/thrust_sort_by_key.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// Wrapper functions
+template<typename Tk, typename Tv>
+void thrustSortByKey(Tk *keyPtr, Tv *valPtr, int elements, bool isAscending);
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/thrust_sort_by_key/CMakeLists.txt b/src/backend/cuda/kernel/thrust_sort_by_key/CMakeLists.txt
new file mode 100644
index 0000000000..6c2f7f3c49
--- /dev/null
+++ b/src/backend/cuda/kernel/thrust_sort_by_key/CMakeLists.txt
@@ -0,0 +1,37 @@
+# Copyright (c) 2020, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(STRINGS
+    "${CMAKE_CURRENT_SOURCE_DIR}/kernel/thrust_sort_by_key/thrust_sort_by_key_impl.cu"
+    FILESTRINGS)
+
+foreach(STR ${FILESTRINGS})
+    if(${STR} MATCHES "// SBK_TYPES")
+        string(REPLACE "// SBK_TYPES:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_TYPES ${TEMP})
+    elseif(${STR} MATCHES "// SBK_INSTS:")
+        string(REPLACE "// SBK_INSTS:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_INSTS ${TEMP})
+    endif()
+endforeach()
+
+foreach(SBK_TYPE ${SBK_TYPES})
+  foreach(SBK_INST ${SBK_INSTS})
+    set(INSTANTIATESBK_INST "INSTANTIATE${SBK_INST}")
+
+    configure_file(
+      "${CMAKE_CURRENT_SOURCE_DIR}/kernel/thrust_sort_by_key/thrust_sort_by_key_impl.cu"
+      "${CMAKE_CURRENT_BINARY_DIR}/kernel/thrust_sort_by_key/thrust_sort_by_key_impl_${SBK_TYPE}_${SBK_INST}.cu"
+    )
+
+    list(
+      APPEND
+      thrust_sort_sources
+      "${CMAKE_CURRENT_BINARY_DIR}/kernel/thrust_sort_by_key/thrust_sort_by_key_impl_${SBK_TYPE}_${SBK_INST}.cu"
+    )
+  endforeach(SBK_INST ${SBK_INSTS})
+endforeach(SBK_TYPE ${SBK_TYPES})
diff --git a/src/backend/cuda/kernel/thrust_sort_by_key/thrust_sort_by_key_impl.cu b/src/backend/cuda/kernel/thrust_sort_by_key/thrust_sort_by_key_impl.cu
new file mode 100644
index 0000000000..7a7e3616c9
--- /dev/null
+++ b/src/backend/cuda/kernel/thrust_sort_by_key/thrust_sort_by_key_impl.cu
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/thrust_sort_by_key_impl.hpp>
+
+// This file instantiates sort_by_key as separate object files from CMake
+// The 3 lines below are read by CMake to determenine the instantiations
+// SBK_TYPES:float double int uint intl uintl short ushort char schar uchar
+// SBK_INSTS:0 1
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// clang-format off
+@INSTANTIATESBK_INST@ ( @SBK_TYPE@ )
+// clang-format on
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/thrust_sort_by_key_impl.hpp b/src/backend/cuda/kernel/thrust_sort_by_key_impl.hpp
new file mode 100644
index 0000000000..e909a786de
--- /dev/null
+++ b/src/backend/cuda/kernel/thrust_sort_by_key_impl.hpp
@@ -0,0 +1,55 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <debug_cuda.hpp>
+#include <kernel/thrust_sort_by_key.hpp>
+#include <thrust/sort.h>
+#include <thrust_utils.hpp>
+#include <types.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+// Wrapper functions
+template<typename Tk, typename Tv>
+void thrustSortByKey(Tk *keyPtr, Tv *valPtr, int elements, bool isAscending) {
+    if (isAscending) {
+        THRUST_SELECT(thrust::stable_sort_by_key, keyPtr, keyPtr + elements,
+                      valPtr);
+    } else {
+        THRUST_SELECT(thrust::stable_sort_by_key, keyPtr, keyPtr + elements,
+                      valPtr, thrust::greater<Tk>());
+    }
+    POST_LAUNCH_CHECK();
+}
+
+#define INSTANTIATE(Tk, Tv)                                         \
+    template void thrustSortByKey<Tk, Tv>(Tk * keyPtr, Tv * valPtr, \
+                                          int elements, bool isAscending);
+
+#define INSTANTIATE0(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)
+
+#define INSTANTIATE1(Tk)    \
+    INSTANTIATE(Tk, int)    \
+    INSTANTIATE(Tk, uint)   \
+    INSTANTIATE(Tk, short)  \
+    INSTANTIATE(Tk, ushort) \
+    INSTANTIATE(Tk, intl)   \
+    INSTANTIATE(Tk, uintl)
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/tile.cuh b/src/backend/cuda/kernel/tile.cuh
new file mode 100644
index 0000000000..705ac70647
--- /dev/null
+++ b/src/backend/cuda/kernel/tile.cuh
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void tile(Param<T> out, CParam<T> in, const int blocksPerMatX,
+                     const int blocksPerMatY) {
+    const int oz = blockIdx.x / blocksPerMatX;
+    const int ow = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+
+    const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - ow * blocksPerMatY;
+
+    const int xx = threadIdx.x + blockIdx_x * blockDim.x;
+    const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    if (xx >= out.dims[0] || yy >= out.dims[1] || oz >= out.dims[2] ||
+        ow >= out.dims[3])
+        return;
+
+    const int iz  = oz % in.dims[2];
+    const int iw  = ow % in.dims[3];
+    const int izw = iw * in.strides[3] + iz * in.strides[2];
+    const int ozw = ow * out.strides[3] + oz * out.strides[2];
+
+    const int incy = blocksPerMatY * blockDim.y;
+    const int incx = blocksPerMatX * blockDim.x;
+
+    for (int oy = yy; oy < out.dims[1]; oy += incy) {
+        const int iy = oy % in.dims[1];
+        for (int ox = xx; ox < out.dims[0]; ox += incx) {
+            const int ix = ox % in.dims[0];
+
+            int iMem = izw + iy * in.strides[1] + ix;
+            int oMem = ozw + oy * out.strides[1] + ox;
+
+            out.ptr[oMem] = in.ptr[iMem];
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/tile.hpp b/src/backend/cuda/kernel/tile.hpp
index a0325cbc93..e25bdce4b7 100644
--- a/src/backend/cuda/kernel/tile.hpp
+++ b/src/backend/cuda/kernel/tile.hpp
@@ -7,79 +7,44 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/tile_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 512;
-        static const unsigned TILEY = 32;
-
-        template<typename T>
-        __global__
-        void tile_kernel(Param<T> out, CParam<T> in,
-                         const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            if(xx >= out.dims[0] ||
-               yy >= out.dims[1] ||
-               oz >= out.dims[2] ||
-               ow >= out.dims[3])
-                return;
+template<typename T>
+void tile(Param<T> out, CParam<T> in) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 512;
+    constexpr unsigned TILEY = 32;
 
-            const int iz = oz % in.dims[2];
-            const int iw = ow % in.dims[3];
-            const int izw = iw * in.strides[3] + iz * in.strides[2];
-            const int ozw = ow * out.strides[3] + oz * out.strides[2];
+    auto tile = common::getKernel("arrayfire::cuda::tile", {{tile_cuh_src}},
+                                  TemplateArgs(TemplateTypename<T>()));
 
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
+    dim3 threads(TX, TY, 1);
 
-            for(int oy = yy; oy < out.dims[1]; oy += incy) {
-                const int iy = oy % in.dims[1];
-                for(int ox = xx; ox < out.dims[0]; ox += incx) {
-                    const int ix = ox % in.dims[0];
+    int blocksPerMatX = divup(out.dims[0], TILEX);
+    int blocksPerMatY = divup(out.dims[1], TILEY);
+    dim3 blocks(blocksPerMatX * out.dims[2], blocksPerMatY * out.dims[3], 1);
 
-                    int iMem = izw + iy * in.strides[1] + ix;
-                    int oMem = ozw + oy * out.strides[1] + ox;
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-                    out.ptr[oMem] = in.ptr[iMem];
-                }
-            }
-        }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T>
-        void tile(Param<T> out, CParam<T> in)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(out.dims[0], TILEX);
-            int blocksPerMatY = divup(out.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * out.dims[2],
-                        blocksPerMatY * out.dims[3],
-                        1);
-
-            tile_kernel<T><<<blocks, threads>>>(out, in, blocksPerMatX, blocksPerMatY);
-            POST_LAUNCH_CHECK();
-        }
-    }
+    tile(qArgs, out, in, blocksPerMatX, blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/topk.hpp b/src/backend/cuda/kernel/topk.hpp
new file mode 100644
index 0000000000..22f7c34f93
--- /dev/null
+++ b/src/backend/cuda/kernel/topk.hpp
@@ -0,0 +1,194 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <cub/block/block_radix_sort.cuh>
+
+#include <Array.hpp>
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/dispatch.hpp>
+#include <debug_cuda.hpp>
+#include <math.hpp>
+#include <types.hpp>
+
+#include <limits>
+
+using cub::BlockRadixSort;
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+static const int TOPK_THRDS_PER_BLK = 256;
+static const int TOPK_IDX_THRD_LOAD = 4;
+
+template<typename T, bool READ_INDEX>
+static __global__ void kerTopkDim0(Param<T> ovals, Param<uint> oidxs,
+                                   CParam<T> ivals, CParam<uint> iidxs,
+                                   const int k, const af::topkFunction order,
+                                   uint numLaunchBlocksY) {
+    using ValueType       = uint;
+    using BlockRadixSortT = BlockRadixSort<compute_t<T>, TOPK_THRDS_PER_BLK,
+                                           TOPK_IDX_THRD_LOAD, ValueType>;
+
+    struct keyValBlocks {
+        // used for rearranging each granule's data items
+        // we want each thread(granule) to own TOPK_IDX_THRD_LOAD=4 consecutive
+        // datum for both coalesced memory reads and this blocked layout we need
+        // this SMEM to rearrange
+        compute_t<T> keys[TOPK_IDX_THRD_LOAD * TOPK_THRDS_PER_BLK];
+        ValueType vals[TOPK_IDX_THRD_LOAD * TOPK_THRDS_PER_BLK];
+    };
+
+    union smemUnion {
+        // used for cub radix sort
+        typename BlockRadixSortT::TempStorage sortmem;
+        // used for rearranging
+        keyValBlocks blkt;
+    } __shared__ smem;
+
+    const int bw = blockIdx.y / numLaunchBlocksY;
+    const int bz = blockIdx.z;
+    const int by = (blockIdx.y - bw * numLaunchBlocksY);
+
+    const uint elements = ivals.dims[0];
+
+    const data_t<T>* kdata = ivals.ptr + by * ivals.strides[1] +
+                             bz * ivals.strides[2] + bw * ivals.strides[3];
+
+    const ValueType* idata = iidxs.ptr + by * iidxs.strides[1] +
+                             bz * iidxs.strides[2] + bw * iidxs.strides[3];
+
+    T* ores = ovals.ptr + by * ovals.strides[1] + bz * ovals.strides[2] +
+              bw * ovals.strides[3];
+    uint* ires = oidxs.ptr + by * oidxs.strides[1] + bz * oidxs.strides[2] +
+                 bw * oidxs.strides[3];
+
+    compute_t<T> keys[TOPK_IDX_THRD_LOAD];
+    ValueType vals[TOPK_IDX_THRD_LOAD];
+
+    const int blockOffset =
+        blockDim.x * blockIdx.x * TOPK_IDX_THRD_LOAD + threadIdx.x;
+// each block will load consecutive data items while iterating a block-width at
+// a time [B0][][]...[][B1][][]...[] ... [BN][][]...[]
+#pragma unroll
+    for (uint li = 0, i = blockOffset; li < TOPK_IDX_THRD_LOAD;
+         i += blockDim.x, li++) {
+        if (i < elements) {
+            smem.blkt.keys[li * TOPK_THRDS_PER_BLK + threadIdx.x] =
+                static_cast<compute_t<T>>(kdata[i]);
+            smem.blkt.vals[li * TOPK_THRDS_PER_BLK + threadIdx.x] =
+                (READ_INDEX) ? idata[i] : i;
+        } else {
+            smem.blkt.keys[li * TOPK_THRDS_PER_BLK + threadIdx.x] =
+                (order & AF_TOPK_MAX) ? minval<compute_t<T>>()
+                                      : maxval<compute_t<T>>();
+            smem.blkt.vals[li * TOPK_THRDS_PER_BLK + threadIdx.x] =
+                maxval<ValueType>();
+        }
+    }
+    __syncthreads();
+
+#pragma unroll
+    for (uint li = 0; li < TOPK_IDX_THRD_LOAD; li++) {
+        // transposed read into registers for cub radix sort
+        keys[li] = smem.blkt.keys[li + (threadIdx.x * TOPK_IDX_THRD_LOAD)];
+        vals[li] = smem.blkt.vals[li + (threadIdx.x * TOPK_IDX_THRD_LOAD)];
+    }
+    __syncthreads();
+
+    if (order & AF_TOPK_MAX) {
+        BlockRadixSortT(smem.sortmem)
+            .SortDescendingBlockedToStriped(keys, vals);
+    } else {
+        BlockRadixSortT(smem.sortmem).SortBlockedToStriped(keys, vals);
+    }
+
+    if (threadIdx.x < k) {
+        int oidx   = threadIdx.x + blockIdx.x * k;
+        ores[oidx] = keys[0];
+        ires[oidx] = vals[0];
+    }
+}
+
+template<typename T>
+void topkDim0(Param<T> ovals, Param<uint> oidxs, CParam<T> ivals, const int k,
+              const af::topkFunction order) {
+    dim3 threads(TOPK_THRDS_PER_BLK, 1);
+    const int thrdLoad = TOPK_IDX_THRD_LOAD;
+
+    int numBlocksX = divup(ivals.dims[0], threads.x * thrdLoad);
+    dim3 blocks(numBlocksX, ivals.dims[1] * ivals.dims[3], ivals.dims[2]);
+
+    // The algorithm is to iteratively find top k elements among each block
+    // of threads until there is only one block to launch.
+    // The additional memory used for values and indices is allocated only
+    // before the first iteration and reused for further iterations.
+
+    // Temporary storage allocation for iterations
+    Array<T> tvals    = createEmptyArray<T>(dim4());
+    Array<uint> tidxs = createEmptyArray<uint>(dim4());
+
+    if (numBlocksX > 1) {
+        tvals = createEmptyArray<T>(dim4(k * numBlocksX, ivals.dims[1]));
+        // TODO(umar): this can be smaller because the first iteration is not
+        // reading this array.
+        tidxs = createEmptyArray<uint>(dim4(k * numBlocksX, ivals.dims[1]));
+    }
+
+    int prevBlocksX = 1;
+
+    CParam<T> iivals    = ivals;
+    CParam<uint> iiidxs = tidxs;
+
+    int dims0      = tvals.dims()[0];
+    bool first_run = true;
+    do {
+        if (blocks.x == 1) {
+            tvals = createParamArray(ovals, false);
+            tidxs = createParamArray(oidxs, false);
+        }
+
+        if (first_run) {
+            // Launch topk which doesn't read the indice values from global
+            // memory
+            CUDA_LAUNCH((kerTopkDim0<T, false>), blocks, threads, tvals, tidxs,
+                        iivals, iiidxs, k, order, ivals.dims[1]);
+            first_run = false;
+        } else {
+            CUDA_LAUNCH((kerTopkDim0<T, true>), blocks, threads, tvals, tidxs,
+                        iivals, iiidxs, k, order, ivals.dims[1]);
+        }
+
+        POST_LAUNCH_CHECK();
+
+        prevBlocksX = blocks.x;
+        blocks.x    = divup(dims0, threads.x * thrdLoad);
+
+        // set output of current iteration as input for the next iteration
+        iivals = tvals;
+        iiidxs = tidxs;
+
+        dims0 = blocks.x * k;
+
+        tvals.setDataDims(dim4(dims0, tvals.elements() / (float)dims0));
+        tidxs.setDataDims(dim4(dims0, tidxs.elements() / (float)dims0));
+    } while (prevBlocksX > 1);
+}
+
+template<typename T>
+inline void topk(Param<T> ovals, Param<uint> oidxs, CParam<T> ivals,
+                 const int k, const int dim, const af::topkFunction order) {
+    assert(dim == 0);
+    // TODO Add switch statement when support for other dims is added
+    topkDim0<T>(ovals, oidxs, ivals, k, order);
+}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/transform.cuh b/src/backend/cuda/kernel/transform.cuh
new file mode 100644
index 0000000000..f2d2f2c909
--- /dev/null
+++ b/src/backend/cuda/kernel/transform.cuh
@@ -0,0 +1,174 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <interp.hpp>
+
+__constant__ float
+    c_tmat[3072];  // Allows 512 Affine Transforms and 340 Persp. Transforms
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__device__ void calc_transf_inverse(T *txo, const T *txi,
+                                    const bool perspective) {
+    if (perspective) {
+        txo[0] = txi[4] * txi[8] - txi[5] * txi[7];
+        txo[1] = -(txi[1] * txi[8] - txi[2] * txi[7]);
+        txo[2] = txi[1] * txi[5] - txi[2] * txi[4];
+
+        txo[3] = -(txi[3] * txi[8] - txi[5] * txi[6]);
+        txo[4] = txi[0] * txi[8] - txi[2] * txi[6];
+        txo[5] = -(txi[0] * txi[5] - txi[2] * txi[3]);
+
+        txo[6] = txi[3] * txi[7] - txi[4] * txi[6];
+        txo[7] = -(txi[0] * txi[7] - txi[1] * txi[6]);
+        txo[8] = txi[0] * txi[4] - txi[1] * txi[3];
+
+        T det = txi[0] * txo[0] + txi[1] * txo[3] + txi[2] * txo[6];
+
+        txo[0] /= det;
+        txo[1] /= det;
+        txo[2] /= det;
+        txo[3] /= det;
+        txo[4] /= det;
+        txo[5] /= det;
+        txo[6] /= det;
+        txo[7] /= det;
+        txo[8] /= det;
+    } else {
+        T det = txi[0] * txi[4] - txi[1] * txi[3];
+
+        txo[0] = txi[4] / det;
+        txo[1] = txi[3] / det;
+        txo[3] = txi[1] / det;
+        txo[4] = txi[0] / det;
+
+        txo[2] = txi[2] * -txo[0] + txi[5] * -txo[1];
+        txo[5] = txi[2] * -txo[3] + txi[5] * -txo[4];
+    }
+}
+
+template<typename T, bool inverse, int order>
+__global__ void transform(Param<T> out, CParam<T> in, const int nImg2,
+                          const int nImg3, const int nTfs2, const int nTfs3,
+                          const int batchImg2, const int blocksXPerImage,
+                          const int blocksYPerImage, const bool perspective,
+                          af::interpType method) {
+    // Image Ids
+    const int imgId2 = blockIdx.x / blocksXPerImage;
+    const int imgId3 = blockIdx.y / blocksYPerImage;
+
+    // Block in local image
+    const int blockIdx_x = blockIdx.x - imgId2 * blocksXPerImage;
+    const int blockIdx_y = blockIdx.y - imgId3 * blocksYPerImage;
+
+    // Get thread indices in local image
+    const int xido = blockIdx_x * blockDim.x + threadIdx.x;
+    const int yido = blockIdx_y * blockDim.y + threadIdx.y;
+
+    // Image iteration loop count for image batching
+    int limages = min(max(out.dims[2] - imgId2 * nImg2, 1), batchImg2);
+
+    if (xido >= out.dims[0] || yido >= out.dims[1]) return;
+
+    // Index of transform
+    const int eTfs2 = max((nTfs2 / nImg2), 1);
+    const int eTfs3 = max((nTfs3 / nImg3), 1);
+
+    int t_idx3        = -1;  // init
+    int t_idx2        = -1;  // init
+    int t_idx2_offset = 0;
+
+    if (nTfs3 == 1) {
+        t_idx3 = 0;  // Always 0 as only 1 transform defined
+    } else {
+        if (nTfs3 == nImg3) {
+            t_idx3 = imgId3;  // One to one batch with all transforms defined
+        } else {
+            t_idx3        = blockIdx.z / eTfs2;  // Transform batched, calculate
+            t_idx2_offset = t_idx3 * nTfs2;
+        }
+    }
+
+    if (nTfs2 == 1) {
+        t_idx2 = 0;  // Always 0 as only 1 transform defined
+    } else {
+        if (nTfs2 == nImg2) {
+            t_idx2 = imgId2;  // One to one batch with all transforms defined
+        } else {
+            t_idx2 =
+                blockIdx.z - t_idx2_offset;  // Transform batched, calculate
+        }
+    }
+
+    // Linear transform index
+    const int t_idx = t_idx2 + t_idx3 * nTfs2;
+    int outoff      = 0;
+
+    // Global offsets
+    const int inoff =
+        imgId2 * batchImg2 * in.strides[2] + imgId3 * in.strides[3];
+    if (nImg2 == nTfs2 || nImg2 > 1) {  // One-to-One or Image on dim2
+        outoff += imgId2 * batchImg2 * out.strides[2];
+    } else {  // Transform batched on dim2
+        outoff += t_idx2 * out.strides[2];
+    }
+
+    if (nImg3 == nTfs3 || nImg3 > 1) {  // One-to-One or Image on dim3
+        outoff += imgId3 * out.strides[3];
+    } else {  // Transform batched on dim2
+        outoff += t_idx3 * out.strides[3];
+    }
+
+    // Transform is in constant memory.
+    const int transf_len  = (perspective ? 9 : 6);
+    const float *tmat_ptr = c_tmat + t_idx * transf_len;
+    float tmat[9];
+
+    // We expect a inverse transform matrix by default
+    // If it is an forward transform, then we need its inverse
+    if (inverse) {
+#pragma unroll 3
+        for (int i = 0; i < transf_len; i++) tmat[i] = tmat_ptr[i];
+    } else {
+        calc_transf_inverse(tmat, tmat_ptr, perspective);
+    }
+
+    const int loco = outoff + (yido * out.strides[1] + xido);
+
+    // Compute input index
+    typedef typename itype_t<T>::wtype WT;
+    WT xidi = xido * tmat[0] + yido * tmat[1] + tmat[2];
+    WT yidi = xido * tmat[3] + yido * tmat[4] + tmat[5];
+
+    if (perspective) {
+        const WT W = xido * tmat[6] + yido * tmat[7] + tmat[8];
+        xidi /= W;
+        yidi /= W;
+    }
+
+    if (xidi < -0.0001 || yidi < -0.0001 || in.dims[0] <= xidi ||
+        in.dims[1] <= yidi) {
+        for (int i = 0; i < limages; i++) {
+            out.ptr[loco + i * out.strides[2]] = scalar<T>(0.0f);
+        }
+        return;
+    }
+
+    Interp2<T, WT, 0, 1, order> interp;
+    // FIXME: Nearest and lower do not do clamping, but other methods do
+    // Make it consistent
+    bool clamp = order != 1;
+    interp(out, loco, in, inoff, xidi, yidi, method, limages, clamp);
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/transform.hpp b/src/backend/cuda/kernel/transform.hpp
index 88af1a2119..5405fcc9cc 100644
--- a/src/backend/cuda/kernel/transform.hpp
+++ b/src/backend/cuda/kernel/transform.hpp
@@ -7,135 +7,72 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include "transform_interp.hpp"
-
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 16;
-        static const unsigned TY = 16;
-        // Used for batching images
-        static const unsigned TI = 4;
-
-        __constant__ float c_tmat[6 * 256];
-
-        template <typename T>
-        __host__ __device__
-        void calc_affine_inverse(T *txo, const T *txi)
-        {
-            T det = txi[0]*txi[4] - txi[1]*txi[3];
-
-            txo[0] = txi[4] / det;
-            txo[1] = txi[3] / det;
-            txo[3] = txi[1] / det;
-            txo[4] = txi[0] / det;
-
-            txo[2] = txi[2] * -txo[0] + txi[5] * -txo[1];
-            txo[5] = txi[2] * -txo[3] + txi[5] * -txo[4];
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Transform Kernel
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, bool inverse, af_interp_type method>
-        __global__ static void
-        transform_kernel(Param<T> out, CParam<T> in, const int nimages,
-                         const int ntransforms, const int blocksXPerImage)
-        {
-            // Compute which image set
-            const int setId = blockIdx.x / blocksXPerImage;
-            const int blockIdx_x = blockIdx.x - setId * blocksXPerImage;
-
-            // Get thread indices
-            const int xx = blockIdx_x * blockDim.x + threadIdx.x;
-            const int yy = blockIdx.y * blockDim.y + threadIdx.y;
-
-            const int limages = min(out.dims[2] - setId * nimages, nimages);
-
-            if(xx >= out.dims[0] || yy >= out.dims[1] * ntransforms)
-                return;
-
-            // Index of channel of images and transform
-            //const int i_idx = xx / out.dims[0];
-            const int t_idx = yy / out.dims[1];
-
-            // Index in local channel -> This is output index
-            //const int xido = xx - i_idx * out.dims[0];
-            const int xido = xx;
-            const int yido = yy - t_idx * out.dims[1];
-
-            // Global offset
-            //          Offset for transform channel + Offset for image channel.
-                  T *optr = out.ptr + t_idx * nimages * out.strides[2] + setId * nimages * out.strides[2];
-            const T *iptr = in.ptr  + setId * nimages * in.strides[2];
-
-            // Transform is in constant memory.
-            const float *tmat_ptr = c_tmat + t_idx * 6;
-            float tmat[6];
-
-            // We expect a inverse transform matrix by default
-            // If it is an forward transform, then we need its inverse
-            if(inverse) {
-                #pragma unroll
-                for(int i = 0; i < 6; i++)
-                    tmat[i] = tmat_ptr[i];
-            } else {
-                calc_affine_inverse(tmat, tmat_ptr);
-            }
-
-            if (xido >= out.dims[0] && yido >= out.dims[1]) return;
-
-            switch(method) {
-                case AF_INTERP_NEAREST:
-                    transform_n(optr, out, iptr, in, tmat, xido, yido, limages); break;
-                case AF_INTERP_BILINEAR:
-                    transform_b(optr, out, iptr, in, tmat, xido, yido, limages); break;
-                default: break;
-            }
-        }
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template <typename T, af_interp_type method>
-        void transform(Param<T> out, CParam<T> in, CParam<float> tf,
-                       const bool inverse)
-        {
-            int nimages = in.dims[2];
-            // Multiplied in src/backend/transform.cpp
-            const int ntransforms = out.dims[2] / in.dims[2];
-
-            // Copy transform to constant memory.
-            CUDA_CHECK(cudaMemcpyToSymbol(c_tmat, tf.ptr, ntransforms * 6 * sizeof(float), 0,
-                                          cudaMemcpyDeviceToDevice));
-
-            dim3 threads(TX, TY, 1);
-            dim3 blocks(divup(out.dims[0], threads.x), divup(out.dims[1], threads.y));
-
-            const int blocksXPerImage = blocks.x;
-            if(nimages > TI) {
-                int tile_images = divup(nimages, TI);
-                nimages = TI;
-                blocks.x = blocks.x * tile_images;
-            }
-
-            if (ntransforms > 1) { blocks.y *= ntransforms; }
-
-            if(inverse) {
-                transform_kernel<T, true, method><<<blocks, threads>>>
-                                (out, in, nimages, ntransforms, blocksXPerImage);
-            } else {
-                transform_kernel<T, false, method><<<blocks, threads>>>
-                                (out, in, nimages, ntransforms, blocksXPerImage);
-            }
-            POST_LAUNCH_CHECK();
-        }
-    }
+#include <nvrtc_kernel_headers/transform_cuh.hpp>
+#include <af/defines.h>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+// Kernel Launch Config Values
+static const unsigned TX = 16;
+static const unsigned TY = 16;
+// Used for batching images
+static const unsigned TI = 4;
+
+template<typename T>
+void transform(Param<T> out, CParam<T> in, CParam<float> tf, const bool inverse,
+               const bool perspective, const af::interpType method, int order) {
+    auto transform = common::getKernel(
+        "arrayfire::cuda::transform", {{transform_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(inverse),
+                     TemplateArg(order)));
+
+    const unsigned int nImg2  = in.dims[2];
+    const unsigned int nImg3  = in.dims[3];
+    const unsigned int nTfs2  = tf.dims[2];
+    const unsigned int nTfs3  = tf.dims[3];
+    const unsigned int tf_len = (perspective) ? 9 : 6;
+
+    // Copy transform to constant memory.
+    auto constPtr = transform.getDevPtr("c_tmat");
+    transform.copyToReadOnly(constPtr, reinterpret_cast<CUdeviceptr>(tf.ptr),
+                             nTfs2 * nTfs3 * tf_len * sizeof(float));
+
+    dim3 threads(TX, TY, 1);
+    dim3 blocks(divup(out.dims[0], threads.x), divup(out.dims[1], threads.y));
+
+    const int blocksXPerImage = blocks.x;
+    const int blocksYPerImage = blocks.y;
+
+    // Takes care of all types of batching
+    // One-to-one batching is only done on blocks.x
+    // TODO If dim2 is not one-to-one batched, then divide blocks.x by factor
+    int batchImg2 = 1;
+    if (nImg2 != nTfs2) batchImg2 = std::min(nImg2, TI);
+
+    blocks.x *= (nImg2 / batchImg2);
+    blocks.y *= nImg3;
+
+    // Use blocks.z for transforms
+    blocks.z *= std::max((nTfs2 / nImg2), 1u) * std::max((nTfs3 / nImg3), 1u);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    transform(qArgs, out, in, nImg2, nImg3, nTfs2, nTfs3, batchImg2,
+              blocksXPerImage, blocksYPerImage, perspective, method);
+
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/transform_interp.hpp b/src/backend/cuda/kernel/transform_interp.hpp
deleted file mode 100644
index e91750881b..0000000000
--- a/src/backend/cuda/kernel/transform_interp.hpp
+++ /dev/null
@@ -1,133 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-namespace cuda
-{
-    namespace kernel
-    {
-        template<typename T>
-        struct itype_t
-        {
-            typedef float wtype;
-            typedef float vtype;
-        };
-
-        template<>
-        struct itype_t<double>
-        {
-            typedef double wtype;
-            typedef double vtype;
-        };
-
-        template<>
-        struct itype_t<cfloat>
-        {
-            typedef float  wtype;
-            typedef cfloat vtype;
-        };
-
-        template<>
-        struct itype_t<cdouble>
-        {
-            typedef double  wtype;
-            typedef cdouble vtype;
-        };
-
-        template<typename T>
-        __device__
-        void transform_n(T *optr, Param<T> out, const T *iptr, CParam<T> in, const float *tmat,
-                         const int xido, const int yido, const int nimages)
-        {
-            // Compute input index
-            int xidi = round(xido * tmat[0]
-                             + yido * tmat[1]
-                             + tmat[2]);
-            int yidi = round(xido * tmat[3]
-                             + yido * tmat[4]
-                             + tmat[5]);
-
-            // Makes scale give same output as resize
-            // But fails rotate tests
-            //if (xidi >= in.dims[0]) { xidi = in.dims[0] - 1; }
-            //if (yidi >= in.dims[1]) { yidi = in.dims[1] - 1; }
-
-            const int loci = yidi * in.strides[1]  + xidi;
-            const int loco = yido * out.strides[1] + xido;
-
-            for(int i = 0; i < nimages; i++) {
-                // Compute memory location of indices
-                int ioff = loci + i * in.strides[2];
-                int ooff = loco + i * out.strides[2];
-
-                // Copy to output
-                T val = scalar<T>(0);
-                if (xidi < in.dims[0] && yidi < in.dims[1] && xidi >= 0 && yidi >= 0) val = iptr[ioff];
-
-                optr[ooff] = val;
-            }
-        }
-
-        template<typename T>
-        __device__
-        void transform_b(T *optr, Param<T> out, const T *iptr, CParam<T> in, const float *tmat,
-                         const int xido, const int yido, const int nimages)
-        {
-            const int loco = (yido * out.strides[1] + xido);
-
-            // Compute input index
-            const float xidi = xido * tmat[0]
-                             + yido * tmat[1]
-                                    + tmat[2];
-            const float yidi = xido * tmat[3]
-                             + yido * tmat[4]
-                                    + tmat[5];
-
-            if (xidi < -0.0001 || yidi < -0.0001 || in.dims[0] < xidi || in.dims[1] < yidi) {
-                for(int i = 0; i < nimages; i++) {
-                    optr[loco + i * out.strides[2]] = scalar<T>(0.0f);
-                }
-                return;
-            }
-
-            typedef typename itype_t<T>::wtype WT;
-            typedef typename itype_t<T>::vtype VT;
-
-            const WT grd_x = floor(xidi),  grd_y = floor(yidi);
-            const WT off_x = xidi - grd_x, off_y = yidi - grd_y;
-
-            // Check if pVal and pVal + 1 are both valid indices
-            const bool condY = (yidi < in.dims[1] - 1);
-            const bool condX = (xidi < in.dims[0] - 1);
-
-            // Compute weights used
-            const WT wt00 = (1.0 - off_x) * (1.0 - off_y);
-            const WT wt10 = (condY) ? (1.0 - off_x) * (off_y)     : 0;
-            const WT wt01 = (condX) ? (off_x) * (1.0 - off_y)     : 0;
-            const WT wt11 = (condX && condY) ? (off_x) * (off_y)  : 0;
-
-            const WT wt = wt00 + wt10 + wt01 + wt11;
-
-            const int loci = grd_y * in.strides[1] + grd_x;
-            T zero = scalar<T>(0.0f);
-            for(int i = 0; i < nimages; i++) {
-                const int ioff = loci + (i * in.strides[2]);
-                const int ooff = loco + (i * out.strides[2]);
-
-                // Compute Weighted Values
-                VT v00 =                    wt00 * iptr[ioff];
-                VT v10 = (condY) ?          wt10 * iptr[ioff + in.strides[1]]     : zero;
-                VT v01 = (condX) ?          wt01 * iptr[ioff + 1]                 : zero;
-                VT v11 = (condX && condY) ? wt11 * iptr[ioff + in.strides[1] + 1] : zero;
-                VT vo  = v00 + v10 + v01 + v11;
-
-                optr[ooff] = (vo / wt);
-            }
-        }
-    }
-}
diff --git a/src/backend/cuda/kernel/transpose.cuh b/src/backend/cuda/kernel/transpose.cuh
new file mode 100644
index 0000000000..444a61b819
--- /dev/null
+++ b/src/backend/cuda/kernel/transpose.cuh
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool conjugate>
+__device__ T doOp(T in) {
+    if (conjugate)
+        return conj(in);
+    else
+        return in;
+}
+
+template<typename T, bool conjugate, bool is32Multiple>
+__global__ void transpose(Param<T> out, CParam<T> in, const int blocksPerMatX,
+                          const int blocksPerMatY) {
+    __shared__ T shrdMem[TILE_DIM][TILE_DIM + 1];
+
+    const int oDim0 = out.dims[0];
+    const int oDim1 = out.dims[1];
+    const int iDim0 = in.dims[0];
+    const int iDim1 = in.dims[1];
+
+    const int oStride1 = out.strides[1];
+    const int iStride1 = in.strides[1];
+
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+
+    const int batchId_x  = blockIdx.x / blocksPerMatX;
+    const int blockIdx_x = (blockIdx.x - batchId_x * blocksPerMatX);
+
+    const int batchId_y = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (batchId_y * blocksPerMatY);
+
+    if (batchId_x >= in.dims[2] || batchId_y >= in.dims[3]) return;
+
+    const int x0 = TILE_DIM * blockIdx_x;
+    const int y0 = TILE_DIM * blockIdx_y;
+
+    int gx = lx + x0;
+    int gy = ly + y0;
+
+    in.ptr += batchId_x * in.strides[2] + batchId_y * in.strides[3];
+    out.ptr += batchId_x * out.strides[2] + batchId_y * out.strides[3];
+
+#pragma unroll
+    for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+        int gy_ = gy + repeat;
+        if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
+            shrdMem[ly + repeat][lx] = in.ptr[gy_ * iStride1 + gx];
+    }
+    __syncthreads();
+
+    gx = lx + y0;
+    gy = ly + x0;
+
+#pragma unroll
+    for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+        int gy_ = gy + repeat;
+        if (is32Multiple || (gx < oDim0 && gy_ < oDim1))
+            out.ptr[gy_ * oStride1 + gx] =
+                doOp<T, conjugate>(shrdMem[lx][ly + repeat]);
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/transpose.hpp b/src/backend/cuda/kernel/transpose.hpp
index 86c5e9b4f2..f84ff89b96 100644
--- a/src/backend/cuda/kernel/transpose.hpp
+++ b/src/backend/cuda/kernel/transpose.hpp
@@ -7,108 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-
-namespace cuda
-{
-
-namespace kernel
-{
-
-    static const int TILE_DIM  = 32;
-    static const int THREADS_X = TILE_DIM;
-    static const int THREADS_Y = 256 / TILE_DIM;
-
-    template<typename T, bool conjugate>
-    __device__ T doOp(T in)
-    {
-        if (conjugate) return conj(in);
-        else return in;
-    }
-
-    // Kernel is going access original data in colleased format
-    template<typename T, bool conjugate, bool is32Multiple>
-    __global__
-    void transpose(Param<T> out, CParam<T> in,
-                   const int blocksPerMatX, const int blocksPerMatY)
-    {
-        __shared__ T shrdMem[TILE_DIM][TILE_DIM+1];
-        // create variables to hold output dimensions
-        const int oDim0 = out.dims[0];
-        const int oDim1 = out.dims[1];
-        const int iDim0 = in.dims[0];
-        const int iDim1 = in.dims[1];
-
-        // calculate strides
-        const int oStride1 = out.strides[1];
-        const int iStride1 = in.strides[1];
+#include <nvrtc_kernel_headers/transpose_cuh.hpp>
 
-        const int lx = threadIdx.x;
-        const int ly = threadIdx.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-        // batch based block Id
-        const int batchId_x = blockIdx.x / blocksPerMatX;
-        const int blockIdx_x = (blockIdx.x - batchId_x * blocksPerMatX);
+static const int TILE_DIM  = 32;
+static const int THREADS_X = TILE_DIM;
+static const int THREADS_Y = 256 / TILE_DIM;
 
-        const int batchId_y = blockIdx.y / blocksPerMatY;
-        const int blockIdx_y = (blockIdx.y - batchId_y * blocksPerMatY);
+template<typename T>
+void transpose(Param<T> out, CParam<T> in, const bool conjugate,
+               const bool is32multiple) {
+    auto transpose = common::getKernel(
+        "arrayfire::cuda::transpose", {{transpose_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(conjugate),
+                     TemplateArg(is32multiple)),
+        {{DefineValue(TILE_DIM), DefineValue(THREADS_Y)}});
 
-        const int x0 = TILE_DIM * blockIdx_x;
-        const int y0 = TILE_DIM * blockIdx_y;
+    dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
 
-        // calculate global indices
-        int gx      = lx + x0;
-        int gy      = ly + y0;
+    int blk_x = divup(in.dims[0], TILE_DIM);
+    int blk_y = divup(in.dims[1], TILE_DIM);
+    dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-        // offset in and out based on batch id
-        in.ptr  += batchId_x *  in.strides[2] + batchId_y *  in.strides[3];
-        out.ptr += batchId_x * out.strides[2] + batchId_y * out.strides[3];
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-#pragma unroll
-        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-            int gy_ = gy+repeat;
-            if (is32Multiple || (gx<iDim0 && gy_<iDim1))
-                shrdMem[ly + repeat][lx] = in.ptr[gy_ * iStride1 + gx];
-        }
-        __syncthreads();
+    transpose(qArgs, out, in, blk_x, blk_y);
 
-        gx = lx + y0;
-        gy = ly + x0;
-
-#pragma unroll
-        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-            int gy_ = gy+repeat;
-            if (is32Multiple || (gx<oDim0 && gy_<oDim1))
-                out.ptr[gy_ * oStride1 + gx] = doOp<T, conjugate>(shrdMem[lx][ly + repeat]);
-        }
-    }
-
-    template<typename T, bool conjugate>
-    void transpose(Param<T> out, CParam<T> in, const int ndims)
-    {
-        // dimensions passed to this function should be input dimensions
-        // any necessary transformations and dimension related calculations are
-        // carried out here and inside the kernel
-        dim3 threads(kernel::THREADS_X,kernel::THREADS_Y);
-
-
-        int blk_x = divup(in.dims[0],TILE_DIM);
-        int blk_y = divup(in.dims[1],TILE_DIM);
-        // launch batch * blk_x blocks along x dimension
-        dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
-
-        if (in.dims[0] % TILE_DIM == 0 && in.dims[1] % TILE_DIM == 0)
-            (transpose<T, conjugate, true >)<<<blocks, threads>>>(out, in, blk_x, blk_y);
-        else
-            (transpose<T, conjugate, false>)<<<blocks, threads>>>(out, in, blk_x, blk_y);
-
-        POST_LAUNCH_CHECK();
-    }
+    POST_LAUNCH_CHECK();
 }
 
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/transpose_inplace.cuh b/src/backend/cuda/kernel/transpose_inplace.cuh
new file mode 100644
index 0000000000..8d0b3cdb04
--- /dev/null
+++ b/src/backend/cuda/kernel/transpose_inplace.cuh
@@ -0,0 +1,122 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool conjugate>
+__device__ T doOp(T in) {
+    if (conjugate)
+        return conj(in);
+    else
+        return in;
+}
+
+// Hint from txbob
+// https://devtalk.nvidia.com/default/topic/765696/efficient-in-place-transpose-of-multiple-square-float-matrices
+//
+// Kernel is going access original data in colleased format
+template<typename T, bool conjugate, bool is32Multiple>
+__global__ void transposeIP(Param<T> in, const int blocksPerMatX,
+                            const int blocksPerMatY) {
+    __shared__ T shrdMem_s[TILE_DIM][TILE_DIM + 1];
+    __shared__ T shrdMem_d[TILE_DIM][TILE_DIM + 1];
+
+    // create variables to hold output dimensions
+    const int iDim0 = in.dims[0];
+    const int iDim1 = in.dims[1];
+
+    // calculate strides
+    const int iStride1 = in.strides[1];
+
+    const int lx = threadIdx.x;
+    const int ly = threadIdx.y;
+
+    // batch based block Id
+    const int batchId_x  = blockIdx.x / blocksPerMatX;
+    const int blockIdx_x = (blockIdx.x - batchId_x * blocksPerMatX);
+
+    const int batchId_y  = blockIdx.y / blocksPerMatY;
+    const int blockIdx_y = (blockIdx.y - batchId_y * blocksPerMatY);
+
+    const int x0 = TILE_DIM * blockIdx_x;
+    const int y0 = TILE_DIM * blockIdx_y;
+
+    // offset in and out based on batch id
+    T *iptr = in.ptr + batchId_x * in.strides[2] + batchId_y * in.strides[3];
+
+    if (blockIdx_y > blockIdx_x) {  // Off diagonal blocks
+        // calculate global indices
+        int gx = lx + x0;
+        int gy = ly + y0;
+        int dx = lx + y0;
+        int dy = ly + x0;
+
+        // Copy to shared memory
+#pragma unroll
+        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+            int gy_ = gy + repeat;
+            if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
+                shrdMem_s[ly + repeat][lx] = iptr[gy_ * iStride1 + gx];
+
+            int dy_ = dy + repeat;
+            if (is32Multiple || (dx < iDim0 && dy_ < iDim1))
+                shrdMem_d[ly + repeat][lx] = iptr[dy_ * iStride1 + dx];
+        }
+
+        __syncthreads();
+
+        // Copy from shared to global memory
+#pragma unroll
+        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+            int dy_ = dy + repeat;
+            if (is32Multiple || (dx < iDim0 && dy_ < iDim1))
+                iptr[dy_ * iStride1 + dx] =
+                    doOp<T, conjugate>(shrdMem_s[lx][ly + repeat]);
+
+            int gy_ = gy + repeat;
+            if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
+                iptr[gy_ * iStride1 + gx] =
+                    doOp<T, conjugate>(shrdMem_d[lx][ly + repeat]);
+        }
+
+    } else if (blockIdx_y == blockIdx_x) {  // Diagonal blocks
+        // calculate global indices
+        int gx = lx + x0;
+        int gy = ly + y0;
+
+        // offset in and out based on batch id
+        iptr = in.ptr + batchId_x * in.strides[2] + batchId_y * in.strides[3];
+
+        // Copy to shared memory
+#pragma unroll
+        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+            int gy_ = gy + repeat;
+            if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
+                shrdMem_s[ly + repeat][lx] = iptr[gy_ * iStride1 + gx];
+        }
+
+        __syncthreads();
+
+        // Copy from shared to global memory
+#pragma unroll
+        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+            int gy_ = gy + repeat;
+            if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
+                iptr[gy_ * iStride1 + gx] =
+                    doOp<T, conjugate>(shrdMem_s[lx][ly + repeat]);
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/transpose_inplace.hpp b/src/backend/cuda/kernel/transpose_inplace.hpp
index b25917a9b2..5ff28020c4 100644
--- a/src/backend/cuda/kernel/transpose_inplace.hpp
+++ b/src/backend/cuda/kernel/transpose_inplace.hpp
@@ -7,149 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
-#include <math.hpp>
-
-namespace cuda
-{
-
-namespace kernel
-{
-
-    static const int TILE_DIM  = 32;
-    static const int THREADS_X = TILE_DIM;
-    static const int THREADS_Y = 256 / TILE_DIM;
-
-    template<typename T, bool conjugate>
-    __device__ T doOp(T in)
-    {
-        if (conjugate) return conj(in);
-        else return in;
-    }
-
-    // Hint from txbob
-    // https://devtalk.nvidia.com/default/topic/765696/efficient-in-place-transpose-of-multiple-square-float-matrices
-    //
-    // Kernel is going access original data in colleased format
-    template<typename T, bool conjugate, bool is32Multiple>
-    __global__
-    void transposeIP(Param<T> in, const int blocksPerMatX, const int blocksPerMatY)
-    {
-        __shared__ T shrdMem_s[TILE_DIM][TILE_DIM+1];
-        __shared__ T shrdMem_d[TILE_DIM][TILE_DIM+1];
-
-        // create variables to hold output dimensions
-        const int iDim0 = in.dims[0];
-        const int iDim1 = in.dims[1];
-
-        // calculate strides
-        const int iStride1 = in.strides[1];
-
-        const int lx = threadIdx.x;
-        const int ly = threadIdx.y;
-
-        // batch based block Id
-        const int batchId_x = blockIdx.x / blocksPerMatX;
-        const int blockIdx_x = (blockIdx.x - batchId_x * blocksPerMatX);
-
-        const int batchId_y = blockIdx.y / blocksPerMatY;
-        const int blockIdx_y = (blockIdx.y - batchId_y * blocksPerMatY);
-
-        const int x0 = TILE_DIM * blockIdx_x;
-        const int y0 = TILE_DIM * blockIdx_y;
-
-        // offset in and out based on batch id
-        T *iptr = in.ptr + batchId_x * in.strides[2] + batchId_y * in.strides[3];
+#include <nvrtc_kernel_headers/transpose_inplace_cuh.hpp>
 
-        if(blockIdx_y > blockIdx_x) {       // Off diagonal blocks
-            // calculate global indices
-            int gx      = lx + x0;
-            int gy      = ly + y0;
-            int dx      = lx + y0;
-            int dy      = ly + x0;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            // Copy to shared memory
-#pragma unroll
-            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+static const int TILE_DIM  = 32;
+static const int THREADS_X = TILE_DIM;
+static const int THREADS_Y = 256 / TILE_DIM;
 
-                int gy_ = gy + repeat;
-                if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
-                    shrdMem_s[ly + repeat][lx] = iptr[gy_ * iStride1 + gx];
+template<typename T>
+void transpose_inplace(Param<T> in, const bool conjugate,
+                       const bool is32multiple) {
+    auto transposeIP = common::getKernel(
+        "arrayfire::cuda::transposeIP", {{transpose_inplace_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(conjugate),
+                     TemplateArg(is32multiple)),
+        {{DefineValue(TILE_DIM), DefineValue(THREADS_Y)}});
 
-                int dy_ = dy + repeat;
-                if (is32Multiple || (dx < iDim0 && dy_ < iDim1))
-                    shrdMem_d[ly + repeat][lx] = iptr[dy_ * iStride1 + dx];
-            }
+    // dimensions passed to this function should be input dimensions
+    // any necessary transformations and dimension related calculations are
+    // carried out here and inside the kernel
+    dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
 
-            __syncthreads();
+    int blk_x = divup(in.dims[0], TILE_DIM);
+    int blk_y = divup(in.dims[1], TILE_DIM);
+    dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
 
-            // Copy from shared to global memory
-#pragma unroll
-            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-                int dy_ = dy + repeat;
-                if (is32Multiple || (dx < iDim0 && dy_ < iDim1))
-                    iptr[dy_ * iStride1 + dx] = doOp<T, conjugate>(shrdMem_s[lx][ly + repeat]);
+    transposeIP(qArgs, in, blk_x, blk_y);
 
-                int gy_ = gy + repeat;
-                if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
-                    iptr[gy_ * iStride1 + gx] = doOp<T, conjugate>(shrdMem_d[lx][ly + repeat]);
-            }
-
-        } else if (blockIdx_y == blockIdx_x) {    // Diagonal blocks
-            // calculate global indices
-            int gx      = lx + x0;
-            int gy      = ly + y0;
-
-            // offset in and out based on batch id
-            iptr = in.ptr + batchId_x * in.strides[2] + batchId_y * in.strides[3];
-
-            // Copy to shared memory
-#pragma unroll
-            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-                int gy_ = gy + repeat;
-                if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
-                    shrdMem_s[ly + repeat][lx] = iptr[gy_ * iStride1 + gx];
-            }
-
-            __syncthreads();
-
-            // Copy from shared to global memory
-#pragma unroll
-            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-                int gy_ = gy + repeat;
-                if (is32Multiple || (gx < iDim0 && gy_ < iDim1))
-                    iptr[gy_ * iStride1 + gx] = doOp<T, conjugate>(shrdMem_s[lx][ly + repeat]);
-            }
-        }
-    }
-
-    template<typename T, bool conjugate>
-    void transpose_inplace(Param<T> in)
-    {
-        // dimensions passed to this function should be input dimensions
-        // any necessary transformations and dimension related calculations are
-        // carried out here and inside the kernel
-        dim3 threads(kernel::THREADS_X, kernel::THREADS_Y);
-
-
-        int blk_x = divup(in.dims[0],TILE_DIM);
-        int blk_y = divup(in.dims[1],TILE_DIM);
-
-        // launch batch * blk_x blocks along x dimension
-        dim3 blocks(blk_x * in.dims[2], blk_y * in.dims[3]);
-
-        if (in.dims[0] % TILE_DIM == 0 && in.dims[1] % TILE_DIM == 0)
-            (transposeIP<T, conjugate, true >)<<<blocks, threads>>>(in, blk_x, blk_y);
-        else
-            (transposeIP<T, conjugate, false>)<<<blocks, threads>>>(in, blk_x, blk_y);
-
-        POST_LAUNCH_CHECK();
-    }
+    POST_LAUNCH_CHECK();
 }
 
-}
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/triangle.cuh b/src/backend/cuda/kernel/triangle.cuh
new file mode 100644
index 0000000000..841a7c636f
--- /dev/null
+++ b/src/backend/cuda/kernel/triangle.cuh
@@ -0,0 +1,63 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool is_upper, bool is_unit_diag>
+__global__ void triangle(Param<T> r, CParam<T> in, const int blocksPerMatX,
+                         const int blocksPerMatY) {
+    const int oz = blockIdx.x / blocksPerMatX;
+    const int ow = (blockIdx.y + blockIdx.z * gridDim.y) / blocksPerMatY;
+
+    const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
+    const int blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - ow * blocksPerMatY;
+
+    const int xx = threadIdx.x + blockIdx_x * blockDim.x;
+    const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+
+    const int incy = blocksPerMatY * blockDim.y;
+    const int incx = blocksPerMatX * blockDim.x;
+
+    T *d_r       = r.ptr;
+    const T *d_i = in.ptr;
+
+    const T one  = scalar<T>(1);
+    const T zero = scalar<T>(0);
+
+    if (oz < r.dims[2] && ow < r.dims[3]) {
+        d_i = d_i + oz * in.strides[2] + ow * in.strides[3];
+        d_r = d_r + oz * r.strides[2] + ow * r.strides[3];
+
+        for (int oy = yy; oy < r.dims[1]; oy += incy) {
+            const T *Yd_i = d_i + oy * in.strides[1];
+            T *Yd_r       = d_r + oy * r.strides[1];
+
+            for (int ox = xx; ox < r.dims[0]; ox += incx) {
+                bool cond         = is_upper ? (oy >= ox) : (oy <= ox);
+                bool do_unit_diag = is_unit_diag && (ox == oy);
+                if (cond) {
+                    // Change made because of compute 53 failing tests
+                    Yd_r[ox] = do_unit_diag ? one : Yd_i[ox];
+                } else {
+                    Yd_r[ox] = zero;
+                }
+            }
+        }
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/triangle.hpp b/src/backend/cuda/kernel/triangle.hpp
index 6bce765703..ba922a3115 100644
--- a/src/backend/cuda/kernel/triangle.hpp
+++ b/src/backend/cuda/kernel/triangle.hpp
@@ -7,81 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <math.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
-#include <err_cuda.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <nvrtc_kernel_headers/triangle_cuh.hpp>
 
-namespace cuda
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const unsigned TX = 32;
-        static const unsigned TY = 8;
-        static const unsigned TILEX = 128;
-        static const unsigned TILEY = 32;
-
-        template<typename T, bool is_upper, bool is_unit_diag>
-        __global__
-        void triangle_kernel(Param<T> r, CParam<T> in,
-                             const int blocksPerMatX, const int blocksPerMatY)
-        {
-            const int oz = blockIdx.x / blocksPerMatX;
-            const int ow = blockIdx.y / blocksPerMatY;
-
-            const int blockIdx_x = blockIdx.x - oz * blocksPerMatX;
-            const int blockIdx_y = blockIdx.y - ow * blocksPerMatY;
-
-            const int xx = threadIdx.x + blockIdx_x * blockDim.x;
-            const int yy = threadIdx.y + blockIdx_y * blockDim.y;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-            const int incy = blocksPerMatY * blockDim.y;
-            const int incx = blocksPerMatX * blockDim.x;
+template<typename T>
+void triangle(Param<T> r, CParam<T> in, bool is_upper, bool is_unit_diag) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 128;
+    constexpr unsigned TILEY = 32;
 
-            T *d_r = r.ptr;
-            const T *d_i = in.ptr;
+    auto triangle = common::getKernel(
+        "arrayfire::cuda::triangle", {{triangle_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(is_upper),
+                     TemplateArg(is_unit_diag)));
 
-            if(oz < r.dims[2] && ow < r.dims[3]) {
-                d_i = d_i + oz * in.strides[2]    + ow * in.strides[3];
-                d_r = d_r + oz * r.strides[2] + ow * r.strides[3];
+    dim3 threads(TX, TY, 1);
 
-                for (int oy = yy; oy < r.dims[1]; oy += incy) {
-                    const T *Yd_i = d_i + oy * in.strides[1];
-                    T *Yd_r = d_r +  oy * r.strides[1];
+    int blocksPerMatX = divup(r.dims[0], TILEX);
+    int blocksPerMatY = divup(r.dims[1], TILEY);
+    dim3 blocks(blocksPerMatX * r.dims[2], blocksPerMatY * r.dims[3], 1);
 
-                    for (int ox = xx; ox < r.dims[0]; ox += incx) {
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-                        bool cond = is_upper ? (oy >= ox) : (oy <= ox);
-                        bool do_unit_diag  = is_unit_diag && (ox == oy);
-                        if(cond) {
-                            Yd_r[ox] = do_unit_diag ? scalar<T>(1) : Yd_i[ox];
-                        } else {
-                            Yd_r[ox] = scalar<T>(0);
-                        }
-                    }
-                }
-            }
-        }
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
 
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template<typename T, bool is_upper, bool is_unit_diag>
-        void triangle(Param<T> r, CParam<T> in)
-        {
-            dim3 threads(TX, TY, 1);
-
-            int blocksPerMatX = divup(r.dims[0], TILEX);
-            int blocksPerMatY = divup(r.dims[1], TILEY);
-            dim3 blocks(blocksPerMatX * r.dims[2],
-                        blocksPerMatY * r.dims[3],
-                        1);
-
-            triangle_kernel<T, is_upper, is_unit_diag><<<blocks, threads>>>(r, in, blocksPerMatX, blocksPerMatY);
-
-            POST_LAUNCH_CHECK();
-        }
-    }
+    triangle(qArgs, r, in, blocksPerMatX, blocksPerMatY);
+    POST_LAUNCH_CHECK();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/unwrap.cuh b/src/backend/cuda/kernel/unwrap.cuh
new file mode 100644
index 0000000000..415727a281
--- /dev/null
+++ b/src/backend/cuda/kernel/unwrap.cuh
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool is_column>
+__global__ void unwrap(Param<T> out, CParam<T> in, const int wx, const int wy,
+                       const int sx, const int sy, const int px, const int py,
+                       const int dx, const int dy, const int nx, int reps) {
+    // Compute channel and volume
+    const int w = (blockIdx.y + blockIdx.z * gridDim.y) / in.dims[2];
+    const int z = (blockIdx.y + blockIdx.z * gridDim.y) % in.dims[2];
+
+    if (w >= in.dims[3] || z >= in.dims[2]) return;
+
+    // Compute offset for channel and volume
+    const int cOut = w * out.strides[3] + z * out.strides[2];
+    const int cIn  = w * in.strides[3] + z * in.strides[2];
+
+    // Compute the output column index
+    const int id = is_column ? (blockIdx.x * blockDim.y + threadIdx.y)
+                             : (blockIdx.x * blockDim.x + threadIdx.x);
+
+    if (id >= (is_column ? out.dims[1] : out.dims[0])) return;
+
+    // Compute the starting index of window in x and y of input
+    const int startx = (id % nx) * sx;
+    const int starty = (id / nx) * sy;
+
+    const int spx = startx - px;
+    const int spy = starty - py;
+
+    // Offset the global pointers to the respective starting indices
+    T* optr       = out.ptr + cOut + id * (is_column ? out.strides[1] : 1);
+    const T* iptr = in.ptr + cIn;
+
+    // Compute output index local to column
+    int outIdx        = is_column ? threadIdx.x : threadIdx.y;
+    const int oStride = is_column ? blockDim.x : blockDim.y;
+    bool cond         = (spx >= 0 && spx + (wx * dx) < in.dims[0] && spy >= 0 &&
+                 spy + (wy * dy) < in.dims[1]);
+
+    for (int i = 0; i < reps; i++) {
+        if (outIdx >= (is_column ? out.dims[0] : out.dims[1])) return;
+
+        // Compute input index local to window
+        const int x = outIdx % wx;
+        const int y = outIdx / wx;
+
+        const int xpad = spx + x * dx;
+        const int ypad = spy + y * dy;
+
+        // Copy
+        T val = scalar<T>(0.0);
+        if (cond || (xpad >= 0 && xpad < in.dims[0] && ypad >= 0 &&
+                     ypad < in.dims[1])) {
+            const int inIdx = ypad * in.strides[1] + xpad * in.strides[0];
+            val             = iptr[inIdx];
+        }
+
+        if (is_column) {
+            optr[outIdx] = val;
+        } else {
+            optr[outIdx * out.strides[1]] = val;
+        }
+        outIdx += oStride;
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/unwrap.hpp b/src/backend/cuda/kernel/unwrap.hpp
new file mode 100644
index 0000000000..20ad8e67e3
--- /dev/null
+++ b/src/backend/cuda/kernel/unwrap.hpp
@@ -0,0 +1,60 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <kernel/config.hpp>
+#include <nvrtc_kernel_headers/unwrap_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename T>
+void unwrap(Param<T> out, CParam<T> in, const int wx, const int wy,
+            const int sx, const int sy, const int px, const int py,
+            const int dx, const int dy, const int nx, const bool is_column) {
+    auto unwrap = common::getKernel(
+        "arrayfire::cuda::unwrap", {{unwrap_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(is_column)));
+
+    dim3 threads, blocks;
+    int reps;
+
+    if (is_column) {
+        int TX = std::min(THREADS_PER_BLOCK, nextpow2(out.dims[0]));
+
+        threads = dim3(TX, THREADS_PER_BLOCK / TX);
+        blocks = dim3(divup(out.dims[1], threads.y), out.dims[2] * out.dims[3]);
+        reps   = divup((wx * wy),
+                       threads.x);  // is > 1 only when TX == 256 && wx * wy > 256
+    } else {
+        threads = dim3(THREADS_X, THREADS_Y);
+        blocks = dim3(divup(out.dims[0], threads.x), out.dims[2] * out.dims[3]);
+
+        reps = divup((wx * wy), threads.y);
+    }
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    unwrap(qArgs, out, in, wx, wy, sx, sy, px, py, dx, dy, nx, reps);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/where.cuh b/src/backend/cuda/kernel/where.cuh
new file mode 100644
index 0000000000..a9e31d2739
--- /dev/null
+++ b/src/backend/cuda/kernel/where.cuh
@@ -0,0 +1,60 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+__global__ void where(uint *optr, CParam<uint> otmp, CParam<uint> rtmp,
+                      CParam<T> in, uint blocks_x, uint blocks_y, uint lim) {
+    const uint tidx = threadIdx.x;
+    const uint tidy = threadIdx.y;
+
+    const uint zid        = blockIdx.x / blocks_x;
+    const uint wid        = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+    const uint blockIdx_x = blockIdx.x - (blocks_x)*zid;
+    const uint blockIdx_y =
+        (blockIdx.y + blockIdx.z * gridDim.y) - (blocks_y)*wid;
+    const uint xid = blockIdx_x * blockDim.x * lim + tidx;
+    const uint yid = blockIdx_y * blockDim.y + tidy;
+
+    const uint *otptr = otmp.ptr;
+    const uint *rtptr = rtmp.ptr;
+    const T *iptr     = in.ptr;
+
+    const uint off =
+        wid * otmp.strides[3] + zid * otmp.strides[2] + yid * otmp.strides[1];
+    const uint bid = wid * rtmp.strides[3] + zid * rtmp.strides[2] +
+                     yid * rtmp.strides[1] + blockIdx_x;
+
+    otptr +=
+        wid * otmp.strides[3] + zid * otmp.strides[2] + yid * otmp.strides[1];
+    iptr += wid * in.strides[3] + zid * in.strides[2] + yid * in.strides[1];
+
+    bool cond =
+        (yid < otmp.dims[1]) && (zid < otmp.dims[2]) && (wid < otmp.dims[3]);
+    T zero = scalar<T>(0);
+
+    if (!cond) return;
+
+    uint accum = (bid == 0) ? 0 : rtptr[bid - 1];
+
+    for (uint k = 0, id = xid; k < lim && id < otmp.dims[0];
+         k++, id += blockDim.x) {
+        uint idx = otptr[id] + accum;
+        if (iptr[id] != zero) optr[idx - 1] = (off + id);
+    }
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/where.hpp b/src/backend/cuda/kernel/where.hpp
index a3e6e4e5cd..0b500d4628 100644
--- a/src/backend/cuda/kernel/where.hpp
+++ b/src/backend/cuda/kernel/where.hpp
@@ -7,139 +7,101 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <ops.hpp>
-#include <backend.hpp>
 #include <Param.hpp>
-#include <dispatch.hpp>
-#include <math.hpp>
-#include <err_cuda.hpp>
+#include <backend.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_cuda.hpp>
+#include <err_cuda.hpp>
 #include <memory.hpp>
+#include <nvrtc_kernel_headers/where_cuh.hpp>
 #include "config.hpp"
 #include "scan_first.hpp"
 
-namespace cuda
-{
-namespace kernel
-{
-
-    template<typename T>
-    __global__
-    static void get_out_idx(uint *optr,
-                            CParam<uint> otmp,
-                            CParam<uint> rtmp,
-                            CParam<T> in,
-                            uint blocks_x,
-                            uint blocks_y,
-                            uint lim)
-    {
-        const uint tidx = threadIdx.x;
-        const uint tidy = threadIdx.y;
-
-        const uint zid = blockIdx.x / blocks_x;
-        const uint wid = blockIdx.y / blocks_y;
-        const uint blockIdx_x = blockIdx.x - (blocks_x) * zid;
-        const uint blockIdx_y = blockIdx.y - (blocks_y) * wid;
-        const uint xid = blockIdx_x * blockDim.x * lim + tidx;
-        const uint yid = blockIdx_y * blockDim.y + tidy;
-
-        const uint *otptr = otmp.ptr;
-        const uint *rtptr = rtmp.ptr;
-        const T *iptr = in.ptr;
-
-        const uint off = wid * otmp.strides[3] + zid * otmp.strides[2] + yid * otmp.strides[1];
-        const uint bid = wid * rtmp.strides[3] + zid * rtmp.strides[2] + yid * rtmp.strides[1] + blockIdx_x;
-
-        otptr += wid * otmp.strides[3] + zid * otmp.strides[2] + yid * otmp.strides[1];
-        iptr  += wid *   in.strides[3] + zid *   in.strides[2] + yid *   in.strides[1];
-
-        bool cond = (yid < otmp.dims[1]) && (zid < otmp.dims[2]) && (wid < otmp.dims[3]);
-        T zero = scalar<T>(0);
-
-        if (!cond) return;
-
-        uint accum = (bid == 0) ? 0 : rtptr[bid - 1];
-
-        for (uint k = 0, id = xid;
-             k < lim && id < otmp.dims[0];
-             k++, id += blockDim.x) {
-
-            uint idx = otptr[id] + accum;
-            if (iptr[id] != zero) optr[idx - 1] = (off + id);
-        }
-    }
-
-    template<typename T>
-    static void where(Param<uint> &out, CParam<T> in)
-    {
-        uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_BLOCK);
-        uint threads_y = THREADS_PER_BLOCK / threads_x;
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
 
-        uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
-        uint blocks_y = divup(in.dims[1], threads_y);
+template<typename T>
+static void where(Param<uint> &out, CParam<T> in) {
+    auto where = common::getKernel("arrayfire::cuda::where", {{where_cuh_src}},
+                                   TemplateArgs(TemplateTypename<T>()));
 
-        Param<uint> rtmp;
-        Param<uint> otmp;
-        rtmp.dims[0] = blocks_x;
-        otmp.dims[0] = in.dims[0];
-        rtmp.strides[0] = 1;
-        otmp.strides[0] = 1;
+    uint threads_x = nextpow2(std::max(32u, (uint)in.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
 
-        for (int k = 1; k < 4; k++) {
-            rtmp.dims[k] = in.dims[k];
-            rtmp.strides[k] = rtmp.strides[k - 1] * rtmp.dims[k - 1];
+    uint blocks_x = divup(in.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(in.dims[1], threads_y);
 
-            otmp.dims[k] = in.dims[k];
-            otmp.strides[k] = otmp.strides[k - 1] * otmp.dims[k - 1];
-        }
+    Param<uint> rtmp;
+    Param<uint> otmp;
+    rtmp.dims[0]    = blocks_x;
+    otmp.dims[0]    = in.dims[0];
+    rtmp.strides[0] = 1;
+    otmp.strides[0] = 1;
 
-        int rtmp_elements = rtmp.strides[3] * rtmp.dims[3];
-        rtmp.ptr = memAlloc<uint>(rtmp_elements);
+    for (int k = 1; k < 4; k++) {
+        rtmp.dims[k]    = in.dims[k];
+        rtmp.strides[k] = rtmp.strides[k - 1] * rtmp.dims[k - 1];
 
-        int otmp_elements = otmp.strides[3] * otmp.dims[3];
-        otmp.ptr = memAlloc<uint>(otmp_elements);
+        otmp.dims[k]    = in.dims[k];
+        otmp.strides[k] = otmp.strides[k - 1] * otmp.dims[k - 1];
+    }
 
-        scan_first_launcher<T, uint, af_notzero_t, false>(otmp, rtmp, in,
-                                                          blocks_x, blocks_y,
-                                                          threads_x);
+    int rtmp_elements = rtmp.strides[3] * rtmp.dims[3];
+    int otmp_elements = otmp.strides[3] * otmp.dims[3];
+    auto rtmp_alloc   = memAlloc<uint>(rtmp_elements);
+    auto otmp_alloc   = memAlloc<uint>(otmp_elements);
+    rtmp.ptr          = rtmp_alloc.get();
+    otmp.ptr          = otmp_alloc.get();
+
+    scan_first_launcher<T, uint, af_notzero_t>(
+        otmp, rtmp, in, blocks_x, blocks_y, threads_x, false, true);
+
+    // Linearize the dimensions and perform scan
+    Param<uint> ltmp = rtmp;
+    ltmp.dims[0]     = rtmp_elements;
+    for (int k = 1; k < 4; k++) {
+        ltmp.dims[k]    = 1;
+        ltmp.strides[k] = rtmp_elements;
+    }
 
-        // Linearize the dimensions and perform scan
-        Param<uint> ltmp = rtmp;
-        ltmp.dims[0] = rtmp_elements;
-        for (int k = 1; k < 4; k++) {
-            ltmp.dims[k] = 1;
-            ltmp.strides[k] = rtmp_elements;
-        }
+    scan_first<uint, uint, af_add_t>(ltmp, ltmp, true);
 
-        scan_first<uint, uint, af_add_t>(ltmp, ltmp);
+    // Get output size and allocate output
+    uint total;
+    CUDA_CHECK(cudaMemcpyAsync(&total, rtmp.ptr + rtmp_elements - 1,
+                               sizeof(uint), cudaMemcpyDeviceToHost,
+                               getActiveStream()));
+    CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
 
-        // Get output size and allocate output
-        uint total;
-        CUDA_CHECK(cudaMemcpy(&total, rtmp.ptr + rtmp_elements - 1,
-                              sizeof(uint), cudaMemcpyDeviceToHost));
+    auto out_alloc = memAlloc<uint>(total);
+    out.ptr        = out_alloc.get();
 
-        out.ptr = memAlloc<uint>(total);
+    out.dims[0]    = total;
+    out.strides[0] = 1;
+    for (int k = 1; k < 4; k++) {
+        out.dims[k]    = 1;
+        out.strides[k] = total;
+    }
 
-        out.dims[0] = total;
-        out.strides[0] = 1;
-        for (int k = 1; k < 4; k++) {
-            out.dims[k] = 1;
-            out.strides[k] = total;
-        }
+    dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
+    dim3 blocks(blocks_x * in.dims[2], blocks_y * in.dims[3]);
 
-        dim3 threads(threads_x, THREADS_PER_BLOCK / threads_x);
-        dim3 blocks(blocks_x * in.dims[2],
-                    blocks_y * in.dims[3]);
+    uint lim = divup(otmp.dims[0], (threads_x * blocks_x));
 
-        uint lim = divup(otmp.dims[0], (threads_x * blocks_x));
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
 
-        (get_out_idx<T>)<<<blocks, threads>>>(out.ptr, otmp, rtmp, in, blocks_x, blocks_y, lim);
-        POST_LAUNCH_CHECK();
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+    where(qArgs, out.ptr, otmp, rtmp, in, blocks_x, blocks_y, lim);
+    POST_LAUNCH_CHECK();
 
-        memFree(rtmp.ptr);
-        memFree(otmp.ptr);
-    }
-}
+    out_alloc.release();
 }
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/wrap.cuh b/src/backend/cuda/kernel/wrap.cuh
new file mode 100644
index 0000000000..9200d78f13
--- /dev/null
+++ b/src/backend/cuda/kernel/wrap.cuh
@@ -0,0 +1,148 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, bool is_column>
+__global__ void wrap(Param<T> out, CParam<T> in, const int wx, const int wy,
+                     const int sx, const int sy, const int px, const int py,
+                     const int nx, const int ny, int blocks_x, int blocks_y) {
+    int idx2 = blockIdx.x / blocks_x;
+    int idx3 = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+
+    int blockIdx_x = blockIdx.x - idx2 * blocks_x;
+    int blockIdx_y = (blockIdx.y + blockIdx.z * gridDim.y) - idx3 * blocks_y;
+
+    int oidx0 = threadIdx.x + blockDim.x * blockIdx_x;
+    int oidx1 = threadIdx.y + blockDim.y * blockIdx_y;
+
+    T *optr       = out.ptr + idx2 * out.strides[2] + idx3 * out.strides[3];
+    const T *iptr = in.ptr + idx2 * in.strides[2] + idx3 * in.strides[3];
+
+    if (oidx0 >= out.dims[0] || oidx1 >= out.dims[1] || idx2 >= out.dims[2] ||
+        idx3 >= out.dims[3])
+        return;
+
+    int pidx0 = oidx0 + px;
+    int pidx1 = oidx1 + py;
+
+    // The last time a value appears in the unwrapped index is padded_index /
+    // stride Each previous index has the value appear "stride" locations
+    // earlier We work our way back from the last index
+
+    const int x_end = min(pidx0 / sx, nx - 1);
+    const int y_end = min(pidx1 / sy, ny - 1);
+
+    const int x_off = pidx0 - sx * x_end;
+    const int y_off = pidx1 - sy * y_end;
+
+    T val   = scalar<T>(0);
+    int idx = 1;
+
+    for (int y = y_end, yo = y_off; y >= 0 && yo < wy; yo += sy, y--) {
+        int win_end_y = yo * wx;
+        int dim_end_y = y * nx;
+
+        for (int x = x_end, xo = x_off; x >= 0 && xo < wx; xo += sx, x--) {
+            int win_end = win_end_y + xo;
+            int dim_end = dim_end_y + x;
+
+            if (is_column) {
+                idx = dim_end * in.strides[1] + win_end;
+            } else {
+                idx = dim_end + win_end * in.strides[1];
+            }
+
+            val = val + iptr[idx];
+        }
+    }
+
+    optr[oidx1 * out.strides[1] + oidx0] = val;
+}
+
+template<typename T, bool is_column>
+__global__ void wrap_dilated(Param<T> out, CParam<T> in, const int wx,
+                             const int wy, const int sx, const int sy,
+                             const int px, const int py, const int dx,
+                             const int dy, const int nx, const int ny,
+                             int blocks_x, int blocks_y) {
+    int idx2 = blockIdx.x / blocks_x;
+    int idx3 = (blockIdx.y + blockIdx.z * gridDim.y) / blocks_y;
+
+    int blockIdx_x = blockIdx.x - idx2 * blocks_x;
+    int blockIdx_y = (blockIdx.y + blockIdx.z * gridDim.y) - idx3 * blocks_y;
+
+    int oidx0 = threadIdx.x + blockDim.x * blockIdx_x;
+    int oidx1 = threadIdx.y + blockDim.y * blockIdx_y;
+
+    T *optr       = out.ptr + idx2 * out.strides[2] + idx3 * out.strides[3];
+    const T *iptr = in.ptr + idx2 * in.strides[2] + idx3 * in.strides[3];
+
+    if (oidx0 >= out.dims[0] || oidx1 >= out.dims[1] || idx2 >= out.dims[2] ||
+        idx3 >= out.dims[3])
+        return;
+
+    int eff_wx = wx + (wx - 1) * (dx - 1);
+    int eff_wy = wy + (wy - 1) * (dy - 1);
+
+    int pidx0 = oidx0 + px;
+    int pidx1 = oidx1 + py;
+
+    // The last time a value appears in the unwrapped index is padded_index /
+    // stride Each previous index has the value appear "stride" locations
+    // earlier We work our way back from the last index
+
+    const int x_start = (pidx0 < eff_wx) ? 0 : (pidx0 - eff_wx) / sx + 1;
+    const int y_start = (pidx1 < eff_wy) ? 0 : (pidx1 - eff_wy) / sy + 1;
+
+    const int x_end = min(pidx0 / sx + 1, nx);
+    const int y_end = min(pidx1 / sy + 1, ny);
+
+    T val   = scalar<T>(0);
+    int idx = 1;
+
+    for (int y = y_start; y < y_end; y++) {
+        int fy      = (pidx1 - y * sy);
+        bool yvalid = (fy % dy == 0) && (y < ny);
+        fy /= dy;
+
+        int win_end_y = fy * wx;
+        int dim_end_y = y * nx;
+
+        for (int x = x_start; x < x_end; x++) {
+            int fx      = (pidx0 - x * sx);
+            bool xvalid = (fx % dx == 0) && (x < nx);
+            fx /= dx;
+
+            int win_end = win_end_y + fx;
+            int dim_end = dim_end_y + x;
+
+            if (is_column) {
+                idx = dim_end * in.strides[1] + win_end;
+            } else {
+                idx = dim_end + win_end * in.strides[1];
+            }
+
+            T ival;
+            ival = (yvalid && xvalid) ? iptr[idx] : T(0);
+            val  = val + ival;
+        }
+    }
+
+    optr[oidx1 * out.strides[1] + oidx0] = val;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/kernel/wrap.hpp b/src/backend/cuda/kernel/wrap.hpp
new file mode 100644
index 0000000000..e95db0f3f3
--- /dev/null
+++ b/src/backend/cuda/kernel/wrap.hpp
@@ -0,0 +1,80 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_cuda.hpp>
+#include <kernel/config.hpp>
+#include <nvrtc_kernel_headers/wrap_cuh.hpp>
+
+namespace arrayfire {
+namespace cuda {
+namespace kernel {
+
+template<typename T>
+void wrap(Param<T> out, CParam<T> in, const int wx, const int wy, const int sx,
+          const int sy, const int px, const int py, const bool is_column) {
+    auto wrap = common::getKernel(
+        "arrayfire::cuda::wrap", {{wrap_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(is_column)));
+
+    int nx = (out.dims[0] + 2 * px - wx) / sx + 1;
+    int ny = (out.dims[1] + 2 * py - wy) / sy + 1;
+
+    dim3 threads(THREADS_X, THREADS_Y);
+    int blocks_x = divup(out.dims[0], threads.x);
+    int blocks_y = divup(out.dims[1], threads.y);
+
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    wrap(qArgs, out, in, wx, wy, sx, sy, px, py, nx, ny, blocks_x, blocks_y);
+    POST_LAUNCH_CHECK();
+}
+
+template<typename T>
+void wrap_dilated(Param<T> out, CParam<T> in, const dim_t wx, const dim_t wy,
+                  const dim_t sx, const dim_t sy, const dim_t px,
+                  const dim_t py, const dim_t dx, const dim_t dy,
+                  const bool is_column) {
+    auto wrap = common::getKernel(
+        "arrayfire::cuda::wrap_dilated", {{wrap_cuh_src}},
+        TemplateArgs(TemplateTypename<T>(), TemplateArg(is_column)));
+
+    int nx = 1 + (out.dims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    int ny = 1 + (out.dims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    dim3 threads(THREADS_X, THREADS_Y);
+    int blocks_x = divup(out.dims[0], threads.x);
+    int blocks_y = divup(out.dims[1], threads.y);
+
+    dim3 blocks(blocks_x * out.dims[2], blocks_y * out.dims[3]);
+
+    const int maxBlocksY = getDeviceProp(getActiveDeviceId()).maxGridSize[1];
+    blocks.z             = divup(blocks.y, maxBlocksY);
+    blocks.y             = divup(blocks.y, blocks.z);
+
+    EnqueueArgs qArgs(blocks, threads, getActiveStream());
+
+    wrap(qArgs, out, in, wx, wy, sx, sy, px, py, dx, dy, nx, ny, blocks_x,
+         blocks_y);
+    POST_LAUNCH_CHECK();
+}
+
+}  // namespace kernel
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/logic.hpp b/src/backend/cuda/logic.hpp
index 7b29b19d48..88c11b3d09 100644
--- a/src/backend/cuda/logic.hpp
+++ b/src/backend/cuda/logic.hpp
@@ -7,25 +7,22 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <optypes.hpp>
-#include <err_cuda.hpp>
-#include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <af/dim4.hpp>
 
-namespace cuda
-{
-    template<typename T, af_op_t op>
-    Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<char, T, op>(lhs, rhs, odims);
-    }
+namespace arrayfire {
+namespace cuda {
+template<typename T, af_op_t op>
+Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs,
+                    const af::dim4 &odims) {
+    return common::createBinaryNode<char, T, op>(lhs, rhs, odims);
+}
 
-    template<typename T, af_op_t op>
-    Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<T, T, op>(lhs, rhs, odims);
-    }
+template<typename T, af_op_t op>
+Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/lookup.cpp b/src/backend/cuda/lookup.cpp
new file mode 100644
index 0000000000..ca5b8f79ed
--- /dev/null
+++ b/src/backend/cuda/lookup.cpp
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <lookup.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/lookup.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename in_t, typename idx_t>
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim) {
+    const dim4 &iDims = input.dims();
+
+    dim4 oDims(1);
+    for (dim_t d = 0; d < 4; ++d) {
+        oDims[d] = (d == dim ? indices.elements() : iDims[d]);
+    }
+
+    Array<in_t> out = createEmptyArray<in_t>(oDims);
+
+    dim_t nDims = iDims.ndims();
+
+    kernel::lookup<in_t, idx_t>(out, input, indices, nDims, dim);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> lookup<T, float>(const Array<T> &, const Array<float> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, double>(                                       \
+        const Array<T> &, const Array<double> &, const unsigned);              \
+    template Array<T> lookup<T, int>(const Array<T> &, const Array<int> &,     \
+                                     const unsigned);                          \
+    template Array<T> lookup<T, unsigned>(                                     \
+        const Array<T> &, const Array<unsigned> &, const unsigned);            \
+    template Array<T> lookup<T, short>(const Array<T> &, const Array<short> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, ushort>(                                       \
+        const Array<T> &, const Array<ushort> &, const unsigned);              \
+    template Array<T> lookup<T, intl>(const Array<T> &, const Array<intl> &,   \
+                                      const unsigned);                         \
+    template Array<T> lookup<T, uintl>(const Array<T> &, const Array<uintl> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, schar>(const Array<T> &, const Array<schar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, uchar>(const Array<T> &, const Array<uchar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, half>(const Array<T> &, const Array<half> &,   \
+                                      const unsigned);
+
+INSTANTIATE(float);
+INSTANTIATE(cfloat);
+INSTANTIATE(double);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(unsigned);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(char);
+INSTANTIATE(short);
+INSTANTIATE(ushort);
+INSTANTIATE(half);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/lookup.cu b/src/backend/cuda/lookup.cu
deleted file mode 100644
index 8f910dea6a..0000000000
--- a/src/backend/cuda/lookup.cu
+++ /dev/null
@@ -1,58 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <lookup.hpp>
-#include <err_cuda.hpp>
-#include <kernel/lookup.hpp>
-
-namespace cuda
-{
-
-template<typename in_t, typename idx_t>
-Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, const unsigned dim)
-{
-    const dim4 iDims = input.dims();
-
-    dim4 oDims(1);
-    for (dim_t d=0; d<4; ++d)
-        oDims[d] = (d==dim ? indices.elements() : iDims[d]);
-
-    Array<in_t> out = createEmptyArray<in_t>(oDims);
-
-    dim_t nDims = iDims.ndims();
-
-    switch(dim) {
-        case 0: kernel::lookup<in_t, idx_t, 0>(out, input, indices, nDims); break;
-        case 1: kernel::lookup<in_t, idx_t, 1>(out, input, indices, nDims); break;
-        case 2: kernel::lookup<in_t, idx_t, 2>(out, input, indices, nDims); break;
-        case 3: kernel::lookup<in_t, idx_t, 3>(out, input, indices, nDims); break;
-    }
-
-    return out;
-}
-
-#define INSTANTIATE(T)  \
-    template Array<T> lookup<T, float   >(const Array<T> &input, const Array<float   > &indices, const unsigned dim); \
-    template Array<T> lookup<T, double  >(const Array<T> &input, const Array<double  > &indices, const unsigned dim); \
-    template Array<T> lookup<T, int     >(const Array<T> &input, const Array<int     > &indices, const unsigned dim); \
-    template Array<T> lookup<T, unsigned>(const Array<T> &input, const Array<unsigned> &indices, const unsigned dim); \
-    template Array<T> lookup<T, uchar   >(const Array<T> &input, const Array<uchar   > &indices, const unsigned dim);
-
-INSTANTIATE(float   );
-INSTANTIATE(cfloat  );
-INSTANTIATE(double  );
-INSTANTIATE(cdouble );
-INSTANTIATE(int     );
-INSTANTIATE(unsigned);
-INSTANTIATE(intl    );
-INSTANTIATE(uintl   );
-INSTANTIATE(uchar   );
-INSTANTIATE(char    );
-
-}
diff --git a/src/backend/cuda/lookup.hpp b/src/backend/cuda/lookup.hpp
index c8732952f1..0dc298805b 100644
--- a/src/backend/cuda/lookup.hpp
+++ b/src/backend/cuda/lookup.hpp
@@ -9,10 +9,10 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
-
+namespace arrayfire {
+namespace cuda {
 template<typename in_t, typename idx_t>
-Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, const unsigned dim);
-
-}
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/lu.cpp b/src/backend/cuda/lu.cpp
new file mode 100644
index 0000000000..addae1e7ba
--- /dev/null
+++ b/src/backend/cuda/lu.cpp
@@ -0,0 +1,151 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <lu.hpp>
+
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <cusolverDn.hpp>
+#include <kernel/lu_split.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace cuda {
+
+// cusolverStatus_t CUDENSEAPI cusolverDn<>getrf_bufferSize(
+//        cusolverDnHandle_t handle,
+//        int m, int n,
+//        <> *A,
+//        int lda, int *Lwork );
+//
+//
+// cusolverStatus_t CUDENSEAPI cusolverDn<>getrf(
+//        cusolverDnHandle_t handle,
+//        int m, int n,
+//        <> *A,
+//        int lda,
+//        <> *Workspace,
+//        int *devIpiv, int *devInfo );
+
+template<typename T>
+struct getrf_func_def_t {
+    using getrf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t, int, int,
+                                                T *, int, T *, int *, int *);
+};
+
+template<typename T>
+struct getrf_buf_func_def_t {
+    using getrf_buf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t, int,
+                                                    int, T *, int, int *);
+};
+
+#define LU_FUNC_DEF(FUNC)                                         \
+    template<typename T>                                          \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func(); \
+                                                                  \
+    template<typename T>                                          \
+    typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def FUNC##_buf_func();
+
+#define LU_FUNC(FUNC, TYPE, PREFIX)                                         \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               cusolverDn##PREFIX##FUNC;                                    \
+    }                                                                       \
+                                                                            \
+    template<>                                                              \
+    typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def               \
+        FUNC##_buf_func<TYPE>() {                                           \
+        return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def) &         \
+               cusolverDn##PREFIX##FUNC##_bufferSize;                       \
+    }
+
+LU_FUNC_DEF(getrf)
+LU_FUNC(getrf, float, S)
+LU_FUNC(getrf, double, D)
+LU_FUNC(getrf, cfloat, C)
+LU_FUNC(getrf, cdouble, Z)
+
+void convertPivot(Array<int> &pivot, int out_sz) {
+    dim_t d0 = pivot.dims()[0];
+
+    std::vector<int> d_po(out_sz);
+    for (int i = 0; i < out_sz; i++) { d_po[i] = i; }
+
+    std::vector<int> d_pi(d0);
+    copyData(&d_pi[0], pivot);
+
+    for (int j = 0; j < d0; j++) {
+        // 1 indexed in pivot
+        std::swap(d_po[j], d_po[d_pi[j] - 1]);
+    }
+
+    pivot = createHostDataArray<int>(out_sz, &d_po[0]);
+}
+
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<T> in_copy = copyArray<T>(in);
+    pivot            = lu_inplace(in_copy);
+
+    // SPLIT into lower and upper
+    dim4 ldims(M, std::min(M, N));
+    dim4 udims(std::min(M, N), N);
+    lower = createEmptyArray<T>(ldims);
+    upper = createEmptyArray<T>(udims);
+    kernel::lu_split<T>(lower, upper, in_copy);
+}
+
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<int> pivot = createEmptyArray<int>(af::dim4(std::min(M, N), 1, 1, 1));
+
+    int lwork = 0;
+
+    CUSOLVER_CHECK(getrf_buf_func<T>()(solverDnHandle(), M, N, in.get(),
+                                       in.strides()[1], &lwork));
+
+    auto workspace = memAlloc<T>(lwork);
+    auto info      = memAlloc<int>(1);
+
+    CUSOLVER_CHECK(getrf_func<T>()(solverDnHandle(), M, N, in.get(),
+                                   in.strides()[1], workspace.get(),
+                                   pivot.get(), info.get()));
+
+    if (convert_pivot) { convertPivot(pivot, M); }
+
+    return pivot;
+}
+
+bool isLAPACKAvailable() { return true; }
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
+
+INSTANTIATE_LU(float)
+INSTANTIATE_LU(cfloat)
+INSTANTIATE_LU(double)
+INSTANTIATE_LU(cdouble)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/lu.cu b/src/backend/cuda/lu.cu
deleted file mode 100644
index 85dedf50e0..0000000000
--- a/src/backend/cuda/lu.cu
+++ /dev/null
@@ -1,195 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <lu.hpp>
-#include <err_common.hpp>
-
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-
-#include <cusolverDnManager.hpp>
-#include <memory.hpp>
-#include <copy.hpp>
-#include <math.hpp>
-#include <err_common.hpp>
-
-#include <kernel/lu_split.hpp>
-
-namespace cuda
-{
-
-using cusolver::getDnHandle;
-
-//cusolverStatus_t CUDENSEAPI cusolverDn<>getrf_bufferSize(
-//        cusolverDnHandle_t handle,
-//        int m, int n,
-//        <> *A,
-//        int lda, int *Lwork );
-//
-//
-//cusolverStatus_t CUDENSEAPI cusolverDn<>getrf(
-//        cusolverDnHandle_t handle,
-//        int m, int n,
-//        <> *A,
-//        int lda,
-//        <> *Workspace,
-//        int *devIpiv, int *devInfo );
-
-template<typename T>
-struct getrf_func_def_t
-{
-    typedef cusolverStatus_t (*getrf_func_def) (
-                              cusolverDnHandle_t, int, int,
-                              T *, int,
-                              T *,
-                              int *, int *);
-};
-
-template<typename T>
-struct getrf_buf_func_def_t
-{
-    typedef cusolverStatus_t (*getrf_buf_func_def) (
-                              cusolverDnHandle_t, int, int,
-                              T *, int, int *);
-};
-
-#define LU_FUNC_DEF( FUNC )                                                     \
-template<typename T>                                                            \
-typename FUNC##_func_def_t<T>::FUNC##_func_def                                  \
-FUNC##_func();                                                                  \
-                                                                                \
-template<typename T>                                                            \
-typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def                          \
-FUNC##_buf_func();
-
-
-#define LU_FUNC( FUNC, TYPE, PREFIX )                                                   \
-template<> typename FUNC##_func_def_t<TYPE>::FUNC##_func_def                            \
-FUNC##_func<TYPE>()                                                                     \
-{ return (FUNC##_func_def_t<TYPE>::FUNC##_func_def)&cusolverDn##PREFIX##FUNC; }         \
-                                                                                        \
-template<> typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def                    \
-FUNC##_buf_func<TYPE>()                                                                 \
-{ return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def)& cusolverDn##PREFIX##FUNC##_bufferSize; }
-
-LU_FUNC_DEF( getrf )
-LU_FUNC(getrf , float  , S)
-LU_FUNC(getrf , double , D)
-LU_FUNC(getrf , cfloat , C)
-LU_FUNC(getrf , cdouble, Z)
-
-void convertPivot(Array<int> &pivot, int out_sz)
-{
-    dim_t d0 = pivot.dims()[0];
-
-    std::vector<int> d_po(out_sz);
-    for(int i = 0; i < out_sz; i++) {
-        d_po[i] = i;
-    }
-
-    std::vector<int> d_pi(d0);
-    copyData(&d_pi[0], pivot);
-
-    for(int j = 0; j < d0; j++) {
-        // 1 indexed in pivot
-        std::swap(d_po[j], d_po[d_pi[j] - 1]);
-    }
-
-    pivot = createHostDataArray<int>(out_sz, &d_po[0]);
-}
-
-
-template<typename T>
-void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
-{
-    dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-
-    Array<T> in_copy = copyArray<T>(in);
-    pivot = lu_inplace(in_copy);
-
-    // SPLIT into lower and upper
-    dim4 ldims(M, min(M, N));
-    dim4 udims(min(M, N), N);
-    lower = createEmptyArray<T>(ldims);
-    upper = createEmptyArray<T>(udims);
-    kernel::lu_split<T>(lower, upper, in_copy);
-}
-
-template<typename T>
-Array<int> lu_inplace(Array<T> &in, const bool convert_pivot)
-{
-    dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-
-    Array<int> pivot = createEmptyArray<int>(af::dim4(min(M, N), 1, 1, 1));
-
-    int lwork = 0;
-
-    CUSOLVER_CHECK(getrf_buf_func<T>()(getDnHandle(),
-                                       M, N,
-                                       in.get(), in.strides()[1],
-                                       &lwork));
-
-    T *workspace = memAlloc<T>(lwork);
-    int *info = memAlloc<int>(1);
-
-    CUSOLVER_CHECK(getrf_func<T>()(getDnHandle(),
-                                   M, N,
-                                   in.get(), in.strides()[1],
-                                   workspace,
-                                   pivot.get(),
-                                   info));
-
-    if(convert_pivot) convertPivot(pivot, M);
-
-    memFree(workspace);
-    memFree(info);
-
-    return pivot;
-}
-
-#define INSTANTIATE_LU(T)                                                                           \
-    template Array<int> lu_inplace<T>(Array<T> &in, const bool convert_pivot);                      \
-    template void lu<T>(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
-
-INSTANTIATE_LU(float)
-INSTANTIATE_LU(cfloat)
-INSTANTIATE_LU(double)
-INSTANTIATE_LU(cdouble)
-}
-
-#else
-namespace cuda
-{
-template<typename T>
-void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-template<typename T>
-Array<int> lu_inplace(Array<T> &in, const bool convert_pivot)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-#define INSTANTIATE_LU(T)                                                                           \
-    template Array<int> lu_inplace<T>(Array<T> &in, const bool convert_pivot);                      \
-    template void lu<T>(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
-
-INSTANTIATE_LU(float)
-INSTANTIATE_LU(cfloat)
-INSTANTIATE_LU(double)
-INSTANTIATE_LU(cdouble)
-}
-#endif
diff --git a/src/backend/cuda/lu.hpp b/src/backend/cuda/lu.hpp
index 0753129d6b..7ed639bef4 100644
--- a/src/backend/cuda/lu.hpp
+++ b/src/backend/cuda/lu.hpp
@@ -7,14 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in);
 
-    template<typename T>
-    Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
-}
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
+
+bool isLAPACKAvailable();
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/match_template.cpp b/src/backend/cuda/match_template.cpp
new file mode 100644
index 0000000000..63b50435b7
--- /dev/null
+++ b/src/backend/cuda/match_template.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/match_template.hpp>
+#include <match_template.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType) {
+    Array<outType> out = createEmptyArray<outType>(sImg.dims());
+    bool needMean = mType == AF_ZSAD || mType == AF_LSAD || mType == AF_ZSSD ||
+                    mType == AF_LSSD || mType == AF_ZNCC;
+    kernel::matchTemplate<inType, outType>(out, sImg, tImg, mType, needMean);
+    return out;
+}
+
+#define INSTANTIATE(in_t, out_t)                       \
+    template Array<out_t> match_template<in_t, out_t>( \
+        const Array<in_t> &, const Array<in_t> &, const af::matchType);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/match_template.cu b/src/backend/cuda/match_template.cu
deleted file mode 100644
index 5b30eb03e8..0000000000
--- a/src/backend/cuda/match_template.cu
+++ /dev/null
@@ -1,58 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cuda.hpp>
-#include <match_template.hpp>
-#include <kernel/match_template.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename inType, typename outType, af_match_type mType>
-Array<outType> match_template(const Array<inType> &sImg, const Array<inType> &tImg)
-{
-    Array<outType> out = createEmptyArray<outType>(sImg.dims());
-
-    bool needMean = mType==AF_ZSAD || mType==AF_LSAD ||
-                    mType==AF_ZSSD || mType==AF_LSSD ||
-                    mType==AF_ZNCC;
-
-    if (needMean)
-        kernel::matchTemplate<inType, outType, mType, true>(out, sImg, tImg);
-    else
-        kernel::matchTemplate<inType, outType, mType, false>(out, sImg, tImg);
-
-    return out;
-}
-
-#define INSTANTIATE(in_t, out_t)\
-    template Array<out_t> match_template<in_t, out_t, AF_SAD >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_LSAD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZSAD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_SSD >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_LSSD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZSSD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_NCC >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZNCC>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_SHD >(const Array<in_t> &sImg, const Array<in_t> &tImg);
-
-INSTANTIATE(double, double)
-INSTANTIATE(float ,  float)
-INSTANTIATE(char  ,  float)
-INSTANTIATE(int   ,  float)
-INSTANTIATE(uint  ,  float)
-INSTANTIATE(uchar ,  float)
-
-}
diff --git a/src/backend/cuda/match_template.hpp b/src/backend/cuda/match_template.hpp
index 7803d90e23..fe98cea5e9 100644
--- a/src/backend/cuda/match_template.hpp
+++ b/src/backend/cuda/match_template.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
-
-template<typename inType, typename outType, af_match_type mType>
-Array<outType> match_template(const Array<inType> &sImg, const Array<inType> &tImg);
-
-}
+namespace arrayfire {
+namespace cuda {
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/math.cpp b/src/backend/cuda/math.cpp
deleted file mode 100644
index 928ad7ea80..0000000000
--- a/src/backend/cuda/math.cpp
+++ /dev/null
@@ -1,29 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <math.hpp>
-
-namespace cuda
-{
-    cfloat division(cfloat lhs, double rhs)
-    {
-        cfloat retVal;
-        retVal.x = real(lhs) / rhs;
-        retVal.y = imag(lhs) / rhs;
-        return retVal;
-    }
-
-    cdouble division(cdouble lhs, double rhs)
-    {
-        cdouble retVal;
-        retVal.x = real(lhs) / rhs;
-        retVal.y = imag(lhs) / rhs;
-        return retVal;
-    }
-}
diff --git a/src/backend/cuda/math.hpp b/src/backend/cuda/math.hpp
index 20301fd977..28574ac7e2 100644
--- a/src/backend/cuda/math.hpp
+++ b/src/backend/cuda/math.hpp
@@ -8,202 +8,462 @@
  ********************************************************/
 
 #pragma once
-#include <af/defines.h>
-#include <limits>
-#include <algorithm>
-#include "backend.hpp"
-#include "types.hpp"
+
+#ifndef __CUDACC_RTC__
+
+#include <common/defines.hpp>
 
 #ifdef __CUDACC__
-#include <math_functions.h>
+#include <cuda_runtime_api.h>
+#endif  //__CUDACC__
+
+#include <algorithm>
+#include <climits>
+#include <limits>
+
+#endif  //__CUDACC_RTC__
+
+#include <backend.hpp>
+#include <common/half.hpp>
+#include <types.hpp>
+#include <af/defines.h>
+
+#include <cuda_fp16.h>
 #include <math_constants.h>
+
+namespace arrayfire {
+namespace cuda {
+
+#ifdef AF_WITH_FAST_MATH
+constexpr bool fast_math = true;
+#else
+constexpr bool fast_math = false;
 #endif
 
-namespace cuda
-{
-    template<typename T> static inline __DH__ T abs(T val)  { return abs(val); }
-    static inline __DH__ float  abs(float  val) { return fabsf(val); }
-    static inline __DH__ double abs(double val) { return fabs (val); }
-    static inline __DH__ float  abs(cfloat  cval) { return cuCabsf(cval); }
-    static inline __DH__ double abs(cdouble cval) { return cuCabs (cval); }
+template<typename T>
+static inline __DH__ T abs(T val) {
+    return ::abs(val);
+}
+static inline __DH__ int abs(int val) { return (val > 0 ? val : -val); }
+static inline __DH__ char abs(char val) { return (val > 0 ? val : -val); }
+static inline __DH__ float abs(float val) { return fabsf(val); }
+static inline __DH__ double abs(double val) { return fabs(val); }
+static inline __DH__ float abs(cfloat cval) { return cuCabsf(cval); }
+static inline __DH__ double abs(cdouble cval) { return cuCabs(cval); }
 
-    static inline __DH__ size_t min(size_t lhs, size_t rhs) { return lhs < rhs ? lhs : rhs; }
-    static inline __DH__ size_t max(size_t lhs, size_t rhs) { return lhs > rhs ? lhs : rhs; }
+static inline __DH__ size_t min(size_t lhs, size_t rhs) {
+    return lhs < rhs ? lhs : rhs;
+}
+static inline __DH__ size_t max(size_t lhs, size_t rhs) {
+    return lhs > rhs ? lhs : rhs;
+}
 
-#ifndef __CUDA_ARCH__
-    template<typename T> static inline __DH__ T min(T lhs, T rhs) { return std::min(lhs, rhs);}
-    template<typename T> static inline __DH__ T max(T lhs, T rhs) { return std::max(lhs, rhs);}
+#ifdef __CUDA_ARCH__
+static inline __device__ __half abs(__half val) {
+    return __short_as_half(__half_as_short(val) & 0x7FFF);
+}
+
+template<typename T>
+inline __DH__ T min(T lhs, T rhs) {
+    return ::min(lhs, rhs);
+}
+
+template<typename T>
+inline __DH__ T max(T lhs, T rhs) {
+    return ::max(lhs, rhs);
+}
+
+template<>
+inline __DH__ __half min<__half>(__half lhs, __half rhs) {
+#if __CUDA_ARCH__ >= 530
+    return __hlt(lhs, rhs) ? lhs : rhs;
 #else
-    template<typename T> static inline __DH__ T min(T lhs, T rhs) { return ::min(lhs, rhs);}
-    template<typename T> static inline __DH__ T max(T lhs, T rhs) { return ::max(lhs, rhs);}
+    return __half2float(lhs) < __half2float(rhs) ? lhs : rhs;
 #endif
+}
 
-    template<> __DH__
-    STATIC_ cfloat  max<cfloat >(cfloat lhs, cfloat rhs)
-    {
-        return abs(lhs) > abs(rhs) ? lhs : rhs;
-    }
+template<>
+inline __DH__ __half max<__half>(__half lhs, __half rhs) {
+#if __CUDA_ARCH__ >= 530
+    return __hgt(lhs, rhs) ? lhs : rhs;
+#else
+    return __half2float(lhs) > __half2float(rhs) ? lhs : rhs;
+#endif
+}
 
-    template<> __DH__
-    STATIC_ cdouble max<cdouble>(cdouble lhs, cdouble rhs)
-    {
-        return abs(lhs) > abs(rhs) ? lhs : rhs;
-    }
+#else
+template<typename T>
+static inline __DH__ T min(T lhs, T rhs) {
+    return std::min(lhs, rhs);
+}
+template<typename T>
+static inline __DH__ T max(T lhs, T rhs) {
+    return std::max(lhs, rhs);
+}
+#endif
 
-    template<> __DH__
-    STATIC_ cfloat  min<cfloat >(cfloat lhs, cfloat rhs)
-    {
-        return abs(lhs) < abs(rhs) ? lhs :  rhs;
-    }
+template<>
+__DH__ inline cfloat max<cfloat>(cfloat lhs, cfloat rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
 
-    template<> __DH__
-    STATIC_ cdouble min<cdouble>(cdouble lhs, cdouble rhs)
-    {
-        return abs(lhs) < abs(rhs) ? lhs :  rhs;
-    }
+template<>
+__DH__ inline cdouble max<cdouble>(cdouble lhs, cdouble rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
 
-    template<typename T> __DH__
-    static T scalar(double val)
-    {
-        return (T)(val);
-    }
+template<>
+__DH__ inline cfloat min<cfloat>(cfloat lhs, cfloat rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
 
-    template<> __DH__
-    STATIC_ cfloat scalar<cfloat >(double val)
-    {
-        cfloat  cval = {(float)val, 0};
-        return cval;
-    }
+template<>
+__DH__ inline cdouble min<cdouble>(cdouble lhs, cdouble rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
 
-    template<> __DH__
-    STATIC_ cdouble scalar<cdouble >(double val)
-    {
-        cdouble  cval = {val, 0};
-        return cval;
-    }
+template<typename T>
+__DH__ static T scalar(double val) {
+    return (T)(val);
+}
 
-    template<typename To, typename Ti> __DH__
-    static To scalar(Ti real, Ti imag)
-    {
-        To  cval = {real, imag};
-        return cval;
-    }
+template<>
+__DH__ inline cfloat scalar<cfloat>(double val) {
+    cfloat cval = {(float)val, 0};
+    return cval;
+}
+
+template<>
+__DH__ inline cdouble scalar<cdouble>(double val) {
+    cdouble cval = {val, 0};
+    return cval;
+}
+
+template<typename To, typename Ti>
+__DH__ static To scalar(Ti real, Ti imag) {
+    To cval = {real, imag};
+    return cval;
+}
 
 #ifndef __CUDA_ARCH__
-    template <typename T> T limit_max() { return std::numeric_limits<T>::max(); }
-    template <typename T> T limit_min() { return std::numeric_limits<T>::min(); }
+
+template<typename T>
+inline T maxval() {
+    AF_IF_CONSTEXPR(std::is_floating_point<T>::value && !fast_math) {
+        return std::numeric_limits<T>::infinity();
+    }
+    else { return std::numeric_limits<T>::max(); }
+}
+template<typename T>
+inline T minval() {
+    AF_IF_CONSTEXPR(std::is_floating_point<T>::value && !fast_math) {
+        return -std::numeric_limits<T>::infinity();
+    }
+    else { return std::numeric_limits<T>::lowest(); }
+}
 #else
-    template <typename T> __device__ T limit_max() { return 1u << (8 * sizeof(T) - 1); }
-    template <typename T> __device__ T limit_min() { return scalar<T>(0); }
-
-    template<> __device__  int    limit_max<int>()    { return 0x7fffffff; }
-    template<> __device__  int    limit_min<int>()    { return 0x80000000; }
-    template<> __device__  char   limit_max<char>()   { return 0x7f; }
-    template<> __device__  char   limit_min<char>()   { return 0x80; }
-    template<> __device__  float  limit_max<float>()  { return  CUDART_INF_F; }
-    template<> __device__  float  limit_min<float>()  { return -CUDART_INF_F; }
-    template<> __device__  double limit_max<double>() { return  CUDART_INF; }
-    template<> __device__  double limit_min<double>() { return -CUDART_INF; }
+template<typename T>
+inline __device__ T maxval() {
+    return 1u << (8 * sizeof(T) - 1);
+}
+template<typename T>
+inline __device__ T minval() {
+    return scalar<T>(0);
+}
+
+template<>
+inline __device__ int maxval<int>() {
+    return 0x7fffffff;
+}
+template<>
+inline __device__ int minval<int>() {
+    return 0x80000000;
+}
+template<>
+inline __device__ intl maxval<intl>() {
+    return 0x7fffffffffffffff;
+}
+template<>
+inline __device__ intl minval<intl>() {
+    return 0x8000000000000000;
+}
+template<>
+inline __device__ uintl maxval<uintl>() {
+    return 1ULL << (8 * sizeof(uintl) - 1);
+}
+template<>
+inline __device__ schar maxval<schar>() {
+    return 0x7f;
+}
+template<>
+inline __device__ schar minval<schar>() {
+    return 0x80;
+}
+template<>
+inline __device__ char maxval<char>() {
+    return 0x7f;
+}
+template<>
+inline __device__ char minval<char>() {
+    return 0x80;
+}
+template<>
+inline __device__ float maxval<float>() {
+    return CUDART_INF_F;
+}
+template<>
+inline __device__ float minval<float>() {
+    return -CUDART_INF_F;
+}
+template<>
+inline __device__ double maxval<double>() {
+    return CUDART_INF;
+}
+template<>
+inline __device__ double minval<double>() {
+    return -CUDART_INF;
+}
+template<>
+inline __device__ short maxval<short>() {
+    return 0x7fff;
+}
+template<>
+inline __device__ short minval<short>() {
+    return 0x8000;
+}
+template<>
+inline __device__ ushort maxval<ushort>() {
+    return ((ushort)1) << (8 * sizeof(ushort) - 1);
+}
+template<>
+inline __device__ common::half maxval<common::half>() {
+    return common::half(65537.f);
+}
+template<>
+inline __device__ common::half minval<common::half>() {
+    return common::half(-65537.f);
+}
+template<>
+inline __device__ __half maxval<__half>() {
+    return __float2half(CUDART_INF);
+}
+template<>
+inline __device__ __half minval<__half>() {
+    return __float2half(-CUDART_INF);
+}
 #endif
 
 #define upcast cuComplexFloatToDouble
 #define downcast cuComplexDoubleToFloat
 
 #ifdef __GNUC__
-//This suprresses unused function warnings in gcc
-//FIXME: Check if the warnings exist in other compilers
+// This suprresses unused function warnings in gcc
+// FIXME: Check if the warnings exist in other compilers
 #define __SDH__ static __DH__ __attribute__((unused))
 #else
 #define __SDH__ static __DH__
 #endif
-__SDH__ float  real(cfloat  c) { return cuCrealf(c); }
-__SDH__ double real(cdouble c) { return cuCreal(c);  }
+__SDH__ float real(cfloat c) { return cuCrealf(c); }
+__SDH__ double real(cdouble c) { return cuCreal(c); }
+
+__SDH__ float imag(cfloat c) { return cuCimagf(c); }
+__SDH__ double imag(cdouble c) { return cuCimag(c); }
 
-__SDH__ float  imag(cfloat  c) { return cuCimagf(c); }
-__SDH__ double imag(cdouble c) { return cuCimag(c);  }
+template<typename T>
+static inline __DH__ auto is_nan(const T &val) -> bool {
+    return false;
+}
 
-template<typename T> T
-__SDH__  conj(T  x) { return x; }
-__SDH__ cfloat  conj(cfloat  c) { return cuConjf(c);}
+template<>
+inline __DH__ auto is_nan<float>(const float &val) -> bool {
+    return ::isnan(val);
+}
+
+template<>
+inline __DH__ auto is_nan<double>(const double &val) -> bool {
+    return ::isnan(val);
+}
+
+#ifdef __CUDA_ARCH__
+template<>
+inline __device__ auto is_nan<__half>(const __half &val) -> bool {
+#if __CUDA_ARCH__ >= 530
+    return __hisnan(val);
+#else
+    return ::isnan(__half2float(val));
+#endif
+}
+#endif
+
+template<>
+inline auto is_nan<cfloat>(const cfloat &in) -> bool {
+    return ::isnan(real(in)) || ::isnan(imag(in));
+}
+
+template<>
+inline auto is_nan<cdouble>(const cdouble &in) -> bool {
+    return ::isnan(real(in)) || ::isnan(imag(in));
+}
+
+template<typename T>
+T __SDH__ conj(T x) {
+    return x;
+}
+__SDH__ cfloat conj(cfloat c) { return cuConjf(c); }
 __SDH__ cdouble conj(cdouble c) { return cuConj(c); }
 
-__SDH__ cfloat make_cfloat(bool     x) { return make_cuComplex(x,0);     }
-__SDH__ cfloat make_cfloat(int      x) { return make_cuComplex(x,0);     }
-__SDH__ cfloat make_cfloat(unsigned x) { return make_cuComplex(x,0);     }
-__SDH__ cfloat make_cfloat(float    x) { return make_cuComplex(x,0);     }
-__SDH__ cfloat make_cfloat(double   x) { return make_cuComplex(x,0);     }
-__SDH__ cfloat make_cfloat(cfloat   x) { return x;                    }
-__SDH__ cfloat make_cfloat(cdouble  c) { return make_cuComplex(c.x,c.y); }
-
-__SDH__ cdouble make_cdouble(bool      x) { return make_cuDoubleComplex(x,0);       }
-__SDH__ cdouble make_cdouble(int       x) { return make_cuDoubleComplex(x,0);       }
-__SDH__ cdouble make_cdouble(unsigned  x) { return make_cuDoubleComplex(x,0);       }
-__SDH__ cdouble make_cdouble(float     x) { return make_cuDoubleComplex(x,0);       }
-__SDH__ cdouble make_cdouble(double    x) { return make_cuDoubleComplex(x,0);       }
-__SDH__ cdouble make_cdouble(cdouble   x) { return x;                       }
-__SDH__ cdouble make_cdouble(cfloat    c) { return make_cuDoubleComplex(c.x,c.y);   }
+__SDH__ cfloat make_cfloat(bool x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(int x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(unsigned x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(short x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(ushort x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(float x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(double x) {
+    return make_cuComplex(static_cast<float>(x), 0);
+}
+__SDH__ cfloat make_cfloat(cfloat x) { return x; }
+__SDH__ cfloat make_cfloat(cdouble c) { return make_cuComplex(c.x, c.y); }
 
-__SDH__ cfloat make_cfloat(float x, float y) { return make_cuComplex(x, y); }
-__SDH__ cdouble make_cdouble(double x, double y) { return make_cuDoubleComplex(x, y); }
+__SDH__ cdouble make_cdouble(bool x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(int x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(unsigned x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(short x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(ushort x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(float x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(double x) {
+    return make_cuDoubleComplex(static_cast<double>(x), 0);
+}
+__SDH__ cdouble make_cdouble(cdouble x) { return x; }
+__SDH__ cdouble make_cdouble(cfloat c) {
+    return make_cuDoubleComplex(static_cast<double>(c.x), c.y);
+}
 
+__SDH__ cfloat make_cfloat(float x, float y) { return make_cuComplex(x, y); }
+__SDH__ cdouble make_cdouble(double x, double y) {
+    return make_cuDoubleComplex(x, y);
+}
 
-#define BINOP(OP, cfn, zfn)                                             \
-    __SDH__ cfloat   operator OP(cfloat  a, cfloat  b)                  \
-    { return cfn(a,b); }                                                \
-    __SDH__ cdouble  operator OP(cdouble a, cfloat  b)                  \
-    { return zfn(a,upcast(b)); }                                        \
-    __SDH__ cdouble  operator OP(cfloat  a, cdouble b)                  \
-    { return zfn(upcast(a),b); }                                        \
-    __SDH__ cdouble  operator OP(cdouble a, cdouble b)                  \
-    { return zfn(a,b); }                                                \
-                                                                        \
+#define BINOP(OP, cfn, zfn)                                              \
+    __SDH__ cfloat operator OP(cfloat a, cfloat b) { return cfn(a, b); } \
+    __SDH__ cdouble operator OP(cdouble a, cfloat b) {                   \
+        return zfn(a, upcast(b));                                        \
+    }                                                                    \
+    __SDH__ cdouble operator OP(cfloat a, cdouble b) {                   \
+        return zfn(upcast(a), b);                                        \
+    }                                                                    \
+    __SDH__ cdouble operator OP(cdouble a, cdouble b) { return zfn(a, b); }
 
-    BINOP(+, cuCaddf, cuCadd)
-    BINOP(-, cuCsubf, cuCsub)
-    BINOP(*, cuCmulf, cuCmul)
-    BINOP(/, cuCdivf, cuCdiv)
+BINOP(+, cuCaddf, cuCadd)
+BINOP(-, cuCsubf, cuCsub)
+BINOP(*, cuCmulf, cuCmul)
+BINOP(/, cuCdivf, cuCdiv)
 
 #undef BINOP
 
-#define BINOP_SCALAR(T, TR, R)                  \
-    __SDH__ R operator *(TR a, T b)             \
-    { return make_##R(a * b.x, a * b.y); }      \
-                                                \
-    __SDH__ R operator *(T a, TR b)             \
-    { return make_##R(a.x * b,  a.y * b); }     \
-                                                \
-    __SDH__ R operator +(TR a, T b)             \
-    { return make_##R(a + b.x, a + b.y); }      \
-                                                \
-    __SDH__ R operator +(T a, TR b)             \
-    { return make_##R(a.x + b,  a.y + b); }     \
-                                                \
-    __SDH__ R operator -(TR a, T b)             \
-    { return make_##R(a - b.x, a - b.y); }      \
-                                                \
-    __SDH__ R operator -(T a, TR b)             \
-    { return make_##R(a.x - b,  a.y - b); }     \
-                                                \
-    __SDH__ R operator /(T a, TR b)             \
-    { return make_##R(a.x / b, a.y / b); }      \
-                                                \
-    __SDH__ R operator /(TR a, T b)             \
-    { return make_##R(a) / b; }                 \
-                                                \
-
-    BINOP_SCALAR(cfloat, float, cfloat)
-    BINOP_SCALAR(cfloat, double, cdouble)
-    BINOP_SCALAR(cdouble, float, cdouble)
-    BINOP_SCALAR(cdouble, double, cdouble)
+#define BINOP_SCALAR(T, TR, R)                                            \
+    __SDH__ R operator*(TR a, T b) { return make_##R(a * b.x, a * b.y); } \
+                                                                          \
+    __SDH__ R operator*(T a, TR b) { return make_##R(a.x * b, a.y * b); } \
+                                                                          \
+    __SDH__ R operator+(TR a, T b) { return make_##R(a + b.x, a + b.y); } \
+                                                                          \
+    __SDH__ R operator+(T a, TR b) { return make_##R(a.x + b, a.y + b); } \
+                                                                          \
+    __SDH__ R operator-(TR a, T b) { return make_##R(a - b.x, a - b.y); } \
+                                                                          \
+    __SDH__ R operator-(T a, TR b) { return make_##R(a.x - b, a.y - b); } \
+                                                                          \
+    __SDH__ R operator/(T a, TR b) { return make_##R(a.x / b, a.y / b); } \
+                                                                          \
+    __SDH__ R operator/(TR a, T b) { return make_##R(a) / b; }
+
+BINOP_SCALAR(cfloat, float, cfloat)
+BINOP_SCALAR(cfloat, double, cdouble)
+BINOP_SCALAR(cdouble, float, cdouble)
+BINOP_SCALAR(cdouble, double, cdouble)
 
 #undef BINOP_SCALAR
 
-__SDH__ bool operator ==(cfloat a, cfloat b) { return (a.x == b.x) && (a.y == b.y); }
-__SDH__ bool operator !=(cfloat a, cfloat b) { return !(a == b); }
-__SDH__ bool operator ==(cdouble a, cdouble b) { return (a.x == b.x) && (a.y == b.y); }
-__SDH__ bool operator !=(cdouble a, cdouble b) { return !(a == b); }
+template<typename T>
+static inline T division(T lhs, double rhs) {
+    return lhs / rhs;
+}
+
+static inline cfloat division(cfloat lhs, double rhs) {
+    cfloat retVal;
+    retVal.x = real(lhs) / rhs;
+    retVal.y = imag(lhs) / rhs;
+    return retVal;
+}
+
+static inline cdouble division(cdouble lhs, double rhs) {
+    cdouble retVal;
+    retVal.x = real(lhs) / rhs;
+    retVal.y = imag(lhs) / rhs;
+    return retVal;
+}
+
+template<typename T, typename Compare>
+constexpr const __DH__ T clamp(const T value, const T lo, const T hi,
+                               Compare comp) {
+    return comp(value, lo) ? lo : comp(hi, value) ? hi : value;
+}
+
+template<typename T>
+constexpr const __DH__ T clamp(const T value, const T lo, const T hi) {
+    return clamp(value, lo, hi, [](auto lhs, auto rhs) { return lhs < rhs; });
+}
+
+#ifdef AF_WITH_FAST_MATH
+/// The pow function with fast math is constantly wrong with fast math
+/// so this function converts the operation to double when fast-math
+/// is used
+__device__ inline double afpowf(double x, double y) { return pow(x, y); }
+#else
+/// The pow function with fast math is constantly wrong with fast math
+/// so this function converts the operation to double when fast-math
+/// is used
+__device__ inline float afpowf(float x, float y) { return powf(x, y); }
+#endif
 
-    template<typename T> static inline T division(T lhs, double rhs) { return lhs / rhs; }
-    cfloat division(cfloat lhs, double rhs);
-    cdouble division(cdouble lhs, double rhs);
+}  // namespace cuda
+}  // namespace arrayfire
+
+__SDH__ bool operator==(arrayfire::cuda::cfloat a, arrayfire::cuda::cfloat b) {
+    return (a.x == b.x) && (a.y == b.y);
+}
+__SDH__ bool operator!=(arrayfire::cuda::cfloat a, arrayfire::cuda::cfloat b) {
+    return !(a == b);
+}
+__SDH__ bool operator==(arrayfire::cuda::cdouble a,
+                        arrayfire::cuda::cdouble b) {
+    return (a.x == b.x) && (a.y == b.y);
+}
+__SDH__ bool operator!=(arrayfire::cuda::cdouble a,
+                        arrayfire::cuda::cdouble b) {
+    return !(a == b);
 }
diff --git a/src/backend/cuda/max.cu b/src/backend/cuda/max.cu
index 68efee55f8..9fe7b92409 100644
--- a/src/backend/cuda/max.cu
+++ b/src/backend/cuda/max.cu
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    //max
-    INSTANTIATE(af_max_t, float  , float  )
-    INSTANTIATE(af_max_t, double , double )
-    INSTANTIATE(af_max_t, cfloat , cfloat )
-    INSTANTIATE(af_max_t, cdouble, cdouble)
-    INSTANTIATE(af_max_t, int    , int    )
-    INSTANTIATE(af_max_t, uint   , uint   )
-    INSTANTIATE(af_max_t, char   , char   )
-    INSTANTIATE(af_max_t, uchar  , uchar  )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// max
+INSTANTIATE(af_max_t, float, float)
+INSTANTIATE(af_max_t, double, double)
+INSTANTIATE(af_max_t, cfloat, cfloat)
+INSTANTIATE(af_max_t, cdouble, cdouble)
+INSTANTIATE(af_max_t, int, int)
+INSTANTIATE(af_max_t, uint, uint)
+INSTANTIATE(af_max_t, intl, intl)
+INSTANTIATE(af_max_t, uintl, uintl)
+INSTANTIATE(af_max_t, char, char)
+INSTANTIATE(af_max_t, schar, schar)
+INSTANTIATE(af_max_t, uchar, uchar)
+INSTANTIATE(af_max_t, short, short)
+INSTANTIATE(af_max_t, ushort, ushort)
+INSTANTIATE(af_max_t, half, half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/mean.cu b/src/backend/cuda/mean.cu
new file mode 100644
index 0000000000..b4dab3b866
--- /dev/null
+++ b/src/backend/cuda/mean.cu
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/dim4.hpp>
+
+#undef _GLIBCXX_USE_INT128
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/mean.hpp>
+#include <mean.hpp>
+#include <complex>
+
+using af::dim4;
+using arrayfire::common::half;
+using std::swap;
+namespace arrayfire {
+namespace cuda {
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in) {
+    return kernel::mean_all<Ti, Tw, To>(in);
+}
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts) {
+    return kernel::mean_all_weighted<T, Tw>(in, wts);
+}
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    kernel::mean<Ti, Tw, To>(out, in, dim);
+    return out;
+}
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wts, const int dim) {
+    dim4 odims   = in.dims();
+    odims[dim]   = 1;
+    Array<T> out = createEmptyArray<T>(odims);
+    kernel::mean_weighted<T, Tw, T>(out, in, wts, dim);
+    return out;
+}
+
+#define INSTANTIATE(Ti, Tw, To)                        \
+    template To mean<Ti, Tw, To>(const Array<Ti>& in); \
+    template Array<To> mean<Ti, Tw, To>(const Array<Ti>& in, const int dim);
+
+INSTANTIATE(double, double, double);
+INSTANTIATE(float, float, float);
+INSTANTIATE(int, float, float);
+INSTANTIATE(unsigned, float, float);
+INSTANTIATE(intl, double, double);
+INSTANTIATE(uintl, double, double);
+INSTANTIATE(short, float, float);
+INSTANTIATE(ushort, float, float);
+INSTANTIATE(uchar, float, float);
+INSTANTIATE(schar, float, float);
+INSTANTIATE(char, float, float);
+INSTANTIATE(cfloat, float, cfloat);
+INSTANTIATE(cdouble, double, cdouble);
+INSTANTIATE(half, float, half);
+INSTANTIATE(half, float, float);
+
+#define INSTANTIATE_WGT(T, Tw)                                              \
+    template T mean<T, Tw>(const Array<T>& in, const Array<Tw>& wts);       \
+    template Array<T> mean<T, Tw>(const Array<T>& in, const Array<Tw>& wts, \
+                                  const int dim);
+
+INSTANTIATE_WGT(double, double);
+INSTANTIATE_WGT(float, float);
+INSTANTIATE_WGT(cfloat, float);
+INSTANTIATE_WGT(cdouble, double);
+INSTANTIATE_WGT(half, float);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/mean.hpp b/src/backend/cuda/mean.hpp
new file mode 100644
index 0000000000..af1810550c
--- /dev/null
+++ b/src/backend/cuda/mean.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in);
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts);
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim);
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wts, const int dim);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/meanshift.cpp b/src/backend/cuda/meanshift.cpp
new file mode 100644
index 0000000000..83d12cb3ef
--- /dev/null
+++ b/src/backend/cuda/meanshift.cpp
@@ -0,0 +1,48 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/meanshift.hpp>
+#include <meanshift.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor) {
+    const dim4 &dims = in.dims();
+    Array<T> out     = createEmptyArray<T>(dims);
+    kernel::meanshift<T>(out, in, spatialSigma, chromaticSigma, numIterations,
+                         isColor);
+    return out;
+}
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> meanshift<T>(const Array<T> &, const float &, \
+                                   const float &, const unsigned &, \
+                                   const bool &);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/meanshift.cu b/src/backend/cuda/meanshift.cu
deleted file mode 100644
index 0fa1ac3ca3..0000000000
--- a/src/backend/cuda/meanshift.cu
+++ /dev/null
@@ -1,46 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <meanshift.hpp>
-#include <kernel/meanshift.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T, bool is_color>
-Array<T> meanshift(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter)
-{
-    const dim4 dims = in.dims();
-
-    Array<T> out   = createEmptyArray<T>(dims);
-
-    kernel::meanshift<T, is_color>(out, in, s_sigma, c_sigma, iter);
-
-    return out;
-}
-
-#define INSTANTIATE(T) \
-    template Array<T> meanshift<T, true >(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter); \
-    template Array<T> meanshift<T, false>(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter);
-
-INSTANTIATE(float )
-INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
diff --git a/src/backend/cuda/meanshift.hpp b/src/backend/cuda/meanshift.hpp
index a12fe6a16e..267a978cb1 100644
--- a/src/backend/cuda/meanshift.hpp
+++ b/src/backend/cuda/meanshift.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
-
-template<typename T, bool is_color>
-Array<T> meanshift(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter);
-
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/medfilt.cpp b/src/backend/cuda/medfilt.cpp
new file mode 100644
index 0000000000..cca97dd644
--- /dev/null
+++ b/src/backend/cuda/medfilt.cpp
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <medfilt.hpp>
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/medfilt.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType pad) {
+    ARG_ASSERT(2, (w_wid <= kernel::MAX_MEDFILTER1_LEN));
+    ARG_ASSERT(2, (w_wid % 2 != 0));
+
+    const dim4 &dims = in.dims();
+    Array<T> out     = createEmptyArray<T>(dims);
+
+    kernel::medfilt1<T>(out, in, pad, w_wid);
+
+    return out;
+}
+
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType pad) {
+    ARG_ASSERT(2, (w_len <= kernel::MAX_MEDFILTER2_LEN));
+    ARG_ASSERT(2, (w_len % 2 != 0));
+
+    const dim4 &dims = in.dims();
+    Array<T> out     = createEmptyArray<T>(dims);
+
+    kernel::medfilt2<T>(out, in, pad, w_len, w_wid);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> medfilt1<T>(const Array<T> &in, const int w_wid, \
+                                  const af::borderType);               \
+    template Array<T> medfilt2<T>(const Array<T> &in, const int w_len, \
+                                  const int w_wid, const af::borderType);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/medfilt.cu b/src/backend/cuda/medfilt.cu
deleted file mode 100644
index 9a99caea01..0000000000
--- a/src/backend/cuda/medfilt.cu
+++ /dev/null
@@ -1,48 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <medfilt.hpp>
-#include <kernel/medfilt.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T, af_border_type pad>
-Array<T> medfilt(const Array<T> &in, dim_t w_len, dim_t w_wid)
-{
-    ARG_ASSERT(2, (w_len<=kernel::MAX_MEDFILTER_LEN));
-
-    const dim4 dims     = in.dims();
-
-    Array<T> out      = createEmptyArray<T>(dims);
-
-    kernel::medfilt<T, pad>(out, in, w_len, w_wid);
-
-    return out;
-}
-
-#define INSTANTIATE(T)\
-    template Array<T> medfilt<T, AF_PAD_ZERO     >(const Array<T> &in, dim_t w_len, dim_t w_wid); \
-    template Array<T> medfilt<T, AF_PAD_SYM>(const Array<T> &in, dim_t w_len, dim_t w_wid);
-
-INSTANTIATE(float )
-INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
diff --git a/src/backend/cuda/medfilt.hpp b/src/backend/cuda/medfilt.hpp
index f13935b794..e9bc1d2f2d 100644
--- a/src/backend/cuda/medfilt.hpp
+++ b/src/backend/cuda/medfilt.hpp
@@ -9,10 +9,16 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-template<typename T, af_border_type edge_pad>
-Array<T> medfilt(const Array<T> &in, dim_t w_len, dim_t w_wid);
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType edge_pad);
 
-}
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType edge_pad);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/memory.cpp b/src/backend/cuda/memory.cpp
index 730dbed1b4..616547d6af 100644
--- a/src/backend/cuda/memory.cpp
+++ b/src/backend/cuda/memory.cpp
@@ -8,343 +8,185 @@
  ********************************************************/
 
 #include <memory.hpp>
+
+#include <Event.hpp>
+#include <common/Logger.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/util.hpp>
 #include <cuda.h>
-#include <cuda_runtime_api.h>
 #include <cuda_runtime.h>
+#include <cuda_runtime_api.h>
 #include <err_cuda.hpp>
-#include <types.hpp>
-#include <map>
-#include <dispatch.hpp>
 #include <platform.hpp>
+#include <spdlog/spdlog.h>
+#include <types.hpp>
+#include <af/dim4.hpp>
 
-namespace cuda
-{
-    static size_t memory_resolution = 1024; //1KB
-
-    void setMemStepSize(size_t step_bytes)
-    {
-        memory_resolution = step_bytes;
-    }
-
-    size_t getMemStepSize(void)
-    {
-        return memory_resolution;
-    }
-
-    template<typename T>
-    static void cudaFreeWrapper(T *ptr)
-    {
-        cudaError_t err = cudaFree(ptr);
-        if (err != cudaErrorCudartUnloading) // see issue #167
-            CUDA_CHECK(err);
-    }
-
-    template<typename T>
-    static void pinnedFreeWrapper(T *ptr)
-    {
-        cudaError_t err = cudaFreeHost(ptr);
-        if (err != cudaErrorCudartUnloading) // see issue #167
-            CUDA_CHECK(err);
-    }
-
-#ifdef AF_CUDA_MEM_DEBUG
-
-    template<typename T>
-    T* memAlloc(const size_t &elements)
-    {
-        T* ptr = NULL;
-        CUDA_CHECK(cudaMalloc(&ptr, elements * sizeof(T)));
-        return ptr;
-    }
-
-    template<typename T>
-    void memFree(T *ptr)
-    {
-        cudaFreeWrapper(ptr); // Free it because we are not sure what the size is
-    }
-
-    template<typename T>
-    T* pinnedAlloc(const size_t &elements)
-    {
-        T* ptr = NULL;
-        CUDA_CHECK(cudaMallocHost((void **)(&ptr), elements * sizeof(T)));
-        return (T*)ptr;
-    }
-
-    template<typename T>
-    void pinnedFree(T *ptr)
-    {
-        pinnedFreeWrapper(ptr); // Free it because we are not sure what the size is
-    }
-
-#else
-
-    // Manager Class
-    // Dummy used to call garbage collection at the end of the program
-    class Manager
-    {
-        public:
-        static bool initialized;
-        Manager()
-        {
-            initialized = true;
-        }
-
-        ~Manager()
-        {
-            for(int i = 0; i < getDeviceCount(); i++) {
-                setDevice(i);
-                garbageCollect();
-            }
-            pinnedGarbageCollect();
-        }
-    };
+#include <cstdlib>
+#include <mutex>
 
-    bool Manager::initialized = false;
+using af::dim4;
+using arrayfire::common::bytesToString;
+using arrayfire::common::half;
 
-    static void managerInit()
-    {
-        if(Manager::initialized == false)
-            static Manager pm = Manager();
-    }
+using std::move;
 
-    typedef struct
-    {
-        bool is_free;
-        bool is_unlinked;
-        size_t bytes;
-    } mem_info;
+namespace arrayfire {
+namespace cuda {
+float getMemoryPressure() { return memoryManager().getMemoryPressure(); }
+float getMemoryPressureThreshold() {
+    return memoryManager().getMemoryPressureThreshold();
+}
 
-    static size_t used_bytes[DeviceManager::MAX_DEVICES] = {0};
-    static size_t used_buffers[DeviceManager::MAX_DEVICES] = {0};
-    static size_t total_bytes[DeviceManager::MAX_DEVICES] = {0};
-    typedef std::map<void *, mem_info> mem_t;
-    typedef mem_t::iterator mem_iter;
+bool jitTreeExceedsMemoryPressure(size_t bytes) {
+    return memoryManager().jitTreeExceedsMemoryPressure(bytes);
+}
 
-    mem_t memory_maps[DeviceManager::MAX_DEVICES];
+void setMemStepSize(size_t step_bytes) {
+    memoryManager().setMemStepSize(step_bytes);
+}
 
-    void garbageCollect()
-    {
-        int n = getActiveDeviceId();
+size_t getMemStepSize() { return memoryManager().getMemStepSize(); }
 
-        for(mem_iter iter = memory_maps[n].begin();
-            iter != memory_maps[n].end(); ++iter) {
+void signalMemoryCleanup() { memoryManager().signalMemoryCleanup(); }
 
-            if ((iter->second).is_free) {
+void shutdownMemoryManager() { memoryManager().shutdown(); }
 
-                if (!(iter->second).is_unlinked) {
-                    cudaFreeWrapper(iter->first);
-                }
-                total_bytes[n] -= iter->second.bytes;
-            }
-        }
+void shutdownPinnedMemoryManager() { pinnedMemoryManager().shutdown(); }
 
-        mem_iter memory_curr = memory_maps[n].begin();
-        mem_iter memory_end  = memory_maps[n].end();
+void printMemInfo(const char *msg, const int device) {
+    memoryManager().printInfo(msg, device);
+}
 
-        while(memory_curr != memory_end) {
-            if (memory_curr->second.is_free) {
-                memory_maps[n].erase(memory_curr++);
-            } else {
-                ++memory_curr;
-            }
-        }
-    }
+template<typename T>
+uptr<T> memAlloc(const size_t &elements) {
+    // TODO: make memAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = memoryManager().alloc(false, 1, dims.get(), sizeof(T));
+    return uptr<T>(static_cast<T *>(ptr), memFree);
+}
 
-    template<typename T>
-    T* memAlloc(const size_t &elements)
-    {
-        managerInit();
-        int n = getActiveDeviceId();
-        T* ptr = NULL;
-        size_t alloc_bytes = divup(sizeof(T) * elements, memory_resolution) * memory_resolution;
-
-        if (elements > 0) {
-
-            // FIXME: Add better checks for garbage collection
-            // Perhaps look at total memory available as a metric
-            if (memory_maps[n].size() >= MAX_BUFFERS || used_bytes[n] >= MAX_BYTES) {
-                garbageCollect();
-            }
-
-            for(mem_iter iter = memory_maps[n].begin();
-                iter != memory_maps[n].end(); ++iter) {
-
-                mem_info info = iter->second;
-
-                if (  info.is_free &&
-                     !info.is_unlinked &&
-                      info.bytes == alloc_bytes) {
-
-                    iter->second.is_free = false;
-                    used_bytes[n] += alloc_bytes;
-                    used_buffers[n]++;
-                    return (T *)iter->first;
-                }
-            }
-
-            // Perform garbage collection if memory can not be allocated
-            if (cudaMalloc((void **)&ptr, alloc_bytes) != cudaSuccess) {
-                garbageCollect();
-                CUDA_CHECK(cudaMalloc((void **)(&ptr), alloc_bytes));
-            }
-
-            mem_info info = {false, false, alloc_bytes};
-            memory_maps[n][ptr] = info;
-            used_bytes[n] += alloc_bytes;
-            used_buffers[n]++;
-            total_bytes[n] += alloc_bytes;
-        }
-        return ptr;
-    }
+void *memAllocUser(const size_t &bytes) {
+    dim4 dims(bytes);
+    void *ptr = memoryManager().alloc(true, 1, dims.get(), 1);
+    return ptr;
+}
 
-    template<typename T>
-    void memFree(T *ptr)
-    {
-        int n = getActiveDeviceId();
-        mem_iter iter = memory_maps[n].find((void *)ptr);
+void memFree(void *ptr) { memoryManager().unlock(ptr, false); }
 
-        if (iter != memory_maps[n].end()) {
+void memFreeUser(void *ptr) { memoryManager().unlock(ptr, true); }
 
-            if ((iter->second).is_unlinked) return;
+void memLock(const void *ptr) {
+    memoryManager().userLock(const_cast<void *>(ptr));
+}
 
-            iter->second.is_free = true;
-            used_bytes[n] -= iter->second.bytes;
-            used_buffers[n]--;
+void memUnlock(const void *ptr) {
+    memoryManager().userUnlock(const_cast<void *>(ptr));
+}
 
-        } else {
-            cudaFreeWrapper(ptr); // Free it because we are not sure what the size is
-        }
-    }
+bool isLocked(const void *ptr) {
+    return memoryManager().isUserLocked(const_cast<void *>(ptr));
+}
 
-    template<typename T>
-    void memUnlink(T *ptr)
-    {
-        int n = getActiveDeviceId();
-        mem_iter iter = memory_maps[n].find((void *)ptr);
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers) {
+    memoryManager().usageInfo(alloc_bytes, alloc_buffers, lock_bytes,
+                              lock_buffers);
+}
 
-        if (iter != memory_maps[n].end()) {
+template<typename T>
+T *pinnedAlloc(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = pinnedMemoryManager().alloc(false, 1, dims.get(), sizeof(T));
+    return static_cast<T *>(ptr);
+}
 
-            iter->second.is_free = true;
-            iter->second.is_unlinked = true;
-            used_bytes[n] -= iter->second.bytes;
-            used_buffers[n]--;
+void pinnedFree(void *ptr) { pinnedMemoryManager().unlock(ptr, false); }
+
+#define INSTANTIATE(T)                                 \
+    template uptr<T> memAlloc(const size_t &elements); \
+    template T *pinnedAlloc(const size_t &elements);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+template<>
+void *pinnedAlloc<void>(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = pinnedMemoryManager().alloc(false, 1, dims.get(), 1);
+    return ptr;
+}
 
-        } else {
-            mem_info info = { false,
-                              false,
-                              100 }; //This number is not relevant
+Allocator::Allocator() { logger = common::loggerFactory("mem"); }
 
-            memory_maps[n][ptr] = info;
+void Allocator::shutdown() {
+    for (int n = 0; n < getDeviceCount(); n++) {
+        try {
+            setDevice(n);
+            shutdownMemoryManager();
+        } catch (const AfError &err) {
+            continue;  // Do not throw any errors while shutting down
         }
     }
+}
 
-    void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers)
-    {
-        int n = getActiveDeviceId();
-        if (alloc_bytes   ) *alloc_bytes   = total_bytes[n];
-        if (alloc_buffers ) *alloc_buffers = memory_maps[n].size();
-        if (lock_bytes    ) *lock_bytes    = used_bytes[n];
-        if (lock_buffers  ) *lock_buffers  = used_buffers[n];
-    }
+int Allocator::getActiveDeviceId() { return cuda::getActiveDeviceId(); }
 
-    //////////////////////////////////////////////////////////////////////////////
-    mem_t pinned_maps;
-    static size_t pinned_used_bytes = 0;
+size_t Allocator::getMaxMemorySize(int id) { return getDeviceMemorySize(id); }
 
-    void pinnedGarbageCollect()
-    {
-        for(mem_iter iter = pinned_maps.begin(); iter != pinned_maps.end(); ++iter) {
-            if ((iter->second).is_free) {
-                pinnedFreeWrapper(iter->first);
-            }
-        }
+void *Allocator::nativeAlloc(const size_t bytes) {
+    void *ptr = NULL;
+    CUDA_CHECK(cudaMalloc(&ptr, bytes));
+    AF_TRACE("nativeAlloc: {:>7} {}", bytesToString(bytes), ptr);
+    return ptr;
+}
 
-        mem_iter memory_curr = pinned_maps.begin();
-        mem_iter memory_end  = pinned_maps.end();
+void Allocator::nativeFree(void *ptr) {
+    AF_TRACE("nativeFree:          {}", ptr);
+    cudaError_t err = cudaFree(ptr);
+    if (err != cudaErrorCudartUnloading) { CUDA_CHECK(err); }
+}
 
-        while(memory_curr != memory_end) {
-            if (memory_curr->second.is_free) {
-                pinned_maps.erase(memory_curr++);
-            } else {
-                ++memory_curr;
-            }
-        }
-    }
+AllocatorPinned::AllocatorPinned() { logger = common::loggerFactory("mem"); }
 
-    template<typename T>
-    T* pinnedAlloc(const size_t &elements)
-    {
-        managerInit();
-        T* ptr = NULL;
-        // Allocate the higher megabyte. Overhead of creating pinned memory is
-        // more so we want more resuable memory.
-        size_t alloc_bytes = divup(sizeof(T) * elements, 1048576) * 1048576;
-
-        if (elements > 0) {
-
-            // FIXME: Add better checks for garbage collection
-            // Perhaps look at total memory available as a metric
-            if (pinned_maps.size() >= MAX_BUFFERS || pinned_used_bytes >= MAX_BYTES) {
-                pinnedGarbageCollect();
-            }
-
-            for(mem_iter iter = pinned_maps.begin();
-                iter != pinned_maps.end(); ++iter) {
-
-                mem_info info = iter->second;
-                if (info.is_free && info.bytes == alloc_bytes) {
-                    iter->second.is_free = false;
-                    pinned_used_bytes += alloc_bytes;
-                    return (T *)iter->first;
-                }
-            }
-
-            // Perform garbage collection if memory can not be allocated
-            if (cudaMallocHost((void **)&ptr, alloc_bytes) != cudaSuccess) {
-                pinnedGarbageCollect();
-                CUDA_CHECK(cudaMallocHost((void **)(&ptr), alloc_bytes));
-            }
-
-            mem_info info = {false, false, alloc_bytes};
-            pinned_maps[ptr] = info;
-            pinned_used_bytes += alloc_bytes;
-        }
-        return (T*)ptr;
-    }
+void AllocatorPinned::shutdown() { shutdownPinnedMemoryManager(); }
 
-    template<typename T>
-    void pinnedFree(T *ptr)
-    {
-        mem_iter iter = pinned_maps.find((void *)ptr);
+int AllocatorPinned::getActiveDeviceId() {
+    return 0;  // pinned uses a single vector
+}
 
-        if (iter != pinned_maps.end()) {
-            iter->second.is_free = true;
-            pinned_used_bytes -= iter->second.bytes;
-        } else {
-            pinnedFreeWrapper(ptr); // Free it because we are not sure what the size is
-        }
-    }
+size_t AllocatorPinned::getMaxMemorySize(int id) {
+    UNUSED(id);
+    return getHostMemorySize();
+}
 
-#endif
-
-#define INSTANTIATE(T)                              \
-    template T* memAlloc(const size_t &elements);   \
-    template void memFree(T* ptr);                  \
-    template void memUnlink(T* ptr);                \
-    template T* pinnedAlloc(const size_t &elements);\
-    template void pinnedFree(T* ptr);               \
-
-    INSTANTIATE(float)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(double)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+void *AllocatorPinned::nativeAlloc(const size_t bytes) {
+    void *ptr;
+    CUDA_CHECK(cudaMallocHost(&ptr, bytes));
+    AF_TRACE("Pinned::nativeAlloc: {:>7} {}", bytesToString(bytes), ptr);
+    return ptr;
+}
 
+void AllocatorPinned::nativeFree(void *ptr) {
+    AF_TRACE("Pinned::nativeFree:          {}", ptr);
+    cudaError_t err = cudaFreeHost(ptr);
+    if (err != cudaErrorCudartUnloading) { CUDA_CHECK(err); }
 }
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/memory.hpp b/src/backend/cuda/memory.hpp
index 12fe452517..039879a90e 100644
--- a/src/backend/cuda/memory.hpp
+++ b/src/backend/cuda/memory.hpp
@@ -8,24 +8,80 @@
  ********************************************************/
 #pragma once
 
-#include <af/defines.h>
-namespace cuda
-{
-    template<typename T> T* memAlloc(const size_t &elements);
-    template<typename T> void memFree(T* ptr);
-    template<typename T> void memUnlink(T *ptr);
-
-    template<typename T> T* pinnedAlloc(const size_t &elements);
-    template<typename T> void pinnedFree(T* ptr);
-
-    static const unsigned MAX_BUFFERS   = 100;
-    static const unsigned MAX_BYTES     = (1 << 30);
-
-    void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers);
-    void garbageCollect();
-    void pinnedGarbageCollect();
-
-    void setMemStepSize(size_t step_bytes);
-    size_t getMemStepSize(void);
-}
+#include <common/AllocatorInterface.hpp>
+
+#include <cstdlib>
+#include <functional>
+#include <memory>
+
+namespace arrayfire {
+namespace cuda {
+float getMemoryPressure();
+float getMemoryPressureThreshold();
+
+void memFree(void *ptr);
+
+template<typename T>
+using uptr = std::unique_ptr<T[], std::function<void(T[])>>;
+
+template<typename T>
+uptr<T> memAlloc(const size_t &elements);
+
+void *memAllocUser(const size_t &bytes);
+
+// Need these as 2 separate function and not a default argument
+// This is because it is used as the deleter in shared pointer
+// which cannot support default arguments
+
+void memFreeUser(void *ptr);
+
+void memLock(const void *ptr);
+void memUnlock(const void *ptr);
+bool isLocked(const void *ptr);
+
+template<typename T>
+T *pinnedAlloc(const size_t &elements);
+void pinnedFree(void *ptr);
+
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers);
+void signalMemoryCleanup();
+void shutdownMemoryManager();
+void pinnedGarbageCollect();
+
+void printMemInfo(const char *msg, const int device);
+
+float getMemoryPressure();
+float getMemoryPressureThreshold();
+bool jitTreeExceedsMemoryPressure(size_t bytes);
+void setMemStepSize(size_t step_bytes);
+size_t getMemStepSize(void);
+
+class Allocator final : public arrayfire::common::AllocatorInterface {
+   public:
+    Allocator();
+    ~Allocator() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+};
+
+// CUDA Pinned Memory does not depend on device
+// So we pass 1 as numDevices to the constructor so that it creates 1 vector
+// of memory_info
+// When allocating and freeing, it doesn't really matter which device is active
+class AllocatorPinned final : public arrayfire::common::AllocatorInterface {
+   public:
+    AllocatorPinned();
+    ~AllocatorPinned() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/min.cu b/src/backend/cuda/min.cu
index 13b3596500..b0fad5733c 100644
--- a/src/backend/cuda/min.cu
+++ b/src/backend/cuda/min.cu
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    //min
-    INSTANTIATE(af_min_t, float  , float  )
-    INSTANTIATE(af_min_t, double , double )
-    INSTANTIATE(af_min_t, cfloat , cfloat )
-    INSTANTIATE(af_min_t, cdouble, cdouble)
-    INSTANTIATE(af_min_t, int    , int    )
-    INSTANTIATE(af_min_t, uint   , uint   )
-    INSTANTIATE(af_min_t, char   , char   )
-    INSTANTIATE(af_min_t, uchar  , uchar  )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// min
+INSTANTIATE(af_min_t, float, float)
+INSTANTIATE(af_min_t, double, double)
+INSTANTIATE(af_min_t, cfloat, cfloat)
+INSTANTIATE(af_min_t, cdouble, cdouble)
+INSTANTIATE(af_min_t, int, int)
+INSTANTIATE(af_min_t, uint, uint)
+INSTANTIATE(af_min_t, intl, intl)
+INSTANTIATE(af_min_t, uintl, uintl)
+INSTANTIATE(af_min_t, char, char)
+INSTANTIATE(af_min_t, schar, schar)
+INSTANTIATE(af_min_t, uchar, uchar)
+INSTANTIATE(af_min_t, short, short)
+INSTANTIATE(af_min_t, ushort, ushort)
+INSTANTIATE(af_min_t, half, half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/minmax_op.hpp b/src/backend/cuda/minmax_op.hpp
new file mode 100644
index 0000000000..a2b7149a07
--- /dev/null
+++ b/src/backend/cuda/minmax_op.hpp
@@ -0,0 +1,74 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/Binary.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+static double cabs(const T &in) {
+    return (double)in;
+}
+
+template<>
+double cabs<char>(const char &in) {
+    return (double)(in > 0);
+}
+
+template<>
+double cabs<cfloat>(const cfloat &in) {
+    return (double)abs(in);
+}
+
+template<>
+double cabs<cdouble>(const cdouble &in) {
+    return (double)abs(in);
+}
+
+template<af_op_t op, typename T>
+struct MinMaxOp {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {
+        using arrayfire::cuda::is_nan;
+        if (is_nan(val)) { m_val = common::Binary<compute_t<T>, op>::init(); }
+    }
+
+    void operator()(T val, uint idx) {
+        if ((cabs(val) < cabs(m_val) ||
+             (cabs(val) == cabs(m_val) && idx > m_idx))) {
+            m_val = val;
+            m_idx = idx;
+        }
+    }
+};
+
+template<typename T>
+struct MinMaxOp<af_max_t, T> {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {
+        using arrayfire::cuda::is_nan;
+        if (is_nan(val)) { m_val = common::Binary<T, af_max_t>::init(); }
+    }
+
+    void operator()(T val, uint idx) {
+        if ((cabs(val) > cabs(m_val) ||
+             (cabs(val) == cabs(m_val) && idx <= m_idx))) {
+            m_val = val;
+            m_idx = idx;
+        }
+    }
+};
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/moments.cpp b/src/backend/cuda/moments.cpp
new file mode 100644
index 0000000000..fa37b033e1
--- /dev/null
+++ b/src/backend/cuda/moments.cpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <moments.hpp>
+
+#include <Array.hpp>
+#include <debug_cuda.hpp>
+#include <err_cuda.hpp>
+#include <kernel/moments.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+static inline unsigned bitCount(unsigned v) {
+    v = v - ((v >> 1U) & 0x55555555U);
+    v = (v & 0x33333333U) + ((v >> 2U) & 0x33333333U);
+    return (((v + (v >> 4U)) & 0xF0F0F0FU) * 0x1010101U) >> 24U;
+}
+
+using af::dim4;
+
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment) {
+    in.eval();
+    dim4 odims, idims = in.dims();
+    dim_t moments_dim = bitCount(moment);
+
+    odims[0] = moments_dim;
+    odims[1] = 1;
+    odims[2] = idims[2];
+    odims[3] = idims[3];
+
+    Array<float> out = createValueArray<float>(odims, 0.f);
+    out.eval();
+
+    kernel::moments<T>(out, in, moment);
+    return out;
+}
+
+#define INSTANTIATE(T)                                   \
+    template Array<float> moments<T>(const Array<T> &in, \
+                                     const af_moment_type moment);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/moments.hpp b/src/backend/cuda/moments.hpp
new file mode 100644
index 0000000000..54791ac590
--- /dev/null
+++ b/src/backend/cuda/moments.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/morph.cpp b/src/backend/cuda/morph.cpp
new file mode 100644
index 0000000000..f09f20bded
--- /dev/null
+++ b/src/backend/cuda/morph.cpp
@@ -0,0 +1,62 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/morph.hpp>
+#include <morph.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    const dim4 mdims = mask.dims();
+    if (mdims[0] != mdims[1]) {
+        CUDA_NOT_SUPPORTED("Rectangular masks are not supported");
+    }
+    if (mdims[0] > 19) {
+        CUDA_NOT_SUPPORTED("Kernels > 19x19 are not supported");
+    }
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::morph<T>(out, in, mask, isDilation);
+    return out;
+}
+
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    const dim4 mdims = mask.dims();
+    if (mdims[0] != mdims[1] || mdims[0] != mdims[2]) {
+        CUDA_NOT_SUPPORTED("Only cubic masks are supported");
+    }
+    if (mdims[0] > 7) { CUDA_NOT_SUPPORTED("Kernels > 7x7x7 not supported"); }
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::morph3d<T>(out, in, mask, isDilation);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                    \
+    template Array<T> morph<T>(const Array<T> &, const Array<T> &, bool); \
+    template Array<T> morph3d<T>(const Array<T> &, const Array<T> &, bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/morph.hpp b/src/backend/cuda/morph.hpp
index d218577e8f..7b072ef669 100644
--- a/src/backend/cuda/morph.hpp
+++ b/src/backend/cuda/morph.hpp
@@ -9,13 +9,12 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation);
 
-template<typename T, bool isDilation>
-Array<T> morph(const Array<T> &in, const Array<T> &mask);
-
-template<typename T, bool isDilation>
-Array<T> morph3d(const Array<T> &in, const Array<T> &mask);
-
-}
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/morph3d_impl.hpp b/src/backend/cuda/morph3d_impl.hpp
deleted file mode 100644
index bd98ebd44b..0000000000
--- a/src/backend/cuda/morph3d_impl.hpp
+++ /dev/null
@@ -1,51 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <morph.hpp>
-#include <kernel/morph.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T, bool isDilation>
-Array<T> morph3d(const Array<T> &in, const Array<T> &mask)
-{
-    const dim4 mdims = mask.dims();
-
-    if (mdims[0] != mdims[1] || mdims[0] != mdims[2])
-        AF_ERROR("Only cube masks are supported in CUDA backend", AF_ERR_SIZE);
-
-    if (mdims[0] > 7)
-        AF_ERROR("Upto 7x7x7 kernels are only supported in CUDA backend", AF_ERR_SIZE);
-
-    Array<T> out       = createEmptyArray<T>(in.dims());
-
-    CUDA_CHECK(cudaMemcpyToSymbol(kernel::cFilter, mask.get(),
-                                  mdims[0] * mdims[1] *mdims[2] * sizeof(T),
-                                  0, cudaMemcpyDeviceToDevice));
-
-    if (isDilation)
-        kernel::morph3d<T, true >(out, in, mdims[0]);
-    else
-        kernel::morph3d<T, false>(out, in, mdims[0]);
-
-    return out;
-}
-
-}
-
-#define INSTANTIATE(T, ISDILATE)                                        \
-    template Array<T> morph3d<T, ISDILATE>(const Array<T> &in, const Array<T> &mask);
diff --git a/src/backend/cuda/morph_impl.hpp b/src/backend/cuda/morph_impl.hpp
deleted file mode 100644
index 0b5b653488..0000000000
--- a/src/backend/cuda/morph_impl.hpp
+++ /dev/null
@@ -1,50 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <morph.hpp>
-#include <kernel/morph.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T, bool isDilation>
-Array<T>  morph(const Array<T> &in, const Array<T> &mask)
-{
-    const dim4 mdims = mask.dims();
-
-    if (mdims[0] != mdims[1])
-        AF_ERROR("Only square masks are supported in cuda morph currently", AF_ERR_SIZE);
-    if (mdims[0] > 19)
-        AF_ERROR("Upto 19x19 square kernels are only supported in cuda currently", AF_ERR_SIZE);
-
-    Array<T> out = createEmptyArray<T>(in.dims());
-
-    CUDA_CHECK(cudaMemcpyToSymbol(kernel::cFilter, mask.get(),
-                                  mdims[0] * mdims[1] * sizeof(T),
-                                  0, cudaMemcpyDeviceToDevice));
-
-    if (isDilation)
-        kernel::morph<T, true >(out, in, mdims[0]);
-    else
-        kernel::morph<T, false>(out, in, mdims[0]);
-
-    return out;
-}
-
-}
-
-#define INSTANTIATE(T, ISDILATE)                                        \
-    template Array<T> morph  <T, ISDILATE>(const Array<T> &in, const Array<T> &mask);
diff --git a/src/backend/cuda/nearest_neighbour.cu b/src/backend/cuda/nearest_neighbour.cu
new file mode 100644
index 0000000000..dc10695f8a
--- /dev/null
+++ b/src/backend/cuda/nearest_neighbour.cu
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/nearest_neighbour.hpp>
+#include <math.hpp>
+#include <topk.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist, const af_match_type dist_type) {
+    uint sample_dim  = (dist_dim == 0) ? 1 : 0;
+    const dim4 qDims = query.dims();
+    const dim4 tDims = train.dims();
+
+    const dim4 outDims(n_dist, qDims[sample_dim]);
+    const dim4 distDims(tDims[sample_dim], qDims[sample_dim]);
+
+    Array<To> tmp_dists = createEmptyArray<To>(distDims);
+
+    idx  = createEmptyArray<uint>(outDims);
+    dist = createEmptyArray<To>(outDims);
+
+    Array<T> queryT = dist_dim == 0 ? transpose(query, false) : query;
+    Array<T> trainT = dist_dim == 0 ? transpose(train, false) : train;
+
+    switch (dist_type) {
+        case AF_SAD:
+            kernel::all_distances<T, To, AF_SAD>(tmp_dists, queryT, trainT, 1);
+            break;
+        case AF_SSD:
+            kernel::all_distances<T, To, AF_SSD>(tmp_dists, queryT, trainT, 1);
+            break;
+        case AF_SHD:
+            kernel::all_distances<T, To, AF_SHD>(tmp_dists, queryT, trainT, 1);
+            break;
+        default: AF_ERROR("Unsupported dist_type", AF_ERR_NOT_CONFIGURED);
+    }
+
+    topk(dist, idx, tmp_dists, n_dist, 0, AF_TOPK_MIN);
+}
+
+#define INSTANTIATE(T, To)                                             \
+    template void nearest_neighbour<T, To>(                            \
+        Array<uint> & idx, Array<To> & dist, const Array<T>& query,    \
+        const Array<T>& train, const uint dist_dim, const uint n_dist, \
+        const af_match_type dist_type);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(int, int)
+INSTANTIATE(uint, uint)
+INSTANTIATE(intl, intl)
+INSTANTIATE(uintl, uintl)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, uint)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, uint)
+
+INSTANTIATE(uintl, uint)  // For Hamming
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/nearest_neighbour.hpp b/src/backend/cuda/nearest_neighbour.hpp
new file mode 100644
index 0000000000..a1e8bd21bf
--- /dev/null
+++ b/src/backend/cuda/nearest_neighbour.hpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist,
+                       const af_match_type dist_type = AF_SSD);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/orb.cu b/src/backend/cuda/orb.cu
index b29c7affe7..83da734ce2 100644
--- a/src/backend/cuda/orb.cu
+++ b/src/backend/cuda/orb.cu
@@ -7,39 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
+#include <orb.hpp>
+
 #include <Array.hpp>
+#include <LookupTable1D.hpp>
 #include <err_cuda.hpp>
-#include <handle.hpp>
 #include <fast_pyramid.hpp>
 #include <kernel/orb.hpp>
 #include <kernel/orb_patch.hpp>
+#include <af/dim4.hpp>
+
+#include <type_traits>
 
 using af::dim4;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T, typename convAccT>
-unsigned orb(Array<float> &x, Array<float> &y,
-             Array<float> &score, Array<float> &ori,
-             Array<float> &size, Array<uint> &desc,
-             const Array<T>& image,
-             const float fast_thr, const unsigned max_feat,
-             const float scl_fctr, const unsigned levels,
-             const bool blur_img)
-{
-    const dim4 dims = image.dims();
-
+unsigned orb(Array<float> &x, Array<float> &y, Array<float> &score,
+             Array<float> &ori, Array<float> &size, Array<uint> &desc,
+             const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img) {
     std::vector<unsigned> feat_pyr, lvl_best;
     std::vector<float> lvl_scl;
-    std::vector<float*> d_x_pyr, d_y_pyr;
-    std::vector<CParam<T> > img_pyr;
+    std::vector<Array<float>> x_pyr, y_pyr;
+    std::vector<Array<T>> img_pyr;
+
+    fast_pyramid<T>(feat_pyr, x_pyr, y_pyr, lvl_best, lvl_scl, img_pyr, image,
+                    fast_thr, max_feat, scl_fctr, levels, REF_PAT_SIZE);
 
-    fast_pyramid<T>(feat_pyr, d_x_pyr, d_y_pyr, lvl_best, lvl_scl, img_pyr,
-                    image, fast_thr, max_feat, scl_fctr, levels, REF_PAT_SIZE);
+    const size_t num_levels = feat_pyr.size();
+
+    std::vector<float *> d_x_pyr(num_levels, nullptr),
+        d_y_pyr(num_levels, nullptr);
+
+    for (size_t i = 0; i < feat_pyr.size(); ++i) {
+        if (feat_pyr[i] > 0) {
+            d_x_pyr[i] = static_cast<float *>(x_pyr[i].get());
+            d_y_pyr[i] = static_cast<float *>(y_pyr[i].get());
+        }
+    }
 
     unsigned nfeat_out;
     float *x_out;
@@ -49,14 +58,20 @@ unsigned orb(Array<float> &x, Array<float> &y,
     float *size_out;
     unsigned *desc_out;
 
-    kernel::orb<T, convAccT>(&nfeat_out, &x_out, &y_out, &score_out, &orientation_out, &size_out,
-                             &desc_out, feat_pyr, d_x_pyr, d_y_pyr, lvl_best, lvl_scl, img_pyr,
-                             fast_thr, max_feat, scl_fctr, levels, blur_img);
+    // TODO(pradeep) Figure out a better way to create lut Array only once
+    const Array<int> lut = createHostDataArray(
+        af::dim4(sizeof(d_ref_pat) / sizeof(int)), d_ref_pat);
 
-    if (nfeat_out > 0) {
+    LookupTable1D<int> orbLUT(lut);
+
+    kernel::orb<T, convAccT>(
+        &nfeat_out, &x_out, &y_out, &score_out, &orientation_out, &size_out,
+        &desc_out, feat_pyr, d_x_pyr, d_y_pyr, lvl_best, lvl_scl, img_pyr,
+        fast_thr, max_feat, scl_fctr, levels, blur_img, orbLUT);
 
-        if (x_out == NULL || y_out == NULL || score_out == NULL || orientation_out == NULL ||
-            size_out == NULL || desc_out == NULL) {
+    if (nfeat_out > 0) {
+        if (x_out == NULL || y_out == NULL || score_out == NULL ||
+            orientation_out == NULL || size_out == NULL || desc_out == NULL) {
             AF_ERROR("orb_descriptor: feature array is null.", AF_ERR_SIZE);
         }
 
@@ -69,22 +84,20 @@ unsigned orb(Array<float> &x, Array<float> &y,
         ori   = createDeviceDataArray<float>(feat_dims, orientation_out);
         size  = createDeviceDataArray<float>(feat_dims, size_out);
         desc  = createDeviceDataArray<unsigned>(desc_dims, desc_out);
-
     }
 
     return nfeat_out;
 }
 
-#define INSTANTIATE(T, convAccT)                                                        \
-    template unsigned orb<T, convAccT>(Array<float> &x, Array<float> &y,                \
-                                       Array<float> &score, Array<float> &ori,          \
-                                       Array<float> &size, Array<uint> &desc,           \
-                                       const Array<T>& image,                           \
-                                       const float fast_thr, const unsigned max_feat,   \
-                                       const float scl_fctr, const unsigned levels,     \
-                                       const bool blur_img);
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned orb<T, convAccT>(                                       \
+        Array<float> & x, Array<float> & y, Array<float> & score,             \
+        Array<float> & ori, Array<float> & size, Array<uint> & desc,          \
+        const Array<T> &image, const float fast_thr, const unsigned max_feat, \
+        const float scl_fctr, const unsigned levels, const bool blur_img);
 
-INSTANTIATE(float , float )
+INSTANTIATE(float, float)
 INSTANTIATE(double, double)
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/orb.hpp b/src/backend/cuda/orb.hpp
index c0c61c906a..c40a1f9026 100644
--- a/src/backend/cuda/orb.hpp
+++ b/src/backend/cuda/orb.hpp
@@ -7,21 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
 #include <Array.hpp>
+#include <af/features.h>
 
 using af::features;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T, typename convAccT>
 unsigned orb(Array<float> &x, Array<float> &y, Array<float> &score,
              Array<float> &orientation, Array<float> &size,
-             Array<unsigned> &desc,
-             const Array<T>& image,
-             const float fast_thr, const unsigned max_feat,
-             const float scl_fctr, const unsigned levels,
-             const bool blur_img);
+             Array<unsigned> &desc, const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/pad_array_borders.cpp b/src/backend/cuda/pad_array_borders.cpp
new file mode 100644
index 0000000000..af563733d2
--- /dev/null
+++ b/src/backend/cuda/pad_array_borders.cpp
@@ -0,0 +1,58 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/pad_array_borders.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> padArrayBorders(Array<T> const& in, dim4 const& lowerBoundPadding,
+                         dim4 const& upperBoundPadding,
+                         const af::borderType btype) {
+    const dim4& iDims = in.dims();
+
+    dim4 oDims(lowerBoundPadding[0] + iDims[0] + upperBoundPadding[0],
+               lowerBoundPadding[1] + iDims[1] + upperBoundPadding[1],
+               lowerBoundPadding[2] + iDims[2] + upperBoundPadding[2],
+               lowerBoundPadding[3] + iDims[3] + upperBoundPadding[3]);
+
+    if (oDims == iDims) { return in; }
+
+    auto ret = createEmptyArray<T>(oDims);
+
+    kernel::padBorders<T>(ret, in, lowerBoundPadding, btype);
+
+    return ret;
+}
+
+#define INSTANTIATE_PAD_ARRAY_BORDERS(T)                               \
+    template Array<T> padArrayBorders<T>(Array<T> const&, dim4 const&, \
+                                         dim4 const&, const af::borderType);
+
+INSTANTIATE_PAD_ARRAY_BORDERS(cfloat)
+INSTANTIATE_PAD_ARRAY_BORDERS(cdouble)
+INSTANTIATE_PAD_ARRAY_BORDERS(float)
+INSTANTIATE_PAD_ARRAY_BORDERS(double)
+INSTANTIATE_PAD_ARRAY_BORDERS(int)
+INSTANTIATE_PAD_ARRAY_BORDERS(uint)
+INSTANTIATE_PAD_ARRAY_BORDERS(intl)
+INSTANTIATE_PAD_ARRAY_BORDERS(uintl)
+INSTANTIATE_PAD_ARRAY_BORDERS(schar)
+INSTANTIATE_PAD_ARRAY_BORDERS(uchar)
+INSTANTIATE_PAD_ARRAY_BORDERS(char)
+INSTANTIATE_PAD_ARRAY_BORDERS(ushort)
+INSTANTIATE_PAD_ARRAY_BORDERS(short)
+INSTANTIATE_PAD_ARRAY_BORDERS(common::half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/platform.cpp b/src/backend/cuda/platform.cpp
index 59c47edb89..0de2451c4d 100644
--- a/src/backend/cuda/platform.cpp
+++ b/src/backend/cuda/platform.cpp
@@ -7,179 +7,253 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/version.h>
-#include <platform.hpp>
+#if defined(OS_WIN)
+#include <windows.h>
+#endif
+
+#ifdef WITH_CUDNN
+#include <cudnn.hpp>
+#include <cudnnModule.hpp>
+#endif
+
+#include <GraphicsResourceManager.hpp>
+#include <build_version.hpp>
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/defines.hpp>
+#include <common/err_common.hpp>
+#include <common/graphics_common.hpp>
+#include <common/host_memory.hpp>
+#include <common/unique_handle.hpp>
+#include <common/util.hpp>
+#include <cublas.hpp>
+#include <cufft.hpp>
+#include <cusolverDn.hpp>
+#include <cusparse.hpp>
+#include <cusparseModule.hpp>
+#include <device_manager.hpp>
 #include <driver.h>
-#include <vector>
-#include <string>
-#include <algorithm>
-#include <sstream>
-#include <stdexcept>
-#include <cstdio>
-#include <cstring>
 #include <err_cuda.hpp>
+#include <memory.hpp>
+#include <spdlog/spdlog.h>
+#include <utility.hpp>
+#include <af/cuda.h>
+#include <af/device.h>
+#include <af/version.h>
 
-using namespace std;
-
-namespace cuda
-{
-
-///////////////////////////////////////////////////////////////////////////
-// HELPERS
-///////////////////////////////////////////////////////////////////////////
-// pulled from CUTIL from CUDA SDK
-static inline int compute2cores(int major, int minor)
-{
-    struct {
-        int compute; // 0xMm (hex), M = major version, m = minor version
-        int cores;
-    } gpus[] = {
-        { 0x10,  8 },
-        { 0x11,  8 },
-        { 0x12,  8 },
-        { 0x13,  8 },
-        { 0x20, 32 },
-        { 0x21, 48 },
-        { 0x30, 192 },
-        { 0x35, 192 },
-        { 0x50, 128 },
-        {   -1, -1  },
-    };
-
-    for (int i = 0; gpus[i].compute != -1; ++i) {
-        if (gpus[i].compute == (major << 4) + minor)
-            return gpus[i].cores;
-    }
-    return 0;
+#include <array>
+#include <cstdlib>
+#include <memory>
+#include <mutex>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <thread>
+#include <type_traits>
+
+using std::call_once;
+using std::make_unique;
+using std::once_flag;
+using std::ostringstream;
+using std::runtime_error;
+using std::string;
+using std::to_string;
+using std::unique_ptr;
+using std::vector;
+
+using arrayfire::common::getEnvVar;
+using arrayfire::common::int_version_to_string;
+using arrayfire::common::MemoryManagerBase;
+using arrayfire::common::unique_handle;
+using arrayfire::cuda::Allocator;
+using arrayfire::cuda::AllocatorPinned;
+
+namespace arrayfire {
+namespace cuda {
+
+static string get_system() {
+    string arch = (sizeof(void *) == 4) ? "32-bit " : "64-bit ";
+
+    return arch +
+#if defined(OS_LNX)
+           "Linux";
+#elif defined(OS_WIN)
+           "Windows";
+#elif defined(OS_MAC)
+           "Mac OSX";
+#endif
 }
 
-// compare two cards based on (in order):
-//   1. flops (theoretical)
-//   2. total memory
-
-#define COMPARE(a,b,f) do {                     \
-        return ((a)->f >= (b)->f);              \
-    } while (0);
-
-
-static inline bool card_compare_compute(const cudaDevice_t &l, const cudaDevice_t &r)
-{
-    const cudaDevice_t *lc = &l;
-    const cudaDevice_t *rc = &r;
+unique_handle<cublasHandle_t> *cublasManager(const int deviceId) {
+    thread_local unique_handle<cublasHandle_t>
+        handles[DeviceManager::MAX_DEVICES];
+    thread_local once_flag initFlags[DeviceManager::MAX_DEVICES];
+
+    call_once(initFlags[deviceId], [&] {
+        CUBLAS_CHECK((cublasStatus_t)handles[deviceId].create());
+        // TODO(pradeep) When multiple streams per device
+        // is added to CUDA backend, move the cublasSetStream
+        // call outside of call_once scope.
+        CUBLAS_CHECK(cublasSetStream(handles[deviceId], getStream(deviceId)));
+#ifdef AF_WITH_FAST_MATH
+        CUBLAS_CHECK(
+            cublasSetMathMode(handles[deviceId], CUBLAS_TF32_TENSOR_OP_MATH));
+        CUBLAS_CHECK(
+            cublasSetAtomicsMode(handles[deviceId], CUBLAS_ATOMICS_ALLOWED));
+#endif
+    });
 
-    COMPARE(lc, rc, prop.major);
-    COMPARE(lc, rc, prop.minor);
-    COMPARE(lc, rc, flops);
-    COMPARE(lc, rc, prop.totalGlobalMem);
-    COMPARE(lc, rc, nativeId);
-    return 0;
+    return &handles[deviceId];
 }
 
-static inline bool card_compare_flops(const cudaDevice_t &l, const cudaDevice_t &r)
-{
-    const cudaDevice_t *lc = &l;
-    const cudaDevice_t *rc = &r;
+#ifdef WITH_CUDNN
+unique_handle<cudnnHandle_t> *nnManager(const int deviceId) {
+    thread_local unique_handle<cudnnHandle_t>
+        cudnnHandles[DeviceManager::MAX_DEVICES];
+    thread_local once_flag initFlags[DeviceManager::MAX_DEVICES];
+
+    auto *handle        = &cudnnHandles[deviceId];
+    cudnnStatus_t error = CUDNN_STATUS_SUCCESS;
+    call_once(initFlags[deviceId], [handle, &error] {
+        auto getLogger = [&] { return spdlog::get("platform"); };
+        AF_TRACE("Initializing cuDNN");
+        error = static_cast<cudnnStatus_t>(handle->create());
+
+        // Not throwing an AF_ERROR here because we are in a lambda that could
+        // be executing on another thread;
+        if (!(*handle)) { getLogger()->error("Error initalizing cuDNN"); }
+    });
+    if (error) {
+        string error_msg = fmt::format(
+            "Error initializing cuDNN({}): {}.",
+            static_cast<std::underlying_type<cudnnStatus_t>::type>(error),
+            errorString(error));
+        AF_ERROR(error_msg, AF_ERR_RUNTIME);
+    }
+    CUDNN_CHECK(getCudnnPlugin().cudnnSetStream(cudnnHandles[deviceId],
+                                                getStream(deviceId)));
 
-    COMPARE(lc, rc, flops);
-    COMPARE(lc, rc, prop.totalGlobalMem);
-    COMPARE(lc, rc, prop.major);
-    COMPARE(lc, rc, prop.minor);
-    COMPARE(lc, rc, nativeId);
-    return 0;
+    return handle;
 }
+#endif
 
-static inline bool card_compare_mem(const cudaDevice_t &l, const cudaDevice_t &r)
-{
-    const cudaDevice_t *lc = &l;
-    const cudaDevice_t *rc = &r;
-
-    COMPARE(lc, rc, prop.totalGlobalMem);
-    COMPARE(lc, rc, flops);
-    COMPARE(lc, rc, prop.major);
-    COMPARE(lc, rc, prop.minor);
-    COMPARE(lc, rc, nativeId);
-    return 0;
+unique_ptr<PlanCache> &cufftManager(const int deviceId) {
+    thread_local unique_ptr<PlanCache> caches[DeviceManager::MAX_DEVICES];
+    thread_local once_flag initFlags[DeviceManager::MAX_DEVICES];
+    call_once(initFlags[deviceId],
+              [&] { caches[deviceId] = make_unique<PlanCache>(); });
+    return caches[deviceId];
 }
 
-static inline bool card_compare_num(const cudaDevice_t &l, const cudaDevice_t &r)
-{
-    const cudaDevice_t *lc = &l;
-    const cudaDevice_t *rc = &r;
-
-    COMPARE(lc, rc, nativeId);
-    return 0;
+unique_handle<cusolverDnHandle_t> *cusolverManager(const int deviceId) {
+    thread_local unique_handle<cusolverDnHandle_t>
+        handles[DeviceManager::MAX_DEVICES];
+    thread_local once_flag initFlags[DeviceManager::MAX_DEVICES];
+    call_once(initFlags[deviceId], [&] {
+        handles[deviceId].create();
+        // TODO(pradeep) When multiple streams per device
+        // is added to CUDA backend, move the cublasSetStream
+        // call outside of call_once scope.
+        CUSOLVER_CHECK(
+            cusolverDnSetStream(handles[deviceId], getStream(deviceId)));
+    });
+    // TODO(pradeep) prior to this change, stream was being synced in get solver
+    // handle because of some cusolver bug. Re-enable that if this change
+    // doesn't work and sovler tests fail.
+    // https://gist.github.com/shehzan10/414c3d04a40e7c4a03ed3c2e1b9072e7
+    // cuSolver Streams patch:
+    // CUDA_CHECK(cudaStreamSynchronize(getStream(deviceId)));
+
+    return &handles[deviceId];
 }
 
-static const char *get_system(void)
-{
-    return
-#if defined(ARCH_32)
-    "32-bit "
-#elif defined(ARCH_64)
-    "64-bit "
-#endif
+unique_handle<cusparseHandle_t> *cusparseManager(const int deviceId) {
+    thread_local unique_handle<cusparseHandle_t>
+        handles[DeviceManager::MAX_DEVICES];
+    thread_local once_flag initFlags[DeviceManager::MAX_DEVICES];
+    call_once(initFlags[deviceId], [&] {
+        auto &_ = getCusparsePlugin();
+        handles[deviceId].create();
+        // TODO(pradeep) When multiple streams per device
+        // is added to CUDA backend, move the cublasSetStream
+        // call outside of call_once scope.
+        CUSPARSE_CHECK(
+            _.cusparseSetStream(handles[deviceId], getStream(deviceId)));
+    });
+    return &handles[deviceId];
+}
 
-#if defined(OS_LNX)
-    "Linux";
-#elif defined(OS_WIN)
-    "Windows";
-#elif defined(OS_MAC)
-    "Mac OSX";
+DeviceManager::~DeviceManager() {
+    try {
+        // Reset unique_ptrs for all cu[BLAS | Sparse | Solver]
+        // handles of all devices
+        for (int i = 0; i < nDevices; ++i) {
+            setDevice(i);
+            cusolverManager(i)->reset();
+            cusparseManager(i)->reset();
+            cufftManager(i).reset();
+            cublasManager(i)->reset();
+#ifdef WITH_CUDNN
+            nnManager(i)->reset();
 #endif
+        }
+    } catch (const AfError &err) {
+        AF_TRACE(
+            "Exception thrown during destruction of DeviceManager(ignoring). "
+            "{}({}):{} "
+            "{}",
+            err.getFileName(), err.getLine(), err.getFunctionName(),
+            err.what());
+    } catch (...) {
+        AF_TRACE(
+            "Unknown exception thrown during destruction of "
+            "DeviceManager(ignoring)");
+    }
 }
 
-template <typename T>
-static inline string toString(T val)
-{
-    stringstream s;
-    s << val;
-    return s.str();
+bool isDeviceBufferAccessible(int buf_device_id, int execution_id) {
+    DeviceManager &mngr = DeviceManager::getInstance();
+    return buf_device_id == execution_id ||
+           mngr.device_peer_access_map[buf_device_id][execution_id];
 }
 
-///////////////////////////////////////////////////////////////////////////
-// Wrapper Functions
-///////////////////////////////////////////////////////////////////////////
-string getInfo()
-{
-    ostringstream info;
-    info << "ArrayFire v" << AF_VERSION
-         << " (CUDA, " << get_system() << ", build " << AF_REVISION << ")" << std::endl;
-    info << getPlatformInfo();
-    for (int i = 0; i < getDeviceCount(); ++i) {
-        info << getDeviceInfo(i);
-    }
-    return info.str();
-}
+int getBackend() { return AF_BACKEND_CUDA; }
 
-string getDeviceInfo(int device)
-{
-    cudaDeviceProp dev = getDeviceProp(device);
+string getDeviceInfo(int device) noexcept {
+    const cudaDeviceProp &dev = getDeviceProp(device);
 
     size_t mem_gpu_total = dev.totalGlobalMem;
-    //double cc = double(dev.major) + double(dev.minor) / 10;
+    // double cc = double(dev.major) + double(dev.minor) / 10;
 
     bool show_braces = getActiveDeviceId() == device;
 
-    string id = (show_braces ? string("[") : "-") + toString(device) +
+    string id = (show_braces ? string("[") : "-") + to_string(device) +
                 (show_braces ? string("]") : "-");
     string name(dev.name);
-    string memory = toString((mem_gpu_total / (1024 * 1024))
-                          + !!(mem_gpu_total % (1024 * 1024)))
-                    + string(" MB");
-    string compute = string("CUDA Compute ") + toString(dev.major) + string(".") + toString(dev.minor);
-
-    string info = id + string(" ")  +
-                name + string(", ") +
-              memory + string(", ") +
-             compute + string("\n");
+    string memory = to_string((mem_gpu_total / (1024 * 1024)) +
+                              !!(mem_gpu_total % (1024 * 1024))) +
+                    string(" MB");
+    string compute = string("CUDA Compute ") + to_string(dev.major) +
+                     string(".") + to_string(dev.minor);
+
+    string info = id + string(" ") + name + string(", ") + memory +
+                  string(", ") + compute + string("\n");
     return info;
 }
 
-string getPlatformInfo()
-{
+string getDeviceInfo() noexcept {
+    ostringstream info;
+    info << "ArrayFire v" << AF_VERSION << " (CUDA, " << get_system()
+         << ", build " << AF_REVISION << ")\n";
+    info << getPlatformInfo();
+    for (int i = 0; i < getDeviceCount(); ++i) { info << getDeviceInfo(i); }
+    return info.str();
+}
+
+string getPlatformInfo() noexcept {
     string driverVersion = getDriverVersion();
-    std::string cudaRuntime = getCUDARuntimeVersion();
-    string platform = "Platform: CUDA Toolkit " + cudaRuntime;
+    string cudaRuntime   = getCUDARuntimeVersion();
+    string platform      = "Platform: CUDA Runtime " + cudaRuntime;
     if (!driverVersion.empty()) {
         platform.append(", Driver: ");
         platform.append(driverVersion);
@@ -188,25 +262,35 @@ string getPlatformInfo()
     return platform;
 }
 
-bool isDoubleSupported(int device)
-{
+bool isDoubleSupported(int device) noexcept {
+    UNUSED(device);
     return true;
 }
 
-void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
-{
-    if (getDeviceCount() <= 0) {
-        printf("No CUDA-capable devices detected.\n");
-        return;
-    }
+bool isHalfSupported(int device) {
+    static std::array<bool, DeviceManager::MAX_DEVICES> half_supported = []() {
+        std::array<bool, DeviceManager::MAX_DEVICES> out{};
+        int count = getDeviceCount();
+        for (int i = 0; i < count; i++) {
+            const auto &prop = getDeviceProp(i);
+            int compute      = prop.major * 1000 + prop.minor * 10;
+            out[i]           = compute >= 5030;
+        }
+        return out;
+    }();
+    return half_supported[device];
+}
+
+void devprop(char *d_name, char *d_platform, char *d_toolkit, char *d_compute) {
+    if (getDeviceCount() <= 0) { return; }
 
-    cudaDeviceProp dev = getDeviceProp(getActiveDeviceId());
+    const cudaDeviceProp &dev = getDeviceProp(getActiveDeviceId());
 
     // Name
-    snprintf(d_name, 32, "%s", dev.name);
+    snprintf(d_name, 256, "%s", dev.name);
 
-    //Platform
-    std::string cudaRuntime = getCUDARuntimeVersion();
+    // Platform
+    string cudaRuntime = getCUDARuntimeVersion();
     snprintf(d_platform, 10, "CUDA");
     snprintf(d_toolkit, 64, "v%s", cudaRuntime.c_str());
 
@@ -214,152 +298,309 @@ void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
     snprintf(d_compute, 10, "%d.%d", dev.major, dev.minor);
 
     // Sanitize input
-    for (int i = 0; i < 31; i++) {
+    for (int i = 0; i < 256; i++) {
         if (d_name[i] == ' ') {
-            if (d_name[i + 1] == 0 || d_name[i + 1] == ' ') d_name[i] = 0;
-            else d_name[i] = '_';
+            if (d_name[i + 1] == 0 || d_name[i + 1] == ' ') {
+                d_name[i] = 0;
+            } else {
+                d_name[i] = '_';
+            }
         }
     }
 }
 
-string getDriverVersion()
-{
-    char driverVersion[1024] = {" ",};
+string getDriverVersion() noexcept {
+    char driverVersion[1024] = {" "};
     int x = nvDriverVersion(driverVersion, sizeof(driverVersion));
     if (x != 1) {
-        #if !defined(OS_MAC) && !defined(__arm__)  // HACK Mac OSX 10.7 needs new method for fetching driver
-        throw runtime_error("Invalid driver");
-        #endif
+// Windows, OSX, Tegra Need a new way to fetch driver
+#if !defined(OS_WIN) && !defined(OS_MAC) && !defined(__arm__) && \
+    !defined(__aarch64__)
+        return "N/A";
+#endif
         int driver = 0;
-        CUDA_CHECK(cudaDriverGetVersion(&driver));
-        return string("CUDA Driver Version: ") + toString(driver);
+        if (cudaDriverGetVersion(&driver)) { return "N/A"; }
+        return to_string(driver);
     } else {
         return string(driverVersion);
     }
 }
 
-string getCUDARuntimeVersion()
-{
+string getCUDARuntimeVersion() noexcept {
     int runtime = 0;
-    CUDA_CHECK(cudaRuntimeGetVersion(&runtime));
-    if(runtime / 100.f > 0)
-        return toString((runtime / 1000) + (runtime % 1000)/ 100.);
-    else
-        return toString(runtime / 1000) + string(".0");
+    if (cudaSuccess == cudaRuntimeGetVersion(&runtime)) {
+        return int_version_to_string(runtime);
+    } else {
+        return int_version_to_string(CUDA_VERSION);
+    }
+}
+
+int &getMaxJitSize() {
+    constexpr int MAX_JIT_LEN = 100;
+    thread_local int length   = 0;
+    if (length <= 0) {
+        string env_var = getEnvVar("AF_CUDA_MAX_JIT_LEN");
+        if (!env_var.empty()) {
+            int input_len = stoi(env_var);
+            length        = input_len > 0 ? input_len : MAX_JIT_LEN;
+        } else {
+            length = MAX_JIT_LEN;
+        }
+    }
 
+    return length;
 }
 
-int getDeviceCount()
-{
-    return DeviceManager::getInstance().nDevices;
+int &tlocalActiveDeviceId() {
+    thread_local int activeDeviceId = 0;
+
+    return activeDeviceId;
 }
 
-int getActiveDeviceId()
-{
-    return DeviceManager::getInstance().activeDev;
+int getDeviceCount() {
+    int count = 0;
+    if (cudaGetDeviceCount(&count)) {
+        return 0;
+    } else {
+        return count;
+    }
+}
+
+void init() {
+    thread_local auto err =
+        cudaSetDevice(getDeviceNativeId(getActiveDeviceId()));
+    thread_local auto queue2 = getActiveStream();
+    UNUSED(err);
+    UNUSED(queue2);
 }
 
-int getDeviceNativeId(int device)
-{
-    if(device < (int)DeviceManager::getInstance().cuDevices.size())
+int getActiveDeviceId() { return tlocalActiveDeviceId(); }
+
+int getDeviceNativeId(int device) {
+    if (device <
+        static_cast<int>(DeviceManager::getInstance().cuDevices.size())) {
         return DeviceManager::getInstance().cuDevices[device].nativeId;
+    }
     return -1;
 }
 
-int setDevice(int device)
-{
+int getDeviceIdFromNativeId(int nativeId) {
+    DeviceManager &mngr = DeviceManager::getInstance();
+
+    int devId = 0;
+    for (devId = 0; devId < mngr.nDevices; ++devId) {
+        if (nativeId == mngr.cuDevices[devId].nativeId) { break; }
+    }
+    return devId;
+}
+
+cudaStream_t getStream(int device) {
+    static once_flag streamInitFlags[DeviceManager::MAX_DEVICES];
+
+    call_once(streamInitFlags[device], [device]() {
+        DeviceManager &inst = DeviceManager::getInstance();
+        CUDA_CHECK(cudaStreamCreate(&(inst.streams[device])));
+    });
+
+    return DeviceManager::getInstance().streams[device];
+}
+
+cudaStream_t getActiveStream() { return getStream(getActiveDeviceId()); }
+
+cudaStream_t getQueueHandle(int device) { return getStream(device); }
+
+size_t getDeviceMemorySize(int device) {
+    return getDeviceProp(device).totalGlobalMem;
+}
+
+size_t getHostMemorySize() { return common::getHostMemorySize(); }
+
+int setDevice(int device) {
     return DeviceManager::getInstance().setActiveDevice(device);
 }
 
-cudaDeviceProp getDeviceProp(int device)
-{
-    if(device < (int)DeviceManager::getInstance().cuDevices.size())
-        return DeviceManager::getInstance().cuDevices[device].prop;
-    return DeviceManager::getInstance().cuDevices[0].prop;
+size_t getL2CacheSize(const int device) {
+    return getDeviceProp(device).l2CacheSize;
+}
+
+const int *getMaxGridSize(const int device) {
+    return getDeviceProp(device).maxGridSize;
 }
 
-///////////////////////////////////////////////////////////////////////////
-// DeviceManager Class Functions
-///////////////////////////////////////////////////////////////////////////
-DeviceManager& DeviceManager::getInstance()
-{
-    static DeviceManager my_instance;
-    return my_instance;
+unsigned getMemoryBusWidth(const int device) {
+    return getDeviceProp(device).memoryBusWidth;
 }
 
-DeviceManager::DeviceManager()
-    : cuDevices(0), activeDev(0), nDevices(0)
-{
-    CUDA_CHECK(cudaGetDeviceCount(&nDevices));
-    if (nDevices == 0)
-        throw runtime_error("No CUDA-Capable devices found");
+unsigned getMultiProcessorCount(const int device) {
+    return getDeviceProp(device).multiProcessorCount;
+}
 
-    cuDevices.reserve(nDevices);
+unsigned getMaxParallelThreads(const int device) {
+    const cudaDeviceProp &prop{getDeviceProp(device)};
+    return prop.multiProcessorCount * prop.maxThreadsPerMultiProcessor;
+}
 
-    for(int i = 0; i < nDevices; i++) {
-        cudaDevice_t dev;
-        cudaGetDeviceProperties(&dev.prop, i);
-        dev.flops = dev.prop.multiProcessorCount * compute2cores(dev.prop.major, dev.prop.minor) * dev.prop.clockRate;
-        dev.nativeId = i;
-        cuDevices.push_back(dev);
-    }
+const cudaDeviceProp &getDeviceProp(const int device) {
+    const vector<cudaDevice_t> &devs = DeviceManager::getInstance().cuDevices;
+    if (device < static_cast<int>(devs.size())) { return devs[device].prop; }
+    return devs[0].prop;
+}
 
-    sortDevices();
+MemoryManagerBase &memoryManager() {
+    static once_flag flag;
 
-    const char* deviceENV = getenv("AF_CUDA_DEFAULT_DEVICE");
-    if(!deviceENV) {
-        setActiveDevice(0, cuDevices[0].nativeId);
-    } else {
-        stringstream s(deviceENV);
-        int def_device = -1;
-        s >> def_device;
-        if(def_device < 0 || def_device >= nDevices) {
-            printf("WARNING: AF_CUDA_DEFAULT_DEVICE is out of range\n");
-            printf("Setting default device as 0\n");
-            setActiveDevice(0, cuDevices[0].nativeId);
-        } else {
-            setActiveDevice(def_device, cuDevices[def_device].nativeId);
-        }
-    }
+    DeviceManager &inst = DeviceManager::getInstance();
+
+    call_once(flag, [&]() {
+        // By default, create an instance of the default memory manager
+        inst.memManager = make_unique<common::DefaultMemoryManager>(
+            getDeviceCount(), common::MAX_BUFFERS,
+            AF_MEM_DEBUG || AF_CUDA_MEM_DEBUG);
+        // Set the memory manager's device memory manager
+        unique_ptr<Allocator> deviceMemoryManager(new Allocator());
+        inst.memManager->setAllocator(move(deviceMemoryManager));
+        inst.memManager->initialize();
+    });
+
+    return *(inst.memManager.get());
 }
 
-void DeviceManager::sortDevices(sort_mode mode)
-{
-    switch(mode) {
-        case memory :
-            sort(cuDevices.begin(), cuDevices.end(), card_compare_mem);
-            break;
-        case flops :
-            sort(cuDevices.begin(), cuDevices.end(), card_compare_flops);
-            break;
-        case compute :
-            sort(cuDevices.begin(), cuDevices.end(), card_compare_compute);
-            break;
-        case none : default :
-            sort(cuDevices.begin(), cuDevices.end(), card_compare_num);
-            break;
-    }
+MemoryManagerBase &pinnedMemoryManager() {
+    static once_flag flag;
+
+    DeviceManager &inst = DeviceManager::getInstance();
+
+    call_once(flag, [&]() {
+        // By default, create an instance of the default memory manager
+        inst.pinnedMemManager = make_unique<common::DefaultMemoryManager>(
+            1, common::MAX_BUFFERS, AF_MEM_DEBUG || AF_CUDA_MEM_DEBUG);
+        // Set the memory manager's device memory manager
+        unique_ptr<AllocatorPinned> deviceMemoryManager(new AllocatorPinned());
+        inst.pinnedMemManager->setAllocator(move(deviceMemoryManager));
+        inst.pinnedMemManager->initialize();
+    });
+
+    return *(inst.pinnedMemManager.get());
+}
+
+void setMemoryManager(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManager(move(mgr));
+}
+
+void resetMemoryManager() {
+    return DeviceManager::getInstance().resetMemoryManager();
 }
 
-int DeviceManager::setActiveDevice(int device, int nId)
-{
-    if(device > (int)cuDevices.size()) {
-        return -1;
+void setMemoryManagerPinned(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManagerPinned(move(mgr));
+}
+
+void resetMemoryManagerPinned() {
+    return DeviceManager::getInstance().resetMemoryManagerPinned();
+}
+
+arrayfire::common::ForgeManager &forgeManager() {
+    return *(DeviceManager::getInstance().fgMngr);
+}
+
+GraphicsResourceManager &interopManager() {
+    static once_flag initFlags[DeviceManager::MAX_DEVICES];
+
+    int id = getActiveDeviceId();
+
+    DeviceManager &inst = DeviceManager::getInstance();
+
+    call_once(initFlags[id], [&] {
+        inst.gfxManagers[id] = make_unique<GraphicsResourceManager>();
+    });
+
+    return *(inst.gfxManagers[id].get());
+}
+
+PlanCache &fftManager() { return *(cufftManager(getActiveDeviceId()).get()); }
+
+BlasHandle blasHandle() { return *cublasManager(getActiveDeviceId()); }
+
+#ifdef WITH_CUDNN
+cudnnHandle_t nnHandle() {
+    // Keep the getCudnnPlugin call here because module loading can throw an
+    // exception the first time its called. We want to avoid that because
+    // the unique handle object is marked noexcept and could terminate. if
+    // the module is not loaded correctly
+    static cudnnModule keep_me_to_avoid_exceptions_exceptions =
+        getCudnnPlugin();
+    static unique_handle<cudnnHandle_t> *handle =
+        nnManager(getActiveDeviceId());
+    if (*handle) {
+        return *handle;
     } else {
-        int old = activeDev;
-        if(nId == -1) nId = getDeviceNativeId(device);
-        CUDA_CHECK(cudaSetDevice(nId));
-        activeDev = device;
-        return old;
+        AF_ERROR("Error Initializing cuDNN\n", AF_ERR_RUNTIME);
     }
 }
+#endif
 
-void sync(int device)
-{
+SolveHandle solverDnHandle() { return *cusolverManager(getActiveDeviceId()); }
+
+SparseHandle sparseHandle() { return *cusparseManager(getActiveDeviceId()); }
+
+void sync(int device) {
     int currDevice = getActiveDeviceId();
     setDevice(device);
-    CUDA_CHECK(cudaDeviceSynchronize());
+    CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
     setDevice(currDevice);
 }
 
+bool synchronize_calls() {
+    static const bool sync = getEnvVar("AF_SYNCHRONOUS_CALLS") == "1";
+    return sync;
+}
+
+bool &evalFlag() {
+    thread_local bool flag = true;
+    return flag;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
+
+af_err afcu_get_stream(cudaStream_t *stream, int id) {
+    try {
+        *stream = arrayfire::cuda::getStream(id);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcu_get_native_id(int *nativeid, int id) {
+    try {
+        *nativeid = arrayfire::cuda::getDeviceNativeId(id);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcu_set_native_id(int nativeid) {
+    try {
+        arrayfire::cuda::setDevice(
+            arrayfire::cuda::getDeviceIdFromNativeId(nativeid));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcu_cublasSetMathMode(cublasMath_t mode) {
+    try {
+        CUBLAS_CHECK(cublasSetMathMode(arrayfire::cuda::blasHandle(), mode));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+namespace af {
+template<>
+__half *array::device<__half>() const {
+    void *ptr = NULL;
+    af_get_device_ptr(&ptr, get());
+    return static_cast<__half *>(ptr);
 }
+}  // namespace af
diff --git a/src/backend/cuda/platform.hpp b/src/backend/cuda/platform.hpp
index de63c03ab5..be9f0b9996 100644
--- a/src/backend/cuda/platform.hpp
+++ b/src/backend/cuda/platform.hpp
@@ -11,97 +11,153 @@
 
 #include <cuda.h>
 #include <cuda_runtime.h>
-#include <vector>
+
+#include <memory>
 #include <string>
-#if defined(WITH_GRAPHICS)
-#include <fg/window.h>
+#include <utility>
+
+/* Forward declarations of Opaque structure holding
+ * the following library contexts
+ *  * cuBLAS
+ *  * cuSparse
+ *  * cuSolver
+ */
+struct cublasContext;
+typedef struct cublasContext* BlasHandle;
+struct cusparseContext;
+typedef struct cusparseContext* SparseHandle;
+struct cusolverDnContext;
+typedef struct cusolverDnContext* SolveHandle;
+
+#ifdef WITH_CUDNN
+struct cudnnContext;
+typedef struct cudnnContext* cudnnHandle_t;
 #endif
 
-namespace cuda
-{
+namespace spdlog {
+class logger;
+}
+
+namespace arrayfire {
+namespace common {
+class ForgeManager;
+class MemoryManagerBase;
+}  // namespace common
+}  // namespace arrayfire
+
+using arrayfire::common::MemoryManagerBase;
+
+namespace arrayfire {
+namespace cuda {
+
+class GraphicsResourceManager;
+class PlanCache;
 
-std::string getInfo();
+int getBackend();
 
-std::string getDeviceInfo(int device);
+std::string getDeviceInfo() noexcept;
+std::string getDeviceInfo(int device) noexcept;
 
-std::string getPlatformInfo();
+std::string getPlatformInfo() noexcept;
 
-std::string getDriverVersion();
+std::string getDriverVersion() noexcept;
 
-std::string getCUDARuntimeVersion();
+// Returns the cuda runtime version as a string for the current build. If no
+// runtime is found or an error occured, the string "N/A" is returned
+std::string getCUDARuntimeVersion() noexcept;
 
-std::string getInfo();
+// Returns true if double is supported by the device
+bool isDoubleSupported(int device) noexcept;
 
-bool isDoubleSupported(int device);
+// Returns true if half is supported by the device
+bool isHalfSupported(int device);
 
-void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute);
+
+int& getMaxJitSize();
 
 int getDeviceCount();
 
+void init();
+
 int getActiveDeviceId();
 
 int getDeviceNativeId(int device);
 
-int setDevice(int device);
+cudaStream_t getStream(int device);
 
-void sync(int device);
+cudaStream_t getActiveStream();
 
-cudaDeviceProp getDeviceProp(int device);
+/// Returns true if the buffer on device buf_device_id can be accessed by
+/// kernels on device execution_id
+///
+/// \param[in] buf_device_id The device id of the buffer
+/// \param[in] execution_id The device where the buffer will be accessed.
+bool isDeviceBufferAccessible(int buf_device_id, int execution_id);
 
-struct cudaDevice_t {
-    cudaDeviceProp prop;
-    size_t flops;
-    int nativeId;
-};
+/// Return a handle to the stream for the device.
+///
+/// \param[in] device The device of the returned stream
+/// \returns The handle to the queue/stream
+cudaStream_t getQueueHandle(int device);
 
-class DeviceManager
-{
-    public:
-        static const unsigned MAX_DEVICES = 16;
+size_t getDeviceMemorySize(int device);
 
-        static DeviceManager& getInstance();
+size_t getHostMemorySize();
 
-        friend std::string getDeviceInfo(int device);
+size_t getL2CacheSize(const int device);
 
-        friend std::string getPlatformInfo();
+// Returns int[3] of maxGridSize
+const int* getMaxGridSize(const int device);
 
-        friend std::string getDriverVersion();
+unsigned getMemoryBusWidth(const int device);
 
-        friend std::string getCUDARuntimeVersion();
+// maximum nr of threads the device really can run in parallel, without
+// scheduling
+unsigned getMaxParallelThreads(const int device);
 
-        friend std::string getInfo();
+unsigned getMultiProcessorCount(const int device);
 
-        friend int getDeviceCount();
+int setDevice(int device);
 
-        friend int getActiveDeviceId();
+void sync(int device);
 
-        friend int getDeviceNativeId(int device);
+// Returns true if the AF_SYNCHRONIZE_CALLS environment variable is set to 1
+bool synchronize_calls();
 
-        friend int setDevice(int device);
+const cudaDeviceProp& getDeviceProp(const int device);
 
-        friend cudaDeviceProp getDeviceProp(int device);
+std::pair<int, int> getComputeCapability(const int device);
 
-    private:
-        DeviceManager();
+bool& evalFlag();
 
-        // Following two declarations are required to
-        // avoid copying accidental copy/assignment
-        // of instance returned by getInstance to other
-        // variables
-        DeviceManager(DeviceManager const&);
-        void operator=(DeviceManager const&);
+MemoryManagerBase& memoryManager();
 
-        // Attributes
-        std::vector<cudaDevice_t> cuDevices;
+MemoryManagerBase& pinnedMemoryManager();
 
-        enum sort_mode {flops = 0, memory = 1, compute = 2, none = 3};
+void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
 
-        void sortDevices(sort_mode mode = flops);
+void resetMemoryManager();
 
-        int setActiveDevice(int device, int native = -1);
+void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
 
-        int activeDev;
-        int nDevices;
-};
+void resetMemoryManagerPinned();
 
-}
+arrayfire::common::ForgeManager& forgeManager();
+
+GraphicsResourceManager& interopManager();
+
+PlanCache& fftManager();
+
+BlasHandle blasHandle();
+
+#ifdef WITH_CUDNN
+cudnnHandle_t nnHandle();
+#endif
+
+SolveHandle solverDnHandle();
+
+SparseHandle sparseHandle();
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/plot.cpp b/src/backend/cuda/plot.cpp
new file mode 100644
index 0000000000..e69b149790
--- /dev/null
+++ b/src/backend/cuda/plot.cpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_cuda.hpp>
+#include <device_manager.hpp>
+#include <err_cuda.hpp>
+#include <plot.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot) {
+    auto stream = getActiveStream();
+    if (DeviceManager::checkGraphicsInteropCapability()) {
+        const T *d_P = P.get();
+
+        auto res = interopManager().getPlotResources(plot);
+
+        size_t bytes = 0;
+        T *d_vbo     = NULL;
+        cudaGraphicsMapResources(1, res[0].get(), stream);
+        cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &bytes,
+                                             *(res[0].get()));
+        cudaMemcpyAsync(d_vbo, d_P, bytes, cudaMemcpyDeviceToDevice, stream);
+        cudaGraphicsUnmapResources(1, res[0].get(), stream);
+
+        CheckGL("After cuda resource copy");
+
+        POST_LAUNCH_CHECK();
+    } else {
+        ForgeModule &_ = common::forgePlugin();
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_plot_vertex_buffer(&buffer, plot));
+        FG_CHECK(_.fg_get_plot_vertex_buffer_size(&bytes, plot));
+
+        CheckGL("Begin CUDA fallback-resource copy");
+        glBindBuffer(GL_ARRAY_BUFFER, buffer);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            CUDA_CHECK(cudaMemcpyAsync(ptr, P.get(), bytes,
+                                       cudaMemcpyDeviceToHost, stream));
+            CUDA_CHECK(cudaStreamSynchronize(stream));
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+        CheckGL("End CUDA fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T) template void copy_plot<T>(const Array<T> &, fg_plot);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/plot.cu b/src/backend/cuda/plot.cu
deleted file mode 100644
index 195988e0ab..0000000000
--- a/src/backend/cuda/plot.cu
+++ /dev/null
@@ -1,58 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined (WITH_GRAPHICS)
-
-#include <interopManager.hpp>
-#include <Array.hpp>
-#include <plot.hpp>
-#include <err_cuda.hpp>
-#include <debug_cuda.hpp>
-#include <join.hpp>
-#include <reduce.hpp>
-#include <reorder.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-void copy_plot(const Array<T> &P, fg::Plot* plot)
-{
-    const T *d_P = P.get();
-
-    InteropManager& intrpMngr = InteropManager::getInstance();
-
-    cudaGraphicsResource *cudaVBOResource = intrpMngr.getBufferResource(plot);
-    // Map resource. Copy data to VBO. Unmap resource.
-    size_t num_bytes = plot->size();
-    T* d_vbo = NULL;
-    cudaGraphicsMapResources(1, &cudaVBOResource, 0);
-    cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &num_bytes, cudaVBOResource);
-    cudaMemcpy(d_vbo, d_P, num_bytes, cudaMemcpyDeviceToDevice);
-    cudaGraphicsUnmapResources(1, &cudaVBOResource, 0);
-
-    CheckGL("After cuda resource copy");
-
-    POST_LAUNCH_CHECK();
-}
-
-#define INSTANTIATE(T)  \
-    template void copy_plot<T>(const Array<T> &P, fg::Plot* plot);
-
-INSTANTIATE(float)
-INSTANTIATE(double)
-INSTANTIATE(int)
-INSTANTIATE(uint)
-INSTANTIATE(uchar)
-
-}
-
-#endif  // WITH_GRAPHICS
diff --git a/src/backend/cuda/plot.hpp b/src/backend/cuda/plot.hpp
index d41e4fb595..ff0739105d 100644
--- a/src/backend/cuda/plot.hpp
+++ b/src/backend/cuda/plot.hpp
@@ -7,16 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <graphics_common.hpp>
+#include <common/graphics_common.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    void copy_plot(const Array<T> &P, fg::Plot* plot);
-}
+namespace arrayfire {
+namespace cuda {
 
-#endif
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot);
 
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/print.hpp b/src/backend/cuda/print.hpp
index cdd08a0277..2343992350 100644
--- a/src/backend/cuda/print.hpp
+++ b/src/backend/cuda/print.hpp
@@ -8,22 +8,20 @@
  ********************************************************/
 
 #pragma once
-#include <ostream>
 #include <backend.hpp>
+#include <types.hpp>
+#include <ostream>
 
-namespace cuda
-{
-    static std::ostream&
-    operator<<(std::ostream &out, const cfloat& var)
-    {
-        out << "(" << var.x << "," << var.y << ")";
-        return out;
-    }
+namespace arrayfire {
+namespace cuda {
+static std::ostream& operator<<(std::ostream& out, const cfloat& var) {
+    out << "(" << var.x << "," << var.y << ")";
+    return out;
+}
 
-    static std::ostream&
-    operator<<(std::ostream &out, const cdouble& var)
-    {
-        out << "(" << var.x << "," << var.y << ")";
-        return out;
-    }
+static std::ostream& operator<<(std::ostream& out, const cdouble& var) {
+    out << "(" << var.x << "," << var.y << ")";
+    return out;
 }
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/product.cu b/src/backend/cuda/product.cu
index d8463f9b50..fb26c95562 100644
--- a/src/backend/cuda/product.cu
+++ b/src/backend/cuda/product.cu
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    //sum
-    INSTANTIATE(af_mul_t, float  , float  )
-    INSTANTIATE(af_mul_t, double , double )
-    INSTANTIATE(af_mul_t, cfloat , cfloat )
-    INSTANTIATE(af_mul_t, cdouble, cdouble)
-    INSTANTIATE(af_mul_t, int    , int    )
-    INSTANTIATE(af_mul_t, uint   , uint   )
-    INSTANTIATE(af_mul_t, char   , int    )
-    INSTANTIATE(af_mul_t, uchar  , uint   )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// mul
+INSTANTIATE(af_mul_t, float, float)
+INSTANTIATE(af_mul_t, double, double)
+INSTANTIATE(af_mul_t, cfloat, cfloat)
+INSTANTIATE(af_mul_t, cdouble, cdouble)
+INSTANTIATE(af_mul_t, int, int)
+INSTANTIATE(af_mul_t, uint, uint)
+INSTANTIATE(af_mul_t, intl, intl)
+INSTANTIATE(af_mul_t, uintl, uintl)
+INSTANTIATE(af_mul_t, char, int)
+INSTANTIATE(af_mul_t, schar, int)
+INSTANTIATE(af_mul_t, uchar, uint)
+INSTANTIATE(af_mul_t, short, int)
+INSTANTIATE(af_mul_t, ushort, uint)
+INSTANTIATE(af_mul_t, half, float)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/qr.cpp b/src/backend/cuda/qr.cpp
new file mode 100644
index 0000000000..f388944127
--- /dev/null
+++ b/src/backend/cuda/qr.cpp
@@ -0,0 +1,214 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <qr.hpp>
+
+#include <copy.hpp>
+#include <cublas_v2.h>
+#include <cusolverDn.hpp>
+#include <identity.hpp>
+#include <kernel/triangle.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+// cusolverStatus_t cusolverDn<>geqrf_bufferSize(
+//        cusolverDnHandle_t handle,
+//        int m, int n,
+//        <> *A,
+//        int lda,
+//        int *Lwork );
+//
+// cusolverStatus_t cusolverDn<>geqrf(
+//        cusolverDnHandle_t handle,
+//        int m, int n,
+//        <> *A, int lda,
+//        <> *TAU,
+//        <> *Workspace,
+//        int Lwork, int *devInfo );
+//
+// cusolverStatus_t cusolverDn<>mqr(
+//        cusolverDnHandle_t handle,
+//        cublasSideMode_t side, cublasOperation_t trans,
+//        int m, int n, int k,
+//        const double *A, int lda,
+//        const double *tau,
+//        double *C, int ldc,
+//        double *work,
+//        int lwork, int *devInfo);
+
+template<typename T>
+struct geqrf_func_def_t {
+    using geqrf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t, int, int,
+                                                T *, int, T *, T *, int, int *);
+};
+
+template<typename T>
+struct geqrf_buf_func_def_t {
+    using geqrf_buf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t, int,
+                                                    int, T *, int, int *);
+};
+
+template<typename T>
+struct mqr_func_def_t {
+    using mqr_func_def = cusolverStatus_t (*)(cusolverDnHandle_t,
+                                              cublasSideMode_t,
+                                              cublasOperation_t, int, int, int,
+                                              const T *, int, const T *, T *,
+                                              int, T *, int, int *);
+};
+
+template<typename T>
+struct mqr_buf_func_def_t {
+    using mqr_buf_func_def = cusolverStatus_t (*)(cusolverDnHandle_t,
+                                                  cublasSideMode_t,
+                                                  cublasOperation_t, int, int, int,
+                                                  const T *, int, const T *, T *,
+                                                  int, int *);
+};
+
+
+#define QR_FUNC_DEF(FUNC)                                         \
+    template<typename T>                                          \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func(); \
+                                                                  \
+    template<typename T>                                          \
+    typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def FUNC##_buf_func();
+
+#define QR_FUNC(FUNC, TYPE, PREFIX)                                         \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               cusolverDn##PREFIX##FUNC;                                    \
+    }                                                                       \
+                                                                            \
+    template<>                                                              \
+    typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def               \
+        FUNC##_buf_func<TYPE>() {                                           \
+        return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def) &         \
+               cusolverDn##PREFIX##FUNC##_bufferSize;                       \
+    }
+
+QR_FUNC_DEF(geqrf)
+QR_FUNC(geqrf, float, S)
+QR_FUNC(geqrf, double, D)
+QR_FUNC(geqrf, cfloat, C)
+QR_FUNC(geqrf, cdouble, Z)
+
+#define MQR_FUNC_DEF(FUNC)                                        \
+    template<typename T>                                          \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func(); \
+                                                                  \
+    template<typename T>                                          \
+    typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def FUNC##_buf_func();
+
+#define MQR_FUNC(FUNC, TYPE, PREFIX)                                        \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               cusolverDn##PREFIX;                                          \
+    }                                                                       \
+                                                                            \
+    template<>                                                              \
+    typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def               \
+        FUNC##_buf_func<TYPE>() {                                           \
+        return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def) &         \
+               cusolverDn##PREFIX##_bufferSize;                             \
+    }
+
+MQR_FUNC_DEF(mqr)
+MQR_FUNC(mqr, float, Sormqr)
+MQR_FUNC(mqr, double, Dormqr)
+MQR_FUNC(mqr, cfloat, Cunmqr)
+MQR_FUNC(mqr, cdouble, Zunmqr)
+
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<T> in_copy = copyArray<T>(in);
+
+    int lwork = 0;
+
+    CUSOLVER_CHECK(geqrf_buf_func<T>()(solverDnHandle(), M, N, in_copy.get(),
+                                       in_copy.strides()[1], &lwork));
+
+    auto workspace = memAlloc<T>(lwork);
+
+    t         = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+    auto info = memAlloc<int>(1);
+
+    CUSOLVER_CHECK(geqrf_func<T>()(solverDnHandle(), M, N, in_copy.get(),
+                                   in_copy.strides()[1], t.get(),
+                                   workspace.get(), lwork, info.get()));
+
+    // SPLIT into q and r
+    dim4 rdims(M, N);
+    r = createEmptyArray<T>(rdims);
+
+    kernel::triangle<T>(r, in_copy, true, false);
+
+    int mn = max(M, N);
+    dim4 qdims(M, mn);
+    q = identity<T>(qdims);
+
+    CUSOLVER_CHECK(mqr_buf_func<T>()(
+        solverDnHandle(), CUBLAS_SIDE_LEFT, CUBLAS_OP_N, q.dims()[0],
+	q.dims()[1], min(M, N), in_copy.get(), in_copy.strides()[1], t.get(),
+        q.get(), q.strides()[1], &lwork));
+
+    workspace = memAlloc<T>(lwork);
+
+    CUSOLVER_CHECK(mqr_func<T>()(
+        solverDnHandle(), CUBLAS_SIDE_LEFT, CUBLAS_OP_N, q.dims()[0],
+        q.dims()[1], min(M, N), in_copy.get(), in_copy.strides()[1], t.get(),
+        q.get(), q.strides()[1], workspace.get(), lwork, info.get()));
+
+    q.resetDims(dim4(M, M));
+}
+
+template<typename T>
+Array<T> qr_inplace(Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+
+    int lwork = 0;
+
+    CUSOLVER_CHECK(geqrf_buf_func<T>()(solverDnHandle(), M, N, in.get(),
+                                       in.strides()[1], &lwork));
+
+    auto workspace = memAlloc<T>(lwork);
+    auto info      = memAlloc<int>(1);
+
+    CUSOLVER_CHECK(geqrf_func<T>()(solverDnHandle(), M, N, in.get(),
+                                   in.strides()[1], t.get(), workspace.get(),
+                                   lwork, info.get()));
+
+    return t;
+}
+
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
+
+INSTANTIATE_QR(float)
+INSTANTIATE_QR(cfloat)
+INSTANTIATE_QR(double)
+INSTANTIATE_QR(cdouble)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/qr.cu b/src/backend/cuda/qr.cu
deleted file mode 100644
index 4654ee6e89..0000000000
--- a/src/backend/cuda/qr.cu
+++ /dev/null
@@ -1,251 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <qr.hpp>
-#include <err_common.hpp>
-
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-
-#include <cusolverDnManager.hpp>
-#include <cublas_v2.h>
-#include <identity.hpp>
-#include <memory.hpp>
-#include <copy.hpp>
-
-#include <math.hpp>
-#include <err_common.hpp>
-
-#include <kernel/triangle.hpp>
-
-namespace cuda
-{
-
-using cusolver::getDnHandle;
-
-//cusolverStatus_t cusolverDn<>geqrf_bufferSize(
-//        cusolverDnHandle_t handle,
-//        int m, int n,
-//        <> *A,
-//        int lda,
-//        int *Lwork );
-//
-//cusolverStatus_t cusolverDn<>geqrf(
-//        cusolverDnHandle_t handle,
-//        int m, int n,
-//        <> *A, int lda,
-//        <> *TAU,
-//        <> *Workspace,
-//        int Lwork, int *devInfo );
-//
-//cusolverStatus_t cusolverDn<>mqr(
-//        cusolverDnHandle_t handle,
-//        cublasSideMode_t side, cublasOperation_t trans,
-//        int m, int n, int k,
-//        const double *A, int lda,
-//        const double *tau,
-//        double *C, int ldc,
-//        double *work,
-//        int lwork, int *devInfo);
-
-template<typename T>
-struct geqrf_func_def_t
-{
-    typedef cusolverStatus_t (*geqrf_func_def) (
-                              cusolverDnHandle_t, int, int,
-                              T *, int,
-                              T *,
-                              T *,
-                              int, int *);
-};
-
-template<typename T>
-struct geqrf_buf_func_def_t
-{
-    typedef cusolverStatus_t (*geqrf_buf_func_def) (
-                              cusolverDnHandle_t, int, int,
-                              T *, int, int *);
-};
-
-template<typename T>
-struct mqr_func_def_t
-{
-    typedef cusolverStatus_t (*mqr_func_def) (
-                              cusolverDnHandle_t,
-                              cublasSideMode_t, cublasOperation_t,
-                              int, int, int,
-                              const T *, int,
-                              const T *,
-                              T *, int,
-                              T *, int,
-                              int *);
-};
-
-#define QR_FUNC_DEF( FUNC )                                                     \
-template<typename T>                                                            \
-typename FUNC##_func_def_t<T>::FUNC##_func_def                                  \
-FUNC##_func();                                                                  \
-                                                                                \
-template<typename T>                                                            \
-typename FUNC##_buf_func_def_t<T>::FUNC##_buf_func_def                          \
-FUNC##_buf_func();                                                              \
-
-#define QR_FUNC( FUNC, TYPE, PREFIX )                                                           \
-template<> typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>()                \
-{ return (FUNC##_func_def_t<TYPE>::FUNC##_func_def)&cusolverDn##PREFIX##FUNC; }                 \
-                                                                                                \
-template<> typename FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def FUNC##_buf_func<TYPE>()    \
-{ return (FUNC##_buf_func_def_t<TYPE>::FUNC##_buf_func_def)& cusolverDn##PREFIX##FUNC##_bufferSize; }
-
-QR_FUNC_DEF( geqrf )
-QR_FUNC(geqrf , float  , S)
-QR_FUNC(geqrf , double , D)
-QR_FUNC(geqrf , cfloat , C)
-QR_FUNC(geqrf , cdouble, Z)
-
-#define MQR_FUNC_DEF( FUNC )                                                        \
-template<typename T>                                                                \
-typename FUNC##_func_def_t<T>::FUNC##_func_def                                      \
-FUNC##_func();
-
-#define MQR_FUNC( FUNC, TYPE, PREFIX )                                              \
-template<> typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>()    \
-{ return (FUNC##_func_def_t<TYPE>::FUNC##_func_def)&cusolverDn##PREFIX; }
-
-MQR_FUNC_DEF( mqr )
-MQR_FUNC(mqr , float  , Sormqr)
-MQR_FUNC(mqr , double , Dormqr)
-MQR_FUNC(mqr , cfloat , Cunmqr)
-MQR_FUNC(mqr , cdouble, Zunmqr)
-
-template<typename T>
-void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in)
-{
-    dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-
-    Array<T> in_copy = copyArray<T>(in);
-
-    int lwork = 0;
-
-    CUSOLVER_CHECK(geqrf_buf_func<T>()(getDnHandle(),
-                                       M, N,
-                                       in_copy.get(), in_copy.strides()[1],
-                                       &lwork));
-
-    T *workspace = memAlloc<T>(lwork);
-
-    t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
-    int *info = memAlloc<int>(1);
-
-    CUSOLVER_CHECK(geqrf_func<T>()(getDnHandle(),
-                                   M, N,
-                                   in_copy.get(), in_copy.strides()[1],
-                                   t.get(),
-                                   workspace,
-                                   lwork, info));
-
-    // SPLIT into q and r
-    dim4 rdims(M, N);
-    r = createEmptyArray<T>(rdims);
-
-    kernel::triangle<T, true, false>(r, in_copy);
-
-    int mn = max(M, N);
-    dim4 qdims(M, mn);
-    q = identity<T>(qdims);
-
-    CUSOLVER_CHECK(mqr_func<T>()(getDnHandle(),
-                                 CUBLAS_SIDE_LEFT, CUBLAS_OP_N,
-                                 q.dims()[0],
-                                 q.dims()[1],
-                                 min(M, N),
-                                 in_copy.get(), in_copy.strides()[1],
-                                 t.get(),
-                                 q.get(), q.strides()[1],
-                                 workspace, lwork,
-                                 info));
-
-    q.resetDims(dim4(M, M));
-
-    memFree(workspace);
-    memFree(info);
-}
-
-template<typename T>
-Array<T> qr_inplace(Array<T> &in)
-{
-    dim4 iDims = in.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-
-    Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
-
-    int lwork = 0;
-
-    CUSOLVER_CHECK(geqrf_buf_func<T>()(getDnHandle(),
-                                       M, N,
-                                       in.get(), in.strides()[1],
-                                       &lwork));
-
-    T *workspace = memAlloc<T>(lwork);
-    int *info = memAlloc<int>(1);
-
-    CUSOLVER_CHECK(geqrf_func<T>()(getDnHandle(),
-                                   M, N,
-                                   in.get(), in.strides()[1],
-                                   t.get(),
-                                   workspace, lwork,
-                                   info));
-
-    memFree(workspace);
-    memFree(info);
-    return t;
-}
-
-#define INSTANTIATE_QR(T)                                                                           \
-    template Array<T> qr_inplace<T>(Array<T> &in);                                                \
-    template void qr<T>(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
-
-INSTANTIATE_QR(float)
-INSTANTIATE_QR(cfloat)
-INSTANTIATE_QR(double)
-INSTANTIATE_QR(cdouble)
-}
-
-#else
-namespace cuda
-{
-
-template<typename T>
-void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-template<typename T>
-Array<T> qr_inplace(Array<T> &in)
-{
-    AF_ERROR("CUDA cusolver not available. Linear Algebra is disabled",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-#define INSTANTIATE_QR(T)                                                                           \
-    template Array<T> qr_inplace<T>(Array<T> &in);                                                \
-    template void qr<T>(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
-
-INSTANTIATE_QR(float)
-INSTANTIATE_QR(cfloat)
-INSTANTIATE_QR(double)
-INSTANTIATE_QR(cdouble)
-
-}
-
-#endif
diff --git a/src/backend/cuda/qr.hpp b/src/backend/cuda/qr.hpp
index acedfd520c..46121cc211 100644
--- a/src/backend/cuda/qr.hpp
+++ b/src/backend/cuda/qr.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
 
-    template<typename T>
-    Array<T> qr_inplace(Array<T> &in);
-}
+template<typename T>
+Array<T> qr_inplace(Array<T> &in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/random.cu b/src/backend/cuda/random.cu
deleted file mode 100644
index 737c6c1fe7..0000000000
--- a/src/backend/cuda/random.cu
+++ /dev/null
@@ -1,64 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <random.hpp>
-#include <kernel/random.hpp>
-#include <cassert>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> randu(const af::dim4 &dims)
-    {
-        if (!kernel::is_init[getActiveDeviceId()]) kernel::setup_states();
-        Array<T> out = createEmptyArray<T>(dims);
-        kernel::randu(out.get(), out.elements());
-        return out;
-    }
-
-    template<typename T>
-    Array<T> randn(const af::dim4 &dims)
-    {
-        if (!kernel::is_init[getActiveDeviceId()]) kernel::setup_states();
-        Array<T> out  = createEmptyArray<T>(dims);
-        kernel::randn(out.get(), out.elements());
-        return out;
-    }
-
-    template Array<float>   randu<float>   (const af::dim4 &dims);
-    template Array<double>  randu<double>  (const af::dim4 &dims);
-    template Array<cfloat>  randu<cfloat>  (const af::dim4 &dims);
-    template Array<cdouble> randu<cdouble> (const af::dim4 &dims);
-    template Array<int>     randu<int>     (const af::dim4 &dims);
-    template Array<uint>    randu<uint>    (const af::dim4 &dims);
-    template Array<char>    randu<char>    (const af::dim4 &dims);
-    template Array<uchar>   randu<uchar>   (const af::dim4 &dims);
-
-    template Array<float>   randn<float>   (const af::dim4 &dims);
-    template Array<double>  randn<double>  (const af::dim4 &dims);
-    template Array<cfloat>  randn<cfloat>  (const af::dim4 &dims);
-    template Array<cdouble> randn<cdouble> (const af::dim4 &dims);
-
-
-    void setSeed(const uintl seed)
-    {
-        kernel::seed = seed;
-        kernel::setup_states();
-    }
-
-    uintl getSeed()
-    {
-        return kernel::seed;
-    }
-
-
-}
diff --git a/src/backend/cuda/random.hpp b/src/backend/cuda/random.hpp
deleted file mode 100644
index 250af773b3..0000000000
--- a/src/backend/cuda/random.hpp
+++ /dev/null
@@ -1,23 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <Array.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> randu(const af::dim4 &dims);
-
-    template<typename T>
-    Array<T> randn(const af::dim4 &dims);
-
-    void setSeed(const uintl seed);
-    uintl getSeed();
-}
diff --git a/src/backend/cuda/random_engine.cu b/src/backend/cuda/random_engine.cu
new file mode 100644
index 0000000000..26cdbdc23b
--- /dev/null
+++ b/src/backend/cuda/random_engine.cu
@@ -0,0 +1,163 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/random_engine.hpp>
+#include <af/dim4.hpp>
+#include <cassert>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl) {
+    kernel::initMersenneState(state.get(), tbl.get(), seed);
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::uniformDistributionCBRNG<T>(out.get(), out.elements(), type, seed,
+                                        counter);
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl &seed,
+                            uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::normalDistributionCBRNG<T>(out.get(), out.elements(), type, seed,
+                                       counter);
+    return out;
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::uniformDistributionMT<T>(out.get(), out.elements(), state.get(),
+                                     pos.get(), sh1.get(), sh2.get(), mask,
+                                     recursion_table.get(), temper_table.get());
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::normalDistributionMT<T>(out.get(), out.elements(), state.get(),
+                                    pos.get(), sh1.get(), sh2.get(), mask,
+                                    recursion_table.get(), temper_table.get());
+    return out;
+}
+
+#define INSTANTIATE_UNIFORM(T)                                   \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl &seed, uintl &counter);                      \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+#define INSTANTIATE_NORMAL(T)                                    \
+    template Array<T> normalDistribution<T>(                     \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl &seed, uintl &counter);                      \
+    template Array<T> normalDistribution<T>(                     \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+#define COMPLEX_UNIFORM_DISTRIBUTION(T, TR)                                 \
+    template<>                                                              \
+    Array<T> uniformDistribution<T>(const af::dim4 &dims,                   \
+                                    const af_random_engine_type type,       \
+                                    const uintl &seed, uintl &counter) {    \
+        Array<T> out    = createEmptyArray<T>(dims);                        \
+        TR *outPtr      = (TR *)out.get();                                  \
+        size_t elements = out.elements() * 2;                               \
+        kernel::uniformDistributionCBRNG<TR>(outPtr, elements, type, seed,  \
+                                             counter);                      \
+        return out;                                                         \
+    }                                                                       \
+    template<>                                                              \
+    Array<T> uniformDistribution<T>(                                        \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,             \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,            \
+        Array<uint> temper_table, Array<uint> state) {                      \
+        Array<T> out    = createEmptyArray<T>(dims);                        \
+        TR *outPtr      = (TR *)out.get();                                  \
+        size_t elements = out.elements() * 2;                               \
+        kernel::uniformDistributionMT<TR>(                                  \
+            outPtr, elements, state.get(), pos.get(), sh1.get(), sh2.get(), \
+            mask, recursion_table.get(), temper_table.get());               \
+        return out;                                                         \
+    }
+
+#define COMPLEX_NORMAL_DISTRIBUTION(T, TR)                                  \
+    template<>                                                              \
+    Array<T> normalDistribution<T>(const af::dim4 &dims,                    \
+                                   const af_random_engine_type type,        \
+                                   const uintl &seed, uintl &counter) {     \
+        Array<T> out    = createEmptyArray<T>(dims);                        \
+        TR *outPtr      = (TR *)out.get();                                  \
+        size_t elements = out.elements() * 2;                               \
+        kernel::normalDistributionCBRNG<TR>(outPtr, elements, type, seed,   \
+                                            counter);                       \
+        return out;                                                         \
+    }                                                                       \
+    template<>                                                              \
+    Array<T> normalDistribution<T>(                                         \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,             \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,            \
+        Array<uint> temper_table, Array<uint> state) {                      \
+        Array<T> out    = createEmptyArray<T>(dims);                        \
+        TR *outPtr      = (TR *)out.get();                                  \
+        size_t elements = out.elements() * 2;                               \
+        kernel::normalDistributionMT<TR>(                                   \
+            outPtr, elements, state.get(), pos.get(), sh1.get(), sh2.get(), \
+            mask, recursion_table.get(), temper_table.get());               \
+        return out;                                                         \
+    }
+
+INSTANTIATE_UNIFORM(float)
+INSTANTIATE_UNIFORM(double)
+INSTANTIATE_UNIFORM(int)
+INSTANTIATE_UNIFORM(uint)
+INSTANTIATE_UNIFORM(intl)
+INSTANTIATE_UNIFORM(uintl)
+INSTANTIATE_UNIFORM(char)
+INSTANTIATE_UNIFORM(schar)
+INSTANTIATE_UNIFORM(uchar)
+INSTANTIATE_UNIFORM(short)
+INSTANTIATE_UNIFORM(ushort)
+INSTANTIATE_UNIFORM(half)
+
+INSTANTIATE_NORMAL(float)
+INSTANTIATE_NORMAL(double)
+INSTANTIATE_NORMAL(half)
+
+COMPLEX_UNIFORM_DISTRIBUTION(cdouble, double)
+COMPLEX_UNIFORM_DISTRIBUTION(cfloat, float)
+
+COMPLEX_NORMAL_DISTRIBUTION(cdouble, double)
+COMPLEX_NORMAL_DISTRIBUTION(cfloat, float)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/random_engine.hpp b/src/backend/cuda/random_engine.hpp
new file mode 100644
index 0000000000..8062f6feb7
--- /dev/null
+++ b/src/backend/cuda/random_engine.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace cuda {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl &seed,
+                            uintl &counter);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/range.cpp b/src/backend/cuda/range.cpp
new file mode 100644
index 0000000000..f821f283f7
--- /dev/null
+++ b/src/backend/cuda/range.cpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <range.hpp>
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/range.hpp>
+#include <math.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim) {
+    // Set dimension along which the sequence should be
+    // Other dimensions are simply tiled
+    int _seq_dim = seq_dim;
+    if (seq_dim < 0) {
+        _seq_dim = 0;  // column wise sequence
+    }
+
+    if (_seq_dim < 0 || _seq_dim > 3) {
+        AF_ERROR("Invalid rep selection", AF_ERR_ARG);
+    }
+
+    Array<T> out = createEmptyArray<T>(dim);
+    kernel::range<T>(out, _seq_dim);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> range<T>(const af::dim4& dims, const int seq_dim);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/range.cu b/src/backend/cuda/range.cu
deleted file mode 100644
index 17e6472b72..0000000000
--- a/src/backend/cuda/range.cu
+++ /dev/null
@@ -1,46 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <range.hpp>
-#include <kernel/range.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> range(const dim4& dim, const int seq_dim)
-    {
-        // Set dimension along which the sequence should be
-        // Other dimensions are simply tiled
-        int _seq_dim = seq_dim;
-        if(seq_dim < 0) {
-            _seq_dim = 0;   // column wise sequence
-        }
-
-        if(_seq_dim < 0 || _seq_dim > 3)
-            AF_ERROR("Invalid rep selection", AF_ERR_ARG);
-
-        Array<T> out = createEmptyArray<T>(dim);
-        kernel::range<T>(out, _seq_dim);
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                      \
-    template Array<T> range<T>(const af::dim4 &dims, const int seq_dim);    \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-}
diff --git a/src/backend/cuda/range.hpp b/src/backend/cuda/range.hpp
index f49f3dff46..7ad50970aa 100644
--- a/src/backend/cuda/range.hpp
+++ b/src/backend/cuda/range.hpp
@@ -8,11 +8,11 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> range(const dim4& dim, const int seq_dim = -1);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim = -1);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/reduce.hpp b/src/backend/cuda/reduce.hpp
index 2af2f5efd4..70f7cf848d 100644
--- a/src/backend/cuda/reduce.hpp
+++ b/src/backend/cuda/reduce.hpp
@@ -6,16 +6,23 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <af/array.h>
+#pragma once
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan = false,
+                 double nanval = 0);
 
-namespace cuda
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> reduce(const Array<Ti> &in, const int dim);
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan = false, double nanval = 0);
 
-    template<af_op_t op, typename Ti, typename To>
-    To reduce_all(const Array<Ti> &in);
-}
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan = false,
+                     double nanval = 0);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/reduce_impl.hpp b/src/backend/cuda/reduce_impl.hpp
index abc6d5c2eb..bbb91d79d9 100644
--- a/src/backend/cuda/reduce_impl.hpp
+++ b/src/backend/cuda/reduce_impl.hpp
@@ -7,39 +7,370 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#pragma once
+
 #include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 
 #undef _GLIBCXX_USE_INT128
+#include <Array.hpp>
+#include <Event.hpp>
+#include <err_cuda.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/reduce_by_key.hpp>
 #include <reduce.hpp>
+#include <set.hpp>
+
+#include <cub/device/device_scan.cuh>
+
 #include <complex>
-#include <kernel/reduce.hpp>
-#include <err_cuda.hpp>
 
-using std::swap;
 using af::dim4;
-namespace cuda
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> reduce(const Array<Ti> &in, const int dim)
-    {
-
-        dim4 odims = in.dims();
-        odims[dim] = 1;
-        Array<To> out = createEmptyArray<To>(odims);
-        kernel::reduce<Ti, To, op>(out, in, dim);
-        return out;
+using std::swap;
+
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan,
+                 double nanval) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    kernel::reduce<Ti, To, op>(out, in, dim, change_nan, nanval);
+    return out;
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key_dim(Array<Tk> &keys_out, Array<To> &vals_out,
+                       const Array<Tk> &keys, const Array<Ti> &vals,
+                       bool change_nan, double nanval, const int dim) {
+    std::vector<int> dim_ordering = {dim};
+    for (int i = 0; i < 4; ++i) {
+        if (i != dim) { dim_ordering.push_back(i); }
     }
 
-    template<af_op_t op, typename Ti, typename To>
-    To reduce_all(const Array<Ti> &in)
-    {
-        return kernel::reduce_all<Ti, To, op>(in);
+    dim4 kdims = keys.dims();
+    dim4 odims = vals.dims();
+
+    // allocate space for output and temporary working arrays
+    Array<Tk> reduced_keys   = createEmptyArray<Tk>(kdims);
+    Array<To> reduced_vals   = createEmptyArray<To>(odims);
+    Array<Tk> t_reduced_keys = createEmptyArray<Tk>(kdims);
+    Array<To> t_reduced_vals = createEmptyArray<To>(odims);
+
+    // flags determining more reduction is necessary
+    auto needs_another_reduction        = memAlloc<int>(1);
+    auto needs_block_boundary_reduction = memAlloc<int>(1);
+
+    // reset flags
+    CUDA_CHECK(cudaMemsetAsync(needs_another_reduction.get(), 0, sizeof(int),
+                               getActiveStream()));
+    CUDA_CHECK(cudaMemsetAsync(needs_block_boundary_reduction.get(), 0,
+                               sizeof(int), getActiveStream()));
+
+    int nelems = kdims[0];
+
+    const unsigned int numThreads = 128;
+    int numBlocksD0               = divup(nelems, numThreads);
+
+    auto reduced_block_sizes = memAlloc<int>(numBlocksD0);
+
+    size_t temp_storage_bytes = 0;
+    cub::DeviceScan::InclusiveSum(
+        NULL, temp_storage_bytes, reduced_block_sizes.get(),
+        reduced_block_sizes.get(), numBlocksD0, getActiveStream());
+    auto d_temp_storage = memAlloc<char>(temp_storage_bytes);
+
+    int n_reduced_host = nelems;
+    int needs_another_reduction_host;
+    int needs_block_boundary_reduction_host;
+
+    bool first_pass = true;
+    do {
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+        dim3 blocks(numBlocksD0, odims[dim_ordering[1]],
+                    odims[dim_ordering[2]] * odims[dim_ordering[3]]);
+
+        int folded_dim_sz = odims[dim_ordering[2]];
+        if (first_pass) {
+            CUDA_LAUNCH(
+                (kernel::reduce_blocks_dim_by_key<Ti, Tk, To, op, numThreads>),
+                blocks, numThreads, reduced_block_sizes.get(), reduced_keys,
+                reduced_vals, keys, vals, nelems, change_nan,
+                scalar<To>(nanval), dim, folded_dim_sz);
+            POST_LAUNCH_CHECK();
+            first_pass = false;
+        } else {
+            constexpr af_op_t op2 = op == af_notzero_t ? af_add_t : op;
+            CUDA_LAUNCH(
+                (kernel::reduce_blocks_dim_by_key<To, Tk, To, op2, numThreads>),
+                blocks, numThreads, reduced_block_sizes.get(), reduced_keys,
+                reduced_vals, t_reduced_keys, t_reduced_vals, n_reduced_host,
+                change_nan, scalar<To>(nanval), dim, folded_dim_sz);
+            POST_LAUNCH_CHECK();
+        }
+
+        cub::DeviceScan::InclusiveSum(
+            (void *)d_temp_storage.get(), temp_storage_bytes,
+            reduced_block_sizes.get(), reduced_block_sizes.get(), numBlocksD0,
+            getActiveStream());
+
+        CUDA_LAUNCH((kernel::compact_dim<Tk, To>), blocks, numThreads,
+                    reduced_block_sizes.get(), t_reduced_keys, t_reduced_vals,
+                    reduced_keys, reduced_vals, dim, folded_dim_sz);
+        POST_LAUNCH_CHECK();
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            &n_reduced_host, reduced_block_sizes.get() + (numBlocksD0 - 1),
+            sizeof(int), cudaMemcpyDeviceToHost, getActiveStream()));
+        Event reduce_host_event = makeEvent(getActiveStream());
+
+        // reset flags
+        CUDA_CHECK(cudaMemsetAsync(needs_another_reduction.get(), 0,
+                                   sizeof(int), getActiveStream()));
+        CUDA_CHECK(cudaMemsetAsync(needs_block_boundary_reduction.get(), 0,
+                                   sizeof(int), getActiveStream()));
+
+        reduce_host_event.block();
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        CUDA_LAUNCH((kernel::test_needs_reduction<Tk>), numBlocksD0, numThreads,
+                    needs_another_reduction.get(),
+                    needs_block_boundary_reduction.get(), t_reduced_keys,
+                    n_reduced_host);
+        POST_LAUNCH_CHECK();
+
+        CUDA_CHECK(cudaMemcpyAsync(&needs_another_reduction_host,
+                                   needs_another_reduction.get(), sizeof(int),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(&needs_block_boundary_reduction_host,
+                                   needs_block_boundary_reduction.get(),
+                                   sizeof(int), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
+
+        if (needs_block_boundary_reduction_host &&
+            !needs_another_reduction_host) {
+            dim3 blocks(numBlocksD0, odims[dim_ordering[1]],
+                        odims[dim_ordering[2]] * odims[dim_ordering[3]]);
+            CUDA_LAUNCH((kernel::final_boundary_reduce<Tk, To, op>), blocks,
+                        numThreads, reduced_block_sizes.get(), t_reduced_keys,
+                        t_reduced_vals, n_reduced_host);
+            POST_LAUNCH_CHECK();
+
+            cub::DeviceScan::InclusiveSum(
+                (void *)d_temp_storage.get(), temp_storage_bytes,
+                reduced_block_sizes.get(), reduced_block_sizes.get(),
+                numBlocksD0, getActiveStream());
+
+            CUDA_CHECK(cudaMemcpyAsync(
+                &n_reduced_host, reduced_block_sizes.get() + (numBlocksD0 - 1),
+                sizeof(int), cudaMemcpyDeviceToHost, getActiveStream()));
+            reduce_host_event.mark(getActiveStream());
+
+            CUDA_LAUNCH((kernel::compact_dim<Tk, To>), blocks, numThreads,
+                        reduced_block_sizes.get(), reduced_keys, reduced_vals,
+                        t_reduced_keys, t_reduced_vals, dim, folded_dim_sz);
+            POST_LAUNCH_CHECK();
+
+            std::swap(t_reduced_keys, reduced_keys);
+            std::swap(t_reduced_vals, reduced_vals);
+            reduce_host_event.block();
+        }
+    } while (needs_another_reduction_host ||
+             needs_block_boundary_reduction_host);
+
+    kdims[0]   = n_reduced_host;
+    odims[dim] = n_reduced_host;
+    std::vector<af_seq> kindex, vindex;
+    for (int i = 0; i < odims.ndims(); ++i) {
+        af_seq sk = {0.0, (double)kdims[i] - 1, 1.0};
+        af_seq sv = {0.0, (double)odims[i] - 1, 1.0};
+        kindex.push_back(sk);
+        vindex.push_back(sv);
     }
+
+    keys_out = createSubArray<Tk>(t_reduced_keys, kindex, true);
+    vals_out = createSubArray<To>(t_reduced_vals, vindex, true);
 }
 
-#define INSTANTIATE(Op, Ti, To)                                         \
-    template Array<To> reduce<Op, Ti, To>(const Array<Ti> &in, const int dim); \
-    template To reduce_all<Op, Ti, To>(const Array<Ti> &in);
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key_first(Array<Tk> &keys_out, Array<To> &vals_out,
+                         const Array<Tk> &keys, const Array<Ti> &vals,
+                         bool change_nan, double nanval) {
+    dim4 kdims = keys.dims();
+    dim4 odims = vals.dims();
+
+    // allocate space for output and temporary working arrays
+    Array<Tk> reduced_keys   = createEmptyArray<Tk>(kdims);
+    Array<To> reduced_vals   = createEmptyArray<To>(odims);
+    Array<Tk> t_reduced_keys = createEmptyArray<Tk>(kdims);
+    Array<To> t_reduced_vals = createEmptyArray<To>(odims);
+
+    // flags determining more reduction is necessary
+    auto needs_another_reduction        = memAlloc<int>(1);
+    auto needs_block_boundary_reduction = memAlloc<int>(1);
+
+    // reset flags
+    CUDA_CHECK(cudaMemsetAsync(needs_another_reduction.get(), 0, sizeof(int),
+                               getActiveStream()));
+    CUDA_CHECK(cudaMemsetAsync(needs_block_boundary_reduction.get(), 0,
+                               sizeof(int), getActiveStream()));
+
+    int nelems = kdims[0];
+
+    const unsigned int numThreads = 128;
+    int numBlocksD0               = divup(nelems, numThreads);
+
+    auto reduced_block_sizes = memAlloc<int>(numBlocksD0);
+
+    size_t temp_storage_bytes = 0;
+    cub::DeviceScan::InclusiveSum(
+        NULL, temp_storage_bytes, reduced_block_sizes.get(),
+        reduced_block_sizes.get(), numBlocksD0, getActiveStream());
+    auto d_temp_storage = memAlloc<char>(temp_storage_bytes);
+
+    int n_reduced_host = nelems;
+    int needs_another_reduction_host;
+    int needs_block_boundary_reduction_host;
+
+    bool first_pass = true;
+    do {
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+        dim3 blocks(numBlocksD0, odims[1], odims[2] * odims[3]);
+
+        if (first_pass) {
+            CUDA_LAUNCH(
+                (kernel::reduce_blocks_by_key<Ti, Tk, To, op, numThreads>),
+                blocks, numThreads, reduced_block_sizes.get(), reduced_keys,
+                reduced_vals, keys, vals, nelems, change_nan,
+                scalar<To>(nanval), odims[2]);
+            POST_LAUNCH_CHECK();
+            first_pass = false;
+        } else {
+            constexpr af_op_t op2 = op == af_notzero_t ? af_add_t : op;
+            CUDA_LAUNCH(
+                (kernel::reduce_blocks_by_key<To, Tk, To, op2, numThreads>),
+                blocks, numThreads, reduced_block_sizes.get(), reduced_keys,
+                reduced_vals, t_reduced_keys, t_reduced_vals, n_reduced_host,
+                change_nan, scalar<To>(nanval), odims[2]);
+            POST_LAUNCH_CHECK();
+        }
+
+        cub::DeviceScan::InclusiveSum(
+            (void *)d_temp_storage.get(), temp_storage_bytes,
+            reduced_block_sizes.get(), reduced_block_sizes.get(), numBlocksD0,
+            getActiveStream());
+
+        CUDA_LAUNCH((kernel::compact<Tk, To>), blocks, numThreads,
+                    reduced_block_sizes.get(), t_reduced_keys, t_reduced_vals,
+                    reduced_keys, reduced_vals, odims[2]);
+        POST_LAUNCH_CHECK();
+
+        CUDA_CHECK(cudaMemcpyAsync(
+            &n_reduced_host, reduced_block_sizes.get() + (numBlocksD0 - 1),
+            sizeof(int), cudaMemcpyDeviceToHost, getActiveStream()));
+        Event reduce_host_event = makeEvent(getActiveStream());
+
+        // reset flags
+        CUDA_CHECK(cudaMemsetAsync(needs_another_reduction.get(), 0,
+                                   sizeof(int), getActiveStream()));
+        CUDA_CHECK(cudaMemsetAsync(needs_block_boundary_reduction.get(), 0,
+                                   sizeof(int), getActiveStream()));
+
+        reduce_host_event.block();
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        CUDA_LAUNCH((kernel::test_needs_reduction<Tk>), numBlocksD0, numThreads,
+                    needs_another_reduction.get(),
+                    needs_block_boundary_reduction.get(), t_reduced_keys,
+                    n_reduced_host);
+        POST_LAUNCH_CHECK();
+
+        CUDA_CHECK(cudaMemcpyAsync(&needs_another_reduction_host,
+                                   needs_another_reduction.get(), sizeof(int),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(&needs_block_boundary_reduction_host,
+                                   needs_block_boundary_reduction.get(),
+                                   sizeof(int), cudaMemcpyDeviceToHost,
+                                   getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(getActiveStream()));
+
+        if (needs_block_boundary_reduction_host &&
+            !needs_another_reduction_host) {
+            // TODO: fold 3,4 dimensions
+            blocks = dim3(numBlocksD0, odims[1], odims[2]);
+            CUDA_LAUNCH((kernel::final_boundary_reduce<Tk, To, op>), blocks,
+                        numThreads, reduced_block_sizes.get(), t_reduced_keys,
+                        t_reduced_vals, n_reduced_host);
+            POST_LAUNCH_CHECK();
+
+            cub::DeviceScan::InclusiveSum(
+                (void *)d_temp_storage.get(), temp_storage_bytes,
+                reduced_block_sizes.get(), reduced_block_sizes.get(),
+                numBlocksD0, getActiveStream());
+
+            CUDA_CHECK(cudaMemcpyAsync(
+                &n_reduced_host, reduced_block_sizes.get() + (numBlocksD0 - 1),
+                sizeof(int), cudaMemcpyDeviceToHost, getActiveStream()));
+            reduce_host_event.mark(getActiveStream());
+
+            CUDA_LAUNCH((kernel::compact<Tk, To>), blocks, numThreads,
+                        reduced_block_sizes.get(), reduced_keys, reduced_vals,
+                        t_reduced_keys, t_reduced_vals, odims[2]);
+            POST_LAUNCH_CHECK();
+
+            std::swap(t_reduced_keys, reduced_keys);
+            std::swap(t_reduced_vals, reduced_vals);
+            reduce_host_event.block();
+        }
+    } while (needs_another_reduction_host ||
+             needs_block_boundary_reduction_host);
+
+    kdims[0] = n_reduced_host;
+    odims[0] = n_reduced_host;
+    std::vector<af_seq> kindex, vindex;
+    for (int i = 0; i < odims.ndims(); ++i) {
+        af_seq sk = {0.0, (double)kdims[i] - 1, 1.0};
+        af_seq sv = {0.0, (double)odims[i] - 1, 1.0};
+        kindex.push_back(sk);
+        vindex.push_back(sv);
+    }
+
+    keys_out = createSubArray<Tk>(t_reduced_keys, kindex, true);
+    vals_out = createSubArray<To>(t_reduced_vals, vindex, true);
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan, double nanval) {
+    if (dim == 0) {
+        reduce_by_key_first<op, Ti, Tk, To>(keys_out, vals_out, keys, vals,
+                                            change_nan, nanval);
+    } else {
+        reduce_by_key_dim<op, Ti, Tk, To>(keys_out, vals_out, keys, vals,
+                                          change_nan, nanval, dim);
+    }
+}
+
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan, double nanval) {
+    Array<To> out = createEmptyArray<To>(1);
+    kernel::reduce_all<Ti, To, op>(out, in, change_nan, nanval);
+    return out;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
+
+#define INSTANTIATE(Op, Ti, To)                                                \
+    template Array<To> reduce<Op, Ti, To>(const Array<Ti> &in, const int dim,  \
+                                          bool change_nan, double nanval);     \
+    template void reduce_by_key<Op, Ti, int, To>(                              \
+        Array<int> & keys_out, Array<To> & vals_out, const Array<int> &keys,   \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template void reduce_by_key<Op, Ti, uint, To>(                             \
+        Array<uint> & keys_out, Array<To> & vals_out, const Array<uint> &keys, \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template Array<To> reduce_all<Op, Ti, To>(const Array<Ti> &in,             \
+                                              bool change_nan, double nanval);
diff --git a/src/backend/cuda/regions.cu b/src/backend/cuda/regions.cu
index 656048c9e9..7de5c54c05 100644
--- a/src/backend/cuda/regions.cu
+++ b/src/backend/cuda/regions.cu
@@ -7,63 +7,71 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <regions.hpp>
-#include <kernel/regions.hpp>
 #include <err_cuda.hpp>
+#include <kernel/regions.hpp>
+#include <regions.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
-Array<T>  regions(const Array<char> &in, af_connectivity connectivity)
-{
-    ARG_ASSERT(2, (connectivity==AF_CONNECTIVITY_4 || connectivity==AF_CONNECTIVITY_8));
-
+Array<T> regions(const Array<char> &in, af_connectivity connectivity) {
     const dim4 dims = in.dims();
 
-    Array<T>  out  = createEmptyArray<T>(dims);
+    Array<T> out = createEmptyArray<T>(dims);
 
     // Create bindless texture object for the equiv map.
     cudaTextureObject_t tex = 0;
-    // FIXME: Currently disabled, only supported on capaibility >= 3.0
-    //if (compute >= 3.0) {
-    //    cudaResourceDesc resDesc;
-    //    memset(&resDesc, 0, sizeof(resDesc));
-    //    resDesc.resType = cudaResourceTypeLinear;
-    //    resDesc.res.linear.devPtr = out->get();
-    //    resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
-    //    resDesc.res.linear.desc.x = 32; // bits per channel
-    //    resDesc.res.linear.sizeInBytes = dims[0] * dims[1] * sizeof(float);
-    //    cudaTextureDesc texDesc;
-    //    memset(&texDesc, 0, sizeof(texDesc));
-    //    texDesc.readMode = cudaReadModeElementType;
-    //    CUDA_CHECK(cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL));
-    //}
 
-    switch(connectivity) {
-        case AF_CONNECTIVITY_4:
-            ::regions<T, false, 2>(out, in, tex);
-            break;
-        case AF_CONNECTIVITY_8:
-            ::regions<T, true,  2>(out, in, tex);
-            break;
+    // Use texture objects with compute 3.0 or higher
+    if (!std::is_same<T, double>::value) {
+        cudaResourceDesc resDesc;
+        memset(&resDesc, 0, sizeof(resDesc));
+        resDesc.resType           = cudaResourceTypeLinear;
+        resDesc.res.linear.devPtr = out.get();
+
+        if (std::is_signed<T>::value)
+            resDesc.res.linear.desc.f = cudaChannelFormatKindSigned;
+        else if (std::is_unsigned<T>::value)
+            resDesc.res.linear.desc.f = cudaChannelFormatKindUnsigned;
+        else
+            resDesc.res.linear.desc.f = cudaChannelFormatKindFloat;
+
+        resDesc.res.linear.desc.x      = sizeof(T) * 8;  // bits per channel
+        resDesc.res.linear.sizeInBytes = dims[0] * dims[1] * sizeof(T);
+        cudaTextureDesc texDesc;
+        memset(&texDesc, 0, sizeof(texDesc));
+        texDesc.readMode = cudaReadModeElementType;
+        CUDA_CHECK(cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL));
+    }
+
+    switch (connectivity) {
+        case AF_CONNECTIVITY_4: ::regions<T, false, 2>(out, in, tex); break;
+        case AF_CONNECTIVITY_8: ::regions<T, true, 2>(out, in, tex); break;
     }
 
+    // Iterative procedure(while loop) in kernel::regions
+    // does stream synchronization towards loop end. So, it is
+    // safe to destroy the texture object
+    CUDA_CHECK(cudaDestroyTextureObject(tex));
+
     return out;
 }
 
-#define INSTANTIATE(T)\
-    template Array<T>  regions<T>(const Array<char> &in, af_connectivity connectivity);
+#define INSTANTIATE(T)                                  \
+    template Array<T> regions<T>(const Array<char> &in, \
+                                 af_connectivity connectivity);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/regions.hpp b/src/backend/cuda/regions.hpp
index ac6550122f..34959c4f62 100644
--- a/src/backend/cuda/regions.hpp
+++ b/src/backend/cuda/regions.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
 Array<T> regions(const Array<char> &in, af_connectivity connectivity);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/reorder.cpp b/src/backend/cuda/reorder.cpp
new file mode 100644
index 0000000000..286dcde6ad
--- /dev/null
+++ b/src/backend/cuda/reorder.cpp
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <reorder.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/reorder.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(0);
+    for (int i = 0; i < 4; i++) { oDims[i] = iDims[rdims[i]]; }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::reorder<T>(out, in, rdims.get());
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/reorder.cu b/src/backend/cuda/reorder.cu
deleted file mode 100644
index 2c920e632a..0000000000
--- a/src/backend/cuda/reorder.cu
+++ /dev/null
@@ -1,47 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <reorder.hpp>
-#include <kernel/reorder.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> reorder(const Array<T> &in, const af::dim4 &rdims)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims(0);
-        for(int i = 0; i < 4; i++)
-            oDims[i] = iDims[rdims[i]];
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        kernel::reorder<T>(out, in, rdims.get());
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                         \
-    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);  \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-
-}
diff --git a/src/backend/cuda/reorder.hpp b/src/backend/cuda/reorder.hpp
index 3adb5e20dc..bda5fc449c 100644
--- a/src/backend/cuda/reorder.hpp
+++ b/src/backend/cuda/reorder.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/reshape.cpp b/src/backend/cuda/reshape.cpp
new file mode 100644
index 0000000000..329b7883cb
--- /dev/null
+++ b/src/backend/cuda/reshape.cpp
@@ -0,0 +1,84 @@
+
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+
+#include <common/half.hpp>
+#include <kernel/memcopy.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue, double scale) {
+    Array<outType> out = createEmptyArray<outType>(outDims);
+    if (out.elements() > 0) {
+        kernel::copy<inType, outType>(out, in, in.ndims(), defaultValue, scale);
+    }
+    return out;
+}
+
+#define INSTANTIATE(SRC_T)                                                    \
+    template Array<float> reshape<SRC_T, float>(Array<SRC_T> const &,         \
+                                                dim4 const &, float, double); \
+    template Array<double> reshape<SRC_T, double>(                            \
+        Array<SRC_T> const &, dim4 const &, double, double);                  \
+    template Array<cfloat> reshape<SRC_T, cfloat>(                            \
+        Array<SRC_T> const &, dim4 const &, cfloat, double);                  \
+    template Array<cdouble> reshape<SRC_T, cdouble>(                          \
+        Array<SRC_T> const &, dim4 const &, cdouble, double);                 \
+    template Array<int> reshape<SRC_T, int>(Array<SRC_T> const &,             \
+                                            dim4 const &, int, double);       \
+    template Array<uint> reshape<SRC_T, uint>(Array<SRC_T> const &,           \
+                                              dim4 const &, uint, double);    \
+    template Array<intl> reshape<SRC_T, intl>(Array<SRC_T> const &,           \
+                                              dim4 const &, intl, double);    \
+    template Array<uintl> reshape<SRC_T, uintl>(Array<SRC_T> const &,         \
+                                                dim4 const &, uintl, double); \
+    template Array<short> reshape<SRC_T, short>(Array<SRC_T> const &,         \
+                                                dim4 const &, short, double); \
+    template Array<ushort> reshape<SRC_T, ushort>(                            \
+        Array<SRC_T> const &, dim4 const &, ushort, double);                  \
+    template Array<schar> reshape<SRC_T, schar>(Array<SRC_T> const &,         \
+                                                dim4 const &, schar, double); \
+    template Array<uchar> reshape<SRC_T, uchar>(Array<SRC_T> const &,         \
+                                                dim4 const &, uchar, double); \
+    template Array<char> reshape<SRC_T, char>(Array<SRC_T> const &,           \
+                                              dim4 const &, char, double);    \
+    template Array<half> reshape<SRC_T, half>(Array<SRC_T> const &,           \
+                                              dim4 const &, half, double);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COMPLEX(SRC_T)                           \
+    template Array<cfloat> reshape<SRC_T, cfloat>(           \
+        Array<SRC_T> const &, dim4 const &, cfloat, double); \
+    template Array<cdouble> reshape<SRC_T, cdouble>(         \
+        Array<SRC_T> const &, dim4 const &, cdouble, double);
+
+INSTANTIATE_COMPLEX(cfloat)
+INSTANTIATE_COMPLEX(cdouble)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/resize.cpp b/src/backend/cuda/resize.cpp
new file mode 100644
index 0000000000..dec6f09d26
--- /dev/null
+++ b/src/backend/cuda/resize.cpp
@@ -0,0 +1,50 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <resize.hpp>
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/resize.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(odim0, odim1, iDims[2], iDims[3]);
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::resize<T>(out, in, method);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> resize<T>(const Array<T> &in, const dim_t odim0, \
+                                const dim_t odim1,                     \
+                                const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/resize.cu b/src/backend/cuda/resize.cu
deleted file mode 100644
index dd5442636d..0000000000
--- a/src/backend/cuda/resize.cu
+++ /dev/null
@@ -1,57 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <resize.hpp>
-#include <kernel/resize.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
-                     const af_interp_type method)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims(odim0, odim1, iDims[2], iDims[3]);
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::resize<T, AF_INTERP_NEAREST >(out, in);
-                break;
-            case AF_INTERP_BILINEAR:
-                kernel::resize<T, AF_INTERP_BILINEAR>(out, in);
-                break;
-            default:
-                break;
-        }
-
-        return out;
-    }
-
-
-#define INSTANTIATE(T)                                                                            \
-    template Array<T> resize<T> (const Array<T> &in, const dim_t odim0, const dim_t odim1, \
-                                 const af_interp_type method);
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-}
diff --git a/src/backend/cuda/resize.hpp b/src/backend/cuda/resize.hpp
index 025e149115..ee2f1a0117 100644
--- a/src/backend/cuda/resize.hpp
+++ b/src/backend/cuda/resize.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
-                    const af_interp_type method);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/rotate.cpp b/src/backend/cuda/rotate.cpp
new file mode 100644
index 0000000000..7edb0de7a6
--- /dev/null
+++ b/src/backend/cuda/rotate.cpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <rotate.hpp>
+
+#include <kernel/rotate.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method) {
+    Array<T> out = createEmptyArray<T>(odims);
+    kernel::rotate<T>(out, in, theta, method, interpOrder(method));
+    return out;
+}
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> rotate(const Array<T> &in, const float theta, \
+                             const af::dim4 &odims,                 \
+                             const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/rotate.cu b/src/backend/cuda/rotate.cu
deleted file mode 100644
index d5efadce1f..0000000000
--- a/src/backend/cuda/rotate.cu
+++ /dev/null
@@ -1,53 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <rotate.hpp>
-#include <kernel/rotate.hpp>
-#include <stdexcept>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
-                     const af_interp_type method)
-    {
-        Array<T> out = createEmptyArray<T>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::rotate<T, AF_INTERP_NEAREST> (out, in, theta);
-                break;
-            case AF_INTERP_BILINEAR:
-                kernel::rotate<T, AF_INTERP_BILINEAR>(out, in, theta);
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-        }
-
-        return out;
-    }
-
-
-#define INSTANTIATE(T)                                                                          \
-    template Array<T> rotate(const Array<T> &in, const float theta,                            \
-                             const af::dim4 &odims, const af_interp_type method); \
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-}
diff --git a/src/backend/cuda/rotate.hpp b/src/backend/cuda/rotate.hpp
index 91f761bdb5..a9e271de04 100644
--- a/src/backend/cuda/rotate.hpp
+++ b/src/backend/cuda/rotate.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
-                    const af_interp_type method);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/scalar.hpp b/src/backend/cuda/scalar.hpp
index b2bd1606cc..250062b535 100644
--- a/src/backend/cuda/scalar.hpp
+++ b/src/backend/cuda/scalar.hpp
@@ -8,18 +8,30 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <optypes.hpp>
+#include <common/jit/ScalarNode.hpp>
 #include <math.hpp>
-#include <JIT/ScalarNode.hpp>
+#include <optypes.hpp>
+#include <memory>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
-Array<T> createScalarNode(const dim4 &size, const T val)
-{
-    JIT::ScalarNode<T> *node = new JIT::ScalarNode<T>(val);
-    return createNodeArray<T>(size, JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
+Array<T> createScalarNode(const dim4 &size, const T val) {
+#if _MSC_VER > 1914
+    // FIXME(pradeep) - Needed only in CUDA backend, didn't notice any
+    // issues in other backends.
+    // Either this gaurd or we need to enable extended alignment
+    // by defining _ENABLE_EXTENDED_ALIGNED_STORAGE before <type_traits>
+    // header is included
+    using ScalarNode    = common::ScalarNode<T>;
+    using ScalarNodePtr = std::shared_ptr<ScalarNode>;
+    return createNodeArray<T>(size, ScalarNodePtr(new ScalarNode(val)));
+#else
+    return createNodeArray<T>(size,
+                              std::make_shared<common::ScalarNode<T>>(val));
+#endif
 }
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/scan.cpp b/src/backend/cuda/scan.cpp
new file mode 100644
index 0000000000..cf3f2a0b70
--- /dev/null
+++ b/src/backend/cuda/scan.cpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <af/dim4.hpp>
+
+#undef _GLIBCXX_USE_INT128
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <scan.hpp>
+#include <complex>
+
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusive_scan) {
+    Array<To> out = createEmptyArray<To>(in.dims());
+
+    if (dim == 0) {
+        kernel::scan_first<Ti, To, op>(out, in, inclusive_scan);
+    } else {
+        kernel::scan_dim<Ti, To, op>(out, in, dim, inclusive_scan);
+    }
+
+    return out;
+}
+
+#define INSTANTIATE_SCAN(ROp, Ti, To)                                        \
+    template Array<To> scan<ROp, Ti, To>(const Array<Ti>& in, const int dim, \
+                                         bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_ALL(ROp)           \
+    INSTANTIATE_SCAN(ROp, float, float)     \
+    INSTANTIATE_SCAN(ROp, double, double)   \
+    INSTANTIATE_SCAN(ROp, cfloat, cfloat)   \
+    INSTANTIATE_SCAN(ROp, cdouble, cdouble) \
+    INSTANTIATE_SCAN(ROp, int, int)         \
+    INSTANTIATE_SCAN(ROp, uint, uint)       \
+    INSTANTIATE_SCAN(ROp, intl, intl)       \
+    INSTANTIATE_SCAN(ROp, uintl, uintl)     \
+    INSTANTIATE_SCAN(ROp, char, int)        \
+    INSTANTIATE_SCAN(ROp, char, uint)       \
+    INSTANTIATE_SCAN(ROp, schar, int)       \
+    INSTANTIATE_SCAN(ROp, uchar, uint)      \
+    INSTANTIATE_SCAN(ROp, short, int)       \
+    INSTANTIATE_SCAN(ROp, ushort, uint)
+
+INSTANTIATE_SCAN(af_notzero_t, char, uint)
+INSTANTIATE_SCAN_ALL(af_add_t)
+INSTANTIATE_SCAN_ALL(af_mul_t)
+INSTANTIATE_SCAN_ALL(af_min_t)
+INSTANTIATE_SCAN_ALL(af_max_t)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/scan.cu b/src/backend/cuda/scan.cu
deleted file mode 100644
index 0a29d6a0cd..0000000000
--- a/src/backend/cuda/scan.cu
+++ /dev/null
@@ -1,53 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cuda.hpp>
-
-#undef _GLIBCXX_USE_INT128
-#include <scan.hpp>
-#include <complex>
-#include <kernel/scan_first.hpp>
-#include <kernel/scan_dim.hpp>
-
-namespace cuda
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> scan(const Array<Ti> &in, const int dim)
-    {
-        Array<To> out = createEmptyArray<To>(in.dims());
-
-        switch (dim) {
-        case 0: kernel::scan_first<Ti, To, op   >(out, in); break;
-        case 1: kernel::scan_dim  <Ti, To, op, 1>(out, in); break;
-        case 2: kernel::scan_dim  <Ti, To, op, 2>(out, in); break;
-        case 3: kernel::scan_dim  <Ti, To, op, 3>(out, in); break;
-        }
-
-        return out;
-    }
-
-
-#define INSTANTIATE(ROp, Ti, To)                                        \
-    template Array<To> scan<ROp, Ti, To>(const Array<Ti> &in, const int dim); \
-
-    //accum
-    INSTANTIATE(af_add_t, float  , float  )
-    INSTANTIATE(af_add_t, double , double )
-    INSTANTIATE(af_add_t, cfloat , cfloat )
-    INSTANTIATE(af_add_t, cdouble, cdouble)
-    INSTANTIATE(af_add_t, int    , int    )
-    INSTANTIATE(af_add_t, uint   , uint   )
-    INSTANTIATE(af_add_t, char   , int    )
-    INSTANTIATE(af_add_t, uchar  , uint   )
-    INSTANTIATE(af_notzero_t, char  , uint   )
-}
diff --git a/src/backend/cuda/scan.hpp b/src/backend/cuda/scan.hpp
index 536accd1d3..b26202fba7 100644
--- a/src/backend/cuda/scan.hpp
+++ b/src/backend/cuda/scan.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace cuda
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> scan(const Array<Ti>& in, const int dim);
-}
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusive_scan = true);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/scan_by_key.cpp b/src/backend/cuda/scan_by_key.cpp
new file mode 100644
index 0000000000..b7d476cc56
--- /dev/null
+++ b/src/backend/cuda/scan_by_key.cpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+#undef _GLIBCXX_USE_INT128
+#include <kernel/scan_dim_by_key.hpp>
+#include <kernel/scan_first_by_key.hpp>
+#include <scan_by_key.hpp>
+#include <complex>
+
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan) {
+    Array<To> out = createEmptyArray<To>(in.dims());
+
+    if (dim == 0) {
+        kernel::scan_first_by_key<Ti, Tk, To, op>(out, in, key, inclusive_scan);
+    } else {
+        kernel::scan_dim_by_key<Ti, Tk, To, op>(out, in, key, dim,
+                                                inclusive_scan);
+    }
+    return out;
+}
+
+#define INSTANTIATE_SCAN_BY_KEY(ROp, Ti, Tk, To)                  \
+    template Array<To> scan<ROp, Ti, Tk, To>(                     \
+        const Array<Tk>& key, const Array<Ti>& in, const int dim, \
+        bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_BY_KEY_ALL(ROp, Tk)           \
+    INSTANTIATE_SCAN_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_OP(ROp)           \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, int)  \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uint) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, intl) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uintl)
+
+INSTANTIATE_SCAN_OP(af_add_t)
+INSTANTIATE_SCAN_OP(af_mul_t)
+INSTANTIATE_SCAN_OP(af_min_t)
+INSTANTIATE_SCAN_OP(af_max_t)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/scan_by_key.hpp b/src/backend/cuda/scan_by_key.hpp
new file mode 100644
index 0000000000..5b95c75978
--- /dev/null
+++ b/src/backend/cuda/scan_by_key.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/select.cpp b/src/backend/cuda/select.cpp
new file mode 100644
index 0000000000..0b78263efd
--- /dev/null
+++ b/src/backend/cuda/select.cpp
@@ -0,0 +1,137 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <select.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <common/jit/NaryNode.hpp>
+#include <err_cuda.hpp>
+#include <kernel/select.hpp>
+#include <scalar.hpp>
+
+#include <memory>
+
+using arrayfire::common::half;
+using arrayfire::common::NaryNode;
+using arrayfire::common::Node_ptr;
+using std::make_shared;
+using std::max;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b) {
+    kernel::select<T>(out, cond, a, b, out.ndims());
+}
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b) {
+    kernel::select_scalar<T>(out, cond, a, b, out.ndims(), flip);
+}
+
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const af::dim4 &odims) {
+    auto cond_node   = cond.getNode();
+    auto a_node      = a.getNode();
+    auto b_node      = b.getNode();
+    auto a_height    = a_node->getHeight();
+    auto b_height    = b_node->getHeight();
+    auto cond_height = cond_node->getHeight();
+    const int height = max(max(a_height, b_height), cond_height) + 1;
+
+    auto node = make_shared<NaryNode>(
+        NaryNode(static_cast<af::dtype>(dtype_traits<T>::af_type), "__select",
+                 3, {{cond_node, a_node, b_node}}, af_select_t, height));
+
+    std::array<common::Node *, 1> nodes{node.get()};
+    if (detail::passesJitHeuristics<T>(nodes) != kJITHeuristics::Pass) {
+        if (a_height > max(b_height, cond_height)) {
+            a.eval();
+        } else if (b_height > cond_height) {
+            b.eval();
+        } else {
+            cond.eval();
+        }
+        return createSelectNode<T>(cond, a, b, odims);
+    }
+    return createNodeArray<T>(odims, node);
+}
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b_val, const af::dim4 &odims) {
+    auto cond_node   = cond.getNode();
+    auto a_node      = a.getNode();
+    Array<T> b       = createScalarNode<T>(odims, b_val);
+    auto b_node      = b.getNode();
+    auto a_height    = a_node->getHeight();
+    auto b_height    = b_node->getHeight();
+    auto cond_height = cond_node->getHeight();
+    const int height = max(max(a_height, b_height), cond_height) + 1;
+
+    auto node = make_shared<NaryNode>(NaryNode(
+        static_cast<af::dtype>(dtype_traits<T>::af_type),
+        (flip ? "__not_select" : "__select"), 3, {{cond_node, a_node, b_node}},
+        flip ? af_not_select_t : af_select_t, height));
+
+    std::array<common::Node *, 1> nodes{node.get()};
+    if (detail::passesJitHeuristics<T>(nodes) != kJITHeuristics::Pass) {
+        if (a_height > max(b_height, cond_height)) {
+            a.eval();
+        } else if (b_height > cond_height) {
+            b.eval();
+        } else {
+            cond.eval();
+        }
+        return createSelectNode<T, flip>(cond, a, b_val, odims);
+    }
+    return createNodeArray<T>(odims, node);
+}
+
+#define INSTANTIATE(T)                                                   \
+    template Array<T> createSelectNode<T>(                               \
+        const Array<char> &cond, const Array<T> &a, const Array<T> &b,   \
+        const af::dim4 &odims);                                          \
+    template Array<T> createSelectNode<T, true>(                         \
+        const Array<char> &cond, const Array<T> &a, const T &b_val,      \
+        const af::dim4 &odims);                                          \
+    template Array<T> createSelectNode<T, false>(                        \
+        const Array<char> &cond, const Array<T> &a, const T &b_val,      \
+        const af::dim4 &odims);                                          \
+    template void select<T>(Array<T> & out, const Array<char> &cond,     \
+                            const Array<T> &a, const Array<T> &b);       \
+    template void select_scalar<T, true>(Array<T> & out,                 \
+                                         const Array<char> &cond,        \
+                                         const Array<T> &a, const T &b); \
+    template void select_scalar<T, false>(Array<T> & out,                \
+                                          const Array<char> &cond,       \
+                                          const Array<T> &a, const T &b)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(cfloat);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(uint);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(char);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(short);
+INSTANTIATE(ushort);
+INSTANTIATE(half);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/select.hpp b/src/backend/cuda/select.hpp
new file mode 100644
index 0000000000..530aab097f
--- /dev/null
+++ b/src/backend/cuda/select.hpp
@@ -0,0 +1,31 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <Array.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b);
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b);
+
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const af::dim4 &odims);
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b_val, const af::dim4 &odims);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/set.cu b/src/backend/cuda/set.cu
index b1641658df..d558d6e938 100644
--- a/src/backend/cuda/set.cu
+++ b/src/backend/cuda/set.cu
@@ -7,115 +7,126 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <set.hpp>
 #include <copy.hpp>
+#include <debug_cuda.hpp>
+#include <set.hpp>
 #include <sort.hpp>
-#include <err_cuda.hpp>
+#include <thrust_utils.hpp>
+#include <af/dim4.hpp>
 
 #include <thrust/device_ptr.h>
+#include <thrust/set_operations.h>
 #include <thrust/sort.h>
 #include <thrust/unique.h>
-#include <thrust/set_operations.h>
 
-namespace cuda
-{
-    using af::dim4;
+#include <algorithm>
 
-    template<typename T>
-    Array<T> setUnique(const Array<T> &in,
-                        const bool is_sorted)
-    {
-        Array<T> out = copyArray<T>(in);
+namespace arrayfire {
+namespace cuda {
+using af::dim4;
 
-        thrust::device_ptr<T> out_ptr = thrust::device_pointer_cast<T>(out.get());
-        thrust::device_ptr<T> out_ptr_end = out_ptr + out.dims()[0];
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted) {
+    Array<T> out = copyArray<T>(in);
 
-        if(!is_sorted) thrust::sort(out_ptr, out_ptr_end);
-        thrust::device_ptr<T> out_ptr_last = thrust::unique(out_ptr, out_ptr_end);
+    thrust::device_ptr<T> out_ptr = thrust::device_pointer_cast<T>(out.get());
+    thrust::device_ptr<T> out_ptr_end = out_ptr + out.elements();
 
-        out.resetDims(dim4(thrust::distance(out_ptr, out_ptr_last)));
-        return out;
-    }
+    if (!is_sorted) THRUST_SELECT(thrust::sort, out_ptr, out_ptr_end);
+    thrust::device_ptr<T> out_ptr_last;
+    THRUST_SELECT_OUT(out_ptr_last, thrust::unique, out_ptr, out_ptr_end);
+
+    out.resetDims(dim4(thrust::distance(out_ptr, out_ptr_last)));
+    return out;
+}
 
-    template<typename T>
-    Array<T> setUnion(const Array<T> &first,
-                       const Array<T> &second,
-                       const bool is_unique)
-    {
-        Array<T> unique_first = first;
-        Array<T> unique_second = second;
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique) {
+    Array<T> unique_first  = first;
+    Array<T> unique_second = second;
 
-        if (!is_unique) {
-            unique_first = setUnique(first, false);
-            unique_second = setUnique(second, false);
-        }
+    if (!is_unique) {
+        unique_first  = setUnique(first, false);
+        unique_second = setUnique(second, false);
+    }
 
-        dim_t out_size = unique_first.dims()[0] + unique_second.dims()[0];
-        Array<T> out = createEmptyArray<T>(dim4(out_size));
+    dim_t out_size = unique_first.elements() + unique_second.elements();
+    Array<T> out   = createEmptyArray<T>(dim4(out_size));
 
-        thrust::device_ptr<T> first_ptr = thrust::device_pointer_cast<T>(unique_first.get());
-        thrust::device_ptr<T> first_ptr_end = first_ptr + unique_first.dims()[0];
+    thrust::device_ptr<T> first_ptr =
+        thrust::device_pointer_cast<T>(unique_first.get());
+    thrust::device_ptr<T> first_ptr_end = first_ptr + unique_first.elements();
 
-        thrust::device_ptr<T> second_ptr = thrust::device_pointer_cast<T>(unique_second.get());
-        thrust::device_ptr<T> second_ptr_end = second_ptr + unique_second.dims()[0];
+    thrust::device_ptr<T> second_ptr =
+        thrust::device_pointer_cast<T>(unique_second.get());
+    thrust::device_ptr<T> second_ptr_end =
+        second_ptr + unique_second.elements();
 
-        thrust::device_ptr<T> out_ptr = thrust::device_pointer_cast<T>(out.get());
+    thrust::device_ptr<T> out_ptr = thrust::device_pointer_cast<T>(out.get());
 
-        thrust::device_ptr<T> out_ptr_last = thrust::set_union(first_ptr, first_ptr_end,
-                                                               second_ptr, second_ptr_end,
-                                                               out_ptr);
+    thrust::device_ptr<T> out_ptr_last;
+    THRUST_SELECT_OUT(out_ptr_last, thrust::set_union, first_ptr, first_ptr_end,
+                      second_ptr, second_ptr_end, out_ptr);
 
-        out.resetDims(dim4(thrust::distance(out_ptr, out_ptr_last)));
+    out.resetDims(dim4(thrust::distance(out_ptr, out_ptr_last)));
 
-        return out;
-    }
+    return out;
+}
 
-    template<typename T>
-    Array<T> setIntersect(const Array<T> &first,
-                           const Array<T> &second,
-                           const bool is_unique)
-    {
-        Array<T> unique_first = first;
-        Array<T> unique_second = second;
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique) {
+    Array<T> unique_first  = first;
+    Array<T> unique_second = second;
 
-        if (!is_unique) {
-            unique_first = setUnique(first, false);
-            unique_second = setUnique(second, false);
-        }
+    if (!is_unique) {
+        unique_first  = setUnique(first, false);
+        unique_second = setUnique(second, false);
+    }
 
-        dim_t out_size = std::max(unique_first.dims()[0], unique_second.dims()[0]);
-        Array<T> out = createEmptyArray<T>(dim4(out_size));
+    dim_t out_size =
+        std::max(unique_first.elements(), unique_second.elements());
+    Array<T> out = createEmptyArray<T>(dim4(out_size));
 
-        thrust::device_ptr<T> first_ptr = thrust::device_pointer_cast<T>(unique_first.get());
-        thrust::device_ptr<T> first_ptr_end = first_ptr + unique_first.dims()[0];
+    thrust::device_ptr<T> first_ptr =
+        thrust::device_pointer_cast<T>(unique_first.get());
+    thrust::device_ptr<T> first_ptr_end = first_ptr + unique_first.elements();
 
-        thrust::device_ptr<T> second_ptr = thrust::device_pointer_cast<T>(unique_second.get());
-        thrust::device_ptr<T> second_ptr_end = second_ptr + unique_second.dims()[0];
+    thrust::device_ptr<T> second_ptr =
+        thrust::device_pointer_cast<T>(unique_second.get());
+    thrust::device_ptr<T> second_ptr_end =
+        second_ptr + unique_second.elements();
 
-        thrust::device_ptr<T> out_ptr = thrust::device_pointer_cast<T>(out.get());
+    thrust::device_ptr<T> out_ptr = thrust::device_pointer_cast<T>(out.get());
 
-        thrust::device_ptr<T> out_ptr_last = thrust::set_intersection(first_ptr, first_ptr_end,
-                                                                      second_ptr, second_ptr_end,
-                                                                      out_ptr);
+    thrust::device_ptr<T> out_ptr_last;
+    THRUST_SELECT_OUT(out_ptr_last, thrust::set_intersection, first_ptr,
+                      first_ptr_end, second_ptr, second_ptr_end, out_ptr);
 
-        out.resetDims(dim4(thrust::distance(out_ptr, out_ptr_last)));
+    out.resetDims(dim4(thrust::distance(out_ptr, out_ptr_last)));
 
-        return out;
-    }
+    return out;
+}
 
-#define INSTANTIATE(T)                                                  \
+#define INSTANTIATE(T)                                                        \
     template Array<T> setUnique<T>(const Array<T> &in, const bool is_sorted); \
-    template Array<T> setUnion<T>(const Array<T> &first, const Array<T> &second, const bool is_unique); \
-    template Array<T> setIntersect<T>(const Array<T> &first, const Array<T> &second, const bool is_unique); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-}
+    template Array<T> setUnion<T>(                                            \
+        const Array<T> &first, const Array<T> &second, const bool is_unique); \
+    template Array<T> setIntersect<T>(                                        \
+        const Array<T> &first, const Array<T> &second, const bool is_unique);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/set.hpp b/src/backend/cuda/set.hpp
index 01f048bf07..872599ad40 100644
--- a/src/backend/cuda/set.hpp
+++ b/src/backend/cuda/set.hpp
@@ -7,19 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T> Array<T> setUnique(const Array<T> &in,
-                                            const bool is_sorted);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted);
 
-    template<typename T> Array<T> setUnion(const Array<T> &first,
-                                           const Array<T> &second,
-                                           const bool is_unique);
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique);
 
-    template<typename T> Array<T> setIntersect(const Array<T> &first,
-                                               const Array<T> &second,
-                                               const bool is_unique);
-}
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/shift.cpp b/src/backend/cuda/shift.cpp
new file mode 100644
index 0000000000..f073d3c844
--- /dev/null
+++ b/src/backend/cuda/shift.cpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/jit/ShiftNodeBase.hpp>
+#include <err_cuda.hpp>
+#include <jit/BufferNode.hpp>
+#include <jit/ShiftNode.hpp>
+#include <shift.hpp>
+
+#include <memory>
+
+using af::dim4;
+
+using arrayfire::common::Node_ptr;
+using arrayfire::cuda::jit::BufferNode;
+using arrayfire::cuda::jit::ShiftNode;
+
+using std::array;
+using std::make_shared;
+using std::static_pointer_cast;
+using std::string;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]) {
+    // Shift should only be the first node in the JIT tree.
+    // Force input to be evaluated so that in is always a buffer.
+    in.eval();
+
+    string name_str("Sh");
+    name_str += shortname<T>(true);
+    const dim4 &iDims = in.dims();
+    dim4 oDims        = iDims;
+
+    array<int, 4> shifts{};
+    for (int i = 0; i < 4; i++) {
+        // sdims_[i] will always be positive and always [0, oDims[i]].
+        // Negative shifts are converted to position by going the other way
+        // round
+        shifts[i] = -(sdims[i] % static_cast<int>(oDims[i])) +
+                    oDims[i] * (sdims[i] > 0);
+        assert(shifts[i] >= 0 && shifts[i] <= oDims[i]);
+    }
+
+    auto node = make_shared<ShiftNode<T>>(
+        static_cast<af::dtype>(af::dtype_traits<T>::af_type),
+        static_pointer_cast<BufferNode<T>>(in.getNode()), shifts);
+    return createNodeArray<T>(oDims, Node_ptr(node));
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/shift.cu b/src/backend/cuda/shift.cu
deleted file mode 100644
index 2d59b39bd0..0000000000
--- a/src/backend/cuda/shift.cu
+++ /dev/null
@@ -1,42 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <shift.hpp>
-#include <kernel/shift.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> shift(const Array<T> &in, const int sdims[4])
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        kernel::shift<T>(out, in, sdims);
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                  \
-    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-}
diff --git a/src/backend/cuda/shift.hpp b/src/backend/cuda/shift.hpp
index b8b4377eed..68c4ccd9bf 100644
--- a/src/backend/cuda/shift.hpp
+++ b/src/backend/cuda/shift.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> shift(const Array<T> &in, const int sdims[4]);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sift.cu b/src/backend/cuda/sift.cu
new file mode 100644
index 0000000000..dbfb46a63b
--- /dev/null
+++ b/src/backend/cuda/sift.cu
@@ -0,0 +1,75 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sift.hpp>
+
+#include <kernel/sift.hpp>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x, Array<float>& y, Array<float>& score,
+              Array<float>& ori, Array<float>& size, Array<float>& desc,
+              const Array<T>& in, const unsigned n_layers,
+              const float contrast_thr, const float edge_thr,
+              const float init_sigma, const bool double_input,
+              const float img_scale, const float feature_ratio,
+              const bool compute_GLOH) {
+    unsigned nfeat_out;
+    unsigned desc_len;
+    float* x_out;
+    float* y_out;
+    float* score_out;
+    float* orientation_out;
+    float* size_out;
+    float* desc_out;
+
+    kernel::sift<T, convAccT>(
+        &nfeat_out, &desc_len, &x_out, &y_out, &score_out, &orientation_out,
+        &size_out, &desc_out, in, n_layers, contrast_thr, edge_thr, init_sigma,
+        double_input, img_scale, feature_ratio, compute_GLOH);
+
+    if (nfeat_out > 0) {
+        if (x_out == NULL || y_out == NULL || score_out == NULL ||
+            orientation_out == NULL || size_out == NULL || desc_out == NULL) {
+            AF_ERROR("sift: feature array is null.", AF_ERR_SIZE);
+        }
+
+        const dim4 feat_dims(nfeat_out);
+        const dim4 desc_dims(desc_len, nfeat_out);
+
+        x     = createDeviceDataArray<float>(feat_dims, x_out);
+        y     = createDeviceDataArray<float>(feat_dims, y_out);
+        score = createDeviceDataArray<float>(feat_dims, score_out);
+        ori   = createDeviceDataArray<float>(feat_dims, orientation_out);
+        size  = createDeviceDataArray<float>(feat_dims, size_out);
+        desc  = createDeviceDataArray<float>(desc_dims, desc_out);
+    }
+
+    return nfeat_out;
+}
+
+#define INSTANTIATE(T, convAccT)                                               \
+    template unsigned sift<T, convAccT>(                                       \
+        Array<float> & x, Array<float> & y, Array<float> & score,              \
+        Array<float> & ori, Array<float> & size, Array<float> & desc,          \
+        const Array<T>& in, const unsigned n_layers, const float contrast_thr, \
+        const float edge_thr, const float init_sigma, const bool double_input, \
+        const float img_scale, const float feature_ratio,                      \
+        const bool compute_GLOH);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sift.hpp b/src/backend/cuda/sift.hpp
new file mode 100644
index 0000000000..a177c345ae
--- /dev/null
+++ b/src/backend/cuda/sift.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x, Array<float>& y, Array<float>& score,
+              Array<float>& ori, Array<float>& size, Array<float>& desc,
+              const Array<T>& in, const unsigned n_layers,
+              const float contrast_thr, const float edge_thr,
+              const float init_sigma, const bool double_input,
+              const float img_scale, const float feature_ratio,
+              const bool compute_GLOH);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sobel.cpp b/src/backend/cuda/sobel.cpp
new file mode 100644
index 0000000000..1861d0c76c
--- /dev/null
+++ b/src/backend/cuda/sobel.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/sobel.hpp>
+#include <sobel.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename Ti, typename To>
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size) {
+    Array<To> dx = createEmptyArray<To>(img.dims());
+    Array<To> dy = createEmptyArray<To>(img.dims());
+
+    kernel::sobel<Ti, To>(dx, dy, img, ker_size);
+
+    return std::make_pair(dx, dy);
+}
+
+#define INSTANTIATE(Ti, To)                                    \
+    template std::pair<Array<To>, Array<To>> sobelDerivatives( \
+        const Array<Ti> &img, const unsigned &ker_size);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(int, int)
+INSTANTIATE(uint, int)
+INSTANTIATE(char, int)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, int)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, int)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sobel.cu b/src/backend/cuda/sobel.cu
deleted file mode 100644
index 6f9b1948c6..0000000000
--- a/src/backend/cuda/sobel.cu
+++ /dev/null
@@ -1,46 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <sobel.hpp>
-#include <kernel/sobel.hpp>
-#include <err_cuda.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename Ti, typename To>
-std::pair< Array<To>, Array<To> >
-sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size)
-{
-    Array<To> dx = createEmptyArray<To>(img.dims());
-    Array<To> dy = createEmptyArray<To>(img.dims());
-
-    kernel::sobel<Ti, To>(dx, dy, img, ker_size);
-
-    return std::make_pair(dx, dy);
-}
-
-#define INSTANTIATE(Ti, To)                                             \
-    template std::pair< Array<To>, Array<To> >                          \
-    sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size);
-
-INSTANTIATE(float , float)
-INSTANTIATE(double, double)
-INSTANTIATE(int   , int)
-INSTANTIATE(uint  , int)
-INSTANTIATE(char  , int)
-INSTANTIATE(uchar , int)
-
-}
diff --git a/src/backend/cuda/sobel.hpp b/src/backend/cuda/sobel.hpp
index 096930f4c9..f566459138 100644
--- a/src/backend/cuda/sobel.hpp
+++ b/src/backend/cuda/sobel.hpp
@@ -10,11 +10,12 @@
 #include <Array.hpp>
 #include <utility>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename Ti, typename To>
-std::pair< Array<To>, Array<To> >
-sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size);
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/solve.cu b/src/backend/cuda/solve.cu
index 7077c1fbc3..568e44b136 100644
--- a/src/backend/cuda/solve.cu
+++ b/src/backend/cuda/solve.cu
@@ -7,35 +7,84 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <err_common.hpp>
 #include <solve.hpp>
 
-#if defined(WITH_CUDA_LINEAR_ALGEBRA)
-
-#include <cusolverDnManager.hpp>
-#include <cublas_v2.h>
+#include <blas.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <cublas.hpp>
+#include <cusolverDn.hpp>
+#include <err_cuda.hpp>
 #include <identity.hpp>
-#include <iostream>
+#include <lu.hpp>
+#include <math.hpp>
 #include <memory.hpp>
-#include <copy.hpp>
+#include <platform.hpp>
+#include <qr.hpp>
 #include <transpose.hpp>
 
-#include <math.hpp>
-#include <err_common.hpp>
+namespace arrayfire {
+namespace cuda {
+
+// cublasStatus_t cublas<>getrsBatched( cublasHandle_t handle,
+//                                      cublasOperation_t trans,
+//                                      int n,
+//                                      int nrhs,
+//                                      const <> *Aarray[],
+//                                      int lda,
+//                                      const int *devIpiv,
+//                                      <> *Barray[],
+//                                      int ldb,
+//                                      int *info,
+//                                      int batchSize);
 
-#include <blas.hpp>
-#include <lu.hpp>
-#include <qr.hpp>
+template<typename T>
+struct getrsBatched_func_def_t {
+    typedef cublasStatus_t (*getrsBatched_func_def)(cublasHandle_t,
+                                                    cublasOperation_t, int, int,
+                                                    const T **, int,
+                                                    const int *, T **, int,
+                                                    int *, int);
+};
 
-#include <handle.hpp>
-#include <cstdio>
+// cublasStatus_t cublas<>getrfBatched(cublasHandle_t handle,
+//                                     int n,
+//                                     float *A[],
+//                                     int lda,
+//                                     int *P,
+//                                     int *info,
+//                                     int batchSize);
+
+template<typename T>
+struct getrfBatched_func_def_t {
+    typedef cublasStatus_t (*getrfBatched_func_def)(cublasHandle_t, int, T **,
+                                                    int, int *, int *, int);
+};
+
+#define SOLVE_BATCH_FUNC_DEF(FUNC) \
+    template<typename T>           \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func();
+
+#define SOLVE_BATCH_FUNC(FUNC, TYPE, PREFIX)                                \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               cublas##PREFIX##FUNC;                                        \
+    }
 
-namespace cuda
-{
+SOLVE_BATCH_FUNC_DEF(getrfBatched)
+SOLVE_BATCH_FUNC(getrfBatched, float, S)
+SOLVE_BATCH_FUNC(getrfBatched, double, D)
+SOLVE_BATCH_FUNC(getrfBatched, cfloat, C)
+SOLVE_BATCH_FUNC(getrfBatched, cdouble, Z)
 
-using cusolver::getDnHandle;
+SOLVE_BATCH_FUNC_DEF(getrsBatched)
+SOLVE_BATCH_FUNC(getrsBatched, float, S)
+SOLVE_BATCH_FUNC(getrsBatched, double, D)
+SOLVE_BATCH_FUNC(getrsBatched, cfloat, C)
+SOLVE_BATCH_FUNC(getrsBatched, cdouble, Z)
 
-//cusolverStatus_t cusolverDn<>getrs(
+// cusolverStatus_t cusolverDn<>getrs(
 //    cusolverDnHandle_t handle,
 //    cublasOperation_t trans,
 //    int n, int nrhs,
@@ -45,41 +94,38 @@ using cusolver::getDnHandle;
 //    int *devInfo );
 
 template<typename T>
-struct getrs_func_def_t
-{
-    typedef cusolverStatus_t (*getrs_func_def) (
-                              cusolverDnHandle_t,
-                              cublasOperation_t,
-                              int, int,
-                              const T *, int,
-                              const int *,
-                              T *, int,
-                              int *);
+struct getrs_func_def_t {
+    typedef cusolverStatus_t (*getrs_func_def)(cusolverDnHandle_t,
+                                               cublasOperation_t, int, int,
+                                               const T *, int, const int *, T *,
+                                               int, int *);
 };
 
-#define SOLVE_FUNC_DEF( FUNC )                                                  \
-template<typename T>                                                            \
-typename FUNC##_func_def_t<T>::FUNC##_func_def                                  \
-FUNC##_func();
+#define SOLVE_FUNC_DEF(FUNC) \
+    template<typename T>     \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func();
 
-#define SOLVE_FUNC( FUNC, TYPE, PREFIX )                                                    \
-template<> typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>()            \
-{ return (FUNC##_func_def_t<TYPE>::FUNC##_func_def)&cusolverDn##PREFIX##FUNC; }             \
+#define SOLVE_FUNC(FUNC, TYPE, PREFIX)                                      \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               cusolverDn##PREFIX##FUNC;                                    \
+    }
 
-SOLVE_FUNC_DEF( getrs )
-SOLVE_FUNC(getrs , float  , S)
-SOLVE_FUNC(getrs , double , D)
-SOLVE_FUNC(getrs , cfloat , C)
-SOLVE_FUNC(getrs , cdouble, Z)
+SOLVE_FUNC_DEF(getrs)
+SOLVE_FUNC(getrs, float, S)
+SOLVE_FUNC(getrs, double, D)
+SOLVE_FUNC(getrs, cfloat, C)
+SOLVE_FUNC(getrs, cdouble, Z)
 
-//cusolverStatus_t cusolverDn<>geqrf_bufferSize(
+// cusolverStatus_t cusolverDn<>geqrf_bufferSize(
 //        cusolverDnHandle_t handle,
 //        int m, int n,
 //        <> *A,
 //        int lda,
 //        int *Lwork );
 //
-//cusolverStatus_t cusolverDn<>geqrf(
+// cusolverStatus_t cusolverDn<>geqrf(
 //        cusolverDnHandle_t handle,
 //        int m, int n,
 //        <> *A, int lda,
@@ -87,7 +133,7 @@ SOLVE_FUNC(getrs , cdouble, Z)
 //        <> *Workspace,
 //        int Lwork, int *devInfo );
 //
-//cusolverStatus_t cusolverDn<>mqr(
+// cusolverStatus_t cusolverDn<>mqr(
 //        cusolverDnHandle_t handle,
 //        cublasSideMode_t side, cublasOperation_t trans,
 //        int m, int n, int k,
@@ -98,132 +144,222 @@ SOLVE_FUNC(getrs , cdouble, Z)
 //        int lwork, int *devInfo);
 
 template<typename T>
-struct geqrf_solve_func_def_t
-{
-    typedef cusolverStatus_t (*geqrf_solve_func_def) (
-                              cusolverDnHandle_t, int, int,
-                              T *, int,
-                              T *,
-                              T *,
-                              int, int *);
+struct geqrf_solve_func_def_t {
+    typedef cusolverStatus_t (*geqrf_solve_func_def)(cusolverDnHandle_t, int,
+                                                     int, T *, int, T *, T *,
+                                                     int, int *);
 };
 
 template<typename T>
-struct geqrf_solve_buf_func_def_t
-{
-    typedef cusolverStatus_t (*geqrf_solve_buf_func_def) (
-                              cusolverDnHandle_t, int, int,
-                              T *, int, int *);
+struct geqrf_solve_buf_func_def_t {
+    typedef cusolverStatus_t (*geqrf_solve_buf_func_def)(cusolverDnHandle_t,
+                                                         int, int, T *, int,
+                                                         int *);
 };
 
 template<typename T>
-struct mqr_solve_func_def_t
-{
-    typedef cusolverStatus_t (*mqr_solve_func_def) (
-                              cusolverDnHandle_t,
-                              cublasSideMode_t, cublasOperation_t,
-                              int, int, int,
-                              const T *, int,
-                              const T *,
-                              T *, int,
-                              T *, int,
-                              int *);
+struct mqr_solve_func_def_t {
+    typedef cusolverStatus_t (*mqr_solve_func_def)(
+        cusolverDnHandle_t, cublasSideMode_t, cublasOperation_t, int, int, int,
+        const T *, int, const T *, T *, int, T *, int, int *);
 };
 
-#define QR_FUNC_DEF( FUNC )                                                     \
-template<typename T>                                                            \
-static typename FUNC##_solve_func_def_t<T>::FUNC##_solve_func_def               \
-FUNC##_solve_func();                                                            \
+template<typename T>
+struct mqr_solve_buf_func_def_t {
+    typedef cusolverStatus_t (*mqr_solve_buf_func_def)(
+	cusolverDnHandle_t, cublasSideMode_t, cublasOperation_t, int, int, int,
+        const T *, int, const T *, T *, int, int *);
+};
+
+#define QR_FUNC_DEF(FUNC)                                                     \
+    template<typename T>                                                      \
+    static typename FUNC##_solve_func_def_t<T>::FUNC##_solve_func_def         \
+        FUNC##_solve_func();                                                  \
+                                                                              \
+    template<typename T>                                                      \
+    static typename FUNC##_solve_buf_func_def_t<T>::FUNC##_solve_buf_func_def \
+        FUNC##_solve_buf_func();
+
+#define QR_FUNC(FUNC, TYPE, PREFIX)                                       \
+    template<>                                                            \
+    typename FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def         \
+        FUNC##_solve_func<TYPE>() {                                       \
+        return (FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def) &   \
+               cusolverDn##PREFIX##FUNC;                                  \
+    }                                                                     \
+                                                                          \
+    template<>                                                            \
+    typename FUNC##_solve_buf_func_def_t<TYPE>::FUNC##_solve_buf_func_def \
+        FUNC##_solve_buf_func<TYPE>() {                                   \
+        return (FUNC##_solve_buf_func_def_t<                              \
+                   TYPE>::FUNC##_solve_buf_func_def) &                    \
+               cusolverDn##PREFIX##FUNC##_bufferSize;                     \
+    }
+
+QR_FUNC_DEF(geqrf)
+QR_FUNC(geqrf, float, S)
+QR_FUNC(geqrf, double, D)
+QR_FUNC(geqrf, cfloat, C)
+QR_FUNC(geqrf, cdouble, Z)
+
+#define MQR_FUNC_DEF(FUNC)                                                    \
+    template<typename T>                                                      \
+    static typename FUNC##_solve_func_def_t<T>::FUNC##_solve_func_def         \
+        FUNC##_solve_func();                                                  \
+	                                                                      \
+    template<typename T>                                                      \
+    static typename FUNC##_solve_buf_func_def_t<T>::FUNC##_solve_buf_func_def \
+       	FUNC##_solve_buf_func();
+
+#define MQR_FUNC(FUNC, TYPE, PREFIX)                                            \
+    template<>                                                                  \
+    typename FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def               \
+        FUNC##_solve_func<TYPE>() {                                             \
+        return (FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def) &         \
+               cusolverDn##PREFIX;                                              \
+    }                                                                           \
                                                                                 \
-template<typename T>                                                            \
-static typename FUNC##_solve_buf_func_def_t<T>::FUNC##_solve_buf_func_def       \
-FUNC##_solve_buf_func();                                                        \
-
-#define QR_FUNC( FUNC, TYPE, PREFIX )                                                                               \
-template<> typename FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def FUNC##_solve_func<TYPE>()                  \
-{ return (FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def)&cusolverDn##PREFIX##FUNC; }                         \
-                                                                                                                    \
-template<> typename FUNC##_solve_buf_func_def_t<TYPE>::FUNC##_solve_buf_func_def FUNC##_solve_buf_func<TYPE>()      \
-{ return (FUNC##_solve_buf_func_def_t<TYPE>::FUNC##_solve_buf_func_def)& cusolverDn##PREFIX##FUNC##_bufferSize; }
-
-QR_FUNC_DEF( geqrf )
-QR_FUNC(geqrf , float  , S)
-QR_FUNC(geqrf , double , D)
-QR_FUNC(geqrf , cfloat , C)
-QR_FUNC(geqrf , cdouble, Z)
-
-#define MQR_FUNC_DEF( FUNC )                                                            \
-template<typename T>                                                                    \
-static typename FUNC##_solve_func_def_t<T>::FUNC##_solve_func_def                       \
-FUNC##_solve_func();
-
-#define MQR_FUNC( FUNC, TYPE, PREFIX )                                                  \
-template<> typename FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def                \
-FUNC##_solve_func<TYPE>()                                                               \
-{ return (FUNC##_solve_func_def_t<TYPE>::FUNC##_solve_func_def)&cusolverDn##PREFIX; }   \
-
-MQR_FUNC_DEF( mqr )
-MQR_FUNC(mqr , float  , Sormqr)
-MQR_FUNC(mqr , double , Dormqr)
-MQR_FUNC(mqr , cfloat , Cunmqr)
-MQR_FUNC(mqr , cdouble, Zunmqr)
+    template<>                                                                  \
+    typename FUNC##_solve_buf_func_def_t<TYPE>::FUNC##_solve_buf_func_def       \
+        FUNC##_solve_buf_func<TYPE>() {                                         \
+        return (FUNC##_solve_buf_func_def_t<TYPE>::FUNC##_solve_buf_func_def) & \
+               cusolverDn##PREFIX##_bufferSize;                                 \
+    }
+
+MQR_FUNC_DEF(mqr)
+MQR_FUNC(mqr, float, Sormqr)
+MQR_FUNC(mqr, double, Dormqr)
+MQR_FUNC(mqr, cfloat, Cunmqr)
+MQR_FUNC(mqr, cdouble, Zunmqr)
 
 template<typename T>
-Array<T> solveLU(const Array<T> &A, const Array<int> &pivot,
-                 const Array<T> &b, const af_mat_prop options)
-{
-    int N = A.dims()[0];
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    UNUSED(options);
+    int N    = A.dims()[0];
     int NRHS = b.dims()[1];
 
-    Array< T > B = copyArray<T>(b);
+    Array<T> B = copyArray<T>(b);
+
+    auto info = memAlloc<int>(1);
+
+    CUSOLVER_CHECK(getrs_func<T>()(solverDnHandle(), CUBLAS_OP_N, N, NRHS,
+                                   A.get(), A.strides()[1], pivot.get(),
+                                   B.get(), B.strides()[1], info.get()));
+
+    return B;
+}
+
+template<typename T>
+Array<T> generalSolveBatched(const Array<T> &a, const Array<T> &b) {
+    Array<T> A = copyArray<T>(a);
+    Array<T> B = copyArray<T>(b);
+
+    dim4 aDims = a.dims();
+    int M      = aDims[0];
+    int N      = aDims[1];
+    int NRHS   = b.dims()[1];
+
+    if (M != N) {
+        AF_ERROR("Batched solve requires square matrices", AF_ERR_ARG);
+    }
+
+    int batchz = aDims[2];
+    int batchw = aDims[3];
+    int batch  = batchz * batchw;
+
+    size_t bytes         = batch * sizeof(T *);
+    using unique_mem_ptr = std::unique_ptr<char, void (*)(void *)>;
+
+    unique_mem_ptr aBatched_host_mem(pinnedAlloc<char>(bytes),
+                                     pinnedFree);
+    unique_mem_ptr bBatched_host_mem(pinnedAlloc<char>(bytes),
+                                     pinnedFree);
+
+    T *a_ptr               = A.get();
+    T *b_ptr               = B.get();
+    T **aBatched_host_ptrs = (T **)aBatched_host_mem.get();
+    T **bBatched_host_ptrs = (T **)bBatched_host_mem.get();
+
+    for (int i = 0; i < batchw; i++) {
+        for (int j = 0; j < batchz; j++) {
+            aBatched_host_ptrs[i * batchz + j] =
+                a_ptr + j * A.strides()[2] + i * A.strides()[3];
+            bBatched_host_ptrs[i * batchz + j] =
+                b_ptr + j * B.strides()[2] + i * B.strides()[3];
+        }
+    }
+
+    unique_mem_ptr aBatched_device_mem(pinnedAlloc<char>(bytes), pinnedFree);
+    unique_mem_ptr bBatched_device_mem(pinnedAlloc<char>(bytes), pinnedFree);
+
+    T **aBatched_device_ptrs = (T **)aBatched_device_mem.get();
+    T **bBatched_device_ptrs = (T **)bBatched_device_mem.get();
+
+    CUDA_CHECK(cudaMemcpyAsync(aBatched_device_ptrs, aBatched_host_ptrs, bytes,
+                               cudaMemcpyHostToDevice,
+                               getStream(getActiveDeviceId())));
 
-    int *info = memAlloc<int>(1);
+    // Perform batched LU
+    // getrf requires pivot and info to be device pointers
+    Array<int> pivots = createEmptyArray<int>(af::dim4(N, batch, 1, 1));
+    Array<int> info   = createEmptyArray<int>(af::dim4(batch, 1, 1, 1));
 
-    CUSOLVER_CHECK(getrs_func<T>()(getDnHandle(),
-                                   CUBLAS_OP_N,
-                                   N, NRHS,
-                                   A.get(), A.strides()[1],
-                                   pivot.get(),
-                                   B.get(), B.strides()[1],
-                                   info));
+    CUBLAS_CHECK(getrfBatched_func<T>()(blasHandle(), N, aBatched_device_ptrs,
+                                        A.strides()[1], pivots.get(),
+                                        info.get(), batch));
 
-    memFree(info);
+    CUDA_CHECK(cudaMemcpyAsync(bBatched_device_ptrs, bBatched_host_ptrs, bytes,
+                               cudaMemcpyHostToDevice,
+                               getStream(getActiveDeviceId())));
+
+    // getrs requires info to be host pointer
+    unique_mem_ptr info_host_mem(pinnedAlloc<char>(batch * sizeof(int)),
+                                 pinnedFree);
+    CUBLAS_CHECK(getrsBatched_func<T>()(
+        blasHandle(), CUBLAS_OP_N, N, NRHS, (const T **)aBatched_device_ptrs,
+        A.strides()[1], pivots.get(), bBatched_device_ptrs, B.strides()[1],
+        (int *)info_host_mem.get(), batch));
     return B;
 }
 
 template<typename T>
-Array<T> generalSolve(const Array<T> &a, const Array<T> &b)
-{
+Array<T> generalSolve(const Array<T> &a, const Array<T> &b) {
+    if (a.dims()[2] > 1 || a.dims()[3] > 1) {
+        return generalSolveBatched(a, b);
+    }
+
     int M = a.dims()[0];
     int N = a.dims()[1];
     int K = b.dims()[1];
 
-    Array<T> A = copyArray<T>(a);
-    Array<T> B = copyArray<T>(b);
+    Array<T> A       = copyArray<T>(a);
+    Array<T> B       = copyArray<T>(b);
     Array<int> pivot = lu_inplace(A, false);
 
-    int *info = memAlloc<int>(1);
+    auto info = memAlloc<int>(1);
 
-    CUSOLVER_CHECK(getrs_func<T>()(getDnHandle(),
-                                   CUBLAS_OP_N,
-                                   N, K,
-                                   A.get(), A.strides()[1],
-                                   pivot.get(),
-                                   B.get(), B.strides()[1],
-                                   info));
-    memFree(info);
+    CUSOLVER_CHECK(getrs_func<T>()(solverDnHandle(), CUBLAS_OP_N, N, K, A.get(),
+                                   A.strides()[1], pivot.get(), B.get(),
+                                   B.strides()[1], info.get()));
     return B;
 }
 
 template<typename T>
-cublasOperation_t trans() { return CUBLAS_OP_T; }
-template<> cublasOperation_t trans<cfloat>() { return CUBLAS_OP_C; }
-template<> cublasOperation_t trans<cdouble>() { return CUBLAS_OP_C; }
-
+cublasOperation_t trans() {
+    return CUBLAS_OP_T;
+}
+template<>
+cublasOperation_t trans<cfloat>() {
+    return CUBLAS_OP_C;
+}
+template<>
+cublasOperation_t trans<cdouble>() {
+    return CUBLAS_OP_C;
+}
 
 template<typename T>
-Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
-{
+Array<T> leastSquares(const Array<T> &a, const Array<T> &b) {
     int M = a.dims()[0];
     int N = a.dims()[1];
     int K = b.dims()[1];
@@ -231,6 +367,7 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
     Array<T> B = createEmptyArray<T>(dim4());
 
     if (M < N) {
+        const dim4 NullShape(0, 0, 0, 0);
 
         // Least squres for this case is solved using the following
         // solve(A, B) == matmul(Q, Xpad);
@@ -242,27 +379,26 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
 
         // QR is performed on the transpose of A
         Array<T> A = transpose<T>(a, true);
-        B = padArray<T, T>(b, dim4(N, K), scalar<T>(0));
+        dim4 endPadding(N - b.dims()[0], K - b.dims()[1], 0, 0);
+        B = (endPadding == NullShape
+                 ? copyArray(b)
+                 : padArrayBorders(b, NullShape, endPadding, AF_PAD_ZERO));
 
         int lwork = 0;
 
         // Get workspace needed for QR
-        CUSOLVER_CHECK(geqrf_solve_buf_func<T>()(getDnHandle(),
-                                                 A.dims()[0], A.dims()[1],
-                                                 A.get(), A.strides()[1],
-                                                 &lwork));
+        CUSOLVER_CHECK(geqrf_solve_buf_func<T>()(solverDnHandle(), A.dims()[0],
+                                                 A.dims()[1], A.get(),
+                                                 A.strides()[1], &lwork));
 
-        T *workspace = memAlloc<T>(lwork);
-        Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
-        int *info = memAlloc<int>(1);
+        auto workspace = memAlloc<T>(lwork);
+        Array<T> t     = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+        auto info      = memAlloc<int>(1);
 
         // In place Perform in place QR
-        CUSOLVER_CHECK(geqrf_solve_func<T>()(getDnHandle(),
-                                             A.dims()[0], A.dims()[1],
-                                             A.get(), A.strides()[1],
-                                             t.get(),
-                                             workspace, lwork,
-                                             info));
+        CUSOLVER_CHECK(geqrf_solve_func<T>()(
+            solverDnHandle(), A.dims()[0], A.dims()[1], A.get(), A.strides()[1],
+            t.get(), workspace.get(), lwork, info.get()));
 
         // R1 = R(seq(M), seq(M));
         A.resetDims(dim4(M, M));
@@ -275,22 +411,19 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
         B.resetDims(dim4(N, K));
 
         // matmul(Q, Bpad)
-        CUSOLVER_CHECK(mqr_solve_func<T>()(getDnHandle(),
-                                           CUBLAS_SIDE_LEFT, CUBLAS_OP_N,
-                                           B.dims()[0],
-                                           B.dims()[1],
-                                           A.dims()[0],
-                                           A.get(), A.strides()[1],
-                                           t.get(),
-                                           B.get(), B.strides()[1],
-                                           workspace, lwork,
-                                           info));
-
-        memFree(workspace);
-        memFree(info);
+        CUSOLVER_CHECK(mqr_solve_buf_func<T>()(
+            solverDnHandle(), CUBLAS_SIDE_LEFT, CUBLAS_OP_N, B.dims()[0],
+    	    B.dims()[1], A.dims()[0], A.get(), A.strides()[1], t.get(), B.get(),
+	    B.strides()[1], &lwork));
+    
+        workspace = memAlloc<T>(lwork);
+
+        CUSOLVER_CHECK(mqr_solve_func<T>()(
+            solverDnHandle(), CUBLAS_SIDE_LEFT, CUBLAS_OP_N, B.dims()[0],
+            B.dims()[1], A.dims()[0], A.get(), A.strides()[1], t.get(), B.get(),
+            B.strides()[1], workspace.get(), lwork, info.get()));
 
     } else if (M > N) {
-
         // Least squres for this case is solved using the following
         // solve(A, B) == tri_solve(R1, Bt);
         // Where:
@@ -300,119 +433,81 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
         // A  == matmul(Q, R);
 
         Array<T> A = copyArray<T>(a);
-        B = copyArray(b);
+        B          = copyArray(b);
 
         int lwork = 0;
 
         // Get workspace needed for QR
-        CUSOLVER_CHECK(geqrf_solve_buf_func<T>()(getDnHandle(),
-                                                 A.dims()[0], A.dims()[1],
-                                                 A.get(), A.strides()[1],
-                                                 &lwork));
+        CUSOLVER_CHECK(geqrf_solve_buf_func<T>()(solverDnHandle(), A.dims()[0],
+                                                 A.dims()[1], A.get(),
+                                                 A.strides()[1], &lwork));
 
-        T *workspace = memAlloc<T>(lwork);
-        Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
-        int *info = memAlloc<int>(1);
+        auto workspace = memAlloc<T>(lwork);
+        Array<T> t     = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+        auto info      = memAlloc<int>(1);
 
         // In place Perform in place QR
-        CUSOLVER_CHECK(geqrf_solve_func<T>()(getDnHandle(),
-                                             A.dims()[0], A.dims()[1],
-                                             A.get(), A.strides()[1],
-                                             t.get(),
-                                             workspace, lwork,
-                                             info));
+        CUSOLVER_CHECK(geqrf_solve_func<T>()(
+            solverDnHandle(), A.dims()[0], A.dims()[1], A.get(), A.strides()[1],
+            t.get(), workspace.get(), lwork, info.get()));
 
         // matmul(Q1, B)
-        CUSOLVER_CHECK(mqr_solve_func<T>()(getDnHandle(),
-                                           CUBLAS_SIDE_LEFT,
-                                           trans<T>(),
-                                           M, K, N,
-                                           A.get(), A.strides()[1],
-                                           t.get(),
-                                           B.get(), B.strides()[1],
-                                           workspace, lwork,
-                                           info));
+        CUSOLVER_CHECK(mqr_solve_buf_func<T>()(
+            solverDnHandle(), CUBLAS_SIDE_LEFT, trans<T>(), M, K, N, A.get(),
+	    A.strides()[1], t.get(), B.get(), B.strides()[1], &lwork));
+    
+        workspace = memAlloc<T>(lwork);
+
+        CUSOLVER_CHECK(mqr_solve_func<T>()(
+            solverDnHandle(), CUBLAS_SIDE_LEFT, trans<T>(), M, K, N, A.get(),
+            A.strides()[1], t.get(), B.get(), B.strides()[1], workspace.get(),
+            lwork, info.get()));
 
         // tri_solve(R1, Bt)
         A.resetDims(dim4(N, N));
         B.resetDims(dim4(N, K));
         trsm(A, B, AF_MAT_NONE, true, true, false);
-
-        memFree(workspace);
-        memFree(info);
     }
     return B;
 }
 
 template<typename T>
-Array<T> triangleSolve(const Array<T> &A, const Array<T> &b, const af_mat_prop options)
-{
+Array<T> triangleSolve(const Array<T> &A, const Array<T> &b,
+                       const af_mat_prop options) {
     Array<T> B = copyArray<T>(b);
     trsm(A, B,
-         AF_MAT_NONE, // transpose flag
+         AF_MAT_NONE,  // transpose flag
          options & AF_MAT_UPPER ? true : false,
-         true, // is_left
+         true,  // is_left
          options & AF_MAT_DIAG_UNIT ? true : false);
     return B;
 }
 
 template<typename T>
-Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options)
-{
-    if (options & AF_MAT_UPPER ||
-        options & AF_MAT_LOWER) {
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    if (options & AF_MAT_UPPER || options & AF_MAT_LOWER) {
         return triangleSolve<T>(a, b, options);
     }
 
-    if(a.dims()[0] == a.dims()[1]) {
+    if (a.dims()[0] == a.dims()[1]) {
         return generalSolve<T>(a, b);
     } else {
         return leastSquares<T>(a, b);
     }
 }
 
-#define INSTANTIATE_SOLVE(T)                                            \
-    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,    \
-                               const af_mat_prop options);              \
-    template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
-                                 const Array<T> &b, const af_mat_prop options); \
-
-INSTANTIATE_SOLVE(float)
-INSTANTIATE_SOLVE(cfloat)
-INSTANTIATE_SOLVE(double)
-INSTANTIATE_SOLVE(cdouble)
-
-}
-
-#else
-namespace cuda
-{
-
-template<typename T>
-Array<T> solveLU(const Array<T> &A, const Array<int> &pivot,
-                 const Array<T> &b, const af_mat_prop options)
-{
-    AF_ERROR("Linear Algebra is diabled on CUDA",
-             AF_ERR_NOT_CONFIGURED);
-}
-
-template<typename T>
-Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options)
-{
-    AF_ERROR("Linear Algebra is diabled on CUDA",
-              AF_ERR_NOT_CONFIGURED);
-}
-
-#define INSTANTIATE_SOLVE(T)                                            \
-    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,    \
-                               const af_mat_prop options);              \
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
     template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
-                                 const Array<T> &b, const af_mat_prop options); \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
 
 INSTANTIATE_SOLVE(float)
 INSTANTIATE_SOLVE(cfloat)
 INSTANTIATE_SOLVE(double)
 INSTANTIATE_SOLVE(cdouble)
-}
 
-#endif
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/solve.hpp b/src/backend/cuda/solve.hpp
index 34da8f6527..20205aa771 100644
--- a/src/backend/cuda/solve.hpp
+++ b/src/backend/cuda/solve.hpp
@@ -7,15 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options = AF_MAT_NONE);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options = AF_MAT_NONE);
 
-    template<typename T>
-    Array<T> solveLU(const Array<T> &a, const Array<int> &pivot,
-                     const Array<T> &b, const af_mat_prop options = AF_MAT_NONE);
-}
+template<typename T>
+Array<T> solveLU(const Array<T> &a, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options = AF_MAT_NONE);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sort.cu b/src/backend/cuda/sort.cu
index dc74b800a4..d56899a87d 100644
--- a/src/backend/cuda/sort.cu
+++ b/src/backend/cuda/sort.cu
@@ -9,35 +9,56 @@
 
 #include <Array.hpp>
 #include <copy.hpp>
-#include <sort.hpp>
+#include <err_cuda.hpp>
 #include <kernel/sort.hpp>
 #include <math.hpp>
+#include <reorder.hpp>
+#include <sort.hpp>
 #include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T, bool isAscending>
-    Array<T> sort(const Array<T> &in, const unsigned dim)
-    {
-        Array<T> out = copyArray<T>(in);
-        switch(dim) {
 
-        case 0: kernel::sort0<T, isAscending>(out);
-            break;
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending) {
+    Array<T> out = copyArray<T>(in);
+    switch (dim) {
+        case 0: kernel::sort0<T>(out, isAscending); break;
+        case 1: kernel::sortBatched<T>(out, 1, isAscending); break;
+        case 2: kernel::sortBatched<T>(out, 2, isAscending); break;
+        case 3: kernel::sortBatched<T>(out, 3, isAscending); break;
         default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
-        }
-        return out;
     }
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> sort<T, true>(const Array<T> &in, const unsigned dim); \
-    template Array<T>  sort<T,false>(const Array<T> &in, const unsigned dim); \
+    if (dim != 0) {
+        af::dim4 preorderDims = out.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = out.dims()[dim];
+        for (int i = 1; i <= (int)dim; i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = out.dims()[i - 1];
+        }
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+        out.setDataDims(preorderDims);
+        out = reorder<T>(out, reorderDims);
+    }
+    return out;
 }
+
+#define INSTANTIATE(T)                                                \
+    template Array<T> sort<T>(const Array<T> &in, const unsigned dim, \
+                              bool isAscending);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sort.hpp b/src/backend/cuda/sort.hpp
index 8f4f3a03ef..f6b8832f01 100644
--- a/src/backend/cuda/sort.hpp
+++ b/src/backend/cuda/sort.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T, bool isAscending>
-    Array<T> sort(const Array<T> &in, const unsigned dim);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sort_by_key.cu b/src/backend/cuda/sort_by_key.cu
new file mode 100644
index 0000000000..21d9efc5b2
--- /dev/null
+++ b/src/backend/cuda/sort_by_key.cu
@@ -0,0 +1,88 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <copy.hpp>
+#include <err_cuda.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <math.hpp>
+#include <reorder.hpp>
+#include <sort_by_key.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace cuda {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const uint dim, bool isAscending) {
+    okey = copyArray<Tk>(ikey);
+    oval = copyArray<Tv>(ival);
+
+    switch (dim) {
+        case 0: kernel::sort0ByKey<Tk, Tv>(okey, oval, isAscending); break;
+        case 1:
+        case 2:
+        case 3:
+            kernel::sortByKeyBatched<Tk, Tv>(okey, oval, dim, isAscending);
+            break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+
+    if (dim != 0) {
+        af::dim4 preorderDims = okey.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = okey.dims()[dim];
+        for (int i = 1; i <= (int)dim; i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = okey.dims()[i - 1];
+        }
+
+        okey.setDataDims(preorderDims);
+        oval.setDataDims(preorderDims);
+
+        okey = reorder<Tk>(okey, reorderDims);
+        oval = reorder<Tv>(oval, reorderDims);
+    }
+}
+
+#define INSTANTIATE(Tk, Tv)                                        \
+    template void sort_by_key<Tk, Tv>(                             \
+        Array<Tk> & okey, Array<Tv> & oval, const Array<Tk> &ikey, \
+        const Array<Tv> &ival, const uint dim, bool);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)
+
+INSTANTIATE1(float)
+INSTANTIATE1(double)
+INSTANTIATE1(int)
+INSTANTIATE1(uint)
+INSTANTIATE1(short)
+INSTANTIATE1(ushort)
+INSTANTIATE1(char)
+INSTANTIATE1(schar)
+INSTANTIATE1(uchar)
+INSTANTIATE1(intl)
+INSTANTIATE1(uintl)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sort_by_key.hpp b/src/backend/cuda/sort_by_key.hpp
index 561df04d80..e44badc6a8 100644
--- a/src/backend/cuda/sort_by_key.hpp
+++ b/src/backend/cuda/sort_by_key.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort_by_key(Array<Tk> &okey, Array<Tv> &oval,
-               const Array<Tk> &ikey, const Array<Tv> &ival, const unsigned dim);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const unsigned dim, bool isAscending);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sort_by_key/ascd_f32.cu b/src/backend/cuda/sort_by_key/ascd_f32.cu
deleted file mode 100644
index 44b770402c..0000000000
--- a/src/backend/cuda/sort_by_key/ascd_f32.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(float, true)
-}
diff --git a/src/backend/cuda/sort_by_key/ascd_f64.cu b/src/backend/cuda/sort_by_key/ascd_f64.cu
deleted file mode 100644
index 17b54a3903..0000000000
--- a/src/backend/cuda/sort_by_key/ascd_f64.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(double, true)
-}
diff --git a/src/backend/cuda/sort_by_key/ascd_s32.cu b/src/backend/cuda/sort_by_key/ascd_s32.cu
deleted file mode 100644
index 75adbddc0b..0000000000
--- a/src/backend/cuda/sort_by_key/ascd_s32.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(int, true)
-}
diff --git a/src/backend/cuda/sort_by_key/ascd_s8.cu b/src/backend/cuda/sort_by_key/ascd_s8.cu
deleted file mode 100644
index f47a397727..0000000000
--- a/src/backend/cuda/sort_by_key/ascd_s8.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(char, true)
-}
diff --git a/src/backend/cuda/sort_by_key/ascd_u32.cu b/src/backend/cuda/sort_by_key/ascd_u32.cu
deleted file mode 100644
index 6f7939aa12..0000000000
--- a/src/backend/cuda/sort_by_key/ascd_u32.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(uint, true)
-}
diff --git a/src/backend/cuda/sort_by_key/ascd_u8.cu b/src/backend/cuda/sort_by_key/ascd_u8.cu
deleted file mode 100644
index a2e1dec887..0000000000
--- a/src/backend/cuda/sort_by_key/ascd_u8.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(uchar, true)
-}
diff --git a/src/backend/cuda/sort_by_key/desc_f32.cu b/src/backend/cuda/sort_by_key/desc_f32.cu
deleted file mode 100644
index 1bbb10bbba..0000000000
--- a/src/backend/cuda/sort_by_key/desc_f32.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(float, false)
-}
diff --git a/src/backend/cuda/sort_by_key/desc_f64.cu b/src/backend/cuda/sort_by_key/desc_f64.cu
deleted file mode 100644
index ecbed78878..0000000000
--- a/src/backend/cuda/sort_by_key/desc_f64.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(double, false)
-}
diff --git a/src/backend/cuda/sort_by_key/desc_s32.cu b/src/backend/cuda/sort_by_key/desc_s32.cu
deleted file mode 100644
index 49904437f4..0000000000
--- a/src/backend/cuda/sort_by_key/desc_s32.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(int, false)
-}
diff --git a/src/backend/cuda/sort_by_key/desc_s8.cu b/src/backend/cuda/sort_by_key/desc_s8.cu
deleted file mode 100644
index cad78dfc84..0000000000
--- a/src/backend/cuda/sort_by_key/desc_s8.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(char, false)
-}
diff --git a/src/backend/cuda/sort_by_key/desc_u32.cu b/src/backend/cuda/sort_by_key/desc_u32.cu
deleted file mode 100644
index ae2ad4bc84..0000000000
--- a/src/backend/cuda/sort_by_key/desc_u32.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(uint, false)
-}
diff --git a/src/backend/cuda/sort_by_key/desc_u8.cu b/src/backend/cuda/sort_by_key/desc_u8.cu
deleted file mode 100644
index 51d8096620..0000000000
--- a/src/backend/cuda/sort_by_key/desc_u8.cu
+++ /dev/null
@@ -1,15 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <sort_by_key_impl.hpp>
-
-namespace cuda
-{
-    INSTANTIATE1(uchar, false)
-}
diff --git a/src/backend/cuda/sort_by_key_impl.hpp b/src/backend/cuda/sort_by_key_impl.hpp
deleted file mode 100644
index 32758b47a5..0000000000
--- a/src/backend/cuda/sort_by_key_impl.hpp
+++ /dev/null
@@ -1,45 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <copy.hpp>
-#include <sort_by_key.hpp>
-#include <kernel/sort_by_key.hpp>
-#include <math.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort_by_key(Array<Tk> &okey, Array<Tv> &oval,
-               const Array<Tk> &ikey, const Array<Tv> &ival, const uint dim)
-    {
-        okey = copyArray<Tk>(ikey);
-        oval = copyArray<Tv>(ival);
-        switch(dim) {
-        case 0: kernel::sort0_by_key<Tk, Tv, isAscending>(okey, oval);
-            break;
-        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
-        }
-    }
-
-#define INSTANTIATE(Tk, Tv, dr)                                         \
-    template void                                                       \
-    sort_by_key<Tk, Tv, dr>(Array<Tk> &okey, Array<Tv> &oval,           \
-                            const Array<Tk> &ikey, const Array<Tv> &ival, const uint dim); \
-
-#define INSTANTIATE1(Tk,    dr) \
-    INSTANTIATE(Tk, float,  dr) \
-    INSTANTIATE(Tk, double, dr) \
-    INSTANTIATE(Tk, int,    dr) \
-    INSTANTIATE(Tk, uint,   dr) \
-    INSTANTIATE(Tk, char,   dr) \
-    INSTANTIATE(Tk, uchar,  dr)
-}
diff --git a/src/backend/cuda/sort_index.cu b/src/backend/cuda/sort_index.cu
index b80287b90f..d923f7c6e9 100644
--- a/src/backend/cuda/sort_index.cu
+++ b/src/backend/cuda/sort_index.cu
@@ -8,38 +8,67 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <sort_index.hpp>
 #include <copy.hpp>
-#include <kernel/sort_index.hpp>
+#include <err_cuda.hpp>
+#include <kernel/sort_by_key.hpp>
 #include <math.hpp>
+#include <range.hpp>
+#include <reorder.hpp>
+#include <sort_index.hpp>
 #include <stdexcept>
-#include <err_cuda.hpp>
 
-namespace cuda
-{
-    template<typename T, bool isAscending>
-    void sort_index(Array<T> &val, Array<uint> &idx, const Array<T> &in, const uint dim)
-    {
-        val = copyArray<T>(in);
-        idx = createEmptyArray<uint>(in.dims());
-        switch(dim) {
-            case 0: kernel::sort0_index<T, isAscending>(val, idx);
-                    break;
-            default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
-        }
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void sort_index(Array<T> &okey, Array<uint> &oval, const Array<T> &in,
+                const uint dim, bool isAscending) {
+    okey = copyArray<T>(in);
+    oval = range<uint>(in.dims(), dim);
+
+    switch (dim) {
+        case 0: kernel::sort0ByKey<T, uint>(okey, oval, isAscending); break;
+        case 1:
+        case 2:
+        case 3:
+            kernel::sortByKeyBatched<T, uint>(okey, oval, dim, isAscending);
+            break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
     }
 
-#define INSTANTIATE(T)                                                  \
-    template void sort_index<T, true>(Array<T> &val, Array<uint> &idx, const Array<T> &in, \
-                                      const uint dim);                  \
-    template void sort_index<T,false>(Array<T> &val, Array<uint> &idx, const Array<T> &in, \
-                                      const uint dim);                  \
+    if (dim != 0) {
+        af::dim4 preorderDims = okey.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = okey.dims()[dim];
+        for (int i = 1; i <= (int)dim; i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = okey.dims()[i - 1];
+        }
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+        okey.setDataDims(preorderDims);
+        oval.setDataDims(preorderDims);
 
+        okey = reorder<T>(okey, reorderDims);
+        oval = reorder<uint>(oval, reorderDims);
+    }
 }
+
+#define INSTANTIATE(T)                                              \
+    template void sort_index<T>(Array<T> & val, Array<uint> & idx,  \
+                                const Array<T> &in, const uint dim, \
+                                bool isAscending);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sort_index.hpp b/src/backend/cuda/sort_index.hpp
index d85c076d1c..1355f9ea8a 100644
--- a/src/backend/cuda/sort_index.hpp
+++ b/src/backend/cuda/sort_index.hpp
@@ -7,11 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T, bool isAscending>
-    void sort_index(Array<T> &val, Array<unsigned> &idx, const Array<T> &in, const unsigned dim);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void sort_index(Array<T> &val, Array<unsigned> &idx, const Array<T> &in,
+                const unsigned dim, bool isAscending);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sparse.cu b/src/backend/cuda/sparse.cu
new file mode 100644
index 0000000000..3c39c72695
--- /dev/null
+++ b/src/backend/cuda/sparse.cu
@@ -0,0 +1,543 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sparse.hpp>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <cudaDataType.hpp>
+#include <cusparse.hpp>
+#include <cusparseModule.hpp>
+#include <cusparse_descriptor_helpers.hpp>
+#include <handle.hpp>
+#include <kernel/sparse.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <where.hpp>
+
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+
+using namespace common;
+
+// cusparseStatus_t cusparseZdense2csr(cusparseHandle_t handle,
+//                                    int m, int n,
+//                                    const cusparseMatDescr_t descrA,
+//                                    const cuDoubleComplex *A, int lda,
+//                                    const int *nnzPerRow,
+//                                    cuDoubleComplex *csrValA,
+//                                    int *csrRowPtrA, int *csrColIndA)
+template<typename T>
+struct dense2csr_func_def_t {
+    typedef cusparseStatus_t (*dense2csr_func_def)(cusparseHandle_t, int, int,
+                                                   const cusparseMatDescr_t,
+                                                   const T *, int, const int *,
+                                                   T *, int *, int *);
+};
+
+// cusparseStatus_t cusparseZdense2csc(cusparseHandle_t handle,
+//                                    int m, int n,
+//                                    const cusparseMatDescr_t descrA,
+//                                    const cuDoubleComplex *A, int lda,
+//                                    const int *nnzPerCol,
+//                                    cuDoubleComplex *cscValA,
+//                                    int *cscRowIndA, int *cscColPtrA)
+template<typename T>
+struct dense2csc_func_def_t {
+    typedef cusparseStatus_t (*dense2csc_func_def)(cusparseHandle_t, int, int,
+                                                   const cusparseMatDescr_t,
+                                                   const T *, int, const int *,
+                                                   T *, int *, int *);
+};
+
+// cusparseStatus_t cusparseZcsr2dense(cusparseHandle_t handle,
+//                                    int m, int n,
+//                                    const cusparseMatDescr_t descrA,
+//                                    const cuDoubleComplex *csrValA,
+//                                    const int *csrRowPtrA,
+//                                    const int *csrColIndA,
+//                                    cuDoubleComplex *A, int lda)
+template<typename T>
+struct csr2dense_func_def_t {
+    typedef cusparseStatus_t (*csr2dense_func_def)(cusparseHandle_t, int, int,
+                                                   const cusparseMatDescr_t,
+                                                   const T *, const int *,
+                                                   const int *, T *, int);
+};
+
+// cusparseStatus_t cusparseZcsc2dense(cusparseHandle_t handle,
+//                                    int m, int n,
+//                                    const cusparseMatDescr_t descrA,
+//                                    const cuDoubleComplex *cscValA,
+//                                    const int *cscRowIndA,
+//                                    const int *cscColPtrA,
+//                                    cuDoubleComplex *A, int lda)
+template<typename T>
+struct csc2dense_func_def_t {
+    typedef cusparseStatus_t (*csc2dense_func_def)(cusparseHandle_t, int, int,
+                                                   const cusparseMatDescr_t,
+                                                   const T *, const int *,
+                                                   const int *, T *, int);
+};
+
+// cusparseStatus_t cusparseZnnz(cusparseHandle_t handle,
+//                              cusparseDirection_t dirA,
+//                              int m, int n,
+//                              const cusparseMatDescr_t descrA,
+//                              const cuDoubleComplex *A, int lda,
+//                              int *nnzPerRowColumn,
+//                              int *nnzTotalDevHostPtr)
+template<typename T>
+struct nnz_func_def_t {
+    typedef cusparseStatus_t (*nnz_func_def)(cusparseHandle_t,
+                                             cusparseDirection_t, int, int,
+                                             const cusparseMatDescr_t,
+                                             const T *, int, int *, int *);
+};
+
+// cusparseStatus_t cusparseZgthr(cusparseHandle_t handle,
+//                               int nnz,
+//                               const cuDoubleComplex *y,
+//                               cuDoubleComplex *xVal, const int *xInd,
+//                               cusparseIndexBase_t idxBase)
+template<typename T>
+struct gthr_func_def_t {
+    typedef cusparseStatus_t (*gthr_func_def)(cusparseHandle_t, int, const T *,
+                                              T *, const int *,
+                                              cusparseIndexBase_t);
+};
+
+#define SPARSE_FUNC_DEF(FUNC) \
+    template<typename T>      \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func();
+
+#define SPARSE_FUNC(FUNC, TYPE, PREFIX)                                     \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        cusparseModule &_ = getCusparsePlugin();                            \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def)(                  \
+            _.cusparse##PREFIX##FUNC);                                      \
+    }
+
+/// Newer versions of cusparse use matrix descriptor instead of types encoded in
+/// their names
+#if CUSPARSE_VERSION < 11300
+SPARSE_FUNC_DEF(dense2csr)
+SPARSE_FUNC(dense2csr, float, S)
+SPARSE_FUNC(dense2csr, double, D)
+SPARSE_FUNC(dense2csr, cfloat, C)
+SPARSE_FUNC(dense2csr, cdouble, Z)
+
+SPARSE_FUNC_DEF(dense2csc)
+SPARSE_FUNC(dense2csc, float, S)
+SPARSE_FUNC(dense2csc, double, D)
+SPARSE_FUNC(dense2csc, cfloat, C)
+SPARSE_FUNC(dense2csc, cdouble, Z)
+
+SPARSE_FUNC_DEF(csr2dense)
+SPARSE_FUNC(csr2dense, float, S)
+SPARSE_FUNC(csr2dense, double, D)
+SPARSE_FUNC(csr2dense, cfloat, C)
+SPARSE_FUNC(csr2dense, cdouble, Z)
+
+SPARSE_FUNC_DEF(csc2dense)
+SPARSE_FUNC(csc2dense, float, S)
+SPARSE_FUNC(csc2dense, double, D)
+SPARSE_FUNC(csc2dense, cfloat, C)
+SPARSE_FUNC(csc2dense, cdouble, Z)
+
+SPARSE_FUNC_DEF(gthr)
+SPARSE_FUNC(gthr, float, S)
+SPARSE_FUNC(gthr, double, D)
+SPARSE_FUNC(gthr, cfloat, C)
+SPARSE_FUNC(gthr, cdouble, Z)
+#endif
+
+SPARSE_FUNC_DEF(nnz)
+SPARSE_FUNC(nnz, float, S)
+SPARSE_FUNC(nnz, double, D)
+SPARSE_FUNC(nnz, cfloat, C)
+SPARSE_FUNC(nnz, cdouble, Z)
+
+#undef SPARSE_FUNC
+#undef SPARSE_FUNC_DEF
+
+// Partial template specialization of sparseConvertDenseToStorage for COO
+// However, template specialization is not allowed
+template<typename T>
+SparseArray<T> sparseConvertDenseToCOO(const Array<T> &in) {
+    Array<uint> nonZeroIdx_ = where<T>(in);
+    Array<int> nonZeroIdx   = cast<int, uint>(nonZeroIdx_);
+
+    dim_t nNZ = nonZeroIdx.elements();
+
+    Array<int> constDim = createValueArray<int>(dim4(nNZ), in.dims()[0]);
+
+    Array<int> rowIdx =
+        arithOp<int, af_mod_t>(nonZeroIdx, constDim, nonZeroIdx.dims());
+    Array<int> colIdx =
+        arithOp<int, af_div_t>(nonZeroIdx, constDim, nonZeroIdx.dims());
+
+    Array<T> values = copyArray<T>(in);
+    values.modDims(dim4(values.elements()));
+    values = lookup<T, int>(values, nonZeroIdx, 0);
+
+    return createArrayDataSparseArray<T>(in.dims(), values, rowIdx, colIdx,
+                                         AF_STORAGE_COO);
+}
+
+template<typename T, af_storage stype>
+SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in) {
+    const int M = in.dims()[0];
+    const int N = in.dims()[1];
+
+    cusparseModule &_ = getCusparsePlugin();
+#if CUSPARSE_VERSION < 11300
+    // Create Sparse Matrix Descriptor
+    cusparseMatDescr_t descr = 0;
+    CUSPARSE_CHECK(_.cusparseCreateMatDescr(&descr));
+    _.cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);
+    _.cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);
+
+    int d                   = -1;
+    cusparseDirection_t dir = CUSPARSE_DIRECTION_ROW;
+
+    if (stype == AF_STORAGE_CSR) {
+        d   = M;
+        dir = CUSPARSE_DIRECTION_ROW;
+    } else {
+        d   = N;
+        dir = CUSPARSE_DIRECTION_COLUMN;
+    }
+    Array<int> nnzPerDir = createEmptyArray<int>(dim4(d));
+
+    int nNZ = -1;
+    CUSPARSE_CHECK(nnz_func<T>()(sparseHandle(), dir, M, N, descr, in.get(),
+                                 in.strides()[1], nnzPerDir.get(), &nNZ));
+
+    Array<int> rowIdx = createEmptyArray<int>(dim4());
+    Array<int> colIdx = createEmptyArray<int>(dim4());
+
+    if (stype == AF_STORAGE_CSR) {
+        rowIdx = createEmptyArray<int>(dim4(M + 1));
+        colIdx = createEmptyArray<int>(dim4(nNZ));
+    } else {
+        rowIdx = createEmptyArray<int>(dim4(nNZ));
+        colIdx = createEmptyArray<int>(dim4(N + 1));
+    }
+    Array<T> values = createEmptyArray<T>(dim4(nNZ));
+
+    if (stype == AF_STORAGE_CSR) {
+        CUSPARSE_CHECK(dense2csr_func<T>()(
+            sparseHandle(), M, N, descr, in.get(), in.strides()[1],
+            nnzPerDir.get(), values.get(), rowIdx.get(), colIdx.get()));
+    } else {
+        CUSPARSE_CHECK(dense2csc_func<T>()(
+            sparseHandle(), M, N, descr, in.get(), in.strides()[1],
+            nnzPerDir.get(), values.get(), rowIdx.get(), colIdx.get()));
+    }
+    // Destory Sparse Matrix Descriptor
+    CUSPARSE_CHECK(_.cusparseDestroyMatDescr(descr));
+
+    return createArrayDataSparseArray<T>(in.dims(), values, rowIdx, colIdx,
+                                         stype);
+#else
+    auto matA = denMatDescriptor(in);
+    cusparseSpMatDescr_t matB;
+
+    Array<int> d_offsets = createEmptyArray<int>(0);
+
+    if (stype == AF_STORAGE_CSR) {
+        d_offsets = createEmptyArray<int>(M + 1);
+        // Create sparse matrix B in CSR format
+        CUSPARSE_CHECK(
+            _.cusparseCreateCsr(&matB, M, N, 0, d_offsets.get(), nullptr,
+                                nullptr, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
+                                CUSPARSE_INDEX_BASE_ZERO, getType<T>()));
+    } else {
+        d_offsets = createEmptyArray<int>(N + 1);
+        CUSPARSE_CHECK(
+            _.cusparseCreateCsc(&matB, M, N, 0, d_offsets.get(), nullptr,
+                                nullptr, CUSPARSE_INDEX_32I, CUSPARSE_INDEX_32I,
+                                CUSPARSE_INDEX_BASE_ZERO, getType<T>()));
+    }
+
+    // allocate an external buffer if needed
+    size_t bufferSize;
+    CUSPARSE_CHECK(_.cusparseDenseToSparse_bufferSize(
+        sparseHandle(), matA, matB, CUSPARSE_DENSETOSPARSE_ALG_DEFAULT,
+        &bufferSize));
+
+    auto dBuffer = memAlloc<char>(bufferSize);
+
+    // execute Sparse to Dense conversion
+    CUSPARSE_CHECK(_.cusparseDenseToSparse_analysis(
+        sparseHandle(), matA, matB, CUSPARSE_DENSETOSPARSE_ALG_DEFAULT,
+        dBuffer.get()));
+    // get number of non-zero elements
+    int64_t num_rows_tmp, num_cols_tmp, nnz;
+    CUSPARSE_CHECK(
+        _.cusparseSpMatGetSize(matB, &num_rows_tmp, &num_cols_tmp, &nnz));
+
+    auto d_ind    = createEmptyArray<int>(nnz);
+    auto d_values = createEmptyArray<T>(nnz);
+    // allocate CSR column indices and values
+    // reset offsets, column indices, and values pointers
+    if (stype == AF_STORAGE_CSR) {
+        // Create sparse matrix B in CSR format
+        // reset offsets, column indices, and values pointers
+        CUSPARSE_CHECK(_.cusparseCsrSetPointers(matB, d_offsets.get(),
+                                                d_ind.get(), d_values.get()));
+
+    } else {
+        // reset offsets, column indices, and values pointers
+        CUSPARSE_CHECK(_.cusparseCscSetPointers(matB, d_offsets.get(),
+                                                d_ind.get(), d_values.get()));
+    }
+    // execute Sparse to Dense conversion
+    CUSPARSE_CHECK(_.cusparseDenseToSparse_convert(
+        sparseHandle(), matA, matB, CUSPARSE_DENSETOSPARSE_ALG_DEFAULT,
+        dBuffer.get()));
+
+    if (stype == AF_STORAGE_CSR) {
+        size_t pBufferSizeInBytes = 0;
+        auto desc                 = make_handle<cusparseMatDescr_t>();
+        CUSPARSE_CHECK(_.cusparseXcsrsort_bufferSizeExt(
+            sparseHandle(), M, N, nnz, d_offsets.get(), d_ind.get(),
+            &pBufferSizeInBytes));
+        auto pBuffer = memAlloc<char>(pBufferSizeInBytes);
+        Array<int> P = createEmptyArray<int>(nnz);
+        CUSPARSE_CHECK(
+            _.cusparseCreateIdentityPermutation(sparseHandle(), nnz, P.get()));
+        CUSPARSE_CHECK(_.cusparseXcsrsort(
+            sparseHandle(), M, N, nnz, desc, (int *)d_offsets.get(),
+            (int *)d_ind.get(), P.get(), pBuffer.get()));
+        d_values = lookup(d_values, P, 0);
+        return createArrayDataSparseArray<T>(in.dims(), d_values, d_offsets,
+                                             d_ind, stype, false);
+    } else {
+        return createArrayDataSparseArray<T>(in.dims(), d_values, d_ind,
+                                             d_offsets, stype, false);
+    }
+#endif
+}
+
+// Partial template specialization of sparseConvertStorageToDense for COO
+// However, template specialization is not allowed
+template<typename T>
+Array<T> sparseConvertCOOToDense(const SparseArray<T> &in) {
+    Array<T> dense = createValueArray<T>(in.dims(), scalar<T>(0));
+
+    const Array<T> values   = in.getValues();
+    const Array<int> rowIdx = in.getRowIdx();
+    const Array<int> colIdx = in.getColIdx();
+
+    kernel::coo2dense<T>(dense, values, rowIdx, colIdx);
+
+    return dense;
+}
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const SparseArray<T> &in) {
+    // Create Sparse Matrix Descriptor
+    cusparseModule &_ = getCusparsePlugin();
+#if CUSPARSE_VERSION < 11300
+    cusparseMatDescr_t descr = 0;
+    CUSPARSE_CHECK(_.cusparseCreateMatDescr(&descr));
+    _.cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL);
+    _.cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO);
+
+    int M          = in.dims()[0];
+    int N          = in.dims()[1];
+    Array<T> dense = createValueArray<T>(in.dims(), scalar<T>(0));
+    int d_strides1 = dense.strides()[1];
+
+    if (stype == AF_STORAGE_CSR) {
+        CUSPARSE_CHECK(
+            csr2dense_func<T>()(sparseHandle(), M, N, descr,
+                                in.getValues().get(), in.getRowIdx().get(),
+                                in.getColIdx().get(), dense.get(), d_strides1));
+    } else {
+        CUSPARSE_CHECK(
+            csc2dense_func<T>()(sparseHandle(), M, N, descr,
+                                in.getValues().get(), in.getRowIdx().get(),
+                                in.getColIdx().get(), dense.get(), d_strides1));
+    }
+
+    // Destory Sparse Matrix Descriptor
+    CUSPARSE_CHECK(_.cusparseDestroyMatDescr(descr));
+#else
+    unique_handle<cusparseSpMatDescr_t> inhandle = cusparseDescriptor(in);
+
+    Array<T> dense = createEmptyArray<T>(in.dims());
+    unique_handle<cusparseDnMatDescr_t> outhandle = denMatDescriptor(dense);
+
+    size_t bufferSize = 0;
+    _.cusparseSparseToDense_bufferSize(sparseHandle(), inhandle, outhandle,
+                                       CUSPARSE_SPARSETODENSE_ALG_DEFAULT,
+                                       &bufferSize);
+
+    auto dBuffer = memAlloc<char>(bufferSize);
+    _.cusparseSparseToDense(sparseHandle(), inhandle, outhandle,
+                            CUSPARSE_SPARSETODENSE_ALG_DEFAULT, dBuffer.get());
+
+#endif
+
+    return dense;
+}
+
+template<typename T, af_storage dest, af_storage src>
+SparseArray<T> sparseConvertStorageToStorage(const SparseArray<T> &in) {
+    using std::shared_ptr;
+    in.eval();
+
+    int nNZ                  = in.getNNZ();
+    SparseArray<T> converted = createEmptySparseArray<T>(in.dims(), nNZ, dest);
+
+    cusparseModule &_ = getCusparsePlugin();
+    if (src == AF_STORAGE_CSR && dest == AF_STORAGE_COO) {
+        // Copy colIdx as is
+        CUDA_CHECK(
+            cudaMemcpyAsync(converted.getColIdx().get(), in.getColIdx().get(),
+                            in.getColIdx().elements() * sizeof(int),
+                            cudaMemcpyDeviceToDevice, getActiveStream()));
+
+        // cusparse function to expand compressed row into coordinate
+        CUSPARSE_CHECK(_.cusparseXcsr2coo(
+            sparseHandle(), in.getRowIdx().get(), nNZ, in.dims()[0],
+            converted.getRowIdx().get(), CUSPARSE_INDEX_BASE_ZERO));
+
+        // Call sort
+        size_t pBufferSizeInBytes = 0;
+        CUSPARSE_CHECK(_.cusparseXcoosort_bufferSizeExt(
+            sparseHandle(), in.dims()[0], in.dims()[1], nNZ,
+            converted.getRowIdx().get(), converted.getColIdx().get(),
+            &pBufferSizeInBytes));
+        auto pBuffer = memAlloc<char>(pBufferSizeInBytes);
+
+        // shared_ptr<int> P(memAlloc<int>(nNZ).release(), memFree<int>);
+        Array<int> P = createEmptyArray<int>(nNZ);
+        CUSPARSE_CHECK(
+            _.cusparseCreateIdentityPermutation(sparseHandle(), nNZ, P.get()));
+
+        CUSPARSE_CHECK(_.cusparseXcoosortByRow(
+            sparseHandle(), in.dims()[0], in.dims()[1], nNZ,
+            converted.getRowIdx().get(), converted.getColIdx().get(), P.get(),
+            pBuffer.get()));
+
+        converted.getValues() = lookup<T, int>(in.getValues(), P, 0);
+
+    } else if (src == AF_STORAGE_COO && dest == AF_STORAGE_CSR) {
+        // The cusparse csr sort function is not behaving correctly.
+        // So the work around is to convert the COO into row major and then
+        // convert it to CSR
+
+        int M = in.dims()[0];
+        int N = in.dims()[1];
+        // Deep copy input into temporary COO Row Major
+        SparseArray<T> cooT = createArrayDataSparseArray<T>(
+            in.dims(), in.getValues(), in.getRowIdx(), in.getColIdx(),
+            in.getStorage(), true);
+
+        // Call sort to convert column major to row major
+        {
+            size_t pBufferSizeInBytes = 0;
+            CUSPARSE_CHECK(_.cusparseXcoosort_bufferSizeExt(
+                sparseHandle(), M, N, nNZ, cooT.getRowIdx().get(),
+                cooT.getColIdx().get(), &pBufferSizeInBytes));
+            auto pBuffer = memAlloc<char>(pBufferSizeInBytes);
+
+            Array<int> P = createEmptyArray<int>(nNZ);
+            CUSPARSE_CHECK(_.cusparseCreateIdentityPermutation(sparseHandle(),
+                                                               nNZ, P.get()));
+
+            CUSPARSE_CHECK(_.cusparseXcoosortByRow(
+                sparseHandle(), M, N, nNZ, cooT.getRowIdx().get(),
+                cooT.getColIdx().get(), P.get(), pBuffer.get()));
+
+            converted.getValues() = lookup<T, int>(in.getValues(), P, 0);
+        }
+
+        // Copy values and colIdx as is
+        copyArray<int, int>(converted.getColIdx(), cooT.getColIdx());
+
+        // cusparse function to compress row from coordinate
+        CUSPARSE_CHECK(_.cusparseXcoo2csr(
+            sparseHandle(), cooT.getRowIdx().get(), nNZ, M,
+            converted.getRowIdx().get(), CUSPARSE_INDEX_BASE_ZERO));
+
+        // No need to call CSRSORT
+
+    } else {
+        // Should never come here
+        AF_ERROR("CUDA Backend invalid conversion combination",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    return converted;
+}
+
+#define INSTANTIATE_TO_STORAGE(T, S)                     \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSR>( \
+        const SparseArray<T> &in);                       \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSC>( \
+        const SparseArray<T> &in);                       \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_COO>( \
+        const SparseArray<T> &in);
+
+#define INSTANTIATE_COO_SPECIAL(T)                                 \
+    template<>                                                     \
+    SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_COO>( \
+        const Array<T> &in) {                                      \
+        return sparseConvertDenseToCOO<T>(in);                     \
+    }                                                              \
+    template<>                                                     \
+    Array<T> sparseConvertStorageToDense<T, AF_STORAGE_COO>(       \
+        const SparseArray<T> &in) {                                \
+        return sparseConvertCOOToDense<T>(in);                     \
+    }
+
+#define INSTANTIATE_SPARSE(T)                                               \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSR>( \
+        const Array<T> &in);                                                \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSC>( \
+        const Array<T> &in);                                                \
+                                                                            \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSR>(       \
+        const SparseArray<T> &in);                                          \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSC>(       \
+        const SparseArray<T> &in);                                          \
+                                                                            \
+    INSTANTIATE_COO_SPECIAL(T)                                              \
+                                                                            \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSR)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSC)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_COO)
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+#undef INSTANTIATE_TO_STORAGE
+#undef INSTANTIATE_COO_SPECIAL
+#undef INSTANTIATE_SPARSE
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sparse.hpp b/src/backend/cuda/sparse.hpp
new file mode 100644
index 0000000000..ae4f42ccf6
--- /dev/null
+++ b/src/backend/cuda/sparse.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T, af_storage stype>
+common::SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in);
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const common::SparseArray<T> &in);
+
+template<typename T, af_storage dest, af_storage src>
+common::SparseArray<T> sparseConvertStorageToStorage(
+    const common::SparseArray<T> &in);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sparse_arith.cu b/src/backend/cuda/sparse_arith.cu
new file mode 100644
index 0000000000..8a60aba4d3
--- /dev/null
+++ b/src/backend/cuda/sparse_arith.cu
@@ -0,0 +1,305 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sparse_arith.hpp>
+#include <sparse_arith.hpp>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <common/unique_handle.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <cusparse.hpp>
+#include <cusparse_descriptor_helpers.hpp>
+#include <handle.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <sparse.hpp>
+#include <sparse_handle.hpp>
+#include <where.hpp>
+
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+
+using namespace common;
+using std::numeric_limits;
+
+template<typename T>
+T getInf() {
+    return scalar<T>(numeric_limits<T>::infinity());
+}
+
+template<>
+cfloat getInf() {
+    return scalar<cfloat, float>(
+        NAN, NAN);  // Matches behavior of complex division by 0 in CUDA
+}
+
+template<>
+cdouble getInf() {
+    return scalar<cdouble, double>(
+        NAN, NAN);  // Matches behavior of complex division by 0 in CUDA
+}
+
+template<typename T, af_op_t op>
+Array<T> arithOpD(const SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse) {
+    lhs.eval();
+    rhs.eval();
+
+    Array<T> out  = createEmptyArray<T>(dim4(0));
+    Array<T> zero = createValueArray<T>(rhs.dims(), scalar<T>(0));
+    switch (op) {
+        case af_add_t: out = copyArray<T>(rhs); break;
+        case af_sub_t:
+            out = reverse ? copyArray<T>(rhs)
+                          : arithOp<T, af_sub_t>(zero, rhs, rhs.dims());
+            break;
+        default: out = copyArray<T>(rhs);
+    }
+    out.eval();
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            kernel::sparseArithOpCSR<T, op>(out, lhs.getValues(),
+                                            lhs.getRowIdx(), lhs.getColIdx(),
+                                            rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            kernel::sparseArithOpCOO<T, op>(out, lhs.getValues(),
+                                            lhs.getRowIdx(), lhs.getColIdx(),
+                                            rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const Array<T> &rhs,
+                       const bool reverse) {
+    lhs.eval();
+    rhs.eval();
+
+    SparseArray<T> out = createArrayDataSparseArray<T>(
+        lhs.dims(), lhs.getValues(), lhs.getRowIdx(), lhs.getColIdx(),
+        lhs.getStorage(), true);
+    out.eval();
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            kernel::sparseArithOpCSR<T, op>(out.getValues(), out.getRowIdx(),
+                                            out.getColIdx(), rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            kernel::sparseArithOpCOO<T, op>(out.getValues(), out.getRowIdx(),
+                                            out.getColIdx(), rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+#define SPARSE_ARITH_OP_FUNC_DEF(FUNC) \
+    template<typename T>               \
+    FUNC##_def<T> FUNC##_func();
+
+#define SPARSE_ARITH_OP_FUNC(FUNC, TYPE, INFIX)  \
+    template<>                                   \
+    FUNC##_def<TYPE> FUNC##_func<TYPE>() {       \
+        cusparseModule &_ = getCusparsePlugin(); \
+        return _.cusparse##INFIX##FUNC;          \
+    }
+
+#if CUSPARSE_VERSION >= 11000
+
+template<typename T>
+using csrgeam2_bufferSizeExt_def = cusparseStatus_t (*)(
+    cusparseHandle_t, int, int, const T *, const cusparseMatDescr_t, int,
+    const T *, const int *, const int *, const T *, const cusparseMatDescr_t,
+    int, const T *, const int *, const int *, const cusparseMatDescr_t,
+    const T *, const int *, const int *, size_t *);
+
+#define SPARSE_ARITH_OP_BUFFER_SIZE_FUNC_DEF(FUNC) \
+    template<typename T>                           \
+    FUNC##_def<T> FUNC##_func();
+
+SPARSE_ARITH_OP_BUFFER_SIZE_FUNC_DEF(csrgeam2_bufferSizeExt);
+
+#define SPARSE_ARITH_OP_BUFFER_SIZE_FUNC(FUNC, TYPE, INFIX) \
+    template<>                                              \
+    FUNC##_def<TYPE> FUNC##_func<TYPE>() {                  \
+        cusparseModule &_ = getCusparsePlugin();            \
+        return _.cusparse##INFIX##FUNC;                     \
+    }
+
+SPARSE_ARITH_OP_BUFFER_SIZE_FUNC(csrgeam2_bufferSizeExt, float, S);
+SPARSE_ARITH_OP_BUFFER_SIZE_FUNC(csrgeam2_bufferSizeExt, double, D);
+SPARSE_ARITH_OP_BUFFER_SIZE_FUNC(csrgeam2_bufferSizeExt, cfloat, C);
+SPARSE_ARITH_OP_BUFFER_SIZE_FUNC(csrgeam2_bufferSizeExt, cdouble, Z);
+
+template<typename T>
+using csrgeam2_def = cusparseStatus_t (*)(cusparseHandle_t, int, int, const T *,
+                                          const cusparseMatDescr_t, int,
+                                          const T *, const int *, const int *,
+                                          const T *, const cusparseMatDescr_t,
+                                          int, const T *, const int *,
+                                          const int *, const cusparseMatDescr_t,
+                                          T *, int *, int *, void *);
+
+SPARSE_ARITH_OP_FUNC_DEF(csrgeam2);
+
+SPARSE_ARITH_OP_FUNC(csrgeam2, float, S);
+SPARSE_ARITH_OP_FUNC(csrgeam2, double, D);
+SPARSE_ARITH_OP_FUNC(csrgeam2, cfloat, C);
+SPARSE_ARITH_OP_FUNC(csrgeam2, cdouble, Z);
+
+#else
+
+template<typename T>
+using csrgeam_def = cusparseStatus_t (*)(cusparseHandle_t, int, int, const T *,
+                                         const cusparseMatDescr_t, int,
+                                         const T *, const int *, const int *,
+                                         const T *, const cusparseMatDescr_t,
+                                         int, const T *, const int *,
+                                         const int *, const cusparseMatDescr_t,
+                                         T *, int *, int *);
+
+SPARSE_ARITH_OP_FUNC_DEF(csrgeam);
+
+SPARSE_ARITH_OP_FUNC(csrgeam, float, S);
+SPARSE_ARITH_OP_FUNC(csrgeam, double, D);
+SPARSE_ARITH_OP_FUNC(csrgeam, cfloat, C);
+SPARSE_ARITH_OP_FUNC(csrgeam, cdouble, Z);
+
+#endif
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const SparseArray<T> &rhs) {
+    cusparseModule &_ = getCusparsePlugin();
+    af::storage sfmt  = lhs.getStorage();
+    auto ldesc        = make_handle<cusparseMatDescr_t>();
+    auto rdesc        = make_handle<cusparseMatDescr_t>();
+    auto odesc        = make_handle<cusparseMatDescr_t>();
+
+    const dim4 ldims      = lhs.dims();
+    const int M           = ldims[0];
+    const int N           = ldims[1];
+    const dim_t nnzA      = lhs.getNNZ();
+    const dim_t nnzB      = rhs.getNNZ();
+    const int *csrRowPtrA = lhs.getRowIdx().get();
+    const int *csrColPtrA = lhs.getColIdx().get();
+    const int *csrRowPtrB = rhs.getRowIdx().get();
+    const int *csrColPtrB = rhs.getColIdx().get();
+
+    int baseC, nnzC = M + 1;
+
+    auto nnzDevHostPtr = memAlloc<int>(1);
+    auto outRowIdx     = createValueArray<int>(M + 1, 0);
+
+    T alpha = scalar<T>(1);
+    T beta  = op == af_sub_t ? scalar<T>(-1) : scalar<T>(1);
+
+    T *csrValC      = nullptr;
+    int *csrColIndC = nullptr;
+
+#if CUSPARSE_VERSION < 11000
+    CUSPARSE_CHECK(_.cusparseXcsrgeamNnz(
+        sparseHandle(), M, N, ldesc, nnzA, csrRowPtrA, csrColPtrA, rdesc, nnzB,
+        csrRowPtrB, csrColPtrB, odesc, outRowIdx.get(), nnzDevHostPtr.get()));
+#else
+    size_t pBufferSize = 0;
+
+    CUSPARSE_CHECK(csrgeam2_bufferSizeExt_func<T>()(
+        sparseHandle(), M, N, &alpha, ldesc, nnzA, lhs.getValues().get(),
+        csrRowPtrA, csrColPtrA, &beta, rdesc, nnzB, rhs.getValues().get(),
+        csrRowPtrB, csrColPtrB, odesc, csrValC, outRowIdx.get(), csrColIndC,
+        &pBufferSize));
+
+    auto tmpBuffer = memAlloc<char>(pBufferSize);
+    CUSPARSE_CHECK(_.cusparseXcsrgeam2Nnz(
+        sparseHandle(), M, N, ldesc, nnzA, csrRowPtrA, csrColPtrA, rdesc, nnzB,
+        csrRowPtrB, csrColPtrB, odesc, outRowIdx.get(), nnzDevHostPtr.get(),
+        tmpBuffer.get()));
+#endif
+    if (NULL != nnzDevHostPtr) {
+        CUDA_CHECK(cudaMemcpyAsync(&nnzC, nnzDevHostPtr.get(), sizeof(int),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+    } else {
+        CUDA_CHECK(cudaMemcpyAsync(&nnzC, outRowIdx.get() + M, sizeof(int),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaMemcpyAsync(&baseC, outRowIdx.get(), sizeof(int),
+                                   cudaMemcpyDeviceToHost, getActiveStream()));
+        CUDA_CHECK(cudaStreamSynchronize(cuda::getActiveStream()));
+        nnzC -= baseC;
+    }
+    auto outColIdx = createEmptyArray<int>(nnzC);
+    auto outValues = createEmptyArray<T>(nnzC);
+
+#if CUSPARSE_VERSION < 11000
+    CUSPARSE_CHECK(csrgeam_func<T>()(
+        sparseHandle(), M, N, &alpha, ldesc, nnzA, lhs.getValues().get(),
+        csrRowPtrA, csrColPtrA, &beta, rdesc, nnzB, rhs.getValues().get(),
+        csrRowPtrB, csrColPtrB, odesc, outValues.get(), outRowIdx.get(),
+        outColIdx.get()));
+#else
+    CUSPARSE_CHECK(csrgeam2_func<T>()(
+        sparseHandle(), M, N, &alpha, ldesc, nnzA, lhs.getValues().get(),
+        csrRowPtrA, csrColPtrA, &beta, rdesc, nnzB, rhs.getValues().get(),
+        csrRowPtrB, csrColPtrB, odesc, outValues.get(), outRowIdx.get(),
+        outColIdx.get(), tmpBuffer.get()));
+#endif
+    SparseArray<T> retVal = createArrayDataSparseArray(
+        ldims, outValues, outRowIdx, outColIdx, sfmt);
+    return retVal;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> arithOpD<T, af_add_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_sub_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_mul_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_div_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_mul_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_div_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_mul_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_div_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sparse_arith.hpp b/src/backend/cuda/sparse_arith.hpp
new file mode 100644
index 0000000000..a3628df405
--- /dev/null
+++ b/src/backend/cuda/sparse_arith.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <optypes.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+// These two functions cannot be overloaded by return type.
+// So have to give them separate names.
+template<typename T, af_op_t op>
+Array<T> arithOpD(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const Array<T> &rhs, const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const common::SparseArray<T> &rhs);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sparse_blas.cu b/src/backend/cuda/sparse_blas.cu
new file mode 100644
index 0000000000..f0ef6a45c3
--- /dev/null
+++ b/src/backend/cuda/sparse_blas.cu
@@ -0,0 +1,243 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sparse_blas.hpp>
+
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <cudaDataType.hpp>
+#include <cuda_runtime.h>
+#include <cusparse.hpp>
+#include <cusparseModule.hpp>
+#include <cusparse_descriptor_helpers.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace cuda {
+
+cusparseOperation_t toCusparseTranspose(af_mat_prop opt) {
+    cusparseOperation_t out = CUSPARSE_OPERATION_NON_TRANSPOSE;
+    switch (opt) {
+        case AF_MAT_NONE: out = CUSPARSE_OPERATION_NON_TRANSPOSE; break;
+        case AF_MAT_TRANS: out = CUSPARSE_OPERATION_TRANSPOSE; break;
+        case AF_MAT_CTRANS: out = CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE; break;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+    return out;
+}
+
+#if CUSPARSE_VERSION < 11300
+#define AF_CUSPARSE_SPMV_CSR_ALG1 CUSPARSE_CSRMV_ALG1
+#define AF_CUSPARSE_SPMV_ALG_DEFAULT CUSPARSE_MV_ALG_DEFAULT
+#define AF_CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_CSRMM_ALG1
+#define AF_CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_CSRMM_ALG1
+#elif CUSPARSE_VERSION < 11400
+#define AF_CUSPARSE_SPMV_CSR_ALG1 CUSPARSE_CSRMV_ALG1
+#define AF_CUSPARSE_SPMV_ALG_DEFAULT CUSPARSE_MV_ALG_DEFAULT
+#define AF_CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_SPMM_CSR_ALG1
+#define AF_CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_SPMM_CSR_ALG1
+#else
+#define AF_CUSPARSE_SPMV_CSR_ALG1 CUSPARSE_SPMV_CSR_ALG1
+#define AF_CUSPARSE_SPMV_ALG_DEFAULT CUSPARSE_SPMV_ALG_DEFAULT
+#define AF_CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_SPMM_CSR_ALG1
+#define AF_CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_SPMM_CSR_ALG1
+#endif
+
+#if defined(AF_USE_NEW_CUSPARSE_API)
+
+template<typename T>
+size_t spmvBufferSize(cusparseOperation_t opA, const T *alpha,
+                      const cusparseSpMatDescr_t matA,
+                      const cusparseDnVecDescr_t vecX, const T *beta,
+                      const cusparseDnVecDescr_t vecY) {
+    size_t retVal     = 0;
+    cusparseModule &_ = getCusparsePlugin();
+    CUSPARSE_CHECK(_.cusparseSpMV_bufferSize(
+        sparseHandle(), opA, alpha, matA, vecX, beta, vecY, getComputeType<T>(),
+        AF_CUSPARSE_SPMV_CSR_ALG1, &retVal));
+    return retVal;
+}
+
+template<typename T>
+void spmv(cusparseOperation_t opA, const T *alpha,
+          const cusparseSpMatDescr_t matA, const cusparseDnVecDescr_t vecX,
+          const T *beta, const cusparseDnVecDescr_t vecY, void *buffer) {
+    cusparseModule &_ = getCusparsePlugin();
+    CUSPARSE_CHECK(_.cusparseSpMV(sparseHandle(), opA, alpha, matA, vecX, beta,
+                                  vecY, getComputeType<T>(),
+                                  AF_CUSPARSE_SPMV_ALG_DEFAULT, buffer));
+}
+
+template<typename T>
+size_t spmmBufferSize(cusparseOperation_t opA, cusparseOperation_t opB,
+                      const T *alpha, const cusparseSpMatDescr_t matA,
+                      const cusparseDnMatDescr_t matB, const T *beta,
+                      const cusparseDnMatDescr_t matC) {
+    size_t retVal     = 0;
+    cusparseModule &_ = getCusparsePlugin();
+    CUSPARSE_CHECK(_.cusparseSpMM_bufferSize(
+        sparseHandle(), opA, opB, alpha, matA, matB, beta, matC,
+        getComputeType<T>(), AF_CUSPARSE_SPMM_CSR_ALG1, &retVal));
+    return retVal;
+}
+
+template<typename T>
+void spmm(cusparseOperation_t opA, cusparseOperation_t opB, const T *alpha,
+          const cusparseSpMatDescr_t matA, const cusparseDnMatDescr_t matB,
+          const T *beta, const cusparseDnMatDescr_t matC, void *buffer) {
+    cusparseModule &_ = getCusparsePlugin();
+    CUSPARSE_CHECK(_.cusparseSpMM(sparseHandle(), opA, opB, alpha, matA, matB,
+                                  beta, matC, getComputeType<T>(),
+                                  AF_CUSPARSE_SPMM_CSR_ALG1, buffer));
+}
+
+#else
+
+template<typename T>
+struct csrmv_func_def_t {
+    typedef cusparseStatus_t (*csrmv_func_def)(
+        cusparseHandle_t handle, cusparseOperation_t transA, int m, int n,
+        int k, const T *alpha, const cusparseMatDescr_t descrA,
+        const T *csrValA, const int *csrRowPtrA, const int *csrColIndA,
+        const T *x, const T *beta, T *y);
+};
+
+template<typename T>
+struct csrmm_func_def_t {
+    typedef cusparseStatus_t (*csrmm_func_def)(
+        cusparseHandle_t handle, cusparseOperation_t transA, int m, int n,
+        int k, int nnz, const T *alpha, const cusparseMatDescr_t descrA,
+        const T *csrValA, const int *csrRowPtrA, const int *csrColIndA,
+        const T *B, int ldb, const T *beta, T *C, int ldc);
+};
+
+#define SPARSE_FUNC_DEF(FUNC) \
+    template<typename T>      \
+    typename FUNC##_func_def_t<T>::FUNC##_func_def FUNC##_func();
+
+#define SPARSE_FUNC(FUNC, TYPE, PREFIX)                                     \
+    template<>                                                              \
+    typename FUNC##_func_def_t<TYPE>::FUNC##_func_def FUNC##_func<TYPE>() { \
+        cusparseModule &_ = getCusparsePlugin();                            \
+        return (FUNC##_func_def_t<TYPE>::FUNC##_func_def) &                 \
+               _.cusparse##PREFIX##FUNC;                                    \
+    }
+
+SPARSE_FUNC_DEF(csrmm)
+SPARSE_FUNC(csrmm, float, S)
+SPARSE_FUNC(csrmm, double, D)
+SPARSE_FUNC(csrmm, cfloat, C)
+SPARSE_FUNC(csrmm, cdouble, Z)
+
+SPARSE_FUNC_DEF(csrmv)
+SPARSE_FUNC(csrmv, float, S)
+SPARSE_FUNC(csrmv, double, D)
+SPARSE_FUNC(csrmv, cfloat, C)
+SPARSE_FUNC(csrmv, cdouble, Z)
+
+#undef SPARSE_FUNC
+#undef SPARSE_FUNC_DEF
+
+#endif
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+    // Similar Operations to GEMM
+    cusparseOperation_t lOpts = toCusparseTranspose(optLhs);
+
+    int lRowDim = (lOpts == CUSPARSE_OPERATION_NON_TRANSPOSE) ? 0 : 1;
+    // int lColDim = (lOpts == CUSPARSE_OPERATION_NON_TRANSPOSE) ? 1 : 0;
+    static const int rColDim = 1;  // Unsupported : (rOpts ==
+                                   // CUSPARSE_OPERATION_NON_TRANSPOSE) ? 1 : 0;
+
+    dim4 lDims = lhs.dims();
+    dim4 rDims = rhs.dims();
+    int M      = lDims[lRowDim];
+    int N      = rDims[rColDim];
+    // int K = lDims[lColDim];
+
+    Array<T> out = createEmptyArray<T>(af::dim4(M, N, 1, 1));
+    T alpha      = scalar<T>(1);
+    T beta       = scalar<T>(0);
+
+    dim4 rStrides = rhs.strides();
+
+#if defined(AF_USE_NEW_CUSPARSE_API)
+
+    auto spMat = cusparseDescriptor<T>(lhs);
+
+    if (rDims[rColDim] == 1) {
+        auto dnVec = denVecDescriptor<T>(rhs);
+        auto dnOut = denVecDescriptor<T>(out);
+        size_t bufferSize =
+            spmvBufferSize<T>(lOpts, &alpha, spMat, dnVec, &beta, dnOut);
+        auto tempBuffer = createEmptyArray<char>(dim4(bufferSize));
+        spmv<T>(lOpts, &alpha, spMat, dnVec, &beta, dnOut, tempBuffer.get());
+    } else {
+        cusparseOperation_t rOpts = toCusparseTranspose(optRhs);
+
+        auto dnMat = denMatDescriptor<T>(rhs);
+        auto dnOut = denMatDescriptor<T>(out);
+        size_t bufferSize =
+            spmmBufferSize<T>(lOpts, rOpts, &alpha, spMat, dnMat, &beta, dnOut);
+        auto tempBuffer = createEmptyArray<char>(dim4(bufferSize));
+        spmm<T>(lOpts, rOpts, &alpha, spMat, dnMat, &beta, dnOut,
+                tempBuffer.get());
+    }
+
+#else
+
+    cusparseModule &_ = getCusparsePlugin();
+    // Create Sparse Matrix Descriptor
+    cusparseMatDescr_t descr = 0;
+    CUSPARSE_CHECK(_.cusparseCreateMatDescr(&descr));
+    CUSPARSE_CHECK(_.cusparseSetMatType(descr, CUSPARSE_MATRIX_TYPE_GENERAL));
+    CUSPARSE_CHECK(_.cusparseSetMatIndexBase(descr, CUSPARSE_INDEX_BASE_ZERO));
+
+    // Call Matrix-Vector or Matrix-Matrix
+    // Note:
+    // Do not use M, N, K here. Use lDims and rDims instead.
+    // This is because the function wants row/col of A
+    // and not OP(A) (gemm wants row/col of OP(A)).
+    if (rDims[rColDim] == 1) {
+        CUSPARSE_CHECK(csrmv_func<T>()(
+            sparseHandle(), lOpts, lDims[0], lDims[1], lhs.getNNZ(), &alpha,
+            descr, lhs.getValues().get(), lhs.getRowIdx().get(),
+            lhs.getColIdx().get(), rhs.get(), &beta, out.get()));
+    } else {
+        CUSPARSE_CHECK(csrmm_func<T>()(
+            sparseHandle(), lOpts, lDims[0], rDims[rColDim], lDims[1],
+            lhs.getNNZ(), &alpha, descr, lhs.getValues().get(),
+            lhs.getRowIdx().get(), lhs.getColIdx().get(), rhs.get(),
+            rStrides[1], &beta, out.get(), out.dims()[0]));
+    }
+    CUSPARSE_CHECK(_.cusparseDestroyMatDescr(descr));
+
+#endif
+
+    return out;
+}
+
+#define INSTANTIATE_SPARSE(T)                                            \
+    template Array<T> matmul<T>(const common::SparseArray<T> &lhs,       \
+                                const Array<T> &rhs, af_mat_prop optLhs, \
+                                af_mat_prop optRhs);
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sparse_blas.hpp b/src/backend/cuda/sparse_blas.hpp
new file mode 100644
index 0000000000..d4b41defd0
--- /dev/null
+++ b/src/backend/cuda/sparse_blas.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T>& lhs, const Array<T>& rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/sum.cu b/src/backend/cuda/sum.cu
index 4f668742b6..6a52c2c369 100644
--- a/src/backend/cuda/sum.cu
+++ b/src/backend/cuda/sum.cu
@@ -7,17 +7,38 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace cuda
-{
-    //sum
-    INSTANTIATE(af_add_t, float  , float  )
-    INSTANTIATE(af_add_t, double , double )
-    INSTANTIATE(af_add_t, cfloat , cfloat )
-    INSTANTIATE(af_add_t, cdouble, cdouble)
-    INSTANTIATE(af_add_t, int    , int    )
-    INSTANTIATE(af_add_t, uint   , uint   )
-    INSTANTIATE(af_add_t, char   , int    )
-    INSTANTIATE(af_add_t, uchar  , uint   )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+// sum
+INSTANTIATE(af_add_t, float, float)
+INSTANTIATE(af_add_t, double, double)
+INSTANTIATE(af_add_t, cfloat, cfloat)
+INSTANTIATE(af_add_t, cdouble, cdouble)
+INSTANTIATE(af_add_t, int, int)
+INSTANTIATE(af_add_t, int, float)
+INSTANTIATE(af_add_t, uint, uint)
+INSTANTIATE(af_add_t, uint, float)
+INSTANTIATE(af_add_t, intl, intl)
+INSTANTIATE(af_add_t, intl, double)
+INSTANTIATE(af_add_t, uintl, uintl)
+INSTANTIATE(af_add_t, uintl, double)
+INSTANTIATE(af_add_t, char, int)
+INSTANTIATE(af_add_t, char, float)
+INSTANTIATE(af_add_t, schar, int)
+INSTANTIATE(af_add_t, schar, float)
+INSTANTIATE(af_add_t, uchar, uint)
+INSTANTIATE(af_add_t, uchar, float)
+INSTANTIATE(af_add_t, short, int)
+INSTANTIATE(af_add_t, short, float)
+INSTANTIATE(af_add_t, ushort, uint)
+INSTANTIATE(af_add_t, ushort, float)
+INSTANTIATE(af_add_t, half, half)
+INSTANTIATE(af_add_t, half, float)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/surface.cpp b/src/backend/cuda/surface.cpp
new file mode 100644
index 0000000000..61f3457036
--- /dev/null
+++ b/src/backend/cuda/surface.cpp
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_cuda.hpp>
+#include <device_manager.hpp>
+#include <err_cuda.hpp>
+#include <surface.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface) {
+    auto stream = getActiveStream();
+    if (DeviceManager::checkGraphicsInteropCapability()) {
+        const T *d_P = P.get();
+
+        auto res = interopManager().getSurfaceResources(surface);
+
+        size_t bytes = 0;
+        T *d_vbo     = NULL;
+        cudaGraphicsMapResources(1, res[0].get(), stream);
+        cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &bytes,
+                                             *(res[0].get()));
+        cudaMemcpyAsync(d_vbo, d_P, bytes, cudaMemcpyDeviceToDevice, stream);
+        cudaGraphicsUnmapResources(1, res[0].get(), stream);
+
+        CheckGL("After cuda resource copy");
+
+        POST_LAUNCH_CHECK();
+    } else {
+        ForgeModule &_ = forgePlugin();
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_surface_vertex_buffer(&buffer, surface));
+        FG_CHECK(_.fg_get_surface_vertex_buffer_size(&bytes, surface));
+
+        CheckGL("Begin CUDA fallback-resource copy");
+        glBindBuffer(GL_ARRAY_BUFFER, buffer);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            CUDA_CHECK(cudaMemcpyAsync(ptr, P.get(), bytes,
+                                       cudaMemcpyDeviceToHost, stream));
+            CUDA_CHECK(cudaStreamSynchronize(stream));
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+        CheckGL("End CUDA fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T) \
+    template void copy_surface<T>(const Array<T> &, fg_surface);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/surface.hpp b/src/backend/cuda/surface.hpp
new file mode 100644
index 0000000000..896344c73b
--- /dev/null
+++ b/src/backend/cuda/surface.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface);
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/susan.cpp b/src/backend/cuda/susan.cpp
new file mode 100644
index 0000000000..5f1d07d913
--- /dev/null
+++ b/src/backend/cuda/susan.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2015, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <susan.hpp>
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <kernel/susan.hpp>
+#include <af/features.h>
+
+#include <algorithm>
+
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out, Array<float> &resp_out,
+               const Array<T> &in, const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge) {
+    dim4 idims = in.dims();
+
+    const unsigned corner_lim = in.elements() * feature_ratio;
+    auto x_corners            = memAlloc<float>(corner_lim);
+    auto y_corners            = memAlloc<float>(corner_lim);
+    auto resp_corners         = memAlloc<float>(corner_lim);
+
+    auto resp              = memAlloc<T>(in.elements());
+    unsigned corners_found = 0;
+
+    kernel::susan_responses<T>(resp.get(), in.get(), idims[0], idims[1], radius,
+                               diff_thr, geom_thr, edge);
+
+    kernel::nonMaximal<T>(x_corners.get(), y_corners.get(), resp_corners.get(),
+                          &corners_found, idims[0], idims[1], resp.get(), edge,
+                          corner_lim);
+
+    const unsigned corners_out = std::min(corners_found, corner_lim);
+    if (corners_out == 0) {
+        x_out    = createEmptyArray<float>(dim4());
+        y_out    = createEmptyArray<float>(dim4());
+        resp_out = createEmptyArray<float>(dim4());
+        return 0;
+    } else {
+        x_out = createDeviceDataArray<float>(
+            dim4(corners_out), static_cast<void *>(x_corners.get()));
+        y_out = createDeviceDataArray<float>(
+            dim4(corners_out), static_cast<void *>(y_corners.get()));
+        resp_out = createDeviceDataArray<float>(
+            dim4(corners_out), static_cast<void *>(resp_corners.get()));
+        x_corners.release();
+        y_corners.release();
+        resp_corners.release();
+        return corners_out;
+    }
+}
+
+#define INSTANTIATE(T)                                                        \
+    template unsigned susan<T>(                                               \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned radius, const float diff_thr,      \
+        const float geom_thr, const float feature_ratio, const unsigned edge);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/susan.hpp b/src/backend/cuda/susan.hpp
new file mode 100644
index 0000000000..2266320485
--- /dev/null
+++ b/src/backend/cuda/susan.hpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2015, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out, Array<float> &resp_out,
+               const Array<T> &in, const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/svd.cpp b/src/backend/cuda/svd.cpp
new file mode 100644
index 0000000000..6ec71739ba
--- /dev/null
+++ b/src/backend/cuda/svd.cpp
@@ -0,0 +1,118 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/err_common.hpp>
+#include <svd.hpp>
+
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include "transpose.hpp"
+
+#include <cusolverDn.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+cusolverStatus_t gesvd_buf_func(cusolverDnHandle_t /*handle*/, int /*m*/,
+                                int /*n*/, int * /*Lwork*/) {
+    return CUSOLVER_STATUS_ARCH_MISMATCH;
+}
+
+template<typename T, typename Tr>
+cusolverStatus_t gesvd_func(cusolverDnHandle_t /*handle*/, char /*jobu*/,
+                            char /*jobvt*/, int /*m*/, int /*n*/, T * /*A*/,
+                            int /*lda*/, Tr * /*S*/, T * /*U*/, int /*ldu*/,
+                            T * /*VT*/, int /*ldvt*/, T * /*Work*/,
+                            int /*Lwork*/, Tr * /*rwork*/, int * /*devInfo*/) {
+    return CUSOLVER_STATUS_ARCH_MISMATCH;
+}
+
+#define SVD_SPECIALIZE(T, Tr, X)                                         \
+    template<>                                                           \
+    cusolverStatus_t gesvd_buf_func<T>(cusolverDnHandle_t handle, int m, \
+                                       int n, int *Lwork) {              \
+        return cusolverDn##X##gesvd_bufferSize(handle, m, n, Lwork);     \
+    }
+
+SVD_SPECIALIZE(float, float, S);
+SVD_SPECIALIZE(double, double, D);
+SVD_SPECIALIZE(cfloat, float, C);
+SVD_SPECIALIZE(cdouble, double, Z);
+
+#undef SVD_SPECIALIZE
+
+#define SVD_SPECIALIZE(T, Tr, X)                                              \
+    template<>                                                                \
+    cusolverStatus_t gesvd_func<T, Tr>(                                       \
+        cusolverDnHandle_t handle, char jobu, char jobvt, int m, int n, T *A, \
+        int lda, Tr *S, T *U, int ldu, T *VT, int ldvt, T *Work, int Lwork,   \
+        Tr *rwork, int *devInfo) {                                            \
+        return cusolverDn##X##gesvd(handle, jobu, jobvt, m, n, A, lda, S, U,  \
+                                    ldu, VT, ldvt, Work, Lwork, rwork,        \
+                                    devInfo);                                 \
+    }
+
+SVD_SPECIALIZE(float, float, S);
+SVD_SPECIALIZE(double, double, D);
+SVD_SPECIALIZE(cfloat, float, C);
+SVD_SPECIALIZE(cdouble, double, Z);
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    int lwork = 0;
+
+    CUSOLVER_CHECK(gesvd_buf_func<T>(solverDnHandle(), M, N, &lwork));
+
+    auto lWorkspace = memAlloc<T>(lwork);
+    auto rWorkspace = memAlloc<Tr>(5 * std::min(M, N));
+
+    auto info = memAlloc<int>(1);
+
+    gesvd_func<T, Tr>(solverDnHandle(), 'A', 'A', M, N, in.get(), M, s.get(),
+                      u.get(), M, vt.get(), N, lWorkspace.get(), lwork,
+                      rWorkspace.get(), info.get());
+}
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    if (M >= N) {
+        Array<T> in_copy = copyArray(in);
+        svdInPlace(s, u, vt, in_copy);
+    } else {
+        Array<T> in_trans = transpose(in, true);
+        svdInPlace(s, vt, u, in_trans);
+        transpose_inplace(vt, true);
+        transpose_inplace(u, true);
+    }
+}
+
+#define INSTANTIATE(T, Tr)                                               \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/svd.hpp b/src/backend/cuda/svd.hpp
new file mode 100644
index 0000000000..21cd52b684
--- /dev/null
+++ b/src/backend/cuda/svd.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in);
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/threadsMgt.hpp b/src/backend/cuda/threadsMgt.hpp
new file mode 100644
index 0000000000..147dff5586
--- /dev/null
+++ b/src/backend/cuda/threadsMgt.hpp
@@ -0,0 +1,329 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace cuda {
+// OVERALL USAGE (With looping):
+// ...                                                      // OWN CODE
+// threadsMgt<T> th(...);                                   // backend.hpp
+// const dim3 threads{th.genThreads()};                     // backend.hpp
+// const dim3 blocks{th.genBlocks(threads,..)};             // backend.hpp
+// arrayfire::cuda::Kernel KER{GETKERNEL(..., th.loop0, th.loop1, th.loop2,
+//                               th.loop3)};                // OWN CODE
+// KER(threads,blocks,...);                                 // OWN CODE
+// ...                                                      // OWN CODE
+//
+// OVERALL USAGE (without looping):
+// ...                                                      // OWN CODE
+// threadsMgt<T> th(...);                                   // backend.hpp
+// const dim3 threads{th.genThreads()};                     // backend.hpp
+// const dim3 blocks{th.genBlocksFull(threads,...)};        // backend.hpp
+// arrayfire::cuda::Kernel KER{GETKERNEL(...)};             // OWN
+// CODE KER(threads,blocks,...);                            // OWN CODE
+// ...                                                      // OWN CODE
+template<typename T>
+class threadsMgt {
+   public:
+    bool loop0, loop1, loop2, loop3;
+
+   private:
+    const unsigned d0, d1, d2, d3;
+    const T ndims;
+    const unsigned maxParallelThreads;
+
+   public:
+    // INPUT: dims of the output array
+    // INPUT: ndims of previous dims
+    threadsMgt(const T dims[4], const T ndims);
+
+    // Generate optimal thread values
+    inline const dim3 genThreads() const;
+
+    // INPUT threads, generated by genThreads()
+    // OUTPUT blocks, supposing that each element results in 1 thread
+    inline dim3 genBlocksFull(const dim3& threads) const;
+
+    // Generate the optimal block values
+    // INPUT threads, generated by genThreads()
+    // INPUT nrInputs = number of input buffers read by kernel in parallel
+    // INPUT nrOutputs = number of output buffers written by kernel in parallel
+    // INPUT totalSize = size of all input arrays and all output arrays together
+    // INPUT sizeofT = size of 1 element TO BE WRITTEN
+    // OUTPUT blocks, assuming that the previously calculated loopings will be
+    // executed in the kernel
+    inline dim3 genBlocks(const dim3& threads, const unsigned nrInputs,
+                          const unsigned nrOutputs, const size_t totalSize,
+                          const size_t sizeofT);
+};
+
+// INPUT: dims of the output array
+// INPUT: ndims of previous dims
+template<typename T>
+threadsMgt<T>::threadsMgt(const T dims[4], const T ndims)
+    : loop0(false)
+    , loop1(false)
+    , loop2(false)
+    , loop3(false)
+    , d0(static_cast<unsigned>(dims[0]))
+    , d1(static_cast<unsigned>(dims[1]))
+    , d2(static_cast<unsigned>(dims[2]))
+    , d3(static_cast<unsigned>(dims[3]))
+    , ndims(ndims)
+    , maxParallelThreads(getMaxParallelThreads(getActiveDeviceId())){};
+
+// Generate optimal thread values
+template<typename T>
+const dim3 threadsMgt<T>::genThreads() const {
+    // Performance is mainly dependend on:
+    //    - reducing memory latency, by preferring a sequential read of
+    //    cachelines (principally dim0)
+    //    - more parallel threads --> higher occupation of available
+    //    threads
+    //    - more I/O operations per thread --> dims[3] indicates the #
+    //    of I/Os handled by the kernel inside each thread, and outside
+    //    the scope of the block scheduler
+    // High performance is achievable with occupation rates as low as
+    // 30%. Here we aim at 50%, to also cover older hardware with slower
+    // cores.
+    // https://stackoverflow.com/questions/7737772/improving-kernel-performance-by-increasing-occupancy
+    // http://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf
+    // https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-Optimization.pdf
+    // https://en.wikipedia.org/wiki/Graphics_Core_Next#SIMD_Vector_Unit
+
+    // The performance for vectors is independent from array sizes.
+    if ((d1 == 1) & (d2 == 1)) return dim3(128U);
+
+    // TOTAL OCCUPATION = occup(dim0) * occup(dim1) * occup(dim2).
+    // For linearized arrays, each linear block is allocated to a dim,
+    // resulting in large numbers for dim0 & dim1.
+    // - For dim2, we only return exact dividers of the array dim[3], so
+    // occup(dim2)=100%
+    // - For dim0 & dim1, we aim somewhere between 30% and 50%
+    //      * Having 2 blocks filled + 1 thread in block 3 --> occup >
+    //      2/3=66%
+    //      * Having 3 blocks filled + 1 thread in block 4 --> occup >
+    //      3/4=75%
+    //      * Having 4 blocks filled + 1 thread in block 5 --> occup >
+    //      4/5=80%
+    constexpr unsigned OCCUPANCY_FACTOR{2U};  // at least 2 blocks filled
+
+    // NVIDIA:
+    //  warp             = 32
+    //  possible blocks  = [32, 64, 96, 128, 160, 192, 224, 256, ..
+    //  1024] best performance = [32, 64, 96, 128] optimal perf     =
+    //  128; any combination
+    //   NIVIDA always processes full wavefronts.  Allocating partial
+    //   warps
+    //   (<32) reduces throughput.  Performance reaches a plateau from
+    //   128 with a slightly slowing for very large sizes.
+    // For algorithm below:
+    //  parallelThreads  = [32, 64, 96, 128]
+    constexpr unsigned minThreads{32};
+    const unsigned relevantElements{d0 * d1 * d2};
+    constexpr unsigned warp{32};
+
+    // For small array's, we reduce the maximum threads in 1 block to
+    // improve parallelisme.  In worst case the scheduler can have 1
+    // block per CU, even when only partly loaded. Range for block is:
+    // [minThreads ... 4 * warp multiple]
+    //   * NVIDIA: [4*32=128 threads]
+    // At 4 * warp multiple, full wavefronts (queue of 4 partial
+    // wavefronts) are all occupied.
+
+    // We need at least maxParallelThreads to occupy all the CU's.
+    const unsigned parallelThreads{
+        relevantElements <= maxParallelThreads
+            ? minThreads
+            : std::min(4U, relevantElements / maxParallelThreads) * warp};
+
+    // Priority 1: keep cachelines filled.  Aparrantly sharing
+    // cachelines between CU's has a heavy cost. Testing confirmed that
+    // the occupation is mostly > 50%
+    const unsigned threads0{d0 == 1 ? 1
+                            : d0 <= minThreads
+                                ? minThreads  // better distribution
+                                : std::min(128U, (divup(d0, warp) * warp))};
+
+    // Priority 2: Fill the block, while respecting the occupation limit
+    // (>66%) (through parallelThreads limit)
+    const unsigned threads1{
+        (threads0 * 64U <= parallelThreads) &&
+                (!(d1 & (64U - 1U)) || (d1 > OCCUPANCY_FACTOR * 64U))
+            ? 64U
+        : (threads0 * 32U <= parallelThreads) &&
+                (!(d1 & (32U - 1U)) || (d1 > OCCUPANCY_FACTOR * 32U))
+            ? 32U
+        : (threads0 * 16U <= parallelThreads) &&
+                (!(d1 & (16U - 1U)) || (d1 > OCCUPANCY_FACTOR * 16U))
+            ? 16U
+        : (threads0 * 8U <= parallelThreads) &&
+                (!(d1 & (8U - 1U)) || (d1 > OCCUPANCY_FACTOR * 8U))
+            ? 8U
+        : (threads0 * 4U <= parallelThreads) &&
+                (!(d1 & (4U - 1U)) || (d1 > OCCUPANCY_FACTOR * 4U))
+            ? 4U
+        : (threads0 * 2U <= parallelThreads) &&
+                (!(d1 & (2U - 1U)) || (d1 > OCCUPANCY_FACTOR * 2U))
+            ? 2U
+            : 1U};
+
+    const unsigned threads01{threads0 * threads1};
+    if ((d2 == 1) | (threads01 * 2 > parallelThreads))
+        return dim3(threads0, threads1);
+
+    // Priority 3: Only exact dividers are used, so that
+    //  - overflow checking is not needed in the kernel.
+    //  - occupation rate never is reduced
+    // Chances are low that threads2 will be different from 1.
+    const unsigned threads2{
+        (threads01 * 8 <= parallelThreads) && !(d2 & (8U - 1U))   ? 8U
+        : (threads01 * 4 <= parallelThreads) && !(d2 & (4U - 1U)) ? 4U
+        : (threads01 * 2 <= parallelThreads) && !(d2 & (2U - 1U)) ? 2U
+                                                                  : 1U};
+    return dim3(threads0, threads1, threads2);
+};
+
+// INPUT threads, generated by genThreads()
+// OUTPUT blocks, supposing that each element results in 1 thread
+template<typename T>
+inline dim3 threadsMgt<T>::genBlocksFull(const dim3& threads) const {
+    const dim3 blocks{divup(d0, threads.x), divup(d1, threads.y),
+                      divup(d2, threads.z)};
+    return dim3(divup(d0, threads.x), divup(d1, threads.y),
+                divup(d2, threads.z));
+};
+
+// Generate the optimal block values
+// INPUT threads, generated by genThreads()
+// INPUT nrInputs = number of input buffers read by kernel in parallel
+// INPUT nrOutputs = number of output buffers written by kernel in parallel
+// INPUT totalSize = size of all input arrays and all output arrays together
+// INPUT sizeofT = size of 1 element TO BE WRITTEN
+// OUTPUT blocks, assuming that the previously calculated loopings will be
+// executed in the kernel
+template<typename T>
+inline dim3 threadsMgt<T>::genBlocks(const dim3& threads,
+                                     const unsigned nrInputs,
+                                     const unsigned nrOutputs,
+                                     const size_t totalSize,
+                                     const size_t sizeofT) {
+    // The bottleneck of anykernel is dependent on the type of memory
+    // used.
+    // a) For very small arrays (elements < maxParallelThreads), each
+    //  element receives it individual thread.
+    // b) For arrays (in+out) smaller than 3/2 L2cache, memory access no
+    //  longer is the bottleneck, because enough L2cache is available at any
+    //  time. Threads are limited to reduce scheduling overhead.
+    // c) For very large arrays and type sizes (<long double), 1 thread will
+    //  not generate enough data to keep the memory sync mechanism
+    //  saturated, so we start loooping inside each thread.
+    dim3 blocks{1};
+    const int activeDeviceId{getActiveDeviceId()};
+    const unsigned* maxGridSize{
+        reinterpret_cast<const unsigned*>(getMaxGridSize(activeDeviceId))};
+    const size_t L2CacheSize{getL2CacheSize(activeDeviceId)};
+    const unsigned cacheLine{getMemoryBusWidth(activeDeviceId)};
+    const unsigned multiProcessorCount{getMultiProcessorCount(activeDeviceId)};
+    const unsigned maxThreads{maxParallelThreads *
+                              (sizeofT * nrInputs * nrInputs > 8 ? 1 : 2)};
+
+    if (ndims == 1) {
+        if (d0 > maxThreads) {
+            if (totalSize * 2 > L2CacheSize * 3) {
+                // General formula to calculate best #loops
+                // Dedicated GPUs:
+                //  32/sizeof(T)**2/#outBuffers*(3/4)**(#inBuffers-1)
+                // Integrated GPUs:
+                //  4/sizeof(T)/#outBuffers*(3/4)**(#inBuffers-1)
+                unsigned largeVolDivider{cacheLine == 64
+                                             ? sizeofT == 1   ? 4
+                                               : sizeofT == 2 ? 2
+                                                              : 1
+                                             : (sizeofT == 1   ? 32
+                                                : sizeofT == 2 ? 8
+                                                               : 1) /
+                                                   nrOutputs};
+                for (unsigned i{1}; i < nrInputs; ++i)
+                    largeVolDivider = largeVolDivider * 3 / 4;
+                if (largeVolDivider > 1) {
+                    blocks.x = d0 / (largeVolDivider * threads.x);
+                    if (blocks.x == 0) blocks.x = 1;
+                    loop0 = true;
+                }
+            } else {
+                // A reduction to (1|2*)maxParallelThreads will be
+                // performed
+                blocks.x = maxThreads / threads.x;
+                if (blocks.x == 0) blocks.x = 1;
+                loop0 = true;
+            }
+        }
+        if (!loop0) { blocks.x = divup(d0, threads.x); }
+    } else {
+        loop3    = d3 != 1;
+        blocks.x = divup(d0, threads.x);
+        blocks.z = divup(d2, threads.z);
+        // contains the mandatory loops introduced by dim3 and dim2
+        // gridSize overflow
+        unsigned dim2and3Multiplier{d3};
+        if (blocks.z > maxGridSize[2]) {
+            dim2and3Multiplier = dim2and3Multiplier * blocks.z / maxGridSize[2];
+            blocks.z           = maxGridSize[2];
+            loop2              = true;
+        }
+        if ((d1 > threads.y) &
+            (threads.x * blocks.x * d1 * threads.z * blocks.z > maxThreads)) {
+            if ((d0 * sizeofT * 8 > cacheLine * multiProcessorCount) &
+                (totalSize * 2 > L2CacheSize * 3)) {
+                // General formula to calculate best #loops
+                // Dedicated GPUs:
+                //  32/sizeof(T)**2/#outBuffers*(3/4)**(#inBuffers-1)
+                // Integrated GPUs:
+                //  4/sizeof(T)/#outBuffers*(3/4)**(#inBuffers-1)
+                unsigned largeVolDivider{
+                    cacheLine == 64 ? sizeofT == 1   ? 4
+                                      : sizeofT == 2 ? 2
+                                                     : 1
+                                    : (sizeofT == 1   ? 32
+                                       : sizeofT == 2 ? 8
+                                       : sizeofT == 4 ? 2
+                                                      : 1) /
+                                          (dim2and3Multiplier * nrOutputs)};
+                for (unsigned i{1}; i < nrInputs; ++i)
+                    largeVolDivider = largeVolDivider * 3 / 4;
+                if (largeVolDivider > 1) {
+                    blocks.y = d1 / (largeVolDivider * threads.y);
+                    if (blocks.y == 0) blocks.y = 1;
+                    loop1 = true;
+                }
+            } else {
+                // A reduction to (1|2*)maxParallelThreads will be
+                // performed
+                blocks.y = maxThreads / (threads.x * blocks.x * threads.z *
+                                         blocks.z * threads.y);
+                if (blocks.y == 0) blocks.y = 1;
+                loop1 = true;
+            }
+        }
+        if (!loop1) { blocks.y = divup(d1, threads.y); }
+        // Check on new overflows
+        if (blocks.y > maxGridSize[1]) {
+            blocks.y = maxGridSize[1];
+            loop1    = true;
+        }
+    }
+
+    return blocks;
+};
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/thrust_utils.hpp b/src/backend/cuda/thrust_utils.hpp
new file mode 100644
index 0000000000..0646b934ba
--- /dev/null
+++ b/src/backend/cuda/thrust_utils.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <ThrustArrayFirePolicy.hpp>
+#include <thrust/system/cuda/detail/par.h>
+#include <thrust/version.h>
+#include <ThrustAllocator.cuh>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+using ThrustVector = thrust::device_vector<T, ThrustAllocator<T>>;
+}  // namespace cuda
+}  // namespace arrayfire
+
+#define THRUST_SELECT(fn, ...) \
+    fn(arrayfire::cuda::ThrustArrayFirePolicy(), __VA_ARGS__)
+#define THRUST_SELECT_OUT(res, fn, ...) \
+    res = fn(arrayfire::cuda::ThrustArrayFirePolicy(), __VA_ARGS__)
diff --git a/src/backend/cuda/tile.cpp b/src/backend/cuda/tile.cpp
new file mode 100644
index 0000000000..edd2a7b686
--- /dev/null
+++ b/src/backend/cuda/tile.cpp
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <tile.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/tile.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims *= tileDims;
+
+    if (iDims.elements() == 0 || oDims.elements() == 0) {
+        AF_ERROR("Elements are 0", AF_ERR_SIZE);
+    }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::tile<T>(out, in);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/tile.cu b/src/backend/cuda/tile.cu
deleted file mode 100644
index b44d8b9a7a..0000000000
--- a/src/backend/cuda/tile.cu
+++ /dev/null
@@ -1,48 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <tile.hpp>
-#include <kernel/tile.hpp>
-#include <stdexcept>
-#include <err_cuda.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> tile(const Array<T> &in, const af::dim4 &tileDims)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
-        oDims *= tileDims;
-
-        if(iDims.elements() == 0 || oDims.elements() == 0) {
-            AF_ERROR("Elements are 0", AF_ERR_SIZE);
-        }
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        kernel::tile<T>(out, in);
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                         \
-    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);  \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-
-}
diff --git a/src/backend/cuda/tile.hpp b/src/backend/cuda/tile.hpp
index 85c895a6fe..888e77aa13 100644
--- a/src/backend/cuda/tile.hpp
+++ b/src/backend/cuda/tile.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/topk.cu b/src/backend/cuda/topk.cu
new file mode 100644
index 0000000000..12dde72684
--- /dev/null
+++ b/src/backend/cuda/topk.cu
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/topk.hpp>
+#include <topk.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void topk(Array<T>& ovals, Array<uint>& oidxs, const Array<T>& ivals,
+          const int k, const int dim, const af::topkFunction order) {
+    dim4 outDims = ivals.dims();
+    outDims[dim] = k;
+
+    ovals = createEmptyArray<T>(outDims);
+    oidxs = createEmptyArray<uint>(outDims);
+
+    kernel::topk<T>(ovals, oidxs, ivals, k, dim, order);
+}
+
+#define INSTANTIATE(T)                                                         \
+    template void topk<T>(Array<T>&, Array<uint>&, const Array<T>&, const int, \
+                          const int, const af::topkFunction);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(half)
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/topk.hpp b/src/backend/cuda/topk.hpp
new file mode 100644
index 0000000000..f3c27f433c
--- /dev/null
+++ b/src/backend/cuda/topk.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void topk(Array<T>& keys, Array<unsigned>& vals, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/traits.hpp b/src/backend/cuda/traits.hpp
index 5d293febab..3ca7a63324 100644
--- a/src/backend/cuda/traits.hpp
+++ b/src/backend/cuda/traits.hpp
@@ -9,8 +9,9 @@
 
 #pragma once
 
-#include <af/traits.hpp>
+#include <common/traits.hpp>
 #include <cuComplex.h>
+#include <cuda_fp16.h>
 
 namespace af {
 
@@ -28,6 +29,13 @@ struct dtype_traits<cuDoubleComplex> {
     static const char* getName() { return "cuDoubleComplex"; }
 };
 
-}
+template<>
+struct dtype_traits<__half> {
+    enum { af_type = f16 };
+    typedef __half base_type;
+    static const char* getName() { return "__half"; }
+};
+
+}  // namespace af
 
 using af::dtype_traits;
diff --git a/src/backend/cuda/transform.cpp b/src/backend/cuda/transform.cpp
new file mode 100644
index 0000000000..af8b561191
--- /dev/null
+++ b/src/backend/cuda/transform.cpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <transform.hpp>
+
+#include <copy.hpp>
+#include <kernel/transform.hpp>
+#include <utility.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af::interpType method, const bool inverse,
+               const bool perspective) {
+    // TODO: Temporary Fix, must fix handling subarrays upstream
+    // tf has to be linear, although offset is allowed.
+    const Array<float> tf_Lin = tf.isLinear() ? tf : copyArray(tf);
+
+    kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                         interpOrder(method));
+}
+
+#define INSTANTIATE(T)                                                       \
+    template void transform(Array<T> &out, const Array<T> &in,               \
+                            const Array<float> &tf,                          \
+                            const af_interp_type method, const bool inverse, \
+                            const bool perspective);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/transform.cu b/src/backend/cuda/transform.cu
deleted file mode 100644
index 13e1a40a4d..0000000000
--- a/src/backend/cuda/transform.cu
+++ /dev/null
@@ -1,56 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-#include <transform.hpp>
-#include <kernel/transform.hpp>
-#include <stdexcept>
-
-namespace cuda
-{
-    template<typename T>
-    Array<T> transform(const Array<T> &in, const Array<float> &transform, const af::dim4 &odims,
-                        const af_interp_type method, const bool inverse)
-    {
-        const af::dim4 idims = in.dims();
-
-        Array<T> out = createEmptyArray<T>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::transform<T, AF_INTERP_NEAREST> (out, in, transform, inverse);
-                break;
-            case AF_INTERP_BILINEAR:
-                kernel::transform<T, AF_INTERP_BILINEAR>(out, in, transform, inverse);
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-        }
-
-        return out;
-    }
-
-
-#define INSTANTIATE(T)                                                                          \
-    template Array<T> transform(const Array<T> &in, const Array<float> &transform,             \
-                                 const af::dim4 &odims, const af_interp_type method,            \
-                                 const bool inverse);                                           \
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-}
diff --git a/src/backend/cuda/transform.hpp b/src/backend/cuda/transform.hpp
index eb3d71d097..8e9e4b6990 100644
--- a/src/backend/cuda/transform.hpp
+++ b/src/backend/cuda/transform.hpp
@@ -7,12 +7,13 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<T> transform(const Array<T> &in, const Array<float> &tf, const af::dim4 &odims,
-                        const af_interp_type method, const bool inverse);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/transpose.cpp b/src/backend/cuda/transpose.cpp
new file mode 100644
index 0000000000..03d6f3b91d
--- /dev/null
+++ b/src/backend/cuda/transpose.cpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/transpose.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> transpose(const Array<T> &in, const bool conjugate) {
+    const dim4 &inDims = in.dims();
+
+    dim4 outDims = dim4(inDims[1], inDims[0], inDims[2], inDims[3]);
+
+    Array<T> out = createEmptyArray<T>(outDims);
+
+    const bool is32multiple =
+        inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0;
+
+    kernel::transpose<T>(out, in, conjugate, is32multiple);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> transpose(const Array<T> &in, const bool conjugate);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/transpose.cu b/src/backend/cuda/transpose.cu
deleted file mode 100644
index e787b6ede4..0000000000
--- a/src/backend/cuda/transpose.cu
+++ /dev/null
@@ -1,50 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <transpose.hpp>
-#include <kernel/transpose.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-Array<T> transpose(const Array<T> &in, const bool conjugate)
-{
-    const dim4 inDims   = in.dims();
-    const dim4 inStrides= in.strides();
-
-    dim4 outDims  = dim4(inDims[1],inDims[0],inDims[2],inDims[3]);
-
-    Array<T> out  = createEmptyArray<T>(outDims);
-
-    if(conjugate)   { kernel::transpose<T, true>(out, in, inDims.ndims()); }
-    else            { kernel::transpose<T, false>(out, in, inDims.ndims());}
-
-    return out;
-}
-
-#define INSTANTIATE(T)                                                              \
-    template Array<T> transpose(const Array<T> &in, const bool conjugate);
-
-INSTANTIATE(float  )
-INSTANTIATE(cfloat )
-INSTANTIATE(double )
-INSTANTIATE(cdouble)
-INSTANTIATE(char   )
-INSTANTIATE(int    )
-INSTANTIATE(uint   )
-INSTANTIATE(uchar  )
-INSTANTIATE(intl   )
-INSTANTIATE(uintl  )
-
-}
diff --git a/src/backend/cuda/transpose.hpp b/src/backend/cuda/transpose.hpp
index 48089a0aa2..e612754323 100644
--- a/src/backend/cuda/transpose.hpp
+++ b/src/backend/cuda/transpose.hpp
@@ -9,13 +9,14 @@
 
 #include <Array.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
 template<typename T>
-Array<T>  transpose(const Array<T> &in, const bool conjugate);
+Array<T> transpose(const Array<T> &in, const bool conjugate);
 
 template<typename T>
 void transpose_inplace(Array<T> &in, const bool conjugate);
 
-}
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/transpose_inplace.cpp b/src/backend/cuda/transpose_inplace.cpp
new file mode 100644
index 0000000000..dcc8c5664b
--- /dev/null
+++ b/src/backend/cuda/transpose_inplace.cpp
@@ -0,0 +1,49 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/transpose_inplace.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void transpose_inplace(Array<T> &in, const bool conjugate) {
+    const dim4 inDims = in.dims();
+    const bool is32multiple =
+        inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0;
+    kernel::transpose_inplace<T>(in, conjugate, is32multiple);
+}
+
+#define INSTANTIATE(T) \
+    template void transpose_inplace(Array<T> &in, const bool conjugate);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/transpose_inplace.cu b/src/backend/cuda/transpose_inplace.cu
deleted file mode 100644
index 98613bc846..0000000000
--- a/src/backend/cuda/transpose_inplace.cu
+++ /dev/null
@@ -1,42 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <transpose.hpp>
-#include <kernel/transpose_inplace.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T>
-void transpose_inplace(Array<T> &in, const bool conjugate)
-{
-    if(conjugate)   { kernel::transpose_inplace<T, true >(in); }
-    else            { kernel::transpose_inplace<T, false>(in); }
-}
-
-#define INSTANTIATE(T)                                                              \
-    template void transpose_inplace(Array<T> &in, const bool conjugate);
-
-INSTANTIATE(float  )
-INSTANTIATE(cfloat )
-INSTANTIATE(double )
-INSTANTIATE(cdouble)
-INSTANTIATE(char   )
-INSTANTIATE(int    )
-INSTANTIATE(uint   )
-INSTANTIATE(uchar  )
-INSTANTIATE(intl   )
-INSTANTIATE(uintl  )
-
-}
-
diff --git a/src/backend/cuda/triangle.cpp b/src/backend/cuda/triangle.cpp
new file mode 100644
index 0000000000..c32e984626
--- /dev/null
+++ b/src/backend/cuda/triangle.cpp
@@ -0,0 +1,58 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <triangle.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/triangle.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag) {
+    kernel::triangle<T>(out, in, is_upper, is_unit_diag);
+}
+
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    triangle<T>(out, in, is_upper, is_unit_diag);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                  \
+    template void triangle<T>(Array<T> &, const Array<T> &, const bool, \
+                              const bool);                              \
+    template Array<T> triangle<T>(const Array<T> &, const bool, const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/triangle.cu b/src/backend/cuda/triangle.cu
deleted file mode 100644
index 99970a0d72..0000000000
--- a/src/backend/cuda/triangle.cu
+++ /dev/null
@@ -1,55 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <triangle.hpp>
-#include <kernel/triangle.hpp>
-
-using af::dim4;
-
-namespace cuda
-{
-
-template<typename T, bool is_upper, bool is_unit_diag>
-void triangle(Array<T> &out, const Array<T> &in)
-{
-    kernel::triangle<T, is_upper, is_unit_diag>(out, in);
-}
-
-
-template<typename T, bool is_upper, bool is_unit_diag>
-Array<T> triangle(const Array<T> &in)
-{
-    Array<T> out = createEmptyArray<T>(in.dims());
-    triangle<T, is_upper, is_unit_diag>(out, in);
-    return out;
-}
-
-#define INSTANTIATE(T)                                                  \
-    template void triangle<T, true ,  true>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, false,  true>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, true , false>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, false, false>(Array<T> &out, const Array<T> &in); \
-    template Array<T> triangle<T, true ,  true>(const Array<T> &in);    \
-    template Array<T> triangle<T, false,  true>(const Array<T> &in);    \
-    template Array<T> triangle<T, true , false>(const Array<T> &in);    \
-    template Array<T> triangle<T, false, false>(const Array<T> &in);    \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-}
diff --git a/src/backend/cuda/triangle.hpp b/src/backend/cuda/triangle.hpp
index 2a37f39c62..98c3480126 100644
--- a/src/backend/cuda/triangle.hpp
+++ b/src/backend/cuda/triangle.hpp
@@ -7,14 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T, bool is_upper, bool is_unit_diag>
-    void triangle(Array<T> &out, const Array<T> &in);
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag);
 
-    template<typename T, bool is_upper, bool is_unit_diag>
-    Array<T> triangle(const Array<T> &in);
-}
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/types.cpp b/src/backend/cuda/types.cpp
deleted file mode 100644
index f83913bce9..0000000000
--- a/src/backend/cuda/types.cpp
+++ /dev/null
@@ -1,92 +0,0 @@
-/*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#include <af/defines.h>
-#include "types.hpp"
-#include <sstream>
-
-namespace cuda
-{
-    template<typename T > const char *cuShortName() { return "q"; }
-    template<> const char *cuShortName<float   >() { return "f"; }
-    template<> const char *cuShortName<double  >() { return "d"; }
-    template<> const char *cuShortName<cfloat  >() { return "6float2"; }
-    template<> const char *cuShortName<cdouble >() { return "7double2"; }
-    template<> const char *cuShortName<int     >() { return "i"; }
-    template<> const char *cuShortName<uint    >() { return "j"; }
-    template<> const char *cuShortName<char    >() { return "c"; }
-    template<> const char *cuShortName<uchar   >() { return "h"; }
-    template<> const char *cuShortName<intl    >() { return "x"; }
-    template<> const char *cuShortName<uintl   >() { return "y"; }
-
-    template<typename T > const char *afShortName(bool caps) { return caps ?  "Q" : "q"; }
-    template<> const char *afShortName<float   >(bool caps) { return caps ?  "S" : "s"; }
-    template<> const char *afShortName<double  >(bool caps) { return caps ?  "D" : "d"; }
-    template<> const char *afShortName<cfloat  >(bool caps) { return caps ?  "C" : "c"; }
-    template<> const char *afShortName<cdouble >(bool caps) { return caps ?  "Z" : "z"; }
-    template<> const char *afShortName<int     >(bool caps) { return caps ?  "I" : "i"; }
-    template<> const char *afShortName<uint    >(bool caps) { return caps ?  "U" : "u"; }
-    template<> const char *afShortName<char    >(bool caps) { return caps ?  "J" : "j"; }
-    template<> const char *afShortName<uchar   >(bool caps) { return caps ?  "V" : "v"; }
-    template<> const char *afShortName<intl    >(bool caps) { return caps ?  "X" : "x"; }
-    template<> const char *afShortName<uintl   >(bool caps) { return caps ?  "Y" : "y"; }
-
-    template<typename T > const char *irname() { return  "i32"; }
-    template<> const char *irname<float   >() { return  "float"; }
-    template<> const char *irname<double  >() { return  "double"; }
-    template<> const char *irname<cfloat  >() { return  "<2 x float>"; }
-    template<> const char *irname<cdouble >() { return  "<2 x double>"; }
-    template<> const char *irname<int     >() { return  "i32"; }
-    template<> const char *irname<uint    >() { return  "i32"; }
-    template<> const char *irname<intl    >() { return  "i64"; }
-    template<> const char *irname<uintl   >() { return  "i64"; }
-    template<> const char *irname<char    >() { return  "i8"; }
-    template<> const char *irname<uchar   >() { return  "i8"; }
-
-    template <typename T>
-    static inline std::string toString(T val)
-    {
-        std::stringstream s;
-        s << val;
-        return s.str();
-    }
-
-    template<typename T, bool binary>
-    const std::string cuMangledName(const char *fn)
-    {
-        std::string cname(cuShortName<T>());
-        std::string fname(fn);
-        size_t flen = fname.size();
-
-        std::string res = std::string("@_Z") + toString(flen) + fname + cname;
-        if (binary) {
-            if (cname.size() > 1) {
-                res = res + "S_";
-            } else {
-                res = res + cname;
-            }
-        }
-        return res;
-    }
-
-#define INSTANTIATE(T)                                                  \
-    template const std::string cuMangledName<T, false>(const char *fn); \
-    template const std::string cuMangledName<T, true>(const char *fn);  \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-}
diff --git a/src/backend/cuda/types.hpp b/src/backend/cuda/types.hpp
index 0d807ae364..2230948f3a 100644
--- a/src/backend/cuda/types.hpp
+++ b/src/backend/cuda/types.hpp
@@ -8,21 +8,173 @@
  ********************************************************/
 
 #pragma once
+
+#include <common/kernel_type.hpp>
 #include <cuComplex.h>
-#include <string>
+#include <cuda_fp16.h>
+
+namespace arrayfire {
+namespace common {
+class half;
+}  // namespace common
+}  // namespace arrayfire
+
+#ifdef __CUDACC_RTC__
+
+using dim_t = long long;
+
+#else  //__CUDACC_RTC__
+
+#include <af/traits.hpp>
+
+#endif  //__CUDACC_RTC__
+
+namespace arrayfire {
+namespace cuda {
+
+using cdouble = cuDoubleComplex;
+using cfloat  = cuFloatComplex;
+using intl    = long long;
+using schar   = signed char;
+using uchar   = unsigned char;
+using uint    = unsigned int;
+using uintl   = unsigned long long;
+using ushort  = unsigned short;
+using ulong   = unsigned long long;
+
+template<typename T>
+using compute_t = typename common::kernel_type<T>::compute;
+
+template<typename T>
+using data_t = typename common::kernel_type<T>::data;
+
+#ifndef __CUDACC_RTC__
+namespace {
+template<typename T>
+inline const char *shortname(bool caps = false) {
+    return caps ? "Q" : "q";
+}
+template<>
+inline const char *shortname<float>(bool caps) {
+    return caps ? "S" : "s";
+}
+template<>
+inline const char *shortname<double>(bool caps) {
+    return caps ? "D" : "d";
+}
+template<>
+inline const char *shortname<cfloat>(bool caps) {
+    return caps ? "C" : "c";
+}
+template<>
+inline const char *shortname<cdouble>(bool caps) {
+    return caps ? "Z" : "z";
+}
+template<>
+inline const char *shortname<int>(bool caps) {
+    return caps ? "I" : "i";
+}
+template<>
+inline const char *shortname<uint>(bool caps) {
+    return caps ? "U" : "u";
+}
+template<>
+inline const char *shortname<char>(bool caps) {
+    return caps ? "J" : "j";
+}
+template<>
+inline const char *shortname<schar>(bool caps) {
+    return caps ? "A" : "a"; // TODO
+}
+template<>
+inline const char *shortname<uchar>(bool caps) {
+    return caps ? "V" : "v";
+}
+template<>
+inline const char *shortname<intl>(bool caps) {
+    return caps ? "X" : "x";
+}
+template<>
+inline const char *shortname<uintl>(bool caps) {
+    return caps ? "Y" : "y";
+}
+template<>
+inline const char *shortname<short>(bool caps) {
+    return caps ? "P" : "p";
+}
+template<>
+inline const char *shortname<ushort>(bool caps) {
+    return caps ? "Q" : "q";
+}
+template<>
+inline const char *shortname<arrayfire::common::half>(bool caps) {
+    return caps ? "H" : "h";
+}
 
-namespace cuda
-{
-    typedef cuFloatComplex   cfloat;
-    typedef cuDoubleComplex cdouble;
-    typedef unsigned int   uint;
-    typedef unsigned char uchar;
+template<typename T>
+inline const char *getFullName();
 
-    template<typename T> struct is_complex          { static const bool value = false;  };
-    template<> struct           is_complex<cfloat>  { static const bool value = true;   };
-    template<> struct           is_complex<cdouble> { static const bool value = true;   };
+#define SPECIALIZE(T)                     \
+    template<>                            \
+    inline const char *getFullName<T>() { \
+        return #T;                        \
+    }
 
-    template<typename T, bool binary> const std::string cuMangledName(const char *fn);
-    template<typename T > const char *afShortName(bool caps = true);
-    template<typename T > const char *irname();
+SPECIALIZE(float)
+SPECIALIZE(double)
+SPECIALIZE(cfloat)
+SPECIALIZE(cdouble)
+SPECIALIZE(char)
+SPECIALIZE(signed char)
+SPECIALIZE(unsigned char)
+SPECIALIZE(short)
+SPECIALIZE(unsigned short)
+SPECIALIZE(int)
+SPECIALIZE(unsigned int)
+SPECIALIZE(unsigned long long)
+SPECIALIZE(long long)
+
+template<>
+inline const char *getFullName<common::half>() {
+    return "half";
 }
+#undef SPECIALIZE
+}  // namespace
+#endif  //__CUDACC_RTC__
+
+}  // namespace cuda
+
+namespace common {
+
+template<typename T>
+struct kernel_type;
+
+template<>
+struct kernel_type<arrayfire::common::half> {
+    using data = arrayfire::common::half;
+
+#ifdef __CUDA_ARCH__
+
+    // These are the types within a kernel
+#if __CUDA_ARCH__ >= 530 && __CUDA_ARCH__ != 610
+    using compute = __half;
+#else
+    using compute = float;
+#endif
+    using native = compute;
+
+#else  // __CUDA_ARCH__
+
+    // outside of a cuda kernel use float
+    using compute = float;
+
+#if defined(__NVCC__) || defined(__CUDACC_RTC__)
+    using native  = __half;
+#else
+    using native = common::half;
+#endif
+
+#endif  // __CUDA_ARCH__
+};
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/cuda/unary.hpp b/src/backend/cuda/unary.hpp
index 290e7c021a..5fd9e48f52 100644
--- a/src/backend/cuda/unary.hpp
+++ b/src/backend/cuda/unary.hpp
@@ -7,38 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#pragma once
 #include <Array.hpp>
-#include <optypes.hpp>
+#include <common/jit/NaryNode.hpp>
+#include <common/jit/UnaryNode.hpp>
 #include <math.hpp>
-#include <err_cuda.hpp>
-#include <JIT/UnaryNode.hpp>
+#include <optypes.hpp>
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-template<typename T, af_op_t op>
-struct UnOp
-{
-    const char *name()
-    {
-        return "noop";
+template<af_op_t op>
+static const char *unaryName();
+
+#define UNARY_DECL(OP, FNAME)                     \
+    template<>                                    \
+    inline const char *unaryName<af_##OP##_t>() { \
+        return FNAME;                             \
     }
-};
-
-#define UNARY_FN(fn)                                \
-    template<typename T>                            \
-    struct UnOp<T, af_##fn##_t>                     \
-    {                                               \
-        std::string res;                            \
-        UnOp() :                                    \
-            res(cuMangledName<T, false>("___"#fn))  \
-        {                                           \
-        }                                           \
-        const std::string name()                    \
-        {                                           \
-            return res;                             \
-        }                                           \
-    };                                              \
+
+#define UNARY_FN(OP) UNARY_DECL(OP, #OP)
 
 UNARY_FN(sin)
 UNARY_FN(cos)
@@ -57,6 +45,7 @@ UNARY_FN(acosh)
 UNARY_FN(atanh)
 
 UNARY_FN(exp)
+UNARY_DECL(sigmoid, "__sigmoid")
 UNARY_FN(expm1)
 UNARY_FN(erf)
 UNARY_FN(erfc)
@@ -70,63 +59,55 @@ UNARY_FN(log10)
 UNARY_FN(log2)
 
 UNARY_FN(sqrt)
+UNARY_FN(rsqrt)
 UNARY_FN(cbrt)
 
-UNARY_FN(sign )
-UNARY_FN(round)
 UNARY_FN(trunc)
+UNARY_FN(round)
+UNARY_FN(signbit)
 UNARY_FN(ceil)
 UNARY_FN(floor)
 
-#undef UNARY_FN
-
-    template<typename T, af_op_t op>
-    Array<T> unaryOp(const Array<T> &in)
-    {
-
-        UnOp<T, op> uop;
+UNARY_DECL(bitnot, "__bitnot")
+UNARY_DECL(isinf, "__isinf")
+UNARY_DECL(isnan, "__isnan")
+UNARY_FN(iszero)
+UNARY_DECL(noop, "__noop")
 
-        JIT::Node_ptr in_node = in.getNode();
-
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<T>(),
-                                                  afShortName<T>(),
-                                                  uop.name(),
-                                                  in_node, op);
-
-        return createNodeArray<T>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+#undef UNARY_DECL
+#undef UNARY_FN
 
+template<typename T, af_op_t op>
+Array<T> unaryOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using arrayfire::common::Node;
+    using arrayfire::common::Node_ptr;
+    using std::array;
+
+    auto createUnary = [](array<Node_ptr, 1> &operands) {
+        return common::Node_ptr(new common::UnaryNode(
+            static_cast<af::dtype>(af::dtype_traits<T>::af_type),
+            unaryName<op>(), operands[0], op));
+    };
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    Node_ptr out = common::createNaryNode<T, 1>(outDim, createUnary, {&in});
+    return createNodeArray<T>(outDim, out);
+}
 
-#define UNARY2_FN(op, fn)                           \
-    template<typename T>                            \
-    struct UnOp<T, af_##op##_t>                     \
-    {                                               \
-        std::string res;                            \
-        UnOp() :                                    \
-            res(cuMangledName<T, false>("___"#fn))  \
-        {                                           \
-        }                                           \
-        const std::string name()                    \
-        {                                           \
-            return res;                             \
-        }                                           \
-    };                                              \
-
-
-UNARY2_FN(isnan, isNaN)
-UNARY2_FN(isinf, isINF)
-UNARY2_FN(iszero, iszero)
-
-    template<typename T, af_op_t op>
-    Array<char> checkOp(const Array<T> &in)
-    {
-        UnOp<T, op> uop;
-
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(irname<char>(),
-                                                  afShortName<char>(),
-                                                  uop.name(),
-                                                  in_node, op);
-        return createNodeArray<char>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+template<typename T, af_op_t op>
+Array<char> checkOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using arrayfire::common::Node_ptr;
+
+    auto createUnary = [](std::array<Node_ptr, 1> &operands) {
+        return Node_ptr(new common::UnaryNode(
+            static_cast<af::dtype>(dtype_traits<char>::af_type),
+            unaryName<op>(), operands[0], op));
+    };
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    Node_ptr out = common::createNaryNode<T, 1>(outDim, createUnary, {&in});
+    return createNodeArray<char>(outDim, out);
 }
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/unwrap.cpp b/src/backend/cuda/unwrap.cpp
new file mode 100644
index 0000000000..9d96aec1d9
--- /dev/null
+++ b/src/backend/cuda/unwrap.cpp
@@ -0,0 +1,67 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <unwrap.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/unwrap.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+
+    dim_t nx = 1 + (idims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    dim_t ny = 1 + (idims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    af::dim4 odims(wx * wy, nx * ny, idims[2], idims[3]);
+
+    if (!is_column) { std::swap(odims[0], odims[1]); }
+
+    Array<T> outArray = createEmptyArray<T>(odims);
+    kernel::unwrap<T>(outArray, in, wx, wy, sx, sy, px, py, dx, dy, nx,
+                      is_column);
+
+    return outArray;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> unwrap<T>(                                            \
+        const Array<T> &in, const dim_t wx, const dim_t wy, const dim_t sx, \
+        const dim_t sy, const dim_t px, const dim_t py, const dim_t dx,     \
+        const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/unwrap.hpp b/src/backend/cuda/unwrap.hpp
new file mode 100644
index 0000000000..dbb1f8ee24
--- /dev/null
+++ b/src/backend/cuda/unwrap.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/utility.cpp b/src/backend/cuda/utility.cpp
new file mode 100644
index 0000000000..724f546326
--- /dev/null
+++ b/src/backend/cuda/utility.cpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <utility.hpp>
+
+#include <err_cuda.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+int interpOrder(const af_interp_type p) noexcept {
+    int order = 1;
+    switch (p) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER: order = 1; break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+        case AF_INTERP_BILINEAR_COSINE: order = 2; break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+        case AF_INTERP_BICUBIC_SPLINE: order = 3; break;
+    }
+    return order;
+}
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/utility.hpp b/src/backend/cuda/utility.hpp
index bae4cc78b3..d3ff338bf6 100644
--- a/src/backend/cuda/utility.hpp
+++ b/src/backend/cuda/utility.hpp
@@ -8,22 +8,27 @@
  ********************************************************/
 
 #pragma once
+
+#include <backend.hpp>
 #include <af/defines.h>
-#include "backend.hpp"
 
-namespace cuda
-{
+namespace arrayfire {
+namespace cuda {
 
-static __DH__ dim_t trimIndex(const int &idx, const dim_t &len)
-{
+[[gnu::unused]] static __DH__ dim_t trimIndex(const int &idx,
+                                              const dim_t &len) {
     int ret_val = idx;
-    int offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=len) {
-        ret_val = len-offset-1;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
     }
     return ret_val;
 }
 
-}
+int interpOrder(const af_interp_type p) noexcept;
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/vector_field.cpp b/src/backend/cuda/vector_field.cpp
new file mode 100644
index 0000000000..a0528cddb1
--- /dev/null
+++ b/src/backend/cuda/vector_field.cpp
@@ -0,0 +1,112 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_cuda.hpp>
+#include <device_manager.hpp>
+#include <err_cuda.hpp>
+#include <vector_field.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeManager;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield) {
+    auto stream = getActiveStream();
+    if (DeviceManager::checkGraphicsInteropCapability()) {
+        auto res = interopManager().getVectorFieldResources(vfield);
+        cudaGraphicsResource_t resources[2] = {*res[0].get(), *res[1].get()};
+
+        cudaGraphicsMapResources(2, resources, stream);
+
+        // Points
+        {
+            const T *ptr = points.get();
+            size_t bytes = 0;
+            T *d_vbo     = NULL;
+            cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &bytes,
+                                                 resources[0]);
+            cudaMemcpyAsync(d_vbo, ptr, bytes, cudaMemcpyDeviceToDevice,
+                            stream);
+        }
+        // Directions
+        {
+            const T *ptr = directions.get();
+            size_t bytes = 0;
+            T *d_vbo     = NULL;
+            cudaGraphicsResourceGetMappedPointer((void **)&d_vbo, &bytes,
+                                                 resources[1]);
+            cudaMemcpyAsync(d_vbo, ptr, bytes, cudaMemcpyDeviceToDevice,
+                            stream);
+        }
+        cudaGraphicsUnmapResources(2, resources, stream);
+
+        CheckGL("After cuda resource copy");
+
+        POST_LAUNCH_CHECK();
+    } else {
+        ForgeModule &_ = forgePlugin();
+        CheckGL("Begin CUDA fallback-resource copy");
+        unsigned size1 = 0, size2 = 0;
+        unsigned buff1 = 0, buff2 = 0;
+        FG_CHECK(_.fg_get_vector_field_vertex_buffer_size(&size1, vfield));
+        FG_CHECK(_.fg_get_vector_field_direction_buffer_size(&size2, vfield));
+        FG_CHECK(_.fg_get_vector_field_vertex_buffer(&buff1, vfield));
+        FG_CHECK(_.fg_get_vector_field_direction_buffer(&buff2, vfield));
+
+        // Points
+        glBindBuffer(GL_ARRAY_BUFFER, buff1);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            CUDA_CHECK(cudaMemcpyAsync(ptr, points.get(), size1,
+                                       cudaMemcpyDeviceToHost, stream));
+            CUDA_CHECK(cudaStreamSynchronize(stream));
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+
+        // Directions
+        glBindBuffer(GL_ARRAY_BUFFER, buff2);
+        ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            CUDA_CHECK(cudaMemcpyAsync(ptr, directions.get(), size2,
+                                       cudaMemcpyDeviceToHost, stream));
+            CUDA_CHECK(cudaStreamSynchronize(stream));
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+
+        CheckGL("End CUDA fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T)                                                     \
+    template void copy_vector_field<T>(const Array<T> &, const Array<T> &, \
+                                       fg_vector_field);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/vector_field.hpp b/src/backend/cuda/vector_field.hpp
new file mode 100644
index 0000000000..086e1bbf27
--- /dev/null
+++ b/src/backend/cuda/vector_field.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/where.cpp b/src/backend/cuda/where.cpp
new file mode 100644
index 0000000000..862b25fa24
--- /dev/null
+++ b/src/backend/cuda/where.cpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_cuda.hpp>
+#include <af/dim4.hpp>
+
+#undef _GLIBCXX_USE_INT128
+#include <kernel/where.hpp>
+#include <where.hpp>
+#include <complex>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<uint> where(const Array<T> &in) {
+    Param<uint> out;
+    kernel::where<T>(out, in);
+    return createParamArray<uint>(out, true);
+}
+
+#define INSTANTIATE(T) template Array<uint> where<T>(const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/where.cu b/src/backend/cuda/where.cu
deleted file mode 100644
index 8e4f9cfe80..0000000000
--- a/src/backend/cuda/where.cu
+++ /dev/null
@@ -1,46 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_cuda.hpp>
-
-#undef _GLIBCXX_USE_INT128
-#include <where.hpp>
-#include <complex>
-#include <kernel/where.hpp>
-
-namespace cuda
-{
-    template<typename T>
-    Array<uint> where(const Array<T> &in)
-    {
-        Param<uint> out;
-        kernel::where<T>(out, in);
-        return createParamArray<uint>(out);
-    }
-
-
-#define INSTANTIATE(T)                                  \
-    template Array<uint> where<T>(const Array<T> &in);    \
-
-    INSTANTIATE(float  )
-    INSTANTIATE(cfloat )
-    INSTANTIATE(double )
-    INSTANTIATE(cdouble)
-    INSTANTIATE(char   )
-    INSTANTIATE(int    )
-    INSTANTIATE(uint   )
-    INSTANTIATE(intl   )
-    INSTANTIATE(uintl  )
-    INSTANTIATE(uchar  )
-
-}
diff --git a/src/backend/cuda/where.hpp b/src/backend/cuda/where.hpp
index 1f181ba389..a2e9ccdab6 100644
--- a/src/backend/cuda/where.hpp
+++ b/src/backend/cuda/where.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace cuda
-{
-    template<typename T>
-    Array<uint> where(const Array<T>& in);
-}
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+Array<uint> where(const Array<T>& in);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/wrap.cpp b/src/backend/cuda/wrap.cpp
new file mode 100644
index 0000000000..dd7901cc0e
--- /dev/null
+++ b/src/backend/cuda/wrap.cpp
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <wrap.hpp>
+
+#include <Array.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <err_cuda.hpp>
+#include <kernel/wrap.hpp>
+#include <math.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace cuda {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column) {
+    kernel::wrap<T>(out, in, wx, wy, sx, sy, px, py, is_column);
+}
+
+#define INSTANTIATE(T)                                                        \
+    template void wrap<T>(Array<T> & out, const Array<T> &in, const dim_t wx, \
+                          const dim_t wy, const dim_t sx, const dim_t sy,     \
+                          const dim_t px, const dim_t py,                     \
+                          const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+    af::dim4 odims(ox, oy, idims[2], idims[3]);
+    Array<T> out = createValueArray<T>(odims, scalar<T>(0));
+
+    kernel::wrap_dilated<T>(out, in, wx, wy, sx, sy, px, py, dx, dy, is_column);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> wrap_dilated<T>(                                      \
+        const Array<T> &in, const dim_t ox, const dim_t oy, const dim_t wx, \
+        const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,     \
+        const dim_t py, const dim_t dx, const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/cuda/wrap.hpp b/src/backend/cuda/wrap.hpp
new file mode 100644
index 0000000000..312b24a23e
--- /dev/null
+++ b/src/backend/cuda/wrap.hpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace cuda {
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column);
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace cuda
+}  // namespace arrayfire
diff --git a/src/backend/dim4.cpp b/src/backend/dim4.cpp
deleted file mode 100644
index 7958e96441..0000000000
--- a/src/backend/dim4.cpp
+++ /dev/null
@@ -1,260 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <limits>
-#include <numeric>
-#include <cmath>
-#include <cfloat>
-#include <af/dim4.hpp>
-#include <ArrayInfo.hpp>
-#include <err_common.hpp>
-
-namespace af
-{
-
-using std::vector;
-using std::numeric_limits;
-using std::abs;
-
-dim4::dim4()
-{
-    dims[0] = 0;
-    dims[1] = 0;
-    dims[2] = 0;
-    dims[3] = 0;
-}
-
-dim4::dim4( dim_t first,
-            dim_t second,
-            dim_t third,
-            dim_t fourth)
-{
-    dims[0] = first;
-    dims[1] = second;
-    dims[2] = third;
-    dims[3] = fourth;
-}
-
-dim4::dim4(const dim4& other)
-{
-    dims[0] = other.dims[0];
-    dims[1] = other.dims[1];
-    dims[2] = other.dims[2];
-    dims[3] = other.dims[3];
-}
-
-dim4::dim4(const unsigned ndims_, const dim_t * const dims_)
-{
-    for (unsigned i = 0; i < 4; i++) {
-        dims[i] = ndims_ > i ? dims_[i] : 1;
-    }
-}
-
-
-dim_t
-dim4::elements() const
-{
-    return dims[0] * dims[1] * dims[2] * dims[3];
-}
-
-dim_t
-dim4::elements()
-{
-    return static_cast<const dim4&>(*this).elements();
-}
-
-dim_t
-dim4::ndims() const
-{
-    int num = elements();
-    if (num == 0) return 0;
-    if (num == 1) return 1;
-
-    if (dims[3] != 1) return 4;
-    if (dims[2] != 1) return 3;
-    if (dims[1] != 1) return 2;
-
-    return 1;
-}
-
-dim_t
-dim4::ndims()
-{
-    return static_cast<const dim4&>(*this).ndims();
-}
-
-const dim_t&
-dim4::operator[](const unsigned dim) const
-{
-    return dims[dim];
-}
-
-dim_t &
-dim4::operator[](const unsigned dim)
-{
-    return const_cast<dim_t&>(static_cast<const dim4&>((*this))[dim]);
-}
-
-bool
-dim4::operator==(const dim4 &other) const
-{
-    bool ret = true;
-    for(unsigned i = 0; i < 4 && ret; i++) {
-        ret = (*this)[i] == other[i];
-    }
-    return ret;
-}
-
-bool
-dim4::operator!=(const dim4 &other) const
-{
-    return !((*this) == other);
-}
-
-dim4&
-dim4::operator*=(const dim4 &other)
-{
-    for(unsigned i = 0; i < 4; i++) {
-        (*this)[i] *= other[i];
-    }
-    return *this;
-}
-
-dim4&
-dim4::operator+=(const dim4 &other)
-{
-    for(unsigned i = 0; i < 4; i++) {
-        (*this)[i] = (*this)[i] + other[i];
-    }
-    return *this;
-}
-
-dim4&
-dim4::operator-=(const dim4 &other)
-{
-    for(unsigned i = 0; i < 4; i++) {
-        (*this)[i] = (*this)[i] - other[i];
-    }
-    return *this;
-}
-
-dim4 operator+(const dim4& first, const dim4& second)
-{
-    dim4 dims;
-    for(unsigned i = 0; i < 4; i++) {
-        dims[i] = first[i] + second[i];
-    }
-    return dims;
-}
-
-dim4 operator-(const dim4& first, const dim4& second)
-{
-    dim4 dims;
-    for(unsigned i = 0; i < 4; i++) {
-        dims[i] = first[i] - second[i];
-    }
-    return dims;
-}
-
-dim4 operator*(const dim4& first, const dim4& second)
-{
-    dim4 dims;
-    for(unsigned i = 0; i < 4; i++) {
-        dims[i] = first[i] * second[i];
-    }
-    return dims;
-}
-
-
-bool
-isEnd(const af_seq &seq)    { return (seq.end <= -1); }
-
-bool
-isSpan(const af_seq &seq)   { return (seq.step == 0 && seq.begin == 1 && seq.end == 1); }
-
-size_t
-seqElements(const af_seq &seq) {
-    size_t out = 0;
-    if      (seq.step > DBL_MIN)    { out = ((seq.end - seq.begin) / abs(seq.step)) + 1;    }
-    else if (seq.step < -DBL_MIN)   { out = ((seq.begin - seq.end) / abs(seq.step)) + 1;    }
-    else                            { out = numeric_limits<size_t>::max();                  }
-
-    return out;
-}
-
-dim_t calcDim(const af_seq &seq, const dim_t &parentDim)
-{
-    dim_t outDim = 1;
-    if  (isSpan(seq)) {
-        outDim = parentDim;
-    } else if (isEnd(seq)) {
-        if(seq.begin == -1) {   // only end is passed as seq
-            outDim = 1;
-        } else if (seq.begin < 0) {
-            af_seq temp = {parentDim + seq.begin,
-                           parentDim + seq.end,
-                           seq.step};
-            outDim = seqElements(temp);
-        } else {    // end is passed as a part of seq
-            af_seq temp = {seq.begin, parentDim + seq.end, seq.step};
-            outDim = seqElements(temp);
-        }
-    } else {
-        DIM_ASSERT(1, seq.begin >= -DBL_MIN && seq.begin < parentDim);
-        DIM_ASSERT(1, seq.end < parentDim);
-        outDim = seqElements(seq);
-    }
-
-    return outDim;
-}
-}
-
-using af::dim4;
-using std::vector;
-
-dim4
-toDims(const vector<af_seq>& seqs, const dim4 &parentDims)
-{
-    dim4 outDims(1, 1, 1, 1);
-    for(unsigned i = 0; i < seqs.size(); i++ ) {
-        outDims[i] = af::calcDim(seqs[i], parentDims[i]);
-        if (outDims[i] > parentDims[i])
-            AF_ERROR("Size mismatch between input and output", AF_ERR_SIZE);
-    }
-    return outDims;
-}
-
-dim4
-toOffset(const vector<af_seq>& seqs, const dim4 &parentDims)
-{
-    dim4 outOffsets(0, 0, 0, 0);
-    for(unsigned i = 0; i < seqs.size(); i++ ) {
-        if (seqs[i].step !=0 && seqs[i].begin >= 0) {
-            outOffsets[i] = seqs[i].begin;
-        } else if (seqs[i].begin <= -1) {
-            outOffsets[i] = parentDims[i] + seqs[i].begin;
-        } else {
-            outOffsets[i] = 0;
-        }
-
-        if (outOffsets[i] >= parentDims[i])
-            AF_ERROR("Index out of range", AF_ERR_SIZE);
-    }
-    return outOffsets;
-}
-
-dim4
-toStride(const vector<af_seq>& seqs, const af::dim4 &parentDims)
-{
-    dim4 out(calcStrides(parentDims));
-    for(unsigned i = 0; i < seqs.size(); i++ ) {
-        if  (seqs[i].step != 0) {   out[i] *= seqs[i].step; }
-    }
-    return out;
-}
diff --git a/src/backend/lapacke.cpp b/src/backend/lapacke.cpp
deleted file mode 100644
index 28e166b718..0000000000
--- a/src/backend/lapacke.cpp
+++ /dev/null
@@ -1,167 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#if defined(__APPLE__) && !defined(AF_CUDA)
-#include <Accelerate/Accelerate.h>
-#include "lapacke.hpp"
-#include <cstdint>
-
-#if INTPTR_MAX == INT16MAX
-    #define BS 16
-#elif INTPTR_MAX == INT32MAX
-    #define BS 32
-#elif INTPTR_MAX == INT64MAX
-    #define BS 64
-#else
-    #define BS 32
-#endif
-
-#define LAPACK_FUNC(X, T, TO)                                                       \
-int LAPACKE_##X##geqrf(int layout, int M, int N, T *A, int lda, T *tau)             \
-{                                                                                   \
-    int lwork = N * BS;                                                             \
-    T *work = new T[lwork];                                                         \
-    int info = 0;                                                                   \
-    X##geqrf_(&M, &N, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);               \
-    delete [] work;                                                                 \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##geqrf_work(int layout, int M, int N, T *A, int lda,                \
-                            T *tau, T *work, int lwork)                             \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##geqrf_(&M, &N, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);               \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##getrf(int layout, int M, int N, T *A, int lda, int *pivot)         \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##getrf_(&M, &N, (TO)A, &lda, pivot, &info);                                   \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##getrs(int layout, char trans, int M, int N, const T *A,            \
-                       int lda, const int *pivot, T *B, int ldb)                    \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##getrs_(&trans, &M, &N, (TO)A, &lda, (int *)pivot, (TO)B, &ldb, &info);       \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##potrf(int layout, char uplo, int N, T *A, int lda)                 \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##potrf_(&uplo, &N, (TO)A, &lda, &info);                                       \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##gesv(int layout, int N, int nrhs, T *A, int lda,                   \
-                      int *pivot, T *B, int ldb)                                    \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##gesv_(&N, &nrhs, (TO)A, &lda, pivot, (TO)B, &ldb, &info);                    \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##gels(int layout, char trans, int M, int N, int nrhs,               \
-                      T *A, int lda, T *B, int ldb)                                 \
-{                                                                                   \
-    int lwork = std::min(M, N) + std::max(M, std::max(N, nrhs)) * BS;               \
-    T *work = new T[lwork];                                                         \
-    int info = 0;                                                                   \
-    X##gels_(&trans, &M, &N, &nrhs, (TO)A, &lda,                                    \
-                       (TO)B, &ldb, (TO)work, &lwork, &info);                       \
-    delete [] work;                                                                 \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##getri(int layout, int N, T *A, int lda, const int *pivot)          \
-{                                                                                   \
-    int lwork = N * BS;                                                             \
-    T *work = new T[lwork];                                                         \
-    int info = 0;                                                                   \
-    X##getri_(&N, (TO)A, &lda, const_cast<int *>(pivot),                            \
-                        (TO)work, &lwork, &info);                                   \
-    delete [] work;                                                                 \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##trtri(int layout, char uplo, char diag, int N, T *A, int lda)      \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##trtri_(&uplo, &diag, &N, (TO)A, &lda, &info);                                \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##trtrs(int layout, char uplo, char trans, char diag,                \
-                       int N, int NRHS, const T *A, int lda, T *B, int ldb)         \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##trtrs_(&uplo, &trans, &diag, &N, &NRHS, (TO)A, &lda, (TO)B, &ldb, &info);    \
-    return info;                                                                    \
-}                                                                                   \
-int LAPACKE_##X##larft(int layout, char direct, char storev, int N, int K,          \
-                       const T *v, int ldv, const T *tau, T *t, int ldt)            \
-{                                                                                   \
-    X##larft_(&direct, &storev, &N, &K, (TO)v, &ldv,                                \
-                        (TO)const_cast<T*>(tau), (TO)t, &ldt);                      \
-    return 0;                                                                       \
-}                                                                                   \
-int LAPACKE_##X##laswp(int layout, int N, T *A, int lda,                            \
-                       int k1, int k2, const int *pivot, int incx)                  \
-{                                                                                   \
-    X##laswp_(&N, (TO)A, &lda, &k1, &k2, const_cast<int*>(pivot), &incx);           \
-    return 0;                                                                       \
-}                                                                                   \
-
-LAPACK_FUNC(s, float, float*)
-LAPACK_FUNC(d, double, double*)
-LAPACK_FUNC(c, cfloat, __CLPK_complex*)
-LAPACK_FUNC(z, cdouble, __CLPK_doublecomplex*)
-
-#define LAPACK_GQR(P, X, T, TO)                                                     \
-int LAPACKE_##X##P(int layout, int M, int N, int K, T *A, int lda, const T *tau)    \
-{                                                                                   \
-    int lwork = N * 32;                                                             \
-    T *work = new T[lwork];                                                         \
-    int info = 0;                                                                   \
-    X##P##_(&M, &N, &K, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);             \
-    delete [] work;                                                                 \
-    return info;                                                                    \
-}                                                                                   \
-
-LAPACK_GQR(orgqr, s, float, float*)
-LAPACK_GQR(orgqr, d, double, double*)
-LAPACK_GQR(ungqr, c, cfloat, __CLPK_complex*)
-LAPACK_GQR(ungqr, z, cdouble, __CLPK_doublecomplex*)
-
-#define LAPACK_GQR_WORK(P, X, T, TO)                                                \
-int LAPACKE_##X##P##_work(int layout, int M, int N, int K, T *A, int lda,           \
-                          const T *tau, T *work, int lwork)                         \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##P##_(&M, &N, &K, (TO)A, &lda, (TO)tau, (TO)work, &lwork, &info);             \
-    return info;                                                                    \
-}                                                                                   \
-
-LAPACK_GQR_WORK(orgqr, s, float, float*)
-LAPACK_GQR_WORK(orgqr, d, double, double*)
-LAPACK_GQR_WORK(ungqr, c, cfloat, __CLPK_complex*)
-LAPACK_GQR_WORK(ungqr, z, cdouble, __CLPK_doublecomplex*)
-
-#define LAPACK_MQR_WORK(P, X, T, TO)                                                \
-int LAPACKE_##X##P##_work(int layout, char side, char trans, int M, int N, int K,   \
-                          const T *A, int lda, const T *tau, T *c, int ldc,         \
-                          T *work, int lwork)                                       \
-{                                                                                   \
-    int info = 0;                                                                   \
-    X##P##_(&side, &trans, &M, &N, &K, (TO)A, &lda, (TO)tau, (TO)c, &ldc,           \
-                      (TO)work, &lwork, &info);                                     \
-    return info;                                                                    \
-}                                                                                   \
-
-LAPACK_MQR_WORK(ormqr, s, float, float*)
-LAPACK_MQR_WORK(ormqr, d, double, double*)
-LAPACK_MQR_WORK(unmqr, c, cfloat, __CLPK_complex*)
-LAPACK_MQR_WORK(unmqr, z, cdouble, __CLPK_doublecomplex*)
-
-#endif
diff --git a/src/backend/lapacke.hpp b/src/backend/lapacke.hpp
deleted file mode 100644
index 9954fe1881..0000000000
--- a/src/backend/lapacke.hpp
+++ /dev/null
@@ -1,76 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#if defined(__APPLE__)
-#include <af/defines.h>
-#include <types.hpp>
-#include <backend.hpp>
-
-using detail::cfloat;
-using detail::cdouble;
-
-#define LAPACK_FUNC(X, T)                                                           \
-int LAPACKE_##X##geqrf(int layout, int M, int N, T *A, int lda, T *tau);            \
-int LAPACKE_##X##geqrf_work(int layout, int M, int N, T *A, int lda,                \
-                            T *tau, T *work, int lwork);                            \
-int LAPACKE_##X##getrf(int layout, int M, int N, T *A, int lda, int *pivot);        \
-int LAPACKE_##X##potrf(int layout, char uplo, int N, T *A, int lda);                \
-int LAPACKE_##X##gesv(int layout, int N, int nrhs, T *A, int lda,                   \
-                      int *pivot, T *B, int ldb);                                   \
-int LAPACKE_##X##gels(int layout, char trans, int M, int N, int nrhs,               \
-                      T *A, int lda, T *B, int ldb);                                \
-int LAPACKE_##X##getri(int layout, int N, T *A, int lda, const int *pivot);         \
-int LAPACKE_##X##trtri(int layout, char uplo, char diag, int N, T *A, int lda);     \
-int LAPACKE_##X##larft(int layout, char direct, char storev, int N, int K,          \
-                       const T *v, int ldv, const T *tau, T *t, int ldt);           \
-int LAPACKE_##X##laswp(int layout, int N, T *A, int lda,                            \
-                       int k1, int k2, const int * pivot, int incx);                \
-int LAPACKE_##X##getrs(int layout, char trans, int M, int N, const T *A,            \
-                       int lda, const int *pivot, T *B, int ldb);                   \
-int LAPACKE_##X##trtrs(int layout, char uplo, char trans, char diag,                \
-                       int N, int NRHS, const T *A, int lda, T *B, int ldb);        \
-
-LAPACK_FUNC(s, float)
-LAPACK_FUNC(d, double)
-LAPACK_FUNC(c, cfloat)
-LAPACK_FUNC(z, cdouble)
-
-#define LAPACK_GQR(P, X, T)                                                         \
-int LAPACKE_##X##P(int layout, int M, int N, int K, T *A, int lda, const T *tau);   \
-
-LAPACK_GQR(orgqr, s, float)
-LAPACK_GQR(orgqr, d, double)
-LAPACK_GQR(ungqr, c, cfloat)
-LAPACK_GQR(ungqr, z, cdouble)
-
-#define LAPACK_GQR_WORK(P, X, T)                                                    \
-int LAPACKE_##X##P##_work(int layout, int M, int N, int K, T *A, int lda,           \
-                          const T *tau, T *work, int lwork);                        \
-
-LAPACK_GQR_WORK(orgqr, s, float)
-LAPACK_GQR_WORK(orgqr, d, double)
-LAPACK_GQR_WORK(ungqr, c, cfloat)
-LAPACK_GQR_WORK(ungqr, z, cdouble)
-
-#define LAPACK_MQR_WORK(P, X, T)                                                    \
-int LAPACKE_##X##P##_work(int layout, char side, char trans, int M, int N, int K,   \
-                          const T *A, int lda, const T *tau, T *c, int ldc,         \
-                          T *work, int lwork);                                      \
-
-LAPACK_MQR_WORK(ormqr, s, float)
-LAPACK_MQR_WORK(ormqr, d, double)
-LAPACK_MQR_WORK(unmqr, c, cfloat)
-LAPACK_MQR_WORK(unmqr, z, cdouble)
-
-#undef LAPACK_FUNC
-#undef LAPACK_GQR
-#undef LAPACK_GQR_WORK
-#undef LAPACK_MQR_WORK
-
-#endif
diff --git a/src/backend/oneapi/Array.cpp b/src/backend/oneapi/Array.cpp
new file mode 100644
index 0000000000..57c8f111ee
--- /dev/null
+++ b/src/backend/oneapi/Array.cpp
@@ -0,0 +1,595 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+#include <Param.hpp>
+#include <common/Logger.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/half.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <common/jit/ScalarNode.hpp>
+#include <common/util.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <jit/BufferNode.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <scalar.hpp>
+#include <traits.hpp>
+#include <af/dim4.hpp>
+
+#include <cstddef>
+#include <cstdlib>
+#include <memory>
+#include <numeric>
+
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+
+#include <vector>
+
+using af::dim4;
+using af::dtype_traits;
+
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::oneapi::jit::BufferNode;
+
+using nonstd::span;
+using std::accumulate;
+using std::is_standard_layout;
+using std::make_shared;
+using std::shared_ptr;
+using std::vector;
+
+using sycl::buffer;
+
+namespace arrayfire {
+namespace oneapi {
+namespace {
+template<typename T>
+shared_ptr<BufferNode<T>> bufferNodePtr() {
+    return make_shared<BufferNode<T>>(
+        static_cast<af::dtype>(dtype_traits<T>::af_type));
+}
+
+template<typename T>
+void verifyTypeSupport() {}
+
+template<>
+void verifyTypeSupport<double>() {
+    if (!isDoubleSupported(getActiveDeviceId())) {
+        AF_ERROR("Double precision not supported", AF_ERR_NO_DBL);
+    }
+}
+
+template<>
+void verifyTypeSupport<cdouble>() {
+    if (!isDoubleSupported(getActiveDeviceId())) {
+        AF_ERROR("Double precision not supported", AF_ERR_NO_DBL);
+    }
+}
+
+template<>
+void verifyTypeSupport<arrayfire::common::half>() {
+    if (!isHalfSupported(getActiveDeviceId())) {
+        AF_ERROR("Half precision not supported", AF_ERR_NO_HALF);
+    }
+}
+}  // namespace
+
+template<typename T>
+void checkAndMigrate(const Array<T> &arr) {
+    if (arr.getDevId() != detail::getActiveDeviceId()) {
+        AF_ERROR("Input Array not created on current device", AF_ERR_DEVICE);
+    }
+}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(memAlloc<T>(info.elements()).release(), memFree<T>)
+    , data_dims(dims)
+    , node()
+    , owner(true) {}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, Node_ptr n)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data_dims(dims)
+    , node(std::move(n))
+    , owner(true) {
+    if (node->isBuffer()) {
+        data = std::static_pointer_cast<BufferNode<T>>(node)->getDataPointer();
+    }
+}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, const T *const in_data)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(memAlloc<T>(info.elements()).release(), memFree<T>)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    static_assert(is_standard_layout<Array<T>>::value,
+                  "Array<T> must be a standard layout type");
+    static_assert(std::is_nothrow_move_assignable<Array<T>>::value,
+                  "Array<T> is not move assignable");
+    static_assert(std::is_nothrow_move_constructible<Array<T>>::value,
+                  "Array<T> is not move constructible");
+    static_assert(
+        offsetof(Array<T>, info) == 0,
+        "Array<T>::info must be the first member variable of Array<T>");
+    getQueue()
+        .submit([&](sycl::handler &h) {
+            h.copy(in_data, data->get_access(h, sycl::range(info.elements())));
+        })
+        .wait();
+}
+
+template<typename T>
+Array<T>::Array(const af::dim4 &dims, buffer<T> *const mem, size_t offset,
+                bool copy)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(copy ? memAlloc<T>(info.elements()).release() : new buffer<T>(*mem),
+           memFree<T>)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    if (copy) {
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                h.copy(mem->get_access(h, sycl::range(info.elements())),
+                       data->get_access(h));
+            })
+            .wait();
+    }
+}
+
+template<typename T>
+Array<T>::Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset_,
+                const dim4 &stride)
+    : info(parent.getDevId(), dims, offset_, stride,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(parent.getData())
+    , data_dims(parent.getDataDims())
+    , node()
+    , owner(false) {}
+
+template<typename T>
+Array<T>::Array(Param<T> &tmp, bool owner_)
+    : info(getActiveDeviceId(),
+           dim4(tmp.info.dims[0], tmp.info.dims[1], tmp.info.dims[2],
+                tmp.info.dims[3]),
+           0,
+           dim4(tmp.info.strides[0], tmp.info.strides[1], tmp.info.strides[2],
+                tmp.info.strides[3]),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(
+          tmp.data, owner_ ? memFree<T> : [](sycl::buffer<T> * /*unused*/) {})
+    , data_dims(dim4(tmp.info.dims[0], tmp.info.dims[1], tmp.info.dims[2],
+                     tmp.info.dims[3]))
+    , node()
+    , owner(owner_) {}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, const dim4 &strides, dim_t offset_,
+                const T *const in_data, bool is_device)
+    : info(getActiveDeviceId(), dims, offset_, strides,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data()
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    if (is_device) {
+        buffer<T> *ptr;
+        std::memcpy(&ptr, in_data, sizeof(buffer<T> *));
+        data = make_shared<buffer<T>>(*ptr);
+    } else {
+        data = memAlloc<T>(info.elements());
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                h.copy(in_data, data->get_access(h, sycl::range(info.total())));
+            })
+            .wait();
+    }
+}
+
+template<typename T>
+void Array<T>::eval() {
+    if (isReady()) { return; }
+
+    this->setId(getActiveDeviceId());
+    data = std::shared_ptr<sycl::buffer<T>>(
+        memAlloc<T>(info.elements()).release(), memFree<T>);
+
+    // Do not replace this with cast operator
+    KParam info = {{dims()[0], dims()[1], dims()[2], dims()[3]},
+                   {strides()[0], strides()[1], strides()[2], strides()[3]},
+                   0};
+
+    Param<T> res{data.get(), info};
+
+    evalNodes(res, getNode().get());
+    node.reset();
+}
+
+template<typename T>
+void Array<T>::eval() const {
+    const_cast<Array<T> *>(this)->eval();
+}
+
+template<typename T>
+buffer<T> *Array<T>::device() {
+    if (!isOwner() || getOffset() || data.use_count() > 1) {
+        *this = copyArray<T>(*this);
+    }
+    return this->get();
+}
+
+template<typename T>
+void evalMultiple(vector<Array<T> *> arrays) {
+    vector<Param<T>> outputs;
+    vector<Array<T> *> output_arrays;
+    vector<Node *> nodes;
+
+    // Check if all the arrays have the same dimension
+    auto it = std::adjacent_find(begin(arrays), end(arrays),
+                                 [](const Array<T> *l, const Array<T> *r) {
+                                     return l->dims() != r->dims();
+                                 });
+
+    // If they are not the same. eval individually
+    if (it != end(arrays)) {
+        for (auto ptr : arrays) { ptr->eval(); }
+        return;
+    }
+
+    for (Array<T> *array : arrays) {
+        if (array->isReady()) { continue; }
+
+        const ArrayInfo info = array->info;
+
+        array->setId(getActiveDeviceId());
+        array->data = std::shared_ptr<buffer<T>>(
+            memAlloc<T>(info.elements()).release(), memFree<T>);
+
+        // Do not replace this with cast operator
+        KParam kInfo = {
+            {info.dims()[0], info.dims()[1], info.dims()[2], info.dims()[3]},
+            {info.strides()[0], info.strides()[1], info.strides()[2],
+             info.strides()[3]},
+            0};
+
+        outputs.emplace_back(array->data.get(), kInfo);
+        output_arrays.push_back(array);
+        nodes.push_back(array->getNode().get());
+    }
+
+    evalNodes(outputs, nodes);
+
+    for (Array<T> *array : output_arrays) { array->node.reset(); }
+}
+
+template<typename T>
+Node_ptr Array<T>::getNode() {
+    if (node) { return node; }
+
+    AParam<T, sycl::access_mode::read> info = *this;
+    unsigned bytes = this->dims().elements() * sizeof(T);
+    auto nn        = bufferNodePtr<T>();
+    nn->setData(info, data, bytes, isLinear());
+
+    return nn;
+}
+
+template<typename T>
+Node_ptr Array<T>::getNode() const {
+    return const_cast<Array<T> *>(this)->getNode();
+}
+
+/// This function should be called after a new JIT node is created. It will
+/// return true if the newly created node will generate a valid kernel. If
+/// false the node will fail to compile or the node and its referenced buffers
+/// are consuming too many resources. If false, the node's child nodes should
+/// be evaluated before continuing.
+///
+/// We eval in the following cases:
+///
+/// 1. Too many bytes are locked up by JIT causing memory
+///    pressure. Too many bytes is assumed to be half of all bytes
+///    allocated so far.
+///
+/// 2. The number of parameters we are passing into the kernel exceeds the
+///    limitation on the platform. For NVIDIA this is 4096 bytes. The
+template<typename T>
+kJITHeuristics passesJitHeuristics(span<Node *> root_nodes) {
+    if (!evalFlag()) { return kJITHeuristics::Pass; }
+    static auto getLogger = [&] { return common::loggerFactory("jit"); };
+    for (const Node *n : root_nodes) {
+        if (n->getHeight() > static_cast<int>(getMaxJitSize())) {
+            AF_TRACE(
+                "JIT tree evaluated because of tree height exceeds limit: {} > "
+                "{}",
+                n->getHeight(), getMaxJitSize());
+            return kJITHeuristics::TreeHeight;
+        }
+    }
+
+    // TODO(umar): add memory based checks for JIT kernel generation
+    bool isBufferLimit =
+        false;  // getMemoryPressure() >= getMemoryPressureThreshold();
+    // auto platform      = getActivePlatform();
+
+    // The Apple platform can have the nvidia card or the AMD card
+    // bool isIntel = platform == AFCL_PLATFORM_INTEL;
+
+    /// Intels param_size limit is much smaller than the other platforms
+    /// so we need to start checking earlier with smaller trees
+    int heightCheckLimit = 3;
+
+    // A lightweight check based on the height of the node. This is
+    // an inexpensive operation and does not traverse the JIT tree.
+    bool atHeightLimit =
+        std::any_of(std::begin(root_nodes), std::end(root_nodes),
+                    [heightCheckLimit](Node *n) {
+                        return (n->getHeight() + 1 >= heightCheckLimit);
+                    });
+
+    if (atHeightLimit || isBufferLimit) {
+        // This is the base parameter size if the kernel had no
+        // arguments
+        size_t base_param_size =
+            (sizeof(T *) + sizeof(Param<T>)) * root_nodes.size() +
+            (3 * sizeof(uint));
+
+        const sycl::device &device = getDevice();
+        size_t max_param_size =
+            device.get_info<sycl::info::device::max_parameter_size>();
+        // typical values:
+        //   NVIDIA     = 4096
+        //   AMD        = 3520  (AMD A10 iGPU = 1024)
+        //   Intel iGPU = 1024
+        max_param_size -= base_param_size;
+
+        struct tree_info {
+            size_t total_buffer_size;
+            size_t num_buffers;
+            size_t param_scalar_size;
+        };
+
+        tree_info info{0, 0, 0};
+        for (Node *n : root_nodes) {
+            NodeIterator<> it(n);
+            info = accumulate(
+                it, NodeIterator<>(), info, [](tree_info &prev, Node &n) {
+                    if (n.isBuffer()) {
+                        auto &buf_node = static_cast<BufferNode<T> &>(n);
+                        // getBytes returns the size of the data Array.
+                        // Sub arrays will be represented by their parent
+                        // size.
+                        prev.total_buffer_size += buf_node.getBytes();
+                        prev.num_buffers++;
+                    } else {
+                        prev.param_scalar_size += n.getParamBytes();
+                    }
+                    return prev;
+                });
+        }
+        isBufferLimit = jitTreeExceedsMemoryPressure(info.total_buffer_size);
+
+        size_t param_size =
+            (info.num_buffers * (sizeof(Param<T>) + sizeof(T *)) +
+             info.param_scalar_size);
+
+        bool isParamLimit = param_size >= max_param_size;
+
+        if (isParamLimit) {
+            AF_TRACE(
+                "JIT tree evaluated because of kernel parameter size: {} >= {}",
+                param_size, max_param_size);
+            return kJITHeuristics::KernelParameterSize;
+        }
+        if (isBufferLimit) {
+            AF_TRACE("JIT tree evaluated because of memory pressure: {}",
+                     info.total_buffer_size);
+            return kJITHeuristics::MemoryPressure;
+        }
+    }
+    return kJITHeuristics::Pass;
+}
+
+template<typename T>
+void *getDevicePtr(const Array<T> &arr) {
+    const buffer<T> *buf = arr.device();
+    return (void *)buf;
+}
+
+template<typename T>
+Array<T> createNodeArray(const dim4 &dims, Node_ptr node) {
+    verifyTypeSupport<T>();
+    Array<T> out = Array<T>(dims, node);
+    return out;
+}
+
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent, const vector<af_seq> &index,
+                        bool copy) {
+    parent.eval();
+
+    dim4 dDims          = parent.getDataDims();
+    dim4 parent_strides = parent.strides();
+
+    if (parent.isLinear() == false) {
+        const Array<T> parentCopy = copyArray(parent);
+        return createSubArray(parentCopy, index, copy);
+    }
+
+    const dim4 &pDims = parent.dims();
+
+    dim4 dims    = toDims(index, pDims);
+    dim4 strides = toStride(index, dDims);
+
+    // Find total offsets after indexing
+    dim4 offsets = toOffset(index, pDims);
+    dim_t offset = parent.getOffset();
+    for (int i = 0; i < 4; i++) { offset += offsets[i] * parent_strides[i]; }
+
+    Array<T> out = Array<T>(parent, dims, offset, strides);
+
+    if (!copy) { return out; }
+
+    if (strides[0] != 1 || strides[1] < 0 || strides[2] < 0 || strides[3] < 0) {
+        out = copyArray(out);
+    }
+
+    return out;
+}
+
+template<typename T>
+Array<T> createHostDataArray(const dim4 &dims, const T *const data) {
+    verifyTypeSupport<T>();
+    return Array<T>(dims, data);
+}
+
+template<typename T>
+Array<T> createDeviceDataArray(const dim4 &dims, void *data, bool copy) {
+    verifyTypeSupport<T>();
+
+    return Array<T>(dims, static_cast<buffer<T> *>(data), 0, copy);
+}
+
+template<typename T>
+Array<T> createValueArray(const dim4 &dims, const T &value) {
+    verifyTypeSupport<T>();
+    return createScalarNode<T>(dims, value);
+}
+
+template<typename T>
+Array<T> createEmptyArray(const dim4 &dims) {
+    verifyTypeSupport<T>();
+    return Array<T>(dims);
+}
+
+template<typename T>
+Array<T> createParamArray(Param<T> &tmp, bool owner) {
+    verifyTypeSupport<T>();
+    return Array<T>(tmp, owner);
+}
+
+template<typename T>
+void destroyArray(Array<T> *A) {
+    delete A;
+}
+
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data,
+                        const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
+    auto arr_get = arr.get();
+    getQueue()
+        .submit([&](sycl::handler &h) {
+            auto host_acc =
+                arr_get->template get_access<sycl::access_mode::write>(
+                    h, sycl::range(bytes / sizeof(T)), arr.getOffset());
+            h.copy(data, host_acc);
+        })
+        .wait();
+}
+
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
+
+    sycl::buffer<T> *dataptr =
+        static_cast<sycl::buffer<T> *>(const_cast<void *>(data));
+    auto arr_get = arr.get();
+    getQueue().submit([&](sycl::handler &h) {
+        auto src_acc = dataptr->template get_access<sycl::access_mode::read>(
+            h, sycl::range(bytes / sizeof(T)));
+        auto dst_acc = arr_get->template get_access<sycl::access_mode::write>(
+            h, sycl::range(bytes / sizeof(T)), arr.getOffset());
+        h.copy(src_acc, dst_acc);
+    });
+}
+
+template<typename T>
+void Array<T>::setDataDims(const dim4 &new_dims) {
+    data_dims = new_dims;
+    modDims(new_dims);
+}
+
+template<typename T>
+size_t Array<T>::getAllocatedBytes() const {
+    if (!isReady()) { return 0; }
+    size_t bytes = memoryManager().allocated(data.get());
+    // External device pointer
+    if (bytes == 0 && data.get()) { return data_dims.elements() * sizeof(T); }
+    return bytes;
+}
+
+#define INSTANTIATE(T)                                                       \
+    template Array<T> createHostDataArray<T>(const dim4 &dims,               \
+                                             const T *const data);           \
+    template Array<T> createDeviceDataArray<T>(const dim4 &dims, void *data, \
+                                               bool copy);                   \
+    template Array<T> createValueArray<T>(const dim4 &dims, const T &value); \
+    template Array<T> createEmptyArray<T>(const dim4 &dims);                 \
+    template Array<T> createParamArray<T>(Param<T> & tmp, bool owner);       \
+    template Array<T> createSubArray<T>(                                     \
+        const Array<T> &parent, const vector<af_seq> &index, bool copy);     \
+    template void destroyArray<T>(Array<T> * A);                             \
+    template Array<T> createNodeArray<T>(const dim4 &dims, Node_ptr node);   \
+    template Array<T>::Array(const dim4 &dims, const dim4 &strides,          \
+                             dim_t offset, const T *const in_data,           \
+                             bool is_device);                                \
+    template Array<T>::Array(const dim4 &dims, buffer<T> *mem,               \
+                             size_t src_offset, bool copy);                  \
+    template Node_ptr Array<T>::getNode();                                   \
+    template Node_ptr Array<T>::getNode() const;                             \
+    template void Array<T>::eval();                                          \
+    template void Array<T>::eval() const;                                    \
+    template buffer<T> *Array<T>::device();                                  \
+    template void writeHostDataArray<T>(Array<T> & arr, const T *const data, \
+                                        const size_t bytes);                 \
+    template void writeDeviceDataArray<T>(                                   \
+        Array<T> & arr, const void *const data, const size_t bytes);         \
+    template void evalMultiple<T>(vector<Array<T> *> arrays);                \
+    template kJITHeuristics passesJitHeuristics<T>(span<Node *> node);       \
+    template void *getDevicePtr<T>(const Array<T> &arr);                     \
+    template void Array<T>::setDataDims(const dim4 &new_dims);               \
+    template size_t Array<T>::getAllocatedBytes() const;                     \
+    template void checkAndMigrate<T>(const Array<T> &arr);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/Array.hpp b/src/backend/oneapi/Array.hpp
new file mode 100644
index 0000000000..5e7ec490f1
--- /dev/null
+++ b/src/backend/oneapi/Array.hpp
@@ -0,0 +1,375 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <kernel/KParam.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <nonstd/span.hpp>
+#include <algorithm>
+#include <cstdlib>
+#include <memory>
+#include <type_traits>
+#include <vector>
+
+enum class kJITHeuristics;
+
+namespace arrayfire {
+namespace common {
+template<typename T>
+class SparseArray;
+
+class Node;
+
+using Node_ptr = std::shared_ptr<Node>;
+
+}  // namespace common
+namespace oneapi {
+
+template<typename T>
+struct Param;
+template<typename T, sycl::access_mode AM>
+struct AParam;
+
+template<typename T>
+using Buffer_ptr = std::shared_ptr<sycl::buffer<T>>;
+using af::dim4;
+template<typename T>
+class Array;
+
+/// Checks if the Array object can be migrated to the current device and if not,
+/// an error is thrown
+///
+/// \param[in] arr The Array that will be checked.
+template<typename T>
+void checkAndMigrate(const Array<T> &arr);
+
+template<typename T>
+void evalMultiple(std::vector<Array<T> *> arrays);
+
+template<typename T>
+void evalNodes(Param<T> &out, common::Node *node);
+
+template<typename T>
+void evalNodes(std::vector<Param<T>> &outputs,
+               const std::vector<common::Node *> &nodes);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createNodeArray(const af::dim4 &dims, common::Node_ptr node);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createValueArray(const af::dim4 &dims, const T &value);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createHostDataArray(const af::dim4 &dims, const T *const data);
+
+/// Creates an Array<T> object from a device pointer.
+///
+/// \param[in] dims The shape of the resulting Array.
+/// \param[in] data The device pointer to the data
+/// \param[in] copy If true, memory will be allocated and the data will be
+///                 copied to the device. If false the data will be used
+///                 directly
+/// \returns The new Array<T> object based on the device pointer.
+template<typename T>
+Array<T> createDeviceDataArray(const af::dim4 &dims, void *data,
+                               bool copy = false);
+
+template<typename T>
+Array<T> createStridedArray(const af::dim4 &dims, const af::dim4 &strides,
+                            dim_t offset, const T *const in_data,
+                            bool is_device) {
+    return Array<T>(dims, strides, offset, in_data, is_device);
+}
+
+/// Copies data to an existing Array object from a host pointer
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data, const size_t bytes);
+
+/// Copies data to an existing Array object from a device pointer
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes);
+
+/// Creates an empty array of a given size. No data is initialized
+///
+/// \param[in] size The dimension of the output array
+template<typename T>
+Array<T> createEmptyArray(const af::dim4 &dims);
+
+/// Create an Array object from Param object.
+///
+/// \param[in] in    The Param array that is created.
+/// \param[in] owner If true, the new Array<T> object is the owner of the data.
+/// If false
+///                  the Array<T> will not delete the object on destruction
+template<typename T>
+Array<T> createParamArray(Param<T> &tmp, bool owner);
+
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent,
+                        const std::vector<af_seq> &index, bool copy = true);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+void destroyArray(Array<T> *A);
+
+/// \brief Checks if the Node can be compiled successfully and the buffers
+///        references are not consuming most of the allocated memory
+///
+/// \param [in] node The root node which needs to be checked
+///
+/// \returns false if the kernel generated by this node will fail to compile
+///          or its nodes are consuming too much memory.
+template<typename T>
+kJITHeuristics passesJitHeuristics(nonstd::span<common::Node *> node);
+
+template<typename T>
+void *getDevicePtr(const Array<T> &arr);
+
+template<typename T>
+void *getRawPtr(const Array<T> &arr) {
+    // const sycl::buffer<T> *buf = arr.get();
+    // if (!buf) return NULL;
+    // cl_mem mem = (*buf)();
+    // return (void *)mem;
+
+    // TODO:
+    return nullptr;
+}
+
+template<typename T>
+using mapped_ptr = std::unique_ptr<T, std::function<void(void *)>>;
+
+template<typename T>
+class Array {
+    ArrayInfo info;  // This must be the first element of Array<T>
+
+    /// Pointer to the data
+    std::shared_ptr<sycl::buffer<T>> data;
+
+    /// The shape of the underlying parent data.
+    af::dim4 data_dims;
+
+    /// Null if this a buffer node. Otherwise this points to a JIT node
+    common::Node_ptr node;
+
+    /// If true, the Array object is the parent. If false the data object points
+    /// to another array's data
+    bool owner;
+
+    Array(const af::dim4 &dims);
+
+    Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset,
+          const dim4 &stride);
+    Array(Param<T> &tmp, bool owner);
+    explicit Array(const af::dim4 &dims, common::Node_ptr n);
+    explicit Array(const af::dim4 &dims, const T *const in_data);
+
+    explicit Array(const af::dim4 &dims, sycl::buffer<T> *const mem,
+                   size_t offset, bool copy);
+
+    std::shared_ptr<sycl::buffer<T>> getData() const { return data; }
+
+   public:
+    Array(const Array<T> &other) = default;
+
+    Array(Array<T> &&other) noexcept = default;
+
+    Array<T> &operator=(Array<T> other) noexcept {
+        swap(other);
+        return *this;
+    }
+
+    void swap(Array<T> &other) noexcept {
+        using std::swap;
+        swap(info, other.info);
+        swap(data, other.data);
+        swap(data_dims, other.data_dims);
+        swap(node, other.node);
+        swap(owner, other.owner);
+    }
+
+    Array(const af::dim4 &dims, const af::dim4 &strides, dim_t offset,
+          const T *const in_data, bool is_device = false);
+    void resetInfo(const af::dim4 &dims) { info.resetInfo(dims); }
+    void resetDims(const af::dim4 &dims) { info.resetDims(dims); }
+    void modDims(const af::dim4 &newDims) { info.modDims(newDims); }
+    void modStrides(const af::dim4 &newStrides) { info.modStrides(newStrides); }
+    void setId(int id) { info.setId(id); }
+
+#define INFO_FUNC(RET_TYPE, NAME) \
+    RET_TYPE NAME() const { return info.NAME(); }
+
+    INFO_FUNC(const af_dtype &, getType)
+    INFO_FUNC(const af::dim4 &, strides)
+    INFO_FUNC(dim_t, elements)
+    INFO_FUNC(dim_t, ndims)
+    INFO_FUNC(const af::dim4 &, dims)
+    INFO_FUNC(int, getDevId)
+
+#undef INFO_FUNC
+
+#define INFO_IS_FUNC(NAME) \
+    bool NAME() const { return info.NAME(); }
+
+    INFO_IS_FUNC(isEmpty);
+    INFO_IS_FUNC(isScalar);
+    INFO_IS_FUNC(isRow);
+    INFO_IS_FUNC(isColumn);
+    INFO_IS_FUNC(isVector);
+    INFO_IS_FUNC(isComplex);
+    INFO_IS_FUNC(isReal);
+    INFO_IS_FUNC(isDouble);
+    INFO_IS_FUNC(isSingle);
+    INFO_IS_FUNC(isHalf);
+    INFO_IS_FUNC(isRealFloating);
+    INFO_IS_FUNC(isFloating);
+    INFO_IS_FUNC(isInteger);
+    INFO_IS_FUNC(isBool);
+    INFO_IS_FUNC(isLinear);
+    INFO_IS_FUNC(isSparse);
+
+#undef INFO_IS_FUNC
+    ~Array() = default;
+
+    bool isReady() const { return static_cast<bool>(node) == false; }
+    bool isOwner() const { return owner; }
+
+    void eval();
+    void eval() const;
+
+    sycl::buffer<T> *device();
+    sycl::buffer<T> *device() const {
+        return const_cast<Array<T> *>(this)->device();
+    }
+
+    sycl::buffer<T> *get() const {
+        if (!isReady()) { eval(); }
+        return data.get();
+    }
+
+    template<typename outT>
+    sycl::buffer<outT> getBufferWithOffset(dim_t offset = -1) const {
+        offset             = (offset == -1) ? getOffset() : offset;
+        dim_t sz_remaining = data_dims.elements() - offset;
+        if constexpr (std::is_same_v<outT, T>) {
+            if (offset == 0) { return *get(); }
+            return sycl::buffer<outT, 1>(*get(), sycl::id<1>(offset),
+                                         sycl::range<1>(sz_remaining));
+        } else {
+            if (offset == 0) { return get()->template reinterpret<outT, 1>(); }
+            return sycl::buffer<T, 1>(*get(), sycl::id<1>(offset),
+                                      sycl::range<1>(sz_remaining))
+                .template reinterpret<outT, 1>();
+        }
+    }
+
+    int useCount() const { return data.use_count(); }
+
+    dim_t getOffset() const { return info.getOffset(); }
+
+    dim4 getDataDims() const { return data_dims; }
+
+    void setDataDims(const dim4 &new_dims);
+
+    size_t getAllocatedBytes() const;
+
+    operator Param<T>() const {
+        KParam info = {{dims()[0], dims()[1], dims()[2], dims()[3]},
+                       {strides()[0], strides()[1], strides()[2], strides()[3]},
+                       getOffset()};
+
+        Param<T> out{(sycl::buffer<T> *)this->get(), info};
+        return out;
+    }
+
+    operator AParam<T, sycl::access_mode::write>() {
+        AParam<T, sycl::access_mode::write> out(*getData(), dims().get(),
+                                                strides().get(), getOffset());
+        return out;
+    }
+
+    operator AParam<T, sycl::access_mode::read>() const {
+        AParam<T, sycl::access_mode::read> out(*getData(), dims().get(),
+                                               strides().get(), getOffset());
+        return out;
+    }
+
+    operator KParam() const {
+        KParam kinfo = {
+            {dims()[0], dims()[1], dims()[2], dims()[3]},
+            {strides()[0], strides()[1], strides()[2], strides()[3]},
+            getOffset()};
+
+        return kinfo;
+    }
+
+    common::Node_ptr getNode() const;
+    common::Node_ptr getNode();
+
+   public:
+    mapped_ptr<T> getMappedPtr(cl_map_flags map_flags = CL_MAP_READ |
+                                                        CL_MAP_WRITE) const {
+        if (!isReady()) eval();
+        auto func = [data = data](void *ptr) {
+            if (ptr != nullptr) {
+                // cl_int err = getQueue().enqueueUnmapMemObject(*data, ptr);
+                // UNUSED(err);
+                ptr = nullptr;
+            }
+        };
+
+        // T *ptr = (T *)getQueue().enqueueMapBuffer(
+        //*static_cast<const sycl::buffer<T> *>(get()), CL_TRUE, map_flags,
+        // getOffset() * sizeof(T), elements() * sizeof(T), nullptr, nullptr,
+        // nullptr);
+
+        return mapped_ptr<T>(nullptr, func);
+    }
+
+    friend void evalMultiple<T>(std::vector<Array<T> *> arrays);
+
+    friend Array<T> createValueArray<T>(const af::dim4 &dims, const T &value);
+    friend Array<T> createHostDataArray<T>(const af::dim4 &dims,
+                                           const T *const data);
+    friend Array<T> createDeviceDataArray<T>(const af::dim4 &dims, void *data,
+                                             bool copy);
+    friend Array<T> createStridedArray<T>(const af::dim4 &dims,
+                                          const af::dim4 &strides, dim_t offset,
+                                          const T *const in_data,
+                                          bool is_device);
+
+    friend Array<T> createEmptyArray<T>(const af::dim4 &dims);
+    friend Array<T> createParamArray<T>(Param<T> &tmp, bool owner);
+    friend Array<T> createNodeArray<T>(const af::dim4 &dims,
+                                       common::Node_ptr node);
+
+    friend Array<T> createSubArray<T>(const Array<T> &parent,
+                                      const std::vector<af_seq> &index,
+                                      bool copy);
+
+    friend void destroyArray<T>(Array<T> *arr);
+    friend void *getDevicePtr<T>(const Array<T> &arr);
+    friend void *getRawPtr<T>(const Array<T> &arr);
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/CMakeLists.txt b/src/backend/oneapi/CMakeLists.txt
new file mode 100644
index 0000000000..a41d3fa3b7
--- /dev/null
+++ b/src/backend/oneapi/CMakeLists.txt
@@ -0,0 +1,400 @@
+#Copyright(c) 2022, ArrayFire
+#All rights reserved.
+#
+#This file is distributed under 3 - clause BSD license.
+#The complete license agreement can be obtained at:
+#http:  // arrayfire.com/licenses/BSD-3-Clause
+
+if(AF_BUILD_ONEAPI)
+    enable_language(SYCL)
+endif()
+
+include(InternalUtils)
+include(build_cl2hpp)
+include(FileToString)
+
+add_library(afoneapi
+  Array.cpp
+  Array.hpp
+  Event.cpp
+  Event.hpp
+  GraphicsResourceManager.cpp
+  GraphicsResourceManager.hpp
+  Module.hpp
+  Param.cpp
+  Param.hpp
+  all.cpp
+  anisotropic_diffusion.cpp
+  anisotropic_diffusion.hpp
+  any.cpp
+  approx1.cpp
+  approx2.cpp
+  approx.hpp
+  arith.hpp
+  assign.cpp
+  assign.hpp
+  backend.hpp
+  bilateral.cpp
+  bilateral.hpp
+  binary.hpp
+  blas.cpp
+  blas.hpp
+  canny.cpp
+  canny.hpp
+  cast.hpp
+  cholesky.cpp
+  cholesky.hpp
+  compile_module.cpp
+  complex.hpp
+  convolve.cpp
+  convolve.hpp
+  convolve_separable.cpp
+  copy.cpp
+  copy.hpp
+  count.cpp
+  device_manager.cpp
+  device_manager.hpp
+  diagonal.cpp
+  diagonal.hpp
+  diff.cpp
+  diff.hpp
+  err_oneapi.hpp
+  errorcodes.cpp
+  errorcodes.hpp
+  exampleFunction.cpp
+  exampleFunction.hpp
+  fast.cpp
+  fast.hpp
+  fft.cpp
+  fft.hpp
+  fftconvolve.cpp
+  fftconvolve.hpp
+  flood_fill.cpp
+  flood_fill.hpp
+  gradient.cpp
+  gradient.hpp
+  harris.cpp
+  harris.hpp
+  hist_graphics.cpp
+  hist_graphics.hpp
+  histogram.cpp
+  histogram.hpp
+  homography.cpp
+  homography.hpp
+  hsv_rgb.cpp
+  hsv_rgb.hpp
+  identity.cpp
+  identity.hpp
+  iir.cpp
+  iir.hpp
+  image.cpp
+  image.hpp
+  index.cpp
+  index.hpp
+  inverse.cpp
+  inverse.hpp
+  iota.cpp
+  iota.hpp
+  ireduce.cpp
+  ireduce.hpp
+  jit.cpp
+  jit/BufferNode.hpp
+  jit/ShiftNode.hpp
+  jit/kernel_generators.hpp
+  join.cpp
+  join.hpp
+  logic.hpp
+  lookup.cpp
+  lookup.hpp
+  lu.cpp
+  lu.hpp
+  match_template.cpp
+  match_template.hpp
+  math.cpp
+  math.hpp
+  max.cpp
+  mean.cpp
+  mean.hpp
+  meanshift.cpp
+  meanshift.hpp
+  medfilt.cpp
+  medfilt.hpp
+  memory.cpp
+  memory.hpp
+  min.cpp
+  minmax_op.hpp
+  moments.cpp
+  moments.hpp
+  morph.cpp
+  morph.hpp
+  nearest_neighbour.cpp
+  nearest_neighbour.hpp
+  orb.cpp
+  orb.hpp
+  platform.cpp
+  platform.hpp
+  plot.cpp
+  plot.hpp
+  print.hpp
+  product.cpp
+  qr.cpp
+  qr.hpp
+  random_engine.cpp
+  random_engine.hpp
+  range.cpp
+  range.hpp
+  reduce.hpp
+  reduce_impl.hpp
+  regions.cpp
+  regions.hpp
+  reorder.cpp
+  reorder.hpp
+  reshape.cpp
+  resize.cpp
+  resize.hpp
+  rotate.cpp
+  rotate.hpp
+  scalar.hpp
+  scan.cpp
+  scan.hpp
+  scan_by_key.cpp
+  scan_by_key.hpp
+  select.cpp
+  select.hpp
+  set.cpp
+  set.hpp
+  shift.cpp
+  shift.hpp
+  sift.cpp
+  sift.hpp
+  sobel.cpp
+  sobel.hpp
+  solve.cpp
+  solve.hpp
+  sort.cpp
+  sort.hpp
+  sort_by_key.cpp
+  sort_by_key.hpp
+  sort_index.cpp
+  sort_index.hpp
+  sparse.cpp
+  sparse.hpp
+  sparse_arith.cpp
+  sparse_arith.hpp
+  sparse_blas.cpp
+  sparse_blas.hpp
+  sum.cpp
+  surface.cpp
+  surface.hpp
+  susan.cpp
+  susan.hpp
+  svd.cpp
+  svd.hpp
+  tile.cpp
+  tile.hpp
+  topk.cpp
+  topk.hpp
+  transform.cpp
+  transform.hpp
+  transpose.cpp
+  transpose_inplace.cpp
+  transpose.hpp
+  triangle.cpp
+  triangle.hpp
+  types.hpp
+  unwrap.cpp
+  unwrap.hpp
+  vector_field.cpp
+  vector_field.hpp
+  where.cpp
+  where.hpp
+  wrap.cpp
+  wrap.hpp
+  )
+
+target_sources(afoneapi
+  PRIVATE
+    kernel/KParam.hpp
+    kernel/accessors.hpp
+    kernel/approx1.hpp
+    kernel/approx2.hpp
+    kernel/assign.hpp
+    kernel/bilateral.hpp
+    kernel/convolve_separable.cpp
+    kernel/diagonal.hpp
+    kernel/diff.hpp
+    kernel/fftconvolve_common.hpp
+    kernel/fftconvolve_multiply.hpp
+    kernel/fftconvolve_pack.hpp
+    kernel/fftconvolve_pad.hpp
+    kernel/fftconvolve_reorder.hpp
+    kernel/histogram.hpp
+    kernel/iir.hpp
+    kernel/identity.hpp
+    kernel/interp.hpp
+    kernel/iota.hpp
+    kernel/ireduce.hpp
+    kernel/lu_split.hpp
+    kernel/memcopy.hpp
+    kernel/mean.hpp
+    kernel/pad_array_borders.hpp
+    kernel/random_engine.hpp
+    kernel/random_engine_write.hpp
+    kernel/random_engine_mersenne.hpp
+    kernel/random_engine_philox.hpp
+    kernel/random_engine_threefry.hpp
+    kernel/range.hpp
+    kernel/reduce.hpp
+    kernel/reduce_all.hpp
+    kernel/reduce_by_key.hpp
+    kernel/reduce_first.hpp
+    kernel/reduce_dim.hpp
+    kernel/reorder.hpp
+    kernel/scan_first.hpp
+    kernel/scan_dim.hpp
+    kernel/sort.hpp
+    kernel/sort_by_key.hpp
+    kernel/sparse.hpp
+    kernel/sparse_arith.hpp
+    kernel/transpose.hpp
+    kernel/transpose_inplace.hpp
+    kernel/triangle.hpp
+    kernel/unwrap.hpp
+    kernel/where.hpp
+    kernel/wrap.hpp
+    kernel/wrap_dilated.hpp
+)
+
+function(set_sycl_language)
+  foreach(target ${ARGV})
+    set_target_properties(${target}
+      PROPERTIES
+        LINKER_LANGUAGE SYCL)
+
+    get_target_property(target_type ${target} TYPE)
+    if(NOT (${target_type} STREQUAL "INTERFACE_LIBRARY"))
+      target_compile_options(${target} PRIVATE ${MSVC_RUNTIME})
+    endif()
+
+    get_target_property(TGT_SOURCES ${target} SOURCES)
+    if(NOT TGT_SOURCES)
+      get_target_property(TGT_SOURCES ${target} INTERFACE_SOURCES)
+    endif()
+
+    foreach(FILE ${TGT_SOURCES})
+      get_filename_component(FILE_EXTENSION ${FILE} EXT)
+      if(FILE_EXTENSION STREQUAL ".cpp")
+        set_source_files_properties(${FILE} PROPERTIES LANGUAGE SYCL)
+      endif()
+    endforeach()
+  endforeach()
+endfunction()
+
+set(kernel_src
+  ${CMAKE_CURRENT_SOURCE_DIR}/../opencl/kernel/KParam.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/../opencl/kernel/jit.cl
+)
+
+set( kernel_headers_dir "kernel_headers")
+
+file_to_string(
+  SOURCES ${kernel_src}
+  VARNAME kernel_files
+  EXTENSION "hpp"
+  OUTPUT_DIR ${kernel_headers_dir}
+  TARGETS cl_kernel_targets
+  NAMESPACE "arrayfire oneapi opencl"
+)
+
+add_dependencies(afoneapi ${cl_kernel_targets})
+
+add_library(ArrayFire::afoneapi ALIAS afoneapi)
+
+arrayfire_set_default_cxx_flags(afoneapi)
+
+include("${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/CMakeLists.txt")
+
+target_include_directories(afoneapi
+  SYSTEM PRIVATE
+    ${SYCL_INCLUDE_DIR}
+)
+
+target_include_directories(afoneapi
+  PUBLIC
+    $<BUILD_INTERFACE:${ArrayFire_SOURCE_DIR}/include>
+    $<BUILD_INTERFACE:${ArrayFire_BINARY_DIR}/include>
+    $<INSTALL_INTERFACE:${AF_INSTALL_INC_DIR}>
+  PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}
+    ${CMAKE_CURRENT_BINARY_DIR}
+  )
+
+target_compile_options(afoneapi
+  PRIVATE
+  $<$<COMPILE_LANGUAGE:SYCL>:
+    -fno-sycl-id-queries-fit-in-int
+    -sycl-std=2020
+    $<$<PLATFORM_ID:Linux>: -fno-sycl-rdc>
+    >
+)
+
+target_compile_definitions(afoneapi
+  PRIVATE
+    AF_ONEAPI
+    WITH_LINEAR_ALGEBRA
+    CL_TARGET_OPENCL_VERSION=300
+    CL_HPP_TARGET_OPENCL_VERSION=300
+    CL_HPP_MINIMUM_OPENCL_VERSION=110
+    CL_HPP_ENABLE_EXCEPTIONS
+    AF_MKL_INTERFACE_SIZE=${MKL_INTERFACE_INTEGER_SIZE}
+  )
+if(MKL_INTERFACE_INTEGER_SIZE EQUAL 8)
+  target_compile_definitions(afoneapi PRIVATE MKL_ILP64)
+endif()
+
+cmake_host_system_information(RESULT NumberOfThreads
+  QUERY NUMBER_OF_LOGICAL_CORES)
+
+target_link_libraries(afoneapi
+  PRIVATE
+    c_api_interface
+    cpp_api_interface
+    oneapi_sort_by_key
+    afcommon_interface
+    OpenCL::OpenCL
+    OpenCL::cl2hpp
+    -fno-sycl-id-queries-fit-in-int
+    $<$<PLATFORM_ID:Linux>:-flink-huge-device-code>
+    $<$<PLATFORM_ID:Linux>:-fvisibility-inlines-hidden>
+    $<$<PLATFORM_ID:Linux>:-fno-sycl-rdc>
+    $<$<PLATFORM_ID:Linux>:-Wl,--build-id>
+    -fsycl-max-parallel-link-jobs=${NumberOfThreads}
+    MKL::MKL_SYCL
+  )
+  set_sycl_language(afcommon_interface
+    oneapi_sort_by_key
+    c_api_interface
+    cpp_api_interface
+    afoneapi)
+
+
+#af_split_debug_info(afoneapi ${AF_INSTALL_LIB_DIR})
+
+install(TARGETS afoneapi
+  EXPORT ArrayFireoneAPITargets
+  COMPONENT oneapi
+  PUBLIC_HEADER DESTINATION af
+  RUNTIME DESTINATION ${AF_INSTALL_BIN_DIR}
+  LIBRARY DESTINATION ${AF_INSTALL_LIB_DIR}
+  ARCHIVE DESTINATION ${AF_INSTALL_LIB_DIR}
+  FRAMEWORK DESTINATION framework
+  INCLUDES DESTINATION ${AF_INSTALL_INC_DIR}
+)
+
+source_group(include REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/include/*)
+source_group(api\\cpp REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/cpp/*)
+source_group(api\\c   REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/c/*)
+source_group(backend  REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/backend/common/*|${CMAKE_CURRENT_SOURCE_DIR}/*)
+source_group(backend\\kernel  REGULAR_EXPRESSION ${CMAKE_CURRENT_SOURCE_DIR}/kernel/*)
+source_group("generated files" FILES ${ArrayFire_BINARY_DIR}/src/backend/build_version.hpp ${ArrayFire_BINARY_DIR}/include/af/version.h)
+source_group("" FILES CMakeLists.txt)
diff --git a/src/backend/oneapi/Event.cpp b/src/backend/oneapi/Event.cpp
new file mode 100644
index 0000000000..60bc8bcb77
--- /dev/null
+++ b/src/backend/oneapi/Event.cpp
@@ -0,0 +1,74 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Event.hpp>
+
+#include <err_oneapi.hpp>
+#include <events.hpp>
+#include <platform.hpp>
+#include <af/event.h>
+#include <memory>
+
+#include <memory>
+
+using std::make_unique;
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace oneapi {
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(sycl::queue& queue) {
+    Event e;
+    if (e.create() == 0) { e.mark(queue); }
+    return e;
+}
+
+af_event createEvent() {
+    auto e = make_unique<Event>();
+    // Ensure the default CL command queue is initialized
+    getQueue();
+    if (e->create() != 0) {
+        AF_ERROR("Could not create event", AF_ERR_RUNTIME);
+    }
+    Event& ref = *e.release();
+    return getHandle(ref);
+}
+
+void markEventOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active stream
+    if (event.mark(getQueue()) != 0) {
+        AF_ERROR("Could not mark event on active queue", AF_ERR_RUNTIME);
+    }
+}
+
+void enqueueWaitOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active stream
+    if (event.enqueueWait(getQueue()) != 0) {
+        AF_ERROR("Could not enqueue wait on active queue for event",
+                 AF_ERR_RUNTIME);
+    }
+}
+
+void block(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    if (event.block() != 0) {
+        AF_ERROR("Could not block on active queue for event", AF_ERR_RUNTIME);
+    }
+}
+
+af_event createAndMarkEvent() {
+    af_event handle = createEvent();
+    markEventOnActiveQueue(handle);
+    return handle;
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/Event.hpp b/src/backend/oneapi/Event.hpp
new file mode 100644
index 0000000000..44af139cda
--- /dev/null
+++ b/src/backend/oneapi/Event.hpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/EventBase.hpp>
+#include <af/event.h>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+class OneAPIEventPolicy {
+   public:
+    using EventType = sycl::event *;
+    using QueueType = sycl::queue;
+    using ErrorType = int;
+
+    static ErrorType createAndMarkEvent(EventType *e) noexcept {
+        *e = new sycl::event;
+        return 0;
+    }
+
+    static ErrorType markEvent(EventType *e, QueueType stream) noexcept {
+        **e = stream.ext_oneapi_submit_barrier();
+        return 0;
+    }
+
+    static ErrorType waitForEvent(EventType *e, QueueType stream) noexcept {
+        stream.ext_oneapi_submit_barrier({**e});
+        return 0;
+    }
+
+    static ErrorType syncForEvent(EventType *e) noexcept {
+        (*e)->wait();
+        return 0;
+    }
+
+    static ErrorType destroyEvent(EventType *e) noexcept {
+        delete *e;
+        return 0;
+    }
+};
+
+using Event = common::EventBase<OneAPIEventPolicy>;
+
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(sycl::queue &queue);
+
+af_event createEvent();
+
+void markEventOnActiveQueue(af_event eventHandle);
+
+void enqueueWaitOnActiveQueue(af_event eventHandle);
+
+void block(af_event eventHandle);
+
+af_event createAndMarkEvent();
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/GraphicsResourceManager.cpp b/src/backend/oneapi/GraphicsResourceManager.cpp
new file mode 100644
index 0000000000..cb03ce0a4f
--- /dev/null
+++ b/src/backend/oneapi/GraphicsResourceManager.cpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <GraphicsResourceManager.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+GraphicsResourceManager::ShrdResVector
+GraphicsResourceManager::registerResources(
+    const std::vector<uint32_t>& resources) {
+    ShrdResVector output;
+    return output;
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/GraphicsResourceManager.hpp b/src/backend/oneapi/GraphicsResourceManager.hpp
new file mode 100644
index 0000000000..1f19c6f8c0
--- /dev/null
+++ b/src/backend/oneapi/GraphicsResourceManager.hpp
@@ -0,0 +1,34 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/InteropManager.hpp>
+
+#include <map>
+#include <memory>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+class GraphicsResourceManager
+    : public common::InteropManager<GraphicsResourceManager, std::byte> {
+   public:
+    using ShrdResVector = std::vector<std::shared_ptr<std::byte>>;
+
+    GraphicsResourceManager() {}
+    static ShrdResVector registerResources(
+        const std::vector<uint32_t>& resources);
+
+   protected:
+    GraphicsResourceManager(GraphicsResourceManager const&);
+    void operator=(GraphicsResourceManager const&);
+};
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/Kernel.hpp b/src/backend/oneapi/Kernel.hpp
new file mode 100644
index 0000000000..c0f15356f8
--- /dev/null
+++ b/src/backend/oneapi/Kernel.hpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/KernelInterface.hpp>
+#include <common/Logger.hpp>
+
+#include <backend.hpp>
+#include <sycl/sycl.hpp>
+#include <string>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel_logger {
+inline auto getLogger() -> spdlog::logger* {
+    static auto logger = common::loggerFactory("kernel");
+    return logger.get();
+}
+}  // namespace kernel_logger
+
+/*
+ */
+struct Enqueuer {
+    template<typename... Args>
+    void operator()(std::string name, sycl::kernel ker, const Enqueuer& qArgs,
+                    Args&&... args) {
+        // auto launchOp = cl::KernelFunctor<Args...>(ker);
+        using namespace kernel_logger;
+        AF_TRACE("Launching {}", name);
+        // launchOp(qArgs, std::forward<Args>(args)...);
+    }
+};
+
+class Kernel {
+    //   public:
+    //    using BaseClass =
+    //      common::KernelInterface<ModuleType, KernelType, Enqueuer,
+    //      sycl::buffer<float>*>;
+    //
+    //  Kernel() : {}
+    //    Kernel(std::string name, ModuleType mod, KernelType ker)
+    //        : BaseClass(name, mod, ker) {}
+    //
+    //    // clang-format off
+    //    [[deprecated("OpenCL backend doesn't need Kernel::getDevPtr method")]]
+    //    DevPtrType getDevPtr(const char* name) final;
+    //    // clang-format on
+    //
+    //    void copyToReadOnly(DevPtrType dst, DevPtrType src, size_t bytes)
+    // final;
+    //
+    //    void setFlag(DevPtrType dst, int* scalarValPtr,
+    //                 const bool syncCopy = false) final;
+    //
+    //    int getFlag(DevPtrType src) final;
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/Module.hpp b/src/backend/oneapi/Module.hpp
new file mode 100644
index 0000000000..dc2afe676d
--- /dev/null
+++ b/src/backend/oneapi/Module.hpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/ModuleInterface.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+/// oneapi backend wrapper for cl::Program object
+class Module
+    : public common::ModuleInterface<
+          sycl::kernel_bundle<sycl::bundle_state::executable> *> {
+   public:
+    using ModuleType = sycl::kernel_bundle<sycl::bundle_state::executable> *;
+    using BaseClass  = common::ModuleInterface<ModuleType>;
+
+    /// \brief Create an uninitialized Module
+    Module() = default;
+
+    /// \brief Create a module given a sycl::program type
+    Module(ModuleType mod) : BaseClass(mod) {}
+
+    /// \brief Unload module
+    operator bool() const final { return get()->empty(); }
+
+    /// Unload the module
+    void unload() final {
+        // TODO(oneapi): Unload kernel/program
+        ;
+    }
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/Param.cpp b/src/backend/oneapi/Param.cpp
new file mode 100644
index 0000000000..6528f707f4
--- /dev/null
+++ b/src/backend/oneapi/Param.cpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <kernel/KParam.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Param<T> makeParam(sycl::buffer<T> &mem, int off, const int dims[4],
+                   const int strides[4]) {
+    Param<T> out;
+    out.data        = &mem;
+    out.info.offset = off;
+    for (int i = 0; i < 4; i++) {
+        out.info.dims[i]    = dims[i];
+        out.info.strides[i] = strides[i];
+    }
+    return out;
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/Param.hpp b/src/backend/oneapi/Param.hpp
new file mode 100644
index 0000000000..4a935c5e2c
--- /dev/null
+++ b/src/backend/oneapi/Param.hpp
@@ -0,0 +1,119 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <sycl/sycl.hpp>
+
+#include <kernel/KParam.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+
+#include <optional>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+struct Param {
+    sycl::buffer<T>* data;
+    KParam info;
+    Param& operator=(const Param& other) = default;
+    Param(const Param& other)            = default;
+    Param(Param&& other)                 = default;
+
+    dim_t* dims_ptr() { return info.dims; }
+    dim_t* strides_ptr() { return info.strides; }
+
+    // AF_DEPRECATED("Use Array<T>")
+    Param() : data(nullptr), info{{0, 0, 0, 0}, {0, 0, 0, 0}, 0} {}
+
+    // AF_DEPRECATED("Use Array<T>")
+    Param(sycl::buffer<T>* data_, KParam info_) : data(data_), info(info_) {}
+
+    template<sycl::access::mode MODE>
+    sycl::accessor<data_t<T>, 1, MODE> get_accessor(sycl::handler& h) const {
+        auto o = data->template reinterpret<data_t<T>>();
+        return sycl::accessor<data_t<T>, 1, MODE>(o, h);
+    }
+
+    ~Param() = default;
+};
+
+template<typename T, sycl::access_mode AM>
+struct AParam {
+    sycl::accessor<T, 1, AM, sycl::target::device,
+                   sycl::access::placeholder::true_t>
+        data;
+    af::dim4 dims;
+    af::dim4 strides;
+    dim_t offset;
+    AParam& operator=(const AParam& other) = default;
+    AParam(const AParam& other)            = default;
+    AParam(AParam&& other)                 = default;
+
+    dim_t* dims_ptr() { return dims.get(); }
+    dim_t* strides_ptr() { return strides.get(); }
+
+    // AF_DEPRECATED("Use Array<T>")
+    AParam() : data(), dims{0, 0, 0, 0}, strides{0, 0, 0, 0}, offset(0) {}
+
+    AParam(sycl::buffer<T, 1>& data_, const dim_t dims_[4],
+           const dim_t strides_[4], dim_t offset_)
+        : data(data_), dims(4, dims_), strides(4, strides_), offset(offset_) {}
+    // AF_DEPRECATED("Use Array<T>")
+    AParam(sycl::handler& h, sycl::buffer<T, 1>& data_, const dim_t dims_[4],
+           const dim_t strides_[4], dim_t offset_)
+        : data(data_), dims(4, dims_), strides(4, strides_), offset(offset_) {
+        require(h);
+    }
+
+    template<sycl::access::mode MODE>
+    sycl::accessor<data_t<T>, 1, MODE> get_accessor(sycl::handler& h) const {
+        return *data;
+    }
+
+    void require(sycl::handler& h) const { h.require(data); }
+
+    operator KParam() const {
+        return KParam{{dims[0], dims[1], dims[2], dims[3]},
+                      {strides[0], strides[1], strides[2], strides[3]},
+                      offset};
+    }
+
+    ~AParam() = default;
+};
+
+// AF_DEPRECATED("Use Array<T>")
+template<typename T>
+Param<T> makeParam(sycl::buffer<T>& mem, int off, const int dims[4],
+                   const int strides[4]);
+
+namespace opencl {
+
+template<typename T>
+struct Param {
+    cl_mem data;
+    KParam info;
+    Param& operator=(const Param& other) = default;
+    Param(const Param& other)            = default;
+    Param(Param&& other)                 = default;
+    Param(cl_mem data_, KParam info_) : data(data_), info(info_) {}
+
+    // AF_DEPRECATED("Use Array<T>")
+    Param() : data(nullptr), info{{0, 0, 0, 0}, {0, 0, 0, 0}, 0} {}
+
+    // AF_DEPRECATED("Use Array<T>")
+    Param(sycl::buffer<T>* data_, KParam info_) : data(data_), info(info_) {}
+
+    ~Param() = default;
+};
+}  // namespace opencl
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/all.cpp b/src/backend/oneapi/all.cpp
new file mode 100644
index 0000000000..e4e86232d2
--- /dev/null
+++ b/src/backend/oneapi/all.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// alltrue
+INSTANTIATE(af_and_t, float, char)
+INSTANTIATE(af_and_t, double, char)
+INSTANTIATE(af_and_t, cfloat, char)
+INSTANTIATE(af_and_t, cdouble, char)
+INSTANTIATE(af_and_t, int, char)
+INSTANTIATE(af_and_t, uint, char)
+INSTANTIATE(af_and_t, intl, char)
+INSTANTIATE(af_and_t, uintl, char)
+INSTANTIATE(af_and_t, char, char)
+INSTANTIATE(af_and_t, schar, char)
+INSTANTIATE(af_and_t, uchar, char)
+INSTANTIATE(af_and_t, short, char)
+INSTANTIATE(af_and_t, ushort, char)
+INSTANTIATE(af_and_t, half, char)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/anisotropic_diffusion.cpp b/src/backend/oneapi/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..912ee6d986
--- /dev/null
+++ b/src/backend/oneapi/anisotropic_diffusion.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <anisotropic_diffusion.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+#define INSTANTIATE(T)                                     \
+    template void anisotropicDiffusion<T>(                 \
+        Array<T> & inout, const float dt, const float mct, \
+        const af::fluxFunction fftype, const af::diffusionEq eq);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/anisotropic_diffusion.hpp b/src/backend/oneapi/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..71ed5a9bc4
--- /dev/null
+++ b/src/backend/oneapi/anisotropic_diffusion.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/any.cpp b/src/backend/oneapi/any.cpp
new file mode 100644
index 0000000000..82e242a989
--- /dev/null
+++ b/src/backend/oneapi/any.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// anytrue
+INSTANTIATE(af_or_t, float, char)
+INSTANTIATE(af_or_t, double, char)
+INSTANTIATE(af_or_t, cfloat, char)
+INSTANTIATE(af_or_t, cdouble, char)
+INSTANTIATE(af_or_t, int, char)
+INSTANTIATE(af_or_t, uint, char)
+INSTANTIATE(af_or_t, intl, char)
+INSTANTIATE(af_or_t, uintl, char)
+INSTANTIATE(af_or_t, char, char)
+INSTANTIATE(af_or_t, schar, char)
+INSTANTIATE(af_or_t, uchar, char)
+INSTANTIATE(af_or_t, short, char)
+INSTANTIATE(af_or_t, ushort, char)
+INSTANTIATE(af_or_t, half, char)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/approx.cpp b/src/backend/oneapi/approx.cpp
new file mode 100644
index 0000000000..825c9072fb
--- /dev/null
+++ b/src/backend/oneapi/approx.cpp
@@ -0,0 +1,88 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <approx.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/approx1.hpp>
+#include <kernel/approx2.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::approx1<Ty, Tp, 1>(yo, yi, xo, xdim, xi_beg, xi_step,
+                                       offGrid, method);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+            kernel::approx1<Ty, Tp, 2>(yo, yi, xo, xdim, xi_beg, xi_step,
+                                       offGrid, method);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+            kernel::approx1<Ty, Tp, 3>(yo, yi, xo, xdim, xi_beg, xi_step,
+                                       offGrid, method);
+            break;
+        default: break;
+    }
+}
+
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, offGrid, method, 1);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, offGrid, method, 2);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, offGrid, method, 3);
+            break;
+        default: break;
+    }
+}
+
+#define INSTANTIATE(Ty, Tp)                                       \
+    template void approx1<Ty, Tp>(                                \
+        Array<Ty> & yo, const Array<Ty> &yi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const af_interp_type method, const float offGrid);        \
+    template void approx2<Ty, Tp>(                                \
+        Array<Ty> & zo, const Array<Ty> &zi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const Array<Tp> &yo, const int ydim, const Tp &yi_beg,    \
+        const Tp &yi_step, const af_interp_type method, const float offGrid);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/approx.hpp b/src/backend/oneapi/approx.hpp
new file mode 100644
index 0000000000..b895dac8aa
--- /dev/null
+++ b/src/backend/oneapi/approx.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid);
+
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/approx1.cpp b/src/backend/oneapi/approx1.cpp
new file mode 100644
index 0000000000..0271d0a4ed
--- /dev/null
+++ b/src/backend/oneapi/approx1.cpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <approx.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/approx1.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::approx1<Ty, Tp, 1>(yo, yi, xo, xdim, xi_beg, xi_step,
+                                       offGrid, method);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+            kernel::approx1<Ty, Tp, 2>(yo, yi, xo, xdim, xi_beg, xi_step,
+                                       offGrid, method);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+            kernel::approx1<Ty, Tp, 3>(yo, yi, xo, xdim, xi_beg, xi_step,
+                                       offGrid, method);
+            break;
+        default: break;
+    }
+}
+
+#define INSTANTIATE(Ty, Tp)                                       \
+    template void approx1<Ty, Tp>(                                \
+        Array<Ty> & yo, const Array<Ty> &yi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const af_interp_type method, const float offGrid);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/approx2.cpp b/src/backend/oneapi/approx2.cpp
new file mode 100644
index 0000000000..e491a5be5e
--- /dev/null
+++ b/src/backend/oneapi/approx2.cpp
@@ -0,0 +1,58 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <approx.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/approx2.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::approx2<Ty, Tp, 1>(zo, zi, xo, xdim, xi_beg, xi_step, yo,
+                                       ydim, yi_beg, yi_step, offGrid, method);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::approx2<Ty, Tp, 2>(zo, zi, xo, xdim, xi_beg, xi_step, yo,
+                                       ydim, yi_beg, yi_step, offGrid, method);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::approx2<Ty, Tp, 3>(zo, zi, xo, xdim, xi_beg, xi_step, yo,
+                                       ydim, yi_beg, yi_step, offGrid, method);
+            break;
+        default: break;
+    }
+}
+
+#define INSTANTIATE(Ty, Tp)                                       \
+    template void approx2<Ty, Tp>(                                \
+        Array<Ty> & zo, const Array<Ty> &zi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const Array<Tp> &yo, const int ydim, const Tp &yi_beg,    \
+        const Tp &yi_step, const af_interp_type method, const float offGrid);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/arith.hpp b/src/backend/oneapi/arith.hpp
new file mode 100644
index 0000000000..815df91b57
--- /dev/null
+++ b/src/backend/oneapi/arith.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <optypes.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &&lhs, const Array<T> &&rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
+}
+
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/assign.cpp b/src/backend/oneapi/assign.cpp
new file mode 100644
index 0000000000..de436495db
--- /dev/null
+++ b/src/backend/oneapi/assign.cpp
@@ -0,0 +1,91 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <assign.hpp>
+#include <kernel/assign.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <handle.hpp>
+#include <memory.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs) {
+    AssignKernelParam p;
+    std::vector<af_seq> seqs(4, af_span);
+    // create seq vector to retrieve output
+    // dimensions, offsets & offsets
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
+    }
+
+    // retrieve dimensions, strides and offsets
+    const dim4& dDims = out.dims();
+    // retrieve dimensions & strides for array
+    // to which rhs is being copied to
+    dim4 dstOffs  = toOffset(seqs, dDims);
+    dim4 dstStrds = toStride(seqs, dDims);
+
+    for (dim_t i = 0; i < 4; ++i) {
+        p.isSeq[i] = idxrs[i].isSeq;
+        p.offs[i]  = dstOffs[i];
+        p.strds[i] = dstStrds[i];
+    }
+
+    sycl::buffer<uint>* bPtrs[4];
+
+    std::vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
+    // look through indexs to read af_array indexs
+    for (dim_t x = 0; x < 4; ++x) {
+        // set index pointers were applicable
+        if (!p.isSeq[x]) {
+            idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
+            bPtrs[x]   = idxArrs[x].get();
+        } else {
+            // alloc an 1-element buffer to avoid OpenCL from failing using
+            // direct buffer allocation as opposed to mem manager to avoid
+            // reference count desprepancies between different backends
+            static auto* empty = new sycl::buffer<uint>(sycl::range{1});
+            bPtrs[x]           = empty;
+        }
+    }
+
+    kernel::assign<T>(out, rhs, p, bPtrs);
+    return;
+}
+
+#define INSTANTIATE(T)                                                \
+    template void assign<T>(Array<T> & out, const af_index_t idxrs[], \
+                            const Array<T>& rhs);
+
+INSTANTIATE(cdouble)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/assign.hpp b/src/backend/oneapi/assign.hpp
new file mode 100644
index 0000000000..cb26fd515b
--- /dev/null
+++ b/src/backend/oneapi/assign.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/index.h>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/backend.hpp b/src/backend/oneapi/backend.hpp
new file mode 100644
index 0000000000..2eb14151d8
--- /dev/null
+++ b/src/backend/oneapi/backend.hpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifdef __DH__
+#undef __DH__
+#endif
+
+#ifdef __CUDACC__
+#define __DH__ __device__ __host__
+#else
+#define __DH__
+#endif
+
+namespace arrayfire {
+namespace oneapi {}
+}  // namespace arrayfire
+
+namespace detail = arrayfire::oneapi;
diff --git a/src/backend/oneapi/bilateral.cpp b/src/backend/oneapi/bilateral.cpp
new file mode 100644
index 0000000000..6520cf9ffa
--- /dev/null
+++ b/src/backend/oneapi/bilateral.cpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <bilateral.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/bilateral.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &sSigma,
+                         const float &cSigma) {
+    Array<outType> out = createEmptyArray<outType>(in.dims());
+    kernel::bilateral<inType, outType>(out, in, sSigma, cSigma);
+    return out;
+}
+
+#define INSTANTIATE(inT, outT)                                    \
+    template Array<outT> bilateral<inT, outT>(const Array<inT> &, \
+                                              const float &, const float &);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/bilateral.hpp b/src/backend/oneapi/bilateral.hpp
new file mode 100644
index 0000000000..f88145cd7b
--- /dev/null
+++ b/src/backend/oneapi/bilateral.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &spatialSigma,
+                         const float &chromaticSigma);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/binary.hpp b/src/backend/oneapi/binary.hpp
new file mode 100644
index 0000000000..8bd36aff7e
--- /dev/null
+++ b/src/backend/oneapi/binary.hpp
@@ -0,0 +1,133 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <optypes.hpp>
+#include <common/half.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename To, typename Ti, af_op_t op>
+struct BinOp;
+
+#define BINARY_TYPE_1(fn)                            \
+    template<typename To, typename Ti>               \
+    struct BinOp<To, Ti, af_##fn##_t> {              \
+        const char *name() { return "__" #fn; }      \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cfloat, af_##fn##_t> {          \
+        const char *name() { return "__c" #fn "f"; } \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cdouble, af_##fn##_t> {         \
+        const char *name() { return "__c" #fn; }     \
+    };
+
+BINARY_TYPE_1(eq)
+BINARY_TYPE_1(neq)
+BINARY_TYPE_1(lt)
+BINARY_TYPE_1(le)
+BINARY_TYPE_1(gt)
+BINARY_TYPE_1(ge)
+BINARY_TYPE_1(add)
+BINARY_TYPE_1(sub)
+BINARY_TYPE_1(mul)
+BINARY_TYPE_1(div)
+BINARY_TYPE_1(and)
+BINARY_TYPE_1(or)
+BINARY_TYPE_1(bitand)
+BINARY_TYPE_1(bitor)
+BINARY_TYPE_1(bitxor)
+BINARY_TYPE_1(bitshiftl)
+BINARY_TYPE_1(bitshiftr)
+
+#undef BINARY_TYPE_1
+
+#define BINARY_TYPE_2(fn)                            \
+    template<typename To, typename Ti>               \
+    struct BinOp<To, Ti, af_##fn##_t> {              \
+        const char *name() { return "__" #fn; }      \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, float, af_##fn##_t> {           \
+        const char *name() { return "f" #fn; }       \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, double, af_##fn##_t> {          \
+        const char *name() { return "f" #fn; }       \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, cfloat, af_##fn##_t> {          \
+        const char *name() { return "__c" #fn "f"; } \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cdouble, af_##fn##_t> {         \
+        const char *name() { return "__c" #fn; }     \
+    };
+
+BINARY_TYPE_2(min)
+BINARY_TYPE_2(max)
+BINARY_TYPE_2(rem)
+BINARY_TYPE_2(mod)
+
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_pow_t> {
+    const char *name() { return "__pow"; }
+};
+
+#define POW_BINARY_OP(INTYPE, OPNAME)         \
+    template<typename To>                     \
+    struct BinOp<To, INTYPE, af_pow_t> {      \
+        const char *name() { return OPNAME; } \
+    };
+
+POW_BINARY_OP(double, "pow")
+POW_BINARY_OP(float, "pow")
+POW_BINARY_OP(half, "pow")
+POW_BINARY_OP(intl, "__powll")
+POW_BINARY_OP(uintl, "__powul")
+POW_BINARY_OP(uint, "__powui")
+POW_BINARY_OP(int, "__powsi")
+
+#undef POW_BINARY_OP
+
+template<typename Ti>
+struct BinOp<cfloat, Ti, af_cplx2_t> {
+    const char *name() { return "__cplx2f"; }
+};
+
+template<typename Ti>
+struct BinOp<cdouble, Ti, af_cplx2_t> {
+    const char *name() { return "__cplx2"; }
+};
+
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_cplx2_t> {
+    const char *name() { return "noop"; }
+};
+
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_atan2_t> {
+    const char *name() { return "atan2"; }
+};
+
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_hypot_t> {
+    const char *name() { return "hypot"; }
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/blas.cpp b/src/backend/oneapi/blas.cpp
new file mode 100644
index 0000000000..93ae6559a4
--- /dev/null
+++ b/src/backend/oneapi/blas.cpp
@@ -0,0 +1,251 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <blas.hpp>
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <common/half.hpp>
+#include <common/kernel_type.hpp>
+#include <common/traits.hpp>
+#include <complex.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <reduce.hpp>
+#include <transpose.hpp>
+#include <types.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <oneapi/mkl/blas.hpp>
+
+#include <complex>
+#include <vector>
+
+using arrayfire::common::half;
+
+// Converts an af_mat_prop options to a transpose type for mkl
+static oneapi::mkl::transpose toBlasTranspose(af_mat_prop opt) {
+    switch (opt) {
+        case AF_MAT_NONE: return oneapi::mkl::transpose::nontrans;
+        case AF_MAT_TRANS: return oneapi::mkl::transpose::trans;
+        case AF_MAT_CTRANS: return oneapi::mkl::transpose::conjtrans;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+}
+
+template<typename T>
+static void gemvDispatch(sycl::queue queue, oneapi::mkl::transpose lOpts,
+                         oneapi::mkl::transpose rOpts, int M, int N,
+                         const T *alpha, const arrayfire::oneapi::Array<T> &lhs,
+                         dim_t lStride, const arrayfire::oneapi::Array<T> &x,
+                         dim_t incx, const T *beta,
+                         arrayfire::oneapi::Array<T> &out, dim_t oInc) {
+    using Dt                   = arrayfire::oneapi::data_t<T>;
+    const af::dim4 lStrides    = lhs.strides();
+    const af::dim4 xStrides    = x.strides();
+    const af::dim4 oStrides    = out.strides();
+    sycl::buffer<Dt, 1> lhsBuf = lhs.template getBufferWithOffset<Dt>();
+    sycl::buffer<Dt, 1> xBuf   = x.template getBufferWithOffset<Dt>();
+    sycl::buffer<Dt, 1> outBuf = out.template getBufferWithOffset<Dt>();
+    if constexpr (!std::is_same_v<T, arrayfire::common::half>) {
+        ::oneapi::mkl::blas::gemv(queue, lOpts, (int64_t)M, (int64_t)N,
+                                  (T)*alpha, lhsBuf, (int64_t)lStride, xBuf,
+                                  (int64_t)incx, (T)*beta, outBuf,
+                                  (int64_t)oInc);
+    }
+}
+
+template<typename T>
+static void gemmDispatch(sycl::queue queue, oneapi::mkl::transpose lOpts,
+                         oneapi::mkl::transpose rOpts, int M, int N, int K,
+                         const T *alpha, const arrayfire::oneapi::Array<T> &lhs,
+                         dim_t lStride, const arrayfire::oneapi::Array<T> &rhs,
+                         dim_t rStride, const T *beta,
+                         arrayfire::oneapi::Array<T> &out, dim_t oleading) {
+    using Dt                = arrayfire::oneapi::data_t<T>;
+    const af::dim4 lStrides = lhs.strides();
+
+    const af::dim4 rStrides    = rhs.strides();
+    const af::dim4 oStrides    = out.strides();
+    sycl::buffer<Dt, 1> lhsBuf = lhs.template getBufferWithOffset<Dt>();
+    sycl::buffer<Dt, 1> rhsBuf = rhs.template getBufferWithOffset<Dt>();
+    sycl::buffer<Dt, 1> outBuf = out.template getBufferWithOffset<Dt>();
+    ::oneapi::mkl::blas::gemm(queue, lOpts, rOpts, M, N, K, *alpha, lhsBuf,
+                              lStride, rhsBuf, rStride, *beta, outBuf,
+                              oleading);
+}
+
+namespace arrayfire {
+namespace oneapi {
+
+void initBlas() { /*gpu_blas_init();*/
+}
+
+void deInitBlas() { /*gpu_blas_deinit();*/
+}
+
+bool isStrideMonotonic(const af::dim4 &dim) {
+    return (dim[0] <= dim[1]) && (dim[1] <= dim[2]) && (dim[2] <= dim[3]);
+}
+
+template<typename Ti, typename To>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta) {
+    const auto lOpts = toBlasTranspose(optLhs);
+    const auto rOpts = toBlasTranspose(optRhs);
+
+    const auto aRowDim = (optLhs == AF_MAT_NONE) ? 0 : 1;
+    const auto aColDim = (optLhs == AF_MAT_NONE) ? 1 : 0;
+    const auto bColDim = (optRhs == AF_MAT_NONE) ? 1 : 0;
+
+    const dim4 &lDims = lhs.dims();
+    const dim4 &rDims = rhs.dims();
+    const int M       = lDims[aRowDim];
+    const int N       = rDims[bColDim];
+    const int K       = lDims[aColDim];
+    const dim4 oDims  = out.dims();
+
+    const dim4 &lStrides = lhs.strides();
+    const dim4 &rStrides = rhs.strides();
+    const dim4 oStrides  = out.strides();
+
+    if (oDims.ndims() <= 2) {  // if non-batched
+        if (rhs.dims()[bColDim] == 1) {
+            if constexpr (std::is_same_v<Ti, arrayfire::common::half>) {
+                // currently no half support for gemv, use gemm instead
+                gemmDispatch<Ti>(getQueue(), lOpts, rOpts, M, N, K, alpha, lhs,
+                                 lStrides[1], rhs, rStrides[1], beta, out,
+                                 oStrides[1]);
+            } else {
+                dim_t incr =
+                    (optRhs == AF_MAT_NONE) ? rStrides[0] : rStrides[1];
+                gemvDispatch<Ti>(getQueue(), lOpts, rOpts, lDims[0], lDims[1],
+                                 alpha, lhs, lStrides[1], rhs, incr, beta, out,
+                                 oStrides[0]);
+            }
+        } else {
+            gemmDispatch<Ti>(getQueue(), lOpts, rOpts, M, N, K, alpha, lhs,
+                             lStrides[1], rhs, rStrides[1], beta, out,
+                             oStrides[1]);
+        }
+    } else {  // if batched
+        using Dt = arrayfire::oneapi::data_t<Ti>;
+
+        int64_t batchSize = static_cast<int64_t>(oDims[2] * oDims[3]);
+
+        bool is_l_d2_batched = (oDims[2] == lDims[2]) && lDims[2] != 1;
+        bool is_l_d3_batched = (oDims[3] == lDims[3]) && lDims[3] != 1;
+        bool is_r_d2_batched = (oDims[2] == rDims[2]) && rDims[2] != 1;
+        bool is_r_d3_batched = (oDims[3] == rDims[3]) && rDims[3] != 1;
+
+        // MKL requires stridec >= ldc * n, which may not be true with reordered
+        // outputs if the stride is monotonic, then MKL requirements for
+        // batching can be met
+        bool canBatchMKL = isStrideMonotonic(oStrides);
+        if (canBatchMKL) {
+            sycl::buffer<Dt, 1> lhsBuf = lhs.template getBufferWithOffset<Dt>();
+            sycl::buffer<Dt, 1> rhsBuf = rhs.template getBufferWithOffset<Dt>();
+            sycl::buffer<Dt, 1> outBuf = out.template getBufferWithOffset<Dt>();
+
+            const int64_t lda = lStrides[1];
+            const int64_t ldb = rStrides[1];
+            const int64_t ldc = oStrides[1];
+
+            dim_t lstride = (is_l_d2_batched) ? lStrides[2]
+                            : is_l_d3_batched ? lStrides[3]
+                                              : 0;
+            dim_t rstride = (is_r_d2_batched) ? rStrides[2]
+                            : is_r_d3_batched ? rStrides[3]
+                                              : 0;
+
+            ::oneapi::mkl::blas::gemm_batch(getQueue(), lOpts, rOpts, M, N, K,
+                                            *alpha, lhsBuf, lda, lstride,
+                                            rhsBuf, ldb, rstride, *beta, outBuf,
+                                            ldc, oStrides[2], batchSize);
+        } else {
+            std::vector<sycl::buffer<Dt>> lptrs;
+            std::vector<sycl::buffer<Dt>> rptrs;
+            std::vector<sycl::buffer<Dt>> optrs;
+
+            lptrs.reserve(batchSize);
+            rptrs.reserve(batchSize);
+            optrs.reserve(batchSize);
+
+            for (int n = 0; n < batchSize; n++) {
+                ptrdiff_t w = n / oDims[2];
+                ptrdiff_t z = n - w * oDims[2];
+
+                ptrdiff_t loff = z * (is_l_d2_batched * lStrides[2]) +
+                                 w * (is_l_d3_batched * lStrides[3]);
+                ptrdiff_t roff = z * (is_r_d2_batched * rStrides[2]) +
+                                 w * (is_r_d3_batched * rStrides[3]);
+                ptrdiff_t zoff = z * oStrides[2] + w * oStrides[3];
+
+                lptrs.emplace_back(lhs.template getBufferWithOffset<Dt>(loff));
+                rptrs.emplace_back(rhs.template getBufferWithOffset<Dt>(roff));
+                optrs.emplace_back(out.template getBufferWithOffset<Dt>(zoff));
+            }
+
+            for (int n = 0; n < batchSize; n++) {
+                ::oneapi::mkl::blas::gemm(getQueue(), lOpts, rOpts, M, N, K,
+                                          *alpha, lptrs[n], lStrides[1],
+                                          rptrs[n], rStrides[1], *beta,
+                                          optrs[n], oStrides[1]);
+            }
+        }
+    }
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<>
+void gemm<schar, float>(Array<float> &out, af_mat_prop optLhs,
+                        af_mat_prop optRhs, const float *alpha,
+                        const Array<schar> &lhs, const Array<schar> &rhs,
+                        const float *beta) {
+    TYPE_ERROR(3, af_dtype::s8);
+}
+
+template<typename T>
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs) {
+    auto lhs_ = (optLhs == AF_MAT_NONE ? lhs : conj<T>(lhs));
+    auto rhs_ = (optRhs == AF_MAT_NONE ? rhs : conj<T>(rhs));
+    auto temp = arithOp<T, af_mul_t>(lhs_, rhs_, lhs_.dims());
+    return reduce<af_add_t, T, T>(temp, 0, false, 0);
+}
+
+#define INSTANTIATE_GEMM(TYPE)                                               \
+    template void gemm<TYPE>(Array<TYPE> & out, af_mat_prop optLhs,          \
+                             af_mat_prop optRhs, const TYPE *alpha,          \
+                             const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
+                             const TYPE *beta);
+
+INSTANTIATE_GEMM(float)
+INSTANTIATE_GEMM(cfloat)
+INSTANTIATE_GEMM(double)
+INSTANTIATE_GEMM(cdouble)
+INSTANTIATE_GEMM(half)
+
+#define INSTANTIATE_DOT(TYPE)                                                  \
+    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs,                     \
+                                   const Array<TYPE> &rhs, af_mat_prop optLhs, \
+                                   af_mat_prop optRhs);
+
+INSTANTIATE_DOT(float)
+INSTANTIATE_DOT(double)
+INSTANTIATE_DOT(cfloat)
+INSTANTIATE_DOT(cdouble)
+INSTANTIATE_DOT(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/blas.hpp b/src/backend/oneapi/blas.hpp
new file mode 100644
index 0000000000..af65f56d12
--- /dev/null
+++ b/src/backend/oneapi/blas.hpp
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <math.hpp>
+
+// This file contains the common interface for OneAPI BLAS
+// functions
+
+namespace arrayfire {
+namespace oneapi {
+
+void initBlas();
+void deInitBlas();
+
+template<typename Ti, typename To = Ti>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta);
+
+template<typename T>
+Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+                af_mat_prop optRhs) {
+    int Mdim     = optLhs == AF_MAT_NONE ? 0 : 1;
+    int Ndim     = optRhs == AF_MAT_NONE ? 1 : 0;
+    Array<T> res = createEmptyArray<T>(
+        dim4(lhs.dims()[Mdim], rhs.dims()[Ndim], lhs.dims()[2], lhs.dims()[3]));
+    static const T alpha = scalar<T>(1.0);
+    static const T beta  = scalar<T>(0.0);
+    gemm(res, optLhs, optRhs, &alpha, lhs, rhs, &beta);
+    return res;
+}
+
+template<typename T>
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/canny.cpp b/src/backend/oneapi/canny.cpp
new file mode 100644
index 0000000000..4e9e7fceb2
--- /dev/null
+++ b/src/backend/oneapi/canny.cpp
@@ -0,0 +1,30 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <canny.hpp>
+#include <err_oneapi.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/canny.hpp b/src/backend/oneapi/canny.hpp
new file mode 100644
index 0000000000..c9bbe36edd
--- /dev/null
+++ b/src/backend/oneapi/canny.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy);
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/cast.hpp b/src/backend/oneapi/cast.hpp
new file mode 100644
index 0000000000..11b64c9631
--- /dev/null
+++ b/src/backend/oneapi/cast.hpp
@@ -0,0 +1,80 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <common/jit/UnaryNode.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename To, typename Ti>
+struct CastOp {
+    const char *name() { return ""; }
+};
+
+#define CAST_FN(TYPE)                                   \
+    template<typename Ti>                               \
+    struct CastOp<TYPE, Ti> {                           \
+        const char *name() { return "convert_" #TYPE; } \
+    };
+
+CAST_FN(int)
+CAST_FN(uint)
+CAST_FN(uchar)
+CAST_FN(float)
+CAST_FN(double)
+
+template<typename Ti>
+struct CastOp<schar, Ti> {
+    const char *name() { return "convert_char"; }
+};
+
+#define CAST_CFN(TYPE)                                    \
+    template<typename Ti>                                 \
+    struct CastOp<TYPE, Ti> {                             \
+        const char *name() { return "__convert_" #TYPE; } \
+    };
+
+CAST_CFN(cfloat)
+CAST_CFN(cdouble)
+CAST_CFN(char)
+
+template<>
+struct CastOp<cfloat, cdouble> {
+    const char *name() { return "__convert_z2c"; }
+};
+
+template<>
+struct CastOp<cdouble, cfloat> {
+    const char *name() { return "__convert_c2z"; }
+};
+
+template<>
+struct CastOp<cfloat, cfloat> {
+    const char *name() { return "__convert_c2c"; }
+};
+
+template<>
+struct CastOp<cdouble, cdouble> {
+    const char *name() { return "__convert_z2z"; }
+};
+
+#undef CAST_FN
+#undef CAST_CFN
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/cholesky.cpp b/src/backend/oneapi/cholesky.cpp
new file mode 100644
index 0000000000..d399034383
--- /dev/null
+++ b/src/backend/oneapi/cholesky.cpp
@@ -0,0 +1,111 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <blas.hpp>
+#include <cholesky.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <platform.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <memory.hpp>
+#include <oneapi/mkl/lapack.hpp>
+#include <triangle.hpp>
+#include <algorithm>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
+    dim4 iDims    = in.dims();
+    dim4 iStrides = in.strides();
+    int64_t N     = iDims[0];
+    int64_t LDA   = iStrides[1];
+
+    int64_t lwork = 0;
+
+    ::oneapi::mkl::uplo uplo = ::oneapi::mkl::uplo::lower;
+    if (is_upper) { uplo = ::oneapi::mkl::uplo::upper; }
+
+    lwork = ::oneapi::mkl::lapack::potrf_scratchpad_size<compute_t<T>>(
+        getQueue(), uplo, N, LDA);
+
+    auto workspace = memAlloc<compute_t<T>>(std::max<int64_t>(lwork, 1));
+    sycl::buffer<compute_t<T>> in_buffer =
+        in.template getBufferWithOffset<compute_t<T>>();
+
+    try {
+        ::oneapi::mkl::lapack::potrf(getQueue(), uplo, N, in_buffer, LDA,
+                                     *workspace, workspace->size());
+    } catch (::oneapi::mkl::lapack::exception const &e) {
+        AF_ERROR(
+            "Unexpected exception caught during synchronous\
+                call to LAPACK API",
+            AF_ERR_RUNTIME);
+        return e.info();
+    }
+
+    return 0;
+}
+
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
+    Array<T> out = copyArray<T>(in);
+    *info        = cholesky_inplace(out, is_upper);
+
+    triangle<T>(out, out, is_upper, false);
+
+    return out;
+}
+
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
+
+INSTANTIATE_CH(float)
+INSTANTIATE_CH(cfloat)
+INSTANTIATE_CH(double)
+INSTANTIATE_CH(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI backend",
+             AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI backend",
+             AF_ERR_NOT_CONFIGURED);
+}
+
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
+
+INSTANTIATE_CH(float)
+INSTANTIATE_CH(cfloat)
+INSTANTIATE_CH(double)
+INSTANTIATE_CH(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/oneapi/cholesky.hpp b/src/backend/oneapi/cholesky.hpp
new file mode 100644
index 0000000000..ab2bef5cc8
--- /dev/null
+++ b/src/backend/oneapi/cholesky.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
+
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/compile_module.cpp b/src/backend/oneapi/compile_module.cpp
new file mode 100644
index 0000000000..016b2d7dcf
--- /dev/null
+++ b/src/backend/oneapi/compile_module.cpp
@@ -0,0 +1,116 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/compile_module.hpp>  //compileModule & loadModuleFromDisk
+#include <common/kernel_cache.hpp>    //getKernel(Module&, ...)
+
+#include <common/Logger.hpp>
+#include <common/defines.hpp>
+#include <common/util.hpp>
+#include <err_oneapi.hpp>
+#include <platform.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <cctype>
+#include <cstdio>
+#include <fstream>
+#include <sstream>
+#include <string>
+#include <vector>
+
+using arrayfire::common::loggerFactory;
+using arrayfire::oneapi::Kernel;
+using arrayfire::oneapi::Module;
+using fmt::format;
+// using arrayfire::oneapi::getActiveDeviceId;
+// using arrayfire::oneapi::getDevice;
+using spdlog::logger;
+using sycl::bundle_state;
+using sycl::kernel_bundle;
+
+using std::begin;
+using std::end;
+using std::ofstream;
+using std::ostringstream;
+using std::shared_ptr;
+using std::string;
+using std::to_string;
+using std::transform;
+using std::vector;
+using std::chrono::duration_cast;
+using std::chrono::high_resolution_clock;
+using std::chrono::milliseconds;
+
+logger *getLogger() {
+    static shared_ptr<logger> logger(loggerFactory("jit"));
+    return logger.get();
+}
+
+string getProgramBuildLog(const kernel_bundle<bundle_state::executable> &prog) {
+    ONEAPI_NOT_SUPPORTED("");
+    return "";
+}
+
+//#define THROW_BUILD_LOG_EXCEPTION(PROG)                              \
+//    do {                                                             \
+//        string build_error = getProgramBuildLog(PROG);               \
+//        string info        = getEnvVar("AF_OPENCL_SHOW_BUILD_INFO"); \
+//        if (!info.empty() && info != "0") puts(build_error.c_str()); \
+//        AF_ERROR(build_error, AF_ERR_INTERNAL);                      \
+//    } while (0)
+
+namespace arrayfire {
+namespace oneapi {
+
+/*
+get_kernel_bundle<>() needs sycl::context
+kernel_bundle<bundle_state::executable> buildProgram(const vector<string>
+&kernelSources, const vector<string> &compileOpts) { ONEAPI_NOT_SUPPORTED("");
+    kernel_bundle<bundle_state::executable> bb;
+    return bb;
+}
+*/
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+string getKernelCacheFilename(const int device, const string &key) {
+    ONEAPI_NOT_SUPPORTED("");
+    return "";
+}
+
+namespace common {
+
+/*
+Module compileModule(const string &moduleKey, const vector<string> &sources,
+                     const vector<string> &options,
+                     const vector<string> &kInstances, const bool isJIT) {
+    ONEAPI_NOT_SUPPORTED("");
+    Module m{}
+    return m;
+}
+
+Module loadModuleFromDisk(const int device, const string &moduleKey,
+                          const bool isJIT) {
+    ONEAPI_NOT_SUPPORTED("");
+    Module m{}
+    return m;
+}
+
+Kernel getKernel(const Module &mod, const string &nameExpr,
+                 const bool sourceWasJIT) {
+    ONEAPI_NOT_SUPPORTED("");
+    return {nameExpr, &mod.get(), sycl::Kernel()};
+}
+*/
+
+}  // namespace common
diff --git a/src/backend/oneapi/complex.hpp b/src/backend/oneapi/complex.hpp
new file mode 100644
index 0000000000..c480fa6474
--- /dev/null
+++ b/src/backend/oneapi/complex.hpp
@@ -0,0 +1,92 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <common/jit/UnaryNode.hpp>
+#include <optypes.hpp>
+#include <traits.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename To, typename Ti>
+Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<To, Ti, af_cplx2_t>(lhs, rhs, odims);
+}
+
+template<typename To, typename Ti>
+Array<To> real(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              "__creal", in_node, af_real_t);
+
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
+
+template<typename To, typename Ti>
+Array<To> imag(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              "__cimag", in_node, af_imag_t);
+
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
+
+template<typename T>
+static const char *abs_name() {
+    return "fabs";
+}
+template<>
+inline const char *abs_name<cfloat>() {
+    return "__cabsf";
+}
+template<>
+inline const char *abs_name<cdouble>() {
+    return "__cabs";
+}
+
+template<typename To, typename Ti>
+Array<To> abs(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              abs_name<Ti>(), in_node, af_abs_t);
+
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
+
+template<typename T>
+static const char *conj_name() {
+    return "__noop";
+}
+template<>
+inline const char *conj_name<cfloat>() {
+    return "__cconjf";
+}
+template<>
+inline const char *conj_name<cdouble>() {
+    return "__cconj";
+}
+
+template<typename T>
+Array<T> conj(const Array<T> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<T>::af_type),
+                              conj_name<T>(), in_node, af_conj_t);
+
+    return createNodeArray<T>(in.dims(), common::Node_ptr(node));
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/convolve.cpp b/src/backend/oneapi/convolve.cpp
new file mode 100644
index 0000000000..0e443d7b77
--- /dev/null
+++ b/src/backend/oneapi/convolve.cpp
@@ -0,0 +1,253 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <blas.hpp>
+#include <common/half.hpp>
+#include <common/indexing_helpers.hpp>
+#include <common/moddims.hpp>
+#include <convolve.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/convolve.hpp>
+#include <reorder.hpp>
+#include <transpose.hpp>
+#include <unwrap.hpp>
+#include <wrap.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::flip;
+using arrayfire::common::half;
+using arrayfire::common::modDims;
+using std::vector;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand) {
+    const dim4 &sDims = signal.dims();
+    const dim4 &fDims = filter.dims();
+
+    dim4 oDims(1);
+    if (expand) {
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
+            } else {
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
+            }
+        }
+    } else {
+        oDims = sDims;
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
+        }
+    }
+
+    Array<T> out    = createEmptyArray<T>(oDims);
+    bool callKernel = true;
+
+    dim_t MCFL2 = kernel::MAX_CONV2_FILTER_LEN;
+    dim_t MCFL3 = kernel::MAX_CONV3_FILTER_LEN;
+    switch (rank) {
+        case 1:
+            if (fDims[0] > kernel::MAX_CONV1_FILTER_LEN) { callKernel = false; }
+            break;
+        case 2:
+            if ((fDims[0] * fDims[1]) > (MCFL2 * MCFL2)) { callKernel = false; }
+            break;
+        case 3:
+            if ((fDims[0] * fDims[1] * fDims[2]) > (MCFL3 * MCFL3 * MCFL3)) {
+                callKernel = false;
+            }
+            break;
+        default: AF_ERROR("rank only supports values 1-3.", AF_ERR_UNKNOWN);
+    }
+
+    if (!callKernel) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nOneAPI N Dimensional Convolution doesn't support "
+                 "%llux%llux%llu kernel\n",
+                 fDims[0], fDims[1], fDims[2]);
+        ONEAPI_NOT_SUPPORTED(errMessage);
+    }
+
+    kernel::convolve_nd<T, accT>(out, signal, filter, kind, rank, expand);
+
+    return out;
+}
+
+#define INSTANTIATE(T, accT)                                                   \
+    template Array<T> convolve<T, accT>(Array<T> const &, Array<accT> const &, \
+                                        AF_BATCH_KIND, const int, const bool);
+
+INSTANTIATE(cdouble, cdouble)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> convolve2_unwrap(const Array<T> &signal, const Array<T> &filter,
+                          const dim4 &stride, const dim4 &padding,
+                          const dim4 &dilation) {
+    dim4 sDims = signal.dims();
+    dim4 fDims = filter.dims();
+
+    dim_t outputWidth =
+        1 + (sDims[0] + 2 * padding[0] - (((fDims[0] - 1) * dilation[0]) + 1)) /
+                stride[0];
+    dim_t outputHeight =
+        1 + (sDims[1] + 2 * padding[1] - (((fDims[1] - 1) * dilation[1]) + 1)) /
+                stride[1];
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(signal, fDims[0], fDims[1], stride[0], stride[1], padding[0],
+               padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsedFilter = filter;
+
+    collapsedFilter = flip(collapsedFilter, {1, 1, 0, 0});
+    collapsedFilter = modDims(collapsedFilter,
+                              dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> res =
+        matmul(unwrapped, collapsedFilter, AF_MAT_TRANS, AF_MAT_NONE);
+    res = modDims(res, dim4(outputWidth, outputHeight, signal.dims()[3],
+                            collapsedFilter.dims()[1]));
+    Array<T> out = reorder(res, dim4(0, 1, 3, 2));
+
+    return out;
+}
+
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation) {
+    Array<T> out =
+        convolve2_unwrap<T>(signal, filter, stride, padding, dilation);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> convolve2<T>(Array<T> const &signal,                    \
+                                   Array<T> const &filter, const dim4 stride, \
+                                   const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> & /*convolved_output*/,
+                           af::dim4 stride, af::dim4 padding,
+                           af::dim4 dilation) {
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &sDims = original_signal.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    Array<T> collapsed_filter = original_filter;
+
+    collapsed_filter = flip(collapsed_filter, {1, 1, 0, 0});
+    collapsed_filter = modDims(collapsed_filter,
+                               dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    Array<T> res =
+        matmul(collapsed_gradient, collapsed_filter, AF_MAT_NONE, AF_MAT_TRANS);
+    res = modDims(res, dim4(res.dims()[0] / sDims[3], sDims[3],
+                            fDims[0] * fDims[1], sDims[2]));
+    res = reorder(res, dim4(0, 2, 3, 1));
+
+    const bool retCols = false;
+    res = wrap_dilated(res, sDims[0], sDims[1], fDims[0], fDims[1], stride[0],
+                       stride[1], padding[0], padding[1], dilation[0],
+                       dilation[1], retCols);
+
+    return res;
+}
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> & /*convolved_output*/,
+                             af::dim4 stride, af::dim4 padding,
+                             af::dim4 dilation) {
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(original_signal, fDims[0], fDims[1], stride[0], stride[1],
+               padding[0], padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    Array<T> res =
+        matmul(unwrapped, collapsed_gradient, AF_MAT_NONE, AF_MAT_NONE);
+    res = modDims(res, dim4(fDims[0], fDims[1], fDims[2], fDims[3]));
+
+    auto out = flip(res, {1, 1, 0, 0});
+    return out;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> conv2DataGradient<T>(                                 \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);        \
+    template Array<T> conv2FilterGradient<T>(                               \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/convolve.hpp b/src/backend/oneapi/convolve.hpp
new file mode 100644
index 0000000000..6551416170
--- /dev/null
+++ b/src/backend/oneapi/convolve.hpp
@@ -0,0 +1,41 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand);
+
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const &signal, Array<accT> const &c_filter,
+                   Array<accT> const &r_filter, const bool expand);
+
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation);
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> &convolved_output, af::dim4 stride,
+                           af::dim4 padding, af::dim4 dilation);
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> &convolved_output, af::dim4 stride,
+                             af::dim4 padding, af::dim4 dilation);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/convolve_separable.cpp b/src/backend/oneapi/convolve_separable.cpp
new file mode 100644
index 0000000000..ddf5c27a7e
--- /dev/null
+++ b/src/backend/oneapi/convolve_separable.cpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <convolve.hpp>
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/convolve_separable.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter,
+                   Array<accT> const& r_filter, const bool expand) {
+    const auto cflen = c_filter.elements();
+    const auto rflen = r_filter.elements();
+
+    if ((cflen > kernel::MAX_SCONV_FILTER_LEN) ||
+        (rflen > kernel::MAX_SCONV_FILTER_LEN)) {
+        // TODO call upon fft
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\noneAPI Separable convolution doesn't support %llu(coloumn) "
+                 "%llu(row) filters\n",
+                 cflen, rflen);
+        ONEAPI_NOT_SUPPORTED(errMessage);
+    }
+
+    const dim4& sDims = signal.dims();
+    dim4 tDims        = sDims;
+    dim4 oDims        = sDims;
+
+    if (expand) {
+        tDims[0] += cflen - 1;
+        oDims[0] += cflen - 1;
+        oDims[1] += rflen - 1;
+    }
+
+    Array<T> temp = createEmptyArray<T>(tDims);
+    Array<T> out  = createEmptyArray<T>(oDims);
+
+    kernel::convSep<T, accT>(temp, signal, c_filter, 0, expand);
+    kernel::convSep<T, accT>(out, temp, r_filter, 1, expand);
+
+    return out;
+}
+
+#define INSTANTIATE(T, accT)                                                  \
+    template Array<T> convolve2<T, accT>(Array<T> const&, Array<accT> const&, \
+                                         Array<accT> const&, const bool);
+
+INSTANTIATE(cdouble, cdouble)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(intl, float)
+INSTANTIATE(uintl, float)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/copy.cpp b/src/backend/oneapi/copy.cpp
new file mode 100644
index 0000000000..a89023261e
--- /dev/null
+++ b/src/backend/oneapi/copy.cpp
@@ -0,0 +1,255 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <copy.hpp>
+
+#include <Array.hpp>
+#include <common/complex.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/memcopy.hpp>
+#include <math.hpp>
+
+using arrayfire::common::half;
+using arrayfire::common::is_complex;
+
+using sycl::access_mode;
+using sycl::accessor;
+using sycl::buffer;
+using sycl::id;
+using sycl::range;
+using sycl::target;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copyData(T *data, const Array<T> &src) {
+    if (src.elements() > 0) {
+        Array<T> lin = src.isReady() && src.isLinear() ? src : copyArray(src);
+        size_t elements = lin.elements();
+        Param<T> p      = lin;
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                sycl::range rr(elements);
+                sycl::id offset_id(p.info.offset);
+                auto offset_acc =
+                    p.data->template get_access<sycl::access_mode::read_write>(
+                        h, rr, offset_id);
+                h.copy(offset_acc, data);
+            })
+            .wait();
+    }
+}
+
+template<typename T>
+Array<T> copyArray(const Array<T> &A) {
+    Array<T> out = createEmptyArray<T>(A.dims());
+    if (A.elements() == 0) { return out; }
+
+    dim_t offset = A.getOffset();
+    if (A.isReady()) {
+        if (A.isLinear()) {
+            // FIXME: Add checks
+
+            sycl::buffer<T> *A_buf   = A.get();
+            sycl::buffer<T> *out_buf = out.get();
+
+            size_t aelem = A.elements();
+            getQueue().submit([&](sycl::handler &h) {
+                range rr(aelem);
+                id offset_id(offset);
+                accessor offset_acc_A =
+                    A_buf->template get_access<access_mode::read>(h, rr,
+                                                                  offset_id);
+                accessor acc_out =
+                    out_buf->template get_access<access_mode::write>(h);
+
+                h.copy(offset_acc_A, acc_out);
+            });
+        } else {
+            kernel::memcopy<T>(out.get(), out.strides().get(), A.get(),
+                               A.dims().get(), A.strides().get(), offset,
+                               (uint)A.ndims());
+        }
+    } else {
+        Param info = {out.get(),
+                      {{A.dims().dims[0], A.dims().dims[1], A.dims().dims[2],
+                        A.dims().dims[3]},
+                       {out.strides().dims[0], out.strides().dims[1],
+                        out.strides().dims[2], out.strides().dims[3]},
+                       0}};
+        evalNodes(info, A.getNode().get());
+    }
+    return out;
+}
+
+template<typename T>
+void multiply_inplace(Array<T> &in, double val) {
+    kernel::copy<T, T>(in, in, in.ndims(), scalar<T>(0), val, true);
+}
+
+template<typename inType, typename outType>
+struct copyWrapper {
+    void operator()(Array<outType> &out, Array<inType> const &in) {
+        kernel::copy<inType, outType>(out, in, in.ndims(), scalar<outType>(0),
+                                      1, in.dims() == out.dims());
+    }
+};
+
+template<typename T>
+struct copyWrapper<T, T> {
+    void operator()(Array<T> &out, Array<T> const &in) {
+        if (out.isLinear() && in.isLinear() &&
+            out.elements() == in.elements()) {
+            dim_t in_offset  = in.getOffset();
+            dim_t out_offset = out.getOffset();
+
+            sycl::buffer<T> *in_buf  = in.get();
+            sycl::buffer<T> *out_buf = out.get();
+
+            getQueue()
+                .submit([&](sycl::handler &h) {
+                    sycl::range rr(in.elements());
+                    sycl::id in_offset_id(in_offset);
+                    sycl::id out_offset_id(out_offset);
+
+                    auto offset_acc_in =
+                        in_buf->template get_access<access_mode::read>(
+                            h, rr, in_offset_id);
+                    auto offset_acc_out =
+                        out_buf->template get_access<access_mode::write>(
+                            h, rr, out_offset_id);
+
+                    h.copy(offset_acc_in, offset_acc_out);
+                })
+                .wait();
+        } else {
+            kernel::copy<T, T>(out, in, in.ndims(), scalar<T>(0), 1,
+                               in.dims() == out.dims());
+        }
+    }
+};
+
+template<typename inType, typename outType>
+void copyArray(Array<outType> &out, Array<inType> const &in) {
+    static_assert(!(is_complex<inType>::value && !is_complex<outType>::value),
+                  "Cannot copy from complex value to a non complex value");
+    copyWrapper<inType, outType> copyFn;
+    copyFn(out, in);
+}
+
+#define INSTANTIATE(T)                                         \
+    template void copyData<T>(T * data, const Array<T> &from); \
+    template Array<T> copyArray<T>(const Array<T> &A);         \
+    template void multiply_inplace<T>(Array<T> & in, double norm);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COPY_ARRAY(SRC_T)                                 \
+    template void copyArray<SRC_T, float>(Array<float> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, double>(Array<double> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,     \
+                                            Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, int>(Array<int> & dst,             \
+                                        Array<SRC_T> const &src);     \
+    template void copyArray<SRC_T, uint>(Array<uint> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, intl>(Array<intl> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, uintl>(Array<uintl> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, short>(Array<short> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, ushort>(Array<ushort> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, schar>(Array<schar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, uchar>(Array<uchar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, char>(Array<char> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, half>(Array<half> & dst,           \
+                                         Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY(float)
+INSTANTIATE_COPY_ARRAY(double)
+INSTANTIATE_COPY_ARRAY(int)
+INSTANTIATE_COPY_ARRAY(uint)
+INSTANTIATE_COPY_ARRAY(intl)
+INSTANTIATE_COPY_ARRAY(uintl)
+INSTANTIATE_COPY_ARRAY(schar)
+INSTANTIATE_COPY_ARRAY(uchar)
+INSTANTIATE_COPY_ARRAY(char)
+INSTANTIATE_COPY_ARRAY(short)
+INSTANTIATE_COPY_ARRAY(ushort)
+INSTANTIATE_COPY_ARRAY(half)
+
+#define INSTANTIATE_COPY_ARRAY_COMPLEX(SRC_T)                        \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,      \
+                                           Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,    \
+                                            Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY_COMPLEX(cfloat)
+INSTANTIATE_COPY_ARRAY_COMPLEX(cdouble)
+
+template<typename T>
+T getScalar(const Array<T> &in) {
+    T retVal{};
+
+    auto in_get = in.get();
+    getQueue()
+        .submit([&](sycl::handler &h) {
+            auto acc_in =
+                in_get->template get_access<sycl::access::mode::read>(
+                    h, sycl::range{1},
+                    sycl::id{static_cast<uintl>(in.getOffset())});
+            h.copy(acc_in, &retVal);
+        })
+        .wait();
+
+    return retVal;
+}
+
+#define INSTANTIATE_GETSCALAR(T) template T getScalar(const Array<T> &in);
+
+INSTANTIATE_GETSCALAR(float)
+INSTANTIATE_GETSCALAR(double)
+INSTANTIATE_GETSCALAR(cfloat)
+INSTANTIATE_GETSCALAR(cdouble)
+INSTANTIATE_GETSCALAR(int)
+INSTANTIATE_GETSCALAR(uint)
+INSTANTIATE_GETSCALAR(schar)
+INSTANTIATE_GETSCALAR(uchar)
+INSTANTIATE_GETSCALAR(char)
+INSTANTIATE_GETSCALAR(intl)
+INSTANTIATE_GETSCALAR(uintl)
+INSTANTIATE_GETSCALAR(short)
+INSTANTIATE_GETSCALAR(ushort)
+INSTANTIATE_GETSCALAR(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/copy.hpp b/src/backend/oneapi/copy.hpp
new file mode 100644
index 0000000000..85b3b861ea
--- /dev/null
+++ b/src/backend/oneapi/copy.hpp
@@ -0,0 +1,69 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Array.hpp>
+#include <kernel/pad_array_borders.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void copyData(T *data, const Array<T> &A);
+
+template<typename T>
+Array<T> copyArray(const Array<T> &A);
+
+template<typename inType, typename outType>
+void copyArray(Array<outType> &out, const Array<inType> &in);
+
+// Resize Array to target dimensions and convert type
+//
+// Depending on the \p outDims, the output Array can be either truncated
+// or padded (towards end of respective dimensions).
+//
+// While resizing copying, if output dimensions are larger than input, then
+// elements beyond the input dimensions are set to the \p defaultValue.
+//
+// \param[in] in is input Array
+// \param[in] outDims is the target output dimensions
+// \param[in] defaultValue is the value to which padded locations are set.
+// \param[in] scale is the value by which all output elements are scaled.
+//
+// \returns Array<outType>
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue = outType(0), double scale = 1.0);
+
+template<typename T>
+Array<T> padArrayBorders(Array<T> const &in, dim4 const &lowerBoundPadding,
+                         dim4 const &upperBoundPadding,
+                         const af::borderType btype) {
+    auto iDims = in.dims();
+
+    dim4 oDims(lowerBoundPadding[0] + iDims[0] + upperBoundPadding[0],
+               lowerBoundPadding[1] + iDims[1] + upperBoundPadding[1],
+               lowerBoundPadding[2] + iDims[2] + upperBoundPadding[2],
+               lowerBoundPadding[3] + iDims[3] + upperBoundPadding[3]);
+
+    if (oDims == iDims) { return in; }
+
+    auto ret = createEmptyArray<T>(oDims);
+
+    kernel::padBorders<T>(ret, in, lowerBoundPadding, btype);
+
+    return ret;
+}
+
+template<typename T>
+void multiply_inplace(Array<T> &in, double val);
+
+template<typename T>
+T getScalar(const Array<T> &in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/count.cpp b/src/backend/oneapi/count.cpp
new file mode 100644
index 0000000000..4ed59eb3b9
--- /dev/null
+++ b/src/backend/oneapi/count.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// count
+INSTANTIATE(af_notzero_t, float, uint)
+INSTANTIATE(af_notzero_t, double, uint)
+INSTANTIATE(af_notzero_t, cfloat, uint)
+INSTANTIATE(af_notzero_t, cdouble, uint)
+INSTANTIATE(af_notzero_t, int, uint)
+INSTANTIATE(af_notzero_t, uint, uint)
+INSTANTIATE(af_notzero_t, intl, uint)
+INSTANTIATE(af_notzero_t, uintl, uint)
+INSTANTIATE(af_notzero_t, char, uint)
+INSTANTIATE(af_notzero_t, schar, uint)
+INSTANTIATE(af_notzero_t, uchar, uint)
+INSTANTIATE(af_notzero_t, short, uint)
+INSTANTIATE(af_notzero_t, ushort, uint)
+INSTANTIATE(af_notzero_t, half, uint)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/debug_oneapi.hpp b/src/backend/oneapi/debug_oneapi.hpp
new file mode 100644
index 0000000000..ea7cf992ee
--- /dev/null
+++ b/src/backend/oneapi/debug_oneapi.hpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <platform.hpp>
+
+#ifndef NDEBUG
+
+#define ONEAPI_DEBUG_FINISH(Q) Q.wait_and_throw()
+
+#else
+
+#define ONEAPI_DEBUG_FINISH(Q)                                   \
+    do {                                                         \
+        if (oneapi::synchronize_calls()) { Q.wait_and_throw(); } \
+    } while (false);
+
+#endif
diff --git a/src/backend/oneapi/device_manager.cpp b/src/backend/oneapi/device_manager.cpp
new file mode 100644
index 0000000000..56125382a0
--- /dev/null
+++ b/src/backend/oneapi/device_manager.cpp
@@ -0,0 +1,303 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <device_manager.hpp>
+
+#include <GraphicsResourceManager.hpp>
+#include <build_version.hpp>
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/defines.hpp>
+#include <common/graphics_common.hpp>
+#include <common/host_memory.hpp>
+#include <common/util.hpp>
+#include <err_oneapi.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <af/oneapi.h>
+#include <af/version.h>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <iterator>
+#include <sstream>
+#include <string>
+#include <vector>
+
+using arrayfire::common::ForgeManager;
+using arrayfire::common::getEnvVar;
+using std::begin;
+using std::end;
+using std::find;
+using std::make_unique;
+using std::move;
+using std::string;
+using std::stringstream;
+using std::unique_ptr;
+using std::vector;
+using sycl::device;
+using sycl::platform;
+
+using af::dtype_traits;
+
+namespace arrayfire {
+namespace oneapi {
+
+static inline bool compare_default(const unique_ptr<sycl::device>& ldev,
+                                   const unique_ptr<sycl::device>& rdev) {
+    using sycl::info::device_type;
+
+    auto ldt = ldev->get_info<sycl::info::device::device_type>();
+    auto rdt = rdev->get_info<sycl::info::device::device_type>();
+
+    if (ldt == rdt) {
+        auto l_mem = ldev->get_info<sycl::info::device::global_mem_size>();
+        auto r_mem = rdev->get_info<sycl::info::device::global_mem_size>();
+        return l_mem > r_mem;
+    } else {
+        if (ldt == device_type::gpu)
+            return true;
+        else if (rdt == device_type::gpu)
+            return false;
+        else if (ldt == device_type::cpu)
+            return true;
+        else if (rdt == device_type::cpu)
+            return false;
+    }
+    return false;
+}
+
+auto arrayfire_exception_handler(sycl::exception_list exceptions) {
+    for (std::exception_ptr const& e : exceptions) {
+        try {
+            std::rethrow_exception(e);
+        } catch (sycl::exception const& ex) {
+            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
+        }
+    }
+}
+
+DeviceManager::DeviceManager()
+    : logger(common::loggerFactory("platform"))
+    , mUserDeviceOffset(0)
+    , fgMngr(nullptr) {
+    vector<sycl::platform> platforms;
+    try {
+        platforms = sycl::platform::get_platforms();
+    } catch (sycl::exception& err) {
+        AF_ERROR(
+            "No sycl platforms found on this system. Ensure you have "
+            "installed the device driver as well as the runtime.",
+            AF_ERR_RUNTIME);
+    }
+
+    fgMngr = std::make_unique<ForgeManager>();
+
+    AF_TRACE("Found {} sycl platforms", platforms.size());
+    // Iterate through platforms, get all available devices and store them
+    for (auto& platform : platforms) {
+        vector<sycl::device> current_devices;
+        current_devices = platform.get_devices();
+        AF_TRACE("Found {} devices on platform {}", current_devices.size(),
+                 platform.get_info<sycl::info::platform::name>());
+
+        for (auto& dev : current_devices) {
+            mDevices.emplace_back(make_unique<sycl::device>(dev));
+            AF_TRACE("Found device {} on platform {}",
+                     dev.get_info<sycl::info::device::name>(),
+                     platform.get_info<sycl::info::platform::name>());
+        }
+    }
+
+    int nDevices = mDevices.size();
+    AF_TRACE("Found {} sycl devices", nDevices);
+
+    if (nDevices == 0) { AF_ERROR("No sycl devices found", AF_ERR_RUNTIME); }
+
+    // Sort sycl devices based on default criteria
+    stable_sort(mDevices.begin(), mDevices.end(), compare_default);
+
+    auto devices = move(mDevices);
+    mDevices.clear();
+
+    // Create contexts and queues once the sort is done
+    for (int i = 0; i < nDevices; i++) {
+        if (devices[i]->is_gpu() || devices[i]->is_cpu()) {
+            try {
+                mContexts.push_back(make_unique<sycl::context>(*devices[i]));
+                mQueues.push_back(
+                    make_unique<sycl::queue>(*mContexts.back(), *devices[i],
+                                             arrayfire_exception_handler));
+                mIsGLSharingOn.push_back(false);
+                // TODO:
+                // mDeviceTypes.push_back(getDeviceTypeEnum(*devices[i]));
+                // mPlatforms.push_back(getPlatformEnum(*devices[i]));
+                mDevices.emplace_back(std::move(devices[i]));
+
+                std::string options;
+#ifdef AF_WITH_FAST_MATH
+                options = fmt::format(" -D dim_t=CL3.0 -cl-fast-relaxed-math",
+                                      dtype_traits<dim_t>::getName());
+#else
+                options = fmt::format(" -cl-std=CL3.0 -D dim_t={}",
+                                      dtype_traits<dim_t>::getName());
+#endif
+                mBaseOpenCLBuildFlags.push_back(options);
+                if (mDevices.back()->has(sycl::aspect::fp64)) {
+                    mBaseOpenCLBuildFlags.back() += " -DUSE_DOUBLE";
+                }
+                if (mDevices.back()->has(sycl::aspect::fp16)) {
+                    mBaseOpenCLBuildFlags.back() += " -D USE_HALF";
+                }
+            } catch (sycl::exception& err) {
+                AF_TRACE("Error creating context for device {} with error {}\n",
+                         devices[i]->get_info<sycl::info::device::name>(),
+                         err.what());
+            }
+        }
+    }
+    nDevices = mDevices.size();
+
+    bool default_device_set = false;
+    string deviceENV        = getEnvVar("AF_ONEAPI_DEFAULT_DEVICE");
+
+    if (!deviceENV.empty()) {
+        stringstream s(deviceENV);
+        int def_device = -1;
+        s >> def_device;
+        if (def_device >= static_cast<int>(mQueues.size()) ||
+            def_device >= static_cast<int>(DeviceManager::MAX_DEVICES)) {
+            AF_TRACE(
+                "AF_ONEAPI_DEFAULT_DEVICE ({}) \
+                   is out of range, Setting default device to 0",
+                def_device);
+            def_device = 0;
+        } else {
+            setActiveContext(def_device);
+            default_device_set = true;
+        }
+    }
+
+    deviceENV = getEnvVar("AF_ONEAPI_DEFAULT_DEVICE_TYPE");
+    if (!default_device_set && !deviceENV.empty()) {
+        sycl::info::device_type default_device_type =
+            sycl::info::device_type::gpu;
+        if (deviceENV == "CPU") {
+            default_device_type = sycl::info::device_type::cpu;
+        } else if (deviceENV == "ACC") {
+            default_device_type = sycl::info::device_type::accelerator;
+        }
+
+        bool default_device_set = false;
+        for (int i = 0; i < nDevices; i++) {
+            if (mDevices[i]->get_info<sycl::info::device::device_type>() ==
+                default_device_type) {
+                default_device_set = true;
+                AF_TRACE("Setting to first available {}({})", deviceENV, i);
+                setActiveContext(i);
+                break;
+            }
+        }
+        if (!default_device_set) {
+            AF_TRACE(
+                "AF_ONEAPI_DEFAULT_DEVICE_TYPE={} \
+                   is not available, Using default device as 0",
+                deviceENV);
+        }
+    }
+
+    // Define AF_DISABLE_GRAPHICS with any value to disable initialization
+    string noGraphicsENV = getEnvVar("AF_DISABLE_GRAPHICS");
+    if (fgMngr->plugin().isLoaded() && noGraphicsENV.empty()) {
+        // TODO: handle forge shared contexts
+    }
+
+    mUserDeviceOffset = mDevices.size();
+
+    // TODO: init other needed libraries?
+    // blas? program cache?
+    AF_TRACE("Default device: {}", getActiveDeviceId());
+}
+
+spdlog::logger* DeviceManager::getLogger() { return logger.get(); }
+
+DeviceManager& DeviceManager::getInstance() {
+    static auto* my_instance = new DeviceManager();
+    return *my_instance;
+}
+
+void DeviceManager::setMemoryManager(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a memory manager and the default memory
+    // manager still hasn't been initialized, so initialize it anyways so we
+    // don't inadvertently reset to it when we first call memoryManager()
+    memoryManager();
+    // Calls shutdown() on the existing memory manager.
+    if (memManager) { memManager->shutdownAllocator(); }
+    memManager = std::move(newMgr);
+    // Set the backend memory manager for this new manager to register native
+    // functions correctly.
+    std::unique_ptr<oneapi::Allocator> deviceMemoryManager(
+        new oneapi::Allocator());
+    memManager->setAllocator(std::move(deviceMemoryManager));
+    memManager->initialize();
+}
+
+void DeviceManager::resetMemoryManager() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_ONEAPI_MEM_DEBUG));
+    setMemoryManager(std::move(mgr));
+}
+
+void DeviceManager::setMemoryManagerPinned(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a pinned memory manager and the default
+    // memory manager still hasn't been initialized, so initialize it anyways so
+    // we don't inadvertently reset to it when we first call
+    // pinnedMemoryManager()
+    pinnedMemoryManager();
+    // Calls shutdown() on the existing memory manager.
+    if (pinnedMemManager) { pinnedMemManager->shutdownAllocator(); }
+    // Set the backend pinned memory manager for this new manager to register
+    // native functions correctly.
+    pinnedMemManager = std::move(newMgr);
+    std::unique_ptr<oneapi::AllocatorPinned> deviceMemoryManager(
+        new oneapi::AllocatorPinned());
+    pinnedMemManager->setAllocator(std::move(deviceMemoryManager));
+    pinnedMemManager->initialize();
+}
+
+void DeviceManager::resetMemoryManagerPinned() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_ONEAPI_MEM_DEBUG));
+    setMemoryManagerPinned(std::move(mgr));
+}
+
+DeviceManager::~DeviceManager() {
+    for (int i = 0; i < getDeviceCount(); ++i) { gfxManagers[i] = nullptr; }
+    memManager       = nullptr;
+    pinnedMemManager = nullptr;
+
+    // TODO: cleanup mQueues, mContexts, mDevices??
+}
+
+void DeviceManager::markDeviceForInterop(const int device,
+                                         const void* wHandle) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/device_manager.hpp b/src/backend/oneapi/device_manager.hpp
new file mode 100644
index 0000000000..28be51631b
--- /dev/null
+++ b/src/backend/oneapi/device_manager.hpp
@@ -0,0 +1,159 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <sycl/sycl.hpp>
+
+#include <memory>
+#include <mutex>
+#include <string>
+#include <vector>
+
+#ifndef AF_ONEAPI_MEM_DEBUG
+#define AF_ONEAPI_MEM_DEBUG 0
+#endif
+
+namespace spdlog {
+class logger;
+}
+
+namespace arrayfire {
+namespace common {
+class ForgeManager;
+class MemoryManagerBase;
+}  // namespace common
+}  // namespace arrayfire
+
+using arrayfire::common::MemoryManagerBase;
+
+namespace arrayfire {
+namespace oneapi {
+
+// opencl namespace forward declarations
+class GraphicsResourceManager;
+struct kc_entry_t;  // kernel cache entry
+
+class DeviceManager {
+    friend MemoryManagerBase& memoryManager();
+
+    friend void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+    void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+    friend void resetMemoryManager();
+
+    void resetMemoryManager();
+
+    friend MemoryManagerBase& pinnedMemoryManager();
+
+    friend void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+    void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+    friend void resetMemoryManagerPinned();
+
+    void resetMemoryManagerPinned();
+
+    friend arrayfire::common::ForgeManager& forgeManager();
+
+    friend GraphicsResourceManager& interopManager();
+
+    friend void addKernelToCache(int device, const std::string& key,
+                                 const kc_entry_t entry);
+
+    friend void removeKernelFromCache(int device, const std::string& key);
+
+    friend kc_entry_t kernelCache(int device, const std::string& key);
+
+    friend std::string getDeviceInfo() noexcept;
+
+    friend int getDeviceCount() noexcept;
+
+    // friend int getDeviceIdFromNativeId(cl_device_id id);
+
+    friend const sycl::context& getContext();
+
+    friend sycl::queue& getQueue();
+
+    friend sycl::queue* getQueueHandle(int device_id);
+
+    friend const sycl::device& getDevice(int id);
+
+    friend const std::string& getActiveDeviceBaseBuildFlags();
+
+    friend size_t getDeviceMemorySize(int device);
+
+    friend bool isGLSharingSupported();
+
+    friend bool isDoubleSupported(unsigned device);
+
+    friend bool isHalfSupported(unsigned device);
+
+    friend void devprop(char* d_name, char* d_platform, char* d_toolkit,
+                        char* d_compute);
+
+    friend int setDevice(int device);
+
+    friend void addDeviceContext(sycl::device& dev, sycl::context& ctx,
+                                 sycl::queue& que);
+
+    friend void setDeviceContext(sycl::device& dev, sycl::context& ctx);
+
+    friend void removeDeviceContext(sycl::device& dev, sycl::context& ctx);
+
+    friend int getActiveDeviceType();
+
+    friend int getActivePlatform();
+
+   public:
+    static const int MAX_DEVICES = 32;
+
+    static DeviceManager& getInstance();
+
+    ~DeviceManager();
+
+    spdlog::logger* getLogger();
+
+   protected:
+    DeviceManager();
+
+    // Following two declarations are required to
+    // avoid copying accidental copy/assignment
+    // of instance returned by getInstance to other
+    // variables
+    DeviceManager(DeviceManager const&);
+    void operator=(DeviceManager const&);
+    void markDeviceForInterop(const int device, const void* wHandle);
+
+   private:
+    // Attributes
+    std::shared_ptr<spdlog::logger> logger;
+    std::mutex deviceMutex;
+    std::vector<std::unique_ptr<sycl::device>> mDevices;
+    std::vector<std::unique_ptr<sycl::context>> mContexts;
+    std::vector<std::unique_ptr<sycl::queue>> mQueues;
+    std::vector<bool> mIsGLSharingOn;
+    std::vector<std::string> mBaseOpenCLBuildFlags;
+    std::vector<int> mDeviceTypes;
+    std::vector<int> mPlatforms;
+    unsigned mUserDeviceOffset;
+
+    std::unique_ptr<arrayfire::common::ForgeManager> fgMngr;
+    std::unique_ptr<MemoryManagerBase> memManager;
+    std::unique_ptr<MemoryManagerBase> pinnedMemManager;
+    std::unique_ptr<GraphicsResourceManager> gfxManagers[MAX_DEVICES];
+    std::mutex mutex;
+
+    // using BoostProgCache = boost::shared_ptr<boost::compute::program_cache>;
+    // std::vector<BoostProgCache*> mBoostProgCacheVector;
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/diagonal.cpp b/src/backend/oneapi/diagonal.cpp
new file mode 100644
index 0000000000..900f53ba3c
--- /dev/null
+++ b/src/backend/oneapi/diagonal.cpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <diagonal.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/diagonal.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num) {
+    int size     = in.dims()[0] + std::abs(num);
+    int batch    = in.dims()[1];
+    Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
+
+    kernel::diagCreate<T>(out, in, num);
+
+    return out;
+}
+
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num) {
+    const dim_t *idims = in.dims().get();
+    dim_t size         = std::min(idims[0], idims[1]) - std::abs(num);
+    Array<T> out       = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
+
+    kernel::diagExtract<T>(out, in, num);
+
+    return out;
+}
+
+#define INSTANTIATE_DIAGONAL(T)                                          \
+    template Array<T> diagExtract<T>(const Array<T> &in, const int num); \
+    template Array<T> diagCreate<T>(const Array<T> &in, const int num);
+
+INSTANTIATE_DIAGONAL(float)
+INSTANTIATE_DIAGONAL(double)
+INSTANTIATE_DIAGONAL(cfloat)
+INSTANTIATE_DIAGONAL(cdouble)
+INSTANTIATE_DIAGONAL(int)
+INSTANTIATE_DIAGONAL(uint)
+INSTANTIATE_DIAGONAL(intl)
+INSTANTIATE_DIAGONAL(uintl)
+INSTANTIATE_DIAGONAL(char)
+INSTANTIATE_DIAGONAL(schar)
+INSTANTIATE_DIAGONAL(uchar)
+INSTANTIATE_DIAGONAL(short)
+INSTANTIATE_DIAGONAL(ushort)
+INSTANTIATE_DIAGONAL(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/diagonal.hpp b/src/backend/oneapi/diagonal.hpp
new file mode 100644
index 0000000000..1329cdd9d2
--- /dev/null
+++ b/src/backend/oneapi/diagonal.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num);
+
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/diff.cpp b/src/backend/oneapi/diff.cpp
new file mode 100644
index 0000000000..01cd18e37e
--- /dev/null
+++ b/src/backend/oneapi/diff.cpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <diff.hpp>
+#include <kernel/diff.hpp>
+#include <af/dim4.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> diff(const Array<T> &in, const int dim, const bool isDiff2) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims[dim] -= (isDiff2 + 1);
+
+    if (iDims.elements() == 0 || oDims.elements() == 0) {
+        throw std::runtime_error("Elements are 0");
+    }
+    Array<T> out = createEmptyArray<T>(oDims);
+    kernel::diff<T>(out, in, in.ndims(), dim, isDiff2);
+    return out;
+}
+
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim) {
+    return diff<T>(in, dim, false);
+}
+
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim) {
+    return diff<T>(in, dim, true);
+}
+
+#define INSTANTIATE(T)                                             \
+    template Array<T> diff1<T>(const Array<T> &in, const int dim); \
+    template Array<T> diff2<T>(const Array<T> &in, const int dim);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(char)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/diff.hpp b/src/backend/oneapi/diff.hpp
new file mode 100644
index 0000000000..9679f90c59
--- /dev/null
+++ b/src/backend/oneapi/diff.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim);
+
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/err_oneapi.hpp b/src/backend/oneapi/err_oneapi.hpp
new file mode 100644
index 0000000000..4f187b6273
--- /dev/null
+++ b/src/backend/oneapi/err_oneapi.hpp
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/err_common.hpp>
+
+#define ONEAPI_NOT_SUPPORTED(message)                                       \
+    do {                                                                    \
+        throw SupportError(__AF_FUNC__, __AF_FILENAME__, __LINE__, "oneAPI",\
+                           message, boost::stacktrace::stacktrace());       \
+    } while (0)
+
+#define CL_CHECK(call)                                                      \
+    do {                                                                    \
+        if (cl_int err = (call)) {                                          \
+            char cl_err_msg[2048];                                          \
+            const char* cl_err_call = #call;                                \
+            snprintf(cl_err_msg, sizeof(cl_err_msg),                        \
+                     "CL Error %s(%d): %d = %s\n", __FILE__, __LINE__, err, \
+                     cl_err_call);                                          \
+            AF_ERROR(cl_err_msg, AF_ERR_INTERNAL);                          \
+        }                                                                   \
+    } while (0)
+
+#define CL_CHECK_BUILD(call)                                                  \
+    do {                                                                      \
+        if (cl_int err = (call)) {                                            \
+            char log[8192];                                                   \
+            char cl_err_msg[8192];                                            \
+            const char* cl_err_call = #call;                                  \
+            size_t log_ret;                                                   \
+            clGetProgramBuildInfo(prog, dev, CL_PROGRAM_BUILD_LOG, 8192, log, \
+                                  &log_ret);                                  \
+            snprintf(cl_err_msg, sizeof(cl_err_msg),                          \
+                     "OpenCL Error building %s(%d): %d = %s\nLog:\n%s",       \
+                     __FILE__, __LINE__, err, cl_err_call, log);              \
+            AF_ERROR(cl_err_msg, AF_ERR_INTERNAL);                            \
+        }                                                                     \
+    } while (0)
diff --git a/src/backend/oneapi/errorcodes.cpp b/src/backend/oneapi/errorcodes.cpp
new file mode 100644
index 0000000000..cf7152fa00
--- /dev/null
+++ b/src/backend/oneapi/errorcodes.cpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <errorcodes.hpp>
+
+std::string getErrorMessage(int error_code) {
+    ONEAPI_NOT_SUPPORTED("");
+    // return boost::compute::opencl_error::to_string(error_code);
+    return "";
+}
diff --git a/src/backend/oneapi/errorcodes.hpp b/src/backend/oneapi/errorcodes.hpp
new file mode 100644
index 0000000000..ff30326ae9
--- /dev/null
+++ b/src/backend/oneapi/errorcodes.hpp
@@ -0,0 +1,14 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <string>
+
+std::string getErrorMessage(int error_code);
diff --git a/src/backend/oneapi/exampleFunction.cpp b/src/backend/oneapi/exampleFunction.cpp
new file mode 100644
index 0000000000..9a006febff
--- /dev/null
+++ b/src/backend/oneapi/exampleFunction.cpp
@@ -0,0 +1,69 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>  // header with oneapi backend specific
+                      // Array class implementation that inherits
+                      // ArrayInfo base class
+
+#include <exampleFunction.hpp>  // oneapi backend function header
+
+#include <err_oneapi.hpp>  // error check functions and Macros
+                           // specific to oneapi backend
+
+// #include <kernel/exampleFunction.hpp>  // this header under the folder
+//  src/oneapi/kernel
+//  defines the OneAPI kernel wrapper
+//  function to which the main computation of your
+//  algorithm should be relayed to
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method) {
+    ONEAPI_NOT_SUPPORTED("");
+    dim4 outputDims;  // this should be '= in.dims();' in most cases
+                      // but would definitely depend on the type of
+                      // algorithm you are implementing.
+
+    Array<T> out = createEmptyArray<T>(outputDims);
+    // Please use the create***Array<T> helper
+    // functions defined in Array.hpp to create
+    // different types of Arrays. Please check the
+    // file to know what are the different types you
+    // can create.
+
+    // Relay the actual computation to OneAPI kernel wrapper
+    // kernel::exampleFunc<T>(out, a, b, method);
+
+    return out;  // return the result
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> exampleFunction<T>(const Array<T> &a, const Array<T> &b, \
+                                         const af_someenum_t method);
+
+// INSTANTIATIONS for all the types which
+// are present in the switch case statement
+// in src/api/c/exampleFunction.cpp should be available
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/exampleFunction.hpp b/src/backend/oneapi/exampleFunction.hpp
new file mode 100644
index 0000000000..5e5978a057
--- /dev/null
+++ b/src/backend/oneapi/exampleFunction.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/fast.cpp b/src/backend/oneapi/fast.cpp
new file mode 100644
index 0000000000..a5b0934f97
--- /dev/null
+++ b/src/backend/oneapi/fast.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
+              const Array<T> &in, const float thr, const unsigned arc_length,
+              const bool non_max, const float feature_ratio,
+              const unsigned edge) {
+    ONEAPI_NOT_SUPPORTED("");
+    return 0;
+}
+
+#define INSTANTIATE(T)                                                        \
+    template unsigned fast<T>(                                                \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const float thr, const unsigned arc_length,       \
+        const bool nonmax, const float feature_ratio, const unsigned edge);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/fast.hpp b/src/backend/oneapi/fast.hpp
new file mode 100644
index 0000000000..4f9c7cf7f4
--- /dev/null
+++ b/src/backend/oneapi/fast.hpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
+              const Array<T> &in, const float thr, const unsigned arc_length,
+              const bool non_max, const float feature_ratio,
+              const unsigned edge);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/fft.cpp b/src/backend/oneapi/fft.cpp
new file mode 100644
index 0000000000..03ae19efc6
--- /dev/null
+++ b/src/backend/oneapi/fft.cpp
@@ -0,0 +1,291 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <fft.hpp>
+
+#include <common/dispatch.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <onefft.hpp>
+#include <platform.hpp>
+#include <af/dim4.hpp>
+
+#include <oneapi/mkl/dfti.hpp>
+#include <oneapi/mkl/exceptions.hpp>
+
+#include <cstdint>
+#include <memory>
+
+using std::make_shared;
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+void setFFTPlanCacheSize(size_t numPlans) {}
+
+std::string genPlanHashStr(int rank, ::oneapi::mkl::dft::precision precision,
+                           ::oneapi::mkl::dft::domain domain,
+                           const bool isInPlace, const dim_t *n,
+                           std::int64_t *istrides, int ibatch,
+                           std::int64_t *ostrides, int obatch, int nbatch) {
+    // create the key string
+    char key_str_temp[64];
+    sprintf(key_str_temp, "%d:", rank);
+
+    std::string key_string(key_str_temp);
+
+    if (precision == ::oneapi::mkl::dft::precision::SINGLE) {
+        key_string.append("S:");
+    } else if (precision == ::oneapi::mkl::dft::precision::DOUBLE) {
+        key_string.append("D:");
+    }
+    if (domain == ::oneapi::mkl::dft::domain::REAL) {
+        key_string.append("R:");
+    } else if (domain == ::oneapi::mkl::dft::domain::COMPLEX) {
+        key_string.append("C:");
+    }
+    if (isInPlace) {
+        key_string.append("IIP:");
+    } else {
+        key_string.append("OOP:");
+    }
+
+    for (int r = 0; r < rank; ++r) {
+        sprintf(key_str_temp, "%lld:", n[r]);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    if (istrides != nullptr) {
+        for (int r = 0; r < rank + 1; ++r) {
+            sprintf(key_str_temp, "%ld:", istrides[r]);
+            key_string.append(std::string(key_str_temp));
+        }
+        sprintf(key_str_temp, "%d:", ibatch);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    if (ostrides != nullptr) {
+        for (int r = 0; r < rank + 1; ++r) {
+            sprintf(key_str_temp, "%ld:", ostrides[r]);
+            key_string.append(std::string(key_str_temp));
+        }
+        sprintf(key_str_temp, "%d:", obatch);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    sprintf(key_str_temp, "%d", nbatch);
+    key_string.append(std::string(key_str_temp));
+
+    return key_string;
+}
+
+std::vector<std::int64_t> computeStrides(const int rank, const dim4 istrides,
+                                         const dim_t offset) {
+    if (rank == 2) return {offset, istrides[1], istrides[0]};
+    if (rank == 3) return {offset, istrides[2], istrides[1], istrides[0]};
+    if (rank == 4)
+        return {offset, istrides[3], istrides[2], istrides[1], istrides[0]};
+    return {offset, istrides[0]};
+}
+
+template<::oneapi::mkl::dft::precision precision,
+         ::oneapi::mkl::dft::domain domain>
+PlanType findPlan(int rank, const bool isInPlace, const dim_t *idims,
+                  std::int64_t *istrides, int ibatch, std::int64_t *ostrides,
+                  int obatch, int nbatch) {
+    using desc_ty = ::oneapi::mkl::dft::descriptor<precision, domain>;
+
+    std::string key_string =
+        genPlanHashStr(rank, precision, domain, isInPlace, idims, istrides,
+                       ibatch, ostrides, obatch, nbatch);
+
+    PlanCache &planner               = arrayfire::oneapi::fftManager();
+    std::shared_ptr<PlanType> retVal = (planner.find(key_string));
+    if (retVal) { return *retVal; }
+
+    desc_ty *desc = [rank, &idims]() {
+        if (rank == 1) return new desc_ty(static_cast<int64_t>(idims[0]));
+        if (rank == 2) return new desc_ty({idims[1], idims[0]});
+        if (rank == 3) return new desc_ty({idims[2], idims[1], idims[0]});
+        return new desc_ty({idims[3], idims[2], idims[1], idims[0]});
+    }();
+
+    if (rank > 1) {
+        desc->set_value(::oneapi::mkl::dft::config_param::INPUT_STRIDES,
+                        istrides);
+        desc->set_value(::oneapi::mkl::dft::config_param::OUTPUT_STRIDES,
+                        ostrides);
+    }
+
+    if (isInPlace) {
+        desc->set_value(::oneapi::mkl::dft::config_param::PLACEMENT,
+                        DFTI_INPLACE);
+    } else {
+        desc->set_value(::oneapi::mkl::dft::config_param::PLACEMENT,
+                        DFTI_NOT_INPLACE);
+    }
+
+    desc->set_value(::oneapi::mkl::dft::config_param::NUMBER_OF_TRANSFORMS,
+                    (int64_t)nbatch);
+
+    desc->set_value(::oneapi::mkl::dft::config_param::FWD_DISTANCE, ibatch);
+    desc->set_value(::oneapi::mkl::dft::config_param::BWD_DISTANCE, obatch);
+
+    if constexpr (domain == ::oneapi::mkl::dft::domain::COMPLEX) {
+        desc->set_value(::oneapi::mkl::dft::config_param::COMPLEX_STORAGE,
+                        DFTI_COMPLEX_COMPLEX);
+    } else {
+        desc->set_value(
+            ::oneapi::mkl::dft::config_param::CONJUGATE_EVEN_STORAGE,
+            DFTI_COMPLEX_COMPLEX);
+        desc->set_value(::oneapi::mkl::dft::config_param::PACKED_FORMAT,
+                        DFTI_CCE_FORMAT);
+    }
+
+    try {
+        desc->commit(getQueue());
+    } catch (::oneapi::mkl::device_bad_alloc &e) {
+        // If plan creation fails, clean up the memory we hold on to and try
+        // again
+        arrayfire::oneapi::signalMemoryCleanup();
+        desc->commit(getQueue());
+    }
+
+    // push the plan into plan cache
+    std::shared_ptr<void> ptr(desc);
+    planner.push(key_string, make_shared<PlanType>(ptr));
+    return ptr;
+}
+
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction) {
+    const dim4 idims    = in.dims();
+    const dim4 istrides = in.strides();
+
+    constexpr bool is_single = std::is_same_v<T, cfloat>;
+    constexpr auto precision = (is_single)
+                                   ? ::oneapi::mkl::dft::precision::SINGLE
+                                   : ::oneapi::mkl::dft::precision::DOUBLE;
+    using desc_ty =
+        ::oneapi::mkl::dft::descriptor<precision,
+                                       ::oneapi::mkl::dft::domain::COMPLEX>;
+
+    // TODO[STF]: WTF
+    // getOffset() for s0 throwing Invalid Descriptor when targeting gpu
+    // on CPU, results are wrong but does not throw
+    // strides not working? TODO: test standalone oneMKL
+    // perhaps in.getDataDims() needed instead of in.dims()?
+    std::vector<std::int64_t> fft_input_strides =
+        computeStrides(rank, istrides, 0);
+    // computeStrides(rank, istrides, in.getOffset()); //TODO[STF]: WTF,
+    int batch = 1;
+    for (int i = rank; i < 4; i++) { batch *= idims[i]; }
+
+    const bool isInPlace = true;
+    PlanType descP = findPlan<precision, ::oneapi::mkl::dft::domain::COMPLEX>(
+        rank, isInPlace, idims.get(), fft_input_strides.data(), istrides[rank],
+        fft_input_strides.data(), istrides[rank], batch);
+
+    desc_ty *desc = (desc_ty *)descP.get();
+
+    if (direction)
+        ::oneapi::mkl::dft::compute_forward(*desc, *in.get());
+    else
+        ::oneapi::mkl::dft::compute_backward(*desc, *in.get());
+}
+
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank) {
+    const dim4 idims    = in.dims();
+    const dim4 istrides = in.strides();
+    Array<Tc> out       = createEmptyArray<Tc>(
+        dim4({idims[0] / 2 + 1, idims[1], idims[2], idims[3]}));
+    const dim4 ostrides = out.strides();
+
+    constexpr bool is_single = std::is_same_v<Tr, float>;
+    constexpr auto precision = (is_single)
+                                   ? ::oneapi::mkl::dft::precision::SINGLE
+                                   : ::oneapi::mkl::dft::precision::DOUBLE;
+    using desc_ty =
+        ::oneapi::mkl::dft::descriptor<precision,
+                                       ::oneapi::mkl::dft::domain::REAL>;
+
+    std::vector<std::int64_t> fft_input_strides =
+        computeStrides(rank, istrides, in.getOffset());
+    std::vector<std::int64_t> fft_output_strides =
+        computeStrides(rank, ostrides, out.getOffset());
+
+    int batch = 1;
+    for (int i = rank; i < 4; i++) { batch *= idims[i]; }
+
+    const bool isInPlace = false;
+    PlanType descP = findPlan<precision, ::oneapi::mkl::dft::domain::REAL>(
+        rank, isInPlace, idims.get(), fft_input_strides.data(), istrides[rank],
+        fft_output_strides.data(), ostrides[rank], batch);
+
+    desc_ty *desc = (desc_ty *)descP.get();
+
+    ::oneapi::mkl::dft::compute_forward(*desc, *in.get(), *out.get());
+
+    return out;
+}
+
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank) {
+    const dim4 idims    = in.dims();
+    const dim4 istrides = in.strides();
+    Array<Tr> out       = createEmptyArray<Tr>(odims);
+    const dim4 ostrides = out.strides();
+
+    constexpr bool is_single = std::is_same_v<Tr, float>;
+    constexpr auto precision = (is_single)
+                                   ? ::oneapi::mkl::dft::precision::SINGLE
+                                   : ::oneapi::mkl::dft::precision::DOUBLE;
+    using desc_ty =
+        ::oneapi::mkl::dft::descriptor<precision,
+                                       ::oneapi::mkl::dft::domain::REAL>;
+
+    std::vector<std::int64_t> fft_input_strides =
+        computeStrides(rank, istrides, in.getOffset());
+    std::vector<std::int64_t> fft_output_strides =
+        computeStrides(rank, ostrides, out.getOffset());
+
+    int batch = 1;
+    for (int i = rank; i < 4; i++) { batch *= odims[i]; }
+
+    const bool isInPlace = false;
+    PlanType descP = findPlan<precision, ::oneapi::mkl::dft::domain::REAL>(
+        rank, isInPlace, odims.get(), fft_input_strides.data(), ostrides[rank],
+        fft_output_strides.data(), istrides[rank], batch);
+
+    desc_ty *desc = (desc_ty *)descP.get();
+
+    ::oneapi::mkl::dft::compute_backward(*desc, *in.get(), *out.get());
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template void fft_inplace<T>(Array<T> &, const int, const bool);
+
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+#define INSTANTIATE_REAL(Tr, Tc)                                        \
+    template Array<Tc> fft_r2c<Tc, Tr>(const Array<Tr> &, const int);   \
+    template Array<Tr> fft_c2r<Tr, Tc>(const Array<Tc> &, const dim4 &, \
+                                       const int);
+
+INSTANTIATE_REAL(float, cfloat)
+INSTANTIATE_REAL(double, cdouble)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/fft.hpp b/src/backend/oneapi/fft.hpp
new file mode 100644
index 0000000000..ca82f06118
--- /dev/null
+++ b/src/backend/oneapi/fft.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+void setFFTPlanCacheSize(size_t numPlans);
+
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction);
+
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank);
+
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/fftconvolve.cpp b/src/backend/oneapi/fftconvolve.cpp
new file mode 100644
index 0000000000..85718f4f4f
--- /dev/null
+++ b/src/backend/oneapi/fftconvolve.cpp
@@ -0,0 +1,160 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <fftconvolve.hpp>
+
+#include <Array.hpp>
+#include <common/dispatch.hpp>
+#include <err_oneapi.hpp>
+#include <fft.hpp>
+#include <af/dim4.hpp>
+
+#include <kernel/fftconvolve_common.hpp>
+#include <kernel/fftconvolve_multiply.hpp>
+#include <kernel/fftconvolve_pack.hpp>
+#include <kernel/fftconvolve_pad.hpp>
+#include <kernel/fftconvolve_reorder.hpp>
+
+#include <cmath>
+#include <type_traits>
+#include <vector>
+
+using af::dim4;
+using std::ceil;
+using std::conditional;
+using std::is_integral;
+using std::is_same;
+using std::vector;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+dim4 calcPackedSize(Array<T> const& i1, Array<T> const& i2, const dim_t rank) {
+    const dim4& i1d = i1.dims();
+    const dim4& i2d = i2.dims();
+
+    dim_t pd[4] = {1, 1, 1, 1};
+
+    // Pack both signal and filter on same memory array, this will ensure
+    // better use of batched cuFFT capabilities
+    pd[0] = nextpow2(static_cast<unsigned>(
+        static_cast<int>(ceil(i1d[0] / 2.f)) + i2d[0] - 1));
+
+    for (dim_t k = 1; k < rank; k++) {
+        pd[k] = nextpow2(static_cast<unsigned>(i1d[k] + i2d[k] - 1));
+    }
+
+    dim_t i1batch = 1;
+    dim_t i2batch = 1;
+    for (int k = rank; k < 4; k++) {
+        i1batch *= i1d[k];
+        i2batch *= i2d[k];
+    }
+    pd[rank] = (i1batch + i2batch);
+
+    return dim4(pd[0], pd[1], pd[2], pd[3]);
+}
+
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank) {
+    using convT = typename conditional<is_integral<T>::value ||
+                                           is_same<T, float>::value ||
+                                           is_same<T, cfloat>::value,
+                                       float, double>::type;
+    using cT    = typename conditional<is_same<convT, float>::value, cfloat,
+                                    cdouble>::type;
+
+    const dim4& sDims = signal.dims();
+    const dim4& fDims = filter.dims();
+
+    dim4 oDims(1);
+    if (expand) {
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
+            } else {
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
+            }
+        }
+    } else {
+        oDims = sDims;
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
+        }
+    }
+
+    const dim4 pDims = calcPackedSize<T>(signal, filter, rank);
+    Array<cT> packed = createEmptyArray<cT>(pDims);
+
+    kernel::packDataHelper<cT, T>(packed, signal, filter, rank, kind);
+    kernel::padDataHelper<cT, T>(packed, signal, filter, rank, kind);
+
+    fft_inplace<cT>(packed, rank, true);
+
+    kernel::complexMultiplyHelper<cT, T>(packed, signal, filter, rank, kind);
+
+    // Compute inverse FFT only on complex-multiplied data
+    if (kind == AF_BATCH_RHS) {
+        vector<af_seq> seqs;
+        for (int k = 0; k < AF_MAX_DIMS; k++) {
+            if (k < rank) {
+                seqs.push_back({0., static_cast<double>(pDims[k] - 1), 1.});
+            } else if (k == rank) {
+                seqs.push_back({1., static_cast<double>(pDims[k] - 1), 1.});
+            } else {
+                seqs.push_back({0., 0., 1.});
+            }
+        }
+
+        Array<cT> subPacked = createSubArray<cT>(packed, seqs);
+        fft_inplace<cT>(subPacked, rank, false);
+    } else {
+        vector<af_seq> seqs;
+        for (int k = 0; k < AF_MAX_DIMS; k++) {
+            if (k < rank) {
+                seqs.push_back({0., static_cast<double>(pDims[k]) - 1, 1.});
+            } else if (k == rank) {
+                seqs.push_back({0., static_cast<double>(pDims[k] - 2), 1.});
+            } else {
+                seqs.push_back({0., 0., 1.});
+            }
+        }
+
+        Array<cT> subPacked = createSubArray<cT>(packed, seqs);
+        fft_inplace<cT>(subPacked, rank, false);
+    }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::reorderOutputHelper<T, cT>(out, packed, signal, filter, rank, kind,
+                                       expand);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> fftconvolve<T>(Array<T> const&, Array<T> const&, \
+                                     const bool, AF_BATCH_KIND, const int);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(int)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(uintl)
+INSTANTIATE(intl)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/fftconvolve.hpp b/src/backend/oneapi/fftconvolve.hpp
new file mode 100644
index 0000000000..88ad3c9b9d
--- /dev/null
+++ b/src/backend/oneapi/fftconvolve.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/flood_fill.cpp b/src/backend/oneapi/flood_fill.cpp
new file mode 100644
index 0000000000..2d9d22d696
--- /dev/null
+++ b/src/backend/oneapi/flood_fill.cpp
@@ -0,0 +1,38 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <flood_fill.hpp>
+
+#include <err_oneapi.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup) {
+    ONEAPI_NOT_SUPPORTED("");
+    auto out = createValueArray(image.dims(), T(0));
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> floodFill(const Array<T>&, const Array<uint>&,           \
+                                const Array<uint>&, const T, const T, const T, \
+                                const af::connectivity);
+
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(ushort)
+INSTANTIATE(uchar)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/flood_fill.hpp b/src/backend/oneapi/flood_fill.hpp
new file mode 100644
index 0000000000..00ddce1b70
--- /dev/null
+++ b/src/backend/oneapi/flood_fill.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup = AF_CONNECTIVITY_8);
+}
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/gradient.cpp b/src/backend/oneapi/gradient.cpp
new file mode 100644
index 0000000000..0ab39d7e8d
--- /dev/null
+++ b/src/backend/oneapi/gradient.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <gradient.hpp>
+#include <kernel/gradient.hpp>
+#include <math.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in) {
+    kernel::gradient<T>(grad0, grad1, in);
+}
+
+#define INSTANTIATE(T)                                            \
+    template void gradient<T>(Array<T> & grad0, Array<T> & grad1, \
+                              const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/gradient.hpp b/src/backend/oneapi/gradient.hpp
new file mode 100644
index 0000000000..b90fb6ecc7
--- /dev/null
+++ b/src/backend/oneapi/gradient.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/harris.cpp b/src/backend/oneapi/harris.cpp
new file mode 100644
index 0000000000..d266a18bad
--- /dev/null
+++ b/src/backend/oneapi/harris.cpp
@@ -0,0 +1,42 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &score_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr) {
+    ONEAPI_NOT_SUPPORTED("");
+    return 0;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned harris<T, convAccT>(                                    \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned max_corners,                       \
+        const float min_response, const float sigma,                          \
+        const unsigned filter_len, const float k_thr);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/harris.hpp b/src/backend/oneapi/harris.hpp
new file mode 100644
index 0000000000..eba87bd404
--- /dev/null
+++ b/src/backend/oneapi/harris.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &score_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/hist_graphics.cpp b/src/backend/oneapi/hist_graphics.cpp
new file mode 100644
index 0000000000..e016337a54
--- /dev/null
+++ b/src/backend/oneapi/hist_graphics.cpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <err_oneapi.hpp>
+#include <hist_graphics.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_histogram(const Array<T> &data, fg_histogram hist) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+#define INSTANTIATE(T) \
+    template void copy_histogram<T>(const Array<T> &, fg_histogram);
+
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/hist_graphics.hpp b/src/backend/oneapi/hist_graphics.hpp
new file mode 100644
index 0000000000..578a9bde70
--- /dev/null
+++ b/src/backend/oneapi/hist_graphics.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_histogram(const Array<T> &data, fg_histogram hist);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/histogram.cpp b/src/backend/oneapi/histogram.cpp
new file mode 100644
index 0000000000..872431f14c
--- /dev/null
+++ b/src/backend/oneapi/histogram.cpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <histogram.hpp>
+#include <kernel/histogram.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear) {
+    const dim4 &dims = in.dims();
+    dim4 outDims     = dim4(nbins, 1, dims[2], dims[3]);
+    Array<uint> out  = createValueArray<uint>(outDims, uint(0));
+    kernel::histogram<T>(out, in, nbins, minval, maxval, isLinear);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                    \
+    template Array<uint> histogram<T>(const Array<T> &, const unsigned &, \
+                                      const double &, const double &,     \
+                                      const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/histogram.hpp b/src/backend/oneapi/histogram.hpp
new file mode 100644
index 0000000000..67be10a0d3
--- /dev/null
+++ b/src/backend/oneapi/histogram.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/homography.cpp b/src/backend/oneapi/homography.cpp
new file mode 100644
index 0000000000..2bf05ef672
--- /dev/null
+++ b/src/backend/oneapi/homography.cpp
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <homography.hpp>
+
+#include <arith.hpp>
+#include <err_oneapi.hpp>
+#include <af/dim4.hpp>
+
+#include <algorithm>
+#include <limits>
+
+using af::dim4;
+using std::numeric_limits;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+int homography(Array<T> &bestH, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations) {
+    ONEAPI_NOT_SUPPORTED("");
+    return 0;
+}
+
+#define INSTANTIATE(T)                                                     \
+    template int homography(                                               \
+        Array<T> &H, const Array<float> &x_src, const Array<float> &y_src, \
+        const Array<float> &x_dst, const Array<float> &y_dst,              \
+        const Array<float> &initial, const af_homography_type htype,       \
+        const float inlier_thr, const unsigned iterations);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/homography.hpp b/src/backend/oneapi/homography.hpp
new file mode 100644
index 0000000000..456b692330
--- /dev/null
+++ b/src/backend/oneapi/homography.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+int homography(Array<T> &H, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/hsv_rgb.cpp b/src/backend/oneapi/hsv_rgb.cpp
new file mode 100644
index 0000000000..fb9d86b5ec
--- /dev/null
+++ b/src/backend/oneapi/hsv_rgb.cpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <hsv_rgb.hpp>
+
+#include <err_oneapi.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> hsv2rgb(const Array<T>& in) {
+    ONEAPI_NOT_SUPPORTED("");
+    Array<T> out = createEmptyArray<T>(in.dims());
+    return out;
+}
+
+template<typename T>
+Array<T> rgb2hsv(const Array<T>& in) {
+    ONEAPI_NOT_SUPPORTED("");
+    Array<T> out = createEmptyArray<T>(in.dims());
+    return out;
+}
+
+#define INSTANTIATE(T)                                \
+    template Array<T> hsv2rgb<T>(const Array<T>& in); \
+    template Array<T> rgb2hsv<T>(const Array<T>& in);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/hsv_rgb.hpp b/src/backend/oneapi/hsv_rgb.hpp
new file mode 100644
index 0000000000..73abd86410
--- /dev/null
+++ b/src/backend/oneapi/hsv_rgb.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> hsv2rgb(const Array<T>& in);
+
+template<typename T>
+Array<T> rgb2hsv(const Array<T>& in);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/identity.cpp b/src/backend/oneapi/identity.cpp
new file mode 100644
index 0000000000..68a592ab88
--- /dev/null
+++ b/src/backend/oneapi/identity.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <identity.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/identity.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> identity(const dim4& dims) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::identity<T>(out);
+    return out;
+}
+
+#define INSTANTIATE_IDENTITY(T) \
+    template Array<T> identity<T>(const af::dim4& dims);
+
+INSTANTIATE_IDENTITY(float)
+INSTANTIATE_IDENTITY(double)
+INSTANTIATE_IDENTITY(cfloat)
+INSTANTIATE_IDENTITY(cdouble)
+INSTANTIATE_IDENTITY(int)
+INSTANTIATE_IDENTITY(uint)
+INSTANTIATE_IDENTITY(intl)
+INSTANTIATE_IDENTITY(uintl)
+INSTANTIATE_IDENTITY(char)
+INSTANTIATE_IDENTITY(schar)
+INSTANTIATE_IDENTITY(uchar)
+INSTANTIATE_IDENTITY(short)
+INSTANTIATE_IDENTITY(ushort)
+INSTANTIATE_IDENTITY(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/identity.hpp b/src/backend/oneapi/identity.hpp
new file mode 100644
index 0000000000..4b1057d04a
--- /dev/null
+++ b/src/backend/oneapi/identity.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> identity(const dim4& dim);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/iir.cpp b/src/backend/oneapi/iir.cpp
new file mode 100644
index 0000000000..4a7654bd38
--- /dev/null
+++ b/src/backend/oneapi/iir.cpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <arith.hpp>
+#include <convolve.hpp>
+#include <err_oneapi.hpp>
+#include <iir.hpp>
+#include <kernel/iir.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x) {
+    AF_BATCH_KIND type = x.ndims() == 1 ? AF_BATCH_NONE : AF_BATCH_SAME;
+    if (x.ndims() != b.ndims()) {
+        type = (x.ndims() < b.ndims()) ? AF_BATCH_RHS : AF_BATCH_LHS;
+    }
+
+    // Extract the first N elements
+    Array<T> c = convolve<T, T>(x, b, type, 1, true);
+    dim4 cdims = c.dims();
+    cdims[0]   = x.dims()[0];
+    c.resetDims(cdims);
+
+    int num_a = a.dims()[0];
+
+    if (num_a == 1) { return c; }
+
+    size_t local_bytes_req = (num_a * 2 + 1) * sizeof(T);
+    if (local_bytes_req >
+        getDevice().get_info<sycl::info::device::local_mem_size>()) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\ncurrent OneAPI device does not have sufficient local "
+                 "memory,\n"
+                 "for iir kernel, %zu(required) > %zu(available)\n",
+                 local_bytes_req,
+                 getDevice().get_info<sycl::info::device::local_mem_size>());
+        AF_ERROR(errMessage, AF_ERR_RUNTIME);
+    }
+
+    dim4 ydims = c.dims();
+    Array<T> y = createEmptyArray<T>(ydims);
+
+    if (a.ndims() > 1) {
+        kernel::iir<T, true>(y, c, a);
+    } else {
+        kernel::iir<T, false>(y, c, a);
+    }
+    return y;
+}
+
+#define INSTANTIATE(T)                                          \
+    template Array<T> iir(const Array<T> &b, const Array<T> &a, \
+                          const Array<T> &x);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/iir.hpp b/src/backend/oneapi/iir.hpp
new file mode 100644
index 0000000000..3c50f539ee
--- /dev/null
+++ b/src/backend/oneapi/iir.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/image.cpp b/src/backend/oneapi/image.cpp
new file mode 100644
index 0000000000..7aa8b4b667
--- /dev/null
+++ b/src/backend/oneapi/image.cpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <err_oneapi.hpp>
+#include <image.hpp>
+
+#include <stdexcept>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+#define INSTANTIATE(T) template void copy_image<T>(const Array<T> &, fg_image);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/image.hpp b/src/backend/oneapi/image.hpp
new file mode 100644
index 0000000000..6e644a3e48
--- /dev/null
+++ b/src/backend/oneapi/image.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/index.cpp b/src/backend/oneapi/index.cpp
new file mode 100644
index 0000000000..af204b0820
--- /dev/null
+++ b/src/backend/oneapi/index.cpp
@@ -0,0 +1,94 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <index.hpp>
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <handle.hpp>
+#include <kernel/assign_kernel_param.hpp>
+#include <kernel/index.hpp>
+#include <memory.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+using arrayfire::oneapi::IndexKernelParam;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> index(const Array<T>& in, const af_index_t idxrs[]) {
+    IndexKernelParam p;
+    std::vector<af_seq> seqs(4, af_span);
+    // create seq vector to retrieve output
+    // dimensions, offsets & offsets
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
+    }
+
+    // retrieve dimensions, strides and offsets
+    const dim4& iDims = in.dims();
+    dim4 dDims        = in.getDataDims();
+    dim4 oDims        = toDims(seqs, iDims);
+    dim4 iOffs        = toOffset(seqs, dDims);
+    dim4 iStrds       = in.strides();
+
+    for (dim_t i = 0; i < 4; ++i) {
+        p.isSeq[i] = idxrs[i].isSeq;
+        p.offs[i]  = iOffs[i];
+        p.strds[i] = iStrds[i];
+        p.steps[i] = 0;
+        if (idxrs[i].isSeq) {
+            af_seq seq = idxrs[i].idx.seq;
+            // The step for af_span used in the kernel must be 1
+            if (seq.begin == af_span.begin && seq.end == af_span.end &&
+                seq.step == af_span.step)
+                p.steps[i] = 1;
+            else
+                p.steps[i] = seq.step;
+        }
+    }
+
+    std::vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4(1)));
+    // look through indexs to read af_array indexs
+    for (dim_t x = 0; x < 4; ++x) {
+        if (!p.isSeq[x]) {
+            idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
+            oDims[x]   = idxArrs[x].elements();
+        }
+    }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+    if (oDims.elements() == 0) { return out; }
+    kernel::index<T>(out, in, p, idxArrs);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> index<T>(const Array<T>& in, const af_index_t idxrs[]);
+
+INSTANTIATE(cdouble)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/index.hpp b/src/backend/oneapi/index.hpp
new file mode 100644
index 0000000000..cebd4c3ea5
--- /dev/null
+++ b/src/backend/oneapi/index.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/index.h>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> index(const Array<T>& in, const af_index_t idxrs[]);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/inverse.cpp b/src/backend/oneapi/inverse.cpp
new file mode 100644
index 0000000000..2779393906
--- /dev/null
+++ b/src/backend/oneapi/inverse.cpp
@@ -0,0 +1,58 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <identity.hpp>
+#include <solve.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> inverse(const Array<T> &in) {
+    Array<T> I = identity<T>(in.dims());
+    return solve<T>(in, I);
+}
+
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> inverse(const Array<T> &in) {
+    ONEAPI_NOT_SUPPORTED("");
+    AF_ERROR("Linear Algebra is disabled on OneAPI backend",
+             AF_ERR_NOT_CONFIGURED);
+}
+
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#endif
diff --git a/src/backend/oneapi/inverse.hpp b/src/backend/oneapi/inverse.hpp
new file mode 100644
index 0000000000..5b37d94978
--- /dev/null
+++ b/src/backend/oneapi/inverse.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> inverse(const Array<T> &in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/iota.cpp b/src/backend/oneapi/iota.cpp
new file mode 100644
index 0000000000..e775f0dde6
--- /dev/null
+++ b/src/backend/oneapi/iota.cpp
@@ -0,0 +1,47 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <iota.hpp>
+#include <kernel/iota.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> iota(const dim4 &dims, const dim4 &tile_dims) {
+    dim4 outdims = dims * tile_dims;
+
+    Array<T> out = createEmptyArray<T>(outdims);
+    kernel::iota<T>(out, dims);
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/iota.hpp b/src/backend/oneapi/iota.hpp
new file mode 100644
index 0000000000..ffce49d1bd
--- /dev/null
+++ b/src/backend/oneapi/iota.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/ireduce.cpp b/src/backend/oneapi/ireduce.cpp
new file mode 100644
index 0000000000..c4bfc7604f
--- /dev/null
+++ b/src/backend/oneapi/ireduce.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <ireduce.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/ireduce.hpp>
+#include <optypes.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim) {
+    Array<uint> rlen = createEmptyArray<uint>(af::dim4(0));
+    kernel::ireduce<T, op>(out, loc, in, dim, rlen);
+}
+
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen) {
+    kernel::ireduce<T, op>(out, loc, in, dim, rlen);
+}
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in) {
+    return kernel::ireduce_all<T, op>(loc, in);
+}
+
+#define INSTANTIATE(ROp, T)                                           \
+    template void ireduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim); \
+    template void rreduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim,  \
+                                  const Array<uint> &rlen);           \
+    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);
+
+// min
+INSTANTIATE(af_min_t, float)
+INSTANTIATE(af_min_t, double)
+INSTANTIATE(af_min_t, cfloat)
+INSTANTIATE(af_min_t, cdouble)
+INSTANTIATE(af_min_t, int)
+INSTANTIATE(af_min_t, uint)
+INSTANTIATE(af_min_t, intl)
+INSTANTIATE(af_min_t, uintl)
+INSTANTIATE(af_min_t, char)
+INSTANTIATE(af_min_t, schar)
+INSTANTIATE(af_min_t, uchar)
+INSTANTIATE(af_min_t, short)
+INSTANTIATE(af_min_t, ushort)
+INSTANTIATE(af_min_t, half)
+
+// max
+INSTANTIATE(af_max_t, float)
+INSTANTIATE(af_max_t, double)
+INSTANTIATE(af_max_t, cfloat)
+INSTANTIATE(af_max_t, cdouble)
+INSTANTIATE(af_max_t, int)
+INSTANTIATE(af_max_t, uint)
+INSTANTIATE(af_max_t, intl)
+INSTANTIATE(af_max_t, uintl)
+INSTANTIATE(af_max_t, char)
+INSTANTIATE(af_max_t, schar)
+INSTANTIATE(af_max_t, uchar)
+INSTANTIATE(af_max_t, short)
+INSTANTIATE(af_max_t, ushort)
+INSTANTIATE(af_max_t, half)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/ireduce.hpp b/src/backend/oneapi/ireduce.hpp
new file mode 100644
index 0000000000..99a1e45aac
--- /dev/null
+++ b/src/backend/oneapi/ireduce.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim);
+
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen);
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/jit.cpp b/src/backend/oneapi/jit.cpp
new file mode 100644
index 0000000000..bda9e43ccf
--- /dev/null
+++ b/src/backend/oneapi/jit.cpp
@@ -0,0 +1,685 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <CL/cl.h>
+#include <jit/ShiftNode.hpp>
+#include <jit/kernel_generators.hpp>
+
+#include <kernel_headers/KParam.hpp>
+#include <kernel_headers/jit.hpp>
+
+#include <Array.hpp>
+#include <Kernel.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/jit/ModdimNode.hpp>
+#include <common/jit/Node.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <common/jit/ShiftNodeBase.hpp>
+#include <common/util.hpp>
+#include <copy.hpp>
+#include <device_manager.hpp>
+#include <err_oneapi.hpp>
+#include <jit/BufferNode.hpp>
+#include <platform.hpp>
+#include <type_util.hpp>
+#include <af/dim4.hpp>
+
+#include <sycl/backend.hpp>
+#include <sycl/sycl.hpp>
+
+#include <array>
+#include <cstdio>
+#include <functional>
+#include <mutex>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <unordered_map>
+#include <vector>
+
+using arrayfire::common::getFuncName;
+using arrayfire::common::half;
+using arrayfire::common::kNodeType;
+using arrayfire::common::ModdimNode;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ids;
+using arrayfire::common::Node_map_t;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::common::ShiftNodeBase;
+using arrayfire::oneapi::getActiveDeviceBaseBuildFlags;
+using arrayfire::oneapi::jit::BufferNode;
+using arrayfire::oneapi::jit::ShiftNode;
+
+using std::array;
+using std::begin;
+using std::end;
+using std::find;
+using std::find_if;
+using std::string;
+using std::stringstream;
+using std::to_string;
+using std::unordered_map;
+using std::vector;
+
+using sycl::backend;
+
+namespace arrayfire {
+
+namespace opencl {
+
+const static string DEFAULT_MACROS_STR(R"JIT(
+#ifdef USE_DOUBLE
+#pragma OPENCL EXTENSION cl_khr_fp64 : enable
+#endif
+#ifdef USE_HALF
+#pragma OPENCL EXTENSION cl_khr_fp16 : enable
+#else
+#define half short
+#endif
+#ifndef M_PI
+#define
+ M_PI 3.1415926535897932384626433832795028841971693993751058209749445923078164
+#endif
+)JIT");
+
+string getKernelString(const string& funcName,
+                       const nonstd::span<Node* const> full_nodes,
+                       nonstd::span<const Node_ids> full_ids,
+                       const nonstd::span<int const> output_ids,
+                       const bool is_linear, const bool loop0, const bool loop1,
+                       const bool loop3) {
+    // Common OpenCL code
+    // This part of the code does not change with the kernel.
+
+    static const char* kernelVoid = R"JIT(
+__kernel void )JIT";
+    static const char* dimParams  = "KParam oInfo";
+    static const char* blockStart = "{";
+    static const char* blockEnd   = "\n}\n";
+
+    static const char* linearInit = R"JIT(
+   int idx = get_global_id(0);
+   const int idxEnd = oInfo.dims[0];
+   if (idx < idxEnd) {
+)JIT";
+    static const char* linearEnd  = R"JIT(
+   })JIT";
+
+    static const char* linearLoop0Start = R"JIT(
+        const int idxID0Inc = get_global_size(0);
+        do {)JIT";
+    static const char* linearLoop0End   = R"JIT(
+            idx += idxID0Inc;
+            if (idx >= idxEnd) break;
+        } while (true);)JIT";
+
+    // ///////////////////////////////////////////////
+    // oInfo = output optimized information (dims, strides, offset).
+    //         oInfo has removed dimensions, to optimized block scheduling
+    // iInfo = input internal information (dims, strides, offset)
+    //         iInfo has the original dimensions, auto generated code
+    //
+    // Loop3 is fastest and becomes inside loop, since
+    //      - #of loops is known upfront
+    // Loop1 is used for extra dynamic looping (writing into cache)
+    // All loops are conditional and idependent
+    // Format Loop1 & Loop3
+    // ////////////////////////////
+    //  *stridedLoopNInit               // Always
+    //  *stridedLoop1Init               // Conditional
+    //  *stridedLoop2Init               // Conditional
+    //  *stridedLoop3Init               // Conditional
+    //  *stridedLoop1Start              // Conditional
+    //      *stridedLoop3Start          // Conditional
+    //          auto generated code     // Always
+    //      *stridedLoop3End            // Conditional
+    //  *stridedLoop1End                // Conditional
+    //  *StridedEnd                     // Always
+    //
+    // format loop0 (Vector only)
+    // //////////////////////////
+    // *stridedLoop0Init                // Always
+    // *stridedLoop0Start               // Always
+    //      auto generated code         // Always
+    // *stridedLoop0End                 // Always
+    // *stridedEnd                      // Always
+
+    static const char* stridedLoop0Init  = R"JIT(
+    int id0 = get_global_id(0);
+    const int id0End = oInfo.dims[0];
+    if (id0 < id0End) {
+#define id1 0
+#define id2 0
+#define id3 0
+        const int ostrides0 = oInfo.strides[0];
+        int idx = ostrides0*id0;)JIT";
+    static const char* stridedLoop0Start = R"JIT(
+        const int id0Inc = get_global_size(0);
+        const int idxID0Inc = ostrides0*id0Inc;
+        do {)JIT";
+    static const char* stridedLoop0End   = R"JIT(
+            id0 += id0Inc;
+            if (id0 >= id0End) break;
+            idx += idxID0Inc;
+        } while (true);)JIT";
+
+    // -------------
+    static const char* stridedLoopNInit = R"JIT(
+    int id0 = get_global_id(0);
+    int id1 = get_global_id(1);
+    const int id0End = oInfo.dims[0];
+    const int id1End = oInfo.dims[1];
+    if ((id0 < id0End) & (id1 < id1End)) {
+        const int id2 = get_global_id(2);
+#define id3 0
+        const int ostrides1 = oInfo.strides[1];
+        int idx = (int)oInfo.strides[0]*id0 + ostrides1*id1 + (int)oInfo.strides[2]*id2;)JIT";
+    static const char* stridedEnd       = R"JIT(
+    })JIT";
+
+    static const char* stridedLoop3Init  = R"JIT(
+#undef id3
+        int id3 = 0;
+        const int id3End = oInfo.dims[3];
+        const int idxID3Inc = oInfo.strides[3];)JIT";
+    static const char* stridedLoop3Start = R"JIT(
+                const int idxBaseID3 = idx;
+                do {)JIT";
+    static const char* stridedLoop3End   = R"JIT(
+                    ++id3;
+                    if (id3 == id3End) break;
+                    idx += idxID3Inc;
+                } while (true);
+                id3 = 0;
+                idx = idxBaseID3;)JIT";
+
+    static const char* stridedLoop1Init  = R"JIT(
+        const int id1Inc = get_global_size(1);
+        const int idxID1Inc = id1Inc * ostrides1;)JIT";
+    static const char* stridedLoop1Start = R"JIT(
+        do {)JIT";
+    static const char* stridedLoop1End   = R"JIT(
+            id1 += id1Inc;
+            if (id1 >= id1End) break;
+            idx += idxID1Inc;
+        } while (true);)JIT";
+
+    // Reuse stringstreams, because they are very costly during initilization
+    thread_local stringstream inParamStream;
+    thread_local stringstream outParamStream;
+    thread_local stringstream outOffsetStream;
+    thread_local stringstream inOffsetsStream;
+    thread_local stringstream opsStream;
+    thread_local stringstream kerStream;
+
+    string ret;
+    try {
+        int oid{0};
+        for (size_t i{0}; i < full_nodes.size(); i++) {
+            const auto& node{full_nodes[i]};
+            const auto& ids_curr{full_ids[i]};
+            // Generate input parameters, only needs current id
+            node->genParams(inParamStream, ids_curr.id, is_linear);
+            // Generate input offsets, only needs current id
+            node->genOffsets(inOffsetsStream, ids_curr.id, is_linear);
+            // Generate the core function body, needs children ids as well
+            node->genFuncs(opsStream, ids_curr);
+            for (size_t output_idx{0}; output_idx < output_ids.size();
+                 ++output_idx) {
+                if (output_ids[output_idx] == ids_curr.id) {
+                    outParamStream
+                        << "__global " << full_nodes[ids_curr.id]->getTypeStr()
+                        << " *out" << oid << ", int offset" << oid << ",\n";
+                    // Apply output offset
+                    outOffsetStream << "\nout" << oid << " += offset" << oid
+                                    << ';';
+                    // Generate code to write the output
+                    opsStream << "out" << output_idx << "[idx] = val"
+                              << ids_curr.id << ";\n";
+                    ++oid;
+                }
+            }
+        }
+
+        kerStream << DEFAULT_MACROS_STR << kernelVoid << funcName << "(\n"
+                  << inParamStream.str() << outParamStream.str() << dimParams
+                  << ")" << blockStart;
+        if (is_linear) {
+            kerStream << linearInit << inOffsetsStream.str()
+                      << outOffsetStream.str() << '\n';
+            if (loop0) kerStream << linearLoop0Start;
+            kerStream << "\n\n" << opsStream.str();
+            if (loop0) kerStream << linearLoop0End;
+            kerStream << linearEnd;
+        } else {
+            if (loop0) {
+                kerStream << stridedLoop0Init << outOffsetStream.str() << '\n'
+                          << stridedLoop0Start;
+            } else {
+                kerStream << stridedLoopNInit << outOffsetStream.str() << '\n';
+                if (loop3) kerStream << stridedLoop3Init;
+                if (loop1) kerStream << stridedLoop1Init << stridedLoop1Start;
+                if (loop3) kerStream << stridedLoop3Start;
+            }
+            kerStream << "\n\n" << inOffsetsStream.str() << opsStream.str();
+            if (loop3) kerStream << stridedLoop3End;
+            if (loop1) kerStream << stridedLoop1End;
+            if (loop0) kerStream << stridedLoop0End;
+            kerStream << stridedEnd;
+        }
+        kerStream << blockEnd;
+        ret = kerStream.str();
+    } catch (...) {
+        // Prepare for next round, limit memory
+        inParamStream.str("");
+        outParamStream.str("");
+        inOffsetsStream.str("");
+        outOffsetStream.str("");
+        opsStream.str("");
+        kerStream.str("");
+        throw;
+    }
+    // Prepare for next round, limit memory
+    inParamStream.str("");
+    outParamStream.str("");
+    inOffsetsStream.str("");
+    outOffsetStream.str("");
+    opsStream.str("");
+    kerStream.str("");
+
+    return ret;
+}
+
+// cl::Kernel getKernel(const vector<Node*>& output_nodes,
+//                      const vector<int>& output_ids,
+//                      const vector<Node*>& full_nodes,
+//                      const vector<Node_ids>& full_ids, const bool is_linear)
+//                      {
+//     ONEAPI_NOT_SUPPORTED("");
+//     return common::getKernel("", "", true).get();
+// }
+
+static unordered_map<cl_device_id, std::string> device_name_map;
+static std::mutex device_name_map_mutex;
+static unordered_map<std::string, cl_kernel> kernel_map;
+static std::mutex kernel_map_mutex;
+
+template<typename T>
+cl_kernel getKernel(
+    std::string funcName, cl_context ctx, cl_device_id dev, cl_command_queue q,
+    const nonstd::span<Node* const> full_nodes,
+    nonstd::span<Node_ids const> full_ids, nonstd::span<int const> output_ids,
+    nonstd::span<oneapi::AParam<T, sycl::access_mode::write> const> ap,
+    bool is_linear) {
+    std::string devName;
+    {
+        std::lock_guard<std::mutex> lock(device_name_map_mutex);
+
+        auto devNameIt = device_name_map.find(dev);
+        if (devNameIt == device_name_map.end()) {
+            size_t devNameSz;
+            CL_CHECK(
+                clGetDeviceInfo(dev, CL_DEVICE_NAME, 0, nullptr, &devNameSz));
+            string newDevName(devNameSz, '\0');
+            CL_CHECK(clGetDeviceInfo(dev, CL_DEVICE_NAME, devNameSz,
+                                     newDevName.data(), nullptr));
+            device_name_map[dev] = newDevName;
+            devName              = newDevName;
+        } else {
+            devName = devNameIt->second;
+        }
+    }
+
+    vector<cl_kernel> kernels(10);
+    bool kernel_found;
+    string kernelHash = funcName + devName;
+    {
+        std::lock_guard<std::mutex> lock(kernel_map_mutex);
+        kernel_found = !(kernel_map.find(kernelHash) == end(kernel_map));
+    }
+    if (kernel_found) {
+        std::lock_guard<std::mutex> lock(kernel_map_mutex);
+        kernels[0] = kernel_map[kernelHash];
+    } else {
+        string jitstr = arrayfire::opencl::getKernelString(
+            funcName, full_nodes, full_ids, output_ids, is_linear, false, false,
+            ap[0].dims[2] > 1);
+
+        cl_int err;
+        vector<const char*> jitsources = {
+            {arrayfire::oneapi::opencl::KParam_hpp,
+             arrayfire::oneapi::opencl::jit_cl, jitstr.c_str()}};
+        vector<size_t> jitsizes = {arrayfire::oneapi::opencl::KParam_hpp_len,
+                                   arrayfire::oneapi::opencl::jit_cl_len,
+                                   jitstr.size()};
+
+        cl_program prog = clCreateProgramWithSource(
+            ctx, jitsources.size(), jitsources.data(), jitsizes.data(), &err);
+
+        std::string options = getActiveDeviceBaseBuildFlags();
+
+        CL_CHECK_BUILD(
+            clBuildProgram(prog, 1, &dev, options.c_str(), nullptr, nullptr));
+
+        cl_uint ret_kernels = 0;
+        CL_CHECK(
+            clCreateKernelsInProgram(prog, 1, kernels.data(), &ret_kernels));
+
+        std::lock_guard<std::mutex> lock(kernel_map_mutex);
+        kernel_map[kernelHash] = kernels[0];
+        CL_CHECK(clReleaseProgram(prog));
+    }
+    return kernels[0];
+}
+
+}  // namespace opencl
+
+namespace oneapi {
+
+template<typename T>
+void evalNodes(vector<Param<T>>& outputs, const vector<Node*>& output_nodes) {
+    if (outputs.empty()) return;
+    Node_map_t nodes;
+    vector<Node*> full_nodes;
+    vector<Node_ids> full_ids;
+    vector<int> output_ids;
+    vector<Node_ptr> node_clones;
+
+    bool is_linear{true};
+    dim_t numOutElems{1};
+    assert(outputs.size() == output_nodes.size());
+    KParam& out_info{outputs[0].info};
+    dim_t* outDims{out_info.dims};
+    dim_t* outStrides{out_info.strides};
+    // unsigned nrInputs{0};
+
+    dim_t ndims{outDims[3] > 1   ? 4
+                : outDims[2] > 1 ? 3
+                : outDims[1] > 1 ? 2
+                : outDims[0] > 0 ? 1
+                                 : 0};
+    for (dim_t dim{0}; dim < ndims; ++dim) {
+        is_linear &= (numOutElems == outStrides[dim]);
+        numOutElems *= outDims[dim];
+    }
+    if (numOutElems == 0) { return; }
+
+    for (Node* node : output_nodes) {
+        const int id{node->getNodesMap(nodes, full_nodes, full_ids)};
+        output_ids.push_back(id);
+    }
+
+    node_clones.clear();
+    node_clones.reserve(full_nodes.size());
+    for (Node* node : full_nodes) { node_clones.emplace_back(node->clone()); }
+
+    bool moddimsFound{false};
+    for (const Node* node : full_nodes) {
+        is_linear &= node->isLinear(outDims);
+        moddimsFound |= (node->getOp() == af_moddims_t);
+        // if (node->isBuffer()) { ++nrInputs; }
+    }
+
+    bool emptyColumnsFound{false};
+    if (is_linear) {
+        outDims[0]    = numOutElems;
+        outDims[1]    = 1;
+        outDims[2]    = 1;
+        outDims[3]    = 1;
+        outStrides[0] = 1;
+        outStrides[1] = numOutElems;
+        outStrides[2] = numOutElems;
+        outStrides[3] = numOutElems;
+        ndims         = 1;
+    } else {
+        emptyColumnsFound = ndims > (outDims[0] == 1   ? 1
+                                     : outDims[1] == 1 ? 2
+                                     : outDims[2] == 1 ? 3
+                                                       : 4);
+    }
+
+    //  Keep in global scope, so that the nodes remain active for later
+    //  referral in case moddims operations or column elimination have to
+    //  take place Avoid all cloning/copying when no moddims node is present
+    //  (high chance)
+    if (moddimsFound || emptyColumnsFound) {
+        for (const Node_ids& ids : full_ids) {
+            auto& children{node_clones[ids.id]->m_children};
+            for (int i{0}; i < Node::kMaxChildren && children[i] != nullptr;
+                 i++) {
+                children[i] = node_clones[ids.child_ids[i]];
+            }
+        }
+
+        if (moddimsFound) {
+            const auto isModdim{[](const Node_ptr& ptr) {
+                return ptr->getOp() == af_moddims_t;
+            }};
+            for (auto nodeIt{begin(node_clones)}, endIt{end(node_clones)};
+                 (nodeIt = find_if(nodeIt, endIt, isModdim)) != endIt;
+                 ++nodeIt) {
+                const ModdimNode* mn{static_cast<ModdimNode*>(nodeIt->get())};
+
+                const auto new_strides{calcStrides(mn->m_new_shape)};
+                const auto isBuffer{
+                    [](const Node& node) { return node.isBuffer(); }};
+                for (NodeIterator<> it{nodeIt->get()}, end{NodeIterator<>()};
+                     (it = find_if(it, end, isBuffer)) != end; ++it) {
+                    jit::BufferNode<T>* buf{
+                        static_cast<jit::BufferNode<T>*>(&(*it))};
+                    buf->m_param.dims[0]    = mn->m_new_shape[0];
+                    buf->m_param.dims[1]    = mn->m_new_shape[1];
+                    buf->m_param.dims[2]    = mn->m_new_shape[2];
+                    buf->m_param.dims[3]    = mn->m_new_shape[3];
+                    buf->m_param.strides[0] = new_strides[0];
+                    buf->m_param.strides[1] = new_strides[1];
+                    buf->m_param.strides[2] = new_strides[2];
+                    buf->m_param.strides[3] = new_strides[3];
+                }
+            }
+        }
+        if (emptyColumnsFound) {
+            common::removeEmptyDimensions<Param<T>, BufferNode<T>,
+                                          ShiftNode<T>>(outputs, node_clones);
+        }
+    }
+
+    full_nodes.clear();
+    for (Node_ptr& node : node_clones) { full_nodes.push_back(node.get()); }
+
+    const string funcName{getFuncName(output_nodes, output_ids, full_nodes,
+                                      full_ids, is_linear, false, false, false,
+                                      outputs[0].info.dims[2] > 1)};
+
+    getQueue().submit([&](sycl::handler& h) {
+        for (Node* node : full_nodes) {
+            switch (node->getNodeType()) {
+                case kNodeType::Buffer: {
+                    BufferNode<T>* n = static_cast<BufferNode<T>*>(node);
+                    n->m_param.require(h);
+                } break;
+                case kNodeType::Shift: {
+                    ShiftNodeBase<jit::BufferNode<T>>* sn =
+                        static_cast<ShiftNodeBase<jit::BufferNode<T>>*>(node);
+                    sn->getBufferNode().m_param.require(h);
+                } break;
+                default: break;
+            }
+        }
+        vector<AParam<T, sycl::access_mode::write>> ap;
+        transform(begin(outputs), end(outputs), back_inserter(ap),
+                  [&](const Param<T>& p) {
+                      return AParam<T, sycl::access_mode::write>(
+                          h, *p.data, p.info.dims, p.info.strides,
+                          p.info.offset);
+                  });
+
+        h.host_task([ap, full_nodes, output_ids, full_ids, is_linear, funcName,
+                     node_clones, nodes, outputs](sycl::interop_handle hh) {
+            switch (hh.get_backend()) {
+                case backend::opencl: {
+                    auto ncc = node_clones;
+
+                    cl_command_queue q = hh.get_native_queue<backend::opencl>();
+                    cl_context ctx   = hh.get_native_context<backend::opencl>();
+                    cl_device_id dev = hh.get_native_device<backend::opencl>();
+
+                    cl_kernel kernel = arrayfire::opencl::getKernel<T>(
+                        funcName, ctx, dev, q, full_nodes, full_ids, output_ids,
+                        ap, is_linear);
+                    int nargs{0};
+                    for (Node* node : full_nodes) {
+                        nargs = node->setArgs(
+                            nargs, is_linear,
+                            [&kernel, &hh, &is_linear](int id, const void* ptr,
+                                                       size_t arg_size,
+                                                       bool is_buffer) {
+                                if (is_buffer) {
+                                    auto* info = static_cast<
+                                        AParam<T, sycl::access_mode::read>*>(
+                                        const_cast<void*>(ptr));
+                                    vector<cl_mem> mem =
+                                        hh.get_native_mem<backend::opencl>(
+                                            info->data);
+                                    if (is_linear) {
+                                        CL_CHECK(clSetKernelArg(kernel, id++,
+                                                                sizeof(cl_mem),
+                                                                &mem[0]));
+                                        CL_CHECK(clSetKernelArg(kernel, id++,
+                                                                sizeof(dim_t),
+                                                                &info->offset));
+                                    } else {
+                                        CL_CHECK(clSetKernelArg(kernel, id++,
+                                                                sizeof(cl_mem),
+                                                                &mem[0]));
+                                        KParam ooo = *info;
+                                        CL_CHECK(clSetKernelArg(kernel, id++,
+                                                                sizeof(KParam),
+                                                                &ooo));
+                                    }
+
+                                } else {
+                                    CL_CHECK(clSetKernelArg(kernel, id,
+                                                            arg_size, ptr));
+                                }
+                            });
+                    }
+
+                    // Set output parameters
+                    vector<cl_mem> mem;
+                    for (const auto& output : ap) {
+                        mem = hh.get_native_mem<backend::opencl>(output.data);
+                        cl_mem mmm = mem[0];
+                        CL_CHECK(clSetKernelArg(kernel, nargs++, sizeof(cl_mem),
+                                                &mmm));
+                        int off = output.offset;
+                        CL_CHECK(
+                            clSetKernelArg(kernel, nargs++, sizeof(int), &off));
+                    }
+                    const KParam ooo = ap[0];
+                    CL_CHECK(
+                        clSetKernelArg(kernel, nargs++, sizeof(KParam), &ooo));
+                    array<size_t, 3> offset{0, 0, 0};
+                    array<size_t, 3> global;
+                    int ndims = 0;
+                    if (is_linear) {
+                        global = {(size_t)ap[0].dims.elements(), 0, 0};
+                        ndims  = 1;
+                    } else {
+                        global = {(size_t)ap[0].dims[0], (size_t)ap[0].dims[1],
+                                  (size_t)ap[0].dims[2]};
+                        ndims  = 3;
+                    }
+
+                    {
+                        using namespace oneapi::kernel_logger;
+                        AF_TRACE(
+                            "Launching {}: Dims: [{},{},{},{}] Global: "
+                            "[{},{},{}] threads: {}",
+                            funcName, ap[0].dims[0], ap[0].dims[1],
+                            ap[0].dims[2], ap[0].dims[3], global[0], global[1],
+                            global[2],
+                            global[0] * std::max<size_t>(1, global[1]) *
+                                std::max<size_t>(1, global[2]));
+                    }
+
+                    cl_event kernel_event;
+                    CL_CHECK(clEnqueueNDRangeKernel(
+                        q, kernel, ndims, offset.data(), global.data(), nullptr,
+                        0, nullptr, &kernel_event));
+                    CL_CHECK(clEnqueueBarrierWithWaitList(q, 1, &kernel_event,
+                                                          nullptr));
+                    CL_CHECK(clReleaseEvent(kernel_event));
+
+                    CL_CHECK(clReleaseDevice(dev));
+                    CL_CHECK(clReleaseContext(ctx));
+                    CL_CHECK(clReleaseCommandQueue(q));
+
+                } break;
+                default: ONEAPI_NOT_SUPPORTED("Backend not supported");
+            }
+        });
+    });
+}
+
+template<typename T>
+void evalNodes(Param<T>& out, Node* node) {
+    vector<Param<T>> outputs{out};
+    vector<Node*> nodes{node};
+    oneapi::evalNodes(outputs, nodes);
+}
+
+template void evalNodes<float>(Param<float>& out, Node* node);
+template void evalNodes<double>(Param<double>& out, Node* node);
+template void evalNodes<cfloat>(Param<cfloat>& out, Node* node);
+template void evalNodes<cdouble>(Param<cdouble>& out, Node* node);
+template void evalNodes<int>(Param<int>& out, Node* node);
+template void evalNodes<uint>(Param<uint>& out, Node* node);
+template void evalNodes<char>(Param<char>& out, Node* node);
+template void evalNodes<schar>(Param<schar>& out, Node* node);
+template void evalNodes<uchar>(Param<uchar>& out, Node* node);
+template void evalNodes<intl>(Param<intl>& out, Node* node);
+template void evalNodes<uintl>(Param<uintl>& out, Node* node);
+template void evalNodes<short>(Param<short>& out, Node* node);
+template void evalNodes<ushort>(Param<ushort>& out, Node* node);
+template void evalNodes<half>(Param<half>& out, Node* node);
+
+template void evalNodes<float>(vector<Param<float>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<double>(vector<Param<double>>& out,
+                                const vector<Node*>& node);
+template void evalNodes<cfloat>(vector<Param<cfloat>>& out,
+                                const vector<Node*>& node);
+template void evalNodes<cdouble>(vector<Param<cdouble>>& out,
+                                 const vector<Node*>& node);
+template void evalNodes<int>(vector<Param<int>>& out,
+                             const vector<Node*>& node);
+template void evalNodes<uint>(vector<Param<uint>>& out,
+                              const vector<Node*>& node);
+template void evalNodes<char>(vector<Param<char>>& out,
+                              const vector<Node*>& node);
+template void evalNodes<schar>(vector<Param<schar>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<uchar>(vector<Param<uchar>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<intl>(vector<Param<intl>>& out,
+                              const vector<Node*>& node);
+template void evalNodes<uintl>(vector<Param<uintl>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<short>(vector<Param<short>>& out,
+                               const vector<Node*>& node);
+template void evalNodes<ushort>(vector<Param<ushort>>& out,
+                                const vector<Node*>& node);
+template void evalNodes<half>(vector<Param<half>>& out,
+                              const vector<Node*>& node);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/jit/BufferNode.hpp b/src/backend/oneapi/jit/BufferNode.hpp
new file mode 100644
index 0000000000..d10ca24cc3
--- /dev/null
+++ b/src/backend/oneapi/jit/BufferNode.hpp
@@ -0,0 +1,48 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/jit/BufferNodeBase.hpp>
+#include <jit/kernel_generators.hpp>
+
+#include <memory>
+
+namespace arrayfire {
+namespace oneapi {
+namespace jit {
+template<typename T>
+using BufferNode = common::BufferNodeBase<std::shared_ptr<sycl::buffer<T>>,
+                                          AParam<T, sycl::access_mode::read>>;
+}  // namespace jit
+}  // namespace oneapi
+
+namespace common {
+
+template<typename DataType, typename ParamType>
+bool BufferNodeBase<DataType, ParamType>::operator==(
+    const BufferNodeBase<DataType, ParamType> &other) const noexcept {
+    // clang-format off
+    return m_data.get() == other.m_data.get() &&
+           m_bytes == other.m_bytes &&
+           m_param.offset == other.m_param.offset &&
+           m_linear_buffer == other.m_linear_buffer &&
+           m_param.dims[0] == other.m_param.dims[0] &&
+           m_param.dims[1] == other.m_param.dims[1] &&
+           m_param.dims[2] == other.m_param.dims[2] &&
+           m_param.dims[3] == other.m_param.dims[3] &&
+           m_param.strides[0] == other.m_param.strides[0] &&
+           m_param.strides[1] == other.m_param.strides[1] &&
+           m_param.strides[2] == other.m_param.strides[2] &&
+           m_param.strides[3] == other.m_param.strides[3];
+    // clang-format on
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/jit/ShiftNode.hpp b/src/backend/oneapi/jit/ShiftNode.hpp
new file mode 100644
index 0000000000..6a87b28729
--- /dev/null
+++ b/src/backend/oneapi/jit/ShiftNode.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/jit/ShiftNodeBase.hpp>
+#include <jit/BufferNode.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace jit {
+
+template<typename T>
+using ShiftNode = common::ShiftNodeBase<BufferNode<T>>;
+
+}  // namespace jit
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/jit/kernel_generators.hpp b/src/backend/oneapi/jit/kernel_generators.hpp
new file mode 100644
index 0000000000..9ca9cd984e
--- /dev/null
+++ b/src/backend/oneapi/jit/kernel_generators.hpp
@@ -0,0 +1,115 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <err_oneapi.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <functional>
+#include <memory>
+#include <sstream>
+#include <string>
+
+namespace arrayfire {
+namespace oneapi {
+
+namespace {
+
+/// Creates a string that will be used to declare the parameter of kernel
+inline void generateParamDeclaration(std::stringstream& kerStream, int id,
+                                     bool is_linear,
+                                     const std::string& m_type_str) {
+    if (is_linear) {
+        kerStream << "__global " << m_type_str << " *in" << id
+                  << ", dim_t iInfo" << id << "_offset, \n";
+    } else {
+        kerStream << "__global " << m_type_str << " *in" << id
+                  << ", KParam iInfo" << id << ", \n";
+    }
+}
+
+/// Calls the setArg function to set the arguments for a kernel call
+template<typename T>
+inline int setBufferKernelArguments(
+    int start_id, bool is_linear,
+    std::function<void(int id, const void* ptr, size_t arg_size,
+                       bool is_buffer)>& setArg,
+    const std::shared_ptr<sycl::buffer<T>>& ptr,
+    const AParam<T, sycl::access_mode::read>& info) {
+    setArg(start_id + 0, static_cast<const void*>(&info),
+           sizeof(AParam<T, sycl::access_mode::read>), true);
+    return start_id + 2;
+}
+
+/// Generates the code to calculate the offsets for a buffer
+inline void generateBufferOffsets(std::stringstream& kerStream, int id,
+                                  bool is_linear, const std::string& type_str) {
+    UNUSED(type_str);
+    std::string idx_str  = std::string("int idx") + std::to_string(id);
+    std::string info_str = std::string("iInfo") + std::to_string(id);
+
+    if (is_linear) {
+        kerStream << idx_str << " = idx + " << info_str << "_offset;\n";
+    } else {
+        kerStream << idx_str << " = (id3 < " << info_str << ".dims[3]) * "
+                  << info_str << ".strides[3] * id3 + (id2 < " << info_str
+                  << ".dims[2]) * " << info_str << ".strides[2] * id2 + (id1 < "
+                  << info_str << ".dims[1]) * " << info_str
+                  << ".strides[1] * id1 + (id0 < " << info_str << ".dims[0]) * "
+                  << info_str << ".strides[0]  * id0 + " << info_str
+                  << ".offset;\n";
+    }
+}
+
+/// Generates the code to read a buffer and store it in a local variable
+inline void generateBufferRead(std::stringstream& kerStream, int id,
+                               const std::string& type_str) {
+    kerStream << type_str << " val" << id << " = in" << id << "[idx" << id
+              << "];\n";
+}
+
+inline void generateShiftNodeOffsets(std::stringstream& kerStream, int id,
+                                     bool is_linear,
+                                     const std::string& type_str) {
+    UNUSED(is_linear);
+    UNUSED(type_str);
+    std::string idx_str   = std::string("idx") + std::to_string(id);
+    std::string info_str  = std::string("iInfo") + std::to_string(id);
+    std::string id_str    = std::string("sh_id_") + std::to_string(id) + "_";
+    std::string shift_str = std::string("shift") + std::to_string(id) + "_";
+
+    for (int i = 0; i < 4; i++) {
+        kerStream << "int " << id_str << i << " = __circular_mod(id" << i
+                  << " + " << shift_str << i << ", " << info_str << ".dims["
+                  << i << "]);\n";
+    }
+
+    kerStream << "int " << idx_str << " = (" << id_str << "3 < " << info_str
+              << ".dims[3]) * " << info_str << ".strides[3] * " << id_str
+              << "3;\n";
+    kerStream << idx_str << " += (" << id_str << "2 < " << info_str
+              << ".dims[2]) * " << info_str << ".strides[2] * " << id_str
+              << "2;\n";
+    kerStream << idx_str << " += (" << id_str << "1 < " << info_str
+              << ".dims[1]) * " << info_str << ".strides[1] * " << id_str
+              << "1;\n";
+    kerStream << idx_str << " += (" << id_str << "0 < " << info_str
+              << ".dims[0]) * " << id_str << "0 + " << info_str << ".offset;\n";
+}
+
+inline void generateShiftNodeRead(std::stringstream& kerStream, int id,
+                                  const std::string& type_str) {
+    kerStream << type_str << " val" << id << " = in" << id << "[idx" << id
+              << "];\n";
+}
+}  // namespace
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/join.cpp b/src/backend/oneapi/join.cpp
new file mode 100644
index 0000000000..a64e6edb9d
--- /dev/null
+++ b/src/backend/oneapi/join.cpp
@@ -0,0 +1,303 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <join.hpp>
+#include <kernel/memcopy.hpp>
+#include <platform.hpp>
+
+#include <algorithm>
+#include <map>
+#include <stdexcept>
+#include <vector>
+
+using af::dim4;
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using std::transform;
+using std::vector;
+
+namespace arrayfire {
+namespace oneapi {
+dim4 calcOffset(const dim4 &dims, int dim) {
+    dim4 offset;
+    offset[0] = (dim == 0) ? dims[0] : 0;
+    offset[1] = (dim == 1) ? dims[1] : 0;
+    offset[2] = (dim == 2) ? dims[2] : 0;
+    offset[3] = (dim == 3) ? dims[3] : 0;
+    return offset;
+}
+
+template<typename T>
+Array<T> join(const int jdim, const Array<T> &first, const Array<T> &second) {
+    // All dimensions except join dimension must be equal
+    const dim4 &fdims{first.dims()};
+    const dim4 &sdims{second.dims()};
+
+    // Compute output dims
+    dim4 odims(fdims);
+    odims.dims[jdim] += sdims.dims[jdim];
+    Array<T> out = createEmptyArray<T>(odims);
+
+    // topspeed is achieved when byte size(in+out) ~= L2CacheSize
+    //
+    // 1 array: memcpy always copies 1 array.  topspeed
+    //      --> size(in) <= L2CacheSize/2
+    // 2 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/2
+    //          --> JIT can copy 2 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - size(in) < L2CacheSize/2
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called twice
+    //      - size(in) >= L2CacheSize/2
+    //          --> memcpy will achieve veryLargeArray speed.  The kernel
+    //              will be called twice
+    if (fdims.dims[jdim] == sdims.dims[jdim]) {
+        const size_t L2CacheSize{getL2CacheSize(oneapi::getDevice())};
+        if (!(first.isReady() || second.isReady()) ||
+            (fdims.elements() * sizeof(T) * 2 * 2 < L2CacheSize)) {
+            // Both arrays have same size & everything fits into the cache,
+            // so thread in 1 JIT kernel, iso individual copies which is
+            // always slower
+            const dim_t *outStrides{out.strides().dims};
+            vector<Param<T>> outputs{
+                {out.get(),
+                 {{fdims.dims[0], fdims.dims[1], fdims.dims[2], fdims.dims[3]},
+                  {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+                  0}},
+                {out.get(),
+                 {{sdims.dims[0], sdims.dims[1], sdims.dims[2], sdims.dims[3]},
+                  {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+                  fdims.dims[jdim] * outStrides[jdim]}}};
+            // Extend the life of the returned node, bij saving the
+            // corresponding shared_ptr
+            const Node_ptr fNode{first.getNode()};
+            const Node_ptr sNode{second.getNode()};
+            vector<Node *> nodes{fNode.get(), sNode.get()};
+            evalNodes(outputs, nodes);
+            return out;
+        }
+        // continue because individually processing is faster
+    }
+
+    // Handle each array individually
+    if (first.isReady()) {
+        if (1LL + jdim >= first.ndims() && first.isLinear()) {
+            // first & out are linear
+            auto first_array = first.get();
+            auto out_array = out.get();
+            getQueue().submit([&](sycl::handler &h) {
+                sycl::range sz(first.elements());
+                sycl::id src_offset(first.getOffset());
+                sycl::accessor offset_acc_src =
+                    first_array->template get_access<sycl::access_mode::read>(
+                        h, sz, src_offset);
+                sycl::id dst_offset(0);
+                sycl::accessor offset_acc_dst =
+                    out_array->template get_access<sycl::access_mode::write>(
+                        h, sz, dst_offset);
+                h.copy(offset_acc_src, offset_acc_dst);
+            });
+        } else {
+            kernel::memcopy<T>(out.get(), out.strides().get(), first.get(),
+                               fdims.get(), first.strides().get(),
+                               first.getOffset(), first.ndims());
+        }
+    } else {
+        // Write the result directly in the out array
+        const dim_t *outStrides{out.strides().dims};
+        Param<T> output{
+            out.get(),
+            {{fdims.dims[0], fdims.dims[1], fdims.dims[2], fdims.dims[3]},
+             {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+             0}};
+        evalNodes(output, first.getNode().get());
+    }
+
+    if (second.isReady()) {
+        if (1LL + jdim >= second.ndims() && second.isLinear()) {
+            // second & out are linear
+            auto second_array = second.get();
+            auto out_array = out.get();
+            getQueue().submit([&](sycl::handler &h) {
+                sycl::range sz(second.elements());
+                sycl::id src_offset(second.getOffset());
+                sycl::accessor offset_acc_src =
+                    second_array->template get_access<sycl::access_mode::read>(
+                        h, sz, src_offset);
+                sycl::id dst_offset(fdims.dims[jdim] *
+                                    out.strides().dims[jdim]);
+                sycl::accessor offset_acc_dst =
+                    out_array->template get_access<sycl::access_mode::write>(
+                        h, sz, dst_offset);
+                h.copy(offset_acc_src, offset_acc_dst);
+            });
+        } else {
+            kernel::memcopy<T>(out.get(), out.strides().get(), second.get(),
+                               sdims.get(), second.strides().get(),
+                               second.getOffset(), second.ndims(),
+                               fdims.dims[jdim] * out.strides().dims[jdim]);
+        }
+    } else {
+        // Write the result directly in the out array
+        const dim_t *outStrides{out.strides().dims};
+        Param<T> output{
+            out.get(),
+            {{sdims.dims[0], sdims.dims[1], sdims.dims[2], sdims.dims[3]},
+             {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+             fdims.dims[jdim] * outStrides[jdim]}};
+        evalNodes(output, second.getNode().get());
+    }
+    return out;
+}
+
+template<typename T>
+void join(Array<T> &out, const int jdim, const vector<Array<T>> &inputs) {
+    class eval {
+       public:
+        vector<Param<T>> outputs;
+        vector<Node_ptr> nodePtrs;
+        vector<Node *> nodes;
+        vector<const Array<T> *> ins;
+    };
+    std::map<dim_t, eval> evals;
+    const dim_t *ostrides{out.strides().dims};
+    const size_t L2CacheSize{getL2CacheSize(oneapi::getDevice())};
+
+    // topspeed is achieved when byte size(in+out) ~= L2CacheSize
+    //
+    // 1 array: memcpy always copies 1 array.  topspeed
+    //      --> size(in) <= L2CacheSize/2
+    // 2 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/2
+    //          --> JIT can copy 2 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - size(in) < L2CacheSize/2
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called twice
+    //      - size(in) >= L2CacheSize/2
+    //          --> memcpy will achieve veryLargeArray speed.  The kernel
+    //              will be called twice
+
+    // Group all arrays according to size
+    dim_t outOffset{0};
+    for (const Array<T> &iArray : inputs) {
+        const dim_t *idims{iArray.dims().dims};
+        eval &e{evals[idims[jdim]]};
+        const Param output{
+            out.get(),
+            {{idims[0], idims[1], idims[2], idims[3]},
+             {ostrides[0], ostrides[1], ostrides[2], ostrides[3]},
+             outOffset}};
+        e.outputs.push_back(output);
+        // Extend life of the returned node by saving the corresponding
+        // shared_ptr
+        e.nodePtrs.emplace_back(iArray.getNode());
+        e.nodes.push_back(e.nodePtrs.back().get());
+        e.ins.push_back(&iArray);
+        outOffset += idims[jdim] * ostrides[jdim];
+    }
+
+    for (auto &eval : evals) {
+        auto &s{eval.second};
+        if (s.ins.size() == 1 ||
+            s.ins[0]->elements() * sizeof(T) * 2 * 2 > L2CacheSize) {
+            // Process (evaluate arrays) individually for
+            //  - single small array
+            //  - very large arrays
+            auto nodeIt{begin(s.nodes)};
+            auto outputIt{begin(s.outputs)};
+            for (const Array<T> *in : s.ins) {
+                if (in->isReady()) {
+                    if (1LL + jdim >= in->ndims() && in->isLinear()) {
+                        auto in_array = in->get();
+                        getQueue().submit([&](sycl::handler &h) {
+                            sycl::range sz(in->elements());
+                            sycl::id src_offset(in->getOffset());
+                            sycl::accessor offset_acc_src =
+                                in_array
+                                    ->template get_access<
+                                        sycl::access_mode::read>(h, sz,
+                                                                 src_offset);
+                            sycl::id dst_offset(outputIt->info.offset);
+                            sycl::accessor offset_acc_dst =
+                                outputIt->data->template get_access<
+                                    sycl::access_mode::write>(h, sz,
+                                                              dst_offset);
+                            h.copy(offset_acc_src, offset_acc_dst);
+                        });
+                    } else {
+                        kernel::memcopy<T>(
+                            outputIt->data,
+                            af::dim4(4, outputIt->info.strides).get(),
+                            in->get(), in->dims().get(), in->strides().get(),
+                            in->getOffset(), in->ndims(),
+                            outputIt->info.offset);
+                    }
+                    // eliminate this array from the list, so that it will
+                    // not be processed in bulk via JIT
+                    outputIt = s.outputs.erase(outputIt);
+                    nodeIt   = s.nodes.erase(nodeIt);
+                } else {
+                    ++outputIt;
+                    ++nodeIt;
+                }
+            }
+        }
+        evalNodes(s.outputs, s.nodes);
+    }
+}
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> join<T>(const int dim, const Array<T> &first, \
+                              const Array<T> &second);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#undef INSTANTIATE
+
+#define INSTANTIATE(T)                                   \
+    template void join<T>(Array<T> & out, const int dim, \
+                          const vector<Array<T>> &inputs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#undef INSTANTIATE
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/join.hpp b/src/backend/oneapi/join.hpp
new file mode 100644
index 0000000000..818047cae2
--- /dev/null
+++ b/src/backend/oneapi/join.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> join(const int dim, const Array<T> &first, const Array<T> &second);
+
+template<typename T>
+void join(Array<T> &out, const int dim, const std::vector<Array<T>> &inputs);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/KParam.hpp b/src/backend/oneapi/kernel/KParam.hpp
new file mode 100644
index 0000000000..c1cf30be4b
--- /dev/null
+++ b/src/backend/oneapi/kernel/KParam.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifndef __KPARAM_H
+#define __KPARAM_H
+
+// #ifndef __OPENCL_VERSION__
+//  Only define dim_t in host code. dim_t is defined when setting the program
+//  options in program.cpp
+#include <af/defines.h>
+// #endif
+
+// Defines the size and shape of the data in the OpenCL buffer
+typedef struct {
+    dim_t dims[4];
+    dim_t strides[4];
+    dim_t offset;
+} KParam;
+
+#endif
diff --git a/src/backend/oneapi/kernel/accessors.hpp b/src/backend/oneapi/kernel/accessors.hpp
new file mode 100644
index 0000000000..902f48b0e0
--- /dev/null
+++ b/src/backend/oneapi/kernel/accessors.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <sycl/sycl.hpp>
+
+template<typename T>
+using read_accessor = sycl::accessor<T, 1, sycl::access::mode::read>;
+
+template<typename T>
+using write_accessor = sycl::accessor<T, 1, sycl::access::mode::write>;
diff --git a/src/backend/oneapi/kernel/approx1.hpp b/src/backend/oneapi/kernel/approx1.hpp
new file mode 100644
index 0000000000..ed2290ffc9
--- /dev/null
+++ b/src/backend/oneapi/kernel/approx1.hpp
@@ -0,0 +1,157 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/interp.hpp>
+#include <traits.hpp>
+#include <af/constants.h>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr int TILE_DIM  = 32;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = 256 / TILE_DIM;
+
+template<typename Ty, typename Tp, int order>
+class approx1Kernel {
+   public:
+    approx1Kernel(write_accessor<Ty> d_yo, const KParam yoInfo,
+                  read_accessor<Ty> d_yi, const KParam yiInfo,
+                  read_accessor<Tp> d_xo, const KParam xoInfo, const Tp xi_beg,
+                  const Tp xi_step_reproc, const Ty offGrid,
+                  const int blocksMatX, const af_interp_type method,
+                  const bool batch, const int XDIM)
+        : d_yo_(d_yo)
+        , yoInfo_(yoInfo)
+        , d_yi_(d_yi)
+        , yiInfo_(yiInfo)
+        , d_xo_(d_xo)
+        , xoInfo_(xoInfo)
+        , xi_beg_(xi_beg)
+        , xi_step_reproc_(xi_step_reproc)
+        , offGrid_(offGrid)
+        , blocksMatX_(blocksMatX)
+        , method_(method)
+        , batch_(batch)
+        , XDIM_(XDIM) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        const int idw = g.get_group_id(1) / yoInfo_.dims[2];
+        const int idz = g.get_group_id(1) - idw * yoInfo_.dims[2];
+
+        const int idy        = g.get_group_id(0) / blocksMatX_;
+        const int blockIdx_x = g.get_group_id(0) - idy * blocksMatX_;
+        const int idx = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+
+        if (idx >= yoInfo_.dims[0] || idy >= yoInfo_.dims[1] ||
+            idz >= yoInfo_.dims[2] || idw >= yoInfo_.dims[3])
+            return;
+
+        // FIXME: Only cubic interpolation is doing clamping
+        // We need to make it consistent across all methods
+        // Not changing the behavior because tests will fail
+        const bool doclamp = order == 3;
+
+        bool is_off[] = {xoInfo_.dims[0] > 1, xoInfo_.dims[1] > 1,
+                         xoInfo_.dims[2] > 1, xoInfo_.dims[3] > 1};
+
+        const int yo_idx = idw * yoInfo_.strides[3] + idz * yoInfo_.strides[2] +
+                           idy * yoInfo_.strides[1] + idx + yoInfo_.offset;
+
+        int xo_idx = idx * is_off[0] + xoInfo_.offset;
+        if (batch_) {
+            xo_idx += idw * xoInfo_.strides[3] * is_off[3];
+            xo_idx += idz * xoInfo_.strides[2] * is_off[2];
+            xo_idx += idy * xoInfo_.strides[1] * is_off[1];
+        }
+
+        const Tp x = (d_xo_[xo_idx] - xi_beg_) * xi_step_reproc_;
+
+#pragma unroll
+        for (int flagIdx = 0; flagIdx < 4; ++flagIdx) {
+            is_off[flagIdx] = true;
+        }
+        is_off[XDIM_] = false;
+
+        if (x < 0 || yiInfo_.dims[XDIM_] < x + 1) {
+            d_yo_[yo_idx] = offGrid_;
+            return;
+        }
+
+        int yi_idx = idx * is_off[0] + yiInfo_.offset;
+        yi_idx += idw * yiInfo_.strides[3] * is_off[3];
+        yi_idx += idz * yiInfo_.strides[2] * is_off[2];
+        yi_idx += idy * yiInfo_.strides[1] * is_off[1];
+
+        Interp1<Ty, Tp, order> interp;
+        interp(d_yo_, yoInfo_, yo_idx, d_yi_, yiInfo_, yi_idx, x, XDIM_,
+               method_, 1, doclamp);
+    }
+
+   protected:
+    write_accessor<Ty> d_yo_;
+    const KParam yoInfo_;
+    read_accessor<Ty> d_yi_;
+    const KParam yiInfo_;
+    read_accessor<Tp> d_xo_;
+    const KParam xoInfo_;
+    const Tp xi_beg_;
+    const Tp xi_step_reproc_;
+    const Ty offGrid_;
+    const int blocksMatX_;
+    const af_interp_type method_;
+    const bool batch_;
+    const int XDIM_;
+};
+
+template<typename Ty, typename Tp, int order>
+void approx1(Param<Ty> yo, const Param<Ty> yi, const Param<Tp> xo,
+             const int xdim, const Tp xi_beg, const Tp xi_step,
+             const float offGrid, const af_interp_type method) {
+    constexpr int THREADS = 256;
+
+    auto local        = sycl::range{THREADS, 1};
+    uint blocksPerMat = divup(yo.info.dims[0], local[0]);
+    auto global       = sycl::range{blocksPerMat * local[0] * yo.info.dims[1],
+                              yo.info.dims[2] * yo.info.dims[3] * local[1]};
+
+    bool batch =
+        !(xo.info.dims[1] == 1 && xo.info.dims[2] == 1 && xo.info.dims[3] == 1);
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<Ty> yoAcc{*yo.data, h};
+        read_accessor<Ty> yiAcc{*yi.data, h};
+        read_accessor<Tp> xoAcc{*xo.data, h};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       approx1Kernel<Ty, Tp, order>(
+                           yoAcc, yo.info, yiAcc, yi.info, xoAcc, xo.info,
+                           xi_beg, Tp(1) / xi_step, (Ty)offGrid,
+                           (uint)blocksPerMat, method, batch, xdim));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/approx2.hpp b/src/backend/oneapi/kernel/approx2.hpp
new file mode 100644
index 0000000000..c173b527b1
--- /dev/null
+++ b/src/backend/oneapi/kernel/approx2.hpp
@@ -0,0 +1,187 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/interp.hpp>
+#include <traits.hpp>
+#include <af/constants.h>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr int TILE_DIM  = 32;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = 256 / TILE_DIM;
+
+template<typename Ty, typename Tp, int order>
+class approx2Kernel {
+   public:
+    approx2Kernel(write_accessor<Ty> d_zo, const KParam zo,
+                  read_accessor<Ty> d_zi, const KParam zi,
+                  read_accessor<Tp> d_xo, const KParam xo,
+                  read_accessor<Tp> d_yo, const KParam yo, const Tp xi_beg,
+                  const Tp xi_step_reproc, const Tp yi_beg,
+                  const Tp yi_step_reproc, const Ty offGrid,
+                  const int blocksMatX, const int blocksMatY, const bool batch,
+                  const af_interp_type method, const int XDIM, const int YDIM)
+        : d_zo_(d_zo)
+        , zoInfo_(zo)
+        , d_zi_(d_zi)
+        , ziInfo_(zi)
+        , d_xo_(d_xo)
+        , xoInfo_(xo)
+        , d_yo_(d_yo)
+        , yoInfo_(yo)
+        , xi_beg_(xi_beg)
+        , xi_step_reproc_(xi_step_reproc)
+        , yi_beg_(yi_beg)
+        , yi_step_reproc_(yi_step_reproc)
+        , offGrid_(offGrid)
+        , blocksMatX_(blocksMatX)
+        , blocksMatY_(blocksMatY)
+        , batch_(batch)
+        , method_(method)
+        , XDIM_(XDIM)
+        , YDIM_(YDIM) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        const int idz = g.get_group_id(0) / blocksMatX_;
+        const int idw = g.get_group_id(1) / blocksMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - idz * blocksMatX_;
+        const int blockIdx_y = g.get_group_id(1) - idw * blocksMatY_;
+
+        const int idx = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+        const int idy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+        if (idx >= zoInfo_.dims[0] || idy >= zoInfo_.dims[1] ||
+            idz >= zoInfo_.dims[2] || idw >= zoInfo_.dims[3])
+            return;
+
+        // FIXME: Only cubic interpolation is doing clamping
+        // We need to make it consistent across all methods
+        // Not changing the behavior because tests will fail
+        const bool doclamp = order == 3;
+
+        bool is_off[] = {xoInfo_.dims[0] > 1, xoInfo_.dims[1] > 1,
+                         xoInfo_.dims[2] > 1, xoInfo_.dims[3] > 1};
+
+        const int zo_idx = idw * zoInfo_.strides[3] + idz * zoInfo_.strides[2] +
+                           idy * zoInfo_.strides[1] + idx + zoInfo_.offset;
+        int xo_idx = idy * xoInfo_.strides[1] * is_off[1] + idx * is_off[0] +
+                     xoInfo_.offset;
+
+        int yo_idx = idy * yoInfo_.strides[1] * is_off[1] + idx * is_off[0] +
+                     yoInfo_.offset;
+        if (batch_) {
+            xo_idx += idw * xoInfo_.strides[3] * is_off[3] +
+                      idz * xoInfo_.strides[2] * is_off[2];
+            yo_idx += idw * yoInfo_.strides[3] * is_off[3] +
+                      idz * yoInfo_.strides[2] * is_off[2];
+        }
+
+#pragma unroll
+        for (int flagIdx = 0; flagIdx < 4; ++flagIdx) {
+            is_off[flagIdx] = true;
+        }
+        is_off[XDIM_] = false;
+        is_off[YDIM_] = false;
+
+        const Tp x = (d_xo_[xo_idx] - xi_beg_) * xi_step_reproc_;
+        const Tp y = (d_yo_[yo_idx] - yi_beg_) * yi_step_reproc_;
+
+        if (x < 0 || y < 0 || ziInfo_.dims[XDIM_] < x + 1 ||
+            ziInfo_.dims[YDIM_] < y + 1) {
+            d_zo_[zo_idx] = offGrid_;
+            return;
+        }
+
+        int zi_idx = idy * ziInfo_.strides[1] * is_off[1] + idx * is_off[0] +
+                     ziInfo_.offset;
+        zi_idx += idw * ziInfo_.strides[3] * is_off[3] +
+                  idz * ziInfo_.strides[2] * is_off[2];
+
+        Interp2<Ty, Tp, order> interp;
+        interp(d_zo_, zoInfo_, zo_idx, d_zi_, ziInfo_, zi_idx, x, y, XDIM_,
+               YDIM_, method_, 1, doclamp);
+    }
+
+   protected:
+    write_accessor<Ty> d_zo_;
+    const KParam zoInfo_;
+    read_accessor<Ty> d_zi_;
+    const KParam ziInfo_;
+    read_accessor<Tp> d_xo_;
+    const KParam xoInfo_;
+    read_accessor<Tp> d_yo_;
+    const KParam yoInfo_;
+    const Tp xi_beg_;
+    const Tp xi_step_reproc_;
+    const Tp yi_beg_;
+    const Tp yi_step_reproc_;
+    const Ty offGrid_;
+    const int blocksMatX_;
+    const int blocksMatY_;
+    const int batch_;
+    af::interpType method_;
+    const int XDIM_;
+    const int YDIM_;
+};
+
+template<typename Ty, typename Tp, int order>
+void approx2(Param<Ty> zo, const Param<Ty> zi, const Param<Tp> xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Param<Tp> yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const float offGrid,
+             const af_interp_type method) {
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+
+    auto local          = sycl::range{TX, TY};
+    dim_t blocksPerMatX = divup(zo.info.dims[0], local[0]);
+    dim_t blocksPerMatY = divup(zo.info.dims[1], local[1]);
+    auto global = sycl::range{blocksPerMatX * local[0] * zo.info.dims[2],
+                              blocksPerMatY * local[1] * zo.info.dims[3]};
+
+    // Passing bools to opencl kernels is not allowed
+    bool batch = !(xo.info.dims[2] == 1 && xo.info.dims[3] == 1);
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<Ty> zoAcc{*zo.data, h};
+        read_accessor<Ty> ziAcc{*zi.data, h};
+        read_accessor<Tp> xoAcc{*xo.data, h};
+        read_accessor<Tp> yoAcc{*yo.data, h};
+
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            approx2Kernel<Ty, Tp, order>(
+                zoAcc, zo.info, ziAcc, zi.info, xoAcc, xo.info, yoAcc, yo.info,
+                xi_beg, Tp(1) / xi_step, yi_beg, Tp(1) / yi_step, (Ty)offGrid,
+                static_cast<int>(blocksPerMatX),
+                static_cast<int>(blocksPerMatY), batch, method, xdim, ydim));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/assign.hpp b/src/backend/oneapi/kernel/assign.hpp
new file mode 100644
index 0000000000..1b69827d18
--- /dev/null
+++ b/src/backend/oneapi/kernel/assign.hpp
@@ -0,0 +1,146 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/assign_kernel_param.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+static int trimIndex(int idx, const int len) {
+    int ret_val = idx;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
+    }
+    return ret_val;
+}
+
+template<typename T>
+class assignKernel {
+   public:
+    assignKernel(write_accessor<T> out, KParam oInfo, read_accessor<T> in,
+                 KParam iInfo, AssignKernelParam p, const int nBBS0,
+                 const int nBBS1)
+        : out_(out)
+        , in_(in)
+        , oInfo_(oInfo)
+        , iInfo_(iInfo)
+        , p_(p)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        // retrive booleans that tell us which index to use
+        const bool s0 = p_.isSeq[0];
+        const bool s1 = p_.isSeq[1];
+        const bool s2 = p_.isSeq[2];
+        const bool s3 = p_.isSeq[3];
+
+        sycl::group g = it.get_group();
+        const int gz  = g.get_group_id(0) / nBBS0_;
+        const int gw  = g.get_group_id(1) / nBBS1_;
+        const int gx =
+            g.get_local_range(0) * (g.get_group_id(0) - gz * nBBS0_) +
+            it.get_local_id(0);
+        const int gy =
+            g.get_local_range(1) * (g.get_group_id(1) - gw * nBBS1_) +
+            it.get_local_id(1);
+
+        size_t idims0 = iInfo_.dims[0];
+        size_t idims1 = iInfo_.dims[1];
+        size_t idims2 = iInfo_.dims[2];
+        size_t idims3 = iInfo_.dims[3];
+
+        if (gx < idims0 && gy < idims1 && gz < idims2 && gw < idims3) {
+            // calculate pointer offsets for input
+            int i =
+                p_.strds[0] *
+                trimIndex(s0 ? gx + p_.offs[0] : p_.ptr[0][gx], oInfo_.dims[0]);
+            int j =
+                p_.strds[1] *
+                trimIndex(s1 ? gy + p_.offs[1] : p_.ptr[1][gy], oInfo_.dims[1]);
+            int k =
+                p_.strds[2] *
+                trimIndex(s2 ? gz + p_.offs[2] : p_.ptr[2][gz], oInfo_.dims[2]);
+            int l =
+                p_.strds[3] *
+                trimIndex(s3 ? gw + p_.offs[3] : p_.ptr[3][gw], oInfo_.dims[3]);
+
+            const T* iptr = in_.get_pointer();
+            // offset input and output pointers
+            const T* src =
+                iptr + (gx * iInfo_.strides[0] + gy * iInfo_.strides[1] +
+                        gz * iInfo_.strides[2] + gw * iInfo_.strides[3] +
+                        iInfo_.offset);
+
+            T* optr = out_.get_pointer();
+            T* dst  = optr + (i + j + k + l) + oInfo_.offset;
+            // set the output
+            dst[0] = src[0];
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    read_accessor<T> in_;
+    KParam oInfo_, iInfo_;
+    AssignKernelParam p_;
+    const int nBBS0_, nBBS1_;
+};
+
+template<typename T>
+void assign(Param<T> out, const Param<T> in, const AssignKernelParam& p,
+            sycl::buffer<uint>* bPtr[4]) {
+    constexpr int THREADS_X = 32;
+    constexpr int THREADS_Y = 8;
+    using sycl::access_mode;
+
+    sycl::range<2> local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    sycl::range<2> global(blk_x * in.info.dims[2] * THREADS_X,
+                          blk_y * in.info.dims[3] * THREADS_Y);
+
+    getQueue().submit([&](sycl::handler& h) {
+        auto pp = p;
+        write_accessor<T> out_acc{*out.data, h};
+        read_accessor<T> in_acc{*in.data, h};
+
+        pp.ptr[0] = bPtr[0]->template get_access<access_mode::read>(h);
+        pp.ptr[1] = bPtr[1]->template get_access<access_mode::read>(h);
+        pp.ptr[2] = bPtr[2]->template get_access<access_mode::read>(h);
+        pp.ptr[3] = bPtr[3]->template get_access<access_mode::read>(h);
+
+        h.parallel_for(sycl::nd_range<2>(global, local),
+                       assignKernel<T>(out_acc, out.info, in_acc, in.info, pp,
+                                       blk_x, blk_y));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/assign_kernel_param.hpp b/src/backend/oneapi/kernel/assign_kernel_param.hpp
new file mode 100644
index 0000000000..e2eec56d18
--- /dev/null
+++ b/src/backend/oneapi/kernel/assign_kernel_param.hpp
@@ -0,0 +1,34 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <sycl/sycl.hpp>
+
+#include <array>
+
+namespace arrayfire {
+namespace oneapi {
+
+typedef struct {
+    int offs[4];
+    int strds[4];
+    int steps[4];
+    bool isSeq[4];
+    std::array<sycl::accessor<unsigned int, 1, sycl::access::mode::read,
+                              sycl::access::target::device>,
+               4>
+        ptr;
+
+} AssignKernelParam;
+
+using IndexKernelParam = AssignKernelParam;
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/bilateral.hpp b/src/backend/oneapi/kernel/bilateral.hpp
new file mode 100644
index 0000000000..210c92e911
--- /dev/null
+++ b/src/backend/oneapi/kernel/bilateral.hpp
@@ -0,0 +1,218 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename outType, bool USE_NATIVE_EXP>
+auto exp_native_nonnative(float in) {
+    if constexpr (USE_NATIVE_EXP)
+        return sycl::native::exp(in);
+    else
+        return exp(in);
+}
+
+template<typename outType, typename inType, bool USE_NATIVE_EXP>
+class bilateralKernel {
+   public:
+    bilateralKernel(write_accessor<outType> d_dst, KParam oInfo,
+                    read_accessor<inType> d_src, KParam iInfo,
+                    sycl::local_accessor<outType, 1> localMem,
+                    sycl::local_accessor<outType, 1> gauss2d, float sigma_space,
+                    float sigma_color, int gaussOff, int nBBS0, int nBBS1)
+        : d_dst_(d_dst)
+        , oInfo_(oInfo)
+        , d_src_(d_src)
+        , iInfo_(iInfo)
+        , localMem_(localMem)
+        , gauss2d_(gauss2d)
+        , sigma_space_(sigma_space)
+        , sigma_color_(sigma_color)
+        , gaussOff_(gaussOff)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g              = it.get_group();
+        const int radius           = sycl::max((int)(sigma_space_ * 1.5f), 1);
+        const int padding          = 2 * radius;
+        const int window_size      = padding + 1;
+        const int shrdLen          = g.get_local_range(0) + padding;
+        const float variance_range = sigma_color_ * sigma_color_;
+        const float variance_space = sigma_space_ * sigma_space_;
+        const float variance_space_neg2     = -2.0 * variance_space;
+        const float inv_variance_range_neg2 = -0.5 / (variance_range);
+
+        // gfor batch offsets
+        unsigned b2 = g.get_group_id(0) / nBBS0_;
+        unsigned b3 = g.get_group_id(1) / nBBS1_;
+
+        const inType* in =
+            d_src_.get_pointer() +
+            (b2 * iInfo_.strides[2] + b3 * iInfo_.strides[3] + iInfo_.offset);
+        outType* out = d_dst_.get_pointer() +
+                       (b2 * oInfo_.strides[2] + b3 * oInfo_.strides[3]);
+
+        int lx = it.get_local_id(0);
+        int ly = it.get_local_id(1);
+
+        const int gx =
+            g.get_local_range(0) * (g.get_group_id(0) - b2 * nBBS0_) + lx;
+        const int gy =
+            g.get_local_range(1) * (g.get_group_id(1) - b3 * nBBS1_) + ly;
+
+        // generate gauss2d_ spatial variance values for block
+        if (lx < window_size && ly < window_size) {
+            int x = lx - radius;
+            int y = ly - radius;
+            gauss2d_[ly * window_size + lx] =
+                exp_native_nonnative<outType, USE_NATIVE_EXP>(
+                    ((x * x) + (y * y)) / variance_space_neg2);
+        }
+
+        int s0 = iInfo_.strides[0];
+        int s1 = iInfo_.strides[1];
+        int d0 = iInfo_.dims[0];
+        int d1 = iInfo_.dims[1];
+        // pull image to local memory
+        for (int b = ly, gy2 = gy; b < shrdLen;
+             b += g.get_local_range(1), gy2 += g.get_local_range(1)) {
+            // move row_set g.get_local_range(1) along coloumns
+            for (int a = lx, gx2 = gx; a < shrdLen;
+                 a += g.get_local_range(0), gx2 += g.get_local_range(0)) {
+                load2LocalMem(localMem_, in, a, b, shrdLen, d0, d1,
+                              gx2 - radius, gy2 - radius, s1, s0);
+            }
+        }
+
+        it.barrier();
+
+        if (gx < iInfo_.dims[0] && gy < iInfo_.dims[1]) {
+            lx += radius;
+            ly += radius;
+            outType center_color = localMem_[ly * shrdLen + lx];
+            outType res          = 0;
+            outType norm         = 0;
+
+            int joff = (ly - radius) * shrdLen + (lx - radius);
+            int goff = 0;
+
+            for (int wj = 0; wj < window_size; ++wj) {
+                for (int wi = 0; wi < window_size; ++wi) {
+                    outType tmp_color = localMem_[joff + wi];
+                    const outType c   = center_color - tmp_color;
+                    outType gauss_range =
+                        exp_native_nonnative<outType, USE_NATIVE_EXP>(
+                            c * c * inv_variance_range_neg2);
+                    outType weight = gauss2d_[goff + wi] * gauss_range;
+                    norm += weight;
+                    res += tmp_color * weight;
+                }
+                joff += shrdLen;
+                goff += window_size;
+            }
+            out[gy * oInfo_.strides[1] + gx] = res / norm;
+        }
+    }
+
+    int lIdx(int x, int y, int stride1, int stride0) const {
+        return (y * stride1 + x * stride0);
+    }
+
+    template<class T>
+    constexpr const T& clamp0(const T& v, const T& lo, const T& hi) const {
+        return (v < lo) ? lo : (hi < v) ? hi : v;
+    }
+
+    void load2LocalMem(sycl::local_accessor<outType, 1> shrd, const inType* in,
+                       int lx, int ly, int shrdStride, int dim0, int dim1,
+                       int gx, int gy, int inStride1, int inStride0) const {
+        int gx_ = sycl::clamp(gx, 0, dim0 - 1);
+        int gy_ = sycl::clamp(gy, 0, dim1 - 1);
+        shrd[lIdx(lx, ly, shrdStride, 1)] =
+            (outType)in[lIdx(gx_, gy_, inStride1, inStride0)];
+    }
+
+   private:
+    write_accessor<outType> d_dst_;
+    KParam oInfo_;
+    read_accessor<inType> d_src_;
+    KParam iInfo_;
+    sycl::local_accessor<outType, 1> localMem_;
+    sycl::local_accessor<outType, 1> gauss2d_;
+    float sigma_space_;
+    float sigma_color_;
+    int gaussOff_;
+    int nBBS0_;
+    int nBBS1_;
+};
+
+template<typename inType, typename outType>
+void bilateral(Param<outType> out, const Param<inType> in, const float s_sigma,
+               const float c_sigma) {
+    constexpr int THREADS_X     = 16;
+    constexpr int THREADS_Y     = 16;
+    constexpr bool UseNativeExp = !std::is_same<inType, double>::value ||
+                                  std::is_same<inType, cdouble>::value;
+
+    auto local = sycl::range{THREADS_X, THREADS_Y};
+
+    const int blk_x = divup(in.info.dims[0], THREADS_X);
+    const int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    auto global = sycl::range{(size_t)(blk_x * in.info.dims[2] * THREADS_X),
+                              (size_t)(blk_y * in.info.dims[3] * THREADS_Y)};
+
+    // calculate local memory size
+    int radius          = (int)std::max(s_sigma * 1.5f, 1.f);
+    int num_shrd_elems  = (THREADS_X + 2 * radius) * (THREADS_Y + 2 * radius);
+    int num_gauss_elems = (2 * radius + 1) * (2 * radius + 1);
+    size_t localMemSize = (num_shrd_elems + num_gauss_elems) * sizeof(outType);
+    size_t MaxLocalSize =
+        getQueue().get_device().get_info<sycl::info::device::local_mem_size>();
+    if (localMemSize > MaxLocalSize) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nOneAPI Bilateral filter doesn't support %f spatial sigma\n",
+                 s_sigma);
+        ONEAPI_NOT_SUPPORTED(errMessage);
+    }
+
+    getQueue().submit([&](sycl::handler& h) {
+        read_accessor<inType> inAcc{*in.data, h};
+        write_accessor<outType> outAcc{*out.data, h};
+
+        auto localMem = sycl::local_accessor<outType, 1>(num_shrd_elems, h);
+        auto gauss2d  = sycl::local_accessor<outType, 1>(num_shrd_elems, h);
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       bilateralKernel<outType, inType, UseNativeExp>(
+                           outAcc, out.info, inAcc, in.info, localMem, gauss2d,
+                           s_sigma, c_sigma, num_shrd_elems, blk_x, blk_y));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/convolve.hpp b/src/backend/oneapi/kernel/convolve.hpp
new file mode 100644
index 0000000000..ebec7dbe88
--- /dev/null
+++ b/src/backend/oneapi/kernel/convolve.hpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <af/defines.h>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+// below shared MAX_*_LEN's are calculated based on
+// a maximum shared memory configuration of 48KB per block
+// considering complex types as well
+constexpr int MAX_CONV1_FILTER_LEN = 129;
+constexpr int MAX_CONV2_FILTER_LEN = 17;
+constexpr int MAX_CONV3_FILTER_LEN = 5;
+
+constexpr int MAX_SCONV_FILTER_LEN = 31;
+
+constexpr int THREADS   = 256;
+constexpr int THREADS_X = 16;
+constexpr int THREADS_Y = 16;
+constexpr int CUBE_X    = 8;
+constexpr int CUBE_Y    = 8;
+constexpr int CUBE_Z    = 4;
+
+template<typename aT>
+struct conv_kparam_t {
+    sycl::range<3> global{0, 0, 0};
+    sycl::range<3> local{0, 0, 0};
+    size_t loc_size;
+    int nBBS0;
+    int nBBS1;
+    bool outHasNoOffset;
+    bool inHasNoOffset;
+    bool launchMoreBlocks;
+    int o[3];
+    int s[3];
+    sycl::buffer<aT> *impulse;
+};
+
+template<typename T>
+T binOp(T lhs, T rhs) {
+    return lhs * rhs;
+}
+
+template<typename aT>
+void prepareKernelArgs(conv_kparam_t<aT> &param, dim_t *oDims,
+                       const dim_t *fDims, const int rank) {
+    using sycl::range;
+
+    int batchDims[4] = {1, 1, 1, 1};
+    for (int i = rank; i < 4; ++i) {
+        batchDims[i] = (param.launchMoreBlocks ? 1 : oDims[i]);
+    }
+
+    if (rank == 1) {
+        param.local    = range<3>{THREADS, 1, 1};
+        param.nBBS0    = divup(oDims[0], THREADS);
+        param.nBBS1    = batchDims[2];
+        param.global   = range<3>(param.nBBS0 * THREADS * batchDims[1],
+                                param.nBBS1 * batchDims[3], 1);
+        param.loc_size = (THREADS + 2 * (fDims[0] - 1));
+    } else if (rank == 2) {
+        param.local  = range<3>{THREADS_X, THREADS_Y, 1};
+        param.nBBS0  = divup(oDims[0], THREADS_X);
+        param.nBBS1  = divup(oDims[1], THREADS_Y);
+        param.global = range<3>(param.nBBS0 * THREADS_X * batchDims[2],
+                                param.nBBS1 * THREADS_Y * batchDims[3], 1);
+    } else if (rank == 3) {
+        param.local    = range<3>{CUBE_X, CUBE_Y, CUBE_Z};
+        param.nBBS0    = divup(oDims[0], CUBE_X);
+        param.nBBS1    = divup(oDims[1], CUBE_Y);
+        int blk_z      = divup(oDims[2], CUBE_Z);
+        param.global   = range<3>(param.nBBS0 * CUBE_X * batchDims[3],
+                                param.nBBS1 * CUBE_Y, blk_z * CUBE_Z);
+        param.loc_size = (CUBE_X + 2 * (fDims[0] - 1)) *
+                         (CUBE_Y + 2 * (fDims[1] - 1)) *
+                         (CUBE_Z + 2 * (fDims[2] - 1));
+    }
+}
+
+template<typename T>
+void memcpyBuffer(sycl::buffer<T, 1> &dest, sycl::buffer<T, 1> &src,
+                  const size_t n, const size_t srcOffset) {
+    getQueue().submit([&](auto &h) {
+        sycl::accessor srcAcc{src, h, sycl::range{n}, sycl::id{srcOffset},
+                              sycl::read_only};
+        sycl::accessor destAcc{
+            dest,         h, sycl::range{n}, sycl::id{0}, sycl::write_only,
+            sycl::no_init};
+        h.copy(srcAcc, destAcc);
+    });
+}
+
+#include "convolve1.hpp"
+#include "convolve2.hpp"
+#include "convolve3.hpp"
+
+template<typename T, typename aT>
+void convolve_nd(Param<T> out, const Param<T> signal, const Param<aT> filter,
+                 AF_BATCH_KIND kind, const int rank, const bool expand) {
+    conv_kparam_t<aT> param;
+
+    for (int i = 0; i < 3; ++i) {
+        param.o[i] = 0;
+        param.s[i] = 0;
+    }
+    param.launchMoreBlocks = kind == AF_BATCH_SAME || kind == AF_BATCH_RHS;
+    param.outHasNoOffset   = kind == AF_BATCH_LHS || kind == AF_BATCH_NONE;
+    param.inHasNoOffset    = kind != AF_BATCH_SAME;
+
+    prepareKernelArgs<aT>(param, out.info.dims, filter.info.dims, rank);
+
+    switch (rank) {
+        case 1: conv1<T, aT>(param, out, signal, filter, expand); break;
+        case 2: conv2<T, aT>(param, out, signal, filter, expand); break;
+        case 3: conv3<T, aT>(param, out, signal, filter, expand); break;
+    }
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/convolve1.hpp b/src/backend/oneapi/kernel/convolve1.hpp
new file mode 100644
index 0000000000..41c6facae6
--- /dev/null
+++ b/src/backend/oneapi/kernel/convolve1.hpp
@@ -0,0 +1,183 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+template<typename T, typename aT>
+class conv1HelperCreateKernel {
+   public:
+    conv1HelperCreateKernel(write_accessor<T> out, KParam oInfo,
+                            read_accessor<T> signal, KParam sInfo,
+                            sycl::local_accessor<aT> localMem,
+                            read_accessor<aT> impulse, KParam fInfo, int nBBS0,
+                            int nBBS1, int ostep1, int ostep2, int ostep3,
+                            int sstep1, int sstep2, int sstep3,
+                            const bool expand)
+        : out_(out)
+        , oInfo_(oInfo)
+        , signal_(signal)
+        , sInfo_(sInfo)
+        , localMem_(localMem)
+        , impulse_(impulse)
+        , fInfo_(fInfo)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1)
+        , ostep1_(ostep1)
+        , ostep2_(ostep2)
+        , ostep3_(ostep3)
+        , sstep1_(sstep1)
+        , sstep2_(sstep2)
+        , sstep3_(sstep3)
+        , expand_(expand) {}
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g = it.get_group();
+
+        int fLen          = fInfo_.dims[0];
+        int padding       = fLen - 1;
+        int shrdLen       = g.get_local_range(0) + 2 * padding;
+        const unsigned b1 = g.get_group_id(0) / nBBS0_;
+        const unsigned b0 = g.get_group_id(0) - nBBS0_ * b1;
+        const unsigned b3 = g.get_group_id(1) / nBBS1_;
+        const unsigned b2 = g.get_group_id(1) - nBBS1_ * b3;
+
+        T *dst =
+            out_.get_pointer() +
+            (b1 * oInfo_.strides[1] + /* activated with batched input signal_ */
+             ostep1_ *
+                 oInfo_.strides[1] +  /* activated with batched input filter */
+             b2 * oInfo_.strides[2] + /* activated with batched input signal_ */
+             ostep2_ *
+                 oInfo_.strides[2] +  /* activated with batched input filter */
+             b3 * oInfo_.strides[3] + /* activated with batched input signal_ */
+             ostep3_ *
+                 oInfo_.strides[3]); /* activated with batched input filter */
+
+        T const *src =
+            signal_.get_pointer() + sInfo_.offset +
+            (b1 * sInfo_.strides[1] + /* activated with batched input signal_ */
+             sstep1_ *
+                 sInfo_.strides[1] +  /* activated with batched input filter */
+             b2 * sInfo_.strides[2] + /* activated with batched input signal_ */
+             sstep2_ *
+                 sInfo_.strides[2] +  /* activated with batched input filter */
+             b3 * sInfo_.strides[3] + /* activated with batched input signal_ */
+             sstep3_ *
+                 sInfo_.strides[3]); /* activated with batched input filter */
+
+        int gx = g.get_local_range(0) * b0;
+
+        for (int i = it.get_local_id(0); i < shrdLen;
+             i += g.get_local_range(0)) {
+            int idx      = gx - padding + i;
+            localMem_[i] = (idx >= 0 && idx < sInfo_.dims[0])
+                               ? src[idx * sInfo_.strides[0]]
+                               : (T)(0);
+        }
+        it.barrier();
+        gx += it.get_local_id(0);
+
+        if (gx >= 0 && gx < oInfo_.dims[0]) {
+            int lx   = it.get_local_id(0) + padding + (expand_ ? 0 : fLen >> 1);
+            aT accum = (aT)(0);
+            for (int f = 0; f < fLen; ++f) {
+                // binOp will do MUL_OP for convolution operation
+                accum = accum + binOp((aT)localMem_[lx - f], (aT)impulse_[f]);
+            }
+            dst[gx] = (T)accum;
+        }
+    }
+
+   private:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    read_accessor<T> signal_;
+    KParam sInfo_;
+    sycl::local_accessor<aT> localMem_;
+    read_accessor<aT> impulse_;
+    KParam fInfo_;
+    int nBBS0_;
+    int nBBS1_;
+    int ostep1_;
+    int ostep2_;
+    int ostep3_;
+    int sstep1_;
+    int sstep2_;
+    int sstep3_;
+    const bool expand_;
+};
+
+template<typename T, typename aT>
+void conv1Helper(const conv_kparam_t<aT> &param, Param<T> &out,
+                 const Param<T> &signal, const Param<aT> &filter,
+                 const int rank, const bool expand) {
+    auto Q = getQueue();
+    Q.submit([&](auto &h) {
+        sycl::local_accessor<aT> localMem(param.loc_size, h);
+        write_accessor<T> outAcc{*out.data, h};
+        read_accessor<T> signalAcc{*signal.data, h};
+        read_accessor<aT> impulseAcc{*param.impulse, h};
+        h.parallel_for(
+            sycl::nd_range{param.global, param.local},
+            conv1HelperCreateKernel<T, aT>(
+                outAcc, out.info, signalAcc, signal.info, localMem, impulseAcc,
+                filter.info, param.nBBS0, param.nBBS1, param.o[0], param.o[1],
+                param.o[2], param.s[0], param.s[1], param.s[2], expand));
+    });
+    ONEAPI_DEBUG_FINISH(Q);
+}
+
+template<typename T, typename aT>
+void conv1(conv_kparam_t<aT> &p, Param<T> &out, const Param<T> &sig,
+           const Param<aT> &filt, const bool expand) {
+    const size_t se_size = filt.info.dims[0];
+    sycl::buffer<aT> impulse{sycl::range(filt.info.dims[0])};
+    int f0Off = filt.info.offset;
+    for (int b3 = 0; b3 < filt.info.dims[3]; ++b3) {
+        int f3Off = b3 * filt.info.strides[3];
+
+        for (int b2 = 0; b2 < filt.info.dims[2]; ++b2) {
+            int f2Off = b2 * filt.info.strides[2];
+
+            for (int b1 = 0; b1 < filt.info.dims[1]; ++b1) {
+                int f1Off = b1 * filt.info.strides[1];
+
+                const size_t srcOffset = f0Off + f1Off + f2Off + f3Off;
+                memcpyBuffer(impulse, *filt.data, se_size, srcOffset);
+                p.impulse = &impulse;
+
+                p.o[0] = (p.outHasNoOffset ? 0 : b1);
+                p.o[1] = (p.outHasNoOffset ? 0 : b2);
+                p.o[2] = (p.outHasNoOffset ? 0 : b3);
+                p.s[0] = (p.inHasNoOffset ? 0 : b1);
+                p.s[1] = (p.inHasNoOffset ? 0 : b2);
+                p.s[2] = (p.inHasNoOffset ? 0 : b3);
+
+                conv1Helper<T, aT>(p, out, sig, filt, 1, expand);
+            }
+        }
+    }
+}
+
+#define INSTANTIATE_CONV1(T, aT)                                    \
+    template void conv1<T, aT>(conv_kparam_t<aT> &, Param<T> &,     \
+                               const Param<T> &, const Param<aT> &, \
+                               const bool);
+
+INSTANTIATE_CONV1(cdouble, cdouble)
+INSTANTIATE_CONV1(cfloat, cfloat)
+INSTANTIATE_CONV1(double, double)
+INSTANTIATE_CONV1(float, float)
+INSTANTIATE_CONV1(uint, float)
+INSTANTIATE_CONV1(int, float)
+INSTANTIATE_CONV1(schar, float)
+INSTANTIATE_CONV1(uchar, float)
+INSTANTIATE_CONV1(char, float)
+INSTANTIATE_CONV1(ushort, float)
+INSTANTIATE_CONV1(short, float)
+INSTANTIATE_CONV1(uintl, float)
+INSTANTIATE_CONV1(intl, float)
diff --git a/src/backend/oneapi/kernel/convolve2.hpp b/src/backend/oneapi/kernel/convolve2.hpp
new file mode 100644
index 0000000000..45bfa6c108
--- /dev/null
+++ b/src/backend/oneapi/kernel/convolve2.hpp
@@ -0,0 +1,199 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+template<typename T, typename aT>
+class conv2HelperCreateKernel {
+   public:
+    conv2HelperCreateKernel(write_accessor<T> out, KParam oInfo,
+                            read_accessor<T> signal, KParam sInfo,
+                            read_accessor<aT> impulse, KParam fInfo, int nBBS0,
+                            int nBBS1, int ostep2, int ostep3, int sstep2,
+                            int sstep3, sycl::local_accessor<aT> localMem,
+                            const int f0, const int f1, const bool expand)
+        : out_(out)
+        , oInfo_(oInfo)
+        , signal_(signal)
+        , sInfo_(sInfo)
+        , impulse_(impulse)
+        , fInfo_(fInfo)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1)
+        , ostep2_(ostep2)
+        , ostep3_(ostep3)
+        , sstep2_(sstep2)
+        , sstep3_(sstep3)
+        , localMem_(localMem)
+        , f0_(f0)
+        , f1_(f1)
+        , expand_(expand) {}
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g = it.get_group();
+
+        int radius0  = f0_ - 1;
+        int radius1  = f1_ - 1;
+        int padding0 = 2 * radius0;
+        int padding1 = 2 * radius1;
+        int shrdLen0 = g.get_local_range(0) + padding0;
+        int shrdLen1 = g.get_local_range(1) + padding1;
+
+        unsigned b0 = g.get_group_id(0) / nBBS0_;
+        unsigned b1 = g.get_group_id(1) / nBBS1_;
+
+        T *dst =
+            out_.get_pointer() +
+            (b0 * oInfo_.strides[2] + /* activated with batched input signal_ */
+             ostep2_ *
+                 oInfo_.strides[2] +  /* activated with batched input filter */
+             b1 * oInfo_.strides[3] + /* activated with batched input signal_ */
+             ostep3_ *
+                 oInfo_.strides[3]); /* activated with batched input filter */
+
+        const T *src =
+            signal_.get_pointer() + sInfo_.offset +
+            (b0 * sInfo_.strides[2] + /* activated with batched input signal_ */
+             sstep2_ *
+                 sInfo_.strides[2] +  /* activated with batched input filter */
+             b1 * sInfo_.strides[3] + /* activated with batched input signal_ */
+             sstep3_ *
+                 sInfo_.strides[3]); /* activated with batched input filter */
+
+        int lx = it.get_local_id(0);
+        int ly = it.get_local_id(1);
+        int gx = g.get_local_range(0) * (g.get_group_id(0) - b0 * nBBS0_) + lx;
+        int gy = g.get_local_range(1) * (g.get_group_id(1) - b1 * nBBS1_) + ly;
+
+        // below loops are traditional loops, they only run multiple
+        // times filter length is more than launch size
+        int s0 = sInfo_.strides[0];
+        int s1 = sInfo_.strides[1];
+        int d0 = sInfo_.dims[0];
+        int d1 = sInfo_.dims[1];
+        for (int b = ly, gy2 = gy; b < shrdLen1;
+             b += g.get_local_range(1), gy2 += g.get_local_range(1)) {
+            int j     = gy2 - radius1;
+            bool is_j = j >= 0 && j < d1;
+            // move row_set g.get_local_range(1) along coloumns
+            for (int a = lx, gx2 = gx; a < shrdLen0;
+                 a += g.get_local_range(0), gx2 += g.get_local_range(0)) {
+                int i     = gx2 - radius0;
+                bool is_i = i >= 0 && i < d0;
+                localMem_[b * shrdLen0 + a] =
+                    (is_i && is_j ? src[i * s0 + j * s1] : (T)(0));
+            }
+        }
+        it.barrier();
+
+        if (gx < oInfo_.dims[0] && gy < oInfo_.dims[1]) {
+            int ci = lx + radius0 + (expand_ ? 0 : f0_ >> 1);
+            int cj = ly + radius1 + (expand_ ? 0 : f1_ >> 1);
+
+            aT accum = (aT)(0);
+            for (int fj = 0; fj < f1_; ++fj) {
+                for (int fi = 0; fi < f0_; ++fi) {
+                    aT f_val = impulse_[fj * f0_ + fi];
+                    T s_val  = localMem_[(cj - fj) * shrdLen0 + (ci - fi)];
+
+                    // binOp will do MUL_OP for convolution operation
+                    accum = accum + binOp((aT)s_val, (aT)f_val);
+                }
+            }
+            dst[gy * oInfo_.strides[1] + gx] = (T)accum;
+        }
+    }
+
+   private:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    read_accessor<T> signal_;
+    KParam sInfo_;
+    read_accessor<aT> impulse_;
+    KParam fInfo_;
+    int nBBS0_;
+    int nBBS1_;
+    int ostep2_;
+    int ostep3_;
+    int sstep2_;
+    int sstep3_;
+    sycl::local_accessor<aT> localMem_;
+    const int f0_;
+    const int f1_;
+    const bool expand_;
+};
+
+template<typename T, typename aT>
+void conv2Helper(const conv_kparam_t<aT> &param, Param<T> out,
+                 const Param<T> signal, const Param<aT> filter,
+                 const bool expand) {
+    const int f0 = filter.info.dims[0];
+    const int f1 = filter.info.dims[1];
+    const size_t LOC_SIZE =
+        (THREADS_X + 2 * (f0 - 1)) * (THREADS_Y + 2 * (f1 - 1));
+
+    auto Q = getQueue();
+    Q.submit([&](auto &h) {
+        sycl::local_accessor<aT> localMem(LOC_SIZE, h);
+        write_accessor<T> outAcc{*out.data, h};
+        read_accessor<T> signalAcc{*signal.data, h};
+        read_accessor<aT> impulseAcc{*param.impulse, h};
+        h.parallel_for(
+            sycl::nd_range{param.global, param.local},
+            conv2HelperCreateKernel<T, aT>(
+                outAcc, out.info, signalAcc, signal.info, impulseAcc,
+                filter.info, param.nBBS0, param.nBBS1, param.o[1], param.o[2],
+                param.s[1], param.s[2], localMem, f0, f1, expand));
+    });
+    ONEAPI_DEBUG_FINISH(Q);
+}
+
+template<typename T, typename aT>
+void conv2(conv_kparam_t<aT> &p, Param<T> &out, const Param<T> &sig,
+           const Param<aT> &filt, const bool expand) {
+    size_t se_size = filt.info.dims[0] * filt.info.dims[1];
+    sycl::buffer<aT> impulse{sycl::range(se_size)};
+    int f0Off = filt.info.offset;
+
+    for (int b3 = 0; b3 < filt.info.dims[3]; ++b3) {
+        int f3Off = b3 * filt.info.strides[3];
+
+        for (int b2 = 0; b2 < filt.info.dims[2]; ++b2) {
+            int f2Off = b2 * filt.info.strides[2];
+
+            const size_t srcOffset = f2Off + f3Off + f0Off;
+            memcpyBuffer(impulse, *filt.data, se_size, srcOffset);
+            p.impulse = &impulse;
+
+            p.o[1] = (p.outHasNoOffset ? 0 : b2);
+            p.o[2] = (p.outHasNoOffset ? 0 : b3);
+            p.s[1] = (p.inHasNoOffset ? 0 : b2);
+            p.s[2] = (p.inHasNoOffset ? 0 : b3);
+
+            conv2Helper<T, aT>(p, out, sig, filt, expand);
+        }
+    }
+}
+
+#define INSTANTIATE_CONV2(T, aT)                                    \
+    template void conv2<T, aT>(conv_kparam_t<aT> &, Param<T> &,     \
+                               const Param<T> &, const Param<aT> &, \
+                               const bool);
+
+INSTANTIATE_CONV2(char, float)
+INSTANTIATE_CONV2(cfloat, cfloat)
+INSTANTIATE_CONV2(cdouble, cdouble)
+INSTANTIATE_CONV2(float, float)
+INSTANTIATE_CONV2(double, double)
+INSTANTIATE_CONV2(short, float)
+INSTANTIATE_CONV2(int, float)
+INSTANTIATE_CONV2(intl, float)
+INSTANTIATE_CONV2(ushort, float)
+INSTANTIATE_CONV2(uint, float)
+INSTANTIATE_CONV2(uintl, float)
+INSTANTIATE_CONV2(schar, float)
+INSTANTIATE_CONV2(uchar, float)
diff --git a/src/backend/oneapi/kernel/convolve3.hpp b/src/backend/oneapi/kernel/convolve3.hpp
new file mode 100644
index 0000000000..bdfcc4eb24
--- /dev/null
+++ b/src/backend/oneapi/kernel/convolve3.hpp
@@ -0,0 +1,202 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+int index(int i, int j, int k, int jstride, int kstride) {
+    return i + j * jstride + k * kstride;
+}
+
+template<typename T, typename aT>
+class conv3HelperCreateKernel {
+   public:
+    conv3HelperCreateKernel(write_accessor<T> out, KParam oInfo,
+                            read_accessor<T> signal, KParam sInfo,
+                            sycl::local_accessor<aT> localMem,
+                            read_accessor<aT> impulse, KParam fInfo, int nBBS0,
+                            int nBBS1, int ostep1, int ostep2, int ostep3,
+                            int sstep1, int sstep2, int sstep3,
+                            const bool EXPAND)
+        : out_(out)
+        , oInfo_(oInfo)
+        , signal_(signal)
+        , sInfo_(sInfo)
+        , localMem_(localMem)
+        , impulse_(impulse)
+        , fInfo_(fInfo)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1)
+        , ostep1_(ostep1)
+        , ostep2_(ostep2)
+        , ostep3_(ostep3)
+        , sstep1_(sstep1)
+        , sstep2_(sstep2)
+        , sstep3_(sstep3)
+        , EXPAND_(EXPAND) {}
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g = it.get_group();
+        int fLen0     = fInfo_.dims[0];
+        int fLen1     = fInfo_.dims[1];
+        int fLen2     = fInfo_.dims[2];
+        int radius0   = fLen0 - 1;
+        int radius1   = fLen1 - 1;
+        int radius2   = fLen2 - 1;
+        int shrdLen0  = g.get_local_range(0) + 2 * radius0;
+        int shrdLen1  = g.get_local_range(1) + 2 * radius1;
+        int shrdLen2  = g.get_local_range(2) + 2 * radius2;
+        int skStride  = shrdLen0 * shrdLen1;
+        int fStride   = fLen0 * fLen1;
+        unsigned b2   = g.get_group_id(0) / nBBS0_;
+
+        T *dst =
+            out_.get_pointer() +
+            (b2 * oInfo_.strides[3] + /* activated with batched input signal_ */
+             ostep3_ *
+                 oInfo_.strides[3]); /* activated with batched input filter */
+
+        const T *src =
+            signal_.get_pointer() + sInfo_.offset +
+            (b2 * sInfo_.strides[3] + /* activated with batched input signal_ */
+             sstep3_ *
+                 sInfo_.strides[3]); /* activated with batched input filter */
+
+        int lx = it.get_local_id(0);
+        int ly = it.get_local_id(1);
+        int lz = it.get_local_id(2);
+        int gx = g.get_local_range(0) * (g.get_group_id(0) - b2 * nBBS0_) + lx;
+        int gy = g.get_local_range(1) * g.get_group_id(1) + ly;
+        int gz = g.get_local_range(2) * g.get_group_id(2) + lz;
+
+        int s0 = sInfo_.strides[0];
+        int s1 = sInfo_.strides[1];
+        int s2 = sInfo_.strides[2];
+        int d0 = sInfo_.dims[0];
+        int d1 = sInfo_.dims[1];
+        int d2 = sInfo_.dims[2];
+
+        for (int c = lz, gz2 = gz; c < shrdLen2;
+             c += g.get_local_range(2), gz2 += g.get_local_range(2)) {
+            int k     = gz2 - radius2;
+            bool is_k = k >= 0 && k < d2;
+            for (int b = ly, gy2 = gy; b < shrdLen1;
+                 b += g.get_local_range(1), gy2 += g.get_local_range(1)) {
+                int j     = gy2 - radius1;
+                bool is_j = j >= 0 && j < d1;
+                for (int a = lx, gx2 = gx; a < shrdLen0;
+                     a += g.get_local_range(0), gx2 += g.get_local_range(0)) {
+                    int i     = gx2 - radius0;
+                    bool is_i = i >= 0 && i < d0;
+                    localMem_[c * skStride + b * shrdLen0 + a] =
+                        (is_i && is_j && is_k ? src[i * s0 + j * s1 + k * s2]
+                                              : (T)(0));
+                }
+            }
+        }
+        it.barrier();
+
+        if (gx < oInfo_.dims[0] && gy < oInfo_.dims[1] && gz < oInfo_.dims[2]) {
+            int ci = lx + radius0 + (EXPAND_ ? 0 : fLen0 >> 1);
+            int cj = ly + radius1 + (EXPAND_ ? 0 : fLen1 >> 1);
+            int ck = lz + radius2 + (EXPAND_ ? 0 : fLen2 >> 1);
+
+            aT accum = (aT)(0);
+            for (int fk = 0; fk < fLen2; ++fk) {
+                for (int fj = 0; fj < fLen1; ++fj) {
+                    for (int fi = 0; fi < fLen0; ++fi) {
+                        aT f_val = impulse_[index(fi, fj, fk, fLen0, fStride)];
+                        T s_val  = localMem_[index(ci - fi, cj - fj, ck - fk,
+                                                   shrdLen0, skStride)];
+
+                        // binOp will do MUL_OP for convolution operation
+                        accum = accum + binOp((aT)s_val, (aT)f_val);
+                    }
+                }
+            }
+            dst[index(gx, gy, gz, oInfo_.strides[1], oInfo_.strides[2])] =
+                (T)accum;
+        }
+    }
+
+   private:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    read_accessor<T> signal_;
+    KParam sInfo_;
+    sycl::local_accessor<aT> localMem_;
+    read_accessor<aT> impulse_;
+    KParam fInfo_;
+    int nBBS0_;
+    int nBBS1_;
+    int ostep1_;
+    int ostep2_;
+    int ostep3_;
+    int sstep1_;
+    int sstep2_;
+    int sstep3_;
+    const bool EXPAND_;
+};
+
+template<typename T, typename aT>
+void conv3Helper(const conv_kparam_t<aT> &param, Param<T> &out,
+                 const Param<T> &signal, const Param<aT> &impulse,
+                 const int rank, const bool EXPAND) {
+    auto Q = getQueue();
+    Q.submit([&](auto &h) {
+        sycl::local_accessor<aT> localMem(param.loc_size, h);
+        write_accessor<T> outAcc{*out.data, h};
+        read_accessor<T> signalAcc{*signal.data, h};
+        read_accessor<aT> impulseAcc{*param.impulse, h};
+        h.parallel_for(
+            sycl::nd_range{param.global, param.local},
+            conv3HelperCreateKernel<T, aT>(
+                outAcc, out.info, signalAcc, signal.info, localMem, impulseAcc,
+                impulse.info, param.nBBS0, param.nBBS1, param.o[0], param.o[1],
+                param.o[2], param.s[0], param.s[1], param.s[2], EXPAND));
+    });
+    ONEAPI_DEBUG_FINISH(Q);
+}
+
+template<typename T, typename aT>
+void conv3(conv_kparam_t<aT> &p, Param<T> &out, const Param<T> &sig,
+           const Param<aT> &filt, const bool expand) {
+    size_t se_size = filt.info.dims[0] * filt.info.dims[1] * filt.info.dims[2];
+    sycl::buffer<aT> impulse{sycl::range(se_size)};
+    int f0Off = filt.info.offset;
+
+    for (int b3 = 0; b3 < filt.info.dims[3]; ++b3) {
+        int f3Off = b3 * filt.info.strides[3];
+
+        const size_t srcOffset = f3Off + f0Off;
+        memcpyBuffer(impulse, *filt.data, se_size, srcOffset);
+        p.impulse = &impulse;
+
+        p.o[2] = (p.outHasNoOffset ? 0 : b3);
+        p.s[2] = (p.inHasNoOffset ? 0 : b3);
+
+        conv3Helper<T, aT>(p, out, sig, filt, 3, expand);
+    }
+}
+
+#define INSTANTIATE_CONV3(T, aT)                                    \
+    template void conv3<T, aT>(conv_kparam_t<aT> &, Param<T> &,     \
+                               const Param<T> &, const Param<aT> &, \
+                               const bool);
+
+INSTANTIATE_CONV3(cdouble, cdouble)
+INSTANTIATE_CONV3(cfloat, cfloat)
+INSTANTIATE_CONV3(double, double)
+INSTANTIATE_CONV3(float, float)
+INSTANTIATE_CONV3(uint, float)
+INSTANTIATE_CONV3(int, float)
+INSTANTIATE_CONV3(schar, float)
+INSTANTIATE_CONV3(uchar, float)
+INSTANTIATE_CONV3(char, float)
+INSTANTIATE_CONV3(ushort, float)
+INSTANTIATE_CONV3(short, float)
+INSTANTIATE_CONV3(uintl, float)
+INSTANTIATE_CONV3(intl, float)
diff --git a/src/backend/oneapi/kernel/convolve_separable.cpp b/src/backend/oneapi/kernel/convolve_separable.cpp
new file mode 100644
index 0000000000..0f3dfacb30
--- /dev/null
+++ b/src/backend/oneapi/kernel/convolve_separable.cpp
@@ -0,0 +1,213 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+using read_accessor = sycl::accessor<T, 1, sycl::access::mode::read>;
+template<typename T>
+using write_accessor = sycl::accessor<T, 1, sycl::access::mode::write>;
+
+template<typename T, typename accType>
+class convolveSeparableCreateKernel {
+   public:
+    convolveSeparableCreateKernel(write_accessor<T> out, KParam oInfo,
+                                  read_accessor<T> signal, KParam sInfo,
+                                  read_accessor<accType> impulse, int nBBS0,
+                                  int nBBS1, const int FLEN, const int CONV_DIM,
+                                  const bool EXPAND,
+                                  sycl::local_accessor<T> localMem)
+        : out_(out)
+        , oInfo_(oInfo)
+        , signal_(signal)
+        , sInfo_(sInfo)
+        , impulse_(impulse)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1)
+        , FLEN_(FLEN)
+        , CONV_DIM_(CONV_DIM)
+        , EXPAND_(EXPAND)
+        , localMem_(localMem) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const int radius  = FLEN_ - 1;
+        const int padding = 2 * radius;
+        const int s0      = sInfo_.strides[0];
+        const int s1      = sInfo_.strides[1];
+        const int d0      = sInfo_.dims[0];
+        const int d1      = sInfo_.dims[1];
+        const int shrdLen =
+            g.get_local_range(0) + (CONV_DIM_ == 0 ? padding : 0);
+
+        unsigned b2 = g.get_group_id(0) / nBBS0_;
+        unsigned b3 = g.get_group_id(1) / nBBS1_;
+        T *dst      = out_.get_pointer() +
+                 (b2 * oInfo_.strides[2] + b3 * oInfo_.strides[3]);
+        const T *src = signal_.get_pointer() +
+                       (b2 * sInfo_.strides[2] + b3 * sInfo_.strides[3]) +
+                       sInfo_.offset;
+
+        int lx = it.get_local_id(0);
+        int ly = it.get_local_id(1);
+        int ox = g.get_local_range(0) * (g.get_group_id(0) - b2 * nBBS0_) + lx;
+        int oy = g.get_local_range(1) * (g.get_group_id(1) - b3 * nBBS1_) + ly;
+        int gx = ox;
+        int gy = oy;
+
+        // below if-else statement is based on MACRO value passed while kernel
+        // compilation
+        if (CONV_DIM_ == 0) {
+            gx += (EXPAND_ ? 0 : FLEN_ >> 1);
+            int endX = ((FLEN_ - 1) << 1) + g.get_local_range(0);
+            for (int lx = it.get_local_id(0), glb_x = gx; lx < endX;
+                 lx += g.get_local_range(0), glb_x += g.get_local_range(0)) {
+                int i     = glb_x - radius;
+                int j     = gy;
+                bool is_i = i >= 0 && i < d0;
+                bool is_j = j >= 0 && j < d1;
+                localMem_[ly * shrdLen + lx] =
+                    (is_i && is_j ? src[i * s0 + j * s1] : (T)(0));
+            }
+
+        } else if (CONV_DIM_ == 1) {
+            gy += (EXPAND_ ? 0 : FLEN_ >> 1);
+            int endY = ((FLEN_ - 1) << 1) + g.get_local_range(1);
+            for (int ly = it.get_local_id(1), glb_y = gy; ly < endY;
+                 ly += g.get_local_range(1), glb_y += g.get_local_range(1)) {
+                int i     = gx;
+                int j     = glb_y - radius;
+                bool is_i = i >= 0 && i < d0;
+                bool is_j = j >= 0 && j < d1;
+                localMem_[ly * shrdLen + lx] =
+                    (is_i && is_j ? src[i * s0 + j * s1] : (T)(0));
+            }
+        }
+        it.barrier();
+
+        if (ox < oInfo_.dims[0] && oy < oInfo_.dims[1]) {
+            // below conditional statement is based on MACRO value passed while
+            // kernel compilation
+            int i         = (CONV_DIM_ == 0 ? lx : ly) + radius;
+            accType accum = (accType)(0);
+            for (int f = 0; f < FLEN_; ++f) {
+                accType f_val = impulse_[f];
+                // below conditional statement is based on MACRO value passed
+                // while kernel compilation
+                int s_idx = (CONV_DIM_ == 0 ? (ly * shrdLen + (i - f))
+                                            : ((i - f) * shrdLen + lx));
+                T s_val   = localMem_[s_idx];
+
+                // binOp omitted from OpenCL implementation (see
+                // convolve_separable.cl)
+                accum = accum + (accType)s_val * (accType)f_val;
+            }
+            dst[oy * oInfo_.strides[1] + ox] = (T)accum;
+        }
+    }
+
+   private:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    read_accessor<T> signal_;
+    KParam sInfo_;
+    read_accessor<accType> impulse_;
+    int nBBS0_;
+    int nBBS1_;
+    const int FLEN_;
+    const int CONV_DIM_;
+    const bool EXPAND_;
+    sycl::local_accessor<T> localMem_;
+};
+
+template<typename T>
+void memcpyBuffer(sycl::buffer<T, 1> &dest, sycl::buffer<T, 1> &src,
+                  const size_t n, const size_t srcOffset) {
+    getQueue().submit([&](auto &h) {
+        sycl::accessor srcAcc{src, h, sycl::range{n}, sycl::id{srcOffset},
+                              sycl::read_only};
+        sycl::accessor destAcc{
+            dest,         h, sycl::range{n}, sycl::id{0}, sycl::write_only,
+            sycl::no_init};
+        h.copy(srcAcc, destAcc);
+    });
+}
+
+template<typename T, typename accType>
+void convSep(Param<T> out, const Param<T> signal, const Param<accType> filter,
+             const int conv_dim, const bool expand) {
+    if (!(conv_dim == 0 || conv_dim == 1)) {
+        AF_ERROR(
+            "Separable convolution accepts only 0 or 1 as convolution "
+            "dimension",
+            AF_ERR_NOT_SUPPORTED);
+    }
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
+
+    const int fLen       = filter.info.dims[0] * filter.info.dims[1];
+    const size_t C0_SIZE = (THREADS_X + 2 * (fLen - 1)) * THREADS_Y;
+    const size_t C1_SIZE = (THREADS_Y + 2 * (fLen - 1)) * THREADS_X;
+    size_t locSize       = (conv_dim == 0 ? C0_SIZE : C1_SIZE);
+
+    auto local = sycl::range(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(out.info.dims[0], THREADS_X);
+    int blk_y = divup(out.info.dims[1], THREADS_Y);
+
+    auto global = sycl::range(blk_x * signal.info.dims[2] * THREADS_X,
+                              blk_y * signal.info.dims[3] * THREADS_Y);
+
+    sycl::buffer<accType> mBuff = {sycl::range(fLen * sizeof(accType))};
+    memcpyBuffer(mBuff, *filter.data, fLen, 0);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_signal{*signal.data, h, sycl::read_only};
+        sycl::accessor d_out{*out.data, h, sycl::write_only, sycl::no_init};
+        sycl::accessor d_mBuff{mBuff, h, sycl::read_only};
+        sycl::local_accessor<T> localMem(locSize, h);
+        h.parallel_for(sycl::nd_range{global, local},
+                       convolveSeparableCreateKernel<T, accType>(
+                           d_out, out.info, d_signal, signal.info, d_mBuff,
+                           blk_x, blk_y, fLen, conv_dim, expand, localMem));
+    });
+}
+
+#define INSTANTIATE(T, accT)                                          \
+    template void convSep<T, accT>(Param<T>, const Param<T>,          \
+                                   const Param<accT> filt, const int, \
+                                   const bool);
+
+INSTANTIATE(cdouble, cdouble)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/convolve_separable.hpp b/src/backend/oneapi/kernel/convolve_separable.hpp
new file mode 100644
index 0000000000..0339c9c614
--- /dev/null
+++ b/src/backend/oneapi/kernel/convolve_separable.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+// below shared MAX_*_LEN's are calculated based on
+// a maximum shared memory configuration of 48KB per block
+// considering complex types as well
+constexpr int MAX_SCONV_FILTER_LEN = 31;
+
+template<typename T, typename accT>
+void convSep(Param<T> out, const Param<T> sig, const Param<accT> filt,
+             const int cDim, const bool expand);
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/default_config.hpp b/src/backend/oneapi/kernel/default_config.hpp
new file mode 100644
index 0000000000..c2ed8ae3dc
--- /dev/null
+++ b/src/backend/oneapi/kernel/default_config.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+static const uint THREADS_PER_BLOCK = 256;
+static const uint THREADS_X         = 32;
+static const uint THREADS_Y         = THREADS_PER_BLOCK / THREADS_X;
+static const uint REPEAT            = 32;
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/diagonal.hpp b/src/backend/oneapi/kernel/diagonal.hpp
new file mode 100644
index 0000000000..91db3fbda1
--- /dev/null
+++ b/src/backend/oneapi/kernel/diagonal.hpp
@@ -0,0 +1,161 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class diagCreateKernel {
+   public:
+    diagCreateKernel(write_accessor<T> oData, KParam oInfo,
+                     read_accessor<T> iData, KParam iInfo, int num,
+                     int groups_x)
+        : oData_(oData)
+        , oInfo_(oInfo)
+        , iData_(iData)
+        , iInfo_(iInfo)
+        , num_(num)
+        , groups_x_(groups_x) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g      = it.get_group();
+        unsigned idz       = g.get_group_id(0) / groups_x_;
+        unsigned groupId_x = g.get_group_id(0) - idz * groups_x_;
+
+        unsigned idx = it.get_local_id(0) + groupId_x * g.get_local_range(0);
+        unsigned idy = it.get_global_id(1);
+
+        if (idx >= oInfo_.dims[0] || idy >= oInfo_.dims[1] ||
+            idz >= oInfo_.dims[2])
+            return;
+
+        T *optr = oData_.get_pointer();
+        optr += idz * oInfo_.strides[2] + idy * oInfo_.strides[1] + idx;
+
+        const T *iptr = iData_.get_pointer();
+        iptr +=
+            idz * iInfo_.strides[1] + ((num_ > 0) ? idx : idy) + iInfo_.offset;
+
+        T val = (idx == (idy - num_)) ? *iptr : (T)(0);
+        *optr = val;
+    }
+
+   private:
+    write_accessor<T> oData_;
+    KParam oInfo_;
+    read_accessor<T> iData_;
+    KParam iInfo_;
+    int num_;
+    int groups_x_;
+};
+
+template<typename T>
+static void diagCreate(Param<T> out, Param<T> in, int num) {
+    auto local   = sycl::range{32, 8};
+    int groups_x = divup(out.info.dims[0], local[0]);
+    int groups_y = divup(out.info.dims[1], local[1]);
+    auto global  = sycl::range{groups_x * local[0] * out.info.dims[2],
+                              groups_y * local[1]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> oData{*out.data, h};
+        read_accessor<T> iData{*in.data, h};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       diagCreateKernel<T>(oData, out.info, iData, in.info, num,
+                                           groups_x));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+class diagExtractKernel {
+   public:
+    diagExtractKernel(write_accessor<T> oData, KParam oInfo,
+                      read_accessor<T> iData, KParam iInfo, int num,
+                      int groups_z)
+        : oData_(oData)
+        , oInfo_(oInfo)
+        , iData_(iData)
+        , iInfo_(iInfo)
+        , num_(num)
+        , groups_z_(groups_z) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        unsigned idw  = g.get_group_id(1) / groups_z_;
+        unsigned idz  = g.get_group_id(1) - idw * groups_z_;
+
+        unsigned idx = it.get_global_id(0);
+
+        if (idx >= oInfo_.dims[0] || idz >= oInfo_.dims[2] ||
+            idw >= oInfo_.dims[3])
+            return;
+
+        T *optr = oData_.get_pointer();
+        optr += idz * oInfo_.strides[2] + idw * oInfo_.strides[3] + idx;
+
+        if (idx >= iInfo_.dims[0] || idx >= iInfo_.dims[1]) {
+            *optr = (T)(0);
+            return;
+        }
+
+        int i_off = (num_ > 0) ? (num_ * iInfo_.strides[1] + idx)
+                               : (idx - num_) + iInfo_.offset;
+
+        const T *iptr = iData_.get_pointer();
+        iptr += idz * iInfo_.strides[2] + idw * iInfo_.strides[3] + i_off;
+
+        *optr = iptr[idx * iInfo_.strides[1]];
+    }
+
+   private:
+    write_accessor<T> oData_;
+    KParam oInfo_;
+    read_accessor<T> iData_;
+    KParam iInfo_;
+    int num_;
+    int groups_z_;
+};
+
+template<typename T>
+static void diagExtract(Param<T> out, Param<T> in, int num) {
+    auto local   = sycl::range{256, 1};
+    int groups_x = divup(out.info.dims[0], local[0]);
+    int groups_z = out.info.dims[2];
+    auto global  = sycl::range{groups_x * local[0],
+                              groups_z * local[1] * out.info.dims[3]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> oData{*out.data, h};
+        read_accessor<T> iData{*in.data, h};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       diagExtractKernel<T>(oData, out.info, iData, in.info,
+                                            num, groups_z));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/diff.hpp b/src/backend/oneapi/kernel/diff.hpp
new file mode 100644
index 0000000000..5276786646
--- /dev/null
+++ b/src/backend/oneapi/kernel/diff.hpp
@@ -0,0 +1,123 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class diffKernel {
+   public:
+    diffKernel(write_accessor<T> outAcc, const read_accessor<T> inAcc,
+               const KParam op, const KParam ip, const int oElem,
+               const int blocksPerMatX, const int blocksPerMatY,
+               const bool isDiff2, const unsigned DIM)
+        : outAcc_(outAcc)
+        , inAcc_(inAcc)
+        , op_(op)
+        , ip_(ip)
+        , oElem_(oElem)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY)
+        , isDiff2_(isDiff2)
+        , DIM_(DIM) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        const int idz = g.get_group_id(0) / blocksPerMatX_;
+        const int idw = g.get_group_id(1) / blocksPerMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - idz * blocksPerMatX_;
+        const int blockIdx_y = g.get_group_id(1) - idw * blocksPerMatY_;
+
+        const int idx = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+        const int idy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+        if (idx >= op_.dims[0] || idy >= op_.dims[1] || idz >= op_.dims[2] ||
+            idw >= op_.dims[3])
+            return;
+
+        int iMem0 = idw * ip_.strides[3] + idz * ip_.strides[2] +
+                    idy * ip_.strides[1] + idx;
+        int iMem1 = iMem0 + ip_.strides[DIM_];
+        int iMem2 = iMem1 + ip_.strides[DIM_];
+
+        int oMem = idw * op_.strides[3] + idz * op_.strides[2] +
+                   idy * op_.strides[1] + idx;
+
+        iMem2 *= isDiff2_;
+
+        T *out      = outAcc_.get_pointer();
+        const T *in = inAcc_.get_pointer() + ip_.offset;
+        if (isDiff2_ == 0) {
+            out[oMem] = in[iMem1] - in[iMem0];
+        } else {
+            out[oMem] = in[iMem2] - in[iMem1] - in[iMem1] + in[iMem0];
+        }
+
+        // diff_this(out, in + ip.offset, oMem, iMem0, iMem1, iMem2);
+    }
+
+   private:
+    write_accessor<T> outAcc_;
+    const read_accessor<T> inAcc_;
+    const KParam op_;
+    const KParam ip_;
+    const int oElem_;
+    const int blocksPerMatX_;
+    const int blocksPerMatY_;
+    const bool isDiff2_;
+    const unsigned DIM_;
+};
+
+template<typename T>
+void diff(Param<T> out, const Param<T> in, const unsigned indims,
+          const unsigned dim, const bool isDiff2) {
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+
+    auto local = sycl::range{TX, TY};
+    if (dim == 0 && indims == 1) { local = sycl::range{TX * TY, 1}; }
+
+    int blocksPerMatX = divup(out.info.dims[0], local[0]);
+    int blocksPerMatY = divup(out.info.dims[1], local[1]);
+    auto global       = sycl::range{local[0] * blocksPerMatX * out.info.dims[2],
+                              local[1] * blocksPerMatY * out.info.dims[3]};
+
+    const int oElem = out.info.dims[0] * out.info.dims[1] * out.info.dims[2] *
+                      out.info.dims[3];
+
+    getQueue().submit([&](sycl::handler &h) {
+        read_accessor<T> inAcc   = {*in.data, h};
+        write_accessor<T> outAcc = {*out.data, h};
+
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            diffKernel<T>(outAcc, inAcc, out.info, in.info, oElem,
+                          blocksPerMatX, blocksPerMatY, isDiff2, dim));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/fftconvolve_common.hpp b/src/backend/oneapi/kernel/fftconvolve_common.hpp
new file mode 100644
index 0000000000..6caf9923d2
--- /dev/null
+++ b/src/backend/oneapi/kernel/fftconvolve_common.hpp
@@ -0,0 +1,74 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr int THREADS = 256;
+
+template<typename T, typename convT>
+void calcParamSizes(Param<T>& sig_tmp, Param<T>& filter_tmp,
+                    Param<convT>& packed, Param<T>& sig, Param<T>& filter,
+                    const int rank, AF_BATCH_KIND kind) {
+    sig_tmp.info.dims[0] = filter_tmp.info.dims[0] = packed.info.dims[0];
+    sig_tmp.info.strides[0] = filter_tmp.info.strides[0] = 1;
+
+    for (int k = 1; k < 4; k++) {
+        if (k < rank) {
+            sig_tmp.info.dims[k]    = packed.info.dims[k];
+            filter_tmp.info.dims[k] = packed.info.dims[k];
+        } else {
+            sig_tmp.info.dims[k]    = sig.info.dims[k];
+            filter_tmp.info.dims[k] = filter.info.dims[k];
+        }
+
+        sig_tmp.info.strides[k] =
+            sig_tmp.info.strides[k - 1] * sig_tmp.info.dims[k - 1];
+        filter_tmp.info.strides[k] =
+            filter_tmp.info.strides[k - 1] * filter_tmp.info.dims[k - 1];
+    }
+
+    // NOTE: The OpenCL implementation on which this oneAPI port is
+    // based treated the incoming `packed` buffer as a string of real
+    // scalars instead of complex numbers. OpenCL accomplished this
+    // with the hack depicted in the trailing two lines. This note
+    // remains here in an explanation of SYCL buffer reinterpret's in
+    // fftconvolve kernel invocations.
+
+    // sig_tmp.data    = packed.data;
+    // filter_tmp.data = packed.data;
+
+    // Calculate memory offsets for packed signal and filter
+    if (kind == AF_BATCH_RHS) {
+        filter_tmp.info.offset = 0;
+        sig_tmp.info.offset =
+            filter_tmp.info.strides[3] * filter_tmp.info.dims[3] * 2;
+    } else {
+        sig_tmp.info.offset = 0;
+        filter_tmp.info.offset =
+            sig_tmp.info.strides[3] * sig_tmp.info.dims[3] * 2;
+    }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/fftconvolve_multiply.hpp b/src/backend/oneapi/kernel/fftconvolve_multiply.hpp
new file mode 100644
index 0000000000..32516f4056
--- /dev/null
+++ b/src/backend/oneapi/kernel/fftconvolve_multiply.hpp
@@ -0,0 +1,153 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class fftconvolve_multiplyCreateKernel {
+   public:
+    fftconvolve_multiplyCreateKernel(write_accessor<T> d_out, KParam oInfo,
+                                     read_accessor<T> d_in1, KParam i1Info,
+                                     read_accessor<T> d_in2, KParam i2Info,
+                                     const int nelem, const int kind)
+        : d_out_(d_out)
+        , oInfo_(oInfo)
+        , d_in1_(d_in1)
+        , i1Info_(i1Info)
+        , d_in2_(d_in2)
+        , i2Info_(i2Info)
+        , nelem_(nelem)
+        , kind_(kind) {}
+    void operator()(sycl::nd_item<1> it) const {
+        const int t = it.get_global_id(0);
+
+        if (t >= nelem_) return;
+
+        if (kind_ == AF_BATCH_NONE || kind_ == AF_BATCH_SAME) {
+            // Complex multiply each signal to equivalent filter
+            const int ridx = t * 2;
+            const int iidx = t * 2 + 1;
+
+            T a = d_in1_[i1Info_.offset + ridx];
+            T b = d_in1_[i1Info_.offset + iidx];
+            T c = d_in2_[i2Info_.offset + ridx];
+            T d = d_in2_[i2Info_.offset + iidx];
+
+            d_out_[oInfo_.offset + ridx] = a * c - b * d;
+            d_out_[oInfo_.offset + iidx] = a * d + b * c;
+        } else if (kind_ == AF_BATCH_LHS) {
+            // Complex multiply all signals to filter
+            const int ridx1 = t * 2;
+            const int iidx1 = t * 2 + 1;
+
+            // Treating complex output array as real-only array,
+            // thus, multiply strides by 2
+            const int ridx2 =
+                ridx1 % (i2Info_.strides[3] * i2Info_.dims[3] * 2);
+            const int iidx2 =
+                iidx1 % (i2Info_.strides[3] * i2Info_.dims[3] * 2);
+
+            T a = d_in1_[i1Info_.offset + ridx1];
+            T b = d_in1_[i1Info_.offset + iidx1];
+            T c = d_in2_[i2Info_.offset + ridx2];
+            T d = d_in2_[i2Info_.offset + iidx2];
+
+            d_out_[oInfo_.offset + ridx1] = a * c - b * d;
+            d_out_[oInfo_.offset + iidx1] = a * d + b * c;
+        } else if (kind_ == AF_BATCH_RHS) {
+            // Complex multiply signal to all filters
+            const int ridx2 = t * 2;
+            const int iidx2 = t * 2 + 1;
+
+            // Treating complex output array as real-only array,
+            // thus, multiply strides by 2
+            const int ridx1 =
+                ridx2 % (i1Info_.strides[3] * i1Info_.dims[3] * 2);
+            const int iidx1 =
+                iidx2 % (i1Info_.strides[3] * i1Info_.dims[3] * 2);
+
+            T a = d_in1_[i1Info_.offset + ridx1];
+            T b = d_in1_[i1Info_.offset + iidx1];
+            T c = d_in2_[i2Info_.offset + ridx2];
+            T d = d_in2_[i2Info_.offset + iidx2];
+
+            d_out_[oInfo_.offset + ridx2] = a * c - b * d;
+            d_out_[oInfo_.offset + iidx2] = a * d + b * c;
+        }
+    }
+
+   private:
+    write_accessor<T> d_out_;
+    KParam oInfo_;
+    read_accessor<T> d_in1_;
+    KParam i1Info_;
+    read_accessor<T> d_in2_;
+    KParam i2Info_;
+    const int nelem_;
+    const int kind_;
+};
+
+template<typename convT, typename T>
+void complexMultiplyHelper(Param<convT> packed, Param<T> sig, Param<T> filter,
+                           const int rank, AF_BATCH_KIND kind) {
+    Param<T> sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    int sig_packed_elem = sig_tmp.info.strides[3] * sig_tmp.info.dims[3];
+    int filter_packed_elem =
+        filter_tmp.info.strides[3] * filter_tmp.info.dims[3];
+    int mul_elem = (sig_packed_elem < filter_packed_elem) ? filter_packed_elem
+                                                          : sig_packed_elem;
+    int blocks   = divup(mul_elem, THREADS);
+
+    auto local  = sycl::range(THREADS);
+    auto global = sycl::range(blocks * THREADS);
+
+    // Treat complex output as an array of scalars
+    using convScalarT      = typename convT::value_type;
+    auto packed_num_elem   = (*packed.data).get_range().size();
+    auto packed_tmp_buffer = (*packed.data)
+                                 .template reinterpret<convScalarT>(
+                                     sycl::range<1>{packed_num_elem * 2});
+    auto sig_tmp_buffer = (*packed.data)
+                              .template reinterpret<convScalarT>(
+                                  sycl::range<1>{packed_num_elem * 2});
+    auto filter_tmp_buffer = (*packed.data)
+                                 .template reinterpret<convScalarT>(
+                                     sycl::range<1>{packed_num_elem * 2});
+
+    getQueue().submit([&](auto &h) {
+        write_accessor<convScalarT> d_packed    = {packed_tmp_buffer, h};
+        read_accessor<convScalarT> d_sig_tmp    = {sig_tmp_buffer, h};
+        read_accessor<convScalarT> d_filter_tmp = {filter_tmp_buffer, h};
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            fftconvolve_multiplyCreateKernel<typename convT::value_type>(
+                d_packed, packed.info, d_sig_tmp, sig_tmp.info, d_filter_tmp,
+                filter_tmp.info, mul_elem, (int)kind));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/fftconvolve_pack.hpp b/src/backend/oneapi/kernel/fftconvolve_pack.hpp
new file mode 100644
index 0000000000..5f8afc2b7a
--- /dev/null
+++ b/src/backend/oneapi/kernel/fftconvolve_pack.hpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+#include <iostream>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename inputType, typename outputType>
+class fftconvolve_packCreateKernel {
+   public:
+    fftconvolve_packCreateKernel(write_accessor<outputType> d_out, KParam oInfo,
+                                 read_accessor<inputType> d_in, KParam iInfo,
+                                 const int di0_half, const int odd_di0)
+        : d_out_(d_out)
+        , oInfo_(oInfo)
+        , d_in_(d_in)
+        , iInfo_(iInfo)
+        , di0_half_(di0_half)
+        , odd_di0_(odd_di0) {}
+    void operator()(sycl::nd_item<1> it) const {
+        const int t = it.get_global_id(0);
+
+        const int tMax = oInfo_.strides[3] * oInfo_.dims[3];
+
+        if (t >= tMax) return;
+
+        // const int do0 = oInfo_.dims[0];
+        const int do1 = oInfo_.dims[1];
+        const int do2 = oInfo_.dims[2];
+
+        const int so1 = oInfo_.strides[1];
+        const int so2 = oInfo_.strides[2];
+        const int so3 = oInfo_.strides[3];
+
+        const int to0 = t % so1;
+        const int to1 = (t / so1) % do1;
+        const int to2 = (t / so2) % do2;
+        const int to3 = t / so3;
+
+        // const int di0 = iInfo_.dims[0];
+        const int di1 = iInfo_.dims[1];
+        const int di2 = iInfo_.dims[2];
+
+        const int si1 = iInfo_.strides[1];
+        const int si2 = iInfo_.strides[2];
+        const int si3 = iInfo_.strides[3];
+
+        const int ti0 = to0;
+        const int ti1 = to1 * si1;
+        const int ti2 = to2 * si2;
+        const int ti3 = to3 * si3;
+
+        const int iidx1 = iInfo_.offset + ti3 + ti2 + ti1 + ti0;
+        const int iidx2 = iidx1 + di0_half_;
+
+        // Treating complex output array as real-only array,
+        // thus, multiply strides by 2
+        const int oidx1 = oInfo_.offset + to3 * so3 * 2 + to2 * so2 * 2 +
+                          to1 * so1 * 2 + to0 * 2;
+        const int oidx2 = oidx1 + 1;
+
+        if (to0 < di0_half_ && to1 < di1 && to2 < di2) {
+            d_out_[oidx1] = (outputType)d_in_[iidx1];
+            if (ti0 == di0_half_ - 1 && odd_di0_ == 1)
+                d_out_[oidx2] = (outputType)0;
+            else
+                d_out_[oidx2] = (outputType)d_in_[iidx2];
+        } else {
+            // Pad remaining elements with 0s
+            d_out_[oidx1] = (outputType)0;
+            d_out_[oidx2] = (outputType)0;
+        }
+    }
+
+   private:
+    write_accessor<outputType> d_out_;
+    KParam oInfo_;
+    read_accessor<inputType> d_in_;
+    KParam iInfo_;
+    const int di0_half_;
+    const int odd_di0_;
+};
+
+template<typename convT, typename T>
+void packDataHelper(Param<convT> packed, Param<T> sig, Param<T> filter,
+                    const int rank, AF_BATCH_KIND kind) {
+    Param<T> sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    int sig_packed_elem = sig_tmp.info.strides[3] * sig_tmp.info.dims[3];
+
+    // Number of packed complex elements in dimension 0
+    int sig_half_d0     = divup(sig.info.dims[0], 2);
+    int sig_half_d0_odd = sig.info.dims[0] % 2;
+
+    int blocks = divup(sig_packed_elem, THREADS);
+
+    // Locate features kernel sizes
+    auto local  = sycl::range(THREADS);
+    auto global = sycl::range(blocks * THREADS);
+
+    // Treat complex output as an array of scalars
+    using convScalarT    = typename convT::value_type;
+    auto packed_num_elem = (*packed.data).get_range().size();
+    auto sig_tmp_buffer  = (*packed.data)
+                              .template reinterpret<convScalarT>(
+                                  sycl::range<1>{packed_num_elem * 2});
+
+    getQueue().submit([&](auto &h) {
+        read_accessor<T> d_sig                = {*sig.data, h};
+        write_accessor<convScalarT> d_sig_tmp = {sig_tmp_buffer, h};
+        h.parallel_for(sycl::nd_range{global, local},
+                       fftconvolve_packCreateKernel<T, convScalarT>(
+                           d_sig_tmp, sig_tmp.info, d_sig, sig.info,
+                           sig_half_d0, sig_half_d0_odd));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/fftconvolve_pad.hpp b/src/backend/oneapi/kernel/fftconvolve_pad.hpp
new file mode 100644
index 0000000000..6d60506236
--- /dev/null
+++ b/src/backend/oneapi/kernel/fftconvolve_pad.hpp
@@ -0,0 +1,122 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename inputType, typename outputType>
+class fftconvolve_padCreateKernel {
+   public:
+    fftconvolve_padCreateKernel(write_accessor<outputType> d_out, KParam oInfo,
+                                read_accessor<inputType> d_in, KParam iInfo)
+        : d_out_(d_out), oInfo_(oInfo), d_in_(d_in), iInfo_(iInfo) {}
+    void operator()(sycl::nd_item<1> it) const {
+        const int t = it.get_global_id(0);
+
+        const int tMax = oInfo_.strides[3] * oInfo_.dims[3];
+
+        if (t >= tMax) return;
+
+        // const int do0 = oInfo_.dims[0];
+        const int do1 = oInfo_.dims[1];
+        const int do2 = oInfo_.dims[2];
+
+        const int so1 = oInfo_.strides[1];
+        const int so2 = oInfo_.strides[2];
+        const int so3 = oInfo_.strides[3];
+
+        const int to0 = t % so1;
+        const int to1 = (t / so1) % do1;
+        const int to2 = (t / so2) % do2;
+        const int to3 = (t / so3);
+
+        const int di0 = iInfo_.dims[0];
+        const int di1 = iInfo_.dims[1];
+        const int di2 = iInfo_.dims[2];
+        const int di3 = iInfo_.dims[3];
+
+        const int si1 = iInfo_.strides[1];
+        const int si2 = iInfo_.strides[2];
+        const int si3 = iInfo_.strides[3];
+
+        const int ti0 = to0;
+        const int ti1 = to1 * si1;
+        const int ti2 = to2 * si2;
+        const int ti3 = to3 * si3;
+
+        const int iidx = iInfo_.offset + ti3 + ti2 + ti1 + ti0;
+
+        const int oidx = oInfo_.offset + t * 2;
+
+        if (to0 < di0 && to1 < di1 && to2 < di2 && to3 < di3) {
+            // Copy input elements to real elements, set imaginary elements to 0
+            d_out_[oidx]     = (outputType)d_in_[iidx];
+            d_out_[oidx + 1] = (outputType)0;
+        } else {
+            // Pad remaining of the matrix to 0s
+            d_out_[oidx]     = (outputType)0;
+            d_out_[oidx + 1] = (outputType)0;
+        }
+    }
+
+   private:
+    write_accessor<outputType> d_out_;
+    KParam oInfo_;
+    read_accessor<inputType> d_in_;
+    KParam iInfo_;
+};
+
+template<typename convT, typename T>
+void padDataHelper(Param<convT> packed, Param<T> sig, Param<T> filter,
+                   const int rank, AF_BATCH_KIND kind) {
+    Param<T> sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    int filter_packed_elem =
+        filter_tmp.info.strides[3] * filter_tmp.info.dims[3];
+
+    int blocks = divup(filter_packed_elem, THREADS);
+
+    // Locate features kernel sizes
+    auto local  = sycl::range(THREADS);
+    auto global = sycl::range(blocks * THREADS);
+
+    // Treat complex output as an array of scalars
+    using convScalarT      = typename convT::value_type;
+    auto packed_num_elem   = (*packed.data).get_range().size();
+    auto filter_tmp_buffer = (*packed.data)
+                                 .template reinterpret<convScalarT>(
+                                     sycl::range<1>{packed_num_elem * 2});
+
+    getQueue().submit([&](auto &h) {
+        read_accessor<T> d_filter = {*filter.data, h, sycl::read_only};
+        write_accessor<convScalarT> d_filter_tmp = {filter_tmp_buffer, h};
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            fftconvolve_padCreateKernel<T, convScalarT>(
+                d_filter_tmp, filter_tmp.info, d_filter, filter.info));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/fftconvolve_reorder.hpp b/src/backend/oneapi/kernel/fftconvolve_reorder.hpp
new file mode 100644
index 0000000000..589242007a
--- /dev/null
+++ b/src/backend/oneapi/kernel/fftconvolve_reorder.hpp
@@ -0,0 +1,187 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T, typename convScalarT>
+class fftconvolve_reorderCreateKernel {
+   public:
+    fftconvolve_reorderCreateKernel(write_accessor<T> d_out, KParam oInfo,
+                                    read_accessor<convScalarT> d_in,
+                                    KParam iInfo, KParam fInfo,
+                                    const int half_di0, const int baseDim,
+                                    const int fftScale, const bool EXPAND,
+                                    const bool ROUND_OUT)
+        : d_out_(d_out)
+        , oInfo_(oInfo)
+        , d_in_(d_in)
+        , iInfo_(iInfo)
+        , fInfo_(fInfo)
+        , half_di0_(half_di0)
+        , baseDim_(baseDim)
+        , fftScale_(fftScale)
+        , EXPAND_(EXPAND)
+        , ROUND_OUT_(ROUND_OUT) {}
+    void operator()(sycl::nd_item<1> it) const {
+        const int t = it.get_global_id(0);
+
+        const int tMax = oInfo_.strides[3] * oInfo_.dims[3];
+
+        if (t >= tMax) return;
+
+        // const int do0 = oInfo_.dims[0];
+        const int do1 = oInfo_.dims[1];
+        const int do2 = oInfo_.dims[2];
+
+        const int so1 = oInfo_.strides[1];
+        const int so2 = oInfo_.strides[2];
+        const int so3 = oInfo_.strides[3];
+
+        // Treating complex input array as real-only array,
+        // thus, multiply dimension 0 and strides by 2
+        const int si1 = iInfo_.strides[1] * 2;
+        const int si2 = iInfo_.strides[2] * 2;
+        const int si3 = iInfo_.strides[3] * 2;
+
+        const int to0 = t % so1;
+        const int to1 = (t / so1) % do1;
+        const int to2 = (t / so2) % do2;
+        const int to3 = (t / so3);
+
+        int oidx = to3 * so3 + to2 * so2 + to1 * so1 + to0;
+
+        int ti0, ti1, ti2, ti3;
+        if (EXPAND_) {
+            ti0 = to0;
+            ti1 = to1 * si1;
+            ti2 = to2 * si2;
+            ti3 = to3 * si3;
+        } else {
+            ti0 = to0 + fInfo_.dims[0] / 2;
+            ti1 = (to1 + (baseDim_ > 1) * (fInfo_.dims[1] / 2)) * si1;
+            ti2 = (to2 + (baseDim_ > 2) * (fInfo_.dims[2] / 2)) * si2;
+            ti3 = to3 * si3;
+        }
+
+        // Divide output elements to cuFFT resulting scale, round result if
+        // output type is single or double precision floating-point
+        if (ti0 < half_di0_) {
+            // Copy top elements
+            int iidx = iInfo_.offset + ti3 + ti2 + ti1 + ti0 * 2;
+            if (ROUND_OUT_)
+                d_out_[oidx] = (T)round(d_in_[iidx] / fftScale_);
+            else
+                d_out_[oidx] = (T)(d_in_[iidx] / fftScale_);
+        } else if (ti0 < half_di0_ + fInfo_.dims[0] - 1) {
+            // Add central elements
+            int iidx1 = iInfo_.offset + ti3 + ti2 + ti1 + ti0 * 2;
+            int iidx2 =
+                iInfo_.offset + ti3 + ti2 + ti1 + (ti0 - half_di0_) * 2 + 1;
+            if (ROUND_OUT_)
+                d_out_[oidx] =
+                    (T)round((d_in_[iidx1] + d_in_[iidx2]) / fftScale_);
+            else
+                d_out_[oidx] = (T)((d_in_[iidx1] + d_in_[iidx2]) / fftScale_);
+        } else {
+            // Copy bottom elements
+            const int iidx =
+                iInfo_.offset + ti3 + ti2 + ti1 + (ti0 - half_di0_) * 2 + 1;
+            if (ROUND_OUT_)
+                d_out_[oidx] = (T)round(d_in_[iidx] / fftScale_);
+            else
+                d_out_[oidx] = (T)(d_in_[iidx] / fftScale_);
+        }
+    }
+
+   private:
+    write_accessor<T> d_out_;
+    KParam oInfo_;
+    read_accessor<convScalarT> d_in_;
+    KParam iInfo_;
+    KParam fInfo_;
+    const int half_di0_;
+    const int baseDim_;
+    const int fftScale_;
+    const bool EXPAND_;
+    const bool ROUND_OUT_;
+};
+
+template<typename T, typename convT>
+void reorderOutputHelper(Param<T> out, Param<convT> packed, Param<T> sig,
+                         Param<T> filter, const int rank, AF_BATCH_KIND kind,
+                         bool expand) {
+    int fftScale = 1;
+
+    // Calculate the scale by which to divide clFFT results
+    for (int k = 0; k < rank; k++) fftScale *= packed.info.dims[k];
+
+    Param<T> sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    // Number of packed complex elements in dimension 0
+    int sig_half_d0 = divup(sig.info.dims[0], 2);
+
+    int blocks = divup(out.info.strides[3] * out.info.dims[3], THREADS);
+
+    constexpr bool round_out = std::is_integral<T>::value;
+
+    auto local  = sycl::range(THREADS);
+    auto global = sycl::range(blocks * THREADS);
+
+    using convScalarT = typename convT::value_type;
+
+    if (kind == AF_BATCH_RHS) {
+        auto packed_num_elem   = (*packed.data).get_range().size();
+        auto filter_tmp_buffer = (*packed.data)
+                                     .template reinterpret<convScalarT>(
+                                         sycl::range<1>{packed_num_elem * 2});
+        getQueue().submit([&](auto &h) {
+            read_accessor<convScalarT> d_filter_tmp = {filter_tmp_buffer, h};
+            write_accessor<T> d_out = {*out.data, h, sycl::write_only};
+            h.parallel_for(
+                sycl::nd_range{global, local},
+                fftconvolve_reorderCreateKernel<T, convScalarT>(
+                    d_out, out.info, d_filter_tmp, filter_tmp.info, filter.info,
+                    sig_half_d0, rank, fftScale, expand, round_out));
+        });
+    } else {
+        auto packed_num_elem = (*packed.data).get_range().size();
+        auto sig_tmp_buffer  = (*packed.data)
+                                  .template reinterpret<convScalarT>(
+                                      sycl::range<1>{packed_num_elem * 2});
+        getQueue().submit([&](auto &h) {
+            read_accessor<convScalarT> d_sig_tmp = {sig_tmp_buffer, h,
+                                                    sycl::read_only};
+            write_accessor<T> d_out              = {*out.data, h};
+            h.parallel_for(
+                sycl::nd_range{global, local},
+                fftconvolve_reorderCreateKernel<T, convScalarT>(
+                    d_out, out.info, d_sig_tmp, sig_tmp.info, filter.info,
+                    sig_half_d0, rank, fftScale, expand, round_out));
+        });
+    }
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/gradient.hpp b/src/backend/oneapi/kernel/gradient.hpp
new file mode 100644
index 0000000000..f8ae841444
--- /dev/null
+++ b/src/backend/oneapi/kernel/gradient.hpp
@@ -0,0 +1,158 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+#define sidx(y, x) scratch_[((y + 1) * (TX + 2)) + (x + 1)]
+
+template<typename T, int TX, int TY>
+class gradientCreateKernel {
+   public:
+    gradientCreateKernel(write_accessor<T> d_grad0, const KParam grad0,
+                         write_accessor<T> d_grad1, const KParam grad1,
+                         read_accessor<T> d_in, const KParam in,
+                         const int blocksPerMatX, const int blocksPerMatY,
+                         sycl::local_accessor<T> scratch)
+        : d_grad0_(d_grad0)
+        , grad0_(grad0)
+        , d_grad1_(d_grad1)
+        , grad1_(grad1)
+        , d_in_(d_in)
+        , in_(in)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY)
+        , scratch_(scratch) {}
+    void operator()(sycl::nd_item<2> it) const {
+        auto g = it.get_group();
+
+        const int idz = g.get_group_id(0) / blocksPerMatX_;
+        const int idw = g.get_group_id(1) / blocksPerMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - idz * blocksPerMatX_;
+        const int blockIdx_y = g.get_group_id(1) - idw * blocksPerMatY_;
+
+        const int xB = blockIdx_x * g.get_local_range(0);
+        const int yB = blockIdx_y * g.get_local_range(1);
+
+        const int tx = it.get_local_id(0);
+        const int ty = it.get_local_id(1);
+
+        const int idx = tx + xB;
+        const int idy = ty + yB;
+
+        const bool cond = (idx >= in_.dims[0] || idy >= in_.dims[1] ||
+                           idz >= in_.dims[2] || idw >= in_.dims[3]);
+
+        int xmax = (TX > (in_.dims[0] - xB)) ? (in_.dims[0] - xB) : TX;
+        int ymax = (TY > (in_.dims[1] - yB)) ? (in_.dims[1] - yB) : TY;
+
+        int iIdx = in_.offset + idw * in_.strides[3] + idz * in_.strides[2] +
+                   idy * in_.strides[1] + idx;
+
+        int g0dx = idw * grad0_.strides[3] + idz * grad0_.strides[2] +
+                   idy * grad0_.strides[1] + idx;
+
+        int g1dx = idw * grad1_.strides[3] + idz * grad1_.strides[2] +
+                   idy * grad1_.strides[1] + idx;
+
+        // Multipliers - 0.5 for interior, 1 for edge cases
+        typename std::conditional<std::is_same<T, std::complex<double>>::value,
+                                  double, float>::type
+            xf = 0.5 * (1 + (idx == 0 || idx >= (in_.dims[0] - 1))),
+            yf = 0.5 * (1 + (idy == 0 || idy >= (in_.dims[1] - 1)));
+
+        // Copy data to scratch space
+        T zero = (T)(0);
+        if (cond) {
+            sidx(ty, tx) = zero;
+        } else {
+            sidx(ty, tx) = d_in_[iIdx];
+        }
+
+        it.barrier();
+
+        // Copy buffer zone data. Corner (0,0) etc, are not used.
+        // Cols
+        if (ty == 0) {
+            // Y-1
+            sidx(-1, tx) =
+                (cond || idy == 0) ? sidx(0, tx) : d_in_[iIdx - in_.strides[1]];
+            sidx(ymax, tx) = (cond || (idy + ymax) >= in_.dims[1])
+                                 ? sidx(ymax - 1, tx)
+                                 : d_in_[iIdx + ymax * in_.strides[1]];
+        }
+        // Rows
+        if (tx == 0) {
+            sidx(ty, -1)   = (cond || idx == 0) ? sidx(ty, 0) : d_in_[iIdx - 1];
+            sidx(ty, xmax) = (cond || (idx + xmax) >= in_.dims[0])
+                                 ? sidx(ty, xmax - 1)
+                                 : d_in_[iIdx + xmax];
+        }
+
+        it.barrier();
+
+        if (cond) return;
+
+        d_grad0_[g0dx] = xf * (sidx(ty, tx + 1) - sidx(ty, tx - 1));
+        d_grad1_[g1dx] = yf * (sidx(ty + 1, tx) - sidx(ty - 1, tx));
+    }
+
+   private:
+    write_accessor<T> d_grad0_;
+    const KParam grad0_;
+    write_accessor<T> d_grad1_;
+    const KParam grad1_;
+    read_accessor<T> d_in_;
+    const KParam in_;
+    const int blocksPerMatX_;
+    const int blocksPerMatY_;
+    sycl::local_accessor<T> scratch_;
+};
+
+template<typename T>
+void gradient(Param<T> grad0, Param<T> grad1, const Param<T> in) {
+    constexpr int TX = 32;
+    constexpr int TY = 8;
+
+    auto local = sycl::range{TX, TY};
+
+    int blocksPerMatX = divup(in.info.dims[0], TX);
+    int blocksPerMatY = divup(in.info.dims[1], TY);
+    auto global       = sycl::range{local[0] * blocksPerMatX * in.info.dims[2],
+                              local[1] * blocksPerMatY * in.info.dims[3]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> grad0Acc{*grad0.data, h};
+        write_accessor<T> grad1Acc{*grad1.data, h};
+        read_accessor<T> inAcc{*in.data, h};
+        auto scratch = sycl::local_accessor<T>((TY + 2) * (TX + 2), h);
+        h.parallel_for(sycl::nd_range{global, local},
+                       gradientCreateKernel<T, TX, TY>(
+                           grad0Acc, grad0.info, grad1Acc, grad1.info, inAcc,
+                           in.info, blocksPerMatX, blocksPerMatY, scratch));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/histogram.hpp b/src/backend/oneapi/kernel/histogram.hpp
new file mode 100644
index 0000000000..bd574c9e2d
--- /dev/null
+++ b/src/backend/oneapi/kernel/histogram.hpp
@@ -0,0 +1,159 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+#define MAX_BINS 4000
+#define THREADS_X 256
+#define THRD_LOAD 16
+
+// using memory_order = memory_order;
+// using memory_scope = memory_scope;
+
+template<typename T>
+using local_atomic_ref =
+    sycl::atomic_ref<T, sycl::memory_order::relaxed,
+                     sycl::memory_scope::work_group,
+                     sycl::access::address_space::local_space>;
+
+template<typename T>
+using global_atomic_ref =
+    sycl::atomic_ref<T, sycl::memory_order::relaxed, sycl::memory_scope::system,
+                     sycl::access::address_space::global_space>;
+
+template<typename T>
+class histogramKernel {
+   public:
+    histogramKernel(write_accessor<uint> d_dst, KParam oInfo,
+                    const read_accessor<T> d_src, KParam iInfo,
+                    sycl::local_accessor<uint, 1> localMemAcc, int len,
+                    int nbins, float minval, float maxval, int nBBS,
+                    const bool isLinear)
+        : d_dst_(d_dst)
+        , oInfo_(oInfo)
+        , d_src_(d_src)
+        , iInfo_(iInfo)
+        , localMemAcc_(localMemAcc)
+        , len_(len)
+        , nbins_(nbins)
+        , minval_(minval)
+        , maxval_(maxval)
+        , nBBS_(nBBS)
+        , isLinear_(isLinear) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        unsigned b2   = g.get_group_id(0) / nBBS_;
+        int start     = (g.get_group_id(0) - b2 * nBBS_) * THRD_LOAD *
+                        g.get_local_range(0) +
+                    it.get_local_id(0);
+        int end =
+            sycl::min((int)(start + THRD_LOAD * g.get_local_range(0)), len_);
+
+        // offset input and output to account for batch ops
+        const T *in = d_src_.get_pointer() + b2 * iInfo_.strides[2] +
+                      g.get_group_id(1) * iInfo_.strides[3] + iInfo_.offset;
+        uint outOffset =
+            b2 * oInfo_.strides[2] + g.get_group_id(1) * oInfo_.strides[3];
+
+        float dx = (maxval_ - minval_) / (float)nbins_;
+
+        bool use_global = nbins_ > MAX_BINS;
+
+        if (!use_global) {
+            for (int i = it.get_local_id(0); i < nbins_;
+                 i += g.get_local_range(0))
+                localMemAcc_[i] = 0;
+            it.barrier();
+        }
+
+        for (int row = start; row < end; row += g.get_local_range(0)) {
+            const int i0  = row % iInfo_.dims[0];
+            const int i1  = row / iInfo_.dims[0];
+            const int idx = isLinear_ ? row : i0 + i1 * iInfo_.strides[1];
+
+            int bin = (int)(((float)in[idx] - minval_) / dx);
+            bin     = sycl::max(bin, 0);
+            bin     = sycl::min(bin, (int)nbins_ - 1);
+
+            if (use_global) {
+                global_atomic_ref<uint>(d_dst_[outOffset + bin])++;
+            } else {
+                local_atomic_ref<uint>(localMemAcc_[bin])++;
+            }
+        }
+
+        if (!use_global) {
+            it.barrier();
+            for (int i = it.get_local_id(0); i < nbins_;
+                 i += g.get_local_range(0)) {
+                global_atomic_ref<uint>(d_dst_[outOffset + i]) +=
+                    localMemAcc_[i];
+            }
+        }
+    }
+
+   private:
+    write_accessor<uint> d_dst_;
+    KParam oInfo_;
+    read_accessor<T> d_src_;
+    KParam iInfo_;
+    sycl::local_accessor<uint, 1> localMemAcc_;
+    int len_;
+    int nbins_;
+    float minval_;
+    float maxval_;
+    int nBBS_;
+    bool isLinear_;
+};
+
+template<typename T>
+void histogram(Param<uint> out, const Param<T> in, int nbins, float minval,
+               float maxval, bool isLinear) {
+    int nElems  = in.info.dims[0] * in.info.dims[1];
+    int blk_x   = divup(nElems, THRD_LOAD * THREADS_X);
+    int locSize = nbins <= MAX_BINS ? (nbins * sizeof(uint)) : 1;
+
+    auto local           = sycl::range{THREADS_X, 1};
+    const size_t global0 = blk_x * in.info.dims[2] * THREADS_X;
+    const size_t global1 = in.info.dims[3];
+    auto global          = sycl::range{global0, global1};
+
+    getQueue().submit([&](sycl::handler &h) {
+        read_accessor<T> inAcc{*in.data, h};
+        write_accessor<uint> outAcc{*out.data, h};
+
+        auto localMem = sycl::local_accessor<uint, 1>(locSize, h);
+
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            histogramKernel<T>(outAcc, out.info, inAcc, in.info, localMem,
+                               nElems, nbins, minval, maxval, blk_x, isLinear));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/identity.hpp b/src/backend/oneapi/kernel/identity.hpp
new file mode 100644
index 0000000000..0f6911606a
--- /dev/null
+++ b/src/backend/oneapi/kernel/identity.hpp
@@ -0,0 +1,84 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <math.hpp>
+#include <types.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class identityKernel {
+   public:
+    identityKernel(write_accessor<T> out, KParam oInfo, const int groups_x,
+                   const int groups_y)
+        : out_(out), oInfo_(oInfo), groups_x_(groups_x), groups_y_(groups_y) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        size_t idz = g.get_group_id(0) / groups_x_;
+        size_t idw = g.get_group_id(1) / groups_y_;
+
+        size_t groupId_x = g.get_group_id(0) - idz * groups_x_;
+        size_t groupId_y = g.get_group_id(1) - idw * groups_y_;
+
+        size_t idx = it.get_local_id(0) + groupId_x * g.get_local_range(0);
+        size_t idy = it.get_local_id(1) + groupId_y * g.get_local_range(1);
+
+        size_t xlim = oInfo_.dims[0];
+        size_t ylim = oInfo_.dims[1];
+        size_t zlim = oInfo_.dims[2];
+        size_t wlim = oInfo_.dims[3];
+        if (idx < xlim && idy < ylim && idz < zlim && idw < wlim) {
+            const T one  = scalar<T>(1);
+            const T zero = scalar<T>(0);
+
+            T *ptr = out_.get_pointer() + idz * oInfo_.strides[2] +
+                     idw * oInfo_.strides[3];
+            T val                              = (idx == idy) ? one : zero;
+            ptr[idx + idy * oInfo_.strides[1]] = val;
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    int groups_x_;
+    int groups_y_;
+};
+
+template<typename T>
+void identity(Param<T> out) {
+    sycl::range<2> local{32, 8};
+
+    int groups_x = divup(out.info.dims[0], local[0]);
+    int groups_y = divup(out.info.dims[1], local[1]);
+    sycl::range<2> global{groups_x * out.info.dims[2] * local[0],
+                          groups_y * out.info.dims[3] * local[1]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> oData{*out.data, h};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       identityKernel<T>(oData, out.info, groups_x, groups_y));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/iir.hpp b/src/backend/oneapi/kernel/iir.hpp
new file mode 100644
index 0000000000..938202f32f
--- /dev/null
+++ b/src/backend/oneapi/kernel/iir.hpp
@@ -0,0 +1,151 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <math.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T, bool batch_a>
+class iirKernel {
+   public:
+    iirKernel(write_accessor<T> y, KParam yInfo, read_accessor<T> c,
+              KParam cInfo, read_accessor<T> a, KParam aInfo,
+              sycl::local_accessor<T> s_z, sycl::local_accessor<T> s_a,
+              sycl::local_accessor<T> s_y, int groups_y)
+        : y_(y)
+        , yInfo_(yInfo)
+        , c_(c)
+        , cInfo_(cInfo)
+        , a_(a)
+        , aInfo_(aInfo)
+        , s_z_(s_z)
+        , s_a_(s_a)
+        , s_y_(s_y)
+        , groups_y_(groups_y) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const int idz = g.get_group_id(0);
+        const int idw = g.get_group_id(1) / groups_y_;
+        const int idy = g.get_group_id(1) - idw * groups_y_;
+
+        const int tx    = it.get_local_id(0);
+        const int num_a = aInfo_.dims[0];
+
+        int y_off = idw * yInfo_.strides[3] + idz * yInfo_.strides[2] +
+                    idy * yInfo_.strides[1];
+        int c_off = idw * cInfo_.strides[3] + idz * cInfo_.strides[2] +
+                    idy * cInfo_.strides[1];
+        int a_off = 0;
+
+        if (batch_a)
+            a_off = idw * aInfo_.strides[3] + idz * aInfo_.strides[2] +
+                    idy * aInfo_.strides[1];
+
+        T *d_y       = y_.get_pointer() + y_off;
+        const T *d_c = c_.get_pointer() + c_off;
+        const T *d_a = a_.get_pointer() + a_off;
+        const int repeat =
+            (num_a + g.get_local_range(0) - 1) / g.get_local_range(0);
+
+        for (int ii = tx; ii < num_a; ii += g.get_local_range(0)) {
+            s_z_[ii] = scalar<T>(0);
+            s_a_[ii] = (ii < num_a) ? d_a[ii] : scalar<T>(0);
+        }
+        group_barrier(g);
+
+        for (int i = 0; i < yInfo_.dims[0]; i++) {
+            if (tx == 0) {
+                s_y_[0] = (d_c[i] + s_z_[0]) / s_a_[0];
+                d_y[i]  = s_y_[0];
+            }
+            group_barrier(g);
+
+            for (int ii = 0; ii < repeat; ii++) {
+                int id = ii * g.get_local_range(0) + tx + 1;
+
+                T z;
+
+                if (id < num_a) {
+                    z = s_z_[id] - s_a_[id] * s_y_[0];
+                } else {
+                    z = scalar<T>(0);
+                }
+                group_barrier(g);
+
+                if ((id - 1) < num_a) { s_z_[id - 1] = z; }
+                group_barrier(g);
+            }
+        }
+    }
+
+   protected:
+    write_accessor<T> y_;
+    KParam yInfo_;
+    read_accessor<T> c_;
+    KParam cInfo_;
+    read_accessor<T> a_;
+    KParam aInfo_;
+    sycl::local_accessor<T> s_z_;
+    sycl::local_accessor<T> s_a_;
+    sycl::local_accessor<T> s_y_;
+    int groups_y_;
+};
+
+template<typename T, bool batch_a>
+void iir(Param<T> y, Param<T> c, Param<T> a) {
+    const size_t groups_y = y.info.dims[1];
+    const size_t groups_x = y.info.dims[2];
+
+    size_t threads = 256;
+    while (threads > y.info.dims[0] && threads > 32) threads /= 2;
+    sycl::range<2> local = sycl::range{threads, 1};
+
+    sycl::range<2> global =
+        sycl::range<2>{groups_x * local[0], groups_y * y.info.dims[3]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> yAcc{*y.data, h};
+        read_accessor<T> cAcc{*c.data, h};
+        read_accessor<T> aAcc{*a.data, h};
+
+        unsigned num_a = a.info.dims[0];
+
+        auto s_z = sycl::local_accessor<T>(num_a, h);
+        auto s_a = sycl::local_accessor<T>(num_a, h);
+        auto s_y = sycl::local_accessor<T>(1, h);
+
+        if (batch_a) {
+            h.parallel_for(sycl::nd_range{global, local},
+                           iirKernel<T, true>(yAcc, y.info, cAcc, c.info, aAcc,
+                                              a.info, s_z, s_a, s_y, groups_y));
+        } else {
+            h.parallel_for(
+                sycl::nd_range{global, local},
+                iirKernel<T, false>(yAcc, y.info, cAcc, c.info, aAcc, a.info,
+                                    s_z, s_a, s_y, groups_y));
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/index.hpp b/src/backend/oneapi/kernel/index.hpp
new file mode 100644
index 0000000000..e86c0bd808
--- /dev/null
+++ b/src/backend/oneapi/kernel/index.hpp
@@ -0,0 +1,163 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/assign_kernel_param.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class indexKernel {
+    write_accessor<T> out;
+    KParam outp;
+    read_accessor<T> in;
+    KParam inp;
+    IndexKernelParam p;
+    int nBBS0;
+    int nBBS1;
+
+   public:
+    indexKernel(write_accessor<T> out_, KParam outp_, read_accessor<T> in_,
+                KParam inp_, const IndexKernelParam p_, const int nBBS0_,
+                const int nBBS1_)
+        : out(out_)
+        , outp(outp_)
+        , in(in_)
+        , inp(inp_)
+        , p(p_)
+        , nBBS0(nBBS0_)
+        , nBBS1(nBBS1_) {}
+
+    int trimIndex(int idx, const int len) const {
+        int ret_val = idx;
+        if (ret_val < 0) {
+            int offset = (abs(ret_val) - 1) % len;
+            ret_val    = offset;
+        } else if (ret_val >= len) {
+            int offset = abs(ret_val) % len;
+            ret_val    = len - offset - 1;
+        }
+        return ret_val;
+    }
+
+    void operator()(sycl::nd_item<3> it) const {
+        // retrieve index pointers
+        // these can be 0 where af_array index is not used
+        sycl::group g    = it.get_group();
+        const uint* ptr0 = p.ptr[0].get_pointer();
+        const uint* ptr1 = p.ptr[1].get_pointer();
+        const uint* ptr2 = p.ptr[2].get_pointer();
+        const uint* ptr3 = p.ptr[3].get_pointer();
+        // retrive booleans that tell us which index to use
+        const bool s0 = p.isSeq[0];
+        const bool s1 = p.isSeq[1];
+        const bool s2 = p.isSeq[2];
+        const bool s3 = p.isSeq[3];
+
+        const int gz = g.get_group_id(0) / nBBS0;
+        const int gx = g.get_local_range(0) * (g.get_group_id(0) - gz * nBBS0) +
+                       it.get_local_id(0);
+
+        const int gw =
+            (g.get_group_id(1) + g.get_group_id(2) * g.get_group_range(1)) /
+            nBBS1;
+        const int gy =
+            g.get_local_range(1) * ((g.get_group_id(1) +
+                                     g.get_group_id(2) * g.get_group_range(1)) -
+                                    gw * nBBS1) +
+            it.get_local_id(1);
+
+        size_t odims0 = outp.dims[0];
+        size_t odims1 = outp.dims[1];
+        size_t odims2 = outp.dims[2];
+        size_t odims3 = outp.dims[3];
+
+        if (gx < odims0 && gy < odims1 && gz < odims2 && gw < odims3) {
+            // calculate pointer offsets for input
+            int i = p.strds[0] *
+                    trimIndex(s0 ? gx * p.steps[0] + p.offs[0] : ptr0[gx],
+                              inp.dims[0]);
+            int j = p.strds[1] *
+                    trimIndex(s1 ? gy * p.steps[1] + p.offs[1] : ptr1[gy],
+                              inp.dims[1]);
+            int k = p.strds[2] *
+                    trimIndex(s2 ? gz * p.steps[2] + p.offs[2] : ptr2[gz],
+                              inp.dims[2]);
+            int l = p.strds[3] *
+                    trimIndex(s3 ? gw * p.steps[3] + p.offs[3] : ptr3[gw],
+                              inp.dims[3]);
+            // offset input and output pointers
+            const T* src = (const T*)in.get_pointer() + (i + j + k + l);
+            T* dst       = (T*)out.get_pointer() +
+                     (gx * outp.strides[0] + gy * outp.strides[1] +
+                      gz * outp.strides[2] + gw * outp.strides[3]);
+            // set the output
+            dst[0] = src[0];
+        }
+    }
+};
+
+template<typename T>
+void index(Param<T> out, Param<T> in, IndexKernelParam& p,
+           std::vector<Array<uint>>& idxArrs) {
+    sycl::range<3> threads(0, 0, 1);
+    switch (out.info.dims[1]) {
+        case 1: threads[1] = 1; break;
+        case 2: threads[1] = 2; break;
+        case 3:
+        case 4: threads[1] = 4; break;
+        default: threads[1] = 8; break;
+    }
+    threads[0] = static_cast<unsigned>(256.f / threads[1]);
+
+    int blks_x = divup(out.info.dims[0], threads[0]);
+    int blks_y = divup(out.info.dims[1], threads[1]);
+
+    sycl::range<3> blocks(blks_x * out.info.dims[2], blks_y * out.info.dims[3],
+                          1);
+
+    const size_t maxBlocksY =
+        getDevice().get_info<sycl::info::device::max_work_item_sizes<3>>()[2];
+    blocks[2] = divup(blocks[1], maxBlocksY);
+    blocks[1] = divup(blocks[1], blocks[2]) * threads[1];
+    blocks[1] = blocks[1] * threads[1];
+    blocks[0] *= threads[0];
+
+    sycl::nd_range<3> marange(blocks, threads);
+    sycl::buffer<uint> *idxArrs_get[4];
+    for (dim_t x = 0; x < 4; ++x)
+        idxArrs_get[x] = idxArrs[x].get();
+    getQueue().submit([&](sycl::handler& h) {
+        auto pp = p;
+        for (dim_t x = 0; x < 4; ++x) {
+            pp.ptr[x] =
+                idxArrs_get[x]->get_access<sycl::access::mode::read>(h);
+        }
+
+        h.parallel_for(
+            marange,
+            indexKernel<T>(
+                out.data->template get_access<sycl::access::mode::write>(h),
+                out.info,
+                in.data->template get_access<sycl::access::mode::read>(h),
+                in.info, pp, blks_x, blks_y));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/interp.hpp b/src/backend/oneapi/kernel/interp.hpp
new file mode 100644
index 0000000000..bfc894dfdf
--- /dev/null
+++ b/src/backend/oneapi/kernel/interp.hpp
@@ -0,0 +1,345 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <kernel/accessors.hpp>
+#include <math.hpp>
+#include <types.hpp>
+#include <af/constants.h>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+struct itype_t {
+    typedef float wtype;
+    typedef float vtype;
+};
+
+template<>
+struct itype_t<double> {
+    typedef double wtype;
+    typedef double vtype;
+};
+
+template<>
+struct itype_t<cfloat> {
+    typedef float wtype;
+    typedef cfloat vtype;
+};
+
+template<>
+struct itype_t<cdouble> {
+    typedef double wtype;
+    typedef cdouble vtype;
+};
+
+template<typename Ty, typename Tp>
+Ty linearInterpFunc(Ty val[2], Tp ratio) {
+    return (1 - ratio) * val[0] + ratio * val[1];
+}
+
+template<typename Ty, typename Tp>
+Ty bilinearInterpFunc(Ty val[2][2], Tp xratio, Tp yratio) {
+    Ty res[2];
+    res[0] = linearInterpFunc(val[0], xratio);
+    res[1] = linearInterpFunc(val[1], xratio);
+    return linearInterpFunc(res, yratio);
+}
+
+template<typename Ty, typename Tp>
+inline static Ty cubicInterpFunc(Ty val[4], Tp xratio, bool spline) {
+    Ty a0, a1, a2, a3;
+    if (spline) {
+        a0 = scalar<Ty>(-0.5) * val[0] + scalar<Ty>(1.5) * val[1] +
+             scalar<Ty>(-1.5) * val[2] + scalar<Ty>(0.5) * val[3];
+
+        a1 = scalar<Ty>(1.0) * val[0] + scalar<Ty>(-2.5) * val[1] +
+             scalar<Ty>(2.0) * val[2] + scalar<Ty>(-0.5) * val[3];
+
+        a2 = scalar<Ty>(-0.5) * val[0] + scalar<Ty>(0.5) * val[2];
+
+        a3 = val[1];
+    } else {
+        a0 = val[3] - val[2] - val[0] + val[1];
+        a1 = val[0] - val[1] - a0;
+        a2 = val[2] - val[0];
+        a3 = val[1];
+    }
+
+    Tp xratio2 = xratio * xratio;
+    Tp xratio3 = xratio2 * xratio;
+
+    return a0 * xratio3 + a1 * xratio2 + a2 * xratio + a3;
+}
+
+template<typename Ty, typename Tp>
+inline static Ty bicubicInterpFunc(Ty val[4][4], Tp xratio, Tp yratio,
+                                   bool spline) {
+    Ty res[4];
+    res[0] = cubicInterpFunc(val[0], xratio, spline);
+    res[1] = cubicInterpFunc(val[1], xratio, spline);
+    res[2] = cubicInterpFunc(val[2], xratio, spline);
+    res[3] = cubicInterpFunc(val[3], xratio, spline);
+    return cubicInterpFunc(res, yratio, spline);
+}
+
+template<typename Ty, typename Tp, int order>
+struct Interp1 {};
+
+template<typename Ty, typename Tp>
+struct Interp1<Ty, Tp, 1> {
+    void operator()(write_accessor<Ty> out, KParam oInfo, int ooff,
+                    read_accessor<Ty> in, KParam iInfo, int ioff, Tp x,
+                    int xdim, af::interpType method, int batch, bool clamp,
+                    int batch_dim = 1) {
+        Ty zero = scalar<Ty>(0);
+
+        const int x_lim    = iInfo.dims[xdim];
+        const int x_stride = iInfo.strides[xdim];
+
+        int xid = (method == AF_INTERP_LOWER ? sycl::floor(x) : sycl::round(x));
+        bool cond = xid >= 0 && xid < x_lim;
+        if (clamp) xid = sycl::max((int)0, sycl::min(xid, x_lim));
+
+        const int idx = ioff + xid * x_stride;
+
+        for (int n = 0; n < batch; n++) {
+            Ty outval =
+                (cond || clamp) ? in[idx + n * iInfo.strides[batch_dim]] : zero;
+            out[ooff + n * oInfo.strides[batch_dim]] = outval;
+        }
+    }
+};
+
+template<typename Ty, typename Tp>
+struct Interp1<Ty, Tp, 2> {
+    void operator()(write_accessor<Ty> out, KParam oInfo, int ooff,
+                    read_accessor<Ty> in, KParam iInfo, int ioff, Tp x,
+                    int xdim, af::interpType method, int batch, bool clamp,
+                    int batch_dim = 1) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = sycl::floor(x);  // nearest grid
+        const WT off_x   = x - grid_x;      // fractional offset
+
+        const int x_lim    = iInfo.dims[xdim];
+        const int x_stride = iInfo.strides[xdim];
+        const int idx      = ioff + grid_x * x_stride;
+
+        bool cond[2] = {true, grid_x + 1 < x_lim};
+        int offx[2]  = {0, cond[1] ? 1 : 0};
+        WT ratio     = off_x;
+        if (method == AF_INTERP_LINEAR_COSINE) {
+            // Smooth the factional part with cosine
+            ratio = (1 - sycl::cospi(ratio)) / 2;
+        }
+
+        Ty zero = scalar<Ty>(0);
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * iInfo.strides[batch_dim];
+            VT val[2] = {
+                (clamp || cond[0]) ? in[idx_n + offx[0] * x_stride] : zero,
+                (clamp || cond[1]) ? in[idx_n + offx[1] * x_stride] : zero};
+            out[ooff + n * oInfo.strides[batch_dim]] =
+                linearInterpFunc(val, ratio);
+        }
+    }
+};
+
+template<typename Ty, typename Tp>
+struct Interp1<Ty, Tp, 3> {
+    void operator()(write_accessor<Ty> out, KParam oInfo, int ooff,
+                    read_accessor<Ty> in, KParam iInfo, int ioff, Tp x,
+                    int xdim, af::interpType method, int batch, bool clamp,
+                    int batch_dim = 1) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = sycl::floor(x);  // nearest grid
+        const WT off_x   = x - grid_x;      // fractional offset
+
+        const int x_lim    = iInfo.dims[xdim];
+        const int x_stride = iInfo.strides[xdim];
+        const int idx      = ioff + grid_x * x_stride;
+
+        bool cond[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                        grid_x + 2 < x_lim};
+        int offx[4]  = {cond[0] ? -1 : 0, 0, cond[2] ? 1 : 0,
+                       cond[3] ? 2 : (cond[2] ? 1 : 0)};
+
+        bool spline = method == AF_INTERP_CUBIC_SPLINE;
+        Ty zero     = scalar<Ty>(0);
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * iInfo.strides[batch_dim];
+            VT val[4];
+            for (int i = 0; i < 4; i++) {
+                val[i] =
+                    (clamp || cond[i]) ? in[idx_n + offx[i] * x_stride] : zero;
+            }
+            out[ooff + n * oInfo.strides[batch_dim]] =
+                cubicInterpFunc(val, off_x, spline);
+        }
+    }
+};
+
+template<typename Ty, typename Tp, int order>
+struct Interp2 {};
+
+template<typename Ty, typename Tp>
+struct Interp2<Ty, Tp, 1> {
+    void operator()(write_accessor<Ty> out, KParam oInfo, int ooff,
+                    read_accessor<Ty> in, KParam iInfo, int ioff, Tp x, Tp y,
+                    int xdim, int ydim, af::interpType method, int batch,
+                    bool clamp, int batch_dim = 2) {
+        int xid = (method == AF_INTERP_LOWER ? sycl::floor(x) : sycl::round(x));
+        int yid = (method == AF_INTERP_LOWER ? sycl::floor(y) : sycl::round(y));
+
+        const int x_lim    = iInfo.dims[xdim];
+        const int y_lim    = iInfo.dims[ydim];
+        const int x_stride = iInfo.strides[xdim];
+        const int y_stride = iInfo.strides[ydim];
+
+        if (clamp) {
+            xid = sycl::max(0, sycl::min(xid, (int)iInfo.dims[xdim]));
+            yid = sycl::max(0, sycl::min(yid, (int)iInfo.dims[ydim]));
+        }
+
+        const int idx = ioff + yid * y_stride + xid * x_stride;
+
+        bool condX = xid >= 0 && xid < x_lim;
+        bool condY = yid >= 0 && yid < y_lim;
+
+        Ty zero   = scalar<Ty>(0);
+        bool cond = condX && condY;
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * iInfo.strides[batch_dim];
+            Ty val    = (clamp || cond) ? in[idx_n] : zero;
+            out[ooff + n * oInfo.strides[batch_dim]] = val;
+        }
+    }
+};
+
+template<typename Ty, typename Tp>
+struct Interp2<Ty, Tp, 2> {
+    void operator()(write_accessor<Ty> out, KParam oInfo, int ooff,
+                    read_accessor<Ty> in, KParam iInfo, int ioff, Tp x, Tp y,
+                    int xdim, int ydim, af::interpType method, int batch,
+                    bool clamp, int batch_dim = 2) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = sycl::floor(x);
+        const WT off_x   = x - grid_x;
+
+        const int grid_y = sycl::floor(y);
+        const WT off_y   = y - grid_y;
+
+        const int x_lim    = iInfo.dims[xdim];
+        const int y_lim    = iInfo.dims[ydim];
+        const int x_stride = iInfo.strides[xdim];
+        const int y_stride = iInfo.strides[ydim];
+        const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+        bool condX[2] = {true, x + 1 < x_lim};
+        bool condY[2] = {true, y + 1 < y_lim};
+        int offx[2]   = {0, condX[1] ? 1 : 0};
+        int offy[2]   = {0, condY[1] ? 1 : 0};
+
+        WT xratio = off_x, yratio = off_y;
+        if (method == AF_INTERP_LINEAR_COSINE ||
+            method == AF_INTERP_BILINEAR_COSINE) {
+            // Smooth the factional part with cosine
+            xratio = (1 - sycl::cospi(xratio)) / 2;
+            yratio = (1 - sycl::cospi(yratio)) / 2;
+        }
+
+        Ty zero = scalar<Ty>(0);
+
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * iInfo.strides[batch_dim];
+            VT val[2][2];
+            for (int j = 0; j < 2; j++) {
+                int ioff_j = idx_n + offy[j] * y_stride;
+                for (int i = 0; i < 2; i++) {
+                    bool cond = clamp || (condX[i] && condY[j]);
+                    val[j][i] = (cond) ? in[ioff_j + offx[i] * x_stride] : zero;
+                }
+            }
+            out[ooff + n * oInfo.strides[batch_dim]] =
+                bilinearInterpFunc(val, xratio, yratio);
+        }
+    }
+};
+
+template<typename Ty, typename Tp>
+struct Interp2<Ty, Tp, 3> {
+    void operator()(write_accessor<Ty> out, KParam oInfo, int ooff,
+                    read_accessor<Ty> in, KParam iInfo, int ioff, Tp x, Tp y,
+                    int xdim, int ydim, af::interpType method, int batch,
+                    bool clamp, int batch_dim = 2) {
+        typedef typename itype_t<Tp>::wtype WT;
+        typedef typename itype_t<Ty>::vtype VT;
+
+        const int grid_x = sycl::floor(x);
+        const WT off_x   = x - grid_x;
+
+        const int grid_y = sycl::floor(y);
+        const WT off_y   = y - grid_y;
+
+        const int x_lim    = iInfo.dims[xdim];
+        const int y_lim    = iInfo.dims[ydim];
+        const int x_stride = iInfo.strides[xdim];
+        const int y_stride = iInfo.strides[ydim];
+        const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+        // used for setting values at boundaries
+        bool condX[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                         grid_x + 2 < x_lim};
+        bool condY[4] = {grid_y - 1 >= 0, true, grid_y + 1 < y_lim,
+                         grid_y + 2 < y_lim};
+        int offX[4]   = {condX[0] ? -1 : 0, 0, condX[2] ? 1 : 0,
+                       condX[3] ? 2 : (condX[2] ? 1 : 0)};
+        int offY[4]   = {condY[0] ? -1 : 0, 0, condY[2] ? 1 : 0,
+                       condY[3] ? 2 : (condY[2] ? 1 : 0)};
+
+        // for bicubic interpolation, work with 4x4 val at a time
+        Ty zero     = scalar<Ty>(0);
+        bool spline = (method == AF_INTERP_CUBIC_SPLINE ||
+                       method == AF_INTERP_BICUBIC_SPLINE);
+        for (int n = 0; n < batch; n++) {
+            int idx_n = idx + n * iInfo.strides[batch_dim];
+            VT val[4][4];
+#pragma unroll
+            for (int j = 0; j < 4; j++) {
+                int ioff_j = idx_n + offY[j] * y_stride;
+#pragma unroll
+                for (int i = 0; i < 4; i++) {
+                    bool cond = clamp || (condX[i] && condY[j]);
+                    val[j][i] = (cond) ? in[ioff_j + offX[i] * x_stride] : zero;
+                }
+            }
+
+            out[ooff + n * oInfo.strides[batch_dim]] =
+                bicubicInterpFunc(val, off_x, off_y, spline);
+        }
+    }
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/iota.hpp b/src/backend/oneapi/kernel/iota.hpp
new file mode 100644
index 0000000000..f334695ef5
--- /dev/null
+++ b/src/backend/oneapi/kernel/iota.hpp
@@ -0,0 +1,119 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+#include <af/dim4.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class iotaKernel {
+   public:
+    iotaKernel(write_accessor<T> out, KParam oinfo, const int s0, const int s1,
+               const int s2, const int s3, const int blocksPerMatX,
+               const int blocksPerMatY)
+        : out_(out)
+        , oinfo_(oinfo)
+        , s0_(s0)
+        , s1_(s1)
+        , s2_(s2)
+        , s3_(s3)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group gg = it.get_group();
+        const int oz   = gg.get_group_id(0) / blocksPerMatX_;
+        const int ow   = gg.get_group_id(1) / blocksPerMatY_;
+
+        const int blockIdx_x = gg.get_group_id(0) - oz * blocksPerMatX_;
+        const int blockIdx_y = gg.get_group_id(1) - ow * blocksPerMatY_;
+
+        const int xx = it.get_local_id(0) + blockIdx_x * gg.get_local_range(0);
+        const int yy = it.get_local_id(1) + blockIdx_y * gg.get_local_range(1);
+
+        size_t odims0 = oinfo_.dims[0];
+        size_t odims1 = oinfo_.dims[1];
+        size_t odims2 = oinfo_.dims[2];
+        size_t odims3 = oinfo_.dims[3];
+
+        if (xx < odims0 && yy < odims1 && oz < odims2 && ow < odims3) {
+            const int ozw = ow * oinfo_.strides[3] + oz * oinfo_.strides[2];
+
+            compute_t<T> val =
+                static_cast<compute_t<T>>((ow % s3_) * s2_ * s1_ * s0_);
+            val += static_cast<compute_t<T>>((oz % s2_) * s1_ * s0_);
+
+            const int incy = blocksPerMatY_ * gg.get_local_range(1);
+            const int incx = blocksPerMatX_ * gg.get_local_range(0);
+
+            for (int oy = yy; oy < odims1; oy += incy) {
+                compute_t<T> valY = val + (oy % s1_) * s0_;
+                int oyzw          = ozw + oy * oinfo_.strides[1];
+                for (int ox = xx; ox < odims0; ox += incx) {
+                    int oidx   = oyzw + ox;
+                    out_[oidx] = valY + (ox % s0_);
+                }
+            }
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    KParam oinfo_;
+    int s0_, s1_, s2_, s3_;
+    int blocksPerMatX_, blocksPerMatY_;
+};
+
+template<typename T>
+void iota(Param<T> out, const af::dim4& sdims) {
+    constexpr int IOTA_TX = 32;
+    constexpr int IOTA_TY = 8;
+    constexpr int TILEX   = 512;
+    constexpr int TILEY   = 32;
+
+    sycl::range<2> local(IOTA_TX, IOTA_TY);
+
+    int blocksPerMatX = divup(out.info.dims[0], TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], TILEY);
+    sycl::range<2> global(local[0] * blocksPerMatX * out.info.dims[2],
+                          local[1] * blocksPerMatY * out.info.dims[3]);
+    sycl::nd_range<2> ndrange(global, local);
+
+    getQueue().submit([&](sycl::handler& h) {
+        write_accessor<T> out_acc{*out.data, h};
+
+        h.parallel_for(ndrange, iotaKernel<T>(out_acc, out.info,
+                                              static_cast<int>(sdims[0]),
+                                              static_cast<int>(sdims[1]),
+                                              static_cast<int>(sdims[2]),
+                                              static_cast<int>(sdims[3]),
+                                              blocksPerMatX, blocksPerMatY));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/ireduce.hpp b/src/backend/oneapi/kernel/ireduce.hpp
new file mode 100644
index 0000000000..9ba79ed61b
--- /dev/null
+++ b/src/backend/oneapi/kernel/ireduce.hpp
@@ -0,0 +1,699 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce_config.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <minmax_op.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <climits>
+#include <complex>
+#include <iostream>
+#include <memory>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T, af_op_t op, uint dim, bool is_first, uint DIMY>
+class ireduceDimKernelSMEM {
+   public:
+    ireduceDimKernelSMEM(write_accessor<T> out, KParam oInfo,
+                         write_accessor<uint> oloc, KParam olocInfo,
+                         read_accessor<T> in, KParam iInfo,
+                         read_accessor<uint> iloc, KParam ilocInfo,
+                         uint groups_x, uint groups_y, uint groups_dim,
+                         bool rlenValid, read_accessor<uint> rlen,
+                         KParam rlenInfo,
+                         sycl::local_accessor<compute_t<T>, 1> s_val,
+                         sycl::local_accessor<uint, 1> s_idx)
+        : out_(out)
+        , oInfo_(oInfo)
+        , oloc_(oloc)
+        , olocInfo_(olocInfo)
+        , in_(in)
+        , iInfo_(iInfo)
+        , iloc_(iloc)
+        , ilocInfo_(ilocInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , groups_dim_(groups_dim)
+        , rlenValid_(rlenValid)
+        , rlen_(rlen)
+        , rlenInfo_(rlenInfo)
+        , s_val_(s_val)
+        , s_idx_(s_idx) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * g.get_local_range(0) + lidx;
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) + lidx;
+        const uint yid       = groupId_y;
+
+        uint ids[4] = {xid, yid, zid, wid};
+        T *optr     = out_.get_pointer() + ids[3] * oInfo_.strides[3] +
+                  ids[2] * oInfo_.strides[2] + ids[1] * oInfo_.strides[1] +
+                  ids[0] + oInfo_.offset;
+
+        uint *olptr = oloc_.get_pointer() + ids[3] * oInfo_.strides[3] +
+                      ids[2] * oInfo_.strides[2] + ids[1] * oInfo_.strides[1] +
+                      ids[0] + oInfo_.offset;
+
+        // There is only one element per block for out
+        // There are blockDim.y elements per block for in
+        // Hence increment ids[dim] just after offseting out and before
+        // offsetting in
+        const bool rlen_valid =
+            (ids[0] < rlenInfo_.dims[0]) && (ids[1] < rlenInfo_.dims[1]) &&
+            (ids[2] < rlenInfo_.dims[2]) && (ids[3] < rlenInfo_.dims[3]);
+        const bool rlen_nonnull = rlenValid_;
+        const uint *rlenptr =
+            (rlen_nonnull && rlen_valid)
+                ? rlen_.get_pointer() + ids[3] * rlenInfo_.strides[3] +
+                      ids[2] * rlenInfo_.strides[2] +
+                      ids[1] * rlenInfo_.strides[1] + ids[0] + rlenInfo_.offset
+                : nullptr;
+
+        const uint groupIdx_dim = ids[dim];
+
+        // add thread offset for reduced dim for inputs
+        ids[dim] = ids[dim] * g.get_local_range(1) + lidy;
+
+        const T *iptr = in_.get_pointer() + ids[3] * iInfo_.strides[3] +
+                        ids[2] * iInfo_.strides[2] +
+                        ids[1] * iInfo_.strides[1] + ids[0] + iInfo_.offset;
+        const uint *ilptr;
+        if (!is_first) {
+            ilptr = iloc_.get_pointer() + ids[3] * iInfo_.strides[3] +
+                    ids[2] * iInfo_.strides[2] + ids[1] * iInfo_.strides[1] +
+                    ids[0] + iInfo_.offset;
+        }
+
+        const uint id_dim_in   = ids[dim];
+        const uint istride_dim = iInfo_.strides[dim];
+
+        size_t xlim   = iInfo_.dims[0];
+        size_t ylim   = iInfo_.dims[1];
+        size_t zlim   = iInfo_.dims[2];
+        size_t wlim   = iInfo_.dims[3];
+        bool is_valid = (ids[0] < xlim) && (ids[1] < ylim) && (ids[2] < zlim) &&
+                        (ids[3] < wlim);
+
+        compute_t<T> out_val = common::Binary<compute_t<T>, op>::init();
+        uint out_idx         = id_dim_in;
+
+        uint lim = rlenptr ? *rlenptr : iInfo_.dims[0];
+        lim      = is_first ? sycl::min((uint)iInfo_.dims[dim], lim) : lim;
+
+        bool within_ragged_bounds =
+            (is_first) ? (out_idx < lim)
+                       : ((rlenptr) ? ((is_valid) && (*ilptr < lim)) : true);
+        if (is_valid && id_dim_in < iInfo_.dims[dim] && within_ragged_bounds) {
+            out_val = *iptr;
+            if (!is_first) out_idx = *ilptr;
+        }
+
+        MinMaxOp<op, compute_t<T>> Op(out_val, out_idx);
+
+        const uint id_dim_in_start =
+            id_dim_in + groups_dim_ * g.get_local_range(1);
+        for (int id = id_dim_in_start; is_valid && (id < lim);
+             id += groups_dim_ * g.get_local_range(1)) {
+            iptr = iptr + groups_dim_ * g.get_local_range(1) * istride_dim;
+            if (!is_first) {
+                ilptr =
+                    ilptr + groups_dim_ * g.get_local_range(1) * istride_dim;
+                Op(*iptr, *ilptr);
+            } else {
+                Op(*iptr, id);
+            }
+        }
+
+        s_val_[lid] = Op.m_val;
+        s_idx_[lid] = Op.m_idx;
+        it.barrier();
+
+        compute_t<T> *s_vptr = s_val_.get_pointer() + lid;
+        uint *s_iptr         = s_idx_.get_pointer() + lid;
+
+        if (DIMY == 8) {
+            if (lidy < 4) {
+                Op(s_vptr[g.get_local_range(0) * 4],
+                   s_iptr[g.get_local_range(0) * 4]);
+                *s_vptr = Op.m_val;
+                *s_iptr = Op.m_idx;
+            }
+            it.barrier();
+        }
+        if (DIMY >= 4) {
+            if (lidy < 2) {
+                Op(s_vptr[g.get_local_range(0) * 2],
+                   s_iptr[g.get_local_range(0) * 2]);
+                *s_vptr = Op.m_val;
+                *s_iptr = Op.m_idx;
+            }
+            it.barrier();
+        }
+        if (DIMY >= 2) {
+            if (lidy < 1) {
+                Op(s_vptr[g.get_local_range(0) * 1],
+                   s_iptr[g.get_local_range(0) * 1]);
+                *s_vptr = Op.m_val;
+                *s_iptr = Op.m_idx;
+            }
+            it.barrier();
+        }
+        if (is_valid && lidy == 0 && (groupIdx_dim < oInfo_.dims[dim])) {
+            *optr  = data_t<T>(s_vptr[0]);
+            *olptr = s_iptr[0];
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    write_accessor<uint> oloc_;
+    KParam olocInfo_;
+    read_accessor<T> in_;
+    KParam iInfo_;
+    read_accessor<uint> iloc_;
+    KParam ilocInfo_;
+    uint groups_x_, groups_y_, groups_dim_;
+    bool rlenValid_;
+    read_accessor<uint> rlen_;
+    KParam rlenInfo_;
+    sycl::local_accessor<compute_t<T>, 1> s_val_;
+    sycl::local_accessor<uint, 1> s_idx_;
+};
+
+template<typename T, af_op_t op, int dim, bool is_first>
+void ireduce_dim_launcher(Param<T> out, Param<uint> oloc, Param<T> in,
+                          Param<uint> iloc, const uint threads_y,
+                          const dim_t groups_dim[4], Param<uint> rlen) {
+    sycl::range<2> local(creduce::THREADS_X, threads_y);
+    sycl::range<2> global(groups_dim[0] * groups_dim[2] * local[0],
+                          groups_dim[1] * groups_dim[3] * local[1]);
+
+    auto iempty = memAlloc<uint>(1);
+    auto rempty = memAlloc<uint>(1);
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> out_acc{*out.data, h};
+        write_accessor<uint> oloc_acc{*oloc.data, h};
+        read_accessor<T> in_acc{*in.data, h};
+
+        read_accessor<uint> iloc_acc{*iempty, h};
+        if (iloc.info.dims[0] * iloc.info.dims[1] * iloc.info.dims[2] *
+                iloc.info.dims[3] >
+            0) {
+            iloc_acc = read_accessor<uint>{*iloc.data, h};
+        }
+
+        read_accessor<uint> rlen_acc{*rempty, h};
+        bool rlenValid = (rlen.info.dims[0] * rlen.info.dims[1] *
+                              rlen.info.dims[2] * rlen.info.dims[3] >
+                          0);
+        if (rlenValid) { rlen_acc = read_accessor<uint>{*rlen.data, h}; }
+
+        auto shrdVal = sycl::local_accessor<compute_t<T>, 1>(
+            creduce::THREADS_PER_BLOCK, h);
+        auto shrdLoc =
+            sycl::local_accessor<uint, 1>(creduce::THREADS_PER_BLOCK, h);
+
+        switch (threads_y) {
+            case 8:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceDimKernelSMEM<T, op, dim, is_first, 8>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_dim[0], groups_dim[1],
+                        groups_dim[dim], rlenValid, rlen_acc, rlen.info,
+                        shrdVal, shrdLoc));
+                break;
+            case 4:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceDimKernelSMEM<T, op, dim, is_first, 4>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_dim[0], groups_dim[1],
+                        groups_dim[dim], rlenValid, rlen_acc, rlen.info,
+                        shrdVal, shrdLoc));
+                break;
+            case 2:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceDimKernelSMEM<T, op, dim, is_first, 2>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_dim[0], groups_dim[1],
+                        groups_dim[dim], rlenValid, rlen_acc, rlen.info,
+                        shrdVal, shrdLoc));
+                break;
+            case 1:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceDimKernelSMEM<T, op, dim, is_first, 1>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_dim[0], groups_dim[1],
+                        groups_dim[dim], rlenValid, rlen_acc, rlen.info,
+                        shrdVal, shrdLoc));
+                break;
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op, int dim>
+void ireduce_dim(Param<T> out, Param<uint> oloc, Param<T> in,
+                 Param<uint> rlen) {
+    uint threads_y = std::min(creduce::THREADS_Y, nextpow2(in.info.dims[dim]));
+    uint threads_x = creduce::THREADS_X;
+
+    dim_t blocks_dim[] = {divup(in.info.dims[0], threads_x), in.info.dims[1],
+                          in.info.dims[2], in.info.dims[3]};
+
+    blocks_dim[dim] = divup(in.info.dims[dim], threads_y * creduce::REPEAT);
+
+    Param<T> tmp      = out;
+    Param<uint> tlptr = oloc;
+    bufptr<T> tmp_alloc;
+    bufptr<uint> tlptr_alloc;
+
+    if (blocks_dim[dim] > 1) {
+        int tmp_elements   = 1;
+        tmp.info.dims[dim] = blocks_dim[dim];
+
+        for (int k = 0; k < 4; k++) tmp_elements *= tmp.info.dims[k];
+        tmp_alloc   = memAlloc<T>(tmp_elements);
+        tlptr_alloc = memAlloc<uint>(tmp_elements);
+        tmp.data    = tmp_alloc.get();
+        tlptr.data  = tlptr_alloc.get();
+
+        for (int k = dim + 1; k < 4; k++)
+            tmp.info.strides[k] *= blocks_dim[dim];
+    }
+
+    Param<uint> nullparam;
+    ireduce_dim_launcher<T, op, dim, true>(tmp, tlptr, in, nullparam, threads_y,
+                                           blocks_dim, rlen);
+
+    if (blocks_dim[dim] > 1) {
+        blocks_dim[dim] = 1;
+
+        ireduce_dim_launcher<T, op, dim, false>(out, oloc, tmp, tlptr,
+                                                threads_y, blocks_dim, rlen);
+    }
+}
+
+template<typename T, af_op_t op, bool is_first, uint DIMX>
+class ireduceFirstKernelSMEM {
+   public:
+    ireduceFirstKernelSMEM(write_accessor<T> out, KParam oInfo,
+                           write_accessor<uint> oloc, KParam olocInfo,
+                           read_accessor<T> in, KParam iInfo,
+                           read_accessor<uint> iloc, KParam ilocInfo,
+                           uint groups_x, uint groups_y, uint repeat,
+                           bool rlenValid, read_accessor<uint> rlen,
+                           KParam rlenInfo,
+                           sycl::local_accessor<compute_t<T>, 1> s_val,
+                           sycl::local_accessor<uint, 1> s_idx)
+        : out_(out)
+        , oInfo_(oInfo)
+        , oloc_(oloc)
+        , olocInfo_(olocInfo)
+        , in_(in)
+        , iInfo_(iInfo)
+        , iloc_(iloc)
+        , ilocInfo_(ilocInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , repeat_(repeat)
+        , rlenValid_(rlenValid)
+        , rlen_(rlen)
+        , rlenInfo_(rlenInfo)
+        , s_val_(s_val)
+        , s_idx_(s_idx) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * g.get_local_range(0) + lidx;
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid = groupId_x * g.get_local_range(0) * repeat_ + lidx;
+        const uint yid = groupId_y * g.get_local_range(1) + lidy;
+
+        const T *iptr = in_.get_pointer() + wid * iInfo_.strides[3] +
+                        zid * iInfo_.strides[2] + yid * iInfo_.strides[1] +
+                        iInfo_.offset;
+
+        T *optr = out_.get_pointer() + wid * oInfo_.strides[3] +
+                  zid * oInfo_.strides[2] + yid * oInfo_.strides[1] +
+                  oInfo_.offset;
+
+        const uint *rlenptr =
+            (rlenValid_) ? rlen_.get_pointer() + wid * rlenInfo_.strides[3] +
+                               zid * rlenInfo_.strides[2] +
+                               yid * rlenInfo_.strides[1] + rlenInfo_.offset
+                         : nullptr;
+
+        const uint *ilptr;
+        if (!is_first) {
+            ilptr = iloc_.get_pointer() + wid * iInfo_.strides[3] +
+                    zid * iInfo_.strides[2] + yid * iInfo_.strides[1] +
+                    iInfo_.offset;
+        }
+        uint *olptr = oloc_.get_pointer() + wid * oInfo_.strides[3] +
+                      zid * oInfo_.strides[2] + yid * oInfo_.strides[1] +
+                      oInfo_.offset;
+
+        size_t ylim   = iInfo_.dims[1];
+        size_t zlim   = iInfo_.dims[2];
+        size_t wlim   = iInfo_.dims[3];
+        bool is_valid = (yid < ylim) && (zid < zlim) && (wid < wlim);
+        // bool is_valid = (yid < iInfo_.dims[1]) && (zid < iInfo_.dims[2]) &&
+        //(wid < iInfo_.dims[3]);
+
+        int minlen = rlenptr ? sycl::min(*rlenptr, (uint)iInfo_.dims[0])
+                             : iInfo_.dims[0];
+        int lim    = sycl::min((int)(xid + repeat_ * DIMX), minlen);
+
+        compute_t<T> out_val = common::Binary<compute_t<T>, op>::init();
+        uint idx             = xid;
+
+        if (xid < lim && is_valid) {
+            out_val = static_cast<compute_t<T>>(iptr[xid]);
+            if (!is_first) idx = ilptr[xid];
+        }
+
+        MinMaxOp<op, compute_t<T>> Op(out_val, idx);
+        for (int id = xid; is_valid && id < lim; id += DIMX) {
+            Op(static_cast<compute_t<T>>(iptr[id]),
+               (!is_first) ? ilptr[id] : id);
+        }
+
+        s_val_[lid] = Op.m_val;
+        s_idx_[lid] = Op.m_idx;
+        it.barrier();
+
+        compute_t<T> *s_vptr = s_val_.get_pointer() + lidy * DIMX;
+        uint *s_iptr         = s_idx_.get_pointer() + lidy * DIMX;
+
+        if (DIMX == 256) {
+            if (lidx < 128) {
+                Op(s_vptr[lidx + 128], s_iptr[lidx + 128]);
+                s_vptr[lidx] = Op.m_val;
+                s_iptr[lidx] = Op.m_idx;
+            }
+            it.barrier();
+        }
+
+        if (DIMX >= 128) {
+            if (lidx < 64) {
+                Op(s_vptr[lidx + 64], s_iptr[lidx + 64]);
+                s_vptr[lidx] = Op.m_val;
+                s_iptr[lidx] = Op.m_idx;
+            }
+            it.barrier();
+        }
+
+        if (DIMX >= 64) {
+            if (lidx < 32) {
+                Op(s_vptr[lidx + 32], s_iptr[lidx + 32]);
+                s_vptr[lidx] = Op.m_val;
+                s_iptr[lidx] = Op.m_idx;
+            }
+            it.barrier();
+        }
+
+        // TODO: replace with subgroup operations in optimized kernels
+        if (lidx < 16) {
+            Op(s_vptr[lidx + 16], s_iptr[lidx + 16]);
+            s_vptr[lidx] = Op.m_val;
+            s_iptr[lidx] = Op.m_idx;
+        }
+        it.barrier();
+
+        if (lidx < 8) {
+            Op(s_vptr[lidx + 8], s_iptr[lidx + 8]);
+            s_vptr[lidx] = Op.m_val;
+            s_iptr[lidx] = Op.m_idx;
+        }
+        it.barrier();
+
+        if (lidx < 4) {
+            Op(s_vptr[lidx + 4], s_iptr[lidx + 4]);
+            s_vptr[lidx] = Op.m_val;
+            s_iptr[lidx] = Op.m_idx;
+        }
+        it.barrier();
+
+        if (lidx < 2) {
+            Op(s_vptr[lidx + 2], s_iptr[lidx + 2]);
+            s_vptr[lidx] = Op.m_val;
+            s_iptr[lidx] = Op.m_idx;
+        }
+        it.barrier();
+
+        if (lidx < 1) {
+            Op(s_vptr[lidx + 1], s_iptr[lidx + 1]);
+            s_vptr[lidx] = Op.m_val;
+            s_iptr[lidx] = Op.m_idx;
+        }
+        it.barrier();
+
+        if (is_valid && lidx == 0) {
+            optr[groupId_x]  = data_t<T>(s_vptr[0]);
+            olptr[groupId_x] = s_iptr[0];
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    write_accessor<uint> oloc_;
+    KParam olocInfo_;
+    read_accessor<T> in_;
+    KParam iInfo_;
+    read_accessor<uint> iloc_;
+    KParam ilocInfo_;
+    uint groups_x_, groups_y_, repeat_;
+    bool rlenValid_;
+    read_accessor<uint> rlen_;
+    KParam rlenInfo_;
+    sycl::local_accessor<compute_t<T>, 1> s_val_;
+    sycl::local_accessor<uint, 1> s_idx_;
+};
+
+template<typename T, af_op_t op, bool is_first>
+void ireduce_first_launcher(Param<T> out, Param<uint> oloc, Param<T> in,
+                            Param<uint> iloc, const uint groups_x,
+                            const uint groups_y, const uint threads_x,
+                            Param<uint> rlen) {
+    sycl::range<2> local(threads_x, creduce::THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * in.info.dims[2] * local[0],
+                          groups_y * in.info.dims[3] * local[1]);
+
+    uint repeat = divup(in.info.dims[0], (groups_x * threads_x));
+
+    auto iempty = memAlloc<uint>(1);
+    auto rempty = memAlloc<uint>(1);
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> out_acc{*out.data, h};
+        write_accessor<uint> oloc_acc{*oloc.data, h};
+        read_accessor<T> in_acc{*in.data, h};
+
+        read_accessor<uint> iloc_acc{*iempty, h};
+        if (iloc.info.dims[0] * iloc.info.dims[1] * iloc.info.dims[2] *
+                iloc.info.dims[3] >
+            0) {
+            iloc_acc = read_accessor<uint>{*iloc.data, h};
+        }
+
+        read_accessor<uint> rlen_acc{*rempty, h};
+        bool rlenValid = (rlen.info.dims[0] * rlen.info.dims[1] *
+                              rlen.info.dims[2] * rlen.info.dims[3] >
+                          0);
+        if (rlenValid) { rlen_acc = read_accessor<uint>{*rlen.data, h}; }
+
+        auto shrdVal = sycl::local_accessor<compute_t<T>, 1>(
+            creduce::THREADS_PER_BLOCK, h);
+        auto shrdLoc =
+            sycl::local_accessor<uint, 1>(creduce::THREADS_PER_BLOCK, h);
+
+        switch (threads_x) {
+            case 32:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceFirstKernelSMEM<T, op, is_first, 32>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_x, groups_y, repeat,
+                        rlenValid, rlen_acc, rlen.info, shrdVal, shrdLoc));
+                break;
+            case 64:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceFirstKernelSMEM<T, op, is_first, 64>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_x, groups_y, repeat,
+                        rlenValid, rlen_acc, rlen.info, shrdVal, shrdLoc));
+                break;
+            case 128:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceFirstKernelSMEM<T, op, is_first, 128>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_x, groups_y, repeat,
+                        rlenValid, rlen_acc, rlen.info, shrdVal, shrdLoc));
+                break;
+            case 256:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    ireduceFirstKernelSMEM<T, op, is_first, 256>(
+                        out_acc, out.info, oloc_acc, oloc.info, in_acc, in.info,
+                        iloc_acc, iloc.info, groups_x, groups_y, repeat,
+                        rlenValid, rlen_acc, rlen.info, shrdVal, shrdLoc));
+                break;
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+void ireduce_first(Param<T> out, Param<uint> oloc, Param<T> in,
+                   Param<uint> rlen) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, creduce::THREADS_PER_BLOCK);
+    uint threads_y = creduce::THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = divup(in.info.dims[0], threads_x * creduce::REPEAT);
+    uint blocks_y = divup(in.info.dims[1], threads_y);
+
+    Param<T> tmp      = out;
+    Param<uint> tlptr = oloc;
+    bufptr<T> tmp_alloc;
+    bufptr<uint> tlptr_alloc;
+    if (blocks_x > 1) {
+        auto elements =
+            blocks_x * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+        tmp_alloc   = memAlloc<T>(elements);
+        tlptr_alloc = memAlloc<uint>(elements);
+        tmp.data    = tmp_alloc.get();
+        tlptr.data  = tlptr_alloc.get();
+
+        tmp.info.dims[0] = blocks_x;
+        for (int k = 1; k < 4; k++) tmp.info.strides[k] *= blocks_x;
+    }
+
+    Param<uint> nullparam;
+    ireduce_first_launcher<T, op, true>(tmp, tlptr, in, nullparam, blocks_x,
+                                        blocks_y, threads_x, rlen);
+
+    if (blocks_x > 1) {
+        ireduce_first_launcher<T, op, false>(out, oloc, tmp, tlptr, 1, blocks_y,
+                                             threads_x, rlen);
+    }
+}
+
+template<typename T, af_op_t op>
+void ireduce(Param<T> out, Param<uint> oloc, Param<T> in, int dim,
+             Param<uint> rlen) {
+    switch (dim) {
+        case 0: return ireduce_first<T, op>(out, oloc, in, rlen);
+        case 1: return ireduce_dim<T, op, 1>(out, oloc, in, rlen);
+        case 2: return ireduce_dim<T, op, 2>(out, oloc, in, rlen);
+        case 3: return ireduce_dim<T, op, 3>(out, oloc, in, rlen);
+    }
+}
+
+template<typename T, af_op_t op>
+T ireduce_all(uint *idx, Param<T> in) {
+    int in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+
+    bool is_linear = (in.info.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.info.strides[k] ==
+                      (in.info.strides[k - 1] * in.info.dims[k - 1]));
+    }
+
+    if (is_linear) {
+        in.info.dims[0] = in_elements;
+        for (int k = 1; k < 4; k++) {
+            in.info.dims[k]    = 1;
+            in.info.strides[k] = in_elements;
+        }
+    }
+
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, creduce::THREADS_PER_BLOCK);
+    uint threads_y = creduce::THREADS_PER_BLOCK / threads_x;
+
+    // TODO: perf REPEAT, consider removing or runtime eval
+    // max problem size < SM resident threads, don't use REPEAT
+    uint groups_x = divup(in.info.dims[0], threads_x * creduce::REPEAT);
+    uint groups_y = divup(in.info.dims[1], threads_y);
+
+    Array<T> tmp = createEmptyArray<T>(
+        {groups_x, in.info.dims[1], in.info.dims[2], in.info.dims[3]});
+
+    int tmp_elements  = tmp.elements();
+    Array<uint> tlptr = createEmptyArray<uint>({tmp_elements, 1, 1, 1});
+
+    Param<uint> nullparam;
+    Array<uint> rlen = createEmptyArray<uint>(af::dim4(0));
+    ireduce_first_launcher<T, op, true>(tmp, tlptr, in, nullparam, groups_x,
+                                        groups_y, threads_x, rlen);
+
+    sycl::host_accessor h_ptr_raw{*tmp.get()};
+    sycl::host_accessor h_lptr_raw{*tlptr.get()};
+    if (!is_linear) {
+        // Converting n-d index into a linear index
+        // in is of size   [   dims0, dims1, dims2, dims3]
+        // tidx is of size [blocks_x, dims1, dims2, dims3]
+        // i / blocks_x gives you the batch number "N"
+        // "N * dims0 + i" gives the linear index
+        for (int i = 0; i < tmp_elements; i++) {
+            h_lptr_raw[i] += (i / groups_x) * in.info.dims[0];
+        }
+    }
+
+    MinMaxOp<op, T> Op(h_ptr_raw[0], h_lptr_raw[0]);
+
+    for (int i = 1; i < tmp_elements; i++) { Op(h_ptr_raw[i], h_lptr_raw[i]); }
+
+    *idx = Op.m_idx;
+    return Op.m_val;
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/lookup.hpp b/src/backend/oneapi/kernel/lookup.hpp
new file mode 100644
index 0000000000..6bceca3e97
--- /dev/null
+++ b/src/backend/oneapi/kernel/lookup.hpp
@@ -0,0 +1,131 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+int trimIndex(int idx, const int len) {
+    int ret_val = idx;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
+    }
+    return ret_val;
+}
+
+template<typename in_t, typename idx_t>
+class lookupNDCreateKernel {
+   public:
+    lookupNDCreateKernel(write_accessor<in_t> out, KParam oInfo,
+                         read_accessor<in_t> in, KParam iInfo,
+                         read_accessor<idx_t> indices, KParam idxInfo,
+                         int nBBS0, int nBBS1, const int DIM)
+        : out_(out)
+        , oInfo_(oInfo)
+        , in_(in)
+        , iInfo_(iInfo)
+        , indices_(indices)
+        , idxInfo_(idxInfo)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1)
+        , DIM_(DIM) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        int lx = it.get_local_id(0);
+        int ly = it.get_local_id(1);
+
+        int gz = g.get_group_id(0) / nBBS0_;
+        int gw = g.get_group_id(1) / nBBS1_;
+
+        int gx = g.get_local_range(0) * (g.get_group_id(0) - gz * nBBS0_) + lx;
+        int gy = g.get_local_range(1) * (g.get_group_id(1) - gw * nBBS1_) + ly;
+
+        const idx_t *idxPtr = indices_.get_pointer() + idxInfo_.offset;
+
+        int i = iInfo_.strides[0] *
+                (DIM_ == 0 ? trimIndex((int)idxPtr[gx], iInfo_.dims[0]) : gx);
+        int j = iInfo_.strides[1] *
+                (DIM_ == 1 ? trimIndex((int)idxPtr[gy], iInfo_.dims[1]) : gy);
+        int k = iInfo_.strides[2] *
+                (DIM_ == 2 ? trimIndex((int)idxPtr[gz], iInfo_.dims[2]) : gz);
+        int l = iInfo_.strides[3] *
+                (DIM_ == 3 ? trimIndex((int)idxPtr[gw], iInfo_.dims[3]) : gw);
+
+        const in_t *inPtr = in_.get_pointer() + (i + j + k + l) + iInfo_.offset;
+        in_t *outPtr =
+            out_.get_pointer() +
+            (gx * oInfo_.strides[0] + gy * oInfo_.strides[1] +
+             gz * oInfo_.strides[2] + gw * oInfo_.strides[3] + oInfo_.offset);
+
+        if (gx < oInfo_.dims[0] && gy < oInfo_.dims[1] && gz < oInfo_.dims[2] &&
+            gw < oInfo_.dims[3]) {
+            outPtr[0] = inPtr[0];
+        }
+    }
+
+   private:
+    write_accessor<in_t> out_;
+    KParam oInfo_;
+    read_accessor<in_t> in_;
+    KParam iInfo_;
+    read_accessor<idx_t> indices_;
+    KParam idxInfo_;
+    int nBBS0_;
+    int nBBS1_;
+    const int DIM_;
+};
+
+template<typename in_t, typename idx_t>
+void lookup(Param<in_t> out, const Param<in_t> in, const Param<idx_t> indices,
+            const unsigned dim) {
+    constexpr int THREADS_X = 32;
+    constexpr int THREADS_Y = 8;
+
+    auto local = sycl::range(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(out.info.dims[0], THREADS_X);
+    int blk_y = divup(out.info.dims[1], THREADS_Y);
+
+    auto global = sycl::range(blk_x * out.info.dims[2] * THREADS_X,
+                              blk_y * out.info.dims[3] * THREADS_Y);
+
+    getQueue().submit([&](auto &h) {
+        write_accessor<in_t> d_out{*out.data, h};
+        read_accessor<in_t> d_in{*in.data, h};
+        read_accessor<idx_t> d_indices{*indices.data, h};
+        h.parallel_for(sycl::nd_range{global, local},
+                       lookupNDCreateKernel<in_t, idx_t>(
+                           d_out, out.info, d_in, in.info, d_indices,
+                           indices.info, blk_x, blk_y, dim));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/lu_split.hpp b/src/backend/oneapi/kernel/lu_split.hpp
new file mode 100644
index 0000000000..6d52fb3835
--- /dev/null
+++ b/src/backend/oneapi/kernel/lu_split.hpp
@@ -0,0 +1,139 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+
+#include <math.hpp>
+#include <array>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T, bool same_dims>
+class luSplitKernel {
+   public:
+    luSplitKernel(write_accessor<T> lower, KParam lInfo,
+                  write_accessor<T> upper, KParam uInfo, read_accessor<T> in,
+                  KParam iInfo, const int groupsPerMatX,
+                  const int groupsPerMatY)
+        : lower_(lower)
+        , lInfo_(lInfo)
+        , upper_(upper)
+        , uInfo_(uInfo)
+        , in_(in)
+        , iInfo_(iInfo)
+        , groupsPerMatX_(groupsPerMatX)
+        , groupsPerMatY_(groupsPerMatY) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        const int oz  = g.get_group_id(0) / groupsPerMatX_;
+        const int ow  = g.get_group_id(1) / groupsPerMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - oz * groupsPerMatX_;
+        const int blockIdx_y = g.get_group_id(1) - ow * groupsPerMatY_;
+
+        const int xx = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+        const int yy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+        const int incy = groupsPerMatY_ * g.get_local_range(1);
+        const int incx = groupsPerMatX_ * g.get_local_range(0);
+
+        T *d_l       = lower_.get_pointer();
+        T *d_u       = upper_.get_pointer();
+        const T *d_i = in_.get_pointer();
+
+        if (oz < iInfo_.dims[2] && ow < iInfo_.dims[3]) {
+            d_i = d_i + oz * iInfo_.strides[2] + ow * iInfo_.strides[3];
+            d_l = d_l + oz * lInfo_.strides[2] + ow * lInfo_.strides[3];
+            d_u = d_u + oz * uInfo_.strides[2] + ow * uInfo_.strides[3];
+
+            for (int oy = yy; oy < iInfo_.dims[1]; oy += incy) {
+                const T *Yd_i = d_i + oy * iInfo_.strides[1];
+                T *Yd_l       = d_l + oy * lInfo_.strides[1];
+                T *Yd_u       = d_u + oy * uInfo_.strides[1];
+                for (int ox = xx; ox < iInfo_.dims[0]; ox += incx) {
+                    if (ox > oy) {
+                        if (same_dims || oy < lInfo_.dims[1])
+                            Yd_l[ox] = Yd_i[ox];
+                        if (!same_dims || ox < uInfo_.dims[0])
+                            Yd_u[ox] = scalar<T>(0);
+                    } else if (oy > ox) {
+                        if (same_dims || oy < lInfo_.dims[1])
+                            Yd_l[ox] = scalar<T>(0);
+                        if (!same_dims || ox < uInfo_.dims[0])
+                            Yd_u[ox] = Yd_i[ox];
+                    } else if (ox == oy) {
+                        if (same_dims || oy < lInfo_.dims[1])
+                            Yd_l[ox] = scalar<T>(1.0);
+                        if (!same_dims || ox < uInfo_.dims[0])
+                            Yd_u[ox] = Yd_i[ox];
+                    }
+                }
+            }
+        }
+    }
+
+   protected:
+    write_accessor<T> lower_;
+    KParam lInfo_;
+    write_accessor<T> upper_;
+    KParam uInfo_;
+    read_accessor<T> in_;
+    KParam iInfo_;
+    int groupsPerMatX_;
+    int groupsPerMatY_;
+};
+
+template<typename T>
+void lu_split(Param<T> lower, Param<T> upper, Param<T> in) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 128;
+    constexpr unsigned TILEY = 32;
+
+    const bool sameDims = lower.info.dims[0] == in.info.dims[0] &&
+                          lower.info.dims[1] == in.info.dims[1];
+
+    sycl::range<2> local(TX, TY);
+
+    int groupsPerMatX = divup(in.info.dims[0], TILEX);
+    int groupsPerMatY = divup(in.info.dims[1], TILEY);
+    sycl::range<2> global(groupsPerMatX * in.info.dims[2] * local[0],
+                          groupsPerMatY * in.info.dims[3] * local[1]);
+
+    getQueue().submit([&](sycl::handler &h) {
+        read_accessor<T> iData{*in.data, h};
+        write_accessor<T> lData{*lower.data, h};
+        write_accessor<T> uData{*upper.data, h};
+
+        if (sameDims) {
+            h.parallel_for(sycl::nd_range{global, local},
+                           luSplitKernel<T, true>(
+                               lData, lower.info, uData, upper.info, iData,
+                               in.info, groupsPerMatX, groupsPerMatY));
+        } else {
+            h.parallel_for(sycl::nd_range{global, local},
+                           luSplitKernel<T, false>(
+                               lData, lower.info, uData, upper.info, iData,
+                               in.info, groupsPerMatX, groupsPerMatY));
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/mean.hpp b/src/backend/oneapi/kernel/mean.hpp
new file mode 100644
index 0000000000..4c8533b1ec
--- /dev/null
+++ b/src/backend/oneapi/kernel/mean.hpp
@@ -0,0 +1,743 @@
+
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+#include <kernel/reduce_config.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <memory>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+
+namespace kernel {
+
+template<typename To, typename Tw>
+void stable_mean(To *lhs, Tw *l_wt, To rhs, Tw r_wt) {
+    if (((*l_wt) != (Tw)0) || (r_wt != (Tw)0)) {
+        Tw l_scale = (*l_wt);
+        (*l_wt) += r_wt;
+        l_scale = l_scale / (*l_wt);
+
+        Tw r_scale = r_wt / (*l_wt);
+        (*lhs)     = (l_scale * *lhs) + (r_scale * rhs);
+    }
+}
+
+template<typename Ti, typename Tw, typename To, uint dim, uint DIMY>
+class meanDimKernelSMEM {
+   public:
+    meanDimKernelSMEM(write_accessor<To> out, KParam oInfo,
+                      write_accessor<Tw> owt, KParam owInfo,
+                      read_accessor<Ti> in, KParam iInfo, read_accessor<Tw> iwt,
+                      KParam iwInfo, uint groups_x, uint groups_y,
+                      uint offset_dim,
+                      sycl::local_accessor<compute_t<To>, 1> s_val,
+                      sycl::local_accessor<compute_t<Tw>, 1> s_idx,
+                      bool input_weight, bool output_weight)
+        : out_(out)
+        , owt_(owt)
+        , in_(in)
+        , iwt_(iwt)
+        , oInfo_(oInfo)
+        , owInfo_(owInfo)
+        , iInfo_(iInfo)
+        , iwInfo_(iwInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , offset_dim_(offset_dim)
+        , s_val_(s_val)
+        , s_idx_(s_idx)
+        , input_weight_(input_weight)
+        , output_weight_(output_weight) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * g.get_local_range(0) + lidx;
+
+        const uint zid        = g.get_group_id(0) / groups_x_;
+        const uint wid        = g.get_group_id(1) / groups_y_;
+        const uint groupIdx_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupIdx_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid        = groupIdx_x * g.get_local_range(0) + lidx;
+        const uint yid =
+            groupIdx_y;  // yid  of output. updated for input later.
+
+        uint ids[4] = {xid, yid, zid, wid};
+
+        const Ti *iptr = in_.get_pointer();
+        To *optr       = out_.get_pointer();
+
+        uint ooffset = ids[3] * oInfo_.strides[3] + ids[2] * oInfo_.strides[2] +
+                       ids[1] * oInfo_.strides[1] + ids[0] + oInfo_.offset;
+        // There is only one element per block for out
+        // There are blockDim.y elements per block for in
+        // Hence increment ids[dim] just after offseting out and before
+        // offsetting in
+        optr += ooffset;
+
+        const uint blockIdx_dim = ids[dim];
+        ids[dim]                = ids[dim] * g.get_local_range(1) + lidy;
+
+        uint ioffset = ids[3] * iInfo_.strides[3] + ids[2] * iInfo_.strides[2] +
+                       ids[1] * iInfo_.strides[1] + ids[0] + iInfo_.offset;
+        iptr += ioffset;
+
+        const Tw *iwptr = nullptr;
+        Tw *owptr       = nullptr;
+
+        if (output_weight_) owptr = owt_.get_pointer() + ooffset;
+        if (input_weight_) iwptr = iwt_.get_pointer() + ioffset;
+
+        const uint id_dim_in   = ids[dim];
+        const uint istride_dim = iInfo_.strides[dim];
+
+        bool is_valid = (ids[0] < iInfo_.dims[0]) &&
+                        (ids[1] < iInfo_.dims[1]) &&
+                        (ids[2] < iInfo_.dims[2]) && (ids[3] < iInfo_.dims[3]);
+
+        common::Transform<Ti, compute_t<To>, af_add_t> transform;
+
+        compute_t<To> val    = common::Binary<compute_t<To>, af_add_t>::init();
+        compute_t<Tw> weight = common::Binary<compute_t<Tw>, af_add_t>::init();
+
+        if (is_valid && id_dim_in < iInfo_.dims[dim]) {
+            val = transform(*iptr);
+            if (iwptr) {
+                weight = *iwptr;
+            } else {
+                weight = (Tw)1;
+            }
+        }
+
+        const uint id_dim_in_start =
+            id_dim_in + offset_dim_ * g.get_local_range(1);
+
+        for (int id = id_dim_in_start; is_valid && (id < iInfo_.dims[dim]);
+             id += offset_dim_ * g.get_local_range(1)) {
+            iptr = iptr + offset_dim_ * g.get_local_range(1) * istride_dim;
+            if (input_weight_) {
+                iwptr =
+                    iwptr + offset_dim_ * g.get_local_range(1) * istride_dim;
+                stable_mean(&val, &weight, transform(*iptr),
+                            compute_t<Tw>(*iwptr));
+            } else {
+                // Faster version of stable_mean when iwptr is NULL
+                val    = val + (transform(*iptr) - val) / (weight + (Tw)1);
+                weight = weight + (Tw)1;
+            }
+        }
+
+        s_val_[lid] = val;
+        s_idx_[lid] = weight;
+
+        compute_t<To> *s_vptr = s_val_.get_pointer() + lid;
+        compute_t<Tw> *s_iptr = s_idx_.get_pointer() + lid;
+        group_barrier(g);
+
+        if (DIMY == 8) {
+            if (lidy < 4) {
+                stable_mean(s_vptr, s_iptr, s_vptr[THREADS_X * 4],
+                            s_iptr[THREADS_X * 4]);
+            }
+            group_barrier(g);
+        }
+
+        if (DIMY >= 4) {
+            if (lidy < 2) {
+                stable_mean(s_vptr, s_iptr, s_vptr[THREADS_X * 2],
+                            s_iptr[THREADS_X * 2]);
+            }
+            group_barrier(g);
+        }
+
+        if (DIMY >= 2) {
+            if (lidy < 1) {
+                stable_mean(s_vptr, s_iptr, s_vptr[THREADS_X * 1],
+                            s_iptr[THREADS_X * 1]);
+            }
+            group_barrier(g);
+        }
+
+        if (lidy == 0 && is_valid && (blockIdx_dim < oInfo_.dims[dim])) {
+            *optr = *s_vptr;
+            if (output_weight_) *owptr = *s_iptr;
+        }
+    }
+
+   protected:
+    write_accessor<To> out_;
+    write_accessor<Tw> owt_;
+    read_accessor<Ti> in_;
+    read_accessor<Tw> iwt_;
+    KParam oInfo_, owInfo_, iInfo_, iwInfo_;
+    const uint groups_x_, groups_y_, offset_dim_;
+    sycl::local_accessor<compute_t<To>, 1> s_val_;
+    sycl::local_accessor<compute_t<Tw>, 1> s_idx_;
+    bool input_weight_, output_weight_;
+};
+
+template<typename Ti, typename Tw, typename To, int dim>
+void mean_dim_launcher(Param<To> out, Param<Tw> owt, Param<Ti> in,
+                       Param<Tw> iwt, const uint threads_y,
+                       const dim_t blocks_dim[4]) {
+    sycl::range<2> local(THREADS_X, threads_y);
+    sycl::range<2> global(blocks_dim[0] * blocks_dim[2] * local[0],
+                          blocks_dim[1] * blocks_dim[3] * local[1]);
+
+    auto empty  = memAlloc<Tw>(1);
+    auto oempty = memAlloc<Tw>(1);
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        read_accessor<Ti> in_acc{*in.data, h};
+
+        auto s_val =
+            sycl::local_accessor<compute_t<To>, 1>(THREADS_PER_BLOCK, h);
+        auto s_idx =
+            sycl::local_accessor<compute_t<Tw>, 1>(THREADS_PER_BLOCK, h);
+
+        bool input_weight = ((iwt.info.dims[0] * iwt.info.dims[1] *
+                              iwt.info.dims[2] * iwt.info.dims[3]) != 0);
+
+        bool output_weight = ((owt.info.dims[0] * owt.info.dims[1] *
+                               owt.info.dims[2] * owt.info.dims[3]) != 0);
+
+        write_accessor<Tw> owt_acc{(output_weight) ? *owt.data : *oempty, h};
+        read_accessor<Tw> iwt_acc{(input_weight) ? *iwt.data : *empty, h};
+
+        switch (threads_y) {
+            case 8:
+                h.parallel_for(sycl::nd_range<2>(global, local),
+                               meanDimKernelSMEM<Ti, Tw, To, dim, 8>(
+                                   out_acc, out.info, owt_acc, owt.info, in_acc,
+                                   in.info, iwt_acc, iwt.info, blocks_dim[0],
+                                   blocks_dim[1], blocks_dim[dim], s_val, s_idx,
+                                   input_weight, output_weight));
+                break;
+            case 4:
+                h.parallel_for(sycl::nd_range<2>(global, local),
+                               meanDimKernelSMEM<Ti, Tw, To, dim, 4>(
+                                   out_acc, out.info, owt_acc, owt.info, in_acc,
+                                   in.info, iwt_acc, iwt.info, blocks_dim[0],
+                                   blocks_dim[1], blocks_dim[dim], s_val, s_idx,
+                                   input_weight, output_weight));
+                break;
+            case 2:
+                h.parallel_for(sycl::nd_range<2>(global, local),
+                               meanDimKernelSMEM<Ti, Tw, To, dim, 2>(
+                                   out_acc, out.info, owt_acc, owt.info, in_acc,
+                                   in.info, iwt_acc, iwt.info, blocks_dim[0],
+                                   blocks_dim[1], blocks_dim[dim], s_val, s_idx,
+                                   input_weight, output_weight));
+                break;
+            case 1:
+                h.parallel_for(sycl::nd_range<2>(global, local),
+                               meanDimKernelSMEM<Ti, Tw, To, dim, 1>(
+                                   out_acc, out.info, owt_acc, owt.info, in_acc,
+                                   in.info, iwt_acc, iwt.info, blocks_dim[0],
+                                   blocks_dim[1], blocks_dim[dim], s_val, s_idx,
+                                   input_weight, output_weight));
+                break;
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tw, typename To, int dim>
+void mean_dim(Param<To> out, Param<Ti> in, Param<Tw> iwt) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.info.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    dim_t blocks_dim[] = {divup(in.info.dims[0], threads_x), in.info.dims[1],
+                          in.info.dims[2], in.info.dims[3]};
+
+    blocks_dim[dim] = divup(in.info.dims[dim], threads_y * REPEAT);
+
+    Array<To> tmpOut = createEmptyArray<To>(dim4());
+    Array<Tw> tmpWt  = createEmptyArray<Tw>(dim4());
+
+    if (blocks_dim[dim] > 1) {
+        dim4 dims(4, out.info.dims);
+        dims[dim] = blocks_dim[dim];
+        tmpOut    = createEmptyArray<To>(dims);
+        tmpWt     = createEmptyArray<Tw>(dims);
+    } else {
+        tmpOut = createParamArray(out, false);
+    }
+
+    mean_dim_launcher<Ti, Tw, To, dim>(tmpOut, tmpWt, in, iwt, threads_y,
+                                       blocks_dim);
+
+    if (blocks_dim[dim] > 1) {
+        blocks_dim[dim] = 1;
+
+        Array<Tw> owt = createEmptyArray<Tw>(dim4());
+        mean_dim_launcher<To, Tw, To, dim>(out, owt, tmpOut, tmpWt, threads_y,
+                                           blocks_dim);
+    }
+}
+
+// Calculate mean along the first dimension. If wt is an empty Param, use
+// weight as 1 and treat it as count. If owt is empty Param, do not write
+// temporary reduced counts/weights to it.
+template<typename Ti, typename Tw, typename To>
+class meanFirstKernelSMEM {
+   public:
+    meanFirstKernelSMEM(write_accessor<To> out, KParam oInfo,
+                        write_accessor<Tw> owt, KParam owInfo,
+                        read_accessor<Ti> in, KParam iInfo,
+                        read_accessor<Tw> iwt, KParam iwInfo, const uint DIMX,
+                        const uint groups_x, const uint groups_y,
+                        const uint repeat,
+                        sycl::local_accessor<compute_t<To>, 1> s_val,
+                        sycl::local_accessor<compute_t<Tw>, 1> s_idx,
+                        bool input_weight, bool output_weight)
+        : out_(out)
+        , owt_(owt)
+        , in_(in)
+        , iwt_(iwt)
+        , oInfo_(oInfo)
+        , owInfo_(owInfo)
+        , iInfo_(iInfo)
+        , iwInfo_(iwInfo)
+        , DIMX_(DIMX)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , repeat_(repeat)
+        , s_val_(s_val)
+        , s_idx_(s_idx)
+        , input_weight_(input_weight)
+        , output_weight_(output_weight) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * DIMX_ + lidx;
+
+        const uint zid        = g.get_group_id(0) / groups_x_;
+        const uint wid        = g.get_group_id(1) / groups_y_;
+        const uint groupIdx_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupIdx_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid = groupIdx_x * g.get_local_range(0) * repeat_ + lidx;
+        const uint yid = groupIdx_y * g.get_local_range(1) + lidy;
+
+        const Ti *iptr = in_.get_pointer();
+        To *optr       = out_.get_pointer();
+
+        iptr += wid * iInfo_.strides[3] + zid * iInfo_.strides[2] +
+                yid * iInfo_.strides[1] + iInfo_.offset;
+        optr += wid * oInfo_.strides[3] + zid * oInfo_.strides[2] +
+                yid * oInfo_.strides[1] + oInfo_.offset;
+
+        const Tw *iwptr = nullptr;
+        Tw *owptr       = nullptr;
+        if (input_weight_)
+            iwptr = iwt_.get_pointer() + wid * iwInfo_.strides[3] +
+                    zid * iwInfo_.strides[2] + yid * iwInfo_.strides[1] +
+                    iwInfo_.offset;
+
+        if (output_weight_)
+            owptr = owt_.get_pointer() + wid * owInfo_.strides[3] +
+                    zid * owInfo_.strides[2] + yid * owInfo_.strides[1] +
+                    owInfo_.offset;
+
+        bool cond = (yid < iInfo_.dims[1] && zid < iInfo_.dims[2] &&
+                     wid < iInfo_.dims[3]);
+
+        int lim = min((dim_t)(xid + repeat_ * DIMX_), iInfo_.dims[0]);
+
+        common::Transform<Ti, compute_t<To>, af_add_t> transform;
+
+        compute_t<To> val    = common::Binary<compute_t<To>, af_add_t>::init();
+        compute_t<Tw> weight = common::Binary<compute_t<Tw>, af_add_t>::init();
+
+        if (cond && xid < lim) {
+            val = transform(iptr[xid]);
+            if (input_weight_) {
+                weight = iwptr[xid];
+            } else {
+                weight = (Tw)1;
+            }
+        }
+
+        if (input_weight_) {
+            for (int id = xid + DIMX_; cond && id < lim; id += DIMX_) {
+                stable_mean(&val, &weight, transform(iptr[id]),
+                            compute_t<Tw>(iwptr[id]));
+            }
+        } else {
+            for (int id = xid + DIMX_; cond && id < lim; id += DIMX_) {
+                // Faster version of stable_mean when iwptr is NULL
+                val = val + (transform(iptr[id]) - compute_t<To>(val)) /
+                                (weight + (Tw)1);
+                weight = weight + (Tw)1;
+            }
+        }
+
+        s_val_[lid] = val;
+        s_idx_[lid] = weight;
+        group_barrier(g);
+
+        compute_t<To> *s_vptr = s_val_.get_pointer() + lidy * DIMX_;
+        compute_t<Tw> *s_iptr = s_idx_.get_pointer() + lidy * DIMX_;
+
+        if (DIMX_ == 256) {
+            if (lidx < 128) {
+                stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 128],
+                            s_iptr[lidx + 128]);
+            }
+            group_barrier(g);
+        }
+
+        if (DIMX_ >= 128) {
+            if (lidx < 64) {
+                stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 64],
+                            s_iptr[lidx + 64]);
+            }
+            group_barrier(g);
+        }
+
+        if (DIMX_ >= 64) {
+            if (lidx < 32) {
+                stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 32],
+                            s_iptr[lidx + 32]);
+            }
+            group_barrier(g);
+        }
+
+        if (lidx < 16) {
+            stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 16],
+                        s_iptr[lidx + 16]);
+        }
+        group_barrier(g);
+
+        if (lidx < 8) {
+            stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 8],
+                        s_iptr[lidx + 8]);
+        }
+        group_barrier(g);
+
+        if (lidx < 4) {
+            stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 4],
+                        s_iptr[lidx + 4]);
+        }
+        group_barrier(g);
+
+        if (lidx < 2) {
+            stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 2],
+                        s_iptr[lidx + 2]);
+        }
+        group_barrier(g);
+
+        if (lidx < 1) {
+            stable_mean(s_vptr + lidx, s_iptr + lidx, s_vptr[lidx + 1],
+                        s_iptr[lidx + 1]);
+        }
+        group_barrier(g);
+
+        if (cond && lidx == 0) {
+            optr[groupIdx_x] = s_vptr[0];
+            if (output_weight_) owptr[groupIdx_x] = s_iptr[0];
+        }
+    }
+
+   protected:
+    write_accessor<To> out_;
+    write_accessor<Tw> owt_;
+    read_accessor<Ti> in_;
+    read_accessor<Tw> iwt_;
+    KParam oInfo_, owInfo_, iInfo_, iwInfo_;
+    const uint DIMX_, groups_x_, groups_y_, repeat_;
+    sycl::local_accessor<compute_t<To>, 1> s_val_;
+    sycl::local_accessor<compute_t<Tw>, 1> s_idx_;
+    bool input_weight_, output_weight_;
+};
+
+template<typename Ti, typename Tw, typename To>
+void mean_first_launcher(Param<To> out, Param<Tw> owt, Param<Ti> in,
+                         Param<Tw> iwt, const uint groups_x,
+                         const uint groups_y, const uint threads_x) {
+    sycl::range<2> local(threads_x, THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * in.info.dims[2] * local[0],
+                          groups_y * in.info.dims[3] * local[1]);
+
+    uint repeat = divup(in.info.dims[0], (groups_x * threads_x));
+
+    auto empty  = memAlloc<Tw>(1);
+    auto oempty = memAlloc<Tw>(1);
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        read_accessor<Ti> in_acc{*in.data, h};
+
+        auto s_val =
+            sycl::local_accessor<compute_t<To>, 1>(THREADS_PER_BLOCK, h);
+        auto s_idx =
+            sycl::local_accessor<compute_t<Tw>, 1>(THREADS_PER_BLOCK, h);
+
+        bool input_weight = ((iwt.info.dims[0] * iwt.info.dims[1] *
+                              iwt.info.dims[2] * iwt.info.dims[3]) != 0);
+
+        bool output_weight = ((owt.info.dims[0] * owt.info.dims[1] *
+                               owt.info.dims[2] * owt.info.dims[3]) != 0);
+
+        write_accessor<Tw> owt_acc{(output_weight) ? *owt.data : *oempty, h};
+        read_accessor<Tw> iwt_acc{(input_weight) ? *iwt.data : *empty, h};
+
+        h.parallel_for(
+            sycl::nd_range<2>(global, local),
+            meanFirstKernelSMEM<Ti, Tw, To>(
+                out_acc, out.info, owt_acc, owt.info, in_acc, in.info, iwt_acc,
+                iwt.info, threads_x, groups_x, groups_y, repeat, s_val, s_idx,
+                input_weight, output_weight));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean_first(Param<To> out, Param<Ti> in, Param<Tw> iwt) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = divup(in.info.dims[0], threads_x * REPEAT);
+    uint blocks_y = divup(in.info.dims[1], threads_y);
+
+    Array<To> tmpOut = createEmptyArray<To>(dim4());
+    Array<Tw> tmpWt  = createEmptyArray<Tw>(dim4());
+    if (blocks_x > 1) {
+        tmpOut = createEmptyArray<To>(
+            {blocks_x, in.info.dims[1], in.info.dims[2], in.info.dims[3]});
+        tmpWt = createEmptyArray<Tw>(
+            {blocks_x, in.info.dims[1], in.info.dims[2], in.info.dims[3]});
+    } else {
+        tmpOut = createParamArray(out, false);
+    }
+
+    mean_first_launcher<Ti, Tw, To>(tmpOut, tmpWt, in, iwt, blocks_x, blocks_y,
+                                    threads_x);
+
+    if (blocks_x > 1) {
+        Param<Tw> owt;
+        owt.data = nullptr;
+        mean_first_launcher<To, Tw, To>(out, owt, tmpOut, tmpWt, 1, blocks_y,
+                                        threads_x);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean_weighted(Param<To> out, Param<Ti> in, Param<Tw> iwt, int dim) {
+    switch (dim) {
+        case 0: return mean_first<Ti, Tw, To>(out, in, iwt);
+        case 1: return mean_dim<Ti, Tw, To, 1>(out, in, iwt);
+        case 2: return mean_dim<Ti, Tw, To, 2>(out, in, iwt);
+        case 3: return mean_dim<Ti, Tw, To, 3>(out, in, iwt);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean(Param<To> out, Param<Ti> in, int dim) {
+    Param<Tw> dummy_weight;
+    mean_weighted<Ti, Tw, To>(out, in, dummy_weight, dim);
+}
+
+template<typename T, typename Tw>
+T mean_all_weighted(Param<T> in, Param<Tw> iwt) {
+    uintl in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+    // FIXME: Use better heuristics to get to the optimum number
+    if (in_elements > 4096) {
+        bool in_is_linear = (in.info.strides[0] == 1);
+        bool wt_is_linear = (iwt.info.strides[0] == 1);
+        for (int k = 1; k < 4; k++) {
+            in_is_linear &= (in.info.strides[k] ==
+                             (in.info.strides[k - 1] * in.info.dims[k - 1]));
+            wt_is_linear &= (iwt.info.strides[k] ==
+                             (iwt.info.strides[k - 1] * iwt.info.dims[k - 1]));
+        }
+
+        if (in_is_linear && wt_is_linear) {
+            in.info.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.info.dims[k]    = 1;
+                in.info.strides[k] = in_elements;
+            }
+
+            for (int k = 0; k < 4; k++) {
+                iwt.info.dims[k]    = in.info.dims[k];
+                iwt.info.strides[k] = in.info.strides[k];
+            }
+        }
+
+        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+        uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+        uint blocks_x = divup(in.info.dims[0], threads_x * REPEAT);
+        uint blocks_y = divup(in.info.dims[1], threads_y);
+
+        Array<T> tmpOut = createEmptyArray<T>(
+            {blocks_x, in.info.dims[1], in.info.dims[2], in.info.dims[3]});
+        Array<Tw> tmpWt = createEmptyArray<Tw>(
+            {blocks_x, in.info.dims[1], in.info.dims[2], in.info.dims[3]});
+
+        uintl tmp_elements = tmpOut.elements();
+
+        mean_first_launcher<T, Tw, T>(tmpOut, tmpWt, in, iwt, blocks_x,
+                                      blocks_y, threads_x);
+
+        compute_t<T> val;
+        auto tmpOut_get = tmpOut.get();
+        auto tmpWt_get = tmpWt.get();
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                auto acc_in =
+                    tmpOut_get->get_host_access(h, sycl::read_only);
+                auto acc_wt =
+                    tmpWt_get->get_host_access(h, sycl::read_only);
+
+                h.host_task([acc_in, acc_wt, tmp_elements, &val] {
+                    val = static_cast<compute_t<T>>(acc_in[0]);
+                    compute_t<Tw> weight =
+                        static_cast<compute_t<Tw>>(acc_wt[0]);
+
+                    for (int i = 1; i < tmp_elements; i++) {
+                        stable_mean(&val, &weight, compute_t<T>(acc_in[i]),
+                                    compute_t<Tw>(acc_wt[i]));
+                    }
+                });
+            })
+            .wait();
+        return static_cast<T>(val);
+    } else {
+        compute_t<T> val;
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                auto acc_in = in.data->get_host_access(
+                    h, sycl::range{in_elements}, sycl::read_only);
+                auto acc_wt = iwt.data->get_host_access(
+                    h, sycl::range{in_elements}, sycl::read_only);
+
+                h.host_task([acc_in, acc_wt, in_elements, &val]() {
+                    val                  = acc_in[0];
+                    compute_t<Tw> weight = acc_wt[0];
+                    for (int i = 1; i < in_elements; i++) {
+                        stable_mean(&val, &weight, compute_t<T>(acc_in[i]),
+                                    compute_t<Tw>(acc_wt[i]));
+                    }
+                });
+            })
+            .wait();
+        return static_cast<T>(val);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+To mean_all(Param<Ti> in) {
+    using std::unique_ptr;
+    uintl in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+    bool is_linear = (in.info.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.info.strides[k] ==
+                      (in.info.strides[k - 1] * in.info.dims[k - 1]));
+    }
+
+    // FIXME: Use better heuristics to get to the optimum number
+    if (in_elements > 4096 || !is_linear) {
+        if (is_linear) {
+            in.info.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.info.dims[k]    = 1;
+                in.info.strides[k] = in_elements;
+            }
+        }
+
+        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+        uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+        uint blocks_x = divup(in.info.dims[0], threads_x * REPEAT);
+        uint blocks_y = divup(in.info.dims[1], threads_y);
+
+        dim4 outDims(blocks_x, in.info.dims[1], in.info.dims[2],
+                     in.info.dims[3]);
+
+        Array<To> tmpOut = createEmptyArray<To>(outDims);
+        Array<Tw> tmpCt  = createEmptyArray<Tw>(outDims);
+
+        Param<Tw> iwt;
+        mean_first_launcher<Ti, Tw, To>(tmpOut, tmpCt, in, iwt, blocks_x,
+                                        blocks_y, threads_x);
+
+        uintl tmp_elements = tmpOut.elements();
+
+        compute_t<To> val;
+        auto tmpOut_get = tmpOut.get();
+        auto tmpCt_get = tmpCt.get();
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                auto out =
+                    tmpOut_get->get_host_access(h, sycl::read_only);
+                auto ct =
+                    tmpCt_get->get_host_access(h, sycl::read_only);
+
+                h.host_task([out, ct, tmp_elements, &val] {
+                    val                  = static_cast<compute_t<To>>(out[0]);
+                    compute_t<Tw> weight = static_cast<compute_t<Tw>>(ct[0]);
+
+                    for (int i = 1; i < tmp_elements; i++) {
+                        stable_mean(&val, &weight, compute_t<To>(out[i]),
+                                    compute_t<Tw>(ct[i]));
+                    }
+                });
+            })
+            .wait();
+        return static_cast<To>(val);
+    } else {
+        compute_t<To> val;
+        getQueue()
+            .submit([&](sycl::handler &h) {
+                auto acc_in =
+                    in.data->get_host_access(h, sycl::read_only);
+                h.host_task([acc_in, in_elements, &val]() {
+                    common::Transform<Ti, compute_t<To>, af_add_t> transform;
+                    compute_t<Tw> count = static_cast<compute_t<Tw>>(1);
+
+                    val                  = transform(acc_in[0]);
+                    compute_t<Tw> weight = count;
+                    for (int i = 1; i < in_elements; i++) {
+                        stable_mean(&val, &weight, transform(acc_in[i]), count);
+                    }
+                });
+            })
+            .wait();
+        return static_cast<To>(val);
+    }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/meanshift.hpp b/src/backend/oneapi/kernel/meanshift.hpp
new file mode 100644
index 0000000000..ef28998d4d
--- /dev/null
+++ b/src/backend/oneapi/kernel/meanshift.hpp
@@ -0,0 +1,227 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+inline int convert_int_rtz(float number) { return ((int)(number)); }
+
+template<typename T, typename AccType, const int MAX_CHANNELS>
+class meanshiftCreateKernel {
+   public:
+    meanshiftCreateKernel(write_accessor<T> d_dst, KParam oInfo,
+                          read_accessor<T> d_src, KParam iInfo, int radius,
+                          float cvar, unsigned numIters, int nBBS0, int nBBS1)
+        : d_dst_(d_dst)
+        , oInfo_(oInfo)
+        , d_src_(d_src)
+        , iInfo_(iInfo)
+        , radius_(radius)
+        , cvar_(cvar)
+        , numIters_(numIters)
+        , nBBS0_(nBBS0)
+        , nBBS1_(nBBS1) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        unsigned b2 = g.get_group_id(0) / nBBS0_;
+        unsigned b3 = g.get_group_id(1) / nBBS1_;
+        const int gx =
+            g.get_local_range(0) * (g.get_group_id(0) - b2 * nBBS0_) +
+            it.get_local_id(0);
+        const int gy =
+            g.get_local_range(1) * (g.get_group_id(1) - b3 * nBBS1_) +
+            it.get_local_id(1);
+
+        if (gx < iInfo_.dims[0] && gy < iInfo_.dims[1]) {
+            const T* iptr =
+                d_src_.get_pointer() + (b2 * iInfo_.strides[2] +
+                                        b3 * iInfo_.strides[3] + iInfo_.offset);
+            T* optr = d_dst_.get_pointer() +
+                      (b2 * oInfo_.strides[2] + b3 * oInfo_.strides[3]);
+
+            int meanPosI = gx;
+            int meanPosJ = gy;
+
+            T currentCenterColors[MAX_CHANNELS];
+            T tempColors[MAX_CHANNELS];
+
+            AccType currentMeanColors[MAX_CHANNELS];
+
+#pragma unroll
+            for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                currentCenterColors[ch] =
+                    iptr[gx * iInfo_.strides[0] + gy * iInfo_.strides[1] +
+                         ch * iInfo_.strides[2]];
+
+            const int dim0LenLmt = iInfo_.dims[0] - 1;
+            const int dim1LenLmt = iInfo_.dims[1] - 1;
+
+            // scope of meanshift iterationd begin
+            for (uint it = 0; it < numIters_; ++it) {
+                int oldMeanPosJ = meanPosJ;
+                int oldMeanPosI = meanPosI;
+                unsigned count  = 0;
+
+                int shift_x = 0;
+                int shift_y = 0;
+
+                for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                    currentMeanColors[ch] = 0;
+
+                for (int wj = -radius_; wj <= radius_; ++wj) {
+                    int hit_count = 0;
+                    int tj        = meanPosJ + wj;
+
+                    if (tj < 0 || tj > dim1LenLmt) continue;
+
+                    for (int wi = -radius_; wi <= radius_; ++wi) {
+                        int ti = meanPosI + wi;
+
+                        if (ti < 0 || ti > dim0LenLmt) continue;
+
+                        AccType norm = 0;
+#pragma unroll
+                        for (int ch = 0; ch < MAX_CHANNELS; ++ch) {
+                            unsigned idx = ti * iInfo_.strides[0] +
+                                           tj * iInfo_.strides[1] +
+                                           ch * iInfo_.strides[2];
+                            tempColors[ch] = iptr[idx];
+                            AccType diff   = (AccType)currentCenterColors[ch] -
+                                           (AccType)tempColors[ch];
+                            norm += (diff * diff);
+                        }
+
+                        if (norm <= cvar_) {
+#pragma unroll
+                            for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                                currentMeanColors[ch] +=
+                                    (AccType)tempColors[ch];
+
+                            shift_x += ti;
+                            ++hit_count;
+                        }
+                    }
+                    count += hit_count;
+                    shift_y += tj * hit_count;
+                }
+
+                if (count == 0) break;
+
+                const AccType fcount = 1 / (AccType)count;
+
+                meanPosI = convert_int_rtz(shift_x * fcount);
+                meanPosJ = convert_int_rtz(shift_y * fcount);
+
+#pragma unroll
+                for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                    currentMeanColors[ch] =
+                        convert_int_rtz(currentMeanColors[ch] * fcount);
+
+                AccType norm = 0;
+#pragma unroll
+                for (int ch = 0; ch < MAX_CHANNELS; ++ch) {
+                    AccType diff = (AccType)currentCenterColors[ch] -
+                                   currentMeanColors[ch];
+                    norm += (diff * diff);
+                }
+
+                bool stop =
+                    (meanPosJ == oldMeanPosJ && meanPosI == oldMeanPosI) ||
+                    ((abs(oldMeanPosJ - meanPosJ) +
+                      abs(oldMeanPosI - meanPosI)) +
+                     norm) <= 1;
+
+#pragma unroll
+                for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                    currentCenterColors[ch] = (T)(currentMeanColors[ch]);
+
+                if (stop) break;
+            }  // scope of meanshift iterations end
+
+#pragma unroll
+            for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                optr[gx * oInfo_.strides[0] + gy * oInfo_.strides[1] +
+                     ch * oInfo_.strides[2]] = currentCenterColors[ch];
+        }
+    }
+
+   private:
+    write_accessor<T> d_dst_;
+    KParam oInfo_;
+    read_accessor<T> d_src_;
+    KParam iInfo_;
+    int radius_;
+    float cvar_;
+    unsigned numIters_;
+    int nBBS0_;
+    int nBBS1_;
+};
+
+template<typename T>
+void meanshift(Param<T> out, const Param<T> in, const float spatialSigma,
+               const float chromaticSigma, const uint numIters,
+               const bool is_color) {
+    using AccType = typename std::conditional<std::is_same<T, double>::value,
+                                              double, float>::type;
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
+
+    const int MAX_CHANNELS = (is_color ? 3 : 1);
+
+    auto local = sycl::range(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    const int bCount = (is_color ? 1 : in.info.dims[2]);
+
+    auto global = sycl::range(bCount * blk_x * THREADS_X,
+                              in.info.dims[3] * blk_y * THREADS_Y);
+
+    // clamp spatial and chromatic sigma's
+    int radius = std::max((int)(spatialSigma * 1.5f), 1);
+
+    const float cvar = chromaticSigma * chromaticSigma;
+
+    getQueue().submit([&](auto& h) {
+        read_accessor<T> d_src{*in.data, h};
+        write_accessor<T> d_dst{*out.data, h};
+        if (MAX_CHANNELS == 3) {
+            h.parallel_for(sycl::nd_range{global, local},
+                           meanshiftCreateKernel<T, AccType, 3>(
+                               d_dst, out.info, d_src, in.info, radius, cvar,
+                               numIters, blk_x, blk_y));
+        } else {
+            h.parallel_for(sycl::nd_range{global, local},
+                           meanshiftCreateKernel<T, AccType, 1>(
+                               d_dst, out.info, d_src, in.info, radius, cvar,
+                               numIters, blk_x, blk_y));
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/memcopy.hpp b/src/backend/oneapi/kernel/memcopy.hpp
new file mode 100644
index 0000000000..64bd26ba1e
--- /dev/null
+++ b/src/backend/oneapi/kernel/memcopy.hpp
@@ -0,0 +1,347 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/traits.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <sycl/sycl.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+using factortypes = typename std::conditional<std::is_same_v<T, double> ||
+                                                  std::is_same_v<T, cdouble>,
+                                              double, float>::type;
+
+template<typename T, typename FACTORTYPE = factortypes<T>>
+inline T scale(T value, FACTORTYPE factor) {
+    return (T)(FACTORTYPE(value) * factor);
+}
+
+template<>
+inline cfloat scale<cfloat, float>(cfloat value, float factor) {
+    return cfloat{static_cast<float>(value.real() * factor),
+                  static_cast<float>(value.imag() * factor)};
+}
+
+template<>
+inline cdouble scale<cdouble, double>(cdouble value, double factor) {
+    return cdouble{value.real() * factor, value.imag() * factor};
+}
+
+typedef struct {
+    dim_t dim[4];
+} dims_t;
+
+template<typename T>
+class memCopy {
+   public:
+    memCopy(write_accessor<T> out, dims_t ostrides, int ooffset,
+            read_accessor<T> in, dims_t idims, dims_t istrides, int ioffset,
+            int groups_0, int groups_1)
+        : out_(out)
+        , ostrides_(ostrides)
+        , ooffset_(ooffset)
+        , in_(in)
+        , idims_(idims)
+        , istrides_(istrides)
+        , ioffset_(ioffset)
+        , groups_0_(groups_0)
+        , groups_1_(groups_1) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        const int lid0 = it.get_local_id(0);
+        const int lid1 = it.get_local_id(1);
+
+        sycl::group gg       = it.get_group();
+        const int id2        = gg.get_group_id(0) / groups_0_;
+        const int id3        = gg.get_group_id(1) / groups_1_;
+        const int group_id_0 = gg.get_group_id(0) - groups_0_ * id2;
+        const int group_id_1 = gg.get_group_id(1) - groups_1_ * id3;
+        const int id0        = group_id_0 * gg.get_local_range(0) + lid0;
+        const int id1        = group_id_1 * gg.get_local_range(1) + lid1;
+
+        const T *iptr = in_.get_pointer();
+        // FIXME: Do more work per work group
+
+        T *optr = out_.get_pointer();
+        optr += id3 * ostrides_.dim[3] + id2 * ostrides_.dim[2] +
+                id1 * ostrides_.dim[1] + ooffset_;
+        iptr += id3 * istrides_.dim[3] + id2 * istrides_.dim[2] +
+                id1 * istrides_.dim[1] + ioffset_;
+
+        int istride0 = istrides_.dim[0];
+        size_t idd0  = idims_.dim[0];
+        size_t idd1  = idims_.dim[1];
+        size_t idd2  = idims_.dim[2];
+        size_t idd3  = idims_.dim[3];
+
+        if (id0 < idd0 && id1 < idd1 && id2 < idd2 && id3 < idd3) {
+            optr[id0] = iptr[id0 * istride0];
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    dims_t ostrides_;
+    int ooffset_;
+    read_accessor<T> in_;
+    dims_t idims_, istrides_;
+    int ioffset_, groups_0_, groups_1_;
+};
+
+constexpr uint DIM0 = 32;
+constexpr uint DIM1 = 8;
+
+template<typename T>
+void memcopy(sycl::buffer<T> *out, const dim_t *ostrides,
+             const sycl::buffer<T> *in, const dim_t *idims,
+             const dim_t *istrides, dim_t ioffset, uint indims,
+             dim_t ooffset = 0) {
+    dims_t _ostrides = {{ostrides[0], ostrides[1], ostrides[2], ostrides[3]}};
+    dims_t _istrides = {{istrides[0], istrides[1], istrides[2], istrides[3]}};
+    dims_t _idims    = {{idims[0], idims[1], idims[2], idims[3]}};
+
+    size_t local_size[2] = {DIM0, DIM1};
+    if (indims == 1) {
+        local_size[0] *= local_size[1];
+        local_size[1] = 1;
+    }
+
+    int groups_0 = divup(idims[0], local_size[0]);
+    int groups_1 = divup(idims[1], local_size[1]);
+
+    sycl::range<2> local(local_size[0], local_size[1]);
+    sycl::range<2> global(groups_0 * idims[2] * local_size[0],
+                          groups_1 * idims[3] * local_size[1]);
+    sycl::nd_range<2> ndrange(global, local);
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> out_acc{*out, h};
+        read_accessor<T> in_acc{*const_cast<sycl::buffer<T> *>(in), h};
+
+        h.parallel_for(ndrange,
+                       memCopy<T>(out_acc, _ostrides, ooffset, in_acc, _idims,
+                                  _istrides, ioffset, groups_0, groups_1));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename inType, typename outType>
+inline outType convertType(inType value) {
+    return static_cast<outType>(value);
+}
+
+template<>
+inline char convertType<compute_t<arrayfire::common::half>, char>(
+    compute_t<arrayfire::common::half> value) {
+    return (char)((short)value);
+}
+
+template<>
+inline compute_t<arrayfire::common::half>
+convertType<char, compute_t<arrayfire::common::half>>(char value) {
+    return compute_t<arrayfire::common::half>(value);
+}
+
+template<>
+signed char inline convertType<compute_t<arrayfire::common::half>, signed char>(
+    compute_t<arrayfire::common::half> value) {
+    return (signed char)((short)value);
+}
+
+template<>
+inline compute_t<arrayfire::common::half>
+convertType<signed char, compute_t<arrayfire::common::half>>(
+    signed char value) {
+    return compute_t<arrayfire::common::half>(value);
+}
+
+template<>
+unsigned char inline convertType<compute_t<arrayfire::common::half>,
+                                 unsigned char>(
+    compute_t<arrayfire::common::half> value) {
+    return (unsigned char)((short)value);
+}
+
+template<>
+inline compute_t<arrayfire::common::half>
+convertType<unsigned char, compute_t<arrayfire::common::half>>(
+    unsigned char value) {
+    return compute_t<arrayfire::common::half>(value);
+}
+
+#define OTHER_SPECIALIZATIONS(IN_T)                         \
+    template<>                                              \
+    inline cfloat convertType<IN_T, cfloat>(IN_T value) {   \
+        return cfloat(static_cast<float>(value), 0.0f);     \
+    }                                                       \
+                                                            \
+    template<>                                              \
+    inline cdouble convertType<IN_T, cdouble>(IN_T value) { \
+        return cdouble(static_cast<double>(value), 0.0);    \
+    }
+
+OTHER_SPECIALIZATIONS(float)
+OTHER_SPECIALIZATIONS(double)
+OTHER_SPECIALIZATIONS(int)
+OTHER_SPECIALIZATIONS(uint)
+OTHER_SPECIALIZATIONS(intl)
+OTHER_SPECIALIZATIONS(uintl)
+OTHER_SPECIALIZATIONS(short)
+OTHER_SPECIALIZATIONS(ushort)
+OTHER_SPECIALIZATIONS(schar)
+OTHER_SPECIALIZATIONS(uchar)
+OTHER_SPECIALIZATIONS(char)
+OTHER_SPECIALIZATIONS(arrayfire::common::half)
+
+template<typename inType, typename outType, bool SAMEDIMS>
+class reshapeCopy {
+   public:
+    reshapeCopy(write_accessor<outType> dst, KParam oInfo,
+                read_accessor<inType> src, KParam iInfo, outType default_value,
+                factortypes<inType> factor, dims_t trgt, int blk_x, int blk_y)
+        : dst_(dst)
+        , src_(src)
+        , oInfo_(oInfo)
+        , iInfo_(iInfo)
+        , default_value_(default_value)
+        , factor_(factor)
+        , trgt_(trgt)
+        , blk_x_(blk_x)
+        , blk_y_(blk_y) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        const uint lx = it.get_local_id(0);
+        const uint ly = it.get_local_id(1);
+
+        sycl::group gg  = it.get_group();
+        uint gz         = gg.get_group_id(0) / blk_x_;
+        uint gw         = gg.get_group_id(1) / blk_y_;
+        uint blockIdx_x = gg.get_group_id(0) - (blk_x_)*gz;
+        uint blockIdx_y = gg.get_group_id(1) - (blk_y_)*gw;
+        uint gx         = blockIdx_x * gg.get_local_range(0) + lx;
+        uint gy         = blockIdx_y * gg.get_local_range(1) + ly;
+
+        const inType *srcptr = src_.get_pointer();
+        outType *dstptr      = dst_.get_pointer();
+
+        const inType *in =
+            srcptr + (gw * iInfo_.strides[3] + gz * iInfo_.strides[2] +
+                      gy * iInfo_.strides[1] + iInfo_.offset);
+        outType *out =
+            dstptr + (gw * oInfo_.strides[3] + gz * oInfo_.strides[2] +
+                      gy * oInfo_.strides[1] + oInfo_.offset);
+
+        uint istride0 = iInfo_.strides[0];
+        uint ostride0 = oInfo_.strides[0];
+
+        size_t odims0 = oInfo_.dims[0];
+        size_t odims1 = oInfo_.dims[1];
+        size_t odims2 = oInfo_.dims[2];
+        size_t odims3 = oInfo_.dims[3];
+
+        size_t tdims0 = trgt_.dim[0];
+        size_t tdims1 = trgt_.dim[1];
+        size_t tdims2 = trgt_.dim[2];
+        size_t tdims3 = trgt_.dim[3];
+
+        if (gy < odims1 && gz < odims2 && gw < odims3) {
+            int loop_offset = gg.get_local_range(0) * blk_x_;
+            bool cond       = gy < tdims1 && gz < tdims2 && gw < tdims3;
+            for (int rep = gx; rep < odims0; rep += loop_offset) {
+                outType temp = default_value_;
+                if (SAMEDIMS || (rep < tdims0 && cond)) {
+                    temp = convertType<inType, outType>(
+                        scale<inType>(in[rep * istride0], factor_));
+                }
+                out[rep * ostride0] = temp;
+            }
+        }
+    }
+
+   protected:
+    write_accessor<outType> dst_;
+    read_accessor<inType> src_;
+    KParam oInfo_, iInfo_;
+    outType default_value_;
+    factortypes<inType> factor_;
+    dims_t trgt_;
+    int blk_x_, blk_y_;
+};
+
+template<typename inType, typename outType>
+void copy(Param<outType> dst, const Param<inType> src, const int ndims,
+          const outType default_value, const double factor,
+          const bool same_dims) {
+    using std::string;
+
+    sycl::range<2> local(DIM0, DIM1);
+    size_t local_size[] = {DIM0, DIM1};
+
+    local_size[0] *= local_size[1];
+    if (ndims == 1) { local_size[1] = 1; }
+
+    int blk_x = divup(dst.info.dims[0], local_size[0]);
+    int blk_y = divup(dst.info.dims[1], local_size[1]);
+
+    sycl::range<2> global(blk_x * dst.info.dims[2] * DIM0,
+                          blk_y * dst.info.dims[3] * DIM1);
+
+    sycl::nd_range<2> ndrange(global, local);
+
+    dims_t trgt_dims;
+    if (same_dims) {
+        trgt_dims = {{dst.info.dims[0], dst.info.dims[1], dst.info.dims[2],
+                      dst.info.dims[3]}};
+    } else {
+        dim_t trgt_l = std::min(dst.info.dims[3], src.info.dims[3]);
+        dim_t trgt_k = std::min(dst.info.dims[2], src.info.dims[2]);
+        dim_t trgt_j = std::min(dst.info.dims[1], src.info.dims[1]);
+        dim_t trgt_i = std::min(dst.info.dims[0], src.info.dims[0]);
+        trgt_dims    = {{trgt_i, trgt_j, trgt_k, trgt_l}};
+    }
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<outType> dst_acc{*dst.data, h};
+        read_accessor<inType> src_acc{
+            *const_cast<sycl::buffer<inType> *>(src.data), h};
+
+        if (same_dims) {
+            h.parallel_for(ndrange,
+                           reshapeCopy<inType, outType, true>(
+                               dst_acc, dst.info, src_acc, src.info,
+                               default_value, factor, trgt_dims, blk_x, blk_y));
+        } else {
+            h.parallel_for(ndrange,
+                           reshapeCopy<inType, outType, false>(
+                               dst_acc, dst.info, src_acc, src.info,
+                               default_value, factor, trgt_dims, blk_x, blk_y));
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/pad_array_borders.hpp b/src/backend/oneapi/kernel/pad_array_borders.hpp
new file mode 100644
index 0000000000..c5401a65c2
--- /dev/null
+++ b/src/backend/oneapi/kernel/pad_array_borders.hpp
@@ -0,0 +1,213 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <math.hpp>
+#include <af/defines.h>
+
+#include <sycl/sycl.hpp>
+
+#include <array>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T, int BType>
+class padBordersKernel {
+   public:
+    padBordersKernel(write_accessor<T> out, KParam oInfo, read_accessor<T> in,
+                     KParam iInfo, const dim_t l0, const dim_t l1,
+                     const dim_t l2, const dim_t l3, const int groups_x,
+                     const int groups_y)
+        : out_(out)
+        , oInfo_(oInfo)
+        , in_(in)
+        , iInfo_(iInfo)
+        , l0_(l0)
+        , l1_(l1)
+        , l2_(l2)
+        , l3_(l3)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        const int lx  = it.get_local_id(0);
+        const int ly  = it.get_local_id(1);
+        const int k   = g.get_group_id(0) / groups_x_;
+        const int l   = g.get_group_id(1) / groups_y_;
+
+        const int blockIdx_x = g.get_group_id(0) - (groups_x_)*k;
+        const int blockIdx_y = g.get_group_id(1) - (groups_y_)*l;
+        const int i          = blockIdx_x * g.get_local_range(0) + lx;
+        const int j          = blockIdx_y * g.get_local_range(1) + ly;
+
+        const size_t d0 = iInfo_.dims[0];
+        const size_t d1 = iInfo_.dims[1];
+        const size_t d2 = iInfo_.dims[2];
+        const size_t d3 = iInfo_.dims[3];
+        const size_t s0 = iInfo_.strides[0];
+        const size_t s1 = iInfo_.strides[1];
+        const size_t s2 = iInfo_.strides[2];
+        const size_t s3 = iInfo_.strides[3];
+
+        const T* src = in_.get_pointer() + iInfo_.offset;
+        T* dst       = out_.get_pointer();
+
+        bool isNotPadding =
+            (l >= l3_ && l < (d3 + l3_)) && (k >= l2_ && k < (d2 + l2_)) &&
+            (j >= l1_ && j < (d1 + l1_)) && (i >= l0_ && i < (d0 + l0_));
+
+        T value = scalar<T>(0);
+        if (isNotPadding) {
+            unsigned iLOff = (l - l3_) * s3;
+            unsigned iKOff = (k - l2_) * s2;
+            unsigned iJOff = (j - l1_) * s1;
+            unsigned iIOff = (i - l0_) * s0;
+
+            value = src[iLOff + iKOff + iJOff + iIOff];
+        } else if (BType != AF_PAD_ZERO) {
+            unsigned iLOff =
+                padBordersKernel<T, BType>::idxByndEdge(l, l3_, d3) * s3;
+            unsigned iKOff =
+                padBordersKernel<T, BType>::idxByndEdge(k, l2_, d2) * s2;
+            unsigned iJOff =
+                padBordersKernel<T, BType>::idxByndEdge(j, l1_, d1) * s1;
+            unsigned iIOff =
+                padBordersKernel<T, BType>::idxByndEdge(i, l0_, d0) * s0;
+
+            value = src[iLOff + iKOff + iJOff + iIOff];
+        }
+
+        size_t xlim = oInfo_.dims[0];
+        size_t ylim = oInfo_.dims[1];
+        size_t zlim = oInfo_.dims[2];
+        size_t wlim = oInfo_.dims[3];
+
+        size_t woStrides = oInfo_.strides[3];
+        size_t zoStrides = oInfo_.strides[2];
+        size_t yoStrides = oInfo_.strides[1];
+        size_t xoStrides = oInfo_.strides[0];
+
+        if (i < xlim && j < ylim && k < zlim && l < wlim) {
+            unsigned off =
+                (l * woStrides + k * zoStrides + j * yoStrides + i * xoStrides);
+            dst[off] = value;
+        }
+    }
+
+    static int trimIndex(int idx, const int len) {
+        int ret_val = idx;
+        if (ret_val < 0) {
+            int offset = (abs(ret_val) - 1) % len;
+            ret_val    = offset;
+        } else if (ret_val >= len) {
+            int offset = abs(ret_val) % len;
+            ret_val    = len - offset - 1;
+        }
+        return ret_val;
+    }
+
+    static int idxByndEdge(const int i, const int lb, const int len) {
+        uint retVal;
+        switch (BType) {
+            case AF_PAD_SYM:
+                retVal = padBordersKernel<T, BType>::trimIndex(i - lb, len);
+                break;
+            case AF_PAD_CLAMP_TO_EDGE:
+                retVal = sycl::clamp(i - lb, 0, len - 1);
+                break;
+            case AF_PAD_PERIODIC: {
+                int rem   = (i - lb) % len;
+                bool cond = rem < 0;
+                retVal    = cond * (rem + len) + (1 - cond) * rem;
+            } break;
+            default: retVal = 0; break;  // AF_PAD_ZERO
+        }
+        return retVal;
+    }
+
+   protected:
+    write_accessor<T> out_;
+    KParam oInfo_;
+    read_accessor<T> in_;
+    KParam iInfo_;
+    const dim_t l0_;
+    const dim_t l1_;
+    const dim_t l2_;
+    const dim_t l3_;
+    const int groups_x_;
+    const int groups_y_;
+};
+
+static const int PADB_THREADS_X = 32;
+static const int PADB_THREADS_Y = 8;
+
+template<typename T>
+void padBorders(Param<T> out, Param<T> in, dim4 const lBoundPadding,
+                const af::borderType btype) {
+    sycl::range<2> local(PADB_THREADS_X, PADB_THREADS_Y);
+
+    int groups_x = divup(out.info.dims[0], PADB_THREADS_X);
+    int groups_y = divup(out.info.dims[1], PADB_THREADS_Y);
+
+    sycl::range<2> global(groups_x * out.info.dims[2] * local[0],
+                          groups_y * out.info.dims[3] * local[1]);
+
+    getQueue().submit([&](sycl::handler& h) {
+        read_accessor<T> iData{*in.data, h};
+        write_accessor<T> oData{*out.data, h};
+
+        switch (btype) {
+            case AF_PAD_ZERO:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    padBordersKernel<T, AF_PAD_ZERO>(
+                        oData, out.info, iData, in.info, lBoundPadding[0],
+                        lBoundPadding[1], lBoundPadding[2], lBoundPadding[3],
+                        groups_x, groups_y));
+                break;
+            case AF_PAD_SYM:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    padBordersKernel<T, AF_PAD_SYM>(
+                        oData, out.info, iData, in.info, lBoundPadding[0],
+                        lBoundPadding[1], lBoundPadding[2], lBoundPadding[3],
+                        groups_x, groups_y));
+                break;
+            case AF_PAD_CLAMP_TO_EDGE:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    padBordersKernel<T, AF_PAD_CLAMP_TO_EDGE>(
+                        oData, out.info, iData, in.info, lBoundPadding[0],
+                        lBoundPadding[1], lBoundPadding[2], lBoundPadding[3],
+                        groups_x, groups_y));
+                break;
+            case AF_PAD_PERIODIC:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    padBordersKernel<T, AF_PAD_PERIODIC>(
+                        oData, out.info, iData, in.info, lBoundPadding[0],
+                        lBoundPadding[1], lBoundPadding[2], lBoundPadding[3],
+                        groups_x, groups_y));
+                break;
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/random_engine.hpp b/src/backend/oneapi/kernel/random_engine.hpp
new file mode 100644
index 0000000000..7e97a6fc59
--- /dev/null
+++ b/src/backend/oneapi/kernel/random_engine.hpp
@@ -0,0 +1,197 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/random_engine_mersenne.hpp>
+#include <kernel/random_engine_philox.hpp>
+#include <kernel/random_engine_threefry.hpp>
+#include <kernel/random_engine_write.hpp>
+#include <random_engine.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+#include <af/defines.h>
+
+#include <functional>
+#include <string>
+#include <vector>
+
+static const int N          = 351;
+static const int TABLE_SIZE = 16;
+static const int MAX_BLOCKS = 32;
+static const int STATE_SIZE = (256 * 3);
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+static const uint THREADS           = 256;
+static const uint THREADS_PER_GROUP = 256;
+static const uint THREADS_X         = 32;
+static const uint THREADS_Y         = THREADS_PER_GROUP / THREADS_X;
+static const uint REPEAT            = 32;
+
+template<typename T>
+void uniformDistributionCBRNG(Param<T> out, const size_t elements,
+                              const af_random_engine_type type,
+                              const uintl &seed, uintl &counter) {
+    int threads          = THREADS;
+    int elementsPerBlock = threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks           = divup(elements, elementsPerBlock);
+    uint hi              = seed >> 32;
+    uint lo              = seed;
+    uint hic             = counter >> 32;
+    uint loc             = counter;
+    sycl::nd_range<1> ndrange(sycl::range<1>(blocks * threads),
+                              sycl::range<1>(threads));
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            getQueue().submit([&](sycl::handler &h) {
+                write_accessor<T> out_acc{*out.data, h};
+
+                h.parallel_for(ndrange,
+                               uniformPhilox<T>(out_acc, hi, lo, hic, loc,
+                                                elementsPerBlock, elements));
+            });
+            ONEAPI_DEBUG_FINISH(getQueue());
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            getQueue().submit([&](sycl::handler &h) {
+                write_accessor<T> out_acc{*out.data, h};
+
+                h.parallel_for(ndrange,
+                               uniformThreefry<T>(out_acc, hi, lo, hic, loc,
+                                                  elementsPerBlock, elements));
+            });
+            ONEAPI_DEBUG_FINISH(getQueue());
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+    counter += elements;
+}
+
+template<typename T>
+void normalDistributionCBRNG(Param<T> out, const size_t elements,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter) {
+    int threads          = THREADS;
+    int elementsPerBlock = threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks           = divup(elements, elementsPerBlock);
+    uint hi              = seed >> 32;
+    uint lo              = seed;
+    uint hic             = counter >> 32;
+    uint loc             = counter;
+    sycl::nd_range<1> ndrange(sycl::range<1>(blocks * threads),
+                              sycl::range<1>(threads));
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            getQueue().submit([&](sycl::handler &h) {
+                write_accessor<T> out_acc{*out.data, h};
+
+                h.parallel_for(ndrange,
+                               normalPhilox<T>(out_acc, hi, lo, hic, loc,
+                                               elementsPerBlock, elements));
+            });
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            getQueue().submit([&](sycl::handler &h) {
+                write_accessor<T> out_acc{*out.data, h};
+
+                h.parallel_for(ndrange,
+                               normalThreefry<T>(out_acc, hi, lo, hic, loc,
+                                                 elementsPerBlock, elements));
+            });
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+    counter += elements;
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void uniformDistributionMT(Param<T> out, const size_t elements,
+                           Param<uint> state, Param<uint> pos, Param<uint> sh1,
+                           Param<uint> sh2, const uint mask,
+                           Param<uint> recursion_table,
+                           Param<uint> temper_table) {
+    int threads                = THREADS;
+    int min_elements_per_block = 32 * threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks                 = divup(elements, min_elements_per_block);
+    blocks                     = (blocks > BLOCKS) ? BLOCKS : blocks;
+    uint elementsPerBlock      = divup(elements, blocks);
+
+    sycl::nd_range<1> ndrange(sycl::range<1>(blocks * threads),
+                              sycl::range<1>(threads));
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> out_acc{*out.data, h};
+        auto state_acc     = state.data->get_access(h);
+        auto pos_acc       = pos.data->get_access(h);
+        auto sh1_acc       = sh1.data->get_access(h);
+        auto sh2_acc       = sh2.data->get_access(h);
+        auto recursion_acc = sh2.data->get_access(h);
+        auto temper_acc    = sh2.data->get_access(h);
+
+        auto lstate_acc     = sycl::local_accessor<uint, 1>(STATE_SIZE, h);
+        auto lrecursion_acc = sycl::local_accessor<uint, 1>(TABLE_SIZE, h);
+        auto ltemper_acc    = sycl::local_accessor<uint, 1>(TABLE_SIZE, h);
+
+        h.parallel_for(
+            ndrange, uniformMersenne<T>(
+                         out_acc, state_acc, pos_acc, sh1_acc, sh2_acc, mask,
+                         recursion_acc, temper_acc, lstate_acc, lrecursion_acc,
+                         ltemper_acc, elementsPerBlock, elements));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void normalDistributionMT(Param<T> out, const size_t elements,
+                          Param<uint> state, Param<uint> pos, Param<uint> sh1,
+                          Param<uint> sh2, const uint mask,
+                          Param<uint> recursion_table,
+                          Param<uint> temper_table) {
+    int threads                = THREADS;
+    int min_elements_per_block = 32 * threads * 4 * sizeof(uint) / sizeof(T);
+    int blocks                 = divup(elements, min_elements_per_block);
+    blocks                     = (blocks > BLOCKS) ? BLOCKS : blocks;
+    uint elementsPerBlock      = divup(elements, blocks);
+
+    sycl::nd_range<1> ndrange(sycl::range<1>(blocks * threads),
+                              sycl::range<1>(threads));
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<T> out_acc{*out.data, h};
+        auto state_acc     = state.data->get_access(h);
+        auto pos_acc       = pos.data->get_access(h);
+        auto sh1_acc       = sh1.data->get_access(h);
+        auto sh2_acc       = sh2.data->get_access(h);
+        auto recursion_acc = sh2.data->get_access(h);
+        auto temper_acc    = sh2.data->get_access(h);
+
+        auto lstate_acc     = sycl::local_accessor<uint, 1>(STATE_SIZE, h);
+        auto lrecursion_acc = sycl::local_accessor<uint, 1>(TABLE_SIZE, h);
+        auto ltemper_acc    = sycl::local_accessor<uint, 1>(TABLE_SIZE, h);
+
+        h.parallel_for(
+            ndrange, normalMersenne<T>(out_acc, state_acc, pos_acc, sh1_acc,
+                                       sh2_acc, mask, recursion_acc, temper_acc,
+                                       lstate_acc, lrecursion_acc, ltemper_acc,
+                                       elementsPerBlock, elements));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/random_engine_mersenne.hpp b/src/backend/oneapi/kernel/random_engine_mersenne.hpp
new file mode 100644
index 0000000000..acb56f3c9f
--- /dev/null
+++ b/src/backend/oneapi/kernel/random_engine_mersenne.hpp
@@ -0,0 +1,358 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/********************************************************
+ * Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
+ * University.
+ * Copyright (c) 2011, 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima
+ * University and University of Tokyo.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ *       copyright notice, this list of conditions and the following
+ *       disclaimer in the documentation and/or other materials provided
+ *       with the distribution.
+ *     * Neither the name of the Hiroshima University, The Uinversity
+ *       of Tokyo nor the names of its contributors may be used to
+ *       endorse or promote products derived from this software without
+ *       specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *******************************************************/
+#pragma once
+#include <kernel/accessors.hpp>
+#include <kernel/random_engine_write.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr int N          = 351;
+constexpr int BLOCKS     = 32;
+constexpr int STATE_SIZE = (256 * 3);
+constexpr int TABLE_SIZE = 16;
+
+// Utils
+static inline void read_table(uint *const sharedTable, const uint *const table,
+                              size_t groupId, size_t localId) {
+    const uint *const t = table + (groupId * TABLE_SIZE);
+    if (localId < TABLE_SIZE) { sharedTable[localId] = t[localId]; }
+}
+
+static inline void state_read(uint *const state, const uint *const gState,
+                              size_t groupRange, size_t groupId,
+                              size_t localId) {
+    const uint *const g             = gState + (groupId * N);
+    state[STATE_SIZE - N + localId] = g[localId];
+    if (localId < N - groupRange) {
+        state[STATE_SIZE - N + groupRange + localId] = g[groupRange + localId];
+    }
+}
+
+static inline void state_write(uint *const gState, const uint *const state,
+                               size_t groupRange, size_t groupId,
+                               size_t localId) {
+    uint *const g = gState + (groupId * N);
+    g[localId]    = state[STATE_SIZE - N + localId];
+    if (localId < N - groupRange) {
+        g[groupRange + localId] = state[STATE_SIZE - N + groupRange + localId];
+    }
+}
+
+static inline uint recursion(const uint *const recursion_table, const uint mask,
+                             const uint sh1, const uint sh2, const uint x1,
+                             const uint x2, uint y) {
+    uint x = (x1 & mask) ^ x2;
+    x ^= x << sh1;
+    y        = x ^ (y >> sh2);
+    uint mat = recursion_table[y & 0x0f];
+    return y ^ mat;
+}
+
+static inline uint temper(const uint *const temper_table, const uint v,
+                          uint t) {
+    t ^= t >> 16;
+    t ^= t >> 8;
+    uint mat = temper_table[t & 0x0f];
+    return v ^ mat;
+}
+
+// Initialization
+class initMersenneKernel {
+   public:
+    initMersenneKernel(write_accessor<uint> state, read_accessor<uint> tbl,
+                       sycl::local_accessor<uint, 1> lstate, uintl seed)
+        : state_(state), tbl_(tbl), lstate_(lstate), seed_(seed) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+
+        const uint *ltbl =
+            tbl_.get_pointer() + (TABLE_SIZE * g.get_group_id(0));
+        uint hidden_seed = ltbl[4] ^ (ltbl[8] << 16);
+        uint tmp         = hidden_seed;
+        tmp += tmp >> 16;
+        tmp += tmp >> 8;
+        tmp &= 0xff;
+        tmp |= tmp << 8;
+        tmp |= tmp << 16;
+        lstate_[it.get_local_id(0)] = tmp;
+        it.barrier();
+        if (it.get_local_id(0) == 0) {
+            lstate_[0] = seed_;
+            lstate_[1] = hidden_seed;
+            for (int i = 1; i < N; ++i) {
+                lstate_[i] ^= ((uint)(1812433253) *
+                                   (lstate_[i - 1] ^ (lstate_[i - 1] >> 30)) +
+                               i);
+            }
+        }
+        it.barrier();
+        state_[N * g.get_group_id(0) + it.get_local_id(0)] =
+            lstate_[it.get_local_id(0)];
+    }
+
+   protected:
+    write_accessor<uint> state_;
+    read_accessor<uint> tbl_;
+    sycl::local_accessor<uint, 1> lstate_;
+    uintl seed_;
+};
+
+void initMersenneState(Param<uint> state, const Param<uint> tbl, uintl seed) {
+    sycl::nd_range<1> ndrange({BLOCKS * N}, {N});
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<uint> state_acc{*state.data, h};
+        read_accessor<uint> tbl_acc{*tbl.data, h};
+        auto lstate_acc = sycl::local_accessor<uint, 1>(N, h);
+
+        h.parallel_for(
+            ndrange, initMersenneKernel(state_acc, tbl_acc, lstate_acc, seed));
+    });
+    // TODO: do we need to sync before using Mersenne generators?
+    // force wait() here?
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+class uniformMersenne {
+   public:
+    uniformMersenne(write_accessor<T> out, sycl::accessor<uint> gState,
+                    sycl::accessor<uint> pos_tbl, sycl::accessor<uint> sh1_tbl,
+                    sycl::accessor<uint> sh2_tbl, uint mask,
+                    sycl::accessor<uint> g_recursion_table,
+                    sycl::accessor<uint> g_temper_table,
+                    // local memory caches of global state
+                    sycl::local_accessor<uint, 1> state,
+                    sycl::local_accessor<uint, 1> recursion_table,
+                    sycl::local_accessor<uint, 1> temper_table,
+                    uint elementsPerBlock, size_t elements)
+        : out_(out)
+        , gState_(gState)
+        , pos_tbl_(pos_tbl)
+        , sh1_tbl_(sh1_tbl)
+        , sh2_tbl_(sh2_tbl)
+        , mask_(mask)
+        , g_recursion_table_(g_recursion_table)
+        , g_temper_table_(g_temper_table)
+        , state_(state)
+        , recursion_table_(recursion_table)
+        , temper_table_(temper_table)
+        , elementsPerBlock_(elementsPerBlock)
+        , elements_(elements) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+        uint start    = g.get_group_id(0) * elementsPerBlock_;
+        uint end      = start + elementsPerBlock_;
+        end           = (end > elements_) ? elements_ : end;
+        int elementsPerBlockIteration =
+            (g.get_local_range(0) * 4 * sizeof(uint)) / sizeof(T);
+        int iter = divup((end - start), elementsPerBlockIteration);
+
+        uint pos = pos_tbl_[it.get_group(0)];
+        uint sh1 = sh1_tbl_[it.get_group(0)];
+        uint sh2 = sh2_tbl_[it.get_group(0)];
+        state_read(state_.get_pointer(), gState_.get_pointer(),
+                   g.get_local_range(0), g.get_group_id(0), it.get_local_id(0));
+        read_table(recursion_table_.get_pointer(),
+                   g_recursion_table_.get_pointer(), g.get_group_id(0),
+                   it.get_local_id(0));
+        read_table(temper_table_.get_pointer(), g_temper_table_.get_pointer(),
+                   g.get_group_id(0), it.get_local_id(0));
+        it.barrier();
+
+        uint index = start;
+        uint o[4];
+        int offsetX1 = (STATE_SIZE - N + it.get_local_id(0)) % STATE_SIZE;
+        int offsetX2 = (STATE_SIZE - N + it.get_local_id(0) + 1) % STATE_SIZE;
+        int offsetY  = (STATE_SIZE - N + it.get_local_id(0) + pos) % STATE_SIZE;
+        int offsetT =
+            (STATE_SIZE - N + it.get_local_id(0) + pos - 1) % STATE_SIZE;
+        int offsetO = it.get_local_id(0);
+
+        for (int i = 0; i < iter; ++i) {
+            for (int ii = 0; ii < 4; ++ii) {
+                uint r = recursion(recursion_table_.get_pointer(), mask_, sh1,
+                                   sh2, state_[offsetX1], state_[offsetX2],
+                                   state_[offsetY]);
+                state_[offsetO] = r;
+                o[ii] = temper(temper_table_.get_pointer(), r, state_[offsetT]);
+                offsetX1 = (offsetX1 + g.get_local_range(0)) % STATE_SIZE;
+                offsetX2 = (offsetX2 + g.get_local_range(0)) % STATE_SIZE;
+                offsetY  = (offsetY + g.get_local_range(0)) % STATE_SIZE;
+                offsetT  = (offsetT + g.get_local_range(0)) % STATE_SIZE;
+                offsetO  = (offsetO + g.get_local_range(0)) % STATE_SIZE;
+                it.barrier();
+            }
+            if (i == iter - 1) {
+                partialWriteOut128Bytes(
+                    out_.get_pointer(), index + it.get_local_id(0),
+                    g.get_local_range(0), o[0], o[1], o[2], o[3], elements_);
+            } else {
+                writeOut128Bytes(out_.get_pointer(), index + it.get_local_id(0),
+                                 g.get_local_range(0), o[0], o[1], o[2], o[3]);
+            }
+            index += elementsPerBlockIteration;
+        }
+        state_write(gState_.get_pointer(), state_.get_pointer(),
+                    g.get_local_range(0), g.get_group_id(0),
+                    it.get_local_id(0));
+    }
+
+   protected:
+    write_accessor<T> out_;
+    sycl::accessor<uint> gState_;
+    sycl::accessor<uint> pos_tbl_, sh1_tbl_, sh2_tbl_;
+    uint mask_;
+    sycl::accessor<uint> g_recursion_table_, g_temper_table_;
+    sycl::local_accessor<uint, 1> state_, recursion_table_, temper_table_;
+    uint elementsPerBlock_;
+    size_t elements_;
+};
+
+template<typename T>
+class normalMersenne {
+   public:
+    normalMersenne(write_accessor<T> out, sycl::accessor<uint> gState,
+                   sycl::accessor<uint> pos_tbl, sycl::accessor<uint> sh1_tbl,
+                   sycl::accessor<uint> sh2_tbl, uint mask,
+                   sycl::accessor<uint> g_recursion_table,
+                   sycl::accessor<uint> g_temper_table,
+                   // local memory caches of global state
+                   sycl::local_accessor<uint, 1> state,
+                   sycl::local_accessor<uint, 1> recursion_table,
+                   sycl::local_accessor<uint, 1> temper_table,
+                   uint elementsPerBlock, size_t elements)
+        : out_(out)
+        , gState_(gState)
+        , pos_tbl_(pos_tbl)
+        , sh1_tbl_(sh1_tbl)
+        , sh2_tbl_(sh2_tbl)
+        , mask_(mask)
+        , g_recursion_table_(g_recursion_table)
+        , g_temper_table_(g_temper_table)
+        , state_(state)
+        , recursion_table_(recursion_table)
+        , temper_table_(temper_table)
+        , elementsPerBlock_(elementsPerBlock)
+        , elements_(elements) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+        uint start    = g.get_group_id(0) * elementsPerBlock_;
+        uint end      = start + elementsPerBlock_;
+        end           = (end > elements_) ? elements_ : end;
+        int elementsPerBlockIteration =
+            (g.get_local_range(0) * 4 * sizeof(uint)) / sizeof(T);
+        int iter = divup((end - start), elementsPerBlockIteration);
+
+        uint pos = pos_tbl_[it.get_group(0)];
+        uint sh1 = sh1_tbl_[it.get_group(0)];
+        uint sh2 = sh2_tbl_[it.get_group(0)];
+        state_read(state_.get_pointer(), gState_.get_pointer(),
+                   g.get_local_range(0), g.get_group_id(0), it.get_local_id(0));
+        read_table(recursion_table_.get_pointer(),
+                   g_recursion_table_.get_pointer(), g.get_group_id(0),
+                   it.get_local_id(0));
+        read_table(temper_table_.get_pointer(), g_temper_table_.get_pointer(),
+                   g.get_group_id(0), it.get_local_id(0));
+        it.barrier();
+
+        uint index = start;
+        uint o[4];
+        int offsetX1 = (STATE_SIZE - N + it.get_local_id(0)) % STATE_SIZE;
+        int offsetX2 = (STATE_SIZE - N + it.get_local_id(0) + 1) % STATE_SIZE;
+        int offsetY  = (STATE_SIZE - N + it.get_local_id(0) + pos) % STATE_SIZE;
+        int offsetT =
+            (STATE_SIZE - N + it.get_local_id(0) + pos - 1) % STATE_SIZE;
+        int offsetO = it.get_local_id(0);
+
+        for (int i = 0; i < iter; ++i) {
+            for (int ii = 0; ii < 4; ++ii) {
+                uint r = recursion(recursion_table_.get_pointer(), mask_, sh1,
+                                   sh2, state_[offsetX1], state_[offsetX2],
+                                   state_[offsetY]);
+                state_[offsetO] = r;
+                o[ii] = temper(temper_table_.get_pointer(), r, state_[offsetT]);
+                offsetX1 = (offsetX1 + g.get_local_range(0)) % STATE_SIZE;
+                offsetX2 = (offsetX2 + g.get_local_range(0)) % STATE_SIZE;
+                offsetY  = (offsetY + g.get_local_range(0)) % STATE_SIZE;
+                offsetT  = (offsetT + g.get_local_range(0)) % STATE_SIZE;
+                offsetO  = (offsetO + g.get_local_range(0)) % STATE_SIZE;
+                it.barrier();
+            }
+            if (i == iter - 1) {
+                partialBoxMullerWriteOut128Bytes(
+                    out_.get_pointer(), index + it.get_local_id(0),
+                    g.get_local_range(0), o[0], o[1], o[2], o[3], elements_);
+            } else {
+                boxMullerWriteOut128Bytes(
+                    out_.get_pointer(), index + it.get_local_id(0),
+                    g.get_local_range(0), o[0], o[1], o[2], o[3]);
+            }
+            index += elementsPerBlockIteration;
+        }
+        state_write(gState_.get_pointer(), state_.get_pointer(),
+                    g.get_local_range(0), g.get_group_id(0),
+                    it.get_local_id(0));
+    }
+
+   protected:
+    write_accessor<T> out_;
+    sycl::accessor<uint> gState_;
+    sycl::accessor<uint> pos_tbl_, sh1_tbl_, sh2_tbl_;
+    uint mask_;
+    sycl::accessor<uint> g_recursion_table_, g_temper_table_;
+    sycl::local_accessor<uint, 1> state_, recursion_table_, temper_table_;
+    uint elementsPerBlock_;
+    size_t elements_;
+};
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/random_engine_philox.hpp b/src/backend/oneapi/kernel/random_engine_philox.hpp
new file mode 100644
index 0000000000..afa29394e2
--- /dev/null
+++ b/src/backend/oneapi/kernel/random_engine_philox.hpp
@@ -0,0 +1,191 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+#pragma once
+#include <kernel/accessors.hpp>
+#include <kernel/random_engine_write.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/philox.hpp
+
+constexpr uint m4x32_0 = 0xD2511F53;
+constexpr uint m4x32_1 = 0xCD9E8D57;
+constexpr uint w32_0   = 0x9E3779B9;
+constexpr uint w32_1   = 0xBB67AE85;
+
+static inline void mulhilo(uint a, uint b, uint& hi, uint& lo) {
+    hi = sycl::mul_hi(a, b);
+    lo = a * b;
+}
+
+static inline void philoxBump(uint k[2]) {
+    k[0] += w32_0;
+    k[1] += w32_1;
+}
+
+static inline void philoxRound(const uint m0, const uint m1, const uint k[2],
+                               uint c[4]) {
+    uint hi0, lo0, hi1, lo1;
+    mulhilo(m0, c[0], hi0, lo0);
+    mulhilo(m1, c[2], hi1, lo1);
+    c[0] = hi1 ^ c[1] ^ k[0];
+    c[1] = lo1;
+    c[2] = hi0 ^ c[3] ^ k[1];
+    c[3] = lo0;
+}
+
+static inline void philox(uint key[2], uint ctr[4]) {
+    // 10 Rounds
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+    philoxBump(key);
+    philoxRound(m4x32_0, m4x32_1, key, ctr);
+}
+
+template<typename T>
+class uniformPhilox {
+   public:
+    uniformPhilox(write_accessor<T> out, uint hi, uint lo, uint hic, uint loc,
+                  uint elementsPerBlock, uint elements)
+        : out_(out)
+        , hi_(hi)
+        , lo_(lo)
+        , hic_(hic)
+        , loc_(loc)
+        , elementsPerBlock_(elementsPerBlock)
+        , elements_(elements) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+
+        uint index = g.get_group_id(0) * elementsPerBlock_ + it.get_local_id(0);
+        uint key[2] = {lo_, hi_};
+        uint ctr[4] = {loc_, hic_, 0, 0};
+        ctr[0] += index;
+        ctr[1] += (ctr[0] < loc_);
+        ctr[2] += (ctr[1] < hic_);
+        T* optr = out_.get_pointer();
+        if (g.get_group_id(0) != (g.get_group_range(0) - 1)) {
+            philox(key, ctr);
+            writeOut128Bytes(optr, index, g.get_local_range(0), ctr[0], ctr[1],
+                             ctr[2], ctr[3]);
+        } else {
+            philox(key, ctr);
+            partialWriteOut128Bytes(optr, index, g.get_local_range(0), ctr[0],
+                                    ctr[1], ctr[2], ctr[3], elements_);
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    uint hi_, lo_, hic_, loc_;
+    uint elementsPerBlock_, elements_;
+};
+
+template<typename T>
+class normalPhilox {
+   public:
+    normalPhilox(write_accessor<T> out, uint hi, uint lo, uint hic, uint loc,
+                 uint elementsPerBlock, uint elements)
+        : out_(out)
+        , hi_(hi)
+        , lo_(lo)
+        , hic_(hic)
+        , loc_(loc)
+        , elementsPerBlock_(elementsPerBlock)
+        , elements_(elements) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+
+        uint index = g.get_group_id(0) * elementsPerBlock_ + it.get_local_id(0);
+        uint key[2] = {lo_, hi_};
+        uint ctr[4] = {loc_, hic_, 0, 0};
+        ctr[0] += index;
+        ctr[1] += (ctr[0] < loc_);
+        ctr[2] += (ctr[1] < hic_);
+
+        philox(key, ctr);
+
+        T* optr = out_.get_pointer();
+        if (g.get_group_id(0) != (g.get_group_range(0) - 1)) {
+            boxMullerWriteOut128Bytes(optr, index, g.get_local_range(0), ctr[0],
+                                      ctr[1], ctr[2], ctr[3]);
+        } else {
+            partialBoxMullerWriteOut128Bytes(optr, index, g.get_local_range(0),
+                                             ctr[0], ctr[1], ctr[2], ctr[3],
+                                             elements_);
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    uint hi_, lo_, hic_, loc_;
+    uint elementsPerBlock_, elements_;
+};
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/random_engine_threefry.hpp b/src/backend/oneapi/kernel/random_engine_threefry.hpp
new file mode 100644
index 0000000000..1969bf3b69
--- /dev/null
+++ b/src/backend/oneapi/kernel/random_engine_threefry.hpp
@@ -0,0 +1,254 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+#pragma once
+#include <kernel/accessors.hpp>
+#include <kernel/random_engine_write.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/threefry.hpp
+
+static const uint SKEIN_KS_PARITY32 = 0x1BD11BDA;
+
+static const uint R0 = 13;
+static const uint R1 = 15;
+static const uint R2 = 26;
+static const uint R3 = 6;
+static const uint R4 = 17;
+static const uint R5 = 29;
+static const uint R6 = 16;
+static const uint R7 = 24;
+
+static inline void setSkeinParity(uint* ptr) { *ptr = SKEIN_KS_PARITY32; }
+
+static inline uint rotL(uint x, uint N) {
+    return (x << (N & 31)) | (x >> ((32 - N) & 31));
+}
+
+void threefry(uint k[2], uint c[2], uint X[2]) {
+    uint ks[3];
+
+    setSkeinParity(&ks[2]);
+    ks[0] = k[0];
+    X[0]  = c[0];
+    ks[2] ^= k[0];
+    ks[1] = k[1];
+    X[1]  = c[1];
+    ks[2] ^= k[1];
+
+    X[0] += ks[0];
+    X[1] += ks[1];
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=1) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 1; /* X[2-1] += r  */
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=2) */
+    X[0] += ks[2];
+    X[1] += ks[0];
+    X[1] += 2;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=3) */
+    X[0] += ks[0];
+    X[1] += ks[1];
+    X[1] += 3;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=4) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 4;
+}
+
+template<typename T>
+class uniformThreefry {
+   public:
+    uniformThreefry(write_accessor<T> out, uint hi, uint lo, uint hic, uint loc,
+                    uint elementsPerBlock, uint elements)
+        : out_(out)
+        , hi_(hi)
+        , lo_(lo)
+        , hic_(hic)
+        , loc_(loc)
+        , elementsPerBlock_(elementsPerBlock)
+        , elements_(elements) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+        uint index = g.get_group_id(0) * elementsPerBlock_ + it.get_local_id(0);
+
+        uint key[2] = {lo_, hi_};
+        uint ctr[4] = {loc_, hic_, 0, 0};
+        ctr[0] += index;
+        ctr[1] += (ctr[0] < loc_);
+        uint o[4];
+
+        threefry(key, ctr, o);
+        uint step = elementsPerBlock_ / 2;
+        ctr[0] += step;
+        ctr[1] += (ctr[0] < step);
+        threefry(key, ctr, o + 2);
+
+        T* optr = out_.get_pointer();
+        if (g.get_group_id(0) != (g.get_group_range(0) - 1)) {
+            writeOut128Bytes(optr, index, g.get_local_range(0), o[0], o[1],
+                             o[2], o[3]);
+        } else {
+            partialWriteOut128Bytes(optr, index, g.get_local_range(0), o[0],
+                                    o[1], o[2], o[3], elements_);
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    uint hi_, lo_, hic_, loc_;
+    uint elementsPerBlock_, elements_;
+};
+
+template<typename T>
+class normalThreefry {
+   public:
+    normalThreefry(write_accessor<T> out, uint hi, uint lo, uint hic, uint loc,
+                   uint elementsPerBlock, uint elements)
+        : out_(out)
+        , hi_(hi)
+        , lo_(lo)
+        , hic_(hic)
+        , loc_(loc)
+        , elementsPerBlock_(elementsPerBlock)
+        , elements_(elements) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+        uint index = g.get_group_id(0) * elementsPerBlock_ + it.get_local_id(0);
+
+        uint key[2] = {lo_, hi_};
+        uint ctr[4] = {loc_, hic_, 0, 0};
+        ctr[0] += index;
+        ctr[1] += (ctr[0] < loc_);
+        uint o[4];
+
+        threefry(key, ctr, o);
+        uint step = elementsPerBlock_ / 2;
+        ctr[0] += step;
+        ctr[1] += (ctr[0] < step);
+        threefry(key, ctr, o + 2);
+
+        T* optr = out_.get_pointer();
+        if (g.get_group_id(0) != (g.get_group_range(0) - 1)) {
+            boxMullerWriteOut128Bytes(optr, index, g.get_local_range(0), o[0],
+                                      o[1], o[2], o[3]);
+        } else {
+            partialBoxMullerWriteOut128Bytes(optr, index, g.get_local_range(0),
+                                             o[0], o[1], o[2], o[3], elements_);
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    uint hi_, lo_, hic_, loc_;
+    uint elementsPerBlock_, elements_;
+};
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/random_engine_write.hpp b/src/backend/oneapi/kernel/random_engine_write.hpp
new file mode 100644
index 0000000000..a96d7d07fe
--- /dev/null
+++ b/src/backend/oneapi/kernel/random_engine_write.hpp
@@ -0,0 +1,661 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <sycl/sycl.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+// Generates rationals in (0, 1]
+static float getFloat01(uint num) {
+    // Conversion to floats adapted from Random123
+    constexpr float factor =
+        ((1.0f) /
+         (static_cast<float>(std::numeric_limits<unsigned int>::max()) +
+          (1.0f)));
+    constexpr float half_factor = ((0.5f) * factor);
+
+    return sycl::fma(static_cast<float>(num), factor, half_factor);
+}
+
+// Generates rationals in (-1, 1]
+static float getFloatNegative11(uint num) {
+    // Conversion to floats adapted from Random123
+    constexpr float factor =
+        ((1.0) /
+         (static_cast<double>(std::numeric_limits<int>::max()) + (1.0)));
+    constexpr float half_factor = ((0.5f) * factor);
+
+    return sycl::fma(static_cast<float>(num), factor, half_factor);
+}
+
+// Generates rationals in (0, 1]
+static double getDouble01(uint num1, uint num2) {
+    uint64_t n1 = num1;
+    uint64_t n2 = num2;
+    n1 <<= 32;
+    uint64_t num = n1 | n2;
+    constexpr double factor =
+        ((1.0) /
+         (static_cast<double>(std::numeric_limits<unsigned long long>::max()) +
+          static_cast<double>(1.0)));
+    constexpr double half_factor((0.5) * factor);
+
+    return sycl::fma(static_cast<double>(num), factor, half_factor);
+}
+
+// Conversion to doubles adapted from Random123
+constexpr double signed_factor =
+    ((1.0l) / (static_cast<long double>(std::numeric_limits<long long>::max()) +
+               (1.0l)));
+constexpr double half_factor = ((0.5) * signed_factor);
+
+// Generates rationals in (-1, 1]
+static double getDoubleNegative11(uint num1, uint num2) {
+    uint32_t arr[2] = {num2, num1};
+    uint64_t num;
+
+    memcpy(&num, arr, sizeof(uint64_t));
+    return sycl::fma(static_cast<double>(num), signed_factor, half_factor);
+}
+
+/// This is the largest integer representable by fp16. We need to
+/// make sure that the value converted from ushort is smaller than this
+/// value to avoid generating infinity
+#define MAX_INT_BEFORE_INFINITY (ushort)65504u
+
+// Generates rationals in (0, 1]
+sycl::half getHalf01(uint num, uint index) {
+    sycl::half v = static_cast<sycl::half>(min(MAX_INT_BEFORE_INFINITY,
+                       static_cast<ushort>(num >> (16U * (index & 1U)) & 0x0000ffff)));
+
+    const sycl::half half_factor{1.526e-5}; // (1 / (USHRT_MAX + 1))
+    const sycl::half half_half_factor{7.6e-6}; // (0.5 * half_factor)
+    return sycl::fma(v, half_factor, half_half_factor);
+}
+
+sycl::half oneMinusGetHalf01(uint num, uint index) {
+    return static_cast<sycl::half>(1.) - getHalf01(num, index);
+}
+
+// Generates rationals in (-1, 1]
+sycl::half getHalfNegative11(uint num, uint index) {
+    sycl::half v = static_cast<sycl::half>(min(MAX_INT_BEFORE_INFINITY,
+                       static_cast<ushort>(num >> (16U * (index & 1U)) & 0x0000ffff)));
+
+    const sycl::half signed_half_factor{3.05e-5}; // (1 / (SHRT_MAX + 1))
+    const sycl::half signed_half_half_factor{1.526e-5}; // (0.5 * signed_half_factor)
+    return sycl::fma(v, signed_half_factor, signed_half_half_factor);
+}
+
+namespace {
+template<typename T>
+void sincospi(T val, T *sptr, T *cptr) {
+    *sptr = sycl::sinpi(val);
+    *cptr = sycl::cospi(val);
+}
+}  // namespace
+
+template<typename T>
+constexpr T neg_two() {
+    return -2.0;
+}
+//
+// template<typename T>
+// constexpr __device__ T two_pi() {
+//    return 2.0 * PI_VAL;
+//};
+//
+template<typename Tc>
+static void boxMullerTransform(cfloat *const cOut, const Tc &r1, const Tc &r2) {
+    /*
+     * The log of a real value x where 0 < x < 1 is negative.
+     */
+    Tc r = sycl::sqrt(neg_two<Tc>() * sycl::log(r2));
+    Tc s, c;
+
+    // Multiplying by PI instead of 2*PI seems to yeild a better distribution
+    // even though the original boxMuller algorithm calls for 2 * PI
+    // sincos(two_pi<Tc>() * r1, &s, &c);
+    sincospi(r1, &s, &c);
+    cOut->real(static_cast<float>(r * s));
+    cOut->imag(static_cast<float>(r * c));
+}
+
+template<typename Tc>
+static void boxMullerTransform(cdouble *const cOut, const Tc &r1,
+                               const Tc &r2) {
+    /*
+     * The log of a real value x where 0 < x < 1 is negative.
+     */
+    Tc r = sycl::sqrt(neg_two<Tc>() * sycl::log(r2));
+    Tc s, c;
+
+    // Multiplying by PI instead of 2*PI seems to yeild a better distribution
+    // even though the original boxMuller algorithm calls for 2 * PI
+    // sincos(two_pi<Tc>() * r1, &s, &c);
+    sincospi(r1, &s, &c);
+    cOut->real(static_cast<double>(r * s));
+    cOut->imag(static_cast<double>(r * c));
+}
+
+template<typename Td, typename Tc>
+static void boxMullerTransform(Td *const out1, Td *const out2, const Tc &r1,
+                               const Tc &r2) {
+    /*
+     * The log of a real value x where 0 < x < 1 is negative.
+     */
+    Tc r = sycl::sqrt(neg_two<Tc>() * sycl::log(r2));
+    Tc s, c;
+
+    // Multiplying by PI instead of 2*PI seems to yeild a better distribution
+    // even though the original boxMuller algorithm calls for 2 * PI
+    // sincos(two_pi<Tc>() * r1, &s, &c);
+    sincospi(r1, &s, &c);
+    *out1 = static_cast<Td>(r * s);
+    *out2 = static_cast<Td>(r * c);
+}
+
+// Writes without boundary checking
+static void writeOut128Bytes(uchar *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]                = r1;
+    out[index + groupSz]      = r1 >> 8;
+    out[index + 2 * groupSz]  = r1 >> 16;
+    out[index + 3 * groupSz]  = r1 >> 24;
+    out[index + 4 * groupSz]  = r2;
+    out[index + 5 * groupSz]  = r2 >> 8;
+    out[index + 6 * groupSz]  = r2 >> 16;
+    out[index + 7 * groupSz]  = r2 >> 24;
+    out[index + 8 * groupSz]  = r3;
+    out[index + 9 * groupSz]  = r3 >> 8;
+    out[index + 10 * groupSz] = r3 >> 16;
+    out[index + 11 * groupSz] = r3 >> 24;
+    out[index + 12 * groupSz] = r4;
+    out[index + 13 * groupSz] = r4 >> 8;
+    out[index + 14 * groupSz] = r4 >> 16;
+    out[index + 15 * groupSz] = r4 >> 24;
+}
+
+static void writeOut128Bytes(schar *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    writeOut128Bytes((uchar *)(out), index, groupSz, r1, r2, r3, r4);
+}
+
+static void writeOut128Bytes(char *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]                = (r1)&0x1;
+    out[index + groupSz]      = (r1 >> 8) & 0x1;
+    out[index + 2 * groupSz]  = (r1 >> 16) & 0x1;
+    out[index + 3 * groupSz]  = (r1 >> 24) & 0x1;
+    out[index + 4 * groupSz]  = (r2)&0x1;
+    out[index + 5 * groupSz]  = (r2 >> 8) & 0x1;
+    out[index + 6 * groupSz]  = (r2 >> 16) & 0x1;
+    out[index + 7 * groupSz]  = (r2 >> 24) & 0x1;
+    out[index + 8 * groupSz]  = (r3)&0x1;
+    out[index + 9 * groupSz]  = (r3 >> 8) & 0x1;
+    out[index + 10 * groupSz] = (r3 >> 16) & 0x1;
+    out[index + 11 * groupSz] = (r3 >> 24) & 0x1;
+    out[index + 12 * groupSz] = (r4)&0x1;
+    out[index + 13 * groupSz] = (r4 >> 8) & 0x1;
+    out[index + 14 * groupSz] = (r4 >> 16) & 0x1;
+    out[index + 15 * groupSz] = (r4 >> 24) & 0x1;
+}
+
+static void writeOut128Bytes(short *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]               = r1;
+    out[index + groupSz]     = r1 >> 16;
+    out[index + 2 * groupSz] = r2;
+    out[index + 3 * groupSz] = r2 >> 16;
+    out[index + 4 * groupSz] = r3;
+    out[index + 5 * groupSz] = r3 >> 16;
+    out[index + 6 * groupSz] = r4;
+    out[index + 7 * groupSz] = r4 >> 16;
+}
+
+static void writeOut128Bytes(ushort *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    writeOut128Bytes((short *)(out), index, groupSz, r1, r2, r3, r4);
+}
+
+static void writeOut128Bytes(int *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]               = r1;
+    out[index + groupSz]     = r2;
+    out[index + 2 * groupSz] = r3;
+    out[index + 3 * groupSz] = r4;
+}
+
+static void writeOut128Bytes(uint *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    writeOut128Bytes((int *)(out), index, groupSz, r1, r2, r3, r4);
+}
+
+static void writeOut128Bytes(intl *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    intl c1              = r2;
+    c1                   = (c1 << 32) | r1;
+    intl c2              = r4;
+    c2                   = (c2 << 32) | r3;
+    out[index]           = c1;
+    out[index + groupSz] = c2;
+}
+
+static void writeOut128Bytes(uintl *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    writeOut128Bytes((intl *)(out), index, groupSz, r1, r2, r3, r4);
+}
+
+static void writeOut128Bytes(float *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]               = 1.f - getFloat01(r1);
+    out[index + groupSz]     = 1.f - getFloat01(r2);
+    out[index + 2 * groupSz] = 1.f - getFloat01(r3);
+    out[index + 3 * groupSz] = 1.f - getFloat01(r4);
+}
+
+static void writeOut128Bytes(cfloat *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]           = {1.f - getFloat01(r1), 1.f - getFloat01(r2)};
+    out[index + groupSz] = {1.f - getFloat01(r3), 1.f - getFloat01(r4)};
+}
+
+static void writeOut128Bytes(double *out, const uint &index, const uint groupSz,
+                             const uint &r1, const uint &r2, const uint &r3,
+                             const uint &r4) {
+    out[index]           = 1.0 - getDouble01(r1, r2);
+    out[index + groupSz] = 1.0 - getDouble01(r3, r4);
+}
+
+static void writeOut128Bytes(cdouble *out, const uint &index,
+                             const uint groupSz, const uint &r1, const uint &r2,
+                             const uint &r3, const uint &r4) {
+    out[index] = {1.0 - getDouble01(r1, r2), 1.0 - getDouble01(r3, r4)};
+}
+
+static void writeOut128Bytes(arrayfire::common::half *out, const uint &index,
+                             const uint groupSz, const uint &r1, const uint &r2,
+                             const uint &r3, const uint &r4) {
+    out[index]               = oneMinusGetHalf01(r1, 0);
+    out[index + groupSz]     = oneMinusGetHalf01(r1, 1);
+    out[index + 2 * groupSz] = oneMinusGetHalf01(r2, 0);
+    out[index + 3 * groupSz] = oneMinusGetHalf01(r2, 1);
+    out[index + 4 * groupSz] = oneMinusGetHalf01(r3, 0);
+    out[index + 5 * groupSz] = oneMinusGetHalf01(r3, 1);
+    out[index + 6 * groupSz] = oneMinusGetHalf01(r4, 0);
+    out[index + 7 * groupSz] = oneMinusGetHalf01(r4, 1);
+}
+
+// Normalized writes without boundary checking
+
+static void boxMullerWriteOut128Bytes(float *out, const uint &index,
+                                      const uint groupSz, const uint &r1,
+                                      const uint &r2, const uint &r3,
+                                      const uint &r4) {
+    boxMullerTransform(&out[index], &out[index + groupSz],
+                       getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&out[index + 2 * groupSz], &out[index + 3 * groupSz],
+                       getFloatNegative11(r3), getFloat01(r4));
+}
+
+static void boxMullerWriteOut128Bytes(cfloat *out, const uint &index,
+                                      const uint groupSz, const uint &r1,
+                                      const uint &r2, const uint &r3,
+                                      const uint &r4) {
+    boxMullerTransform(&out[index], getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&out[index + groupSz], getFloatNegative11(r3),
+                       getFloat01(r4));
+}
+
+static void boxMullerWriteOut128Bytes(double *out, const uint &index,
+                                      const uint groupSz, const uint &r1,
+                                      const uint &r2, const uint &r3,
+                                      const uint &r4) {
+    boxMullerTransform(&out[index], &out[index + groupSz],
+                       getDoubleNegative11(r1, r2), getDouble01(r3, r4));
+}
+
+static void boxMullerWriteOut128Bytes(cdouble *out, const uint &index,
+                                      const uint groupSz, const uint &r1,
+                                      const uint &r2, const uint &r3,
+                                      const uint &r4) {
+    boxMullerTransform(&out[index], getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+}
+
+static void boxMullerWriteOut128Bytes(arrayfire::common::half *out,
+                                      const uint &index, const uint groupSz,
+                                      const uint &r1, const uint &r2,
+                                      const uint &r3, const uint &r4) {
+    boxMullerTransform(&out[index], &out[index + groupSz],
+                       getHalfNegative11(r1, 0), getHalf01(r1, 1));
+    boxMullerTransform(&out[index + 2 * groupSz], &out[index + 3 * groupSz],
+                       getHalfNegative11(r2, 0), getHalf01(r2, 1));
+    boxMullerTransform(&out[index + 4 * groupSz], &out[index + 5 * groupSz],
+                       getHalfNegative11(r3, 0), getHalf01(r3, 1));
+    boxMullerTransform(&out[index + 6 * groupSz], &out[index + 7 * groupSz],
+                       getHalfNegative11(r4, 0), getHalf01(r4, 1));
+}
+
+// Writes with boundary checking
+
+static void partialWriteOut128Bytes(uchar *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + groupSz < elements) { out[index + groupSz] = r1 >> 8; }
+    if (index + 2 * groupSz < elements) { out[index + 2 * groupSz] = r1 >> 16; }
+    if (index + 3 * groupSz < elements) { out[index + 3 * groupSz] = r1 >> 24; }
+    if (index + 4 * groupSz < elements) { out[index + 4 * groupSz] = r2; }
+    if (index + 5 * groupSz < elements) { out[index + 5 * groupSz] = r2 >> 8; }
+    if (index + 6 * groupSz < elements) { out[index + 6 * groupSz] = r2 >> 16; }
+    if (index + 7 * groupSz < elements) { out[index + 7 * groupSz] = r2 >> 24; }
+    if (index + 8 * groupSz < elements) { out[index + 8 * groupSz] = r3; }
+    if (index + 9 * groupSz < elements) { out[index + 9 * groupSz] = r3 >> 8; }
+    if (index + 10 * groupSz < elements) {
+        out[index + 10 * groupSz] = r3 >> 16;
+    }
+    if (index + 11 * groupSz < elements) {
+        out[index + 11 * groupSz] = r3 >> 24;
+    }
+    if (index + 12 * groupSz < elements) { out[index + 12 * groupSz] = r4; }
+    if (index + 13 * groupSz < elements) {
+        out[index + 13 * groupSz] = r4 >> 8;
+    }
+    if (index + 14 * groupSz < elements) {
+        out[index + 14 * groupSz] = r4 >> 16;
+    }
+    if (index + 15 * groupSz < elements) {
+        out[index + 15 * groupSz] = r4 >> 24;
+    }
+}
+
+static void partialWriteOut128Bytes(schar *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    partialWriteOut128Bytes((uchar *)(out), index, groupSz, r1, r2, r3, r4,
+                            elements);
+}
+
+static void partialWriteOut128Bytes(char *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) { out[index] = (r1)&0x1; }
+    if (index + groupSz < elements) { out[index + groupSz] = (r1 >> 8) & 0x1; }
+    if (index + 2 * groupSz < elements) {
+        out[index + 2 * groupSz] = (r1 >> 16) & 0x1;
+    }
+    if (index + 3 * groupSz < elements) {
+        out[index + 3 * groupSz] = (r1 >> 24) & 0x1;
+    }
+    if (index + 4 * groupSz < elements) { out[index + 4 * groupSz] = (r2)&0x1; }
+    if (index + 5 * groupSz < elements) {
+        out[index + 5 * groupSz] = (r2 >> 8) & 0x1;
+    }
+    if (index + 6 * groupSz < elements) {
+        out[index + 6 * groupSz] = (r2 >> 16) & 0x1;
+    }
+    if (index + 7 * groupSz < elements) {
+        out[index + 7 * groupSz] = (r2 >> 24) & 0x1;
+    }
+    if (index + 8 * groupSz < elements) { out[index + 8 * groupSz] = (r3)&0x1; }
+    if (index + 9 * groupSz < elements) {
+        out[index + 9 * groupSz] = (r3 >> 8) & 0x1;
+    }
+    if (index + 10 * groupSz < elements) {
+        out[index + 10 * groupSz] = (r3 >> 16) & 0x1;
+    }
+    if (index + 11 * groupSz < elements) {
+        out[index + 11 * groupSz] = (r3 >> 24) & 0x1;
+    }
+    if (index + 12 * groupSz < elements) {
+        out[index + 12 * groupSz] = (r4)&0x1;
+    }
+    if (index + 13 * groupSz < elements) {
+        out[index + 13 * groupSz] = (r4 >> 8) & 0x1;
+    }
+    if (index + 14 * groupSz < elements) {
+        out[index + 14 * groupSz] = (r4 >> 16) & 0x1;
+    }
+    if (index + 15 * groupSz < elements) {
+        out[index + 15 * groupSz] = (r4 >> 24) & 0x1;
+    }
+}
+
+static void partialWriteOut128Bytes(short *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + groupSz < elements) { out[index + groupSz] = r1 >> 16; }
+    if (index + 2 * groupSz < elements) { out[index + 2 * groupSz] = r2; }
+    if (index + 3 * groupSz < elements) { out[index + 3 * groupSz] = r2 >> 16; }
+    if (index + 4 * groupSz < elements) { out[index + 4 * groupSz] = r3; }
+    if (index + 5 * groupSz < elements) { out[index + 5 * groupSz] = r3 >> 16; }
+    if (index + 6 * groupSz < elements) { out[index + 6 * groupSz] = r4; }
+    if (index + 7 * groupSz < elements) { out[index + 7 * groupSz] = r4 >> 16; }
+}
+
+static void partialWriteOut128Bytes(ushort *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    partialWriteOut128Bytes((short *)(out), index, groupSz, r1, r2, r3, r4,
+                            elements);
+}
+
+static void partialWriteOut128Bytes(int *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + groupSz < elements) { out[index + groupSz] = r2; }
+    if (index + 2 * groupSz < elements) { out[index + 2 * groupSz] = r3; }
+    if (index + 3 * groupSz < elements) { out[index + 3 * groupSz] = r4; }
+}
+
+static void partialWriteOut128Bytes(uint *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    partialWriteOut128Bytes((int *)(out), index, groupSz, r1, r2, r3, r4,
+                            elements);
+}
+
+static void partialWriteOut128Bytes(intl *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    intl c1 = r2;
+    c1      = (c1 << 32) | r1;
+    intl c2 = r4;
+    c2      = (c2 << 32) | r3;
+    if (index < elements) { out[index] = c1; }
+    if (index + groupSz < elements) { out[index + groupSz] = c2; }
+}
+
+static void partialWriteOut128Bytes(uintl *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    partialWriteOut128Bytes((intl *)(out), index, groupSz, r1, r2, r3, r4,
+                            elements);
+}
+
+static void partialWriteOut128Bytes(float *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) { out[index] = 1.f - getFloat01(r1); }
+    if (index + groupSz < elements) {
+        out[index + groupSz] = 1.f - getFloat01(r2);
+    }
+    if (index + 2 * groupSz < elements) {
+        out[index + 2 * groupSz] = 1.f - getFloat01(r3);
+    }
+    if (index + 3 * groupSz < elements) {
+        out[index + 3 * groupSz] = 1.f - getFloat01(r4);
+    }
+}
+
+static void partialWriteOut128Bytes(cfloat *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) {
+        out[index] = {1.f - getFloat01(r1), 1.f - getFloat01(r2)};
+    }
+    if (index + groupSz < elements) {
+        out[index + groupSz] = {1.f - getFloat01(r3), 1.f - getFloat01(r4)};
+    }
+}
+
+static void partialWriteOut128Bytes(double *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) { out[index] = 1.0 - getDouble01(r1, r2); }
+    if (index + groupSz < elements) {
+        out[index + groupSz] = 1.0 - getDouble01(r3, r4);
+    }
+}
+
+static void partialWriteOut128Bytes(cdouble *out, const uint &index,
+                                    const uint groupSz, const uint &r1,
+                                    const uint &r2, const uint &r3,
+                                    const uint &r4, const uint &elements) {
+    if (index < elements) {
+        out[index] = {1.0 - getDouble01(r1, r2), 1.0 - getDouble01(r3, r4)};
+    }
+}
+
+// Normalized writes with boundary checking
+static void partialBoxMullerWriteOut128Bytes(float *out, const uint &index,
+                                             const uint groupSz, const uint &r1,
+                                             const uint &r2, const uint &r3,
+                                             const uint &r4,
+                                             const uint &elements) {
+    float n1, n2, n3, n4;
+    boxMullerTransform(&n1, &n2, getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&n3, &n4, getFloatNegative11(r3), getFloat01(r4));
+    if (index < elements) { out[index] = n1; }
+    if (index + groupSz < elements) { out[index + groupSz] = n2; }
+    if (index + 2 * groupSz < elements) { out[index + 2 * groupSz] = n3; }
+    if (index + 3 * groupSz < elements) { out[index + 3 * groupSz] = n4; }
+}
+
+static void partialBoxMullerWriteOut128Bytes(cfloat *out, const uint &index,
+                                             const uint groupSz, const uint &r1,
+                                             const uint &r2, const uint &r3,
+                                             const uint &r4,
+                                             const uint &elements) {
+    float n1, n2, n3, n4;
+    boxMullerTransform(&n1, &n2, getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&n3, &n4, getFloatNegative11(r3), getFloat01(r4));
+    if (index < elements) { out[index] = {n1, n2}; }
+    if (index + groupSz < elements) { out[index + groupSz] = {n3, n4}; }
+}
+
+static void partialBoxMullerWriteOut128Bytes(double *out, const uint &index,
+                                             const uint groupSz, const uint &r1,
+                                             const uint &r2, const uint &r3,
+                                             const uint &r4,
+                                             const uint &elements) {
+    double n1, n2;
+    boxMullerTransform(&n1, &n2, getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+    if (index < elements) { out[index] = n1; }
+    if (index + groupSz < elements) { out[index + groupSz] = n2; }
+}
+
+static void partialBoxMullerWriteOut128Bytes(cdouble *out, const uint &index,
+                                             const uint groupSz, const uint &r1,
+                                             const uint &r2, const uint &r3,
+                                             const uint &r4,
+                                             const uint &elements) {
+    double n1, n2;
+    boxMullerTransform(&n1, &n2, getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+    if (index < elements) { out[index] = {n1, n2}; }
+}
+
+static void partialWriteOut128Bytes(arrayfire::common::half *out,
+                                    const uint &index, const uint groupSz,
+                                    const uint &r1, const uint &r2,
+                                    const uint &r3, const uint &r4,
+                                    const uint &elements) {
+    if (index < elements) { out[index] = oneMinusGetHalf01(r1, 0); }
+    if (index + groupSz < elements) {
+        out[index + groupSz] = oneMinusGetHalf01(r1, 1);
+    }
+    if (index + 2 * groupSz < elements) {
+        out[index + 2 * groupSz] = oneMinusGetHalf01(r2, 0);
+    }
+    if (index + 3 * groupSz < elements) {
+        out[index + 3 * groupSz] = oneMinusGetHalf01(r2, 1);
+    }
+    if (index + 4 * groupSz < elements) {
+        out[index + 4 * groupSz] = oneMinusGetHalf01(r3, 0);
+    }
+    if (index + 5 * groupSz < elements) {
+        out[index + 5 * groupSz] = oneMinusGetHalf01(r3, 1);
+    }
+    if (index + 6 * groupSz < elements) {
+        out[index + 6 * groupSz] = oneMinusGetHalf01(r4, 0);
+    }
+    if (index + 7 * groupSz < elements) {
+        out[index + 7 * groupSz] = oneMinusGetHalf01(r4, 1);
+    }
+}
+
+// Normalized writes with boundary checking
+static void partialBoxMullerWriteOut128Bytes(arrayfire::common::half *out,
+                                             const uint &index,
+                                             const uint groupSz, const uint &r1,
+                                             const uint &r2, const uint &r3,
+                                             const uint &r4,
+                                             const uint &elements) {
+    sycl::half n1, n2;
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r1, 0), getHalf01(r1, 1));
+    if (index < elements) { out[index] = n1; }
+    if (index + groupSz < elements) { out[index + groupSz] = n2; }
+
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r2, 0), getHalf01(r2, 1));
+    if (index + 2 * groupSz < elements) { out[index + 2 * groupSz] = n1; }
+    if (index + 3 * groupSz < elements) { out[index + 3 * groupSz] = n2; }
+
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r3, 0), getHalf01(r3, 1));
+    if (index + 4 * groupSz < elements) { out[index + 4 * groupSz] = n1; }
+    if (index + 5 * groupSz < elements) { out[index + 5 * groupSz] = n2; }
+
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r4, 0), getHalf01(r4, 1));
+    if (index + 6 * groupSz < elements) { out[index + 6 * groupSz] = n1; }
+    if (index + 7 * groupSz < elements) { out[index + 7 * groupSz] = n2; }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/range.hpp b/src/backend/oneapi/kernel/range.hpp
new file mode 100644
index 0000000000..b8678179c2
--- /dev/null
+++ b/src/backend/oneapi/kernel/range.hpp
@@ -0,0 +1,118 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+#include <af/dim4.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class rangeOp {
+   public:
+    rangeOp(write_accessor<T> out, KParam oinfo, const int dim,
+            const int blocksPerMatX, const int blocksPerMatY)
+        : out_(out)
+        , oinfo_(oinfo)
+        , dim_(dim)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        const int mul0 = (dim_ == 0);
+        const int mul1 = (dim_ == 1);
+        const int mul2 = (dim_ == 2);
+        const int mul3 = (dim_ == 3);
+
+        sycl::group g = it.get_group();
+        const int oz  = g.get_group_id(0) / blocksPerMatX_;
+        const int ow  = g.get_group_id(1) / blocksPerMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - oz * blocksPerMatX_;
+        const int blockIdx_y = g.get_group_id(1) - ow * blocksPerMatY_;
+
+        const int xx = it.get_local_id(0) + blockIdx_x * it.get_local_range(0);
+        const int yy = it.get_local_id(1) + blockIdx_y * it.get_local_range(1);
+
+        const size_t odx = oinfo_.dims[0];
+        const size_t ody = oinfo_.dims[1];
+        const size_t odz = oinfo_.dims[2];
+        const size_t odw = oinfo_.dims[3];
+
+        if (xx < odx && yy < ody && oz < odz && ow < odw) {
+            const int ozw = ow * oinfo_.strides[3] + oz * oinfo_.strides[2];
+
+            const int incy = blocksPerMatY_ * g.get_local_range(1);
+            const int incx = blocksPerMatX_ * g.get_local_range(0);
+
+            compute_t<T> valZW = (mul3 * ow) + (mul2 * oz);
+
+            T* optr = out_.get_pointer();
+            for (int oy = yy; oy < oinfo_.dims[1]; oy += incy) {
+                compute_t<T> valYZW = valZW + (mul1 * oy);
+                int oyzw            = ozw + oy * oinfo_.strides[1];
+                for (int ox = xx; ox < oinfo_.dims[0]; ox += incx) {
+                    int oidx         = oyzw + ox;
+                    compute_t<T> val = valYZW + (mul0 * ox);
+
+                    optr[oidx] = val;
+                }
+            }
+        }
+    }
+
+   protected:
+    write_accessor<T> out_;
+    KParam oinfo_;
+    int dim_;
+    int blocksPerMatX_, blocksPerMatY_;
+};
+
+template<typename T>
+void range(Param<T> out, const int dim) {
+    constexpr int RANGE_TX    = 32;
+    constexpr int RANGE_TY    = 8;
+    constexpr int RANGE_TILEX = 512;
+    constexpr int RANGE_TILEY = 32;
+
+    sycl::range<2> local(RANGE_TX, RANGE_TY);
+
+    int blocksPerMatX = divup(out.info.dims[0], RANGE_TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], RANGE_TILEY);
+    sycl::range<2> global(local[0] * blocksPerMatX * out.info.dims[2],
+                          local[1] * blocksPerMatY * out.info.dims[3]);
+    sycl::nd_range<2> ndrange(global, local);
+
+    getQueue().submit([&](sycl::handler& h) {
+        write_accessor<T> out_acc{*out.data, h};
+
+        h.parallel_for(ndrange, rangeOp<T>(out_acc, out.info, dim,
+                                           blocksPerMatX, blocksPerMatY));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reduce.hpp b/src/backend/oneapi/kernel/reduce.hpp
new file mode 100644
index 0000000000..7089cb9b4e
--- /dev/null
+++ b/src/backend/oneapi/kernel/reduce.hpp
@@ -0,0 +1,115 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/reduce_all.hpp>
+#include <kernel/reduce_config.hpp>
+#include <kernel/reduce_dim.hpp>
+#include <kernel/reduce_first.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+
+#include <algorithm>
+#include <climits>
+#include <complex>
+#include <iostream>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_default_dispatch(Param<To> out, Param<Ti> in, int dim,
+                             bool change_nan, double nanval) {
+    switch (dim) {
+        case 0:
+            return reduce_first_default<Ti, To, op>(out, in, change_nan,
+                                                    nanval);
+        case 1:
+            return reduce_dim_default<Ti, To, op, 1>(out, in, change_nan,
+                                                     nanval);
+        case 2:
+            return reduce_dim_default<Ti, To, op, 2>(out, in, change_nan,
+                                                     nanval);
+        case 3:
+            return reduce_dim_default<Ti, To, op, 3>(out, in, change_nan,
+                                                     nanval);
+    }
+}
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_cpu_dispatch(Param<To> out, Param<Ti> in, int dim, bool change_nan,
+                         double nanval) {
+    // TODO: use kernels optimized for SIMD-based subgroup sizes
+    reduce_default_dispatch<Ti, To, op>(out, in, dim, change_nan, nanval);
+}
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_gpu_dispatch(Param<To> out, Param<Ti> in, int dim, bool change_nan,
+                         double nanval) {
+    // TODO: use kernels optimized for gpu subgroup sizes
+    reduce_default_dispatch<Ti, To, op>(out, in, dim, change_nan, nanval);
+}
+
+template<typename Ti, typename To, af_op_t op>
+void reduce(Param<To> out, Param<Ti> in, int dim, bool change_nan,
+            double nanval) {
+    // TODO: logic to dispatch to different kernels depending on device type
+    if (getQueue().get_device().is_cpu()) {
+        reduce_cpu_dispatch<Ti, To, op>(out, in, dim, change_nan, nanval);
+    } else if (getQueue().get_device().is_gpu()) {
+        reduce_gpu_dispatch<Ti, To, op>(out, in, dim, change_nan, nanval);
+    } else {
+        reduce_default_dispatch<Ti, To, op>(out, in, dim, change_nan, nanval);
+    }
+}
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_all(Param<To> out, Param<Ti> in, bool change_nan, double nanval) {
+    int in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+    bool is_linear = (in.info.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.info.strides[k] ==
+                      (in.info.strides[k - 1] * in.info.dims[k - 1]));
+    }
+
+    if (is_linear) {
+        in.info.dims[0] = in_elements;
+        for (int k = 1; k < 4; k++) {
+            in.info.dims[k]    = 1;
+            in.info.strides[k] = in_elements;
+        }
+    }
+
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, creduce::THREADS_PER_BLOCK);
+    uint threads_y = creduce::THREADS_PER_BLOCK / threads_x;
+
+    // TODO: perf REPEAT, consider removing or runtime eval
+    // max problem size < SM resident threads, don't use REPEAT
+    uint blocks_x = divup(in.info.dims[0], threads_x * creduce::REPEAT);
+    uint blocks_y = divup(in.info.dims[1], threads_y);
+
+    reduce_all_launcher_default<Ti, To, op>(out, in, blocks_x, blocks_y,
+                                            threads_x, change_nan, nanval);
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reduce_all.hpp b/src/backend/oneapi/kernel/reduce_all.hpp
new file mode 100644
index 0000000000..7a1e842425
--- /dev/null
+++ b/src/backend/oneapi/kernel/reduce_all.hpp
@@ -0,0 +1,280 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce_config.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <climits>
+#include <complex>
+#include <iostream>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+using global_atomic_ref =
+    sycl::atomic_ref<T, sycl::memory_order::relaxed, sycl::memory_scope::system,
+                     sycl::access::address_space::global_space>;
+
+template<typename Ti, typename To, af_op_t op>
+class reduceAllKernelSMEM {
+   public:
+    reduceAllKernelSMEM(write_accessor<To> out, KParam oInfo,
+                        sycl::accessor<unsigned> retCount,
+                        sycl::accessor<To> tmp, KParam tmpInfo,
+                        read_accessor<Ti> in, KParam iInfo, uint DIMX,
+                        uint groups_x, uint groups_y, uint repeat,
+                        bool change_nan, To nanval,
+                        sycl::local_accessor<compute_t<To>, 1> s_ptr,
+                        sycl::local_accessor<bool, 1> amLast)
+        : out_(out)
+        , retCount_(retCount)
+        , tmp_(tmp)
+        , in_(in)
+        , oInfo_(oInfo)
+        , tmpInfo_(tmpInfo)
+        , iInfo_(iInfo)
+        , DIMX_(DIMX)
+        , repeat_(repeat)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , change_nan_(change_nan)
+        , nanval_(nanval)
+        , s_ptr_(s_ptr)
+        , amLast_(amLast) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * DIMX_ + lidx;
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid = groupId_x * g.get_local_range(0) * repeat_ + lidx;
+        const uint yid = groupId_y * g.get_local_range(1) + lidy;
+
+        common::Binary<compute_t<To>, op> reduce;
+        common::Transform<Ti, compute_t<To>, op> transform;
+
+        auto iptr = in_.get_pointer() + wid * iInfo_.strides[3] +
+                    zid * iInfo_.strides[2] + yid * iInfo_.strides[1] +
+                    iInfo_.offset;
+
+        bool cond = (yid < iInfo_.dims[1]) && (zid < iInfo_.dims[2]) &&
+                    (wid < iInfo_.dims[3]);
+
+        dim_t last = (xid + repeat_ * DIMX_);
+        int lim    = min(last, iInfo_.dims[0]);
+
+        compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+        for (int id = xid; cond && id < lim; id += DIMX_) {
+            compute_t<To> in_val = transform(iptr[id]);
+            if (change_nan_)
+                in_val = !IS_NAN(in_val) ? in_val
+                                         : static_cast<compute_t<To>>(nanval_);
+            out_val = reduce(in_val, out_val);
+        }
+
+        s_ptr_[lid] = out_val;
+
+        group_barrier(g);
+
+        if (creduce::THREADS_PER_BLOCK == 256) {
+            if (lid < 128) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 128]);
+            group_barrier(g);
+        }
+
+        if (creduce::THREADS_PER_BLOCK >= 128) {
+            if (lid < 64) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 64]);
+            group_barrier(g);
+        }
+
+        if (creduce::THREADS_PER_BLOCK >= 64) {
+            if (lid < 32) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 32]);
+            group_barrier(g);
+        }
+
+        // TODO: replace with subgroup operations in optimized kernels
+        if (lid < 16) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 16]);
+        group_barrier(g);
+
+        if (lid < 8) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 8]);
+        group_barrier(g);
+
+        if (lid < 4) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 4]);
+        group_barrier(g);
+
+        if (lid < 2) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 2]);
+        group_barrier(g);
+
+        if (lid < 1) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 1]);
+        group_barrier(g);
+
+        const unsigned total_blocks =
+            (g.get_group_range(0) * g.get_group_range(1));
+        const int uubidx =
+            (g.get_group_range(0) * g.get_group_id(1)) + g.get_group_id(0);
+        if (cond && lid == 0) {
+            if (total_blocks != 1) {
+                tmp_[uubidx] = s_ptr_[0];
+            } else {
+                out_[0] = s_ptr_[0];
+            }
+        }
+
+        // Last block to perform final reduction
+        if (total_blocks > 1) {
+            sycl::atomic_fence(sycl::memory_order::seq_cst,
+                               sycl::memory_scope::device);
+
+            // thread 0 takes a ticket
+            if (lid == 0) {
+                unsigned int ticket = global_atomic_ref<uint>(retCount_[0])++;
+                // If the ticket ID == number of blocks, we are the last block
+                amLast_[0] = (ticket == (total_blocks - 1));
+            }
+            group_barrier(g);
+
+            if (amLast_[0]) {
+                int i   = lid;
+                out_val = common::Binary<compute_t<To>, op>::init();
+
+                while (i < total_blocks) {
+                    compute_t<To> in_val = compute_t<To>(tmp_[i]);
+                    out_val              = reduce(in_val, out_val);
+                    i += creduce::THREADS_PER_BLOCK;
+                }
+
+                s_ptr_[lid] = out_val;
+                group_barrier(g);
+
+                // reduce final block
+                if (creduce::THREADS_PER_BLOCK == 256) {
+                    if (lid < 128)
+                        s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 128]);
+                    group_barrier(g);
+                }
+
+                if (creduce::THREADS_PER_BLOCK >= 128) {
+                    if (lid < 64)
+                        s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 64]);
+                    group_barrier(g);
+                }
+
+                if (creduce::THREADS_PER_BLOCK >= 64) {
+                    if (lid < 32)
+                        s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 32]);
+                    group_barrier(g);
+                }
+
+                if (lid < 16)
+                    s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 16]);
+                group_barrier(g);
+
+                if (lid < 8) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 8]);
+                group_barrier(g);
+
+                if (lid < 4) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 4]);
+                group_barrier(g);
+
+                if (lid < 2) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 2]);
+                group_barrier(g);
+
+                if (lid < 1) s_ptr_[lid] = reduce(s_ptr_[lid], s_ptr_[lid + 1]);
+                group_barrier(g);
+
+                if (lid == 0) {
+                    out_[0] = s_ptr_[0];
+
+                    // reset retirement count so that next run succeeds
+                    retCount_[0] = 0;
+                }
+            }
+        }
+    }
+
+   protected:
+    write_accessor<To> out_;
+    sycl::accessor<unsigned> retCount_;
+    sycl::accessor<To> tmp_;
+    read_accessor<Ti> in_;
+    KParam oInfo_, tmpInfo_, iInfo_;
+    uint DIMX_, repeat_;
+    uint groups_x_, groups_y_;
+    bool change_nan_;
+    To nanval_;
+    sycl::local_accessor<compute_t<To>, 1> s_ptr_;
+    sycl::local_accessor<bool, 1> amLast_;
+};
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_all_launcher_default(Param<To> out, Param<Ti> in,
+                                 const uint groups_x, const uint groups_y,
+                                 const uint threads_x, bool change_nan,
+                                 double nanval) {
+    sycl::range<2> local(threads_x, creduce::THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * in.info.dims[2] * local[0],
+                          groups_y * in.info.dims[3] * local[1]);
+
+    uint repeat = divup(in.info.dims[0], (groups_x * threads_x));
+
+    long tmp_elements = groups_x * in.info.dims[2] * groups_y * in.info.dims[3];
+    if (tmp_elements > UINT_MAX) {
+        AF_ERROR(
+            "Too many blocks requested (typeof(retirementCount) == unsigned)",
+            AF_ERR_RUNTIME);
+    }
+
+    Array<To> tmp = createEmptyArray<To>(tmp_elements);
+    auto tmp_get = tmp.get();
+    
+    Array<unsigned> retirementCount = createValueArray<unsigned>(1, 0);
+    auto ret_get = retirementCount.get();
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        auto retCount_acc = ret_get->get_access(h);
+        auto tmp_acc      = tmp_get->get_access(h);
+        read_accessor<Ti> in_acc{*in.data, h};
+
+        auto shrdMem = sycl::local_accessor<compute_t<To>, 1>(
+            creduce::THREADS_PER_BLOCK, h);
+        auto amLast = sycl::local_accessor<bool, 1>(1, h);
+        h.parallel_for(
+            sycl::nd_range<2>(global, local),
+            reduceAllKernelSMEM<Ti, To, op>(
+                out_acc, out.info, retCount_acc, tmp_acc, (KParam)tmp, in_acc,
+                in.info, threads_x, groups_x, groups_y, repeat, change_nan,
+                scalar<To>(nanval), shrdMem, amLast));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reduce_by_key.hpp b/src/backend/oneapi/kernel/reduce_by_key.hpp
new file mode 100644
index 0000000000..329fd33109
--- /dev/null
+++ b/src/backend/oneapi/kernel/reduce_by_key.hpp
@@ -0,0 +1,694 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce_config.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <type_traits>
+
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+// Reduces keys across block boundaries
+template<typename Tk, typename To, af_op_t op>
+class finalBoundaryReduceKernel {
+   public:
+    finalBoundaryReduceKernel(write_accessor<int> reduced_block_sizes,
+                              read_accessor<Tk> iKeys, KParam iKInfo,
+                              sycl::accessor<To> oVals, KParam oVInfo,
+                              const int n)
+        : reduced_block_sizes_(reduced_block_sizes)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , oVals_(oVals)
+        , oVInfo_(oVInfo)
+        , n_(n) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g  = it.get_group();
+        const uint lid = it.get_local_id(0);
+        const uint gid = it.get_global_id(0);
+        const uint bid = g.get_group_id(0);
+
+        common::Binary<compute_t<To>, op> binOp;
+        if (gid == ((bid + 1) * it.get_local_range(0)) - 1 &&
+            bid < g.get_group_range(0) - 1) {
+            Tk k0 = iKeys_[gid + iKInfo_.offset];
+            Tk k1 = iKeys_[gid + 1 + iKInfo_.offset];
+
+            if (k0 == k1) {
+                compute_t<To> v0          = compute_t<To>(oVals_[gid]);
+                compute_t<To> v1          = compute_t<To>(oVals_[gid + 1]);
+                oVals_[gid + 1]           = binOp(v0, v1);
+                reduced_block_sizes_[bid] = it.get_local_range(0) - 1;
+            } else {
+                reduced_block_sizes_[bid] = it.get_local_range(0);
+            }
+        }
+
+        // if last block, set block size to difference between n and block
+        // boundary
+        if (lid == 0 && bid == g.get_group_range(0) - 1) {
+            reduced_block_sizes_[bid] = n_ - (bid * it.get_local_range(0));
+        }
+    }
+
+   protected:
+    write_accessor<int> reduced_block_sizes_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    sycl::accessor<To> oVals_;
+    KParam oVInfo_;
+    int n_;
+};
+
+template<typename Tk, typename To, af_op_t op>
+class finalBoundaryReduceDimKernel {
+   public:
+    finalBoundaryReduceDimKernel(write_accessor<int> reduced_block_sizes,
+                                 read_accessor<Tk> iKeys, KParam iKInfo,
+                                 sycl::accessor<To> oVals, KParam oVInfo,
+                                 const int n, const int nGroupsZ)
+        : reduced_block_sizes_(reduced_block_sizes)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , oVals_(oVals)
+        , oVInfo_(oVInfo)
+        , n_(n)
+        , nGroupsZ_(nGroupsZ) {}
+
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g  = it.get_group();
+        const uint lid = it.get_local_id(0);
+        const uint gid = it.get_global_id(0);
+        const uint bid = g.get_group_id(0);
+
+        common::Binary<compute_t<To>, op> binOp;
+        if (gid == ((bid + 1) * it.get_local_range(0)) - 1 &&
+            bid < g.get_group_range(0) - 1) {
+            Tk k0 = iKeys_[gid + iKInfo_.offset];
+            Tk k1 = iKeys_[gid + 1 + iKInfo_.offset];
+
+            if (k0 == k1) {
+                compute_t<To> v0          = compute_t<To>(oVals_[gid]);
+                compute_t<To> v1          = compute_t<To>(oVals_[gid + 1]);
+                oVals_[gid + 1]           = binOp(v0, v1);
+                reduced_block_sizes_[bid] = it.get_local_range(0) - 1;
+            } else {
+                reduced_block_sizes_[bid] = it.get_local_range(0);
+            }
+        }
+
+        // if last block, set block size to difference between n and block
+        // boundary
+        if (lid == 0 && bid == g.get_group_range(0) - 1) {
+            reduced_block_sizes_[bid] = n_ - (bid * it.get_local_range(0));
+        }
+    }
+
+   protected:
+    write_accessor<int> reduced_block_sizes_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    sycl::accessor<To> oVals_;
+    KParam oVInfo_;
+    int n_;
+    int nGroupsZ_;
+};
+
+template<typename T>
+using global_atomic_ref =
+    sycl::atomic_ref<T, sycl::memory_order::relaxed, sycl::memory_scope::system,
+                     sycl::access::address_space::global_space>;
+
+// Tests if data needs further reduction, including across block boundaries
+template<typename Tk>
+class testNeedsReductionKernel {
+   public:
+    testNeedsReductionKernel(sycl::accessor<int> needs_another_reduction,
+                             sycl::accessor<int> needs_block_boundary_reduced,
+                             read_accessor<Tk> iKeys, KParam iKInfo,
+                             const int n, const int DIMX,
+                             sycl::local_accessor<Tk> l_keys)
+        : needs_another_reduction_(needs_another_reduction)
+        , needs_block_boundary_reduced_(needs_block_boundary_reduced)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , n_(n)
+        , DIMX_(DIMX)
+        , l_keys_(l_keys) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g  = it.get_group();
+        const uint lid = it.get_local_id(0);
+        const uint gid = it.get_global_id(0);
+        const uint bid = g.get_group_id(0);
+
+        Tk k = scalar<Tk>(0);
+        if (gid < n_) { k = iKeys_[gid + iKInfo_.offset]; }
+
+        l_keys_[lid] = k;
+        it.barrier();
+
+        int update_key =
+            (lid < DIMX_ - 2) && (k == l_keys_[lid + 1]) && (gid < (n_ - 1));
+
+        if (update_key) {
+            global_atomic_ref<int>(needs_another_reduction_[0]) |= update_key;
+        }
+
+        it.barrier();
+
+        // last thread in each block checks if any inter-block keys need further
+        // reduction
+        if (gid == ((bid + 1) * DIMX_) - 1 &&
+            bid < (g.get_group_range(0) - 1)) {
+            int k0 = iKeys_[gid + iKInfo_.offset];
+            int k1 = iKeys_[gid + 1 + iKInfo_.offset];
+            if (k0 == k1) {
+                global_atomic_ref<int>(needs_block_boundary_reduced_[0]) |= 1;
+            }
+        }
+    }
+
+   protected:
+    sycl::accessor<int> needs_another_reduction_;
+    sycl::accessor<int> needs_block_boundary_reduced_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    int n_;
+    int DIMX_;
+    sycl::local_accessor<Tk> l_keys_;
+};
+
+// Compacts "incomplete" block-sized chunks of data in global memory
+template<typename Tk, typename To>
+class compactKernel {
+   public:
+    compactKernel(read_accessor<int> reduced_block_sizes,
+                  write_accessor<Tk> oKeys, KParam oKInfo,
+                  write_accessor<To> oVals, KParam oVInfo,
+                  read_accessor<Tk> iKeys, KParam iKInfo,
+                  read_accessor<To> iVals, KParam iVInfo, int nGroupsZ)
+        : reduced_block_sizes_(reduced_block_sizes)
+        , oKeys_(oKeys)
+        , oKInfo_(oKInfo)
+        , oVals_(oVals)
+        , oVInfo_(oVInfo)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , iVals_(iVals)
+        , iVInfo_(iVInfo)
+        , nGroupsZ_(nGroupsZ) {}
+
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g  = it.get_group();
+        const uint lid = it.get_local_id(0);
+        const uint bid = g.get_group_id(0);
+        const uint gid = it.get_global_id(0);
+
+        const int bidy = g.get_group_id(1);
+        const int bidz = g.get_group_id(2) % nGroupsZ_;
+        const int bidw = g.get_group_id(2) / nGroupsZ_;
+
+        const int bOffset = bidw * oVInfo_.strides[3] +
+                            bidz * oVInfo_.strides[2] +
+                            bidy * oVInfo_.strides[1];
+
+        // reduced_block_sizes should have inclusive sum of block sizes
+        int nwrite =
+            (bid == 0)
+                ? reduced_block_sizes_[0]
+                : (reduced_block_sizes_[bid] - reduced_block_sizes_[bid - 1]);
+        int writeloc = (bid == 0) ? 0 : reduced_block_sizes_[bid - 1];
+
+        Tk k = iKeys_[gid + iKInfo_.offset];
+        To v = iVals_[bOffset + gid + iVInfo_.offset];
+
+        if (lid < nwrite) {
+            oKeys_[writeloc + lid]           = k;
+            oVals_[bOffset + writeloc + lid] = v;
+        }
+    }
+
+   protected:
+    read_accessor<int> reduced_block_sizes_;
+    write_accessor<Tk> oKeys_;
+    KParam oKInfo_;
+    write_accessor<To> oVals_;
+    KParam oVInfo_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    read_accessor<To> iVals_;
+    KParam iVInfo_;
+    int nGroupsZ_;
+};
+
+// Compacts "incomplete" block-sized chunks of data in global memory
+template<typename Tk, typename To>
+class compactDimKernel {
+   public:
+    compactDimKernel(read_accessor<int> reduced_block_sizes,
+                     write_accessor<Tk> oKeys, KParam oKInfo,
+                     write_accessor<To> oVals, KParam oVInfo,
+                     read_accessor<Tk> iKeys, KParam iKInfo,
+                     read_accessor<To> iVals, KParam iVInfo, int nGroupsZ,
+                     int DIM)
+        : reduced_block_sizes_(reduced_block_sizes)
+        , oKeys_(oKeys)
+        , oKInfo_(oKInfo)
+        , oVals_(oVals)
+        , oVInfo_(oVInfo)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , iVals_(iVals)
+        , iVInfo_(iVInfo)
+        , nGroupsZ_(nGroupsZ)
+        , DIM_(DIM) {}
+
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g = it.get_group();
+
+        const uint lid  = it.get_local_id(0);
+        const uint gidx = it.get_global_id(0);
+        const uint bid  = g.get_group_id(0);
+
+        const int bidy = g.get_group_id(1);
+        const int bidz = g.get_group_id(2) % nGroupsZ_;
+        const int bidw = g.get_group_id(2) / nGroupsZ_;
+
+        int dims_ordering[4];
+        dims_ordering[0] = DIM_;
+        int d            = 1;
+        for (int i = 0; i < 4; ++i) {
+            if (i != DIM_) dims_ordering[d++] = i;
+        }
+
+        Tk k;
+        To v;
+
+        // reduced_block_sizes should have inclusive sum of block sizes
+        int nwrite =
+            (bid == 0)
+                ? reduced_block_sizes_[0]
+                : (reduced_block_sizes_[bid] - reduced_block_sizes_[bid - 1]);
+        int writeloc = (bid == 0) ? 0 : reduced_block_sizes_[bid - 1];
+
+        const int tid = bidw * iVInfo_.strides[dims_ordering[3]] +
+                        bidz * iVInfo_.strides[dims_ordering[2]] +
+                        bidy * iVInfo_.strides[dims_ordering[1]] +
+                        gidx * iVInfo_.strides[DIM_];
+        k = iKeys_[gidx + iKInfo_.offset];
+        v = iVals_[tid + iVInfo_.offset];
+
+        if (lid < nwrite) {
+            oKeys_[writeloc + lid] = k;
+            const int bOffset      = bidw * oVInfo_.strides[dims_ordering[3]] +
+                                bidz * oVInfo_.strides[dims_ordering[2]] +
+                                bidy * oVInfo_.strides[dims_ordering[1]];
+            oVals_[bOffset + (writeloc + lid) * oVInfo_.strides[DIM_]] = v;
+        }
+    }
+
+   protected:
+    read_accessor<int> reduced_block_sizes_;
+    write_accessor<Tk> oKeys_;
+    KParam oKInfo_;
+    write_accessor<To> oVals_;
+    KParam oVInfo_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    read_accessor<To> iVals_;
+    KParam iVInfo_;
+    int nGroupsZ_;
+    int DIM_;
+};
+
+// Reduces each block by key
+template<typename Ti, typename Tk, typename To, af_op_t op>
+class reduceBlocksByKeyKernel {
+   public:
+    reduceBlocksByKeyKernel(sycl::accessor<int> reduced_block_sizes,
+                            write_accessor<Tk> oKeys, KParam oKInfo,
+                            write_accessor<To> oVals, KParam oVInfo,
+                            read_accessor<Tk> iKeys, KParam iKInfo,
+                            read_accessor<Ti> iVals, KParam iVInfo,
+                            int change_nan, To nanval, int n, int nGroupsZ,
+                            int DIMX, sycl::local_accessor<Tk> l_keys,
+                            sycl::local_accessor<compute_t<To>> l_vals,
+                            sycl::local_accessor<Tk> l_reduced_keys,
+                            sycl::local_accessor<compute_t<To>> l_reduced_vals,
+                            sycl::local_accessor<int> l_unique_ids,
+                            sycl::local_accessor<int> l_wg_temp,
+                            sycl::local_accessor<int> l_unique_flags,
+                            sycl::local_accessor<int> l_reduced_block_size)
+        : reduced_block_sizes_(reduced_block_sizes)
+        , oKeys_(oKeys)
+        , oKInfo_(oKInfo)
+        , oVals_(oVals)
+        , oVInfo_(oVInfo)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , iVals_(iVals)
+        , iVInfo_(iVInfo)
+        , change_nan_(change_nan)
+        , nanval_(nanval)
+        , n_(n)
+        , nGroupsZ_(nGroupsZ)
+        , DIMX_(DIMX)
+        , l_keys_(l_keys)
+        , l_vals_(l_vals)
+        , l_reduced_keys_(l_reduced_keys)
+        , l_reduced_vals_(l_reduced_vals)
+        , l_unique_ids_(l_unique_ids)
+        , l_wg_temp_(l_wg_temp)
+        , l_unique_flags_(l_unique_flags)
+        , l_reduced_block_size_(l_reduced_block_size) {}
+
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g  = it.get_group();
+        const uint lid = it.get_local_id(0);
+        const uint gid = it.get_global_id(0);
+
+        const int bidy = g.get_group_id(1);
+        const int bidz = g.get_group_id(2) % nGroupsZ_;
+        const int bidw = g.get_group_id(2) / nGroupsZ_;
+
+        const compute_t<To> init_val =
+            common::Binary<compute_t<To>, op>::init();
+        common::Binary<compute_t<To>, op> binOp;
+        common::Transform<Ti, compute_t<To>, op> transform;
+
+        if (lid == 0) { l_reduced_block_size_[0] = 0; }
+
+        // load keys and values to threads
+        Tk k            = scalar<Tk>(0);
+        compute_t<To> v = init_val;
+        if (gid < n_) {
+            k                 = iKeys_[gid + iKInfo_.offset];
+            const int bOffset = bidw * iVInfo_.strides[3] +
+                                bidz * iVInfo_.strides[2] +
+                                bidy * iVInfo_.strides[1];
+            v = transform(iVals_[bOffset + gid + iVInfo_.offset]);
+            if (change_nan_) v = IS_NAN(v) ? nanval_ : v;
+        }
+
+        l_keys_[lid] = k;
+        l_vals_[lid] = v;
+
+        l_reduced_keys_[lid] = k;
+        it.barrier();
+
+        // mark threads containing unique keys
+        int eq_check    = (lid > 0) ? (k != l_reduced_keys_[lid - 1]) : 0;
+        int unique_flag = (eq_check || (lid == 0)) && (gid < n_);
+
+        l_unique_flags_[lid] = unique_flag;
+        int unique_id =
+            work_group_scan_inclusive_add(it, l_wg_temp_, l_unique_flags_);
+
+        l_unique_ids_[lid] = unique_id;
+
+        if (lid == DIMX_ - 1) l_reduced_block_size_[0] = unique_id;
+
+        for (int off = 1; off < DIMX_; off *= 2) {
+            it.barrier();
+            int test_unique_id =
+                (lid + off < DIMX_) ? l_unique_ids_[lid + off] : ~unique_id;
+            eq_check = (unique_id == test_unique_id);
+            int update_key =
+                eq_check && (lid < (DIMX_ - off)) &&
+                ((gid + off) <
+                 n_);  // checks if this thread should perform a reduction
+            compute_t<To> uval = (update_key) ? l_vals_[lid + off] : init_val;
+            it.barrier();
+            l_vals_[lid] =
+                binOp(l_vals_[lid], uval);  // update if thread requires it
+        }
+
+        if (unique_flag) {
+            l_reduced_keys_[unique_id - 1] = k;
+            l_reduced_vals_[unique_id - 1] = l_vals_[lid];
+        }
+        it.barrier();
+
+        const int bid = g.get_group_id(0);
+        if (lid < l_reduced_block_size_[0]) {
+            const int bOffset = bidw * oVInfo_.strides[3] +
+                                bidz * oVInfo_.strides[2] +
+                                bidy * oVInfo_.strides[1];
+            oKeys_[bid * DIMX_ + lid]               = l_reduced_keys_[lid];
+            oVals_[bOffset + ((bid * DIMX_) + lid)] = l_reduced_vals_[lid];
+        }
+
+        reduced_block_sizes_[bid] = l_reduced_block_size_[0];
+    }
+
+    int work_group_scan_inclusive_add(sycl::nd_item<3> it,
+                                      sycl::local_accessor<int> wg_temp,
+                                      sycl::local_accessor<int> arr) const {
+        const uint lid = it.get_local_id(0);
+        int *active_buf;
+
+        int val    = arr[lid];
+        active_buf = arr.get_pointer();
+
+        bool swap_buffer = false;
+        for (int off = 1; off <= DIMX_; off *= 2) {
+            it.barrier();
+            if (lid >= off) { val = val + active_buf[lid - off]; }
+            swap_buffer = !swap_buffer;
+            active_buf =
+                swap_buffer ? wg_temp.get_pointer() : arr.get_pointer();
+            active_buf[lid] = val;
+        }
+
+        int res = active_buf[lid];
+        return res;
+    }
+
+   protected:
+    sycl::accessor<int> reduced_block_sizes_;
+    write_accessor<Tk> oKeys_;
+    KParam oKInfo_;
+    write_accessor<To> oVals_;
+    KParam oVInfo_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    read_accessor<Ti> iVals_;
+    KParam iVInfo_;
+    int change_nan_;
+    To nanval_;
+    int n_;
+    int nGroupsZ_;
+    int DIMX_;
+    sycl::local_accessor<Tk> l_keys_;
+    sycl::local_accessor<compute_t<To>> l_vals_;
+    sycl::local_accessor<Tk> l_reduced_keys_;
+    sycl::local_accessor<compute_t<To>> l_reduced_vals_;
+    sycl::local_accessor<int> l_unique_ids_;
+    sycl::local_accessor<int> l_wg_temp_;
+    sycl::local_accessor<int> l_unique_flags_;
+    sycl::local_accessor<int> l_reduced_block_size_;
+};
+
+// Reduces each block by key
+template<typename Ti, typename Tk, typename To, af_op_t op>
+class reduceBlocksByKeyDimKernel {
+   public:
+    reduceBlocksByKeyDimKernel(
+        sycl::accessor<int> reduced_block_sizes, write_accessor<Tk> oKeys,
+        KParam oKInfo, write_accessor<To> oVals, KParam oVInfo,
+        read_accessor<Tk> iKeys, KParam iKInfo, read_accessor<Ti> iVals,
+        KParam iVInfo, int change_nan, To nanval, int n, int nGroupsZ, int DIMX,
+        int DIM, sycl::local_accessor<Tk> l_keys,
+        sycl::local_accessor<compute_t<To>> l_vals,
+        sycl::local_accessor<Tk> l_reduced_keys,
+        sycl::local_accessor<compute_t<To>> l_reduced_vals,
+        sycl::local_accessor<int> l_unique_ids,
+        sycl::local_accessor<int> l_wg_temp,
+        sycl::local_accessor<int> l_unique_flags,
+        sycl::local_accessor<int> l_reduced_block_size)
+        : reduced_block_sizes_(reduced_block_sizes)
+        , oKeys_(oKeys)
+        , oKInfo_(oKInfo)
+        , oVals_(oVals)
+        , oVInfo_(oVInfo)
+        , iKeys_(iKeys)
+        , iKInfo_(iKInfo)
+        , iVals_(iVals)
+        , iVInfo_(iVInfo)
+        , change_nan_(change_nan)
+        , nanval_(nanval)
+        , n_(n)
+        , nGroupsZ_(nGroupsZ)
+        , DIMX_(DIMX)
+        , DIM_(DIM)
+        , l_keys_(l_keys)
+        , l_vals_(l_vals)
+        , l_reduced_keys_(l_reduced_keys)
+        , l_reduced_vals_(l_reduced_vals)
+        , l_unique_ids_(l_unique_ids)
+        , l_wg_temp_(l_wg_temp)
+        , l_unique_flags_(l_unique_flags)
+        , l_reduced_block_size_(l_reduced_block_size) {}
+
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g  = it.get_group();
+        const uint lid = it.get_local_id(0);
+        const uint gid = it.get_global_id(0);
+
+        const int bidy = g.get_group_id(1);
+        const int bidz = g.get_group_id(2) % nGroupsZ_;
+        const int bidw = g.get_group_id(2) / nGroupsZ_;
+
+        const compute_t<To> init_val =
+            common::Binary<compute_t<To>, op>::init();
+        common::Binary<compute_t<To>, op> binOp;
+        common::Transform<Ti, compute_t<To>, op> transform;
+
+        if (lid == 0) { l_reduced_block_size_[0] = 0; }
+
+        int dims_ordering[4];
+        dims_ordering[0] = DIM_;
+        int d            = 1;
+        for (int i = 0; i < 4; ++i) {
+            if (i != DIM_) dims_ordering[d++] = i;
+        }
+        it.barrier();
+
+        // load keys and values to threads
+        Tk k            = scalar<Tk>(0);
+        compute_t<To> v = init_val;
+        if (gid < n_) {
+            k                 = iKeys_[gid + iKInfo_.offset];
+            const int bOffset = bidw * iVInfo_.strides[dims_ordering[3]] +
+                                bidz * iVInfo_.strides[dims_ordering[2]] +
+                                bidy * iVInfo_.strides[dims_ordering[1]];
+            v = transform(
+                iVals_[bOffset + gid * iVInfo_.strides[DIM_] + iVInfo_.offset]);
+            if (change_nan_) v = IS_NAN(v) ? nanval_ : v;
+        }
+
+        l_keys_[lid] = k;
+        l_vals_[lid] = v;
+
+        l_reduced_keys_[lid] = k;
+        it.barrier();
+
+        // mark threads containing unique keys
+        int eq_check    = (lid > 0) ? (k != l_reduced_keys_[lid - 1]) : 0;
+        int unique_flag = (eq_check || (lid == 0)) && (gid < n_);
+
+        l_unique_flags_[lid] = unique_flag;
+        int unique_id =
+            work_group_scan_inclusive_add(it, l_wg_temp_, l_unique_flags_);
+
+        l_unique_ids_[lid] = unique_id;
+
+        if (lid == DIMX_ - 1) l_reduced_block_size_[0] = unique_id;
+
+        for (int off = 1; off < DIMX_; off *= 2) {
+            it.barrier();
+            int test_unique_id =
+                (lid + off < DIMX_) ? l_unique_ids_[lid + off] : ~unique_id;
+            eq_check = (unique_id == test_unique_id);
+            int update_key =
+                eq_check && (lid < (DIMX_ - off)) &&
+                ((gid + off) <
+                 n_);  // checks if this thread should perform a reduction
+            compute_t<To> uval = (update_key) ? l_vals_[lid + off] : init_val;
+            it.barrier();
+            l_vals_[lid] =
+                binOp(l_vals_[lid], uval);  // update if thread requires it
+        }
+
+        if (unique_flag) {
+            l_reduced_keys_[unique_id - 1] = k;
+            l_reduced_vals_[unique_id - 1] = l_vals_[lid];
+        }
+        it.barrier();
+
+        const int bid = g.get_group_id(0);
+        if (lid < l_reduced_block_size_[0]) {
+            const int bOffset = bidw * oVInfo_.strides[dims_ordering[3]] +
+                                bidz * oVInfo_.strides[dims_ordering[2]] +
+                                bidy * oVInfo_.strides[dims_ordering[1]];
+            oKeys_[gid] = l_reduced_keys_[lid];
+            oVals_[bOffset + (gid)*oVInfo_.strides[DIM_]] =
+                l_reduced_vals_[lid];
+        }
+
+        reduced_block_sizes_[bid] = l_reduced_block_size_[0];
+    }
+
+    int work_group_scan_inclusive_add(sycl::nd_item<3> it,
+                                      sycl::local_accessor<int> wg_temp,
+                                      sycl::local_accessor<int> arr) const {
+        const uint lid = it.get_local_id(0);
+        int *active_buf;
+
+        int val    = arr[lid];
+        active_buf = arr.get_pointer();
+
+        bool swap_buffer = false;
+        for (int off = 1; off <= DIMX_; off *= 2) {
+            it.barrier();
+            if (lid >= off) { val = val + active_buf[lid - off]; }
+            swap_buffer = !swap_buffer;
+            active_buf =
+                swap_buffer ? wg_temp.get_pointer() : arr.get_pointer();
+            active_buf[lid] = val;
+        }
+
+        int res = active_buf[lid];
+        return res;
+    }
+
+   protected:
+    sycl::accessor<int> reduced_block_sizes_;
+    write_accessor<Tk> oKeys_;
+    KParam oKInfo_;
+    write_accessor<To> oVals_;
+    KParam oVInfo_;
+    read_accessor<Tk> iKeys_;
+    KParam iKInfo_;
+    read_accessor<Ti> iVals_;
+    KParam iVInfo_;
+    int change_nan_;
+    To nanval_;
+    int n_;
+    int nGroupsZ_;
+    int DIMX_;
+    int DIM_;
+    sycl::local_accessor<Tk> l_keys_;
+    sycl::local_accessor<compute_t<To>> l_vals_;
+    sycl::local_accessor<Tk> l_reduced_keys_;
+    sycl::local_accessor<compute_t<To>> l_reduced_vals_;
+    sycl::local_accessor<int> l_unique_ids_;
+    sycl::local_accessor<int> l_wg_temp_;
+    sycl::local_accessor<int> l_unique_flags_;
+    sycl::local_accessor<int> l_reduced_block_size_;
+};
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reduce_config.hpp b/src/backend/oneapi/kernel/reduce_config.hpp
new file mode 100644
index 0000000000..ca892f4cc8
--- /dev/null
+++ b/src/backend/oneapi/kernel/reduce_config.hpp
@@ -0,0 +1,27 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+namespace creduce {
+// TODO: are different values more appropriate for reduce on oneapi?
+static const uint THREADS_PER_BLOCK = 256;
+static const uint THREADS_X         = 32;
+static const uint THREADS_Y         = THREADS_PER_BLOCK / THREADS_X;
+static const uint REPEAT            = 32;
+
+}  // namespace creduce
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reduce_dim.hpp b/src/backend/oneapi/kernel/reduce_dim.hpp
new file mode 100644
index 0000000000..0cc7055f14
--- /dev/null
+++ b/src/backend/oneapi/kernel/reduce_dim.hpp
@@ -0,0 +1,229 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce_config.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+
+#include <algorithm>
+#include <climits>
+#include <complex>
+#include <iostream>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op, uint dim, uint DIMY>
+class reduceDimKernelSMEM {
+   public:
+    reduceDimKernelSMEM(Param<To> out, Param<Ti> in, uint groups_x,
+                        uint groups_y, uint offset_dim, bool change_nan,
+                        To nanval, sycl::local_accessor<compute_t<To>, 1> s_val,
+                        sycl::handler &h)
+        : out_(out.template get_accessor<sycl::access::mode::write>(h))
+        , in_(in.template get_accessor<sycl::access::mode::read>(h))
+        , oInfo_(out.info)
+        , iInfo_(in.info)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , offset_dim_(offset_dim)
+        , change_nan_(change_nan)
+        , nanval_(nanval)
+        , s_val_(s_val) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * g.get_local_range(0) + lidx;
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) + lidx;
+        const uint yid       = groupId_y;
+
+        uint ids[4] = {xid, yid, zid, wid};
+        using sycl::global_ptr;
+
+        data_t<To> *optr = out_.get_pointer() + ids[3] * oInfo_.strides[3] +
+                           ids[2] * oInfo_.strides[2] +
+                           ids[1] * oInfo_.strides[1] + ids[0];
+
+        const uint groupIdx_dim = ids[dim];
+        ids[dim]                = ids[dim] * g.get_local_range(1) + lidy;
+
+        const data_t<Ti> *iptr =
+            in_.get_pointer() + ids[3] * iInfo_.strides[3] +
+            ids[2] * iInfo_.strides[2] + ids[1] * iInfo_.strides[1] + ids[0] +
+            iInfo_.offset;
+
+        const uint id_dim_in   = ids[dim];
+        const uint istride_dim = iInfo_.strides[dim];
+        bool is_valid          = (ids[0] < iInfo_.dims[0]) &&
+                        (ids[1] < iInfo_.dims[1]) &&
+                        (ids[2] < iInfo_.dims[2]) && (ids[3] < iInfo_.dims[3]);
+
+        common::Binary<compute_t<To>, op> reduce;
+        common::Transform<data_t<Ti>, compute_t<To>, op> transform;
+
+        compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+        for (int id = id_dim_in; is_valid && (id < iInfo_.dims[dim]);
+             id += offset_dim_ * g.get_local_range(1)) {
+            compute_t<To> in_val = transform(*iptr);
+            if (change_nan_) {
+                in_val = !IS_NAN(in_val) ? in_val
+                                         : static_cast<compute_t<To>>(nanval_);
+            }
+            out_val = reduce(in_val, out_val);
+            iptr += offset_dim_ * g.get_local_range(1) * istride_dim;
+        }
+
+        s_val_[lid] = out_val;
+
+        it.barrier();
+        compute_t<To> *s_ptr = s_val_.get_pointer() + lid;
+
+        if (DIMY == 8) {
+            if (lidy < 4)
+                *s_ptr = reduce(*s_ptr, s_ptr[creduce::THREADS_X * 4]);
+            it.barrier();
+        }
+
+        if (DIMY >= 4) {
+            if (lidy < 2)
+                *s_ptr = reduce(*s_ptr, s_ptr[creduce::THREADS_X * 2]);
+            it.barrier();
+        }
+
+        if (DIMY >= 2) {
+            if (lidy < 1)
+                *s_ptr = reduce(*s_ptr, s_ptr[creduce::THREADS_X * 1]);
+            it.barrier();
+        }
+
+        if (lidy == 0 && is_valid && (groupIdx_dim < oInfo_.dims[dim])) {
+            *optr = data_t<To>(*s_ptr);
+        }
+    }
+
+   protected:
+    write_accessor<data_t<To>> out_;
+    read_accessor<data_t<Ti>> in_;
+    KParam oInfo_, iInfo_;
+    uint groups_x_, groups_y_, offset_dim_;
+    bool change_nan_;
+    To nanval_;
+    sycl::local_accessor<compute_t<To>, 1> s_val_;
+};
+
+template<typename Ti, typename To, af_op_t op, uint dim>
+void reduce_dim_launcher_default(Param<To> out, Param<Ti> in,
+                                 const uint threads_y,
+                                 const dim_t blocks_dim[4], bool change_nan,
+                                 double nanval) {
+    sycl::range<2> local(creduce::THREADS_X, threads_y);
+    sycl::range<2> global(blocks_dim[0] * blocks_dim[2] * local[0],
+                          blocks_dim[1] * blocks_dim[3] * local[1]);
+
+    getQueue().submit([&](sycl::handler &h) {
+        auto shrdMem = sycl::local_accessor<compute_t<To>, 1>(
+            creduce::THREADS_X * threads_y, h);
+
+        switch (threads_y) {
+            case 8:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceDimKernelSMEM<Ti, To, op, dim, 8>(
+                        out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim],
+                        change_nan, scalar<To>(nanval), shrdMem, h));
+                break;
+            case 4:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceDimKernelSMEM<Ti, To, op, dim, 4>(
+                        out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim],
+                        change_nan, scalar<To>(nanval), shrdMem, h));
+                break;
+            case 2:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceDimKernelSMEM<Ti, To, op, dim, 2>(
+                        out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim],
+                        change_nan, scalar<To>(nanval), shrdMem, h));
+                break;
+            case 1:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceDimKernelSMEM<Ti, To, op, dim, 1>(
+                        out, in, blocks_dim[0], blocks_dim[1], blocks_dim[dim],
+                        change_nan, scalar<To>(nanval), shrdMem, h));
+                break;
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename To, af_op_t op, int dim>
+void reduce_dim_default(Param<To> out, Param<Ti> in, bool change_nan,
+                        double nanval) {
+    uint threads_y = std::min(creduce::THREADS_Y, nextpow2(in.info.dims[dim]));
+    uint threads_x = creduce::THREADS_X;
+
+    dim_t blocks_dim[] = {divup(in.info.dims[0], threads_x), in.info.dims[1],
+                          in.info.dims[2], in.info.dims[3]};
+    blocks_dim[dim]    = divup(in.info.dims[dim], threads_y * creduce::REPEAT);
+
+    Param<To> tmp = out;
+    bufptr<To> tmp_alloc;
+    if (blocks_dim[dim] > 1) {
+        tmp.info.dims[dim] = blocks_dim[dim];
+        int tmp_elements   = tmp.info.dims[0] * tmp.info.dims[1] *
+                           tmp.info.dims[2] * tmp.info.dims[3];
+
+        tmp_alloc = memAlloc<To>(tmp_elements);
+        tmp.data  = tmp_alloc.get();
+
+        tmp.info.dims[dim] = blocks_dim[dim];
+        for (int k = dim + 1; k < 4; k++)
+            tmp.info.strides[k] *= blocks_dim[dim];
+    }
+
+    reduce_dim_launcher_default<Ti, To, op, dim>(tmp, in, threads_y, blocks_dim,
+                                                 change_nan, nanval);
+
+    if (blocks_dim[dim] > 1) {
+        blocks_dim[dim] = 1;
+
+        if (op == af_notzero_t) {
+            reduce_dim_launcher_default<To, To, af_add_t, dim>(
+                out, tmp, threads_y, blocks_dim, change_nan, nanval);
+        } else {
+            reduce_dim_launcher_default<To, To, op, dim>(
+                out, tmp, threads_y, blocks_dim, change_nan, nanval);
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reduce_first.hpp b/src/backend/oneapi/kernel/reduce_first.hpp
new file mode 100644
index 0000000000..152120648b
--- /dev/null
+++ b/src/backend/oneapi/kernel/reduce_first.hpp
@@ -0,0 +1,233 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce_config.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <climits>
+#include <complex>
+#include <iostream>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op, uint DIMX>
+class reduceFirstKernelSMEM {
+   public:
+    reduceFirstKernelSMEM(write_accessor<To> out, KParam oInfo,
+                          read_accessor<Ti> in, KParam iInfo, uint groups_x,
+                          uint groups_y, uint repeat, bool change_nan,
+                          To nanval,
+                          sycl::local_accessor<compute_t<To>, 1> s_val)
+        : out_(out)
+        , oInfo_(oInfo)
+        , iInfo_(iInfo)
+        , in_(in)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , repeat_(repeat)
+        , change_nan_(change_nan)
+        , nanval_(nanval)
+        , s_val_(s_val) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * g.get_local_range(0) + lidx;
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid = groupId_x * g.get_local_range(0) * repeat_ + lidx;
+        const uint yid = groupId_y * g.get_local_range(1) + lidy;
+
+        common::Binary<compute_t<To>, op> reduce;
+        common::Transform<Ti, compute_t<To>, op> transform;
+
+        const Ti *iptr = in_.get_pointer() + wid * iInfo_.strides[3] +
+                         zid * iInfo_.strides[2] + yid * iInfo_.strides[1] +
+                         iInfo_.offset;
+
+        auto optr = out_.get_pointer() + wid * oInfo_.strides[3] +
+                    zid * oInfo_.strides[2] + yid * oInfo_.strides[1];
+
+        bool cond = (yid < iInfo_.dims[1]) && (zid < iInfo_.dims[2]) &&
+                    (wid < iInfo_.dims[3]);
+
+        dim_t last = (xid + repeat_ * DIMX);
+        int lim    = sycl::min(last, iInfo_.dims[0]);
+
+        compute_t<To> out_val = common::Binary<compute_t<To>, op>::init();
+        for (int id = xid; cond && id < lim; id += DIMX) {
+            compute_t<To> in_val = transform(iptr[id]);
+            if (change_nan_)
+                in_val = !IS_NAN(in_val) ? in_val
+                                         : static_cast<compute_t<To>>(nanval_);
+            out_val = reduce(in_val, out_val);
+        }
+
+        s_val_[lid] = out_val;
+
+        it.barrier();
+        compute_t<To> *s_ptr = s_val_.get_pointer() + lidy * DIMX;
+
+        if (DIMX == 256) {
+            if (lidx < 128)
+                s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 128]);
+            it.barrier();
+        }
+
+        if (DIMX >= 128) {
+            if (lidx < 64) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 64]);
+            it.barrier();
+        }
+
+        if (DIMX >= 64) {
+            if (lidx < 32) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 32]);
+            it.barrier();
+        }
+
+        // TODO: replace with subgroup operations in optimized kernels
+        if (lidx < 16) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 16]);
+        it.barrier();
+
+        if (lidx < 8) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 8]);
+        it.barrier();
+
+        if (lidx < 4) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 4]);
+        it.barrier();
+
+        if (lidx < 2) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 2]);
+        it.barrier();
+
+        if (lidx < 1) s_ptr[lidx] = reduce(s_ptr[lidx], s_ptr[lidx + 1]);
+        it.barrier();
+
+        if (cond && lidx == 0) optr[groupId_x] = data_t<To>(s_ptr[lidx]);
+    }
+
+   protected:
+    write_accessor<To> out_;
+    KParam oInfo_, iInfo_;
+    read_accessor<Ti> in_;
+    uint groups_x_, groups_y_, repeat_;
+    bool change_nan_;
+    To nanval_;
+    sycl::local_accessor<compute_t<To>, 1> s_val_;
+};
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_first_launcher_default(Param<To> out, Param<Ti> in,
+                                   const uint groups_x, const uint groups_y,
+                                   const uint threads_x, bool change_nan,
+                                   double nanval) {
+    sycl::range<2> local(threads_x, creduce::THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * in.info.dims[2] * local[0],
+                          groups_y * in.info.dims[3] * local[1]);
+
+    uint repeat = divup(in.info.dims[0], (groups_x * threads_x));
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        read_accessor<Ti> in_acc{*in.data, h};
+
+        auto shrdMem = sycl::local_accessor<compute_t<To>, 1>(
+            creduce::THREADS_PER_BLOCK, h);
+
+        switch (threads_x) {
+            case 32:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceFirstKernelSMEM<Ti, To, op, 32>(
+                        out_acc, out.info, in_acc, in.info, groups_x, groups_y,
+                        repeat, change_nan, scalar<To>(nanval), shrdMem));
+                break;
+            case 64:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceFirstKernelSMEM<Ti, To, op, 64>(
+                        out_acc, out.info, in_acc, in.info, groups_x, groups_y,
+                        repeat, change_nan, scalar<To>(nanval), shrdMem));
+                break;
+            case 128:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceFirstKernelSMEM<Ti, To, op, 128>(
+                        out_acc, out.info, in_acc, in.info, groups_x, groups_y,
+                        repeat, change_nan, scalar<To>(nanval), shrdMem));
+                break;
+            case 256:
+                h.parallel_for(
+                    sycl::nd_range<2>(global, local),
+                    reduceFirstKernelSMEM<Ti, To, op, 256>(
+                        out_acc, out.info, in_acc, in.info, groups_x, groups_y,
+                        repeat, change_nan, scalar<To>(nanval), shrdMem));
+                break;
+        }
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename To, af_op_t op>
+void reduce_first_default(Param<To> out, Param<Ti> in, bool change_nan,
+                          double nanval) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, creduce::THREADS_PER_BLOCK);
+    uint threads_y = creduce::THREADS_PER_BLOCK / threads_x;
+
+    uint blocks_x = divup(in.info.dims[0], threads_x * creduce::REPEAT);
+    uint blocks_y = divup(in.info.dims[1], threads_y);
+
+    Param<To> tmp = out;
+    bufptr<To> tmp_alloc;
+    if (blocks_x > 1) {
+        tmp_alloc = memAlloc<To>(blocks_x * in.info.dims[1] * in.info.dims[2] *
+                                 in.info.dims[3]);
+        tmp.data  = tmp_alloc.get();
+
+        tmp.info.dims[0] = blocks_x;
+        for (int k = 1; k < 4; k++) tmp.info.strides[k] *= blocks_x;
+    }
+
+    reduce_first_launcher_default<Ti, To, op>(tmp, in, blocks_x, blocks_y,
+                                              threads_x, change_nan, nanval);
+
+    if (blocks_x > 1) {
+        // FIXME: Is there an alternative to the if condition?
+        if (op == af_notzero_t) {
+            reduce_first_launcher_default<To, To, af_add_t>(
+                out, tmp, 1, blocks_y, threads_x, change_nan, nanval);
+        } else {
+            reduce_first_launcher_default<To, To, op>(
+                out, tmp, 1, blocks_y, threads_x, change_nan, nanval);
+        }
+    }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/reorder.hpp b/src/backend/oneapi/kernel/reorder.hpp
new file mode 100644
index 0000000000..adf1c8f57b
--- /dev/null
+++ b/src/backend/oneapi/kernel/reorder.hpp
@@ -0,0 +1,127 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class reorderCreateKernel {
+   public:
+    reorderCreateKernel(write_accessor<T> out, read_accessor<T> in,
+                        const KParam op, const KParam ip, const int d0,
+                        const int d1, const int d2, const int d3,
+                        const int blocksPerMatX, const int blocksPerMatY)
+        : out_(out)
+        , in_(in)
+        , op_(op)
+        , ip_(ip)
+        , d0_(d0)
+        , d1_(d1)
+        , d2_(d2)
+        , d3_(d3)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const int oz = g.get_group_id(0) / blocksPerMatX_;
+        const int ow = g.get_group_id(1) / blocksPerMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - oz * blocksPerMatX_;
+        const int blockIdx_y = g.get_group_id(1) - ow * blocksPerMatY_;
+
+        const int xx = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+        const int yy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+        bool valid = (xx < op_.dims[0] && yy < op_.dims[1] &&
+                      oz < op_.dims[2] && ow < op_.dims[3]);
+
+        const int incy = blocksPerMatY_ * g.get_local_range(1);
+        const int incx = blocksPerMatX_ * g.get_local_range(0);
+
+        const int o_off    = ow * op_.strides[3] + oz * op_.strides[2];
+        const int rdims[4] = {d0_, d1_, d2_, d3_};
+        int ids[4]         = {0};
+
+        ids[rdims[3]] = ow;
+        ids[rdims[2]] = oz;
+
+        for (int oy = yy; oy < op_.dims[1]; oy += incy) {
+            ids[rdims[1]] = oy;
+            for (int ox = xx; ox < op_.dims[0]; ox += incx) {
+                ids[rdims[0]] = ox;
+
+                const int oIdx = o_off + oy * op_.strides[1] + ox;
+
+                const int iIdx = ids[3] * ip_.strides[3] +
+                                 ids[2] * ip_.strides[2] +
+                                 ids[1] * ip_.strides[1] + ids[0];
+
+                if (valid) { out_[oIdx] = in_[ip_.offset + iIdx]; }
+            }
+        }
+    }
+
+   private:
+    write_accessor<T> out_;
+    read_accessor<T> in_;
+    const KParam op_;
+    const KParam ip_;
+    const int d0_;
+    const int d1_;
+    const int d2_;
+    const int d3_;
+    const int blocksPerMatX_;
+    const int blocksPerMatY_;
+};
+
+template<typename T>
+void reorder(Param<T> out, const Param<T> in, const dim_t* rdims) {
+    constexpr int TX    = 32;
+    constexpr int TY    = 8;
+    constexpr int TILEX = 512;
+    constexpr int TILEY = 32;
+
+    auto local = sycl::range(TX, TY);
+
+    int blocksPerMatX = divup(out.info.dims[0], TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], TILEY);
+    auto global       = sycl::range(local[0] * blocksPerMatX * out.info.dims[2],
+                                    local[1] * blocksPerMatY * out.info.dims[3]);
+
+    getQueue().submit([&](auto& h) {
+        read_accessor<T> d_in{*in.data, h};
+        write_accessor<T> d_out{*out.data, h};
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            reorderCreateKernel<T>(
+                d_out, d_in, out.info, in.info, static_cast<int>(rdims[0]),
+                static_cast<int>(rdims[1]), static_cast<int>(rdims[2]),
+                static_cast<int>(rdims[3]), blocksPerMatX, blocksPerMatY));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/resize.hpp b/src/backend/oneapi/kernel/resize.hpp
new file mode 100644
index 0000000000..50cc041ab5
--- /dev/null
+++ b/src/backend/oneapi/kernel/resize.hpp
@@ -0,0 +1,228 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename AT, typename BT>
+BT mul(AT a, BT b) {
+    return a * b;
+}
+template<typename AT>
+std::complex<double> mul(AT a, std::complex<double> b) {
+    return std::complex<double>(a * b.real(), a * b.imag());
+}
+
+template<typename T>
+using wtype_t = typename std::conditional<std::is_same<T, double>::value,
+                                          double, float>::type;
+
+template<typename T>
+using vtype_t = typename std::conditional<common::is_complex<T>::value, T,
+                                          wtype_t<T>>::type;
+
+////////////////////////////////////////////////////////////////////////////////////
+// nearest-neighbor resampling
+template<typename T>
+void resize_n_(T* d_out, const KParam out, const T* d_in, const KParam in,
+               const int blockIdx_x, const int blockIdx_y, const float xf,
+               const float yf, sycl::nd_item<2>& it) {
+    sycl::group g = it.get_group();
+    int const ox  = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+    int const oy  = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+    // int ix = convert_int_rtp(ox * xf);
+    // int iy = convert_int_rtp(oy * yf);
+    int ix = sycl::round(ox * xf);
+    int iy = sycl::round(oy * yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    d_out[ox + oy * out.strides[1]] = d_in[ix + iy * in.strides[1]];
+}
+
+////////////////////////////////////////////////////////////////////////////////////
+// bilinear resampling
+template<typename T, typename VT>
+void resize_b_(T* d_out, const KParam out, const T* d_in, const KParam in,
+               const int blockIdx_x, const int blockIdx_y, const float xf_,
+               const float yf_, sycl::nd_item<2>& it) {
+    sycl::group g = it.get_group();
+
+    int const ox = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+    int const oy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+    float xf = ox * xf_;
+    float yf = oy * yf_;
+
+    int ix = sycl::floor(xf);
+
+    int iy = sycl::floor(yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    float b = xf - ix;
+    float a = yf - iy;
+
+    const int ix2 = (ix + 1) < in.dims[0] ? (ix + 1) : ix;
+    const int iy2 = (iy + 1) < in.dims[1] ? (iy + 1) : iy;
+
+    const VT p1 = d_in[ix + in.strides[1] * iy];
+    const VT p2 = d_in[ix + in.strides[1] * iy2];
+    const VT p3 = d_in[ix2 + in.strides[1] * iy];
+    const VT p4 = d_in[ix2 + in.strides[1] * iy2];
+
+    d_out[ox + oy * out.strides[1]] =
+        mul(((1.0f - a) * (1.0f - b)), p1) + mul(((a) * (1.0f - b)), p2) +
+        mul(((1.0f - a) * (b)), p3) + mul(((a) * (b)), p4);
+}
+
+////////////////////////////////////////////////////////////////////////////////////
+// lower resampling
+template<typename T>
+void resize_l_(T* d_out, const KParam out, const T* d_in, const KParam in,
+               const int blockIdx_x, const int blockIdx_y, const float xf,
+               const float yf, sycl::nd_item<2>& it) {
+    sycl::group g = it.get_group();
+
+    int const ox = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+    int const oy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+    int ix = (ox * xf);
+    int iy = (oy * yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    d_out[ox + oy * out.strides[1]] = d_in[ix + iy * in.strides[1]];
+}
+
+template<typename T, int method>
+class resizeCreateKernel {
+   public:
+    resizeCreateKernel(write_accessor<T> d_out, const KParam out,
+                       read_accessor<T> d_in, const KParam in, const int b0,
+                       const int b1, const float xf, const float yf)
+        : d_out_(d_out)
+        , out_(out)
+        , d_in_(d_in)
+        , in_(in)
+        , b0_(b0)
+        , b1_(b1)
+        , xf_(xf)
+        , yf_(yf) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        int bIdx = g.get_group_id(0) / b0_;
+        int bIdy = g.get_group_id(1) / b1_;
+        // batch adjustment
+        int i_off = bIdy * in_.strides[3] + bIdx * in_.strides[2] + in_.offset;
+        int o_off = bIdy * out_.strides[3] + bIdx * out_.strides[2];
+        int blockIdx_x = g.get_group_id(0) - bIdx * b0_;
+        int blockIdx_y = g.get_group_id(1) - bIdy * b1_;
+
+        switch (method) {
+            case AF_INTERP_NEAREST:
+                resize_n_<T>(d_out_.get_pointer() + o_off, out_,
+                             d_in_.get_pointer() + i_off, in_, blockIdx_x,
+                             blockIdx_y, xf_, yf_, it);
+                break;
+            case AF_INTERP_BILINEAR:
+                resize_b_<T, vtype_t<T>>(d_out_.get_pointer() + o_off, out_,
+                                         d_in_.get_pointer() + i_off, in_,
+                                         blockIdx_x, blockIdx_y, xf_, yf_, it);
+                break;
+            case AF_INTERP_LOWER:
+                resize_l_<T>(d_out_.get_pointer() + o_off, out_,
+                             d_in_.get_pointer() + i_off, in_, blockIdx_x,
+                             blockIdx_y, xf_, yf_, it);
+                break;
+        }
+    }
+
+   private:
+    write_accessor<T> d_out_;
+    const KParam out_;
+    read_accessor<T> d_in_;
+    const KParam in_;
+    const int b0_;
+    const int b1_;
+    const float xf_;
+    const float yf_;
+};
+
+template<typename T>
+void resize(Param<T> out, const Param<T> in, const af_interp_type method) {
+    constexpr int RESIZE_TX = 16;
+    constexpr int RESIZE_TY = 16;
+
+    auto local = sycl::range(RESIZE_TX, RESIZE_TY);
+
+    int blocksPerMatX = divup(out.info.dims[0], local[0]);
+    int blocksPerMatY = divup(out.info.dims[1], local[1]);
+    auto global       = sycl::range(local[0] * blocksPerMatX * in.info.dims[2],
+                                    local[1] * blocksPerMatY * in.info.dims[3]);
+
+    double xd = (double)in.info.dims[0] / (double)out.info.dims[0];
+    double yd = (double)in.info.dims[1] / (double)out.info.dims[1];
+
+    float xf = (float)xd, yf = (float)yd;
+
+    getQueue().submit([&](auto& h) {
+        read_accessor<T> d_in{*in.data, h};
+        write_accessor<T> d_out{*out.data, h};
+        switch (method) {
+            case AF_INTERP_NEAREST:
+                h.parallel_for(sycl::nd_range{global, local},
+                               resizeCreateKernel<T, AF_INTERP_NEAREST>(
+                                   d_out, out.info, d_in, in.info,
+                                   blocksPerMatX, blocksPerMatY, xf, yf));
+                break;
+            case AF_INTERP_BILINEAR:
+                h.parallel_for(sycl::nd_range{global, local},
+                               resizeCreateKernel<T, AF_INTERP_BILINEAR>(
+                                   d_out, out.info, d_in, in.info,
+                                   blocksPerMatX, blocksPerMatY, xf, yf));
+                break;
+            case AF_INTERP_LOWER:
+                h.parallel_for(sycl::nd_range{global, local},
+                               resizeCreateKernel<T, AF_INTERP_LOWER>(
+                                   d_out, out.info, d_in, in.info,
+                                   blocksPerMatX, blocksPerMatY, xf, yf));
+                break;
+            default: break;
+        }
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/rotate.hpp b/src/backend/oneapi/kernel/rotate.hpp
new file mode 100644
index 0000000000..2bb945f9a2
--- /dev/null
+++ b/src/backend/oneapi/kernel/rotate.hpp
@@ -0,0 +1,211 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/interp.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+typedef struct {
+    float tmat[6];
+} tmat_t;
+
+template<typename T>
+using wtype_t = typename std::conditional<std::is_same<T, double>::value,
+                                          double, float>::type;
+
+template<typename T>
+using vtype_t = typename std::conditional<common::is_complex<T>::value, T,
+                                          wtype_t<T>>::type;
+
+template<typename T, typename InterpInTy, typename InterpPosTy,
+         int INTERP_ORDER>
+class rotateCreateKernel {
+   public:
+    rotateCreateKernel(write_accessor<T> d_out, const KParam out,
+                       read_accessor<T> d_in, const KParam in, const tmat_t t,
+                       const int nimages, const int batches,
+                       const int blocksXPerImage, const int blocksYPerImage,
+                       af::interpType method)
+        : d_out_(d_out)
+        , out_(out)
+        , d_in_(d_in)
+        , in_(in)
+        , t_(t)
+        , nimages_(nimages)
+        , batches_(batches)
+        , blocksXPerImage_(blocksXPerImage)
+        , blocksYPerImage_(blocksYPerImage)
+        , method_(method) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        // Compute which image set
+        const int setId      = g.get_group_id(0) / blocksXPerImage_;
+        const int blockIdx_x = g.get_group_id(0) - setId * blocksXPerImage_;
+
+        const int batch      = g.get_group_id(1) / blocksYPerImage_;
+        const int blockIdx_y = g.get_group_id(1) - batch * blocksYPerImage_;
+
+        // Get thread indices
+        const int xido = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+        const int yido = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+        const int limages =
+            std::min((int)out_.dims[2] - setId * nimages_, nimages_);
+
+        if (xido >= (unsigned)out_.dims[0] || yido >= (unsigned)out_.dims[1])
+            return;
+
+        InterpPosTy xidi = xido * t_.tmat[0] + yido * t_.tmat[1] + t_.tmat[2];
+        InterpPosTy yidi = xido * t_.tmat[3] + yido * t_.tmat[4] + t_.tmat[5];
+
+        int outoff = out_.offset + setId * nimages_ * out_.strides[2] +
+                     batch * out_.strides[3];
+        int inoff = in_.offset + setId * nimages_ * in_.strides[2] +
+                    batch * in_.strides[3];
+
+        const int loco = outoff + (yido * out_.strides[1] + xido);
+
+        InterpInTy zero = (InterpInTy)0;
+        if constexpr (INTERP_ORDER > 1) {
+            // Special conditions to deal with boundaries for bilinear and
+            // bicubic
+            // FIXME: Ideally this condition should be removed or be present for
+            // all  methods But tests are expecting a different behavior for
+            // bilinear and nearest
+            if (xidi < (InterpPosTy)-0.0001 || yidi < (InterpPosTy)-0.0001 ||
+                in_.dims[0] <= xidi || in_.dims[1] <= yidi) {
+                for (int i = 0; i < nimages_; i++) {
+                    d_out_[loco + i * out_.strides[2]] = zero;
+                }
+                return;
+            }
+        }
+
+        // FIXME: Nearest and lower do not do clamping, but other methods do
+        // Make it consistent
+        constexpr bool doclamp = INTERP_ORDER != 1;
+        Interp2<T, InterpPosTy, INTERP_ORDER> interp2;
+        interp2(d_out_, out_, loco, d_in_, in_, inoff, xidi, yidi, 0, 1,
+                method_, limages, doclamp, 2);
+    }
+
+   private:
+    write_accessor<T> d_out_;
+    const KParam out_;
+    read_accessor<T> d_in_;
+    const KParam in_;
+    const tmat_t t_;
+    const int nimages_;
+    const int batches_;
+    const int blocksXPerImage_;
+    const int blocksYPerImage_;
+    af::interpType method_;
+};
+
+template<typename T>
+void rotate(Param<T> out, const Param<T> in, const float theta,
+            af_interp_type method, int order) {
+    using std::string;
+
+    using BT = typename dtype_traits<T>::base_type;
+
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+
+    // Used for batching images
+    constexpr int TI = 4;
+
+    const float c = cos(-theta), s = sin(-theta);
+    float tx, ty;
+    {
+        const float nx = 0.5 * (in.info.dims[0] - 1);
+        const float ny = 0.5 * (in.info.dims[1] - 1);
+        const float mx = 0.5 * (out.info.dims[0] - 1);
+        const float my = 0.5 * (out.info.dims[1] - 1);
+        const float sx = (mx * c + my * -s);
+        const float sy = (mx * s + my * c);
+        tx             = -(sx - nx);
+        ty             = -(sy - ny);
+    }
+
+    // Rounding error. Anything more than 3 decimal points wont make a diff
+    tmat_t t;
+    t.tmat[0] = round(c * 1000) / 1000.0f;
+    t.tmat[1] = round(-s * 1000) / 1000.0f;
+    t.tmat[2] = round(tx * 1000) / 1000.0f;
+    t.tmat[3] = round(s * 1000) / 1000.0f;
+    t.tmat[4] = round(c * 1000) / 1000.0f;
+    t.tmat[5] = round(ty * 1000) / 1000.0f;
+
+    auto local = sycl::range(TX, TY);
+
+    int nimages               = in.info.dims[2];
+    int nbatches              = in.info.dims[3];
+    int global_x              = local[0] * divup(out.info.dims[0], local[0]);
+    int global_y              = local[1] * divup(out.info.dims[1], local[1]);
+    const int blocksXPerImage = global_x / local[0];
+    const int blocksYPerImage = global_y / local[1];
+
+    if (nimages > TI) {
+        int tile_images = divup(nimages, TI);
+        nimages         = TI;
+        global_x        = global_x * tile_images;
+    }
+    global_y *= nbatches;
+
+    auto global = sycl::range(global_x, global_y);
+
+    getQueue().submit([&](auto &h) {
+        read_accessor<T> d_in{*in.data, h};
+        write_accessor<T> d_out{*out.data, h};
+        switch (order) {
+            case 1:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    rotateCreateKernel<T, T, wtype_t<BT>, 1>(
+                        d_out, out.info, d_in, in.info, t, nimages, nbatches,
+                        blocksXPerImage, blocksYPerImage, method));
+                break;
+            case 2:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    rotateCreateKernel<T, T, wtype_t<BT>, 2>(
+                        d_out, out.info, d_in, in.info, t, nimages, nbatches,
+                        blocksXPerImage, blocksYPerImage, method));
+                break;
+            case 3:
+                h.parallel_for(
+                    sycl::nd_range{global, local},
+                    rotateCreateKernel<T, T, wtype_t<BT>, 3>(
+                        d_out, out.info, d_in, in.info, t, nimages, nbatches,
+                        blocksXPerImage, blocksYPerImage, method));
+                break;
+            default: throw std::string("invalid interpolation order");
+        }
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/scan_dim.hpp b/src/backend/oneapi/kernel/scan_dim.hpp
new file mode 100644
index 0000000000..52450f5c98
--- /dev/null
+++ b/src/backend/oneapi/kernel/scan_dim.hpp
@@ -0,0 +1,341 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+#include <memory.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op, int dim>
+class scanDimKernel {
+   public:
+    scanDimKernel(write_accessor<To> out_acc, KParam oInfo,
+                  write_accessor<To> tmp_acc, KParam tInfo,
+                  read_accessor<Ti> in_acc, KParam iInfo, const uint groups_x,
+                  const uint groups_y, const uint blocks_dim, const uint lim,
+                  const bool isFinalPass, const uint DIMY,
+                  const bool inclusive_scan, sycl::local_accessor<To, 1> s_val,
+                  sycl::local_accessor<To, 1> s_tmp)
+        : out_acc_(out_acc)
+        , tmp_acc_(tmp_acc)
+        , in_acc_(in_acc)
+        , oInfo_(oInfo)
+        , tInfo_(tInfo)
+        , iInfo_(iInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , blocks_dim_(blocks_dim)
+        , lim_(lim)
+        , DIMY_(DIMY)
+        , isFinalPass_(isFinalPass)
+        , inclusive_scan_(inclusive_scan)
+        , s_val_(s_val)
+        , s_tmp_(s_tmp) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+        const uint lid  = lidy * g.get_local_range(0) + lidx;
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) + lidx;
+        const uint yid       = groupId_y;
+
+        uint ids[4] = {xid, yid, zid, wid};
+
+        const Ti *iptr = in_acc_.get_pointer();
+        To *optr       = out_acc_.get_pointer();
+        To *tptr       = tmp_acc_.get_pointer();
+
+        // There is only one element per block for out
+        // There are blockDim.y elements per block for in
+        // Hence increment ids[dim] just after offseting out and before
+        // offsetting in
+        tptr += ids[3] * tInfo_.strides[3] + ids[2] * tInfo_.strides[2] +
+                ids[1] * tInfo_.strides[1] + ids[0];
+
+        const int groupIdx_dim = ids[dim];
+        ids[dim]               = ids[dim] * g.get_local_range(1) * lim_ + lidy;
+
+        optr += ids[3] * oInfo_.strides[3] + ids[2] * oInfo_.strides[2] +
+                ids[1] * oInfo_.strides[1] + ids[0];
+        iptr += ids[3] * iInfo_.strides[3] + ids[2] * iInfo_.strides[2] +
+                ids[1] * iInfo_.strides[1] + ids[0] + iInfo_.offset;
+        int id_dim        = ids[dim];
+        const int out_dim = oInfo_.dims[dim];
+
+        bool is_valid = (ids[0] < oInfo_.dims[0]) &&
+                        (ids[1] < oInfo_.dims[1]) &&
+                        (ids[2] < oInfo_.dims[2]) && (ids[3] < oInfo_.dims[3]);
+
+        const int ostride_dim = oInfo_.strides[dim];
+        const int istride_dim = iInfo_.strides[dim];
+
+        To *sptr = s_val_.get_pointer() + lid;
+
+        common::Transform<Ti, To, op> transform;
+        common::Binary<To, op> binop;
+
+        const To init = common::Binary<To, op>::init();
+        To val        = init;
+
+        const bool isLast = (lidy == (DIMY_ - 1));
+
+        for (int k = 0; k < lim_; k++) {
+            if (isLast) s_tmp_[lidx] = val;
+
+            bool cond = (is_valid) && (id_dim < out_dim);
+            val       = cond ? transform(*iptr) : init;
+            *sptr     = val;
+            group_barrier(g);
+
+            int start = 0;
+#pragma unroll
+            for (int off = 1; off < DIMY_; off *= 2) {
+                if (lidy >= off)
+                    val = binop(val, sptr[(start - off) * (int)THREADS_X]);
+                start                   = DIMY_ - start;
+                sptr[start * THREADS_X] = val;
+
+                group_barrier(g);
+            }
+
+            val = binop(val, s_tmp_[lidx]);
+            if (inclusive_scan_) {
+                if (cond) { *optr = val; }
+            } else if (is_valid) {
+                if (id_dim == (out_dim - 1)) {
+                    *(optr - (id_dim * ostride_dim)) = init;
+                } else if (id_dim < (out_dim - 1)) {
+                    *(optr + ostride_dim) = val;
+                }
+            }
+            id_dim += g.get_local_range(1);
+            iptr += g.get_local_range(1) * istride_dim;
+            optr += g.get_local_range(1) * ostride_dim;
+            group_barrier(g);
+        }
+
+        if (!isFinalPass_ && is_valid && (groupIdx_dim < tInfo_.dims[dim]) &&
+            isLast) {
+            *tptr = val;
+        }
+    }
+
+   protected:
+    write_accessor<To> out_acc_;
+    write_accessor<To> tmp_acc_;
+    read_accessor<Ti> in_acc_;
+    KParam oInfo_, tInfo_, iInfo_;
+    const uint groups_x_, groups_y_, blocks_dim_, lim_, DIMY_;
+    const bool isFinalPass_, inclusive_scan_;
+    sycl::local_accessor<To, 1> s_val_;
+    sycl::local_accessor<To, 1> s_tmp_;
+};
+
+template<typename To, af_op_t op, int dim>
+class scanDimBcastKernel {
+   public:
+    scanDimBcastKernel(write_accessor<To> out_acc, KParam oInfo,
+                       read_accessor<To> tmp_acc, KParam tInfo,
+                       const uint groups_x, const uint groups_y,
+                       const uint groups_dim, const uint lim,
+                       const bool inclusive_scan)
+        : out_acc_(out_acc)
+        , tmp_acc_(tmp_acc)
+        , oInfo_(oInfo)
+        , tInfo_(tInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , groups_dim_(groups_dim)
+        , lim_(lim)
+        , inclusive_scan_(inclusive_scan) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) + lidx;
+        const uint yid       = groupId_y;
+
+        uint ids[4] = {xid, yid, zid, wid};
+
+        const To *tptr = tmp_acc_.get_pointer();
+        To *optr       = out_acc_.get_pointer();
+
+        // There is only one element per block for out
+        // There are blockDim.y elements per block for in
+        // Hence increment ids[dim] just after offseting out and before
+        // offsetting in
+        tptr += ids[3] * tInfo_.strides[3] + ids[2] * tInfo_.strides[2] +
+                ids[1] * tInfo_.strides[1] + ids[0];
+
+        const int groupIdx_dim = ids[dim];
+        ids[dim]               = ids[dim] * g.get_local_range(1) * lim_ + lidy;
+
+        optr += ids[3] * oInfo_.strides[3] + ids[2] * oInfo_.strides[2] +
+                ids[1] * oInfo_.strides[1] + ids[0];
+        const int id_dim  = ids[dim];
+        const int out_dim = oInfo_.dims[dim];
+
+        // Shift broadcast one step to the right for exclusive scan (#2366)
+        int offset = inclusive_scan_ ? 0 : oInfo_.strides[dim];
+        optr += offset;
+
+        bool is_valid = (ids[0] < oInfo_.dims[0]) &&
+                        (ids[1] < oInfo_.dims[1]) &&
+                        (ids[2] < oInfo_.dims[2]) && (ids[3] < oInfo_.dims[3]);
+
+        if (!is_valid) return;
+        if (groupIdx_dim == 0) return;
+
+        To accum = *(tptr - tInfo_.strides[dim]);
+
+        common::Binary<To, op> binop;
+        const int ostride_dim = oInfo_.strides[dim];
+
+        for (int k = 0, id = id_dim; is_valid && k < lim_ && (id < out_dim);
+             k++, id += g.get_local_range(1)) {
+            *optr = binop(*optr, accum);
+            optr += g.get_local_range(1) * ostride_dim;
+        }
+    }
+
+   protected:
+    write_accessor<To> out_acc_;
+    read_accessor<To> tmp_acc_;
+    KParam oInfo_, tInfo_;
+    const uint groups_x_, groups_y_, groups_dim_, lim_;
+    const bool inclusive_scan_;
+};
+
+template<typename Ti, typename To, af_op_t op, int dim>
+static void scan_dim_launcher(Param<To> out, Param<To> tmp, Param<Ti> in,
+                              const uint threads_y, const dim_t blocks_all[4],
+                              bool isFinalPass, bool inclusive_scan) {
+    sycl::range<2> local(THREADS_X, threads_y);
+    sycl::range<2> global(blocks_all[0] * blocks_all[2] * local[0],
+                          blocks_all[1] * blocks_all[3] * local[1]);
+
+    uint lim = divup(out.info.dims[dim], (threads_y * blocks_all[dim]));
+
+    getQueue().submit([&](sycl::handler &h) {
+        // TODO: specify access modes in all kernels
+        write_accessor<To> out_acc{*out.data, h};
+        write_accessor<To> tmp_acc{*tmp.data, h};
+        read_accessor<Ti> in_acc{*in.data, h};
+
+        auto s_val = sycl::local_accessor<compute_t<To>, 1>(
+            THREADS_X * threads_y * 2, h);
+        auto s_tmp = sycl::local_accessor<compute_t<To>, 1>(THREADS_X, h);
+
+        h.parallel_for(
+            sycl::nd_range<2>(global, local),
+            scanDimKernel<Ti, To, op, dim>(
+                out_acc, out.info, tmp_acc, tmp.info, in_acc, in.info,
+                blocks_all[0], blocks_all[1], blocks_all[dim], lim, isFinalPass,
+                threads_y, inclusive_scan, s_val, s_tmp));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename To, af_op_t op, int dim>
+static void bcast_dim_launcher(Param<To> out, Param<To> tmp,
+                               const uint threads_y, const dim_t blocks_all[4],
+                               bool inclusive_scan) {
+    sycl::range<2> local(THREADS_X, threads_y);
+    sycl::range<2> global(blocks_all[0] * blocks_all[2] * local[0],
+                          blocks_all[1] * blocks_all[3] * local[1]);
+
+    uint lim = divup(out.info.dims[dim], (threads_y * blocks_all[dim]));
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        read_accessor<To> tmp_acc{*tmp.data, h};
+
+        h.parallel_for(
+            sycl::nd_range<2>(global, local),
+            scanDimBcastKernel<To, op, dim>(
+                out_acc, out.info, tmp_acc, tmp.info, blocks_all[0],
+                blocks_all[1], blocks_all[dim], lim, inclusive_scan));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename To, af_op_t op, int dim>
+static void scan_dim(Param<To> out, Param<Ti> in, bool inclusive_scan) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(out.info.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    dim_t blocks_all[] = {divup(out.info.dims[0], threads_x), out.info.dims[1],
+                          out.info.dims[2], out.info.dims[3]};
+
+    blocks_all[dim] = divup(out.info.dims[dim], threads_y * REPEAT);
+
+    if (blocks_all[dim] == 1) {
+        scan_dim_launcher<Ti, To, op, dim>(out, out, in, threads_y, blocks_all,
+                                           true, inclusive_scan);
+    } else {
+        Param<To> tmp = out;
+
+        tmp.info.dims[dim]  = blocks_all[dim];
+        tmp.info.strides[0] = 1;
+        for (int k = 1; k < 4; k++)
+            tmp.info.strides[k] =
+                tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
+
+        int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
+        auto tmp_alloc   = memAlloc<To>(tmp_elements);
+        tmp.data         = tmp_alloc.get();
+
+        scan_dim_launcher<Ti, To, op, dim>(out, tmp, in, threads_y, blocks_all,
+                                           false, inclusive_scan);
+
+        int bdim        = blocks_all[dim];
+        blocks_all[dim] = 1;
+
+        // FIXME: Is there an alternative to the if condition ?
+        if (op == af_notzero_t) {
+            scan_dim_launcher<To, To, af_add_t, dim>(tmp, tmp, tmp, threads_y,
+                                                     blocks_all, true, true);
+        } else {
+            scan_dim_launcher<To, To, op, dim>(tmp, tmp, tmp, threads_y,
+                                               blocks_all, true, true);
+        }
+
+        blocks_all[dim] = bdim;
+        bcast_dim_launcher<To, op, dim>(out, tmp, threads_y, blocks_all,
+                                        inclusive_scan);
+    }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/scan_first.hpp b/src/backend/oneapi/kernel/scan_first.hpp
new file mode 100644
index 0000000000..4aa7fc502e
--- /dev/null
+++ b/src/backend/oneapi/kernel/scan_first.hpp
@@ -0,0 +1,292 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+#include <memory.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op>
+class scanFirstKernel {
+   public:
+    scanFirstKernel(write_accessor<To> out_acc, KParam oInfo,
+                    write_accessor<To> tmp_acc, KParam tInfo,
+                    read_accessor<Ti> in_acc, KParam iInfo, const uint groups_x,
+                    const uint groups_y, const uint lim, const bool isFinalPass,
+                    const uint DIMX, const bool inclusive_scan,
+                    sycl::local_accessor<To, 1> s_val,
+                    sycl::local_accessor<To, 1> s_tmp)
+        : out_acc_(out_acc)
+        , tmp_acc_(tmp_acc)
+        , in_acc_(in_acc)
+        , oInfo_(oInfo)
+        , tInfo_(tInfo)
+        , iInfo_(iInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , lim_(lim)
+        , DIMX_(DIMX)
+        , isFinalPass_(isFinalPass)
+        , inclusive_scan_(inclusive_scan)
+        , s_val_(s_val)
+        , s_tmp_(s_tmp) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) * lim_ + lidx;
+        const uint yid       = groupId_y * g.get_local_range(1) + lidy;
+
+        bool cond_yzw = (yid < oInfo_.dims[1]) && (zid < oInfo_.dims[2]) &&
+                        (wid < oInfo_.dims[3]);
+
+        // if (!cond_yzw) return;  // retire warps early TODO: move
+
+        const Ti *iptr = in_acc_.get_pointer();
+        To *optr       = out_acc_.get_pointer();
+        To *tptr       = tmp_acc_.get_pointer();
+
+        iptr += wid * iInfo_.strides[3] + zid * iInfo_.strides[2] +
+                yid * iInfo_.strides[1] + iInfo_.offset;
+        optr += wid * oInfo_.strides[3] + zid * oInfo_.strides[2] +
+                yid * oInfo_.strides[1];
+        tptr += wid * tInfo_.strides[3] + zid * tInfo_.strides[2] +
+                yid * tInfo_.strides[1];
+
+        To *sptr = s_val_.get_pointer() + lidy * (2 * DIMX_ + 1);
+
+        common::Transform<Ti, To, op> transform;
+        common::Binary<To, op> binop;
+
+        const To init = common::Binary<To, op>::init();
+        int id        = xid;
+        To val        = init;
+
+        const bool isLast = (lidx == (DIMX_ - 1));
+        for (int k = 0; k < lim_; k++) {
+            if (isLast) s_tmp_[lidy] = val;
+
+            bool cond  = (id < oInfo_.dims[0]) && cond_yzw;
+            val        = cond ? transform(iptr[id]) : init;
+            sptr[lidx] = val;
+            group_barrier(g);
+
+            int start = 0;
+            for (int off = 1; off < DIMX_; off *= 2) {
+                if (lidx >= off) val = binop(val, sptr[(start - off) + lidx]);
+                start              = DIMX_ - start;
+                sptr[start + lidx] = val;
+
+                group_barrier(g);
+            }
+
+            val = binop(val, s_tmp_[lidy]);
+
+            if (inclusive_scan_) {
+                if (cond) { optr[id] = val; }
+            } else {
+                if (cond_yzw && id == (oInfo_.dims[0] - 1)) {
+                    optr[0] = init;
+                } else if (cond_yzw && id < (oInfo_.dims[0] - 1)) {
+                    optr[id + 1] = val;
+                }
+            }
+            id += g.get_local_range(0);
+            group_barrier(g);
+        }
+
+        if (!isFinalPass_ && isLast && cond_yzw) { tptr[groupId_x] = val; }
+    }
+
+   protected:
+    write_accessor<To> out_acc_;
+    write_accessor<To> tmp_acc_;
+    read_accessor<Ti> in_acc_;
+    KParam oInfo_, tInfo_, iInfo_;
+    const uint groups_x_, groups_y_, lim_, DIMX_;
+    const bool isFinalPass_, inclusive_scan_;
+    sycl::local_accessor<To, 1> s_val_;
+    sycl::local_accessor<To, 1> s_tmp_;
+};
+
+template<typename To, af_op_t op>
+class scanFirstBcastKernel {
+   public:
+    scanFirstBcastKernel(write_accessor<To> out_acc, KParam oInfo,
+                         read_accessor<To> tmp_acc, KParam tInfo,
+                         const uint groups_x, const uint groups_y,
+                         const uint lim, const bool inclusive_scan)
+        : out_acc_(out_acc)
+        , tmp_acc_(tmp_acc)
+        , oInfo_(oInfo)
+        , tInfo_(tInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , lim_(lim)
+        , inclusive_scan_(inclusive_scan) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) * lim_ + lidx;
+        const uint yid       = groupId_y * g.get_local_range(1) + lidy;
+
+        if (groupId_x == 0) return;
+
+        bool cond = (yid < oInfo_.dims[1]) && (zid < oInfo_.dims[2]) &&
+                    (wid < oInfo_.dims[3]);
+        if (!cond) return;
+
+        To *optr       = out_acc_.get_pointer();
+        const To *tptr = tmp_acc_.get_pointer();
+
+        optr += wid * oInfo_.strides[3] + zid * oInfo_.strides[2] +
+                yid * oInfo_.strides[1];
+        tptr += wid * tInfo_.strides[3] + zid * tInfo_.strides[2] +
+                yid * tInfo_.strides[1];
+
+        common::Binary<To, op> binop;
+        To accum = tptr[groupId_x - 1];
+
+        // Shift broadcast one step to the right for exclusive scan (#2366)
+        int offset = !inclusive_scan_;
+        for (int k = 0, id = xid + offset; k < lim_ && id < oInfo_.dims[0];
+             k++, id += g.get_local_range(0)) {
+            optr[id] = binop(accum, optr[id]);
+        }
+    }
+
+   protected:
+    write_accessor<To> out_acc_;
+    read_accessor<To> tmp_acc_;
+    KParam oInfo_, tInfo_;
+    const uint groups_x_, groups_y_, lim_;
+    const bool inclusive_scan_;
+};
+
+template<typename Ti, typename To, af_op_t op>
+static void scan_first_launcher(Param<To> out, Param<To> tmp, Param<Ti> in,
+                                const uint groups_x, const uint groups_y,
+                                const uint threads_x, bool isFinalPass,
+                                bool inclusive_scan) {
+    sycl::range<2> local(threads_x, THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * out.info.dims[2] * local[0],
+                          groups_y * out.info.dims[3] * local[1]);
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        write_accessor<To> tmp_acc{*tmp.data, h};
+        read_accessor<Ti> in_acc{*in.data, h};
+
+        const int DIMY            = THREADS_PER_BLOCK / threads_x;
+        const int SHARED_MEM_SIZE = (2 * threads_x + 1) * (DIMY);
+        auto s_val = sycl::local_accessor<compute_t<To>, 1>(SHARED_MEM_SIZE, h);
+        auto s_tmp = sycl::local_accessor<compute_t<To>, 1>(DIMY, h);
+
+        // TODO threads_x as template arg for #pragma unroll?
+        h.parallel_for(sycl::nd_range<2>(global, local),
+                       scanFirstKernel<Ti, To, op>(
+                           out_acc, out.info, tmp_acc, tmp.info, in_acc,
+                           in.info, groups_x, groups_y, lim, isFinalPass,
+                           threads_x, inclusive_scan, s_val, s_tmp));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename To, af_op_t op>
+static void bcast_first_launcher(Param<To> out, Param<To> tmp,
+                                 const uint groups_x, const uint groups_y,
+                                 const uint threads_x, bool inclusive_scan) {
+    sycl::range<2> local(threads_x, THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * out.info.dims[2] * local[0],
+                          groups_y * out.info.dims[3] * local[1]);
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<To> out_acc{*out.data, h};
+        read_accessor<To> tmp_acc{*tmp.data, h};
+
+        h.parallel_for(sycl::nd_range<2>(global, local),
+                       scanFirstBcastKernel<To, op>(
+                           out_acc, out.info, tmp_acc, tmp.info, groups_x,
+                           groups_y, lim, inclusive_scan));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename To, af_op_t op>
+static void scan_first(Param<To> out, Param<Ti> in, bool inclusive_scan) {
+    uint threads_x = nextpow2(std::max(32u, (uint)out.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint groups_x = divup(out.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(out.info.dims[1], threads_y);
+
+    if (groups_x == 1) {
+        scan_first_launcher<Ti, To, op>(out, out, in, groups_x, groups_y,
+                                        threads_x, true, inclusive_scan);
+    } else {
+        Param<To> tmp = out;
+
+        tmp.info.dims[0]    = groups_x;
+        tmp.info.strides[0] = 1;
+        for (int k = 1; k < 4; k++)
+            tmp.info.strides[k] =
+                tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
+
+        int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
+        auto tmp_alloc   = memAlloc<To>(tmp_elements);
+        tmp.data         = tmp_alloc.get();
+
+        scan_first_launcher<Ti, To, op>(out, tmp, in, groups_x, groups_y,
+                                        threads_x, false, inclusive_scan);
+
+        // FIXME: Is there an alternative to the if condition ?
+        if (op == af_notzero_t) {
+            scan_first_launcher<To, To, af_add_t>(tmp, tmp, tmp, 1, groups_y,
+                                                  threads_x, true, true);
+        } else {
+            scan_first_launcher<To, To, op>(tmp, tmp, tmp, 1, groups_y,
+                                            threads_x, true, true);
+        }
+
+        bcast_first_launcher<To, op>(out, tmp, groups_x, groups_y, threads_x,
+                                     inclusive_scan);
+    }
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/select.hpp b/src/backend/oneapi/kernel/select.hpp
new file mode 100644
index 0000000000..06db45ad79
--- /dev/null
+++ b/src/backend/oneapi/kernel/select.hpp
@@ -0,0 +1,256 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <kernel/accessors.hpp>
+#include <math.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr uint DIMX  = 32;
+constexpr uint DIMY  = 8;
+constexpr int REPEAT = 64;
+
+int getOffset(const dim_t *dims, const dim_t *strides, const dim_t *refdims,
+              int ids[4]) {
+    int off = 0;
+    off += ids[3] * (dims[3] == refdims[3]) * strides[3];
+    off += ids[2] * (dims[2] == refdims[2]) * strides[2];
+    off += ids[1] * (dims[1] == refdims[1]) * strides[1];
+    return off;
+}
+
+template<typename T>
+class selectKernelCreateKernel {
+   public:
+    selectKernelCreateKernel(write_accessor<T> optr, KParam oinfo,
+                             read_accessor<char> cptr_, KParam cinfo,
+                             read_accessor<T> aptr_, KParam ainfo,
+                             read_accessor<T> bptr_, KParam binfo, int groups_0,
+                             int groups_1, const bool is_same)
+        : optr_(optr)
+        , oinfo_(oinfo)
+        , cptr__(cptr_)
+        , cinfo_(cinfo)
+        , aptr__(aptr_)
+        , ainfo_(ainfo)
+        , bptr__(bptr_)
+        , binfo_(binfo)
+        , groups_0_(groups_0)
+        , groups_1_(groups_1)
+        , is_same_(is_same) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const char *cptr = cptr__.get_pointer() + cinfo_.offset;
+        const T *aptr    = aptr__.get_pointer() + ainfo_.offset;
+        const T *bptr    = bptr__.get_pointer() + binfo_.offset;
+
+        const int idz = g.get_group_id(0) / groups_0_;
+        const int idw = g.get_group_id(1) / groups_1_;
+
+        const int group_id_0 = g.get_group_id(0) - idz * groups_0_;
+        const int group_id_1 = g.get_group_id(1) - idw * groups_1_;
+
+        const int idx0 = group_id_0 * g.get_local_range(0) + it.get_local_id(0);
+        const int idy  = group_id_1 * g.get_local_range(1) + it.get_local_id(1);
+
+        const int off = idw * oinfo_.strides[3] + idz * oinfo_.strides[2] +
+                        idy * oinfo_.strides[1];
+
+        const bool valid = (idw < oinfo_.dims[3] && idz < oinfo_.dims[2] &&
+                            idy < oinfo_.dims[1]);
+
+        int ids[] = {idx0, idy, idz, idw};
+
+        T *optr_pointer = optr_.get_pointer();
+        optr_pointer += off;
+        aptr += getOffset(ainfo_.dims, ainfo_.strides, oinfo_.dims, ids);
+        bptr += getOffset(binfo_.dims, binfo_.strides, oinfo_.dims, ids);
+        cptr += getOffset(cinfo_.dims, cinfo_.strides, oinfo_.dims, ids);
+
+        if (is_same_) {
+            for (int idx = idx0; idx < oinfo_.dims[0];
+                 idx += g.get_local_range(0) * groups_0_) {
+                if (valid)
+                    optr_pointer[idx] = (cptr[idx]) ? aptr[idx] : bptr[idx];
+            }
+        } else {
+            bool csame = cinfo_.dims[0] == oinfo_.dims[0];
+            bool asame = ainfo_.dims[0] == oinfo_.dims[0];
+            bool bsame = binfo_.dims[0] == oinfo_.dims[0];
+            for (int idx = idx0; idx < oinfo_.dims[0];
+                 idx += g.get_local_range(0) * groups_0_) {
+                if (valid)
+                    optr_pointer[idx] = (cptr[csame * idx]) ? aptr[asame * idx]
+                                                            : bptr[bsame * idx];
+            }
+        }
+    }
+
+   private:
+    write_accessor<T> optr_;
+    KParam oinfo_;
+    read_accessor<char> cptr__;
+    KParam cinfo_;
+    read_accessor<T> aptr__;
+    KParam ainfo_;
+    read_accessor<T> bptr__;
+    KParam binfo_;
+    int groups_0_;
+    int groups_1_;
+    const bool is_same_;
+};
+
+template<typename T>
+void selectLauncher(Param<T> out, Param<char> cond, Param<T> a, Param<T> b,
+                    const int ndims, const bool is_same) {
+    int threads[] = {DIMX, DIMY};
+
+    if (ndims == 1) {
+        threads[0] *= threads[1];
+        threads[1] = 1;
+    }
+
+    auto local = sycl::range(threads[0], threads[1]);
+
+    int groups_0 = divup(out.info.dims[0], REPEAT * local[0]);
+    int groups_1 = divup(out.info.dims[1], local[1]);
+
+    auto global = sycl::range(groups_0 * out.info.dims[2] * local[0],
+                              groups_1 * out.info.dims[3] * local[1]);
+
+    getQueue().submit([&](auto &h) {
+        write_accessor<T> d_out{*out.data, h};
+        read_accessor<char> d_cond{*cond.data, h};
+        read_accessor<T> d_a{*a.data, h};
+        read_accessor<T> d_b{*b.data, h};
+        h.parallel_for(sycl::nd_range{global, local},
+                       selectKernelCreateKernel<T>(
+                           d_out, out.info, d_cond, cond.info, d_a, a.info, d_b,
+                           b.info, groups_0, groups_1, is_same));
+    });
+}
+
+template<typename T>
+class selectScalarCreateKernel {
+   public:
+    selectScalarCreateKernel(write_accessor<T> optr, KParam oinfo,
+                             read_accessor<char> cptr_, KParam cinfo,
+                             read_accessor<T> aptr_, KParam ainfo, T b,
+                             int groups_0, int groups_1, const bool flip)
+        : optr_(optr)
+        , oinfo_(oinfo)
+        , cptr__(cptr_)
+        , cinfo_(cinfo)
+        , aptr__(aptr_)
+        , ainfo_(ainfo)
+        , b_(b)
+        , groups_0_(groups_0)
+        , groups_1_(groups_1)
+        , flip_(flip) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const char *cptr = cptr__.get_pointer() + cinfo_.offset;
+        const T *aptr    = aptr__.get_pointer() + ainfo_.offset;
+
+        const int idz = g.get_group_id(0) / groups_0_;
+        const int idw = g.get_group_id(1) / groups_1_;
+
+        const int group_id_0 = g.get_group_id(0) - idz * groups_0_;
+        const int group_id_1 = g.get_group_id(1) - idw * groups_1_;
+
+        const int idx0 = group_id_0 * g.get_local_range(0) + it.get_local_id(0);
+        const int idy  = group_id_1 * g.get_local_range(1) + it.get_local_id(1);
+
+        const int off = idw * oinfo_.strides[3] + idz * oinfo_.strides[2] +
+                        idy * oinfo_.strides[1];
+
+        int ids[] = {idx0, idy, idz, idw};
+        T *optr   = optr_.get_pointer();
+        optr += off;
+        aptr += getOffset(ainfo_.dims, ainfo_.strides, oinfo_.dims, ids);
+        cptr += getOffset(cinfo_.dims, cinfo_.strides, oinfo_.dims, ids);
+
+        if (idw >= oinfo_.dims[3] || idz >= oinfo_.dims[2] ||
+            idy >= oinfo_.dims[1]) {
+            return;
+        }
+
+        for (int idx = idx0; idx < oinfo_.dims[0];
+             idx += g.get_local_range(0) * groups_0_) {
+            optr[idx] = (cptr[idx] ^ flip_) ? aptr[idx] : b_;
+        }
+    }
+
+   private:
+    write_accessor<T> optr_;
+    KParam oinfo_;
+    read_accessor<char> cptr__;
+    KParam cinfo_;
+    read_accessor<T> aptr__;
+    KParam ainfo_;
+    T b_;
+    int groups_0_;
+    int groups_1_;
+    const bool flip_;
+};
+
+template<typename T>
+void select(Param<T> out, Param<char> cond, Param<T> a, Param<T> b, int ndims) {
+    bool is_same = true;
+    for (int i = 0; i < 4; i++) {
+        is_same &= (a.info.dims[i] == b.info.dims[i]);
+    }
+    selectLauncher<T>(out, cond, a, b, ndims, is_same);
+}
+
+template<typename T>
+void select_scalar(Param<T> out, Param<char> cond, Param<T> a, const T b,
+                   const int ndims, const bool flip) {
+    int threads[] = {DIMX, DIMY};
+
+    if (ndims == 1) {
+        threads[0] *= threads[1];
+        threads[1] = 1;
+    }
+
+    auto local = sycl::range(threads[0], threads[1]);
+
+    int groups_0 = divup(out.info.dims[0], REPEAT * local[0]);
+    int groups_1 = divup(out.info.dims[1], local[1]);
+
+    auto global = sycl::range(groups_0 * out.info.dims[2] * local[0],
+                              groups_1 * out.info.dims[3] * local[1]);
+
+    getQueue().submit([&](auto &h) {
+        write_accessor<T> d_out{*out.data, h};
+        read_accessor<char> d_cond{*cond.data, h};
+        read_accessor<T> d_a{*a.data, h};
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            selectScalarCreateKernel<T>(d_out, out.info, d_cond, cond.info, d_a,
+                                        a.info, b, groups_0, groups_1, flip));
+    });
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/sort.hpp b/src/backend/oneapi/kernel/sort.hpp
new file mode 100644
index 0000000000..71bedd1f50
--- /dev/null
+++ b/src/backend/oneapi/kernel/sort.hpp
@@ -0,0 +1,119 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+// oneDPL headers should be included before standard headers
+#define ONEDPL_USE_PREDEFINED_POLICIES 0
+#include <oneapi/dpl/algorithm>
+#include <oneapi/dpl/execution>
+#include <oneapi/dpl/iterator>
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_oneapi.hpp>
+#include <iota.hpp>
+#include <traits.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+void sort0Iterative(Param<T> val, bool isAscending) {
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+    for (int w = 0; w < val.info.dims[3]; w++) {
+        int valW = w * val.info.strides[3];
+        for (int z = 0; z < val.info.dims[2]; z++) {
+            int valWZ = valW + z * val.info.strides[2];
+            for (int y = 0; y < val.info.dims[1]; y++) {
+                int valOffset = valWZ + y * val.info.strides[1];
+
+                auto buf_begin = ::oneapi::dpl::begin(*val.data) + valOffset;
+                auto buf_end   = buf_begin + val.info.dims[0];
+                if (isAscending) {
+                    std::sort(dpl_policy, buf_begin, buf_end,
+                              [](auto lhs, auto rhs) { return lhs < rhs; });
+                    // std::less<T>()); // mangled name errors in icx for now
+                } else {
+                    std::sort(dpl_policy, buf_begin, buf_end,
+                              [](auto lhs, auto rhs) { return lhs > rhs; });
+                    // std::greater<T>()); // mangled name errors in icx for now
+                }
+            }
+        }
+    }
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void sortBatched(Param<T> pVal, int dim, bool isAscending) {
+    af::dim4 inDims;
+    for (int i = 0; i < 4; i++) inDims[i] = pVal.info.dims[i];
+
+    // Sort dimension
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    // Create/call iota
+    Array<uint> pKey = iota<uint>(seqDims, tileDims);
+
+    pKey.setDataDims(inDims.elements());
+
+    // Flat
+    pVal.info.dims[0]    = inDims.elements();
+    pVal.info.strides[0] = 1;
+    for (int i = 1; i < 4; i++) {
+        pVal.info.dims[i]    = 1;
+        pVal.info.strides[i] = pVal.info.strides[i - 1] * pVal.info.dims[i - 1];
+    }
+
+    // Sort indices
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    auto key_begin    = ::oneapi::dpl::begin(*pKey.get());
+    auto key_end      = key_begin + pKey.dims()[0];
+    auto val_begin    = ::oneapi::dpl::begin(*pVal.data);
+    auto val_end      = val_begin + pVal.info.dims[0];
+    auto zipped_begin = dpl::make_zip_iterator(key_begin, val_begin);
+    auto zipped_end   = dpl::make_zip_iterator(key_end, val_end);
+
+    // sort values first
+    if (isAscending) {
+        std::sort(dpl_policy, zipped_begin, zipped_end, [](auto lhs, auto rhs) {
+            return std::get<1>(lhs) < std::get<1>(rhs);
+        });
+    } else {
+        std::sort(dpl_policy, zipped_begin, zipped_end, [](auto lhs, auto rhs) {
+            return std::get<1>(lhs) > std::get<1>(rhs);
+        });
+    }
+    // sort according to keys second
+    std::sort(dpl_policy, zipped_begin, zipped_end, [](auto lhs, auto rhs) {
+        return std::get<0>(lhs) < std::get<0>(rhs);
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void sort0(Param<T> val, bool isAscending) {
+    int higherDims = val.info.dims[1] * val.info.dims[2] * val.info.dims[3];
+    // TODO Make a better heurisitic
+    if (higherDims > 10)
+        sortBatched<T>(val, 0, isAscending);
+    else
+        sort0Iterative<T>(val, isAscending);
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/sort_by_key.hpp b/src/backend/oneapi/kernel/sort_by_key.hpp
new file mode 100644
index 0000000000..3a1d7d38a8
--- /dev/null
+++ b/src/backend/oneapi/kernel/sort_by_key.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param<Tk> pKey, Param<Tv> pVal, bool isAscending);
+
+template<typename Tk, typename Tv>
+void sortByKeyBatched(Param<Tk> pKey, Param<Tv> pVal, const int dim,
+                      bool isAscending);
+
+template<typename Tk, typename Tv>
+void sort0ByKey(Param<Tk> pKey, Param<Tv> pVal, bool isAscending);
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/sort_by_key/CMakeLists.txt b/src/backend/oneapi/kernel/sort_by_key/CMakeLists.txt
new file mode 100644
index 0000000000..08b1d35f73
--- /dev/null
+++ b/src/backend/oneapi/kernel/sort_by_key/CMakeLists.txt
@@ -0,0 +1,63 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp" FILESTRINGS)
+
+foreach(STR ${FILESTRINGS})
+    if(${STR} MATCHES "// SBK_TYPES")
+        string(REPLACE "// SBK_TYPES:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_TYPES ${TEMP})
+    endif()
+endforeach()
+
+add_library(oneapi_sort_by_key INTERFACE)
+foreach(SBK_TYPE ${SBK_TYPES})
+  add_library(oneapi_sort_by_key_${SBK_TYPE} OBJECT
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp"
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key_impl.hpp"
+    )
+
+  set_source_files_properties("${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp"
+    PROPERTIES
+      LANGUAGE SYCL)
+  set_target_properties(oneapi_sort_by_key_${SBK_TYPE}
+    PROPERTIES
+      COMPILE_DEFINITIONS "TYPE=${SBK_TYPE};AFDLL;$<TARGET_PROPERTY:Boost::boost,INTERFACE_COMPILE_DEFINITIONS>"
+      CXX_STANDARD 17
+      CXX_EXTENSIONS OFF
+      CXX_VISIBILITY_PRESET hidden
+      FOLDER "Generated Targets")
+
+  arrayfire_set_default_cxx_flags(oneapi_sort_by_key_${SBK_TYPE})
+
+  target_include_directories(oneapi_sort_by_key_${SBK_TYPE}
+    PUBLIC
+      .
+      ../../api/c
+      ${ArrayFire_SOURCE_DIR}/include
+      ${ArrayFire_BINARY_DIR}/include
+    PRIVATE
+      ../common
+      ..
+      )
+
+  target_compile_options(oneapi_sort_by_key_${SBK_TYPE}
+    PRIVATE
+      $<$<COMPILE_LANGUAGE:SYCL>: -fno-sycl-id-queries-fit-in-int
+                                  -sycl-std=2020
+                                  ${MSVC_RUNTIME}
+                                  $<$<PLATFORM_ID:Linux>: -fno-sycl-rdc>>)
+
+  target_include_directories(oneapi_sort_by_key_${SBK_TYPE}
+    SYSTEM PRIVATE
+      ${span-lite_SOURCE_DIR}/include
+      $<TARGET_PROPERTY:Boost::boost,INTERFACE_INCLUDE_DIRECTORIES>)
+
+  set_target_properties(oneapi_sort_by_key_${SBK_TYPE} PROPERTIES POSITION_INDEPENDENT_CODE ON)
+  target_sources(oneapi_sort_by_key
+    INTERFACE $<TARGET_OBJECTS:oneapi_sort_by_key_${SBK_TYPE}>)
+endforeach(SBK_TYPE ${SBK_TYPES})
diff --git a/src/backend/oneapi/kernel/sort_by_key/sort_by_key_impl.cpp b/src/backend/oneapi/kernel/sort_by_key/sort_by_key_impl.cpp
new file mode 100644
index 0000000000..0b0a8fb13f
--- /dev/null
+++ b/src/backend/oneapi/kernel/sort_by_key/sort_by_key_impl.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sort_by_key_impl.hpp>
+
+// SBK_TYPES:float double int uint intl uintl short ushort char schar uchar half
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+INSTANTIATE1(TYPE);
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/sort_by_key_impl.hpp b/src/backend/oneapi/kernel/sort_by_key_impl.hpp
new file mode 100644
index 0000000000..2e462db4b6
--- /dev/null
+++ b/src/backend/oneapi/kernel/sort_by_key_impl.hpp
@@ -0,0 +1,224 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#if defined(__clang__)
+#pragma clang diagnostic push
+// temporary ignores for DPL internals
+#pragma clang diagnostic ignored "-Wunused-variable"
+#pragma clang diagnostic ignored "-Wdeprecated-declarations"
+#endif
+
+// oneDPL headers should be included before standard headers
+#define ONEDPL_USE_PREDEFINED_POLICIES 0
+#include <oneapi/dpl/algorithm>
+#include <oneapi/dpl/execution>
+#include <oneapi/dpl/iterator>
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_oneapi.hpp>
+#include <iota.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+using arrayfire::common::half;
+
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param<Tk> pKey, Param<Tv> pVal, bool isAscending) {
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    for (int w = 0; w < pKey.info.dims[3]; w++) {
+        int pKeyW = w * pKey.info.strides[3];
+        int pValW = w * pVal.info.strides[3];
+        for (int z = 0; z < pKey.info.dims[2]; z++) {
+            int pKeyWZ = pKeyW + z * pKey.info.strides[2];
+            int pValWZ = pValW + z * pVal.info.strides[2];
+            for (int y = 0; y < pKey.info.dims[1]; y++) {
+                int pKeyOffset = pKeyWZ + y * pKey.info.strides[1];
+                int pValOffset = pValWZ + y * pVal.info.strides[1];
+
+                auto key_begin =
+                    ::oneapi::dpl::begin(
+                        pKey.data->template reinterpret<compute_t<Tk>>()) +
+                    pKeyOffset;
+                auto key_end   = key_begin + pKey.info.dims[0];
+                auto val_begin = ::oneapi::dpl::begin(*pVal.data) + pValOffset;
+                auto val_end   = val_begin + pVal.info.dims[0];
+
+                auto zipped_begin =
+                    ::oneapi::dpl::make_zip_iterator(key_begin, val_begin);
+                auto zipped_end =
+                    ::oneapi::dpl::make_zip_iterator(key_end, val_end);
+
+                // sort by key
+                if (isAscending) {
+                    std::sort(dpl_policy, zipped_begin, zipped_end,
+                              [](auto lhs, auto rhs) {
+                                  return std::get<0>(lhs) < std::get<0>(rhs);
+                              });
+                } else {
+                    std::sort(dpl_policy, zipped_begin, zipped_end,
+                              [](auto lhs, auto rhs) {
+                                  return std::get<0>(lhs) > std::get<0>(rhs);
+                              });
+                }
+            }
+        }
+    }
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename Tv>
+void sortByKeyBatched(Param<Tk> pKey, Param<Tv> pVal, const int dim,
+                      bool isAscending) {
+    af::dim4 inDims;
+    for (int i = 0; i < 4; i++) inDims[i] = pKey.info.dims[i];
+
+    const dim_t elements = inDims.elements();
+
+    // Sort dimension
+    // tileDims * seqDims = inDims
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    // Create/call iota
+    Array<uint> Seq = iota<uint>(seqDims, tileDims);
+
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    // set up iterators for seq, key, val, and new cKey
+    auto seq_begin = ::oneapi::dpl::begin(*Seq.get());
+    auto seq_end   = seq_begin + elements;
+    auto key_begin =
+        ::oneapi::dpl::begin(pKey.data->template reinterpret<compute_t<Tk>>());
+    auto key_end = key_begin + elements;
+
+    auto val_begin = ::oneapi::dpl::begin(*pVal.data);
+    auto val_end   = val_begin + elements;
+
+    auto cKey = memAlloc<Tk>(elements);
+    auto cKey_get = cKey.get();
+    getQueue().submit([&](sycl::handler &h) {
+        h.copy(pKey.data->template reinterpret<compute_t<Tk>>().get_access(
+                   h, elements),
+               cKey_get->template reinterpret<compute_t<Tk>>().get_access(
+                   h, elements));
+    });
+    auto ckey_begin =
+        ::oneapi::dpl::begin(cKey.get()->template reinterpret<compute_t<Tk>>());
+    auto ckey_end = ckey_begin + elements;
+
+    {
+        auto zipped_begin_KV  = dpl::make_zip_iterator(key_begin, val_begin);
+        auto zipped_end_KV    = dpl::make_zip_iterator(key_end, val_end);
+        auto zipped_begin_cKS = dpl::make_zip_iterator(ckey_begin, seq_begin);
+        auto zipped_end_cKS   = dpl::make_zip_iterator(ckey_end, seq_end);
+        if (isAscending) {
+            std::sort(dpl_policy, zipped_begin_KV, zipped_end_KV,
+                      [](auto lhs, auto rhs) {
+                          return std::get<0>(lhs) < std::get<0>(rhs);
+                      });
+            std::sort(dpl_policy, zipped_begin_cKS, zipped_end_cKS,
+                      [](auto lhs, auto rhs) {
+                          return std::get<0>(lhs) < std::get<0>(rhs);
+                      });
+        } else {
+            std::sort(dpl_policy, zipped_begin_KV, zipped_end_KV,
+                      [](auto lhs, auto rhs) {
+                          return std::get<0>(lhs) > std::get<0>(rhs);
+                      });
+            std::sort(dpl_policy, zipped_begin_cKS, zipped_end_cKS,
+                      [](auto lhs, auto rhs) {
+                          return std::get<0>(lhs) > std::get<0>(rhs);
+                      });
+        }
+    }
+
+    auto Seq_get = Seq.get();
+    auto cSeq = memAlloc<uint>(elements);
+    auto cSeq_get = cSeq.get();
+    getQueue().submit([&](sycl::handler &h) {
+        h.copy(Seq_get->get_access(h, elements),
+               cSeq_get->get_access(h, elements));
+    });
+    auto cseq_begin = ::oneapi::dpl::begin(*cSeq.get());
+    auto cseq_end   = cseq_begin + elements;
+
+    {
+        auto zipped_begin_SV  = dpl::make_zip_iterator(seq_begin, val_begin);
+        auto zipped_end_SV    = dpl::make_zip_iterator(seq_end, val_end);
+        auto zipped_begin_cSK = dpl::make_zip_iterator(cseq_begin, key_begin);
+        auto zipped_end_cSK   = dpl::make_zip_iterator(cseq_end, key_end);
+        std::sort(dpl_policy, zipped_begin_SV, zipped_end_SV,
+                  [](auto lhs, auto rhs) {
+                      return std::get<0>(lhs) < std::get<0>(rhs);
+                  });
+        std::sort(dpl_policy, zipped_begin_cSK, zipped_end_cSK,
+                  [](auto lhs, auto rhs) {
+                      return std::get<0>(lhs) < std::get<0>(rhs);
+                  });
+    }
+}
+
+template<typename Tk, typename Tv>
+void sort0ByKey(Param<Tk> pKey, Param<Tv> pVal, bool isAscending) {
+    int higherDims = pKey.info.dims[1] * pKey.info.dims[2] * pKey.info.dims[3];
+    // Batched sort performs 4x sort by keys
+    // But this is only useful before GPU is saturated
+    // The GPU is saturated at around 1000,000 integers
+    // Call batched sort only if both conditions are met
+    if (higherDims > 4 && pKey.info.dims[0] < 1000000) {
+        kernel::sortByKeyBatched<Tk, Tv>(pKey, pVal, 0, isAscending);
+    } else {
+        kernel::sort0ByKeyIterative<Tk, Tv>(pKey, pVal, isAscending);
+    }
+}
+
+#define INSTANTIATE(Tk, Tv)                                                   \
+    template void sort0ByKey<Tk, Tv>(Param<Tk> okey, Param<Tv> oval,          \
+                                     bool isAscending);                       \
+    template void sort0ByKeyIterative<Tk, Tv>(Param<Tk> okey, Param<Tv> oval, \
+                                              bool isAscending);              \
+    template void sortByKeyBatched<Tk, Tv>(Param<Tk> okey, Param<Tv> oval,    \
+                                           const int dim, bool isAscending);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic pop
+#endif
diff --git a/src/backend/oneapi/kernel/sparse.hpp b/src/backend/oneapi/kernel/sparse.hpp
new file mode 100644
index 0000000000..24458ed77d
--- /dev/null
+++ b/src/backend/oneapi/kernel/sparse.hpp
@@ -0,0 +1,472 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class coo2DenseCreateKernel {
+   public:
+    coo2DenseCreateKernel(write_accessor<T> oPtr, const KParam output,
+                          write_accessor<T> vPtr, const KParam values,
+                          read_accessor<int> rPtr, const KParam rowIdx,
+                          read_accessor<int> cPtr, const KParam colIdx)
+        : oPtr_(oPtr)
+        , output_(output)
+        , vPtr_(vPtr)
+        , values_(values)
+        , rPtr_(rPtr)
+        , rowIdx_(rowIdx)
+        , cPtr_(cPtr)
+        , colIdx_(colIdx) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const int dimSize = g.get_local_range(0);
+
+        for (int i = it.get_local_id(0); i < REPEAT * dimSize; i += dimSize) {
+            const int id =
+                g.get_group_id(0) * g.get_local_range(0) * REPEAT + i;
+            if (id >= values_.dims[0]) return;
+
+            T v   = vPtr_[id + values_.offset];
+            int r = rPtr_[id + rowIdx_.offset];
+            int c = cPtr_[id + colIdx_.offset];
+
+            int offset = r + c * output_.strides[1];
+
+            oPtr_[offset] = v;
+        }
+    }
+
+   private:
+    write_accessor<T> oPtr_;
+    const KParam output_;
+    write_accessor<T> vPtr_;
+    const KParam values_;
+    read_accessor<int> rPtr_;
+    const KParam rowIdx_;
+    read_accessor<int> cPtr_;
+    const KParam colIdx_;
+};
+
+template<typename T>
+void coo2dense(Param<T> out, const Param<T> values, const Param<int> rowIdx,
+               const Param<int> colIdx) {
+    auto local  = sycl::range(THREADS_PER_BLOCK, 1);
+    auto global = sycl::range(
+        divup(values.info.dims[0], local[0] * REPEAT) * THREADS_PER_BLOCK, 1);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::read_only};
+        sycl::accessor d_out{*out.data, h, sycl::write_only, sycl::no_init};
+        sycl::accessor d_values{*values.data, h, sycl::write_only,
+                                sycl::no_init};
+        h.parallel_for(sycl::nd_range{global, local},
+                       coo2DenseCreateKernel<T>(
+                           d_out, out.info, d_values, values.info, d_rowIdx,
+                           rowIdx.info, d_colIdx, colIdx.info));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, int THREADS>
+class csr2DenseCreateKernel {
+   public:
+    csr2DenseCreateKernel(write_accessor<T> output, read_accessor<T> values,
+                          read_accessor<int> rowidx, read_accessor<int> colidx,
+                          const int M, const int v_off, const int r_off, const int c_off)
+        : output_(output)
+        , values_(values)
+        , rowidx_(rowidx)
+        , colidx_(colidx)
+        , M_(M)
+        , v_off_(v_off)
+        , r_off_(r_off)
+        , c_off_(c_off) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        int lid = it.get_local_id(0);
+        for (int rowId = g.get_group_id(0); rowId < M_;
+             rowId += it.get_group_range(0)) {
+            int colStart = rowidx_[rowId + r_off_];
+            int colEnd   = rowidx_[rowId + r_off_ + 1];
+            for (int colId = colStart + lid; colId < colEnd; colId += THREADS) {
+                output_[rowId + colidx_[colId + c_off_] * M_] = values_[colId + v_off_];
+            }
+        }
+    }
+
+   private:
+    write_accessor<T> output_;
+    read_accessor<T> values_;
+    read_accessor<int> rowidx_;
+    read_accessor<int> colidx_;
+    const int M_;
+    const int v_off_;
+    const int r_off_;
+    const int c_off_;
+};
+
+template<typename T>
+void csr2dense(Param<T> output, const Param<T> values, const Param<int> rowIdx,
+               const Param<int> colIdx) {
+    constexpr int MAX_GROUPS = 4096;
+    // FIXME: This needs to be based non nonzeros per row
+    constexpr int threads = 64;
+
+    const int M = rowIdx.info.dims[0] - 1;
+
+    auto local   = sycl::range(threads, 1);
+    int groups_x = std::min((int)(divup(M, local[0])), MAX_GROUPS);
+    auto global  = sycl::range(local[0] * groups_x, 1);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_values{*values.data, h, sycl::read_only};
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::read_only};
+        sycl::accessor d_output{*output.data, h, sycl::write_only,
+                                sycl::no_init};
+        h.parallel_for(sycl::nd_range{global, local},
+                       csr2DenseCreateKernel<T, threads>(
+                           d_output, d_values, d_rowIdx, d_colIdx, M,
+                           static_cast<int>(values.info.offset),
+                           static_cast<int>(rowIdx.info.offset),
+                           static_cast<int>(colIdx.info.offset)));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+class dense2csrCreateKernel {
+   public:
+    dense2csrCreateKernel(write_accessor<T> svalptr,
+                          write_accessor<int> scolptr, read_accessor<T> dvalptr,
+                          const KParam valinfo, read_accessor<int> dcolptr,
+                          const KParam colinfo, read_accessor<int> rowptr)
+        : svalptr_(svalptr)
+        , scolptr_(scolptr)
+        , dvalptr_(dvalptr)
+        , valinfo_(valinfo)
+        , dcolptr_(dcolptr)
+        , colinfo_(colinfo)
+        , rowptr_(rowptr) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        // sycl::group g = it.get_group();
+
+        int gidx = it.get_global_id(0);
+        int gidy = it.get_global_id(1);
+
+        if (gidx >= (unsigned)valinfo_.dims[0]) return;
+        if (gidy >= (unsigned)valinfo_.dims[1]) return;
+
+        int rowoff       = rowptr_[gidx];
+        auto svalptr_ptr   = svalptr_.get_pointer();
+        auto scolptr_ptr = scolptr_.get_pointer();
+
+        auto dvalptr_ptr   = dvalptr_.get_pointer();
+        auto dcolptr_ptr = dcolptr_.get_pointer();
+
+        T val = dvalptr_ptr[gidx + gidy * (unsigned)valinfo_.strides[1] + valinfo_.offset];
+
+        if constexpr (std::is_same_v<decltype(val), std::complex<float>> ||
+                      std::is_same_v<decltype(val), std::complex<double>>) {
+            if (val.real() == 0 && val.imag() == 0) return;
+        } else {
+            if (val == 0) return;
+        }
+
+        int oloc              = dcolptr_ptr[gidx + gidy * colinfo_.strides[1] + colinfo_.offset];
+        svalptr_ptr[oloc + rowoff - 1] = val;
+        scolptr_ptr[oloc + rowoff - 1] = gidy;
+    }
+
+   private:
+    write_accessor<T> svalptr_;
+    write_accessor<int> scolptr_;
+    read_accessor<T> dvalptr_;
+    const KParam valinfo_;
+    read_accessor<int> dcolptr_;
+    const KParam colinfo_;
+    read_accessor<int> rowptr_;
+};
+
+template<typename T>
+void dense2csr(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+               const Param<T> dense) {
+    int num_rows = dense.info.dims[0];
+    int num_cols = dense.info.dims[1];
+
+    // sd1 contains output of scan along dim 1 of dense
+    Array<int> sd1 = createEmptyArray<int>(dim4(num_rows, num_cols));
+    // rd1 contains output of nonzero count along dim 1 along dense
+    Array<int> rd1 = createEmptyArray<int>(num_rows);
+
+    scan_dim<T, int, af_notzero_t, 1>(sd1, dense, true);
+    reduce_dim_default<T, int, af_notzero_t, 1>(rd1, dense, 0, 0);
+    scan_first<int, int, af_add_t>(rowIdx, rd1, false);
+
+    const int nnz = values.info.dims[0];
+
+    const sycl::id<1> fillOffset(rowIdx.info.offset +
+                                 (rowIdx.info.dims[0] - 1));
+    const sycl::range<1> fillRange(rowIdx.info.dims[0] - fillOffset[0]);
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_rowIdx{*rowIdx.data, h, fillRange, fillOffset};
+        h.fill(d_rowIdx, nnz);
+    });
+
+    auto local   = sycl::range(THREADS_X, THREADS_Y);
+    int groups_x = divup(dense.info.dims[0], local[0]);
+    int groups_y = divup(dense.info.dims[1], local[1]);
+    auto global  = sycl::range(groups_x * local[0], groups_y * local[1]);
+
+    const Param<int> sdParam = sd1;
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_dense{*dense.data, h, sycl::read_only};
+        sycl::accessor d_sdParam{*sdParam.data, h, sycl::read_only};
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_values{*values.data, h, sycl::write_only,
+                                sycl::no_init};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::write_only,
+                                sycl::no_init};
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            dense2csrCreateKernel<T>(d_values, d_colIdx, d_dense, dense.info,
+                                     d_sdParam, sdParam.info, d_rowIdx));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+class swapIndexCreateKernel {
+   public:
+    swapIndexCreateKernel(write_accessor<T> ovalues, write_accessor<int> oindex,
+                          read_accessor<T> ivalues, read_accessor<int> iindex,
+                          read_accessor<int> swapIdx, const int nNZ)
+        : ovalues_(ovalues)
+        , oindex_(oindex)
+        , ivalues_(ivalues)
+        , iindex_(iindex)
+        , swapIdx_(swapIdx)
+        , nNZ_(nNZ) {}
+
+    void operator()(sycl::item<1> it) const {
+        int id = it.get_id(0);
+        if (id < nNZ_) {
+            int idx = swapIdx_[id];
+
+            ovalues_[id] = ivalues_[idx];
+            oindex_[id]  = iindex_[idx];
+        }
+    }
+
+   private:
+    write_accessor<T> ovalues_;
+    write_accessor<int> oindex_;
+    read_accessor<T> ivalues_;
+    read_accessor<int> iindex_;
+    read_accessor<int> swapIdx_;
+    const int nNZ_;
+};
+
+template<typename T>
+void swapIndex(Param<T> ovalues, Param<int> oindex, const Param<T> ivalues,
+               sycl::buffer<int> iindex, const Param<int> swapIdx) {
+    auto global = sycl::range(ovalues.info.dims[0]);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_ivalues{*ivalues.data, h, sycl::read_only};
+        sycl::accessor d_iindex{iindex, h, sycl::read_only};
+        sycl::accessor d_swapIdx{*swapIdx.data, h, sycl::read_only};
+        sycl::accessor d_ovalues{*ovalues.data, h, sycl::write_only,
+                                 sycl::no_init};
+        sycl::accessor d_oindex{*oindex.data, h, sycl::write_only,
+                                sycl::no_init};
+
+        h.parallel_for(global,
+                       swapIndexCreateKernel<T>(
+                           d_ovalues, d_oindex, d_ivalues, d_iindex, d_swapIdx,
+                           static_cast<int>(ovalues.info.dims[0])));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+class csr2CooCreateKernel {
+   public:
+    csr2CooCreateKernel(write_accessor<int> orowidx,
+                        write_accessor<int> ocolidx, read_accessor<int> irowidx,
+                        read_accessor<int> icolidx, const int M)
+        : orowidx_(orowidx)
+        , ocolidx_(ocolidx)
+        , irowidx_(irowidx)
+        , icolidx_(icolidx)
+        , M_(M) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        int lid = it.get_local_id(0);
+        for (int rowId = g.get_group_id(0); rowId < M_;
+             rowId += it.get_group_range(0)) {
+            int colStart = irowidx_[rowId];
+            int colEnd   = irowidx_[rowId + 1];
+            for (int colId = colStart + lid; colId < colEnd;
+                 colId += g.get_local_range(0)) {
+                orowidx_[colId] = rowId;
+                ocolidx_[colId] = icolidx_[colId];
+            }
+        }
+    }
+
+   private:
+    write_accessor<int> orowidx_;
+    write_accessor<int> ocolidx_;
+    read_accessor<int> irowidx_;
+    read_accessor<int> icolidx_;
+    const int M_;
+};
+
+template<typename T>
+void csr2coo(Param<T> ovalues, Param<int> orowIdx, Param<int> ocolIdx,
+             const Param<T> ivalues, const Param<int> irowIdx,
+             const Param<int> icolIdx, Param<int> index) {
+    const int MAX_GROUPS = 4096;
+    int M                = irowIdx.info.dims[0] - 1;
+    // FIXME: This needs to be based non nonzeros per row
+    int threads = 64;
+
+    auto scratch = memAlloc<int>(orowIdx.info.dims[0]);
+
+    auto local   = sycl::range(threads, 1);
+    int groups_x = std::min((int)(divup(M, local[0])), MAX_GROUPS);
+    auto global  = sycl::range(local[0] * groups_x, 1);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_irowIdx{*irowIdx.data, h, sycl::read_only};
+        sycl::accessor d_icolIdx{*icolIdx.data, h, sycl::read_only};
+        sycl::accessor d_scratch{*scratch, h, sycl::write_only, sycl::no_init};
+        sycl::accessor d_ocolIdx{*ocolIdx.data, h, sycl::write_only,
+                                 sycl::no_init};
+        h.parallel_for(sycl::nd_range{global, local},
+                       csr2CooCreateKernel<T>(d_scratch, d_ocolIdx, d_irowIdx,
+                                              d_icolIdx, M));
+    });
+
+    // Now we need to sort this into column major
+    kernel::sort0ByKeyIterative<int, int>(ocolIdx, index, true);
+
+    // Now use index to sort values and rows
+    kernel::swapIndex<T>(ovalues, orowIdx, ivalues, *scratch, index);
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+class csrReduceKernel {
+   public:
+    csrReduceKernel(write_accessor<int> orowidx, read_accessor<int> irowidx,
+                    const int M, const int nNZ)
+        : orowidx_(orowidx), irowidx_(irowidx), M_(M), nNZ_(nNZ) {}
+
+    void operator()(sycl::item<1> it) const {
+        int id = it.get_id(0);
+
+        if (id < nNZ_) {
+            // Read COO row indices
+            int iRId  = irowidx_[id];
+            int iRId1 = 0;
+            if (id > 0) iRId1 = irowidx_[id - 1];
+
+            // If id is 0, then mark the edge cases of csrRow[0] and csrRow[M]
+            if (id == 0) {
+                orowidx_[id] = 0;
+                orowidx_[M_] = nNZ_;
+            } else if (iRId1 != iRId) {
+                // If iRId1 and iRId are not same, that means the row has
+                // incremented For example, if iRId is 5 and iRId1 is 4, that
+                // means row 4 has ended and row 5 has begun at index id. We use
+                // the for-loop because there can be any number of empty rows
+                // between iRId1 and iRId, all of which should be marked by id
+                for (int i = iRId1 + 1; i <= iRId; i++) orowidx_[i] = id;
+            }
+
+            // The last X rows are corner cases if they dont have any values
+            if (id < M_) {
+                if (id > irowidx_[nNZ_ - 1] && orowidx_[id] == 0) {
+                    orowidx_[id] = nNZ_;
+                }
+            }
+        }
+    }
+
+   private:
+    write_accessor<int> orowidx_;
+    read_accessor<int> irowidx_;
+    const int M_;
+    const int nNZ_;
+};
+
+template<typename T>
+void coo2csr(Param<T> ovalues, Param<int> orowIdx, Param<int> ocolIdx,
+             const Param<T> ivalues, const Param<int> irowIdx,
+             const Param<int> icolIdx, Param<int> index, Param<int> rowCopy,
+             const int M) {
+    // Now we need to sort this into column major
+    kernel::sort0ByKeyIterative<int, int>(rowCopy, index, true);
+
+    // Now use index to sort values and rows
+    kernel::swapIndex<T>(ovalues, ocolIdx, ivalues, *icolIdx.data, index);
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+
+    auto global = sycl::range(irowIdx.info.dims[0]);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_orowIdx{*orowIdx.data, h, sycl::write_only};
+        sycl::accessor d_rowCopy{*rowCopy.data, h, sycl::read_only};
+        h.parallel_for(
+            sycl::range{global},
+            csrReduceKernel<T>(d_orowIdx, d_rowCopy, M,
+                               static_cast<int>(ovalues.info.dims[0])));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/sparse_arith.hpp b/src/backend/oneapi/kernel/sparse_arith.hpp
new file mode 100644
index 0000000000..b46baa69df
--- /dev/null
+++ b/src/backend/oneapi/kernel/sparse_arith.hpp
@@ -0,0 +1,570 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Array.hpp>
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr unsigned TX      = 32;
+constexpr unsigned TY      = 8;
+constexpr unsigned THREADS = TX * TY;
+
+template<typename T>
+using global_atomic_ref =
+    sycl::atomic_ref<T, sycl::memory_order::relaxed, sycl::memory_scope::system,
+                     sycl::access::address_space::global_space>;
+
+template<typename T, af_op_t op>
+class sparseArithCSRKernel {
+   public:
+    sparseArithCSRKernel(write_accessor<T> oPtr, const KParam out,
+                         read_accessor<T> values, read_accessor<int> rowIdx,
+                         read_accessor<int> colIdx, const int nNZ,
+                         read_accessor<T> rPtr, const KParam rhs,
+                         const int reverse)
+        : oPtr_(oPtr)
+        , out_(out)
+        , values_(values)
+        , rowIdx_(rowIdx)
+        , colIdx_(colIdx)
+        , nNZ_(nNZ)
+        , rPtr_(rPtr)
+        , rhs_(rhs)
+        , reverse_(reverse) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        common::Binary<T, op> binOP;
+
+        const int row =
+            g.get_group_id(0) * g.get_local_range(1) + it.get_local_id(1);
+
+        if (row < out_.dims[0]) {
+            const int rowStartIdx = rowIdx_[row];
+            const int rowEndIdx   = rowIdx_[row + 1];
+
+            // Repeat loop until all values in the row are computed
+            for (int idx = rowStartIdx + it.get_local_id(0); idx < rowEndIdx;
+                 idx += g.get_local_range(0)) {
+                const int col = colIdx_[idx];
+
+                if (row >= out_.dims[0] || col >= out_.dims[1])
+                    continue;  // Bad indices
+
+                // Get Values
+                const T val  = values_[idx];
+                const T rval = rPtr_[col * rhs_.strides[1] + row];
+
+                const int offset = col * out_.strides[1] + row;
+                if (reverse_)
+                    oPtr_[offset] = binOP(rval, val);
+                else
+                    oPtr_[offset] = binOP(val, rval);
+            }
+        }
+    }
+
+   private:
+    write_accessor<T> oPtr_;
+    const KParam out_;
+    read_accessor<T> values_;
+    read_accessor<int> rowIdx_;
+    read_accessor<int> colIdx_;
+    const int nNZ_;
+    read_accessor<T> rPtr_;
+    const KParam rhs_;
+    const int reverse_;
+};
+
+template<typename T, af_op_t op>
+void sparseArithOpCSR(Param<T> out, const Param<T> values,
+                      const Param<int> rowIdx, const Param<int> colIdx,
+                      const Param<T> rhs, const bool reverse) {
+    auto local  = sycl::range(TX, TY);
+    auto global = sycl::range(divup(out.info.dims[0], TY) * TX, TY);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_out{*out.data, h, sycl::write_only};
+        sycl::accessor d_values{*values.data, h, sycl::read_only};
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::read_only};
+        sycl::accessor d_rhs{*rhs.data, h, sycl::read_only};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       sparseArithCSRKernel<T, op>(
+                           d_out, out.info, d_values, d_rowIdx, d_colIdx,
+                           static_cast<int>(values.info.dims[0]), d_rhs,
+                           rhs.info, static_cast<int>(reverse)));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+class sparseArithCOOKernel {
+   public:
+    sparseArithCOOKernel(write_accessor<T> oPtr, const KParam out,
+                         read_accessor<T> values, read_accessor<int> rowIdx,
+                         read_accessor<int> colIdx, const int nNZ,
+                         read_accessor<T> rPtr, const KParam rhs,
+                         const int reverse)
+        : oPtr_(oPtr)
+        , out_(out)
+        , values_(values)
+        , rowIdx_(rowIdx)
+        , colIdx_(colIdx)
+        , nNZ_(nNZ)
+        , rPtr_(rPtr)
+        , rhs_(rhs)
+        , reverse_(reverse) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        common::Binary<T, op> binOP;
+
+        const int idx = it.get_global_id(0);
+
+        if (idx < nNZ_) {
+            const int row = rowIdx_[idx];
+            const int col = colIdx_[idx];
+
+            if (row >= out_.dims[0] || col >= out_.dims[1])
+                return;  // Bad indices
+
+            // Get Values
+            const T val  = values_[idx];
+            const T rval = rPtr_[col * rhs_.strides[1] + row];
+
+            const int offset = col * out_.strides[1] + row;
+            if (reverse_)
+                oPtr_[offset] = binOP(rval, val);
+            else
+                oPtr_[offset] = binOP(val, rval);
+        }
+    }
+
+   private:
+    write_accessor<T> oPtr_;
+    const KParam out_;
+    read_accessor<T> values_;
+    read_accessor<int> rowIdx_;
+    read_accessor<int> colIdx_;
+    const int nNZ_;
+    read_accessor<T> rPtr_;
+    const KParam rhs_;
+    const int reverse_;
+};
+
+template<typename T, af_op_t op>
+void sparseArithOpCOO(Param<T> out, const Param<T> values,
+                      const Param<int> rowIdx, const Param<int> colIdx,
+                      const Param<T> rhs, const bool reverse) {
+    auto local  = sycl::range(THREADS);
+    auto global = sycl::range(divup(values.info.dims[0], THREADS) * THREADS);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_out{*out.data, h, sycl::write_only};
+        sycl::accessor d_values{*values.data, h, sycl::read_only};
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::read_only};
+        sycl::accessor d_rhs{*rhs.data, h, sycl::read_only};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       sparseArithCOOKernel<T, op>(
+                           d_out, out.info, d_values, d_rowIdx, d_colIdx,
+                           static_cast<int>(values.info.dims[0]), d_rhs,
+                           rhs.info, static_cast<int>(reverse)));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+class sparseArithCSR2Kernel {
+   public:
+    sparseArithCSR2Kernel(sycl::accessor<T> values, read_accessor<int> rowIdx,
+                          read_accessor<int> colIdx, const int nNZ,
+                          read_accessor<T> rPtr, const KParam rhs,
+                          const int reverse)
+        : values_(values)
+        , rowIdx_(rowIdx)
+        , colIdx_(colIdx)
+        , nNZ_(nNZ)
+        , rPtr_(rPtr)
+        , rhs_(rhs)
+        , reverse_(reverse) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        common::Binary<T, op> binOP;
+
+        const int row =
+            g.get_group_id(0) * g.get_local_range(1) + it.get_local_id(1);
+
+        if (row < rhs_.dims[0]) {
+            const int rowStartIdx = rowIdx_[row];
+            const int rowEndIdx   = rowIdx_[row + 1];
+
+            // Repeat loop until all values in the row are computed
+            for (int idx = rowStartIdx + it.get_local_id(0); idx < rowEndIdx;
+                 idx += g.get_local_range(0)) {
+                const int col = colIdx_[idx];
+
+                if (row >= rhs_.dims[0] || col >= rhs_.dims[1])
+                    continue;  // Bad indices
+
+                // Get Values
+                const T val  = values_[idx];
+                const T rval = rPtr_[col * rhs_.strides[1] + row];
+
+                if (reverse_)
+                    values_[idx] = binOP(rval, val);
+                else
+                    values_[idx] = binOP(val, rval);
+            }
+        }
+    }
+
+   private:
+    sycl::accessor<T> values_;
+    read_accessor<int> rowIdx_;
+    read_accessor<int> colIdx_;
+    const int nNZ_;
+    read_accessor<T> rPtr_;
+    const KParam rhs_;
+    const int reverse_;
+};
+
+template<typename T, af_op_t op>
+void sparseArithOpCSR(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+                      const Param<T> rhs, const bool reverse) {
+    auto local  = sycl::range(TX, TY);
+    auto global = sycl::range(divup(values.info.dims[0], TY) * TX, TY);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_values{*values.data, h, sycl::read_write};
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::read_only};
+        sycl::accessor d_rhs{*rhs.data, h, sycl::read_only};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       sparseArithCSR2Kernel<T, op>(
+                           d_values, d_rowIdx, d_colIdx,
+                           static_cast<int>(values.info.dims[0]), d_rhs,
+                           rhs.info, static_cast<int>(reverse)));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+class sparseArithCOO2Kernel {
+   public:
+    sparseArithCOO2Kernel(sycl::accessor<T> values, read_accessor<int> rowIdx,
+                          read_accessor<int> colIdx, const int nNZ,
+                          read_accessor<T> rPtr, const KParam rhs,
+                          const int reverse)
+        : values_(values)
+        , rowIdx_(rowIdx)
+        , colIdx_(colIdx)
+        , nNZ_(nNZ)
+        , rPtr_(rPtr)
+        , rhs_(rhs)
+        , reverse_(reverse) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        common::Binary<T, op> binOP;
+
+        const int idx = it.get_global_id(0);
+
+        if (idx < nNZ_) {
+            const int row = rowIdx_[idx];
+            const int col = colIdx_[idx];
+
+            if (row >= rhs_.dims[0] || col >= rhs_.dims[1])
+                return;  // Bad indices
+
+            // Get Values
+            const T val  = values_[idx];
+            const T rval = rPtr_[col * rhs_.strides[1] + row];
+
+            if (reverse_)
+                values_[idx] = binOP(rval, val);
+            else
+                values_[idx] = binOP(val, rval);
+        }
+    }
+
+   private:
+    sycl::accessor<T> values_;
+    read_accessor<int> rowIdx_;
+    read_accessor<int> colIdx_;
+    const int nNZ_;
+    read_accessor<T> rPtr_;
+    const KParam rhs_;
+    const int reverse_;
+};
+
+template<typename T, af_op_t op>
+void sparseArithOpCOO(Param<T> values, Param<int> rowIdx, Param<int> colIdx,
+                      const Param<T> rhs, const bool reverse) {
+    auto local  = sycl::range(THREADS);
+    auto global = sycl::range(divup(values.info.dims[0], THREADS) * THREADS);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_values{*values.data, h, sycl::read_write};
+        sycl::accessor d_rowIdx{*rowIdx.data, h, sycl::read_only};
+        sycl::accessor d_colIdx{*colIdx.data, h, sycl::read_only};
+        sycl::accessor d_rhs{*rhs.data, h, sycl::read_only};
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       sparseArithCOO2Kernel<T, op>(
+                           d_values, d_rowIdx, d_colIdx,
+                           static_cast<int>(values.info.dims[0]), d_rhs,
+                           rhs.info, static_cast<int>(reverse)));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+class csrCalcOutNNZKernel {
+   public:
+    csrCalcOutNNZKernel(write_accessor<unsigned> nnzc,
+                        write_accessor<int> oRowIdx, unsigned M,
+                        read_accessor<int> lRowIdx, read_accessor<int> lColIdx,
+                        read_accessor<int> rRowIdx, read_accessor<int> rColIdx,
+                        sycl::local_accessor<unsigned, 1> blkNNZ)
+        : nnzc_(nnzc)
+        , oRowIdx_(oRowIdx)
+        , M_(M)
+        , lRowIdx_(lRowIdx)
+        , lColIdx_(lColIdx)
+        , rRowIdx_(rRowIdx)
+        , rColIdx_(rColIdx)
+        , blkNNZ_(blkNNZ) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        sycl::group g = it.get_group();
+
+        const uint row = it.get_global_id(0);
+        const uint tid = it.get_local_id(0);
+
+        const bool valid = row < M_;
+
+        const uint lEnd = (valid ? lRowIdx_[row + 1] : 0);
+        const uint rEnd = (valid ? rRowIdx_[row + 1] : 0);
+
+        blkNNZ_[tid] = 0;
+        it.barrier();
+
+        uint l   = (valid ? lRowIdx_[row] : 0);
+        uint r   = (valid ? rRowIdx_[row] : 0);
+        uint nnz = 0;
+        while (l < lEnd && r < rEnd) {
+            uint lci = lColIdx_[l];
+            uint rci = rColIdx_[r];
+            l += (lci <= rci);
+            r += (lci >= rci);
+            nnz++;
+        }
+        nnz += (lEnd - l);
+        nnz += (rEnd - r);
+
+        blkNNZ_[tid] = nnz;
+        it.barrier();
+
+        if (valid) oRowIdx_[row + 1] = nnz;
+
+        for (uint s = g.get_local_range(0) / 2; s > 0; s >>= 1) {
+            if (tid < s) { blkNNZ_[tid] += blkNNZ_[tid + s]; }
+            it.barrier();
+        }
+
+        if (tid == 0) {
+            nnz = blkNNZ_[0];
+            global_atomic_ref<uint>(nnzc_[0]) += nnz;
+        }
+    }
+
+   private:
+    write_accessor<unsigned> nnzc_;
+    write_accessor<int> oRowIdx_;
+    unsigned M_;
+    read_accessor<int> lRowIdx_;
+    read_accessor<int> lColIdx_;
+    read_accessor<int> rRowIdx_;
+    read_accessor<int> rColIdx_;
+    sycl::local_accessor<unsigned, 1> blkNNZ_;
+};
+
+static void csrCalcOutNNZ(Param<int> outRowIdx, unsigned &nnzC, const uint M,
+                          const uint N, uint nnzA, const Param<int> lrowIdx,
+                          const Param<int> lcolIdx, uint nnzB,
+                          const Param<int> rrowIdx, const Param<int> rcolIdx) {
+    UNUSED(N);
+    UNUSED(nnzA);
+    UNUSED(nnzB);
+
+    auto local  = sycl::range(256);
+    auto global = sycl::range(divup(M, local[0]) * local[0]);
+
+    Array<unsigned> out = createValueArray<unsigned>(1, 0);
+    auto out_get = out.get();
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_out{*out_get, h, sycl::write_only};
+        sycl::accessor d_outRowIdx{*outRowIdx.data, h, sycl::write_only};
+        sycl::accessor d_lRowIdx{*lrowIdx.data, h, sycl::read_only};
+        sycl::accessor d_lColIdx{*lcolIdx.data, h, sycl::read_only};
+        sycl::accessor d_rRowIdx{*rrowIdx.data, h, sycl::read_only};
+        sycl::accessor d_rColIdx{*rcolIdx.data, h, sycl::read_only};
+
+        auto blkNNZ = sycl::local_accessor<unsigned, 1>(local[0], h);
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            csrCalcOutNNZKernel(d_out, d_outRowIdx, M, d_lRowIdx, d_lColIdx,
+                                d_rRowIdx, d_rColIdx, blkNNZ));
+    });
+
+    {
+        sycl::host_accessor nnz_acc{*out.get(), sycl::read_only};
+        nnzC = nnz_acc[0];
+    }
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+class ssarithCSRKernel {
+   public:
+    ssarithCSRKernel(write_accessor<T> oVals, write_accessor<int> oColIdx,
+                     read_accessor<int> oRowIdx, unsigned M, unsigned N,
+                     unsigned nnza, read_accessor<T> lVals,
+                     read_accessor<int> lRowIdx, read_accessor<int> lColIdx,
+                     unsigned nnzb, read_accessor<T> rVals,
+                     read_accessor<int> rRowIdx, read_accessor<int> rColIdx)
+        : oVals_(oVals)
+        , oColIdx_(oColIdx)
+        , oRowIdx_(oRowIdx)
+        , M_(M)
+        , N_(N)
+        , nnza_(nnza)
+        , lVals_(lVals)
+        , lRowIdx_(lRowIdx)
+        , lColIdx_(lColIdx)
+        , nnzb_(nnzb)
+        , rVals_(rVals)
+        , rRowIdx_(rRowIdx)
+        , rColIdx_(rColIdx) {}
+
+    void operator()(sycl::nd_item<1> it) const {
+        common::Binary<T, op> binOP;
+
+        const uint row = it.get_global_id(0);
+
+        const bool valid  = row < M_;
+        const uint lEnd   = (valid ? lRowIdx_[row + 1] : 0);
+        const uint rEnd   = (valid ? rRowIdx_[row + 1] : 0);
+        const uint offset = (valid ? oRowIdx_[row] : 0);
+
+        T *ovPtr   = oVals_.get_pointer() + offset;
+        int *ocPtr = oColIdx_.get_pointer() + offset;
+
+        uint l = (valid ? lRowIdx_[row] : 0);
+        uint r = (valid ? rRowIdx_[row] : 0);
+
+        uint nnz = 0;
+        while (l < lEnd && r < rEnd) {
+            uint lci = lColIdx_[l];
+            uint rci = rColIdx_[r];
+
+            T lhs = (lci <= rci ? lVals_[l] : common::Binary<T, op>::init());
+            T rhs = (lci >= rci ? rVals_[r] : common::Binary<T, op>::init());
+
+            ovPtr[nnz] = binOP(lhs, rhs);
+            ocPtr[nnz] = (lci <= rci) ? lci : rci;
+
+            l += (lci <= rci);
+            r += (lci >= rci);
+            nnz++;
+        }
+        while (l < lEnd) {
+            ovPtr[nnz] = binOP(lVals_[l], common::Binary<T, op>::init());
+            ocPtr[nnz] = lColIdx_[l];
+            l++;
+            nnz++;
+        }
+        while (r < rEnd) {
+            ovPtr[nnz] = binOP(common::Binary<T, op>::init(), rVals_[r]);
+            ocPtr[nnz] = rColIdx_[r];
+            r++;
+            nnz++;
+        }
+    }
+
+   private:
+    write_accessor<T> oVals_;
+    write_accessor<int> oColIdx_;
+    read_accessor<int> oRowIdx_;
+    unsigned M_, N_;
+    unsigned nnza_;
+    read_accessor<T> lVals_;
+    read_accessor<int> lRowIdx_;
+    read_accessor<int> lColIdx_;
+    unsigned nnzb_;
+    read_accessor<T> rVals_;
+    read_accessor<int> rRowIdx_;
+    read_accessor<int> rColIdx_;
+};
+
+template<typename T, af_op_t op>
+void ssArithCSR(Param<T> oVals, Param<int> oColIdx, const Param<int> oRowIdx,
+                const uint M, const uint N, unsigned nnzA, const Param<T> lVals,
+                const Param<int> lRowIdx, const Param<int> lColIdx,
+                unsigned nnzB, const Param<T> rVals, const Param<int> rRowIdx,
+                const Param<int> rColIdx) {
+    auto local  = sycl::range(256);
+    auto global = sycl::range(divup(M, local[0]) * local[0]);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_oVals{*oVals.data, h, sycl::write_only};
+        sycl::accessor d_oColIdx{*oColIdx.data, h, sycl::write_only};
+        sycl::accessor d_oRowIdx{*oRowIdx.data, h, sycl::read_only};
+
+        sycl::accessor d_lVals{*lVals.data, h, sycl::read_only};
+        sycl::accessor d_lRowIdx{*lRowIdx.data, h, sycl::read_only};
+        sycl::accessor d_lColIdx{*lColIdx.data, h, sycl::read_only};
+
+        sycl::accessor d_rVals{*rVals.data, h, sycl::read_only};
+        sycl::accessor d_rRowIdx{*rRowIdx.data, h, sycl::read_only};
+        sycl::accessor d_rColIdx{*rColIdx.data, h, sycl::read_only};
+
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            ssarithCSRKernel<T, op>(d_oVals, d_oColIdx, d_oRowIdx, M, N, nnzA,
+                                    d_lVals, d_lRowIdx, d_lColIdx, nnzB,
+                                    d_rVals, d_rRowIdx, d_rColIdx));
+    });
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/tile.hpp b/src/backend/oneapi/kernel/tile.hpp
new file mode 100644
index 0000000000..39cea65af3
--- /dev/null
+++ b/src/backend/oneapi/kernel/tile.hpp
@@ -0,0 +1,110 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class tileCreateKernel {
+   public:
+    tileCreateKernel(write_accessor<T> out, read_accessor<T> in,
+                     const KParam op, const KParam ip, const int blocksPerMatX,
+                     const int blocksPerMatY)
+        : out_(out)
+        , in_(in)
+        , op_(op)
+        , ip_(ip)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        const int oz = g.get_group_id(0) / blocksPerMatX_;
+        const int ow = g.get_group_id(1) / blocksPerMatY_;
+
+        const int blockIdx_x = g.get_group_id(0) - oz * blocksPerMatX_;
+        const int blockIdx_y = g.get_group_id(1) - ow * blocksPerMatY_;
+
+        const int xx = it.get_local_id(0) + blockIdx_x * g.get_local_range(0);
+        const int yy = it.get_local_id(1) + blockIdx_y * g.get_local_range(1);
+
+        const bool valid = (xx < op_.dims[0] && yy < op_.dims[1] &&
+                            oz < op_.dims[2] && ow < op_.dims[3]);
+
+        const int iz  = oz % ip_.dims[2];
+        const int iw  = ow % ip_.dims[3];
+        const int izw = iw * ip_.strides[3] + iz * ip_.strides[2];
+        const int ozw = ow * op_.strides[3] + oz * op_.strides[2];
+
+        const int incy = blocksPerMatY_ * g.get_local_range(1);
+        const int incx = blocksPerMatX_ * g.get_local_range(0);
+
+        for (int oy = yy; oy < op_.dims[1]; oy += incy) {
+            const int iy = oy % ip_.dims[1];
+            for (int ox = xx; ox < op_.dims[0]; ox += incx) {
+                const int ix = ox % ip_.dims[0];
+
+                int iMem = izw + iy * ip_.strides[1] + ix;
+                int oMem = ozw + oy * op_.strides[1] + ox;
+
+                if (valid) out_[oMem] = in_[ip_.offset + iMem];
+            }
+        }
+    }
+
+   private:
+    write_accessor<T> out_;
+    read_accessor<T> in_;
+    const KParam op_;
+    const KParam ip_;
+    const int blocksPerMatX_;
+    const int blocksPerMatY_;
+};
+
+template<typename T>
+void tile(Param<T> out, const Param<T> in) {
+    constexpr int TX    = 32;
+    constexpr int TY    = 8;
+    constexpr int TILEX = 512;
+    constexpr int TILEY = 32;
+
+    auto local = sycl::range(TX, TY);
+
+    int blocksPerMatX = divup(out.info.dims[0], TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], TILEY);
+    auto global       = sycl::range(local[0] * blocksPerMatX * out.info.dims[2],
+                                    local[1] * blocksPerMatY * out.info.dims[3]);
+
+    getQueue().submit([&](auto &h) {
+        write_accessor<T> d_out{*out.data, h};
+        read_accessor<T> d_in{*in.data, h};
+        h.parallel_for(sycl::nd_range{global, local},
+                       tileCreateKernel<T>(d_out, d_in, out.info, in.info,
+                                           blocksPerMatX, blocksPerMatY));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/transform.hpp b/src/backend/oneapi/kernel/transform.hpp
new file mode 100644
index 0000000000..874e9638c7
--- /dev/null
+++ b/src/backend/oneapi/kernel/transform.hpp
@@ -0,0 +1,298 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/interp.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+using wtype_t = typename std::conditional<std::is_same<T, double>::value,
+                                          double, float>::type;
+
+template<typename T>
+using vtype_t = typename std::conditional<common::is_complex<T>::value, T,
+                                          wtype_t<T>>::type;
+
+template<bool PERSPECTIVE>
+void calc_transf_inverse(float *txo, const float *txi) {
+    if constexpr (PERSPECTIVE) {
+        txo[0] = txi[4] * txi[8] - txi[5] * txi[7];
+        txo[1] = -(txi[1] * txi[8] - txi[2] * txi[7]);
+        txo[2] = txi[1] * txi[5] - txi[2] * txi[4];
+
+        txo[3] = -(txi[3] * txi[8] - txi[5] * txi[6]);
+        txo[4] = txi[0] * txi[8] - txi[2] * txi[6];
+        txo[5] = -(txi[0] * txi[5] - txi[2] * txi[3]);
+
+        txo[6] = txi[3] * txi[7] - txi[4] * txi[6];
+        txo[7] = -(txi[0] * txi[7] - txi[1] * txi[6]);
+        txo[8] = txi[0] * txi[4] - txi[1] * txi[3];
+
+        float det = txi[0] * txo[0] + txi[1] * txo[3] + txi[2] * txo[6];
+
+        txo[0] /= det;
+        txo[1] /= det;
+        txo[2] /= det;
+        txo[3] /= det;
+        txo[4] /= det;
+        txo[5] /= det;
+        txo[6] /= det;
+        txo[7] /= det;
+        txo[8] /= det;
+    } else {
+        float det = txi[0] * txi[4] - txi[1] * txi[3];
+
+        txo[0] = txi[4] / det;
+        txo[1] = txi[3] / det;
+        txo[3] = txi[1] / det;
+        txo[4] = txi[0] / det;
+
+        txo[2] = txi[2] * -txo[0] + txi[5] * -txo[1];
+        txo[5] = txi[2] * -txo[3] + txi[5] * -txo[4];
+    }
+}
+
+template<typename T, typename InterpPosTy, bool PERSPECTIVE, int INTERP_ORDER>
+class transformCreateKernel {
+   public:
+    transformCreateKernel(write_accessor<T> d_out, const KParam out,
+                          read_accessor<T> d_in, const KParam in,
+                          read_accessor<float> c_tmat, const KParam tf,
+                          const int nImg2, const int nImg3, const int nTfs2,
+                          const int nTfs3, const int batchImg2,
+                          const int blocksXPerImage, const int blocksYPerImage,
+                          const af::interpType method, const bool INVERSE)
+        : d_out_(d_out)
+        , out_(out)
+        , d_in_(d_in)
+        , in_(in)
+        , c_tmat_(c_tmat)
+        , tf_(tf)
+        , nImg2_(nImg2)
+        , nImg3_(nImg3)
+        , nTfs2_(nTfs2)
+        , nTfs3_(nTfs3)
+        , batchImg2_(batchImg2)
+        , blocksXPerImage_(blocksXPerImage)
+        , blocksYPerImage_(blocksYPerImage)
+        , method_(method)
+        , INVERSE_(INVERSE) {}
+    void operator()(sycl::nd_item<3> it) const {
+        sycl::group g = it.get_group();
+
+        // Image Ids
+        const int imgId2 = g.get_group_id(0) / blocksXPerImage_;
+        const int imgId3 = g.get_group_id(1) / blocksYPerImage_;
+
+        // Block in_ local image
+        const int blockIdx_x = g.get_group_id(0) - imgId2 * blocksXPerImage_;
+        const int blockIdx_y = g.get_group_id(1) - imgId3 * blocksYPerImage_;
+
+        // Get thread indices in_ local image
+        const int xido = blockIdx_x * g.get_local_range(0) + it.get_local_id(0);
+        const int yido = blockIdx_y * g.get_local_range(1) + it.get_local_id(1);
+
+        // Image iteration loop count for image batching
+        int limages = sycl::min(
+            sycl::max((int)(out_.dims[2] - imgId2 * nImg2_), 1), batchImg2_);
+
+        if (xido >= out_.dims[0] || yido >= out_.dims[1]) return;
+
+        // Index of transform
+        const int eTfs2 = sycl::max((nTfs2_ / nImg2_), 1);
+
+        int t_idx3        = -1;  // init
+        int t_idx2        = -1;  // init
+        int t_idx2_offset = 0;
+
+        const int blockIdx_z = g.get_group_id(2);
+
+        if (nTfs3_ == 1) {
+            t_idx3 = 0;  // Always 0 as only 1 transform defined
+        } else {
+            if (nTfs3_ == nImg3_) {
+                t_idx3 =
+                    imgId3;  // One to one batch with all transforms defined
+            } else {
+                t_idx3 = blockIdx_z / eTfs2;  // Transform batched, calculate
+                t_idx2_offset = t_idx3 * nTfs2_;
+            }
+        }
+
+        if (nTfs2_ == 1) {
+            t_idx2 = 0;  // Always 0 as only 1 transform defined
+        } else {
+            if (nTfs2_ == nImg2_) {
+                t_idx2 =
+                    imgId2;  // One to one batch with all transforms defined
+            } else {
+                t_idx2 =
+                    blockIdx_z - t_idx2_offset;  // Transform batched, calculate
+            }
+        }
+
+        // Linear transform index
+        const int t_idx = t_idx2 + t_idx3 * nTfs2_;
+
+        // Global outoff
+        int outoff = out_.offset;
+        int inoff  = imgId2 * batchImg2_ * in_.strides[2] +
+                    imgId3 * in_.strides[3] + in_.offset;
+        if (nImg2_ == nTfs2_ || nImg2_ > 1) {  // One-to-One or Image on dim2
+            outoff += imgId2 * batchImg2_ * out_.strides[2];
+        } else {  // Transform batched on dim2
+            outoff += t_idx2 * out_.strides[2];
+        }
+
+        if (nImg3_ == nTfs3_ || nImg3_ > 1) {  // One-to-One or Image on dim3
+            outoff += imgId3 * out_.strides[3];
+        } else {  // Transform batched on dim2
+            outoff += t_idx3 * out_.strides[3];
+        }
+
+        // Transform is in_ global memory.
+        // Needs outoff to correct transform being processed.
+        const int transf_len = PERSPECTIVE ? 9 : 6;
+        using TMatTy =
+            typename std::conditional<PERSPECTIVE, float[9], float[6]>::type;
+        TMatTy tmat;
+        const float *tmat_ptr =
+            c_tmat_.get_pointer() + tf_.offset + t_idx * transf_len;
+
+        // We expect a inverse transform matrix by default
+        // If it is an forward transform, then we need its inverse
+        if (INVERSE_ == 1) {
+#pragma unroll 3
+            for (int i = 0; i < transf_len; i++) tmat[i] = tmat_ptr[i];
+        } else {
+            calc_transf_inverse<PERSPECTIVE>(tmat, tmat_ptr);
+        }
+
+        InterpPosTy xidi = xido * tmat[0] + yido * tmat[1] + tmat[2];
+        InterpPosTy yidi = xido * tmat[3] + yido * tmat[4] + tmat[5];
+
+        if constexpr (PERSPECTIVE) {
+            const InterpPosTy W = xido * tmat[6] + yido * tmat[7] + tmat[8];
+            xidi /= W;
+            yidi /= W;
+        }
+        const int loco = outoff + (yido * out_.strides[1] + xido);
+        // FIXME: Nearest and lower do not do clamping, but other methods do
+        // Make it consistent
+        const bool doclamp = INTERP_ORDER != 1;
+
+        T zero = (T)0;
+        if (xidi < (InterpPosTy)-0.0001f || yidi < (InterpPosTy)-0.0001f ||
+            in_.dims[0] <= xidi || in_.dims[1] <= yidi) {
+            for (int n = 0; n < limages; n++) {
+                d_out_[loco + n * out_.strides[2]] = zero;
+            }
+            return;
+        }
+
+        Interp2<T, InterpPosTy, INTERP_ORDER> interp2;
+        interp2(d_out_, out_, loco, d_in_, in_, inoff, xidi, yidi, 0, 1,
+                method_, limages, doclamp, 2);
+    }
+
+   private:
+    write_accessor<T> d_out_;
+    const KParam out_;
+    read_accessor<T> d_in_;
+    const KParam in_;
+    read_accessor<float> c_tmat_;
+    const KParam tf_;
+    const int nImg2_;
+    const int nImg3_;
+    const int nTfs2_;
+    const int nTfs3_;
+    const int batchImg2_;
+    const int blocksXPerImage_;
+    const int blocksYPerImage_;
+    const af::interpType method_;
+    const bool INVERSE_;
+};
+
+template<typename T>
+void transform(Param<T> out, const Param<T> in, const Param<float> tf,
+               bool isInverse, bool isPerspective, af_interp_type method,
+               int order) {
+    using std::string;
+
+    using BT = typename dtype_traits<T>::base_type;
+
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+    // Used for batching images
+    constexpr int TI = 4;
+
+    const int nImg2 = in.info.dims[2];
+    const int nImg3 = in.info.dims[3];
+    const int nTfs2 = tf.info.dims[2];
+    const int nTfs3 = tf.info.dims[3];
+
+    auto local = sycl::range(TX, TY, 1);
+
+    int batchImg2 = 1;
+    if (nImg2 != nTfs2) batchImg2 = fmin(nImg2, TI);
+
+    const int blocksXPerImage = divup(out.info.dims[0], local[0]);
+    const int blocksYPerImage = divup(out.info.dims[1], local[1]);
+
+    int global_x = local[0] * blocksXPerImage * (nImg2 / batchImg2);
+    int global_y = local[1] * blocksYPerImage * nImg3;
+    int global_z =
+        local[2] * fmax((nTfs2 / nImg2), 1) * fmax((nTfs3 / nImg3), 1);
+
+    auto global = sycl::range(global_x, global_y, global_z);
+
+#define INVOKE(PERSPECTIVE, INTERP_ORDER)                                      \
+    h.parallel_for(                                                            \
+        sycl::nd_range{global, local},                                         \
+        transformCreateKernel<T, wtype_t<BT>, PERSPECTIVE, INTERP_ORDER>(      \
+            d_out, out.info, d_in, in.info, d_tf, tf.info, nImg2, nImg3,       \
+            nTfs2, nTfs3, batchImg2, blocksXPerImage, blocksYPerImage, method, \
+            isInverse));
+
+    getQueue().submit([&](auto &h) {
+        read_accessor<T> d_in{*in.data, h};
+        read_accessor<float> d_tf{*tf.data, h};
+        write_accessor<T> d_out{*out.data, h};
+
+        if (isPerspective == true && order == 1) INVOKE(true, 1);
+        if (isPerspective == true && order == 2) INVOKE(true, 2);
+        if (isPerspective == true && order == 3) INVOKE(true, 3);
+
+        if (isPerspective == false && order == 1) INVOKE(false, 1);
+        if (isPerspective == false && order == 2) INVOKE(false, 2);
+        if (isPerspective == false && order == 3) INVOKE(false, 3);
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/transpose.hpp b/src/backend/oneapi/kernel/transpose.hpp
new file mode 100644
index 0000000000..2752111534
--- /dev/null
+++ b/src/backend/oneapi/kernel/transpose.hpp
@@ -0,0 +1,164 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+constexpr int TILE_DIM  = 32;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = 256 / TILE_DIM;
+
+template<typename T>
+T getConjugate(const T &in) {
+    // For non-complex types return same
+    return in;
+}
+
+template<>
+cfloat getConjugate(const cfloat &in) {
+    return std::conj(in);
+}
+
+template<>
+cdouble getConjugate(const cdouble &in) {
+    return std::conj(in);
+}
+
+template<typename T>
+class transposeKernel {
+   public:
+    transposeKernel(sycl::accessor<T, 1, sycl::access::mode::write> oData,
+                    const KParam out,
+                    const sycl::accessor<T, 1, sycl::access::mode::read> iData,
+                    const KParam in, const int blocksPerMatX,
+                    const int blocksPerMatY, const bool conjugate,
+                    const bool IS32MULTIPLE, sycl::local_accessor<T, 1> shrdMem)
+        : oData_(oData)
+        , out_(out)
+        , iData_(iData)
+        , in_(in)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY)
+        , conjugate_(conjugate)
+        , IS32MULTIPLE_(IS32MULTIPLE)
+        , shrdMem_(shrdMem) {}
+    void operator()(sycl::nd_item<2> it) const {
+        const int shrdStride = TILE_DIM + 1;
+
+        const int oDim0 = out_.dims[0];
+        const int oDim1 = out_.dims[1];
+        const int iDim0 = in_.dims[0];
+        const int iDim1 = in_.dims[1];
+
+        // calculate strides
+        const int oStride1 = out_.strides[1];
+        const int iStride1 = in_.strides[1];
+
+        const int lx = it.get_local_id(0);
+        const int ly = it.get_local_id(1);
+
+        // batch based block Id
+        sycl::group g        = it.get_group();
+        const int batchId_x  = g.get_group_id(0) / blocksPerMatX_;
+        const int blockIdx_x = (g.get_group_id(0) - batchId_x * blocksPerMatX_);
+
+        const int batchId_y  = g.get_group_id(1) / blocksPerMatY_;
+        const int blockIdx_y = (g.get_group_id(1) - batchId_y * blocksPerMatY_);
+
+        const int x0 = TILE_DIM * blockIdx_x;
+        const int y0 = TILE_DIM * blockIdx_y;
+
+        // calculate global in_dices
+        int gx = lx + x0;
+        int gy = ly + y0;
+
+        // offset in_ and out_ based on batch id
+        // also add the subBuffer offsets
+        const T *iDataPtr = iData_.get_pointer();
+        T *oDataPtr       = oData_.get_pointer();
+        iDataPtr += batchId_x * in_.strides[2] + batchId_y * in_.strides[3] +
+                    in_.offset;
+        oDataPtr += batchId_x * out_.strides[2] + batchId_y * out_.strides[3] +
+                    out_.offset;
+
+        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+            int gy_ = gy + repeat;
+            if (IS32MULTIPLE_ || (gx < iDim0 && gy_ < iDim1))
+                shrdMem_[(ly + repeat) * shrdStride + lx] =
+                    iDataPtr[gy_ * iStride1 + gx];
+        }
+        it.barrier();
+
+        gx = lx + y0;
+        gy = ly + x0;
+
+        for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+            int gy_ = gy + repeat;
+            if (IS32MULTIPLE_ || (gx < oDim0 && gy_ < oDim1)) {
+                const T val = shrdMem_[lx * shrdStride + ly + repeat];
+                oDataPtr[gy_ * oStride1 + gx] =
+                    conjugate_ ? getConjugate(val) : val;
+            }
+        }
+    }
+
+   private:
+    sycl::accessor<T, 1, sycl::access::mode::write> oData_;
+    KParam out_;
+    sycl::accessor<T, 1, sycl::access::mode::read> iData_;
+    KParam in_;
+    int blocksPerMatX_;
+    int blocksPerMatY_;
+    bool conjugate_;
+    bool IS32MULTIPLE_;
+    sycl::local_accessor<T, 1> shrdMem_;
+};
+
+template<typename T>
+void transpose(Param<T> out, const Param<T> in, const bool conjugate,
+               const bool IS32MULTIPLE) {
+    auto local = sycl::range{THREADS_X, THREADS_Y};
+
+    const int blk_x = divup(in.info.dims[0], TILE_DIM);
+    const int blk_y = divup(in.info.dims[1], TILE_DIM);
+
+    auto global = sycl::range{blk_x * local[0] * in.info.dims[2],
+                              blk_y * local[1] * in.info.dims[3]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        auto r = in.data->template get_access<sycl::access::mode::read>(h);
+        auto q = out.data->template get_access<sycl::access::mode::write>(h);
+
+        auto shrdMem = sycl::local_accessor<T, 1>(TILE_DIM * (TILE_DIM + 1), h);
+
+        h.parallel_for(sycl::nd_range{global, local},
+                       transposeKernel<T>(q, out.info, r, in.info, blk_x, blk_y,
+                                          conjugate, IS32MULTIPLE, shrdMem));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/transpose_inplace.hpp b/src/backend/oneapi/kernel/transpose_inplace.hpp
new file mode 100644
index 0000000000..721a3befb9
--- /dev/null
+++ b/src/backend/oneapi/kernel/transpose_inplace.hpp
@@ -0,0 +1,193 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+static T getConjugate(const T &in) {
+    // For non-complex types return same
+    return in;
+}
+
+template<>
+cfloat getConjugate(const cfloat &in) {
+    return std::conj(in);
+}
+
+template<>
+cdouble getConjugate(const cdouble &in) {
+    return std::conj(in);
+}
+
+#define doOp(v) (conjugate_ ? getConjugate((v)) : (v))
+
+constexpr dim_t TILE_DIM  = 16;
+constexpr dim_t THREADS_X = TILE_DIM;
+constexpr dim_t THREADS_Y = 256 / TILE_DIM;
+
+template<typename T>
+class transposeInPlaceKernel {
+   public:
+    transposeInPlaceKernel(const sycl::accessor<T> iData, const KParam in,
+                           const int blocksPerMatX, const int blocksPerMatY,
+                           const bool conjugate, const bool IS32MULTIPLE,
+                           sycl::local_accessor<T, 1> shrdMem_s,
+                           sycl::local_accessor<T, 1> shrdMem_d)
+        : iData_(iData)
+        , in_(in)
+        , blocksPerMatX_(blocksPerMatX)
+        , blocksPerMatY_(blocksPerMatY)
+        , conjugate_(conjugate)
+        , IS32MULTIPLE_(IS32MULTIPLE)
+        , shrdMem_s_(shrdMem_s)
+        , shrdMem_d_(shrdMem_d) {}
+    void operator()(sycl::nd_item<2> it) const {
+        const int shrdStride = TILE_DIM + 1;
+
+        // create variables to hold output dimensions
+        const int iDim0 = in_.dims[0];
+        const int iDim1 = in_.dims[1];
+
+        // calculate strides
+        const int iStride1 = in_.strides[1];
+
+        const int lx = it.get_local_id(0);
+        const int ly = it.get_local_id(1);
+
+        // batch based block Id
+        sycl::group g        = it.get_group();
+        const int batchId_x  = g.get_group_id(0) / blocksPerMatX_;
+        const int blockIdx_x = (g.get_group_id(0) - batchId_x * blocksPerMatX_);
+
+        const int batchId_y  = g.get_group_id(1) / blocksPerMatY_;
+        const int blockIdx_y = (g.get_group_id(1) - batchId_y * blocksPerMatY_);
+
+        const int x0 = TILE_DIM * blockIdx_x;
+        const int y0 = TILE_DIM * blockIdx_y;
+
+        T *iDataPtr = iData_.get_pointer();
+        iDataPtr += batchId_x * in_.strides[2] + batchId_y * in_.strides[3] +
+                    in_.offset;
+
+        if (blockIdx_y > blockIdx_x) {
+            // calculate global indices
+            int gx = lx + x0;
+            int gy = ly + y0;
+            int dx = lx + y0;
+            int dy = ly + x0;
+
+            // Copy to shared memory
+            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+                int gy_ = gy + repeat;
+                if (IS32MULTIPLE_ || (gx < iDim0 && gy_ < iDim1))
+                    shrdMem_s_[(ly + repeat) * shrdStride + lx] =
+                        iDataPtr[gy_ * iStride1 + gx];
+
+                int dy_ = dy + repeat;
+                if (IS32MULTIPLE_ || (dx < iDim0 && dy_ < iDim1))
+                    shrdMem_d_[(ly + repeat) * shrdStride + lx] =
+                        iDataPtr[dy_ * iStride1 + dx];
+            }
+
+            it.barrier();
+
+            // Copy from shared memory to global memory
+            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+                int dy_ = dy + repeat;
+                if (IS32MULTIPLE_ || (dx < iDim0 && dy_ < iDim1))
+                    iDataPtr[dy_ * iStride1 + dx] =
+                        doOp(shrdMem_s_[(ly + repeat) + (shrdStride * lx)]);
+
+                int gy_ = gy + repeat;
+                if (IS32MULTIPLE_ || (gx < iDim0 && gy_ < iDim1))
+                    iDataPtr[gy_ * iStride1 + gx] =
+                        doOp(shrdMem_d_[(ly + repeat) + (shrdStride * lx)]);
+            }
+
+        } else if (blockIdx_y == blockIdx_x) {
+            // calculate global indices
+            int gx = lx + x0;
+            int gy = ly + y0;
+
+            // Copy to shared memory
+            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+                int gy_ = gy + repeat;
+                if (IS32MULTIPLE_ || (gx < iDim0 && gy_ < iDim1))
+                    shrdMem_s_[(ly + repeat) * shrdStride + lx] =
+                        iDataPtr[gy_ * iStride1 + gx];
+            }
+
+            it.barrier();
+
+            // Copy from shared memory to global memory
+            for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
+                int gy_ = gy + repeat;
+                if (IS32MULTIPLE_ || (gx < iDim0 && gy_ < iDim1))
+                    iDataPtr[gy_ * iStride1 + gx] =
+                        doOp(shrdMem_s_[(ly + repeat) + (shrdStride * lx)]);
+            }
+        }
+    }
+
+   private:
+    sycl::accessor<T> iData_;
+    KParam in_;
+    int blocksPerMatX_;
+    int blocksPerMatY_;
+    bool conjugate_;
+    bool IS32MULTIPLE_;
+    sycl::local_accessor<T, 1> shrdMem_s_;
+    sycl::local_accessor<T, 1> shrdMem_d_;
+};
+
+template<typename T>
+void transpose_inplace(Param<T> in, const bool conjugate,
+                       const bool IS32MULTIPLE) {
+    auto local = sycl::range{THREADS_X, THREADS_Y};
+
+    int blk_x = divup(in.info.dims[0], TILE_DIM);
+    int blk_y = divup(in.info.dims[1], TILE_DIM);
+
+    auto global = sycl::range{blk_x * local[0] * in.info.dims[2],
+                              blk_y * local[1] * in.info.dims[3]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        auto r = in.data->get_access(h);
+        auto shrdMem_s =
+            sycl::local_accessor<T, 1>(TILE_DIM * (TILE_DIM + 1), h);
+        auto shrdMem_d =
+            sycl::local_accessor<T, 1>(TILE_DIM * (TILE_DIM + 1), h);
+
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            transposeInPlaceKernel<T>(r, in.info, blk_x, blk_y, conjugate,
+                                      IS32MULTIPLE, shrdMem_s, shrdMem_d));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/triangle.hpp b/src/backend/oneapi/kernel/triangle.hpp
new file mode 100644
index 0000000000..4634f69570
--- /dev/null
+++ b/src/backend/oneapi/kernel/triangle.hpp
@@ -0,0 +1,121 @@
+/*******************************************************
+ * Copyright (c) 2022 ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class triangleKernel {
+   public:
+    triangleKernel(write_accessor<T> rAcc, KParam rinfo, read_accessor<T> iAcc,
+                   KParam iinfo, const int groups_x, const int groups_y,
+                   const bool is_upper, const bool is_unit_diag)
+        : rAcc_(rAcc)
+        , rinfo_(rinfo)
+        , iAcc_(iAcc)
+        , iinfo_(iinfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , is_upper_(is_upper)
+        , is_unit_diag_(is_unit_diag) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+        const int oz  = g.get_group_id(0) / groups_x_;
+        const int ow  = g.get_group_id(1) / groups_y_;
+
+        const int groupId_0 = g.get_group_id(0) - oz * groups_x_;
+        const int groupId_1 = g.get_group_id(1) - ow * groups_y_;
+
+        const int xx = it.get_local_id(0) + groupId_0 * it.get_local_range(0);
+        const int yy = it.get_local_id(1) + groupId_1 * it.get_local_range(1);
+
+        const int incy = groups_y_ * it.get_local_range(1);
+        const int incx = groups_x_ * it.get_local_range(0);
+
+        T *d_r       = rAcc_.get_pointer();
+        const T *d_i = iAcc_.get_pointer() + iinfo_.offset;
+
+        if (oz < rinfo_.dims[2] && ow < rinfo_.dims[3]) {
+            d_i = d_i + oz * iinfo_.strides[2] + ow * iinfo_.strides[3];
+            d_r = d_r + oz * rinfo_.strides[2] + ow * rinfo_.strides[3];
+
+            for (int oy = yy; oy < rinfo_.dims[1]; oy += incy) {
+                const T *Yd_i = d_i + oy * iinfo_.strides[1];
+                T *Yd_r       = d_r + oy * rinfo_.strides[1];
+
+                for (int ox = xx; ox < rinfo_.dims[0]; ox += incx) {
+                    bool cond         = is_upper_ ? (oy >= ox) : (oy <= ox);
+                    bool do_unit_diag = is_unit_diag_ && (oy == ox);
+                    if (cond) {
+                        Yd_r[ox] = do_unit_diag ? (T)(1) : Yd_i[ox];
+                    } else {
+                        Yd_r[ox] = (T)(0);
+                    }
+                }
+            }
+        }
+    }
+
+   private:
+    write_accessor<T> rAcc_;
+    KParam rinfo_;
+    read_accessor<T> iAcc_;
+    KParam iinfo_;
+    const int groups_x_;
+    const int groups_y_;
+    const bool is_upper_;
+    const bool is_unit_diag_;
+};
+
+template<typename T>
+void triangle(Param<T> out, const Param<T> in, bool is_upper,
+              bool is_unit_diag) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 128;
+    constexpr unsigned TILEY = 32;
+
+    auto local = sycl::range{TX, TY};
+
+    int groups_x = divup(out.info.dims[0], TILEX);
+    int groups_y = divup(out.info.dims[1], TILEY);
+
+    auto global = sycl::range{groups_x * out.info.dims[2] * local[0],
+                              groups_y * out.info.dims[3] * local[1]};
+
+    getQueue().submit([&](sycl::handler &h) {
+        read_accessor<T> iAcc{*in.data, h};
+        write_accessor<T> rAcc{*out.data, h};
+
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            triangleKernel<T>(rAcc, out.info, iAcc, in.info, groups_x, groups_y,
+                              is_upper, is_unit_diag));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/unwrap.hpp b/src/backend/oneapi/kernel/unwrap.hpp
new file mode 100644
index 0000000000..43301fd744
--- /dev/null
+++ b/src/backend/oneapi/kernel/unwrap.hpp
@@ -0,0 +1,174 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/default_config.hpp>
+
+#include <sycl/sycl.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class unwrapCreateKernel {
+   public:
+    unwrapCreateKernel(sycl::accessor<T, 1, sycl::access::mode::write> d_out,
+                       const KParam out,
+                       sycl::accessor<T, 1, sycl::access::mode::read> d_in,
+                       const KParam in, const int wx, const int wy,
+                       const int sx, const int sy, const int px, const int py,
+                       const int dx, const int dy, const int nx, const int reps,
+                       const bool IS_COLUMN)
+        : d_out_(d_out)
+        , out_(out)
+        , d_in_(d_in)
+        , in_(in)
+        , wx_(wx)
+        , wy_(wy)
+        , sx_(sx)
+        , sy_(sy)
+        , px_(px)
+        , py_(py)
+        , dx_(dx)
+        , dy_(dy)
+        , nx_(nx)
+        , reps_(reps)
+        , IS_COLUMN_(IS_COLUMN) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        // Compute channel and volume
+        const int w = g.get_group_id(1) / in_.dims[2];
+        const int z = g.get_group_id(1) - w * in_.dims[2];
+
+        if (w >= in_.dims[3] || z >= in_.dims[2]) return;
+
+        // Compute offset for channel and volume
+        const int cOut = w * out_.strides[3] + z * out_.strides[2];
+        const int cIn  = w * in_.strides[3] + z * in_.strides[2];
+
+        // Compute the output column index
+        const int id = IS_COLUMN_ ? (g.get_group_id(0) * g.get_local_range(1) +
+                                     it.get_local_id(1))
+                                  : it.get_global_id(0);
+
+        if (id >= (IS_COLUMN_ ? out_.dims[1] : out_.dims[0])) return;
+
+        // Compute the starting index of window in_ x and y of input
+        const int startx = (id % nx_) * sx_;
+        const int starty = (id / nx_) * sy_;
+
+        const int spx = startx - px_;
+        const int spy = starty - py_;
+
+        // Offset the global pointers to the respective starting indices
+        T *optr = d_out_.get_pointer() + cOut +
+                  id * (IS_COLUMN_ ? out_.strides[1] : 1);
+        const T *iptr = d_in_.get_pointer() + cIn + in_.offset;
+
+        bool cond = (spx >= 0 && spx + (wx_ * dx_) < in_.dims[0] && spy >= 0 &&
+                     spy + (wy_ * dy_) < in_.dims[1]);
+
+        // Compute output index local to column
+        int outIdx = IS_COLUMN_ ? it.get_local_id(0) : it.get_local_id(1);
+        const int oStride =
+            IS_COLUMN_ ? it.get_local_range(0) : it.get_local_range(1);
+
+        for (int i = 0; i < reps_; i++) {
+            if (outIdx >= (IS_COLUMN_ ? out_.dims[0] : out_.dims[1])) return;
+
+            // Compute input index local to window
+            const int y = outIdx / wx_;
+            const int x = outIdx % wx_;
+
+            const int xpad = spx + x * dx_;
+            const int ypad = spy + y * dy_;
+
+            // Copy
+            T val = (T)0;
+            if (cond || (xpad >= 0 && xpad < in_.dims[0] && ypad >= 0 &&
+                         ypad < in_.dims[1])) {
+                const int inIdx = ypad * in_.strides[1] + xpad * in_.strides[0];
+                val             = iptr[inIdx];
+            }
+
+            if (IS_COLUMN_) {
+                optr[outIdx] = val;
+            } else {
+                optr[outIdx * out_.strides[1]] = val;
+            }
+
+            outIdx += oStride;
+        }
+    }
+
+   private:
+    sycl::accessor<T, 1, sycl::access::mode::write> d_out_;
+    const KParam out_;
+    sycl::accessor<T, 1, sycl::access::mode::read> d_in_;
+    const KParam in_;
+    const int wx_;
+    const int wy_;
+    const int sx_;
+    const int sy_;
+    const int px_;
+    const int py_;
+    const int dx_;
+    const int dy_;
+    const int nx_;
+    const int reps_;
+    const bool IS_COLUMN_;
+};
+
+template<typename T>
+void unwrap(Param<T> out, const Param<T> in, const dim_t wx, const dim_t wy,
+            const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+            const dim_t dx, const dim_t dy, const dim_t nx,
+            const bool IS_COLUMN) {
+    dim_t TX = 1, TY = 1;
+    dim_t BX       = 1;
+    const dim_t BY = out.info.dims[2] * out.info.dims[3];
+    int reps       = 1;
+
+    if (IS_COLUMN) {
+        TX   = std::min(THREADS_PER_BLOCK, nextpow2(out.info.dims[0]));
+        TY   = THREADS_PER_BLOCK / TX;
+        BX   = divup(out.info.dims[1], TY);
+        reps = divup((wx * wy), TX);
+    } else {
+        TX   = THREADS_X;
+        TY   = THREADS_Y;
+        BX   = divup(out.info.dims[0], TX);
+        reps = divup((wx * wy), TY);
+    }
+
+    auto local  = sycl::range(TX, TY);
+    auto global = sycl::range(local[0] * BX, local[1] * BY);
+
+    getQueue().submit([&](auto &h) {
+        sycl::accessor d_out{*out.data, h, sycl::write_only, sycl::no_init};
+        sycl::accessor d_in{*in.data, h, sycl::read_only};
+        h.parallel_for(
+            sycl::nd_range{global, local},
+            unwrapCreateKernel<T>(d_out, out.info, d_in, in.info, wx, wy, sx,
+                                  sy, px, py, dx, dy, nx, reps, IS_COLUMN));
+    });
+
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/where.hpp b/src/backend/oneapi/kernel/where.hpp
new file mode 100644
index 0000000000..69f2f7719a
--- /dev/null
+++ b/src/backend/oneapi/kernel/where.hpp
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/dispatch.hpp>
+#include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+#include <kernel/scan_first.hpp>
+#include <memory.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class whereKernel {
+   public:
+    whereKernel(write_accessor<uint> out_acc, KParam oInfo,
+                read_accessor<uint> otmp_acc, KParam otInfo,
+                read_accessor<uint> rtmp_acc, KParam rtInfo,
+                read_accessor<T> in_acc, KParam iInfo, uint groups_x,
+                uint groups_y, uint lim)
+        : out_acc_(out_acc)
+        , otmp_acc_(otmp_acc)
+        , rtmp_acc_(rtmp_acc)
+        , in_acc_(in_acc)
+        , oInfo_(oInfo)
+        , otInfo_(otInfo)
+        , rtInfo_(rtInfo)
+        , iInfo_(iInfo)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , lim_(lim) {}
+
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g   = it.get_group();
+        const uint lidx = it.get_local_id(0);
+        const uint lidy = it.get_local_id(1);
+
+        const uint zid       = g.get_group_id(0) / groups_x_;
+        const uint wid       = g.get_group_id(1) / groups_y_;
+        const uint groupId_x = g.get_group_id(0) - (groups_x_)*zid;
+        const uint groupId_y = g.get_group_id(1) - (groups_y_)*wid;
+        const uint xid       = groupId_x * g.get_local_range(0) * lim_ + lidx;
+        const uint yid       = groupId_y * g.get_local_range(1) + lidy;
+
+        const uint *otptr = otmp_acc_.get_pointer();
+        const uint *rtptr = rtmp_acc_.get_pointer();
+        const T *iptr     = in_acc_.get_pointer();
+
+        const uint off = wid * otInfo_.strides[3] + zid * otInfo_.strides[2] +
+                         yid * otInfo_.strides[1];
+        const uint bid = wid * rtInfo_.strides[3] + zid * rtInfo_.strides[2] +
+                         yid * rtInfo_.strides[1] + groupId_x;
+
+        otptr += wid * otInfo_.strides[3] + zid * otInfo_.strides[2] +
+                 yid * otInfo_.strides[1];
+        iptr += wid * iInfo_.strides[3] + zid * iInfo_.strides[2] +
+                yid * iInfo_.strides[1] + iInfo_.offset;
+
+        size_t odims0 = otInfo_.dims[0];
+        size_t odims1 = otInfo_.dims[1];
+        size_t odims2 = otInfo_.dims[2];
+        size_t odims3 = otInfo_.dims[3];
+        bool cond     = (yid < odims1) && (zid < odims2) && (wid < odims3);
+        T zero        = scalar<T>(0);
+
+        if (cond) {
+            uint accum = (bid == 0) ? 0 : rtptr[bid - 1];
+
+            for (uint k = 0, id = xid; k < lim_ && id < odims0;
+                 k++, id += g.get_local_range(0)) {
+                uint idx = otptr[id] + accum;
+                if (iptr[id] != zero) out_acc_[idx - 1] = (off + id);
+            }
+        }
+    }
+
+   protected:
+    write_accessor<uint> out_acc_;
+    read_accessor<uint> otmp_acc_;
+    read_accessor<uint> rtmp_acc_;
+    read_accessor<T> in_acc_;
+    KParam oInfo_, otInfo_, rtInfo_, iInfo_;
+    uint groups_x_, groups_y_, lim_;
+};
+
+template<typename T>
+static void where(Param<uint> &out, Param<T> in) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_BLOCK);
+    uint threads_y = THREADS_PER_BLOCK / threads_x;
+
+    uint groups_x = divup((uint)in.info.dims[0], (uint)(threads_x * REPEAT));
+    uint groups_y = divup(in.info.dims[1], threads_y);
+
+    Param<uint> rtmp;
+    Param<uint> otmp;
+    rtmp.info.dims[0]    = groups_x;
+    otmp.info.dims[0]    = in.info.dims[0];
+    rtmp.info.strides[0] = 1;
+    otmp.info.strides[0] = 1;
+
+    for (int k = 1; k < 4; k++) {
+        rtmp.info.dims[k]    = in.info.dims[k];
+        rtmp.info.strides[k] = rtmp.info.strides[k - 1] * rtmp.info.dims[k - 1];
+
+        otmp.info.dims[k]    = in.info.dims[k];
+        otmp.info.strides[k] = otmp.info.strides[k - 1] * otmp.info.dims[k - 1];
+    }
+
+    uintl rtmp_elements = rtmp.info.strides[3] * rtmp.info.dims[3];
+    uintl otmp_elements = otmp.info.strides[3] * otmp.info.dims[3];
+    auto rtmp_alloc     = memAlloc<uint>(rtmp_elements);
+    auto otmp_alloc     = memAlloc<uint>(otmp_elements);
+    rtmp.data           = rtmp_alloc.get();
+    otmp.data           = otmp_alloc.get();
+
+    scan_first_launcher<T, uint, af_notzero_t>(
+        otmp, rtmp, in, groups_x, groups_y, threads_x, false, true);
+
+    // Linearize the dimensions and perform scan
+    Param<uint> ltmp  = rtmp;
+    ltmp.info.dims[0] = rtmp_elements;
+    for (int k = 1; k < 4; k++) {
+        ltmp.info.dims[k]    = 1;
+        ltmp.info.strides[k] = rtmp_elements;
+    }
+
+    scan_first<uint, uint, af_add_t>(ltmp, ltmp, true);
+
+    // Get output size and allocate output
+    uint total;
+
+    getQueue()
+        .submit([&](sycl::handler &h) {
+            auto acc_in = rtmp.data->get_access(h, sycl::range{1},
+                                                sycl::id{rtmp_elements - 1});
+            h.copy(acc_in, &total);
+        })
+        .wait();
+
+    auto out_alloc = memAlloc<uint>(std::max(1U, total));
+    out.data       = out_alloc.get();
+
+    out.info.dims[0]    = total;
+    out.info.strides[0] = 1;
+    for (int k = 1; k < 4; k++) {
+        out.info.dims[k]    = 1;
+        out.info.strides[k] = total;
+    }
+
+    sycl::range<2> local(threads_x, THREADS_PER_BLOCK / threads_x);
+    sycl::range<2> global(groups_x * in.info.dims[2] * local[0],
+                          groups_y * in.info.dims[3] * local[1]);
+    uint lim = divup(otmp.info.dims[0], (threads_x * groups_x));
+
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<uint> out_acc{*out.data, h};
+        read_accessor<uint> otmp_acc{*otmp.data, h};
+        read_accessor<uint> rtmp_acc{*rtmp.data, h};
+        read_accessor<T> in_acc{*in.data, h};
+
+        h.parallel_for(sycl::nd_range<2>(global, local),
+                       whereKernel<T>(out_acc, out.info, otmp_acc, otmp.info,
+                                      rtmp_acc, rtmp.info, in_acc, in.info,
+                                      groups_x, groups_y, lim));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+    out_alloc.release();
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/wrap.hpp b/src/backend/oneapi/kernel/wrap.hpp
new file mode 100644
index 0000000000..e29403b604
--- /dev/null
+++ b/src/backend/oneapi/kernel/wrap.hpp
@@ -0,0 +1,159 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+#include <math.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class wrapCreateKernel {
+   public:
+    wrapCreateKernel(write_accessor<T> optrAcc, KParam out,
+                     read_accessor<T> iptrAcc, KParam in, const int wx,
+                     const int wy, const int sx, const int sy, const int px,
+                     const int py, const int nx, const int ny, int groups_x,
+                     int groups_y, const bool is_column)
+        : optrAcc_(optrAcc)
+        , out_(out)
+        , iptrAcc_(iptrAcc)
+        , in_(in)
+        , wx_(wx)
+        , wy_(wy)
+        , sx_(sx)
+        , sy_(sy)
+        , px_(px)
+        , py_(py)
+        , nx_(nx)
+        , ny_(ny)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , is_column_(is_column) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        int idx2 = g.get_group_id(0) / groups_x_;
+        int idx3 = g.get_group_id(1) / groups_y_;
+
+        int groupId_x = g.get_group_id(0) - idx2 * groups_x_;
+        int groupId_y = g.get_group_id(1) - idx3 * groups_y_;
+
+        int oidx0 = it.get_local_id(0) + g.get_local_range(0) * groupId_x;
+        int oidx1 = it.get_local_id(1) + g.get_local_range(1) * groupId_y;
+
+        T *optr = optrAcc_.get_pointer() + idx2 * out_.strides[2] +
+                  idx3 * out_.strides[3] + out_.offset;
+        const T *iptr = iptrAcc_.get_pointer() + idx2 * in_.strides[2] +
+                        idx3 * in_.strides[3] + in_.offset;
+
+        if (oidx0 >= out_.dims[0] || oidx1 >= out_.dims[1]) return;
+
+        int pidx0 = oidx0 + px_;
+        int pidx1 = oidx1 + py_;
+
+        // The last time a value appears in_ the unwrapped index is padded_index
+        // / stride Each previous index has the value appear "stride" locations
+        // earlier We work our way back from the last index
+
+        const int x_end = sycl::min(pidx0 / sx_, nx_ - 1);
+        const int y_end = sycl::min(pidx1 / sy_, ny_ - 1);
+
+        const int x_off = pidx0 - sx_ * x_end;
+        const int y_off = pidx1 - sy_ * y_end;
+
+        T val   = (T)0;
+        int idx = 1;
+
+        for (int y = y_end, yo = y_off; y >= 0 && yo < wy_; yo += sy_, y--) {
+            int win_end_y = yo * wx_;
+            int dim_end_y = y * nx_;
+
+            for (int x = x_end, xo = x_off; x >= 0 && xo < wx_;
+                 xo += sx_, x--) {
+                int win_end = win_end_y + xo;
+                int dim_end = dim_end_y + x;
+
+                if (is_column_) {
+                    idx = dim_end * in_.strides[1] + win_end;
+                } else {
+                    idx = dim_end + win_end * in_.strides[1];
+                }
+
+                // No need to include anything special for complex
+                // Add for complex numbers is just vector add of reals
+                // Might need to change if we generalize add to more binary ops
+                val = val + iptr[idx];
+            }
+        }
+
+        optr[oidx1 * out_.strides[1] + oidx0] = val;
+    }
+
+   private:
+    write_accessor<T> optrAcc_;
+    KParam out_;
+    read_accessor<T> iptrAcc_;
+    KParam in_;
+    const int wx_;
+    const int wy_;
+    const int sx_;
+    const int sy_;
+    const int px_;
+    const int py_;
+    const int nx_;
+    const int ny_;
+    int groups_x_;
+    int groups_y_;
+    const bool is_column_;
+};
+
+template<typename T>
+void wrap(Param<T> out, const Param<T> in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column) {
+    dim_t nx = (out.info.dims[0] + 2 * px - wx) / sx + 1;
+    dim_t ny = (out.info.dims[1] + 2 * py - wy) / sy + 1;
+
+    auto local = sycl::range{THREADS_X, THREADS_Y};
+
+    dim_t groups_x = divup(out.info.dims[0], local[0]);
+    dim_t groups_y = divup(out.info.dims[1], local[1]);
+
+    auto global = sycl::range{groups_x * local[0] * out.info.dims[2],
+                              groups_y * local[1] * out.info.dims[3]};
+
+    auto Q = getQueue();
+    Q.submit([&](sycl::handler &h) {
+        sycl::accessor outAcc{*out.data, h, sycl::write_only, sycl::no_init};
+        sycl::accessor inAcc{*in.data, h, sycl::read_only};
+        h.parallel_for(sycl::nd_range{global, local},
+                       wrapCreateKernel<T>(outAcc, out.info, inAcc, in.info, wx,
+                                           wy, sx, sy, px, py, nx, ny, groups_x,
+                                           groups_y, is_column));
+    });
+    ONEAPI_DEBUG_FINISH(Q);
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/kernel/wrap_dilated.hpp b/src/backend/oneapi/kernel/wrap_dilated.hpp
new file mode 100644
index 0000000000..41112fbce4
--- /dev/null
+++ b/src/backend/oneapi/kernel/wrap_dilated.hpp
@@ -0,0 +1,176 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/default_config.hpp>
+#include <math.hpp>
+
+#include <sycl/sycl.hpp>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+namespace kernel {
+
+template<typename T>
+class wrapDilatedCreateKernel {
+   public:
+    wrapDilatedCreateKernel(write_accessor<data_t<T>> optrAcc, KParam out,
+                            read_accessor<data_t<T>> iptrAcc, KParam in,
+                            const int wx, const int wy, const int sx,
+                            const int sy, const int px, const int py,
+                            const int dx, const int dy, const int nx,
+                            const int ny, int groups_x, int groups_y,
+                            const bool is_column)
+        : optrAcc_(optrAcc)
+        , out_(out)
+        , iptrAcc_(iptrAcc)
+        , in_(in)
+        , wx_(wx)
+        , wy_(wy)
+        , sx_(sx)
+        , sy_(sy)
+        , px_(px)
+        , py_(py)
+        , dx_(dx)
+        , dy_(dy)
+        , nx_(nx)
+        , ny_(ny)
+        , groups_x_(groups_x)
+        , groups_y_(groups_y)
+        , is_column_(is_column) {}
+    void operator()(sycl::nd_item<2> it) const {
+        sycl::group g = it.get_group();
+
+        int idx2 = g.get_group_id(0) / groups_x_;
+        int idx3 = g.get_group_id(1) / groups_y_;
+
+        int groupId_x = g.get_group_id(0) - idx2 * groups_x_;
+        int groupId_y = g.get_group_id(1) - idx3 * groups_y_;
+
+        int oidx0 = it.get_local_id(0) + g.get_local_range(0) * groupId_x;
+        int oidx1 = it.get_local_id(1) + g.get_local_range(1) * groupId_y;
+
+        data_t<T> *optr = optrAcc_.get_pointer() + idx2 * out_.strides[2] +
+                          idx3 * out_.strides[3];
+        const data_t<T> *iptr = iptrAcc_.get_pointer() + idx2 * in_.strides[2] +
+                                idx3 * in_.strides[3] + in_.offset;
+
+        if (oidx0 >= out_.dims[0] || oidx1 >= out_.dims[1]) return;
+
+        int eff_wx = wx_ + (wx_ - 1) * (dx_ - 1);
+        int eff_wy = wy_ + (wy_ - 1) * (dy_ - 1);
+
+        int pidx0 = oidx0 + px_;
+        int pidx1 = oidx1 + py_;
+
+        // The last time a value appears in_ the unwrapped index is padded_index
+        // / stride Each previous index has the value appear "stride" locations
+        // earlier We work our way back from the last index
+
+        const int y_start = (pidx1 < eff_wy) ? 0 : (pidx1 - eff_wy) / sy_ + 1;
+        const int y_end   = sycl::min(pidx1 / sy_ + 1, ny_);
+
+        const int x_start = (pidx0 < eff_wx) ? 0 : (pidx0 - eff_wx) / sx_ + 1;
+        const int x_end   = sycl::min(pidx0 / sx_ + 1, nx_);
+
+        compute_t<T> val(0);
+        int idx = 1;
+
+        for (int y = y_start; y < y_end; y++) {
+            int fy      = (pidx1 - y * sy_);
+            bool yvalid = (fy % dy_ == 0) && (y < ny_);
+            fy /= dy_;
+
+            int win_end_y = fy * wx_;
+            int dim_end_y = y * nx_;
+
+            for (int x = x_start; x < x_end; x++) {
+                int fx      = (pidx0 - x * sx_);
+                bool xvalid = (fx % dx_ == 0) && (x < nx_);
+                fx /= dx_;
+
+                int win_end = win_end_y + fx;
+                int dim_end = dim_end_y + x;
+
+                if (is_column_) {
+                    idx = dim_end * in_.strides[1] + win_end;
+                } else {
+                    idx = dim_end + win_end * in_.strides[1];
+                }
+
+                compute_t<T> ival;
+                ival = (yvalid && xvalid) ? iptr[idx] : compute_t<T>(0);
+                val  = val + ival;
+            }
+        }
+
+        optr[oidx1 * out_.strides[1] + oidx0] = val;
+    }
+
+   private:
+    write_accessor<data_t<T>> optrAcc_;
+    KParam out_;
+    read_accessor<data_t<T>> iptrAcc_;
+    KParam in_;
+    const int wx_;
+    const int wy_;
+    const int sx_;
+    const int sy_;
+    const int px_;
+    const int py_;
+    const int dx_;
+    const int dy_;
+    const int nx_;
+    const int ny_;
+    int groups_x_;
+    int groups_y_;
+    const bool is_column_;
+};
+
+template<typename T>
+void wrap_dilated(Param<T> out, const Param<T> in, const dim_t wx,
+                  const dim_t wy, const dim_t sx, const dim_t sy,
+                  const dim_t px, const dim_t py, const dim_t dx,
+                  const dim_t dy, const bool is_column) {
+    dim_t nx = 1 + (out.info.dims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    dim_t ny = 1 + (out.info.dims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    auto local = sycl::range{THREADS_X, THREADS_Y};
+
+    dim_t groups_x = divup(out.info.dims[0], local[0]);
+    dim_t groups_y = divup(out.info.dims[1], local[1]);
+
+    auto global = sycl::range{local[0] * groups_x * out.info.dims[2],
+                              local[1] * groups_y * out.info.dims[3]};
+
+    auto Q = getQueue();
+    Q.submit([&](sycl::handler &h) {
+        write_accessor<data_t<T>> outAcc =
+            out.template get_accessor<sycl::access_mode::write>(h);
+        read_accessor<data_t<T>> inAcc =
+            in.template get_accessor<sycl::access_mode::read>(h);
+        h.parallel_for(sycl::nd_range{global, local},
+                       wrapDilatedCreateKernel<T>(
+                           outAcc, out.info, inAcc, in.info, wx, wy, sx, sy, px,
+                           py, dx, dy, nx, ny, groups_x, groups_y, is_column));
+    });
+    ONEAPI_DEBUG_FINISH(Q);
+}
+
+}  // namespace kernel
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/logic.hpp b/src/backend/oneapi/logic.hpp
new file mode 100644
index 0000000000..650d079159
--- /dev/null
+++ b/src/backend/oneapi/logic.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <err_oneapi.hpp>
+#include <optypes.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T, af_op_t op>
+Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs,
+                    const af::dim4 &odims) {
+    return common::createBinaryNode<char, T, op>(lhs, rhs, odims);
+}
+
+template<typename T, af_op_t op>
+Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/lookup.cpp b/src/backend/oneapi/lookup.cpp
new file mode 100644
index 0000000000..da658e12aa
--- /dev/null
+++ b/src/backend/oneapi/lookup.cpp
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <lookup.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/lookup.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename in_t, typename idx_t>
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim) {
+    const dim4 &iDims = input.dims();
+
+    dim4 oDims(1);
+    for (dim_t d = 0; d < 4; ++d) {
+        oDims[d] = (d == dim ? indices.elements() : iDims[d]);
+    }
+
+    Array<in_t> out = createEmptyArray<in_t>(oDims);
+
+    kernel::lookup<in_t, idx_t>(out, input, indices, dim);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> lookup<T, float>(const Array<T> &, const Array<float> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, double>(                                       \
+        const Array<T> &, const Array<double> &, const unsigned);              \
+    template Array<T> lookup<T, int>(const Array<T> &, const Array<int> &,     \
+                                     const unsigned);                          \
+    template Array<T> lookup<T, unsigned>(                                     \
+        const Array<T> &, const Array<unsigned> &, const unsigned);            \
+    template Array<T> lookup<T, short>(const Array<T> &, const Array<short> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, ushort>(                                       \
+        const Array<T> &, const Array<ushort> &, const unsigned);              \
+    template Array<T> lookup<T, intl>(const Array<T> &, const Array<intl> &,   \
+                                      const unsigned);                         \
+    template Array<T> lookup<T, uintl>(const Array<T> &, const Array<uintl> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, schar>(const Array<T> &, const Array<schar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, uchar>(const Array<T> &, const Array<uchar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, half>(const Array<T> &, const Array<half> &,   \
+                                      const unsigned)
+
+INSTANTIATE(float);
+INSTANTIATE(cfloat);
+INSTANTIATE(double);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(unsigned);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(char);
+INSTANTIATE(ushort);
+INSTANTIATE(short);
+INSTANTIATE(half);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/lookup.hpp b/src/backend/oneapi/lookup.hpp
new file mode 100644
index 0000000000..78d8da1ac1
--- /dev/null
+++ b/src/backend/oneapi/lookup.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename in_t, typename idx_t>
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/lu.cpp b/src/backend/oneapi/lu.cpp
new file mode 100644
index 0000000000..27e6bd4bf3
--- /dev/null
+++ b/src/backend/oneapi/lu.cpp
@@ -0,0 +1,140 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <lu.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <blas.hpp>
+#include <copy.hpp>
+#include <kernel/lu_split.hpp>
+#include <memory.hpp>
+#include <oneapi/mkl/lapack.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+Array<int> convertPivot(sycl::buffer<int64_t> &pivot, int in_sz, int out_sz,
+                        bool convert_pivot) {
+    std::vector<int> d_po(out_sz);
+    for (int i = 0; i < out_sz; i++) { d_po[i] = i; }
+
+    auto d_pi = pivot.get_host_access();
+
+    if (convert_pivot) {
+        for (int j = 0; j < in_sz; j++) {
+            // 1 indexed in pivot
+            std::swap(d_po[j], d_po[d_pi[j] - 1]);
+        }
+
+        Array<int> res = createHostDataArray(dim4(out_sz), &d_po[0]);
+        return res;
+    } else {
+        d_po.resize(in_sz);
+        for (int j = 0; j < in_sz; j++) { d_po[j] = static_cast<int>(d_pi[j]); }
+    }
+    Array<int> res = createHostDataArray(dim4(in_sz), &d_po[0]);
+    return res;
+}
+
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+    int MN     = std::min(M, N);
+
+    Array<T> in_copy = copyArray<T>(in);
+    pivot            = lu_inplace(in_copy);
+
+    // SPLIT into lower and upper
+    dim4 ldims(M, MN);
+    dim4 udims(MN, N);
+    lower = createEmptyArray<T>(ldims);
+    upper = createEmptyArray<T>(udims);
+    kernel::lu_split<T>(lower, upper, in_copy);
+}
+
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
+    dim4 iDims    = in.dims();
+    dim4 iStrides = in.strides();
+    int64_t M     = iDims[0];
+    int64_t N     = iDims[1];
+    int64_t MN    = std::min(M, N);
+    int64_t LDA   = iStrides[1];
+
+    std::int64_t scratchpad_size =
+        ::oneapi::mkl::lapack::getrf_scratchpad_size<T>(getQueue(), M, N, LDA);
+
+    auto ipiv       = memAlloc<int64_t>(MN);
+    auto scratchpad = memAlloc<compute_t<T>>(scratchpad_size);
+
+    sycl::buffer<compute_t<T>> in_buffer =
+        in.template getBufferWithOffset<compute_t<T>>();
+    ::oneapi::mkl::lapack::getrf(getQueue(), M, N, in_buffer, LDA, *ipiv,
+                                 *scratchpad, scratchpad->size());
+
+    Array<int> pivot = convertPivot(*ipiv, MN, M, convert_pivot);
+    return pivot;
+}
+
+bool isLAPACKAvailable() { return true; }
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
+
+INSTANTIATE_LU(float)
+INSTANTIATE_LU(cfloat)
+INSTANTIATE_LU(double)
+INSTANTIATE_LU(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI backend",
+             AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI backend",
+             AF_ERR_NOT_CONFIGURED);
+}
+
+bool isLAPACKAvailable() { return false; }
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
+
+INSTANTIATE_LU(float)
+INSTANTIATE_LU(cfloat)
+INSTANTIATE_LU(double)
+INSTANTIATE_LU(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/oneapi/lu.hpp b/src/backend/oneapi/lu.hpp
new file mode 100644
index 0000000000..a6b1eeb982
--- /dev/null
+++ b/src/backend/oneapi/lu.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in);
+
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
+
+bool isLAPACKAvailable();
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/match_template.cpp b/src/backend/oneapi/match_template.cpp
new file mode 100644
index 0000000000..10b84757ac
--- /dev/null
+++ b/src/backend/oneapi/match_template.cpp
@@ -0,0 +1,41 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <match_template.hpp>
+
+#include <err_oneapi.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType) {
+    ONEAPI_NOT_SUPPORTED("");
+    Array<outType> out = createEmptyArray<outType>(sImg.dims());
+    return out;
+}
+
+#define INSTANTIATE(in_t, out_t)                       \
+    template Array<out_t> match_template<in_t, out_t>( \
+        const Array<in_t> &, const Array<in_t> &, const af::matchType);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/match_template.hpp b/src/backend/oneapi/match_template.hpp
new file mode 100644
index 0000000000..84ea6d337a
--- /dev/null
+++ b/src/backend/oneapi/match_template.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/math.cpp b/src/backend/oneapi/math.cpp
new file mode 100644
index 0000000000..18bafd324b
--- /dev/null
+++ b/src/backend/oneapi/math.cpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include "math.hpp"
+#include <common/half.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+cfloat division(cfloat lhs, double rhs) {
+    cfloat retVal(real(lhs) / rhs, imag(lhs) / rhs);
+    return retVal;
+}
+
+cdouble division(cdouble lhs, double rhs) {
+    cdouble retVal(real(lhs) / rhs, imag(lhs) / rhs);
+    return retVal;
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/math.hpp b/src/backend/oneapi/math.hpp
new file mode 100644
index 0000000000..7362874442
--- /dev/null
+++ b/src/backend/oneapi/math.hpp
@@ -0,0 +1,182 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/defines.hpp>
+#include <common/half.hpp>
+#include <af/defines.h>
+
+#include <backend.hpp>
+#include <types.hpp>
+
+#include <algorithm>
+#include <complex>
+#include <climits>
+#include <limits>
+
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+#else
+/* Other */
+#endif
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+static inline T abs(T val) {
+    return std::abs(val);
+}
+template<typename T>
+static inline T min(T lhs, T rhs) {
+    return std::min(lhs, rhs);
+}
+template<typename T>
+static inline T max(T lhs, T rhs) {
+    return std::max(lhs, rhs);
+}
+
+template<typename T>
+static inline T division(T lhs, double rhs) {
+    return lhs / rhs;
+}
+cfloat division(cfloat lhs, double rhs);
+cdouble division(cdouble lhs, double rhs);
+
+template<>
+inline cfloat max<cfloat>(cfloat lhs, cfloat rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
+
+template<>
+inline cdouble max<cdouble>(cdouble lhs, cdouble rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
+
+template<>
+inline cfloat min<cfloat>(cfloat lhs, cfloat rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
+
+template<>
+inline cdouble min<cdouble>(cdouble lhs, cdouble rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
+
+template<typename T>
+static inline auto is_nan(const T &val) -> bool {
+    return false;
+}
+
+template<>
+inline auto is_nan<sycl::half>(const sycl::half &val) -> bool {
+    return sycl::isnan(val);
+}
+
+template<>
+inline auto is_nan<float>(const float &val) -> bool {
+    return sycl::isnan(val);
+}
+
+template<>
+inline auto is_nan<double>(const double &val) -> bool {
+    return sycl::isnan(val);
+}
+
+template<>
+inline auto is_nan<cfloat>(const cfloat &in) -> bool {
+    return sycl::isnan(real(in)) || sycl::isnan(imag(in));
+}
+
+template<>
+inline auto is_nan<cdouble>(const cdouble &in) -> bool {
+    return sycl::isnan(real(in)) || sycl::isnan(imag(in));
+}
+
+template<typename T>
+static T scalar(double val) {
+    return (T)(val);
+}
+
+template<>
+inline cfloat scalar<cfloat>(double val) {
+    cfloat cval(static_cast<float>(val));
+    return cval;
+}
+
+template<>
+inline cdouble scalar<cdouble>(double val) {
+    cdouble cval(val);
+    return cval;
+}
+
+template<typename To, typename Ti>
+static To scalar(Ti real, Ti imag) {
+    To cval(real, imag);
+    return cval;
+}
+
+template<typename T>
+inline T maxval() {
+    return std::numeric_limits<T>::max();
+}
+template<typename T>
+inline T minval() {
+    return std::numeric_limits<T>::min();
+}
+template<>
+inline float maxval() {
+    return std::numeric_limits<float>::infinity();
+}
+template<>
+inline double maxval() {
+    return std::numeric_limits<double>::infinity();
+}
+
+template<>
+inline arrayfire::common::half maxval() {
+    return std::numeric_limits<arrayfire::common::half>::infinity();
+}
+
+template<>
+inline float minval() {
+    return -std::numeric_limits<float>::infinity();
+}
+
+template<>
+inline double minval() {
+    return -std::numeric_limits<double>::infinity();
+}
+template<>
+inline sycl::half minval() {
+    return -1 * std::numeric_limits<sycl::half>::infinity();
+}
+
+template<typename T>
+static inline T real(T in) {
+    return std::real(in);
+}
+
+template<typename T>
+static inline T imag(T in) {
+    return std::imag(in);
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic pop
+#else
+/* Other */
+#endif
diff --git a/src/backend/oneapi/max.cpp b/src/backend/oneapi/max.cpp
new file mode 100644
index 0000000000..fa21d78c1c
--- /dev/null
+++ b/src/backend/oneapi/max.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// max
+INSTANTIATE(af_max_t, float, float)
+INSTANTIATE(af_max_t, double, double)
+INSTANTIATE(af_max_t, cfloat, cfloat)
+INSTANTIATE(af_max_t, cdouble, cdouble)
+INSTANTIATE(af_max_t, int, int)
+INSTANTIATE(af_max_t, uint, uint)
+INSTANTIATE(af_max_t, intl, intl)
+INSTANTIATE(af_max_t, uintl, uintl)
+INSTANTIATE(af_max_t, char, char)
+INSTANTIATE(af_max_t, schar, schar)
+INSTANTIATE(af_max_t, uchar, uchar)
+INSTANTIATE(af_max_t, short, short)
+INSTANTIATE(af_max_t, ushort, ushort)
+INSTANTIATE(af_max_t, half, half)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/mean.cpp b/src/backend/oneapi/mean.cpp
new file mode 100644
index 0000000000..2f94101f56
--- /dev/null
+++ b/src/backend/oneapi/mean.cpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <mean.hpp>
+
+#include <common/half.hpp>
+#include <kernel/mean.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+using std::swap;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in) {
+    return kernel::mean_all<Ti, Tw, To>(in);
+}
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts) {
+    return kernel::mean_all_weighted<T, Tw>(in, wts);
+}
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    kernel::mean<Ti, Tw, To>(out, in, dim);
+    return out;
+}
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wts, const int dim) {
+    dim4 odims   = in.dims();
+    odims[dim]   = 1;
+    Array<T> out = createEmptyArray<T>(odims);
+    kernel::mean_weighted<T, Tw, T>(out, in, wts, dim);
+    return out;
+}
+
+#define INSTANTIATE(Ti, Tw, To)                        \
+    template To mean<Ti, Tw, To>(const Array<Ti>& in); \
+    template Array<To> mean<Ti, Tw, To>(const Array<Ti>& in, const int dim);
+
+INSTANTIATE(double, double, double);
+INSTANTIATE(float, float, float);
+INSTANTIATE(int, float, float);
+INSTANTIATE(unsigned, float, float);
+INSTANTIATE(intl, double, double);
+INSTANTIATE(uintl, double, double);
+INSTANTIATE(short, float, float);
+INSTANTIATE(ushort, float, float);
+INSTANTIATE(schar, float, float);
+INSTANTIATE(uchar, float, float);
+INSTANTIATE(char, float, float);
+INSTANTIATE(cfloat, float, cfloat);
+INSTANTIATE(cdouble, double, cdouble);
+INSTANTIATE(half, float, half);
+INSTANTIATE(half, float, float);
+
+#define INSTANTIATE_WGT(T, Tw)                                              \
+    template T mean<T, Tw>(const Array<T>& in, const Array<Tw>& wts);       \
+    template Array<T> mean<T, Tw>(const Array<T>& in, const Array<Tw>& wts, \
+                                  const int dim);
+
+INSTANTIATE_WGT(double, double);
+INSTANTIATE_WGT(float, float);
+INSTANTIATE_WGT(cfloat, float);
+INSTANTIATE_WGT(cdouble, double);
+INSTANTIATE_WGT(half, float);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/mean.hpp b/src/backend/oneapi/mean.hpp
new file mode 100644
index 0000000000..1ff66440b5
--- /dev/null
+++ b/src/backend/oneapi/mean.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in);
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts);
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim);
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wts, const int dim);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/meanshift.cpp b/src/backend/oneapi/meanshift.cpp
new file mode 100644
index 0000000000..825b26eb88
--- /dev/null
+++ b/src/backend/oneapi/meanshift.cpp
@@ -0,0 +1,48 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/meanshift.hpp>
+#include <meanshift.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor) {
+    const dim4 &dims = in.dims();
+    Array<T> out     = createEmptyArray<T>(dims);
+    kernel::meanshift<T>(out, in, spatialSigma, chromaticSigma, numIterations,
+                         isColor);
+    return out;
+}
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> meanshift<T>(const Array<T> &, const float &, \
+                                   const float &, const unsigned &, \
+                                   const bool &);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/meanshift.hpp b/src/backend/oneapi/meanshift.hpp
new file mode 100644
index 0000000000..dbe26b4c85
--- /dev/null
+++ b/src/backend/oneapi/meanshift.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/medfilt.cpp b/src/backend/oneapi/medfilt.cpp
new file mode 100644
index 0000000000..50c2cc3dd8
--- /dev/null
+++ b/src/backend/oneapi/medfilt.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/medfilt.hpp>
+#include <medfilt.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType pad) {
+    ONEAPI_NOT_SUPPORTED("medfilt1 Not supported");
+
+    // ARG_ASSERT(2, (w_wid <= kernel::MAX_MEDFILTER1_LEN));
+    // ARG_ASSERT(2, (w_wid % 2 != 0));
+
+    const dim4 &dims = in.dims();
+
+    Array<T> out = createEmptyArray<T>(dims);
+
+    // kernel::medfilt1<T>(out, in, w_wid, pad);
+
+    return out;
+}
+
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType pad) {
+    ONEAPI_NOT_SUPPORTED("medfilt2 Not supported");
+
+    // ARG_ASSERT(2, (w_len % 2 != 0));
+    // ARG_ASSERT(2, (w_len <= kernel::MAX_MEDFILTER2_LEN));
+
+    Array<T> out = createEmptyArray<T>(in.dims());
+    // kernel::medfilt2<T>(out, in, pad, w_len, w_wid);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> medfilt1<T>(const Array<T> &in, const int w_wid, \
+                                  const af::borderType);               \
+    template Array<T> medfilt2<T>(const Array<T> &in, const int w_len, \
+                                  const int w_wid, const af::borderType);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/medfilt.hpp b/src/backend/oneapi/medfilt.hpp
new file mode 100644
index 0000000000..eb459f7dd9
--- /dev/null
+++ b/src/backend/oneapi/medfilt.hpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType edge_pad);
+
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType edge_pad);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/memory.cpp b/src/backend/oneapi/memory.cpp
new file mode 100644
index 0000000000..3482742b73
--- /dev/null
+++ b/src/backend/oneapi/memory.cpp
@@ -0,0 +1,227 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/Logger.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <errorcodes.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <spdlog/spdlog.h>
+#include <types.hpp>
+#include <af/dim4.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <utility>
+
+using arrayfire::common::bytesToString;
+
+using af::dim4;
+using std::function;
+using std::move;
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace oneapi {
+float getMemoryPressure() { return memoryManager().getMemoryPressure(); }
+float getMemoryPressureThreshold() {
+    return memoryManager().getMemoryPressureThreshold();
+}
+
+bool jitTreeExceedsMemoryPressure(size_t bytes) {
+    return memoryManager().jitTreeExceedsMemoryPressure(bytes);
+}
+
+void setMemStepSize(size_t step_bytes) {
+    memoryManager().setMemStepSize(step_bytes);
+}
+
+size_t getMemStepSize() { return memoryManager().getMemStepSize(); }
+
+void signalMemoryCleanup() { memoryManager().signalMemoryCleanup(); }
+
+void shutdownMemoryManager() { memoryManager().shutdown(); }
+
+void shutdownPinnedMemoryManager() { pinnedMemoryManager().shutdown(); }
+
+void printMemInfo(const char *msg, const int device) {
+    memoryManager().printInfo(msg, device);
+}
+
+template<typename T>
+// unique_ptr<cl::Buffer, function<void(cl::Buffer *)>> memAlloc(
+// unique_ptr<int, function<void(int *)>> memAlloc(
+std::unique_ptr<sycl::buffer<T>, std::function<void(sycl::buffer<T> *)>>
+memAlloc(const size_t &elements) {
+    if (elements) {
+        dim4 dims(elements);
+
+        // The alloc function returns a pointer to a buffer<std::byte> object.
+        // We need to reinterpret that object into buffer<T> while keeping the
+        // same pointer value for memory accounting purposes. We acheive this
+        // assigning the renterpreted buffer back into the original pointer.
+        // This would delete the buffer<std::byte> object and replace it with
+        // the buffer<T> object. We do the reverse in the memFree function
+        auto *ptr = static_cast<sycl::buffer<std::byte> *>(
+            memoryManager().alloc(false, 1, dims.get(), sizeof(T)));
+        sycl::buffer<T> *optr = static_cast<sycl::buffer<T> *>((void *)ptr);
+        size_t bytes          = ptr->byte_size();
+
+        // TODO(umar): This could be a DANGEROUS function becasue we are calling
+        // delete on the reniterpreted buffer<T> instead of the orignal
+        // buffer<byte> object
+        *optr = ptr->template reinterpret<T>(sycl::range(bytes / sizeof(T)));
+        return unique_ptr<sycl::buffer<T>, function<void(sycl::buffer<T> *)>>(
+            optr, memFree<T>);
+    } else {
+        return unique_ptr<sycl::buffer<T>, function<void(sycl::buffer<T> *)>>(
+            nullptr, memFree<T>);
+    }
+}
+
+void *memAllocUser(const size_t &bytes) {
+    dim4 dims(bytes);
+    void *ptr = memoryManager().alloc(true, 1, dims.get(), 1);
+    return ptr;
+}
+
+template<typename T>
+void memFree(sycl::buffer<T> *ptr) {
+    if (ptr) {
+        sycl::buffer<std::byte> *optr =
+            static_cast<sycl::buffer<std::byte> *>((void *)ptr);
+        size_t bytes = ptr->byte_size();
+        *optr        = ptr->template reinterpret<std::byte>(sycl::range(bytes));
+        memoryManager().unlock(optr, false);
+    }
+}
+
+void memFreeUser(void *ptr) { memoryManager().unlock(ptr, true); }
+
+template<typename T>
+void memLock(const sycl::buffer<T> *ptr) {
+    memoryManager().userLock(static_cast<const void *>(ptr));
+}
+
+template<typename T>
+void memUnlock(const sycl::buffer<T> *ptr) {
+    memoryManager().userUnlock(static_cast<const void *>(ptr));
+}
+
+bool isLocked(const void *ptr) {
+    return memoryManager().isUserLocked(const_cast<void *>(ptr));
+}
+
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers) {
+    memoryManager().usageInfo(alloc_bytes, alloc_buffers, lock_bytes,
+                              lock_buffers);
+}
+
+template<typename T>
+T *pinnedAlloc(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = pinnedMemoryManager().alloc(false, 1, dims.get(), sizeof(T));
+    return static_cast<T *>(ptr);
+}
+
+void pinnedFree(void *ptr) { pinnedMemoryManager().unlock(ptr, false); }
+
+// template unique_ptr<int, function<void(int *)>> memAlloc<T>(
+#define INSTANTIATE(T)                                               \
+    template std::unique_ptr<sycl::buffer<T>,                        \
+                             std::function<void(sycl::buffer<T> *)>> \
+    memAlloc(const size_t &elements);                                \
+    template T *pinnedAlloc(const size_t &elements);                 \
+    template void memLock(const sycl::buffer<T> *buf);               \
+    template void memUnlock(const sycl::buffer<T> *buf);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(arrayfire::common::half)
+INSTANTIATE(int64_t)
+
+template<>
+void *pinnedAlloc<void>(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = pinnedMemoryManager().alloc(false, 1, dims.get(), 1);
+    return ptr;
+}
+
+Allocator::Allocator() { logger = common::loggerFactory("mem"); }
+
+void Allocator::shutdown() { shutdownMemoryManager(); }
+
+int Allocator::getActiveDeviceId() { return oneapi::getActiveDeviceId(); }
+
+size_t Allocator::getMaxMemorySize(int id) {
+    return oneapi::getDeviceMemorySize(id);
+}
+
+void *Allocator::nativeAlloc(const size_t bytes) {
+    auto *ptr = new sycl::buffer<std::byte>(sycl::range(bytes));
+    AF_TRACE("nativeAlloc: {} {}", bytesToString(bytes),
+             static_cast<void *>(ptr));
+    return ptr;
+}
+
+void Allocator::nativeFree(void *ptr) {
+    auto *buf = static_cast<sycl::buffer<std::byte> *>(ptr);
+    AF_TRACE("nativeFree:          {}", ptr);
+    delete buf;
+}
+
+AllocatorPinned::AllocatorPinned() { logger = common::loggerFactory("mem"); }
+
+void AllocatorPinned::shutdown() { shutdownPinnedMemoryManager(); }
+
+int AllocatorPinned::getActiveDeviceId() { return oneapi::getActiveDeviceId(); }
+
+size_t AllocatorPinned::getMaxMemorySize(int id) {
+    return oneapi::getDeviceMemorySize(id);
+}
+
+void *AllocatorPinned::nativeAlloc(const size_t bytes) {
+    void *ptr = NULL;
+    try {
+        ptr = sycl::malloc_host<unsigned char>(bytes, getQueue());
+    } catch (...) {
+        auto str = fmt::format("Failed to allocate device memory of size {}",
+                               bytesToString(bytes));
+        AF_ERROR(str, AF_ERR_NO_MEM);
+    }
+    AF_TRACE("Pinned::nativeAlloc: {:>7} {}", bytesToString(bytes), ptr);
+    return ptr;
+}
+
+void AllocatorPinned::nativeFree(void *ptr) {
+    AF_TRACE("Pinned::nativeFree:          {}", ptr);
+    try {
+        sycl::free(ptr, getQueue());
+    } catch (...) {
+        AF_ERROR("Failed to release device memory.", AF_ERR_RUNTIME);
+    }
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/memory.hpp b/src/backend/oneapi/memory.hpp
new file mode 100644
index 0000000000..ebe5f2403b
--- /dev/null
+++ b/src/backend/oneapi/memory.hpp
@@ -0,0 +1,92 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/AllocatorInterface.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <cstdlib>
+#include <functional>
+#include <map>
+#include <memory>
+#include <vector>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+using bufptr =
+    std::unique_ptr<sycl::buffer<T>, std::function<void(sycl::buffer<T> *)>>;
+
+template<typename T>
+bufptr<T> memAlloc(const size_t &elements);
+void *memAllocUser(const size_t &bytes);
+
+// Need these as 2 separate function and not a default argument
+// This is because it is used as the deleter in shared pointer
+// which cannot support default arguments
+template<typename T>
+void memFree(sycl::buffer<T> *ptr);
+void memFreeUser(void *ptr);
+
+template<typename T>
+void memLock(const sycl::buffer<T> *ptr);
+
+template<typename T>
+void memUnlock(const sycl::buffer<T> *ptr);
+
+bool isLocked(const void *ptr);
+
+template<typename T>
+T *pinnedAlloc(const size_t &elements);
+
+void pinnedFree(void *ptr);
+
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers);
+void signalMemoryCleanup();
+void shutdownMemoryManager();
+void pinnedGarbageCollect();
+
+void printMemInfo(const char *msg, const int device);
+
+float getMemoryPressure();
+float getMemoryPressureThreshold();
+bool jitTreeExceedsMemoryPressure(size_t bytes);
+void setMemStepSize(size_t step_bytes);
+size_t getMemStepSize(void);
+
+class Allocator final : public common::AllocatorInterface {
+   public:
+    Allocator();
+    ~Allocator() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+};
+
+class AllocatorPinned final : public common::AllocatorInterface {
+   public:
+    AllocatorPinned();
+    ~AllocatorPinned() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+
+   private:
+    std::vector<std::map<void *, void *>> pinnedMaps;
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/min.cpp b/src/backend/oneapi/min.cpp
new file mode 100644
index 0000000000..fe1a5a3fa4
--- /dev/null
+++ b/src/backend/oneapi/min.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// min
+INSTANTIATE(af_min_t, float, float)
+INSTANTIATE(af_min_t, double, double)
+INSTANTIATE(af_min_t, cfloat, cfloat)
+INSTANTIATE(af_min_t, cdouble, cdouble)
+INSTANTIATE(af_min_t, int, int)
+INSTANTIATE(af_min_t, uint, uint)
+INSTANTIATE(af_min_t, intl, intl)
+INSTANTIATE(af_min_t, uintl, uintl)
+INSTANTIATE(af_min_t, char, char)
+INSTANTIATE(af_min_t, schar, schar)
+INSTANTIATE(af_min_t, uchar, uchar)
+INSTANTIATE(af_min_t, short, short)
+INSTANTIATE(af_min_t, ushort, ushort)
+INSTANTIATE(af_min_t, half, half)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/minmax_op.hpp b/src/backend/oneapi/minmax_op.hpp
new file mode 100644
index 0000000000..40159d3ec9
--- /dev/null
+++ b/src/backend/oneapi/minmax_op.hpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/Binary.hpp>
+#include <math.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+static double cabs(const T &in) {
+    return (double)in;
+}
+
+template<>
+double cabs<char>(const char &in) {
+    return (double)(in > 0);
+}
+
+template<>
+double cabs<cfloat>(const cfloat &in) {
+    return (double)abs(in);
+}
+
+template<>
+double cabs<cdouble>(const cdouble &in) {
+    return (double)abs(in);
+}
+
+template<af_op_t op, typename T>
+struct MinMaxOp {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {
+        if (is_nan(val)) { m_val = common::Binary<compute_t<T>, op>::init(); }
+    }
+
+    void operator()(T val, uint idx) {
+        if ((cabs(val) < cabs(m_val) ||
+             (cabs(val) == cabs(m_val) && idx > m_idx))) {
+            m_val = val;
+            m_idx = idx;
+        }
+    }
+};
+
+template<typename T>
+struct MinMaxOp<af_max_t, T> {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {
+        if (is_nan(val)) { m_val = common::Binary<T, af_max_t>::init(); }
+    }
+
+    void operator()(T val, uint idx) {
+        if ((cabs(val) > cabs(m_val) ||
+             (cabs(val) == cabs(m_val) && idx <= m_idx))) {
+            m_val = val;
+            m_idx = idx;
+        }
+    }
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/moments.cpp b/src/backend/oneapi/moments.cpp
new file mode 100644
index 0000000000..76e385990b
--- /dev/null
+++ b/src/backend/oneapi/moments.cpp
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+// #include <debug_opencl.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/moments.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+static inline unsigned bitCount(unsigned v) {
+    v = v - ((v >> 1U) & 0x55555555U);
+    v = (v & 0x33333333U) + ((v >> 2U) & 0x33333333U);
+    return (((v + (v >> 4U)) & 0xF0F0F0FU) * 0x1010101U) >> 24U;
+}
+
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment) {
+    ONEAPI_NOT_SUPPORTED("moments Not supported");
+
+    in.eval();
+    dim4 odims, idims = in.dims();
+    dim_t moments_dim = bitCount(moment);
+
+    odims[0] = moments_dim;
+    odims[1] = 1;
+    odims[2] = idims[2];
+    odims[3] = idims[3];
+
+    Array<float> out = createValueArray<float>(odims, 0.f);
+    out.eval();
+
+    // kernel::moments<T>(out, in, moment);
+    return out;
+}
+
+#define INSTANTIATE(T)                                   \
+    template Array<float> moments<T>(const Array<T> &in, \
+                                     const af_moment_type moment);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/moments.hpp b/src/backend/oneapi/moments.hpp
new file mode 100644
index 0000000000..3dcf1e194f
--- /dev/null
+++ b/src/backend/oneapi/moments.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/morph.cpp b/src/backend/oneapi/morph.cpp
new file mode 100644
index 0000000000..11f3d3df7a
--- /dev/null
+++ b/src/backend/oneapi/morph.cpp
@@ -0,0 +1,71 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/morph.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+#include <morph.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    ONEAPI_NOT_SUPPORTED("morph Not supported");
+
+    // const dim4 mdims = mask.dims();
+    // if (mdims[0] != mdims[1]) {
+    //     OPENCL_NOT_SUPPORTED("Rectangular masks are not suported");
+    // }
+    // if (mdims[0] > 19) {
+    //     OPENCL_NOT_SUPPORTED("Kernels > 19x19 are not supported");
+    // }
+    const dim4 dims = in.dims();
+    Array<T> out    = createEmptyArray<T>(dims);
+    // kernel::morph<T>(out, in, mask, isDilation);
+    return out;
+}
+
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    ONEAPI_NOT_SUPPORTED("morph3d Not supported");
+
+    // const dim4 mdims = mask.dims();
+    // if (mdims[0] != mdims[1] || mdims[0] != mdims[2]) {
+    //     OPENCL_NOT_SUPPORTED("Only cubic masks are supported");
+    // }
+    // if (mdims[0] > 7) {
+    //     OPENCL_NOT_SUPPORTED("Kernels > 7x7x7 masks are not supported");
+    // }
+    Array<T> out = createEmptyArray<T>(in.dims());
+    // kernel::morph3d<T>(out, in, mask, isDilation);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                    \
+    template Array<T> morph<T>(const Array<T> &, const Array<T> &, bool); \
+    template Array<T> morph3d<T>(const Array<T> &, const Array<T> &, bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/morph.hpp b/src/backend/oneapi/morph.hpp
new file mode 100644
index 0000000000..47d3399f87
--- /dev/null
+++ b/src/backend/oneapi/morph.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation);
+
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/nearest_neighbour.cpp b/src/backend/oneapi/nearest_neighbour.cpp
new file mode 100644
index 0000000000..bec80b5cce
--- /dev/null
+++ b/src/backend/oneapi/nearest_neighbour.cpp
@@ -0,0 +1,91 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/nearest_neighbour.hpp>
+#include <math.hpp>
+#include <topk.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+// nsing cl::Device;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename To, af_match_type dist_type>
+void nearest_neighbour_(Array<uint>& idx, Array<To>& dist,
+                        const Array<T>& query, const Array<T>& train,
+                        const uint dist_dim, const uint n_dist) {
+    ONEAPI_NOT_SUPPORTED("nearest_neighbour_ Not supported");
+
+    uint sample_dim   = (dist_dim == 0) ? 1 : 0;
+    const dim4& qDims = query.dims();
+    const dim4& tDims = train.dims();
+
+    const dim4 outDims(n_dist, qDims[sample_dim]);
+    const dim4 distDims(tDims[sample_dim], qDims[sample_dim]);
+
+    Array<To> tmp_dists = createEmptyArray<To>(distDims);
+
+    idx  = createEmptyArray<uint>(outDims);
+    dist = createEmptyArray<To>(outDims);
+
+    Array<T> queryT = dist_dim == 0 ? transpose(query, false) : query;
+    Array<T> trainT = dist_dim == 0 ? transpose(train, false) : train;
+
+    // kernel::allDistances<T, To>(tmp_dists, queryT, trainT, 1, dist_type);
+
+    topk(dist, idx, tmp_dists, n_dist, 0, AF_TOPK_MIN);
+}
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist, const af_match_type dist_type) {
+    switch (dist_type) {
+        case AF_SAD:
+            nearest_neighbour_<T, To, AF_SAD>(idx, dist, query, train, dist_dim,
+                                              n_dist);
+            break;
+        case AF_SSD:
+            nearest_neighbour_<T, To, AF_SSD>(idx, dist, query, train, dist_dim,
+                                              n_dist);
+            break;
+        case AF_SHD:
+            nearest_neighbour_<T, To, AF_SHD>(idx, dist, query, train, dist_dim,
+                                              n_dist);
+            break;
+        default: AF_ERROR("Unsupported dist_type", AF_ERR_NOT_CONFIGURED);
+    }
+}
+
+#define INSTANTIATE(T, To)                                             \
+    template void nearest_neighbour<T, To>(                            \
+        Array<uint> & idx, Array<To> & dist, const Array<T>& query,    \
+        const Array<T>& train, const uint dist_dim, const uint n_dist, \
+        const af_match_type dist_type);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(int, int)
+INSTANTIATE(uint, uint)
+INSTANTIATE(intl, intl)
+INSTANTIATE(uintl, uintl)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, uint)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, uint)
+
+INSTANTIATE(uintl, uint)  // For Hamming
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/nearest_neighbour.hpp b/src/backend/oneapi/nearest_neighbour.hpp
new file mode 100644
index 0000000000..1af9889b00
--- /dev/null
+++ b/src/backend/oneapi/nearest_neighbour.hpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist,
+                       const af_match_type dist_type = AF_SSD);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/onefft.hpp b/src/backend/oneapi/onefft.hpp
new file mode 100644
index 0000000000..a31a91d1e1
--- /dev/null
+++ b/src/backend/oneapi/onefft.hpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <common/FFTPlanCache.hpp>
+#include <memory.hpp>
+#include <oneapi/mkl/dfti.hpp>
+
+#include <cstdint>
+
+namespace arrayfire {
+namespace oneapi {
+
+using ::oneapi::mkl::dft::domain;
+using ::oneapi::mkl::dft::precision;
+
+using PlanType   = std::shared_ptr<void>;
+using SharedPlan = std::shared_ptr<PlanType>;
+
+template<precision p, domain d>
+PlanType findPlan(int rank, const bool isInPlace, int *n,
+                  std::int64_t *istrides, int ibatch, std::int64_t *ostrides,
+                  int obatch, int nbatch);
+
+class PlanCache : public common::FFTPlanCache<PlanCache, PlanType> {
+    template<precision p, domain d>
+    friend PlanType findPlan(int rank, const bool isInPlace, int *n,
+                             std::int64_t *istrides, int ibatch,
+                             std::int64_t *ostrides, int obatch, int nbatch);
+};
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/orb.cpp b/src/backend/oneapi/orb.cpp
new file mode 100644
index 0000000000..b00cf0395f
--- /dev/null
+++ b/src/backend/oneapi/orb.cpp
@@ -0,0 +1,70 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/orb.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename convAccT>
+unsigned orb(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
+             Array<float> &ori_out, Array<float> &size_out,
+             Array<uint> &desc_out, const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img) {
+    ONEAPI_NOT_SUPPORTED("orb Not supported");
+    return 0;
+
+    // unsigned nfeat;
+
+    // Param x;
+    // Param y;
+    // Param score;
+    // Param ori;
+    // Param size;
+    // Param desc;
+
+    // kernel::orb<T, convAccT>(&nfeat, x, y, score, ori, size, desc, image,
+    //                          fast_thr, max_feat, scl_fctr, levels, blur_img);
+
+    // if (nfeat > 0) {
+    //     const dim4 out_dims(nfeat);
+    //     const dim4 desc_dims(8, nfeat);
+
+    //     x_out     = createParamArray<float>(x, true);
+    //     y_out     = createParamArray<float>(y, true);
+    //     score_out = createParamArray<float>(score, true);
+    //     ori_out   = createParamArray<float>(ori, true);
+    //     size_out  = createParamArray<float>(size, true);
+    //     desc_out  = createParamArray<unsigned>(desc, true);
+    // }
+
+    // return nfeat;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned orb<T, convAccT>(                                       \
+        Array<float> & x, Array<float> & y, Array<float> & score,             \
+        Array<float> & ori, Array<float> & size, Array<uint> & desc,          \
+        const Array<T> &image, const float fast_thr, const unsigned max_feat, \
+        const float scl_fctr, const unsigned levels, const bool blur_img);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/orb.hpp b/src/backend/oneapi/orb.hpp
new file mode 100644
index 0000000000..ab29a6813b
--- /dev/null
+++ b/src/backend/oneapi/orb.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename convAccT>
+unsigned orb(Array<float> &x, Array<float> &y, Array<float> &score,
+             Array<float> &orientation, Array<float> &size,
+             Array<unsigned> &desc, const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/platform.cpp b/src/backend/oneapi/platform.cpp
new file mode 100644
index 0000000000..3994a907a5
--- /dev/null
+++ b/src/backend/oneapi/platform.cpp
@@ -0,0 +1,734 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <platform.hpp>
+
+#include <GraphicsResourceManager.hpp>
+#include <blas.hpp>
+#include <build_version.hpp>
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/graphics_common.hpp>
+#include <common/host_memory.hpp>
+#include <common/util.hpp>
+#include <device_manager.hpp>
+#include <err_oneapi.hpp>
+#include <errorcodes.hpp>
+#include <memory.hpp>
+#include <onefft.hpp>
+#include <af/oneapi.h>
+#include <af/version.h>
+
+#ifdef OS_MAC
+#include <OpenGL/CGLCurrent.h>
+#endif
+
+#include <sycl/sycl.hpp>
+
+#include <cctype>
+#include <cstdlib>
+#include <functional>
+#include <map>
+#include <mutex>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <utility>
+#include <vector>
+
+using sycl::aspect;
+using sycl::context;
+using sycl::device;
+using sycl::platform;
+using sycl::queue;
+
+using std::begin;
+using std::call_once;
+using std::end;
+using std::endl;
+using std::find_if;
+using std::get;
+using std::make_pair;
+using std::make_unique;
+using std::map;
+using std::move;
+using std::once_flag;
+using std::ostringstream;
+using std::pair;
+using std::string;
+using std::to_string;
+using std::unique_ptr;
+using std::vector;
+
+using arrayfire::common::getEnvVar;
+using arrayfire::common::ltrim;
+using arrayfire::common::MemoryManagerBase;
+using arrayfire::oneapi::Allocator;
+using arrayfire::oneapi::AllocatorPinned;
+
+namespace arrayfire {
+namespace oneapi {
+
+static string get_system() {
+    string arch = (sizeof(void*) == 4) ? "32-bit " : "64-bit ";
+
+    return arch +
+#if defined(OS_LNX)
+           "Linux";
+#elif defined(OS_WIN)
+           "Windows";
+#elif defined(OS_MAC)
+           "Mac OSX";
+#endif
+}
+
+int getBackend() { return AF_BACKEND_ONEAPI; }
+
+bool verify_present(const string& pname, const string ref) {
+    auto iter =
+        search(begin(pname), end(pname), begin(ref), end(ref),
+               [](const string::value_type& l, const string::value_type& r) {
+                   return tolower(l) == tolower(r);
+               });
+
+    return iter != end(pname);
+}
+
+// TODO: update to new platforms?
+inline string platformMap(string& platStr) {
+    using strmap_t                = map<string, string>;
+    static const strmap_t platMap = {
+        make_pair("NVIDIA CUDA", "NVIDIA"),
+        make_pair("Intel(R) OpenCL", "INTEL"),
+        make_pair("AMD Accelerated Parallel Processing", "AMD"),
+        make_pair("Intel Gen OCL Driver", "BEIGNET"),
+        make_pair("Intel(R) OpenCL HD Graphics", "INTEL"),
+        make_pair("Apple", "APPLE"),
+        make_pair("Portable Computing Language", "POCL"),
+    };
+
+    auto idx = platMap.find(platStr);
+
+    if (idx == platMap.end()) {
+        return platStr;
+    } else {
+        return idx->second;
+    }
+}
+
+af_oneapi_platform getPlatformEnum(sycl::device dev) {
+    string pname = getPlatformName(dev);
+    if (verify_present(pname, "AMD"))
+        return AF_ONEAPI_PLATFORM_AMD;
+    else if (verify_present(pname, "NVIDIA"))
+        return AF_ONEAPI_PLATFORM_NVIDIA;
+    else if (verify_present(pname, "INTEL"))
+        return AF_ONEAPI_PLATFORM_INTEL;
+    else if (verify_present(pname, "APPLE"))
+        return AF_ONEAPI_PLATFORM_APPLE;
+    else if (verify_present(pname, "BEIGNET"))
+        return AF_ONEAPI_PLATFORM_BEIGNET;
+    else if (verify_present(pname, "POCL"))
+        return AF_ONEAPI_PLATFORM_POCL;
+    return AF_ONEAPI_PLATFORM_UNKNOWN;
+}
+
+string getDeviceInfo() noexcept {
+    ostringstream info;
+    info << "ArrayFire v" << AF_VERSION << " (oneAPI, " << get_system()
+         << ", build " << AF_REVISION << ")\n";
+
+    try {
+        DeviceManager& devMngr = DeviceManager::getInstance();
+
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        unsigned nDevices = 0;
+        for (auto& device : devMngr.mDevices) {
+            // const Platform platform(device->getInfo<CL_DEVICE_PLATFORM>());
+
+            string dstr = device->get_info<sycl::info::device::name>();
+            bool show_braces =
+                (static_cast<unsigned>(getActiveDeviceId()) == nDevices);
+
+            string id = (show_braces ? string("[") : "-") +
+                        to_string(nDevices) + (show_braces ? string("]") : "-");
+            size_t msize =
+                device->get_info<sycl::info::device::global_mem_size>();
+            info << id << " " << getPlatformName(*device) << ": " << ltrim(dstr)
+                 << ", " << msize / 1048576 << " MB";
+            info << " (";
+            if (device->has(aspect::fp64)) { info << "fp64 "; }
+            if (device->has(aspect::fp16) &&
+                device->get_info<sycl::info::device::native_vector_width_half>() != 0)
+                { info << "fp16 "; }
+            info << "\b)";
+#ifndef NDEBUG
+            info << " -- ";
+            string devVersion = device->get_info<sycl::info::device::version>();
+            string driVersion =
+                device->get_info<sycl::info::device::driver_version>();
+            info << devVersion;
+            info << " -- Device driver " << driVersion;
+            info << " -- Unified Memory ("
+                 << (isHostUnifiedMemory(*device) ? "True" : "False") << ")";
+#endif
+            info << endl;
+
+            nDevices++;
+        }
+    } catch (const AfError& err) {
+        UNUSED(err);
+        info << "No platforms found.\n";
+        // Don't throw an exception here. Info should pass even if the system
+        // doesn't have the correct drivers installed.
+    }
+    return info.str();
+}
+
+string getPlatformName(const sycl::device& device) {
+    std::string platStr =
+        device.get_platform().get_info<sycl::info::platform::name>();
+    // return platformMap(platStr);
+    return platStr;
+}
+
+typedef pair<unsigned, unsigned> device_id_t;
+
+pair<unsigned, unsigned>& tlocalActiveDeviceId() {
+    // First element is active context id
+    // Second element is active queue id
+    thread_local device_id_t activeDeviceId(0, 0);
+
+    return activeDeviceId;
+}
+
+void setActiveContext(int device) {
+    tlocalActiveDeviceId() = make_pair(device, device);
+}
+
+int getDeviceCount() noexcept try {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return static_cast<int>(devMngr.mQueues.size());
+} catch (const AfError& err) {
+    UNUSED(err);
+    // If device manager threw an error then return 0 because no platforms
+    // were found
+    return 0;
+}
+
+void init() {
+    thread_local const DeviceManager& devMngr = DeviceManager::getInstance();
+    UNUSED(devMngr);
+}
+
+unsigned getActiveDeviceId() {
+    // Second element is the queue id, which is
+    // what we mean by active device id in opencl backend
+    return get<1>(tlocalActiveDeviceId());
+}
+
+/*
+int getDeviceIdFromNativeId(cl_device_id id) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    int nDevices = static_cast<int>(devMngr.mDevices.size());
+    int devId    = 0;
+    for (devId = 0; devId < nDevices; ++devId) {
+        //TODO: how to get cl_device_id from sycl::device
+        if (id == devMngr.mDevices[devId]->get()) { return devId; }
+    }
+    // TODO: reasonable if no match??
+    return -1;
+}
+*/
+
+int getActiveDeviceType() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mDeviceTypes[get<1>(devId)];
+}
+
+int getActivePlatform() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mPlatforms[get<1>(devId)];
+}
+
+const sycl::context& getContext() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return *(devMngr.mContexts[get<0>(devId)]);
+}
+
+sycl::queue& getQueue() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return *(devMngr.mQueues[get<1>(devId)]);
+}
+
+sycl::queue* getQueueHandle(int device_id) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mQueues[device_id].get();
+}
+
+const sycl::device& getDevice(int id) {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    if (id == -1) { id = get<1>(devId); }
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return *(devMngr.mDevices[id]);
+}
+
+const std::string& getActiveDeviceBaseBuildFlags() {
+    device_id_t& devId     = tlocalActiveDeviceId();
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return devMngr.mBaseOpenCLBuildFlags[get<1>(devId)];
+}
+
+size_t getDeviceMemorySize(int device) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    // Assuming devices don't deallocate or are invalidated during execution
+    sycl::device& dev = *devMngr.mDevices[device];
+    return dev.get_info<sycl::info::device::global_mem_size>();
+}
+
+size_t getHostMemorySize() { return common::getHostMemorySize(); }
+
+sycl::info::device_type getDeviceType() {
+    const sycl::device& device = getDevice();
+    sycl::info::device_type type =
+        device.get_info<sycl::info::device::device_type>();
+    return type;
+}
+
+bool isHostUnifiedMemory(const sycl::device& device) {
+    return device.has(sycl::aspect::usm_host_allocations);
+}
+
+bool OneAPICPUOffload(bool forceOffloadOSX) {
+    static const bool offloadEnv = getEnvVar("AF_ONEAPI_CPU_OFFLOAD") != "0";
+    bool offload                 = false;
+    if (offloadEnv) { offload = isHostUnifiedMemory(getDevice()); }
+#if OS_MAC
+    // FORCED OFFLOAD FOR LAPACK FUNCTIONS ON OSX UNIFIED MEMORY DEVICES
+    //
+    // On OSX Unified Memory devices (Intel), always offload LAPACK but not GEMM
+    // irrespective of the AF_OPENCL_CPU_OFFLOAD value
+    // From GEMM, OpenCLCPUOffload(false) is called which will render the
+    // variable inconsequential to the returned result.
+    //
+    // Issue https://github.com/arrayfire/arrayfire/issues/662
+    //
+    // Make sure device has unified memory
+    bool osx_offload = isHostUnifiedMemory(getDevice());
+    // Force condition
+    offload = osx_offload && (offload || forceOffloadOSX);
+#else
+    UNUSED(forceOffloadOSX);
+#endif
+    return offload;
+}
+
+bool isGLSharingSupported() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mIsGLSharingOn[get<1>(devId)];
+}
+
+bool isDoubleSupported(unsigned device) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        sycl::device& dev = *devMngr.mDevices[device];
+        return dev.has(sycl::aspect::fp64);
+    }
+}
+
+bool isHalfSupported(unsigned device) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return devMngr.mDevices[device]->has(sycl::aspect::fp16) &&
+           devMngr.mDevices[device]->get_info<sycl::info::device::native_vector_width_half>() != 0;
+}
+
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute) {
+    ONEAPI_NOT_SUPPORTED("");
+}
+
+int setDevice(int device) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    if (device >= static_cast<int>(devMngr.mQueues.size()) ||
+        device >= static_cast<int>(DeviceManager::MAX_DEVICES)) {
+        return -1;
+    } else {
+        int old = getActiveDeviceId();
+        setActiveContext(device);
+        return old;
+    }
+}
+
+void sync(int device) {
+    int currDevice = getActiveDeviceId();
+    setDevice(device);
+    getQueue().wait();
+    setDevice(currDevice);
+}
+
+void addDeviceContext(sycl::device& dev, sycl::context& ctx, sycl::queue& que) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    int nDevices = 0;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+
+        auto tDevice  = make_unique<sycl::device>(dev);
+        auto tContext = make_unique<sycl::context>(ctx);
+        // queue atleast has implicit context and device if created
+        auto tQueue = make_unique<sycl::queue>(que);
+
+        devMngr.mPlatforms.push_back(getPlatformEnum(*tDevice));
+        // FIXME: add OpenGL Interop for user provided contexts later
+        devMngr.mIsGLSharingOn.push_back(false);
+        devMngr.mDeviceTypes.push_back(static_cast<int>(
+            tDevice->get_info<sycl::info::device::device_type>()));
+
+        devMngr.mDevices.push_back(move(tDevice));
+        devMngr.mContexts.push_back(move(tContext));
+        devMngr.mQueues.push_back(move(tQueue));
+        nDevices = static_cast<int>(devMngr.mDevices.size()) - 1;
+
+        // TODO: cache?
+    }
+
+    // Last/newly added device needs memory management
+    memoryManager().addMemoryManagement(nDevices);
+}
+
+void setDeviceContext(sycl::device& dev, sycl::context& ctx) {
+    // FIXME: add OpenGL Interop for user provided contexts later
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    const int dCount = static_cast<int>(devMngr.mDevices.size());
+    for (int i = 0; i < dCount; ++i) {
+        if (*devMngr.mDevices[i] == dev && *devMngr.mContexts[i] == ctx) {
+            setActiveContext(i);
+            return;
+        }
+    }
+    AF_ERROR("No matching device found", AF_ERR_ARG);
+}
+
+void removeDeviceContext(sycl::device& dev, sycl::context& ctx) {
+    if (getDevice() == dev && getContext() == ctx) {
+        AF_ERROR("Cannot pop the device currently in use", AF_ERR_ARG);
+    }
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    int deleteIdx = -1;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+
+        const int dCount = static_cast<int>(devMngr.mDevices.size());
+        for (int i = 0; i < dCount; ++i) {
+            if (*devMngr.mDevices[i] == dev && *devMngr.mContexts[i] == ctx) {
+                deleteIdx = i;
+                break;
+            }
+        }
+    }
+
+    if (deleteIdx < static_cast<int>(devMngr.mUserDeviceOffset)) {
+        AF_ERROR("Cannot pop ArrayFire internal devices", AF_ERR_ARG);
+    } else if (deleteIdx == -1) {
+        AF_ERROR("No matching device found", AF_ERR_ARG);
+    } else {
+        // remove memory management for device added by user outside of the lock
+        memoryManager().removeMemoryManagement(deleteIdx);
+
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        // FIXME: this case can potentially cause issues due to the
+        // modification of the device pool stl containers.
+
+        // IF the current active device is enumerated at a position
+        // that lies ahead of the device that has been requested
+        // to be removed. We just pop the entries from pool since it
+        // has no side effects.
+        devMngr.mDevices.erase(devMngr.mDevices.begin() + deleteIdx);
+        devMngr.mContexts.erase(devMngr.mContexts.begin() + deleteIdx);
+        devMngr.mQueues.erase(devMngr.mQueues.begin() + deleteIdx);
+        devMngr.mPlatforms.erase(devMngr.mPlatforms.begin() + deleteIdx);
+
+        // FIXME: add OpenGL Interop for user provided contexts later
+        devMngr.mIsGLSharingOn.erase(devMngr.mIsGLSharingOn.begin() +
+                                     deleteIdx);
+
+        // OTHERWISE, update(decrement) the thread local active device ids
+        device_id_t& devId = tlocalActiveDeviceId();
+
+        if (deleteIdx < static_cast<int>(devId.first)) {
+            device_id_t newVals = make_pair(devId.first - 1, devId.second - 1);
+            devId               = newVals;
+        }
+    }
+}
+
+unsigned getMemoryBusWidth(const sycl::device& device) {
+    return device.get_info<sycl::info::device::global_mem_cache_line_size>();
+}
+
+size_t getL2CacheSize(const sycl::device& device) {
+    return device.get_info<sycl::info::device::global_mem_cache_line_size>();
+}
+
+unsigned getComputeUnits(const sycl::device& device) {
+    return device.get_info<sycl::info::device::max_compute_units>();
+}
+
+unsigned getMaxParallelThreads(const sycl::device& device) {
+    return getComputeUnits(device) * 2048;
+}
+
+bool synchronize_calls() {
+    static const bool sync = getEnvVar("AF_SYNCHRONOUS_CALLS") == "1";
+    return sync;
+}
+
+int& getMaxJitSize() {
+#if defined(OS_MAC)
+    constexpr int MAX_JIT_LEN = 50;
+#else
+    constexpr int MAX_JIT_LEN = 100;
+#endif
+    thread_local int length = 0;
+    if (length <= 0) {
+        string env_var = getEnvVar("AF_OPENCL_MAX_JIT_LEN");
+        if (!env_var.empty()) {
+            int input_len = stoi(env_var);
+            length        = input_len > 0 ? input_len : MAX_JIT_LEN;
+        } else {
+            length = MAX_JIT_LEN;
+        }
+    }
+    return length;
+}
+
+bool& evalFlag() {
+    thread_local bool flag = true;
+    return flag;
+}
+
+MemoryManagerBase& memoryManager() {
+    static once_flag flag;
+
+    DeviceManager& inst = DeviceManager::getInstance();
+
+    call_once(flag, [&]() {
+        // By default, create an instance of the default memory manager
+        inst.memManager = make_unique<common::DefaultMemoryManager>(
+            getDeviceCount(), common::MAX_BUFFERS,
+            AF_MEM_DEBUG || AF_ONEAPI_MEM_DEBUG);
+        // Set the memory manager's device memory manager
+        unique_ptr<Allocator> deviceMemoryManager;
+        deviceMemoryManager = make_unique<Allocator>();
+        inst.memManager->setAllocator(move(deviceMemoryManager));
+        inst.memManager->initialize();
+    });
+
+    return *(inst.memManager.get());
+}
+
+MemoryManagerBase& pinnedMemoryManager() {
+    static once_flag flag;
+
+    DeviceManager& inst = DeviceManager::getInstance();
+
+    call_once(flag, [&]() {
+        // By default, create an instance of the default memory manager
+        inst.pinnedMemManager = make_unique<common::DefaultMemoryManager>(
+            getDeviceCount(), common::MAX_BUFFERS,
+            AF_MEM_DEBUG || AF_ONEAPI_MEM_DEBUG);
+        // Set the memory manager's device memory manager
+        unique_ptr<AllocatorPinned> deviceMemoryManager;
+        deviceMemoryManager = make_unique<AllocatorPinned>();
+        inst.pinnedMemManager->setAllocator(move(deviceMemoryManager));
+        inst.pinnedMemManager->initialize();
+    });
+
+    return *(inst.pinnedMemManager.get());
+}
+
+void setMemoryManager(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManager(move(mgr));
+}
+
+void resetMemoryManager() {
+    return DeviceManager::getInstance().resetMemoryManager();
+}
+
+void setMemoryManagerPinned(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManagerPinned(move(mgr));
+}
+
+void resetMemoryManagerPinned() {
+    return DeviceManager::getInstance().resetMemoryManagerPinned();
+}
+
+arrayfire::common::ForgeManager& forgeManager() {
+    return *(DeviceManager::getInstance().fgMngr);
+}
+
+GraphicsResourceManager& interopManager() {
+    static once_flag initFlags[DeviceManager::MAX_DEVICES];
+
+    int id = getActiveDeviceId();
+
+    DeviceManager& inst = DeviceManager::getInstance();
+
+    call_once(initFlags[id], [&] {
+        inst.gfxManagers[id] = make_unique<GraphicsResourceManager>();
+    });
+
+    return *(inst.gfxManagers[id].get());
+}
+
+unique_ptr<PlanCache>& oneFFTManager(const int deviceId) {
+    thread_local unique_ptr<PlanCache> caches[DeviceManager::MAX_DEVICES];
+    thread_local once_flag initFlags[DeviceManager::MAX_DEVICES];
+    call_once(initFlags[deviceId],
+              [&] { caches[deviceId] = make_unique<PlanCache>(); });
+    return caches[deviceId];
+}
+
+PlanCache& fftManager() { return *oneFFTManager(getActiveDeviceId()); }
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+/*
+//TODO: select which external api functions to expose and add to
+header+implement
+
+using namespace oneapi;
+
+af_err afcl_get_device_type(afcl_device_type* res) {
+    try {
+        *res = static_cast<afcl_device_type>(getActiveDeviceType());
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_get_platform(afcl_platform* res) {
+    try {
+        *res = static_cast<afcl_platform>(getActivePlatform());
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_get_context(cl_context* ctx, const bool retain) {
+    try {
+        *ctx = getContext()();
+        if (retain) { clRetainContext(*ctx); }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_get_queue(cl_command_queue* queue, const bool retain) {
+    try {
+        *queue = getQueue()();
+        if (retain) { clRetainCommandQueue(*queue); }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_get_device_id(cl_device_id* id) {
+    try {
+        *id = getDevice()();
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_set_device_id(cl_device_id id) {
+    try {
+        setDevice(getDeviceIdFromNativeId(id));
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_add_device_context(cl_device_id dev, cl_context ctx,
+                               cl_command_queue que) {
+    try {
+        addDeviceContext(dev, ctx, que);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_set_device_context(cl_device_id dev, cl_context ctx) {
+    try {
+        setDeviceContext(dev, ctx);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_delete_device_context(cl_device_id dev, cl_context ctx) {
+    try {
+        removeDeviceContext(dev, ctx);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+*/
diff --git a/src/backend/oneapi/platform.hpp b/src/backend/oneapi/platform.hpp
new file mode 100644
index 0000000000..bceb1e5db6
--- /dev/null
+++ b/src/backend/oneapi/platform.hpp
@@ -0,0 +1,141 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/oneapi.h>
+
+#include <sycl/sycl.hpp>
+
+#include <memory>
+#include <string>
+
+// Forward declarations
+namespace spdlog {
+class logger;
+}
+
+namespace arrayfire {
+namespace common {
+class MemoryManagerBase;
+class ForgeManager;
+}  // namespace common
+}  // namespace arrayfire
+
+using arrayfire::common::MemoryManagerBase;
+
+namespace arrayfire {
+namespace oneapi {
+
+// Forward declarations
+class GraphicsResourceManager;
+class PlanCache;  // clfft
+
+bool verify_present(const std::string& pname, const std::string ref);
+
+int getBackend();
+
+std::string getDeviceInfo() noexcept;
+
+int getDeviceCount() noexcept;
+
+void init();
+
+unsigned getActiveDeviceId();
+
+int& getMaxJitSize();
+
+const sycl::context& getContext();
+
+sycl::queue& getQueue();
+
+/// Return a handle to the queue for the device.
+///
+/// \param[in] device The device of the returned queue
+/// \returns The handle to the queue
+sycl::queue* getQueueHandle(int device);
+
+const sycl::device& getDevice(int id = -1);
+
+const std::string& getActiveDeviceBaseBuildFlags();
+
+size_t getDeviceMemorySize(int device);
+
+size_t getHostMemorySize();
+
+unsigned getMemoryBusWidth(const sycl::device& device);
+
+size_t getL2CacheSize(const sycl::device& device);
+
+unsigned getComputeUnits(const sycl::device& device);
+
+// maximum nr of threads the device really can run in parallel, without
+// scheduling
+unsigned getMaxParallelThreads(const sycl::device& device);
+
+// sycl::device::is_cpu,is_gpu,is_accelerator
+sycl::info::device_type getDeviceType();
+
+bool isHostUnifiedMemory(const sycl::device& device);
+
+bool OneAPICPUOffload(bool forceOffloadOSX = true);
+
+bool isGLSharingSupported();
+
+bool isDoubleSupported(unsigned device);
+
+// Returns true if 16-bit precision floats are supported by the device
+bool isHalfSupported(unsigned device);
+
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute);
+
+std::string getPlatformName(const sycl::device& device);
+
+int setDevice(int device);
+
+void addDeviceContext(sycl::device& dev, sycl::context& ctx, sycl::queue& que);
+
+void setDeviceContext(sycl::device& dev, sycl::context& ctx);
+
+void removeDeviceContext(sycl::device& dev, sycl::context& ctx);
+
+void sync(int device);
+
+bool synchronize_calls();
+
+int getActiveDeviceType();
+
+int getActivePlatform();
+
+bool& evalFlag();
+
+MemoryManagerBase& memoryManager();
+
+void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+void resetMemoryManager();
+
+MemoryManagerBase& pinnedMemoryManager();
+
+void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+void resetMemoryManagerPinned();
+
+arrayfire::common::ForgeManager& forgeManager();
+
+GraphicsResourceManager& interopManager();
+
+PlanCache& fftManager();
+
+// afcl::platform getPlatformEnum(cl::Device dev);
+
+void setActiveContext(int device);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/plot.cpp b/src/backend/oneapi/plot.cpp
new file mode 100644
index 0000000000..3bd287fbd6
--- /dev/null
+++ b/src/backend/oneapi/plot.cpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+// #include <GraphicsResourceManager.hpp>
+// #include <debug_opencl.hpp>
+#include <err_oneapi.hpp>
+#include <plot.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot) {
+    ONEAPI_NOT_SUPPORTED("copy_plot Not supported");
+
+    // ForgeModule &_ = common::forgePlugin();
+    // if (isGLSharingSupported()) {
+    //     CheckGL("Begin OpenCL resource copy");
+    //     const cl::Buffer *d_P = P.get();
+    //     unsigned bytes        = 0;
+    //     FG_CHECK(_.fg_get_plot_vertex_buffer_size(&bytes, plot));
+
+    //     auto res = interopManager().getPlotResources(plot);
+
+    //     std::vector<cl::Memory> shared_objects;
+    //     shared_objects.push_back(*(res[0].get()));
+
+    //     glFinish();
+
+    //     // Use of events:
+    //     //
+    //     https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+    //     cl::Event event;
+
+    //     getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+    //     event.wait();
+    //     getQueue().enqueueCopyBuffer(*d_P, *(res[0].get()), 0, 0, bytes,
+    //     NULL,
+    //                                  &event);
+    //     getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+    //     event.wait();
+
+    //     CL_DEBUG_FINISH(getQueue());
+    //     CheckGL("End OpenCL resource copy");
+    // } else {
+    //     unsigned bytes = 0, buffer = 0;
+    //     FG_CHECK(_.fg_get_plot_vertex_buffer(&buffer, plot));
+    //     FG_CHECK(_.fg_get_plot_vertex_buffer_size(&bytes, plot));
+
+    //     CheckGL("Begin OpenCL fallback-resource copy");
+    //     glBindBuffer(GL_ARRAY_BUFFER, buffer);
+    //     auto *ptr =
+    //         static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER,
+    //         GL_WRITE_ONLY));
+    //     if (ptr) {
+    //         getQueue().enqueueReadBuffer(*P.get(), CL_TRUE, 0, bytes, ptr);
+    //         glUnmapBuffer(GL_ARRAY_BUFFER);
+    //     }
+    //     glBindBuffer(GL_ARRAY_BUFFER, 0);
+    //     CheckGL("End OpenCL fallback-resource copy");
+    // }
+}
+
+#define INSTANTIATE(T) template void copy_plot<T>(const Array<T> &, fg_plot);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/plot.hpp b/src/backend/oneapi/plot.hpp
new file mode 100644
index 0000000000..ed8bd5e118
--- /dev/null
+++ b/src/backend/oneapi/plot.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/print.hpp b/src/backend/oneapi/print.hpp
new file mode 100644
index 0000000000..686445db49
--- /dev/null
+++ b/src/backend/oneapi/print.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <backend.hpp>
+#include <types.hpp>
+
+#include <ostream>
+
+namespace arrayfire {
+namespace oneapi {
+static std::ostream& operator<<(std::ostream& out, const cfloat& var) {
+    out << "(" << std::real(var) << "," << std::imag(var) << ")";
+    return out;
+}
+
+static std::ostream& operator<<(std::ostream& out, const cdouble& var) {
+    out << "(" << std::real(var) << "," << std::imag(var) << ")";
+    return out;
+}
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/product.cpp b/src/backend/oneapi/product.cpp
new file mode 100644
index 0000000000..4aa9cb61dd
--- /dev/null
+++ b/src/backend/oneapi/product.cpp
@@ -0,0 +1,33 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// sum
+INSTANTIATE(af_mul_t, float, float)
+INSTANTIATE(af_mul_t, double, double)
+INSTANTIATE(af_mul_t, cfloat, cfloat)
+INSTANTIATE(af_mul_t, cdouble, cdouble)
+INSTANTIATE(af_mul_t, int, int)
+INSTANTIATE(af_mul_t, uint, uint)
+INSTANTIATE(af_mul_t, intl, intl)
+INSTANTIATE(af_mul_t, uintl, uintl)
+INSTANTIATE(af_mul_t, char, int)
+INSTANTIATE(af_mul_t, schar, int)
+INSTANTIATE(af_mul_t, uchar, uint)
+INSTANTIATE(af_mul_t, short, int)
+INSTANTIATE(af_mul_t, ushort, uint)
+INSTANTIATE(af_mul_t, half, float)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/qr.cpp b/src/backend/oneapi/qr.cpp
new file mode 100644
index 0000000000..64884e4c24
--- /dev/null
+++ b/src/backend/oneapi/qr.cpp
@@ -0,0 +1,162 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <qr.hpp>
+
+#include <err_oneapi.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+
+#include <blas.hpp>
+#include <copy.hpp>
+#include <identity.hpp>
+#include <kernel/triangle.hpp>
+#include <memory.hpp>
+#include <oneapi/mkl/lapack.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+using sycl::buffer;
+
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<T> in_copy = copyArray<T>(in);
+
+    // Get workspace needed for QR
+    std::int64_t scratchpad_size =
+        ::oneapi::mkl::lapack::geqrf_scratchpad_size<compute_t<T>>(
+            getQueue(), iDims[0], iDims[1], in_copy.strides()[1]);
+
+    auto scratchpad = memAlloc<compute_t<T>>(scratchpad_size);
+
+    t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+
+    buffer<compute_t<T>> iBuf =
+        in_copy.template getBufferWithOffset<compute_t<T>>();
+    buffer<compute_t<T>> tBuf = t.template getBufferWithOffset<compute_t<T>>();
+    ::oneapi::mkl::lapack::geqrf(getQueue(), M, N, iBuf, in_copy.strides()[1],
+                                 tBuf, *scratchpad, scratchpad->size());
+    // SPLIT into q and r
+    dim4 rdims(M, N);
+    r = createEmptyArray<T>(rdims);
+
+    constexpr bool is_upper     = true;
+    constexpr bool is_unit_diag = false;
+    kernel::triangle<T>(r, in_copy, is_upper, is_unit_diag);
+
+    int mn = max(M, N);
+    dim4 qdims(M, mn);
+    q = identity<T>(qdims);
+
+    buffer<compute_t<T>> qBuf = q.template getBufferWithOffset<compute_t<T>>();
+    if constexpr (std::is_floating_point<compute_t<T>>()) {
+        std::int64_t scratchpad_size =
+            ::oneapi::mkl::lapack::ormqr_scratchpad_size<compute_t<T>>(
+                getQueue(), ::oneapi::mkl::side::left,
+                ::oneapi::mkl::transpose::nontrans, q.dims()[0], q.dims()[1],
+                min(M, N), in_copy.strides()[1], q.strides()[1]);
+
+        auto scratchpad_ormqr = memAlloc<compute_t<T>>(scratchpad_size);
+        ::oneapi::mkl::lapack::ormqr(
+            getQueue(), ::oneapi::mkl::side::left,
+            ::oneapi::mkl::transpose::nontrans, q.dims()[0], q.dims()[1],
+            min(M, N), iBuf, in_copy.strides()[1], tBuf, qBuf, q.strides()[1],
+            *scratchpad_ormqr, scratchpad_ormqr->size());
+
+    } else if constexpr (common::isComplex(static_cast<af::dtype>(
+                             dtype_traits<compute_t<T>>::af_type))) {
+        std::int64_t scratchpad_size =
+            ::oneapi::mkl::lapack::unmqr_scratchpad_size<compute_t<T>>(
+                getQueue(), ::oneapi::mkl::side::left,
+                ::oneapi::mkl::transpose::nontrans, q.dims()[0], q.dims()[1],
+                min(M, N), in_copy.strides()[1], q.strides()[1]);
+
+        auto scratchpad_ormqr = memAlloc<compute_t<T>>(scratchpad_size);
+        ::oneapi::mkl::lapack::unmqr(
+            getQueue(), ::oneapi::mkl::side::left,
+            ::oneapi::mkl::transpose::nontrans, q.dims()[0], q.dims()[1],
+            min(M, N), iBuf, in_copy.strides()[1], tBuf, qBuf, q.strides()[1],
+            *scratchpad_ormqr, scratchpad_ormqr->size());
+    }
+    q.resetDims(dim4(M, M));
+}
+
+template<typename T>
+Array<T> qr_inplace(Array<T> &in) {
+    dim4 iDims    = in.dims();
+    dim4 iStrides = in.strides();
+    int M         = iDims[0];
+    int N         = iDims[1];
+
+    Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+
+    // Get workspace needed for QR
+    std::int64_t scratchpad_size =
+        ::oneapi::mkl::lapack::geqrf_scratchpad_size<compute_t<T>>(
+            getQueue(), iDims[0], iDims[1], iStrides[1]);
+
+    auto scratchpad = memAlloc<compute_t<T>>(scratchpad_size);
+
+    buffer<compute_t<T>> iBuf = in.template getBufferWithOffset<compute_t<T>>();
+    buffer<compute_t<T>> tBuf = t.template getBufferWithOffset<compute_t<T>>();
+    // In place Perform in place QR
+    ::oneapi::mkl::lapack::geqrf(getQueue(), iDims[0], iDims[1], iBuf,
+                                 iStrides[1], tBuf, *scratchpad,
+                                 scratchpad->size());
+    return t;
+}
+
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
+
+INSTANTIATE_QR(float)
+INSTANTIATE_QR(cfloat)
+INSTANTIATE_QR(double)
+INSTANTIATE_QR(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI", AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T>
+Array<T> qr_inplace(Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI", AF_ERR_NOT_CONFIGURED);
+}
+
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
+
+INSTANTIATE_QR(float)
+INSTANTIATE_QR(cfloat)
+INSTANTIATE_QR(double)
+INSTANTIATE_QR(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/oneapi/qr.hpp b/src/backend/oneapi/qr.hpp
new file mode 100644
index 0000000000..ad8ed882a0
--- /dev/null
+++ b/src/backend/oneapi/qr.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &orig);
+
+template<typename T>
+Array<T> qr_inplace(Array<T> &in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/random_engine.cpp b/src/backend/oneapi/random_engine.cpp
new file mode 100644
index 0000000000..e3eac5da0b
--- /dev/null
+++ b/src/backend/oneapi/random_engine.cpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreemengt can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/random_engine.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl) {
+    kernel::initMersenneState(state, tbl, seed);
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::uniformDistributionCBRNG<T>(out, out.elements(), type, seed,
+                                        counter);
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl &seed,
+                            uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::normalDistributionCBRNG<T>(out, out.elements(), type, seed,
+                                       counter);
+    return out;
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::uniformDistributionMT<T>(out, out.elements(), state, pos, sh1, sh2,
+                                     mask, recursion_table, temper_table);
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::normalDistributionMT<T>(out, out.elements(), state, pos, sh1, sh2,
+                                    mask, recursion_table, temper_table);
+    return out;
+}
+
+#define INSTANTIATE_UNIFORM(T)                                   \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl &seed, uintl &counter);                      \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+#define INSTANTIATE_NORMAL(T)                                    \
+    template Array<T> normalDistribution<T>(                     \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl &seed, uintl &counter);                      \
+    template Array<T> normalDistribution<T>(                     \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+INSTANTIATE_UNIFORM(float)
+INSTANTIATE_UNIFORM(double)
+INSTANTIATE_UNIFORM(cfloat)
+INSTANTIATE_UNIFORM(cdouble)
+INSTANTIATE_UNIFORM(int)
+INSTANTIATE_UNIFORM(uint)
+INSTANTIATE_UNIFORM(intl)
+INSTANTIATE_UNIFORM(uintl)
+INSTANTIATE_UNIFORM(char)
+INSTANTIATE_UNIFORM(schar)
+INSTANTIATE_UNIFORM(uchar)
+INSTANTIATE_UNIFORM(short)
+INSTANTIATE_UNIFORM(ushort)
+INSTANTIATE_UNIFORM(half)
+
+INSTANTIATE_NORMAL(float)
+INSTANTIATE_NORMAL(double)
+INSTANTIATE_NORMAL(cdouble)
+INSTANTIATE_NORMAL(cfloat)
+INSTANTIATE_NORMAL(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/random_engine.hpp b/src/backend/oneapi/random_engine.hpp
new file mode 100644
index 0000000000..7738294d06
--- /dev/null
+++ b/src/backend/oneapi/random_engine.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace oneapi {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl &seed,
+                            uintl &counter);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/range.cpp b/src/backend/oneapi/range.cpp
new file mode 100644
index 0000000000..c08a7bea91
--- /dev/null
+++ b/src/backend/oneapi/range.cpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <kernel/range.hpp>
+#include <range.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim) {
+    // Set dimension along which the sequence should be
+    // Other dimensions are simply tiled
+    int _seq_dim = seq_dim;
+    if (seq_dim < 0) {
+        _seq_dim = 0;  // column wise sequence
+    }
+
+    if (_seq_dim < 0 || _seq_dim > 3) {
+        AF_ERROR("Invalid rep selection", AF_ERR_ARG);
+    }
+
+    Array<T> out = createEmptyArray<T>(dim);
+    kernel::range<T>(out, _seq_dim);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> range<T>(const af::dim4& dims, const int seq_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/range.hpp b/src/backend/oneapi/range.hpp
new file mode 100644
index 0000000000..6a997c6787
--- /dev/null
+++ b/src/backend/oneapi/range.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim = -1);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/reduce.hpp b/src/backend/oneapi/reduce.hpp
new file mode 100644
index 0000000000..6d6ab31670
--- /dev/null
+++ b/src/backend/oneapi/reduce.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan = false,
+                 double nanval = 0);
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan = false, double nanval = 0);
+
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan = false,
+                     double nanval = 0);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/reduce_impl.hpp b/src/backend/oneapi/reduce_impl.hpp
new file mode 100644
index 0000000000..b2c478c71f
--- /dev/null
+++ b/src/backend/oneapi/reduce_impl.hpp
@@ -0,0 +1,649 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#if defined(__clang__)
+#pragma clang diagnostic push
+// temporary ignores for DPL internals
+#pragma clang diagnostic ignored "-Wunused-variable"
+#pragma clang diagnostic ignored "-Wdeprecated-declarations"
+#endif
+
+// oneDPL headers should be included before standard headers
+#define ONEDPL_USE_PREDEFINED_POLICIES 0
+#include <oneapi/dpl/execution>
+#include <oneapi/dpl/iterator>
+#include <oneapi/dpl/numeric>
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/accessors.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/reduce_by_key.hpp>
+#include <reduce.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+using af::dim4;
+using std::swap;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan,
+                 double nanval) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    kernel::reduce<Ti, To, op>(out, in, dim, change_nan, nanval);
+    return out;
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void reduceBlocksByKey(sycl::buffer<int> &reduced_block_sizes,
+                       Array<Tk> keys_out, Array<To> vals_out,
+                       const Array<Tk> keys, const Array<Ti> vals,
+                       int change_nan, double nanval, const int n,
+                       const int threads_x) {
+    int numBlocks = divup(n, threads_x);
+
+    sycl::range<3> local(threads_x, 1, 1);
+    sycl::range<3> global(local[0] * numBlocks, vals_out.dims()[1],
+                          vals_out.dims()[2] * vals_out.dims()[3]);
+
+    auto keys_out_get = keys_out.get();
+    auto vals_out_get = vals_out.get();
+    auto keys_get = keys.get();
+    auto vals_get = vals.get();
+    getQueue().submit([&](sycl::handler &h) {
+        sycl::accessor<int> reduced_block_sizes_acc{reduced_block_sizes, h};
+        write_accessor<Tk> keys_out_acc{*keys_out_get, h};
+        write_accessor<To> vals_out_acc{*vals_out_get, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        read_accessor<Ti> vals_acc{*vals_get, h};
+
+        auto l_keys         = sycl::local_accessor<Tk>(threads_x, h);
+        auto l_vals         = sycl::local_accessor<compute_t<To>>(threads_x, h);
+        auto l_reduced_keys = sycl::local_accessor<Tk>(threads_x, h);
+        auto l_reduced_vals = sycl::local_accessor<compute_t<To>>(threads_x, h);
+        auto l_unique_ids   = sycl::local_accessor<int>(threads_x, h);
+        auto l_wq_temp      = sycl::local_accessor<int>(threads_x, h);
+        auto l_unique_flags = sycl::local_accessor<int>(threads_x, h);
+        auto l_reduced_block_size = sycl::local_accessor<int>(1, h);
+
+        h.parallel_for(
+            sycl::nd_range<3>(global, local),
+            kernel::reduceBlocksByKeyKernel<Ti, Tk, To, op>(
+                reduced_block_sizes_acc, keys_out_acc, keys_out, vals_out_acc,
+                vals_out, keys_acc, keys, vals_acc, vals, change_nan,
+                scalar<To>(nanval), n, static_cast<int>(vals_out.dims()[2]),
+                threads_x, l_keys, l_vals, l_reduced_keys, l_reduced_vals,
+                l_unique_ids, l_wq_temp, l_unique_flags, l_reduced_block_size));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void reduceBlocksByKeyDim(sycl::buffer<int> &reduced_block_sizes,
+                          Array<Tk> keys_out, Array<To> vals_out,
+                          const Array<Tk> keys, const Array<Ti> vals,
+                          int change_nan, double nanval, const int n,
+                          const int threads_x, const int dim,
+                          std::vector<int> dim_ordering) {
+    int numBlocks = divup(n, threads_x);
+
+    sycl::range<3> local(threads_x, 1, 1);
+    sycl::range<3> global(
+        local[0] * numBlocks, vals_out.dims()[dim_ordering[1]],
+        vals_out.dims()[dim_ordering[2]] * vals_out.dims()[dim_ordering[3]]);
+
+    auto keys_out_get = keys_out.get();
+    auto vals_out_get = vals_out.get();
+    auto keys_get = keys.get();
+    auto vals_get = vals.get();
+    getQueue().submit([&](sycl::handler &h) {
+        sycl::accessor<int> reduced_block_sizes_acc{reduced_block_sizes, h};
+        write_accessor<Tk> keys_out_acc{*keys_out_get, h};
+        write_accessor<To> vals_out_acc{*vals_out_get, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        read_accessor<Ti> vals_acc{*vals_get, h};
+
+        auto l_keys         = sycl::local_accessor<Tk>(threads_x, h);
+        auto l_vals         = sycl::local_accessor<compute_t<To>>(threads_x, h);
+        auto l_reduced_keys = sycl::local_accessor<Tk>(threads_x, h);
+        auto l_reduced_vals = sycl::local_accessor<compute_t<To>>(threads_x, h);
+        auto l_unique_ids   = sycl::local_accessor<int>(threads_x, h);
+        auto l_wq_temp      = sycl::local_accessor<int>(threads_x, h);
+        auto l_unique_flags = sycl::local_accessor<int>(threads_x, h);
+        auto l_reduced_block_size = sycl::local_accessor<int>(1, h);
+
+        h.parallel_for(
+            sycl::nd_range<3>(global, local),
+            kernel::reduceBlocksByKeyDimKernel<Ti, Tk, To, op>(
+                reduced_block_sizes_acc, keys_out_acc, keys_out, vals_out_acc,
+                vals_out, keys_acc, keys, vals_acc, vals, change_nan,
+                scalar<To>(nanval), n, static_cast<int>(vals_out.dims()[2]),
+                threads_x, dim, l_keys, l_vals, l_reduced_keys, l_reduced_vals,
+                l_unique_ids, l_wq_temp, l_unique_flags, l_reduced_block_size));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To, af_op_t op>
+void finalBoundaryReduce(sycl::buffer<int> &reduced_block_sizes, Array<Tk> keys,
+                         Array<To> vals_out, const int n, const int numBlocks,
+                         const int threads_x) {
+    sycl::range<1> local(threads_x);
+    sycl::range<1> global(local[0] * numBlocks);
+
+    auto vals_out_get = vals_out.get();
+    auto keys_get = keys.get();
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<int> reduced_block_sizes_acc{reduced_block_sizes, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        sycl::accessor<To> vals_out_acc{*vals_out_get, h};
+
+        h.parallel_for(sycl::nd_range<1>(global, local),
+                       kernel::finalBoundaryReduceKernel<Tk, To, op>(
+                           reduced_block_sizes_acc, keys_acc, keys,
+                           vals_out_acc, vals_out, n));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To, af_op_t op>
+void finalBoundaryReduceDim(sycl::buffer<int> &reduced_block_sizes,
+                            Array<Tk> keys, Array<To> vals_out, const int n,
+                            const int numBlocks, const int threads_x,
+                            const int dim, std::vector<int> dim_ordering) {
+    sycl::range<3> local(threads_x, 1, 1);
+    sycl::range<3> global(
+        local[0] * numBlocks, vals_out.dims()[dim_ordering[1]],
+        vals_out.dims()[dim_ordering[2]] * vals_out.dims()[dim_ordering[3]]);
+
+    auto vals_out_get = vals_out.get();
+    auto keys_get = keys.get();
+    getQueue().submit([&](sycl::handler &h) {
+        write_accessor<int> reduced_block_sizes_acc{reduced_block_sizes, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        sycl::accessor<To> vals_out_acc{*vals_out_get, h};
+
+        // TODO: fold 3,4 dimensions
+        h.parallel_for(
+            sycl::nd_range<3>(global, local),
+            kernel::finalBoundaryReduceDimKernel<Tk, To, op>(
+                reduced_block_sizes_acc, keys_acc, keys, vals_out_acc, vals_out,
+                n, vals_out.dims()[dim_ordering[2]]));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To>
+void compact(sycl::buffer<int> reduced_block_sizes, Array<Tk> &keys_out,
+             Array<To> &vals_out, const Array<Tk> &keys, const Array<To> &vals,
+             const int numBlocks, const int threads_x) {
+    sycl::range<3> local(threads_x, 1, 1);
+    sycl::range<3> global(local[0] * numBlocks, vals_out.dims()[1],
+                          vals_out.dims()[2] * vals_out.dims()[3]);
+
+    auto keys_out_get = keys_out.get();
+    auto vals_out_get = vals_out.get();
+    auto keys_get = keys.get();
+    auto vals_get = vals.get();
+    getQueue().submit([&](sycl::handler &h) {
+        read_accessor<int> reduced_block_sizes_acc{reduced_block_sizes, h};
+        write_accessor<Tk> keys_out_acc{*keys_out_get, h};
+        write_accessor<To> vals_out_acc{*vals_out_get, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        read_accessor<To> vals_acc{*vals_get, h};
+
+        h.parallel_for(sycl::nd_range<3>(global, local),
+                       kernel::compactKernel<Tk, To>(
+                           reduced_block_sizes_acc, keys_out_acc, keys_out,
+                           vals_out_acc, vals_out, keys_acc, keys, vals_acc,
+                           vals, static_cast<int>(vals_out.dims()[2])));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To>
+void compactDim(sycl::buffer<int> &reduced_block_sizes, Array<Tk> &keys_out,
+                Array<To> &vals_out, const Array<Tk> &keys,
+                const Array<To> &vals, const int numBlocks, const int threads_x,
+                const int dim, std::vector<int> dim_ordering) {
+    sycl::range<3> local(threads_x, 1, 1);
+    sycl::range<3> global(
+        local[0] * numBlocks, vals_out.dims()[dim_ordering[1]],
+        vals_out.dims()[dim_ordering[2]] * vals_out.dims()[dim_ordering[3]]);
+
+    auto keys_out_get = keys_out.get();
+    auto vals_out_get = vals_out.get();
+    auto keys_get = keys.get();
+    auto vals_get = vals.get();
+    getQueue().submit([&](sycl::handler &h) {
+        read_accessor<int> reduced_block_sizes_acc{reduced_block_sizes, h};
+        write_accessor<Tk> keys_out_acc{*keys_out_get, h};
+        write_accessor<To> vals_out_acc{*vals_out_get, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        read_accessor<To> vals_acc{*vals_get, h};
+
+        h.parallel_for(
+            sycl::nd_range<3>(global, local),
+            kernel::compactDimKernel<Tk, To>(
+                reduced_block_sizes_acc, keys_out_acc, keys_out, vals_out_acc,
+                vals_out, keys_acc, keys, vals_acc, vals,
+                static_cast<int>(vals_out.dims()[dim_ordering[2]]), dim));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk>
+void testNeedsReduction(sycl::buffer<int> needs_reduction,
+                        sycl::buffer<int> needs_boundary, const Array<Tk> &keys,
+                        const int n, const int numBlocks, const int threads_x) {
+    sycl::range<1> local(threads_x);
+    sycl::range<1> global(local[0] * numBlocks);
+
+    auto keys_get = keys.get();
+    getQueue().submit([&](sycl::handler &h) {
+        sycl::accessor<int> needs_reduction_acc{needs_reduction, h};
+        sycl::accessor<int> needs_boundary_acc{needs_boundary, h};
+        read_accessor<Tk> keys_acc{*keys_get, h};
+        auto l_keys = sycl::local_accessor<Tk>(threads_x, h);
+
+        h.parallel_for(sycl::nd_range<1>(global, local),
+                       kernel::testNeedsReductionKernel<Tk>(
+                           needs_reduction_acc, needs_boundary_acc, keys_acc,
+                           keys, n, threads_x, l_keys));
+    });
+    ONEAPI_DEBUG_FINISH(getQueue());
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+int reduce_by_key_first(Array<Tk> &keys_out, Array<To> &vals_out,
+                        const Array<Tk> &keys, const Array<Ti> &vals,
+                        bool change_nan, double nanval) {
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    dim4 kdims = keys.dims();
+    dim4 odims = vals.dims();
+
+    Array<Tk> reduced_keys   = createEmptyArray<Tk>(kdims);
+    Array<To> reduced_vals   = createEmptyArray<To>(odims);
+    Array<Tk> t_reduced_keys = createEmptyArray<Tk>(kdims);
+    Array<To> t_reduced_vals = createEmptyArray<To>(odims);
+
+    // flags determining more reduction is necessary
+    auto needs_another_reduction        = memAlloc<int>(1);
+    auto needs_block_boundary_reduction = memAlloc<int>(1);
+
+    // reset flags
+    getQueue().submit([&](sycl::handler &h) {
+        auto wacc =
+            needs_another_reduction->get_access<sycl::access_mode::write>(h);
+        h.fill(wacc, 0);
+    });
+    getQueue().submit([&](sycl::handler &h) {
+        auto wacc = needs_block_boundary_reduction
+                        ->get_access<sycl::access_mode::write>(h);
+        h.fill(wacc, 0);
+    });
+
+    size_t nelems = kdims[0];
+
+    const unsigned int numThreads = 128;
+    int numBlocksD0               = divup(nelems, numThreads);
+    auto reduced_block_sizes      = memAlloc<int>(numBlocksD0);
+
+    int n_reduced_host = nelems;
+
+    int needs_another_reduction_host        = 0;
+    int needs_block_boundary_reduction_host = 0;
+
+    bool first_pass = true;
+    do {
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        if (first_pass) {
+            reduceBlocksByKey<Ti, Tk, To, op>(
+                *reduced_block_sizes.get(), reduced_keys, reduced_vals, keys,
+                vals, change_nan, nanval, n_reduced_host, numThreads);
+            first_pass = false;
+        } else {
+            constexpr af_op_t op2 = (op == af_notzero_t) ? af_add_t : op;
+            reduceBlocksByKey<To, Tk, To, op2>(
+                *reduced_block_sizes.get(), reduced_keys, reduced_vals,
+                t_reduced_keys, t_reduced_vals, change_nan, nanval,
+                n_reduced_host, numThreads);
+        }
+
+        auto val_buf_begin = ::oneapi::dpl::begin(*reduced_block_sizes.get());
+        auto val_buf_end   = val_buf_begin + numBlocksD0;
+        std::inclusive_scan(dpl_policy, val_buf_begin, val_buf_end,
+                            val_buf_begin);
+
+        compact<Tk, To>(*reduced_block_sizes.get(), t_reduced_keys,
+                        t_reduced_vals, reduced_keys, reduced_vals, numBlocksD0,
+                        numThreads);
+
+        sycl::event reduce_host_event =
+            getQueue().submit([&](sycl::handler &h) {
+                sycl::range rr(1);
+                sycl::id offset_id(numBlocksD0 - 1);
+                auto offset_acc =
+                    reduced_block_sizes
+                        ->template get_access<sycl::access_mode::read>(
+                            h, rr, offset_id);
+                h.copy(offset_acc, &n_reduced_host);
+            });
+
+        // reset flags
+        getQueue().submit([&](sycl::handler &h) {
+            auto wacc =
+                needs_another_reduction->get_access<sycl::access_mode::write>(
+                    h);
+            h.fill(wacc, 0);
+        });
+        getQueue().submit([&](sycl::handler &h) {
+            auto wacc = needs_block_boundary_reduction
+                            ->get_access<sycl::access_mode::write>(h);
+            h.fill(wacc, 0);
+        });
+
+        reduce_host_event.wait();
+
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        testNeedsReduction<Tk>(*needs_another_reduction.get(),
+                               *needs_block_boundary_reduction.get(),
+                               t_reduced_keys, n_reduced_host, numBlocksD0,
+                               numThreads);
+
+        sycl::event host_flag0_event = getQueue().submit([&](sycl::handler &h) {
+            sycl::range rr(1);
+            auto acc =
+                needs_another_reduction
+                    ->template get_access<sycl::access_mode::read>(h, rr);
+            h.copy(acc, &needs_another_reduction_host);
+        });
+        sycl::event host_flag1_event = getQueue().submit([&](sycl::handler &h) {
+            sycl::range rr(1);
+            auto acc =
+                needs_block_boundary_reduction
+                    ->template get_access<sycl::access_mode::read>(h, rr);
+            h.copy(acc, &needs_block_boundary_reduction_host);
+        });
+
+        host_flag1_event.wait();
+        host_flag0_event.wait();
+
+        if (needs_block_boundary_reduction_host &&
+            !needs_another_reduction_host) {
+            finalBoundaryReduce<Tk, To, op>(
+                *reduced_block_sizes.get(), t_reduced_keys, t_reduced_vals,
+                n_reduced_host, numBlocksD0, numThreads);
+
+            auto val_buf_begin =
+                ::oneapi::dpl::begin(*reduced_block_sizes.get());
+            auto val_buf_end = val_buf_begin + numBlocksD0;
+            std::inclusive_scan(dpl_policy, val_buf_begin, val_buf_end,
+                                val_buf_begin);
+
+            sycl::event reduce_host_event =
+                getQueue().submit([&](sycl::handler &h) {
+                    sycl::range rr(1);
+                    sycl::id offset_id(numBlocksD0 - 1);
+                    auto offset_acc =
+                        reduced_block_sizes
+                            ->template get_access<sycl::access_mode::read>(
+                                h, rr, offset_id);
+                    h.copy(offset_acc, &n_reduced_host);
+                });
+
+            compact<Tk, To>(*reduced_block_sizes.get(), reduced_keys,
+                            reduced_vals, t_reduced_keys, t_reduced_vals,
+                            numBlocksD0, numThreads);
+
+            std::swap(t_reduced_keys, reduced_keys);
+            std::swap(t_reduced_vals, reduced_vals);
+            reduce_host_event.wait();
+        }
+    } while (needs_another_reduction_host ||
+             needs_block_boundary_reduction_host);
+
+    keys_out = t_reduced_keys;
+    vals_out = t_reduced_vals;
+    return n_reduced_host;
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+int reduce_by_key_dim(Array<Tk> &keys_out, Array<To> &vals_out,
+                      const Array<Tk> &keys, const Array<Ti> &vals,
+                      bool change_nan, double nanval, const int dim) {
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    std::vector<int> dim_ordering = {dim};
+    for (int i = 0; i < 4; ++i) {
+        if (i != dim) { dim_ordering.push_back(i); }
+    }
+
+    dim4 kdims = keys.dims();
+    dim4 odims = vals.dims();
+
+    Array<Tk> reduced_keys   = createEmptyArray<Tk>(kdims);
+    Array<To> reduced_vals   = createEmptyArray<To>(odims);
+    Array<Tk> t_reduced_keys = createEmptyArray<Tk>(kdims);
+    Array<To> t_reduced_vals = createEmptyArray<To>(odims);
+
+    // flags determining more reduction is necessary
+    auto needs_another_reduction        = memAlloc<int>(1);
+    auto needs_block_boundary_reduction = memAlloc<int>(1);
+
+    // reset flags
+    getQueue().submit([&](sycl::handler &h) {
+        auto wacc =
+            needs_another_reduction->get_access<sycl::access_mode::write>(h);
+        h.fill(wacc, 0);
+    });
+    getQueue().submit([&](sycl::handler &h) {
+        auto wacc = needs_block_boundary_reduction
+                        ->get_access<sycl::access_mode::write>(h);
+        h.fill(wacc, 0);
+    });
+
+    int nelems = kdims[0];
+
+    const unsigned int numThreads = 128;
+    int numBlocksD0               = divup(nelems, numThreads);
+    auto reduced_block_sizes      = memAlloc<int>(numBlocksD0);
+
+    int n_reduced_host = nelems;
+
+    int needs_another_reduction_host        = 0;
+    int needs_block_boundary_reduction_host = 0;
+
+    bool first_pass = true;
+    do {
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        if (first_pass) {
+            reduceBlocksByKeyDim<Ti, Tk, To, op>(
+                *reduced_block_sizes.get(), reduced_keys, reduced_vals, keys,
+                vals, change_nan, nanval, n_reduced_host, numThreads, dim,
+                dim_ordering);
+            first_pass = false;
+        } else {
+            constexpr af_op_t op2 = op == af_notzero_t ? af_add_t : op;
+            reduceBlocksByKeyDim<To, Tk, To, op2>(
+                *reduced_block_sizes.get(), reduced_keys, reduced_vals,
+                t_reduced_keys, t_reduced_vals, change_nan, nanval,
+                n_reduced_host, numThreads, dim, dim_ordering);
+        }
+
+        auto val_buf_begin = ::oneapi::dpl::begin(*reduced_block_sizes.get());
+        auto val_buf_end   = val_buf_begin + numBlocksD0;
+        std::inclusive_scan(dpl_policy, val_buf_begin, val_buf_end,
+                            val_buf_begin);
+
+        compactDim<Tk, To>(*reduced_block_sizes.get(), t_reduced_keys,
+                           t_reduced_vals, reduced_keys, reduced_vals,
+                           numBlocksD0, numThreads, dim, dim_ordering);
+
+        sycl::event reduce_host_event =
+            getQueue().submit([&](sycl::handler &h) {
+                sycl::range rr(1);
+                sycl::id offset_id(numBlocksD0 - 1);
+                auto offset_acc =
+                    reduced_block_sizes
+                        ->template get_access<sycl::access_mode::read>(
+                            h, rr, offset_id);
+                h.copy(offset_acc, &n_reduced_host);
+            });
+
+        // reset flags
+        getQueue().submit([&](sycl::handler &h) {
+            auto wacc =
+                needs_another_reduction->get_access<sycl::access_mode::write>(
+                    h);
+            h.fill(wacc, 0);
+        });
+        getQueue().submit([&](sycl::handler &h) {
+            auto wacc = needs_block_boundary_reduction
+                            ->get_access<sycl::access_mode::write>(h);
+            h.fill(wacc, 0);
+        });
+
+        reduce_host_event.wait();
+
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        testNeedsReduction<Tk>(*needs_another_reduction.get(),
+                               *needs_block_boundary_reduction.get(),
+                               t_reduced_keys, n_reduced_host, numBlocksD0,
+                               numThreads);
+
+        sycl::event host_flag0_event = getQueue().submit([&](sycl::handler &h) {
+            sycl::range rr(1);
+            auto acc =
+                needs_another_reduction
+                    ->template get_access<sycl::access_mode::read>(h, rr);
+            h.copy(acc, &needs_another_reduction_host);
+        });
+        sycl::event host_flag1_event = getQueue().submit([&](sycl::handler &h) {
+            sycl::range rr(1);
+            auto acc =
+                needs_block_boundary_reduction
+                    ->template get_access<sycl::access_mode::read>(h, rr);
+            h.copy(acc, &needs_block_boundary_reduction_host);
+        });
+
+        host_flag1_event.wait();
+        host_flag0_event.wait();
+
+        if (needs_block_boundary_reduction_host &&
+            !needs_another_reduction_host) {
+            finalBoundaryReduceDim<Tk, To, op>(
+                *reduced_block_sizes.get(), t_reduced_keys, t_reduced_vals,
+                n_reduced_host, numBlocksD0, numThreads, dim, dim_ordering);
+
+            auto val_buf_begin =
+                ::oneapi::dpl::begin(*reduced_block_sizes.get());
+            auto val_buf_end = val_buf_begin + numBlocksD0;
+            std::inclusive_scan(dpl_policy, val_buf_begin, val_buf_end,
+                                val_buf_begin);
+
+            sycl::event reduce_host_event =
+                getQueue().submit([&](sycl::handler &h) {
+                    sycl::range rr(1);
+                    sycl::id offset_id(numBlocksD0 - 1);
+                    auto offset_acc =
+                        reduced_block_sizes
+                            ->template get_access<sycl::access_mode::read>(
+                                h, rr, offset_id);
+                    h.copy(offset_acc, &n_reduced_host);
+                });
+
+            compactDim<Tk, To>(*reduced_block_sizes.get(), reduced_keys,
+                               reduced_vals, t_reduced_keys, t_reduced_vals,
+                               numBlocksD0, numThreads, dim, dim_ordering);
+
+            std::swap(t_reduced_keys, reduced_keys);
+            std::swap(t_reduced_vals, reduced_vals);
+            reduce_host_event.wait();
+        }
+    } while (needs_another_reduction_host ||
+             needs_block_boundary_reduction_host);
+
+    keys_out = t_reduced_keys;
+    vals_out = t_reduced_vals;
+
+    return n_reduced_host;
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan, double nanval) {
+    dim4 kdims = keys.dims();
+    dim4 odims = vals.dims();
+
+    // prepare output arrays
+    Array<Tk> reduced_keys = createEmptyArray<Tk>(dim4());
+    Array<To> reduced_vals = createEmptyArray<To>(dim4());
+
+    size_t n_reduced = 0;
+    if (dim == 0) {
+        n_reduced = reduce_by_key_first<op, Ti, Tk, To>(
+            reduced_keys, reduced_vals, keys, vals, change_nan, nanval);
+    } else {
+        n_reduced = reduce_by_key_dim<op, Ti, Tk, To>(
+            reduced_keys, reduced_vals, keys, vals, change_nan, nanval, dim);
+    }
+
+    kdims[0]   = n_reduced;
+    odims[dim] = n_reduced;
+    std::vector<af_seq> kindex, vindex;
+    for (int i = 0; i < odims.ndims(); ++i) {
+        af_seq sk = {0.0, (double)kdims[i] - 1, 1.0};
+        af_seq sv = {0.0, (double)odims[i] - 1, 1.0};
+        kindex.push_back(sk);
+        vindex.push_back(sv);
+    }
+
+    keys_out = createSubArray<Tk>(reduced_keys, kindex, true);
+    vals_out = createSubArray<To>(reduced_vals, vindex, true);
+}
+
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan, double nanval) {
+    Array<To> out = createEmptyArray<To>(1);
+    kernel::reduce_all<Ti, To, op>(out, in, change_nan, nanval);
+    return out;
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#define INSTANTIATE(Op, Ti, To)                                                \
+    template Array<To> reduce<Op, Ti, To>(const Array<Ti> &in, const int dim,  \
+                                          bool change_nan, double nanval);     \
+    template void reduce_by_key<Op, Ti, int, To>(                              \
+        Array<int> & keys_out, Array<To> & vals_out, const Array<int> &keys,   \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template void reduce_by_key<Op, Ti, uint, To>(                             \
+        Array<uint> & keys_out, Array<To> & vals_out, const Array<uint> &keys, \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template Array<To> reduce_all<Op, Ti, To>(const Array<Ti> &in,             \
+                                              bool change_nan, double nanval);
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic pop
+#endif
diff --git a/src/backend/oneapi/regions.cpp b/src/backend/oneapi/regions.cpp
new file mode 100644
index 0000000000..983b3b9000
--- /dev/null
+++ b/src/backend/oneapi/regions.cpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/regions.hpp>
+#include <regions.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> regions(const Array<char> &in, af_connectivity connectivity) {
+    ONEAPI_NOT_SUPPORTED("regions Not supported");
+
+    const af::dim4 &dims = in.dims();
+    Array<T> out         = createEmptyArray<T>(dims);
+    // kernel::regions<T>(out, in, connectivity == AF_CONNECTIVITY_8, 2);
+    return out;
+}
+
+#define INSTANTIATE(T)                                  \
+    template Array<T> regions<T>(const Array<char> &in, \
+                                 af_connectivity connectivity);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/regions.hpp b/src/backend/oneapi/regions.hpp
new file mode 100644
index 0000000000..34e90f2918
--- /dev/null
+++ b/src/backend/oneapi/regions.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> regions(const Array<char> &in, af_connectivity connectivity);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/reorder.cpp b/src/backend/oneapi/reorder.cpp
new file mode 100644
index 0000000000..d9e264f70c
--- /dev/null
+++ b/src/backend/oneapi/reorder.cpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/reorder.hpp>
+#include <reorder.hpp>
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(0);
+    for (int i = 0; i < 4; i++) { oDims[i] = iDims[rdims[i]]; }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::reorder<T>(out, in, rdims.get());
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/reorder.hpp b/src/backend/oneapi/reorder.hpp
new file mode 100644
index 0000000000..a587bc9de3
--- /dev/null
+++ b/src/backend/oneapi/reorder.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/reshape.cpp b/src/backend/oneapi/reshape.cpp
new file mode 100644
index 0000000000..2b15f686e9
--- /dev/null
+++ b/src/backend/oneapi/reshape.cpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+
+#include <common/half.hpp>
+#include <kernel/memcopy.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue, double scale) {
+    Array<outType> out = createEmptyArray<outType>(outDims);
+    if (out.elements() > 0) {
+        kernel::copy<inType, outType>(out, in, in.ndims(), defaultValue, scale,
+                                      in.dims() == outDims);
+    }
+    return out;
+}
+
+#define INSTANTIATE(SRC_T)                                                    \
+    template Array<float> reshape<SRC_T, float>(Array<SRC_T> const &,         \
+                                                dim4 const &, float, double); \
+    template Array<double> reshape<SRC_T, double>(                            \
+        Array<SRC_T> const &, dim4 const &, double, double);                  \
+    template Array<cfloat> reshape<SRC_T, cfloat>(                            \
+        Array<SRC_T> const &, dim4 const &, cfloat, double);                  \
+    template Array<cdouble> reshape<SRC_T, cdouble>(                          \
+        Array<SRC_T> const &, dim4 const &, cdouble, double);                 \
+    template Array<int> reshape<SRC_T, int>(Array<SRC_T> const &,             \
+                                            dim4 const &, int, double);       \
+    template Array<uint> reshape<SRC_T, uint>(Array<SRC_T> const &,           \
+                                              dim4 const &, uint, double);    \
+    template Array<intl> reshape<SRC_T, intl>(Array<SRC_T> const &,           \
+                                              dim4 const &, intl, double);    \
+    template Array<uintl> reshape<SRC_T, uintl>(Array<SRC_T> const &,         \
+                                                dim4 const &, uintl, double); \
+    template Array<short> reshape<SRC_T, short>(Array<SRC_T> const &,         \
+                                                dim4 const &, short, double); \
+    template Array<ushort> reshape<SRC_T, ushort>(                            \
+        Array<SRC_T> const &, dim4 const &, ushort, double);                  \
+    template Array<schar> reshape<SRC_T, schar>(Array<SRC_T> const &,         \
+                                                dim4 const &, schar, double); \
+    template Array<uchar> reshape<SRC_T, uchar>(Array<SRC_T> const &,         \
+                                                dim4 const &, uchar, double); \
+    template Array<char> reshape<SRC_T, char>(Array<SRC_T> const &,           \
+                                              dim4 const &, char, double);    \
+    template Array<half> reshape<SRC_T, half>(Array<SRC_T> const &,           \
+                                              dim4 const &, half, double);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COMPLEX(SRC_T)                           \
+    template Array<cfloat> reshape<SRC_T, cfloat>(           \
+        Array<SRC_T> const &, dim4 const &, cfloat, double); \
+    template Array<cdouble> reshape<SRC_T, cdouble>(         \
+        Array<SRC_T> const &, dim4 const &, cdouble, double);
+
+INSTANTIATE_COMPLEX(cfloat)
+INSTANTIATE_COMPLEX(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/resize.cpp b/src/backend/oneapi/resize.cpp
new file mode 100644
index 0000000000..b73f42eabb
--- /dev/null
+++ b/src/backend/oneapi/resize.cpp
@@ -0,0 +1,49 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/resize.hpp>
+#include <resize.hpp>
+#include <af/dim4.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(odim0, odim1, iDims[2], iDims[3]);
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::resize<T>(out, in, method);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> resize<T>(const Array<T> &in, const dim_t odim0, \
+                                const dim_t odim1,                     \
+                                const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/resize.hpp b/src/backend/oneapi/resize.hpp
new file mode 100644
index 0000000000..4cd7aa39aa
--- /dev/null
+++ b/src/backend/oneapi/resize.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/rotate.cpp b/src/backend/oneapi/rotate.cpp
new file mode 100644
index 0000000000..bcd7b5810a
--- /dev/null
+++ b/src/backend/oneapi/rotate.cpp
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <rotate.hpp>
+
+#include <kernel/rotate.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method) {
+    Array<T> out = createEmptyArray<T>(odims);
+
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::rotate<T>(out, in, theta, method, 1);
+            break;
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::rotate<T>(out, in, theta, method, 2);
+            break;
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::rotate<T>(out, in, theta, method, 3);
+            break;
+        default: AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
+    }
+    return out;
+}
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> rotate(const Array<T> &in, const float theta, \
+                             const af::dim4 &odims,                 \
+                             const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/rotate.hpp b/src/backend/oneapi/rotate.hpp
new file mode 100644
index 0000000000..ee6114da0d
--- /dev/null
+++ b/src/backend/oneapi/rotate.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/scalar.hpp b/src/backend/oneapi/scalar.hpp
new file mode 100644
index 0000000000..9e5ac25704
--- /dev/null
+++ b/src/backend/oneapi/scalar.hpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/jit/ScalarNode.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> createScalarNode(const dim4 &size, const T val) {
+    return createNodeArray<T>(size,
+                              std::make_shared<common::ScalarNode<T>>(val));
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/scan.cpp b/src/backend/oneapi/scan.cpp
new file mode 100644
index 0000000000..9aaae59b49
--- /dev/null
+++ b/src/backend/oneapi/scan.cpp
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <scan.hpp>
+
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusiveScan) {
+    Array<To> out = createEmptyArray<To>(in.dims());
+
+    Param<To> Out = out;
+    Param<Ti> In  = in;
+
+    switch (dim) {
+        case 0: kernel::scan_first<Ti, To, op>(Out, In, inclusiveScan); break;
+        case 1: kernel::scan_dim<Ti, To, op, 1>(Out, In, inclusiveScan); break;
+        case 2: kernel::scan_dim<Ti, To, op, 2>(Out, In, inclusiveScan); break;
+        case 3: kernel::scan_dim<Ti, To, op, 3>(Out, In, inclusiveScan); break;
+    }
+
+    return out;
+}
+
+#define INSTANTIATE_SCAN(ROp, Ti, To) \
+    template Array<To> scan<ROp, Ti, To>(const Array<Ti>&, const int, bool);
+
+#define INSTANTIATE_SCAN_ALL(ROp)           \
+    INSTANTIATE_SCAN(ROp, float, float)     \
+    INSTANTIATE_SCAN(ROp, double, double)   \
+    INSTANTIATE_SCAN(ROp, cfloat, cfloat)   \
+    INSTANTIATE_SCAN(ROp, cdouble, cdouble) \
+    INSTANTIATE_SCAN(ROp, int, int)         \
+    INSTANTIATE_SCAN(ROp, uint, uint)       \
+    INSTANTIATE_SCAN(ROp, intl, intl)       \
+    INSTANTIATE_SCAN(ROp, uintl, uintl)     \
+    INSTANTIATE_SCAN(ROp, char, uint)       \
+    INSTANTIATE_SCAN(ROp, schar, int)       \
+    INSTANTIATE_SCAN(ROp, uchar, uint)      \
+    INSTANTIATE_SCAN(ROp, short, int)       \
+    INSTANTIATE_SCAN(ROp, ushort, uint)
+
+INSTANTIATE_SCAN(af_notzero_t, char, uint)
+INSTANTIATE_SCAN_ALL(af_add_t)
+INSTANTIATE_SCAN_ALL(af_mul_t)
+INSTANTIATE_SCAN_ALL(af_min_t)
+INSTANTIATE_SCAN_ALL(af_max_t)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/scan.hpp b/src/backend/oneapi/scan.hpp
new file mode 100644
index 0000000000..59522a8c4b
--- /dev/null
+++ b/src/backend/oneapi/scan.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusive_scan = true);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/scan_by_key.cpp b/src/backend/oneapi/scan_by_key.cpp
new file mode 100644
index 0000000000..dabca1815a
--- /dev/null
+++ b/src/backend/oneapi/scan_by_key.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <scan.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+// #include <kernel/scan_dim_by_key.hpp>
+// #include <kernel/scan_first_by_key.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan) {
+    ONEAPI_NOT_SUPPORTED("scan Not supported");
+
+    Array<To> out = createEmptyArray<To>(in.dims());
+
+    // Param Out = out;
+    // Param Key = key;
+    // Param In  = in;
+
+    // if (dim == 0) {
+    //     // kernel::scanFirstByKey<Ti, Tk, To, op>(Out, In, Key,
+    //     inclusive_scan);
+    // } else {
+    //     // kernel::scanDimByKey<Ti, Tk, To, op>(Out, In, Key, dim,
+    //     inclusive_scan);
+    // }
+    return out;
+}
+
+#define INSTANTIATE_SCAN_BY_KEY(ROp, Ti, Tk, To)                  \
+    template Array<To> scan<ROp, Ti, Tk, To>(                     \
+        const Array<Tk>& key, const Array<Ti>& in, const int dim, \
+        bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_BY_KEY_ALL(ROp, Tk)           \
+    INSTANTIATE_SCAN_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_BY_KEY_OP(ROp)    \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, int)  \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uint) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, intl) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uintl)
+
+INSTANTIATE_SCAN_BY_KEY_OP(af_add_t)
+INSTANTIATE_SCAN_BY_KEY_OP(af_mul_t)
+INSTANTIATE_SCAN_BY_KEY_OP(af_min_t)
+INSTANTIATE_SCAN_BY_KEY_OP(af_max_t)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/scan_by_key.hpp b/src/backend/oneapi/scan_by_key.hpp
new file mode 100644
index 0000000000..7512f479c1
--- /dev/null
+++ b/src/backend/oneapi/scan_by_key.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan = true);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/select.cpp b/src/backend/oneapi/select.cpp
new file mode 100644
index 0000000000..b24b1fa340
--- /dev/null
+++ b/src/backend/oneapi/select.cpp
@@ -0,0 +1,139 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+// #include <kernel/select.hpp>
+#include <select.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <common/jit/NaryNode.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/select.hpp>
+#include <scalar.hpp>
+
+#include <nonstd/span.hpp>
+#include <memory>
+
+using af::dim4;
+
+using arrayfire::common::half;
+using arrayfire::common::NaryNode;
+
+using std::make_shared;
+using std::max;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const dim4 &odims) {
+    auto cond_node   = cond.getNode();
+    auto a_node      = a.getNode();
+    auto b_node      = b.getNode();
+    auto a_height    = a_node->getHeight();
+    auto b_height    = b_node->getHeight();
+    auto cond_height = cond_node->getHeight();
+    const int height = max(max(a_height, b_height), cond_height) + 1;
+
+    auto node = make_shared<NaryNode>(NaryNode(
+        static_cast<af::dtype>(af::dtype_traits<T>::af_type), "__select", 3,
+        {{cond_node, a_node, b_node}}, af_select_t, height));
+    std::array<common::Node *, 1> nodes{node.get()};
+    if (detail::passesJitHeuristics<T>(nodes) != kJITHeuristics::Pass) {
+        if (a_height > max(b_height, cond_height)) {
+            a.eval();
+        } else if (b_height > cond_height) {
+            b.eval();
+        } else {
+            cond.eval();
+        }
+        return createSelectNode<T>(cond, a, b, odims);
+    }
+    return createNodeArray<T>(odims, node);
+}
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b_val, const dim4 &odims) {
+    auto cond_node   = cond.getNode();
+    auto a_node      = a.getNode();
+    Array<T> b       = createScalarNode<T>(odims, b_val);
+    auto b_node      = b.getNode();
+    auto a_height    = a_node->getHeight();
+    auto b_height    = b_node->getHeight();
+    auto cond_height = cond_node->getHeight();
+    const int height = max(max(a_height, b_height), cond_height) + 1;
+
+    auto node = make_shared<NaryNode>(NaryNode(
+        static_cast<af::dtype>(af::dtype_traits<T>::af_type),
+        (flip ? "__not_select" : "__select"), 3, {{cond_node, a_node, b_node}},
+        (flip ? af_not_select_t : af_select_t), height));
+
+    std::array<common::Node *, 1> nodes{node.get()};
+    if (detail::passesJitHeuristics<T>(nodes) != kJITHeuristics::Pass) {
+        if (a_height > max(b_height, cond_height)) {
+            a.eval();
+        } else if (b_height > cond_height) {
+            b.eval();
+        } else {
+            cond.eval();
+        }
+        return createSelectNode<T, flip>(cond, a, b_val, odims);
+    }
+    return createNodeArray<T>(odims, node);
+}
+
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b) {
+    kernel::select<T>(out, cond, a, b, out.ndims());
+}
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b) {
+    kernel::select_scalar<T>(out, cond, a, b, out.ndims(), flip);
+}
+
+#define INSTANTIATE(T)                                                   \
+    template Array<T> createSelectNode<T>(                               \
+        const Array<char> &cond, const Array<T> &a, const Array<T> &b,   \
+        const af::dim4 &odims);                                          \
+    template Array<T> createSelectNode<T, true>(                         \
+        const Array<char> &cond, const Array<T> &a, const T &b_val,      \
+        const af::dim4 &odims);                                          \
+    template Array<T> createSelectNode<T, false>(                        \
+        const Array<char> &cond, const Array<T> &a, const T &b_val,      \
+        const af::dim4 &odims);                                          \
+    template void select<T>(Array<T> & out, const Array<char> &cond,     \
+                            const Array<T> &a, const Array<T> &b);       \
+    template void select_scalar<T, true>(Array<T> & out,                 \
+                                         const Array<char> &cond,        \
+                                         const Array<T> &a, const T &b); \
+    template void select_scalar<T, false>(Array<T> & out,                \
+                                          const Array<char> &cond,       \
+                                          const Array<T> &a, const T &b)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(cfloat);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(uint);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(char);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(short);
+INSTANTIATE(ushort);
+INSTANTIATE(half);
+
+#undef INSTANTIATE
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/select.hpp b/src/backend/oneapi/select.hpp
new file mode 100644
index 0000000000..754a0ec44d
--- /dev/null
+++ b/src/backend/oneapi/select.hpp
@@ -0,0 +1,31 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <Array.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b);
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b);
+
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const af::dim4 &odims);
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b_val, const af::dim4 &odims);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/set.cpp b/src/backend/oneapi/set.cpp
new file mode 100644
index 0000000000..4c4b68e4b0
--- /dev/null
+++ b/src/backend/oneapi/set.cpp
@@ -0,0 +1,137 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+// oneDPL headers should be included before standard headers
+#define ONEDPL_USE_PREDEFINED_POLICIES 0
+#include <oneapi/dpl/algorithm>
+#include <oneapi/dpl/execution>
+#include <oneapi/dpl/iterator>
+
+#include <Array.hpp>
+#include <common/deprecated.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <set.hpp>
+#include <sort.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+using af::dim4;
+
+using std::conditional;
+using std::is_same;
+template<typename T>
+using ltype_t = typename conditional<is_same<T, intl>::value, cl_long, T>::type;
+
+template<typename T>
+using type_t =
+    typename conditional<is_same<T, uintl>::value, cl_ulong, ltype_t<T>>::type;
+
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted) {
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    Array<T> out = copyArray<T>(in);
+
+    auto out_begin = ::oneapi::dpl::begin(*out.get());
+    auto out_end   = out_begin + out.elements();
+
+    if (!is_sorted) {
+        std::sort(dpl_policy, out_begin, out_end,
+                  [](auto lhs, auto rhs) { return lhs < rhs; });
+    }
+
+    out_end = std::unique(dpl_policy, out_begin, out_end);
+
+    out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
+
+    return out;
+}
+
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique) {
+    Array<T> unique_first  = first;
+    Array<T> unique_second = second;
+
+    if (!is_unique) {
+        unique_first  = setUnique(first, false);
+        unique_second = setUnique(second, false);
+    }
+
+    size_t out_size = unique_first.elements() + unique_second.elements();
+    Array<T> out    = createEmptyArray<T>(dim4(out_size, 1, 1, 1));
+
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    auto first_begin = ::oneapi::dpl::begin(*unique_first.get());
+    auto first_end   = first_begin + unique_first.elements();
+
+    auto second_begin = ::oneapi::dpl::begin(*unique_second.get());
+    auto second_end   = second_begin + unique_second.elements();
+
+    auto out_begin = ::oneapi::dpl::begin(*out.get());
+
+    auto out_end = std::set_union(dpl_policy, first_begin, first_end,
+                                  second_begin, second_end, out_begin);
+    out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
+    return out;
+}
+
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique) {
+    Array<T> unique_first  = first;
+    Array<T> unique_second = second;
+
+    if (!is_unique) {
+        unique_first  = setUnique(first, false);
+        unique_second = setUnique(second, false);
+    }
+
+    size_t out_size =
+        std::max(unique_first.elements(), unique_second.elements());
+    Array<T> out = createEmptyArray<T>(dim4(out_size, 1, 1, 1));
+
+    auto dpl_policy = ::oneapi::dpl::execution::make_device_policy(getQueue());
+
+    auto first_begin = ::oneapi::dpl::begin(*unique_first.get());
+    auto first_end   = first_begin + unique_first.elements();
+
+    auto second_begin = ::oneapi::dpl::begin(*unique_second.get());
+    auto second_end   = second_begin + unique_second.elements();
+
+    auto out_begin = ::oneapi::dpl::begin(*out.get());
+
+    auto out_end = std::set_intersection(dpl_policy, first_begin, first_end,
+                                         second_begin, second_end, out_begin);
+    out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
+    return out;
+}
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> setUnique<T>(const Array<T> &in, const bool is_sorted); \
+    template Array<T> setUnion<T>(                                            \
+        const Array<T> &first, const Array<T> &second, const bool is_unique); \
+    template Array<T> setIntersect<T>(                                        \
+        const Array<T> &first, const Array<T> &second, const bool is_unique);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/set.hpp b/src/backend/oneapi/set.hpp
new file mode 100644
index 0000000000..beef4a44b4
--- /dev/null
+++ b/src/backend/oneapi/set.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted);
+
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique);
+
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/shift.cpp b/src/backend/oneapi/shift.cpp
new file mode 100644
index 0000000000..7e5e31bf37
--- /dev/null
+++ b/src/backend/oneapi/shift.cpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <shift.hpp>
+
+#include <common/jit/ShiftNodeBase.hpp>
+#include <err_oneapi.hpp>
+#include <traits.hpp>
+
+using af::dim4;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::ShiftNodeBase;
+using std::array;
+using std::make_shared;
+using std::static_pointer_cast;
+using std::string;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+using ShiftNode = ShiftNodeBase<jit::BufferNode<T>>;
+
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]) {
+    // Shift should only be the first node in the JIT tree.
+    // Force input to be evaluated so that in is always a buffer.
+    in.eval();
+
+    string name_str("Sh");
+    name_str += shortname<T>(true);
+    const dim4 &iDims = in.dims();
+    dim4 oDims        = iDims;
+
+    array<int, 4> shifts{};
+    for (int i = 0; i < 4; i++) {
+        // sdims_[i] will always be positive and always [0, oDims[i]].
+        // Negative shifts are converted to position by going the other way
+        // round
+        shifts[i] = -(sdims[i] % static_cast<int>(oDims[i])) +
+                    oDims[i] * (sdims[i] > 0);
+        assert(shifts[i] >= 0 && shifts[i] <= oDims[i]);
+    }
+
+    auto node = make_shared<ShiftNode<T>>(
+        static_cast<af::dtype>(dtype_traits<T>::af_type),
+        static_pointer_cast<jit::BufferNode<T>>(in.getNode()), shifts);
+    return createNodeArray<T>(oDims, common::Node_ptr(node));
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/shift.hpp b/src/backend/oneapi/shift.hpp
new file mode 100644
index 0000000000..1c808479d0
--- /dev/null
+++ b/src/backend/oneapi/shift.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sift.cpp b/src/backend/oneapi/sift.cpp
new file mode 100644
index 0000000000..72dccab12d
--- /dev/null
+++ b/src/backend/oneapi/sift.cpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sift.hpp>
+
+// #include <kernel/sift.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x_out, Array<float>& y_out, Array<float>& score_out,
+              Array<float>& ori_out, Array<float>& size_out,
+              Array<float>& desc_out, const Array<T>& in,
+              const unsigned n_layers, const float contrast_thr,
+              const float edge_thr, const float init_sigma,
+              const bool double_input, const float img_scale,
+              const float feature_ratio, const bool compute_GLOH) {
+    ONEAPI_NOT_SUPPORTED("sift Not supported");
+    return 0;
+
+    // unsigned nfeat_out;
+    // unsigned desc_len;
+
+    // Param x;
+    // Param y;
+    // Param score;
+    // Param ori;
+    // Param size;
+    // Param desc;
+
+    // kernel::sift<T, convAccT>(&nfeat_out, &desc_len, x, y, score, ori, size,
+    //                           desc, in, n_layers, contrast_thr, edge_thr,
+    //                           init_sigma, double_input, img_scale,
+    //                           feature_ratio, compute_GLOH);
+
+    // if (nfeat_out > 0) {
+    //     const dim4 out_dims(nfeat_out);
+    //     const dim4 desc_dims(desc_len, nfeat_out);
+
+    //     x_out     = createParamArray<float>(x, true);
+    //     y_out     = createParamArray<float>(y, true);
+    //     score_out = createParamArray<float>(score, true);
+    //     ori_out   = createParamArray<float>(ori, true);
+    //     size_out  = createParamArray<float>(size, true);
+    //     desc_out  = createParamArray<float>(desc, true);
+    // }
+
+    // return nfeat_out;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned sift<T, convAccT>(                                      \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        Array<float> & ori_out, Array<float> & size_out,                      \
+        Array<float> & desc_out, const Array<T>& in, const unsigned n_layers, \
+        const float contrast_thr, const float edge_thr,                       \
+        const float init_sigma, const bool double_input,                      \
+        const float img_scale, const float feature_ratio,                     \
+        const bool compute_GLOH);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sift.hpp b/src/backend/oneapi/sift.hpp
new file mode 100644
index 0000000000..ae656a73fd
--- /dev/null
+++ b/src/backend/oneapi/sift.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x, Array<float>& y, Array<float>& score,
+              Array<float>& ori, Array<float>& size, Array<float>& desc,
+              const Array<T>& in, const unsigned n_layers,
+              const float contrast_thr, const float edge_thr,
+              const float init_sigma, const bool double_input,
+              const float img_scale, const float feature_ratio,
+              const bool compute_GLOH);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sobel.cpp b/src/backend/oneapi/sobel.cpp
new file mode 100644
index 0000000000..e919a37b77
--- /dev/null
+++ b/src/backend/oneapi/sobel.cpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/sobel.hpp>
+#include <sobel.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename Ti, typename To>
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size) {
+    ONEAPI_NOT_SUPPORTED("sobelDerivatives Not supported");
+
+    Array<To> dx = createEmptyArray<To>(img.dims());
+    Array<To> dy = createEmptyArray<To>(img.dims());
+
+    // switch (ker_size) {
+    //     case 3: kernel::sobel<Ti, To, 3>(dx, dy, img); break;
+    // }
+
+    return std::make_pair(dx, dy);
+}
+
+#define INSTANTIATE(Ti, To)                                    \
+    template std::pair<Array<To>, Array<To>> sobelDerivatives( \
+        const Array<Ti> &img, const unsigned &ker_size);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(int, int)
+INSTANTIATE(uint, int)
+INSTANTIATE(char, int)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, int)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, int)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sobel.hpp b/src/backend/oneapi/sobel.hpp
new file mode 100644
index 0000000000..44e2356dc5
--- /dev/null
+++ b/src/backend/oneapi/sobel.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <utility>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename Ti, typename To>
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/solve.cpp b/src/backend/oneapi/solve.cpp
new file mode 100644
index 0000000000..4d213d25ae
--- /dev/null
+++ b/src/backend/oneapi/solve.cpp
@@ -0,0 +1,374 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <solve.hpp>
+
+#include <err_oneapi.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <Array.hpp>
+#include <blas.hpp>
+#include <common/cast.hpp>
+#include <copy.hpp>
+#include <lu.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <oneapi/mkl/blas.hpp>
+#include <oneapi/mkl/lapack.hpp>
+#include <platform.hpp>
+#include <transpose.hpp>
+
+#include <common/traits.hpp>
+#include <algorithm>
+#include <type_traits>
+#include <vector>
+
+using arrayfire::common::cast;
+using std::min;
+using std::vector;
+using sycl::buffer;
+
+namespace arrayfire {
+namespace oneapi {
+
+static ::oneapi::mkl::transpose toMKLTranspose(af_mat_prop opt) {
+    switch (opt) {
+        case AF_MAT_NONE: return ::oneapi::mkl::transpose::nontrans;
+        case AF_MAT_TRANS: return ::oneapi::mkl::transpose::trans;
+        case AF_MAT_CTRANS: return ::oneapi::mkl::transpose::conjtrans;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+}
+
+template<typename T>
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    const int64_t N    = A.dims()[0];
+    const int64_t NRHS = b.dims()[1];
+    const int64_t LDA  = A.strides()[1];
+    const int64_t LDB  = b.strides()[1];
+
+    ::oneapi::mkl::transpose opts = toMKLTranspose(options);
+    std::int64_t scratchpad_size =
+        ::oneapi::mkl::lapack::getrs_scratchpad_size<compute_t<T>>(
+            getQueue(), opts, N, NRHS, LDA, LDB);
+
+    Array<intl> ipiv        = cast<intl, int>(pivot);
+    buffer<int64_t> ipivBuf = ipiv.get()->reinterpret<int64_t, 1>();
+    auto scratchpad         = memAlloc<compute_t<T>>(scratchpad_size);
+
+    Array<compute_t<T>> B     = copyArray<compute_t<T>>(b);
+    buffer<compute_t<T>> aBuf = A.template getBufferWithOffset<compute_t<T>>();
+    buffer<compute_t<T>> bBuf = B.template getBufferWithOffset<compute_t<T>>();
+
+    ::oneapi::mkl::lapack::getrs(getQueue(), opts, N, NRHS, aBuf, LDA, ipivBuf,
+                                 bBuf, LDB, *scratchpad, scratchpad->size());
+    return B;
+}
+
+template<typename T>
+Array<T> generalSolve(const Array<T> &a, const Array<T> &b) {
+    int batches = a.dims()[2] * a.dims()[3];
+
+    dim4 aDims = a.dims();
+    dim4 bDims = b.dims();
+    int M      = aDims[0];
+    int N      = aDims[1];
+    int K      = bDims[1];
+    int MN     = std::min(M, N);
+
+    int lda        = a.strides()[1];
+    int astride    = a.strides()[2];
+    auto ipiv      = memAlloc<int64_t>(MN * batches);
+    int ipivstride = MN;
+
+    int ldb     = b.strides()[1];
+    int bstride = b.strides()[2];
+
+    vector<int> info(batches, 0);
+
+    Array<T> A = copyArray<T>(a);  // A will be overwritten by L,U
+    Array<T> B = copyArray<T>(b);  // will be overwritten with solution
+
+    std::int64_t scratchpad_size =
+        ::oneapi::mkl::lapack::getrf_batch_scratchpad_size<compute_t<T>>(
+            getQueue(), M, N, lda, astride, ipivstride, batches);
+
+    auto scratchpad = memAlloc<compute_t<T>>(scratchpad_size);
+
+    buffer<compute_t<T>> aBuf = A.template getBufferWithOffset<compute_t<T>>();
+    buffer<compute_t<T>> bBuf = B.template getBufferWithOffset<compute_t<T>>();
+    ::oneapi::mkl::lapack::getrf_batch(getQueue(), M, N, aBuf, lda, astride,
+                                       *ipiv, ipivstride, batches, *scratchpad,
+                                       scratchpad->size());
+
+    scratchpad_size =
+        ::oneapi::mkl::lapack::getrs_batch_scratchpad_size<compute_t<T>>(
+            getQueue(), ::oneapi::mkl::transpose::nontrans, N, K, lda, astride,
+            ipivstride, ldb, bstride, batches);
+
+    auto scratchpad_rs = memAlloc<compute_t<T>>(scratchpad_size);
+
+    ::oneapi::mkl::lapack::getrs_batch(
+        getQueue(), ::oneapi::mkl::transpose::nontrans, N, K, aBuf, lda,
+        astride, *ipiv, ipivstride, bBuf, ldb, bstride, batches, *scratchpad_rs,
+        scratchpad_rs->size());
+
+    return B;
+}
+
+template<typename T>
+Array<T> leastSquares(const Array<T> &a, const Array<T> &b) {
+    int64_t M  = a.dims()[0];
+    int64_t N  = a.dims()[1];
+    int64_t K  = b.dims()[1];
+    int64_t MN = min(M, N);
+
+    Array<T> B = createEmptyArray<T>(dim4());
+
+    if (M < N) {
+        const dim4 NullShape(0, 0, 0, 0);
+
+        // Least squres for this case is solved using the following
+        // solve(A, B) == matmul(Q, Xpad);
+        // Where:
+        // Xpad == pad(Xt, N - M, 1);
+        // Xt   == tri_solve(R1, B);
+        // R1   == R(seq(M), seq(M));
+        // transpose(A) == matmul(Q, R);
+
+        // QR is performed on the transpose of A
+        Array<T> A = transpose<T>(a, true);
+        dim4 endPadding(N - b.dims()[0], K - b.dims()[1], 0, 0);
+        B = (endPadding == NullShape
+                 ? copyArray(b)
+                 : padArrayBorders(b, NullShape, endPadding, AF_PAD_ZERO));
+
+        // Get workspace needed for QR
+        std::int64_t scratchpad_size =
+            ::oneapi::mkl::lapack::geqrf_scratchpad_size<compute_t<T>>(
+                getQueue(), A.dims()[0], A.dims()[1], A.strides()[1]);
+
+        auto scratchpad = memAlloc<compute_t<T>>(scratchpad_size);
+        auto t          = memAlloc<compute_t<T>>(MN);
+
+        buffer<compute_t<T>> aBuf =
+            A.template getBufferWithOffset<compute_t<T>>();
+        // In place Perform in place QR
+        ::oneapi::mkl::lapack::geqrf(getQueue(), A.dims()[0], A.dims()[1], aBuf,
+                                     A.strides()[1], *t, *scratchpad,
+                                     scratchpad->size());
+
+        // R1 = R(seq(M), seq(M));
+        A.resetDims(dim4(M, M));
+
+        // Bt = tri_solve(R1, B);
+        B.resetDims(dim4(M, K));
+
+        buffer<compute_t<T>> bBuf =
+            B.template getBufferWithOffset<compute_t<T>>();
+        // TODO: move to helper? trsm<T>(A, B, AF_MAT_CTRANS, true, true,
+        // false);
+        compute_t<T> alpha = scalar<compute_t<T>>(1);
+        ::oneapi::mkl::blas::trsm(
+            getQueue(), ::oneapi::mkl::side::left, ::oneapi::mkl::uplo::upper,
+            ::oneapi::mkl::transpose::conjtrans, ::oneapi::mkl::diag::nonunit,
+            B.dims()[0], B.dims()[1], alpha, aBuf, A.strides()[1], bBuf,
+            B.strides()[1]);
+
+        // Bpad = pad(Bt, ..)
+        B.resetDims(dim4(N, K));
+
+        // matmul(Q, Bpad)
+        if constexpr (std::is_floating_point<compute_t<T>>()) {
+            std::int64_t scratchpad_size =
+                ::oneapi::mkl::lapack::ormqr_scratchpad_size<compute_t<T>>(
+                    getQueue(), ::oneapi::mkl::side::left,
+                    ::oneapi::mkl::transpose::nontrans, B.dims()[0],
+                    B.dims()[1], A.dims()[0], A.strides()[1], B.strides()[1]);
+
+            auto scratchpad_ormqr = memAlloc<compute_t<T>>(scratchpad_size);
+            ::oneapi::mkl::lapack::ormqr(
+                getQueue(), ::oneapi::mkl::side::left,
+                ::oneapi::mkl::transpose::nontrans, B.dims()[0], B.dims()[1],
+                A.dims()[0], aBuf, A.strides()[1], *t, bBuf, B.strides()[1],
+                *scratchpad_ormqr, scratchpad_ormqr->size());
+        } else if constexpr (common::isComplex(static_cast<af::dtype>(
+                                 dtype_traits<compute_t<T>>::af_type))) {
+            std::int64_t scratchpad_size =
+                ::oneapi::mkl::lapack::unmqr_scratchpad_size<compute_t<T>>(
+                    getQueue(), ::oneapi::mkl::side::left,
+                    ::oneapi::mkl::transpose::nontrans, B.dims()[0],
+                    B.dims()[1], A.dims()[0], A.strides()[1], B.strides()[1]);
+
+            auto scratchpad_unmqr = memAlloc<compute_t<T>>(scratchpad_size);
+            ::oneapi::mkl::lapack::unmqr(
+                getQueue(), ::oneapi::mkl::side::left,
+                ::oneapi::mkl::transpose::nontrans, B.dims()[0], B.dims()[1],
+                A.dims()[0], aBuf, A.strides()[1], *t, bBuf, B.strides()[1],
+                *scratchpad_unmqr, scratchpad_unmqr->size());
+        }
+
+    } else if (M > N) {
+        // Least squres for this case is solved using the following
+        // solve(A, B) == tri_solve(R1, Bt);
+        // Where:
+        // R1 == R(seq(N), seq(N));
+        // Bt == matmul(transpose(Q1), B);
+        // Q1 == Q(span, seq(N));
+        // A  == matmul(Q, R);
+
+        Array<T> A = copyArray<T>(a);
+        B          = copyArray(b);
+
+        // Get workspace needed for QR
+        std::int64_t scratchpad_size =
+            ::oneapi::mkl::lapack::geqrf_scratchpad_size<compute_t<T>>(
+                getQueue(), M, N, A.strides()[1]);
+
+        auto scratchpad = memAlloc<compute_t<T>>(scratchpad_size);
+        auto t          = memAlloc<compute_t<T>>(MN);
+
+        buffer<compute_t<T>> aBuf =
+            A.template getBufferWithOffset<compute_t<T>>();
+        // In place Perform in place QR
+        ::oneapi::mkl::lapack::geqrf(getQueue(), M, N, aBuf, A.strides()[1], *t,
+                                     *scratchpad, scratchpad->size());
+
+        // matmul(Q1, B)
+        buffer<compute_t<T>> bBuf =
+            B.template getBufferWithOffset<compute_t<T>>();
+        if constexpr (std::is_floating_point<compute_t<T>>()) {
+            std::int64_t scratchpad_size =
+                ::oneapi::mkl::lapack::ormqr_scratchpad_size<compute_t<T>>(
+                    getQueue(), ::oneapi::mkl::side::left,
+                    ::oneapi::mkl::transpose::trans, M, K, N, A.strides()[1],
+                    b.strides()[1]);
+
+            auto scratchpad_ormqr = memAlloc<compute_t<T>>(scratchpad_size);
+            ::oneapi::mkl::lapack::ormqr(getQueue(), ::oneapi::mkl::side::left,
+                                         ::oneapi::mkl::transpose::trans, M, K,
+                                         N, aBuf, A.strides()[1], *t, bBuf,
+                                         b.strides()[1], *scratchpad_ormqr,
+                                         scratchpad_ormqr->size());
+        } else if constexpr (common::isComplex(static_cast<af::dtype>(
+                                 dtype_traits<compute_t<T>>::af_type))) {
+            std::int64_t scratchpad_size =
+                ::oneapi::mkl::lapack::unmqr_scratchpad_size<compute_t<T>>(
+                    getQueue(), ::oneapi::mkl::side::left,
+                    ::oneapi::mkl::transpose::conjtrans, M, K, N,
+                    A.strides()[1], b.strides()[1]);
+
+            auto scratchpad_unmqr = memAlloc<compute_t<T>>(scratchpad_size);
+            ::oneapi::mkl::lapack::unmqr(getQueue(), ::oneapi::mkl::side::left,
+                                         ::oneapi::mkl::transpose::conjtrans, M,
+                                         K, N, aBuf, A.strides()[1], *t, bBuf,
+                                         b.strides()[1], *scratchpad_unmqr,
+                                         scratchpad_unmqr->size());
+        }
+
+        // tri_solve(R1, Bt)
+        A.resetDims(dim4(N, N));
+        B.resetDims(dim4(N, K));
+
+        compute_t<T> alpha = scalar<compute_t<T>>(1);
+        ::oneapi::mkl::blas::trsm(
+            getQueue(), ::oneapi::mkl::side::left, ::oneapi::mkl::uplo::upper,
+            ::oneapi::mkl::transpose::nontrans, ::oneapi::mkl::diag::nonunit, N,
+            K, alpha, aBuf, A.strides()[1], bBuf, B.strides()[1]);
+    }
+
+    return B;
+}
+
+template<typename T>
+Array<T> triangleSolve(const Array<T> &A, const Array<T> &b,
+                       const af_mat_prop options) {
+    Array<compute_t<T>> B = copyArray<T>(b);
+
+    compute_t<T> alpha       = scalar<compute_t<T>>(1);
+    ::oneapi::mkl::uplo uplo = (options & AF_MAT_UPPER)
+                                   ? ::oneapi::mkl::uplo::upper
+                                   : ::oneapi::mkl::uplo::lower;
+
+    ::oneapi::mkl::diag unitdiag = (options & AF_MAT_DIAG_UNIT)
+                                       ? ::oneapi::mkl::diag::unit
+                                       : ::oneapi::mkl::diag::nonunit;
+
+    buffer<compute_t<T>> aBuf = A.template getBufferWithOffset<compute_t<T>>();
+    buffer<compute_t<T>> bBuf = B.template getBufferWithOffset<compute_t<T>>();
+
+    ::oneapi::mkl::blas::trsm(getQueue(), ::oneapi::mkl::side::left, uplo,
+                              ::oneapi::mkl::transpose::nontrans, unitdiag,
+                              B.dims()[0], B.dims()[1], alpha, aBuf,
+                              A.strides()[1], bBuf, B.strides()[1]);
+    return B;
+}
+
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    if (options & AF_MAT_UPPER || options & AF_MAT_LOWER) {
+        return triangleSolve<T>(a, b, options);
+    }
+
+    if (a.dims()[0] == a.dims()[1]) {
+        return generalSolve<T>(a, b);
+    } else {
+        return leastSquares<T>(a, b);
+    }
+}
+
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
+    template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
+
+INSTANTIATE_SOLVE(float)
+INSTANTIATE_SOLVE(cfloat)
+INSTANTIATE_SOLVE(double)
+INSTANTIATE_SOLVE(cdouble)
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI", AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    AF_ERROR("Linear Algebra is disabled on OneAPI", AF_ERR_NOT_CONFIGURED);
+}
+
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
+    template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
+
+INSTANTIATE_SOLVE(float)
+INSTANTIATE_SOLVE(cfloat)
+INSTANTIATE_SOLVE(double)
+INSTANTIATE_SOLVE(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/oneapi/solve.hpp b/src/backend/oneapi/solve.hpp
new file mode 100644
index 0000000000..a0c8924fa9
--- /dev/null
+++ b/src/backend/oneapi/solve.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options = AF_MAT_NONE);
+
+template<typename T>
+Array<T> solveLU(const Array<T> &a, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options = AF_MAT_NONE);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sort.cpp b/src/backend/oneapi/sort.cpp
new file mode 100644
index 0000000000..9bfbeb9094
--- /dev/null
+++ b/src/backend/oneapi/sort.cpp
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(__clang__)
+#pragma clang diagnostic push
+// temporary ignores for DPL internals
+#pragma clang diagnostic ignored "-Wunused-variable"
+#pragma clang diagnostic ignored "-Wdeprecated-declarations"
+#endif
+
+#include <kernel/sort.hpp>
+
+#include <Array.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+#include <reorder.hpp>
+#include <sort.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending) {
+    Array<T> out = copyArray<T>(in);
+    switch (dim) {
+        case 0: kernel::sort0<T>(out, isAscending); break;
+        case 1: kernel::sortBatched<T>(out, 1, isAscending); break;
+        case 2: kernel::sortBatched<T>(out, 2, isAscending); break;
+        case 3: kernel::sortBatched<T>(out, 3, isAscending); break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+
+    if (dim != 0) {
+        af::dim4 preorderDims = out.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = out.dims()[dim];
+        for (int i = 1; i <= static_cast<int>(dim); i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = out.dims()[i - 1];
+        }
+
+        out.setDataDims(preorderDims);
+        out = reorder<T>(out, reorderDims);
+    }
+    return out;
+}
+
+#define INSTANTIATE(T)                                                \
+    template Array<T> sort<T>(const Array<T> &in, const unsigned dim, \
+                              bool isAscending);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic pop
+#endif
diff --git a/src/backend/oneapi/sort.hpp b/src/backend/oneapi/sort.hpp
new file mode 100644
index 0000000000..73512ed973
--- /dev/null
+++ b/src/backend/oneapi/sort.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sort_by_key.cpp b/src/backend/oneapi/sort_by_key.cpp
new file mode 100644
index 0000000000..ba24249955
--- /dev/null
+++ b/src/backend/oneapi/sort_by_key.cpp
@@ -0,0 +1,87 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <math.hpp>
+#include <reorder.hpp>
+#include <sort_by_key.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const unsigned dim, bool isAscending) {
+    okey = copyArray<Tk>(ikey);
+    oval = copyArray<Tv>(ival);
+
+    switch (dim) {
+        case 0: kernel::sort0ByKey<Tk, Tv>(okey, oval, isAscending); break;
+        case 1:
+        case 2:
+        case 3:
+            kernel::sortByKeyBatched<Tk, Tv>(okey, oval, dim, isAscending);
+            break;
+        default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+
+    if (dim != 0) {
+        af::dim4 preorderDims = okey.dims();
+        af::dim4 reorderDims(0, 1, 2, 3);
+        reorderDims[dim] = 0;
+        preorderDims[0]  = okey.dims()[dim];
+        for (int i = 1; i <= (int)dim; i++) {
+            reorderDims[i - 1] = i;
+            preorderDims[i]    = okey.dims()[i - 1];
+        }
+
+        okey.setDataDims(preorderDims);
+        oval.setDataDims(preorderDims);
+
+        okey = reorder<Tk>(okey, reorderDims);
+        oval = reorder<Tv>(oval, reorderDims);
+    }
+}
+
+#define INSTANTIATE(Tk, Tv)                                        \
+    template void sort_by_key<Tk, Tv>(                             \
+        Array<Tk> & okey, Array<Tv> & oval, const Array<Tk> &ikey, \
+        const Array<Tv> &ival, const uint dim, bool isAscending);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)
+
+INSTANTIATE1(float)
+INSTANTIATE1(double)
+INSTANTIATE1(int)
+INSTANTIATE1(uint)
+INSTANTIATE1(short)
+INSTANTIATE1(ushort)
+INSTANTIATE1(char)
+INSTANTIATE1(schar)
+INSTANTIATE1(uchar)
+INSTANTIATE1(intl)
+INSTANTIATE1(uintl)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sort_by_key.hpp b/src/backend/oneapi/sort_by_key.hpp
new file mode 100644
index 0000000000..665fdccaca
--- /dev/null
+++ b/src/backend/oneapi/sort_by_key.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const unsigned dim, bool isAscending);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sort_index.cpp b/src/backend/oneapi/sort_index.cpp
new file mode 100644
index 0000000000..a8c547f8a1
--- /dev/null
+++ b/src/backend/oneapi/sort_index.cpp
@@ -0,0 +1,80 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <math.hpp>
+#include <range.hpp>
+#include <reorder.hpp>
+#include <sort_index.hpp>
+#include <stdexcept>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void sort_index(Array<T> &okey, Array<uint> &oval, const Array<T> &in,
+                const uint dim, bool isAscending) {
+    try {
+        // okey contains values, oval contains indices
+        okey = copyArray<T>(in);
+        oval = range<uint>(in.dims(), dim);
+        oval.eval();
+
+        switch (dim) {
+            case 0: kernel::sort0ByKey<T, uint>(okey, oval, isAscending); break;
+            case 1:
+            case 2:
+            case 3:
+                kernel::sortByKeyBatched<T, uint>(okey, oval, dim, isAscending);
+                break;
+            default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+        }
+
+        if (dim != 0) {
+            af::dim4 preorderDims = okey.dims();
+            af::dim4 reorderDims(0, 1, 2, 3);
+            reorderDims[dim] = 0;
+            preorderDims[0]  = okey.dims()[dim];
+            for (uint i = 1; i <= dim; i++) {
+                reorderDims[i - 1] = i;
+                preorderDims[i]    = okey.dims()[i - 1];
+            }
+
+            okey.setDataDims(preorderDims);
+            oval.setDataDims(preorderDims);
+
+            okey = reorder<T>(okey, reorderDims);
+            oval = reorder<uint>(oval, reorderDims);
+        }
+    } catch (const std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
+}
+
+#define INSTANTIATE(T)                                              \
+    template void sort_index<T>(Array<T> & val, Array<uint> & idx,  \
+                                const Array<T> &in, const uint dim, \
+                                bool isAscending);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(arrayfire::common::half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sort_index.hpp b/src/backend/oneapi/sort_index.hpp
new file mode 100644
index 0000000000..30d6db07b9
--- /dev/null
+++ b/src/backend/oneapi/sort_index.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void sort_index(Array<T> &okey, Array<unsigned> &oval, const Array<T> &in,
+                const unsigned dim, bool isAscending);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sparse.cpp b/src/backend/oneapi/sparse.cpp
new file mode 100644
index 0000000000..2e9a67213f
--- /dev/null
+++ b/src/backend/oneapi/sparse.cpp
@@ -0,0 +1,226 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sparse.hpp>
+#include <sparse.hpp>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/moddims.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <range.hpp>
+#include <reduce.hpp>
+#include <where.hpp>
+
+#include <stdexcept>
+#include <string>
+
+#include <handle.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+using namespace common;
+
+#define P(exp) af_print_array_gen(#exp, getHandle(exp), 2)
+
+// Partial template specialization of sparseConvertDenseToStorage for COO
+// However, template specialization is not allowed
+template<typename T>
+SparseArray<T> sparseConvertDenseToCOO(const Array<T> &in) {
+    in.eval();
+
+    Array<uint> nonZeroIdx_ = where<T>(in);
+    Array<int> nonZeroIdx   = cast<int, uint>(nonZeroIdx_);
+    nonZeroIdx.eval();
+
+    dim_t nNZ = nonZeroIdx.elements();
+
+    Array<int> constDim = createValueArray<int>(dim4(nNZ), in.dims()[0]);
+    constDim.eval();
+
+    Array<int> rowIdx =
+        arithOp<int, af_mod_t>(nonZeroIdx, constDim, nonZeroIdx.dims());
+    Array<int> colIdx =
+        arithOp<int, af_div_t>(nonZeroIdx, constDim, nonZeroIdx.dims());
+
+    Array<T> values = copyArray<T>(in);
+    values          = modDims(values, dim4(values.elements()));
+    values          = lookup<T, int>(values, nonZeroIdx, 0);
+
+    return createArrayDataSparseArray<T>(in.dims(), values, rowIdx, colIdx,
+                                         AF_STORAGE_COO);
+}
+
+template<typename T, af_storage stype>
+SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in_) {
+    in_.eval();
+
+    uint nNZ = getScalar<uint>(reduce_all<af_notzero_t, T, uint>(in_));
+
+    SparseArray<T> sparse_ = createEmptySparseArray<T>(in_.dims(), nNZ, stype);
+    sparse_.eval();
+
+    Array<T> &values   = sparse_.getValues();
+    Array<int> &rowIdx = sparse_.getRowIdx();
+    Array<int> &colIdx = sparse_.getColIdx();
+
+    kernel::dense2csr<T>(values, rowIdx, colIdx, in_);
+
+    return sparse_;
+}
+
+// Partial template specialization of sparseConvertStorageToDense for COO
+// However, template specialization is not allowed
+template<typename T>
+Array<T> sparseConvertCOOToDense(const SparseArray<T> &in) {
+    in.eval();
+
+    Array<T> dense = createValueArray<T>(in.dims(), scalar<T>(0));
+    dense.eval();
+
+    const Array<T> values   = in.getValues();
+    const Array<int> rowIdx = in.getRowIdx();
+    const Array<int> colIdx = in.getColIdx();
+
+    kernel::coo2dense<T>(dense, values, rowIdx, colIdx);
+
+    return dense;
+}
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const SparseArray<T> &in_) {
+    if (stype != AF_STORAGE_CSR) {
+        AF_ERROR("oneAPI Backend only supports CSR or COO to Dense",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    in_.eval();
+
+    Array<T> dense_ = createValueArray<T>(in_.dims(), scalar<T>(0));
+    dense_.eval();
+
+    const Array<T> &values   = in_.getValues();
+    const Array<int> &rowIdx = in_.getRowIdx();
+    const Array<int> &colIdx = in_.getColIdx();
+
+    if (stype == AF_STORAGE_CSR) {
+        kernel::csr2dense<T>(dense_, values, rowIdx, colIdx);
+    } else {
+        AF_ERROR("oneAPI Backend only supports CSR or COO to Dense",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    return dense_;
+}
+
+template<typename T, af_storage dest, af_storage src>
+SparseArray<T> sparseConvertStorageToStorage(const SparseArray<T> &in) {
+    in.eval();
+
+    SparseArray<T> converted = createEmptySparseArray<T>(
+        in.dims(), static_cast<int>(in.getNNZ()), dest);
+    converted.eval();
+
+    if (src == AF_STORAGE_CSR && dest == AF_STORAGE_COO) {
+        Array<int> index = range<int>(in.getNNZ(), 0);
+        index.eval();
+
+        Array<T> &ovalues         = converted.getValues();
+        Array<int> &orowIdx       = converted.getRowIdx();
+        Array<int> &ocolIdx       = converted.getColIdx();
+        const Array<T> &ivalues   = in.getValues();
+        const Array<int> &irowIdx = in.getRowIdx();
+        const Array<int> &icolIdx = in.getColIdx();
+
+        kernel::csr2coo<T>(ovalues, orowIdx, ocolIdx, ivalues, irowIdx, icolIdx,
+                           index);
+
+    } else if (src == AF_STORAGE_COO && dest == AF_STORAGE_CSR) {
+        Array<int> index = range<int>(in.getNNZ(), 0);
+        index.eval();
+
+        Array<T> &ovalues         = converted.getValues();
+        Array<int> &orowIdx       = converted.getRowIdx();
+        Array<int> &ocolIdx       = converted.getColIdx();
+        const Array<T> &ivalues   = in.getValues();
+        const Array<int> &irowIdx = in.getRowIdx();
+        const Array<int> &icolIdx = in.getColIdx();
+
+        Array<int> rowCopy = copyArray<int>(irowIdx);
+        rowCopy.eval();
+
+        kernel::coo2csr<T>(ovalues, orowIdx, ocolIdx, ivalues, irowIdx, icolIdx,
+                           index, rowCopy, in.dims()[0]);
+
+    } else {
+        // Should never come here
+        AF_ERROR("oneAPI Backend invalid conversion combination",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    return converted;
+}
+
+#define INSTANTIATE_TO_STORAGE(T, S)                     \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSR>( \
+        const SparseArray<T> &in);                       \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSC>( \
+        const SparseArray<T> &in);                       \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_COO>( \
+        const SparseArray<T> &in);
+
+#define INSTANTIATE_COO_SPECIAL(T)                                 \
+    template<>                                                     \
+    SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_COO>( \
+        const Array<T> &in) {                                      \
+        return sparseConvertDenseToCOO<T>(in);                     \
+    }                                                              \
+    template<>                                                     \
+    Array<T> sparseConvertStorageToDense<T, AF_STORAGE_COO>(       \
+        const SparseArray<T> &in) {                                \
+        return sparseConvertCOOToDense<T>(in);                     \
+    }
+
+#define INSTANTIATE_SPARSE(T)                                               \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSR>( \
+        const Array<T> &in);                                                \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSC>( \
+        const Array<T> &in);                                                \
+                                                                            \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSR>(       \
+        const SparseArray<T> &in);                                          \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSC>(       \
+        const SparseArray<T> &in);                                          \
+                                                                            \
+    INSTANTIATE_COO_SPECIAL(T)                                              \
+                                                                            \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSR)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSC)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_COO)
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+#undef INSTANTIATE_TO_STORAGE
+#undef INSTANTIATE_COO_SPECIAL
+#undef INSTANTIATE_SPARSE
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sparse.hpp b/src/backend/oneapi/sparse.hpp
new file mode 100644
index 0000000000..e7440fc405
--- /dev/null
+++ b/src/backend/oneapi/sparse.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, af_storage stype>
+common::SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in);
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const common::SparseArray<T> &in);
+
+template<typename T, af_storage dest, af_storage src>
+common::SparseArray<T> sparseConvertStorageToStorage(
+    const common::SparseArray<T> &in);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sparse_arith.cpp b/src/backend/oneapi/sparse_arith.cpp
new file mode 100644
index 0000000000..4b3e7301c4
--- /dev/null
+++ b/src/backend/oneapi/sparse_arith.cpp
@@ -0,0 +1,179 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <err_oneapi.hpp>
+#include <kernel/sparse_arith.hpp>
+#include <sparse.hpp>
+
+#include <stdexcept>
+#include <string>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <scan.hpp>
+#include <where.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+using namespace common;
+using std::numeric_limits;
+
+template<typename T>
+T getInf() {
+    return scalar<T>(numeric_limits<T>::infinity());
+}
+
+template<>
+cfloat getInf() {
+    return scalar<cfloat, float>(
+        NAN, NAN);  // Matches behavior of complex division by 0 in OpenCL
+}
+
+template<>
+cdouble getInf() {
+    return scalar<cdouble, double>(
+        NAN, NAN);  // Matches behavior of complex division by 0 in OpenCL
+}
+
+template<typename T, af_op_t op>
+Array<T> arithOpD(const SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse) {
+    lhs.eval();
+    rhs.eval();
+
+    Array<T> out  = createEmptyArray<T>(dim4(0));
+    Array<T> zero = createValueArray<T>(rhs.dims(), scalar<T>(0));
+    switch (op) {
+        case af_add_t: out = copyArray<T>(rhs); break;
+        case af_sub_t:
+            out = reverse ? copyArray<T>(rhs)
+                          : arithOp<T, af_sub_t>(zero, rhs, rhs.dims());
+            break;
+        default: out = copyArray<T>(rhs);
+    }
+    out.eval();
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            kernel::sparseArithOpCSR<T, op>(out, lhs.getValues(),
+                                            lhs.getRowIdx(), lhs.getColIdx(),
+                                            rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            kernel::sparseArithOpCOO<T, op>(out, lhs.getValues(),
+                                            lhs.getRowIdx(), lhs.getColIdx(),
+                                            rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const Array<T> &rhs,
+                       const bool reverse) {
+    lhs.eval();
+    rhs.eval();
+
+    SparseArray<T> out = createArrayDataSparseArray<T>(
+        lhs.dims(), lhs.getValues(), lhs.getRowIdx(), lhs.getColIdx(),
+        lhs.getStorage(), true);
+    out.eval();
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            kernel::sparseArithOpCSR<T, op>(out.getValues(), out.getRowIdx(),
+                                            out.getColIdx(), rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            kernel::sparseArithOpCOO<T, op>(out.getValues(), out.getRowIdx(),
+                                            out.getColIdx(), rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const SparseArray<T> &rhs) {
+    lhs.eval();
+    rhs.eval();
+    af::storage sfmt = lhs.getStorage();
+
+    const dim4 &ldims = lhs.dims();
+
+    const uint M = ldims[0];
+    const uint N = ldims[1];
+
+    const dim_t nnzA = lhs.getNNZ();
+    const dim_t nnzB = rhs.getNNZ();
+
+    auto temp = createValueArray<int>(dim4(M + 1), scalar<int>(0));
+    temp.eval();
+
+    unsigned nnzC = 0;
+    kernel::csrCalcOutNNZ(temp, nnzC, M, N, nnzA, lhs.getRowIdx(),
+                          lhs.getColIdx(), nnzB, rhs.getRowIdx(),
+                          rhs.getColIdx());
+
+    auto outRowIdx = scan<af_add_t, int, int>(temp, 0);
+
+    auto outColIdx = createEmptyArray<int>(dim4(nnzC));
+    auto outValues = createEmptyArray<T>(dim4(nnzC));
+
+    kernel::ssArithCSR<T, op>(outValues, outColIdx, outRowIdx, M, N, nnzA,
+                              lhs.getValues(), lhs.getRowIdx(), lhs.getColIdx(),
+                              nnzB, rhs.getValues(), rhs.getRowIdx(),
+                              rhs.getColIdx());
+
+    SparseArray<T> retVal = createArrayDataSparseArray(
+        ldims, outValues, outRowIdx, outColIdx, sfmt);
+    return retVal;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> arithOpD<T, af_add_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_sub_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_mul_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_div_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_mul_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_div_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sparse_arith.hpp b/src/backend/oneapi/sparse_arith.hpp
new file mode 100644
index 0000000000..b35d4963e1
--- /dev/null
+++ b/src/backend/oneapi/sparse_arith.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <optypes.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+// These two functions cannot be overloaded by return type.
+// So have to give them separate names.
+template<typename T, af_op_t op>
+Array<T> arithOpD(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const Array<T> &rhs, const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const common::SparseArray<T> &rhs);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sparse_blas.cpp b/src/backend/oneapi/sparse_blas.cpp
new file mode 100644
index 0000000000..0494a5806e
--- /dev/null
+++ b/src/backend/oneapi/sparse_blas.cpp
@@ -0,0 +1,104 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sparse_blas.hpp>
+
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+#include <oneapi/mkl/spblas.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <cassert>
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace oneapi {
+
+using namespace common;
+
+// Converts an af_mat_prop options to a transpose type for mkl
+static ::oneapi::mkl::transpose toBlasTranspose(af_mat_prop opt) {
+    switch (opt) {
+        case AF_MAT_NONE: return ::oneapi::mkl::transpose::nontrans;
+        case AF_MAT_TRANS: return ::oneapi::mkl::transpose::trans;
+        case AF_MAT_CTRANS: return ::oneapi::mkl::transpose::conjtrans;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+}
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T>& lhs, const Array<T>& rhsIn,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+    int lRowDim = (optLhs == AF_MAT_NONE) ? 0 : 1;
+    static const int rColDim =
+        1;  // Unsupported : (optRhs == AF_MAT_NONE) ? 1 : 0;
+
+    dim4 lDims    = lhs.dims();
+    dim4 rDims    = rhsIn.dims();
+    dim4 rStrides = rhsIn.strides();
+    int M         = lDims[lRowDim];
+    int N         = rDims[rColDim];
+
+    Array<T> out  = createEmptyArray<T>(af::dim4(M, N, 1, 1));
+    dim4 oStrides = out.strides();
+
+    static const T alpha = scalar<T>(1.0);
+    static const T beta  = scalar<T>(0.0);
+
+    const Array<T>& values      = lhs.getValues();
+    const Array<int>& rowIdx    = lhs.getRowIdx();
+    const Array<int>& colIdx    = lhs.getColIdx();
+    sycl::buffer<T, 1> valBuf   = values.template getBufferWithOffset<T>();
+    sycl::buffer<int, 1> rowBuf = rowIdx.template getBufferWithOffset<int>();
+    sycl::buffer<int, 1> colBuf = colIdx.template getBufferWithOffset<int>();
+
+    const auto lOpts = toBlasTranspose(optLhs);
+    const auto rOpts = toBlasTranspose(optRhs);
+
+    sycl::buffer<T, 1> rhsBuf = rhsIn.template getBufferWithOffset<T>();
+    sycl::buffer<T, 1> outBuf = out.template getBufferWithOffset<T>();
+
+    ::oneapi::mkl::sparse::matrix_handle_t CSRHandle = nullptr;
+    ::oneapi::mkl::sparse::init_matrix_handle(&CSRHandle);
+    ::oneapi::mkl::sparse::set_csr_data(
+        getQueue(), CSRHandle, lDims[0], lDims[1],
+        ::oneapi::mkl::index_base::zero, rowBuf, colBuf, valBuf);
+
+    if (N == 1) {
+        ::oneapi::mkl::sparse::gemv(getQueue(), lOpts, alpha, CSRHandle, rhsBuf,
+                                    beta, outBuf);
+    } else {
+        ::oneapi::mkl::sparse::gemm(
+            getQueue(), ::oneapi::mkl::layout::col_major, lOpts, rOpts, alpha,
+            CSRHandle, rhsBuf, N, rStrides[1], beta, outBuf, oStrides[1]);
+    }
+    ::oneapi::mkl::sparse::release_matrix_handle(getQueue(), &CSRHandle);
+    return out;
+}
+
+#define INSTANTIATE_SPARSE(T)                                            \
+    template Array<T> matmul<T>(const common::SparseArray<T>& lhs,       \
+                                const Array<T>& rhs, af_mat_prop optLhs, \
+                                af_mat_prop optRhs);
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sparse_blas.hpp b/src/backend/oneapi/sparse_blas.hpp
new file mode 100644
index 0000000000..a5acc6ffc0
--- /dev/null
+++ b/src/backend/oneapi/sparse_blas.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T>& lhs, const Array<T>& rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/sum.cpp b/src/backend/oneapi/sum.cpp
new file mode 100644
index 0000000000..990979ba25
--- /dev/null
+++ b/src/backend/oneapi/sum.cpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/half.hpp>
+#include "reduce_impl.hpp"
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+// sum
+INSTANTIATE(af_add_t, float, float)
+INSTANTIATE(af_add_t, double, double)
+INSTANTIATE(af_add_t, cfloat, cfloat)
+INSTANTIATE(af_add_t, cdouble, cdouble)
+INSTANTIATE(af_add_t, int, int)
+INSTANTIATE(af_add_t, int, float)
+INSTANTIATE(af_add_t, uint, uint)
+INSTANTIATE(af_add_t, uint, float)
+INSTANTIATE(af_add_t, intl, intl)
+INSTANTIATE(af_add_t, intl, double)
+INSTANTIATE(af_add_t, uintl, uintl)
+INSTANTIATE(af_add_t, uintl, double)
+INSTANTIATE(af_add_t, char, int)
+INSTANTIATE(af_add_t, char, float)
+INSTANTIATE(af_add_t, schar, int)
+INSTANTIATE(af_add_t, schar, float)
+INSTANTIATE(af_add_t, uchar, uint)
+INSTANTIATE(af_add_t, uchar, float)
+INSTANTIATE(af_add_t, short, int)
+INSTANTIATE(af_add_t, short, float)
+INSTANTIATE(af_add_t, ushort, uint)
+INSTANTIATE(af_add_t, ushort, float)
+INSTANTIATE(af_add_t, half, half)
+INSTANTIATE(af_add_t, half, float)
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/surface.cpp b/src/backend/oneapi/surface.cpp
new file mode 100644
index 0000000000..ac50627938
--- /dev/null
+++ b/src/backend/oneapi/surface.cpp
@@ -0,0 +1,87 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+// #include <GraphicsResourceManager.hpp>
+// #include <debug_oneapi.hpp>
+#include <err_oneapi.hpp>
+#include <surface.hpp>
+
+using af::dim4;
+// using cl::Memory;
+using std::vector;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface) {
+    ONEAPI_NOT_SUPPORTED("copy_surface Not supported");
+    // ForgeModule &_ = common::forgePlugin();
+    // if (isGLSharingSupported()) {
+    //     CheckGL("Begin OpenCL resource copy");
+    //     const cl::Buffer *d_P = P.get();
+    //     unsigned bytes        = 0;
+    //     FG_CHECK(_.fg_get_surface_vertex_buffer_size(&bytes, surface));
+
+    //     auto res = interopManager().getSurfaceResources(surface);
+
+    //     vector<Memory> shared_objects;
+    //     shared_objects.push_back(*(res[0].get()));
+
+    //     glFinish();
+
+    //     // Use of events:
+    //     //
+    //     https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+    //     cl::Event event;
+
+    //     getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+    //     event.wait();
+    //     getQueue().enqueueCopyBuffer(*d_P, *(res[0].get()), 0, 0, bytes,
+    //     NULL,
+    //                                  &event);
+    //     getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+    //     event.wait();
+
+    //     CL_DEBUG_FINISH(getQueue());
+    //     CheckGL("End OpenCL resource copy");
+    // } else {
+    //     unsigned bytes = 0, buffer = 0;
+    //     FG_CHECK(_.fg_get_surface_vertex_buffer(&buffer, surface));
+    //     FG_CHECK(_.fg_get_surface_vertex_buffer_size(&bytes, surface));
+
+    //     CheckGL("Begin OpenCL fallback-resource copy");
+    //     glBindBuffer(GL_ARRAY_BUFFER, buffer);
+    //     auto *ptr =
+    //         static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER,
+    //         GL_WRITE_ONLY));
+    //     if (ptr) {
+    //         getQueue().enqueueReadBuffer(*P.get(), CL_TRUE, 0, bytes, ptr);
+    //         glUnmapBuffer(GL_ARRAY_BUFFER);
+    //     }
+    //     glBindBuffer(GL_ARRAY_BUFFER, 0);
+    //     CheckGL("End OpenCL fallback-resource copy");
+    // }
+}
+
+#define INSTANTIATE(T) \
+    template void copy_surface<T>(const Array<T> &, fg_surface);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/surface.hpp b/src/backend/oneapi/surface.hpp
new file mode 100644
index 0000000000..2d868301e0
--- /dev/null
+++ b/src/backend/oneapi/surface.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/susan.cpp b/src/backend/oneapi/susan.cpp
new file mode 100644
index 0000000000..b51acf13df
--- /dev/null
+++ b/src/backend/oneapi/susan.cpp
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2022, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+// #include <kernel/susan.hpp>
+#include <af/features.h>
+#include <algorithm>
+#include <cmath>
+
+using af::features;
+using std::vector;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out, Array<float> &resp_out,
+               const Array<T> &in, const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge) {
+    dim4 idims = in.dims();
+
+    const unsigned corner_lim = in.elements() * feature_ratio;
+    Array<float> x_corners    = createEmptyArray<float>({corner_lim});
+    Array<float> y_corners    = createEmptyArray<float>({corner_lim});
+    Array<float> resp_corners = createEmptyArray<float>({corner_lim});
+
+    // auto resp = memAlloc<float>(in.elements());
+
+    ONEAPI_NOT_SUPPORTED("");
+    return 0;
+
+    // kernel::susan<T>(resp.get(), in.get(), in.getOffset(), idims[0],
+    // idims[1],
+    //                  diff_thr, geom_thr, edge, radius);
+
+    // unsigned corners_found = kernel::nonMaximal<T>(
+    //     x_corners.get(), y_corners.get(), resp_corners.get(), idims[0],
+    //     idims[1], resp.get(), edge, corner_lim);
+
+    // const unsigned corners_out = std::min(corners_found, corner_lim);
+    // if (corners_out == 0) {
+    //     x_out    = createEmptyArray<float>(dim4());
+    //     y_out    = createEmptyArray<float>(dim4());
+    //     resp_out = createEmptyArray<float>(dim4());
+    // } else {
+    //     vector<af_seq> idx{{0., static_cast<double>(corners_out - 1.0), 1.}};
+    //     x_out    = createSubArray(x_corners, idx);
+    //     y_out    = createSubArray(y_corners, idx);
+    //     resp_out = createSubArray(resp_corners, idx);
+    // }
+    // return corners_out;
+}
+
+#define INSTANTIATE(T)                                                        \
+    template unsigned susan<T>(                                               \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned radius, const float diff_thr,      \
+        const float geom_thr, const float feature_ratio, const unsigned edge);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/susan.hpp b/src/backend/oneapi/susan.hpp
new file mode 100644
index 0000000000..1a0c4ffe8c
--- /dev/null
+++ b/src/backend/oneapi/susan.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out,
+               Array<float> &score_out, const Array<T> &in,
+               const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/svd.cpp b/src/backend/oneapi/svd.cpp
new file mode 100644
index 0000000000..7255226e1b
--- /dev/null
+++ b/src/backend/oneapi/svd.cpp
@@ -0,0 +1,115 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <blas.hpp>
+#include <common/err_common.hpp>
+#include <copy.hpp>
+#include <err_oneapi.hpp>  // error check functions and Macros
+#include <math.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <reduce.hpp>
+#include <svd.hpp>  // oneapi backend function header
+#include <transpose.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include "oneapi/mkl/lapack.hpp"
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    dim4 iDims = in.dims();
+    int64_t M  = iDims[0];
+    int64_t N  = iDims[1];
+
+    dim4 iStrides = in.strides();
+    dim4 uStrides = u.strides();
+    dim4 vStrides = vt.strides();
+    int64_t LDA   = iStrides[1];
+    int64_t LDU   = uStrides[1];
+    int64_t LDVt  = vStrides[1];
+
+    int64_t scratch_size =
+        ::oneapi::mkl::lapack::gesvd_scratchpad_size<compute_t<T>>(
+            getQueue(), ::oneapi::mkl::jobsvd::vectors,
+            ::oneapi::mkl::jobsvd::vectors, M, N, LDA, LDU, LDVt);
+
+    auto scratchpad = memAlloc<compute_t<T>>(scratch_size);
+
+    sycl::buffer<compute_t<T>> in_buffer =
+        in.template getBufferWithOffset<compute_t<T>>();
+
+    sycl::buffer<compute_t<Tr>> sBuf =
+        s.template getBufferWithOffset<compute_t<Tr>>();
+    sycl::buffer<compute_t<T>> uBuf =
+        u.template getBufferWithOffset<compute_t<T>>();
+    sycl::buffer<compute_t<T>> vtBuf =
+        vt.template getBufferWithOffset<compute_t<T>>();
+
+    ::oneapi::mkl::lapack::gesvd(getQueue(), ::oneapi::mkl::jobsvd::vectors,
+                                 ::oneapi::mkl::jobsvd::vectors, M, N,
+                                 in_buffer, LDA, sBuf, uBuf, LDU, vtBuf, LDVt,
+                                 *scratchpad, scratchpad->size());
+}
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    Array<T> in_copy = copyArray<T>(in);
+    svdInPlace(s, u, vt, in_copy);
+}
+
+#define INSTANTIATE(T, Tr)                                               \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    ONEAPI_NOT_SUPPORTED("");
+    AF_ERROR("Linear Algebra is disabled on OneAPI", AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    ONEAPI_NOT_SUPPORTED("");
+    AF_ERROR("Linear Algebra is disabled on OneAPI", AF_ERR_NOT_CONFIGURED);
+}
+
+#define INSTANTIATE(T, Tr)                                               \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace oneapi
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/oneapi/svd.hpp b/src/backend/oneapi/svd.hpp
new file mode 100644
index 0000000000..4b001d2ad0
--- /dev/null
+++ b/src/backend/oneapi/svd.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in);
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/tile.cpp b/src/backend/oneapi/tile.cpp
new file mode 100644
index 0000000000..928d0e2b19
--- /dev/null
+++ b/src/backend/oneapi/tile.cpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <err_oneapi.hpp>
+#include <kernel/tile.hpp>
+#include <tile.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims *= tileDims;
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::tile<T>(out, in);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/tile.hpp b/src/backend/oneapi/tile.hpp
new file mode 100644
index 0000000000..f11e2aa711
--- /dev/null
+++ b/src/backend/oneapi/tile.hpp
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/topk.cpp b/src/backend/oneapi/topk.cpp
new file mode 100644
index 0000000000..17a14ce810
--- /dev/null
+++ b/src/backend/oneapi/topk.cpp
@@ -0,0 +1,71 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <index.hpp>
+#include <sort.hpp>
+#include <sort_index.hpp>
+#include <types.hpp>
+
+#include <algorithm>
+#include <cmath>
+#include <numeric>
+#include <vector>
+
+using arrayfire::common::half;
+
+using std::iota;
+using std::min;
+using std::partial_sort_copy;
+using std::transform;
+using std::vector;
+
+namespace arrayfire {
+namespace oneapi {
+vector<af_index_t> indexForTopK(const int k) {
+    af_index_t idx;
+    idx.idx.seq = af_seq{0.0, static_cast<double>(k) - 1.0, 1.0};
+    idx.isSeq   = true;
+    idx.isBatch = false;
+
+    af_index_t sp;
+    sp.idx.seq = af_span;
+    sp.isSeq   = true;
+    sp.isBatch = false;
+
+    return vector<af_index_t>({idx, sp, sp, sp});
+}
+
+template<typename T>
+void topk(Array<T>& vals, Array<unsigned>& idxs, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order) {
+    auto values  = createEmptyArray<T>(in.dims());
+    auto indices = createEmptyArray<unsigned>(in.dims());
+    sort_index(values, indices, in, dim, order & AF_TOPK_MIN);
+    auto indVec = indexForTopK(k);
+    vals        = index<T>(values, indVec.data());
+    idxs        = index<unsigned>(indices, indVec.data());
+}
+
+#define INSTANTIATE(T)                                                  \
+    template void topk<T>(Array<T>&, Array<unsigned>&, const Array<T>&, \
+                          const int, const int, const af::topkFunction);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/topk.hpp b/src/backend/oneapi/topk.hpp
new file mode 100644
index 0000000000..fa816b9ca7
--- /dev/null
+++ b/src/backend/oneapi/topk.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void topk(Array<T>& keys, Array<unsigned>& vals, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/traits.hpp b/src/backend/oneapi/traits.hpp
new file mode 100644
index 0000000000..57e1949082
--- /dev/null
+++ b/src/backend/oneapi/traits.hpp
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/defines.hpp>
+#include <common/traits.hpp>
+#include <types.hpp>
+
+#include <sstream>
+#include <string>
+
+namespace af {
+
+template<typename T>
+static bool iscplx() {
+    return false;
+}
+template<>
+inline bool iscplx<arrayfire::oneapi::cfloat>() {
+    return true;
+}
+template<>
+inline bool iscplx<arrayfire::oneapi::cdouble>() {
+    return true;
+}
+
+template<typename T>
+inline std::string scalar_to_option(const T &val) {
+    using namespace arrayfire::common;
+    using namespace std;
+    return to_string(+val);
+}
+
+template<>
+inline std::string scalar_to_option<cl_float2>(const cl_float2 &val) {
+    std::ostringstream ss;
+    ss << val.s[0] << "," << val.s[1];
+    return ss.str();
+}
+
+template<>
+inline std::string scalar_to_option<cl_double2>(const cl_double2 &val) {
+    std::ostringstream ss;
+    ss << val.s[0] << "," << val.s[1];
+    return ss.str();
+}
+}  // namespace af
+
+using af::dtype_traits;
diff --git a/src/backend/oneapi/transform.cpp b/src/backend/oneapi/transform.cpp
new file mode 100644
index 0000000000..00edc15817
--- /dev/null
+++ b/src/backend/oneapi/transform.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <transform.hpp>
+
+#include <copy.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/transform.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective) {
+    // TODO: Temporary Fix, must fix handling subarrays upstream
+    // tf has to be linear, although offset is allowed.
+    const Array<float> tf_Lin = tf.isLinear() ? tf : copyArray(tf);
+
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                                 1);
+            break;
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                                 2);
+            break;
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                                 3);
+            break;
+        default: AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
+    }
+}
+
+#define INSTANTIATE(T)                                                       \
+    template void transform(Array<T> &out, const Array<T> &in,               \
+                            const Array<float> &tf,                          \
+                            const af_interp_type method, const bool inverse, \
+                            const bool perspective);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/transform.hpp b/src/backend/oneapi/transform.hpp
new file mode 100644
index 0000000000..ea62f261b0
--- /dev/null
+++ b/src/backend/oneapi/transform.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/transpose.cpp b/src/backend/oneapi/transpose.cpp
new file mode 100644
index 0000000000..1f41e96cde
--- /dev/null
+++ b/src/backend/oneapi/transpose.cpp
@@ -0,0 +1,55 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <err_oneapi.hpp>
+#include <kernel/transpose.hpp>
+#include <transpose.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> transpose(const Array<T> &in, const bool conjugate) {
+    const dim4 &inDims = in.dims();
+    dim4 outDims       = dim4(inDims[1], inDims[0], inDims[2], inDims[3]);
+    Array<T> out       = createEmptyArray<T>(outDims);
+
+    const bool is32multiple =
+        inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0;
+    kernel::transpose<T>(out, in, conjugate, is32multiple);
+
+    return out;
+}
+
+#define INSTANTIATE(T) \
+    template Array<T> transpose(const Array<T> &in, const bool conjugate);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/transpose.hpp b/src/backend/oneapi/transpose.hpp
new file mode 100644
index 0000000000..88ca4abce0
--- /dev/null
+++ b/src/backend/oneapi/transpose.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> transpose(const Array<T> &in, const bool conjugate);
+
+template<typename T>
+void transpose_inplace(Array<T> &in, const bool conjugate);
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/transpose_inplace.cpp b/src/backend/oneapi/transpose_inplace.cpp
new file mode 100644
index 0000000000..013027f780
--- /dev/null
+++ b/src/backend/oneapi/transpose_inplace.cpp
@@ -0,0 +1,52 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/transpose_inplace.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void transpose_inplace(Array<T> &in, const bool conjugate) {
+    const dim4 &inDims = in.dims();
+
+    const bool is32multiple =
+        inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0;
+
+    kernel::transpose_inplace<T>(in, conjugate, is32multiple);
+}
+
+#define INSTANTIATE(T) \
+    template void transpose_inplace(Array<T> &in, const bool conjugate);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/triangle.cpp b/src/backend/oneapi/triangle.cpp
new file mode 100644
index 0000000000..c8ab5e2b16
--- /dev/null
+++ b/src/backend/oneapi/triangle.cpp
@@ -0,0 +1,59 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <kernel/triangle.hpp>
+
+#include <err_oneapi.hpp>
+#include <triangle.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag) {
+    kernel::triangle<T>(out, in, is_upper, is_unit_diag);
+}
+
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    triangle<T>(out, in, is_upper, is_unit_diag);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                  \
+    template void triangle<T>(Array<T> &, const Array<T> &, const bool, \
+                              const bool);                              \
+    template Array<T> triangle<T>(const Array<T> &, const bool, const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/triangle.hpp b/src/backend/oneapi/triangle.hpp
new file mode 100644
index 0000000000..d56a26c126
--- /dev/null
+++ b/src/backend/oneapi/triangle.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag);
+
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/types.hpp b/src/backend/oneapi/types.hpp
new file mode 100644
index 0000000000..395687396c
--- /dev/null
+++ b/src/backend/oneapi/types.hpp
@@ -0,0 +1,177 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/kernel_type.hpp>
+#include <common/traits.hpp>
+#include <af/compilers.h>
+#include <af/traits.hpp>
+
+#include <sycl/sycl.hpp>
+
+#include <algorithm>
+#include <array>
+#include <complex>
+#include <string>
+
+namespace arrayfire {
+namespace common {
+/// This is a CPU based half which need to be converted into floats before they
+/// are used
+template<>
+struct kernel_type<common::half> {
+    using data = sycl::half;
+
+    // These are the types within a kernel
+    using native = sycl::half;
+
+    using compute = sycl::half;
+};
+}  // namespace common
+}  // namespace arrayfire
+
+namespace arrayfire {
+
+namespace oneapi {
+using cdouble = std::complex<double>;
+using cfloat  = std::complex<float>;
+using intl    = long long;
+using schar   = signed char;
+using uchar   = unsigned char;
+using uint    = unsigned int;
+using uintl   = unsigned long long;
+using ushort  = unsigned short;
+
+template<typename T>
+using compute_t = typename common::kernel_type<T>::compute;
+
+template<typename T>
+using data_t = typename common::kernel_type<T>::data;
+
+template<typename T>
+struct ToNumStr {
+    std::string operator()(T val);
+    template<typename CONVERSION_TYPE>
+    std::string operator()(CONVERSION_TYPE val);
+};
+
+template<typename T>
+inline const char *shortname(bool caps = false) {
+    return caps ? "X" : "x";
+}
+
+template<>
+inline const char *shortname<float>(bool caps) {
+    return caps ? "S" : "s";
+}
+template<>
+inline const char *shortname<double>(bool caps) {
+    return caps ? "D" : "d";
+}
+template<>
+inline const char *shortname<cfloat>(bool caps) {
+    return caps ? "C" : "c";
+}
+template<>
+inline const char *shortname<cdouble>(bool caps) {
+    return caps ? "Z" : "z";
+}
+template<>
+inline const char *shortname<int>(bool caps) {
+    return caps ? "I" : "i";
+}
+template<>
+inline const char *shortname<uint>(bool caps) {
+    return caps ? "U" : "u";
+}
+template<>
+inline const char *shortname<char>(bool caps) {
+    return caps ? "J" : "j";
+}
+template<>
+inline const char *shortname<schar>(bool caps) {
+    return caps ? "A" : "a"; // TODO
+}
+template<>
+inline const char *shortname<uchar>(bool caps) {
+    return caps ? "V" : "v";
+}
+template<>
+inline const char *shortname<intl>(bool caps) {
+    return caps ? "L" : "l";
+}
+template<>
+inline const char *shortname<uintl>(bool caps) {
+    return caps ? "K" : "k";
+}
+template<>
+inline const char *shortname<short>(bool caps) {
+    return caps ? "P" : "p";
+}
+template<>
+inline const char *shortname<ushort>(bool caps) {
+    return caps ? "Q" : "q";
+}
+
+template<typename T>
+inline const char *getFullName() {
+    return af::dtype_traits<T>::getName();
+}
+
+template<>
+inline const char *getFullName<schar>() {
+    return "signed char";
+}
+
+template<>
+inline const char *getFullName<cfloat>() {
+    return "float2";
+}
+
+template<>
+inline const char *getFullName<cdouble>() {
+    return "double2";
+}
+
+#if 0
+template<typename... ARGS>
+AF_CONSTEXPR const char *getTypeBuildDefinition() {
+    using arrayfire::common::half;
+    using std::any_of;
+    using std::array;
+    using std::begin;
+    using std::end;
+    using std::is_same;
+    array<bool, sizeof...(ARGS)> is_half    = {is_same<ARGS, half>::value...};
+    array<bool, sizeof...(ARGS)> is_double  = {is_same<ARGS, double>::value...};
+    array<bool, sizeof...(ARGS)> is_cdouble = {
+        is_same<ARGS, cdouble>::value...};
+
+    bool half_def =
+        any_of(begin(is_half), end(is_half), [](bool val) { return val; });
+    bool double_def =
+        any_of(begin(is_double), end(is_double), [](bool val) { return val; });
+    bool cdouble_def = any_of(begin(is_cdouble), end(is_cdouble),
+                              [](bool val) { return val; });
+
+    if (half_def && (double_def || cdouble_def)) {
+        return " -D USE_HALF -D USE_DOUBLE";
+    } else if (half_def) {
+        return " -D USE_HALF";
+    } else if (double_def || cdouble_def) {
+        return " -D USE_DOUBLE";
+    } else {
+        return "";
+    }
+}
+#endif
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/unary.hpp b/src/backend/oneapi/unary.hpp
new file mode 100644
index 0000000000..2c9ccf54ce
--- /dev/null
+++ b/src/backend/oneapi/unary.hpp
@@ -0,0 +1,113 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+#include <common/jit/UnaryNode.hpp>
+#include <math.hpp>
+#include <optypes.hpp>
+#include <af/traits.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<af_op_t op>
+static const char *unaryName();
+
+#define UNARY_DECL(OP, FNAME)                     \
+    template<>                                    \
+    inline const char *unaryName<af_##OP##_t>() { \
+        return FNAME;                             \
+    }
+
+#define UNARY_FN(OP) UNARY_DECL(OP, #OP)
+
+UNARY_FN(sin)
+UNARY_FN(cos)
+UNARY_FN(tan)
+
+UNARY_FN(asin)
+UNARY_FN(acos)
+UNARY_FN(atan)
+
+UNARY_FN(sinh)
+UNARY_FN(cosh)
+UNARY_FN(tanh)
+
+UNARY_FN(asinh)
+UNARY_FN(acosh)
+UNARY_FN(atanh)
+
+UNARY_FN(exp)
+UNARY_DECL(sigmoid, "__sigmoid")
+UNARY_FN(expm1)
+UNARY_FN(erf)
+UNARY_FN(erfc)
+
+UNARY_FN(tgamma)
+UNARY_FN(lgamma)
+
+UNARY_FN(log)
+UNARY_FN(log1p)
+UNARY_FN(log10)
+UNARY_FN(log2)
+
+UNARY_FN(sqrt)
+UNARY_FN(rsqrt)
+UNARY_FN(cbrt)
+
+UNARY_FN(trunc)
+UNARY_FN(round)
+UNARY_FN(signbit)
+UNARY_FN(ceil)
+UNARY_FN(floor)
+
+UNARY_FN(isinf)
+UNARY_FN(isnan)
+UNARY_FN(iszero)
+UNARY_DECL(noop, "__noop")
+
+UNARY_DECL(bitnot, "__bitnot")
+
+#undef UNARY_FN
+
+template<typename T, af_op_t op>
+Array<T> unaryOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using arrayfire::common::Node;
+    using arrayfire::common::Node_ptr;
+    using std::array;
+
+    auto createUnary = [](array<Node_ptr, 1> &operands) {
+        return common::Node_ptr(new common::UnaryNode(
+            static_cast<af::dtype>(af::dtype_traits<T>::af_type),
+            unaryName<op>(), operands[0], op));
+    };
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    Node_ptr out = common::createNaryNode<T, 1>(outDim, createUnary, {&in});
+    return createNodeArray<T>(outDim, out);
+}
+
+template<typename T, af_op_t op>
+Array<char> checkOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using arrayfire::common::Node_ptr;
+
+    auto createUnary = [](std::array<Node_ptr, 1> &operands) {
+        return Node_ptr(new common::UnaryNode(
+            static_cast<af::dtype>(af::dtype_traits<char>::af_type),
+            unaryName<op>(), operands[0], op));
+    };
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    Node_ptr out = common::createNaryNode<T, 1>(outDim, createUnary, {&in});
+    return createNodeArray<char>(outDim, out);
+}
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/unwrap.cpp b/src/backend/oneapi/unwrap.cpp
new file mode 100644
index 0000000000..bfc95e0f18
--- /dev/null
+++ b/src/backend/oneapi/unwrap.cpp
@@ -0,0 +1,65 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/unwrap.hpp>
+#include <unwrap.hpp>
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+
+    dim_t nx = 1 + (idims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    dim_t ny = 1 + (idims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    af::dim4 odims(wx * wy, nx * ny, idims[2], idims[3]);
+
+    if (!is_column) { std::swap(odims[0], odims[1]); }
+
+    Array<T> outArray = createEmptyArray<T>(odims);
+    kernel::unwrap<T>(outArray, in, wx, wy, sx, sy, px, py, dx, dy, nx,
+                      is_column);
+
+    return outArray;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> unwrap<T>(                                            \
+        const Array<T> &in, const dim_t wx, const dim_t wy, const dim_t sx, \
+        const dim_t sy, const dim_t px, const dim_t py, const dim_t dx,     \
+        const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/unwrap.hpp b/src/backend/oneapi/unwrap.hpp
new file mode 100644
index 0000000000..9977e99af4
--- /dev/null
+++ b/src/backend/oneapi/unwrap.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/vector_field.cpp b/src/backend/oneapi/vector_field.cpp
new file mode 100644
index 0000000000..d67fa73c51
--- /dev/null
+++ b/src/backend/oneapi/vector_field.cpp
@@ -0,0 +1,38 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <err_oneapi.hpp>
+#include <vector_field.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield) {}
+
+#define INSTANTIATE(T)                                                     \
+    template void copy_vector_field<T>(const Array<T> &, const Array<T> &, \
+                                       fg_vector_field);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/vector_field.hpp b/src/backend/oneapi/vector_field.hpp
new file mode 100644
index 0000000000..b6bf83a52e
--- /dev/null
+++ b/src/backend/oneapi/vector_field.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/where.cpp b/src/backend/oneapi/where.cpp
new file mode 100644
index 0000000000..fd08b975b8
--- /dev/null
+++ b/src/backend/oneapi/where.cpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_oneapi.hpp>
+#include <kernel/where.hpp>
+#include <where.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+Array<uint> where(const Array<T> &in) {
+    Param<uint> Out;
+    Param<T> In = in;
+    kernel::where<T>(Out, In);
+    return createParamArray<uint>(Out, true);
+}
+
+#define INSTANTIATE(T) template Array<uint> where<T>(const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/where.hpp b/src/backend/oneapi/where.hpp
new file mode 100644
index 0000000000..e4b1b0b87f
--- /dev/null
+++ b/src/backend/oneapi/where.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+template<typename T>
+Array<uint> where(const Array<T>& in);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/wrap.cpp b/src/backend/oneapi/wrap.cpp
new file mode 100644
index 0000000000..21c47ac007
--- /dev/null
+++ b/src/backend/oneapi/wrap.cpp
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <wrap.hpp>
+
+#include <kernel/wrap.hpp>
+#include <kernel/wrap_dilated.hpp>
+
+#include <Array.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <err_oneapi.hpp>
+#include <math.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column) {
+    kernel::wrap<T>(out, in, wx, wy, sx, sy, px, py, is_column);
+}
+
+#define INSTANTIATE(T)                                                        \
+    template void wrap<T>(Array<T> & out, const Array<T> &in, const dim_t wx, \
+                          const dim_t wy, const dim_t sx, const dim_t sy,     \
+                          const dim_t px, const dim_t py,                     \
+                          const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+    af::dim4 odims(ox, oy, idims[2], idims[3]);
+    Array<T> out = createValueArray<T>(odims, scalar<T>(0));
+
+    kernel::wrap_dilated<T>(out, in, wx, wy, sx, sy, px, py, dx, dy, is_column);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> wrap_dilated<T>(                                      \
+        const Array<T> &in, const dim_t ox, const dim_t oy, const dim_t wx, \
+        const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,     \
+        const dim_t py, const dim_t dx, const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/oneapi/wrap.hpp b/src/backend/oneapi/wrap.hpp
new file mode 100644
index 0000000000..245632cbca
--- /dev/null
+++ b/src/backend/oneapi/wrap.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace oneapi {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column);
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace oneapi
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Array.cpp b/src/backend/opencl/Array.cpp
index fcaec5c673..38fbfc4d84 100644
--- a/src/backend/opencl/Array.cpp
+++ b/src/backend/opencl/Array.cpp
@@ -7,318 +7,592 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
+
+#include <common/Logger.hpp>
+#include <common/half.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <common/jit/ScalarNode.hpp>
+#include <common/util.hpp>
 #include <copy.hpp>
-#include <scalar.hpp>
-#include <JIT/BufferNode.hpp>
 #include <err_opencl.hpp>
+#include <jit/BufferNode.hpp>
 #include <memory.hpp>
 #include <platform.hpp>
+#include <scalar.hpp>
+#include <traits.hpp>
+#include <af/dim4.hpp>
+#include <af/opencl.h>
+
+#include <cstddef>
+#include <cstdlib>
+#include <memory>
+#include <numeric>
+
+#include <cstdio>
+#include <cstdlib>
+#include <iostream>
+
+#include <vector>
 
 using af::dim4;
+using af::dtype_traits;
+
+using cl::Buffer;
+
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::opencl::jit::BufferNode;
+
+using nonstd::span;
+using std::accumulate;
+using std::is_standard_layout;
+using std::make_shared;
+using std::shared_ptr;
+using std::vector;
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+shared_ptr<BufferNode> bufferNodePtr() {
+    return make_shared<BufferNode>(
+        static_cast<af::dtype>(dtype_traits<T>::af_type));
+}
 
-namespace opencl
-{
-
-    const int MAX_JIT_LEN = 20;
-    using JIT::BufferNode;
-    using JIT::Node;
-    using JIT::Node_ptr;
-
-    template<typename T>
-    Array<T>::Array(af::dim4 dims) :
-        ArrayInfo(getActiveDeviceId(), dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(bufferAlloc(ArrayInfo::elements() * sizeof(T)), bufferFree),
-        data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    {
+namespace {
+template<typename T>
+void verifyTypeSupport() {}
+
+template<>
+void verifyTypeSupport<double>() {
+    if (!isDoubleSupported(getActiveDeviceId())) {
+        AF_ERROR("Double precision not supported", AF_ERR_NO_DBL);
     }
+}
 
-    template<typename T>
-    Array<T>::Array(af::dim4 dims, JIT::Node_ptr n) :
-        ArrayInfo(-1, dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(),
-        data_dims(dims),
-        node(n), ready(false), offset(0), owner(true)
-    {
+template<>
+void verifyTypeSupport<cdouble>() {
+    if (!isDoubleSupported(getActiveDeviceId())) {
+        AF_ERROR("Double precision not supported", AF_ERR_NO_DBL);
     }
+}
 
-    template<typename T>
-    Array<T>::Array(af::dim4 dims, const T * const in_data) :
-        ArrayInfo(getActiveDeviceId(), dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(bufferAlloc(ArrayInfo::elements()*sizeof(T)), bufferFree),
-        data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    {
-        getQueue().enqueueWriteBuffer(*data.get(), CL_TRUE, 0, sizeof(T)*ArrayInfo::elements(), in_data);
+template<>
+void verifyTypeSupport<common::half>() {
+    if (!isHalfSupported(getActiveDeviceId())) {
+        AF_ERROR("Half precision not supported", AF_ERR_NO_HALF);
+    }
+}
+}  // namespace
+
+template<typename T>
+Array<T>::Array(const dim4 &dims)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(memAlloc<T>(info.elements()).release(), bufferFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, Node_ptr n)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data_dims(dims)
+    , node(std::move(n))
+    , owner(true) {
+    if (node->isBuffer()) {
+        data = std::static_pointer_cast<BufferNode>(node)->getDataPointer();
     }
+}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, const T *const in_data)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(memAlloc<T>(info.elements()).release(), bufferFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    static_assert(is_standard_layout<Array<T>>::value,
+                  "Array<T> must be a standard layout type");
+    static_assert(std::is_nothrow_move_assignable<Array<T>>::value,
+                  "Array<T> is not move assignable");
+    static_assert(std::is_nothrow_move_constructible<Array<T>>::value,
+                  "Array<T> is not move constructible");
+    static_assert(
+        offsetof(Array<T>, info) == 0,
+        "Array<T>::info must be the first member variable of Array<T>");
+    getQueue().enqueueWriteBuffer(*data.get(), CL_TRUE, 0,
+                                  sizeof(T) * info.elements(), in_data);
+}
 
-    template<typename T>
-    Array<T>::Array(af::dim4 dims, cl_mem mem) :
-        ArrayInfo(getActiveDeviceId(), dims, af::dim4(0,0,0,0), calcStrides(dims), (af_dtype)dtype_traits<T>::af_type),
-        data(new cl::Buffer(mem), bufferFree),
-        data_dims(dims),
-        node(), ready(true), offset(0), owner(true)
-    {
+template<typename T>
+Array<T>::Array(const dim4 &dims, cl_mem mem, size_t src_offset, bool copy)
+    : info(getActiveDeviceId(), dims, 0, calcStrides(dims),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(
+          copy ? memAlloc<T>(info.elements()).release() : new Buffer(mem, true),
+          bufferFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    if (copy) {
+        clRetainMemObject(mem);
+        Buffer src_buf = Buffer(mem);
+        getQueue().enqueueCopyBuffer(src_buf, *data.get(), src_offset, 0,
+                                     sizeof(T) * info.elements());
     }
+}
 
-    template<typename T>
-    Array<T>::Array(const Array<T>& parent, const dim4 &dims, const dim4 &offsets, const dim4 &stride) :
-        ArrayInfo(parent.getDevId(), dims, offsets, stride, (af_dtype)dtype_traits<T>::af_type),
-        data(parent.getData()),
-        data_dims(parent.getDataDims()),
-        node(),
-        ready(true),
-        offset(parent.getOffset() + calcOffset(parent.strides(), offsets)),
-        owner(false)
-    { }
-
-
-    template<typename T>
-    Array<T>::Array(Param &tmp) :
-        ArrayInfo(getActiveDeviceId(), af::dim4(tmp.info.dims[0], tmp.info.dims[1], tmp.info.dims[2], tmp.info.dims[3]),
-                  af::dim4(0, 0, 0, 0),
-                  af::dim4(tmp.info.strides[0], tmp.info.strides[1],
-                           tmp.info.strides[2], tmp.info.strides[3]),
-                  (af_dtype)dtype_traits<T>::af_type),
-        data(tmp.data, bufferFree),
-        data_dims(af::dim4(tmp.info.dims[0], tmp.info.dims[1], tmp.info.dims[2], tmp.info.dims[3])),
-        node(), ready(true), offset(0), owner(true)
-    {
+template<typename T>
+Array<T>::Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset_,
+                const dim4 &stride)
+    : info(parent.getDevId(), dims, offset_, stride,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(parent.getData())
+    , data_dims(parent.getDataDims())
+    , node()
+    , owner(false) {}
+
+template<typename T>
+Array<T>::Array(Param &tmp, bool owner_)
+    : info(getActiveDeviceId(),
+           dim4(tmp.info.dims[0], tmp.info.dims[1], tmp.info.dims[2],
+                tmp.info.dims[3]),
+           0,
+           dim4(tmp.info.strides[0], tmp.info.strides[1], tmp.info.strides[2],
+                tmp.info.strides[3]),
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(
+          tmp.data, owner_ ? bufferFree : [](Buffer * /*unused*/) {})
+    , data_dims(dim4(tmp.info.dims[0], tmp.info.dims[1], tmp.info.dims[2],
+                     tmp.info.dims[3]))
+    , node()
+    , owner(owner_) {}
+
+template<typename T>
+Array<T>::Array(const dim4 &dims, const dim4 &strides, dim_t offset_,
+                const T *const in_data, bool is_device)
+    : info(getActiveDeviceId(), dims, offset_, strides,
+           static_cast<af_dtype>(dtype_traits<T>::af_type))
+    , data(
+          is_device
+              ? (new Buffer(reinterpret_cast<cl_mem>(const_cast<T *>(in_data))))
+              : (memAlloc<T>(info.elements()).release()),
+          bufferFree)
+    , data_dims(dims)
+    , node()
+    , owner(true) {
+    if (!is_device) {
+        getQueue().enqueueWriteBuffer(*data.get(), CL_TRUE, 0,
+                                      sizeof(T) * info.total(), in_data);
     }
+}
 
+template<typename T>
+void checkAndMigrate(Array<T> &arr) {
+    int arr_id = arr.getDevId();
+    int cur_id = detail::getActiveDeviceId();
+    if (!isDeviceBufferAccessible(arr_id, cur_id)) {
+        auto getLogger = [&] { return spdlog::get("platform"); };
+        AF_TRACE("Migrating array from {} to {}.", arr_id, cur_id);
+        auto migrated_data           = memAlloc<T>(arr.elements());
+        void *mapped_migrated_buffer = getQueue().enqueueMapBuffer(
+            *migrated_data, CL_TRUE, CL_MAP_WRITE_INVALIDATE_REGION, 0,
+            sizeof(T) * arr.elements());
+        setDevice(arr_id);
+        Buffer &buf = *arr.get();
+        getQueue().enqueueReadBuffer(buf, CL_TRUE, 0,
+                                     sizeof(T) * arr.elements(),
+                                     mapped_migrated_buffer);
+        setDevice(cur_id);
+        getQueue().enqueueUnmapMemObject(*migrated_data,
+                                         mapped_migrated_buffer);
+        arr.data.reset(migrated_data.release(), bufferFree);
+        arr.setId(cur_id);
+    }
+}
 
-    template<typename T>
-    void Array<T>::eval()
-    {
-        if (isReady()) return;
+template<typename T>
+void Array<T>::eval() {
+    if (isReady()) { return; }
 
-        this->setId(getActiveDeviceId());
-        data = Buffer_ptr(bufferAlloc(elements() * sizeof(T)), bufferFree);
+    this->setId(getActiveDeviceId());
+    data = std::shared_ptr<cl::Buffer>(memAlloc<T>(info.elements()).release(),
+                                       bufferFree);
 
-        // Do not replace this with cast operator
-        KParam info = {{dims()[0], dims()[1], dims()[2], dims()[3]},
-                       {strides()[0], strides()[1], strides()[2], strides()[3]},
-                       0};
+    // Do not replace this with cast operator
+    KParam info = {{dims()[0], dims()[1], dims()[2], dims()[3]},
+                   {strides()[0], strides()[1], strides()[2], strides()[3]},
+                   0};
 
-        Param res = {data.get(), info};
+    Param res = {data.get(), info};
 
-        evalNodes(res, this->getNode().get());
-        ready = true;
+    evalNodes(res, getNode().get());
+    node.reset();
+}
 
-        Node_ptr prev = node;
-        prev->resetFlags();
-        // FIXME: Replace the current node in any JIT possible trees with the new BufferNode
-        node.reset();
+template<typename T>
+void Array<T>::eval() const {
+    const_cast<Array<T> *>(this)->eval();
+}
+
+template<typename T>
+Buffer *Array<T>::device() {
+    if (!isOwner() || getOffset() || data.use_count() > 1) {
+        *this = copyArray<T>(*this);
     }
+    return this->get();
+}
 
-    template<typename T>
-    void Array<T>::eval() const
-    {
-        if (isReady()) return;
-        const_cast<Array<T> *>(this)->eval();
+template<typename T>
+void evalMultiple(vector<Array<T> *> arrays) {
+    vector<Param> outputs;
+    vector<Array<T> *> output_arrays;
+    vector<Node *> nodes;
+
+    // Check if all the arrays have the same dimension
+    auto it = std::adjacent_find(begin(arrays), end(arrays),
+                                 [](const Array<T> *l, const Array<T> *r) {
+                                     return l->dims() != r->dims();
+                                 });
+
+    // If they are not the same. eval individually
+    if (it != end(arrays)) {
+        for (auto ptr : arrays) { ptr->eval(); }
+        return;
     }
 
-    template<typename T>
-    Array<T>::~Array()
-    { }
-
-    template<typename T>
-    Node_ptr Array<T>::getNode() const
-    {
-        if (!node) {
-            bool is_linear = isLinear();
-            unsigned bytes = this->getDataDims().elements() * sizeof(T);
-            BufferNode *buf_node = new BufferNode(dtype_traits<T>::getName(),
-                                                  shortname<T>(true),
-                                                  *this, data,
-                                                  bytes,
-                                                  is_linear);
-            const_cast<Array<T> *>(this)->node = Node_ptr(reinterpret_cast<Node *>(buf_node));
-        }
+    for (Array<T> *array : arrays) {
+        if (array->isReady()) { continue; }
+
+        const ArrayInfo info = array->info;
 
-        return node;
+        array->setId(getActiveDeviceId());
+        array->data = std::shared_ptr<cl::Buffer>(
+            memAlloc<T>(info.elements()).release(), bufferFree);
+
+        // Do not replace this with cast operator
+        KParam kInfo = {
+            {info.dims()[0], info.dims()[1], info.dims()[2], info.dims()[3]},
+            {info.strides()[0], info.strides()[1], info.strides()[2],
+             info.strides()[3]},
+            0};
+
+        outputs.emplace_back(array->data.get(), kInfo);
+        output_arrays.push_back(array);
+        nodes.push_back(array->getNode().get());
     }
 
-    using af::dim4;
+    evalNodes(outputs, nodes);
 
-    template<typename T>
-    Array<T> createNodeArray(const dim4 &dims, Node_ptr node)
-    {
-        verifyDoubleSupport<T>();
+    for (Array<T> *array : output_arrays) { array->node.reset(); }
+}
 
-        Array<T> out =  Array<T>(dims, node);
+template<typename T>
+Node_ptr Array<T>::getNode() {
+    if (node) { return node; }
 
-        unsigned length =0, buf_count = 0, bytes = 0;
+    KParam kinfo   = *this;
+    unsigned bytes = this->dims().elements() * sizeof(T);
+    auto nn        = bufferNodePtr<T>();
+    nn->setData(kinfo, data, bytes, isLinear());
 
-        Node *n = node.get();
-        n->getInfo(length, buf_count, bytes);
-        n->resetFlags();
+    return nn;
+}
 
-        if (length > MAX_JIT_LEN ||
-            buf_count >= MAX_BUFFERS ||
-            bytes >= MAX_BYTES) {
-            out.eval();
-        }
+template<typename T>
+Node_ptr Array<T>::getNode() const {
+    return const_cast<Array<T> *>(this)->getNode();
+}
 
-        return out;
+/// This function should be called after a new JIT node is created. It will
+/// return true if the newly created node will generate a valid kernel. If
+/// false the node will fail to compile or the node and its referenced buffers
+/// are consuming too many resources. If false, the node's child nodes should
+/// be evaluated before continuing.
+///
+/// We eval in the following cases:
+///
+/// 1. Too many bytes are locked up by JIT causing memory
+///    pressure. Too many bytes is assumed to be half of all bytes
+///    allocated so far.
+///
+/// 2. The number of parameters we are passing into the kernel exceeds the
+///    limitation on the platform. For NVIDIA this is 4096 bytes. The
+template<typename T>
+kJITHeuristics passesJitHeuristics(span<Node *> root_nodes) {
+    if (!evalFlag()) { return kJITHeuristics::Pass; }
+    static auto getLogger = [&] { return common::loggerFactory("jit"); };
+    for (const Node *n : root_nodes) {
+        if (n->getHeight() > static_cast<int>(getMaxJitSize())) {
+            AF_TRACE(
+                "JIT tree evaluated because of tree height exceeds limit: {} > "
+                "{}",
+                n->getHeight(), getMaxJitSize());
+            return kJITHeuristics::TreeHeight;
+        }
     }
 
-    template<typename T>
-    Array<T> createSubArray(const Array<T>& parent,
-                            const std::vector<af_seq> &index,
-                            bool copy)
-    {
-        parent.eval();
+    bool isBufferLimit = getMemoryPressure() >= getMemoryPressureThreshold();
+    auto platform      = getActivePlatformVendor();
+
+    // The Apple platform can have the nvidia card or the AMD card
+    bool isIntel = platform == AFCL_PLATFORM_INTEL;
+
+    /// Intels param_size limit is much smaller than the other platforms
+    /// so we need to start checking earlier with smaller trees
+    int heightCheckLimit =
+        isIntel && getDeviceType() == CL_DEVICE_TYPE_GPU ? 3 : 6;
+
+    // A lightweight check based on the height of the node. This is
+    // an inexpensive operation and does not traverse the JIT tree.
+    bool atHeightLimit =
+        std::any_of(std::begin(root_nodes), std::end(root_nodes),
+                    [heightCheckLimit](Node *n) {
+                        return (n->getHeight() + 1 >= heightCheckLimit);
+                    });
+
+    if (atHeightLimit || isBufferLimit) {
+        // This is the base parameter size if the kernel had no
+        // arguments
+        size_t base_param_size =
+            (sizeof(T *) + sizeof(KParam)) * root_nodes.size() +
+            (3 * sizeof(uint));
+
+        const cl::Device &device = getDevice();
+        // typical values:
+        //   NVIDIA     = 4096
+        //   AMD        = 3520  (AMD A10 iGPU = 1024)
+        //   Intel iGPU = 1024
+        //
+        // Setting the maximum to 5120 bytes to keep the compile times
+        // resonable. This still results in large kernels but its not excessive.
+        size_t max_param_size =
+            min(static_cast<cl::size_type>(5120),
+                device.getInfo<CL_DEVICE_MAX_PARAMETER_SIZE>());
+        max_param_size -= base_param_size;
+
+        struct tree_info {
+            size_t total_buffer_size;
+            size_t num_buffers;
+            size_t param_scalar_size;
+        };
+
+        tree_info info{0, 0, 0};
+        for (Node *n : root_nodes) {
+            NodeIterator<> it(n);
+            info = accumulate(
+                it, NodeIterator<>(), info, [](tree_info &prev, Node &n) {
+                    if (n.isBuffer()) {
+                        auto &buf_node = static_cast<BufferNode &>(n);
+                        // getBytes returns the size of the data Array.
+                        // Sub arrays will be represented by their parent
+                        // size.
+                        prev.total_buffer_size += buf_node.getBytes();
+                        prev.num_buffers++;
+                    } else {
+                        prev.param_scalar_size += n.getParamBytes();
+                    }
+                    return prev;
+                });
+        }
+        isBufferLimit = jitTreeExceedsMemoryPressure(info.total_buffer_size);
 
-        dim4 dDims = parent.getDataDims();
-        dim4 pDims = parent.dims();
+        size_t param_size = (info.num_buffers * (sizeof(KParam) + sizeof(T *)) +
+                             info.param_scalar_size);
 
-        dim4 dims   = toDims  (index, pDims);
-        dim4 offset = toOffset(index, dDims);
-        dim4 stride = toStride (index, dDims);
+        bool isParamLimit = param_size >= max_param_size;
 
-        Array<T> out = Array<T>(parent, dims, offset, stride);
+        if (isParamLimit) {
+            AF_TRACE(
+                "JIT tree evaluated because of kernel parameter size: {} >= {}",
+                param_size, max_param_size);
+            return kJITHeuristics::KernelParameterSize;
+        }
+        if (isBufferLimit) {
+            AF_TRACE("JIT tree evaluated because of memory pressure: {}",
+                     info.total_buffer_size);
+            return kJITHeuristics::MemoryPressure;
+        }
+    }
+    return kJITHeuristics::Pass;
+}
 
-        if (!copy) return out;
+template<typename T>
+void *getDevicePtr(const Array<T> &arr) {
+    const cl::Buffer *buf = arr.device();
+    if (!buf) { return NULL; }
+    memLock(buf);
+    cl_mem mem = (*buf)();
+    return (void *)mem;
+}
 
-        if (stride[0] != 1 ||
-            stride[1] <  0 ||
-            stride[2] <  0 ||
-            stride[3] <  0) {
+template<typename T>
+Array<T> createNodeArray(const dim4 &dims, Node_ptr node) {
+    verifyTypeSupport<T>();
+    Array<T> out = Array<T>(dims, node);
+    return out;
+}
 
-            out = copyArray(out);
-        }
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent, const vector<af_seq> &index,
+                        bool copy) {
+    parent.eval();
 
-        return out;
-    }
+    dim4 dDims          = parent.getDataDims();
+    dim4 parent_strides = parent.strides();
 
-    template<typename T>
-    Array<T>
-    createHostDataArray(const dim4 &size, const T * const data)
-    {
-        verifyDoubleSupport<T>();
-        return Array<T>(size, data);
+    if (parent.isLinear() == false) {
+        const Array<T> parentCopy = copyArray(parent);
+        return createSubArray(parentCopy, index, copy);
     }
 
-    template<typename T>
-    Array<T>
-    createDeviceDataArray(const dim4 &size, const void *data)
-    {
-        verifyDoubleSupport<T>();
+    const dim4 &pDims = parent.dims();
 
-        return Array<T>(size, (cl_mem)(data));
-    }
+    dim4 dims    = toDims(index, pDims);
+    dim4 strides = toStride(index, dDims);
 
-    template<typename T>
-    Array<T>
-    createValueArray(const dim4 &size, const T& value)
-    {
-        verifyDoubleSupport<T>();
-        return createScalarNode<T>(size, value);
-    }
+    // Find total offsets after indexing
+    dim4 offsets = toOffset(index, pDims);
+    dim_t offset = parent.getOffset();
+    for (int i = 0; i < 4; i++) { offset += offsets[i] * parent_strides[i]; }
 
-    template<typename T>
-    Array<T>
-    createEmptyArray(const dim4 &size)
-    {
-        verifyDoubleSupport<T>();
-        return Array<T>(size);
-    }
+    Array<T> out = Array<T>(parent, dims, offset, strides);
 
-    template<typename T>
-    Array<T> *initArray()
-    {
-        return new Array<T>(dim4());
-    }
+    if (!copy) { return out; }
 
-    template<typename T>
-    Array<T>
-    createParamArray(Param &tmp)
-    {
-        verifyDoubleSupport<T>();
-        return Array<T>(tmp);
+    if (strides[0] != 1 || strides[1] < 0 || strides[2] < 0 || strides[3] < 0) {
+        out = copyArray(out);
     }
 
-    template<typename T>
-    void
-    destroyArray(Array<T> *A)
-    {
-        delete A;
-    }
+    return out;
+}
 
-    template<typename T>
-    void evalArray(const Array<T> &A)
-    {
-        A.eval();
-    }
+template<typename T>
+Array<T> createHostDataArray(const dim4 &dims, const T *const data) {
+    verifyTypeSupport<T>();
+    return Array<T>(dims, data);
+}
 
-    template<typename T>
-    void
-    writeHostDataArray(Array<T> &arr, const T * const data, const size_t bytes)
-    {
-        if (!arr.isOwner()) {
-            arr = createEmptyArray<T>(arr.dims());
-        }
+template<typename T>
+Array<T> createDeviceDataArray(const dim4 &dims, void *data, bool copy) {
+    verifyTypeSupport<T>();
+    return Array<T>(dims, static_cast<cl_mem>(data), 0, copy);
+}
 
-        getQueue().enqueueWriteBuffer(*arr.get(), CL_TRUE,
-                                      arr.getOffset(),
-                                      bytes,
-                                      data);
+template<typename T>
+Array<T> createValueArray(const dim4 &dims, const T &value) {
+    verifyTypeSupport<T>();
+    return createScalarNode<T>(dims, value);
+}
 
-        return;
-    }
+template<typename T>
+Array<T> createEmptyArray(const dim4 &dims) {
+    verifyTypeSupport<T>();
+    return Array<T>(dims);
+}
 
-    template<typename T>
-    void
-    writeDeviceDataArray(Array<T> &arr, const void * const data, const size_t bytes)
-    {
-        if (!arr.isOwner()) {
-            arr = createEmptyArray<T>(arr.dims());
-        }
+template<typename T>
+Array<T> createParamArray(Param &tmp, bool owner) {
+    verifyTypeSupport<T>();
+    return Array<T>(tmp, owner);
+}
 
-        cl::Buffer& buf = *arr.get();
+template<typename T>
+void destroyArray(Array<T> *A) {
+    delete A;
+}
 
-        clRetainMemObject((cl_mem)(data));
-        cl::Buffer data_buf = cl::Buffer((cl_mem)(data));
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data,
+                        const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
 
-        getQueue().enqueueCopyBuffer(data_buf, buf,
-                                     0, (size_t)arr.getOffset(),
-                                     bytes);
+    getQueue().enqueueWriteBuffer(*arr.get(), CL_TRUE, arr.getOffset(), bytes,
+                                  data);
+}
 
-        return;
-    }
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes) {
+    if (!arr.isOwner()) { arr = copyArray<T>(arr); }
+
+    Buffer &buf = *arr.get();
 
+    clRetainMemObject(reinterpret_cast<cl_mem>(const_cast<void *>(data)));
+    Buffer data_buf =
+        Buffer(reinterpret_cast<cl_mem>(const_cast<void *>(data)));
 
-#define INSTANTIATE(T)                                                  \
-    template       Array<T>  createHostDataArray<T>   (const dim4 &size, const T * const data); \
-    template       Array<T>  createDeviceDataArray<T> (const dim4 &size, const void *data); \
-    template       Array<T>  createValueArray<T>      (const dim4 &size, const T &value); \
-    template       Array<T>  createEmptyArray<T>      (const dim4 &size); \
-    template       Array<T>  *initArray<T      >      ();               \
-    template       Array<T>  createParamArray<T>      (Param &tmp);  \
-    template       Array<T>  createSubArray<T>        (const Array<T> &parent, \
-                                                       const std::vector<af_seq> &index, \
-                                                       bool copy);      \
-    template       void      destroyArray<T>          (Array<T> *A);    \
-    template       void      evalArray<T>             (const Array<T> &A); \
-    template       Array<T>  createNodeArray<T>       (const dim4 &size, JIT::Node_ptr node); \
-    template       Array<T>::~Array        ();                          \
-    template       void Array<T>::eval();                               \
-    template       void Array<T>::eval() const;                         \
-    template       void      writeHostDataArray<T>    (Array<T> &arr, const T * const data, const size_t bytes); \
-    template       void      writeDeviceDataArray<T>  (Array<T> &arr, const void * const data, const size_t bytes); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+    getQueue().enqueueCopyBuffer(data_buf, buf, 0,
+                                 static_cast<size_t>(arr.getOffset()), bytes);
+}
+
+template<typename T>
+void Array<T>::setDataDims(const dim4 &new_dims) {
+    data_dims = new_dims;
+    modDims(new_dims);
+}
 
+template<typename T>
+size_t Array<T>::getAllocatedBytes() const {
+    if (!isReady()) { return 0; }
+    size_t bytes = memoryManager().allocated(data.get());
+    // External device pointer
+    if (bytes == 0 && data.get()) { return data_dims.elements() * sizeof(T); }
+    return bytes;
 }
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> createHostDataArray<T>(const dim4 &dims,                \
+                                             const T *const data);            \
+    template Array<T> createDeviceDataArray<T>(const dim4 &dims, void *data,  \
+                                               bool copy);                    \
+    template Array<T> createValueArray<T>(const dim4 &dims, const T &value);  \
+    template Array<T> createEmptyArray<T>(const dim4 &dims);                  \
+    template Array<T> createParamArray<T>(Param & tmp, bool owner);           \
+    template Array<T> createSubArray<T>(                                      \
+        const Array<T> &parent, const vector<af_seq> &index, bool copy);      \
+    template void destroyArray<T>(Array<T> * A);                              \
+    template Array<T> createNodeArray<T>(const dim4 &dims, Node_ptr node);    \
+    template Array<T>::Array(const dim4 &dims, const dim4 &strides,           \
+                             dim_t offset, const T *const in_data,            \
+                             bool is_device);                                 \
+    template Array<T>::Array(const dim4 &dims, cl_mem mem, size_t src_offset, \
+                             bool copy);                                      \
+    template Node_ptr Array<T>::getNode();                                    \
+    template Node_ptr Array<T>::getNode() const;                              \
+    template void Array<T>::eval();                                           \
+    template void Array<T>::eval() const;                                     \
+    template Buffer *Array<T>::device();                                      \
+    template void writeHostDataArray<T>(Array<T> & arr, const T *const data,  \
+                                        const size_t bytes);                  \
+    template void writeDeviceDataArray<T>(                                    \
+        Array<T> & arr, const void *const data, const size_t bytes);          \
+    template void evalMultiple<T>(vector<Array<T> *> arrays);                 \
+    template kJITHeuristics passesJitHeuristics<T>(span<Node *> node);        \
+    template void *getDevicePtr<T>(const Array<T> &arr);                      \
+    template void Array<T>::setDataDims(const dim4 &new_dims);                \
+    template size_t Array<T>::getAllocatedBytes() const;                      \
+    template void checkAndMigrate<T>(Array<T> & arr);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Array.hpp b/src/backend/opencl/Array.hpp
index 6372a6e466..05b0468333 100644
--- a/src/backend/opencl/Array.hpp
+++ b/src/backend/opencl/Array.hpp
@@ -8,166 +8,330 @@
  ********************************************************/
 
 #pragma once
+
+#include <Param.hpp>
+#include <backend.hpp>
+#include <common/ArrayInfo.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/jit/Node.hpp>
+#include <err_opencl.hpp>
+#include <jit/BufferNode.hpp>
+#include <memory.hpp>
 #include <platform.hpp>
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <ArrayInfo.hpp>
 #include <traits.hpp>
-#include <backend.hpp>
 #include <types.hpp>
-#include <traits.hpp>
-#include <Param.hpp>
-#include <JIT/Node.hpp>
-#include <memory.hpp>
+
+#include <af/dim4.hpp>
+
+#include <nonstd/span.hpp>
+#include <algorithm>
+#include <cstdlib>
 #include <memory>
+#include <vector>
 
-namespace opencl
-{
-    using af::dim4;
-    typedef std::shared_ptr<cl::Buffer> Buffer_ptr;
+namespace common {
+template<typename T>
+class SparseArray;
+}
 
-    template<typename T> class Array;
+namespace arrayfire {
+namespace opencl {
+typedef std::shared_ptr<cl::Buffer> Buffer_ptr;
+using af::dim4;
+template<typename T>
+class Array;
+
+/// Checks if the Array object can be migrated to the current device and if not,
+/// an error is thrown
+///
+/// \param[in] arr The Array that will be checked.
+template<typename T>
+void checkAndMigrate(Array<T> &arr);
+
+template<typename T>
+void evalMultiple(std::vector<Array<T> *> arrays);
+
+void evalNodes(Param &out, common::Node *node);
+void evalNodes(std::vector<Param> &outputs,
+               const std::vector<common::Node *> &nodes);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createNodeArray(const af::dim4 &dims, common::Node_ptr node);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createValueArray(const af::dim4 &dims, const T &value);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+Array<T> createHostDataArray(const af::dim4 &dims, const T *const data);
+
+/// Creates an Array<T> object from a device pointer.
+///
+/// \param[in] dims The shape of the resulting Array.
+/// \param[in] data The device pointer to the data
+/// \param[in] copy If true, memory will be allocated and the data will be
+///                 copied to the device. If false the data will be used
+///                 directly
+/// \returns The new Array<T> object based on the device pointer.
+template<typename T>
+Array<T> createDeviceDataArray(const af::dim4 &dims, void *data,
+                               bool copy = false);
+
+template<typename T>
+Array<T> createStridedArray(const af::dim4 &dims, const af::dim4 &strides,
+                            dim_t offset, const T *const in_data,
+                            bool is_device) {
+    return Array<T>(dims, strides, offset, in_data, is_device);
+}
 
-    void evalNodes(Param &out, JIT::Node *node);
+/// Copies data to an existing Array object from a host pointer
+template<typename T>
+void writeHostDataArray(Array<T> &arr, const T *const data, const size_t bytes);
+
+/// Copies data to an existing Array object from a device pointer
+template<typename T>
+void writeDeviceDataArray(Array<T> &arr, const void *const data,
+                          const size_t bytes);
+
+/// Creates an empty array of a given size. No data is initialized
+///
+/// \param[in] size The dimension of the output array
+template<typename T>
+Array<T> createEmptyArray(const af::dim4 &dims);
+
+/// Create an Array object from Param object.
+///
+/// \param[in] in    The Param array that is created.
+/// \param[in] owner If true, the new Array<T> object is the owner of the data.
+/// If false
+///                  the Array<T> will not delete the object on destruction
+template<typename T>
+Array<T> createParamArray(Param &tmp, bool owner);
+
+template<typename T>
+Array<T> createSubArray(const Array<T> &parent,
+                        const std::vector<af_seq> &index, bool copy = true);
+
+/// Creates a new Array object on the heap and returns a reference to it.
+template<typename T>
+void destroyArray(Array<T> *A);
+
+/// \brief Checks if the Node can be compiled successfully and the buffers
+///        references are not consuming most of the allocated memory
+///
+/// \param [in] node The root node which needs to be checked
+///
+/// \returns false if the kernel generated by this node will fail to compile
+///          or its nodes are consuming too much memory.
+template<typename T>
+kJITHeuristics passesJitHeuristics(nonstd::span<common::Node *> node);
+
+template<typename T>
+void *getDevicePtr(const Array<T> &arr);
+
+template<typename T>
+void *getRawPtr(const Array<T> &arr) {
+    const cl::Buffer *buf = arr.get();
+    if (!buf) return NULL;
+    cl_mem mem = (*buf)();
+    return (void *)mem;
+}
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createNodeArray(const af::dim4 &size, JIT::Node_ptr node);
+template<typename T>
+using mapped_ptr = std::unique_ptr<T, std::function<void(void *)>>;
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createValueArray(const af::dim4 &size, const T& value);
+template<typename T>
+class Array {
+    ArrayInfo info;  // This must be the first element of Array<T>
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    Array<T> createHostDataArray(const af::dim4 &size, const T * const data);
+    /// Pointer to the data
+    std::shared_ptr<cl::Buffer> data;
 
-    template<typename T>
-    Array<T> createDeviceDataArray(const af::dim4 &size, const void *data);
+    /// The shape of the underlying parent data.
+    af::dim4 data_dims;
 
-    // Copies data to an existing Array object from a host pointer
-    template<typename T>
-    void writeHostDataArray(Array<T> &arr, const T * const data, const size_t bytes);
+    /// Null if this a buffer node. Otherwise this points to a JIT node
+    common::Node_ptr node;
 
-    // Copies data to an existing Array object from a device pointer
-    template<typename T>
-    void writeDeviceDataArray(Array<T> &arr, const void * const data, const size_t bytes);
+    /// If true, the Array object is the parent. If false the data object points
+    /// to another array's data
+    bool owner;
 
-    // Create an Array object and do not assign any values to it
-    template<typename T> Array<T> *initArray();
+    Array(const af::dim4 &dims);
 
-    template<typename T>
-    Array<T> createEmptyArray(const af::dim4 &size);
+    Array(const Array<T> &parent, const dim4 &dims, const dim_t &offset,
+          const dim4 &stride);
+    Array(Param &tmp, bool owner);
+    explicit Array(const af::dim4 &dims, common::Node_ptr n);
+    explicit Array(const af::dim4 &dims, const T *const in_data);
+    explicit Array(const af::dim4 &dims, cl_mem mem, size_t offset, bool copy);
 
-    // Create an Array object from Param
-    template<typename T>
-    Array<T> createParamArray(Param &tmp);
+    std::shared_ptr<cl::Buffer> getData() const { return data; }
 
-    template<typename T>
-    Array<T> createSubArray(const Array<T>& parent,
-                            const std::vector<af_seq> &index,
-                            bool copy=true);
+   public:
+    Array(const Array<T> &other) = default;
 
-    template<typename T>
-    void evalArray(const Array<T> &A);
+    Array(Array<T> &&other) noexcept = default;
 
-    // Creates a new Array object on the heap and returns a reference to it.
-    template<typename T>
-    void destroyArray(Array<T> *A);
+    Array<T> &operator=(Array<T> other) noexcept {
+        swap(other);
+        return *this;
+    }
 
-    template<typename T>
-    void *getDevicePtr(const Array<T>& arr)
-    {
-        memUnlink((T *)arr.get());
-        return (void *)((*arr.get())());
+    void swap(Array<T> &other) noexcept {
+        using std::swap;
+        swap(info, other.info);
+        swap(data, other.data);
+        swap(data_dims, other.data_dims);
+        swap(node, other.node);
+        swap(owner, other.owner);
     }
 
-    template<typename T>
-    class Array : public ArrayInfo
-    {
-        Buffer_ptr  data;
-        af::dim4 data_dims;
-
-        JIT::Node_ptr node;
-        bool ready;
-        dim_t offset;
-        bool owner;
-
-        Array(af::dim4 dims);
-        Array(const Array<T>& parnt, const dim4 &dims, const dim4 &offset, const dim4 &stride);
-        Array(Param &tmp);
-        explicit Array(af::dim4 dims, JIT::Node_ptr n);
-        explicit Array(af::dim4 dims, const T * const in_data);
-        explicit Array(af::dim4 dims, cl_mem mem);
-
-    public:
-
-        ~Array();
-
-        bool isReady() const { return ready; }
-        bool isOwner() const { return owner; }
-
-        void eval();
-        void eval() const;
-
-        //FIXME: This should do a copy if it is not owner. You do not want to overwrite parents data
-        cl::Buffer *get()
-        {
-            if (!isReady()) eval();
-            return data.get();
-        }
-
-        const cl::Buffer *get() const
-        {
-            if (!isReady()) eval();
-            return data.get();
-        }
-
-        const dim_t getOffset() const
-        {
-            return offset;
-        }
-
-        Buffer_ptr getData() const
-        {
-            return data;
-        }
-
-        dim4 getDataDims() const
-        {
-            // This is for moddims
-            // dims and data_dims are different when moddims is used
-            return isOwner() ? dims() : data_dims;
-        }
-
-        operator Param() const
-        {
-            KParam info = {{dims()[0], dims()[1], dims()[2], dims()[3]},
-                           {strides()[0], strides()[1], strides()[2], strides()[3]},
-                           getOffset()};
-
-            Param out = {(cl::Buffer *)this->get(), info};
-            return out;
-        }
-
-        JIT::Node_ptr getNode() const;
-
-        friend Array<T> createValueArray<T>(const af::dim4 &size, const T& value);
-        friend Array<T> createHostDataArray<T>(const af::dim4 &size, const T * const data);
-        friend Array<T> createDeviceDataArray<T>(const af::dim4 &size, const void *data);
-
-        friend Array<T> *initArray<T>();
-        friend Array<T> createEmptyArray<T>(const af::dim4 &size);
-        friend Array<T> createParamArray<T>(Param &tmp);
-        friend Array<T> createNodeArray<T>(const af::dim4 &dims, JIT::Node_ptr node);
-
-        friend Array<T> createSubArray<T>(const Array<T>& parent,
-                                          const std::vector<af_seq> &index,
-                                          bool copy);
-
-        friend void destroyArray<T>(Array<T> *arr);
-        friend void evalArray<T>(const Array<T> &arr);
-        friend void *getDevicePtr<T>(const Array<T>& arr);
-    };
+    Array(const af::dim4 &dims, const af::dim4 &strides, dim_t offset,
+          const T *const in_data, bool is_device = false);
+    void resetInfo(const af::dim4 &dims) { info.resetInfo(dims); }
+    void resetDims(const af::dim4 &dims) { info.resetDims(dims); }
+    void modDims(const af::dim4 &newDims) { info.modDims(newDims); }
+    void modStrides(const af::dim4 &newStrides) { info.modStrides(newStrides); }
+    void setId(int id) { info.setId(id); }
+
+#define INFO_FUNC(RET_TYPE, NAME) \
+    RET_TYPE NAME() const { return info.NAME(); }
+
+    INFO_FUNC(const af_dtype &, getType)
+    INFO_FUNC(const af::dim4 &, strides)
+    INFO_FUNC(dim_t, elements)
+    INFO_FUNC(dim_t, ndims)
+    INFO_FUNC(const af::dim4 &, dims)
+    INFO_FUNC(int, getDevId)
+
+#undef INFO_FUNC
+
+#define INFO_IS_FUNC(NAME) \
+    bool NAME() const { return info.NAME(); }
+
+    INFO_IS_FUNC(isEmpty);
+    INFO_IS_FUNC(isScalar);
+    INFO_IS_FUNC(isRow);
+    INFO_IS_FUNC(isColumn);
+    INFO_IS_FUNC(isVector);
+    INFO_IS_FUNC(isComplex);
+    INFO_IS_FUNC(isReal);
+    INFO_IS_FUNC(isDouble);
+    INFO_IS_FUNC(isSingle);
+    INFO_IS_FUNC(isHalf);
+    INFO_IS_FUNC(isRealFloating);
+    INFO_IS_FUNC(isFloating);
+    INFO_IS_FUNC(isInteger);
+    INFO_IS_FUNC(isBool);
+    INFO_IS_FUNC(isLinear);
+    INFO_IS_FUNC(isSparse);
+
+#undef INFO_IS_FUNC
+    ~Array() = default;
+
+    bool isReady() const { return static_cast<bool>(node) == false; }
+    bool isOwner() const { return owner; }
+
+    void eval();
+    void eval() const;
+
+    cl::Buffer *device();
+    cl::Buffer *device() const {
+        return const_cast<Array<T> *>(this)->device();
+    }
 
-}
+    // FIXME: This should do a copy if it is not owner. You do not want to
+    // overwrite parents data
+    cl::Buffer *get() {
+        if (!isReady()) eval();
+        return data.get();
+    }
+
+    const cl::Buffer *get() const {
+        if (!isReady()) eval();
+        return data.get();
+    }
+
+    int useCount() const { return data.use_count(); }
+
+    dim_t getOffset() const { return info.getOffset(); }
+
+    dim4 getDataDims() const { return data_dims; }
+
+    void setDataDims(const dim4 &new_dims);
+
+    size_t getAllocatedBytes() const;
+
+    operator Param() const {
+        KParam info = {{dims()[0], dims()[1], dims()[2], dims()[3]},
+                       {strides()[0], strides()[1], strides()[2], strides()[3]},
+                       getOffset()};
+
+        Param out{(cl::Buffer *)this->get(), info};
+        return out;
+    }
+
+    operator KParam() const {
+        KParam kinfo = {
+            {dims()[0], dims()[1], dims()[2], dims()[3]},
+            {strides()[0], strides()[1], strides()[2], strides()[3]},
+            getOffset()};
+
+        return kinfo;
+    }
+
+    common::Node_ptr getNode() const;
+    common::Node_ptr getNode();
+
+   public:
+    mapped_ptr<T> getMappedPtr(cl_map_flags map_flags = CL_MAP_READ |
+                                                        CL_MAP_WRITE) const {
+        if (!isReady()) eval();
+        auto func = [data = data](void *ptr) {
+            if (ptr != nullptr) {
+                cl_int err = getQueue().enqueueUnmapMemObject(*data, ptr);
+                UNUSED(err);
+                ptr = nullptr;
+            }
+        };
+
+        T *ptr = (T *)getQueue().enqueueMapBuffer(
+            *static_cast<const cl::Buffer *>(get()), CL_TRUE, map_flags,
+            getOffset() * sizeof(T), elements() * sizeof(T), nullptr, nullptr,
+            nullptr);
+
+        return mapped_ptr<T>(ptr, func);
+    }
+
+    friend void evalMultiple<T>(std::vector<Array<T> *> arrays);
+
+    friend Array<T> createValueArray<T>(const af::dim4 &dims, const T &value);
+    friend Array<T> createHostDataArray<T>(const af::dim4 &dims,
+                                           const T *const data);
+    friend Array<T> createDeviceDataArray<T>(const af::dim4 &dims, void *data,
+                                             bool copy);
+    friend Array<T> createStridedArray<T>(const af::dim4 &dims,
+                                          const af::dim4 &strides, dim_t offset,
+                                          const T *const in_data,
+                                          bool is_device);
+
+    friend Array<T> createEmptyArray<T>(const af::dim4 &dims);
+    friend Array<T> createParamArray<T>(Param &tmp, bool owner);
+    friend Array<T> createNodeArray<T>(const af::dim4 &dims,
+                                       common::Node_ptr node);
+
+    friend Array<T> createSubArray<T>(const Array<T> &parent,
+                                      const std::vector<af_seq> &index,
+                                      bool copy);
+
+    friend void destroyArray<T>(Array<T> *arr);
+    friend void *getDevicePtr<T>(const Array<T> &arr);
+    friend void *getRawPtr<T>(const Array<T> &arr);
+    friend void checkAndMigrate<T>(Array<T> &arr);
+};
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/CMakeLists.txt b/src/backend/opencl/CMakeLists.txt
index 88bb4f17d1..23bedeedab 100644
--- a/src/backend/opencl/CMakeLists.txt
+++ b/src/backend/opencl/CMakeLists.txt
@@ -1,239 +1,675 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-PROJECT(ARRAYFIRE)
-
-FIND_PACKAGE(OpenCL REQUIRED)
-
-INCLUDE("${CMAKE_MODULE_PATH}/CLKernelToH.cmake")
-
-IF(USE_OPENCL_F77_BLAS)
-    MESSAGE("Using F77 BLAS")
-    ADD_DEFINITIONS(-DUSE_F77_BLAS)
-ENDIF()
-
-IF(USE_OPENCL_MKL)
-    MESSAGE("Using MKL")
-    ADD_DEFINITIONS(-DUSE_MKL)
-ENDIF()
-
-IF(APPLE)
-    FIND_PACKAGE(LAPACK)
-ELSE(APPLE) # Linux and Windows
-    FIND_PACKAGE(LAPACKE)
-ENDIF(APPLE)
-
-IF(NOT LAPACK_FOUND)
-    MESSAGE(WARNING "LAPACK not found. Functionality will be disabled")
-ELSE(NOT LAPACK_FOUND)
-    ADD_DEFINITIONS(-DLAPACK_${BLA_VENDOR})
-    ADD_DEFINITIONS(-DWITH_OPENCL_LINEAR_ALGEBRA)
-ENDIF()
-
-IF(NOT UNIX)
-    ADD_DEFINITIONS(-DAFDLL)
-ENDIF()
-
-ADD_DEFINITIONS(-DAF_OPENCL
-                -D__CL_ENABLE_EXCEPTIONS)
-
-OPTION(USE_SYSTEM_CLBLAS "Use system clBLAS" OFF)
-IF(USE_SYSTEM_CLBLAS)
-    FIND_PACKAGE(clBLAS REQUIRED)
-ELSE()
-    INCLUDE("${CMAKE_MODULE_PATH}/build_clBLAS.cmake")
-ENDIF()
-INCLUDE_DIRECTORIES(${CLBLAS_INCLUDE_DIRS})
-LINK_DIRECTORIES(${CLBLAS_LIBRARY_DIR})
-
-OPTION(USE_SYSTEM_CLFFT "Use system clFFT" OFF)
-IF(USE_SYSTEM_CLFFT)
-    FIND_PACKAGE(clFFT REQUIRED)
-ELSE()
-    INCLUDE("${CMAKE_MODULE_PATH}/build_clFFT.cmake")
-ENDIF()
-INCLUDE_DIRECTORIES(${CLFFT_INCLUDE_DIRS})
-LINK_DIRECTORIES(${CLFFT_LIBRARY_DIR})
-
-ADD_DEFINITIONS( -DBOOST_ALL_NO_LIB )
-SET(Boost_USE_STATIC_LIBS OFF)
-FIND_PACKAGE(Boost 1.48 REQUIRED)
-
-OPTION(USE_SYSTEM_BOOST_COMPUTE "Use system BoostCompute" OFF)
-IF(USE_SYSTEM_BOOST_COMPUTE)
-    FIND_PACKAGE(BoostCompute REQUIRED)
-ELSE()
-    INCLUDE("${CMAKE_MODULE_PATH}/build_boost_compute.cmake")
-ENDIF()
-
-SET( cl_kernel_headers
-    "kernel_headers")
-
-INCLUDE_DIRECTORIES(
-    ${CMAKE_INCLUDE_PATH}
-    "${CMAKE_SOURCE_DIR}/src/backend/opencl"
-    ${OpenCL_INCLUDE_DIRS}
-    "${CMAKE_CURRENT_BINARY_DIR}"
-    ${CLBLAS_INCLUDE_DIRS}
-    ${CLFFT_INCLUDE_DIRS}
-    ${Boost_INCLUDE_DIR}
-    ${BoostCompute_INCLUDE_DIRS}
-    ${LAPACK_INCLUDE_DIR}
-    )
-
-FILE(GLOB opencl_headers
-  "*.hpp"
-  "*.h")
-
-FILE(GLOB opencl_sources
-    "*.cpp")
-
-FILE(GLOB magma_sources
-    "magma/*.cpp")
-
-FILE(GLOB magma_headers
-    "magma/*.h")
-
-FILE(GLOB jit_sources
-    "jit/*.hpp")
-
-FILE(GLOB kernel_headers
-    "kernel/*.hpp")
-
-FILE(GLOB opencl_kernels
-    "kernel/*.cl")
-
-FILE(GLOB kernel_sources
-     "kernel/*.cpp")
-
-FILE(GLOB conv_ker_headers
-    "kernel/convolve/*.hpp")
-
-FILE(GLOB conv_ker_sources
-     "kernel/convolve/*.cpp")
-
-source_group(backend\\opencl\\Headers FILES ${opencl_headers})
-source_group(backend\\opencl\\Sources FILES ${opencl_sources})
-source_group(backend\\opencl\\JIT FILES ${jit_sources})
-source_group(backend\\opencl\\kernel\\Headers FILES ${kernel_headers})
-source_group(backend\\opencl\\kernel\\cl FILES ${opencl_kernels})
-source_group(backend\\opencl\\kernel\\Sources FILES ${kernel_sources})
-source_group(backend\\opencl\\kernel\\convolve\\Headers FILES ${conv_ker_headers})
-source_group(backend\\opencl\\kernel\\convolve\\Sources FILES ${conv_ker_sources})
-source_group(backend\\opencl\\magma\\Sources FILES ${magma_sources})
-source_group(backend\\opencl\\magma\\Headers FILES ${magma_headers})
-
-FILE(GLOB backend_headers
-    "../*.hpp"
-    "../*.h"
-    )
-
-FILE(GLOB backend_sources
-    "../*.cpp"
-    )
-source_group(backend\\Headers FILES ${backend_headers})
-source_group(backend\\Sources FILES ${backend_sources})
-
-FILE(GLOB c_headers
-    "../../api/c/*.hpp"
-    "../../api/c/*.h"
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+dependency_check(OpenCL_FOUND "OpenCL not found.")
+
+# OpenCL back end needs to use MKL LP64 interface
+set(MKL_INTERFACE_INTEGER_SIZE 4)
+set(MKL_INTERFACE "lp64")
+
+include(InternalUtils)
+include(build_cl2hpp)
+include(build_CLBlast)
+include(build_clFFT)
+include(FileToString)
+
+generate_product_version(af_opencl_ver_res_file
+  FILE_NAME "afopencl"
+  FILE_DESCRIPTION "OpenCL Backend Dynamic-link library"
+)
+
+set(kernel_src
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/KParam.hpp
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/anisotropic_diffusion.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/approx1.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/approx2.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/assign.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/bilateral.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/convolve.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/convolve_separable.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/coo2dense.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/copy.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/cscmm.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/cscmv.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/csr2coo.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/csr2dense.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/csrmm.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/csrmv.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/dense2csr.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/diag_create.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/diag_extract.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/diff.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/example.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/fast.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/fftconvolve_multiply.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/fftconvolve_pack.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/fftconvolve_reorder.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/flood_fill.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/gradient.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/harris.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/histogram.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/homography.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/hsv_rgb.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/identity.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/iir.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/index.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/interp.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/iops.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/iota.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/ireduce_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/ireduce_first.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/jit.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/laset_band.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/laset.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/laswp.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/lookup.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/lu_split.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/matchTemplate.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/mean_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/mean_first.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/mean_ops.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/meanshift.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/medfilt1.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/medfilt2.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/memcopy.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/moments.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/morph.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/nearest_neighbour.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/nonmax_suppression.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/ops.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/orb.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/pad_array_borders.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/random_engine_mersenne.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/random_engine_mersenne_init.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/random_engine_philox.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/random_engine_threefry.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/random_engine_write.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/range.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_all.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_blocks_by_key_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_blocks_by_key_first.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_by_key_boundary.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_by_key_boundary_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_by_key_compact.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_by_key_compact_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_by_key_needs_reduction.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reduce_first.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/regions.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/reorder.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/resize.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/rotate.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_dim_by_key.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_dim.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_first_by_key.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_first.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/select.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sift_nonfree.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sobel.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sparse_arith_common.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sparse_arith_coo.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sparse_arith_csr.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/sp_sp_arith_csr.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/ssarith_calc_out_nnz.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/susan.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/swapdblk.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/tile.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/trace_edge.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/transform.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/transpose.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/transpose_inplace.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/triangle.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/unwrap.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/where.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/wrap.cl
+  ${CMAKE_CURRENT_SOURCE_DIR}/kernel/wrap_dilated.cl
+)
+
+set( kernel_headers_dir "kernel_headers")
+
+file_to_string(
+    SOURCES ${kernel_src}
+    VARNAME kernel_files
+    EXTENSION "hpp"
+    OUTPUT_DIR ${kernel_headers_dir}
+    TARGETS cl_kernel_targets
+    NAMESPACE "arrayfire opencl"
     )
 
-FILE(GLOB c_sources
-    "../../api/c/*.cpp"
+if(OpenCL_VERSION_MAJOR LESS 3)
+  set(opencl_compile_definitions
+    CL_TARGET_OPENCL_VERSION=120
+    CL_HPP_TARGET_OPENCL_VERSION=120
+    CL_HPP_MINIMUM_OPENCL_VERSION=120
+    CL_HPP_ENABLE_EXCEPTIONS)
+else()
+  set(opencl_compile_definitions
+    CL_TARGET_OPENCL_VERSION=300
+    CL_HPP_TARGET_OPENCL_VERSION=300
+    CL_HPP_MINIMUM_OPENCL_VERSION=110
+    CL_HPP_ENABLE_EXCEPTIONS)
+endif()
+
+include(kernel/scan_by_key/CMakeLists.txt)
+include(kernel/sort_by_key/CMakeLists.txt)
+
+add_library(afopencl "")
+add_library(ArrayFire::afopencl ALIAS afopencl)
+
+target_sources(afopencl
+  PRIVATE
+    $<$<PLATFORM_ID:Windows>:${af_opencl_ver_res_file}>
+    Array.cpp
+    Array.hpp
+    Kernel.cpp
+    Kernel.hpp
+    Module.hpp
+    Param.cpp
+    Param.hpp
+    all.cpp
+    anisotropic_diffusion.cpp
+    anisotropic_diffusion.hpp
+    any.cpp
+    api.cpp
+    approx.cpp
+    approx.hpp
+    arith.hpp
+    assign.cpp
+    assign.hpp
+    backend.hpp
+    bilateral.cpp
+    bilateral.hpp
+    binary.hpp
+    blas.cpp
+    blas.hpp
+    canny.cpp
+    canny.hpp
+    cast.hpp
+    cholesky.cpp
+    cholesky.hpp
+    clfft.cpp
+    clfft.hpp
+    compile_module.cpp
+    complex.hpp
+    convolve.cpp
+    convolve.hpp
+    convolve_separable.cpp
+    copy.cpp
+    copy.hpp
+    count.cpp
+    debug_opencl.hpp
+    device_manager.cpp
+    device_manager.hpp
+    diagonal.cpp
+    diagonal.hpp
+    diff.cpp
+    diff.hpp
+    err_clblast.hpp
+    err_opencl.hpp
+    errorcodes.cpp
+    errorcodes.hpp
+    Event.hpp
+    Event.cpp
+    exampleFunction.cpp
+    exampleFunction.hpp
+    fast.cpp
+    fast.hpp
+    fft.cpp
+    fft.hpp
+    fftconvolve.cpp
+    fftconvolve.hpp
+    flood_fill.cpp
+    flood_fill.hpp
+    GraphicsResourceManager.cpp
+    GraphicsResourceManager.hpp
+    gradient.cpp
+    gradient.hpp
+    harris.cpp
+    harris.hpp
+    hist_graphics.cpp
+    hist_graphics.hpp
+    histogram.cpp
+    histogram.hpp
+    homography.cpp
+    homography.hpp
+    hsv_rgb.cpp
+    hsv_rgb.hpp
+    identity.cpp
+    identity.hpp
+    iir.cpp
+    iir.hpp
+    image.cpp
+    image.hpp
+    index.cpp
+    index.hpp
+    inverse.cpp
+    inverse.hpp
+    iota.cpp
+    iota.hpp
+    ireduce.cpp
+    ireduce.hpp
+    jit.cpp
+    join.cpp
+    join.hpp
+    logic.hpp
+    lookup.cpp
+    lookup.hpp
+    lu.cpp
+    lu.hpp
+    match_template.cpp
+    match_template.hpp
+    math.cpp
+    math.hpp
+    max.cpp
+    mean.cpp
+    mean.hpp
+    meanshift.cpp
+    meanshift.hpp
+    medfilt.cpp
+    medfilt.hpp
+    memory.cpp
+    memory.hpp
+    min.cpp
+    moments.cpp
+    moments.hpp
+    morph.cpp
+    morph.hpp
+    nearest_neighbour.cpp
+    nearest_neighbour.hpp
+    orb.cpp
+    orb.hpp
+    platform.cpp
+    platform.hpp
+    plot.cpp
+    plot.hpp
+    print.hpp
+    product.cpp
+    qr.cpp
+    qr.hpp
+    random_engine.cpp
+    random_engine.hpp
+    range.cpp
+    range.hpp
+    reduce.hpp
+    reduce_impl.hpp
+    regions.cpp
+    regions.hpp
+    reorder.cpp
+    reorder.hpp
+    resize.cpp
+    resize.hpp
+    reshape.cpp
+    rotate.cpp
+    rotate.hpp
+    scalar.hpp
+    scan.cpp
+    scan.hpp
+    scan_by_key.cpp
+    scan_by_key.hpp
+    select.cpp
+    select.hpp
+    set.cpp
+    set.hpp
+    shift.cpp
+    shift.hpp
+    sift.cpp
+    sift.hpp
+    sobel.cpp
+    sobel.hpp
+    solve.cpp
+    solve.hpp
+    sort.cpp
+    sort.hpp
+    sort_by_key.cpp
+    sort_by_key.hpp
+    sort_index.cpp
+    sort_index.hpp
+    sparse.cpp
+    sparse.hpp
+    sparse_arith.cpp
+    sparse_arith.hpp
+    sparse_blas.cpp
+    sparse_blas.hpp
+    sum.cpp
+    surface.cpp
+    surface.hpp
+    susan.cpp
+    susan.hpp
+    svd.cpp
+    svd.hpp
+    tile.cpp
+    tile.hpp
+    threadsMgt.hpp
+    topk.cpp
+    topk.hpp
+    traits.hpp
+    transform.cpp
+    transform.hpp
+    transpose.cpp
+    transpose.hpp
+    transpose_inplace.cpp
+    triangle.cpp
+    triangle.hpp
+    types.hpp
+    types.cpp
+    unary.hpp
+    unwrap.cpp
+    unwrap.hpp
+    vector_field.cpp
+    vector_field.hpp
+    where.cpp
+    where.hpp
+    wrap.cpp
+    wrap.hpp
     )
-source_group(api\\c\\Headers FILES ${c_headers})
-source_group(api\\c\\Sources FILES ${c_sources})
 
 
-FILE(GLOB cpp_sources
-    "../../api/cpp/*.cpp"
+target_sources(afopencl
+  PRIVATE
+    kernel/KParam.hpp
+    kernel/anisotropic_diffusion.hpp
+    kernel/approx.hpp
+    kernel/assign.hpp
+    kernel/bilateral.hpp
+    kernel/canny.hpp
+    kernel/config.cpp
+    kernel/config.hpp
+    kernel/convolve.hpp
+    kernel/convolve_separable.cpp
+    kernel/convolve_separable.hpp
+    kernel/cscmm.hpp
+    kernel/cscmv.hpp
+    kernel/csrmm.hpp
+    kernel/csrmv.hpp
+    kernel/diagonal.hpp
+    kernel/diff.hpp
+    kernel/exampleFunction.hpp
+    kernel/fast.hpp
+    kernel/fftconvolve.hpp
+    kernel/flood_fill.hpp
+    kernel/gradient.hpp
+    kernel/harris.hpp
+    kernel/histogram.hpp
+    kernel/homography.hpp
+    kernel/hsv_rgb.hpp
+    kernel/identity.hpp
+    kernel/iir.hpp
+    kernel/index.hpp
+    kernel/interp.hpp
+    kernel/iota.hpp
+    kernel/ireduce.hpp
+    kernel/laset.hpp
+    #kernel/laset_band.hpp
+    kernel/laswp.hpp
+    kernel/lookup.hpp
+    kernel/lu_split.hpp
+    kernel/match_template.hpp
+    kernel/mean.hpp
+    kernel/meanshift.hpp
+    kernel/medfilt.hpp
+    kernel/memcopy.hpp
+    kernel/moments.hpp
+    kernel/morph.hpp
+    kernel/names.hpp
+    kernel/nearest_neighbour.hpp
+    kernel/orb.hpp
+    kernel/pad_array_borders.hpp
+    kernel/random_engine.hpp
+    kernel/range.hpp
+    kernel/reduce.hpp
+    kernel/regions.hpp
+    kernel/reorder.hpp
+    kernel/resize.hpp
+    kernel/rotate.hpp
+    kernel/scan_dim.hpp
+    kernel/scan_dim_by_key.hpp
+    kernel/scan_dim_by_key_impl.hpp
+    kernel/scan_first.hpp
+    kernel/scan_first_by_key.hpp
+    kernel/scan_first_by_key_impl.hpp
+    kernel/select.hpp
+    kernel/sift.hpp
+    kernel/sobel.hpp
+    kernel/sort.hpp
+    kernel/sort_by_key.hpp
+    kernel/sort_by_key_impl.hpp
+    kernel/sort_helper.hpp
+    kernel/sparse.hpp
+    kernel/sparse_arith.hpp
+    kernel/susan.hpp
+    kernel/swapdblk.hpp
+    kernel/tile.hpp
+    kernel/transform.hpp
+    kernel/transpose.hpp
+    kernel/transpose_inplace.hpp
+    kernel/triangle.hpp
+    kernel/unwrap.hpp
+    kernel/where.hpp
+    kernel/wrap.hpp
+
+    kernel/convolve/conv1.cpp
+    kernel/convolve/conv2_b8.cpp
+    kernel/convolve/conv2_c32.cpp
+    kernel/convolve/conv2_c64.cpp
+    kernel/convolve/conv2_f32.cpp
+    kernel/convolve/conv2_f64.cpp
+    kernel/convolve/conv2_impl.hpp
+    kernel/convolve/conv2_s8.cpp
+    kernel/convolve/conv2_s16.cpp
+    kernel/convolve/conv2_s32.cpp
+    kernel/convolve/conv2_s64.cpp
+    kernel/convolve/conv2_u16.cpp
+    kernel/convolve/conv2_u32.cpp
+    kernel/convolve/conv2_u64.cpp
+    kernel/convolve/conv2_u8.cpp
+    kernel/convolve/conv3.cpp
+    kernel/convolve/conv_common.hpp
     )
-source_group(api\\cpp\\Sources FILES ${cpp_sources})
-
-FILE(GLOB kernel_src ${opencl_kernels} "kernel/KParam.hpp")
 
-CL_KERNEL_TO_H(
-    SOURCES ${kernel_src}
-    VARNAME kernel_files
-    EXTENSION "hpp"
-    OUTPUT_DIR ${cl_kernel_headers}
-    TARGETS cl_kernel_targets
-    NAMESPACE "opencl"
-    EOF "0"
+target_sources(afopencl
+  PRIVATE
+    jit/BufferNode.hpp
+    jit/ShiftNode.hpp
+    jit/kernel_generators.hpp
+  )
+
+target_sources(afopencl
+  PRIVATE
+    ${kernel_files}
+  )
+
+target_sources(afopencl
+  PRIVATE
+    cpu/cpu_blas.cpp
+    cpu/cpu_blas.hpp
+    cpu/cpu_cholesky.cpp
+    cpu/cpu_cholesky.hpp
+    cpu/cpu_helper.hpp
+    cpu/cpu_inverse.cpp
+    cpu/cpu_inverse.hpp
+    cpu/cpu_lu.cpp
+    cpu/cpu_lu.hpp
+    cpu/cpu_qr.cpp
+    cpu/cpu_qr.hpp
+    cpu/cpu_solve.cpp
+    cpu/cpu_solve.hpp
+    cpu/cpu_sparse_blas.cpp
+    cpu/cpu_sparse_blas.hpp
+    cpu/cpu_svd.cpp
+    cpu/cpu_svd.hpp
+    cpu/cpu_triangle.hpp
+  )
+
+target_include_directories(afopencl
+  PUBLIC
+    $<BUILD_INTERFACE:${ArrayFire_SOURCE_DIR}/include>
+    $<BUILD_INTERFACE:${ArrayFire_BINARY_DIR}/include>
+    $<INSTALL_INTERFACE:${AF_INSTALL_INC_DIR}>
+  PRIVATE
+    ${CMAKE_CURRENT_SOURCE_DIR}
+    ${CMAKE_CURRENT_BINARY_DIR}
+    magma
+    ../../api/c
+    ../../../include
+  )
+
+if(NOT TARGET clblast)
+  add_dependencies(afopencl ${cl_kernel_targets} CLBlast-ext)
+endif()
+
+set_target_properties(afopencl PROPERTIES POSITION_INDEPENDENT_CODE ON)
+
+target_compile_definitions(afopencl
+  PRIVATE
+    ${opencl_compile_definitions}
+    AF_OPENCL
+  )
+
+target_link_libraries(afopencl
+  PRIVATE
+    c_api_interface
+    cpp_api_interface
+    OpenCL::OpenCL
+    OpenCL::cl2hpp
+    afcommon_interface
+    clFFT
+    clblast
+    opencl_scan_by_key
+    opencl_sort_by_key
+    Threads::Threads
     )
 
-# OS Definitions
-IF(UNIX)
-    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fPIC -pthread")
-ELSE(${UNIX}) #Windows
-    SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}")
-    SET(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /bigobj")
-    SET(CMAKE_CXX_FLAGS_RELWITHDEBINFO "${CMAKE_CXX_FLAGS_RELWITHDEBINFO} /bigobj")
-ENDIF()
-
-
-ADD_LIBRARY(afopencl SHARED
-            ${opencl_headers}
-            ${opencl_sources}
-            ${jit_sources}
-            ${kernel_headers}
-            ${opencl_kernels}
-            ${kernel_sources}
-            ${conv_ker_headers}
-            ${conv_ker_sources}
-            ${backend_headers}
-            ${backend_sources}
-            ${c_sources}
-            ${c_headers}
-            ${cpp_sources}
-            ${magma_sources}
-            ${magma_headers}
-            )
-
-ADD_DEPENDENCIES(afopencl ${cl_kernel_targets})
-
-TARGET_LINK_LIBRARIES(afopencl
-    ${OpenCL_LIBRARIES}
-    ${FreeImage_LIBS}
-    ${CLBLAS_LIBRARIES}
-    ${CLFFT_LIBRARIES}
-    ${CMAKE_DL_LIBS}
-    ${Boost_LIBRARIES})
-
-IF(LAPACK_FOUND)
-   TARGET_LINK_LIBRARIES(afopencl
-   ${LAPACK_LIBRARIES}
-   )
-ENDIF()
-
-SET_TARGET_PROPERTIES(afopencl PROPERTIES
-    VERSION "${AF_VERSION}"
-    SOVERSION "${AF_VERSION_MAJOR}")
-
-IF(FORGE_FOUND)
-    TARGET_LINK_LIBRARIES(afopencl
-                          ${FORGE_LIBRARIES}
-                         )
-ENDIF()
-
-# locally built but not installed libraries (clBLAS and clFFT) must NOT appear in the
-# link interface. The best option would be to use LINK_PRIVATE, but unfortunately
-# it is not available for older cmake than 2.8.7, so they must be remove from
-# the link interface manually. Both LINK_INTERFACE_LIBRARIES and INTERFACE_LINK_LIBRARIES
-# are used to keep it working with older cmake versions that 2.8.12 and avoid
-# warnings on newer versions - see CMP0022
-FOREACH(property LINK_INTERFACE_LIBRARIES INTERFACE_LINK_LIBRARIES)
-    GET_TARGET_PROPERTY(value afopencl ${property})
-    IF(value)
-        LIST(REMOVE_ITEM value ${CLBLAS_LIBRARIES} ${CLFFT_LIBRARIES})
-        SET_TARGET_PROPERTIES(afopencl PROPERTIES LINK_INTERFACE_LIBRARIES "${value}")
-        SET_TARGET_PROPERTIES(afopencl PROPERTIES INTERFACE_LINK_LIBRARIES "${value}")
-    ENDIF()
-ENDFOREACH()
-
-INSTALL(TARGETS afopencl EXPORT OpenCL DESTINATION "${AF_INSTALL_LIB_DIR}"
-        COMPONENT libraries)
-
-export(TARGETS afopencl FILE ArrayFireOpenCL.cmake)
-INSTALL(EXPORT OpenCL DESTINATION "${AF_INSTALL_CMAKE_DIR}"
-    COMPONENT cmake
-    FILE ArrayFireOpenCL.cmake)
+if(APPLE)
+  target_link_libraries(afopencl PRIVATE OpenGL::GL)
+endif()
+
+if(LAPACK_FOUND OR BUILD_WITH_MKL)
+  target_sources(afopencl
+    PRIVATE
+      magma/gebrd.cpp
+      magma/geqrf2.cpp
+      magma/geqrf3.cpp
+      magma/getrf.cpp
+      magma/getrs.cpp
+      magma/labrd.cpp
+      magma/larfb.cpp
+      magma/laset.cpp
+      #magma/laset_band.cpp
+      magma/laswp.cpp
+      magma/magma.h
+      magma/magma_blas.h
+      magma/magma_blas_clblast.h
+      magma/magma_common.h
+      magma/magma_cpu_blas.h
+      magma/magma_cpu_lapack.h
+      magma/magma_data.h
+      magma/magma_helper.cpp
+      magma/magma_helper.h
+      magma/magma_sync.h
+      magma/magma_types.h
+      magma/potrf.cpp
+      magma/swapdblk.cpp
+      magma/transpose.cpp
+      magma/transpose_inplace.cpp
+      magma/ungqr.cpp
+      magma/unmqr.cpp
+      #magma/unmqr2.cpp
+      )
+
+  if(BUILD_WITH_MKL)
+    target_compile_definitions(afopencl PRIVATE USE_MKL)
+    target_compile_definitions(afopencl PRIVATE AF_MKL_INTERFACE_SIZE=${MKL_INTERFACE_INTEGER_SIZE})
+    if(MKL_BATCH)
+      target_compile_definitions(afopencl PRIVATE AF_USE_MKL_BATCH)
+    endif()
+
+    if(AF_WITH_STATIC_MKL)
+        target_link_libraries(afopencl PRIVATE MKL::Static)
+        target_compile_definitions(afopencl PRIVATE USE_STATIC_MKL)
+    else()
+        target_link_libraries(afopencl PRIVATE MKL::RT)
+    endif()
+  else()
+    if(USE_CPU_F77_BLAS)
+      target_compile_definitions(afopencl PRIVATE USE_F77_BLAS)
+    endif()
+
+    target_include_directories(afopencl
+      SYSTEM PRIVATE
+        ${CBLAS_INCLUDE_DIR})
+
+    check_cxx_compiler_flag("-Wl,--start-group -Werror" group_flags)
+    if(group_flags)
+      set(START_GROUP -Wl,--start-group)
+      set(END_GROUP -Wl,--end-group)
+    endif()
+    target_link_libraries(afopencl
+      PRIVATE
+        ${START_GROUP}
+        ${LAPACK_LIBRARIES}
+        LAPACKE::LAPACKE
+        ${CBLAS_LIBRARIES}
+        ${END_GROUP}
+      )
+  endif()
+
+  target_compile_definitions(afopencl PRIVATE WITH_LINEAR_ALGEBRA)
+endif()
+
+af_split_debug_info(afopencl ${AF_INSTALL_LIB_DIR})
+
+install(TARGETS afopencl
+  EXPORT ArrayFireOpenCLTargets
+  COMPONENT opencl
+  PUBLIC_HEADER DESTINATION af
+  RUNTIME DESTINATION ${AF_INSTALL_BIN_DIR}
+  LIBRARY DESTINATION ${AF_INSTALL_LIB_DIR}
+  ARCHIVE DESTINATION ${AF_INSTALL_LIB_DIR}
+  FRAMEWORK DESTINATION framework
+  INCLUDES DESTINATION ${AF_INSTALL_INC_DIR}
+  )
+
+if(NOT APPLE AND AF_INSTALL_STANDALONE)
+  if(UNIX)
+    get_filename_component(opencl_outpath "${OpenCL_LIBRARIES}" REALPATH)
+    install(FILES ${opencl_outpath}
+        DESTINATION ${AF_INSTALL_LIB_DIR}
+        RENAME "${CMAKE_SHARED_LIBRARY_PREFIX}OpenCL${CMAKE_SHARED_LIBRARY_SUFFIX}.1"
+        COMPONENT opencl_dependencies)
+  else()
+    find_file(OpenCL_DLL_LIBRARY
+      NAMES ${CMAKE_SHARED_LIBRARY_PREFIX}OpenCL${CMAKE_SHARED_LIBRARY_SUFFIX}
+	  PATHS
+        ENV "PROGRAMFILES(X86)"
+        ENV "PROGRAMFILES"
+        ENV AMDAPPSDKROOT
+        ENV INTELOCLSDKROOT
+        ENV CUDA_PATH
+        ENV NVSDKCOMPUTE_ROOT
+        ENV ATISTREAMSDKROOT
+      PATH_SUFFIXES
+        "AMD APP SDK/bin/x86_64"
+        "bin/x86_64"
+        "bin/x64"
+        "bin/icd/x64"
+        "OpenCL SDK/bin/icd/x64"
+        "Intel/OpenCL SDK/bin/icd/x64"
+        "OpenCL SDK/bin/icd/x64"
+        "NVIDIA Corporation/OpenCL")
+    mark_as_advanced(OpenCL_DLL_LIBRARY)
+    install(FILES "${OpenCL_DLL_LIBRARY}"
+        DESTINATION ${AF_INSTALL_BIN_DIR}
+        COMPONENT opencl_dependencies)
+  endif()
+endif()
+
+source_group(include REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/include/*)
+source_group(api\\cpp REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/cpp/*)
+source_group(api\\c   REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/api/c/*)
+source_group(backend  REGULAR_EXPRESSION ${ArrayFire_SOURCE_DIR}/src/backend/common/*|${CMAKE_CURRENT_SOURCE_DIR}/*)
+source_group(backend\\kernel  REGULAR_EXPRESSION ${CMAKE_CURRENT_SOURCE_DIR}/kernel/*|${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/*|${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_by_key/*)
+source_group("generated files" FILES ${ArrayFire_BINARY_DIR}/src/backend/build_version.hpp ${ArrayFire_BINARY_DIR}/include/af/version.h)
diff --git a/src/backend/opencl/Event.cpp b/src/backend/opencl/Event.cpp
new file mode 100644
index 0000000000..bc93b60a62
--- /dev/null
+++ b/src/backend/opencl/Event.cpp
@@ -0,0 +1,74 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Event.hpp>
+
+#include <common/err_common.hpp>
+#include <events.hpp>
+#include <platform.hpp>
+#include <af/event.h>
+#include <memory>
+
+#include <memory>
+
+using std::make_unique;
+using std::unique_ptr;
+
+namespace arrayfire {
+namespace opencl {
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(cl::CommandQueue& queue) {
+    Event e;
+    if (e.create() == CL_SUCCESS) { e.mark(queue()); }
+    return e;
+}
+
+af_event createEvent() {
+    auto e = make_unique<Event>();
+    // Ensure the default CL command queue is initialized
+    getQueue()();
+    if (e->create() != CL_SUCCESS) {
+        AF_ERROR("Could not create event", AF_ERR_RUNTIME);
+    }
+    Event& ref = *e.release();
+    return getHandle(ref);
+}
+
+void markEventOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active stream
+    if (event.mark(getQueue()()) != CL_SUCCESS) {
+        AF_ERROR("Could not mark event on active queue", AF_ERR_RUNTIME);
+    }
+}
+
+void enqueueWaitOnActiveQueue(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    // Use the currently-active stream
+    if (event.enqueueWait(getQueue()()) != CL_SUCCESS) {
+        AF_ERROR("Could not enqueue wait on active queue for event",
+                 AF_ERR_RUNTIME);
+    }
+}
+
+void block(af_event eventHandle) {
+    Event& event = getEvent(eventHandle);
+    if (event.block() != CL_SUCCESS) {
+        AF_ERROR("Could not block on active queue for event", AF_ERR_RUNTIME);
+    }
+}
+
+af_event createAndMarkEvent() {
+    af_event handle = createEvent();
+    markEventOnActiveQueue(handle);
+    return handle;
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Event.hpp b/src/backend/opencl/Event.hpp
new file mode 100644
index 0000000000..c8420a9dff
--- /dev/null
+++ b/src/backend/opencl/Event.hpp
@@ -0,0 +1,61 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+
+#include <cl2hpp.hpp>
+#include <common/EventBase.hpp>
+#include <af/event.h>
+
+namespace arrayfire {
+namespace opencl {
+class OpenCLEventPolicy {
+   public:
+    using EventType = cl_event;
+    using QueueType = cl_command_queue;
+    using ErrorType = cl_int;
+
+    static cl_int createAndMarkEvent(cl_event *e) noexcept {
+        // Events are created when you mark them
+        return CL_SUCCESS;
+    }
+
+    static cl_int markEvent(cl_event *e, cl_command_queue stream) noexcept {
+        return clEnqueueMarkerWithWaitList(stream, 0, nullptr, e);
+    }
+
+    static cl_int waitForEvent(cl_event *e, cl_command_queue stream) noexcept {
+        return clEnqueueMarkerWithWaitList(stream, 1, e, nullptr);
+    }
+
+    static cl_int syncForEvent(cl_event *e) noexcept {
+        return clWaitForEvents(1, e);
+    }
+
+    static cl_int destroyEvent(cl_event *e) noexcept {
+        return clReleaseEvent(*e);
+    }
+};
+
+using Event = common::EventBase<OpenCLEventPolicy>;
+
+/// \brief Creates a new event and marks it in the queue
+Event makeEvent(cl::CommandQueue &queue);
+
+af_event createEvent();
+
+void markEventOnActiveQueue(af_event eventHandle);
+
+void enqueueWaitOnActiveQueue(af_event eventHandle);
+
+void block(af_event eventHandle);
+
+af_event createAndMarkEvent();
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/GraphicsResourceManager.cpp b/src/backend/opencl/GraphicsResourceManager.cpp
new file mode 100644
index 0000000000..fe1f703a5f
--- /dev/null
+++ b/src/backend/opencl/GraphicsResourceManager.cpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <GraphicsResourceManager.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace opencl {
+GraphicsResourceManager::ShrdResVector
+GraphicsResourceManager::registerResources(
+    const std::vector<uint32_t>& resources) {
+    ShrdResVector output;
+
+    for (auto id : resources) {
+        output.emplace_back(new cl::BufferGL(
+            getContext(), CL_MEM_WRITE_ONLY,  // NOLINT(hicpp-signed-bitwise)
+            id, NULL));
+    }
+
+    return output;
+}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/GraphicsResourceManager.hpp b/src/backend/opencl/GraphicsResourceManager.hpp
new file mode 100644
index 0000000000..130a564df1
--- /dev/null
+++ b/src/backend/opencl/GraphicsResourceManager.hpp
@@ -0,0 +1,37 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/InteropManager.hpp>
+
+#include <map>
+#include <vector>
+
+namespace cl {
+class Buffer;
+}
+
+namespace arrayfire {
+namespace opencl {
+class GraphicsResourceManager
+    : public common::InteropManager<GraphicsResourceManager, cl::Buffer> {
+   public:
+    using ShrdResVector = std::vector<std::shared_ptr<cl::Buffer>>;
+
+    GraphicsResourceManager() {}
+    static ShrdResVector registerResources(
+        const std::vector<uint32_t>& resources);
+
+   protected:
+    GraphicsResourceManager(GraphicsResourceManager const&);
+    void operator=(GraphicsResourceManager const&);
+};
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/JIT/BinaryNode.hpp b/src/backend/opencl/JIT/BinaryNode.hpp
deleted file mode 100644
index f087760b87..0000000000
--- a/src/backend/opencl/JIT/BinaryNode.hpp
+++ /dev/null
@@ -1,134 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <iomanip>
-
-namespace opencl
-{
-
-namespace JIT
-{
-
-    class BinaryNode : public Node
-    {
-    private:
-        std::string m_op_str;
-        Node_ptr m_lhs, m_rhs;
-        int m_op;
-
-    public:
-        BinaryNode(const char *out_type_str, const char *name_str,
-                   const char *op_str,
-                   Node_ptr lhs, Node_ptr rhs, int op)
-            : Node(out_type_str, name_str),
-              m_op_str(op_str),
-              m_lhs(lhs),
-              m_rhs(rhs),
-              m_op(op)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            return m_lhs->isLinear(dims) && m_rhs->isLinear(dims);
-        }
-
-        void genParams(std::stringstream &kerStream)
-        {
-            if (m_gen_param) return;
-            if (!(m_lhs->isGenParam())) m_lhs->genParams(kerStream);
-            if (!(m_rhs->isGenParam())) m_rhs->genParams(kerStream);
-            m_gen_param = true;
-        }
-
-        int setArgs(cl::Kernel &ker, int id)
-        {
-            id = m_lhs->setArgs(ker, id);
-            id = m_rhs->setArgs(ker, id);
-            return id;
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-            if (!(m_lhs->isGenOffset())) m_lhs->genOffsets(kerStream, is_linear);
-            if (!(m_rhs->isGenOffset())) m_rhs->genOffsets(kerStream, is_linear);
-            m_gen_offset = true;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            m_lhs->genKerName(kerStream);
-            m_rhs->genKerName(kerStream);
-
-            if (m_gen_name) return;
-            // Make the dec representation of enum part of the Kernel name
-            kerStream << "_" << std::setw(3) << std::setfill('0') << std::dec << m_op;
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_lhs->getId();
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_rhs->getId();
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream)
-        {
-            if (m_gen_func) return;
-
-            if (!(m_lhs->isGenFunc())) m_lhs->genFuncs(kerStream);
-            if (!(m_rhs->isGenFunc())) m_rhs->genFuncs(kerStream);
-
-            kerStream << m_type_str << " val" << m_id << " = "
-                      << m_op_str << "(val" << m_lhs->getId()
-                      << ", val" << m_rhs->getId() << ");"
-                      << "\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            id = m_lhs->setId(id);
-            id = m_rhs->setId(id);
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-
-            m_lhs->getInfo(len, buf_count, bytes);
-            m_rhs->getInfo(len, buf_count, bytes);
-            len++;
-
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_lhs->resetFlags();
-            m_rhs->resetFlags();
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/opencl/JIT/BufferNode.hpp b/src/backend/opencl/JIT/BufferNode.hpp
deleted file mode 100644
index 71723b99df..0000000000
--- a/src/backend/opencl/JIT/BufferNode.hpp
+++ /dev/null
@@ -1,154 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <iomanip>
-
-namespace opencl
-{
-
-namespace JIT
-{
-
-
-    class BufferNode : public Node
-    {
-    private:
-        const std::shared_ptr<cl::Buffer> m_data;
-        const Param m_param;
-        const unsigned m_bytes;
-        bool m_set_arg;
-        bool m_linear;
-
-    public:
-
-        BufferNode(const char *type_str,
-                   const char *name_str,
-                   const Param param,
-                   const std::shared_ptr<cl::Buffer> data,
-                   const unsigned bytes,
-                   const bool is_linear)
-            : Node(type_str, name_str),
-              m_data(data),
-              m_param(param),
-              m_bytes(bytes),
-              m_set_arg(false),
-              m_linear(is_linear)
-        {}
-
-        bool isLinear(dim_t dims[4])
-        {
-            bool same_dims = true;
-            for (int i = 0; same_dims && i < 4; i++) {
-                same_dims &= (dims[i] == m_param.info.dims[i]);
-            }
-            return m_linear && same_dims;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            if (m_gen_name) return;
-
-            kerStream << "_" << m_name_str;
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genParams(std::stringstream &kerStream)
-        {
-            if (m_gen_param) return;
-            kerStream << "__global " << m_type_str << " *in" << m_id
-                      << ", KParam iInfo" << m_id << ", " << "\n";
-            m_gen_param = true;
-        }
-
-        int setArgs(cl::Kernel &ker, int id)
-        {
-            if (m_set_arg) return id;
-
-            ker.setArg(id + 0, *m_param.data);
-            ker.setArg(id + 1,  m_param.info);
-
-            m_set_arg = true;
-            return id + 2;
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-
-            std::string idx_str = std::string("int idx") + std::to_string(m_id);
-            std::string info_str = std::string("iInfo") + std::to_string(m_id);;
-
-            if (!is_linear) {
-                kerStream << idx_str << " = "
-                          << "(id3 < " << info_str << ".dims[3]) * "
-                          << info_str << ".strides[3] * id3 + "
-                          << "(id2 < " << info_str << ".dims[2]) * "
-                          << info_str << ".strides[2] * id2 + "
-                          << "(id1 < " << info_str << ".dims[1]) * "
-                          << info_str << ".strides[1] * id1 + "
-                          << "(id0 < " << info_str << ".dims[0]) * "
-                          << "id0 + " << info_str << ".offset;"
-                          << "\n";
-            } else {
-                kerStream << idx_str << " = idx + " << info_str << ".offset;" << "\n";
-            }
-
-            m_gen_offset = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream)
-        {
-            if (m_gen_func) return;
-
-            kerStream << m_type_str << " val" << m_id << " = "
-                      << "in" << m_id << "[idx" << m_id << "];"
-                      << "\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-
-            len++;
-            buf_count++;
-            bytes += m_bytes;
-            m_set_id = true;
-            return;
-        }
-
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_gen_name = false;
-            m_set_arg = false;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/opencl/JIT/Node.hpp b/src/backend/opencl/JIT/Node.hpp
deleted file mode 100644
index fedf7fb9bd..0000000000
--- a/src/backend/opencl/JIT/Node.hpp
+++ /dev/null
@@ -1,87 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <platform.hpp>
-#include <af/array.h>
-#include <optypes.hpp>
-#include <string>
-#include <vector>
-#include <memory>
-
-namespace opencl
-{
-
-namespace JIT
-{
-    using std::shared_ptr;
-
-    class Node
-    {
-    protected:
-        std::string m_type_str;
-        std::string m_name_str;
-        int m_id;
-        bool m_set_id;
-        bool m_gen_func;
-        bool m_gen_param;
-        bool m_gen_offset;
-        bool m_gen_name;
-
-    public:
-
-        Node(const char *type_str, const char *name_str)
-            : m_type_str(type_str),
-              m_name_str(name_str),
-              m_id(-1),
-              m_set_id(false),
-              m_gen_func(false),
-              m_gen_param(false),
-              m_gen_offset(false),
-              m_gen_name(false)
-        {}
-
-        virtual void genKerName(std::stringstream &kerStream) {}
-        virtual void genParams  (std::stringstream &kerStream) {}
-        virtual void genOffsets (std::stringstream &kerStream, bool is_linear) {}
-        virtual void genFuncs   (std::stringstream &kerStream) { m_gen_func = true;}
-
-        virtual int setArgs (cl::Kernel &ker, int id) { return id; }
-
-        virtual int setId(int id) { m_set_id = true; return id; }
-
-        virtual void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            len = 0;
-            buf_count = 0;
-            bytes = 0;
-        }
-
-
-        virtual void resetFlags() {}
-
-        virtual bool isLinear(dim_t dims[4]) { return true; }
-
-        std::string getTypeStr() { return m_type_str; }
-
-        bool isGenFunc() { return m_gen_func; }
-        bool isGenParam() { return m_gen_param; }
-        bool isGenOffset() { return m_gen_offset; }
-
-        int getId()  { return m_id; }
-        std::string getNameStr() { return m_name_str; }
-
-        virtual ~Node() {}
-    };
-
-    typedef shared_ptr<Node> Node_ptr;
-
-}
-
-}
diff --git a/src/backend/opencl/JIT/ScalarNode.hpp b/src/backend/opencl/JIT/ScalarNode.hpp
deleted file mode 100644
index 9eaa544134..0000000000
--- a/src/backend/opencl/JIT/ScalarNode.hpp
+++ /dev/null
@@ -1,115 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <math.hpp>
-#include <types.hpp>
-#include <iomanip>
-
-namespace opencl
-{
-
-namespace JIT
-{
-
-    template <typename T>
-    class ScalarNode : public Node
-    {
-    private:
-        const T m_val;
-        bool m_set_arg;
-
-    public:
-
-        ScalarNode(T val)
-            : Node(dtype_traits<T>::getName(), shortname<T>(false)),
-              m_val(val),
-              m_set_arg(false)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            return true;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            if (m_gen_name) return;
-
-            kerStream << "_" << m_name_str;
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genParams(std::stringstream &kerStream)
-        {
-            if (m_gen_param) return;
-            kerStream << m_type_str << " scalar" << m_id << ", " << "\n";
-            m_gen_param = true;
-        }
-
-        int setArgs(cl::Kernel &ker, int id)
-        {
-            if (m_set_arg) return id;
-            ker.setArg(id, m_val);
-            m_set_arg = true;
-            return id + 1;
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-            m_gen_offset = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream)
-        {
-            if (m_gen_func) return;
-
-            kerStream << m_type_str << " val" << m_id << " = "
-                      << "scalar" << m_id << ";"
-                      << "\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-            len++;
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_gen_name = false;
-            m_set_arg = false;
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/opencl/JIT/UnaryNode.hpp b/src/backend/opencl/JIT/UnaryNode.hpp
deleted file mode 100644
index 78fda23e92..0000000000
--- a/src/backend/opencl/JIT/UnaryNode.hpp
+++ /dev/null
@@ -1,121 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include "Node.hpp"
-#include <iomanip>
-
-namespace opencl
-{
-
-namespace JIT
-{
-
-    class UnaryNode : public Node
-    {
-    private:
-        std::string m_op_str;
-        Node_ptr m_child;
-        int m_op;
-
-    public:
-        UnaryNode(const char *out_type_str, const char *name_str,
-                  const char *op_str,
-                  Node_ptr child, int op)
-            : Node(out_type_str, name_str),
-              m_op_str(op_str),
-              m_child(child),
-              m_op(op)
-        {
-        }
-
-        bool isLinear(dim_t dims[4])
-        {
-            return m_child->isLinear(dims);
-        }
-
-        void genParams(std::stringstream &kerStream)
-        {
-            if (m_gen_param) return;
-            if (!(m_child->isGenParam())) m_child->genParams(kerStream);
-            m_gen_param = true;
-        }
-
-        int setArgs(cl::Kernel &ker, int id)
-        {
-            return m_child->setArgs(ker, id);
-        }
-
-        void genOffsets(std::stringstream &kerStream, bool is_linear)
-        {
-            if (m_gen_offset) return;
-            if (!(m_child->isGenOffset())) m_child->genOffsets(kerStream, is_linear);
-            m_gen_offset = true;
-        }
-
-        void genKerName(std::stringstream &kerStream)
-        {
-            m_child->genKerName(kerStream);
-
-            // Make the dec representation of enum part of the Kernel name
-            kerStream << "_" << std::setw(3) << std::setfill('0') << std::dec << m_op;
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_child->getId();
-            kerStream << std::setw(3) << std::setfill('0') << std::dec << m_id << std::dec;
-            m_gen_name = true;
-        }
-
-        void genFuncs(std::stringstream &kerStream)
-        {
-            if (m_gen_func) return;
-
-            if (!(m_child->isGenFunc())) m_child->genFuncs(kerStream);
-
-            kerStream << m_type_str << " val" << m_id << " = "
-                      << m_op_str << "(val" << m_child->getId() << ");"
-                      << "\n";
-
-            m_gen_func = true;
-        }
-
-        int setId(int id)
-        {
-            if (m_set_id) return id;
-
-            id = m_child->setId(id);
-
-            m_id = id;
-            m_set_id = true;
-
-            return m_id + 1;
-        }
-
-        void getInfo(unsigned &len, unsigned &buf_count, unsigned &bytes)
-        {
-            if (m_set_id) return;
-
-            m_child->getInfo(len, buf_count, bytes);
-            len++;
-
-            m_set_id = true;
-            return;
-        }
-
-        void resetFlags()
-        {
-            m_set_id = false;
-            m_gen_func = false;
-            m_gen_param = false;
-            m_gen_offset = false;
-            m_child->resetFlags();
-        }
-    };
-
-}
-
-}
diff --git a/src/backend/opencl/Kernel.cpp b/src/backend/opencl/Kernel.cpp
new file mode 100644
index 0000000000..b5d818b6d2
--- /dev/null
+++ b/src/backend/opencl/Kernel.cpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Kernel.hpp>
+
+#include <backend.hpp>
+#include <cl2hpp.hpp>
+#include <common/defines.hpp>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+Kernel::DevPtrType Kernel::getDevPtr(const char* name) {
+    UNUSED(name);
+    return nullptr;
+}
+
+void Kernel::copyToReadOnly(Kernel::DevPtrType dst, Kernel::DevPtrType src,
+                            size_t bytes) {
+    getQueue().enqueueCopyBuffer(*src, *dst, 0, 0, bytes);
+}
+
+void Kernel::setFlag(Kernel::DevPtrType dst, int* scalarValPtr,
+                     const bool syncCopy) {
+    UNUSED(syncCopy);
+    getQueue().enqueueFillBuffer(*dst, *scalarValPtr, 0, sizeof(int));
+}
+
+int Kernel::getFlag(Kernel::DevPtrType src) {
+    int retVal = 0;
+    getQueue().enqueueReadBuffer(*src, CL_TRUE, 0, sizeof(int), &retVal);
+    return retVal;
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Kernel.hpp b/src/backend/opencl/Kernel.hpp
new file mode 100644
index 0000000000..c5582d8f1c
--- /dev/null
+++ b/src/backend/opencl/Kernel.hpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/KernelInterface.hpp>
+#include <common/Logger.hpp>
+
+#include <backend.hpp>
+#include <cl2hpp.hpp>
+#include <string>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel_logger {
+inline auto getLogger() -> spdlog::logger* {
+    static auto logger = common::loggerFactory("kernel");
+    return logger.get();
+}
+}  // namespace kernel_logger
+
+struct Enqueuer {
+    template<typename... Args>
+    void operator()(std::string name, cl::Kernel ker,
+                    const cl::EnqueueArgs& qArgs, Args&&... args) {
+        auto launchOp = cl::KernelFunctor<Args...>(ker);
+        using namespace kernel_logger;
+        AF_TRACE("Launching {}", name);
+        launchOp(qArgs, std::forward<Args>(args)...);
+    }
+};
+
+class Kernel
+    : public common::KernelInterface<const cl::Program*, cl::Kernel, Enqueuer,
+                                     cl::Buffer*> {
+   public:
+    using BaseClass =
+        common::KernelInterface<ModuleType, KernelType, Enqueuer, DevPtrType>;
+
+    Kernel() : BaseClass("", nullptr, cl::Kernel{nullptr, false}) {}
+    Kernel(std::string name, ModuleType mod, KernelType ker)
+        : BaseClass(name, mod, ker) {}
+
+    // clang-format off
+    [[deprecated("OpenCL backend doesn't need Kernel::getDevPtr method")]]
+    DevPtrType getDevPtr(const char* name) final;
+    // clang-format on
+
+    void copyToReadOnly(DevPtrType dst, DevPtrType src, size_t bytes) final;
+
+    void setFlag(DevPtrType dst, int* scalarValPtr,
+                 const bool syncCopy = false) final;
+
+    int getFlag(DevPtrType src) final;
+};
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Module.hpp b/src/backend/opencl/Module.hpp
new file mode 100644
index 0000000000..b8a8d6a3b5
--- /dev/null
+++ b/src/backend/opencl/Module.hpp
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/ModuleInterface.hpp>
+
+#include <cl2hpp.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+/// OpenCL backend wrapper for cl::Program object
+class Module : public common::ModuleInterface<cl::Program> {
+   public:
+    using ModuleType = cl::Program;
+    using BaseClass  = common::ModuleInterface<ModuleType>;
+
+    /// \brief Create an uninitialized Module
+    Module() = default;
+
+    /// \brief Create a module given a cl::Program type
+    Module(ModuleType mod) : BaseClass(mod) {}
+
+    /// \brief Unload module
+    operator bool() const final { return get()(); }
+
+    /// Unload the module
+    void unload() final { set(cl::Program()); }
+};
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Param.cpp b/src/backend/opencl/Param.cpp
index 552513aaf2..3b791c96ea 100644
--- a/src/backend/opencl/Param.cpp
+++ b/src/backend/opencl/Param.cpp
@@ -7,22 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
 #include <Param.hpp>
-#include <platform.hpp>
 #include <kernel/KParam.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace opencl {
+Param::Param() : data(nullptr), info{{0, 0, 0, 0}, {0, 0, 0, 0}, 0} {}
+Param::Param(cl::Buffer *data_, KParam info_) : data(data_), info(info_) {}
 
-namespace opencl
-{
-    Param makeParam(cl_mem mem, int off, int dims[4], int strides[4])
-    {
-        Param out;
-        out.data = new cl::Buffer(mem);
-        out.info.offset = off;
-        for (int i = 0; i < 4; i++) {
-            out.info.dims[i] = dims[i];
-            out.info.strides[i] = strides[i];
-        }
-        return out;
+Param makeParam(cl::Buffer &mem, int off, const int dims[4],
+                const int strides[4]) {
+    Param out;
+    out.data        = &mem;
+    out.info.offset = off;
+    for (int i = 0; i < 4; i++) {
+        out.info.dims[i]    = dims[i];
+        out.info.strides[i] = strides[i];
     }
+    return out;
 }
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/Param.hpp b/src/backend/opencl/Param.hpp
index bd948b1bfe..879c92c677 100644
--- a/src/backend/opencl/Param.hpp
+++ b/src/backend/opencl/Param.hpp
@@ -8,17 +8,32 @@
  ********************************************************/
 
 #pragma once
-#include <platform.hpp>
+
+#include <cl2hpp.hpp>
 #include <kernel/KParam.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
+
+struct Param {
+    cl::Buffer* data;
+    KParam info;
+    Param& operator=(const Param& other) = default;
+    Param(const Param& other)            = default;
+    Param(Param&& other)                 = default;
+
+    dim_t* dims_ptr() { return info.dims; }
+    dim_t* strides_ptr() { return info.strides; }
 
-    typedef struct
-    {
-        cl::Buffer *data;
-        KParam info;
-    } Param;
+    // AF_DEPRECATED("Use Array<T>")
+    Param();
+    // AF_DEPRECATED("Use Array<T>")
+    Param(cl::Buffer* data_, KParam info_);
+    ~Param() = default;
+};
 
-    Param makeParam(cl_mem mem, int off, int dims[4], int strides[4]);
-}
+// AF_DEPRECATED("Use Array<T>")
+Param makeParam(cl::Buffer& mem, int off, const int dims[4],
+                const int strides[4]);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/all.cpp b/src/backend/opencl/all.cpp
index 8f224c91c3..d81d9def34 100644
--- a/src/backend/opencl/all.cpp
+++ b/src/backend/opencl/all.cpp
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    //alltrue
-    INSTANTIATE(af_and_t, float  , char)
-    INSTANTIATE(af_and_t, double , char)
-    INSTANTIATE(af_and_t, cfloat , char)
-    INSTANTIATE(af_and_t, cdouble, char)
-    INSTANTIATE(af_and_t, int    , char)
-    INSTANTIATE(af_and_t, uint   , char)
-    INSTANTIATE(af_and_t, char   , char)
-    INSTANTIATE(af_and_t, uchar  , char)
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// alltrue
+INSTANTIATE(af_and_t, float, char)
+INSTANTIATE(af_and_t, double, char)
+INSTANTIATE(af_and_t, cfloat, char)
+INSTANTIATE(af_and_t, cdouble, char)
+INSTANTIATE(af_and_t, int, char)
+INSTANTIATE(af_and_t, uint, char)
+INSTANTIATE(af_and_t, intl, char)
+INSTANTIATE(af_and_t, uintl, char)
+INSTANTIATE(af_and_t, char, char)
+INSTANTIATE(af_and_t, schar, char)
+INSTANTIATE(af_and_t, uchar, char)
+INSTANTIATE(af_and_t, short, char)
+INSTANTIATE(af_and_t, ushort, char)
+INSTANTIATE(af_and_t, half, char)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/anisotropic_diffusion.cpp b/src/backend/opencl/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..19e065c14f
--- /dev/null
+++ b/src/backend/opencl/anisotropic_diffusion.cpp
@@ -0,0 +1,37 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <anisotropic_diffusion.hpp>
+#include <copy.hpp>
+#include <kernel/anisotropic_diffusion.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq) {
+    if (eq == AF_DIFFUSION_MCDE) {
+        kernel::anisotropicDiffusion<T, true>(inout, dt, mct, fftype);
+    } else {
+        kernel::anisotropicDiffusion<T, false>(inout, dt, mct, fftype);
+    }
+}
+
+#define INSTANTIATE(T)                                     \
+    template void anisotropicDiffusion<T>(                 \
+        Array<T> & inout, const float dt, const float mct, \
+        const af::fluxFunction fftype, const af::diffusionEq eq);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/anisotropic_diffusion.hpp b/src/backend/opencl/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..a1a76a29dc
--- /dev/null
+++ b/src/backend/opencl/anisotropic_diffusion.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void anisotropicDiffusion(Array<T>& inout, const float dt, const float mct,
+                          const af::fluxFunction fftype,
+                          const af::diffusionEq eq);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/any.cpp b/src/backend/opencl/any.cpp
index e58c5a7a9a..ee2d16ab63 100644
--- a/src/backend/opencl/any.cpp
+++ b/src/backend/opencl/any.cpp
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    //anytrue
-    INSTANTIATE(af_or_t, float  , char)
-    INSTANTIATE(af_or_t, double , char)
-    INSTANTIATE(af_or_t, cfloat , char)
-    INSTANTIATE(af_or_t, cdouble, char)
-    INSTANTIATE(af_or_t, int    , char)
-    INSTANTIATE(af_or_t, uint   , char)
-    INSTANTIATE(af_or_t, char   , char)
-    INSTANTIATE(af_or_t, uchar  , char)
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// anytrue
+INSTANTIATE(af_or_t, float, char)
+INSTANTIATE(af_or_t, double, char)
+INSTANTIATE(af_or_t, cfloat, char)
+INSTANTIATE(af_or_t, cdouble, char)
+INSTANTIATE(af_or_t, int, char)
+INSTANTIATE(af_or_t, uint, char)
+INSTANTIATE(af_or_t, intl, char)
+INSTANTIATE(af_or_t, uintl, char)
+INSTANTIATE(af_or_t, char, char)
+INSTANTIATE(af_or_t, schar, char)
+INSTANTIATE(af_or_t, uchar, char)
+INSTANTIATE(af_or_t, short, char)
+INSTANTIATE(af_or_t, ushort, char)
+INSTANTIATE(af_or_t, half, char)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/api.cpp b/src/backend/opencl/api.cpp
new file mode 100644
index 0000000000..df3f6783a1
--- /dev/null
+++ b/src/backend/opencl/api.cpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <af/array.h>
+#include <af/opencl.h>
+#include <cstring>
+
+namespace af {
+template<>
+AFAPI cl_mem *array::device() const {
+    auto *mem_ptr = new cl_mem;
+    void *dptr    = nullptr;
+    af_err err    = af_get_device_ptr(&dptr, get());
+    memcpy(mem_ptr, &dptr, sizeof(void *));
+    if (err != AF_SUCCESS) {
+        throw af::exception("Failed to get cl_mem from array object");
+    }
+    return mem_ptr;
+}
+}  // namespace af
diff --git a/src/backend/opencl/approx.cpp b/src/backend/opencl/approx.cpp
index 867157264d..cc8c6994a9 100644
--- a/src/backend/opencl/approx.cpp
+++ b/src/backend/opencl/approx.cpp
@@ -7,71 +7,81 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <Array.hpp>
 #include <approx.hpp>
-#include <kernel/approx.hpp>
-#include <stdexcept>
-
-namespace opencl
-{
-    template<typename Ty, typename Tp>
-    Array<Ty> approx1(const Array<Ty> &in, const Array<Tp> &pos,
-                      const af_interp_type method, const float offGrid)
-    {
-        af::dim4 odims = in.dims();
-        odims[0] = pos.dims()[0];
 
-        // Create output placeholder
-        Array<Ty> out = createEmptyArray<Ty>(odims);
+#include <kernel/approx.hpp>
 
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::approx1<Ty, Tp, AF_INTERP_NEAREST>(out, in, pos, offGrid);
-                break;
-            case AF_INTERP_LINEAR:
-                kernel::approx1<Ty, Tp, AF_INTERP_LINEAR> (out, in, pos, offGrid);
-                break;
-            default:
-                break;
-        }
-        return out;
+namespace arrayfire {
+namespace opencl {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::approx1<Ty, Tp>(yo, yi, xo, xdim, xi_beg, xi_step, offGrid,
+                                    method, 1);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+            kernel::approx1<Ty, Tp>(yo, yi, xo, xdim, xi_beg, xi_step, offGrid,
+                                    method, 2);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+            kernel::approx1<Ty, Tp>(yo, yi, xo, xdim, xi_beg, xi_step, offGrid,
+                                    method, 3);
+            break;
+        default: break;
     }
+}
 
-    template<typename Ty, typename Tp>
-    Array<Ty> approx2(const Array<Ty> &in, const Array<Tp> &pos0, const Array<Tp> &pos1,
-                      const af_interp_type method, const float offGrid)
-    {
-        af::dim4 odims = pos0.dims();
-        odims[2] = in.dims()[2];
-        odims[3] = in.dims()[3];
-
-        // Create output placeholder
-        Array<Ty> out = createEmptyArray<Ty>(odims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::approx2<Ty, Tp, AF_INTERP_NEAREST>(out, in, pos0, pos1, offGrid);
-                break;
-            case AF_INTERP_LINEAR:
-                kernel::approx2<Ty, Tp, AF_INTERP_LINEAR> (out, in, pos0, pos1, offGrid);
-                break;
-            default:
-                break;
-        }
-        return out;
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid) {
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, offGrid, method, 1);
+            break;
+        case AF_INTERP_LINEAR:
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_LINEAR_COSINE:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, offGrid, method, 2);
+            break;
+        case AF_INTERP_CUBIC:
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_CUBIC_SPLINE:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::approx2<Ty, Tp>(zo, zi, xo, xdim, xi_beg, xi_step, yo, ydim,
+                                    yi_beg, yi_step, offGrid, method, 3);
+            break;
+        default: break;
     }
+}
 
-#define INSTANTIATE(Ty, Tp)                                             \
-    template Array<Ty> approx1<Ty, Tp>(const Array<Ty> &in, const Array<Tp> &pos, \
-                                       const af_interp_type method, const float offGrid); \
-    template Array<Ty> approx2<Ty, Tp>(const Array<Ty> &in, const Array<Tp> &pos0, \
-                                       const Array<Tp> &pos1, const af_interp_type method, \
-                                       const float offGrid);            \
+#define INSTANTIATE(Ty, Tp)                                       \
+    template void approx1<Ty, Tp>(                                \
+        Array<Ty> & yo, const Array<Ty> &yi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const af_interp_type method, const float offGrid);        \
+    template void approx2<Ty, Tp>(                                \
+        Array<Ty> & zo, const Array<Ty> &zi, const Array<Tp> &xo, \
+        const int xdim, const Tp &xi_beg, const Tp &xi_step,      \
+        const Array<Tp> &yo, const int ydim, const Tp &yi_beg,    \
+        const Tp &yi_step, const af_interp_type method, const float offGrid);
 
-    INSTANTIATE(float  , float )
-    INSTANTIATE(double , double)
-    INSTANTIATE(cfloat , float )
-    INSTANTIATE(cdouble, double)
-}
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/approx.hpp b/src/backend/opencl/approx.hpp
index 4e515f6f64..5a2b7e3212 100644
--- a/src/backend/opencl/approx.hpp
+++ b/src/backend/opencl/approx.hpp
@@ -7,16 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename Ty, typename Tp>
-    Array<Ty> approx1(const Array<Ty> &in, const Array<Tp> &pos,
-                      const af_interp_type method, const float offGrid);
+namespace arrayfire {
+namespace opencl {
+template<typename Ty, typename Tp>
+void approx1(Array<Ty> &yo, const Array<Ty> &yi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const af_interp_type method, const float offGrid);
 
-    template<typename Ty, typename Tp>
-    Array<Ty> approx2(const Array<Ty> &in, const Array<Tp> &pos0, const Array<Tp> &pos1,
-                      const af_interp_type method, const float offGrid);
-}
+template<typename Ty, typename Tp>
+void approx2(Array<Ty> &zo, const Array<Ty> &zi, const Array<Tp> &xo,
+             const int xdim, const Tp &xi_beg, const Tp &xi_step,
+             const Array<Tp> &yo, const int ydim, const Tp &yi_beg,
+             const Tp &yi_step, const af_interp_type method,
+             const float offGrid);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/arith.hpp b/src/backend/opencl/arith.hpp
index 244b32d339..932a86d814 100644
--- a/src/backend/opencl/arith.hpp
+++ b/src/backend/opencl/arith.hpp
@@ -7,18 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
+#pragma once
+
 #include <Array.hpp>
+#include <common/jit/BinaryNode.hpp>
 #include <optypes.hpp>
-#include <binary.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &&lhs, const Array<T> &&rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
+}
 
-namespace opencl
-{
-    template<typename T, af_op_t op>
-    Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<T, T, op>(lhs, rhs, odims);
-    }
+template<typename T, af_op_t op>
+Array<T> arithOp(const Array<T> &lhs, const Array<T> &rhs,
+                 const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/assign.cpp b/src/backend/opencl/assign.cpp
index 15d579db0d..fbe0370dde 100644
--- a/src/backend/opencl/assign.cpp
+++ b/src/backend/opencl/assign.cpp
@@ -7,82 +7,109 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <handle.hpp>
 #include <assign.hpp>
 #include <kernel/assign.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
 #include <err_opencl.hpp>
+#include <handle.hpp>
 #include <memory.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
+using arrayfire::common::half;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
+
+static std::mutex mtx;
+static std::map<std::pair<const cl::Context*, const int /* deviceId */>,
+                cl::Buffer*>
+    cachedEmptyBuffers;
 
 template<typename T>
-void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs)
-{
+void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs) {
     kernel::AssignKernelParam_t p;
     std::vector<af_seq> seqs(4, af_span);
     // create seq vector to retrieve output
     // dimensions, offsets & offsets
-    for (dim_t x=0; x<4; ++x) {
-        if (idxrs[x].isSeq) {
-            seqs[x] = idxrs[x].idx.seq;
-        }
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
     }
 
     // retrieve dimensions, strides and offsets
-    dim4 dDims = out.dims();
+    const dim4& dDims = out.dims();
     // retrieve dimensions & strides for array
     // to which rhs is being copied to
-    dim4 dstOffs = toOffset(seqs, dDims);
-    dim4 dstStrds= toStride(seqs, dDims);
+    dim4 dstOffs  = toOffset(seqs, dDims);
+    dim4 dstStrds = toStride(seqs, dDims);
 
-    for (dim_t i=0; i<4; ++i) {
+    for (dim_t i = 0; i < 4; ++i) {
         p.isSeq[i] = idxrs[i].isSeq;
         p.offs[i]  = dstOffs[i];
         p.strds[i] = dstStrds[i];
     }
 
-    Buffer* bPtrs[4];
+    cl::Buffer* bPtrs[4];
+
+    std::vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
+
+    // Prepare commonBuffer for empty indexes
+    // Buffer is dependent on the context.
+    // To avoid copying between devices, we add also deviceId as a dependency
+    cl::Buffer* emptyBuffer;
+    {
+        std::lock_guard<std::mutex> lck(mtx);
+        const auto dependent = std::make_pair<const cl::Context*, const int>(
+            &getContext(), getActiveDeviceId());
+        auto it = cachedEmptyBuffers.find(dependent);
+        if (it == cachedEmptyBuffers.end()) {
+            emptyBuffer = new cl::Buffer(
+                getContext(),
+                CL_MEM_READ_ONLY,  // NOLINT(hicpp-signed-bitwise)
+                sizeof(uint));
+            cachedEmptyBuffers[dependent] = emptyBuffer;
+        } else {
+            emptyBuffer = it->second;
+        }
+    }
 
-    std::vector< Array<uint> > idxArrs(4, createEmptyArray<uint>(dim4()));
     // look through indexs to read af_array indexs
-    for (dim_t x=0; x<4; ++x) {
+    for (dim_t x = 0; x < 4; ++x) {
         // set index pointers were applicable
         if (!p.isSeq[x]) {
             idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
-            bPtrs[x] = idxArrs[x].get();
-        }
-        else {
-            // alloc an 1-element buffer to avoid OpenCL from failing
-            bPtrs[x] = bufferAlloc(sizeof(uint));
+            bPtrs[x]   = idxArrs[x].get();
+        } else {
+            // alloc an 1-element buffer to avoid OpenCL from failing using
+            // direct buffer allocation as opposed to mem manager to avoid
+            // reference count desprepancies between different backends
+            bPtrs[x] = emptyBuffer;
         }
     }
 
     kernel::assign<T>(out, rhs, p, bPtrs);
-
-    for (dim_t x=0; x<4; ++x) {
-        if (p.isSeq[x]) bufferFree(bPtrs[x]);
-    }
 }
 
-#define INSTANTIATE(T) \
-    template void assign<T>(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
+#define INSTANTIATE(T)                                                \
+    template void assign<T>(Array<T> & out, const af_index_t idxrs[], \
+                            const Array<T>& rhs);
 
 INSTANTIATE(cdouble)
-INSTANTIATE(double )
-INSTANTIATE(cfloat )
-INSTANTIATE(float  )
-INSTANTIATE(uintl  )
-INSTANTIATE(uint   )
-INSTANTIATE(intl   )
-INSTANTIATE(int    )
-INSTANTIATE(uchar  )
-INSTANTIATE(char   )
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/assign.hpp b/src/backend/opencl/assign.hpp
index b4f2db0340..6283ad8ceb 100644
--- a/src/backend/opencl/assign.hpp
+++ b/src/backend/opencl/assign.hpp
@@ -8,11 +8,13 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <af/index.h>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 void assign(Array<T>& out, const af_index_t idxrs[], const Array<T>& rhs);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/backend.hpp b/src/backend/opencl/backend.hpp
index 527d379168..30392a7b9a 100644
--- a/src/backend/opencl/backend.hpp
+++ b/src/backend/opencl/backend.hpp
@@ -21,4 +21,4 @@
 
 #include "types.hpp"
 
-namespace detail = opencl;
+namespace detail = arrayfire::opencl;
diff --git a/src/backend/opencl/bilateral.cpp b/src/backend/opencl/bilateral.cpp
index 1cd54d973b..6475377e75 100644
--- a/src/backend/opencl/bilateral.cpp
+++ b/src/backend/opencl/bilateral.cpp
@@ -7,35 +7,37 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
 #include <bilateral.hpp>
 #include <kernel/bilateral.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename inType, typename outType, bool isColor>
-Array<outType> bilateral(const Array<inType> &in, const float &s_sigma, const float &c_sigma)
-{
-    Array<outType> out       = createEmptyArray<outType>(in.dims());
-    kernel::bilateral<inType, outType, isColor>(out, in, s_sigma, c_sigma);
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &sSigma,
+                         const float &cSigma) {
+    Array<outType> out = createEmptyArray<outType>(in.dims());
+    kernel::bilateral<inType, outType>(out, in, sSigma, cSigma);
     return out;
 }
 
-#define INSTANTIATE(inT, outT)\
-template Array<outT> bilateral<inT, outT,true >(const Array<inT> &in, const float &s_sigma, const float &c_sigma);\
-template Array<outT> bilateral<inT, outT,false>(const Array<inT> &in, const float &s_sigma, const float &c_sigma);
+#define INSTANTIATE(inT, outT)                                    \
+    template Array<outT> bilateral<inT, outT>(const Array<inT> &, \
+                                              const float &, const float &);
 
 INSTANTIATE(double, double)
-INSTANTIATE(float ,  float)
-INSTANTIATE(char  ,  float)
-INSTANTIATE(int   ,  float)
-INSTANTIATE(uint  ,  float)
-INSTANTIATE(uchar ,  float)
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/bilateral.hpp b/src/backend/opencl/bilateral.hpp
index d28e7b1249..05fd52c429 100644
--- a/src/backend/opencl/bilateral.hpp
+++ b/src/backend/opencl/bilateral.hpp
@@ -9,10 +9,10 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
-
-template<typename inType, typename outType, bool isColor>
-Array<outType> bilateral(const Array<inType> &in, const float &s_sigma, const float &c_sigma);
-
-}
+namespace arrayfire {
+namespace opencl {
+template<typename inType, typename outType>
+Array<outType> bilateral(const Array<inType> &in, const float &spatialSigma,
+                         const float &chromaticSigma);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/binary.hpp b/src/backend/opencl/binary.hpp
index 4f58cb49e6..546c5bc085 100644
--- a/src/backend/opencl/binary.hpp
+++ b/src/backend/opencl/binary.hpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2025, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -8,52 +8,32 @@
  ********************************************************/
 
 #pragma once
-#include <af/dim4.hpp>
-#include <Array.hpp>
 #include <optypes.hpp>
-#include <math.hpp>
-#include <JIT/BinaryNode.hpp>
-
-namespace opencl
-{
-
-    template<typename To, typename Ti, af_op_t op>
-    struct BinOp
-    {
-        const char *name()
-        {
-            return "noop";
-        }
-    };
+#include <common/half.hpp>
+
+using arrayfire::common::half;
 
-#define BINARY_TYPE_1(fn)                      \
-    template<typename To, typename Ti>          \
-    struct BinOp<To, Ti, af_##fn##_t>           \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__"#fn;                     \
-        }                                       \
-    };                                          \
-                                                \
-    template<typename To>                       \
-    struct BinOp<To, cfloat, af_##fn##_t>       \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__c"#fn"f";                 \
-        }                                       \
-    };                                          \
-                                                \
-    template<typename To>                       \
-    struct BinOp<To, cdouble, af_##fn##_t>      \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__c"#fn;                    \
-        }                                       \
-    };                                          \
+namespace arrayfire {
+namespace opencl {
 
+template<typename To, typename Ti, af_op_t op>
+struct BinOp;
+
+#define BINARY_TYPE_1(fn)                            \
+    template<typename To, typename Ti>               \
+    struct BinOp<To, Ti, af_##fn##_t> {              \
+        const char *name() { return "__" #fn; }      \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cfloat, af_##fn##_t> {          \
+        const char *name() { return "__c" #fn "f"; } \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cdouble, af_##fn##_t> {         \
+        const char *name() { return "__c" #fn; }     \
+    };
 
 BINARY_TYPE_1(eq)
 BINARY_TYPE_1(neq)
@@ -75,116 +55,84 @@ BINARY_TYPE_1(bitshiftr)
 
 #undef BINARY_TYPE_1
 
-#define BINARY_TYPE_2(fn)                       \
-    template<typename To, typename Ti>          \
-    struct BinOp<To, Ti, af_##fn##_t>           \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__"#fn;                     \
-        }                                       \
-    };                                          \
-    template<typename To>                       \
-    struct BinOp<To, float, af_##fn##_t>        \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "f"#fn;                      \
-        }                                       \
-    };                                          \
-    template<typename To>                       \
-    struct BinOp<To, double, af_##fn##_t>       \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "f"#fn;                      \
-        }                                       \
-    };                                          \
-    template<typename To>                       \
-    struct BinOp<To, cfloat, af_##fn##_t>       \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__c"#fn"f";                 \
-        }                                       \
-    };                                          \
-                                                \
-    template<typename To>                       \
-    struct BinOp<To, cdouble, af_##fn##_t>      \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__c"#fn;                    \
-        }                                       \
-    };                                          \
-
+#define BINARY_TYPE_2(fn)                            \
+    template<typename To, typename Ti>               \
+    struct BinOp<To, Ti, af_##fn##_t> {              \
+        const char *name() { return "__" #fn; }      \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, float, af_##fn##_t> {           \
+        const char *name() { return "f" #fn; }       \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, double, af_##fn##_t> {          \
+        const char *name() { return "f" #fn; }       \
+    };                                               \
+    template<typename To>                            \
+    struct BinOp<To, cfloat, af_##fn##_t> {          \
+        const char *name() { return "__c" #fn "f"; } \
+    };                                               \
+                                                     \
+    template<typename To>                            \
+    struct BinOp<To, cdouble, af_##fn##_t> {         \
+        const char *name() { return "__c" #fn; }     \
+    };
 
 BINARY_TYPE_2(min)
 BINARY_TYPE_2(max)
-BINARY_TYPE_2(pow)
 BINARY_TYPE_2(rem)
 BINARY_TYPE_2(mod)
 
+template<>
+struct BinOp<common::half, common::half, af_mod_t> {
+    const char *name() { return "fmod"; }
+};
+
+template<typename To, typename Ti>
+struct BinOp<To, Ti, af_pow_t> {
+    const char *name() { return "__pow"; }
+};
+
+#define POW_BINARY_OP(INTYPE, OPNAME)         \
+    template<typename To>                     \
+    struct BinOp<To, INTYPE, af_pow_t> {      \
+        const char *name() { return OPNAME; } \
+    };
+
+POW_BINARY_OP(double, "pow")
+POW_BINARY_OP(float, "pow")
+POW_BINARY_OP(half, "pow")
+POW_BINARY_OP(intl, "__powll")
+POW_BINARY_OP(uintl, "__powul")
+POW_BINARY_OP(uint, "__powui")
+POW_BINARY_OP(int, "__powsi")
+
+#undef POW_BINARY_OP
+
 template<typename Ti>
-struct BinOp<cfloat, Ti, af_cplx2_t>
-{
-    const char *name()
-    {
-        return "__cplx2f";
-    }
+struct BinOp<cfloat, Ti, af_cplx2_t> {
+    const char *name() { return "__cplx2f"; }
 };
 
 template<typename Ti>
-struct BinOp<cdouble, Ti, af_cplx2_t>
-{
-    const char *name()
-    {
-        return "__cplx2";
-    }
+struct BinOp<cdouble, Ti, af_cplx2_t> {
+    const char *name() { return "__cplx2"; }
 };
 
 template<typename To, typename Ti>
-struct BinOp<To, Ti, af_cplx2_t>
-{
-    const char *name()
-    {
-        return "noop";
-    }
+struct BinOp<To, Ti, af_cplx2_t> {
+    const char *name() { return "noop"; }
 };
 
 template<typename To, typename Ti>
-struct BinOp<To, Ti, af_atan2_t>
-{
-    const char *name()
-    {
-        return "atan2";
-    }
+struct BinOp<To, Ti, af_atan2_t> {
+    const char *name() { return "atan2"; }
 };
 
 template<typename To, typename Ti>
-struct BinOp<To, Ti, af_hypot_t>
-{
-    const char *name()
-    {
-        return "hypot";
-    }
+struct BinOp<To, Ti, af_hypot_t> {
+    const char *name() { return "hypot"; }
 };
 
-template<typename To, typename Ti, af_op_t op>
-Array<To> createBinaryNode(const Array<Ti> &lhs, const Array<Ti> &rhs, const af::dim4 &odims)
-{
-    BinOp<To, Ti, op> bop;
-
-    JIT::Node_ptr lhs_node = lhs.getNode();
-    JIT::Node_ptr rhs_node = rhs.getNode();
-    JIT::BinaryNode *node = new JIT::BinaryNode(dtype_traits<To>::getName(),
-                                                shortname<To>(true),
-                                                bop.name(),
-                                                lhs_node,
-                                                rhs_node, (int)(op));
-
-    return createNodeArray<To>(odims, JIT::Node_ptr(
-                                   reinterpret_cast<JIT::Node *>(node)));
-}
-
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/blas.cpp b/src/backend/opencl/blas.cpp
index 6777fd22f1..8010fe555d 100644
--- a/src/backend/opencl/blas.cpp
+++ b/src/backend/opencl/blas.cpp
@@ -8,170 +8,170 @@
  ********************************************************/
 
 #include <blas.hpp>
-#include <af/blas.h>
+
 #include <Array.hpp>
-#include <cassert>
-#include <string>
-#include <functional>
-#include <stdexcept>
-#include <mutex>
-#include <err_common.hpp>
-#include <err_clblas.hpp>
+#include <arith.hpp>
+#include <common/half.hpp>
+#include <common/traits.hpp>
+#include <complex.hpp>
+#include <err_opencl.hpp>
 #include <math.hpp>
+#include <reduce.hpp>
+#include <transpose.hpp>
+
+#include <complex>
+#include <vector>
+
+// Includes one of the supported OpenCL BLAS back-ends (e.g. clBLAS, CLBlast)
+#include <cpu/cpu_blas.hpp>
+#include <magma/magma_blas.h>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
 
-namespace opencl
-{
-
-using std::is_floating_point;
-using std::enable_if;
-using std::once_flag;
-using std::call_once;
-using std::runtime_error;
-using std::to_string;
-
-clblasTranspose
-toClblasTranspose(af_mat_prop opt)
-{
-    clblasTranspose out = clblasNoTrans;
-    switch(opt) {
-        case AF_MAT_NONE        : out = clblasNoTrans;   break;
-        case AF_MAT_TRANS           : out = clblasTrans;     break;
-        case AF_MAT_CTRANS : out = clblasConjTrans; break;
-        default                     : AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+void initBlas() { gpu_blas_init(); }
+
+void deInitBlas() { gpu_blas_deinit(); }
+
+// Converts an af_mat_prop options to a transpose type for one of the OpenCL
+// BLAS back-ends
+OPENCL_BLAS_TRANS_T
+toBlasTranspose(af_mat_prop opt) {
+    switch (opt) {
+        case AF_MAT_NONE: return OPENCL_BLAS_NO_TRANS;
+        case AF_MAT_TRANS: return OPENCL_BLAS_TRANS;
+        case AF_MAT_CTRANS: return OPENCL_BLAS_CONJ_TRANS;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
     }
-    return out;
 }
 
-#define BLAS_FUNC_DEF(NAME)                                             \
-template<typename T>                                                    \
-struct NAME##_func;
-
-#define BLAS_FUNC(NAME, TYPE, PREFIX)                                   \
-template<>                                                              \
-struct NAME##_func<TYPE>                                                \
-{                                                                       \
-    template<typename... Args>                                          \
-    clblasStatus                                                        \
-    operator() (Args... args) { return clblas##PREFIX##NAME(args...); } \
-};
-
-BLAS_FUNC_DEF(gemm)
-BLAS_FUNC(gemm, float,      S)
-BLAS_FUNC(gemm, double,     D)
-BLAS_FUNC(gemm, cfloat,     C)
-BLAS_FUNC(gemm, cdouble,    Z)
-
-BLAS_FUNC_DEF(gemv)
-BLAS_FUNC(gemv, float,      S)
-BLAS_FUNC(gemv, double,     D)
-BLAS_FUNC(gemv, cfloat,     C)
-BLAS_FUNC(gemv, cdouble,    Z)
-
-BLAS_FUNC_DEF( dot )
-BLAS_FUNC(dot, float,       S)
-BLAS_FUNC(dot, double,      D)
-
-#undef BLAS_FUNC_DEF
-#undef BLAS_FUNC
-
 template<typename T>
-Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs,
-                af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    initBlas();
-    clblasTranspose lOpts = toClblasTranspose(optLhs);
-    clblasTranspose rOpts = toClblasTranspose(optRhs);
-
-    int aRowDim = (lOpts == clblasNoTrans) ? 0 : 1;
-    int aColDim = (lOpts == clblasNoTrans) ? 1 : 0;
-    int bColDim = (rOpts == clblasNoTrans) ? 1 : 0;
-
-    dim4 lDims = lhs.dims();
-    dim4 rDims = rhs.dims();
-    int M = lDims[aRowDim];
-    int N = rDims[bColDim];
-    int K = lDims[aColDim];
-
-    //FIXME: Leaks on errors.
-    Array<T> out = createEmptyArray<T>(af::dim4(M, N, 1, 1));
-    auto alpha = scalar<T>(1);
-    auto beta  = scalar<T>(0);
-
-    dim4 lStrides = lhs.strides();
-    dim4 rStrides = rhs.strides();
-    cl::Event event;
-    if(rDims[bColDim] == 1) {
-        N = lDims[aColDim];
-        gemv_func<T> gemv;
-        CLBLAS_CHECK(
-            gemv(
-                clblasColumnMajor, lOpts,
-                lDims[0], lDims[1],
-                alpha,
-                (*lhs.get())(),    lhs.getOffset(),   lStrides[1],
-                (*rhs.get())(),    rhs.getOffset(),   rStrides[0],
-                beta ,
-                (*out.get())(),   out.getOffset(),             1,
-                1, &getQueue()(), 0, nullptr, &event())
-            );
-    } else {
-        gemm_func<T> gemm;
-        CLBLAS_CHECK(
-            gemm(
-                clblasColumnMajor, lOpts, rOpts,
-                M, N, K,
-                alpha,
-                (*lhs.get())(),    lhs.getOffset(),   lStrides[1],
-                (*rhs.get())(),    rhs.getOffset(),   rStrides[1],
-                beta,
-                (*out.get())(),   out.getOffset(),  out.dims()[0],
-                1, &getQueue()(), 0, nullptr, &event())
-            );
+void gemm_fallback(Array<T> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+                   const T *alpha, const Array<T> &lhs, const Array<T> &rhs,
+                   const T *beta) {
+    cpu::gemm(out, optLhs, optRhs, alpha, lhs, rhs, beta);
+}
 
+template<>
+void gemm_fallback<half>(Array<half> & /*out*/, af_mat_prop /*optLhs*/,
+                         af_mat_prop /*optRhs*/, const half * /*alpha*/,
+                         const Array<half> & /*lhs*/,
+                         const Array<half> & /*rhs*/, const half * /*beta*/) {
+    assert(false && "CPU fallback not implemented for f16");
+}
+
+template<typename Ti, typename To>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta) {
+#if defined(WITH_LINEAR_ALGEBRA)
+    // Do not force offload gemm on OSX Intel devices
+    if (OpenCLCPUOffload(false) &&
+        static_cast<af_dtype>(dtype_traits<Ti>::af_type) != f16) {
+        gemm_fallback(out, optLhs, optRhs, alpha, lhs, rhs, beta);
+        return;
+    }
+#endif
+    const auto lOpts = toBlasTranspose(optLhs);
+    const auto rOpts = toBlasTranspose(optRhs);
+
+    const auto aRowDim = (lOpts == OPENCL_BLAS_NO_TRANS) ? 0 : 1;
+    const auto aColDim = (lOpts == OPENCL_BLAS_NO_TRANS) ? 1 : 0;
+    const auto bColDim = (rOpts == OPENCL_BLAS_NO_TRANS) ? 1 : 0;
+
+    const dim4 &lDims = lhs.dims();
+    const dim4 &rDims = rhs.dims();
+    const int M       = lDims[aRowDim];
+    const int N       = rDims[bColDim];
+    const int K       = lDims[aColDim];
+    const dim4 oDims  = out.dims();
+
+    const dim4 &lStrides = lhs.strides();
+    const dim4 &rStrides = rhs.strides();
+    const dim4 oStrides  = out.strides();
+
+    int batchSize = static_cast<int>(oDims[2] * oDims[3]);
+
+    bool is_l_d2_batched = oDims[2] == lDims[2];
+    bool is_l_d3_batched = oDims[3] == lDims[3];
+    bool is_r_d2_batched = oDims[2] == rDims[2];
+    bool is_r_d3_batched = oDims[3] == rDims[3];
+
+    for (int n = 0; n < batchSize; n++) {
+        int w = static_cast<int>(n / oDims[2]);
+        int z = static_cast<int>(n - w * oDims[2]);
+
+        int loff = z * (is_l_d2_batched * lStrides[2]) +
+                   w * (is_l_d3_batched * lStrides[3]);
+        int roff = z * (is_r_d2_batched * rStrides[2]) +
+                   w * (is_r_d3_batched * rStrides[3]);
+
+        dim_t lOffset = lhs.getOffset() + loff;
+        dim_t rOffset = rhs.getOffset() + roff;
+        dim_t oOffset = out.getOffset() + z * oStrides[2] + w * oStrides[3];
+
+        cl::Event event;
+        if (rDims[bColDim] == 1) {
+            dim_t incr = (optRhs == AF_MAT_NONE) ? rStrides[0] : rStrides[1];
+            gpu_blas_gemv_func<Ti> gemv;
+            OPENCL_BLAS_CHECK(gemv(lOpts, lDims[0], lDims[1], *alpha,
+                                   (*lhs.get())(), lOffset, lStrides[1],
+                                   (*rhs.get())(), rOffset, incr, *beta,
+                                   (*out.get())(), oOffset, oStrides[0], 1,
+                                   &getQueue()(), 0, nullptr, &event()));
+        } else {
+            gpu_blas_gemm_func<Ti> gemm;
+            OPENCL_BLAS_CHECK(gemm(lOpts, rOpts, M, N, K, *alpha,
+                                   (*lhs.get())(), lOffset, lStrides[1],
+                                   (*rhs.get())(), rOffset, rStrides[1], *beta,
+                                   (*out.get())(), oOffset, oStrides[1], 1,
+                                   &getQueue()(), 0, nullptr, &event()));
+        }
     }
+}
 
-    return out;
+template<>
+void gemm<schar, float>(Array<float> &out, af_mat_prop optLhs,
+                        af_mat_prop optRhs, const float *alpha,
+                        const Array<schar> &lhs, const Array<schar> &rhs,
+                        const float *beta) {
+    TYPE_ERROR(3, af_dtype::s8);
 }
 
 template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-             af_mat_prop optLhs, af_mat_prop optRhs)
-{
-    initBlas();
-
-    int N = lhs.dims()[0];
-    dot_func<T> dot;
-    cl::Event event;
-    auto out = createEmptyArray<T>(af::dim4(1));
-    cl::Buffer scratch(getContext(), CL_MEM_READ_WRITE, sizeof(T) * N);
-    CLBLAS_CHECK(
-        dot(N,
-            (*out.get())(), out.getOffset(),
-            (*lhs.get())(),  lhs.getOffset(), lhs.strides()[0],
-            (*rhs.get())(),  rhs.getOffset(), rhs.strides()[0],
-            scratch(),
-            1, &getQueue()(), 0, nullptr, &event())
-        );
-    return out;
-}
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs) {
+    const Array<T> lhs_ = (optLhs == AF_MAT_NONE ? lhs : conj<T>(lhs));
+    const Array<T> rhs_ = (optRhs == AF_MAT_NONE ? rhs : conj<T>(rhs));
 
-#define INSTANTIATE_BLAS(TYPE)                                                          \
-    template Array<TYPE> matmul<TYPE>(const Array<TYPE> &lhs, const Array<TYPE> &rhs,  \
-                    af_mat_prop optLhs, af_mat_prop optRhs);
+    const Array<T> temp = arithOp<T, af_mul_t>(lhs_, rhs_, lhs_.dims());
+    return reduce<af_add_t, T, T>(temp, 0, false, 0);
+}
 
-INSTANTIATE_BLAS(float)
-INSTANTIATE_BLAS(cfloat)
-INSTANTIATE_BLAS(double)
-INSTANTIATE_BLAS(cdouble)
+#define INSTANTIATE_GEMM(TYPE)                                               \
+    template void gemm<TYPE>(Array<TYPE> & out, af_mat_prop optLhs,          \
+                             af_mat_prop optRhs, const TYPE *alpha,          \
+                             const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
+                             const TYPE *beta);
 
-#define INSTANTIATE_DOT(TYPE)                                                       \
-    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
-                                   af_mat_prop optLhs, af_mat_prop optRhs);
+INSTANTIATE_GEMM(float)
+INSTANTIATE_GEMM(cfloat)
+INSTANTIATE_GEMM(double)
+INSTANTIATE_GEMM(cdouble)
+INSTANTIATE_GEMM(half)
 
-template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-              af_mat_prop optLhs, af_mat_prop optRhs);
+#define INSTANTIATE_DOT(TYPE)                                                  \
+    template Array<TYPE> dot<TYPE>(const Array<TYPE> &lhs,                     \
+                                   const Array<TYPE> &rhs, af_mat_prop optLhs, \
+                                   af_mat_prop optRhs);
 
 INSTANTIATE_DOT(float)
 INSTANTIATE_DOT(double)
-}
+INSTANTIATE_DOT(cfloat)
+INSTANTIATE_DOT(cdouble)
+INSTANTIATE_DOT(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/blas.hpp b/src/backend/opencl/blas.hpp
index 2d7a89a46e..fc4571d4b5 100644
--- a/src/backend/opencl/blas.hpp
+++ b/src/backend/opencl/blas.hpp
@@ -8,25 +8,38 @@
  ********************************************************/
 
 #pragma once
-#include <af/defines.h>
-#include <af/blas.h>
 #include <Array.hpp>
-#include <clBLAS.h>
-#include <mutex>
 
-namespace opencl
-{
+// This file contains the common interface for OpenCL BLAS
+// functions. They can be implemented in different back-ends,
+// such as CLBlast or clBLAS.
 
-template<typename T>
-Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs,
-                af_mat_prop optLhs, af_mat_prop optRhs);
-template<typename T>
-Array<T> dot(const Array<T> &lhs, const Array<T> &rhs,
-             af_mat_prop optLhs, af_mat_prop optRhs);
+namespace arrayfire {
+namespace opencl {
 
-STATIC_ void
-initBlas() {
-    static std::once_flag clblasSetupFlag;
-    call_once(clblasSetupFlag, clblasSetup);
-}
+void initBlas();
+void deInitBlas();
+
+template<typename Ti, typename To = Ti>
+void gemm(Array<To> &out, af_mat_prop optLhs, af_mat_prop optRhs,
+          const To *alpha, const Array<Ti> &lhs, const Array<Ti> &rhs,
+          const To *beta);
+
+template<typename T>
+Array<T> matmul(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+                af_mat_prop optRhs) {
+    int Mdim     = optLhs == AF_MAT_NONE ? 0 : 1;
+    int Ndim     = optRhs == AF_MAT_NONE ? 1 : 0;
+    Array<T> res = createEmptyArray<T>(
+        dim4(lhs.dims()[Mdim], rhs.dims()[Ndim], lhs.dims()[2], lhs.dims()[3]));
+    static const T alpha = T(1.0);
+    static const T beta  = T(0.0);
+    gemm(res, optLhs, optRhs, &alpha, lhs, rhs, &beta);
+    return res;
 }
+
+template<typename T>
+Array<T> dot(const Array<T> &lhs, const Array<T> &rhs, af_mat_prop optLhs,
+             af_mat_prop optRhs);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/canny.cpp b/src/backend/opencl/canny.cpp
new file mode 100644
index 0000000000..cf4965fd5c
--- /dev/null
+++ b/src/backend/opencl/canny.cpp
@@ -0,0 +1,38 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <canny.hpp>
+#include <err_opencl.hpp>
+#include <kernel/canny.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace opencl {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy) {
+    Array<float> out = createValueArray<float>(mag.dims(), 0);
+
+    kernel::nonMaxSuppression<float>(out, mag, gx, gy);
+
+    return out;
+}
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak) {
+    Array<char> out = createValueArray<char>(strong.dims(), 0);
+
+    kernel::edgeTrackingHysteresis<char>(out, strong, weak);
+
+    return out;
+}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/canny.hpp b/src/backend/opencl/canny.hpp
new file mode 100644
index 0000000000..e7ad6dda0d
--- /dev/null
+++ b/src/backend/opencl/canny.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+Array<float> nonMaximumSuppression(const Array<float>& mag,
+                                   const Array<float>& gx,
+                                   const Array<float>& gy);
+
+Array<char> edgeTrackingByHysteresis(const Array<char>& strong,
+                                     const Array<char>& weak);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cast.hpp b/src/backend/opencl/cast.hpp
index f99a86d38d..cef1d76c0e 100644
--- a/src/backend/opencl/cast.hpp
+++ b/src/backend/opencl/cast.hpp
@@ -8,37 +8,28 @@
  ********************************************************/
 
 #pragma once
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <complex>
+#include <Array.hpp>
+#include <common/jit/UnaryNode.hpp>
 #include <err_opencl.hpp>
 #include <math.hpp>
 #include <optypes.hpp>
-#include <JIT/UnaryNode.hpp>
+#include <traits.hpp>
 #include <types.hpp>
-#include <Array.hpp>
+#include <af/dim4.hpp>
+#include <complex>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename To, typename Ti>
-struct CastOp
-{
-    const char *name()
-    {
-        return "";
-    }
+struct CastOp {
+    const char *name() { return ""; }
 };
 
-#define CAST_FN(TYPE)                           \
-    template<typename Ti>                       \
-    struct CastOp<TYPE, Ti>                     \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "convert_"#TYPE;             \
-        }                                       \
+#define CAST_FN(TYPE)                                   \
+    template<typename Ti>                               \
+    struct CastOp<TYPE, Ti> {                           \
+        const char *name() { return "convert_" #TYPE; } \
     };
 
 CAST_FN(int)
@@ -47,92 +38,43 @@ CAST_FN(uchar)
 CAST_FN(float)
 CAST_FN(double)
 
-#define CAST_CFN(TYPE)                          \
-    template<typename Ti>                       \
-    struct CastOp<TYPE, Ti>                     \
-    {                                           \
-        const char *name()                      \
-        {                                       \
-            return "__convert_"#TYPE;           \
-        }                                       \
-    };
+template<typename Ti>
+struct CastOp<schar, Ti> {
+    const char *name() { return "convert_char"; }
+};
 
+#define CAST_CFN(TYPE)                                    \
+    template<typename Ti>                                 \
+    struct CastOp<TYPE, Ti> {                             \
+        const char *name() { return "__convert_" #TYPE; } \
+    };
 
 CAST_CFN(cfloat)
 CAST_CFN(cdouble)
 CAST_CFN(char)
 
-
 template<>
-struct CastOp<cfloat, cdouble>
-{
-    const char *name()
-    {
-        return "__convert_z2c";
-    }
+struct CastOp<cfloat, cdouble> {
+    const char *name() { return "__convert_z2c"; }
 };
 
-
 template<>
-struct CastOp<cdouble, cfloat>
-{
-    const char *name()
-    {
-        return "__convert_c2z";
-    }
+struct CastOp<cdouble, cfloat> {
+    const char *name() { return "__convert_c2z"; }
 };
 
 template<>
-struct CastOp<cfloat, cfloat>
-{
-    const char *name()
-    {
-        return "__convert_c2c";
-    }
+struct CastOp<cfloat, cfloat> {
+    const char *name() { return "__convert_c2c"; }
 };
 
-
 template<>
-struct CastOp<cdouble, cdouble>
-{
-    const char *name()
-    {
-        return "__convert_z2z";
-    }
+struct CastOp<cdouble, cdouble> {
+    const char *name() { return "__convert_z2z"; }
 };
 
 #undef CAST_FN
 #undef CAST_CFN
 
-template<typename To, typename Ti>
-struct CastWrapper
-{
-    Array<To> operator()(const Array<Ti> &in)
-    {
-        CastOp<To, Ti> cop;
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<To>::getName(),
-                                                  shortname<To>(true),
-                                                  cop.name(),
-                                                  in_node, af_cast_t);
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
-};
-
-template<typename T>
-struct CastWrapper<T, T>
-{
-    Array<T> operator()(const Array<T> &in)
-    {
-        return in;
-    }
-};
-
-template<typename To, typename Ti>
-Array<To> cast(const Array<Ti> &in)
-{
-    CastWrapper<To, Ti> cast_op;
-    return cast_op(in);
-}
-
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cholesky.cpp b/src/backend/opencl/cholesky.cpp
index 78fe999645..4d140ba099 100644
--- a/src/backend/opencl/cholesky.cpp
+++ b/src/backend/opencl/cholesky.cpp
@@ -7,98 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <blas.hpp>
 #include <cholesky.hpp>
 #include <copy.hpp>
-#include <err_common.hpp>
-#include <blas.hpp>
 #include <err_opencl.hpp>
 
-#if defined(WITH_OPENCL_LINEAR_ALGEBRA)
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <cpu/cpu_cholesky.hpp>
 #include <magma/magma.h>
 #include <triangle.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-int cholesky_inplace(Array<T> &in, const bool is_upper)
-{
-    try {
-        initBlas();
-
-        dim4 iDims = in.dims();
-        int N = iDims[0];
-
-        magma_uplo_t uplo = is_upper ? MagmaUpper : MagmaLower;
-
-        int info = 0;
-        cl::Buffer *in_buf = in.get();
-        magma_potrf_gpu<T>(uplo, N,
-                           (*in_buf)(), in.getOffset(),  in.strides()[1],
-                           getQueue()(), &info);
-        return info;
-    } catch (cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-    }
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
+    if (OpenCLCPUOffload()) { return cpu::cholesky_inplace(in, is_upper); }
+
+    dim4 iDims = in.dims();
+    int N      = iDims[0];
+
+    magma_uplo_t uplo = is_upper ? MagmaUpper : MagmaLower;
+
+    int info           = 0;
+    cl::Buffer *in_buf = in.get();
+    magma_potrf_gpu<T>(uplo, N, (*in_buf)(), in.getOffset(), in.strides()[1],
+                       getQueue()(), &info);
+    return info;
 }
 
 template<typename T>
-Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper)
-{
-    try {
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
+    if (OpenCLCPUOffload()) { return cpu::cholesky(info, in, is_upper); }
 
-        Array<T> out = copyArray<T>(in);
-        *info = cholesky_inplace(out, is_upper);
+    Array<T> out = copyArray<T>(in);
+    *info        = cholesky_inplace(out, is_upper);
 
-        if (is_upper) triangle<T, true , false>(out, out);
-        else          triangle<T, false, false>(out, out);
+    triangle<T>(out, out, is_upper, false);
 
-        return out;
-
-    } catch (cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-    }
+    return out;
 }
 
-#define INSTANTIATE_CH(T)                                                                   \
-    template int cholesky_inplace<T>(Array<T> &in, const bool is_upper);                    \
-    template Array<T> cholesky<T>   (int *info, const Array<T> &in, const bool is_upper);   \
-
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
 
 INSTANTIATE_CH(float)
 INSTANTIATE_CH(cfloat)
 INSTANTIATE_CH(double)
 INSTANTIATE_CH(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper)
-{
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-int cholesky_inplace(Array<T> &in, const bool is_upper)
-{
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_CH(T)                                                                   \
-    template int cholesky_inplace<T>(Array<T> &in, const bool is_upper);                    \
-    template Array<T> cholesky<T>   (int *info, const Array<T> &in, const bool is_upper);   \
-
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
 
 INSTANTIATE_CH(float)
 INSTANTIATE_CH(cfloat)
 INSTANTIATE_CH(double)
 INSTANTIATE_CH(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cholesky.hpp b/src/backend/opencl/cholesky.hpp
index ff973a2df4..be1805bc96 100644
--- a/src/backend/opencl/cholesky.hpp
+++ b/src/backend/opencl/cholesky.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
 
-    template<typename T>
-    int cholesky_inplace(Array<T> &in, const bool is_upper);
-}
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cl.hpp b/src/backend/opencl/cl.hpp
deleted file mode 100644
index bd7caf5003..0000000000
--- a/src/backend/opencl/cl.hpp
+++ /dev/null
@@ -1,12452 +0,0 @@
-/*******************************************************************************
- * Copyright (c) 2008-2013 The Khronos Group Inc.
- *
- * Permission is hereby granted, free of charge, to any person obtaining a
- * copy of this software and/or associated documentation files (the
- * "Materials"), to deal in the Materials without restriction, including
- * without limitation the rights to use, copy, modify, merge, publish,
- * distribute, sublicense, and/or sell copies of the Materials, and to
- * permit persons to whom the Materials are furnished to do so, subject to
- * the following conditions:
- *
- * The above copyright notice and this permission notice shall be included
- * in all copies or substantial portions of the Materials.
- *
- * THE MATERIALS ARE PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
- * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
- * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
- * IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
- * CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
- * TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
- * MATERIALS OR THE USE OR OTHER DEALINGS IN THE MATERIALS.
- ******************************************************************************/
-
-/*! \file
- *
- *   \brief C++ bindings for OpenCL 1.0 (rev 48), OpenCL 1.1 (rev 33) and 
- *       OpenCL 1.2 (rev 15)    
- *   \author Benedict R. Gaster, Laurent Morichetti and Lee Howes
- *   
- *   Additions and fixes from:
- *       Brian Cole, March 3rd 2010 and April 2012 
- *       Matt Gruenke, April 2012.
- *       Bruce Merry, February 2013.
- *       Tom Deakin and Simon McIntosh-Smith, July 2013
- *   
- *   \version 1.2.6
- *   \date August 2013
- *
- *   Optional extension support
- *
- *         cl
- *         cl_ext_device_fission
- *                #define USE_CL_DEVICE_FISSION
- */
-
-/*! \mainpage
- * \section intro Introduction
- * For many large applications C++ is the language of choice and so it seems
- * reasonable to define C++ bindings for OpenCL.
- *
- *
- * The interface is contained with a single C++ header file \em cl.hpp and all
- * definitions are contained within the namespace \em cl. There is no additional
- * requirement to include \em cl.h and to use either the C++ or original C
- * bindings it is enough to simply include \em cl.hpp.
- *
- * The bindings themselves are lightweight and correspond closely to the
- * underlying C API. Using the C++ bindings introduces no additional execution
- * overhead.
- *
- * For detail documentation on the bindings see:
- *
- * The OpenCL C++ Wrapper API 1.2 (revision 09)
- *  http://www.khronos.org/registry/cl/specs/opencl-cplusplus-1.2.pdf
- *
- * \section example Example
- *
- * The following example shows a general use case for the C++
- * bindings, including support for the optional exception feature and
- * also the supplied vector and string classes, see following sections for
- * decriptions of these features.
- *
- * \code
- * #define __CL_ENABLE_EXCEPTIONS
- * 
- * #if defined(__APPLE__) || defined(__MACOSX)
- * #include <OpenCL/cl.hpp>
- * #else
- * #include <CL/cl.hpp>
- * #endif
- * #include <cstdio>
- * #include <cstdlib>
- * #include <iostream>
- * 
- *  const char * helloStr  = "__kernel void "
- *                           "hello(void) "
- *                           "{ "
- *                           "  "
- *                           "} ";
- * 
- *  int
- *  main(void)
- *  {
- *     cl_int err = CL_SUCCESS;
- *     try {
- *
- *       std::vector<cl::Platform> platforms;
- *       cl::Platform::get(&platforms);
- *       if (platforms.size() == 0) {
- *           std::cout << "Platform size 0\n";
- *           return -1;
- *       }
- *
- *       cl_context_properties properties[] = 
- *          { CL_CONTEXT_PLATFORM, (cl_context_properties)(platforms[0])(), 0};
- *       cl::Context context(CL_DEVICE_TYPE_CPU, properties); 
- * 
- *       std::vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
- * 
- *       cl::Program::Sources source(1,
- *           std::make_pair(helloStr,strlen(helloStr)));
- *       cl::Program program_ = cl::Program(context, source);
- *       program_.build(devices);
- * 
- *       cl::Kernel kernel(program_, "hello", &err);
- * 
- *       cl::Event event;
- *       cl::CommandQueue queue(context, devices[0], 0, &err);
- *       queue.enqueueNDRangeKernel(
- *           kernel, 
- *           cl::NullRange, 
- *           cl::NDRange(4,4),
- *           cl::NullRange,
- *           NULL,
- *           &event); 
- * 
- *       event.wait();
- *     }
- *     catch (cl::Error err) {
- *        std::cerr 
- *           << "ERROR: "
- *           << err.what()
- *           << "("
- *           << err.err()
- *           << ")"
- *           << std::endl;
- *     }
- * 
- *    return EXIT_SUCCESS;
- *  }
- * 
- * \endcode
- *
- */
-#ifndef CL_HPP_
-#define CL_HPP_
-
-#ifdef _WIN32
-
-#include <windows.h>
-#include <malloc.h>
-#include <iterator>
-#include <intrin.h>
-
-#if defined(__CL_ENABLE_EXCEPTIONS)
-#include <exception>
-#endif // #if defined(__CL_ENABLE_EXCEPTIONS)
-
-#pragma push_macro("max")
-#undef max
-#if defined(USE_DX_INTEROP)
-#include <CL/cl_d3d10.h>
-#include <CL/cl_dx9_media_sharing.h>
-#endif
-#endif // _WIN32
-
-// 
-#if defined(USE_CL_DEVICE_FISSION)
-#include <CL/cl_ext.h>
-#endif
-
-#if defined(__APPLE__) || defined(__MACOSX)
-#include <OpenGL/OpenGL.h>
-#include <OpenCL/opencl.h>
-#include <libkern/OSAtomic.h>
-#else
-#include <GL/gl.h>
-#include <CL/opencl.h>
-#endif // !__APPLE__
-
-// To avoid accidentally taking ownership of core OpenCL types
-// such as cl_kernel constructors are made explicit
-// under OpenCL 1.2
-#if defined(CL_VERSION_1_2) && !defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-#define __CL_EXPLICIT_CONSTRUCTORS explicit
-#else // #if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-#define __CL_EXPLICIT_CONSTRUCTORS 
-#endif // #if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-
-// Define deprecated prefixes and suffixes to ensure compilation
-// in case they are not pre-defined
-#if !defined(CL_EXT_PREFIX__VERSION_1_1_DEPRECATED)
-#define CL_EXT_PREFIX__VERSION_1_1_DEPRECATED  
-#endif // #if !defined(CL_EXT_PREFIX__VERSION_1_1_DEPRECATED)
-#if !defined(CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED)
-#define CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED
-#endif // #if !defined(CL_EXT_PREFIX__VERSION_1_1_DEPRECATED)
-
-#if !defined(CL_CALLBACK)
-#define CL_CALLBACK
-#endif //CL_CALLBACK
-
-#include <utility>
-#include <limits>
-
-#if !defined(__NO_STD_VECTOR)
-#include <vector>
-#endif
-
-#if !defined(__NO_STD_STRING)
-#include <string>
-#endif 
-
-#if defined(linux) || defined(__APPLE__) || defined(__MACOSX)
-#include <alloca.h>
-
-#include <emmintrin.h>
-#include <xmmintrin.h>
-#endif // linux
-
-#include <cstring>
-
-
-/*! \namespace cl
- *
- * \brief The OpenCL C++ bindings are defined within this namespace.
- *
- */
-namespace cl {
-
-class Memory;
-
-/**
- * Deprecated APIs for 1.2
- */
-#if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS) || (defined(CL_VERSION_1_1) && !defined(CL_VERSION_1_2)) 
-#define __INIT_CL_EXT_FCN_PTR(name) \
-    if(!pfn_##name) { \
-        pfn_##name = (PFN_##name) \
-            clGetExtensionFunctionAddress(#name); \
-        if(!pfn_##name) { \
-        } \
-    }
-#endif // #if defined(CL_VERSION_1_1)
-
-#if defined(CL_VERSION_1_2)
-#define __INIT_CL_EXT_FCN_PTR_PLATFORM(platform, name) \
-    if(!pfn_##name) { \
-        pfn_##name = (PFN_##name) \
-            clGetExtensionFunctionAddressForPlatform(platform, #name); \
-        if(!pfn_##name) { \
-        } \
-    }
-#endif // #if defined(CL_VERSION_1_1)
-
-class Program;
-class Device;
-class Context;
-class CommandQueue;
-class Memory;
-class Buffer;
-
-#if defined(__CL_ENABLE_EXCEPTIONS)
-/*! \brief Exception class 
- * 
- *  This may be thrown by API functions when __CL_ENABLE_EXCEPTIONS is defined.
- */
-class Error : public std::exception
-{
-private:
-    cl_int err_;
-    const char * errStr_;
-public:
-    /*! \brief Create a new CL error exception for a given error code
-     *  and corresponding message.
-     * 
-     *  \param err error code value.
-     *
-     *  \param errStr a descriptive string that must remain in scope until
-     *                handling of the exception has concluded.  If set, it
-     *                will be returned by what().
-     */
-    Error(cl_int err, const char * errStr = NULL) : err_(err), errStr_(errStr)
-    {}
-
-    ~Error() throw() {}
-
-    /*! \brief Get error string associated with exception
-     *
-     * \return A memory pointer to the error message string.
-     */
-    virtual const char * what() const throw ()
-    {
-        if (errStr_ == NULL) {
-            return "empty";
-        }
-        else {
-            return errStr_;
-        }
-    }
-
-    /*! \brief Get error code associated with exception
-     *
-     *  \return The error code.
-     */
-    cl_int err(void) const { return err_; }
-};
-
-#define __ERR_STR(x) #x
-#else
-#define __ERR_STR(x) NULL
-#endif // __CL_ENABLE_EXCEPTIONS
-
-
-namespace detail
-{
-#if defined(__CL_ENABLE_EXCEPTIONS)
-static inline cl_int errHandler (
-    cl_int err,
-    const char * errStr = NULL)
-{
-    if (err != CL_SUCCESS) {
-        throw Error(err, errStr);
-    }
-    return err;
-}
-#else
-static inline cl_int errHandler (cl_int err, const char * errStr = NULL)
-{
-    (void) errStr; // suppress unused variable warning
-    return err;
-}
-#endif // __CL_ENABLE_EXCEPTIONS
-}
-
-
-
-//! \cond DOXYGEN_DETAIL
-#if !defined(__CL_USER_OVERRIDE_ERROR_STRINGS)
-#define __GET_DEVICE_INFO_ERR               __ERR_STR(clGetDeviceInfo)
-#define __GET_PLATFORM_INFO_ERR             __ERR_STR(clGetPlatformInfo)
-#define __GET_DEVICE_IDS_ERR                __ERR_STR(clGetDeviceIDs)
-#define __GET_PLATFORM_IDS_ERR              __ERR_STR(clGetPlatformIDs)
-#define __GET_CONTEXT_INFO_ERR              __ERR_STR(clGetContextInfo)
-#define __GET_EVENT_INFO_ERR                __ERR_STR(clGetEventInfo)
-#define __GET_EVENT_PROFILE_INFO_ERR        __ERR_STR(clGetEventProfileInfo)
-#define __GET_MEM_OBJECT_INFO_ERR           __ERR_STR(clGetMemObjectInfo)
-#define __GET_IMAGE_INFO_ERR                __ERR_STR(clGetImageInfo)
-#define __GET_SAMPLER_INFO_ERR              __ERR_STR(clGetSamplerInfo)
-#define __GET_KERNEL_INFO_ERR               __ERR_STR(clGetKernelInfo)
-#if defined(CL_VERSION_1_2)
-#define __GET_KERNEL_ARG_INFO_ERR               __ERR_STR(clGetKernelArgInfo)
-#endif // #if defined(CL_VERSION_1_2)
-#define __GET_KERNEL_WORK_GROUP_INFO_ERR    __ERR_STR(clGetKernelWorkGroupInfo)
-#define __GET_PROGRAM_INFO_ERR              __ERR_STR(clGetProgramInfo)
-#define __GET_PROGRAM_BUILD_INFO_ERR        __ERR_STR(clGetProgramBuildInfo)
-#define __GET_COMMAND_QUEUE_INFO_ERR        __ERR_STR(clGetCommandQueueInfo)
-
-#define __CREATE_CONTEXT_ERR                __ERR_STR(clCreateContext)
-#define __CREATE_CONTEXT_FROM_TYPE_ERR      __ERR_STR(clCreateContextFromType)
-#define __GET_SUPPORTED_IMAGE_FORMATS_ERR   __ERR_STR(clGetSupportedImageFormats)
-
-#define __CREATE_BUFFER_ERR                 __ERR_STR(clCreateBuffer)
-#define __COPY_ERR                          __ERR_STR(cl::copy)
-#define __CREATE_SUBBUFFER_ERR              __ERR_STR(clCreateSubBuffer)
-#define __CREATE_GL_BUFFER_ERR              __ERR_STR(clCreateFromGLBuffer)
-#define __CREATE_GL_RENDER_BUFFER_ERR       __ERR_STR(clCreateFromGLBuffer)
-#define __GET_GL_OBJECT_INFO_ERR            __ERR_STR(clGetGLObjectInfo)
-#if defined(CL_VERSION_1_2)
-#define __CREATE_IMAGE_ERR                  __ERR_STR(clCreateImage)
-#define __CREATE_GL_TEXTURE_ERR             __ERR_STR(clCreateFromGLTexture)
-#define __IMAGE_DIMENSION_ERR               __ERR_STR(Incorrect image dimensions)
-#endif // #if defined(CL_VERSION_1_2)
-#define __CREATE_SAMPLER_ERR                __ERR_STR(clCreateSampler)
-#define __SET_MEM_OBJECT_DESTRUCTOR_CALLBACK_ERR __ERR_STR(clSetMemObjectDestructorCallback)
-
-#define __CREATE_USER_EVENT_ERR             __ERR_STR(clCreateUserEvent)
-#define __SET_USER_EVENT_STATUS_ERR         __ERR_STR(clSetUserEventStatus)
-#define __SET_EVENT_CALLBACK_ERR            __ERR_STR(clSetEventCallback)
-#define __WAIT_FOR_EVENTS_ERR               __ERR_STR(clWaitForEvents)
-
-#define __CREATE_KERNEL_ERR                 __ERR_STR(clCreateKernel)
-#define __SET_KERNEL_ARGS_ERR               __ERR_STR(clSetKernelArg)
-#define __CREATE_PROGRAM_WITH_SOURCE_ERR    __ERR_STR(clCreateProgramWithSource)
-#define __CREATE_PROGRAM_WITH_BINARY_ERR    __ERR_STR(clCreateProgramWithBinary)
-#if defined(CL_VERSION_1_2)
-#define __CREATE_PROGRAM_WITH_BUILT_IN_KERNELS_ERR    __ERR_STR(clCreateProgramWithBuiltInKernels)
-#endif // #if defined(CL_VERSION_1_2)
-#define __BUILD_PROGRAM_ERR                 __ERR_STR(clBuildProgram)
-#if defined(CL_VERSION_1_2)
-#define __COMPILE_PROGRAM_ERR                  __ERR_STR(clCompileProgram)
-
-#endif // #if defined(CL_VERSION_1_2)
-#define __CREATE_KERNELS_IN_PROGRAM_ERR     __ERR_STR(clCreateKernelsInProgram)
-
-#define __CREATE_COMMAND_QUEUE_ERR          __ERR_STR(clCreateCommandQueue)
-#define __SET_COMMAND_QUEUE_PROPERTY_ERR    __ERR_STR(clSetCommandQueueProperty)
-#define __ENQUEUE_READ_BUFFER_ERR           __ERR_STR(clEnqueueReadBuffer)
-#define __ENQUEUE_READ_BUFFER_RECT_ERR      __ERR_STR(clEnqueueReadBufferRect)
-#define __ENQUEUE_WRITE_BUFFER_ERR          __ERR_STR(clEnqueueWriteBuffer)
-#define __ENQUEUE_WRITE_BUFFER_RECT_ERR     __ERR_STR(clEnqueueWriteBufferRect)
-#define __ENQEUE_COPY_BUFFER_ERR            __ERR_STR(clEnqueueCopyBuffer)
-#define __ENQEUE_COPY_BUFFER_RECT_ERR       __ERR_STR(clEnqueueCopyBufferRect)
-#define __ENQUEUE_FILL_BUFFER_ERR           __ERR_STR(clEnqueueFillBuffer)
-#define __ENQUEUE_READ_IMAGE_ERR            __ERR_STR(clEnqueueReadImage)
-#define __ENQUEUE_WRITE_IMAGE_ERR           __ERR_STR(clEnqueueWriteImage)
-#define __ENQUEUE_COPY_IMAGE_ERR            __ERR_STR(clEnqueueCopyImage)
-#define __ENQUEUE_FILL_IMAGE_ERR           __ERR_STR(clEnqueueFillImage)
-#define __ENQUEUE_COPY_IMAGE_TO_BUFFER_ERR  __ERR_STR(clEnqueueCopyImageToBuffer)
-#define __ENQUEUE_COPY_BUFFER_TO_IMAGE_ERR  __ERR_STR(clEnqueueCopyBufferToImage)
-#define __ENQUEUE_MAP_BUFFER_ERR            __ERR_STR(clEnqueueMapBuffer)
-#define __ENQUEUE_MAP_IMAGE_ERR             __ERR_STR(clEnqueueMapImage)
-#define __ENQUEUE_UNMAP_MEM_OBJECT_ERR      __ERR_STR(clEnqueueUnMapMemObject)
-#define __ENQUEUE_NDRANGE_KERNEL_ERR        __ERR_STR(clEnqueueNDRangeKernel)
-#define __ENQUEUE_TASK_ERR                  __ERR_STR(clEnqueueTask)
-#define __ENQUEUE_NATIVE_KERNEL             __ERR_STR(clEnqueueNativeKernel)
-#if defined(CL_VERSION_1_2)
-#define __ENQUEUE_MIGRATE_MEM_OBJECTS_ERR   __ERR_STR(clEnqueueMigrateMemObjects)
-#endif // #if defined(CL_VERSION_1_2)
-
-#define __ENQUEUE_ACQUIRE_GL_ERR            __ERR_STR(clEnqueueAcquireGLObjects)
-#define __ENQUEUE_RELEASE_GL_ERR            __ERR_STR(clEnqueueReleaseGLObjects)
-
-
-#define __RETAIN_ERR                        __ERR_STR(Retain Object)
-#define __RELEASE_ERR                       __ERR_STR(Release Object)
-#define __FLUSH_ERR                         __ERR_STR(clFlush)
-#define __FINISH_ERR                        __ERR_STR(clFinish)
-#define __VECTOR_CAPACITY_ERR               __ERR_STR(Vector capacity error)
-
-/**
- * CL 1.2 version that uses device fission.
- */
-#if defined(CL_VERSION_1_2)
-#define __CREATE_SUB_DEVICES                __ERR_STR(clCreateSubDevices)
-#else
-#define __CREATE_SUB_DEVICES                __ERR_STR(clCreateSubDevicesEXT)
-#endif // #if defined(CL_VERSION_1_2)
-
-/**
- * Deprecated APIs for 1.2
- */
-#if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS) || (defined(CL_VERSION_1_1) && !defined(CL_VERSION_1_2)) 
-#define __ENQUEUE_MARKER_ERR                __ERR_STR(clEnqueueMarker)
-#define __ENQUEUE_WAIT_FOR_EVENTS_ERR       __ERR_STR(clEnqueueWaitForEvents)
-#define __ENQUEUE_BARRIER_ERR               __ERR_STR(clEnqueueBarrier)
-#define __UNLOAD_COMPILER_ERR               __ERR_STR(clUnloadCompiler)
-#define __CREATE_GL_TEXTURE_2D_ERR          __ERR_STR(clCreateFromGLTexture2D)
-#define __CREATE_GL_TEXTURE_3D_ERR          __ERR_STR(clCreateFromGLTexture3D)
-#define __CREATE_IMAGE2D_ERR                __ERR_STR(clCreateImage2D)
-#define __CREATE_IMAGE3D_ERR                __ERR_STR(clCreateImage3D)
-#endif // #if defined(CL_VERSION_1_1)
-
-#endif // __CL_USER_OVERRIDE_ERROR_STRINGS
-//! \endcond
-
-/**
- * CL 1.2 marker and barrier commands
- */
-#if defined(CL_VERSION_1_2)
-#define __ENQUEUE_MARKER_WAIT_LIST_ERR                __ERR_STR(clEnqueueMarkerWithWaitList)
-#define __ENQUEUE_BARRIER_WAIT_LIST_ERR               __ERR_STR(clEnqueueBarrierWithWaitList)
-#endif // #if defined(CL_VERSION_1_2)
-
-#if !defined(__USE_DEV_STRING) && !defined(__NO_STD_STRING)
-typedef std::string STRING_CLASS;
-#elif !defined(__USE_DEV_STRING) 
-
-/*! \class string
- * \brief Simple string class, that provides a limited subset of std::string
- * functionality but avoids many of the issues that come with that class.
- 
- *  \note Deprecated. Please use std::string as default or
- *  re-define the string class to match the std::string
- *  interface by defining STRING_CLASS
- */
-class CL_EXT_PREFIX__VERSION_1_1_DEPRECATED string CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED
-{
-private:
-    ::size_t size_;
-    char * str_;
-public:
-    //! \brief Constructs an empty string, allocating no memory.
-    string(void) : size_(0), str_(NULL)
-    {
-    }
-
-    /*! \brief Constructs a string populated from an arbitrary value of
-     *  specified size.
-     * 
-     *  An extra '\0' is added, in case none was contained in str.
-     *
-     *  \param str the initial value of the string instance.  Note that '\0'     
-     *             characters receive no special treatment.  If NULL,
-     *             the string is left empty, with a size of 0.
-     *
-     *  \param size the number of characters to copy from str.
-     */
-    string(const char * str, ::size_t size) :
-        size_(size),
-        str_(NULL)
-    {
-        if( size > 0 ) {
-            str_ = new char[size_+1];
-            if (str_ != NULL) {
-                memcpy(str_, str, size_  * sizeof(char));
-                str_[size_] = '\0';
-            }
-            else {
-                size_ = 0;
-            }
-        }
-    }
-
-    /*! \brief Constructs a string populated from a null-terminated value.
-     *
-     *  \param str the null-terminated initial value of the string instance.
-     *             If NULL, the string is left empty, with a size of 0.
-     */
-    string(const char * str) :
-        size_(0),
-        str_(NULL)
-    {
-        if( str ) {
-            size_= ::strlen(str);
-        }
-        if( size_ > 0 ) {
-            str_ = new char[size_ + 1];
-            if (str_ != NULL) {
-                memcpy(str_, str, (size_ + 1) * sizeof(char));
-            }
-        }
-    }
-
-    void resize( ::size_t n )
-    {
-        if( size_ == n ) {
-            return;
-        }
-        if (n == 0) {
-            if( str_ ) {
-                delete [] str_;
-            }
-            str_ = NULL;
-            size_ = 0;
-        } 
-        else {
-            char *newString = new char[n + 1];
-            int copySize = n;
-            if( size_ < n ) {
-                copySize = size_;
-            }
-            size_ = n;
-            
-            if(str_) {
-                memcpy(newString, str_, (copySize + 1) * sizeof(char));
-            }
-            if( copySize < size_ ) {
-                memset(newString + copySize, 0, size_ - copySize);
-            }
-            newString[size_] = '\0';
-
-            delete [] str_;
-            str_ = newString;
-        }
-    }
-
-    const char& operator[] ( ::size_t pos ) const
-    {
-        return str_[pos];
-    }
-
-    char& operator[] ( ::size_t pos )
-    {
-        return str_[pos];
-    }
-
-    /*! \brief Copies the value of another string to this one.
-     *
-     *  \param rhs the string to copy.
-     *
-     *  \returns a reference to the modified instance.
-     */
-    string& operator=(const string& rhs)
-    {
-        if (this == &rhs) {
-            return *this;
-        }
-
-        if( str_ != NULL ) {
-            delete [] str_;
-            str_ = NULL;
-            size_ = 0;
-        }
-
-        if (rhs.size_ == 0 || rhs.str_ == NULL) {
-            str_ = NULL;
-            size_ = 0;
-        } 
-        else {
-            str_ = new char[rhs.size_ + 1];
-            size_ = rhs.size_;
-            
-            if (str_ != NULL) {
-                memcpy(str_, rhs.str_, (size_ + 1) * sizeof(char));
-            }
-            else {
-                size_ = 0;
-            }
-        }
-
-        return *this;
-    }
-
-    /*! \brief Constructs a string by copying the value of another instance.
-     *
-     *  \param rhs the string to copy.
-     */
-    string(const string& rhs) :
-        size_(0),
-        str_(NULL)
-    {
-        *this = rhs;
-    }
-
-    //! \brief Destructor - frees memory used to hold the current value.
-    ~string()
-    {
-        delete[] str_;
-        str_ = NULL;
-    }
-    
-    //! \brief Queries the length of the string, excluding any added '\0's.
-    ::size_t size(void) const   { return size_; }
-
-    //! \brief Queries the length of the string, excluding any added '\0's.
-    ::size_t length(void) const { return size(); }
-
-    /*! \brief Returns a pointer to the private copy held by this instance,
-     *  or "" if empty/unset.
-     */
-    const char * c_str(void) const { return (str_) ? str_ : "";}
-};
-typedef cl::string STRING_CLASS;
-#endif // #elif !defined(__USE_DEV_STRING) 
-
-#if !defined(__USE_DEV_VECTOR) && !defined(__NO_STD_VECTOR)
-#define VECTOR_CLASS std::vector
-#elif !defined(__USE_DEV_VECTOR) 
-#define VECTOR_CLASS cl::vector 
-
-#if !defined(__MAX_DEFAULT_VECTOR_SIZE)
-#define __MAX_DEFAULT_VECTOR_SIZE 10
-#endif
-
-/*! \class vector
- * \brief Fixed sized vector implementation that mirroring 
- *
- *  \note Deprecated. Please use std::vector as default or
- *  re-define the vector class to match the std::vector
- *  interface by defining VECTOR_CLASS
-
- *  \note Not recommended for use with custom objects as
- *  current implementation will construct N elements
- *
- * std::vector functionality.
- *  \brief Fixed sized vector compatible with std::vector.
- *
- *  \note
- *  This differs from std::vector<> not just in memory allocation,
- *  but also in terms of when members are constructed, destroyed,
- *  and assigned instead of being copy constructed.
- *
- *  \param T type of element contained in the vector.
- *
- *  \param N maximum size of the vector.
- */
-template <typename T, unsigned int N = __MAX_DEFAULT_VECTOR_SIZE>
-class CL_EXT_PREFIX__VERSION_1_1_DEPRECATED vector CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED
-{
-private:
-    T data_[N];
-    unsigned int size_;
-
-public:
-    //! \brief Constructs an empty vector with no memory allocated.
-    vector() :  
-        size_(static_cast<unsigned int>(0))
-    {}
-
-    //! \brief Deallocates the vector's memory and destroys all of its elements.
-    ~vector() 
-    {
-        clear();
-    }
-
-    //! \brief Returns the number of elements currently contained.
-    unsigned int size(void) const
-    {
-        return size_;
-    }
-    
-    /*! \brief Empties the vector of all elements.
-     *  \note
-     *  This does not deallocate memory but will invoke destructors
-     *  on contained elements.
-     */
-    void clear()
-    {
-        while(!empty()) {
-            pop_back();
-        }
-    }
-
-    /*! \brief Appends an element after the last valid element.
-     * Calling this on a vector that has reached capacity will throw an 
-     * exception if exceptions are enabled.
-     */
-    void push_back (const T& x)
-    { 
-        if (size() < N) {    
-            new (&data_[size_]) T(x);
-            size_++;
-        } else {
-            detail::errHandler(CL_MEM_OBJECT_ALLOCATION_FAILURE, __VECTOR_CAPACITY_ERR);
-        }
-    }
-
-    /*! \brief Removes the last valid element from the vector.
-     * Calling this on an empty vector will throw an exception
-     * if exceptions are enabled.
-     */
-    void pop_back(void)
-    {
-        if (size_ != 0) {
-            --size_;
-            data_[size_].~T();
-        } else {
-            detail::errHandler(CL_MEM_OBJECT_ALLOCATION_FAILURE, __VECTOR_CAPACITY_ERR);
-        }
-    }
-  
-    /*! \brief Constructs with a value copied from another.
-     *
-     *  \param vec the vector to copy.
-     */
-    vector(const vector<T, N>& vec) : 
-        size_(vec.size_)
-    {
-        if (size_ != 0) {    
-            assign(vec.begin(), vec.end());
-        }
-    } 
-
-    /*! \brief Constructs with a specified number of initial elements.
-     *
-     *  \param size number of initial elements.
-     *
-     *  \param val value of initial elements.
-     */
-    vector(unsigned int size, const T& val = T()) :
-        size_(0)
-    {
-        for (unsigned int i = 0; i < size; i++) {
-            push_back(val);
-        }
-    }
-
-    /*! \brief Overwrites the current content with that copied from another
-     *         instance.
-     *
-     *  \param rhs vector to copy.
-     *
-     *  \returns a reference to this.
-     */
-    vector<T, N>& operator=(const vector<T, N>& rhs)
-    {
-        if (this == &rhs) {
-            return *this;
-        }
-
-        if (rhs.size_ != 0) {    
-            assign(rhs.begin(), rhs.end());
-        } else {
-            clear();
-        }
-    
-        return *this;
-    }
-
-    /*! \brief Tests equality against another instance.
-     *
-     *  \param vec the vector against which to compare.
-     */
-    bool operator==(vector<T,N> &vec)
-    {
-        if (size() != vec.size()) {
-            return false;
-        }
-
-        for( unsigned int i = 0; i < size(); ++i ) {
-            if( operator[](i) != vec[i] ) {
-                return false;
-            }
-        }
-        return true;
-    }
-  
-    //! \brief Conversion operator to T*.
-    operator T* ()             { return data_; }
-
-    //! \brief Conversion operator to const T*.
-    operator const T* () const { return data_; }
-   
-    //! \brief Tests whether this instance has any elements.
-    bool empty (void) const
-    {
-        return size_==0;
-    }
-  
-    //! \brief Returns the maximum number of elements this instance can hold.
-    unsigned int max_size (void) const
-    {
-        return N;
-    }
-
-    //! \brief Returns the maximum number of elements this instance can hold.
-    unsigned int capacity () const
-    {
-        return N;
-    }
-
-    /*! \brief Returns a reference to a given element.
-     *
-     *  \param index which element to access.     *
-     *  \note
-     *  The caller is responsible for ensuring index is >= 0 and < size().
-     */
-    T& operator[](int index)
-    {
-        return data_[index];
-    }
-  
-    /*! \brief Returns a const reference to a given element.
-     *
-     *  \param index which element to access.
-     *
-     *  \note
-     *  The caller is responsible for ensuring index is >= 0 and < size().
-     */
-    const T& operator[](int index) const
-    {
-        return data_[index];
-    }
-  
-    /*! \brief Assigns elements of the vector based on a source iterator range.
-     *
-     *  \param start Beginning iterator of source range
-     *  \param end Enditerator of source range
-     *
-     *  \note
-     *  Will throw an exception if exceptions are enabled and size exceeded.
-     */
-    template<class I>
-    void assign(I start, I end)
-    {
-        clear();   
-        while(start != end) {
-            push_back(*start);
-            start++;
-        }
-    }
-
-    /*! \class iterator
-     * \brief Const iterator class for vectors
-     */
-    class iterator
-    {
-    private:
-        const vector<T,N> *vec_;
-        int index_;
-
-        /**
-         * Internal iterator constructor to capture reference
-         * to the vector it iterates over rather than taking 
-         * the vector by copy.
-         */
-        iterator (const vector<T,N> &vec, int index) :
-            vec_(&vec)
-        {            
-            if( !vec.empty() ) {
-                index_ = index;
-            } else {
-                index_ = -1;
-            }
-        }
-
-    public:
-        iterator(void) : 
-            index_(-1),
-            vec_(NULL)
-        {
-        }
-
-        iterator(const iterator& rhs) :
-            vec_(rhs.vec_),
-            index_(rhs.index_)
-        {
-        }
-
-        ~iterator(void) {}
-
-        static iterator begin(const cl::vector<T,N> &vec)
-        {
-            iterator i(vec, 0);
-
-            return i;
-        }
-
-        static iterator end(const cl::vector<T,N> &vec)
-        {
-            iterator i(vec, vec.size());
-
-            return i;
-        }
-    
-        bool operator==(iterator i)
-        {
-            return ((vec_ == i.vec_) && 
-                    (index_ == i.index_));
-        }
-
-        bool operator!=(iterator i)
-        {
-            return (!(*this==i));
-        }
-
-        iterator& operator++()
-        {
-            ++index_;
-            return *this;
-        }
-
-        iterator operator++(int)
-        {
-            iterator retVal(*this);
-            ++index_;
-            return retVal;
-        }
-
-        iterator& operator--()
-        {
-            --index_;
-            return *this;
-        }
-
-        iterator operator--(int)
-        {
-            iterator retVal(*this);
-            --index_;
-            return retVal;
-        }
-
-        const T& operator *() const
-        {
-            return (*vec_)[index_];
-        }
-    };
-
-    iterator begin(void)
-    {
-        return iterator::begin(*this);
-    }
-
-    iterator begin(void) const
-    {
-        return iterator::begin(*this);
-    }
-
-    iterator end(void)
-    {
-        return iterator::end(*this);
-    }
-
-    iterator end(void) const
-    {
-        return iterator::end(*this);
-    }
-
-    T& front(void)
-    {
-        return data_[0];
-    }
-
-    T& back(void)
-    {
-        return data_[size_];
-    }
-
-    const T& front(void) const
-    {
-        return data_[0];
-    }
-
-    const T& back(void) const
-    {
-        return data_[size_-1];
-    }
-};  
-#endif // #if !defined(__USE_DEV_VECTOR) && !defined(__NO_STD_VECTOR)
-
-
-
-
-
-namespace detail {
-#define __DEFAULT_NOT_INITIALIZED 1 
-#define __DEFAULT_BEING_INITIALIZED 2
-#define __DEFAULT_INITIALIZED 4
-
-    /*
-     * Compare and exchange primitives are needed for handling of defaults
-    */
-    inline int compare_exchange(volatile int * dest, int exchange, int comparand)
-    {
-#ifdef _WIN32
-        return (int)(InterlockedCompareExchange(
-           (volatile long*)dest, 
-           (long)exchange, 
-           (long)comparand));
-#elif defined(__APPLE__) || defined(__MACOSX)
-        return OSAtomicOr32Orig((uint32_t)exchange, (volatile uint32_t*)dest);
-#else // !_WIN32 || defined(__APPLE__) || defined(__MACOSX)
-        return (__sync_val_compare_and_swap(
-            dest, 
-            comparand, 
-            exchange));
-#endif // !_WIN32
-    }
-
-    inline void fence() { _mm_mfence(); }
-}; // namespace detail
-
-    
-/*! \brief class used to interface between C++ and
- *  OpenCL C calls that require arrays of size_t values, whose
- *  size is known statically.
- */
-template <int N>
-class size_t
-{ 
-private:
-    ::size_t data_[N];
-
-public:
-    //! \brief Initialize size_t to all 0s
-    size_t()
-    {
-        for( int i = 0; i < N; ++i ) {
-            data_[i] = 0;
-        }
-    }
-
-    ::size_t& operator[](int index)
-    {
-        return data_[index];
-    }
-
-    const ::size_t& operator[](int index) const
-    {
-        return data_[index];
-    }
-
-    //! \brief Conversion operator to T*.
-    operator ::size_t* ()             { return data_; }
-
-    //! \brief Conversion operator to const T*.
-    operator const ::size_t* () const { return data_; }
-};
-
-namespace detail {
-
-// Generic getInfoHelper. The final parameter is used to guide overload
-// resolution: the actual parameter passed is an int, which makes this
-// a worse conversion sequence than a specialization that declares the
-// parameter as an int.
-template<typename Functor, typename T>
-inline cl_int getInfoHelper(Functor f, cl_uint name, T* param, long)
-{
-    return f(name, sizeof(T), param, NULL);
-}
-
-// Specialized getInfoHelper for VECTOR_CLASS params
-template <typename Func, typename T>
-inline cl_int getInfoHelper(Func f, cl_uint name, VECTOR_CLASS<T>* param, long)
-{
-    ::size_t required;
-    cl_int err = f(name, 0, NULL, &required);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    T* value = (T*) alloca(required);
-    err = f(name, required, value, NULL);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    param->assign(&value[0], &value[required/sizeof(T)]);
-    return CL_SUCCESS;
-}
-
-/* Specialization for reference-counted types. This depends on the
- * existence of Wrapper<T>::cl_type, and none of the other types having the
- * cl_type member. Note that simplify specifying the parameter as Wrapper<T>
- * does not work, because when using a derived type (e.g. Context) the generic
- * template will provide a better match.
- */
-template <typename Func, typename T>
-inline cl_int getInfoHelper(Func f, cl_uint name, VECTOR_CLASS<T>* param, int, typename T::cl_type = 0)
-{
-    ::size_t required;
-    cl_int err = f(name, 0, NULL, &required);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    typename T::cl_type * value = (typename T::cl_type *) alloca(required);
-    err = f(name, required, value, NULL);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    ::size_t elements = required / sizeof(typename T::cl_type);
-    param->assign(&value[0], &value[elements]);
-    for (::size_t i = 0; i < elements; i++)
-    {
-        if (value[i] != NULL)
-        {
-            err = (*param)[i].retain();
-            if (err != CL_SUCCESS) {
-                return err;
-            }
-        }
-    }
-    return CL_SUCCESS;
-}
-
-// Specialized for getInfo<CL_PROGRAM_BINARIES>
-template <typename Func>
-inline cl_int getInfoHelper(Func f, cl_uint name, VECTOR_CLASS<char *>* param, int)
-{
-    cl_int err = f(name, param->size() * sizeof(char *), &(*param)[0], NULL);
-
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    return CL_SUCCESS;
-}
-
-// Specialized GetInfoHelper for STRING_CLASS params
-template <typename Func>
-inline cl_int getInfoHelper(Func f, cl_uint name, STRING_CLASS* param, long)
-{
-    ::size_t required;
-    cl_int err = f(name, 0, NULL, &required);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    char* value = (char*) alloca(required);
-    err = f(name, required, value, NULL);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    *param = value;
-    return CL_SUCCESS;
-}
-
-// Specialized GetInfoHelper for cl::size_t params
-template <typename Func, ::size_t N>
-inline cl_int getInfoHelper(Func f, cl_uint name, size_t<N>* param, long)
-{
-    ::size_t required;
-    cl_int err = f(name, 0, NULL, &required);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    ::size_t* value = (::size_t*) alloca(required);
-    err = f(name, required, value, NULL);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-
-    for(int i = 0; i < N; ++i) {
-        (*param)[i] = value[i];
-    }
-
-    return CL_SUCCESS;
-}
-
-template<typename T> struct ReferenceHandler;
-
-/* Specialization for reference-counted types. This depends on the
- * existence of Wrapper<T>::cl_type, and none of the other types having the
- * cl_type member. Note that simplify specifying the parameter as Wrapper<T>
- * does not work, because when using a derived type (e.g. Context) the generic
- * template will provide a better match.
- */
-template<typename Func, typename T>
-inline cl_int getInfoHelper(Func f, cl_uint name, T* param, int, typename T::cl_type = 0)
-{
-    typename T::cl_type value;
-    cl_int err = f(name, sizeof(value), &value, NULL);
-    if (err != CL_SUCCESS) {
-        return err;
-    }
-    *param = value;
-    if (value != NULL)
-    {
-        err = param->retain();
-        if (err != CL_SUCCESS) {
-            return err;
-        }
-    }
-    return CL_SUCCESS;
-}
-
-#define __PARAM_NAME_INFO_1_0(F) \
-    F(cl_platform_info, CL_PLATFORM_PROFILE, STRING_CLASS) \
-    F(cl_platform_info, CL_PLATFORM_VERSION, STRING_CLASS) \
-    F(cl_platform_info, CL_PLATFORM_NAME, STRING_CLASS) \
-    F(cl_platform_info, CL_PLATFORM_VENDOR, STRING_CLASS) \
-    F(cl_platform_info, CL_PLATFORM_EXTENSIONS, STRING_CLASS) \
-    \
-    F(cl_device_info, CL_DEVICE_TYPE, cl_device_type) \
-    F(cl_device_info, CL_DEVICE_VENDOR_ID, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_COMPUTE_UNITS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_WORK_GROUP_SIZE, ::size_t) \
-    F(cl_device_info, CL_DEVICE_MAX_WORK_ITEM_SIZES, VECTOR_CLASS< ::size_t>) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR, cl_uint) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_LONG, cl_uint) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_FLOAT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_CLOCK_FREQUENCY, cl_uint) \
-    F(cl_device_info, CL_DEVICE_ADDRESS_BITS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_READ_IMAGE_ARGS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_WRITE_IMAGE_ARGS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MAX_MEM_ALLOC_SIZE, cl_ulong) \
-    F(cl_device_info, CL_DEVICE_IMAGE2D_MAX_WIDTH, ::size_t) \
-    F(cl_device_info, CL_DEVICE_IMAGE2D_MAX_HEIGHT, ::size_t) \
-    F(cl_device_info, CL_DEVICE_IMAGE3D_MAX_WIDTH, ::size_t) \
-    F(cl_device_info, CL_DEVICE_IMAGE3D_MAX_HEIGHT, ::size_t) \
-    F(cl_device_info, CL_DEVICE_IMAGE3D_MAX_DEPTH, ::size_t) \
-    F(cl_device_info, CL_DEVICE_IMAGE_SUPPORT, cl_bool) \
-    F(cl_device_info, CL_DEVICE_MAX_PARAMETER_SIZE, ::size_t) \
-    F(cl_device_info, CL_DEVICE_MAX_SAMPLERS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MEM_BASE_ADDR_ALIGN, cl_uint) \
-    F(cl_device_info, CL_DEVICE_MIN_DATA_TYPE_ALIGN_SIZE, cl_uint) \
-    F(cl_device_info, CL_DEVICE_SINGLE_FP_CONFIG, cl_device_fp_config) \
-    F(cl_device_info, CL_DEVICE_GLOBAL_MEM_CACHE_TYPE, cl_device_mem_cache_type) \
-    F(cl_device_info, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE, cl_uint)\
-    F(cl_device_info, CL_DEVICE_GLOBAL_MEM_CACHE_SIZE, cl_ulong) \
-    F(cl_device_info, CL_DEVICE_GLOBAL_MEM_SIZE, cl_ulong) \
-    F(cl_device_info, CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE, cl_ulong) \
-    F(cl_device_info, CL_DEVICE_MAX_CONSTANT_ARGS, cl_uint) \
-    F(cl_device_info, CL_DEVICE_LOCAL_MEM_TYPE, cl_device_local_mem_type) \
-    F(cl_device_info, CL_DEVICE_LOCAL_MEM_SIZE, cl_ulong) \
-    F(cl_device_info, CL_DEVICE_ERROR_CORRECTION_SUPPORT, cl_bool) \
-    F(cl_device_info, CL_DEVICE_PROFILING_TIMER_RESOLUTION, ::size_t) \
-    F(cl_device_info, CL_DEVICE_ENDIAN_LITTLE, cl_bool) \
-    F(cl_device_info, CL_DEVICE_AVAILABLE, cl_bool) \
-    F(cl_device_info, CL_DEVICE_COMPILER_AVAILABLE, cl_bool) \
-    F(cl_device_info, CL_DEVICE_EXECUTION_CAPABILITIES, cl_device_exec_capabilities) \
-    F(cl_device_info, CL_DEVICE_QUEUE_PROPERTIES, cl_command_queue_properties) \
-    F(cl_device_info, CL_DEVICE_PLATFORM, cl_platform_id) \
-    F(cl_device_info, CL_DEVICE_NAME, STRING_CLASS) \
-    F(cl_device_info, CL_DEVICE_VENDOR, STRING_CLASS) \
-    F(cl_device_info, CL_DRIVER_VERSION, STRING_CLASS) \
-    F(cl_device_info, CL_DEVICE_PROFILE, STRING_CLASS) \
-    F(cl_device_info, CL_DEVICE_VERSION, STRING_CLASS) \
-    F(cl_device_info, CL_DEVICE_EXTENSIONS, STRING_CLASS) \
-    \
-    F(cl_context_info, CL_CONTEXT_REFERENCE_COUNT, cl_uint) \
-    F(cl_context_info, CL_CONTEXT_DEVICES, VECTOR_CLASS<Device>) \
-    F(cl_context_info, CL_CONTEXT_PROPERTIES, VECTOR_CLASS<cl_context_properties>) \
-    \
-    F(cl_event_info, CL_EVENT_COMMAND_QUEUE, cl::CommandQueue) \
-    F(cl_event_info, CL_EVENT_COMMAND_TYPE, cl_command_type) \
-    F(cl_event_info, CL_EVENT_REFERENCE_COUNT, cl_uint) \
-    F(cl_event_info, CL_EVENT_COMMAND_EXECUTION_STATUS, cl_uint) \
-    \
-    F(cl_profiling_info, CL_PROFILING_COMMAND_QUEUED, cl_ulong) \
-    F(cl_profiling_info, CL_PROFILING_COMMAND_SUBMIT, cl_ulong) \
-    F(cl_profiling_info, CL_PROFILING_COMMAND_START, cl_ulong) \
-    F(cl_profiling_info, CL_PROFILING_COMMAND_END, cl_ulong) \
-    \
-    F(cl_mem_info, CL_MEM_TYPE, cl_mem_object_type) \
-    F(cl_mem_info, CL_MEM_FLAGS, cl_mem_flags) \
-    F(cl_mem_info, CL_MEM_SIZE, ::size_t) \
-    F(cl_mem_info, CL_MEM_HOST_PTR, void*) \
-    F(cl_mem_info, CL_MEM_MAP_COUNT, cl_uint) \
-    F(cl_mem_info, CL_MEM_REFERENCE_COUNT, cl_uint) \
-    F(cl_mem_info, CL_MEM_CONTEXT, cl::Context) \
-    \
-    F(cl_image_info, CL_IMAGE_FORMAT, cl_image_format) \
-    F(cl_image_info, CL_IMAGE_ELEMENT_SIZE, ::size_t) \
-    F(cl_image_info, CL_IMAGE_ROW_PITCH, ::size_t) \
-    F(cl_image_info, CL_IMAGE_SLICE_PITCH, ::size_t) \
-    F(cl_image_info, CL_IMAGE_WIDTH, ::size_t) \
-    F(cl_image_info, CL_IMAGE_HEIGHT, ::size_t) \
-    F(cl_image_info, CL_IMAGE_DEPTH, ::size_t) \
-    \
-    F(cl_sampler_info, CL_SAMPLER_REFERENCE_COUNT, cl_uint) \
-    F(cl_sampler_info, CL_SAMPLER_CONTEXT, cl::Context) \
-    F(cl_sampler_info, CL_SAMPLER_NORMALIZED_COORDS, cl_addressing_mode) \
-    F(cl_sampler_info, CL_SAMPLER_ADDRESSING_MODE, cl_filter_mode) \
-    F(cl_sampler_info, CL_SAMPLER_FILTER_MODE, cl_bool) \
-    \
-    F(cl_program_info, CL_PROGRAM_REFERENCE_COUNT, cl_uint) \
-    F(cl_program_info, CL_PROGRAM_CONTEXT, cl::Context) \
-    F(cl_program_info, CL_PROGRAM_NUM_DEVICES, cl_uint) \
-    F(cl_program_info, CL_PROGRAM_DEVICES, VECTOR_CLASS<Device>) \
-    F(cl_program_info, CL_PROGRAM_SOURCE, STRING_CLASS) \
-    F(cl_program_info, CL_PROGRAM_BINARY_SIZES, VECTOR_CLASS< ::size_t>) \
-    F(cl_program_info, CL_PROGRAM_BINARIES, VECTOR_CLASS<char *>) \
-    \
-    F(cl_program_build_info, CL_PROGRAM_BUILD_STATUS, cl_build_status) \
-    F(cl_program_build_info, CL_PROGRAM_BUILD_OPTIONS, STRING_CLASS) \
-    F(cl_program_build_info, CL_PROGRAM_BUILD_LOG, STRING_CLASS) \
-    \
-    F(cl_kernel_info, CL_KERNEL_FUNCTION_NAME, STRING_CLASS) \
-    F(cl_kernel_info, CL_KERNEL_NUM_ARGS, cl_uint) \
-    F(cl_kernel_info, CL_KERNEL_REFERENCE_COUNT, cl_uint) \
-    F(cl_kernel_info, CL_KERNEL_CONTEXT, cl::Context) \
-    F(cl_kernel_info, CL_KERNEL_PROGRAM, cl::Program) \
-    \
-    F(cl_kernel_work_group_info, CL_KERNEL_WORK_GROUP_SIZE, ::size_t) \
-    F(cl_kernel_work_group_info, CL_KERNEL_COMPILE_WORK_GROUP_SIZE, cl::size_t<3>) \
-    F(cl_kernel_work_group_info, CL_KERNEL_LOCAL_MEM_SIZE, cl_ulong) \
-    \
-    F(cl_command_queue_info, CL_QUEUE_CONTEXT, cl::Context) \
-    F(cl_command_queue_info, CL_QUEUE_DEVICE, cl::Device) \
-    F(cl_command_queue_info, CL_QUEUE_REFERENCE_COUNT, cl_uint) \
-    F(cl_command_queue_info, CL_QUEUE_PROPERTIES, cl_command_queue_properties)
-
-#if defined(CL_VERSION_1_1)
-#define __PARAM_NAME_INFO_1_1(F) \
-    F(cl_context_info, CL_CONTEXT_NUM_DEVICES, cl_uint)\
-    F(cl_device_info, CL_DEVICE_PREFERRED_VECTOR_WIDTH_HALF, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_CHAR, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_SHORT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_INT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_LONG, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_FLOAT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_DOUBLE, cl_uint) \
-    F(cl_device_info, CL_DEVICE_NATIVE_VECTOR_WIDTH_HALF, cl_uint) \
-    F(cl_device_info, CL_DEVICE_DOUBLE_FP_CONFIG, cl_device_fp_config) \
-    F(cl_device_info, CL_DEVICE_HALF_FP_CONFIG, cl_device_fp_config) \
-    F(cl_device_info, CL_DEVICE_HOST_UNIFIED_MEMORY, cl_bool) \
-    F(cl_device_info, CL_DEVICE_OPENCL_C_VERSION, STRING_CLASS) \
-    \
-    F(cl_mem_info, CL_MEM_ASSOCIATED_MEMOBJECT, cl::Memory) \
-    F(cl_mem_info, CL_MEM_OFFSET, ::size_t) \
-    \
-    F(cl_kernel_work_group_info, CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE, ::size_t) \
-    F(cl_kernel_work_group_info, CL_KERNEL_PRIVATE_MEM_SIZE, cl_ulong) \
-    \
-    F(cl_event_info, CL_EVENT_CONTEXT, cl::Context)
-#endif // CL_VERSION_1_1
-
-    
-#if defined(CL_VERSION_1_2)
-#define __PARAM_NAME_INFO_1_2(F) \
-    F(cl_image_info, CL_IMAGE_BUFFER, cl::Buffer) \
-    \
-    F(cl_program_info, CL_PROGRAM_NUM_KERNELS, ::size_t) \
-    F(cl_program_info, CL_PROGRAM_KERNEL_NAMES, STRING_CLASS) \
-    \
-    F(cl_program_build_info, CL_PROGRAM_BINARY_TYPE, cl_program_binary_type) \
-    \
-    F(cl_kernel_info, CL_KERNEL_ATTRIBUTES, STRING_CLASS) \
-    \
-    F(cl_kernel_arg_info, CL_KERNEL_ARG_ADDRESS_QUALIFIER, cl_kernel_arg_address_qualifier) \
-    F(cl_kernel_arg_info, CL_KERNEL_ARG_ACCESS_QUALIFIER, cl_kernel_arg_access_qualifier) \
-    F(cl_kernel_arg_info, CL_KERNEL_ARG_TYPE_NAME, STRING_CLASS) \
-    F(cl_kernel_arg_info, CL_KERNEL_ARG_NAME, STRING_CLASS) \
-    \
-    F(cl_device_info, CL_DEVICE_PARENT_DEVICE, cl_device_id) \
-    F(cl_device_info, CL_DEVICE_PARTITION_PROPERTIES, VECTOR_CLASS<cl_device_partition_property>) \
-    F(cl_device_info, CL_DEVICE_PARTITION_TYPE, VECTOR_CLASS<cl_device_partition_property>)  \
-    F(cl_device_info, CL_DEVICE_REFERENCE_COUNT, cl_uint) \
-    F(cl_device_info, CL_DEVICE_PREFERRED_INTEROP_USER_SYNC, ::size_t) \
-    F(cl_device_info, CL_DEVICE_PARTITION_AFFINITY_DOMAIN, cl_device_affinity_domain) \
-    F(cl_device_info, CL_DEVICE_BUILT_IN_KERNELS, STRING_CLASS)
-#endif // #if defined(CL_VERSION_1_2)
-
-#if defined(USE_CL_DEVICE_FISSION)
-#define __PARAM_NAME_DEVICE_FISSION(F) \
-    F(cl_device_info, CL_DEVICE_PARENT_DEVICE_EXT, cl_device_id) \
-    F(cl_device_info, CL_DEVICE_PARTITION_TYPES_EXT, VECTOR_CLASS<cl_device_partition_property_ext>) \
-    F(cl_device_info, CL_DEVICE_AFFINITY_DOMAINS_EXT, VECTOR_CLASS<cl_device_partition_property_ext>) \
-    F(cl_device_info, CL_DEVICE_REFERENCE_COUNT_EXT , cl_uint) \
-    F(cl_device_info, CL_DEVICE_PARTITION_STYLE_EXT, VECTOR_CLASS<cl_device_partition_property_ext>)
-#endif // USE_CL_DEVICE_FISSION
-
-template <typename enum_type, cl_int Name>
-struct param_traits {};
-
-#define __CL_DECLARE_PARAM_TRAITS(token, param_name, T) \
-struct token;                                        \
-template<>                                           \
-struct param_traits<detail:: token,param_name>       \
-{                                                    \
-    enum { value = param_name };                     \
-    typedef T param_type;                            \
-};
-
-__PARAM_NAME_INFO_1_0(__CL_DECLARE_PARAM_TRAITS)
-#if defined(CL_VERSION_1_1)
-__PARAM_NAME_INFO_1_1(__CL_DECLARE_PARAM_TRAITS)
-#endif // CL_VERSION_1_1
-#if defined(CL_VERSION_1_2)
-__PARAM_NAME_INFO_1_2(__CL_DECLARE_PARAM_TRAITS)
-#endif // CL_VERSION_1_1
-
-#if defined(USE_CL_DEVICE_FISSION)
-__PARAM_NAME_DEVICE_FISSION(__CL_DECLARE_PARAM_TRAITS);
-#endif // USE_CL_DEVICE_FISSION
-
-#ifdef CL_PLATFORM_ICD_SUFFIX_KHR
-__CL_DECLARE_PARAM_TRAITS(cl_platform_info, CL_PLATFORM_ICD_SUFFIX_KHR, STRING_CLASS)
-#endif
-
-#ifdef CL_DEVICE_PROFILING_TIMER_OFFSET_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_PROFILING_TIMER_OFFSET_AMD, cl_ulong)
-#endif
-
-#ifdef CL_DEVICE_GLOBAL_FREE_MEMORY_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_GLOBAL_FREE_MEMORY_AMD, VECTOR_CLASS< ::size_t>)
-#endif
-#ifdef CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_SIMD_PER_COMPUTE_UNIT_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_SIMD_WIDTH_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_SIMD_WIDTH_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_SIMD_INSTRUCTION_WIDTH_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_SIMD_INSTRUCTION_WIDTH_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_WAVEFRONT_WIDTH_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_WAVEFRONT_WIDTH_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_GLOBAL_MEM_CHANNELS_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_GLOBAL_MEM_CHANNELS_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_GLOBAL_MEM_CHANNEL_BANKS_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_GLOBAL_MEM_CHANNEL_BANKS_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_GLOBAL_MEM_CHANNEL_BANK_WIDTH_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_LOCAL_MEM_SIZE_PER_COMPUTE_UNIT_AMD, cl_uint)
-#endif
-#ifdef CL_DEVICE_LOCAL_MEM_BANKS_AMD
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_LOCAL_MEM_BANKS_AMD, cl_uint)
-#endif
-
-#ifdef CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_COMPUTE_CAPABILITY_MAJOR_NV, cl_uint)
-#endif
-#ifdef CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_COMPUTE_CAPABILITY_MINOR_NV, cl_uint)
-#endif
-#ifdef CL_DEVICE_REGISTERS_PER_BLOCK_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_REGISTERS_PER_BLOCK_NV, cl_uint)
-#endif
-#ifdef CL_DEVICE_WARP_SIZE_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_WARP_SIZE_NV, cl_uint)
-#endif
-#ifdef CL_DEVICE_GPU_OVERLAP_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_GPU_OVERLAP_NV, cl_bool)
-#endif
-#ifdef CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_KERNEL_EXEC_TIMEOUT_NV, cl_bool)
-#endif
-#ifdef CL_DEVICE_INTEGRATED_MEMORY_NV
-__CL_DECLARE_PARAM_TRAITS(cl_device_info, CL_DEVICE_INTEGRATED_MEMORY_NV, cl_bool)
-#endif
-
-// Convenience functions
-
-template <typename Func, typename T>
-inline cl_int
-getInfo(Func f, cl_uint name, T* param)
-{
-    return getInfoHelper(f, name, param, 0);
-}
-
-template <typename Func, typename Arg0>
-struct GetInfoFunctor0
-{
-    Func f_; const Arg0& arg0_;
-    cl_int operator ()(
-        cl_uint param, ::size_t size, void* value, ::size_t* size_ret)
-    { return f_(arg0_, param, size, value, size_ret); }
-};
-
-template <typename Func, typename Arg0, typename Arg1>
-struct GetInfoFunctor1
-{
-    Func f_; const Arg0& arg0_; const Arg1& arg1_;
-    cl_int operator ()(
-        cl_uint param, ::size_t size, void* value, ::size_t* size_ret)
-    { return f_(arg0_, arg1_, param, size, value, size_ret); }
-};
-
-template <typename Func, typename Arg0, typename T>
-inline cl_int
-getInfo(Func f, const Arg0& arg0, cl_uint name, T* param)
-{
-    GetInfoFunctor0<Func, Arg0> f0 = { f, arg0 };
-    return getInfoHelper(f0, name, param, 0);
-}
-
-template <typename Func, typename Arg0, typename Arg1, typename T>
-inline cl_int
-getInfo(Func f, const Arg0& arg0, const Arg1& arg1, cl_uint name, T* param)
-{
-    GetInfoFunctor1<Func, Arg0, Arg1> f0 = { f, arg0, arg1 };
-    return getInfoHelper(f0, name, param, 0);
-}
-
-template<typename T>
-struct ReferenceHandler
-{ };
-
-#if defined(CL_VERSION_1_2)
-/**
- * OpenCL 1.2 devices do have retain/release.
- */
-template <>
-struct ReferenceHandler<cl_device_id>
-{
-    /**
-     * Retain the device.
-     * \param device A valid device created using createSubDevices
-     * \return 
-     *   CL_SUCCESS if the function executed successfully.
-     *   CL_INVALID_DEVICE if device was not a valid subdevice
-     *   CL_OUT_OF_RESOURCES
-     *   CL_OUT_OF_HOST_MEMORY
-     */
-    static cl_int retain(cl_device_id device)
-    { return ::clRetainDevice(device); }
-    /**
-     * Retain the device.
-     * \param device A valid device created using createSubDevices
-     * \return 
-     *   CL_SUCCESS if the function executed successfully.
-     *   CL_INVALID_DEVICE if device was not a valid subdevice
-     *   CL_OUT_OF_RESOURCES
-     *   CL_OUT_OF_HOST_MEMORY
-     */
-    static cl_int release(cl_device_id device)
-    { return ::clReleaseDevice(device); }
-};
-#else // #if defined(CL_VERSION_1_2)
-/**
- * OpenCL 1.1 devices do not have retain/release.
- */
-template <>
-struct ReferenceHandler<cl_device_id>
-{
-    // cl_device_id does not have retain().
-    static cl_int retain(cl_device_id)
-    { return CL_SUCCESS; }
-    // cl_device_id does not have release().
-    static cl_int release(cl_device_id)
-    { return CL_SUCCESS; }
-};
-#endif // #if defined(CL_VERSION_1_2)
-
-template <>
-struct ReferenceHandler<cl_platform_id>
-{
-    // cl_platform_id does not have retain().
-    static cl_int retain(cl_platform_id)
-    { return CL_SUCCESS; }
-    // cl_platform_id does not have release().
-    static cl_int release(cl_platform_id)
-    { return CL_SUCCESS; }
-};
-
-template <>
-struct ReferenceHandler<cl_context>
-{
-    static cl_int retain(cl_context context)
-    { return ::clRetainContext(context); }
-    static cl_int release(cl_context context)
-    { return ::clReleaseContext(context); }
-};
-
-template <>
-struct ReferenceHandler<cl_command_queue>
-{
-    static cl_int retain(cl_command_queue queue)
-    { return ::clRetainCommandQueue(queue); }
-    static cl_int release(cl_command_queue queue)
-    { return ::clReleaseCommandQueue(queue); }
-};
-
-template <>
-struct ReferenceHandler<cl_mem>
-{
-    static cl_int retain(cl_mem memory)
-    { return ::clRetainMemObject(memory); }
-    static cl_int release(cl_mem memory)
-    { return ::clReleaseMemObject(memory); }
-};
-
-template <>
-struct ReferenceHandler<cl_sampler>
-{
-    static cl_int retain(cl_sampler sampler)
-    { return ::clRetainSampler(sampler); }
-    static cl_int release(cl_sampler sampler)
-    { return ::clReleaseSampler(sampler); }
-};
-
-template <>
-struct ReferenceHandler<cl_program>
-{
-    static cl_int retain(cl_program program)
-    { return ::clRetainProgram(program); }
-    static cl_int release(cl_program program)
-    { return ::clReleaseProgram(program); }
-};
-
-template <>
-struct ReferenceHandler<cl_kernel>
-{
-    static cl_int retain(cl_kernel kernel)
-    { return ::clRetainKernel(kernel); }
-    static cl_int release(cl_kernel kernel)
-    { return ::clReleaseKernel(kernel); }
-};
-
-template <>
-struct ReferenceHandler<cl_event>
-{
-    static cl_int retain(cl_event event)
-    { return ::clRetainEvent(event); }
-    static cl_int release(cl_event event)
-    { return ::clReleaseEvent(event); }
-};
-
-
-// Extracts version number with major in the upper 16 bits, minor in the lower 16
-static cl_uint getVersion(const char *versionInfo)
-{
-    int highVersion = 0;
-    int lowVersion = 0;
-    int index = 7;
-    while(versionInfo[index] != '.' ) {
-        highVersion *= 10;
-        highVersion += versionInfo[index]-'0';
-        ++index;
-    }
-    ++index;
-    while(versionInfo[index] != ' ' ) {
-        lowVersion *= 10;
-        lowVersion += versionInfo[index]-'0';
-        ++index;
-    }
-    return (highVersion << 16) | lowVersion;
-}
-
-static cl_uint getPlatformVersion(cl_platform_id platform)
-{
-    ::size_t size = 0;
-    clGetPlatformInfo(platform, CL_PLATFORM_VERSION, 0, NULL, &size);
-    char *versionInfo = (char *) alloca(size);
-    clGetPlatformInfo(platform, CL_PLATFORM_VERSION, size, &versionInfo[0], &size);
-    return getVersion(versionInfo);
-}
-
-static cl_uint getDevicePlatformVersion(cl_device_id device)
-{
-    cl_platform_id platform;
-    clGetDeviceInfo(device, CL_DEVICE_PLATFORM, sizeof(platform), &platform, NULL);
-    return getPlatformVersion(platform);
-}
-
-#if defined(CL_VERSION_1_2) && defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-static cl_uint getContextPlatformVersion(cl_context context)
-{
-    // The platform cannot be queried directly, so we first have to grab a
-    // device and obtain its context
-    ::size_t size = 0;
-    clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &size);
-    if (size == 0)
-        return 0;
-    cl_device_id *devices = (cl_device_id *) alloca(size);
-    clGetContextInfo(context, CL_CONTEXT_DEVICES, size, devices, NULL);
-    return getDevicePlatformVersion(devices[0]);
-}
-#endif // #if defined(CL_VERSION_1_2) && defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-
-template <typename T>
-class Wrapper
-{
-public:
-    typedef T cl_type;
-
-protected:
-    cl_type object_;
-
-public:
-    Wrapper() : object_(NULL) { }
-
-    Wrapper(const cl_type &obj) : object_(obj) { }
-
-    ~Wrapper()
-    {
-        if (object_ != NULL) { release(); }
-    }
-
-    Wrapper(const Wrapper<cl_type>& rhs)
-    {
-        object_ = rhs.object_;
-        if (object_ != NULL) { detail::errHandler(retain(), __RETAIN_ERR); }
-    }
-
-    Wrapper<cl_type>& operator = (const Wrapper<cl_type>& rhs)
-    {
-        if (object_ != NULL) { detail::errHandler(release(), __RELEASE_ERR); }
-        object_ = rhs.object_;
-        if (object_ != NULL) { detail::errHandler(retain(), __RETAIN_ERR); }
-        return *this;
-    }
-
-    Wrapper<cl_type>& operator = (const cl_type &rhs)
-    {
-        if (object_ != NULL) { detail::errHandler(release(), __RELEASE_ERR); }
-        object_ = rhs;
-        return *this;
-    }
-
-    cl_type operator ()() const { return object_; }
-
-    cl_type& operator ()() { return object_; }
-
-protected:
-    template<typename Func, typename U>
-    friend inline cl_int getInfoHelper(Func, cl_uint, U*, int, typename U::cl_type);
-
-    cl_int retain() const
-    {
-        return ReferenceHandler<cl_type>::retain(object_);
-    }
-
-    cl_int release() const
-    {
-        return ReferenceHandler<cl_type>::release(object_);
-    }
-};
-
-template <>
-class Wrapper<cl_device_id>
-{
-public:
-    typedef cl_device_id cl_type;
-
-protected:
-    cl_type object_;
-    bool referenceCountable_;
-
-    static bool isReferenceCountable(cl_device_id device)
-    {
-        bool retVal = false;
-        if (device != NULL) {
-            int version = getDevicePlatformVersion(device);
-            if(version > ((1 << 16) + 1)) {
-                retVal = true;
-            }
-        }
-        return retVal;
-    }
-
-public:
-    Wrapper() : object_(NULL), referenceCountable_(false) 
-    { 
-    }
-    
-    Wrapper(const cl_type &obj) : object_(obj), referenceCountable_(false) 
-    {
-        referenceCountable_ = isReferenceCountable(obj); 
-    }
-
-    ~Wrapper()
-    {
-        if (object_ != NULL) { release(); }
-    }
-    
-    Wrapper(const Wrapper<cl_type>& rhs)
-    {
-        object_ = rhs.object_;
-        referenceCountable_ = isReferenceCountable(object_); 
-        if (object_ != NULL) { detail::errHandler(retain(), __RETAIN_ERR); }
-    }
-
-    Wrapper<cl_type>& operator = (const Wrapper<cl_type>& rhs)
-    {
-        if (object_ != NULL) { detail::errHandler(release(), __RELEASE_ERR); }
-        object_ = rhs.object_;
-        referenceCountable_ = rhs.referenceCountable_;
-        if (object_ != NULL) { detail::errHandler(retain(), __RETAIN_ERR); }
-        return *this;
-    }
-
-    Wrapper<cl_type>& operator = (const cl_type &rhs)
-    {
-        if (object_ != NULL) { detail::errHandler(release(), __RELEASE_ERR); }
-        object_ = rhs;
-        referenceCountable_ = isReferenceCountable(object_); 
-        return *this;
-    }
-
-    cl_type operator ()() const { return object_; }
-
-    cl_type& operator ()() { return object_; }
-
-protected:
-    template<typename Func, typename U>
-    friend inline cl_int getInfoHelper(Func, cl_uint, U*, int, typename U::cl_type);
-
-    template<typename Func, typename U>
-    friend inline cl_int getInfoHelper(Func, cl_uint, VECTOR_CLASS<U>*, int, typename U::cl_type);
-
-    cl_int retain() const
-    {
-        if( referenceCountable_ ) {
-            return ReferenceHandler<cl_type>::retain(object_);
-        }
-        else {
-            return CL_SUCCESS;
-        }
-    }
-
-    cl_int release() const
-    {
-        if( referenceCountable_ ) {
-            return ReferenceHandler<cl_type>::release(object_);
-        }
-        else {
-            return CL_SUCCESS;
-        }
-    }
-};
-
-} // namespace detail
-//! \endcond
-
-/*! \stuct ImageFormat
- *  \brief Adds constructors and member functions for cl_image_format.
- *
- *  \see cl_image_format
- */
-struct ImageFormat : public cl_image_format
-{
-    //! \brief Default constructor - performs no initialization.
-    ImageFormat(){}
-
-    //! \brief Initializing constructor.
-    ImageFormat(cl_channel_order order, cl_channel_type type)
-    {
-        image_channel_order = order;
-        image_channel_data_type = type;
-    }
-
-    //! \brief Assignment operator.
-    ImageFormat& operator = (const ImageFormat& rhs)
-    {
-        if (this != &rhs) {
-            this->image_channel_data_type = rhs.image_channel_data_type;
-            this->image_channel_order     = rhs.image_channel_order;
-        }
-        return *this;
-    }
-};
-
-/*! \brief Class interface for cl_device_id.
- *
- *  \note Copies of these objects are inexpensive, since they don't 'own'
- *        any underlying resources or data structures.
- *
- *  \see cl_device_id
- */
-class Device : public detail::Wrapper<cl_device_id>
-{
-public:
-    //! \brief Default constructor - initializes to NULL.
-    Device() : detail::Wrapper<cl_type>() { }
-
-    /*! \brief Copy constructor.
-     * 
-     *  This simply copies the device ID value, which is an inexpensive operation.
-     */
-    Device(const Device& device) : detail::Wrapper<cl_type>(device) { }
-
-    /*! \brief Constructor from cl_device_id.
-     * 
-     *  This simply copies the device ID value, which is an inexpensive operation.
-     */
-    Device(const cl_device_id &device) : detail::Wrapper<cl_type>(device) { }
-
-    /*! \brief Returns the first device on the default context.
-     *
-     *  \see Context::getDefault()
-     */
-    static Device getDefault(cl_int * err = NULL);
-
-    /*! \brief Assignment operator from Device.
-     * 
-     *  This simply copies the device ID value, which is an inexpensive operation.
-     */
-    Device& operator = (const Device& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_device_id.
-     * 
-     *  This simply copies the device ID value, which is an inexpensive operation.
-     */
-    Device& operator = (const cl_device_id& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetDeviceInfo().
-    template <typename T>
-    cl_int getInfo(cl_device_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetDeviceInfo, object_, name, param),
-            __GET_DEVICE_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetDeviceInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_device_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_device_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    /**
-     * CL 1.2 version
-     */
-#if defined(CL_VERSION_1_2)
-    //! \brief Wrapper for clCreateSubDevicesEXT().
-    cl_int createSubDevices(
-        const cl_device_partition_property * properties,
-        VECTOR_CLASS<Device>* devices)
-    {
-        cl_uint n = 0;
-        cl_int err = clCreateSubDevices(object_, properties, 0, NULL, &n);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __CREATE_SUB_DEVICES);
-        }
-
-        cl_device_id* ids = (cl_device_id*) alloca(n * sizeof(cl_device_id));
-        err = clCreateSubDevices(object_, properties, n, ids, NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __CREATE_SUB_DEVICES);
-        }
-
-        devices->assign(&ids[0], &ids[n]);
-        return CL_SUCCESS;
-    }
-#endif // #if defined(CL_VERSION_1_2)
-
-/**
- * CL 1.1 version that uses device fission.
- */
-#if defined(CL_VERSION_1_1)
-#if defined(USE_CL_DEVICE_FISSION)
-    cl_int createSubDevices(
-        const cl_device_partition_property_ext * properties,
-        VECTOR_CLASS<Device>* devices)
-    {
-        typedef CL_API_ENTRY cl_int 
-            ( CL_API_CALL * PFN_clCreateSubDevicesEXT)(
-                cl_device_id /*in_device*/,
-                const cl_device_partition_property_ext * /* properties */,
-                cl_uint /*num_entries*/,
-                cl_device_id * /*out_devices*/,
-                cl_uint * /*num_devices*/ ) CL_EXT_SUFFIX__VERSION_1_1;
-
-        static PFN_clCreateSubDevicesEXT pfn_clCreateSubDevicesEXT = NULL;
-        __INIT_CL_EXT_FCN_PTR(clCreateSubDevicesEXT);
-
-        cl_uint n = 0;
-        cl_int err = pfn_clCreateSubDevicesEXT(object_, properties, 0, NULL, &n);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __CREATE_SUB_DEVICES);
-        }
-
-        cl_device_id* ids = (cl_device_id*) alloca(n * sizeof(cl_device_id));
-        err = pfn_clCreateSubDevicesEXT(object_, properties, n, ids, NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __CREATE_SUB_DEVICES);
-        }
-
-        devices->assign(&ids[0], &ids[n]);
-        return CL_SUCCESS;
-    }
-#endif // #if defined(USE_CL_DEVICE_FISSION)
-#endif // #if defined(CL_VERSION_1_1)
-};
-
-/*! \brief Class interface for cl_platform_id.
- *
- *  \note Copies of these objects are inexpensive, since they don't 'own'
- *        any underlying resources or data structures.
- *
- *  \see cl_platform_id
- */
-class Platform : public detail::Wrapper<cl_platform_id>
-{
-public:
-    //! \brief Default constructor - initializes to NULL.
-    Platform() : detail::Wrapper<cl_type>()  { }
-
-    /*! \brief Copy constructor.
-     * 
-     *  This simply copies the platform ID value, which is an inexpensive operation.
-     */
-    Platform(const Platform& platform) : detail::Wrapper<cl_type>(platform) { }
-
-    /*! \brief Constructor from cl_platform_id.
-     * 
-     *  This simply copies the platform ID value, which is an inexpensive operation.
-     */
-    Platform(const cl_platform_id &platform) : detail::Wrapper<cl_type>(platform) { }
-
-    /*! \brief Assignment operator from Platform.
-     * 
-     *  This simply copies the platform ID value, which is an inexpensive operation.
-     */
-    Platform& operator = (const Platform& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_platform_id.
-     * 
-     *  This simply copies the platform ID value, which is an inexpensive operation.
-     */
-    Platform& operator = (const cl_platform_id& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetPlatformInfo().
-    cl_int getInfo(cl_platform_info name, STRING_CLASS* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetPlatformInfo, object_, name, param),
-            __GET_PLATFORM_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetPlatformInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_platform_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_platform_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    /*! \brief Gets a list of devices for this platform.
-     * 
-     *  Wraps clGetDeviceIDs().
-     */
-    cl_int getDevices(
-        cl_device_type type,
-        VECTOR_CLASS<Device>* devices) const
-    {
-        cl_uint n = 0;
-        if( devices == NULL ) {
-            return detail::errHandler(CL_INVALID_ARG_VALUE, __GET_DEVICE_IDS_ERR);
-        }
-        cl_int err = ::clGetDeviceIDs(object_, type, 0, NULL, &n);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_DEVICE_IDS_ERR);
-        }
-
-        cl_device_id* ids = (cl_device_id*) alloca(n * sizeof(cl_device_id));
-        err = ::clGetDeviceIDs(object_, type, n, ids, NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_DEVICE_IDS_ERR);
-        }
-
-        devices->assign(&ids[0], &ids[n]);
-        return CL_SUCCESS;
-    }
-
-#if defined(USE_DX_INTEROP)
-   /*! \brief Get the list of available D3D10 devices.
-     *
-     *  \param d3d_device_source.
-     *
-     *  \param d3d_object.
-     *
-     *  \param d3d_device_set.
-     *
-     *  \param devices returns a vector of OpenCL D3D10 devices found. The cl::Device
-     *  values returned in devices can be used to identify a specific OpenCL
-     *  device. If \a devices argument is NULL, this argument is ignored.
-     *
-     *  \return One of the following values:
-     *    - CL_SUCCESS if the function is executed successfully.
-     *
-     *  The application can query specific capabilities of the OpenCL device(s)
-     *  returned by cl::getDevices. This can be used by the application to
-     *  determine which device(s) to use.
-     *
-     * \note In the case that exceptions are enabled and a return value
-     * other than CL_SUCCESS is generated, then cl::Error exception is
-     * generated.
-     */
-    cl_int getDevices(
-        cl_d3d10_device_source_khr d3d_device_source,
-        void *                     d3d_object,
-        cl_d3d10_device_set_khr    d3d_device_set,
-        VECTOR_CLASS<Device>* devices) const
-    {
-        typedef CL_API_ENTRY cl_int (CL_API_CALL *PFN_clGetDeviceIDsFromD3D10KHR)(
-            cl_platform_id platform, 
-            cl_d3d10_device_source_khr d3d_device_source, 
-            void * d3d_object,
-            cl_d3d10_device_set_khr d3d_device_set,
-            cl_uint num_entries,
-            cl_device_id * devices,
-            cl_uint* num_devices);
-
-        if( devices == NULL ) {
-            return detail::errHandler(CL_INVALID_ARG_VALUE, __GET_DEVICE_IDS_ERR);
-        }
-
-        static PFN_clGetDeviceIDsFromD3D10KHR pfn_clGetDeviceIDsFromD3D10KHR = NULL;
-        __INIT_CL_EXT_FCN_PTR_PLATFORM(object_, clGetDeviceIDsFromD3D10KHR);
-
-        cl_uint n = 0;
-        cl_int err = pfn_clGetDeviceIDsFromD3D10KHR(
-            object_, 
-            d3d_device_source, 
-            d3d_object,
-            d3d_device_set, 
-            0, 
-            NULL, 
-            &n);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_DEVICE_IDS_ERR);
-        }
-
-        cl_device_id* ids = (cl_device_id*) alloca(n * sizeof(cl_device_id));
-        err = pfn_clGetDeviceIDsFromD3D10KHR(
-            object_, 
-            d3d_device_source, 
-            d3d_object,
-            d3d_device_set,
-            n, 
-            ids, 
-            NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_DEVICE_IDS_ERR);
-        }
-
-        devices->assign(&ids[0], &ids[n]);
-        return CL_SUCCESS;
-    }
-#endif
-
-    /*! \brief Gets a list of available platforms.
-     * 
-     *  Wraps clGetPlatformIDs().
-     */
-    static cl_int get(
-        VECTOR_CLASS<Platform>* platforms)
-    {
-        cl_uint n = 0;
-
-        if( platforms == NULL ) {
-            return detail::errHandler(CL_INVALID_ARG_VALUE, __GET_PLATFORM_IDS_ERR);
-        }
-
-        cl_int err = ::clGetPlatformIDs(0, NULL, &n);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
-        }
-
-        cl_platform_id* ids = (cl_platform_id*) alloca(
-            n * sizeof(cl_platform_id));
-        err = ::clGetPlatformIDs(n, ids, NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
-        }
-
-        platforms->assign(&ids[0], &ids[n]);
-        return CL_SUCCESS;
-    }
-
-    /*! \brief Gets the first available platform.
-     * 
-     *  Wraps clGetPlatformIDs(), returning the first result.
-     */
-    static cl_int get(
-        Platform * platform)
-    {
-        cl_uint n = 0;
-
-        if( platform == NULL ) {
-            return detail::errHandler(CL_INVALID_ARG_VALUE, __GET_PLATFORM_IDS_ERR);
-        }
-
-        cl_int err = ::clGetPlatformIDs(0, NULL, &n);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
-        }
-
-        cl_platform_id* ids = (cl_platform_id*) alloca(
-            n * sizeof(cl_platform_id));
-        err = ::clGetPlatformIDs(n, ids, NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
-        }
-
-        *platform = ids[0];
-        return CL_SUCCESS;
-    }
-
-    /*! \brief Gets the first available platform, returning it by value.
-     * 
-     *  Wraps clGetPlatformIDs(), returning the first result.
-     */
-    static Platform get(
-        cl_int * errResult = NULL)
-    {
-        Platform platform;
-        cl_uint n = 0;
-        cl_int err = ::clGetPlatformIDs(0, NULL, &n);
-        if (err != CL_SUCCESS) {
-            detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
-            if (errResult != NULL) {
-                *errResult = err;
-            }
-        }
-
-        cl_platform_id* ids = (cl_platform_id*) alloca(
-            n * sizeof(cl_platform_id));
-        err = ::clGetPlatformIDs(n, ids, NULL);
-
-        if (err != CL_SUCCESS) {
-            detail::errHandler(err, __GET_PLATFORM_IDS_ERR);
-        }
-
-        if (errResult != NULL) {
-            *errResult = err;
-        }
-        
-        return ids[0];
-    }
-
-    static Platform getDefault( 
-        cl_int *errResult = NULL )
-    {
-        return get(errResult);
-    }
-
-    
-#if defined(CL_VERSION_1_2)
-    //! \brief Wrapper for clUnloadCompiler().
-    cl_int
-    unloadCompiler()
-    {
-        return ::clUnloadPlatformCompiler(object_);
-    }
-#endif // #if defined(CL_VERSION_1_2)
-}; // class Platform
-
-/**
- * Deprecated APIs for 1.2
- */
-#if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS) || (defined(CL_VERSION_1_1) && !defined(CL_VERSION_1_2))
-/**
- * Unload the OpenCL compiler.
- * \note Deprecated for OpenCL 1.2. Use Platform::unloadCompiler instead.
- */
-inline CL_EXT_PREFIX__VERSION_1_1_DEPRECATED cl_int
-UnloadCompiler() CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
-inline cl_int
-UnloadCompiler()
-{
-    return ::clUnloadCompiler();
-}
-#endif // #if defined(CL_VERSION_1_1)
-
-/*! \brief Class interface for cl_context.
- *
- *  \note Copies of these objects are shallow, meaning that the copy will refer
- *        to the same underlying cl_context as the original.  For details, see
- *        clRetainContext() and clReleaseContext().
- *
- *  \see cl_context
- */
-class Context 
-    : public detail::Wrapper<cl_context>
-{
-private:
-    static volatile int default_initialized_;
-    static Context default_;
-    static volatile cl_int default_error_;
-public:
-    /*! \brief Destructor.
-     *
-     *  This calls clReleaseContext() on the value held by this instance.
-     */
-    ~Context() { }
-
-    /*! \brief Constructs a context including a list of specified devices.
-     *
-     *  Wraps clCreateContext().
-     */
-    Context(
-        const VECTOR_CLASS<Device>& devices,
-        cl_context_properties* properties = NULL,
-        void (CL_CALLBACK * notifyFptr)(
-            const char *,
-            const void *,
-            ::size_t,
-            void *) = NULL,
-        void* data = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        ::size_t numDevices = devices.size();
-        cl_device_id* deviceIDs = (cl_device_id*) alloca(numDevices * sizeof(cl_device_id));
-        for( ::size_t deviceIndex = 0; deviceIndex < numDevices; ++deviceIndex ) {
-            deviceIDs[deviceIndex] = (devices[deviceIndex])();
-        }
-
-        object_ = ::clCreateContext(
-            properties, (cl_uint) numDevices,
-            deviceIDs,
-            notifyFptr, data, &error);
-
-        detail::errHandler(error, __CREATE_CONTEXT_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    Context(
-        const Device& device,
-        cl_context_properties* properties = NULL,
-        void (CL_CALLBACK * notifyFptr)(
-            const char *,
-            const void *,
-            ::size_t,
-            void *) = NULL,
-        void* data = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        cl_device_id deviceID = device();
-
-        object_ = ::clCreateContext(
-            properties, 1,
-            &deviceID,
-            notifyFptr, data, &error);
-
-        detail::errHandler(error, __CREATE_CONTEXT_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    /*! \brief Constructs a context including all or a subset of devices of a specified type.
-     *
-     *  Wraps clCreateContextFromType().
-     */
-    Context(
-        cl_device_type type,
-        cl_context_properties* properties = NULL,
-        void (CL_CALLBACK * notifyFptr)(
-            const char *,
-            const void *,
-            ::size_t,
-            void *) = NULL,
-        void* data = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-#if !defined(__APPLE__) || !defined(__MACOS)
-        cl_context_properties prop[4] = {CL_CONTEXT_PLATFORM, 0, 0, 0 };
-
-        if (properties == NULL) {
-            // Get a valid platform ID as we cannot send in a blank one
-            VECTOR_CLASS<Platform> platforms;
-            error = Platform::get(&platforms);
-            if (error != CL_SUCCESS) {
-                detail::errHandler(error, __CREATE_CONTEXT_FROM_TYPE_ERR);
-                if (err != NULL) {
-                    *err = error;
-                }
-                return;
-            }
-
-            // Check the platforms we found for a device of our specified type
-            cl_context_properties platform_id = 0;
-            for (unsigned int i = 0; i < platforms.size(); i++) {
-
-                VECTOR_CLASS<Device> devices;
-
-#if defined(__CL_ENABLE_EXCEPTIONS)
-                try {
-#endif
-
-                    error = platforms[i].getDevices(type, &devices);
-
-#if defined(__CL_ENABLE_EXCEPTIONS)
-                } catch (Error) {}
-    // Catch if exceptions are enabled as we don't want to exit if first platform has no devices of type
-    // We do error checking next anyway, and can throw there if needed
-#endif
-
-                // Only squash CL_SUCCESS and CL_DEVICE_NOT_FOUND
-                if (error != CL_SUCCESS && error != CL_DEVICE_NOT_FOUND) {
-                    detail::errHandler(error, __CREATE_CONTEXT_FROM_TYPE_ERR);
-                    if (err != NULL) {
-                        *err = error;
-                    }
-                }
-
-                if (devices.size() > 0) {
-                    platform_id = (cl_context_properties)platforms[i]();
-                    break;
-                }
-            }
-
-            if (platform_id == 0) {
-                detail::errHandler(CL_DEVICE_NOT_FOUND, __CREATE_CONTEXT_FROM_TYPE_ERR);
-                if (err != NULL) {
-                    *err = CL_DEVICE_NOT_FOUND;
-                }
-                return;
-            }
-
-            prop[1] = platform_id;
-            properties = &prop[0];
-        }
-#endif
-        object_ = ::clCreateContextFromType(
-            properties, type, notifyFptr, data, &error);
-
-        detail::errHandler(error, __CREATE_CONTEXT_FROM_TYPE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    /*! \brief Returns a singleton context including all devices of CL_DEVICE_TYPE_DEFAULT.
-     *
-     *  \note All calls to this function return the same cl_context as the first.
-     */
-    static Context getDefault(cl_int * err = NULL) 
-    {
-        int state = detail::compare_exchange(
-            &default_initialized_, 
-            __DEFAULT_BEING_INITIALIZED, __DEFAULT_NOT_INITIALIZED);
-        
-        if (state & __DEFAULT_INITIALIZED) {
-            if (err != NULL) {
-                *err = default_error_;
-            }
-            return default_;
-        }
-
-        if (state & __DEFAULT_BEING_INITIALIZED) {
-              // Assume writes will propagate eventually...
-              while(default_initialized_ != __DEFAULT_INITIALIZED) {
-                  detail::fence();
-              }
-
-            if (err != NULL) {
-                *err = default_error_;
-            }
-            return default_;
-        }
-
-        cl_int error;
-        default_ = Context(
-            CL_DEVICE_TYPE_DEFAULT,
-            NULL,
-            NULL,
-            NULL,
-            &error);
-
-        detail::fence();
-
-        default_error_ = error;
-        // Assume writes will propagate eventually...
-        default_initialized_ = __DEFAULT_INITIALIZED;
-
-        detail::fence();
-
-        if (err != NULL) {
-            *err = default_error_;
-        }
-        return default_;
-
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    Context() : detail::Wrapper<cl_type>() { }
-
-    /*! \brief Copy constructor.
-     * 
-     *  This calls clRetainContext() on the parameter's cl_context.
-     */
-    Context(const Context& context) : detail::Wrapper<cl_type>(context) { }
-
-    /*! \brief Constructor from cl_context - takes ownership.
-     * 
-     *  This effectively transfers ownership of a refcount on the cl_context
-     *  into the new Context object.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Context(const cl_context& context) : detail::Wrapper<cl_type>(context) { }
-
-    /*! \brief Assignment operator from Context.
-     * 
-     *  This calls clRetainContext() on the parameter and clReleaseContext() on
-     *  the previous value held by this instance.
-     */
-    Context& operator = (const Context& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_context - takes ownership.
-     * 
-     *  This effectively transfers ownership of a refcount on the rhs and calls
-     *  clReleaseContext() on the value previously held by this instance.
-     */
-    Context& operator = (const cl_context& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetContextInfo().
-    template <typename T>
-    cl_int getInfo(cl_context_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetContextInfo, object_, name, param),
-            __GET_CONTEXT_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetContextInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_context_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_context_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    /*! \brief Gets a list of supported image formats.
-     *  
-     *  Wraps clGetSupportedImageFormats().
-     */
-    cl_int getSupportedImageFormats(
-        cl_mem_flags flags,
-        cl_mem_object_type type,
-        VECTOR_CLASS<ImageFormat>* formats) const
-    {
-        cl_uint numEntries;
-        cl_int err = ::clGetSupportedImageFormats(
-           object_, 
-           flags,
-           type, 
-           0, 
-           NULL, 
-           &numEntries);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_SUPPORTED_IMAGE_FORMATS_ERR);
-        }
-
-        ImageFormat* value = (ImageFormat*)
-            alloca(numEntries * sizeof(ImageFormat));
-        err = ::clGetSupportedImageFormats(
-            object_, 
-            flags, 
-            type, 
-            numEntries,
-            (cl_image_format*) value, 
-            NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __GET_SUPPORTED_IMAGE_FORMATS_ERR);
-        }
-
-        formats->assign(&value[0], &value[numEntries]);
-        return CL_SUCCESS;
-    }
-};
-
-inline Device Device::getDefault(cl_int * err)
-{
-    cl_int error;
-    Device device;
-
-    Context context = Context::getDefault(&error);
-    detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-
-    if (error != CL_SUCCESS) {
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-    else {
-        device = context.getInfo<CL_CONTEXT_DEVICES>()[0];
-        if (err != NULL) {
-            *err = CL_SUCCESS;
-        }
-    }
-
-    return device;
-}
-
-
-#ifdef _WIN32
-__declspec(selectany) volatile int Context::default_initialized_ = __DEFAULT_NOT_INITIALIZED;
-__declspec(selectany) Context Context::default_;
-__declspec(selectany) volatile cl_int Context::default_error_ = CL_SUCCESS;
-#else
-__attribute__((weak)) volatile int Context::default_initialized_ = __DEFAULT_NOT_INITIALIZED;
-__attribute__((weak)) Context Context::default_;
-__attribute__((weak)) volatile cl_int Context::default_error_ = CL_SUCCESS;
-#endif
-
-/*! \brief Class interface for cl_event.
- *
- *  \note Copies of these objects are shallow, meaning that the copy will refer
- *        to the same underlying cl_event as the original.  For details, see
- *        clRetainEvent() and clReleaseEvent().
- *
- *  \see cl_event
- */
-class Event : public detail::Wrapper<cl_event>
-{
-public:
-    /*! \brief Destructor.
-     *
-     *  This calls clReleaseEvent() on the value held by this instance.
-     */
-    ~Event() { }
- 
-    //! \brief Default constructor - initializes to NULL.
-    Event() : detail::Wrapper<cl_type>() { }
-
-    /*! \brief Copy constructor.
-     * 
-     *  This calls clRetainEvent() on the parameter's cl_event.
-     */
-    Event(const Event& event) : detail::Wrapper<cl_type>(event) { }
-
-    /*! \brief Constructor from cl_event - takes ownership.
-     * 
-     *  This effectively transfers ownership of a refcount on the cl_event
-     *  into the new Event object.
-     */
-    Event(const cl_event& event) : detail::Wrapper<cl_type>(event) { }
-
-    /*! \brief Assignment operator from cl_event - takes ownership.
-     *
-     *  This effectively transfers ownership of a refcount on the rhs and calls
-     *  clReleaseEvent() on the value previously held by this instance.
-     */
-    Event& operator = (const Event& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_event.
-     * 
-     *  This calls clRetainEvent() on the parameter and clReleaseEvent() on
-     *  the previous value held by this instance.
-     */
-    Event& operator = (const cl_event& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetEventInfo().
-    template <typename T>
-    cl_int getInfo(cl_event_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetEventInfo, object_, name, param),
-            __GET_EVENT_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetEventInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_event_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_event_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    //! \brief Wrapper for clGetEventProfilingInfo().
-    template <typename T>
-    cl_int getProfilingInfo(cl_profiling_info name, T* param) const
-    {
-        return detail::errHandler(detail::getInfo(
-            &::clGetEventProfilingInfo, object_, name, param),
-            __GET_EVENT_PROFILE_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetEventProfilingInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_profiling_info, name>::param_type
-    getProfilingInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_profiling_info, name>::param_type param;
-        cl_int result = getProfilingInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    /*! \brief Blocks the calling thread until this event completes.
-     * 
-     *  Wraps clWaitForEvents().
-     */
-    cl_int wait() const
-    {
-        return detail::errHandler(
-            ::clWaitForEvents(1, &object_),
-            __WAIT_FOR_EVENTS_ERR);
-    }
-
-#if defined(CL_VERSION_1_1)
-    /*! \brief Registers a user callback function for a specific command execution status.
-     *
-     *  Wraps clSetEventCallback().
-     */
-    cl_int setCallback(
-        cl_int type,
-        void (CL_CALLBACK * pfn_notify)(cl_event, cl_int, void *),        
-        void * user_data = NULL)
-    {
-        return detail::errHandler(
-            ::clSetEventCallback(
-                object_,
-                type,
-                pfn_notify,
-                user_data), 
-            __SET_EVENT_CALLBACK_ERR);
-    }
-#endif
-
-    /*! \brief Blocks the calling thread until every event specified is complete.
-     * 
-     *  Wraps clWaitForEvents().
-     */
-    static cl_int
-    waitForEvents(const VECTOR_CLASS<Event>& events)
-    {
-        return detail::errHandler(
-            ::clWaitForEvents(
-                (cl_uint) events.size(), (cl_event*)&events.front()),
-            __WAIT_FOR_EVENTS_ERR);
-    }
-};
-
-#if defined(CL_VERSION_1_1)
-/*! \brief Class interface for user events (a subset of cl_event's).
- * 
- *  See Event for details about copy semantics, etc.
- */
-class UserEvent : public Event
-{
-public:
-    /*! \brief Constructs a user event on a given context.
-     *
-     *  Wraps clCreateUserEvent().
-     */
-    UserEvent(
-        const Context& context,
-        cl_int * err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateUserEvent(
-            context(),
-            &error);
-
-        detail::errHandler(error, __CREATE_USER_EVENT_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    UserEvent() : Event() { }
-
-    //! \brief Copy constructor - performs shallow copy.
-    UserEvent(const UserEvent& event) : Event(event) { }
-
-    //! \brief Assignment Operator - performs shallow copy.
-    UserEvent& operator = (const UserEvent& rhs)
-    {
-        if (this != &rhs) {
-            Event::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Sets the execution status of a user event object.
-     *
-     *  Wraps clSetUserEventStatus().
-     */
-    cl_int setStatus(cl_int status)
-    {
-        return detail::errHandler(
-            ::clSetUserEventStatus(object_,status), 
-            __SET_USER_EVENT_STATUS_ERR);
-    }
-};
-#endif
-
-/*! \brief Blocks the calling thread until every event specified is complete.
- * 
- *  Wraps clWaitForEvents().
- */
-inline static cl_int
-WaitForEvents(const VECTOR_CLASS<Event>& events)
-{
-    return detail::errHandler(
-        ::clWaitForEvents(
-            (cl_uint) events.size(), (cl_event*)&events.front()),
-        __WAIT_FOR_EVENTS_ERR);
-}
-
-/*! \brief Class interface for cl_mem.
- *
- *  \note Copies of these objects are shallow, meaning that the copy will refer
- *        to the same underlying cl_mem as the original.  For details, see
- *        clRetainMemObject() and clReleaseMemObject().
- *
- *  \see cl_mem
- */
-class Memory : public detail::Wrapper<cl_mem>
-{
-public:
- 
-    /*! \brief Destructor.
-     *
-     *  This calls clReleaseMemObject() on the value held by this instance.
-     */
-    ~Memory() {}
-
-    //! \brief Default constructor - initializes to NULL.
-    Memory() : detail::Wrapper<cl_type>() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     * 
-     *  This calls clRetainMemObject() on the parameter's cl_mem.
-     */
-    Memory(const Memory& memory) : detail::Wrapper<cl_type>(memory) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     * 
-     *  This effectively transfers ownership of a refcount on the cl_mem
-     *  into the new Memory object.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Memory(const cl_mem& memory) : detail::Wrapper<cl_type>(memory) { }
-
-    /*! \brief Assignment operator from Memory.
-     * 
-     *  This calls clRetainMemObject() on the parameter and clReleaseMemObject()
-     *  on the previous value held by this instance.
-     */
-    Memory& operator = (const Memory& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_mem - takes ownership.
-     *
-     *  This effectively transfers ownership of a refcount on the rhs and calls
-     *  clReleaseMemObject() on the value previously held by this instance.
-     */
-    Memory& operator = (const cl_mem& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetMemObjectInfo().
-    template <typename T>
-    cl_int getInfo(cl_mem_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetMemObjectInfo, object_, name, param),
-            __GET_MEM_OBJECT_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetMemObjectInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_mem_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_mem_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-#if defined(CL_VERSION_1_1)
-    /*! \brief Registers a callback function to be called when the memory object
-     *         is no longer needed.
-     *
-     *  Wraps clSetMemObjectDestructorCallback().
-     *
-     *  Repeated calls to this function, for a given cl_mem value, will append
-     *  to the list of functions called (in reverse order) when memory object's
-     *  resources are freed and the memory object is deleted.
-     *
-     *  \note
-     *  The registered callbacks are associated with the underlying cl_mem
-     *  value - not the Memory class instance.
-     */
-    cl_int setDestructorCallback(
-        void (CL_CALLBACK * pfn_notify)(cl_mem, void *),        
-        void * user_data = NULL)
-    {
-        return detail::errHandler(
-            ::clSetMemObjectDestructorCallback(
-                object_,
-                pfn_notify,
-                user_data), 
-            __SET_MEM_OBJECT_DESTRUCTOR_CALLBACK_ERR);
-    }
-#endif
-
-};
-
-// Pre-declare copy functions
-class Buffer;
-template< typename IteratorType >
-cl_int copy( IteratorType startIterator, IteratorType endIterator, cl::Buffer &buffer );
-template< typename IteratorType >
-cl_int copy( const cl::Buffer &buffer, IteratorType startIterator, IteratorType endIterator );
-template< typename IteratorType >
-cl_int copy( const CommandQueue &queue, IteratorType startIterator, IteratorType endIterator, cl::Buffer &buffer );
-template< typename IteratorType >
-cl_int copy( const CommandQueue &queue, const cl::Buffer &buffer, IteratorType startIterator, IteratorType endIterator );
-
-
-/*! \brief Class interface for Buffer Memory Objects.
- * 
- *  See Memory for details about copy semantics, etc.
- *
- *  \see Memory
- */
-class Buffer : public Memory
-{
-public:
-
-    /*! \brief Constructs a Buffer in a specified context.
-     *
-     *  Wraps clCreateBuffer().
-     *
-     *  \param host_ptr Storage to be used if the CL_MEM_USE_HOST_PTR flag was
-     *                  specified.  Note alignment & exclusivity requirements.
-     */
-    Buffer(
-        const Context& context,
-        cl_mem_flags flags,
-        ::size_t size,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateBuffer(context(), flags, size, host_ptr, &error);
-
-        detail::errHandler(error, __CREATE_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    /*! \brief Constructs a Buffer in the default context.
-     *
-     *  Wraps clCreateBuffer().
-     *
-     *  \param host_ptr Storage to be used if the CL_MEM_USE_HOST_PTR flag was
-     *                  specified.  Note alignment & exclusivity requirements.
-     *
-     *  \see Context::getDefault()
-     */
-    Buffer(
-         cl_mem_flags flags,
-        ::size_t size,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        Context context = Context::getDefault(err);
-
-        object_ = ::clCreateBuffer(context(), flags, size, host_ptr, &error);
-
-        detail::errHandler(error, __CREATE_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    /*!
-     * \brief Construct a Buffer from a host container via iterators.
-     * IteratorType must be random access.
-     * If useHostPtr is specified iterators must represent contiguous data.
-     */
-    template< typename IteratorType >
-    Buffer(
-        IteratorType startIterator,
-        IteratorType endIterator,
-        bool readOnly,
-        bool useHostPtr = false,
-        cl_int* err = NULL)
-    {
-        typedef typename std::iterator_traits<IteratorType>::value_type DataType;
-        cl_int error;
-
-        cl_mem_flags flags = 0;
-        if( readOnly ) {
-            flags |= CL_MEM_READ_ONLY;
-        }
-        else {
-            flags |= CL_MEM_READ_WRITE;
-        }
-        if( useHostPtr ) {
-            flags |= CL_MEM_USE_HOST_PTR;
-        }
-        
-        ::size_t size = sizeof(DataType)*(endIterator - startIterator);
-
-        Context context = Context::getDefault(err);
-
-        if( useHostPtr ) {
-            object_ = ::clCreateBuffer(context(), flags, size, static_cast<DataType*>(&*startIterator), &error);
-        } else {
-            object_ = ::clCreateBuffer(context(), flags, size, 0, &error);
-        }
-
-        detail::errHandler(error, __CREATE_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-
-        if( !useHostPtr ) {
-            error = cl::copy(startIterator, endIterator, *this);
-            detail::errHandler(error, __CREATE_BUFFER_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-    }
-
-    /*!
-     * \brief Construct a Buffer from a host container via iterators using a specified context.
-     * IteratorType must be random access.
-     * If useHostPtr is specified iterators must represent contiguous data.
-     */
-    template< typename IteratorType >
-    Buffer(const Context &context, IteratorType startIterator, IteratorType endIterator,
-        bool readOnly, bool useHostPtr = false, cl_int* err = NULL);
-
-    //! \brief Default constructor - initializes to NULL.
-    Buffer() : Memory() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Buffer(const Buffer& buffer) : Memory(buffer) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Buffer(const cl_mem& buffer) : Memory(buffer) { }
-
-    /*! \brief Assignment from Buffer - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Buffer& operator = (const Buffer& rhs)
-    {
-        if (this != &rhs) {
-            Memory::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Buffer& operator = (const cl_mem& rhs)
-    {
-        Memory::operator=(rhs);
-        return *this;
-    }
-
-#if defined(CL_VERSION_1_1)
-    /*! \brief Creates a new buffer object from this.
-     *
-     *  Wraps clCreateSubBuffer().
-     */
-    Buffer createSubBuffer(
-        cl_mem_flags flags,
-        cl_buffer_create_type buffer_create_type,
-        const void * buffer_create_info,
-        cl_int * err = NULL)
-    {
-        Buffer result;
-        cl_int error;
-        result.object_ = ::clCreateSubBuffer(
-            object_, 
-            flags, 
-            buffer_create_type, 
-            buffer_create_info, 
-            &error);
-
-        detail::errHandler(error, __CREATE_SUBBUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-
-        return result;
-    }        
-#endif
-};
-
-#if defined (USE_DX_INTEROP)
-/*! \brief Class interface for creating OpenCL buffers from ID3D10Buffer's.
- *
- *  This is provided to facilitate interoperability with Direct3D.
- * 
- *  See Memory for details about copy semantics, etc.
- *
- *  \see Memory
- */
-class BufferD3D10 : public Buffer
-{
-public:
-    typedef CL_API_ENTRY cl_mem (CL_API_CALL *PFN_clCreateFromD3D10BufferKHR)(
-    cl_context context, cl_mem_flags flags, ID3D10Buffer*  buffer,
-    cl_int* errcode_ret);
-
-    /*! \brief Constructs a BufferD3D10, in a specified context, from a
-     *         given ID3D10Buffer.
-     *
-     *  Wraps clCreateFromD3D10BufferKHR().
-     */
-    BufferD3D10(
-        const Context& context,
-        cl_mem_flags flags,
-        ID3D10Buffer* bufobj,
-        cl_int * err = NULL)
-    {
-        static PFN_clCreateFromD3D10BufferKHR pfn_clCreateFromD3D10BufferKHR = NULL;
-
-#if defined(CL_VERSION_1_2)
-        vector<cl_context_properties> props = context.getInfo<CL_CONTEXT_PROPERTIES>();
-        cl_platform platform = -1;
-        for( int i = 0; i < props.size(); ++i ) {
-            if( props[i] == CL_CONTEXT_PLATFORM ) {
-                platform = props[i+1];
-            }
-        }
-        __INIT_CL_EXT_FCN_PTR_PLATFORM(platform, clCreateFromD3D10BufferKHR);
-#endif
-#if defined(CL_VERSION_1_1)
-        __INIT_CL_EXT_FCN_PTR(clCreateFromD3D10BufferKHR);
-#endif
-
-        cl_int error;
-        object_ = pfn_clCreateFromD3D10BufferKHR(
-            context(),
-            flags,
-            bufobj,
-            &error);
-
-        detail::errHandler(error, __CREATE_GL_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    BufferD3D10() : Buffer() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferD3D10(const BufferD3D10& buffer) : Buffer(buffer) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS BufferD3D10(const cl_mem& buffer) : Buffer(buffer) { }
-
-    /*! \brief Assignment from BufferD3D10 - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferD3D10& operator = (const BufferD3D10& rhs)
-    {
-        if (this != &rhs) {
-            Buffer::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferD3D10& operator = (const cl_mem& rhs)
-    {
-        Buffer::operator=(rhs);
-        return *this;
-    }
-};
-#endif
-
-/*! \brief Class interface for GL Buffer Memory Objects.
- *
- *  This is provided to facilitate interoperability with OpenGL.
- * 
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class BufferGL : public Buffer
-{
-public:
-    /*! \brief Constructs a BufferGL in a specified context, from a given
-     *         GL buffer.
-     *
-     *  Wraps clCreateFromGLBuffer().
-     */
-    BufferGL(
-        const Context& context,
-        cl_mem_flags flags,
-        GLuint bufobj,
-        cl_int * err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateFromGLBuffer(
-            context(),
-            flags,
-            bufobj,
-            &error);
-
-        detail::errHandler(error, __CREATE_GL_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    BufferGL() : Buffer() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferGL(const BufferGL& buffer) : Buffer(buffer) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS BufferGL(const cl_mem& buffer) : Buffer(buffer) { }
-
-    /*! \brief Assignment from BufferGL - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferGL& operator = (const BufferGL& rhs)
-    {
-        if (this != &rhs) {
-            Buffer::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferGL& operator = (const cl_mem& rhs)
-    {
-        Buffer::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetGLObjectInfo().
-    cl_int getObjectInfo(
-        cl_gl_object_type *type,
-        GLuint * gl_object_name)
-    {
-        return detail::errHandler(
-            ::clGetGLObjectInfo(object_,type,gl_object_name),
-            __GET_GL_OBJECT_INFO_ERR);
-    }
-};
-
-/*! \brief Class interface for GL Render Buffer Memory Objects.
- *
- *  This is provided to facilitate interoperability with OpenGL.
- * 
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class BufferRenderGL : public Buffer
-{
-public:
-    /*! \brief Constructs a BufferRenderGL in a specified context, from a given
-     *         GL Renderbuffer.
-     *
-     *  Wraps clCreateFromGLRenderbuffer().
-     */
-    BufferRenderGL(
-        const Context& context,
-        cl_mem_flags flags,
-        GLuint bufobj,
-        cl_int * err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateFromGLRenderbuffer(
-            context(),
-            flags,
-            bufobj,
-            &error);
-
-        detail::errHandler(error, __CREATE_GL_RENDER_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    BufferRenderGL() : Buffer() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferRenderGL(const BufferGL& buffer) : Buffer(buffer) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS BufferRenderGL(const cl_mem& buffer) : Buffer(buffer) { }
-
-    /*! \brief Assignment from BufferGL - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferRenderGL& operator = (const BufferRenderGL& rhs)
-    {
-        if (this != &rhs) {
-            Buffer::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    BufferRenderGL& operator = (const cl_mem& rhs)
-    {
-        Buffer::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetGLObjectInfo().
-    cl_int getObjectInfo(
-        cl_gl_object_type *type,
-        GLuint * gl_object_name)
-    {
-        return detail::errHandler(
-            ::clGetGLObjectInfo(object_,type,gl_object_name),
-            __GET_GL_OBJECT_INFO_ERR);
-    }
-};
-
-/*! \brief C++ base class for Image Memory objects.
- *
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class Image : public Memory
-{
-protected:
-    //! \brief Default constructor - initializes to NULL.
-    Image() : Memory() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image(const Image& image) : Memory(image) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Image(const cl_mem& image) : Memory(image) { }
-
-    /*! \brief Assignment from Image - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image& operator = (const Image& rhs)
-    {
-        if (this != &rhs) {
-            Memory::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image& operator = (const cl_mem& rhs)
-    {
-        Memory::operator=(rhs);
-        return *this;
-    }
-
-public:
-    //! \brief Wrapper for clGetImageInfo().
-    template <typename T>
-    cl_int getImageInfo(cl_image_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetImageInfo, object_, name, param),
-            __GET_IMAGE_INFO_ERR);
-    }
-    
-    //! \brief Wrapper for clGetImageInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_image_info, name>::param_type
-    getImageInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_image_info, name>::param_type param;
-        cl_int result = getImageInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-};
-
-#if defined(CL_VERSION_1_2)
-/*! \brief Class interface for 1D Image Memory objects.
- *
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class Image1D : public Image
-{
-public:
-    /*! \brief Constructs a 1D Image in a specified context.
-     *
-     *  Wraps clCreateImage().
-     */
-    Image1D(
-        const Context& context,
-        cl_mem_flags flags,
-        ImageFormat format,
-        ::size_t width,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        cl_image_desc desc =
-        {
-            CL_MEM_OBJECT_IMAGE1D,
-            width,
-            0, 0, 0, 0, 0, 0, 0, 0
-        };
-        object_ = ::clCreateImage(
-            context(), 
-            flags, 
-            &format, 
-            &desc, 
-            host_ptr, 
-            &error);
-
-        detail::errHandler(error, __CREATE_IMAGE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    Image1D() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image1D(const Image1D& image1D) : Image(image1D) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Image1D(const cl_mem& image1D) : Image(image1D) { }
-
-    /*! \brief Assignment from Image1D - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image1D& operator = (const Image1D& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image1D& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-
-/*! \class Image1DBuffer
- * \brief Image interface for 1D buffer images.
- */
-class Image1DBuffer : public Image
-{
-public:
-    Image1DBuffer(
-        const Context& context,
-        cl_mem_flags flags,
-        ImageFormat format,
-        ::size_t width,
-        const Buffer &buffer,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        cl_image_desc desc =
-        {
-            CL_MEM_OBJECT_IMAGE1D_BUFFER,
-            width,
-            0, 0, 0, 0, 0, 0, 0,
-            buffer()
-        };
-        object_ = ::clCreateImage(
-            context(), 
-            flags, 
-            &format, 
-            &desc, 
-            NULL, 
-            &error);
-
-        detail::errHandler(error, __CREATE_IMAGE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    Image1DBuffer() { }
-
-    Image1DBuffer(const Image1DBuffer& image1D) : Image(image1D) { }
-
-    __CL_EXPLICIT_CONSTRUCTORS Image1DBuffer(const cl_mem& image1D) : Image(image1D) { }
-
-    Image1DBuffer& operator = (const Image1DBuffer& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    Image1DBuffer& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-
-/*! \class Image1DArray
- * \brief Image interface for arrays of 1D images.
- */
-class Image1DArray : public Image
-{
-public:
-    Image1DArray(
-        const Context& context,
-        cl_mem_flags flags,
-        ImageFormat format,
-        ::size_t arraySize,
-        ::size_t width,
-        ::size_t rowPitch,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        cl_image_desc desc =
-        {
-            CL_MEM_OBJECT_IMAGE1D_ARRAY,
-            width,
-            0, 0,  // height, depth (unused)
-            arraySize,
-            rowPitch,
-            0, 0, 0, 0
-        };
-        object_ = ::clCreateImage(
-            context(), 
-            flags, 
-            &format, 
-            &desc, 
-            host_ptr, 
-            &error);
-
-        detail::errHandler(error, __CREATE_IMAGE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    Image1DArray() { }
-
-    Image1DArray(const Image1DArray& imageArray) : Image(imageArray) { }
-
-    __CL_EXPLICIT_CONSTRUCTORS Image1DArray(const cl_mem& imageArray) : Image(imageArray) { }
-
-    Image1DArray& operator = (const Image1DArray& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    Image1DArray& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-#endif // #if defined(CL_VERSION_1_2)
-
-
-/*! \brief Class interface for 2D Image Memory objects.
- *
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class Image2D : public Image
-{
-public:
-    /*! \brief Constructs a 1D Image in a specified context.
-     *
-     *  Wraps clCreateImage().
-     */
-    Image2D(
-        const Context& context,
-        cl_mem_flags flags,
-        ImageFormat format,
-        ::size_t width,
-        ::size_t height,
-        ::size_t row_pitch = 0,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        bool useCreateImage;
-
-#if defined(CL_VERSION_1_2) && defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-        // Run-time decision based on the actual platform
-        {
-            cl_uint version = detail::getContextPlatformVersion(context());
-            useCreateImage = (version >= 0x10002); // OpenCL 1.2 or above
-        }
-#elif defined(CL_VERSION_1_2)
-        useCreateImage = true;
-#else
-        useCreateImage = false;
-#endif
-
-#if defined(CL_VERSION_1_2)
-        if (useCreateImage)
-        {
-            cl_image_desc desc =
-            {
-                CL_MEM_OBJECT_IMAGE2D,
-                width,
-                height,
-                0, 0, // depth, array size (unused)
-                row_pitch,
-                0, 0, 0, 0
-            };
-            object_ = ::clCreateImage(
-                context(),
-                flags,
-                &format,
-                &desc,
-                host_ptr,
-                &error);
-
-            detail::errHandler(error, __CREATE_IMAGE_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-#endif // #if defined(CL_VERSION_1_2)
-#if !defined(CL_VERSION_1_2) || defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-        if (!useCreateImage)
-        {
-            object_ = ::clCreateImage2D(
-                context(), flags,&format, width, height, row_pitch, host_ptr, &error);
-
-            detail::errHandler(error, __CREATE_IMAGE2D_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-#endif // #if !defined(CL_VERSION_1_2) || defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    Image2D() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image2D(const Image2D& image2D) : Image(image2D) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Image2D(const cl_mem& image2D) : Image(image2D) { }
-
-    /*! \brief Assignment from Image2D - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image2D& operator = (const Image2D& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image2D& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-
-
-#if !defined(CL_VERSION_1_2)
-/*! \brief Class interface for GL 2D Image Memory objects.
- *
- *  This is provided to facilitate interoperability with OpenGL.
- * 
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- *  \note Deprecated for OpenCL 1.2. Please use ImageGL instead.
- */
-class CL_EXT_PREFIX__VERSION_1_1_DEPRECATED Image2DGL CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED : public Image2D
-{
-public:
-    /*! \brief Constructs an Image2DGL in a specified context, from a given
-     *         GL Texture.
-     *
-     *  Wraps clCreateFromGLTexture2D().
-     */
-    Image2DGL(
-        const Context& context,
-        cl_mem_flags flags,
-        GLenum target,
-        GLint  miplevel,
-        GLuint texobj,
-        cl_int * err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateFromGLTexture2D(
-            context(),
-            flags,
-            target,
-            miplevel,
-            texobj,
-            &error);
-
-        detail::errHandler(error, __CREATE_GL_TEXTURE_2D_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-
-    }
-    
-    //! \brief Default constructor - initializes to NULL.
-    Image2DGL() : Image2D() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image2DGL(const Image2DGL& image) : Image2D(image) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Image2DGL(const cl_mem& image) : Image2D(image) { }
-
-    /*! \brief Assignment from Image2DGL - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image2DGL& operator = (const Image2DGL& rhs)
-    {
-        if (this != &rhs) {
-            Image2D::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image2DGL& operator = (const cl_mem& rhs)
-    {
-        Image2D::operator=(rhs);
-        return *this;
-    }
-};
-#endif // #if !defined(CL_VERSION_1_2)
-
-#if defined(CL_VERSION_1_2)
-/*! \class Image2DArray
- * \brief Image interface for arrays of 2D images.
- */
-class Image2DArray : public Image
-{
-public:
-    Image2DArray(
-        const Context& context,
-        cl_mem_flags flags,
-        ImageFormat format,
-        ::size_t arraySize,
-        ::size_t width,
-        ::size_t height,
-        ::size_t rowPitch,
-        ::size_t slicePitch,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        cl_image_desc desc =
-        {
-            CL_MEM_OBJECT_IMAGE2D_ARRAY,
-            width,
-            height,
-            0,       // depth (unused)
-            arraySize,
-            rowPitch,
-            slicePitch,
-            0, 0, 0
-        };
-        object_ = ::clCreateImage(
-            context(), 
-            flags, 
-            &format, 
-            &desc, 
-            host_ptr, 
-            &error);
-
-        detail::errHandler(error, __CREATE_IMAGE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    Image2DArray() { }
-
-    Image2DArray(const Image2DArray& imageArray) : Image(imageArray) { }
-
-    __CL_EXPLICIT_CONSTRUCTORS Image2DArray(const cl_mem& imageArray) : Image(imageArray) { }
-
-    Image2DArray& operator = (const Image2DArray& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    Image2DArray& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-#endif // #if defined(CL_VERSION_1_2)
-
-/*! \brief Class interface for 3D Image Memory objects.
- *
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class Image3D : public Image
-{
-public:
-    /*! \brief Constructs a 3D Image in a specified context.
-     *
-     *  Wraps clCreateImage().
-     */
-    Image3D(
-        const Context& context,
-        cl_mem_flags flags,
-        ImageFormat format,
-        ::size_t width,
-        ::size_t height,
-        ::size_t depth,
-        ::size_t row_pitch = 0,
-        ::size_t slice_pitch = 0,
-        void* host_ptr = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        bool useCreateImage;
-
-#if defined(CL_VERSION_1_2) && defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-        // Run-time decision based on the actual platform
-        {
-            cl_uint version = detail::getContextPlatformVersion(context());
-            useCreateImage = (version >= 0x10002); // OpenCL 1.2 or above
-        }
-#elif defined(CL_VERSION_1_2)
-        useCreateImage = true;
-#else
-        useCreateImage = false;
-#endif
-
-#if defined(CL_VERSION_1_2)
-        if (useCreateImage)
-        {
-            cl_image_desc desc =
-            {
-                CL_MEM_OBJECT_IMAGE3D,
-                width,
-                height,
-                depth,
-                0,      // array size (unused)
-                row_pitch,
-                slice_pitch,
-                0, 0, 0
-            };
-            object_ = ::clCreateImage(
-                context(), 
-                flags, 
-                &format, 
-                &desc, 
-                host_ptr, 
-                &error);
-
-            detail::errHandler(error, __CREATE_IMAGE_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-#endif  // #if defined(CL_VERSION_1_2)
-#if !defined(CL_VERSION_1_2) || defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-        if (!useCreateImage)
-        {
-            object_ = ::clCreateImage3D(
-                context(), flags, &format, width, height, depth, row_pitch,
-                slice_pitch, host_ptr, &error);
-
-            detail::errHandler(error, __CREATE_IMAGE3D_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-#endif // #if !defined(CL_VERSION_1_2) || defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS)
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    Image3D() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image3D(const Image3D& image3D) : Image(image3D) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Image3D(const cl_mem& image3D) : Image(image3D) { }
-
-    /*! \brief Assignment from Image3D - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image3D& operator = (const Image3D& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image3D& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-
-#if !defined(CL_VERSION_1_2)
-/*! \brief Class interface for GL 3D Image Memory objects.
- *
- *  This is provided to facilitate interoperability with OpenGL.
- * 
- *  See Memory for details about copy semantics, etc.
- * 
- *  \see Memory
- */
-class Image3DGL : public Image3D
-{
-public:
-    /*! \brief Constructs an Image3DGL in a specified context, from a given
-     *         GL Texture.
-     *
-     *  Wraps clCreateFromGLTexture3D().
-     */
-    Image3DGL(
-        const Context& context,
-        cl_mem_flags flags,
-        GLenum target,
-        GLint  miplevel,
-        GLuint texobj,
-        cl_int * err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateFromGLTexture3D(
-            context(),
-            flags,
-            target,
-            miplevel,
-            texobj,
-            &error);
-
-        detail::errHandler(error, __CREATE_GL_TEXTURE_3D_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    //! \brief Default constructor - initializes to NULL.
-    Image3DGL() : Image3D() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image3DGL(const Image3DGL& image) : Image3D(image) { }
-
-    /*! \brief Constructor from cl_mem - takes ownership.
-     *
-     *  See Memory for further details.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Image3DGL(const cl_mem& image) : Image3D(image) { }
-
-    /*! \brief Assignment from Image3DGL - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image3DGL& operator = (const Image3DGL& rhs)
-    {
-        if (this != &rhs) {
-            Image3D::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment from cl_mem - performs shallow copy.
-     *
-     *  See Memory for further details.
-     */
-    Image3DGL& operator = (const cl_mem& rhs)
-    {
-        Image3D::operator=(rhs);
-        return *this;
-    }
-};
-#endif // #if !defined(CL_VERSION_1_2)
-
-#if defined(CL_VERSION_1_2)
-/*! \class ImageGL
- * \brief general image interface for GL interop.
- * We abstract the 2D and 3D GL images into a single instance here
- * that wraps all GL sourced images on the grounds that setup information
- * was performed by OpenCL anyway.
- */
-class ImageGL : public Image
-{
-public:
-    ImageGL(
-        const Context& context,
-        cl_mem_flags flags,
-        GLenum target,
-        GLint  miplevel,
-        GLuint texobj,
-        cl_int * err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateFromGLTexture(
-            context(), 
-            flags, 
-            target,
-            miplevel,
-            texobj,
-            &error);
-
-        detail::errHandler(error, __CREATE_GL_TEXTURE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    ImageGL() : Image() { }
-
-    ImageGL(const ImageGL& image) : Image(image) { }
-
-    __CL_EXPLICIT_CONSTRUCTORS ImageGL(const cl_mem& image) : Image(image) { }
-
-    ImageGL& operator = (const ImageGL& rhs)
-    {
-        if (this != &rhs) {
-            Image::operator=(rhs);
-        }
-        return *this;
-    }
-
-    ImageGL& operator = (const cl_mem& rhs)
-    {
-        Image::operator=(rhs);
-        return *this;
-    }
-};
-#endif // #if defined(CL_VERSION_1_2)
-
-/*! \brief Class interface for cl_sampler.
- *
- *  \note Copies of these objects are shallow, meaning that the copy will refer
- *        to the same underlying cl_sampler as the original.  For details, see
- *        clRetainSampler() and clReleaseSampler().
- *
- *  \see cl_sampler 
- */
-class Sampler : public detail::Wrapper<cl_sampler>
-{
-public:
-    /*! \brief Destructor.
-     *
-     *  This calls clReleaseSampler() on the value held by this instance.
-     */
-    ~Sampler() { }
-
-    //! \brief Default constructor - initializes to NULL.
-    Sampler() { }
-
-    /*! \brief Constructs a Sampler in a specified context.
-     *
-     *  Wraps clCreateSampler().
-     */
-    Sampler(
-        const Context& context,
-        cl_bool normalized_coords,
-        cl_addressing_mode addressing_mode,
-        cl_filter_mode filter_mode,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateSampler(
-            context(), 
-            normalized_coords,
-            addressing_mode,
-            filter_mode,
-            &error);
-
-        detail::errHandler(error, __CREATE_SAMPLER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     * 
-     *  This calls clRetainSampler() on the parameter's cl_sampler.
-     */
-    Sampler(const Sampler& sampler) : detail::Wrapper<cl_type>(sampler) { }
-
-    /*! \brief Constructor from cl_sampler - takes ownership.
-     * 
-     *  This effectively transfers ownership of a refcount on the cl_sampler
-     *  into the new Sampler object.
-     */
-    Sampler(const cl_sampler& sampler) : detail::Wrapper<cl_type>(sampler) { }
-
-    /*! \brief Assignment operator from Sampler.
-     * 
-     *  This calls clRetainSampler() on the parameter and clReleaseSampler()
-     *  on the previous value held by this instance.
-     */
-    Sampler& operator = (const Sampler& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_sampler - takes ownership.
-     *
-     *  This effectively transfers ownership of a refcount on the rhs and calls
-     *  clReleaseSampler() on the value previously held by this instance.
-     */
-    Sampler& operator = (const cl_sampler& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    //! \brief Wrapper for clGetSamplerInfo().
-    template <typename T>
-    cl_int getInfo(cl_sampler_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetSamplerInfo, object_, name, param),
-            __GET_SAMPLER_INFO_ERR);
-    }
-
-    //! \brief Wrapper for clGetSamplerInfo() that returns by value.
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_sampler_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_sampler_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-};
-
-class Program;
-class CommandQueue;
-class Kernel;
-
-//! \brief Class interface for specifying NDRange values.
-class NDRange
-{
-private:
-    size_t<3> sizes_;
-    cl_uint dimensions_;
-
-public:
-    //! \brief Default constructor - resulting range has zero dimensions.
-    NDRange()
-        : dimensions_(0)
-    { }
-
-    //! \brief Constructs one-dimensional range.
-    NDRange(::size_t size0)
-        : dimensions_(1)
-    {
-        sizes_[0] = size0;
-    }
-
-    //! \brief Constructs two-dimensional range.
-    NDRange(::size_t size0, ::size_t size1)
-        : dimensions_(2)
-    {
-        sizes_[0] = size0;
-        sizes_[1] = size1;
-    }
-
-    //! \brief Constructs three-dimensional range.
-    NDRange(::size_t size0, ::size_t size1, ::size_t size2)
-        : dimensions_(3)
-    {
-        sizes_[0] = size0;
-        sizes_[1] = size1;
-        sizes_[2] = size2;
-    }
-
-    /*! \brief Conversion operator to const ::size_t *.
-     *  
-     *  \returns a pointer to the size of the first dimension.
-     */
-    operator const ::size_t*() const { 
-        return (const ::size_t*) sizes_; 
-    }
-
-    //! \brief Queries the number of dimensions in the range.
-    ::size_t dimensions() const { return dimensions_; }
-};
-
-//! \brief A zero-dimensional range.
-static const NDRange NullRange;
-
-//! \brief Local address wrapper for use with Kernel::setArg
-struct LocalSpaceArg
-{
-    ::size_t size_;
-};
-
-namespace detail {
-
-template <typename T>
-struct KernelArgumentHandler
-{
-    static ::size_t size(const T&) { return sizeof(T); }
-    static T* ptr(T& value) { return &value; }
-};
-
-template <>
-struct KernelArgumentHandler<LocalSpaceArg>
-{
-    static ::size_t size(const LocalSpaceArg& value) { return value.size_; }
-    static void* ptr(LocalSpaceArg&) { return NULL; }
-};
-
-} 
-//! \endcond
-
-/*! __local
- * \brief Helper function for generating LocalSpaceArg objects.
- * Deprecated. Replaced with Local.
- */
-inline CL_EXT_PREFIX__VERSION_1_1_DEPRECATED LocalSpaceArg
-__local(::size_t size) CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED;
-inline LocalSpaceArg
-__local(::size_t size)
-{
-    LocalSpaceArg ret = { size };
-    return ret;
-}
-
-/*! Local
- * \brief Helper function for generating LocalSpaceArg objects.
- */
-inline LocalSpaceArg
-Local(::size_t size)
-{
-    LocalSpaceArg ret = { size };
-    return ret;
-}
-
-//class KernelFunctor;
-
-/*! \brief Class interface for cl_kernel.
- *
- *  \note Copies of these objects are shallow, meaning that the copy will refer
- *        to the same underlying cl_kernel as the original.  For details, see
- *        clRetainKernel() and clReleaseKernel().
- *
- *  \see cl_kernel
- */
-class Kernel : public detail::Wrapper<cl_kernel>
-{
-public:
-    inline Kernel(const Program& program, const char* name, cl_int* err = NULL);
-
-    /*! \brief Destructor.
-     *
-     *  This calls clReleaseKernel() on the value held by this instance.
-     */
-    ~Kernel() { }
-
-    //! \brief Default constructor - initializes to NULL.
-    Kernel() { }
-
-    /*! \brief Copy constructor - performs shallow copy.
-     * 
-     *  This calls clRetainKernel() on the parameter's cl_kernel.
-     */
-    Kernel(const Kernel& kernel) : detail::Wrapper<cl_type>(kernel) { }
-
-    /*! \brief Constructor from cl_kernel - takes ownership.
-     * 
-     *  This effectively transfers ownership of a refcount on the cl_kernel
-     *  into the new Kernel object.
-     */
-    __CL_EXPLICIT_CONSTRUCTORS Kernel(const cl_kernel& kernel) : detail::Wrapper<cl_type>(kernel) { }
-
-    /*! \brief Assignment operator from Kernel.
-     * 
-     *  This calls clRetainKernel() on the parameter and clReleaseKernel()
-     *  on the previous value held by this instance.
-     */
-    Kernel& operator = (const Kernel& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    /*! \brief Assignment operator from cl_kernel - takes ownership.
-     *
-     *  This effectively transfers ownership of a refcount on the rhs and calls
-     *  clReleaseKernel() on the value previously held by this instance.
-     */
-    Kernel& operator = (const cl_kernel& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    template <typename T>
-    cl_int getInfo(cl_kernel_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetKernelInfo, object_, name, param),
-            __GET_KERNEL_INFO_ERR);
-    }
-
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_kernel_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_kernel_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-#if defined(CL_VERSION_1_2)
-    template <typename T>
-    cl_int getArgInfo(cl_uint argIndex, cl_kernel_arg_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetKernelArgInfo, object_, argIndex, name, param),
-            __GET_KERNEL_ARG_INFO_ERR);
-    }
-
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_kernel_arg_info, name>::param_type
-    getArgInfo(cl_uint argIndex, cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_kernel_arg_info, name>::param_type param;
-        cl_int result = getArgInfo(argIndex, name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-#endif // #if defined(CL_VERSION_1_2)
-
-    template <typename T>
-    cl_int getWorkGroupInfo(
-        const Device& device, cl_kernel_work_group_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(
-                &::clGetKernelWorkGroupInfo, object_, device(), name, param),
-                __GET_KERNEL_WORK_GROUP_INFO_ERR);
-    }
-
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_kernel_work_group_info, name>::param_type
-        getWorkGroupInfo(const Device& device, cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-        detail::cl_kernel_work_group_info, name>::param_type param;
-        cl_int result = getWorkGroupInfo(device, name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    template <typename T>
-    cl_int setArg(cl_uint index, T value)
-    {
-        return detail::errHandler(
-            ::clSetKernelArg(
-                object_,
-                index,
-                detail::KernelArgumentHandler<T>::size(value),
-                detail::KernelArgumentHandler<T>::ptr(value)),
-            __SET_KERNEL_ARGS_ERR);
-    }
-
-    cl_int setArg(cl_uint index, ::size_t size, void* argPtr)
-    {
-        return detail::errHandler(
-            ::clSetKernelArg(object_, index, size, argPtr),
-            __SET_KERNEL_ARGS_ERR);
-    }
-};
-
-/*! \class Program
- * \brief Program interface that implements cl_program.
- */
-class Program : public detail::Wrapper<cl_program>
-{
-public:
-    typedef VECTOR_CLASS<std::pair<const void*, ::size_t> > Binaries;
-    typedef VECTOR_CLASS<std::pair<const char*, ::size_t> > Sources;
-
-    Program(
-        const STRING_CLASS& source,
-        bool build = false,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        const char * strings = source.c_str();
-        const ::size_t length  = source.size();
-
-        Context context = Context::getDefault(err);
-
-        object_ = ::clCreateProgramWithSource(
-            context(), (cl_uint)1, &strings, &length, &error);
-
-        detail::errHandler(error, __CREATE_PROGRAM_WITH_SOURCE_ERR);
-
-        if (error == CL_SUCCESS && build) {
-
-            error = ::clBuildProgram(
-                object_,
-                0,
-                NULL,
-                "",
-                NULL,
-                NULL);
-
-            detail::errHandler(error, __BUILD_PROGRAM_ERR);
-        }
-
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    Program(
-        const Context& context,
-        const STRING_CLASS& source,
-        bool build = false,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        const char * strings = source.c_str();
-        const ::size_t length  = source.size();
-
-        object_ = ::clCreateProgramWithSource(
-            context(), (cl_uint)1, &strings, &length, &error);
-
-        detail::errHandler(error, __CREATE_PROGRAM_WITH_SOURCE_ERR);
-
-        if (error == CL_SUCCESS && build) {
-
-            error = ::clBuildProgram(
-                object_,
-                0,
-                NULL,
-                "",
-                NULL,
-                NULL);
-
-            detail::errHandler(error, __BUILD_PROGRAM_ERR);
-        }
-
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    Program(
-        const Context& context,
-        const Sources& sources,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        const ::size_t n = (::size_t)sources.size();
-        ::size_t* lengths = (::size_t*) alloca(n * sizeof(::size_t));
-        const char** strings = (const char**) alloca(n * sizeof(const char*));
-
-        for (::size_t i = 0; i < n; ++i) {
-            strings[i] = sources[(int)i].first;
-            lengths[i] = sources[(int)i].second;
-        }
-
-        object_ = ::clCreateProgramWithSource(
-            context(), (cl_uint)n, strings, lengths, &error);
-
-        detail::errHandler(error, __CREATE_PROGRAM_WITH_SOURCE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    /**
-     * Construct a program object from a list of devices and a per-device list of binaries.
-     * \param context A valid OpenCL context in which to construct the program.
-     * \param devices A vector of OpenCL device objects for which the program will be created.
-     * \param binaries A vector of pairs of a pointer to a binary object and its length.
-     * \param binaryStatus An optional vector that on completion will be resized to
-     *   match the size of binaries and filled with values to specify if each binary
-     *   was successfully loaded.
-     *   Set to CL_SUCCESS if the binary was successfully loaded.
-     *   Set to CL_INVALID_VALUE if the length is 0 or the binary pointer is NULL.
-     *   Set to CL_INVALID_BINARY if the binary provided is not valid for the matching device.
-     * \param err if non-NULL will be set to CL_SUCCESS on successful operation or one of the following errors:
-     *   CL_INVALID_CONTEXT if context is not a valid context.
-     *   CL_INVALID_VALUE if the length of devices is zero; or if the length of binaries does not match the length of devices; 
-     *     or if any entry in binaries is NULL or has length 0.
-     *   CL_INVALID_DEVICE if OpenCL devices listed in devices are not in the list of devices associated with context.
-     *   CL_INVALID_BINARY if an invalid program binary was encountered for any device. binaryStatus will return specific status for each device.
-     *   CL_OUT_OF_HOST_MEMORY if there is a failure to allocate resources required by the OpenCL implementation on the host.
-     */
-    Program(
-        const Context& context,
-        const VECTOR_CLASS<Device>& devices,
-        const Binaries& binaries,
-        VECTOR_CLASS<cl_int>* binaryStatus = NULL,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        
-        const ::size_t numDevices = devices.size();
-        
-        // Catch size mismatch early and return
-        if(binaries.size() != numDevices) {
-            error = CL_INVALID_VALUE;
-            detail::errHandler(error, __CREATE_PROGRAM_WITH_BINARY_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-            return;
-        }
-
-        ::size_t* lengths = (::size_t*) alloca(numDevices * sizeof(::size_t));
-        const unsigned char** images = (const unsigned char**) alloca(numDevices * sizeof(const unsigned char**));
-
-        for (::size_t i = 0; i < numDevices; ++i) {
-            images[i] = (const unsigned char*)binaries[i].first;
-            lengths[i] = binaries[(int)i].second;
-        }
-
-        cl_device_id* deviceIDs = (cl_device_id*) alloca(numDevices * sizeof(cl_device_id));
-        for( ::size_t deviceIndex = 0; deviceIndex < numDevices; ++deviceIndex ) {
-            deviceIDs[deviceIndex] = (devices[deviceIndex])();
-        }
-
-        if(binaryStatus) {
-            binaryStatus->resize(numDevices);
-        }
-        
-        object_ = ::clCreateProgramWithBinary(
-            context(), (cl_uint) devices.size(),
-            deviceIDs,
-            lengths, images, binaryStatus != NULL
-               ? &binaryStatus->front()
-               : NULL, &error);
-
-        detail::errHandler(error, __CREATE_PROGRAM_WITH_BINARY_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    
-#if defined(CL_VERSION_1_2)
-    /**
-     * Create program using builtin kernels.
-     * \param kernelNames Semi-colon separated list of builtin kernel names
-     */
-    Program(
-        const Context& context,
-        const VECTOR_CLASS<Device>& devices,
-        const STRING_CLASS& kernelNames,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-
-        ::size_t numDevices = devices.size();
-        cl_device_id* deviceIDs = (cl_device_id*) alloca(numDevices * sizeof(cl_device_id));
-        for( ::size_t deviceIndex = 0; deviceIndex < numDevices; ++deviceIndex ) {
-            deviceIDs[deviceIndex] = (devices[deviceIndex])();
-        }
-        
-        object_ = ::clCreateProgramWithBuiltInKernels(
-            context(), 
-            (cl_uint) devices.size(),
-            deviceIDs,
-            kernelNames.c_str(), 
-            &error);
-
-        detail::errHandler(error, __CREATE_PROGRAM_WITH_BUILT_IN_KERNELS_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-#endif // #if defined(CL_VERSION_1_2)
-
-    Program() { }
-
-    Program(const Program& program) : detail::Wrapper<cl_type>(program) { }
-
-    __CL_EXPLICIT_CONSTRUCTORS Program(const cl_program& program) : detail::Wrapper<cl_type>(program) { }
-
-    Program& operator = (const Program& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    Program& operator = (const cl_program& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    cl_int build(
-        const VECTOR_CLASS<Device>& devices,
-        const char* options = NULL,
-        void (CL_CALLBACK * notifyFptr)(cl_program, void *) = NULL,
-        void* data = NULL) const
-    {
-        ::size_t numDevices = devices.size();
-        cl_device_id* deviceIDs = (cl_device_id*) alloca(numDevices * sizeof(cl_device_id));
-        for( ::size_t deviceIndex = 0; deviceIndex < numDevices; ++deviceIndex ) {
-            deviceIDs[deviceIndex] = (devices[deviceIndex])();
-        }
-
-        return detail::errHandler(
-            ::clBuildProgram(
-                object_,
-                (cl_uint)
-                devices.size(),
-                deviceIDs,
-                options,
-                notifyFptr,
-                data),
-                __BUILD_PROGRAM_ERR);
-    }
-
-    cl_int build(
-        const char* options = NULL,
-        void (CL_CALLBACK * notifyFptr)(cl_program, void *) = NULL,
-        void* data = NULL) const
-    {
-        return detail::errHandler(
-            ::clBuildProgram(
-                object_,
-                0,
-                NULL,
-                options,
-                notifyFptr,
-                data),
-                __BUILD_PROGRAM_ERR);
-    }
-
-#if defined(CL_VERSION_1_2)
-    cl_int compile(
-        const char* options = NULL,
-        void (CL_CALLBACK * notifyFptr)(cl_program, void *) = NULL,
-        void* data = NULL) const
-    {
-        return detail::errHandler(
-            ::clCompileProgram(
-                object_,
-                0,
-                NULL,
-                options,
-                0,
-                NULL,
-                NULL,
-                notifyFptr,
-                data),
-                __COMPILE_PROGRAM_ERR);
-    }
-#endif
-
-    template <typename T>
-    cl_int getInfo(cl_program_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(&::clGetProgramInfo, object_, name, param),
-            __GET_PROGRAM_INFO_ERR);
-    }
-
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_program_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_program_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    template <typename T>
-    cl_int getBuildInfo(
-        const Device& device, cl_program_build_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(
-                &::clGetProgramBuildInfo, object_, device(), name, param),
-                __GET_PROGRAM_BUILD_INFO_ERR);
-    }
-
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_program_build_info, name>::param_type
-    getBuildInfo(const Device& device, cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_program_build_info, name>::param_type param;
-        cl_int result = getBuildInfo(device, name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    cl_int createKernels(VECTOR_CLASS<Kernel>* kernels)
-    {
-        cl_uint numKernels;
-        cl_int err = ::clCreateKernelsInProgram(object_, 0, NULL, &numKernels);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __CREATE_KERNELS_IN_PROGRAM_ERR);
-        }
-
-        Kernel* value = (Kernel*) alloca(numKernels * sizeof(Kernel));
-        err = ::clCreateKernelsInProgram(
-            object_, numKernels, (cl_kernel*) value, NULL);
-        if (err != CL_SUCCESS) {
-            return detail::errHandler(err, __CREATE_KERNELS_IN_PROGRAM_ERR);
-        }
-
-        kernels->assign(&value[0], &value[numKernels]);
-        return CL_SUCCESS;
-    }
-};
-
-#if defined(CL_VERSION_1_2)
-inline Program linkProgram(
-    Program input1,
-    Program input2,
-    const char* options = NULL,
-    void (CL_CALLBACK * notifyFptr)(cl_program, void *) = NULL,
-    void* data = NULL,
-    cl_int* err = NULL) 
-{
-    cl_int err_local = CL_SUCCESS;
-
-    cl_program programs[2] = { input1(), input2() };
-
-    Context ctx = input1.getInfo<CL_PROGRAM_CONTEXT>();
-
-    cl_program prog = ::clLinkProgram(
-        ctx(),
-        0,
-        NULL,
-        options,
-        2,
-        programs,
-        notifyFptr,
-        data,
-        &err_local);
-
-    detail::errHandler(err_local,__COMPILE_PROGRAM_ERR);
-    if (err != NULL) {
-        *err = err_local;
-    }
-
-    return Program(prog);
-}
-
-inline Program linkProgram(
-    VECTOR_CLASS<Program> inputPrograms,
-    const char* options = NULL,
-    void (CL_CALLBACK * notifyFptr)(cl_program, void *) = NULL,
-    void* data = NULL,
-    cl_int* err = NULL) 
-{
-    cl_int err_local = CL_SUCCESS;
-
-    cl_program * programs = (cl_program*) alloca(inputPrograms.size() * sizeof(cl_program));
-
-    if (programs != NULL) {
-        for (unsigned int i = 0; i < inputPrograms.size(); i++) {
-          programs[i] = inputPrograms[i]();
-        }
-    } 
-
-    cl_program prog = ::clLinkProgram(
-        Context::getDefault()(),
-        0,
-        NULL,
-        options,
-        (cl_uint)inputPrograms.size(),
-        programs,
-        notifyFptr,
-        data,
-        &err_local);
-
-    detail::errHandler(err_local,__COMPILE_PROGRAM_ERR);
-    if (err != NULL) {
-        *err = err_local;
-    }
-
-    return Program(prog);
-}
-#endif
-
-template<>
-inline VECTOR_CLASS<char *> cl::Program::getInfo<CL_PROGRAM_BINARIES>(cl_int* err) const
-{
-    VECTOR_CLASS< ::size_t> sizes = getInfo<CL_PROGRAM_BINARY_SIZES>();
-    VECTOR_CLASS<char *> binaries;
-    for (VECTOR_CLASS< ::size_t>::iterator s = sizes.begin(); s != sizes.end(); ++s) 
-    {
-        char *ptr = NULL;
-        if (*s != 0) 
-            ptr = new char[*s];
-        binaries.push_back(ptr);
-    }
-    
-    cl_int result = getInfo(CL_PROGRAM_BINARIES, &binaries);
-    if (err != NULL) {
-        *err = result;
-    }
-    return binaries;
-}
-
-inline Kernel::Kernel(const Program& program, const char* name, cl_int* err)
-{
-    cl_int error;
-
-    object_ = ::clCreateKernel(program(), name, &error);
-    detail::errHandler(error, __CREATE_KERNEL_ERR);
-
-    if (err != NULL) {
-        *err = error;
-    }
-
-}
-
-/*! \class CommandQueue
- * \brief CommandQueue interface for cl_command_queue.
- */
-class CommandQueue : public detail::Wrapper<cl_command_queue>
-{
-private:
-    static volatile int default_initialized_;
-    static CommandQueue default_;
-    static volatile cl_int default_error_;
-public:
-   CommandQueue(
-        cl_command_queue_properties properties,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-
-        Context context = Context::getDefault(&error);
-        detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-
-        if (error != CL_SUCCESS) {
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-        else {
-            Device device = context.getInfo<CL_CONTEXT_DEVICES>()[0];
-
-            object_ = ::clCreateCommandQueue(
-                context(), device(), properties, &error);
-
-            detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-    }
-    /*!
-    * \brief Constructs a CommandQueue for an implementation defined device in the given context
-    */
-    explicit CommandQueue(
-        const Context& context,
-        cl_command_queue_properties properties = 0,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        VECTOR_CLASS<cl::Device> devices;
-        error = context.getInfo(CL_CONTEXT_DEVICES, &devices);
-
-        detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-
-        if (error != CL_SUCCESS)
-        {
-            if (err != NULL) {
-                *err = error;
-            }
-            return;
-        }
-
-        object_ = ::clCreateCommandQueue(context(), devices[0](), properties, &error);
-
-        detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-
-        if (err != NULL) {
-            *err = error;
-        }
-
-    }
-
-    CommandQueue(
-        const Context& context,
-        const Device& device,
-        cl_command_queue_properties properties = 0,
-        cl_int* err = NULL)
-    {
-        cl_int error;
-        object_ = ::clCreateCommandQueue(
-            context(), device(), properties, &error);
-
-        detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-
-    static CommandQueue getDefault(cl_int * err = NULL) 
-    {
-        int state = detail::compare_exchange(
-            &default_initialized_, 
-            __DEFAULT_BEING_INITIALIZED, __DEFAULT_NOT_INITIALIZED);
-        
-        if (state & __DEFAULT_INITIALIZED) {
-            if (err != NULL) {
-                *err = default_error_;
-            }
-            return default_;
-        }
-
-        if (state & __DEFAULT_BEING_INITIALIZED) {
-              // Assume writes will propagate eventually...
-              while(default_initialized_ != __DEFAULT_INITIALIZED) {
-                  detail::fence();
-              }
-
-            if (err != NULL) {
-                *err = default_error_;
-            }
-            return default_;
-        }
-
-        cl_int error;
-
-        Context context = Context::getDefault(&error);
-        detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-
-        if (error != CL_SUCCESS) {
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-        else {
-            Device device = context.getInfo<CL_CONTEXT_DEVICES>()[0];
-
-            default_ = CommandQueue(context, device, 0, &error);
-
-            detail::errHandler(error, __CREATE_COMMAND_QUEUE_ERR);
-            if (err != NULL) {
-                *err = error;
-            }
-        }
-
-        detail::fence();
-
-        default_error_ = error;
-        // Assume writes will propagate eventually...
-        default_initialized_ = __DEFAULT_INITIALIZED;
-
-        detail::fence();
-
-        if (err != NULL) {
-            *err = default_error_;
-        }
-        return default_;
-
-    }
-
-    CommandQueue() { }
-
-    CommandQueue(const CommandQueue& commandQueue) : detail::Wrapper<cl_type>(commandQueue) { }
-
-    CommandQueue(const cl_command_queue& commandQueue) : detail::Wrapper<cl_type>(commandQueue) { }
-
-    CommandQueue& operator = (const CommandQueue& rhs)
-    {
-        if (this != &rhs) {
-            detail::Wrapper<cl_type>::operator=(rhs);
-        }
-        return *this;
-    }
-
-    CommandQueue& operator = (const cl_command_queue& rhs)
-    {
-        detail::Wrapper<cl_type>::operator=(rhs);
-        return *this;
-    }
-
-    template <typename T>
-    cl_int getInfo(cl_command_queue_info name, T* param) const
-    {
-        return detail::errHandler(
-            detail::getInfo(
-                &::clGetCommandQueueInfo, object_, name, param),
-                __GET_COMMAND_QUEUE_INFO_ERR);
-    }
-
-    template <cl_int name> typename
-    detail::param_traits<detail::cl_command_queue_info, name>::param_type
-    getInfo(cl_int* err = NULL) const
-    {
-        typename detail::param_traits<
-            detail::cl_command_queue_info, name>::param_type param;
-        cl_int result = getInfo(name, &param);
-        if (err != NULL) {
-            *err = result;
-        }
-        return param;
-    }
-
-    cl_int enqueueReadBuffer(
-        const Buffer& buffer,
-        cl_bool blocking,
-        ::size_t offset,
-        ::size_t size,
-        void* ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueReadBuffer(
-                object_, buffer(), blocking, offset, size,
-                ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_READ_BUFFER_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueWriteBuffer(
-        const Buffer& buffer,
-        cl_bool blocking,
-        ::size_t offset,
-        ::size_t size,
-        const void* ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueWriteBuffer(
-                object_, buffer(), blocking, offset, size,
-                ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_WRITE_BUFFER_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueCopyBuffer(
-        const Buffer& src,
-        const Buffer& dst,
-        ::size_t src_offset,
-        ::size_t dst_offset,
-        ::size_t size,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueCopyBuffer(
-                object_, src(), dst(), src_offset, dst_offset, size,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQEUE_COPY_BUFFER_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueReadBufferRect(
-        const Buffer& buffer,
-        cl_bool blocking,
-        const size_t<3>& buffer_offset,
-        const size_t<3>& host_offset,
-        const size_t<3>& region,
-        ::size_t buffer_row_pitch,
-        ::size_t buffer_slice_pitch,
-        ::size_t host_row_pitch,
-        ::size_t host_slice_pitch,
-        void *ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueReadBufferRect(
-                object_, 
-                buffer(), 
-                blocking, 
-                (const ::size_t *)buffer_offset,
-                (const ::size_t *)host_offset,
-                (const ::size_t *)region,
-                buffer_row_pitch,
-                buffer_slice_pitch,
-                host_row_pitch,
-                host_slice_pitch,
-                ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_READ_BUFFER_RECT_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueWriteBufferRect(
-        const Buffer& buffer,
-        cl_bool blocking,
-        const size_t<3>& buffer_offset,
-        const size_t<3>& host_offset,
-        const size_t<3>& region,
-        ::size_t buffer_row_pitch,
-        ::size_t buffer_slice_pitch,
-        ::size_t host_row_pitch,
-        ::size_t host_slice_pitch,
-        void *ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueWriteBufferRect(
-                object_, 
-                buffer(), 
-                blocking, 
-                (const ::size_t *)buffer_offset,
-                (const ::size_t *)host_offset,
-                (const ::size_t *)region,
-                buffer_row_pitch,
-                buffer_slice_pitch,
-                host_row_pitch,
-                host_slice_pitch,
-                ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_WRITE_BUFFER_RECT_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueCopyBufferRect(
-        const Buffer& src,
-        const Buffer& dst,
-        const size_t<3>& src_origin,
-        const size_t<3>& dst_origin,
-        const size_t<3>& region,
-        ::size_t src_row_pitch,
-        ::size_t src_slice_pitch,
-        ::size_t dst_row_pitch,
-        ::size_t dst_slice_pitch,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueCopyBufferRect(
-                object_, 
-                src(), 
-                dst(), 
-                (const ::size_t *)src_origin, 
-                (const ::size_t *)dst_origin, 
-                (const ::size_t *)region,
-                src_row_pitch,
-                src_slice_pitch,
-                dst_row_pitch,
-                dst_slice_pitch,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQEUE_COPY_BUFFER_RECT_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-#if defined(CL_VERSION_1_2)
-    /**
-     * Enqueue a command to fill a buffer object with a pattern
-     * of a given size. The pattern is specified a as vector.
-     * \tparam PatternType The datatype of the pattern field. 
-     *     The pattern type must be an accepted OpenCL data type.
-     */
-    template<typename PatternType>
-    cl_int enqueueFillBuffer(
-        const Buffer& buffer,
-        PatternType pattern,
-        ::size_t offset,
-        ::size_t size,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueFillBuffer(
-                object_, 
-                buffer(),
-                static_cast<void*>(&pattern),
-                sizeof(PatternType), 
-                offset, 
-                size,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_FILL_BUFFER_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-#endif // #if defined(CL_VERSION_1_2)
-
-    cl_int enqueueReadImage(
-        const Image& image,
-        cl_bool blocking,
-        const size_t<3>& origin,
-        const size_t<3>& region,
-        ::size_t row_pitch,
-        ::size_t slice_pitch,
-        void* ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueReadImage(
-                object_, image(), blocking, (const ::size_t *) origin,
-                (const ::size_t *) region, row_pitch, slice_pitch, ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_READ_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueWriteImage(
-        const Image& image,
-        cl_bool blocking,
-        const size_t<3>& origin,
-        const size_t<3>& region,
-        ::size_t row_pitch,
-        ::size_t slice_pitch,
-        void* ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueWriteImage(
-                object_, image(), blocking, (const ::size_t *) origin,
-                (const ::size_t *) region, row_pitch, slice_pitch, ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_WRITE_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueCopyImage(
-        const Image& src,
-        const Image& dst,
-        const size_t<3>& src_origin,
-        const size_t<3>& dst_origin,
-        const size_t<3>& region,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueCopyImage(
-                object_, src(), dst(), (const ::size_t *) src_origin,
-                (const ::size_t *)dst_origin, (const ::size_t *) region,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_COPY_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-#if defined(CL_VERSION_1_2)
-    /**
-     * Enqueue a command to fill an image object with a specified color.
-     * \param fillColor is the color to use to fill the image.
-     *     This is a four component RGBA floating-point color value if
-     *     the image channel data type is not an unnormalized signed or
-     *     unsigned data type.
-     */
-    cl_int enqueueFillImage(
-        const Image& image,
-        cl_float4 fillColor,
-        const size_t<3>& origin,
-        const size_t<3>& region,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueFillImage(
-                object_, 
-                image(),
-                static_cast<void*>(&fillColor), 
-                (const ::size_t *) origin, 
-                (const ::size_t *) region,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_FILL_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    /**
-     * Enqueue a command to fill an image object with a specified color.
-     * \param fillColor is the color to use to fill the image.
-     *     This is a four component RGBA signed integer color value if
-     *     the image channel data type is an unnormalized signed integer
-     *     type.
-     */
-    cl_int enqueueFillImage(
-        const Image& image,
-        cl_int4 fillColor,
-        const size_t<3>& origin,
-        const size_t<3>& region,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueFillImage(
-                object_, 
-                image(),
-                static_cast<void*>(&fillColor), 
-                (const ::size_t *) origin, 
-                (const ::size_t *) region,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_FILL_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    /**
-     * Enqueue a command to fill an image object with a specified color.
-     * \param fillColor is the color to use to fill the image.
-     *     This is a four component RGBA unsigned integer color value if
-     *     the image channel data type is an unnormalized unsigned integer
-     *     type.
-     */
-    cl_int enqueueFillImage(
-        const Image& image,
-        cl_uint4 fillColor,
-        const size_t<3>& origin,
-        const size_t<3>& region,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueFillImage(
-                object_, 
-                image(),
-                static_cast<void*>(&fillColor), 
-                (const ::size_t *) origin, 
-                (const ::size_t *) region,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-                __ENQUEUE_FILL_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-#endif // #if defined(CL_VERSION_1_2)
-
-    cl_int enqueueCopyImageToBuffer(
-        const Image& src,
-        const Buffer& dst,
-        const size_t<3>& src_origin,
-        const size_t<3>& region,
-        ::size_t dst_offset,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueCopyImageToBuffer(
-                object_, src(), dst(), (const ::size_t *) src_origin,
-                (const ::size_t *) region, dst_offset,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_COPY_IMAGE_TO_BUFFER_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueCopyBufferToImage(
-        const Buffer& src,
-        const Image& dst,
-        ::size_t src_offset,
-        const size_t<3>& dst_origin,
-        const size_t<3>& region,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueCopyBufferToImage(
-                object_, src(), dst(), src_offset,
-                (const ::size_t *) dst_origin, (const ::size_t *) region,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_COPY_BUFFER_TO_IMAGE_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    void* enqueueMapBuffer(
-        const Buffer& buffer,
-        cl_bool blocking,
-        cl_map_flags flags,
-        ::size_t offset,
-        ::size_t size,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL,
-        cl_int* err = NULL) const
-    {
-        cl_int error;
-        void * result = ::clEnqueueMapBuffer(
-            object_, buffer(), blocking, flags, offset, size,
-            (events != NULL) ? (cl_uint) events->size() : 0,
-            (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-            (cl_event*) event,
-            &error);
-
-        detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-        return result;
-    }
-
-    void* enqueueMapImage(
-        const Image& buffer,
-        cl_bool blocking,
-        cl_map_flags flags,
-        const size_t<3>& origin,
-        const size_t<3>& region,
-        ::size_t * row_pitch,
-        ::size_t * slice_pitch,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL,
-        cl_int* err = NULL) const
-    {
-        cl_int error;
-        void * result = ::clEnqueueMapImage(
-            object_, buffer(), blocking, flags,
-            (const ::size_t *) origin, (const ::size_t *) region,
-            row_pitch, slice_pitch,
-            (events != NULL) ? (cl_uint) events->size() : 0,
-            (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-            (cl_event*) event,
-            &error);
-
-        detail::errHandler(error, __ENQUEUE_MAP_IMAGE_ERR);
-        if (err != NULL) {
-              *err = error;
-        }
-        return result;
-    }
-
-    cl_int enqueueUnmapMemObject(
-        const Memory& memory,
-        void* mapped_ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueUnmapMemObject(
-                object_, memory(), mapped_ptr,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_UNMAP_MEM_OBJECT_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-#if defined(CL_VERSION_1_2)
-    /**
-     * Enqueues a marker command which waits for either a list of events to complete, 
-     * or all previously enqueued commands to complete.
-     *
-     * Enqueues a marker command which waits for either a list of events to complete, 
-     * or if the list is empty it waits for all commands previously enqueued in command_queue 
-     * to complete before it completes. This command returns an event which can be waited on, 
-     * i.e. this event can be waited on to insure that all events either in the event_wait_list 
-     * or all previously enqueued commands, queued before this command to command_queue, 
-     * have completed.
-     */
-    cl_int enqueueMarkerWithWaitList(
-        const VECTOR_CLASS<Event> *events = 0,
-        Event *event = 0)
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueMarkerWithWaitList(
-                object_,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_MARKER_WAIT_LIST_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    /**
-     * A synchronization point that enqueues a barrier operation.
-     *
-     * Enqueues a barrier command which waits for either a list of events to complete, 
-     * or if the list is empty it waits for all commands previously enqueued in command_queue 
-     * to complete before it completes. This command blocks command execution, that is, any 
-     * following commands enqueued after it do not execute until it completes. This command 
-     * returns an event which can be waited on, i.e. this event can be waited on to insure that 
-     * all events either in the event_wait_list or all previously enqueued commands, queued 
-     * before this command to command_queue, have completed.
-     */
-    cl_int enqueueBarrierWithWaitList(
-        const VECTOR_CLASS<Event> *events = 0,
-        Event *event = 0)
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueBarrierWithWaitList(
-                object_,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_BARRIER_WAIT_LIST_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-    
-    /**
-     * Enqueues a command to indicate with which device a set of memory objects
-     * should be associated.
-     */
-    cl_int enqueueMigrateMemObjects(
-        const VECTOR_CLASS<Memory> &memObjects,
-        cl_mem_migration_flags flags,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL
-        )
-    {
-        cl_event tmp;
-        
-        cl_mem* localMemObjects = static_cast<cl_mem*>(alloca(memObjects.size() * sizeof(cl_mem)));
-        for( int i = 0; i < (int)memObjects.size(); ++i ) {
-            localMemObjects[i] = memObjects[i]();
-        }
-
-
-        cl_int err = detail::errHandler(
-            ::clEnqueueMigrateMemObjects(
-                object_, 
-                (cl_uint)memObjects.size(), 
-                static_cast<const cl_mem*>(localMemObjects),
-                flags,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_UNMAP_MEM_OBJECT_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-#endif // #if defined(CL_VERSION_1_2)
-
-    cl_int enqueueNDRangeKernel(
-        const Kernel& kernel,
-        const NDRange& offset,
-        const NDRange& global,
-        const NDRange& local = NullRange,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueNDRangeKernel(
-                object_, kernel(), (cl_uint) global.dimensions(),
-                offset.dimensions() != 0 ? (const ::size_t*) offset : NULL,
-                (const ::size_t*) global,
-                local.dimensions() != 0 ? (const ::size_t*) local : NULL,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_NDRANGE_KERNEL_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueTask(
-        const Kernel& kernel,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueTask(
-                object_, kernel(),
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_TASK_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-    cl_int enqueueNativeKernel(
-        void (CL_CALLBACK *userFptr)(void *),
-        std::pair<void*, ::size_t> args,
-        const VECTOR_CLASS<Memory>* mem_objects = NULL,
-        const VECTOR_CLASS<const void*>* mem_locs = NULL,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL) const
-    {
-        cl_mem * mems = (mem_objects != NULL && mem_objects->size() > 0) 
-            ? (cl_mem*) alloca(mem_objects->size() * sizeof(cl_mem))
-            : NULL;
-
-        if (mems != NULL) {
-            for (unsigned int i = 0; i < mem_objects->size(); i++) {
-                mems[i] = ((*mem_objects)[i])();
-            }
-        }
-
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            ::clEnqueueNativeKernel(
-                object_, userFptr, args.first, args.second,
-                (mem_objects != NULL) ? (cl_uint) mem_objects->size() : 0,
-                mems,
-                (mem_locs != NULL) ? (const void **) &mem_locs->front() : NULL,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_NATIVE_KERNEL);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-
-/**
- * Deprecated APIs for 1.2
- */
-#if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS) || (defined(CL_VERSION_1_1) && !defined(CL_VERSION_1_2)) 
-    CL_EXT_PREFIX__VERSION_1_1_DEPRECATED 
-    cl_int enqueueMarker(Event* event = NULL) const CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED
-    {
-        return detail::errHandler(
-            ::clEnqueueMarker(object_, (cl_event*) event),
-            __ENQUEUE_MARKER_ERR);
-    }
-
-    CL_EXT_PREFIX__VERSION_1_1_DEPRECATED
-    cl_int enqueueWaitForEvents(const VECTOR_CLASS<Event>& events) const CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED
-    {
-        return detail::errHandler(
-            ::clEnqueueWaitForEvents(
-                object_,
-                (cl_uint) events.size(),
-                (const cl_event*) &events.front()),
-            __ENQUEUE_WAIT_FOR_EVENTS_ERR);
-    }
-#endif // #if defined(CL_VERSION_1_1)
-
-    cl_int enqueueAcquireGLObjects(
-         const VECTOR_CLASS<Memory>* mem_objects = NULL,
-         const VECTOR_CLASS<Event>* events = NULL,
-         Event* event = NULL) const
-     {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-             ::clEnqueueAcquireGLObjects(
-                 object_,
-                 (mem_objects != NULL) ? (cl_uint) mem_objects->size() : 0,
-                 (mem_objects != NULL) ? (const cl_mem *) &mem_objects->front(): NULL,
-                 (events != NULL) ? (cl_uint) events->size() : 0,
-                 (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                 (event != NULL) ? &tmp : NULL),
-             __ENQUEUE_ACQUIRE_GL_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-     }
-
-    cl_int enqueueReleaseGLObjects(
-         const VECTOR_CLASS<Memory>* mem_objects = NULL,
-         const VECTOR_CLASS<Event>* events = NULL,
-         Event* event = NULL) const
-     {
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-             ::clEnqueueReleaseGLObjects(
-                 object_,
-                 (mem_objects != NULL) ? (cl_uint) mem_objects->size() : 0,
-                 (mem_objects != NULL) ? (const cl_mem *) &mem_objects->front(): NULL,
-                 (events != NULL) ? (cl_uint) events->size() : 0,
-                 (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-                 (event != NULL) ? &tmp : NULL),
-             __ENQUEUE_RELEASE_GL_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-     }
-
-#if defined (USE_DX_INTEROP)
-typedef CL_API_ENTRY cl_int (CL_API_CALL *PFN_clEnqueueAcquireD3D10ObjectsKHR)(
-    cl_command_queue command_queue, cl_uint num_objects,
-    const cl_mem* mem_objects, cl_uint num_events_in_wait_list,
-    const cl_event* event_wait_list, cl_event* event);
-typedef CL_API_ENTRY cl_int (CL_API_CALL *PFN_clEnqueueReleaseD3D10ObjectsKHR)(
-    cl_command_queue command_queue, cl_uint num_objects,
-    const cl_mem* mem_objects,  cl_uint num_events_in_wait_list,
-    const cl_event* event_wait_list, cl_event* event);
-
-    cl_int enqueueAcquireD3D10Objects(
-         const VECTOR_CLASS<Memory>* mem_objects = NULL,
-         const VECTOR_CLASS<Event>* events = NULL,
-         Event* event = NULL) const
-    {
-        static PFN_clEnqueueAcquireD3D10ObjectsKHR pfn_clEnqueueAcquireD3D10ObjectsKHR = NULL;
-#if defined(CL_VERSION_1_2)
-        cl_context context = getInfo<CL_QUEUE_CONTEXT>();
-        cl::Device device(getInfo<CL_QUEUE_DEVICE>());
-        cl_platform_id platform = device.getInfo<CL_DEVICE_PLATFORM>();
-        __INIT_CL_EXT_FCN_PTR_PLATFORM(platform, clEnqueueAcquireD3D10ObjectsKHR);
-#endif
-#if defined(CL_VERSION_1_1)
-        __INIT_CL_EXT_FCN_PTR(clEnqueueAcquireD3D10ObjectsKHR);
-#endif
-        
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-             pfn_clEnqueueAcquireD3D10ObjectsKHR(
-                 object_,
-                 (mem_objects != NULL) ? (cl_uint) mem_objects->size() : 0,
-                 (mem_objects != NULL) ? (const cl_mem *) &mem_objects->front(): NULL,
-                 (events != NULL) ? (cl_uint) events->size() : 0,
-                 (events != NULL) ? (cl_event*) &events->front() : NULL,
-                 (event != NULL) ? &tmp : NULL),
-             __ENQUEUE_ACQUIRE_GL_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-     }
-
-    cl_int enqueueReleaseD3D10Objects(
-         const VECTOR_CLASS<Memory>* mem_objects = NULL,
-         const VECTOR_CLASS<Event>* events = NULL,
-         Event* event = NULL) const
-    {
-        static PFN_clEnqueueReleaseD3D10ObjectsKHR pfn_clEnqueueReleaseD3D10ObjectsKHR = NULL;
-#if defined(CL_VERSION_1_2)
-        cl_context context = getInfo<CL_QUEUE_CONTEXT>();
-        cl::Device device(getInfo<CL_QUEUE_DEVICE>());
-        cl_platform_id platform = device.getInfo<CL_DEVICE_PLATFORM>();
-        __INIT_CL_EXT_FCN_PTR_PLATFORM(platform, clEnqueueReleaseD3D10ObjectsKHR);
-#endif // #if defined(CL_VERSION_1_2)
-#if defined(CL_VERSION_1_1)
-        __INIT_CL_EXT_FCN_PTR(clEnqueueReleaseD3D10ObjectsKHR);
-#endif // #if defined(CL_VERSION_1_1)
-
-        cl_event tmp;
-        cl_int err = detail::errHandler(
-            pfn_clEnqueueReleaseD3D10ObjectsKHR(
-                object_,
-                (mem_objects != NULL) ? (cl_uint) mem_objects->size() : 0,
-                (mem_objects != NULL) ? (const cl_mem *) &mem_objects->front(): NULL,
-                (events != NULL) ? (cl_uint) events->size() : 0,
-                (events != NULL) ? (cl_event*) &events->front() : NULL,
-                (event != NULL) ? &tmp : NULL),
-            __ENQUEUE_RELEASE_GL_ERR);
-
-        if (event != NULL && err == CL_SUCCESS)
-            *event = tmp;
-
-        return err;
-    }
-#endif
-
-/**
- * Deprecated APIs for 1.2
- */
-#if defined(CL_USE_DEPRECATED_OPENCL_1_1_APIS) || (defined(CL_VERSION_1_1) && !defined(CL_VERSION_1_2)) 
-    CL_EXT_PREFIX__VERSION_1_1_DEPRECATED
-    cl_int enqueueBarrier() const CL_EXT_SUFFIX__VERSION_1_1_DEPRECATED
-    {
-        return detail::errHandler(
-            ::clEnqueueBarrier(object_),
-            __ENQUEUE_BARRIER_ERR);
-    }
-#endif // #if defined(CL_VERSION_1_1)
-
-    cl_int flush() const
-    {
-        return detail::errHandler(::clFlush(object_), __FLUSH_ERR);
-    }
-
-    cl_int finish() const
-    {
-        return detail::errHandler(::clFinish(object_), __FINISH_ERR);
-    }
-};
-
-#ifdef _WIN32
-__declspec(selectany) volatile int CommandQueue::default_initialized_ = __DEFAULT_NOT_INITIALIZED;
-__declspec(selectany) CommandQueue CommandQueue::default_;
-__declspec(selectany) volatile cl_int CommandQueue::default_error_ = CL_SUCCESS;
-#else
-__attribute__((weak)) volatile int CommandQueue::default_initialized_ = __DEFAULT_NOT_INITIALIZED;
-__attribute__((weak)) CommandQueue CommandQueue::default_;
-__attribute__((weak)) volatile cl_int CommandQueue::default_error_ = CL_SUCCESS;
-#endif
-
-template< typename IteratorType >
-Buffer::Buffer(
-    const Context &context,
-    IteratorType startIterator,
-    IteratorType endIterator,
-    bool readOnly,
-    bool useHostPtr,
-    cl_int* err)
-{
-    typedef typename std::iterator_traits<IteratorType>::value_type DataType;
-    cl_int error;
-
-    cl_mem_flags flags = 0;
-    if( readOnly ) {
-        flags |= CL_MEM_READ_ONLY;
-    }
-    else {
-        flags |= CL_MEM_READ_WRITE;
-    }
-    if( useHostPtr ) {
-        flags |= CL_MEM_USE_HOST_PTR;
-    }
-    
-    ::size_t size = sizeof(DataType)*(endIterator - startIterator);
-
-    if( useHostPtr ) {
-        object_ = ::clCreateBuffer(context(), flags, size, static_cast<DataType*>(&*startIterator), &error);
-    } else {
-        object_ = ::clCreateBuffer(context(), flags, size, 0, &error);
-    }
-
-    detail::errHandler(error, __CREATE_BUFFER_ERR);
-    if (err != NULL) {
-        *err = error;
-    }
-
-    if( !useHostPtr ) {
-        CommandQueue queue(context, 0, &error);
-        detail::errHandler(error, __CREATE_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-
-        error = cl::copy(queue, startIterator, endIterator, *this);
-        detail::errHandler(error, __CREATE_BUFFER_ERR);
-        if (err != NULL) {
-            *err = error;
-        }
-    }
-}
-
-inline cl_int enqueueReadBuffer(
-    const Buffer& buffer,
-    cl_bool blocking,
-    ::size_t offset,
-    ::size_t size,
-    void* ptr,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueReadBuffer(buffer, blocking, offset, size, ptr, events, event);
-}
-
-inline cl_int enqueueWriteBuffer(
-        const Buffer& buffer,
-        cl_bool blocking,
-        ::size_t offset,
-        ::size_t size,
-        const void* ptr,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueWriteBuffer(buffer, blocking, offset, size, ptr, events, event);
-}
-
-inline void* enqueueMapBuffer(
-        const Buffer& buffer,
-        cl_bool blocking,
-        cl_map_flags flags,
-        ::size_t offset,
-        ::size_t size,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL,
-        cl_int* err = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-    detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR);
-    if (err != NULL) {
-        *err = error;
-    }
-
-    void * result = ::clEnqueueMapBuffer(
-            queue(), buffer(), blocking, flags, offset, size,
-            (events != NULL) ? (cl_uint) events->size() : 0,
-            (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-            (cl_event*) event,
-            &error);
-
-    detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR);
-    if (err != NULL) {
-        *err = error;
-    }
-    return result;
-}
-
-inline cl_int enqueueUnmapMemObject(
-    const Memory& memory,
-    void* mapped_ptr,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-    detail::errHandler(error, __ENQUEUE_MAP_BUFFER_ERR);
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    cl_event tmp;
-    cl_int err = detail::errHandler(
-        ::clEnqueueUnmapMemObject(
-            queue(), memory(), mapped_ptr,
-            (events != NULL) ? (cl_uint) events->size() : 0,
-            (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
-            (event != NULL) ? &tmp : NULL),
-        __ENQUEUE_UNMAP_MEM_OBJECT_ERR);
-
-    if (event != NULL && err == CL_SUCCESS)
-        *event = tmp;
-
-    return err;
-}
-
-inline cl_int enqueueCopyBuffer(
-        const Buffer& src,
-        const Buffer& dst,
-        ::size_t src_offset,
-        ::size_t dst_offset,
-        ::size_t size,
-        const VECTOR_CLASS<Event>* events = NULL,
-        Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueCopyBuffer(src, dst, src_offset, dst_offset, size, events, event);
-}
-
-/**
- * Blocking copy operation between iterators and a buffer.
- * Host to Device.
- * Uses default command queue.
- */
-template< typename IteratorType >
-inline cl_int copy( IteratorType startIterator, IteratorType endIterator, cl::Buffer &buffer )
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-    if (error != CL_SUCCESS)
-        return error;
-
-    return cl::copy(queue, startIterator, endIterator, buffer);
-}
-
-/**
- * Blocking copy operation between iterators and a buffer.
- * Device to Host.
- * Uses default command queue.
- */
-template< typename IteratorType >
-inline cl_int copy( const cl::Buffer &buffer, IteratorType startIterator, IteratorType endIterator )
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-    if (error != CL_SUCCESS)
-        return error;
-
-    return cl::copy(queue, buffer, startIterator, endIterator);
-}
-
-/**
- * Blocking copy operation between iterators and a buffer.
- * Host to Device.
- * Uses specified queue.
- */
-template< typename IteratorType >
-inline cl_int copy( const CommandQueue &queue, IteratorType startIterator, IteratorType endIterator, cl::Buffer &buffer )
-{
-    typedef typename std::iterator_traits<IteratorType>::value_type DataType;
-    cl_int error;
-    
-    ::size_t length = endIterator-startIterator;
-    ::size_t byteLength = length*sizeof(DataType);
-
-    DataType *pointer = 
-        static_cast<DataType*>(queue.enqueueMapBuffer(buffer, CL_TRUE, CL_MAP_WRITE, 0, byteLength, 0, 0, &error));
-    // if exceptions enabled, enqueueMapBuffer will throw
-    if( error != CL_SUCCESS ) {
-        return error;
-    }
-#if defined(_MSC_VER)
-    std::copy(
-        startIterator, 
-        endIterator, 
-        stdext::checked_array_iterator<DataType*>(
-            pointer, length));
-#else
-    std::copy(startIterator, endIterator, pointer);
-#endif
-    Event endEvent;
-    error = queue.enqueueUnmapMemObject(buffer, pointer, 0, &endEvent);
-    // if exceptions enabled, enqueueUnmapMemObject will throw
-    if( error != CL_SUCCESS ) { 
-        return error;
-    }
-    endEvent.wait();
-    return CL_SUCCESS;
-}
-
-/**
- * Blocking copy operation between iterators and a buffer.
- * Device to Host.
- * Uses specified queue.
- */
-template< typename IteratorType >
-inline cl_int copy( const CommandQueue &queue, const cl::Buffer &buffer, IteratorType startIterator, IteratorType endIterator )
-{
-    typedef typename std::iterator_traits<IteratorType>::value_type DataType;
-    cl_int error;
-        
-    ::size_t length = endIterator-startIterator;
-    ::size_t byteLength = length*sizeof(DataType);
-
-    DataType *pointer = 
-        static_cast<DataType*>(queue.enqueueMapBuffer(buffer, CL_TRUE, CL_MAP_READ, 0, byteLength, 0, 0, &error));
-    // if exceptions enabled, enqueueMapBuffer will throw
-    if( error != CL_SUCCESS ) {
-        return error;
-    }
-    std::copy(pointer, pointer + length, startIterator);
-    Event endEvent;
-    error = queue.enqueueUnmapMemObject(buffer, pointer, 0, &endEvent);
-    // if exceptions enabled, enqueueUnmapMemObject will throw
-    if( error != CL_SUCCESS ) { 
-        return error;
-    }
-    endEvent.wait();
-    return CL_SUCCESS;
-}
-
-#if defined(CL_VERSION_1_1)
-inline cl_int enqueueReadBufferRect(
-    const Buffer& buffer,
-    cl_bool blocking,
-    const size_t<3>& buffer_offset,
-    const size_t<3>& host_offset,
-    const size_t<3>& region,
-    ::size_t buffer_row_pitch,
-    ::size_t buffer_slice_pitch,
-    ::size_t host_row_pitch,
-    ::size_t host_slice_pitch,
-    void *ptr,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueReadBufferRect(
-        buffer, 
-        blocking, 
-        buffer_offset, 
-        host_offset,
-        region,
-        buffer_row_pitch,
-        buffer_slice_pitch,
-        host_row_pitch,
-        host_slice_pitch,
-        ptr, 
-        events, 
-        event);
-}
-
-inline cl_int enqueueWriteBufferRect(
-    const Buffer& buffer,
-    cl_bool blocking,
-    const size_t<3>& buffer_offset,
-    const size_t<3>& host_offset,
-    const size_t<3>& region,
-    ::size_t buffer_row_pitch,
-    ::size_t buffer_slice_pitch,
-    ::size_t host_row_pitch,
-    ::size_t host_slice_pitch,
-    void *ptr,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueWriteBufferRect(
-        buffer, 
-        blocking, 
-        buffer_offset, 
-        host_offset,
-        region,
-        buffer_row_pitch,
-        buffer_slice_pitch,
-        host_row_pitch,
-        host_slice_pitch,
-        ptr, 
-        events, 
-        event);
-}
-
-inline cl_int enqueueCopyBufferRect(
-    const Buffer& src,
-    const Buffer& dst,
-    const size_t<3>& src_origin,
-    const size_t<3>& dst_origin,
-    const size_t<3>& region,
-    ::size_t src_row_pitch,
-    ::size_t src_slice_pitch,
-    ::size_t dst_row_pitch,
-    ::size_t dst_slice_pitch,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueCopyBufferRect(
-        src,
-        dst,
-        src_origin,
-        dst_origin,
-        region,
-        src_row_pitch,
-        src_slice_pitch,
-        dst_row_pitch,
-        dst_slice_pitch,
-        events, 
-        event);
-}
-#endif
-
-inline cl_int enqueueReadImage(
-    const Image& image,
-    cl_bool blocking,
-    const size_t<3>& origin,
-    const size_t<3>& region,
-    ::size_t row_pitch,
-    ::size_t slice_pitch,
-    void* ptr,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL) 
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueReadImage(
-        image,
-        blocking,
-        origin,
-        region,
-        row_pitch,
-        slice_pitch,
-        ptr,
-        events, 
-        event);
-}
-
-inline cl_int enqueueWriteImage(
-    const Image& image,
-    cl_bool blocking,
-    const size_t<3>& origin,
-    const size_t<3>& region,
-    ::size_t row_pitch,
-    ::size_t slice_pitch,
-    void* ptr,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueWriteImage(
-        image,
-        blocking,
-        origin,
-        region,
-        row_pitch,
-        slice_pitch,
-        ptr,
-        events, 
-        event);
-}
-
-inline cl_int enqueueCopyImage(
-    const Image& src,
-    const Image& dst,
-    const size_t<3>& src_origin,
-    const size_t<3>& dst_origin,
-    const size_t<3>& region,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueCopyImage(
-        src,
-        dst,
-        src_origin,
-        dst_origin,
-        region,
-        events,
-        event);
-}
-
-inline cl_int enqueueCopyImageToBuffer(
-    const Image& src,
-    const Buffer& dst,
-    const size_t<3>& src_origin,
-    const size_t<3>& region,
-    ::size_t dst_offset,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueCopyImageToBuffer(
-        src,
-        dst,
-        src_origin,
-        region,
-        dst_offset,
-        events,
-        event);
-}
-
-inline cl_int enqueueCopyBufferToImage(
-    const Buffer& src,
-    const Image& dst,
-    ::size_t src_offset,
-    const size_t<3>& dst_origin,
-    const size_t<3>& region,
-    const VECTOR_CLASS<Event>* events = NULL,
-    Event* event = NULL)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.enqueueCopyBufferToImage(
-        src,
-        dst,
-        src_offset,
-        dst_origin,
-        region,
-        events,
-        event);
-}
-
-
-inline cl_int flush(void)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    }
-
-    return queue.flush();
-}
-
-inline cl_int finish(void)
-{
-    cl_int error;
-    CommandQueue queue = CommandQueue::getDefault(&error);
-
-    if (error != CL_SUCCESS) {
-        return error;
-    } 
-
-
-    return queue.finish();
-}
-
-// Kernel Functor support
-// New interface as of September 2011
-// Requires the C++11 std::tr1::function (note do not support TR1)
-// Visual Studio 2010 and GCC 4.2
-
-struct EnqueueArgs
-{
-    CommandQueue queue_;
-    const NDRange offset_;
-    const NDRange global_;
-    const NDRange local_;
-    VECTOR_CLASS<Event> events_;
-
-    EnqueueArgs(NDRange global) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(NullRange), 
-      global_(global),
-      local_(NullRange)
-    {
-
-    }
-
-    EnqueueArgs(NDRange global, NDRange local) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(NullRange), 
-      global_(global),
-      local_(local)
-    {
-
-    }
-
-    EnqueueArgs(NDRange offset, NDRange global, NDRange local) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(offset), 
-      global_(global),
-      local_(local)
-    {
-
-    }
-
-    EnqueueArgs(Event e, NDRange global) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(NullRange), 
-      global_(global),
-      local_(NullRange)
-    {
-        events_.push_back(e);
-    }
-
-    EnqueueArgs(Event e, NDRange global, NDRange local) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(NullRange), 
-      global_(global),
-      local_(local)
-    {
-        events_.push_back(e);
-    }
-
-    EnqueueArgs(Event e, NDRange offset, NDRange global, NDRange local) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(offset), 
-      global_(global),
-      local_(local)
-    {
-        events_.push_back(e);
-    }
-
-    EnqueueArgs(const VECTOR_CLASS<Event> &events, NDRange global) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(NullRange), 
-      global_(global),
-      local_(NullRange),
-      events_(events)
-    {
-
-    }
-
-    EnqueueArgs(const VECTOR_CLASS<Event> &events, NDRange global, NDRange local) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(NullRange), 
-      global_(global),
-      local_(local),
-      events_(events)
-    {
-
-    }
-
-    EnqueueArgs(const VECTOR_CLASS<Event> &events, NDRange offset, NDRange global, NDRange local) : 
-      queue_(CommandQueue::getDefault()),
-      offset_(offset), 
-      global_(global),
-      local_(local),
-      events_(events)
-    {
-
-    }
-
-    EnqueueArgs(CommandQueue &queue, NDRange global) : 
-      queue_(queue),
-      offset_(NullRange), 
-      global_(global),
-      local_(NullRange)
-    {
-
-    }
-
-    EnqueueArgs(CommandQueue &queue, NDRange global, NDRange local) : 
-      queue_(queue),
-      offset_(NullRange), 
-      global_(global),
-      local_(local)
-    {
-
-    }
-
-    EnqueueArgs(CommandQueue &queue, NDRange offset, NDRange global, NDRange local) : 
-      queue_(queue),
-      offset_(offset), 
-      global_(global),
-      local_(local)
-    {
-
-    }
-
-    EnqueueArgs(CommandQueue &queue, Event e, NDRange global) : 
-      queue_(queue),
-      offset_(NullRange), 
-      global_(global),
-      local_(NullRange)
-    {
-        events_.push_back(e);
-    }
-
-    EnqueueArgs(CommandQueue &queue, Event e, NDRange global, NDRange local) : 
-      queue_(queue),
-      offset_(NullRange), 
-      global_(global),
-      local_(local)
-    {
-        events_.push_back(e);
-    }
-
-    EnqueueArgs(CommandQueue &queue, Event e, NDRange offset, NDRange global, NDRange local) : 
-      queue_(queue),
-      offset_(offset), 
-      global_(global),
-      local_(local)
-    {
-        events_.push_back(e);
-    }
-
-    EnqueueArgs(CommandQueue &queue, const VECTOR_CLASS<Event> &events, NDRange global) : 
-      queue_(queue),
-      offset_(NullRange), 
-      global_(global),
-      local_(NullRange),
-      events_(events)
-    {
-
-    }
-
-    EnqueueArgs(CommandQueue &queue, const VECTOR_CLASS<Event> &events, NDRange global, NDRange local) : 
-      queue_(queue),
-      offset_(NullRange), 
-      global_(global),
-      local_(local),
-      events_(events)
-    {
-
-    }
-
-    EnqueueArgs(CommandQueue &queue, const VECTOR_CLASS<Event> &events, NDRange offset, NDRange global, NDRange local) : 
-      queue_(queue),
-      offset_(offset), 
-      global_(global),
-      local_(local),
-      events_(events)
-    {
-
-    }
-};
-
-namespace detail {
-
-class NullType {};
-
-template<int index, typename T0>
-struct SetArg
-{
-    static void set (Kernel kernel, T0 arg)
-    {
-        kernel.setArg(index, arg);
-    }
-};  
-
-template<int index>
-struct SetArg<index, NullType>
-{
-    static void set (Kernel, NullType)
-    { 
-    }
-};
-
-template <
-   typename T0,   typename T1,   typename T2,   typename T3,
-   typename T4,   typename T5,   typename T6,   typename T7,
-   typename T8,   typename T9,   typename T10,   typename T11,
-   typename T12,   typename T13,   typename T14,   typename T15,
-   typename T16,   typename T17,   typename T18,   typename T19,
-   typename T20,   typename T21,   typename T22,   typename T23,
-   typename T24,   typename T25,   typename T26,   typename T27,
-   typename T28,   typename T29,   typename T30,   typename T31
->
-class KernelFunctorGlobal
-{
-private:
-    Kernel kernel_;
-
-public:
-   KernelFunctorGlobal(
-        Kernel kernel) :
-            kernel_(kernel)
-    {}
-
-   KernelFunctorGlobal(
-        const Program& program,
-        const STRING_CLASS name,
-        cl_int * err = NULL) :
-            kernel_(program, name.c_str(), err)
-    {}
-
-    Event operator() (
-        const EnqueueArgs& args,
-        T0 t0,
-        T1 t1 = NullType(),
-        T2 t2 = NullType(),
-        T3 t3 = NullType(),
-        T4 t4 = NullType(),
-        T5 t5 = NullType(),
-        T6 t6 = NullType(),
-        T7 t7 = NullType(),
-        T8 t8 = NullType(),
-        T9 t9 = NullType(),
-        T10 t10 = NullType(),
-        T11 t11 = NullType(),
-        T12 t12 = NullType(),
-        T13 t13 = NullType(),
-        T14 t14 = NullType(),
-        T15 t15 = NullType(),
-        T16 t16 = NullType(),
-        T17 t17 = NullType(),
-        T18 t18 = NullType(),
-        T19 t19 = NullType(),
-        T20 t20 = NullType(),
-        T21 t21 = NullType(),
-        T22 t22 = NullType(),
-        T23 t23 = NullType(),
-        T24 t24 = NullType(),
-        T25 t25 = NullType(),
-        T26 t26 = NullType(),
-        T27 t27 = NullType(),
-        T28 t28 = NullType(),
-        T29 t29 = NullType(),
-        T30 t30 = NullType(),
-        T31 t31 = NullType()
-        )
-    {
-        Event event;
-        SetArg<0, T0>::set(kernel_, t0);
-        SetArg<1, T1>::set(kernel_, t1);
-        SetArg<2, T2>::set(kernel_, t2);
-        SetArg<3, T3>::set(kernel_, t3);
-        SetArg<4, T4>::set(kernel_, t4);
-        SetArg<5, T5>::set(kernel_, t5);
-        SetArg<6, T6>::set(kernel_, t6);
-        SetArg<7, T7>::set(kernel_, t7);
-        SetArg<8, T8>::set(kernel_, t8);
-        SetArg<9, T9>::set(kernel_, t9);
-        SetArg<10, T10>::set(kernel_, t10);
-        SetArg<11, T11>::set(kernel_, t11);
-        SetArg<12, T12>::set(kernel_, t12);
-        SetArg<13, T13>::set(kernel_, t13);
-        SetArg<14, T14>::set(kernel_, t14);
-        SetArg<15, T15>::set(kernel_, t15);
-        SetArg<16, T16>::set(kernel_, t16);
-        SetArg<17, T17>::set(kernel_, t17);
-        SetArg<18, T18>::set(kernel_, t18);
-        SetArg<19, T19>::set(kernel_, t19);
-        SetArg<20, T20>::set(kernel_, t20);
-        SetArg<21, T21>::set(kernel_, t21);
-        SetArg<22, T22>::set(kernel_, t22);
-        SetArg<23, T23>::set(kernel_, t23);
-        SetArg<24, T24>::set(kernel_, t24);
-        SetArg<25, T25>::set(kernel_, t25);
-        SetArg<26, T26>::set(kernel_, t26);
-        SetArg<27, T27>::set(kernel_, t27);
-        SetArg<28, T28>::set(kernel_, t28);
-        SetArg<29, T29>::set(kernel_, t29);
-        SetArg<30, T30>::set(kernel_, t30);
-        SetArg<31, T31>::set(kernel_, t31);
-        
-        args.queue_.enqueueNDRangeKernel(
-            kernel_,
-            args.offset_,
-            args.global_,
-            args.local_,
-            &args.events_,
-            &event);
-        
-        return event;
-    }
-
-};
-
-//------------------------------------------------------------------------------------------------------
-
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25,
-    typename T26,
-    typename T27,
-    typename T28,
-    typename T29,
-    typename T30,
-    typename T31>
-struct functionImplementation_
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        T29,
-        T30,
-        T31> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 32))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        T29,
-        T30,
-        T31);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25,
-        T26 arg26,
-        T27 arg27,
-        T28 arg28,
-        T29 arg29,
-        T30 arg30,
-        T31 arg31)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25,
-            arg26,
-            arg27,
-            arg28,
-            arg29,
-            arg30,
-            arg31);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25,
-    typename T26,
-    typename T27,
-    typename T28,
-    typename T29,
-    typename T30>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    T25,
-    T26,
-    T27,
-    T28,
-    T29,
-    T30,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        T29,
-        T30,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 31))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        T29,
-        T30);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25,
-        T26 arg26,
-        T27 arg27,
-        T28 arg28,
-        T29 arg29,
-        T30 arg30)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25,
-            arg26,
-            arg27,
-            arg28,
-            arg29,
-            arg30);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25,
-    typename T26,
-    typename T27,
-    typename T28,
-    typename T29>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    T25,
-    T26,
-    T27,
-    T28,
-    T29,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        T29,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 30))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        T29);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25,
-        T26 arg26,
-        T27 arg27,
-        T28 arg28,
-        T29 arg29)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25,
-            arg26,
-            arg27,
-            arg28,
-            arg29);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25,
-    typename T26,
-    typename T27,
-    typename T28>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    T25,
-    T26,
-    T27,
-    T28,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 29))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        T28);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25,
-        T26 arg26,
-        T27 arg27,
-        T28 arg28)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25,
-            arg26,
-            arg27,
-            arg28);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25,
-    typename T26,
-    typename T27>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    T25,
-    T26,
-    T27,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 28))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        T27);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25,
-        T26 arg26,
-        T27 arg27)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25,
-            arg26,
-            arg27);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25,
-    typename T26>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    T25,
-    T26,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 27))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        T26);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25,
-        T26 arg26)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25,
-            arg26);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24,
-    typename T25>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    T25,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 26))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        T25);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24,
-        T25 arg25)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24,
-            arg25);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23,
-    typename T24>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    T24,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 25))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        T24);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23,
-        T24 arg24)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23,
-            arg24);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22,
-    typename T23>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    T23,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 24))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        T23);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22,
-        T23 arg23)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22,
-            arg23);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21,
-    typename T22>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    T22,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 23))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        T22);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21,
-        T22 arg22)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21,
-            arg22);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20,
-    typename T21>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    T21,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 22))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        T21);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20,
-        T21 arg21)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20,
-            arg21);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19,
-    typename T20>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    T20,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 21))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        T20);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19,
-        T20 arg20)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19,
-            arg20);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18,
-    typename T19>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    T19,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 20))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        T19);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18,
-        T19 arg19)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18,
-            arg19);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17,
-    typename T18>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    T18,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 19))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        T18);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17,
-        T18 arg18)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17,
-            arg18);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16,
-    typename T17>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    T17,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 18))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        T17);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16,
-        T17 arg17)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16,
-            arg17);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15,
-    typename T16>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    T16,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 17))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        T16);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15,
-        T16 arg16)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15,
-            arg16);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14,
-    typename T15>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    T15,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 16))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        T15);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14,
-        T15 arg15)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14,
-            arg15);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13,
-    typename T14>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    T14,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 15))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        T14);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13,
-        T14 arg14)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13,
-            arg14);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12,
-    typename T13>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    T13,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 14))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        T13);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12,
-        T13 arg13)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12,
-            arg13);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11,
-    typename T12>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    T12,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 13))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        T12);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11,
-        T12 arg12)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11,
-            arg12);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10,
-    typename T11>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    T11,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 12))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        T11);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10,
-        T11 arg11)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10,
-            arg11);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9,
-    typename T10>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    T10,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 11))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        T10);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9,
-        T10 arg10)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9,
-            arg10);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8,
-    typename T9>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    T9,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 10))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        T9);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8,
-        T9 arg9)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8,
-            arg9);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7,
-    typename T8>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    T8,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 9))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        T8);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7,
-        T8 arg8)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7,
-            arg8);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6,
-    typename T7>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    T7,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 8))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        T7);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6,
-        T7 arg7)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6,
-            arg7);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5,
-    typename T6>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    T6,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 7))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        T6);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5,
-        T6 arg6)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5,
-            arg6);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4,
-    typename T5>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    T5,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 6))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        T5);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4,
-        T5 arg5)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4,
-            arg5);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3,
-    typename T4>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    T4,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        T4,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 5))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3,
-        T4);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3,
-        T4 arg4)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3,
-            arg4);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2,
-    typename T3>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    T3,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        T3,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 4))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2,
-        T3);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2,
-        T3 arg3)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2,
-            arg3);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1,
-    typename T2>
-struct functionImplementation_
-<    T0,
-    T1,
-    T2,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        T2,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 3))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1,
-        T2);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1,
-        T2 arg2)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1,
-            arg2);
-    }
-
-
-};
-
-template<
-    typename T0,
-    typename T1>
-struct functionImplementation_
-<    T0,
-    T1,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        T1,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 2))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0,
-        T1);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0,
-        T1 arg1)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0,
-            arg1);
-    }
-
-
-};
-
-template<
-    typename T0>
-struct functionImplementation_
-<    T0,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType,
-    NullType>
-{
-    typedef detail::KernelFunctorGlobal<
-        T0,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType,
-        NullType> FunctorType;
-
-    FunctorType functor_;
-
-    functionImplementation_(const FunctorType &functor) :
-        functor_(functor)
-    {
-    
-        #if (defined(_WIN32) && defined(_VARIADIC_MAX) && (_VARIADIC_MAX < 1))
-        // Fail variadic expansion for dev11
-        static_assert(0, "Visual Studio has a hard limit of argument count for a std::function expansion. Please define _VARIADIC_MAX to be 10. If you need more arguments than that VC12 and below cannot support it.");
-        #endif
-            
-    }
-
-    //! \brief Return type of the functor
-    typedef Event result_type;
-
-    //! \brief Function signature of kernel functor with no event dependency.
-    typedef Event type_(
-        const EnqueueArgs&,
-        T0);
-
-    Event operator()(
-        const EnqueueArgs& enqueueArgs,
-        T0 arg0)
-    {
-        return functor_(
-            enqueueArgs,
-            arg0);
-    }
-
-
-};
-
-
-
-
-
-} // namespace detail
-
-//----------------------------------------------------------------------------------------------
-
-template <
-   typename T0,   typename T1 = detail::NullType,   typename T2 = detail::NullType,
-   typename T3 = detail::NullType,   typename T4 = detail::NullType,
-   typename T5 = detail::NullType,   typename T6 = detail::NullType,
-   typename T7 = detail::NullType,   typename T8 = detail::NullType,
-   typename T9 = detail::NullType,   typename T10 = detail::NullType,
-   typename T11 = detail::NullType,   typename T12 = detail::NullType,
-   typename T13 = detail::NullType,   typename T14 = detail::NullType,
-   typename T15 = detail::NullType,   typename T16 = detail::NullType,
-   typename T17 = detail::NullType,   typename T18 = detail::NullType,
-   typename T19 = detail::NullType,   typename T20 = detail::NullType,
-   typename T21 = detail::NullType,   typename T22 = detail::NullType,
-   typename T23 = detail::NullType,   typename T24 = detail::NullType,
-   typename T25 = detail::NullType,   typename T26 = detail::NullType,
-   typename T27 = detail::NullType,   typename T28 = detail::NullType,
-   typename T29 = detail::NullType,   typename T30 = detail::NullType,
-   typename T31 = detail::NullType
->
-struct make_kernel :
-    public detail::functionImplementation_<
-               T0,   T1,   T2,   T3,
-               T4,   T5,   T6,   T7,
-               T8,   T9,   T10,   T11,
-               T12,   T13,   T14,   T15,
-               T16,   T17,   T18,   T19,
-               T20,   T21,   T22,   T23,
-               T24,   T25,   T26,   T27,
-               T28,   T29,   T30,   T31
-    >
-{
-public:
-    typedef detail::KernelFunctorGlobal<             
-               T0,   T1,   T2,   T3,
-               T4,   T5,   T6,   T7,
-               T8,   T9,   T10,   T11,
-               T12,   T13,   T14,   T15,
-               T16,   T17,   T18,   T19,
-               T20,   T21,   T22,   T23,
-               T24,   T25,   T26,   T27,
-               T28,   T29,   T30,   T31
-    > FunctorType;
-
-    make_kernel(
-        const Program& program,
-        const STRING_CLASS name,
-        cl_int * err = NULL) :
-           detail::functionImplementation_<
-                    T0,   T1,   T2,   T3,
-                       T4,   T5,   T6,   T7,
-                       T8,   T9,   T10,   T11,
-                       T12,   T13,   T14,   T15,
-                       T16,   T17,   T18,   T19,
-                       T20,   T21,   T22,   T23,
-                       T24,   T25,   T26,   T27,
-                       T28,   T29,   T30,   T31
-           >(
-            FunctorType(program, name, err)) 
-    {}
-
-    make_kernel(
-        const Kernel kernel) :
-           detail::functionImplementation_<
-                    T0,   T1,   T2,   T3,
-                       T4,   T5,   T6,   T7,
-                       T8,   T9,   T10,   T11,
-                       T12,   T13,   T14,   T15,
-                       T16,   T17,   T18,   T19,
-                       T20,   T21,   T22,   T23,
-                       T24,   T25,   T26,   T27,
-                       T28,   T29,   T30,   T31
-           >(
-            FunctorType(kernel)) 
-    {}    
-};
-
-
-//----------------------------------------------------------------------------------------------------------------------
-
-#undef __ERR_STR
-#if !defined(__CL_USER_OVERRIDE_ERROR_STRINGS)
-#undef __GET_DEVICE_INFO_ERR
-#undef __GET_PLATFORM_INFO_ERR
-#undef __GET_DEVICE_IDS_ERR
-#undef __GET_CONTEXT_INFO_ERR
-#undef __GET_EVENT_INFO_ERR
-#undef __GET_EVENT_PROFILE_INFO_ERR
-#undef __GET_MEM_OBJECT_INFO_ERR
-#undef __GET_IMAGE_INFO_ERR
-#undef __GET_SAMPLER_INFO_ERR
-#undef __GET_KERNEL_INFO_ERR
-#undef __GET_KERNEL_ARG_INFO_ERR
-#undef __GET_KERNEL_WORK_GROUP_INFO_ERR
-#undef __GET_PROGRAM_INFO_ERR
-#undef __GET_PROGRAM_BUILD_INFO_ERR
-#undef __GET_COMMAND_QUEUE_INFO_ERR
-
-#undef __CREATE_CONTEXT_ERR
-#undef __CREATE_CONTEXT_FROM_TYPE_ERR
-#undef __GET_SUPPORTED_IMAGE_FORMATS_ERR
-
-#undef __CREATE_BUFFER_ERR
-#undef __CREATE_SUBBUFFER_ERR
-#undef __CREATE_IMAGE2D_ERR
-#undef __CREATE_IMAGE3D_ERR
-#undef __CREATE_SAMPLER_ERR
-#undef __SET_MEM_OBJECT_DESTRUCTOR_CALLBACK_ERR
-
-#undef __CREATE_USER_EVENT_ERR
-#undef __SET_USER_EVENT_STATUS_ERR
-#undef __SET_EVENT_CALLBACK_ERR
-#undef __SET_PRINTF_CALLBACK_ERR
-
-#undef __WAIT_FOR_EVENTS_ERR
-
-#undef __CREATE_KERNEL_ERR
-#undef __SET_KERNEL_ARGS_ERR
-#undef __CREATE_PROGRAM_WITH_SOURCE_ERR
-#undef __CREATE_PROGRAM_WITH_BINARY_ERR
-#undef __CREATE_PROGRAM_WITH_BUILT_IN_KERNELS_ERR
-#undef __BUILD_PROGRAM_ERR
-#undef __CREATE_KERNELS_IN_PROGRAM_ERR
-
-#undef __CREATE_COMMAND_QUEUE_ERR
-#undef __SET_COMMAND_QUEUE_PROPERTY_ERR
-#undef __ENQUEUE_READ_BUFFER_ERR
-#undef __ENQUEUE_WRITE_BUFFER_ERR
-#undef __ENQUEUE_READ_BUFFER_RECT_ERR
-#undef __ENQUEUE_WRITE_BUFFER_RECT_ERR
-#undef __ENQEUE_COPY_BUFFER_ERR
-#undef __ENQEUE_COPY_BUFFER_RECT_ERR
-#undef __ENQUEUE_READ_IMAGE_ERR
-#undef __ENQUEUE_WRITE_IMAGE_ERR
-#undef __ENQUEUE_COPY_IMAGE_ERR
-#undef __ENQUEUE_COPY_IMAGE_TO_BUFFER_ERR
-#undef __ENQUEUE_COPY_BUFFER_TO_IMAGE_ERR
-#undef __ENQUEUE_MAP_BUFFER_ERR
-#undef __ENQUEUE_MAP_IMAGE_ERR
-#undef __ENQUEUE_UNMAP_MEM_OBJECT_ERR
-#undef __ENQUEUE_NDRANGE_KERNEL_ERR
-#undef __ENQUEUE_TASK_ERR
-#undef __ENQUEUE_NATIVE_KERNEL
-
-#undef __CL_EXPLICIT_CONSTRUCTORS
-
-#undef __UNLOAD_COMPILER_ERR
-#endif //__CL_USER_OVERRIDE_ERROR_STRINGS
-
-#undef __CL_FUNCTION_TYPE
-
-// Extensions
-/**
- * Deprecated APIs for 1.2
- */
-#if defined(CL_VERSION_1_1)
-#undef __INIT_CL_EXT_FCN_PTR
-#endif // #if defined(CL_VERSION_1_1)
-#undef __CREATE_SUB_DEVICES
-
-#if defined(USE_CL_DEVICE_FISSION)
-#undef __PARAM_NAME_DEVICE_FISSION
-#endif // USE_CL_DEVICE_FISSION
-
-#undef __DEFAULT_NOT_INITIALIZED 
-#undef __DEFAULT_BEING_INITIALIZED 
-#undef __DEFAULT_INITIALIZED
-
-} // namespace cl
-
-#ifdef _WIN32
-#pragma pop_macro("max")
-#endif // _WIN32
-
-#endif // CL_HPP_
diff --git a/src/backend/opencl/cl2hpp.hpp b/src/backend/opencl/cl2hpp.hpp
new file mode 100644
index 0000000000..729710d420
--- /dev/null
+++ b/src/backend/opencl/cl2hpp.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/deprecated.hpp>
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+#pragma GCC diagnostic ignored "-Wunused-parameter"
+#pragma GCC diagnostic ignored "-Wignored-qualifiers"
+AF_DEPRECATED_WARNINGS_OFF
+#if __GNUC__ >= 8
+#pragma GCC diagnostic ignored "-Wcatch-value="
+#endif
+#ifdef __has_include
+#if __has_include(<CL/opencl.hpp>)
+#include <CL/opencl.hpp>
+#else
+#include <CL/cl2.hpp>
+#endif
+#else
+#include <CL/cl2.hpp>
+#endif
+AF_DEPRECATED_WARNINGS_ON
+#pragma GCC diagnostic pop
diff --git a/src/backend/opencl/clfft.cpp b/src/backend/opencl/clfft.cpp
new file mode 100644
index 0000000000..68a17cbd50
--- /dev/null
+++ b/src/backend/opencl/clfft.cpp
@@ -0,0 +1,182 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <clfft.hpp>
+#include <common/err_common.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+
+#include <memory>
+#include <string>
+
+using std::make_unique;
+using std::string;
+
+namespace arrayfire {
+namespace opencl {
+const char *_clfftGetResultString(clfftStatus st) {
+    switch (st) {
+        case CLFFT_SUCCESS: return "Success";
+        case CLFFT_DEVICE_NOT_FOUND: return "Device Not Found";
+        case CLFFT_DEVICE_NOT_AVAILABLE: return "Device Not Available";
+        case CLFFT_COMPILER_NOT_AVAILABLE: return "Compiler Not Available";
+        case CLFFT_MEM_OBJECT_ALLOCATION_FAILURE:
+            return "Memory Object Allocation Failure";
+        case CLFFT_OUT_OF_RESOURCES: return "Out of Resources";
+        case CLFFT_OUT_OF_HOST_MEMORY: return "Out of Host Memory";
+        case CLFFT_PROFILING_INFO_NOT_AVAILABLE:
+            return "Profiling Information Not Available";
+        case CLFFT_MEM_COPY_OVERLAP: return "Memory Copy Overlap";
+        case CLFFT_IMAGE_FORMAT_MISMATCH: return "Image Format Mismatch";
+        case CLFFT_IMAGE_FORMAT_NOT_SUPPORTED:
+            return "Image Format Not Supported";
+        case CLFFT_BUILD_PROGRAM_FAILURE: return "Build Program Failure";
+        case CLFFT_MAP_FAILURE: return "Map Failure";
+        case CLFFT_INVALID_VALUE: return "Invalid Value";
+        case CLFFT_INVALID_DEVICE_TYPE: return "Invalid Device Type";
+        case CLFFT_INVALID_PLATFORM: return "Invalid Platform";
+        case CLFFT_INVALID_DEVICE: return "Invalid Device";
+        case CLFFT_INVALID_CONTEXT: return "Invalid Context";
+        case CLFFT_INVALID_QUEUE_PROPERTIES: return "Invalid Queue Properties";
+        case CLFFT_INVALID_COMMAND_QUEUE: return "Invalid Command Queue";
+        case CLFFT_INVALID_HOST_PTR: return "Invalid Host Pointer";
+        case CLFFT_INVALID_MEM_OBJECT: return "Invalid Memory Object";
+        case CLFFT_INVALID_IMAGE_FORMAT_DESCRIPTOR:
+            return "Invalid Image Format Descriptor";
+        case CLFFT_INVALID_IMAGE_SIZE: return "Invalid Image Size";
+        case CLFFT_INVALID_SAMPLER: return "Invalid Sampler";
+        case CLFFT_INVALID_BINARY: return "Invalid Binary";
+        case CLFFT_INVALID_BUILD_OPTIONS: return "Invalid Build Options";
+        case CLFFT_INVALID_PROGRAM: return "Invalid Program";
+        case CLFFT_INVALID_PROGRAM_EXECUTABLE:
+            return "Invalid Program Executable";
+        case CLFFT_INVALID_KERNEL_NAME: return "Invalid Kernel Name";
+        case CLFFT_INVALID_KERNEL_DEFINITION:
+            return "Invalid Kernel Definition";
+        case CLFFT_INVALID_KERNEL: return "Invalid Kernel";
+        case CLFFT_INVALID_ARG_INDEX: return "Invalid Argument Index";
+        case CLFFT_INVALID_ARG_VALUE: return "Invalid Argument Value";
+        case CLFFT_INVALID_ARG_SIZE: return "Invalid Argument Size";
+        case CLFFT_INVALID_KERNEL_ARGS: return "Invalid Kernel Arguments";
+        case CLFFT_INVALID_WORK_DIMENSION: return "Invalid Work Dimension";
+        case CLFFT_INVALID_WORK_GROUP_SIZE: return "Invalid Work Group Size";
+        case CLFFT_INVALID_WORK_ITEM_SIZE: return "Invalid Work Item Size";
+        case CLFFT_INVALID_GLOBAL_OFFSET: return "Invalid Global Offset";
+        case CLFFT_INVALID_EVENT_WAIT_LIST: return "Invalid Event Wait List";
+        case CLFFT_INVALID_EVENT: return "Invalid Event";
+        case CLFFT_INVALID_OPERATION: return "Invalid Operation";
+        case CLFFT_INVALID_GL_OBJECT: return "Invalid GL Object";
+        case CLFFT_INVALID_BUFFER_SIZE: return "Invalid Buffer Size";
+        case CLFFT_INVALID_MIP_LEVEL: return "Invalid MIP Level";
+        case CLFFT_INVALID_GLOBAL_WORK_SIZE: return "Invalid Global Work Size";
+        case CLFFT_BUGCHECK: return "Bugcheck";
+        case CLFFT_NOTIMPLEMENTED: return "Not implemented";
+        case CLFFT_TRANSPOSED_NOTIMPLEMENTED:
+            return "Transpose not implemented for this transformation";
+        case CLFFT_FILE_NOT_FOUND: return "File not found";
+        case CLFFT_FILE_CREATE_FAILURE: return "File creation failed";
+        case CLFFT_VERSION_MISMATCH: return "Version mismatch";
+        case CLFFT_INVALID_PLAN: return "Invalid plan";
+        case CLFFT_DEVICE_NO_DOUBLE:
+            return "Device does not support double precision";
+        case CLFFT_DEVICE_MISMATCH: return "Plan device mismatch";
+        case CLFFT_ENDSTATUS: return "End status";
+    }
+
+    return "Unknown error";
+}
+
+SharedPlan findPlan(clfftLayout iLayout, clfftLayout oLayout, clfftDim rank,
+                    size_t *clLengths, size_t *istrides, size_t idist,
+                    size_t *ostrides, size_t odist, clfftPrecision precision,
+                    size_t batch) {
+    // create the key string
+    char key_str_temp[64];
+    sprintf(key_str_temp, "%d:%d:%d:", iLayout, oLayout, rank);
+
+    string key_string(key_str_temp);
+
+    /* WARNING: DO NOT CHANGE sprintf format specifier */
+    for (int r = 0; r < rank; ++r) {
+        sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", clLengths[r]);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    if (istrides != NULL) {
+        for (int r = 0; r < rank; ++r) {
+            sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", istrides[r]);
+            key_string.append(std::string(key_str_temp));
+        }
+        sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", idist);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    if (ostrides != NULL) {
+        for (int r = 0; r < rank; ++r) {
+            sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", ostrides[r]);
+            key_string.append(std::string(key_str_temp));
+        }
+        sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", odist);
+        key_string.append(std::string(key_str_temp));
+    }
+
+    sprintf(key_str_temp, "%d:" SIZE_T_FRMT_SPECIFIER,
+            static_cast<int>(precision), batch);
+    key_string.append(std::string(key_str_temp));
+
+    PlanCache &planner = opencl::fftManager();
+    SharedPlan retVal  = planner.find(key_string);
+
+    if (retVal) { return retVal; }
+
+    auto temp = make_unique<PlanType>();
+
+    // getContext() returns object of type Context
+    // Context() returns the actual cl_context handle
+    CLFFT_CHECK(clfftCreateDefaultPlan(temp.get(), opencl::getContext()(), rank,
+                                       clLengths));
+
+    // complex to complex
+    if (iLayout == oLayout) {
+        CLFFT_CHECK(clfftSetResultLocation(*temp, CLFFT_INPLACE));
+    } else {
+        CLFFT_CHECK(clfftSetResultLocation(*temp, CLFFT_OUTOFPLACE));
+    }
+
+    CLFFT_CHECK(clfftSetLayout(*temp, iLayout, oLayout));
+    CLFFT_CHECK(clfftSetPlanBatchSize(*temp, batch));
+    CLFFT_CHECK(clfftSetPlanDistance(*temp, idist, odist));
+    CLFFT_CHECK(clfftSetPlanInStride(*temp, rank, istrides));
+    CLFFT_CHECK(clfftSetPlanOutStride(*temp, rank, ostrides));
+    CLFFT_CHECK(clfftSetPlanPrecision(*temp, precision));
+    CLFFT_CHECK(clfftSetPlanScale(*temp, CLFFT_BACKWARD, 1.0));
+
+    // getQueue() returns object of type CommandQueue
+    // CommandQueue() returns the actual cl_command_queue handle
+    CLFFT_CHECK(clfftBakePlan(*temp, 1, &(opencl::getQueue()()), NULL, NULL));
+
+    retVal.reset(temp.release(), [](PlanType *p) {
+#ifndef OS_WIN
+        // On Windows the resources that are released after the main function
+        // have exited cause "Pure Virtual Function Called" errors. It seems
+        // that Windows releases all resources when exiting main without calling
+        // their destructors. When the destructors are called this error is
+        // thrown. This is related to
+        // https://github.com/arrayfire/arrayfire/pull/1899
+        CLFFT_CHECK(clfftDestroyPlan(p));
+        delete p;
+#endif
+    });
+    // push the plan into plan cache
+    planner.push(key_string, retVal);
+
+    return retVal;
+}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/clfft.hpp b/src/backend/opencl/clfft.hpp
new file mode 100644
index 0000000000..c7b9d9949f
--- /dev/null
+++ b/src/backend/opencl/clfft.hpp
@@ -0,0 +1,55 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <clFFT.h>
+#include <common/FFTPlanCache.hpp>
+#include <memory.hpp>
+
+#include <cstdio>
+
+namespace arrayfire {
+namespace opencl {
+typedef clfftPlanHandle PlanType;
+typedef std::shared_ptr<PlanType> SharedPlan;
+
+const char *_clfftGetResultString(clfftStatus st);
+
+SharedPlan findPlan(clfftLayout iLayout, clfftLayout oLayout, clfftDim rank,
+                    size_t *clLengths, size_t *istrides, size_t idist,
+                    size_t *ostrides, size_t odist, clfftPrecision precision,
+                    size_t batch);
+
+class PlanCache : public common::FFTPlanCache<PlanCache, PlanType> {
+    friend SharedPlan findPlan(clfftLayout iLayout, clfftLayout oLayout,
+                               clfftDim rank, size_t *clLengths,
+                               size_t *istrides, size_t idist, size_t *ostrides,
+                               size_t odist, clfftPrecision precision,
+                               size_t batch);
+};
+}  // namespace opencl
+}  // namespace arrayfire
+
+#define CLFFT_CHECK(fn)                                          \
+    do {                                                         \
+        clfftStatus _clfft_st = fn;                              \
+        if (_clfft_st != CLFFT_SUCCESS) {                        \
+            opencl::signalMemoryCleanup();                       \
+            _clfft_st = (fn);                                    \
+        }                                                        \
+        if (_clfft_st != CLFFT_SUCCESS) {                        \
+            char clfft_st_msg[1024];                             \
+            snprintf(clfft_st_msg, sizeof(clfft_st_msg),         \
+                     "clFFT Error (%d): %s\n", (int)(_clfft_st), \
+                     opencl::_clfftGetResultString(_clfft_st));  \
+                                                                 \
+            AF_ERROR(clfft_st_msg, AF_ERR_INTERNAL);             \
+        }                                                        \
+    } while (0)
diff --git a/src/backend/opencl/compile_module.cpp b/src/backend/opencl/compile_module.cpp
new file mode 100644
index 0000000000..f0244b3b0d
--- /dev/null
+++ b/src/backend/opencl/compile_module.cpp
@@ -0,0 +1,290 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/compile_module.hpp>  //compileModule & loadModuleFromDisk
+#include <common/kernel_cache.hpp>    //getKernel(Module&, ...)
+
+#include <cl2hpp.hpp>
+#include <common/Logger.hpp>
+#include <common/defines.hpp>
+#include <common/deterministicHash.hpp>
+#include <common/util.hpp>
+#include <debug_opencl.hpp>
+#include <err_opencl.hpp>
+#include <kernel_headers/KParam.hpp>
+#include <nonstd/span.hpp>
+#include <platform.hpp>
+#include <traits.hpp>
+
+#include <algorithm>
+#include <cctype>
+#include <cstdio>
+#include <fstream>
+#include <sstream>
+#include <string>
+#include <vector>
+
+using arrayfire::common::getEnvVar;
+using arrayfire::common::loggerFactory;
+using arrayfire::opencl::getActiveDeviceId;
+using arrayfire::opencl::getDevice;
+using arrayfire::opencl::Kernel;
+using arrayfire::opencl::Module;
+using cl::Error;
+using cl::Program;
+using fmt::format;
+using nonstd::span;
+using spdlog::logger;
+
+using std::begin;
+using std::end;
+using std::ofstream;
+using std::ostringstream;
+using std::shared_ptr;
+using std::string;
+using std::to_string;
+using std::transform;
+using std::vector;
+using std::chrono::duration_cast;
+using std::chrono::high_resolution_clock;
+using std::chrono::milliseconds;
+
+logger *getLogger() {
+    static shared_ptr<logger> logger(loggerFactory("jit"));
+    return logger.get();
+}
+
+#define THROW_BUILD_LOG_EXCEPTION(PROG)                              \
+    do {                                                             \
+        string build_error = getProgramBuildLog(PROG);               \
+        string info        = getEnvVar("AF_OPENCL_SHOW_BUILD_INFO"); \
+        if (!info.empty() && info != "0") puts(build_error.c_str()); \
+        AF_ERROR(build_error, AF_ERR_INTERNAL);                      \
+    } while (0)
+
+namespace arrayfire {
+namespace opencl {
+
+const static string DEFAULT_MACROS_STR(
+    "\n\
+                                           #ifdef USE_DOUBLE\n\
+                                           #pragma OPENCL EXTENSION cl_khr_fp64 : enable\n\
+                                           #endif\n                     \
+                                           #ifdef USE_HALF\n\
+                                           #pragma OPENCL EXTENSION cl_khr_fp16 : enable\n\
+                                           #else\n                     \
+                                           #define half short\n          \
+                                           #endif\n                      \
+                                           #ifndef schar\n              \
+                                           #define schar char\n         \
+                                           #endif\n                     \
+                                           #ifndef M_PI\n               \
+                                           #define M_PI 3.1415926535897932384626433832795028841971693993751058209749445923078164\n \
+                                           #endif\n                     \
+                                           ");
+
+Program buildProgram(span<const string> kernelSources,
+                     span<const string> compileOpts) {
+    Program retVal;
+    try {
+        auto device = getDevice();
+        Program::Sources sources;
+        sources.emplace_back(DEFAULT_MACROS_STR);
+        sources.emplace_back(KParam_hpp, KParam_hpp_len);
+        sources.insert(end(sources), begin(kernelSources), end(kernelSources));
+
+        retVal = Program(getContext(), sources);
+
+        ostringstream options;
+        for (auto &opt : compileOpts) { options << opt; }
+        options << getActiveDeviceBaseBuildFlags();
+        retVal.build({device}, (options.str()).c_str());
+    } catch (Error &err) {
+        if (err.err() == CL_BUILD_PROGRAM_FAILURE) {
+            THROW_BUILD_LOG_EXCEPTION(retVal);
+        }
+        throw;
+    }
+    return retVal;
+}
+
+string getProgramBuildLog(const Program &prog) {
+    string build_error("");
+    try {
+        build_error.reserve(4096);
+        auto devices = prog.getInfo<CL_PROGRAM_DEVICES>();
+        for (auto &device : prog.getInfo<CL_PROGRAM_DEVICES>()) {
+            build_error +=
+                format("OpenCL Device: {}\n\tOptions: {}\n\tLog:\n{}\n",
+                       device.getInfo<CL_DEVICE_NAME>(),
+                       prog.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(device),
+                       prog.getBuildInfo<CL_PROGRAM_BUILD_LOG>(device));
+        }
+    } catch (const cl::Error &e) {
+        build_error = format("Failed to fetch build log: {}", e.what());
+    }
+    return build_error;
+}
+
+string getKernelCacheFilename(const int device, const string &key) {
+    auto &dev = arrayfire::opencl::getDevice(device);
+
+    unsigned vendorId = dev.getInfo<CL_DEVICE_VENDOR_ID>();
+    auto devName      = dev.getInfo<CL_DEVICE_NAME>();
+    string infix      = to_string(vendorId) + "_" + devName;
+
+    transform(infix.begin(), infix.end(), infix.begin(),
+              [](unsigned char c) { return std::toupper(c); });
+    std::replace(infix.begin(), infix.end(), ' ', '_');
+
+    return "KER" + key + "_CL_" + infix + "_AF_" +
+           to_string(AF_API_VERSION_CURRENT) + ".bin";
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
+
+namespace arrayfire {
+namespace common {
+
+Module compileModule(const string &moduleKey, span<const string> sources,
+                     span<const string> options, span<const string> kInstances,
+                     const bool isJIT) {
+    UNUSED(kInstances);
+    UNUSED(isJIT);
+
+    auto compileBegin = high_resolution_clock::now();
+    auto program      = arrayfire::opencl::buildProgram(sources, options);
+    auto compileEnd   = high_resolution_clock::now();
+
+#ifdef AF_CACHE_KERNELS_TO_DISK
+    const int device             = arrayfire::opencl::getActiveDeviceId();
+    const string &cacheDirectory = getCacheDirectory();
+    if (!cacheDirectory.empty()) {
+        const string cacheFile =
+            cacheDirectory + AF_PATH_SEPARATOR +
+            opencl::getKernelCacheFilename(device, moduleKey);
+        const string tempFile =
+            cacheDirectory + AF_PATH_SEPARATOR + makeTempFilename();
+        try {
+            auto binaries = program.getInfo<CL_PROGRAM_BINARIES>();
+
+            // TODO Handle cases where program objects are created from contexts
+            // having multiple devices
+            const size_t clbinSize = binaries[0].size();
+            const char *clbin =
+                reinterpret_cast<const char *>(binaries[0].data());
+            const size_t clbinHash = deterministicHash(clbin, clbinSize);
+
+            // write module hash and binary data to file
+            ofstream out(tempFile, std::ios::binary);
+
+            out.write(reinterpret_cast<const char *>(&clbinHash),
+                      sizeof(clbinHash));
+            out.write(reinterpret_cast<const char *>(&clbinSize),
+                      sizeof(clbinSize));
+            out.write(static_cast<const char *>(clbin), clbinSize);
+            out.close();
+
+            // try to rename temporary file into final cache file, if this fails
+            // this means another thread has finished compiling this kernel
+            // before the current thread.
+            if (!renameFile(tempFile, cacheFile)) { removeFile(tempFile); }
+        } catch (const cl::Error &e) {
+            AF_TRACE(
+                "{{{:<20} : Failed to fetch opencl binary for {}, {}}}",
+                moduleKey,
+                arrayfire::opencl::getDevice(device).getInfo<CL_DEVICE_NAME>(),
+                e.what());
+        } catch (const std::ios_base::failure &e) {
+            AF_TRACE(
+                "{{{:<20} : Failed writing binary to {} for {}, {}}}",
+                moduleKey, cacheFile,
+                arrayfire::opencl::getDevice(device).getInfo<CL_DEVICE_NAME>(),
+                e.what());
+        }
+    }
+#endif
+
+    AF_TRACE("{{ {:<20} : {{ compile:{:>5} ms, {{ {} }}, {} }} }}", moduleKey,
+             duration_cast<milliseconds>(compileEnd - compileBegin).count(),
+             fmt::join(options, " "),
+             getDevice(getActiveDeviceId()).getInfo<CL_DEVICE_NAME>());
+
+    return {program};
+}
+
+Module loadModuleFromDisk(const int device, const string &moduleKey,
+                          const bool isJIT) {
+    const string &cacheDirectory = getCacheDirectory();
+    if (cacheDirectory.empty()) return Module{};
+
+    auto &dev              = arrayfire::opencl::getDevice(device);
+    const string cacheFile = cacheDirectory + AF_PATH_SEPARATOR +
+                             opencl::getKernelCacheFilename(device, moduleKey);
+    Program program;
+    Module retVal{};
+    try {
+        std::ifstream in(cacheFile, std::ios::binary);
+        if (!in.is_open()) {
+            AF_TRACE("{{{:<20} : Unable to open {} for {}}}", moduleKey,
+                     cacheFile, dev.getInfo<CL_DEVICE_NAME>());
+            removeFile(cacheFile);
+            return retVal;
+        }
+        in.exceptions(std::ios::failbit | std::ios::badbit);
+
+        // TODO Handle cases where program objects are created from contexts
+        // having multiple devices
+        size_t clbinHash = 0;
+        in.read(reinterpret_cast<char *>(&clbinHash), sizeof(clbinHash));
+        size_t clbinSize = 0;
+        in.read(reinterpret_cast<char *>(&clbinSize), sizeof(clbinSize));
+        vector<unsigned char> clbin(clbinSize);
+        in.read(reinterpret_cast<char *>(clbin.data()), clbinSize);
+        in.close();
+
+        const size_t recomputedHash =
+            deterministicHash(clbin.data(), clbinSize);
+        if (recomputedHash != clbinHash) {
+            AF_TRACE(
+                "{{{:<20} : Corrupt binary({}) found on disk for {}, removed}}",
+                moduleKey, cacheFile, dev.getInfo<CL_DEVICE_NAME>());
+            removeFile(cacheFile);
+            return retVal;
+        }
+        program = Program(arrayfire::opencl::getContext(), {dev}, {clbin});
+        program.build();
+
+        AF_TRACE("{{{:<20} : loaded from {} for {} }}", moduleKey, cacheFile,
+                 dev.getInfo<CL_DEVICE_NAME>());
+        retVal.set(program);
+    } catch (const std::ios_base::failure &e) {
+        AF_TRACE("{{{:<20} : IO failure while loading {} for {}; {}}}",
+                 moduleKey, cacheFile, dev.getInfo<CL_DEVICE_NAME>(), e.what());
+        removeFile(cacheFile);
+    } catch (const cl::Error &e) {
+        AF_TRACE(
+            "{{{:<20} : Loading OpenCL binary({}) failed for {}; {}, Build "
+            "Log: {}}}",
+            moduleKey, cacheFile, dev.getInfo<CL_DEVICE_NAME>(), e.what(),
+            opencl::getProgramBuildLog(program));
+        removeFile(cacheFile);
+    }
+    return retVal;
+}
+
+Kernel getKernel(const Module &mod, const string &nameExpr,
+                 const bool sourceWasJIT) {
+    UNUSED(sourceWasJIT);
+    return {nameExpr, &mod.get(), cl::Kernel(mod.get(), nameExpr.c_str())};
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/opencl/complex.hpp b/src/backend/opencl/complex.hpp
index bdd42ba553..a4306c7be3 100644
--- a/src/backend/opencl/complex.hpp
+++ b/src/backend/opencl/complex.hpp
@@ -7,75 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <optypes.hpp>
 #include <binary.hpp>
-#include <JIT/UnaryNode.hpp>
+#include <common/jit/BinaryNode.hpp>
+#include <common/jit/UnaryNode.hpp>
+#include <optypes.hpp>
+#include <traits.hpp>
+#include <af/dim4.hpp>
 
-namespace opencl
-{
-    template<typename To, typename Ti>
-    Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<To, Ti, af_cplx2_t>(lhs, rhs, odims);
-    }
+namespace arrayfire {
+namespace opencl {
+template<typename To, typename Ti>
+Array<To> cplx(const Array<Ti> &lhs, const Array<Ti> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<To, Ti, af_cplx2_t>(lhs, rhs, odims);
+}
 
-    template<typename To, typename Ti>
-    Array<To> real(const Array<Ti> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<To>::getName(),
-                                                  shortname<To>(true),
-                                                  "__creal",
-                                                  in_node, af_real_t);
+template<typename To, typename Ti>
+Array<To> real(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              "__creal", in_node, af_real_t);
 
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
 
-    template<typename To, typename Ti>
-    Array<To> imag(const Array<Ti> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<To>::getName(),
-                                                  shortname<To>(true),
-                                                  "__cimag",
-                                                  in_node, af_imag_t);
+template<typename To, typename Ti>
+Array<To> imag(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              "__cimag", in_node, af_imag_t);
 
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
 
-    template<typename T> static const char *abs_name() { return "fabs"; }
-    template<> STATIC_ const char *abs_name<cfloat>() { return "__cabsf"; }
-    template<> STATIC_ const char *abs_name<cdouble>() { return "__cabs"; }
+template<typename T>
+static const char *abs_name() {
+    return "fabs";
+}
+template<>
+inline const char *abs_name<cfloat>() {
+    return "__cabsf";
+}
+template<>
+inline const char *abs_name<cdouble>() {
+    return "__cabs";
+}
 
-    template<typename To, typename Ti>
-    Array<To> abs(const Array<Ti> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<To>::getName(),
-                                                  shortname<To>(true),
-                                                  abs_name<Ti>(),
-                                                  in_node, af_abs_t);
+template<typename To, typename Ti>
+Array<To> abs(const Array<Ti> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<To>::af_type),
+                              abs_name<Ti>(), in_node, af_abs_t);
 
-        return createNodeArray<To>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<To>(in.dims(), common::Node_ptr(node));
+}
 
-    template<typename T> static const char *conj_name() { return "__noop"; }
-    template<> STATIC_ const char *conj_name<cfloat>() { return "__cconjf"; }
-    template<> STATIC_ const char *conj_name<cdouble>() { return "__cconj"; }
+template<typename T>
+static const char *conj_name() {
+    return "__noop";
+}
+template<>
+inline const char *conj_name<cfloat>() {
+    return "__cconjf";
+}
+template<>
+inline const char *conj_name<cdouble>() {
+    return "__cconj";
+}
 
-    template<typename T>
-    Array<T> conj(const Array<T> &in)
-    {
-        JIT::Node_ptr in_node = in.getNode();
-        JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<T>::getName(),
-                                                  shortname<T>(true),
-                                                  conj_name<T>(),
-                                                  in_node, af_conj_t);
+template<typename T>
+Array<T> conj(const Array<T> &in) {
+    common::Node_ptr in_node = in.getNode();
+    common::UnaryNode *node =
+        new common::UnaryNode(static_cast<af::dtype>(dtype_traits<T>::af_type),
+                              conj_name<T>(), in_node, af_conj_t);
 
-        return createNodeArray<T>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
-    }
+    return createNodeArray<T>(in.dims(), common::Node_ptr(node));
 }
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/convolve.cpp b/src/backend/opencl/convolve.cpp
index ab66773210..34aa93b642 100644
--- a/src/backend/opencl/convolve.cpp
+++ b/src/backend/opencl/convolve.cpp
@@ -7,75 +7,248 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <blas.hpp>
+#include <common/half.hpp>
+#include <common/indexing_helpers.hpp>
+#include <common/moddims.hpp>
 #include <convolve.hpp>
-#include <kernel/convolve.hpp>
 #include <err_opencl.hpp>
+#include <kernel/convolve.hpp>
+#include <reorder.hpp>
+#include <transpose.hpp>
+#include <unwrap.hpp>
+#include <wrap.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <vector>
 
 using af::dim4;
+using arrayfire::common::flip;
+using arrayfire::common::half;
+using arrayfire::common::modDims;
+using std::vector;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-Array<T> convolve(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind)
-{
-    const dim4 sDims    = signal.dims();
-    const dim4 fDims    = filter.dims();
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand) {
+    const dim4 &sDims = signal.dims();
+    const dim4 &fDims = filter.dims();
 
     dim4 oDims(1);
     if (expand) {
-        for(dim_t d=0; d<4; ++d) {
-            if (kind==ONE2ONE || kind==ONE2MANY) {
-                oDims[d] = sDims[d]+fDims[d]-1;
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
             } else {
-                oDims[d] = (d<baseDim ? sDims[d]+fDims[d]-1 : sDims[d]);
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
             }
         }
     } else {
         oDims = sDims;
-        if (kind==ONE2MANY) {
-            for (dim_t i=baseDim; i<4; ++i)
-                oDims[i] = fDims[i];
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
         }
     }
 
-    Array<T> out   = createEmptyArray<T>(oDims);
+    Array<T> out    = createEmptyArray<T>(oDims);
     bool callKernel = true;
 
     dim_t MCFL2 = kernel::MAX_CONV2_FILTER_LEN;
     dim_t MCFL3 = kernel::MAX_CONV3_FILTER_LEN;
-    switch(baseDim) {
-        case 1: if (fDims[0]>kernel::MAX_CONV1_FILTER_LEN) callKernel = false; break;
-        case 2: if ((fDims[0]*fDims[1]) > (MCFL2 * MCFL2)) callKernel = false; break;
-        case 3: if ((fDims[0]*fDims[1]*fDims[2]) > (MCFL3 * MCFL3 * MCFL3)) callKernel = false; break;
+    switch (rank) {
+        case 1:
+            if (fDims[0] > kernel::MAX_CONV1_FILTER_LEN) { callKernel = false; }
+            break;
+        case 2:
+            if ((fDims[0] * fDims[1]) > (MCFL2 * MCFL2)) { callKernel = false; }
+            break;
+        case 3:
+            if ((fDims[0] * fDims[1] * fDims[2]) > (MCFL3 * MCFL3 * MCFL3)) {
+                callKernel = false;
+            }
+            break;
+        default: AF_ERROR("rank only supports values 1-3.", AF_ERR_UNKNOWN);
     }
 
-    if(!callKernel) { OPENCL_NOT_SUPPORTED(); }
+    if (!callKernel) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nOpenCL N Dimensional Convolution doesn't support "
+                 "%llux%llux%llu kernel\n",
+                 fDims[0], fDims[1], fDims[2]);
+        OPENCL_NOT_SUPPORTED(errMessage);
+    }
 
-    kernel::convolve_nd<T, accT, baseDim, expand>(out, signal, filter, kind);
+    kernel::convolve_nd<T, accT>(out, signal, filter, kind, rank, expand);
 
     return out;
 }
 
-#define INSTANTIATE(T, accT)                                            \
-    template Array<T> convolve <T, accT, 1, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 1, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 2, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 2, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 3, true >(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
-    template Array<T> convolve <T, accT, 3, false>(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind); \
+#define INSTANTIATE(T, accT)                                                   \
+    template Array<T> convolve<T, accT>(Array<T> const &, Array<accT> const &, \
+                                        AF_BATCH_KIND, const int, const bool);
 
 INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> convolve2_unwrap(const Array<T> &signal, const Array<T> &filter,
+                          const dim4 &stride, const dim4 &padding,
+                          const dim4 &dilation) {
+    dim4 sDims = signal.dims();
+    dim4 fDims = filter.dims();
+
+    dim_t outputWidth =
+        1 + (sDims[0] + 2 * padding[0] - (((fDims[0] - 1) * dilation[0]) + 1)) /
+                stride[0];
+    dim_t outputHeight =
+        1 + (sDims[1] + 2 * padding[1] - (((fDims[1] - 1) * dilation[1]) + 1)) /
+                stride[1];
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(signal, fDims[0], fDims[1], stride[0], stride[1], padding[0],
+               padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsedFilter = filter;
+
+    collapsedFilter = flip(collapsedFilter, {1, 1, 0, 0});
+    collapsedFilter = modDims(collapsedFilter,
+                              dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> res =
+        matmul(unwrapped, collapsedFilter, AF_MAT_TRANS, AF_MAT_NONE);
+    res = modDims(res, dim4(outputWidth, outputHeight, signal.dims()[3],
+                            collapsedFilter.dims()[1]));
+    Array<T> out = reorder(res, dim4(0, 1, 3, 2));
+
+    return out;
+}
+
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation) {
+    Array<T> out =
+        convolve2_unwrap<T>(signal, filter, stride, padding, dilation);
+
+    return out;
+}
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> convolve2<T>(Array<T> const &signal,                    \
+                                   Array<T> const &filter, const dim4 stride, \
+                                   const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> & /*convolved_output*/,
+                           af::dim4 stride, af::dim4 padding,
+                           af::dim4 dilation) {
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &sDims = original_signal.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    Array<T> collapsed_filter = original_filter;
 
+    collapsed_filter = flip(collapsed_filter, {1, 1, 0, 0});
+    collapsed_filter = modDims(collapsed_filter,
+                               dim4(fDims[0] * fDims[1] * fDims[2], fDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    Array<T> res =
+        matmul(collapsed_gradient, collapsed_filter, AF_MAT_NONE, AF_MAT_TRANS);
+    res = modDims(res, dim4(res.dims()[0] / sDims[3], sDims[3],
+                            fDims[0] * fDims[1], sDims[2]));
+    res = reorder(res, dim4(0, 2, 3, 1));
+
+    const bool retCols = false;
+    res = wrap_dilated(res, sDims[0], sDims[1], fDims[0], fDims[1], stride[0],
+                       stride[1], padding[0], padding[1], dilation[0],
+                       dilation[1], retCols);
+
+    return res;
 }
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> & /*convolved_output*/,
+                             af::dim4 stride, af::dim4 padding,
+                             af::dim4 dilation) {
+    const dim4 &cDims = incoming_gradient.dims();
+    const dim4 &fDims = original_filter.dims();
+
+    const bool retCols = false;
+    Array<T> unwrapped =
+        unwrap(original_signal, fDims[0], fDims[1], stride[0], stride[1],
+               padding[0], padding[1], dilation[0], dilation[1], retCols);
+
+    unwrapped  = reorder(unwrapped, dim4(1, 2, 0, 3));
+    dim4 uDims = unwrapped.dims();
+    unwrapped =
+        modDims(unwrapped, dim4(uDims[0] * uDims[1], uDims[2] * uDims[3]));
+
+    Array<T> collapsed_gradient = incoming_gradient;
+    collapsed_gradient          = reorder(collapsed_gradient, dim4(0, 1, 3, 2));
+    collapsed_gradient          = modDims(
+        collapsed_gradient, dim4(cDims[0] * cDims[1] * cDims[3], cDims[2]));
+
+    Array<T> res =
+        matmul(unwrapped, collapsed_gradient, AF_MAT_NONE, AF_MAT_NONE);
+    res = modDims(res, dim4(fDims[0], fDims[1], fDims[2], fDims[3]));
+
+    auto out = flip(res, {1, 1, 0, 0});
+    return out;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> conv2DataGradient<T>(                                 \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);        \
+    template Array<T> conv2FilterGradient<T>(                               \
+        Array<T> const &incoming_gradient, Array<T> const &original_signal, \
+        Array<T> const &original_filter, Array<T> const &convolved_output,  \
+        const dim4 stride, const dim4 padding, const dim4 dilation);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/convolve.hpp b/src/backend/opencl/convolve.hpp
index a4eace6c5f..0cf040c417 100644
--- a/src/backend/opencl/convolve.hpp
+++ b/src/backend/opencl/convolve.hpp
@@ -8,15 +8,34 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <convolve_common.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename T, typename accT, dim_t baseDim, bool expand>
-Array<T> convolve(Array<T> const& signal, Array<accT> const& filter, ConvolveBatchKind kind);
+template<typename T, typename accT>
+Array<T> convolve(Array<T> const &signal, Array<accT> const &filter,
+                  AF_BATCH_KIND kind, const int rank, const bool expand);
 
-template<typename T, typename accT, bool expand>
-Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const &signal, Array<accT> const &c_filter,
+                   Array<accT> const &r_filter, const bool expand);
 
-}
+template<typename T>
+Array<T> convolve2(Array<T> const &signal, Array<T> const &filter,
+                   const dim4 stride, const dim4 padding, const dim4 dilation);
+
+template<typename T>
+Array<T> conv2DataGradient(const Array<T> &incoming_gradient,
+                           const Array<T> &original_signal,
+                           const Array<T> &original_filter,
+                           const Array<T> &convolved_output, af::dim4 stride,
+                           af::dim4 padding, af::dim4 dilation);
+
+template<typename T>
+Array<T> conv2FilterGradient(const Array<T> &incoming_gradient,
+                             const Array<T> &original_signal,
+                             const Array<T> &original_filter,
+                             const Array<T> &convolved_output, af::dim4 stride,
+                             af::dim4 padding, af::dim4 dilation);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/convolve_separable.cpp b/src/backend/opencl/convolve_separable.cpp
index fa2e8f0c9d..41b88b6ba8 100644
--- a/src/backend/opencl/convolve_separable.cpp
+++ b/src/backend/opencl/convolve_separable.cpp
@@ -7,72 +7,38 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <convolve.hpp>
-#include <kernel/convolve_separable.hpp>
+
+#include <Array.hpp>
 #include <err_opencl.hpp>
+#include <kernel/convolve_separable.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename T, typename accT, dim_t cDim, bool expand>
-void conv2Helper(Array<T>& out, const Array<T>& sig, const Array<accT>& filt, dim_t f)
-{
-    switch(f) {
-        case  2: kernel::convolve2<T, accT, cDim, expand,  2>(out, sig, filt); break;
-        case  3: kernel::convolve2<T, accT, cDim, expand,  3>(out, sig, filt); break;
-        case  4: kernel::convolve2<T, accT, cDim, expand,  4>(out, sig, filt); break;
-        case  5: kernel::convolve2<T, accT, cDim, expand,  5>(out, sig, filt); break;
-        case  6: kernel::convolve2<T, accT, cDim, expand,  6>(out, sig, filt); break;
-        case  7: kernel::convolve2<T, accT, cDim, expand,  7>(out, sig, filt); break;
-        case  8: kernel::convolve2<T, accT, cDim, expand,  8>(out, sig, filt); break;
-        case  9: kernel::convolve2<T, accT, cDim, expand,  9>(out, sig, filt); break;
-        case 10: kernel::convolve2<T, accT, cDim, expand, 10>(out, sig, filt); break;
-        case 11: kernel::convolve2<T, accT, cDim, expand, 11>(out, sig, filt); break;
-        case 12: kernel::convolve2<T, accT, cDim, expand, 12>(out, sig, filt); break;
-        case 13: kernel::convolve2<T, accT, cDim, expand, 13>(out, sig, filt); break;
-        case 14: kernel::convolve2<T, accT, cDim, expand, 14>(out, sig, filt); break;
-        case 15: kernel::convolve2<T, accT, cDim, expand, 15>(out, sig, filt); break;
-        case 16: kernel::convolve2<T, accT, cDim, expand, 16>(out, sig, filt); break;
-        case 17: kernel::convolve2<T, accT, cDim, expand, 17>(out, sig, filt); break;
-        case 18: kernel::convolve2<T, accT, cDim, expand, 18>(out, sig, filt); break;
-        case 19: kernel::convolve2<T, accT, cDim, expand, 19>(out, sig, filt); break;
-        case 20: kernel::convolve2<T, accT, cDim, expand, 20>(out, sig, filt); break;
-        case 21: kernel::convolve2<T, accT, cDim, expand, 21>(out, sig, filt); break;
-        case 22: kernel::convolve2<T, accT, cDim, expand, 22>(out, sig, filt); break;
-        case 23: kernel::convolve2<T, accT, cDim, expand, 23>(out, sig, filt); break;
-        case 24: kernel::convolve2<T, accT, cDim, expand, 24>(out, sig, filt); break;
-        case 25: kernel::convolve2<T, accT, cDim, expand, 25>(out, sig, filt); break;
-        case 26: kernel::convolve2<T, accT, cDim, expand, 26>(out, sig, filt); break;
-        case 27: kernel::convolve2<T, accT, cDim, expand, 27>(out, sig, filt); break;
-        case 28: kernel::convolve2<T, accT, cDim, expand, 28>(out, sig, filt); break;
-        case 29: kernel::convolve2<T, accT, cDim, expand, 29>(out, sig, filt); break;
-        case 30: kernel::convolve2<T, accT, cDim, expand, 30>(out, sig, filt); break;
-        case 31: kernel::convolve2<T, accT, cDim, expand, 31>(out, sig, filt); break;
-        default: OPENCL_NOT_SUPPORTED();
-    }
-}
-
-template<typename T, typename accT, bool expand>
-Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter)
-{
-    const dim_t cflen = (dim_t)c_filter.elements();
-    const dim_t rflen = (dim_t)r_filter.elements();
+template<typename T, typename accT>
+Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter,
+                   Array<accT> const& r_filter, const bool expand) {
+    const auto cflen = c_filter.elements();
+    const auto rflen = r_filter.elements();
 
     if ((cflen > kernel::MAX_SCONV_FILTER_LEN) ||
-            (rflen > kernel::MAX_SCONV_FILTER_LEN)) {
-        // call upon fft
-        OPENCL_NOT_SUPPORTED();
+        (rflen > kernel::MAX_SCONV_FILTER_LEN)) {
+        // TODO call upon fft
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nOpenCL Separable convolution doesn't support %llu(coloumn) "
+                 "%llu(row) filters\n",
+                 cflen, rflen);
+        OPENCL_NOT_SUPPORTED(errMessage);
     }
 
-    const dim4 sDims = signal.dims();
-    dim4 tDims = sDims;
-    dim4 oDims = sDims;
+    const dim4& sDims = signal.dims();
+    dim4 tDims        = sDims;
+    dim4 oDims        = sDims;
 
     if (expand) {
         tDims[0] += cflen - 1;
@@ -80,26 +46,32 @@ Array<T> convolve2(Array<T> const& signal, Array<accT> const& c_filter, Array<ac
         oDims[1] += rflen - 1;
     }
 
-    Array<T> temp= createEmptyArray<T>(tDims);
-    Array<T> out = createEmptyArray<T>(oDims);
+    Array<T> temp = createEmptyArray<T>(tDims);
+    Array<T> out  = createEmptyArray<T>(oDims);
 
-    conv2Helper<T, accT, 0, expand>(temp, signal, c_filter, cflen);
-    conv2Helper<T, accT, 1, expand>( out,   temp, r_filter, rflen);
+    kernel::convSep<T, accT>(temp, signal, c_filter, 0, expand);
+    kernel::convSep<T, accT>(out, temp, r_filter, 1, expand);
 
     return out;
 }
 
-#define INSTANTIATE(T, accT)  \
-    template Array<T> convolve2<T, accT, true >(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);  \
-    template Array<T> convolve2<T, accT, false>(Array<T> const& signal, Array<accT> const& c_filter, Array<accT> const& r_filter);
+#define INSTANTIATE(T, accT)                                                  \
+    template Array<T> convolve2<T, accT>(Array<T> const&, Array<accT> const&, \
+                                         Array<accT> const&, const bool);
 
 INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(intl, float)
+INSTANTIATE(uintl, float)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/copy.cpp b/src/backend/opencl/copy.cpp
index 37b33df7c9..97d54d432c 100644
--- a/src/backend/opencl/copy.cpp
+++ b/src/backend/opencl/copy.cpp
@@ -6,196 +6,213 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <af/array.h>
-#include <af/defines.h>
-#include <Array.hpp>
 #include <copy.hpp>
 #include <kernel/memcopy.hpp>
+
+#include <Array.hpp>
+#include <common/complex.hpp>
+#include <common/half.hpp>
 #include <err_opencl.hpp>
 #include <math.hpp>
 
-namespace opencl
-{
-
-    template<typename T>
-    void copyData(T *data, const Array<T> &A)
-    {
-
-        // FIXME: Merge this with copyArray
-        A.eval();
+using arrayfire::common::half;
+using arrayfire::common::is_complex;
 
-        dim_t offset = 0;
-        cl::Buffer buf;
-        Array<T> out = A;
+namespace arrayfire {
+namespace opencl {
 
-        if (A.isOwner() || // No offsets, No strides
-            A.ndims() == 1 // Simple offset, no strides.
-            ) {
-            buf = *A.get();
-            offset = A.getOffset();
-        } else {
-            //FIXME: Think about implementing eval
-            out = copyArray(A);
-            buf = *out.get();
-            offset = 0;
-        }
-
-        //FIXME: Add checks
-        getQueue().enqueueReadBuffer(buf, CL_TRUE,
-                                     sizeof(T) * offset,
-                                     sizeof(T) * A.elements(),
-                                     data);
-        return;
+template<typename T>
+void copyData(T *data, const Array<T> &src) {
+    if (src.elements() > 0) {
+        Array<T> out = src.isReady() && src.isLinear() ? src : copyArray(src);
+        // out is now guaranteed linear
+        getQueue().enqueueReadBuffer(*out.get(), CL_TRUE,
+                                     sizeof(T) * out.getOffset(),
+                                     sizeof(T) * out.elements(), data);
     }
+}
 
-    template<typename T>
-    Array<T> copyArray(const Array<T> &A)
-    {
-        Array<T> out = createEmptyArray<T>(A.dims());
-        dim_t offset = A.getOffset();
-
-        if (A.isLinear()) {
-            // FIXME: Add checks
-            getQueue().enqueueCopyBuffer(*A.get(), *out.get(),
-                                         sizeof(T) * offset, 0,
-                                         A.elements() * sizeof(T));
+template<typename T>
+Array<T> copyArray(const Array<T> &src) {
+    Array<T> out = createEmptyArray<T>(src.dims());
+    if (src.elements() > 0) {
+        if (src.isReady()) {
+            if (src.isLinear()) {
+                getQueue().enqueueCopyBuffer(
+                    *src.get(), *out.get(), src.getOffset() * sizeof(T), 0,
+                    src.elements() * sizeof(T), nullptr, nullptr);
+            } else {
+                kernel::memcopy<T>(*out.get(), out.strides(), *src.get(),
+                                   src.dims(), src.strides(), src.getOffset(),
+                                   src.ndims());
+            }
         } else {
-            kernel::memcopy<T>(*out.get(), out.strides().get(), *A.get(), A.dims().get(),
-                               A.strides().get(), offset, (uint)A.ndims());
+            Param info = {out.get(),
+                          {{src.dims().dims[0], src.dims().dims[1],
+                            src.dims().dims[2], src.dims().dims[3]},
+                           {out.strides().dims[0], out.strides().dims[1],
+                            out.strides().dims[2], out.strides().dims[3]},
+                           0}};
+            evalNodes(info, src.getNode().get());
         }
-        return out;
     }
+    return out;
+}
 
-    template<typename inType, typename outType>
-    Array<outType> padArray(Array<inType> const &in, dim4 const &dims, outType default_value, double factor)
-    {
-        Array<outType> ret = createEmptyArray<outType>(dims);
-
-        if (in.dims() == dims)
-            kernel::copy<inType, outType, true >(ret, in, in.ndims(), default_value, factor);
-        else
-            kernel::copy<inType, outType, false>(ret, in, in.ndims(), default_value, factor);
-        return ret;
+template<typename T>
+void multiply_inplace(Array<T> &src, double norm) {
+    if (src.elements() > 0) {
+        kernel::copy<T, T>(src, src, src.ndims(), scalar<T>(0), norm);
     }
+}
 
-    template<typename inType, typename outType>
-    struct copyWrapper {
-        void operator()(Array<outType> &out, Array<inType> const &in)
-        {
-            if (in.dims() == out.dims())
-                kernel::copy<inType, outType, true >(out, in, in.ndims(), scalar<outType>(0), 1);
-            else
-                kernel::copy<inType, outType, false>(out, in, in.ndims(), scalar<outType>(0), 1);
-        }
-    };
-
-    template<typename T>
-    struct copyWrapper<T, T> {
-        void operator()(Array<T> &out, Array<T> const &in)
-        {
-            if (out.isLinear() &&
-                in.isLinear() &&
-                out.elements() == in.elements())
-            {
-                dim_t in_offset = in.getOffset() * sizeof(T);
-                dim_t out_offset = out.getOffset() * sizeof(T);
-
-                getQueue().enqueueCopyBuffer(*in.get(), *out.get(),
-                                             in_offset, out_offset,
-                                             in.elements() * sizeof(T));
+template<typename inType, typename outType>
+struct copyWrapper {
+    void operator()(Array<outType> &dst, Array<inType> const &src) {
+        kernel::copy<inType, outType>(dst, src, dst.ndims(), scalar<outType>(0),
+                                      1.0);
+    }
+};
+
+template<typename T>
+struct copyWrapper<T, T> {
+    void operator()(Array<T> &dst, Array<T> const &src) {
+        if (src.elements() > 0) {
+            if (dst.dims() == src.dims()) {
+                if (src.isReady()) {
+                    if (dst.isLinear() && src.isLinear()) {
+                        getQueue().enqueueCopyBuffer(
+                            *src.get(), *dst.get(), src.getOffset() * sizeof(T),
+                            dst.getOffset() * sizeof(T),
+                            src.elements() * sizeof(T), nullptr, nullptr);
+                    } else {
+                        kernel::memcopy<T>(*dst.get(), dst.strides(),
+                                           *src.get(), src.dims(),
+                                           src.strides(), src.getOffset(),
+                                           src.ndims(), dst.getOffset());
+                    }
+                } else {
+                    Param info = {
+                        dst.get(),
+                        {{src.dims().dims[0], src.dims().dims[1],
+                          src.dims().dims[2], src.dims().dims[3]},
+                         {dst.strides().dims[0], dst.strides().dims[1],
+                          dst.strides().dims[2], dst.strides().dims[3]},
+                         dst.getOffset()}};
+                    evalNodes(info, src.getNode().get());
+                }
             } else {
-                if (in.dims() == out.dims())
-                    kernel::copy<T, T, true >(out, in, in.ndims(), scalar<T>(0), 1);
-                else
-                    kernel::copy<T, T, false>(out, in, in.ndims(), scalar<T>(0), 1);
+                // dst has more elements than src, so default has to be applied
+                kernel::copy<T, T>(dst, src, dst.ndims(), scalar<T>(0), 1.0);
             }
         }
-    };
-
-    template<typename inType, typename outType>
-    void copyArray(Array<outType> &out, Array<inType> const &in)
-    {
-        copyWrapper<inType, outType> copyFn;
-        copyFn(out, in);
     }
+};
+
+template<typename inType, typename outType>
+void copyArray(Array<outType> &dst, Array<inType> const &src) {
+    static_assert(!(is_complex<inType>::value && !is_complex<outType>::value),
+                  "Cannot copy from complex value to a non complex value");
+    copyWrapper<inType, outType> copyFn;
+    copyFn(dst, src);
+}
 
-#define INSTANTIATE(T)                                              \
-    template void      copyData<T> (T *data, const Array<T> &from); \
-    template Array<T>  copyArray<T>(const Array<T> &A);             \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-
-    #define INSTANTIATE_PAD_ARRAY(SRC_T)                                    \
-    template Array<float  > padArray<SRC_T, float  >(Array<SRC_T> const &src, dim4 const &dims, float   default_value, double factor); \
-    template Array<double > padArray<SRC_T, double >(Array<SRC_T> const &src, dim4 const &dims, double  default_value, double factor); \
-    template Array<cfloat > padArray<SRC_T, cfloat >(Array<SRC_T> const &src, dim4 const &dims, cfloat  default_value, double factor); \
-    template Array<cdouble> padArray<SRC_T, cdouble>(Array<SRC_T> const &src, dim4 const &dims, cdouble default_value, double factor); \
-    template Array<int    > padArray<SRC_T, int    >(Array<SRC_T> const &src, dim4 const &dims, int     default_value, double factor); \
-    template Array<uint   > padArray<SRC_T, uint   >(Array<SRC_T> const &src, dim4 const &dims, uint    default_value, double factor); \
-    template Array<intl    > padArray<SRC_T, intl    >(Array<SRC_T> const &src, dim4 const &dims, intl     default_value, double factor); \
-    template Array<uintl   > padArray<SRC_T, uintl   >(Array<SRC_T> const &src, dim4 const &dims, uintl    default_value, double factor); \
-    template Array<uchar  > padArray<SRC_T, uchar  >(Array<SRC_T> const &src, dim4 const &dims, uchar   default_value, double factor); \
-    template Array<char   > padArray<SRC_T, char   >(Array<SRC_T> const &src, dim4 const &dims, char    default_value, double factor); \
-    template void copyArray<SRC_T, float  >(Array<float  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, double >(Array<double > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cfloat >(Array<cfloat > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cdouble>(Array<cdouble> &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, int    >(Array<int    > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uint   >(Array<uint   > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, intl    >(Array<intl    > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uintl   >(Array<uintl   > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, uchar  >(Array<uchar  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, char   >(Array<char   > &dst, Array<SRC_T> const &src);
-
-    INSTANTIATE_PAD_ARRAY(float )
-    INSTANTIATE_PAD_ARRAY(double)
-    INSTANTIATE_PAD_ARRAY(int   )
-    INSTANTIATE_PAD_ARRAY(uint  )
-    INSTANTIATE_PAD_ARRAY(intl   )
-    INSTANTIATE_PAD_ARRAY(uintl  )
-    INSTANTIATE_PAD_ARRAY(uchar )
-    INSTANTIATE_PAD_ARRAY(char  )
-
-#define INSTANTIATE_PAD_ARRAY_COMPLEX(SRC_T)                            \
-    template Array<cfloat > padArray<SRC_T, cfloat >(Array<SRC_T> const &src, dim4 const &dims, cfloat  default_value, double factor); \
-    template Array<cdouble> padArray<SRC_T, cdouble>(Array<SRC_T> const &src, dim4 const &dims, cdouble default_value, double factor); \
-    template void copyArray<SRC_T, cfloat  >(Array<cfloat  > &dst, Array<SRC_T> const &src); \
-    template void copyArray<SRC_T, cdouble   >(Array<cdouble > &dst, Array<SRC_T> const &src);
-
-    INSTANTIATE_PAD_ARRAY_COMPLEX(cfloat )
-    INSTANTIATE_PAD_ARRAY_COMPLEX(cdouble)
-
-#define SPECILIAZE_UNUSED_COPYARRAY(SRC_T, DST_T) \
-    template<> void copyArray<SRC_T, DST_T>(Array<DST_T> &out, Array<SRC_T> const &in) \
-    {\
-        OPENCL_NOT_SUPPORTED();\
-    }
-
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, double)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, float)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uchar)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, char)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uint)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, int)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, intl)
-    SPECILIAZE_UNUSED_COPYARRAY(cfloat, uintl)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, double)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, float)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uchar)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, char)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uint)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, int)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, intl)
-    SPECILIAZE_UNUSED_COPYARRAY(cdouble, uintl)
-
+#define INSTANTIATE(T)                                        \
+    template void copyData<T>(T * data, const Array<T> &src); \
+    template Array<T> copyArray<T>(const Array<T> &src);      \
+    template void multiply_inplace<T>(Array<T> & src, double norm);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COPY_ARRAY(SRC_T)                                 \
+    template void copyArray<SRC_T, float>(Array<float> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, double>(Array<double> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,     \
+                                            Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, int>(Array<int> & dst,             \
+                                        Array<SRC_T> const &src);     \
+    template void copyArray<SRC_T, uint>(Array<uint> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, intl>(Array<intl> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, uintl>(Array<uintl> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, short>(Array<short> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, ushort>(Array<ushort> & dst,       \
+                                           Array<SRC_T> const &src);  \
+    template void copyArray<SRC_T, schar>(Array<schar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, uchar>(Array<uchar> & dst,         \
+                                          Array<SRC_T> const &src);   \
+    template void copyArray<SRC_T, char>(Array<char> & dst,           \
+                                         Array<SRC_T> const &src);    \
+    template void copyArray<SRC_T, half>(Array<half> & dst,           \
+                                         Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY(float)
+INSTANTIATE_COPY_ARRAY(double)
+INSTANTIATE_COPY_ARRAY(int)
+INSTANTIATE_COPY_ARRAY(uint)
+INSTANTIATE_COPY_ARRAY(intl)
+INSTANTIATE_COPY_ARRAY(uintl)
+INSTANTIATE_COPY_ARRAY(schar)
+INSTANTIATE_COPY_ARRAY(uchar)
+INSTANTIATE_COPY_ARRAY(char)
+INSTANTIATE_COPY_ARRAY(short)
+INSTANTIATE_COPY_ARRAY(ushort)
+INSTANTIATE_COPY_ARRAY(half)
+
+#define INSTANTIATE_COPY_ARRAY_COMPLEX(SRC_T)                        \
+    template void copyArray<SRC_T, cfloat>(Array<cfloat> & dst,      \
+                                           Array<SRC_T> const &src); \
+    template void copyArray<SRC_T, cdouble>(Array<cdouble> & dst,    \
+                                            Array<SRC_T> const &src);
+
+INSTANTIATE_COPY_ARRAY_COMPLEX(cfloat)
+INSTANTIATE_COPY_ARRAY_COMPLEX(cdouble)
+
+template<typename T>
+T getScalar(const Array<T> &src) {
+    T retVal{};
+    getQueue().enqueueReadBuffer(
+        *src.get(), CL_TRUE, sizeof(T) * src.getOffset(), sizeof(T), &retVal);
+    return retVal;
 }
+
+#define INSTANTIATE_GETSCALAR(T) template T getScalar(const Array<T> &in);
+
+INSTANTIATE_GETSCALAR(float)
+INSTANTIATE_GETSCALAR(double)
+INSTANTIATE_GETSCALAR(cfloat)
+INSTANTIATE_GETSCALAR(cdouble)
+INSTANTIATE_GETSCALAR(int)
+INSTANTIATE_GETSCALAR(uint)
+INSTANTIATE_GETSCALAR(schar)
+INSTANTIATE_GETSCALAR(uchar)
+INSTANTIATE_GETSCALAR(char)
+INSTANTIATE_GETSCALAR(intl)
+INSTANTIATE_GETSCALAR(uintl)
+INSTANTIATE_GETSCALAR(short)
+INSTANTIATE_GETSCALAR(ushort)
+INSTANTIATE_GETSCALAR(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/copy.hpp b/src/backend/opencl/copy.hpp
index 3ad0d0aa98..1b8576a5d9 100644
--- a/src/backend/opencl/copy.hpp
+++ b/src/backend/opencl/copy.hpp
@@ -8,22 +8,62 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
+#include <kernel/pad_array_borders.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void copyData(T *data, const Array<T> &A);
 
-    template<typename T>
-    void copyData(T *data, const Array<T> &A);
+template<typename T>
+Array<T> copyArray(const Array<T> &A);
 
-    template<typename T>
-    Array<T> copyArray(const Array<T> &A);
+template<typename inType, typename outType>
+void copyArray(Array<outType> &out, const Array<inType> &in);
 
-    template<typename inType, typename outType>
-    void copyArray(Array<outType> &out, const Array<inType> &in);
+// Resize Array to target dimensions and convert type
+//
+// Depending on the \p outDims, the output Array can be either truncated
+// or padded (towards end of respective dimensions).
+//
+// While resizing copying, if output dimensions are larger than input, then
+// elements beyond the input dimensions are set to the \p defaultValue.
+//
+// \param[in] in is input Array
+// \param[in] outDims is the target output dimensions
+// \param[in] defaultValue is the value to which padded locations are set.
+// \param[in] scale is the value by which all output elements are scaled.
+//
+// \returns Array<outType>
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue = outType(0), double scale = 1.0);
 
-    template<typename inType, typename outType>
-    Array<outType> padArray(Array<inType> const &in, dim4 const &dims,
-                            outType default_value, double factor=1.0);
+template<typename T>
+Array<T> padArrayBorders(Array<T> const &in, dim4 const &lowerBoundPadding,
+                         dim4 const &upperBoundPadding,
+                         const af::borderType btype) {
+    auto iDims = in.dims();
+
+    dim4 oDims(lowerBoundPadding[0] + iDims[0] + upperBoundPadding[0],
+               lowerBoundPadding[1] + iDims[1] + upperBoundPadding[1],
+               lowerBoundPadding[2] + iDims[2] + upperBoundPadding[2],
+               lowerBoundPadding[3] + iDims[3] + upperBoundPadding[3]);
+
+    if (oDims == iDims) { return in; }
+
+    auto ret = createEmptyArray<T>(oDims);
+
+    kernel::padBorders<T>(ret, in, lowerBoundPadding, btype);
+
+    return ret;
 }
+
+template<typename T>
+void multiply_inplace(Array<T> &in, double val);
+
+template<typename T>
+T getScalar(const Array<T> &in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/count.cpp b/src/backend/opencl/count.cpp
index f8ef0b853d..fe1b588f89 100644
--- a/src/backend/opencl/count.cpp
+++ b/src/backend/opencl/count.cpp
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    // count
-    INSTANTIATE(af_notzero_t, float  , uint)
-    INSTANTIATE(af_notzero_t, double , uint)
-    INSTANTIATE(af_notzero_t, cfloat , uint)
-    INSTANTIATE(af_notzero_t, cdouble, uint)
-    INSTANTIATE(af_notzero_t, int    , uint)
-    INSTANTIATE(af_notzero_t, uint   , uint)
-    INSTANTIATE(af_notzero_t, char   , uint)
-    INSTANTIATE(af_notzero_t, uchar  , uint)
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// count
+INSTANTIATE(af_notzero_t, float, uint)
+INSTANTIATE(af_notzero_t, double, uint)
+INSTANTIATE(af_notzero_t, cfloat, uint)
+INSTANTIATE(af_notzero_t, cdouble, uint)
+INSTANTIATE(af_notzero_t, int, uint)
+INSTANTIATE(af_notzero_t, uint, uint)
+INSTANTIATE(af_notzero_t, intl, uint)
+INSTANTIATE(af_notzero_t, uintl, uint)
+INSTANTIATE(af_notzero_t, char, uint)
+INSTANTIATE(af_notzero_t, schar, uint)
+INSTANTIATE(af_notzero_t, uchar, uint)
+INSTANTIATE(af_notzero_t, short, uint)
+INSTANTIATE(af_notzero_t, ushort, uint)
+INSTANTIATE(af_notzero_t, half, uint)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_blas.cpp b/src/backend/opencl/cpu/cpu_blas.cpp
new file mode 100644
index 0000000000..8fbef46443
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_blas.cpp
@@ -0,0 +1,251 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <common/blas_headers.hpp>
+#include <common/complex.hpp>
+#include <common/err_common.hpp>
+#include <cpu/cpu_blas.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+using arrayfire::common::is_complex;
+
+using std::add_const;
+using std::add_pointer;
+using std::conditional;
+using std::enable_if;
+using std::is_floating_point;
+using std::remove_const;
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+// Some implementations of BLAS require void* for complex pointers while others
+// use float*/double*
+//
+// Sample cgemm API
+// OpenBLAS
+// void cblas_cgemm(OPENBLAS_CONST enum CBLAS_ORDER Order, OPENBLAS_CONST enum
+// CBLAS_TRANSPOSE TransA, OPENBLAS_CONST enum CBLAS_TRANSPOSE TransB,
+//                  OPENBLAS_CONST blasint M, OPENBLAS_CONST blasint N,
+//                  OPENBLAS_CONST blasint K, OPENBLAS_CONST float *alpha,
+//                  OPENBLAS_CONST float *A, OPENBLAS_CONST blasint lda,
+//                  OPENBLAS_CONST float *B, OPENBLAS_CONST blasint ldb,
+//                  OPENBLAS_CONST float *beta, float *C, OPENBLAS_CONST blasint
+//                  ldc);
+//
+// MKL
+// void cblas_cgemm(const  CBLAS_LAYOUT Layout, const CBLAS_TRANSPOSE TransA,
+// const  CBLAS_TRANSPOSE TransB,
+//                  const MKL_INT M, const MKL_INT N, const MKL_INT K,
+//                  const void *alpha, const void *A, const MKL_INT lda,
+//                  const void *B, const MKL_INT ldb, const void *beta,
+//                  void *C, const MKL_INT ldc);
+// atlas cblas
+// void cblas_cgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE
+// TransA,
+//                  const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
+//                  const int K, const void *alpha, const void *A, const int
+//                  lda, const void *B, const int ldb, const void *beta, void
+//                  *C, const int ldc);
+//
+// LAPACKE
+// void cblas_cgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE
+// TransA,
+//                  const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
+//                  const int K, const void *alpha, const void *A, const int
+//                  lda, const void *B, const int ldb, const void *beta, void
+//                  *C, const int ldc);
+#if defined(IS_OPENBLAS)
+static const bool cplx_void_ptr = false;
+#else
+static const bool cplx_void_ptr = true;
+#endif
+
+template<typename T, class Enable = void>
+struct blas_base {
+    using type = typename dtype_traits<T>::base_type;
+};
+
+template<typename T>
+struct blas_base<
+    T, typename enable_if<is_complex<T>::value && cplx_void_ptr>::type> {
+    using type = void;
+};
+
+template<typename T>
+using cptr_type =
+    typename conditional<is_complex<T>::value,
+                         const typename blas_base<T>::type *, const T *>::type;
+template<typename T>
+using ptr_type = typename conditional<is_complex<T>::value,
+                                      typename blas_base<T>::type *, T *>::type;
+template<typename T>
+using scale_type =
+    typename conditional<is_complex<T>::value,
+                         const typename blas_base<T>::type *, const T>::type;
+
+template<typename T>
+scale_type<T> getOneScalar(const T *const vals) {
+    return vals[0];
+}
+
+template<>
+scale_type<cfloat> getOneScalar(const cfloat *const vals) {
+    return reinterpret_cast<scale_type<cfloat>>(vals);
+}
+
+template<>
+scale_type<cdouble> getOneScalar(const cdouble *const vals) {
+    return reinterpret_cast<scale_type<cdouble>>(vals);
+}
+
+template<typename T>
+using gemm_func_def = void (*)(const CBLAS_ORDER, const CBLAS_TRANSPOSE,
+                               const CBLAS_TRANSPOSE, const blasint,
+                               const blasint, const blasint, scale_type<T>,
+                               cptr_type<T>, const blasint, cptr_type<T>,
+                               const blasint, scale_type<T>, ptr_type<T>,
+                               const blasint);
+
+template<typename T>
+using gemv_func_def = void (*)(const CBLAS_ORDER, const CBLAS_TRANSPOSE,
+                               const blasint, const blasint, scale_type<T>,
+                               cptr_type<T>, const blasint, cptr_type<T>,
+                               const blasint, scale_type<T>, ptr_type<T>,
+                               const blasint);
+
+#define BLAS_FUNC_DEF(FUNC) \
+    template<typename T>    \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define BLAS_FUNC(FUNC, TYPE, PREFIX)                        \
+    template<>                                               \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() {              \
+        return (FUNC##_func_def<TYPE>)&cblas_##PREFIX##FUNC; \
+    }
+
+BLAS_FUNC_DEF(gemm)
+BLAS_FUNC(gemm, float, s)
+BLAS_FUNC(gemm, double, d)
+BLAS_FUNC(gemm, cfloat, c)
+BLAS_FUNC(gemm, cdouble, z)
+
+BLAS_FUNC_DEF(gemv)
+BLAS_FUNC(gemv, float, s)
+BLAS_FUNC(gemv, double, d)
+BLAS_FUNC(gemv, cfloat, c)
+BLAS_FUNC(gemv, cdouble, z)
+
+template<typename T, int value>
+typename enable_if<is_floating_point<T>::value, scale_type<T>>::type
+getScale() {
+    return T(value);
+}
+
+template<typename T, int value>
+typename enable_if<is_complex<T>::value, scale_type<T>>::type getScale() {
+    thread_local T val = scalar<T>(value);
+    return (const typename blas_base<T>::type *)&val;
+}
+
+CBLAS_TRANSPOSE
+toCblasTranspose(af_mat_prop opt) {
+    CBLAS_TRANSPOSE out = CblasNoTrans;
+    switch (opt) {
+        case AF_MAT_NONE: out = CblasNoTrans; break;
+        case AF_MAT_TRANS: out = CblasTrans; break;
+        case AF_MAT_CTRANS: out = CblasConjTrans; break;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+    return out;
+}
+
+template<typename T>
+void gemm(Array<T> &out, af_mat_prop optLhs, af_mat_prop optRhs, const T *alpha,
+          const Array<T> &lhs, const Array<T> &rhs, const T *beta) {
+    using BT  = typename blas_base<T>::type;
+    using CBT = const typename blas_base<T>::type;
+
+    const CBLAS_TRANSPOSE lOpts = toCblasTranspose(optLhs);
+    const CBLAS_TRANSPOSE rOpts = toCblasTranspose(optRhs);
+
+    const int aRowDim = (lOpts == CblasNoTrans) ? 0 : 1;
+    const int aColDim = (lOpts == CblasNoTrans) ? 1 : 0;
+    const int bColDim = (rOpts == CblasNoTrans) ? 1 : 0;
+
+    const dim4 &lDims = lhs.dims();
+    const dim4 &rDims = rhs.dims();
+    const int M       = lDims[aRowDim];
+    const int N       = rDims[bColDim];
+    const int K       = lDims[aColDim];
+    const dim4 &oDims = out.dims();
+
+    dim4 lStrides = lhs.strides();
+    dim4 rStrides = rhs.strides();
+    dim4 oStrides = out.strides();
+
+    int batchSize = oDims[2] * oDims[3];
+
+    bool is_l_d2_batched = (oDims[2] == lDims[2]);
+    bool is_l_d3_batched = (oDims[3] == lDims[3]);
+    bool is_r_d2_batched = (oDims[2] == rDims[2]);
+    bool is_r_d3_batched = (oDims[3] == rDims[3]);
+
+    // get host pointers from mapped memory
+    mapped_ptr<T> lPtr = lhs.getMappedPtr(CL_MAP_READ);
+    mapped_ptr<T> rPtr = rhs.getMappedPtr(CL_MAP_READ);
+    mapped_ptr<T> oPtr = out.getMappedPtr(CL_MAP_READ | CL_MAP_WRITE);
+
+    for (int n = 0; n < batchSize; ++n) {
+        int w = n / oDims[2];
+        int z = n - w * oDims[2];
+
+        int loff = z * (is_l_d2_batched * lStrides[2]) +
+                   w * (is_l_d3_batched * lStrides[3]);
+        int roff = z * (is_r_d2_batched * rStrides[2]) +
+                   w * (is_r_d3_batched * rStrides[3]);
+
+        CBT *lptr = reinterpret_cast<CBT *>(lPtr.get() + loff);
+        CBT *rptr = reinterpret_cast<CBT *>(rPtr.get() + roff);
+        BT *optr  = reinterpret_cast<BT *>(oPtr.get() + z * oStrides[2] +
+                                          w * oStrides[3]);
+
+        if (rDims[bColDim] == 1) {
+            dim_t incr = (rOpts == CblasNoTrans) ? rStrides[0] : rStrides[1];
+            gemv_func<T>()(CblasColMajor, lOpts, lDims[0], lDims[1],
+                           getOneScalar<T>(alpha), lptr, lStrides[1], rptr,
+                           incr, getOneScalar<T>(beta), optr, 1);
+        } else {
+            gemm_func<T>()(CblasColMajor, lOpts, rOpts, M, N, K,
+                           getOneScalar<T>(alpha), lptr, lStrides[1], rptr,
+                           rStrides[1], getOneScalar<T>(beta), optr,
+                           oStrides[1]);
+        }
+    }
+}
+
+#define INSTANTIATE_GEMM(TYPE)                                               \
+    template void gemm<TYPE>(Array<TYPE> & out, af_mat_prop optLhs,          \
+                             af_mat_prop optRhs, const TYPE *alpha,          \
+                             const Array<TYPE> &lhs, const Array<TYPE> &rhs, \
+                             const TYPE *beta);
+
+INSTANTIATE_GEMM(float)
+INSTANTIATE_GEMM(cfloat)
+INSTANTIATE_GEMM(double)
+INSTANTIATE_GEMM(cdouble)
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_blas.hpp b/src/backend/opencl/cpu/cpu_blas.hpp
new file mode 100644
index 0000000000..ae44d0ea91
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_blas.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+void gemm(Array<T> &out, af_mat_prop optLhs, af_mat_prop optRhs, const T *alpha,
+          const Array<T> &lhs, const Array<T> &rhs, const T *beta);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_cholesky.cpp b/src/backend/opencl/cpu/cpu_cholesky.cpp
new file mode 100644
index 0000000000..8878c8adf2
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_cholesky.cpp
@@ -0,0 +1,86 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <cpu/cpu_cholesky.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <cpu/cpu_triangle.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+using potrf_func_def = int (*)(ORDER_TYPE, char, int, T *, int);
+
+#define CH_FUNC_DEF(FUNC) \
+    template<typename T>  \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define CH_FUNC(FUNC, TYPE, PREFIX)             \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
+
+CH_FUNC_DEF(potrf)
+CH_FUNC(potrf, float, s)
+CH_FUNC(potrf, double, d)
+CH_FUNC(potrf, cfloat, c)
+CH_FUNC(potrf, cdouble, z)
+
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper) {
+    Array<T> out = copyArray<T>(in);
+    *info        = cholesky_inplace(out, is_upper);
+
+    mapped_ptr<T> oPtr = out.getMappedPtr();
+
+    if (is_upper) {
+        triangle<T, true, false>(oPtr.get(), oPtr.get(), out.dims(),
+                                 out.strides(), out.strides());
+    } else {
+        triangle<T, false, false>(oPtr.get(), oPtr.get(), out.dims(),
+                                  out.strides(), out.strides());
+    }
+
+    return out;
+}
+
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper) {
+    dim4 iDims = in.dims();
+    int N      = iDims[0];
+
+    char uplo = 'L';
+    if (is_upper) { uplo = 'U'; }
+
+    mapped_ptr<T> inPtr = in.getMappedPtr();
+
+    int info = potrf_func<T>()(AF_LAPACK_COL_MAJOR, uplo, N, inPtr.get(),
+                               in.strides()[1]);
+
+    return info;
+}
+
+#define INSTANTIATE_CH(T)                                                 \
+    template int cholesky_inplace<T>(Array<T> & in, const bool is_upper); \
+    template Array<T> cholesky<T>(int *info, const Array<T> &in,          \
+                                  const bool is_upper);
+
+INSTANTIATE_CH(float)
+INSTANTIATE_CH(cfloat)
+INSTANTIATE_CH(double)
+INSTANTIATE_CH(cdouble)
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_cholesky.hpp b/src/backend/opencl/cpu/cpu_cholesky.hpp
new file mode 100644
index 0000000000..489221304c
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_cholesky.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+template<typename T>
+Array<T> cholesky(int *info, const Array<T> &in, const bool is_upper);
+
+template<typename T>
+int cholesky_inplace(Array<T> &in, const bool is_upper);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_helper.hpp b/src/backend/opencl/cpu/cpu_helper.hpp
new file mode 100644
index 0000000000..0f979d1f90
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_helper.hpp
@@ -0,0 +1,46 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifndef AF_OPENCL_CPU
+#define AF_OPENCL_CPU
+
+#include <Array.hpp>
+#include <memory.hpp>
+#include <platform.hpp>
+#include <types.hpp>
+
+//********************************************************/
+// LAPACK
+//********************************************************/
+#if defined(WITH_LINEAR_ALGEBRA)
+
+#define lapack_complex_float arrayfire::opencl::cfloat
+#define lapack_complex_double arrayfire::opencl::cdouble
+#define LAPACK_PREFIX LAPACKE_
+#define ORDER_TYPE int
+#define AF_LAPACK_COL_MAJOR LAPACK_COL_MAJOR
+#define LAPACK_NAME(fn) LAPACKE_##fn
+
+#ifdef USE_MKL
+#include <mkl_lapack.h>
+#include <mkl_lapacke.h>
+#else
+#ifdef __APPLE__
+#include <Accelerate/Accelerate.h>
+#include <common/lapacke.hpp>
+#undef AF_LAPACK_COL_MAJOR
+#define AF_LAPACK_COL_MAJOR 0
+#else  // NETLIB LAPACKE
+#include <lapacke.h>
+#endif
+#endif
+
+#endif  // WITH_LINEAR_ALGEBRA
+
+#endif  // AF_OPENCL_CPU
diff --git a/src/backend/opencl/cpu/cpu_inverse.cpp b/src/backend/opencl/cpu/cpu_inverse.cpp
new file mode 100644
index 0000000000..b31e70b857
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_inverse.cpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <cpu/cpu_inverse.hpp>
+#include <cpu/cpu_lu.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+using getri_func_def = int (*)(ORDER_TYPE, int, T *, int, const int *);
+
+#define INV_FUNC_DEF(FUNC) \
+    template<typename T>   \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define INV_FUNC(FUNC, TYPE, PREFIX)            \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
+
+INV_FUNC_DEF(getri)
+INV_FUNC(getri, float, s)
+INV_FUNC(getri, double, d)
+INV_FUNC(getri, cfloat, c)
+INV_FUNC(getri, cdouble, z)
+
+template<typename T>
+Array<T> inverse(const Array<T> &in) {
+    int M = in.dims()[0];
+    // int N = in.dims()[1];
+
+    // This condition is already handled in opencl/inverse.cpp
+    // if (M != N) {
+    // Array<T> I = identity<T>(in.dims());
+    // return solve(in, I);
+    //}
+
+    Array<T> A = copyArray<T>(in);
+
+    Array<int> pivot = cpu::lu_inplace<T>(A, false);
+
+    mapped_ptr<T> aPtr   = A.getMappedPtr();
+    mapped_ptr<int> pPtr = pivot.getMappedPtr();
+
+    getri_func<T>()(AF_LAPACK_COL_MAJOR, M, aPtr.get(), A.strides()[1],
+                    pPtr.get());
+
+    return A;
+}
+
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_inverse.hpp b/src/backend/opencl/cpu/cpu_inverse.hpp
new file mode 100644
index 0000000000..04ed32b7d4
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_inverse.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+template<typename T>
+Array<T> inverse(const Array<T> &in);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_lu.cpp b/src/backend/opencl/cpu/cpu_lu.cpp
new file mode 100644
index 0000000000..a754535025
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_lu.cpp
@@ -0,0 +1,161 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <cpu/cpu_lu.hpp>
+#include <math.hpp>
+#include <range.hpp>
+
+#include <numeric>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+using getrf_func_def = int (*)(ORDER_TYPE, int, int, T *, int, int *);
+
+#define LU_FUNC_DEF(FUNC) \
+    template<typename T>  \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define LU_FUNC(FUNC, TYPE, PREFIX)             \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
+
+LU_FUNC_DEF(getrf)
+LU_FUNC(getrf, float, s)
+LU_FUNC(getrf, double, d)
+LU_FUNC(getrf, cfloat, c)
+LU_FUNC(getrf, cdouble, z)
+
+template<typename T>
+void lu_split(Array<T> &lower, Array<T> &upper, const Array<T> &in) {
+    mapped_ptr<T> ls = lower.getMappedPtr();
+    mapped_ptr<T> us = upper.getMappedPtr();
+    mapped_ptr<T> is = in.getMappedPtr(CL_MAP_READ);
+
+    T *l = ls.get();
+    T *u = us.get();
+    T *i = is.get();
+
+    dim4 ldm = lower.dims();
+    dim4 udm = upper.dims();
+    dim4 idm = in.dims();
+
+    dim4 lst = lower.strides();
+    dim4 ust = upper.strides();
+    dim4 ist = in.strides();
+
+    for (dim_t ow = 0; ow < idm[3]; ow++) {
+        const dim_t lW = ow * lst[3];
+        const dim_t uW = ow * ust[3];
+        const dim_t iW = ow * ist[3];
+
+        for (dim_t oz = 0; oz < idm[2]; oz++) {
+            const dim_t lZW = lW + oz * lst[2];
+            const dim_t uZW = uW + oz * ust[2];
+            const dim_t iZW = iW + oz * ist[2];
+
+            for (dim_t oy = 0; oy < idm[1]; oy++) {
+                const dim_t lYZW = lZW + oy * lst[1];
+                const dim_t uYZW = uZW + oy * ust[1];
+                const dim_t iYZW = iZW + oy * ist[1];
+
+                for (dim_t ox = 0; ox < idm[0]; ox++) {
+                    const dim_t lMem = lYZW + ox;
+                    const dim_t uMem = uYZW + ox;
+                    const dim_t iMem = iYZW + ox;
+                    if (ox > oy) {
+                        if (oy < ldm[1]) { l[lMem] = i[iMem]; }
+                        if (ox < udm[0]) { u[uMem] = scalar<T>(0); }
+                    } else if (oy > ox) {
+                        if (oy < ldm[1]) { l[lMem] = scalar<T>(0); }
+                        if (ox < udm[0]) { u[uMem] = i[iMem]; }
+                    } else if (ox == oy) {
+                        if (oy < ldm[1]) { l[lMem] = scalar<T>(1.0); }
+                        if (ox < udm[0]) { u[uMem] = i[iMem]; }
+                    }
+                }
+            }
+        }
+    }
+}
+
+void convertPivot(int *pivot, int out_sz, size_t pivot_dim) {
+    std::vector<int> p(out_sz);
+    iota(begin(p), end(p), 0);
+
+    for (int j = 0; j < static_cast<int>(pivot_dim); j++) {
+        // 1 indexed in pivot
+        std::swap(p[j], p[pivot[j] - 1]);
+    }
+
+    copy(begin(p), end(p), pivot);
+}
+
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<T> in_copy = copyArray<T>(in);
+    pivot            = lu_inplace(in_copy);
+
+    // SPLIT into lower and upper
+    dim4 ldims(M, min(M, N));
+    dim4 udims(min(M, N), N);
+    lower = createEmptyArray<T>(ldims);
+    upper = createEmptyArray<T>(udims);
+
+    lu_split<T>(lower, upper, in_copy);
+}
+
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    int pivot_dim    = min(M, N);
+    Array<int> pivot = createEmptyArray<int>(af::dim4(pivot_dim, 1, 1, 1));
+    if (convert_pivot) { pivot = range<int>(af::dim4(M, 1, 1, 1)); }
+
+    mapped_ptr<T> inPtr   = in.getMappedPtr();
+    mapped_ptr<int> piPtr = pivot.getMappedPtr();
+
+    getrf_func<T>()(AF_LAPACK_COL_MAJOR, M, N, inPtr.get(), in.strides()[1],
+                    piPtr.get());
+
+    if (convert_pivot) { convertPivot(piPtr.get(), M, min(M, N)); }
+
+    return pivot;
+}
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
+
+INSTANTIATE_LU(float)
+INSTANTIATE_LU(cfloat)
+INSTANTIATE_LU(double)
+INSTANTIATE_LU(cdouble)
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_lu.hpp b/src/backend/opencl/cpu/cpu_lu.hpp
new file mode 100644
index 0000000000..936add16e3
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_lu.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in);
+
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_qr.cpp b/src/backend/opencl/cpu/cpu_qr.cpp
new file mode 100644
index 0000000000..1e1b926d0f
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_qr.cpp
@@ -0,0 +1,120 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <cpu/cpu_qr.hpp>
+#include <cpu/cpu_triangle.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+using geqrf_func_def = int (*)(ORDER_TYPE, int, int, T *, int, T *);
+
+template<typename T>
+using gqr_func_def = int (*)(ORDER_TYPE, int, int, int, T *, int, const T *);
+
+#define QR_FUNC_DEF(FUNC) \
+    template<typename T>  \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define QR_FUNC(FUNC, TYPE, PREFIX)             \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
+
+QR_FUNC_DEF(geqrf)
+QR_FUNC(geqrf, float, s)
+QR_FUNC(geqrf, double, d)
+QR_FUNC(geqrf, cfloat, c)
+QR_FUNC(geqrf, cdouble, z)
+
+#define GQR_FUNC_DEF(FUNC) \
+    template<typename T>   \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define GQR_FUNC(FUNC, TYPE, PREFIX)            \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX);            \
+    }
+
+GQR_FUNC_DEF(gqr)
+GQR_FUNC(gqr, float, sorgqr)
+GQR_FUNC(gqr, double, dorgqr)
+GQR_FUNC(gqr, cfloat, cungqr)
+GQR_FUNC(gqr, cdouble, zungqr)
+
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    const dim4 NullShape(0, 0, 0, 0);
+
+    dim4 endPadding(M - iDims[0], max(M, N) - iDims[1], 0, 0);
+    q = (endPadding == NullShape
+             ? copyArray(in)
+             : padArrayBorders(in, NullShape, endPadding, AF_PAD_ZERO));
+    q.resetDims(iDims);
+    t = qr_inplace(q);
+
+    // SPLIT into q and r
+    dim4 rdims(M, N);
+    r = createEmptyArray<T>(rdims);
+
+    mapped_ptr<T> qPtr = q.getMappedPtr();
+    mapped_ptr<T> rPtr = r.getMappedPtr();
+    mapped_ptr<T> tPtr = t.getMappedPtr();
+
+    triangle<T, true, false>(rPtr.get(), qPtr.get(), rdims, r.strides(),
+                             q.strides());
+
+    gqr_func<T>()(AF_LAPACK_COL_MAJOR, M, M, min(M, N), qPtr.get(),
+                  q.strides()[1], tPtr.get());
+
+    q.resetDims(dim4(M, M));
+}
+
+template<typename T>
+Array<T> qr_inplace(Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    Array<T> t = createEmptyArray<T>(af::dim4(min(M, N), 1, 1, 1));
+
+    mapped_ptr<T> iPtr = in.getMappedPtr();
+    mapped_ptr<T> tPtr = t.getMappedPtr();
+
+    geqrf_func<T>()(AF_LAPACK_COL_MAJOR, M, N, iPtr.get(), in.strides()[1],
+                    tPtr.get());
+
+    return t;
+}
+
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
+
+INSTANTIATE_QR(float)
+INSTANTIATE_QR(cfloat)
+INSTANTIATE_QR(double)
+INSTANTIATE_QR(cdouble)
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_qr.hpp b/src/backend/opencl/cpu/cpu_qr.hpp
new file mode 100644
index 0000000000..d9c9345115
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_qr.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+
+template<typename T>
+Array<T> qr_inplace(Array<T> &in);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_solve.cpp b/src/backend/opencl/cpu/cpu_solve.cpp
new file mode 100644
index 0000000000..4e0349d2dc
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_solve.cpp
@@ -0,0 +1,318 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <cpu/cpu_solve.hpp>
+#include <math.hpp>
+#if USE_MKL
+#include <mkl_version.h>
+#endif
+#include <algorithm>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+using gesv_func_def = int (*)(ORDER_TYPE, int, int, T *, int, int *, T *, int);
+
+template<typename T>
+using gels_func_def = int (*)(ORDER_TYPE, char, int, int, int, T *, int, T *,
+                              int);
+
+#ifdef AF_USE_MKL_BATCH
+template<typename T>
+using getrf_batch_strided_func_def =
+    void (*)(const MKL_INT *m, const MKL_INT *n, T *a, const MKL_INT *lda,
+             const MKL_INT *stride_a, MKL_INT *ipiv, const MKL_INT *stride_ipiv,
+             const MKL_INT *batch_size, MKL_INT *info);
+
+#if INTEL_MKL_VERSION >= 20210004
+template<typename T>
+using getrs_batch_strided_func_def = void (*)(
+    const char *trans, const MKL_INT *n, const MKL_INT *nrhs, const T *a,
+    const MKL_INT *lda, const MKL_INT *stride_a, const MKL_INT *ipiv,
+    const MKL_INT *stride_ipiv, T *b, const MKL_INT *ldb,
+    const MKL_INT *stride_b, const MKL_INT *batch_size, MKL_INT *info);
+#else
+template<typename T>
+using getrs_batch_strided_func_def =
+    void (*)(const char *trans, const MKL_INT *n, const MKL_INT *nrhs, T *a,
+             const MKL_INT *lda, const MKL_INT *stride_a, MKL_INT *ipiv,
+             const MKL_INT *stride_ipiv, T *b, const MKL_INT *ldb,
+             const MKL_INT *stride_b, const MKL_INT *batch_size, MKL_INT *info);
+#endif
+#endif
+
+template<typename T>
+using getrs_func_def = int (*)(ORDER_TYPE, char, int, int, const T *, int,
+                               const int *, T *, int);
+
+template<typename T>
+using trtrs_func_def = int (*)(ORDER_TYPE, char, char, char, int, int,
+                               const T *, int, T *, int);
+
+#define SOLVE_FUNC_DEF(FUNC) \
+    template<typename T>     \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define SOLVE_FUNC(FUNC, TYPE, PREFIX)          \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);      \
+    }
+
+SOLVE_FUNC_DEF(gesv)
+SOLVE_FUNC(gesv, float, s)
+SOLVE_FUNC(gesv, double, d)
+SOLVE_FUNC(gesv, cfloat, c)
+SOLVE_FUNC(gesv, cdouble, z)
+
+SOLVE_FUNC_DEF(gels)
+SOLVE_FUNC(gels, float, s)
+SOLVE_FUNC(gels, double, d)
+SOLVE_FUNC(gels, cfloat, c)
+SOLVE_FUNC(gels, cdouble, z)
+
+#ifdef AF_USE_MKL_BATCH
+
+template<typename T>
+struct mkl_type {
+    using type = T;
+};
+template<>
+struct mkl_type<cl_float2> {
+    using type = MKL_Complex8;
+};
+template<>
+struct mkl_type<cl_double2> {
+    using type = MKL_Complex16;
+};
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wnoexcept-type"
+template<typename T>
+getrf_batch_strided_func_def<T> getrf_batch_strided_func();
+
+template<>
+getrf_batch_strided_func_def<float> getrf_batch_strided_func<float>() {
+    return &sgetrf_batch_strided;
+}
+template<>
+getrf_batch_strided_func_def<double> getrf_batch_strided_func<double>() {
+    return &dgetrf_batch_strided;
+}
+template<>
+getrf_batch_strided_func_def<MKL_Complex8>
+getrf_batch_strided_func<MKL_Complex8>() {
+    return &cgetrf_batch_strided;
+}
+template<>
+getrf_batch_strided_func_def<MKL_Complex16>
+getrf_batch_strided_func<MKL_Complex16>() {
+    return &zgetrf_batch_strided;
+}
+
+template<typename T>
+getrs_batch_strided_func_def<T> getrs_batch_strided_func();
+
+template<>
+getrs_batch_strided_func_def<float> getrs_batch_strided_func<float>() {
+    return &sgetrs_batch_strided;
+}
+template<>
+getrs_batch_strided_func_def<double> getrs_batch_strided_func<double>() {
+    return &dgetrs_batch_strided;
+}
+template<>
+getrs_batch_strided_func_def<MKL_Complex8>
+getrs_batch_strided_func<MKL_Complex8>() {
+    return &cgetrs_batch_strided;
+}
+template<>
+getrs_batch_strided_func_def<MKL_Complex16>
+getrs_batch_strided_func<MKL_Complex16>() {
+    return &zgetrs_batch_strided;
+}
+
+#pragma GCC diagnostic pop
+#endif
+
+SOLVE_FUNC_DEF(getrs)
+SOLVE_FUNC(getrs, float, s)
+SOLVE_FUNC(getrs, double, d)
+SOLVE_FUNC(getrs, cfloat, c)
+SOLVE_FUNC(getrs, cdouble, z)
+
+SOLVE_FUNC_DEF(trtrs)
+SOLVE_FUNC(trtrs, float, s)
+SOLVE_FUNC(trtrs, double, d)
+SOLVE_FUNC(trtrs, cfloat, c)
+SOLVE_FUNC(trtrs, cdouble, z)
+
+template<typename T>
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    UNUSED(options);
+    int N    = A.dims()[0];
+    int NRHS = b.dims()[1];
+
+    Array<T> B = copyArray<T>(b);
+
+    mapped_ptr<T> aPtr   = A.getMappedPtr();
+    mapped_ptr<T> bPtr   = B.getMappedPtr();
+    mapped_ptr<int> pPtr = pivot.getMappedPtr();
+
+    getrs_func<T>()(AF_LAPACK_COL_MAJOR, 'N', N, NRHS, aPtr.get(),
+                    A.strides()[1], pPtr.get(), bPtr.get(), B.strides()[1]);
+
+    return B;
+}
+
+template<typename T>
+Array<T> triangleSolve(const Array<T> &A, const Array<T> &b,
+                       const af_mat_prop options) {
+    Array<T> B = copyArray<T>(b);
+    int N      = B.dims()[0];
+    int NRHS   = B.dims()[1];
+
+    mapped_ptr<T> aPtr = A.getMappedPtr();
+    mapped_ptr<T> bPtr = B.getMappedPtr();
+
+    trtrs_func<T>()(AF_LAPACK_COL_MAJOR, options & AF_MAT_UPPER ? 'U' : 'L',
+                    'N',  // transpose flag
+                    options & AF_MAT_DIAG_UNIT ? 'U' : 'N', N, NRHS, aPtr.get(),
+                    A.strides()[1], bPtr.get(), B.strides()[1]);
+
+    return B;
+}
+
+#ifdef AF_USE_MKL_BATCH
+
+template<typename T>
+Array<T> generalSolveBatched(const Array<T> &a, const Array<T> &b,
+                             const af_mat_prop options) {
+    using std::vector;
+    int batches = a.dims()[2] * a.dims()[3];
+
+    dim4 aDims = a.dims();
+    dim4 bDims = b.dims();
+    int M      = aDims[0];
+    int N      = aDims[1];
+    int K      = bDims[1];
+    int MN     = std::min(M, N);
+
+    int lda     = a.strides()[1];
+    int astride = a.strides()[2];
+
+    vector<int> ipiv(MN * batches);
+    int ipivstride = MN;
+
+    int ldb     = b.strides()[1];
+    int bstride = b.strides()[2];
+
+    vector<int> info(batches, 0);
+
+    char trans = 'N';
+
+    Array<T> A = copyArray<T>(a);
+    Array<T> B = copyArray<T>(b);
+
+    mapped_ptr<T> aPtr = A.getMappedPtr();
+    mapped_ptr<T> bPtr = B.getMappedPtr();
+
+    getrf_batch_strided_func<typename mkl_type<T>::type>()(
+        &M, &N, reinterpret_cast<typename mkl_type<T>::type *>(aPtr.get()),
+        &lda, &astride, ipiv.data(), &ipivstride, &batches, info.data());
+
+    getrs_batch_strided_func<typename mkl_type<T>::type>()(
+        &trans, &M, &K,
+        reinterpret_cast<typename mkl_type<T>::type *>(aPtr.get()), &lda,
+        &astride, ipiv.data(), &ipivstride,
+        reinterpret_cast<typename mkl_type<T>::type *>(bPtr.get()), &ldb,
+        &bstride, &batches, info.data());
+
+    return B;
+}
+#endif
+
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    if (options & AF_MAT_UPPER || options & AF_MAT_LOWER) {
+        return triangleSolve<T>(a, b, options);
+    }
+
+#ifdef AF_USE_MKL_BATCH
+    if (a.dims()[2] > 1 || a.dims()[3] > 1) {
+        return generalSolveBatched(a, b, options);
+    }
+#endif
+
+    const dim4 NullShape(0, 0, 0, 0);
+
+    dim4 aDims = a.dims();
+    int batchz = aDims[2];
+    int batchw = aDims[3];
+
+    int M = a.dims()[0];
+    int N = a.dims()[1];
+    int K = b.dims()[1];
+
+    Array<T> A = copyArray<T>(a);
+    dim4 endPadding(max(M, N) - b.dims()[0], K - b.dims()[1], 0, 0);
+    Array<T> B = (endPadding == NullShape
+                      ? copyArray(b)
+                      : padArrayBorders(b, NullShape, endPadding, AF_PAD_ZERO));
+
+    mapped_ptr<T> aPtr = A.getMappedPtr();
+    mapped_ptr<T> bPtr = B.getMappedPtr();
+
+    for (int i = 0; i < batchw; i++) {
+        for (int j = 0; j < batchz; j++) {
+            auto pA = aPtr.get() + A.strides()[2] * j + A.strides()[3] * i;
+            auto pB = bPtr.get() + B.strides()[2] * j + B.strides()[3] * i;
+
+            if (M == N) {
+                std::vector<int> pivot(N);
+                gesv_func<T>()(AF_LAPACK_COL_MAJOR, N, K, pA, A.strides()[1],
+                               &pivot.front(), pB, B.strides()[1]);
+            } else {
+                int sM = a.strides()[1];
+                int sN = a.strides()[2] / sM;
+
+                gels_func<T>()(AF_LAPACK_COL_MAJOR, 'N', M, N, K, pA,
+                               A.strides()[1], pB, max(sM, sN));
+            }
+        }
+    }
+    if (M != N) { B.resetDims(dim4(N, K, B.dims()[2], B.dims()[3])); }
+
+    return B;
+}
+
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
+    template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
+
+INSTANTIATE_SOLVE(float)
+INSTANTIATE_SOLVE(cfloat)
+INSTANTIATE_SOLVE(double)
+INSTANTIATE_SOLVE(cdouble)
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_solve.hpp b/src/backend/opencl/cpu/cpu_solve.hpp
new file mode 100644
index 0000000000..1223a96531
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_solve.hpp
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options = AF_MAT_NONE);
+
+template<typename T>
+Array<T> solveLU(const Array<T> &a, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options = AF_MAT_NONE);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_sparse_blas.cpp b/src/backend/opencl/cpu/cpu_sparse_blas.cpp
new file mode 100644
index 0000000000..66fba7cdbe
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_sparse_blas.cpp
@@ -0,0 +1,492 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <cpu/cpu_sparse_blas.hpp>
+
+#include <common/complex.hpp>
+#include <complex.hpp>
+#include <err_opencl.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <af/dim4.hpp>
+
+#include <stdexcept>
+#include <string>
+
+using arrayfire::common::is_complex;
+
+using std::add_const;
+using std::add_pointer;
+using std::conditional;
+using std::enable_if;
+using std::is_floating_point;
+using std::is_same;
+using std::remove_const;
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T, class Enable = void>
+struct blas_base {
+    using type = T;
+};
+
+template<typename T>
+struct blas_base<T, typename enable_if<is_complex<T>::value>::type> {
+    using type = typename conditional<is_same<T, cdouble>::value, sp_cdouble,
+                                      sp_cfloat>::type;
+};
+
+template<typename T>
+using cptr_type =
+    typename conditional<is_complex<T>::value,
+                         const typename blas_base<T>::type *, const T *>::type;
+template<typename T>
+using ptr_type = typename conditional<is_complex<T>::value,
+                                      typename blas_base<T>::type *, T *>::type;
+template<typename T>
+using scale_type =
+    typename conditional<is_complex<T>::value,
+                         const typename blas_base<T>::type, const T>::type;
+
+template<typename To, typename Ti>
+auto getScaleValue(Ti val) -> std::remove_cv_t<To> {
+    return static_cast<std::remove_cv_t<To>>(val);
+}
+
+#ifdef USE_MKL
+
+// MKL
+// sparse_status_t mkl_sparse_z_create_csr (
+//                 sparse_matrix_t *A,
+//                 sparse_index_base_t indexing,
+//                 MKL_INT rows, MKL_INT cols,
+//                 MKL_INT *rows_start, MKL_INT *rows_end,
+//                 MKL_INT *col_indx,
+//                 MKL_Complex16 *values);
+//
+// sparse_status_t mkl_sparse_z_mv (
+//                 sparse_operation_t operation,
+//                 MKL_Complex16 alpha,
+//                 const sparse_matrix_t A,
+//                 struct matrix_descr descr,
+//                 const MKL_Complex16 *x,
+//                 MKL_Complex16 beta,
+//                 MKL_Complex16 *y);
+//
+// sparse_status_t mkl_sparse_z_mm (
+//                 sparse_operation_t operation,
+//                 MKL_Complex16 alpha,
+//                 const sparse_matrix_t A,
+//                 struct matrix_descr descr,
+//                 sparse_layout_t layout,
+//                 const MKL_Complex16 *x,
+//                 MKL_INT columns, MKL_INT ldx,
+//                 MKL_Complex16 beta,
+//                 MKL_Complex16 *y,
+//                 MKL_INT ldy);
+
+template<typename T>
+using create_csr_func_def = sparse_status_t (*)(sparse_matrix_t *,
+                                                sparse_index_base_t, int, int,
+                                                int *, int *, int *,
+                                                ptr_type<T>);
+
+template<typename T>
+using mv_func_def = sparse_status_t (*)(sparse_operation_t, scale_type<T>,
+                                        const sparse_matrix_t, matrix_descr,
+                                        cptr_type<T>, scale_type<T>,
+                                        ptr_type<T>);
+
+template<typename T>
+using mm_func_def = sparse_status_t (*)(sparse_operation_t, scale_type<T>,
+                                        const sparse_matrix_t, matrix_descr,
+                                        sparse_layout_t, cptr_type<T>, int, int,
+                                        scale_type<T>, ptr_type<T>, int);
+
+#define SPARSE_FUNC_DEF(FUNC) \
+    template<typename T>      \
+    FUNC##_func_def<T> FUNC##_func();
+
+#define SPARSE_FUNC(FUNC, TYPE, PREFIX)         \
+    template<>                                  \
+    FUNC##_func_def<TYPE> FUNC##_func<TYPE>() { \
+        return &mkl_sparse_##PREFIX##_##FUNC;   \
+    }
+
+SPARSE_FUNC_DEF(create_csr)
+SPARSE_FUNC(create_csr, float, s)
+SPARSE_FUNC(create_csr, double, d)
+SPARSE_FUNC(create_csr, cfloat, c)
+SPARSE_FUNC(create_csr, cdouble, z)
+
+SPARSE_FUNC_DEF(mv)
+SPARSE_FUNC(mv, float, s)
+SPARSE_FUNC(mv, double, d)
+SPARSE_FUNC(mv, cfloat, c)
+SPARSE_FUNC(mv, cdouble, z)
+
+SPARSE_FUNC_DEF(mm)
+SPARSE_FUNC(mm, float, s)
+SPARSE_FUNC(mm, double, d)
+SPARSE_FUNC(mm, cfloat, c)
+SPARSE_FUNC(mm, cdouble, z)
+
+#undef SPARSE_FUNC
+#undef SPARSE_FUNC_DEF
+
+template<>
+sp_cfloat getScaleValue<const sp_cfloat, cfloat>(cfloat val) {
+    sp_cfloat ret;
+    ret.real = val.s[0];
+    ret.imag = val.s[1];
+    return ret;
+}
+
+template<>
+sp_cdouble getScaleValue<const sp_cdouble, cdouble>(cdouble val) {
+    sp_cdouble ret;
+    ret.real = val.s[0];
+    ret.imag = val.s[1];
+    return ret;
+}
+
+#else  // USE_MKL
+
+// From mkl_spblas.h
+typedef enum {
+    SPARSE_OPERATION_NON_TRANSPOSE       = 10,
+    SPARSE_OPERATION_TRANSPOSE           = 11,
+    SPARSE_OPERATION_CONJUGATE_TRANSPOSE = 12,
+} sparse_operation_t;
+
+#endif  // USE_MKL
+
+sparse_operation_t toSparseTranspose(af_mat_prop opt) {
+    sparse_operation_t out = SPARSE_OPERATION_NON_TRANSPOSE;
+    switch (opt) {
+        case AF_MAT_NONE: out = SPARSE_OPERATION_NON_TRANSPOSE; break;
+        case AF_MAT_TRANS: out = SPARSE_OPERATION_TRANSPOSE; break;
+        case AF_MAT_CTRANS: out = SPARSE_OPERATION_CONJUGATE_TRANSPOSE; break;
+        default: AF_ERROR("INVALID af_mat_prop", AF_ERR_ARG);
+    }
+    return out;
+}
+
+template<typename T, int value>
+scale_type<T> getScale() {  // NOLINT(readability-const-return-type)
+    thread_local T val = scalar<T>(value);
+    return getScaleValue<scale_type<T>, T>(val);
+}
+
+////////////////////////////////////////////////////////////////////////////////
+#ifdef USE_MKL  // Implementation using MKL
+////////////////////////////////////////////////////////////////////////////////
+template<typename T>
+Array<T> matmul(const common::SparseArray<T> lhs, const Array<T> rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+    // MKL: CSRMM Does not support optRhs
+    UNUSED(optRhs);
+
+    lhs.eval();
+    rhs.eval();
+
+    // Similar Operations to GEMM
+    sparse_operation_t lOpts = toSparseTranspose(optLhs);
+
+    int lRowDim = (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) ? 0 : 1;
+    // int lColDim = (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) ? 1 : 0;
+
+    // Unsupported : (rOpts == SPARSE_OPERATION_NON_TRANSPOSE;) ? 1 : 0;
+    static const int rColDim = 1;
+
+    dim4 lDims = lhs.dims();
+    dim4 rDims = rhs.dims();
+    int M      = lDims[lRowDim];
+    int N      = rDims[rColDim];
+    // int K = lDims[lColDim];
+
+    Array<T> out = createValueArray<T>(af::dim4(M, N, 1, 1), scalar<T>(0));
+    out.eval();
+
+    auto alpha = getScale<T, 1>();
+    auto beta  = getScale<T, 0>();
+
+    int ldb = rhs.strides()[1];
+    int ldc = out.strides()[1];
+
+    // get host pointers from mapped memory
+    mapped_ptr<T> rhsPtr = rhs.getMappedPtr(CL_MAP_READ);
+    mapped_ptr<T> outPtr = out.getMappedPtr();
+
+    Array<T> values   = lhs.getValues();
+    Array<int> rowIdx = lhs.getRowIdx();
+    Array<int> colIdx = lhs.getColIdx();
+
+    mapped_ptr<T> vPtr   = values.getMappedPtr();
+    mapped_ptr<int> rPtr = rowIdx.getMappedPtr();
+    mapped_ptr<int> cPtr = colIdx.getMappedPtr();
+    int *pB              = rPtr.get();
+    int *pE              = rPtr.get() + 1;
+
+    sparse_matrix_t csrLhs;
+    create_csr_func<T>()(&csrLhs, SPARSE_INDEX_BASE_ZERO, lhs.dims()[0],
+                         lhs.dims()[1], pB, pE, cPtr.get(),
+                         reinterpret_cast<ptr_type<T>>(vPtr.get()));
+
+    struct matrix_descr descrLhs {};
+    descrLhs.type = SPARSE_MATRIX_TYPE_GENERAL;
+
+    mkl_sparse_optimize(csrLhs);
+
+    if (rDims[rColDim] == 1) {
+        mkl_sparse_set_mv_hint(csrLhs, lOpts, descrLhs, 1);
+        mv_func<T>()(lOpts, alpha, csrLhs, descrLhs,
+                     reinterpret_cast<cptr_type<T>>(rhsPtr.get()), beta,
+                     reinterpret_cast<ptr_type<T>>(outPtr.get()));
+    } else {
+        mkl_sparse_set_mm_hint(csrLhs, lOpts, descrLhs,
+                               SPARSE_LAYOUT_COLUMN_MAJOR, N, 1);
+        mm_func<T>()(lOpts, alpha, csrLhs, descrLhs, SPARSE_LAYOUT_COLUMN_MAJOR,
+                     reinterpret_cast<cptr_type<T>>(rhsPtr.get()), N, ldb, beta,
+                     reinterpret_cast<ptr_type<T>>(outPtr.get()), ldc);
+    }
+    mkl_sparse_destroy(csrLhs);
+
+    return out;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+#else  // Implementation without using MKL
+////////////////////////////////////////////////////////////////////////////////
+
+template<typename T>
+T getConjugate(const T &in) {
+    // For non-complex types return same
+    return in;
+}
+
+template<>
+cfloat getConjugate(const cfloat &in) {
+    cfloat val;
+    val.s[0] = in.s[0];
+    val.s[1] = -in.s[1];
+    return val;
+}
+
+template<>
+cdouble getConjugate(const cdouble &in) {
+    cdouble val;
+    val.s[0] = in.s[0];
+    val.s[1] = -in.s[1];
+    return val;
+}
+
+template<typename T, bool conjugate>
+void mv(Array<T> output, const Array<T> values, const Array<int> rowIdx,
+        const Array<int> colIdx, const Array<T> right, int M) {
+    UNUSED(M);
+    mapped_ptr<T> oPtr   = output.getMappedPtr();
+    mapped_ptr<T> rhtPtr = right.getMappedPtr();
+    mapped_ptr<T> vPtr   = values.getMappedPtr();
+    mapped_ptr<int> rPtr = rowIdx.getMappedPtr();
+    mapped_ptr<int> cPtr = colIdx.getMappedPtr();
+
+    T const *const valPtr   = vPtr.get();
+    int const *const rowPtr = rPtr.get();
+    int const *const colPtr = cPtr.get();
+    T const *const rhsPtr   = rhtPtr.get();
+    T *const outPtr         = oPtr.get();
+
+    for (int i = 0; i < rowIdx.dims()[0] - 1; ++i) {
+        outPtr[i] = scalar<T>(0);
+        for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+            // If stride[0] of right is not 1 then rhsPtr[colPtr[j]*stride]
+            if (conjugate) {
+                outPtr[i] =
+                    outPtr[i] + getConjugate(valPtr[j]) * rhsPtr[colPtr[j]];
+            } else {
+                outPtr[i] = outPtr[i] + valPtr[j] * rhsPtr[colPtr[j]];
+            }
+        }
+    }
+}
+
+template<typename T, bool conjugate>
+void mtv(Array<T> output, const Array<T> values, const Array<int> rowIdx,
+         const Array<int> colIdx, const Array<T> right, int M) {
+    mapped_ptr<T> oPtr   = output.getMappedPtr();
+    mapped_ptr<T> rhtPtr = right.getMappedPtr();
+    mapped_ptr<T> vPtr   = values.getMappedPtr();
+    mapped_ptr<int> rPtr = rowIdx.getMappedPtr();
+    mapped_ptr<int> cPtr = colIdx.getMappedPtr();
+
+    T const *const valPtr   = vPtr.get();
+    int const *const rowPtr = rPtr.get();
+    int const *const colPtr = cPtr.get();
+    T const *const rhsPtr   = rhtPtr.get();
+    T *const outPtr         = oPtr.get();
+
+    for (int i = 0; i < M; ++i) { outPtr[i] = scalar<T>(0); }
+
+    for (int i = 0; i < rowIdx.dims()[0] - 1; ++i) {
+        for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+            // If stride[0] of right is not 1 then rhsPtr[i*stride]
+            if (conjugate) {
+                outPtr[colPtr[j]] =
+                    outPtr[colPtr[j]] + getConjugate(valPtr[j]) * rhsPtr[i];
+            } else {
+                outPtr[colPtr[j]] = outPtr[colPtr[j]] + valPtr[j] * rhsPtr[i];
+            }
+        }
+    }
+}
+
+template<typename T, bool conjugate>
+void mm(Array<T> output, const Array<T> values, const Array<int> rowIdx,
+        const Array<int> colIdx, const Array<T> right, int M, int N, int ldb,
+        int ldc) {
+    UNUSED(M);
+    mapped_ptr<T> oPtr   = output.getMappedPtr();
+    mapped_ptr<T> rhtPtr = right.getMappedPtr();
+    mapped_ptr<T> vPtr   = values.getMappedPtr();
+    mapped_ptr<int> rPtr = rowIdx.getMappedPtr();
+    mapped_ptr<int> cPtr = colIdx.getMappedPtr();
+
+    T const *const valPtr   = vPtr.get();
+    int const *const rowPtr = rPtr.get();
+    int const *const colPtr = cPtr.get();
+    T const *rhsPtr         = rhtPtr.get();
+    T *outPtr               = oPtr.get();
+
+    for (int o = 0; o < N; ++o) {
+        for (int i = 0; i < rowIdx.dims()[0] - 1; ++i) {
+            outPtr[i] = scalar<T>(0);
+            for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+                // If stride[0] of right is not 1 then rhsPtr[colPtr[j]*stride]
+                if (conjugate) {
+                    outPtr[i] =
+                        outPtr[i] + getConjugate(valPtr[j]) * rhsPtr[colPtr[j]];
+                } else {
+                    outPtr[i] = outPtr[i] + valPtr[j] * rhsPtr[colPtr[j]];
+                }
+            }
+        }
+        rhsPtr += ldb;
+        outPtr += ldc;
+    }
+}
+
+template<typename T, bool conjugate>
+void mtm(Array<T> output, const Array<T> values, const Array<int> rowIdx,
+         const Array<int> colIdx, const Array<T> right, int M, int N, int ldb,
+         int ldc) {
+    mapped_ptr<T> oPtr   = output.getMappedPtr();
+    mapped_ptr<T> rhtPtr = right.getMappedPtr();
+    mapped_ptr<T> vPtr   = values.getMappedPtr();
+    mapped_ptr<int> rPtr = rowIdx.getMappedPtr();
+    mapped_ptr<int> cPtr = colIdx.getMappedPtr();
+
+    T const *const valPtr   = vPtr.get();
+    int const *const rowPtr = rPtr.get();
+    int const *const colPtr = cPtr.get();
+    T const *rhsPtr         = rhtPtr.get();
+    T *outPtr               = oPtr.get();
+
+    for (int o = 0; o < N; ++o) {
+        for (int i = 0; i < M; ++i) { outPtr[i] = scalar<T>(0); }
+
+        for (int i = 0; i < rowIdx.dims()[0] - 1; ++i) {
+            for (int j = rowPtr[i]; j < rowPtr[i + 1]; ++j) {
+                // If stride[0] of right is not 1 then rhsPtr[i*stride]
+                if (conjugate) {
+                    outPtr[colPtr[j]] =
+                        outPtr[colPtr[j]] + getConjugate(valPtr[j]) * rhsPtr[i];
+                } else {
+                    outPtr[colPtr[j]] =
+                        outPtr[colPtr[j]] + valPtr[j] * rhsPtr[i];
+                }
+            }
+        }
+        rhsPtr += ldb;
+        outPtr += ldc;
+    }
+}
+template<typename T>
+Array<T> matmul(const common::SparseArray<T> lhs, const Array<T> rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+    UNUSED(optRhs);
+    lhs.eval();
+    rhs.eval();
+
+    // Similar Operations to GEMM
+    sparse_operation_t lOpts = toSparseTranspose(optLhs);
+
+    int lRowDim = (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) ? 0 : 1;
+
+    static const int rColDim = 1;
+
+    dim4 lDims = lhs.dims();
+    dim4 rDims = rhs.dims();
+    int M      = lDims[lRowDim];
+    int N      = rDims[rColDim];
+
+    Array<T> out = createValueArray<T>(af::dim4(M, N, 1, 1), scalar<T>(0));
+    out.eval();
+
+    int ldb = rhs.strides()[1];
+    int ldc = out.strides()[1];
+
+    Array<T> values   = lhs.getValues();
+    Array<int> rowIdx = lhs.getRowIdx();
+    Array<int> colIdx = lhs.getColIdx();
+
+    if (rDims[rColDim] == 1) {
+        if (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) {
+            mv<T, false>(out, values, rowIdx, colIdx, rhs, M);
+        } else if (lOpts == SPARSE_OPERATION_TRANSPOSE) {
+            mtv<T, false>(out, values, rowIdx, colIdx, rhs, M);
+        } else if (lOpts == SPARSE_OPERATION_CONJUGATE_TRANSPOSE) {
+            mtv<T, true>(out, values, rowIdx, colIdx, rhs, M);
+        }
+    } else {
+        if (lOpts == SPARSE_OPERATION_NON_TRANSPOSE) {
+            mm<T, false>(out, values, rowIdx, colIdx, rhs, M, N, ldb, ldc);
+        } else if (lOpts == SPARSE_OPERATION_TRANSPOSE) {
+            mtm<T, false>(out, values, rowIdx, colIdx, rhs, M, N, ldb, ldc);
+        } else if (lOpts == SPARSE_OPERATION_CONJUGATE_TRANSPOSE) {
+            mtm<T, true>(out, values, rowIdx, colIdx, rhs, M, N, ldb, ldc);
+        }
+    }
+
+    return out;
+}
+
+////////////////////////////////////////////////////////////////////////////////
+#endif
+////////////////////////////////////////////////////////////////////////////////
+
+#define INSTANTIATE_SPARSE(T)                                           \
+    template Array<T> matmul<T>(const common::SparseArray<T> lhs,       \
+                                const Array<T> rhs, af_mat_prop optLhs, \
+                                af_mat_prop optRhs);
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+#undef INSTANTIATE_SPARSE
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_sparse_blas.hpp b/src/backend/opencl/cpu/cpu_sparse_blas.hpp
new file mode 100644
index 0000000000..dee21c7c01
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_sparse_blas.hpp
@@ -0,0 +1,35 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+
+#ifdef USE_MKL
+#include <mkl_spblas.h>
+#endif
+
+#ifdef USE_MKL
+using sp_cfloat  = MKL_Complex8;
+using sp_cdouble = MKL_Complex16;
+#else
+using sp_cfloat  = arrayfire::opencl::cfloat;
+using sp_cdouble = arrayfire::opencl::cdouble;
+#endif
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T> lhs, const Array<T> rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs);
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_svd.cpp b/src/backend/opencl/cpu/cpu_svd.cpp
new file mode 100644
index 0000000000..6d865e8520
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_svd.cpp
@@ -0,0 +1,98 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <copy.hpp>
+#include <cpu/cpu_helper.hpp>
+#include <cpu/cpu_svd.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+#define SVD_FUNC_DEF(FUNC)            \
+    template<typename T, typename Tr> \
+    svd_func_def<T, Tr> svd_func();
+
+#define SVD_FUNC(FUNC, T, Tr, PREFIX)       \
+    template<>                              \
+    svd_func_def<T, Tr> svd_func<T, Tr>() { \
+        return &LAPACK_NAME(PREFIX##FUNC);  \
+    }
+
+#if defined(USE_MKL) || defined(__APPLE__)
+
+template<typename T, typename Tr>
+using svd_func_def = int (*)(ORDER_TYPE, char jobz, int m, int n, T *in,
+                             int ldin, Tr *s, T *u, int ldu, T *vt, int ldvt);
+
+SVD_FUNC_DEF(gesdd)
+SVD_FUNC(gesdd, float, float, s)
+SVD_FUNC(gesdd, double, double, d)
+SVD_FUNC(gesdd, cfloat, float, c)
+SVD_FUNC(gesdd, cdouble, double, z)
+
+#else  // Atlas causes memory freeing issues with using gesdd
+
+template<typename T, typename Tr>
+using svd_func_def = int (*)(ORDER_TYPE, char jobu, char jobvt, int m, int n,
+                             T *in, int ldin, Tr *s, T *u, int ldu, T *vt,
+                             int ldvt, Tr *superb);
+
+SVD_FUNC_DEF(gesvd)
+SVD_FUNC(gesvd, float, float, s)
+SVD_FUNC(gesvd, double, double, d)
+SVD_FUNC(gesvd, cfloat, float, c)
+SVD_FUNC(gesvd, cdouble, double, z)
+
+#endif
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    mapped_ptr<Tr> sPtr = s.getMappedPtr();
+    mapped_ptr<T> uPtr  = u.getMappedPtr();
+    mapped_ptr<T> vPtr  = vt.getMappedPtr();
+    mapped_ptr<T> iPtr  = in.getMappedPtr();
+
+#if defined(USE_MKL) || defined(__APPLE__)
+    svd_func<T, Tr>()(AF_LAPACK_COL_MAJOR, 'A', M, N, iPtr.get(),
+                      in.strides()[1], sPtr.get(), uPtr.get(), u.strides()[1],
+                      vPtr.get(), vt.strides()[1]);
+#else
+    std::vector<Tr> superb(std::min(M, N));
+    svd_func<T, Tr>()(AF_LAPACK_COL_MAJOR, 'A', 'A', M, N, iPtr.get(),
+                      in.strides()[1], sPtr.get(), uPtr.get(), u.strides()[1],
+                      vPtr.get(), vt.strides()[1], &superb[0]);
+#endif
+}
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    Array<T> in_copy = copyArray<T>(in);
+    svdInPlace(s, u, vt, in_copy);
+}
+
+#define INSTANTIATE_SVD(T, Tr)                                           \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE_SVD(float, float)
+INSTANTIATE_SVD(double, double)
+INSTANTIATE_SVD(cfloat, float)
+INSTANTIATE_SVD(cdouble, double)
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/cpu/cpu_svd.hpp b/src/backend/opencl/cpu/cpu_svd.hpp
new file mode 100644
index 0000000000..2cb163de43
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_svd.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in);
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in);
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/cpu/cpu_triangle.hpp b/src/backend/opencl/cpu/cpu_triangle.hpp
new file mode 100644
index 0000000000..6bf2a4ceda
--- /dev/null
+++ b/src/backend/opencl/cpu/cpu_triangle.hpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#ifndef CPU_LAPACK_TRIANGLE
+#define CPU_LAPACK_TRIANGLE
+
+#include <math.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace cpu {
+
+template<typename T, bool is_upper, bool is_unit_diag>
+void triangle(T *o, const T *i, const dim4 odm, const dim4 ost,
+              const dim4 ist) {
+    for (dim_t ow = 0; ow < odm[3]; ow++) {
+        const dim_t oW = ow * ost[3];
+        const dim_t iW = ow * ist[3];
+
+        for (dim_t oz = 0; oz < odm[2]; oz++) {
+            const dim_t oZW = oW + oz * ost[2];
+            const dim_t iZW = iW + oz * ist[2];
+
+            for (dim_t oy = 0; oy < odm[1]; oy++) {
+                const dim_t oYZW = oZW + oy * ost[1];
+                const dim_t iYZW = iZW + oy * ist[1];
+
+                for (dim_t ox = 0; ox < odm[0]; ox++) {
+                    const dim_t oMem = oYZW + ox;
+                    const dim_t iMem = iYZW + ox;
+
+                    bool cond         = is_upper ? (oy >= ox) : (oy <= ox);
+                    bool do_unit_diag = (is_unit_diag && ox == oy);
+                    if (cond) {
+                        o[oMem] = do_unit_diag ? scalar<T>(1) : i[iMem];
+                    } else {
+                        o[oMem] = scalar<T>(0);
+                    }
+                }
+            }
+        }
+    }
+}
+
+}  // namespace cpu
+}  // namespace opencl
+}  // namespace arrayfire
+
+#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/debug_opencl.hpp b/src/backend/opencl/debug_opencl.hpp
index 74b3f7cf59..81bc51dce0 100644
--- a/src/backend/opencl/debug_opencl.hpp
+++ b/src/backend/opencl/debug_opencl.hpp
@@ -8,13 +8,18 @@
  ********************************************************/
 
 #pragma once
-#include <err_opencl.hpp>
-#include <stdio.h>
-#include <errorcodes.hpp>
+
+#include <platform.hpp>
 
 #ifndef NDEBUG
-#include <iostream>
+
 #define CL_DEBUG_FINISH(Q) Q.finish()
+
 #else
-#define CL_DEBUG_FINISH(Q)
+
+#define CL_DEBUG_FINISH(Q)                               \
+    do {                                                 \
+        if (opencl::synchronize_calls()) { Q.finish(); } \
+    } while (false);
+
 #endif
diff --git a/src/backend/opencl/device_manager.cpp b/src/backend/opencl/device_manager.cpp
new file mode 100644
index 0000000000..62c06a21a5
--- /dev/null
+++ b/src/backend/opencl/device_manager.cpp
@@ -0,0 +1,577 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Include this before af/opencl.h
+// Causes conflict between system cl.hpp and opencl/cl.hpp
+#include <common/graphics_common.hpp>
+
+#include <GraphicsResourceManager.hpp>
+#include <blas.hpp>
+#include <build_version.hpp>
+#include <clfft.hpp>
+#include <common/ArrayFireTypesIO.hpp>
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/Version.hpp>
+#include <common/defines.hpp>
+#include <common/host_memory.hpp>
+#include <common/util.hpp>
+#include <device_manager.hpp>
+#include <err_opencl.hpp>
+#include <errorcodes.hpp>
+#include <af/opencl.h>
+#include <af/version.h>
+#include <memory>
+
+#ifdef OS_MAC
+#include <OpenGL/CGLCurrent.h>
+#endif
+
+#include <boost/compute/context.hpp>
+#include <boost/compute/utility/program_cache.hpp>
+
+#include <algorithm>
+#include <iterator>
+#include <sstream>
+#include <string>
+#include <vector>
+
+using arrayfire::common::getEnvVar;
+using cl::CommandQueue;
+using cl::Context;
+using cl::Device;
+using cl::Platform;
+using std::begin;
+using std::end;
+using std::find;
+using std::make_unique;
+using std::ostringstream;
+using std::sort;
+using std::string;
+using std::stringstream;
+using std::unique_ptr;
+using std::vector;
+
+namespace arrayfire {
+namespace opencl {
+
+#if defined(OS_MAC)
+static const char* CL_GL_SHARING_EXT = "cl_APPLE_gl_sharing";
+#else
+static const char* CL_GL_SHARING_EXT = "cl_khr_gl_sharing";
+#endif
+
+bool checkExtnAvailability(const Device& pDevice, const string& pName) {
+    bool ret_val = false;
+    // find the extension required
+    string exts = pDevice.getInfo<CL_DEVICE_EXTENSIONS>();
+    stringstream ss(exts);
+    string item;
+    while (getline(ss, item, ' ')) {
+        if (item == pName) {
+            ret_val = true;
+            break;
+        }
+    }
+    return ret_val;
+}
+
+static afcl::deviceType getDeviceTypeEnum(const Device& dev) {
+    return static_cast<afcl::deviceType>(dev.getInfo<CL_DEVICE_TYPE>());
+}
+
+static inline bool compare_default(const unique_ptr<Device>& ldev,
+                                   const unique_ptr<Device>& rdev) {
+    const cl_device_type device_types[] = {CL_DEVICE_TYPE_GPU,
+                                           CL_DEVICE_TYPE_ACCELERATOR};
+
+    auto l_dev_type = ldev->getInfo<CL_DEVICE_TYPE>();
+    auto r_dev_type = rdev->getInfo<CL_DEVICE_TYPE>();
+
+    // This ensures GPU > ACCELERATOR > CPU
+    for (auto current_type : device_types) {
+        auto is_l_curr_type = l_dev_type == current_type;
+        auto is_r_curr_type = r_dev_type == current_type;
+
+        if (is_l_curr_type && !is_r_curr_type) { return true; }
+        if (!is_l_curr_type && is_r_curr_type) { return false; }
+    }
+
+    // At this point, the devices are of same type.
+    // Sort based on emperical evidence of preferred platforms
+
+    // Prefer AMD first
+    string lPlatName = getPlatformName(*ldev);
+    string rPlatName = getPlatformName(*rdev);
+
+    if (l_dev_type == CL_DEVICE_TYPE_GPU && r_dev_type == CL_DEVICE_TYPE_GPU) {
+        // If GPU, prefer AMD > NVIDIA > Beignet / Intel > APPLE
+        const char* platforms[] = {"AMD", "NVIDIA", "APPLE", "INTEL",
+                                   "BEIGNET"};
+
+        for (auto ref_name : platforms) {
+            if (verify_present(lPlatName, ref_name) &&
+                !verify_present(rPlatName, ref_name)) {
+                return true;
+            }
+
+            if (!verify_present(lPlatName, ref_name) &&
+                verify_present(rPlatName, ref_name)) {
+                return false;
+            }
+        }
+
+        // Intel falls back to compare based on memory
+    } else {
+        // If CPU, prefer Intel > AMD > POCL > APPLE
+        const char* platforms[] = {"INTEL", "AMD", "POCL", "APPLE"};
+
+        for (auto ref_name : platforms) {
+            if (verify_present(lPlatName, ref_name) &&
+                !verify_present(rPlatName, ref_name)) {
+                return true;
+            }
+
+            if (!verify_present(lPlatName, ref_name) &&
+                verify_present(rPlatName, ref_name)) {
+                return false;
+            }
+        }
+    }
+
+    // Compare device compute versions
+
+    {
+        // Check Device OpenCL Version
+        auto lversion = ldev->getInfo<CL_DEVICE_VERSION>();
+        auto rversion = rdev->getInfo<CL_DEVICE_VERSION>();
+
+        bool lres =
+            (lversion[7] > rversion[7]) ||
+            ((lversion[7] == rversion[7]) && (lversion[9] > rversion[9]));
+
+        bool rres =
+            (lversion[7] < rversion[7]) ||
+            ((lversion[7] == rversion[7]) && (lversion[9] < rversion[9]));
+
+        if (lres) { return true; }
+        if (rres) { return false; }
+    }
+
+    // Default criteria, sort based on memory
+    // Sort based on memory
+    auto l_mem = ldev->getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>();
+    auto r_mem = rdev->getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>();
+    return l_mem > r_mem;
+}
+
+/// Class to compare two devices for sorting in a map
+class deviceLess {
+   public:
+    bool operator()(const cl::Device& lhs, const cl::Device& rhs) const {
+        return lhs() < rhs();
+    }
+};
+
+DeviceManager::DeviceManager()
+    : logger(common::loggerFactory("platform"))
+    , mUserDeviceOffset(0)
+    , fgMngr(nullptr)
+    , mFFTSetup(new clfftSetupData) {
+    vector<Platform> platforms;
+    try {
+        Platform::get(&platforms);
+    } catch (const cl::Error& err) {
+#if !defined(OS_MAC)
+        // CL_PLATFORM_NOT_FOUND_KHR is not defined in Apple's OpenCL
+        // implementation. Thus, it requires this ugly check.
+        if (err.err() == CL_PLATFORM_NOT_FOUND_KHR) {
+#endif
+            AF_ERROR(
+                "No OpenCL platforms found on this system. Ensure you have "
+                "installed the device driver as well as the OpenCL runtime and "
+                "ICD from your device vendor. You can use the clinfo utility "
+                "to debug OpenCL installation issues.",
+                AF_ERR_RUNTIME);
+#if !defined(OS_MAC)
+        }
+#endif
+    }
+    fgMngr = std::make_unique<arrayfire::common::ForgeManager>();
+
+    // This is all we need because the sort takes care of the order of devices
+#ifdef OS_MAC
+    cl_device_type DEVICE_TYPES = CL_DEVICE_TYPE_GPU;
+#else
+    cl_device_type DEVICE_TYPES = CL_DEVICE_TYPE_ALL;
+#endif
+
+    string deviceENV = getEnvVar("AF_OPENCL_DEVICE_TYPE");
+
+    if (deviceENV == "GPU") {
+        DEVICE_TYPES = CL_DEVICE_TYPE_GPU;
+    } else if (deviceENV == "CPU") {
+        DEVICE_TYPES = CL_DEVICE_TYPE_CPU;
+    } else if (deviceENV.compare("ACC") >= 0) {
+        DEVICE_TYPES = CL_DEVICE_TYPE_ACCELERATOR;
+    }
+
+    AF_TRACE("Found {} OpenCL platforms", platforms.size());
+
+    std::map<cl::Device, cl::Context, deviceLess> mDeviceContextMap;
+    // Iterate through platforms, get all available devices and store them
+    for (auto& platform : platforms) {
+        vector<Device> current_devices;
+
+        try {
+            platform.getDevices(DEVICE_TYPES, &current_devices);
+        } catch (const cl::Error& err) {
+            if (err.err() != CL_DEVICE_NOT_FOUND) { throw; }
+        }
+        AF_TRACE("Found {} devices on platform {}", current_devices.size(),
+                 platform.getInfo<CL_PLATFORM_NAME>());
+        if (!current_devices.empty()) {
+            cl::Context ctx(current_devices);
+            for (auto& dev : current_devices) {
+                mDeviceContextMap[dev] = ctx;
+                mDevices.emplace_back(make_unique<Device>(dev));
+                AF_TRACE("Found device {} on platform {}",
+                         dev.getInfo<CL_DEVICE_NAME>(),
+                         platform.getInfo<CL_PLATFORM_NAME>());
+            }
+        }
+    }
+
+    int nDevices = mDevices.size();
+    AF_TRACE("Found {} OpenCL devices", nDevices);
+
+    if (nDevices == 0) { AF_ERROR("No OpenCL devices found", AF_ERR_RUNTIME); }
+
+    // Sort OpenCL devices based on default criteria
+    stable_sort(mDevices.begin(), mDevices.end(), compare_default);
+
+    auto devices = move(mDevices);
+    mDevices.clear();
+
+    // Create contexts and queues once the sort is done
+    for (int i = 0; i < nDevices; i++) {
+        // For OpenCL-HPP >= v2023.12.14 type is cl::Platform instead of
+        // cl_platform_id
+        cl::Platform device_platform;
+        device_platform = devices[i]->getInfo<CL_DEVICE_PLATFORM>();
+
+        try {
+            mContexts.emplace_back(
+                make_unique<cl::Context>(mDeviceContextMap[*devices[i]]));
+            mQueues.push_back(make_unique<CommandQueue>(
+                *mContexts.back(), *devices[i], cl::QueueProperties::None));
+            mIsGLSharingOn.push_back(false);
+            mDeviceTypes.push_back(getDeviceTypeEnum(*devices[i]));
+            mPlatforms.push_back(
+                std::make_pair<std::unique_ptr<cl::Platform>, afcl_platform>(
+                    make_unique<cl::Platform>(device_platform(), true),
+                    getPlatformEnum(*devices[i])));
+            mDevices.emplace_back(std::move(devices[i]));
+
+            auto platform_version =
+                mPlatforms.back().first->getInfo<CL_PLATFORM_VERSION>();
+            string options;
+            common::Version version =
+                getOpenCLCDeviceVersion(*mDevices[i]).back();
+#ifdef AF_WITH_FAST_MATH
+            options = fmt::format(
+                " -cl-std=CL{:Mm} -D dim_t={} -cl-fast-relaxed-math", version,
+                dtype_traits<dim_t>::getName());
+#else
+            options = fmt::format(" -cl-std=CL{:Mm} -D dim_t={}", version,
+                                  dtype_traits<dim_t>::getName());
+#endif
+            mBaseBuildFlags.push_back(options);
+        } catch (const cl::Error& err) {
+            AF_TRACE("Error creating context for device {} with error {}\n",
+                     devices[i]->getInfo<CL_DEVICE_NAME>(), err.what());
+        }
+    }
+    nDevices = mDevices.size();
+
+    bool default_device_set = false;
+    deviceENV               = getEnvVar("AF_OPENCL_DEFAULT_DEVICE");
+    if (!deviceENV.empty()) {
+        stringstream s(deviceENV);
+        int def_device = -1;
+        s >> def_device;
+        if (def_device < 0 || def_device >= nDevices) {
+            AF_TRACE(
+                "AF_OPENCL_DEFAULT_DEVICE ({}) \
+                   is out of range, Setting default device to 0",
+                def_device);
+        } else {
+            setActiveContext(def_device);
+            default_device_set = true;
+        }
+    }
+
+    deviceENV = getEnvVar("AF_OPENCL_DEFAULT_DEVICE_TYPE");
+    if (!default_device_set && !deviceENV.empty()) {
+        cl_device_type default_device_type = CL_DEVICE_TYPE_GPU;
+        if (deviceENV == "CPU") {
+            default_device_type = CL_DEVICE_TYPE_CPU;
+        } else if (deviceENV.compare("ACC") >= 0) {
+            default_device_type = CL_DEVICE_TYPE_ACCELERATOR;
+        }
+
+        bool default_device_set = false;
+        for (int i = 0; i < nDevices; i++) {
+            if (mDevices[i]->getInfo<CL_DEVICE_TYPE>() == default_device_type) {
+                default_device_set = true;
+                setActiveContext(i);
+                break;
+            }
+        }
+        if (!default_device_set) {
+            AF_TRACE(
+                "AF_OPENCL_DEFAULT_DEVICE_TYPE={} \
+                   is not available, Using default device as 0",
+                deviceENV);
+        }
+    }
+
+    // Define AF_DISABLE_GRAPHICS with any value to disable initialization
+    string noGraphicsENV = getEnvVar("AF_DISABLE_GRAPHICS");
+    if (fgMngr->plugin().isLoaded() && noGraphicsENV.empty()) {
+        // If forge library was successfully loaded and
+        // AF_DISABLE_GRAPHICS is not defined
+        try {
+            /* loop over devices and replace contexts with
+             * OpenGL shared contexts whereever applicable */
+            int devCount      = mDevices.size();
+            fg_window wHandle = fgMngr->getMainWindow();
+            for (int i = 0; i < devCount; ++i) {
+                markDeviceForInterop(i, wHandle);
+            }
+        } catch (...) {}
+    }
+
+    mUserDeviceOffset = mDevices.size();
+    // Initialize FFT setup data structure
+    CLFFT_CHECK(clfftInitSetupData(mFFTSetup.get()));
+    CLFFT_CHECK(clfftSetup(mFFTSetup.get()));
+
+    // Initialize clBlas library
+    initBlas();
+
+    // Cache Boost program_cache
+    namespace compute = boost::compute;
+    for (auto& ctx : mContexts) {
+        compute::context c(ctx->get());
+        BoostProgCache currCache = compute::program_cache::get_global_cache(c);
+        mBoostProgCacheVector.emplace_back(new BoostProgCache(currCache));
+    }
+    AF_TRACE("Default device: {}", getActiveDeviceId());
+}
+
+spdlog::logger* DeviceManager::getLogger() { return logger.get(); }
+
+DeviceManager& DeviceManager::getInstance() {
+    static auto* my_instance = new DeviceManager();
+    return *my_instance;
+}
+
+void DeviceManager::setMemoryManager(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a memory manager and the default memory
+    // manager still hasn't been initialized, so initialize it anyways so we
+    // don't inadvertently reset to it when we first call memoryManager()
+    memoryManager();
+    // Calls shutdown() on the existing memory manager.
+    if (memManager) { memManager->shutdownAllocator(); }
+    memManager = std::move(newMgr);
+    // Set the backend memory manager for this new manager to register native
+    // functions correctly.
+    std::unique_ptr<opencl::Allocator> deviceMemoryManager(
+        new opencl::Allocator());
+    memManager->setAllocator(std::move(deviceMemoryManager));
+    memManager->initialize();
+}
+
+void DeviceManager::resetMemoryManager() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_OPENCL_MEM_DEBUG));
+    setMemoryManager(std::move(mgr));
+}
+
+void DeviceManager::setMemoryManagerPinned(
+    std::unique_ptr<MemoryManagerBase> newMgr) {
+    std::lock_guard<std::mutex> l(mutex);
+    // It's possible we're setting a pinned memory manager and the default
+    // memory manager still hasn't been initialized, so initialize it anyways so
+    // we don't inadvertently reset to it when we first call
+    // pinnedMemoryManager()
+    pinnedMemoryManager();
+    // Calls shutdown() on the existing memory manager.
+    if (pinnedMemManager) { pinnedMemManager->shutdownAllocator(); }
+    // Set the backend pinned memory manager for this new manager to register
+    // native functions correctly.
+    pinnedMemManager = std::move(newMgr);
+    std::unique_ptr<opencl::AllocatorPinned> deviceMemoryManager(
+        new opencl::AllocatorPinned());
+    pinnedMemManager->setAllocator(std::move(deviceMemoryManager));
+    pinnedMemManager->initialize();
+}
+
+void DeviceManager::resetMemoryManagerPinned() {
+    // Replace with default memory manager
+    std::unique_ptr<MemoryManagerBase> mgr(
+        new common::DefaultMemoryManager(getDeviceCount(), common::MAX_BUFFERS,
+                                         AF_MEM_DEBUG || AF_OPENCL_MEM_DEBUG));
+    setMemoryManagerPinned(std::move(mgr));
+}
+
+DeviceManager::~DeviceManager() {
+    for (int i = 0; i < getDeviceCount(); ++i) { gfxManagers[i] = nullptr; }
+#ifndef OS_WIN
+    // TODO: FIXME:
+    // clfftTeardown() causes a "Pure Virtual Function Called" crash on
+    // Windows for Intel devices. This causes tests to fail.
+    clfftTeardown();
+#endif
+
+    deInitBlas();
+
+    // deCache Boost program_cache
+#ifndef OS_WIN
+    for (auto bCache : mBoostProgCacheVector) { delete bCache; }
+#endif
+
+    memManager       = nullptr;
+    pinnedMemManager = nullptr;
+
+    // TODO: FIXME:
+    // OpenCL libs on Windows platforms
+    // are crashing the application at program exit
+    // most probably a reference counting issue based
+    // on the investigation done so far. This problem
+    // doesn't seem to happen on Linux or MacOSX.
+    // So, clean up OpenCL resources on non-Windows platforms
+#ifdef OS_WIN
+    for (auto& q : mQueues) { q.release(); }
+    for (auto& c : mContexts) { c.release(); }
+    for (auto& d : mDevices) { d.release(); }
+#endif
+}
+
+void DeviceManager::markDeviceForInterop(const int device,
+                                         const void* wHandle) {
+    try {
+        if (device >= static_cast<int>(mQueues.size()) ||
+            device >= static_cast<int>(DeviceManager::MAX_DEVICES)) {
+            AF_TRACE("Invalid device (}) passed for CL-GL Interop", device);
+            throw cl::Error(CL_INVALID_DEVICE,
+                            "Invalid device passed for CL-GL Interop");
+        } else {
+            mQueues[device]->finish();
+
+            // check if the device has CL_GL sharing extension enabled
+            bool temp =
+                checkExtnAvailability(*mDevices[device], CL_GL_SHARING_EXT);
+            if (!temp) {
+                /* return silently if given device has not OpenGL sharing
+                 * extension enabled so that regular queue is used for it */
+                return;
+            }
+
+            // call forge to get OpenGL sharing context and details
+            Platform plat(mDevices[device]->getInfo<CL_DEVICE_PLATFORM>());
+
+            long long wnd_ctx, wnd_dsp;
+            fgMngr->plugin().fg_get_window_context_handle(
+                &wnd_ctx, const_cast<fg_window>(wHandle));
+            fgMngr->plugin().fg_get_window_display_handle(
+                &wnd_dsp, const_cast<fg_window>(wHandle));
+#ifdef OS_MAC
+            CGLContextObj cgl_current_ctx = CGLGetCurrentContext();
+            CGLShareGroupObj cgl_share_group =
+                CGLGetShareGroup(cgl_current_ctx);
+
+            cl_context_properties cps[] = {
+                CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE,
+                (cl_context_properties)cgl_share_group, 0};
+#else
+            cl_context_properties cps[] = {
+                CL_GL_CONTEXT_KHR,
+                static_cast<cl_context_properties>(wnd_ctx),
+#if defined(_WIN32) || defined(_MSC_VER)
+                CL_WGL_HDC_KHR,
+                (cl_context_properties)wnd_dsp,
+#else
+                CL_GLX_DISPLAY_KHR,
+                static_cast<cl_context_properties>(wnd_dsp),
+#endif
+                CL_CONTEXT_PLATFORM,
+                (cl_context_properties)plat(),
+                0
+            };
+
+            // Check if current OpenCL device is belongs to the OpenGL context
+            {
+                cl_context_properties test_cps[] = {
+                    CL_GL_CONTEXT_KHR,
+                    static_cast<cl_context_properties>(wnd_ctx),
+                    CL_CONTEXT_PLATFORM, (cl_context_properties)plat(), 0};
+
+                // Load the extension
+                // If cl_khr_gl_sharing is available, this function should be
+                // present This has been checked earlier, it comes to this point
+                // only if it is found
+                auto func = reinterpret_cast<clGetGLContextInfoKHR_fn>(
+                    clGetExtensionFunctionAddressForPlatform(
+                        plat(), "clGetGLContextInfoKHR"));
+
+                // If the function doesn't load, bail early
+                if (!func) { return; }
+
+                // Get all devices associated with opengl context
+                vector<cl_device_id> devices(16);
+                size_t ret = 0;
+                cl_int err = func(test_cps, CL_DEVICES_FOR_GL_CONTEXT_KHR,
+                                  devices.size() * sizeof(cl_device_id),
+                                  &devices[0], &ret);
+                if (err != CL_SUCCESS) { return; }
+                size_t num = ret / sizeof(cl_device_id);
+                devices.resize(num);
+
+                // Check if current device is present in the associated devices
+                cl_device_id current_device = (*mDevices[device])();
+                auto res = find(begin(devices), end(devices), current_device);
+
+                if (res == end(devices)) { return; }
+            }
+#endif
+
+            // Change current device to use GL sharing
+            auto ctx = make_unique<Context>(*mDevices[device], cps);
+            auto cq  = make_unique<CommandQueue>(*ctx, *mDevices[device]);
+
+            mQueues[device]        = move(cq);
+            mContexts[device]      = move(ctx);
+            mIsGLSharingOn[device] = true;
+        }
+    } catch (const cl::Error& ex) {
+        /* If replacing the original context with GL shared context
+         * failes, don't throw an error and instead fall back to
+         * original context and use copy via host to support graphics
+         * on that particular OpenCL device. So mark it as no GL sharing */
+    }
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/device_manager.hpp b/src/backend/opencl/device_manager.hpp
new file mode 100644
index 0000000000..4b27a8f885
--- /dev/null
+++ b/src/backend/opencl/device_manager.hpp
@@ -0,0 +1,192 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <af/opencl.h>
+
+#include <memory>
+#include <mutex>
+#include <string>
+#include <vector>
+
+#ifndef AF_OPENCL_MEM_DEBUG
+#define AF_OPENCL_MEM_DEBUG 0
+#endif
+
+// Forward declarations
+struct clfftSetupData_;
+
+namespace cl {
+class CommandQueue;
+class Context;
+class Device;
+}  // namespace cl
+
+namespace boost {
+template<typename T>
+class shared_ptr;
+
+namespace compute {
+class program_cache;
+}
+}  // namespace boost
+
+namespace spdlog {
+class logger;
+}
+
+namespace arrayfire {
+namespace common {
+class ForgeManager;
+class MemoryManagerBase;
+}  // namespace common
+}  // namespace arrayfire
+
+using arrayfire::common::MemoryManagerBase;
+
+namespace arrayfire {
+namespace opencl {
+
+// opencl namespace forward declarations
+class GraphicsResourceManager;
+struct kc_entry_t;  // kernel cache entry
+class PlanCache;    // clfft
+
+class DeviceManager {
+    friend arrayfire::common::MemoryManagerBase& memoryManager();
+
+    friend void setMemoryManager(
+        std::unique_ptr<arrayfire::common::MemoryManagerBase> mgr);
+
+    void setMemoryManager(
+        std::unique_ptr<arrayfire::common::MemoryManagerBase> mgr);
+
+    friend void resetMemoryManager();
+
+    void resetMemoryManager();
+
+    friend arrayfire::common::MemoryManagerBase& pinnedMemoryManager();
+
+    friend void setMemoryManagerPinned(
+        std::unique_ptr<arrayfire::common::MemoryManagerBase> mgr);
+
+    void setMemoryManagerPinned(
+        std::unique_ptr<arrayfire::common::MemoryManagerBase> mgr);
+
+    friend void resetMemoryManagerPinned();
+
+    void resetMemoryManagerPinned();
+
+    friend arrayfire::common::ForgeManager& forgeManager();
+
+    friend GraphicsResourceManager& interopManager();
+
+    friend PlanCache& fftManager();
+
+    friend void addKernelToCache(int device, const std::string& key,
+                                 const kc_entry_t entry);
+
+    friend void removeKernelFromCache(int device, const std::string& key);
+
+    friend kc_entry_t kernelCache(int device, const std::string& key);
+
+    friend std::string getDeviceInfo() noexcept;
+
+    friend int getDeviceCount() noexcept;
+
+    friend int getDeviceIdFromNativeId(cl_device_id id);
+
+    friend const cl::Context& getContext();
+
+    friend cl::CommandQueue& getQueue(int device_id);
+
+    friend cl_command_queue getQueueHandle(int device_id);
+
+    friend const cl::Device& getDevice(int id);
+
+    friend const std::string& getActiveDeviceBaseBuildFlags();
+
+    friend size_t getDeviceMemorySize(int device);
+
+    friend bool isGLSharingSupported();
+
+    friend bool isDoubleSupported(unsigned device);
+
+    friend bool isHalfSupported(unsigned device);
+
+    friend void devprop(char* d_name, char* d_platform, char* d_toolkit,
+                        char* d_compute);
+
+    friend int setDevice(int device);
+
+    friend void addDeviceContext(cl_device_id dev, cl_context ctx,
+                                 cl_command_queue que);
+
+    friend void setDeviceContext(cl_device_id dev, cl_context ctx);
+
+    friend void removeDeviceContext(cl_device_id dev, cl_context ctx);
+
+    friend int getActiveDeviceType();
+
+    friend cl::Platform& getActivePlatform();
+
+    friend afcl::platform getActivePlatformVendor();
+
+    friend bool isDeviceBufferAccessible(int buf_device_id, int execution_id);
+
+   public:
+    static const int MAX_DEVICES = 32;
+
+    static DeviceManager& getInstance();
+
+    ~DeviceManager();
+
+    spdlog::logger* getLogger();
+
+   protected:
+    using clfftSetupData = clfftSetupData_;
+
+    DeviceManager();
+
+    // Following two declarations are required to
+    // avoid copying accidental copy/assignment
+    // of instance returned by getInstance to other
+    // variables
+    DeviceManager(DeviceManager const&);
+    void operator=(DeviceManager const&);
+    void markDeviceForInterop(const int device, const void* wHandle);
+
+   private:
+    // Attributes
+    std::shared_ptr<spdlog::logger> logger;
+    std::mutex deviceMutex;
+    std::vector<std::unique_ptr<cl::Device>> mDevices;
+    std::vector<std::unique_ptr<cl::Context>> mContexts;
+    std::vector<std::unique_ptr<cl::CommandQueue>> mQueues;
+    std::vector<bool> mIsGLSharingOn;
+    std::vector<std::string> mBaseBuildFlags;
+    std::vector<int> mDeviceTypes;
+    std::vector<std::pair<std::unique_ptr<cl::Platform>, afcl::platform>>
+        mPlatforms;
+    unsigned mUserDeviceOffset;
+
+    std::unique_ptr<arrayfire::common::ForgeManager> fgMngr;
+    std::unique_ptr<MemoryManagerBase> memManager;
+    std::unique_ptr<MemoryManagerBase> pinnedMemManager;
+    std::unique_ptr<GraphicsResourceManager> gfxManagers[MAX_DEVICES];
+    std::unique_ptr<clfftSetupData> mFFTSetup;
+    std::mutex mutex;
+
+    using BoostProgCache = boost::shared_ptr<boost::compute::program_cache>;
+    std::vector<BoostProgCache*> mBoostProgCacheVector;
+};
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/diagonal.cpp b/src/backend/opencl/diagonal.cpp
index a6d3e2c2dd..2d21b5f461 100644
--- a/src/backend/opencl/diagonal.cpp
+++ b/src/backend/opencl/diagonal.cpp
@@ -7,55 +7,58 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
 #include <Array.hpp>
+#include <common/half.hpp>
 #include <diagonal.hpp>
-#include <math.hpp>
 #include <err_opencl.hpp>
 #include <kernel/diagonal.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> diagCreate(const Array<T> &in, const int num)
-    {
-        int size = in.dims()[0] + std::abs(num);
-        int batch = in.dims()[1];
-        Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
+using arrayfire::common::half;
 
-        kernel::diagCreate<T>(out, in, num);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num) {
+    int size     = in.dims()[0] + std::abs(num);
+    int batch    = in.dims()[1];
+    Array<T> out = createEmptyArray<T>(dim4(size, size, batch));
 
-        return out;
-    }
+    kernel::diagCreate<T>(out, in, num);
 
-    template<typename T>
-    Array<T> diagExtract(const Array<T> &in, const int num)
-    {
-        const dim_t *idims = in.dims().get();
-        dim_t size = std::max(idims[0], idims[1]) - std::abs(num);
-        Array<T> out = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
+    return out;
+}
 
-        kernel::diagExtract<T>(out, in, num);
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num) {
+    const dim_t *idims = in.dims().get();
+    dim_t size         = std::min(idims[0], idims[1]) - std::abs(num);
+    Array<T> out       = createEmptyArray<T>(dim4(size, 1, idims[2], idims[3]));
 
-        return out;
+    kernel::diagExtract<T>(out, in, num);
 
-    }
+    return out;
+}
 
 #define INSTANTIATE_DIAGONAL(T)                                          \
-    template Array<T>  diagExtract<T>    (const Array<T> &in, const int num); \
-    template Array<T>  diagCreate <T>    (const Array<T> &in, const int num);
-
-    INSTANTIATE_DIAGONAL(float)
-    INSTANTIATE_DIAGONAL(double)
-    INSTANTIATE_DIAGONAL(cfloat)
-    INSTANTIATE_DIAGONAL(cdouble)
-    INSTANTIATE_DIAGONAL(int)
-    INSTANTIATE_DIAGONAL(uint)
-    INSTANTIATE_DIAGONAL(intl)
-    INSTANTIATE_DIAGONAL(uintl)
-    INSTANTIATE_DIAGONAL(char)
-    INSTANTIATE_DIAGONAL(uchar)
+    template Array<T> diagExtract<T>(const Array<T> &in, const int num); \
+    template Array<T> diagCreate<T>(const Array<T> &in, const int num);
 
-}
+INSTANTIATE_DIAGONAL(float)
+INSTANTIATE_DIAGONAL(double)
+INSTANTIATE_DIAGONAL(cfloat)
+INSTANTIATE_DIAGONAL(cdouble)
+INSTANTIATE_DIAGONAL(int)
+INSTANTIATE_DIAGONAL(uint)
+INSTANTIATE_DIAGONAL(intl)
+INSTANTIATE_DIAGONAL(uintl)
+INSTANTIATE_DIAGONAL(char)
+INSTANTIATE_DIAGONAL(schar)
+INSTANTIATE_DIAGONAL(uchar)
+INSTANTIATE_DIAGONAL(short)
+INSTANTIATE_DIAGONAL(ushort)
+INSTANTIATE_DIAGONAL(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/diagonal.hpp b/src/backend/opencl/diagonal.hpp
index cd6e9e0ab0..5ba6daed79 100644
--- a/src/backend/opencl/diagonal.hpp
+++ b/src/backend/opencl/diagonal.hpp
@@ -7,15 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> diagCreate(const Array<T> &in, const int num);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> diagCreate(const Array<T> &in, const int num);
 
-    template<typename T>
-    Array<T> diagExtract(const Array<T> &in, const int num);
-}
+template<typename T>
+Array<T> diagExtract(const Array<T> &in, const int num);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/diff.cpp b/src/backend/opencl/diff.cpp
index 981062a2d4..e152301f0d 100644
--- a/src/backend/opencl/diff.cpp
+++ b/src/backend/opencl/diff.cpp
@@ -7,69 +7,55 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
 #include <diff.hpp>
 #include <kernel/diff.hpp>
+#include <af/dim4.hpp>
 #include <stdexcept>
 
-namespace opencl
-{
-    template<typename T, bool isDiff2>
-    static Array<T> diff(const Array<T> &in, const int dim)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
-        oDims[dim] -= (isDiff2 + 1);
-
-        if(iDims.elements() == 0 || oDims.elements() == 0) {
-            throw std::runtime_error("Elements are 0");
-        }
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        switch (dim) {
-
-            case (0):    kernel::diff<T, 0, isDiff2>(out, in, in.ndims());
-                         break;
-
-            case (1):    kernel::diff<T, 1, isDiff2>(out, in, in.ndims());
-                         break;
-
-            case (2):    kernel::diff<T, 2, isDiff2>(out, in, in.ndims());
-                         break;
+namespace arrayfire {
+namespace opencl {
 
-            case (3):    kernel::diff<T, 3, isDiff2>(out, in, in.ndims());
-                         break;
-        }
+template<typename T>
+Array<T> diff(const Array<T> &in, const int dim, const bool isDiff2) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims[dim] -= (isDiff2 + 1);
 
-        return out;
+    if (iDims.elements() == 0 || oDims.elements() == 0) {
+        throw std::runtime_error("Elements are 0");
     }
+    Array<T> out = createEmptyArray<T>(oDims);
+    kernel::diff<T>(out, in, in.ndims(), dim, isDiff2);
+    return out;
+}
 
-    template<typename T>
-    Array<T> diff1(const Array<T> &in, const int dim)
-    {
-        return diff<T, false>(in, dim);
-    }
-
-    template<typename T>
-    Array<T> diff2(const Array<T> &in, const int dim)
-    {
-        return diff<T, true>(in, dim);
-    }
-
-#define INSTANTIATE(T)                                                 \
-    template Array<T> diff1<T>  (const Array<T> &in, const int dim);   \
-    template Array<T> diff2<T>  (const Array<T> &in, const int dim);   \
-
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim) {
+    return diff<T>(in, dim, false);
+}
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim) {
+    return diff<T>(in, dim, true);
 }
+
+#define INSTANTIATE(T)                                             \
+    template Array<T> diff1<T>(const Array<T> &in, const int dim); \
+    template Array<T> diff2<T>(const Array<T> &in, const int dim);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(char)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/diff.hpp b/src/backend/opencl/diff.hpp
index 5298d8bf62..ff60455fe8 100644
--- a/src/backend/opencl/diff.hpp
+++ b/src/backend/opencl/diff.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> diff1(const Array<T> &in, const int dim);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> diff1(const Array<T> &in, const int dim);
 
-    template<typename T>
-    Array<T> diff2(const Array<T> &in, const int dim);
-}
+template<typename T>
+Array<T> diff2(const Array<T> &in, const int dim);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/dilate.cpp b/src/backend/opencl/dilate.cpp
deleted file mode 100644
index fbc5b2881d..0000000000
--- a/src/backend/opencl/dilate.cpp
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph_impl.hpp"
-
-namespace opencl
-{
-
-INSTANTIATE(float , true)
-INSTANTIATE(double, true)
-INSTANTIATE(char  , true)
-INSTANTIATE(int   , true)
-INSTANTIATE(uint  , true)
-INSTANTIATE(uchar , true)
-
-}
diff --git a/src/backend/opencl/dilate3d.cpp b/src/backend/opencl/dilate3d.cpp
deleted file mode 100644
index 7c8898f175..0000000000
--- a/src/backend/opencl/dilate3d.cpp
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph3d_impl.hpp"
-
-namespace opencl
-{
-
-INSTANTIATE(float , true)
-INSTANTIATE(double, true)
-INSTANTIATE(char  , true)
-INSTANTIATE(int   , true)
-INSTANTIATE(uint  , true)
-INSTANTIATE(uchar , true)
-
-}
diff --git a/src/backend/opencl/erode.cpp b/src/backend/opencl/erode.cpp
deleted file mode 100644
index bcb1579291..0000000000
--- a/src/backend/opencl/erode.cpp
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph_impl.hpp"
-
-namespace opencl
-{
-
-INSTANTIATE(float , false)
-INSTANTIATE(double, false)
-INSTANTIATE(char  , false)
-INSTANTIATE(int   , false)
-INSTANTIATE(uint  , false)
-INSTANTIATE(uchar , false)
-
-}
diff --git a/src/backend/opencl/erode3d.cpp b/src/backend/opencl/erode3d.cpp
deleted file mode 100644
index 71ee3fd504..0000000000
--- a/src/backend/opencl/erode3d.cpp
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include "morph3d_impl.hpp"
-
-namespace opencl
-{
-
-INSTANTIATE(float , false)
-INSTANTIATE(double, false)
-INSTANTIATE(char  , false)
-INSTANTIATE(int   , false)
-INSTANTIATE(uint  , false)
-INSTANTIATE(uchar , false)
-
-}
diff --git a/src/backend/opencl/err_clblas.hpp b/src/backend/opencl/err_clblas.hpp
deleted file mode 100644
index b5a9733622..0000000000
--- a/src/backend/opencl/err_clblas.hpp
+++ /dev/null
@@ -1,68 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <stdio.h>
-#include <err_common.hpp>
-#include <clBLAS.h>
-
-static const char * _clblasGetResultString(clblasStatus st)
-{
-    switch (st)
-    {
-    case clblasSuccess:              return "Success";
-    case clblasInvalidValue:         return "Invalid value";
-    case clblasInvalidCommandQueue:  return "Invalid queue";
-    case clblasInvalidContext:       return "Invalid context";
-    case clblasInvalidMemObject:     return "Invalid memory object";
-    case clblasInvalidDevice:        return "Invalid device";
-    case clblasInvalidEventWaitList: return "Invalid event list";
-    case clblasOutOfResources:       return "Out of resources";
-    case clblasOutOfHostMemory:      return "Out of host memory";
-    case clblasInvalidOperation:     return "Invalid operation";
-    case clblasCompilerNotAvailable: return "Compiler not available";
-    case clblasBuildProgramFailure:  return "Build program failure";
-    case clblasNotImplemented:       return "Not implemented";
-    case clblasNotInitialized:       return "CLBLAS Not initialized";
-    case clblasInvalidMatA:          return "Invalid matrix A";
-    case clblasInvalidMatB:          return "Invalid matrix B";
-    case clblasInvalidMatC:          return "Invalid matrix C";
-    case clblasInvalidVecX:          return "Invalid vector X";
-    case clblasInvalidVecY:          return "Invalid vector Y";
-    case clblasInvalidDim:           return "Invalid dimension";
-    case clblasInvalidLeadDimA:      return "Invalid lda";
-    case clblasInvalidLeadDimB:      return "Invalid ldb";
-    case clblasInvalidLeadDimC:      return "Invalid ldc";
-    case clblasInvalidIncX:          return "Invalid incx";
-    case clblasInvalidIncY:          return "Invalid incy";
-    case clblasInsufficientMemMatA:  return  "Insufficient Memory for Matrix A";
-    case clblasInsufficientMemMatB:  return  "Insufficient Memory for Matrix B";
-    case clblasInsufficientMemMatC:  return  "Insufficient Memory for Matrix C";
-    case clblasInsufficientMemVecX:  return  "Insufficient Memory for Vector X";
-    case clblasInsufficientMemVecY:  return  "Insufficient Memory for Vector Y";
-    }
-
-    return "Unknown error";
-}
-
-#define CLBLAS_CHECK(fn) do {                   \
-        clblasStatus _clblas_st = fn;           \
-        if (_clblas_st != clblasSuccess) {     \
-            char clblas_st_msg[1024];           \
-            snprintf(clblas_st_msg,             \
-                     sizeof(clblas_st_msg),     \
-                     "clblas Error (%d): %s\n", \
-                     (int)(_clblas_st),         \
-                     _clblasGetResultString(    \
-                         _clblas_st));          \
-                                                \
-            AF_ERROR(clblas_st_msg,             \
-                     AF_ERR_INTERNAL);          \
-        }                                       \
-    } while(0)
diff --git a/src/backend/opencl/err_clblast.hpp b/src/backend/opencl/err_clblast.hpp
new file mode 100644
index 0000000000..c4201d175f
--- /dev/null
+++ b/src/backend/opencl/err_clblast.hpp
@@ -0,0 +1,153 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <clblast.h>
+#include <common/err_common.hpp>
+#include <stdio.h>
+#include <mutex>
+
+static const char* _clblastGetResultString(clblast::StatusCode st) {
+    switch (st) {
+        // Status codes in common with the OpenCL standard
+        case clblast::StatusCode::kSuccess: return "CL_SUCCESS";
+        case clblast::StatusCode::kOpenCLCompilerNotAvailable:
+            return "CL_COMPILER_NOT_AVAILABLE";
+        case clblast::StatusCode::kTempBufferAllocFailure:
+            return "CL_MEM_OBJECT_ALLOCATION_FAILURE";
+        case clblast::StatusCode::kOpenCLOutOfResources:
+            return "CL_OUT_OF_RESOURCES";
+        case clblast::StatusCode::kOpenCLOutOfHostMemory:
+            return "CL_OUT_OF_HOST_MEMORY";
+        case clblast::StatusCode::kOpenCLBuildProgramFailure:
+            return "CL_BUILD_PROGRAM_FAILURE: OpenCL compilation error";
+        case clblast::StatusCode::kInvalidValue: return "CL_INVALID_VALUE";
+        case clblast::StatusCode::kInvalidCommandQueue:
+            return "CL_INVALID_COMMAND_QUEUE";
+        case clblast::StatusCode::kInvalidMemObject:
+            return "CL_INVALID_MEM_OBJECT";
+        case clblast::StatusCode::kInvalidBinary: return "CL_INVALID_BINARY";
+        case clblast::StatusCode::kInvalidBuildOptions:
+            return "CL_INVALID_BUILD_OPTIONS";
+        case clblast::StatusCode::kInvalidProgram: return "CL_INVALID_PROGRAM";
+        case clblast::StatusCode::kInvalidProgramExecutable:
+            return "CL_INVALID_PROGRAM_EXECUTABLE";
+        case clblast::StatusCode::kInvalidKernelName:
+            return "CL_INVALID_KERNEL_NAME";
+        case clblast::StatusCode::kInvalidKernelDefinition:
+            return "CL_INVALID_KERNEL_DEFINITION";
+        case clblast::StatusCode::kInvalidKernel: return "CL_INVALID_KERNEL";
+        case clblast::StatusCode::kInvalidArgIndex:
+            return "CL_INVALID_ARG_INDEX";
+        case clblast::StatusCode::kInvalidArgValue:
+            return "CL_INVALID_ARG_VALUE";
+        case clblast::StatusCode::kInvalidArgSize: return "CL_INVALID_ARG_SIZE";
+        case clblast::StatusCode::kInvalidKernelArgs:
+            return "CL_INVALID_KERNEL_ARGS";
+        case clblast::StatusCode::kInvalidLocalNumDimensions:
+            return "CL_INVALID_WORK_DIMENSION: Too many thread dimensions";
+        case clblast::StatusCode::kInvalidLocalThreadsTotal:
+            return "CL_INVALID_WORK_GROUP_SIZE: Too many threads in total";
+        case clblast::StatusCode::kInvalidLocalThreadsDim:
+            return "CL_INVALID_WORK_ITEM_SIZE: ... or for a specific dimension";
+        case clblast::StatusCode::kInvalidGlobalOffset:
+            return "CL_INVALID_GLOBAL_OFFSET";
+        case clblast::StatusCode::kInvalidEventWaitList:
+            return "CL_INVALID_EVENT_WAIT_LIST";
+        case clblast::StatusCode::kInvalidEvent: return "CL_INVALID_EVENT";
+        case clblast::StatusCode::kInvalidOperation:
+            return "CL_INVALID_OPERATION";
+        case clblast::StatusCode::kInvalidBufferSize:
+            return "CL_INVALID_BUFFER_SIZE";
+        case clblast::StatusCode::kInvalidGlobalWorkSize:
+            return "CL_INVALID_GLOBAL_WORK_SIZE";
+
+        // Status codes in common with the clBLAS library
+        case clblast::StatusCode::kNotImplemented:
+            return "Routine or functionality not implemented yet";
+        case clblast::StatusCode::kInvalidMatrixA:
+            return "Matrix A is not a valid OpenCL buffer";
+        case clblast::StatusCode::kInvalidMatrixB:
+            return "Matrix B is not a valid OpenCL buffer";
+        case clblast::StatusCode::kInvalidMatrixC:
+            return "Matrix C is not a valid OpenCL buffer";
+        case clblast::StatusCode::kInvalidVectorX:
+            return "Vector X is not a valid OpenCL buffer";
+        case clblast::StatusCode::kInvalidVectorY:
+            return "Vector Y is not a valid OpenCL buffer";
+        case clblast::StatusCode::kInvalidDimension:
+            return "Dimensions M, N, and K have to be larger than zero";
+        case clblast::StatusCode::kInvalidLeadDimA:
+            return "LD of A is smaller than the matrix's first dimension";
+        case clblast::StatusCode::kInvalidLeadDimB:
+            return "LD of B is smaller than the matrix's first dimension";
+        case clblast::StatusCode::kInvalidLeadDimC:
+            return "LD of C is smaller than the matrix's first dimension";
+        case clblast::StatusCode::kInvalidIncrementX:
+            return "Increment of vector X cannot be zero";
+        case clblast::StatusCode::kInvalidIncrementY:
+            return "Increment of vector Y cannot be zero";
+        case clblast::StatusCode::kInsufficientMemoryA:
+            return "Matrix A's OpenCL buffer is too small";
+        case clblast::StatusCode::kInsufficientMemoryB:
+            return "Matrix B's OpenCL buffer is too small";
+        case clblast::StatusCode::kInsufficientMemoryC:
+            return "Matrix C's OpenCL buffer is too small";
+        case clblast::StatusCode::kInsufficientMemoryX:
+            return "Vector X's OpenCL buffer is too small";
+        case clblast::StatusCode::kInsufficientMemoryY:
+            return "Vector Y's OpenCL buffer is too small";
+
+        // Custom additional status codes for CLBlast
+        case clblast::StatusCode::kInsufficientMemoryTemp:
+            return "Temporary buffer provided to GEMM routine is too small";
+        case clblast::StatusCode::kInvalidBatchCount:
+            return "The batch count needs to be positive";
+        case clblast::StatusCode::kInvalidOverrideKernel:
+            return "Trying to override parameters for an invalid kernel";
+        case clblast::StatusCode::kMissingOverrideParameter:
+            return "Missing override parameter(s) for the target kernel";
+        case clblast::StatusCode::kInvalidLocalMemUsage:
+            return "Not enough local memory available on this device";
+        case clblast::StatusCode::kNoHalfPrecision:
+            return "Half precision (16-bits) not supported by the device";
+        case clblast::StatusCode::kNoDoublePrecision:
+            return "Double precision (64-bits) not supported by the device";
+        case clblast::StatusCode::kInvalidVectorScalar:
+            return "The unit-sized vector is not a valid OpenCL buffer";
+        case clblast::StatusCode::kInsufficientMemoryScalar:
+            return "The unit-sized vector's OpenCL buffer is too small";
+        case clblast::StatusCode::kDatabaseError:
+            return "Entry for the device was not found in the database";
+        case clblast::StatusCode::kUnknownError:
+            return "A catch-all error code representing an unspecified error";
+        case clblast::StatusCode::kUnexpectedError:
+            return "A catch-all error code representing an unexpected "
+                   "exception";
+    }
+
+    return "Unknown error";
+}
+
+static std::recursive_mutex gCLBlastMutex;
+
+#define CLBLAST_CHECK(fn)                                            \
+    do {                                                             \
+        gCLBlastMutex.lock();                                        \
+        clblast::StatusCode _clblast_st = fn;                        \
+        gCLBlastMutex.unlock();                                      \
+        if (_clblast_st != clblast::StatusCode::kSuccess) {          \
+            char clblast_st_msg[1024];                               \
+            snprintf(clblast_st_msg, sizeof(clblast_st_msg),         \
+                     "CLBlast Error (%d): %s\n", (int)(_clblast_st), \
+                     _clblastGetResultString(_clblast_st));          \
+                                                                     \
+            AF_ERROR(clblast_st_msg, AF_ERR_INTERNAL);               \
+        }                                                            \
+    } while (0)
diff --git a/src/backend/opencl/err_clfft.hpp b/src/backend/opencl/err_clfft.hpp
deleted file mode 100644
index 8d74bb9f02..0000000000
--- a/src/backend/opencl/err_clfft.hpp
+++ /dev/null
@@ -1,100 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <stdio.h>
-#include <err_common.hpp>
-#include <clFFT.h>
-
-static const char * _clfftGetResultString(clfftStatus st)
-{
-    switch (st)
-    {
-        case CLFFT_SUCCESS: return "Success";
-        case CLFFT_DEVICE_NOT_FOUND: return "Device Not Found";
-        case CLFFT_DEVICE_NOT_AVAILABLE: return "Device Not Available";
-        case CLFFT_COMPILER_NOT_AVAILABLE: return "Compiler Not Available";
-        case CLFFT_MEM_OBJECT_ALLOCATION_FAILURE: return "Memory Object Allocation Failure";
-        case CLFFT_OUT_OF_RESOURCES: return "Out of Resources";
-        case CLFFT_OUT_OF_HOST_MEMORY: return "Out of Host Memory";
-        case CLFFT_PROFILING_INFO_NOT_AVAILABLE: return "Profiling Information Not Available";
-        case CLFFT_MEM_COPY_OVERLAP: return "Memory Copy Overlap";
-        case CLFFT_IMAGE_FORMAT_MISMATCH: return "Image Format Mismatch";
-        case CLFFT_IMAGE_FORMAT_NOT_SUPPORTED: return "Image Format Not Supported";
-        case CLFFT_BUILD_PROGRAM_FAILURE: return "Build Program Failure";
-        case CLFFT_MAP_FAILURE: return "Map Failure";
-        case CLFFT_INVALID_VALUE: return "Invalid Value";
-        case CLFFT_INVALID_DEVICE_TYPE: return "Invalid Device Type";
-        case CLFFT_INVALID_PLATFORM: return "Invalid Platform";
-        case CLFFT_INVALID_DEVICE: return "Invalid Device";
-        case CLFFT_INVALID_CONTEXT: return "Invalid Context";
-        case CLFFT_INVALID_QUEUE_PROPERTIES: return "Invalid Queue Properties";
-        case CLFFT_INVALID_COMMAND_QUEUE: return "Invalid Command Queue";
-        case CLFFT_INVALID_HOST_PTR: return "Invalid Host Pointer";
-        case CLFFT_INVALID_MEM_OBJECT: return "Invalid Memory Object";
-        case CLFFT_INVALID_IMAGE_FORMAT_DESCRIPTOR: return "Invalid Image Format Descriptor";
-        case CLFFT_INVALID_IMAGE_SIZE: return "Invalid Image Size";
-        case CLFFT_INVALID_SAMPLER: return "Invalid Sampler";
-        case CLFFT_INVALID_BINARY: return "Invalid Binary";
-        case CLFFT_INVALID_BUILD_OPTIONS: return "Invalid Build Options";
-        case CLFFT_INVALID_PROGRAM: return "Invalid Program";
-        case CLFFT_INVALID_PROGRAM_EXECUTABLE: return "Invalid Program Executable";
-        case CLFFT_INVALID_KERNEL_NAME: return "Invalid Kernel Name";
-        case CLFFT_INVALID_KERNEL_DEFINITION: return "Invalid Kernel Definition";
-        case CLFFT_INVALID_KERNEL: return "Invalid Kernel";
-        case CLFFT_INVALID_ARG_INDEX: return "Invalid Argument Index";
-        case CLFFT_INVALID_ARG_VALUE: return "Invalid Argument Value";
-        case CLFFT_INVALID_ARG_SIZE: return "Invalid Argument Size";
-        case CLFFT_INVALID_KERNEL_ARGS: return "Invalid Kernel Arguments";
-        case CLFFT_INVALID_WORK_DIMENSION: return "Invalid Work Dimension";
-        case CLFFT_INVALID_WORK_GROUP_SIZE: return "Invalid Work Group Size";
-        case CLFFT_INVALID_WORK_ITEM_SIZE: return "Invalid Work Item Size";
-        case CLFFT_INVALID_GLOBAL_OFFSET: return "Invalid Global Offset";
-        case CLFFT_INVALID_EVENT_WAIT_LIST: return "Invalid Event Wait List";
-        case CLFFT_INVALID_EVENT: return "Invalid Event";
-        case CLFFT_INVALID_OPERATION: return "Invalid Operation";
-        case CLFFT_INVALID_GL_OBJECT: return "Invalid GL Object";
-        case CLFFT_INVALID_BUFFER_SIZE: return "Invalid Buffer Size";
-        case CLFFT_INVALID_MIP_LEVEL: return "Invalid MIP Level";
-        case CLFFT_INVALID_GLOBAL_WORK_SIZE: return "Invalid Global Work Size";
-        case CLFFT_BUGCHECK: return "Bugcheck";
-        case CLFFT_NOTIMPLEMENTED: return "Not implemented";
-        case CLFFT_TRANSPOSED_NOTIMPLEMENTED: return "Transpose not implemented for this transformation";
-        case CLFFT_FILE_NOT_FOUND: return "File not found";
-        case CLFFT_FILE_CREATE_FAILURE: return "File creation failed";
-        case CLFFT_VERSION_MISMATCH: return "Version mismatch";
-        case CLFFT_INVALID_PLAN: return "Invalid plan";
-        case CLFFT_DEVICE_NO_DOUBLE: return "Device does not support double precision";
-        case CLFFT_DEVICE_MISMATCH: return "Plan device mismatch";
-        case CLFFT_ENDSTATUS: return "End status";
-    }
-
-    return "Unknown error";
-}
-
-#define CLFFT_CHECK(fn) do {                    \
-        clfftStatus _clfft_st = fn;             \
-        if (_clfft_st != CLFFT_SUCCESS) {       \
-            garbageCollect();                   \
-            _clfft_st = (fn);                   \
-        }                                       \
-        if (_clfft_st != CLFFT_SUCCESS) {       \
-            char clfft_st_msg[1024];            \
-            snprintf(clfft_st_msg,              \
-                     sizeof(clfft_st_msg),      \
-                     "clFFT Error (%d): %s\n",  \
-                     (int)(_clfft_st),          \
-                     _clfftGetResultString(     \
-                         _clfft_st));           \
-                                                \
-            AF_ERROR(clfft_st_msg,              \
-                     AF_ERR_INTERNAL);          \
-        }                                       \
-    } while(0)
-
diff --git a/src/backend/opencl/err_opencl.hpp b/src/backend/opencl/err_opencl.hpp
index 7369cfd339..9a24bc2789 100644
--- a/src/backend/opencl/err_opencl.hpp
+++ b/src/backend/opencl/err_opencl.hpp
@@ -8,36 +8,25 @@
  ********************************************************/
 
 #pragma once
-#include <stdio.h>
-#include <errorcodes.hpp>
-#include <err_common.hpp>
-#include <platform.hpp>
-#include <types.hpp>
-
-#define OPENCL_NOT_SUPPORTED() do {                         \
-        throw SupportError(__FILE__, __LINE__, "OPENCL");   \
-    } while(0)
-
-#define CL_TO_AF_ERROR(ERR) do {                        \
-        char opencl_err_msg[1024];                      \
-        snprintf(opencl_err_msg,                        \
-                 sizeof(opencl_err_msg),                \
-                 "OpenCL Error: %s when calling %s",    \
-                 getErrorMessage(ERR.err()).c_str(),    \
-                 ERR.what());                           \
-        AF_ERROR(opencl_err_msg,                        \
-                 AF_ERR_INTERNAL);                      \
-    } while(0)
-
-namespace opencl
-{
-    template <typename T>
-    void verifyDoubleSupport()
-    {
-        if ((std::is_same<T, double>::value ||
-             std::is_same<T, cdouble>::value) &&
-            !isDoubleSupported(getActiveDeviceId())) {
-            AF_ERROR("Double precision not supported", AF_ERR_NO_DBL);
-        }
-    }
+
+#include <common/err_common.hpp>
+
+#include <string>
+
+namespace cl {
+class Program;
 }
+
+namespace arrayfire {
+namespace opencl {
+
+std::string getProgramBuildLog(const cl::Program &prog);
+
+}  // namespace opencl
+}  // namespace arrayfire
+
+#define OPENCL_NOT_SUPPORTED(message)                                       \
+    do {                                                                    \
+        throw SupportError(__AF_FUNC__, __AF_FILENAME__, __LINE__, "OpenCL",\
+                           message, boost::stacktrace::stacktrace());       \
+    } while (0)
diff --git a/src/backend/opencl/errorcodes.cpp b/src/backend/opencl/errorcodes.cpp
index eb12c0e4c2..31ae6f6de0 100644
--- a/src/backend/opencl/errorcodes.cpp
+++ b/src/backend/opencl/errorcodes.cpp
@@ -11,7 +11,6 @@
 
 #include <boost/compute/exception/opencl_error.hpp>
 
-std::string getErrorMessage(int error_code)
-{
+std::string getErrorMessage(int error_code) {
     return boost::compute::opencl_error::to_string(error_code);
 }
diff --git a/src/backend/opencl/exampleFunction.cpp b/src/backend/opencl/exampleFunction.cpp
index 082c223f30..87306e329c 100644
--- a/src/backend/opencl/exampleFunction.cpp
+++ b/src/backend/opencl/exampleFunction.cpp
@@ -7,48 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>                    // header with opencl backend specific
-                                        // Array class implementation that inherits
-                                        // ArrayInfo base class
+#include <Array.hpp>  // header with opencl backend specific
+                      // Array class implementation that inherits
+                      // ArrayInfo base class
 
-#include <exampleFunction.hpp>          // opencl backend function header
+#include <exampleFunction.hpp>  // opencl backend function header
 
-#include <err_opencl.hpp>               // error check functions and Macros
-                                        // specific to opencl backend
+#include <err_opencl.hpp>  // error check functions and Macros
+                           // specific to opencl backend
 
-#include <kernel/exampleFunction.hpp>   // this header under the folder src/opencl/kernel
-                                        // defines the OpenCL kernel wrapper
-                                        // function to which the main computation of your
-                                        // algorithm should be relayed to
+#include <kernel/exampleFunction.hpp>  // this header under the folder src/opencl/kernel
+                                       // defines the OpenCL kernel wrapper
+// function to which the main computation of your
+// algorithm should be relayed to
 
 using af::dim4;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> exampleFunction(const Array<T> &in, const af_someenum_t method)
-{
-    dim4 outputDims;                    // this should be '= in.dims();' in most cases
-                                        // but would definitely depend on the type of
-                                        // algorithm you are implementing.
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method) {
+    dim4 outputDims;  // this should be '= in.dims();' in most cases
+                      // but would definitely depend on the type of
+                      // algorithm you are implementing.
 
     Array<T> out = createEmptyArray<T>(outputDims);
-                                        // Please use the create***Array<T> helper
-                                        // functions defined in Array.hpp to create
-                                        // different types of Arrays. Please check the
-                                        // file to know what are the different types you
-                                        // can create.
+    // Please use the create***Array<T> helper
+    // functions defined in Array.hpp to create
+    // different types of Arrays. Please check the
+    // file to know what are the different types you
+    // can create.
 
     // Relay the actual computation to OpenCL kernel wrapper
-    kernel::exampleFunc<T>(out, in, method);
+    kernel::exampleFunc<T>(out, a, b, method);
 
-    return out;                         // return the result
+    return out;  // return the result
 }
 
-
-#define INSTANTIATE(T)  \
-    template Array<T> exampleFunction<T>(const Array<T> &in, const af_someenum_t method);
+#define INSTANTIATE(T)                                                         \
+    template Array<T> exampleFunction<T>(const Array<T> &a, const Array<T> &b, \
+                                         const af_someenum_t method);
 
 // INSTANTIATIONS for all the types which
 // are present in the switch case statement
@@ -57,9 +57,11 @@ INSTANTIATE(float)
 INSTANTIATE(double)
 INSTANTIATE(int)
 INSTANTIATE(uint)
+INSTANTIATE(schar)
 INSTANTIATE(uchar)
 INSTANTIATE(char)
 INSTANTIATE(cfloat)
 INSTANTIATE(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/exampleFunction.hpp b/src/backend/opencl/exampleFunction.hpp
index 442cc09574..35f844dc4e 100644
--- a/src/backend/opencl/exampleFunction.hpp
+++ b/src/backend/opencl/exampleFunction.hpp
@@ -9,9 +9,10 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> exampleFunction(const Array<T> &in, const af_someenum_t method);
-}
-
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> exampleFunction(const Array<T> &a, const Array<T> &b,
+                         const af_someenum_t method);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/fast.cpp b/src/backend/opencl/fast.cpp
index 5af04a8425..4198cf82ba 100644
--- a/src/backend/opencl/fast.cpp
+++ b/src/backend/opencl/fast.cpp
@@ -7,55 +7,56 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/features.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
 #include <err_opencl.hpp>
-#include <handle.hpp>
 #include <kernel/fast.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
 
 using af::dim4;
 using af::features;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
               const Array<T> &in, const float thr, const unsigned arc_length,
-              const bool non_max, const float feature_ratio, const unsigned edge)
-{
+              const bool non_max, const float feature_ratio,
+              const unsigned edge) {
     unsigned nfeat;
 
     Param x;
     Param y;
     Param score;
 
-    kernel::fast_dispatch<T>(arc_length, non_max,
-                             &nfeat, x, y, score, in,
-                             thr, feature_ratio, edge);
+    kernel::fast<T>(arc_length, &nfeat, x, y, score, in, thr, feature_ratio,
+                    edge, non_max);
 
     if (nfeat > 0) {
-        x_out = createParamArray<float>(x);
-        y_out = createParamArray<float>(y);
-        score_out = createParamArray<float>(score);
+        x_out     = createParamArray<float>(x, true);
+        y_out     = createParamArray<float>(y, true);
+        score_out = createParamArray<float>(score, true);
     }
 
     return nfeat;
 }
 
-#define INSTANTIATE(T)                                                                              \
-    template unsigned fast<T>(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,    \
-                              const Array<T> &in, const float thr, const unsigned arc_length,       \
-                              const bool nonmax, const float feature_ratio, const unsigned edge);
+#define INSTANTIATE(T)                                                        \
+    template unsigned fast<T>(                                                \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const float thr, const unsigned arc_length,       \
+        const bool nonmax, const float feature_ratio, const unsigned edge);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/fast.hpp b/src/backend/opencl/fast.hpp
index 5d11e8fdcb..4a1d7cc3cd 100644
--- a/src/backend/opencl/fast.hpp
+++ b/src/backend/opencl/fast.hpp
@@ -7,17 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
 #include <Array.hpp>
+#include <af/features.h>
 
 using af::features;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 unsigned fast(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
               const Array<T> &in, const float thr, const unsigned arc_length,
-              const bool non_max, const float feature_ratio, const unsigned edge);
+              const bool non_max, const float feature_ratio,
+              const unsigned edge);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/fft.cpp b/src/backend/opencl/fft.cpp
index 80a2a75349..36ebd70a63 100644
--- a/src/backend/opencl/fft.cpp
+++ b/src/backend/opencl/fft.cpp
@@ -7,294 +7,165 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <copy.hpp>
 #include <fft.hpp>
+
+#include <clfft.hpp>
+#include <copy.hpp>
 #include <err_opencl.hpp>
-#include <err_clfft.hpp>
-#include <clFFT.h>
 #include <math.hpp>
-#include <string>
-#include <cstdio>
 #include <memory.hpp>
-#include <iostream>
-#include <handle.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
-using std::string;
-
-namespace opencl
-{
-
-// clFFTPlanner will do very basic plan caching.
-// it looks for required candidate in mHandles array and returns if found one.
-// otherwise, it will create a plan and set it at the mAvailSlotIndex and increment
-// the slot index variable in ciruclar fashion 0 to MAX_PLAN_CACHE, then back to zero and repeat.
-class clFFTPlanner
-{
-    friend void find_clfft_plan(clfftPlanHandle &plan,
-                                clfftDim rank, size_t *clLengths,
-                                size_t *istrides, size_t idist,
-                                size_t *ostrides, size_t odist,
-                                clfftPrecision precision, size_t batch);
-
-    public:
-        static clFFTPlanner& getInstance() {
-            static clFFTPlanner single_instance;
-            return single_instance;
-        }
-
-        ~clFFTPlanner() {
-            CLFFT_CHECK(clfftTeardown());
-        }
 
-    private:
-        clFFTPlanner() : mAvailSlotIndex(0) {
-            CLFFT_CHECK(clfftInitSetupData(&fftSetup));
-            CLFFT_CHECK(clfftSetup(&fftSetup));
-            for(int p=0; p<MAX_PLAN_CACHE; ++p)
-                mHandles[p] = 0;
-        }
-        clFFTPlanner(clFFTPlanner const&);
-        void operator=(clFFTPlanner const&);
+namespace arrayfire {
+namespace opencl {
 
-        static const int MAX_PLAN_CACHE = 5;
-
-        int          mAvailSlotIndex;
-        string mKeys[MAX_PLAN_CACHE];
-
-        clfftPlanHandle mHandles[MAX_PLAN_CACHE];
+void setFFTPlanCacheSize(size_t numPlans) {
+    fftManager().setMaxCacheSize(numPlans);
+}
 
-        clfftSetupData  fftSetup;
+template<typename T>
+struct Precision;
+template<>
+struct Precision<cfloat> {
+    enum { type = CLFFT_SINGLE };
+};
+template<>
+struct Precision<cdouble> {
+    enum { type = CLFFT_DOUBLE };
 };
 
-void find_clfft_plan(clfftPlanHandle &plan,
-                     clfftDim rank, size_t *clLengths,
-                     size_t *istrides, size_t idist,
-                     size_t *ostrides, size_t odist,
-                     clfftPrecision precision, size_t batch)
-{
-    clFFTPlanner &planner = clFFTPlanner::getInstance();
-
-    // create the key string
-    char key_str_temp[64];
-    sprintf(key_str_temp, "%d:", rank);
-
-    string key_string(key_str_temp);
-
-    /* WARNING: DO NOT CHANGE sprintf format specifier */
-    for(int r=0; r<rank; ++r) {
-        sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", clLengths[r]);
-        key_string.append(std::string(key_str_temp));
-    }
-
-    if(istrides!=NULL) {
-        for(int r=0; r<rank; ++r) {
-            sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", istrides[r]);
-            key_string.append(std::string(key_str_temp));
-        }
-        sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", idist);
-        key_string.append(std::string(key_str_temp));
-    }
-
-    if (ostrides!=NULL) {
-        for(int r=0; r<rank; ++r) {
-            sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", ostrides[r]);
-            key_string.append(std::string(key_str_temp));
-        }
-        sprintf(key_str_temp, SIZE_T_FRMT_SPECIFIER ":", odist);
-        key_string.append(std::string(key_str_temp));
+void computeDims(size_t rdims[AF_MAX_DIMS], const dim4 &idims) {
+    for (int i = 0; i < AF_MAX_DIMS; i++) {
+        rdims[i] = static_cast<size_t>(idims[i]);
     }
+}
 
-    sprintf(key_str_temp, "%d:" SIZE_T_FRMT_SPECIFIER, (int)precision, batch);
-    key_string.append(std::string(key_str_temp));
-
-    // find the matching plan_index in the array clFFTPlanner::mKeys
-    int plan_index = -1;
-    for (int i=0; i<clFFTPlanner::MAX_PLAN_CACHE; ++i) {
-        if (key_string==planner.mKeys[i]) {
-            plan_index = i;
-            break;
+//(currently) true is in clFFT if length is a power of 2,3,5
+inline bool isSupLen(dim_t length) {
+    while (length > 1) {
+        if (length % 2 == 0) {
+            length /= 2;
+        } else if (length % 3 == 0) {
+            length /= 3;
+        } else if (length % 5 == 0) {
+            length /= 5;
+        } else if (length % 7 == 0) {
+            length /= 7;
+        } else if (length % 11 == 0) {
+            length /= 11;
+        } else if (length % 13 == 0) {
+            length /= 13;
+        } else {
+            return false;
         }
     }
+    return true;
+}
 
-    // return mHandles[plan_index] if plan_index valid
-    if (plan_index!=-1) {
-        plan = planner.mHandles[plan_index];
-        return;
-    }
-
-    // otherwise create a new plan and set it at mAvailSlotIndex
-    // and finally set it to output plan variable
-    int slot_index = planner.mAvailSlotIndex;
+void verifySupported(const int rank, const dim4 &dims) {
+    for (int i = 0; i < rank; i++) { ARG_ASSERT(1, isSupLen(dims[i])); }
+}
 
-    if (planner.mHandles[slot_index]) {
-        CLFFT_CHECK(clfftDestroyPlan(&planner.mHandles[slot_index]));
-        planner.mHandles[slot_index] = 0;
-    }
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction) {
+    verifySupported(rank, in.dims());
+    size_t tdims[AF_MAX_DIMS], istrides[AF_MAX_DIMS];
 
-    clfftPlanHandle temp;
+    computeDims(tdims, in.dims());
+    computeDims(istrides, in.strides());
 
-    // getContext() returns object of type Context
-    // Context() returns the actual cl_context handle
-    CLFFT_CHECK(clfftCreateDefaultPlan(&temp, getContext()(), rank, clLengths));
+    int batch = 1;
+    for (int i = rank; i < AF_MAX_DIMS; i++) { batch *= tdims[i]; }
 
-    CLFFT_CHECK(clfftSetLayout(temp, CLFFT_COMPLEX_INTERLEAVED, CLFFT_COMPLEX_INTERLEAVED));
-    CLFFT_CHECK(clfftSetPlanBatchSize(temp, batch));
-    CLFFT_CHECK(clfftSetPlanDistance(temp, idist, odist));
-    CLFFT_CHECK(clfftSetPlanInStride(temp, rank, istrides));
-    CLFFT_CHECK(clfftSetPlanOutStride(temp, rank, ostrides));
-    CLFFT_CHECK(clfftSetPlanPrecision(temp, precision));
-    CLFFT_CHECK(clfftSetResultLocation(temp, CLFFT_INPLACE));
+    SharedPlan plan = findPlan(
+        CLFFT_COMPLEX_INTERLEAVED, CLFFT_COMPLEX_INTERLEAVED,
+        static_cast<clfftDim>(rank), tdims, istrides, istrides[rank], istrides,
+        istrides[rank], static_cast<clfftPrecision>(Precision<T>::type), batch);
 
-    // getQueue() returns object of type CommandQueue
-    // CommandQueue() returns the actual cl_command_queue handle
-    CLFFT_CHECK(clfftBakePlan(temp, 1, &(getQueue()()), NULL, NULL));
+    cl_mem imem            = (*in.get())();
+    cl_command_queue queue = getQueue()();
 
-    plan = temp;
-    planner.mHandles[slot_index] = temp;
-    planner.mKeys[slot_index] = key_string;
-    planner.mAvailSlotIndex = (slot_index + 1)%clFFTPlanner::MAX_PLAN_CACHE;
+    CLFFT_CHECK(clfftEnqueueTransform(
+        *plan.get(), direction ? CLFFT_FORWARD : CLFFT_BACKWARD, 1, &queue, 0,
+        NULL, NULL, &imem, &imem, NULL));
 }
 
-template<typename T> struct Precision;
-template<> struct Precision<cfloat > { enum {type = CLFFT_SINGLE}; };
-template<> struct Precision<cdouble> { enum {type = CLFFT_DOUBLE}; };
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank) {
+    dim4 odims = in.dims();
 
-void computeDims(size_t rdims[4], const dim4 &idims)
-{
-    for (int i = 0; i < 4; i++) {
-        rdims[i] = (size_t)idims[i];
-    }
-}
+    odims[0] = odims[0] / 2 + 1;
 
-template<typename T, int rank, bool direction>
-void fft_common(Array<T> &out, const Array<T> &in)
-{
-    size_t idims[4], istrides[4], iembed[4];
-    size_t odims[4], ostrides[4], oembed[4];
+    Array<Tc> out = createEmptyArray<Tc>(odims);
 
-    computeDims(idims   , in.dims());
-    computeDims(iembed  , in.getDataDims());
-    computeDims(istrides, in.strides());
+    verifySupported(rank, in.dims());
+    size_t tdims[AF_MAX_DIMS], istrides[AF_MAX_DIMS], ostrides[AF_MAX_DIMS];
 
-    computeDims(odims   , out.dims());
-    computeDims(oembed  , out.getDataDims());
+    computeDims(tdims, in.dims());
+    computeDims(istrides, in.strides());
     computeDims(ostrides, out.strides());
 
-    clfftPlanHandle plan;
-
     int batch = 1;
-    for (int i = rank; i < 4; i++) {
-        batch *= idims[i];
-    }
+    for (int i = rank; i < AF_MAX_DIMS; i++) { batch *= tdims[i]; }
 
-    find_clfft_plan(plan, (clfftDim)rank, idims,
-                    istrides, istrides[rank],
-                    ostrides, ostrides[rank],
-                    (clfftPrecision)Precision<T>::type,
-                    batch);
+    SharedPlan plan = findPlan(
+        CLFFT_REAL, CLFFT_HERMITIAN_INTERLEAVED, static_cast<clfftDim>(rank),
+        tdims, istrides, istrides[rank], ostrides, ostrides[rank],
+        static_cast<clfftPrecision>(Precision<Tc>::type), batch);
 
-    cl_mem imem = (*in.get())();
-    cl_mem omem = (*out.get())();
+    cl_mem imem            = (*in.get())();
+    cl_mem omem            = (*out.get())();
     cl_command_queue queue = getQueue()();
 
-    CLFFT_CHECK(clfftEnqueueTransform(plan,
-                                      direction ? CLFFT_FORWARD : CLFFT_BACKWARD,
-                                      1, &queue, 0, NULL, NULL,
-                                      &imem, &omem, NULL));
-}
-
-void computePaddedDims(dim4 &pdims,
-                       const dim4 &idims,
-                       const dim_t npad,
-                       dim_t const * const pad)
-{
-    for (int i = 0; i < 4; i++) {
-        pdims[i] = (i < (int)npad) ? pad[i] : idims[i];
-    }
-}
-
-//(currently) true is in clFFT if length is a power of 2,3,5
-inline bool isSupLen(dim_t length)
-{
-    while( length > 1 )
-    {
-        if( length % 2 == 0 )
-            length /= 2;
-        else if( length % 3 == 0 )
-            length /= 3;
-        else if( length % 5 == 0 )
-            length /= 5;
-        else
-            return false;
-    }
-    return true;
-}
+    CLFFT_CHECK(clfftEnqueueTransform(*plan.get(), CLFFT_FORWARD, 1, &queue, 0,
+                                      NULL, NULL, &imem, &omem, NULL));
 
-template<int rank>
-void verifySupported(const dim4 dims)
-{
-    for (int i = 0; i < rank; i++) {
-        ARG_ASSERT(1, isSupLen(dims[i]));
-    }
+    return out;
 }
 
-template<typename inType, typename outType, int rank, bool isR2C>
-Array<outType> fft(Array<inType> const &in, double norm_factor, dim_t const npad, dim_t const * const pad)
-{
-    ARG_ASSERT(1, (rank>=1 && rank<=3));
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank) {
+    Array<Tr> out = createEmptyArray<Tr>(odims);
 
-    dim4 pdims(1);
-    computePaddedDims(pdims, in.dims(), npad, pad);
-    verifySupported<rank>(pdims);
+    verifySupported(rank, odims);
+    size_t tdims[AF_MAX_DIMS], istrides[AF_MAX_DIMS], ostrides[AF_MAX_DIMS];
 
-    Array<outType> ret = padArray<inType, outType>(in, pdims, scalar<outType>(0), norm_factor);
-    fft_common<outType, rank, true>(ret, ret);
+    computeDims(tdims, odims);
+    computeDims(istrides, in.strides());
+    computeDims(ostrides, out.strides());
 
-    return ret;
-}
+    int batch = 1;
+    for (int i = rank; i < AF_MAX_DIMS; i++) { batch *= tdims[i]; }
 
-template<typename T, int rank>
-Array<T> ifft(Array<T> const &in, double norm_factor, dim_t const npad, dim_t const * const pad)
-{
-    ARG_ASSERT(1, (rank>=1 && rank<=3));
+    SharedPlan plan = findPlan(
+        CLFFT_HERMITIAN_INTERLEAVED, CLFFT_REAL, static_cast<clfftDim>(rank),
+        tdims, istrides, istrides[rank], ostrides, ostrides[rank],
+        static_cast<clfftPrecision>(Precision<Tc>::type), batch);
 
-    dim4 pdims(1);
-    computePaddedDims(pdims, in.dims(), npad, pad);
-    verifySupported<rank>(pdims);
+    cl_mem imem            = (*in.get())();
+    cl_mem omem            = (*out.get())();
+    cl_command_queue queue = getQueue()();
 
-    // the input norm_factor is further scaled
-    // based on the input dimensions to match
-    // cuFFT behavior
-    for (int i=0; i<rank; i++)
-        norm_factor *= pdims[i];
+    CLFFT_CHECK(clfftEnqueueTransform(*plan.get(), CLFFT_BACKWARD, 1, &queue, 0,
+                                      NULL, NULL, &imem, &omem, NULL));
 
-    Array<T> ret = padArray<T, T>(in, pdims, scalar<T>(0), norm_factor);
-    fft_common<T, rank, false>(ret, ret);
-    return ret;
+    return out;
 }
 
-#define INSTANTIATE1(T1, T2)\
-    template Array<T2> fft <T1, T2, 1, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T2> fft <T1, T2, 2, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T2> fft <T1, T2, 3, true >(const Array<T1> &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+#define INSTANTIATE(T) \
+    template void fft_inplace<T>(Array<T> &, const int, const bool);
 
-INSTANTIATE1(float  , cfloat )
-INSTANTIATE1(double , cdouble)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
 
-#define INSTANTIATE2(T)\
-    template Array<T> fft <T, T, 1, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> fft <T, T, 2, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> fft <T, T, 3, false>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 1>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 2>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad); \
-    template Array<T> ifft<T, 3>(const Array<T> &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+#define INSTANTIATE_REAL(Tr, Tc)                                        \
+    template Array<Tc> fft_r2c<Tc, Tr>(const Array<Tr> &, const int);   \
+    template Array<Tr> fft_c2r<Tr, Tc>(const Array<Tc> &, const dim4 &, \
+                                       const int);
 
-INSTANTIATE2(cfloat )
-INSTANTIATE2(cdouble)
-
-}
+INSTANTIATE_REAL(float, cfloat)
+INSTANTIATE_REAL(double, cdouble)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/fft.hpp b/src/backend/opencl/fft.hpp
index e1ccb9741e..f071b9a8c5 100644
--- a/src/backend/opencl/fft.hpp
+++ b/src/backend/opencl/fft.hpp
@@ -9,16 +9,19 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename inType, typename outType, int rank, bool isR2C>
-Array<outType> fft(Array<inType> const &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+void setFFTPlanCacheSize(size_t numPlans);
 
-template<typename T, int rank>
-Array<T> ifft(Array<T> const &in, double norm_factor, dim_t const npad, dim_t const * const pad);
+template<typename T>
+void fft_inplace(Array<T> &in, const int rank, const bool direction);
 
-template<typename T, int rank, bool direction>
-void fft_common(Array<T> &out, const Array<T> &in);
+template<typename Tc, typename Tr>
+Array<Tc> fft_r2c(const Array<Tr> &in, const int rank);
 
-}
+template<typename Tr, typename Tc>
+Array<Tr> fft_c2r(const Array<Tc> &in, const dim4 &odims, const int rank);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/fftconvolve.cpp b/src/backend/opencl/fftconvolve.cpp
index cb4ef6cfd1..f5a875f41c 100644
--- a/src/backend/opencl/fftconvolve.cpp
+++ b/src/backend/opencl/fftconvolve.cpp
@@ -7,131 +7,143 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <fftconvolve.hpp>
-#include <kernel/fftconvolve.hpp>
-#include <err_opencl.hpp>
+
+#include <Array.hpp>
 #include <fft.hpp>
+#include <kernel/fftconvolve.hpp>
+#include <af/dim4.hpp>
+
+#include <cmath>
+#include <type_traits>
+#include <vector>
 
 using af::dim4;
+using std::ceil;
+using std::conditional;
+using std::is_integral;
+using std::is_same;
+using std::vector;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-static const dim4 calcPackedSize(Array<T> const& i1,
-                                 Array<T> const& i2,
-                                 const dim_t baseDim)
-{
-    const dim4 i1d = i1.dims();
-    const dim4 i2d = i2.dims();
+dim4 calcPackedSize(Array<T> const& i1, Array<T> const& i2, const dim_t rank) {
+    const dim4& i1d = i1.dims();
+    const dim4& i2d = i2.dims();
 
-    dim_t pd[4];
+    dim_t pd[4] = {1, 1, 1, 1};
 
     // Pack both signal and filter on same memory array, this will ensure
     // better use of batched cuFFT capabilities
-    for (dim_t k = 0; k < 4; k++) {
-        if (k == 0)
-            pd[k] = nextpow2((unsigned)(i1d[k] + i2d[k] - 1)) / 2;
-        else if (k < baseDim)
-            pd[k] = nextpow2((unsigned)(i1d[k] + i2d[k] - 1));
-        else if (k == baseDim)
-            pd[k] = i1d[k] + i2d[k];
-        else
-            pd[k] = 1;
+    pd[0] = nextpow2(static_cast<unsigned>(
+        static_cast<int>(ceil(i1d[0] / 2.f)) + i2d[0] - 1));
+
+    for (dim_t k = 1; k < rank; k++) {
+        pd[k] = nextpow2(static_cast<unsigned>(i1d[k] + i2d[k] - 1));
     }
 
+    dim_t i1batch = 1;
+    dim_t i2batch = 1;
+    for (int k = rank; k < 4; k++) {
+        i1batch *= i1d[k];
+        i2batch *= i2d[k];
+    }
+    pd[rank] = (i1batch + i2batch);
+
     return dim4(pd[0], pd[1], pd[2], pd[3]);
 }
 
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
-Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind)
-{
-    const dim4 sDims = signal.dims();
-    const dim4 fDims = filter.dims();
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank) {
+    using convT = typename conditional<is_integral<T>::value ||
+                                           is_same<T, float>::value ||
+                                           is_same<T, cfloat>::value,
+                                       float, double>::type;
+    using cT    = typename conditional<is_same<convT, float>::value, cfloat,
+                                    cdouble>::type;
+
+    const dim4& sDims = signal.dims();
+    const dim4& fDims = filter.dims();
 
     dim4 oDims(1);
     if (expand) {
-        for(dim_t d=0; d<4; ++d) {
-            if (kind==ONE2ONE || kind==ONE2MANY) {
-                oDims[d] = sDims[d]+fDims[d]-1;
+        for (int d = 0; d < AF_MAX_DIMS; ++d) {
+            if (kind == AF_BATCH_NONE || kind == AF_BATCH_RHS) {
+                oDims[d] = sDims[d] + fDims[d] - 1;
             } else {
-                oDims[d] = (d<baseDim ? sDims[d]+fDims[d]-1 : sDims[d]);
+                oDims[d] = (d < rank ? sDims[d] + fDims[d] - 1 : sDims[d]);
             }
         }
     } else {
         oDims = sDims;
-        if (kind==ONE2MANY) {
-            for (dim_t i=baseDim; i<4; ++i)
-                oDims[i] = fDims[i];
+        if (kind == AF_BATCH_RHS) {
+            for (int i = rank; i < AF_MAX_DIMS; ++i) { oDims[i] = fDims[i]; }
         }
     }
 
-    const dim4 pDims = calcPackedSize<T>(signal, filter, baseDim);
+    const dim4 pDims = calcPackedSize<T>(signal, filter, rank);
     Array<cT> packed = createEmptyArray<cT>(pDims);
 
-    kernel::packDataHelper<cT, T, isDouble, convT>(packed, signal, filter, baseDim, kind);
-
-    fft_common<cT, baseDim, true>(packed, packed);
-
-    kernel::complexMultiplyHelper<cT, T, isDouble, convT>(packed, signal, filter, baseDim, kind);
+    kernel::packDataHelper<cT, T>(packed, signal, filter, rank, kind);
+    fft_inplace<cT>(packed, rank, true);
+    kernel::complexMultiplyHelper<cT, T>(packed, signal, filter, rank, kind);
 
     // Compute inverse FFT only on complex-multiplied data
-    if (kind == ONE2MANY) {
-        std::vector<af_seq> seqs;
-        for (dim_t k = 0; k < 4; k++) {
-            if (k < baseDim)
-                seqs.push_back(af_make_seq(0, pDims[k]-1, 1));
-            else if (k == baseDim)
-                seqs.push_back(af_make_seq(1, pDims[k]-1, 1));
-            else
-                seqs.push_back(af_make_seq(0, 0, 1));
+    if (kind == AF_BATCH_RHS) {
+        vector<af_seq> seqs;
+        for (int k = 0; k < AF_MAX_DIMS; k++) {
+            if (k < rank) {
+                seqs.push_back({0., static_cast<double>(pDims[k] - 1), 1.});
+            } else if (k == rank) {
+                seqs.push_back({1., static_cast<double>(pDims[k] - 1), 1.});
+            } else {
+                seqs.push_back({0., 0., 1.});
+            }
         }
 
         Array<cT> subPacked = createSubArray<cT>(packed, seqs);
-        fft_common<cT, baseDim, false>(subPacked, subPacked);
-    }
-    else {
-        std::vector<af_seq> seqs;
-        for (dim_t k = 0; k < 4; k++) {
-            if (k < baseDim)
-                seqs.push_back(af_make_seq(0, pDims[k]-1, 1));
-            else if (k == baseDim)
-                seqs.push_back(af_make_seq(0, sDims[k]-1, 1));
-            else
-                seqs.push_back(af_make_seq(0, 0, 1));
+        fft_inplace<cT>(subPacked, rank, false);
+    } else {
+        vector<af_seq> seqs;
+        for (int k = 0; k < AF_MAX_DIMS; k++) {
+            if (k < rank) {
+                seqs.push_back({0., static_cast<double>(pDims[k]) - 1, 1.});
+            } else if (k == rank) {
+                seqs.push_back({0., static_cast<double>(pDims[k] - 2), 1.});
+            } else {
+                seqs.push_back({0., 0., 1.});
+            }
         }
 
         Array<cT> subPacked = createSubArray<cT>(packed, seqs);
-        fft_common<cT, baseDim, false>(subPacked, subPacked);
+        fft_inplace<cT>(subPacked, rank, false);
     }
 
     Array<T> out = createEmptyArray<T>(oDims);
 
-    if (expand)
-        kernel::reorderOutputHelper<T, cT, isDouble, roundOut, true , convT>(out, packed, signal, filter, baseDim, kind);
-    else
-        kernel::reorderOutputHelper<T, cT, isDouble, roundOut, false, convT>(out, packed, signal, filter, baseDim, kind);
-
+    kernel::reorderOutputHelper<T, cT>(out, packed, signal, filter, rank, kind,
+                                       expand);
     return out;
 }
 
-#define INSTANTIATE(T, convT, cT, isDouble, roundOut)                                                   \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 1>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);    \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 2>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);    \
-    template Array<T> fftconvolve <T, convT, cT, isDouble, roundOut, 3>                                 \
-        (Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);
-
-INSTANTIATE(double, double, cdouble, true , false)
-INSTANTIATE(float , float,  cfloat,  false, false)
-INSTANTIATE(uint  , float,  cfloat,  false, true)
-INSTANTIATE(int   , float,  cfloat,  false, true)
-INSTANTIATE(uchar , float,  cfloat,  false, true)
-INSTANTIATE(char  , float,  cfloat,  false, true)
-
-}
+#define INSTANTIATE(T)                                                 \
+    template Array<T> fftconvolve<T>(Array<T> const&, Array<T> const&, \
+                                     const bool, AF_BATCH_KIND, const int);
+
+INSTANTIATE(double)
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(int)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(uintl)
+INSTANTIATE(intl)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/fftconvolve.hpp b/src/backend/opencl/fftconvolve.hpp
index ebaa504443..a00f978adc 100644
--- a/src/backend/opencl/fftconvolve.hpp
+++ b/src/backend/opencl/fftconvolve.hpp
@@ -8,12 +8,11 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <convolve_common.hpp>
 
-namespace opencl
-{
-
-template<typename T, typename convT, typename cT, bool isDouble, bool roundOut, dim_t baseDim>
-Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter, const bool expand, ConvolveBatchKind kind);
-
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> fftconvolve(Array<T> const& signal, Array<T> const& filter,
+                     const bool expand, AF_BATCH_KIND kind, const int rank);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/flood_fill.cpp b/src/backend/opencl/flood_fill.cpp
new file mode 100644
index 0000000000..4a759e095d
--- /dev/null
+++ b/src/backend/opencl/flood_fill.cpp
@@ -0,0 +1,41 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <flood_fill.hpp>
+
+#include <err_opencl.hpp>
+#include <kernel/flood_fill.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup) {
+    auto out = createValueArray(image.dims(), T(0));
+    kernel::floodFill<T>(out, image, seedsX, seedsY, newValue, lowValue,
+                         highValue, nlookup);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> floodFill(const Array<T>&, const Array<uint>&,           \
+                                const Array<uint>&, const T, const T, const T, \
+                                const af::connectivity);
+
+INSTANTIATE(float)
+INSTANTIATE(uint)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/flood_fill.hpp b/src/backend/opencl/flood_fill.hpp
new file mode 100644
index 0000000000..b4210c2d57
--- /dev/null
+++ b/src/backend/opencl/flood_fill.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> floodFill(const Array<T>& image, const Array<uint>& seedsX,
+                   const Array<uint>& seedsY, const T newValue,
+                   const T lowValue, const T highValue,
+                   const af::connectivity nlookup = AF_CONNECTIVITY_8);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/gradient.cpp b/src/backend/opencl/gradient.cpp
index a4ef700be9..711e579295 100644
--- a/src/backend/opencl/gradient.cpp
+++ b/src/backend/opencl/gradient.cpp
@@ -9,23 +9,24 @@
 
 #include <Array.hpp>
 #include <gradient.hpp>
-#include <math.hpp>
 #include <kernel/gradient.hpp>
+#include <math.hpp>
 #include <stdexcept>
 
-namespace opencl
-{
-    template<typename T>
-    void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in)
-    {
-        kernel::gradient<T>(grad0, grad1, in);
-    }
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in) {
+    kernel::gradient<T>(grad0, grad1, in);
+}
 
-#define INSTANTIATE(T)                                                                  \
-    template void gradient<T>(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);    \
+#define INSTANTIATE(T)                                            \
+    template void gradient<T>(Array<T> & grad0, Array<T> & grad1, \
+                              const Array<T> &in);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-}
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/gradient.hpp b/src/backend/opencl/gradient.hpp
index f8b229cea2..88d663f436 100644
--- a/src/backend/opencl/gradient.hpp
+++ b/src/backend/opencl/gradient.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void gradient(Array<T> &grad0, Array<T> &grad1, const Array<T> &in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/hamming.cpp b/src/backend/opencl/hamming.cpp
deleted file mode 100644
index 5d4bd59919..0000000000
--- a/src/backend/opencl/hamming.cpp
+++ /dev/null
@@ -1,143 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_opencl.hpp>
-#include <handle.hpp>
-#include <kernel/hamming.hpp>
-#include <kernel/transpose.hpp>
-
-using af::dim4;
-using cl::Device;
-
-namespace opencl
-{
-
-static const unsigned THREADS = 256;
-
-template<typename T>
-void hamming_matcher(Array<uint>& idx, Array<uint>& dist,
-                     const Array<T>& query, const Array<T>& train,
-                     const uint dist_dim, const uint n_dist)
-{
-    uint sample_dim = (dist_dim == 0) ? 1 : 0;
-    const dim4 qDims = query.dims();
-    const dim4 tDims = train.dims();
-
-    const dim4 outDims(n_dist, qDims[sample_dim]);
-
-    idx  = createEmptyArray<uint>(outDims);
-    dist = createEmptyArray<uint>(outDims);
-
-    const unsigned feat_len = qDims[dist_dim];
-
-    // Determine maximum feat_len capable of using shared memory (faster)
-    cl_ulong avail_lmem = getDevice().getInfo<CL_DEVICE_LOCAL_MEM_SIZE>();
-    size_t lmem_predef = 2 * THREADS * sizeof(unsigned) + feat_len * sizeof(T);
-    size_t ltrain_sz = THREADS * feat_len * sizeof(T);
-    bool use_lmem = (avail_lmem >= (lmem_predef + ltrain_sz)) ? true : false;
-    size_t lmem_sz = (use_lmem) ? lmem_predef + ltrain_sz : lmem_predef;
-
-    Array<T> queryT = query;
-    Array<T> trainT = train;
-
-    if (dist_dim == 0) {
-        const dim4 queryTDims = dim4(qDims[1], qDims[0], qDims[2], qDims[3]);
-        const dim4 trainTDims = dim4(tDims[1], tDims[0], tDims[2], tDims[3]);
-        queryT = createEmptyArray<T>(queryTDims);
-        trainT = createEmptyArray<T>(trainTDims);
-
-        bool queryIs32Multiple = false;
-        if (qDims[0] % 32 == 0 && qDims[1] % 32 == 0)
-            queryIs32Multiple = true;
-
-        bool trainIs32Multiple = false;
-        if (tDims[0] % 32 == 0 && tDims[1] % 32 == 0)
-        trainIs32Multiple = true;
-
-        if (queryIs32Multiple)
-            kernel::transpose<T, false, true >(queryT, query);
-        else
-            kernel::transpose<T, false, false>(queryT, query);
-
-        if (trainIs32Multiple)
-            kernel::transpose<T, false, true >(trainT, train);
-        else
-            kernel::transpose<T, false, false>(trainT, train);
-    }
-
-    if (use_lmem) {
-        switch (feat_len) {
-        case 1:
-            kernel::hamming_matcher<T, true , 1 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 2:
-            kernel::hamming_matcher<T, true , 2 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 4:
-            kernel::hamming_matcher<T, true , 4 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 8:
-            kernel::hamming_matcher<T, true , 8 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 16:
-            kernel::hamming_matcher<T, true , 16>(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 32:
-            kernel::hamming_matcher<T, true , 32>(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 64:
-            kernel::hamming_matcher<T, true , 64>(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        default:
-            kernel::hamming_matcher<T, true , 0 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        }
-    } else {
-        switch (feat_len) {
-        case 1:
-            kernel::hamming_matcher<T, false, 1 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 2:
-            kernel::hamming_matcher<T, false, 2 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 4:
-            kernel::hamming_matcher<T, false, 4 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 8:
-            kernel::hamming_matcher<T, false, 8 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 16:
-            kernel::hamming_matcher<T, false, 16>(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 32:
-            kernel::hamming_matcher<T, false, 32>(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        case 64:
-            kernel::hamming_matcher<T, false, 64>(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        default:
-            kernel::hamming_matcher<T, false, 0 >(idx, dist, queryT, trainT, 1, n_dist, lmem_sz);
-            break;
-        }
-    }
-}
-
-#define INSTANTIATE(T)\
-    template void hamming_matcher<T>(Array<uint>& idx, Array<uint>& dist,           \
-                                     const Array<T>& query, const Array<T>& train,  \
-                                     const uint dist_dim, const uint n_dist);
-
-INSTANTIATE(uchar)
-INSTANTIATE(uint)
-
-}
diff --git a/src/backend/opencl/hamming.hpp b/src/backend/opencl/hamming.hpp
deleted file mode 100644
index 18689efa8d..0000000000
--- a/src/backend/opencl/hamming.hpp
+++ /dev/null
@@ -1,22 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <Array.hpp>
-
-using af::features;
-
-namespace opencl
-{
-
-template<typename T>
-void hamming_matcher(Array<uint>& idx, Array<uint>& dist,
-                     const Array<T>& query, const Array<T>& train,
-                     const uint dist_dim, const uint n_dist);
-
-}
diff --git a/src/backend/opencl/harris.cpp b/src/backend/opencl/harris.cpp
new file mode 100644
index 0000000000..ce2f21fced
--- /dev/null
+++ b/src/backend/opencl/harris.cpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_opencl.hpp>
+#include <kernel/harris.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &score_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr) {
+    unsigned nfeat;
+
+    Param x;
+    Param y;
+    Param score;
+
+    kernel::harris<T, convAccT>(&nfeat, x, y, score, in, max_corners,
+                                min_response, sigma, filter_len, k_thr);
+
+    if (nfeat > 0) {
+        x_out     = createParamArray<float>(x, true);
+        y_out     = createParamArray<float>(y, true);
+        score_out = createParamArray<float>(score, true);
+    }
+
+    return nfeat;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned harris<T, convAccT>(                                    \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned max_corners,                       \
+        const float min_response, const float sigma,                          \
+        const unsigned filter_len, const float k_thr);
+
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/harris.hpp b/src/backend/opencl/harris.hpp
new file mode 100644
index 0000000000..73ac64bbfd
--- /dev/null
+++ b/src/backend/opencl/harris.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename convAccT>
+unsigned harris(Array<float> &x_out, Array<float> &y_out,
+                Array<float> &score_out, const Array<T> &in,
+                const unsigned max_corners, const float min_response,
+                const float sigma, const unsigned filter_len,
+                const float k_thr);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/hist_graphics.cpp b/src/backend/opencl/hist_graphics.cpp
index cde15a1799..a20daeb700 100644
--- a/src/backend/opencl/hist_graphics.cpp
+++ b/src/backend/opencl/hist_graphics.cpp
@@ -7,46 +7,58 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
-#include <interopManager.hpp>
 #include <Array.hpp>
-#include <hist_graphics.hpp>
-#include <err_opencl.hpp>
+#include <GraphicsResourceManager.hpp>
 #include <debug_opencl.hpp>
+#include <err_opencl.hpp>
+#include <hist_graphics.hpp>
+
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void copy_histogram(const Array<T> &data, const fg::Histogram* hist)
-{
+void copy_histogram(const Array<T> &data, fg_histogram hist) {
+    ForgeModule &_ = forgePlugin();
     if (isGLSharingSupported()) {
         CheckGL("Begin OpenCL resource copy");
         const cl::Buffer *d_P = data.get();
-        size_t bytes = hist->size();
+        unsigned bytes        = 0;
+        FG_CHECK(_.fg_get_histogram_vertex_buffer_size(&bytes, hist));
 
-        InteropManager& intrpMngr = InteropManager::getInstance();
-
-        cl::Buffer *clPBOResource = intrpMngr.getBufferResource(hist);
+        auto res = interopManager().getHistogramResources(hist);
 
         std::vector<cl::Memory> shared_objects;
-        shared_objects.push_back(*clPBOResource);
+        shared_objects.push_back(*(res[0].get()));
 
         glFinish();
-        getQueue().enqueueAcquireGLObjects(&shared_objects);
-        getQueue().enqueueCopyBuffer(*d_P, *clPBOResource, 0, 0, bytes, NULL, NULL);
-        getQueue().finish();
-        getQueue().enqueueReleaseGLObjects(&shared_objects);
+
+        // Use of events:
+        // https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+        cl::Event event;
+
+        getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+        getQueue().enqueueCopyBuffer(*d_P, *(res[0].get()), 0, 0, bytes, NULL,
+                                     &event);
+        getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+        event.wait();
 
         CL_DEBUG_FINISH(getQueue());
         CheckGL("End OpenCL resource copy");
     } else {
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_histogram_vertex_buffer(&buffer, hist));
+        FG_CHECK(_.fg_get_histogram_vertex_buffer_size(&bytes, hist));
+
         CheckGL("Begin OpenCL fallback-resource copy");
-        glBindBuffer(GL_ARRAY_BUFFER, hist->vbo());
-        GLubyte* ptr = (GLubyte*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
+        glBindBuffer(GL_ARRAY_BUFFER, buffer);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
         if (ptr) {
-            getQueue().enqueueReadBuffer(*data.get(), CL_TRUE, 0, hist->size(), ptr);
+            getQueue().enqueueReadBuffer(*data.get(), CL_TRUE, 0, bytes, ptr);
             glUnmapBuffer(GL_ARRAY_BUFFER);
         }
         glBindBuffer(GL_ARRAY_BUFFER, 0);
@@ -54,14 +66,16 @@ void copy_histogram(const Array<T> &data, const fg::Histogram* hist)
     }
 }
 
-#define INSTANTIATE(T)  \
-    template void copy_histogram<T>(const Array<T> &data, const fg::Histogram* hist);
+#define INSTANTIATE(T) \
+    template void copy_histogram<T>(const Array<T> &, fg_histogram);
 
 INSTANTIATE(float)
 INSTANTIATE(int)
 INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
 INSTANTIATE(uchar)
 
-}
-
-#endif  // WITH_GRAPHICS
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/hist_graphics.hpp b/src/backend/opencl/hist_graphics.hpp
index 6bb3a9eb5b..40dd57e5e9 100644
--- a/src/backend/opencl/hist_graphics.hpp
+++ b/src/backend/opencl/hist_graphics.hpp
@@ -7,18 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
-#include <graphics_common.hpp>
 #include <Array.hpp>
+#include <common/graphics_common.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void copy_histogram(const Array<T> &data, const fg::Histogram* hist);
-
-}
-
-#endif
+void copy_histogram(const Array<T> &data, fg_histogram hist);
 
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/histogram.cpp b/src/backend/opencl/histogram.cpp
index fbae44fc3b..bbf7e9082e 100644
--- a/src/backend/opencl/histogram.cpp
+++ b/src/backend/opencl/histogram.cpp
@@ -7,55 +7,47 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
 #include <histogram.hpp>
 #include <kernel/histogram.hpp>
-#include <err_opencl.hpp>
-#include <vector>
+#include <af/dim4.hpp>
 
 using af::dim4;
-using std::vector;
-
-namespace opencl
-{
-
-template<typename inType, typename outType>
-Array<outType> histogram(const Array<inType> &in, const unsigned &nbins, const double &minval, const double &maxval)
-{
-    ARG_ASSERT(1, (nbins<=kernel::MAX_BINS));
-
-    const dim4 dims     = in.dims();
-    dim4 outDims        = dim4(nbins, 1, dims[2], dims[3]);
-    Array<outType> out = createValueArray<outType>(outDims, outType(0));
-
-    // create an array to hold min and max values for
-    // batch operation handling, this will reduce
-    // number of concurrent reads to one single memory location
-    dim_t mmNElems= dims[2] * dims[3];
-    cfloat init;
-    init.s[0] = minval;
-    init.s[1] = maxval;
-    vector<cfloat> h_minmax(mmNElems, init);
-
-    dim4 minmax_dims(mmNElems*2);
-    Array<cfloat> minmax = createHostDataArray<cfloat>(minmax_dims, h_minmax.data());
-
-    kernel::histogram<inType, outType>(out, in, minmax, nbins);
-
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear) {
+    const dim4 &dims = in.dims();
+    dim4 outDims     = dim4(nbins, 1, dims[2], dims[3]);
+    Array<uint> out  = createValueArray<uint>(outDims, uint(0));
+    kernel::histogram<T>(out, in, nbins, minval, maxval, isLinear);
     return out;
 }
 
-#define INSTANTIATE(in_t,out_t)\
-    template Array<out_t> histogram(const Array<in_t> &in, const unsigned &nbins, const double &minval, const double &maxval);
-
-INSTANTIATE(float , uint)
-INSTANTIATE(double, uint)
-INSTANTIATE(char  , uint)
-INSTANTIATE(int   , uint)
-INSTANTIATE(uint  , uint)
-INSTANTIATE(uchar , uint)
-
-}
+#define INSTANTIATE(T)                                                    \
+    template Array<uint> histogram<T>(const Array<T> &, const unsigned &, \
+                                      const double &, const double &,     \
+                                      const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/histogram.hpp b/src/backend/opencl/histogram.hpp
index 94701f4fb8..5b0c21e970 100644
--- a/src/backend/opencl/histogram.hpp
+++ b/src/backend/opencl/histogram.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
-
-template<typename inType, typename outType>
-Array<outType> histogram(const Array<inType> &in, const unsigned &nbins, const double &minval, const double &maxval);
-
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<uint> histogram(const Array<T> &in, const unsigned &nbins,
+                      const double &minval, const double &maxval,
+                      const bool isLinear);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/homography.cpp b/src/backend/opencl/homography.cpp
new file mode 100644
index 0000000000..1bd958de55
--- /dev/null
+++ b/src/backend/opencl/homography.cpp
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <homography.hpp>
+
+#include <arith.hpp>
+#include <kernel/homography.hpp>
+#include <af/dim4.hpp>
+
+#include <algorithm>
+#include <limits>
+
+using af::dim4;
+using std::numeric_limits;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+int homography(Array<T> &bestH, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations) {
+    // constexpr float RANSACConfidence  = 0.99f;
+    constexpr float LMEDSConfidence   = 0.99f;
+    constexpr float LMEDSOutlierRatio = 0.4f;
+
+    const af::dim4 &idims   = x_src.dims();
+    const unsigned nsamples = idims[0];
+
+    unsigned iter    = iterations;
+    Array<float> err = createEmptyArray<float>(af::dim4());
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        iter =
+            ::std::min(iter, static_cast<unsigned>(
+                                 log(1.f - LMEDSConfidence) /
+                                 log(1.f - pow(1.f - LMEDSOutlierRatio, 4.f))));
+        err = createValueArray<float>(af::dim4(nsamples, iter),
+                                      numeric_limits<float>::max());
+    } else {
+        // Avoid passing "null" cl_mem object to kernels
+        err = createEmptyArray<float>(af::dim4(1));
+    }
+
+    const size_t iter_sz = divup(iter, 256) * 256;
+
+    af::dim4 rdims(4, iter_sz);
+    Array<float> fctr =
+        createValueArray<float>(rdims, static_cast<float>(nsamples));
+    Array<float> rnd = arithOp<float, af_mul_t>(initial, fctr, rdims);
+
+    Array<T> tmpH =
+        createValueArray<T>(af::dim4(9, iter_sz), static_cast<T>(0));
+
+    bestH = createValueArray<T>(af::dim4(3, 3), static_cast<T>(0));
+    return kernel::computeH<T>(bestH, tmpH, err, x_src, y_src, x_dst, y_dst,
+                               rnd, iter, nsamples, inlier_thr, htype);
+}
+
+#define INSTANTIATE(T)                                                     \
+    template int homography(                                               \
+        Array<T> &H, const Array<float> &x_src, const Array<float> &y_src, \
+        const Array<float> &x_dst, const Array<float> &y_dst,              \
+        const Array<float> &initial, const af_homography_type htype,       \
+        const float inlier_thr, const unsigned iterations);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/homography.hpp b/src/backend/opencl/homography.hpp
new file mode 100644
index 0000000000..2fa7c76690
--- /dev/null
+++ b/src/backend/opencl/homography.hpp
@@ -0,0 +1,23 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+int homography(Array<T> &H, const Array<float> &x_src,
+               const Array<float> &y_src, const Array<float> &x_dst,
+               const Array<float> &y_dst, const Array<float> &initial,
+               const af_homography_type htype, const float inlier_thr,
+               const unsigned iterations);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/hsv_rgb.cpp b/src/backend/opencl/hsv_rgb.cpp
index 1c840a4a64..06ab6b9856 100644
--- a/src/backend/opencl/hsv_rgb.cpp
+++ b/src/backend/opencl/hsv_rgb.cpp
@@ -1,49 +1,39 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <ArrayInfo.hpp>
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
 #include <hsv_rgb.hpp>
-#include <kernel/hsv_rgb.hpp>
-#include <err_opencl.hpp>
 
-using af::dim4;
+#include <kernel/hsv_rgb.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> hsv2rgb(const Array<T>& in)
-{
-    Array<T> out   = createEmptyArray<T>(in.dims());
-
-    kernel::hsv2rgb_convert<T, true>(out, in);
-
+Array<T> hsv2rgb(const Array<T>& in) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::hsv2rgb_convert<T>(out, in, true);
     return out;
 }
 
 template<typename T>
-Array<T> rgb2hsv(const Array<T>& in)
-{
-    Array<T> out   = createEmptyArray<T>(in.dims());
-
-    kernel::hsv2rgb_convert<T, false>(out, in);
-
+Array<T> rgb2hsv(const Array<T>& in) {
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::hsv2rgb_convert<T>(out, in, false);
     return out;
 }
 
-#define INSTANTIATE(T)  \
+#define INSTANTIATE(T)                                \
     template Array<T> hsv2rgb<T>(const Array<T>& in); \
-    template Array<T> rgb2hsv<T>(const Array<T>& in); \
+    template Array<T> rgb2hsv<T>(const Array<T>& in);
 
 INSTANTIATE(double)
-INSTANTIATE(float )
+INSTANTIATE(float)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/hsv_rgb.hpp b/src/backend/opencl/hsv_rgb.hpp
index d08490993e..4c87fa9479 100644
--- a/src/backend/opencl/hsv_rgb.hpp
+++ b/src/backend/opencl/hsv_rgb.hpp
@@ -1,16 +1,16 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 Array<T> hsv2rgb(const Array<T>& in);
@@ -18,4 +18,5 @@ Array<T> hsv2rgb(const Array<T>& in);
 template<typename T>
 Array<T> rgb2hsv(const Array<T>& in);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/identity.cpp b/src/backend/opencl/identity.cpp
index dd6414027c..9aa72fc433 100644
--- a/src/backend/opencl/identity.cpp
+++ b/src/backend/opencl/identity.cpp
@@ -6,37 +6,42 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <identity.hpp>
+#include <kernel/identity.hpp>
 
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <af/defines.h>
 #include <Array.hpp>
-#include <identity.hpp>
+#include <common/half.hpp>
 #include <debug_opencl.hpp>
-#include <kernel/identity.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> identity(const dim4& dims)
-    {
-        Array<T> out  = createEmptyArray<T>(dims);
-        kernel::identity<T>(out);
-        return out;
-    }
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> identity(const dim4& dims) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::identity<T>(out);
+    return out;
+}
 
-#define INSTANTIATE_IDENTITY(T)                              \
-    template Array<T>  identity<T>    (const af::dim4 &dims);
+#define INSTANTIATE_IDENTITY(T) \
+    template Array<T> identity<T>(const af::dim4& dims);
 
-    INSTANTIATE_IDENTITY(float)
-    INSTANTIATE_IDENTITY(double)
-    INSTANTIATE_IDENTITY(cfloat)
-    INSTANTIATE_IDENTITY(cdouble)
-    INSTANTIATE_IDENTITY(int)
-    INSTANTIATE_IDENTITY(uint)
-    INSTANTIATE_IDENTITY(intl)
-    INSTANTIATE_IDENTITY(uintl)
-    INSTANTIATE_IDENTITY(char)
-    INSTANTIATE_IDENTITY(uchar)
+INSTANTIATE_IDENTITY(float)
+INSTANTIATE_IDENTITY(double)
+INSTANTIATE_IDENTITY(cfloat)
+INSTANTIATE_IDENTITY(cdouble)
+INSTANTIATE_IDENTITY(int)
+INSTANTIATE_IDENTITY(uint)
+INSTANTIATE_IDENTITY(intl)
+INSTANTIATE_IDENTITY(uintl)
+INSTANTIATE_IDENTITY(char)
+INSTANTIATE_IDENTITY(schar)
+INSTANTIATE_IDENTITY(uchar)
+INSTANTIATE_IDENTITY(short)
+INSTANTIATE_IDENTITY(ushort)
+INSTANTIATE_IDENTITY(half)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/identity.hpp b/src/backend/opencl/identity.hpp
index 3a56e182db..0a401099b8 100644
--- a/src/backend/opencl/identity.hpp
+++ b/src/backend/opencl/identity.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> identity(const dim4& dim);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> identity(const dim4& dim);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/iir.cpp b/src/backend/opencl/iir.cpp
index eb7e99666e..9b53708212 100644
--- a/src/backend/opencl/iir.cpp
+++ b/src/backend/opencl/iir.cpp
@@ -7,63 +7,55 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <iir.hpp>
-#include <err_opencl.hpp>
-#include <math.hpp>
 #include <arith.hpp>
 #include <convolve.hpp>
+#include <err_opencl.hpp>
+#include <iir.hpp>
 #include <kernel/iir.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x)
-    {
-        try {
-
-            ConvolveBatchKind type = x.ndims() == 1 ? ONE2ONE : MANY2MANY;
-            if (x.ndims() != b.ndims()) {
-                type = (x.ndims() < b.ndims()) ? ONE2MANY : MANY2ONE;
-            }
-
-            // Extract the first N elements
-            Array<T> c = convolve<T, T, 1, true>(x, b, type);
-            dim4 cdims = c.dims();
-            cdims[0] = x.dims()[0];
-            c.resetDims(cdims);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x) {
+    AF_BATCH_KIND type = x.ndims() == 1 ? AF_BATCH_NONE : AF_BATCH_SAME;
+    if (x.ndims() != b.ndims()) {
+        type = (x.ndims() < b.ndims()) ? AF_BATCH_RHS : AF_BATCH_LHS;
+    }
 
-            int num_a = a.dims()[0];
+    // Extract the first N elements
+    Array<T> c = convolve<T, T>(x, b, type, 1, true);
+    dim4 cdims = c.dims();
+    cdims[0]   = x.dims()[0];
+    c.resetDims(cdims);
 
-            if (num_a == 1) return c;
+    int num_a = a.dims()[0];
 
-            dim4 ydims = c.dims();
-            Array<T> y = createEmptyArray<T>(ydims);
+    if (num_a == 1) { return c; }
 
-            if (a.ndims() > 1) {
-                kernel::iir<T,  true>(y, c, a);
-            } else {
-                kernel::iir<T, false>(y, c, a);
-            }
+    dim4 ydims = c.dims();
+    Array<T> y = createEmptyArray<T>(ydims);
 
-            return y;
-        } catch (cl::Error &err) {
-            CL_TO_AF_ERROR(err);
-        }
+    if (a.ndims() > 1) {
+        kernel::iir<T, true>(y, c, a);
+    } else {
+        kernel::iir<T, false>(y, c, a);
     }
 
-#define INSTANTIATE(T)                          \
-    template Array<T> iir(const Array<T> &b,    \
-                          const Array<T> &a,    \
-                          const Array<T> &x);   \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
+    return y;
 }
+
+#define INSTANTIATE(T)                                          \
+    template Array<T> iir(const Array<T> &b, const Array<T> &a, \
+                          const Array<T> &x);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/iir.hpp b/src/backend/opencl/iir.hpp
index 6fb69459c4..0b939ab3fe 100644
--- a/src/backend/opencl/iir.hpp
+++ b/src/backend/opencl/iir.hpp
@@ -9,9 +9,10 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 Array<T> iir(const Array<T> &b, const Array<T> &a, const Array<T> &x);
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/image.cpp b/src/backend/opencl/image.cpp
index 1ee886b8e2..663fc63c24 100644
--- a/src/backend/opencl/image.cpp
+++ b/src/backend/opencl/image.cpp
@@ -7,69 +7,82 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined(WITH_GRAPHICS)
-
-#include <interopManager.hpp>
 #include <Array.hpp>
-#include <image.hpp>
-#include <err_opencl.hpp>
+#include <GraphicsResourceManager.hpp>
 #include <debug_opencl.hpp>
+#include <err_opencl.hpp>
+#include <image.hpp>
+
 #include <stdexcept>
 #include <vector>
 
-namespace opencl
-{
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void copy_image(const Array<T> &in, const fg::Image* image)
-{
+void copy_image(const Array<T> &in, fg_image image) {
+    ForgeModule &_ = forgePlugin();
     if (isGLSharingSupported()) {
         CheckGL("Begin opencl resource copy");
-        InteropManager& intrpMngr = InteropManager::getInstance();
 
-        cl::Buffer *clPBOResource = intrpMngr.getBufferResource(image);
+        auto res = interopManager().getImageResources(image);
+
         const cl::Buffer *d_X = in.get();
-        size_t num_bytes = image->size();
+
+        unsigned bytes = 0;
+        FG_CHECK(_.fg_get_image_size(&bytes, image));
 
         std::vector<cl::Memory> shared_objects;
-        shared_objects.push_back(*clPBOResource);
+        shared_objects.push_back(*(res[0].get()));
 
         glFinish();
-        getQueue().enqueueAcquireGLObjects(&shared_objects);
-        getQueue().enqueueCopyBuffer(*d_X, *clPBOResource, 0, 0, num_bytes, NULL, NULL);
-        getQueue().finish();
-        getQueue().enqueueReleaseGLObjects(&shared_objects);
+
+        // Use of events:
+        // https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+        cl::Event event;
+
+        getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+        getQueue().enqueueCopyBuffer(*d_X, *(res[0].get()), 0, 0, bytes, NULL,
+                                     &event);
+        getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+        event.wait();
 
         CL_DEBUG_FINISH(getQueue());
         CheckGL("End opencl resource copy");
     } else {
         CheckGL("Begin OpenCL fallback-resource copy");
-        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, image->pbo());
-        CheckGL("1Begin OpenCL fallback-resource copy");
-        glBufferData(GL_PIXEL_UNPACK_BUFFER, image->size(), 0, GL_STREAM_DRAW);
-        CheckGL("2Begin OpenCL fallback-resource copy");
-        GLubyte* ptr = (GLubyte*)glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY);
-        CheckGL("3Begin OpenCL fallback-resource copy");
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_image_size(&bytes, image));
+        FG_CHECK(_.fg_get_pixel_buffer(&buffer, image));
+
+        glBindBuffer(GL_PIXEL_UNPACK_BUFFER, buffer);
+        glBufferData(GL_PIXEL_UNPACK_BUFFER, bytes, 0, GL_STREAM_DRAW);
+        auto *ptr = static_cast<GLubyte *>(
+            glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY));
         if (ptr) {
-            getQueue().enqueueReadBuffer(*in.get(), CL_TRUE, 0, image->size(), ptr);
+            getQueue().enqueueReadBuffer(*in.get(), CL_TRUE, 0, bytes, ptr);
             glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
         }
-        CheckGL("4Begin OpenCL fallback-resource copy");
         glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
         CheckGL("End OpenCL fallback-resource copy");
     }
 }
 
-#define INSTANTIATE(T)      \
-    template void copy_image<T>(const Array<T> &in, const fg::Image* image);
+#define INSTANTIATE(T) template void copy_image<T>(const Array<T> &, fg_image);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
 INSTANTIATE(int)
 INSTANTIATE(uint)
+INSTANTIATE(schar)
 INSTANTIATE(uchar)
 INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
 
-}
-
-#endif
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/image.hpp b/src/backend/opencl/image.hpp
index 778fde0949..f9ee5db1eb 100644
--- a/src/backend/opencl/image.hpp
+++ b/src/backend/opencl/image.hpp
@@ -7,15 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <graphics_common.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace opencl {
 
-namespace opencl
-{
-    template<typename T>
-    void copy_image(const Array<T> &in, const fg::Image* image);
-}
+template<typename T>
+void copy_image(const Array<T> &in, fg_image image);
 
-#endif
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/index.cpp b/src/backend/opencl/index.cpp
index 33dc559b8f..b1cb238968 100644
--- a/src/backend/opencl/index.cpp
+++ b/src/backend/opencl/index.cpp
@@ -7,71 +7,76 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <handle.hpp>
 #include <index.hpp>
 #include <kernel/index.hpp>
+
+#include <Array.hpp>
 #include <err_opencl.hpp>
+#include <handle.hpp>
 #include <memory.hpp>
+#include <af/dim4.hpp>
 
-namespace opencl
-{
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> index(const Array<T>& in, const af_index_t idxrs[])
-{
+Array<T> index(const Array<T>& in, const af_index_t idxrs[]) {
     kernel::IndexKernelParam_t p;
     std::vector<af_seq> seqs(4, af_span);
     // create seq vector to retrieve output
     // dimensions, offsets & offsets
-    for (dim_t x=0; x<4; ++x) {
-        if (idxrs[x].isSeq) {
-            seqs[x] = idxrs[x].idx.seq;
-        }
+    for (dim_t x = 0; x < 4; ++x) {
+        if (idxrs[x].isSeq) { seqs[x] = idxrs[x].idx.seq; }
     }
 
     // retrieve dimensions, strides and offsets
-    dim4 iDims = in.dims();
-    dim4 dDims = in.getDataDims();
-    dim4 oDims = toDims  (seqs, iDims);
-    dim4 iOffs = toOffset(seqs, dDims);
-    dim4 iStrds= toStride(seqs, dDims);
+    const dim4& iDims = in.dims();
+    dim4 dDims        = in.getDataDims();
+    dim4 oDims        = toDims(seqs, iDims);
+    dim4 iOffs        = toOffset(seqs, dDims);
+    dim4 iStrds       = in.strides();
 
-    for (dim_t i=0; i<4; ++i) {
-        p.isSeq[i] = idxrs[i].isSeq;
+    for (dim_t i = 0; i < 4; ++i) {
+        p.isSeq[i] = idxrs[i].isSeq ? 1 : 0;
         p.offs[i]  = iOffs[i];
         p.strds[i] = iStrds[i];
+        p.steps[i] = 0;
+        if (idxrs[i].isSeq) {
+            af_seq seq = idxrs[i].idx.seq;
+            // The step for af_span used in the kernel must be 1
+            if (seq.begin == af_span.begin && seq.end == af_span.end &&
+                seq.step == af_span.step)
+                p.steps[i] = 1;
+            else
+                p.steps[i] = seq.step;
+        }
     }
 
-    Buffer* bPtrs[4];
+    cl::Buffer* bPtrs[4];
 
-    std::vector< Array<uint> > idxArrs(4, createEmptyArray<uint>(dim4()));
+    auto buf = cl::Buffer();
+    std::vector<Array<uint>> idxArrs(4, createEmptyArray<uint>(dim4()));
     // look through indexs to read af_array indexs
-    for (dim_t x=0; x<4; ++x) {
+    for (dim_t x = 0; x < 4; ++x) {
         // set index pointers were applicable
         if (!p.isSeq[x]) {
             idxArrs[x] = castArray<uint>(idxrs[x].idx.arr);
-            bPtrs[x] = idxArrs[x].get();
+            bPtrs[x]   = idxArrs[x].get();
             // set output array ith dimension value
             oDims[x] = idxArrs[x].elements();
-        }
-        else {
+        } else {
             // alloc an 1-element buffer to avoid OpenCL from failing
-            bPtrs[x] = bufferAlloc(sizeof(uint));
+            bPtrs[x] = &buf;
         }
     }
 
     Array<T> out = createEmptyArray<T>(oDims);
+    if (oDims.elements() == 0) { return out; }
 
     kernel::index<T>(out, in, p, bPtrs);
 
-    for (dim_t x=0; x<4; ++x) {
-        if (p.isSeq[x]) bufferFree(bPtrs[x]);
-    }
-
     return out;
 }
 
@@ -79,14 +84,19 @@ Array<T> index(const Array<T>& in, const af_index_t idxrs[])
     template Array<T> index<T>(const Array<T>& in, const af_index_t idxrs[]);
 
 INSTANTIATE(cdouble)
-INSTANTIATE(double )
-INSTANTIATE(cfloat )
-INSTANTIATE(float  )
-INSTANTIATE(uintl  )
-INSTANTIATE(uint   )
-INSTANTIATE(intl   )
-INSTANTIATE(int    )
-INSTANTIATE(uchar  )
-INSTANTIATE(char   )
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(float)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/index.hpp b/src/backend/opencl/index.hpp
index 82aef094c6..2164305a62 100644
--- a/src/backend/opencl/index.hpp
+++ b/src/backend/opencl/index.hpp
@@ -8,11 +8,13 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <af/index.h>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 Array<T> index(const Array<T>& in, const af_index_t idxrs[]);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/interopManager.cpp b/src/backend/opencl/interopManager.cpp
deleted file mode 100644
index 2b6fda0cc7..0000000000
--- a/src/backend/opencl/interopManager.cpp
+++ /dev/null
@@ -1,77 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined(WITH_GRAPHICS)
-
-#include <interopManager.hpp>
-
-namespace opencl
-{
-
-void InteropManager::destroyResources()
-{
-    int n = getActiveDeviceId();
-    for(iter_t iter = interop_maps[n].begin(); iter != interop_maps[n].end(); iter++)
-        delete iter->second;
-}
-
-InteropManager::~InteropManager()
-{
-    for(int i = 0; i < getDeviceCount(); i++) {
-        setDevice(i);
-        destroyResources();
-    }
-}
-
-InteropManager& InteropManager::getInstance()
-{
-    static InteropManager my_instance;
-    return my_instance;
-}
-
-cl::Buffer* InteropManager::getBufferResource(const fg::Image* image)
-{
-    void * key = (void*)image;
-    int device = getActiveDeviceId();
-    iter_t iter = interop_maps[device].find(key);
-
-    if (iter == interop_maps[device].end())
-        interop_maps[device][key] = new cl::BufferGL(getContext(), CL_MEM_WRITE_ONLY, image->pbo(), NULL);
-
-    return interop_maps[device][key];
-}
-
-cl::Buffer* InteropManager::getBufferResource(const fg::Plot* plot)
-{
-    void * key = (void*)plot;
-    int device = getActiveDeviceId();
-    iter_t iter = interop_maps[device].find(key);
-
-    if (iter == interop_maps[device].end())
-        interop_maps[device][key] = new cl::BufferGL(getContext(), CL_MEM_WRITE_ONLY, plot->vbo(), NULL);
-
-    return interop_maps[device][key];
-}
-
-cl::Buffer* InteropManager::getBufferResource(const fg::Histogram* hist)
-{
-    void * key = (void*)hist;
-    int device = getActiveDeviceId();
-    iter_t iter = interop_maps[device].find(key);
-
-    if (iter == interop_maps[device].end())
-        interop_maps[device][key] = new cl::BufferGL(getContext(), CL_MEM_WRITE_ONLY, hist->vbo(), NULL);
-
-    return interop_maps[device][key];
-}
-
-}
-
-#endif
-
diff --git a/src/backend/opencl/interopManager.hpp b/src/backend/opencl/interopManager.hpp
deleted file mode 100644
index 6af6d17ed7..0000000000
--- a/src/backend/opencl/interopManager.hpp
+++ /dev/null
@@ -1,44 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if defined(WITH_GRAPHICS)
-
-#include <GL/glew.h>
-#include <graphics_common.hpp>
-#include <platform.hpp>
-#include <map>
-
-namespace opencl
-{
-
-typedef std::map<void *, cl::Buffer*> interop_t;
-typedef interop_t::iterator iter_t;
-
-class InteropManager
-{
-    private:
-        interop_t interop_maps[DeviceManager::MAX_DEVICES];
-
-    public:
-        static InteropManager& getInstance();
-        ~InteropManager();
-        cl::Buffer* getBufferResource(const fg::Image* image);
-        cl::Buffer* getBufferResource(const fg::Plot* plot);
-        cl::Buffer* getBufferResource(const fg::Histogram* hist);
-
-    protected:
-        InteropManager() {}
-        InteropManager(InteropManager const&);
-        void operator=(InteropManager const&);
-        void destroyResources();
-};
-
-}
-
-#endif
diff --git a/src/backend/opencl/inverse.cpp b/src/backend/opencl/inverse.cpp
index eb8348edd4..860c449c3c 100644
--- a/src/backend/opencl/inverse.cpp
+++ b/src/backend/opencl/inverse.cpp
@@ -7,51 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <err_common.hpp>
-#include <solve.hpp>
+#include <err_opencl.hpp>
 #include <identity.hpp>
+#include <solve.hpp>
 
-#if defined(WITH_OPENCL_LINEAR_ALGEBRA)
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <cpu/cpu_inverse.hpp>
+#include <platform.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> inverse(const Array<T> &in)
-{
+Array<T> inverse(const Array<T> &in) {
+    if (OpenCLCPUOffload()) {
+        if (in.dims()[0] == in.dims()[1]) { return cpu::inverse(in); }
+    }
     Array<T> I = identity<T>(in.dims());
     return solve<T>(in, I);
 }
 
-#define INSTANTIATE(T)                                                                   \
-    template Array<T> inverse<T> (const Array<T> &in);
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
 
 INSTANTIATE(float)
 INSTANTIATE(cfloat)
 INSTANTIATE(double)
 INSTANTIATE(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> inverse(const Array<T> &in)
-{
+Array<T> inverse(const Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE(T)                                                                   \
-    template Array<T> inverse<T> (const Array<T> &in);
+#define INSTANTIATE(T) template Array<T> inverse<T>(const Array<T> &in);
 
 INSTANTIATE(float)
 INSTANTIATE(cfloat)
 INSTANTIATE(double)
 INSTANTIATE(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
 #endif
diff --git a/src/backend/opencl/inverse.hpp b/src/backend/opencl/inverse.hpp
index b28c3a4180..1695798720 100644
--- a/src/backend/opencl/inverse.hpp
+++ b/src/backend/opencl/inverse.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> inverse(const Array<T> &in);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> inverse(const Array<T> &in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/iota.cpp b/src/backend/opencl/iota.cpp
index 7a1d369b42..87c840b419 100644
--- a/src/backend/opencl/iota.cpp
+++ b/src/backend/opencl/iota.cpp
@@ -6,33 +6,43 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <Array.hpp>
 #include <iota.hpp>
 #include <kernel/iota.hpp>
-#include <math.hpp>
-#include <stdexcept>
+
+#include <Array.hpp>
+#include <common/half.hpp>
 #include <err_opencl.hpp>
+#include <math.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> iota(const dim4 &dims, const dim4 &tile_dims)
-    {
-        dim4 outdims = dims * tile_dims;
+#include <stdexcept>
 
-        Array<T> out = createEmptyArray<T>(outdims);
-        kernel::iota<T>(out, dims, tile_dims);
+using arrayfire::common::half;
 
-        return out;
-    }
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> iota(const dim4 &dims, const dim4 &tile_dims) {
+    dim4 outdims = dims * tile_dims;
 
-#define INSTANTIATE(T)                                                          \
-    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims); \
+    Array<T> out = createEmptyArray<T>(outdims);
+    kernel::iota<T>(out, dims);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> iota<T>(const af::dim4 &dims, const af::dim4 &tile_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/iota.hpp b/src/backend/opencl/iota.hpp
index 192c09d9f3..26869554b8 100644
--- a/src/backend/opencl/iota.hpp
+++ b/src/backend/opencl/iota.hpp
@@ -8,13 +8,11 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
-}
-
-
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> iota(const dim4 &dim, const dim4 &tile_dims = dim4(1));
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/ireduce.cpp b/src/backend/opencl/ireduce.cpp
index 2b735c85da..d4b080389c 100644
--- a/src/backend/opencl/ireduce.cpp
+++ b/src/backend/opencl/ireduce.cpp
@@ -6,57 +6,78 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <ireduce.hpp>
-#include <ops.hpp>
 #include <kernel/ireduce.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
 #include <err_opencl.hpp>
+#include <optypes.hpp>
+#include <af/dim4.hpp>
+#include <complex>
 
 using af::dim4;
+using arrayfire::common::half;
 
-namespace opencl
-{
-
-    template<af_op_t op, typename T>
-    void ireduce(Array<T> &out, Array<uint> &loc,
-                 const Array<T> &in, const int dim)
-    {
-        kernel::ireduce<T, op>(out, loc.get(), in, dim);
-    }
-
-    template<af_op_t op, typename T>
-    T ireduce_all(unsigned *loc, const Array<T> &in)
-    {
-        return kernel::ireduce_all<T, op>(loc, in);
-    }
-
-#define INSTANTIATE(ROp, T)                                             \
-    template void ireduce<ROp, T>(Array<T> &out, Array<uint> &loc,      \
-                                       const Array<T> &in, const int dim); \
-    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);  \
-
-    //min
-    INSTANTIATE(af_min_t, float  )
-    INSTANTIATE(af_min_t, double )
-    INSTANTIATE(af_min_t, cfloat )
-    INSTANTIATE(af_min_t, cdouble)
-    INSTANTIATE(af_min_t, int    )
-    INSTANTIATE(af_min_t, uint   )
-    INSTANTIATE(af_min_t, char   )
-    INSTANTIATE(af_min_t, uchar  )
-
-    //max
-    INSTANTIATE(af_max_t, float  )
-    INSTANTIATE(af_max_t, double )
-    INSTANTIATE(af_max_t, cfloat )
-    INSTANTIATE(af_max_t, cdouble)
-    INSTANTIATE(af_max_t, int    )
-    INSTANTIATE(af_max_t, uint   )
-    INSTANTIATE(af_max_t, char   )
-    INSTANTIATE(af_max_t, uchar  )
+namespace arrayfire {
+namespace opencl {
+
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim) {
+    Array<uint> rlen = createEmptyArray<uint>(af::dim4(0));
+    kernel::ireduce<T, op>(out, loc.get(), in, dim, rlen);
+}
+
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen) {
+    kernel::ireduce<T, op>(out, loc.get(), in, dim, rlen);
+}
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in) {
+    return kernel::ireduceAll<T, op>(loc, in);
 }
+
+#define INSTANTIATE(ROp, T)                                           \
+    template void ireduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim); \
+    template void rreduce<ROp, T>(Array<T> & out, Array<uint> & loc,  \
+                                  const Array<T> &in, const int dim,  \
+                                  const Array<uint> &rlen);           \
+    template T ireduce_all<ROp, T>(unsigned *loc, const Array<T> &in);
+
+// min
+INSTANTIATE(af_min_t, float)
+INSTANTIATE(af_min_t, double)
+INSTANTIATE(af_min_t, cfloat)
+INSTANTIATE(af_min_t, cdouble)
+INSTANTIATE(af_min_t, int)
+INSTANTIATE(af_min_t, uint)
+INSTANTIATE(af_min_t, intl)
+INSTANTIATE(af_min_t, uintl)
+INSTANTIATE(af_min_t, char)
+INSTANTIATE(af_min_t, schar)
+INSTANTIATE(af_min_t, uchar)
+INSTANTIATE(af_min_t, short)
+INSTANTIATE(af_min_t, ushort)
+INSTANTIATE(af_min_t, half)
+
+// max
+INSTANTIATE(af_max_t, float)
+INSTANTIATE(af_max_t, double)
+INSTANTIATE(af_max_t, cfloat)
+INSTANTIATE(af_max_t, cdouble)
+INSTANTIATE(af_max_t, int)
+INSTANTIATE(af_max_t, uint)
+INSTANTIATE(af_max_t, intl)
+INSTANTIATE(af_max_t, uintl)
+INSTANTIATE(af_max_t, char)
+INSTANTIATE(af_max_t, schar)
+INSTANTIATE(af_max_t, uchar)
+INSTANTIATE(af_max_t, short)
+INSTANTIATE(af_max_t, ushort)
+INSTANTIATE(af_max_t, half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/ireduce.hpp b/src/backend/opencl/ireduce.hpp
index 2a1059aaad..1b60a7a745 100644
--- a/src/backend/opencl/ireduce.hpp
+++ b/src/backend/opencl/ireduce.hpp
@@ -7,16 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace opencl
-{
-    template<af_op_t op, typename T>
-    void ireduce(Array<T> &out, Array<uint> &loc,
-                 const Array<T> &in, const int dim);
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename T>
+void ireduce(Array<T> &out, Array<uint> &loc, const Array<T> &in,
+             const int dim);
 
-    template<af_op_t op, typename T>
-    T ireduce_all(unsigned *loc, const Array<T> &in);
-}
+template<af_op_t op, typename T>
+void rreduce(Array<T> &out, Array<uint> &loc, const Array<T> &in, const int dim,
+             const Array<uint> &rlen);
+
+template<af_op_t op, typename T>
+T ireduce_all(unsigned *loc, const Array<T> &in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/jit.cpp b/src/backend/opencl/jit.cpp
index a00bfba378..c0858c3cc5 100644
--- a/src/backend/opencl/jit.cpp
+++ b/src/backend/opencl/jit.cpp
@@ -7,205 +7,493 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <map>
-#include <stdexcept>
+#include <common/compile_module.hpp>
+#include <common/deterministicHash.hpp>
+#include <common/jit/ModdimNode.hpp>
+#include <common/jit/Node.hpp>
+#include <common/jit/NodeIterator.hpp>
+#include <common/kernel_cache.hpp>
+#include <common/util.hpp>
 #include <copy.hpp>
-#include <JIT/Node.hpp>
-#include <kernel_headers/jit.hpp>
-#include <program.hpp>
-#include <dispatch.hpp>
+#include <device_manager.hpp>
 #include <err_opencl.hpp>
-#include <functional>
-
-namespace opencl
-{
+#include <jit/BufferNode.hpp>
+#include <jit/ShiftNode.hpp>
+#include <kernel_headers/jit.hpp>
+#include <threadsMgt.hpp>
+#include <type_util.hpp>
+#include <af/dim4.hpp>
+#include <af/opencl.h>
 
-using JIT::Node;
+#include <algorithm>
+#include <cstdio>
+#include <functional>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+using arrayfire::common::findModule;
+using arrayfire::common::getFuncName;
+using arrayfire::common::ModdimNode;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ids;
+using arrayfire::common::Node_map_t;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::NodeIterator;
+using arrayfire::common::saveKernel;
+using arrayfire::opencl::jit::ShiftNode;
 
-using cl::Buffer;
-using cl::Program;
 using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
 using cl::NDRange;
+using cl::NullRange;
+
+using std::equal;
+using std::find_if;
+using std::for_each;
+using std::shared_ptr;
 using std::string;
 using std::stringstream;
+using std::to_string;
+using std::vector;
+
+namespace arrayfire {
+namespace opencl {
+using jit::BufferNode;
+
+string getKernelString(const string& funcName, const vector<Node*>& full_nodes,
+                       const vector<Node_ids>& full_ids,
+                       const vector<int>& output_ids, const bool is_linear,
+                       const bool loop0, const bool loop1, const bool loop3) {
+    // Common OpenCL code
+    // This part of the code does not change with the kernel.
+
+    static const char* kernelVoid = R"JIT(
+__kernel void )JIT";
+    static const char* dimParams  = "KParam oInfo";
+    static const char* blockStart = "{";
+    static const char* blockEnd   = "\n}\n";
+
+    static const char* linearInit = R"JIT(
+   int idx = get_global_id(0);
+   const int idxEnd = oInfo.dims[0];
+   if (idx < idxEnd) {
+)JIT";
+    static const char* linearEnd  = R"JIT(
+   })JIT";
+
+    static const char* linearLoop0Start = R"JIT(
+        const int idxID0Inc = get_global_size(0);
+        do {)JIT";
+    static const char* linearLoop0End   = R"JIT(
+            idx += idxID0Inc;
+            if (idx >= idxEnd) break;
+        } while (true);)JIT";
+
+    // ///////////////////////////////////////////////
+    // oInfo = output optimized information (dims, strides, offset).
+    //         oInfo has removed dimensions, to optimized block scheduling
+    // iInfo = input internal information (dims, strides, offset)
+    //         iInfo has the original dimensions, auto generated code
+    //
+    // Loop3 is fastest and becomes inside loop, since
+    //      - #of loops is known upfront
+    // Loop1 is used for extra dynamic looping (writing into cache)
+    // All loops are conditional and idependent
+    // Format Loop1 & Loop3
+    // ////////////////////////////
+    //  *stridedLoopNInit               // Always
+    //  *stridedLoop1Init               // Conditional
+    //  *stridedLoop2Init               // Conditional
+    //  *stridedLoop3Init               // Conditional
+    //  *stridedLoop1Start              // Conditional
+    //      *stridedLoop3Start          // Conditional
+    //          auto generated code     // Always
+    //      *stridedLoop3End            // Conditional
+    //  *stridedLoop1End                // Conditional
+    //  *StridedEnd                     // Always
+    //
+    // format loop0 (Vector only)
+    // //////////////////////////
+    // *stridedLoop0Init                // Always
+    // *stridedLoop0Start               // Always
+    //      auto generated code         // Always
+    // *stridedLoop0End                 // Always
+    // *stridedEnd                      // Always
+
+    static const char* stridedLoop0Init  = R"JIT(
+    int id0 = get_global_id(0);
+    const int id0End = oInfo.dims[0];
+    if (id0 < id0End) {
+#define id1 0
+#define id2 0
+#define id3 0
+        const int ostrides0 = oInfo.strides[0];
+        int idx = ostrides0*id0;)JIT";
+    static const char* stridedLoop0Start = R"JIT(
+        const int id0Inc = get_global_size(0);
+        const int idxID0Inc = ostrides0*id0Inc;
+        do {)JIT";
+    static const char* stridedLoop0End   = R"JIT(
+            id0 += id0Inc;
+            if (id0 >= id0End) break;
+            idx += idxID0Inc;
+        } while (true);)JIT";
+
+    // -------------
+    static const char* stridedLoopNInit = R"JIT(
+    int id0 = get_global_id(0);
+    int id1 = get_global_id(1);
+    const int id0End = oInfo.dims[0];
+    const int id1End = oInfo.dims[1];
+    if ((id0 < id0End) & (id1 < id1End)) {
+        const int id2 = get_global_id(2);
+#define id3 0
+        const int ostrides1 = oInfo.strides[1];
+        int idx = (int)oInfo.strides[0]*id0 + ostrides1*id1 + (int)oInfo.strides[2]*id2;)JIT";
+    static const char* stridedEnd       = R"JIT(
+    })JIT";
+
+    static const char* stridedLoop3Init  = R"JIT(
+#undef id3
+        int id3 = 0;
+        const int id3End = oInfo.dims[3];
+        const int idxID3Inc = oInfo.strides[3];)JIT";
+    static const char* stridedLoop3Start = R"JIT(
+                const int idxBaseID3 = idx;
+                do {)JIT";
+    static const char* stridedLoop3End   = R"JIT(
+                    ++id3;
+                    if (id3 == id3End) break;
+                    idx += idxID3Inc;
+                } while (true);
+                id3 = 0;
+                idx = idxBaseID3;)JIT";
+
+    static const char* stridedLoop1Init  = R"JIT(
+        const int id1Inc = get_global_size(1);
+        const int idxID1Inc = id1Inc * ostrides1;)JIT";
+    static const char* stridedLoop1Start = R"JIT(
+        do {)JIT";
+    static const char* stridedLoop1End   = R"JIT(
+            id1 += id1Inc;
+            if (id1 >= id1End) break;
+            idx += idxID1Inc;
+        } while (true);)JIT";
+
+    // Reuse stringstreams, because they are very costly during initilization
+    thread_local stringstream inParamStream;
+    thread_local stringstream outParamStream;
+    thread_local stringstream outOffsetStream;
+    thread_local stringstream inOffsetsStream;
+    thread_local stringstream opsStream;
+    thread_local stringstream kerStream;
+
+    string ret;
+    try {
+        int oid{0};
+        for (size_t i{0}; i < full_nodes.size(); i++) {
+            const auto& node{full_nodes[i]};
+            const auto& ids_curr{full_ids[i]};
+            // Generate input parameters, only needs current id
+            node->genParams(inParamStream, ids_curr.id, is_linear);
+            // Generate input offsets, only needs current id
+            node->genOffsets(inOffsetsStream, ids_curr.id, is_linear);
+            // Generate the core function body, needs children ids as well
+            node->genFuncs(opsStream, ids_curr);
+            for (size_t output_idx{0}; output_idx < output_ids.size();
+                 ++output_idx) {
+                if (output_ids[output_idx] == ids_curr.id) {
+                    outParamStream
+                        << "__global " << full_nodes[ids_curr.id]->getTypeStr()
+                        << " *out" << oid << ", int offset" << oid << ",\n";
+                    // Apply output offset
+                    outOffsetStream << "\nout" << oid << " += offset" << oid
+                                    << ';';
+                    // Generate code to write the output
+                    opsStream << "out" << output_idx << "[idx] = val"
+                              << ids_curr.id << ";\n";
+                    ++oid;
+                }
+            }
+        }
 
-static string getFuncName(Node *node, bool is_linear, bool *is_double)
-{
-    node->setId(0);
-
-    stringstream hashName;
-    stringstream funcName;
-
-    if (is_linear) {
-        funcName << "L_";
-    } else {
-        funcName << "G_";
+        kerStream << kernelVoid << funcName << "(\n"
+                  << inParamStream.str() << outParamStream.str() << dimParams
+                  << ")" << blockStart;
+        if (is_linear) {
+            kerStream << linearInit << inOffsetsStream.str()
+                      << outOffsetStream.str() << '\n';
+            if (loop0) kerStream << linearLoop0Start;
+            kerStream << "\n\n" << opsStream.str();
+            if (loop0) kerStream << linearLoop0End;
+            kerStream << linearEnd;
+        } else {
+            if (loop0) {
+                kerStream << stridedLoop0Init << outOffsetStream.str() << '\n'
+                          << stridedLoop0Start;
+            } else {
+                kerStream << stridedLoopNInit << outOffsetStream.str() << '\n';
+                if (loop3) kerStream << stridedLoop3Init;
+                if (loop1) kerStream << stridedLoop1Init << stridedLoop1Start;
+                if (loop3) kerStream << stridedLoop3Start;
+            }
+            kerStream << "\n\n" << inOffsetsStream.str() << opsStream.str();
+            if (loop3) kerStream << stridedLoop3End;
+            if (loop1) kerStream << stridedLoop1End;
+            if (loop0) kerStream << stridedLoop0End;
+            kerStream << stridedEnd;
+        }
+        kerStream << blockEnd;
+        ret = kerStream.str();
+    } catch (...) {
+        // Prepare for next round
+        inParamStream.str("");
+        outParamStream.str("");
+        inOffsetsStream.str("");
+        outOffsetStream.str("");
+        opsStream.str("");
+        kerStream.str("");
+        throw;
     }
 
-    std::string outName = node->getNameStr();
-    funcName << outName;
-
-    node->genKerName(funcName);
-
-    string nameStr = funcName.str();
-    funcName << nameStr;
-
-    nameStr = nameStr + outName;
-    string dblChars = "dDzZ";
-    size_t loc = nameStr.find_first_of(dblChars);
-    *is_double = (loc != std::string::npos);
+    // Prepare for next round
+    inParamStream.str("");
+    outParamStream.str("");
+    inOffsetsStream.str("");
+    outOffsetStream.str("");
+    opsStream.str("");
+    kerStream.str("");
 
-    std::hash<std::string> hash_fn;
-    hashName << "KER" << hash_fn(funcName.str());
-    return hashName.str();
+    return ret;
 }
 
-static string getKernelString(string funcName, Node *node, bool is_linear)
-{
-    stringstream kerStream;
-    int id = node->getId();
-
-    kerStream << "__kernel void" << "\n";
-
-    kerStream << funcName;
-    kerStream << "(" << "\n";
-
-    node->genParams(kerStream);
-    kerStream << "__global " << node->getTypeStr() << " *out, KParam oInfo," << "\n";
-    kerStream << "uint groups_0, uint groups_1)" << "\n";
-
-    kerStream << "{" << "\n" << "\n";
-
-    if (!is_linear) {
-
-        kerStream << "uint id2 = get_group_id(0) / groups_0;" << "\n";
-        kerStream << "uint id3 = get_group_id(1) / groups_1;" << "\n";
-        kerStream << "uint groupId_0 = get_group_id(0) - id2 * groups_0;" << "\n";
-        kerStream << "uint groupId_1 = get_group_id(1) - id3 * groups_1;" << "\n";
-        kerStream << "uint id1 = get_local_id(1) + groupId_1 * get_local_size(1);" << "\n";
-        kerStream << "uint id0 = get_local_id(0) + groupId_0 * get_local_size(0);" << "\n";
-        kerStream << "\n";
-
-        kerStream << "bool cond = " << "\n";
-        kerStream << "id0 < oInfo.dims[0] && " << "\n";
-        kerStream << "id1 < oInfo.dims[1] && " << "\n";
-        kerStream << "id2 < oInfo.dims[2] && " << "\n";
-        kerStream << "id3 < oInfo.dims[3];" << "\n" << "\n";
-
-        kerStream << "if (!cond) return;" << "\n" << "\n";
-
-        kerStream << "int idx = ";
-        kerStream << "oInfo.strides[3] * id3 + oInfo.strides[2] * id2 + ";
-        kerStream << "oInfo.strides[1] * id1 + id0 + oInfo.offset;" << "\n" << "\n";
-
-    } else {
-
-        kerStream << "uint groupId  = get_group_id(1) * get_num_groups(0) + get_group_id(0);" << "\n";
-        kerStream << "uint threadId = get_local_id(0);" << "\n";
-        kerStream << "int idx = groupId * get_local_size(0) * get_local_size(1) + threadId;" << "\n";
-        kerStream << "if (idx >= oInfo.dims[3] * oInfo.strides[3]) return;" << "\n";
+cl::Kernel getKernel(const vector<Node*>& output_nodes,
+                     const vector<int>& output_ids,
+                     const vector<Node*>& full_nodes,
+                     const vector<Node_ids>& full_ids, const bool is_linear,
+                     const bool loop0, const bool loop1, const bool loop3) {
+    const string funcName{getFuncName(output_nodes, output_ids, full_nodes,
+                                      full_ids, is_linear, loop0, loop1, false,
+                                      loop3)};
+    // A forward lookup in module cache helps avoid recompiling the JIT
+    // source generated from identical JIT-trees.
+    const auto entry{
+        findModule(getActiveDeviceId(), deterministicHash(funcName))};
+
+    if (!entry) {
+        const string jitKer{getKernelString(funcName, full_nodes, full_ids,
+                                            output_ids, is_linear, loop0, loop1,
+                                            loop3)};
+        saveKernel(funcName, jitKer, ".cl");
+
+        const common::Source jitKer_cl_src{
+            jitKer.data(), jitKer.size(),
+            deterministicHash(jitKer.data(), jitKer.size())};
+        const cl::Device device{getDevice()};
+        vector<string> options;
+        if (isDoubleSupported(device)) {
+            options.emplace_back(DefineKey(USE_DOUBLE));
+        }
+        if (isHalfSupported(device)) {
+            options.emplace_back(DefineKey(USE_HALF));
+        }
+        return common::getKernel(funcName, {{jit_cl_src, jitKer_cl_src}}, {},
+                                 options, true)
+            .get();
     }
-
-    node->genOffsets(kerStream, is_linear);
-    node->genFuncs(kerStream);
-    kerStream << "\n";
-
-    kerStream << "out[idx] = val"
-           << id << ";"  << "\n";
-
-    kerStream << "}" << "\n";
-
-    return kerStream.str();
+    return common::getKernel(entry, funcName, true).get();
 }
 
-static Kernel getKernel(Node *node, bool is_linear)
-{
-
-    bool is_dbl = false;
-    string funcName = getFuncName(node, is_linear, &is_dbl);
-
-    typedef struct {
-        Program* prog;
-        Kernel* ker;
-    } kc_entry_t;
-
-    typedef std::map<string, kc_entry_t> kc_t;
-    static kc_t kernelCaches[DeviceManager::MAX_DEVICES];
-    int device = getActiveDeviceId();
-
-    kc_t::iterator idx = kernelCaches[device].find(funcName);
-    kc_entry_t entry;
+void evalNodes(vector<Param>& outputs, const vector<Node*>& output_nodes) {
+    const unsigned nrOutputs{static_cast<unsigned>(outputs.size())};
+    if (nrOutputs == 0) { return; }
+    assert(outputs.size() == output_nodes.size());
+    KParam& out_info{outputs[0].info};
+    dim_t* outDims{out_info.dims};
+    dim_t* outStrides{out_info.strides};
+#ifndef NDEBUG
+    for_each(begin(outputs)++, end(outputs),
+             [outDims, outStrides](Param& output) {
+                 assert(equal(output.info.dims, output.info.dims + AF_MAX_DIMS,
+                              outDims) &&
+                        equal(output.info.strides,
+                              output.info.strides + AF_MAX_DIMS, outStrides));
+             });
+#endif
+
+    dim_t ndims{outDims[3] > 1   ? 4
+                : outDims[2] > 1 ? 3
+                : outDims[1] > 1 ? 2
+                : outDims[0] > 0 ? 1
+                                 : 0};
+    bool is_linear{true};
+    dim_t numOutElems{1};
+    for (dim_t dim{0}; dim < ndims; ++dim) {
+        is_linear &= (numOutElems == outStrides[dim]);
+        numOutElems *= outDims[dim];
+    }
+    if (numOutElems == 0) { return; }
+
+    // Use thread local to reuse the memory every time you are here.
+    thread_local Node_map_t nodes;
+    thread_local vector<Node*> full_nodes;
+    thread_local vector<Node_ids> full_ids;
+    thread_local vector<int> output_ids;
+
+    // Reserve some space to improve performance at smaller sizes
+    constexpr size_t CAP{1024};
+    if (full_nodes.capacity() < CAP) {
+        nodes.reserve(CAP);
+        output_ids.reserve(10);
+        full_nodes.reserve(CAP);
+        full_ids.reserve(CAP);
+    }
 
-    if (idx == kernelCaches[device].end()) {
-        string jit_ker = getKernelString(funcName, node, is_linear);
+    const af::dtype outputType{output_nodes[0]->getType()};
+    const size_t outputSizeofType{size_of(outputType)};
+    for (Node* node : output_nodes) {
+        assert(node->getType() == outputType);
+        const int id{node->getNodesMap(nodes, full_nodes, full_ids)};
+        output_ids.push_back(id);
+    }
 
-        const char *ker_strs[] = {jit_cl, jit_ker.c_str()};
-        const int ker_lens[] = {jit_cl_len, (int)jit_ker.size()};
-        cl::Program prog;
-        buildProgram(prog, 2, ker_strs, ker_lens, is_dbl ? string(" -D USE_DOUBLE") :  string(""));
-        entry.prog = new cl::Program(prog);
-        entry.ker = new Kernel(*entry.prog, funcName.c_str());
+    const size_t outputSize{numOutElems * outputSizeofType * nrOutputs};
+    size_t inputSize{0};
+    unsigned nrInputs{0};
+    bool moddimsFound{false};
+    for (const Node* node : full_nodes) {
+        is_linear &= node->isLinear(outDims);
+        moddimsFound |= (node->getOp() == af_moddims_t);
+        if (node->isBuffer()) {
+            ++nrInputs;
+            inputSize += node->getBytes();
+        }
+    }
+    const size_t totalSize{inputSize + outputSize};
 
-        kernelCaches[device][funcName] = entry;
+    bool emptyColumnsFound{false};
+    if (is_linear) {
+        outDims[0]    = numOutElems;
+        outDims[1]    = 1;
+        outDims[2]    = 1;
+        outDims[3]    = 1;
+        outStrides[0] = 1;
+        outStrides[1] = numOutElems;
+        outStrides[2] = numOutElems;
+        outStrides[3] = numOutElems;
+        ndims         = 1;
     } else {
-        entry = idx->second;
+        emptyColumnsFound = ndims > (outDims[0] == 1   ? 1
+                                     : outDims[1] == 1 ? 2
+                                     : outDims[2] == 1 ? 3
+                                                       : 4);
     }
 
-    return *entry.ker;
-}
-
-void evalNodes(Param &out, Node *node)
-{
-    try {
-
-        bool is_linear = node->isLinear(out.info.dims);
-        Kernel ker = getKernel(node, is_linear);
-
-        uint local_0 = 1;
-        uint local_1 = 1;
-        uint global_0 = 1;
-        uint global_1 = 1;
-        uint groups_0 = 1;
-        uint groups_1 = 1;
-
-        if (is_linear) {
-            local_0 = 256;
-            uint out_elements = out.info.dims[3] * out.info.strides[3];
-            uint groups = divup(out_elements, local_0);
-
-            global_1 = divup(groups,     1000) * local_1;
-            global_0 = divup(groups, global_1) * local_0;
-
-        } else {
-            local_0 = 32;
-            local_1 = 8;
+    // Keep in global scope, so that the nodes remain active for later referral
+    // in case moddims operations or column elimination have to take place
+    vector<Node_ptr> node_clones;
+    // Avoid all cloning/copying when no moddims node is present (high chance)
+    if (moddimsFound | emptyColumnsFound) {
+        node_clones.reserve(full_nodes.size());
+        for (Node* node : full_nodes) {
+            node_clones.emplace_back(node->clone());
+        }
 
-            groups_0 = divup(out.info.dims[0], local_0);
-            groups_1 = divup(out.info.dims[1], local_1);
+        for (const Node_ids& ids : full_ids) {
+            auto& children{node_clones[ids.id]->m_children};
+            for (int i{0}; i < Node::kMaxChildren && children[i] != nullptr;
+                 i++) {
+                children[i] = node_clones[ids.child_ids[i]];
+            }
+        }
 
-            global_0 = groups_0 * local_0 * out.info.dims[2];
-            global_1 = groups_1 * local_1 * out.info.dims[3];
+        if (moddimsFound) {
+            const auto isModdim{[](const Node_ptr& ptr) {
+                return ptr->getOp() == af_moddims_t;
+            }};
+            for (auto nodeIt{begin(node_clones)}, endIt{end(node_clones)};
+                 (nodeIt = find_if(nodeIt, endIt, isModdim)) != endIt;
+                 ++nodeIt) {
+                const ModdimNode* mn{static_cast<ModdimNode*>(nodeIt->get())};
+
+                const auto new_strides{calcStrides(mn->m_new_shape)};
+                const auto isBuffer{
+                    [](const Node& node) { return node.isBuffer(); }};
+                for (NodeIterator<> it{nodeIt->get()}, end{NodeIterator<>()};
+                     (it = find_if(it, end, isBuffer)) != end; ++it) {
+                    BufferNode* buf{static_cast<BufferNode*>(&(*it))};
+                    buf->m_param.dims[0]    = mn->m_new_shape[0];
+                    buf->m_param.dims[1]    = mn->m_new_shape[1];
+                    buf->m_param.dims[2]    = mn->m_new_shape[2];
+                    buf->m_param.dims[3]    = mn->m_new_shape[3];
+                    buf->m_param.strides[0] = new_strides[0];
+                    buf->m_param.strides[1] = new_strides[1];
+                    buf->m_param.strides[2] = new_strides[2];
+                    buf->m_param.strides[3] = new_strides[3];
+                }
+            }
+        }
+        if (emptyColumnsFound) {
+            common::removeEmptyDimensions<Param, BufferNode, ShiftNode>(
+                outputs, node_clones);
         }
 
-        NDRange local(local_0, local_1);
-        NDRange global(global_0, global_1);
+        full_nodes.clear();
+        for (Node_ptr& node : node_clones) { full_nodes.push_back(node.get()); }
+    }
 
-        int args = node->setArgs(ker, 0);
-        ker.setArg(args + 0, *out.data);
-        ker.setArg(args + 1,  out.info);
-        ker.setArg(args + 2,  groups_0);
-        ker.setArg(args + 3,  groups_1);
+    threadsMgt<dim_t> th(outDims, ndims, nrInputs, nrOutputs, totalSize,
+                         outputSizeofType);
+    auto ker = getKernel(output_nodes, output_ids, full_nodes, full_ids,
+                         is_linear, th.loop0, th.loop1, th.loop3);
+    const cl::NDRange local{th.genLocal(ker)};
+    const cl::NDRange global{th.genGlobal(local)};
+
+    int nargs{0};
+    for (const Node* node : full_nodes) {
+        nargs = node->setArgs(
+            nargs, is_linear,
+            [&ker](int id, const void* ptr, size_t arg_size, bool is_buffer) {
+                ker.setArg(id, arg_size, ptr);
+            });
+    }
 
-        getQueue().enqueueNDRangeKernel(ker, cl::NullRange, global, local);
+    // Set output parameters
+    for (const auto& output : outputs) {
+        ker.setArg(nargs++, *(output.data));
+        ker.setArg(nargs++, static_cast<int>(output.info.offset));
+    }
 
-    } catch (const cl::Error &ex) {
-        CL_TO_AF_ERROR(ex);
+    // Set dimensions
+    // All outputs are asserted to be of same size
+    // Just use the size from the first output
+    ker.setArg(nargs++, out_info);
+
+    {
+        using namespace opencl::kernel_logger;
+        AF_TRACE(
+            "Launching : Dims: [{},{},{},{}] Global: [{},{},{}] Local: "
+            "[{},{},{}] threads: {}",
+            outDims[0], outDims[1], outDims[2], outDims[3], global[0],
+            global[1], global[2], local[0], local[1], local[2],
+            global[0] * global[1] * global[2]);
     }
+    getQueue().enqueueNDRangeKernel(ker, NullRange, global, local);
 
+    // Reset the thread local vectors
+    nodes.clear();
+    output_ids.clear();
+    full_nodes.clear();
+    full_ids.clear();
 }
 
+void evalNodes(Param& out, Node* node) {
+    vector<Param> outputs{out};
+    vector<Node*> nodes{node};
+    return evalNodes(outputs, nodes);
 }
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/jit/BufferNode.hpp b/src/backend/opencl/jit/BufferNode.hpp
new file mode 100644
index 0000000000..14521030f7
--- /dev/null
+++ b/src/backend/opencl/jit/BufferNode.hpp
@@ -0,0 +1,45 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <common/jit/BufferNodeBase.hpp>
+#include "../kernel/KParam.hpp"
+
+#include <memory>
+
+namespace arrayfire {
+namespace opencl {
+namespace jit {
+using BufferNode = common::BufferNodeBase<std::shared_ptr<cl::Buffer>, KParam>;
+}  // namespace jit
+}  // namespace opencl
+
+namespace common {
+
+template<typename DataType, typename ParamType>
+bool BufferNodeBase<DataType, ParamType>::operator==(
+    const BufferNodeBase<DataType, ParamType> &other) const noexcept {
+    // clang-format off
+    return m_data.get() == other.m_data.get() &&
+           m_bytes == other.m_bytes &&
+           m_param.offset == other.m_param.offset &&
+           m_linear_buffer == other.m_linear_buffer &&
+           m_param.dims[0] == other.m_param.dims[0] &&
+           m_param.dims[1] == other.m_param.dims[1] &&
+           m_param.dims[2] == other.m_param.dims[2] &&
+           m_param.dims[3] == other.m_param.dims[3] &&
+           m_param.strides[0] == other.m_param.strides[0] &&
+           m_param.strides[1] == other.m_param.strides[1] &&
+           m_param.strides[2] == other.m_param.strides[2] &&
+           m_param.strides[3] == other.m_param.strides[3];
+    // clang-format on
+}
+
+}  // namespace common
+}  // namespace arrayfire
diff --git a/src/backend/opencl/jit/ShiftNode.hpp b/src/backend/opencl/jit/ShiftNode.hpp
new file mode 100644
index 0000000000..8132105faf
--- /dev/null
+++ b/src/backend/opencl/jit/ShiftNode.hpp
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <common/jit/ShiftNodeBase.hpp>
+#include <jit/BufferNode.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace jit {
+
+using ShiftNode = common::ShiftNodeBase<BufferNode>;
+
+}  // namespace jit
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/jit/kernel_generators.hpp b/src/backend/opencl/jit/kernel_generators.hpp
new file mode 100644
index 0000000000..0228e7173f
--- /dev/null
+++ b/src/backend/opencl/jit/kernel_generators.hpp
@@ -0,0 +1,112 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <sstream>
+#include <string>
+
+namespace arrayfire {
+namespace opencl {
+
+namespace {
+
+/// Creates a string that will be used to declare the parameter of kernel
+inline void generateParamDeclaration(std::stringstream& kerStream, int id,
+                                     bool is_linear,
+                                     const std::string& m_type_str) {
+    if (is_linear) {
+        kerStream << "__global " << m_type_str << " *in" << id
+                  << ", dim_t iInfo" << id << "_offset, \n";
+    } else {
+        kerStream << "__global " << m_type_str << " *in" << id
+                  << ", KParam iInfo" << id << ", \n";
+    }
+}
+
+/// Calls the setArg function to set the arguments for a kernel call
+inline int setBufferKernelArguments(
+    int start_id, bool is_linear,
+    std::function<void(int id, const void* ptr, size_t arg_size,
+                       bool is_buffer)>& setArg,
+    const std::shared_ptr<cl::Buffer>& ptr, const KParam& info) {
+    setArg(start_id + 0, static_cast<const void*>(&ptr.get()->operator()()),
+           sizeof(cl_mem), true);
+    if (is_linear) {
+        setArg(start_id + 1, static_cast<const void*>(&info.offset),
+               sizeof(dim_t), true);
+    } else {
+        setArg(start_id + 1, static_cast<const void*>(&info), sizeof(KParam),
+               true);
+    }
+    return start_id + 2;
+}
+
+/// Generates the code to calculate the offsets for a buffer
+inline void generateBufferOffsets(std::stringstream& kerStream, int id,
+                                  bool is_linear, const std::string& type_str) {
+    UNUSED(type_str);
+    const std::string idx_str  = std::string("idx") + std::to_string(id);
+    const std::string info_str = std::string("iInfo") + std::to_string(id);
+    const std::string in_str   = std::string("in") + std::to_string(id);
+
+    if (is_linear) {
+        kerStream << in_str << " += " << info_str << "_offset;\n"
+                  << "#define " << idx_str << " idx\n";
+    } else {
+        kerStream << "int " << idx_str << " = id0*(id0<" << info_str
+                  << ".dims[0])*" << info_str << ".strides[0] + id1*(id1<"
+                  << info_str << ".dims[1])*" << info_str
+                  << ".strides[1] + id2*(id2<" << info_str << ".dims[2])*"
+                  << info_str << ".strides[2] + id3*(id3<" << info_str
+                  << ".dims[3])*" << info_str << ".strides[3] + " << info_str
+                  << ".offset;\n";
+    }
+}
+
+/// Generates the code to read a buffer and store it in a local variable
+inline void generateBufferRead(std::stringstream& kerStream, int id,
+                               const std::string& type_str) {
+    kerStream << type_str << " val" << id << " = in" << id << "[idx" << id
+              << "];\n";
+}
+
+inline void generateShiftNodeOffsets(std::stringstream& kerStream, int id,
+                                     bool is_linear,
+                                     const std::string& type_str) {
+    UNUSED(is_linear);
+    UNUSED(type_str);
+    const std::string idx_str  = std::string("idx") + std::to_string(id);
+    const std::string info_str = std::string("iInfo") + std::to_string(id);
+    const std::string id_str = std::string("sh_id_") + std::to_string(id) + '_';
+    const std::string shift_str =
+        std::string("shift") + std::to_string(id) + '_';
+
+    for (int i = 0; i < 4; i++) {
+        kerStream << "int " << id_str << i << " = __circular_mod(id" << i
+                  << " + " << shift_str << i << ", " << info_str << ".dims["
+                  << i << "]);\n";
+    }
+    kerStream << "int " << idx_str << " = " << id_str << "0*(" << id_str << "0<"
+              << info_str << ".dims[0])*" << info_str << ".strides[0] + "
+              << id_str << "1*(" << id_str << "1<" << info_str << ".dims[1])*"
+              << info_str << ".strides[1] + " << id_str << "2*(" << id_str
+              << "2<" << info_str << ".dims[2])*" << info_str
+              << ".strides[2] + " << id_str << "3*(" << id_str << "3<"
+              << info_str << ".dims[3])*" << info_str << ".strides[3] + "
+              << info_str << ".offset;\n";
+}
+
+inline void generateShiftNodeRead(std::stringstream& kerStream, int id,
+                                  const std::string& type_str) {
+    kerStream << type_str << " val" << id << " = in" << id << "[idx" << id
+              << "];\n";
+}
+}  // namespace
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/join.cpp b/src/backend/opencl/join.cpp
index a02fb2fd6a..7975ecfb5a 100644
--- a/src/backend/opencl/join.cpp
+++ b/src/backend/opencl/join.cpp
@@ -8,195 +8,251 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
 #include <join.hpp>
-#include <kernel/join.hpp>
+#include <kernel/memcopy.hpp>
+
+#include <algorithm>
+#include <map>
 #include <stdexcept>
-#include <err_opencl.hpp>
+#include <vector>
 
-namespace opencl
-{
-    template<int dim>
-    af::dim4 calcOffset(const af::dim4 dims)
-    {
-        af::dim4 offset;
-        offset[0] = (dim == 0) ? dims[0] : 0;
-        offset[1] = (dim == 1) ? dims[1] : 0;
-        offset[2] = (dim == 2) ? dims[2] : 0;
-        offset[3] = (dim == 3) ? dims[3] : 0;
-        return offset;
-    }
+using af::dim4;
+using arrayfire::common::half;
+using arrayfire::common::Node;
+using arrayfire::common::Node_ptr;
+using std::vector;
 
-    template<typename Tx, typename Ty>
-    Array<Tx> join(const int dim, const Array<Tx> &first, const Array<Ty> &second)
-    {
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        af::dim4 odims;
-        af::dim4 fdims = first.dims();
-        af::dim4 sdims = second.dims();
-
-        for(int i = 0; i < 4; i++) {
-            if(i == dim) {
-                odims[i] = fdims[i] + sdims[i];
-            } else {
-                odims[i] = fdims[i];
-            }
-        }
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> join(const int jdim, const Array<T> &first, const Array<T> &second) {
+    // All dimensions except join dimension must be equal
+    const dim4 &fdims{first.dims()};
+    const dim4 &sdims{second.dims()};
+    // Compute output dims
+    dim4 odims(fdims);
+    odims.dims[jdim] += sdims.dims[jdim];
+    Array<T> out = createEmptyArray<T>(odims);
 
-        Array<Tx> out = createEmptyArray<Tx>(odims);
-
-        af::dim4 zero(0,0,0,0);
-
-        switch(dim) {
-            case 0:
-                kernel::join<Tx, Tx, 0>(out, first,  zero);
-                kernel::join<Tx, Ty, 0>(out, second, calcOffset<0>(fdims));
-                break;
-            case 1:
-                kernel::join<Tx, Tx, 1>(out, first,  zero);
-                kernel::join<Tx, Ty, 1>(out, second, calcOffset<1>(fdims));
-                break;
-            case 2:
-                kernel::join<Tx, Tx, 2>(out, first,  zero);
-                kernel::join<Tx, Ty, 2>(out, second, calcOffset<2>(fdims));
-                break;
-            case 3:
-                kernel::join<Tx, Tx, 3>(out, first,  zero);
-                kernel::join<Tx, Ty, 3>(out, second, calcOffset<3>(fdims));
-                break;
+    // topspeed is achieved when byte size(in+out) ~= L2CacheSize
+    //
+    // 1 array: memcpy always copies 1 array.  topspeed
+    //      --> size(in) <= L2CacheSize/2
+    // 2 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/2
+    //          --> JIT can copy 2 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - size(in) < L2CacheSize/2
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called twice
+    //      - size(in) >= L2CacheSize/2
+    //          --> memcpy will achieve veryLargeArray speed.  The kernel
+    //              will be called twice
+    if (fdims.dims[jdim] == sdims.dims[jdim]) {
+        const size_t L2CacheSize{getL2CacheSize(opencl::getDevice())};
+        if (!(first.isReady() || second.isReady()) ||
+            (fdims.elements() * sizeof(T) * 2 * 2 < L2CacheSize)) {
+            // Both arrays have same size & everything fits into the cache,
+            // so thread in 1 JIT kernel, iso individual copies which is
+            // always slower
+            const dim_t *outStrides{out.strides().dims};
+            vector<Param> outputs{
+                {out.get(),
+                 {{fdims.dims[0], fdims.dims[1], fdims.dims[2], fdims.dims[3]},
+                  {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+                  0}},
+                {out.get(),
+                 {{sdims.dims[0], sdims.dims[1], sdims.dims[2], sdims.dims[3]},
+                  {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+                  fdims.dims[jdim] * outStrides[jdim]}}};
+            // Extend the life of the returned node, bij saving the
+            // corresponding shared_ptr
+            const Node_ptr fNode{first.getNode()};
+            const Node_ptr sNode{second.getNode()};
+            vector<Node *> nodes{fNode.get(), sNode.get()};
+            evalNodes(outputs, nodes);
+            return out;
         }
+        // continue because individually processing is faster
+    }
 
-        return out;
+    // Handle each array individually
+    if (first.isReady()) {
+        if (1LL + jdim >= first.ndims() && first.isLinear()) {
+            // first & out are linear
+            getQueue().enqueueCopyBuffer(
+                *first.get(), *out.get(), first.getOffset() * sizeof(T), 0,
+                first.elements() * sizeof(T), nullptr, nullptr);
+        } else {
+            kernel::memcopy<T>(*out.get(), out.strides(), *first.get(), fdims,
+                               first.strides(), first.getOffset(),
+                               first.ndims(), 0);
+        }
+    } else {
+        // Write the result directly in the out array
+        const dim_t *outStrides{out.strides().dims};
+        Param output{
+            out.get(),
+            {{fdims.dims[0], fdims.dims[1], fdims.dims[2], fdims.dims[3]},
+             {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+             0}};
+        evalNodes(output, first.getNode().get());
     }
 
-    template<typename T, int n_arrays>
-    void join_wrapper(const int dim, Array<T> &out, const std::vector<Array<T> > &inputs)
-    {
-        af::dim4 zero(0,0,0,0);
-        af::dim4 d = zero;
-
-        switch(dim) {
-            case 0:
-                kernel::join<T, T, 0>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 0>(out, inputs[i], calcOffset<0>(d));
-                }
-                break;
-            case 1:
-                kernel::join<T, T, 1>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 1>(out, inputs[i], calcOffset<1>(d));
-                }
-                break;
-            case 2:
-                kernel::join<T, T, 1>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 2>(out, inputs[i], calcOffset<2>(d));
-                }
-                break;
-            case 3:
-                kernel::join<T, T, 3>(out, inputs[0], zero);
-                for(int i = 1; i < n_arrays; i++) {
-                    d += inputs[i - 1].dims();
-                    kernel::join<T, T, 3>(out, inputs[i], calcOffset<3>(d));
-                }
-                break;
+    if (second.isReady()) {
+        if (1LL + jdim >= second.ndims() && second.isLinear()) {
+            // second & out are linear
+            getQueue().enqueueCopyBuffer(
+                *second.get(), *out.get(), second.getOffset() * sizeof(T),
+                (fdims.dims[jdim] * out.strides().dims[jdim]) * sizeof(T),
+                second.elements() * sizeof(T), nullptr, nullptr);
+        } else {
+            kernel::memcopy<T>(*out.get(), out.strides(), *second.get(), sdims,
+                               second.strides(), second.getOffset(),
+                               second.ndims(),
+                               fdims.dims[jdim] * out.strides().dims[jdim]);
         }
+    } else {
+        // Write the result directly in the out array
+        const dim_t *outStrides{out.strides().dims};
+        Param output{
+            out.get(),
+            {{sdims.dims[0], sdims.dims[1], sdims.dims[2], sdims.dims[3]},
+             {outStrides[0], outStrides[1], outStrides[2], outStrides[3]},
+             fdims.dims[jdim] * outStrides[jdim]}};
+        evalNodes(output, second.getNode().get());
     }
 
-    template<typename T>
-    Array<T> join(const int dim, const std::vector<Array<T> > &inputs)
-    {
+    return out;
+}
 
-        // All dimensions except join dimension must be equal
-        // Compute output dims
-        af::dim4 odims;
-        const dim_t n_arrays = inputs.size();
-        std::vector<af::dim4> idims(n_arrays);
+template<typename T>
+void join(Array<T> &out, const int jdim, const vector<Array<T>> &inputs) {
+    class eval {
+       public:
+        vector<Param> outputs;
+        vector<Node_ptr> nodePtrs;
+        vector<Node *> nodes;
+        vector<const Array<T> *> ins;
+    };
+    std::map<dim_t, eval> evals;
+    const dim_t *ostrides{out.strides().dims};
+    const size_t L2CacheSize{getL2CacheSize(opencl::getDevice())};
 
-        dim_t dim_size = 0;
-        for(int i = 0; i < (int)idims.size(); i++) {
-            idims[i] = inputs[i].dims();
-            dim_size += idims[i][dim];
-        }
+    // topspeed is achieved when byte size(in+out) ~= L2CacheSize
+    //
+    // 1 array: memcpy always copies 1 array.  topspeed
+    //      --> size(in) <= L2CacheSize/2
+    // 2 arrays: topspeeds
+    //      - size(in) < L2CacheSize/2/2
+    //          --> JIT can copy 2 arrays in // and is fastest
+    //              (condition: array sizes have to be identical)
+    //      - size(in) < L2CacheSize/2
+    //          --> memcpy will achieve highest speed, although the kernel
+    //              has to be called twice
+    //      - size(in) >= L2CacheSize/2
+    //          --> memcpy will achieve veryLargeArray speed.  The kernel
+    //              will be called twice
 
-        for(int i = 0; i < 4; i++) {
-            if(i == dim) {
-                odims[i] = dim_size;
-            } else {
-                odims[i] = idims[0][i];
-            }
-        }
+    // Group all arrays according to size
+    dim_t outOffset{0};
+    for (const Array<T> &iArray : inputs) {
+        const dim_t *idims{iArray.dims().dims};
+        eval &e{evals[idims[jdim]]};
+        const Param output{
+            out.get(),
+            {{idims[0], idims[1], idims[2], idims[3]},
+             {ostrides[0], ostrides[1], ostrides[2], ostrides[3]},
+             outOffset}};
+        e.outputs.push_back(output);
+        // Extend life of the returned node by saving the corresponding
+        // shared_ptr
+        e.nodePtrs.emplace_back(iArray.getNode());
+        e.nodes.push_back(e.nodePtrs.back().get());
+        e.ins.push_back(&iArray);
+        outOffset += idims[jdim] * ostrides[jdim];
+    }
 
-        Array<T> out = createEmptyArray<T>(odims);
-
-        switch(n_arrays) {
-            case 1:
-                join_wrapper<T, 1>(dim, out, inputs);
-                break;
-            case 2:
-                join_wrapper<T, 2>(dim, out, inputs);
-                break;
-            case 3:
-                join_wrapper<T, 3>(dim, out, inputs);
-                break;
-            case 4:
-                join_wrapper<T, 4>(dim, out, inputs);
-                break;
-            case 5:
-                join_wrapper<T, 5>(dim, out, inputs);
-                break;
-            case 6:
-                join_wrapper<T, 6>(dim, out, inputs);
-                break;
-            case 7:
-                join_wrapper<T, 7>(dim, out, inputs);
-                break;
-            case 8:
-                join_wrapper<T, 8>(dim, out, inputs);
-                break;
-            case 9:
-                join_wrapper<T, 9>(dim, out, inputs);
-                break;
-            case 10:
-                join_wrapper<T,10>(dim, out, inputs);
-                break;
+    for (auto &eval : evals) {
+        auto &s{eval.second};
+        if (s.ins.size() == 1 ||
+            s.ins[0]->elements() * sizeof(T) * 2 * 2 > L2CacheSize) {
+            // Process (evaluate arrays) individually for
+            //  - single small array
+            //  - very large arrays
+            auto nodeIt{begin(s.nodes)};
+            auto outputIt{begin(s.outputs)};
+            for (const Array<T> *in : s.ins) {
+                if (in->isReady()) {
+                    if (1LL + jdim >= in->ndims() && in->isLinear()) {
+                        getQueue().enqueueCopyBuffer(
+                            *in->get(), *outputIt->data,
+                            in->getOffset() * sizeof(T),
+                            outputIt->info.offset * sizeof(T),
+                            in->elements() * sizeof(T), nullptr, nullptr);
+                    } else {
+                        kernel::memcopy<T>(*outputIt->data,
+                                           af::dim4(4, outputIt->info.strides),
+                                           *in->get(), in->dims(),
+                                           in->strides(), in->getOffset(),
+                                           in->ndims(), outputIt->info.offset);
+                    }
+                    // eliminate this array from the list, so that it will
+                    // not be processed in bulk via JIT
+                    outputIt = s.outputs.erase(outputIt);
+                    nodeIt   = s.nodes.erase(nodeIt);
+                } else {
+                    ++outputIt;
+                    ++nodeIt;
+                }
+            }
         }
-        return out;
+        evalNodes(s.outputs, s.nodes);
     }
+}
 
-#define INSTANTIATE(Tx, Ty)                                                                             \
-    template Array<Tx> join<Tx, Ty>(const int dim, const Array<Tx> &first, const Array<Ty> &second);   \
+#define INSTANTIATE(T)                                               \
+    template Array<T> join<T>(const int jdim, const Array<T> &first, \
+                              const Array<T> &second);
 
-    INSTANTIATE(float,   float)
-    INSTANTIATE(double,  double)
-    INSTANTIATE(cfloat,  cfloat)
-    INSTANTIATE(cdouble, cdouble)
-    INSTANTIATE(int,     int)
-    INSTANTIATE(uint,    uint)
-    INSTANTIATE(intl,    intl)
-    INSTANTIATE(uintl,   uintl)
-    INSTANTIATE(uchar,   uchar)
-    INSTANTIATE(char,    char)
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
 
 #undef INSTANTIATE
 
-#define INSTANTIATE(T)                                                                              \
-    template Array<T> join<T>(const int dim, const std::vector<Array<T> > &inputs);
+#define INSTANTIATE(T)                                    \
+    template void join<T>(Array<T> & out, const int jdim, \
+                          const vector<Array<T>> &inputs);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
 
 #undef INSTANTIATE
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/join.hpp b/src/backend/opencl/join.hpp
index 9068d756d0..9caf52d863 100644
--- a/src/backend/opencl/join.hpp
+++ b/src/backend/opencl/join.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename Tx, typename Ty>
-    Array<Tx> join(const int dim, const Array<Tx> &first, const Array<Ty> &second);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> join(const int dim, const Array<T> &first, const Array<T> &second);
 
-    template<typename T>
-    Array<T> join(const int dim, const std::vector<Array<T>> &inputs);
-}
+template<typename T>
+void join(Array<T> &out, const int dim, const std::vector<Array<T>> &inputs);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/KParam.hpp b/src/backend/opencl/kernel/KParam.hpp
index 820db70887..1f4f1d5ba4 100644
--- a/src/backend/opencl/kernel/KParam.hpp
+++ b/src/backend/opencl/kernel/KParam.hpp
@@ -9,10 +9,24 @@
 
 #ifndef __KPARAM_H
 #define __KPARAM_H
-typedef struct
-{
+
+#ifndef __OPENCL_VERSION__
+// Only define dim_t in host code. dim_t is defined when setting the program
+// options in program.cpp
+#include <af/defines.h>
+#endif
+
+// Defines the size and shape of the data in the OpenCL buffer
+typedef struct KParam_t {
     dim_t dims[4];
     dim_t strides[4];
     dim_t offset;
+
+#ifndef __OPENCL_VERSION__
+    dim_t *dims_ptr() { return dims; }
+    dim_t *strides_ptr() { return strides; }
+#endif
+
 } KParam;
+
 #endif
diff --git a/src/backend/opencl/kernel/anisotropic_diffusion.cl b/src/backend/opencl/kernel/anisotropic_diffusion.cl
new file mode 100644
index 0000000000..82077791f6
--- /dev/null
+++ b/src/backend/opencl/kernel/anisotropic_diffusion.cl
@@ -0,0 +1,175 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+int lIndex(const int j, const int i) { return j * SHRD_MEM_WIDTH + i; }
+
+int gIndex(const int x, const int y, const int dim0, const int dim1,
+           const int stride0, const int stride1) {
+    return clamp(x, 0, dim0 - 1) * stride0 + clamp(y, 0, dim1 - 1) * stride1;
+}
+
+float quadratic(const float value) { return 1.0f / (1.0f + value); }
+
+float gradientUpdate(const float mct, const float C, const float S,
+                     const float N, const float W, const float E,
+                     const float SE, const float SW, const float NE,
+                     const float NW) {
+    float delta = 0;
+
+    float dx, dy, df, db, cx, cxd;
+
+    // centralized derivatives
+    dx = (E - W) * 0.5f;
+    dy = (S - N) * 0.5f;
+
+    // half-d's and conductance along first dimension
+    df = E - C;
+    db = C - W;
+
+    if (FLUX_FN == 2) {
+        cx  = exp((df * df + 0.25f * pow(dy + 0.5f * (SE - NE), 2)) * mct);
+        cxd = exp((db * db + 0.25f * pow(dy + 0.5f * (SW - NW), 2)) * mct);
+    } else {
+        cx = quadratic((df * df + 0.25f * pow(dy + 0.5f * (SE - NE), 2)) * mct);
+        cxd =
+            quadratic((db * db + 0.25f * pow(dy + 0.5f * (SW - NW), 2)) * mct);
+    }
+
+    delta += (cx * df - cxd * db);
+
+    // half-d's and conductance along second dimension
+    df = S - C;
+    db = C - N;
+
+    if (FLUX_FN == 2) {
+        cx  = exp((df * df + 0.25f * pow(dx + 0.5f * (SE - SW), 2)) * mct);
+        cxd = exp((db * db + 0.25f * pow(dx + 0.5f * (NE - NW), 2)) * mct);
+    } else {
+        cx = quadratic((df * df + 0.25f * pow(dx + 0.5f * (SE - SW), 2)) * mct);
+        cxd =
+            quadratic((db * db + 0.25f * pow(dx + 0.5f * (NE - NW), 2)) * mct);
+    }
+
+    delta += (cx * df - cxd * db);
+
+    return delta;
+}
+
+float curvatureUpdate(const float mct, const float C, const float S,
+                      const float N, const float W, const float E,
+                      const float SE, const float SW, const float NE,
+                      const float NW) {
+    float delta     = 0;
+    float prop_grad = 0;
+
+    float df0, db0;
+    float dx, dy, df, db, cx, cxd, gmf, gmb, gmsqf, gmsqb;
+
+    // centralized derivatives
+    dx = (E - W) * 0.5f;
+    dy = (S - N) * 0.5f;
+
+    // half-d's and conductance along first dimension
+    df  = E - C;
+    db  = C - W;
+    df0 = df;
+    db0 = db;
+
+    gmsqf = (df * df + 0.25f * pow(dy + 0.5f * (SE - NE), 2));
+    gmsqb = (db * db + 0.25f * pow(dy + 0.5f * (SW - NW), 2));
+
+    gmf = sqrt(1.0e-10f + gmsqf);
+    gmb = sqrt(1.0e-10f + gmsqb);
+
+    cx  = exp(gmsqf * mct);
+    cxd = exp(gmsqb * mct);
+
+    delta += ((df / gmf) * cx - (db / gmb) * cxd);
+
+    // half-d's and conductance along second dimension
+    df = S - C;
+    db = C - N;
+
+    gmsqf = (df * df + 0.25f * pow(dx + 0.5f * (SE - SW), 2));
+    gmsqb = (db * db + 0.25f * pow(dx + 0.5f * (NE - NW), 2));
+
+    gmf = sqrt(1.0e-10 + gmsqf);
+    gmb = sqrt(1.0e-10 + gmsqb);
+
+    cx  = exp(gmsqf * mct);
+    cxd = exp(gmsqb * mct);
+
+    delta += ((df / gmf) * cx - (db / gmb) * cxd);
+
+    if (delta > 0) {
+        prop_grad += (pow(min(db0, 0.0f), 2.0f) + pow(max(df0, 0.0f), 2.0f));
+        prop_grad += (pow(min(db, 0.0f), 2.0f) + pow(max(df, 0.0f), 2.0f));
+    } else {
+        prop_grad += (pow(max(db0, 0.0f), 2.0f) + pow(min(df0, 0.0f), 2.0f));
+        prop_grad += (pow(max(db, 0.0f), 2.0f) + pow(min(df, 0.0f), 2.0f));
+    }
+
+    return sqrt(prop_grad) * delta;
+}
+
+kernel void aisoDiffUpdate(global T* inout, KParam info, const float dt,
+                           const float mct, unsigned blkX, unsigned blkY) {
+    local T localMem[SHRD_MEM_HEIGHT][SHRD_MEM_WIDTH];
+
+    const int l0 = info.dims[0];
+    const int l1 = info.dims[1];
+    const int s0 = info.strides[0];
+    const int s1 = info.strides[1];
+
+    const int lx = get_local_id(0);
+    const int ly = get_local_id(1);
+
+    const int b2 = get_group_id(0) / blkX;
+    const int b3 = get_group_id(1) / blkY;
+
+    const int gx = get_local_size(0) * (get_group_id(0) - b2 * blkX) + lx;
+    int gy       = get_local_size(1) * (get_group_id(1) - b3 * blkY) + ly;
+
+    global T* img =
+        inout + (b3 * info.strides[3] + b2 * info.strides[2]) + info.offset;
+
+    for (int b = ly, gy2 = gy - 1; b < SHRD_MEM_HEIGHT;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        for (int a = lx, gx2 = gx - 1; a < SHRD_MEM_WIDTH;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            localMem[b][a] = img[gIndex(gx2, gy2, l0, l1, s0, s1)];
+        }
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    int i = lx + 1;
+    int j = ly + 1;
+
+#pragma unroll
+    for (int ld = 0; ld < YDIM_LOAD;
+         ++ld, j += get_local_size(1), gy += get_local_size(1)) {
+        float C     = localMem[j][i];
+        float delta = 0;
+#if IS_MCDE == 1
+        delta = curvatureUpdate(mct, C, localMem[j][i + 1], localMem[j][i - 1],
+                                localMem[j - 1][i], localMem[j + 1][i],
+                                localMem[j + 1][i + 1], localMem[j - 1][i + 1],
+                                localMem[j + 1][i - 1], localMem[j - 1][i - 1]);
+#else
+        delta = gradientUpdate(mct, C, localMem[j][i + 1], localMem[j][i - 1],
+                               localMem[j - 1][i], localMem[j + 1][i],
+                               localMem[j + 1][i + 1], localMem[j - 1][i + 1],
+                               localMem[j + 1][i - 1], localMem[j - 1][i - 1]);
+#endif
+        if (gx < l0 && gy < l1) {
+            img[gx * s0 + gy * s1] = (T)(C + delta * dt);
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/anisotropic_diffusion.hpp b/src/backend/opencl/kernel/anisotropic_diffusion.hpp
new file mode 100644
index 0000000000..a8655be95e
--- /dev/null
+++ b/src/backend/opencl/kernel/anisotropic_diffusion.hpp
@@ -0,0 +1,72 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/anisotropic_diffusion.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, bool isMCDE>
+void anisotropicDiffusion(Param inout, const float dt, const float mct,
+                          const int fluxFnCode) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr int THREADS_X = 32;
+    constexpr int THREADS_Y = 8;
+    constexpr int YDIM_LOAD = 2 * THREADS_X / THREADS_Y;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(isMCDE),
+        TemplateArg(fluxFnCode),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(SHRD_MEM_HEIGHT, (THREADS_Y * YDIM_LOAD + 2)),
+        DefineKeyValue(SHRD_MEM_WIDTH, (THREADS_X + 2)),
+        DefineKeyValue(IS_MCDE, isMCDE),
+        DefineKeyValue(FLUX_FN, fluxFnCode),
+        DefineValue(YDIM_LOAD),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto diffUpdate =
+        common::getKernel("aisoDiffUpdate", {{anisotropic_diffusion_cl_src}},
+                          tmpltArgs, compileOpts);
+
+    NDRange local(THREADS_X, THREADS_Y, 1);
+
+    int blkX = divup(inout.info.dims[0], local[0]);
+    int blkY = divup(inout.info.dims[1], local[1] * YDIM_LOAD);
+
+    NDRange global(local[0] * blkX * inout.info.dims[2],
+                   local[1] * blkY * inout.info.dims[3], 1);
+
+    diffUpdate(EnqueueArgs(getQueue(), global, local), *inout.data, inout.info,
+               dt, mct, blkX, blkY);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/approx.hpp b/src/backend/opencl/kernel/approx.hpp
index 05abe13c96..d23a590e7f 100644
--- a/src/backend/opencl/kernel/approx.hpp
+++ b/src/backend/opencl/kernel/approx.hpp
@@ -8,165 +8,128 @@
  ********************************************************/
 
 #pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/interp.hpp>
 #include <kernel_headers/approx1.hpp>
 #include <kernel_headers/approx2.hpp>
-#include <program.hpp>
+#include <kernel_headers/interp.hpp>
+#include <math.hpp>
 #include <traits.hpp>
+
 #include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename Ty, typename Tp>
+auto genCompileOptions(const int order, const int xdim, const int ydim = -1) {
+    constexpr bool isComplex =
+        static_cast<af_dtype>(dtype_traits<Ty>::af_type) == c32 ||
+        static_cast<af_dtype>(dtype_traits<Ty>::af_type) == c64;
+
+    ToNumStr<Ty> toNumStr;
+
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Ty, dtype_traits<Ty>::getName()),
+        DefineKeyValue(Tp, dtype_traits<Tp>::getName()),
+        DefineKeyValue(InterpInTy, dtype_traits<Ty>::getName()),
+        DefineKeyValue(InterpValTy, dtype_traits<Ty>::getName()),
+        DefineKeyValue(InterpPosTy, dtype_traits<Tp>::getName()),
+        DefineKeyValue(ZERO, toNumStr(scalar<Ty>(0))),
+        DefineKeyValue(XDIM, xdim),
+        DefineKeyValue(INTERP_ORDER, order),
+        DefineKeyValue(IS_CPLX, (isComplex ? 1 : 0)),
+    };
+    if (ydim != -1) { compileOpts.emplace_back(DefineKeyValue(YDIM, ydim)); }
+    compileOpts.emplace_back(getTypeBuildDefinition<Ty>());
+    addInterpEnumOptions(compileOpts);
+
+    return compileOpts;
+}
+
+template<typename Ty, typename Tp>
+void approx1(Param yo, const Param yi, const Param xo, const int xdim,
+             const Tp xi_beg, const Tp xi_step, const float offGrid,
+             const af_interp_type method, const int order) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr int THREADS = 256;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ty>(),
+        TemplateTypename<Tp>(),
+        TemplateArg(xdim),
+        TemplateArg(order),
+    };
+    auto compileOpts = genCompileOptions<Ty, Tp>(order, xdim);
+
+    auto approx1 = common::getKernel(
+        "approx1", {{interp_cl_src, approx1_cl_src}}, tmpltArgs, compileOpts);
+
+    NDRange local(THREADS, 1, 1);
+    dim_t blocksPerMat = divup(yo.info.dims[0], local[0]);
+    NDRange global(blocksPerMat * local[0] * yo.info.dims[1],
+                   yo.info.dims[2] * yo.info.dims[3] * local[1]);
+
+    // Passing bools to opencl kernels is not allowed
+    bool batch =
+        !(xo.info.dims[1] == 1 && xo.info.dims[2] == 1 && xo.info.dims[3] == 1);
+
+    approx1(EnqueueArgs(getQueue(), global, local), *yo.data, yo.info, *yi.data,
+            yi.info, *xo.data, xo.info, xi_beg, Tp(1) / xi_step,
+            scalar<Ty>(offGrid), (int)blocksPerMat, (int)batch, (int)method);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        static const int TX = 16;
-        static const int TY = 16;
-
-        static const int THREADS = 256;
-
-        ///////////////////////////////////////////////////////////////////////////
-        // Wrapper functions
-        ///////////////////////////////////////////////////////////////////////////
-        template <typename Ty, typename Tp, af_interp_type method>
-        void approx1(Param out, const Param in, const Param pos, const float offGrid)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>  approxProgs;
-                static std::map<int, Kernel*> approxKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D Ty="        << dtype_traits<Ty>::getName()
-                            << " -D Tp="        << dtype_traits<Tp>::getName();
-
-                    if((af_dtype) dtype_traits<Ty>::af_type == c32 ||
-                       (af_dtype) dtype_traits<Ty>::af_type == c64) {
-                        options << " -D CPLX=1";
-                    } else {
-                        options << " -D CPLX=0";
-                    }
-                    if (std::is_same<Ty, double>::value ||
-                        std::is_same<Ty, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    switch(method) {
-                        case AF_INTERP_NEAREST: options << " -D INTERP=NEAREST";
-                            break;
-                        case AF_INTERP_LINEAR:  options << " -D INTERP=LINEAR";
-                            break;
-                        default:
-                            break;
-                    }
-                    Program prog;
-                    buildProgram(prog, approx1_cl, approx1_cl_len, options.str());
-                    approxProgs[device] = new Program(prog);
-
-                    approxKernels[device] = new Kernel(*approxProgs[device], "approx1_kernel");
-                });
-
-
-                auto approx1Op = make_kernel<Buffer, const KParam, const Buffer, const KParam,
-                                       const Buffer, const KParam, const float, const int>
-                                      (*approxKernels[device]);
-
-                NDRange local(THREADS, 1, 1);
-                int blocksPerMat = divup(out.info.dims[0], local[0]);
-                NDRange global(blocksPerMat * local[0] * out.info.dims[1],
-                               out.info.dims[2] * out.info.dims[3] * local[0],
-                               1);
-
-                approx1Op(EnqueueArgs(getQueue(), global, local),
-                          *out.data, out.info, *in.data, in.info,
-                          *pos.data, pos.info, offGrid, blocksPerMat);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-
-        template <typename Ty, typename Tp, af_interp_type method>
-        void approx2(Param out, const Param in, const Param pos, const Param qos, const float offGrid)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>       approxProgs;
-                static std::map<int, Kernel*>      approxKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D Ty="        << dtype_traits<Ty>::getName()
-                            << " -D Tp="        << dtype_traits<Tp>::getName();
-
-                    if((af_dtype) dtype_traits<Ty>::af_type == c32 ||
-                       (af_dtype) dtype_traits<Ty>::af_type == c64) {
-                        options << " -D CPLX=1";
-                    } else {
-                        options << " -D CPLX=0";
-                    }
-                    if (std::is_same<Ty, double>::value ||
-                        std::is_same<Ty, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    switch(method) {
-                        case AF_INTERP_NEAREST: options << " -D INTERP=NEAREST";
-                            break;
-                        case AF_INTERP_LINEAR:  options << " -D INTERP=LINEAR";
-                            break;
-                        default:
-                            break;
-                    }
-                    Program prog;
-                    buildProgram(prog, approx2_cl, approx2_cl_len, options.str());
-                    approxProgs[device] = new Program(prog);
-
-                    approxKernels[device] = new Kernel(*approxProgs[device], "approx2_kernel");
-                });
-
-                auto approx2Op = make_kernel<Buffer, const KParam, const Buffer, const KParam,
-                                       const Buffer, const KParam, const Buffer, const KParam,
-                                       const float, const int, const int>
-                                       (*approxKernels[device]);
-
-                NDRange local(TX, TY, 1);
-                int blocksPerMatX = divup(out.info.dims[0], local[0]);
-                int blocksPerMatY = divup(out.info.dims[1], local[1]);
-                NDRange global(blocksPerMatX * local[0] * out.info.dims[2],
-                               blocksPerMatY * local[1] * out.info.dims[3],
-                               1);
-
-
-                approx2Op(EnqueueArgs(getQueue(), global, local),
-                          *out.data, out.info,
-                          *in.data, in.info,
-                          *pos.data, pos.info,
-                          *qos.data, qos.info,
-                          offGrid, blocksPerMatX, blocksPerMatY);
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+template<typename Ty, typename Tp>
+void approx2(Param zo, const Param zi, const Param xo, const int xdim,
+             const Tp &xi_beg, const Tp &xi_step, const Param yo,
+             const int ydim, const Tp &yi_beg, const Tp &yi_step,
+             const float offGrid, const af_interp_type method,
+             const int order) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ty>(), TemplateTypename<Tp>(), TemplateArg(xdim),
+        TemplateArg(ydim),      TemplateArg(order),
+    };
+    auto compileOpts = genCompileOptions<Ty, Tp>(order, xdim, ydim);
+
+    auto approx2 = common::getKernel(
+        "approx2", {{interp_cl_src, approx2_cl_src}}, tmpltArgs, compileOpts);
+
+    NDRange local(TX, TY, 1);
+    dim_t blocksPerMatX = divup(zo.info.dims[0], local[0]);
+    dim_t blocksPerMatY = divup(zo.info.dims[1], local[1]);
+    NDRange global(blocksPerMatX * local[0] * zo.info.dims[2],
+                   blocksPerMatY * local[1] * zo.info.dims[3], 1);
+
+    // Passing bools to opencl kernels is not allowed
+    bool batch = !(xo.info.dims[2] == 1 && xo.info.dims[3] == 1);
+
+    approx2(EnqueueArgs(getQueue(), global, local), *zo.data, zo.info, *zi.data,
+            zi.info, *xo.data, xo.info, *yo.data, yo.info, xi_beg,
+            Tp(1) / xi_step, yi_beg, Tp(1) / yi_step, scalar<Ty>(offGrid),
+            static_cast<int>(blocksPerMatX), static_cast<int>(blocksPerMatY),
+            static_cast<int>(batch), static_cast<int>(method));
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/approx1.cl b/src/backend/opencl/kernel/approx1.cl
index bae2722ced..60d9ebbae3 100644
--- a/src/backend/opencl/kernel/approx1.cl
+++ b/src/backend/opencl/kernel/approx1.cl
@@ -7,116 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#define NEAREST core_nearest1
-#define LINEAR core_linear1
-
-#if CPLX
-#define set(a, b) a = b
-#define set_scalar(a, b) do {                   \
-        a.x = b;                                \
-        a.y = 0;                                \
-    } while(0)
-
-Ty mul(Ty a, Tp b) { a.x = a.x * b; a.y = a.y * b; return a; }
-Ty div(Ty a, Tp b) { a.x = a.x / b; a.y = a.y / b; return a; }
+kernel void approx1(global Ty *d_yo, const KParam yo, global const Ty *d_yi,
+                    const KParam yi, global const Tp *d_xo, const KParam xo,
+                    const Tp xi_beg, const Tp xi_step_reproc, const Ty offGrid,
+                    const int blocksMatX, const int batch, const int method) {
+    const int idw = get_group_id(1) / yo.dims[2];
+    const int idz = get_group_id(1) - idw * yo.dims[2];
+
+    const int idy        = get_group_id(0) / blocksMatX;
+    const int blockIdx_x = get_group_id(0) - idy * blocksMatX;
+    const int idx        = get_local_id(0) + blockIdx_x * get_local_size(0);
 
-#else
+    if (idx >= yo.dims[0] || idy >= yo.dims[1] || idz >= yo.dims[2] ||
+        idw >= yo.dims[3])
+        return;
 
-#define set(a, b) a = b
-#define set_scalar(a, b) a = b
-#define mul(a, b) ((a) * (b))
-#define div(a, b) ((a) / (b))
+    // FIXME: Only cubic interpolation is doing clamping
+    // We need to make it consistent across all methods
+    // Not changing the behavior because tests will fail
+    const bool doclamp = INTERP_ORDER == 3;
 
-#endif
+    bool is_off[] = {xo.dims[0] > 1, xo.dims[1] > 1, xo.dims[2] > 1,
+                     xo.dims[3] > 1};
 
-///////////////////////////////////////////////////////////////////////////
-// nearest-neighbor resampling
-///////////////////////////////////////////////////////////////////////////
-void core_nearest1(const int idx, const int idy, const int idz, const int idw,
-                   __global       Ty *d_out, const KParam out,
-                   __global const Ty *d_in,  const KParam in,
-                   __global const Tp *d_pos, const KParam pos,
-                   const float offGrid)
-{
-    const int omId = idw * out.strides[3] + idz * out.strides[2]
-                        + idy * out.strides[1] + idx;
-    const int pmId = idx;
+    const int yo_idx = idw * yo.strides[3] + idz * yo.strides[2] +
+                       idy * yo.strides[1] + idx + yo.offset;
 
-    const Tp pVal = d_pos[pmId];
-    if (pVal < 0 || in.dims[0] < pVal+1) {
-        set_scalar(d_out[omId], offGrid);
-        return;
+    int xo_idx = idx * is_off[0] + xo.offset;
+    if (batch) {
+        xo_idx += idw * xo.strides[3] * is_off[3];
+        xo_idx += idz * xo.strides[2] * is_off[2];
+        xo_idx += idy * xo.strides[1] * is_off[1];
     }
 
-    int ioff = idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1];
-    const int imId = round(pVal) + ioff;
+    const Tp x = (d_xo[xo_idx] - xi_beg) * xi_step_reproc;
 
-    Ty y;
-    set(y, d_in[imId]);
-    set(d_out[omId], y);
-}
-
-///////////////////////////////////////////////////////////////////////////
-// linear resampling
-///////////////////////////////////////////////////////////////////////////
-void core_linear1(const int idx, const int idy, const int idz, const int idw,
-                   __global       Ty *d_out, const KParam out,
-                   __global const Ty *d_in,  const KParam in,
-                   __global const Tp *d_pos, const KParam pos,
-                   const float offGrid)
-{
-    const int omId = idw * out.strides[3] + idz * out.strides[2]
-                        + idy * out.strides[1] + idx;
-    const int pmId = idx;
+#pragma unroll
+    for (int flagIdx = 0; flagIdx < 4; ++flagIdx) { is_off[flagIdx] = true; }
+    is_off[XDIM] = false;
 
-    const Tp pVal = d_pos[pmId];
-    if (pVal < 0 || in.dims[0] < pVal+1) {
-        set_scalar(d_out[omId], offGrid);
+    if (x < 0 || yi.dims[XDIM] < x + 1) {
+        d_yo[yo_idx] = offGrid;
         return;
     }
 
-    const Tp grid_x = floor(pVal);  // nearest grid
-    const Tp off_x = pVal - grid_x; // fractional offset
-
-    int ioff = idw * in.strides[3] + idz * in.strides[2] + idy * in.strides[1] + grid_x;
-
-    // Check if pVal and pVal + 1 are both valid indices
-    bool cond = (pVal < in.dims[0] - 1);
-    Ty zero; set_scalar(zero, 0);
-
-    // Compute Left and Right Weighted Values
-    Ty yl; set(yl, mul(d_in[ioff] , (1 - off_x)));
-    Ty yr; set(yr, cond ? mul(d_in[ioff + 1], off_x) : zero);
-    Ty yo = yl + yr;
-
-    // Compute Weight used
-    Tp wt = cond ? 1 : (1 - off_x);
-
-    // Write final value
-    set(d_out[omId], div(yo, wt));
-}
-
-////////////////////////////////////////////////////////////////////////////////////
-// Wrapper Kernel
-////////////////////////////////////////////////////////////////////////////////////
-__kernel
-void approx1_kernel(__global       Ty *d_out, const KParam out,
-                    __global const Ty *d_in,  const KParam in,
-                    __global const Tp *d_pos, const KParam pos,
-                    const float offGrid, const int blocksMatX)
-{
-    const int idw = get_group_id(1) / out.dims[2];
-    const int idz = get_group_id(1)  - idw * out.dims[2];
-
-    const int idy = get_group_id(0) / blocksMatX;
-    const int blockIdx_x = get_group_id(0) - idy * blocksMatX;
-    const int idx = get_local_id(0) + blockIdx_x * get_local_size(0);
-
-    if(idx >= out.dims[0] ||
-       idy >= out.dims[1] ||
-       idz >= out.dims[2] ||
-       idw >= out.dims[3])
-        return;
+    int yi_idx = idx * is_off[0] + yi.offset;
+    yi_idx += idw * yi.strides[3] * is_off[3];
+    yi_idx += idz * yi.strides[2] * is_off[2];
+    yi_idx += idy * yi.strides[1] * is_off[1];
 
-    INTERP(idx, idy, idz, idw, d_out, out, d_in + in.offset, in, d_pos + pos.offset, pos, offGrid);
+    interp1(d_yo, yo, yo_idx, d_yi, yi, yi_idx, x, method, 1, doclamp, 1);
 }
diff --git a/src/backend/opencl/kernel/approx2.cl b/src/backend/opencl/kernel/approx2.cl
index 2d2d7e805f..6df3f0a381 100644
--- a/src/backend/opencl/kernel/approx2.cl
+++ b/src/backend/opencl/kernel/approx2.cl
@@ -7,134 +7,60 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#define NEAREST core_nearest2
-#define LINEAR core_linear2
-
-#if CPLX
-#define set(a, b) a = b
-#define set_scalar(a, b) do {                   \
-        a.x = b;                                \
-        a.y = 0;                                \
-    } while(0)
-
-Ty mul(Ty a, Tp b) { a.x = a.x * b; a.y = a.y * b; return a; }
-Ty div(Ty a, Tp b) { a.x = a.x / b; a.y = a.y / b; return a; }
-
-#else
-
-#define set(a, b) a = b
-#define set_scalar(a, b) a = b
-#define mul(a, b) ((a) * (b))
-#define div(a, b) ((a) / (b))
+kernel void approx2(global Ty *d_zo, const KParam zo, global const Ty *d_zi,
+                    const KParam zi, global const Tp *d_xo, const KParam xo,
+                    global const Tp *d_yo, const KParam yo, const Tp xi_beg,
+                    const Tp xi_step_reproc, const Tp yi_beg,
+                    const Tp yi_step_reproc, const Ty offGrid,
+                    const int blocksMatX, const int blocksMatY, const int batch,
+                    int method) {
+    const int idz = get_group_id(0) / blocksMatX;
+    const int idw = get_group_id(1) / blocksMatY;
 
-#endif
+    const int blockIdx_x = get_group_id(0) - idz * blocksMatX;
+    const int blockIdx_y = get_group_id(1) - idw * blocksMatY;
 
-///////////////////////////////////////////////////////////////////////////
-// nearest-neighbor resampling
-///////////////////////////////////////////////////////////////////////////
-void core_nearest2(const int idx, const int idy, const int idz, const int idw,
-                   __global       Ty *d_out, const KParam out,
-                   __global const Ty *d_in,  const KParam in,
-                   __global const Tp *d_pos, const KParam pos,
-                   __global const Tp *d_qos, const KParam qos,
-                   const float offGrid)
-{
-    const int omId = idw * out.strides[3] + idz * out.strides[2]
-                        + idy * out.strides[1] + idx;
-    const int pmId = idy * pos.strides[1] + idx;
-    const int qmId = idy * qos.strides[1] + idx;
+    const int idx = get_local_id(0) + blockIdx_x * get_local_size(0);
+    const int idy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    const Tp x = d_pos[pmId], y = d_qos[qmId];
-    if (x < 0 || y < 0 || in.dims[0] < x+1 || in.dims[1] < y+1) {
-        set_scalar(d_out[omId], offGrid);
+    if (idx >= zo.dims[0] || idy >= zo.dims[1] || idz >= zo.dims[2] ||
+        idw >= zo.dims[3])
         return;
-    }
 
-    const int grid_x = round(x), grid_y = round(y); // nearest grid
-    const int imId = idw * in.strides[3] + idz * in.strides[2]
-                     + grid_y * in.strides[1] + grid_x;
+    // FIXME: Only cubic interpolation is doing clamping
+    // We need to make it consistent across all methods
+    // Not changing the behavior because tests will fail
+    const bool doclamp = INTERP_ORDER == 3;
+
+    bool is_off[] = {xo.dims[0] > 1, xo.dims[1] > 1, xo.dims[2] > 1,
+                     xo.dims[3] > 1};
+
+    const int zo_idx = idw * zo.strides[3] + idz * zo.strides[2] +
+                       idy * zo.strides[1] + idx + zo.offset;
+    int xo_idx = idy * xo.strides[1] * is_off[1] + idx * is_off[0] + xo.offset;
+    int yo_idx = idy * yo.strides[1] * is_off[1] + idx * is_off[0] + yo.offset;
+    if (batch) {
+        xo_idx +=
+            idw * xo.strides[3] * is_off[3] + idz * xo.strides[2] * is_off[2];
+        yo_idx +=
+            idw * yo.strides[3] * is_off[3] + idz * yo.strides[2] * is_off[2];
+    }
 
-    Ty z;
-    set(z, d_in[imId]);
-    set(d_out[omId], z);
-}
+#pragma unroll
+    for (int flagIdx = 0; flagIdx < 4; ++flagIdx) { is_off[flagIdx] = true; }
+    is_off[XDIM] = false;
+    is_off[YDIM] = false;
 
-///////////////////////////////////////////////////////////////////////////
-// linear resampling
-///////////////////////////////////////////////////////////////////////////
-void core_linear2(const int idx, const int idy, const int idz, const int idw,
-                  __global       Ty *d_out, const KParam out,
-                  __global const Ty *d_in,  const KParam in,
-                  __global const Tp *d_pos, const KParam pos,
-                  __global const Tp *d_qos, const KParam qos,
-                  const float offGrid)
-{
-    const int omId = idw * out.strides[3] + idz * out.strides[2]
-                        + idy * out.strides[1] + idx;
-    const int pmId = idy * pos.strides[1] + idx;
-    const int qmId = idy * qos.strides[1] + idx;
+    const Tp x = (d_xo[xo_idx] - xi_beg) * xi_step_reproc;
+    const Tp y = (d_yo[yo_idx] - yi_beg) * yi_step_reproc;
 
-    const Tp x = d_pos[pmId], y = d_qos[qmId];
-    if (x < 0 || y < 0 || in.dims[0] < x+1 || in.dims[1] < y+1) {
-        set_scalar(d_out[omId], offGrid);
+    if (x < 0 || y < 0 || zi.dims[XDIM] < x + 1 || zi.dims[YDIM] < y + 1) {
+        d_zo[zo_idx] = offGrid;
         return;
     }
 
-    const Tp grid_x = floor(x),   grid_y = floor(y);   // nearest grid
-    const Tp off_x  = x - grid_x, off_y  = y - grid_y; // fractional offset
-
-    int ioff = idw * in.strides[3] + idz * in.strides[2] + grid_y * in.strides[1] + grid_x;
-
-    // Check if pVal and pVal + 1 are both valid indices
-    bool condY = (y < in.dims[1] - 1);
-    bool condX = (x < in.dims[0] - 1);
-
-    // Compute wieghts used
-    Tp wt00 = ((Tp)1.0 - off_x) * ((Tp)1.0 - off_y);
-    Tp wt10 = (condY) ? ((Tp)1.0 - off_x) * (off_y) : 0;
-    Tp wt01 = (condX) ? (off_x) * ((Tp)1.0 - off_y) : 0;
-    Tp wt11 = (condX && condY) ? (off_x) * (off_y)  : 0;
-
-    Tp wt = wt00 + wt10 + wt01 + wt11;
-
-    // Compute Weighted Values
-    Ty zero; set_scalar(zero, 0);
-    Ty y00; set(y00,                    mul(d_in[ioff],                     wt00)       );
-    Ty y10; set(y10, (condY) ?          mul(d_in[ioff + in.strides[1]],     wt10) : zero);
-    Ty y01; set(y01, (condX) ?          mul(d_in[ioff + 1],                 wt01) : zero);
-    Ty y11; set(y11, (condX && condY) ? mul(d_in[ioff + in.strides[1] + 1], wt11) : zero);
-
-    Ty yo = y00 + y10 + y01 + y11;
-
-    // Write Final Value
-    set(d_out[omId], div(yo, wt));
-}
-
-////////////////////////////////////////////////////////////////////////////////////
-// Wrapper Kernel
-////////////////////////////////////////////////////////////////////////////////////
-__kernel
-void approx2_kernel(__global       Ty *d_out, const KParam out,
-                    __global const Ty *d_in,  const KParam in,
-                    __global const Tp *d_pos, const KParam pos,
-                    __global const Tp *d_qos, const KParam qos,
-                    const float offGrid, const int blocksMatX, const int blocksMatY)
-{
-    const int idz = get_group_id(0) / blocksMatX;
-    const int idw = get_group_id(1) / blocksMatY;
-
-    const int blockIdx_x = get_group_id(0) - idz * blocksMatX;
-    const int blockIdx_y = get_group_id(1) - idw * blocksMatY;
-
-    const int idx = get_local_id(0) + blockIdx_x * get_local_size(0);
-    const int idy = get_local_id(1) + blockIdx_y * get_local_size(1);
-
-    if(idx >= out.dims[0] ||
-       idy >= out.dims[1] ||
-       idz >= out.dims[2] ||
-       idw >= out.dims[3])
-        return;
+    int zi_idx = idy * zi.strides[1] * is_off[1] + idx * is_off[0] + zi.offset;
+    zi_idx += idw * zi.strides[3] * is_off[3] + idz * zi.strides[2] * is_off[2];
 
-    INTERP(idx, idy, idz, idw, d_out, out, d_in + in.offset, in,
-           d_pos + pos.offset, pos, d_qos + qos.offset, qos, offGrid);
+    interp2(d_zo, zo, zo_idx, d_zi, zi, zi_idx, x, y, method, 1, doclamp, 2);
 }
diff --git a/src/backend/opencl/kernel/assign.cl b/src/backend/opencl/kernel/assign.cl
index 167031b76a..90bb5fd789 100644
--- a/src/backend/opencl/kernel/assign.cl
+++ b/src/backend/opencl/kernel/assign.cl
@@ -8,52 +8,58 @@
  ********************************************************/
 
 typedef struct {
-    int  offs[4];
+    int offs[4];
     int strds[4];
-    char     isSeq[4];
+    char isSeq[4];
 } AssignKernelParam_t;
 
-int trimIndex(int idx, const int len)
-{
+int trimIndex(int idx, const int len) {
     int ret_val = idx;
-    int offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=len) {
-        ret_val = len-offset-1;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
     }
     return ret_val;
 }
 
-kernel
-void assignKernel(global T * optr, KParam oInfo, global const T * iptr, KParam iInfo,
-                 const AssignKernelParam_t p, global const uint* ptr0,
-                 global const uint* ptr1, global const uint* ptr2,
-                 global const uint* ptr3, const int nBBS0, const int nBBS1)
-{
+kernel void assignKernel(global T* optr, KParam oInfo, global const T* iptr,
+                         KParam iInfo, const AssignKernelParam_t p,
+                         global const uint* ptr0, global const uint* ptr1,
+                         global const uint* ptr2, global const uint* ptr3,
+                         const int nBBS0, const int nBBS1) {
     // retrive booleans that tell us which index to use
     const bool s0 = p.isSeq[0];
     const bool s1 = p.isSeq[1];
     const bool s2 = p.isSeq[2];
     const bool s3 = p.isSeq[3];
 
-    const int gz = get_group_id(0)/nBBS0;
-    const int gw = get_group_id(1)/nBBS1;
-    const int gx = get_local_size(0) * (get_group_id(0) - gz*nBBS0) + get_local_id(0);
-    const int gy = get_local_size(1) * (get_group_id(1) - gw*nBBS1) + get_local_id(1);
+    const int gz = get_group_id(0) / nBBS0;
+    const int gw = get_group_id(1) / nBBS1;
+    const int gx =
+        get_local_size(0) * (get_group_id(0) - gz * nBBS0) + get_local_id(0);
+    const int gy =
+        get_local_size(1) * (get_group_id(1) - gw * nBBS1) + get_local_id(1);
 
-    if (gx<iInfo.dims[0] && gy<iInfo.dims[1] && gz<iInfo.dims[2] && gw<iInfo.dims[3]) {
+    if (gx < iInfo.dims[0] && gy < iInfo.dims[1] && gz < iInfo.dims[2] &&
+        gw < iInfo.dims[3]) {
         // calculate pointer offsets for input
-        int i = p.strds[0] * trimIndex(s0 ? gx+p.offs[0] : ptr0[gx], oInfo.dims[0]);
-        int j = p.strds[1] * trimIndex(s1 ? gy+p.offs[1] : ptr1[gy], oInfo.dims[1]);
-        int k = p.strds[2] * trimIndex(s2 ? gz+p.offs[2] : ptr2[gz], oInfo.dims[2]);
-        int l = p.strds[3] * trimIndex(s3 ? gw+p.offs[3] : ptr3[gw], oInfo.dims[3]);
+        int i = p.strds[0] *
+                trimIndex(s0 ? gx + p.offs[0] : ptr0[gx], oInfo.dims[0]);
+        int j = p.strds[1] *
+                trimIndex(s1 ? gy + p.offs[1] : ptr1[gy], oInfo.dims[1]);
+        int k = p.strds[2] *
+                trimIndex(s2 ? gz + p.offs[2] : ptr2[gz], oInfo.dims[2]);
+        int l = p.strds[3] *
+                trimIndex(s3 ? gw + p.offs[3] : ptr3[gw], oInfo.dims[3]);
         // offset input and output pointers
-        global const T *src = iptr + (gx*iInfo.strides[0]+
-                                      gy*iInfo.strides[1]+
-                                      gz*iInfo.strides[2]+
-                                      gw*iInfo.strides[3]);
-        global T *dst = optr + (i+j+k+l);
+        global const T* src =
+            iptr +
+            (gx * iInfo.strides[0] + gy * iInfo.strides[1] +
+             gz * iInfo.strides[2] + gw * iInfo.strides[3] + iInfo.offset);
+        global T* dst = optr + (i + j + k + l) + oInfo.offset;
         // set the output
         dst[0] = src[0];
     }
diff --git a/src/backend/opencl/kernel/assign.hpp b/src/backend/opencl/kernel/assign.hpp
index 2b6b517799..b7cd779027 100644
--- a/src/backend/opencl/kernel/assign.hpp
+++ b/src/backend/opencl/kernel/assign.hpp
@@ -8,87 +8,56 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/assign.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/assign.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
+#include <string>
+#include <vector>
 
-static const int THREADS_X = 32;
-static const int THREADS_Y =  8;
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 typedef struct {
-    int  offs[4];
+    int offs[4];
     int strds[4];
-    char     isSeq[4];
+    char isSeq[4];
 } AssignKernelParam_t;
 
 template<typename T>
-void assign(Param out, const Param in, const AssignKernelParam_t& p, Buffer *bPtr[4])
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  agnProgs;
-        static std::map<int, Kernel*> agnKernels;
+void assign(Param out, const Param in, const AssignKernelParam_t& p,
+            cl::Buffer* bPtr[4]) {
+    constexpr int THREADS_X = 32;
+    constexpr int THREADS_Y = 8;
 
-        int device = getActiveDeviceId();
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
 
-        std::call_once( compileFlags[device], [device] () {
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName();
+    auto assign =
+        common::getKernel("assignKernel", {{assign_cl_src}}, targs, options);
 
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                options << " -D USE_DOUBLE";
-                }
+    cl::NDRange local(THREADS_X, THREADS_Y);
 
-                Program prog;
-                buildProgram(prog, assign_cl, assign_cl_len, options.str());
-                agnProgs[device]   = new Program(prog);
-                agnKernels[device] = new Kernel(*agnProgs[device], "assignKernel");
-                });
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
 
-        NDRange local(THREADS_X, THREADS_Y);
+    cl::NDRange global(blk_x * in.info.dims[2] * THREADS_X,
+                       blk_y * in.info.dims[3] * THREADS_Y);
 
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x * in.info.dims[2] * THREADS_X,
-                blk_y * in.info.dims[3] * THREADS_Y);
-
-        auto assignOp = make_kernel<Buffer, KParam, Buffer, KParam, AssignKernelParam_t,
-             Buffer, Buffer, Buffer, Buffer, int, int>(*agnKernels[device]);
-
-        assignOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *in.data, in.info, p,
-                *bPtr[0], *bPtr[1], *bPtr[2], *bPtr[3], blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
+    assign(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+           *in.data, in.info, p, *bPtr[0], *bPtr[1], *bPtr[2], *bPtr[3], blk_x,
+           blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
-
-}
-
-}
-
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/bilateral.cl b/src/backend/opencl/kernel/bilateral.cl
index b54f80961e..af6bb11143 100644
--- a/src/backend/opencl/kernel/bilateral.cl
+++ b/src/backend/opencl/kernel/bilateral.cl
@@ -7,110 +7,102 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-int lIdx(int x, int y,
-        int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
+#ifdef USE_NATIVE_EXP
+#define EXP native_exp
+#else
+#define EXP exp
+#endif
 
-void load2LocalMem(__local outType *  shrd,
-        __global const inType *      in,
-        int lx, int ly, int shrdStride,
-        int dim0, int dim1,
-        int gx, int gy,
-        int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx, 0, dim0-1);
-    int gy_  = clamp(gy, 0, dim1-1);
-    shrd[ lIdx(lx, ly, shrdStride, 1) ] = (outType)in[ lIdx(gx_, gy_, inStride1, inStride0) ];
+int lIdx(int x, int y, int stride1, int stride0) {
+    return (y * stride1 + x * stride0);
 }
 
-float gaussian1d(float x, float variance)
-{
-    return exp((x * x) / (-2.f * variance));
+void load2LocalMem(local outType* shrd, global const inType* in, int lx,
+                   int ly, int shrdStride, int dim0, int dim1, int gx, int gy,
+                   int inStride1, int inStride0) {
+    int gx_ = clamp(gx, 0, dim0 - 1);
+    int gy_ = clamp(gy, 0, dim1 - 1);
+    shrd[lIdx(lx, ly, shrdStride, 1)] =
+        (outType)in[lIdx(gx_, gy_, inStride1, inStride0)];
 }
 
-__kernel
-void bilateral(__global outType *        d_dst,
-               KParam                    oInfo,
-               __global const inType *   d_src,
-               KParam                    iInfo,
-               __local outType *         localMem,
-               __local outType *         gauss2d,
-               float sigma_space, float sigma_color,
-               int gaussOff, int nBBS0, int nBBS1)
-{
-    const int radius      = max((int)(sigma_space * 1.5f), 1);
-    const int padding     = 2 * radius;
-    const int window_size = padding + 1;
-    const int shrdLen     = get_local_size(0) + padding;
-    const float variance_range = sigma_color * sigma_color;
-    const float variance_space = sigma_space * sigma_space;
+kernel void bilateral(global outType* d_dst, KParam oInfo,
+                        global const inType* d_src, KParam iInfo,
+                        local outType* localMem, __local outType* gauss2d,
+                        float sigma_space, float sigma_color, int gaussOff,
+                        int nBBS0, int nBBS1) {
+    const int radius                    = max((int)(sigma_space * 1.5f), 1);
+    const int padding                   = 2 * radius;
+    const int window_size               = padding + 1;
+    const int shrdLen                   = get_local_size(0) + padding;
+    const float variance_range          = sigma_color * sigma_color;
+    const float variance_space          = sigma_space * sigma_space;
+    const float variance_space_neg2     = -2.0 * variance_space;
+    const float inv_variance_range_neg2 = -0.5 / (variance_range);
 
     // gfor batch offsets
     unsigned b2 = get_group_id(0) / nBBS0;
     unsigned b3 = get_group_id(1) / nBBS1;
-    __global const inType* in = d_src + (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
-    __global outType* out     = d_dst + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
+    global const inType* in =
+        d_src + (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
+    global outType* out =
+        d_dst + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
 
     int lx = get_local_id(0);
     int ly = get_local_id(1);
 
-    const int gx = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
-    const int gy = get_local_size(1) * (get_group_id(1)-b3*nBBS1) + ly;
+    const int gx = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    const int gy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
 
     // generate gauss2d spatial variance values for block
-    if (lx<window_size && ly<window_size) {
+    if (lx < window_size && ly < window_size) {
         int x = lx - radius;
         int y = ly - radius;
-        gauss2d[ly*window_size+lx] = exp( ((x*x) + (y*y)) / (-2.f * variance_space));
+        gauss2d[ly * window_size + lx] =
+            EXP(((x * x) + (y * y)) / variance_space_neg2);
     }
 
-    int lx2 = lx + get_local_size(0);
-    int ly2 = ly + get_local_size(1);
-    int gx2 = gx + get_local_size(0);
-    int gy2 = gy + get_local_size(1);
-
+    int s0 = iInfo.strides[0];
+    int s1 = iInfo.strides[1];
+    int d0 = iInfo.dims[0];
+    int d1 = iInfo.dims[1];
     // pull image to local memory
-    load2LocalMem(localMem, in, lx, ly, shrdLen,
-                 iInfo.dims[0], iInfo.dims[1], gx-radius,
-                 gy-radius, iInfo.strides[1], iInfo.strides[0]);
-    if (lx<padding) {
-        load2LocalMem(localMem, in, lx2, ly, shrdLen,
-                     iInfo.dims[0], iInfo.dims[1], gx2-radius,
-                     gy-radius, iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<padding) {
-        load2LocalMem(localMem, in, lx, ly2, shrdLen,
-                     iInfo.dims[0], iInfo.dims[1], gx-radius,
-                     gy2-radius, iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2LocalMem(localMem, in, lx2, ly2, shrdLen,
-                     iInfo.dims[0], iInfo.dims[1], gx2-radius,
-                     gy2-radius, iInfo.strides[1], iInfo.strides[0]);
+    for (int b = ly, gy2 = gy; b < shrdLen;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        // move row_set get_local_size(1) along coloumns
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            load2LocalMem(localMem, in, a, b, shrdLen, d0, d1, gx2 - radius,
+                          gy2 - radius, s1, s0);
+        }
     }
+
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (gx<iInfo.dims[0] && gy<iInfo.dims[1]) {
+    if (gx < iInfo.dims[0] && gy < iInfo.dims[1]) {
         lx += radius;
         ly += radius;
-        outType center_color = localMem[ly*shrdLen+lx];
-        outType res  = 0;
-        outType norm = 0;
+        outType center_color = localMem[ly * shrdLen + lx];
+        outType res          = 0;
+        outType norm         = 0;
+
+        int joff = (ly - radius) * shrdLen + (lx - radius);
+        int goff = 0;
+
 #pragma unroll
-        for(int wj=0; wj<window_size; ++wj) {
-            int joff = (ly+wj-radius)*shrdLen + (lx-radius);
-            int goff = wj*window_size;
+        for (int wj = 0; wj < window_size; ++wj) {
 #pragma unroll
-            for(int wi=0; wi<window_size; ++wi) {
-                outType tmp_color   = localMem[joff+wi];
-                outType gauss_range = gaussian1d(center_color - tmp_color, variance_range);
-                outType weight      = gauss2d[goff+wi] * gauss_range;
+            for (int wi = 0; wi < window_size; ++wi) {
+                outType tmp_color   = localMem[joff + wi];
+                const outType c     = center_color - tmp_color;
+                outType gauss_range = EXP(c * c * inv_variance_range_neg2);
+                outType weight      = gauss2d[goff + wi] * gauss_range;
                 norm += weight;
-                res  += tmp_color * weight;
+                res += tmp_color * weight;
             }
+            joff += shrdLen;
+            goff += window_size;
         }
-        out[gy*oInfo.strides[1] + gx] = res / norm;
+        out[gy * oInfo.strides[1] + gx] = res / norm;
     }
 }
diff --git a/src/backend/opencl/kernel/bilateral.hpp b/src/backend/opencl/kernel/bilateral.hpp
index 76533ae894..eba0f2bb10 100644
--- a/src/backend/opencl/kernel/bilateral.hpp
+++ b/src/backend/opencl/kernel/bilateral.hpp
@@ -8,95 +8,74 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/bilateral.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <algorithm>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/bilateral.hpp>
+#include <traits.hpp>
+#include <af/opencl.h>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::LocalSpaceArg;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-template<typename inType, typename outType, bool isColor>
-void bilateral(Param out, const Param in, float s_sigma, float c_sigma)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  bilProgs;
-        static std::map<int, Kernel*> bilKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D inType=" << dtype_traits<inType>::getName()
-                            << " -D outType=" << dtype_traits<outType>::getName();
-                    if (std::is_same<inType, double>::value ||
-                        std::is_same<inType, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    Program prog;
-                    buildProgram(prog, bilateral_cl, bilateral_cl_len, options.str());
-                    bilProgs[device] = new Program(prog);
-
-                    bilKernels[device] = new Kernel(*bilProgs[device], "bilateral");
-                });
-
-        auto bilateralOp = make_kernel<Buffer, KParam,
-                                       Buffer, KParam,
-                                       LocalSpaceArg,
-                                       LocalSpaceArg,
-                                       float, float,
-                                       int, int, int
-                                      >(*bilKernels[device]);
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x*in.info.dims[2]*THREADS_X,
-                       blk_y*in.info.dims[3]*THREADS_Y);
-
-        // calculate local memory size
-        int radius = (int)std::max(s_sigma * 1.5f, 1.f);
-        int num_shrd_elems    = (THREADS_X + 2 * radius) * (THREADS_Y + 2 * radius);
-        int num_gauss_elems   = (2*radius+1)*(2*radius+1);
-
-        bilateralOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info,
-                    cl::Local(num_shrd_elems*sizeof(outType)),
-                    cl::Local(num_gauss_elems*sizeof(outType)),
-                    s_sigma, c_sigma, num_shrd_elems, blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+#include <algorithm>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename inType, typename outType>
+void bilateral(Param out, const Param in, const float s_sigma,
+               const float c_sigma) {
+    constexpr int THREADS_X     = 16;
+    constexpr int THREADS_Y     = 16;
+    constexpr bool UseNativeExp = !std::is_same<inType, double>::value ||
+                                  std::is_same<inType, cdouble>::value;
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<inType>(),
+        TemplateTypename<outType>(),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(inType, dtype_traits<inType>::getName()),
+        DefineKeyValue(outType, dtype_traits<outType>::getName()),
+    };
+    if (UseNativeExp) { options.emplace_back(DefineKey(USE_NATIVE_EXP)); }
+    options.emplace_back(getTypeBuildDefinition<inType>());
+
+    auto bilateralOp =
+        common::getKernel("bilateral", {{bilateral_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    cl::NDRange global(blk_x * in.info.dims[2] * THREADS_X,
+                       blk_y * in.info.dims[3] * THREADS_Y);
+
+    // calculate local memory size
+    int radius          = (int)std::max(s_sigma * 1.5f, 1.f);
+    int num_shrd_elems  = (THREADS_X + 2 * radius) * (THREADS_Y + 2 * radius);
+    int num_gauss_elems = (2 * radius + 1) * (2 * radius + 1);
+    size_t localMemSize = (num_shrd_elems + num_gauss_elems) * sizeof(outType);
+    size_t MaxLocalSize =
+        getDevice(getActiveDeviceId()).getInfo<CL_DEVICE_LOCAL_MEM_SIZE>();
+    if (localMemSize > MaxLocalSize) {
+        char errMessage[256];
+        snprintf(errMessage, sizeof(errMessage),
+                 "\nOpenCL Bilateral filter doesn't support %f spatial sigma\n",
+                 s_sigma);
+        OPENCL_NOT_SUPPORTED(errMessage);
     }
-}
-
-}
 
+    bilateralOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                *in.data, in.info, cl::Local(num_shrd_elems * sizeof(outType)),
+                cl::Local(num_gauss_elems * sizeof(outType)), s_sigma, c_sigma,
+                num_shrd_elems, blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/canny.hpp b/src/backend/opencl/kernel/canny.hpp
new file mode 100644
index 0000000000..bcc850e6ba
--- /dev/null
+++ b/src/backend/opencl/kernel/canny.hpp
@@ -0,0 +1,178 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/nonmax_suppression.hpp>
+#include <kernel_headers/trace_edge.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+constexpr int THREADS_X = 16;
+constexpr int THREADS_Y = 16;
+
+template<typename T>
+void nonMaxSuppression(Param output, const Param magnitude, const Param dx,
+                       const Param dy) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(SHRD_MEM_HEIGHT, THREADS_X + 2),
+        DefineKeyValue(SHRD_MEM_WIDTH, THREADS_Y + 2),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    auto nonMaxOp = common::getKernel(
+        "nonMaxSuppressionKernel", {{nonmax_suppression_cl_src}},
+        TemplateArgs(TemplateTypename<T>()), options);
+
+    NDRange threads(kernel::THREADS_X, kernel::THREADS_Y, 1);
+
+    // Launch only threads to process non-border pixels
+    int blk_x = divup(magnitude.info.dims[0] - 2, threads[0]);
+    int blk_y = divup(magnitude.info.dims[1] - 2, threads[1]);
+
+    // launch batch * blk_x blocks along x dimension
+    NDRange global(blk_x * magnitude.info.dims[2] * threads[0],
+                   blk_y * magnitude.info.dims[3] * threads[1], 1);
+
+    nonMaxOp(EnqueueArgs(getQueue(), global, threads), *output.data,
+             output.info, *magnitude.data, magnitude.info, *dx.data, dx.info,
+             *dy.data, dy.info, blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void initEdgeOut(Param output, const Param strong, const Param weak) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKey(INIT_EDGE_OUT),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    auto initOp =
+        common::getKernel("initEdgeOutKernel", {{trace_edge_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+
+    NDRange threads(kernel::THREADS_X, kernel::THREADS_Y, 1);
+
+    // Launch only threads to process non-border pixels
+    int blk_x = divup(strong.info.dims[0] - 2, threads[0]);
+    int blk_y = divup(strong.info.dims[1] - 2, threads[1]);
+
+    // launch batch * blk_x blocks along x dimension
+    NDRange global(blk_x * strong.info.dims[2] * threads[0],
+                   blk_y * strong.info.dims[3] * threads[1], 1);
+
+    initOp(EnqueueArgs(getQueue(), global, threads), *output.data, output.info,
+           *strong.data, strong.info, *weak.data, weak.info, blk_x, blk_y);
+
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void suppressLeftOver(Param output) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKey(SUPPRESS_LEFT_OVER),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    auto finalOp =
+        common::getKernel("suppressLeftOverKernel", {{trace_edge_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+
+    NDRange threads(kernel::THREADS_X, kernel::THREADS_Y, 1);
+
+    // Launch only threads to process non-border pixels
+    int blk_x = divup(output.info.dims[0] - 2, threads[0]);
+    int blk_y = divup(output.info.dims[1] - 2, threads[1]);
+
+    // launch batch * blk_x blocks along x dimension
+    NDRange global(blk_x * output.info.dims[2] * threads[0],
+                   blk_y * output.info.dims[3] * threads[1], 1);
+
+    finalOp(EnqueueArgs(getQueue(), global, threads), *output.data, output.info,
+            blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void edgeTrackingHysteresis(Param output, const Param strong,
+                            const Param weak) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKey(EDGE_TRACER),
+        DefineKeyValue(SHRD_MEM_HEIGHT, THREADS_X + 2),
+        DefineKeyValue(SHRD_MEM_WIDTH, THREADS_Y + 2),
+        DefineKeyValue(TOTAL_NUM_THREADS, THREADS_X * THREADS_Y),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    auto edgeTraceOp =
+        common::getKernel("edgeTrackKernel", {{trace_edge_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+
+    NDRange threads(kernel::THREADS_X, kernel::THREADS_Y);
+
+    // Launch only threads to process non-border pixels
+    int blk_x = divup(weak.info.dims[0] - 2, threads[0]);
+    int blk_y = divup(weak.info.dims[1] - 2, threads[1]);
+
+    // launch batch * blk_x blocks along x dimension
+    NDRange global(blk_x * weak.info.dims[2] * threads[0],
+                   blk_y * weak.info.dims[3] * threads[1], 1);
+
+    initEdgeOut<T>(output, strong, weak);
+
+    int notFinished = 1;
+    auto dContinue  = memAlloc<T>(sizeof(int));
+
+    while (notFinished > 0) {
+        notFinished = 0;
+        edgeTraceOp.setFlag(dContinue.get(), &notFinished);
+        edgeTraceOp(EnqueueArgs(getQueue(), global, threads), *output.data,
+                    output.info, blk_x, blk_y, *dContinue);
+        CL_DEBUG_FINISH(getQueue());
+        notFinished = edgeTraceOp.getFlag(dContinue.get());
+    }
+    suppressLeftOver<T>(output);
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/config.cpp b/src/backend/opencl/kernel/config.cpp
index 322f514443..363a876d95 100644
--- a/src/backend/opencl/kernel/config.cpp
+++ b/src/backend/opencl/kernel/config.cpp
@@ -8,23 +8,19 @@
  ********************************************************/
 
 #include "config.hpp"
-namespace opencl
-{
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-    std::ostream&
-    operator<<(std::ostream &out, const cfloat& var)
-    {
-        out << "{" << var.s[0] << "," << var.s[1] << "}";
-        return out;
-    }
-
-    std::ostream&
-    operator<<(std::ostream &out, const cdouble& var)
-    {
-        out << "{" << var.s[0] << "," << var.s[1] << "}";
-        return out;
-    }
+std::ostream& operator<<(std::ostream& out, const cfloat& var) {
+    out << "{" << var.s[0] << "," << var.s[1] << "}";
+    return out;
 }
+
+std::ostream& operator<<(std::ostream& out, const cdouble& var) {
+    out << "{" << var.s[0] << "," << var.s[1] << "}";
+    return out;
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/config.hpp b/src/backend/opencl/kernel/config.hpp
index cbe92c4023..9e3d07868a 100644
--- a/src/backend/opencl/kernel/config.hpp
+++ b/src/backend/opencl/kernel/config.hpp
@@ -8,23 +8,21 @@
  ********************************************************/
 
 #pragma once
-#include <ostream>
 #include <types.hpp>
+#include <ostream>
 
-namespace opencl
-{
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-    std::ostream&
-    operator<<(std::ostream &out, const cfloat& var);
+std::ostream& operator<<(std::ostream& out, const cfloat& var);
 
-    std::ostream&
-    operator<<(std::ostream &out, const cdouble& var);
+std::ostream& operator<<(std::ostream& out, const cdouble& var);
 
-    static const uint THREADS_PER_GROUP = 256;
-    static const uint THREADS_X = 32;
-    static const uint THREADS_Y = THREADS_PER_GROUP / THREADS_X;
-    static const uint REPEAT    = 32;
-}
-}
+static const uint THREADS_PER_GROUP = 256;
+static const uint THREADS_X         = 32;
+static const uint THREADS_Y         = THREADS_PER_GROUP / THREADS_X;
+static const uint REPEAT            = 32;
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve.cl b/src/backend/opencl/kernel/convolve.cl
index 7839e2ad29..cf1205dac1 100644
--- a/src/backend/opencl/kernel/convolve.cl
+++ b/src/backend/opencl/kernel/convolve.cl
@@ -7,93 +7,101 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-int index(int i, int j, int k, int jstride, int kstride)
-{
-    return i+j*jstride+k*kstride;
+int index(int i, int j, int k, int jstride, int kstride) {
+    return i + j * jstride + k * kstride;
 }
 
-#if BASE_DIM==1
-kernel
-void convolve(global T *out, KParam oInfo, global T const *signal, KParam sInfo,
-              local T *localMem, constant accType const *impulse, KParam fInfo,
-              int nBBS0, int nBBS1, int ostep1, int ostep2,
-              int ostep3, int sstep1, int sstep2, int sstep3)
-{
-    int fLen     = fInfo.dims[0];
-    int padding  = fLen-1;
-    int shrdLen  = get_local_size(0) + 2*padding;
-    const unsigned b1 = get_group_id(0)/nBBS0;
-    const unsigned b0 = get_group_id(0)-nBBS0*b1;
-    const unsigned b3 = get_group_id(1)/nBBS1;
-    const unsigned b2 = get_group_id(1)-nBBS1*b3;
-
-    global T *dst = out + (b1 * oInfo.strides[1] +  /* activated with batched input signal */
-                       ostep1 * oInfo.strides[1] +  /* activated with batched input filter */
-                           b2 * oInfo.strides[2] +  /* activated with batched input signal */
-                       ostep2 * oInfo.strides[2] +  /* activated with batched input filter */
-                           b3 * oInfo.strides[3] +  /* activated with batched input signal */
-                       ostep3 * oInfo.strides[3]);  /* activated with batched input filter */
-
-    global T const *src = signal + sInfo.offset + (b1 * sInfo.strides[1] + /* activated with batched input signal */
-                                               sstep1 * sInfo.strides[1] + /* activated with batched input filter */
-                                                   b2 * sInfo.strides[2] + /* activated with batched input signal */
-                                               sstep2 * sInfo.strides[2] + /* activated with batched input filter */
-                                                   b3 * sInfo.strides[3] + /* activated with batched input signal */
-                                               sstep3 * sInfo.strides[3]); /* activated with batched input filter */
-
-    int gx  = get_local_size(0)*b0;
-
-    for (int i=get_local_id(0); i<shrdLen; i+=get_local_size(0)) {
-        int idx = gx-padding + i;
-        localMem[i]  = (idx>=0 && idx<sInfo.dims[0]) ? src[idx*sInfo.strides[0]] : (T)(0);
+#if RANK == 1
+kernel void convolve(global T *out, KParam oInfo, global T const *signal,
+                     KParam sInfo, local T *localMem,
+                     constant accType const *impulse, KParam fInfo, int nBBS0,
+                     int nBBS1, int ostep1, int ostep2, int ostep3, int sstep1,
+                     int sstep2, int sstep3) {
+    int fLen          = fInfo.dims[0];
+    int padding       = fLen - 1;
+    int shrdLen       = get_local_size(0) + 2 * padding;
+    const unsigned b1 = get_group_id(0) / nBBS0;
+    const unsigned b0 = get_group_id(0) - nBBS0 * b1;
+    const unsigned b3 = get_group_id(1) / nBBS1;
+    const unsigned b2 = get_group_id(1) - nBBS1 * b3;
+
+    global T *dst =
+        out +
+        (b1 * oInfo.strides[1] +     /* activated with batched input signal */
+         ostep1 * oInfo.strides[1] + /* activated with batched input filter */
+         b2 * oInfo.strides[2] +     /* activated with batched input signal */
+         ostep2 * oInfo.strides[2] + /* activated with batched input filter */
+         b3 * oInfo.strides[3] +     /* activated with batched input signal */
+         ostep3 * oInfo.strides[3]); /* activated with batched input filter */
+
+    global T const *src =
+        signal + sInfo.offset +
+        (b1 * sInfo.strides[1] +     /* activated with batched input signal */
+         sstep1 * sInfo.strides[1] + /* activated with batched input filter */
+         b2 * sInfo.strides[2] +     /* activated with batched input signal */
+         sstep2 * sInfo.strides[2] + /* activated with batched input filter */
+         b3 * sInfo.strides[3] +     /* activated with batched input signal */
+         sstep3 * sInfo.strides[3]); /* activated with batched input filter */
+
+    int gx = get_local_size(0) * b0;
+
+    for (int i = get_local_id(0); i < shrdLen; i += get_local_size(0)) {
+        int idx     = gx - padding + i;
+        localMem[i] = (idx >= 0 && idx < sInfo.dims[0])
+                          ? src[idx * sInfo.strides[0]]
+                          : (T)(0);
     }
     barrier(CLK_LOCAL_MEM_FENCE);
     gx += get_local_id(0);
 
-    if (gx>=0 && gx<oInfo.dims[0]) {
-        int lx   = get_local_id(0) + padding + (EXPAND ? 0 : fLen>>1);
+    if (gx >= 0 && gx < oInfo.dims[0]) {
+        int lx        = get_local_id(0) + padding + (EXPAND ? 0 : fLen >> 1);
         accType accum = (accType)(0);
-        for(int f=0; f<fLen; ++f) {
-            accum = accum + ((accType)localMem[lx-f] * (accType)impulse[f]);
+        for (int f = 0; f < fLen; ++f) {
+            // binOp will do MUL_OP for convolution operation
+            accum =
+                accum + binOp((accType)localMem[lx - f], (accType)impulse[f]);
         }
         dst[gx] = (T)accum;
     }
 }
 #endif
 
-#if BASE_DIM==2
-kernel
-void convolve(global T *out, KParam oInfo, global T const *signal, KParam sInfo,
-              constant accType const *impulse, KParam fInfo,
-              int nBBS0, int nBBS1, int ostep2,
-              int ostep3, int sstep2, int sstep3)
-{
+#if RANK == 2
+kernel void convolve(global T *out, KParam oInfo, global T const *signal,
+                     KParam sInfo, constant accType const *impulse,
+                     KParam fInfo, int nBBS0, int nBBS1, int ostep2, int ostep3,
+                     int sstep2, int sstep3) {
     local T localMem[C_SIZE];
 
-    int radius0  = FLEN0-1;
-    int radius1  = FLEN1-1;
-    int padding0 = 2*radius0;
-    int padding1 = 2*radius1;
+    int radius0  = FLEN0 - 1;
+    int radius1  = FLEN1 - 1;
+    int padding0 = 2 * radius0;
+    int padding1 = 2 * radius1;
     int shrdLen0 = get_local_size(0) + padding0;
     int shrdLen1 = get_local_size(1) + padding1;
 
-    unsigned b0  = get_group_id(0)/nBBS0;
-    unsigned b1  = get_group_id(1)/nBBS1;
+    unsigned b0 = get_group_id(0) / nBBS0;
+    unsigned b1 = get_group_id(1) / nBBS1;
 
-    global T *dst = out + (b0 * oInfo.strides[2] + /* activated with batched input signal */
-                       ostep2 * oInfo.strides[2] + /* activated with batched input filter */
-                           b1 * oInfo.strides[3] + /* activated with batched input signal */
-                       ostep3 * oInfo.strides[3]); /* activated with batched input filter */
+    global T *dst =
+        out +
+        (b0 * oInfo.strides[2] +     /* activated with batched input signal */
+         ostep2 * oInfo.strides[2] + /* activated with batched input filter */
+         b1 * oInfo.strides[3] +     /* activated with batched input signal */
+         ostep3 * oInfo.strides[3]); /* activated with batched input filter */
 
-    global const T *src = signal + sInfo.offset + (b0 * sInfo.strides[2] + /* activated with batched input signal */
-                                               sstep2 * sInfo.strides[2] + /* activated with batched input filter */
-                                                   b1 * sInfo.strides[3] + /* activated with batched input signal */
-                                               sstep3 * sInfo.strides[3]); /* activated with batched input filter */
+    global const T *src =
+        signal + sInfo.offset +
+        (b0 * sInfo.strides[2] +     /* activated with batched input signal */
+         sstep2 * sInfo.strides[2] + /* activated with batched input filter */
+         b1 * sInfo.strides[3] +     /* activated with batched input signal */
+         sstep3 * sInfo.strides[3]); /* activated with batched input filter */
 
     int lx = get_local_id(0);
     int ly = get_local_id(1);
-    int gx = get_local_size(0) * (get_group_id(0)-b0*nBBS0) + lx;
-    int gy = get_local_size(1) * (get_group_id(1)-b1*nBBS1) + ly;
+    int gx = get_local_size(0) * (get_group_id(0) - b0 * nBBS0) + lx;
+    int gy = get_local_size(1) * (get_group_id(1) - b1 * nBBS1) + ly;
 
     // below loops are traditional loops, they only run multiple
     // times filter length is more than launch size
@@ -101,65 +109,73 @@ void convolve(global T *out, KParam oInfo, global T const *signal, KParam sInfo,
     int s1 = sInfo.strides[1];
     int d0 = sInfo.dims[0];
     int d1 = sInfo.dims[1];
-    for (int b=ly, gy2=gy; b<shrdLen1; b+=get_local_size(1), gy2+=get_local_size(1)) {
-        int j = gy2-radius1;
-        bool is_j  = j>=0 && j<d1;
+    for (int b = ly, gy2 = gy; b < shrdLen1;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        int j     = gy2 - radius1;
+        bool is_j = j >= 0 && j < d1;
         // move row_set get_local_size(1) along coloumns
-        for (int a=lx, gx2=gx; a<shrdLen0; a+=get_local_size(0), gx2+=get_local_size(0)) {
-            int i = gx2-radius0;
-            bool is_i  = i>=0 && i<d0;
-            localMem[b*shrdLen0+a] = (is_i && is_j ? src[i*s0+j*s1] : (T)(0));
+        for (int a = lx, gx2 = gx; a < shrdLen0;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            int i     = gx2 - radius0;
+            bool is_i = i >= 0 && i < d0;
+            localMem[b * shrdLen0 + a] =
+                (is_i && is_j ? src[i * s0 + j * s1] : (T)(0));
         }
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (gx<oInfo.dims[0] && gy<oInfo.dims[1]) {
-        int ci = lx + radius0 + (EXPAND ? 0 : FLEN0>>1);
-        int cj = ly + radius1 + (EXPAND ? 0 : FLEN1>>1);
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1]) {
+        int ci = lx + radius0 + (EXPAND ? 0 : FLEN0 >> 1);
+        int cj = ly + radius1 + (EXPAND ? 0 : FLEN1 >> 1);
 
         accType accum = (accType)(0);
-        for(int fj=0; fj<FLEN1; ++fj) {
-            for(int fi=0; fi<FLEN0; ++fi) {
-                accType f_val = impulse[fj*FLEN0+fi];
-                T s_val = localMem[(cj-fj)*shrdLen0+(ci-fi)];
-                accum   = accum + ((accType)s_val*(accType)f_val);
+        for (int fj = 0; fj < FLEN1; ++fj) {
+            for (int fi = 0; fi < FLEN0; ++fi) {
+                accType f_val = impulse[fj * FLEN0 + fi];
+                T s_val       = localMem[(cj - fj) * shrdLen0 + (ci - fi)];
+
+                // binOp will do MUL_OP for convolution operation
+                accum = accum + binOp((accType)s_val, (accType)f_val);
             }
         }
-        dst[gy*oInfo.strides[1]+gx] = (T)accum;
+        dst[gy * oInfo.strides[1] + gx] = (T)accum;
     }
 }
 #endif
 
-#if BASE_DIM==3
-kernel
-void convolve(global T *out, KParam oInfo, global T const *signal, KParam sInfo,
-              local T *localMem, constant accType const *impulse, KParam fInfo,
-              int nBBS0, int nBBS1, int o1, int ostep2,
-              int ostep3, int sstep1, int sstep2, int sstep3)
-{
+#if RANK == 3
+kernel void convolve(global T *out, KParam oInfo, global T const *signal,
+                     KParam sInfo, local T *localMem,
+                     constant accType const *impulse, KParam fInfo, int nBBS0,
+                     int nBBS1, int o1, int ostep2, int ostep3, int sstep1,
+                     int sstep2, int sstep3) {
     int fLen0    = fInfo.dims[0];
     int fLen1    = fInfo.dims[1];
     int fLen2    = fInfo.dims[2];
-    int radius0  = fLen0-1;
-    int radius1  = fLen1-1;
-    int radius2  = fLen2-1;
-    int shrdLen0 = get_local_size(0) + 2*radius0;
-    int shrdLen1 = get_local_size(1) + 2*radius1;
-    int shrdLen2 = get_local_size(2) + 2*radius2;
+    int radius0  = fLen0 - 1;
+    int radius1  = fLen1 - 1;
+    int radius2  = fLen2 - 1;
+    int shrdLen0 = get_local_size(0) + 2 * radius0;
+    int shrdLen1 = get_local_size(1) + 2 * radius1;
+    int shrdLen2 = get_local_size(2) + 2 * radius2;
     int skStride = shrdLen0 * shrdLen1;
     int fStride  = fLen0 * fLen1;
-    unsigned b2  = get_group_id(0)/nBBS0;
+    unsigned b2  = get_group_id(0) / nBBS0;
 
-    global T *dst = out + (b2 * oInfo.strides[3] + /* activated with batched input signal */
-                       ostep3 * oInfo.strides[3]); /* activated with batched input filter */
+    global T *dst =
+        out +
+        (b2 * oInfo.strides[3] +     /* activated with batched input signal */
+         ostep3 * oInfo.strides[3]); /* activated with batched input filter */
 
-    global const T *src = signal + sInfo.offset + (b2 * sInfo.strides[3] + /* activated with batched input signal */
-                                               sstep3 * sInfo.strides[3]); /* activated with batched input filter */
+    global const T *src =
+        signal + sInfo.offset +
+        (b2 * sInfo.strides[3] +     /* activated with batched input signal */
+         sstep3 * sInfo.strides[3]); /* activated with batched input filter */
 
     int lx  = get_local_id(0);
     int ly  = get_local_id(1);
     int lz  = get_local_id(2);
-    int gx  = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
+    int gx  = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
     int gy  = get_local_size(1) * get_group_id(1) + ly;
     int gz  = get_local_size(2) * get_group_id(2) + lz;
     int lx2 = lx + get_local_size(0);
@@ -176,33 +192,41 @@ void convolve(global T *out, KParam oInfo, global T const *signal, KParam sInfo,
     int d1 = sInfo.dims[1];
     int d2 = sInfo.dims[2];
 
-    for (int c=lz, gz2=gz; c<shrdLen2; c+=get_local_size(2), gz2+=get_local_size(2)) {
-        int k = gz2-radius2;
-        bool is_k  = k>=0 && k<d2;
-        for (int b=ly, gy2=gy; b<shrdLen1; b+=get_local_size(1), gy2+=get_local_size(1)) {
-            int j = gy2-radius1;
-            bool is_j  = j>=0 && j<d1;
-            for (int a=lx, gx2=gx; a<shrdLen0; a+=get_local_size(0), gx2+=get_local_size(0)) {
-                int i = gx2-radius0;
-                bool is_i  = i>=0 && i<d0;
-                localMem[c*skStride+b*shrdLen0+a] = (is_i && is_j && is_k ? src[i*s0+j*s1+k*s2] : (T)(0));
+    for (int c = lz, gz2 = gz; c < shrdLen2;
+         c += get_local_size(2), gz2 += get_local_size(2)) {
+        int k     = gz2 - radius2;
+        bool is_k = k >= 0 && k < d2;
+        for (int b = ly, gy2 = gy; b < shrdLen1;
+             b += get_local_size(1), gy2 += get_local_size(1)) {
+            int j     = gy2 - radius1;
+            bool is_j = j >= 0 && j < d1;
+            for (int a = lx, gx2 = gx; a < shrdLen0;
+                 a += get_local_size(0), gx2 += get_local_size(0)) {
+                int i     = gx2 - radius0;
+                bool is_i = i >= 0 && i < d0;
+                localMem[c * skStride + b * shrdLen0 + a] =
+                    (is_i && is_j && is_k ? src[i * s0 + j * s1 + k * s2]
+                                          : (T)(0));
             }
         }
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (gx<oInfo.dims[0] && gy<oInfo.dims[1] && gz<oInfo.dims[2]) {
-        int ci = lx + radius0 + (EXPAND ? 0 : fLen0>>1);
-        int cj = ly + radius1 + (EXPAND ? 0 : fLen1>>1);
-        int ck = lz + radius2 + (EXPAND ? 0 : fLen2>>1);
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1] && gz < oInfo.dims[2]) {
+        int ci = lx + radius0 + (EXPAND ? 0 : fLen0 >> 1);
+        int cj = ly + radius1 + (EXPAND ? 0 : fLen1 >> 1);
+        int ck = lz + radius2 + (EXPAND ? 0 : fLen2 >> 1);
 
         accType accum = (accType)(0);
-        for(int fk=0; fk<fLen2; ++fk) {
-            for(int fj=0; fj<fLen1; ++fj) {
-                for(int fi=0; fi<fLen0; ++fi) {
+        for (int fk = 0; fk < fLen2; ++fk) {
+            for (int fj = 0; fj < fLen1; ++fj) {
+                for (int fi = 0; fi < fLen0; ++fi) {
                     accType f_val = impulse[index(fi, fj, fk, fLen0, fStride)];
-                    T s_val = localMem[index(ci-fi, cj-fj, ck-fk, shrdLen0, skStride)];
-                    accum   = accum + ((accType)s_val*(accType)f_val);
+                    T s_val       = localMem[index(ci - fi, cj - fj, ck - fk,
+                                             shrdLen0, skStride)];
+
+                    // binOp will do MUL_OP for convolution operation
+                    accum = accum + binOp((accType)s_val, (accType)f_val);
                 }
             }
         }
diff --git a/src/backend/opencl/kernel/convolve.hpp b/src/backend/opencl/kernel/convolve.hpp
index 89ac252d12..39d2c77564 100644
--- a/src/backend/opencl/kernel/convolve.hpp
+++ b/src/backend/opencl/kernel/convolve.hpp
@@ -10,18 +10,17 @@
 #pragma once
 #include <kernel/convolve/conv_common.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-namespace kernel
-{
+namespace kernel {
 
 // below shared MAX_*_LEN's are calculated based on
 // a maximum shared memory configuration of 48KB per block
 // considering complex types as well
-static const int MAX_CONV1_FILTER_LEN = 129;
-static const int MAX_CONV2_FILTER_LEN = 17;
-static const int MAX_CONV3_FILTER_LEN = 5;
+constexpr int MAX_CONV1_FILTER_LEN = 129;
+constexpr int MAX_CONV2_FILTER_LEN = 17;
+constexpr int MAX_CONV3_FILTER_LEN = 5;
 
 /*
  * convolution kernel wrappers are split to multiple files to
@@ -31,30 +30,32 @@ static const int MAX_CONV3_FILTER_LEN = 5;
  * file under the folder 'kernel/convovel' with their implementations
  * written in corresponding conv[1|2|3].cpp files under the same folder.
  */
-template<typename T, typename accType, int baseDim, bool expand>
-void convolve_nd(Param out, const Param signal, const Param filter, ConvolveBatchKind kind)
-{
+template<typename T, typename accType>
+void convolve_nd(Param out, const Param signal, const Param filter,
+                 AF_BATCH_KIND kind, const int rank, const bool expand) {
     conv_kparam_t param;
 
-    for (int i=0; i<3; ++i) {
+    for (int i = 0; i < 3; ++i) {
         param.o[i] = 0;
         param.s[i] = 0;
     }
-    param.launchMoreBlocks = kind==MANY2MANY || kind==ONE2MANY;
-    param.outHasNoOffset = kind==MANY2ONE || kind==ONE2ONE;
-    param.inHasNoOffset  = kind!=MANY2MANY;
+    param.launchMoreBlocks = kind == AF_BATCH_SAME || kind == AF_BATCH_RHS;
+    param.outHasNoOffset   = kind == AF_BATCH_LHS || kind == AF_BATCH_NONE;
+    param.inHasNoOffset    = kind != AF_BATCH_SAME;
 
-    prepareKernelArgs<T>(param, out.info.dims, filter.info.dims, baseDim);
+    prepareKernelArgs<T>(param, out.info.dims, filter.info.dims, rank);
 
-    switch(baseDim) {
-        case 1: conv1<T, accType, expand>(param, out, signal, filter); break;
-        case 2: conv2<T, accType, expand>(param, out, signal, filter); break;
-        case 3: conv3<T, accType, expand>(param, out, signal, filter); break;
+    switch (rank) {
+        case 1: conv1<T, accType>(param, out, signal, filter, expand); break;
+        case 2: conv2<T, accType>(param, out, signal, filter, expand); break;
+        case 3: conv3<T, accType>(param, out, signal, filter, expand); break;
     }
 
+    CL_DEBUG_FINISH(getQueue());
     bufferFree(param.impulse);
 }
 
-}
+}  // namespace kernel
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv1.cpp b/src/backend/opencl/kernel/convolve/conv1.cpp
index 7ac1123ee6..5bfa9668d6 100644
--- a/src/backend/opencl/kernel/convolve/conv1.cpp
+++ b/src/backend/opencl/kernel/convolve/conv1.cpp
@@ -9,33 +9,31 @@
 
 #include <kernel/convolve/conv_common.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-namespace kernel
-{
-
-template<typename T, typename aT, bool expand>
-void conv1(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt)
-{
+template<typename T, typename aT>
+void conv1(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt,
+           const bool expand) {
     size_t se_size = filt.info.dims[0] * sizeof(aT);
-    p.impulse = bufferAlloc(se_size);
-    int f0Off = filt.info.offset;
+    p.impulse      = bufferAlloc(se_size);
+    int f0Off      = filt.info.offset;
 
-    for (int b3=0; b3<filt.info.dims[3]; ++b3) {
+    for (int b3 = 0; b3 < filt.info.dims[3]; ++b3) {
         int f3Off = b3 * filt.info.strides[3];
 
-        for (int b2=0; b2<filt.info.dims[2]; ++b2) {
+        for (int b2 = 0; b2 < filt.info.dims[2]; ++b2) {
             int f2Off = b2 * filt.info.strides[2];
 
-            for (int b1=0; b1<filt.info.dims[1]; ++b1) {
+            for (int b1 = 0; b1 < filt.info.dims[1]; ++b1) {
                 int f1Off = b1 * filt.info.strides[1];
 
                 // FIXME: if the filter array is strided, direct copy of symbols
                 // might cause issues
-                getQueue().enqueueCopyBuffer(*filt.data, *p.impulse,
-                                            (f0Off+f1Off+f2Off+f3Off)*sizeof(aT)
-                                             , 0, se_size);
+                getQueue().enqueueCopyBuffer(
+                    *filt.data, *p.impulse,
+                    (f0Off + f1Off + f2Off + f3Off) * sizeof(aT), 0, se_size);
 
                 p.o[0] = (p.outHasNoOffset ? 0 : b1);
                 p.o[1] = (p.outHasNoOffset ? 0 : b2);
@@ -44,25 +42,30 @@ void conv1(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt)
                 p.s[1] = (p.inHasNoOffset ? 0 : b2);
                 p.s[2] = (p.inHasNoOffset ? 0 : b3);
 
-                convNHelper<T, aT, 1, expand>(p, out, sig, filt);
+                convNHelper<T, aT>(p, out, sig, filt, 1, expand);
             }
         }
     }
 }
 
-#define INSTANTIATE(T, accT)  \
-    template void conv1<T, accT, true >(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt); \
-    template void conv1<T, accT, false>(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt); \
+#define INSTANTIATE(T, accT)                                           \
+    template void conv1<T, accT>(conv_kparam_t&, Param&, const Param&, \
+                                 const Param&, const bool);
 
 INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
-
-}
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_b8.cpp b/src/backend/opencl/kernel/convolve/conv2_b8.cpp
index 2a7252294b..18c41628a6 100644
--- a/src/backend/opencl/kernel/convolve/conv2_b8.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_b8.cpp
@@ -9,15 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(char, float)
 
-}
-
-}
-
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_c32.cpp b/src/backend/opencl/kernel/convolve/conv2_c32.cpp
index 91ad98862d..5be66c8040 100644
--- a/src/backend/opencl/kernel/convolve/conv2_c32.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_c32.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(cfloat, cfloat)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_c64.cpp b/src/backend/opencl/kernel/convolve/conv2_c64.cpp
index 80754c4e08..87e787ceed 100644
--- a/src/backend/opencl/kernel/convolve/conv2_c64.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_c64.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(cdouble, cdouble)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_f32.cpp b/src/backend/opencl/kernel/convolve/conv2_f32.cpp
index 887bb2a299..89dc63dd6d 100644
--- a/src/backend/opencl/kernel/convolve/conv2_f32.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_f32.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(float, float)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_f64.cpp b/src/backend/opencl/kernel/convolve/conv2_f64.cpp
index 3482c4dde4..97a8044cdd 100644
--- a/src/backend/opencl/kernel/convolve/conv2_f64.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_f64.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(double, double)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_impl.hpp b/src/backend/opencl/kernel/convolve/conv2_impl.hpp
index 4c1ac5fd4f..9798714750 100644
--- a/src/backend/opencl/kernel/convolve/conv2_impl.hpp
+++ b/src/backend/opencl/kernel/convolve/conv2_impl.hpp
@@ -7,126 +7,76 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <kernel/convolve/conv_common.hpp>
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-template<typename T, typename aT, bool expand, int f0, int f1>
-void conv2Helper(const conv_kparam_t& param, Param out, const Param signal, const Param filter)
-{
-    try {
-        static std::once_flag  compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> convProgs;
-        static std::map<int, Kernel*>  convKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    size_t LOC_SIZE = (THREADS_X+2*(f0-1))*(THREADS_Y+2*(f1-1));
-
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D accType="<< dtype_traits<aT>::getName()
-                            << " -D BASE_DIM="<< 2 /* hard constant specific to this convolution type */
-                            << " -D FLEN0=" << f0
-                            << " -D FLEN1=" << f1
-                            << " -D EXPAND="<< expand
-                            << " -D C_SIZE="<< LOC_SIZE;
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, convolve_cl, convolve_cl_len, options.str());
-                    convProgs[device]   = new Program(prog);
-                    convKernels[device] = new Kernel(*convProgs[device], "convolve");
-                });
-
-        auto convOp = make_kernel<Buffer, KParam, Buffer, KParam,
-                                  Buffer, KParam, int, int,
-                                  int, int,
-                                  int, int
-                                 >(*convKernels[device]);
-
-        convOp(EnqueueArgs(getQueue(), param.global, param.local),
-                *out.data, out.info, *signal.data, signal.info,
-                *param.impulse, filter.info, param.nBBS0, param.nBBS1,
-                param.o[1], param.o[2], param.s[1], param.s[2]);
-
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
+#pragma once
 
-template<typename T, typename aT, bool expand, int f>
-void conv2Helper(const conv_kparam_t& p, Param out, const Param sig, const Param filt)
-{
-    switch(filt.info.dims[1]) {
-        case  1: conv2Helper<T, aT, expand, f,  1>(p, out, sig, filt); break;
-        case  2: conv2Helper<T, aT, expand, f,  2>(p, out, sig, filt); break;
-        case  3: conv2Helper<T, aT, expand, f,  3>(p, out, sig, filt); break;
-        case  4: conv2Helper<T, aT, expand, f,  4>(p, out, sig, filt); break;
-        case  5: conv2Helper<T, aT, expand, f,  5>(p, out, sig, filt); break;
-        default: OPENCL_NOT_SUPPORTED();
-    }
-}
+#include <common/kernel_cache.hpp>
+#include <kernel/convolve/conv_common.hpp>
 
-template<typename T, typename aT, bool expand>
-void conv2Helper(const conv_kparam_t& p, Param& out, const Param& sig, const Param& filt)
-{
-    int f0 = filt.info.dims[0];
-    int f1 = filt.info.dims[1];
-    switch(f0) {
-        case  1: conv2Helper<T, aT, expand,  1>(p, out, sig, filt); break;
-        case  2: conv2Helper<T, aT, expand,  2>(p, out, sig, filt); break;
-        case  3: conv2Helper<T, aT, expand,  3>(p, out, sig, filt); break;
-        case  4: conv2Helper<T, aT, expand,  4>(p, out, sig, filt); break;
-        case  5: conv2Helper<T, aT, expand,  5>(p, out, sig, filt); break;
-        default: {
-                     if (f0==f1) {
-                         switch(f1) {
-                             case  6: conv2Helper<T, aT, expand,  6,  6>(p, out, sig, filt); break;
-                             case  7: conv2Helper<T, aT, expand,  7,  7>(p, out, sig, filt); break;
-                             case  8: conv2Helper<T, aT, expand,  8,  8>(p, out, sig, filt); break;
-                             case  9: conv2Helper<T, aT, expand,  9,  9>(p, out, sig, filt); break;
-                             case 10: conv2Helper<T, aT, expand, 10, 10>(p, out, sig, filt); break;
-                             case 11: conv2Helper<T, aT, expand, 11, 11>(p, out, sig, filt); break;
-                             case 12: conv2Helper<T, aT, expand, 12, 12>(p, out, sig, filt); break;
-                             case 13: conv2Helper<T, aT, expand, 13, 13>(p, out, sig, filt); break;
-                             case 14: conv2Helper<T, aT, expand, 14, 14>(p, out, sig, filt); break;
-                             case 15: conv2Helper<T, aT, expand, 15, 15>(p, out, sig, filt); break;
-                             case 16: conv2Helper<T, aT, expand, 16, 16>(p, out, sig, filt); break;
-                             case 17: conv2Helper<T, aT, expand, 17, 17>(p, out, sig, filt); break;
-                             default: OPENCL_NOT_SUPPORTED();
-                         }
-                     } else
-                         OPENCL_NOT_SUPPORTED();
-                 } break;
-    }
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, typename aT>
+void conv2Helper(const conv_kparam_t& param, Param out, const Param signal,
+                 const Param filter, const bool expand) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr bool IsComplex =
+        std::is_same<T, cfloat>::value || std::is_same<T, cdouble>::value;
+
+    const int f0 = filter.info.dims[0];
+    const int f1 = filter.info.dims[1];
+    const size_t LOC_SIZE =
+        (THREADS_X + 2 * (f0 - 1)) * (THREADS_Y + 2 * (f1 - 1));
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(), TemplateTypename<aT>(), TemplateArg(expand),
+        TemplateArg(f0),       TemplateArg(f1),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(Ti, dtype_traits<T>::getName()),
+        DefineKeyValue(To, dtype_traits<aT>::getName()),
+        DefineKeyValue(accType, dtype_traits<aT>::getName()),
+        DefineKeyValue(RANK, 2),
+        DefineKeyValue(FLEN0, f0),
+        DefineKeyValue(FLEN1, f1),
+        DefineKeyValue(EXPAND, (expand ? 1 : 0)),
+        DefineKeyValue(C_SIZE, LOC_SIZE),
+        DefineKeyFromStr(binOpName<af_mul_t>()),
+        DefineKeyValue(CPLX, (IsComplex ? 1 : 0)),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto convolve = common::getKernel(
+        "convolve", {{ops_cl_src, convolve_cl_src}}, tmpltArgs, compileOpts);
+
+    convolve(EnqueueArgs(getQueue(), param.global, param.local), *out.data,
+             out.info, *signal.data, signal.info, *param.impulse, filter.info,
+             param.nBBS0, param.nBBS1, param.o[1], param.o[2], param.s[1],
+             param.s[2]);
 }
 
-template<typename T, typename aT, bool expand>
-void conv2(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt)
-{
+template<typename T, typename aT>
+void conv2(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt,
+           const bool expand) {
     size_t se_size = filt.info.dims[0] * filt.info.dims[1] * sizeof(aT);
-    p.impulse = bufferAlloc(se_size);
-    int f0Off = filt.info.offset;
+    p.impulse      = bufferAlloc(se_size);
+    int f0Off      = filt.info.offset;
 
-    for (int b3=0; b3<filt.info.dims[3]; ++b3) {
+    for (int b3 = 0; b3 < filt.info.dims[3]; ++b3) {
         int f3Off = b3 * filt.info.strides[3];
 
-        for (int b2=0; b2<filt.info.dims[2]; ++b2) {
+        for (int b2 = 0; b2 < filt.info.dims[2]; ++b2) {
             int f2Off = b2 * filt.info.strides[2];
 
             // FIXME: if the filter array is strided, direct copy of symbols
             // might cause issues
             getQueue().enqueueCopyBuffer(*filt.data, *p.impulse,
-                                         (f2Off+f3Off+f0Off)*sizeof(aT),
+                                         (f2Off + f3Off + f0Off) * sizeof(aT),
                                          0, se_size);
 
             p.o[1] = (p.outHasNoOffset ? 0 : b2);
@@ -134,15 +84,15 @@ void conv2(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt)
             p.s[1] = (p.inHasNoOffset ? 0 : b2);
             p.s[2] = (p.inHasNoOffset ? 0 : b3);
 
-            conv2Helper<T, aT, expand>(p, out, sig, filt);
+            conv2Helper<T, aT>(p, out, sig, filt, expand);
         }
     }
 }
 
-#define INSTANTIATE(T, accT)  \
-    template void conv2<T, accT, true >(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt); \
-    template void conv2<T, accT, false>(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt); \
+#define INSTANTIATE(T, accT)                                           \
+    template void conv2<T, accT>(conv_kparam_t&, Param&, const Param&, \
+                                 const Param&, const bool);
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_s16.cpp b/src/backend/opencl/kernel/convolve/conv2_s16.cpp
new file mode 100644
index 0000000000..d5c1e5cc3d
--- /dev/null
+++ b/src/backend/opencl/kernel/convolve/conv2_s16.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/convolve/conv2_impl.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+INSTANTIATE(short, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_s32.cpp b/src/backend/opencl/kernel/convolve/conv2_s32.cpp
index 431c85d838..dc621d45f5 100644
--- a/src/backend/opencl/kernel/convolve/conv2_s32.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_s32.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(int, float)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_s64.cpp b/src/backend/opencl/kernel/convolve/conv2_s64.cpp
new file mode 100644
index 0000000000..cdfde44ab1
--- /dev/null
+++ b/src/backend/opencl/kernel/convolve/conv2_s64.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/convolve/conv2_impl.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+INSTANTIATE(intl, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_s8.cpp b/src/backend/opencl/kernel/convolve/conv2_s8.cpp
new file mode 100644
index 0000000000..b4b39b3f28
--- /dev/null
+++ b/src/backend/opencl/kernel/convolve/conv2_s8.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2023, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/convolve/conv2_impl.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+INSTANTIATE(schar, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_u16.cpp b/src/backend/opencl/kernel/convolve/conv2_u16.cpp
new file mode 100644
index 0000000000..05b525ea5c
--- /dev/null
+++ b/src/backend/opencl/kernel/convolve/conv2_u16.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/convolve/conv2_impl.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+INSTANTIATE(ushort, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_u32.cpp b/src/backend/opencl/kernel/convolve/conv2_u32.cpp
index 332b3fe70e..c4b6667c32 100644
--- a/src/backend/opencl/kernel/convolve/conv2_u32.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_u32.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(uint, float)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_u64.cpp b/src/backend/opencl/kernel/convolve/conv2_u64.cpp
new file mode 100644
index 0000000000..b7f410bc9c
--- /dev/null
+++ b/src/backend/opencl/kernel/convolve/conv2_u64.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/convolve/conv2_impl.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+INSTANTIATE(uintl, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv2_u8.cpp b/src/backend/opencl/kernel/convolve/conv2_u8.cpp
index 39cedd4acf..bfe74b4c6b 100644
--- a/src/backend/opencl/kernel/convolve/conv2_u8.cpp
+++ b/src/backend/opencl/kernel/convolve/conv2_u8.cpp
@@ -9,14 +9,12 @@
 
 #include <kernel/convolve/conv2_impl.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 INSTANTIATE(uchar, float)
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv3.cpp b/src/backend/opencl/kernel/convolve/conv3.cpp
index 844a79f65b..1383e8f443 100644
--- a/src/backend/opencl/kernel/convolve/conv3.cpp
+++ b/src/backend/opencl/kernel/convolve/conv3.cpp
@@ -9,45 +9,50 @@
 
 #include <kernel/convolve/conv_common.hpp>
 
-namespace opencl
-{
-
-namespace kernel
-{
-
-template<typename T, typename aT, bool expand>
-void conv3(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt)
-{
-    size_t se_size = filt.info.dims[0] * filt.info.dims[1] * filt.info.dims[2] * sizeof(aT);
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, typename aT>
+void conv3(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt,
+           const bool expand) {
+    size_t se_size =
+        filt.info.dims[0] * filt.info.dims[1] * filt.info.dims[2] * sizeof(aT);
     p.impulse = bufferAlloc(se_size);
     int f0Off = filt.info.offset;
 
-    for (int b3=0; b3<filt.info.dims[3]; ++b3) {
+    for (int b3 = 0; b3 < filt.info.dims[3]; ++b3) {
         int f3Off = b3 * filt.info.strides[3];
         // FIXME: if the filter array is strided, direct copy of symbols
         // might cause issues
-        getQueue().enqueueCopyBuffer(*filt.data, *p.impulse, (f0Off + f3Off)*sizeof(aT), 0, se_size);
+        getQueue().enqueueCopyBuffer(*filt.data, *p.impulse,
+                                     (f0Off + f3Off) * sizeof(aT), 0, se_size);
 
         p.o[2] = (p.outHasNoOffset ? 0 : b3);
         p.s[2] = (p.inHasNoOffset ? 0 : b3);
 
-        convNHelper<T, aT, 3, expand>(p, out, sig, filt);
+        convNHelper<T, aT>(p, out, sig, filt, 3, expand);
     }
 }
 
-#define INSTANTIATE(T, accT)  \
-    template void conv3<T, accT, true >(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt); \
-    template void conv3<T, accT, false>(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt); \
+#define INSTANTIATE(T, accT)                                           \
+    template void conv3<T, accT>(conv_kparam_t&, Param&, const Param&, \
+                                 const Param&, const bool);
 
 INSTANTIATE(cdouble, cdouble)
-INSTANTIATE(cfloat ,  cfloat)
-INSTANTIATE(double ,  double)
-INSTANTIATE(float  ,   float)
-INSTANTIATE(uint   ,   float)
-INSTANTIATE(int    ,   float)
-INSTANTIATE(uchar  ,   float)
-INSTANTIATE(char   ,   float)
-
-}
-
-}
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve/conv_common.hpp b/src/backend/opencl/kernel/convolve/conv_common.hpp
index 36bda38335..bd93419c7c 100644
--- a/src/backend/opencl/kernel/convolve/conv_common.hpp
+++ b/src/backend/opencl/kernel/convolve/conv_common.hpp
@@ -8,147 +8,132 @@
  ********************************************************/
 
 #pragma once
-#include <af/defines.h>
-
-#include <convolve_common.hpp>
-#include <kernel_headers/convolve.hpp>
 
-#include <map>
-#include <mutex>
-#include <string>
 #include <Param.hpp>
-#include <types.hpp>
-#include <traits.hpp>
-#include <memory.hpp>
-#include <program.hpp>
-#include <dispatch.hpp>
-#include <platform.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/convolve.hpp>
+#include <kernel_headers/ops.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+#include <types.hpp>
+#include <af/defines.h>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS   = 256;
+#include <string>
+#include <vector>
 
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-static const int CUBE_X    =  8;
-static const int CUBE_Y    =  8;
-static const int CUBE_Z    =  4;
+constexpr int THREADS   = 256;
+constexpr int THREADS_X = 16;
+constexpr int THREADS_Y = 16;
+constexpr int CUBE_X    = 8;
+constexpr int CUBE_Y    = 8;
+constexpr int CUBE_Z    = 4;
 
 struct conv_kparam_t {
-    NDRange         global;
-    NDRange          local;
-    size_t        loc_size;
-    int         nBBS0;
-    int         nBBS1;
-    bool    outHasNoOffset;
-    bool     inHasNoOffset;
-    bool  launchMoreBlocks;
-    int          o[3];
-    int          s[3];
-    cl::Buffer*    impulse;
+    cl::NDRange global;
+    cl::NDRange local;
+    size_t loc_size;
+    int nBBS0;
+    int nBBS1;
+    bool outHasNoOffset;
+    bool inHasNoOffset;
+    bool launchMoreBlocks;
+    int o[3];
+    int s[3];
+    cl::Buffer* impulse;
 };
 
 template<typename T>
-void prepareKernelArgs(conv_kparam_t& param, dim_t *oDims,
-                       const dim_t *fDims, int baseDim)
-{
+void prepareKernelArgs(conv_kparam_t& param, dim_t* oDims, const dim_t* fDims,
+                       const int rank) {
+    using cl::NDRange;
+
     int batchDims[4] = {1, 1, 1, 1};
-    for(int i=baseDim; i<4; ++i) {
+    for (int i = rank; i < 4; ++i) {
         batchDims[i] = (param.launchMoreBlocks ? 1 : oDims[i]);
     }
 
-    if (baseDim==1) {
+    if (rank == 1) {
         param.local    = NDRange(THREADS, 1);
         param.nBBS0    = divup(oDims[0], THREADS);
         param.nBBS1    = batchDims[2];
-        param.global   = NDRange(param.nBBS0 * THREADS * batchDims[1], param.nBBS1 * batchDims[3]);
-        param.loc_size = (THREADS+2*(fDims[0]-1)) * sizeof(T);
-    } else if (baseDim==2) {
-        param.local    = NDRange(THREADS_X, THREADS_Y);
-        param.nBBS0    = divup(oDims[0], THREADS_X);
-        param.nBBS1    = divup(oDims[1], THREADS_Y);
-        param.global   = NDRange(param.nBBS0*THREADS_X*batchDims[2],
-                                 param.nBBS1*THREADS_Y*batchDims[3]);
-    } else if (baseDim==3) {
+        param.global   = NDRange(param.nBBS0 * THREADS * batchDims[1],
+                                 param.nBBS1 * batchDims[3]);
+        param.loc_size = (THREADS + 2 * (fDims[0] - 1)) * sizeof(T);
+    } else if (rank == 2) {
+        param.local  = NDRange(THREADS_X, THREADS_Y);
+        param.nBBS0  = divup(oDims[0], THREADS_X);
+        param.nBBS1  = divup(oDims[1], THREADS_Y);
+        param.global = NDRange(param.nBBS0 * THREADS_X * batchDims[2],
+                               param.nBBS1 * THREADS_Y * batchDims[3]);
+    } else if (rank == 3) {
         param.local    = NDRange(CUBE_X, CUBE_Y, CUBE_Z);
         param.nBBS0    = divup(oDims[0], CUBE_X);
         param.nBBS1    = divup(oDims[1], CUBE_Y);
-        int blk_z = divup(oDims[2], CUBE_Z);
+        int blk_z      = divup(oDims[2], CUBE_Z);
         param.global   = NDRange(param.nBBS0 * CUBE_X * batchDims[3],
-                                 param.nBBS1 * CUBE_Y,
-                                 blk_z * CUBE_Z);
-        param.loc_size = (CUBE_X+2*(fDims[0]-1)) * (CUBE_Y+2*(fDims[1]-1)) *
-                         (CUBE_Z+2*(fDims[2]-1)) * sizeof(T);
+                                 param.nBBS1 * CUBE_Y, blk_z * CUBE_Z);
+        param.loc_size = (CUBE_X + 2 * (fDims[0] - 1)) *
+                         (CUBE_Y + 2 * (fDims[1] - 1)) *
+                         (CUBE_Z + 2 * (fDims[2] - 1)) * sizeof(T);
     }
 }
 
-template<typename T, typename aT, int bDim, bool expand>
-void convNHelper(const conv_kparam_t& param, Param& out, const Param& signal, const Param& filter)
-{
-    try {
-        static std::once_flag  compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> convProgs;
-        static std::map<int, Kernel*>  convKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D accType="<< dtype_traits<aT>::getName()
-                            << " -D BASE_DIM="<< bDim
-                            << " -D EXPAND=" << expand;
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, convolve_cl, convolve_cl_len, options.str());
-                    convProgs[device]   = new Program(prog);
-                    convKernels[device] = new Kernel(*convProgs[device], "convolve");
-                });
-
-        auto convOp = make_kernel<Buffer, KParam, Buffer, KParam,
-                                  cl::LocalSpaceArg, Buffer, KParam,
-                                  int, int,
-                                  int, int, int,
-                                  int, int, int
-                                 >(*convKernels[device]);
-
-        convOp(EnqueueArgs(getQueue(), param.global, param.local),
-                *out.data, out.info, *signal.data, signal.info, cl::Local(param.loc_size),
-                *param.impulse, filter.info, param.nBBS0, param.nBBS1,
-                param.o[0], param.o[1], param.o[2], param.s[0], param.s[1], param.s[2]);
-
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
+template<typename T, typename aT>
+void convNHelper(const conv_kparam_t& param, Param& out, const Param& signal,
+                 const Param& filter, const int rank, const bool expand) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr bool IsComplex =
+        std::is_same<T, cfloat>::value || std::is_same<T, cdouble>::value;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateTypename<aT>(),
+        TemplateArg(rank),
+        TemplateArg(expand),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(Ti, dtype_traits<T>::getName()),
+        DefineKeyValue(To, dtype_traits<aT>::getName()),
+        DefineKeyValue(accType, dtype_traits<aT>::getName()),
+        DefineKeyValue(RANK, rank),
+        DefineKeyValue(EXPAND, (expand ? 1 : 0)),
+        DefineKeyFromStr(binOpName<af_mul_t>()),
+        DefineKeyValue(CPLX, (IsComplex ? 1 : 0)),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto convolve = common::getKernel(
+        "convolve", {{ops_cl_src, convolve_cl_src}}, tmpltArgs, compileOpts);
+
+    convolve(EnqueueArgs(getQueue(), param.global, param.local), *out.data,
+             out.info, *signal.data, signal.info, cl::Local(param.loc_size),
+             *param.impulse, filter.info, param.nBBS0, param.nBBS1, param.o[0],
+             param.o[1], param.o[2], param.s[0], param.s[1], param.s[2]);
 }
 
-template<typename T, typename aT, bool expand>
-void conv1(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt);
-
-template<typename T, typename aT, bool expand>
-void conv2(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt);
+template<typename T, typename aT>
+void conv1(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt,
+           const bool expand);
 
-template<typename T, typename aT, bool expand>
-void conv3(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt);
+template<typename T, typename aT>
+void conv2(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt,
+           const bool expand);
 
-}
-
-}
+template<typename T, typename aT>
+void conv3(conv_kparam_t& p, Param& out, const Param& sig, const Param& filt,
+           const bool expand);
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve_separable.cl b/src/backend/opencl/kernel/convolve_separable.cl
index 763fed1701..02e5d53d41 100644
--- a/src/backend/opencl/kernel/convolve_separable.cl
+++ b/src/backend/opencl/kernel/convolve_separable.cl
@@ -7,72 +7,81 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-kernel
-void convolve(global T *out, KParam oInfo, global T const *signal,
-              KParam sInfo, constant accType const *impulse,
-              int nBBS0, int nBBS1)
-{
+kernel void convolve(global T *out, KParam oInfo, global T const *signal,
+                     KParam sInfo, constant accType const *impulse, int nBBS0,
+                     int nBBS1) {
     local T localMem[LOCAL_MEM_SIZE];
 
-    const int radius  = FLEN-1;
-    const int padding = 2*radius;
+    const int radius  = FLEN - 1;
+    const int padding = 2 * radius;
     const int s0      = sInfo.strides[0];
     const int s1      = sInfo.strides[1];
     const int d0      = sInfo.dims[0];
     const int d1      = sInfo.dims[1];
-    const int shrdLen = get_local_size(0) + (CONV_DIM==0 ? padding : 0);
+    const int shrdLen = get_local_size(0) + (CONV_DIM == 0 ? padding : 0);
 
-    unsigned b2  = get_group_id(0)/nBBS0;
-    unsigned b3  = get_group_id(1)/nBBS1;
-    global T *dst = out + (b2*oInfo.strides[2] + b3*oInfo.strides[3]);
-    global const T *src = signal + (b2*sInfo.strides[2] + b3*sInfo.strides[3]) + sInfo.offset;
+    unsigned b2   = get_group_id(0) / nBBS0;
+    unsigned b3   = get_group_id(1) / nBBS1;
+    global T *dst = out + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
+    global const T *src =
+        signal + (b2 * sInfo.strides[2] + b3 * sInfo.strides[3]) + sInfo.offset;
 
     int lx = get_local_id(0);
     int ly = get_local_id(1);
-    int ox = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
-    int oy = get_local_size(1) * (get_group_id(1)-b3*nBBS1) + ly;
+    int ox = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    int oy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
     int gx = ox;
     int gy = oy;
 
-    // below if-else statement is based on MACRO value passed while kernel compilation
-    if (CONV_DIM==0) {
-        gx += (EXPAND ? 0 : FLEN>>1);
-        int endX = ((FLEN-1)<<1) + get_local_size(0);
+    // below if-else statement is based on MACRO value passed while kernel
+    // compilation
+    if (CONV_DIM == 0) {
+        gx += (EXPAND ? 0 : FLEN >> 1);
+        int endX = ((FLEN - 1) << 1) + get_local_size(0);
 #pragma unroll
-        for(int lx = get_local_id(0), glb_x = gx; lx<endX; lx += get_local_size(0), glb_x += get_local_size(0)) {
-            int i = glb_x - radius;
-            int j = gy;
-            bool is_i  = i>=0 && i<d0;
-            bool is_j  = j>=0 && j<d1;
-            localMem[ly*shrdLen+lx] = (is_i && is_j ? src[i*s0 + j*s1] : (T)(0));
+        for (int lx = get_local_id(0), glb_x = gx; lx < endX;
+             lx += get_local_size(0), glb_x += get_local_size(0)) {
+            int i     = glb_x - radius;
+            int j     = gy;
+            bool is_i = i >= 0 && i < d0;
+            bool is_j = j >= 0 && j < d1;
+            localMem[ly * shrdLen + lx] =
+                (is_i && is_j ? src[i * s0 + j * s1] : (T)(0));
         }
 
-    } else if (CONV_DIM==1) {
-        gy += (EXPAND ? 0 : FLEN>>1);
-        int endY = ((FLEN-1)<<1) + get_local_size(1);
+    } else if (CONV_DIM == 1) {
+        gy += (EXPAND ? 0 : FLEN >> 1);
+        int endY = ((FLEN - 1) << 1) + get_local_size(1);
 #pragma unroll
-        for(int ly = get_local_id(1), glb_y = gy; ly<endY; ly += get_local_size(1), glb_y += get_local_size(1)) {
-            int i = gx;
-            int j = glb_y - radius;
-            bool is_i  = i>=0 && i<d0;
-            bool is_j  = j>=0 && j<d1;
-            localMem[ly*shrdLen+lx] = (is_i && is_j ? src[i*s0 + j*s1] : (T)(0));
+        for (int ly = get_local_id(1), glb_y = gy; ly < endY;
+             ly += get_local_size(1), glb_y += get_local_size(1)) {
+            int i     = gx;
+            int j     = glb_y - radius;
+            bool is_i = i >= 0 && i < d0;
+            bool is_j = j >= 0 && j < d1;
+            localMem[ly * shrdLen + lx] =
+                (is_i && is_j ? src[i * s0 + j * s1] : (T)(0));
         }
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (ox<oInfo.dims[0] && oy<oInfo.dims[1]) {
-        // below conditional statement is based on MACRO value passed while kernel compilation
-        int i  = (CONV_DIM==0 ? lx : ly) + radius;
+    if (ox < oInfo.dims[0] && oy < oInfo.dims[1]) {
+        // below conditional statement is based on MACRO value passed while
+        // kernel compilation
+        int i         = (CONV_DIM == 0 ? lx : ly) + radius;
         accType accum = (accType)(0);
 #pragma unroll
-        for(int f=0; f<FLEN; ++f) {
+        for (int f = 0; f < FLEN; ++f) {
             accType f_val = impulse[f];
-            // below conditional statement is based on MACRO value passed while kernel compilation
-            int s_idx = (CONV_DIM==0 ? (ly*shrdLen+(i-f)) : ((i-f)*shrdLen+lx));
-            T s_val = localMem[s_idx];
-            accum   = accum + ((accType)s_val*(accType)f_val);
+            // below conditional statement is based on MACRO value passed while
+            // kernel compilation
+            int s_idx = (CONV_DIM == 0 ? (ly * shrdLen + (i - f))
+                                       : ((i - f) * shrdLen + lx));
+            T s_val   = localMem[s_idx];
+
+            // binOp will do MUL_OP for convolution operation
+            accum = accum + binOp((accType)s_val, (accType)f_val);
         }
-        dst[oy*oInfo.strides[1]+ox] = (T)accum;
+        dst[oy * oInfo.strides[1] + ox] = (T)accum;
     }
 }
diff --git a/src/backend/opencl/kernel/convolve_separable.cpp b/src/backend/opencl/kernel/convolve_separable.cpp
new file mode 100644
index 0000000000..83a9116d72
--- /dev/null
+++ b/src/backend/opencl/kernel/convolve_separable.cpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel_headers/convolve_separable.hpp>
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/err_common.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, typename accType>
+void convSep(Param out, const Param signal, const Param filter,
+             const int conv_dim, const bool expand) {
+    if (!(conv_dim == 0 || conv_dim == 1)) {
+        AF_ERROR(
+            "Separable convolution accepts only 0 or 1 as convolution "
+            "dimension",
+            AF_ERR_NOT_SUPPORTED);
+    }
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
+    constexpr bool IsComplex =
+        std::is_same<T, cfloat>::value || std::is_same<T, cdouble>::value;
+
+    const int fLen       = filter.info.dims[0] * filter.info.dims[1];
+    const size_t C0_SIZE = (THREADS_X + 2 * (fLen - 1)) * THREADS_Y;
+    const size_t C1_SIZE = (THREADS_Y + 2 * (fLen - 1)) * THREADS_X;
+    size_t locSize       = (conv_dim == 0 ? C0_SIZE : C1_SIZE);
+
+    std::array<TemplateArg, 5> tmpltArgs = {
+        TemplateTypename<T>(), TemplateTypename<accType>(),
+        TemplateArg(conv_dim), TemplateArg(expand),
+        TemplateArg(fLen),
+    };
+    std::array<std::string, 11> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(Ti, dtype_traits<T>::getName()),
+        DefineKeyValue(To, dtype_traits<accType>::getName()),
+        DefineKeyValue(accType, dtype_traits<accType>::getName()),
+        DefineKeyValue(CONV_DIM, conv_dim),
+        DefineKeyValue(EXPAND, (expand ? 1 : 0)),
+        DefineKeyValue(FLEN, fLen),
+        DefineKeyFromStr(binOpName<af_mul_t>()),
+        DefineKeyValue(IS_CPLX, (IsComplex ? 1 : 0)),
+        DefineKeyValue(LOCAL_MEM_SIZE, locSize),
+        getTypeBuildDefinition<T>()};
+
+    auto conv =
+        common::getKernel("convolve", {{ops_cl_src, convolve_separable_cl_src}},
+                          tmpltArgs, compileOpts);
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(out.info.dims[0], THREADS_X);
+    int blk_y = divup(out.info.dims[1], THREADS_Y);
+
+    cl::NDRange global(blk_x * signal.info.dims[2] * THREADS_X,
+                       blk_y * signal.info.dims[3] * THREADS_Y);
+
+    cl::Buffer *mBuff = bufferAlloc(fLen * sizeof(accType));
+    // FIX ME: if the filter array is strided, direct might cause issues
+    getQueue().enqueueCopyBuffer(*filter.data, *mBuff, 0, 0,
+                                 fLen * sizeof(accType));
+
+    conv(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+         *signal.data, signal.info, *mBuff, blk_x, blk_y);
+    bufferFree(mBuff);
+}
+
+#define INSTANTIATE(T, accT)                                             \
+    template void convSep<T, accT>(Param, const Param, const Param filt, \
+                                   const int, const bool);
+
+INSTANTIATE(cdouble, cdouble)
+INSTANTIATE(cfloat, cfloat)
+INSTANTIATE(double, double)
+INSTANTIATE(float, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(int, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(char, float)
+INSTANTIATE(ushort, float)
+INSTANTIATE(short, float)
+INSTANTIATE(uintl, float)
+INSTANTIATE(intl, float)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/convolve_separable.hpp b/src/backend/opencl/kernel/convolve_separable.hpp
index 1a9edc3c1f..2651856c92 100644
--- a/src/backend/opencl/kernel/convolve_separable.hpp
+++ b/src/backend/opencl/kernel/convolve_separable.hpp
@@ -8,97 +8,22 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/convolve_separable.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
-#include <memory.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
 
-namespace opencl
-{
-
-namespace kernel
-{
+#include <Param.hpp>
 
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 // below shared MAX_*_LEN's are calculated based on
 // a maximum shared memory configuration of 48KB per block
 // considering complex types as well
-static const int MAX_SCONV_FILTER_LEN = 31;
-
-template<typename T, typename accType, int conv_dim, bool expand, int fLen>
-void convolve2(Param out, const Param signal, const Param filter)
-{
-    try {
-        static std::once_flag  compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>   convProgs;
-        static std::map<int, Kernel*>  convKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                const size_t C0_SIZE  = (THREADS_X+2*(fLen-1))* THREADS_Y;
-                const size_t C1_SIZE  = (THREADS_Y+2*(fLen-1))* THREADS_X;
-
-                size_t locSize = (conv_dim==0 ? C0_SIZE : C1_SIZE);
-
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D accType="<< dtype_traits<accType>::getName()
-                            << " -D CONV_DIM="<< conv_dim
-                            << " -D EXPAND="<< expand
-                            << " -D FLEN="<< fLen
-                            << " -D LOCAL_MEM_SIZE="<<locSize;
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, convolve_separable_cl, convolve_separable_cl_len, options.str());
-                    convProgs[device]   = new Program(prog);
-                    convKernels[device] = new Kernel(*convProgs[device], "convolve");
-                });
-
-        auto convOp = make_kernel<Buffer, KParam, Buffer, KParam, Buffer,
-                                  int, int>(*convKernels[device]);
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(out.info.dims[0], THREADS_X);
-        int blk_y = divup(out.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x*signal.info.dims[2]*THREADS_X,
-                       blk_y*signal.info.dims[3]*THREADS_Y);
-
-        cl::Buffer *mBuff = bufferAlloc(fLen*sizeof(accType));
-        // FIX ME: if the filter array is strided, direct might cause issues
-        getQueue().enqueueCopyBuffer(*filter.data, *mBuff, 0, 0, fLen*sizeof(accType));
-
-        convOp(EnqueueArgs(getQueue(), global, local),
-               *out.data, out.info, *signal.data, signal.info, *mBuff, blk_x, blk_y);
-
-        bufferFree(mBuff);
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
+constexpr int MAX_SCONV_FILTER_LEN = 31;
 
-}
+template<typename T, typename accT>
+void convSep(Param out, const Param sig, const Param filt, const int cDim,
+             const bool expand);
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/coo2dense.cl b/src/backend/opencl/kernel/coo2dense.cl
new file mode 100644
index 0000000000..85afbfcd4b
--- /dev/null
+++ b/src/backend/opencl/kernel/coo2dense.cl
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void coo2Dense(global T *oPtr, const KParam output, global const T *vPtr,
+                      const KParam values, global const int *rPtr,
+                      const KParam rowIdx, global const int *cPtr,
+                      const KParam colIdx) {
+    const int dimSize = get_local_size(0);
+
+    for (int i = get_local_id(0); i < reps * dimSize; i += dimSize) {
+        const int id = i + get_group_id(0) * dimSize * reps;
+        if (id >= values.dims[0]) return;
+
+        T v   = vPtr[id + values.offset];
+        int r = rPtr[id + rowIdx.offset];
+        int c = cPtr[id + colIdx.offset];
+
+        int offset = r + c * output.strides[1];
+
+        oPtr[offset] = v;
+    }
+}
diff --git a/src/backend/opencl/kernel/copy.cl b/src/backend/opencl/kernel/copy.cl
index 84f77e0b61..8cbe2cbf93 100644
--- a/src/backend/opencl/kernel/copy.cl
+++ b/src/backend/opencl/kernel/copy.cl
@@ -8,17 +8,14 @@
  ********************************************************/
 
 typedef struct {
-    dim_t dim[4];
-} dims_t;
+    int dims[4];
+} dims_type;
 
-inType scale(inType value, float factor)
-{
-#ifdef inType_float2
-    return (inType)(value.s0*factor, value.s1*factor);
+#ifdef FACTOR
+#define SCALE(value, factor) (value * factor)
 #else
-    return (inType)(value*factor);
+#define SCALE(value, factor) (value)
 #endif
-}
 
 #if defined(outType_double2)
 
@@ -48,44 +45,185 @@ inType scale(inType value, float factor)
 
 #endif
 
-__kernel
-void copy(__global outType * dst,
-          KParam oInfo,
-          __global const inType * src,
-          KParam iInfo,
-          outType default_value,
-          float factor, dims_t trgt,
-          int blk_x, int blk_y)
-{
-    uint lx = get_local_id(0);
-    uint ly = get_local_id(1);
-
-    uint gz = get_group_id(0) / blk_x;
-    uint gw = get_group_id(1) / blk_y;
-    uint blockIdx_x = get_group_id(0) - (blk_x) * gz;
-    uint blockIdx_y = get_group_id(1) - (blk_y) * gw;
-    uint gx = blockIdx_x * get_local_size(0) + lx;
-    uint gy = blockIdx_y * get_local_size(1) + ly;
-
-    __global const inType *in = src + (gw * iInfo.strides[3] + gz * iInfo.strides[2] + gy * iInfo.strides[1] + iInfo.offset);
-    __global outType *out     = dst + (gw * oInfo.strides[3] + gz * oInfo.strides[2] + gy * oInfo.strides[1] + oInfo.offset);
-
-    uint istride0 = iInfo.strides[0];
-    uint ostride0 = oInfo.strides[0];
-
-    if (gy < oInfo.dims[1] && gz < oInfo.dims[2] && gw < oInfo.dims[3]) {
-        int loop_offset = get_local_size(0) * blk_x;
-        bool cond = gy < trgt.dim[1] && gz < trgt.dim[2] && gw < trgt.dim[3];
-        for(int rep=gx; rep<oInfo.dims[0]; rep+=loop_offset) {
-            outType temp  = default_value;
-#if SAME_DIMS
-            temp = CONVERT(scale(in[rep*istride0], factor));
-#else
-            if (rep < trgt.dim[0] && cond) {
-                temp = CONVERT(scale(in[rep*istride0], factor));
+// scaledCopy without looping, so dim3 has to be 1.
+// conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] >= dims[1]
+//      global dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+kernel void scaledCopy(global outType *out, const dims_type odims,
+                       const dims_type ostrides, const int ooffset,
+                       global const inType *in, const dims_type idims,
+                       const dims_type istrides, const int ioffset,
+                       const outType default_value, const factorType factor) {
+    const int g0 = get_global_id(0);
+    const int g1 = get_global_id(1);
+    if ((g0 < (int)odims.dims[0]) & (g1 < (int)odims.dims[1])) {
+        const int g2 = get_global_id(2);
+
+        int idx_in = g0 * (int)istrides.dims[0] + g1 * (int)istrides.dims[1] +
+                     g2 * (int)istrides.dims[2] + ioffset;
+        int idx_out = g0 * (int)ostrides.dims[0] + g1 * (int)ostrides.dims[1] +
+                      g2 * (int)ostrides.dims[2] + ooffset;
+
+        if (SAME_DIMS | ((g0 < (int)idims.dims[0]) & (g1 < (int)idims.dims[1]) &
+                         (g2 < (int)idims.dims[2]))) {
+            out[idx_out] = CONVERT(SCALE(in[idx_in], factor));
+        } else {
+            out[idx_out] = default_value;
+        }
+    }
+}
+
+// scaledCopy with looping over dims[0] -- VECTOR ONLY
+// Conditions:
+//      global dims[0] has no restrictions
+//      only dims[1] == 1 will be processed!!
+//      only dims[2] == 1 will be processed!!
+//      only dims[3] == 1 will be processed!!
+kernel void scaledCopyLoop0(global outType *out, const dims_type odims,
+                            const dims_type ostrides, const int ooffset,
+                            global const inType *in, const dims_type idims,
+                            const dims_type istrides, const int ioffset,
+                            const outType default_value,
+                            const factorType factor) {
+    int id0              = get_global_id(0);
+    const int id0End_out = odims.dims[0];
+    if (id0 < id0End_out) {
+        const int ostrides0     = ostrides.dims[0];
+        const int id0Inc        = get_global_size(0);
+        int idx_out             = id0 * ostrides0 + ooffset;
+        const int idxID0Inc_out = id0Inc * ostrides0;
+        const int id0End_in     = idims.dims[0];
+        const int istrides0     = istrides.dims[0];
+        int idx_in              = id0 * istrides0 + ioffset;
+        const int idxID0Inc_in  = id0Inc * istrides0;
+
+        while (id0 < id0End_in) {
+            // inside input array, so convert
+            out[idx_out] = CONVERT(SCALE(in[idx_in], factor));
+            id0 += id0Inc;
+            idx_in += idxID0Inc_in;
+            idx_out += idxID0Inc_out;
+        }
+        if (!SAME_DIMS) {
+            while (id0 < id0End_out) {
+                // outside the input array, so copy default value
+                out[idx_out] = default_value;
+                id0 += id0Inc;
+                idx_out += idxID0Inc_out;
             }
-#endif
-            out[rep*ostride0] = temp;
         }
     }
 }
+
+// scaledCopy with looping over dims[1]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] has no restrictions
+//      global dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+kernel void scaledCopyLoop1(global outType *out, const dims_type odims,
+                            const dims_type ostrides, const int ooffset,
+                            global const inType *in, const dims_type idims,
+                            const dims_type istrides, const int ioffset,
+                            const outType default_value,
+                            const factorType factor) {
+    const int id0        = get_global_id(0);
+    int id1              = get_global_id(1);
+    const int id1End_out = odims.dims[1];
+    if ((id0 < (int)odims.dims[0]) & (id1 < id1End_out)) {
+        const int id2       = get_global_id(2);
+        const int ostrides1 = ostrides.dims[1];
+        const int id1Inc    = get_global_size(1);
+        int idx_out         = id0 * (int)ostrides.dims[0] + id1 * ostrides1 +
+                      id2 * (int)ostrides.dims[2] + ooffset;
+        const int idxID1Inc_out = id1Inc * ostrides1;
+        const int id1End_in     = idims.dims[1];
+        const int istrides1     = istrides.dims[1];
+        int idx_in = id0 * (int)istrides.dims[0] + id1 * istrides1 +
+                     id2 * (int)istrides.dims[2] + ioffset;
+        const int idxID1Inc_in = id1Inc * istrides1;
+
+        if (SAME_DIMS | ((id0 < idims.dims[0]) & (id2 < idims.dims[2]))) {
+            while (id1 < id1End_in) {
+                // inside input array, so convert
+                out[idx_out] = CONVERT(SCALE(in[idx_in], factor));
+                id1 += id1Inc;
+                idx_in += idxID1Inc_in;
+                idx_out += idxID1Inc_out;
+            }
+        }
+        if (!SAME_DIMS) {
+            while (id1 < id1End_out) {
+                // outside the input array, so copy default value
+                out[idx_out] = default_value;
+                id1 += id1Inc;
+                idx_out += idxID1Inc_out;
+            }
+        }
+    }
+}
+
+// scaledCopy with looping over dims[1] and dims[3]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] has no restrictions
+//      global dims[2] == dims[2]
+kernel void scaledCopyLoop13(global outType *out, const dims_type odims,
+                             const dims_type ostrides, const int ooffset,
+                             global const inType *in, const dims_type idims,
+                             const dims_type istrides, const int ioffset,
+                             const outType default_value,
+                             const factorType factor) {
+    const int id0        = get_global_id(0);
+    int id1              = get_global_id(1);
+    const int id1End_out = odims.dims[1];
+    if ((id0 < (int)odims.dims[0]) & (id1 < id1End_out)) {
+        const int id2               = get_global_id(2);
+        const int id1Inc            = get_global_size(1);
+        const int ostrides1         = ostrides.dims[1];
+        const int idxIncID3_out     = ostrides.dims[3];
+        const int idxBaseIncID1_out = id1Inc * ostrides1;
+        int idxBase_out             = id0 * ostrides.dims[0] + id1 * ostrides1 +
+                          id2 * ostrides.dims[2] + ooffset;
+        int idxEndID3_out = odims.dims[3] * idxIncID3_out + idxBase_out;
+
+        const int id0End_in        = idims.dims[0];
+        const int id1End_in        = idims.dims[1];
+        const int id2End_in        = idims.dims[2];
+        const int istrides1        = istrides.dims[1];
+        const int idxIncID3_in     = istrides.dims[3];
+        const int idxBaseIncID1_in = id1Inc * istrides1;
+        int idxBase_in             = id0 * istrides.dims[0] + id1 * istrides1 +
+                         id2 * istrides.dims[2] + ioffset;
+        int idxEndID3_in = idims.dims[3] * idxIncID3_in + idxBase_in;
+
+        do {
+            int idx_in  = idxBase_in;
+            int idx_out = idxBase_out;
+            if (SAME_DIMS |
+                ((id0 < id0End_in) & (id1 < id1End_in) & (id2 < id2End_in))) {
+                // inside input array, so convert
+                do {
+                    out[idx_out] = CONVERT(SCALE(in[idx_in], factor));
+                    idx_in += idxIncID3_in;
+                    idx_out += idxIncID3_out;
+                } while (idx_in != idxEndID3_in);
+            }
+            if (!SAME_DIMS) {
+                while (idx_out != idxEndID3_out) {
+                    // outside the input array, so copy default value
+                    out[idx_out] = default_value;
+                    idx_out += idxIncID3_out;
+                }
+            }
+            id1 += id1Inc;
+            if (id1 >= id1End_out) break;
+            idxBase_in += idxBaseIncID1_in;
+            idxEndID3_in += idxBaseIncID1_in;
+            idxBase_out += idxBaseIncID1_out;
+            idxEndID3_out += idxBaseIncID1_out;
+        } while (true);
+    }
+}
\ No newline at end of file
diff --git a/src/backend/opencl/kernel/cscmm.cl b/src/backend/opencl/kernel/cscmm.cl
new file mode 100644
index 0000000000..4dd7a47514
--- /dev/null
+++ b/src/backend/opencl/kernel/cscmm.cl
@@ -0,0 +1,149 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if IS_CPLX
+T __cmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
+    return out;
+}
+
+T __ccmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x + lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y - lhs.y * rhs.x;
+    return out;
+}
+
+#define MUL(a, b) __cmul(a, b)
+
+#if IS_CONJ
+#define CMUL(a, b) __ccmul(a, b)
+#else
+#define CMUL(a, b) __cmul(a, b)
+#endif
+
+#else
+#define MUL(a, b) (a) * (b)
+#define CMUL(a, b) (a) * (b)
+#endif
+
+int binary_search(global const int *ptr, int len, int val) {
+    int start = 0;
+    int end   = len;
+    while (end > start) {
+        int mid = start + (end - start) / 2;
+        if (val < ptr[mid]) {
+            end = mid;
+        } else if (val > ptr[mid]) {
+            start = mid + 1;
+        } else {
+            return mid;
+        }
+    }
+    return start;
+}
+
+// Each group computes an output of size ROWS_PER_GROUP x COLS_PER_GROUP
+// Each thread in a group maintains the partial outputs of size ROWS_PER_GROUP x
+// COLS_PER_GROUP The outputs from each thread are added up to generate the
+// final result.
+kernel void cscmm_nn(
+    global T *output, __global const T *values,
+    global const int *colidx,  // rowidx from csr is colidx in csc
+    global const int *rowidx,  // colidx from csr is rowidx in csc
+    const int M,                 // K from csr is M in csc
+    const int K,                 // M from csr is K in csc
+    const int N,                 // N is number of columns in dense matrix
+    global const T *rhs, const KParam rinfo, const T alpha, const T beta) {
+    int lid = get_local_id(0);
+
+    // Get the row offset for the current group in the uncompressed matrix
+    int rowOff = get_group_id(0) * ROWS_PER_GROUP;
+    int colOff = get_group_id(1) * COLS_PER_GROUP;
+
+    // Ensure you are not going out of bounds
+    int rowLim = min(ROWS_PER_GROUP, M - rowOff);
+    int colLim = min(COLS_PER_GROUP, N - colOff);
+
+    rhs += colOff * rinfo.strides[1] + rinfo.offset;
+    output += colOff * M;
+
+    // Initialize partial output to 0
+    T l_outvals[COLS_PER_GROUP][ROWS_PER_GROUP];
+    for (int j = 0; j < colLim; j++) {
+        for (int i = 0; i < rowLim; i++) { l_outvals[j][i] = 0; }
+    }
+
+    // Dot requires you to traverse the entire inner dimension
+    for (int colId = lid; colId < K; colId += THREADS) {
+        int rowStart     = colidx[colId];
+        int rowEnd       = colidx[colId + 1];
+        int nonZeroCount = rowEnd - rowStart;
+
+        // Find the location of the next non zero element after rowOff
+        int rowPos = binary_search(rowidx + rowStart, nonZeroCount, rowOff);
+
+        // Read the rhs values from all the columns as they can be reused for
+        // all rows
+        T rhsvals[COLS_PER_GROUP];
+        for (int j = 0; j < colLim; j++) {
+            rhsvals[j] = rhs[colId + j * rinfo.strides[1]];
+        }
+
+        // Traversing through nonzero elements in the current chunk
+        for (int id = rowPos + rowStart; id < rowEnd; id++) {
+            int rowId = rowidx[id];
+
+            // Exit if going past current chunk
+            if (rowId >= rowOff + ROWS_PER_GROUP) break;
+
+            // Read the row value and multiply with all columns
+            T lhsval = values[id];
+            for (int j = 0; j < colLim; j++) {
+                l_outvals[j][rowId - rowOff] += CMUL(lhsval, rhsvals[j]);
+            }
+        }
+    }
+
+    local T s_outvals[THREADS];
+
+    // For each row and col of output, copy registers to local memory, add
+    // results, write to output.
+    for (int j = 0; j < colLim; j++) {
+        for (int i = 0; i < rowLim; i++) {
+            // Copying to local memory
+            s_outvals[lid] = l_outvals[j][i];
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            // Adding the results through reduction
+            for (int n = THREADS / 2; n > 0; n /= 2) {
+                if (lid < n) s_outvals[lid] += s_outvals[lid + n];
+                barrier(CLK_LOCAL_MEM_FENCE);
+            }
+
+            // Writing to output
+            if (lid == 0) {
+                T outval = s_outvals[0];
+
+#if USE_ALPHA
+                outval = MUL(alpha, outval);
+#endif
+
+#if USE_BETA
+                output[j * M + rowOff + i] =
+                    outval + MUL(beta, output[j * M + rowOff + i]);
+#else
+                output[j * M + rowOff + i] = outval;
+#endif
+            }
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/cscmm.hpp b/src/backend/opencl/kernel/cscmm.hpp
new file mode 100644
index 0000000000..a668025726
--- /dev/null
+++ b/src/backend/opencl/kernel/cscmm.hpp
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel_headers/cscmm.hpp>
+#include <traits.hpp>
+#include <af/opencl.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void cscmm_nn(Param out, const Param &values, const Param &colIdx,
+              const Param &rowIdx, const Param &rhs, const T alpha,
+              const T beta, bool is_conj) {
+    constexpr int threads = 256;
+    // TODO: Find a better way to tune these parameters
+    constexpr int rows_per_group = 8;
+    constexpr int cols_per_group = 8;
+
+    const bool use_alpha = (alpha != scalar<T>(1.0));
+    const bool use_beta  = (beta != scalar<T>(0.0));
+
+    std::array<TemplateArg, 7> targs = {
+        TemplateTypename<T>(),       TemplateArg(use_alpha),
+        TemplateArg(use_beta),       TemplateArg(is_conj),
+        TemplateArg(rows_per_group), TemplateArg(cols_per_group),
+        TemplateArg(threads),
+    };
+    std::array<std::string, 9> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(USE_ALPHA, use_alpha),
+        DefineKeyValue(USE_BETA, use_beta),
+        DefineKeyValue(IS_CONJ, is_conj),
+        DefineKeyValue(THREADS, threads),
+        DefineKeyValue(ROWS_PER_GROUP, rows_per_group),
+        DefineKeyValue(COLS_PER_GROUP, cols_per_group),
+        DefineKeyValue(IS_CPLX, (iscplx<T>() ? 1 : 0)),
+        getTypeBuildDefinition<T>()};
+
+    auto cscmmNN =
+        common::getKernel("cscmm_nn", {{cscmm_cl_src}}, targs, options);
+
+    cl::NDRange local(threads, 1);
+    int M = out.info.dims[0];
+    int N = out.info.dims[1];
+    int K = colIdx.info.dims[0] - 1;
+
+    int groups_x = divup(M, rows_per_group);
+    int groups_y = divup(N, cols_per_group);
+    cl::NDRange global(local[0] * groups_x, local[1] * groups_y);
+
+    cscmmNN(cl::EnqueueArgs(getQueue(), global, local), *out.data, *values.data,
+            *colIdx.data, *rowIdx.data, M, K, N, *rhs.data, rhs.info, alpha,
+            beta);
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/cscmv.cl b/src/backend/opencl/kernel/cscmv.cl
new file mode 100644
index 0000000000..bc56f57e46
--- /dev/null
+++ b/src/backend/opencl/kernel/cscmv.cl
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if IS_DBL || IS_LONG
+#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
+#endif
+
+#if IS_CPLX
+T __cmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
+    return out;
+}
+
+T __ccmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x + lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y - lhs.y * rhs.x;
+    return out;
+}
+
+#define MUL(a, b) __cmul(a, b)
+
+#if IS_CONJ
+#define CMUL(a, b) __ccmul(a, b)
+#else
+#define CMUL(a, b) __cmul(a, b)
+#endif
+
+#else
+#define MUL(a, b) (a) * (b)
+#define CMUL(a, b) (a) * (b)
+#endif
+
+#if IS_DBL || IS_LONG
+#define U ulong
+#define ATOMIC_FN atom_cmpxchg
+#else
+#define U unsigned
+#define ATOMIC_FN atomic_cmpxchg
+#endif
+
+#if IS_CPLX
+inline void atomicAdd(volatile __global T *ptr, T val) {
+    union {
+        U u[2];
+        T t;
+    } next, expected, current;
+    current.t = *ptr;
+
+    do {
+        expected.t.x = current.t.x;
+        next.t.x = expected.t.x + val.x;
+        current.u[0] = ATOMIC_FN((volatile __global U *) ptr, expected.u[0], next.u[0]);
+    } while(current.u[0] != expected.u[0]);
+    do {
+        expected.t.y = current.t.y;
+        next.t.y = expected.t.y + val.y;
+        current.u[1] = ATOMIC_FN(((volatile __global U *) ptr) + 1, expected.u[1], next.u[1]);
+    } while(current.u[1] != expected.u[1]);
+}
+#else
+inline void atomicAdd(volatile __global T *ptr, T val) {
+    union {
+        U u;
+        T t;
+    } next, expected, current;
+    current.t = *ptr;
+
+    do {
+        expected.t = current.t;
+        next.t = expected.t + val;
+        current.u = ATOMIC_FN((volatile __global U *) ptr, expected.u, next.u);
+    } while(current.u != expected.u);
+}
+#endif
+
+kernel void cscmv_beta(global T *output, const int M, const T beta) {
+    for(unsigned j = get_global_id(0); j < M; j += THREADS * get_num_groups(0))
+        output[j] *= beta;
+}
+
+kernel void cscmv_atomic(
+    global T *output, __global T *values,
+    global int *colidx,  // rowidx from csr is colidx in csc
+    global int *rowidx,  // colidx from csr is rowidx in csc
+    const int K,                 // M from csr is K in csc
+    global const T *rhs, const KParam rinfo, const T alpha) {
+
+    rhs += rinfo.offset;
+
+    for(unsigned j = get_group_id(0); j < K; j += get_num_groups(0)) {
+        for(unsigned i = get_local_id(0) + colidx[j]; i < colidx[j + 1]; i += THREADS) {
+            T outval = CMUL(values[i], rhs[j]);
+#if USE_ALPHA
+            outval = MUL(alpha, outval);
+#endif
+            atomicAdd(output + rowidx[i], outval);
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/cscmv.hpp b/src/backend/opencl/kernel/cscmv.hpp
new file mode 100644
index 0000000000..2ab88b202c
--- /dev/null
+++ b/src/backend/opencl/kernel/cscmv.hpp
@@ -0,0 +1,97 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel_headers/cscmv.hpp>
+#include <traits.hpp>
+#include <af/opencl.h>
+
+#include <string>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void cscmv(Param out, const Param &values, const Param &colIdx,
+           const Param &rowIdx, const Param &rhs, const T alpha, const T beta,
+           bool is_conj) {
+    // TODO: rows_per_group limited by register pressure. Find better way to
+    // handle this.
+    constexpr int threads_per_g = 64;
+    constexpr int rows_per_group = 64;
+
+    const bool use_alpha = (alpha != scalar<T>(1.0));
+    const bool use_beta  = (beta != scalar<T>(0.0));
+
+    cl::NDRange local(threads_per_g);
+
+    int K        = colIdx.info.dims[0] - 1;
+    int M        = out.info.dims[0];
+
+    std::array<TemplateArg, 5> targs = {
+        TemplateTypename<T>(),       TemplateArg(use_alpha),
+        TemplateArg(is_conj), TemplateArg(rows_per_group),
+        TemplateArg(local[0]),
+    };
+    std::array<std::string, 9> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(USE_ALPHA, use_alpha),
+        DefineKeyValue(IS_CONJ, is_conj),
+        DefineKeyValue(THREADS, local[0]),
+        DefineKeyValue(ROWS_PER_GROUP, rows_per_group),
+        DefineKeyValue(IS_CPLX, (iscplx<T>() ? 1 : 0)),
+        DefineKeyValue(IS_DBL, (isdbl<T>() ? 1 : 0)),
+        DefineKeyValue(IS_LONG, (islong<T>() ? 1 : 0)),
+        getTypeBuildDefinition<T>()};
+
+    if(use_beta) {
+        std::array<TemplateArg, 4> targs_beta = {
+            TemplateTypename<T>(), TemplateArg(is_conj),
+            TemplateArg(rows_per_group), TemplateArg(local[0])};
+        std::array<std::string, 8> options_beta = {
+            DefineKeyValue(T, dtype_traits<T>::getName()),
+            DefineKeyValue(IS_CONJ, is_conj),
+            DefineKeyValue(THREADS, local[0]),
+            DefineKeyValue(ROWS_PER_GROUP, rows_per_group),
+            DefineKeyValue(IS_CPLX, (iscplx<T>() ? 1 : 0)),
+            DefineKeyValue(IS_DBL, (isdbl<T>() ? 1 : 0)),
+            DefineKeyValue(IS_LONG, (islong<T>() ? 1 : 0)),
+            getTypeBuildDefinition<T>()};
+
+        int groups_x = divup(M, rows_per_group * threads_per_g);
+        cl::NDRange global(local[0] * groups_x, 1);
+        auto cscmvBeta = common::getKernel("cscmv_beta", {{cscmv_cl_src}}, targs_beta, options_beta);
+        cscmvBeta(cl::EnqueueArgs(getQueue(), global, local), *out.data, M, beta);
+
+    } else {
+        getQueue().enqueueFillBuffer(*out.data, 0, 0, M * sizeof(T));
+    }
+
+    int groups_x = divup(M, rows_per_group);
+    cl::NDRange global(local[0] * groups_x, 1);
+
+    auto cscmvAtomic =
+        common::getKernel("cscmv_atomic", {{cscmv_cl_src}}, targs, options);
+    cscmvAtomic(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+                *values.data, *colIdx.data, *rowIdx.data, K, *rhs.data,
+                rhs.info, alpha);
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/csr2coo.cl b/src/backend/opencl/kernel/csr2coo.cl
new file mode 100644
index 0000000000..d60766f96a
--- /dev/null
+++ b/src/backend/opencl/kernel/csr2coo.cl
@@ -0,0 +1,65 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void csr2Coo(global int *orowidx, global int *ocolidx,
+                    global const int *irowidx, global const int *icolidx,
+                    const int M) {
+    int lid = get_local_id(0);
+    for (int rowId = get_group_id(0); rowId < M; rowId += get_num_groups(0)) {
+        int colStart = irowidx[rowId];
+        int colEnd   = irowidx[rowId + 1];
+        for (int colId = colStart + lid; colId < colEnd;
+             colId += get_local_size(0)) {
+            orowidx[colId] = rowId;
+            ocolidx[colId] = icolidx[colId];
+        }
+    }
+}
+
+kernel void swapIndex(global T *ovalues, global int *oindex,
+                      global const T *ivalues, global const int *iindex,
+                      global const int *swapIdx, const int nNZ) {
+    int id = get_global_id(0);
+    if (id >= nNZ) return;
+
+    int idx = swapIdx[id];
+
+    ovalues[id] = ivalues[idx];
+    oindex[id]  = iindex[idx];
+}
+
+kernel void csrReduce(global int *orowIdx, global const int *irowIdx,
+                      const int M, const int nNZ) {
+    int id = get_global_id(0);
+
+    if (id >= nNZ) return;
+
+    // Read COO row indices
+    int iRId  = irowIdx[id];
+    int iRId1 = 0;
+    if (id > 0) iRId1 = irowIdx[id - 1];
+
+    // If id is 0, then mark the edge cases of csrRow[0] and csrRow[M]
+    if (id == 0) {
+        orowIdx[id] = 0;
+        orowIdx[M]  = nNZ;
+    } else if (iRId1 != iRId) {
+        // If iRId1 and iRId are not same, that means the row has incremented
+        // For example, if iRId is 5 and iRId1 is 4, that means row 4 has
+        // ended and row 5 has begun at index id.
+        // We use the for-loop because there can be any number of empty rows
+        // between iRId1 and iRId, all of which should be marked by id
+        for (int i = iRId1 + 1; i <= iRId; i++) orowIdx[i] = id;
+    }
+
+    // The last X rows are corner cases if they dont have any values
+    if (id < M) {
+        if (id > irowIdx[nNZ - 1] && orowIdx[id] == 0) { orowIdx[id] = nNZ; }
+    }
+}
diff --git a/src/backend/opencl/kernel/csr2dense.cl b/src/backend/opencl/kernel/csr2dense.cl
new file mode 100644
index 0000000000..e15ef014f3
--- /dev/null
+++ b/src/backend/opencl/kernel/csr2dense.cl
@@ -0,0 +1,24 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void csr2Dense(global T *output, global const T *values,
+                      global const int *rowidx, global const int *colidx,
+                      const int M, const int v_off, const int r_off, const int c_off) {
+    T *v = values + v_off;
+    int *r = rowidx + r_off;
+    int *c = colidx + c_off;
+    int lid = get_local_id(0);
+    for (int rowId = get_group_id(0); rowId < M; rowId += get_num_groups(0)) {
+        int colStart = r[rowId];
+        int colEnd   = r[rowId + 1];
+        for (int colId = colStart + lid; colId < colEnd; colId += THREADS) {
+            output[rowId + c[colId] * M] = v[colId];
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/csrmm.cl b/src/backend/opencl/kernel/csrmm.cl
new file mode 100644
index 0000000000..750c97f8b5
--- /dev/null
+++ b/src/backend/opencl/kernel/csrmm.cl
@@ -0,0 +1,117 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if IS_CPLX
+T __cmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
+    return out;
+}
+
+T __ccmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x + lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y - lhs.y * rhs.x;
+    return out;
+}
+
+#define MUL(a, b) __cmul(a, b)
+
+#if IS_CONJ
+#define CMUL(a, b) __ccmul(a, b)
+#else
+#define CMUL(a, b) __cmul(a, b)
+#endif
+
+#else
+#define MUL(a, b) (a) * (b)
+#define CMUL(a, b) (a) * (b)
+#endif
+
+// This kernel expects the dense matrix to be transpose of column major (aka non
+// transpose row major).
+
+// In this kernel, each block performs multiple "dot" operations.
+// In each outer facing iteration, the group performs a "dot" on (one sparse
+// row, `THREADS_PER_GROUP` dense columns). The threads in the block load the
+// sparse row into local memmory and then perform individual "dot" operations.
+
+kernel void csrmm_nt(global T *output, __global const T *values,
+                       global const int *rowidx, __global const int *colidx,
+                       const int M, const int N, global const T *rhs,
+                       const KParam rinfo, const T alpha, const T beta,
+                       global int *counter) {
+    int gidx = get_global_id(0);
+    int lid  = get_local_id(0);
+
+    rhs += gidx + rinfo.offset;
+    output += gidx * M;
+
+    bool within_N = (gidx < N);
+
+    local T s_values[THREADS_PER_GROUP];
+    local int s_colidx[THREADS_PER_GROUP];
+
+    int rowNext = get_group_id(1);
+    local int s_rowId;
+
+    // Each iteration writes `THREADS_PER_GROUP` columns from one row of the
+    // output
+    while (true) {
+#if USE_GREEDY
+        // If the hardware has decent atomic operation support, greediy get the
+        // next available row
+        if (lid == 0) { s_rowId = atomic_inc(counter + get_group_id(0)); }
+        barrier(CLK_LOCAL_MEM_FENCE);
+        int rowId = s_rowId;
+#else
+        // Fall back to the naive distribution of rows otherwise
+        int rowId = rowNext;
+        rowNext += get_num_groups(1);
+#endif
+        if (rowId >= M) return;
+
+        // Load the nonzero column offsets for current row
+        const int colStart = rowidx[rowId];
+        const int colEnd   = rowidx[rowId + 1];
+
+        T outval = 0;
+        // Since the number of nonzero elements might be greater than local
+        // memory available, Load only part of the row into local memory,
+        // perform partial dot, repeat until done.
+        for (int id = colStart; id < colEnd; id += THREADS_PER_GROUP) {
+            // Load the current chunk of the row into local memory
+            int lim       = min(colEnd - id, THREADS_PER_GROUP);
+            s_values[lid] = lid < lim ? values[id + lid] : 0;
+            s_colidx[lid] = lid < lim ? colidx[id + lid] : -1;
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            // Perform partial "dot" operation for each thread
+            for (int idy = 0; within_N && idy < lim; idy++) {
+                outval +=
+                    CMUL(s_values[idy], rhs[rinfo.strides[1] * s_colidx[idy]]);
+            }
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        if (within_N) {
+            // Each thread writes the output for one column in the current row
+#if USE_ALPHA
+            outval = MUL(alpha, outval);
+#endif
+
+#if USE_BETA
+            output[rowId] = outval + MUL(beta, output[rowId]);
+#else
+            output[rowId] = outval;
+#endif
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/csrmm.hpp b/src/backend/opencl/kernel/csrmm.hpp
new file mode 100644
index 0000000000..60499bf877
--- /dev/null
+++ b/src/backend/opencl/kernel/csrmm.hpp
@@ -0,0 +1,80 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel_headers/csrmm.hpp>
+#include <traits.hpp>
+#include <af/opencl.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void csrmm_nt(Param out, const Param &values, const Param &rowIdx,
+              const Param &colIdx, const Param &rhs, const T alpha,
+              const T beta) {
+    constexpr int MAX_CSRMM_GROUPS = 4096;
+    // Using greedy indexing is causing performance issues on many platforms
+    // FIXME: Figure out why
+    constexpr bool use_greedy = false;
+
+    const bool use_alpha = (alpha != scalar<T>(1.0));
+    const bool use_beta  = (beta != scalar<T>(0.0));
+
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(use_alpha),
+        TemplateArg(use_beta),
+        TemplateArg(use_greedy),
+    };
+    std::array<std::string, 7> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(USE_ALPHA, use_alpha),
+        DefineKeyValue(USE_BETA, use_beta),
+        DefineKeyValue(USE_GREEDY, use_greedy),
+        DefineValue(THREADS_PER_GROUP),
+        DefineKeyValue(IS_CPLX, (iscplx<T>() ? 1 : 0)),
+        getTypeBuildDefinition<T>()};
+
+    // FIXME: Switch to perf (thread vs block) baesd kernel
+    auto csrmm_nt_func =
+        common::getKernel("csrmm_nt", {{csrmm_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_PER_GROUP, 1);
+    int M = rowIdx.info.dims[0] - 1;
+    int N = rhs.info.dims[0];
+
+    int groups_x = divup(N, local[0]);
+    int groups_y = divup(M, REPEAT);
+    groups_y     = std::min(groups_y, MAX_CSRMM_GROUPS);
+    cl::NDRange global(local[0] * groups_x, local[1] * groups_y);
+
+    cl::Buffer *counter = bufferAlloc(groups_x * sizeof(int));
+    getQueue().enqueueFillBuffer(*counter, 0, 0, groups_x * sizeof(int));
+
+    csrmm_nt_func(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+                  *values.data, *rowIdx.data, *colIdx.data, M, N, *rhs.data,
+                  rhs.info, alpha, beta, *counter);
+    bufferFree(counter);
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/csrmv.cl b/src/backend/opencl/kernel/csrmv.cl
new file mode 100644
index 0000000000..4ac7e04881
--- /dev/null
+++ b/src/backend/opencl/kernel/csrmv.cl
@@ -0,0 +1,168 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if IS_CPLX
+T __cmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x - lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y + lhs.y * rhs.x;
+    return out;
+}
+
+T __ccmul(T lhs, T rhs) {
+    T out;
+    out.x = lhs.x * rhs.x + lhs.y * rhs.y;
+    out.y = lhs.x * rhs.y - lhs.y * rhs.x;
+    return out;
+}
+
+#define MUL(a, b) __cmul(a, b)
+
+#if IS_CONJ
+#define CMUL(a, b) __ccmul(a, b)
+#else
+#define CMUL(a, b) __cmul(a, b)
+#endif
+
+#else
+#define MUL(a, b) (a) * (b)
+#define CMUL(a, b) (a) * (b)
+#endif
+
+// In this kernel, each thread performs one "dot" operation by reading nonzero
+// elements from one row and multiplying with the corresponding elements from
+// the dense vector to produce a single output value. This kernel should be used
+// when the number of nonzero elements per block is fairly small
+kernel void csrmv_thread(global T *output, __global const T *values,
+                           global const int *rowidx,
+                           global const int *colidx, const int M,
+                           global const T *rhs, const KParam rinfo,
+                           const T alpha, const T beta
+#if USE_GREEDY
+                           , global int *counter
+#endif
+                           ) {
+    rhs += rinfo.offset;
+    int rowNext = get_global_id(0);
+
+    while (true) {
+        // Each thread performs multiple "dot" operations
+#if USE_GREEDY
+        // Considering that the number of non zero elements per row can be
+        // uneven a greedy approach may be useful. This acheived by getting the
+        // next available row to perform the "dot" operation on.
+        int rowId = atomic_inc(counter);
+#else
+        // Unfortunately atomic operations are costly on some architectures.
+        // The fallback is to use same number of rows on all threads.
+        int rowId = rowNext;
+        rowNext += get_global_size(0);
+#endif
+        if (rowId >= M) return;
+
+        // Find the columns offsets for the current row
+        int colStart = rowidx[rowId];
+        int colEnd   = rowidx[rowId + 1];
+
+        T outval = 0;
+        // Performing the "dot" operation
+        for (int id = colStart; id < colEnd; id++) {
+            int cid = colidx[id];
+            outval += CMUL(values[id], rhs[cid]);
+        }
+
+        // Writing out a single output
+#if USE_ALPHA
+        outval = MUL(alpha, outval);
+#endif
+
+#if USE_BETA
+        output[rowId] = outval + MUL(beta, output[rowId]);
+#else
+        output[rowId] = outval;
+#endif
+    }
+}
+
+// In this kernel, each block performs one "dot" operation by having each thread
+// read a nonzero element from a row and multiplying with the corresponding
+// elements from dense vector to produce a local output values. Then the block
+// performs a reduction operation to produce a single output value. This kernel
+// should be used when the number of nonzero elements per block is large
+kernel void csrmv_block(global T *output, __global const T *values,
+                          global const int *rowidx,
+                          global const int *colidx, const int M,
+                          global const T *rhs, const KParam rinfo,
+                          const T alpha, const T beta
+#if USE_GREEDY
+                          , global int *counter
+#endif
+                          ) {
+    rhs += rinfo.offset;
+    int lid     = get_local_id(0);
+    int rowNext = get_group_id(0);
+    local int s_rowId;
+
+    // Each thread stores part of the output result
+    local T s_outval[THREADS];
+
+    // Each groups performs multiple "dot" operations
+    while (true) {
+#if USE_GREEDY
+        // Considering that the number of non zero elements per row can be
+        // uneven a greedy approach may be useful. This acheived by getting the
+        // next available row to perform the "dot" operation on. Since the rowId
+        // needs is the same across the block, only one thread needs to
+        // increment the counter.
+        if (lid == 0) { s_rowId = atomic_inc(counter); }
+        barrier(CLK_LOCAL_MEM_FENCE);
+        int rowId = s_rowId;
+#else
+        // Unfortunately atomic operations are costly on some architectures.
+        // The fallback is to use same number of rows on all blocks.
+        int rowId = rowNext;
+        rowNext += get_num_groups(0);
+#endif
+        if (rowId >= M) return;
+
+        int colStart = rowidx[rowId];
+        int colEnd   = rowidx[rowId + 1];
+        T outval     = 0;
+
+        // Each thread performs "dot" on num_nonzero_elements / THREADS for a
+        // given row
+        for (int id = colStart + lid; id < colEnd; id += THREADS) {
+            int cid = colidx[id];
+            outval += MUL(values[id], rhs[cid]);
+        }
+        s_outval[lid] = outval;
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Perform a block reduce operation to get the single output value
+        for (int n = THREADS / 2; n > 0; n /= 2) {
+            if (lid < n) s_outval[lid] += s_outval[lid + n];
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        // A single thread writes the output value
+        if (lid == 0) {
+#if USE_ALPHA
+            outval = MUL(alpha, s_outval[0]);
+#else
+            outval        = s_outval[0];
+#endif
+
+#if USE_BETA
+            output[rowId] = outval + MUL(beta, output[rowId]);
+#else
+            output[rowId] = outval;
+#endif
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/csrmv.hpp b/src/backend/opencl/kernel/csrmv.hpp
new file mode 100644
index 0000000000..ca39ae4d32
--- /dev/null
+++ b/src/backend/opencl/kernel/csrmv.hpp
@@ -0,0 +1,90 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel_headers/csrmv.hpp>
+#include <traits.hpp>
+#include <af/opencl.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void csrmv(Param out, const Param &values, const Param &rowIdx,
+           const Param &colIdx, const Param &rhs, const T alpha, const T beta) {
+    constexpr int MAX_CSRMV_GROUPS = 4096;
+    // Using greedy indexing is causing performance issues on many platforms
+    // FIXME: Figure out why
+    constexpr bool use_greedy = false;
+
+    // TODO: Figure out the proper way to choose either csrmv_thread or
+    // csrmv_block
+    bool is_csrmv_block = true;
+
+    const bool use_alpha = (alpha != scalar<T>(1.0));
+    const bool use_beta  = (beta != scalar<T>(0.0));
+
+    cl::NDRange local(THREADS_PER_GROUP);
+
+    std::array<TemplateArg, 5> targs = {
+        TemplateTypename<T>(),   TemplateArg(use_alpha), TemplateArg(use_beta),
+        TemplateArg(use_greedy), TemplateArg(local[0]),
+    };
+    std::array<std::string, 7> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(USE_ALPHA, use_alpha),
+        DefineKeyValue(USE_BETA, use_beta),
+        DefineKeyValue(USE_GREEDY, use_greedy),
+        DefineKeyValue(THREADS, local[0]),
+        DefineKeyValue(IS_CPLX, (iscplx<T>() ? 1 : 0)),
+        getTypeBuildDefinition<T>()};
+
+    auto csrmv =
+        (is_csrmv_block ? common::getKernel("csrmv_thread", {{csrmv_cl_src}},
+                                            targs, options)
+                        : common::getKernel("csrmv_block", {{csrmv_cl_src}},
+                                            targs, options));
+
+    int M = rowIdx.info.dims[0] - 1;
+
+    int groups_x =
+        is_csrmv_block ? divup(M, REPEAT) : divup(M, REPEAT * local[0]);
+    groups_x = std::min(groups_x, MAX_CSRMV_GROUPS);
+    cl::NDRange global(local[0] * groups_x, 1);
+
+    if (use_greedy) {
+        cl::Buffer *counter = bufferAlloc(sizeof(int));
+        getQueue().enqueueFillBuffer(*counter, 0, 0, sizeof(int));
+        csrmv(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+              *values.data, *rowIdx.data, *colIdx.data, M, *rhs.data, rhs.info,
+              alpha, beta, *counter);
+        CL_DEBUG_FINISH(getQueue());
+        bufferFree(counter);
+    } else {
+        csrmv(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+              *values.data, *rowIdx.data, *colIdx.data, M, *rhs.data, rhs.info,
+              alpha, beta);
+        CL_DEBUG_FINISH(getQueue());
+    }
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/dense2csr.cl b/src/backend/opencl/kernel/dense2csr.cl
new file mode 100644
index 0000000000..7f10d2e022
--- /dev/null
+++ b/src/backend/opencl/kernel/dense2csr.cl
@@ -0,0 +1,40 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if IS_CPLX
+#define IS_ZERO(val) ((val.x == 0) && (val.y == 0))
+#else
+#define IS_ZERO(val) (val == 0)
+#endif
+
+kernel void dense2Csr(global T *svalptr, global int *scolptr,
+                      global const T *dvalptr, const KParam valinfo,
+                      global const int *dcolptr, const KParam colinfo,
+                      global const int *rowptr) {
+    int gidx = get_global_id(0);
+    int gidy = get_global_id(1);
+
+    if (gidx >= valinfo.dims[0]) return;
+    if (gidy >= valinfo.dims[1]) return;
+
+    int rowoff = rowptr[gidx];
+    svalptr += rowoff;
+    scolptr += rowoff;
+
+    dvalptr += valinfo.offset;
+    dcolptr += colinfo.offset;
+
+    int idx = gidx + gidy * valinfo.strides[1];
+    T val   = dvalptr[gidx + gidy * valinfo.strides[1]];
+    if (IS_ZERO(val)) return;
+
+    int oloc          = dcolptr[gidx + gidy * colinfo.strides[1]];
+    svalptr[oloc - 1] = val;
+    scolptr[oloc - 1] = gidy;
+}
diff --git a/src/backend/opencl/kernel/diag_create.cl b/src/backend/opencl/kernel/diag_create.cl
index 179c59c455..9087133612 100644
--- a/src/backend/opencl/kernel/diag_create.cl
+++ b/src/backend/opencl/kernel/diag_create.cl
@@ -7,25 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void diagCreateKernel(__global T *oData, KParam oInfo,
-                 const __global T *iData, KParam iInfo,
-                 int num, int groups_x)
-{
-    unsigned idz = get_group_id(0) / groups_x;
+kernel void diagCreateKernel(global T *oData, KParam oInfo,
+                               const global T *iData, KParam iInfo, int num,
+                               int groups_x) {
+    unsigned idz       = get_group_id(0) / groups_x;
     unsigned groupId_x = get_group_id(0) - idz * groups_x;
 
     unsigned idx = get_local_id(0) + groupId_x * get_local_size(0);
     unsigned idy = get_global_id(1);
 
-    if (idx >= oInfo.dims[0] ||
-        idy >= oInfo.dims[1] ||
-        idz >= oInfo.dims[2]) return;
+    if (idx >= oInfo.dims[0] || idy >= oInfo.dims[1] || idz >= oInfo.dims[2])
+        return;
 
+    global T *optr =
+        oData + idz * oInfo.strides[2] + idy * oInfo.strides[1] + idx;
+    const global T *iptr =
+        iData + idz * iInfo.strides[1] + ((num > 0) ? idx : idy) + iInfo.offset;
 
-    __global T *optr = oData + idz * oInfo.strides[2] + idy * oInfo.strides[1] + idx;
-    const __global T *iptr = iData  + idz *  iInfo.strides[1] + ((num > 0) ? idx : idy) + iInfo.offset;
-
-    T val = (idx == (idy - num)) ? *iptr : ZERO;
+    T val = (idx == (idy - num)) ? *iptr : (T)(ZERO);
     *optr = val;
 }
diff --git a/src/backend/opencl/kernel/diag_extract.cl b/src/backend/opencl/kernel/diag_extract.cl
index 2c7b2561d2..f873de5897 100644
--- a/src/backend/opencl/kernel/diag_extract.cl
+++ b/src/backend/opencl/kernel/diag_extract.cl
@@ -7,32 +7,30 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void diagExtractKernel(__global T *oData, KParam oInfo,
-                  const __global T *iData, KParam iInfo,
-                  int num, int groups_z)
-{
+kernel void diagExtractKernel(global T *oData, KParam oInfo,
+                                const global T *iData, KParam iInfo, int num,
+                                int groups_z) {
     unsigned idw = get_group_id(1) / groups_z;
     unsigned idz = get_group_id(1) - idw * groups_z;
 
     unsigned idx = get_global_id(0);
 
-    if (idx >= oInfo.dims[0] ||
-        idz >= oInfo.dims[2] ||
-        idw >= oInfo.dims[3]) return;
+    if (idx >= oInfo.dims[0] || idz >= oInfo.dims[2] || idw >= oInfo.dims[3])
+        return;
 
-    __global T *optr = oData + idz * oInfo.strides[2] + idw * oInfo.strides[3] + idx;
+    global T *optr =
+        oData + idz * oInfo.strides[2] + idw * oInfo.strides[3] + idx;
 
     if (idx >= iInfo.dims[0] || idx >= iInfo.dims[1]) {
-        *optr = ZERO;
+        *optr = (T)(ZERO);
         return;
     }
 
-    int i_off = (num > 0) ? (num * iInfo.strides[1] + idx) : (idx - num) + iInfo.offset;
+    int i_off =
+        (num > 0) ? (num * iInfo.strides[1] + idx) : (idx - num) + iInfo.offset;
 
-    const __global T *iptr = iData  +
-        idz *  iInfo.strides[2] +
-        idw *  iInfo.strides[3] + i_off;
+    const global T *iptr =
+        iData + idz * iInfo.strides[2] + idw * iInfo.strides[3] + i_off;
 
     *optr = iptr[idx * iInfo.strides[1]];
 }
diff --git a/src/backend/opencl/kernel/diagonal.hpp b/src/backend/opencl/kernel/diagonal.hpp
index ab5da11b76..e8340fba03 100644
--- a/src/backend/opencl/kernel/diagonal.hpp
+++ b/src/backend/opencl/kernel/diagonal.hpp
@@ -7,121 +7,73 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <kernel_headers/diag_create.hpp>
-#include <kernel_headers/diag_extract.hpp>
-#include <program.hpp>
-#include "../traits.hpp"
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <map>
-#include <mutex>
+#include <kernel/config.hpp>
+#include <kernel_headers/diag_create.hpp>
+#include <kernel_headers/diag_extract.hpp>
 #include <math.hpp>
-#include "config.hpp"
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using af::scalar_to_option;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-    template<typename T>
-    static void diagCreate(Param out, Param in, int num)
-    {
-        try {
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*>   diagCreateProgs;
-            static std::map<int, Kernel*>  diagCreateKernels;
-
-            int device = getActiveDeviceId();
-
-            std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T="    << dtype_traits<T>::getName()
-                            << " -D ZERO=(T)(" << scalar_to_option(scalar<T>(0)) << ")";
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, diag_create_cl, diag_create_cl_len, options.str());
-                    diagCreateProgs[device]   = new Program(prog);
-                    diagCreateKernels[device] = new Kernel(*diagCreateProgs[device],
-                                                           "diagCreateKernel");
-                });
-
-            NDRange local(32, 8);
-            int groups_x = divup(out.info.dims[0], local[0]);
-            int groups_y = divup(out.info.dims[1], local[1]);
-            NDRange global(groups_x * local[0] * out.info.dims[2],
-                           groups_y * local[1]);
-
-            auto diagCreateOp = make_kernel<Buffer, const KParam,
-                                            Buffer, const KParam,
-                                            int, int> (*diagCreateKernels[device]);
-
-            diagCreateOp(EnqueueArgs(getQueue(), global, local),
-                         *(out.data), out.info, *(in.data), in.info, num, groups_x);
-            CL_DEBUG_FINISH(getQueue());
-
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-        }
-    }
-
-    template<typename T>
-    static void diagExtract(Param out, Param in, int num)
-    {
-        try {
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*>   diagExtractProgs;
-            static std::map<int, Kernel*>  diagExtractKernels;
-
-            int device = getActiveDeviceId();
-
-            std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T="    << dtype_traits<T>::getName()
-                            << " -D ZERO=(T)(" << scalar_to_option(scalar<T>(0)) << ")";
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, diag_extract_cl, diag_extract_cl_len, options.str());
-                    diagExtractProgs[device]   = new Program(prog);
-                    diagExtractKernels[device] = new Kernel(*diagExtractProgs[device],
-                                                           "diagExtractKernel");
-                });
-
-            NDRange local(256, 1);
-            int groups_x = divup(out.info.dims[0], local[0]);
-            int groups_z = out.info.dims[2];
-            NDRange global(groups_x * local[0],
-                           groups_z * local[1] * out.info.dims[3]);
-
-            auto diagExtractOp = make_kernel<Buffer, const KParam,
-                                             Buffer, const KParam,
-                                             int, int> (*diagExtractKernels[device]);
-
-            diagExtractOp(EnqueueArgs(getQueue(), global, local),
-                          *(out.data), out.info, *(in.data), in.info, num, groups_z);
-            CL_DEBUG_FINISH(getQueue());
-
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-        }
-    }
-
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+static void diagCreate(Param out, Param in, int num) {
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 3> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        getTypeBuildDefinition<T>()};
+
+    auto diagCreate = common::getKernel("diagCreateKernel",
+                                        {{diag_create_cl_src}}, targs, options);
+
+    cl::NDRange local(32, 8);
+    int groups_x = divup(out.info.dims[0], local[0]);
+    int groups_y = divup(out.info.dims[1], local[1]);
+    cl::NDRange global(groups_x * local[0] * out.info.dims[2],
+                       groups_y * local[1]);
+
+    diagCreate(cl::EnqueueArgs(getQueue(), global, local), *(out.data),
+               out.info, *(in.data), in.info, num, groups_x);
+    CL_DEBUG_FINISH(getQueue());
 }
 
+template<typename T>
+static void diagExtract(Param out, Param in, int num) {
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 3> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        getTypeBuildDefinition<T>()};
+
+    auto diagExtract = common::getKernel(
+        "diagExtractKernel", {{diag_extract_cl_src}}, targs, options);
+
+    cl::NDRange local(256, 1);
+    int groups_x = divup(out.info.dims[0], local[0]);
+    int groups_z = out.info.dims[2];
+    cl::NDRange global(groups_x * local[0],
+                       groups_z * local[1] * out.info.dims[3]);
+
+    diagExtract(cl::EnqueueArgs(getQueue(), global, local), *(out.data),
+                out.info, *(in.data), in.info, num, groups_z);
+    CL_DEBUG_FINISH(getQueue());
 }
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/diff.cl b/src/backend/opencl/kernel/diff.cl
index 0d00a77bee..aef7c0e86f 100644
--- a/src/backend/opencl/kernel/diff.cl
+++ b/src/backend/opencl/kernel/diff.cl
@@ -7,21 +7,18 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-void diff_this(__global T* out, __global const T* in, const int oMem,
-               const int iMem0, const int iMem1, const int iMem2)
-{
-    if(isDiff2 == 0) {
+void diff_this(global T* out, __global const T* in, const int oMem,
+               const int iMem0, const int iMem1, const int iMem2) {
+    if (isDiff2 == 0) {
         out[oMem] = in[iMem1] - in[iMem0];
     } else {
         out[oMem] = in[iMem2] - in[iMem1] - in[iMem1] + in[iMem0];
     }
 }
 
-__kernel
-void diff_kernel(__global T *out, __global const T *in,
-                 const KParam op, const KParam ip, const int oElem,
-                 const int blocksPerMatX, const int blocksPerMatY)
-{
+kernel void diff_kernel(global T* out, __global const T* in,
+                          const KParam op, const KParam ip, const int oElem,
+                          const int blocksPerMatX, const int blocksPerMatY) {
     const int idz = get_group_id(0) / blocksPerMatX;
     const int idw = get_group_id(1) / blocksPerMatY;
 
@@ -31,17 +28,17 @@ void diff_kernel(__global T *out, __global const T *in,
     const int idx = get_local_id(0) + blockIdx_x * get_local_size(0);
     const int idy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    if(idx >= op.dims[0] ||
-       idy >= op.dims[1] ||
-       idz >= op.dims[2] ||
-       idw >= op.dims[3])
+    if (idx >= op.dims[0] || idy >= op.dims[1] || idz >= op.dims[2] ||
+        idw >= op.dims[3])
         return;
 
-    int iMem0 = idw * ip.strides[3] + idz * ip.strides[2] + idy * ip.strides[1] + idx;
+    int iMem0 =
+        idw * ip.strides[3] + idz * ip.strides[2] + idy * ip.strides[1] + idx;
     int iMem1 = iMem0 + ip.strides[DIM];
     int iMem2 = iMem1 + ip.strides[DIM];
 
-    int oMem = idw * op.strides[3] + idz * op.strides[2] + idy * op.strides[1] + idx;
+    int oMem =
+        idw * op.strides[3] + idz * op.strides[2] + idy * op.strides[1] + idx;
 
     iMem2 *= isDiff2;
 
diff --git a/src/backend/opencl/kernel/diff.hpp b/src/backend/opencl/kernel/diff.hpp
index a1f9318540..33ccbbfca8 100644
--- a/src/backend/opencl/kernel/diff.hpp
+++ b/src/backend/opencl/kernel/diff.hpp
@@ -8,83 +8,55 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/diff.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/diff.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        static const int TX = 16;
-        static const int TY = 16;
-
-        template<typename T, unsigned dim, bool isDiff2>
-        void diff(Param out, const Param in, const unsigned indims)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   diffProgs;
-                static std::map<int, Kernel*>  diffKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T="        << dtype_traits<T>::getName()
-                            << " -D DIM="      << dim
-                            << " -D isDiff2=" << isDiff2;
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, diff_cl, diff_cl_len, options.str());
-                    diffProgs[device]   = new Program(prog);
-                    diffKernels[device] = new Kernel(*diffProgs[device], "diff_kernel");
-                });
-
-                auto diffOp = make_kernel<Buffer, const Buffer, const KParam, const KParam,
-                                          const int, const int, const int>
-                                          (*diffKernels[device]);
-
-                NDRange local(TX, TY, 1);
-                if(dim == 0 && indims == 1) {
-                    local = NDRange(TX * TY, 1, 1);
-                }
-
-                int blocksPerMatX = divup(out.info.dims[0], local[0]);
-                int blocksPerMatY = divup(out.info.dims[1], local[1]);
-                NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
-                               local[1] * blocksPerMatY * out.info.dims[3],
-                               1);
-
-                const int oElem = out.info.dims[0] * out.info.dims[1]
-                                     * out.info.dims[2] * out.info.dims[3];
-
-                diffOp(EnqueueArgs(getQueue(), global, local),
-                       *out.data, *in.data, out.info, in.info,
-                       oElem, blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void diff(Param out, const Param in, const unsigned indims, const unsigned dim,
+          const bool isDiff2) {
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+
+    std::array<TemplateArg, 3> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(dim),
+        TemplateArg(isDiff2),
+    };
+    std::array<std::string, 4> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineKeyValue(DIM, dim),
+        DefineKeyValue(isDiff2, (isDiff2 ? 1 : 0)),
+        getTypeBuildDefinition<T>()};
+
+    auto diffOp =
+        common::getKernel("diff_kernel", {{diff_cl_src}}, targs, options);
+
+    cl::NDRange local(TX, TY, 1);
+    if (dim == 0 && indims == 1) { local = cl::NDRange(TX * TY, 1, 1); }
+
+    int blocksPerMatX = divup(out.info.dims[0], local[0]);
+    int blocksPerMatY = divup(out.info.dims[1], local[1]);
+    cl::NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
+                       local[1] * blocksPerMatY * out.info.dims[3], 1);
+
+    const int oElem = out.info.dims[0] * out.info.dims[1] * out.info.dims[2] *
+                      out.info.dims[3];
+
+    diffOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, *in.data,
+           out.info, in.info, oElem, blocksPerMatX, blocksPerMatY);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/example.cl b/src/backend/opencl/kernel/example.cl
index c8b8328580..e946106326 100644
--- a/src/backend/opencl/kernel/example.cl
+++ b/src/backend/opencl/kernel/example.cl
@@ -7,12 +7,21 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void example(__global T *       d_dst,
-             KParam             oInfo,
-             __global const T * d_src,
-             KParam             iInfo,
-             int                method);
+kernel void example(global T* d_dst, KParam oInfo, __global const T* d_src1,
+                      KParam iInfo1, global const T* d_src2, KParam iInfo2,
+                      int method);
 {
-    // kernel algorithm goes here
+    // get current thread global identifiers along required dimensions
+    int i = get_global_id(0);
+    int j = get_global_id(1);
+
+    if (i < iInfo1.dims[0] && j < iInfo1.dims[1]) {
+        // if needed use strides array to compute linear index of arrays
+        int src1Idx = i + j * iInfo1.strides[1];
+        int src2Idx = i + j * iInfo2.strides[1];
+        int dstIdx  = i + j * oInfo.strides[1];
+
+        // kernel algorithm goes here
+        d_dst[dstIdx] = d_src1[src1Idx] + d_src2[src2Idx];
+    }
 }
diff --git a/src/backend/opencl/kernel/exampleFunction.hpp b/src/backend/opencl/kernel/exampleFunction.hpp
index e2e953c9c8..794c34670c 100644
--- a/src/backend/opencl/kernel/exampleFunction.hpp
+++ b/src/backend/opencl/kernel/exampleFunction.hpp
@@ -8,119 +8,79 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/example.hpp>   // This is the header that gets auto-generated
-                                        // from the .cl file you will create. We pre-process
-                                        // cl files to obfuscate code.
 
-#include <program.hpp>
-#include <traits.hpp>
-
-// Following c++ standard library headers are needed to maintain
-// OpenCL cl::Kernel & cl::Program objects
-#include <string>
-#include <mutex>
-#include <map>
+#include <Param.hpp>  // This header has the declaration of structures
+                      // that are passed onto kernel. Operator overloads
+                      // for creating Param objects from opencl::Array<T>
+                      // objects is automatic, no special work is needed.
+                      // Hence, the OpenCL kernel wrapper function takes in
+                      // Param instead of opencl::Array<T>
 
-#include <dispatch.hpp>                 // common utility header for CUDA & OpenCL backends
-                                        // has the divup macro
+#include <kernel_headers/example.hpp>  // This is the header that gets auto-generated
+// from the .cl file you will create. We pre-process
+// cl files to obfuscate code.
 
-#include <Param.hpp>                    // This header has the declaration of structures
-                                        // that are passed onto kernel. Operator overloads
-                                        // for creating Param objects from opencl::Array<T>
-                                        // objects is automatic, no special work is needed.
-                                        // Hence, the OpenCL kernel wrapper function takes in
-                                        // Param instead of opencl::Array<T>
+#include <traits.hpp>
 
-#include <debug_opencl.hpp>             // For Debug only related OpenCL validations
+#include <common/dispatch.hpp>      // common utility header for CUDA & OpenCL
+#include <common/kernel_cache.hpp>  // Has getKernel
+                                    // backends has the divup macro
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
+#include <debug_opencl.hpp>  // For Debug only related OpenCL validations
 
-namespace opencl
-{
+// Following c++ standard library headers are needed to create
+// the lists of parameters for common::getKernel function call
+#include <string>
+#include <vector>
 
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
+constexpr int THREADS_X = 16;
+constexpr int THREADS_Y = 16;
 
 template<typename T>
-void exampleFunc(Param out, const Param in, const af_someenum_t p)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  egProgs;
-        static std::map<int, Kernel*> egKernels;
-
-        int device = getActiveDeviceId();
-
-        // std::call_once is used to ensure OpenCL kernels
-        // are compiled only once for any given device and combination
-        // of template parameters to this kernel wrapper function 'exampleFunc<T>'
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName();
-                // You can pass any template parameters as compile options
-                // to kernel the compilation step. This is equivalent of
-                // having templated kernels in CUDA
-
-                // The following option is passed to kernel compilation
-                // if template parameter T is double or complex double
-                // to enable FP64 extension
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                Program prog;
-                // below helper function 'buildProgram' uses the option string
-                // we just created and compiles the kernel string
-                // 'example_cl' which was created by our opencl kernel code obfuscation
-                // stage
-                buildProgram(prog, example_cl, example_cl_len, options.str());
-
-                // create a cl::Program object on heap
-                egProgs[device]   = new Program(prog);
-
-                // create a cl::Kernel object on heap
-                egKernels[device] = new Kernel(*egProgs[device], "example");
-            });
-
-        // configure work group parameters
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(out.info.dims[0], THREADS_X);
-        int blk_y = divup(out.info.dims[1], THREADS_Y);
-
-        // configure global launch parameters
-        NDRange global(blk_x * THREADS_X, blk_y * THREADS_Y);
-
-        // create a kernel functor from the cl::Kernel object
-        // corresponding to the device on which current execution
-        // is happending.
-        auto exampleFuncOp = make_kernel<Buffer, KParam,
-                                     Buffer, KParam, int>(*egKernels[device]);
-
-        // launch the kernel
-        exampleFuncOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info, (int)p);
-
-        // Below Macro activates validations ONLY in DEBUG
-        // mode as its name indicates
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) { // Catch all cl::Errors and convert them
-                              // to appropriate ArrayFire error codes
-        CL_TO_AF_ERROR(err);
-    }
-}
-
+void exampleFunc(Param c, const Param a, const Param b, const af_someenum_t p) {
+    // Compilation options for compiling OpenCL kernel.
+    // Go to common/kernel_cache.hpp to find details on this.
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+
+    // Compilation options for compiling OpenCL kernel.
+    // Go to common/kernel_cache.hpp to find details on this.
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+
+        // The following templated function can take variable
+        // number of template parameters and if one of them is double
+        // precision, it will enable necessary constants, flags, ops
+        // in opencl kernel compilation stage
+        getTypeBuildDefinition<T>()};
+
+    // Fetch the Kernel functor, go to common/kernel_cache.hpp
+    // to find details of this function
+    auto exOp =
+        common::getKernel("example", {{example_cl_src}}, targs, options);
+
+    // configure work group parameters
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(c.info.dims[0], THREADS_X);
+    int blk_y = divup(c.info.dims[1], THREADS_Y);
+
+    // configure global launch parameters
+    cl::NDRange global(blk_x * THREADS_X, blk_y * THREADS_Y);
+
+    // launch the kernel
+    exOp(cl::EnqueueArgs(getQueue(), global, local), *c.data, c.info, *a.data,
+         a.info, *b.data, b.info, (int)p);
+    // Below Macro activates validations ONLY in DEBUG
+    // mode as its name indicates
+    CL_DEBUG_FINISH(getQueue());
 }
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/fast.cl b/src/backend/opencl/kernel/fast.cl
index 1c9fdf36a3..ef80350f01 100644
--- a/src/backend/opencl/kernel/fast.cl
+++ b/src/backend/opencl/kernel/fast.cl
@@ -7,78 +7,73 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#define MAX_VAL(A,B) (A<B) ? (B) : (A)
+#define MAX_VAL(A, B) (A < B) ? (B) : (A)
 
-inline int idx_y(const int i)
-{
+inline int idx_y(const int i) {
     int j = i - 4;
     int k = min(j, 8 - j);
     return clamp(k, -3, 3);
 }
 
-inline int idx_x(const int i)
-{
-    return idx_y((i + 4) & 15);
-}
+inline int idx_x(const int i) { return idx_y((i + 4) & 15); }
 
-inline int idx(const int x, const int y)
-{
-    return ((get_local_id(0) + 3 + x) + (get_local_size(0) + 6) * (get_local_id(1) + 3 + y));
+inline int idx(const int x, const int y) {
+    return ((get_local_id(0) + 3 + x) +
+            (get_local_size(0) + 6) * (get_local_id(1) + 3 + y));
 }
 
 // test_greater()
 // Tests if a pixel x > p + thr
-inline int test_greater(const float x, const float p, const float thr)
-{
-    return (x >= p + thr);
+inline int test_greater(const float x, const float p, const float thr) {
+    return (x > p + thr);
 }
 
 // test_smaller()
 // Tests if a pixel x < p - thr
-inline int test_smaller(const float x, const float p, const float thr)
-{
-    return (x <= p - thr);
+inline int test_smaller(const float x, const float p, const float thr) {
+    return (x < p - thr);
 }
 
 // test_pixel()
 // Returns -1 when x < p - thr
 // Returns  0 when x >= p - thr && x <= p + thr
 // Returns  1 when x > p + thr
-inline int test_pixel(__local T* local_image, const float p, const float thr, const int x, const int y)
-{
-    return -test_smaller((float)local_image[idx(x,y)], p, thr) | test_greater((float)local_image[idx(x,y)], p, thr);
+inline int test_pixel(local T* local_image, const float p, const float thr,
+                      const int x, const int y) {
+    return -test_smaller((float)local_image[idx(x, y)], p, thr) +
+           test_greater((float)local_image[idx(x, y)], p, thr);
 }
 
-void locate_features_core(
-    __local T* local_image,
-    __global float* score,
-    KParam iInfo,
-    const float thr,
-    int x, int y,
-    const unsigned edge)
-{
+void locate_features_core(local T* local_image, global float* score,
+                          KParam iInfo, const float thr, int x, int y,
+                          const unsigned edge) {
     if (x >= iInfo.dims[0] - edge || y >= iInfo.dims[1] - edge) return;
 
-    float p = local_image[idx( 0, 0)];
+    float p = local_image[idx(0, 0)];
 
     // Start by testing opposite pixels of the circle that will result in
     // a non-kepoint
-    int d = test_pixel(local_image, p, thr, -3,  0) | test_pixel(local_image, p, thr, 3,  0);
-    if (d == 0)
-        return;
-
-    d &= test_pixel(local_image, p, thr, -2,  2) | test_pixel(local_image, p, thr,  2, -2);
-    d &= test_pixel(local_image, p, thr,  0,  3) | test_pixel(local_image, p, thr,  0, -3);
-    d &= test_pixel(local_image, p, thr,  2,  2) | test_pixel(local_image, p, thr, -2, -2);
-    if (d == 0)
-        return;
-
-    d &= test_pixel(local_image, p, thr, -3,  1) | test_pixel(local_image, p, thr,  3, -1);
-    d &= test_pixel(local_image, p, thr, -1,  3) | test_pixel(local_image, p, thr,  1, -3);
-    d &= test_pixel(local_image, p, thr,  1,  3) | test_pixel(local_image, p, thr, -1, -3);
-    d &= test_pixel(local_image, p, thr,  3,  1) | test_pixel(local_image, p, thr, -3, -1);
-    if (d == 0)
-        return;
+    int d = test_pixel(local_image, p, thr, -3, 0) |
+            test_pixel(local_image, p, thr, 3, 0);
+    if (d == 0) return;
+
+    d &= test_pixel(local_image, p, thr, -2, 2) |
+         test_pixel(local_image, p, thr, 2, -2);
+    d &= test_pixel(local_image, p, thr, 0, 3) |
+         test_pixel(local_image, p, thr, 0, -3);
+    d &= test_pixel(local_image, p, thr, 2, 2) |
+         test_pixel(local_image, p, thr, -2, -2);
+    if (d == 0) return;
+
+    d &= test_pixel(local_image, p, thr, -3, 1) |
+         test_pixel(local_image, p, thr, 3, -1);
+    d &= test_pixel(local_image, p, thr, -1, 3) |
+         test_pixel(local_image, p, thr, 1, -3);
+    d &= test_pixel(local_image, p, thr, 1, 3) |
+         test_pixel(local_image, p, thr, -1, -3);
+    d &= test_pixel(local_image, p, thr, 3, 1) |
+         test_pixel(local_image, p, thr, -3, -1);
+    if (d == 0) return;
 
     int sum = 0;
 
@@ -94,7 +89,8 @@ void locate_features_core(
 
     // Sum responses and test the remaining 16-ARC_LENGTH pixels of the circle
     for (int i = ARC_LENGTH; i < 16; i++) {
-        sum -= test_pixel(local_image, p, thr, idx_x(i-ARC_LENGTH), idx_y(i-ARC_LENGTH));
+        sum -= test_pixel(local_image, p, thr, idx_x(i - ARC_LENGTH),
+                          idx_y(i - ARC_LENGTH));
         sum += test_pixel(local_image, p, thr, idx_x(i), idx_y(i));
         max_sum = max(max_sum, sum);
         min_sum = min(min_sum, sum);
@@ -102,8 +98,9 @@ void locate_features_core(
 
     // To completely test all possible segments, it's necessary to test
     // segments that include the top junction of the circle
-    for (int i = 0; i < ARC_LENGTH-1; i++) {
-        sum -= test_pixel(local_image, p, thr, idx_x(16-ARC_LENGTH+i), idx_y(16-ARC_LENGTH+i));
+    for (int i = 0; i < ARC_LENGTH - 1; i++) {
+        sum -= test_pixel(local_image, p, thr, idx_x(16 - ARC_LENGTH + i),
+                          idx_y(16 - ARC_LENGTH + i));
         sum += test_pixel(local_image, p, thr, idx_x(i), idx_y(i));
         max_sum = max(max_sum, sum);
         min_sum = min(min_sum, sum);
@@ -119,71 +116,59 @@ void locate_features_core(
             float p_x    = local_image[idx(idx_x(i), idx_y(i))];
             float weight = fabs((float)p_x - (float)p) - thr;
             s_bright += test_greater(p_x, p, thr) * weight;
-            s_dark   += test_smaller(p_x, p, thr) * weight;
+            s_dark += test_smaller(p_x, p, thr) * weight;
         }
 
         score[x + iInfo.dims[0] * y] = MAX_VAL(s_bright, s_dark);
     }
 }
 
-void load_shared_image(
-    __global const T *in,
-    KParam iInfo,
-    __local T *local_image,
-    unsigned ix, unsigned iy,
-    unsigned bx, unsigned by,
-    unsigned  x, unsigned  y,
-    unsigned lx, unsigned ly)
-{
+void load_shared_image(global const T* in, KParam iInfo,
+                       local T* local_image, unsigned ix, unsigned iy,
+                       unsigned bx, unsigned by, unsigned x, unsigned y,
+                       unsigned lx, unsigned ly) {
     // Copy an image patch to shared memory, with a 3-pixel edge
     if (ix < lx && iy < ly && x - 3 < iInfo.dims[0] && y - 3 < iInfo.dims[1]) {
-        local_image[(ix)      + (bx+6) * (iy)]    = in[(x-3)    + iInfo.dims[0] * (y-3)];
+        local_image[(ix) + (bx + 6) * (iy)] =
+            in[(x - 3) + iInfo.dims[0] * (y - 3)];
         if (x + lx - 3 < iInfo.dims[0])
-            local_image[(ix + lx) + (bx+6) * (iy)]    = in[(x+lx-3) + iInfo.dims[0] * (y-3)];
+            local_image[(ix + lx) + (bx + 6) * (iy)] =
+                in[(x + lx - 3) + iInfo.dims[0] * (y - 3)];
         if (y + ly - 3 < iInfo.dims[1])
-            local_image[(ix)      + (bx+6) * (iy+ly)] = in[(x-3)    + iInfo.dims[0] * (y+ly-3)];
+            local_image[(ix) + (bx + 6) * (iy + ly)] =
+                in[(x - 3) + iInfo.dims[0] * (y + ly - 3)];
         if (x + lx - 3 < iInfo.dims[0] && y + ly - 3 < iInfo.dims[1])
-            local_image[(ix + lx) + (bx+6) * (iy+ly)] = in[(x+lx-3) + iInfo.dims[0] * (y+ly-3)];
+            local_image[(ix + lx) + (bx + 6) * (iy + ly)] =
+                in[(x + lx - 3) + iInfo.dims[0] * (y + ly - 3)];
     }
 }
 
-__kernel
-void locate_features(
-    __global const T* in,
-    KParam          iInfo,
-    __global float* score,
-    const float thr,
-    const unsigned edge,
-    __local T* local_image)
-{
+kernel void locate_features(global const T* in, KParam iInfo,
+                              global float* score, const float thr,
+                              const unsigned edge, local T* local_image) {
     unsigned ix = get_local_id(0);
     unsigned iy = get_local_id(1);
     unsigned bx = get_local_size(0);
     unsigned by = get_local_size(1);
-    unsigned x = bx * get_group_id(0) + ix + edge;
-    unsigned y = by * get_group_id(1) + iy + edge;
+    unsigned x  = bx * get_group_id(0) + ix + edge;
+    unsigned y  = by * get_group_id(1) + iy + edge;
     unsigned lx = bx / 2 + 3;
     unsigned ly = by / 2 + 3;
 
-    load_shared_image(in, iInfo, local_image, ix, iy, bx, by, x, y, lx, ly);
+    load_shared_image(in + iInfo.offset, iInfo, local_image, ix, iy, bx, by, x,
+                      y, lx, ly);
     barrier(CLK_LOCAL_MEM_FENCE);
-    locate_features_core(local_image, score,
-                         iInfo, thr, x, y, edge);
+    locate_features_core(local_image, score, iInfo, thr, x, y, edge);
 }
 
-__kernel
-void non_max_counts(
-    __global unsigned *d_counts,
-    __global unsigned *d_offsets,
-    __global unsigned *d_total,
-    __global float *flags,
-    __global const float* score,
-    KParam iInfo,
-    const unsigned edge)
-{
-    __local unsigned s_counts[256];
+kernel void non_max_counts(global unsigned* d_counts,
+                             global unsigned* d_offsets,
+                             global unsigned* d_total, __global float* flags,
+                             global const float* score, KParam iInfo,
+                             const unsigned edge) {
+    local unsigned s_counts[256];
 
-    const int yid = get_group_id(1) * get_local_size(1) * 8 + get_local_id(1);
+    const int yid  = get_group_id(1) * get_local_size(1) * 8 + get_local_id(1);
     const int yend = (get_group_id(1) + 1) * get_local_size(1) * 8;
     const int yoff = get_local_size(1);
 
@@ -191,14 +176,15 @@ void non_max_counts(
 
     const int max1 = (int)iInfo.dims[1] - edge - 1;
     for (int y = yid; y < yend; y += yoff) {
-        if (y >= max1 || y <= (int)(edge+1)) continue;
+        if (y >= max1 || y <= (int)(edge + 1)) continue;
 
-        const int xid = get_group_id(0) * get_local_size(0) * 2 + get_local_id(0);
+        const int xid =
+            get_group_id(0) * get_local_size(0) * 2 + get_local_id(0);
         const int xend = (get_group_id(0) + 1) * get_local_size(0) * 2;
 
         const int max0 = (int)iInfo.dims[0] - edge - 1;
         for (int x = xid; x < xend; x += get_local_size(0)) {
-            if (x >= max0 || x <= (int)(edge+1)) continue;
+            if (x >= max0 || x <= (int)(edge + 1)) continue;
 
             float v = score[y * iInfo.dims[0] + x];
             if (v == 0) {
@@ -209,18 +195,19 @@ void non_max_counts(
             }
 
 #if NONMAX
-                float max_v = v;
-                max_v = MAX_VAL(score[x-1 + iInfo.dims[0] * (y-1)], score[x-1 + iInfo.dims[0] * y]);
-                max_v = MAX_VAL(max_v, score[x-1 + iInfo.dims[0] * (y+1)]);
-                max_v = MAX_VAL(max_v, score[x   + iInfo.dims[0] * (y-1)]);
-                max_v = MAX_VAL(max_v, score[x   + iInfo.dims[0] * (y+1)]);
-                max_v = MAX_VAL(max_v, score[x+1 + iInfo.dims[0] * (y-1)]);
-                max_v = MAX_VAL(max_v, score[x+1 + iInfo.dims[0] * (y)  ]);
-                max_v = MAX_VAL(max_v, score[x+1 + iInfo.dims[0] * (y+1)]);
-
-                v = (v > max_v) ? v : 0;
-                flags[y * iInfo.dims[0] + x] = v;
-                if (v == 0) continue;
+            float max_v = v;
+            max_v       = MAX_VAL(score[x - 1 + iInfo.dims[0] * (y - 1)],
+                            score[x - 1 + iInfo.dims[0] * y]);
+            max_v = MAX_VAL(max_v, score[x - 1 + iInfo.dims[0] * (y + 1)]);
+            max_v = MAX_VAL(max_v, score[x + iInfo.dims[0] * (y - 1)]);
+            max_v = MAX_VAL(max_v, score[x + iInfo.dims[0] * (y + 1)]);
+            max_v = MAX_VAL(max_v, score[x + 1 + iInfo.dims[0] * (y - 1)]);
+            max_v = MAX_VAL(max_v, score[x + 1 + iInfo.dims[0] * (y)]);
+            max_v = MAX_VAL(max_v, score[x + 1 + iInfo.dims[0] * (y + 1)]);
+
+            v                            = (v > max_v) ? v : 0;
+            flags[y * iInfo.dims[0] + x] = v;
+            if (v == 0) continue;
 #endif
 
             count++;
@@ -232,34 +219,37 @@ void non_max_counts(
     s_counts[tid] = count;
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (tid < 128) s_counts[tid] += s_counts[tid + 128]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <  64) s_counts[tid] += s_counts[tid +  64]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <  32) s_counts[tid] += s_counts[tid +  32]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <  16) s_counts[tid] += s_counts[tid +  16]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <   8) s_counts[tid] += s_counts[tid +   8]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <   4) s_counts[tid] += s_counts[tid +   4]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <   2) s_counts[tid] += s_counts[tid +   2]; barrier(CLK_LOCAL_MEM_FENCE);
-    if (tid <   1) s_counts[tid] += s_counts[tid +   1]; barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 128) s_counts[tid] += s_counts[tid + 128];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 64) s_counts[tid] += s_counts[tid + 64];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 32) s_counts[tid] += s_counts[tid + 32];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 16) s_counts[tid] += s_counts[tid + 16];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 8) s_counts[tid] += s_counts[tid + 8];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 4) s_counts[tid] += s_counts[tid + 4];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 2) s_counts[tid] += s_counts[tid + 2];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (tid < 1) s_counts[tid] += s_counts[tid + 1];
+    barrier(CLK_LOCAL_MEM_FENCE);
 
     if (tid == 0) {
-        const int bid = get_group_id(1) * get_num_groups(0) + get_group_id(0);
+        const int bid  = get_group_id(1) * get_num_groups(0) + get_group_id(0);
         unsigned total = s_counts[0] ? atomic_add(d_total, s_counts[0]) : 0;
-        d_counts [bid] = s_counts[0];
+        d_counts[bid]  = s_counts[0];
         d_offsets[bid] = total;
     }
 }
 
-__kernel void get_features(
-    __global float* x_out,
-    __global float* y_out,
-    __global float* score_out,
-    __global const float* flags,
-    __global const unsigned* d_counts,
-    __global const unsigned* d_offsets,
-    KParam iInfo,
-    const unsigned total,
-    const unsigned edge)
-{
+kernel void get_features(global float* x_out, __global float* y_out,
+                           global float* score_out,
+                           global const float* flags,
+                           global const unsigned* d_counts,
+                           global const unsigned* d_offsets, KParam iInfo,
+                           const unsigned total, const unsigned edge) {
     const int xid = get_group_id(0) * get_local_size(0) * 2 + get_local_id(0);
     const int yid = get_group_id(1) * get_local_size(1) * 8 + get_local_id(1);
     const int tid = get_local_size(0) * get_local_id(1) + get_local_id(0);
@@ -272,29 +262,29 @@ __kernel void get_features(
 
     const int bid = get_group_id(1) * get_num_groups(0) + get_group_id(0);
 
-    __local unsigned s_count;
-    __local unsigned s_idx;
+    local unsigned s_count;
+    local unsigned s_idx;
 
     if (tid == 0) {
-        s_count  = d_counts [bid];
-        s_idx    = d_offsets[bid];
+        s_count = d_counts[bid];
+        s_idx   = d_offsets[bid];
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
     // Blocks that are empty, please bail
     if (s_count == 0) return;
     for (int y = yid; y < yend; y += yoff) {
-        if (y >= iInfo.dims[1] - edge - 1 || y <= edge+1) continue;
+        if (y >= iInfo.dims[1] - edge - 1 || y <= edge + 1) continue;
         for (int x = xid; x < xend; x += xoff) {
-            if (x >= iInfo.dims[0] - edge - 1 || x <= edge+1) continue;
+            if (x >= iInfo.dims[0] - edge - 1 || x <= edge + 1) continue;
 
             float v = flags[y * iInfo.dims[0] + x];
             if (v == 0) continue;
 
             unsigned id = atomic_inc(&s_idx);
             if (id < total) {
-                y_out[id] = x;
-                x_out[id] = y;
+                y_out[id]     = x;
+                x_out[id]     = y;
                 score_out[id] = v;
             }
         }
diff --git a/src/backend/opencl/kernel/fast.hpp b/src/backend/opencl/kernel/fast.hpp
index 68cb767248..73351803b6 100644
--- a/src/backend/opencl/kernel/fast.hpp
+++ b/src/backend/opencl/kernel/fast.hpp
@@ -7,242 +7,143 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <program.hpp>
-#include <dispatch.hpp>
-#include <err_opencl.hpp>
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
 #include <kernel_headers/fast.hpp>
 #include <memory.hpp>
-#include <map>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::EnqueueArgs;
-using cl::LocalSpaceArg;
-using cl::NDRange;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int FAST_THREADS_X = 16;
-static const int FAST_THREADS_Y = 16;
-static const int FAST_THREADS_NONMAX_X = 32;
-static const int FAST_THREADS_NONMAX_Y = 8;
-
-template<typename T, const unsigned arc_length, const bool nonmax>
-void fast(unsigned* out_feat,
-          Param &x_out,
-          Param &y_out,
-          Param &score_out,
-          Param in,
-          const float thr,
-          const float feature_ratio,
-          const unsigned edge)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> fastProgs;
-        static std::map<int, Kernel*>  lfKernel;
-        static std::map<int, Kernel*>  nmKernel;
-        static std::map<int, Kernel*>  gfKernel;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D ARC_LENGTH=" << arc_length
-                        << " -D NONMAX=" << static_cast<unsigned>(nonmax);
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, fast_cl, fast_cl_len, options.str());
-                fastProgs[device] = new Program(prog);
-
-                lfKernel[device] = new Kernel(*fastProgs[device], "locate_features");
-                nmKernel[device] = new Kernel(*fastProgs[device], "non_max_counts");
-                gfKernel[device] = new Kernel(*fastProgs[device], "get_features");
-            });
-
-        const unsigned max_feat = ceil(in.info.dims[0] * in.info.dims[1] * feature_ratio);
-
-        // Matrix containing scores for detected features, scores are stored in the
-        // same coordinates as features, dimensions should be equal to in.
-        cl::Buffer *d_score = bufferAlloc(in.info.dims[0] * in.info.dims[1] * sizeof(float));
-        std::vector<float> score_init(in.info.dims[0] * in.info.dims[1], (float)0);
-        getQueue().enqueueWriteBuffer(*d_score, CL_TRUE, 0, in.info.dims[0] * in.info.dims[1] * sizeof(float), &score_init[0]);
-
-        cl::Buffer *d_flags = d_score;
-        if (nonmax) {
-            d_flags = bufferAlloc(in.info.dims[0] * in.info.dims[1] * sizeof(T));
-        }
-
-        const int blk_x = divup(in.info.dims[0]-edge*2, FAST_THREADS_X);
-        const int blk_y = divup(in.info.dims[1]-edge*2, FAST_THREADS_Y);
-
-        // Locate features kernel sizes
-        const NDRange local(FAST_THREADS_X, FAST_THREADS_Y);
-        const NDRange global(blk_x * FAST_THREADS_X, blk_y * FAST_THREADS_Y);
-
-        auto lfOp = make_kernel<Buffer, KParam,
-                                Buffer, const float, const unsigned,
-                                LocalSpaceArg> (*lfKernel[device]);
-
-        lfOp(EnqueueArgs(getQueue(), global, local),
-             *in.data, in.info, *d_score, thr, edge,
-             cl::Local((FAST_THREADS_X + 6) * (FAST_THREADS_Y + 6) * sizeof(T)));
-        CL_DEBUG_FINISH(getQueue());
-
-        const int blk_nonmax_x = divup(in.info.dims[0], 64);
-        const int blk_nonmax_y = divup(in.info.dims[1], 64);
+#include <traits.hpp>
+#include <af/defines.h>
 
-        // Nonmax kernel sizes
-        const NDRange local_nonmax(FAST_THREADS_NONMAX_X, FAST_THREADS_NONMAX_Y);
-        const NDRange global_nonmax(blk_nonmax_x * FAST_THREADS_NONMAX_X, blk_nonmax_y * FAST_THREADS_NONMAX_Y);
+#include <string>
+#include <vector>
 
-        unsigned count_init = 0;
-        cl::Buffer *d_total = bufferAlloc(sizeof(unsigned));
-        getQueue().enqueueWriteBuffer(*d_total, CL_TRUE, 0, sizeof(unsigned), &count_init);
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-        //size_t *global_nonmax_dims = global_nonmax();
-        size_t blocks_sz = blk_nonmax_x * FAST_THREADS_NONMAX_X * blk_nonmax_y * FAST_THREADS_NONMAX_Y * sizeof(unsigned);
-        cl::Buffer *d_counts  = bufferAlloc(blocks_sz);
-        cl::Buffer *d_offsets = bufferAlloc(blocks_sz);
+template<typename T>
+void fast(const unsigned arc_length, unsigned *out_feat, Param &x_out,
+          Param &y_out, Param &score_out, Param in, const float thr,
+          const float feature_ratio, const unsigned edge, const bool nonmax) {
+    constexpr int FAST_THREADS_X        = 16;
+    constexpr int FAST_THREADS_Y        = 16;
+    constexpr int FAST_THREADS_NONMAX_X = 32;
+    constexpr int FAST_THREADS_NONMAX_Y = 8;
+
+    std::array<TemplateArg, 3> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(arc_length),
+        TemplateArg(nonmax),
+    };
+    std::array<std::string, 4> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ARC_LENGTH, arc_length),
+        DefineKeyValue(NONMAX, static_cast<unsigned>(nonmax)),
+        getTypeBuildDefinition<T>()};
+
+    auto locate =
+        common::getKernel("locate_features", {{fast_cl_src}}, targs, options);
+    auto nonMax =
+        common::getKernel("non_max_counts", {{fast_cl_src}}, targs, options);
+    auto getFeat =
+        common::getKernel("get_features", {{fast_cl_src}}, targs, options);
+
+    const unsigned max_feat =
+        ceil(in.info.dims[0] * in.info.dims[1] * feature_ratio);
+
+    // Matrix containing scores for detected features, scores are stored in the
+    // same coordinates as features, dimensions should be equal to in.
+    cl::Buffer *d_score =
+        bufferAlloc(in.info.dims[0] * in.info.dims[1] * sizeof(float));
+    getQueue().enqueueFillBuffer(
+        *d_score, 0.0F, 0, in.info.dims[0] * in.info.dims[1] * sizeof(float));
+
+    cl::Buffer *d_flags = d_score;
+    if (nonmax) {
+        d_flags =
+            bufferAlloc(in.info.dims[0] * in.info.dims[1] * sizeof(float));
+    }
 
-        auto nmOp = make_kernel<Buffer, Buffer, Buffer,
-                                Buffer, Buffer,
-                                KParam, const unsigned> (*nmKernel[device]);
-        nmOp(EnqueueArgs(getQueue(), global_nonmax, local_nonmax),
-                         *d_counts, *d_offsets, *d_total, *d_flags, *d_score, in.info, edge);
+    const int blk_x = divup(in.info.dims[0] - edge * 2, FAST_THREADS_X);
+    const int blk_y = divup(in.info.dims[1] - edge * 2, FAST_THREADS_Y);
+
+    // Locate features kernel sizes
+    const cl::NDRange local(FAST_THREADS_X, FAST_THREADS_Y);
+    const cl::NDRange global(blk_x * FAST_THREADS_X, blk_y * FAST_THREADS_Y);
+
+    locate(cl::EnqueueArgs(getQueue(), global, local), *in.data, in.info,
+           *d_score, thr, edge,
+           cl::Local((FAST_THREADS_X + 6) * (FAST_THREADS_Y + 6) * sizeof(T)));
+    CL_DEBUG_FINISH(getQueue());
+
+    const int blk_nonmax_x = divup(in.info.dims[0], 64);
+    const int blk_nonmax_y = divup(in.info.dims[1], 64);
+
+    // Nonmax kernel sizes
+    const cl::NDRange local_nonmax(FAST_THREADS_NONMAX_X,
+                                   FAST_THREADS_NONMAX_Y);
+    const cl::NDRange global_nonmax(blk_nonmax_x * FAST_THREADS_NONMAX_X,
+                                    blk_nonmax_y * FAST_THREADS_NONMAX_Y);
+
+    cl::Buffer *d_total = bufferAlloc(sizeof(unsigned));
+    getQueue().enqueueFillBuffer(*d_total, 0U, 0, sizeof(unsigned));
+
+    // size_t *global_nonmax_dims = global_nonmax();
+    size_t blocks_sz = blk_nonmax_x * FAST_THREADS_NONMAX_X * blk_nonmax_y *
+                       FAST_THREADS_NONMAX_Y * sizeof(unsigned);
+    cl::Buffer *d_counts  = bufferAlloc(blocks_sz);
+    cl::Buffer *d_offsets = bufferAlloc(blocks_sz);
+
+    nonMax(cl::EnqueueArgs(getQueue(), global_nonmax, local_nonmax), *d_counts,
+           *d_offsets, *d_total, *d_flags, *d_score, in.info, edge);
+    CL_DEBUG_FINISH(getQueue());
+
+    unsigned total;
+    getQueue().enqueueReadBuffer(*d_total, CL_TRUE, 0, sizeof(unsigned),
+                                 &total);
+    total = total < max_feat ? total : max_feat;
+
+    if (total > 0) {
+        size_t out_sz  = total * sizeof(float);
+        x_out.data     = bufferAlloc(out_sz);
+        y_out.data     = bufferAlloc(out_sz);
+        score_out.data = bufferAlloc(out_sz);
+
+        getFeat(cl::EnqueueArgs(getQueue(), global_nonmax, local_nonmax),
+                *x_out.data, *y_out.data, *score_out.data, *d_flags, *d_counts,
+                *d_offsets, in.info, total, edge);
         CL_DEBUG_FINISH(getQueue());
-
-        unsigned total;
-        getQueue().enqueueReadBuffer(*d_total, CL_TRUE, 0, sizeof(unsigned), &total);
-        total = total < max_feat ? total : max_feat;
-
-        if (total > 0) {
-            size_t out_sz = total * sizeof(float);
-            x_out.data = bufferAlloc(out_sz);
-            y_out.data = bufferAlloc(out_sz);
-            score_out.data = bufferAlloc(out_sz);
-
-            auto gfOp = make_kernel<Buffer, Buffer, Buffer,
-                                    Buffer, Buffer, Buffer,
-                                    KParam, const unsigned,
-                                    const unsigned> (*gfKernel[device]);
-            gfOp(EnqueueArgs(getQueue(), global_nonmax, local_nonmax),
-                             *x_out.data, *y_out.data, *score_out.data,
-                             *d_flags, *d_counts, *d_offsets,
-                             in.info, total, edge);
-            CL_DEBUG_FINISH(getQueue());
-        }
-
-        *out_feat = total;
-
-        x_out.info.dims[0] = total;
-        x_out.info.strides[0] = 1;
-        y_out.info.dims[0] = total;
-        y_out.info.strides[0] = 1;
-        score_out.info.dims[0] = total;
-        score_out.info.strides[0] = 1;
-
-        for (int k = 1; k < 4; k++) {
-            x_out.info.dims[k] = 1;
-            x_out.info.strides[k] = total;
-            y_out.info.dims[k] = 1;
-            y_out.info.strides[k] = total;
-            score_out.info.dims[k] = 1;
-            score_out.info.strides[k] = total;
-        }
-
-        bufferFree(d_score);
-        if (nonmax) bufferFree(d_flags);
-        bufferFree(d_total);
-        bufferFree(d_counts);
-        bufferFree(d_offsets);
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
     }
-}
 
-template<typename T, bool nonmax>
-void fast_dispatch_nonmax(const unsigned arc_length,
-                          unsigned* out_feat,
-                          Param &x_out,
-                          Param &y_out,
-                          Param &score_out,
-                          Param in,
-                          const float thr,
-                          const float feature_ratio,
-                          const unsigned edge)
-{
-    switch (arc_length) {
-    case 9:
-        fast<T,  9, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 10:
-        fast<T, 10, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 11:
-        fast<T, 11, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 12:
-        fast<T, 12, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 13:
-        fast<T, 13, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 14:
-        fast<T, 14, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 15:
-        fast<T, 15, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
-    case 16:
-        fast<T, 16, nonmax>(out_feat, x_out, y_out, score_out, in,
-                            thr, feature_ratio, edge);
-        break;
+    *out_feat = total;
+
+    x_out.info.dims[0]        = total;
+    x_out.info.strides[0]     = 1;
+    y_out.info.dims[0]        = total;
+    y_out.info.strides[0]     = 1;
+    score_out.info.dims[0]    = total;
+    score_out.info.strides[0] = 1;
+
+    for (int k = 1; k < 4; k++) {
+        x_out.info.dims[k]        = 1;
+        x_out.info.strides[k]     = total;
+        y_out.info.dims[k]        = 1;
+        y_out.info.strides[k]     = total;
+        score_out.info.dims[k]    = 1;
+        score_out.info.strides[k] = total;
     }
-}
 
-template<typename T>
-void fast_dispatch(const unsigned arc_length, const bool nonmax,
-                   unsigned* out_feat,
-                   Param &x_out,
-                   Param &y_out,
-                   Param &score_out,
-                   Param in,
-                   const float thr,
-                   const float feature_ratio,
-                   const unsigned edge)
-{
-    if (!nonmax) {
-        fast_dispatch_nonmax<T, 0>(arc_length, out_feat, x_out, y_out, score_out, in,
-                                   thr, feature_ratio, edge);
-    } else {
-        fast_dispatch_nonmax<T, 1>(arc_length, out_feat, x_out, y_out, score_out, in,
-                                   thr, feature_ratio, edge);
-    }
+    bufferFree(d_score);
+    if (nonmax) bufferFree(d_flags);
+    bufferFree(d_total);
+    bufferFree(d_counts);
+    bufferFree(d_offsets);
 }
 
-} //namespace kernel
-
-} //namespace opencl
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/fftconvolve.hpp b/src/backend/opencl/kernel/fftconvolve.hpp
index 347c4f1864..ab6fc944e7 100644
--- a/src/backend/opencl/kernel/fftconvolve.hpp
+++ b/src/backend/opencl/kernel/fftconvolve.hpp
@@ -7,299 +7,223 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <backend.hpp>
-#include <program.hpp>
-#include <dispatch.hpp>
-#include <err_opencl.hpp>
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <kernel_headers/fftconvolve_pack.hpp>
 #include <kernel_headers/fftconvolve_multiply.hpp>
+#include <kernel_headers/fftconvolve_pack.hpp>
 #include <kernel_headers/fftconvolve_reorder.hpp>
-#include <memory.hpp>
-#include <map>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::EnqueueArgs;
-using cl::LocalSpaceArg;
-using cl::NDRange;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS = 256;
-
-void calcParamSizes(Param& sig_tmp,
-                    Param& filter_tmp,
-                    Param& packed,
-                    Param& sig,
-                    Param& filter,
-                    const int baseDim,
-                    ConvolveBatchKind kind)
-{
+#include <traits.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int THREADS = 256;
+
+void calcParamSizes(Param& sig_tmp, Param& filter_tmp, Param& packed,
+                    Param& sig, Param& filter, const int rank,
+                    AF_BATCH_KIND kind) {
     sig_tmp.info.dims[0] = filter_tmp.info.dims[0] = packed.info.dims[0];
     sig_tmp.info.strides[0] = filter_tmp.info.strides[0] = 1;
 
     for (int k = 1; k < 4; k++) {
-        if (k < baseDim) {
+        if (k < rank) {
             sig_tmp.info.dims[k]    = packed.info.dims[k];
             filter_tmp.info.dims[k] = packed.info.dims[k];
-        }
-        else {
+        } else {
             sig_tmp.info.dims[k]    = sig.info.dims[k];
             filter_tmp.info.dims[k] = filter.info.dims[k];
         }
 
-        sig_tmp.info.strides[k]    = sig_tmp.info.strides[k - 1] * sig_tmp.info.dims[k - 1];
-        filter_tmp.info.strides[k] = filter_tmp.info.strides[k - 1] * filter_tmp.info.dims[k - 1];
+        sig_tmp.info.strides[k] =
+            sig_tmp.info.strides[k - 1] * sig_tmp.info.dims[k - 1];
+        filter_tmp.info.strides[k] =
+            filter_tmp.info.strides[k - 1] * filter_tmp.info.dims[k - 1];
     }
 
     // Calculate memory offsets for packed signal and filter
-    sig_tmp.data = packed.data;
+    sig_tmp.data    = packed.data;
     filter_tmp.data = packed.data;
 
-    if (kind == ONE2MANY) {
+    if (kind == AF_BATCH_RHS) {
         filter_tmp.info.offset = 0;
-        sig_tmp.info.offset = filter_tmp.info.strides[3] * filter_tmp.info.dims[3] * 2;
-    }
-    else {
+        sig_tmp.info.offset =
+            filter_tmp.info.strides[3] * filter_tmp.info.dims[3] * 2;
+    } else {
         sig_tmp.info.offset = 0;
-        filter_tmp.info.offset = sig_tmp.info.strides[3] * sig_tmp.info.dims[3] * 2;
+        filter_tmp.info.offset =
+            sig_tmp.info.strides[3] * sig_tmp.info.dims[3] * 2;
     }
 }
 
-template<typename convT, typename T, bool isDouble, typename printT>
-void packDataHelper(Param packed,
-                    Param sig,
-                    Param filter,
-                    const int baseDim,
-                    ConvolveBatchKind kind)
-{
-    try {
-        static std::once_flag     compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  fftconvolveProgs;
-        static std::map<int, Kernel*>   pdKernel;
-        static std::map<int, Kernel*>   paKernel;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName();
-
-                if ((af_dtype) dtype_traits<convT>::af_type == c32) {
-                    options << " -D CONVT=float";
-                }
-                else if ((af_dtype) dtype_traits<convT>::af_type == c64 && isDouble) {
-                    options << " -D CONVT=double"
-                            << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, fftconvolve_pack_cl, fftconvolve_pack_cl_len, options.str());
-                fftconvolveProgs[device] = new Program(prog);
-
-                pdKernel[device] = new Kernel(*fftconvolveProgs[device], "pack_data");
-                paKernel[device] = new Kernel(*fftconvolveProgs[device], "pad_array");
-            });
-
-        Param sig_tmp, filter_tmp;
-        calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, baseDim, kind);
-
-        int sig_packed_elem = sig_tmp.info.strides[3] * sig_tmp.info.dims[3];
-        int filter_packed_elem = filter_tmp.info.strides[3] * filter_tmp.info.dims[3];
-
-        // Number of packed complex elements in dimension 0
-        int sig_half_d0 = divup(sig.info.dims[0], 2);
-        int sig_half_d0_odd = sig.info.dims[0] % 2;
-
-        int blocks = divup(sig_packed_elem, THREADS);
-
-        // Locate features kernel sizes
-        NDRange local(THREADS);
-        NDRange global(blocks * THREADS);
-
-        // Pack signal in a complex matrix where first dimension is half the input
-        // (allows faster FFT computation) and pad array to a power of 2 with 0s
-        auto pdOp = make_kernel<Buffer, KParam,
-                                Buffer, KParam,
-                                const int, const int> (*pdKernel[device]);
-
-        pdOp(EnqueueArgs(getQueue(), global, local),
-             *sig_tmp.data, sig_tmp.info, *sig.data, sig.info,
-             sig_half_d0, sig_half_d0_odd);
-        CL_DEBUG_FINISH(getQueue());
-
-        blocks = divup(filter_packed_elem, THREADS);
-        global = NDRange(blocks * THREADS);
-
-        // Pad filter array with 0s
-        auto paOp = make_kernel<Buffer, KParam,
-                                Buffer, KParam> (*paKernel[device]);
-
-        paOp(EnqueueArgs(getQueue(), global, local),
-             *filter_tmp.data, filter_tmp.info,
-             *filter.data, filter.info);
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+template<typename convT, typename T>
+void packDataHelper(Param packed, Param sig, Param filter, const int rank,
+                    AF_BATCH_KIND kind) {
+    constexpr bool IsTypeDouble = std::is_same<T, double>::value;
+    constexpr auto ctDType =
+        static_cast<af_dtype>(dtype_traits<convT>::af_type);
+
+    std::array<TemplateArg, 3> targs = {
+        TemplateTypename<T>(),
+        TemplateTypename<convT>(),
+        TemplateArg(IsTypeDouble),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T, convT>()};
+    if (ctDType == c32) {
+        options.emplace_back(DefineKeyValue(CONVT, "float"));
+    } else if (ctDType == c64 && IsTypeDouble) {
+        options.emplace_back(DefineKeyValue(CONVT, "double"));
     }
+
+    auto packData = common::getKernel("pack_data", {{fftconvolve_pack_cl_src}},
+                                      targs, options);
+    auto padArray = common::getKernel("pad_array", {{fftconvolve_pack_cl_src}},
+                                      targs, options);
+
+    Param sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    int sig_packed_elem = sig_tmp.info.strides[3] * sig_tmp.info.dims[3];
+    int filter_packed_elem =
+        filter_tmp.info.strides[3] * filter_tmp.info.dims[3];
+
+    // Number of packed complex elements in dimension 0
+    int sig_half_d0     = divup(sig.info.dims[0], 2);
+    int sig_half_d0_odd = sig.info.dims[0] % 2;
+
+    int blocks = divup(sig_packed_elem, THREADS);
+
+    // Locate features kernel sizes
+    cl::NDRange local(THREADS);
+    cl::NDRange global(blocks * THREADS);
+
+    // Pack signal in a complex matrix where first dimension is half the input
+    // (allows faster FFT computation) and pad array to a power of 2 with 0s
+    packData(cl::EnqueueArgs(getQueue(), global, local), *sig_tmp.data,
+             sig_tmp.info, *sig.data, sig.info, sig_half_d0, sig_half_d0_odd);
+    CL_DEBUG_FINISH(getQueue());
+
+    blocks = divup(filter_packed_elem, THREADS);
+    global = cl::NDRange(blocks * THREADS);
+
+    // Pad filter array with 0s
+    padArray(cl::EnqueueArgs(getQueue(), global, local), *filter_tmp.data,
+             filter_tmp.info, *filter.data, filter.info);
+    CL_DEBUG_FINISH(getQueue());
 }
 
-template<typename convT, typename T, bool isDouble, typename printT>
-void complexMultiplyHelper(Param packed,
-                           Param sig,
-                           Param filter,
-                           const int baseDim,
-                           ConvolveBatchKind kind)
-{
-    try {
-        static std::once_flag     compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> fftconvolveProgs;
-        static std::map<int, Kernel*>  cmKernel;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D ONE2ONE=" << (int)ONE2ONE
-                        << " -D MANY2ONE=" << (int)MANY2ONE
-                        << " -D ONE2MANY=" << (int)ONE2MANY
-                        << " -D MANY2MANY=" << (int)MANY2MANY;
-
-                if ((af_dtype) dtype_traits<convT>::af_type == c32) {
-                    options << " -D CONVT=float";
-                }
-                else if ((af_dtype) dtype_traits<convT>::af_type == c64 && isDouble) {
-                    options << " -D CONVT=double"
-                            << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog,
-                             fftconvolve_multiply_cl,
-                             fftconvolve_multiply_cl_len,
-                             options.str());
-                fftconvolveProgs[device] = new Program(prog);
-
-                cmKernel[device] = new Kernel(*fftconvolveProgs[device], "complex_multiply");
-            });
-
-        Param sig_tmp, filter_tmp;
-        calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, baseDim, kind);
-
-        int sig_packed_elem = sig_tmp.info.strides[3] * sig_tmp.info.dims[3];
-        int filter_packed_elem = filter_tmp.info.strides[3] * filter_tmp.info.dims[3];
-        int mul_elem = (sig_packed_elem < filter_packed_elem) ?
-                            filter_packed_elem : sig_packed_elem;
-
-        int blocks = divup(mul_elem, THREADS);
-
-        NDRange local(THREADS);
-        NDRange global(blocks * THREADS);
-
-        // Multiply filter and signal FFT arrays
-        auto cmOp = make_kernel<Buffer, KParam,
-                                Buffer, KParam,
-                                Buffer, KParam,
-                                const int, const int> (*cmKernel[device]);
-
-        cmOp(EnqueueArgs(getQueue(), global, local),
-             *packed.data, packed.info,
-             *sig_tmp.data, sig_tmp.info,
-             *filter_tmp.data, filter_tmp.info,
-             mul_elem, (int)kind);
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+template<typename convT, typename T>
+void complexMultiplyHelper(Param packed, Param sig, Param filter,
+                           const int rank, AF_BATCH_KIND kind) {
+    constexpr bool IsTypeDouble = std::is_same<T, double>::value;
+    constexpr auto ctDType =
+        static_cast<af_dtype>(dtype_traits<convT>::af_type);
+
+    std::array<TemplateArg, 3> targs = {
+        TemplateTypename<T>(),
+        TemplateTypename<convT>(),
+        TemplateArg(IsTypeDouble),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(AF_BATCH_NONE, static_cast<int>(AF_BATCH_NONE)),
+        DefineKeyValue(AF_BATCH_LHS, static_cast<int>(AF_BATCH_LHS)),
+        DefineKeyValue(AF_BATCH_RHS, static_cast<int>(AF_BATCH_RHS)),
+        DefineKeyValue(AF_BATCH_SAME, static_cast<int>(AF_BATCH_SAME)),
+        getTypeBuildDefinition<T, convT>()};
+    if (ctDType == c32) {
+        options.emplace_back(DefineKeyValue(CONVT, "float"));
+    } else if (ctDType == c64 && IsTypeDouble) {
+        options.emplace_back(DefineKeyValue(CONVT, "double"));
     }
+
+    auto cplxMul = common::getKernel(
+        "complex_multiply", {{fftconvolve_multiply_cl_src}}, targs, options);
+
+    Param sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    int sig_packed_elem = sig_tmp.info.strides[3] * sig_tmp.info.dims[3];
+    int filter_packed_elem =
+        filter_tmp.info.strides[3] * filter_tmp.info.dims[3];
+    int mul_elem = (sig_packed_elem < filter_packed_elem) ? filter_packed_elem
+                                                          : sig_packed_elem;
+    int blocks   = divup(mul_elem, THREADS);
+
+    cl::NDRange local(THREADS);
+    cl::NDRange global(blocks * THREADS);
+
+    // Multiply filter and signal FFT arrays
+    cplxMul(cl::EnqueueArgs(getQueue(), global, local), *packed.data,
+            packed.info, *sig_tmp.data, sig_tmp.info, *filter_tmp.data,
+            filter_tmp.info, mul_elem, (int)kind);
+    CL_DEBUG_FINISH(getQueue());
 }
 
-template<typename T, typename convT, bool isDouble, bool roundOut, bool expand, typename printT>
-void reorderOutputHelper(Param out,
-                         Param packed,
-                         Param sig,
-                         Param filter,
-                         const int baseDim,
-                         ConvolveBatchKind kind)
-{
-    try {
-        static std::once_flag     compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> fftconvolveProgs;
-        static std::map<int, Kernel*>  roKernel;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D ROUND_OUT=" << (int)roundOut
-                        << " -D EXPAND=" << (int)expand;
-
-                if ((af_dtype) dtype_traits<convT>::af_type == c32) {
-                    options << " -D CONVT=float";
-                }
-                else if ((af_dtype) dtype_traits<convT>::af_type == c64 && isDouble) {
-                    options << " -D CONVT=double"
-                            << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog,
-                             fftconvolve_reorder_cl,
-                             fftconvolve_reorder_cl_len,
-                             options.str());
-                fftconvolveProgs[device] = new Program(prog);
-
-                roKernel[device] = new Kernel(*fftconvolveProgs[device], "reorder_output");
-            });
-
-        Param sig_tmp, filter_tmp;
-        calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, baseDim, kind);
-
-        // Number of packed complex elements in dimension 0
-        int sig_half_d0 = divup(sig.info.dims[0], 2);
-
-        int blocks = divup(out.info.strides[3] * out.info.dims[3], THREADS);
-
-        NDRange local(THREADS);
-        NDRange global(blocks * THREADS);
-
-        auto roOp = make_kernel<Buffer, KParam,
-                                Buffer, KParam,
-                                KParam, const int,
-                                const int> (*roKernel[device]);
-
-        if (kind == ONE2MANY) {
-            roOp(EnqueueArgs(getQueue(), global, local),
-                 *out.data, out.info,
-                 *filter_tmp.data, filter_tmp.info,
-                 filter.info, sig_half_d0, baseDim);
-        }
-        else {
-            roOp(EnqueueArgs(getQueue(), global, local),
-                 *out.data, out.info,
-                 *sig_tmp.data, sig_tmp.info,
-                 filter.info, sig_half_d0, baseDim);
-        }
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+template<typename T, typename convT>
+void reorderOutputHelper(Param out, Param packed, Param sig, Param filter,
+                         const int rank, AF_BATCH_KIND kind, bool expand) {
+    constexpr bool IsTypeDouble = std::is_same<T, double>::value;
+    constexpr auto ctDType =
+        static_cast<af_dtype>(dtype_traits<convT>::af_type);
+    constexpr bool RoundResult = std::is_integral<T>::value;
+
+    std::array<TemplateArg, 5> targs = {
+        TemplateTypename<T>(),     TemplateTypename<convT>(),
+        TemplateArg(IsTypeDouble), TemplateArg(RoundResult),
+        TemplateArg(expand),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ROUND_OUT, static_cast<int>(RoundResult)),
+        DefineKeyValue(EXPAND, static_cast<int>(expand)),
+        getTypeBuildDefinition<T, convT>()};
+    if (ctDType == c32) {
+        options.emplace_back(DefineKeyValue(CONVT, "float"));
+    } else if (ctDType == c64 && IsTypeDouble) {
+        options.emplace_back(DefineKeyValue(CONVT, "double"));
     }
-}
 
-} // namespace kernel
+    auto reorder = common::getKernel(
+        "reorder_output", {{fftconvolve_reorder_cl_src}}, targs, options);
+
+    int fftScale = 1;
+
+    // Calculate the scale by which to divide clFFT results
+    for (int k = 0; k < rank; k++) fftScale *= packed.info.dims[k];
+
+    Param sig_tmp, filter_tmp;
+    calcParamSizes(sig_tmp, filter_tmp, packed, sig, filter, rank, kind);
+
+    // Number of packed complex elements in dimension 0
+    int sig_half_d0 = divup(sig.info.dims[0], 2);
 
-} // namespace opencl
+    int blocks = divup(out.info.strides[3] * out.info.dims[3], THREADS);
+
+    cl::NDRange local(THREADS);
+    cl::NDRange global(blocks * THREADS);
+
+    if (kind == AF_BATCH_RHS) {
+        reorder(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                *filter_tmp.data, filter_tmp.info, filter.info, sig_half_d0,
+                rank, fftScale);
+    } else {
+        reorder(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                *sig_tmp.data, sig_tmp.info, filter.info, sig_half_d0, rank,
+                fftScale);
+    }
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/fftconvolve_multiply.cl b/src/backend/opencl/kernel/fftconvolve_multiply.cl
index 9729a7267e..e0bd2ea6d9 100644
--- a/src/backend/opencl/kernel/fftconvolve_multiply.cl
+++ b/src/backend/opencl/kernel/fftconvolve_multiply.cl
@@ -7,23 +7,15 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void complex_multiply(
-    __global CONVT       *d_out,
-    KParam                oInfo,
-    __global const CONVT *d_in1,
-    KParam                i1Info,
-    __global const CONVT *d_in2,
-    KParam                i2Info,
-    const int        nelem,
-    const int             kind)
-{
+kernel void complex_multiply(global CONVT *d_out, KParam oInfo,
+                             global const CONVT *d_in1, KParam i1Info,
+                             global const CONVT *d_in2, KParam i2Info,
+                             const int nelem, const int kind) {
     const int t = get_global_id(0);
 
-    if (t >= nelem)
-        return;
+    if (t >= nelem) return;
 
-    if (kind == ONE2ONE || kind == MANY2MANY) {
+    if (kind == AF_BATCH_NONE || kind == AF_BATCH_SAME) {
         // Complex multiply each signal to equivalent filter
         const int ridx = t * 2;
         const int iidx = t * 2 + 1;
@@ -33,13 +25,9 @@ void complex_multiply(
         CONVT c = d_in2[i2Info.offset + ridx];
         CONVT d = d_in2[i2Info.offset + iidx];
 
-        CONVT ac = a*c;
-        CONVT bd = b*d;
-
-        d_out[oInfo.offset + ridx] = ac - bd;
-        d_out[oInfo.offset + iidx] = (a+b) * (c+d) - ac - bd;
-    }
-    else if (kind == MANY2ONE) {
+        d_out[oInfo.offset + ridx] = a * c - b * d;
+        d_out[oInfo.offset + iidx] = a * d + b * c;
+    } else if (kind == AF_BATCH_LHS) {
         // Complex multiply all signals to filter
         const int ridx1 = t * 2;
         const int iidx1 = t * 2 + 1;
@@ -54,13 +42,9 @@ void complex_multiply(
         CONVT c = d_in2[i2Info.offset + ridx2];
         CONVT d = d_in2[i2Info.offset + iidx2];
 
-        CONVT ac = a*c;
-        CONVT bd = b*d;
-
-        d_out[oInfo.offset + ridx1] = ac - bd;
-        d_out[oInfo.offset + iidx1] = (a+b) * (c+d) - ac - bd;
-    }
-    else if (kind == ONE2MANY) {
+        d_out[oInfo.offset + ridx1] = a * c - b * d;
+        d_out[oInfo.offset + iidx1] = a * d + b * c;
+    } else if (kind == AF_BATCH_RHS) {
         // Complex multiply signal to all filters
         const int ridx2 = t * 2;
         const int iidx2 = t * 2 + 1;
@@ -75,10 +59,7 @@ void complex_multiply(
         CONVT c = d_in2[i2Info.offset + ridx2];
         CONVT d = d_in2[i2Info.offset + iidx2];
 
-        CONVT ac = a*c;
-        CONVT bd = b*d;
-
-        d_out[oInfo.offset + ridx2] = ac - bd;
-        d_out[oInfo.offset + iidx2] = (a+b) * (c+d) - ac - bd;
+        d_out[oInfo.offset + ridx2] = a * c - b * d;
+        d_out[oInfo.offset + iidx2] = a * d + b * c;
     }
 }
diff --git a/src/backend/opencl/kernel/fftconvolve_pack.cl b/src/backend/opencl/kernel/fftconvolve_pack.cl
index 981f4b8709..cc72bc8495 100644
--- a/src/backend/opencl/kernel/fftconvolve_pack.cl
+++ b/src/backend/opencl/kernel/fftconvolve_pack.cl
@@ -7,21 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void pack_data(
-    __global CONVT   *d_out,
-    KParam            oInfo,
-    __global const T *d_in,
-    KParam            iInfo,
-    const int    di0_half,
-    const int         odd_di0)
-{
+kernel void pack_data(global CONVT *d_out, KParam oInfo,
+                      global const T *d_in, KParam iInfo,
+                      const int di0_half, const int odd_di0) {
     const int t = get_global_id(0);
 
     const int tMax = oInfo.strides[3] * oInfo.dims[3];
 
-    if (t >= tMax)
-        return;
+    if (t >= tMax) return;
 
     const int do0 = oInfo.dims[0];
     const int do1 = oInfo.dims[1];
@@ -54,36 +47,30 @@ void pack_data(
 
     // Treating complex output array as real-only array,
     // thus, multiply strides by 2
-    const int oidx1 = oInfo.offset + to3*so3*2 + to2*so2*2 + to1*so1*2 + to0*2;
+    const int oidx1 =
+        oInfo.offset + to3 * so3 * 2 + to2 * so2 * 2 + to1 * so1 * 2 + to0 * 2;
     const int oidx2 = oidx1 + 1;
 
     if (to0 < di0_half && to1 < di1 && to2 < di2) {
         d_out[oidx1] = (CONVT)d_in[iidx1];
-        if (ti0 == di0_half-1 && odd_di0 == 1)
+        if (ti0 == di0_half - 1 && odd_di0 == 1)
             d_out[oidx2] = (CONVT)0;
         else
             d_out[oidx2] = (CONVT)d_in[iidx2];
-    }
-    else {
+    } else {
         // Pad remaining elements with 0s
         d_out[oidx1] = (CONVT)0;
         d_out[oidx2] = (CONVT)0;
     }
 }
 
-__kernel
-void pad_array(
-    __global CONVT   *d_out,
-    KParam            oInfo,
-    __global const T *d_in,
-    KParam            iInfo)
-{
+kernel void pad_array(global CONVT *d_out, KParam oInfo,
+                        global const T *d_in, KParam iInfo) {
     const int t = get_global_id(0);
 
     const int tMax = oInfo.strides[3] * oInfo.dims[3];
 
-    if (t >= tMax)
-        return;
+    if (t >= tMax) return;
 
     const int do0 = oInfo.dims[0];
     const int do1 = oInfo.dims[1];
@@ -114,16 +101,15 @@ void pad_array(
 
     const int iidx = iInfo.offset + ti3 + ti2 + ti1 + ti0;
 
-    const int oidx = oInfo.offset + t*2;
+    const int oidx = oInfo.offset + t * 2;
 
     if (to0 < di0 && to1 < di1 && to2 < di2 && to3 < di3) {
         // Copy input elements to real elements, set imaginary elements to 0
-        d_out[oidx]   = (CONVT)d_in[iidx];
-        d_out[oidx+1] = (CONVT)0;
-    }
-    else {
+        d_out[oidx]     = (CONVT)d_in[iidx];
+        d_out[oidx + 1] = (CONVT)0;
+    } else {
         // Pad remaining of the matrix to 0s
-        d_out[oidx]   = (CONVT)0;
-        d_out[oidx+1] = (CONVT)0;
+        d_out[oidx]     = (CONVT)0;
+        d_out[oidx + 1] = (CONVT)0;
     }
 }
diff --git a/src/backend/opencl/kernel/fftconvolve_reorder.cl b/src/backend/opencl/kernel/fftconvolve_reorder.cl
index 6f8269a82c..f0064392f0 100644
--- a/src/backend/opencl/kernel/fftconvolve_reorder.cl
+++ b/src/backend/opencl/kernel/fftconvolve_reorder.cl
@@ -7,22 +7,15 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void reorder_output(
-    __global T           *d_out,
-    KParam                oInfo,
-    __global const CONVT *d_in,
-    KParam                iInfo,
-    KParam                fInfo,
-    const int        half_di0,
-    const int             baseDim)
-{
+kernel void reorder_output(global T *d_out, KParam oInfo,
+                           global const CONVT *d_in, KParam iInfo,
+                           KParam fInfo, const int half_di0,
+                           const int baseDim, const int fftScale) {
     const int t = get_global_id(0);
 
     const int tMax = oInfo.strides[3] * oInfo.dims[3];
 
-    if (t >= tMax)
-        return;
+    if (t >= tMax) return;
 
     const int do0 = oInfo.dims[0];
     const int do1 = oInfo.dims[1];
@@ -47,7 +40,7 @@ void reorder_output(
     const int to2 = (t / so2) % do2;
     const int to3 = (t / so3);
 
-    int oidx = to3*so3 + to2*so2 + to1*so1 + to0;
+    int oidx = to3 * so3 + to2 * so2 + to1 * so1 + to0;
 
     int ti0, ti1, ti2, ti3;
 #if EXPAND == 1
@@ -56,9 +49,9 @@ void reorder_output(
     ti2 = to2 * si2;
     ti3 = to3 * si3;
 #else
-    ti0 = to0 + fInfo.dims[0]/2;
-    ti1 = (to1 + (baseDim > 1)*(fInfo.dims[1]/2)) * si1;
-    ti2 = (to2 + (baseDim > 2)*(fInfo.dims[2]/2)) * si2;
+    ti0 = to0 + fInfo.dims[0] / 2;
+    ti1 = (to1 + (baseDim > 1) * (fInfo.dims[1] / 2)) * si1;
+    ti2 = (to2 + (baseDim > 2) * (fInfo.dims[2] / 2)) * si2;
     ti3 = to3 * si3;
 #endif
 
@@ -68,28 +61,27 @@ void reorder_output(
         // Copy top elements
         int iidx = iInfo.offset + ti3 + ti2 + ti1 + ti0 * 2;
 #if ROUND_OUT == 1
-            d_out[oidx] = (T)round(d_in[iidx]);
+        d_out[oidx] = (T)round(d_in[iidx] / fftScale);
 #else
-            d_out[oidx] = (T)(d_in[iidx]);
+        d_out[oidx] = (T)(d_in[iidx] / fftScale);
 #endif
-    }
-    else if (ti0 < half_di0 + fInfo.dims[0] - 1) {
+    } else if (ti0 < half_di0 + fInfo.dims[0] - 1) {
         // Add central elements
         int iidx1 = iInfo.offset + ti3 + ti2 + ti1 + ti0 * 2;
         int iidx2 = iInfo.offset + ti3 + ti2 + ti1 + (ti0 - half_di0) * 2 + 1;
 #if ROUND_OUT == 1
-            d_out[oidx] = (T)round((d_in[iidx1] + d_in[iidx2]));
+        d_out[oidx] = (T)round((d_in[iidx1] + d_in[iidx2]) / fftScale);
 #else
-            d_out[oidx] = (T)((d_in[iidx1] + d_in[iidx2]));
+        d_out[oidx] = (T)((d_in[iidx1] + d_in[iidx2]) / fftScale);
 #endif
-    }
-    else {
+    } else {
         // Copy bottom elements
-        const int iidx = iInfo.offset + ti3 + ti2 + ti1 + (ti0 - half_di0) * 2 + 1;
+        const int iidx =
+            iInfo.offset + ti3 + ti2 + ti1 + (ti0 - half_di0) * 2 + 1;
 #if ROUND_OUT == 1
-            d_out[oidx] = (T)round(d_in[iidx]);
+        d_out[oidx] = (T)round(d_in[iidx] / fftScale);
 #else
-            d_out[oidx] = (T)(d_in[iidx]);
+        d_out[oidx] = (T)(d_in[iidx] / fftScale);
 #endif
     }
 }
diff --git a/src/backend/opencl/kernel/flood_fill.cl b/src/backend/opencl/kernel/flood_fill.cl
new file mode 100644
index 0000000000..58d03b52e8
--- /dev/null
+++ b/src/backend/opencl/kernel/flood_fill.cl
@@ -0,0 +1,136 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/// Output array is set to the following values during the progression
+/// of the algorithm.
+///
+/// 0 - not processed
+/// 1 - not valid
+/// 2 - valid (candidate for neighborhood walk, pushed onto the queue)
+///
+/// Once, the algorithm is finished, output is reset
+/// to either zero or \p newValue for all valid pixels.
+
+#if defined(INIT_SEEDS)
+kernel void init_seeds(global T *out, KParam oInfo, global const uint *seedsx,
+                       KParam sxInfo, global const uint *seedsy,
+                       KParam syInfo) {
+    uint tid = get_global_id(0);
+    if (tid < sxInfo.dims[0]) {
+        uint x                                             = seedsx[tid + sxInfo.offset];
+        uint y                                             = seedsy[tid + syInfo.offset];
+        out[(x * oInfo.strides[0] + y * oInfo.strides[1])] = VALID;
+    }
+}
+#endif
+
+#if defined(FLOOD_FILL_STEP)
+
+int barrierOR(local int *predicates) {
+    int tid = get_local_id(0) + get_local_size(0) * get_local_id(1);
+    barrier(CLK_LOCAL_MEM_FENCE);
+    for (int nt = GROUP_SIZE / 2; nt > 0; nt >>= 1) {
+        if (tid < nt) {
+            predicates[tid] = (predicates[tid] | predicates[tid + nt]);
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+    int retVal = predicates[0];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    return retVal;
+}
+
+kernel void flood_step(global T *out, KParam oInfo, global const T *img,
+                       KParam iInfo, T lowValue, T highValue,
+                       global volatile int *notFinished) {
+    local T lmem[LMEM_HEIGHT][LMEM_WIDTH];
+    local int predicates[GROUP_SIZE];
+
+    const int lx = get_local_id(0);
+    const int ly = get_local_id(1);
+    const int gx = get_global_id(0);
+    const int gy = get_global_id(1);
+    const int d0 = oInfo.dims[0];
+    const int d1 = oInfo.dims[1];
+    const int s0 = oInfo.strides[0];
+    const int s1 = oInfo.strides[1];
+
+    for (int b = ly, gy2 = gy; b < LMEM_HEIGHT;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        for (int a = lx, gx2 = gx; a < LMEM_WIDTH;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            int x      = gx2 - RADIUS;
+            int y      = gy2 - RADIUS;
+            bool inROI = (x >= 0 && x < d0 && y >= 0 && y < d1);
+            lmem[b][a] = (inROI ? out[x * s0 + y * s1] : INVALID);
+        }
+    }
+    int i = lx + RADIUS;
+    int j = ly + RADIUS;
+
+    T tImgVal =
+        img[(clamp(gx, 0, (int)(iInfo.dims[0] - 1)) * iInfo.strides[0] +
+             clamp(gy, 0, (int)(iInfo.dims[1] - 1)) * iInfo.strides[1])+ 
+             iInfo.offset];
+    const int isPxBtwnThresholds =
+        (tImgVal >= lowValue && tImgVal <= highValue);
+
+    int tid = lx + get_local_size(0) * ly;
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+    
+    T origOutVal     = lmem[j][i];
+    bool isBorderPxl = (lx == 0 || ly == 0 || lx == (get_local_size(0) - 1) ||
+                        ly == (get_local_size(1) - 1));
+
+    for (bool blkChngd = true; blkChngd; blkChngd = barrierOR(predicates)) {
+        int validNeighbors = 0;
+        for (int no_j = -RADIUS; no_j <= RADIUS; ++no_j) {
+            for (int no_i = -RADIUS; no_i <= RADIUS; ++no_i) {
+                T currVal = lmem[j + no_j][i + no_i];
+                validNeighbors += (currVal == VALID);
+            }
+        }
+        bool outChanged = (lmem[j][i] == ZERO && (validNeighbors > 0));
+        predicates[tid] = outChanged;
+        barrier(CLK_LOCAL_MEM_FENCE);
+        if (outChanged) { lmem[j][i] = (T)(isPxBtwnThresholds + INVALID); }
+    }
+
+    T newOutVal = lmem[j][i];
+
+    bool brdrChngd =
+        (isBorderPxl && newOutVal != origOutVal && newOutVal == VALID);
+    predicates[tid] = brdrChngd;
+
+    brdrChngd = barrierOR(predicates) > 0;
+
+    if (gx < d0 && gy < d1) {
+        if (brdrChngd && lx == 0 && ly == 0) {
+            // Atleast one border pixel changed. Therefore, mark for
+            // another kernel launch to propogate changes beyond border
+            // of this block
+            atomic_inc(notFinished);
+        }
+        out[(gx * s0 + gy * s1)] = lmem[j][i];
+    }
+}
+#endif
+
+#if defined(FINALIZE_OUTPUT)
+kernel void finalize_output(global T *out, KParam oInfo, T newValue) {
+    uint gx = get_global_id(0);
+    uint gy = get_global_id(1);
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1]) {
+        uint idx = gx * oInfo.strides[0] + gy * oInfo.strides[1];
+        T val    = out[idx];
+        out[idx] = (val == VALID ? newValue : ZERO);
+    }
+}
+#endif
diff --git a/src/backend/opencl/kernel/flood_fill.hpp b/src/backend/opencl/kernel/flood_fill.hpp
new file mode 100644
index 0000000000..8035a61fd6
--- /dev/null
+++ b/src/backend/opencl/kernel/flood_fill.hpp
@@ -0,0 +1,116 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/flood_fill.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int THREADS   = 256;
+constexpr int TILE_DIM  = 16;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = THREADS / TILE_DIM;
+constexpr int VALID     = 2;
+constexpr int INVALID   = 1;
+constexpr int ZERO      = 0;
+
+template<typename T>
+void initSeeds(Param out, const Param seedsx, const Param seedsy) {
+    std::array<std::string, 4> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(VALID),
+        DefineKey(INIT_SEEDS), getTypeBuildDefinition<T>()};
+
+    auto initSeeds =
+        common::getKernel("init_seeds", {{flood_fill_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+    cl::NDRange local(kernel::THREADS, 1, 1);
+    cl::NDRange global(divup(seedsx.info.dims[0], local[0]) * local[0], 1, 1);
+
+    initSeeds(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *seedsx.data, seedsx.info, *seedsy.data, seedsy.info);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void finalizeOutput(Param out, const T newValue) {
+    std::array<std::string, 5> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(VALID),
+        DefineValue(ZERO), DefineKey(FINALIZE_OUTPUT),
+        getTypeBuildDefinition<T>()};
+
+    auto finalizeOut =
+        common::getKernel("finalize_output", {{flood_fill_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+    cl::NDRange local(kernel::THREADS_X, kernel::THREADS_Y, 1);
+    cl::NDRange global(divup(out.info.dims[0], local[0]) * local[0],
+                       divup(out.info.dims[1], local[1]) * local[1], 1);
+    finalizeOut(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                newValue);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void floodFill(Param out, const Param image, const Param seedsx,
+               const Param seedsy, const T newValue, const T lowValue,
+               const T highValue, const af::connectivity nlookup) {
+    constexpr int RADIUS = 1;
+
+    UNUSED(nlookup);
+    std::array<std::string, 11> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(RADIUS),
+        DefineValue(VALID),
+        DefineValue(INVALID),
+        DefineValue(ZERO),
+        DefineKey(FLOOD_FILL_STEP),
+        DefineKeyValue(LMEM_WIDTH, (THREADS_X + 2 * RADIUS)),
+        DefineKeyValue(LMEM_HEIGHT, (THREADS_Y + 2 * RADIUS)),
+        DefineKeyValue(GROUP_SIZE, (THREADS_Y * THREADS_X)),
+        getTypeBuildDefinition<T>()};
+
+    auto floodStep =
+        common::getKernel("flood_step", {{flood_fill_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+    cl::NDRange local(kernel::THREADS_X, kernel::THREADS_Y, 1);
+    cl::NDRange global(divup(out.info.dims[0], local[0]) * local[0],
+                       divup(out.info.dims[1], local[1]) * local[1], 1);
+
+    initSeeds<T>(out, seedsx, seedsy);
+
+    int notFinished       = 1;
+    cl::Buffer* dContinue = bufferAlloc(sizeof(int));
+
+    while (notFinished) {
+        notFinished = 0;
+        floodStep.setFlag(dContinue, &notFinished);
+        floodStep(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+                  out.info, *image.data, image.info, lowValue, highValue,
+                  *dContinue);
+        CL_DEBUG_FINISH(getQueue());
+        notFinished = floodStep.getFlag(dContinue);
+    }
+    bufferFree(dContinue);
+    finalizeOutput<T>(out, newValue);
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/gradient.cl b/src/backend/opencl/kernel/gradient.cl
index cb0ff68a64..e3698ee9b8 100644
--- a/src/backend/opencl/kernel/gradient.cl
+++ b/src/backend/opencl/kernel/gradient.cl
@@ -9,10 +9,11 @@
 
 #if CPLX
 #define set(a, b) a = b
-#define set_scalar(a, b) do {                   \
-        a.x = b;                                \
-        a.y = 0;                                \
-    } while(0)
+#define set_scalar(a, b) \
+    do {                 \
+        a.x = b;         \
+        a.y = 0;         \
+    } while (0)
 
 #else
 
@@ -23,12 +24,9 @@
 
 #define sidx(y, x) scratch[((y + 1) * (TX + 2)) + (x + 1)]
 
-__kernel
-void gradient_kernel(__global T *d_grad0, const KParam grad0,
-                     __global T *d_grad1, const KParam grad1,
-                     __global const T* d_in, const KParam in,
-                     const int blocksPerMatX, const int blocksPerMatY)
-{
+kernel void gradient(global T *d_grad0, const KParam grad0, global T *d_grad1,
+                     const KParam grad1, global const T *d_in, const KParam in,
+                     const int blocksPerMatX, const int blocksPerMatY) {
     const int idz = get_group_id(0) / blocksPerMatX;
     const int idw = get_group_id(1) / blocksPerMatY;
 
@@ -50,24 +48,25 @@ void gradient_kernel(__global T *d_grad0, const KParam grad0,
     int xmax = (TX > (in.dims[0] - xB)) ? (in.dims[0] - xB) : TX;
     int ymax = (TY > (in.dims[1] - yB)) ? (in.dims[1] - yB) : TY;
 
-    int iIdx = in.offset + idw * in.strides[3] + idz * in.strides[2]
-                              + idy * in.strides[1] + idx;
+    int iIdx = in.offset + idw * in.strides[3] + idz * in.strides[2] +
+               idy * in.strides[1] + idx;
 
-    int g0dx = idw * grad0.strides[3] + idz * grad0.strides[2]
-                  + idy * grad0.strides[1] + idx;
+    int g0dx = idw * grad0.strides[3] + idz * grad0.strides[2] +
+               idy * grad0.strides[1] + idx;
 
-    int g1dx = idw * grad1.strides[3] + idz * grad1.strides[2]
-                  + idy * grad1.strides[1] + idx;
+    int g1dx = idw * grad1.strides[3] + idz * grad1.strides[2] +
+               idy * grad1.strides[1] + idx;
 
-    __local T scratch[(TY + 2) * (TX + 2)];
+    local T scratch[(TY + 2) * (TX + 2)];
 
     // Multipliers - 0.5 for interior, 1 for edge cases
     float xf = 0.5 * (1 + (idx == 0 || idx >= (in.dims[0] - 1)));
     float yf = 0.5 * (1 + (idy == 0 || idy >= (in.dims[1] - 1)));
 
     // Copy data to scratch space
-    if(cond) {
-        set_scalar(sidx(ty, tx), 0);
+    T zero = (T)(ZERO);
+    if (cond) {
+        sidx(ty, tx) = zero;
     } else {
         sidx(ty, tx) = d_in[iIdx];
     }
@@ -76,26 +75,27 @@ void gradient_kernel(__global T *d_grad0, const KParam grad0,
 
     // Copy buffer zone data. Corner (0,0) etc, are not used.
     // Cols
-    if(ty == 0) {
+    if (ty == 0) {
         // Y-1
-        sidx(-1, tx) = (cond || idy == 0) ?
-                        sidx(0, tx) : d_in[iIdx - in.strides[1]];
-        sidx(ymax, tx) = (cond || idy + ymax >= in.dims[1] - 1) ?
-                          sidx(ymax - 1, tx) : d_in[iIdx + ymax * in.strides[1]];
+        sidx(-1, tx) =
+            (cond || idy == 0) ? sidx(0, tx) : d_in[iIdx - in.strides[1]];
+        sidx(ymax, tx) = (cond || (idy + ymax) >= in.dims[1])
+                             ? sidx(ymax - 1, tx)
+                             : d_in[iIdx + ymax * in.strides[1]];
     }
     // Rows
-    if(tx == 0) {
-        sidx(ty, -1) = (cond || idx == 0) ?
-                        sidx(ty, 0) : d_in[iIdx - 1];
-        sidx(ty, xmax) = (cond || idx + xmax >= in.dims[0] - 1) ?
-                          sidx(ty, xmax - 1) : d_in[iIdx + xmax];
+    if (tx == 0) {
+        sidx(ty, -1)   = (cond || idx == 0) ? sidx(ty, 0) : d_in[iIdx - 1];
+        sidx(ty, xmax) = (cond || (idx + xmax) >= in.dims[0])
+                             ? sidx(ty, xmax - 1)
+                             : d_in[iIdx + xmax];
     }
 
     barrier(CLK_LOCAL_MEM_FENCE);
 
     if (cond) return;
 
-    //set_scalar(d_grad0[iIdx], sidx(ty, tx));
-    d_grad0[g0dx] = xf * (sidx(ty, tx + 1) -  sidx(ty, tx - 1));
-    d_grad1[g1dx] = yf * (sidx(ty + 1, tx) -  sidx(ty - 1, tx));
+    // set_scalar(d_grad0[iIdx], sidx(ty, tx));
+    d_grad0[g0dx] = xf * (sidx(ty, tx + 1) - sidx(ty, tx - 1));
+    d_grad1[g1dx] = yf * (sidx(ty + 1, tx) - sidx(ty - 1, tx));
 }
diff --git a/src/backend/opencl/kernel/gradient.hpp b/src/backend/opencl/kernel/gradient.hpp
index b1b7cca2ce..6809f10c19 100644
--- a/src/backend/opencl/kernel/gradient.hpp
+++ b/src/backend/opencl/kernel/gradient.hpp
@@ -8,85 +8,54 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/gradient.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel_headers/gradient.hpp>
+#include <math.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-
-        template<typename T>
-        void gradient(Param grad0, Param grad1, const Param in)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>  gradProgs;
-                static std::map<int, Kernel*> gradKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D TX=" << TX
-                            << " -D TY=" << TY;
-
-                    if((af_dtype) dtype_traits<T>::af_type == c32 ||
-                       (af_dtype) dtype_traits<T>::af_type == c64) {
-                        options << " -D CPLX=1";
-                    } else {
-                        options << " -D CPLX=0";
-                    }
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, gradient_cl, gradient_cl_len, options.str());
-                    gradProgs[device]   = new Program(prog);
-                    gradKernels[device] = new Kernel(*gradProgs[device], "gradient_kernel");
-                });
-
-                auto gradOp = make_kernel<Buffer, const KParam, Buffer, const KParam,
-                                    const Buffer, const KParam, const int, const int>
-                                        (*gradKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int blocksPerMatX = divup(in.info.dims[0], TX);
-                int blocksPerMatY = divup(in.info.dims[1], TY);
-                NDRange global(local[0] * blocksPerMatX * in.info.dims[2],
-                               local[1] * blocksPerMatY * in.info.dims[3],
-                               1);
-
-                gradOp(EnqueueArgs(getQueue(), global, local),
-                        *grad0.data, grad0.info, *grad1.data, grad1.info,
-                        *in.data, in.info, blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void gradient(Param grad0, Param grad1, const Param in) {
+    constexpr int TX = 32;
+    constexpr int TY = 8;
+
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 6> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(TX),
+        DefineValue(TY),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        DefineKeyValue(CPLX, static_cast<int>(iscplx<T>())),
+        getTypeBuildDefinition<T>()};
+
+    auto gradOp =
+        common::getKernel("gradient", {{gradient_cl_src}}, targs, options);
+
+    cl::NDRange local(TX, TY, 1);
+
+    int blocksPerMatX = divup(in.info.dims[0], TX);
+    int blocksPerMatY = divup(in.info.dims[1], TY);
+    cl::NDRange global(local[0] * blocksPerMatX * in.info.dims[2],
+                       local[1] * blocksPerMatY * in.info.dims[3], 1);
+
+    gradOp(cl::EnqueueArgs(getQueue(), global, local), *grad0.data, grad0.info,
+           *grad1.data, grad1.info, *in.data, in.info, blocksPerMatX,
+           blocksPerMatY);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/hamming.cl b/src/backend/opencl/kernel/hamming.cl
deleted file mode 100644
index 68ce6fd21f..0000000000
--- a/src/backend/opencl/kernel/hamming.cl
+++ /dev/null
@@ -1,360 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-// OpenCL < 1.2 compatibility
-#if !defined(__OPENCL_VERSION__) || __OPENCL_VERSION__ < 120
-__inline unsigned popcount(unsigned x)
-{
-    x = x - ((x >> 1) & 0x55555555);
-    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
-    x = (x + (x >> 4)) & 0x0F0F0F0F;
-    x = x + (x >> 8);
-    x = x + (x >> 16);
-    return x & 0x0000003F;
-}
-#endif
-
-__kernel
-void hamming_matcher_unroll(
-    __global unsigned* out_idx,
-    __global unsigned* out_dist,
-    __global const T* query,
-    KParam qInfo,
-    __global const T* train,
-    KParam tInfo,
-    const unsigned max_dist,
-    __local T* lmem)
-{
-    unsigned nquery = qInfo.dims[0];
-    unsigned ntrain = tInfo.dims[0];
-
-    unsigned f = get_global_id(0);
-    unsigned tid = get_local_id(0);
-
-    __local unsigned l_dist[THREADS];
-    __local unsigned l_idx[THREADS];
-
-    __local T* l_query = lmem;
-    __local T* l_train = lmem + FEAT_LEN;
-
-    l_dist[tid] = max_dist;
-    l_idx[tid]  = 0xffffffff;
-
-    bool valid_feat = (f < ntrain);
-
-#ifdef USE_LOCAL_MEM
-    if (valid_feat) {
-        // Copy local_size(0) training features to shared memory
-        #pragma unroll
-        for (unsigned i = 0; i < FEAT_LEN; i++) {
-            l_train[i * get_local_size(0) + tid] = train[i * ntrain + f];
-        }
-    }
-    barrier(CLK_LOCAL_MEM_FENCE);
-#endif
-
-    for (int j = 0; j < (int)nquery; j++) {
-        l_dist[tid] = max_dist;
-
-        // Load one query feature that will be tested against all training
-        // features in current block
-        if (tid < FEAT_LEN && valid_feat) {
-            l_query[tid] = query[tid * nquery + j];
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-
-        unsigned dist = 0;
-        if (valid_feat) {
-            #pragma unroll
-            for (int k = 0; k < (int)FEAT_LEN; k++) {
-                // Calculate Hamming distance for 32-bits of descriptor and
-                // accumulates to dist
-#ifdef USE_LOCAL_MEM
-                dist += popcount(l_train[k * get_local_size(0) + tid] ^ l_query[k]);
-#else
-                dist += popcount(train[k * ntrain + f] ^ l_query[k]);
-#endif
-            }
-        }
-
-        // Only stores the feature index and distance if it's smaller
-        // than the best match found so far
-        if (valid_feat) {
-            l_dist[tid] = dist;
-            l_idx[tid]  = f;
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-
-        // Find best match in training features from block to the current
-        // query feature
-        if (tid < 128) {
-            if (l_dist[tid + 128] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 128];
-                l_idx[tid]  = l_idx[tid + 128];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 64) {
-            if (l_dist[tid + 64] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 64];
-                l_idx[tid]  = l_idx[tid + 64];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 32) {
-            if (l_dist[tid + 32] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 32];
-                l_idx[tid]  = l_idx[tid + 32];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 16) {
-            if (l_dist[tid + 16] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 16];
-                l_idx[tid]  = l_idx[tid + 16];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 8) {
-            if (l_dist[tid + 8] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 8];
-                l_idx[tid]  = l_idx[tid + 8];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 4) {
-            if (l_dist[tid + 4] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 4];
-                l_idx[tid]  = l_idx[tid + 4];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 2) {
-            if (l_dist[tid + 2] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 2];
-                l_idx[tid]  = l_idx[tid + 2];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 1) {
-            if (l_dist[tid + 1] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 1];
-                l_idx[tid]  = l_idx[tid + 1];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-
-        // Store best match in training features from block to the current
-        // query feature
-        if (valid_feat) {
-            out_dist[j * get_num_groups(0) + get_group_id(0)] = l_dist[0];
-            out_idx[j * get_num_groups(0) + get_group_id(0)]  = l_idx[0];
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-    }
-}
-
-__kernel
-void hamming_matcher(
-    __global unsigned* out_idx,
-    __global unsigned* out_dist,
-    __global const T* query,
-    KParam qInfo,
-    __global const T* train,
-    KParam tInfo,
-    const unsigned max_dist,
-    const unsigned feat_len,
-    __local T* lmem)
-{
-    unsigned nquery = qInfo.dims[0];
-    unsigned ntrain = tInfo.dims[0];
-
-    unsigned f = get_global_id(0);
-    unsigned tid = get_local_id(0);
-
-    __local unsigned l_dist[THREADS];
-    __local unsigned l_idx[THREADS];
-
-    __local T* l_query = lmem;
-    __local T* l_train = lmem + feat_len;
-
-    l_dist[tid] = max_dist;
-    l_idx[tid]  = 0xffffffff;
-
-    bool valid_feat = (f < ntrain);
-
-#ifdef USE_LOCAL_MEM
-    if (valid_feat) {
-        // Copy local_size(0) training features to shared memory
-        for (unsigned i = 0; i < feat_len; i++) {
-            l_train[i * get_local_size(0) + tid] = train[i * ntrain + f];
-        }
-    }
-    barrier(CLK_LOCAL_MEM_FENCE);
-#endif
-
-    for (int j = 0; j < (int)nquery; j++) {
-        l_dist[tid] = max_dist;
-
-        // Load one query feature that will be tested against all training
-        // features in current block
-        if (tid < feat_len && valid_feat) {
-            l_query[tid] = query[tid * nquery + j];
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-
-        unsigned dist = 0;
-        if (valid_feat) {
-            for (int k = 0; k < (int)feat_len; k++) {
-                // Calculate Hamming distance for 32-bits of descriptor and
-                // accumulates to dist
-#ifdef USE_LOCAL_MEM
-                dist += popcount(l_train[k * get_local_size(0) + tid] ^ l_query[k]);
-#else
-                dist += popcount(train[k * ntrain + f] ^ l_query[k]);
-#endif
-            }
-        }
-
-        // Only stores the feature index and distance if it's smaller
-        // than the best match found so far
-        if (valid_feat) {
-            l_dist[tid] = dist;
-            l_idx[tid]  = f;
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-
-        // Find best match in training features from block to the current
-        // query feature
-        if (tid < 128) {
-            if (l_dist[tid + 128] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 128];
-                l_idx[tid]  = l_idx[tid + 128];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 64) {
-            if (l_dist[tid + 64] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 64];
-                l_idx[tid]  = l_idx[tid + 64];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 32) {
-            if (l_dist[tid + 32] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 32];
-                l_idx[tid]  = l_idx[tid + 32];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 16) {
-            if (l_dist[tid + 16] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 16];
-                l_idx[tid]  = l_idx[tid + 16];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 8) {
-            if (l_dist[tid + 8] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 8];
-                l_idx[tid]  = l_idx[tid + 8];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 4) {
-            if (l_dist[tid + 4] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 4];
-                l_idx[tid]  = l_idx[tid + 4];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 2) {
-            if (l_dist[tid + 2] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 2];
-                l_idx[tid]  = l_idx[tid + 2];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-        if (tid < 1) {
-            if (l_dist[tid + 1] < l_dist[tid]) {
-                l_dist[tid] = l_dist[tid + 1];
-                l_idx[tid]  = l_idx[tid + 1];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-
-        // Store best match in training features from block to the current
-        // query feature
-        if (valid_feat) {
-            out_dist[j * get_num_groups(0) + get_group_id(0)] = l_dist[0];
-            out_idx[j * get_num_groups(0) + get_group_id(0)]  = l_idx[0];
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-    }
-}
-
-__kernel
-void select_matches(
-    __global unsigned* idx,
-    __global unsigned* dist,
-    __global const unsigned* in_idx,
-    __global const unsigned* in_dist,
-    const unsigned nfeat,
-    const unsigned nelem,
-    const unsigned max_dist)
-{
-    unsigned f = get_global_id(0);
-    unsigned lsz1 = get_local_size(1);
-    unsigned sid = get_local_id(0) * lsz1 + get_local_id(1);
-
-    __local unsigned l_dist[THREADS];
-    __local unsigned l_idx[THREADS];
-
-    bool valid_feat = (f < nfeat);
-
-    if (valid_feat)
-        l_dist[sid] = max_dist;
-    barrier(CLK_LOCAL_MEM_FENCE);
-
-    unsigned nelem_max = (nelem / lsz1) * lsz1;
-    nelem_max = (nelem % lsz1 == 0) ? nelem_max : nelem_max + lsz1;
-
-    for (unsigned i = get_local_id(1); i < nelem_max; i += get_local_size(1)) {
-        if (valid_feat && i < nelem) {
-            unsigned dist = in_dist[f * nelem + i];
-
-            // Copy all best matches previously found in hamming_matcher() to
-            // shared memory
-            if (dist < l_dist[sid]) {
-                l_dist[sid] = dist;
-                l_idx[sid]  = in_idx[f * nelem + i];
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-    }
-
-    for (unsigned i = get_local_size(1) / 2; i > 0; i >>= 1) {
-        if (get_local_id(1) < i) {
-            if (valid_feat) {
-                unsigned dist = l_dist[sid + i];
-                if (dist < l_dist[sid]) {
-                    l_dist[sid] = dist;
-                    l_idx[sid]  = l_idx[sid + i];
-                }
-            }
-        }
-        barrier(CLK_LOCAL_MEM_FENCE);
-    }
-
-    // Store best matches and indexes to training dataset
-    if (get_local_id(1) == 0 && valid_feat) {
-        dist[f] = l_dist[get_local_id(0) * get_local_size(1)];
-        idx[f]  = l_idx[get_local_id(0) * get_local_size(1)];
-    }
-}
diff --git a/src/backend/opencl/kernel/hamming.hpp b/src/backend/opencl/kernel/hamming.hpp
deleted file mode 100644
index b89ee15e77..0000000000
--- a/src/backend/opencl/kernel/hamming.hpp
+++ /dev/null
@@ -1,138 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/defines.h>
-#include <program.hpp>
-#include <dispatch.hpp>
-#include <err_opencl.hpp>
-#include <debug_opencl.hpp>
-#include <kernel_headers/hamming.hpp>
-#include <memory.hpp>
-
-using cl::LocalSpaceArg;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const unsigned THREADS = 256;
-
-template<typename T, bool use_lmem, unsigned unroll_len>
-void hamming_matcher(Param idx,
-                     Param dist,
-                     Param query,
-                     Param train,
-                     const dim_t dist_dim,
-                     const unsigned n_dist,
-                     const size_t lmem_sz)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static Program        hammingProgs[DeviceManager::MAX_DEVICES];
-        static Kernel             huKernel[DeviceManager::MAX_DEVICES];
-        static Kernel             hmKernel[DeviceManager::MAX_DEVICES];
-        static Kernel             smKernel[DeviceManager::MAX_DEVICES];
-
-        int device = getActiveDeviceId();
-
-        const unsigned feat_len = query.info.dims[dist_dim];
-        const unsigned max_dist = feat_len * 8 * sizeof(T);
-
-        if (feat_len > THREADS) {
-            OPENCL_NOT_SUPPORTED();
-        }
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D THREADS=" << THREADS
-                        << " -D FEAT_LEN=" << unroll_len;
-
-                if (use_lmem)
-                    options << " -D USE_LOCAL_MEM";
-
-                buildProgram(hammingProgs[device],
-                             hamming_cl,
-                             hamming_cl_len,
-                             options.str());
-
-                huKernel[device] = Kernel(hammingProgs[device], "hamming_matcher_unroll");
-                hmKernel[device] = Kernel(hammingProgs[device], "hamming_matcher");
-                smKernel[device] = Kernel(hammingProgs[device], "select_matches");
-            });
-
-        const dim_t sample_dim = (dist_dim == 0) ? 1 : 0;
-
-        const unsigned nquery = query.info.dims[sample_dim];
-        const unsigned ntrain = train.info.dims[sample_dim];
-
-        unsigned nblk = divup(ntrain, THREADS);
-        const NDRange local(THREADS, 1);
-        const NDRange global(nblk * THREADS, 1);
-
-        cl::Buffer *d_blk_idx  = bufferAlloc(nblk * nquery * sizeof(unsigned));
-        cl::Buffer *d_blk_dist = bufferAlloc(nblk * nquery * sizeof(unsigned));
-
-        // For each query vector, find training vector with smallest Hamming
-        // distance per CUDA block
-        if (unroll_len > 0) {
-            auto huOp = make_kernel<Buffer, Buffer,
-                                    Buffer, KParam,
-                                    Buffer, KParam,
-                                    const unsigned,
-                                    LocalSpaceArg> (huKernel[device]);
-
-            huOp(EnqueueArgs(getQueue(), global, local),
-                 *d_blk_idx, *d_blk_dist,
-                 *query.data, query.info, *train.data, train.info,
-                 max_dist, cl::Local(lmem_sz));
-        }
-        else {
-            auto hmOp = make_kernel<Buffer, Buffer,
-                                    Buffer, KParam,
-                                    Buffer, KParam,
-                                    const unsigned, const unsigned,
-                                    LocalSpaceArg> (hmKernel[device]);
-
-            hmOp(EnqueueArgs(getQueue(), global, local),
-                 *d_blk_idx, *d_blk_dist,
-                 *query.data, query.info, *train.data, train.info,
-                 max_dist, feat_len, cl::Local(lmem_sz));
-        }
-        CL_DEBUG_FINISH(getQueue());
-
-        const NDRange local_sm(32, 8);
-        const NDRange global_sm(divup(nquery, 32) * 32, 8);
-
-        // Reduce all smallest Hamming distances from each block and store final
-        // best match
-        auto smOp = make_kernel<Buffer, Buffer, Buffer, Buffer,
-                                const unsigned, const unsigned,
-                                const unsigned> (smKernel[device]);
-
-        smOp(EnqueueArgs(getQueue(), global_sm, local_sm),
-             *idx.data, *dist.data,
-             *d_blk_idx, *d_blk_dist,
-             nquery, nblk, max_dist);
-        CL_DEBUG_FINISH(getQueue());
-
-        bufferFree(d_blk_idx);
-        bufferFree(d_blk_dist);
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-} // namespace kernel
-
-} // namespace opencl
diff --git a/src/backend/opencl/kernel/harris.cl b/src/backend/opencl/kernel/harris.cl
new file mode 100644
index 0000000000..a849145a51
--- /dev/null
+++ b/src/backend/opencl/kernel/harris.cl
@@ -0,0 +1,95 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#define MAX_VAL(A, B) (A) < (B) ? (B) : (A)
+
+kernel void second_order_deriv(global T* ixx_out, global T* ixy_out,
+                               global T* iyy_out, const dim_t in_len,
+                               global const T* ix_in, global const T* iy_in) {
+    const unsigned x = get_global_id(0);
+
+    if (x < in_len) {
+        ixx_out[x] = ix_in[x] * ix_in[x];
+        ixy_out[x] = ix_in[x] * iy_in[x];
+        iyy_out[x] = iy_in[x] * iy_in[x];
+    }
+}
+
+kernel void harris_responses(global T* resp_out, const unsigned idim0,
+                             const unsigned idim1, global const T* ixx_in,
+                             global const T* ixy_in, global const T* iyy_in,
+                             const float k_thr, const unsigned border_len) {
+    const unsigned r = border_len;
+
+    const unsigned x = get_global_id(0) + r;
+    const unsigned y = get_global_id(1) + r;
+
+    if (x < idim1 - r && y < idim0 - r) {
+        const unsigned idx = x * idim0 + y;
+
+        // Calculates matrix trace and determinant
+        T tr  = ixx_in[idx] + iyy_in[idx];
+        T det = ixx_in[idx] * iyy_in[idx] - ixy_in[idx] * ixy_in[idx];
+
+        // Calculates local Harris response
+        resp_out[idx] = det - k_thr * (tr * tr);
+    }
+}
+
+kernel void non_maximal(global float* x_out, global float* y_out,
+                        global float* resp_out, global unsigned* count,
+                        global const T* resp_in, const unsigned idim0,
+                        const unsigned idim1, const float min_resp,
+                        const unsigned border_len, const unsigned max_corners) {
+    // Responses on the border don't have 8-neighbors to compare, discard them
+    const unsigned r = border_len + 1;
+
+    const unsigned x = get_global_id(0) + r;
+    const unsigned y = get_global_id(1) + r;
+
+    if (x < idim1 - r && y < idim0 - r) {
+        const T v = resp_in[x * idim0 + y];
+
+        // Find maximum neighborhood response
+        T max_v;
+        max_v = MAX_VAL(resp_in[(x - 1) * idim0 + y - 1],
+                        resp_in[x * idim0 + y - 1]);
+        max_v = MAX_VAL(max_v, resp_in[(x + 1) * idim0 + y - 1]);
+        max_v = MAX_VAL(max_v, resp_in[(x - 1) * idim0 + y]);
+        max_v = MAX_VAL(max_v, resp_in[(x + 1) * idim0 + y]);
+        max_v = MAX_VAL(max_v, resp_in[(x - 1) * idim0 + y + 1]);
+        max_v = MAX_VAL(max_v, resp_in[(x)*idim0 + y + 1]);
+        max_v = MAX_VAL(max_v, resp_in[(x + 1) * idim0 + y + 1]);
+
+        // Stores corner to {x,y,resp}_out if it's response is maximum compared
+        // to its 8-neighborhood and greater or equal minimum response
+        if (v > max_v && v >= min_resp) {
+            const unsigned idx = atomic_inc(count);
+            if (idx < max_corners) {
+                x_out[idx]    = (float)x;
+                y_out[idx]    = (float)y;
+                resp_out[idx] = (float)v;
+            }
+        }
+    }
+}
+
+kernel void keep_corners(global float* x_out, global float* y_out,
+                         global float* score_out, global const float* x_in,
+                         global const float* y_in, global const float* score_in,
+                         global const unsigned* score_idx,
+                         const unsigned n_feat) {
+    unsigned f = get_global_id(0);
+
+    if (f < n_feat) {
+        x_out[f]     = x_in[score_idx[f]];
+        y_out[f]     = y_in[score_idx[f]];
+        score_out[f] = score_in[f];
+    }
+}
diff --git a/src/backend/opencl/kernel/harris.hpp b/src/backend/opencl/kernel/harris.hpp
new file mode 100644
index 0000000000..835c20c745
--- /dev/null
+++ b/src/backend/opencl/kernel/harris.hpp
@@ -0,0 +1,276 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/convolve_separable.hpp>
+#include <kernel/gradient.hpp>
+#include <kernel/range.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <kernel_headers/harris.hpp>
+#include <memory.hpp>
+#include <af/constants.h>
+#include <af/defines.h>
+
+#include <array>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void gaussian1D(T *out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
+
+    T sum = (T)0;
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * af::Pi * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
+        out[i] = el;
+        sum += el;
+    }
+
+    for (int k = 0; k < dim; k++) out[k] /= sum;
+}
+
+template<typename T, typename convAccT>
+void conv_helper(Array<T> &ixx, Array<T> &ixy, Array<T> &iyy,
+                 Array<convAccT> &filter) {
+    Array<convAccT> ixx_tmp = createEmptyArray<convAccT>(ixx.dims());
+    Array<convAccT> ixy_tmp = createEmptyArray<convAccT>(ixy.dims());
+    Array<convAccT> iyy_tmp = createEmptyArray<convAccT>(iyy.dims());
+
+    convSep<T, convAccT>(ixx_tmp, ixx, filter, 0, false);
+    convSep<T, convAccT>(ixx, ixx_tmp, filter, 1, false);
+    convSep<T, convAccT>(ixy_tmp, ixy, filter, 0, false);
+    convSep<T, convAccT>(ixy, ixy_tmp, filter, 1, false);
+    convSep<T, convAccT>(iyy_tmp, iyy, filter, 0, false);
+    convSep<T, convAccT>(iyy, iyy_tmp, filter, 1, false);
+}
+
+template<typename T>
+std::array<Kernel, 4> getHarrisKernels() {
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
+
+    return {
+        common::getKernel("second_order_deriv", {{harris_cl_src}}, targs,
+                          options),
+        common::getKernel("keep_corners", {{harris_cl_src}}, targs, options),
+        common::getKernel("harris_responses", {{harris_cl_src}}, targs,
+                          options),
+        common::getKernel("non_maximal", {{harris_cl_src}}, targs, options),
+    };
+}
+
+template<typename T, typename convAccT>
+void harris(unsigned *corners_out, Param &x_out, Param &y_out, Param &resp_out,
+            Param in, const unsigned max_corners, const float min_response,
+            const float sigma, const unsigned filter_len, const float k_thr) {
+    constexpr unsigned HARRIS_THREADS_PER_GROUP = 256;
+    constexpr unsigned HARRIS_THREADS_X         = 16;
+    constexpr unsigned HARRIS_THREADS_Y =
+        HARRIS_THREADS_PER_GROUP / HARRIS_THREADS_X;
+
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto kernels = getHarrisKernels<T>();
+    auto soOp    = kernels[0];
+    auto kcOp    = kernels[1];
+    auto hrOp    = kernels[2];
+    auto nmOp    = kernels[3];
+
+    // Window filter
+    std::vector<convAccT> h_filter(filter_len);
+    // Decide between rectangular or circular filter
+    if (sigma < 0.5f) {
+        for (unsigned i = 0; i < filter_len; i++)
+            h_filter[i] = (T)1.f / (filter_len);
+    } else {
+        gaussian1D<convAccT>(h_filter.data(), (int)filter_len, sigma);
+    }
+
+    const unsigned border_len = filter_len / 2 + 1;
+
+    // Copy filter to device object
+    Array<convAccT> filter =
+        createHostDataArray<convAccT>(filter_len, h_filter.data());
+    Array<T> ix = createEmptyArray<T>(dim4(4, in.info.dims));
+    Array<T> iy = createEmptyArray<T>(dim4(4, in.info.dims));
+
+    // Compute first-order derivatives as gradients
+    gradient<T>(iy, ix, in);
+
+    Array<T> ixx = createEmptyArray<T>(dim4(4, in.info.dims));
+    Array<T> ixy = createEmptyArray<T>(dim4(4, in.info.dims));
+    Array<T> iyy = createEmptyArray<T>(dim4(4, in.info.dims));
+
+    // Second order-derivatives kernel sizes
+    const unsigned blk_x_so =
+        divup(in.info.dims[3] * in.info.strides[3], HARRIS_THREADS_PER_GROUP);
+    const NDRange local_so(HARRIS_THREADS_PER_GROUP, 1);
+    const NDRange global_so(blk_x_so * HARRIS_THREADS_PER_GROUP, 1);
+
+    // Compute second-order derivatives
+    soOp(EnqueueArgs(getQueue(), global_so, local_so), *ixx.get(), *ixy.get(),
+         *iyy.get(), in.info.dims[3] * in.info.strides[3], *ix.get(),
+         *iy.get());
+    CL_DEBUG_FINISH(getQueue());
+
+    // Convolve second order derivatives with proper window filter
+    conv_helper<T, convAccT>(ixx, ixy, iyy, filter);
+
+    cl::Buffer *d_responses =
+        bufferAlloc(in.info.dims[3] * in.info.strides[3] * sizeof(T));
+
+    // Harris responses kernel sizes
+    unsigned blk_x_hr =
+        divup(in.info.dims[0] - border_len * 2, HARRIS_THREADS_X);
+    unsigned blk_y_hr =
+        divup(in.info.dims[1] - border_len * 2, HARRIS_THREADS_Y);
+    const NDRange local_hr(HARRIS_THREADS_X, HARRIS_THREADS_Y);
+    const NDRange global_hr(blk_x_hr * HARRIS_THREADS_X,
+                            blk_y_hr * HARRIS_THREADS_Y);
+
+    // Calculate Harris responses for all pixels
+    hrOp(EnqueueArgs(getQueue(), global_hr, local_hr), *d_responses,
+         static_cast<uint>(in.info.dims[0]), static_cast<uint>(in.info.dims[1]),
+         *ixx.get(), *ixy.get(), *iyy.get(), k_thr, border_len);
+    CL_DEBUG_FINISH(getQueue());
+
+    // Number of corners is not known a priori, limit maximum number of corners
+    // according to image dimensions
+    unsigned corner_lim = in.info.dims[3] * in.info.strides[3] * 0.2f;
+
+    unsigned corners_found      = 0;
+    cl::Buffer *d_corners_found = bufferAlloc(sizeof(unsigned));
+    getQueue().enqueueFillBuffer(*d_corners_found, corners_found, 0,
+                                 sizeof(unsigned));
+
+    cl::Buffer *d_x_corners    = bufferAlloc(corner_lim * sizeof(float));
+    cl::Buffer *d_y_corners    = bufferAlloc(corner_lim * sizeof(float));
+    cl::Buffer *d_resp_corners = bufferAlloc(corner_lim * sizeof(float));
+
+    const float min_r = (max_corners > 0) ? 0.f : min_response;
+
+    // Perform non-maximal suppression
+    nmOp(EnqueueArgs(getQueue(), global_hr, local_hr), *d_x_corners,
+         *d_y_corners, *d_resp_corners, *d_corners_found, *d_responses,
+         static_cast<uint>(in.info.dims[0]), static_cast<uint>(in.info.dims[1]),
+         min_r, border_len, corner_lim);
+    CL_DEBUG_FINISH(getQueue());
+
+    getQueue().enqueueReadBuffer(*d_corners_found, CL_TRUE, 0, sizeof(unsigned),
+                                 &corners_found);
+
+    bufferFree(d_responses);
+    bufferFree(d_corners_found);
+
+    *corners_out =
+        min(corners_found, (max_corners > 0) ? max_corners : corner_lim);
+    if (*corners_out == 0) return;
+
+    // Set output Param info
+    x_out.info.dims[0] = y_out.info.dims[0] = resp_out.info.dims[0] =
+        *corners_out;
+    x_out.info.strides[0] = y_out.info.strides[0] = resp_out.info.strides[0] =
+        1;
+    x_out.info.offset = y_out.info.offset = resp_out.info.offset = 0;
+    for (int k = 1; k < 4; k++) {
+        x_out.info.dims[k] = y_out.info.dims[k] = resp_out.info.dims[k] = 1;
+        x_out.info.strides[k] =
+            x_out.info.dims[k - 1] * x_out.info.strides[k - 1];
+        y_out.info.strides[k] =
+            y_out.info.dims[k - 1] * y_out.info.strides[k - 1];
+        resp_out.info.strides[k] =
+            resp_out.info.dims[k - 1] * resp_out.info.strides[k - 1];
+    }
+
+    if (max_corners > 0 && corners_found > *corners_out) {
+        Param harris_resp;
+        Param harris_idx;
+
+        harris_resp.info.dims[0] = harris_idx.info.dims[0] = corners_found;
+        harris_resp.info.strides[0] = harris_idx.info.strides[0] = 1;
+
+        for (int k = 1; k < 4; k++) {
+            harris_resp.info.dims[k] = 1;
+            harris_resp.info.strides[k] =
+                harris_resp.info.dims[k - 1] * harris_resp.info.strides[k - 1];
+            harris_idx.info.dims[k] = 1;
+            harris_idx.info.strides[k] =
+                harris_idx.info.dims[k - 1] * harris_idx.info.strides[k - 1];
+        }
+
+        int sort_elem = harris_resp.info.strides[3] * harris_resp.info.dims[3];
+        harris_resp.data = d_resp_corners;
+        // Create indices using range
+        harris_idx.data = bufferAlloc(sort_elem * sizeof(unsigned));
+        kernel::range<uint>(harris_idx, 0);
+
+        // Sort Harris responses
+        kernel::sort0ByKey<float, uint>(harris_resp, harris_idx, false);
+
+        x_out.data    = bufferAlloc(*corners_out * sizeof(float));
+        y_out.data    = bufferAlloc(*corners_out * sizeof(float));
+        resp_out.data = bufferAlloc(*corners_out * sizeof(float));
+
+        // Keep corners kernel sizes
+        const unsigned blk_x_kc = divup(*corners_out, HARRIS_THREADS_PER_GROUP);
+        const NDRange local_kc(HARRIS_THREADS_PER_GROUP, 1);
+        const NDRange global_kc(blk_x_kc * HARRIS_THREADS_PER_GROUP, 1);
+
+        // Keep only the first corners_to_keep corners with higher Harris
+        // responses
+        kcOp(EnqueueArgs(getQueue(), global_kc, local_kc), *x_out.data,
+             *y_out.data, *resp_out.data, *d_x_corners, *d_y_corners,
+             *harris_resp.data, *harris_idx.data, *corners_out);
+        CL_DEBUG_FINISH(getQueue());
+
+        bufferFree(d_x_corners);
+        bufferFree(d_y_corners);
+        bufferFree(harris_resp.data);
+        bufferFree(harris_idx.data);
+    } else if (max_corners == 0 && corners_found < corner_lim) {
+        x_out.data    = bufferAlloc(*corners_out * sizeof(float));
+        y_out.data    = bufferAlloc(*corners_out * sizeof(float));
+        resp_out.data = bufferAlloc(*corners_out * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_x_corners, *x_out.data, 0, 0,
+                                     *corners_out * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_y_corners, *y_out.data, 0, 0,
+                                     *corners_out * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_resp_corners, *resp_out.data, 0, 0,
+                                     *corners_out * sizeof(float));
+
+        bufferFree(d_x_corners);
+        bufferFree(d_y_corners);
+        bufferFree(d_resp_corners);
+    } else {
+        x_out.data    = d_x_corners;
+        y_out.data    = d_y_corners;
+        resp_out.data = d_resp_corners;
+    }
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/histogram.cl b/src/backend/opencl/kernel/histogram.cl
index e6b7eba4e7..857ead231d 100644
--- a/src/backend/opencl/kernel/histogram.cl
+++ b/src/backend/opencl/kernel/histogram.cl
@@ -7,48 +7,53 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void histogram(__global outType *         d_dst,
-               KParam                     oInfo,
-               __global const inType *    d_src,
-               KParam                     iInfo,
-               __global const float2 *    d_minmax,
-               __local outType *          localMem,
-               int len, int nbins, int nBBS)
-{
-    unsigned b2    = get_group_id(0)/nBBS;
-    int start = (get_group_id(0)-b2*nBBS) * THRD_LOAD * get_local_size(0) + get_local_id(0);
-    int end   = min((int)(start + THRD_LOAD * get_local_size(0)), len);
+kernel void histogram(global uint *d_dst, KParam oInfo, global const T *d_src,
+                      KParam iInfo, local uint *localMem, int len, int nbins,
+                      float minval, float maxval, int nBBS) {
+    unsigned b2 = get_group_id(0) / nBBS;
+    int start = (get_group_id(0) - b2 * nBBS) * THRD_LOAD * get_local_size(0) +
+                get_local_id(0);
+    int end = min((int)(start + THRD_LOAD * get_local_size(0)), len);
 
     // offset input and output to account for batch ops
-    __global const inType *in = d_src + b2 * iInfo.strides[2] + get_group_id(1) * iInfo.strides[3] + iInfo.offset;
-    __global outType * out    = d_dst + b2 * oInfo.strides[2] + get_group_id(1) * oInfo.strides[3];
+    global const T *in = d_src + b2 * iInfo.strides[2] +
+                         get_group_id(1) * iInfo.strides[3] + iInfo.offset;
+    global uint *out =
+        d_dst + b2 * oInfo.strides[2] + get_group_id(1) * oInfo.strides[3];
 
-    __local float minval;
-    __local float dx;
+    float dx = (maxval - minval) / (float)nbins;
 
-    // offset minmax array to account for batch ops
-    __global const float2 * d_mnmx = d_minmax + (b2 * get_group_id(0) + get_group_id(1));
+    bool use_global = nbins > MAX_BINS;
 
-    if (get_local_id(0) == 0) {
-        float2 minmax = *d_mnmx;
-        minval = minmax.s0;
-        dx     = (minmax.s1-minmax.s0) / (float)nbins;
+    if (!use_global) {
+        for (int i = get_local_id(0); i < nbins; i += get_local_size(0))
+            localMem[i] = 0;
+        barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    for (int i = get_local_id(0); i < nbins; i += get_local_size(0))
-        localMem[i] = 0;
-    barrier(CLK_LOCAL_MEM_FENCE);
-
     for (int row = start; row < end; row += get_local_size(0)) {
-        int bin = (int)(((float)in[row] - minval) / dx);
+#if defined(IS_LINEAR)
+        int idx = row;
+#else
+        int i0  = row % iInfo.dims[0];
+        int i1  = row / iInfo.dims[0];
+        int idx = i0 + i1 * iInfo.strides[1];
+#endif
+        int bin = (int)(((float)in[idx] - minval) / dx);
         bin     = max(bin, 0);
-        bin     = min(bin, (int)nbins-1);
-        atomic_inc((localMem + bin));
+        bin     = min(bin, (int)nbins - 1);
+
+        if (use_global) {
+            atomic_inc((out + bin));
+        } else {
+            atomic_inc((localMem + bin));
+        }
     }
-    barrier(CLK_LOCAL_MEM_FENCE);
 
-    for (int i = get_local_id(0); i < nbins; i += get_local_size(0)) {
-        atomic_add((out + i), localMem[i]);
+    if (!use_global) {
+        barrier(CLK_LOCAL_MEM_FENCE);
+        for (int i = get_local_id(0); i < nbins; i += get_local_size(0)) {
+            atomic_add((out + i), localMem[i]);
+        }
     }
 }
diff --git a/src/backend/opencl/kernel/histogram.hpp b/src/backend/opencl/kernel/histogram.hpp
index 51b8462410..d138202240 100644
--- a/src/backend/opencl/kernel/histogram.hpp
+++ b/src/backend/opencl/kernel/histogram.hpp
@@ -8,78 +8,55 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/histogram.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/histogram.hpp>
+#include <traits.hpp>
 
-using cl::Kernel;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const unsigned MAX_BINS  = 4000;
-static const int THREADS_X =  256;
-static const int THRD_LOAD =   16;
-
-template<typename inType, typename outType>
-void histogram(Param out, const Param in, const Param minmax, int nbins)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> histProgs;
-        static std::map<int, Kernel *> histKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D inType=" << dtype_traits<inType>::getName()
-                            << " -D outType=" << dtype_traits<outType>::getName()
-                            << " -D THRD_LOAD=" << THRD_LOAD;
-
-                    if (std::is_same<inType, double>::value ||
-                        std::is_same<inType, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    Program prog;
-                    buildProgram(prog, histogram_cl, histogram_cl_len, options.str());
-                    histProgs[device]   = new Program(prog);
-                    histKernels[device] = new Kernel(*histProgs[device], "histogram");
-                });
-
-        auto histogramOp = make_kernel<Buffer, KParam, Buffer, KParam,
-                                       Buffer, cl::LocalSpaceArg,
-                                       int, int, int
-                                      >(*histKernels[device]);
-
-        int nElems = in.info.dims[0]*in.info.dims[1];
-        int blk_x  = divup(nElems, THRD_LOAD*THREADS_X);
-        int locSize = nbins * sizeof(outType);
-
-        NDRange local(THREADS_X, 1);
-        NDRange global(blk_x*in.info.dims[2]*THREADS_X, in.info.dims[3]);
-
-        histogramOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *in.data, in.info, *minmax.data,
-                cl::Local(locSize), nElems, nbins, blk_x);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-}
-
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void histogram(Param out, const Param in, int nbins, float minval, float maxval,
+               bool isLinear) {
+    constexpr int MAX_BINS  = 4000;
+    constexpr int THREADS_X = 256;
+    constexpr int THRD_LOAD = 16;
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(isLinear),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(THRD_LOAD),
+        DefineValue(MAX_BINS),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+    if (isLinear) { options.emplace_back(DefineKey(IS_LINEAR)); }
+
+    auto histogram =
+        common::getKernel("histogram", {{histogram_cl_src}}, targs, options);
+
+    int nElems  = in.info.dims[0] * in.info.dims[1];
+    int blk_x   = divup(nElems, THRD_LOAD * THREADS_X);
+    int locSize = nbins <= MAX_BINS ? (nbins * sizeof(uint)) : 1;
+
+    cl::NDRange local(THREADS_X, 1);
+    cl::NDRange global(blk_x * in.info.dims[2] * THREADS_X, in.info.dims[3]);
+
+    histogram(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *in.data, in.info, cl::Local(locSize), nElems, nbins, minval,
+              maxval, blk_x);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/homography.cl b/src/backend/opencl/kernel/homography.cl
new file mode 100644
index 0000000000..07f9724147
--- /dev/null
+++ b/src/backend/opencl/kernel/homography.cl
@@ -0,0 +1,472 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+inline T sq(T a) { return a * a; }
+
+inline void jacobi_svd(local T* l_V, __local T* l_S, __local T* l_d,
+                       local T* l_acc1, __local T* l_acc2, int m, int n) {
+    const int iterations = 30;
+
+    int tid_x = get_local_id(0);
+    int bsz_x = get_local_size(0);
+    int tid_y = get_local_id(1);
+    int gid_y = get_global_id(1);
+
+    int doff = tid_y * n;
+    int soff = tid_y * 81;
+
+    if (tid_x < n) {
+        T acc1 = 0;
+        for (int i = 0; i < m; i++) {
+            int stid = soff + tid_x * m + i;
+            T t      = l_S[stid];
+            acc1 += t * t;
+            l_V[stid] = (tid_x == i) ? 1 : 0;
+        }
+        l_d[doff + tid_x] = acc1;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+#if defined(IS_CPU)
+    // All threads do the same work
+    // FIXME: Figure out why code below doesnt work
+    int tst = 0, toff = 1, tcond = tid_x == 0;
+#define BARRIER  // nothing
+#else
+    // Split work across subgroup
+    int tst = tid_x, toff = bsz_x, tcond = 1;
+#define BARRIER barrier(CLK_LOCAL_MEM_FENCE)
+#endif
+
+    for (int it = 0; tcond && it < iterations; it++) {
+        for (int i = 0; i < n - 1; i++) {
+            for (int j = i + 1; j < n; j++) {
+                local T* Si = l_S + soff + i * m;
+                local T* Sj = l_S + soff + j * m;
+
+                local T* Vi = l_V + soff + i * n;
+                local T* Vj = l_V + soff + j * n;
+
+                T p = (T)0;
+                for (int k = 0; k < m; k++) p += Si[k] * Sj[k];
+
+                T di = l_d[doff + i];
+                T dj = l_d[doff + j];
+                BARRIER;
+
+                T c = 0, s = 0;
+                T t0 = 0, t1 = 0;
+                int cond = (fabs(p) > m * EPS * sqrt(di * dj));
+                T a = 0, b = 0;
+
+                if (cond) {
+                    T y  = di - dj;
+                    T r  = hypot(p * 2, y);
+                    T r2 = r * 2;
+                    if (y >= 0) {
+                        c = sqrt((r + y) / r2);
+                        s = p / (r2 * c);
+                    } else {
+                        s = sqrt((r - y) / r2);
+                        c = p / (r2 * s);
+                    }
+
+                    for (int k = tst; k < m; k += toff) {
+                        t0                        = c * Si[k] + s * Sj[k];
+                        t1                        = c * Sj[k] - s * Si[k];
+                        Si[k]                     = t0;
+                        Sj[k]                     = t1;
+                        l_acc1[tid_y * bsz_x + k] = t0 * t0;
+                        l_acc2[tid_y * bsz_x + k] = t1 * t1;
+                    }
+                }
+                BARRIER;
+
+                if (cond) {
+                    a = 0;
+                    b = 0;
+                    for (int k = 0; k < m; k++) {
+                        a += l_acc1[tid_y * bsz_x + k];
+                        b += l_acc2[tid_y * bsz_x + k];
+                    }
+                    l_d[doff + i] = a;
+                    l_d[doff + j] = b;
+                }
+                BARRIER;
+
+                if (cond) {
+                    for (int l = tst; l < n; l += toff) {
+                        T t0 = Vi[l] * c + Vj[l] * s;
+                        T t1 = Vj[l] * c - Vi[l] * s;
+
+                        Vi[l] = t0;
+                        Vj[l] = t1;
+                    }
+                }
+                BARRIER;
+            }
+        }
+    }
+}
+
+inline int compute_mean_scale(float* x_src_mean, float* y_src_mean,
+                              float* x_dst_mean, float* y_dst_mean,
+                              float* src_scale, float* dst_scale,
+                              float* src_pt_x, float* src_pt_y, float* dst_pt_x,
+                              float* dst_pt_y, global const float* x_src,
+                              global const float* y_src,
+                              global const float* x_dst,
+                              global const float* y_dst,
+                              global const float* rnd, KParam rInfo, int i) {
+    const unsigned ridx = rInfo.dims[0] * i;
+    unsigned r[4]       = {(unsigned)rnd[ridx], (unsigned)rnd[ridx + 1],
+                     (unsigned)rnd[ridx + 2], (unsigned)rnd[ridx + 3]};
+
+    // If one of the points is repeated, it's a bad samples, will still
+    // compute homography to ensure all threads pass barrier()
+    int bad = (r[0] == r[1] || r[0] == r[2] || r[0] == r[3] || r[1] == r[2] ||
+               r[1] == r[3] || r[2] == r[3]);
+
+    for (unsigned j = 0; j < 4; j++) {
+        src_pt_x[j] = x_src[r[j]];
+        src_pt_y[j] = y_src[r[j]];
+        dst_pt_x[j] = x_dst[r[j]];
+        dst_pt_y[j] = y_dst[r[j]];
+    }
+
+    *x_src_mean = (src_pt_x[0] + src_pt_x[1] + src_pt_x[2] + src_pt_x[3]) / 4.f;
+    *y_src_mean = (src_pt_y[0] + src_pt_y[1] + src_pt_y[2] + src_pt_y[3]) / 4.f;
+    *x_dst_mean = (dst_pt_x[0] + dst_pt_x[1] + dst_pt_x[2] + dst_pt_x[3]) / 4.f;
+    *y_dst_mean = (dst_pt_y[0] + dst_pt_y[1] + dst_pt_y[2] + dst_pt_y[3]) / 4.f;
+
+    float src_var = 0.0f, dst_var = 0.0f;
+    for (unsigned j = 0; j < 4; j++) {
+        src_var +=
+            sq(src_pt_x[j] - *x_src_mean) + sq(src_pt_y[j] - *y_src_mean);
+        dst_var +=
+            sq(dst_pt_x[j] - *x_dst_mean) + sq(dst_pt_y[j] - *y_dst_mean);
+    }
+
+    src_var /= 4.f;
+    dst_var /= 4.f;
+
+    *src_scale = sqrt(2.0f) / sqrt(src_var);
+    *dst_scale = sqrt(2.0f) / sqrt(dst_var);
+
+    return bad;
+}
+
+#define LSPTR(Z, Y, X) (l_S[(Z)*81 + (Y)*9 + (X)])
+
+kernel void compute_homography(global T* H, KParam HInfo,
+                                 global const float* x_src,
+                                 global const float* y_src,
+                                 global const float* x_dst,
+                                 global const float* y_dst,
+                                 global const float* rnd, KParam rInfo,
+                                 const unsigned iterations) {
+    unsigned i     = get_global_id(1);
+    unsigned tid_y = get_local_id(1);
+    unsigned tid_x = get_local_id(0);
+
+    float x_src_mean, y_src_mean;
+    float x_dst_mean, y_dst_mean;
+    float src_scale, dst_scale;
+    float src_pt_x[4], src_pt_y[4], dst_pt_x[4], dst_pt_y[4];
+
+    int bad =
+        compute_mean_scale(&x_src_mean, &y_src_mean, &x_dst_mean, &y_dst_mean,
+                           &src_scale, &dst_scale, src_pt_x, src_pt_y, dst_pt_x,
+                           dst_pt_y, x_src, y_src, x_dst, y_dst, rnd, rInfo, i);
+
+    local T l_acc1[256];
+    local T l_acc2[256];
+
+    local T l_S[16 * 81];
+    local T l_V[16 * 81];
+    local T l_d[16 * 9];
+
+    // Compute input matrix
+    if (tid_x < 4) {
+        float srcx = (src_pt_x[tid_x] - x_src_mean) * src_scale;
+        float srcy = (src_pt_y[tid_x] - y_src_mean) * src_scale;
+        float dstx = (dst_pt_x[tid_x] - x_dst_mean) * dst_scale;
+        float dsty = (dst_pt_y[tid_x] - y_dst_mean) * dst_scale;
+
+        LSPTR(tid_y, 0, tid_x * 2) = 0.0f;
+        LSPTR(tid_y, 1, tid_x * 2) = 0.0f;
+        LSPTR(tid_y, 2, tid_x * 2) = 0.0f;
+        LSPTR(tid_y, 3, tid_x * 2) = -srcx;
+        LSPTR(tid_y, 4, tid_x * 2) = -srcy;
+        LSPTR(tid_y, 5, tid_x * 2) = -1.0f;
+        LSPTR(tid_y, 6, tid_x * 2) = dsty * srcx;
+        LSPTR(tid_y, 7, tid_x * 2) = dsty * srcy;
+        LSPTR(tid_y, 8, tid_x * 2) = dsty;
+
+        LSPTR(tid_y, 0, tid_x * 2 + 1) = srcx;
+        LSPTR(tid_y, 1, tid_x * 2 + 1) = srcy;
+        LSPTR(tid_y, 2, tid_x * 2 + 1) = 1.0f;
+        LSPTR(tid_y, 3, tid_x * 2 + 1) = 0.0f;
+        LSPTR(tid_y, 4, tid_x * 2 + 1) = 0.0f;
+        LSPTR(tid_y, 5, tid_x * 2 + 1) = 0.0f;
+        LSPTR(tid_y, 6, tid_x * 2 + 1) = -dstx * srcx;
+        LSPTR(tid_y, 7, tid_x * 2 + 1) = -dstx * srcy;
+        LSPTR(tid_y, 8, tid_x * 2 + 1) = -dstx;
+
+        if (tid_x == 4) {
+            LSPTR(tid_y, 0, 8) = 0.0f;
+            LSPTR(tid_y, 1, 8) = 0.0f;
+            LSPTR(tid_y, 2, 8) = 0.0f;
+            LSPTR(tid_y, 3, 8) = 0.0f;
+            LSPTR(tid_y, 4, 8) = 0.0f;
+            LSPTR(tid_y, 5, 8) = 0.0f;
+            LSPTR(tid_y, 6, 8) = 0.0f;
+            LSPTR(tid_y, 7, 8) = 0.0f;
+            LSPTR(tid_y, 8, 8) = 0.0f;
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    jacobi_svd(l_V, l_S, l_d, l_acc1, l_acc2, 9, 9);
+
+    if (i < HInfo.dims[1] && tid_x == 0) {
+        T vH[9], H_tmp[9];
+        for (unsigned j = 0; j < 9; j++) vH[j] = l_V[tid_y * 81 + 8 * 9 + j];
+
+        H_tmp[0] =
+            src_scale * x_dst_mean * vH[6] + src_scale * vH[0] / dst_scale;
+        H_tmp[1] =
+            src_scale * x_dst_mean * vH[7] + src_scale * vH[1] / dst_scale;
+        H_tmp[2] = x_dst_mean * (vH[8] - src_scale * y_src_mean * vH[7] -
+                                 src_scale * x_src_mean * vH[6]) +
+                   (vH[2] - src_scale * y_src_mean * vH[1] -
+                    src_scale * x_src_mean * vH[0]) /
+                       dst_scale;
+
+        H_tmp[3] =
+            src_scale * y_dst_mean * vH[6] + src_scale * vH[3] / dst_scale;
+        H_tmp[4] =
+            src_scale * y_dst_mean * vH[7] + src_scale * vH[4] / dst_scale;
+        H_tmp[5] = y_dst_mean * (vH[8] - src_scale * y_src_mean * vH[7] -
+                                 src_scale * x_src_mean * vH[6]) +
+                   (vH[5] - src_scale * y_src_mean * vH[4] -
+                    src_scale * x_src_mean * vH[3]) /
+                       dst_scale;
+
+        H_tmp[6] = src_scale * vH[6];
+        H_tmp[7] = src_scale * vH[7];
+        H_tmp[8] = vH[8] - src_scale * y_src_mean * vH[7] -
+                   src_scale * x_src_mean * vH[6];
+
+        const unsigned Hidx = HInfo.dims[0] * i;
+        global T* H_ptr   = H + Hidx;
+        for (int h = 0; h < 9; h++) H_ptr[h] = bad ? 0 : H_tmp[h];
+    }
+}
+
+#undef APTR
+
+// LMedS:
+// http://research.microsoft.com/en-us/um/people/zhang/INRIA/Publis/Tutorial-Estim/node25.html
+kernel void eval_homography(
+    global unsigned* inliers, __global unsigned* idx, __global T* H,
+    KParam HInfo, global float* err, KParam eInfo,
+    global const float* x_src, __global const float* y_src,
+    global const float* x_dst, __global const float* y_dst,
+    global const float* rnd, const unsigned iterations,
+    const unsigned nsamples, const float inlier_thr) {
+    unsigned tid_x = get_local_id(0);
+    unsigned i     = get_global_id(0);
+
+    local unsigned l_inliers[256];
+    local unsigned l_idx[256];
+
+    l_inliers[tid_x] = 0;
+    l_idx[tid_x]     = 0;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (i < iterations) {
+        const unsigned Hidx = HInfo.dims[0] * i;
+        global T* H_ptr   = H + Hidx;
+        T H_tmp[9];
+        for (int h = 0; h < 9; h++) H_tmp[h] = H_ptr[h];
+
+#ifdef RANSAC
+        // Compute inliers
+        unsigned inliers_count = 0;
+        for (unsigned j = 0; j < nsamples; j++) {
+            float z = H_tmp[6] * x_src[j] + H_tmp[7] * y_src[j] + H_tmp[8];
+            float x =
+                (H_tmp[0] * x_src[j] + H_tmp[1] * y_src[j] + H_tmp[2]) / z;
+            float y =
+                (H_tmp[3] * x_src[j] + H_tmp[4] * y_src[j] + H_tmp[5]) / z;
+
+            float dist = sq(x_dst[j] - x) + sq(y_dst[j] - y);
+            if (dist < inlier_thr * inlier_thr) inliers_count++;
+        }
+
+        l_inliers[tid_x] = inliers_count;
+        l_idx[tid_x]     = i;
+#endif
+#ifdef LMEDS
+        // Compute error
+        for (unsigned j = 0; j < nsamples; j++) {
+            float z = H_tmp[6] * x_src[j] + H_tmp[7] * y_src[j] + H_tmp[8];
+            float x =
+                (H_tmp[0] * x_src[j] + H_tmp[1] * y_src[j] + H_tmp[2]) / z;
+            float y =
+                (H_tmp[3] * x_src[j] + H_tmp[4] * y_src[j] + H_tmp[5]) / z;
+
+            float dist                 = sq(x_dst[j] - x) + sq(y_dst[j] - y);
+            err[i * eInfo.dims[0] + j] = sqrt(dist);
+        }
+#endif
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+#ifdef RANSAC
+    unsigned bid_x = get_group_id(0);
+
+    // Find sample with most inliers
+    for (unsigned tx = 128; tx > 0; tx >>= 1) {
+        if (tid_x < tx) {
+            if (l_inliers[tid_x + tx] > l_inliers[tid_x]) {
+                l_inliers[tid_x] = l_inliers[tid_x + tx];
+                l_idx[tid_x]     = l_idx[tid_x + tx];
+            }
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (tid_x == 0) {
+        inliers[bid_x] = l_inliers[0];
+        idx[bid_x]     = l_idx[0];
+    }
+#endif
+}
+
+kernel void compute_median(global float* median, __global unsigned* idx,
+                             global const float* err, KParam eInfo,
+                             const unsigned iterations) {
+    const unsigned tid = get_local_id(0);
+    const unsigned bid = get_group_id(0);
+    const unsigned i   = get_global_id(0);
+
+    local float l_median[256];
+    local unsigned l_idx[256];
+
+    l_median[tid] = FLT_MAX;
+    l_idx[tid]    = 0;
+
+    if (i < iterations) {
+        const int nsamples = eInfo.dims[0];
+        float m            = err[i * nsamples + nsamples / 2];
+        if (nsamples % 2 == 0)
+            m = (m + err[i * nsamples + nsamples / 2 - 1]) * 0.5f;
+
+        l_idx[tid]    = i;
+        l_median[tid] = m;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    for (unsigned t = 128; t > 0; t >>= 1) {
+        if (tid < t) {
+            if (l_median[tid + t] < l_median[tid]) {
+                l_median[tid] = l_median[tid + t];
+                l_idx[tid]    = l_idx[tid + t];
+            }
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    median[bid] = l_median[0];
+    idx[bid]    = l_idx[0];
+}
+
+#define DIVUP(A, B) (((A) + (B)-1) / (B))
+
+kernel void find_min_median(global float* minMedian,
+                              global unsigned* minIdx,
+                              global const float* median, KParam mInfo,
+                              global const unsigned* idx) {
+    const unsigned tid = get_local_id(0);
+
+    local float l_minMedian[256];
+    local unsigned l_minIdx[256];
+
+    l_minMedian[tid] = FLT_MAX;
+    l_minIdx[tid]    = 0;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    const int loop = DIVUP(mInfo.dims[0], get_local_size(0));
+
+    for (int i = 0; i < loop; i++) {
+        int j = i * get_local_size(0) + tid;
+        if (j < mInfo.dims[0] && median[j] < l_minMedian[tid]) {
+            l_minMedian[tid] = median[j];
+            l_minIdx[tid]    = idx[j];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    for (unsigned t = 128; t > 0; t >>= 1) {
+        if (tid < t) {
+            if (l_minMedian[tid + t] < l_minMedian[tid]) {
+                l_minMedian[tid] = l_minMedian[tid + t];
+                l_minIdx[tid]    = l_minIdx[tid + t];
+            }
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    *minMedian = l_minMedian[0];
+    *minIdx    = l_minIdx[0];
+}
+
+#undef DIVUP
+
+kernel void compute_lmeds_inliers(
+    global unsigned* inliers, __global const T* H,
+    global const float* x_src, __global const float* y_src,
+    global const float* x_dst, __global const float* y_dst,
+    const float minMedian, const unsigned nsamples) {
+    unsigned tid = get_local_id(0);
+    unsigned bid = get_group_id(0);
+    unsigned i   = get_global_id(0);
+
+    local T l_H[9];
+    local unsigned l_inliers[256];
+
+    l_inliers[tid] = 0;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (tid < 9) l_H[tid] = H[tid];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    float sigma = fmax(
+        1.4826f * (1 + 5.f / (nsamples - 4)) * (float)sqrt(minMedian), 1e-6f);
+    float dist_thr = sq(2.5f * sigma);
+
+    if (i < nsamples) {
+        float z = l_H[6] * x_src[i] + l_H[7] * y_src[i] + l_H[8];
+        float x = (l_H[0] * x_src[i] + l_H[1] * y_src[i] + l_H[2]) / z;
+        float y = (l_H[3] * x_src[i] + l_H[4] * y_src[i] + l_H[5]) / z;
+
+        float dist = sq(x_dst[i] - x) + sq(y_dst[i] - y);
+        if (dist <= dist_thr) l_inliers[tid] = 1;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    for (unsigned t = 128; t > 0; t >>= 1) {
+        if (tid < t) l_inliers[tid] += l_inliers[tid + t];
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    inliers[bid] = l_inliers[0];
+}
diff --git a/src/backend/opencl/kernel/homography.hpp b/src/backend/opencl/kernel/homography.hpp
new file mode 100644
index 0000000000..4c785b57a1
--- /dev/null
+++ b/src/backend/opencl/kernel/homography.hpp
@@ -0,0 +1,217 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/ireduce.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/sort.hpp>
+#include <kernel_headers/homography.hpp>
+#include <memory.hpp>
+#include <af/defines.h>
+
+#include <limits>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+constexpr int HG_THREADS_X = 16;
+constexpr int HG_THREADS_Y = 16;
+constexpr int HG_THREADS   = 256;
+
+template<typename T>
+std::array<Kernel, 5> getHomographyKernels(const af_homography_type htype) {
+    std::array<TemplateArg, 2> targs = {TemplateTypename<T>(),
+                                        TemplateArg(htype)};
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>(),
+        DefineKeyValue(EPS, (std::is_same<T, double>::value
+                                 ? std::numeric_limits<double>::epsilon()
+                                 : std::numeric_limits<float>::epsilon()))};
+    if (htype == AF_HOMOGRAPHY_RANSAC) {
+        options.emplace_back(DefineKey(RANSAC));
+    }
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        options.emplace_back(DefineKey(LMEDS));
+    }
+    if (getActiveDeviceType() == CL_DEVICE_TYPE_CPU) {
+        options.emplace_back(DefineKey(IS_CPU));
+    }
+    return {
+        common::getKernel("compute_homography", {{homography_cl_src}}, targs,
+                          options),
+        common::getKernel("eval_homography", {{homography_cl_src}}, targs,
+                          options),
+        common::getKernel("compute_median", {{homography_cl_src}}, targs,
+                          options),
+        common::getKernel("find_min_median", {{homography_cl_src}}, targs,
+                          options),
+        common::getKernel("compute_lmeds_inliers", {{homography_cl_src}}, targs,
+                          options),
+    };
+}
+
+template<typename T>
+int computeH(Param bestH, Param H, Param err, Param x_src, Param y_src,
+             Param x_dst, Param y_dst, Param rnd, const unsigned iterations,
+             const unsigned nsamples, const float inlier_thr,
+             const af_homography_type htype) {
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto kernels = getHomographyKernels<T>(htype);
+    auto chOp    = kernels[0];
+    auto ehOp    = kernels[1];
+    auto cmOp    = kernels[2];
+    auto fmOp    = kernels[3];
+    auto clOp    = kernels[4];
+
+    const int blk_x_ch = 1;
+    const int blk_y_ch = divup(iterations, HG_THREADS_Y);
+    const NDRange local_ch(HG_THREADS_X, HG_THREADS_Y);
+    const NDRange global_ch(blk_x_ch * HG_THREADS_X, blk_y_ch * HG_THREADS_Y);
+
+    // Build linear system and solve SVD
+    chOp(EnqueueArgs(getQueue(), global_ch, local_ch), *H.data, H.info,
+         *x_src.data, *y_src.data, *x_dst.data, *y_dst.data, *rnd.data,
+         rnd.info, iterations);
+    CL_DEBUG_FINISH(getQueue());
+
+    const int blk_x_eh = divup(iterations, HG_THREADS);
+    const NDRange local_eh(HG_THREADS);
+    const NDRange global_eh(blk_x_eh * HG_THREADS);
+
+    // Allocate some temporary buffers
+    Param inliers, idx, median;
+    inliers.info.offset = idx.info.offset = median.info.offset = 0;
+    inliers.info.dims[0]    = (htype == AF_HOMOGRAPHY_RANSAC)
+                                  ? blk_x_eh
+                                  : divup(nsamples, HG_THREADS);
+    inliers.info.strides[0] = 1;
+    idx.info.dims[0] = median.info.dims[0] = blk_x_eh;
+    idx.info.strides[0] = median.info.strides[0] = 1;
+    for (int k = 1; k < 4; k++) {
+        inliers.info.dims[k] = 1;
+        inliers.info.strides[k] =
+            inliers.info.dims[k - 1] * inliers.info.strides[k - 1];
+        idx.info.dims[k] = median.info.dims[k] = 1;
+        idx.info.strides[k]                    = median.info.strides[k] =
+            idx.info.dims[k - 1] * idx.info.strides[k - 1];
+    }
+    idx.data =
+        bufferAlloc(idx.info.dims[3] * idx.info.strides[3] * sizeof(unsigned));
+    inliers.data = bufferAlloc(inliers.info.dims[3] * inliers.info.strides[3] *
+                               sizeof(unsigned));
+    if (htype == AF_HOMOGRAPHY_LMEDS)
+        median.data = bufferAlloc(median.info.dims[3] * median.info.strides[3] *
+                                  sizeof(float));
+    else
+        median.data = bufferAlloc(sizeof(float));
+
+    // Compute (and for RANSAC, evaluate) homographies
+    ehOp(EnqueueArgs(getQueue(), global_eh, local_eh), *inliers.data, *idx.data,
+         *H.data, H.info, *err.data, err.info, *x_src.data, *y_src.data,
+         *x_dst.data, *y_dst.data, *rnd.data, iterations, nsamples, inlier_thr);
+    CL_DEBUG_FINISH(getQueue());
+
+    unsigned inliersH, idxH;
+    if (htype == AF_HOMOGRAPHY_LMEDS) {
+        // TODO: Improve this sorting, if the number of iterations is
+        // sufficiently large, this can be *very* slow
+        kernel::sort0<float>(err, true);
+
+        unsigned minIdx;
+        float minMedian;
+
+        // Compute median of every iteration
+        cmOp(EnqueueArgs(getQueue(), global_eh, local_eh), *median.data,
+             *idx.data, *err.data, err.info, iterations);
+        CL_DEBUG_FINISH(getQueue());
+
+        // Reduce medians, only in case iterations > 256
+        if (blk_x_eh > 1) {
+            const NDRange local_fm(HG_THREADS);
+            const NDRange global_fm(HG_THREADS);
+
+            Buffer* finalMedian = bufferAlloc(sizeof(float));
+            Buffer* finalIdx    = bufferAlloc(sizeof(unsigned));
+
+            fmOp(EnqueueArgs(getQueue(), global_fm, local_fm), *finalMedian,
+                 *finalIdx, *median.data, median.info, *idx.data);
+            CL_DEBUG_FINISH(getQueue());
+
+            getQueue().enqueueReadBuffer(*finalMedian, CL_TRUE, 0,
+                                         sizeof(float), &minMedian);
+            getQueue().enqueueReadBuffer(*finalIdx, CL_TRUE, 0,
+                                         sizeof(unsigned), &minIdx);
+
+            bufferFree(finalMedian);
+            bufferFree(finalIdx);
+        } else {
+            getQueue().enqueueReadBuffer(*median.data, CL_TRUE, 0,
+                                         sizeof(float), &minMedian);
+            getQueue().enqueueReadBuffer(*idx.data, CL_TRUE, 0,
+                                         sizeof(unsigned), &minIdx);
+        }
+
+        // Copy best homography to output
+        getQueue().enqueueCopyBuffer(*H.data, *bestH.data,
+                                     minIdx * 9 * sizeof(T), 0, 9 * sizeof(T));
+
+        const int blk_x_cl = divup(nsamples, HG_THREADS);
+        const NDRange local_cl(HG_THREADS);
+        const NDRange global_cl(blk_x_cl * HG_THREADS);
+
+        clOp(EnqueueArgs(getQueue(), global_cl, local_cl), *inliers.data,
+             *bestH.data, *x_src.data, *y_src.data, *x_dst.data, *y_dst.data,
+             minMedian, nsamples);
+        CL_DEBUG_FINISH(getQueue());
+
+        // Adds up the total number of inliers
+        Param totalInliers;
+        totalInliers.info.offset = 0;
+        for (int k = 0; k < 4; k++)
+            totalInliers.info.dims[k] = totalInliers.info.strides[k] = 1;
+        totalInliers.data = bufferAlloc(sizeof(unsigned));
+
+        kernel::reduce<unsigned, unsigned, af_add_t>(totalInliers, inliers, 0,
+                                                     false, 0.0);
+
+        getQueue().enqueueReadBuffer(*totalInliers.data, CL_TRUE, 0,
+                                     sizeof(unsigned), &inliersH);
+
+        bufferFree(totalInliers.data);
+    } else /* if (htype == AF_HOMOGRAPHY_RANSAC) */ {
+        unsigned blockIdx;
+        inliersH = kernel::ireduceAll<unsigned, af_max_t>(&blockIdx, inliers);
+
+        // Copies back index and number of inliers of best homography estimation
+        getQueue().enqueueReadBuffer(*idx.data, CL_TRUE,
+                                     blockIdx * sizeof(unsigned),
+                                     sizeof(unsigned), &idxH);
+        getQueue().enqueueCopyBuffer(*H.data, *bestH.data, idxH * 9 * sizeof(T),
+                                     0, 9 * sizeof(T));
+    }
+
+    bufferFree(inliers.data);
+    bufferFree(idx.data);
+    bufferFree(median.data);
+
+    return (int)inliersH;
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/hsv_rgb.cl b/src/backend/opencl/kernel/hsv_rgb.cl
index a62ab6aea2..5fd7a060b4 100644
--- a/src/backend/opencl/kernel/hsv_rgb.cl
+++ b/src/backend/opencl/kernel/hsv_rgb.cl
@@ -7,24 +7,24 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-kernel
-void convert(global T * out, KParam oInfo, global const T * in, KParam iInfo, int nBBS)
-{
+kernel void hsvrgbConvert(global T* out, KParam oInfo, global const T* in,
+                          KParam iInfo, int nBBS) {
     // batch offsets
-    unsigned batchId = get_group_id(0) / nBBS;
-    global const T* src =  in + (batchId * iInfo.strides[3]);
-    global T*       dst = out + (batchId * oInfo.strides[3]);
+    unsigned batchId    = get_group_id(0) / nBBS;
+    global const T* src = in + (batchId * iInfo.strides[3]);
+    global T* dst       = out + (batchId * oInfo.strides[3]);
     // global indices
-    int gx = get_local_size(0) * (get_group_id(0)-batchId*nBBS) + get_local_id(0);
+    int gx = get_local_size(0) * (get_group_id(0) - batchId * nBBS) +
+             get_local_id(0);
     int gy = get_local_size(1) * get_group_id(1) + get_local_id(1);
 
     if (gx < oInfo.dims[0] && gy < oInfo.dims[1]) {
-
         int oIdx0 = gx + gy * oInfo.strides[1];
         int oIdx1 = oIdx0 + oInfo.strides[2];
         int oIdx2 = oIdx1 + oInfo.strides[2];
 
-        int iIdx0 = gx * iInfo.strides[0] + gy * iInfo.strides[1];
+        int iIdx0 =
+            gx * iInfo.strides[0] + gy * iInfo.strides[1] + iInfo.offset;
         int iIdx1 = iIdx0 + iInfo.strides[2];
         int iIdx2 = iIdx1 + iInfo.strides[2];
 
@@ -36,11 +36,11 @@ void convert(global T * out, KParam oInfo, global const T * in, KParam iInfo, in
         T R, G, B;
         R = G = B = 0;
 
-        int   i = (int)(H * 6);
-        T f = H * 6 - i;
-        T p = V * (1 - S);
-        T q = V * (1 - f * S);
-        T t = V * (1 - (1 - f) * S);
+        int i = (int)(H * 6);
+        T f   = H * 6 - i;
+        T p   = V * (1 - S);
+        T q   = V * (1 - f * S);
+        T t   = V * (1 - (1 - f) * S);
 
         switch (i % 6) {
             case 0: R = V, G = t, B = p; break;
@@ -55,24 +55,24 @@ void convert(global T * out, KParam oInfo, global const T * in, KParam iInfo, in
         dst[oIdx1] = G;
         dst[oIdx2] = B;
 #else
-        T R = src[iIdx0];
-        T G = src[iIdx1];
-        T B = src[iIdx2];
-        T Cmax = fmax(fmax(R, G), B);
-        T Cmin = fmin(fmin(R, G), B);
-        T delta= Cmax-Cmin;
+        T R     = src[iIdx0];
+        T G     = src[iIdx1];
+        T B     = src[iIdx2];
+        T Cmax  = fmax(fmax(R, G), B);
+        T Cmin  = fmin(fmin(R, G), B);
+        T delta = Cmax - Cmin;
 
         T H = 0;
 
-        if (Cmax!=Cmin) {
-            if (Cmax==R) H = (G-B)/delta + (G<B ? 6 : 0);
-            if (Cmax==G) H = (B-R)/delta + 2;
-            if (Cmax==B) H = (R-G)/delta + 4;
+        if (Cmax != Cmin) {
+            if (Cmax == R) H = (G - B) / delta + (G < B ? 6 : 0);
+            if (Cmax == G) H = (B - R) / delta + 2;
+            if (Cmax == B) H = (R - G) / delta + 4;
             H = H / 6.0f;
         }
 
         dst[oIdx0] = H;
-        dst[oIdx1] = Cmax==0.0f ? 0 : delta/Cmax;
+        dst[oIdx1] = Cmax == 0.0f ? 0 : delta / Cmax;
         dst[oIdx2] = Cmax;
 #endif
     }
diff --git a/src/backend/opencl/kernel/hsv_rgb.hpp b/src/backend/opencl/kernel/hsv_rgb.hpp
index 7f28fb0a8f..4ca85a4f74 100644
--- a/src/backend/opencl/kernel/hsv_rgb.hpp
+++ b/src/backend/opencl/kernel/hsv_rgb.hpp
@@ -1,87 +1,58 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
 #pragma once
-#include <kernel_headers/hsv_rgb.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/hsv_rgb.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-template<typename T, bool isHSV2RGB>
-void hsv2rgb_convert(Param out, const Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  hrProgs;
-        static std::map<int, Kernel*> hrKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName();
-
-                if(isHSV2RGB) options << " -D isHSV2RGB";
-
-                if (std::is_same<T, double>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-                Program prog;
-                buildProgram(prog, hsv_rgb_cl, hsv_rgb_cl_len, options.str());
-                hrProgs[device]   = new Program(prog);
-                hrKernels[device] = new Kernel(*hrProgs[device], "convert");
-            });
+#include <string>
+#include <vector>
 
-        NDRange local(THREADS_X, THREADS_Y);
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
+template<typename T>
+void hsv2rgb_convert(Param out, const Param in, bool isHSV2RGB) {
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
 
-        // all images are three channels, so batch
-        // parameter would be along 4th dimension
-        NDRange global(blk_x * in.info.dims[3] * THREADS_X, blk_y * THREADS_Y);
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(isHSV2RGB),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
+    if (isHSV2RGB) { options.emplace_back(DefineKey(isHSV2RGB)); }
 
-        auto hsvrgbOp = make_kernel<Buffer, KParam, Buffer, KParam, int> (*hrKernels[device]);
+    auto convert =
+        common::getKernel("hsvrgbConvert", {{hsv_rgb_cl_src}}, targs, options);
 
-        hsvrgbOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info, blk_x);
+    cl::NDRange local(THREADS_X, THREADS_Y);
 
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
 
-}
+    // all images are three channels, so batch
+    // parameter would be along 4th dimension
+    cl::NDRange global(blk_x * in.info.dims[3] * THREADS_X, blk_y * THREADS_Y);
 
+    convert(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+            *in.data, in.info, blk_x);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/identity.cl b/src/backend/opencl/kernel/identity.cl
index 0c71099a9b..383aee601b 100644
--- a/src/backend/opencl/kernel/identity.cl
+++ b/src/backend/opencl/kernel/identity.cl
@@ -7,10 +7,8 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void identity_kernel(__global T *oData, KParam oInfo, int groups_x, int groups_y)
-{
-
+kernel void identity_kernel(global T *oData, KParam oInfo, int groups_x,
+                            int groups_y) {
     unsigned idz = get_group_id(0) / groups_x;
     unsigned idw = get_group_id(1) / groups_y;
 
@@ -20,13 +18,11 @@ void identity_kernel(__global T *oData, KParam oInfo, int groups_x, int groups_y
     unsigned idx = get_local_id(0) + groupId_x * get_local_size(0);
     unsigned idy = get_local_id(1) + groupId_y * get_local_size(1);
 
-    if(idx >= oInfo.dims[0] ||
-       idy >= oInfo.dims[1] ||
-       idz >= oInfo.dims[2] ||
-       idw >= oInfo.dims[3])
+    if (idx >= oInfo.dims[0] || idy >= oInfo.dims[1] || idz >= oInfo.dims[2] ||
+        idw >= oInfo.dims[3])
         return;
 
-    __global T *ptr = oData + idz * oInfo.strides[2] + idw * oInfo.strides[3];
-    T val = (idx == idy) ? ONE : ZERO;
+    global T *ptr = oData + idz * oInfo.strides[2] + idw * oInfo.strides[3];
+    T val         = (idx == idy) ? (T)(ONE) : (T)(ZERO);
     ptr[idx + idy * oInfo.strides[1]] = val;
 }
diff --git a/src/backend/opencl/kernel/identity.hpp b/src/backend/opencl/kernel/identity.hpp
index 8991eb22ec..32186164ef 100644
--- a/src/backend/opencl/kernel/identity.hpp
+++ b/src/backend/opencl/kernel/identity.hpp
@@ -7,76 +7,50 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <kernel_headers/identity.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <dispatch.hpp>
+#pragma once
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <map>
-#include <mutex>
+#include <kernel/config.hpp>
+#include <kernel_headers/identity.hpp>
 #include <math.hpp>
-#include "config.hpp"
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using std::ostringstream;
-using af::scalar_to_option;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-    template<typename T>
-    static void identity(Param out)
-    {
-        try {
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*>   identityProgs;
-            static std::map<int, Kernel*>  identityKernels;
-
-            int device = getActiveDeviceId();
-
-            std::call_once( compileFlags[device], [device] () {
-                    ostringstream options;
-                    options << " -D T="    << dtype_traits<T>::getName()
-                            << " -D ONE=(T)("  << scalar_to_option(scalar<T>(1)) << ")"
-                            << " -D ZERO=(T)(" << scalar_to_option(scalar<T>(0)) << ")";
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, identity_cl, identity_cl_len, options.str());
-                    identityProgs[device]   = new Program(prog);
-                    identityKernels[device] = new Kernel(*identityProgs[device], "identity_kernel");
-                });
-
-            NDRange local(32, 8);
-            int groups_x = divup(out.info.dims[0], local[0]);
-            int groups_y = divup(out.info.dims[1], local[1]);
-            NDRange global(groups_x * out.info.dims[2] * local[0],
-                           groups_y * out.info.dims[3] * local[1]);
-
-            auto identityOp = make_kernel<Buffer, const KParam,
-                                          int, int> (*identityKernels[device]);
-
-            identityOp(EnqueueArgs(getQueue(), global, local),
-                       *(out.data), out.info, groups_x, groups_y);
-            CL_DEBUG_FINISH(getQueue());
-
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-        }
-    }
+#include <traits.hpp>
 
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+static void identity(Param out) {
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 4> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ONE, scalar_to_option(scalar<T>(1))),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        getTypeBuildDefinition<T>()};
+
+    auto identityOp = common::getKernel("identity_kernel", {{identity_cl_src}},
+                                        targs, options);
+
+    cl::NDRange local(32, 8);
+    int groups_x = divup(out.info.dims[0], local[0]);
+    int groups_y = divup(out.info.dims[1], local[1]);
+    cl::NDRange global(groups_x * out.info.dims[2] * local[0],
+                       groups_y * out.info.dims[3] * local[1]);
+
+    identityOp(cl::EnqueueArgs(getQueue(), global, local), *(out.data),
+               out.info, groups_x, groups_y);
+    CL_DEBUG_FINISH(getQueue());
 }
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/iir.cl b/src/backend/opencl/kernel/iir.cl
index ce970d705d..0292c6ba36 100644
--- a/src/backend/opencl/kernel/iir.cl
+++ b/src/backend/opencl/kernel/iir.cl
@@ -14,26 +14,23 @@
 #endif
 
 #if CPLX
-T __mul(T lhs, T rhs)
-{
+T __mul(T lhs, T rhs) {
     T out;
     out.x = lhs.x * rhs.x - lhs.y * rhs.y;
     out.y = lhs.x * rhs.y + lhs.y * rhs.x;
     return out;
 }
 
-T __cconjf(T in)
-{
+T __cconjf(T in) {
     T out = {in.x, -in.y};
     return out;
 }
 
 // FIXME: overflow / underflow issues
-T __div(T lhs, T rhs)
-{
+T __div(T lhs, T rhs) {
     T out;
     TR den = (rhs.x * rhs.x + rhs.y * rhs.y);
-    T num = __mul(lhs, __cconjf(rhs));
+    T num  = __mul(lhs, __cconjf(rhs));
 
     out.x = num.x / den;
     out.y = num.y / den;
@@ -41,52 +38,52 @@ T __div(T lhs, T rhs)
     return out;
 }
 #else
-#define __mul(lhs, rhs) ((lhs)*(rhs))
-#define __div(lhs, rhs) ((lhs)/(rhs))
+#define __mul(lhs, rhs) ((lhs) * (rhs))
+#define __div(lhs, rhs) ((lhs) / (rhs))
 #endif
 
-__kernel
-void iir_kernel(      __global T *yptr, const KParam yinfo,
-                const __global T *cptr, const KParam cinfo,
-                const __global T *aptr, const KParam ainfo,
-                const int groups_y)
-{
-    __local T s_z[MAX_A_SIZE];
-    __local T s_a[MAX_A_SIZE];
-    __local T s_y;
+kernel void iir_kernel(global T *yptr, const KParam yinfo,
+                         const global T *cptr, const KParam cinfo,
+                         const global T *aptr, const KParam ainfo,
+                         const int groups_y) {
+    local T s_z[MAX_A_SIZE];
+    local T s_a[MAX_A_SIZE];
+    local T s_y;
 
     const int idz = get_group_id(0);
     const int idw = get_group_id(1) / groups_y;
     const int idy = get_group_id(1) - idw * groups_y;
 
-    const int tx = get_local_id(0);
+    const int tx    = get_local_id(0);
     const int num_a = ainfo.dims[0];
 
-    int y_off = idw * yinfo.strides[3] + idz * yinfo.strides[2] + idy * yinfo.strides[1];
-    int c_off = idw * cinfo.strides[3] + idz * cinfo.strides[2] + idy * cinfo.strides[1];
+    int y_off = idw * yinfo.strides[3] + idz * yinfo.strides[2] +
+                idy * yinfo.strides[1];
+    int c_off = idw * cinfo.strides[3] + idz * cinfo.strides[2] +
+                idy * cinfo.strides[1];
 
 #if BATCH_A
-    int a_off = idw * ainfo.strides[3] + idz * ainfo.strides[2] + idy * ainfo.strides[1];
+    int a_off = idw * ainfo.strides[3] + idz * ainfo.strides[2] +
+                idy * ainfo.strides[1];
 #else
     int a_off = 0;
 #endif
 
-    __global T *d_y = yptr + y_off;
-    const __global T *d_c = cptr + c_off;
-    const __global T *d_a = aptr + a_off;
-    const int repeat = (num_a + get_local_size(0) - 1) / get_local_size(0);
+    global T *d_y       = yptr + y_off;
+    const global T *d_c = cptr + c_off + cinfo.offset;
+    const global T *d_a = aptr + a_off + ainfo.offset;
+    const int repeat      = (num_a + get_local_size(0) - 1) / get_local_size(0);
 
     for (int ii = 0; ii < MAX_A_SIZE / get_local_size(0); ii++) {
-        int id = ii * get_local_size(0) + tx;
+        int id  = ii * get_local_size(0) + tx;
         s_z[id] = ZERO;
         s_a[id] = (id < num_a) ? d_a[id] : ZERO;
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
-
     for (int i = 0; i < yinfo.dims[0]; i++) {
         if (tx == 0) {
-            s_y = __div((d_c[i] + s_z[0]), s_a[0]);
+            s_y    = __div((d_c[i] + s_z[0]), s_a[0]);
             d_y[i] = s_y;
         }
         barrier(CLK_LOCAL_MEM_FENCE);
@@ -94,7 +91,7 @@ void iir_kernel(      __global T *yptr, const KParam yinfo,
         for (int ii = 0; ii < repeat; ii++) {
             int id = ii * get_local_size(0) + tx + 1;
 
-            T z = s_z[id] - __mul(s_a[id],  s_y);
+            T z = s_z[id] - __mul(s_a[id], s_y);
             barrier(CLK_LOCAL_MEM_FENCE);
 
             s_z[id - 1] = z;
diff --git a/src/backend/opencl/kernel/iir.hpp b/src/backend/opencl/kernel/iir.hpp
index cbf1768935..34f9d2c0bf 100644
--- a/src/backend/opencl/kernel/iir.hpp
+++ b/src/backend/opencl/kernel/iir.hpp
@@ -8,91 +8,59 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/iir.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using af::scalar_to_option;
-
-namespace opencl
-{
-
-    namespace kernel
-    {
-        template<typename T, bool batch_a>
-        void iir(Param y, Param c, Param a)
-        {
-
-            //FIXME: This is a temporary fix. Ideally the local memory should be allocted outside
-            static const int MAX_A_SIZE = (1024 * sizeof(double)) / sizeof(T);
-
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*>  iirProgs;
-            static std::map<int, Kernel*> iirKernels;
-
-            int device = getActiveDeviceId();
-
-            std::call_once(compileFlags[device], [device] () {
-
-                    std::ostringstream options;
-                    options << " -D MAX_A_SIZE=" << MAX_A_SIZE
-                            << " -D BATCH_A=" << batch_a
-                            << " -D ZERO=(T)(" << scalar_to_option(scalar<T>(0)) << ")"
-                            << " -D T=" << dtype_traits<T>::getName();
-
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    cl::Program prog;
-                    buildProgram(prog, iir_cl, iir_cl_len, options.str());
-                    iirProgs[device] = new Program(prog);
-
-                    iirKernels[device] = new Kernel(*iirProgs[device], "iir_kernel");
-                });
-
-
-            const int groups_y = y.info.dims[1];
-            const int groups_x = y.info.dims[2];
-
-            int threads = 256;
-            while (threads > (int)y.info.dims[0] && threads > 32) threads /= 2;
-
-
-            NDRange local(threads, 1);
-            NDRange global(groups_x * local[0],
-                           groups_y * y.info.dims[3] * local[1]);
-
-            auto iirOp = make_kernel<Buffer, KParam,
-                                     Buffer, KParam,
-                                     Buffer, KParam,
-                                     int>(*iirKernels[device]);
-
-            try {
-                iirOp(EnqueueArgs(getQueue(), global, local),
-                      *y.data, y.info, *c.data, c.info, *a.data, a.info, groups_y);
-            } catch(cl::Error &clerr) {
-                AF_ERROR("Size of a too big for this datatype",
-                         AF_ERR_SIZE);
-            }
-
-            CL_DEBUG_FINISH(getQueue());
-        }
+#include <kernel_headers/iir.hpp>
+#include <math.hpp>
+#include <traits.hpp>
 
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, bool batch_a>
+void iir(Param y, Param c, Param a) {
+    // FIXME: This is a temporary fix. Ideally the local memory should be
+    // allocted outside
+    constexpr int MAX_A_SIZE = (1024 * sizeof(double)) / sizeof(T);
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(batch_a),
+    };
+    std::array<std::string, 5> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(MAX_A_SIZE),
+        DefineKeyValue(BATCH_A, batch_a),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        getTypeBuildDefinition<T>()};
+
+    auto iir = common::getKernel("iir_kernel", {{iir_cl_src}}, targs, options);
+
+    const int groups_y = y.info.dims[1];
+    const int groups_x = y.info.dims[2];
+
+    int threads = 256;
+    while (threads > (int)y.info.dims[0] && threads > 32) threads /= 2;
+
+    cl::NDRange local(threads, 1);
+    cl::NDRange global(groups_x * local[0],
+                       groups_y * y.info.dims[3] * local[1]);
+
+    try {
+        iir(cl::EnqueueArgs(getQueue(), global, local), *y.data, y.info,
+            *c.data, c.info, *a.data, a.info, groups_y);
+    } catch (cl::Error& clerr) {
+        AF_ERROR("Size of a too big for this datatype", AF_ERR_SIZE);
     }
+    CL_DEBUG_FINISH(getQueue());
 }
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/index.cl b/src/backend/opencl/kernel/index.cl
index 2fe9287a5d..2cc3cb57fe 100644
--- a/src/backend/opencl/kernel/index.cl
+++ b/src/backend/opencl/kernel/index.cl
@@ -8,52 +8,63 @@
  ********************************************************/
 
 typedef struct {
-    int  offs[4];
+    int offs[4];
     int strds[4];
-    char     isSeq[4];
+    int steps[4];
+    char isSeq[4];
 } IndexKernelParam_t;
 
-int trimIndex(int idx, const int len)
-{
+int trimIndex(int idx, const int len) {
     int ret_val = idx;
-    int offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=len) {
-        ret_val = len-offset-1;
+    int offset  = abs(ret_val) % len;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
     }
     return ret_val;
 }
 
-kernel
-void indexKernel(global T * optr, KParam oInfo, global const T * iptr, KParam iInfo,
-                 const IndexKernelParam_t p, global const uint* ptr0,
-                 global const uint* ptr1, global const uint* ptr2,
-                 global const uint* ptr3, const int nBBS0, const int nBBS1)
-{
+kernel void indexKernel(global T* optr, KParam oInfo, global const T* iptr,
+                        KParam iInfo, const IndexKernelParam_t p,
+                        global const uint* ptr0, global const uint* ptr1,
+                        global const uint* ptr2, global const uint* ptr3,
+                        const int nBBS0, const int nBBS1) {
     // retrive booleans that tell us which index to use
     const bool s0 = p.isSeq[0];
     const bool s1 = p.isSeq[1];
     const bool s2 = p.isSeq[2];
     const bool s3 = p.isSeq[3];
 
-    const int gz = get_group_id(0)/nBBS0;
-    const int gw = get_group_id(1)/nBBS1;
-    const int gx = get_local_size(0) * (get_group_id(0) - gz*nBBS0) + get_local_id(0);
-    const int gy = get_local_size(1) * (get_group_id(1) - gw*nBBS1) + get_local_id(1);
+    const int gz = get_group_id(0) / nBBS0;
+    const int gw = get_group_id(1) / nBBS1;
+    const int gx =
+        get_local_size(0) * (get_group_id(0) - gz * nBBS0) + get_local_id(0);
+    const int gy =
+        get_local_size(1) * (get_group_id(1) - gw * nBBS1) + get_local_id(1);
 
-    if (gx<oInfo.dims[0] && gy<oInfo.dims[1] && gz<oInfo.dims[2] && gw<oInfo.dims[3]) {
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1] && gz < oInfo.dims[2] &&
+        gw < oInfo.dims[3]) {
         // calculate pointer offsets for input
-        int i = p.strds[0] * trimIndex(s0 ? gx+p.offs[0] : ptr0[gx], iInfo.dims[0]);
-        int j = p.strds[1] * trimIndex(s1 ? gy+p.offs[1] : ptr1[gy], iInfo.dims[1]);
-        int k = p.strds[2] * trimIndex(s2 ? gz+p.offs[2] : ptr2[gz], iInfo.dims[2]);
-        int l = p.strds[3] * trimIndex(s3 ? gw+p.offs[3] : ptr3[gw], iInfo.dims[3]);
+        int i =
+            p.strds[0] * trimIndex(s0 ? gx * p.steps[0] + p.offs[0] : ptr0[gx],
+                                   iInfo.dims[0]);
+        int j =
+            p.strds[1] * trimIndex(s1 ? gy * p.steps[1] + p.offs[1] : ptr1[gy],
+                                   iInfo.dims[1]);
+        int k =
+            p.strds[2] * trimIndex(s2 ? gz * p.steps[2] + p.offs[2] : ptr2[gz],
+                                   iInfo.dims[2]);
+        int l =
+            p.strds[3] * trimIndex(s3 ? gw * p.steps[3] + p.offs[3] : ptr3[gw],
+                                   iInfo.dims[3]);
         // offset input and output pointers
-        global const T *src = iptr + (i+j+k+l);
-        global T *dst = optr + (gx*oInfo.strides[0]+
-                                gy*oInfo.strides[1]+
-                                gz*oInfo.strides[2]+
-                                gw*oInfo.strides[3]);
+        global const T* src = iptr + (i + j + k + l) + iInfo.offset;
+        global T* dst = optr + (gx * oInfo.strides[0] + gy * oInfo.strides[1] +
+                                gz * oInfo.strides[2] + gw * oInfo.strides[3] +
+                                oInfo.offset);
         // set the output
         dst[0] = src[0];
     }
diff --git a/src/backend/opencl/kernel/index.hpp b/src/backend/opencl/kernel/index.hpp
index f87960ea8a..5362a8e78b 100644
--- a/src/backend/opencl/kernel/index.hpp
+++ b/src/backend/opencl/kernel/index.hpp
@@ -8,86 +8,61 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/index.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/index.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
+#include <string>
+#include <vector>
 
-static const int THREADS_X = 32;
-static const int THREADS_Y =  8;
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 typedef struct {
-    int  offs[4];
+    int offs[4];
     int strds[4];
-    char     isSeq[4];
+    int steps[4];
+    char isSeq[4];
 } IndexKernelParam_t;
 
 template<typename T>
-void index(Param out, const Param in, const IndexKernelParam_t& p, Buffer *bPtr[4])
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  idxProgs;
-        static std::map<int, Kernel*> idxKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName();
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                options << " -D USE_DOUBLE";
-                }
-
-                Program prog;
-                buildProgram(prog, index_cl, index_cl_len, options.str());
-                idxProgs[device]   = new Program(prog);
-                idxKernels[device] = new Kernel(*idxProgs[device], "indexKernel");
-                });
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(out.info.dims[0], THREADS_X);
-        int blk_y = divup(out.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x * out.info.dims[2] * THREADS_X,
-                blk_y * out.info.dims[3] * THREADS_Y);
-
-        auto indexOp = make_kernel<Buffer, KParam, Buffer, KParam, IndexKernelParam_t,
-             Buffer, Buffer, Buffer, Buffer, int, int>(*idxKernels[device]);
-
-        indexOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *in.data, in.info, p,
-                *bPtr[0], *bPtr[1], *bPtr[2], *bPtr[3], blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+void index(Param out, const Param in, const IndexKernelParam_t& p,
+           cl::Buffer* bPtr[4]) {
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
+
+    auto index =
+        common::getKernel("indexKernel", {{index_cl_src}},
+                          TemplateArgs(TemplateTypename<T>()), options);
+    int threads_x = 256;
+    int threads_y = 1;
+    cl::NDRange local(threads_x, threads_y);
+    switch (out.info.dims[1]) {
+        case 1: threads_y = 1; break;
+        case 2: threads_y = 2; break;
+        case 3:
+        case 4: threads_y = 4; break;
+        default: threads_y = 8; break;
     }
-}
+    threads_x = static_cast<unsigned>(256.f / threads_y);
 
-}
+    int blk_x = divup(out.info.dims[0], local[0]);
+    int blk_y = divup(out.info.dims[1], local[1]);
+
+    cl::NDRange global(blk_x * out.info.dims[2] * local[0],
+                       blk_y * out.info.dims[3] * local[1]);
 
+    index(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+          *in.data, in.info, p, *bPtr[0], *bPtr[1], *bPtr[2], *bPtr[3], blk_x,
+          blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/interp.cl b/src/backend/opencl/kernel/interp.cl
new file mode 100644
index 0000000000..8d7b8d8a82
--- /dev/null
+++ b/src/backend/opencl/kernel/interp.cl
@@ -0,0 +1,281 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if IS_CPLX
+#if USE_DOUBLE
+typedef double ScalarTy;
+#else
+typedef float ScalarTy;
+#endif
+InterpInTy __mulrc(ScalarTy s, InterpInTy v) {
+    InterpInTy out = {s * v.x, s * v.y};
+    return out;
+}
+#define MULRC(a, b) __mulrc(a, b)
+#define MULCR(a, b) __mulrc(b, a)
+#else
+#define MULRC(a, b) (a) * (b)
+#define MULCR(a, b) (a) * (b)
+#endif
+
+InterpValTy linearInterpFunc(InterpValTy val[2], InterpPosTy ratio) {
+    return MULRC((1 - ratio), val[0]) + MULRC(ratio, val[1]);
+}
+
+InterpValTy bilinearInterpFunc(InterpValTy val[2][2], InterpPosTy xratio,
+                               InterpPosTy yratio) {
+    InterpValTy res[2];
+    res[0] = linearInterpFunc(val[0], xratio);
+    res[1] = linearInterpFunc(val[1], xratio);
+    return linearInterpFunc(res, yratio);
+}
+
+InterpValTy cubicInterpFunc(InterpValTy val[4], InterpPosTy xratio,
+                            bool spline) {
+    InterpValTy a0, a1, a2, a3;
+    if (spline) {
+        a0 = MULRC((InterpPosTy)-0.5, val[0]) +
+             MULRC((InterpPosTy)1.5, val[1]) +
+             MULRC((InterpPosTy)-1.5, val[2]) + MULRC((InterpPosTy)0.5, val[3]);
+
+        a1 = MULRC((InterpPosTy)1.0, val[0]) +
+             MULRC((InterpPosTy)-2.5, val[1]) +
+             MULRC((InterpPosTy)2.0, val[2]) + MULRC((InterpPosTy)-0.5, val[3]);
+
+        a2 = MULRC((InterpPosTy)-0.5, val[0]) + MULRC((InterpPosTy)0.5, val[2]);
+
+        a3 = val[1];
+    } else {
+        a0 = val[3] - val[2] - val[0] + val[1];
+        a1 = val[0] - val[1] - a0;
+        a2 = val[2] - val[0];
+        a3 = val[1];
+    }
+
+    InterpPosTy xratio2 = xratio * xratio;
+    InterpPosTy xratio3 = xratio2 * xratio;
+
+    return MULCR(a0, xratio3) + MULCR(a1, xratio2) + MULCR(a2, xratio) + a3;
+}
+
+InterpValTy bicubicInterpFunc(InterpValTy val[4][4], InterpPosTy xratio,
+                              InterpPosTy yratio, bool spline) {
+    InterpValTy res[4];
+    res[0] = cubicInterpFunc(val[0], xratio, spline);
+    res[1] = cubicInterpFunc(val[1], xratio, spline);
+    res[2] = cubicInterpFunc(val[2], xratio, spline);
+    res[3] = cubicInterpFunc(val[3], xratio, spline);
+    return cubicInterpFunc(res, yratio, spline);
+}
+
+#if INTERP_ORDER == 1
+void interp1(global InterpInTy *d_out, KParam out, int ooff,
+             global const InterpInTy *d_in, KParam in, int ioff, InterpPosTy x,
+             int method, int batch, bool doclamp, int batch_dim) {
+    InterpInTy zero = ZERO;
+
+    const int x_lim    = in.dims[XDIM];
+    const int x_stride = in.strides[XDIM];
+
+    int xid   = (method == AF_INTERP_LOWER ? floor(x) : round(x));
+    bool cond = xid >= 0 && xid < x_lim;
+    if (doclamp) xid = max(0, min(xid, x_lim));
+
+    const int idx = ioff + xid * x_stride;
+
+    for (int n = 0; n < batch; n++) {
+        int idx_n = idx + n * in.strides[batch_dim];
+        d_out[ooff + n * out.strides[batch_dim]] =
+            (doclamp || cond) ? d_in[idx_n] : zero;
+    }
+}
+#elif INTERP_ORDER == 2
+void interp1(global InterpInTy *d_out, KParam out, int ooff,
+             global const InterpInTy *d_in, KParam in, int ioff, InterpPosTy x,
+             int method, int batch, bool doclamp, int batch_dim) {
+    const int grid_x        = floor(x);    // nearest grid
+    const InterpPosTy off_x = x - grid_x;  // fractional offset
+
+    const int x_lim    = in.dims[XDIM];
+    const int x_stride = in.strides[XDIM];
+    const int idx      = ioff + grid_x * x_stride;
+
+    InterpValTy zero  = ZERO;
+    bool cond[2]      = {true, grid_x + 1 < x_lim};
+    int offx[2]       = {0, cond[1] ? 1 : 0};
+    InterpPosTy ratio = off_x;
+    if (method == AF_INTERP_LINEAR_COSINE) {
+        ratio = (1 - cos(ratio * (InterpPosTy)M_PI)) / 2;
+    }
+
+    for (int n = 0; n < batch; n++) {
+        int idx_n          = idx + n * in.strides[batch_dim];
+        InterpValTy val[2] = {
+            (doclamp || cond[0]) ? d_in[idx_n + offx[0] * x_stride] : zero,
+            (doclamp || cond[1]) ? d_in[idx_n + offx[1] * x_stride] : zero};
+
+        d_out[ooff + n * out.strides[batch_dim]] = linearInterpFunc(val, ratio);
+    }
+}
+#elif INTERP_ORDER == 3
+void interp1(global InterpInTy *d_out, KParam out, int ooff,
+             global const InterpInTy *d_in, KParam in, int ioff, InterpPosTy x,
+             int method, int batch, bool doclamp, int batch_dim) {
+    const int grid_x        = floor(x);    // nearest grid
+    const InterpPosTy off_x = x - grid_x;  // fractional offset
+
+    const int x_lim    = in.dims[XDIM];
+    const int x_stride = in.strides[XDIM];
+    const int idx      = ioff + grid_x * x_stride;
+
+    bool cond[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                    grid_x + 2 < x_lim};
+    int off[4]   = {cond[0] ? -1 : 0, 0, cond[2] ? 1 : 0,
+                  cond[3] ? 2 : (cond[2] ? 1 : 0)};
+
+    InterpValTy zero = ZERO;
+
+    for (int n = 0; n < batch; n++) {
+        InterpValTy val[4];
+        int idx_n = idx + n * in.strides[batch_dim];
+        for (int i = 0; i < 4; i++) {
+            val[i] =
+                (doclamp || cond[i]) ? d_in[idx_n + off[i] * x_stride] : zero;
+        }
+        bool spline = method == AF_INTERP_CUBIC_SPLINE;
+        d_out[ooff + n * out.strides[batch_dim]] =
+            cubicInterpFunc(val, off_x, spline);
+        ;
+    }
+}
+#endif
+
+#if defined(YDIM)  // If 2d interpolation is being used
+#if INTERP_ORDER == 1
+void interp2(global InterpInTy *d_out, KParam out, int ooff,
+             global const InterpInTy *d_in, KParam in, int ioff, InterpPosTy x,
+             InterpPosTy y, int method, int batch, bool doclamp,
+             int batch_dim) {
+    int xid = (method == AF_INTERP_LOWER ? floor(x) : round(x));
+    int yid = (method == AF_INTERP_LOWER ? floor(y) : round(y));
+
+    const int x_lim    = in.dims[XDIM];
+    const int y_lim    = in.dims[YDIM];
+    const int x_stride = in.strides[XDIM];
+    const int y_stride = in.strides[YDIM];
+
+    if (doclamp) {
+        xid = max(0, min(xid, x_lim));
+        yid = max(0, min(yid, y_lim));
+    }
+    const int idx = ioff + yid * y_stride + xid * x_stride;
+
+    bool condX = xid >= 0 && xid < x_lim;
+    bool condY = yid >= 0 && yid < y_lim;
+
+    InterpInTy zero = ZERO;
+    bool cond       = condX && condY;
+    for (int n = 0; n < batch; n++) {
+        int idx_n = idx + n * in.strides[batch_dim];
+        d_out[ooff + n * out.strides[batch_dim]] =
+            (doclamp || cond) ? d_in[idx_n] : zero;
+    }
+}
+#elif INTERP_ORDER == 2
+void interp2(global InterpInTy *d_out, KParam out, int ooff,
+             global const InterpInTy *d_in, KParam in, int ioff, InterpPosTy x,
+             InterpPosTy y, int method, int batch, bool doclamp,
+             int batch_dim) {
+    const int grid_x        = floor(x);
+    const InterpPosTy off_x = x - grid_x;
+
+    const int grid_y        = floor(y);
+    const InterpPosTy off_y = y - grid_y;
+
+    const int x_lim    = in.dims[XDIM];
+    const int y_lim    = in.dims[YDIM];
+    const int x_stride = in.strides[XDIM];
+    const int y_stride = in.strides[YDIM];
+    const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+    bool condX[2] = {true, x + 1 < x_lim};
+    bool condY[2] = {true, y + 1 < y_lim};
+    int offx[2]   = {0, condX[1] ? 1 : 0};
+    int offy[2]   = {0, condY[1] ? 1 : 0};
+
+    InterpPosTy xratio = off_x, yratio = off_y;
+    if (method == AF_INTERP_LINEAR_COSINE) {
+        xratio = (1 - cos(xratio * (InterpPosTy)M_PI)) / 2;
+        yratio = (1 - cos(yratio * (InterpPosTy)M_PI)) / 2;
+    }
+
+    InterpValTy zero = ZERO;
+    for (int n = 0; n < batch; n++) {
+        int idx_n = idx + n * in.strides[batch_dim];
+        InterpValTy val[2][2];
+        for (int j = 0; j < 2; j++) {
+            int off_y = idx_n + offy[j] * y_stride;
+            for (int i = 0; i < 2; i++) {
+                bool cond = (doclamp || (condX[i] && condY[j]));
+                val[j][i] = cond ? d_in[off_y + offx[i] * x_stride] : zero;
+            }
+        }
+        d_out[ooff + n * out.strides[batch_dim]] =
+            bilinearInterpFunc(val, xratio, yratio);
+    }
+}
+#elif INTERP_ORDER == 3
+void interp2(global InterpInTy *d_out, KParam out, int ooff,
+             global const InterpInTy *d_in, KParam in, int ioff, InterpPosTy x,
+             InterpPosTy y, int method, int batch, bool doclamp,
+             int batch_dim) {
+    const int grid_x        = floor(x);
+    const InterpPosTy off_x = x - grid_x;
+
+    const int grid_y        = floor(y);
+    const InterpPosTy off_y = y - grid_y;
+
+    const int x_lim    = in.dims[XDIM];
+    const int y_lim    = in.dims[YDIM];
+    const int x_stride = in.strides[XDIM];
+    const int y_stride = in.strides[YDIM];
+    const int idx      = ioff + grid_y * y_stride + grid_x * x_stride;
+
+    // used for setting values at boundaries
+    bool condX[4] = {grid_x - 1 >= 0, true, grid_x + 1 < x_lim,
+                     grid_x + 2 < x_lim};
+    bool condY[4] = {grid_y - 1 >= 0, true, grid_y + 1 < y_lim,
+                     grid_y + 2 < y_lim};
+    int offX[4]   = {condX[0] ? -1 : 0, 0, condX[2] ? 1 : 0,
+                   condX[3] ? 2 : (condX[2] ? 1 : 0)};
+    int offY[4]   = {condY[0] ? -1 : 0, 0, condY[2] ? 1 : 0,
+                   condY[3] ? 2 : (condY[2] ? 1 : 0)};
+
+    InterpValTy zero = ZERO;
+    for (int n = 0; n < batch; n++) {
+        int idx_n = idx + n * in.strides[batch_dim];
+        // for bicubic interpolation, work with 4x4 val at a time
+        InterpValTy val[4][4];
+#pragma unroll
+        for (int j = 0; j < 4; j++) {
+            int ioff_j = idx_n + offY[j] * y_stride;
+#pragma unroll
+            for (int i = 0; i < 4; i++) {
+                bool cond = (doclamp || (condX[i] && condY[j]));
+                val[j][i] = cond ? d_in[ioff_j + offX[i] * x_stride] : zero;
+            }
+        }
+        bool spline = method == AF_INTERP_CUBIC_SPLINE ||
+                      method == AF_INTERP_BICUBIC_SPLINE;
+        d_out[ooff + n * out.strides[batch_dim]] =
+            bicubicInterpFunc(val, off_x, off_y, spline);
+    }
+}
+#endif
+#endif
diff --git a/src/backend/opencl/kernel/interp.hpp b/src/backend/opencl/kernel/interp.hpp
new file mode 100644
index 0000000000..d827bedc5a
--- /dev/null
+++ b/src/backend/opencl/kernel/interp.hpp
@@ -0,0 +1,44 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/TemplateArg.hpp>
+#include <af/defines.h>
+
+#include <array>
+#include <string>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+static void addInterpEnumOptions(std::vector<std::string>& options) {
+    static std::array<std::string, 10> enOpts = {
+        DefineKeyValue(AF_INTERP_NEAREST, static_cast<int>(AF_INTERP_NEAREST)),
+        DefineKeyValue(AF_INTERP_LINEAR, static_cast<int>(AF_INTERP_LINEAR)),
+        DefineKeyValue(AF_INTERP_BILINEAR,
+                       static_cast<int>(AF_INTERP_BILINEAR)),
+        DefineKeyValue(AF_INTERP_CUBIC, static_cast<int>(AF_INTERP_CUBIC)),
+        DefineKeyValue(AF_INTERP_LOWER, static_cast<int>(AF_INTERP_LOWER)),
+        DefineKeyValue(AF_INTERP_LINEAR_COSINE,
+                       static_cast<int>(AF_INTERP_LINEAR_COSINE)),
+        DefineKeyValue(AF_INTERP_BILINEAR_COSINE,
+                       static_cast<int>(AF_INTERP_BILINEAR_COSINE)),
+        DefineKeyValue(AF_INTERP_BICUBIC, static_cast<int>(AF_INTERP_BICUBIC)),
+        DefineKeyValue(AF_INTERP_CUBIC_SPLINE,
+                       static_cast<int>(AF_INTERP_CUBIC_SPLINE)),
+        DefineKeyValue(AF_INTERP_BICUBIC_SPLINE,
+                       static_cast<int>(AF_INTERP_BICUBIC_SPLINE)),
+    };
+    options.insert(std::end(options), std::begin(enOpts), std::end(enOpts));
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/iops.cl b/src/backend/opencl/kernel/iops.cl
index af7c1189d0..06a606abbe 100644
--- a/src/backend/opencl/kernel/iops.cl
+++ b/src/backend/opencl/kernel/iops.cl
@@ -8,29 +8,48 @@
  ********************************************************/
 
 #if CPLX
-#define sabs(in) ((in.x)*(in.x) + (in.y)*(in.y))
+inline bool is_nan(T in) { return (in.x != in.x) || (in.y != in.y); }
 #else
-#define sabs(in) in
+inline bool is_nan(T in) { return (in != in); }
 #endif
 
+#if CPLX
+#define sabs(in) ((in.x) * (in.x) + (in.y) * (in.y))
+#ifdef MIN_OP
+void binOp(T *lhs, uint *lidx, T rhs, uint ridx) {
+    if (((sabs(lhs[0]) > sabs(rhs)) ||
+         (sabs(lhs[0]) == sabs(rhs) && *lidx < ridx))) {
+        *lhs  = rhs;
+        *lidx = ridx;
+    }
+}
+#endif
+
+#ifdef MAX_OP
+void binOp(T *lhs, uint *lidx, T rhs, uint ridx) {
+    if (((sabs(lhs[0]) < sabs(rhs)) ||
+         (sabs(lhs[0]) == sabs(rhs) && *lidx > ridx))) {
+        *lhs  = rhs;
+        *lidx = ridx;
+    }
+}
+#endif
+#else
 #ifdef MIN_OP
-void binOp(T *lhs, uint *lidx, T rhs, uint ridx)
-{
-    if ((*lhs > rhs) ||
-        (*lhs == rhs && *lidx < ridx)) {
-        *lhs = rhs;
+void binOp(T *lhs, uint *lidx, T rhs, uint ridx) {
+    if (((*lhs > rhs) || (*lhs == rhs && *lidx < ridx))) {
+        *lhs  = rhs;
         *lidx = ridx;
     }
 }
 #endif
 
 #ifdef MAX_OP
-void binOp(T *lhs, uint *lidx, T rhs, uint ridx)
-{
-    if ((*lhs < rhs) ||
-        (*lhs == rhs && *lidx > ridx)) {
-        *lhs = rhs;
+void binOp(T *lhs, uint *lidx, T rhs, uint ridx) {
+    if (((*lhs < rhs) || (*lhs == rhs && *lidx > ridx))) {
+        *lhs  = rhs;
         *lidx = ridx;
     }
 }
 #endif
+#endif
diff --git a/src/backend/opencl/kernel/iota.cl b/src/backend/opencl/kernel/iota.cl
index 75d055d1a1..e7e5dccac4 100644
--- a/src/backend/opencl/kernel/iota.cl
+++ b/src/backend/opencl/kernel/iota.cl
@@ -7,12 +7,9 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void iota_kernel(__global T *out, const KParam op,
-                 const int s0, const int s1, const int s2, const int s3,
-                 const int t0, const int t1, const int t2, const int t3,
-                 const int blocksPerMatX, const int blocksPerMatY)
-{
+kernel void iota_kernel(global T *out, const KParam op, const int s0,
+                        const int s1, const int s2, const int s3,
+                        const int blocksPerMatX, const int blocksPerMatY) {
     const int oz = get_group_id(0) / blocksPerMatX;
     const int ow = get_group_id(1) / blocksPerMatY;
 
@@ -22,24 +19,22 @@ void iota_kernel(__global T *out, const KParam op,
     const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
     const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    if(xx >= op.dims[0] ||
-       yy >= op.dims[1] ||
-       oz >= op.dims[2] ||
-       ow >= op.dims[3])
+    if (xx >= op.dims[0] || yy >= op.dims[1] || oz >= op.dims[2] ||
+        ow >= op.dims[3])
         return;
 
     const int ozw = ow * op.strides[3] + oz * op.strides[2];
 
     T val = (ow % s3) * s2 * s1 * s0;
-    val  += (oz % s2) * s1 * s0;
+    val += (oz % s2) * s1 * s0;
 
     const int incy = blocksPerMatY * get_local_size(1);
     const int incx = blocksPerMatX * get_local_size(0);
 
-    for(int oy = yy; oy < op.dims[1]; oy += incy) {
-        T valY = val + (oy % s1) * s0;
+    for (int oy = yy; oy < op.dims[1]; oy += incy) {
+        T valY   = val + (oy % s1) * s0;
         int oyzw = ozw + oy * op.strides[1];
-        for(int ox = xx; ox < op.dims[0]; ox += incx) {
+        for (int ox = xx; ox < op.dims[0]; ox += incx) {
             int oidx = oyzw + ox;
 
             out[oidx] = valY + (ox % s0);
diff --git a/src/backend/opencl/kernel/iota.hpp b/src/backend/opencl/kernel/iota.hpp
index bad486abd2..24d5ad7924 100644
--- a/src/backend/opencl/kernel/iota.hpp
+++ b/src/backend/opencl/kernel/iota.hpp
@@ -8,79 +8,49 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/iota.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/iota.hpp>
+#include <traits.hpp>
+#include <af/dim4.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-        static const int TILEX = 512;
-        static const int TILEY = 32;
-
-        template<typename T>
-        void iota(Param out, const dim4 &sdims, const dim4 &tdims)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>  iotaProgs;
-                static std::map<int, Kernel*> iotaKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName();
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, iota_cl, iota_cl_len, options.str());
-                    iotaProgs[device]   = new Program(prog);
-                    iotaKernels[device] = new Kernel(*iotaProgs[device], "iota_kernel");
-                });
-
-                auto iotaOp = make_kernel<Buffer, const KParam,
-                                          const int, const int, const int, const int,
-                                          const int, const int, const int, const int,
-                                          const int, const int> (*iotaKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int blocksPerMatX = divup(out.info.dims[0], TILEX);
-                int blocksPerMatY = divup(out.info.dims[1], TILEY);
-                NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
-                               local[1] * blocksPerMatY * out.info.dims[3],
-                               1);
-
-                iotaOp(EnqueueArgs(getQueue(), global, local),
-                       *out.data, out.info, sdims[0], sdims[1], sdims[2], sdims[3],
-                       tdims[0], tdims[1], tdims[2], tdims[3], blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void iota(Param out, const af::dim4& sdims) {
+    constexpr int IOTA_TX = 32;
+    constexpr int IOTA_TY = 8;
+    constexpr int TILEX   = 512;
+    constexpr int TILEY   = 32;
+
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
+
+    auto iota = common::getKernel("iota_kernel", {{iota_cl_src}},
+                                  TemplateArgs(TemplateTypename<T>()), options);
+    cl::NDRange local(IOTA_TX, IOTA_TY, 1);
+
+    int blocksPerMatX = divup(out.info.dims[0], TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], TILEY);
+    cl::NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
+                       local[1] * blocksPerMatY * out.info.dims[3], 1);
+
+    iota(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+         static_cast<int>(sdims[0]), static_cast<int>(sdims[1]),
+         static_cast<int>(sdims[2]), static_cast<int>(sdims[3]), blocksPerMatX,
+         blocksPerMatY);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/ireduce.hpp b/src/backend/opencl/kernel/ireduce.hpp
index 122e664b6c..d056fb8fea 100644
--- a/src/backend/opencl/kernel/ireduce.hpp
+++ b/src/backend/opencl/kernel/ireduce.hpp
@@ -8,417 +8,331 @@
  ********************************************************/
 
 #pragma once
-#include <string>
-#include <mutex>
-#include <map>
-#include <memory>
-#include <kernel_headers/ireduce_first.hpp>
-#include <kernel_headers/ireduce_dim.hpp>
-#include <kernel_headers/iops.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <type_util.hpp>
-#include "names.hpp"
-#include "config.hpp"
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/iops.hpp>
+#include <kernel_headers/ireduce_dim.hpp>
+#include <kernel_headers/ireduce_first.hpp>
 #include <memory.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using std::unique_ptr;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-    template<typename T, af_op_t op, int dim, bool is_first, int threads_y>
-    void ireduce_dim_launcher(Param out, cl::Buffer *oidx,
-                              Param in, cl::Buffer *iidx,
-                              const uint groups_all[4])
-    {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> ireduceProgs;
-        static std::map<int, Kernel*> ireduceKerns;
-
-        int device= getActiveDeviceId();
-        std::call_once(compileFlags[device], [device] () {
-
-                Binary<T, op> ireduce;
-                ToNum<T> toNum;
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D dim=" << dim
-                        << " -D DIMY=" << threads_y
-                        << " -D THREADS_X=" << THREADS_X
-                        << " -D init=" << toNum(ireduce.init())
-                        << " -D " << binOpName<op>()
-                        << " -D CPLX=" << af::iscplx<T>()
-                        << " -D IS_FIRST=" << is_first;
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                const char *ker_strs[] = {iops_cl, ireduce_dim_cl};
-                const int   ker_lens[] = {iops_cl_len, ireduce_dim_cl_len};
-                Program prog;
-                buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                ireduceProgs[device] = new Program(prog);
-
-                ireduceKerns[device] = new Kernel(*ireduceProgs[device], "ireduce_dim_kernel");
-            });
-
-        NDRange local(THREADS_X, threads_y);
-        NDRange global(groups_all[0] * groups_all[2] * local[0],
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, af_op_t op>
+void ireduceDimLauncher(Param out, cl::Buffer *oidx, Param in, cl::Buffer *iidx,
+                        const int dim, const int threads_y, const bool is_first,
+                        const uint groups_all[4], Param rlen) {
+    ToNumStr<T> toNumStr;
+    std::array<TemplateArg, 5> targs = {
+        TemplateTypename<T>(), TemplateArg(dim),       TemplateArg(op),
+        TemplateArg(is_first), TemplateArg(threads_y),
+    };
+    std::array<std::string, 9> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(kDim, dim),
+        DefineKeyValue(DIMY, threads_y),
+        DefineValue(THREADS_X),
+        DefineKeyValue(init, toNumStr(common::Binary<T, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<T>()),
+        DefineKeyValue(IS_FIRST, is_first),
+        getTypeBuildDefinition<T>()};
+
+    auto ireduceDim =
+        common::getKernel("ireduce_dim_kernel",
+                          {{iops_cl_src, ireduce_dim_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_X, threads_y);
+    cl::NDRange global(groups_all[0] * groups_all[2] * local[0],
                        groups_all[1] * groups_all[3] * local[1]);
 
-        auto ireduceOp = make_kernel<Buffer, KParam, Buffer,
-                                     Buffer, KParam, Buffer,
-                                     uint, uint, uint>(*ireduceKerns[device]);
-
-        ireduceOp(EnqueueArgs(getQueue(), global, local),
-                  *out.data, out.info, *oidx,
-                  *in.data, in.info, *iidx,
-                  groups_all[0],
-                  groups_all[1],
-                  groups_all[dim]);
-
-        CL_DEBUG_FINISH(getQueue());
-    }
-
-    template<typename T, af_op_t op, int dim, bool is_first>
-    void ireduce_dim_fn(Param out, cl::Buffer *oidx, Param in, cl::Buffer *iidx,
-                        const uint threads_y, const uint groups_all[4])
-    {
-        switch(threads_y) {
-        case  8: return ireduce_dim_launcher<T, op, dim, is_first,  8>(out, oidx, in, iidx, groups_all);
-        case  4: return ireduce_dim_launcher<T, op, dim, is_first,  4>(out, oidx, in, iidx, groups_all);
-        case  2: return ireduce_dim_launcher<T, op, dim, is_first,  2>(out, oidx, in, iidx, groups_all);
-        case  1: return ireduce_dim_launcher<T, op, dim, is_first,  1>(out, oidx, in, iidx, groups_all);
-        case 16: return ireduce_dim_launcher<T, op, dim, is_first, 16>(out, oidx, in, iidx, groups_all);
-        case 32: return ireduce_dim_launcher<T, op, dim, is_first, 32>(out, oidx, in, iidx, groups_all);
-        }
-    }
+    ireduceDim(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *oidx, *in.data, in.info, *iidx, groups_all[0], groups_all[1],
+               groups_all[dim], *rlen.data, rlen.info);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-    template<typename T, af_op_t op, int dim>
-    void ireduce_dim(Param out, cl::Buffer *oidx, Param in)
-    {
-        uint threads_y = std::min(THREADS_Y, nextpow2(in.info.dims[dim]));
-        uint threads_x = THREADS_X;
+template<typename T, af_op_t op>
+void ireduceDim(Param out, cl::Buffer *oidx, Param in, int dim, Param rlen) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.info.dims[dim]));
+    uint threads_x = THREADS_X;
 
-        uint groups_all[] = {(uint)divup(in.info.dims[0], threads_x),
-                             (uint)in.info.dims[1],
-                             (uint)in.info.dims[2],
-                             (uint)in.info.dims[3]};
+    uint groups_all[] = {(uint)divup(in.info.dims[0], threads_x),
+                         (uint)in.info.dims[1], (uint)in.info.dims[2],
+                         (uint)in.info.dims[3]};
 
-        groups_all[dim] = divup(in.info.dims[dim], threads_y * REPEAT);
+    groups_all[dim] = divup(in.info.dims[dim], threads_y * REPEAT);
 
-        Param tmp = out;
-        cl::Buffer *tidx = oidx;
+    Param tmp        = out;
+    cl::Buffer *tidx = oidx;
 
-        int tmp_elements = 1;
-        if (groups_all[dim] > 1) {
-            tmp.info.dims[dim] = groups_all[dim];
+    int tmp_elements = 1;
+    if (groups_all[dim] > 1) {
+        tmp.info.dims[dim] = groups_all[dim];
 
-            for (int k = 0; k < 4; k++) tmp_elements *= tmp.info.dims[k];
+        for (int k = 0; k < 4; k++) tmp_elements *= tmp.info.dims[k];
 
-            tmp.data = bufferAlloc(tmp_elements * sizeof(T));
-            tidx = bufferAlloc(tmp_elements * sizeof(uint));
+        tmp.data = bufferAlloc(tmp_elements * sizeof(T));
+        tidx     = bufferAlloc(tmp_elements * sizeof(uint));
 
-            for (int k = dim + 1; k < 4; k++) tmp.info.strides[k] *= groups_all[dim];
-        }
-
-        ireduce_dim_fn<T, op, dim, true>(tmp, tidx, in, tidx, threads_y, groups_all);
+        for (int k = dim + 1; k < 4; k++)
+            tmp.info.strides[k] *= groups_all[dim];
+    }
 
-        if (groups_all[dim] > 1) {
-            groups_all[dim] = 1;
+    ireduceDimLauncher<T, op>(tmp, tidx, in, tidx, dim, threads_y, true,
+                              groups_all, rlen);
 
-            ireduce_dim_fn<T, op, dim, false>(out, oidx, tmp, tidx, threads_y, groups_all);
-            bufferFree(tmp.data);
-            bufferFree(tidx);
-        }
+    if (groups_all[dim] > 1) {
+        groups_all[dim] = 1;
 
+        ireduceDimLauncher<T, op>(out, oidx, tmp, tidx, dim, threads_y, false,
+                                  groups_all, rlen);
+        bufferFree(tmp.data);
+        bufferFree(tidx);
     }
+}
 
-    template<typename T, af_op_t op, bool is_first, int threads_x>
-    void ireduce_first_launcher(Param out, cl::Buffer *oidx,
-                                Param in, cl::Buffer *iidx,
-                                const uint groups_x,
-                                const uint groups_y)
-    {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> ireduceProgs;
-        static std::map<int, Kernel*>  ireduceKerns;
-
-        int device= getActiveDeviceId();
-        std::call_once(compileFlags[device], [device] () {
-
-                Binary<T, op> ireduce;
-                ToNum<T> toNum;
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D DIMX=" << threads_x
-                        << " -D THREADS_PER_GROUP=" << THREADS_PER_GROUP
-                        << " -D init=" << toNum(ireduce.init())
-                        << " -D " << binOpName<op>()
-                        << " -D CPLX=" << af::iscplx<T>()
-                        << " -D IS_FIRST=" << is_first;
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                const char *ker_strs[] = {iops_cl, ireduce_first_cl};
-                const int   ker_lens[] = {iops_cl_len, ireduce_first_cl_len};
-                Program prog;
-                buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                ireduceProgs[device] = new Program(prog);
-
-                ireduceKerns[device] = new Kernel(*ireduceProgs[device], "ireduce_first_kernel");
-            });
-
-        NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
-        NDRange global(groups_x * in.info.dims[2] * local[0],
+template<typename T, af_op_t op>
+void ireduceFirstLauncher(Param out, cl::Buffer *oidx, Param in,
+                          cl::Buffer *iidx, const int threads_x,
+                          const bool is_first, const uint groups_x,
+                          const uint groups_y, Param rlen) {
+    ToNumStr<T> toNumStr;
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(op),
+        TemplateArg(is_first),
+        TemplateArg(threads_x),
+    };
+    std::array<std::string, 8> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(DIMX, threads_x),
+        DefineValue(THREADS_PER_GROUP),
+        DefineKeyValue(init, toNumStr(common::Binary<T, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<T>()),
+        DefineKeyValue(IS_FIRST, is_first),
+        getTypeBuildDefinition<T>()};
+
+    auto ireduceFirst = common::getKernel("ireduce_first_kernel",
+                                          {{iops_cl_src, ireduce_first_cl_src}},
+                                          targs, options);
+
+    cl::NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    cl::NDRange global(groups_x * in.info.dims[2] * local[0],
                        groups_y * in.info.dims[3] * local[1]);
 
-        uint repeat = divup(in.info.dims[0], (local[0] * groups_x));
+    uint repeat = divup(in.info.dims[0], (local[0] * groups_x));
 
-        auto ireduceOp = make_kernel<Buffer, KParam, Buffer,
-                                     Buffer, KParam, Buffer,
-                                     uint, uint, uint>(*ireduceKerns[device]);
-
-        ireduceOp(EnqueueArgs(getQueue(), global, local),
-                  *out.data, out.info, *oidx,
-                  *in.data, in.info, *iidx,
-                  groups_x, groups_y, repeat);
-
-        CL_DEBUG_FINISH(getQueue());
-    }
-
-    template<typename T, af_op_t op, bool is_first>
-    void ireduce_first_fn(Param out, cl::Buffer *oidx,
-                          Param in, cl::Buffer *iidx,
-                          const uint groups_x,
-                          const uint groups_y,
-                          const uint threads_x)
-    {
-        switch(threads_x) {
-        case  32: return ireduce_first_launcher<T, op, is_first,  32>(out, oidx, in, iidx, groups_x,
-                                                            groups_y);
-        case  64: return ireduce_first_launcher<T, op, is_first,  64>(out, oidx, in, iidx, groups_x,
-                                                            groups_y);
-        case 128: return ireduce_first_launcher<T, op, is_first, 128>(out, oidx, in, iidx, groups_x,
-                                                            groups_y);
-        case 256: return ireduce_first_launcher<T, op, is_first, 256>(out, oidx, in, iidx, groups_x,
-                                                            groups_y);
-        case 512: return ireduce_first_launcher<T, op, is_first, 512>(out, oidx, in, iidx, groups_x,
-                                                                      groups_y);
-        }
-    }
+    ireduceFirst(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+                 out.info, *oidx, *in.data, in.info, *iidx, groups_x, groups_y,
+                 repeat, *rlen.data, rlen.info);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-    template<typename T, af_op_t op>
-    void ireduce_first(Param out, cl::Buffer *oidx, Param in)
-    {
-        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_GROUP);
-        uint threads_y = THREADS_PER_GROUP / threads_x;
+template<typename T, af_op_t op>
+void ireduceFirst(Param out, cl::Buffer *oidx, Param in, Param rlen) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
 
-        uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
-        uint groups_y = divup(in.info.dims[1], threads_y);
-
-        Param tmp = out;
-        cl::Buffer *tidx = oidx;
+    uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(in.info.dims[1], threads_y);
 
-        if (groups_x > 1) {
+    Param tmp        = out;
+    cl::Buffer *tidx = oidx;
 
-            tmp.data = bufferAlloc(groups_x *
-                                in.info.dims[1] *
-                                in.info.dims[2] *
-                                in.info.dims[3] *
-                                sizeof(T));
+    if (groups_x > 1) {
+        tmp.data = bufferAlloc(groups_x * in.info.dims[1] * in.info.dims[2] *
+                               in.info.dims[3] * sizeof(T));
 
-            tidx = bufferAlloc(groups_x *
-                               in.info.dims[1] *
-                               in.info.dims[2] *
-                               in.info.dims[3] *
-                               sizeof(uint));
+        tidx = bufferAlloc(groups_x * in.info.dims[1] * in.info.dims[2] *
+                           in.info.dims[3] * sizeof(uint));
 
+        tmp.info.dims[0] = groups_x;
+        for (int k = 1; k < 4; k++) tmp.info.strides[k] *= groups_x;
+    }
 
-            tmp.info.dims[0] = groups_x;
-            for (int k = 1; k < 4; k++) tmp.info.strides[k] *= groups_x;
-        }
+    ireduceFirstLauncher<T, op>(tmp, tidx, in, tidx, threads_x, true, groups_x,
+                                groups_y, rlen);
 
-        ireduce_first_fn<T, op, true>(tmp, tidx, in, tidx, groups_x, groups_y, threads_x);
+    if (groups_x > 1) {
+        ireduceFirstLauncher<T, op>(out, oidx, tmp, tidx, threads_x, false, 1,
+                                    groups_y, rlen);
 
-        if (groups_x > 1) {
-            ireduce_first_fn<T, op, false>(out, oidx, tmp, tidx, 1, groups_y, threads_x);
+        bufferFree(tmp.data);
+        bufferFree(tidx);
+    }
+}
 
-            bufferFree(tmp.data);
-            bufferFree(tidx);
-        }
+template<typename T, af_op_t op>
+void ireduce(Param out, cl::Buffer *oidx, Param in, int dim, Param rlen) {
+    cl::Buffer buf;
+    if (rlen.info.dims[0] * rlen.info.dims[1] * rlen.info.dims[2] *
+            rlen.info.dims[3] ==
+        0) {
+        // empty opencl::Param() does not have nullptr by default
+        // set to nullptr explicitly here for consequent kernel calls
+        // through cl::Buffer's constructor
+        rlen.data = &buf;
     }
+    if (dim == 0) {
+        ireduceFirst<T, op>(out, oidx, in, rlen);
+    } else {
+        ireduceDim<T, op>(out, oidx, in, dim, rlen);
+    }
+}
 
-    template<typename T, af_op_t op>
-    void ireduce(Param out, cl::Buffer *oidx, Param in, int dim)
-    {
-        try {
-            switch (dim) {
-            case 0: return ireduce_first<T, op   >(out, oidx, in);
-            case 1: return ireduce_dim  <T, op, 1>(out, oidx, in);
-            case 2: return ireduce_dim  <T, op, 2>(out, oidx, in);
-            case 3: return ireduce_dim  <T, op, 3>(out, oidx, in);
-            }
-        } catch(cl::Error ex) {
-            CL_TO_AF_ERROR(ex);
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+#else
+/* Other */
+#endif
+
+template<typename T>
+double cabs(const T in) {
+    return (double)in;
+}
+static double cabs(const cfloat in) { return (double)abs(in); }
+static double cabs(const cdouble in) { return (double)abs(in); }
+
+template<af_op_t op, typename T>
+struct MinMaxOp {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {}
+
+    void operator()(T val, uint idx) {
+        if (cabs(val) < cabs(m_val) ||
+            (cabs(val) == cabs(m_val) && idx > m_idx)) {
+            m_val = val;
+            m_idx = idx;
         }
     }
-
-    template<typename T> double cabs(const T in) { return (double)in; }
-    static double cabs(const cfloat in) { return (double)abs(in); }
-    static double cabs(const cdouble in) { return (double)abs(in); }
-
-    template<af_op_t op, typename T>
-    struct MinMaxOp
-    {
-        T m_val;
-        uint m_idx;
-        MinMaxOp(T val, uint idx) :
-            m_val(val), m_idx(idx)
-        {
+};
+
+template<typename T>
+struct MinMaxOp<af_max_t, T> {
+    T m_val;
+    uint m_idx;
+    MinMaxOp(T val, uint idx) : m_val(val), m_idx(idx) {}
+
+    void operator()(T val, uint idx) {
+        if (cabs(val) > cabs(m_val) ||
+            (cabs(val) == cabs(m_val) && idx <= m_idx)) {
+            m_val = val;
+            m_idx = idx;
         }
+    }
+};
+
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic pop
+#else
+/* Other */
+#endif
+
+template<typename T, af_op_t op>
+T ireduceAll(uint *loc, Param in) {
+    int in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+
+    bool is_linear = (in.info.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.info.strides[k] ==
+                      (in.info.strides[k - 1] * in.info.dims[k - 1]));
+    }
 
-        void operator()(T val, uint idx)
-        {
-            if (cabs(val) < cabs(m_val) ||
-                (cabs(val) == cabs(m_val) &&
-                 idx > m_idx)) {
-                m_val = val;
-                m_idx = idx;
+    // FIXME: Use better heuristics to get to the optimum number
+    if (!is_linear || in_elements > 4096) {
+        if (is_linear) {
+            in.info.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.info.dims[k]    = 1;
+                in.info.strides[k] = in_elements;
             }
         }
-    };
 
-    template<typename T>
-    struct MinMaxOp<af_max_t, T>
-    {
-        T m_val;
-        uint m_idx;
-        MinMaxOp(T val, uint idx) :
-            m_val(val), m_idx(idx)
-        {
-        }
+        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+        uint threads_y = THREADS_PER_GROUP / threads_x;
 
-        void operator()(T val, uint idx)
-        {
-            if (cabs(val) > cabs(m_val) ||
-                (cabs(val) == cabs(m_val) &&
-                 idx <= m_idx)) {
-                m_val = val;
-                m_idx = idx;
+        uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+        uint groups_y = divup(in.info.dims[1], threads_y);
+        Array<T> tmp  = createEmptyArray<T>(
+            {groups_x, in.info.dims[1], in.info.dims[2], in.info.dims[3]});
+
+        int tmp_elements = tmp.elements();
+        cl::Buffer *tidx = bufferAlloc(tmp_elements * sizeof(uint));
+
+        Param rlen;
+        auto buff = std::make_unique<cl::Buffer>();
+        rlen.data = buff.get();
+        ireduceFirstLauncher<T, op>(tmp, tidx, in, tidx, threads_x, true,
+                                    groups_x, groups_y, rlen);
+
+        std::vector<T> h_ptr(tmp_elements);
+        std::vector<uint> h_iptr(tmp_elements);
+
+        getQueue().enqueueReadBuffer(*tmp.get(), CL_TRUE, 0,
+                                     sizeof(T) * tmp_elements, h_ptr.data());
+        getQueue().enqueueReadBuffer(
+            *tidx, CL_TRUE, 0, sizeof(uint) * tmp_elements, h_iptr.data());
+
+        T *h_ptr_raw     = h_ptr.data();
+        uint *h_iptr_raw = h_iptr.data();
+
+        if (!is_linear) {
+            // Converting n-d index into a linear index
+            // in is of size   [   dims0, dims1, dims2, dims3]
+            // tidx is of size [groups_x, dims1, dims2, dims3]
+            // i / groups_x gives you the batch number "N"
+            // "N * dims0 + i" gives the linear index
+            for (int i = 0; i < tmp_elements; i++) {
+                h_iptr_raw[i] += (i / groups_x) * in.info.dims[0];
             }
         }
-    };
-
-
-    template<typename T, af_op_t op>
-    T ireduce_all(uint *loc, Param in)
-    {
-        try {
-            int in_elements = in.info.dims[3] * in.info.strides[3];
 
-            // FIXME: Use better heuristics to get to the optimum number
-            if (in_elements > 4096) {
-
-                bool is_linear = (in.info.strides[0] == 1);
-                for (int k = 1; k < 4; k++) {
-                    is_linear &= (in.info.strides[k] == (in.info.strides[k - 1] * in.info.dims[k - 1]));
-                }
-
-                if (is_linear) {
-                    in.info.dims[0] = in_elements;
-                    for (int k = 1; k < 4; k++) {
-                        in.info.dims[k] = 1;
-                        in.info.strides[k] = in_elements;
-                    }
-                }
-
-                uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
-                threads_x = std::min(threads_x, THREADS_PER_GROUP);
-                uint threads_y = THREADS_PER_GROUP / threads_x;
-
-                Param tmp;
-                uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
-                uint groups_y = divup(in.info.dims[1], threads_y);
-
-                tmp.info.offset = 0;
-                tmp.info.dims[0] = groups_x;
-                tmp.info.strides[0] = 1;
-
-                for (int k = 1; k < 4; k++) {
-                    tmp.info.dims[k] = in.info.dims[k];
-                    tmp.info.strides[k] = tmp.info.dims[k - 1] * tmp.info.strides[k - 1];
-                }
-
-                int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
-                tmp.data = bufferAlloc(tmp_elements * sizeof(T));
-                cl::Buffer *tidx = bufferAlloc(tmp_elements * sizeof(uint));
-
-                ireduce_first_fn<T, op, true>(tmp, tidx, in, tidx, groups_x, groups_y, threads_x);
-
-                unique_ptr<T> h_ptr(new T[tmp_elements]);
-                unique_ptr<uint> h_iptr(new uint[tmp_elements]);
-
-                getQueue().enqueueReadBuffer(*tmp.data, CL_TRUE, 0, sizeof(T) * tmp_elements, h_ptr.get());
-                getQueue().enqueueReadBuffer(*tidx, CL_TRUE, 0, sizeof(uint) * tmp_elements, h_iptr.get());
-
-                T* h_ptr_raw = h_ptr.get();
-                uint* h_iptr_raw = h_iptr.get();
-                MinMaxOp<op, T> Op(h_ptr_raw[0], h_iptr_raw[0]);
-
-                for (int i = 1; i < (int)tmp_elements; i++) {
-                    Op(h_ptr_raw[i], h_iptr_raw[i]);
-                }
-
-                bufferFree(tmp.data);
-                bufferFree(tidx);
+        MinMaxOp<op, T> Op(h_ptr_raw[0], h_iptr_raw[0]);
+        for (int i = 1; i < (int)tmp_elements; i++) {
+            Op(h_ptr_raw[i], h_iptr_raw[i]);
+        }
 
-                *loc = Op.m_idx;
-                return Op.m_val;
+        bufferFree(tidx);
 
-            } else {
+        *loc = Op.m_idx;
+        return Op.m_val;
 
-                unique_ptr<T> h_ptr(new T[in_elements]);
-                T* h_ptr_raw = h_ptr.get();
-                getQueue().enqueueReadBuffer(*in.data, CL_TRUE, 0, sizeof(T) * in_elements, h_ptr_raw);
+    } else {
+        std::unique_ptr<T[]> h_ptr(new T[in_elements]);
+        T *h_ptr_raw = h_ptr.get();
 
+        getQueue().enqueueReadBuffer(*in.data, CL_TRUE,
+                                     sizeof(T) * in.info.offset,
+                                     sizeof(T) * in_elements, h_ptr_raw);
 
-                MinMaxOp<op, T> Op(h_ptr_raw[0], 0);
-                for (int i = 1; i < (int)in_elements; i++) {
-                    Op(h_ptr_raw[i], i);
-                }
+        MinMaxOp<op, T> Op(h_ptr_raw[0], 0);
+        for (int i = 1; i < (int)in_elements; i++) { Op(h_ptr_raw[i], i); }
 
-                *loc = Op.m_idx;
-                return Op.m_val;
-            }
-        } catch(cl::Error ex) {
-            CL_TO_AF_ERROR(ex);
-        }
+        *loc = Op.m_idx;
+        return Op.m_val;
     }
-
-
 }
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/ireduce_dim.cl b/src/backend/opencl/kernel/ireduce_dim.cl
index 333af887fc..bf94c9c9a3 100644
--- a/src/backend/opencl/kernel/ireduce_dim.cl
+++ b/src/backend/opencl/kernel/ireduce_dim.cl
@@ -7,72 +7,78 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void ireduce_dim_kernel(__global T *oData,
-                        KParam oInfo,
-                        __global uint *olData,
-                        const __global T *iData,
-                        KParam iInfo,
-                        const __global uint *ilData,
-                        uint groups_x, uint groups_y, uint group_dim)
-{
+kernel void ireduce_dim_kernel(global T *oData, KParam oInfo,
+                                 global uint *olData, const __global T *iData,
+                                 KParam iInfo, const global uint *ilData,
+                                 uint groups_x, uint groups_y, uint group_dim,
+                                 global uint *rlenptr, KParam rlen) {
     const uint lidx = get_local_id(0);
     const uint lidy = get_local_id(1);
     const uint lid  = lidy * THREADS_X + lidx;
 
-    const uint zid = get_group_id(0) / groups_x;
-    const uint wid = get_group_id(1) / groups_y;
-    const uint groupId_x = get_group_id(0) - (groups_x) * zid;
-    const uint groupId_y = get_group_id(1) - (groups_y) * wid;
-    const uint xid = groupId_x * get_local_size(0) + lidx;
-    const uint yid = groupId_y;
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) + lidx;
+    const uint yid       = groupId_y;
 
     uint ids[4] = {xid, yid, zid, wid};
 
     // There is only one element per group for out
     // There are get_local_size(1) elements per group for in
-    // Hence increment ids[dim] just after offseting out and before offsetting in
+    // Hence increment ids[kDim] just after offseting out and before offsetting
+    // in
+    bool rlen_valid = (ids[0] < rlen.dims[0]) && (ids[1] < rlen.dims[1]) &&
+                      (ids[2] < rlen.dims[2]) && (ids[3] < rlen.dims[3]);
+    rlenptr += (rlenptr && rlen_valid)
+                   ? ids[3] * rlen.strides[3] + ids[2] * rlen.strides[2] +
+                         ids[1] * rlen.strides[1] + ids[0] + rlen.offset
+                   : 0;
+
     oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
-        ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
+             ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
     olData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
-        ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
-    const uint id_dim_out = ids[dim];
+              ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
+
+    const uint id_dim_out = ids[kDim];
 
-    ids[dim] = ids[dim] * get_local_size(1) + lidy;
+    ids[kDim] = ids[kDim] * get_local_size(1) + lidy;
 
-    iData  += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
-        ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+    iData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+             ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
 
     if (!IS_FIRST) {
-        ilData  += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
-            ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+        ilData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+                  ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
     }
 
-    const uint id_dim_in = ids[dim];
-    const uint istride_dim = iInfo.strides[dim];
+    const uint id_dim_in   = ids[kDim];
+    const uint istride_dim = iInfo.strides[kDim];
 
-    bool is_valid =
-        (ids[0] < iInfo.dims[0]) &&
-        (ids[1] < iInfo.dims[1]) &&
-        (ids[2] < iInfo.dims[2]) &&
-        (ids[3] < iInfo.dims[3]);
+    bool is_valid = (ids[0] < iInfo.dims[0]) && (ids[1] < iInfo.dims[1]) &&
+                    (ids[2] < iInfo.dims[2]) && (ids[3] < iInfo.dims[3]);
 
-    __local T s_val[THREADS_X * DIMY];
-    __local uint s_idx[THREADS_X * DIMY];
+    local T s_val[THREADS_X * DIMY];
+    local uint s_idx[THREADS_X * DIMY];
 
-    T out_val = init;
+    T out_val    = init;
     uint out_idx = id_dim_in;
 
-    if (is_valid && id_dim_in < iInfo.dims[dim]) {
+    uint lim = rlenptr ? *rlenptr : iInfo.dims[kDim];
+    lim      = (IS_FIRST) ? min((uint)iInfo.dims[kDim], lim) : lim;
+    bool within_ragged_bounds =
+        (IS_FIRST) ? (out_idx < lim)
+                   : ((rlenptr) ? (is_valid) && (*ilData < lim) : true);
+    if (is_valid && id_dim_in < iInfo.dims[kDim] && within_ragged_bounds) {
         out_val = *iData;
         if (!IS_FIRST) out_idx = *ilData;
     }
 
     const uint id_dim_in_start = id_dim_in + group_dim * get_local_size(1);
 
-    for (int id = id_dim_in_start; is_valid && (id < iInfo.dims[dim]);
+    for (int id = id_dim_in_start; is_valid && (id < lim);
          id += group_dim * get_local_size(1)) {
-
         iData = iData + group_dim * get_local_size(1) * istride_dim;
 
 #if IS_FIRST
@@ -86,14 +92,14 @@ void ireduce_dim_kernel(__global T *oData,
     s_val[lid] = out_val;
     s_idx[lid] = out_idx;
 
-    __local T *s_vptr = s_val + lid;
-    __local uint *s_iptr = s_idx + lid;
+    local T *s_vptr    = s_val + lid;
+    local uint *s_iptr = s_idx + lid;
     barrier(CLK_LOCAL_MEM_FENCE);
 
     if (DIMY == 8) {
         if (lidy < 4) {
-            binOp(&out_val, &out_idx,
-                  s_vptr[THREADS_X * 4], s_iptr[THREADS_X * 4]);
+            binOp(&out_val, &out_idx, s_vptr[THREADS_X * 4],
+                  s_iptr[THREADS_X * 4]);
             *s_vptr = out_val;
             *s_iptr = out_idx;
         }
@@ -102,8 +108,8 @@ void ireduce_dim_kernel(__global T *oData,
 
     if (DIMY >= 4) {
         if (lidy < 2) {
-            binOp(&out_val, &out_idx,
-                  s_vptr[THREADS_X * 2], s_iptr[THREADS_X * 2]);
+            binOp(&out_val, &out_idx, s_vptr[THREADS_X * 2],
+                  s_iptr[THREADS_X * 2]);
             *s_vptr = out_val;
             *s_iptr = out_idx;
         }
@@ -112,18 +118,16 @@ void ireduce_dim_kernel(__global T *oData,
 
     if (DIMY >= 2) {
         if (lidy < 1) {
-            binOp(&out_val, &out_idx,
-                  s_vptr[THREADS_X * 1], s_iptr[THREADS_X * 1]);
+            binOp(&out_val, &out_idx, s_vptr[THREADS_X * 1],
+                  s_iptr[THREADS_X * 1]);
             *s_vptr = out_val;
             *s_iptr = out_idx;
         }
         barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (lidy == 0 && is_valid &&
-        (id_dim_out < oInfo.dims[dim])) {
-        *oData = *s_vptr;
+    if (lidy == 0 && is_valid && (id_dim_out < oInfo.dims[kDim])) {
+        *oData  = *s_vptr;
         *olData = *s_iptr;
     }
-
 }
diff --git a/src/backend/opencl/kernel/ireduce_first.cl b/src/backend/opencl/kernel/ireduce_first.cl
index a135260205..428cc73b99 100644
--- a/src/backend/opencl/kernel/ireduce_first.cl
+++ b/src/backend/opencl/kernel/ireduce_first.cl
@@ -7,51 +7,56 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void ireduce_first_kernel(__global T *oData,
-                          KParam oInfo,
-                          __global uint *olData,
-                          const __global T *iData,
-                          KParam iInfo,
-                          const __global uint *ilData,
-                          uint groups_x, uint groups_y, uint repeat)
-{
+kernel void ireduce_first_kernel(global T *oData, KParam oInfo,
+                                   global uint *olData,
+                                   const global T *iData, KParam iInfo,
+                                   const global uint *ilData, uint groups_x,
+                                   uint groups_y, uint repeat,
+                                   global uint *rlenptr, KParam rlen) {
     const uint lidx = get_local_id(0);
     const uint lidy = get_local_id(1);
     const uint lid  = lidy * get_local_size(0) + lidx;
 
-    const uint zid = get_group_id(0) / groups_x;
-    const uint wid = get_group_id(1) / groups_y;
-    const uint groupId_x = get_group_id(0) - (groups_x) * zid;
-    const uint groupId_y = get_group_id(1) - (groups_y) * wid;
-    const uint xid = groupId_x * get_local_size(0) * repeat + lidx;
-    const uint yid = groupId_y * get_local_size(1) + lidy;
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) * repeat + lidx;
+    const uint yid       = groupId_y * get_local_size(1) + lidy;
 
     iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
-        yid * iInfo.strides[1] + iInfo.offset;
+             yid * iInfo.strides[1] + iInfo.offset;
 
     if (!IS_FIRST) {
         ilData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
-            yid * iInfo.strides[1] + iInfo.offset;
+                  yid * iInfo.strides[1] + iInfo.offset;
     }
 
     oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
-        yid * oInfo.strides[1] + oInfo.offset;
+             yid * oInfo.strides[1] + oInfo.offset;
 
     olData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
-        yid * oInfo.strides[1] + oInfo.offset;
+              yid * oInfo.strides[1] + oInfo.offset;
 
-    bool cond = (yid < iInfo.dims[1]) && (zid < iInfo.dims[2]) && (wid < iInfo.dims[3]);
+    rlenptr += (rlenptr) ? wid * rlen.strides[3] + zid * rlen.strides[2] +
+                               yid * rlen.strides[1] + rlen.offset
+                         : 0;
 
-    __local T s_val[THREADS_PER_GROUP];
-    __local uint s_idx[THREADS_PER_GROUP];
+    bool cond =
+        (yid < iInfo.dims[1]) && (zid < iInfo.dims[2]) && (wid < iInfo.dims[3]);
+
+    local T s_val[THREADS_PER_GROUP];
+    local uint s_idx[THREADS_PER_GROUP];
 
     int last = (xid + repeat * DIMX);
-    int lim = last > iInfo.dims[0] ? iInfo.dims[0] : last;
-    T out_val = init;
+
+    int minlen = rlenptr ? min(*rlenptr, (uint)iInfo.dims[0]) : iInfo.dims[0];
+
+    int lim      = last > minlen ? minlen : last;
+    T out_val    = init;
     uint out_idx = xid;
 
-    if (cond && xid < lim) {
+    if (cond && xid < lim && !is_nan(iData[xid])) {
         out_val = iData[xid];
         if (!IS_FIRST) out_idx = ilData[xid];
     }
@@ -68,13 +73,12 @@ void ireduce_first_kernel(__global T *oData,
     s_idx[lid] = out_idx;
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    __local T *s_vptr = s_val + lidy * DIMX;
-    __local uint *s_iptr = s_idx + lidy * DIMX;
+    local T *s_vptr    = s_val + lidy * DIMX;
+    local uint *s_iptr = s_idx + lidy * DIMX;
 
     if (DIMX == 256) {
         if (lidx < 128) {
-            binOp(&out_val, &out_idx,
-                  s_vptr[lidx + 128], s_iptr[lidx + 128]);
+            binOp(&out_val, &out_idx, s_vptr[lidx + 128], s_iptr[lidx + 128]);
             s_vptr[lidx] = out_val;
             s_iptr[lidx] = out_idx;
         }
@@ -82,64 +86,57 @@ void ireduce_first_kernel(__global T *oData,
     }
 
     if (DIMX >= 128) {
-        if (lidx <  64) {
-            binOp(&out_val, &out_idx,
-                  s_vptr[lidx +  64], s_iptr[lidx +  64]);
+        if (lidx < 64) {
+            binOp(&out_val, &out_idx, s_vptr[lidx + 64], s_iptr[lidx + 64]);
             s_vptr[lidx] = out_val;
             s_iptr[lidx] = out_idx;
         }
         barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (DIMX >=  64) {
-        if (lidx <  32) {
-            binOp(&out_val, &out_idx,
-                  s_vptr[lidx +  32], s_iptr[lidx +  32]);
+    if (DIMX >= 64) {
+        if (lidx < 32) {
+            binOp(&out_val, &out_idx, s_vptr[lidx + 32], s_iptr[lidx + 32]);
             s_vptr[lidx] = out_val;
             s_iptr[lidx] = out_idx;
         }
         barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (lidx <  16) {
-        binOp(&out_val, &out_idx,
-              s_vptr[lidx +  16], s_iptr[lidx +  16]);
+    if (lidx < 16) {
+        binOp(&out_val, &out_idx, s_vptr[lidx + 16], s_iptr[lidx + 16]);
         s_vptr[lidx] = out_val;
         s_iptr[lidx] = out_idx;
     }
 
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <   8) {
-        binOp(&out_val, &out_idx,
-              s_vptr[lidx +   8], s_iptr[lidx +   8]);
+    if (lidx < 8) {
+        binOp(&out_val, &out_idx, s_vptr[lidx + 8], s_iptr[lidx + 8]);
         s_vptr[lidx] = out_val;
         s_iptr[lidx] = out_idx;
     }
 
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <   4) {
-        binOp(&out_val, &out_idx,
-              s_vptr[lidx +   4], s_iptr[lidx +   4]);
+    if (lidx < 4) {
+        binOp(&out_val, &out_idx, s_vptr[lidx + 4], s_iptr[lidx + 4]);
         s_vptr[lidx] = out_val;
         s_iptr[lidx] = out_idx;
     }
 
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <   2) {
-        binOp(&out_val, &out_idx,
-              s_vptr[lidx +   2], s_iptr[lidx +   2]);
+    if (lidx < 2) {
+        binOp(&out_val, &out_idx, s_vptr[lidx + 2], s_iptr[lidx + 2]);
         s_vptr[lidx] = out_val;
         s_iptr[lidx] = out_idx;
     }
 
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <   1) {
-        binOp(&out_val, &out_idx,
-              s_vptr[lidx +   1], s_iptr[lidx +   1]);
+    if (lidx < 1) {
+        binOp(&out_val, &out_idx, s_vptr[lidx + 1], s_iptr[lidx + 1]);
         s_vptr[lidx] = out_val;
         s_iptr[lidx] = out_idx;
     }
@@ -147,7 +144,7 @@ void ireduce_first_kernel(__global T *oData,
     barrier(CLK_LOCAL_MEM_FENCE);
 
     if (cond && lidx == 0) {
-        oData[groupId_x] = s_vptr[0];
+        oData[groupId_x]  = s_vptr[0];
         olData[groupId_x] = s_iptr[0];
     }
 }
diff --git a/src/backend/opencl/kernel/jit.cl b/src/backend/opencl/kernel/jit.cl
index 74b776ec0e..a0486106e2 100644
--- a/src/backend/opencl/kernel/jit.cl
+++ b/src/backend/opencl/kernel/jit.cl
@@ -7,7 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#define sign(in) signbit((in))
+#define __select(cond, a, b) (cond) ? (a) : (b)
+#define __not_select(cond, a, b) (cond) ? (b) : (a)
+#define __circular_mod(a, b) ((a) < (b)) ? (a) : (a - b)
+
+#define __noop(a) (a)
 #define __add(lhs, rhs) (lhs) + (rhs)
 #define __sub(lhs, rhs) (lhs) - (rhs)
 #define __mul(lhs, rhs) (lhs) * (rhs)
@@ -26,38 +30,32 @@
 #define __real(in) (in)
 #define __imag(in) (0)
 #define __abs(in) abs(in)
-#define __abs2(in) (in) * (in)
 
 #define __crealf(in) ((in).x)
 #define __cimagf(in) ((in).y)
-#define __cabsf2(in) ((in).x * (in).x + (in).y * (in).y)
-#define __cabsf(in) sqrt(__cabsf2(in))
+#define __cabsf(in) hypot((in).x, (in).y)
 
 #define __creal(in) ((in).x)
 #define __cimag(in) ((in).y)
-#define __cabs2(in) ((in).x * (in).x + (in).y * (in).y)
-#define __cabs(in) sqrt(__cabs2(in))
+#define __cabs(in) hypot((in).x, (in).y)
+#define __sigmoid(in) (1.0 / (1 + exp(-(in))))
 
-float2 __cconjf(float2 in)
-{
+float2 __cconjf(float2 in) {
     float2 out = {in.x, -in.y};
     return out;
 }
 
-float2 __caddf(float2 lhs, float2 rhs)
-{
+float2 __caddf(float2 lhs, float2 rhs) {
     float2 out = {lhs.x + rhs.x, lhs.y + rhs.y};
     return out;
 }
 
-float2 __csubf(float2 lhs, float2 rhs)
-{
+float2 __csubf(float2 lhs, float2 rhs) {
     float2 out = {lhs.x - rhs.x, lhs.y - rhs.y};
     return out;
 }
 
-float2 __cmulf(float2 lhs, float2 rhs)
-{
+float2 __cmulf(float2 lhs, float2 rhs) {
     float2 out;
     out.x = lhs.x * rhs.x - lhs.y * rhs.y;
     out.y = lhs.x * rhs.y + lhs.y * rhs.x;
@@ -65,38 +63,39 @@ float2 __cmulf(float2 lhs, float2 rhs)
 }
 
 // FIXME: overflow / underflow issues
-float2 __cdivf(float2 lhs, float2 rhs)
-{
-    float2 out;
-    float den = (rhs.x * rhs.x + rhs.y * rhs.y);
-    float2 num = __cmulf(lhs, __cconjf(rhs));
-
-    out.x = num.x / den;
-    out.y = num.y / den;
-
+float2 __cdivf(float2 lhs, float2 rhs) {
+    // Normalize by absolute value and multiply
+    float rhs_abs     = __cabsf(rhs);
+    float inv_rhs_abs = 1.0f / rhs_abs;
+    float rhs_x       = inv_rhs_abs * rhs.x;
+    float rhs_y       = inv_rhs_abs * rhs.y;
+    float2 out = {lhs.x * rhs_x + lhs.y * rhs_y, lhs.y * rhs_x - lhs.x * rhs_y};
+    out.x *= inv_rhs_abs;
+    out.y *= inv_rhs_abs;
     return out;
 }
 
-#define __candf(lhs, rhs) __cabsf2(lhs) && __cabsf2(rhs)
-#define __cand(lhs, rhs) __cabs2(lhs) && __cabs2(rhs)
+#define __candf(lhs, rhs) __cabsf(lhs) && __cabsf(rhs)
+#define __cand(lhs, rhs) __cabs(lhs) && __cabs(rhs)
 
-#define __corf(lhs, rhs) __cabsf2(lhs) || __cabsf2(rhs)
-#define __cor(lhs, rhs) __cabs2(lhs) || __cabs2(rhs)
+#define __corf(lhs, rhs) __cabsf(lhs) || __cabsf(rhs)
+#define __cor(lhs, rhs) __cabs(lhs) || __cabs(rhs)
 
 #define __ceqf(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
 #define __cneqf(lhs, rhs) !__ceqf((lhs), (rhs))
-#define __cltf(lhs, rhs) (__cabsf2(lhs) < __cabsf2(rhs))
-#define __clef(lhs, rhs) (__cabsf2(lhs) <= __cabsf2(rhs))
-#define __cgtf(lhs, rhs) (__cabsf2(lhs) > __cabsf2(rhs))
-#define __cgef(lhs, rhs) (__cabsf2(lhs) >= __cabsf2(rhs))
+#define __cltf(lhs, rhs) (__cabsf(lhs) < __cabsf(rhs))
+#define __clef(lhs, rhs) (__cabsf(lhs) <= __cabsf(rhs))
+#define __cgtf(lhs, rhs) (__cabsf(lhs) > __cabsf(rhs))
+#define __cgef(lhs, rhs) (__cabsf(lhs) >= __cabsf(rhs))
 
 #define __ceq(lhs, rhs) (((lhs).x == (rhs).x) && ((lhs).y == (rhs).y))
 #define __cneq(lhs, rhs) !__ceq((lhs), (rhs))
-#define __clt(lhs, rhs) (__cabs2(lhs) < __cabs2(rhs))
-#define __cle(lhs, rhs) (__cabs2(lhs) <= __cabs2(rhs))
-#define __cgt(lhs, rhs) (__cabs2(lhs) > __cabs2(rhs))
-#define __cge(lhs, rhs) (__cabs2(lhs) >= __cabs2(rhs))
+#define __clt(lhs, rhs) (__cabs(lhs) < __cabs(rhs))
+#define __cle(lhs, rhs) (__cabs(lhs) <= __cabs(rhs))
+#define __cgt(lhs, rhs) (__cabs(lhs) > __cabs(rhs))
+#define __cge(lhs, rhs) (__cabs(lhs) >= __cabs(rhs))
 
+#define __bitnot(in) (~(in))
 #define __bitor(lhs, rhs) ((lhs) | (rhs))
 #define __bitand(lhs, rhs) ((lhs) & (rhs))
 #define __bitxor(lhs, rhs) ((lhs) ^ (rhs))
@@ -107,105 +106,124 @@ float2 __cdivf(float2 lhs, float2 rhs)
 #define __max(lhs, rhs) ((lhs) > (rhs)) ? (lhs) : (rhs)
 #define __rem(lhs, rhs) ((lhs) % (rhs))
 #define __mod(lhs, rhs) ((lhs) % (rhs))
-#define __pow(lhs, rhs) fpow((float)lhs, (float)rhs)
 
-float2 __cminf(float2 lhs, float2 rhs)
-{
-    return __abs2(lhs) < __abs2(rhs) ? lhs : rhs;
+#define __pow(lhs, rhs)                                                 \
+    convert_int_rte(pow(convert_float_rte(lhs), convert_float_rte(rhs)))
+#ifdef USE_DOUBLE
+#define __powll(lhs, rhs) \
+    convert_long_rte(pow(convert_double_rte(lhs), convert_double_rte(rhs)))
+#define __powul(lhs, rhs) \
+    convert_ulong_rte(pow(convert_double_rte(lhs), convert_double_rte(rhs)))
+#else
+#define __powll(lhs, rhs) \
+    convert_long_rte(pow(convert_float_rte(lhs), convert_float_rte(rhs)))
+#define __powul(lhs, rhs) \
+    convert_ulong_rte(pow(convert_float_rte(lhs), convert_float_rte(rhs)))
+#endif
+
+#ifdef USE_DOUBLE
+#define __powui(lhs, rhs) \
+    convert_uint_rte(pow(convert_double_rte(lhs), convert_double_rte(rhs)))
+#define __powsi(lhs, rhs) \
+    convert_int_rte(pow(convert_double_rte(lhs), convert_double_rte(rhs)))
+#else
+#define __powui(lhs, rhs) \
+    convert_uint_rte(pow(convert_float_rte(lhs), convert_float_rte(rhs)))
+#define __powsi(lhs, rhs) \
+    convert_int_rte(pow(convert_float_rte(lhs), convert_float_rte(rhs)))
+#endif
+
+float2 __cminf(float2 lhs, float2 rhs) {
+    return __cabsf(lhs) < __cabsf(rhs) ? lhs : rhs;
 }
 
-float2 __cmaxf(float2 lhs, float2 rhs)
-{
-    return __abs2(lhs) > __abs2(rhs) ? lhs : rhs;
+float2 __cmaxf(float2 lhs, float2 rhs) {
+    return __cabsf(lhs) > __cabsf(rhs) ? lhs : rhs;
 }
 
-float2 __cplx2f(float lhs, float rhs)
-{
+float2 __cplx2f(float lhs, float rhs) {
     float2 out = {lhs, rhs};
     return out;
 }
 
-float2 __convert_cfloat(float in)
-{
+float2 __convert_cfloat(float in) {
     float2 out = {in, 0};
     return out;
 }
 
 #define __convert_char(val) (char)(convert_char((val)) != 0)
 
-#define fpow(lhs, rhs) pow((lhs), (rhs))
-
 #define frem(lhs, rhs) remainder((lhs), (rhs))
 
 #define iszero(a) ((a) == 0)
 
-float2  __convert_c2c(float2 in) { return in; }
+float2 __convert_c2c(float2 in) { return in; }
 
 #ifdef USE_DOUBLE
 
-float2  __convert_z2c(double2 in) { float2  out = {in.x, in.y}; return out; }
+float2 __convert_z2c(double2 in) {
+    float2 out = {in.x, in.y};
+    return out;
+}
 
-double2 __cconj(double2 in)
-{
+double2 __cconj(double2 in) {
     double2 out = {in.x, -in.y};
     return out;
 }
 
-double2 __cadd(double2 lhs, double2 rhs)
-{
+double2 __cadd(double2 lhs, double2 rhs) {
     double2 out = {lhs.x + rhs.x, lhs.y + rhs.y};
     return out;
 }
 
-double2 __csub(double2 lhs, double2 rhs)
-{
+double2 __csub(double2 lhs, double2 rhs) {
     double2 out = {lhs.x - rhs.x, lhs.y - rhs.y};
     return out;
 }
 
-double2 __cmul(double2 lhs, double2 rhs)
-{
+double2 __cmul(double2 lhs, double2 rhs) {
     double2 out;
     out.x = lhs.x * rhs.x - lhs.y * rhs.y;
     out.y = lhs.x * rhs.y + lhs.y * rhs.x;
     return out;
 }
 
-double2 __cdiv(double2 lhs, double2 rhs)
-{
-    double2 out;
-    double den = (rhs.x * rhs.x + rhs.y * rhs.y);
-    double2 num = __cmul(lhs, __cconj(rhs));
-
-    out.x = num.x / den;
-    out.y = num.y / den;
+double2 __cdiv(double2 lhs, double2 rhs) {
+    // Normalize by absolute value and multiply
+    double rhs_abs     = __cabs(rhs);
+    double inv_rhs_abs = 1.0 / rhs_abs;
+    double rhs_x       = inv_rhs_abs * rhs.x;
+    double rhs_y       = inv_rhs_abs * rhs.y;
+    double2 out        = {lhs.x * rhs_x + lhs.y * rhs_y,
+                   lhs.y * rhs_x - lhs.x * rhs_y};
+    out.x *= inv_rhs_abs;
+    out.y *= inv_rhs_abs;
     return out;
 }
 
-double2 __cmin(double2 lhs, double2 rhs)
-{
-    return __abs2(lhs) < __abs2(rhs) ? lhs : rhs;
+double2 __cmin(double2 lhs, double2 rhs) {
+    return __cabs(lhs) < __cabs(rhs) ? lhs : rhs;
 }
 
-double2 __cmax(double2 lhs, double2 rhs)
-{
-    return __abs2(lhs) > __abs2(rhs) ? lhs : rhs;
+double2 __cmax(double2 lhs, double2 rhs) {
+    return __cabs(lhs) > __cabs(rhs) ? lhs : rhs;
 }
 
-double2 __cplx2(double lhs, double rhs)
-{
+double2 __cplx2(double lhs, double rhs) {
     double2 out = {lhs, rhs};
     return out;
 }
 
-double2 __convert_cdouble(double in)
-{
+double2 __convert_cdouble(double in) {
     double2 out = {in, 0};
     return out;
 }
 
-double2 __convert_c2z(float2  in) { double2 out = {in.x, in.y}; return out; }
+double2 __convert_c2z(float2 in) {
+    double2 out = {in.x, in.y};
+    return out;
+}
 
 double2 __convert_z2z(double2 in) { return in; }
 
-#endif // USE_DOUBLE
+#endif  // USE_DOUBLE
diff --git a/src/backend/opencl/kernel/join.cl b/src/backend/opencl/kernel/join.cl
deleted file mode 100644
index 6145574fa1..0000000000
--- a/src/backend/opencl/kernel/join.cl
+++ /dev/null
@@ -1,43 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-__kernel
-void join_kernel(__global To *d_out, const KParam out,
-                 __global const Ti *d_in, const KParam in,
-                 const int o0, const int o1, const int o2, const int o3,
-                 const int blocksPerMatX, const int blocksPerMatY)
-{
-    const int iz = get_group_id(0) / blocksPerMatX;
-    const int iw = get_group_id(1) / blocksPerMatY;
-
-    const int blockIdx_x = get_group_id(0) - iz * blocksPerMatX;
-    const int blockIdx_y = get_group_id(1) - iw * blocksPerMatY;
-
-    const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
-    const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
-
-    const int incy = blocksPerMatY * get_local_size(1);
-    const int incx = blocksPerMatX * get_local_size(0);
-
-    d_in = d_in + in.offset;
-
-    if (iz < in.dims[2] && iw < in.dims[3]) {
-        d_out = d_out + (iz + o2) * out.strides[2] + (iw + o3) * out.strides[3];
-        d_in = d_in + iz * in.strides[2] + iw * in.strides[3];
-
-        for (int iy = yy; iy < in.dims[1]; iy += incy) {
-            __global Ti *d_in_ = d_in + iy * in.strides[1];
-            __global To *d_out_ = d_out + (iy + o1) * out.strides[1];
-
-            for (int ix = xx; ix < in.dims[0]; ix += incx) {
-                d_out_[ix + o0] = d_in_[ix];
-            }
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/join.hpp b/src/backend/opencl/kernel/join.hpp
deleted file mode 100644
index 8e3e95a788..0000000000
--- a/src/backend/opencl/kernel/join.hpp
+++ /dev/null
@@ -1,91 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <kernel_headers/join.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-        static const int TILEX = 256;
-        static const int TILEY = 32;
-
-        template<typename To, typename Ti, int dim>
-        void join(Param out, const Param in, const af::dim4 offset)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   joinProgs;
-                static std::map<int, Kernel *> joinKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D To=" << dtype_traits<To>::getName()
-                            << " -D Ti=" << dtype_traits<Ti>::getName()
-                            << " -D dim=" << dim;
-
-                    if (std::is_same<To, double>::value ||
-                        std::is_same<To, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    } else if (std::is_same<Ti, double>::value ||
-                               std::is_same<Ti, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    Program prog;
-                    buildProgram(prog, join_cl, join_cl_len, options.str());
-                    joinProgs[device] = new Program(prog);
-                    joinKernels[device] = new Kernel(*joinProgs[device], "join_kernel");
-                });
-
-                auto joinOp = make_kernel<Buffer, const KParam, const Buffer, const KParam,
-                              const int, const int, const int, const int,
-                              const int, const int> (*joinKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int blocksPerMatX = divup(in.info.dims[0], TILEX);
-                int blocksPerMatY = divup(in.info.dims[1], TILEY);
-                NDRange global(local[0] * blocksPerMatX * in.info.dims[2],
-                               local[1] * blocksPerMatY * in.info.dims[3],
-                               1);
-
-                joinOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data, in.info,
-                        offset[0], offset[1], offset[2], offset[3], blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/laset.cl b/src/backend/opencl/kernel/laset.cl
index 91ab50a408..4efdbca814 100644
--- a/src/backend/opencl/kernel/laset.cl
+++ b/src/backend/opencl/kernel/laset.cl
@@ -35,22 +35,22 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
@@ -69,117 +69,97 @@
 #define IS_EQUAL(lhs, rhs) ((rhs == lhs))
 #endif
 
-__kernel
-void laset_full(
-    int m, int n,
-    T offdiag, T diag,
-    __global T *A, unsigned long A_offset, int lda )
-{
+kernel void laset_full(int m, int n, T offdiag, T diag, global T *A,
+                         unsigned long A_offset, int lda) {
     A += A_offset;
 
-    int ind = get_group_id(0)*BLK_X + get_local_id(0);
-    int iby = get_group_id(1)*BLK_Y;
-    /* check if full block-column && (below diag || above diag || offdiag == diag) */
-    bool full = (iby + BLK_Y <= n && (ind >= iby + BLK_Y || ind + BLK_X <= iby || IS_EQUAL(offdiag, diag)));
+    int ind = get_group_id(0) * BLK_X + get_local_id(0);
+    int iby = get_group_id(1) * BLK_Y;
+    /* check if full block-column && (below diag || above diag || offdiag ==
+     * diag) */
+    bool full =
+        (iby + BLK_Y <= n &&
+         (ind >= iby + BLK_Y || ind + BLK_X <= iby || IS_EQUAL(offdiag, diag)));
     /* do only rows inside matrix */
-    if ( ind < m ) {
-        A += ind + iby*lda;
-        if ( full ) {
-            // full block-column, off-diagonal block or offdiag == diag
-            #pragma unroll
-            for( int j=0; j < BLK_Y; ++j ) {
-                A[j*lda] = offdiag;
-            }
-        }
-        else {
+    if (ind < m) {
+        A += ind + iby * lda;
+        if (full) {
+// full block-column, off-diagonal block or offdiag == diag
+#pragma unroll
+            for (int j = 0; j < BLK_Y; ++j) { A[j * lda] = offdiag; }
+        } else {
             // either partial block-column or diagonal block
-            for( int j=0; j < BLK_Y && iby+j < n; ++j ) {
-                if ( iby+j == ind )
-                    A[j*lda] = diag;
+            for (int j = 0; j < BLK_Y && iby + j < n; ++j) {
+                if (iby + j == ind)
+                    A[j * lda] = diag;
                 else
-                    A[j*lda] = offdiag;
+                    A[j * lda] = offdiag;
             }
         }
     }
 }
 
-
 /*
     Similar to zlaset_full, but updates only the diagonal and below.
     Blocks that are fully above the diagonal exit immediately.
 
     Code similar to zlacpy, zlat2c, clat2z.
 */
-__kernel
-void laset_lower(
-    int m, int n,
-    T offdiag, T diag,
-    __global T *A, unsigned long A_offset, int lda )
-{
+kernel void laset_lower(int m, int n, T offdiag, T diag, global T *A,
+                          unsigned long A_offset, int lda) {
     A += A_offset;
 
-    int ind = get_group_id(0)*BLK_X + get_local_id(0);
-    int iby = get_group_id(1)*BLK_Y;
+    int ind = get_group_id(0) * BLK_X + get_local_id(0);
+    int iby = get_group_id(1) * BLK_Y;
     /* check if full block-column && (below diag) */
     bool full = (iby + BLK_Y <= n && (ind >= iby + BLK_Y));
     /* do only rows inside matrix, and blocks not above diag */
-    if ( ind < m && ind + BLK_X > iby ) {
-        A += ind + iby*lda;
-        if ( full ) {
-            // full block-column, off-diagonal block
-            #pragma unroll
-            for( int j=0; j < BLK_Y; ++j ) {
-                A[j*lda] = offdiag;
-            }
-        }
-        else {
+    if (ind < m && ind + BLK_X > iby) {
+        A += ind + iby * lda;
+        if (full) {
+// full block-column, off-diagonal block
+#pragma unroll
+            for (int j = 0; j < BLK_Y; ++j) { A[j * lda] = offdiag; }
+        } else {
             // either partial block-column or diagonal block
-            for( int j=0; j < BLK_Y && iby+j < n; ++j ) {
-                if ( iby+j == ind )
-                    A[j*lda] = diag;
-                else if ( ind > iby+j )
-                    A[j*lda] = offdiag;
+            for (int j = 0; j < BLK_Y && iby + j < n; ++j) {
+                if (iby + j == ind)
+                    A[j * lda] = diag;
+                else if (ind > iby + j)
+                    A[j * lda] = offdiag;
             }
         }
     }
 }
 
-
 /*
     Similar to zlaset_full, but updates only the diagonal and above.
     Blocks that are fully below the diagonal exit immediately.
 
     Code similar to zlacpy, zlat2c, clat2z.
 */
-__kernel
-void laset_upper(
-    int m, int n,
-    T offdiag, T diag,
-    __global T *A, unsigned long A_offset, int lda )
-{
+kernel void laset_upper(int m, int n, T offdiag, T diag, global T *A,
+                          unsigned long A_offset, int lda) {
     A += A_offset;
 
-    int ind = get_group_id(0)*BLK_X + get_local_id(0);
-    int iby = get_group_id(1)*BLK_Y;
+    int ind = get_group_id(0) * BLK_X + get_local_id(0);
+    int iby = get_group_id(1) * BLK_Y;
     /* check if full block-column && (above diag) */
     bool full = (iby + BLK_Y <= n && (ind + BLK_X <= iby));
     /* do only rows inside matrix, and blocks not below diag */
-    if ( ind < m && ind < iby + BLK_Y ) {
-        A += ind + iby*lda;
-        if ( full ) {
-            // full block-column, off-diagonal block
-            #pragma unroll
-            for( int j=0; j < BLK_Y; ++j ) {
-                A[j*lda] = offdiag;
-            }
-        }
-        else {
+    if (ind < m && ind < iby + BLK_Y) {
+        A += ind + iby * lda;
+        if (full) {
+// full block-column, off-diagonal block
+#pragma unroll
+            for (int j = 0; j < BLK_Y; ++j) { A[j * lda] = offdiag; }
+        } else {
             // either partial block-column or diagonal block
-            for( int j=0; j < BLK_Y && iby+j < n; ++j ) {
-                if ( iby+j == ind )
-                    A[j*lda] = diag;
-                else if ( ind < iby+j )
-                    A[j*lda] = offdiag;
+            for (int j = 0; j < BLK_Y && iby + j < n; ++j) {
+                if (iby + j == ind)
+                    A[j * lda] = diag;
+                else if (ind < iby + j)
+                    A[j * lda] = offdiag;
             }
         }
     }
diff --git a/src/backend/opencl/kernel/laset.hpp b/src/backend/opencl/kernel/laset.hpp
index 9982285391..5e4588c41f 100644
--- a/src/backend/opencl/kernel/laset.hpp
+++ b/src/backend/opencl/kernel/laset.hpp
@@ -8,84 +8,71 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/laset.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
+#include <kernel_headers/laset.hpp>
+#include <magma_types.h>
 #include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-
-namespace opencl
-{
-
-namespace kernel
-{
+#include <string>
+#include <vector>
 
-static const int BLK_X = 64;
-static const int BLK_Y = 32;
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
 template<int num>
-const char *laset_name() { return "laset_none"; }
-template<> const char *laset_name<0>() { return "laset_full"; }
-template<> const char *laset_name<1>() { return "laset_lower"; }
-template<> const char *laset_name<2>() { return "laset_upper"; }
+const char *laset_name() {
+    return "laset_none";
+}
+template<>
+const char *laset_name<0>() {
+    return "laset_full";
+}
+template<>
+const char *laset_name<1>() {
+    return "laset_lower";
+}
+template<>
+const char *laset_name<2>() {
+    return "laset_upper";
+}
 
 template<typename T, int uplo>
-void laset(int m, int  n,
-           T offdiag, T diag,
-           cl_mem dA, size_t dA_offset, magma_int_t ldda)
-{
-
-    static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-    static std::map<int, Program*>  setProgs;
-    static std::map<int, Kernel*> setKernels;
-
-    int device = getActiveDeviceId();
-
-    std::call_once(compileFlags[device], [device] () {
-
-            std::ostringstream options;
-            options << " -D T=" << dtype_traits<T>::getName()
-                    << " -D BLK_X=" << BLK_X
-                    << " -D BLK_Y=" << BLK_Y
-                    << " -D IS_CPLX=" << af::iscplx<T>();
-
-            if (std::is_same<T, double>::value ||
-                std::is_same<T, cdouble>::value) {
-                options << " -D USE_DOUBLE";
-            }
-
-            cl::Program prog;
-            buildProgram(prog, laset_cl, laset_cl_len, options.str());
-            setProgs[device] = new Program(prog);
-            setKernels[device] = new Kernel(*setProgs[device], laset_name<uplo>());
-        });
+void laset(int m, int n, T offdiag, T diag, cl_mem dA, size_t dA_offset,
+           magma_int_t ldda, cl_command_queue queue) {
+    constexpr int BLK_X = 64;
+    constexpr int BLK_Y = 32;
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(uplo),
+    };
+    std::array<std::string, 5> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(BLK_X),
+        DefineValue(BLK_Y),
+        DefineKeyValue(IS_CPLX, static_cast<int>(iscplx<T>())),
+        getTypeBuildDefinition<T>()};
+
+    auto lasetOp =
+        common::getKernel(laset_name<uplo>(), {{laset_cl_src}}, targs, options);
 
     int groups_x = (m - 1) / BLK_X + 1;
     int groups_y = (n - 1) / BLK_Y + 1;
 
-    NDRange local(BLK_X, 1);
-    NDRange global(groups_x * local[0],
-                   groups_y * local[1]);
+    cl::NDRange local(BLK_X, 1);
+    cl::NDRange global(groups_x * local[0], groups_y * local[1]);
 
-    auto lasetOp = make_kernel<int, int, T, T, cl_mem, unsigned long long, int>(*setKernels[device]);
-    lasetOp(EnqueueArgs(getQueue(), global, local),
-            m, n, offdiag, diag, dA, dA_offset, ldda);
-}
+    // retain the cl_mem object during cl::Buffer creation
+    cl::Buffer dAObj(dA, true);
 
+    cl::CommandQueue q(queue, true);
+    lasetOp(cl::EnqueueArgs(q, global, local), m, n, offdiag, diag, dAObj,
+            dA_offset, ldda);
 }
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/laset_band.cl b/src/backend/opencl/kernel/laset_band.cl
index b69d9484ce..d3f0ddb683 100644
--- a/src/backend/opencl/kernel/laset_band.cl
+++ b/src/backend/opencl/kernel/laset_band.cl
@@ -37,28 +37,24 @@
                       |     3 2   => skip below matrix
                               3   => skip below matrix
 
-    Thread assignment for m=10, n=12, k=4, nb=8. Each column is done in parallel.
+    Thread assignment for m=10, n=12, k=4, nb=8. Each column is done in
+ parallel.
 */
-__kernel
-void laset_band_upper(
-    int m, int n,
-    T offdiag, T diag,
-    __global T *A, unsigned long off, int lda)
-{
+kernel void laset_band_upper(int m, int n, T offdiag, T diag, global T *A,
+                               unsigned long off, int lda) {
     int k   = get_local_size(0);
     int ibx = get_group_id(0) * NB;
     int ind = ibx + get_local_id(0) - k + 1;
 
-    A += ind + ibx*lda + off;
+    A += ind + ibx * lda + off;
 
     T value = offdiag;
-    if (get_local_id(0) == k-1)
-        value = diag;
+    if (get_local_id(0) == k - 1) value = diag;
 
-    #pragma unroll
-    for (int j=0; j < NB; j++) {
+#pragma unroll
+    for (int j = 0; j < NB; j++) {
         if (ibx + j < n && ind + j >= 0 && ind + j < m) {
-            A[j*(lda+1)] = value;
+            A[j * (lda + 1)] = value;
         }
     }
 }
@@ -88,29 +84,23 @@ void laset_band_upper(
                             3 2   => skip below matrix
                               3   => skip below matrix
 
-    Thread assignment for m=13, n=12, k=4, nb=8. Each column is done in parallel.
+    Thread assignment for m=13, n=12, k=4, nb=8. Each column is done in
+ parallel.
 */
 
-__kernel
-void laset_band_lower(
-    int m, int n,
-    T offdiag, T diag,
-    __global T *A, unsigned long off, int lda)
-{
-    //int k   = get_local_size(0);
+kernel void laset_band_lower(int m, int n, T offdiag, T diag, global T *A,
+                               unsigned long off, int lda) {
+    // int k   = get_local_size(0);
     int ibx = get_group_id(0) * NB;
     int ind = ibx + get_local_id(0);
 
-    A += ind + ibx*lda + off;
+    A += ind + ibx * lda + off;
 
     T value = offdiag;
-    if (get_local_id(0) == 0)
-        value = diag;
+    if (get_local_id(0) == 0) value = diag;
 
-    #pragma unroll
-    for (int j=0; j < NB; j++) {
-        if (ibx + j < n && ind + j < m) {
-            A[j*(lda+1)] = value;
-        }
+#pragma unroll
+    for (int j = 0; j < NB; j++) {
+        if (ibx + j < n && ind + j < m) { A[j * (lda + 1)] = value; }
     }
 }
diff --git a/src/backend/opencl/kernel/laset_band.hpp b/src/backend/opencl/kernel/laset_band.hpp
index b6ba692776..daa1f73b0c 100644
--- a/src/backend/opencl/kernel/laset_band.hpp
+++ b/src/backend/opencl/kernel/laset_band.hpp
@@ -8,34 +8,22 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/laset_band.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
+#include <kernel_headers/laset_band.hpp>
 #include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
+#include <string>
+#include <vector>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-namespace kernel
-{
-
-#if 0 // Needs to be enabled when unmqr2 is enabled
+#if 0  // Needs to be enabled when unmqr2 is enabled
 static const int NB = 64;
 template<int num>
 const char *laset_band_name() { return "laset_none"; }
@@ -47,30 +35,19 @@ void laset_band(int m, int  n, int k,
                 T offdiag, T diag,
                 cl_mem dA, size_t dA_offset, magma_int_t ldda)
 {
+    static const std::string src(laset_band_cl, laset_band_cl_len);
 
-    static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-    static std::map<int, Program*>  setProgs;
-    static std::map<int, Kernel*> setKernels;
-
-    int device = getActiveDeviceId();
-
-    std::call_once(compileFlags[device], [device] () {
-
-            std::ostringstream options;
-            options << " -D T=" << dtype_traits<T>::getName()
-                    << " -D NB=" << NB
-                    << " -D IS_CPLX=" << af::iscplx<T>();
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(), TemplateArg(uplo),
+    };
+    std::array<std::string, 4> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(NB),
+        DefineKeyValue(IS_CPLX, static_cast<int>(iscplx<T>())),
+        getTypeBuildDefinition<T>()
+    };
 
-            if (std::is_same<T, double>::value ||
-                std::is_same<T, cdouble>::value) {
-                options << " -D USE_DOUBLE";
-            }
-
-            cl::Program prog;
-            buildProgram(prog, laset_band_cl, laset_band_cl_len, options.str());
-            setProgs[device] = new Program(prog);
-            setKernels[device] = new Kernel(*setProgs[device], laset_band_name<uplo>());
-        });
+    auto lasetBandOp = common::getKernel(laset_band_name<uplo>(), {src}, targs, options);
 
     int threads = 1;
     int groups = 1;
@@ -83,15 +60,13 @@ void laset_band(int m, int  n, int k,
         groups = (std::min(m+k-1, n) - 1) / NB + 1;
     }
 
-    NDRange local(threads, 1);
-    NDRange global(threads * groups, 1);
-
-    auto lasetBandOp = make_kernel<int, int, T, T, cl_mem, unsigned long long, int>(*setKernels[device]);
+    cl::NDRange local(threads, 1);
+    cl::NDRange global(threads * groups, 1);
 
-    lasetBandOp(EnqueueArgs(getQueue(), global, local),
-                m, n, offdiag, diag, dA, dA_offset, ldda);
+    lasetBandOp(cl::EnqueueArgs(getQueue(), global, local), m, n, offdiag, diag, dA, dA_offset, ldda);
 }
-#endif
+#endi
 
-}
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/laswp.cl b/src/backend/opencl/kernel/laswp.cl
index e052e24001..168ce52404 100644
--- a/src/backend/opencl/kernel/laswp.cl
+++ b/src/backend/opencl/kernel/laswp.cl
@@ -38,22 +38,22 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
@@ -69,22 +69,21 @@ typedef struct {
 // Each GPU block processes one block-column of A.
 // Each thread goes down a column of A,
 // swapping rows according to pivots stored in params.
-__kernel void laswp(int n, __global T *dAT, unsigned long dAT_offset,
-                    int ldda, zlaswp_params_t params )
-{
+kernel void laswp(int n, global T *dAT, unsigned long dAT_offset, int ldda,
+                    zlaswp_params_t params) {
     dAT += dAT_offset;
 
-    int tid = get_local_id(0) + get_local_size(0)*get_group_id(0);
-    if ( tid < n ) {
+    int tid = get_local_id(0) + get_local_size(0) * get_group_id(0);
+    if (tid < n) {
         dAT += tid;
-        __global T *A1  = dAT;
+        global T *A1 = dAT;
 
-        for( int i1 = 0; i1 < params.npivots; ++i1 ) {
-            int i2 = params.ipiv[i1];
-            __global T *A2 = dAT + i2*ldda;
-            T temp = *A1;
-            *A1 = *A2;
-            *A2 = temp;
+        for (int i1 = 0; i1 < params.npivots; ++i1) {
+            int i2         = params.ipiv[i1];
+            global T *A2 = dAT + i2 * ldda;
+            T temp         = *A1;
+            *A1            = *A2;
+            *A2            = temp;
             A1 += ldda;  // A1 = dA + i1*ldx
         }
     }
diff --git a/src/backend/opencl/kernel/laswp.hpp b/src/backend/opencl/kernel/laswp.hpp
index 970572e01e..7439f3680e 100644
--- a/src/backend/opencl/kernel/laswp.hpp
+++ b/src/backend/opencl/kernel/laswp.hpp
@@ -8,95 +8,65 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/laswp.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
+#include <kernel_headers/laswp.hpp>
+#include <traits.hpp>
 
-namespace opencl
-{
+#include <string>
+#include <vector>
 
-namespace kernel
-{
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-static const int NTHREADS = 256;
-static const int MAX_PIVOTS =  32;
+constexpr int MAX_PIVOTS = 32;
 
 typedef struct {
     int npivots;
     int ipiv[MAX_PIVOTS];
 } zlaswp_params_t;
 
-
 template<typename T>
-void laswp(int n, cl_mem in, size_t offset, int ldda,
-           int k1, int k2, const int *ipiv, int inci)
-{
+void laswp(int n, cl_mem in, size_t offset, int ldda, int k1, int k2,
+           const int *ipiv, int inci, cl::CommandQueue &queue) {
+    constexpr int NTHREADS = 256;
 
-    static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-    static std::map<int, Program*>  swpProgs;
-    static std::map<int, Kernel*> swpKernels;
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 3> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(MAX_PIVOTS),
+        getTypeBuildDefinition<T>()};
 
-    int device = getActiveDeviceId();
-
-    std::call_once(compileFlags[device], [device] () {
-
-            std::ostringstream options;
-            options << " -D T=" << dtype_traits<T>::getName()
-                    << " -D MAX_PIVOTS=" << MAX_PIVOTS;
-
-            if (std::is_same<T, double>::value ||
-                std::is_same<T, cdouble>::value) {
-                options << " -D USE_DOUBLE";
-            }
-
-            cl::Program prog;
-            buildProgram(prog, laswp_cl, laswp_cl_len, options.str());
-            swpProgs[device] = new Program(prog);
-
-            swpKernels[device] = new Kernel(*swpProgs[device], "laswp");
-        });
+    auto laswpOp = common::getKernel("laswp", {{laswp_cl_src}}, targs, options);
 
     int groups = divup(n, NTHREADS);
-    NDRange local(NTHREADS);
-    NDRange global(groups * local[0]);
+    cl::NDRange local(NTHREADS);
+    cl::NDRange global(groups * local[0]);
     zlaswp_params_t params;
 
-    auto laswpOp = make_kernel<int, cl_mem, unsigned long long,
-                               int, zlaswp_params_t>(*swpKernels[device]);
+    // retain the cl_mem object during cl::Buffer creation
+    cl::Buffer inObj(in, true);
 
-    for( int k = k1-1; k < k2; k += MAX_PIVOTS ) {
-
-        int pivots_left = k2-k;
+    for (int k = k1 - 1; k < k2; k += MAX_PIVOTS) {
+        int pivots_left = k2 - k;
 
         params.npivots = pivots_left > MAX_PIVOTS ? MAX_PIVOTS : pivots_left;
 
-        for( int j = 0; j < params.npivots; ++j ) {
-            params.ipiv[j] = ipiv[(k+j)*inci] - k - 1;
-        }
+        for (int j = 0; j < params.npivots; ++j)
+            params.ipiv[j] = ipiv[(k + j) * inci] - k - 1;
 
-        unsigned long long k_offset = offset + k*ldda;
+        unsigned long long k_offset = offset + k * ldda;
 
-        laswpOp(EnqueueArgs(getQueue(), global, local),
-                n, in, k_offset, ldda, params);
+        laswpOp(cl::EnqueueArgs(queue, global, local), n, inObj, k_offset, ldda,
+                params);
     }
-
 }
 
-}
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/lookup.cl b/src/backend/opencl/kernel/lookup.cl
index b57342934c..7ed4bc1cfa 100644
--- a/src/backend/opencl/kernel/lookup.cl
+++ b/src/backend/opencl/kernel/lookup.cl
@@ -7,49 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-int trimIndex(int idx, const int len)
-{
+int trimIndex(int idx, const int len) {
     int ret_val = idx;
-    int offset  = abs(ret_val)%len;
-    if (ret_val<0) {
-        ret_val = offset-1;
-    } else if (ret_val>=len) {
-        ret_val = len-offset-1;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
     }
     return ret_val;
 }
 
-kernel
-void lookupND(global in_t * out,
-                  KParam oInfo,
-                  global const in_t * in,
-                  KParam iInfo,
-                  global const idx_t * indices,
-                  KParam idxInfo,
-                  int nBBS0,
-                  int nBBS1)
-{
+kernel void lookupND(global in_t *out, KParam oInfo, global const in_t *in,
+                     KParam iInfo, global const idx_t *indices, KParam idxInfo,
+                     int nBBS0, int nBBS1) {
     int lx = get_local_id(0);
     int ly = get_local_id(1);
 
-    int gz = get_group_id(0)/nBBS0;
-    int gw = get_group_id(1)/nBBS1;
+    int gz = get_group_id(0) / nBBS0;
+    int gw = get_group_id(1) / nBBS1;
 
-    int gx = get_local_size(0) * (get_group_id(0) - gz*nBBS0) + lx;
-    int gy = get_local_size(1) * (get_group_id(1) - gw*nBBS1) + ly;
+    int gx = get_local_size(0) * (get_group_id(0) - gz * nBBS0) + lx;
+    int gy = get_local_size(1) * (get_group_id(1) - gw * nBBS1) + ly;
 
-    global const idx_t *idxPtr = indices;
+    global const idx_t *idxPtr = indices + idxInfo.offset;
 
-    int i = iInfo.strides[0]*(DIM==0 ? trimIndex((int)idxPtr[gx], iInfo.dims[0]): gx);
-    int j = iInfo.strides[1]*(DIM==1 ? trimIndex((int)idxPtr[gy], iInfo.dims[1]): gy);
-    int k = iInfo.strides[2]*(DIM==2 ? trimIndex((int)idxPtr[gz], iInfo.dims[2]): gz);
-    int l = iInfo.strides[3]*(DIM==3 ? trimIndex((int)idxPtr[gw], iInfo.dims[3]): gw);
+    int i = iInfo.strides[0] *
+            (DIM == 0 ? trimIndex((int)idxPtr[gx], iInfo.dims[0]) : gx);
+    int j = iInfo.strides[1] *
+            (DIM == 1 ? trimIndex((int)idxPtr[gy], iInfo.dims[1]) : gy);
+    int k = iInfo.strides[2] *
+            (DIM == 2 ? trimIndex((int)idxPtr[gz], iInfo.dims[2]) : gz);
+    int l = iInfo.strides[3] *
+            (DIM == 3 ? trimIndex((int)idxPtr[gw], iInfo.dims[3]) : gw);
 
-    global const in_t *inPtr = in + (i+j+k+l);
-    global in_t *outPtr = out + (gx*oInfo.strides[0]+gy*oInfo.strides[1]+
-                                 gz*oInfo.strides[2]+gw*oInfo.strides[3]);
+    global const in_t *inPtr = in + (i + j + k + l) + iInfo.offset;
+    global in_t *outPtr =
+        out + (gx * oInfo.strides[0] + gy * oInfo.strides[1] +
+               gz * oInfo.strides[2] + gw * oInfo.strides[3] + oInfo.offset);
 
-    if (gx<oInfo.dims[0] && gy<oInfo.dims[1] && gz<oInfo.dims[2] && gw<oInfo.dims[3]) {
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1] && gz < oInfo.dims[2] &&
+        gw < oInfo.dims[3]) {
         outPtr[0] = inPtr[0];
     }
 }
diff --git a/src/backend/opencl/kernel/lookup.hpp b/src/backend/opencl/kernel/lookup.hpp
index 756c0ea9d0..3410c65266 100644
--- a/src/backend/opencl/kernel/lookup.hpp
+++ b/src/backend/opencl/kernel/lookup.hpp
@@ -8,84 +8,54 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/lookup.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/lookup.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 32;
-static const int THREADS_Y = 8;
-
-template<typename in_t, typename idx_t, unsigned dim>
-void lookup(Param out, const Param in, const Param indices, int nDims)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  aiProgs;
-        static std::map<int, Kernel*> aiKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D in_t=" << dtype_traits<in_t>::getName()
-                            << " -D idx_t=" << dtype_traits<idx_t>::getName()
-                            << " -D DIM=" <<dim;
-
-                    if (std::is_same<in_t, double>::value ||
-                        std::is_same<in_t, cdouble>::value ||
-                        std::is_same<idx_t, double>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    Program prog;
-                    buildProgram(prog, lookup_cl, lookup_cl_len, options.str());
-                    aiProgs[device]   = new Program(prog);
-                    aiKernels[device] = new Kernel(*aiProgs[device], "lookupND");
-                });
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(out.info.dims[0], THREADS_X);
-        int blk_y = divup(out.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x * out.info.dims[2] * THREADS_X,
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename in_t, typename idx_t>
+void lookup(Param out, const Param in, const Param indices,
+            const unsigned dim) {
+    constexpr int THREADS_X = 32;
+    constexpr int THREADS_Y = 8;
+
+    std::array<TemplateArg, 3> targs = {
+        TemplateTypename<in_t>(),
+        TemplateTypename<idx_t>(),
+        TemplateArg(dim),
+    };
+    std::array<std::string, 4> options = {
+        DefineKeyValue(in_t, dtype_traits<in_t>::getName()),
+        DefineKeyValue(idx_t, dtype_traits<idx_t>::getName()),
+        DefineKeyValue(DIM, dim), getTypeBuildDefinition<in_t, idx_t>()};
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(out.info.dims[0], THREADS_X);
+    int blk_y = divup(out.info.dims[1], THREADS_Y);
+
+    cl::NDRange global(blk_x * out.info.dims[2] * THREADS_X,
                        blk_y * out.info.dims[3] * THREADS_Y);
 
-        auto arrIdxOp = make_kernel<Buffer, KParam,
-                                    Buffer, KParam,
-                                    Buffer, KParam,
-                                    int, int>(*aiKernels[device]);
-
-        arrIdxOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *in.data, in.info, *indices.data, indices.info, blk_x, blk_y);
+    auto arrIdxOp =
+        common::getKernel("lookupND", {{lookup_cl_src}}, targs, options);
 
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
+    arrIdxOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+             *in.data, in.info, *indices.data, indices.info, blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
 
-}
-
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/lu_split.cl b/src/backend/opencl/kernel/lu_split.cl
index 856a9e2ad7..1b6986d4cf 100644
--- a/src/backend/opencl/kernel/lu_split.cl
+++ b/src/backend/opencl/kernel/lu_split.cl
@@ -7,12 +7,9 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void lu_split_kernel(__global T *lptr, KParam linfo,
-                     __global T *uptr, KParam uinfo,
-                     const __global T *iptr, KParam iinfo,
-                     const int groups_x, const int groups_y)
-{
+kernel void luSplit(global T *lptr, KParam linfo, global T *uptr, KParam uinfo,
+                    const global T *iptr, KParam iinfo, const int groups_x,
+                    const int groups_y) {
     const int oz = get_group_id(0) / groups_x;
     const int ow = get_group_id(1) / groups_y;
 
@@ -25,35 +22,29 @@ void lu_split_kernel(__global T *lptr, KParam linfo,
     const int incy = groups_y * get_local_size(1);
     const int incx = groups_x * get_local_size(0);
 
-    __global T *d_l = lptr;
-    __global T *d_u = uptr;
-    __global T *d_i = iptr;
+    global T *d_l = lptr;
+    global T *d_u = uptr;
+    global T *d_i = iptr;
 
-    if(oz < iinfo.dims[2] && ow < iinfo.dims[3]) {
-        d_i = d_i + oz * iinfo.strides[2]    + ow * iinfo.strides[3];
+    if (oz < iinfo.dims[2] && ow < iinfo.dims[3]) {
+        d_i = d_i + oz * iinfo.strides[2] + ow * iinfo.strides[3];
         d_l = d_l + oz * linfo.strides[2] + ow * linfo.strides[3];
         d_u = d_u + oz * uinfo.strides[2] + ow * uinfo.strides[3];
 
         for (int oy = yy; oy < iinfo.dims[1]; oy += incy) {
-            __global T *Yd_i = d_i + oy * iinfo.strides[1];
-            __global T *Yd_l = d_l +  oy * linfo.strides[1];
-            __global T *Yd_u = d_u +  oy * uinfo.strides[1];
+            global T *Yd_i = d_i + oy * iinfo.strides[1];
+            global T *Yd_l = d_l + oy * linfo.strides[1];
+            global T *Yd_u = d_u + oy * uinfo.strides[1];
             for (int ox = xx; ox < iinfo.dims[0]; ox += incx) {
-                if(ox > oy) {
-                    if(same_dims || oy < linfo.dims[1])
-                        Yd_l[ox] = Yd_i[ox];
-                    if(!same_dims || ox < uinfo.dims[0])
-                        Yd_u[ox] = ZERO;
+                if (ox > oy) {
+                    if (same_dims || oy < linfo.dims[1]) Yd_l[ox] = Yd_i[ox];
+                    if (!same_dims || ox < uinfo.dims[0]) Yd_u[ox] = (T)(ZERO);
                 } else if (oy > ox) {
-                    if(same_dims || oy < linfo.dims[1])
-                        Yd_l[ox] = ZERO;
-                    if(!same_dims || ox < uinfo.dims[0])
-                        Yd_u[ox] = Yd_i[ox];
-                } else if(ox == oy) {
-                    if(same_dims || oy < linfo.dims[1])
-                        Yd_l[ox] = ONE;
-                    if(!same_dims || ox < uinfo.dims[0])
-                        Yd_u[ox] = Yd_i[ox];
+                    if (same_dims || oy < linfo.dims[1]) Yd_l[ox] = (T)(ZERO);
+                    if (!same_dims || ox < uinfo.dims[0]) Yd_u[ox] = Yd_i[ox];
+                } else if (ox == oy) {
+                    if (same_dims || oy < linfo.dims[1]) Yd_l[ox] = (T)(ONE);
+                    if (!same_dims || ox < uinfo.dims[0]) Yd_u[ox] = Yd_i[ox];
                 }
             }
         }
diff --git a/src/backend/opencl/kernel/lu_split.hpp b/src/backend/opencl/kernel/lu_split.hpp
index 5cf210365c..019e02528b 100644
--- a/src/backend/opencl/kernel/lu_split.hpp
+++ b/src/backend/opencl/kernel/lu_split.hpp
@@ -8,110 +8,61 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/lu_split.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
+#include <kernel_headers/lu_split.hpp>
 #include <math.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using af::scalar_to_option;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-// Kernel Launch Config Values
-static const unsigned TX = 32;
-static const unsigned TY = 8;
-static const unsigned TILEX = 128;
-static const unsigned TILEY = 32;
-
-template<typename T, bool same_dims>
-void lu_split_launcher(Param lower, Param upper, const Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  splitProgs;
-        static std::map<int, Kernel*> splitKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once(compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D same_dims=" << same_dims
-                        << " -D ZERO=(T)(" << scalar_to_option(scalar<T>(0)) << ")"
-                        << " -D ONE=(T)(" << scalar_to_option(scalar<T>(1)) << ")";
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, lu_split_cl, lu_split_cl_len, options.str());
-                splitProgs[device] = new Program(prog);
-
-                splitKernels[device] = new Kernel(*splitProgs[device], "lu_split_kernel");
-            });
-
-
-        NDRange local(TX, TY);
+#include <string>
+#include <vector>
 
-        int groups_x = divup(in.info.dims[0], TILEX);
-        int groups_y = divup(in.info.dims[1], TILEY);
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-        NDRange global(groups_x * local[0] * in.info.dims[2],
+template<typename T>
+void luSplitLauncher(Param lower, Param upper, const Param in, bool same_dims) {
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 128;
+    constexpr unsigned TILEY = 32;
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(same_dims),
+    };
+    std::array<std::string, 5> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(same_dims),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        DefineKeyValue(ONE, scalar_to_option(scalar<T>(1))),
+        getTypeBuildDefinition<T>()};
+
+    auto luSplit =
+        common::getKernel("luSplit", {{lu_split_cl_src}}, targs, options);
+
+    cl::NDRange local(TX, TY);
+
+    int groups_x = divup(in.info.dims[0], TILEX);
+    int groups_y = divup(in.info.dims[1], TILEY);
+
+    cl::NDRange global(groups_x * local[0] * in.info.dims[2],
                        groups_y * local[1] * in.info.dims[3]);
 
-        auto lu_split_op = make_kernel<Buffer, const KParam,
-                                       Buffer, const KParam,
-                                       const Buffer, const KParam,
-                                       const int, const int> (*splitKernels[device]);
-
-        lu_split_op(EnqueueArgs(getQueue(), global, local),
-                    *lower.data, lower.info,
-                    *upper.data, upper.info,
-                    *in.data, in.info,
-                    groups_x, groups_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
+    luSplit(cl::EnqueueArgs(getQueue(), global, local), *lower.data, lower.info,
+            *upper.data, upper.info, *in.data, in.info, groups_x, groups_y);
+    CL_DEBUG_FINISH(getQueue());
 }
 
 template<typename T>
-void lu_split(Param lower, Param upper, const Param in)
-{
-    bool same_dims =
-        (lower.info.dims[0] == in.info.dims[0]) &&
-        (lower.info.dims[1] == in.info.dims[1]);
-
-    if (same_dims) {
-        lu_split_launcher<T, true >(lower, upper, in);
-    } else {
-        lu_split_launcher<T, false>(lower, upper, in);
-    }
-}
-
-}
-
+void luSplit(Param lower, Param upper, const Param in) {
+    bool same_dims = (lower.info.dims[0] == in.info.dims[0]) &&
+                     (lower.info.dims[1] == in.info.dims[1]);
+    luSplitLauncher<T>(lower, upper, in, same_dims);
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/matchTemplate.cl b/src/backend/opencl/kernel/matchTemplate.cl
index fae3244beb..2a42a77619 100644
--- a/src/backend/opencl/kernel/matchTemplate.cl
+++ b/src/backend/opencl/kernel/matchTemplate.cl
@@ -7,56 +7,56 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-kernel
-void matchTemplate(global outType * out,
-                    KParam          oInfo,
-                    global const inType * srch,
-                    KParam          sInfo,
-                    global const inType * tmplt,
-                    KParam          tInfo,
-                    int        nBBS0,
-                    int        nBBS1)
-{
+kernel void matchTemplate(global outType* out, KParam oInfo,
+                          global const inType* srch, KParam sInfo,
+                          global const inType* tmplt, KParam tInfo, int nBBS0,
+                          int nBBS1) {
     unsigned b2 = get_group_id(0) / nBBS0;
     unsigned b3 = get_group_id(1) / nBBS1;
 
-    int gx = get_local_id(0) + (get_group_id(0) - b2*nBBS0) * get_local_size(0);
-    int gy = get_local_id(1) + (get_group_id(1) - b3*nBBS1)* get_local_size(1);
+    int gx =
+        get_local_id(0) + (get_group_id(0) - b2 * nBBS0) * get_local_size(0);
+    int gy =
+        get_local_id(1) + (get_group_id(1) - b3 * nBBS1) * get_local_size(1);
 
     if (gx < sInfo.dims[0] && gy < sInfo.dims[1]) {
-
         const int tDim0 = tInfo.dims[0];
         const int tDim1 = tInfo.dims[1];
         const int sDim0 = sInfo.dims[0];
         const int sDim1 = sInfo.dims[1];
-        int winNumElems = tDim0*tDim1;
+        int winNumElems = tDim0 * tDim1;
 
-        global const inType* tptr = tmplt;
+        global const inType* tptr = tmplt + tInfo.offset;
 
         outType tImgMean = (outType)0;
         if (NEEDMEAN) {
-            for(int tj=0; tj<tDim1; tj++) {
-                int tjStride = tj*tInfo.strides[1];
+            for (int tj = 0; tj < tDim1; tj++) {
+                int tjStride = tj * tInfo.strides[1];
 
-                for(int ti=0; ti<tDim0; ti++) {
-                    tImgMean += (outType)tptr[ tjStride + ti*tInfo.strides[0] ];
+                for (int ti = 0; ti < tDim0; ti++) {
+                    tImgMean += (outType)tptr[tjStride + ti * tInfo.strides[0]];
                 }
             }
             tImgMean /= winNumElems;
         }
 
-        global const inType* sptr  = srch + (b2 * sInfo.strides[2] + b3 * sInfo.strides[3] + sInfo.offset);
-        global outType* optr       = out  + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
+        global const inType* sptr =
+            srch +
+            (b2 * sInfo.strides[2] + b3 * sInfo.strides[3] + sInfo.offset);
+        global outType* optr =
+            out + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
 
         // mean for window
         // this variable will be used based on MATCH_T value
         outType wImgMean = (outType)0;
         if (NEEDMEAN) {
-            for(int tj=0,j=gy; tj<tDim1; tj++, j++) {
-                int jStride = j*sInfo.strides[1];
+            for (int tj = 0, j = gy; tj < tDim1; tj++, j++) {
+                int jStride = j * sInfo.strides[1];
 
-                for(int ti=0, i=gx; ti<tDim0; ti++, i++) {
-                    inType sVal = ((j<sDim1 && i<sDim0) ? sptr[jStride + i*sInfo.strides[0]] : (inType)0);
+                for (int ti = 0, i = gx; ti < tDim0; ti++, i++) {
+                    inType sVal = ((j < sDim1 && i < sDim0)
+                                       ? sptr[jStride + i * sInfo.strides[0]]
+                                       : (inType)0);
                     wImgMean += (outType)sVal;
                 }
             }
@@ -66,52 +66,55 @@ void matchTemplate(global outType * out,
         // run the window match metric
         outType disparity = (outType)0;
 
-        for(int tj=0,j=gy; tj<tDim1; tj++, j++) {
-
-            int jStride  = j*sInfo.strides[1];
-            int tjStride = tj*tInfo.strides[1];
-
-            for(int ti=0, i=gx; ti<tDim0; ti++, i++) {
+        for (int tj = 0, j = gy; tj < tDim1; tj++, j++) {
+            int jStride  = j * sInfo.strides[1];
+            int tjStride = tj * tInfo.strides[1];
 
-                inType sVal = ((j<sDim1 && i<sDim0) ? sptr[jStride + i*sInfo.strides[0]] : (inType)0);
-                inType tVal = tptr[ tjStride + ti*tInfo.strides[0] ];
+            for (int ti = 0, i = gx; ti < tDim0; ti++, i++) {
+                inType sVal = ((j < sDim1 && i < sDim0)
+                                   ? sptr[jStride + i * sInfo.strides[0]]
+                                   : (inType)0);
+                inType tVal = tptr[tjStride + ti * tInfo.strides[0]];
 
                 outType temp;
-                switch(MATCH_T) {
+                switch (MATCH_T) {
                     case AF_SAD:
-                        disparity += fabs((outType)sVal-(outType)tVal);
+                        disparity += fabs((outType)sVal - (outType)tVal);
                         break;
                     case AF_ZSAD:
                         disparity += fabs((outType)sVal - wImgMean -
-                                (outType)tVal + tImgMean);
+                                          (outType)tVal + tImgMean);
                         break;
                     case AF_LSAD:
-                        disparity += fabs((outType)sVal-(wImgMean/tImgMean)*tVal);
+                        disparity +=
+                            fabs((outType)sVal - (wImgMean / tImgMean) * tVal);
                         break;
                     case AF_SSD:
-                        disparity += ((outType)sVal-(outType)tVal)*((outType)sVal-(outType)tVal);
+                        disparity += ((outType)sVal - (outType)tVal) *
+                                     ((outType)sVal - (outType)tVal);
                         break;
                     case AF_ZSSD:
-                        temp = ((outType)sVal - wImgMean - (outType)tVal + tImgMean);
-                        disparity += temp*temp;
+                        temp = ((outType)sVal - wImgMean - (outType)tVal +
+                                tImgMean);
+                        disparity += temp * temp;
                         break;
                     case AF_LSSD:
-                        temp = ((outType)sVal-(wImgMean/tImgMean)*tVal);
-                        disparity += temp*temp;
+                        temp = ((outType)sVal - (wImgMean / tImgMean) * tVal);
+                        disparity += temp * temp;
                         break;
                     case AF_NCC:
-                        //TODO: furture implementation
+                        // TODO: furture implementation
                         break;
                     case AF_ZNCC:
-                        //TODO: furture implementation
+                        // TODO: furture implementation
                         break;
                     case AF_SHD:
-                        //TODO: furture implementation
+                        // TODO: furture implementation
                         break;
                 }
             }
         }
 
-        optr[gy*oInfo.strides[1]+gx] = disparity;
+        optr[gy * oInfo.strides[1] + gx] = disparity;
     }
 }
diff --git a/src/backend/opencl/kernel/match_template.hpp b/src/backend/opencl/kernel/match_template.hpp
index bfbe4b0178..8f43c99174 100644
--- a/src/backend/opencl/kernel/match_template.hpp
+++ b/src/backend/opencl/kernel/match_template.hpp
@@ -8,90 +8,64 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/matchTemplate.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/matchTemplate.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-template<typename inType, typename outType, af_match_type mType, bool needMean>
-void matchTemplate(Param out, const Param srch, const Param tmplt)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  mtProgs;
-        static std::map<int, Kernel*> mtKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D inType="  << dtype_traits<inType>::getName()
-                        << " -D outType=" << dtype_traits<outType>::getName()
-                        << " -D MATCH_T=" << mType
-                        << " -D NEEDMEAN="<< needMean
-                        << " -D AF_SAD="  << AF_SAD
-                        << " -D AF_ZSAD=" << AF_ZSAD
-                        << " -D AF_LSAD=" << AF_LSAD
-                        << " -D AF_SSD="  << AF_SSD
-                        << " -D AF_ZSSD=" << AF_ZSSD
-                        << " -D AF_LSSD=" << AF_LSSD
-                        << " -D AF_NCC="  << AF_NCC
-                        << " -D AF_ZNCC=" << AF_ZNCC
-                        << " -D AF_SHD="  << AF_SHD;
-                if (std::is_same<outType, double>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-                Program prog;
-                buildProgram(prog, matchTemplate_cl, matchTemplate_cl_len, options.str());
-                mtProgs[device]   = new Program(prog);
-                mtKernels[device] = new Kernel(*mtProgs[device], "matchTemplate");
-            });
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(srch.info.dims[0], THREADS_X);
-        int blk_y = divup(srch.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x * srch.info.dims[2] * THREADS_X, blk_y * srch.info.dims[3] * THREADS_Y);
-
-        auto matchImgOp = make_kernel<Buffer, KParam,
-                                       Buffer, KParam,
-                                       Buffer, KParam,
-                                       int, int> (*mtKernels[device]);
-
-        matchImgOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *srch.data, srch.info, *tmplt.data, tmplt.info, blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-}
-
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename inType, typename outType>
+void matchTemplate(Param out, const Param srch, const Param tmplt,
+                   const af_match_type mType, const bool needMean) {
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
+
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<inType>(),
+        TemplateTypename<outType>(),
+        TemplateArg(mType),
+        TemplateArg(needMean),
+    };
+    std::array<std::string, 14> options = {
+        DefineKeyValue(inType, dtype_traits<inType>::getName()),
+        DefineKeyValue(outType, dtype_traits<outType>::getName()),
+        DefineKeyValue(MATCH_T, static_cast<int>(mType)),
+        DefineKeyValue(NEEDMEAN, static_cast<int>(needMean)),
+        DefineKeyValue(AF_SAD, static_cast<int>(AF_SAD)),
+        DefineKeyValue(AF_ZSAD, static_cast<int>(AF_ZSAD)),
+        DefineKeyValue(AF_LSAD, static_cast<int>(AF_LSAD)),
+        DefineKeyValue(AF_SSD, static_cast<int>(AF_SSD)),
+        DefineKeyValue(AF_ZSSD, static_cast<int>(AF_ZSSD)),
+        DefineKeyValue(AF_LSSD, static_cast<int>(AF_LSSD)),
+        DefineKeyValue(AF_NCC, static_cast<int>(AF_NCC)),
+        DefineKeyValue(AF_ZNCC, static_cast<int>(AF_ZNCC)),
+        DefineKeyValue(AF_SHD, static_cast<int>(AF_SHD)),
+        getTypeBuildDefinition<outType>()};
+
+    auto matchImgOp = common::getKernel(
+        "matchTemplate", {{matchTemplate_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(srch.info.dims[0], THREADS_X);
+    int blk_y = divup(srch.info.dims[1], THREADS_Y);
+
+    cl::NDRange global(blk_x * srch.info.dims[2] * THREADS_X,
+                       blk_y * srch.info.dims[3] * THREADS_Y);
+
+    matchImgOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *srch.data, srch.info, *tmplt.data, tmplt.info, blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/mean.hpp b/src/backend/opencl/kernel/mean.hpp
new file mode 100644
index 0000000000..bc80a23be9
--- /dev/null
+++ b/src/backend/opencl/kernel/mean.hpp
@@ -0,0 +1,469 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/mean_dim.hpp>
+#include <kernel_headers/mean_first.hpp>
+#include <kernel_headers/mean_ops.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, typename Tw>
+struct MeanOp {
+    T runningMean;
+    Tw runningCount;
+    MeanOp(T mean, Tw count) : runningMean(mean), runningCount(count) {}
+
+    void operator()(T newMean, Tw newCount) {
+        if ((newCount != 0) || (runningCount != 0)) {
+            Tw runningScale = runningCount;
+            Tw newScale     = newCount;
+            runningCount += newCount;
+            runningScale = runningScale / runningCount;
+            newScale     = newScale / (Tw)runningCount;
+            runningMean  = (runningScale * runningMean) + (newScale * newMean);
+        }
+    }
+};
+
+template<>
+struct MeanOp<cfloat, float> {
+    cfloat runningMean;
+    float runningCount;
+    MeanOp(cfloat mean, float count) : runningMean(mean), runningCount(count) {}
+
+    void operator()(cfloat newMean, float newCount) {
+        if ((newCount != 0) || (runningCount != 0)) {
+            float runningScale = runningCount;
+            float newScale     = newCount;
+            runningCount += newCount;
+            runningScale = runningScale / runningCount;
+            newScale     = newScale / (float)runningCount;
+            runningMean.s[0] =
+                (runningScale * runningMean.s[0]) + (newScale * newMean.s[0]);
+            runningMean.s[1] =
+                (runningScale * runningMean.s[1]) + (newScale * newMean.s[1]);
+        }
+    }
+};
+
+template<>
+struct MeanOp<cdouble, double> {
+    cdouble runningMean;
+    double runningCount;
+    MeanOp(cdouble mean, double count)
+        : runningMean(mean), runningCount(count) {}
+
+    void operator()(cdouble newMean, double newCount) {
+        if ((newCount != 0) || (runningCount != 0)) {
+            double runningScale = runningCount;
+            double newScale     = newCount;
+            runningCount += newCount;
+            runningScale = runningScale / runningCount;
+            newScale     = newScale / (double)runningCount;
+            runningMean.s[0] =
+                (runningScale * runningMean.s[0]) + (newScale * newMean.s[0]);
+            runningMean.s[1] =
+                (runningScale * runningMean.s[1]) + (newScale * newMean.s[1]);
+        }
+    }
+};
+
+template<typename Ti, typename Tw, typename To>
+void meanDimLauncher(Param out, Param owt, Param in, Param inWeight,
+                     const int dim, const int threads_y,
+                     const uint groups_all[4]) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    bool input_weight = ((inWeight.info.dims[0] * inWeight.info.dims[1] *
+                          inWeight.info.dims[2] * inWeight.info.dims[3]) != 0);
+
+    bool output_weight = ((owt.info.dims[0] * owt.info.dims[1] *
+                           owt.info.dims[2] * owt.info.dims[3]) != 0);
+
+    ToNumStr<To> toNumStr;
+    ToNumStr<Tw> twNumStr;
+    common::Transform<uint, Tw, af_add_t> transform_weight;
+
+    std::array<TemplateArg, 7> targs = {
+        TemplateTypename<Ti>(),     TemplateTypename<To>(),
+        TemplateTypename<Tw>(),     TemplateArg(dim),
+        TemplateArg(threads_y),     TemplateArg(input_weight),
+        TemplateArg(output_weight),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(Tw, dtype_traits<Tw>::getName()),
+        DefineKeyValue(kDim, dim),
+        DefineKeyValue(DIMY, threads_y),
+        DefineValue(THREADS_X),
+        DefineKeyValue(init_To, toNumStr(common::Binary<To, af_add_t>::init())),
+        DefineKeyValue(init_Tw, twNumStr(transform_weight(0))),
+        DefineKeyValue(one_Tw, twNumStr(transform_weight(1))),
+        getTypeBuildDefinition<Ti, To>()};
+    if (input_weight) { options.emplace_back(DefineKey(INPUT_WEIGHT)); }
+    if (output_weight) { options.emplace_back(DefineKey(OUTPUT_WEIGHT)); }
+
+    auto meanOp = common::getKernel(
+        "meanDim", {{mean_ops_cl_src, mean_dim_cl_src}}, targs, options);
+
+    NDRange local(THREADS_X, threads_y);
+    NDRange global(groups_all[0] * groups_all[2] * local[0],
+                   groups_all[1] * groups_all[3] * local[1]);
+
+    if (input_weight && output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *owt.data, owt.info, *in.data, in.info, *inWeight.data,
+               inWeight.info, groups_all[0], groups_all[1], groups_all[dim]);
+    } else if (!input_weight && !output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *in.data, in.info, groups_all[0], groups_all[1],
+               groups_all[dim]);
+    } else if (input_weight && !output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *in.data, in.info, *inWeight.data, inWeight.info, groups_all[0],
+               groups_all[1], groups_all[dim]);
+    } else if (!input_weight && output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *owt.data, owt.info, *in.data, in.info, groups_all[0],
+               groups_all[1], groups_all[dim]);
+    }
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tw, typename To>
+void meanDim(Param out, Param in, Param inWeight, int dim) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.info.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    uint groups_all[] = {(uint)divup(in.info.dims[0], threads_x),
+                         (uint)in.info.dims[1], (uint)in.info.dims[2],
+                         (uint)in.info.dims[3]};
+
+    groups_all[dim] = divup(in.info.dims[dim], threads_y * REPEAT);
+
+    if (groups_all[dim] > 1) {
+        dim4 d(4, out.info.dims);
+        d[dim]              = groups_all[dim];
+        Array<To> tmpOut    = createEmptyArray<To>(d);
+        Array<Tw> tmpWeight = createEmptyArray<Tw>(d);
+        meanDimLauncher<Ti, Tw, To>(tmpOut, tmpWeight, in, inWeight, dim,
+                                    threads_y, groups_all);
+
+        Param owt;
+        groups_all[dim] = 1;
+        meanDimLauncher<Ti, Tw, To>(out, owt, tmpOut, tmpWeight, dim, threads_y,
+                                    groups_all);
+    } else {
+        Param tmpWeight;
+        meanDimLauncher<Ti, Tw, To>(out, tmpWeight, in, inWeight, dim,
+                                    threads_y, groups_all);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void meanFirstLauncher(Param out, Param owt, Param in, Param inWeight,
+                       const int threads_x, const uint groups_x,
+                       const uint groups_y) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    bool input_weight = ((inWeight.info.dims[0] * inWeight.info.dims[1] *
+                          inWeight.info.dims[2] * inWeight.info.dims[3]) != 0);
+
+    bool output_weight = ((owt.info.dims[0] * owt.info.dims[1] *
+                           owt.info.dims[2] * owt.info.dims[3]) != 0);
+    ToNumStr<To> toNumStr;
+    ToNumStr<Tw> twNumStr;
+    common::Transform<uint, Tw, af_add_t> transform_weight;
+
+    std::array<TemplateArg, 6> targs = {
+        TemplateTypename<Ti>(),    TemplateTypename<To>(),
+        TemplateTypename<Tw>(),    TemplateArg(threads_x),
+        TemplateArg(input_weight), TemplateArg(output_weight),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(Tw, dtype_traits<Tw>::getName()),
+        DefineKeyValue(DIMX, threads_x),
+        DefineValue(THREADS_PER_GROUP),
+        DefineKeyValue(init_To, toNumStr(common::Binary<To, af_add_t>::init())),
+        DefineKeyValue(init_Tw, twNumStr(transform_weight(0))),
+        DefineKeyValue(one_Tw, twNumStr(transform_weight(1))),
+    };
+    options.emplace_back(getTypeBuildDefinition<Ti, To>());
+    if (input_weight) { options.emplace_back(DefineKey(INPUT_WEIGHT)); }
+    if (output_weight) { options.emplace_back(DefineKey(OUTPUT_WEIGHT)); }
+
+    auto meanOp = common::getKernel(
+        "meanFirst", {{mean_ops_cl_src, mean_first_cl_src}}, targs, options);
+
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(groups_x * in.info.dims[2] * local[0],
+                   groups_y * in.info.dims[3] * local[1]);
+
+    uint repeat = divup(in.info.dims[0], (local[0] * groups_x));
+
+    if (input_weight && output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *owt.data, owt.info, *in.data, in.info, *inWeight.data,
+               inWeight.info, groups_x, groups_y, repeat);
+    } else if (!input_weight && !output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *in.data, in.info, groups_x, groups_y, repeat);
+    } else if (input_weight && !output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *in.data, in.info, *inWeight.data, inWeight.info, groups_x,
+               groups_y, repeat);
+    } else if (!input_weight && output_weight) {
+        meanOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+               *owt.data, owt.info, *in.data, in.info, groups_x, groups_y,
+               repeat);
+    }
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tw, typename To>
+void meanFirst(Param out, Param in, Param inWeight) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
+
+    uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(in.info.dims[1], threads_y);
+
+    Param tmpOut = out;
+    Param noWeight;
+    noWeight.info.offset = 0;
+    for (int k = 0; k < 4; ++k) {
+        noWeight.info.dims[k]    = 0;
+        noWeight.info.strides[k] = 0;
+    }
+    // Does not matter what the value is it will not be used. Just needs to be
+    // valid.
+    noWeight.data = inWeight.data;
+
+    Param tmpWeight = noWeight;
+
+    if (groups_x > 1) {
+        tmpOut.data = bufferAlloc(groups_x * in.info.dims[1] * in.info.dims[2] *
+                                  in.info.dims[3] * sizeof(To));
+
+        tmpWeight.data =
+            bufferAlloc(groups_x * in.info.dims[1] * in.info.dims[2] *
+                        in.info.dims[3] * sizeof(Tw));
+
+        tmpOut.info.dims[0] = groups_x;
+        for (int k = 1; k < 4; k++) tmpOut.info.strides[k] *= groups_x;
+        tmpWeight.info = tmpOut.info;
+    }
+
+    meanFirstLauncher<Ti, Tw, To>(tmpOut, tmpWeight, in, inWeight, threads_x,
+                                  groups_x, groups_y);
+
+    if (groups_x > 1) {
+        // No Weight is needed when writing out the output.
+        meanFirstLauncher<Ti, Tw, To>(out, noWeight, tmpOut, tmpWeight,
+                                      threads_x, 1, groups_y);
+
+        bufferFree(tmpOut.data);
+        bufferFree(tmpWeight.data);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+void meanWeighted(Param out, Param in, Param inWeight, int dim) {
+    if (dim == 0)
+        return meanFirst<Ti, Tw, To>(out, in, inWeight);
+    else
+        return meanDim<Ti, Tw, To>(out, in, inWeight, dim);
+}
+
+template<typename Ti, typename Tw, typename To>
+void mean(Param out, Param in, int dim) {
+    Param noWeight;
+    meanWeighted<Ti, Tw, To>(out, in, noWeight, dim);
+}
+
+template<typename T, typename Tw>
+T meanAllWeighted(Param in, Param inWeight) {
+    int in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+
+    // FIXME: Use better heuristics to get to the optimum number
+    if (in_elements > 4096) {
+        bool in_is_linear = (in.info.strides[0] == 1);
+        bool wt_is_linear = (in.info.strides[0] == 1);
+        for (int k = 1; k < 4; k++) {
+            in_is_linear &= (in.info.strides[k] ==
+                             (in.info.strides[k - 1] * in.info.dims[k - 1]));
+            wt_is_linear &=
+                (inWeight.info.strides[k] ==
+                 (inWeight.info.strides[k - 1] * inWeight.info.dims[k - 1]));
+        }
+
+        if (in_is_linear && wt_is_linear) {
+            in.info.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.info.dims[k]    = 1;
+                in.info.strides[k] = in_elements;
+            }
+            inWeight.info = in.info;
+        }
+
+        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+        uint threads_y = THREADS_PER_GROUP / threads_x;
+
+        uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+        uint groups_y = divup(in.info.dims[1], threads_y);
+
+        Array<T> tmpOut     = createEmptyArray<T>(groups_x);
+        Array<Tw> tmpWeight = createEmptyArray<Tw>(groups_x);
+
+        meanFirstLauncher<T, Tw, T>(tmpOut, tmpWeight, in, inWeight, threads_x,
+                                    groups_x, groups_y);
+
+        std::vector<T> h_ptr(tmpOut.elements());
+        std::vector<Tw> h_wptr(tmpWeight.elements());
+
+        getQueue().enqueueReadBuffer(*tmpOut.get(), CL_TRUE, 0,
+                                     sizeof(T) * tmpOut.elements(),
+                                     h_ptr.data());
+        getQueue().enqueueReadBuffer(*tmpWeight.get(), CL_TRUE, 0,
+                                     sizeof(Tw) * tmpWeight.elements(),
+                                     h_wptr.data());
+
+        compute_t<T> initial = static_cast<compute_t<T>>(h_ptr[0]);
+        compute_t<Tw> w      = static_cast<compute_t<Tw>>(h_wptr[0]);
+        MeanOp<compute_t<T>, compute_t<Tw>> Op(initial, w);
+        for (int i = 1; i < (int)tmpOut.elements(); i++) {
+            Op(compute_t<T>(h_ptr[i]), compute_t<Tw>(h_wptr[i]));
+        }
+
+        return static_cast<T>(Op.runningMean);
+    } else {
+        std::vector<T> h_ptr(in_elements);
+        std::vector<Tw> h_wptr(in_elements);
+
+        getQueue().enqueueReadBuffer(*in.data, CL_TRUE,
+                                     sizeof(T) * in.info.offset,
+                                     sizeof(T) * in_elements, h_ptr.data());
+        getQueue().enqueueReadBuffer(*inWeight.data, CL_TRUE,
+                                     sizeof(Tw) * inWeight.info.offset,
+                                     sizeof(Tw) * in_elements, h_wptr.data());
+
+        compute_t<T> initial = static_cast<compute_t<T>>(h_ptr[0]);
+        compute_t<Tw> w      = static_cast<compute_t<Tw>>(h_wptr[0]);
+        MeanOp<compute_t<T>, compute_t<Tw>> Op(initial, w);
+        for (int i = 1; i < (int)in_elements; i++) {
+            Op(compute_t<T>(h_ptr[i]), compute_t<Tw>(h_wptr[i]));
+        }
+
+        return static_cast<T>(Op.runningMean);
+    }
+}
+
+template<typename Ti, typename Tw, typename To>
+To meanAll(Param in) {
+    int in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
+    bool is_linear = (in.info.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.info.strides[k] ==
+                      (in.info.strides[k - 1] * in.info.dims[k - 1]));
+    }
+
+    // FIXME: Use better heuristics to get to the optimum number
+    if (in_elements > 4096 || !is_linear) {
+        if (is_linear) {
+            in.info.dims[0] = in_elements;
+            for (int k = 1; k < 4; k++) {
+                in.info.dims[k]    = 1;
+                in.info.strides[k] = in_elements;
+            }
+        }
+
+        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+        threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+        uint threads_y = THREADS_PER_GROUP / threads_x;
+
+        uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+        uint groups_y = divup(in.info.dims[1], threads_y);
+
+        dim4 outDims(groups_x, in.info.dims[1], in.info.dims[2],
+                     in.info.dims[3]);
+        Array<To> tmpOut = createEmptyArray<To>(outDims);
+        Array<Tw> tmpCt  = createEmptyArray<Tw>(outDims);
+
+        Param iWt;
+        meanFirstLauncher<Ti, Tw, To>(tmpOut, tmpCt, in, iWt, threads_x,
+                                      groups_x, groups_y);
+
+        std::vector<To> h_ptr(tmpOut.elements());
+        std::vector<Tw> h_cptr(tmpOut.elements());
+
+        getQueue().enqueueReadBuffer(*tmpOut.get(), CL_TRUE, 0,
+                                     sizeof(To) * tmpOut.elements(),
+                                     h_ptr.data());
+        getQueue().enqueueReadBuffer(*tmpCt.get(), CL_TRUE, 0,
+                                     sizeof(Tw) * tmpCt.elements(),
+                                     h_cptr.data());
+
+        compute_t<To> initial = static_cast<compute_t<To>>(h_ptr[0]);
+        compute_t<Tw> w       = static_cast<compute_t<Tw>>(h_cptr[0]);
+        MeanOp<compute_t<To>, compute_t<Tw>> Op(initial, w);
+        for (int i = 1; i < (int)h_ptr.size(); i++) {
+            Op(compute_t<To>(h_ptr[i]), compute_t<Tw>(h_cptr[i]));
+        }
+
+        return static_cast<To>(Op.runningMean);
+    } else {
+        std::vector<Ti> h_ptr(in_elements);
+
+        getQueue().enqueueReadBuffer(*in.data, CL_TRUE,
+                                     sizeof(Ti) * in.info.offset,
+                                     sizeof(Ti) * in_elements, h_ptr.data());
+
+        // TODO : MeanOp with (Tw)1
+        common::Transform<Ti, compute_t<To>, af_add_t> transform;
+        common::Transform<uint, compute_t<Tw>, af_add_t> transform_weight;
+        MeanOp<compute_t<To>, compute_t<Tw>> Op(transform(h_ptr[0]),
+                                                transform_weight(1));
+        for (int i = 1; i < (int)in_elements; i++) {
+            Op(transform(h_ptr[i]), transform_weight(1));
+        }
+
+        return static_cast<To>(Op.runningMean);
+    }
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/mean_dim.cl b/src/backend/opencl/kernel/mean_dim.cl
new file mode 100644
index 0000000000..9448486391
--- /dev/null
+++ b/src/backend/opencl/kernel/mean_dim.cl
@@ -0,0 +1,136 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void meanDim(global To *oData, KParam oInfo,
+#ifdef OUTPUT_WEIGHT
+                    global Tw *owData, KParam owInfo,
+#endif
+                    const global Ti *iData, KParam iInfo,
+#ifdef INPUT_WEIGHT
+                    const global Tw *iwData, KParam iwInfo,
+#endif
+                    uint groups_x, uint groups_y, uint group_dim) {
+    const uint lidx = get_local_id(0);
+    const uint lidy = get_local_id(1);
+    const uint lid  = lidy * THREADS_X + lidx;
+
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) + lidx;
+    const uint yid       = groupId_y;
+
+    uint ids[4] = {xid, yid, zid, wid};
+
+    // There is only one element per group for out
+    // There are get_local_size(1) elements per group for in
+    // Hence increment ids[kDim] just after offseting out and before offsetting
+    // in
+    oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+             ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
+
+#ifdef OUTPUT_WEIGHT
+    owData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+              ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
+#endif
+    const uint id_dim_out = ids[kDim];
+
+    ids[kDim] = ids[kDim] * get_local_size(1) + lidy;
+
+    iData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+             ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+
+#ifdef INPUT_WEIGHT
+    iwData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+              ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+#endif
+
+    const uint id_dim_in   = ids[kDim];
+    const uint istride_dim = iInfo.strides[kDim];
+
+    bool is_valid = (ids[0] < iInfo.dims[0]) && (ids[1] < iInfo.dims[1]) &&
+                    (ids[2] < iInfo.dims[2]) && (ids[3] < iInfo.dims[3]);
+
+    local To s_val[THREADS_X * DIMY];
+    local Tw s_wt[THREADS_X * DIMY];
+
+    To out_val = init_To;
+    Tw out_wt  = init_Tw;
+
+    if (is_valid && id_dim_in < iInfo.dims[kDim]) {
+        out_val = transform(*iData);
+#ifdef INPUT_WEIGHT
+        out_wt = *iwData;
+#else
+        out_wt = one_Tw;
+#endif
+    }
+
+    const uint id_dim_in_start = id_dim_in + group_dim * get_local_size(1);
+
+#ifdef INPUT_WEIGHT
+    for (int id = id_dim_in_start; is_valid && (id < iInfo.dims[kDim]);
+         id += group_dim * get_local_size(1)) {
+        iData  = iData + group_dim * get_local_size(1) * istride_dim;
+        iwData = iwData + group_dim * get_local_size(1) * istride_dim;
+        binOp(&out_val, &out_wt, transform(*iData), *iwData);
+    }
+#else
+    for (int id = id_dim_in_start; is_valid && (id < iInfo.dims[kDim]);
+         id += group_dim * get_local_size(1)) {
+        iData = iData + group_dim * get_local_size(1) * istride_dim;
+        binOp(&out_val, &out_wt, transform(*iData), one_Tw);
+    }
+#endif
+
+    s_val[lid] = out_val;
+    s_wt[lid]  = out_wt;
+
+    local To *s_vptr = s_val + lid;
+    local Tw *s_wptr = s_wt + lid;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (DIMY == 8) {
+        if (lidy < 4) {
+            binOp(&out_val, &out_wt, s_vptr[THREADS_X * 4],
+                  s_wptr[THREADS_X * 4]);
+            *s_vptr = out_val;
+            *s_wptr = out_wt;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (DIMY >= 4) {
+        if (lidy < 2) {
+            binOp(&out_val, &out_wt, s_vptr[THREADS_X * 2],
+                  s_wptr[THREADS_X * 2]);
+            *s_vptr = out_val;
+            *s_wptr = out_wt;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (DIMY >= 2) {
+        if (lidy < 1) {
+            binOp(&out_val, &out_wt, s_vptr[THREADS_X * 1],
+                  s_wptr[THREADS_X * 1]);
+            *s_vptr = out_val;
+            *s_wptr = out_wt;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (lidy == 0 && is_valid && (id_dim_out < oInfo.dims[kDim])) {
+        *oData = *s_vptr;
+#ifdef OUTPUT_WEIGHT
+        *owData = *s_wptr;
+#endif
+    }
+}
diff --git a/src/backend/opencl/kernel/mean_first.cl b/src/backend/opencl/kernel/mean_first.cl
new file mode 100644
index 0000000000..14b19827c9
--- /dev/null
+++ b/src/backend/opencl/kernel/mean_first.cl
@@ -0,0 +1,156 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void meanFirst(global To *oData, KParam oInfo,
+#ifdef OUTPUT_WEIGHT
+                      global Tw *owData, KParam owInfo,
+#endif
+                      const global Ti *iData, KParam iInfo,
+#ifdef INPUT_WEIGHT
+                      const global Tw *iwData, KParam iwInfo,
+#endif
+                      uint groups_x, uint groups_y, uint repeat) {
+    const uint lidx = get_local_id(0);
+    const uint lidy = get_local_id(1);
+    const uint lid  = lidy * get_local_size(0) + lidx;
+
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) * repeat + lidx;
+    const uint yid       = groupId_y * get_local_size(1) + lidy;
+
+    iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
+             yid * iInfo.strides[1] + iInfo.offset;
+
+#ifdef INPUT_WEIGHT
+    iwData += wid * iwInfo.strides[3] + zid * iwInfo.strides[2] +
+              yid * iwInfo.strides[1] + iwInfo.offset;
+#endif
+
+    oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
+             yid * oInfo.strides[1] + oInfo.offset;
+
+#ifdef OUTPUT_WEIGHT
+    owData += wid * owInfo.strides[3] + zid * owInfo.strides[2] +
+              yid * owInfo.strides[1] + owInfo.offset;
+#endif
+
+    bool cond =
+        (yid < iInfo.dims[1]) && (zid < iInfo.dims[2]) && (wid < iInfo.dims[3]);
+
+    local To s_val[THREADS_PER_GROUP];
+    local Tw s_wt[THREADS_PER_GROUP];
+
+    int last   = (xid + repeat * DIMX);
+    int lim    = last > iInfo.dims[0] ? iInfo.dims[0] : last;
+    To out_val = init_To;
+    Tw out_wt  = init_Tw;
+
+    if (cond && xid < lim) {
+        out_val = transform(iData[xid]);
+#ifdef INPUT_WEIGHT
+        out_wt = iwData[xid];
+#else
+        out_wt = one_Tw;
+#endif
+    }
+
+#ifdef INPUT_WEIGHT
+    for (int id = xid + DIMX; cond && id < lim; id += DIMX) {
+        binOp(&out_val, &out_wt, transform(iData[id]), iwData[id]);
+    }
+#else
+    for (int id = xid + DIMX; cond && id < lim; id += DIMX) {
+        binOp(&out_val, &out_wt, transform(iData[id]), one_Tw);
+    }
+#endif
+
+    s_val[lid] = out_val;
+    s_wt[lid]  = out_wt;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    local To *s_vptr = s_val + lidy * DIMX;
+    local Tw *s_wptr = s_wt + lidy * DIMX;
+
+    if (DIMX == 256) {
+        if (lidx < 128) {
+            binOp(&out_val, &out_wt, s_vptr[lidx + 128], s_wptr[lidx + 128]);
+            s_vptr[lidx] = out_val;
+            s_wptr[lidx] = out_wt;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (DIMX >= 128) {
+        if (lidx < 64) {
+            binOp(&out_val, &out_wt, s_vptr[lidx + 64], s_wptr[lidx + 64]);
+            s_vptr[lidx] = out_val;
+            s_wptr[lidx] = out_wt;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (DIMX >= 64) {
+        if (lidx < 32) {
+            binOp(&out_val, &out_wt, s_vptr[lidx + 32], s_wptr[lidx + 32]);
+            s_vptr[lidx] = out_val;
+            s_wptr[lidx] = out_wt;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (lidx < 16) {
+        binOp(&out_val, &out_wt, s_vptr[lidx + 16], s_wptr[lidx + 16]);
+        s_vptr[lidx] = out_val;
+        s_wptr[lidx] = out_wt;
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lidx < 8) {
+        binOp(&out_val, &out_wt, s_vptr[lidx + 8], s_wptr[lidx + 8]);
+        s_vptr[lidx] = out_val;
+        s_wptr[lidx] = out_wt;
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lidx < 4) {
+        binOp(&out_val, &out_wt, s_vptr[lidx + 4], s_wptr[lidx + 4]);
+        s_vptr[lidx] = out_val;
+        s_wptr[lidx] = out_wt;
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lidx < 2) {
+        binOp(&out_val, &out_wt, s_vptr[lidx + 2], s_wptr[lidx + 2]);
+        s_vptr[lidx] = out_val;
+        s_wptr[lidx] = out_wt;
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lidx < 1) {
+        binOp(&out_val, &out_wt, s_vptr[lidx + 1], s_wptr[lidx + 1]);
+        s_vptr[lidx] = out_val;
+        s_wptr[lidx] = out_wt;
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (cond && lidx == 0) {
+        oData[groupId_x] = s_vptr[0];
+#ifdef OUTPUT_WEIGHT
+        owData[groupId_x] = s_wptr[0];
+#endif
+    }
+}
diff --git a/src/backend/opencl/kernel/mean_ops.cl b/src/backend/opencl/kernel/mean_ops.cl
new file mode 100644
index 0000000000..b4f104f6ba
--- /dev/null
+++ b/src/backend/opencl/kernel/mean_ops.cl
@@ -0,0 +1,21 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+To transform(Ti in) { return (To)(in); }
+
+void binOp(To *lhs, Tw *l_wt, To rhs, Tw r_wt) {
+    if (((*l_wt) != 0) || (r_wt != 0)) {
+        Tw l_scale = (*l_wt);
+        (*l_wt) += r_wt;
+        l_scale = l_scale / (*l_wt);
+
+        Tw r_scale = r_wt / (*l_wt);
+        (*lhs)     = (l_scale * (*lhs)) + (r_scale * rhs);
+    }
+}
diff --git a/src/backend/opencl/kernel/meanshift.cl b/src/backend/opencl/kernel/meanshift.cl
index ada45708c7..e80da6985a 100644
--- a/src/backend/opencl/kernel/meanshift.cl
+++ b/src/backend/opencl/kernel/meanshift.cl
@@ -7,159 +7,121 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-int lIdx(int x, int y,
-        int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
-
-void load2LocalMem(__local T *  shrd,
-        __global const T *      in, int lx, int ly,
-        int shrdStride, int schStride, int channels,
-        int dim0, int dim1, int gx, int gy,
-        int ichStride, int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx, 0, dim0-1);
-    int gy_  = clamp(gy, 0, dim1-1);
-#pragma unroll
-    for(int ch=0; ch<channels; ++ch)
-        shrd[ lIdx(lx, ly, shrdStride, 1)+ch*schStride] = in[ lIdx(gx_, gy_, inStride1, inStride0)+ch*ichStride];
-}
-
-__kernel
-void meanshift(__global T *       d_dst,
-               KParam             oInfo,
-               __global const T * d_src,
-               KParam             iInfo,
-               __local T *        localMem,
-               int channels, float space_, int radius,
-               float cvar, unsigned iter, int nBBS0, int nBBS1)
-{
-    // calculate necessary offset and window parameters
-    const int padding     = 2*radius + 1;
-    const int wind_len    = padding - 1;
-    const int shrdLen     = get_local_size(0) + padding;
-    const int schStride   = shrdLen*(get_local_size(1) + padding);
-    // the variable ichStride will only effect when we have >1
-    // channels. in the other cases, the expression in question
-    // will not use the variable
-    const int ichStride   = iInfo.strides[2];
-
-    // gfor batch offsets
+kernel void meanshift(global T* d_dst, KParam oInfo,
+                        global const T* d_src, KParam iInfo, int radius,
+                        float cvar, unsigned numIters, int nBBS0, int nBBS1) {
     unsigned b2 = get_group_id(0) / nBBS0;
     unsigned b3 = get_group_id(1) / nBBS1;
-    __global const T* iptr = d_src + (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
-    __global T*       optr = d_dst + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
-
-    const int lx = get_local_id(0);
-    const int ly = get_local_id(1);
-
-    const int gx = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
-    const int gy = get_local_size(1) * (get_group_id(1)-b3*nBBS1) + ly;
-
-    int gx2 = gx + get_local_size(0);
-    int gy2 = gy + get_local_size(1);
-    int lx2 = lx + get_local_size(0);
-    int ly2 = ly + get_local_size(1);
-    int i   = lx + radius;
-    int j   = ly + radius;
-
-    // pull image to local memory
-    load2LocalMem(localMem, iptr, lx, ly, shrdLen, schStride, channels,
-            iInfo.dims[0], iInfo.dims[1], gx-radius,
-            gy-radius, ichStride, iInfo.strides[1], iInfo.strides[0]);
-    if (lx<wind_len) {
-        load2LocalMem(localMem, iptr, lx2, ly, shrdLen, schStride, channels,
-                iInfo.dims[0], iInfo.dims[1], gx2-radius,
-                gy-radius, ichStride, iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<wind_len) {
-        load2LocalMem(localMem, iptr, lx, ly2, shrdLen, schStride, channels,
-                iInfo.dims[0], iInfo.dims[1], gx-radius,
-                gy2-radius, ichStride, iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<wind_len && ly<wind_len) {
-        load2LocalMem(localMem, iptr, lx2, ly2, shrdLen, schStride, channels,
-                iInfo.dims[0], iInfo.dims[1], gx2-radius,
-                gy2-radius, ichStride, iInfo.strides[1], iInfo.strides[0]);
-    }
-    barrier(CLK_LOCAL_MEM_FENCE);
+    const int gx =
+        get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + get_local_id(0);
+    const int gy =
+        get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + get_local_id(1);
+
+    if (gx < iInfo.dims[0] && gy < iInfo.dims[1]) {
+        global const T* iptr = d_src + (b2 * iInfo.strides[2] +
+                                          b3 * iInfo.strides[3] + iInfo.offset);
+        global T* optr =
+            d_dst + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
 
-    if (gx<iInfo.dims[0] && gy<iInfo.dims[1])
-    {
-        float means[MAX_CHANNELS];
-        float centers[MAX_CHANNELS];
-        float tmpclrs[MAX_CHANNELS];
+        int meanPosI = gx;
+        int meanPosJ = gy;
+
+        T currentCenterColors[MAX_CHANNELS];
+        T tempColors[MAX_CHANNELS];
+
+        AccType currentMeanColors[MAX_CHANNELS];
 
-        // clear means and centers for this pixel
 #pragma unroll
-        for(int ch=0; ch<channels; ++ch) {
-            means[ch] = 0.0f;
-            centers[ch] = localMem[lIdx(i, j, shrdLen, 1)+ch*schStride];
-        }
+        for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+            currentCenterColors[ch] =
+                iptr[gx * iInfo.strides[0] + gy * iInfo.strides[1] +
+                     ch * iInfo.strides[2]];
+
+        const int dim0LenLmt = iInfo.dims[0] - 1;
+        const int dim1LenLmt = iInfo.dims[1] - 1;
 
         // scope of meanshift iterationd begin
-        for(uint it=0; it<iter; ++it) {
+        for (uint it = 0; it < numIters; ++it) {
+            int oldMeanPosJ = meanPosJ;
+            int oldMeanPosI = meanPosI;
+            unsigned count  = 0;
 
-            int count   = 0;
             int shift_x = 0;
             int shift_y = 0;
 
-            for(int wj=-radius; wj<=radius; ++wj) {
+            for (int ch = 0; ch < MAX_CHANNELS; ++ch) currentMeanColors[ch] = 0;
+
+            for (int wj = -radius; wj <= radius; ++wj) {
                 int hit_count = 0;
+                int tj        = meanPosJ + wj;
+
+                if (tj < 0 || tj > dim1LenLmt) continue;
 
-                for(int wi=-radius; wi<=radius; ++wi) {
+                for (int wi = -radius; wi <= radius; ++wi) {
+                    int ti = meanPosI + wi;
 
-                    int tj = j + wj;
-                    int ti = i + wi;
+                    if (ti < 0 || ti > dim0LenLmt) continue;
 
-                    // proceed
-                    float norm = 0.0f;
+                    AccType norm = 0;
 #pragma unroll
-                    for(int ch=0; ch<channels; ++ch) {
-                        tmpclrs[ch] = localMem[lIdx(ti, tj, shrdLen, 1)+ch*schStride];
-                        norm += (centers[ch]-tmpclrs[ch]) * (centers[ch]-tmpclrs[ch]);
+                    for (int ch = 0; ch < MAX_CHANNELS; ++ch) {
+                        unsigned idx = ti * iInfo.strides[0] +
+                                       tj * iInfo.strides[1] +
+                                       ch * iInfo.strides[2];
+                        tempColors[ch] = iptr[idx];
+                        AccType diff   = (AccType)currentCenterColors[ch] -
+                                       (AccType)tempColors[ch];
+                        norm += (diff * diff);
                     }
 
-                    if (norm<= cvar) {
+                    if (norm <= cvar) {
 #pragma unroll
-                        for(int ch=0; ch<channels; ++ch)
-                            means[ch] += tmpclrs[ch];
+                        for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                            currentMeanColors[ch] += (AccType)tempColors[ch];
 
-                        shift_x += wi;
+                        shift_x += ti;
                         ++hit_count;
                     }
                 }
-                count+= hit_count;
-                shift_y += wj*hit_count;
+                count += hit_count;
+                shift_y += tj * hit_count;
             }
 
-            if (count==0) { break; }
+            if (count == 0) break;
+
+            const AccType fcount = 1 / (AccType)count;
+
+            meanPosI = convert_int_rtz(shift_x * fcount);
+            meanPosJ = convert_int_rtz(shift_y * fcount);
 
-            const float fcount = 1.f/count;
-            const int mean_x = (int)(shift_x*fcount+0.5f);
-            const int mean_y = (int)(shift_y*fcount+0.5f);
 #pragma unroll
-            for(int ch=0; ch<channels; ++ch)
-                means[ch] *= fcount;
+            for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                currentMeanColors[ch] =
+                    convert_int_rtz(currentMeanColors[ch] * fcount);
 
-            float norm = 0.f;
+            AccType norm = 0;
 #pragma unroll
-            for(int ch=0; ch<channels; ++ch)
-                norm += ((means[ch]-centers[ch])*(means[ch]-centers[ch]));
+            for (int ch = 0; ch < MAX_CHANNELS; ++ch) {
+                AccType diff =
+                    (AccType)currentCenterColors[ch] - currentMeanColors[ch];
+                norm += (diff * diff);
+            }
 
-            bool stop = ((abs(shift_y-mean_y)+abs(shift_x-mean_x)) + norm) <= 1;
-            shift_x = mean_x;
-            shift_y = mean_y;
+            bool stop =
+                (meanPosJ == oldMeanPosJ && meanPosI == oldMeanPosI) ||
+                ((abs(oldMeanPosJ - meanPosJ) + abs(oldMeanPosI - meanPosI)) +
+                 norm) <= 1;
 
 #pragma unroll
-            for(int ch=0; ch<channels; ++ch)
-                centers[ch] = means[ch];
-            if (stop) { break; }
-        } // scope of meanshift iterations end
+            for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+                currentCenterColors[ch] = (T)(currentMeanColors[ch]);
+
+            if (stop) break;
+        }  // scope of meanshift iterations end
 
 #pragma unroll
-        for(int ch=0; ch<channels; ++ch)
-            optr[lIdx(gx, gy, oInfo.strides[1], oInfo.strides[0])+ch*ichStride] = centers[ch];
+        for (int ch = 0; ch < MAX_CHANNELS; ++ch)
+            optr[gx * oInfo.strides[0] + gy * oInfo.strides[1] +
+                 ch * oInfo.strides[2]] = currentCenterColors[ch];
     }
 }
diff --git a/src/backend/opencl/kernel/meanshift.hpp b/src/backend/opencl/kernel/meanshift.hpp
index c3ea8bc4b4..752e507262 100644
--- a/src/backend/opencl/kernel/meanshift.hpp
+++ b/src/backend/opencl/kernel/meanshift.hpp
@@ -8,96 +8,63 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/meanshift.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <algorithm>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/meanshift.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::LocalSpaceArg;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-template<typename T, bool is_color>
-void meanshift(Param out, const Param in, float s_sigma, float c_sigma, uint iter)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> msProgs;
-        static std::map<int, Kernel*> msKernels;
-
-        int device = getActiveDeviceId();
+#include <algorithm>
+#include <string>
+#include <vector>
 
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D MAX_CHANNELS=" << (is_color ? 3 : 1);
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, meanshift_cl, meanshift_cl_len, options.str());
-                    msProgs[device]   = new Program(prog);
-                    msKernels[device] = new Kernel(*msProgs[device], "meanshift");
-                });
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-        auto meanshiftOp = make_kernel<Buffer, KParam,
-                                       Buffer, KParam,
-                                       LocalSpaceArg,
-                                       int, float,
-                                       int, float,
-                                       unsigned, int, int
-                                      >(*msKernels[device]);
+template<typename T>
+void meanshift(Param out, const Param in, const float spatialSigma,
+               const float chromaticSigma, const uint numIters,
+               const bool is_color) {
+    using AccType = typename std::conditional<std::is_same<T, double>::value,
+                                              double, float>::type;
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
 
-        NDRange local(THREADS_X, THREADS_Y);
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(is_color),
+    };
+    std::array<std::string, 4> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(AccType, dtype_traits<AccType>::getName()),
+        DefineKeyValue(MAX_CHANNELS, (is_color ? 3 : 1)),
+        getTypeBuildDefinition<T>()};
 
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
+    auto meanshiftOp =
+        common::getKernel("meanshift", {{meanshift_cl_src}}, targs, options);
 
-        const int bCount   = (is_color ? 1 : in.info.dims[2]);
-        const int channels = (is_color ? in.info.dims[2] : 1);
+    cl::NDRange local(THREADS_X, THREADS_Y);
 
-        NDRange global(bCount*blk_x*THREADS_X, in.info.dims[3]*blk_y*THREADS_Y);
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
 
-        // clamp spatical and chromatic sigma's
-        float space_ = std::min(11.5f, s_sigma);
-        int radius   = std::max((int)(space_ * 1.5f), 1);
-        int padding  = 2*radius+1;
-        const float cvar = c_sigma*c_sigma;
-        size_t loc_size  = channels*(local[0]+padding)*(local[1]+padding)*sizeof(T);
+    const int bCount = (is_color ? 1 : in.info.dims[2]);
 
-        meanshiftOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info,
-                    cl::Local(loc_size), channels,
-                    space_, radius, cvar, iter, blk_x, blk_y);
+    cl::NDRange global(bCount * blk_x * THREADS_X,
+                       in.info.dims[3] * blk_y * THREADS_Y);
 
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
+    // clamp spatical and chromatic sigma's
+    int radius = std::max((int)(spatialSigma * 1.5f), 1);
 
-}
+    const float cvar = chromaticSigma * chromaticSigma;
 
+    meanshiftOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                *in.data, in.info, radius, cvar, numIters, blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/medfilt.cl b/src/backend/opencl/kernel/medfilt.cl
deleted file mode 100644
index de541c47b5..0000000000
--- a/src/backend/opencl/kernel/medfilt.cl
+++ /dev/null
@@ -1,176 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-// Exchange trick: Morgan McGuire, ShaderX 2008
-#define swap(a,b)    { T tmp = a; a = min(a,b); b = max(tmp,b); }
-
-int lIdx(int x, int y, int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
-}
-
-void load2ShrdMem(__local T *           shrd,
-                  __global const T *    in,
-                  int lx, int ly, int shrdStride,
-                  int dim0, int dim1,
-                  int gx, int gy,
-                  int inStride1, int inStride0)
-{
-    if (pad==AF_PAD_ZERO) {
-        if (gx<0 || gx>=dim0 || gy<0 || gy>=dim1)
-            shrd[lIdx(lx, ly, shrdStride, 1)] = (T)0;
-        else
-            shrd[lIdx(lx, ly, shrdStride, 1)] = in[lIdx(gx, gy, inStride1, inStride0)];
-    } else if (pad==AF_PAD_SYM) {
-        if (gx<0) gx *= -1;
-        if (gy<0) gy *= -1;
-        if (gx>=dim0) gx = 2*(dim0-1) - gx;
-        if (gy>=dim1) gy = 2*(dim1-1) - gy;
-        shrd[lIdx(lx, ly, shrdStride, 1)] = in[lIdx(gx, gy, inStride1, inStride0)];
-    }
-}
-
-__kernel
-void medfilt(__global T *       out,
-             KParam             oInfo,
-             __global const T * in,
-             KParam             iInfo,
-             __local T *        localMem,
-             int           nBBS0,
-             int           nBBS1)
-{
-    // calculate necessary offset and window parameters
-    const int padding = w_len-1;
-    const int halo    = padding/2;
-    const int shrdLen = get_local_size(0) + padding;
-
-    // batch offsets
-    unsigned b2 = get_group_id(0) / nBBS0;
-    unsigned b3 = get_group_id(1) / nBBS1;
-    __global const T* iptr =  in + (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
-    __global T*       optr = out + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
-
-    // local neighborhood indices
-    int lx = get_local_id(0);
-    int ly = get_local_id(1);
-
-    // global indices
-    int gx = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
-    int gy = get_local_size(1) * (get_group_id(1)-b3*nBBS1) + ly;
-
-    // offset values for pulling image to local memory
-    int lx2 = lx + get_local_size(0);
-    int ly2 = ly + get_local_size(1);
-    int gx2 = gx + get_local_size(0);
-    int gy2 = gy + get_local_size(1);
-
-    // pull image to local memory
-    load2ShrdMem(localMem, iptr, lx, ly, shrdLen,
-                 iInfo.dims[0], iInfo.dims[1],
-                 gx-halo, gy-halo,
-                 iInfo.strides[1], iInfo.strides[0]);
-    if (lx<padding) {
-        load2ShrdMem(localMem, iptr, lx2, ly, shrdLen,
-                     iInfo.dims[0], iInfo.dims[1],
-                     gx2-halo, gy-halo,
-                     iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<padding) {
-        load2ShrdMem(localMem, iptr, lx, ly2, shrdLen,
-                     iInfo.dims[0], iInfo.dims[1],
-                     gx-halo, gy2-halo,
-                     iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2ShrdMem(localMem, iptr, lx2, ly2, shrdLen,
-                     iInfo.dims[0], iInfo.dims[1],
-                     gx2-halo, gy2-halo,
-                     iInfo.strides[1], iInfo.strides[0]);
-    }
-    barrier(CLK_LOCAL_MEM_FENCE);
-
-    // Only continue if we're at a valid location
-    if (gx < iInfo.dims[0] && gy < iInfo.dims[1]) {
-
-        // pull top half from shared memory into local memory
-        T v[ARR_SIZE];
-#pragma unroll
-        for(int k = 0; k <= w_wid/2; k++) {
-#pragma unroll
-            for(int i = 0; i < w_len; i++) {
-                v[w_len*k + i] = localMem[lIdx(lx+i,ly+k,shrdLen,1)];
-            }
-        }
-
-        // with each pass, remove min and max values and add new value
-        // initial sort
-        // ensure min in first half, max in second half
-#pragma unroll
-        for(int i = 0; i < ARR_SIZE/2; i++) {
-            swap(v[i], v[ARR_SIZE-1-i]);
-        }
-        // move min in first half to first pos
-#pragma unroll
-        for(int i = 1; i < (ARR_SIZE+1)/2; i++) {
-            swap(v[0], v[i]);
-        }
-        // move max in second half to last pos
-#pragma unroll
-        for(int i = ARR_SIZE-2; i >= ARR_SIZE/2; i--) {
-            swap(v[i], v[ARR_SIZE-1]);
-        }
-
-        int last = ARR_SIZE-1;
-
-        for(int k = 1+w_wid/2; k < w_wid; k++) {
-
-            for(int j = 0; j < w_len; j++) {
-
-                // add new contestant to first position in array
-                v[0] = localMem[lIdx(lx+j, ly+k, shrdLen, 1)];
-
-                last--;
-
-                // place max in last half, min in first half
-                for(int i = 0; i < (last+1)/2; i++) {
-                    swap(v[i], v[last-i]);
-                }
-                // now perform swaps on each half such that
-                // max is in last pos, min is in first pos
-                for(int i = 1; i <= last/2; i++) {
-                    swap(v[0], v[i]);
-                }
-                for(int i = last-1; i >= (last+1)/2; i--) {
-                    swap(v[i], v[last]);
-                }
-            }
-        }
-
-        // no more new contestants
-        // may still have to sort the last row
-        // each outer loop drops the min and max
-        for(int k = 1; k < w_len/2; k++) {
-            // move max/min into respective halves
-            for(int i = k; i < w_len/2; i++) {
-                swap(v[i], v[w_len-1-i]);
-            }
-            // move min into first pos
-            for(int i = k+1; i <= w_len/2; i++) {
-                swap(v[k], v[i]);
-            }
-            // move max into last pos
-            for(int i = w_len-k-2; i >= w_len/2; i--) {
-                swap(v[i], v[w_len-1-k]);
-            }
-        }
-
-        // pick the middle element of the first row
-        optr[gy*oInfo.strides[1]+gx*oInfo.strides[0]] = v[w_len/2];
-    }
-}
diff --git a/src/backend/opencl/kernel/medfilt.hpp b/src/backend/opencl/kernel/medfilt.hpp
index e4e75df767..abbd0ea5c7 100644
--- a/src/backend/opencl/kernel/medfilt.hpp
+++ b/src/backend/opencl/kernel/medfilt.hpp
@@ -8,92 +8,100 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/medfilt.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/medfilt1.hpp>
+#include <kernel_headers/medfilt2.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int MAX_MEDFILTER_LEN = 15;
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-template<typename T, af_border_type pad, unsigned w_len, unsigned w_wid>
-void medfilt(Param out, const Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  mfProgs;
-        static std::map<int, Kernel*> mfKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                const int ARR_SIZE = w_len * (w_wid-w_wid/2);
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D pad="<< pad
-                        << " -D AF_PAD_ZERO="<< AF_PAD_ZERO
-                        << " -D AF_PAD_SYM="<< AF_PAD_SYM
-                        << " -D ARR_SIZE="<< ARR_SIZE
-                        << " -D w_len="<< w_len
-                        << " -D w_wid=" << w_wid;
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-                Program prog;
-                buildProgram(prog, medfilt_cl, medfilt_cl_len, options.str());
-                mfProgs[device]   = new Program(prog);
-                mfKernels[device] = new Kernel(*mfProgs[device], "medfilt");
-            });
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x * in.info.dims[2] * THREADS_X,
-                       blk_y * in.info.dims[3] * THREADS_Y);
-
-        auto medfiltOp = make_kernel<Buffer, KParam,
-                                     Buffer, KParam,
-                                     cl::LocalSpaceArg,
-                                     int, int> (*mfKernels[device]);
-
-        size_t loc_size = (THREADS_X+w_len-1)*(THREADS_Y+w_wid-1)*sizeof(T);
-
-        medfiltOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info, cl::Local(loc_size), blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int MAX_MEDFILTER2_LEN = 15;
+constexpr int MAX_MEDFILTER1_LEN = 121;
+
+constexpr int THREADS_X = 16;
+constexpr int THREADS_Y = 16;
+
+template<typename T>
+void medfilt1(Param out, const Param in, const unsigned w_wid,
+              const af_border_type pad) {
+    const int ARR_SIZE = (w_wid - w_wid / 2) + 1;
+    size_t loc_size    = (THREADS_X + w_wid - 1) * sizeof(T);
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(pad),
+    };
+    std::array<std::string, 7> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(pad, static_cast<int>(pad)),
+        DefineKeyValue(AF_PAD_ZERO, static_cast<int>(AF_PAD_ZERO)),
+        DefineKeyValue(AF_PAD_SYM, static_cast<int>(AF_PAD_SYM)),
+        DefineValue(ARR_SIZE),
+        DefineValue(w_wid),
+        getTypeBuildDefinition<T>()};
+
+    auto medfiltOp =
+        common::getKernel("medfilt1", {{medfilt1_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_X, 1, 1);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+
+    cl::NDRange global(blk_x * in.info.dims[1] * THREADS_X, in.info.dims[2],
+                       in.info.dims[3]);
+
+    medfiltOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *in.data, in.info, cl::Local(loc_size), blk_x);
+    CL_DEBUG_FINISH(getQueue());
 }
 
-}
+template<typename T>
+void medfilt2(Param out, const Param in, const af_border_type pad,
+              const unsigned w_len, const unsigned w_wid) {
+    const int ARR_SIZE = w_len * (w_wid - w_wid / 2);
+    const size_t loc_size =
+        (THREADS_X + w_len - 1) * (THREADS_Y + w_wid - 1) * sizeof(T);
+
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(pad),
+        TemplateArg(w_len),
+        TemplateArg(w_wid),
+    };
+    std::array<std::string, 8> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(pad, static_cast<int>(pad)),
+        DefineKeyValue(AF_PAD_ZERO, static_cast<int>(AF_PAD_ZERO)),
+        DefineKeyValue(AF_PAD_SYM, static_cast<int>(AF_PAD_SYM)),
+        DefineValue(ARR_SIZE),
+        DefineValue(w_wid),
+        DefineValue(w_len),
+        getTypeBuildDefinition<T>()};
+
+    auto medfiltOp =
+        common::getKernel("medfilt2", {{medfilt2_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    cl::NDRange global(blk_x * in.info.dims[2] * THREADS_X,
+                       blk_y * in.info.dims[3] * THREADS_Y);
 
+    medfiltOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *in.data, in.info, cl::Local(loc_size), blk_x, blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/medfilt1.cl b/src/backend/opencl/kernel/medfilt1.cl
new file mode 100644
index 0000000000..c547c60c3e
--- /dev/null
+++ b/src/backend/opencl/kernel/medfilt1.cl
@@ -0,0 +1,131 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Exchange trick: Morgan McGuire, ShaderX 2008
+#define swap(a, b)           \
+    {                        \
+        T tmp = a;           \
+        a     = min(a, b);   \
+        b     = max(tmp, b); \
+    }
+
+void load2ShrdMem_1d(local T* shrd, global const T* in, int lx, int dim0,
+                     int gx, int inStride0) {
+    if (pad == AF_PAD_ZERO) {
+        if (gx < 0 || gx >= dim0)
+            shrd[lx] = (T)0;
+        else
+            shrd[lx] = in[gx];
+    } else if (pad == AF_PAD_SYM) {
+        if (gx < 0) gx *= -1;
+        if (gx >= dim0) gx = 2 * (dim0 - 1) - gx;
+        shrd[lx] = in[gx];
+    }
+}
+
+kernel void medfilt1(global T* out, KParam oInfo, __global const T* in,
+                       KParam iInfo, local T* localMem, int nBBS0) {
+    // calculate necessary offset and window parameters
+    const int padding = w_wid - 1;
+    const int halo    = padding / 2;
+    const int shrdLen = get_local_size(0) + padding;
+
+    // batch offsets
+    unsigned b1            = get_group_id(0) / nBBS0;
+    unsigned b0            = get_group_id(0) - b1 * nBBS0;
+    unsigned b2            = get_group_id(1);
+    unsigned b3            = get_group_id(2);
+    global const T* iptr = in +
+                             (b1 * iInfo.strides[1] + b2 * iInfo.strides[2] +
+                              b3 * iInfo.strides[3]) +
+                             iInfo.offset;
+    global T* optr = out +
+                       (b1 * oInfo.strides[1] + b2 * oInfo.strides[2] +
+                        b3 * oInfo.strides[3]) +
+                       oInfo.offset;
+
+    // local neighborhood indices
+    int lx = get_local_id(0);
+
+    // global indices
+    int gx = get_local_size(0) * b0 + lx;
+
+    int s0 = iInfo.strides[0];
+    int d0 = iInfo.dims[0];
+    for (int a = lx, gx2 = gx; a < shrdLen;
+         a += get_local_size(0), gx2 += get_local_size(0)) {
+        load2ShrdMem_1d(localMem, iptr, a, d0, gx2 - halo, s0);
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // Only continue if we're at a valid location
+    if (gx < iInfo.dims[0]) {
+        // pull top half from shared memory into local memory
+        T v[ARR_SIZE];
+
+#pragma unroll
+        for (int k = 0; k <= w_wid / 2 + 1; k++) { v[k] = localMem[lx + k]; }
+
+        // with each pass, remove min and max values and add new value
+        // initial sort
+        // ensure min in first half, max in second half
+#pragma unroll
+        for (int i = 0; i < ARR_SIZE / 2; i++) {
+            swap(v[i], v[ARR_SIZE - 1 - i]);
+        }
+        // move min in first half to first pos
+#pragma unroll
+        for (int i = 1; i < (ARR_SIZE + 1) / 2; i++) { swap(v[0], v[i]); }
+        // move max in second half to last pos
+#pragma unroll
+        for (int i = ARR_SIZE - 2; i >= ARR_SIZE / 2; i--) {
+            swap(v[i], v[ARR_SIZE - 1]);
+        }
+
+        int last = ARR_SIZE - 1;
+
+        for (int k = w_wid / 2 + 2; k < w_wid; k++) {
+            // add new contestant to first position in array
+            v[0] = localMem[lx + k];
+
+            last--;
+
+            // place max in last half, min in first half
+            for (int i = 0; i < (last + 1) / 2; i++) {
+                swap(v[i], v[last - i]);
+            }
+            // now perform swaps on each half such that
+            // max is in last pos, min is in first pos
+            for (int i = 1; i <= last / 2; i++) { swap(v[0], v[i]); }
+            for (int i = last - 1; i >= (last + 1) / 2; i--) {
+                swap(v[i], v[last]);
+            }
+        }
+
+        // no more new contestants
+        // may still have to sort the last row
+        // each outer loop drops the min and max
+        for (int k = 0; k < last; k++) {
+            // move max/min into respective halves
+            for (int i = k; i < ARR_SIZE / 2; i++) {
+                swap(v[i], v[ARR_SIZE - 1 - i]);
+            }
+            // move min into first pos
+            for (int i = k + 1; i <= ARR_SIZE / 2; i++) { swap(v[k], v[i]); }
+            // move max into last pos
+            for (int i = ARR_SIZE - k - 2; i >= ARR_SIZE / 2; i--) {
+                swap(v[i], v[ARR_SIZE - 1 - k]);
+            }
+        }
+
+        // pick the middle element of the first row
+        optr[gx * oInfo.strides[0]] = v[last / 2];
+    }
+}
diff --git a/src/backend/opencl/kernel/medfilt2.cl b/src/backend/opencl/kernel/medfilt2.cl
new file mode 100644
index 0000000000..bfb7109f7c
--- /dev/null
+++ b/src/backend/opencl/kernel/medfilt2.cl
@@ -0,0 +1,150 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Exchange trick: Morgan McGuire, ShaderX 2008
+#define swap(a, b)           \
+    {                        \
+        T tmp = a;           \
+        a     = min(a, b);   \
+        b     = max(tmp, b); \
+    }
+
+int lIdx(int x, int y, int stride1, int stride0) {
+    return (y * stride1 + x * stride0);
+}
+
+void load2ShrdMem(local T* shrd, global const T* in, int lx, int ly,
+                  int shrdStride, int dim0, int dim1, int gx, int gy,
+                  int inStride1, int inStride0) {
+    if (pad == AF_PAD_ZERO) {
+        if (gx < 0 || gx >= dim0 || gy < 0 || gy >= dim1)
+            shrd[lIdx(lx, ly, shrdStride, 1)] = (T)0;
+        else
+            shrd[lIdx(lx, ly, shrdStride, 1)] =
+                in[lIdx(gx, gy, inStride1, inStride0)];
+    } else if (pad == AF_PAD_SYM) {
+        if (gx < 0) gx *= -1;
+        if (gy < 0) gy *= -1;
+        if (gx >= dim0) gx = 2 * (dim0 - 1) - gx;
+        if (gy >= dim1) gy = 2 * (dim1 - 1) - gy;
+        shrd[lIdx(lx, ly, shrdStride, 1)] =
+            in[lIdx(gx, gy, inStride1, inStride0)];
+    }
+}
+
+kernel void medfilt2(global T* out, KParam oInfo, __global const T* in,
+                       KParam iInfo, local T* localMem, int nBBS0,
+                       int nBBS1) {
+    // calculate necessary offset and window parameters
+    const int padding = w_len - 1;
+    const int halo    = padding / 2;
+    const int shrdLen = get_local_size(0) + padding;
+
+    // batch offsets
+    unsigned b2 = get_group_id(0) / nBBS0;
+    unsigned b3 = get_group_id(1) / nBBS1;
+    global const T* iptr =
+        in + (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
+    global T* optr = out + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
+
+    // local neighborhood indices
+    int lx = get_local_id(0);
+    int ly = get_local_id(1);
+
+    // global indices
+    int gx = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    int gy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
+
+    int s0 = iInfo.strides[0];
+    int s1 = iInfo.strides[1];
+    int d0 = iInfo.dims[0];
+    int d1 = iInfo.dims[1];
+    // pull image to local memory
+    for (int b = ly, gy2 = gy; b < shrdLen;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        // move row_set get_local_size(1) along coloumns
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            load2ShrdMem(localMem, iptr, a, b, shrdLen, d0, d1, gx2 - halo,
+                         gy2 - halo, s1, s0);
+        }
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // Only continue if we're at a valid location
+    if (gx < iInfo.dims[0] && gy < iInfo.dims[1]) {
+        // pull top half from shared memory into local memory
+        T v[ARR_SIZE];
+#pragma unroll
+        for (int k = 0; k <= w_wid / 2; k++) {
+#pragma unroll
+            for (int i = 0; i < w_len; i++) {
+                v[w_len * k + i] = localMem[lIdx(lx + i, ly + k, shrdLen, 1)];
+            }
+        }
+
+        // with each pass, remove min and max values and add new value
+        // initial sort
+        // ensure min in first half, max in second half
+#pragma unroll
+        for (int i = 0; i < ARR_SIZE / 2; i++) {
+            swap(v[i], v[ARR_SIZE - 1 - i]);
+        }
+        // move min in first half to first pos
+#pragma unroll
+        for (int i = 1; i < (ARR_SIZE + 1) / 2; i++) { swap(v[0], v[i]); }
+        // move max in second half to last pos
+#pragma unroll
+        for (int i = ARR_SIZE - 2; i >= ARR_SIZE / 2; i--) {
+            swap(v[i], v[ARR_SIZE - 1]);
+        }
+
+        int last = ARR_SIZE - 1;
+
+        for (int k = 1 + w_wid / 2; k < w_wid; k++) {
+            for (int j = 0; j < w_len; j++) {
+                // add new contestant to first position in array
+                v[0] = localMem[lIdx(lx + j, ly + k, shrdLen, 1)];
+
+                last--;
+
+                // place max in last half, min in first half
+                for (int i = 0; i < (last + 1) / 2; i++) {
+                    swap(v[i], v[last - i]);
+                }
+                // now perform swaps on each half such that
+                // max is in last pos, min is in first pos
+                for (int i = 1; i <= last / 2; i++) { swap(v[0], v[i]); }
+                for (int i = last - 1; i >= (last + 1) / 2; i--) {
+                    swap(v[i], v[last]);
+                }
+            }
+        }
+
+        // no more new contestants
+        // may still have to sort the last row
+        // each outer loop drops the min and max
+        for (int k = 1; k < w_len / 2; k++) {
+            // move max/min into respective halves
+            for (int i = k; i < w_len / 2; i++) {
+                swap(v[i], v[w_len - 1 - i]);
+            }
+            // move min into first pos
+            for (int i = k + 1; i <= w_len / 2; i++) { swap(v[k], v[i]); }
+            // move max into last pos
+            for (int i = w_len - k - 2; i >= w_len / 2; i--) {
+                swap(v[i], v[w_len - 1 - k]);
+            }
+        }
+
+        // pick the middle element of the first row
+        optr[gy * oInfo.strides[1] + gx * oInfo.strides[0]] = v[w_len / 2];
+    }
+}
diff --git a/src/backend/opencl/kernel/memcopy.cl b/src/backend/opencl/kernel/memcopy.cl
index 942c127ef2..984ecf25f0 100644
--- a/src/backend/opencl/kernel/memcopy.cl
+++ b/src/backend/opencl/kernel/memcopy.cl
@@ -8,36 +8,168 @@
  ********************************************************/
 
 typedef struct {
-    dim_t dim[4];
+    int dims[4];
 } dims_t;
 
-__kernel
-void memcopy_kernel(__global T *out, dims_t ostrides,
-                    __global const T *in, dims_t idims,
-                    dims_t istrides, int offset,
-                    int groups_0, int groups_1)
-{
-    const int lid0 = get_local_id(0);
-    const int lid1 = get_local_id(1);
-
-    const int id2 = get_group_id(0) / groups_0;
-    const int id3 = get_group_id(1) / groups_1;
-    const int group_id_0 = get_group_id(0) - groups_0 * id2;
-    const int group_id_1 = get_group_id(1) - groups_1 * id3;
-    const int id0 = group_id_0 * get_local_size(0) + lid0;
-    const int id1 = group_id_1 * get_local_size(1) + lid1;
-
-    in += offset;
-
-    // FIXME: Do more work per work group
-    out += id3 * ostrides.dim[3] + id2 * ostrides.dim[2] + id1 * ostrides.dim[1];
-    in  += id3 * istrides.dim[3] + id2 * istrides.dim[2] + id1 * istrides.dim[1];
-
-    int istride0 = istrides.dim[0];
-    if (id0 < idims.dim[0] &&
-        id1 < idims.dim[1] &&
-        id2 < idims.dim[2] &&
-        id3 < idims.dim[3]) {
-        out[id0] = in[id0 * istride0];
+// memcopy without looping, so dim3 has to be 1.
+// conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] >= dims[1]
+//      global dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+kernel void memCopy(global T *d_out, const dims_t ostrides, const int ooffset,
+                    global const T *d_in, const dims_t idims,
+                    const dims_t istrides, const int ioffset) {
+    const int id0 = get_global_id(0);  // dim[0]
+    const int id1 = get_global_id(1);  // dim[1]
+    if ((id0 < idims.dims[0]) & (id1 < idims.dims[1])) {
+        const int id2 = get_global_id(2);  // dim[2] never overflows
+                                           // dim[3] is no processed
+        d_out[id0 * ostrides.dims[0] + id1 * ostrides.dims[1] +
+              id2 * ostrides.dims[2] + ooffset] =
+            d_in[id0 * istrides.dims[0] + id1 * istrides.dims[1] +
+                 id2 * istrides.dims[2] + ioffset];
+    }
+}
+
+// memcopy with looping over dims[0] -- VECTOR ONLY
+// Conditions:
+//      global dims[0] has no restrictions
+//      only dims[1] == 1 will be processed!!
+//      only dims[2] == 1 will be processed!!
+//      only dims[3] == 1 will be processed!!
+kernel void memCopyLoop0(global T *d_out, const dims_t ostrides,
+                         const int ooffset, global const T *d_in,
+                         const dims_t idims, const dims_t istrides,
+                         const int ioffset) {
+    int id0          = get_global_id(0);  // dim[0]
+    const int idims0 = idims.dims[0];
+    if (id0 < idims0) {
+        const int incID0        = get_global_size(0);
+        const int istrides0     = istrides.dims[0];
+        int idx_in              = id0 * istrides0 + ioffset;
+        const int idxIncID0_in  = incID0 * istrides0;
+        const int ostrides0     = ostrides.dims[0];
+        int idx_out             = id0 * ostrides0 + ooffset;
+        const int idxIncID0_out = incID0 * ostrides0;
+
+        do {
+            d_out[idx_out] = d_in[idx_in];
+            id0 += incID0;
+            if (id0 >= idims0) break;
+            idx_in += idxIncID0_in;
+            idx_out += idxIncID0_out;
+        } while (true);
+    }
+}
+
+// memcopy with looping over dims[1]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] has no restrictions
+//      global dims[2] == dims[2]
+//      only dims[3] == 1 will be processed!!
+kernel void memCopyLoop1(global T *d_out, const dims_t ostrides,
+                         const int ooffset, global const T *d_in,
+                         const dims_t idims, const dims_t istrides,
+                         const int ioffset) {
+    const int id0    = get_global_id(0);  // dim[0]
+    int id1          = get_global_id(1);  // dim[1]
+    const int idims1 = idims.dims[1];
+    if ((id0 < idims.dims[0]) & (id1 < idims1)) {
+        const int id2 = get_global_id(2);  // dim[2] never overflows
+                                           // dim[3] is no processed
+        const int istrides1 = istrides.dims[1];
+        int idx_in          = id0 * istrides.dims[0] + id1 * istrides1 +
+                     id2 * istrides.dims[2] + ioffset;
+        const int incID1       = get_global_size(1);
+        const int idxIncID1_in = incID1 * istrides1;
+        const int ostrides1    = ostrides.dims[1];
+        int idx_out            = id0 * ostrides.dims[0] + id1 * ostrides1 +
+                      id2 * ostrides.dims[2] + ooffset;
+        const int idxIncID1_out = incID1 * ostrides1;
+
+        do {
+            d_out[idx_out] = d_in[idx_in];
+            id1 += incID1;
+            if (id1 >= idims1) break;
+            idx_in += idxIncID1_in;
+            idx_out += idxIncID1_out;
+        } while (true);
+    }
+}
+
+// memcopy with looping over dims[3]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] >= dims[1]
+//      global dims[2] == dims[2]
+kernel void memCopyLoop3(global T *d_out, const dims_t ostrides,
+                         const int ooffset, global const T *d_in,
+                         const dims_t idims, const dims_t istrides,
+                         const int ioffset) {
+    const int id0 = get_global_id(0);  // dim[0]
+    const int id1 = get_global_id(1);  // dim[1]
+    if ((id0 < idims.dims[0]) & (id1 < idims.dims[1])) {
+        const int id2 = get_global_id(2);  // dim[2] never overflows
+                                           // dim[3] is no processed
+        int idx_in = id0 * istrides.dims[0] + id1 * istrides.dims[1] +
+                     id2 * istrides.dims[2] + ioffset;
+        const int idxIncID3_in = istrides.dims[3];
+        const int idxEnd_in    = idims.dims[3] * idxIncID3_in + idx_in;
+        int idx_out = id0 * ostrides.dims[0] + id1 * ostrides.dims[1] +
+                      id2 * ostrides.dims[2] + ooffset;
+        const int idxIncID3_out = ostrides.dims[3];
+
+        do {
+            d_out[idx_out] = d_in[idx_in];
+            idx_in += idxIncID3_in;
+            if (idx_in == idxEnd_in) break;
+            idx_out += idxIncID3_out;
+        } while (true);
+    }
+}
+
+// memcopy with looping over dims[1] and dims[3]
+// Conditions:
+//      global dims[0] >= dims[0]
+//      global dims[1] has no restrictions
+//      global dims[2] == dims[2]
+kernel void memCopyLoop13(global T *d_out, const dims_t ostrides,
+                          const int ooffset, global const T *d_in,
+                          const dims_t idims, const dims_t istrides,
+                          const int ioffset) {
+    const int id0    = get_global_id(0);  // dim[0]
+    int id1          = get_global_id(1);  // dim[1]
+    const int idims1 = idims.dims[1];
+    if ((id0 < idims.dims[0]) & (id1 < idims1)) {
+        const int id2       = get_global_id(2);  // dim[2] never overflows
+        const int istrides1 = istrides.dims[1];
+        int idxBase_in      = id0 * istrides.dims[0] + id1 * istrides1 +
+                         id2 * istrides.dims[2] + ioffset;
+        const int incID1           = get_global_size(1);
+        const int idxBaseIncID1_in = incID1 * istrides1;
+        const int idxIncID3_in     = istrides.dims[3];
+        int idxEndID3_in           = idims.dims[3] * idxIncID3_in + idxBase_in;
+        int idxBase_out = id0 * ostrides.dims[0] + id1 * ostrides.dims[1] +
+                          id2 * ostrides.dims[2] + ooffset;
+        const int idxBaseIncID1_out = incID1 * ostrides.dims[1];
+        const int idxIncID3_out     = ostrides.dims[3];
+
+        do {
+            int idx_in  = idxBase_in;
+            int idx_out = idxBase_out;
+            while (true) {
+                d_out[idx_out] = d_in[idx_in];
+                idx_in += idxIncID3_in;
+                if (idx_in == idxEndID3_in) break;
+                idx_out += idxIncID3_out;
+            }
+            id1 += incID1;
+            if (id1 >= idims1) break;
+            idxBase_in += idxBaseIncID1_in;
+            idxEndID3_in += idxBaseIncID1_in;
+            idxBase_out += idxBaseIncID1_out;
+        } while (true);
     }
 }
diff --git a/src/backend/opencl/kernel/memcopy.hpp b/src/backend/opencl/kernel/memcopy.hpp
index 95f61c8869..c27d8c39b6 100644
--- a/src/backend/opencl/kernel/memcopy.hpp
+++ b/src/backend/opencl/kernel/memcopy.hpp
@@ -8,167 +8,245 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/memcopy.hpp>
-#include <kernel_headers/copy.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <sstream>
-#include <string>
-#include <map>
-#include <algorithm>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/kernel_cache.hpp>
+#include <common/traits.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/copy.hpp>
+#include <kernel_headers/memcopy.hpp>
+#include <threadsMgt.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-    typedef struct
-    {
-        dim_t dim[4];
-    } dims_t;
-
-    static const uint DIM0 = 32;
-    static const uint DIM1 =  8;
-
-    template<typename T>
-    void memcopy(cl::Buffer out, const dim_t *ostrides,
-                 const cl::Buffer in, const dim_t *idims,
-                 const dim_t *istrides, int offset, uint ndims)
-    {
-        try {
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*>    cpyProgs;
-            static std::map<int, Kernel*>   cpyKernels;
-
-            int device = getActiveDeviceId();
-
-            std::call_once(compileFlags[device], [&]() {
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName();
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
+#include <algorithm>
+#include <iostream>
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+typedef struct {
+    int dims[4];
+} dims_type;
+
+// Increase vectorization by increasing the used type up to maxVectorWidth.
+// Example:
+//  input array<int> with return value = 4, means that the array became
+//  array<int4>.
+//
+// Parameters
+//  - IN     maxVectorWidth: maximum vectorisation desired
+//  - IN/OUT dims[4]: dimensions of the array
+//  - IN/OUT istrides[4]: strides of the input array
+//  - IN/OUT indims: ndims of the input array.  Updates when dim[0] becomes 1
+//  - IN/OUT ioffset: offset of the input array
+//  - IN/OUT ostrides[4]: strides of the output array
+//  - IN/OUT ooffset: offset of the output array
+//
+// Returns
+//  - maximum obtained vectorization.
+//  - All the parameters are updated accordingly
+//
+static inline unsigned vectorizeShape(const unsigned maxVectorWidth,
+                                      int dims[4], int istrides[4], int& indims,
+                                      dim_t& ioffset, int ostrides[4],
+                                      dim_t& ooffset) {
+    unsigned vectorWidth{1};
+    if ((maxVectorWidth != 1) & (istrides[0] == 1) & (ostrides[0] == 1)) {
+        // - Only adjacent items can be vectorized into a base vector type
+        // - global is the OR of the values to be checked.  When global is
+        // divisable by 2, than all source values are also
+        // - The buffers are always aligned at 128 Bytes, so the alignment is
+        // only dependable on the offsets
+        dim_t global{dims[0] | ioffset | ooffset};
+        for (int i{1}; i < indims; ++i) { global |= istrides[i] | ostrides[i]; }
+
+        // Determine the maximum vectorization possible
+        unsigned count{0};
+        while (((global & 1) == 0) & (vectorWidth < maxVectorWidth)) {
+            ++count;
+            vectorWidth <<= 1;
+            global >>= 1;
+        }
+        if (count != 0) {
+            // update the dimensions, to correspond with the new vectorization
+            dims[0] >>= count;
+            ioffset >>= count;
+            ooffset >>= count;
+            for (int i{1}; i < indims; ++i) {
+                istrides[i] >>= count;
+                ostrides[i] >>= count;
+            }
+            if (dims[0] == 1) {
+                // Vectorization has absorbed the full dim0, so eliminate
+                // the 1st dimension
+                --indims;
+                for (int i{0}; i < indims; ++i) {
+                    dims[i]     = dims[i + 1];
+                    istrides[i] = istrides[i + 1];
+                    ostrides[i] = ostrides[i + 1];
                 }
-                Program prog;
-                buildProgram(prog, memcopy_cl, memcopy_cl_len, options.str());
-                cpyProgs[device]   = new Program(prog);
-                cpyKernels[device] = new Kernel(*cpyProgs[device], "memcopy_kernel");
-            });
-
-            dims_t _ostrides = {{ostrides[0], ostrides[1], ostrides[2], ostrides[3]}};
-            dims_t _istrides = {{istrides[0], istrides[1], istrides[2], istrides[3]}};
-            dims_t _idims = {{idims[0], idims[1], idims[2], idims[3]}};
-
-            size_t local_size[2] = {DIM0, DIM1};
-            if (ndims == 1) {
-                local_size[0] *= local_size[1];
-                local_size[1]  = 1;
+                dims[indims] = 1;
             }
-
-            int groups_0 = divup(idims[0], local_size[0]);
-            int groups_1 = divup(idims[1], local_size[1]);
-
-            NDRange local(local_size[0], local_size[1]);
-            NDRange global(groups_0 * idims[2] * local_size[0],
-                           groups_1 * idims[3] * local_size[1]);
-
-            auto memcopy_kernel = make_kernel< Buffer, dims_t,
-                                               Buffer, dims_t,
-                                               dims_t, int,
-                                               int, int >(*cpyKernels[device]);
-
-            memcopy_kernel(EnqueueArgs(getQueue(), global, local),
-                out, _ostrides, in, _idims, _istrides, offset, groups_0, groups_1);
-            CL_DEBUG_FINISH(getQueue());
-        }
-        catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-            throw;
         }
     }
+    return vectorWidth;
+}
 
-    template<typename inType, typename outType, bool same_dims>
-    void copy(Param dst, const Param src, int ndims, outType default_value, double factor)
-    {
-        try {
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*>    cpyProgs;
-            static std::map<int, Kernel*>   cpyKernels;
-
-            int device = getActiveDeviceId();
-
-            std::call_once(compileFlags[device], [&]() {
-
-                        std::ostringstream options;
-                        options << " -D inType=" << dtype_traits<inType>::getName()
-                            << " -D outType=" << dtype_traits<outType>::getName()
-                            << " -D inType_" << dtype_traits<inType>::getName()
-                            << " -D outType_" << dtype_traits<outType>::getName()
-                            << " -D SAME_DIMS=" << same_dims;
-                        if (std::is_same<inType, double>::value  ||
-                            std::is_same<inType, cdouble>::value ||
-                            std::is_same<outType, double>::value ||
-                            std::is_same<outType, cdouble>::value) {
-                            options << " -D USE_DOUBLE";
-                        }
-
-                        Program prog;
-                        buildProgram(prog, copy_cl, copy_cl_len, options.str());
-                        cpyProgs[device]   = new Program(prog);
-                        cpyKernels[device] = new Kernel(*cpyProgs[device], "copy");
-                    });
-
-            NDRange local(DIM0, DIM1);
-            size_t local_size[] = {DIM0, DIM1};
-
-            local_size[0] *= local_size[1];
-            if (ndims == 1) {
-                local_size[1] = 1;
-            }
-
-            int blk_x = divup(dst.info.dims[0], local_size[0]);
-            int blk_y = divup(dst.info.dims[1], local_size[1]);
-
-            NDRange global(blk_x * dst.info.dims[2] * DIM0,
-                    blk_y * dst.info.dims[3] * DIM1);
-
-            dims_t trgt_dims;
-            if (same_dims) {
-                trgt_dims= {{dst.info.dims[0], dst.info.dims[1], dst.info.dims[2], dst.info.dims[3]}};
-            } else {
-                dim_t trgt_l = std::min(dst.info.dims[3], src.info.dims[3]);
-                dim_t trgt_k = std::min(dst.info.dims[2], src.info.dims[2]);
-                dim_t trgt_j = std::min(dst.info.dims[1], src.info.dims[1]);
-                dim_t trgt_i = std::min(dst.info.dims[0], src.info.dims[0]);
-                trgt_dims= {{trgt_i, trgt_j, trgt_k, trgt_l}};
-            }
+template<typename T>
+void memcopy(const cl::Buffer& b_out, const dim4& ostrides,
+             const cl::Buffer& b_in, const dim4& idims, const dim4& istrides,
+             dim_t ioffset, const dim_t indims, dim_t ooffset = 0) {
+    dims_type idims_{
+        static_cast<int>(idims.dims[0]), static_cast<int>(idims.dims[1]),
+        static_cast<int>(idims.dims[2]), static_cast<int>(idims.dims[3])};
+    dims_type istrides_{
+        static_cast<int>(istrides.dims[0]), static_cast<int>(istrides.dims[1]),
+        static_cast<int>(istrides.dims[2]), static_cast<int>(istrides.dims[3])};
+    dims_type ostrides_{
+        static_cast<int>(ostrides.dims[0]), static_cast<int>(ostrides.dims[1]),
+        static_cast<int>(ostrides.dims[2]), static_cast<int>(ostrides.dims[3])};
+    int indims_{static_cast<int>(indims)};
+
+    const size_t totalSize{idims.elements() * sizeof(T) * 2};
+    removeEmptyColumns(idims_.dims, indims_, ostrides_.dims);
+    indims_ =
+        removeEmptyColumns(idims_.dims, indims_, idims_.dims, istrides_.dims);
+    indims_ =
+        combineColumns(idims_.dims, istrides_.dims, indims_, ostrides_.dims);
+
+    // Optimization memory access and caching.
+    // Best performance is achieved with the highest vectorization
+    // (<int> --> <int2>,<int4>, ...), since more data is processed per IO.
+    const cl::Device dev{opencl::getDevice()};
+    const unsigned DevicePreferredVectorWidthChar{
+        dev.getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_CHAR>()};
+    // When the architecture prefers some width's, it is certainly
+    // on char.  No preference means vector width 1 returned.
+    const bool DevicePreferredVectorWidth{DevicePreferredVectorWidthChar != 1};
+    size_t maxVectorWidth{
+        DevicePreferredVectorWidth
+            ? sizeof(T) == 1 ? DevicePreferredVectorWidthChar
+              : sizeof(T) == 2
+                  ? dev.getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_SHORT>()
+              : sizeof(T) == 4
+                  ? dev.getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_INT>()
+              : sizeof(T) == 8
+                  ? dev.getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE>()
+                  : 1
+        : sizeof(T) > 8 ? 1
+                        : 16 / sizeof(T)};
+    const size_t vectorWidth{vectorizeShape(maxVectorWidth, idims_.dims,
+                                            istrides_.dims, indims_, ioffset,
+                                            ostrides_.dims, ooffset)};
+    const size_t sizeofNewT{sizeof(T) * vectorWidth};
+
+    threadsMgt<int> th(idims_.dims, indims_, 1, 1, totalSize, sizeofNewT);
+    const char* kernelName{
+        th.loop0   ? "memCopyLoop0"
+        : th.loop1 ? th.loop3 ? "memCopyLoop13" : "memCopyLoop1"
+        : th.loop3 ? "memCopyLoop3"
+                   : "memCopy"};  // Conversion to  base vector types.
+    TemplateArg tArg{
+        sizeofNewT == 1   ? "char"
+        : sizeofNewT == 2 ? "short"
+        : sizeofNewT == 4 ? "float"
+        : sizeofNewT == 8 ? "float2"
+        : sizeofNewT == 16
+            ? "float4"
+            : "type is larger than 16 bytes, which is unsupported"};
+    auto memCopy{common::getKernel(kernelName, {{memcopy_cl_src}}, {{tArg}},
+                                   {{DefineKeyValue(T, tArg)}})};
+    const cl::NDRange local{th.genLocal(memCopy.get())};
+    const cl::NDRange global{th.genGlobal(local)};
+
+    memCopy(cl::EnqueueArgs(getQueue(), global, local), b_out, ostrides_,
+            static_cast<int>(ooffset), b_in, idims_, istrides_,
+            static_cast<int>(ioffset));
+    CL_DEBUG_FINISH(getQueue());
+}
 
-            auto copyOp = make_kernel<Buffer, KParam, Buffer, KParam,
-                                      outType, float, dims_t,
-                                      int, int
-                                     >(*cpyKernels[device]);
-
-            copyOp(EnqueueArgs(getQueue(), global, local),
-                   *dst.data, dst.info, *src.data, src.info,
-                   default_value, (float)factor, trgt_dims, blk_x, blk_y);
-            CL_DEBUG_FINISH(getQueue());
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-            throw;
+template<typename inType, typename outType>
+void copy(const Param out, const Param in, dim_t ondims,
+          const outType default_value, const double factor) {
+    dims_type idims_{
+        static_cast<int>(in.info.dims[0]), static_cast<int>(in.info.dims[1]),
+        static_cast<int>(in.info.dims[2]), static_cast<int>(in.info.dims[3])};
+    dims_type istrides_{static_cast<int>(in.info.strides[0]),
+                        static_cast<int>(in.info.strides[1]),
+                        static_cast<int>(in.info.strides[2]),
+                        static_cast<int>(in.info.strides[3])};
+    dims_type odims_{
+        static_cast<int>(out.info.dims[0]), static_cast<int>(out.info.dims[1]),
+        static_cast<int>(out.info.dims[2]), static_cast<int>(out.info.dims[3])};
+    dims_type ostrides_{static_cast<int>(out.info.strides[0]),
+                        static_cast<int>(out.info.strides[1]),
+                        static_cast<int>(out.info.strides[2]),
+                        static_cast<int>(out.info.strides[3])};
+    int ondims_{static_cast<int>(ondims)};
+    const size_t totalSize{odims_.dims[0] * odims_.dims[1] * odims_.dims[2] *
+                               odims_.dims[3] * sizeof(outType) +
+                           idims_.dims[0] * idims_.dims[1] * idims_.dims[2] *
+                               idims_.dims[3] * sizeof(inType)};
+    bool same_dims{true};
+    for (int i{0}; i < ondims_; ++i) {
+        if (idims_.dims[i] > odims_.dims[i]) {
+            idims_.dims[i] = odims_.dims[i];
+        } else if (idims_.dims[i] != odims_.dims[i]) {
+            same_dims = false;
         }
     }
 
-}
+    removeEmptyColumns(odims_.dims, ondims_, idims_.dims, istrides_.dims);
+    ondims_ =
+        removeEmptyColumns(odims_.dims, ondims_, odims_.dims, ostrides_.dims);
+    ondims_ = combineColumns(odims_.dims, ostrides_.dims, ondims_, idims_.dims,
+                             istrides_.dims);
+
+    constexpr int factorTypeIdx{std::is_same<inType, double>::value ||
+                                std::is_same<inType, cdouble>::value};
+    const char* factorType[]{"float", "double"};
+
+    const std::array<TemplateArg, 5> targs{
+        TemplateTypename<inType>(), TemplateTypename<outType>(),
+        TemplateArg(same_dims),     TemplateArg(factorType[factorTypeIdx]),
+        TemplateArg(factor != 1.0),
+    };
+    const std::array<std::string, 8> options{
+        DefineKeyValue(inType, dtype_traits<inType>::getName()),
+        DefineKeyValue(outType, dtype_traits<outType>::getName()),
+        std::string(" -D inType_") + dtype_traits<inType>::getName(),
+        std::string(" -D outType_") + dtype_traits<outType>::getName(),
+        DefineKeyValue(SAME_DIMS, static_cast<int>(same_dims)),
+        std::string(" -D factorType=") + factorType[factorTypeIdx],
+        std::string((factor != 1.0) ? " -D FACTOR" : " -D NOFACTOR"),
+        getTypeBuildDefinition<inType, outType>(),
+    };
+
+    threadsMgt<int> th(odims_.dims, ondims_, 1, 1, totalSize, sizeof(outType));
+    auto copy = common::getKernel(th.loop0   ? "scaledCopyLoop0"
+                                  : th.loop3 ? "scaledCopyLoop13"
+                                  : th.loop1 ? "scaledCopyLoop1"
+                                             : "scaledCopy",
+                                  {{copy_cl_src}}, targs, options);
+    const cl::NDRange local{th.genLocal(copy.get())};
+    const cl::NDRange global{th.genGlobal(local)};
+
+    if (factorTypeIdx == 0) {
+        copy(cl::EnqueueArgs(getQueue(), global, local), *out.data, odims_,
+             ostrides_, static_cast<uint>(out.info.offset), *in.data, idims_,
+             istrides_, static_cast<uint>(in.info.offset), default_value,
+             static_cast<float>(factor));
+    } else {
+        copy(cl::EnqueueArgs(getQueue(), global, local), *out.data, odims_,
+             ostrides_, static_cast<uint>(out.info.offset), *in.data, idims_,
+             istrides_, static_cast<uint>(in.info.offset), default_value,
+             static_cast<double>(factor));
+    }
 
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/moments.cl b/src/backend/opencl/kernel/moments.cl
new file mode 100644
index 0000000000..f9c8dc5031
--- /dev/null
+++ b/src/backend/opencl/kernel/moments.cl
@@ -0,0 +1,87 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#define AF_MOMENT_M00 1
+#define AF_MOMENT_M01 2
+#define AF_MOMENT_M10 4
+#define AF_MOMENT_M11 8
+
+inline void fatomic_add_l(volatile local float *source, const float operand) {
+    union {
+        unsigned int intVal;
+        float floatVal;
+    } newVal, prevVal, expVal;
+
+    prevVal.floatVal = *source;
+    do {
+        expVal.floatVal = prevVal.floatVal;
+        newVal.floatVal = expVal.floatVal + operand;
+        prevVal.intVal  = atomic_cmpxchg((volatile local unsigned int *)source,
+                                        expVal.intVal, newVal.intVal);
+    } while (expVal.intVal != prevVal.intVal);
+}
+
+inline void fatomic_add_g(volatile global float *source, const float operand) {
+    union {
+        unsigned int intVal;
+        float floatVal;
+    } newVal, prevVal, expVal;
+
+    prevVal.floatVal = *source;
+    do {
+        expVal.floatVal = prevVal.floatVal;
+        newVal.floatVal = expVal.floatVal + operand;
+        prevVal.intVal  = atomic_cmpxchg((volatile global unsigned int *)source,
+                                        expVal.intVal, newVal.intVal);
+    } while (expVal.intVal != prevVal.intVal);
+}
+
+kernel void moments(global float *d_out, const KParam out, global const T *d_in,
+                    const KParam in, const int moment, const int pBatch) {
+    const dim_t idw = get_group_id(1) / in.dims[2];
+    const dim_t idz = get_group_id(1) - idw * in.dims[2];
+
+    const dim_t idy = get_group_id(0);
+    dim_t idx       = get_local_id(0);
+
+    if (idy >= in.dims[1] || idz >= in.dims[2] || idw >= in.dims[3]) return;
+
+    local float wkg_moment_sum[MOMENTS_SZ];
+    if (get_local_id(0) < MOMENTS_SZ) { wkg_moment_sum[get_local_id(0)] = 0.f; }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    int mId = idy * in.strides[1] + idx;
+    if (pBatch) { mId += idw * in.strides[3] + idz * in.strides[2]; }
+
+    for (; idx < in.dims[0]; idx += get_local_size(0)) {
+        dim_t m_off = 0;
+        float val   = d_in[mId];
+        mId += get_local_size(0);
+
+        if ((moment & AF_MOMENT_M00) > 0) {
+            fatomic_add_l(wkg_moment_sum + m_off++, val);
+        }
+        if ((moment & AF_MOMENT_M01) > 0) {
+            fatomic_add_l(wkg_moment_sum + m_off++, idx * val);
+        }
+        if ((moment & AF_MOMENT_M10) > 0) {
+            fatomic_add_l(wkg_moment_sum + m_off++, idy * val);
+        }
+        if ((moment & AF_MOMENT_M11) > 0) {
+            fatomic_add_l(wkg_moment_sum + m_off, idx * idy * val);
+        }
+    }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (get_local_id(0) < out.dims[0])
+        fatomic_add_g(d_out + (idw * out.strides[3] + idz * out.strides[2]) +
+                          get_local_id(0),
+                      wkg_moment_sum[get_local_id(0)]);
+}
diff --git a/src/backend/opencl/kernel/moments.hpp b/src/backend/opencl/kernel/moments.hpp
new file mode 100644
index 0000000000..2ab1185516
--- /dev/null
+++ b/src/backend/opencl/kernel/moments.hpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel_headers/moments.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void moments(Param out, const Param in, af_moment_type moment) {
+    constexpr int THREADS = 128;
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(out.info.dims[0]),
+    };
+    std::array<std::string, 3> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(MOMENTS_SZ, out.info.dims[0]),
+        getTypeBuildDefinition<T>()};
+
+    auto momentsOp =
+        common::getKernel("moments", {{moments_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS, 1, 1);
+    cl::NDRange global(in.info.dims[1] * local[0],
+                       in.info.dims[2] * in.info.dims[3] * local[1]);
+
+    bool pBatch = !(in.info.dims[2] == 1 && in.info.dims[3] == 1);
+
+    momentsOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *in.data, in.info, (int)moment, (int)pBatch);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/morph.cl b/src/backend/opencl/kernel/morph.cl
index c92d06f48e..993913628b 100644
--- a/src/backend/opencl/kernel/morph.cl
+++ b/src/backend/opencl/kernel/morph.cl
@@ -7,43 +7,36 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-int lIdx(int x, int y,
-        int stride1, int stride0)
-{
-    return (y*stride1 + x*stride0);
+int lIdx(int x, int y, int stride1, int stride0) {
+    return (y * stride1 + x * stride0);
 }
 
-void load2LocalMem(__local T *  shrd,
-        __global const T *      in,
-        int lx, int ly, int shrdStride,
-        int dim0, int dim1,
-        int gx, int gy,
-        int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx, 0, dim0-1);
-    int gy_  = clamp(gy, 0, dim1-1);
-    shrd[ lIdx(lx, ly, shrdStride, 1) ] = in[ lIdx(gx_, gy_, inStride1, inStride0) ];
+void load2LocalMem(local T* shrd, global const T* in, int lx, int ly,
+                   int shrdStride, int dim0, int dim1, int gx, int gy,
+                   int inStride1, int inStride0) {
+    T val = gx >= 0 && gx < dim0 && gy >= 0 && gy < dim1
+                ? in[lIdx(gx, gy, inStride1, inStride0)]
+                : init;
+    shrd[lIdx(lx, ly, shrdStride, 1)] = val;
 }
 
-//kernel assumes four dimensions
-//doing this to reduce one uneccesary parameter
-__kernel
-void morph(__global T *              out,
-           KParam                  oInfo,
-           __global const T *         in,
-           KParam                  iInfo,
-           __constant const T *   d_filt,
-           __local T *          localMem,
-           int nBBS0, int nBBS1)
-{
-    const int halo   = windLen/2;
-    const int padding= 2*halo;
-    const int shrdLen= get_local_size(0) + padding + 1;
+// kernel assumes four dimensions
+// doing this to reduce one uneccesary parameter
+kernel void morph(global T* out, KParam oInfo, __global const T* in,
+                    KParam iInfo, __constant const T* d_filt,
+                    local T* localMem, int nBBS0, int nBBS1, int windLen) {
+    if (SeLength > 0) windLen = SeLength;
+
+    const int halo = windLen / 2;
+    const int padding =
+        (windLen % 2 == 0 ? (windLen - 1) : (2 * (windLen / 2)));
+    const int shrdLen  = get_local_size(0) + padding + 1;
+    const int shrdLen1 = get_local_size(1) + padding;
 
     // gfor batch offsets
     int b2 = get_group_id(0) / nBBS0;
     int b3 = get_group_id(1) / nBBS1;
-    in  += (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
+    in += (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
     out += (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
 
     // local neighborhood indices
@@ -51,52 +44,35 @@ void morph(__global T *              out,
     const int ly = get_local_id(1);
 
     // global indices
-    int gx = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
-    int gy = get_local_size(1) * (get_group_id(1)-b3*nBBS1) + ly;
-
-    // offset values for pulling image to local memory
-    int lx2      = lx + get_local_size(0);
-    int ly2      = ly + get_local_size(1);
-    int gx2      = gx + get_local_size(0);
-    int gy2      = gy + get_local_size(1);
-
-    // pull image to local memory
-    load2LocalMem(localMem, in, lx, ly, shrdLen,
-                  iInfo.dims[0], iInfo.dims[1],
-                  gx-halo, gy-halo,
-                  iInfo.strides[1], iInfo.strides[0]);
-    if (lx<padding) {
-        load2LocalMem(localMem, in, lx2, ly, shrdLen,
-                      iInfo.dims[0], iInfo.dims[1],
-                      gx2-halo, gy-halo,
-                      iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<padding) {
-        load2LocalMem(localMem, in, lx, ly2, shrdLen,
-                      iInfo.dims[0], iInfo.dims[1],
-                      gx-halo, gy2-halo,
-                      iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2LocalMem(localMem, in, lx2, ly2, shrdLen,
-                      iInfo.dims[0], iInfo.dims[1],
-                      gx2-halo, gy2-halo,
-                      iInfo.strides[1], iInfo.strides[0]);
+    int gx = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    int gy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
+
+    int s0 = iInfo.strides[0];
+    int s1 = iInfo.strides[1];
+    int d0 = iInfo.dims[0];
+    int d1 = iInfo.dims[1];
+    for (int b = ly, gy2 = gy; b < shrdLen1;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            load2LocalMem(localMem, in, a, b, shrdLen, d0, d1, gx2 - halo,
+                          gy2 - halo, s1, s0);
+        }
     }
 
     int i = lx + halo;
     int j = ly + halo;
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    T acc = localMem[ lIdx(i, j, shrdLen, 1) ];
+    T acc = init;
 #pragma unroll
-    for(int wj=0; wj<windLen; ++wj) {
-        int joff   = wj*windLen;
-        int w_joff = (j+wj-halo)*shrdLen;
+    for (int wj = 0; wj < windLen; ++wj) {
+        int joff   = wj * windLen;
+        int w_joff = (j + wj - halo) * shrdLen;
 #pragma unroll
-        for(int wi=0; wi<windLen; ++wi) {
-            T cur  = localMem[w_joff+i+wi-halo];
-            if (d_filt[joff+wi]) {
+        for (int wi = 0; wi < windLen; ++wi) {
+            T cur = localMem[w_joff + i + wi - halo];
+            if (d_filt[joff + wi]) {
                 if (isDilation)
                     acc = max(acc, cur);
                 else
@@ -105,141 +81,93 @@ void morph(__global T *              out,
         }
     }
 
-    if (gx<oInfo.dims[0] && gy<oInfo.dims[1]) {
-        int outIdx = lIdx(gx, gy, oInfo.strides[1], oInfo.strides[0]);
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1]) {
+        int outIdx  = lIdx(gx, gy, oInfo.strides[1], oInfo.strides[0]);
         out[outIdx] = acc;
     }
 }
 
-
-
-int lIdx3D(int x, int y, int z,
-        int stride2, int stride1, int stride0)
-{
-    return (z*stride2 + y*stride1 + x*stride0);
+int lIdx3D(int x, int y, int z, int stride2, int stride1, int stride0) {
+    return (z * stride2 + y * stride1 + x * stride0);
 }
 
-void load2LocVolume(__local T * shrd,
-        __global const T * in,
-        int lx, int ly, int lz,
-        int shrdStride1, int shrdStride2,
-        int dim0, int dim1, int dim2,
-        int gx, int gy, int gz,
-        int inStride2, int inStride1, int inStride0)
-{
-    int gx_  = clamp(gx, 0, dim0-1);
-    int gy_  = clamp(gy, 0, dim1-1);
-    int gz_  = clamp(gz, 0, dim2-1);
-    int shrdIdx = lx + ly*shrdStride1 + lz*shrdStride2;
-    int inIdx   = gx_*inStride0 + gy_*inStride1 + gz_*inStride2;
-    shrd[ shrdIdx ] = in[ inIdx ];
+void load2LocVolume(local T* shrd, global const T* in, int lx, int ly,
+                    int lz, int shrdStride1, int shrdStride2, int dim0,
+                    int dim1, int dim2, int gx, int gy, int gz, int inStride2,
+                    int inStride1, int inStride0) {
+    T val = (T)0;
+    if (gx >= 0 && gx < dim0 && gy >= 0 && gy < dim1 && gz >= 0 && gz < dim2)
+        val = in[gx * inStride0 + gy * inStride1 + gz * inStride2];
+    else
+        val = init;
+
+    shrd[lx + ly * shrdStride1 + lz * shrdStride2] = val;
 }
 
-__kernel
-void morph3d(__global T *         out,
-             KParam               oInfo,
-             __global const T *   in,
-             KParam               iInfo,
-             __constant const T * d_filt,
-             __local T *          localMem,
-             int             nBBS)
-{
-    const int halo   = windLen/2;
-    const int padding= 2*halo;
-
-    const int se_area   = windLen*windLen;
-    const int shrdLen   = get_local_size(0) + padding + 1;
-    const int shrdArea  = shrdLen * (get_local_size(1)+padding);
+kernel void morph3d(global T* out, KParam oInfo, __global const T* in,
+                      KParam iInfo, __constant const T* d_filt,
+                      local T* localMem, int nBBS) {
+    const int halo = SeLength / 2;
+    const int padding =
+        (SeLength % 2 == 0 ? (SeLength - 1) : (2 * (SeLength / 2)));
+    const int se_area  = SeLength * SeLength;
+    const int shrdLen  = get_local_size(0) + padding + 1;
+    const int shrdLen1 = get_local_size(1) + padding;
+    const int shrdLen2 = get_local_size(2) + padding;
+    const int shrdArea = shrdLen * shrdLen1;
 
     // gfor batch offsets
-    int batchId    = get_group_id(0) / nBBS;
-    in  += (batchId * iInfo.strides[3] + iInfo.offset);
+    int batchId = get_group_id(0) / nBBS;
+    in += (batchId * iInfo.strides[3] + iInfo.offset);
     out += (batchId * oInfo.strides[3]);
 
-    int gx, gy, gz, i, j, k;
-    { // scoping out unnecessary variables
     const int lx = get_local_id(0);
     const int ly = get_local_id(1);
     const int lz = get_local_id(2);
 
-    gx = get_local_size(0) * (get_group_id(0)-batchId*nBBS) + lx;
-    gy = get_local_size(1) * get_group_id(1) + ly;
-    gz = get_local_size(2) * get_group_id(2) + lz;
-
-    const int gx2 = gx + get_local_size(0);
-    const int gy2 = gy + get_local_size(1);
-    const int gz2 = gz + get_local_size(2);
-    const int lx2 = lx + get_local_size(0);
-    const int ly2 = ly + get_local_size(1);
-    const int lz2 = lz + get_local_size(2);
-
-    // pull volume to shared memory
-    load2LocVolume(localMem, in, lx, ly, lz, shrdLen, shrdArea,
-                    iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                    gx-halo, gy-halo, gz-halo,
-                    iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    if (lx<padding) {
-        load2LocVolume(localMem, in, lx2, ly, lz, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx2-halo, gy-halo, gz-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<padding) {
-        load2LocVolume(localMem, in, lx, ly2, lz, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx-halo, gy2-halo, gz-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lz<padding) {
-        load2LocVolume(localMem, in, lx, ly, lz2, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx-halo, gy-halo, gz2-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        load2LocVolume(localMem, in, lx2, ly2, lz, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx2-halo, gy2-halo, gz-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<padding && lz<padding) {
-        load2LocVolume(localMem, in, lx, ly2, lz2, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx-halo, gy2-halo, gz2-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lz<padding && lx<padding) {
-        load2LocVolume(localMem, in, lx2, ly, lz2, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx2-halo, gy-halo, gz2-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<padding && ly<padding && lz<padding) {
-        load2LocVolume(localMem, in, lx2, ly2, lz2, shrdLen, shrdArea,
-                       iInfo.dims[0], iInfo.dims[1], iInfo.dims[2],
-                       gx2-halo, gy2-halo, gz2-halo,
-                       iInfo.strides[2], iInfo.strides[1], iInfo.strides[0]);
+    const int gx = get_local_size(0) * (get_group_id(0) - batchId * nBBS) + lx;
+    const int gy = get_local_size(1) * get_group_id(1) + ly;
+    const int gz = get_local_size(2) * get_group_id(2) + lz;
+
+    int s0 = iInfo.strides[0];
+    int s1 = iInfo.strides[1];
+    int s2 = iInfo.strides[2];
+    int d0 = iInfo.dims[0];
+    int d1 = iInfo.dims[1];
+    int d2 = iInfo.dims[2];
+
+    for (int c = lz, gz2 = gz; c < shrdLen2;
+         c += get_local_size(2), gz2 += get_local_size(2)) {
+        for (int b = ly, gy2 = gy; b < shrdLen1;
+             b += get_local_size(1), gy2 += get_local_size(1)) {
+            for (int a = lx, gx2 = gx; a < shrdLen;
+                 a += get_local_size(0), gx2 += get_local_size(0)) {
+                load2LocVolume(localMem, in, a, b, c, shrdLen, shrdArea, d0, d1,
+                               d2, gx2 - halo, gy2 - halo, gz2 - halo, s2, s1,
+                               s0);
+            }
+        }
     }
+
     barrier(CLK_LOCAL_MEM_FENCE);
-    // indices of voxel owned by current thread
-    i  = lx + halo;
-    j  = ly + halo;
-    k  = lz + halo;
-    }
 
-    T acc = localMem[ lIdx3D(i, j, k, shrdArea, shrdLen, 1) ];
+    int i = lx + halo;
+    int j = ly + halo;
+    int k = lz + halo;
+
+    T acc = init;
 #pragma unroll
-    for(int wk=0; wk<windLen; ++wk) {
-        int koff   = wk*se_area;
-        int w_koff = (k+wk-halo)*shrdArea;
+    for (int wk = 0; wk < SeLength; ++wk) {
+        int koff   = wk * se_area;
+        int w_koff = (k + wk - halo) * shrdArea;
 #pragma unroll
-        for(int wj=0; wj<windLen; ++wj) {
-        int joff   = wj*windLen;
-        int w_joff = (j+wj-halo)*shrdLen;
+        for (int wj = 0; wj < SeLength; ++wj) {
+            int joff   = wj * SeLength;
+            int w_joff = (j + wj - halo) * shrdLen;
 #pragma unroll
-            for(int wi=0; wi<windLen; ++wi) {
-                T cur  = localMem[w_koff+w_joff + i+wi-halo];
-                if (d_filt[koff+joff+wi]) {
+            for (int wi = 0; wi < SeLength; ++wi) {
+                T cur = localMem[w_koff + w_joff + i + wi - halo];
+                if (d_filt[koff + joff + wi]) {
                     if (isDilation)
                         acc = max(acc, cur);
                     else
@@ -249,10 +177,9 @@ void morph3d(__global T *         out,
         }
     }
 
-    if (gx<oInfo.dims[0] && gy<oInfo.dims[1] && gz<oInfo.dims[2]) {
-        int outIdx = gz * oInfo.strides[2] +
-            gy * oInfo.strides[1] +
-            gx * oInfo.strides[0];
+    if (gx < oInfo.dims[0] && gy < oInfo.dims[1] && gz < oInfo.dims[2]) {
+        int outIdx = gz * oInfo.strides[2] + gy * oInfo.strides[1] +
+                     gx * oInfo.strides[0];
         out[outIdx] = acc;
     }
 }
diff --git a/src/backend/opencl/kernel/morph.hpp b/src/backend/opencl/kernel/morph.hpp
index 1a93826581..473de659f2 100644
--- a/src/backend/opencl/kernel/morph.hpp
+++ b/src/backend/opencl/kernel/morph.hpp
@@ -8,170 +8,141 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/morph.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/morph.hpp>
 #include <memory.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::LocalSpaceArg;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-static const int CUBE_X    =  8;
-static const int CUBE_Y    =  8;
-static const int CUBE_Z    =  4;
-
-template<typename T, bool isDilation, int windLen>
-void morph(Param         out,
-        const Param      in,
-        const Param      mask)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> morProgs;
-        static std::map<int, Kernel*> morKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D isDilation="<< isDilation
-                            << " -D windLen=" << windLen;
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, morph_cl, morph_cl_len, options.str());
-                    morProgs[device]   = new Program(prog);
-                    morKernels[device] = new Kernel(*morProgs[device], "morph");
-                });
-
-        auto morphOp = make_kernel<Buffer, KParam,
-                                   Buffer, KParam,
-                                   Buffer, cl::LocalSpaceArg,
-                                   int, int
-                                  >(*morKernels[device]);
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
-        // launch batch * blk_x blocks along x dimension
-        NDRange global(blk_x * THREADS_X * in.info.dims[2],
-                       blk_y * THREADS_Y * in.info.dims[3]);
-
-        // copy mask/filter to constant memory
-        cl_int se_size   = sizeof(T)*windLen*windLen;
-        cl::Buffer *mBuff = bufferAlloc(se_size);
-        getQueue().enqueueCopyBuffer(*mask.data, *mBuff, 0, 0, se_size);
-
-        // calculate shared memory size
-        const int halo    = windLen/2;
-        const int padding = 2*halo;
-        const int locLen  = THREADS_X + padding + 1;
-        const int locSize = locLen * (THREADS_Y+padding);
-
-        morphOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *in.data, in.info, *mBuff,
-                cl::Local(locSize*sizeof(T)), blk_x, blk_y);
-
-        bufferFree(mBuff);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-template<typename T, bool isDilation, int windLen>
-void morph3d(Param       out,
-        const Param      in,
-        const Param      mask)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  morProgs;
-        static std::map<int, Kernel*> morKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D isDilation="<< isDilation
-                            << " -D windLen=" << windLen;
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, morph_cl, morph_cl_len, options.str());
-                    morProgs[device]   = new Program(prog);
-                    morKernels[device] = new Kernel(*morProgs[device], "morph3d");
-                });
-
-        auto morphOp = make_kernel<Buffer, KParam,
-                                   Buffer, KParam,
-                                   Buffer, cl::LocalSpaceArg, int
-                                  >(*morKernels[device]);
-
-        NDRange local(CUBE_X, CUBE_Y, CUBE_Z);
-
-        int blk_x = divup(in.info.dims[0], CUBE_X);
-        int blk_y = divup(in.info.dims[1], CUBE_Y);
-        int blk_z = divup(in.info.dims[2], CUBE_Z);
-        // launch batch * blk_x blocks along x dimension
-        NDRange global(blk_x * CUBE_X * in.info.dims[3],
-                       blk_y * CUBE_Y,
-                       blk_z * CUBE_Z);
-
-        // copy mask/filter to constant memory
-        cl_int se_size   = sizeof(T)*windLen*windLen*windLen;
-        cl::Buffer *mBuff = bufferAlloc(se_size);
-        getQueue().enqueueCopyBuffer(*mask.data, *mBuff, 0, 0, se_size);
-
-        // calculate shared memory size
-        const int halo    = windLen/2;
-        const int padding = 2*halo;
-        const int locLen  = CUBE_X+padding+1;
-        const int locArea = locLen *(CUBE_Y+padding);
-        const int locSize = locArea*(CUBE_Z+padding);
-
-        morphOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *in.data, in.info,
-                *mBuff, cl::Local(locSize*sizeof(T)), blk_x);
-
-        bufferFree(mBuff);
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void morph(Param out, const Param in, const Param mask, bool isDilation) {
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::make_unique;
+    using std::string;
+    using std::vector;
+
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
+
+    ToNumStr<T> toNumStr;
+    const T DefaultVal = isDilation ? common::Binary<T, af_max_t>::init()
+                                    : common::Binary<T, af_min_t>::init();
+    const int windLen  = mask.info.dims[0];
+    const int SeLength = (windLen <= 10 ? windLen : 0);
+
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(isDilation),
+        TemplateArg(SeLength),
+    };
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(isDilation),
+        DefineValue(SeLength),
+        DefineKeyValue(init, toNumStr(DefaultVal)),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    auto morphOp = common::getKernel("morph", {{morph_cl_src}}, targs, options);
+
+    NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    NDRange global(blk_x * THREADS_X * in.info.dims[2],
+                   blk_y * THREADS_Y * in.info.dims[3]);
+
+    // copy mask/filter to read-only memory
+    auto seBytes = windLen * windLen * sizeof(T);
+    auto mBuff =
+        make_unique<cl::Buffer>(getContext(), CL_MEM_READ_ONLY, seBytes);
+    morphOp.copyToReadOnly(mBuff.get(), mask.data, seBytes);
+
+    // calculate shared memory size
+    const int padding =
+        (windLen % 2 == 0 ? (windLen - 1) : (2 * (windLen / 2)));
+    const int locLen  = THREADS_X + padding + 1;
+    const int locSize = locLen * (THREADS_Y + padding);
+
+    morphOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+            *in.data, in.info, *mBuff, cl::Local(locSize * sizeof(T)), blk_x,
+            blk_y, windLen);
+    CL_DEBUG_FINISH(getQueue());
 }
 
+template<typename T>
+void morph3d(Param out, const Param in, const Param mask, bool isDilation) {
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::make_unique;
+    using std::string;
+    using std::vector;
+
+    constexpr int CUBE_X = 8;
+    constexpr int CUBE_Y = 8;
+    constexpr int CUBE_Z = 4;
+
+    ToNumStr<T> toNumStr;
+    const T DefaultVal = isDilation ? common::Binary<T, af_max_t>::init()
+                                    : common::Binary<T, af_min_t>::init();
+    const int SeLength = mask.info.dims[0];
+
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(isDilation),
+        TemplateArg(SeLength),
+    };
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(isDilation),
+        DefineValue(SeLength),
+        DefineKeyValue(init, toNumStr(DefaultVal)),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    auto morphOp =
+        common::getKernel("morph3d", {{morph_cl_src}}, targs, options);
+
+    NDRange local(CUBE_X, CUBE_Y, CUBE_Z);
+
+    int blk_x = divup(in.info.dims[0], CUBE_X);
+    int blk_y = divup(in.info.dims[1], CUBE_Y);
+    int blk_z = divup(in.info.dims[2], CUBE_Z);
+
+    NDRange global(blk_x * CUBE_X * in.info.dims[3], blk_y * CUBE_Y,
+                   blk_z * CUBE_Z);
+
+    cl_int seBytes = sizeof(T) * SeLength * SeLength * SeLength;
+    auto mBuff =
+        make_unique<cl::Buffer>(getContext(), CL_MEM_READ_ONLY, seBytes);
+    morphOp.copyToReadOnly(mBuff.get(), mask.data, seBytes);
+
+    // calculate shared memory size
+    const int padding =
+        (SeLength % 2 == 0 ? (SeLength - 1) : (2 * (SeLength / 2)));
+    const int locLen  = CUBE_X + padding + 1;
+    const int locArea = locLen * (CUBE_Y + padding);
+    const int locSize = locArea * (CUBE_Z + padding);
+
+    morphOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+            *in.data, in.info, *mBuff, cl::Local(locSize * sizeof(T)), blk_x);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/names.hpp b/src/backend/opencl/kernel/names.hpp
index 1602f79482..2dc4e63254 100644
--- a/src/backend/opencl/kernel/names.hpp
+++ b/src/backend/opencl/kernel/names.hpp
@@ -8,13 +8,39 @@
  ********************************************************/
 
 #pragma once
-#include <ops.hpp>
-template<af_op_t T> static const char *binOpName() { return "ADD_OP"; }
+#include <common/defines.hpp>
+#include <optypes.hpp>
 
-template<> STATIC_ const char *binOpName<af_add_t>() { return "ADD_OP"; }
-template<> STATIC_ const char *binOpName<af_mul_t>() { return "MUL_OP"; }
-template<> STATIC_ const char *binOpName<af_and_t>() { return "AND_OP"; }
-template<> STATIC_ const char *binOpName<af_or_t >() { return "OR_OP" ; }
-template<> STATIC_ const char *binOpName<af_min_t>() { return "MIN_OP"; }
-template<> STATIC_ const char *binOpName<af_max_t>() { return "MAX_OP"; }
-template<> STATIC_ const char *binOpName<af_notzero_t>() { return "NOTZERO_OP"; }
+template<af_op_t T>
+static const char *binOpName() {
+    return "ADD_OP";
+}
+
+template<>
+inline const char *binOpName<af_add_t>() {
+    return "ADD_OP";
+}
+template<>
+inline const char *binOpName<af_mul_t>() {
+    return "MUL_OP";
+}
+template<>
+inline const char *binOpName<af_and_t>() {
+    return "AND_OP";
+}
+template<>
+inline const char *binOpName<af_or_t>() {
+    return "OR_OP";
+}
+template<>
+inline const char *binOpName<af_min_t>() {
+    return "MIN_OP";
+}
+template<>
+inline const char *binOpName<af_max_t>() {
+    return "MAX_OP";
+}
+template<>
+inline const char *binOpName<af_notzero_t>() {
+    return "NOTZERO_OP";
+}
diff --git a/src/backend/opencl/kernel/nearest_neighbour.cl b/src/backend/opencl/kernel/nearest_neighbour.cl
new file mode 100644
index 0000000000..2c54b8d8af
--- /dev/null
+++ b/src/backend/opencl/kernel/nearest_neighbour.cl
@@ -0,0 +1,109 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// OpenCL < 1.2 compatibility
+#if !defined(__OPENCL_VERSION__) || __OPENCL_VERSION__ < 120
+__inline unsigned popcount(unsigned x) {
+    x = x - ((x >> 1) & 0x55555555);
+    x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
+    x = (x + (x >> 4)) & 0x0F0F0F0F;
+    x = x + (x >> 8);
+    x = x + (x >> 16);
+    return x & 0x0000003F;
+}
+#endif
+
+#ifdef USE_DOUBLE
+To _sad_(T v1, T v2) { return fabs(v1 - v2); }
+#else
+To _sad_(T v1, T v2) { return fabs((float)v1 - (float)v2); }
+#endif
+
+To _ssd_(T v1, T v2) { return (v1 - v2) * (v1 - v2); }
+
+#ifdef __SHD__
+unsigned _shd_(T v1, T v2) { return popcount(v1 ^ v2); }
+#endif
+
+kernel void knnAllDistances(global To* out_dist, global const T* query,
+                            KParam qInfo, global const T* train, KParam tInfo,
+                            const To max_dist, const unsigned feat_len,
+                            const unsigned max_feat_len,
+                            const unsigned feat_offset, local T* lmem) {
+    unsigned nquery = qInfo.dims[0];
+    unsigned ntrain = tInfo.dims[0];
+
+    unsigned f   = get_global_id(0);
+    unsigned tid = get_local_id(0);
+
+    local To l_dist[THREADS];
+
+    local T* l_query = lmem;
+    local T* l_train = lmem + max_feat_len;
+
+    l_dist[tid] = max_dist;
+
+    bool valid_feat = (f < ntrain);
+
+#ifdef USE_LOCAL_MEM
+    if (valid_feat) {
+        // Copy local_size(0) training features to shared memory
+        unsigned end_feat = min(feat_offset + max_feat_len, feat_len);
+        for (unsigned i = feat_offset; i < feat_len; i++) {
+            l_train[(i - feat_offset) * get_local_size(0) + tid] =
+                train[i * ntrain + f + tInfo.offset];
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+#endif
+
+    for (int j = 0; j < (int)nquery; j++) {
+        l_dist[tid] = max_dist;
+
+        // Load one query feature that will be tested against all training
+        // features in current block
+        if (tid < max_feat_len) {
+            l_query[tid] =
+                query[(tid + feat_offset) * nquery + j + qInfo.offset];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        To dist = 0;
+        if (valid_feat) {
+            unsigned feat_end = min(feat_offset + max_feat_len, feat_len);
+            for (unsigned k = feat_offset; k < feat_end; k++) {
+                // Calculate Hamming distance for 32-bits of descriptor and
+                // accumulates to dist
+#ifdef USE_LOCAL_MEM
+                dist +=
+                    DISTOP(l_train[(k - feat_offset) * get_local_size(0) + tid],
+                           l_query[k - feat_offset]);
+#else
+                dist += DISTOP(train[k * ntrain + f + tInfo.offset],
+                               l_query[k - feat_offset]);
+#endif
+            }
+        }
+
+        // Only stores the feature index and distance if it's smaller
+        // than the best match found so far
+        if (valid_feat) { l_dist[tid] = dist; }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Store best match in training features from block to the current
+        // query feature
+        if (valid_feat) {
+            if (feat_offset == 0)
+                out_dist[j * ntrain + f] = l_dist[tid];
+            else
+                out_dist[j * ntrain + f] += l_dist[tid];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+}
diff --git a/src/backend/opencl/kernel/nearest_neighbour.hpp b/src/backend/opencl/kernel/nearest_neighbour.hpp
new file mode 100644
index 0000000000..cac36cab33
--- /dev/null
+++ b/src/backend/opencl/kernel/nearest_neighbour.hpp
@@ -0,0 +1,97 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/nearest_neighbour.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T, typename To>
+void allDistances(Param dist, Param query, Param train, const dim_t dist_dim,
+                  af_match_type dist_type) {
+    constexpr unsigned THREADS = 256;
+
+    const unsigned feat_len = static_cast<uint>(query.info.dims[dist_dim]);
+    const unsigned max_kern_feat_len =
+        min(THREADS, static_cast<unsigned>(feat_len));
+    const To max_dist = maxval<To>();
+
+    // Determine maximum feat_len capable of using shared memory (faster)
+    cl_ulong avail_lmem = getDevice().getInfo<CL_DEVICE_LOCAL_MEM_SIZE>();
+    size_t lmem_predef =
+        2 * THREADS * sizeof(unsigned) + max_kern_feat_len * sizeof(T);
+    size_t ltrain_sz = THREADS * max_kern_feat_len * sizeof(T);
+    bool use_lmem    = (avail_lmem >= (lmem_predef + ltrain_sz)) ? true : false;
+    size_t lmem_sz   = (use_lmem) ? lmem_predef + ltrain_sz : lmem_predef;
+
+    unsigned unroll_len = nextpow2(feat_len);
+    if (unroll_len != feat_len) unroll_len = 0;
+
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(dist_type),
+        TemplateArg(use_lmem),
+        TemplateArg(unroll_len),
+    };
+
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineValue(THREADS),
+        DefineKeyValue(FEAT_LEN, unroll_len),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+    if (use_lmem) { options.emplace_back(DefineKey(USE_LOCAL_MEM)); }
+    if (dist_type == AF_SAD) {
+        options.emplace_back(DefineKeyValue(DISTOP, "_sad_"));
+    }
+    if (dist_type == AF_SSD) {
+        options.emplace_back(DefineKeyValue(DISTOP, "_ssd_"));
+    }
+    if (dist_type == AF_SHD) {
+        options.emplace_back(DefineKeyValue(DISTOP, "_shd_"));
+        options.emplace_back(DefineKey(__SHD__));
+    }
+    auto hmOp = common::getKernel("knnAllDistances",
+                                  {{nearest_neighbour_cl_src}}, targs, options);
+
+    const dim_t sample_dim = (dist_dim == 0) ? 1 : 0;
+
+    const unsigned ntrain = train.info.dims[sample_dim];
+
+    unsigned nblk = divup(ntrain, THREADS);
+    const cl::NDRange local(THREADS, 1);
+    const cl::NDRange global(nblk * THREADS, 1);
+
+    // For each query vector, find training vector with smallest Hamming
+    // distance per CUDA block
+    for (uint feat_offset = 0; feat_offset < feat_len; feat_offset += THREADS) {
+        hmOp(cl::EnqueueArgs(getQueue(), global, local), *dist.data,
+             *query.data, query.info, *train.data, train.info, max_dist,
+             feat_len, max_kern_feat_len, feat_offset, cl::Local(lmem_sz));
+        CL_DEBUG_FINISH(getQueue());
+    }
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/nonmax_suppression.cl b/src/backend/opencl/kernel/nonmax_suppression.cl
new file mode 100644
index 0000000000..e1c93f6add
--- /dev/null
+++ b/src/backend/opencl/kernel/nonmax_suppression.cl
@@ -0,0 +1,123 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void nonMaxSuppressionKernel(global T* output, KParam oInfo,
+                                      global const T* in, KParam inInfo,
+                                      global const T* dx, KParam dxInfo,
+                                      global const T* dy, KParam dyInfo,
+                                      unsigned nBBS0, unsigned nBBS1) {
+    // local thread indices
+    const int lx = get_local_id(0);
+    const int ly = get_local_id(1);
+
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = get_group_id(0) / nBBS0;
+    const unsigned b3 = get_group_id(1) / nBBS1;
+
+    // global indices
+    const int gx = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    const int gy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
+
+    local T localMem[SHRD_MEM_HEIGHT][SHRD_MEM_WIDTH];
+
+    global const T* mag =
+        in + (b2 * inInfo.strides[2] + b3 * inInfo.strides[3] + inInfo.offset);
+    global const T* dX =
+        dx + (b2 * dxInfo.strides[2] + b3 * dxInfo.strides[3] + dxInfo.offset) +
+        dxInfo.strides[1] + 1;
+    global const T* dY =
+        dy + (b2 * dyInfo.strides[2] + b3 * dyInfo.strides[3] + dyInfo.offset) +
+        dyInfo.strides[1] + 1;
+    global T* out = output + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]) +
+                      oInfo.strides[1] + 1;
+
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < SHRD_MEM_HEIGHT && gy2 < inInfo.dims[1];
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < SHRD_MEM_WIDTH && gx2 < inInfo.dims[0];
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            localMem[b][a] =
+                mag[(gx2)*inInfo.strides[0] + (gy2)*inInfo.strides[1]];
+        }
+    }
+    int i = lx + 1;
+    int j = ly + 1;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (gx < inInfo.dims[0] - 2 && gy < inInfo.dims[1] - 2) {
+        int idx = gx * inInfo.strides[0] + gy * inInfo.strides[1];
+
+        const float cmag = localMem[j][i];
+
+        if (cmag == 0.0f)
+            out[idx] = (T)0;
+        else {
+            const float dx = dX[idx];
+            const float dy = dY[idx];
+            const float se = localMem[j + 1][i + 1];
+            const float nw = localMem[j - 1][i - 1];
+            const float ea = localMem[j][i + 1];
+            const float we = localMem[j][i - 1];
+            const float ne = localMem[j - 1][i + 1];
+            const float sw = localMem[j + 1][i - 1];
+            const float no = localMem[j - 1][i];
+            const float so = localMem[j + 1][i];
+
+            float a1, a2, b1, b2, alpha;
+
+            if (dx >= 0) {
+                if (dy >= 0) {
+                    const bool isDxMagGreater = (dx - dy) >= 0;
+
+                    a1    = isDxMagGreater ? ea : so;
+                    a2    = isDxMagGreater ? we : no;
+                    b1    = se;
+                    b2    = nw;
+                    alpha = isDxMagGreater ? dy / dx : dx / dy;
+                } else {
+                    const bool isDyMagGreater = (dx + dy) >= 0;
+
+                    a1    = isDyMagGreater ? ea : no;
+                    a2    = isDyMagGreater ? we : so;
+                    b1    = ne;
+                    b2    = sw;
+                    alpha = isDyMagGreater ? -dy / dx : dx / -dy;
+                }
+            } else {
+                if (dy >= 0) {
+                    const bool isDxMagGreater = (dx + dy) >= 0;
+
+                    a1    = isDxMagGreater ? so : we;
+                    a2    = isDxMagGreater ? no : ea;
+                    b1    = sw;
+                    b2    = ne;
+                    alpha = isDxMagGreater ? -dx / dy : dy / -dx;
+                } else {
+                    const bool isDyMagGreater = (-dx + dy) >= 0;
+
+                    a1    = isDyMagGreater ? we : no;
+                    a2    = isDyMagGreater ? ea : so;
+                    b1    = nw;
+                    b2    = se;
+                    alpha = isDyMagGreater ? dy / dx : dx / dy;
+                }
+            }
+
+            float mag1 = (1 - alpha) * a1 + alpha * b1;
+            float mag2 = (1 - alpha) * a2 + alpha * b2;
+
+            if (cmag > mag1 && cmag > mag2) {
+                out[idx] = cmag;
+            } else {
+                out[idx] = (T)0;
+            }
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/ops.cl b/src/backend/opencl/kernel/ops.cl
index f81195a135..e383a871b2 100644
--- a/src/backend/opencl/kernel/ops.cl
+++ b/src/backend/opencl/kernel/ops.cl
@@ -7,131 +7,97 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#define IS_NAN(in) !((in) == (in))
+
 #ifdef ADD_OP
-T binOp(T lhs, T rhs)
-{
-    return lhs + rhs;
-}
+T binOp(T lhs, T rhs) { return lhs + rhs; }
 
-To transform(Ti in)
-{
-    return(To)(in);
-}
+To transform(Ti in) { return (To)(in); }
 #endif
 
 #ifdef MUL_OP
 #if CPLX
-T binOp(T lhs, T rhs)
-{
+T binOp(T lhs, T rhs) {
     T out;
     out.x = lhs.x * rhs.x - lhs.y * rhs.y;
     out.y = lhs.x * rhs.y + lhs.y * rhs.x;
     return out;
 }
 #else
-T binOp(T lhs, T rhs)
-{
-    return lhs * rhs;
-}
+T binOp(T lhs, T rhs) { return lhs * rhs; }
 #endif
 
-To transform(Ti in)
-{
-    return(To)(in);
-}
+To transform(Ti in) { return (To)(in); }
 #endif
 
 #ifdef OR_OP
-uchar binOp(uchar lhs, uchar rhs)
-{
-    return lhs || rhs;
-}
+uchar binOp(uchar lhs, uchar rhs) { return lhs || rhs; }
 
 #if CPLX
-uchar transform(Ti in)
-{
-    return (in.x != 0) || (in.y != 0);
-}
+uchar transform(Ti in) { return (in.x != 0) || (in.y != 0); }
 #else
-uchar transform(Ti in)
-{
-    return (in != 0);
-}
+uchar transform(Ti in) { return (in != 0); }
 #endif
 #endif
 
 #ifdef AND_OP
-uchar binOp(uchar lhs, uchar rhs)
-{
-    return lhs && rhs;
-}
+uchar binOp(uchar lhs, uchar rhs) { return lhs && rhs; }
 
 #if CPLX
-uchar transform(Ti in)
-{
-    return (in.x != 0) || (in.y != 0);
-}
+uchar transform(Ti in) { return (in.x != 0) || (in.y != 0); }
 #else
-uchar transform(Ti in)
-{
-    return (in != 0);
-}
+uchar transform(Ti in) { return (in != 0); }
 #endif
 #endif
 
 #ifdef NOTZERO_OP
-uint binOp(uint lhs, uint rhs)
-{
-    return lhs + rhs;
-}
+uint binOp(uint lhs, uint rhs) { return lhs + rhs; }
 
 #if CPLX
-uint transform(Ti in)
-{
-    return (in.x != 0) || (in.y != 0);
-}
+uint transform(Ti in) { return (in.x != 0) || (in.y != 0); }
 #else
-uint transform(Ti in)
-{
-    return (in != 0);
-}
+uint transform(Ti in) { return (in != 0); }
 #endif
 #endif
 
 #ifdef MIN_OP
 
-T transform(T in)
-{
-    return in;
+#if CPLX
+#undef IS_NAN
+#define IS_NAN(in) !((in.x) == (in.x)) || !((in.y) == (in.y))
+#endif
+
+T transform(T in) {
+    T val = init;
+    return IS_NAN(in) ? (val) : (in);
 }
 
 #if CPLX
-#define sabs(in) ((in.x)*(in.x) + (in.y)*(in.y))
+#define sabs(in) ((in.x) * (in.x) + (in.y) * (in.y))
 #else
 #define sabs(in) in
 #endif
 
-T binOp(T lhs, T rhs)
-{
-    return sabs(lhs) < sabs(rhs) ? lhs : rhs;
-}
+T binOp(T lhs, T rhs) { return sabs(lhs) < sabs(rhs) ? lhs : rhs; }
 #endif
 
 #ifdef MAX_OP
 
-T transform(T in)
-{
-    return in;
+#if CPLX
+#undef IS_NAN
+#define IS_NAN(in) !((in.x) == (in.x)) || !((in.y) == (in.y))
+#endif
+
+T transform(T in) {
+    T val = init;
+    return IS_NAN(in) ? (val) : (in);
 }
 
 #if CPLX
-#define sabs(in) ((in.x)*(in.x) + (in.y)*(in.y))
+#define sabs(in) ((in.x) * (in.x) + (in.y) * (in.y))
 #else
 #define sabs(in) in
 #endif
 
-T binOp(T lhs, T rhs)
-{
-    return sabs(lhs) > sabs(rhs) ? lhs : rhs;
-}
+T binOp(T lhs, T rhs) { return sabs(lhs) > sabs(rhs) ? lhs : rhs; }
 #endif
diff --git a/src/backend/opencl/kernel/orb.cl b/src/backend/opencl/kernel/orb.cl
index 5c5739b383..d8a31c81ec 100644
--- a/src/backend/opencl/kernel/orb.cl
+++ b/src/backend/opencl/kernel/orb.cl
@@ -11,284 +11,91 @@
 // original ORB paper
 #define REF_PAT_SAMPLES 256
 #define REF_PAT_COORDS 4
-#define REF_PAT_LENGTH (REF_PAT_SAMPLES*REF_PAT_COORDS)
+#define REF_PAT_LENGTH (REF_PAT_SAMPLES * REF_PAT_COORDS)
 
 // Current reference pattern was borrowed from OpenCV, a randomly generated
 // pattern will not achieve same quality as it must be trained like described
 // in sections 4.2 and 4.3 of the original ORB paper.
 __constant int ref_pat[] = {
-    8,-3, 9,5,
-    4,2, 7,-12,
-    -11,9, -8,2,
-    7,-12, 12,-13,
-    2,-13, 2,12,
-    1,-7, 1,6,
-    -2,-10, -2,-4,
-    -13,-13, -11,-8,
-    -13,-3, -12,-9,
-    10,4, 11,9,
-    -13,-8, -8,-9,
-    -11,7, -9,12,
-    7,7, 12,6,
-    -4,-5, -3,0,
-    -13,2, -12,-3,
-    -9,0, -7,5,
-    12,-6, 12,-1,
-    -3,6, -2,12,
-    -6,-13, -4,-8,
-    11,-13, 12,-8,
-    4,7, 5,1,
-    5,-3, 10,-3,
-    3,-7, 6,12,
-    -8,-7, -6,-2,
-    -2,11, -1,-10,
-    -13,12, -8,10,
-    -7,3, -5,-3,
-    -4,2, -3,7,
-    -10,-12, -6,11,
-    5,-12, 6,-7,
-    5,-6, 7,-1,
-    1,0, 4,-5,
-    9,11, 11,-13,
-    4,7, 4,12,
-    2,-1, 4,4,
-    -4,-12, -2,7,
-    -8,-5, -7,-10,
-    4,11, 9,12,
-    0,-8, 1,-13,
-    -13,-2, -8,2,
-    -3,-2, -2,3,
-    -6,9, -4,-9,
-    8,12, 10,7,
-    0,9, 1,3,
-    7,-5, 11,-10,
-    -13,-6, -11,0,
-    10,7, 12,1,
-    -6,-3, -6,12,
-    10,-9, 12,-4,
-    -13,8, -8,-12,
-    -13,0, -8,-4,
-    3,3, 7,8,
-    5,7, 10,-7,
-    -1,7, 1,-12,
-    3,-10, 5,6,
-    2,-4, 3,-10,
-    -13,0, -13,5,
-    -13,-7, -12,12,
-    -13,3, -11,8,
-    -7,12, -4,7,
-    6,-10, 12,8,
-    -9,-1, -7,-6,
-    -2,-5, 0,12,
-    -12,5, -7,5,
-    3,-10, 8,-13,
-    -7,-7, -4,5,
-    -3,-2, -1,-7,
-    2,9, 5,-11,
-    -11,-13, -5,-13,
-    -1,6, 0,-1,
-    5,-3, 5,2,
-    -4,-13, -4,12,
-    -9,-6, -9,6,
-    -12,-10, -8,-4,
-    10,2, 12,-3,
-    7,12, 12,12,
-    -7,-13, -6,5,
-    -4,9, -3,4,
-    7,-1, 12,2,
-    -7,6, -5,1,
-    -13,11, -12,5,
-    -3,7, -2,-6,
-    7,-8, 12,-7,
-    -13,-7, -11,-12,
-    1,-3, 12,12,
-    2,-6, 3,0,
-    -4,3, -2,-13,
-    -1,-13, 1,9,
-    7,1, 8,-6,
-    1,-1, 3,12,
-    9,1, 12,6,
-    -1,-9, -1,3,
-    -13,-13, -10,5,
-    7,7, 10,12,
-    12,-5, 12,9,
-    6,3, 7,11,
-    5,-13, 6,10,
-    2,-12, 2,3,
-    3,8, 4,-6,
-    2,6, 12,-13,
-    9,-12, 10,3,
-    -8,4, -7,9,
-    -11,12, -4,-6,
-    1,12, 2,-8,
-    6,-9, 7,-4,
-    2,3, 3,-2,
-    6,3, 11,0,
-    3,-3, 8,-8,
-    7,8, 9,3,
-    -11,-5, -6,-4,
-    -10,11, -5,10,
-    -5,-8, -3,12,
-    -10,5, -9,0,
-    8,-1, 12,-6,
-    4,-6, 6,-11,
-    -10,12, -8,7,
-    4,-2, 6,7,
-    -2,0, -2,12,
-    -5,-8, -5,2,
-    7,-6, 10,12,
-    -9,-13, -8,-8,
-    -5,-13, -5,-2,
-    8,-8, 9,-13,
-    -9,-11, -9,0,
-    1,-8, 1,-2,
-    7,-4, 9,1,
-    -2,1, -1,-4,
-    11,-6, 12,-11,
-    -12,-9, -6,4,
-    3,7, 7,12,
-    5,5, 10,8,
-    0,-4, 2,8,
-    -9,12, -5,-13,
-    0,7, 2,12,
-    -1,2, 1,7,
-    5,11, 7,-9,
-    3,5, 6,-8,
-    -13,-4, -8,9,
-    -5,9, -3,-3,
-    -4,-7, -3,-12,
-    6,5, 8,0,
-    -7,6, -6,12,
-    -13,6, -5,-2,
-    1,-10, 3,10,
-    4,1, 8,-4,
-    -2,-2, 2,-13,
-    2,-12, 12,12,
-    -2,-13, 0,-6,
-    4,1, 9,3,
-    -6,-10, -3,-5,
-    -3,-13, -1,1,
-    7,5, 12,-11,
-    4,-2, 5,-7,
-    -13,9, -9,-5,
-    7,1, 8,6,
-    7,-8, 7,6,
-    -7,-4, -7,1,
-    -8,11, -7,-8,
-    -13,6, -12,-8,
-    2,4, 3,9,
-    10,-5, 12,3,
-    -6,-5, -6,7,
-    8,-3, 9,-8,
-    2,-12, 2,8,
-    -11,-2, -10,3,
-    -12,-13, -7,-9,
-    -11,0, -10,-5,
-    5,-3, 11,8,
-    -2,-13, -1,12,
-    -1,-8, 0,9,
-    -13,-11, -12,-5,
-    -10,-2, -10,11,
-    -3,9, -2,-13,
-    2,-3, 3,2,
-    -9,-13, -4,0,
-    -4,6, -3,-10,
-    -4,12, -2,-7,
-    -6,-11, -4,9,
-    6,-3, 6,11,
-    -13,11, -5,5,
-    11,11, 12,6,
-    7,-5, 12,-2,
-    -1,12, 0,7,
-    -4,-8, -3,-2,
-    -7,1, -6,7,
-    -13,-12, -8,-13,
-    -7,-2, -6,-8,
-    -8,5, -6,-9,
-    -5,-1, -4,5,
-    -13,7, -8,10,
-    1,5, 5,-13,
-    1,0, 10,-13,
-    9,12, 10,-1,
-    5,-8, 10,-9,
-    -1,11, 1,-13,
-    -9,-3, -6,2,
-    -1,-10, 1,12,
-    -13,1, -8,-10,
-    8,-11, 10,-6,
-    2,-13, 3,-6,
-    7,-13, 12,-9,
-    -10,-10, -5,-7,
-    -10,-8, -8,-13,
-    4,-6, 8,5,
-    3,12, 8,-13,
-    -4,2, -3,-3,
-    5,-13, 10,-12,
-    4,-13, 5,-1,
-    -9,9, -4,3,
-    0,3, 3,-9,
-    -12,1, -6,1,
-    3,2, 4,-8,
-    -10,-10, -10,9,
-    8,-13, 12,12,
-    -8,-12, -6,-5,
-    2,2, 3,7,
-    10,6, 11,-8,
-    6,8, 8,-12,
-    -7,10, -6,5,
-    -3,-9, -3,9,
-    -1,-13, -1,5,
-    -3,-7, -3,4,
-    -8,-2, -8,3,
-    4,2, 12,12,
-    2,-5, 3,11,
-    6,-9, 11,-13,
-    3,-1, 7,12,
-    11,-1, 12,4,
-    -3,0, -3,6,
-    4,-11, 4,12,
-    2,-4, 2,1,
-    -10,-6, -8,1,
-    -13,7, -11,1,
-    -13,12, -11,-13,
-    6,0, 11,-13,
-    0,-1, 1,4,
-    -13,3, -9,-2,
-    -9,8, -6,-3,
-    -13,-6, -8,-2,
-    5,-9, 8,10,
-    2,7, 3,-9,
-    -1,-6, -1,-1,
-    9,5, 11,-2,
-    11,-3, 12,-8,
-    3,0, 3,5,
-    -1,4, 0,10,
-    3,-6, 4,5,
-    -13,0, -10,5,
-    5,8, 12,11,
-    8,9, 9,-6,
-    7,-4, 8,-12,
-    -10,4, -10,9,
-    7,3, 12,4,
-    9,-7, 10,-2,
-    7,0, 12,-2,
-    -1,-6, 0,-11,
+    8,   -3,  9,   5,   4,   2,   7,   -12, -11, 9,   -8,  2,   7,   -12, 12,
+    -13, 2,   -13, 2,   12,  1,   -7,  1,   6,   -2,  -10, -2,  -4,  -13, -13,
+    -11, -8,  -13, -3,  -12, -9,  10,  4,   11,  9,   -13, -8,  -8,  -9,  -11,
+    7,   -9,  12,  7,   7,   12,  6,   -4,  -5,  -3,  0,   -13, 2,   -12, -3,
+    -9,  0,   -7,  5,   12,  -6,  12,  -1,  -3,  6,   -2,  12,  -6,  -13, -4,
+    -8,  11,  -13, 12,  -8,  4,   7,   5,   1,   5,   -3,  10,  -3,  3,   -7,
+    6,   12,  -8,  -7,  -6,  -2,  -2,  11,  -1,  -10, -13, 12,  -8,  10,  -7,
+    3,   -5,  -3,  -4,  2,   -3,  7,   -10, -12, -6,  11,  5,   -12, 6,   -7,
+    5,   -6,  7,   -1,  1,   0,   4,   -5,  9,   11,  11,  -13, 4,   7,   4,
+    12,  2,   -1,  4,   4,   -4,  -12, -2,  7,   -8,  -5,  -7,  -10, 4,   11,
+    9,   12,  0,   -8,  1,   -13, -13, -2,  -8,  2,   -3,  -2,  -2,  3,   -6,
+    9,   -4,  -9,  8,   12,  10,  7,   0,   9,   1,   3,   7,   -5,  11,  -10,
+    -13, -6,  -11, 0,   10,  7,   12,  1,   -6,  -3,  -6,  12,  10,  -9,  12,
+    -4,  -13, 8,   -8,  -12, -13, 0,   -8,  -4,  3,   3,   7,   8,   5,   7,
+    10,  -7,  -1,  7,   1,   -12, 3,   -10, 5,   6,   2,   -4,  3,   -10, -13,
+    0,   -13, 5,   -13, -7,  -12, 12,  -13, 3,   -11, 8,   -7,  12,  -4,  7,
+    6,   -10, 12,  8,   -9,  -1,  -7,  -6,  -2,  -5,  0,   12,  -12, 5,   -7,
+    5,   3,   -10, 8,   -13, -7,  -7,  -4,  5,   -3,  -2,  -1,  -7,  2,   9,
+    5,   -11, -11, -13, -5,  -13, -1,  6,   0,   -1,  5,   -3,  5,   2,   -4,
+    -13, -4,  12,  -9,  -6,  -9,  6,   -12, -10, -8,  -4,  10,  2,   12,  -3,
+    7,   12,  12,  12,  -7,  -13, -6,  5,   -4,  9,   -3,  4,   7,   -1,  12,
+    2,   -7,  6,   -5,  1,   -13, 11,  -12, 5,   -3,  7,   -2,  -6,  7,   -8,
+    12,  -7,  -13, -7,  -11, -12, 1,   -3,  12,  12,  2,   -6,  3,   0,   -4,
+    3,   -2,  -13, -1,  -13, 1,   9,   7,   1,   8,   -6,  1,   -1,  3,   12,
+    9,   1,   12,  6,   -1,  -9,  -1,  3,   -13, -13, -10, 5,   7,   7,   10,
+    12,  12,  -5,  12,  9,   6,   3,   7,   11,  5,   -13, 6,   10,  2,   -12,
+    2,   3,   3,   8,   4,   -6,  2,   6,   12,  -13, 9,   -12, 10,  3,   -8,
+    4,   -7,  9,   -11, 12,  -4,  -6,  1,   12,  2,   -8,  6,   -9,  7,   -4,
+    2,   3,   3,   -2,  6,   3,   11,  0,   3,   -3,  8,   -8,  7,   8,   9,
+    3,   -11, -5,  -6,  -4,  -10, 11,  -5,  10,  -5,  -8,  -3,  12,  -10, 5,
+    -9,  0,   8,   -1,  12,  -6,  4,   -6,  6,   -11, -10, 12,  -8,  7,   4,
+    -2,  6,   7,   -2,  0,   -2,  12,  -5,  -8,  -5,  2,   7,   -6,  10,  12,
+    -9,  -13, -8,  -8,  -5,  -13, -5,  -2,  8,   -8,  9,   -13, -9,  -11, -9,
+    0,   1,   -8,  1,   -2,  7,   -4,  9,   1,   -2,  1,   -1,  -4,  11,  -6,
+    12,  -11, -12, -9,  -6,  4,   3,   7,   7,   12,  5,   5,   10,  8,   0,
+    -4,  2,   8,   -9,  12,  -5,  -13, 0,   7,   2,   12,  -1,  2,   1,   7,
+    5,   11,  7,   -9,  3,   5,   6,   -8,  -13, -4,  -8,  9,   -5,  9,   -3,
+    -3,  -4,  -7,  -3,  -12, 6,   5,   8,   0,   -7,  6,   -6,  12,  -13, 6,
+    -5,  -2,  1,   -10, 3,   10,  4,   1,   8,   -4,  -2,  -2,  2,   -13, 2,
+    -12, 12,  12,  -2,  -13, 0,   -6,  4,   1,   9,   3,   -6,  -10, -3,  -5,
+    -3,  -13, -1,  1,   7,   5,   12,  -11, 4,   -2,  5,   -7,  -13, 9,   -9,
+    -5,  7,   1,   8,   6,   7,   -8,  7,   6,   -7,  -4,  -7,  1,   -8,  11,
+    -7,  -8,  -13, 6,   -12, -8,  2,   4,   3,   9,   10,  -5,  12,  3,   -6,
+    -5,  -6,  7,   8,   -3,  9,   -8,  2,   -12, 2,   8,   -11, -2,  -10, 3,
+    -12, -13, -7,  -9,  -11, 0,   -10, -5,  5,   -3,  11,  8,   -2,  -13, -1,
+    12,  -1,  -8,  0,   9,   -13, -11, -12, -5,  -10, -2,  -10, 11,  -3,  9,
+    -2,  -13, 2,   -3,  3,   2,   -9,  -13, -4,  0,   -4,  6,   -3,  -10, -4,
+    12,  -2,  -7,  -6,  -11, -4,  9,   6,   -3,  6,   11,  -13, 11,  -5,  5,
+    11,  11,  12,  6,   7,   -5,  12,  -2,  -1,  12,  0,   7,   -4,  -8,  -3,
+    -2,  -7,  1,   -6,  7,   -13, -12, -8,  -13, -7,  -2,  -6,  -8,  -8,  5,
+    -6,  -9,  -5,  -1,  -4,  5,   -13, 7,   -8,  10,  1,   5,   5,   -13, 1,
+    0,   10,  -13, 9,   12,  10,  -1,  5,   -8,  10,  -9,  -1,  11,  1,   -13,
+    -9,  -3,  -6,  2,   -1,  -10, 1,   12,  -13, 1,   -8,  -10, 8,   -11, 10,
+    -6,  2,   -13, 3,   -6,  7,   -13, 12,  -9,  -10, -10, -5,  -7,  -10, -8,
+    -8,  -13, 4,   -6,  8,   5,   3,   12,  8,   -13, -4,  2,   -3,  -3,  5,
+    -13, 10,  -12, 4,   -13, 5,   -1,  -9,  9,   -4,  3,   0,   3,   3,   -9,
+    -12, 1,   -6,  1,   3,   2,   4,   -8,  -10, -10, -10, 9,   8,   -13, 12,
+    12,  -8,  -12, -6,  -5,  2,   2,   3,   7,   10,  6,   11,  -8,  6,   8,
+    8,   -12, -7,  10,  -6,  5,   -3,  -9,  -3,  9,   -1,  -13, -1,  5,   -3,
+    -7,  -3,  4,   -8,  -2,  -8,  3,   4,   2,   12,  12,  2,   -5,  3,   11,
+    6,   -9,  11,  -13, 3,   -1,  7,   12,  11,  -1,  12,  4,   -3,  0,   -3,
+    6,   4,   -11, 4,   12,  2,   -4,  2,   1,   -10, -6,  -8,  1,   -13, 7,
+    -11, 1,   -13, 12,  -11, -13, 6,   0,   11,  -13, 0,   -1,  1,   4,   -13,
+    3,   -9,  -2,  -9,  8,   -6,  -3,  -13, -6,  -8,  -2,  5,   -9,  8,   10,
+    2,   7,   3,   -9,  -1,  -6,  -1,  -1,  9,   5,   11,  -2,  11,  -3,  12,
+    -8,  3,   0,   3,   5,   -1,  4,   0,   10,  3,   -6,  4,   5,   -13, 0,
+    -10, 5,   5,   8,   12,  11,  8,   9,   9,   -6,  7,   -4,  8,   -12, -10,
+    4,   -10, 9,   7,   3,   12,  4,   9,   -7,  10,  -2,  7,   0,   12,  -2,
+    -1,  -6,  0,   -11,
 };
 
-
-float block_reduce_sum(float val, __local float *data)
-{
+float block_reduce_sum(float val, local float* data) {
     unsigned idx = get_local_id(0) * get_local_size(0) + get_local_id(1);
 
     data[idx] = val;
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    for (unsigned i = get_local_size(1) / 2; i > 0; i >>= 1)
-    {
-        if (get_local_id(1) < i)
-        {
-            data[idx] += data[idx + i];
-        }
+    for (unsigned i = get_local_size(1) / 2; i > 0; i >>= 1) {
+        if (get_local_id(1) < i) { data[idx] += data[idx + i]; }
 
         barrier(CLK_LOCAL_MEM_FENCE);
     }
@@ -296,40 +103,29 @@ float block_reduce_sum(float val, __local float *data)
     return data[get_local_id(0) * get_local_size(0)];
 }
 
-__kernel void keep_features(
-    __global float* x_out,
-    __global float* y_out,
-    __global float* score_out,
-    __global const float* x_in,
-    __global const float* y_in,
-    __global const float* score_in,
-    __global const unsigned* score_idx,
-    const unsigned n_feat)
-{
+kernel void keep_features(global float* x_out, __global float* y_out,
+                            global float* score_out,
+                            global const float* x_in,
+                            global const float* y_in,
+                            global const float* score_in,
+                            global const unsigned* score_idx,
+                            const unsigned n_feat) {
     unsigned f = get_global_id(0);
 
     if (f < n_feat) {
-        x_out[f] = x_in[score_idx[f]];
-        y_out[f] = y_in[score_idx[f]];
+        x_out[f]     = x_in[score_idx[f]];
+        y_out[f]     = y_in[score_idx[f]];
         score_out[f] = score_in[f];
     }
 }
 
-__kernel void harris_response(
-    __global float* x_out,
-    __global float* y_out,
-    __global float* score_out,
-    __global const float* x_in,
-    __global const float* y_in,
-    const unsigned total_feat,
-    __global unsigned* usable_feat,
-    __global const T* image,
-    KParam            iInfo,
-    const unsigned block_size,
-    const float k_thr,
-    const unsigned patch_size)
-{
-    __local float data[BLOCK_SIZE*BLOCK_SIZE];
+kernel void harris_response(
+    global float* x_out, __global float* y_out, __global float* score_out,
+    global const float* x_in, __global const float* y_in,
+    const unsigned total_feat, global unsigned* usable_feat,
+    global const T* image, KParam iInfo, const unsigned block_size,
+    const float k_thr, const unsigned patch_size) {
+    local float data[BLOCK_SIZE * BLOCK_SIZE];
 
     unsigned f = get_global_id(0);
 
@@ -348,22 +144,26 @@ __kernel void harris_response(
         // represents widest case possible
         unsigned patch_r = ceil(size * sqrt(2.f) / 2.f);
 
-        if (x >= patch_r && y >= patch_r && x < iInfo.dims[1] - patch_r && y < iInfo.dims[0] - patch_r) {
+        if (x >= patch_r && y >= patch_r && x < iInfo.dims[1] - patch_r &&
+            y < iInfo.dims[0] - patch_r) {
             unsigned r = block_size / 2;
 
             unsigned block_size_sq = block_size * block_size;
-            for (unsigned k = get_local_id(1); k < block_size_sq; k += get_local_size(1)) {
+            for (unsigned k = get_local_id(1); k < block_size_sq;
+                 k += get_local_size(1)) {
                 int i = k / block_size - r;
                 int j = k % block_size - r;
 
                 // Calculate local x and y derivatives
-                float ix = image[(x+i+1) * iInfo.dims[0] + y+j] - image[(x+i-1) * iInfo.dims[0] + y+j];
-                float iy = image[(x+i) * iInfo.dims[0] + y+j+1] - image[(x+i) * iInfo.dims[0] + y+j-1];
+                float ix = image[(x + i + 1) * iInfo.dims[0] + y + j] -
+                           image[(x + i - 1) * iInfo.dims[0] + y + j];
+                float iy = image[(x + i) * iInfo.dims[0] + y + j + 1] -
+                           image[(x + i) * iInfo.dims[0] + y + j - 1];
 
                 // Accumulate second order derivatives
-                ixx += ix*ix;
-                iyy += iy*iy;
-                ixy += ix*iy;
+                ixx += ix * ix;
+                iyy += iy * iy;
+                ixy += ix * iy;
             }
         }
     }
@@ -376,34 +176,30 @@ __kernel void harris_response(
     if (f < total_feat && get_local_id(1) == 0) {
         unsigned idx = atomic_inc(usable_feat);
         if (idx < total_feat) {
-            float tr = ixx + iyy;
-            float det = ixx*iyy - ixy*ixy;
+            float tr  = ixx + iyy;
+            float det = ixx * iyy - ixy * ixy;
 
             // Calculate Harris responses
-            float resp = det - k_thr * (tr*tr);
+            float resp = det - k_thr * (tr * tr);
 
             // Scale factor
             // TODO: improve scaling for responses
             float rscale = 0.001f;
-            rscale = rscale * rscale * rscale * rscale;
+            rscale       = rscale * rscale * rscale * rscale;
 
-            x_out[idx] = x;
-            y_out[idx] = y;
+            x_out[idx]     = x;
+            y_out[idx]     = y;
             score_out[idx] = resp * rscale;
         }
     }
 }
 
-__kernel void centroid_angle(
-    __global const float* x_in,
-    __global const float* y_in,
-    __global float* orientation_out,
-    const unsigned total_feat,
-    __global const T* image,
-    KParam            iInfo,
-    const unsigned patch_size)
-{
-    __local float data[BLOCK_SIZE*BLOCK_SIZE];
+kernel void centroid_angle(global const float* x_in,
+                             global const float* y_in,
+                             global float* orientation_out,
+                             const unsigned total_feat, global const T* image,
+                             KParam iInfo, const unsigned patch_size) {
+    local float data[BLOCK_SIZE * BLOCK_SIZE];
     unsigned f = get_global_id(0);
 
     T m01 = (T)0, m10 = (T)0;
@@ -414,14 +210,16 @@ __kernel void centroid_angle(
 
         unsigned r = patch_size / 2;
 
-        if (x >= r && y >= r && x <= iInfo.dims[1] - r && y <= iInfo.dims[0] - r) {
+        if (x >= r && y >= r && x <= iInfo.dims[1] - r &&
+            y <= iInfo.dims[0] - r) {
             unsigned patch_size_sq = patch_size * patch_size;
-            for (unsigned k = get_local_id(1); k < patch_size_sq; k += get_local_size(1)) {
+            for (unsigned k = get_local_id(1); k < patch_size_sq;
+                 k += get_local_size(1)) {
                 int i = k / patch_size - r;
                 int j = k % patch_size - r;
 
                 // Calculate first order moments
-                T p = image[(x+i) * iInfo.dims[0] + y+j];
+                T p = image[(x + i) * iInfo.dims[0] + y + j];
                 m01 += j * p;
                 m10 += i * p;
             }
@@ -433,24 +231,16 @@ __kernel void centroid_angle(
     m10 = block_reduce_sum(m10, data);
 
     if (f < total_feat && get_local_id(1) == 0) {
-        float angle = atan2(m01, m10);
+        float angle        = atan2(m01, m10);
         orientation_out[f] = angle;
     }
 }
 
-inline T get_pixel(
-    unsigned x,
-    unsigned y,
-    const float ori,
-    const unsigned size,
-    const int dist_x,
-    const int dist_y,
-    __global const T* image,
-    KParam            iInfo,
-    const unsigned patch_size)
-{
-    float ori_sin = sin(ori);
-    float ori_cos = cos(ori);
+inline T get_pixel(unsigned x, unsigned y, const float ori, const unsigned size,
+                   const int dist_x, const int dist_y, global const T* image,
+                   KParam iInfo, const unsigned patch_size) {
+    float ori_sin   = sin(ori);
+    float ori_cos   = cos(ori);
     float patch_scl = (float)size / (float)patch_size;
 
     x += round(dist_x * patch_scl * ori_cos - dist_y * patch_scl * ori_sin);
@@ -459,57 +249,54 @@ inline T get_pixel(
     return image[x * iInfo.dims[0] + y];
 }
 
-__kernel void extract_orb(
-    __global unsigned* desc_out,
-    const unsigned n_feat,
-    __global float* x_in,
-    __global float* y_in,
-    __global float* ori_in,
-    __global float* size_out,
-    __global const T* image,
-    KParam            iInfo,
-    const float scl,
-    const unsigned patch_size)
-{
+kernel void extract_orb(global unsigned* desc_out, const unsigned n_feat,
+                          global float* x_in, __global float* y_in,
+                          global float* ori_in, __global float* size_out,
+                          global const T* image, KParam iInfo,
+                          const float scl, const unsigned patch_size) {
     unsigned f = get_global_id(0);
 
     unsigned x, y;
 
     if (f < n_feat) {
-        x = (unsigned)round(x_in[f]);
-        y = (unsigned)round(y_in[f]);
-        float ori = ori_in[f];
+        x             = (unsigned)round(x_in[f]);
+        y             = (unsigned)round(y_in[f]);
+        float ori     = ori_in[f];
         unsigned size = patch_size;
 
         unsigned r = ceil(patch_size * sqrt(2.f) / 2.f);
 
-        if (x >= r && y >= r && x < iInfo.dims[1] - r && y < iInfo.dims[0] - r) {
-           // Descriptor fixed at 256 bits for now
+        if (x >= r && y >= r && x < iInfo.dims[1] - r &&
+            y < iInfo.dims[0] - r) {
+            // Descriptor fixed at 256 bits for now
             for (unsigned i = get_local_id(1); i < 16; i += get_local_size(1)) {
                 unsigned v = 0;
 
                 for (unsigned j = 0; j < 16; j++) {
-                    int dist_x = ref_pat[i*16*4 + j*4];
-                    int dist_y = ref_pat[i*16*4 + j*4+1];
-                    T p1 = get_pixel(x, y, ori, size, dist_x, dist_y, image, iInfo, patch_size);
-
-                    dist_x = ref_pat[i*16*4 + j*4+2];
-                    dist_y = ref_pat[i*16*4 + j*4+3];
-                    T p2 = get_pixel(x, y, ori, size, dist_x, dist_y, image, iInfo, patch_size);
-
-                    // Calculate bit based on p1 and p2 and shifts it to correct position
-                    v |= (p1 < p2) << (j + 16*(i % 2));
+                    int dist_x = ref_pat[i * 16 * 4 + j * 4];
+                    int dist_y = ref_pat[i * 16 * 4 + j * 4 + 1];
+                    T p1 = get_pixel(x, y, ori, size, dist_x, dist_y, image,
+                                     iInfo, patch_size);
+
+                    dist_x = ref_pat[i * 16 * 4 + j * 4 + 2];
+                    dist_y = ref_pat[i * 16 * 4 + j * 4 + 3];
+                    T p2   = get_pixel(x, y, ori, size, dist_x, dist_y, image,
+                                     iInfo, patch_size);
+
+                    // Calculate bit based on p1 and p2 and shifts it to correct
+                    // position
+                    v |= (p1 < p2) << (j + 16 * (i % 2));
                 }
                 // Store 16 bits of descriptor
-                atomic_add(&desc_out[f * 8 + i/2], v);
+                atomic_add(&desc_out[f * 8 + i / 2], v);
             }
         }
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
     if (f < n_feat && get_local_id(1) == 0) {
-        x_in[f] = round(x * scl);
-        y_in[f] = round(y * scl);
+        x_in[f]     = round(x * scl);
+        y_in[f]     = round(y * scl);
         size_out[f] = patch_size * scl;
     }
 }
diff --git a/src/backend/opencl/kernel/orb.hpp b/src/backend/opencl/kernel/orb.hpp
index 36a16c354c..5d4f523f16 100644
--- a/src/backend/opencl/kernel/orb.hpp
+++ b/src/backend/opencl/kernel/orb.hpp
@@ -7,499 +7,509 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <program.hpp>
-#include <dispatch.hpp>
-#include <err_opencl.hpp>
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <convolve_common.hpp>
 #include <kernel/convolve_separable.hpp>
 #include <kernel/fast.hpp>
+#include <kernel/range.hpp>
 #include <kernel/resize.hpp>
-#include <kernel/sort_index.hpp>
+#include <kernel/sort_by_key.hpp>
 #include <kernel_headers/orb.hpp>
 #include <memory.hpp>
-#include <vector>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::EnqueueArgs;
-using cl::LocalSpaceArg;
-using cl::NDRange;
-using std::vector;
-
-namespace opencl
-{
-
-namespace kernel
-{
+#include <af/defines.h>
 
-static const int ORB_THREADS   = 256;
-static const int ORB_THREADS_X = 16;
-static const int ORB_THREADS_Y = 16;
+#include <string>
+#include <vector>
 
-static const float PI_VAL = 3.14159265358979323846f;
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic push
+#pragma clang diagnostic ignored "-Wsometimes-uninitialized"
+#elif defined(__ICC) || defined(__INTEL_COMPILER)
+/* Intel ICC/ICPC */
+// Fix the warning code here, if any
+#elif defined(__GNUC__) || defined(__GNUG__)
+/* GNU GCC/G++ */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wmaybe-uninitialized"
+#elif defined(_MSC_VER)
+/* Microsoft Visual Studio */
+#pragma warning(push)
+#pragma warning(disable : 4700)
+#else
+/* Other */
+#endif
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int ORB_THREADS   = 256;
+constexpr int ORB_THREADS_X = 16;
+constexpr int ORB_THREADS_Y = 16;
+constexpr float PI_VAL      = 3.14159265358979323846f;
 
 // Reference pattern, generated for a patch size of 31x31, as suggested by
 // original ORB paper
 #define REF_PAT_SIZE 31
 #define REF_PAT_SAMPLES 256
 #define REF_PAT_COORDS 4
-#define REF_PAT_LENGTH (REF_PAT_SAMPLES*REF_PAT_COORDS)
-
+#define REF_PAT_LENGTH (REF_PAT_SAMPLES * REF_PAT_COORDS)
 
 template<typename T>
-void gaussian1D(T* out, const int dim, double sigma=0.0)
-{
-    if(!(sigma>0)) sigma = 0.25*dim;
+void gaussian1D(T* out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
 
     T sum = (T)0;
-    for(int i=0;i<dim;i++)
-    {
-        int x = i-(dim-1)/2;
-        T el = 1. / sqrt(2 * PI_VAL * sigma*sigma) * exp(-((x*x)/(2*(sigma*sigma))));
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * PI_VAL * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
         out[i] = el;
-        sum   += el;
+        sum += el;
     }
 
-    for(int k=0;k<dim;k++)
-        out[k] /= sum;
+    for (int k = 0; k < dim; k++) out[k] /= sum;
 }
 
-template<typename T, typename convAccT>
-void orb(unsigned* out_feat,
-         Param& x_out,
-         Param& y_out,
-         Param& score_out,
-         Param& ori_out,
-         Param& size_out,
-         Param& desc_out,
-         Param image,
-         const float fast_thr,
-         const unsigned max_feat,
-         const float scl_fctr,
-         const unsigned levels,
-         const bool blur_img)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> orbProgs;
-        static std::map<int, Kernel*>  hrKernel;
-        static std::map<int, Kernel*>  kfKernel;
-        static std::map<int, Kernel*>  caKernel;
-        static std::map<int, Kernel*>  eoKernel;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D BLOCK_SIZE=" << ORB_THREADS_X;
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, orb_cl, orb_cl_len, options.str());
-                orbProgs[device] = new Program(prog);
-
-                hrKernel[device] = new Kernel(*orbProgs[device], "harris_response");
-                kfKernel[device] = new Kernel(*orbProgs[device], "keep_features");
-                caKernel[device] = new Kernel(*orbProgs[device], "centroid_angle");
-                eoKernel[device] = new Kernel(*orbProgs[device], "extract_orb");
-            });
-
-        unsigned patch_size = REF_PAT_SIZE;
-
-        unsigned min_side = std::min(image.info.dims[0], image.info.dims[1]);
-        unsigned max_levels = 0;
-        float scl_sum = 0.f;
-        for (unsigned i = 0; i < levels; i++) {
-            min_side /= scl_fctr;
-
-            // Minimum image side for a descriptor to be computed
-            if (min_side < patch_size || max_levels == levels) break;
-
-            max_levels++;
-            scl_sum += 1.f / (float)pow(scl_fctr,(float)i);
-        }
-
-        vector<cl::Buffer*> d_x_pyr(max_levels);
-        vector<cl::Buffer*> d_y_pyr(max_levels);
-        vector<cl::Buffer*> d_score_pyr(max_levels);
-        vector<cl::Buffer*> d_ori_pyr(max_levels);
-        vector<cl::Buffer*> d_size_pyr(max_levels);
-        vector<cl::Buffer*> d_desc_pyr(max_levels);
-
-        vector<unsigned> feat_pyr(max_levels);
-        unsigned total_feat = 0;
-
-        // Compute number of features to keep for each level
-        vector<unsigned> lvl_best(max_levels);
-        unsigned feat_sum = 0;
-        for (unsigned i = 0; i < max_levels-1; i++) {
-            float lvl_scl = (float)pow(scl_fctr,(float)i);
-            lvl_best[i] = ceil((max_feat / scl_sum) / lvl_scl);
-            feat_sum += lvl_best[i];
-        }
-        lvl_best[max_levels-1] = max_feat - feat_sum;
-
-        // Maintain a reference to previous level image
-        Param prev_img;
-        Param lvl_img;
-
-        const unsigned gauss_len = 9;
-        T* h_gauss = nullptr;
-        Param gauss_filter;
-        gauss_filter.data = nullptr;
-
-        for (unsigned i = 0; i < max_levels; i++) {
-            const float lvl_scl = (float)pow(scl_fctr,(float)i);
-
-            if (i == 0) {
-                // First level is used in its original size
-                lvl_img = image;
-
-                prev_img = image;
-            }
-            else if (i > 0) {
-                // Resize previous level image to current level dimensions
-                lvl_img.info.dims[0] = round(image.info.dims[0] / lvl_scl);
-                lvl_img.info.dims[1] = round(image.info.dims[1] / lvl_scl);
-
-                lvl_img.info.strides[0] = 1;
-                lvl_img.info.strides[1] = lvl_img.info.dims[0];
-
-                for (int k = 2; k < 4; k++) {
-                    lvl_img.info.dims[k] = 1;
-                    lvl_img.info.strides[k] = lvl_img.info.dims[k - 1] * lvl_img.info.strides[k - 1];
-                }
-
-                lvl_img.info.offset = 0;
-                lvl_img.data = bufferAlloc(lvl_img.info.dims[3] * lvl_img.info.strides[3] * sizeof(T));
+template<typename T>
+std::array<Kernel, 4> getOrbKernels() {
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(BLOCK_SIZE, ORB_THREADS_X),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    return {
+        common::getKernel("harris_response", {{orb_cl_src}}, targs,
+                          compileOpts),
+        common::getKernel("keep_features", {{orb_cl_src}}, targs, compileOpts),
+        common::getKernel("centroid_angle", {{orb_cl_src}}, targs, compileOpts),
+        common::getKernel("extract_orb", {{orb_cl_src}}, targs, compileOpts),
+    };
+}
 
-                resize<T, AF_INTERP_BILINEAR>(lvl_img, prev_img);
+template<typename T, typename convAccT>
+void orb(unsigned* out_feat, Param& x_out, Param& y_out, Param& score_out,
+         Param& ori_out, Param& size_out, Param& desc_out, Param image,
+         const float fast_thr, const unsigned max_feat, const float scl_fctr,
+         const unsigned levels, const bool blur_img) {
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::vector;
+
+    auto kernels = getOrbKernels<T>();
+
+    unsigned patch_size = REF_PAT_SIZE;
+
+    unsigned min_side   = std::min(image.info.dims[0], image.info.dims[1]);
+    unsigned max_levels = 0;
+    float scl_sum       = 0.f;
+    for (unsigned i = 0; i < levels; i++) {
+        min_side /= scl_fctr;
+
+        // Minimum image side for a descriptor to be computed
+        if (min_side < patch_size || max_levels == levels) break;
+
+        max_levels++;
+        scl_sum += 1.f / (float)pow(scl_fctr, (float)i);
+    }
 
-                if (i > 1)
-                   bufferFree(prev_img.data);
-                prev_img = lvl_img;
-            }
+    vector<Buffer*> d_x_pyr(max_levels);
+    vector<Buffer*> d_y_pyr(max_levels);
+    vector<Buffer*> d_score_pyr(max_levels);
+    vector<Buffer*> d_ori_pyr(max_levels);
+    vector<Buffer*> d_size_pyr(max_levels);
+    vector<Buffer*> d_desc_pyr(max_levels);
+
+    vector<unsigned> feat_pyr(max_levels);
+    unsigned total_feat = 0;
+
+    // Compute number of features to keep for each level
+    vector<unsigned> lvl_best(max_levels);
+    unsigned feat_sum = 0;
+    for (unsigned i = 0; i < max_levels - 1; i++) {
+        float lvl_scl = (float)pow(scl_fctr, (float)i);
+        lvl_best[i]   = ceil((max_feat / scl_sum) / lvl_scl);
+        feat_sum += lvl_best[i];
+    }
+    lvl_best[max_levels - 1] = max_feat - feat_sum;
 
-            unsigned lvl_feat = 0;
-            Param d_x_feat, d_y_feat, d_score_feat;
+    // Maintain a reference to previous level image
+    Param prev_img;
+    Param lvl_img;
 
-            // Round feature size to nearest odd integer
-            float size = 2.f * floor(patch_size / 2.f) + 1.f;
+    const unsigned gauss_len = 9;
+    T* h_gauss               = nullptr;
+    Param gauss_filter;
+    gauss_filter.data = nullptr;
 
-            // Avoid keeping features that might be too wide and might not fit on
-            // the image, sqrt(2.f) is the radius when angle is 45 degrees and
-            // represents widest case possible
-            unsigned edge = ceil(size * sqrt(2.f) / 2.f);
+    for (unsigned i = 0; i < max_levels; i++) {
+        const float lvl_scl = (float)pow(scl_fctr, (float)i);
 
-            // Detect FAST features
-            fast<T, 9, true>(&lvl_feat, d_x_feat, d_y_feat, d_score_feat,
-                             lvl_img, fast_thr, 0.15f, edge);
+        if (i == 0) {
+            // First level is used in its original size
+            lvl_img = image;
 
-            if (lvl_feat == 0) {
-                feat_pyr[i] = 0;
+            prev_img = image;
+        } else if (i > 0) {
+            // Resize previous level image to current level dimensions
+            lvl_img.info.dims[0] = round(image.info.dims[0] / lvl_scl);
+            lvl_img.info.dims[1] = round(image.info.dims[1] / lvl_scl);
 
-                if (i > 0 && i == max_levels-1)
-                    bufferFree(lvl_img.data);
+            lvl_img.info.strides[0] = 1;
+            lvl_img.info.strides[1] = lvl_img.info.dims[0];
 
-                continue;
+            for (int k = 2; k < 4; k++) {
+                lvl_img.info.dims[k] = 1;
+                lvl_img.info.strides[k] =
+                    lvl_img.info.dims[k - 1] * lvl_img.info.strides[k - 1];
             }
 
-            bufferFree(d_score_feat.data);
+            lvl_img.info.offset = 0;
+            lvl_img.data        = bufferAlloc(lvl_img.info.dims[3] *
+                                              lvl_img.info.strides[3] * sizeof(T));
 
-            unsigned usable_feat = 0;
-            cl::Buffer* d_usable_feat = bufferAlloc(sizeof(unsigned));
-            getQueue().enqueueWriteBuffer(*d_usable_feat, CL_TRUE, 0, sizeof(unsigned), &usable_feat);
+            resize<T>(lvl_img, prev_img, AF_INTERP_BILINEAR);
 
-            cl::Buffer* d_x_harris = bufferAlloc(lvl_feat * sizeof(float));
-            cl::Buffer* d_y_harris = bufferAlloc(lvl_feat * sizeof(float));
-            cl::Buffer* d_score_harris = bufferAlloc(lvl_feat * sizeof(float));
-
-            // Calculate Harris responses
-            // Good block_size >= 7 (must be an odd number)
-            const int blk_x = divup(lvl_feat, ORB_THREADS_X);
-            const NDRange local(ORB_THREADS_X, ORB_THREADS_Y);
-            const NDRange global(blk_x * ORB_THREADS_X, ORB_THREADS_Y);
-
-            unsigned block_size = 7;
-            float k_thr = 0.04f;
-
-            auto hrOp = make_kernel<Buffer, Buffer, Buffer,
-                                    Buffer, Buffer, const unsigned,
-                                    Buffer, Buffer, KParam,
-                                    const unsigned, const float, const unsigned> (*hrKernel[device]);
-
-            hrOp(EnqueueArgs(getQueue(), global, local),
-                 *d_x_harris, *d_y_harris, *d_score_harris,
-                 *d_x_feat.data, *d_y_feat.data, lvl_feat,
-                 *d_usable_feat, *lvl_img.data, lvl_img.info,
-                 block_size, k_thr, patch_size);
-            CL_DEBUG_FINISH(getQueue());
+            if (i > 1) bufferFree(prev_img.data);
+            prev_img = lvl_img;
+        }
 
-            getQueue().enqueueReadBuffer(*d_usable_feat, CL_TRUE, 0, sizeof(unsigned), &usable_feat);
+        unsigned lvl_feat = 0;
+        Param d_x_feat, d_y_feat, d_score_feat;
 
-            if (lvl_feat > 0) { //This is just to supress warnings
-                bufferFree(d_x_feat.data);
-                bufferFree(d_y_feat.data);
-                bufferFree(d_usable_feat);
-            }
+        // Round feature size to nearest odd integer
+        float size = 2.f * floor(patch_size / 2.f) + 1.f;
 
-            if (usable_feat == 0) {
-                feat_pyr[i] = 0;
+        // Avoid keeping features that might be too wide and might not fit on
+        // the image, sqrt(2.f) is the radius when angle is 45 degrees and
+        // represents widest case possible
+        unsigned edge = ceil(size * sqrt(2.f) / 2.f);
 
-                bufferFree(d_x_harris);
-                bufferFree(d_y_harris);
-                bufferFree(d_score_harris);
+        // Detect FAST features
+        fast<T>(9, &lvl_feat, d_x_feat, d_y_feat, d_score_feat, lvl_img,
+                fast_thr, 0.15f, edge, true);
+        if (lvl_feat == 0) {
+            feat_pyr[i] = 0;
 
-                if (i > 0 && i == max_levels-1)
-                    bufferFree(lvl_img.data);
+            if (i > 0 && i == max_levels - 1) bufferFree(lvl_img.data);
 
-                continue;
-            }
+            continue;
+        }
 
-            // Sort features according to Harris responses
-            Param d_harris_sorted;
-            Param d_harris_idx;
+        bufferFree(d_score_feat.data);
 
-            d_harris_sorted.info.dims[0] = usable_feat;
-            d_harris_idx.info.dims[0] = usable_feat;
-            d_harris_sorted.info.strides[0] = 1;
-            d_harris_idx.info.strides[0] = 1;
+        unsigned usable_feat  = 0;
+        Buffer* d_usable_feat = bufferAlloc(sizeof(unsigned));
+        getQueue().enqueueFillBuffer(*d_usable_feat, usable_feat, 0,
+                                     sizeof(unsigned));
 
-            for (int k = 1; k < 4; k++) {
-                d_harris_sorted.info.dims[k] = 1;
-                d_harris_idx.info.dims[k] = 1;
-                d_harris_sorted.info.strides[k] = d_harris_sorted.info.dims[k - 1] * d_harris_sorted.info.strides[k - 1];
-                d_harris_idx.info.strides[k] = d_harris_idx.info.dims[k - 1] * d_harris_idx.info.strides[k - 1];
-            }
+        Buffer* d_x_harris     = bufferAlloc(lvl_feat * sizeof(float));
+        Buffer* d_y_harris     = bufferAlloc(lvl_feat * sizeof(float));
+        Buffer* d_score_harris = bufferAlloc(lvl_feat * sizeof(float));
 
-            d_harris_sorted.info.offset = 0;
-            d_harris_idx.info.offset = 0;
-            d_harris_sorted.data = d_score_harris;
-            d_harris_idx.data = bufferAlloc((d_harris_idx.info.dims[0]) * sizeof(unsigned));
+        // Calculate Harris responses
+        // Good block_size >= 7 (must be an odd number)
+        const int blk_x = divup(lvl_feat, ORB_THREADS_X);
+        const NDRange local(ORB_THREADS_X, ORB_THREADS_Y);
+        const NDRange global(blk_x * ORB_THREADS_X, ORB_THREADS_Y);
 
-            sort0_index<float, false>(d_harris_sorted, d_harris_idx);
+        unsigned block_size = 7;
+        float k_thr         = 0.04f;
 
-            cl::Buffer* d_x_lvl = bufferAlloc(usable_feat * sizeof(float));
-            cl::Buffer* d_y_lvl = bufferAlloc(usable_feat * sizeof(float));
-            cl::Buffer* d_score_lvl = bufferAlloc(usable_feat * sizeof(float));
+        auto hrOp = kernels[0];
 
-            usable_feat = min(usable_feat, lvl_best[i]);
+        hrOp(EnqueueArgs(getQueue(), global, local), *d_x_harris, *d_y_harris,
+             *d_score_harris, *d_x_feat.data, *d_y_feat.data, lvl_feat,
+             *d_usable_feat, *lvl_img.data, lvl_img.info, block_size, k_thr,
+             patch_size);
+        CL_DEBUG_FINISH(getQueue());
 
-            // Keep only features with higher Harris responses
-            const int keep_blk = divup(usable_feat, ORB_THREADS);
-            const NDRange local_keep(ORB_THREADS, 1);
-            const NDRange global_keep(keep_blk * ORB_THREADS, 1);
+        getQueue().enqueueReadBuffer(*d_usable_feat, CL_TRUE, 0,
+                                     sizeof(unsigned), &usable_feat);
 
-            auto kfOp = make_kernel<Buffer, Buffer, Buffer,
-                                    Buffer, Buffer, Buffer, Buffer,
-                                    const unsigned> (*kfKernel[device]);
+        if (lvl_feat > 0) {  // This is just to supress warnings
+            bufferFree(d_x_feat.data);
+            bufferFree(d_y_feat.data);
+            bufferFree(d_usable_feat);
+        }
 
-            kfOp(EnqueueArgs(getQueue(), global_keep, local_keep),
-                 *d_x_lvl, *d_y_lvl, *d_score_lvl,
-                 *d_x_harris, *d_y_harris, *d_harris_sorted.data, *d_harris_idx.data,
-                 usable_feat);
-            CL_DEBUG_FINISH(getQueue());
+        if (usable_feat == 0) {
+            feat_pyr[i] = 0;
 
             bufferFree(d_x_harris);
             bufferFree(d_y_harris);
-            bufferFree(d_harris_sorted.data);
-            bufferFree(d_harris_idx.data);
-
-            cl::Buffer* d_ori_lvl = bufferAlloc(usable_feat * sizeof(float));
-            cl::Buffer* d_size_lvl = bufferAlloc(usable_feat * sizeof(float));
-
-            // Compute orientation of features
-            const int centroid_blk_x = divup(usable_feat, ORB_THREADS_X);
-            const NDRange local_centroid(ORB_THREADS_X, ORB_THREADS_Y);
-            const NDRange global_centroid(centroid_blk_x * ORB_THREADS_X, ORB_THREADS_Y);
-
-            auto caOp = make_kernel<Buffer, Buffer, Buffer,
-                                    const unsigned, Buffer, KParam,
-                                    const unsigned> (*caKernel[device]);
+            bufferFree(d_score_harris);
 
-            caOp(EnqueueArgs(getQueue(), global_centroid, local_centroid),
-                 *d_x_lvl, *d_y_lvl, *d_ori_lvl,
-                 usable_feat, *lvl_img.data, lvl_img.info,
-                 patch_size);
-            CL_DEBUG_FINISH(getQueue());
-
-            Param lvl_filt;
-            Param lvl_tmp;
+            if (i > 0 && i == max_levels - 1) bufferFree(lvl_img.data);
 
-            if (blur_img) {
-                lvl_filt = lvl_img;
-                lvl_tmp = lvl_img;
+            continue;
+        }
 
-                lvl_filt.data = bufferAlloc(lvl_filt.info.dims[0] * lvl_filt.info.dims[1] * sizeof(T));
-                lvl_tmp.data = bufferAlloc(lvl_tmp.info.dims[0] * lvl_tmp.info.dims[1] * sizeof(T));
+        // Sort features according to Harris responses
+        Param d_harris_sorted;
+        Param d_harris_idx;
 
-                // Calculate a separable Gaussian kernel
-                if (h_gauss == nullptr) {
-                    h_gauss = new T[gauss_len];
-                    gaussian1D(h_gauss, gauss_len, 2.f);
-                    gauss_filter.info.dims[0] = gauss_len;
-                    gauss_filter.info.strides[0] = 1;
+        d_harris_sorted.info.dims[0]    = usable_feat;
+        d_harris_idx.info.dims[0]       = usable_feat;
+        d_harris_sorted.info.strides[0] = 1;
+        d_harris_idx.info.strides[0]    = 1;
 
-                    for (int k = 1; k < 4; k++) {
-                        gauss_filter.info.dims[k] = 1;
-                        gauss_filter.info.strides[k] = gauss_filter.info.dims[k - 1] * gauss_filter.info.strides[k - 1];
-                    }
+        for (int k = 1; k < 4; k++) {
+            d_harris_sorted.info.dims[k] = 1;
+            d_harris_idx.info.dims[k]    = 1;
+            d_harris_sorted.info.strides[k] =
+                d_harris_sorted.info.dims[k - 1] *
+                d_harris_sorted.info.strides[k - 1];
+            d_harris_idx.info.strides[k] = d_harris_idx.info.dims[k - 1] *
+                                           d_harris_idx.info.strides[k - 1];
+        }
 
-                    int gauss_elem = gauss_filter.info.strides[3] * gauss_filter.info.dims[3];
-                    gauss_filter.data = bufferAlloc(gauss_elem * sizeof(T));
-                    getQueue().enqueueWriteBuffer(*gauss_filter.data, CL_TRUE, 0, gauss_elem * sizeof(T), h_gauss);
+        d_harris_sorted.info.offset = 0;
+        d_harris_idx.info.offset    = 0;
+        d_harris_sorted.data        = d_score_harris;
+        // Create indices using range
+        d_harris_idx.data =
+            bufferAlloc((d_harris_idx.info.dims[0]) * sizeof(unsigned));
+        kernel::range<uint>(d_harris_idx, 0);
+
+        kernel::sort0ByKey<float, uint>(d_harris_sorted, d_harris_idx, false);
+
+        Buffer* d_x_lvl     = bufferAlloc(usable_feat * sizeof(float));
+        Buffer* d_y_lvl     = bufferAlloc(usable_feat * sizeof(float));
+        Buffer* d_score_lvl = bufferAlloc(usable_feat * sizeof(float));
+
+        usable_feat = std::min(usable_feat, lvl_best[i]);
+
+        // Keep only features with higher Harris responses
+        const int keep_blk = divup(usable_feat, ORB_THREADS);
+        const NDRange local_keep(ORB_THREADS, 1);
+        const NDRange global_keep(keep_blk * ORB_THREADS, 1);
+
+        auto kfOp = kernels[1];
+
+        kfOp(EnqueueArgs(getQueue(), global_keep, local_keep), *d_x_lvl,
+             *d_y_lvl, *d_score_lvl, *d_x_harris, *d_y_harris,
+             *d_harris_sorted.data, *d_harris_idx.data, usable_feat);
+        CL_DEBUG_FINISH(getQueue());
+
+        bufferFree(d_x_harris);
+        bufferFree(d_y_harris);
+        bufferFree(d_harris_sorted.data);
+        bufferFree(d_harris_idx.data);
+
+        Buffer* d_ori_lvl  = bufferAlloc(usable_feat * sizeof(float));
+        Buffer* d_size_lvl = bufferAlloc(usable_feat * sizeof(float));
+
+        // Compute orientation of features
+        const int centroid_blk_x = divup(usable_feat, ORB_THREADS_X);
+        const NDRange local_centroid(ORB_THREADS_X, ORB_THREADS_Y);
+        const NDRange global_centroid(centroid_blk_x * ORB_THREADS_X,
+                                      ORB_THREADS_Y);
+
+        auto caOp = kernels[2];
+
+        caOp(EnqueueArgs(getQueue(), global_centroid, local_centroid), *d_x_lvl,
+             *d_y_lvl, *d_ori_lvl, usable_feat, *lvl_img.data, lvl_img.info,
+             patch_size);
+        CL_DEBUG_FINISH(getQueue());
+
+        Param lvl_filt;
+        Param lvl_tmp;
+
+        if (blur_img) {
+            lvl_filt = lvl_img;
+            lvl_tmp  = lvl_img;
+
+            lvl_filt.data = bufferAlloc(lvl_filt.info.dims[0] *
+                                        lvl_filt.info.dims[1] * sizeof(T));
+            lvl_tmp.data  = bufferAlloc(lvl_tmp.info.dims[0] *
+                                        lvl_tmp.info.dims[1] * sizeof(T));
+
+            // Calculate a separable Gaussian kernel
+            if (h_gauss == nullptr) {
+                h_gauss = new T[gauss_len];
+                gaussian1D(h_gauss, gauss_len, 2.f);
+                gauss_filter.info.dims[0]    = gauss_len;
+                gauss_filter.info.strides[0] = 1;
+
+                for (int k = 1; k < 4; k++) {
+                    gauss_filter.info.dims[k] = 1;
+                    gauss_filter.info.strides[k] =
+                        gauss_filter.info.dims[k - 1] *
+                        gauss_filter.info.strides[k - 1];
                 }
 
-                // Filter level image with Gaussian kernel to reduce noise sensitivity
-                convolve2<T, convAccT, 0, false, gauss_len>(lvl_tmp, lvl_img, gauss_filter);
-                convolve2<T, convAccT, 1, false, gauss_len>(lvl_filt, lvl_tmp, gauss_filter);
-
-                bufferFree(lvl_tmp.data);
+                int gauss_elem =
+                    gauss_filter.info.strides[3] * gauss_filter.info.dims[3];
+                gauss_filter.data = bufferAlloc(gauss_elem * sizeof(T));
+                getQueue().enqueueWriteBuffer(*gauss_filter.data, CL_TRUE, 0,
+                                              gauss_elem * sizeof(T), h_gauss);
             }
 
-            // Compute ORB descriptors
-            cl::Buffer* d_desc_lvl = bufferAlloc(usable_feat * 8 * sizeof(unsigned));
-            {
-                vector<unsigned> h_desc_lvl(usable_feat * 8);
-                getQueue().enqueueWriteBuffer(*d_desc_lvl, CL_TRUE, 0, usable_feat * 8 * sizeof(unsigned), h_desc_lvl.data());
-            }
-
-            auto eoOp = make_kernel<Buffer, const unsigned,
-                                    Buffer, Buffer, Buffer, Buffer,
-                                    Buffer, KParam,
-                                    const float, const unsigned> (*eoKernel[device]);
-
-            if (blur_img) {
-                eoOp(EnqueueArgs(getQueue(), global_centroid, local_centroid),
-                     *d_desc_lvl, usable_feat,
-                     *d_x_lvl, *d_y_lvl, *d_ori_lvl, *d_size_lvl,
-                     *lvl_filt.data, lvl_filt.info,
-                     lvl_scl, patch_size);
-                CL_DEBUG_FINISH(getQueue());
+            // Filter level image with Gaussian kernel to reduce noise
+            // sensitivity
+            convSep<T, convAccT>(lvl_tmp, lvl_img, gauss_filter, 0, false);
+            convSep<T, convAccT>(lvl_filt, lvl_tmp, gauss_filter, 1, false);
 
-                bufferFree(lvl_filt.data);
-            }
-            else {
-                eoOp(EnqueueArgs(getQueue(), global_centroid, local_centroid),
-                     *d_desc_lvl, usable_feat,
-                     *d_x_lvl, *d_y_lvl, *d_ori_lvl, *d_size_lvl,
-                     *lvl_img.data, lvl_img.info,
-                     lvl_scl, patch_size);
-                CL_DEBUG_FINISH(getQueue());
-            }
-
-            // Store results to pyramids
-            total_feat += usable_feat;
-            feat_pyr[i] = usable_feat;
-            d_x_pyr[i] = d_x_lvl;
-            d_y_pyr[i] = d_y_lvl;
-            d_score_pyr[i] = d_score_lvl;
-            d_ori_pyr[i] = d_ori_lvl;
-            d_size_pyr[i] = d_size_lvl;
-            d_desc_pyr[i] = d_desc_lvl;
-
-            if (i > 0 && i == max_levels-1)
-                bufferFree(lvl_img.data);
+            bufferFree(lvl_tmp.data);
         }
 
-        if (gauss_filter.data != nullptr)
-            bufferFree(gauss_filter.data);
-        if (h_gauss != nullptr)
-            delete[] h_gauss;
+        // Compute ORB descriptors
+        Buffer* d_desc_lvl = bufferAlloc(usable_feat * 8 * sizeof(unsigned));
+        getQueue().enqueueFillBuffer(*d_desc_lvl, 0U, 0,
+                                     usable_feat * 8 * sizeof(unsigned));
+        auto eoOp = kernels[3];
+        if (blur_img) {
+            eoOp(EnqueueArgs(getQueue(), global_centroid, local_centroid),
+                 *d_desc_lvl, usable_feat, *d_x_lvl, *d_y_lvl, *d_ori_lvl,
+                 *d_size_lvl, *lvl_filt.data, lvl_filt.info, lvl_scl,
+                 patch_size);
+            CL_DEBUG_FINISH(getQueue());
 
-        // If no features are found, set found features to 0 and return
-        if (total_feat == 0) {
-            *out_feat = 0;
-            return;
+            bufferFree(lvl_filt.data);
+        } else {
+            eoOp(EnqueueArgs(getQueue(), global_centroid, local_centroid),
+                 *d_desc_lvl, usable_feat, *d_x_lvl, *d_y_lvl, *d_ori_lvl,
+                 *d_size_lvl, *lvl_img.data, lvl_img.info, lvl_scl, patch_size);
+            CL_DEBUG_FINISH(getQueue());
         }
 
-        // Allocate output memory
-        x_out.info.dims[0] = total_feat;
-        x_out.info.strides[0] = 1;
-        y_out.info.dims[0] = total_feat;
-        y_out.info.strides[0] = 1;
-        score_out.info.dims[0] = total_feat;
-        score_out.info.strides[0] = 1;
-        ori_out.info.dims[0] = total_feat;
-        ori_out.info.strides[0] = 1;
-        size_out.info.dims[0] = total_feat;
-        size_out.info.strides[0] = 1;
-
-        desc_out.info.dims[0] = 8;
-        desc_out.info.strides[0] = 1;
-        desc_out.info.dims[1] = total_feat;
-        desc_out.info.strides[1] = desc_out.info.dims[0];
+        // Store results to pyramids
+        total_feat += usable_feat;
+        feat_pyr[i]    = usable_feat;
+        d_x_pyr[i]     = d_x_lvl;
+        d_y_pyr[i]     = d_y_lvl;
+        d_score_pyr[i] = d_score_lvl;
+        d_ori_pyr[i]   = d_ori_lvl;
+        d_size_pyr[i]  = d_size_lvl;
+        d_desc_pyr[i]  = d_desc_lvl;
+
+        if (i > 0 && i == max_levels - 1) bufferFree(lvl_img.data);
+    }
 
-        for (int k = 1; k < 4; k++) {
-            x_out.info.dims[k] = 1;
-            x_out.info.strides[k] = x_out.info.dims[k - 1] * x_out.info.strides[k - 1];
-            y_out.info.dims[k] = 1;
-            y_out.info.strides[k] = y_out.info.dims[k - 1] * y_out.info.strides[k - 1];
-            score_out.info.dims[k] = 1;
-            score_out.info.strides[k] = score_out.info.dims[k - 1] * score_out.info.strides[k - 1];
-            ori_out.info.dims[k] = 1;
-            ori_out.info.strides[k] = ori_out.info.dims[k - 1] * ori_out.info.strides[k - 1];
-            size_out.info.dims[k] = 1;
-            size_out.info.strides[k] = size_out.info.dims[k - 1] * size_out.info.strides[k - 1];
-            if (k > 1) {
-                desc_out.info.dims[k] = 1;
-                desc_out.info.strides[k] = desc_out.info.dims[k - 1] * desc_out.info.strides[k - 1];
-            }
-        }
+    if (gauss_filter.data != nullptr) bufferFree(gauss_filter.data);
+    if (h_gauss != nullptr) delete[] h_gauss;
 
-        if (total_feat > 0) {
-            size_t out_sz  = total_feat * sizeof(float);
-            x_out.data     = bufferAlloc(out_sz);
-            y_out.data     = bufferAlloc(out_sz);
-            score_out.data = bufferAlloc(out_sz);
-            ori_out.data   = bufferAlloc(out_sz);
-            size_out.data  = bufferAlloc(out_sz);
+    // If no features are found, set found features to 0 and return
+    if (total_feat == 0) {
+        *out_feat = 0;
+        return;
+    }
 
-            size_t desc_sz = total_feat * 8 * sizeof(unsigned);
-            desc_out.data  = bufferAlloc(desc_sz);
+    // Allocate output memory
+    x_out.info.dims[0]        = total_feat;
+    x_out.info.strides[0]     = 1;
+    y_out.info.dims[0]        = total_feat;
+    y_out.info.strides[0]     = 1;
+    score_out.info.dims[0]    = total_feat;
+    score_out.info.strides[0] = 1;
+    ori_out.info.dims[0]      = total_feat;
+    ori_out.info.strides[0]   = 1;
+    size_out.info.dims[0]     = total_feat;
+    size_out.info.strides[0]  = 1;
+
+    desc_out.info.dims[0]    = 8;
+    desc_out.info.strides[0] = 1;
+    desc_out.info.dims[1]    = total_feat;
+    desc_out.info.strides[1] = desc_out.info.dims[0];
+
+    for (int k = 1; k < 4; k++) {
+        x_out.info.dims[k] = 1;
+        x_out.info.strides[k] =
+            x_out.info.dims[k - 1] * x_out.info.strides[k - 1];
+        y_out.info.dims[k] = 1;
+        y_out.info.strides[k] =
+            y_out.info.dims[k - 1] * y_out.info.strides[k - 1];
+        score_out.info.dims[k] = 1;
+        score_out.info.strides[k] =
+            score_out.info.dims[k - 1] * score_out.info.strides[k - 1];
+        ori_out.info.dims[k] = 1;
+        ori_out.info.strides[k] =
+            ori_out.info.dims[k - 1] * ori_out.info.strides[k - 1];
+        size_out.info.dims[k] = 1;
+        size_out.info.strides[k] =
+            size_out.info.dims[k - 1] * size_out.info.strides[k - 1];
+        if (k > 1) {
+            desc_out.info.dims[k] = 1;
+            desc_out.info.strides[k] =
+                desc_out.info.dims[k - 1] * desc_out.info.strides[k - 1];
         }
+    }
 
-        unsigned offset = 0;
-        for (unsigned i = 0; i < max_levels; i++) {
-            if (feat_pyr[i] == 0)
-                continue;
-
-            if (i > 0)
-                offset += feat_pyr[i-1];
-
-            getQueue().enqueueCopyBuffer(*d_x_pyr[i], *x_out.data, 0, offset*sizeof(float), feat_pyr[i] * sizeof(float));
-            getQueue().enqueueCopyBuffer(*d_y_pyr[i], *y_out.data, 0, offset*sizeof(float), feat_pyr[i] * sizeof(float));
-            getQueue().enqueueCopyBuffer(*d_score_pyr[i], *score_out.data, 0, offset*sizeof(float), feat_pyr[i] * sizeof(float));
-            getQueue().enqueueCopyBuffer(*d_ori_pyr[i], *ori_out.data, 0, offset*sizeof(float), feat_pyr[i] * sizeof(float));
-            getQueue().enqueueCopyBuffer(*d_size_pyr[i], *size_out.data, 0, offset*sizeof(float), feat_pyr[i] * sizeof(float));
-
-            getQueue().enqueueCopyBuffer(*d_desc_pyr[i], *desc_out.data, 0, offset*8*sizeof(unsigned), feat_pyr[i] * 8 * sizeof(unsigned));
-
-            bufferFree(d_x_pyr[i]);
-            bufferFree(d_y_pyr[i]);
-            bufferFree(d_score_pyr[i]);
-            bufferFree(d_ori_pyr[i]);
-            bufferFree(d_size_pyr[i]);
-            bufferFree(d_desc_pyr[i]);
-        }
+    if (total_feat > 0) {
+        size_t out_sz  = total_feat * sizeof(float);
+        x_out.data     = bufferAlloc(out_sz);
+        y_out.data     = bufferAlloc(out_sz);
+        score_out.data = bufferAlloc(out_sz);
+        ori_out.data   = bufferAlloc(out_sz);
+        size_out.data  = bufferAlloc(out_sz);
 
-        // Sets number of output features
-        *out_feat = total_feat;
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+        size_t desc_sz = total_feat * 8 * sizeof(unsigned);
+        desc_out.data  = bufferAlloc(desc_sz);
     }
-}
 
-} //namespace kernel
+    unsigned offset = 0;
+    for (unsigned i = 0; i < max_levels; i++) {
+        if (feat_pyr[i] == 0) continue;
+
+        if (i > 0) offset += feat_pyr[i - 1];
+
+        getQueue().enqueueCopyBuffer(*d_x_pyr[i], *x_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_y_pyr[i], *y_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_score_pyr[i], *score_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_ori_pyr[i], *ori_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_size_pyr[i], *size_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_desc_pyr[i], *desc_out.data, 0,
+                                     offset * 8 * sizeof(unsigned),
+                                     feat_pyr[i] * 8 * sizeof(unsigned));
+
+        bufferFree(d_x_pyr[i]);
+        bufferFree(d_y_pyr[i]);
+        bufferFree(d_score_pyr[i]);
+        bufferFree(d_ori_pyr[i]);
+        bufferFree(d_size_pyr[i]);
+        bufferFree(d_desc_pyr[i]);
+    }
 
-} //namespace opencl
+    // Sets number of output features
+    *out_feat = total_feat;
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
+
+#if defined(__clang__)
+/* Clang/LLVM */
+#pragma clang diagnostic pop
+#elif defined(__ICC) || defined(__INTEL_COMPILER)
+/* Intel ICC/ICPC */
+// Fix the warning code here, if any
+#elif defined(__GNUC__) || defined(__GNUG__)
+/* GNU GCC/G++ */
+#pragma GCC diagnostic pop
+#elif defined(_MSC_VER)
+/* Microsoft Visual Studio */
+#pragma warning(pop)
+#else
+/* Other */
+#endif
diff --git a/src/backend/opencl/kernel/pad_array_borders.cl b/src/backend/opencl/kernel/pad_array_borders.cl
new file mode 100644
index 0000000000..f62111fb9d
--- /dev/null
+++ b/src/backend/opencl/kernel/pad_array_borders.cl
@@ -0,0 +1,105 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#if AF_BORDER_TYPE == AF_PAD_SYM
+
+int trimIndex(int idx, const int len) {
+    int ret_val = idx;
+    int offset  = abs(ret_val) % len;
+    if (ret_val < 0) {
+        int offset = (abs(ret_val) - 1) % len;
+        ret_val    = offset;
+    } else if (ret_val >= len) {
+        int offset = abs(ret_val) % len;
+        ret_val    = len - offset - 1;
+    }
+    return ret_val;
+}
+
+// TODO(Pradeep) move trimindex from all locations into
+//              a single header after opencl cache is cleaned up
+int idxByndEdge(const int i, const int lb, const int len) {
+    return trimIndex(i - lb, len);
+}
+
+#elif AF_BORDER_TYPE == AF_PAD_CLAMP_TO_EDGE
+
+int idxByndEdge(const int i, const int lb, const int len) {
+    return clamp(i - lb, 0, len - 1);
+}
+
+#elif AF_BORDER_TYPE == AF_PAD_PERIODIC
+
+int idxByndEdge(const int i, const int lb, const int len) {
+    int rem  = (i - lb) % len;
+    int cond = rem < 0;
+    return cond * (rem + len) + (1 - cond) * rem;
+}
+
+#else
+
+#define DEFAULT_BORDER
+
+#endif
+
+kernel void padBorders(global T* out, KParam oInfo, __global const T* in,
+                         KParam iInfo, int l0, int l1, int l2, int l3,
+                         unsigned blk_x, unsigned blk_y) {
+    const int lx = get_local_id(0);
+    const int ly = get_local_id(1);
+    const int k  = get_group_id(0) / blk_x;
+    const int l  = get_group_id(1) / blk_y;
+
+    const int blockIdx_x = get_group_id(0) - (blk_x)*k;
+    const int blockIdx_y = get_group_id(1) - (blk_y)*l;
+    const int i          = blockIdx_x * get_local_size(0) + lx;
+    const int j          = blockIdx_y * get_local_size(1) + ly;
+
+    const int d0 = iInfo.dims[0];
+    const int d1 = iInfo.dims[1];
+    const int d2 = iInfo.dims[2];
+    const int d3 = iInfo.dims[3];
+    const int s0 = iInfo.strides[0];
+    const int s1 = iInfo.strides[1];
+    const int s2 = iInfo.strides[2];
+    const int s3 = iInfo.strides[3];
+
+    global const T* src = in + iInfo.offset;
+    global T* dst       = out;
+
+    bool isNotPadding =
+        (l >= l3 && l < (d3 + l3)) && (k >= l2 && k < (d2 + l2)) &&
+        (j >= l1 && j < (d1 + l1)) && (i >= l0 && i < (d0 + l0));
+    T value = (T)0;
+
+    if (isNotPadding) {
+        unsigned iLOff = (l - l3) * s3;
+        unsigned iKOff = (k - l2) * s2;
+        unsigned iJOff = (j - l1) * s1;
+        unsigned iIOff = (i - l0) * s0;
+
+        value = src[iLOff + iKOff + iJOff + iIOff];
+    } else {
+#if !defined(DEFAULT_BORDER)
+        unsigned iLOff = idxByndEdge(l, l3, d3) * s3;
+        unsigned iKOff = idxByndEdge(k, l2, d2) * s2;
+        unsigned iJOff = idxByndEdge(j, l1, d1) * s1;
+        unsigned iIOff = idxByndEdge(i, l0, d0) * s0;
+
+        value = src[iLOff + iKOff + iJOff + iIOff];
+#endif
+    }
+
+    if (i < oInfo.dims[0] && j < oInfo.dims[1] && k < oInfo.dims[2] &&
+        l < oInfo.dims[3]) {
+        unsigned offset = l * oInfo.strides[3] + k * oInfo.strides[2] +
+                          j * oInfo.strides[1] + i * oInfo.strides[0];
+        dst[offset] = value;
+    }
+}
diff --git a/src/backend/opencl/kernel/pad_array_borders.hpp b/src/backend/opencl/kernel/pad_array_borders.hpp
new file mode 100644
index 0000000000..53ee36d8d8
--- /dev/null
+++ b/src/backend/opencl/kernel/pad_array_borders.hpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/pad_array_borders.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+static const int PADB_THREADS_X = 16;
+static const int PADB_THREADS_Y = 16;
+
+template<typename T>
+void padBorders(Param out, const Param in, dim4 const& lBPadding,
+                const af_border_type borderType) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(borderType),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(AF_BORDER_TYPE, (int)borderType),
+        DefineKeyValue(AF_PAD_SYM, (int)AF_PAD_SYM),
+        DefineKeyValue(AF_PAD_PERIODIC, (int)AF_PAD_PERIODIC),
+        DefineKeyValue(AF_PAD_CLAMP_TO_EDGE, (int)AF_PAD_CLAMP_TO_EDGE),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto pad = common::getKernel("padBorders", {{pad_array_borders_cl_src}},
+                                 tmpltArgs, compileOpts);
+
+    NDRange local(PADB_THREADS_X, PADB_THREADS_Y);
+
+    unsigned blk_x = divup(out.info.dims[0], local[0]);
+    unsigned blk_y = divup(out.info.dims[1], local[1]);
+
+    NDRange global(blk_x * out.info.dims[2] * local[0],
+                   blk_y * out.info.dims[3] * local[1]);
+
+    pad(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data,
+        in.info, static_cast<int>(lBPadding[0]), static_cast<int>(lBPadding[1]),
+        static_cast<int>(lBPadding[2]), static_cast<int>(lBPadding[3]), blk_x,
+        blk_y);
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/random.cl b/src/backend/opencl/kernel/random.cl
deleted file mode 100644
index 5debf5259b..0000000000
--- a/src/backend/opencl/kernel/random.cl
+++ /dev/null
@@ -1,290 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- *
- ********************************************************/
-
-typedef ulong uint64_t;
-typedef uint  uint32_t;
-
-#define PI_VAL 3.1415926535897932384626433832795028841971693993751058209749445923078164
-#define R123_STATIC_INLINE inline
-#define R123_0x1p_32f (1.f/4294967296.f)
-#define R123_0x1p_64 (1./(4294967296.*4294967296.))
-
-#ifdef IS_64
-#define SKEIN_KS_PARITY SKEIN_KS_PARITY64
-#define RotL RotL_64
-#define uint_t uint64_t
-
-#define result(a) (a)*R123_0x1p_64
-
-enum r123_enum_threefry64x2 {
-    /*
-    // Output from skein_rot_search: (srs64_B64-X1000)
-    // Random seed = 1. BlockSize = 128 bits. sampleCnt =  1024. rounds =  8, minHW_or=57
-    // Start: Tue Mar  1 10:07:48 2011
-    // rMin = 0.136. #0325[*15] [CRC=455A682F. hw_OR=64. cnt=16384. blkSize= 128].format
-    */
-    R_2_0_0=16,
-    R_2_1_0=42,
-    R_2_2_0=12,
-    R_2_3_0=31,
-    R_2_4_0=16,
-    R_2_5_0=32,
-    R_2_6_0=24,
-    R_2_7_0=21
-    /* 4 rounds: minHW =  4  [  4  4  4  4 ]
-    // 5 rounds: minHW =  8  [  8  8  8  8 ]
-    // 6 rounds: minHW = 16  [ 16 16 16 16 ]
-    // 7 rounds: minHW = 32  [ 32 32 32 32 ]
-    // 8 rounds: minHW = 64  [ 64 64 64 64 ]
-    // 9 rounds: minHW = 64  [ 64 64 64 64 ]
-    //10 rounds: minHW = 64  [ 64 64 64 64 ]
-    //11 rounds: minHW = 64  [ 64 64 64 64 ] */
-};
-#else
-#define SKEIN_KS_PARITY SKEIN_KS_PARITY32
-#define RotL RotL_32
-#define uint_t uint32_t
-
-#ifdef IS_BOOL
-#define result(a) ((a)*R123_0x1p_32f) > 0.5
-#else
-#define result(a) (a)*R123_0x1p_32f
-#endif
-
-enum r123_enum_threefry32x2 {
-    /* Output from skein_rot_search (srs2-X5000.out)
-    // Random seed = 1. BlockSize = 64 bits. sampleCnt =  1024. rounds =  8, minHW_or=28
-    // Start: Tue Jul 12 11:11:33 2011
-    // rMin = 0.334. #0206[*07] [CRC=1D9765C0. hw_OR=32. cnt=16384. blkSize=  64].format   */
-    R_2_0_0=13,
-    R_2_1_0=15,
-    R_2_2_0=26,
-    R_2_3_0= 6,
-    R_2_4_0=17,
-    R_2_5_0=29,
-    R_2_6_0=16,
-    R_2_7_0=24
-
-    /* 4 rounds: minHW =  4  [  4  4  4  4 ]
-    // 5 rounds: minHW =  6  [  6  8  6  8 ]
-    // 6 rounds: minHW =  9  [  9 12  9 12 ]
-    // 7 rounds: minHW = 16  [ 16 24 16 24 ]
-    // 8 rounds: minHW = 32  [ 32 32 32 32 ]
-    // 9 rounds: minHW = 32  [ 32 32 32 32 ]
-    //10 rounds: minHW = 32  [ 32 32 32 32 ]
-    //11 rounds: minHW = 32  [ 32 32 32 32 ] */
-};
-#endif
-
-enum r123_enum_threefry_wcnt {
-    WCNT2=2,
-    WCNT4=4
-};
-
-R123_STATIC_INLINE uint64_t RotL_64(uint64_t x, uint64_t N)
-{
-    return (x << (N & 63)) | (x >> ((64-N) & 63));
-}
-
-R123_STATIC_INLINE uint32_t RotL_32(uint32_t x, uint32_t N)
-{
-    return (x << (N & 31)) | (x >> ((32-N) & 31));
-}
-
-#define SKEIN_MK_64(hi32,lo32)  ((lo32) + (((uint64_t) (hi32)) << 32))
-#define SKEIN_KS_PARITY64         SKEIN_MK_64(0x1BD11BDA,0xA9FC1A22)
-#define SKEIN_KS_PARITY32         0x1BD11BDA
-
-
-// http://www.thesalmons.org/john/random123/releases/1.06/docs/structr123_1_1Threefry2x32__R.html#af5be46f8426cfcd86e75327e4b3750b0
-#define Nrounds 16
-
-struct r123array2
-{
-    uint_t v[2];
-};
-
-typedef struct r123array2 threefry2_ctr_t;
-typedef struct r123array2 threefry2_key_t;
-typedef struct r123array2 threefry2_ukey_t;
-
-R123_STATIC_INLINE
-threefry2_key_t threefry2keyinit(threefry2_ukey_t uk) { return uk; }
-
-R123_STATIC_INLINE
-threefry2_ctr_t threefry2_R(threefry2_ctr_t in, threefry2_key_t k)
-{
-    threefry2_ctr_t X;
-    uint_t ks[2+1];
-    int  i; /* avoid size_t to avoid need for stddef.h */
-    ks[2] =  SKEIN_KS_PARITY;
-    for (i=0;i < 2; i++)
-    {
-        ks[i] = k.v[i];
-        X.v[i]  = in.v[i];
-        ks[2] ^= k.v[i];
-    }
-
-    /* Insert initial key before round 0 */
-    X.v[0] += ks[0]; X.v[1] += ks[1];
-
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_0_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_1_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_2_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_3_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=1) */
-    X.v[0] += ks[1]; X.v[1] += ks[2];
-    X.v[1] += 1;     /* X.v[2-1] += r  */
-
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_4_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_5_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_6_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_7_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=2) */
-    X.v[0] += ks[2]; X.v[1] += ks[0];
-    X.v[1] += 2;
-
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_0_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_1_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_2_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_3_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=3) */
-    X.v[0] += ks[0]; X.v[1] += ks[1];
-    X.v[1] += 3;
-
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_4_0); X.v[1] ^= X.v[0];
-
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_5_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_6_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_7_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=4) */
-    X.v[0] += ks[1]; X.v[1] += ks[2];
-    X.v[1] += 4;
-
-#if Nrounds > 16
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_0_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_1_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_2_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_3_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=4) */
-    X.v[0] += ks[2]; X.v[1] += ks[0];
-    X.v[1] += 5;
-#endif
-
-#if Nrounds > 20
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_0_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_1_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_2_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_3_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=3) */
-    X.v[0] += ks[0]; X.v[1] += ks[1];
-    X.v[1] += 6;
-#endif
-
-#if Nrounds > 24
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_4_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_5_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_6_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_7_0); X.v[1] ^= X.v[0];
-
-    /* InjectKey(r=4) */
-    X.v[0] += ks[1]; X.v[1] += ks[2];
-    X.v[1] += 7;
-#endif
-
-#if Nrounds > 28
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_0_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_1_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_2_0); X.v[1] ^= X.v[0];
-    X.v[0] += X.v[1]; X.v[1] = RotL(X.v[1],R_2_3_0); X.v[1] ^= X.v[0];
-
-        /* InjectKey(r=4) */
-    X.v[0] += ks[2]; X.v[1] += ks[0];
-    X.v[1] += 8;
-#endif
-
-    return X;
-}
-
-#define threefry2(c,k) threefry2_R(c, k)
-
-#ifdef randu
-void generate(T *one, T *two, threefry2_ctr_t *c, threefry2_key_t k)
-{
-    threefry2_ctr_t r = threefry2(*c, k);
-    c->v[0] = c->v[0] + 1;
-
-    *one = result(r.v[0]);
-    *two = result(r.v[1]);
-}
-#endif
-
-#ifdef randn
-void generate(T *one, T *two, threefry2_ctr_t *c, threefry2_key_t k)
-{
-    threefry2_ctr_t r = threefry2(*c, k);
-    c->v[0] = c->v[0] + 1;
-
-    T u1 = result(r.v[0]);
-    T u2 = result(r.v[1]);
-
-    T R     = sqrt(-2*log(u1));
-    T Theta = 2 * PI_VAL * u2;
-
-    *one = R * sin(Theta);
-    *two = R * cos(Theta);
-}
-#endif
-
-#ifdef randi
-void generate(T *one, T *two, threefry2_ctr_t *c, threefry2_key_t k)
-{
-    threefry2_ctr_t r = threefry2(*c, k);
-    c->v[0] = c->v[0] + 1;
-
-    *one = (T)r.v[0];
-    *two = (T)r.v[1];
-}
-#endif
-
-__kernel void random(__global T *output, unsigned numel,
-                    unsigned counter, unsigned lo, unsigned hi)
-{
-    unsigned gid = get_group_id(0);
-    unsigned off = get_local_size(0);
-    unsigned tid =  off * gid * repeat + get_local_id(0);
-
-    threefry2_key_t k = {{tid, lo}};
-    threefry2_ctr_t c = {{counter, hi}};
-
-    T one, two;
-
-    if (gid < get_num_groups(0) - 1) {
-        for(int i = 0; i < repeat; i+=2) {
-            generate(&one, &two, &c, k);
-            output[tid      ] = one;
-            output[tid + off] = two;
-            tid += 2 * off;
-        }
-    } else {
-        for(int i = 0; i < repeat; i+=2) {
-            generate(&one, &two, &c, k);
-            if (tid       < numel) output[tid      ] = one;
-            if (tid + off < numel) output[tid + off] = two;
-            tid += 2 * off;
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/random.hpp b/src/backend/opencl/kernel/random.hpp
deleted file mode 100644
index a903e39a01..0000000000
--- a/src/backend/opencl/kernel/random.hpp
+++ /dev/null
@@ -1,142 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-
-#include <platform.hpp>
-#include <af/defines.h>
-#include <kernel_headers/random.hpp>
-#include <traits.hpp>
-#include <sstream>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
-#include <err_opencl.hpp>
-#include <debug_opencl.hpp>
-#include <program.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        static const uint REPEAT  = 32;
-        static const uint THREADS = 256;
-
-        static uint random_seed[2] = {0, 0};
-        static unsigned counter;
-
-        template<typename T, bool isRandu>
-        struct random_name
-        {
-            const char *name()
-            {
-                return "randi";
-            }
-        };
-
-        template<typename T>
-        struct random_name<T, false>
-        {
-            const char *name()
-            {
-                return "randn";
-            }
-        };
-
-        template<>
-        struct random_name<float, true>
-        {
-            const char *name()
-            {
-                return "randu";
-            }
-        };
-
-        template<>
-        struct random_name<char, true>
-        {
-            const char *name()
-            {
-                return "randu";
-            }
-        };
-
-        template<>
-        struct random_name<double, true>
-        {
-            const char *name()
-            {
-                return "randu";
-            }
-        };
-
-        template<typename> static bool isDouble() { return false; }
-        template<> STATIC_ bool isDouble<double>() { return true; }
-        template<> STATIC_ bool isDouble<cdouble>() { return true; }
-
-        template<typename T, bool isRandu>
-        void random(cl::Buffer out, int elements)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>  ranProgs;
-                static std::map<int, Kernel*>   ranKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                        Program::Sources setSrc;
-                        setSrc.emplace_back(random_cl, random_cl_len);
-
-                        std::ostringstream options;
-                        options << " -D T=" << dtype_traits<T>::getName()
-                                << " -D repeat="<< REPEAT
-                                << " -D " << random_name<T, isRandu>().name();
-
-                        if (std::is_same<T, double>::value) {
-                            options << " -D USE_DOUBLE";
-                            options << " -D IS_64";
-                        }
-
-                        if (std::is_same<T, char>::value) {
-                            options << " -D IS_BOOL";
-                        }
-
-                        cl::Program prog;
-                        buildProgram(prog, random_cl, random_cl_len, options.str());
-                        ranProgs[device] = new Program(prog);
-                        ranKernels[device] = new Kernel(*ranProgs[device], "random");
-                    });
-
-                auto randomOp = make_kernel<cl::Buffer, uint, uint, uint, uint>(*ranKernels[device]);
-
-                uint groups = divup(elements, THREADS * REPEAT);
-                counter += divup(elements, THREADS * groups);
-
-                NDRange local(THREADS, 1);
-                NDRange global(THREADS * groups, 1);
-
-                randomOp(EnqueueArgs(getQueue(), global, local),
-                         out, elements, counter, random_seed[0], random_seed[1]);
-                CL_DEBUG_FINISH(getQueue());
-            } catch(cl::Error ex) {
-                CL_TO_AF_ERROR(ex);
-            }
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/random_engine.hpp b/src/backend/opencl/kernel/random_engine.hpp
new file mode 100644
index 0000000000..390be184eb
--- /dev/null
+++ b/src/backend/opencl/kernel/random_engine.hpp
@@ -0,0 +1,173 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel_headers/random_engine_mersenne.hpp>
+#include <kernel_headers/random_engine_mersenne_init.hpp>
+#include <kernel_headers/random_engine_philox.hpp>
+#include <kernel_headers/random_engine_threefry.hpp>
+#include <kernel_headers/random_engine_write.hpp>
+#include <random_engine.hpp>
+#include <traits.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+static const int N          = 351;
+static const int TABLE_SIZE = 16;
+static const int MAX_BLOCKS = 32;
+static const int STATE_SIZE = (256 * 3);
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+static const uint THREADS = 256;
+
+template<typename T>
+static Kernel getRandomEngineKernel(const af_random_engine_type type,
+                                    const int kerIdx,
+                                    const uint elementsPerBlock) {
+    std::string key;
+    std::vector<common::Source> sources{random_engine_write_cl_src};
+    switch (type) {
+        case AF_RANDOM_ENGINE_PHILOX_4X32_10:
+            key = "philoxGenerator";
+            sources.emplace_back(random_engine_philox_cl_src);
+            break;
+        case AF_RANDOM_ENGINE_THREEFRY_2X32_16:
+            key = "threefryGenerator";
+            sources.emplace_back(random_engine_threefry_cl_src);
+            break;
+        case AF_RANDOM_ENGINE_MERSENNE_GP11213:
+            key = "mersenneGenerator";
+            sources.emplace_back(random_engine_mersenne_cl_src);
+            break;
+        default:
+            AF_ERROR("Random Engine Type Not Supported", AF_ERR_NOT_SUPPORTED);
+    }
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(kerIdx),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(THREADS),
+        DefineKeyValue(RAND_DIST, kerIdx),
+    };
+    if (type != AF_RANDOM_ENGINE_MERSENNE_GP11213) {
+        options.emplace_back(
+            DefineKeyValue(ELEMENTS_PER_BLOCK, elementsPerBlock));
+    }
+#if defined(OS_MAC)  // Because apple is "special"
+    options.emplace_back(DefineKey(IS_APPLE));
+    options.emplace_back(DefineKeyValue(log10_val, std::log(10.0)));
+#endif
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    return common::getKernel(key, sources, targs, options);
+}
+
+template<typename T>
+static void randomDistribution(cl::Buffer out, const size_t elements,
+                               const af_random_engine_type type,
+                               const uintl &seed, uintl &counter, int kerIdx) {
+    uint elementsPerBlock = THREADS * 4 * sizeof(uint) / sizeof(T);
+    uint groups           = divup(elements, elementsPerBlock);
+
+    uint hi  = seed >> 32;
+    uint lo  = seed;
+    uint hic = counter >> 32;
+    uint loc = counter;
+
+    cl::NDRange local(THREADS, 1);
+    cl::NDRange global(THREADS * groups, 1);
+
+    if ((type == AF_RANDOM_ENGINE_PHILOX_4X32_10) ||
+        (type == AF_RANDOM_ENGINE_THREEFRY_2X32_16)) {
+        auto randomEngineOp =
+            getRandomEngineKernel<T>(type, kerIdx, elementsPerBlock);
+        randomEngineOp(cl::EnqueueArgs(getQueue(), global, local), out,
+                       static_cast<unsigned>(elements), hic, loc, hi, lo);
+    }
+    counter += elements;
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void randomDistribution(cl::Buffer out, const size_t elements, cl::Buffer state,
+                        cl::Buffer pos, cl::Buffer sh1, cl::Buffer sh2,
+                        const uint mask, cl::Buffer recursion_table,
+                        cl::Buffer temper_table, int kerIdx) {
+    int threads                = THREADS;
+    int min_elements_per_block = 32 * THREADS * 4 * sizeof(uint) / sizeof(T);
+    int blocks                 = divup(elements, min_elements_per_block);
+    blocks                     = (blocks > MAX_BLOCKS) ? MAX_BLOCKS : blocks;
+    uint elementsPerBlock      = divup(elements, blocks);
+
+    cl::NDRange local(threads, 1);
+    cl::NDRange global(threads * blocks, 1);
+    auto randomEngineOp = getRandomEngineKernel<T>(
+        AF_RANDOM_ENGINE_MERSENNE_GP11213, kerIdx, elementsPerBlock);
+    randomEngineOp(cl::EnqueueArgs(getQueue(), global, local), out, state, pos,
+                   sh1, sh2, mask, recursion_table, temper_table,
+                   elementsPerBlock, static_cast<uint>(elements));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void uniformDistributionCBRNG(cl::Buffer out, const size_t elements,
+                              const af_random_engine_type type,
+                              const uintl &seed, uintl &counter) {
+    randomDistribution<T>(out, elements, type, seed, counter, 0);
+}
+
+template<typename T>
+void normalDistributionCBRNG(cl::Buffer out, const size_t elements,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter) {
+    randomDistribution<T>(out, elements, type, seed, counter, 1);
+}
+
+template<typename T>
+void uniformDistributionMT(cl::Buffer out, const size_t elements,
+                           cl::Buffer state, cl::Buffer pos, cl::Buffer sh1,
+                           cl::Buffer sh2, const uint mask,
+                           cl::Buffer recursion_table,
+                           cl::Buffer temper_table) {
+    randomDistribution<T>(out, elements, state, pos, sh1, sh2, mask,
+                          recursion_table, temper_table, 0);
+}
+
+template<typename T>
+void normalDistributionMT(cl::Buffer out, const size_t elements,
+                          cl::Buffer state, cl::Buffer pos, cl::Buffer sh1,
+                          cl::Buffer sh2, const uint mask,
+                          cl::Buffer recursion_table, cl::Buffer temper_table) {
+    randomDistribution<T>(out, elements, state, pos, sh1, sh2, mask,
+                          recursion_table, temper_table, 1);
+}
+
+void initMersenneState(cl::Buffer state, cl::Buffer table, const uintl &seed) {
+    cl::NDRange local(THREADS_PER_GROUP, 1);
+    cl::NDRange global(local[0] * MAX_BLOCKS, 1);
+
+    auto initOp = common::getKernel("mersenneInitState",
+                                    {{random_engine_mersenne_init_cl_src}}, {});
+    initOp(cl::EnqueueArgs(getQueue(), global, local), state, table, seed);
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/random_engine_mersenne.cl b/src/backend/opencl/kernel/random_engine_mersenne.cl
new file mode 100644
index 0000000000..ebb5a92120
--- /dev/null
+++ b/src/backend/opencl/kernel/random_engine_mersenne.cl
@@ -0,0 +1,155 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/********************************************************
+ * Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
+ * University.
+ * Copyright (c) 2011, 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima
+ * University and University of Tokyo.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ *       copyright notice, this list of conditions and the following
+ *       disclaimer in the documentation and/or other materials provided
+ *       with the distribution.
+ *     * Neither the name of the Hiroshima University, The Uinversity
+ *       of Tokyo nor the names of its contributors may be used to
+ *       endorse or promote products derived from this software without
+ *       specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *******************************************************/
+
+#define N 351
+#define TABLE_SIZE 16
+#define STATE_SIZE (256 * 3)
+
+#define divup(NUM, DEN) (((NUM) + (DEN)-1) / (DEN));
+
+void read_table(local uint *const localTable, global const uint *const table) {
+    global const uint *const t = table + (get_group_id(0) * TABLE_SIZE);
+    if (get_local_id(0) < TABLE_SIZE) {
+        localTable[get_local_id(0)] = t[get_local_id(0)];
+    }
+}
+
+void state_read(local uint *const localState, global const uint *const state) {
+    global const uint *const g = state + (get_group_id(0) * N);
+    localState[STATE_SIZE - N + get_local_id(0)] = g[get_local_id(0)];
+    if (get_local_id(0) < N - THREADS) {
+        localState[STATE_SIZE - N + THREADS + get_local_id(0)] =
+            g[THREADS + get_local_id(0)];
+    }
+}
+
+void state_write(global uint *const state, local const uint *const localState) {
+    global uint *const g = state + (get_group_id(0) * N);
+    g[get_local_id(0)]   = localState[STATE_SIZE - N + get_local_id(0)];
+    if (get_local_id(0) < N - THREADS) {
+        g[THREADS + get_local_id(0)] =
+            localState[STATE_SIZE - N + THREADS + get_local_id(0)];
+    }
+}
+
+uint recursion(local const uint *const recursion_table, const uint mask,
+               const uint sh1, const uint sh2, const uint x1, const uint x2,
+               uint y) {
+    uint x = (x1 & mask) ^ x2;
+    x ^= x << sh1;
+    y        = x ^ (y >> sh2);
+    uint mat = recursion_table[y & 0x0f];
+    return y ^ mat;
+}
+
+uint temper(local const uint *const temper_table, const uint v, uint t) {
+    t ^= t >> 16;
+    t ^= t >> 8;
+    uint mat = temper_table[t & 0x0f];
+    return v ^ mat;
+}
+
+kernel void mersenneGenerator(global T *output, global uint *const state,
+                              global const uint *const pos_tbl,
+                              global const uint *const sh1_tbl,
+                              global const uint *const sh2_tbl, uint mask,
+                              global const uint *const recursion_table,
+                              global const uint *const temper_table,
+                              uint elements_per_block, uint elements) {
+    local uint l_state[STATE_SIZE];
+    local uint l_recursion_table[TABLE_SIZE];
+    local uint l_temper_table[TABLE_SIZE];
+    uint start = get_group_id(0) * elements_per_block;
+    uint end   = start + elements_per_block;
+    end        = (end > elements) ? elements : end;
+    int iter   = divup((end - start) * sizeof(T), THREADS * 4 * sizeof(uint));
+    uint pos   = pos_tbl[get_group_id(0)];
+    uint sh1   = sh1_tbl[get_group_id(0)];
+    uint sh2   = sh2_tbl[get_group_id(0)];
+
+    state_read(l_state, state);
+    read_table(l_recursion_table, recursion_table);
+    read_table(l_temper_table, temper_table);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    uint index                    = start;
+    int elementsPerBlockIteration = THREADS * 4 * sizeof(uint) / sizeof(T);
+    uint o[4];
+    int offsetX1 = (STATE_SIZE - N + get_local_id(0)) % STATE_SIZE;
+    int offsetX2 = (STATE_SIZE - N + get_local_id(0) + 1) % STATE_SIZE;
+    int offsetY  = (STATE_SIZE - N + get_local_id(0) + pos) % STATE_SIZE;
+    int offsetT  = (STATE_SIZE - N + get_local_id(0) + pos - 1) % STATE_SIZE;
+    int offsetO  = get_local_id(0);
+
+    for (int i = 0; i < iter; ++i) {
+        for (int ii = 0; ii < 4; ++ii) {
+            uint r =
+                recursion(l_recursion_table, mask, sh1, sh2, l_state[offsetX1],
+                          l_state[offsetX2], l_state[offsetY]);
+            l_state[offsetO] = r;
+            o[ii]            = temper(l_temper_table, r, l_state[offsetT]);
+            offsetX1 += THREADS;
+            offsetX2 += THREADS;
+            offsetY += THREADS;
+            offsetT += THREADS;
+            offsetO += THREADS;
+            offsetX1 =
+                (offsetX1 >= STATE_SIZE) ? offsetX1 - STATE_SIZE : offsetX1;
+            offsetX2 =
+                (offsetX2 >= STATE_SIZE) ? offsetX2 - STATE_SIZE : offsetX2;
+            offsetY = (offsetY >= STATE_SIZE) ? offsetY - STATE_SIZE : offsetY;
+            offsetT = (offsetT >= STATE_SIZE) ? offsetT - STATE_SIZE : offsetT;
+            offsetO = (offsetO >= STATE_SIZE) ? offsetO - STATE_SIZE : offsetO;
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+        uint writeIndex = index + get_local_id(0);
+        if (i == iter - 1) {
+            PARTIAL_WRITE(output, writeIndex, o[0], o[1], o[2], o[3], elements);
+        } else {
+            WRITE(output, writeIndex, o[0], o[1], o[2], o[3]);
+        }
+        index += elementsPerBlockIteration;
+    }
+    state_write(state, l_state);
+}
diff --git a/src/backend/opencl/kernel/random_engine_mersenne_init.cl b/src/backend/opencl/kernel/random_engine_mersenne_init.cl
new file mode 100644
index 0000000000..af8435356a
--- /dev/null
+++ b/src/backend/opencl/kernel/random_engine_mersenne_init.cl
@@ -0,0 +1,80 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/********************************************************
+ * Copyright (c) 2009, 2010 Mutsuo Saito, Makoto Matsumoto and Hiroshima
+ * University.
+ * Copyright (c) 2011, 2012 Mutsuo Saito, Makoto Matsumoto, Hiroshima
+ * University and University of Tokyo.
+ * All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above
+ *       copyright notice, this list of conditions and the following
+ *       disclaimer in the documentation and/or other materials provided
+ *       with the distribution.
+ *     * Neither the name of the Hiroshima University, The Uinversity
+ *       of Tokyo nor the names of its contributors may be used to
+ *       endorse or promote products derived from this software without
+ *       specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *******************************************************/
+
+#define N 351
+#define TABLE_SIZE 16
+
+kernel void mersenneInitState(global uint *state, global uint *tbl,
+                              ulong seed) {
+    int tid      = get_local_id(0);
+    int nthreads = get_local_size(0);
+    int gid      = get_group_id(0);
+    local uint lstate[N];
+    const global uint *ltbl = tbl + (TABLE_SIZE * gid);
+    uint hidden_seed        = ltbl[4] ^ (ltbl[8] << 16);
+    uint tmp                = hidden_seed;
+    tmp += tmp >> 16;
+    tmp += tmp >> 8;
+    tmp &= 0xff;
+    tmp |= tmp << 8;
+    tmp |= tmp << 16;
+
+    for (int id = tid; id < N; id += nthreads) { lstate[id] = tmp; }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (tid == 0) {
+        lstate[0] = seed;
+        lstate[1] = hidden_seed;
+        for (int i = 1; i < N; ++i) {
+            lstate[i] ^=
+                (uint)(1812433253) * (lstate[i - 1] ^ (lstate[i - 1] >> 30)) +
+                i;
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    for (int id = tid; id < N; id += nthreads) {
+        state[N * gid + id] = lstate[id];
+    }
+}
diff --git a/src/backend/opencl/kernel/random_engine_philox.cl b/src/backend/opencl/kernel/random_engine_philox.cl
new file mode 100644
index 0000000000..ccc6bb455d
--- /dev/null
+++ b/src/backend/opencl/kernel/random_engine_philox.cl
@@ -0,0 +1,118 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ *
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/philox.hpp
+
+#define m4x32_0 0xD2511F53
+#define m4x32_1 0xCD9E8D57
+#define w32_0 0x9E3779B9
+#define w32_1 0xBB67AE85
+
+void mulhilo(const uint a, const uint b, uint *const hi, uint *const lo) {
+    *hi = mul_hi(a, b);
+    *lo = a * b;
+}
+
+void philoxBump(uint k[2]) {
+    k[0] += w32_0;
+    k[1] += w32_1;
+}
+
+void philoxRound(const uint k[2], uint c[4]) {
+    uint hi0, lo0, hi1, lo1;
+    mulhilo(m4x32_0, c[0], &hi0, &lo0);
+    mulhilo(m4x32_1, c[2], &hi1, &lo1);
+    c[0] = hi1 ^ c[1] ^ k[0];
+    c[1] = lo1;
+    c[2] = hi0 ^ c[3] ^ k[1];
+    c[3] = lo0;
+}
+
+void philox(uint key[2], uint ctr[4]) {
+    // 10 Rounds
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+    philoxBump(key);
+    philoxRound(key, ctr);
+}
+
+kernel void philoxGenerator(global T *output, unsigned elements, unsigned hic,
+                            unsigned loc, unsigned hi, unsigned lo) {
+    unsigned gid   = get_group_id(0);
+    unsigned index = gid * ELEMENTS_PER_BLOCK + get_local_id(0);
+
+    uint key[2] = {lo, hi};
+    uint ctr[4] = {loc, hic, 0, 0};
+    ctr[0] += index;
+    ctr[1] += (ctr[0] < loc);
+    ctr[2] += (ctr[1] < hic);
+
+    philox(key, ctr);
+
+    if (gid != get_num_groups(0) - 1) {
+        WRITE(output, index, ctr[0], ctr[1], ctr[2], ctr[3]);
+    } else {
+        PARTIAL_WRITE(output, index, ctr[0], ctr[1], ctr[2], ctr[3], elements);
+    }
+}
diff --git a/src/backend/opencl/kernel/random_engine_threefry.cl b/src/backend/opencl/kernel/random_engine_threefry.cl
new file mode 100644
index 0000000000..7fdb2bcd07
--- /dev/null
+++ b/src/backend/opencl/kernel/random_engine_threefry.cl
@@ -0,0 +1,178 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ *
+ ********************************************************/
+
+/*******************************************************
+ * Modified version of Random123 library:
+ * https://www.deshawresearch.com/downloads/download_random123.cgi/
+ * The original copyright can be seen here:
+ *
+ * RANDOM123 LICENSE AGREEMENT
+ *
+ * Copyright 2010-2011, D. E. Shaw Research. All rights reserved.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * * Redistributions of source code must retain the above copyright notice,
+ *   this list of conditions, and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ *   notice, this list of conditions, and the following disclaimer in the
+ *   documentation and/or other materials provided with the distribution.
+ *
+ * Neither the name of D. E. Shaw Research nor the names of its contributors
+ * may be used to endorse or promote products derived from this software
+ * without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED
+ * TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+ * PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+ * LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+ * NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+ * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *********************************************************/
+
+// Utils
+// Source of these constants :
+// github.com/DEShawResearch/Random123-Boost/blob/master/boost/random/threefry.hpp
+
+#define SKEIN_KS_PARITY 0x1BD11BDA
+
+#define R0 13
+#define R1 15
+#define R2 26
+#define R3 6
+#define R4 17
+#define R5 29
+#define R6 16
+#define R7 24
+
+inline uint rotL(uint x, uint N) {
+    return (x << (N & 31)) | (x >> ((32 - N) & 31));
+}
+
+inline void threefry(uint k[2], uint c[2], uint X[2]) {
+    uint ks[3];
+
+    ks[2] = SKEIN_KS_PARITY;
+    ks[0] = k[0];
+    X[0]  = c[0];
+    ks[2] ^= k[0];
+    ks[1] = k[1];
+    X[1]  = c[1];
+    ks[2] ^= k[1];
+
+    X[0] += ks[0];
+    X[1] += ks[1];
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=1) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 1; /* X[2-1] += r  */
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=2) */
+    X[0] += ks[2];
+    X[1] += ks[0];
+    X[1] += 2;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R0);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R1);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R2);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R3);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=3) */
+    X[0] += ks[0];
+    X[1] += ks[1];
+    X[1] += 3;
+
+    X[0] += X[1];
+    X[1] = rotL(X[1], R4);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R5);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R6);
+    X[1] ^= X[0];
+    X[0] += X[1];
+    X[1] = rotL(X[1], R7);
+    X[1] ^= X[0];
+
+    /* InjectKey(r=4) */
+    X[0] += ks[1];
+    X[1] += ks[2];
+    X[1] += 4;
+}
+
+kernel void threefryGenerator(global T *output, unsigned elements, unsigned hic,
+                              unsigned loc, unsigned hi, unsigned lo) {
+    unsigned gid   = get_group_id(0);
+    unsigned off   = get_local_size(0);
+    unsigned index = gid * ELEMENTS_PER_BLOCK + get_local_id(0);
+
+    uint key[2] = {lo, hi};
+    uint ctr[2] = {loc, hic};
+    uint o[4];
+
+    ctr[0] += index;
+    ctr[1] += (ctr[0] < index);
+
+    threefry(key, ctr, o);
+    uint step = ELEMENTS_PER_BLOCK / 2;
+    ctr[0] += step;
+    ctr[1] += (ctr[0] < step);
+    threefry(key, ctr, o + 2);
+
+    if (gid != get_num_groups(0) - 1) {
+        WRITE(output, index, o[0], o[1], o[2], o[3]);
+    } else {
+        PARTIAL_WRITE(output, index, o[0], o[1], o[2], o[3], elements);
+    }
+}
diff --git a/src/backend/opencl/kernel/random_engine_write.cl b/src/backend/opencl/kernel/random_engine_write.cl
new file mode 100644
index 0000000000..c36c5f1d6d
--- /dev/null
+++ b/src/backend/opencl/kernel/random_engine_write.cl
@@ -0,0 +1,558 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Conversion to floats adapted from Random123
+#define FLT_FACTOR ((1.0f) / ((float)UINT_MAX + 1.0f))
+#define HALF_FLT_FACTOR ((0.5f) * FLT_FACTOR)
+
+// Conversion to floats adapted from Random123
+#define SIGNED_FLT_FACTOR ((1.0f) / ((float)INT_MAX + 1.0f))
+#define SIGNED_HALF_FLT_FACTOR (0.5f * SIGNED_FLT_FACTOR)
+
+// Generates rationals in (0, 1]
+float getFloat01(uint num) {
+    return fma((float)num, FLT_FACTOR, HALF_FLT_FACTOR);
+}
+
+// Generates rationals in (-1, 1]
+float getFloatNegative11(uint num) {
+    return fma((float)num, SIGNED_FLT_FACTOR, SIGNED_HALF_FLT_FACTOR);
+}
+
+// Writes without boundary checking
+
+void writeOut128Bytes_schar(global char *out, uint index, uint r1, uint r2,
+                            uint r3, uint r4) {
+    out[index]                = r1;
+    out[index + THREADS]      = r1 >> 8;
+    out[index + 2 * THREADS]  = r1 >> 16;
+    out[index + 3 * THREADS]  = r1 >> 24;
+    out[index + 4 * THREADS]  = r2;
+    out[index + 5 * THREADS]  = r2 >> 8;
+    out[index + 6 * THREADS]  = r2 >> 16;
+    out[index + 7 * THREADS]  = r2 >> 24;
+    out[index + 8 * THREADS]  = r3;
+    out[index + 9 * THREADS]  = r3 >> 8;
+    out[index + 10 * THREADS] = r3 >> 16;
+    out[index + 11 * THREADS] = r3 >> 24;
+    out[index + 12 * THREADS] = r4;
+    out[index + 13 * THREADS] = r4 >> 8;
+    out[index + 14 * THREADS] = r4 >> 16;
+    out[index + 15 * THREADS] = r4 >> 24;
+}
+
+void writeOut128Bytes_uchar(global uchar *out, uint index, uint r1, uint r2,
+                            uint r3, uint r4) {
+    out[index]                = r1;
+    out[index + THREADS]      = r1 >> 8;
+    out[index + 2 * THREADS]  = r1 >> 16;
+    out[index + 3 * THREADS]  = r1 >> 24;
+    out[index + 4 * THREADS]  = r2;
+    out[index + 5 * THREADS]  = r2 >> 8;
+    out[index + 6 * THREADS]  = r2 >> 16;
+    out[index + 7 * THREADS]  = r2 >> 24;
+    out[index + 8 * THREADS]  = r3;
+    out[index + 9 * THREADS]  = r3 >> 8;
+    out[index + 10 * THREADS] = r3 >> 16;
+    out[index + 11 * THREADS] = r3 >> 24;
+    out[index + 12 * THREADS] = r4;
+    out[index + 13 * THREADS] = r4 >> 8;
+    out[index + 14 * THREADS] = r4 >> 16;
+    out[index + 15 * THREADS] = r4 >> 24;
+}
+
+void writeOut128Bytes_char(global char *out, uint index, uint r1, uint r2,
+                           uint r3, uint r4) {
+    out[index]                = (r1)&0x1;
+    out[index + THREADS]      = (r1 >> 8) & 0x1;
+    out[index + 2 * THREADS]  = (r1 >> 16) & 0x1;
+    out[index + 3 * THREADS]  = (r1 >> 24) & 0x1;
+    out[index + 4 * THREADS]  = (r2)&0x1;
+    out[index + 5 * THREADS]  = (r2 >> 8) & 0x1;
+    out[index + 6 * THREADS]  = (r2 >> 16) & 0x1;
+    out[index + 7 * THREADS]  = (r2 >> 24) & 0x1;
+    out[index + 8 * THREADS]  = (r3)&0x1;
+    out[index + 9 * THREADS]  = (r3 >> 8) & 0x1;
+    out[index + 10 * THREADS] = (r3 >> 16) & 0x1;
+    out[index + 11 * THREADS] = (r3 >> 24) & 0x1;
+    out[index + 12 * THREADS] = (r4)&0x1;
+    out[index + 13 * THREADS] = (r4 >> 8) & 0x1;
+    out[index + 14 * THREADS] = (r4 >> 16) & 0x1;
+    out[index + 15 * THREADS] = (r4 >> 24) & 0x1;
+}
+
+void writeOut128Bytes_short(global short *out, uint index, uint r1, uint r2,
+                            uint r3, uint r4) {
+    out[index]               = r1;
+    out[index + THREADS]     = r1 >> 16;
+    out[index + 2 * THREADS] = r2;
+    out[index + 3 * THREADS] = r2 >> 16;
+    out[index + 4 * THREADS] = r3;
+    out[index + 5 * THREADS] = r3 >> 16;
+    out[index + 6 * THREADS] = r4;
+    out[index + 7 * THREADS] = r4 >> 16;
+}
+
+void writeOut128Bytes_ushort(global ushort *out, uint index, uint r1, uint r2,
+                             uint r3, uint r4) {
+    out[index]               = r1;
+    out[index + THREADS]     = r1 >> 16;
+    out[index + 2 * THREADS] = r2;
+    out[index + 3 * THREADS] = r2 >> 16;
+    out[index + 4 * THREADS] = r3;
+    out[index + 5 * THREADS] = r3 >> 16;
+    out[index + 6 * THREADS] = r4;
+    out[index + 7 * THREADS] = r4 >> 16;
+}
+
+void writeOut128Bytes_int(global int *out, uint index, uint r1, uint r2,
+                          uint r3, uint r4) {
+    out[index]               = r1;
+    out[index + THREADS]     = r2;
+    out[index + 2 * THREADS] = r3;
+    out[index + 3 * THREADS] = r4;
+}
+
+void writeOut128Bytes_uint(global uint *out, uint index, uint r1, uint r2,
+                           uint r3, uint r4) {
+    out[index]               = r1;
+    out[index + THREADS]     = r2;
+    out[index + 2 * THREADS] = r3;
+    out[index + 3 * THREADS] = r4;
+}
+
+void writeOut128Bytes_long(global long *out, uint index, uint r1, uint r2,
+                           uint r3, uint r4) {
+    long c1              = r2;
+    c1                   = (c1 << 32) | r1;
+    long c2              = r4;
+    c2                   = (c2 << 32) | r3;
+    out[index]           = c1;
+    out[index + THREADS] = c2;
+}
+
+void writeOut128Bytes_ulong(global ulong *out, uint index, uint r1, uint r2,
+                            uint r3, uint r4) {
+    long c1              = r2;
+    c1                   = (c1 << 32) | r1;
+    long c2              = r4;
+    c2                   = (c2 << 32) | r3;
+    out[index]           = c1;
+    out[index + THREADS] = c2;
+}
+
+void writeOut128Bytes_float(global float *out, uint index, uint r1, uint r2,
+                            uint r3, uint r4) {
+    out[index]               = 1.f - getFloat01(r1);
+    out[index + THREADS]     = 1.f - getFloat01(r2);
+    out[index + 2 * THREADS] = 1.f - getFloat01(r3);
+    out[index + 3 * THREADS] = 1.f - getFloat01(r4);
+}
+
+#if RAND_DIST == 1
+void boxMullerTransform(T *const out1, T *const out2, T r1, T r2) {
+    /*
+     * The log of a real value x where 0 < x < 1 is negative.
+     */
+#if defined(IS_APPLE)  // Because Apple is.. "special"
+    T r = sqrt((T)(-2.0) * log10(r2) * (T)log10_val);
+#else
+    T r = sqrt((T)(-2.0) * log(r2));
+#endif
+    T c   = cospi(r1);
+    T s   = sinpi(r1);
+    *out1 = r * s;
+    *out2 = r * c;
+}
+#endif
+
+// Writes with boundary checking
+
+void partialWriteOut128Bytes_schar(global char *out, uint index, uint r1,
+                                   uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + THREADS < elements) { out[index + THREADS] = r1 >> 8; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = r1 >> 16; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = r1 >> 24; }
+    if (index + 4 * THREADS < elements) { out[index + 4 * THREADS] = r2; }
+    if (index + 5 * THREADS < elements) { out[index + 5 * THREADS] = r2 >> 8; }
+    if (index + 6 * THREADS < elements) { out[index + 6 * THREADS] = r2 >> 16; }
+    if (index + 7 * THREADS < elements) { out[index + 7 * THREADS] = r2 >> 24; }
+    if (index + 8 * THREADS < elements) { out[index + 8 * THREADS] = r3; }
+    if (index + 9 * THREADS < elements) { out[index + 9 * THREADS] = r3 >> 8; }
+    if (index + 10 * THREADS < elements) {
+        out[index + 10 * THREADS] = r3 >> 16;
+    }
+    if (index + 11 * THREADS < elements) {
+        out[index + 11 * THREADS] = r3 >> 24;
+    }
+    if (index + 12 * THREADS < elements) { out[index + 12 * THREADS] = r4; }
+    if (index + 13 * THREADS < elements) {
+        out[index + 13 * THREADS] = r4 >> 8;
+    }
+    if (index + 14 * THREADS < elements) {
+        out[index + 14 * THREADS] = r4 >> 16;
+    }
+    if (index + 15 * THREADS < elements) {
+        out[index + 15 * THREADS] = r4 >> 24;
+    }
+}
+
+void partialWriteOut128Bytes_uchar(global uchar *out, uint index, uint r1,
+                                   uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + THREADS < elements) { out[index + THREADS] = r1 >> 8; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = r1 >> 16; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = r1 >> 24; }
+    if (index + 4 * THREADS < elements) { out[index + 4 * THREADS] = r2; }
+    if (index + 5 * THREADS < elements) { out[index + 5 * THREADS] = r2 >> 8; }
+    if (index + 6 * THREADS < elements) { out[index + 6 * THREADS] = r2 >> 16; }
+    if (index + 7 * THREADS < elements) { out[index + 7 * THREADS] = r2 >> 24; }
+    if (index + 8 * THREADS < elements) { out[index + 8 * THREADS] = r3; }
+    if (index + 9 * THREADS < elements) { out[index + 9 * THREADS] = r3 >> 8; }
+    if (index + 10 * THREADS < elements) {
+        out[index + 10 * THREADS] = r3 >> 16;
+    }
+    if (index + 11 * THREADS < elements) {
+        out[index + 11 * THREADS] = r3 >> 24;
+    }
+    if (index + 12 * THREADS < elements) { out[index + 12 * THREADS] = r4; }
+    if (index + 13 * THREADS < elements) {
+        out[index + 13 * THREADS] = r4 >> 8;
+    }
+    if (index + 14 * THREADS < elements) {
+        out[index + 14 * THREADS] = r4 >> 16;
+    }
+    if (index + 15 * THREADS < elements) {
+        out[index + 15 * THREADS] = r4 >> 24;
+    }
+}
+
+void partialWriteOut128Bytes_char(global char *out, uint index, uint r1,
+                                  uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = (r1)&0x1; }
+    if (index + THREADS < elements) { out[index + THREADS] = (r1 >> 8) & 0x1; }
+    if (index + 2 * THREADS < elements) {
+        out[index + 2 * THREADS] = (r1 >> 16) & 0x1;
+    }
+    if (index + 3 * THREADS < elements) {
+        out[index + 3 * THREADS] = (r1 >> 24) & 0x1;
+    }
+    if (index + 4 * THREADS < elements) { out[index + 4 * THREADS] = (r2)&0x1; }
+    if (index + 5 * THREADS < elements) {
+        out[index + 5 * THREADS] = (r2 >> 8) & 0x1;
+    }
+    if (index + 6 * THREADS < elements) {
+        out[index + 6 * THREADS] = (r2 >> 16) & 0x1;
+    }
+    if (index + 7 * THREADS < elements) {
+        out[index + 7 * THREADS] = (r2 >> 24) & 0x1;
+    }
+    if (index + 8 * THREADS < elements) { out[index + 8 * THREADS] = (r3)&0x1; }
+    if (index + 9 * THREADS < elements) {
+        out[index + 9 * THREADS] = (r3 >> 8) & 0x1;
+    }
+    if (index + 10 * THREADS < elements) {
+        out[index + 10 * THREADS] = (r3 >> 16) & 0x1;
+    }
+    if (index + 11 * THREADS < elements) {
+        out[index + 11 * THREADS] = (r3 >> 24) & 0x1;
+    }
+    if (index + 12 * THREADS < elements) {
+        out[index + 12 * THREADS] = (r4)&0x1;
+    }
+    if (index + 13 * THREADS < elements) {
+        out[index + 13 * THREADS] = (r4 >> 8) & 0x1;
+    }
+    if (index + 14 * THREADS < elements) {
+        out[index + 14 * THREADS] = (r4 >> 16) & 0x1;
+    }
+    if (index + 15 * THREADS < elements) {
+        out[index + 15 * THREADS] = (r4 >> 24) & 0x1;
+    }
+}
+
+void partialWriteOut128Bytes_short(global short *out, uint index, uint r1,
+                                   uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + THREADS < elements) { out[index + THREADS] = r1 >> 16; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = r2; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = r2 >> 16; }
+    if (index + 4 * THREADS < elements) { out[index + 4 * THREADS] = r3; }
+    if (index + 5 * THREADS < elements) { out[index + 5 * THREADS] = r3 >> 16; }
+    if (index + 6 * THREADS < elements) { out[index + 6 * THREADS] = r4; }
+    if (index + 7 * THREADS < elements) { out[index + 7 * THREADS] = r4 >> 16; }
+}
+
+void partialWriteOut128Bytes_ushort(global ushort *out, uint index, uint r1,
+                                    uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + THREADS < elements) { out[index + THREADS] = r1 >> 16; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = r2; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = r2 >> 16; }
+    if (index + 4 * THREADS < elements) { out[index + 4 * THREADS] = r3; }
+    if (index + 5 * THREADS < elements) { out[index + 5 * THREADS] = r3 >> 16; }
+    if (index + 6 * THREADS < elements) { out[index + 6 * THREADS] = r4; }
+    if (index + 7 * THREADS < elements) { out[index + 7 * THREADS] = r4 >> 16; }
+}
+
+void partialWriteOut128Bytes_int(global int *out, uint index, uint r1, uint r2,
+                                 uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + THREADS < elements) { out[index + THREADS] = r2; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = r3; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = r4; }
+}
+
+void partialWriteOut128Bytes_uint(global uint *out, uint index, uint r1,
+                                  uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = r1; }
+    if (index + THREADS < elements) { out[index + THREADS] = r2; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = r3; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = r4; }
+}
+
+void partialWriteOut128Bytes_long(global long *out, uint index, uint r1,
+                                  uint r2, uint r3, uint r4, uint elements) {
+    long c1 = r2;
+    c1      = (c1 << 32) | r1;
+    long c2 = r4;
+    c2      = (c2 << 32) | r3;
+    if (index < elements) { out[index] = c1; }
+    if (index + THREADS < elements) { out[index + THREADS] = c2; }
+}
+
+void partialWriteOut128Bytes_ulong(global ulong *out, uint index, uint r1,
+                                   uint r2, uint r3, uint r4, uint elements) {
+    long c1 = r2;
+    c1      = (c1 << 32) | r1;
+    long c2 = r4;
+    c2      = (c2 << 32) | r3;
+    if (index < elements) { out[index] = c1; }
+    if (index + THREADS < elements) { out[index + THREADS] = c2; }
+}
+
+void partialWriteOut128Bytes_float(global float *out, uint index, uint r1,
+                                   uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = 1.f - getFloat01(r1); }
+    if (index + THREADS < elements) {
+        out[index + THREADS] = 1.f - getFloat01(r2);
+    }
+    if (index + 2 * THREADS < elements) {
+        out[index + 2 * THREADS] = 1.f - getFloat01(r3);
+    }
+    if (index + 3 * THREADS < elements) {
+        out[index + 3 * THREADS] = 1.f - getFloat01(r4);
+    }
+}
+
+#if RAND_DIST == 1
+// BoxMuller writes without boundary checking
+void boxMullerWriteOut128Bytes_float(global float *out, uint index, uint r1,
+                                     uint r2, uint r3, uint r4) {
+    float n1, n2, n3, n4;
+    boxMullerTransform(&n1, &n2, getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&n3, &n4, getFloatNegative11(r3), getFloat01(r4));
+    out[index]               = n1;
+    out[index + THREADS]     = n2;
+    out[index + 2 * THREADS] = n3;
+    out[index + 3 * THREADS] = n4;
+}
+
+// BoxMuller writes with boundary checking
+void partialBoxMullerWriteOut128Bytes_float(global float *out, uint index,
+                                            uint r1, uint r2, uint r3, uint r4,
+                                            uint elements) {
+    float n1, n2, n3, n4;
+    boxMullerTransform(&n1, &n2, getFloatNegative11(r1), getFloat01(r2));
+    boxMullerTransform(&n3, &n4, getFloatNegative11(r3), getFloat01(r4));
+    if (index < elements) { out[index] = n1; }
+    if (index + THREADS < elements) { out[index + THREADS] = n2; }
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = n3; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = n4; }
+}
+#endif
+
+#ifdef USE_DOUBLE
+
+// Conversion to floats adapted from Random123
+#define DBL_FACTOR ((1.0) / (ULONG_MAX + (1.0)))
+#define HALF_DBL_FACTOR ((0.5) * DBL_FACTOR)
+
+#define SIGNED_DBL_FACTOR ((1.0) / (LONG_MAX + (1.0)))
+#define SIGNED_HALF_DBL_FACTOR ((0.5) * SIGNED_DBL_FACTOR)
+
+// Generates rationals in (0, 1]
+double getDouble01(uint num1, uint num2) {
+    ulong num = (((ulong)num1) << 32) | ((ulong)num2);
+    return fma(num, DBL_FACTOR, HALF_DBL_FACTOR);
+}
+
+// Generates rationals in (-1, 1]
+float getDoubleNegative11(uint num1, uint num2) {
+    ulong num = (((ulong)num1) << 32) | ((ulong)num2);
+    return fma(num, SIGNED_DBL_FACTOR, SIGNED_HALF_DBL_FACTOR);
+}
+
+void writeOut128Bytes_double(global double *out, uint index, uint r1, uint r2,
+                             uint r3, uint r4) {
+    out[index]           = 1.0 - getDouble01(r1, r2);
+    out[index + THREADS] = 1.0 - getDouble01(r3, r4);
+}
+
+void partialWriteOut128Bytes_double(global double *out, uint index, uint r1,
+                                    uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = 1.0 - getDouble01(r1, r2); }
+    if (index + THREADS < elements) {
+        out[index + THREADS] = 1.0 - getDouble01(r3, r4);
+    }
+}
+
+#if RAND_DIST == 1
+void boxMullerWriteOut128Bytes_double(global double *out, uint index, uint r1,
+                                      uint r2, uint r3, uint r4) {
+    double n1, n2;
+    boxMullerTransform(&n1, &n2, getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+    out[index]           = n1;
+    out[index + THREADS] = n2;
+}
+
+void partialBoxMullerWriteOut128Bytes_double(global double *out, uint index,
+                                             uint r1, uint r2, uint r3, uint r4,
+                                             uint elements) {
+    double n1, n2;
+    boxMullerTransform(&n1, &n2, getDoubleNegative11(r1, r2),
+                       getDouble01(r3, r4));
+    if (index < elements) { out[index] = n1; }
+    if (index + THREADS < elements) { out[index + THREADS] = n2; }
+}
+#endif
+#endif
+
+#ifdef USE_HALF
+
+// Conversion to floats adapted from Random123
+
+// NOTE HALF_FACTOR is calculated in float to avoid conversion of 65535 to +inf
+// because of the limited range of half.
+#define HALF_FACTOR ((half)((1.f) / ((USHRT_MAX) + (1.f))))
+#define HALF_HALF_FACTOR ((0.5h) * (HALF_FACTOR))
+
+#define SIGNED_HALF_FACTOR ((1.h) / (SHRT_MAX + (1.h)))
+#define SIGNED_HALF_HALF_FACTOR ((0.5h) * SIGNED_HALF_FACTOR)
+
+/// This is the largest integer representable by fp16. We need to
+/// make sure that the value converted from ushort is smaller than this
+/// value to avoid generating infinity
+#define MAX_INT_BEFORE_INFINITY (ushort)65504u
+
+// Generates rationals in (0, 1]
+half getHalf01(uint num, uint index) {
+    half v = (half)min(MAX_INT_BEFORE_INFINITY,
+                       (ushort)(num >> (16U * (index & 1U)) & 0x0000ffff));
+    return fma(v, HALF_FACTOR, HALF_HALF_FACTOR);
+}
+
+// Generates rationals in (-1, 1]
+half getHalfNegative11(uint num, uint index) {
+    half v = (half)min(MAX_INT_BEFORE_INFINITY,
+                       (ushort)(num >> (16U * (index & 1U)) & 0x0000ffff));
+    return fma(v, SIGNED_HALF_FACTOR, SIGNED_HALF_HALF_FACTOR);
+}
+
+void writeOut128Bytes_half(global half *out, uint index, uint r1, uint r2,
+                           uint r3, uint r4) {
+    out[index]               = 1.h - getHalf01(r1, 0);
+    out[index + THREADS]     = 1.h - getHalf01(r1, 1);
+    out[index + 2 * THREADS] = 1.h - getHalf01(r2, 0);
+    out[index + 3 * THREADS] = 1.h - getHalf01(r2, 1);
+    out[index + 4 * THREADS] = 1.h - getHalf01(r3, 0);
+    out[index + 5 * THREADS] = 1.h - getHalf01(r3, 1);
+    out[index + 6 * THREADS] = 1.h - getHalf01(r4, 0);
+    out[index + 7 * THREADS] = 1.h - getHalf01(r4, 1);
+}
+
+void partialWriteOut128Bytes_half(global half *out, uint index, uint r1,
+                                  uint r2, uint r3, uint r4, uint elements) {
+    if (index < elements) { out[index] = 1.h - getHalf01(r1, 0); }
+    if (index + THREADS < elements) {
+        out[index + THREADS] = 1.h - getHalf01(r1, 1);
+    }
+    if (index + 2 * THREADS < elements) {
+        out[index + 2 * THREADS] = 1.h - getHalf01(r2, 0);
+    }
+    if (index + 3 * THREADS < elements) {
+        out[index + 3 * THREADS] = 1.h - getHalf01(r2, 1);
+    }
+    if (index + 4 * THREADS < elements) {
+        out[index + 4 * THREADS] = 1.h - getHalf01(r3, 0);
+    }
+    if (index + 5 * THREADS < elements) {
+        out[index + 5 * THREADS] = 1.h - getHalf01(r3, 1);
+    }
+    if (index + 6 * THREADS < elements) {
+        out[index + 6 * THREADS] = 1.h - getHalf01(r4, 0);
+    }
+    if (index + 7 * THREADS < elements) {
+        out[index + 7 * THREADS] = 1.h - getHalf01(r4, 1);
+    }
+}
+
+#if RAND_DIST == 1
+void boxMullerWriteOut128Bytes_half(global half *out, uint index, uint r1,
+                                    uint r2, uint r3, uint r4) {
+    boxMullerTransform(&out[index], &out[index + THREADS],
+                       getHalfNegative11(r1, 0), getHalf01(r1, 1));
+    boxMullerTransform(&out[index + 2 * THREADS], &out[index + 3 * THREADS],
+                       getHalfNegative11(r2, 0), getHalf01(r2, 1));
+    boxMullerTransform(&out[index + 4 * THREADS], &out[index + 5 * THREADS],
+                       getHalfNegative11(r3, 0), getHalf01(r3, 1));
+    boxMullerTransform(&out[index + 6 * THREADS], &out[index + 7 * THREADS],
+                       getHalfNegative11(r4, 0), getHalf01(r4, 1));
+}
+
+void partialBoxMullerWriteOut128Bytes_half(global half *out, uint index,
+                                           uint r1, uint r2, uint r3, uint r4,
+                                           uint elements) {
+    half n1, n2;
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r1, 0), getHalf01(r1, 1));
+    if (index < elements) { out[index] = n1; }
+    if (index + THREADS < elements) { out[index + THREADS] = n2; }
+
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r2, 0), getHalf01(r2, 1));
+    if (index + 2 * THREADS < elements) { out[index + 2 * THREADS] = n1; }
+    if (index + 3 * THREADS < elements) { out[index + 3 * THREADS] = n2; }
+
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r3, 0), getHalf01(r3, 1));
+    if (index + 4 * THREADS < elements) { out[index + 4 * THREADS] = n1; }
+    if (index + 5 * THREADS < elements) { out[index + 5 * THREADS] = n2; }
+
+    boxMullerTransform(&n1, &n2, getHalfNegative11(r4, 0), getHalf01(r4, 1));
+    if (index + 6 * THREADS < elements) { out[index + 6 * THREADS] = n1; }
+    if (index + 7 * THREADS < elements) { out[index + 7 * THREADS] = n2; }
+}
+#endif
+#endif
+
+#define PASTER(x, y) x##_##y
+#define EVALUATOR(x, y) PASTER(x, y)
+#define EVALUATE_T(function) EVALUATOR(function, T)
+#define UNIFORM_WRITE EVALUATE_T(writeOut128Bytes)
+#define UNIFORM_PARTIAL_WRITE EVALUATE_T(partialWriteOut128Bytes)
+#define NORMAL_WRITE EVALUATE_T(boxMullerWriteOut128Bytes)
+#define NORMAL_PARTIAL_WRITE EVALUATE_T(partialBoxMullerWriteOut128Bytes)
+
+#if RAND_DIST == 0
+#define WRITE UNIFORM_WRITE
+#define PARTIAL_WRITE UNIFORM_PARTIAL_WRITE
+#elif RAND_DIST == 1
+#define WRITE NORMAL_WRITE
+#define PARTIAL_WRITE NORMAL_PARTIAL_WRITE
+#endif
diff --git a/src/backend/opencl/kernel/range.cl b/src/backend/opencl/kernel/range.cl
index b3ba67762c..80fbdda90f 100644
--- a/src/backend/opencl/kernel/range.cl
+++ b/src/backend/opencl/kernel/range.cl
@@ -7,10 +7,8 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void range_kernel(__global T *out, const KParam op, const int dim,
-                 const int blocksPerMatX, const int blocksPerMatY)
-{
+kernel void range_kernel(global T *out, const KParam op, const int dim,
+                           const int blocksPerMatX, const int blocksPerMatY) {
     const int mul0 = (dim == 0);
     const int mul1 = (dim == 1);
     const int mul2 = (dim == 2);
@@ -25,10 +23,8 @@ void range_kernel(__global T *out, const KParam op, const int dim,
     const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
     const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    if(xx >= op.dims[0] ||
-       yy >= op.dims[1] ||
-       oz >= op.dims[2] ||
-       ow >= op.dims[3])
+    if (xx >= op.dims[0] || yy >= op.dims[1] || oz >= op.dims[2] ||
+        ow >= op.dims[3])
         return;
 
     const int ozw = ow * op.strides[3] + oz * op.strides[2];
@@ -38,12 +34,12 @@ void range_kernel(__global T *out, const KParam op, const int dim,
 
     T valZW = (mul3 * ow) + (mul2 * oz);
 
-    for(int oy = yy; oy < op.dims[1]; oy += incy) {
+    for (int oy = yy; oy < op.dims[1]; oy += incy) {
         T valYZW = valZW + (mul1 * oy);
         int oyzw = ozw + oy * op.strides[1];
-        for(int ox = xx; ox < op.dims[0]; ox += incx) {
+        for (int ox = xx; ox < op.dims[0]; ox += incx) {
             int oidx = oyzw + ox;
-            T val = valYZW + (mul0 * ox);
+            T val    = valYZW + (mul0 * ox);
 
             out[oidx] = val;
         }
diff --git a/src/backend/opencl/kernel/range.hpp b/src/backend/opencl/kernel/range.hpp
index 2f8be8cd4a..3fb58a65ce 100644
--- a/src/backend/opencl/kernel/range.hpp
+++ b/src/backend/opencl/kernel/range.hpp
@@ -8,76 +8,48 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/range.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/range.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-        static const int TILEX = 512;
-        static const int TILEY = 32;
-
-        template<typename T>
-        void range(Param out, const int dim)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>  rangeProgs;
-                static std::map<int, Kernel*> rangeKernels;
+#include <string>
+#include <vector>
 
-                int device = getActiveDeviceId();
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName();
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, range_cl, range_cl_len, options.str());
-                    rangeProgs[device]   = new Program(prog);
-                    rangeKernels[device] = new Kernel(*rangeProgs[device], "range_kernel");
-                });
+template<typename T>
+void range(Param out, const int dim) {
+    constexpr int RANGE_TX    = 32;
+    constexpr int RANGE_TY    = 8;
+    constexpr int RANGE_TILEX = 512;
+    constexpr int RANGE_TILEY = 32;
 
-                auto rangeOp = make_kernel<Buffer, const KParam, const int,
-                                           const int, const int> (*rangeKernels[device]);
+    std::array<TemplateArg, 1> targs   = {TemplateTypename<T>()};
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
 
-                NDRange local(TX, TY, 1);
+    auto rangeOp =
+        common::getKernel("range_kernel", {{range_cl_src}}, targs, options);
 
-                int blocksPerMatX = divup(out.info.dims[0], TILEX);
-                int blocksPerMatY = divup(out.info.dims[1], TILEY);
-                NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
-                               local[1] * blocksPerMatY * out.info.dims[3],
-                               1);
+    cl::NDRange local(RANGE_TX, RANGE_TY, 1);
 
-                rangeOp(EnqueueArgs(getQueue(), global, local),
-                       *out.data, out.info, dim, blocksPerMatX, blocksPerMatY);
+    int blocksPerMatX = divup(out.info.dims[0], RANGE_TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], RANGE_TILEY);
+    cl::NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
+                       local[1] * blocksPerMatY * out.info.dims[3], 1);
 
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+    rangeOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+            dim, blocksPerMatX, blocksPerMatY);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/reduce.hpp b/src/backend/opencl/kernel/reduce.hpp
index 0d8be355dc..98982fe8f3 100644
--- a/src/backend/opencl/kernel/reduce.hpp
+++ b/src/backend/opencl/kernel/reduce.hpp
@@ -8,354 +8,266 @@
  ********************************************************/
 
 #pragma once
-#include <string>
-#include <mutex>
-#include <map>
-#include <kernel_headers/reduce_first.hpp>
-#include <kernel_headers/reduce_dim.hpp>
-#include <kernel_headers/ops.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <dispatch.hpp>
+
+#include <Array.hpp>
 #include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/Transform.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <type_util.hpp>
-#include "names.hpp"
-#include "config.hpp"
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <kernel_headers/reduce_all.hpp>
+#include <kernel_headers/reduce_dim.hpp>
+#include <kernel_headers/reduce_first.hpp>
+#include <math.hpp>
 #include <memory.hpp>
-#include <memory>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using std::unique_ptr;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-    template<typename Ti, typename To, af_op_t op, int dim, int threads_y>
-    void reduce_dim_launcher(Param out, Param in,
-                       const uint groups_all[4])
-    {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> reduceProgs;
-        static std::map<int, Kernel*> reduceKerns;
-
-        int device= getActiveDeviceId();
-        std::call_once(compileFlags[device], [device] () {
-
-                Binary<To, op> reduce;
-                ToNum<To> toNum;
-
-                std::ostringstream options;
-                options << " -D To=" << dtype_traits<To>::getName()
-                        << " -D Ti=" << dtype_traits<Ti>::getName()
-                        << " -D T=To"
-                        << " -D dim=" << dim
-                        << " -D DIMY=" << threads_y
-                        << " -D THREADS_X=" << THREADS_X
-                        << " -D init=" << toNum(reduce.init())
-                        << " -D " << binOpName<op>()
-                        << " -D CPLX=" << af::iscplx<Ti>();
-                if (std::is_same<Ti, double>::value ||
-                    std::is_same<Ti, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                const char *ker_strs[] = {ops_cl, reduce_dim_cl};
-                const int   ker_lens[] = {ops_cl_len, reduce_dim_cl_len};
-                Program prog;
-                buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                reduceProgs[device] = new Program(prog);
-
-                reduceKerns[device] = new Kernel(*reduceProgs[device], "reduce_dim_kernel");
-            });
-
-        NDRange local(THREADS_X, threads_y);
-        NDRange global(groups_all[0] * groups_all[2] * local[0],
-                       groups_all[1] * groups_all[3] * local[1]);
-
-        auto reduceOp = make_kernel<Buffer, KParam,
-                                    Buffer, KParam,
-                                    uint, uint, uint>(*reduceKerns[device]);
-
-        reduceOp(EnqueueArgs(getQueue(), global, local),
-                 *out.data, out.info,
-                 *in.data, in.info,
-                 groups_all[0],
-                 groups_all[1],
-                 groups_all[dim]);
+#include <traits.hpp>
 
-        CL_DEBUG_FINISH(getQueue());
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op>
+void reduceDimLauncher(Param out, Param in, const int dim, const uint threads_y,
+                       const uint groups_all[4], int change_nan,
+                       double nanval) {
+    ToNumStr<To> toNumStr;
+    std::array<TemplateArg, 5> targs = {
+        TemplateTypename<Ti>(), TemplateTypename<To>(), TemplateArg(dim),
+        TemplateArg(op),        TemplateArg(threads_y),
+    };
+    std::array<std::string, 10> options = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(kDim, dim),
+        DefineKeyValue(DIMY, threads_y),
+        DefineValue(THREADS_X),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        getTypeBuildDefinition<Ti, To>()};
+
+    auto reduceDim = common::getKernel(
+        "reduce_dim_kernel", {{ops_cl_src, reduce_dim_cl_src}}, targs, options);
+
+    cl::NDRange local(THREADS_X, threads_y);
+    cl::NDRange global(groups_all[0] * groups_all[2] * local[0],
+                       groups_all[1] * groups_all[3] * local[1]);
 
-    template<typename Ti, typename To, af_op_t op, int dim>
-    void reduce_dim_fn(Param out, Param in,
-                       const uint threads_y, const uint groups_all[4])
-    {
-        switch(threads_y) {
-        case 8: return reduce_dim_launcher<Ti, To, op, dim, 8>(out, in, groups_all);
-        case 4: return reduce_dim_launcher<Ti, To, op, dim, 4>(out, in, groups_all);
-        case 2: return reduce_dim_launcher<Ti, To, op, dim, 2>(out, in, groups_all);
-        case 1: return reduce_dim_launcher<Ti, To, op, dim, 1>(out, in, groups_all);
-        case 16: return reduce_dim_launcher<Ti, To, op, dim, 16>(out, in, groups_all);
-        case 32: return reduce_dim_launcher<Ti, To, op, dim, 32>(out, in, groups_all);
-        }
-    }
+    reduceDim(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *in.data, in.info, groups_all[0], groups_all[1], groups_all[dim],
+              change_nan, scalar<To>(nanval));
+    CL_DEBUG_FINISH(getQueue());
+}
 
-    template<typename Ti, typename To, af_op_t op, int dim>
-    void reduce_dim(Param out, Param in)
-    {
-        uint threads_y = std::min(THREADS_Y, nextpow2(in.info.dims[dim]));
-        uint threads_x = THREADS_X;
+template<typename Ti, typename To, af_op_t op>
+void reduceDim(Param out, Param in, int change_nan, double nanval, int dim) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(in.info.dims[dim]));
+    uint threads_x = THREADS_X;
 
-        uint groups_all[] = {(uint)divup(in.info.dims[0], threads_x),
-                             (uint)in.info.dims[1],
-                             (uint)in.info.dims[2],
-                             (uint)in.info.dims[3]};
+    uint groups_all[] = {(uint)divup(in.info.dims[0], threads_x),
+                         (uint)in.info.dims[1], (uint)in.info.dims[2],
+                         (uint)in.info.dims[3]};
 
-        groups_all[dim] = divup(in.info.dims[dim], threads_y * REPEAT);
+    groups_all[dim] = divup(in.info.dims[dim], threads_y * REPEAT);
 
-        Param tmp = out;
+    Param tmp = out;
 
-        int tmp_elements = 1;
-        if (groups_all[dim] > 1) {
-            tmp.info.dims[dim] = groups_all[dim];
+    int tmp_elements = 1;
+    if (groups_all[dim] > 1) {
+        tmp.info.dims[dim] = groups_all[dim];
 
-            for (int k = 0; k < 4; k++) tmp_elements *= tmp.info.dims[k];
+        for (int k = 0; k < 4; k++) tmp_elements *= tmp.info.dims[k];
 
-            tmp.data = bufferAlloc(tmp_elements * sizeof(To));
+        tmp.data = bufferAlloc(tmp_elements * sizeof(To));
 
-            for (int k = dim + 1; k < 4; k++) tmp.info.strides[k] *= groups_all[dim];
-        }
+        for (int k = dim + 1; k < 4; k++)
+            tmp.info.strides[k] *= groups_all[dim];
+    }
 
-        reduce_dim_fn<Ti, To, op, dim>(tmp, in, threads_y, groups_all);
+    reduceDimLauncher<Ti, To, op>(tmp, in, dim, threads_y, groups_all,
+                                  change_nan, nanval);
 
-        if (groups_all[dim] > 1) {
-            groups_all[dim] = 1;
+    if (groups_all[dim] > 1) {
+        groups_all[dim] = 1;
 
-            if (op == af_notzero_t) {
-                reduce_dim_fn<To, To, af_add_t, dim>(out, tmp, threads_y, groups_all);
-            } else {
-                reduce_dim_fn<To, To,       op, dim>(out, tmp, threads_y, groups_all);
-            }
-            bufferFree(tmp.data);
+        if (op == af_notzero_t) {
+            reduceDimLauncher<To, To, af_add_t>(out, tmp, dim, threads_y,
+                                                groups_all, change_nan, nanval);
+        } else {
+            reduceDimLauncher<To, To, op>(out, tmp, dim, threads_y, groups_all,
+                                          change_nan, nanval);
         }
-
+        bufferFree(tmp.data);
     }
+}
 
-    template<typename Ti, typename To, af_op_t op, int threads_x>
-    void reduce_first_launcher(Param out, Param in,
-                               const uint groups_x,
-                               const uint groups_y)
-    {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> reduceProgs;
-        static std::map<int, Kernel*>  reduceKerns;
-
-        int device= getActiveDeviceId();
-        std::call_once(compileFlags[device], [device] () {
-
-                Binary<To, op> reduce;
-                ToNum<To> toNum;
-
-                std::ostringstream options;
-                options << " -D To=" << dtype_traits<To>::getName()
-                        << " -D Ti=" << dtype_traits<Ti>::getName()
-                        << " -D T=To"
-                        << " -D DIMX=" << threads_x
-                        << " -D THREADS_PER_GROUP=" << THREADS_PER_GROUP
-                        << " -D init=" << toNum(reduce.init())
-                        << " -D " << binOpName<op>()
-                        << " -D CPLX=" << af::iscplx<Ti>();
-                if (std::is_same<Ti, double>::value ||
-                    std::is_same<Ti, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                const char *ker_strs[] = {ops_cl, reduce_first_cl};
-                const int   ker_lens[] = {ops_cl_len, reduce_first_cl_len};
-                Program prog;
-                buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                reduceProgs[device] = new Program(prog);
-
-                reduceKerns[device] = new Kernel(*reduceProgs[device], "reduce_first_kernel");
-            });
-
-        NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
-        NDRange global(groups_x * in.info.dims[2] * local[0],
+template<typename Ti, typename To, af_op_t op>
+void reduceAllLauncher(Param out, Param in, const uint groups_x,
+                       const uint groups_y, const uint threads_x,
+                       int change_nan, double nanval) {
+    ToNumStr<To> toNumStr;
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<Ti>(),
+        TemplateTypename<To>(),
+        TemplateArg(op),
+        TemplateArg(threads_x),
+    };
+    std::array<std::string, 9> options = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineValue(THREADS_PER_GROUP),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        getTypeBuildDefinition<Ti, To>()};
+
+    auto reduceAll = common::getKernel(
+        "reduce_all_kernel", {{ops_cl_src, reduce_all_cl_src}}, targs, options);
+
+    cl::NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    cl::NDRange global(groups_x * in.info.dims[2] * local[0],
                        groups_y * in.info.dims[3] * local[1]);
 
-        uint repeat = divup(in.info.dims[0], (local[0] * groups_x));
-
-        auto reduceOp = make_kernel<Buffer, KParam,
-                                    Buffer, KParam,
-                                    uint, uint, uint>(*reduceKerns[device]);
+    uint repeat = divup(in.info.dims[0], (local[0] * groups_x));
 
-        reduceOp(EnqueueArgs(getQueue(), global, local),
-                 *out.data, out.info,
-                 *in.data, in.info, groups_x, groups_y, repeat);
-
-        CL_DEBUG_FINISH(getQueue());
+    long tmp_elements = groups_x * in.info.dims[2] * groups_y * in.info.dims[3];
+    if (tmp_elements > UINT_MAX) {
+        AF_ERROR("Too many blocks requested (retirementCount == unsigned)",
+                 AF_ERR_RUNTIME);
     }
+    Array<To> tmp                   = createEmptyArray<To>(tmp_elements);
+    Array<unsigned> retirementCount = createValueArray<unsigned>(1, 0);
+    Param p_tmp(tmp);
+    Param p_Count(retirementCount);
+
+    reduceAll(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *p_Count.data, *p_tmp.data, p_tmp.info, *in.data, in.info,
+              groups_x, groups_y, repeat, change_nan, scalar<To>(nanval));
+    CL_DEBUG_FINISH(getQueue());
+}
 
-    template<typename Ti, typename To, af_op_t op>
-    void reduce_first_fn(Param out, Param in,
-                         const uint groups_x,
-                         const uint groups_y,
-                         const uint threads_x)
-    {
-        switch(threads_x) {
-        case  32: return reduce_first_launcher<Ti, To, op,  32>(out, in, groups_x,
-                                                                groups_y);
-        case  64: return reduce_first_launcher<Ti, To, op,  64>(out, in, groups_x,
-                                                                groups_y);
-        case 128: return reduce_first_launcher<Ti, To, op, 128>(out, in, groups_x,
-                                                                groups_y);
-        case 256: return reduce_first_launcher<Ti, To, op, 256>(out, in, groups_x,
-                                                                groups_y);
-        case 512: return reduce_first_launcher<Ti, To, op, 512>(out, in, groups_x,
-                                                                groups_y);
-        }
-    }
-
-    template<typename Ti, typename To, af_op_t op>
-    void reduce_first(Param out, Param in)
-    {
-        uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_GROUP);
-        uint threads_y = THREADS_PER_GROUP / threads_x;
-
-        uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
-        uint groups_y = divup(in.info.dims[1], threads_y);
+template<typename Ti, typename To, af_op_t op>
+void reduceFirstLauncher(Param out, Param in, const uint groups_x,
+                         const uint groups_y, const uint threads_x,
+                         int change_nan, double nanval) {
+    ToNumStr<To> toNumStr;
+    std::array<TemplateArg, 4> targs = {
+        TemplateTypename<Ti>(),
+        TemplateTypename<To>(),
+        TemplateArg(op),
+        TemplateArg(threads_x),
+    };
+    std::array<std::string, 9> options = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineValue(THREADS_PER_GROUP),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        getTypeBuildDefinition<Ti, To>()};
+
+    auto reduceFirst =
+        common::getKernel("reduce_first_kernel",
+                          {{ops_cl_src, reduce_first_cl_src}}, targs, options);
+
+    cl::NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    cl::NDRange global(groups_x * in.info.dims[2] * local[0],
+                       groups_y * in.info.dims[3] * local[1]);
 
-        Param tmp = out;
+    uint repeat = divup(in.info.dims[0], (local[0] * groups_x));
 
-        if (groups_x > 1) {
-            tmp.data = bufferAlloc(groups_x *
-                                in.info.dims[1] *
-                                in.info.dims[2] *
-                                in.info.dims[3] *
-                                sizeof(To));
+    reduceFirst(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                *in.data, in.info, groups_x, groups_y, repeat, change_nan,
+                scalar<To>(nanval));
+    CL_DEBUG_FINISH(getQueue());
+}
 
-            tmp.info.dims[0] = groups_x;
-            for (int k = 1; k < 4; k++) tmp.info.strides[k] *= groups_x;
-        }
+template<typename Ti, typename To, af_op_t op>
+void reduceFirst(Param out, Param in, int change_nan, double nanval) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
 
-        reduce_first_fn<Ti, To, op>(tmp, in, groups_x, groups_y, threads_x);
+    uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(in.info.dims[1], threads_y);
 
-        if (groups_x > 1) {
+    Param tmp = out;
 
-            //FIXME: Is there an alternative to the if condition ?
-            if (op == af_notzero_t) {
-                reduce_first_fn<To, To, af_add_t>(out, tmp, 1, groups_y, threads_x);
-            } else {
-                reduce_first_fn<To, To,       op>(out, tmp, 1, groups_y, threads_x);
-            }
+    if (groups_x > 1) {
+        tmp.data = bufferAlloc(groups_x * in.info.dims[1] * in.info.dims[2] *
+                               in.info.dims[3] * sizeof(To));
 
-            bufferFree(tmp.data);
-        }
+        tmp.info.dims[0] = groups_x;
+        for (int k = 1; k < 4; k++) tmp.info.strides[k] *= groups_x;
     }
 
-    template<typename Ti, typename To, af_op_t op>
-    void reduce(Param out, Param in, int dim)
-    {
-        try {
-            switch (dim) {
-            case 0: return reduce_first<Ti, To, op   >(out, in);
-            case 1: return reduce_dim  <Ti, To, op, 1>(out, in);
-            case 2: return reduce_dim  <Ti, To, op, 2>(out, in);
-            case 3: return reduce_dim  <Ti, To, op, 3>(out, in);
-            }
-        } catch(cl::Error ex) {
-            CL_TO_AF_ERROR(ex);
+    reduceFirstLauncher<Ti, To, op>(tmp, in, groups_x, groups_y, threads_x,
+                                    change_nan, nanval);
+
+    if (groups_x > 1) {
+        // FIXME: Is there an alternative to the if condition ?
+        if (op == af_notzero_t) {
+            reduceFirstLauncher<To, To, af_add_t>(
+                out, tmp, 1, groups_y, threads_x, change_nan, nanval);
+        } else {
+            reduceFirstLauncher<To, To, op>(out, tmp, 1, groups_y, threads_x,
+                                            change_nan, nanval);
         }
+        bufferFree(tmp.data);
     }
+}
 
-    template<typename Ti, typename To, af_op_t op>
-    To reduce_all(Param in)
-    {
-        try {
-            int in_elements = in.info.dims[3] * in.info.strides[3];
-
-            // FIXME: Use better heuristics to get to the optimum number
-            if (in_elements > 4096) {
-
-                bool is_linear = (in.info.strides[0] == 1);
-                for (int k = 1; k < 4; k++) {
-                    is_linear &= (in.info.strides[k] == (in.info.strides[k - 1] * in.info.dims[k - 1]));
-                }
-
-                if (is_linear) {
-                    in.info.dims[0] = in_elements;
-                    for (int k = 1; k < 4; k++) {
-                        in.info.dims[k] = 1;
-                        in.info.strides[k] = in_elements;
-                    }
-                }
-
-                uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
-                threads_x = std::min(threads_x, THREADS_PER_GROUP);
-                uint threads_y = THREADS_PER_GROUP / threads_x;
-
-                Param tmp;
-                uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
-                uint groups_y = divup(in.info.dims[1], threads_y);
-
-                tmp.info.offset = 0;
-                tmp.info.dims[0] = groups_x;
-                tmp.info.strides[0] = 1;
-
-                for (int k = 1; k < 4; k++) {
-                    tmp.info.dims[k] = in.info.dims[k];
-                    tmp.info.strides[k] = tmp.info.dims[k - 1] * tmp.info.strides[k - 1];
-                }
-
-                int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
-                tmp.data = bufferAlloc(tmp_elements * sizeof(To));
-
-                reduce_first_fn<Ti, To, op>(tmp, in, groups_x, groups_y, threads_x);
-
-                unique_ptr<To> h_ptr(new To[tmp_elements]);
-                getQueue().enqueueReadBuffer(*tmp.data, CL_TRUE, 0, sizeof(To) * tmp_elements, h_ptr.get());
-
-                Binary<To, op> reduce;
-                To out = reduce.init();
-                for (int i = 0; i < (int)tmp_elements; i++) {
-                    out = reduce(out, h_ptr.get()[i]);
-                }
-
-                bufferFree(tmp.data);
-                return out;
-
-            } else {
-
-                unique_ptr<Ti> h_ptr(new Ti[in_elements]);
-                getQueue().enqueueReadBuffer(*in.data, CL_TRUE, 0, sizeof(Ti) * in_elements, h_ptr.get());
+template<typename Ti, typename To, af_op_t op>
+void reduce(Param out, Param in, int dim, int change_nan, double nanval) {
+    if (dim == 0)
+        return reduceFirst<Ti, To, op>(out, in, change_nan, nanval);
+    else
+        return reduceDim<Ti, To, op>(out, in, change_nan, nanval, dim);
+}
 
-                Transform<Ti, To, op> transform;
-                Binary<To, op> reduce;
-                To out = reduce.init();
+template<typename Ti, typename To, af_op_t op>
+void reduceAll(Param out, Param in, int change_nan, double nanval) {
+    int in_elements =
+        in.info.dims[0] * in.info.dims[1] * in.info.dims[2] * in.info.dims[3];
 
-                for (int i = 0; i < (int)in_elements; i++) {
-                    out = reduce(out, transform(h_ptr.get()[i]));
-                }
+    bool is_linear = (in.info.strides[0] == 1);
+    for (int k = 1; k < 4; k++) {
+        is_linear &= (in.info.strides[k] ==
+                      (in.info.strides[k - 1] * in.info.dims[k - 1]));
+    }
 
-                return out;
-            }
-        } catch(cl::Error ex) {
-            CL_TO_AF_ERROR(ex);
+    if (is_linear) {
+        in.info.dims[0] = in_elements;
+        for (int k = 1; k < 4; k++) {
+            in.info.dims[k]    = 1;
+            in.info.strides[k] = in_elements;
         }
     }
 
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
 
+    uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(in.info.dims[1], threads_y);
+    reduceAllLauncher<Ti, To, op>(out, in, groups_x, groups_y, threads_x,
+                                  change_nan, nanval);
 }
 
-}
+}  // namespace kernel
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/reduce_all.cl b/src/backend/opencl/kernel/reduce_all.cl
new file mode 100644
index 0000000000..dccb0f1c69
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_all.cl
@@ -0,0 +1,160 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+// careful w/__threadfence substitution!
+// http://www.whatmannerofburgeristhis.com/blog/opencl-vs-cuda-gpu-memory-fences/
+
+kernel void reduce_all_kernel(global To *oData, KParam oInfo,
+                              global int* retirementCount, global To *tmp, KParam tmpInfo,
+                              const global Ti *iData, KParam iInfo,
+                              uint groups_x, uint groups_y, uint repeat,
+                              int change_nan, To nanval) {
+
+    const uint tidx = get_local_id(0);
+    const uint tidy = get_local_id(1);
+    const uint tid  = tidy * DIMX + tidx;
+
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint xid       = groupId_x * get_local_size(0) * repeat + tidx;
+
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint yid       = groupId_y * get_local_size(1) + tidy;
+
+    local To s_val[THREADS_PER_GROUP];
+    local bool amLast;
+
+    iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
+             yid * iInfo.strides[1] + iInfo.offset;
+
+    bool cond =
+        (yid < iInfo.dims[1]) && (zid < iInfo.dims[2]) && (wid < iInfo.dims[3]);
+
+
+    int last   = (xid + repeat * DIMX);
+    int lim    = last > iInfo.dims[0] ? iInfo.dims[0] : last;
+
+    To out_val = init;
+    for (int id = xid; cond && id < lim; id += DIMX) {
+        To in_val = transform(iData[id]);
+        if (change_nan) in_val = !IS_NAN(in_val) ? in_val : nanval;
+        out_val = binOp(in_val, out_val);
+    }
+
+    s_val[tid] = out_val;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (THREADS_PER_GROUP == 256) {
+        if (tid < 128) s_val[tid] = binOp(s_val[tid], s_val[tid + 128]);
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (THREADS_PER_GROUP >= 128) {
+        if (tid < 64) s_val[tid] = binOp(s_val[tid], s_val[tid + 64]);
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (THREADS_PER_GROUP >= 64) {
+        if (tid < 32) s_val[tid] = binOp(s_val[tid], s_val[tid + 32]);
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (tid < 16) s_val[tid] = binOp(s_val[tid], s_val[tid + 16]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (tid < 8) s_val[tid] = binOp(s_val[tid], s_val[tid + 8]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (tid < 4) s_val[tid] = binOp(s_val[tid], s_val[tid + 4]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (tid < 2) s_val[tid] = binOp(s_val[tid], s_val[tid + 2]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (tid < 1) s_val[tid] = binOp(s_val[tid], s_val[tid + 1]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+
+    const unsigned total_blocks = (get_num_groups(0) * get_num_groups(1) * get_num_groups(2));
+    const int uubidx = (get_num_groups(0) * get_num_groups(1)) * get_group_id(2)
+                       + (get_num_groups(0) * get_group_id(1)) + get_group_id(0);
+    if (cond && tid == 0) {
+        if(total_blocks != 1) {
+            tmp[uubidx] = s_val[0];
+        } else {
+            oData[0] = s_val[0];
+        }
+    }
+
+    // Last block to perform final reduction
+    if (total_blocks > 1) {
+
+        mem_fence(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
+
+        // Thread 0 takes a ticket
+        if (tid == 0) {
+            unsigned int ticket = atomic_inc(retirementCount);
+            // If the ticket ID == number of blocks, we are the last block
+            amLast = (ticket == (total_blocks - 1));
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        if (amLast) {
+            int i = tid;
+            To fout_val = init;
+
+            while (i < total_blocks) {
+                To in_val = tmp[i];
+                fout_val = binOp(in_val, fout_val);
+                i += THREADS_PER_GROUP;
+            }
+
+            s_val[tid] = fout_val;
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            // reduce final block
+            if (THREADS_PER_GROUP == 256) {
+                if (tid < 128) s_val[tid] = binOp(s_val[tid], s_val[tid + 128]);
+                barrier(CLK_LOCAL_MEM_FENCE);
+            }
+
+            if (THREADS_PER_GROUP >= 128) {
+                if (tid < 64) s_val[tid] = binOp(s_val[tid], s_val[tid + 64]);
+                barrier(CLK_LOCAL_MEM_FENCE);
+            }
+
+            if (THREADS_PER_GROUP >= 64) {
+                if (tid < 32) s_val[tid] = binOp(s_val[tid], s_val[tid + 32]);
+                barrier(CLK_LOCAL_MEM_FENCE);
+            }
+
+            if (tid < 16) s_val[tid] = binOp(s_val[tid], s_val[tid + 16]);
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            if (tid < 8) s_val[tid] = binOp(s_val[tid], s_val[tid + 8]);
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            if (tid < 4) s_val[tid] = binOp(s_val[tid], s_val[tid + 4]);
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            if (tid < 2) s_val[tid] = binOp(s_val[tid], s_val[tid + 2]);
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            if (tid < 1) s_val[tid] = binOp(s_val[tid], s_val[tid + 1]);
+            barrier(CLK_LOCAL_MEM_FENCE);
+
+            if (tid == 0) {
+                oData[0] = s_val[0];
+
+                // reset retirement count so that next run succeeds
+                retirementCount[0] = 0;
+            }
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/reduce_blocks_by_key_dim.cl b/src/backend/opencl/kernel/reduce_blocks_by_key_dim.cl
new file mode 100644
index 0000000000..76941ebbd7
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_blocks_by_key_dim.cl
@@ -0,0 +1,146 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Starting from OpenCL 2.0, core profile includes work group level
+// inclusive scan operations, hence skip defining custom one
+#if __OPENCL_C_VERSION__ == 200 || __OPENCL_C_VERSION__ == 210 || \
+    __OPENCL_C_VERSION__ == 220 || __opencl_c_work_group_collective_functions
+#define BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+#endif
+
+#ifndef BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+int work_group_scan_inclusive_add(local int *wg_temp, __local int *arr) {
+    local int *active_buf;
+
+    const int lid = get_local_id(0);
+    int val       = arr[lid];
+    active_buf    = arr;
+
+    bool swap_buffer = false;
+    for (int off = 1; off <= DIMX; off *= 2) {
+        barrier(CLK_LOCAL_MEM_FENCE);
+        if (lid >= off) { val = val + active_buf[lid - off]; }
+        swap_buffer     = !swap_buffer;
+        active_buf      = swap_buffer ? wg_temp : arr;
+        active_buf[lid] = val;
+    }
+
+    int res = active_buf[lid];
+    return res;
+}
+#endif
+
+kernel void reduce_blocks_by_key_dim(global int *reduced_block_sizes,
+                                     global Tk *oKeys, KParam oKInfo,
+                                     global To *oVals, KParam oVInfo,
+                                     const global Tk *iKeys, KParam iKInfo,
+                                     const global Ti *iVals, KParam iVInfo,
+                                     int change_nan, To nanval, int n,
+                                     const int nBlocksZ) {
+    const uint lid  = get_local_id(0);
+    const uint gidx = get_global_id(0);
+
+    const int bidy = get_group_id(1);
+    const int bidz = get_group_id(2) % nBlocksZ;
+    const int bidw = get_group_id(2) / nBlocksZ;
+
+    local Tk keys[DIMX];
+    local To vals[DIMX];
+    local Tk reduced_keys[DIMX];
+    local To reduced_vals[DIMX];
+    local int unique_ids[DIMX];
+#ifndef BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+    local int wg_temp[DIMX];
+    local int unique_flags[DIMX];
+#endif
+
+    const To init_val = init;
+
+    //
+    // will hold final number of reduced elements in block
+    local int reducedBlockSize;
+
+    local int dims_ordering[4];
+    if (lid == 0) {
+        reducedBlockSize = 0;
+
+        int d            = 1;
+        dims_ordering[0] = DIM;
+        for (int i = 0; i < 4; ++i) {
+            if (i != DIM) dims_ordering[d++] = i;
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // load keys and values to threads
+    Tk k;
+    To v;
+    if (gidx < n) {
+        k             = iKeys[gidx + iKInfo.offset];
+        const int gid = bidw * iVInfo.strides[dims_ordering[3]] +
+                        bidz * iVInfo.strides[dims_ordering[2]] +
+                        bidy * iVInfo.strides[dims_ordering[1]] +
+                        gidx * iVInfo.strides[DIM];
+        v = transform(iVals[gid + iVInfo.offset]);
+        if (change_nan) v = IS_NAN(v) ? nanval : v;
+    } else {
+        v = init_val;
+    }
+
+    keys[lid] = k;
+    vals[lid] = v;
+
+    reduced_keys[lid] = k;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // mark threads containing unique keys
+    int eq_check    = (lid > 0) ? (k != reduced_keys[lid - 1]) : 0;
+    int unique_flag = (eq_check || (lid == 0)) && (gidx < n);
+
+#ifdef BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+    int unique_id = work_group_scan_inclusive_add(unique_flag);
+#else
+    unique_flags[lid] = unique_flag;
+    int unique_id     = work_group_scan_inclusive_add(wg_temp, unique_flags);
+#endif
+    unique_ids[lid] = unique_id;
+
+    if (lid == DIMX - 1) reducedBlockSize = unique_id;
+
+    for (int off = 1; off < DIMX; off *= 2) {
+        barrier(CLK_LOCAL_MEM_FENCE);
+        int test_unique_id =
+            (lid + off < DIMX) ? unique_ids[lid + off] : ~unique_id;
+        eq_check = (unique_id == test_unique_id);
+        int update_key =
+            eq_check && (lid < (DIMX - off)) &&
+            ((gidx + off) <
+             n);  // checks if this thread should perform a reduction
+        To uval = (update_key) ? vals[lid + off] : init_val;
+        barrier(CLK_LOCAL_MEM_FENCE);
+        vals[lid] = binOp(vals[lid], uval);  // update if thread requires it
+    }
+
+    if (unique_flag) {
+        reduced_keys[unique_id - 1] = k;
+        reduced_vals[unique_id - 1] = vals[lid];
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    const int bid = get_group_id(0);
+    if (lid < reducedBlockSize) {
+        const int bOffset = bidw * oVInfo.strides[dims_ordering[3]] +
+                            bidz * oVInfo.strides[dims_ordering[2]] +
+                            bidy * oVInfo.strides[dims_ordering[1]];
+        oKeys[gidx]                                 = reduced_keys[lid];
+        oVals[bOffset + (gidx)*oVInfo.strides[DIM]] = reduced_vals[lid];
+    }
+
+    reduced_block_sizes[bid] = reducedBlockSize;
+}
diff --git a/src/backend/opencl/kernel/reduce_blocks_by_key_first.cl b/src/backend/opencl/kernel/reduce_blocks_by_key_first.cl
new file mode 100644
index 0000000000..c01d3c250d
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_blocks_by_key_first.cl
@@ -0,0 +1,133 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Starting from OpenCL 2.0, core profile includes work group level
+// inclusive scan operations, hence skip defining custom one
+#if __OPENCL_C_VERSION__ == 200 || __OPENCL_C_VERSION__ == 210 || \
+    __OPENCL_C_VERSION__ == 220 || __opencl_c_work_group_collective_functions
+#define BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+#endif
+
+#ifndef BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+int work_group_scan_inclusive_add(local int *wg_temp, __local int *arr) {
+    local int *active_buf;
+
+    const int lid = get_local_id(0);
+    int val       = arr[lid];
+    active_buf    = arr;
+
+    bool swap_buffer = false;
+    for (int off = 1; off <= DIMX; off *= 2) {
+        barrier(CLK_LOCAL_MEM_FENCE);
+        if (lid >= off) { val = val + active_buf[lid - off]; }
+        swap_buffer     = !swap_buffer;
+        active_buf      = swap_buffer ? wg_temp : arr;
+        active_buf[lid] = val;
+    }
+
+    int res = active_buf[lid];
+    return res;
+}
+#endif
+
+kernel void reduce_blocks_by_key_first(global int *reduced_block_sizes,
+                                       __global Tk *oKeys, KParam oKInfo,
+                                       global To *oVals, KParam oVInfo,
+                                       const __global Tk *iKeys, KParam iKInfo,
+                                       const global Ti *iVals, KParam iVInfo,
+                                       int change_nan, To nanval, int n,
+                                       const int nBlocksZ) {
+    const uint lid = get_local_id(0);
+    const uint gid = get_global_id(0);
+
+    const int bidy = get_group_id(1);
+    const int bidz = get_group_id(2) % nBlocksZ;
+    const int bidw = get_group_id(2) / nBlocksZ;
+
+    local Tk keys[DIMX];
+    local To vals[DIMX];
+    local Tk reduced_keys[DIMX];
+    local To reduced_vals[DIMX];
+    local int unique_ids[DIMX];
+#ifndef BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+    local int wg_temp[DIMX];
+    local int unique_flags[DIMX];
+#endif
+
+    const To init_val = init;
+
+    //
+    // will hold final number of reduced elements in block
+    local int reducedBlockSize;
+
+    if (lid == 0) { reducedBlockSize = 0; }
+
+    // load keys and values to threads
+    Tk k;
+    To v;
+    if (gid < n) {
+        k                 = iKeys[gid + iKInfo.offset];
+        const int bOffset = bidw * iVInfo.strides[3] +
+                            bidz * iVInfo.strides[2] + bidy * iVInfo.strides[1];
+        v = transform(iVals[bOffset + gid + iVInfo.offset]);
+        if (change_nan) v = IS_NAN(v) ? nanval : v;
+    } else {
+        v = init_val;
+    }
+
+    keys[lid] = k;
+    vals[lid] = v;
+
+    reduced_keys[lid] = k;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // mark threads containing unique keys
+    int eq_check    = (lid > 0) ? (k != reduced_keys[lid - 1]) : 0;
+    int unique_flag = (eq_check || (lid == 0)) && (gid < n);
+
+#ifdef BUILTIN_WORK_GROUP_COLLECTIVE_FUNCTIONS
+    int unique_id = work_group_scan_inclusive_add(unique_flag);
+#else
+    unique_flags[lid] = unique_flag;
+    int unique_id     = work_group_scan_inclusive_add(wg_temp, unique_flags);
+#endif
+    unique_ids[lid] = unique_id;
+
+    if (lid == DIMX - 1) reducedBlockSize = unique_id;
+
+    for (int off = 1; off < DIMX; off *= 2) {
+        barrier(CLK_LOCAL_MEM_FENCE);
+        int test_unique_id =
+            (lid + off < DIMX) ? unique_ids[lid + off] : ~unique_id;
+        eq_check = (unique_id == test_unique_id);
+        int update_key =
+            eq_check && (lid < (DIMX - off)) &&
+            ((gid + off) <
+             n);  // checks if this thread should perform a reduction
+        To uval = (update_key) ? vals[lid + off] : init_val;
+        barrier(CLK_LOCAL_MEM_FENCE);
+        vals[lid] = binOp(vals[lid], uval);  // update if thread requires it
+    }
+
+    if (unique_flag) {
+        reduced_keys[unique_id - 1] = k;
+        reduced_vals[unique_id - 1] = vals[lid];
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    const int bid = get_group_id(0);
+    if (lid < reducedBlockSize) {
+        const int bOffset = bidw * oVInfo.strides[3] +
+                            bidz * oVInfo.strides[2] + bidy * oVInfo.strides[1];
+        oKeys[bid * DIMX + lid]               = reduced_keys[lid];
+        oVals[bOffset + ((bid * DIMX) + lid)] = reduced_vals[lid];
+    }
+
+    reduced_block_sizes[bid] = reducedBlockSize;
+}
diff --git a/src/backend/opencl/kernel/reduce_by_key.hpp b/src/backend/opencl/kernel/reduce_by_key.hpp
new file mode 100644
index 0000000000..e80e3603c6
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_by_key.hpp
@@ -0,0 +1,575 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <kernel_headers/reduce_blocks_by_key_dim.hpp>
+#include <kernel_headers/reduce_blocks_by_key_first.hpp>
+#include <kernel_headers/reduce_by_key_boundary.hpp>
+#include <kernel_headers/reduce_by_key_boundary_dim.hpp>
+#include <kernel_headers/reduce_by_key_compact.hpp>
+#include <kernel_headers/reduce_by_key_compact_dim.hpp>
+#include <kernel_headers/reduce_by_key_needs_reduction.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <boost/compute/algorithm/inclusive_scan.hpp>
+#include <boost/compute/core.hpp>
+#include <boost/compute/functional/operator.hpp>
+#include <boost/compute/iterator/buffer_iterator.hpp>
+
+#include <string>
+#include <vector>
+
+namespace compute = boost::compute;
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void reduceBlocksByKeyDim(cl::Buffer *reduced_block_sizes, Param keys_out,
+                          Param vals_out, const Param keys, const Param vals,
+                          int change_nan, double nanval, const int n,
+                          const uint threads_x, const int dim,
+                          std::vector<int> dim_ordering) {
+    ToNumStr<To> toNumStr;
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ti>(), TemplateTypename<To>(), TemplateTypename<Tk>(),
+        TemplateArg(op),        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(DIM, dim),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    auto reduceBlocksByKeyDim =
+        common::getKernel("reduce_blocks_by_key_dim",
+                          {{ops_cl_src, reduce_blocks_by_key_dim_cl_src}},
+                          tmpltArgs, compileOpts);
+    int numBlocks = divup(n, threads_x);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks,
+                       vals_out.info.dims[dim_ordering[1]],
+                       vals_out.info.dims[dim_ordering[2]] *
+                           vals_out.info.dims[dim_ordering[3]]);
+
+    reduceBlocksByKeyDim(cl::EnqueueArgs(getQueue(), global, local),
+                         *reduced_block_sizes, *keys_out.data, keys_out.info,
+                         *vals_out.data, vals_out.info, *keys.data, keys.info,
+                         *vals.data, vals.info, change_nan, scalar<To>(nanval),
+                         n,
+                         static_cast<int>(vals_out.info.dims[dim_ordering[2]]));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void reduceBlocksByKey(cl::Buffer *reduced_block_sizes, Param keys_out,
+                       Param vals_out, const Param keys, const Param vals,
+                       int change_nan, double nanval, const int n,
+                       const uint threads_x) {
+    ToNumStr<To> toNumStr;
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ti>(), TemplateTypename<To>(), TemplateTypename<Tk>(),
+        TemplateArg(op),        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    auto reduceBlocksByKeyFirst =
+        common::getKernel("reduce_blocks_by_key_first",
+                          {{ops_cl_src, reduce_blocks_by_key_first_cl_src}},
+                          tmpltArgs, compileOpts);
+    int numBlocks = divup(n, threads_x);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks, vals_out.info.dims[1],
+                       vals_out.info.dims[2] * vals_out.info.dims[3]);
+
+    reduceBlocksByKeyFirst(
+        cl::EnqueueArgs(getQueue(), global, local), *reduced_block_sizes,
+        *keys_out.data, keys_out.info, *vals_out.data, vals_out.info,
+        *keys.data, keys.info, *vals.data, vals.info, change_nan,
+        scalar<To>(nanval), n, static_cast<int>(vals_out.info.dims[2]));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To, af_op_t op>
+void finalBoundaryReduce(cl::Buffer *reduced_block_sizes, Param keys_out,
+                         Param vals_out, const int n, const int numBlocks,
+                         const int threads_x) {
+    ToNumStr<To> toNumStr;
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<To>(),
+        TemplateTypename<Tk>(),
+        TemplateArg(op),
+        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(Ti, dtype_traits<To>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<To>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<To>());
+
+    auto finalBoundaryReduce = common::getKernel(
+        "final_boundary_reduce", {{ops_cl_src, reduce_by_key_boundary_cl_src}},
+        tmpltArgs, compileOpts);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks);
+
+    finalBoundaryReduce(cl::EnqueueArgs(getQueue(), global, local),
+                        *reduced_block_sizes, *keys_out.data, keys_out.info,
+                        *vals_out.data, vals_out.info, n);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To, af_op_t op>
+void finalBoundaryReduceDim(cl::Buffer *reduced_block_sizes, Param keys_out,
+                            Param vals_out, const int n, const int numBlocks,
+                            const int threads_x, const int dim,
+                            std::vector<int> dim_ordering) {
+    ToNumStr<To> toNumStr;
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<To>(),
+        TemplateTypename<Tk>(),
+        TemplateArg(op),
+        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(Ti, dtype_traits<To>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(DIM, dim),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<To>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<To>());
+
+    auto finalBoundaryReduceDim =
+        common::getKernel("final_boundary_reduce_dim",
+                          {{ops_cl_src, reduce_by_key_boundary_dim_cl_src}},
+                          tmpltArgs, compileOpts);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks,
+                       vals_out.info.dims[dim_ordering[1]],
+                       vals_out.info.dims[dim_ordering[2]] *
+                           vals_out.info.dims[dim_ordering[3]]);
+
+    finalBoundaryReduceDim(
+        cl::EnqueueArgs(getQueue(), global, local), *reduced_block_sizes,
+        *keys_out.data, keys_out.info, *vals_out.data, vals_out.info, n,
+        static_cast<int>(vals_out.info.dims[dim_ordering[2]]));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To>
+void compact(cl::Buffer *reduced_block_sizes, Param keys_out, Param vals_out,
+             const Param keys, const Param vals, const int numBlocks,
+             const int threads_x) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<To>(),
+        TemplateTypename<Tk>(),
+        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(CPLX, iscplx<To>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<To>());
+
+    auto compact = common::getKernel(
+        "compact", {{ops_cl_src, reduce_by_key_compact_cl_src}}, tmpltArgs,
+        compileOpts);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks, vals_out.info.dims[1],
+                       vals_out.info.dims[2] * vals_out.info.dims[3]);
+
+    compact(cl::EnqueueArgs(getQueue(), global, local), *reduced_block_sizes,
+            *keys_out.data, keys_out.info, *vals_out.data, vals_out.info,
+            *keys.data, keys.info, *vals.data, vals.info,
+            static_cast<int>(vals_out.info.dims[2]));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk, typename To>
+void compactDim(cl::Buffer *reduced_block_sizes, Param keys_out, Param vals_out,
+                const Param keys, const Param vals, const int numBlocks,
+                const int threads_x, const int dim,
+                std::vector<int> dim_ordering) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<To>(),
+        TemplateTypename<Tk>(),
+        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(DIM, dim),
+        DefineKeyValue(CPLX, iscplx<To>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<To>());
+
+    auto compactDim = common::getKernel(
+        "compact_dim", {{ops_cl_src, reduce_by_key_compact_dim_cl_src}},
+        tmpltArgs, compileOpts);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks,
+                       vals_out.info.dims[dim_ordering[1]],
+                       vals_out.info.dims[dim_ordering[2]] *
+                           vals_out.info.dims[dim_ordering[3]]);
+
+    compactDim(cl::EnqueueArgs(getQueue(), global, local), *reduced_block_sizes,
+               *keys_out.data, keys_out.info, *vals_out.data, vals_out.info,
+               *keys.data, keys.info, *vals.data, vals.info,
+               static_cast<int>(vals_out.info.dims[dim_ordering[2]]));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk>
+void testNeedsReduction(cl::Buffer needs_reduction, cl::Buffer needs_boundary,
+                        const Param keys, const int n, const int numBlocks,
+                        const int threads_x) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Tk>(),
+        TemplateArg(threads_x),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(DIMX, threads_x),
+    };
+
+    auto testIfNeedsReduction =
+        common::getKernel("test_needs_reduction",
+                          {{ops_cl_src, reduce_by_key_needs_reduction_cl_src}},
+                          tmpltArgs, compileOpts);
+
+    cl::NDRange local(threads_x);
+    cl::NDRange global(threads_x * numBlocks);
+
+    testIfNeedsReduction(cl::EnqueueArgs(getQueue(), global, local),
+                         needs_reduction, needs_boundary, *keys.data, keys.info,
+                         n);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+int reduceByKeyFirst(Array<Tk> &keys_out, Array<To> &vals_out, const Param keys,
+                     const Param vals, bool change_nan, double nanval) {
+    dim4 kdims(4, keys.info.dims);
+    dim4 odims(4, vals.info.dims);
+
+    auto reduced_keys   = createEmptyArray<Tk>(kdims);
+    auto reduced_vals   = createEmptyArray<To>(odims);
+    auto t_reduced_keys = createEmptyArray<Tk>(kdims);
+    auto t_reduced_vals = createEmptyArray<To>(odims);
+
+    // flags determining more reduction is necessary
+    auto needs_another_reduction        = memAlloc<int>(1);
+    auto needs_block_boundary_reduction = memAlloc<int>(1);
+
+    int nelems = kdims[0];
+
+    const unsigned int numThreads = 128;
+    int numBlocksD0               = divup(nelems, numThreads);
+
+    auto reduced_block_sizes = memAlloc<int>(numBlocksD0);
+
+    compute::command_queue c_queue(getQueue()());
+    compute::buffer val_buf((*reduced_block_sizes.get())());
+
+    int n_reduced_host = nelems;
+    int needs_another_reduction_host;
+
+    int needs_block_boundary_reduction_host;
+    bool first_pass = true;
+    do {
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        if (first_pass) {
+            reduceBlocksByKey<Ti, Tk, To, op>(
+                reduced_block_sizes.get(), reduced_keys, reduced_vals, keys,
+                vals, change_nan, nanval, n_reduced_host, numThreads);
+            first_pass = false;
+        } else {
+            constexpr af_op_t op2 = op == af_notzero_t ? af_add_t : op;
+            reduceBlocksByKey<To, Tk, To, op2>(
+                reduced_block_sizes.get(), reduced_keys, reduced_vals,
+                t_reduced_keys, t_reduced_vals, change_nan, nanval,
+                n_reduced_host, numThreads);
+        }
+
+        compute::inclusive_scan(
+            compute::make_buffer_iterator<int>(val_buf),
+            compute::make_buffer_iterator<int>(val_buf, numBlocksD0),
+            compute::make_buffer_iterator<int>(val_buf), c_queue);
+
+        compact<Tk, To>(reduced_block_sizes.get(), t_reduced_keys,
+                        t_reduced_vals, reduced_keys, reduced_vals, numBlocksD0,
+                        numThreads);
+
+        getQueue().enqueueReadBuffer(*reduced_block_sizes.get(), true,
+                                     (numBlocksD0 - 1) * sizeof(int),
+                                     sizeof(int), &n_reduced_host);
+
+        // reset flags
+        needs_block_boundary_reduction_host = 0;
+        needs_another_reduction_host        = 0;
+
+        getQueue().enqueueWriteBuffer(*needs_another_reduction.get(), CL_FALSE,
+                                      0, sizeof(int),
+                                      &needs_another_reduction_host);
+        getQueue().enqueueWriteBuffer(*needs_block_boundary_reduction.get(),
+                                      CL_FALSE, 0, sizeof(int),
+                                      &needs_block_boundary_reduction_host);
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        testNeedsReduction<Tk>(*needs_another_reduction.get(),
+                               *needs_block_boundary_reduction.get(),
+                               t_reduced_keys, n_reduced_host, numBlocksD0,
+                               numThreads);
+
+        getQueue().enqueueReadBuffer(*needs_another_reduction.get(), CL_FALSE,
+                                     0, sizeof(int),
+                                     &needs_another_reduction_host);
+        getQueue().enqueueReadBuffer(*needs_block_boundary_reduction.get(),
+                                     CL_TRUE, 0, sizeof(int),
+                                     &needs_block_boundary_reduction_host);
+
+        if (needs_block_boundary_reduction_host &&
+            !needs_another_reduction_host) {
+            finalBoundaryReduce<Tk, To, op>(
+                reduced_block_sizes.get(), t_reduced_keys, t_reduced_vals,
+                n_reduced_host, numBlocksD0, numThreads);
+
+            compute::inclusive_scan(
+                compute::make_buffer_iterator<int>(val_buf),
+                compute::make_buffer_iterator<int>(val_buf, numBlocksD0),
+                compute::make_buffer_iterator<int>(val_buf), c_queue);
+
+            getQueue().enqueueReadBuffer(*reduced_block_sizes.get(), true,
+                                         (numBlocksD0 - 1) * sizeof(int),
+                                         sizeof(int), &n_reduced_host);
+
+            compact<Tk, To>(reduced_block_sizes.get(), reduced_keys,
+                            reduced_vals, t_reduced_keys, t_reduced_vals,
+                            numBlocksD0, numThreads);
+
+            std::swap(t_reduced_keys, reduced_keys);
+            std::swap(t_reduced_vals, reduced_vals);
+        }
+    } while (needs_another_reduction_host ||
+             needs_block_boundary_reduction_host);
+
+    keys_out = t_reduced_keys;
+    vals_out = t_reduced_vals;
+
+    return n_reduced_host;
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+int reduceByKeyDim(Array<Tk> &keys_out, Array<To> &vals_out, const Param keys,
+                   const Param vals, bool change_nan, double nanval,
+                   const int dim) {
+    std::vector<int> dim_ordering = {dim};
+    for (int i = 0; i < 4; ++i) {
+        if (i != dim) { dim_ordering.push_back(i); }
+    }
+
+    dim4 kdims(4, keys.info.dims);
+    dim4 odims(4, vals.info.dims);
+
+    auto reduced_keys   = createEmptyArray<Tk>(kdims);
+    auto reduced_vals   = createEmptyArray<To>(odims);
+    auto t_reduced_keys = createEmptyArray<Tk>(kdims);
+    auto t_reduced_vals = createEmptyArray<To>(odims);
+
+    // flags determining more reduction is necessary
+    auto needs_another_reduction        = memAlloc<int>(1);
+    auto needs_block_boundary_reduction = memAlloc<int>(1);
+
+    int nelems = kdims[0];
+
+    const unsigned int numThreads = 128;
+    int numBlocksD0               = divup(nelems, numThreads);
+
+    auto reduced_block_sizes = memAlloc<int>(numBlocksD0);
+
+    compute::command_queue c_queue(getQueue()());
+    compute::buffer val_buf((*reduced_block_sizes.get())());
+
+    int n_reduced_host = nelems;
+    int needs_another_reduction_host;
+    int needs_block_boundary_reduction_host;
+
+    bool first_pass = true;
+    do {
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        if (first_pass) {
+            reduceBlocksByKeyDim<Ti, Tk, To, op>(
+                reduced_block_sizes.get(), reduced_keys, reduced_vals, keys,
+                vals, change_nan, nanval, n_reduced_host, numThreads, dim,
+                dim_ordering);
+            first_pass = false;
+        } else {
+            constexpr af_op_t op2 = op == af_notzero_t ? af_add_t : op;
+            reduceBlocksByKeyDim<To, Tk, To, op2>(
+                reduced_block_sizes.get(), reduced_keys, reduced_vals,
+                t_reduced_keys, t_reduced_vals, change_nan, nanval,
+                n_reduced_host, numThreads, dim, dim_ordering);
+        }
+
+        compute::inclusive_scan(
+            compute::make_buffer_iterator<int>(val_buf),
+            compute::make_buffer_iterator<int>(val_buf, numBlocksD0),
+            compute::make_buffer_iterator<int>(val_buf), c_queue);
+
+        compactDim<Tk, To>(reduced_block_sizes.get(), t_reduced_keys,
+                           t_reduced_vals, reduced_keys, reduced_vals,
+                           numBlocksD0, numThreads, dim, dim_ordering);
+
+        getQueue().enqueueReadBuffer(*reduced_block_sizes.get(), true,
+                                     (numBlocksD0 - 1) * sizeof(int),
+                                     sizeof(int), &n_reduced_host);
+
+        // reset flags
+        needs_block_boundary_reduction_host = 0;
+        needs_another_reduction_host        = 0;
+
+        getQueue().enqueueWriteBuffer(*needs_another_reduction.get(), CL_FALSE,
+                                      0, sizeof(int),
+                                      &needs_another_reduction_host);
+        getQueue().enqueueWriteBuffer(*needs_block_boundary_reduction.get(),
+                                      CL_FALSE, 0, sizeof(int),
+                                      &needs_block_boundary_reduction_host);
+
+        numBlocksD0 = divup(n_reduced_host, numThreads);
+
+        testNeedsReduction<Tk>(*needs_another_reduction.get(),
+                               *needs_block_boundary_reduction.get(),
+                               t_reduced_keys, n_reduced_host, numBlocksD0,
+                               numThreads);
+
+        getQueue().enqueueReadBuffer(*needs_another_reduction.get(), CL_FALSE,
+                                     0, sizeof(int),
+                                     &needs_another_reduction_host);
+        getQueue().enqueueReadBuffer(*needs_block_boundary_reduction.get(),
+                                     CL_TRUE, 0, sizeof(int),
+                                     &needs_block_boundary_reduction_host);
+
+        if (needs_block_boundary_reduction_host &&
+            !needs_another_reduction_host) {
+            finalBoundaryReduceDim<Tk, To, op>(
+                reduced_block_sizes.get(), t_reduced_keys, t_reduced_vals,
+                n_reduced_host, numBlocksD0, numThreads, dim, dim_ordering);
+
+            compute::inclusive_scan(
+                compute::make_buffer_iterator<int>(val_buf),
+                compute::make_buffer_iterator<int>(val_buf, numBlocksD0),
+                compute::make_buffer_iterator<int>(val_buf), c_queue);
+
+            getQueue().enqueueReadBuffer(*reduced_block_sizes.get(), true,
+                                         (numBlocksD0 - 1) * sizeof(int),
+                                         sizeof(int), &n_reduced_host);
+
+            compactDim<Tk, To>(reduced_block_sizes.get(), reduced_keys,
+                               reduced_vals, t_reduced_keys, t_reduced_vals,
+                               numBlocksD0, numThreads, dim, dim_ordering);
+
+            std::swap(t_reduced_keys, reduced_keys);
+            std::swap(t_reduced_vals, reduced_vals);
+        }
+    } while (needs_another_reduction_host ||
+             needs_block_boundary_reduction_host);
+
+    keys_out = t_reduced_keys;
+    vals_out = t_reduced_vals;
+
+    return n_reduced_host;
+}
+
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduceByKey(Array<Tk> &keys_out, Array<To> &vals_out,
+                 const Array<Tk> &keys, const Array<Ti> &vals, int dim,
+                 bool change_nan, double nanval) {
+    dim4 kdims = keys.dims();
+    dim4 odims = vals.dims();
+
+    // allocate space for output arrays
+    Array<Tk> reduced_keys = createEmptyArray<Tk>(dim4());
+    Array<To> reduced_vals = createEmptyArray<To>(dim4());
+
+    int n_reduced = 0;
+    if (dim == 0) {
+        n_reduced = reduceByKeyFirst<Ti, Tk, To, op>(
+            reduced_keys, reduced_vals, keys, vals, change_nan, nanval);
+    } else {
+        n_reduced = reduceByKeyDim<Ti, Tk, To, op>(
+            reduced_keys, reduced_vals, keys, vals, change_nan, nanval, dim);
+    }
+
+    kdims[0]   = n_reduced;
+    odims[dim] = n_reduced;
+    std::vector<af_seq> kindex, vindex;
+    for (int i = 0; i < odims.ndims(); ++i) {
+        af_seq sk = {0.0, (double)kdims[i] - 1, 1.0};
+        af_seq sv = {0.0, (double)odims[i] - 1, 1.0};
+        kindex.push_back(sk);
+        vindex.push_back(sv);
+    }
+
+    keys_out = createSubArray<Tk>(reduced_keys, kindex, true);
+    vals_out = createSubArray<To>(reduced_vals, vindex, true);
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/reduce_by_key_boundary.cl b/src/backend/opencl/kernel/reduce_by_key_boundary.cl
new file mode 100644
index 0000000000..300e95de54
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_by_key_boundary.cl
@@ -0,0 +1,36 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void final_boundary_reduce(global int *reduced_block_sizes,
+                                  global Tk *oKeys, KParam oKInfo,
+                                  global To *oVals, KParam oVInfo,
+                                  const int n) {
+    const uint lid = get_local_id(0);
+    const uint bid = get_group_id(0);
+    const uint gid = get_global_id(0);
+
+    if (gid == ((bid + 1) * get_local_size(0)) - 1 &&
+        bid < get_num_groups(0) - 1) {
+        Tk k0 = oKeys[gid];
+        Tk k1 = oKeys[gid + 1];
+        if (k0 == k1) {
+            To v0                    = oVals[gid];
+            To v1                    = oVals[gid + 1];
+            oVals[gid + 1]           = binOp(v0, v1);
+            reduced_block_sizes[bid] = get_local_size(0) - 1;
+        } else {
+            reduced_block_sizes[bid] = get_local_size(0);
+        }
+    }
+
+    // if last block, set block size to difference between n and block boundary
+    if (lid == 0 && bid == get_num_groups(0) - 1) {
+        reduced_block_sizes[bid] = n - (bid * get_local_size(0));
+    }
+}
diff --git a/src/backend/opencl/kernel/reduce_by_key_boundary_dim.cl b/src/backend/opencl/kernel/reduce_by_key_boundary_dim.cl
new file mode 100644
index 0000000000..c8d56ce6be
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_by_key_boundary_dim.cl
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void final_boundary_reduce_dim(global int *reduced_block_sizes,
+                                      global Tk *oKeys, KParam oKInfo,
+                                      global To *oVals, KParam oVInfo,
+                                      const int n, const int nBlocksZ) {
+    local int dim_ordering[4];
+
+    const uint lid = get_local_id(0);
+    const uint bid = get_group_id(0);
+    const uint gid = get_global_id(0);
+
+    const int bidy = get_group_id(1);
+    const int bidz = get_group_id(2) % nBlocksZ;
+    const int bidw = get_group_id(2) / nBlocksZ;
+
+    if (lid == 0) {
+        int d           = 1;
+        dim_ordering[0] = DIM;
+        for (int i = 0; i < 4; ++i) {
+            if (i != DIM) dim_ordering[d++] = i;
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (gid == ((bid + 1) * get_local_size(0)) - 1 &&
+        bid < get_num_groups(0) - 1) {
+        Tk k0 = oKeys[gid];
+        Tk k1 = oKeys[gid + 1];
+        if (k0 == k1) {
+            To v0                    = oVals[gid];
+            To v1                    = oVals[gid + 1];
+            oVals[gid + 1]           = binOp(v0, v1);
+            reduced_block_sizes[bid] = get_local_size(0) - 1;
+        } else {
+            reduced_block_sizes[bid] = get_local_size(0);
+        }
+    }
+
+    // if last block, set block size to difference between n and block boundary
+    if (lid == 0 && bid == get_num_groups(0) - 1) {
+        reduced_block_sizes[bid] = n - (bid * get_local_size(0));
+    }
+}
diff --git a/src/backend/opencl/kernel/reduce_by_key_compact.cl b/src/backend/opencl/kernel/reduce_by_key_compact.cl
new file mode 100644
index 0000000000..58b78cd894
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_by_key_compact.cl
@@ -0,0 +1,41 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void compact(global int *reduced_block_sizes, global Tk *oKeys,
+                    KParam oKInfo, global To *oVals, KParam oVInfo,
+                    const global Tk *iKeys, KParam iKInfo,
+                    const global To *iVals, KParam iVInfo, const int nBlocksZ) {
+    const uint lid = get_local_id(0);
+    const uint bid = get_group_id(0);
+    const uint gid = get_global_id(0);
+
+    const int bidy = get_group_id(1);
+    const int bidz = get_group_id(2) % nBlocksZ;
+    const int bidw = get_group_id(2) / nBlocksZ;
+
+    Tk k;
+    To v;
+
+    const int bOffset = bidw * oVInfo.strides[3] + bidz * oVInfo.strides[2] +
+                        bidy * oVInfo.strides[1];
+
+    // reduced_block_sizes should have inclusive sum of block sizes
+    int nwrite =
+        (bid == 0) ? reduced_block_sizes[0]
+                   : (reduced_block_sizes[bid] - reduced_block_sizes[bid - 1]);
+    int writeloc = (bid == 0) ? 0 : reduced_block_sizes[bid - 1];
+
+    k = iKeys[gid + iKInfo.offset];
+    v = iVals[bOffset + gid + iVInfo.offset];
+
+    if (lid < nwrite) {
+        oKeys[writeloc + lid]           = k;
+        oVals[bOffset + writeloc + lid] = v;
+    }
+}
diff --git a/src/backend/opencl/kernel/reduce_by_key_compact_dim.cl b/src/backend/opencl/kernel/reduce_by_key_compact_dim.cl
new file mode 100644
index 0000000000..3d07a63eb7
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_by_key_compact_dim.cl
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void compact_dim(global int *reduced_block_sizes, global Tk *oKeys,
+                        KParam oKInfo, global To *oVals, KParam oVInfo,
+                        const global Tk *iKeys, KParam iKInfo,
+                        const global To *iVals, KParam iVInfo,
+                        const int nBlocksZ) {
+    local int dim_ordering[4];
+    const uint lid  = get_local_id(0);
+    const uint bid  = get_group_id(0);
+    const uint gidx = get_global_id(0);
+
+    const int bidy = get_group_id(1);
+    const int bidz = get_group_id(2) % nBlocksZ;
+    const int bidw = get_group_id(2) / nBlocksZ;
+
+    if (lid == 0) {
+        int d           = 1;
+        dim_ordering[0] = DIM;
+        for (int i = 0; i < 4; ++i) {
+            if (i != DIM) dim_ordering[d++] = i;
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    Tk k;
+    To v;
+
+    // reduced_block_sizes should have inclusive sum of block sizes
+    int nwrite =
+        (bid == 0) ? reduced_block_sizes[0]
+                   : (reduced_block_sizes[bid] - reduced_block_sizes[bid - 1]);
+    int writeloc = (bid == 0) ? 0 : reduced_block_sizes[bid - 1];
+
+    const int tid = bidw * iVInfo.strides[dim_ordering[3]] +
+                    bidz * iVInfo.strides[dim_ordering[2]] +
+                    bidy * iVInfo.strides[dim_ordering[1]] +
+                    gidx * iVInfo.strides[DIM];
+    k = iKeys[gidx + iKInfo.offset];
+    v = iVals[tid + iVInfo.offset];
+
+    if (lid < nwrite) {
+        oKeys[writeloc + lid] = k;
+        const int bOffset     = bidw * oVInfo.strides[dim_ordering[3]] +
+                            bidz * oVInfo.strides[dim_ordering[2]] +
+                            bidy * oVInfo.strides[dim_ordering[1]];
+        oVals[bOffset + (writeloc + lid) * oVInfo.strides[DIM]] = v;
+    }
+}
diff --git a/src/backend/opencl/kernel/reduce_by_key_needs_reduction.cl b/src/backend/opencl/kernel/reduce_by_key_needs_reduction.cl
new file mode 100644
index 0000000000..c505689bff
--- /dev/null
+++ b/src/backend/opencl/kernel/reduce_by_key_needs_reduction.cl
@@ -0,0 +1,39 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void test_needs_reduction(global int *needs_another_reduction,
+                                   global int *needs_block_boundary_reduced,
+                                   const global Tk *iKeys, KParam iKInfo,
+                                   int n) {
+    const uint lid = get_local_id(0);
+    const uint bid = get_group_id(0);
+    const uint gid = get_global_id(0);
+
+    Tk k;
+    if (gid < n) { k = iKeys[gid]; }
+
+    local Tk keys[DIMX];
+    keys[lid] = k;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    int update_key =
+        (lid < DIMX - 2) && (k == keys[lid + 1]) && (gid < (n - 1));
+
+    if (update_key) { atomic_or(needs_another_reduction, update_key); }
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // last thread in each block checks if any inter-block keys need further
+    // reduction
+    if (gid == ((bid + 1) * DIMX) - 1 && bid < get_num_groups(0) - 1) {
+        int k0 = iKeys[gid + iKInfo.offset];
+        int k1 = iKeys[gid + 1 + iKInfo.offset];
+        if (k0 == k1) { atomic_or(needs_block_boundary_reduced, 1); }
+    }
+}
diff --git a/src/backend/opencl/kernel/reduce_dim.cl b/src/backend/opencl/kernel/reduce_dim.cl
index 5ac9f22274..7b1397ce87 100644
--- a/src/backend/opencl/kernel/reduce_dim.cl
+++ b/src/backend/opencl/kernel/reduce_dim.cl
@@ -7,60 +7,55 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void reduce_dim_kernel(__global To *oData,
-                       KParam oInfo,
-                       const __global Ti *iData,
-                       KParam iInfo,
-                       uint groups_x, uint groups_y, uint group_dim)
-{
+kernel void reduce_dim_kernel(global To *oData, KParam oInfo,
+                                const global Ti *iData, KParam iInfo,
+                                uint groups_x, uint groups_y, uint group_dim,
+                                int change_nan, To nanval) {
     const uint lidx = get_local_id(0);
     const uint lidy = get_local_id(1);
     const uint lid  = lidy * THREADS_X + lidx;
 
-    const uint zid = get_group_id(0) / groups_x;
-    const uint wid = get_group_id(1) / groups_y;
-    const uint groupId_x = get_group_id(0) - (groups_x) * zid;
-    const uint groupId_y = get_group_id(1) - (groups_y) * wid;
-    const uint xid = groupId_x * get_local_size(0) + lidx;
-    const uint yid = groupId_y;
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) + lidx;
+    const uint yid       = groupId_y;
 
     uint ids[4] = {xid, yid, zid, wid};
 
     // There is only one element per group for out
     // There are get_local_size(1) elements per group for in
-    // Hence increment ids[dim] just after offseting out and before offsetting in
+    // Hence increment ids[kDim] just after offseting out and before offsetting
+    // in
     oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
-        ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
-    const uint id_dim_out = ids[dim];
+             ids[1] * oInfo.strides[1] + ids[0] + oInfo.offset;
+    const uint id_dim_out = ids[kDim];
 
-    ids[dim] = ids[dim] * get_local_size(1) + lidy;
-    iData  += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
-        ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
-    const uint id_dim_in = ids[dim];
+    ids[kDim] = ids[kDim] * get_local_size(1) + lidy;
+    iData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+             ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+    const uint id_dim_in = ids[kDim];
 
-    const uint istride_dim = iInfo.strides[dim];
+    const uint istride_dim = iInfo.strides[kDim];
 
-    bool is_valid =
-        (ids[0] < iInfo.dims[0]) &&
-        (ids[1] < iInfo.dims[1]) &&
-        (ids[2] < iInfo.dims[2]) &&
-        (ids[3] < iInfo.dims[3]);
+    bool is_valid = (ids[0] < iInfo.dims[0]) && (ids[1] < iInfo.dims[1]) &&
+                    (ids[2] < iInfo.dims[2]) && (ids[3] < iInfo.dims[3]);
 
-    __local To s_val[THREADS_X * DIMY];
+    local To s_val[THREADS_X * DIMY];
 
     To out_val = init;
-    for (int id = id_dim_in; is_valid && (id < iInfo.dims[dim]);
+    for (int id = id_dim_in; is_valid && (id < iInfo.dims[kDim]);
          id += group_dim * get_local_size(1)) {
-
         To in_val = transform(*iData);
+        if (change_nan) in_val = !IS_NAN(in_val) ? in_val : nanval;
         out_val = binOp(in_val, out_val);
-        iData = iData + group_dim * get_local_size(1) * istride_dim;
+        iData   = iData + group_dim * get_local_size(1) * istride_dim;
     }
 
     s_val[lid] = out_val;
 
-    __local To *s_ptr = s_val + lid;
+    local To *s_ptr = s_val + lid;
     barrier(CLK_LOCAL_MEM_FENCE);
 
     if (DIMY == 8) {
@@ -78,9 +73,7 @@ void reduce_dim_kernel(__global To *oData,
         barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (lidy == 0 && is_valid &&
-        (id_dim_out < oInfo.dims[dim])) {
+    if (lidy == 0 && is_valid && (id_dim_out < oInfo.dims[kDim])) {
         *oData = *s_ptr;
     }
-
 }
diff --git a/src/backend/opencl/kernel/reduce_first.cl b/src/backend/opencl/kernel/reduce_first.cl
index 8349048ca5..1dcf8ba91a 100644
--- a/src/backend/opencl/kernel/reduce_first.cl
+++ b/src/backend/opencl/kernel/reduce_first.cl
@@ -7,45 +7,44 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void reduce_first_kernel(__global To *oData,
-                         KParam oInfo,
-                         const __global Ti *iData,
-                         KParam iInfo,
-                         uint groups_x, uint groups_y, uint repeat)
-{
+kernel void reduce_first_kernel(global To *oData, KParam oInfo,
+                                  const global Ti *iData, KParam iInfo,
+                                  uint groups_x, uint groups_y, uint repeat,
+                                  int change_nan, To nanval) {
     const uint lidx = get_local_id(0);
     const uint lidy = get_local_id(1);
     const uint lid  = lidy * get_local_size(0) + lidx;
 
-    const uint zid = get_group_id(0) / groups_x;
-    const uint wid = get_group_id(1) / groups_y;
-    const uint groupId_x = get_group_id(0) - (groups_x) * zid;
-    const uint groupId_y = get_group_id(1) - (groups_y) * wid;
-    const uint xid = groupId_x * get_local_size(0) * repeat + lidx;
-    const uint yid = groupId_y * get_local_size(1) + lidy;
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) * repeat + lidx;
+    const uint yid       = groupId_y * get_local_size(1) + lidy;
 
     iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
-        yid * iInfo.strides[1] + iInfo.offset;
+             yid * iInfo.strides[1] + iInfo.offset;
     oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
-        yid * oInfo.strides[1] + oInfo.offset;
+             yid * oInfo.strides[1] + oInfo.offset;
 
-    bool cond = (yid < iInfo.dims[1]) && (zid < iInfo.dims[2]) && (wid < iInfo.dims[3]);
+    bool cond =
+        (yid < iInfo.dims[1]) && (zid < iInfo.dims[2]) && (wid < iInfo.dims[3]);
 
-    __local To s_val[THREADS_PER_GROUP];
+    local To s_val[THREADS_PER_GROUP];
 
-    int last = (xid + repeat * DIMX);
-    int lim = last > iInfo.dims[0] ? iInfo.dims[0] : last;
+    int last   = (xid + repeat * DIMX);
+    int lim    = last > iInfo.dims[0] ? iInfo.dims[0] : last;
     To out_val = init;
 
     for (int id = xid; cond && id < lim; id += DIMX) {
         To in_val = transform(iData[id]);
+        if (change_nan) in_val = !IS_NAN(in_val) ? in_val : nanval;
         out_val = binOp(in_val, out_val);
     }
 
     s_val[lid] = out_val;
     barrier(CLK_LOCAL_MEM_FENCE);
-    __local To *s_ptr = s_val + lidy * DIMX;
+    local To *s_ptr = s_val + lidy * DIMX;
 
     if (DIMX == 256) {
         if (lidx < 128) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 128]);
@@ -53,31 +52,29 @@ void reduce_first_kernel(__global To *oData,
     }
 
     if (DIMX >= 128) {
-        if (lidx <  64) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx +  64]);
+        if (lidx < 64) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 64]);
         barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (DIMX >=  64) {
-        if (lidx <  32) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx +  32]);
+    if (DIMX >= 64) {
+        if (lidx < 32) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 32]);
         barrier(CLK_LOCAL_MEM_FENCE);
     }
 
     if (lidx < 16) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 16]);
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <  8) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx +  8]);
+    if (lidx < 8) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 8]);
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <  4) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx +  4]);
+    if (lidx < 4) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 4]);
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <  2) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx +  2]);
+    if (lidx < 2) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 2]);
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (lidx <  1) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx +  1]);
+    if (lidx < 1) s_ptr[lidx] = binOp(s_ptr[lidx], s_ptr[lidx + 1]);
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    if (cond && lidx == 0) {
-        oData[groupId_x] = s_ptr[0];
-    }
+    if (cond && lidx == 0) { oData[groupId_x] = s_ptr[0]; }
 }
diff --git a/src/backend/opencl/kernel/regions.cl b/src/backend/opencl/kernel/regions.cl
index 38bc1590b7..0a6235935e 100644
--- a/src/backend/opencl/kernel/regions.cl
+++ b/src/backend/opencl/kernel/regions.cl
@@ -9,19 +9,18 @@
 
 // The initial label kernel distinguishes between valid (nonzero)
 // pixels and "background" (zero) pixels.
-__kernel
-void initial_label(global    T * equiv_map,
-                   KParam        eInfo,
-                   global char * bin,
-                   KParam        bInfo)
-{
-    const int base_x = (get_group_id(0) * get_local_size(0) * N_PER_THREAD) + get_local_id(0);
-    const int base_y = (get_group_id(1) * get_local_size(1) * N_PER_THREAD) + get_local_id(1);
-
-    // If in bounds and a valid pixel, set the initial label.
-    #pragma unroll
+kernel void initial_label(global T* equiv_map, KParam eInfo,
+                            global char* bin_, KParam bInfo) {
+    global char* bin = bin_ + bInfo.offset;
+    const int base_x =
+        (get_group_id(0) * get_local_size(0) * N_PER_THREAD) + get_local_id(0);
+    const int base_y =
+        (get_group_id(1) * get_local_size(1) * N_PER_THREAD) + get_local_id(1);
+
+// If in bounds and a valid pixel, set the initial label.
+#pragma unroll
     for (int xb = 0; xb < N_PER_THREAD; ++xb) {
-        #pragma unroll
+#pragma unroll
         for (int yb = 0; yb < N_PER_THREAD; ++yb) {
             const int x = base_x + (xb * get_local_size(0));
             const int y = base_y + (yb * get_local_size(1));
@@ -33,38 +32,38 @@ void initial_label(global    T * equiv_map,
     }
 }
 
-__kernel
-void final_relabel(global       T    * equiv_map,
-                   KParam              eInfo,
-                   global       char * bin,
-                   KParam              bInfo,
-                   global const T    * d_tmp)
-{
-    const int base_x = (get_group_id(0) * get_local_size(0) * N_PER_THREAD) + get_local_id(0);
-    const int base_y = (get_group_id(1) * get_local_size(1) * N_PER_THREAD) + get_local_id(1);
-
-    // If in bounds and a valid pixel, set the initial label.
-    #pragma unroll
+kernel void final_relabel(global T* equiv_map, KParam eInfo,
+                            global char* bin_, KParam bInfo,
+                            global const T* d_tmp) {
+    global char* bin = bin_ + bInfo.offset;
+    const int base_x =
+        (get_group_id(0) * get_local_size(0) * N_PER_THREAD) + get_local_id(0);
+    const int base_y =
+        (get_group_id(1) * get_local_size(1) * N_PER_THREAD) + get_local_id(1);
+
+// If in bounds and a valid pixel, set the initial label.
+#pragma unroll
     for (int xb = 0; xb < N_PER_THREAD; ++xb) {
-        #pragma unroll
+#pragma unroll
         for (int yb = 0; yb < N_PER_THREAD; ++yb) {
             const int x = base_x + (xb * get_local_size(0));
             const int y = base_y + (yb * get_local_size(1));
             const int n = y * bInfo.dims[0] + x;
             if (x < bInfo.dims[0] && y < bInfo.dims[1]) {
-                equiv_map[n] = (bin[n] > (char)0) ? d_tmp[(int)equiv_map[n]] : (T)0;
+                equiv_map[n] =
+                    (bin[n] > (char)0) ? d_tmp[(int)equiv_map[n]] : (T)0;
             }
         }
     }
 }
 
-#define MIN(A,B) ((A < B) ? (A) : (B))
-
 // When two labels are equivalent, choose the lower label, but
 // do not choose zero, which indicates invalid.
 //#if T == double
 static inline T relabel(const T a, const T b) {
-    return MIN((a + (LIMIT_MAX * (a == 0))),(b + (LIMIT_MAX * (b == 0))));
+    T aa = (a == 0) ? LIMIT_MAX : a;
+    T bb = (b == 0) ? LIMIT_MAX : b;
+    return min(aa, bb);
 }
 
 // The following kernel updates the equivalency map.  This kernel
@@ -76,61 +75,59 @@ static inline T relabel(const T a, const T b) {
 // NUM_WARPS = 8; // (Could compute this from block dim)
 // Number of elements to handle per thread in each dimension
 // N_PER_THREAD = 2; // 2x2 per thread = 4 total elems per thread
-__kernel
-void update_equiv(global T*   equiv_map,
-                  KParam      eInfo,
-                  global int* continue_flag)
-{
+kernel void update_equiv(global T* equiv_map, KParam eInfo,
+                           global int* continue_flag) {
     // Basic coordinates
-    const int base_x = (get_group_id(0) * get_local_size(0) * N_PER_THREAD) + get_local_id(0);
-    const int base_y = (get_group_id(1) * get_local_size(1) * N_PER_THREAD) + get_local_id(1);
+    const int base_x =
+        (get_group_id(0) * get_local_size(0) * N_PER_THREAD) + get_local_id(0);
+    const int base_y =
+        (get_group_id(1) * get_local_size(1) * N_PER_THREAD) + get_local_id(1);
 
     const int width  = eInfo.dims[0];
     const int height = eInfo.dims[1];
 
     // Per element write flags and label, initially 0
-    char      write[N_PER_THREAD * N_PER_THREAD];
-    T    best_label[N_PER_THREAD * N_PER_THREAD];
+    char write[N_PER_THREAD * N_PER_THREAD];
+    T best_label[N_PER_THREAD * N_PER_THREAD];
 
-    #pragma unroll
+#pragma unroll
     for (int i = 0; i < N_PER_THREAD * N_PER_THREAD; ++i) {
         write[i]      = (char)0;
         best_label[i] = (T)0;
     }
 
     // Cached tile of the equivalency map
-    __local T s_tile[N_PER_THREAD*BLOCK_DIM][(N_PER_THREAD*BLOCK_DIM)];
+    local T s_tile[N_PER_THREAD * BLOCK_DIM][(N_PER_THREAD * BLOCK_DIM)];
 
     // Space to track ballot funcs to track convergence
-    __local int s_changed[NUM_WARPS];
+    local int s_changed[NUM_WARPS];
 
     const int tn = (get_local_id(1) * get_local_size(0)) + get_local_id(0);
 
     const int warpSize = 32;
-    const int warpIdx = tn / warpSize;
+    const int warpIdx  = tn / warpSize;
     s_changed[warpIdx] = 0;
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    __local int tid_changed[NUM_WARPS];
+    local int tid_changed[NUM_WARPS];
     tid_changed[warpIdx] = 0;
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    #pragma unroll
+#pragma unroll
     for (int xb = 0; xb < N_PER_THREAD; ++xb) {
-        #pragma unroll
+#pragma unroll
         for (int yb = 0; yb < N_PER_THREAD; ++yb) {
-
             // Indexing variables
-            const int x = base_x + (xb * get_local_size(0));
-            const int y = base_y + (yb * get_local_size(1));
-            const int tx = get_local_id(0) + (xb * get_local_size(0));
-            const int ty = get_local_id(1) + (yb * get_local_size(1));
+            const int x     = base_x + (xb * get_local_size(0));
+            const int y     = base_y + (yb * get_local_size(1));
+            const int tx    = get_local_id(0) + (xb * get_local_size(0));
+            const int ty    = get_local_id(1) + (yb * get_local_size(1));
             const int tid_i = xb * N_PER_THREAD + yb;
-            const int n = y * width + x;
+            const int n     = y * width + x;
 
             // Get the label for this pixel if we're  in bounds
-            const T orig_label = (x < width && y < height) ?
-                equiv_map[n] : (T)0;
+            const T orig_label =
+                (x < width && y < height) ? equiv_map[n] : (T)0;
             s_tile[ty][tx] = orig_label;
 
             // Find the lowest label of the nearest valid pixel
@@ -138,46 +135,45 @@ void update_equiv(global T*   equiv_map,
             best_label[tid_i] = orig_label;
 
             if (orig_label != (T)0) {
-
-                const int south_y = min(y, height-2) + 1;
+                const int south_y = min(y, height - 2) + 1;
                 const int north_y = max(y, 1) - 1;
-                const int east_x = min(x, width-2) + 1;
-                const int west_x = max(x, 1) - 1;
+                const int east_x  = min(x, width - 2) + 1;
+                const int west_x  = max(x, 1) - 1;
 
                 // Check bottom
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[(south_y) * width + x]);
+                best_label[tid_i] =
+                    relabel(best_label[tid_i], equiv_map[(south_y)*width + x]);
 
                 // Check right neighbor
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[y * width + east_x]);
+                best_label[tid_i] =
+                    relabel(best_label[tid_i], equiv_map[y * width + east_x]);
 
                 // Check left neighbor
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[y * width + west_x]);
+                best_label[tid_i] =
+                    relabel(best_label[tid_i], equiv_map[y * width + west_x]);
 
                 // Check top neighbor
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[(north_y) * width + x]);
+                best_label[tid_i] =
+                    relabel(best_label[tid_i], equiv_map[(north_y)*width + x]);
 
 #ifdef FULL_CONN
                 // Check NW corner
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[(north_y) * width + west_x]);
+                best_label[tid_i] = relabel(
+                    best_label[tid_i], equiv_map[(north_y)*width + west_x]);
 
                 // Check NE corner
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[(north_y) * width + east_x]);
+                best_label[tid_i] = relabel(
+                    best_label[tid_i], equiv_map[(north_y)*width + east_x]);
 
                 // Check SW corner
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[(south_y) * width + west_x]);
+                best_label[tid_i] = relabel(
+                    best_label[tid_i], equiv_map[(south_y)*width + west_x]);
 
                 // Check SE corner
-                best_label[tid_i] = relabel(best_label[tid_i],
-                        equiv_map[(south_y) * width + east_x]);
-#endif // if connectivity == 8
-            } // if orig_label != 0
+                best_label[tid_i] = relabel(
+                    best_label[tid_i], equiv_map[(south_y)*width + east_x]);
+#endif         // if connectivity == 8
+            }  // if orig_label != 0
 
             // Process the equivalency list.
             T last_label = orig_label;
@@ -185,13 +181,13 @@ void update_equiv(global T*   equiv_map,
 
             while (best_label[tid_i] != (T)0 && new_label < last_label) {
                 last_label = new_label;
-                new_label = equiv_map[(int)new_label - 1];
+                new_label  = equiv_map[(int)new_label - 1];
             }
 
             if (orig_label != new_label) {
                 tid_changed[warpIdx] = 1;
-                s_tile[ty][tx] = new_label;
-                write[tid_i] = (char)1;
+                s_tile[ty][tx]       = new_label;
+                write[tid_i]         = (char)1;
             }
             best_label[tid_i] = new_label;
         }
@@ -200,70 +196,69 @@ void update_equiv(global T*   equiv_map,
 
     // Determine if any pixel changed
     unsigned int continue_iter = 0;
-    s_changed[warpIdx] = tid_changed[warpIdx];
+    s_changed[warpIdx]         = tid_changed[warpIdx];
     barrier(CLK_LOCAL_MEM_FENCE);
 
-    #pragma unroll
+#pragma unroll
     for (int i = 0; i < NUM_WARPS; i++)
         continue_iter = continue_iter || (s_changed[i] != 0);
 
     // Iterate until no pixel in the tile changes
     while (continue_iter != 0) {
-
         // Reset whether or not this thread's pixels have changed.
         tid_changed[warpIdx] = 0;
 
-        #pragma unroll
+#pragma unroll
         for (int xb = 0; xb < N_PER_THREAD; ++xb) {
-            #pragma unroll
+#pragma unroll
             for (int yb = 0; yb < N_PER_THREAD; ++yb) {
-
                 // Indexing
-                const int tx = get_local_id(0) + (xb * get_local_size(0));
-                const int ty = get_local_id(1) + (yb * get_local_size(1));
+                const int tx    = get_local_id(0) + (xb * get_local_size(0));
+                const int ty    = get_local_id(1) + (yb * get_local_size(1));
                 const int tid_i = xb * N_PER_THREAD + yb;
 
                 T last_label = best_label[tid_i];
 
                 if (best_label[tid_i] != 0) {
-
-                    const int north_y   = max(ty, 1) - 1;
-                    const int south_y   = min(ty, N_PER_THREAD*BLOCK_DIM - 2) + 1;
-                    const int east_x    = min(tx, N_PER_THREAD*BLOCK_DIM - 2) + 1;
-                    const int west_x    = max(tx, 1) - 1;
+                    const int north_y = max(ty, 1) - 1;
+                    const int south_y =
+                        min(ty, N_PER_THREAD * BLOCK_DIM - 2) + 1;
+                    const int east_x =
+                        min(tx, N_PER_THREAD * BLOCK_DIM - 2) + 1;
+                    const int west_x = max(tx, 1) - 1;
 
                     // Check bottom
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[south_y][tx]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[south_y][tx]);
 
                     // Check right neighbor
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[ty][east_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[ty][east_x]);
 
                     // Check left neighbor
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[ty][west_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[ty][west_x]);
 
                     // Check top neighbor
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[north_y][tx]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[north_y][tx]);
 
 #ifdef FULL_CONN
                     // Check NW corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[north_y][west_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[north_y][west_x]);
 
                     // Check NE corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[north_y][east_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[north_y][east_x]);
 
                     // Check SW corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[south_y][west_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[south_y][west_x]);
 
                     // Check SE corner
-                    best_label[tid_i] = relabel(best_label[tid_i],
-                                                s_tile[south_y][east_x]);
+                    best_label[tid_i] =
+                        relabel(best_label[tid_i], s_tile[south_y][east_x]);
 #endif
                 }
                 // This thread's value changed  this iteration if the
@@ -280,16 +275,16 @@ void update_equiv(global T*   equiv_map,
         s_changed[warpIdx] = tid_changed[warpIdx];
         barrier(CLK_LOCAL_MEM_FENCE);
         continue_iter = 0;
-        #pragma unroll
+#pragma unroll
         for (int i = 0; i < NUM_WARPS; i++)
             continue_iter |= (s_changed[i] != 0);
 
         // If we have to continue iterating, update the tile of the
         // equiv map in shared memory
         if (continue_iter != 0) {
-            #pragma unroll
+#pragma unroll
             for (int xb = 0; xb < N_PER_THREAD; ++xb) {
-                #pragma unroll
+#pragma unroll
                 for (int yb = 0; yb < N_PER_THREAD; ++yb) {
                     const int tx = get_local_id(0) + (xb * get_local_size(0));
                     const int ty = get_local_id(1) + (yb * get_local_size(1));
@@ -300,19 +295,19 @@ void update_equiv(global T*   equiv_map,
             }
             barrier(CLK_LOCAL_MEM_FENCE);
         }
-    } // while (continue_iter)
+    }  // while (continue_iter)
 
-    // Write out equiv_map
-    #pragma unroll
+// Write out equiv_map
+#pragma unroll
     for (int xb = 0; xb < N_PER_THREAD; ++xb) {
-        #pragma unroll
+#pragma unroll
         for (int yb = 0; yb < N_PER_THREAD; ++yb) {
-            const int x = base_x + (xb * get_local_size(0));
-            const int y = base_y + (yb * get_local_size(1));
-            const int n = y * width + x;
+            const int x     = base_x + (xb * get_local_size(0));
+            const int y     = base_y + (yb * get_local_size(1));
+            const int n     = y * width + x;
             const int tid_i = xb * N_PER_THREAD + yb;
             if (x < width && y < height && write[tid_i]) {
-                equiv_map[n]  = best_label[tid_i];
+                equiv_map[n]   = best_label[tid_i];
                 *continue_flag = 1;
             }
         }
diff --git a/src/backend/opencl/kernel/regions.hpp b/src/backend/opencl/kernel/regions.hpp
index ca8288f798..a082d165af 100644
--- a/src/backend/opencl/kernel/regions.hpp
+++ b/src/backend/opencl/kernel/regions.hpp
@@ -8,211 +8,189 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/regions.hpp>
-#include <program.hpp>
-#include <af/defines.h>
-#include <dispatch.hpp>
-#include <err_opencl.hpp>
+
+#include <Param.hpp>
+#include <common/deprecated.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/regions.hpp>
 #include <math.hpp>
-#include <stdio.h>
-#include <map>
 #include <memory.hpp>
+#include <traits.hpp>
+#include <af/defines.h>
 
-#include <boost/compute/container/vector.hpp>
+AF_DEPRECATED_WARNINGS_OFF
 #include <boost/compute/algorithm/adjacent_difference.hpp>
-#include <boost/compute/algorithm/sort.hpp>
-#include <boost/compute/iterator/counting_iterator.hpp>
 #include <boost/compute/algorithm/exclusive_scan.hpp>
+#include <boost/compute/algorithm/sort.hpp>
 #include <boost/compute/algorithm/transform.hpp>
-#include <boost/compute/lambda/placeholders.hpp>
+#include <boost/compute/container/vector.hpp>
+#include <boost/compute/iterator/counting_iterator.hpp>
 #include <boost/compute/lambda.hpp>
+#include <boost/compute/lambda/placeholders.hpp>
+AF_DEPRECATED_WARNINGS_ON
+
+#include <array>
+#include <string>
+#include <vector>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
 namespace compute = boost::compute;
 
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
-
-template<typename T, bool full_conn, int n_per_thread>
-void regions(Param out, Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> regionsProgs;
-        static std::map<int, Kernel *>     ilKernel;
-        static std::map<int, Kernel *>     frKernel;
-        static std::map<int, Kernel *>     ueKernel;
-
-        int device = getActiveDeviceId();
-
-        static const int block_dim = 16;
-        static const int num_warps = 8;
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                if (full_conn) {
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D BLOCK_DIM=" << block_dim
-                            << " -D NUM_WARPS=" << num_warps
-                            << " -D N_PER_THREAD=" << n_per_thread
-                            << " -D LIMIT_MAX=" << limit_max<T>()
-                            << " -D FULL_CONN";
-                }
-                else {
-                    options << " -D T=" << dtype_traits<T>::getName()
-                            << " -D BLOCK_DIM=" << block_dim
-                            << " -D NUM_WARPS=" << num_warps
-                            << " -D N_PER_THREAD=" << n_per_thread
-                            << " -D LIMIT_MAX=" << limit_max<T>();
-                }
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                Program prog;
-                buildProgram(prog, regions_cl, regions_cl_len, options.str());
-                regionsProgs[device] = new Program(prog);
-
-                ilKernel[device] = new Kernel(*regionsProgs[device], "initial_label");
-                frKernel[device] = new Kernel(*regionsProgs[device], "final_relabel");
-                ueKernel[device] = new Kernel(*regionsProgs[device], "update_equiv");
-            });
-
-        const NDRange local(THREADS_X, THREADS_Y);
-
-        const int blk_x = divup(in.info.dims[0], THREADS_X*2);
-        const int blk_y = divup(in.info.dims[1], THREADS_Y*2);
-
-        const NDRange global(blk_x * THREADS_X, blk_y * THREADS_Y);
-
-        auto ilOp = make_kernel<Buffer, KParam,
-                                Buffer, KParam> (*ilKernel[device]);
-
-        ilOp(EnqueueArgs(getQueue(), global, local),
-             *out.data, out.info, *in.data, in.info);
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+std::array<Kernel, 3> getRegionsKernels(const bool full_conn,
+                                        const int n_per_thread) {
+    using std::string;
+    using std::vector;
+
+    constexpr int block_dim = 16;
+    constexpr int num_warps = 8;
+
+    ToNumStr<T> toNumStr;
+    vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(full_conn),
+        TemplateArg(n_per_thread),
+    };
+    vector<string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(BLOCK_DIM, block_dim),
+        DefineKeyValue(NUM_WARPS, num_warps),
+        DefineKeyValue(N_PER_THREAD, n_per_thread),
+        DefineKeyValue(LIMIT_MAX, toNumStr(maxval<T>())),
+    };
+    if (full_conn) { options.emplace_back(DefineKey(FULL_CONN)); }
+    options.emplace_back(getTypeBuildDefinition<T>());
+
+    return {
+        common::getKernel("initial_label", {{regions_cl_src}}, targs, options),
+        common::getKernel("final_relabel", {{regions_cl_src}}, targs, options),
+        common::getKernel("update_equiv", {{regions_cl_src}}, targs, options),
+    };
+}
 
-        CL_DEBUG_FINISH(getQueue());
+template<typename T>
+void regions(Param out, Param in, const bool full_conn,
+             const int n_per_thread) {
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
 
-        int h_continue = 1;
-        cl::Buffer *d_continue = bufferAlloc(sizeof(int));
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
 
-        while (h_continue) {
-            h_continue = 0;
-            getQueue().enqueueWriteBuffer(*d_continue, CL_TRUE, 0, sizeof(int), &h_continue);
+    auto kernels = getRegionsKernels<T>(full_conn, n_per_thread);
 
-            auto ueOp = make_kernel<Buffer, KParam,
-                                    Buffer> (*ueKernel[device]);
+    const NDRange local(THREADS_X, THREADS_Y);
 
-            ueOp(EnqueueArgs(getQueue(), global, local),
-                 *out.data, out.info, *d_continue);
-            CL_DEBUG_FINISH(getQueue());
+    const int blk_x = divup(in.info.dims[0], THREADS_X * 2);
+    const int blk_y = divup(in.info.dims[1], THREADS_Y * 2);
 
-            getQueue().enqueueReadBuffer(*d_continue, CL_TRUE, 0, sizeof(int), &h_continue);
-        }
+    const NDRange global(blk_x * THREADS_X, blk_y * THREADS_Y);
 
-        bufferFree(d_continue);
-
-        // Now, perform the final relabeling.  This converts the equivalency
-        // map from having unique labels based on the lowest pixel in the
-        // component to being sequentially numbered components starting at
-        // 1.
-        int size = in.info.dims[0] * in.info.dims[1];
-
-        compute::command_queue c_queue(getQueue()());
-
-        // Wrap raw device ptr
-        compute::context context(getContext()());
-        compute::vector<T> tmp(size, context);
-        clEnqueueCopyBuffer(getQueue()(), (*out.data)(), tmp.get_buffer().get(), 0, 0, size * sizeof(T), 0, NULL, NULL);
-
-        // Sort the copy
-        compute::sort(tmp.begin(), tmp.end(), c_queue);
-
-        // Take the max element, this is the number of label assignments to
-        // compute.
-        //int num_bins = tmp[size - 1] + 1;
-        T last_label;
-        clEnqueueReadBuffer(getQueue()(), tmp.get_buffer().get(), CL_TRUE, (size - 1) * sizeof(T), sizeof(T), &last_label, 0, NULL, NULL);
-        int num_bins = (int)last_label + 1;
-
-        Buffer labels(getContext(), CL_MEM_READ_WRITE, num_bins * sizeof(T));
-        compute::buffer c_labels(labels());
-        compute::buffer_iterator<T> labels_begin = compute::make_buffer_iterator<T>(c_labels, 0);
-        compute::buffer_iterator<T> labels_end   = compute::make_buffer_iterator<T>(c_labels, num_bins);
-
-        // Find the end of each section of values
-        compute::counting_iterator<T> search_begin(0);
-
-        int tmp_size = size;
-        BOOST_COMPUTE_CLOSURE(int, upper_bound_closure, (int v), (tmp, tmp_size),
-        {
-            int start = 0, n = tmp_size, i;
-            while(start < n)
-            {
-                i = (start + n) / 2;
-                if(v < tmp[i])
-                {
-                    n = i;
-                }
-                else
-                {
-                    start = i + 1;
-                }
-            }
+    auto ilOp = kernels[0];
+    auto ueOp = kernels[2];
+    auto frOp = kernels[1];
 
-            return start;
-        });
-
-        BOOST_COMPUTE_FUNCTION(int, clamp_to_one, (int i),
-        {
-            return (i >= 1) ? 1 : i;
-        });
-
-        compute::transform(search_begin, search_begin + num_bins,
-                           labels_begin,
-                           upper_bound_closure,
-                           c_queue);
-        compute::adjacent_difference(labels_begin, labels_end, labels_begin, c_queue);
-
-        // Perform the scan -- this can computes the correct labels for each
-        // component
-        compute::transform(labels_begin, labels_end,
-                           labels_begin,
-                           clamp_to_one,
-                           c_queue);
-        compute::exclusive_scan(labels_begin,
-                                labels_end,
-                                labels_begin,
-                                c_queue);
-
-        // Apply the correct labels to the equivalency map
-        auto frOp = make_kernel<Buffer, KParam,
-                                Buffer, KParam,
-                                Buffer> (*frKernel[device]);
-
-        //Buffer labels_buf(tmp.get_buffer().get());
-        frOp(EnqueueArgs(getQueue(), global, local),
-             *out.data, out.info, *in.data, in.info, labels);
+    ilOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data,
+         in.info);
+    CL_DEBUG_FINISH(getQueue());
+
+    int h_continue     = 1;
+    Buffer* d_continue = bufferAlloc(sizeof(int));
+
+    while (h_continue) {
+        h_continue = 0;
+        getQueue().enqueueFillBuffer(*d_continue, h_continue, 0, sizeof(int));
+        ueOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+             *d_continue);
         CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
+        getQueue().enqueueReadBuffer(*d_continue, CL_TRUE, 0, sizeof(int),
+                                     &h_continue);
     }
-}
+    bufferFree(d_continue);
+
+    // Now, perform the final relabeling.  This converts the equivalency
+    // map from having unique labels based on the lowest pixel in the
+    // component to being sequentially numbered components starting at
+    // 1.
+    int size = in.info.dims[0] * in.info.dims[1];
+
+    compute::command_queue c_queue(getQueue()());
+
+    // Wrap raw device ptr
+    compute::context context(getContext()());
+    compute::vector<T> tmp(size, context);
+    clEnqueueCopyBuffer(getQueue()(), (*out.data)(), tmp.get_buffer().get(), 0,
+                        0, size * sizeof(T), 0, NULL, NULL);
+
+    // Sort the copy
+    compute::sort(tmp.begin(), tmp.end(), c_queue);
+
+    // Take the max element which is the number
+    // of label assignments to compute.
+    T last_label;
+    clEnqueueReadBuffer(getQueue()(), tmp.get_buffer().get(), CL_TRUE,
+                        (size - 1) * sizeof(T), sizeof(T), &last_label, 0, NULL,
+                        NULL);
+    const int num_bins = (int)last_label + 1;
+
+    // If the number of label assignments is two,
+    // then either the entire input image is one big
+    // component(1's) or it has only one component other than
+    // background(0's). Either way, no further
+    // post-processing of labels is required.
+    if (num_bins <= 2) return;
+
+    Buffer labels(getContext(), CL_MEM_READ_WRITE, num_bins * sizeof(T));
+    compute::buffer c_labels(labels());
+    compute::buffer_iterator<T> labels_begin =
+        compute::make_buffer_iterator<T>(c_labels, 0);
+    compute::buffer_iterator<T> labels_end =
+        compute::make_buffer_iterator<T>(c_labels, num_bins);
+
+    // Find the end of each section of values
+    compute::counting_iterator<T> search_begin(0);
+
+    int tmp_size = size;
+    BOOST_COMPUTE_CLOSURE(int, upper_bound_closure, (int v), (tmp, tmp_size), {
+        int start = 0, n = tmp_size, i;
+        while (start < n) {
+            i = (start + n) / 2;
+            if (v < tmp[i]) {
+                n = i;
+            } else {
+                start = i + 1;
+            }
+        }
+        return start;
+    });
+
+    BOOST_COMPUTE_FUNCTION(int, clamp_to_one, (int i),
+                           { return (i >= 1) ? 1 : i; });
 
-} //namespace kernel
+    compute::transform(search_begin, search_begin + num_bins, labels_begin,
+                       upper_bound_closure, c_queue);
+    compute::adjacent_difference(labels_begin, labels_end, labels_begin,
+                                 c_queue);
 
-} //namespace opencl
+    // Perform the scan -- this can computes the correct labels for each
+    // component
+    compute::transform(labels_begin, labels_end, labels_begin, clamp_to_one,
+                       c_queue);
+
+    compute::exclusive_scan(labels_begin, labels_end, labels_begin, c_queue);
+
+    // Apply the correct labels to the equivalency map
+    // Buffer labels_buf(tmp.get_buffer().get());
+    frOp(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data,
+         in.info, labels);
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/reorder.cl b/src/backend/opencl/kernel/reorder.cl
index 8153341e94..07b99a123b 100644
--- a/src/backend/opencl/kernel/reorder.cl
+++ b/src/backend/opencl/kernel/reorder.cl
@@ -7,11 +7,10 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void reorder_kernel(__global T *out, __global const T *in, const KParam op, const KParam ip,
-                    const int d0, const int d1, const int d2, const int d3,
-                    const int blocksPerMatX, const int blocksPerMatY)
-{
+kernel void reorder_kernel(global T *out, __global const T *in,
+                             const KParam op, const KParam ip, const int d0,
+                             const int d1, const int d2, const int d3,
+                             const int blocksPerMatX, const int blocksPerMatY) {
     const int oz = get_group_id(0) / blocksPerMatX;
     const int ow = get_group_id(1) / blocksPerMatY;
 
@@ -21,10 +20,8 @@ void reorder_kernel(__global T *out, __global const T *in, const KParam op, cons
     const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
     const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    if(xx >= op.dims[0] ||
-       yy >= op.dims[1] ||
-       oz >= op.dims[2] ||
-       ow >= op.dims[3])
+    if (xx >= op.dims[0] || yy >= op.dims[1] || oz >= op.dims[2] ||
+        ow >= op.dims[3])
         return;
 
     const int incy = blocksPerMatY * get_local_size(1);
@@ -32,21 +29,21 @@ void reorder_kernel(__global T *out, __global const T *in, const KParam op, cons
 
     const int o_off   = ow * op.strides[3] + oz * op.strides[2];
     const int rdims[] = {d0, d1, d2, d3};
-          int ods[]   = {xx, yy, oz, ow};
-          int ids[4]  = {0};
+    int ods[]         = {xx, yy, oz, ow};
+    int ids[4]        = {0};
 
     ids[rdims[3]] = ow;
     ids[rdims[2]] = oz;
 
-    for(int oy = yy; oy < op.dims[1]; oy += incy) {
+    for (int oy = yy; oy < op.dims[1]; oy += incy) {
         ids[rdims[1]] = oy;
-        for(int ox = xx; ox < op.dims[0]; ox += incx) {
+        for (int ox = xx; ox < op.dims[0]; ox += incx) {
             ids[rdims[0]] = ox;
 
             const int oIdx = o_off + oy * op.strides[1] + ox;
 
             const int iIdx = ids[3] * ip.strides[3] + ids[2] * ip.strides[2] +
-                                  ids[1] * ip.strides[1] + ids[0];
+                             ids[1] * ip.strides[1] + ids[0];
 
             out[oIdx] = in[ip.offset + iIdx];
         }
diff --git a/src/backend/opencl/kernel/reorder.hpp b/src/backend/opencl/kernel/reorder.hpp
index 0ec436302e..469e8b77c3 100644
--- a/src/backend/opencl/kernel/reorder.hpp
+++ b/src/backend/opencl/kernel/reorder.hpp
@@ -8,79 +8,50 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/reorder.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/reorder.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-        static const int TILEX = 512;
-        static const int TILEY = 32;
-
-        template<typename T>
-        void reorder(Param out, const Param in, const dim_t *rdims)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   reorderProgs;
-                static std::map<int, Kernel *> reorderKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName();
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, reorder_cl, reorder_cl_len, options.str());
-                    reorderProgs[device] = new Program(prog);
-                    reorderKernels[device] = new Kernel(*reorderProgs[device], "reorder_kernel");
-                });
-
-                auto reorderOp = make_kernel<Buffer, const Buffer, const KParam, const KParam,
-                                          const int, const int, const int, const int,
-                                          const int, const int> (*reorderKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int blocksPerMatX = divup(out.info.dims[0], TILEX);
-                int blocksPerMatY = divup(out.info.dims[1], TILEY);
-                NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
-                               local[1] * blocksPerMatY * out.info.dims[3],
-                               1);
-
-                reorderOp(EnqueueArgs(getQueue(), global, local),
-                          *out.data, *in.data, out.info, in.info,
-                          rdims[0], rdims[1], rdims[2], rdims[3],
-                          blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void reorder(Param out, const Param in, const dim_t* rdims) {
+    constexpr int TX    = 32;
+    constexpr int TY    = 8;
+    constexpr int TILEX = 512;
+    constexpr int TILEY = 32;
+
+    std::array<TemplateArg, 1> targs = {
+        TemplateTypename<T>(),
+    };
+    std::array<std::string, 2> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        getTypeBuildDefinition<T>()};
+
+    auto reorderOp =
+        common::getKernel("reorder_kernel", {{reorder_cl_src}}, targs, options);
+
+    cl::NDRange local(TX, TY, 1);
+
+    int blocksPerMatX = divup(out.info.dims[0], TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], TILEY);
+    cl::NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
+                       local[1] * blocksPerMatY * out.info.dims[3], 1);
+
+    reorderOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, *in.data,
+              out.info, in.info, static_cast<int>(rdims[0]),
+              static_cast<int>(rdims[1]), static_cast<int>(rdims[2]),
+              static_cast<int>(rdims[3]), blocksPerMatX, blocksPerMatY);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/resize.cl b/src/backend/opencl/kernel/resize.cl
index db886bda93..ab2d7a1d3f 100644
--- a/src/backend/opencl/kernel/resize.cl
+++ b/src/backend/opencl/kernel/resize.cl
@@ -9,10 +9,11 @@
 
 #if CPLX
 #define set(a, b) a = b
-#define set_scalar(a, b) do {                   \
-        a.x = b;                                \
-        a.y = 0;                                \
-    } while(0)
+#define set_scalar(a, b) \
+    do {                 \
+        a.x = b;         \
+        a.y = 0;         \
+    } while (0)
 
 #else
 
@@ -23,48 +24,45 @@
 
 #define NEAREST resize_n_
 #define BILINEAR resize_b_
+#define LOWER resize_l_
 
 ////////////////////////////////////////////////////////////////////////////////////
 // nearest-neighbor resampling
-void resize_n_(__global T* d_out, const KParam out,
-               __global const T* d_in, const KParam in,
-               const int blockIdx_x, const int blockIdx_y,
-               const float xf, const float yf)
-{
+void resize_n_(global T* d_out, const KParam out, __global const T* d_in,
+               const KParam in, const int blockIdx_x, const int blockIdx_y,
+               const float xf, const float yf) {
     int const ox = get_local_id(0) + blockIdx_x * get_local_size(0);
     int const oy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    //int ix = convert_int_rtp(ox * xf);
-    //int iy = convert_int_rtp(oy * yf);
+    // int ix = convert_int_rtp(ox * xf);
+    // int iy = convert_int_rtp(oy * yf);
     int ix = round(ox * xf);
     int iy = round(oy * yf);
 
     if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
-    if (ix >=  in.dims[0]) { ix = in.dims[0] - 1; }
-    if (iy >=  in.dims[1]) { iy = in.dims[1] - 1; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
 
     d_out[ox + oy * out.strides[1]] = d_in[ix + iy * in.strides[1]];
 }
 
 ////////////////////////////////////////////////////////////////////////////////////
 // bilinear resampling
-void resize_b_(__global T* d_out, const KParam out,
-               __global const T* d_in, const KParam in,
-               const int blockIdx_x, const int blockIdx_y,
-               const float xf_, const float yf_)
-{
+void resize_b_(global T* d_out, const KParam out, __global const T* d_in,
+               const KParam in, const int blockIdx_x, const int blockIdx_y,
+               const float xf_, const float yf_) {
     int const ox = get_local_id(0) + blockIdx_x * get_local_size(0);
     int const oy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
     float xf = ox * xf_;
     float yf = oy * yf_;
 
-    int ix   = floor(xf);
-    int iy   = floor(yf);
+    int ix = floor(xf);
+    int iy = floor(yf);
 
     if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
-    if (ix >=  in.dims[0]) { ix = in.dims[0] - 1; }
-    if (iy >=  in.dims[1]) { iy = in.dims[1] - 1; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
 
     float b = xf - ix;
     float a = yf - iy;
@@ -72,33 +70,48 @@ void resize_b_(__global T* d_out, const KParam out,
     const int ix2 = (ix + 1) < in.dims[0] ? (ix + 1) : ix;
     const int iy2 = (iy + 1) < in.dims[1] ? (iy + 1) : iy;
 
-    const VT p1 = d_in[ix  + in.strides[1] * iy ];
-    const VT p2 = d_in[ix  + in.strides[1] * iy2];
-    const VT p3 = d_in[ix2 + in.strides[1] * iy ];
+    const VT p1 = d_in[ix + in.strides[1] * iy];
+    const VT p2 = d_in[ix + in.strides[1] * iy2];
+    const VT p3 = d_in[ix2 + in.strides[1] * iy];
     const VT p4 = d_in[ix2 + in.strides[1] * iy2];
 
     d_out[ox + oy * out.strides[1]] =
-             (((1.0f-a) * (1.0f-b)) * p1) +
-             (((a)      * (1.0f-b)) * p2) +
-             (((1.0f-a) * (b)     ) * p3) +
-             (((a)      * (b)     ) * p4);
+        (((1.0f - a) * (1.0f - b)) * p1) + (((a) * (1.0f - b)) * p2) +
+        (((1.0f - a) * (b)) * p3) + (((a) * (b)) * p4);
+}
+
+////////////////////////////////////////////////////////////////////////////////////
+// lower resampling
+void resize_l_(global T* d_out, const KParam out, __global const T* d_in,
+               const KParam in, const int blockIdx_x, const int blockIdx_y,
+               const float xf, const float yf) {
+    int const ox = get_local_id(0) + blockIdx_x * get_local_size(0);
+    int const oy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
+    int ix = (ox * xf);
+    int iy = (oy * yf);
+
+    if (ox >= out.dims[0] || oy >= out.dims[1]) { return; }
+    if (ix >= in.dims[0]) { ix = in.dims[0] - 1; }
+    if (iy >= in.dims[1]) { iy = in.dims[1] - 1; }
+
+    d_out[ox + oy * out.strides[1]] = d_in[ix + iy * in.strides[1]];
 }
 
 ////////////////////////////////////////////////////////////////////////////////////
 // Wrapper Kernel
-__kernel
-void resize_kernel(__global T *d_out, const KParam out,
-                   __global const T *d_in, const KParam in,
-                   const int b0, const int b1, const float xf, const float yf)
-{
+kernel void resize_kernel(global T* d_out, const KParam out,
+                            global const T* d_in, const KParam in,
+                            const int b0, const int b1, const float xf,
+                            const float yf) {
     int bIdx = get_group_id(0) / b0;
     int bIdy = get_group_id(1) / b1;
     // batch adjustment
-    int i_off = bIdy *  in.strides[3] + bIdx *  in.strides[2] + in.offset;
-    int o_off = bIdy * out.strides[3] + bIdx * out.strides[2];
-    int blockIdx_x =  get_group_id(0) - bIdx * b0;
-    int blockIdx_y =  get_group_id(1) - bIdy * b1;
+    int i_off      = bIdy * in.strides[3] + bIdx * in.strides[2] + in.offset;
+    int o_off      = bIdy * out.strides[3] + bIdx * out.strides[2];
+    int blockIdx_x = get_group_id(0) - bIdx * b0;
+    int blockIdx_y = get_group_id(1) - bIdy * b1;
 
-    INTERP(d_out + o_off, out, d_in + i_off, in, blockIdx_x, blockIdx_y, xf, yf);
+    INTERP(d_out + o_off, out, d_in + i_off, in, blockIdx_x, blockIdx_y, xf,
+           yf);
 }
diff --git a/src/backend/opencl/kernel/resize.hpp b/src/backend/opencl/kernel/resize.hpp
index 5c485ca4c3..f201427ddf 100644
--- a/src/backend/opencl/kernel/resize.hpp
+++ b/src/backend/opencl/kernel/resize.hpp
@@ -8,110 +8,84 @@
  ********************************************************/
 
 #pragma once
+
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
 #include <kernel_headers/resize.hpp>
-#include <program.hpp>
 #include <traits.hpp>
+
 #include <string>
-#include <map>
-#include <mutex>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+using wtype_t = typename std::conditional<std::is_same<T, double>::value,
+                                          double, float>::type;
+
+template<typename T>
+using vtype_t = typename std::conditional<common::is_complex<T>::value, T,
+                                          wtype_t<T>>::type;
+
+template<typename T>
+void resize(Param out, const Param in, const af_interp_type method) {
+    using BT = typename dtype_traits<T>::base_type;
+
+    constexpr int RESIZE_TX = 16;
+    constexpr int RESIZE_TY = 16;
+    constexpr bool IsComplex =
+        std::is_same<T, cfloat>::value || std::is_same<T, cdouble>::value;
+
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(method),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(VT, dtype_traits<vtype_t<T>>::getName()),
+        DefineKeyValue(WT, dtype_traits<wtype_t<BT>>::getName()),
+        DefineKeyValue(CPLX, (IsComplex ? 1 : 0)), getTypeBuildDefinition<T>()};
+    if (IsComplex) {
+        options.emplace_back(DefineKeyValue(TB, dtype_traits<BT>::getName()));
+    }
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        static const int RESIZE_TX = 16;
-        static const int RESIZE_TY = 16;
-
-        using std::conditional;
-        using std::is_same;
-        template<typename T>
-        using wtype_t = typename conditional<is_same<T, double>::value, double, float>::type;
-
-        template<typename T>
-        using vtype_t = typename conditional<is_complex<T>::value,
-                                             T, wtype_t<T>
-                                            >::type;
-
-        template<typename T, af_interp_type method>
-        void resize(Param out, const Param in)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   resizeProgs;
-                static std::map<int, Kernel *> resizeKernels;
-
-                int device = getActiveDeviceId();
-
-                typedef typename dtype_traits<T>::base_type BT;
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T="        << dtype_traits<T>::getName();
-                    options << " -D VT="       << dtype_traits<vtype_t<T>>::getName();
-                    options << " -D WT="       << dtype_traits<wtype_t<BT>>::getName();
-
-                    switch(method) {
-                        case AF_INTERP_NEAREST:  options <<" -D INTERP=NEAREST" ;  break;
-                        case AF_INTERP_BILINEAR: options <<" -D INTERP=BILINEAR"; break;
-                        default: break;
-                    }
-
-                    if((af_dtype) dtype_traits<T>::af_type == c32 ||
-                       (af_dtype) dtype_traits<T>::af_type == c64) {
-                        options << " -D CPLX=1";
-                        options << " -D TB=" << dtype_traits<BT>::getName();
-                    } else {
-                        options << " -D CPLX=0";
-                    }
-
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    Program prog;
-                    buildProgram(prog, resize_cl, resize_cl_len, options.str());
-                    resizeProgs[device] = new Program(prog);
-                    resizeKernels[device] = new Kernel(*resizeProgs[device], "resize_kernel");
-                });
-
-                auto resizeOp = make_kernel<Buffer, const KParam,
-                                      const Buffer, const KParam,
-                                      const int, const int, const float, const float>
-                                      (*resizeKernels[device]);
-
-                NDRange local(RESIZE_TX, RESIZE_TY, 1);
-
-                int blocksPerMatX = divup(out.info.dims[0], local[0]);
-                int blocksPerMatY = divup(out.info.dims[1], local[1]);
-                NDRange global(local[0] * blocksPerMatX * in.info.dims[2],
-                               local[1] * blocksPerMatY * in.info.dims[3],
-                               1);
-
-                double xd = (double)in.info.dims[0] / (double)out.info.dims[0];
-                double yd = (double)in.info.dims[1] / (double)out.info.dims[1];
-
-                float xf = (float)xd, yf = (float)yd;
-
-                resizeOp(EnqueueArgs(getQueue(), global, local),
-                         *out.data, out.info, *in.data, in.info, blocksPerMatX, blocksPerMatY, xf, yf);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
+    switch (method) {
+        case AF_INTERP_NEAREST:
+            options.emplace_back(DefineKeyValue(INTERP, "NEAREST"));
+            break;
+        case AF_INTERP_BILINEAR:
+            options.emplace_back(DefineKeyValue(INTERP, "BILINEAR"));
+            break;
+        case AF_INTERP_LOWER:
+            options.emplace_back(DefineKeyValue(INTERP, "LOWER"));
+            break;
+        default: break;
     }
+
+    auto resizeOp =
+        common::getKernel("resize_kernel", {{resize_cl_src}}, targs, options);
+
+    cl::NDRange local(RESIZE_TX, RESIZE_TY, 1);
+
+    int blocksPerMatX = divup(out.info.dims[0], local[0]);
+    int blocksPerMatY = divup(out.info.dims[1], local[1]);
+    cl::NDRange global(local[0] * blocksPerMatX * in.info.dims[2],
+                       local[1] * blocksPerMatY * in.info.dims[3], 1);
+
+    double xd = (double)in.info.dims[0] / (double)out.info.dims[0];
+    double yd = (double)in.info.dims[1] / (double)out.info.dims[1];
+
+    float xf = (float)xd, yf = (float)yd;
+
+    resizeOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+             *in.data, in.info, blocksPerMatX, blocksPerMatY, xf, yf);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/rotate.cl b/src/backend/opencl/kernel/rotate.cl
index 0e1bbe475c..da530e66d3 100644
--- a/src/backend/opencl/kernel/rotate.cl
+++ b/src/backend/opencl/kernel/rotate.cl
@@ -9,34 +9,60 @@
 
 #define NEAREST transform_n
 #define BILINEAR transform_b
+#define LOWER transform_l
 
 typedef struct {
     float tmat[6];
 } tmat_t;
 
-__kernel
-void rotate_kernel(__global T *d_out, const KParam out,
-                   __global const T *d_in, const KParam in,
-                   const tmat_t t, const int nimages, const int batches,
-                   const int blocksXPerImage, const int blocksYPerImage)
-{
+kernel void rotateKernel(global T *d_out, const KParam out,
+                         global const T *d_in, const KParam in,
+                         const tmat_t t, const int nimages, const int batches,
+                         const int blocksXPerImage, const int blocksYPerImage,
+                         int method) {
     // Compute which image set
-    const int setId = get_group_id(0) / blocksXPerImage;
+    const int setId      = get_group_id(0) / blocksXPerImage;
     const int blockIdx_x = get_group_id(0) - setId * blocksXPerImage;
 
-    const int batch = get_group_id(1) / blocksYPerImage;
+    const int batch      = get_group_id(1) / blocksYPerImage;
     const int blockIdx_y = get_group_id(1) - batch * blocksYPerImage;
 
     // Get thread indices
-    const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
-    const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
+    const int xido = get_local_id(0) + blockIdx_x * get_local_size(0);
+    const int yido = get_local_id(1) + blockIdx_y * get_local_size(1);
 
     const int limages = min((int)out.dims[2] - setId * nimages, nimages);
 
-    if(xx >= out.dims[0] || yy >= out.dims[1])
-        return;
+    if (xido >= out.dims[0] || yido >= out.dims[1]) return;
 
-    __global T *optr = d_out + setId * nimages * out.strides[2] + batch * out.strides[3];
-    __global const T *iptr = d_in + in.offset + setId * nimages * in.strides[2] + batch * in.strides[3];
-    INTERP(optr, out, iptr, in, t.tmat, xx, yy, limages);
+    InterpPosTy xidi = xido * t.tmat[0] + yido * t.tmat[1] + t.tmat[2];
+    InterpPosTy yidi = xido * t.tmat[3] + yido * t.tmat[4] + t.tmat[5];
+
+    int outoff =
+        out.offset + setId * nimages * out.strides[2] + batch * out.strides[3];
+    int inoff =
+        in.offset + setId * nimages * in.strides[2] + batch * in.strides[3];
+
+    const int loco = outoff + (yido * out.strides[1] + xido);
+
+    InterpInTy zero = ZERO;
+    if (INTERP_ORDER > 1) {
+        // Special conditions to deal with boundaries for bilinear and bicubic
+        // FIXME: Ideally this condition should be removed or be present for all
+        // methods But tests are expecting a different behavior for bilinear and
+        // nearest
+        if (xidi < (InterpPosTy)-0.0001 || yidi < (InterpPosTy)-0.0001 ||
+            in.dims[0] <= xidi || in.dims[1] <= yidi) {
+            for (int i = 0; i < nimages; i++) {
+                d_out[loco + i * out.strides[2]] = zero;
+            }
+            return;
+        }
+    }
+
+    // FIXME: Nearest and lower do not do clamping, but other methods do
+    // Make it consistent
+    const bool doclamp = INTERP_ORDER != 1;
+    interp2(d_out, out, loco, d_in, in, inoff, xidi, yidi, method, limages,
+            doclamp, 2);
 }
diff --git a/src/backend/opencl/kernel/rotate.hpp b/src/backend/opencl/kernel/rotate.hpp
index b24d972688..a3d3f41cba 100644
--- a/src/backend/opencl/kernel/rotate.hpp
+++ b/src/backend/opencl/kernel/rotate.hpp
@@ -8,148 +8,128 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/transform_interp.hpp>
+
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/interp.hpp>
+#include <kernel_headers/interp.hpp>
 #include <kernel_headers/rotate.hpp>
-#include <program.hpp>
+#include <math.hpp>
 #include <traits.hpp>
+
 #include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+typedef struct {
+    float tmat[6];
+} tmat_t;
+
+template<typename T>
+using wtype_t = typename std::conditional<std::is_same<T, double>::value,
+                                          double, float>::type;
+
+template<typename T>
+using vtype_t = typename std::conditional<common::is_complex<T>::value, T,
+                                          wtype_t<T>>::type;
+
+template<typename T>
+void rotate(Param out, const Param in, const float theta, af_interp_type method,
+            int order) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+    using BT = typename dtype_traits<T>::base_type;
+
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+    // Used for batching images
+    constexpr int TI = 4;
+    constexpr bool isComplex =
+        static_cast<af_dtype>(dtype_traits<T>::af_type) == c32 ||
+        static_cast<af_dtype>(dtype_traits<T>::af_type) == c64;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(order),
+    };
+    ToNumStr<T> toNumStr;
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ZERO, toNumStr(scalar<T>(0))),
+        DefineKeyValue(InterpInTy, dtype_traits<T>::getName()),
+        DefineKeyValue(InterpValTy, dtype_traits<vtype_t<T>>::getName()),
+        DefineKeyValue(InterpPosTy, dtype_traits<wtype_t<BT>>::getName()),
+        DefineKeyValue(XDIM, 0),
+        DefineKeyValue(YDIM, 1),
+        DefineKeyValue(INTERP_ORDER, order),
+        DefineKeyValue(IS_CPLX, (isComplex ? 1 : 0)),
+    };
+    if (isComplex) {
+        compileOpts.emplace_back(
+            DefineKeyValue(TB, dtype_traits<BT>::getName()));
+    }
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+    addInterpEnumOptions(compileOpts);
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
+    auto rotate =
+        common::getKernel("rotateKernel", {{interp_cl_src, rotate_cl_src}},
+                          tmpltArgs, compileOpts);
+
+    const float c = cos(-theta), s = sin(-theta);
+    float tx, ty;
     {
-        static const int TX = 16;
-        static const int TY = 16;
-        // Used for batching images
-        static const int TI = 4;
-
-        typedef struct {
-            float tmat[6];
-        } tmat_t;
-
-        using std::conditional;
-        using std::is_same;
-        template<typename T>
-        using wtype_t = typename conditional<is_same<T, double>::value, double, float>::type;
-
-        template<typename T>
-        using vtype_t = typename conditional<is_complex<T>::value,
-                                             T, wtype_t<T>
-                                            >::type;
-
-        template<typename T, af_interp_type method>
-        void rotate(Param out, const Param in, const float theta)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   rotateProgs;
-                static std::map<int, Kernel *> rotateKernels;
-
-                int device = getActiveDeviceId();
-                typedef typename dtype_traits<T>::base_type BT;
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T="        << dtype_traits<T>::getName();
-                    options << " -D VT="       << dtype_traits<vtype_t<T>>::getName();
-                    options << " -D WT="       << dtype_traits<wtype_t<BT>>::getName();
-
-                    if((af_dtype) dtype_traits<T>::af_type == c32 ||
-                       (af_dtype) dtype_traits<T>::af_type == c64) {
-                        options << " -D CPLX=1";
-                        options << " -D TB=" << dtype_traits<BT>::getName();
-                    } else {
-                        options << " -D CPLX=0";
-                    }
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    switch(method) {
-                        case AF_INTERP_NEAREST: options << " -D INTERP=NEAREST";
-                            break;
-                        case AF_INTERP_BILINEAR:  options << " -D INTERP=BILINEAR";
-                            break;
-                        default:
-                            break;
-                    }
-
-                    const char *ker_strs[] = {transform_interp_cl, rotate_cl};
-                    const int   ker_lens[] = {transform_interp_cl_len, rotate_cl_len};
-                    Program prog;
-                    buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                    rotateProgs[device] = new Program(prog);
-                    rotateKernels[device] = new Kernel(*rotateProgs[device], "rotate_kernel");
-                });
-
-                auto rotateOp = make_kernel<Buffer, const KParam, const Buffer, const KParam, const tmat_t,
-                                            const int, const int, const int, const int>
-                                           (*rotateKernels[device]);
-
-                const float c = cos(-theta), s = sin(-theta);
-                float tx, ty;
-                {
-                    const float nx = 0.5 * (in.info.dims[0] - 1);
-                    const float ny = 0.5 * (in.info.dims[1] - 1);
-                    const float mx = 0.5 * (out.info.dims[0] - 1);
-                    const float my = 0.5 * (out.info.dims[1] - 1);
-                    const float sx = (mx * c + my *-s);
-                    const float sy = (mx * s + my * c);
-                    tx = -(sx - nx);
-                    ty = -(sy - ny);
-                }
-
-                // Rounding error. Anything more than 3 decimal points wont make a diff
-                tmat_t t;
-                t.tmat[0] = round( c * 1000) / 1000.0f;
-                t.tmat[1] = round(-s * 1000) / 1000.0f;
-                t.tmat[2] = round(tx * 1000) / 1000.0f;
-                t.tmat[3] = round( s * 1000) / 1000.0f;
-                t.tmat[4] = round( c * 1000) / 1000.0f;
-                t.tmat[5] = round(ty * 1000) / 1000.0f;
-
-
-                NDRange local(TX, TY, 1);
-
-                int nimages  = in.info.dims[2];
-                int nbatches = in.info.dims[3];
-                int global_x = local[0] * divup(out.info.dims[0], local[0]);
-                int global_y = local[1] * divup(out.info.dims[1], local[1]);
-                const int blocksXPerImage = global_x / local[0];
-                const int blocksYPerImage = global_y / local[1];
-
-                if(nimages > TI) {
-                    int tile_images = divup(nimages, TI);
-                    nimages = TI;
-                    global_x = global_x * tile_images;
-                }
-                global_y *= nbatches;
-
-                NDRange global(global_x, global_y, 1);
-
-                rotateOp(EnqueueArgs(getQueue(), global, local),
-                         *out.data, out.info, *in.data, in.info, t, nimages, nbatches,
-                         blocksXPerImage, blocksYPerImage);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
+        const float nx = 0.5 * (in.info.dims[0] - 1);
+        const float ny = 0.5 * (in.info.dims[1] - 1);
+        const float mx = 0.5 * (out.info.dims[0] - 1);
+        const float my = 0.5 * (out.info.dims[1] - 1);
+        const float sx = (mx * c + my * -s);
+        const float sy = (mx * s + my * c);
+        tx             = -(sx - nx);
+        ty             = -(sy - ny);
+    }
+
+    // Rounding error. Anything more than 3 decimal points wont make a diff
+    tmat_t t;
+    t.tmat[0] = round(c * 1000) / 1000.0f;
+    t.tmat[1] = round(-s * 1000) / 1000.0f;
+    t.tmat[2] = round(tx * 1000) / 1000.0f;
+    t.tmat[3] = round(s * 1000) / 1000.0f;
+    t.tmat[4] = round(c * 1000) / 1000.0f;
+    t.tmat[5] = round(ty * 1000) / 1000.0f;
+
+    NDRange local(TX, TY, 1);
+
+    int nimages               = in.info.dims[2];
+    int nbatches              = in.info.dims[3];
+    int global_x              = local[0] * divup(out.info.dims[0], local[0]);
+    int global_y              = local[1] * divup(out.info.dims[1], local[1]);
+    const int blocksXPerImage = global_x / local[0];
+    const int blocksYPerImage = global_y / local[1];
+
+    if (nimages > TI) {
+        int tile_images = divup(nimages, TI);
+        nimages         = TI;
+        global_x        = global_x * tile_images;
     }
+    global_y *= nbatches;
+
+    NDRange global(global_x, global_y, 1);
+
+    rotate(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+           *in.data, in.info, t, nimages, nbatches, blocksXPerImage,
+           blocksYPerImage, (int)method);
+
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_by_key/CMakeLists.txt b/src/backend/opencl/kernel/scan_by_key/CMakeLists.txt
new file mode 100644
index 0000000000..316e946a31
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_by_key/CMakeLists.txt
@@ -0,0 +1,88 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_by_key/scan_by_key_impl.cpp" FILESTRINGS)
+find_package(OpenCL REQUIRED)
+
+foreach(STR ${FILESTRINGS})
+    if(${STR} MATCHES "// SBK_BINARY_OPS")
+        string(REPLACE "// SBK_BINARY_OPS:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_BINARY_OPS ${TEMP})
+    endif()
+endforeach()
+
+add_library(opencl_scan_by_key INTERFACE)
+
+add_dependencies(opencl_scan_by_key ${cl_kernel_targets} cl2hpp)
+foreach(SBK_BINARY_OP ${SBK_BINARY_OPS})
+    add_library(opencl_scan_by_key_${SBK_BINARY_OP} OBJECT
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_by_key/scan_by_key_impl.cpp"
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_first_by_key_impl.hpp"
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/scan_dim_by_key_impl.hpp")
+
+    add_dependencies(opencl_scan_by_key_${SBK_BINARY_OP}
+                        ${cl_kernel_targets} OpenCL::cl2hpp Boost::boost)
+
+    target_include_directories(opencl_scan_by_key_${SBK_BINARY_OP}
+      PRIVATE
+        .
+        ..
+        magma
+        ../../api/c
+        ../common
+        ../../../include
+        ${CMAKE_CURRENT_BINARY_DIR}
+        $<TARGET_PROPERTY:af_spdlog,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:OpenCL::OpenCL,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:OpenCL::cl2hpp,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:Boost::boost,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:nonstd::span-lite,INTERFACE_SYSTEM_INCLUDE_DIRECTORIES>
+        ${ArrayFire_BINARY_DIR}/include
+      )
+    if(TARGET Forge::forge)
+      target_include_directories(opencl_scan_by_key_${SBK_BINARY_OP}
+        SYSTEM INTERFACE
+        $<TARGET_PROPERTY:Forge::forge,INCLUDE_DIRECTORIES>
+      )
+    else()
+      target_include_directories(opencl_scan_by_key_${SBK_BINARY_OP}
+        SYSTEM INTERFACE
+        ${${forge_prefix}_SOURCE_DIR}/include
+        ${${forge_prefix}_BINARY_DIR}/include
+      )
+    endif()
+    if(TARGET glad::glad)
+      target_include_directories(opencl_scan_by_key_${SBK_BINARY_OP}
+        SYSTEM INTERFACE
+        $<TARGET_PROPERTY:glad::glad,INTERFACE_INCLUDE_DIRECTORIES>
+      )
+    else()
+      target_include_directories(opencl_scan_by_key_${SBK_BINARY_OP}
+        SYSTEM INTERFACE
+        $<TARGET_PROPERTY:af_glad,INTERFACE_INCLUDE_DIRECTORIES>
+      )
+    endif()
+
+    set_target_properties(opencl_scan_by_key_${SBK_BINARY_OP}
+      PROPERTIES
+        CXX_STANDARD 17
+        CXX_EXTENSIONS False
+        CXX_VISIBILITY_PRESET hidden
+        POSITION_INDEPENDENT_CODE ON
+        FOLDER "Generated Targets")
+
+    arrayfire_set_default_cxx_flags(opencl_scan_by_key_${SBK_BINARY_OP})
+    target_compile_definitions(opencl_scan_by_key_${SBK_BINARY_OP}
+      PRIVATE
+        ${opencl_compile_definitions}
+        $<TARGET_PROPERTY:Boost::boost,INTERFACE_COMPILE_DEFINITIONS>
+        $<TARGET_PROPERTY:af_spdlog,INTERFACE_COMPILE_DEFINITIONS>
+        $<TARGET_PROPERTY:nonstd::span-lite,INTERFACE_COMPILE_DEFINITIONS>
+        TYPE=${SBK_BINARY_OP} AFDLL)
+    target_sources(opencl_scan_by_key
+      INTERFACE $<TARGET_OBJECTS:opencl_scan_by_key_${SBK_BINARY_OP}>)
+endforeach(SBK_BINARY_OP ${SBK_BINARY_OPS})
diff --git a/src/backend/opencl/kernel/scan_by_key/scan_by_key_impl.cpp b/src/backend/opencl/kernel/scan_by_key/scan_by_key_impl.cpp
new file mode 100644
index 0000000000..46cac6723d
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_by_key/scan_by_key_impl.cpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <backend.hpp>
+#include <kernel/scan_dim_by_key_impl.hpp>
+#include <kernel/scan_first_by_key_impl.hpp>
+
+// This file instantiates scan_dim_by_key as separate object files from CMake
+// The line below is read by CMake to determenine the instantiations
+// SBK_BINARY_OPS:af_add_t af_mul_t af_max_t af_min_t
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+INSTANTIATE_SCAN_FIRST_BY_KEY_OP(TYPE)
+INSTANTIATE_SCAN_DIM_BY_KEY_OP(TYPE)
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_dim.cl b/src/backend/opencl/kernel/scan_dim.cl
index 543418911e..f6e86081e4 100644
--- a/src/backend/opencl/kernel/scan_dim.cl
+++ b/src/backend/opencl/kernel/scan_dim.cl
@@ -7,148 +7,144 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void scan_dim_kernel(__global To *oData, KParam oInfo,
-                     __global To *tData, KParam tInfo,
-                     const __global Ti *iData, KParam iInfo,
-                     uint groups_x,
-                     uint groups_y,
-                     uint groups_dim,
-                     uint lim)
-{
+kernel void scanDim(global To *oData, KParam oInfo, global To *tData,
+                    KParam tInfo, const global Ti *iData, KParam iInfo,
+                    uint groups_x, uint groups_y, uint groups_dim, uint lim) {
     const int lidx = get_local_id(0);
     const int lidy = get_local_id(1);
     const int lid  = lidy * THREADS_X + lidx;
 
-    const int zid = get_group_id(0) / groups_x;
-    const int wid = get_group_id(1) / groups_y;
-    const int groupId_x = get_group_id(0) - (groups_x) * zid;
-    const int groupId_y = get_group_id(1) - (groups_y) * wid;
-    const int xid = groupId_x * get_local_size(0) + lidx;
-    const int yid = groupId_y;
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) + lidx;
+    const int yid       = groupId_y;
 
     int ids[4] = {xid, yid, zid, wid};
 
     // There is only one element per group for out
     // There are DIMY elements per group for in
-    // Hence increment ids[dim] just after offseting out and before offsetting in
-    tData += ids[3] * tInfo.strides[3] + ids[2] * tInfo.strides[2] + ids[1] * tInfo.strides[1] + ids[0];
-    const int groupId_dim = ids[dim];
-
-    ids[dim] = ids[dim] * DIMY * lim + lidy;
-    oData  += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] + ids[1] * oInfo.strides[1] + ids[0];
-    iData  += ids[3] *  iInfo.strides[3] + ids[2] *  iInfo.strides[2] + ids[1] *  iInfo.strides[1] + ids[0];
-    int id_dim = ids[dim];
-    const int out_dim = oInfo.dims[dim];
-
-    bool is_valid =
-        (ids[0] < oInfo.dims[0]) &&
-        (ids[1] < oInfo.dims[1]) &&
-        (ids[2] < oInfo.dims[2]) &&
-        (ids[3] < oInfo.dims[3]);
-
-    const int ostride_dim = oInfo.strides[dim];
-    const int istride_dim =  iInfo.strides[dim];
-
-    __local To l_val0[THREADS_X * DIMY];
-    __local To l_val1[THREADS_X * DIMY];
-    __local To *l_val = l_val0;
-    __local To l_tmp[THREADS_X];
-
-    bool flip = 0;
-    const To init_val  = init;
-    To val = init_val;
+    // Hence increment ids[kDim] just after offseting out and before offsetting
+    // in
+    tData += ids[3] * tInfo.strides[3] + ids[2] * tInfo.strides[2] +
+             ids[1] * tInfo.strides[1] + ids[0];
+    const int groupId_dim = ids[kDim];
+
+    ids[kDim] = ids[kDim] * DIMY * lim + lidy;
+    oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+             ids[1] * oInfo.strides[1] + ids[0];
+    iData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+             ids[1] * iInfo.strides[1] + ids[0];
+    iData += iInfo.offset;
+
+    int id_dim        = ids[kDim];
+    const int out_dim = oInfo.dims[kDim];
+
+    bool is_valid = (ids[0] < oInfo.dims[0]) && (ids[1] < oInfo.dims[1]) &&
+                    (ids[2] < oInfo.dims[2]) && (ids[3] < oInfo.dims[3]);
+
+    const int ostride_dim = oInfo.strides[kDim];
+    const int istride_dim = iInfo.strides[kDim];
+
+    local To l_val0[THREADS_X * DIMY];
+    local To l_val1[THREADS_X * DIMY];
+    local To *l_val = l_val0;
+    local To l_tmp[THREADS_X];
+
+    bool flip         = 0;
+    const To init_val = init;
+    To val            = init_val;
     const bool isLast = (lidy == (DIMY - 1));
 
     for (int k = 0; k < lim; k++) {
-
         if (isLast) l_tmp[lidx] = val;
 
-        bool cond = (is_valid) && (id_dim < out_dim);
-        val = cond ? transform(*iData) : init_val;
+        bool cond  = (is_valid) && (id_dim < out_dim);
+        val        = cond ? transform(*iData) : init_val;
         l_val[lid] = val;
         barrier(CLK_LOCAL_MEM_FENCE);
 
-        int start = 0;
         for (int off = 1; off < DIMY; off *= 2) {
-
             if (lidy >= off) val = binOp(val, l_val[lid - off * THREADS_X]);
 
-            flip = 1 - flip;
-            l_val = flip ? l_val1 : l_val0;
+            flip       = 1 - flip;
+            l_val      = flip ? l_val1 : l_val0;
             l_val[lid] = val;
 
             barrier(CLK_LOCAL_MEM_FENCE);
         }
 
         val = binOp(val, l_tmp[lidx]);
-        if (cond) *oData = val;
-        barrier(CLK_LOCAL_MEM_FENCE);
+
+        if (INCLUSIVE_SCAN != 0) {
+            if (cond) { *oData = val; }
+        } else if (is_valid) {
+            if (id_dim == (out_dim - 1)) {
+                *(oData - (id_dim * ostride_dim)) = init_val;
+            } else if (id_dim < (out_dim - 1)) {
+                *(oData + ostride_dim) = val;
+            }
+        }
 
         id_dim += DIMY;
         iData += DIMY * istride_dim;
         oData += DIMY * ostride_dim;
+        barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (!isFinalPass &&
-        is_valid &&
-        (groupId_dim < tInfo.dims[dim]) &&
+    if (!IS_FINAL_PASS && is_valid && (groupId_dim < tInfo.dims[kDim]) &&
         isLast) {
         *tData = val;
     }
 }
 
-__kernel
-void bcast_dim_kernel(__global To *oData, KParam oInfo,
-                      const __global To *tData, KParam tInfo,
-                      uint groups_x,
-                      uint groups_y,
-                      uint groups_dim,
-                      uint lim)
-{
+kernel void bcastDim(global To *oData, KParam oInfo, const global To *tData,
+                     KParam tInfo, uint groups_x, uint groups_y,
+                     uint groups_dim, uint lim) {
     const int lidx = get_local_id(0);
     const int lidy = get_local_id(1);
     const int lid  = lidy * THREADS_X + lidx;
 
-    const int zid = get_group_id(0) / groups_x;
-    const int wid = get_group_id(1) / groups_y;
-    const int groupId_x = get_group_id(0) - (groups_x) * zid;
-    const int groupId_y = get_group_id(1) - (groups_y) * wid;
-    const int xid = groupId_x * get_local_size(0) + lidx;
-    const int yid = groupId_y;
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) + lidx;
+    const int yid       = groupId_y;
 
-    int ids[4] = {xid, yid, zid, wid};
-    const int groupId_dim = ids[dim];
+    int ids[4]            = {xid, yid, zid, wid};
+    const int groupId_dim = ids[kDim];
 
     if (groupId_dim != 0) {
-
         // There is only one element per group for out
         // There are DIMY elements per group for in
-        // Hence increment ids[dim] just after offseting out and before offsetting in
-        tData += ids[3] * tInfo.strides[3] + ids[2] * tInfo.strides[2] + ids[1] * tInfo.strides[1] + ids[0];
+        // Hence increment ids[kDim] just after offseting out and before
+        // offsetting in
+        tData += ids[3] * tInfo.strides[3] + ids[2] * tInfo.strides[2] +
+                 ids[1] * tInfo.strides[1] + ids[0];
 
-        ids[dim] = ids[dim] * DIMY * lim + lidy;
-        oData  += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] + ids[1] * oInfo.strides[1] + ids[0];
+        ids[kDim] = ids[kDim] * DIMY * lim + lidy;
+        oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+                 ids[1] * oInfo.strides[1] + ids[0];
 
-        const int id_dim = ids[dim];
-        const int out_dim = oInfo.dims[dim];
+        // Shift broadcast one step to the right for exclusive scan (#2366)
+        int offset = INCLUSIVE_SCAN ? 0 : oInfo.strides[kDim];
+        oData += offset;
 
-        bool is_valid =
-            (ids[0] < oInfo.dims[0]) &&
-            (ids[1] < oInfo.dims[1]) &&
-            (ids[2] < oInfo.dims[2]) &&
-            (ids[3] < oInfo.dims[3]);
+        const int id_dim  = ids[kDim];
+        const int out_dim = oInfo.dims[kDim];
 
-        if (is_valid) {
+        bool is_valid = (ids[0] < oInfo.dims[0]) && (ids[1] < oInfo.dims[1]) &&
+                        (ids[2] < oInfo.dims[2]) && (ids[3] < oInfo.dims[3]);
 
-            To accum = *(tData - tInfo.strides[dim]);
+        if (is_valid) {
+            To accum = *(tData - tInfo.strides[kDim]);
 
-            const int ostride_dim = oInfo.strides[dim];
+            const int ostride_dim = oInfo.strides[kDim];
 
-            for (int k = 0, id = id_dim;
-                 is_valid && k < lim && (id < out_dim);
+            for (int k = 0, id = id_dim; is_valid && k < lim && (id < out_dim);
                  k++, id += DIMY) {
-
                 *oData = binOp(*oData, accum);
                 oData += DIMY * ostride_dim;
             }
diff --git a/src/backend/opencl/kernel/scan_dim.hpp b/src/backend/opencl/kernel/scan_dim.hpp
index 6bb8cdd256..f9820f47cf 100644
--- a/src/backend/opencl/kernel/scan_dim.hpp
+++ b/src/backend/opencl/kernel/scan_dim.hpp
@@ -8,254 +8,153 @@
  ********************************************************/
 
 #pragma once
-#include <string>
-#include <mutex>
-#include <map>
-#include <kernel_headers/scan_dim.hpp>
-#include <kernel_headers/ops.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <type_util.hpp>
-#include "names.hpp"
-#include "config.hpp"
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-namespace kernel
-{
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass, uint threads_y>
-    static Kernel* get_scan_dim_kernels(int kerIdx)
-    {
-        try {
-            static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-            static std::map<int, Program*> scanProgs;
-            static std::map<int, Kernel*>  scanKerns;
-            static std::map<int, Kernel*>  bcastKerns;
-
-            int device= getActiveDeviceId();
-
-            std::call_once(compileFlags[device], [device] () {
-
-                    Binary<To, op> scan;
-                    ToNum<To> toNum;
-
-                    std::ostringstream options;
-                    options << " -D To=" << dtype_traits<To>::getName()
-                            << " -D Ti=" << dtype_traits<Ti>::getName()
-                            << " -D T=To"
-                            << " -D dim=" << dim
-                            << " -D DIMY=" << threads_y
-                            << " -D THREADS_X=" << THREADS_X
-                            << " -D init=" << toNum(scan.init())
-                            << " -D " << binOpName<op>()
-                            << " -D CPLX=" << af::iscplx<Ti>()
-                            << " -D isFinalPass=" << (int)(isFinalPass);
-                    if (std::is_same<Ti, double>::value ||
-                        std::is_same<Ti, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    const char *ker_strs[] = {ops_cl, scan_dim_cl};
-                    const int   ker_lens[] = {ops_cl_len, scan_dim_cl_len};
-                    cl::Program prog;
-                    buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                    scanProgs[device] = new Program(prog);
-
-                    scanKerns[device] = new Kernel(*scanProgs[device],  "scan_dim_kernel");
-                    bcastKerns[device] = new Kernel(*scanProgs[device],  "bcast_dim_kernel");
-
-                });
-
-            return (kerIdx == 0) ? scanKerns[device] : bcastKerns[device];
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-            throw;
-        }
-    }
-
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass, uint threads_y>
-    static void scan_dim_launcher(Param &out,
-                                  Param &tmp,
-                                  const Param &in,
-                                  const uint groups_all[4])
-    {
-        try {
-            Kernel* ker = get_scan_dim_kernels<Ti, To, op, dim, isFinalPass, threads_y>(0);
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <kernel_headers/scan_dim.hpp>
+#include <traits.hpp>
 
-            NDRange local(THREADS_X, threads_y);
-            NDRange global(groups_all[0] * groups_all[2] * local[0],
-                           groups_all[1] * groups_all[3] * local[1]);
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename Ti, typename To, af_op_t op>
+static opencl::Kernel getScanDimKernel(const std::string key, int dim,
+                                       bool isFinalPass, uint threads_y,
+                                       bool inclusiveScan) {
+    using std::string;
+    using std::vector;
+
+    ToNumStr<To> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ti>(),
+        TemplateTypename<To>(),
+        TemplateArg(dim),
+        TemplateArg(isFinalPass),
+        TemplateArg(op),
+        TemplateArg(threads_y),
+        TemplateArg(inclusiveScan),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(kDim, dim),
+        DefineKeyValue(DIMY, threads_y),
+        DefineValue(THREADS_X),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        DefineKeyValue(IS_FINAL_PASS, (isFinalPass ? 1 : 0)),
+        DefineKeyValue(INCLUSIVE_SCAN, inclusiveScan),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    return common::getKernel(key, {{ops_cl_src, scan_dim_cl_src}}, tmpltArgs,
+                             compileOpts);
+}
 
-            uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
+template<typename Ti, typename To, af_op_t op>
+static void scanDimLauncher(Param out, Param tmp, const Param in, int dim,
+                            bool isFinalPass, uint threads_y,
+                            const uint groups_all[4], bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
 
-            auto scanOp = make_kernel<Buffer, KParam,
-                                      Buffer, KParam,
-                                      Buffer, KParam,
-                                      uint, uint,
-                                      uint, uint>(*ker);
+    auto scan = getScanDimKernel<Ti, To, op>("scanDim", dim, isFinalPass,
+                                             threads_y, inclusiveScan);
 
+    NDRange local(THREADS_X, threads_y);
+    NDRange global(groups_all[0] * groups_all[2] * local[0],
+                   groups_all[1] * groups_all[3] * local[1]);
 
-            scanOp(EnqueueArgs(getQueue(), global, local),
-                   *out.data, out.info, *tmp.data, tmp.info, *in.data, in.info,
-                   groups_all[0], groups_all[1], groups_all[dim], lim);
+    uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
 
-            CL_DEBUG_FINISH(getQueue());
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-            throw;
-        }
-    }
+    scan(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *tmp.data,
+         tmp.info, *in.data, in.info, groups_all[0], groups_all[1],
+         groups_all[dim], lim);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass, uint threads_y>
-    static void bcast_dim_launcher(Param &out,
-                                   Param &tmp,
-                                   const uint groups_all[4])
-    {
-        try {
-            Kernel* ker = get_scan_dim_kernels<Ti, To, op, dim, isFinalPass, threads_y>(1);
+template<typename Ti, typename To, af_op_t op>
+static void bcastDimLauncher(Param out, Param tmp, int dim, bool isFinalPass,
+                             uint threads_y, const uint groups_all[4],
+                             const bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
 
-            NDRange local(THREADS_X, threads_y);
-            NDRange global(groups_all[0] * groups_all[2] * local[0],
-                           groups_all[1] * groups_all[3] * local[1]);
+    auto bcast = getScanDimKernel<Ti, To, op>("bcastDim", dim, isFinalPass,
+                                              threads_y, inclusiveScan);
 
-            uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
+    NDRange local(THREADS_X, threads_y);
+    NDRange global(groups_all[0] * groups_all[2] * local[0],
+                   groups_all[1] * groups_all[3] * local[1]);
 
-            auto bcastOp = make_kernel<Buffer, KParam,
-                                       Buffer, KParam,
-                                       uint, uint,
-                                       uint, uint>(*ker);
+    uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
 
-            bcastOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *tmp.data, tmp.info,
-                    groups_all[0], groups_all[1], groups_all[dim], lim);
+    bcast(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+          *tmp.data, tmp.info, groups_all[0], groups_all[1], groups_all[dim],
+          lim);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-            CL_DEBUG_FINISH(getQueue());
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-            throw;
+template<typename Ti, typename To, af_op_t op>
+static void scanDim(Param out, const Param in, const int dim,
+                    const bool inclusiveScan = true) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(out.info.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    uint groups_all[] = {divup((uint)out.info.dims[0], threads_x),
+                         (uint)out.info.dims[1], (uint)out.info.dims[2],
+                         (uint)out.info.dims[3]};
+
+    groups_all[dim] = divup(out.info.dims[dim], threads_y * REPEAT);
+
+    if (groups_all[dim] == 1) {
+        scanDimLauncher<Ti, To, op>(out, out, in, dim, true, threads_y,
+                                    groups_all, inclusiveScan);
+    } else {
+        Param tmp = out;
+
+        tmp.info.dims[dim]  = groups_all[dim];
+        tmp.info.strides[0] = 1;
+        for (int k = 1; k < 4; k++) {
+            tmp.info.strides[k] =
+                tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
         }
-    }
-
 
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass>
-    static void scan_dim_fn(Param &out,
-                            Param &tmp,
-                            const Param &in,
-                            const uint threads_y,
-                            const uint groups_all[4])
-    {
+        int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
+        // FIXME: Do I need to free this ?
+        tmp.data = bufferAlloc(tmp_elements * sizeof(To));
 
-        switch (threads_y) {
-        case 8:
-            (scan_dim_launcher<Ti, To, op, dim, isFinalPass, 8>)(
-                out, tmp, in, groups_all); break;
-        case 4:
-            (scan_dim_launcher<Ti, To, op, dim, isFinalPass, 4>)(
-                out, tmp, in, groups_all); break;
-        case 2:
-            (scan_dim_launcher<Ti, To, op, dim, isFinalPass, 2>)(
-                out, tmp, in, groups_all); break;
-        case 1:
-            (scan_dim_launcher<Ti, To, op, dim, isFinalPass, 1>)(
-                out, tmp, in, groups_all); break;
-        }
-
-    }
+        scanDimLauncher<Ti, To, op>(out, tmp, in, dim, false, threads_y,
+                                    groups_all, inclusiveScan);
 
-    template<typename Ti, typename To, af_op_t op, int dim, bool isFinalPass>
-    static void bcast_dim_fn(Param &out,
-                             Param &tmp,
-                             const uint threads_y,
-                             const uint groups_all[4])
-    {
+        int gdim        = groups_all[dim];
+        groups_all[dim] = 1;
 
-        switch (threads_y) {
-        case 8:
-            (bcast_dim_launcher<Ti, To, op, dim, isFinalPass, 8>)(
-                out, tmp, groups_all); break;
-        case 4:
-            (bcast_dim_launcher<Ti, To, op, dim, isFinalPass, 4>)(
-                out, tmp, groups_all); break;
-        case 2:
-            (bcast_dim_launcher<Ti, To, op, dim, isFinalPass, 2>)(
-                out, tmp, groups_all); break;
-        case 1:
-            (bcast_dim_launcher<Ti, To, op, dim, isFinalPass, 1>)(
-                out, tmp, groups_all); break;
+        if (op == af_notzero_t) {
+            scanDimLauncher<To, To, af_add_t>(tmp, tmp, tmp, dim, true,
+                                              threads_y, groups_all, true);
+        } else {
+            scanDimLauncher<To, To, op>(tmp, tmp, tmp, dim, true, threads_y,
+                                        groups_all, true);
         }
-    }
-
-    template<typename Ti, typename To, af_op_t op, int dim>
-    static void scan_dim(Param &out, const Param &in)
-    {
-        try {
-            uint threads_y = std::min(THREADS_Y, nextpow2(out.info.dims[dim]));
-            uint threads_x = THREADS_X;
-
-            uint groups_all[] = {divup((uint)out.info.dims[0], threads_x),
-                                 (uint)out.info.dims[1],
-                                 (uint)out.info.dims[2],
-                                 (uint)out.info.dims[3]};
 
-            groups_all[dim] = divup(out.info.dims[dim], threads_y * REPEAT);
-
-            if (groups_all[dim] == 1) {
-
-                scan_dim_fn<Ti, To, op, dim, true>(out, out, in,
-                                                   threads_y,
-                                                   groups_all);
-            } else {
-
-                Param tmp = out;
-
-                tmp.info.dims[dim] = groups_all[dim];
-                tmp.info.strides[0] = 1;
-                for (int k = 1; k < 4; k++) {
-                    tmp.info.strides[k] = tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
-                }
-
-                int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
-                // FIXME: Do I need to free this ?
-                tmp.data = bufferAlloc(tmp_elements * sizeof(To));
-
-                scan_dim_fn<Ti, To, op, dim, false>(out, tmp, in,
-                                                    threads_y,
-                                                    groups_all);
-
-                int gdim = groups_all[dim];
-                groups_all[dim] = 1;
-
-                if (op == af_notzero_t) {
-                    scan_dim_fn<To, To, af_add_t, dim, true>(tmp, tmp, tmp,
-                                                             threads_y,
-                                                             groups_all);
-                } else {
-                    scan_dim_fn<To, To,       op, dim, true>(tmp, tmp, tmp,
-                                                             threads_y,
-                                                             groups_all);
-                }
-
-                groups_all[dim] = gdim;
-                bcast_dim_fn<To, To, op, dim, true>(out, tmp,
-                                                    threads_y,
-                                                    groups_all);
-                bufferFree(tmp.data);
-            }
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-            throw;
-        }
+        groups_all[dim] = gdim;
+        bcastDimLauncher<To, To, op>(out, tmp, dim, true, threads_y, groups_all,
+                                     inclusiveScan);
+        bufferFree(tmp.data);
     }
 }
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_dim_by_key.cl b/src/backend/opencl/kernel/scan_dim_by_key.cl
new file mode 100644
index 0000000000..eacd7f9283
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_dim_by_key.cl
@@ -0,0 +1,347 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+char calculate_head_flags_dim(const global Tk *kptr, int id, int stride) {
+    return (id == 0) ? 1 : ((*kptr) != (*(kptr - stride)));
+}
+
+kernel void scanDimByKeyNonfinal(
+    global To *oData, KParam oInfo, global To *tData, KParam tInfo,
+    global char *tfData, KParam tfInfo, global int *tiData, KParam tiInfo,
+    const global Ti *iData, KParam iInfo, const global Tk *kData, KParam kInfo,
+    uint groups_x, uint groups_y, uint groups_dim, uint lim) {
+    const int lidx = get_local_id(0);
+    const int lidy = get_local_id(1);
+    const int lid  = lidy * THREADS_X + lidx;
+
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) + lidx;
+    const int yid       = groupId_y;
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    // There is only one element per group for out
+    // There are DIMY elements per group for in
+    // Hence increment ids[kDim] just after offseting out and before offsetting
+    // in
+    tData += ids[3] * tInfo.strides[3] + ids[2] * tInfo.strides[2] +
+             ids[1] * tInfo.strides[1] + ids[0] ;
+    tfData += ids[3] * tfInfo.strides[3] + ids[2] * tfInfo.strides[2] +
+              ids[1] * tfInfo.strides[1] + ids[0];
+    tiData += ids[3] * tiInfo.strides[3] + ids[2] * tiInfo.strides[2] +
+              ids[1] * tiInfo.strides[1] + ids[0];
+    const int groupId_dim = ids[kDim];
+
+    ids[kDim] = ids[kDim] * DIMY * lim + lidy;
+    oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+             ids[1] * oInfo.strides[1] + ids[0];
+    iData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+             ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+    kData += ids[3] * kInfo.strides[3] + ids[2] * kInfo.strides[2] +
+             ids[1] * kInfo.strides[1] + ids[0] + kInfo.offset;
+
+    int id_dim        = ids[kDim];
+    const int out_dim = oInfo.dims[kDim];
+
+    bool is_valid = (ids[0] < oInfo.dims[0]) && (ids[1] < oInfo.dims[1]) &&
+                    (ids[2] < oInfo.dims[2]) && (ids[3] < oInfo.dims[3]);
+
+    const int ostride_dim = oInfo.strides[kDim];
+    const int istride_dim = iInfo.strides[kDim];
+
+    local To l_val0[THREADS_X * DIMY];
+    local To l_val1[THREADS_X * DIMY];
+    local char l_flg0[THREADS_X * DIMY];
+    local char l_flg1[THREADS_X * DIMY];
+    local To *l_val   = l_val0;
+    local char *l_flg = l_flg0;
+    local To l_tmp[THREADS_X];
+    local char l_ftmp[THREADS_X];
+    local int boundaryid[THREADS_X];
+
+    bool flip         = 0;
+    const To init_val = init;
+    To val            = init_val;
+    const bool isLast = (lidy == (DIMY - 1));
+
+    if (isLast) {
+        l_tmp[lidx]      = val;
+        l_ftmp[lidx]     = 0;
+        boundaryid[lidx] = -1;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    char flag = 0;
+    for (int k = 0; k < lim; k++) {
+        bool cond = (is_valid) && (id_dim < out_dim);
+
+        if (cond) {
+            flag = calculate_head_flags_dim(kData, id_dim, kInfo.strides[kDim]);
+        } else {
+            flag = 0;
+        }
+
+        // Load val from global in
+        if (INCLUSIVE_SCAN) {
+            if (!cond) {
+                val = init_val;
+            } else {
+                val = transform(*iData);
+            }
+        } else {
+            if ((id_dim == 0) || (!cond) || flag) {
+                val = init_val;
+            } else {
+                val = transform(*(iData - iInfo.strides[kDim]));
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((lidy == 0) && (flag == 0)) {
+            val  = binOp(val, l_tmp[lidx]);
+            flag = l_ftmp[lidx];
+        }
+
+        // Write to shared memory
+        l_val[lid] = val;
+        l_flg[lid] = flag;
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Segmented Scan
+        for (int off = 1; off < DIMY; off *= 2) {
+            if (lidy >= off) {
+                val =
+                    l_flg[lid] ? val : binOp(val, l_val[lid - off * THREADS_X]);
+                flag = l_flg[lid] | l_flg[lid - off * THREADS_X];
+            }
+            flip       = 1 - flip;
+            l_val      = flip ? l_val1 : l_val0;
+            l_flg      = flip ? l_flg1 : l_flg0;
+            l_val[lid] = val;
+            l_flg[lid] = flag;
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        // Identify segment boundary
+        if (lidy == 0) {
+            if ((l_ftmp[lidx] == 0) && (l_flg[lid] == 1)) {
+                boundaryid[lidx] = id_dim;
+            }
+        } else {
+            if ((l_flg[lid - THREADS_X] == 0) && (l_flg[lid] == 1)) {
+                boundaryid[lidx] = id_dim;
+            }
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        if (cond) *oData = val;
+        if (isLast) {
+            l_tmp[lidx]  = val;
+            l_ftmp[lidx] = flag;
+        }
+        id_dim += DIMY;
+        kData += DIMY * kInfo.strides[kDim];
+        iData += DIMY * istride_dim;
+        oData += DIMY * ostride_dim;
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (is_valid && (groupId_dim < tInfo.dims[kDim]) && isLast) {
+        *tData       = val;
+        *tfData      = flag;
+        int boundary = boundaryid[lidx];
+        *tiData      = (boundary == -1) ? id_dim : boundary;
+    }
+}
+
+kernel void scanDimByKeyFinal(global To *oData, KParam oInfo,
+                              const global Ti *iData, KParam iInfo,
+                              const global Tk *kData, KParam kInfo,
+                              uint groups_x, uint groups_y, uint groups_dim,
+                              uint lim) {
+    const int lidx = get_local_id(0);
+    const int lidy = get_local_id(1);
+    const int lid  = lidy * THREADS_X + lidx;
+
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) + lidx;
+    const int yid       = groupId_y;
+
+    int ids[4] = {xid, yid, zid, wid};
+
+    // There is only one element per group for out
+    // There are DIMY elements per group for in
+    // Hence increment ids[kDim] just after offseting out and before offsetting
+    // in
+    const int groupId_dim = ids[kDim];
+
+    ids[kDim] = ids[kDim] * DIMY * lim + lidy;
+    oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+             ids[1] * oInfo.strides[1] + ids[0];
+    iData += ids[3] * iInfo.strides[3] + ids[2] * iInfo.strides[2] +
+             ids[1] * iInfo.strides[1] + ids[0] + iInfo.offset;
+    kData += ids[3] * kInfo.strides[3] + ids[2] * kInfo.strides[2] +
+             ids[1] * kInfo.strides[1] + ids[0] + kInfo.offset;
+
+    int id_dim        = ids[kDim];
+    const int out_dim = oInfo.dims[kDim];
+
+    bool is_valid = (ids[0] < oInfo.dims[0]) && (ids[1] < oInfo.dims[1]) &&
+                    (ids[2] < oInfo.dims[2]) && (ids[3] < oInfo.dims[3]);
+
+    const int ostride_dim = oInfo.strides[kDim];
+    const int istride_dim = iInfo.strides[kDim];
+
+    local To l_val0[THREADS_X * DIMY];
+    local To l_val1[THREADS_X * DIMY];
+    local char l_flg0[THREADS_X * DIMY];
+    local char l_flg1[THREADS_X * DIMY];
+    local To *l_val   = l_val0;
+    local char *l_flg = l_flg0;
+    local To l_tmp[THREADS_X];
+    local char l_ftmp[THREADS_X];
+
+    bool flip         = 0;
+    const To init_val = init;
+    To val            = init_val;
+    const bool isLast = (lidy == (DIMY - 1));
+
+    if (isLast) {
+        l_tmp[lidx]  = val;
+        l_ftmp[lidx] = 0;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    char flag = 0;
+    for (int k = 0; k < lim; k++) {
+        bool cond = (is_valid) && (id_dim < out_dim);
+
+        if (calculateFlags) {
+            if (cond) {
+                flag = calculate_head_flags_dim(kData, id_dim,
+                                                kInfo.strides[kDim]);
+            } else {
+                flag = 0;
+            }
+        } else {
+            flag = *kData;
+        }
+
+        // Load val from global in
+        if (INCLUSIVE_SCAN) {
+            if (!cond) {
+                val = init_val;
+            } else {
+                val = transform(*iData);
+            }
+        } else {
+            if ((id_dim == 0) || (!cond) || flag) {
+                val = init_val;
+            } else {
+                val = transform(*(iData - iInfo.strides[kDim]));
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((lidy == 0) && (flag == 0)) {
+            val  = binOp(val, l_tmp[lidx]);
+            flag = l_ftmp[lidx];
+        }
+
+        // Write to shared memory
+        l_val[lid] = val;
+        l_flg[lid] = flag;
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Segmented Scan
+        for (int off = 1; off < DIMY; off *= 2) {
+            if (lidy >= off) {
+                val =
+                    l_flg[lid] ? val : binOp(val, l_val[lid - off * THREADS_X]);
+                flag = l_flg[lid] | l_flg[lid - off * THREADS_X];
+            }
+            flip       = 1 - flip;
+            l_val      = flip ? l_val1 : l_val0;
+            l_flg      = flip ? l_flg1 : l_flg0;
+            l_val[lid] = val;
+            l_flg[lid] = flag;
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        if (cond) *oData = val;
+        if (isLast) {
+            l_tmp[lidx]  = val;
+            l_ftmp[lidx] = flag;
+        }
+        id_dim += DIMY;
+        kData += DIMY * kInfo.strides[kDim];
+        iData += DIMY * istride_dim;
+        oData += DIMY * ostride_dim;
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+}
+
+kernel void bcastDimByKey(global To *oData, KParam oInfo,
+                          const global To *tData, KParam tInfo,
+                          const global int *tiData, KParam tiInfo,
+                          uint groups_x, uint groups_y, uint groups_dim,
+                          uint lim) {
+    const int lidx = get_local_id(0);
+    const int lidy = get_local_id(1);
+    const int lid  = lidy * THREADS_X + lidx;
+
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) + lidx;
+    const int yid       = groupId_y;
+
+    int ids[4]            = {xid, yid, zid, wid};
+    const int groupId_dim = ids[kDim];
+
+    if (groupId_dim != 0) {
+        // There is only one element per group for out
+        // There are DIMY elements per group for in
+        // Hence increment ids[kDim] just after offseting out and before
+        // offsetting in
+        tiData += ids[3] * tiInfo.strides[3] + ids[2] * tiInfo.strides[2] +
+                  ids[1] * tiInfo.strides[1] + ids[0];
+        tData += ids[3] * tInfo.strides[3] + ids[2] * tInfo.strides[2] +
+                 ids[1] * tInfo.strides[1] + ids[0];
+
+        ids[kDim] = ids[kDim] * DIMY * lim + lidy;
+        oData += ids[3] * oInfo.strides[3] + ids[2] * oInfo.strides[2] +
+                 ids[1] * oInfo.strides[1] + ids[0];
+
+        const int id_dim = ids[kDim];
+
+        bool is_valid = (ids[0] < oInfo.dims[0]) && (ids[1] < oInfo.dims[1]) &&
+                        (ids[2] < oInfo.dims[2]) && (ids[3] < oInfo.dims[3]);
+
+        if (is_valid) {
+            int boundary = *tiData;
+            To accum     = *(tData - tInfo.strides[kDim]);
+
+            const int ostride_dim = oInfo.strides[kDim];
+
+            for (int k = 0, id = id_dim; is_valid && k < lim && (id < boundary);
+                 k++, id += DIMY) {
+                *oData = binOp(*oData, accum);
+                oData += DIMY * ostride_dim;
+            }
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/scan_dim_by_key.hpp b/src/backend/opencl/kernel/scan_dim_by_key.hpp
new file mode 100644
index 0000000000..f698c4176d
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_dim_by_key.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scanDimByKey(Param out, const Param in, const Param key, int dim,
+                  const bool inclusive_scan);
+}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_dim_by_key_impl.hpp b/src/backend/opencl/kernel/scan_dim_by_key_impl.hpp
new file mode 100644
index 0000000000..c4cc7959ff
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_dim_by_key_impl.hpp
@@ -0,0 +1,213 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <kernel_headers/scan_dim_by_key.hpp>
+#include <memory.hpp>
+#include <optypes.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static opencl::Kernel getScanDimKernel(const std::string key, int dim,
+                                       bool calculateFlags, uint threads_y,
+                                       bool inclusiveScan) {
+    using std::string;
+    using std::vector;
+
+    ToNumStr<To> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ti>(),      TemplateTypename<To>(),
+        TemplateTypename<Tk>(),      TemplateArg(dim),
+        TemplateArg(calculateFlags), TemplateArg(op),
+        TemplateArg(threads_y),      TemplateArg(inclusiveScan),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(kDim, dim),
+        DefineKeyValue(DIMY, threads_y),
+        DefineValue(THREADS_X),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        DefineKeyValue(calculateFlags, (calculateFlags ? 1 : 0)),
+        DefineKeyValue(INCLUSIVE_SCAN, inclusiveScan),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    return common::getKernel(key, {{ops_cl_src, scan_dim_by_key_cl_src}},
+                             tmpltArgs, compileOpts);
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scanDimNonfinalLauncher(Param out, Param tmp, Param tmpflg,
+                                    Param tmpid, const Param in,
+                                    const Param key, int dim, uint threads_y,
+                                    const uint groups_all[4],
+                                    bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto scan = getScanDimKernel<Ti, Tk, To, op>(
+        "scanDimByKeyNonfinal", dim, false, threads_y, inclusiveScan);
+
+    NDRange local(THREADS_X, threads_y);
+    NDRange global(groups_all[0] * groups_all[2] * local[0],
+                   groups_all[1] * groups_all[3] * local[1]);
+
+    uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
+
+    scan(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *tmp.data,
+         tmp.info, *tmpflg.data, tmpflg.info, *tmpid.data, tmpid.info, *in.data,
+         in.info, *key.data, key.info, groups_all[0], groups_all[1],
+         groups_all[dim], lim);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scanDimFinalLauncher(Param out, const Param in, const Param key,
+                                 int dim, const bool calculateFlags,
+                                 uint threads_y, const uint groups_all[4],
+                                 bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto scan = getScanDimKernel<Ti, Tk, To, op>(
+        "scanDimByKeyFinal", dim, calculateFlags, threads_y, inclusiveScan);
+
+    NDRange local(THREADS_X, threads_y);
+    NDRange global(groups_all[0] * groups_all[2] * local[0],
+                   groups_all[1] * groups_all[3] * local[1]);
+
+    uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
+
+    scan(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data,
+         in.info, *key.data, key.info, groups_all[0], groups_all[1],
+         groups_all[dim], lim);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void bcastDimLauncher(Param out, Param tmp, Param tmpid, int dim,
+                             uint threads_y, const uint groups_all[4],
+                             bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto bcast = getScanDimKernel<Ti, Tk, To, op>("bcastDimByKey", dim, false,
+                                                  threads_y, inclusiveScan);
+
+    NDRange local(THREADS_X, threads_y);
+    NDRange global(groups_all[0] * groups_all[2] * local[0],
+                   groups_all[1] * groups_all[3] * local[1]);
+
+    uint lim = divup(out.info.dims[dim], (threads_y * groups_all[dim]));
+
+    bcast(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+          *tmp.data, tmp.info, *tmpid.data, tmpid.info, groups_all[0],
+          groups_all[1], groups_all[dim], lim);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scanDimByKey(Param out, const Param in, const Param key, int dim,
+                  const bool inclusiveScan) {
+    uint threads_y = std::min(THREADS_Y, nextpow2(out.info.dims[dim]));
+    uint threads_x = THREADS_X;
+
+    uint groups_all[] = {divup((uint)out.info.dims[0], threads_x),
+                         (uint)out.info.dims[1], (uint)out.info.dims[2],
+                         (uint)out.info.dims[3]};
+
+    groups_all[dim] = divup(out.info.dims[dim], threads_y * REPEAT);
+
+    if (groups_all[dim] == 1) {
+        scanDimFinalLauncher<Ti, Tk, To, op>(out, in, key, dim, true, threads_y,
+                                             groups_all, inclusiveScan);
+    } else {
+        Param tmp = out;
+
+        tmp.info.dims[dim]  = groups_all[dim];
+        tmp.info.strides[0] = 1;
+        for (int k = 1; k < 4; k++) {
+            tmp.info.strides[k] =
+                tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
+        }
+        Param tmpflg = tmp;
+        Param tmpid  = tmp;
+
+        int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
+        // FIXME: Do I need to free this ?
+        tmp.data    = bufferAlloc(tmp_elements * sizeof(To));
+        tmpflg.data = bufferAlloc(tmp_elements * sizeof(char));
+        tmpid.data  = bufferAlloc(tmp_elements * sizeof(int));
+
+        scanDimNonfinalLauncher<Ti, Tk, To, op>(out, tmp, tmpflg, tmpid, in,
+                                                key, dim, threads_y, groups_all,
+                                                inclusiveScan);
+
+        int gdim        = groups_all[dim];
+        groups_all[dim] = 1;
+
+        if (op == af_notzero_t) {
+            scanDimFinalLauncher<To, char, To, af_add_t>(
+                tmp, tmp, tmpflg, dim, false, threads_y, groups_all, true);
+        } else {
+            scanDimFinalLauncher<To, char, To, op>(tmp, tmp, tmpflg, dim, false,
+                                                   threads_y, groups_all, true);
+        }
+
+        groups_all[dim] = gdim;
+        bcastDimLauncher<To, Tk, To, op>(out, tmp, tmpid, dim, threads_y,
+                                         groups_all, inclusiveScan);
+        bufferFree(tmp.data);
+        bufferFree(tmpflg.data);
+        bufferFree(tmpid.data);
+    }
+}
+}  // namespace kernel
+
+#define INSTANTIATE_SCAN_DIM_BY_KEY(ROp, Ti, Tk, To) \
+    template void scanDimByKey<Ti, Tk, To, ROp>(     \
+        Param out, const Param in, const Param key, int dim, const bool);
+
+#define INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, Tk)         \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_DIM_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_DIM_BY_KEY_OP(ROp)      \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, int)  \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, uint) \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, intl) \
+    INSTANTIATE_SCAN_DIM_BY_KEY_TYPES(ROp, uintl)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_first.cl b/src/backend/opencl/kernel/scan_first.cl
index d8b08a9ea1..f84dfc6294 100644
--- a/src/backend/opencl/kernel/scan_first.cl
+++ b/src/backend/opencl/kernel/scan_first.cl
@@ -7,110 +7,109 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void scan_first_kernel(__global To *oData, KParam oInfo,
-                       __global To *tData, KParam tInfo,
-                       const __global Ti *iData, KParam iInfo,
-                       uint groups_x, uint groups_y,
-                       uint lim)
-{
+kernel void scanFirst(global To *oData, KParam oInfo, global To *tData,
+                      KParam tInfo, const global Ti *iData, KParam iInfo,
+                      uint groups_x, uint groups_y, uint lim) {
     const int lidx = get_local_id(0);
     const int lidy = get_local_id(1);
     const int lid  = lidy * get_local_size(0) + lidx;
 
-    const int zid = get_group_id(0) / groups_x;
-    const int wid = get_group_id(1) / groups_y;
-    const int groupId_x = get_group_id(0) - (groups_x) * zid;
-    const int groupId_y = get_group_id(1) - (groups_y) * wid;
-    const int xid = groupId_x * get_local_size(0) * lim + lidx;
-    const int yid = groupId_y * get_local_size(1) + lidy;
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) * lim + lidx;
+    const int yid       = groupId_y * get_local_size(1) + lidy;
 
-    bool cond_yzw = (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) && (wid < oInfo.dims[3]);
+    bool cond_yzw =
+        (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) && (wid < oInfo.dims[3]);
 
     iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
-        yid * iInfo.strides[1] + iInfo.offset;
+             yid * iInfo.strides[1] + iInfo.offset;
 
     tData += wid * tInfo.strides[3] + zid * tInfo.strides[2] +
-        yid * tInfo.strides[1] + tInfo.offset;
+             yid * tInfo.strides[1] + tInfo.offset;
 
     oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
-        yid * oInfo.strides[1] + oInfo.offset;
+             yid * oInfo.strides[1] + oInfo.offset;
 
-    __local To l_val0[SHARED_MEM_SIZE];
-    __local To l_val1[SHARED_MEM_SIZE];
-    __local To *l_val = l_val0;
-    __local To l_tmp[DIMY];
+    local To l_val0[SHARED_MEM_SIZE];
+    local To l_val1[SHARED_MEM_SIZE];
+    local To *l_val = l_val0;
+    local To l_tmp[DIMY];
 
     bool flip = 0;
 
     const To init_val = init;
-    int id = xid;
-    To val = init_val;
+    int id            = xid;
+    To val            = init_val;
 
     const bool isLast = (lidx == (DIMX - 1));
 
     for (int k = 0; k < lim; k++) {
-
         if (isLast) l_tmp[lidy] = val;
 
-        bool cond = ((id < oInfo.dims[0]) && cond_yzw);
-        val = cond ? transform(iData[id]) : init_val;
+        bool cond  = ((id < iInfo.dims[0]) && cond_yzw);
+        val        = cond ? transform(iData[id]) : init_val;
         l_val[lid] = val;
         barrier(CLK_LOCAL_MEM_FENCE);
 
         for (int off = 1; off < DIMX; off *= 2) {
             if (lidx >= off) val = binOp(val, l_val[lid - off]);
 
-            flip = 1 - flip;
-            l_val = flip ? l_val1 : l_val0;
+            flip       = 1 - flip;
+            l_val      = flip ? l_val1 : l_val0;
             l_val[lid] = val;
             barrier(CLK_LOCAL_MEM_FENCE);
         }
 
         val = binOp(val, l_tmp[lidy]);
-        if (cond) oData[id] = val;
+        if (INCLUSIVE_SCAN != 0) {
+            if (cond) { oData[id] = val; }
+        } else {
+            if (id == (oInfo.dims[0] - 1)) {
+                oData[0] = init_val;
+            } else if (id < (oInfo.dims[0] - 1)) {
+                oData[id + 1] = val;
+            }
+        }
         id += DIMX;
-        barrier(CLK_LOCAL_MEM_FENCE); //FIXME: May be needed only for non nvidia gpus
+        barrier(CLK_LOCAL_MEM_FENCE);
     }
 
-    if (!isFinalPass && isLast && cond_yzw) {
-        tData[groupId_x] = val;
-    }
+    if (!IS_FINAL_PASS && isLast && cond_yzw) { tData[groupId_x] = val; }
 }
 
-__kernel
-void bcast_first_kernel(__global To *oData, KParam oInfo,
-                        const __global To *tData, KParam tInfo,
-                        uint groups_x, uint groups_y, uint lim)
-{
+kernel void bcastFirst(global To *oData, KParam oInfo, const global To *tData,
+                       KParam tInfo, uint groups_x, uint groups_y, uint lim) {
     const int lidx = get_local_id(0);
     const int lidy = get_local_id(1);
     const int lid  = lidy * get_local_size(0) + lidx;
 
-    const int zid = get_group_id(0) / groups_x;
-    const int wid = get_group_id(1) / groups_y;
-    const int groupId_x = get_group_id(0) - (groups_x) * zid;
-    const int groupId_y = get_group_id(1) - (groups_y) * wid;
-    const int xid = groupId_x * get_local_size(0) * lim + lidx;
-    const int yid = groupId_y * get_local_size(1) + lidy;
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) * lim + lidx;
+    const int yid       = groupId_y * get_local_size(1) + lidy;
 
     if (groupId_x != 0) {
-        bool cond = (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) && (wid < oInfo.dims[3]);
+        bool cond = (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) &&
+                    (wid < oInfo.dims[3]);
 
         if (cond) {
-
             tData += wid * tInfo.strides[3] + zid * tInfo.strides[2] +
-                yid * tInfo.strides[1] + tInfo.offset;
+                     yid * tInfo.strides[1] + tInfo.offset;
 
             oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
-                yid * oInfo.strides[1] + oInfo.offset;
+                     yid * oInfo.strides[1] + oInfo.offset;
 
             To accum = tData[groupId_x - 1];
 
-            for (int k = 0, id = xid;
-                 k < lim && id < oInfo.dims[0];
+            // Shift broadcast one step to the right for exclusive scan (#2366)
+            int offset = !INCLUSIVE_SCAN;
+            for (int k = 0, id = xid + offset; k < lim && id < oInfo.dims[0];
                  k++, id += DIMX) {
-
                 oData[id] = binOp(accum, oData[id]);
             }
         }
diff --git a/src/backend/opencl/kernel/scan_first.hpp b/src/backend/opencl/kernel/scan_first.hpp
index b9521fd1be..569c361ef8 100644
--- a/src/backend/opencl/kernel/scan_first.hpp
+++ b/src/backend/opencl/kernel/scan_first.hpp
@@ -8,237 +8,149 @@
  ********************************************************/
 
 #pragma once
-#include <string>
-#include <mutex>
-#include <map>
-#include <kernel_headers/scan_first.hpp>
-#include <kernel_headers/ops.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/Binary.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <type_util.hpp>
-#include "names.hpp"
-#include "config.hpp"
-#include <memory.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-namespace kernel
-{
-
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass, uint threads_x>
-    static Kernel* get_scan_first_kernels(int kerIdx)
-    {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> scanProgs;
-        static std::map<int, Kernel* > scanKerns;
-        static std::map<int, Kernel* > bcastKerns;
-
-        int device= getActiveDeviceId();
-
-        std::call_once(compileFlags[device], [device] () {
-
-                const uint threads_y = THREADS_PER_GROUP / threads_x;
-                const uint SHARED_MEM_SIZE = THREADS_PER_GROUP;
-
-                Binary<To, op> scan;
-                ToNum<To> toNum;
-
-                std::ostringstream options;
-                options << " -D To=" << dtype_traits<To>::getName()
-                        << " -D Ti=" << dtype_traits<Ti>::getName()
-                        << " -D T=To"
-                        << " -D DIMX=" << threads_x
-                        << " -D DIMY=" << threads_y
-                        << " -D SHARED_MEM_SIZE=" << SHARED_MEM_SIZE
-                        << " -D init=" << toNum(scan.init())
-                        << " -D " << binOpName<op>()
-                        << " -D CPLX=" << af::iscplx<Ti>()
-                        << " -D isFinalPass=" << (int)(isFinalPass);
-                if (std::is_same<Ti, double>::value ||
-                    std::is_same<Ti, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                const char *ker_strs[] = {ops_cl, scan_first_cl};
-                const int   ker_lens[] = {ops_cl_len, scan_first_cl_len};
-                cl::Program prog;
-                buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                scanProgs[device] = new Program(prog);
-
-                scanKerns[device] = new Kernel(*scanProgs[device],  "scan_first_kernel");
-                bcastKerns[device] = new Kernel(*scanProgs[device],  "bcast_first_kernel");
-
-            });
-
-        return (kerIdx == 0) ? scanKerns[device] : bcastKerns[device];
-    }
-
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass, uint threads_x>
-    static void scan_first_launcher(Param &out,
-                                    Param &tmp,
-                                    const Param &in,
-                                    const uint groups_x,
-                                    const uint groups_y)
-    {
-        Kernel* ker = get_scan_first_kernels<Ti, To, op, isFinalPass, threads_x>(0);
-
-        NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
-        NDRange global(groups_x * out.info.dims[2] * local[0],
-                       groups_y * out.info.dims[3] * local[1]);
-
-        uint lim = divup(out.info.dims[0], (threads_x * groups_x));
-
-        auto scanOp = make_kernel<Buffer, KParam,
-                                  Buffer, KParam,
-                                  Buffer, KParam,
-                                  uint, uint, uint>(*ker);
-
-        scanOp(EnqueueArgs(getQueue(), global, local),
-               *out.data, out.info, *tmp.data, tmp.info, *in.data, in.info,
-               groups_x, groups_y, lim);
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <kernel_headers/scan_first.hpp>
+#include <traits.hpp>
 
-        CL_DEBUG_FINISH(getQueue());
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename Ti, typename To, af_op_t op>
+static opencl::Kernel getScanFirstKernel(const std::string key,
+                                         const bool isFinalPass,
+                                         const uint threads_x,
+                                         const bool inclusiveScan) {
+    using std::string;
+    using std::vector;
+
+    const uint threads_y       = THREADS_PER_GROUP / threads_x;
+    const uint SHARED_MEM_SIZE = THREADS_PER_GROUP;
+    ToNumStr<To> toNumStr;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ti>(),   TemplateTypename<To>(),
+        TemplateArg(isFinalPass), TemplateArg(op),
+        TemplateArg(threads_x),   TemplateArg(inclusiveScan),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(DIMY, threads_y),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineValue(SHARED_MEM_SIZE),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        DefineKeyValue(IS_FINAL_PASS, (isFinalPass ? 1 : 0)),
+        DefineKeyValue(INCLUSIVE_SCAN, inclusiveScan),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    return common::getKernel(key, {{ops_cl_src, scan_first_cl_src}}, tmpltArgs,
+                             compileOpts);
+}
 
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass, uint threads_x>
-    static void bcast_first_launcher(Param &out,
-                                     Param &tmp,
-                                     const uint groups_x,
-                                     const uint groups_y)
-    {
+template<typename Ti, typename To, af_op_t op>
+static void scanFirstLauncher(Param &out, Param &tmp, const Param &in,
+                              const bool isFinalPass, const uint groups_x,
+                              const uint groups_y, const uint threads_x,
+                              const bool inclusiveScan = true) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
 
-        Kernel* ker = get_scan_first_kernels<Ti, To, op, isFinalPass, threads_x>(1);
+    auto scan = getScanFirstKernel<Ti, To, op>("scanFirst", isFinalPass,
+                                               threads_x, inclusiveScan);
 
-        NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
-        NDRange global(groups_x * out.info.dims[2] * local[0],
-                       groups_y * out.info.dims[3] * local[1]);
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(groups_x * out.info.dims[2] * local[0],
+                   groups_y * out.info.dims[3] * local[1]);
 
-        uint lim = divup(out.info.dims[0], (threads_x * groups_x));
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
 
-        auto bcastOp = make_kernel<Buffer, KParam,
-                                   Buffer, KParam,
-                                   uint, uint, uint>(*ker);
+    scan(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *tmp.data,
+         tmp.info, *in.data, in.info, groups_x, groups_y, lim);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-        bcastOp(EnqueueArgs(getQueue(), global, local),
-                *out.data, out.info, *tmp.data, tmp.info,
-                groups_x, groups_y, lim);
+template<typename Ti, typename To, af_op_t op>
+static void bcastFirstLauncher(Param &out, Param &tmp, const bool isFinalPass,
+                               const uint groups_x, const uint groups_y,
+                               const uint threads_x, const bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
 
-        CL_DEBUG_FINISH(getQueue());
-    }
+    auto bcast = getScanFirstKernel<Ti, To, op>("bcastFirst", isFinalPass,
+                                                threads_x, inclusiveScan);
 
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(groups_x * out.info.dims[2] * local[0],
+                   groups_y * out.info.dims[3] * local[1]);
 
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass>
-    static void scan_first_fn(Param &out,
-                              Param &tmp,
-                              const Param &in,
-                              const uint groups_x,
-                              const uint groups_y,
-                              const uint threads_x)
-    {
-
-        switch (threads_x) {
-        case 32:
-            (scan_first_launcher<Ti, To, op, isFinalPass,  32>)(
-                out, tmp, in, groups_x, groups_y); break;
-        case 64:
-            (scan_first_launcher<Ti, To, op, isFinalPass,  64>)(
-                out, tmp, in, groups_x, groups_y); break;
-        case 128:
-            (scan_first_launcher<Ti, To, op, isFinalPass, 128>)(
-                out, tmp, in, groups_x, groups_y); break;
-        case 256:
-            (scan_first_launcher<Ti, To, op, isFinalPass, 256>)(
-                out, tmp, in, groups_x, groups_y); break;
-        }
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
 
-    }
+    bcast(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+          *tmp.data, tmp.info, groups_x, groups_y, lim);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-    template<typename Ti, typename To, af_op_t op, bool isFinalPass>
-    static void bcast_first_fn(Param &out,
-                               Param &tmp,
-                               const uint groups_x,
-                               const uint groups_y,
-                               const uint threads_x)
-    {
-
-        switch (threads_x) {
-        case 32:
-            (bcast_first_launcher<Ti, To, op, isFinalPass,  32>)(
-                out, tmp, groups_x, groups_y); break;
-        case 64:
-            (bcast_first_launcher<Ti, To, op, isFinalPass,  64>)(
-                out, tmp, groups_x, groups_y); break;
-        case 128:
-            (bcast_first_launcher<Ti, To, op, isFinalPass, 128>)(
-                out, tmp, groups_x, groups_y); break;
-        case 256:
-            (bcast_first_launcher<Ti, To, op, isFinalPass, 256>)(
-                out, tmp, groups_x, groups_y); break;
+template<typename Ti, typename To, af_op_t op>
+static void scanFirst(Param &out, const Param &in,
+                      const bool inclusiveScan = true) {
+    uint threads_x = nextpow2(std::max(32u, (uint)out.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
+
+    uint groups_x = divup(out.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(out.info.dims[1], threads_y);
+
+    if (groups_x == 1) {
+        scanFirstLauncher<Ti, To, op>(out, out, in, true, groups_x, groups_y,
+                                      threads_x, inclusiveScan);
+
+    } else {
+        Param tmp           = out;
+        tmp.info.dims[0]    = groups_x;
+        tmp.info.strides[0] = 1;
+        for (int k = 1; k < 4; k++) {
+            tmp.info.strides[k] =
+                tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
         }
-    }
 
-    template<typename Ti, typename To, af_op_t op>
-    static void scan_first(Param &out, const Param &in)
-    {
-        uint threads_x = nextpow2(std::max(32u, (uint)out.info.dims[0]));
-        threads_x = std::min(threads_x, THREADS_PER_GROUP);
-        uint threads_y = THREADS_PER_GROUP / threads_x;
+        int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
 
-        uint groups_x = divup(out.info.dims[0], threads_x * REPEAT);
-        uint groups_y = divup(out.info.dims[1], threads_y);
+        tmp.data = bufferAlloc(tmp_elements * sizeof(To));
 
-        if (groups_x == 1) {
-            scan_first_fn<Ti, To, op, true>(out, out, in,
-                                            groups_x, groups_y,
-                                            threads_x);
+        scanFirstLauncher<Ti, To, op>(out, tmp, in, false, groups_x, groups_y,
+                                      threads_x, inclusiveScan);
 
+        if (op == af_notzero_t) {
+            scanFirstLauncher<To, To, af_add_t>(tmp, tmp, tmp, true, 1,
+                                                groups_y, threads_x, true);
         } else {
+            scanFirstLauncher<To, To, op>(tmp, tmp, tmp, true, 1, groups_y,
+                                          threads_x, true);
+        }
 
-            Param tmp = out;
-            tmp.info.dims[0] = groups_x;
-            tmp.info.strides[0] = 1;
-            for (int k = 1; k < 4; k++) {
-                tmp.info.strides[k] = tmp.info.strides[k - 1] * tmp.info.dims   [k - 1];
-            }
-
-            int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
-
-            tmp.data = bufferAlloc(tmp_elements * sizeof(To));
-
-            scan_first_fn<Ti, To, op, false>(out, tmp, in,
-                                             groups_x, groups_y,
-                                             threads_x);
-
-            if (op == af_notzero_t) {
-                scan_first_fn<To, To, af_add_t, true>(tmp, tmp, tmp,
-                                                      1, groups_y,
-                                                      threads_x);
-            } else {
-                scan_first_fn<To, To,       op, true>(tmp, tmp, tmp,
-                                                      1, groups_y,
-                                                      threads_x);
-            }
-
-            bcast_first_fn<To, To, op, true>(out, tmp,
-                                             groups_x,
-                                             groups_y,
-                                             threads_x);
-
-            bufferFree(tmp.data);
+        bcastFirstLauncher<To, To, op>(out, tmp, true, groups_x, groups_y,
+                                       threads_x, inclusiveScan);
 
-        }
+        bufferFree(tmp.data);
     }
-
-}
 }
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_first_by_key.cl b/src/backend/opencl/kernel/scan_first_by_key.cl
new file mode 100644
index 0000000000..1793f0b293
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_first_by_key.cl
@@ -0,0 +1,304 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+char calculate_head_flags(const global Tk *kptr, int id, int previd) {
+    return (id == 0) ? 1 : (kptr[id] != kptr[previd]);
+}
+
+kernel void scanFirstByKeyNonfinal(global To *oData, KParam oInfo,
+                                   global To *tData, KParam tInfo,
+                                   global char *tfData, KParam tfInfo,
+                                   global int *tiData, KParam tiInfo,
+                                   const global Ti *iData, KParam iInfo,
+                                   const global Tk *kData, KParam kInfo,
+                                   uint groups_x, uint groups_y, uint lim) {
+    const int lidx = get_local_id(0);
+    const int lidy = get_local_id(1);
+    const int lid  = lidy * get_local_size(0) + lidx;
+
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) * lim + lidx;
+    const int yid       = groupId_y * get_local_size(1) + lidy;
+
+    bool cond_yzw =
+        (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) && (wid < oInfo.dims[3]);
+
+    iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
+             yid * iInfo.strides[1] + iInfo.offset;
+
+    kData += wid * kInfo.strides[3] + zid * kInfo.strides[2] +
+             yid * kInfo.strides[1] + kInfo.offset;
+
+    tData += wid * tInfo.strides[3] + zid * tInfo.strides[2] +
+             yid * tInfo.strides[1];
+
+    tfData += wid * tfInfo.strides[3] + zid * tfInfo.strides[2] +
+              yid * tfInfo.strides[1];
+
+    tiData += wid * tiInfo.strides[3] + zid * tiInfo.strides[2] +
+              yid * tiInfo.strides[1];
+
+    oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
+             yid * oInfo.strides[1] + oInfo.offset;
+
+    local To l_val0[SHARED_MEM_SIZE];
+    local To l_val1[SHARED_MEM_SIZE];
+    local char l_flg0[SHARED_MEM_SIZE];
+    local char l_flg1[SHARED_MEM_SIZE];
+    local To *l_val   = l_val0;
+    local char *l_flg = l_flg0;
+    local To l_tmp[DIMY];
+    local char l_ftmp[DIMY];
+    local int boundaryid[DIMY];
+
+    bool flip = 0;
+
+    const To init_val = init;
+    int id            = xid;
+    To val            = init_val;
+
+    const bool isLast = (lidx == (DIMX - 1));
+
+    if (isLast) {
+        l_tmp[lidy]      = val;
+        l_ftmp[lidy]     = 0;
+        boundaryid[lidy] = -1;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    char flag = 0;
+    for (int k = 0; k < lim; k++) {
+        bool cond = ((id < oInfo.dims[0]) && cond_yzw);
+
+        if (cond) {
+            flag = calculate_head_flags(kData, id, id - 1);
+        } else {
+            flag = 0;
+        }
+
+        // Load val from global in
+        if (INCLUSIVE_SCAN) {
+            if (!cond) {
+                val = init_val;
+            } else {
+                val = transform(iData[id]);
+            }
+        } else {
+            if ((id == 0) || (!cond) || flag) {
+                val = init_val;
+            } else {
+                val = transform(iData[id - 1]);
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((lidx == 0) && (flag == 0)) {
+            val  = binOp(val, l_tmp[lidy]);
+            flag = l_ftmp[lidy];
+        }
+
+        // Write to shared memory
+        l_val[lid] = val;
+        l_flg[lid] = flag;
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Segmented Scan
+        for (int off = 1; off < DIMX; off *= 2) {
+            if (lidx >= off) {
+                val  = l_flg[lid] ? val : binOp(val, l_val[lid - off]);
+                flag = l_flg[lid] | l_flg[lid - off];
+            }
+            flip       = 1 - flip;
+            l_val      = flip ? l_val1 : l_val0;
+            l_flg      = flip ? l_flg1 : l_flg0;
+            l_val[lid] = val;
+            l_flg[lid] = flag;
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        // Identify segment boundary
+        if (lidx == 0) {
+            if ((l_ftmp[lidy] == 0) && (l_flg[lid] == 1)) {
+                boundaryid[lidy] = id;
+            }
+        } else {
+            if ((l_flg[lid - 1] == 0) && (l_flg[lid] == 1)) {
+                boundaryid[lidy] = id;
+            }
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        if (cond) oData[id] = val;
+        if (isLast) {
+            l_tmp[lidy]  = val;
+            l_ftmp[lidy] = flag;
+        }
+        id += DIMX;
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (isLast && cond_yzw) {
+        tData[groupId_x]  = val;
+        tfData[groupId_x] = flag;
+        int boundary      = boundaryid[lidy];
+        tiData[groupId_x] = (boundary == -1) ? id : boundary;
+    }
+}
+
+kernel void scanFirstByKeyFinal(global To *oData, KParam oInfo,
+                                const global Ti *iData, KParam iInfo,
+                                const global Tk *kData, KParam kInfo,
+                                uint groups_x, uint groups_y, uint lim) {
+    const int lidx = get_local_id(0);
+    const int lidy = get_local_id(1);
+    const int lid  = lidy * get_local_size(0) + lidx;
+
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) * lim + lidx;
+    const int yid       = groupId_y * get_local_size(1) + lidy;
+
+    bool cond_yzw =
+        (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) && (wid < oInfo.dims[3]);
+
+    iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
+             yid * iInfo.strides[1] + iInfo.offset;
+
+    kData += wid * kInfo.strides[3] + zid * kInfo.strides[2] +
+             yid * kInfo.strides[1] + kInfo.offset;
+
+    oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
+             yid * oInfo.strides[1];
+
+    local To l_val0[SHARED_MEM_SIZE];
+    local To l_val1[SHARED_MEM_SIZE];
+    local char l_flg0[SHARED_MEM_SIZE];
+    local char l_flg1[SHARED_MEM_SIZE];
+    local To *l_val   = l_val0;
+    local char *l_flg = l_flg0;
+    local To l_tmp[DIMY];
+    local char l_ftmp[DIMY];
+
+    bool flip = 0;
+
+    const To init_val = init;
+    int id            = xid;
+    To val            = init_val;
+
+    const bool isLast = (lidx == (DIMX - 1));
+
+    for (int k = 0; k < lim; k++) {
+        char flag = 0;
+
+        bool cond = ((id < oInfo.dims[0]) && cond_yzw);
+
+        if (calculateFlags) {
+            if (cond) {
+                flag = calculate_head_flags(kData, id, id - 1);
+            } else {
+                flag = 0;
+            }
+        } else {
+            flag = kData[id];
+        }
+
+        // Load val from global in
+        if (INCLUSIVE_SCAN) {
+            if (!cond) {
+                val = init_val;
+            } else {
+                val = transform(iData[id]);
+            }
+        } else {
+            if ((id == 0) || (!cond) || flag) {
+                val = init_val;
+            } else {
+                val = transform(iData[id - 1]);
+            }
+        }
+
+        // Add partial result from last iteration before scan operation
+        if ((lidx == 0) && (flag == 0)) {
+            val  = binOp(val, l_tmp[lidy]);
+            flag = flag | l_ftmp[lidy];
+        }
+
+        // Write to shared memory
+        l_val[lid] = val;
+        l_flg[lid] = flag;
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Write to shared memory
+        for (int off = 1; off < DIMX; off *= 2) {
+            if (lidx >= off) {
+                val  = l_flg[lid] ? val : binOp(val, l_val[lid - off]);
+                flag = l_flg[lid] | l_flg[lid - off];
+            }
+            flip       = 1 - flip;
+            l_val      = flip ? l_val1 : l_val0;
+            l_flg      = flip ? l_flg1 : l_flg0;
+            l_val[lid] = val;
+            l_flg[lid] = flag;
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        if (cond) oData[id] = val;
+        if (isLast) {
+            l_tmp[lidy]  = val;
+            l_ftmp[lidy] = flag;
+        }
+        id += DIMX;
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+}
+
+kernel void bcastFirstByKey(global To *oData, KParam oInfo,
+                            const global To *tData, KParam tInfo,
+                            const global int *tiData, KParam tiInfo,
+                            uint groups_x, uint groups_y, uint lim) {
+    const int lidx = get_local_id(0);
+    const int lidy = get_local_id(1);
+
+    const int zid       = get_group_id(0) / groups_x;
+    const int wid       = get_group_id(1) / groups_y;
+    const int groupId_x = get_group_id(0) - (groups_x)*zid;
+    const int groupId_y = get_group_id(1) - (groups_y)*wid;
+    const int xid       = groupId_x * get_local_size(0) * lim + lidx;
+    const int yid       = groupId_y * get_local_size(1) + lidy;
+
+    if (groupId_x != 0) {
+        bool cond = (yid < oInfo.dims[1]) && (zid < oInfo.dims[2]) &&
+                    (wid < oInfo.dims[3]);
+
+        if (cond) {
+            tiData += wid * tiInfo.strides[3] + zid * tiInfo.strides[2] +
+                      yid * tiInfo.strides[1];
+
+            tData += wid * tInfo.strides[3] + zid * tInfo.strides[2] +
+                     yid * tInfo.strides[1];
+
+            oData += wid * oInfo.strides[3] + zid * oInfo.strides[2] +
+                     yid * oInfo.strides[1];
+
+            int boundary = tiData[groupId_x];
+            To accum     = tData[groupId_x - 1];
+
+            for (int k = 0, id = xid;
+                 k < lim && id < oInfo.dims[0] && id < boundary;
+                 k++, id += DIMX) {
+                oData[id] = binOp(accum, oData[id]);
+            }
+        }
+    }
+}
diff --git a/src/backend/opencl/kernel/scan_first_by_key.hpp b/src/backend/opencl/kernel/scan_first_by_key.hpp
new file mode 100644
index 0000000000..1e520bcebb
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_first_by_key.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scanFirstByKey(Param &out, const Param &in, const Param &key,
+                    const bool inclusive_scan);
+}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/scan_first_by_key_impl.hpp b/src/backend/opencl/kernel/scan_first_by_key_impl.hpp
new file mode 100644
index 0000000000..82674db44d
--- /dev/null
+++ b/src/backend/opencl/kernel/scan_first_by_key_impl.hpp
@@ -0,0 +1,209 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel_headers/ops.hpp>
+#include <kernel_headers/scan_first_by_key.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static opencl::Kernel getScanFirstKernel(const std::string key,
+                                         bool calculateFlags, uint threads_x,
+                                         const bool inclusiveScan) {
+    using std::string;
+    using std::vector;
+
+    const uint threads_y       = THREADS_PER_GROUP / threads_x;
+    const uint SHARED_MEM_SIZE = THREADS_PER_GROUP;
+    ToNumStr<To> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<Ti>(),
+        TemplateTypename<To>(),
+        TemplateTypename<Tk>(),
+        TemplateArg(calculateFlags),
+        TemplateArg(op),
+        TemplateArg(threads_x),
+        TemplateArg(inclusiveScan),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(Tk, dtype_traits<Tk>::getName()),
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(T, "To"),
+        DefineKeyValue(DIMX, threads_x),
+        DefineKeyValue(DIMY, threads_y),
+        DefineKeyValue(init, toNumStr(common::Binary<To, op>::init())),
+        DefineValue(SHARED_MEM_SIZE),
+        DefineKeyFromStr(binOpName<op>()),
+        DefineKeyValue(CPLX, iscplx<Ti>()),
+        DefineKeyValue(calculateFlags, (calculateFlags ? 1 : 0)),
+        DefineKeyValue(INCLUSIVE_SCAN, inclusiveScan),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    return common::getKernel(key, {{ops_cl_src, scan_first_by_key_cl_src}},
+                             tmpltArgs, compileOpts);
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scanFirstByKeyNonfinalLauncher(
+    Param &out, Param &tmp, Param &tmpflg, Param &tmpid, const Param &in,
+    const Param &key, const uint groups_x, const uint groups_y,
+    const uint threads_x, const bool inclusiveScan = true) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto scan = getScanFirstKernel<Ti, Tk, To, op>(
+        "scanFirstByKeyNonfinal", false, threads_x, inclusiveScan);
+
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(groups_x * out.info.dims[2] * local[0],
+                   groups_y * out.info.dims[3] * local[1]);
+
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
+
+    scan(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *tmp.data,
+         tmp.info, *tmpflg.data, tmpflg.info, *tmpid.data, tmpid.info, *in.data,
+         in.info, *key.data, key.info, groups_x, groups_y, lim);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void scanFirstByKeyFinalLauncher(
+    Param &out, const Param &in, const Param &key, const bool calculateFlags,
+    const uint groups_x, const uint groups_y, const uint threads_x,
+    const bool inclusiveScan = true) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto scan = getScanFirstKernel<Ti, Tk, To, op>(
+        "scanFirstByKeyFinal", calculateFlags, threads_x, inclusiveScan);
+
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(groups_x * out.info.dims[2] * local[0],
+                   groups_y * out.info.dims[3] * local[1]);
+
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
+
+    scan(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data,
+         in.info, *key.data, key.info, groups_x, groups_y, lim);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+static void bcastFirstByKeyLauncher(Param &out, Param &tmp, Param &tmpid,
+                                    const uint groups_x, const uint groups_y,
+                                    const uint threads_x, bool inclusiveScan) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+
+    auto bcast = getScanFirstKernel<Ti, Tk, To, op>("bcastFirstByKey", false,
+                                                    threads_x, inclusiveScan);
+
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(groups_x * out.info.dims[2] * local[0],
+                   groups_y * out.info.dims[3] * local[1]);
+
+    uint lim = divup(out.info.dims[0], (threads_x * groups_x));
+
+    bcast(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+          *tmp.data, tmp.info, *tmpid.data, tmpid.info, groups_x, groups_y,
+          lim);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Ti, typename Tk, typename To, af_op_t op>
+void scanFirstByKey(Param &out, const Param &in, const Param &key,
+                    const bool inclusiveScan) {
+    uint threads_x = nextpow2(std::max(32u, (uint)out.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
+
+    uint groups_x = divup(out.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(out.info.dims[1], threads_y);
+
+    if (groups_x == 1) {
+        scanFirstByKeyFinalLauncher<Ti, Tk, To, op>(
+            out, in, key, true, groups_x, groups_y, threads_x, inclusiveScan);
+
+    } else {
+        Param tmp           = out;
+        tmp.info.dims[0]    = groups_x;
+        tmp.info.strides[0] = 1;
+        for (int k = 1; k < 4; k++) {
+            tmp.info.strides[k] =
+                tmp.info.strides[k - 1] * tmp.info.dims[k - 1];
+        }
+        Param tmpflg = tmp;
+        Param tmpid  = tmp;
+
+        int tmp_elements = tmp.info.strides[3] * tmp.info.dims[3];
+
+        tmp.data    = bufferAlloc(tmp_elements * sizeof(To));
+        tmpflg.data = bufferAlloc(tmp_elements * sizeof(char));
+        tmpid.data  = bufferAlloc(tmp_elements * sizeof(int));
+
+        scanFirstByKeyNonfinalLauncher<Ti, Tk, To, op>(
+            out, tmp, tmpflg, tmpid, in, key, groups_x, groups_y, threads_x,
+            inclusiveScan);
+
+        if (op == af_notzero_t) {
+            scanFirstByKeyFinalLauncher<To, char, To, af_add_t>(
+                tmp, tmp, tmpflg, false, 1, groups_y, threads_x, true);
+        } else {
+            scanFirstByKeyFinalLauncher<To, char, To, op>(
+                tmp, tmp, tmpflg, false, 1, groups_y, threads_x, true);
+        }
+
+        bcastFirstByKeyLauncher<To, Tk, To, op>(
+            out, tmp, tmpid, groups_x, groups_y, threads_x, inclusiveScan);
+
+        bufferFree(tmp.data);
+        bufferFree(tmpflg.data);
+        bufferFree(tmpid.data);
+    }
+}
+}  // namespace kernel
+
+#define INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, Ti, Tk, To) \
+    template void scanFirstByKey<Ti, Tk, To, ROp>(     \
+        Param & out, const Param &in, const Param &key, const bool);
+
+#define INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, Tk)         \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_FIRST_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_FIRST_BY_KEY_OP(ROp)      \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, int)  \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, uint) \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, intl) \
+    INSTANTIATE_SCAN_FIRST_BY_KEY_TYPES(ROp, uintl)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/select.cl b/src/backend/opencl/kernel/select.cl
new file mode 100644
index 0000000000..02d113f3f8
--- /dev/null
+++ b/src/backend/opencl/kernel/select.cl
@@ -0,0 +1,106 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifndef flip
+#define flip 0
+#endif
+
+#ifndef is_same
+#define is_same 0
+#endif
+
+int getOffset(dim_t *dims, dim_t *strides, dim_t *refdims, int ids[4]) {
+    int off = 0;
+    off += ids[3] * (dims[3] == refdims[3]) * strides[3];
+    off += ids[2] * (dims[2] == refdims[2]) * strides[2];
+    off += ids[1] * (dims[1] == refdims[1]) * strides[1];
+    return off;
+}
+
+kernel void select_kernel(global T *optr, KParam oinfo,
+                            global char *cptr_, KParam cinfo,
+                            global T *aptr_, KParam ainfo, __global T *bptr_,
+                            KParam binfo, int groups_0, int groups_1) {
+    global char *cptr = cptr_ + cinfo.offset;
+    global T *aptr    = aptr_ + ainfo.offset;
+    global T *bptr    = bptr_ + binfo.offset;
+
+    const int idz = get_group_id(0) / groups_0;
+    const int idw = get_group_id(1) / groups_1;
+
+    const int group_id_0 = get_group_id(0) - idz * groups_0;
+    const int group_id_1 = get_group_id(1) - idw * groups_1;
+
+    const int idx0 = group_id_0 * get_local_size(0) + get_local_id(0);
+    const int idy  = group_id_1 * get_local_size(1) + get_local_id(1);
+
+    const int off = idw * oinfo.strides[3] + idz * oinfo.strides[2] +
+                    idy * oinfo.strides[1];
+
+    if (idw >= oinfo.dims[3] || idz >= oinfo.dims[2] || idy >= oinfo.dims[1]) {
+        return;
+    }
+
+    int ids[] = {idx0, idy, idz, idw};
+
+    optr += off;
+    aptr += getOffset(ainfo.dims, ainfo.strides, oinfo.dims, ids);
+    bptr += getOffset(binfo.dims, binfo.strides, oinfo.dims, ids);
+    cptr += getOffset(cinfo.dims, cinfo.strides, oinfo.dims, ids);
+
+    if (is_same) {
+        for (int idx = idx0; idx < oinfo.dims[0];
+             idx += get_local_size(0) * groups_0) {
+            optr[idx] = (cptr[idx]) ? aptr[idx] : bptr[idx];
+        }
+    } else {
+        bool csame = cinfo.dims[0] == oinfo.dims[0];
+        bool asame = ainfo.dims[0] == oinfo.dims[0];
+        bool bsame = binfo.dims[0] == oinfo.dims[0];
+        for (int idx = idx0; idx < oinfo.dims[0];
+             idx += get_local_size(0) * groups_0) {
+            optr[idx] =
+                (cptr[csame * idx]) ? aptr[asame * idx] : bptr[bsame * idx];
+        }
+    }
+}
+
+kernel void select_scalar_kernel(global T *optr, KParam oinfo,
+                                   global char *cptr_, KParam cinfo,
+                                   global T *aptr_, KParam ainfo, T b,
+                                   int groups_0, int groups_1) {
+    global char *cptr = cptr_ + cinfo.offset;
+    global T *aptr    = aptr_ + ainfo.offset;
+
+    const int idz = get_group_id(0) / groups_0;
+    const int idw = get_group_id(1) / groups_1;
+
+    const int group_id_0 = get_group_id(0) - idz * groups_0;
+    const int group_id_1 = get_group_id(1) - idw * groups_1;
+
+    const int idx0 = group_id_0 * get_local_size(0) + get_local_id(0);
+    const int idy  = group_id_1 * get_local_size(1) + get_local_id(1);
+
+    const int off = idw * oinfo.strides[3] + idz * oinfo.strides[2] +
+                    idy * oinfo.strides[1];
+
+    int ids[] = {idx0, idy, idz, idw};
+    optr += off;
+    aptr += getOffset(ainfo.dims, ainfo.strides, oinfo.dims, ids);
+    cptr += getOffset(cinfo.dims, cinfo.strides, oinfo.dims, ids);
+
+    if (idw >= oinfo.dims[3] || idz >= oinfo.dims[2] || idy >= oinfo.dims[1]) {
+        return;
+    }
+
+    for (int idx = idx0; idx < oinfo.dims[0];
+         idx += get_local_size(0) * groups_0) {
+        optr[idx] = (cptr[idx] ^ flip) ? aptr[idx] : b;
+    }
+}
diff --git a/src/backend/opencl/kernel/select.hpp b/src/backend/opencl/kernel/select.hpp
new file mode 100644
index 0000000000..6de96e2cd6
--- /dev/null
+++ b/src/backend/opencl/kernel/select.hpp
@@ -0,0 +1,107 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/select.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+#include <array>
+#include <string>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+constexpr uint DIMX  = 32;
+constexpr uint DIMY  = 8;
+constexpr int REPEAT = 64;
+
+template<typename T>
+void selectLauncher(Param out, Param cond, Param a, Param b, const int ndims,
+                    const bool is_same) {
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(is_same),
+    };
+    std::array<std::string, 3> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(is_same),
+        getTypeBuildDefinition<T>()};
+
+    auto selectOp =
+        common::getKernel("select_kernel", {{select_cl_src}}, targs, options);
+
+    int threads[] = {DIMX, DIMY};
+
+    if (ndims == 1) {
+        threads[0] *= threads[1];
+        threads[1] = 1;
+    }
+
+    cl::NDRange local(threads[0], threads[1]);
+
+    int groups_0 = divup(out.info.dims[0], REPEAT * local[0]);
+    int groups_1 = divup(out.info.dims[1], local[1]);
+
+    cl::NDRange global(groups_0 * out.info.dims[2] * local[0],
+                       groups_1 * out.info.dims[3] * local[1]);
+
+    selectOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+             *cond.data, cond.info, *a.data, a.info, *b.data, b.info, groups_0,
+             groups_1);
+}
+
+template<typename T>
+void select(Param out, Param cond, Param a, Param b, int ndims) {
+    bool is_same = true;
+    for (int i = 0; i < 4; i++) {
+        is_same &= (a.info.dims[i] == b.info.dims[i]);
+    }
+    selectLauncher<T>(out, cond, a, b, ndims, is_same);
+}
+
+template<typename T>
+void select_scalar(Param out, Param cond, Param a, const T b, const int ndims,
+                   const bool flip) {
+    std::array<TemplateArg, 2> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(flip),
+    };
+    std::array<std::string, 3> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()), DefineValue(flip),
+        getTypeBuildDefinition<T>()};
+
+    auto selectOp = common::getKernel("select_scalar_kernel", {{select_cl_src}},
+                                      targs, options);
+
+    int threads[] = {DIMX, DIMY};
+
+    if (ndims == 1) {
+        threads[0] *= threads[1];
+        threads[1] = 1;
+    }
+
+    cl::NDRange local(threads[0], threads[1]);
+
+    int groups_0 = divup(out.info.dims[0], REPEAT * local[0]);
+    int groups_1 = divup(out.info.dims[1], local[1]);
+
+    cl::NDRange global(groups_0 * out.info.dims[2] * local[0],
+                       groups_1 * out.info.dims[3] * local[1]);
+
+    selectOp(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+             *cond.data, cond.info, *a.data, a.info, b, groups_0, groups_1);
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/set.cl b/src/backend/opencl/kernel/set.cl
deleted file mode 100644
index c5e43e5dc5..0000000000
--- a/src/backend/opencl/kernel/set.cl
+++ /dev/null
@@ -1,20 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-kernel
-void
-set(    global  T*      ptr,
-                T       val,
-        const   unsigned long  elements)
-{
-    if(get_global_id(0) < elements) {
-        ptr[get_global_id(0)] = val;
-    }
-}
-
diff --git a/src/backend/opencl/kernel/set.hpp b/src/backend/opencl/kernel/set.hpp
deleted file mode 100644
index c5a8b5035d..0000000000
--- a/src/backend/opencl/kernel/set.hpp
+++ /dev/null
@@ -1,67 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <platform.hpp>
-#include <kernel_headers/set.hpp>
-#include <traits.hpp>
-#include <mutex>
-#include <map>
-#include <backend.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl {
-namespace kernel {
-
-//TODO: Build only once for each instance of a kernel.  NOTE: Static objects in
-//      different instances of templates are the same.
-template<typename T>
-void
-set(Buffer &ptr, T val, const size_t &elements)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>   setProgs;
-        static std::map<int, Kernel *> setKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-                    Program::Sources setSrc;
-                    setSrc.emplace_back(set_cl, set_cl_len);
-
-                    setProgs[device] = new Program(getContext(), setSrc);
-
-                    string opt = string("-D T=") + dtype_traits<T>::getName();
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        opt << " -D USE_DOUBLE";
-                    }
-                    setProgs[device]->build(opt.c_str());
-
-                    setKernels[device] = new Kernel(*setProgs[device], "set");
-                });
-
-        auto setKern = make_kernel<Buffer, T, const unsigned long>(setKernels[device]);
-        setKern(EnqueueArgs(getQueue(), NDRange(elements)), ptr, val, elements);
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-}
-}
diff --git a/src/backend/opencl/kernel/shift.cl b/src/backend/opencl/kernel/shift.cl
deleted file mode 100644
index 7c487ec718..0000000000
--- a/src/backend/opencl/kernel/shift.cl
+++ /dev/null
@@ -1,55 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-static inline int simple_mod(const int i, const int dim)
-{
-    return (i < dim) ? i : (i - dim);
-}
-
-__kernel
-void shift_kernel(__global T *out, __global const T *in, const KParam op, const KParam ip,
-                  const int d0, const int d1, const int d2, const int d3,
-                  const int blocksPerMatX, const int blocksPerMatY)
-{
-    const int oz = get_group_id(0) / blocksPerMatX;
-    const int ow = get_group_id(1) / blocksPerMatY;
-
-    const int blockIdx_x = get_group_id(0) - oz * blocksPerMatX;
-    const int blockIdx_y = get_group_id(1) - ow * blocksPerMatY;
-
-    const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
-    const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
-
-    if(xx >= op.dims[0] ||
-       yy >= op.dims[1] ||
-       oz >= op.dims[2] ||
-       ow >= op.dims[3])
-        return;
-
-    const int incy = blocksPerMatY * get_local_size(1);
-    const int incx = blocksPerMatX * get_local_size(0);
-
-    const int iw = simple_mod((ow + d3), op.dims[3]);
-    const int iz = simple_mod((oz + d2), op.dims[2]);
-
-    const int o_off   = ow * op.strides[3] + oz * op.strides[2];
-    const int i_off   = iw * ip.strides[3] + iz * ip.strides[2] + ip.offset;
-
-    for(int oy = yy; oy < op.dims[1]; oy += incy) {
-        const int iy = simple_mod((oy + d1), op.dims[1]);
-        for(int ox = xx; ox < op.dims[0]; ox += incx) {
-            const int ix = simple_mod((ox + d0), op.dims[0]);
-
-            const int oIdx = o_off + oy * op.strides[1] + ox;
-            const int iIdx = i_off + iy * ip.strides[1] + ix;
-
-            out[oIdx] = in[iIdx];
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/shift.hpp b/src/backend/opencl/kernel/shift.hpp
deleted file mode 100644
index 3ff63ee19d..0000000000
--- a/src/backend/opencl/kernel/shift.hpp
+++ /dev/null
@@ -1,96 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <kernel_headers/shift.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
-#include <cassert>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-        static const int TILEX = 128;
-        static const int TILEY = 32;
-
-        template<typename T>
-        void shift(Param out, const Param in, const int *sdims)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   shiftProgs;
-                static std::map<int, Kernel *> shiftKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName();
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, shift_cl, shift_cl_len, options.str());
-                    shiftProgs[device] = new Program(prog);
-                    shiftKernels[device] = new Kernel(*shiftProgs[device], "shift_kernel");
-                });
-
-                auto shiftOp = make_kernel<Buffer, const Buffer, const KParam, const KParam,
-                                          const int, const int, const int, const int,
-                                          const int, const int> (*shiftKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int blocksPerMatX = divup(out.info.dims[0], TILEX);
-                int blocksPerMatY = divup(out.info.dims[1], TILEY);
-                NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
-                               local[1] * blocksPerMatY * out.info.dims[3],
-                               1);
-
-                int sdims_[4];
-                // Need to do this because we are mapping output to input in the kernel
-                for(int i = 0; i < 4; i++) {
-                    // sdims_[i] will always be positive and always [0, oDims[i]].
-                    // Negative shifts are converted to position by going the other way round
-                    sdims_[i] = -(sdims[i] % (int)out.info.dims[i]) + out.info.dims[i] * (sdims[i] > 0);
-                    assert(sdims_[i] >= 0 && sdims_[i] <= out.info.dims[i]);
-                }
-
-                shiftOp(EnqueueArgs(getQueue(), global, local),
-                        *out.data, *in.data, out.info, in.info,
-                        sdims_[0], sdims_[1], sdims_[2], sdims_[3],
-                        blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/sift.hpp b/src/backend/opencl/kernel/sift.hpp
new file mode 100644
index 0000000000..01bfaa3926
--- /dev/null
+++ b/src/backend/opencl/kernel/sift.hpp
@@ -0,0 +1,732 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// The source code contained in this file is based on the original code by
+// Rob Hess. Please note that SIFT is an algorithm patented and protected
+// by US law. As of 29-Dec-2020, the patent stands expired. It can be looked
+// up here - https://patents.google.com/patent/US6711293B1/en
+
+#pragma once
+
+#include <common/deprecated.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/convolve_separable.hpp>
+#include <kernel/fast.hpp>
+#include <kernel/resize.hpp>
+#include <kernel_headers/sift_nonfree.hpp>
+#include <memory.hpp>
+#include <af/defines.h>
+
+AF_DEPRECATED_WARNINGS_OFF
+#include <boost/compute/algorithm/gather.hpp>
+#include <boost/compute/algorithm/iota.hpp>
+#include <boost/compute/algorithm/sort_by_key.hpp>
+#include <boost/compute/core.hpp>
+#include <boost/compute/iterator/buffer_iterator.hpp>
+AF_DEPRECATED_WARNINGS_ON
+
+#include <cmath>
+#include <vector>
+
+namespace compute = boost::compute;
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int SIFT_THREADS   = 256;
+constexpr int SIFT_THREADS_X = 32;
+constexpr int SIFT_THREADS_Y = 8;
+
+// assumed gaussian blur for input image
+constexpr float InitSigma = 0.5f;
+
+// width of border in which to ignore keypoints
+constexpr int ImgBorder = 5;
+
+// default width of descriptor histogram array
+constexpr int DescrWidth = 4;
+
+// default number of bins per histogram in descriptor array
+constexpr int DescrHistBins = 8;
+
+// default number of bins in histogram for orientation assignment
+constexpr int OriHistBins = 36;
+
+// Number of GLOH bins in radial direction
+constexpr unsigned GLOHRadialBins = 3;
+
+// Number of GLOH angular bins (excluding the inner-most radial section)
+constexpr unsigned GLOHAngularBins = 8;
+
+// Number of GLOH bins per histogram in descriptor
+constexpr unsigned GLOHHistBins = 16;
+
+constexpr float PI_VAL = 3.14159265358979323846f;
+
+template<typename T>
+void gaussian1D(T* out, const int dim, double sigma = 0.0) {
+    if (!(sigma > 0)) sigma = 0.25 * dim;
+
+    T sum = (T)0;
+    for (int i = 0; i < dim; i++) {
+        int x = i - (dim - 1) / 2;
+        T el  = 1. / sqrt(2 * PI_VAL * sigma * sigma) *
+               exp(-((x * x) / (2 * (sigma * sigma))));
+        out[i] = el;
+        sum += el;
+    }
+
+    for (int k = 0; k < dim; k++) out[k] /= sum;
+}
+
+template<typename T>
+Param gaussFilter(float sigma) {
+    // Using 6-sigma rule
+    unsigned gauss_len = std::min((unsigned)round(sigma * 6 + 1) | 1, 31u);
+
+    T* h_gauss = new T[gauss_len];
+    gaussian1D(h_gauss, gauss_len, sigma);
+
+    Param gauss_filter;
+    gauss_filter.info.offset     = 0;
+    gauss_filter.info.dims[0]    = gauss_len;
+    gauss_filter.info.strides[0] = 1;
+
+    for (int k = 1; k < 4; k++) {
+        gauss_filter.info.dims[k] = 1;
+        gauss_filter.info.strides[k] =
+            gauss_filter.info.dims[k - 1] * gauss_filter.info.strides[k - 1];
+    }
+
+    dim_t gauss_elem = gauss_filter.info.strides[3] * gauss_filter.info.dims[3];
+    gauss_filter.data = bufferAlloc(gauss_elem * sizeof(T));
+    getQueue().enqueueWriteBuffer(*gauss_filter.data, CL_TRUE, 0,
+                                  gauss_elem * sizeof(T), h_gauss);
+
+    delete[] h_gauss;
+
+    return gauss_filter;
+}
+
+template<typename T, typename convAccT>
+void convSepFull(Param& dst, Param src, Param filter) {
+    Param tmp;
+    tmp.info.offset = 0;
+    for (int k = 0; k < 4; k++) {
+        tmp.info.dims[k]    = src.info.dims[k];
+        tmp.info.strides[k] = src.info.strides[k];
+    }
+
+    const dim_t src_el = src.info.dims[3] * src.info.strides[3];
+    tmp.data           = bufferAlloc(src_el * sizeof(T));
+
+    convSep<T, convAccT>(tmp, src, filter, 0, false);
+    convSep<T, convAccT>(dst, tmp, filter, 1, false);
+
+    bufferFree(tmp.data);
+}
+
+template<typename T, typename convAccT>
+Param createInitialImage(Param img, const float init_sigma,
+                         const bool double_input) {
+    Param init_img;
+    init_img.info.offset = 0;
+    init_img.info.dims[0] =
+        (double_input) ? img.info.dims[0] * 2 : img.info.dims[0];
+    init_img.info.dims[1] =
+        (double_input) ? img.info.dims[1] * 2 : img.info.dims[1];
+    init_img.info.strides[0] = 1;
+    init_img.info.strides[1] = init_img.info.dims[0];
+
+    for (int k = 2; k < 4; k++) {
+        init_img.info.dims[k] = 1;
+        init_img.info.strides[k] =
+            init_img.info.dims[k - 1] * init_img.info.strides[k - 1];
+    }
+
+    dim_t init_img_el = init_img.info.strides[3] * init_img.info.dims[3];
+    init_img.data     = bufferAlloc(init_img_el * sizeof(T));
+
+    float s = (double_input)
+                  ? std::max((float)sqrt(init_sigma * init_sigma -
+                                         InitSigma * InitSigma * 4.f),
+                             0.1f)
+                  : std::max((float)sqrt(init_sigma * init_sigma -
+                                         InitSigma * InitSigma),
+                             0.1f);
+
+    const Param filter = gaussFilter<convAccT>(s);
+
+    if (double_input) resize<T>(init_img, img, AF_INTERP_BILINEAR);
+
+    convSepFull<T, convAccT>(init_img, (double_input) ? init_img : img, filter);
+
+    bufferFree(filter.data);
+
+    return init_img;
+}
+
+template<typename T, typename convAccT>
+std::vector<Param> buildGaussPyr(Param init_img, const unsigned n_octaves,
+                                 const unsigned n_layers,
+                                 const float init_sigma) {
+    // Precompute Gaussian sigmas using the following formula:
+    // \sigma_{total}^2 = \sigma_{i}^2 + \sigma_{i-1}^2
+    std::vector<float> sig_layers(n_layers + 3);
+    sig_layers[0] = init_sigma;
+    float k       = std::pow(2.0f, 1.0f / n_layers);
+    for (unsigned i = 1; i < n_layers + 3; i++) {
+        float sig_prev  = std::pow(k, i - 1) * init_sigma;
+        float sig_total = sig_prev * k;
+        sig_layers[i] = std::sqrt(sig_total * sig_total - sig_prev * sig_prev);
+    }
+
+    // Gaussian Pyramid
+    std::vector<Param> gauss_pyr(n_octaves);
+    std::vector<Param> tmp_pyr(n_octaves * (n_layers + 3));
+    for (unsigned o = 0; o < n_octaves; o++) {
+        gauss_pyr[o].info.offset  = 0;
+        gauss_pyr[o].info.dims[0] = (o == 0)
+                                        ? init_img.info.dims[0]
+                                        : gauss_pyr[o - 1].info.dims[0] / 2;
+        gauss_pyr[o].info.dims[1] = (o == 0)
+                                        ? init_img.info.dims[1]
+                                        : gauss_pyr[o - 1].info.dims[1] / 2;
+        gauss_pyr[o].info.dims[2] = n_layers + 3;
+        gauss_pyr[o].info.dims[3] = 1;
+
+        gauss_pyr[o].info.strides[0] = 1;
+        gauss_pyr[o].info.strides[1] =
+            gauss_pyr[o].info.dims[0] * gauss_pyr[o].info.strides[0];
+        gauss_pyr[o].info.strides[2] =
+            gauss_pyr[o].info.dims[1] * gauss_pyr[o].info.strides[1];
+        gauss_pyr[o].info.strides[3] =
+            gauss_pyr[o].info.dims[2] * gauss_pyr[o].info.strides[2];
+
+        const unsigned nel =
+            gauss_pyr[o].info.dims[3] * gauss_pyr[o].info.strides[3];
+        gauss_pyr[o].data = bufferAlloc(nel * sizeof(T));
+
+        for (unsigned l = 0; l < n_layers + 3; l++) {
+            unsigned src_idx = (l == 0) ? (o - 1) * (n_layers + 3) + n_layers
+                                        : o * (n_layers + 3) + l - 1;
+            unsigned idx     = o * (n_layers + 3) + l;
+
+            tmp_pyr[o].info.offset = 0;
+            if (o == 0 && l == 0) {
+                for (int k = 0; k < 4; k++) {
+                    tmp_pyr[idx].info.dims[k]    = init_img.info.dims[k];
+                    tmp_pyr[idx].info.strides[k] = init_img.info.strides[k];
+                }
+                tmp_pyr[idx].data = init_img.data;
+            } else if (l == 0) {
+                tmp_pyr[idx].info.dims[0] = tmp_pyr[src_idx].info.dims[0] / 2;
+                tmp_pyr[idx].info.dims[1] = tmp_pyr[src_idx].info.dims[1] / 2;
+                tmp_pyr[idx].info.strides[0] = 1;
+                tmp_pyr[idx].info.strides[1] = tmp_pyr[idx].info.dims[0];
+
+                for (int k = 2; k < 4; k++) {
+                    tmp_pyr[idx].info.dims[k] = 1;
+                    tmp_pyr[idx].info.strides[k] =
+                        tmp_pyr[idx].info.dims[k - 1] *
+                        tmp_pyr[idx].info.strides[k - 1];
+                }
+
+                dim_t lvl_el =
+                    tmp_pyr[idx].info.strides[3] * tmp_pyr[idx].info.dims[3];
+                tmp_pyr[idx].data = bufferAlloc(lvl_el * sizeof(T));
+
+                resize<T>(tmp_pyr[idx], tmp_pyr[src_idx], AF_INTERP_BILINEAR);
+            } else {
+                for (int k = 0; k < 4; k++) {
+                    tmp_pyr[idx].info.dims[k] = tmp_pyr[src_idx].info.dims[k];
+                    tmp_pyr[idx].info.strides[k] =
+                        tmp_pyr[src_idx].info.strides[k];
+                }
+                dim_t lvl_el =
+                    tmp_pyr[idx].info.strides[3] * tmp_pyr[idx].info.dims[3];
+                tmp_pyr[idx].data = bufferAlloc(lvl_el * sizeof(T));
+
+                Param filter = gaussFilter<convAccT>(sig_layers[l]);
+
+                convSepFull<T, convAccT>(tmp_pyr[idx], tmp_pyr[src_idx],
+                                         filter);
+
+                bufferFree(filter.data);
+            }
+
+            const unsigned imel =
+                tmp_pyr[idx].info.dims[3] * tmp_pyr[idx].info.strides[3];
+            const unsigned offset = imel * l;
+
+            getQueue().enqueueCopyBuffer(*tmp_pyr[idx].data, *gauss_pyr[o].data,
+                                         0, offset * sizeof(T),
+                                         imel * sizeof(T));
+        }
+    }
+
+    for (unsigned o = 0; o < n_octaves; o++) {
+        for (unsigned l = 0; l < n_layers + 3; l++) {
+            unsigned idx = o * (n_layers + 3) + l;
+            bufferFree(tmp_pyr[idx].data);
+        }
+    }
+
+    return gauss_pyr;
+}
+
+template<typename T>
+std::vector<Param> buildDoGPyr(std::vector<Param> gauss_pyr,
+                               const unsigned n_octaves,
+                               const unsigned n_layers, Kernel suOp) {
+    // DoG Pyramid
+    std::vector<Param> dog_pyr(n_octaves);
+    for (unsigned o = 0; o < n_octaves; o++) {
+        for (int k = 0; k < 4; k++) {
+            dog_pyr[o].info.dims[k] = (k == 2) ? gauss_pyr[o].info.dims[k] - 1
+                                               : gauss_pyr[o].info.dims[k];
+            dog_pyr[o].info.strides[k] =
+                (k == 0) ? 1
+                         : dog_pyr[o].info.dims[k - 1] *
+                               dog_pyr[o].info.strides[k - 1];
+        }
+        dog_pyr[o].info.offset = 0;
+
+        dog_pyr[o].data = bufferAlloc(dog_pyr[o].info.dims[3] *
+                                      dog_pyr[o].info.strides[3] * sizeof(T));
+        const unsigned nel =
+            dog_pyr[o].info.dims[1] * dog_pyr[o].info.strides[1];
+        const unsigned dog_layers = n_layers + 2;
+
+        const int blk_x = divup(nel, SIFT_THREADS);
+        const cl::NDRange local(SIFT_THREADS, 1);
+        const cl::NDRange global(blk_x * SIFT_THREADS, 1);
+
+        suOp(cl::EnqueueArgs(getQueue(), global, local), *dog_pyr[o].data,
+             *gauss_pyr[o].data, nel, dog_layers);
+        CL_DEBUG_FINISH(getQueue());
+    }
+    return dog_pyr;
+}
+
+template<typename T>
+void update_permutation(compute::buffer_iterator<T>& keys,
+                        compute::vector<int>& permutation,
+                        compute::command_queue& queue) {
+    // temporary storage for keys
+    compute::vector<T> temp(permutation.size(), 0, queue);
+
+    // permute the keys with the current reordering
+    compute::gather(permutation.begin(), permutation.end(), keys, temp.begin(),
+                    queue);
+
+    // stable_sort the permuted keys and update the permutation
+    compute::sort_by_key(temp.begin(), temp.end(), permutation.begin(), queue);
+}
+
+template<typename T>
+void apply_permutation(compute::buffer_iterator<T>& keys,
+                       compute::vector<int>& permutation,
+                       compute::command_queue& queue) {
+    // copy keys to temporary vector
+    compute::vector<T> temp(keys, keys + permutation.size(), queue);
+
+    // permute the keys
+    compute::gather(permutation.begin(), permutation.end(), temp.begin(), keys,
+                    queue);
+}
+
+template<typename T>
+std::array<Kernel, 7> getSiftKernels() {
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    return {
+        common::getKernel("sub", {{sift_nonfree_cl_src}}, targs, compileOpts),
+        common::getKernel("detectExtrema", {{sift_nonfree_cl_src}}, targs,
+                          compileOpts),
+        common::getKernel("interpolateExtrema", {{sift_nonfree_cl_src}}, targs,
+                          compileOpts),
+        common::getKernel("calcOrientation", {{sift_nonfree_cl_src}}, targs,
+                          compileOpts),
+        common::getKernel("removeDuplicates", {{sift_nonfree_cl_src}}, targs,
+                          compileOpts),
+        common::getKernel("computeDescriptor", {{sift_nonfree_cl_src}}, targs,
+                          compileOpts),
+        common::getKernel("computeGLOHDescriptor", {{sift_nonfree_cl_src}},
+                          targs, compileOpts),
+    };
+}
+
+template<typename T, typename convAccT>
+void sift(unsigned* out_feat, unsigned* out_dlen, Param& x_out, Param& y_out,
+          Param& score_out, Param& ori_out, Param& size_out, Param& desc_out,
+          Param img, const unsigned n_layers, const float contrast_thr,
+          const float edge_thr, const float init_sigma, const bool double_input,
+          const float img_scale, const float feature_ratio,
+          const bool compute_GLOH) {
+    using cl::Buffer;
+    using cl::EnqueueArgs;
+    using cl::Local;
+    using cl::NDRange;
+    using std::vector;
+
+    auto kernels = getSiftKernels<T>();
+
+    unsigned min_dim = std::min(img.info.dims[0], img.info.dims[1]);
+    if (double_input) min_dim *= 2;
+
+    const unsigned n_octaves = floor(log(min_dim) / log(2)) - 2;
+
+    Param init_img =
+        createInitialImage<T, convAccT>(img, init_sigma, double_input);
+
+    vector<Param> gauss_pyr =
+        buildGaussPyr<T, convAccT>(init_img, n_octaves, n_layers, init_sigma);
+
+    vector<Param> dog_pyr =
+        buildDoGPyr<T>(gauss_pyr, n_octaves, n_layers, kernels[0]);
+
+    vector<bufptr> d_x_pyr;
+    vector<bufptr> d_y_pyr;
+    vector<bufptr> d_response_pyr;
+    vector<bufptr> d_size_pyr;
+    vector<bufptr> d_ori_pyr;
+    vector<bufptr> d_desc_pyr;
+    vector<unsigned> feat_pyr(n_octaves, 0);
+
+    d_x_pyr.reserve(n_octaves);
+    d_y_pyr.reserve(n_octaves);
+    d_response_pyr.reserve(n_octaves);
+    d_size_pyr.reserve(n_octaves);
+    d_ori_pyr.reserve(n_octaves);
+    d_desc_pyr.reserve(n_octaves);
+    unsigned total_feat = 0;
+
+    const unsigned d  = DescrWidth;
+    const unsigned n  = DescrHistBins;
+    const unsigned rb = GLOHRadialBins;
+    const unsigned ab = GLOHAngularBins;
+    const unsigned hb = GLOHHistBins;
+    const unsigned desc_len =
+        (compute_GLOH) ? (1 + (rb - 1) * ab) * hb : d * d * n;
+
+    auto d_count = memAlloc<unsigned>(1);
+
+    for (unsigned o = 0; o < n_octaves; o++) {
+        if (dog_pyr[o].info.dims[0] - 2 * ImgBorder < 1 ||
+            dog_pyr[o].info.dims[1] - 2 * ImgBorder < 1)
+            continue;
+
+        const unsigned imel = dog_pyr[o].info.dims[0] * dog_pyr[o].info.dims[1];
+        const unsigned max_feat = ceil(imel * feature_ratio);
+
+        auto d_extrema_x     = memAlloc<float>(max_feat);
+        auto d_extrema_y     = memAlloc<float>(max_feat);
+        auto d_extrema_layer = memAlloc<unsigned>(max_feat);
+
+        unsigned extrema_feat = 0;
+        getQueue().enqueueWriteBuffer(*d_count, CL_FALSE, 0, sizeof(unsigned),
+                                      &extrema_feat);
+
+        int dim0 = dog_pyr[o].info.dims[0];
+        int dim1 = dog_pyr[o].info.dims[1];
+
+        const int blk_x = divup(dim0 - 2 * ImgBorder, SIFT_THREADS_X);
+        const int blk_y = divup(dim1 - 2 * ImgBorder, SIFT_THREADS_Y);
+        const NDRange local(SIFT_THREADS_X, SIFT_THREADS_Y);
+        const NDRange global(blk_x * SIFT_THREADS_X, blk_y * SIFT_THREADS_Y);
+
+        float extrema_thr = 0.5f * contrast_thr / n_layers;
+
+        auto deOp = kernels[1];
+
+        deOp(EnqueueArgs(getQueue(), global, local), *d_extrema_x, *d_extrema_y,
+             *d_extrema_layer, *d_count, *dog_pyr[o].data, dog_pyr[o].info,
+             max_feat, extrema_thr,
+             Local((SIFT_THREADS_X + 2) * (SIFT_THREADS_Y + 2) * 3 *
+                   sizeof(float)));
+        CL_DEBUG_FINISH(getQueue());
+
+        getQueue().enqueueReadBuffer(*d_count, CL_TRUE, 0, sizeof(unsigned),
+                                     &extrema_feat);
+        extrema_feat = std::min(extrema_feat, max_feat);
+
+        if (extrema_feat == 0) { continue; }
+
+        unsigned interp_feat = 0;
+        getQueue().enqueueWriteBuffer(*d_count, CL_FALSE, 0, sizeof(unsigned),
+                                      &interp_feat);
+
+        auto d_interp_x        = memAlloc<float>(extrema_feat);
+        auto d_interp_y        = memAlloc<float>(extrema_feat);
+        auto d_interp_layer    = memAlloc<unsigned>(extrema_feat);
+        auto d_interp_response = memAlloc<float>(extrema_feat);
+        auto d_interp_size     = memAlloc<float>(extrema_feat);
+
+        const int blk_x_interp = divup(extrema_feat, SIFT_THREADS);
+        const NDRange local_interp(SIFT_THREADS, 1);
+        const NDRange global_interp(blk_x_interp * SIFT_THREADS, 1);
+
+        auto ieOp = kernels[2];
+
+        ieOp(EnqueueArgs(getQueue(), global_interp, local_interp), *d_interp_x,
+             *d_interp_y, *d_interp_layer, *d_interp_response, *d_interp_size,
+             *d_count, *d_extrema_x, *d_extrema_y, *d_extrema_layer,
+             extrema_feat, *dog_pyr[o].data, dog_pyr[o].info, extrema_feat, o,
+             n_layers, contrast_thr, edge_thr, init_sigma, img_scale);
+        CL_DEBUG_FINISH(getQueue());
+
+        getQueue().enqueueReadBuffer(*d_count, CL_TRUE, 0, sizeof(unsigned),
+                                     &interp_feat);
+        interp_feat = std::min(interp_feat, extrema_feat);
+
+        if (interp_feat == 0) { continue; }
+
+        compute::command_queue queue(getQueue()());
+        compute::context context(getContext()());
+
+        compute::buffer buf_interp_x((*d_interp_x)(), true);
+        compute::buffer buf_interp_y((*d_interp_y)(), true);
+        compute::buffer buf_interp_layer((*d_interp_layer)(), true);
+        compute::buffer buf_interp_response((*d_interp_response)(), true);
+        compute::buffer buf_interp_size((*d_interp_size)(), true);
+
+        compute::buffer_iterator<float> interp_x_begin =
+            compute::make_buffer_iterator<float>(buf_interp_x, 0);
+        compute::buffer_iterator<float> interp_y_begin =
+            compute::make_buffer_iterator<float>(buf_interp_y, 0);
+        compute::buffer_iterator<unsigned> interp_layer_begin =
+            compute::make_buffer_iterator<unsigned>(buf_interp_layer, 0);
+        compute::buffer_iterator<float> interp_response_begin =
+            compute::make_buffer_iterator<float>(buf_interp_response, 0);
+        compute::buffer_iterator<float> interp_size_begin =
+            compute::make_buffer_iterator<float>(buf_interp_size, 0);
+
+        compute::vector<int> permutation(interp_feat, context);
+        compute::iota(permutation.begin(), permutation.end(), 0, queue);
+
+        update_permutation<float>(interp_x_begin, permutation, queue);
+        update_permutation<float>(interp_y_begin, permutation, queue);
+        update_permutation<unsigned>(interp_layer_begin, permutation, queue);
+        update_permutation<float>(interp_response_begin, permutation, queue);
+        update_permutation<float>(interp_size_begin, permutation, queue);
+
+        apply_permutation<float>(interp_x_begin, permutation, queue);
+        apply_permutation<float>(interp_y_begin, permutation, queue);
+        apply_permutation<unsigned>(interp_layer_begin, permutation, queue);
+        apply_permutation<float>(interp_response_begin, permutation, queue);
+        apply_permutation<float>(interp_size_begin, permutation, queue);
+
+        unsigned nodup_feat = 0;
+        getQueue().enqueueWriteBuffer(*d_count, CL_FALSE, 0, sizeof(unsigned),
+                                      &nodup_feat);
+
+        auto d_nodup_x        = memAlloc<float>(interp_feat);
+        auto d_nodup_y        = memAlloc<float>(interp_feat);
+        auto d_nodup_layer    = memAlloc<unsigned>(interp_feat);
+        auto d_nodup_response = memAlloc<float>(interp_feat);
+        auto d_nodup_size     = memAlloc<float>(interp_feat);
+
+        const int blk_x_nodup = divup(extrema_feat, SIFT_THREADS);
+        const NDRange local_nodup(SIFT_THREADS, 1);
+        const NDRange global_nodup(blk_x_nodup * SIFT_THREADS, 1);
+
+        auto rdOp = kernels[4];
+
+        rdOp(EnqueueArgs(getQueue(), global_nodup, local_nodup), *d_nodup_x,
+             *d_nodup_y, *d_nodup_layer, *d_nodup_response, *d_nodup_size,
+             *d_count, *d_interp_x, *d_interp_y, *d_interp_layer,
+             *d_interp_response, *d_interp_size, interp_feat);
+        CL_DEBUG_FINISH(getQueue());
+
+        getQueue().enqueueReadBuffer(*d_count, CL_TRUE, 0, sizeof(unsigned),
+                                     &nodup_feat);
+        nodup_feat = std::min(nodup_feat, interp_feat);
+
+        unsigned oriented_feat = 0;
+        getQueue().enqueueWriteBuffer(*d_count, CL_FALSE, 0, sizeof(unsigned),
+                                      &oriented_feat);
+        const unsigned max_oriented_feat = nodup_feat * 3;
+
+        auto d_oriented_x        = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_y        = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_layer    = memAlloc<unsigned>(max_oriented_feat);
+        auto d_oriented_response = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_size     = memAlloc<float>(max_oriented_feat);
+        auto d_oriented_ori      = memAlloc<float>(max_oriented_feat);
+
+        const int blk_x_ori = divup(nodup_feat, SIFT_THREADS_Y);
+        const NDRange local_ori(SIFT_THREADS_X, SIFT_THREADS_Y);
+        const NDRange global_ori(SIFT_THREADS_X, blk_x_ori * SIFT_THREADS_Y);
+
+        auto coOp = kernels[3];
+
+        coOp(EnqueueArgs(getQueue(), global_ori, local_ori), *d_oriented_x,
+             *d_oriented_y, *d_oriented_layer, *d_oriented_response,
+             *d_oriented_size, *d_oriented_ori, *d_count, *d_nodup_x,
+             *d_nodup_y, *d_nodup_layer, *d_nodup_response, *d_nodup_size,
+             nodup_feat, *gauss_pyr[o].data, gauss_pyr[o].info,
+             max_oriented_feat, o, (int)double_input,
+             Local(OriHistBins * SIFT_THREADS_Y * 2 * sizeof(float)));
+        CL_DEBUG_FINISH(getQueue());
+
+        getQueue().enqueueReadBuffer(*d_count, CL_TRUE, 0, sizeof(unsigned),
+                                     &oriented_feat);
+        oriented_feat = std::min(oriented_feat, max_oriented_feat);
+
+        if (oriented_feat == 0) { continue; }
+
+        auto d_desc = memAlloc<float>(oriented_feat * desc_len);
+
+        float scale = 1.f / (1 << o);
+        if (double_input) scale *= 2.f;
+
+        const int blk_x_desc = divup(oriented_feat, 1);
+        const NDRange local_desc(SIFT_THREADS, 1);
+        const NDRange global_desc(SIFT_THREADS, blk_x_desc);
+
+        const unsigned histsz = 8;
+
+        if (compute_GLOH) {
+            auto cgOp = kernels[6];
+
+            cgOp(EnqueueArgs(getQueue(), global_desc, local_desc), *d_desc,
+                 desc_len, histsz, *d_oriented_x, *d_oriented_y,
+                 *d_oriented_layer, *d_oriented_response, *d_oriented_size,
+                 *d_oriented_ori, oriented_feat, *gauss_pyr[o].data,
+                 gauss_pyr[o].info, d, rb, ab, hb, scale, n_layers,
+                 Local(desc_len * (histsz + 1) * sizeof(float)));
+        } else {
+            auto cdOp = kernels[5];
+
+            cdOp(EnqueueArgs(getQueue(), global_desc, local_desc), *d_desc,
+                 desc_len, histsz, *d_oriented_x, *d_oriented_y,
+                 *d_oriented_layer, *d_oriented_response, *d_oriented_size,
+                 *d_oriented_ori, oriented_feat, *gauss_pyr[o].data,
+                 gauss_pyr[o].info, d, n, scale, n_layers,
+                 Local(desc_len * (histsz + 1) * sizeof(float)));
+        }
+        CL_DEBUG_FINISH(getQueue());
+
+        total_feat += oriented_feat;
+        feat_pyr[o] = oriented_feat;
+
+        if (oriented_feat > 0) {
+            d_x_pyr.emplace_back(std::move(d_oriented_x));
+            d_y_pyr.emplace_back(std::move(d_oriented_y));
+            d_response_pyr.emplace_back(std::move(d_oriented_response));
+            d_ori_pyr.emplace_back(std::move(d_oriented_ori));
+            d_size_pyr.emplace_back(std::move(d_oriented_size));
+            d_desc_pyr.emplace_back(std::move(d_desc));
+        }
+    }
+
+    for (size_t i = 0; i < gauss_pyr.size(); i++) bufferFree(gauss_pyr[i].data);
+    for (size_t i = 0; i < dog_pyr.size(); i++) bufferFree(dog_pyr[i].data);
+
+    // If no features are found, set found features to 0 and return
+    if (total_feat == 0) {
+        *out_feat = 0;
+        return;
+    }
+
+    // Allocate output memory
+    x_out.info.dims[0]        = total_feat;
+    x_out.info.strides[0]     = 1;
+    y_out.info.dims[0]        = total_feat;
+    y_out.info.strides[0]     = 1;
+    score_out.info.dims[0]    = total_feat;
+    score_out.info.strides[0] = 1;
+    ori_out.info.dims[0]      = total_feat;
+    ori_out.info.strides[0]   = 1;
+    size_out.info.dims[0]     = total_feat;
+    size_out.info.strides[0]  = 1;
+
+    desc_out.info.dims[0]    = desc_len;
+    desc_out.info.strides[0] = 1;
+    desc_out.info.dims[1]    = total_feat;
+    desc_out.info.strides[1] = desc_out.info.dims[0];
+
+    for (int k = 1; k < 4; k++) {
+        x_out.info.dims[k] = 1;
+        x_out.info.strides[k] =
+            x_out.info.dims[k - 1] * x_out.info.strides[k - 1];
+        y_out.info.dims[k] = 1;
+        y_out.info.strides[k] =
+            y_out.info.dims[k - 1] * y_out.info.strides[k - 1];
+        score_out.info.dims[k] = 1;
+        score_out.info.strides[k] =
+            score_out.info.dims[k - 1] * score_out.info.strides[k - 1];
+        ori_out.info.dims[k] = 1;
+        ori_out.info.strides[k] =
+            ori_out.info.dims[k - 1] * ori_out.info.strides[k - 1];
+        size_out.info.dims[k] = 1;
+        size_out.info.strides[k] =
+            size_out.info.dims[k - 1] * size_out.info.strides[k - 1];
+        if (k > 1) {
+            desc_out.info.dims[k] = 1;
+            desc_out.info.strides[k] =
+                desc_out.info.dims[k - 1] * desc_out.info.strides[k - 1];
+        }
+    }
+
+    if (total_feat > 0) {
+        size_t out_sz  = total_feat * sizeof(float);
+        x_out.data     = bufferAlloc(out_sz);
+        y_out.data     = bufferAlloc(out_sz);
+        score_out.data = bufferAlloc(out_sz);
+        ori_out.data   = bufferAlloc(out_sz);
+        size_out.data  = bufferAlloc(out_sz);
+
+        size_t desc_sz = total_feat * desc_len * sizeof(unsigned);
+        desc_out.data  = bufferAlloc(desc_sz);
+    }
+
+    unsigned offset = 0;
+    for (unsigned i = 0; i < n_octaves; i++) {
+        if (feat_pyr[i] == 0) continue;
+
+        getQueue().enqueueCopyBuffer(*d_x_pyr[i], *x_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_y_pyr[i], *y_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_response_pyr[i], *score_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_ori_pyr[i], *ori_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_size_pyr[i], *size_out.data, 0,
+                                     offset * sizeof(float),
+                                     feat_pyr[i] * sizeof(float));
+        getQueue().enqueueCopyBuffer(*d_desc_pyr[i], *desc_out.data, 0,
+                                     offset * desc_len * sizeof(unsigned),
+                                     feat_pyr[i] * desc_len * sizeof(unsigned));
+
+        offset += feat_pyr[i];
+    }
+
+    // Sets number of output features and descriptor length
+    *out_feat = total_feat;
+    *out_dlen = desc_len;
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sift_nonfree.cl b/src/backend/opencl/kernel/sift_nonfree.cl
new file mode 100644
index 0000000000..e17403ed53
--- /dev/null
+++ b/src/backend/opencl/kernel/sift_nonfree.cl
@@ -0,0 +1,976 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// The source code contained in this file is based on the original code by
+// Rob Hess. Please note that SIFT is an algorithm patented and protected
+// by US law, before using this code or any binary forms generated from it,
+// verify that you have permission to do so. The original license by Rob Hess
+// can be read below:
+//
+// Copyright (c) 2006-2012, Rob Hess <rob@iqengines.com>
+// All rights reserved.
+//
+// The following patent has been issued for methods embodied in this
+// software: "Method and apparatus for identifying scale invariant features
+// in an image and use of same for locating an object in an image," David
+// G. Lowe, US Patent 6,711,293 (March 23, 2004). Provisional application
+// filed March 8, 1999. Asignee: The University of British Columbia. For
+// further details, contact David Lowe (lowe@cs.ubc.ca) or the
+// University-Industry Liaison Office of the University of British
+// Columbia.
+//
+// Note that restrictions imposed by this patent (and possibly others)
+// exist independently of and may be in conflict with the freedoms granted
+// in this license, which refers to copyright of the program, not patents
+// for any methods that it implements.  Both copyright and patent law must
+// be obeyed to legally use and redistribute this program and it is not the
+// purpose of this license to induce you to infringe any patents or other
+// property right claims or to contest validity of any such claims.  If you
+// redistribute or use the program, then this license merely protects you
+// from committing copyright infringement.  It does not protect you from
+// committing patent infringement.  So, before you do anything with this
+// program, make sure that you have permission to do so not merely in terms
+// of copyright, but also in terms of patent law.
+//
+// Please note that this license is not to be understood as a guarantee
+// either.  If you use the program according to this license, but in
+// conflict with patent law, it does not mean that the licensor will refund
+// you for any losses that you incur if you are sued for your patent
+// infringement.
+//
+// Redistribution and use in source and binary forms, with or without
+// modification, are permitted provided that the following conditions are
+// met:
+//     * Redistributions of source code must retain the above copyright and
+//       patent notices, this list of conditions and the following
+//       disclaimer.
+//     * Redistributions in binary form must reproduce the above copyright
+//       notice, this list of conditions and the following disclaimer in
+//       the documentation and/or other materials provided with the
+//       distribution.
+//     * Neither the name of Oregon State University nor the names of its
+//       contributors may be used to endorse or promote products derived
+//       from this software without specific prior written permission.
+//
+// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS
+// IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+// TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+// PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+// HOLDER BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
+// EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
+// PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
+// PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
+// LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
+// NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+// SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+
+// width of border in which to ignore keypoints
+#define IMG_BORDER 5
+
+// maximum steps of keypoint interpolation before failure
+#define MAX_INTERP_STEPS 5
+
+// default number of bins in histogram for orientation assignment
+#define ORI_HIST_BINS 36
+
+// determines gaussian sigma for orientation assignment
+#define ORI_SIG_FCTR 1.5f
+
+// determines the radius of the region used in orientation assignment
+#define ORI_RADIUS (3 * ORI_SIG_FCTR)
+
+// number of passes of orientation histogram smoothing
+#define SMOOTH_ORI_PASSES 2
+
+// orientation magnitude relative to max that results in new feature
+#define ORI_PEAK_RATIO 0.8f
+
+// determines the size of a single descriptor orientation histogram
+#define DESCR_SCL_FCTR 3.f
+
+// threshold on magnitude of elements of descriptor vector
+#define DESCR_MAG_THR 0.2f
+
+// factor used to convert floating-point descriptor to unsigned char
+#define INT_DESCR_FCTR 512.f
+
+__constant float GLOHRadii[3] = {6.f, 11.f, 15.f};
+
+#define PI_VAL 3.14159265358979323846f
+
+void gaussianElimination(float* A, float* b, float* x, const int n) {
+    // forward elimination
+    for (int i = 0; i < n - 1; i++) {
+        for (int j = i + 1; j < n; j++) {
+            float s = A[j * n + i] / A[i * n + i];
+
+            // for (int k = i+1; k < n; k++)
+            for (int k = i; k < n; k++) A[j * n + k] -= s * A[i * n + k];
+
+            b[j] -= s * b[i];
+        }
+    }
+
+    for (int i = 0; i < n; i++) x[i] = 0;
+
+    // backward substitution
+    float sum = 0;
+    for (int i = 0; i <= n - 2; i++) {
+        sum = b[i];
+        for (int j = i + 1; j < n; j++) sum -= A[i * n + j] * x[j];
+        x[i] = sum / A[i * n + i];
+    }
+}
+
+inline void fatomic_add(volatile local float* source, const float operand) {
+    union {
+        unsigned int intVal;
+        float floatVal;
+    } newVal;
+    union {
+        unsigned int intVal;
+        float floatVal;
+    } prevVal;
+    do {
+        prevVal.floatVal = *source;
+        newVal.floatVal  = prevVal.floatVal + operand;
+    } while (atomic_cmpxchg((volatile local unsigned int*)source,
+                            prevVal.intVal, newVal.intVal) != prevVal.intVal);
+}
+
+inline void normalizeDesc(local float* desc, __local float* accum,
+                          const int histlen, int lid_x, int lid_y, int lsz_x) {
+    for (int i = lid_x; i < histlen; i += lsz_x)
+        accum[i] = desc[lid_y * histlen + i] * desc[lid_y * histlen + i];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    float sum = 0.0f;
+    for (int i = 0; i < histlen; i++)
+        sum += desc[lid_y * histlen + i] * desc[lid_y * histlen + i];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lid_x < 64) accum[lid_x] += accum[lid_x + 64];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 32) accum[lid_x] += accum[lid_x + 32];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 16) accum[lid_x] += accum[lid_x + 16];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 8) accum[lid_x] += accum[lid_x + 8];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 4) accum[lid_x] += accum[lid_x + 4];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 2) accum[lid_x] += accum[lid_x + 2];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 1) accum[lid_x] += accum[lid_x + 1];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    float len_sq  = accum[0];
+    float len_inv = 1.0f / sqrt(len_sq);
+
+    for (int i = lid_x; i < histlen; i += lsz_x) {
+        desc[lid_y * histlen + i] *= len_inv;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+}
+
+inline void normalizeGLOHDesc(local float* desc, __local float* accum,
+                              const int histlen, int lid_x, int lid_y,
+                              int lsz_x) {
+    for (int i = lid_x; i < histlen; i += lsz_x)
+        accum[i] = desc[lid_y * histlen + i] * desc[lid_y * histlen + i];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    float sum = 0.0f;
+    for (int i = 0; i < histlen; i++)
+        sum += desc[lid_y * histlen + i] * desc[lid_y * histlen + i];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lid_x < 128) accum[lid_x] += accum[lid_x + 128];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 64) accum[lid_x] += accum[lid_x + 64];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 32) accum[lid_x] += accum[lid_x + 32];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 16)
+        // GLOH is 272-dimensional, accumulating last 16 descriptors
+        accum[lid_x] += accum[lid_x + 16] + accum[lid_x + 256];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 8) accum[lid_x] += accum[lid_x + 8];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 4) accum[lid_x] += accum[lid_x + 4];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 2) accum[lid_x] += accum[lid_x + 2];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 1) accum[lid_x] += accum[lid_x + 1];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    float len_sq  = accum[0];
+    float len_inv = 1.0f / sqrt(len_sq);
+
+    for (int i = lid_x; i < histlen; i += lsz_x) {
+        desc[lid_y * histlen + i] *= len_inv;
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+}
+
+kernel void sub(global T* out, __global const T* in, unsigned nel,
+                  unsigned n_layers) {
+    unsigned i = get_global_id(0);
+
+    if (i < nel) {
+        for (unsigned l = 0; l < n_layers; l++)
+            out[l * nel + i] = in[l * nel + i] - in[(l + 1) * nel + i];
+    }
+}
+
+#define LCPTR(Y, X) (l_center[(Y)*l_i + (X)])
+#define LPPTR(Y, X) (l_prev[(Y)*l_i + (X)])
+#define LNPTR(Y, X) (l_next[(Y)*l_i + (X)])
+
+// Determines whether a pixel is a scale-space extremum by comparing it to its
+// 3x3x3 pixel neighborhood.
+kernel void detectExtrema(global float* x_out, __global float* y_out,
+                            global unsigned* layer_out,
+                            global unsigned* counter, __global const T* dog,
+                            KParam iDoG, const unsigned max_feat,
+                            const float threshold, local float* l_mem) {
+    const int dim0 = iDoG.dims[0];
+    const int dim1 = iDoG.dims[1];
+    const int imel = iDoG.dims[0] * iDoG.dims[1];
+
+    const int lid_i = get_local_id(0);
+    const int lid_j = get_local_id(1);
+    const int lsz_i = get_local_size(0);
+    const int lsz_j = get_local_size(1);
+    const int i     = get_group_id(0) * lsz_i + lid_i + IMG_BORDER;
+    const int j     = get_group_id(1) * lsz_j + lid_j + IMG_BORDER;
+
+    // One pixel border for each side
+    const int l_i = lsz_i + 2;
+    const int l_j = lsz_j + 2;
+
+    local float* l_prev   = l_mem;
+    local float* l_center = l_mem + l_i * l_j;
+    local float* l_next   = l_mem + l_i * l_j * 2;
+
+    const int x = lid_i + 1;
+    const int y = lid_j + 1;
+
+    for (int l = 1; l < iDoG.dims[2] - 1; l++) {
+        const int l_i_half = l_i / 2;
+        const int l_j_half = l_j / 2;
+        if (lid_i < l_i_half && lid_j < l_j_half && i < dim0 - IMG_BORDER + 1 &&
+            j < dim1 - IMG_BORDER + 1) {
+            l_next[lid_j * l_i + lid_i] =
+                (float)dog[(l + 1) * imel + (j - 1) * dim0 + i - 1];
+            l_center[lid_j * l_i + lid_i] =
+                (float)dog[(l)*imel + (j - 1) * dim0 + i - 1];
+            l_prev[lid_j * l_i + lid_i] =
+                (float)dog[(l - 1) * imel + (j - 1) * dim0 + i - 1];
+
+            l_next[lid_j * l_i + lid_i + l_i_half] =
+                (float)dog[(l + 1) * imel + (j - 1) * dim0 + i - 1 + l_i_half];
+            l_center[lid_j * l_i + lid_i + l_i_half] =
+                (float)dog[(l)*imel + (j - 1) * dim0 + i - 1 + l_i_half];
+            l_prev[lid_j * l_i + lid_i + l_i_half] =
+                (float)dog[(l - 1) * imel + (j - 1) * dim0 + i - 1 + l_i_half];
+
+            l_next[(lid_j + l_j_half) * l_i + lid_i] =
+                (float)dog[(l + 1) * imel + (j - 1 + l_j_half) * dim0 + i - 1];
+            l_center[(lid_j + l_j_half) * l_i + lid_i] =
+                (float)dog[(l)*imel + (j - 1 + l_j_half) * dim0 + i - 1];
+            l_prev[(lid_j + l_j_half) * l_i + lid_i] =
+                (float)dog[(l - 1) * imel + (j - 1 + l_j_half) * dim0 + i - 1];
+
+            l_next[(lid_j + l_j_half) * l_i + lid_i + l_i_half] =
+                (float)dog[(l + 1) * imel + (j - 1 + l_j_half) * dim0 + i - 1 +
+                           l_i_half];
+            l_center[(lid_j + l_j_half) * l_i + lid_i + l_i_half] = (float)
+                dog[(l)*imel + (j - 1 + l_j_half) * dim0 + i - 1 + l_i_half];
+            l_prev[(lid_j + l_j_half) * l_i + lid_i + l_i_half] =
+                (float)dog[(l - 1) * imel + (j - 1 + l_j_half) * dim0 + i - 1 +
+                           l_i_half];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        if (i < dim0 - IMG_BORDER && j < dim1 - IMG_BORDER) {
+            const int l_i_half = l_i / 2;
+            float p            = l_center[y * l_i + x];
+
+            if (fabs((float)p) > threshold &&
+                ((p > 0 && p > LCPTR(y - 1, x - 1) && p > LCPTR(y - 1, x) &&
+                  p > LCPTR(y - 1, x + 1) && p > LCPTR(y, x - 1) &&
+                  p > LCPTR(y, x + 1) && p > LCPTR(y + 1, x - 1) &&
+                  p > LCPTR(y + 1, x) && p > LCPTR(y + 1, x + 1) &&
+                  p > LPPTR(y - 1, x - 1) && p > LPPTR(y - 1, x) &&
+                  p > LPPTR(y - 1, x + 1) && p > LPPTR(y, x - 1) &&
+                  p > LPPTR(y, x) && p > LPPTR(y, x + 1) &&
+                  p > LPPTR(y + 1, x - 1) && p > LPPTR(y + 1, x) &&
+                  p > LPPTR(y + 1, x + 1) && p > LNPTR(y - 1, x - 1) &&
+                  p > LNPTR(y - 1, x) && p > LNPTR(y - 1, x + 1) &&
+                  p > LNPTR(y, x - 1) && p > LNPTR(y, x) &&
+                  p > LNPTR(y, x + 1) && p > LNPTR(y + 1, x - 1) &&
+                  p > LNPTR(y + 1, x) && p > LNPTR(y + 1, x + 1)) ||
+                 (p < 0 && p < LCPTR(y - 1, x - 1) && p < LCPTR(y - 1, x) &&
+                  p < LCPTR(y - 1, x + 1) && p < LCPTR(y, x - 1) &&
+                  p < LCPTR(y, x + 1) && p < LCPTR(y + 1, x - 1) &&
+                  p < LCPTR(y + 1, x) && p < LCPTR(y + 1, x + 1) &&
+                  p < LPPTR(y - 1, x - 1) && p < LPPTR(y - 1, x) &&
+                  p < LPPTR(y - 1, x + 1) && p < LPPTR(y, x - 1) &&
+                  p < LPPTR(y, x) && p < LPPTR(y, x + 1) &&
+                  p < LPPTR(y + 1, x - 1) && p < LPPTR(y + 1, x) &&
+                  p < LPPTR(y + 1, x + 1) && p < LNPTR(y - 1, x - 1) &&
+                  p < LNPTR(y - 1, x) && p < LNPTR(y - 1, x + 1) &&
+                  p < LNPTR(y, x - 1) && p < LNPTR(y, x) &&
+                  p < LNPTR(y, x + 1) && p < LNPTR(y + 1, x - 1) &&
+                  p < LNPTR(y + 1, x) && p < LNPTR(y + 1, x + 1)))) {
+                unsigned idx = atomic_inc(counter);
+                if (idx < max_feat) {
+                    x_out[idx]     = (float)j;
+                    y_out[idx]     = (float)i;
+                    layer_out[idx] = l;
+                }
+            }
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+}
+
+#undef LCPTR
+#undef LPPTR
+#undef LNPTR
+#define CPTR(Y, X) (center[(Y)*dim0 + (X)])
+#define PPTR(Y, X) (prev[(Y)*dim0 + (X)])
+#define NPTR(Y, X) (next[(Y)*dim0 + (X)])
+
+// Interpolates a scale-space extremum's location and scale to subpixel
+// accuracy to form an image feature. Rejects features with low contrast.
+// Based on Section 4 of Lowe's paper.
+kernel void interpolateExtrema(
+    global float* x_out, __global float* y_out, __global unsigned* layer_out,
+    global float* response_out, __global float* size_out,
+    global unsigned* counter, __global const float* x_in,
+    global const float* y_in, __global const unsigned* layer_in,
+    const unsigned extrema_feat, global const T* dog_octave, KParam iDoG,
+    const unsigned max_feat, const unsigned octave, const unsigned n_layers,
+    const float contrast_thr, const float edge_thr, const float sigma,
+    const float img_scale) {
+    const unsigned f = get_global_id(0);
+
+    if (f < extrema_feat) {
+        const float first_deriv_scale  = img_scale * 0.5f;
+        const float second_deriv_scale = img_scale;
+        const float cross_deriv_scale  = img_scale * 0.25f;
+
+        float xl = 0, xy = 0, xx = 0, contr = 0;
+        int i = 0;
+
+        unsigned x     = x_in[f];
+        unsigned y     = y_in[f];
+        unsigned layer = layer_in[f];
+
+        const int dim0 = iDoG.dims[0];
+        const int dim1 = iDoG.dims[1];
+        const int imel = dim0 * dim1;
+
+        global const T* prev   = dog_octave + (int)((layer - 1) * imel);
+        global const T* center = dog_octave + (int)((layer)*imel);
+        global const T* next   = dog_octave + (int)((layer + 1) * imel);
+
+        for (i = 0; i < MAX_INTERP_STEPS; i++) {
+            float dD[3] = {
+                (float)(CPTR(x + 1, y) - CPTR(x - 1, y)) * first_deriv_scale,
+                (float)(CPTR(x, y + 1) - CPTR(x, y - 1)) * first_deriv_scale,
+                (float)(NPTR(x, y) - PPTR(x, y)) * first_deriv_scale};
+
+            float d2 = CPTR(x, y) * 2.f;
+            float dxx =
+                (CPTR(x + 1, y) + CPTR(x - 1, y) - d2) * second_deriv_scale;
+            float dyy =
+                (CPTR(x, y + 1) + CPTR(x, y - 1) - d2) * second_deriv_scale;
+            float dss = (NPTR(x, y) + PPTR(x, y) - d2) * second_deriv_scale;
+            float dxy = (CPTR(x + 1, y + 1) - CPTR(x - 1, y + 1) -
+                         CPTR(x + 1, y - 1) + CPTR(x - 1, y - 1)) *
+                        cross_deriv_scale;
+            float dxs = (NPTR(x + 1, y) - NPTR(x - 1, y) - PPTR(x + 1, y) +
+                         PPTR(x - 1, y)) *
+                        cross_deriv_scale;
+            float dys = (NPTR(x, y + 1) - NPTR(x - 1, y - 1) - PPTR(x, y - 1) +
+                         PPTR(x - 1, y - 1)) *
+                        cross_deriv_scale;
+
+            float H[9] = {dxx, dxy, dxs, dxy, dyy, dys, dxs, dys, dss};
+
+            float X[3];
+            gaussianElimination(H, dD, X, 3);
+
+            xl = -X[2];
+            xy = -X[1];
+            xx = -X[0];
+
+            if (fabs(xl) < 0.5f && fabs(xy) < 0.5f && fabs(xx) < 0.5f) break;
+
+            x += round(xx);
+            y += round(xy);
+            layer += round(xl);
+
+            if (layer < 1 || layer > n_layers || x < IMG_BORDER ||
+                x >= dim1 - IMG_BORDER || y < IMG_BORDER ||
+                y >= dim0 - IMG_BORDER)
+                return;
+        }
+
+        // ensure convergence of interpolation
+        if (i >= MAX_INTERP_STEPS) return;
+
+        float dD[3] = {
+            (float)(CPTR(x + 1, y) - CPTR(x - 1, y)) * first_deriv_scale,
+            (float)(CPTR(x, y + 1) - CPTR(x, y - 1)) * first_deriv_scale,
+            (float)(NPTR(x, y) - PPTR(x, y)) * first_deriv_scale};
+        float X[3] = {xx, xy, xl};
+
+        float P = dD[0] * X[0] + dD[1] * X[1] + dD[2] * X[2];
+
+        contr = center[x * dim0 + y] * img_scale + P * 0.5f;
+        if (fabs(contr) < (contrast_thr / n_layers)) return;
+
+        // principal curvatures are computed using the trace and det of Hessian
+        float d2  = CPTR(x, y) * 2.f;
+        float dxx = (CPTR(x + 1, y) + CPTR(x - 1, y) - d2) * second_deriv_scale;
+        float dyy = (CPTR(x, y + 1) + CPTR(x, y - 1) - d2) * second_deriv_scale;
+        float dxy = (CPTR(x + 1, y + 1) - CPTR(x - 1, y + 1) -
+                     CPTR(x + 1, y - 1) + CPTR(x - 1, y - 1)) *
+                    cross_deriv_scale;
+
+        float tr  = dxx + dyy;
+        float det = dxx * dyy - dxy * dxy;
+
+        // add FLT_EPSILON for double-precision compatibility
+        if (det <= 0 || tr * tr * edge_thr >=
+                            (edge_thr + 1) * (edge_thr + 1) * det + FLT_EPSILON)
+            return;
+
+        unsigned ridx = atomic_inc(counter);
+
+        if (ridx < max_feat) {
+            x_out[ridx]        = (x + xx) * (1 << octave);
+            y_out[ridx]        = (y + xy) * (1 << octave);
+            layer_out[ridx]    = layer;
+            response_out[ridx] = fabs(contr);
+            size_out[ridx] =
+                sigma * pow(2.f, octave + (layer + xl) / n_layers) * 2.f;
+        }
+    }
+}
+
+#undef CPTR
+#undef PPTR
+#undef NPTR
+
+// Remove duplicate keypoints
+kernel void removeDuplicates(
+    global float* x_out, __global float* y_out, __global unsigned* layer_out,
+    global float* response_out, __global float* size_out,
+    global unsigned* counter, __global const float* x_in,
+    global const float* y_in, __global const unsigned* layer_in,
+    global const float* response_in, __global const float* size_in,
+    const unsigned total_feat) {
+    const unsigned f = get_global_id(0);
+
+    if (f < total_feat) {
+        const float prec_fctr = 1e4f;
+
+        bool cond = (f < total_feat - 1)
+                        ? !(round(x_in[f] * prec_fctr) ==
+                                round(x_in[f + 1] * prec_fctr) &&
+                            round(y_in[f] * prec_fctr) ==
+                                round(y_in[f + 1] * prec_fctr) &&
+                            layer_in[f] == layer_in[f + 1] &&
+                            round(response_in[f] * prec_fctr) ==
+                                round(response_in[f + 1] * prec_fctr) &&
+                            round(size_in[f] * prec_fctr) ==
+                                round(size_in[f + 1] * prec_fctr))
+                        : true;
+
+        if (cond) {
+            unsigned idx = atomic_inc(counter);
+
+            x_out[idx]        = x_in[f];
+            y_out[idx]        = y_in[f];
+            layer_out[idx]    = layer_in[f];
+            response_out[idx] = response_in[f];
+            size_out[idx]     = size_in[f];
+        }
+    }
+}
+
+#define IPTR(Y, X) (img[(Y)*dim0 + X])
+
+// Computes a canonical orientation for each image feature in an array.  Based
+// on Section 5 of Lowe's paper.  This function adds features to the array when
+// there is more than one dominant orientation at a given feature location.
+kernel void calcOrientation(
+    global float* x_out, __global float* y_out, __global unsigned* layer_out,
+    global float* response_out, __global float* size_out,
+    global float* ori_out, __global unsigned* counter,
+    global const float* x_in, __global const float* y_in,
+    global const unsigned* layer_in, __global const float* response_in,
+    global const float* size_in, const unsigned total_feat,
+    global const T* gauss_octave, KParam iGauss, const unsigned max_feat,
+    const unsigned octave, const int double_input, local float* l_mem) {
+    const int lid_x = get_local_id(0);
+    const int lid_y = get_local_id(1);
+    const int lsz_x = get_local_size(0);
+
+    const unsigned f = get_global_id(1);
+
+    const int n = ORI_HIST_BINS;
+
+    local float* hist     = l_mem;
+    local float* temphist = l_mem + n * 8;
+
+    // Initialize temporary histogram
+    for (int i = lid_x; i < n; i += lsz_x) { hist[lid_y * n + i] = 0.f; }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    float real_x, real_y, response, size;
+    unsigned layer;
+
+    if (f < total_feat) {
+        // Load keypoint information
+        real_x   = x_in[f];
+        real_y   = y_in[f];
+        layer    = layer_in[f];
+        response = response_in[f];
+        size     = size_in[f];
+
+        const int pt_x = (int)round(real_x / (1 << octave));
+        const int pt_y = (int)round(real_y / (1 << octave));
+
+        // Calculate auxiliary parameters
+        const float scl_octv  = size * 0.5f / (1 << octave);
+        const int radius      = (int)round(ORI_RADIUS * scl_octv);
+        const float sigma     = ORI_SIG_FCTR * scl_octv;
+        const int len         = (radius * 2 + 1);
+        const float exp_denom = 2.f * sigma * sigma;
+
+        const int dim0 = iGauss.dims[0];
+        const int dim1 = iGauss.dims[1];
+
+        // Calculate layer offset
+        const int layer_offset = layer * dim0 * dim1;
+        global const T* img  = gauss_octave + layer_offset;
+
+        // Calculate orientation histogram
+        for (int l = lid_x; l < len * len; l += lsz_x) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = pt_y + i;
+            int x = pt_x + j;
+            if (y < 1 || y >= dim0 - 1 || x < 1 || x >= dim1 - 1) continue;
+
+            float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+            float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+            float mag = sqrt(dx * dx + dy * dy);
+            float ori = atan2(dy, dx);
+            float w   = exp(-(i * i + j * j) / exp_denom);
+
+            int bin = round(n * (ori + PI_VAL) / (2.f * PI_VAL));
+            bin     = bin < n ? bin : 0;
+            bin     = (bin < 0) ? 0 : (bin >= n) ? n - 1 : bin;
+
+            fatomic_add(&hist[lid_y * n + bin], w * mag);
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    for (int i = 0; i < SMOOTH_ORI_PASSES; i++) {
+        for (int j = lid_x; j < n; j += lsz_x) {
+            temphist[lid_y * n + j] = hist[lid_y * n + j];
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+        for (int j = lid_x; j < n; j += lsz_x) {
+            float prev = (j == 0) ? temphist[lid_y * n + n - 1]
+                                  : temphist[lid_y * n + j - 1];
+            float next = (j + 1 == n) ? temphist[lid_y * n]
+                                      : temphist[lid_y * n + j + 1];
+            hist[lid_y * n + j] =
+                0.25f * prev + 0.5f * temphist[lid_y * n + j] + 0.25f * next;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    for (int i = lid_x; i < n; i += lsz_x)
+        temphist[lid_y * n + i] = hist[lid_y * n + i];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (lid_x < 16)
+        temphist[lid_y * n + lid_x] =
+            fmax(hist[lid_y * n + lid_x], hist[lid_y * n + lid_x + 16]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 8)
+        temphist[lid_y * n + lid_x] =
+            fmax(temphist[lid_y * n + lid_x], temphist[lid_y * n + lid_x + 8]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 4) {
+        temphist[lid_y * n + lid_x] =
+            fmax(temphist[lid_y * n + lid_x], hist[lid_y * n + lid_x + 32]);
+        temphist[lid_y * n + lid_x] =
+            fmax(temphist[lid_y * n + lid_x], temphist[lid_y * n + lid_x + 4]);
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 2)
+        temphist[lid_y * n + lid_x] =
+            fmax(temphist[lid_y * n + lid_x], temphist[lid_y * n + lid_x + 2]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+    if (lid_x < 1)
+        temphist[lid_y * n + lid_x] =
+            fmax(temphist[lid_y * n + lid_x], temphist[lid_y * n + lid_x + 1]);
+    barrier(CLK_LOCAL_MEM_FENCE);
+    float omax = temphist[lid_y * n];
+
+    if (f < total_feat) {
+        float mag_thr = (float)(omax * ORI_PEAK_RATIO);
+        int l, r;
+        float bin;
+        for (int j = lid_x; j < n; j += lsz_x) {
+            l = (j == 0) ? n - 1 : j - 1;
+            r = (j + 1) % n;
+            if (hist[lid_y * n + j] > hist[lid_y * n + l] &&
+                hist[lid_y * n + j] > hist[lid_y * n + r] &&
+                hist[lid_y * n + j] >= mag_thr) {
+                unsigned idx = atomic_inc(counter);
+
+                if (idx < max_feat) {
+                    float bin =
+                        j +
+                        0.5f * (hist[lid_y * n + l] - hist[lid_y * n + r]) /
+                            (hist[lid_y * n + l] - 2.0f * hist[lid_y * n + j] +
+                             hist[lid_y * n + r]);
+                    bin = (bin < 0.0f) ? bin + n : (bin >= n) ? bin - n : bin;
+                    float ori = 360.f - ((360.f / n) * bin);
+
+                    float new_real_x = real_x;
+                    float new_real_y = real_y;
+                    float new_size   = size;
+
+                    if (double_input != 0) {
+                        float scale = 0.5f;
+                        new_real_x *= scale;
+                        new_real_y *= scale;
+                        new_size *= scale;
+                    }
+
+                    x_out[idx]        = new_real_x;
+                    y_out[idx]        = new_real_y;
+                    layer_out[idx]    = layer;
+                    response_out[idx] = response;
+                    size_out[idx]     = new_size;
+                    ori_out[idx]      = ori;
+                }
+            }
+        }
+    }
+}
+
+// Computes feature descriptors for features in an array.  Based on Section 6
+// of Lowe's paper.
+kernel void computeDescriptor(
+    global float* desc_out, const unsigned desc_len, const unsigned histsz,
+    global const float* x_in, __global const float* y_in,
+    global const unsigned* layer_in, __global const float* response_in,
+    global const float* size_in, __global const float* ori_in,
+    const unsigned total_feat, global const T* gauss_octave, KParam iGauss,
+    const int d, const int n, const float scale, const int n_layers,
+    local float* l_mem) {
+    const int lid_x = get_local_id(0);
+    const int lid_y = get_local_id(1);
+    const int lsz_x = get_local_size(0);
+
+    const int f = get_global_id(1);
+
+    local float* desc  = l_mem;
+    local float* accum = l_mem + desc_len * histsz;
+
+    for (int i = lid_x; i < desc_len * histsz; i += lsz_x)
+        desc[lid_y * desc_len + i] = 0.f;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (f < total_feat) {
+        const unsigned layer = layer_in[f];
+        float ori            = (360.f - ori_in[f]) * PI_VAL / 180.f;
+        ori                  = (ori > PI_VAL) ? ori - PI_VAL * 2 : ori;
+        const float size     = size_in[f];
+        const int fx         = round(x_in[f] * scale);
+        const int fy         = round(y_in[f] * scale);
+
+        // Points img to correct Gaussian pyramid layer
+        const int dim0        = iGauss.dims[0];
+        const int dim1        = iGauss.dims[1];
+        global const T* img = gauss_octave + (layer * dim0 * dim1);
+
+        float cos_t        = cos(ori);
+        float sin_t        = sin(ori);
+        float bins_per_rad = n / (PI_VAL * 2.f);
+        float exp_denom    = d * d * 0.5f;
+        float hist_width   = DESCR_SCL_FCTR * size * scale * 0.5f;
+        int radius         = hist_width * sqrt(2.f) * (d + 1.f) * 0.5f + 0.5f;
+
+        int len            = radius * 2 + 1;
+        const int hist_off = (lid_x % histsz) * desc_len;
+
+        // Calculate orientation histogram
+        for (int l = lid_x; l < len * len; l += lsz_x) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = fy + i;
+            int x = fx + j;
+
+            float x_rot = (j * cos_t - i * sin_t) / hist_width;
+            float y_rot = (j * sin_t + i * cos_t) / hist_width;
+            float xbin  = x_rot + d / 2 - 0.5f;
+            float ybin  = y_rot + d / 2 - 0.5f;
+
+            if (ybin > -1.0f && ybin < d && xbin > -1.0f && xbin < d && y > 0 &&
+                y < dim0 - 1 && x > 0 && x < dim1 - 1) {
+                float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+                float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+                float grad_mag = sqrt(dx * dx + dy * dy);
+                float grad_ori = atan2(dy, dx) - ori;
+                while (grad_ori < 0.0f) grad_ori += PI_VAL * 2;
+                while (grad_ori >= PI_VAL * 2) grad_ori -= PI_VAL * 2;
+
+                float w    = exp(-(x_rot * x_rot + y_rot * y_rot) / exp_denom);
+                float obin = grad_ori * bins_per_rad;
+                float mag  = grad_mag * w;
+
+                int x0 = floor(xbin);
+                int y0 = floor(ybin);
+                int o0 = floor(obin);
+                xbin -= x0;
+                ybin -= y0;
+                obin -= o0;
+
+                for (int yl = 0; yl <= 1; yl++) {
+                    int yb = y0 + yl;
+                    if (yb >= 0 && yb < d) {
+                        float v_y = mag * ((yl == 0) ? 1.0f - ybin : ybin);
+                        for (int xl = 0; xl <= 1; xl++) {
+                            int xb = x0 + xl;
+                            if (xb >= 0 && xb < d) {
+                                float v_x =
+                                    v_y * ((xl == 0) ? 1.0f - xbin : xbin);
+                                for (int ol = 0; ol <= 1; ol++) {
+                                    int ob = (o0 + ol) % n;
+                                    float v_o =
+                                        v_x * ((ol == 0) ? 1.0f - obin : obin);
+                                    fatomic_add(
+                                        &desc[hist_off + lid_y * desc_len +
+                                              (yb * d + xb) * n + ob],
+                                        v_o);
+                                }
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // Combine histograms (reduces previous atomicAdd overhead)
+    for (int l = lid_x; l < desc_len * 4; l += lsz_x)
+        desc[l] += desc[l + 4 * desc_len];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    for (int l = lid_x; l < desc_len * 2; l += lsz_x)
+        desc[l] += desc[l + 2 * desc_len];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    for (int l = lid_x; l < desc_len; l += lsz_x) desc[l] += desc[l + desc_len];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    normalizeDesc(desc, accum, desc_len, lid_x, lid_y, lsz_x);
+
+    for (int i = lid_x; i < d * d * n; i += lsz_x)
+        desc[lid_y * desc_len + i] =
+            min(desc[lid_y * desc_len + i], DESCR_MAG_THR);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    normalizeDesc(desc, accum, desc_len, lid_x, lid_y, lsz_x);
+
+    if (f < total_feat) {
+        // Calculate final descriptor values
+        for (int k = lid_x; k < d * d * n; k += lsz_x)
+            desc_out[f * desc_len + k] =
+                round(min(255.f, desc[lid_y * desc_len + k] * INT_DESCR_FCTR));
+    }
+}
+
+kernel void computeGLOHDescriptor(
+    global float* desc_out, const unsigned desc_len, const unsigned histsz,
+    global const float* x_in, __global const float* y_in,
+    global const unsigned* layer_in, __global const float* response_in,
+    global const float* size_in, __global const float* ori_in,
+    const unsigned total_feat, global const T* gauss_octave, KParam iGauss,
+    const int d, const unsigned rb, const unsigned ab, const unsigned hb,
+    const float scale, const int n_layers, local float* l_mem) {
+    const int lid_x = get_local_id(0);
+    const int lid_y = get_local_id(1);
+    const int lsz_x = get_local_size(0);
+
+    const int f = get_global_id(1);
+
+    local float* desc  = l_mem;
+    local float* accum = l_mem + desc_len * histsz;
+
+    for (int i = lid_x; i < desc_len * histsz; i += lsz_x)
+        desc[lid_y * desc_len + i] = 0.f;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (f < total_feat) {
+        const unsigned layer = layer_in[f];
+        float ori            = (360.f - ori_in[f]) * PI_VAL / 180.f;
+        ori                  = (ori > PI_VAL) ? ori - PI_VAL * 2 : ori;
+        const float size     = size_in[f];
+        const int fx         = round(x_in[f] * scale);
+        const int fy         = round(y_in[f] * scale);
+
+        // Points img to correct Gaussian pyramid layer
+        const int dim0        = iGauss.dims[0];
+        const int dim1        = iGauss.dims[1];
+        global const T* img = gauss_octave + (layer * dim0 * dim1);
+
+        float cos_t              = cos(ori);
+        float sin_t              = sin(ori);
+        float hist_bins_per_rad  = hb / (PI_VAL * 2.f);
+        float polar_bins_per_rad = ab / (PI_VAL * 2.f);
+        float exp_denom          = GLOHRadii[rb - 1] * 0.5f;
+
+        float hist_width = DESCR_SCL_FCTR * size * scale * 0.5f;
+
+        // Keep same descriptor radius used for SIFT
+        int radius = hist_width * sqrt(2.f) * (d + 1.f) * 0.5f + 0.5f;
+
+        // Alternative radius size calculation, changing the radius weight
+        // (rw) in the range of 0.25f-0.75f gives different results,
+        // increasing it tends to show a better recall rate but with a
+        // smaller amount of correct matches
+        // float rw = 0.5f;
+        // int radius = hist_width * GLOHRadii[rb-1] * rw + 0.5f;
+
+        int len            = radius * 2 + 1;
+        const int hist_off = (lid_x % histsz) * desc_len;
+
+        // Calculate orientation histogram
+        for (int l = lid_x; l < len * len; l += lsz_x) {
+            int i = l / len - radius;
+            int j = l % len - radius;
+
+            int y = fy + i;
+            int x = fx + j;
+
+            float x_rot = (j * cos_t - i * sin_t);
+            float y_rot = (j * sin_t + i * cos_t);
+
+            float r = sqrt(x_rot * x_rot + y_rot * y_rot) / radius *
+                      GLOHRadii[rb - 1];
+            float theta = atan2(y_rot, x_rot);
+            while (theta < 0.0f) theta += PI_VAL * 2;
+            while (theta >= PI_VAL * 2) theta -= PI_VAL * 2;
+
+            float tbin = theta * polar_bins_per_rad;
+            float rbin =
+                (r < GLOHRadii[0])
+                    ? r / GLOHRadii[0]
+                    : ((r < GLOHRadii[1])
+                           ? 1 + (r - GLOHRadii[0]) /
+                                     (float)(GLOHRadii[1] - GLOHRadii[0])
+                           : min(2 + (r - GLOHRadii[1]) /
+                                         (float)(GLOHRadii[2] - GLOHRadii[1]),
+                                 3.f - FLT_EPSILON));
+
+            if (r <= GLOHRadii[rb - 1] && y > 0 && y < dim0 - 1 && x > 0 &&
+                x < dim1 - 1) {
+                float dx = (float)(IPTR(x + 1, y) - IPTR(x - 1, y));
+                float dy = (float)(IPTR(x, y - 1) - IPTR(x, y + 1));
+
+                float grad_mag = sqrt(dx * dx + dy * dy);
+                float grad_ori = atan2(dy, dx) - ori;
+                while (grad_ori < 0.0f) grad_ori += PI_VAL * 2;
+                while (grad_ori >= PI_VAL * 2) grad_ori -= PI_VAL * 2;
+
+                float w    = exp(-r / exp_denom);
+                float obin = grad_ori * hist_bins_per_rad;
+                float mag  = grad_mag * w;
+
+                int t0 = floor(tbin);
+                int r0 = floor(rbin);
+                int o0 = floor(obin);
+                tbin -= t0;
+                rbin -= r0;
+                obin -= o0;
+
+                for (int rl = 0; rl <= 1; rl++) {
+                    int rb    = (rbin > 0.5f) ? (r0 + rl) : (r0 - rl);
+                    float v_r = mag * ((rl == 0) ? 1.0f - rbin : rbin);
+                    if (rb >= 0 && rb <= 2) {
+                        for (int tl = 0; tl <= 1; tl++) {
+                            int tb    = (t0 + tl) % ab;
+                            float v_t = v_r * ((tl == 0) ? 1.0f - tbin : tbin);
+                            for (int ol = 0; ol <= 1; ol++) {
+                                int ob = (o0 + ol) % hb;
+                                float v_o =
+                                    v_t * ((ol == 0) ? 1.0f - obin : obin);
+                                unsigned idx =
+                                    (rb > 0) *
+                                        (hb + ((rb - 1) * ab + tb) * hb) +
+                                    ob;
+                                fatomic_add(
+                                    &desc[hist_off + lid_y * desc_len + idx],
+                                    v_o);
+                            }
+                        }
+                    }
+                }
+            }
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    // Combine histograms (reduces previous atomicAdd overhead)
+    for (int l = lid_x; l < desc_len * 4; l += lsz_x)
+        desc[l] += desc[l + 4 * desc_len];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    for (int l = lid_x; l < desc_len * 2; l += lsz_x)
+        desc[l] += desc[l + 2 * desc_len];
+    barrier(CLK_LOCAL_MEM_FENCE);
+    for (int l = lid_x; l < desc_len; l += lsz_x) desc[l] += desc[l + desc_len];
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    normalizeGLOHDesc(desc, accum, desc_len, lid_x, lid_y, lsz_x);
+
+    for (int i = lid_x; i < desc_len; i += lsz_x)
+        desc[lid_y * desc_len + i] =
+            min(desc[lid_y * desc_len + i], DESCR_MAG_THR);
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    normalizeGLOHDesc(desc, accum, desc_len, lid_x, lid_y, lsz_x);
+
+    if (f < total_feat) {
+        // Calculate final descriptor values
+        for (int k = lid_x; k < desc_len; k += lsz_x)
+            desc_out[f * desc_len + k] =
+                round(min(255.f, desc[lid_y * desc_len + k] * INT_DESCR_FCTR));
+    }
+}
+
+#undef IPTR
diff --git a/src/backend/opencl/kernel/sobel.cl b/src/backend/opencl/kernel/sobel.cl
index fbb5671dff..04bc2565f0 100644
--- a/src/backend/opencl/kernel/sobel.cl
+++ b/src/backend/opencl/kernel/sobel.cl
@@ -7,86 +7,73 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-Ti load2LocalMem(global const Ti * in,
-               int dim0, int dim1,
-               int gx, int gy,
-               int inStride1, int inStride0)
-{
-    if (gx<0 || gx>=dim0 || gy<0 || gy>=dim1)
-        return (Ti)0;
-    else
-        return in[gx*inStride0+gy*inStride1];
+int reflect101(int index, int endIndex) {
+    return abs(endIndex - (int)abs(endIndex - index));
 }
 
-kernel
-void sobel3x3(global To * dx, KParam dxInfo,
-              global To * dy, KParam dyInfo,
-              global const Ti * in, KParam iInfo,
-              local        Ti * localMem,
-              int nBBS0, int nBBS1)
-{
+Ti load2LocalMem(global const Ti* in, int d0, int d1, int gx, int gy,
+                 int inStride1, int inStride0) {
+    int idx =
+        reflect101(gx, d0 - 1) * inStride0 + reflect101(gy, d1 - 1) * inStride1;
+    return in[idx];
+}
+
+kernel void sobel3x3(global To* dx, KParam dxInfo, global To* dy, KParam dyInfo,
+                     global const Ti* in, KParam iInfo, local Ti* localMem,
+                     int nBBS0, int nBBS1) {
     const int radius  = 1;
-    const int padding = 2*radius;
+    const int padding = 2 * radius;
     const int shrdLen = get_local_size(0) + padding;
 
     unsigned b2 = get_group_id(0) / nBBS0;
     unsigned b3 = get_group_id(1) / nBBS1;
-    global const Ti* iptr = in + (b2 * iInfo.strides[2]  + b3 * iInfo.strides[3] + iInfo.offset);
-    global To*      dxptr = dx + (b2 * dxInfo.strides[2] + b3 * dxInfo.strides[3]);
-    global To*      dyptr = dy + (b2 * dyInfo.strides[2] + b3 * dyInfo.strides[3]);
+    global const Ti* iptr =
+        in + (b2 * iInfo.strides[2] + b3 * iInfo.strides[3] + iInfo.offset);
+    global To* dxptr = dx + (b2 * dxInfo.strides[2] + b3 * dxInfo.strides[3]);
+    global To* dyptr = dy + (b2 * dyInfo.strides[2] + b3 * dyInfo.strides[3]);
 
     int lx = get_local_id(0);
     int ly = get_local_id(1);
 
-    int gx = get_local_size(0) * (get_group_id(0)-b2*nBBS0) + lx;
-    int gy = get_local_size(1) * (get_group_id(1)-b3*nBBS1) + ly;
-
-    int lx2 = lx + get_local_size(0);
-    int ly2 = ly + get_local_size(1);
-    int gx2 = gx + get_local_size(0);
-    int gy2 = gy + get_local_size(1);
+    int gx = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    int gy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
 
-    localMem[lx+shrdLen*ly] = load2LocalMem(iptr, iInfo.dims[0], iInfo.dims[1],
-                                   gx-radius, gy-radius,
-                                   iInfo.strides[1], iInfo.strides[0]);
-    if (lx<padding) {
-        localMem[lx2+shrdLen*ly] = load2LocalMem(iptr, iInfo.dims[0], iInfo.dims[1],
-                                        gx2-radius, gy-radius,
-                                        iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (ly<padding) {
-        localMem[lx+shrdLen*ly2] = load2LocalMem(iptr, iInfo.dims[0], iInfo.dims[1],
-                                        gx-radius, gy2-radius,
-                                        iInfo.strides[1], iInfo.strides[0]);
-    }
-    if (lx<padding && ly<padding) {
-        localMem[lx2+shrdLen*ly2] = load2LocalMem(iptr, iInfo.dims[0], iInfo.dims[1],
-                                         gx2-radius, gy2-radius,
-                                         iInfo.strides[1], iInfo.strides[0]);
+    int s0 = iInfo.strides[0];
+    int s1 = iInfo.strides[1];
+    int d0 = iInfo.dims[0];
+    int d1 = iInfo.dims[1];
+    for (int b = ly, gy2 = gy; b < shrdLen;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+        for (int a = lx, gx2 = gx; a < shrdLen;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            localMem[a + shrdLen * b] =
+                load2LocalMem(iptr, d0, d1, gx2 - radius, gy2 - radius, s1, s0);
+        }
     }
 
     barrier(CLK_LOCAL_MEM_FENCE);
 
     if (gx < iInfo.dims[0] && gy < iInfo.dims[1]) {
-        int i = lx + radius;
-        int j = ly + radius;
-        int _i = i-1;
-        int i_ = i+1;
-        int _j = j-1;
-        int j_ = j+1;
-
-        float NW = localMem[_i+shrdLen*_j];
-        float SW = localMem[i_+shrdLen*_j];
-        float NE = localMem[_i+shrdLen*j_];
-        float SE = localMem[i_+shrdLen*j_];
+        int i  = lx + radius;
+        int j  = ly + radius;
+        int _i = i - 1;
+        int i_ = i + 1;
+        int _j = j - 1;
+        int j_ = j + 1;
 
-        float t1 = localMem[i+shrdLen*_j];
-        float t2 = localMem[i+shrdLen*j_];
-        dxptr[gy*dxInfo.strides[1]+gx] = (NW+SW - (NE+SE) + 2*(t1-t2));
+        float NW = localMem[_i + shrdLen * _j];
+        float SW = localMem[i_ + shrdLen * _j];
+        float NE = localMem[_i + shrdLen * j_];
+        float SE = localMem[i_ + shrdLen * j_];
 
-        t1 = localMem[_i+shrdLen*j];
-        t2 = localMem[i_+shrdLen*j];
-        dyptr[gy*dyInfo.strides[1]+gx] = (NW+NE - (SW+SE) + 2*(t1-t2));
+        float t1 = localMem[_i + shrdLen * j];
+        float t2 = localMem[i_ + shrdLen * j];
+        dxptr[gy * dxInfo.strides[1] + gx] =
+            (SW + SE - (NW + NE) + 2 * (t2 - t1));
 
+        t1 = localMem[i + shrdLen * _j];
+        t2 = localMem[i + shrdLen * j_];
+        dyptr[gy * dyInfo.strides[1] + gx] =
+            (NE + SE - (NW + SW) + 2 * (t2 - t1));
     }
 }
diff --git a/src/backend/opencl/kernel/sobel.hpp b/src/backend/opencl/kernel/sobel.hpp
index 522dc4969a..9e7138f69d 100644
--- a/src/backend/opencl/kernel/sobel.hpp
+++ b/src/backend/opencl/kernel/sobel.hpp
@@ -8,85 +8,55 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/sobel.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/sobel.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int THREADS_X = 16;
-static const int THREADS_Y = 16;
+#include <string>
+#include <vector>
 
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 template<typename Ti, typename To, unsigned ker_size>
-void sobel(Param dx, Param dy, const Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  sobProgs;
-        static std::map<int, Kernel*> sobKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once( compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D Ti=" << dtype_traits<Ti>::getName()
-                        << " -D To=" << dtype_traits<To>::getName()
-                        << " -D KER_SIZE="<< ker_size;
-                if (std::is_same<Ti, double>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-                Program prog;
-                buildProgram(prog, sobel_cl, sobel_cl_len, options.str());
-                sobProgs[device]   = new Program(prog);
-                sobKernels[device] = new Kernel(*sobProgs[device], "sobel3x3");
-            });
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(in.info.dims[0], THREADS_X);
-        int blk_y = divup(in.info.dims[1], THREADS_Y);
-
-        NDRange global(blk_x * in.info.dims[2] * THREADS_X,
+void sobel(Param dx, Param dy, const Param in) {
+    constexpr int THREADS_X = 16;
+    constexpr int THREADS_Y = 16;
+
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<Ti>(),
+        TemplateTypename<To>(),
+        TemplateArg(ker_size),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(Ti, dtype_traits<Ti>::getName()),
+        DefineKeyValue(To, dtype_traits<To>::getName()),
+        DefineKeyValue(KER_SIZE, ker_size),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<Ti>());
+
+    auto sobel =
+        common::getKernel("sobel3x3", {{sobel_cl_src}}, targs, compileOpts);
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], THREADS_X);
+    int blk_y = divup(in.info.dims[1], THREADS_Y);
+
+    cl::NDRange global(blk_x * in.info.dims[2] * THREADS_X,
                        blk_y * in.info.dims[3] * THREADS_Y);
+    size_t loc_size =
+        (THREADS_X + ker_size - 1) * (THREADS_Y + ker_size - 1) * sizeof(Ti);
 
-        auto sobelOp = make_kernel<Buffer, KParam,
-                                   Buffer, KParam,
-                                   Buffer, KParam,
-                                   cl::LocalSpaceArg,
-                                   int, int> (*sobKernels[device]);
-
-        size_t loc_size = (THREADS_X+ker_size-1)*(THREADS_Y+ker_size-1)*sizeof(Ti);
-
-        sobelOp(EnqueueArgs(getQueue(), global, local),
-                    *dx.data, dx.info, *dy.data, dy.info,
-                    *in.data, in.info, cl::Local(loc_size), blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-}
-
+    sobel(cl::EnqueueArgs(getQueue(), global, local), *dx.data, dx.info,
+          *dy.data, dy.info, *in.data, in.info, cl::Local(loc_size), blk_x,
+          blk_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sort.hpp b/src/backend/opencl/kernel/sort.hpp
index c3976a032b..dd8bbe1390 100644
--- a/src/backend/opencl/kernel/sort.hpp
+++ b/src/backend/opencl/kernel/sort.hpp
@@ -8,72 +8,125 @@
  ********************************************************/
 
 #pragma once
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
 #include <debug_opencl.hpp>
+#include <iota.hpp>
+#include <kernel/sort_helper.hpp>
+#include <traits.hpp>
+
+AF_DEPRECATED_WARNINGS_OFF
+#include <boost/compute/algorithm/sort.hpp>
+#include <boost/compute/algorithm/sort_by_key.hpp>
 #include <boost/compute/core.hpp>
-#include <boost/compute/algorithm/stable_sort.hpp>
 #include <boost/compute/functional/operator.hpp>
 #include <boost/compute/iterator/buffer_iterator.hpp>
+AF_DEPRECATED_WARNINGS_ON
 
 namespace compute = boost::compute;
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-
-        template<typename T, bool isAscending>
-        void sort0(Param val)
-        {
-            try {
-                compute::command_queue c_queue(getQueue()());
-
-                compute::buffer val_buf((*val.data)());
-
-                for(int w = 0; w < val.info.dims[3]; w++) {
-                    int valW = w * val.info.strides[3];
-                    for(int z = 0; z < val.info.dims[2]; z++) {
-                        int valWZ = valW + z * val.info.strides[2];
-                        for(int y = 0; y < val.info.dims[1]; y++) {
-
-                            int valOffset = valWZ + y * val.info.strides[1];
-
-                            if(isAscending) {
-                                compute::stable_sort(
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset),
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset + val.info.dims[0]),
-                                        compute::less<T>(), c_queue);
-                            } else {
-                                compute::stable_sort(
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset),
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset + val.info.dims[0]),
-                                        compute::greater<T>(), c_queue);
-                            }
-                        }
-                    }
-                }
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void sort0Iterative(Param val, bool isAscending) {
+    compute::command_queue c_queue(getQueue()());
+
+    compute::buffer val_buf((*val.data)());
 
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
+    for (int w = 0; w < val.info.dims[3]; w++) {
+        int valW = w * val.info.strides[3];
+        for (int z = 0; z < val.info.dims[2]; z++) {
+            int valWZ = valW + z * val.info.strides[2];
+            for (int y = 0; y < val.info.dims[1]; y++) {
+                int valOffset = valWZ + y * val.info.strides[1];
+
+                if (isAscending) {
+                    compute::sort(compute::make_buffer_iterator<type_t<T>>(
+                                      val_buf, valOffset),
+                                  compute::make_buffer_iterator<type_t<T>>(
+                                      val_buf, valOffset + val.info.dims[0]),
+                                  compute::less<type_t<T>>(), c_queue);
+                } else {
+                    compute::sort(compute::make_buffer_iterator<type_t<T>>(
+                                      val_buf, valOffset),
+                                  compute::make_buffer_iterator<type_t<T>>(
+                                      val_buf, valOffset + val.info.dims[0]),
+                                  compute::greater<type_t<T>>(), c_queue);
+                }
             }
         }
     }
+
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void sortBatched(Param pVal, int dim, bool isAscending) {
+    af::dim4 inDims;
+    for (int i = 0; i < 4; i++) inDims[i] = pVal.info.dims[i];
+
+    // Sort dimension
+    // tileDims * seqDims = inDims
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    // Create/call iota
+    // Array<uint> pKey = createEmptyArray<uint>(inDims);
+    Array<uint> pKey = iota<uint>(seqDims, tileDims);
+
+    pKey.setDataDims(inDims.elements());
+
+    // Flat
+    pVal.info.dims[0]    = inDims.elements();
+    pVal.info.strides[0] = 1;
+    for (int i = 1; i < 4; i++) {
+        pVal.info.dims[i]    = 1;
+        pVal.info.strides[i] = pVal.info.strides[i - 1] * pVal.info.dims[i - 1];
+    }
+
+    // Sort indices
+    // sort_by_key<T, uint, isAscending>(*resVal, *resKey, val, key, 0);
+    // kernel::sort0_by_key<T, uint, isAscending>(pVal, pKey);
+    compute::command_queue c_queue(getQueue()());
+
+    compute::buffer pKey_buf((*pKey.get())());
+    compute::buffer pVal_buf((*pVal.data)());
+
+    compute::buffer_iterator<type_t<T>> val0 =
+        compute::make_buffer_iterator<type_t<T>>(pVal_buf, 0);
+    compute::buffer_iterator<type_t<T>> valN =
+        compute::make_buffer_iterator<type_t<T>>(pVal_buf, +pVal.info.dims[0]);
+    compute::buffer_iterator<uint> key0 =
+        compute::make_buffer_iterator<uint>(pKey_buf, 0);
+    compute::buffer_iterator<uint> keyN =
+        compute::make_buffer_iterator<uint>(pKey_buf, pKey.dims()[0]);
+    if (isAscending) {
+        compute::sort_by_key(val0, valN, key0, c_queue);
+    } else {
+        compute::sort_by_key(val0, valN, key0, compute::greater<type_t<T>>(),
+                             c_queue);
+    }
+
+    // Needs to be ascending (true) in order to maintain the indices properly
+    // kernel::sort0_by_key<uint, T, true>(pKey, pVal);
+    compute::sort_by_key(key0, keyN, val0, c_queue);
+
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void sort0(Param val, bool isAscending) {
+    int higherDims = val.info.dims[1] * val.info.dims[2] * val.info.dims[3];
+    // TODO Make a better heurisitic
+    if (higherDims > 10)
+        sortBatched<T>(val, 0, isAscending);
+    else
+        kernel::sort0Iterative<T>(val, isAscending);
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sort_by_key.hpp b/src/backend/opencl/kernel/sort_by_key.hpp
index 07086ca348..4333a7830c 100644
--- a/src/backend/opencl/kernel/sort_by_key.hpp
+++ b/src/backend/opencl/kernel/sort_by_key.hpp
@@ -8,74 +8,22 @@
  ********************************************************/
 
 #pragma once
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <dispatch.hpp>
 #include <Param.hpp>
+#include <common/dispatch.hpp>
 #include <debug_opencl.hpp>
-#include <boost/compute/core.hpp>
-#include <boost/compute/algorithm/sort_by_key.hpp>
-#include <boost/compute/functional/operator.hpp>
-#include <boost/compute/iterator/buffer_iterator.hpp>
-
-namespace compute = boost::compute;
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-
-        template<typename Tk, typename Tv, bool isAscending>
-        void sort0_by_key(Param okey, Param oval)
-        {
-            try {
-                compute::command_queue c_queue(getQueue()());
-
-                compute::buffer okey_buf((*okey.data)());
-                compute::buffer oval_buf((*oval.data)());
-
-                for(int w = 0; w < okey.info.dims[3]; w++) {
-                    int okeyW = w * okey.info.strides[3];
-                    int ovalW = w * oval.info.strides[3];
-                    for(int z = 0; z < okey.info.dims[2]; z++) {
-                        int okeyWZ = okeyW + z * okey.info.strides[2];
-                        int ovalWZ = ovalW + z * oval.info.strides[2];
-                        for(int y = 0; y < okey.info.dims[1]; y++) {
+#include <traits.hpp>
 
-                            int okeyOffset = okeyWZ + y * okey.info.strides[1];
-                            int ovalOffset = ovalWZ + y * oval.info.strides[1];
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param pKey, Param pVal, bool isAscending);
 
-                            compute::buffer_iterator<Tk> start= compute::make_buffer_iterator<Tk>(okey_buf, okeyOffset);
-                            compute::buffer_iterator<Tk> end = compute::make_buffer_iterator<Tk>(okey_buf, okeyOffset + okey.info.dims[0]);
-                            compute::buffer_iterator<Tv> vals = compute::make_buffer_iterator<Tv>(oval_buf, ovalOffset);
-                            if(isAscending) {
-                                compute::sort_by_key(start, end, vals, c_queue);
-                            } else {
-                                compute::sort_by_key(start, end, vals,
-                                                     compute::greater<Tk>(), c_queue);
-                            }
-                        }
-                    }
-                }
+template<typename Tk_, typename Tv_>
+void sortByKeyBatched(Param pKey, Param pVal, const int dim, bool isAscending);
 
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
-}
+template<typename Tk, typename Tv>
+void sort0ByKey(Param pKey, Param pVal, bool isAscending);
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sort_by_key/CMakeLists.txt b/src/backend/opencl/kernel/sort_by_key/CMakeLists.txt
new file mode 100644
index 0000000000..e2ad168138
--- /dev/null
+++ b/src/backend/opencl/kernel/sort_by_key/CMakeLists.txt
@@ -0,0 +1,85 @@
+# Copyright (c) 2017, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+file(STRINGS "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp" FILESTRINGS)
+
+foreach(STR ${FILESTRINGS})
+    if(${STR} MATCHES "// SBK_TYPES")
+        string(REPLACE "// SBK_TYPES:" "" TEMP ${STR})
+        string(REPLACE " " ";" SBK_TYPES ${TEMP})
+    endif()
+endforeach()
+
+add_library(opencl_sort_by_key INTERFACE)
+foreach(SBK_TYPE ${SBK_TYPES})
+    add_library(opencl_sort_by_key_${SBK_TYPE} OBJECT
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key/sort_by_key_impl.cpp"
+        "${CMAKE_CURRENT_SOURCE_DIR}/kernel/sort_by_key_impl.hpp")
+    add_dependencies(opencl_sort_by_key_${SBK_TYPE}
+                        ${cl_kernel_targets} OpenCL::cl2hpp Boost::boost)
+
+    target_include_directories(opencl_sort_by_key_${SBK_TYPE}
+      SYSTEM PRIVATE
+        ${span-lite_SOURCE_DIR}/include
+        $<TARGET_PROPERTY:OpenCL::OpenCL,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:OpenCL::cl2hpp,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:Boost::boost,INTERFACE_INCLUDE_DIRECTORIES>
+        $<TARGET_PROPERTY:af_spdlog,INTERFACE_INCLUDE_DIRECTORIES>)
+
+    target_include_directories(opencl_sort_by_key_${SBK_TYPE}
+      PRIVATE
+        .
+        ..
+        ../../api/c
+        ../common
+        ../../../include
+        magma
+        ${ArrayFire_BINARY_DIR}/include
+        ${CMAKE_CURRENT_BINARY_DIR})
+
+    if(TARGET Forge::forge)
+      target_include_directories(opencl_sort_by_key_${SBK_TYPE}
+        SYSTEM INTERFACE
+          $<TARGET_PROPERTY:Forge::forge,INCLUDE_DIRECTORIES>
+      )
+    else()
+      target_include_directories(opencl_sort_by_key_${SBK_TYPE}
+        SYSTEM INTERFACE
+          ${${forge_prefix}_SOURCE_DIR}/include
+          ${${forge_prefix}_BINARY_DIR}/include
+      )
+    endif()
+    if(TARGET glad::glad)
+      target_include_directories(opencl_sort_by_key_${SBK_TYPE}
+        SYSTEM INTERFACE
+          $<TARGET_PROPERTY:glad::glad,INTERFACE_INCLUDE_DIRECTORIES>
+      )
+    else()
+      target_include_directories(opencl_sort_by_key_${SBK_TYPE}
+        SYSTEM INTERFACE
+          $<TARGET_PROPERTY:af_glad,INTERFACE_INCLUDE_DIRECTORIES>
+      )
+    endif()
+
+    set_target_properties(opencl_sort_by_key_${SBK_TYPE}
+      PROPERTIES
+        CXX_STANDARD 17
+        CXX_EXTENSIONS False
+        CXX_VISIBILITY_PRESET hidden
+        POSITION_INDEPENDENT_CODE ON
+        FOLDER "Generated Targets")
+
+    arrayfire_set_default_cxx_flags(opencl_sort_by_key_${SBK_TYPE})
+
+    target_compile_definitions(opencl_sort_by_key_${SBK_TYPE}
+      PRIVATE
+        ${opencl_compile_definitions}
+        $<TARGET_PROPERTY:Boost::boost,INTERFACE_COMPILE_DEFINITIONS>
+        TYPE=${SBK_TYPE} AFDLL)
+    target_sources(opencl_sort_by_key
+      INTERFACE $<TARGET_OBJECTS:opencl_sort_by_key_${SBK_TYPE}>)
+endforeach(SBK_TYPE ${SBK_TYPES})
diff --git a/src/backend/opencl/kernel/sort_by_key/sort_by_key_impl.cpp b/src/backend/opencl/kernel/sort_by_key/sort_by_key_impl.cpp
new file mode 100644
index 0000000000..dd14eee6c5
--- /dev/null
+++ b/src/backend/opencl/kernel/sort_by_key/sort_by_key_impl.cpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sort_by_key_impl.hpp>
+
+// SBK_TYPES:float double int uint intl uintl short ushort char schar uchar half
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+INSTANTIATE1(TYPE)
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sort_by_key_impl.hpp b/src/backend/opencl/kernel/sort_by_key_impl.hpp
new file mode 100644
index 0000000000..f03721d01e
--- /dev/null
+++ b/src/backend/opencl/kernel/sort_by_key_impl.hpp
@@ -0,0 +1,259 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_opencl.hpp>
+#include <iota.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <kernel/sort_helper.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+AF_DEPRECATED_WARNINGS_OFF
+#include <boost/compute/algorithm/copy.hpp>
+#include <boost/compute/algorithm/sort_by_key.hpp>
+#include <boost/compute/algorithm/transform.hpp>
+#include <boost/compute/container/vector.hpp>
+#include <boost/compute/core.hpp>
+#include <boost/compute/functional/field.hpp>
+#include <boost/compute/functional/get.hpp>
+#include <boost/compute/functional/operator.hpp>
+#include <boost/compute/iterator/buffer_iterator.hpp>
+#include <boost/compute/types/pair.hpp>
+AF_DEPRECATED_WARNINGS_ON
+
+namespace compute = boost::compute;
+
+using arrayfire::common::half;
+
+template<typename Tk, typename Tv, bool isAscending>
+inline boost::compute::function<bool(const std::pair<Tk, Tv>,
+                                     const std::pair<Tk, Tv>)>
+makeCompareFunction() {
+    // Cannot use isAscending in BOOST_COMPUTE_FUNCTION
+    if (isAscending) {
+        BOOST_COMPUTE_FUNCTION(bool, IPCompare,
+                               (std::pair<Tk, Tv> lhs, std::pair<Tk, Tv> rhs),
+                               { return lhs.first < rhs.first; });
+        return IPCompare;
+    } else {
+        BOOST_COMPUTE_FUNCTION(bool, IPCompare,
+                               (std::pair<Tk, Tv> lhs, std::pair<Tk, Tv> rhs),
+                               { return lhs.first > rhs.first; });
+        return IPCompare;
+    }
+}
+
+template<typename Tk>
+inline boost::compute::function<Tk(Tk)> flipFunction() {
+    BOOST_COMPUTE_FUNCTION(Tk, negateFn, (const Tk x), { return -x; });
+
+    return negateFn;
+}
+
+#define INSTANTIATE_FLIP(TY, XMAX)                               \
+    template<>                                                   \
+    inline boost::compute::function<TY(TY)> flipFunction<TY>() { \
+        BOOST_COMPUTE_FUNCTION(TY, negateFn, (const TY x),       \
+                               { return XMAX - x; });            \
+                                                                 \
+        return negateFn;                                         \
+    }
+
+INSTANTIATE_FLIP(unsigned, UINT_MAX)
+INSTANTIATE_FLIP(unsigned short, USHRT_MAX)
+INSTANTIATE_FLIP(unsigned char, UCHAR_MAX)
+INSTANTIATE_FLIP(cl_ulong, ULONG_MAX)
+
+#undef INSTANTIATE_FLIP
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+static const int copyPairIter = 4;
+
+template<typename Tk, typename Tv>
+void sort0ByKeyIterative(Param pKey, Param pVal, bool isAscending) {
+    compute::command_queue c_queue(getQueue()());
+
+    compute::buffer pKey_buf((*pKey.data)());
+    compute::buffer pVal_buf((*pVal.data)());
+
+    for (int w = 0; w < pKey.info.dims[3]; w++) {
+        int pKeyW = w * pKey.info.strides[3];
+        int pValW = w * pVal.info.strides[3];
+        for (int z = 0; z < pKey.info.dims[2]; z++) {
+            int pKeyWZ = pKeyW + z * pKey.info.strides[2];
+            int pValWZ = pValW + z * pVal.info.strides[2];
+            for (int y = 0; y < pKey.info.dims[1]; y++) {
+                int pKeyOffset = pKeyWZ + y * pKey.info.strides[1];
+                int pValOffset = pValWZ + y * pVal.info.strides[1];
+
+                compute::buffer_iterator<type_t<Tk>> start =
+                    compute::make_buffer_iterator<type_t<Tk>>(pKey_buf,
+                                                              pKeyOffset);
+                compute::buffer_iterator<type_t<Tk>> end =
+                    compute::make_buffer_iterator<type_t<Tk>>(
+                        pKey_buf, pKeyOffset + pKey.info.dims[0]);
+                compute::buffer_iterator<type_t<Tv>> vals =
+                    compute::make_buffer_iterator<type_t<Tv>>(pVal_buf,
+                                                              pValOffset);
+                if (isAscending) {
+                    compute::sort_by_key(start, end, vals, c_queue);
+                } else {
+                    compute::sort_by_key(start, end, vals,
+                                         compute::greater<type_t<Tk>>(),
+                                         c_queue);
+                }
+            }
+        }
+    }
+
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename Tk_, typename Tv_>
+void sortByKeyBatched(Param pKey, Param pVal, const int dim, bool isAscending) {
+    typedef type_t<Tk_> Tk;
+    typedef type_t<Tv_> Tv;
+
+    af::dim4 inDims;
+    for (int i = 0; i < 4; i++) inDims[i] = pKey.info.dims[i];
+
+    // Sort dimension
+    // tileDims * seqDims = inDims
+    af::dim4 tileDims(1);
+    af::dim4 seqDims = inDims;
+    tileDims[dim]    = inDims[dim];
+    seqDims[dim]     = 1;
+
+    // Create/call iota
+    Array<unsigned> pSeq = iota<unsigned>(seqDims, tileDims);
+
+    int elements = inDims.elements();
+
+    // Flat - Not required since inplace and both are continuous
+    // val.modDims(inDims.elements());
+    // key.modDims(inDims.elements());
+
+    // Sort indices
+    // sort_by_key<T, Tv, isAscending>(*resVal, *resKey, val, key, 0);
+    // kernel::sort0_by_key<T, Tv, isAscending>(pVal, pKey);
+    compute::command_queue c_queue(getQueue()());
+    compute::context c_context(getContext()());
+
+    // Create buffer iterators for seq
+    compute::buffer pSeq_buf((*pSeq.get())());
+    compute::buffer_iterator<unsigned> seq0 =
+        compute::make_buffer_iterator<unsigned>(pSeq_buf, 0);
+    compute::buffer_iterator<unsigned> seqN =
+        compute::make_buffer_iterator<unsigned>(pSeq_buf, elements);
+    // Create buffer iterators for key and val
+    compute::buffer pKey_buf((*pKey.data)());
+    compute::buffer pVal_buf((*pVal.data)());
+    compute::buffer_iterator<Tk> key0 =
+        compute::make_buffer_iterator<Tk>(pKey_buf, 0);
+    compute::buffer_iterator<Tk> keyN =
+        compute::make_buffer_iterator<Tk>(pKey_buf, elements);
+    compute::buffer_iterator<Tv> val0 =
+        compute::make_buffer_iterator<Tv>(pVal_buf, 0);
+    compute::buffer_iterator<Tv> valN =
+        compute::make_buffer_iterator<Tv>(pVal_buf, elements);
+
+    // Sort By Key for descending is stable in the reverse
+    // (greater) order. Sorting in ascending with negated values
+    // will give the right result
+    if (!isAscending)
+        compute::transform(key0, keyN, key0, flipFunction<Tk>(), c_queue);
+
+    // Create a copy of the pKey buffer
+    cl::Buffer* cKey = bufferAlloc(elements * sizeof(Tk));
+    compute::buffer cKey_buf((*cKey)());
+    compute::buffer_iterator<Tk> cKey0 =
+        compute::make_buffer_iterator<Tk>(cKey_buf, 0);
+    compute::buffer_iterator<Tk> cKeyN =
+        compute::make_buffer_iterator<Tk>(cKey_buf, elements);
+    compute::copy(key0, keyN, cKey0, c_queue);
+
+    // FIRST SORT
+    compute::sort_by_key(key0, keyN, seq0, c_queue);
+    compute::sort_by_key(cKey0, cKeyN, val0, c_queue);
+
+    // Create a copy of the seq buffer after first sort
+    cl::Buffer* cSeq = bufferAlloc(elements * sizeof(unsigned));
+    compute::buffer cSeq_buf((*cSeq)());
+    compute::buffer_iterator<unsigned> cSeq0 =
+        compute::make_buffer_iterator<unsigned>(cSeq_buf, 0);
+    compute::buffer_iterator<unsigned> cSeqN =
+        compute::make_buffer_iterator<unsigned>(cSeq_buf, elements);
+    compute::copy(seq0, seqN, cSeq0, c_queue);
+
+    // SECOND SORT
+    // First call will sort key, second sort will sort val
+    // Needs to be ascending (true) in order to maintain the indices properly
+    // kernel::sort0_by_key<Tv, T, true>(pKey, pVal);
+    compute::sort_by_key(seq0, seqN, key0, c_queue);
+    compute::sort_by_key(cSeq0, cSeqN, val0, c_queue);
+
+    // If descending, flip it back
+    if (!isAscending)
+        compute::transform(key0, keyN, key0, flipFunction<Tk>(), c_queue);
+
+    CL_DEBUG_FINISH(getQueue());
+    bufferFree(cSeq);
+    bufferFree(cKey);
+}
+
+template<typename Tk, typename Tv>
+void sort0ByKey(Param pKey, Param pVal, bool isAscending) {
+    int higherDims = pKey.info.dims[1] * pKey.info.dims[2] * pKey.info.dims[3];
+    // Batced sort performs 4x sort by keys
+    // But this is only useful before GPU is saturated
+    // The GPU is saturated at around 1000,000 integers
+    // Call batched sort only if both conditions are met
+    if (higherDims > 4 && pKey.info.dims[0] < 1000000) {
+        kernel::sortByKeyBatched<Tk, Tv>(pKey, pVal, 0, isAscending);
+    } else {
+        kernel::sort0ByKeyIterative<Tk, Tv>(pKey, pVal, isAscending);
+    }
+}
+
+#define INSTANTIATE(Tk, Tv)                                           \
+    template void sort0ByKey<Tk, Tv>(Param okey, Param oval,          \
+                                     bool isAscending);               \
+    template void sort0ByKeyIterative<Tk, Tv>(Param okey, Param oval, \
+                                              bool isAscending);      \
+    template void sortByKeyBatched<Tk, Tv>(Param okey, Param oval,    \
+                                           const int dim, bool isAscending);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)   \
+    INSTANTIATE(Tk, half)
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sort_helper.hpp b/src/backend/opencl/kernel/sort_helper.hpp
new file mode 100644
index 0000000000..971b4077e9
--- /dev/null
+++ b/src/backend/opencl/kernel/sort_helper.hpp
@@ -0,0 +1,48 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <debug_opencl.hpp>
+#include <type_traits>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+using htype_t = typename std::conditional<std::is_same<T, common::half>::value,
+                                          cl_half, T>::type;
+
+// If type is cdouble, return std::complex<double>, else return T
+template<typename T>
+using ztype_t =
+    typename std::conditional<std::is_same<T, cdouble>::value,
+                              std::complex<double>, htype_t<T>>::type;
+
+// If type is cfloat, return std::complex<float>, else return ztype_t
+template<typename T>
+using ctype_t =
+    typename std::conditional<std::is_same<T, cfloat>::value,
+                              std::complex<float>, ztype_t<T>>::type;
+
+// If type is intl, return cl_long, else return ctype_t
+template<typename T>
+using ltype_t = typename std::conditional<std::is_same<T, intl>::value, cl_long,
+                                          ctype_t<T>>::type;
+
+// If type is uintl, return cl_ulong, else return ltype_t
+template<typename T>
+using type_t = typename std::conditional<std::is_same<T, uintl>::value,
+                                         cl_ulong, ltype_t<T>>::type;
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sort_index.hpp b/src/backend/opencl/kernel/sort_index.hpp
deleted file mode 100644
index 59770185e5..0000000000
--- a/src/backend/opencl/kernel/sort_index.hpp
+++ /dev/null
@@ -1,87 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <dispatch.hpp>
-#include <Param.hpp>
-#include <debug_opencl.hpp>
-#include <boost/compute/core.hpp>
-#include <boost/compute/algorithm/iota.hpp>
-#include <boost/compute/algorithm/sort_by_key.hpp>
-#include <boost/compute/functional/operator.hpp>
-#include <boost/compute/iterator/buffer_iterator.hpp>
-
-namespace compute = boost::compute;
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-
-        template<typename T, bool isAscending>
-        void sort0_index(Param val, Param idx)
-        {
-            try {
-                compute::command_queue c_queue(getQueue()());
-
-                compute::buffer val_buf((*val.data)());
-                compute::buffer idx_buf((*idx.data)());
-
-                for(int w = 0; w < (int)val.info.dims[3]; w++) {
-                    int valW = w * (int)val.info.strides[3];
-                    int idxW = w * idx.info.strides[3];
-                    for(int z = 0; z < (int)val.info.dims[2]; z++) {
-                        int valWZ = valW + z * (int)val.info.strides[2];
-                        int idxWZ = idxW + z * idx.info.strides[2];
-                        for(int y = 0; y < (int)val.info.dims[1]; y++) {
-
-                            int valOffset = valWZ + y * val.info.strides[1];
-                            int idxOffset = idxWZ + y * idx.info.strides[1];
-
-                            compute::buffer_iterator<unsigned> idx_begin(idx_buf, idxOffset);
-                            compute::iota(idx_begin, idx_begin + val.info.dims[0], 0, c_queue);
-
-                            if(isAscending) {
-                                compute::sort_by_key(
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset),
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset + val.info.dims[0]),
-                                        idx_begin, compute::less<T>(), c_queue);
-                            } else {
-                                compute::sort_by_key(
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset),
-                                        compute::make_buffer_iterator<T>(val_buf, valOffset + val.info.dims[0]),
-                                        idx_begin, compute::greater<T>(), c_queue);
-                            }
-                        }
-                    }
-                }
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
-}
diff --git a/src/backend/opencl/kernel/sp_sp_arith_csr.cl b/src/backend/opencl/kernel/sp_sp_arith_csr.cl
new file mode 100644
index 0000000000..e9b54a755c
--- /dev/null
+++ b/src/backend/opencl/kernel/sp_sp_arith_csr.cl
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// TODO_PERF(pradeep) More performance improvements are possible
+__attribute__((reqd_work_group_size(256, 1, 1))) kernel void ssarith_csr(
+    global T *oVals, global int *oColIdx, global const int *oRowIdx, uint M,
+    uint N, uint nnza, global const T *lVals, global const int *lRowIdx,
+    global const int *lColIdx, uint nnzb, global const T *rVals,
+    global const int *rRowIdx, global const int *rColIdx) {
+    const uint row = get_global_id(0);
+
+    const bool valid = row < M;
+
+    const uint lEnd   = (valid ? lRowIdx[row + 1] : 0);
+    const uint rEnd   = (valid ? rRowIdx[row + 1] : 0);
+    const uint offset = (valid ? oRowIdx[row] : 0);
+
+    global T *ovPtr   = oVals + offset;
+    global int *ocPtr = oColIdx + offset;
+
+    uint l = (valid ? lRowIdx[row] : 0);
+    uint r = (valid ? rRowIdx[row] : 0);
+
+    uint nnz = 0;
+    while (l < lEnd && r < rEnd) {
+        uint lci = lColIdx[l];
+        uint rci = rColIdx[r];
+
+        T lhs = (lci <= rci ? lVals[l] : (T)(IDENTITY_VALUE));
+        T rhs = (lci >= rci ? rVals[r] : (T)(IDENTITY_VALUE));
+
+        ovPtr[nnz] = OP(lhs, rhs);
+        ocPtr[nnz] = (lci <= rci) ? lci : rci;
+
+        l += (lci <= rci);
+        r += (lci >= rci);
+        nnz++;
+    }
+    while (l < lEnd) {
+        ovPtr[nnz] = OP(lVals[l], (T)(IDENTITY_VALUE));
+        ocPtr[nnz] = lColIdx[l];
+        l++;
+        nnz++;
+    }
+    while (r < rEnd) {
+        ovPtr[nnz] = OP((T)(IDENTITY_VALUE), rVals[r]);
+        ocPtr[nnz] = rColIdx[r];
+        r++;
+        nnz++;
+    }
+}
diff --git a/src/backend/opencl/kernel/sparse.hpp b/src/backend/opencl/kernel/sparse.hpp
new file mode 100644
index 0000000000..4d3a33d14a
--- /dev/null
+++ b/src/backend/opencl/kernel/sparse.hpp
@@ -0,0 +1,235 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel/sort_by_key.hpp>
+#include <kernel_headers/coo2dense.hpp>
+#include <kernel_headers/csr2coo.hpp>
+#include <kernel_headers/csr2dense.hpp>
+#include <kernel_headers/dense2csr.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void coo2dense(Param out, const Param values, const Param rowIdx,
+               const Param colIdx) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(REPEAT),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(reps, REPEAT),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto coo2dense = common::getKernel("coo2Dense", {{coo2dense_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    cl::NDRange local(THREADS_PER_GROUP, 1, 1);
+
+    cl::NDRange global(
+        divup(values.info.dims[0], local[0] * REPEAT) * THREADS_PER_GROUP, 1,
+        1);
+
+    coo2dense(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *values.data, values.info, *rowIdx.data, rowIdx.info,
+              *colIdx.data, colIdx.info);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void csr2dense(Param output, const Param values, const Param rowIdx,
+               const Param colIdx) {
+    constexpr int MAX_GROUPS = 4096;
+    // FIXME: This needs to be based non nonzeros per row
+    constexpr int threads = 64;
+
+    const int M = rowIdx.info.dims[0] - 1;
+
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(threads),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(THREADS, threads),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto csr2dense = common::getKernel("csr2Dense", {{csr2dense_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    cl::NDRange local(threads, 1);
+    int groups_x = std::min((int)(divup(M, local[0])), MAX_GROUPS);
+    cl::NDRange global(local[0] * groups_x, 1);
+
+    csr2dense(cl::EnqueueArgs(getQueue(), global, local), *output.data,
+              *values.data, *rowIdx.data, *colIdx.data, M,
+	      static_cast<int>(values.info.offset),
+	      static_cast<int>(rowIdx.info.offset),
+	      static_cast<int>(colIdx.info.offset));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void dense2csr(Param values, Param rowIdx, Param colIdx, const Param dense) {
+    constexpr bool IsComplex =
+        std::is_same<T, cfloat>::value || std::is_same<T, cdouble>::value;
+
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(IS_CPLX, (IsComplex ? 1 : 0)),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto dense2Csr = common::getKernel("dense2Csr", {{dense2csr_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    int num_rows = dense.info.dims[0];
+    int num_cols = dense.info.dims[1];
+
+    // sd1 contains output of scan along dim 1 of dense
+    Array<int> sd1 = createEmptyArray<int>(dim4(num_rows, num_cols));
+    // rd1 contains output of nonzero count along dim 1 along dense
+    Array<int> rd1 = createEmptyArray<int>(num_rows);
+
+    scanDim<T, int, af_notzero_t>(sd1, dense, 1, true);
+    reduceDim<T, int, af_notzero_t>(rd1, dense, 0, 0, 1);
+    scanFirst<int, int, af_add_t>(rowIdx, rd1, false);
+
+    int nnz = values.info.dims[0];
+    getQueue().enqueueFillBuffer(
+        *rowIdx.data, nnz,
+        rowIdx.info.offset + (rowIdx.info.dims[0] - 1) * sizeof(int),
+        sizeof(int));
+
+    cl::NDRange local(THREADS_X, THREADS_Y);
+    int groups_x = divup(dense.info.dims[0], local[0]);
+    int groups_y = divup(dense.info.dims[1], local[1]);
+    cl::NDRange global(groups_x * local[0], groups_y * local[1]);
+
+    const Param sdParam = sd1;
+
+    dense2Csr(cl::EnqueueArgs(getQueue(), global, local), *values.data,
+              *colIdx.data, *dense.data, dense.info, *sdParam.data,
+              sdParam.info, *rowIdx.data);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void swapIndex(Param ovalues, Param oindex, const Param ivalues,
+               const cl::Buffer *iindex, const Param swapIdx) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto swapIndex = common::getKernel("swapIndex", {{csr2coo_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    cl::NDRange global(ovalues.info.dims[0], 1, 1);
+
+    swapIndex(cl::EnqueueArgs(getQueue(), global), *ovalues.data, *oindex.data,
+              *ivalues.data, *iindex, *swapIdx.data,
+              static_cast<int>(ovalues.info.dims[0]));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void csr2coo(Param ovalues, Param orowIdx, Param ocolIdx, const Param ivalues,
+             const Param irowIdx, const Param icolIdx, Param index) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto csr2coo = common::getKernel("csr2Coo", {{csr2coo_cl_src}}, tmpltArgs,
+                                     compileOpts);
+
+    const int MAX_GROUPS = 4096;
+    int M                = irowIdx.info.dims[0] - 1;
+    // FIXME: This needs to be based non nonzeros per row
+    int threads = 64;
+
+    cl::Buffer *scratch = bufferAlloc(orowIdx.info.dims[0] * sizeof(int));
+
+    cl::NDRange local(threads, 1);
+    int groups_x = std::min((int)(divup(M, local[0])), MAX_GROUPS);
+    cl::NDRange global(local[0] * groups_x, 1);
+
+    csr2coo(cl::EnqueueArgs(getQueue(), global, local), *scratch, *ocolIdx.data,
+            *irowIdx.data, *icolIdx.data, M);
+
+    // Now we need to sort this into column major
+    kernel::sort0ByKeyIterative<int, int>(ocolIdx, index, true);
+
+    // Now use index to sort values and rows
+    kernel::swapIndex<T>(ovalues, orowIdx, ivalues, scratch, index);
+
+    CL_DEBUG_FINISH(getQueue());
+
+    bufferFree(scratch);
+}
+
+template<typename T>
+void coo2csr(Param ovalues, Param orowIdx, Param ocolIdx, const Param ivalues,
+             const Param irowIdx, const Param icolIdx, Param index,
+             Param rowCopy, const int M) {
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto csrReduce = common::getKernel("csrReduce", {{csr2coo_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    // Now we need to sort this into column major
+    kernel::sort0ByKeyIterative<int, int>(rowCopy, index, true);
+
+    // Now use index to sort values and rows
+    kernel::swapIndex<T>(ovalues, ocolIdx, ivalues, icolIdx.data, index);
+
+    CL_DEBUG_FINISH(getQueue());
+
+    cl::NDRange global(irowIdx.info.dims[0], 1, 1);
+
+    csrReduce(cl::EnqueueArgs(getQueue(), global), *orowIdx.data, *rowCopy.data,
+              M, static_cast<int>(ovalues.info.dims[0]));
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sparse_arith.hpp b/src/backend/opencl/kernel/sparse_arith.hpp
new file mode 100644
index 0000000000..17cd67ca8a
--- /dev/null
+++ b/src/backend/opencl/kernel/sparse_arith.hpp
@@ -0,0 +1,187 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel_headers/sp_sp_arith_csr.hpp>
+#include <kernel_headers/sparse_arith_common.hpp>
+#include <kernel_headers/sparse_arith_coo.hpp>
+#include <kernel_headers/sparse_arith_csr.hpp>
+#include <kernel_headers/ssarith_calc_out_nnz.hpp>
+#include <math.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr unsigned TX      = 32;
+constexpr unsigned TY      = 8;
+constexpr unsigned THREADS = TX * TY;
+
+template<af_op_t op>
+AF_CONSTEXPR const char *getOpString() {
+    switch (op) {
+        case af_add_t: return "ADD";
+        case af_sub_t: return "SUB";
+        case af_mul_t: return "MUL";
+        case af_div_t: return "DIV";
+        default: return "";  // kernel will fail to compile
+    }
+    return "";
+}
+
+template<typename T, af_op_t op>
+auto fetchKernel(const std::string key, const common::Source &additionalSrc,
+                 const std::vector<std::string> additionalOptions = {}) {
+    constexpr bool IsComplex =
+        std::is_same<T, cfloat>::value || std::is_same<T, cdouble>::value;
+
+    std::array<TemplateArg, 2> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(op),
+    };
+    std::vector<std::string> options = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(OP, getOpString<op>()),
+        DefineKeyValue(IS_CPLX, (IsComplex ? 1 : 0)),
+    };
+    options.emplace_back(getTypeBuildDefinition<T>());
+    options.insert(std::end(options), std::begin(additionalOptions),
+                   std::end(additionalOptions));
+    return common::getKernel(key, {{sparse_arith_common_cl_src, additionalSrc}},
+                             tmpltArgs, options);
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCSR(Param out, const Param values, const Param rowIdx,
+                      const Param colIdx, const Param rhs, const bool reverse) {
+    auto sparseArithCSR =
+        fetchKernel<T, op>("sparseArithCSR", sparse_arith_csr_cl_src);
+
+    cl::NDRange local(TX, TY, 1);
+    cl::NDRange global(divup(out.info.dims[0], TY) * TX, TY, 1);
+
+    sparseArithCSR(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+                   out.info, *values.data, *rowIdx.data, *colIdx.data,
+                   static_cast<int>(values.info.dims[0]), *rhs.data, rhs.info,
+                   static_cast<int>(reverse));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCOO(Param out, const Param values, const Param rowIdx,
+                      const Param colIdx, const Param rhs, const bool reverse) {
+    auto sparseArithCOO =
+        fetchKernel<T, op>("sparseArithCOO", sparse_arith_coo_cl_src);
+
+    cl::NDRange local(THREADS, 1, 1);
+    cl::NDRange global(divup(values.info.dims[0], THREADS) * THREADS, 1, 1);
+
+    sparseArithCOO(cl::EnqueueArgs(getQueue(), global, local), *out.data,
+                   out.info, *values.data, *rowIdx.data, *colIdx.data,
+                   static_cast<int>(values.info.dims[0]), *rhs.data, rhs.info,
+                   static_cast<int>(reverse));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCSR(Param values, Param rowIdx, Param colIdx, const Param rhs,
+                      const bool reverse) {
+    auto sparseArithCSR =
+        fetchKernel<T, op>("sparseArithCSR2", sparse_arith_csr_cl_src);
+
+    cl::NDRange local(TX, TY, 1);
+    cl::NDRange global(divup(rhs.info.dims[0], TY) * TX, TY, 1);
+
+    sparseArithCSR(cl::EnqueueArgs(getQueue(), global, local), *values.data,
+                   *rowIdx.data, *colIdx.data,
+                   static_cast<int>(values.info.dims[0]), *rhs.data, rhs.info,
+                   static_cast<int>(reverse));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+void sparseArithOpCOO(Param values, Param rowIdx, Param colIdx, const Param rhs,
+                      const bool reverse) {
+    auto sparseArithCOO =
+        fetchKernel<T, op>("sparseArithCOO2", sparse_arith_coo_cl_src);
+
+    cl::NDRange local(THREADS, 1, 1);
+    cl::NDRange global(divup(values.info.dims[0], THREADS) * THREADS, 1, 1);
+
+    sparseArithCOO(cl::EnqueueArgs(getQueue(), global, local), *values.data,
+                   *rowIdx.data, *colIdx.data,
+                   static_cast<int>(values.info.dims[0]), *rhs.data, rhs.info,
+                   static_cast<int>(reverse));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+static void csrCalcOutNNZ(Param outRowIdx, unsigned &nnzC, const uint M,
+                          const uint N, uint nnzA, const Param lrowIdx,
+                          const Param lcolIdx, uint nnzB, const Param rrowIdx,
+                          const Param rcolIdx) {
+    UNUSED(N);
+    UNUSED(nnzA);
+    UNUSED(nnzB);
+
+    std::vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<uint>(),
+    };
+
+    auto calcNNZ = common::getKernel(
+        "csr_calc_out_nnz", {{ssarith_calc_out_nnz_cl_src}}, tmpltArgs, {});
+
+    cl::NDRange local(256, 1);
+    cl::NDRange global(divup(M, local[0]) * local[0], 1, 1);
+
+    nnzC     = 0;
+    auto out = memAlloc<unsigned>(1);
+    getQueue().enqueueFillBuffer(*out, nnzC, 0, sizeof(unsigned));
+
+    calcNNZ(cl::EnqueueArgs(getQueue(), global, local), *out, *outRowIdx.data,
+            M, *lrowIdx.data, *lcolIdx.data, *rrowIdx.data, *rcolIdx.data,
+            cl::Local(local[0] * sizeof(unsigned int)));
+    getQueue().enqueueReadBuffer(*out, CL_TRUE, 0, sizeof(unsigned), &nnzC);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T, af_op_t op>
+void ssArithCSR(Param oVals, Param oColIdx, const Param oRowIdx, const uint M,
+                const uint N, unsigned nnzA, const Param lVals,
+                const Param lRowIdx, const Param lColIdx, unsigned nnzB,
+                const Param rVals, const Param rRowIdx, const Param rColIdx) {
+    const T iden_val =
+        (op == af_mul_t || op == af_div_t ? scalar<T>(1) : scalar<T>(0));
+
+    auto arithOp = fetchKernel<T, op>(
+        "ssarith_csr", sp_sp_arith_csr_cl_src,
+        {DefineKeyValue(IDENTITY_VALUE, scalar_to_option(iden_val))});
+
+    cl::NDRange local(256, 1);
+    cl::NDRange global(divup(M, local[0]) * local[0], 1, 1);
+
+    arithOp(cl::EnqueueArgs(getQueue(), global, local), *oVals.data,
+            *oColIdx.data, *oRowIdx.data, M, N, nnzA, *lVals.data,
+            *lRowIdx.data, *lColIdx.data, nnzB, *rVals.data, *rRowIdx.data,
+            *rColIdx.data);
+    CL_DEBUG_FINISH(getQueue());
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/sparse_arith_common.cl b/src/backend/opencl/kernel/sparse_arith_common.cl
new file mode 100644
index 0000000000..e89f223b4a
--- /dev/null
+++ b/src/backend/opencl/kernel/sparse_arith_common.cl
@@ -0,0 +1,37 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+T _add_(T v1, T v2) { return v1 + v2; }
+
+T _sub_(T v1, T v2) { return v1 - v2; }
+
+#if IS_CPLX
+T _mul_(T v1, T v2) {
+    T out;
+    out.x = v1.x * v2.x - v1.y * v2.y;
+    out.y = v1.x * v2.y + v1.y * v2.x;
+    return out;
+}
+
+T _div_(T v1, T v2) {
+    T out;
+    out.x = (v1.x * v2.x + v1.y * v2.y) / (v2.x * v2.x + v2.y * v2.y);
+    out.y = (v1.y * v2.x - v1.x * v2.y) / (v2.x * v2.x + v2.y * v2.y);
+    return out;
+}
+#else
+T _mul_(T v1, T v2) { return v1 * v2; }
+
+T _div_(T v1, T v2) { return v1 / v2; }
+#endif
+
+#define ADD _add_
+#define SUB _sub_
+#define MUL _mul_
+#define DIV _div_
diff --git a/src/backend/opencl/kernel/sparse_arith_coo.cl b/src/backend/opencl/kernel/sparse_arith_coo.cl
new file mode 100644
index 0000000000..07186f7a68
--- /dev/null
+++ b/src/backend/opencl/kernel/sparse_arith_coo.cl
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void sparseArithCOO(global T *oPtr, const KParam out,
+                           global const T *values, global const int *rowIdx,
+                           global const int *colIdx, const int nNZ,
+                           global const T *rPtr, const KParam rhs,
+                           const int reverse) {
+    const int idx = get_global_id(0);
+
+    if (idx >= nNZ) return;
+
+    const int row = rowIdx[idx];
+    const int col = colIdx[idx];
+
+    if (row >= out.dims[0] || col >= out.dims[1]) return;  // Bad indices
+
+    // Get Values
+    const T val  = values[idx];
+    const T rval = rPtr[col * rhs.strides[1] + row];
+
+    const int offset = col * out.strides[1] + row;
+    if (reverse)
+        oPtr[offset] = OP(rval, val);
+    else
+        oPtr[offset] = OP(val, rval);
+}
+
+kernel void sparseArithCOO2(global T *values, global int *rowIdx,
+                            global int *colIdx, const int nNZ,
+                            global const T *rPtr, const KParam rhs,
+                            const int reverse) {
+    const int idx = get_global_id(0);
+
+    if (idx >= nNZ) return;
+
+    const int row = rowIdx[idx];
+    const int col = colIdx[idx];
+
+    if (row >= rhs.dims[0] || col >= rhs.dims[1]) return;  // Bad indices
+
+    // Get Values
+    const T val  = values[idx];
+    const T rval = rPtr[col * rhs.strides[1] + row];
+
+    if (reverse)
+        values[idx] = OP(rval, val);
+    else
+        values[idx] = OP(val, rval);
+}
diff --git a/src/backend/opencl/kernel/sparse_arith_csr.cl b/src/backend/opencl/kernel/sparse_arith_csr.cl
new file mode 100644
index 0000000000..165db256a4
--- /dev/null
+++ b/src/backend/opencl/kernel/sparse_arith_csr.cl
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void sparseArithCSR(global T *oPtr, const KParam out,
+                           global const T *values, global const int *rowIdx,
+                           global const int *colIdx, const int nNZ,
+                           global const T *rPtr, const KParam rhs,
+                           const int reverse) {
+    const int row = get_group_id(0) * get_local_size(1) + get_local_id(1);
+
+    if (row >= out.dims[0]) return;
+
+    const int rowStartIdx = rowIdx[row];
+    const int rowEndIdx   = rowIdx[row + 1];
+
+    // Repeat loop until all values in the row are computed
+    for (int idx = rowStartIdx + get_local_id(0); idx < rowEndIdx;
+         idx += get_local_size(0)) {
+        const int col = colIdx[idx];
+
+        if (row >= out.dims[0] || col >= out.dims[1]) continue;  // Bad indices
+
+        // Get Values
+        const T val  = values[idx];
+        const T rval = rPtr[col * rhs.strides[1] + row];
+
+        const int offset = col * out.strides[1] + row;
+        if (reverse)
+            oPtr[offset] = OP(rval, val);
+        else
+            oPtr[offset] = OP(val, rval);
+    }
+}
+
+kernel void sparseArithCSR2(global T *values, global int *rowIdx,
+                            global int *colIdx, const int nNZ,
+                            global const T *rPtr, const KParam rhs,
+                            const int reverse) {
+    const int row = get_group_id(0) * get_local_size(1) + get_local_id(1);
+
+    if (row >= rhs.dims[0]) return;
+
+    const int rowStartIdx = rowIdx[row];
+    const int rowEndIdx   = rowIdx[row + 1];
+
+    // Repeat loop until all values in the row are computed
+    for (int idx = rowStartIdx + get_local_id(0); idx < rowEndIdx;
+         idx += get_local_size(0)) {
+        const int col = colIdx[idx];
+
+        if (row >= rhs.dims[0] || col >= rhs.dims[1]) continue;  // Bad indices
+
+        // Get Values
+        const T val  = values[idx];
+        const T rval = rPtr[col * rhs.strides[1] + row];
+
+        if (reverse)
+            values[idx] = OP(rval, val);
+        else
+            values[idx] = OP(val, rval);
+    }
+}
diff --git a/src/backend/opencl/kernel/ssarith_calc_out_nnz.cl b/src/backend/opencl/kernel/ssarith_calc_out_nnz.cl
new file mode 100644
index 0000000000..f395a1807d
--- /dev/null
+++ b/src/backend/opencl/kernel/ssarith_calc_out_nnz.cl
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void csr_calc_out_nnz(global unsigned *nnzc, global int *oRowIdx, uint M,
+                             global const int *lRowIdx,
+                             global const int *lColIdx,
+                             global const int *rRowIdx,
+                             global const int *rColIdx, local uint *blkNnz) {
+    const uint row = get_global_id(0);
+    const uint tid = get_local_id(0);
+
+    const bool valid = row < M;
+
+    const uint lEnd = (valid ? lRowIdx[row + 1] : 0);
+    const uint rEnd = (valid ? rRowIdx[row + 1] : 0);
+
+    blkNnz[tid] = 0;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    uint l   = (valid ? lRowIdx[row] : 0);
+    uint r   = (valid ? rRowIdx[row] : 0);
+    uint nnz = 0;
+    while (l < lEnd && r < rEnd) {
+        uint lci = lColIdx[l];
+        uint rci = rColIdx[r];
+        l += (lci <= rci);
+        r += (lci >= rci);
+        nnz++;
+    }
+    nnz += (lEnd - l);
+    nnz += (rEnd - r);
+
+    blkNnz[tid] = nnz;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (valid) oRowIdx[row + 1] = nnz;
+
+    for (uint s = get_local_size(0) / 2; s > 0; s >>= 1) {
+        if (tid < s) { blkNnz[tid] += blkNnz[tid + s]; }
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    if (tid == 0) {
+        nnz = blkNnz[0];
+        atomic_add(nnzc, nnz);
+    }
+}
diff --git a/src/backend/opencl/kernel/susan.cl b/src/backend/opencl/kernel/susan.cl
new file mode 100644
index 0000000000..fa4e9f892d
--- /dev/null
+++ b/src/backend/opencl/kernel/susan.cl
@@ -0,0 +1,113 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#define MAX_VAL(A, B) (A) < (B) ? (B) : (A)
+
+#ifdef RESPONSE
+kernel void susan_responses(global T* out, global const T* in_,
+                            const unsigned in_off, const unsigned idim0,
+                            const unsigned idim1, const float t, const float g,
+                            const unsigned edge) {
+    global const T* in = in_ + in_off;
+
+    const int rSqrd   = RADIUS * RADIUS;
+    const int windLen = 2 * RADIUS + 1;
+    const int shrdLen = BLOCK_X + windLen - 1;
+    local T localMem[LOCAL_MEM_SIZE];
+
+    const unsigned lx = get_local_id(0);
+    const unsigned ly = get_local_id(1);
+    const unsigned gx = get_global_id(0) + edge;
+    const unsigned gy = get_global_id(1) + edge;
+
+    const unsigned nucleusIdx = (ly + RADIUS) * shrdLen + lx + RADIUS;
+    if (gx < idim0 && gy < idim1)
+        localMem[nucleusIdx] = in[gy * idim0 + gx];
+    else
+        localMem[nucleusIdx] = 0;
+    T m_0 = localMem[nucleusIdx];
+
+#pragma unroll
+    for (int b = ly, gy2 = gy; b < shrdLen; b += BLOCK_Y, gy2 += BLOCK_Y) {
+        int j = gy2 - RADIUS;
+#pragma unroll
+        for (int a = lx, gx2 = gx; a < shrdLen; a += BLOCK_X, gx2 += BLOCK_X) {
+            int i = gx2 - RADIUS;
+            if (i < idim0 && j < idim1)
+                localMem[b * shrdLen + a] = in[i + idim0 * j];
+            else
+                localMem[b * shrdLen + a] = m_0;
+        }
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    if (gx < idim0 - edge && gy < idim1 - edge) {
+        unsigned idx = gx + idim0 * gy;
+        float nM     = 0.0f;
+#pragma unroll
+        for (int p = 0; p < windLen; ++p) {
+#pragma unroll
+            for (int q = 0; q < windLen; ++q) {
+                int i = p - RADIUS;
+                int j = q - RADIUS;
+                int a = lx + RADIUS + i;
+                int b = ly + RADIUS + j;
+                if (i * i + j * j < rSqrd) {
+                    float c       = m_0;
+                    float m       = localMem[b * shrdLen + a];
+                    float exp_pow = pow((m - c) / t, 6.0f);
+                    float cM      = exp(-exp_pow);
+                    nM += cM;
+                }
+            }
+        }
+        out[idx] = nM < g ? g - nM : (T)0;
+    }
+}
+#endif
+
+#ifdef NONMAX
+kernel void non_maximal(global float* x_out, global float* y_out,
+                        global float* resp_out, global unsigned* count,
+                        const unsigned idim0, const unsigned idim1,
+                        global const T* resp_in, const unsigned edge,
+                        const unsigned max_corners) {
+    // Responses on the border don't have 8-neighbors to compare, discard them
+    const unsigned r = edge + 1;
+
+    const unsigned gx = get_global_id(0) + r;
+    const unsigned gy = get_global_id(1) + r;
+
+    if (gx < idim0 - r && gy < idim1 - r) {
+        const T v = resp_in[gy * idim0 + gx];
+
+        // Find maximum neighborhood response
+        T max_v;
+        max_v = MAX_VAL(resp_in[(gy - 1) * idim0 + gx - 1],
+                        resp_in[gy * idim0 + gx - 1]);
+        max_v = MAX_VAL(max_v, resp_in[(gy + 1) * idim0 + gx - 1]);
+        max_v = MAX_VAL(max_v, resp_in[(gy - 1) * idim0 + gx]);
+        max_v = MAX_VAL(max_v, resp_in[(gy + 1) * idim0 + gx]);
+        max_v = MAX_VAL(max_v, resp_in[(gy - 1) * idim0 + gx + 1]);
+        max_v = MAX_VAL(max_v, resp_in[(gy)*idim0 + gx + 1]);
+        max_v = MAX_VAL(max_v, resp_in[(gy + 1) * idim0 + gx + 1]);
+
+        // Stores corner to {x,y,resp}_out if it's response is maximum compared
+        // to its 8-neighborhood and greater or equal minimum response
+        if (v > max_v) {
+            const unsigned idx = atomic_inc(count);
+            if (idx < max_corners) {
+                x_out[idx]    = (float)gx;
+                y_out[idx]    = (float)gy;
+                resp_out[idx] = (float)v;
+            }
+        }
+    }
+}
+#endif
diff --git a/src/backend/opencl/kernel/susan.hpp b/src/backend/opencl/kernel/susan.hpp
new file mode 100644
index 0000000000..4b87b43a85
--- /dev/null
+++ b/src/backend/opencl/kernel/susan.hpp
@@ -0,0 +1,99 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel_headers/susan.hpp>
+#include <memory.hpp>
+#include <traits.hpp>
+#include <af/defines.h>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+constexpr unsigned SUSAN_THREADS_X = 16;
+constexpr unsigned SUSAN_THREADS_Y = 16;
+
+template<typename T>
+void susan(cl::Buffer* out, const cl::Buffer* in, const unsigned in_off,
+           const unsigned idim0, const unsigned idim1, const float t,
+           const float g, const unsigned edge, const unsigned radius) {
+    const size_t LOCAL_MEM_SIZE =
+        (SUSAN_THREADS_X + 2 * radius) * (SUSAN_THREADS_Y + 2 * radius);
+
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+        TemplateArg(radius),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineValue(LOCAL_MEM_SIZE),
+        DefineKeyValue(BLOCK_X, SUSAN_THREADS_X),
+        DefineKeyValue(BLOCK_Y, SUSAN_THREADS_Y),
+        DefineKeyValue(RADIUS, radius),
+        DefineKey(RESPONSE),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto susan = common::getKernel("susan_responses", {{susan_cl_src}}, targs,
+                                   compileOpts);
+
+    cl::NDRange local(SUSAN_THREADS_X, SUSAN_THREADS_Y);
+    cl::NDRange global(divup(idim0 - 2 * edge, local[0]) * local[0],
+                       divup(idim1 - 2 * edge, local[1]) * local[1]);
+
+    susan(cl::EnqueueArgs(getQueue(), global, local), *out, *in, in_off, idim0,
+          idim1, t, g, edge);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+unsigned nonMaximal(cl::Buffer* x_out, cl::Buffer* y_out, cl::Buffer* resp_out,
+                    const unsigned idim0, const unsigned idim1,
+                    const cl::Buffer* resp_in, const unsigned edge,
+                    const unsigned max_corners) {
+    std::vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+    };
+    std::vector<std::string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKey(NONMAX),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto nonMax =
+        common::getKernel("non_maximal", {{susan_cl_src}}, targs, compileOpts);
+
+    unsigned corners_found = 0;
+    auto d_corners_found   = memAlloc<unsigned>(1);
+    getQueue().enqueueFillBuffer(*d_corners_found, corners_found, 0,
+                                 sizeof(unsigned));
+
+    cl::NDRange local(SUSAN_THREADS_X, SUSAN_THREADS_Y);
+    cl::NDRange global(divup(idim0 - 2 * edge, local[0]) * local[0],
+                       divup(idim1 - 2 * edge, local[1]) * local[1]);
+
+    nonMax(cl::EnqueueArgs(getQueue(), global, local), *x_out, *y_out,
+           *resp_out, *d_corners_found, idim0, idim1, *resp_in, edge,
+           max_corners);
+    getQueue().enqueueReadBuffer(*d_corners_found, CL_TRUE, 0, sizeof(unsigned),
+                                 &corners_found);
+    return corners_found;
+}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/swapdblk.cl b/src/backend/opencl/kernel/swapdblk.cl
index c1bbf87dd8..35c61c8889 100644
--- a/src/backend/opencl/kernel/swapdblk.cl
+++ b/src/backend/opencl/kernel/swapdblk.cl
@@ -29,31 +29,29 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
 
-__kernel void
-swapdblk(int nb,
-         __global T *dA, unsigned long dA_offset, int ldda, int inca,
-         __global T *dB, unsigned long dB_offset, int lddb, int incb)
-{
+kernel void swapdblk(int nb, global T *dA, unsigned long dA_offset,
+                       int ldda, int inca, global T *dB,
+                       unsigned long dB_offset, int lddb, int incb) {
     const int tx = get_local_id(0);
     const int bx = get_group_id(0);
 
@@ -62,10 +60,10 @@ swapdblk(int nb,
 
     T tmp;
 
-    #pragma unroll
-    for( int i = 0; i < nb; i++ ){
-        tmp        = dA[i*ldda];
-        dA[i*ldda] = dB[i*lddb];
-        dB[i*lddb] = tmp;
+#pragma unroll
+    for (int i = 0; i < nb; i++) {
+        tmp          = dA[i * ldda];
+        dA[i * ldda] = dB[i * lddb];
+        dB[i * lddb] = tmp;
     }
 }
diff --git a/src/backend/opencl/kernel/swapdblk.hpp b/src/backend/opencl/kernel/swapdblk.hpp
index 4577bdd259..a6c96ea940 100644
--- a/src/backend/opencl/kernel/swapdblk.hpp
+++ b/src/backend/opencl/kernel/swapdblk.hpp
@@ -8,76 +8,56 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/swapdblk.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/err_common.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-
-namespace opencl
-{
+#include <kernel_headers/swapdblk.hpp>
+#include <traits.hpp>
 
-namespace kernel
-{
+#include <string>
+#include <vector>
 
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
 template<typename T>
-void swapdblk(int n, int nb,
-              cl_mem dA, size_t dA_offset, int ldda, int inca,
-              cl_mem dB, size_t dB_offset, int lddb, int incb)
-{
-
-    static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-    static std::map<int, Program*>  swpProgs;
-    static std::map<int, Kernel*> swpKernels;
-
-    int device = getActiveDeviceId();
-
-    std::call_once(compileFlags[device], [device] () {
-
-            std::ostringstream options;
-            options << " -D T=" << dtype_traits<T>::getName();
-
-            if (std::is_same<T, double>::value ||
-                std::is_same<T, cdouble>::value) {
-                options << " -D USE_DOUBLE";
-            }
-
-            cl::Program prog;
-            buildProgram(prog, swapdblk_cl, swapdblk_cl_len, options.str());
-            swpProgs[device] = new Program(prog);
-
-            swpKernels[device] = new Kernel(*swpProgs[device], "swapdblk");
-        });
+void swapdblk(int n, int nb, cl_mem dA, size_t dA_offset, int ldda, int inca,
+              cl_mem dB, size_t dB_offset, int lddb, int incb,
+              cl_command_queue queue) {
+    using cl::Buffer;
+    using cl::CommandQueue;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
 
     int nblocks = n / nb;
+    if (nblocks == 0) return;
 
-    if(nblocks == 0)
-        return;
+    vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto swapdblk =
+        common::getKernel("swapdblk", {{swapdblk_cl_src}}, targs, compileOpts);
 
     int info = 0;
     if (n < 0) {
         info = -1;
     } else if (nb < 1 || nb > 1024) {
         info = -2;
-    } else if (ldda < (nblocks-1)*nb*inca + nb) {
+    } else if (ldda < (nblocks - 1) * nb * inca + nb) {
         info = -4;
     } else if (inca < 0) {
         info = -5;
-    } else if (lddb < (nblocks-1)*nb*incb + nb) {
+    } else if (lddb < (nblocks - 1) * nb * incb + nb) {
         info = -7;
     } else if (incb < 0) {
         info = -8;
@@ -90,16 +70,15 @@ void swapdblk(int n, int nb,
 
     NDRange local(nb);
     NDRange global(nblocks * nb);
-    auto swapdOp = make_kernel<int,
-                               cl_mem, unsigned long long, int, int,
-                               cl_mem, unsigned long long, int, int>(*swpKernels[device]);
 
-    swapdOp(EnqueueArgs(getQueue(), global, local),
-            nb,
-            dA, dA_offset, ldda, inca,
-            dB, dB_offset, lddb, incb);
+    Buffer dAObj(dA, true);
+    Buffer dBObj(dB, true);
 
+    CommandQueue q(queue, true);
+    swapdblk(EnqueueArgs(q, global, local), nb, dAObj, dA_offset, ldda, inca,
+             dBObj, dB_offset, lddb, incb);
+    CL_DEBUG_FINISH(getQueue());
 }
-
-}
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/tile.cl b/src/backend/opencl/kernel/tile.cl
index 37fbe63de2..89323294db 100644
--- a/src/backend/opencl/kernel/tile.cl
+++ b/src/backend/opencl/kernel/tile.cl
@@ -7,10 +7,9 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void tile_kernel(__global T *out, __global const T *in, const KParam op, const KParam ip,
-                 const int blocksPerMatX, const int blocksPerMatY)
-{
+kernel void tile(global T *out, global const T *in, const KParam op,
+                 const KParam ip, const int blocksPerMatX,
+                 const int blocksPerMatY) {
     const int oz = get_group_id(0) / blocksPerMatX;
     const int ow = get_group_id(1) / blocksPerMatY;
 
@@ -20,23 +19,21 @@ void tile_kernel(__global T *out, __global const T *in, const KParam op, const K
     const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
     const int yy = get_local_id(1) + blockIdx_y * get_local_size(1);
 
-    if(xx >= op.dims[0] ||
-       yy >= op.dims[1] ||
-       oz >= op.dims[2] ||
-       ow >= op.dims[3])
+    if (xx >= op.dims[0] || yy >= op.dims[1] || oz >= op.dims[2] ||
+        ow >= op.dims[3])
         return;
 
-    const int iz = oz % ip.dims[2];
-    const int iw = ow % ip.dims[3];
+    const int iz  = oz % ip.dims[2];
+    const int iw  = ow % ip.dims[3];
     const int izw = iw * ip.strides[3] + iz * ip.strides[2];
     const int ozw = ow * op.strides[3] + oz * op.strides[2];
 
     const int incy = blocksPerMatY * get_local_size(1);
     const int incx = blocksPerMatX * get_local_size(0);
 
-    for(int oy = yy; oy < op.dims[1]; oy += incy) {
+    for (int oy = yy; oy < op.dims[1]; oy += incy) {
         const int iy = oy % ip.dims[1];
-        for(int ox = xx; ox < op.dims[0]; ox += incx) {
+        for (int ox = xx; ox < op.dims[0]; ox += incx) {
             const int ix = ox % ip.dims[0];
 
             int iMem = izw + iy * ip.strides[1] + ix;
diff --git a/src/backend/opencl/kernel/tile.hpp b/src/backend/opencl/kernel/tile.hpp
index 710aefc0b1..7c9b042372 100644
--- a/src/backend/opencl/kernel/tile.hpp
+++ b/src/backend/opencl/kernel/tile.hpp
@@ -8,77 +8,53 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/tile.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel_headers/tile.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        // Kernel Launch Config Values
-        static const int TX = 32;
-        static const int TY = 8;
-        static const int TILEX = 512;
-        static const int TILEY = 32;
-
-        template<typename T>
-        void tile(Param out, const Param in)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   tileProgs;
-                static std::map<int, Kernel *> tileKernels;
-
-                int device = getActiveDeviceId();
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T=" << dtype_traits<T>::getName();
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-                    Program prog;
-                    buildProgram(prog, tile_cl, tile_cl_len, options.str());
-                    tileProgs[device] = new Program(prog);
-                    tileKernels[device] = new Kernel(*tileProgs[device], "tile_kernel");
-                });
-
-                auto tileOp = make_kernel<Buffer, const Buffer, const KParam, const KParam,
-                                          const int, const int> (*tileKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int blocksPerMatX = divup(out.info.dims[0], TILEX);
-                int blocksPerMatY = divup(out.info.dims[1], TILEY);
-                NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
-                               local[1] * blocksPerMatY * out.info.dims[3],
-                               1);
-
-                tileOp(EnqueueArgs(getQueue(), global, local),
-                       *out.data, *in.data, out.info, in.info,
-                       blocksPerMatX, blocksPerMatY);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
-    }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+void tile(Param out, const Param in) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr int TX    = 32;
+    constexpr int TY    = 8;
+    constexpr int TILEX = 512;
+    constexpr int TILEY = 32;
+
+    vector<TemplateArg> targs = {
+        TemplateTypename<T>(),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto tile = common::getKernel("tile", {{tile_cl_src}}, targs, compileOpts);
+
+    NDRange local(TX, TY, 1);
+
+    int blocksPerMatX = divup(out.info.dims[0], TILEX);
+    int blocksPerMatY = divup(out.info.dims[1], TILEY);
+    NDRange global(local[0] * blocksPerMatX * out.info.dims[2],
+                   local[1] * blocksPerMatY * out.info.dims[3], 1);
+
+    tile(EnqueueArgs(getQueue(), global, local), *out.data, *in.data, out.info,
+         in.info, blocksPerMatX, blocksPerMatY);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/trace_edge.cl b/src/backend/opencl/kernel/trace_edge.cl
new file mode 100644
index 0000000000..5291b0158c
--- /dev/null
+++ b/src/backend/opencl/kernel/trace_edge.cl
@@ -0,0 +1,223 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#define STRONG 1
+#define WEAK 2
+#define NOEDGE 0
+
+#if defined(INIT_EDGE_OUT)
+kernel void initEdgeOutKernel(global T* output, KParam oInfo,
+                              global const T* strong, KParam sInfo,
+                              global const T* weak, KParam wInfo,
+                              unsigned nBBS0, unsigned nBBS1) {
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = get_group_id(0) / nBBS0;
+    const unsigned b3 = get_group_id(1) / nBBS1;
+
+    // global indices
+    const int gx =
+        get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + get_local_id(0);
+    const int gy =
+        get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + get_local_id(1);
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    global const T* wPtr =
+        weak + (b2 * wInfo.strides[2] + b3 * wInfo.strides[3] + wInfo.offset) +
+        wInfo.strides[1] + 1;
+
+    global const T* sPtr =
+        strong +
+        (b2 * sInfo.strides[2] + b3 * sInfo.strides[3] + sInfo.offset) +
+        sInfo.strides[1] + 1;
+
+    global T* oPtr =
+        output +
+        (b2 * oInfo.strides[2] + b3 * oInfo.strides[3] + oInfo.offset) +
+        oInfo.strides[1] + 1;
+
+    if (gx < (oInfo.dims[0] - 2) && gy < (oInfo.dims[1] - 2)) {
+        int idx   = gx * oInfo.strides[0] + gy * oInfo.strides[1];
+        oPtr[idx] = (sPtr[idx] > 0 ? STRONG : (wPtr[idx] > 0 ? WEAK : NOEDGE));
+    }
+}
+#endif
+
+#define VALID_BLOCK_IDX(j, i)                             \
+    ((j) > 0 && (j) < (SHRD_MEM_HEIGHT - 1) && (i) > 0 && \
+     (i) < (SHRD_MEM_WIDTH - 1))
+
+#if defined(EDGE_TRACER)
+kernel void edgeTrackKernel(global T* output, KParam oInfo, unsigned nBBS0,
+                            unsigned nBBS1, global volatile int* hasChanged) {
+    // shared memory with 1 pixel border
+    // strong and weak images are binary(char) images thus,
+    // occupying only (16+2)*(16+2) = 324 bytes per shared memory tile
+    local int outMem[SHRD_MEM_HEIGHT][SHRD_MEM_WIDTH];
+    local bool predicates[TOTAL_NUM_THREADS];
+
+    // local thread indices
+    const int lx = get_local_id(0);
+    const int ly = get_local_id(1);
+
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = get_group_id(0) / nBBS0;
+    const unsigned b3 = get_group_id(1) / nBBS1;
+
+    // global indices
+    const int gx = get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + lx;
+    const int gy = get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + ly;
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    global T* oPtr = output + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]);
+
+    // pull image to local memory
+#pragma unroll
+    for (int b = ly, gy2 = gy - 1; b < SHRD_MEM_HEIGHT;
+         b += get_local_size(1), gy2 += get_local_size(1)) {
+#pragma unroll
+        for (int a = lx, gx2 = gx - 1; a < SHRD_MEM_WIDTH;
+             a += get_local_size(0), gx2 += get_local_size(0)) {
+            if (gx2 >= 0 && gx2 < oInfo.dims[0] && gy2 >= 0 &&
+                gy2 < oInfo.dims[1])
+                outMem[b][a] =
+                    oPtr[gx2 * oInfo.strides[0] + gy2 * oInfo.strides[1]];
+            else
+                outMem[b][a] = NOEDGE;
+        }
+    }
+
+    int i = lx + 1;
+    int j = ly + 1;
+
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    int tid = lx + get_local_size(0) * ly;
+
+    bool continueIter = true;
+
+    while (continueIter) {
+        if (outMem[j][i] == WEAK) {
+            int nw, no, ne, we, ea, sw, so, se;
+            nw = outMem[j - 1][i - 1];
+            no = outMem[j - 1][i];
+            ne = outMem[j - 1][i + 1];
+            we = outMem[j][i - 1];
+            ea = outMem[j][i + 1];
+            sw = outMem[j + 1][i - 1];
+            so = outMem[j + 1][i];
+            se = outMem[j + 1][i + 1];
+
+            bool hasStrongNeighbour =
+                nw == STRONG || no == STRONG || ne == STRONG || ea == STRONG ||
+                se == STRONG || so == STRONG || sw == STRONG || we == STRONG;
+
+            if (hasStrongNeighbour) outMem[j][i] = STRONG;
+        }
+
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        predicates[tid] = false;
+        if (outMem[j][i] == STRONG) {
+            bool nw, no, ne, we, ea, sw, so, se;
+            // clang-format off
+            nw = outMem[j - 1][i - 1] == WEAK && VALID_BLOCK_IDX(j - 1, i - 1);
+            no = outMem[j - 1][i]     == WEAK && VALID_BLOCK_IDX(j - 1, i);
+            ne = outMem[j - 1][i + 1] == WEAK && VALID_BLOCK_IDX(j - 1, i + 1);
+            we = outMem[j][i - 1]     == WEAK && VALID_BLOCK_IDX(j, i - 1);
+            ea = outMem[j][i + 1]     == WEAK && VALID_BLOCK_IDX(j, i + 1);
+            sw = outMem[j + 1][i - 1] == WEAK && VALID_BLOCK_IDX(j + 1, i - 1);
+            so = outMem[j + 1][i]     == WEAK && VALID_BLOCK_IDX(j + 1, i);
+            se = outMem[j + 1][i + 1] == WEAK && VALID_BLOCK_IDX(j + 1, i + 1);
+            // clang-format on
+
+            bool hasWeakNeighbour =
+                nw || no || ne || ea || se || so || sw || we;
+
+            predicates[tid] = hasWeakNeighbour;
+        }
+        barrier(CLK_LOCAL_MEM_FENCE);
+
+        // Following Block is equivalent of __syncthreads_or in CUDA
+        for (int nt = TOTAL_NUM_THREADS >> 1; nt > 0; nt >>= 1) {
+            if (tid < nt) {
+                predicates[tid] = predicates[tid] || predicates[tid + nt];
+            }
+            barrier(CLK_LOCAL_MEM_FENCE);
+        }
+
+        continueIter = predicates[0];
+
+        // Needed for Intel OpenCL implementation targeting CPUs
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+
+    // Check if any 1-pixel border ring
+    // has weak pixels with strong candidates
+    // within the main region, then increment hasChanged.
+    int cu = outMem[j][i];
+    int nw = outMem[j - 1][i - 1];
+    int no = outMem[j - 1][i];
+    int ne = outMem[j - 1][i + 1];
+    int ea = outMem[j][i + 1];
+    int se = outMem[j + 1][i + 1];
+    int so = outMem[j + 1][i];
+    int sw = outMem[j + 1][i - 1];
+    int we = outMem[j][i - 1];
+
+    bool hasWeakNeighbour = nw == WEAK || no == WEAK || ne == WEAK ||
+                            ea == WEAK || se == WEAK || so == WEAK ||
+                            sw == WEAK || we == WEAK;
+
+    // Following Block is equivalent of __syncthreads_or in CUDA
+    predicates[tid] = cu == STRONG && hasWeakNeighbour;
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    for (int nt = TOTAL_NUM_THREADS / 2; nt > 0; nt >>= 1) {
+        if (tid < nt) predicates[tid] = predicates[tid] || predicates[tid + nt];
+        barrier(CLK_LOCAL_MEM_FENCE);
+    }
+    barrier(CLK_LOCAL_MEM_FENCE);
+
+    continueIter = predicates[0];
+
+    if (continueIter && lx == 0 && ly == 0) atomic_inc(hasChanged);
+
+    // Update output with shared memory result
+    if (gx < (oInfo.dims[0] - 1) && gy < (oInfo.dims[1] - 1))
+        oPtr[(gx * oInfo.strides[0] + gy * oInfo.strides[1])] = outMem[j][i];
+}
+#endif
+
+#if defined(SUPPRESS_LEFT_OVER)
+kernel void suppressLeftOverKernel(global T* output, KParam oInfo,
+                                   unsigned nBBS0, unsigned nBBS1) {
+    // batch offsets for 3rd and 4th dimension
+    const unsigned b2 = get_group_id(0) / nBBS0;
+    const unsigned b3 = get_group_id(1) / nBBS1;
+
+    // global indices
+    const int gx =
+        get_local_size(0) * (get_group_id(0) - b2 * nBBS0) + get_local_id(0);
+    const int gy =
+        get_local_size(1) * (get_group_id(1) - b3 * nBBS1) + get_local_id(1);
+
+    // Offset input and output pointers to second pixel of second coloumn/row
+    // to skip the border
+    global T* oPtr = output + (b2 * oInfo.strides[2] + b3 * oInfo.strides[3]) +
+                     oInfo.strides[1] + 1;
+
+    if (gx < (oInfo.dims[0] - 2) && gy < (oInfo.dims[1] - 2)) {
+        int idx = gx * oInfo.strides[0] + gy * oInfo.strides[1];
+        T val   = oPtr[idx];
+        if (val == WEAK) oPtr[idx] = NOEDGE;
+    }
+}
+#endif
diff --git a/src/backend/opencl/kernel/transform.cl b/src/backend/opencl/kernel/transform.cl
index 9c7e06f29b..4fae1c05f8 100644
--- a/src/backend/opencl/kernel/transform.cl
+++ b/src/backend/opencl/kernel/transform.cl
@@ -9,69 +9,163 @@
 
 #define NEAREST transform_n
 #define BILINEAR transform_b
+#define LOWER transform_l
 
-void calc_affine_inverse(float* txo, __global const float* txi)
-{
-    float det = txi[0]*txi[4] - txi[1]*txi[3];
+void calc_transf_inverse(float *txo, global const float *txi) {
+#if PERSPECTIVE
+    txo[0] = txi[4] * txi[8] - txi[5] * txi[7];
+    txo[1] = -(txi[1] * txi[8] - txi[2] * txi[7]);
+    txo[2] = txi[1] * txi[5] - txi[2] * txi[4];
+
+    txo[3] = -(txi[3] * txi[8] - txi[5] * txi[6]);
+    txo[4] = txi[0] * txi[8] - txi[2] * txi[6];
+    txo[5] = -(txi[0] * txi[5] - txi[2] * txi[3]);
+
+    txo[6] = txi[3] * txi[7] - txi[4] * txi[6];
+    txo[7] = -(txi[0] * txi[7] - txi[1] * txi[6]);
+    txo[8] = txi[0] * txi[4] - txi[1] * txi[3];
+
+    float det = txi[0] * txo[0] + txi[1] * txo[3] + txi[2] * txo[6];
+
+    txo[0] /= det;
+    txo[1] /= det;
+    txo[2] /= det;
+    txo[3] /= det;
+    txo[4] /= det;
+    txo[5] /= det;
+    txo[6] /= det;
+    txo[7] /= det;
+    txo[8] /= det;
+#else
+    float det = txi[0] * txi[4] - txi[1] * txi[3];
 
     txo[0] = txi[4] / det;
     txo[1] = txi[3] / det;
     txo[3] = txi[1] / det;
     txo[4] = txi[0] / det;
 
-    txo[2] = txi[2] * -txo[0] + txi[5] * -txo[1];
-    txo[5] = txi[2] * -txo[3] + txi[5] * -txo[4];
+    txo[2]               = txi[2] * -txo[0] + txi[5] * -txo[1];
+    txo[5]               = txi[2] * -txo[3] + txi[5] * -txo[4];
+#endif
 }
 
-__kernel
-void transform_kernel(__global T *d_out, const KParam out,
-                      __global const T *d_in, const KParam in,
-                      __global const float *c_tmat, const KParam tf,
-                      const int nimages, const int ntransforms,
-                      const int blocksXPerImage)
-{
-    // Compute which image set
-    const int setId = get_group_id(0) / blocksXPerImage;
-    const int blockIdx_x = get_group_id(0) - setId * blocksXPerImage;
-
-    // Get thread indices
-    const int xx = get_local_id(0) + blockIdx_x * get_local_size(0);
-    const int yy = get_global_id(1);
-
-    if(xx >= out.dims[0] * nimages || yy >= out.dims[1] * ntransforms)
-        return;
+kernel void transformKernel(global T *d_out, const KParam out,
+                            global const T *d_in, const KParam in,
+                            global const float *c_tmat, const KParam tf,
+                            const int nImg2, const int nImg3, const int nTfs2,
+                            const int nTfs3, const int batchImg2,
+                            const int blocksXPerImage,
+                            const int blocksYPerImage, const int method) {
+    // Image Ids
+    const int imgId2 = get_group_id(0) / blocksXPerImage;
+    const int imgId3 = get_group_id(1) / blocksYPerImage;
+
+    // Block in local image
+    const int blockIdx_x = get_group_id(0) - imgId2 * blocksXPerImage;
+    const int blockIdx_y = get_group_id(1) - imgId3 * blocksYPerImage;
 
-    // Index of channel of images and transform
-    //int i_idx = xx / out.dims[0];
-    const int t_idx = yy / out.dims[1];
+    // Get thread indices in local image
+    const int xido = blockIdx_x * get_local_size(0) + get_local_id(0);
+    const int yido = blockIdx_y * get_local_size(1) + get_local_id(1);
 
-    const int limages = min((int)out.dims[2] - setId * nimages, nimages);
+    // Image iteration loop count for image batching
+    int limages = min(max((int)(out.dims[2] - imgId2 * nImg2), 1), batchImg2);
 
-    // Index in local channel -> This is output index
-    const int xido = xx; // - i_idx * out.dims[0];
-    const int yido = yy - t_idx * out.dims[1];
+    if (xido >= out.dims[0] || yido >= out.dims[1]) return;
 
-    // Global offset
-    //          Offset for transform channel + Offset for image channel.
-    d_out += t_idx * nimages * out.strides[2] + setId * nimages * out.strides[2];
-    d_in  += setId * nimages * in.strides[2] + in.offset;
+    // Index of transform
+    const int eTfs2 = max((nTfs2 / nImg2), 1);
+    const int eTfs3 = max((nTfs3 / nImg3), 1);
+
+    int t_idx3        = -1;  // init
+    int t_idx2        = -1;  // init
+    int t_idx2_offset = 0;
+
+    const int blockIdx_z = get_group_id(2);
+
+    if (nTfs3 == 1) {
+        t_idx3 = 0;  // Always 0 as only 1 transform defined
+    } else {
+        if (nTfs3 == nImg3) {
+            t_idx3 = imgId3;  // One to one batch with all transforms defined
+        } else {
+            t_idx3        = blockIdx_z / eTfs2;  // Transform batched, calculate
+            t_idx2_offset = t_idx3 * nTfs2;
+        }
+    }
+
+    if (nTfs2 == 1) {
+        t_idx2 = 0;  // Always 0 as only 1 transform defined
+    } else {
+        if (nTfs2 == nImg2) {
+            t_idx2 = imgId2;  // One to one batch with all transforms defined
+        } else {
+            t_idx2 =
+                blockIdx_z - t_idx2_offset;  // Transform batched, calculate
+        }
+    }
+
+    // Linear transform index
+    const int t_idx = t_idx2 + t_idx3 * nTfs2;
+
+    // Global outoff
+    int outoff = out.offset;
+    int inoff =
+        imgId2 * batchImg2 * in.strides[2] + imgId3 * in.strides[3] + in.offset;
+    if (nImg2 == nTfs2 || nImg2 > 1) {  // One-to-One or Image on dim2
+        outoff += imgId2 * batchImg2 * out.strides[2];
+    } else {  // Transform batched on dim2
+        outoff += t_idx2 * out.strides[2];
+    }
+
+    if (nImg3 == nTfs3 || nImg3 > 1) {  // One-to-One or Image on dim3
+        outoff += imgId3 * out.strides[3];
+    } else {  // Transform batched on dim2
+        outoff += t_idx3 * out.strides[3];
+    }
 
     // Transform is in global memory.
-    // Needs offset to correct transform being processed.
-    __global const float *tmat_ptr = c_tmat + t_idx * 6;
+    // Needs outoff to correct transform being processed.
+#if PERSPECTIVE
+    const int transf_len = 9;
+    float tmat[9];
+#else
+    const int transf_len = 6;
     float tmat[6];
+#endif
+    global const float *tmat_ptr = c_tmat + tf.offset + t_idx * transf_len;
 
     // We expect a inverse transform matrix by default
     // If it is an forward transform, then we need its inverse
-    if(INVERSE == 1) {
-        #pragma unroll
-        for(int i = 0; i < 6; i++)
-            tmat[i] = tmat_ptr[i];
+    if (INVERSE == 1) {
+#pragma unroll 3
+        for (int i = 0; i < transf_len; i++) tmat[i] = tmat_ptr[i];
     } else {
-        calc_affine_inverse(tmat, tmat_ptr);
+        calc_transf_inverse(tmat, tmat_ptr);
     }
 
-    if (xido >= out.dims[0] && yido >= out.dims[1]) return;
+    InterpPosTy xidi = xido * tmat[0] + yido * tmat[1] + tmat[2];
+    InterpPosTy yidi = xido * tmat[3] + yido * tmat[4] + tmat[5];
+
+#if PERSPECTIVE
+    const InterpPosTy W = xido * tmat[6] + yido * tmat[7] + tmat[8];
+    xidi /= W;
+    yidi /= W;
+#endif
+    const int loco = outoff + (yido * out.strides[1] + xido);
+    // FIXME: Nearest and lower do not do clamping, but other methods do
+    // Make it consistent
+    const bool doclamp = INTERP_ORDER != 1;
+
+    T zero = ZERO;
+    if (xidi < (InterpPosTy)-0.0001 || yidi < (InterpPosTy)-0.0001 ||
+        in.dims[0] <= xidi || in.dims[1] <= yidi) {
+        for (int n = 0; n < limages; n++) {
+            d_out[loco + n * out.strides[2]] = zero;
+        }
+        return;
+    }
 
-    INTERP(d_out, out, d_in, in, tmat, xido, yido, limages);
+    interp2(d_out, out, loco, d_in, in, inoff, xidi, yidi, method, limages,
+            doclamp, 2);
 }
diff --git a/src/backend/opencl/kernel/transform.hpp b/src/backend/opencl/kernel/transform.hpp
index dfaa18a153..76a2dafa43 100644
--- a/src/backend/opencl/kernel/transform.hpp
+++ b/src/backend/opencl/kernel/transform.hpp
@@ -8,125 +8,106 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/transform_interp.hpp>
-#include <kernel_headers/transform.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/complex.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel/interp.hpp>
+#include <kernel_headers/interp.hpp>
+#include <kernel_headers/transform.hpp>
+#include <math.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    namespace kernel
-    {
-        static const int TX = 16;
-        static const int TY = 16;
-        // Used for batching images
-        static const int TI = 4;
-
-        using std::conditional;
-        using std::is_same;
-        template<typename T>
-        using wtype_t = typename conditional<is_same<T, double>::value, double, float>::type;
-
-        template<typename T>
-        using vtype_t = typename conditional<is_complex<T>::value,
-                                             T, wtype_t<T>
-                                            >::type;
-
-
-        template<typename T, bool isInverse, af_interp_type method>
-        void transform(Param out, const Param in, const Param tf)
-        {
-            try {
-                static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-                static std::map<int, Program*>   transformProgs;
-                static std::map<int, Kernel *> transformKernels;
-
-                int device = getActiveDeviceId();
-                typedef typename dtype_traits<T>::base_type BT;
-
-                std::call_once( compileFlags[device], [device] () {
-                    std::ostringstream options;
-                    options << " -D T="        << dtype_traits<T>::getName()
-                            << " -D INVERSE="  << (isInverse ? 1 : 0);
-                    options << " -D VT="       << dtype_traits<vtype_t<T>>::getName();
-                    options << " -D WT="       << dtype_traits<wtype_t<BT>>::getName();
-
-                    if((af_dtype) dtype_traits<T>::af_type == c32 ||
-                       (af_dtype) dtype_traits<T>::af_type == c64) {
-                        options << " -D CPLX=1";
-                        options << " -D TB=" << dtype_traits<BT>::getName();
-                    } else {
-                        options << " -D CPLX=0";
-                    }
-                    if (std::is_same<T, double>::value ||
-                        std::is_same<T, cdouble>::value) {
-                        options << " -D USE_DOUBLE";
-                    }
-
-                    switch(method) {
-                        case AF_INTERP_NEAREST: options << " -D INTERP=NEAREST";
-                            break;
-                        case AF_INTERP_BILINEAR:  options << " -D INTERP=BILINEAR";
-                            break;
-                        default:
-                            break;
-                    }
-
-                    const char *ker_strs[] = {transform_interp_cl, transform_cl};
-                    const int   ker_lens[] = {transform_interp_cl_len, transform_cl_len};
-                    Program prog;
-                    buildProgram(prog, 2, ker_strs, ker_lens, options.str());
-                    transformProgs[device] = new Program(prog);
-                    transformKernels[device] = new Kernel(*transformProgs[device], "transform_kernel");
-                });
-
-                auto transformOp = make_kernel<Buffer, const KParam,
-                                         const Buffer, const KParam, const Buffer, const KParam,
-                                         const int, const int, const int>
-                                         (*transformKernels[device]);
-
-                NDRange local(TX, TY, 1);
-
-                int nimages = in.info.dims[2];
-                int global_x = local[0] * divup(out.info.dims[0], local[0]);
-                const int blocksXPerImage = global_x / local[0];
-
-                if(nimages > TI) {
-                    int tile_images = divup(nimages, TI);
-                    nimages = TI;
-                    global_x = global_x * tile_images;
-                }
-
-                // Multiplied in src/backend/transform.cpp
-                const int ntransforms = out.info.dims[2] / in.info.dims[2];
-
-                NDRange global(global_x,
-                               local[1] * divup(out.info.dims[1], local[1]) * ntransforms,
-                               1);
-
-                transformOp(EnqueueArgs(getQueue(), global, local),
-                            *out.data, out.info, *in.data, in.info,
-                            *tf.data, tf.info, nimages, ntransforms, blocksXPerImage);
-
-                CL_DEBUG_FINISH(getQueue());
-            } catch (cl::Error err) {
-                CL_TO_AF_ERROR(err);
-                throw;
-            }
-        }
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+using wtype_t = typename std::conditional<std::is_same<T, double>::value,
+                                          double, float>::type;
+
+template<typename T>
+using vtype_t = typename std::conditional<common::is_complex<T>::value, T,
+                                          wtype_t<T>>::type;
+
+template<typename T>
+void transform(Param out, const Param in, const Param tf, bool isInverse,
+               bool isPerspective, af_interp_type method, int order) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+    using BT = typename dtype_traits<T>::base_type;
+
+    constexpr int TX = 16;
+    constexpr int TY = 16;
+    // Used for batching images
+    constexpr int TI = 4;
+    constexpr bool isComplex =
+        static_cast<af_dtype>(dtype_traits<T>::af_type) == c32 ||
+        static_cast<af_dtype>(dtype_traits<T>::af_type) == c64;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(isInverse),
+        TemplateArg(isPerspective),
+        TemplateArg(order),
+    };
+    ToNumStr<T> toNumStr;
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(INVERSE, (isInverse ? 1 : 0)),
+        DefineKeyValue(PERSPECTIVE, (isPerspective ? 1 : 0)),
+        DefineKeyValue(ZERO, toNumStr(scalar<T>(0))),
+        DefineKeyValue(InterpInTy, dtype_traits<T>::getName()),
+        DefineKeyValue(InterpValTy, dtype_traits<vtype_t<T>>::getName()),
+        DefineKeyValue(InterpPosTy, dtype_traits<wtype_t<BT>>::getName()),
+        DefineKeyValue(XDIM, 0),
+        DefineKeyValue(YDIM, 1),
+        DefineKeyValue(INTERP_ORDER, order),
+        DefineKeyValue(IS_CPLX, (isComplex ? 1 : 0)),
+    };
+    if (isComplex) {
+        compileOpts.emplace_back(
+            DefineKeyValue(TB, dtype_traits<BT>::getName()));
     }
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+    addInterpEnumOptions(compileOpts);
+
+    auto transform = common::getKernel("transformKernel",
+                                       {{interp_cl_src, transform_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    const int nImg2 = in.info.dims[2];
+    const int nImg3 = in.info.dims[3];
+    const int nTfs2 = tf.info.dims[2];
+    const int nTfs3 = tf.info.dims[3];
+
+    NDRange local(TX, TY, 1);
+
+    int batchImg2 = 1;
+    if (nImg2 != nTfs2) batchImg2 = min(nImg2, TI);
+
+    const int blocksXPerImage = divup(out.info.dims[0], local[0]);
+    const int blocksYPerImage = divup(out.info.dims[1], local[1]);
+
+    int global_x = local[0] * blocksXPerImage * (nImg2 / batchImg2);
+    int global_y = local[1] * blocksYPerImage * nImg3;
+    int global_z = local[2] * max((nTfs2 / nImg2), 1) * max((nTfs3 / nImg3), 1);
+
+    NDRange global(global_x, global_y, global_z);
+
+    transform(cl::EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+              *in.data, in.info, *tf.data, tf.info, nImg2, nImg3, nTfs2, nTfs3,
+              batchImg2, blocksXPerImage, blocksYPerImage, (int)method);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/transform_interp.cl b/src/backend/opencl/kernel/transform_interp.cl
deleted file mode 100644
index 9bd744366a..0000000000
--- a/src/backend/opencl/kernel/transform_interp.cl
+++ /dev/null
@@ -1,101 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#if CPLX
-#define set(a, b) a = b
-#define set_scalar(a, b) do {                   \
-        a.x = b;                                \
-        a.y = 0;                                \
-    } while(0)
-
-#else
-
-#define set(a, b) a = b
-#define set_scalar(a, b) a = b
-
-#endif
-
-void transform_n(__global T *d_out, const KParam out, __global const T *d_in, const KParam in,
-                 const float *tmat, const int xido, const int yido, const int nimages)
-{
-    // Compute input index
-    const int xidi = round(xido * tmat[0]
-                              + yido * tmat[1]
-                                     + tmat[2]);
-    const int yidi = round(xido * tmat[3]
-                              + yido * tmat[4]
-                                     + tmat[5]);
-
-    // Compute memory location of indices
-    const int loci = yidi * in.strides[1]  + xidi;
-    const int loco = yido * out.strides[1] + xido;
-
-    for(int i = 0; i < nimages; i++) {
-        // Compute memory location of indices
-        int ioff = loci + i * in.strides[2];
-        int ooff = loco + i * out.strides[2];
-
-        T val; set_scalar(val, 0);
-        if (xidi < in.dims[0] && yidi < in.dims[1] && xidi >= 0 && yidi >= 0) val = d_in[ioff];
-
-        d_out[ooff] = val;
-    }
-}
-
-void transform_b(__global T *d_out, const KParam out, __global const T *d_in, const KParam in,
-                 const float *tmat, const int xido, const int yido, const int nimages)
-{
-    const int loco = (yido * out.strides[1] + xido);
-
-    // Compute input index
-    const float xid = xido * tmat[0]
-                    + yido * tmat[1]
-                           + tmat[2];
-    const float yid = xido * tmat[3]
-                    + yido * tmat[4]
-                           + tmat[5];
-
-    T zero; set_scalar(zero, 0);
-    if (xid < -0.001 || yid < -0.001 || in.dims[0] < xid || in.dims[1] < yid) {
-        for(int i = 0; i < nimages; i++) {
-            set(d_out[loco + i * out.strides[2]], zero);
-        }
-        return;
-    }
-
-    const WT grd_x = floor(xid),  grd_y = floor(yid);
-    const WT off_x = xid - grd_x, off_y = yid - grd_y;
-
-    // Check if pVal and pVal + 1 are both valid indices
-    const bool condY = (yid < in.dims[1] - 1);
-    const bool condX = (xid < in.dims[0] - 1);
-
-    // Compute weights used
-    const WT wt00 = (1.0 - off_x) * (1.0 - off_y);
-    const WT wt10 = (condY) ? (1.0 - off_x) * (off_y)     : 0;
-    const WT wt01 = (condX) ? (off_x) * (1.0 - off_y)     : 0;
-    const WT wt11 = (condX && condY) ? (off_x) * (off_y)  : 0;
-
-    const WT wt = wt00 + wt10 + wt01 + wt11;
-
-    const int loci = grd_y * in.strides[1] + grd_x;
-    for(int i = 0; i < nimages; i++) {
-        const int ioff = loci + (i * in.strides[2]);
-        const int ooff = loco + (i * out.strides[2]);
-
-        // Compute Weighted Values
-        VT v00 =                    (wt00 * d_in[ioff]);
-        VT v10 = (condY) ?          (wt10 * d_in[ioff + in.strides[1]])     : zero;
-        VT v01 = (condX) ?          (wt01 * d_in[ioff + 1])                 : zero;
-        VT v11 = (condX && condY) ? (wt11 * d_in[ioff + in.strides[1] + 1]) : zero;
-        VT vo  = v00 + v10 + v01 + v11;
-
-        d_out[ooff] = (T)(vo / wt);
-    }
-}
diff --git a/src/backend/opencl/kernel/transpose.cl b/src/backend/opencl/kernel/transpose.cl
index 0a6554e4c5..ea3075f3fd 100644
--- a/src/backend/opencl/kernel/transpose.cl
+++ b/src/backend/opencl/kernel/transpose.cl
@@ -7,8 +7,7 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 #if DOCONJUGATE
-T doOp(T in)
-{
+T doOp(T in) {
     T out = {in.x, -in.y};
     return out;
 }
@@ -16,14 +15,12 @@ T doOp(T in)
 #define doOp(in) in
 #endif
 
-__kernel
-void transpose(__global T *oData, const KParam out,
-               const __global T *iData, const KParam in,
-               const int blocksPerMatX, const int blocksPerMatY)
-{
-    __local T shrdMem[TILE_DIM*(TILE_DIM+1)];
+kernel void transpose(global T *oData, const KParam out,
+                        const global T *iData, const KParam in,
+                        const int blocksPerMatX, const int blocksPerMatY) {
+    local T shrdMem[TILE_DIM * (TILE_DIM + 1)];
 
-    const int shrdStride = TILE_DIM+1;
+    const int shrdStride = TILE_DIM + 1;
     // create variables to hold output dimensions
     const int oDim0 = out.dims[0];
     const int oDim1 = out.dims[1];
@@ -53,13 +50,15 @@ void transpose(__global T *oData, const KParam out,
 
     // offset in and out based on batch id
     // also add the subBuffer offsets
-    iData += batchId_x *  in.strides[2] + batchId_y *  in.strides[3] +  in.offset;
-    oData += batchId_x * out.strides[2] + batchId_y * out.strides[3] + out.offset;
+    iData += batchId_x * in.strides[2] + batchId_y * in.strides[3] + in.offset;
+    oData +=
+        batchId_x * out.strides[2] + batchId_y * out.strides[3] + out.offset;
 
     for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
         int gy_ = gy + repeat;
-        if (IS32MULTIPLE || (gx < iDim0 && gy_< iDim1))
-            shrdMem[(ly + repeat) * shrdStride + lx] = iData[gy_ * iStride1 + gx];
+        if (IS32MULTIPLE || (gx < iDim0 && gy_ < iDim1))
+            shrdMem[(ly + repeat) * shrdStride + lx] =
+                iData[gy_ * iStride1 + gx];
     }
     barrier(CLK_LOCAL_MEM_FENCE);
 
@@ -69,7 +68,8 @@ void transpose(__global T *oData, const KParam out,
     for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
         int gy_ = gy + repeat;
         if (IS32MULTIPLE || (gx < oDim0 && gy_ < oDim1)) {
-            oData[gy_ * oStride1 + gx] = doOp(shrdMem[lx * shrdStride + ly + repeat]);
+            oData[gy_ * oStride1 + gx] =
+                doOp(shrdMem[lx * shrdStride + ly + repeat]);
         }
     }
 }
diff --git a/src/backend/opencl/kernel/transpose.hpp b/src/backend/opencl/kernel/transpose.hpp
index 7975e67e6f..b6979cf6d5 100644
--- a/src/backend/opencl/kernel/transpose.hpp
+++ b/src/backend/opencl/kernel/transpose.hpp
@@ -8,90 +8,63 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/transpose.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int TILE_DIM  = 32;
-static const int THREADS_X = TILE_DIM;
-static const int THREADS_Y = 256 / TILE_DIM;
-
-template<typename T, bool conjugate, bool IS32MULTIPLE>
-void transpose(Param out, const Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  trsProgs;
-        static std::map<int, Kernel*> trsKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once(compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D TILE_DIM=" << TILE_DIM
-                        << " -D THREADS_Y=" << THREADS_Y
-                        << " -D IS32MULTIPLE=" << IS32MULTIPLE
-                        << " -D DOCONJUGATE=" << (conjugate && af::iscplx<T>())
-                        << " -D T=" << dtype_traits<T>::getName();
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, transpose_cl, transpose_cl_len, options.str());
-                trsProgs[device] = new Program(prog);
-
-                trsKernels[device] = new Kernel(*trsProgs[device], "transpose");
-            });
-
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(in.info.dims[0], TILE_DIM);
-        int blk_y = divup(in.info.dims[1], TILE_DIM);
-
-        // launch batch * blk_x blocks along x dimension
-        NDRange global(blk_x * local[0] * in.info.dims[2],
-                       blk_y * local[1] * in.info.dims[3]);
-
-        auto transposeOp = make_kernel<Buffer, const KParam,
-                                       const Buffer, const KParam,
-                                       const int, const int> (*trsKernels[device]);
-
-        transposeOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info, blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
+#include <kernel_headers/transpose.hpp>
+#include <traits.hpp>
 
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int TILE_DIM  = 32;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = 256 / TILE_DIM;
+
+template<typename T>
+void transpose(Param out, const Param in, cl::CommandQueue queue,
+               const bool conjugate, const bool IS32MULTIPLE) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(conjugate),
+        TemplateArg(IS32MULTIPLE),
+    };
+    vector<string> compileOpts = {
+        DefineValue(TILE_DIM),
+        DefineValue(THREADS_Y),
+        DefineValue(IS32MULTIPLE),
+        DefineKeyValue(DOCONJUGATE, (conjugate && iscplx<T>())),
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto transpose = common::getKernel("transpose", {{transpose_cl_src}},
+                                       tmpltArgs, compileOpts);
+
+    NDRange local(THREADS_X, THREADS_Y);
+
+    const int blk_x = divup(in.info.dims[0], TILE_DIM);
+    const int blk_y = divup(in.info.dims[1], TILE_DIM);
+
+    NDRange global(blk_x * local[0] * in.info.dims[2],
+                   blk_y * local[1] * in.info.dims[3]);
+
+    transpose(EnqueueArgs(queue, global, local), *out.data, out.info, *in.data,
+              in.info, blk_x, blk_y);
+    CL_DEBUG_FINISH(queue);
 }
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/transpose_inplace.cl b/src/backend/opencl/kernel/transpose_inplace.cl
index 074f242351..db444b8bc4 100644
--- a/src/backend/opencl/kernel/transpose_inplace.cl
+++ b/src/backend/opencl/kernel/transpose_inplace.cl
@@ -7,8 +7,7 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 #if DOCONJUGATE
-T doOp(T in)
-{
+T doOp(T in) {
     T out = {in.x, -in.y};
     return out;
 }
@@ -16,14 +15,13 @@ T doOp(T in)
 #define doOp(in) in
 #endif
 
-__kernel
-void transpose_inplace(__global T *iData, const KParam in,
-                       const int blocksPerMatX, const int blocksPerMatY)
-{
-    __local T shrdMem_s[TILE_DIM*(TILE_DIM+1)];
-    __local T shrdMem_d[TILE_DIM*(TILE_DIM+1)];
+kernel void transpose_inplace(global T *iData, const KParam in,
+                                const int blocksPerMatX,
+                                const int blocksPerMatY) {
+    local T shrdMem_s[TILE_DIM * (TILE_DIM + 1)];
+    local T shrdMem_d[TILE_DIM * (TILE_DIM + 1)];
 
-    const int shrdStride = TILE_DIM+1;
+    const int shrdStride = TILE_DIM + 1;
 
     // create variables to hold output dimensions
     const int iDim0 = in.dims[0];
@@ -45,9 +43,10 @@ void transpose_inplace(__global T *iData, const KParam in,
     const int x0 = TILE_DIM * blockIdx_x;
     const int y0 = TILE_DIM * blockIdx_y;
 
-    __global T *iptr = iData + batchId_x * in.strides[2] + batchId_y * in.strides[3] + in.offset;
+    global T *iptr = iData + batchId_x * in.strides[2] +
+                       batchId_y * in.strides[3] + in.offset;
 
-    if(blockIdx_y > blockIdx_x) {
+    if (blockIdx_y > blockIdx_x) {
         // calculate global indices
         int gx = lx + x0;
         int gy = ly + y0;
@@ -56,28 +55,30 @@ void transpose_inplace(__global T *iData, const KParam in,
 
         // Copy to shared memory
         for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-
             int gy_ = gy + repeat;
-            if (IS32MULTIPLE || (gx < iDim0 && gy_< iDim1))
-                shrdMem_s[(ly + repeat) * shrdStride + lx] = iptr[gy_ * iStride1 + gx];
+            if (IS32MULTIPLE || (gx < iDim0 && gy_ < iDim1))
+                shrdMem_s[(ly + repeat) * shrdStride + lx] =
+                    iptr[gy_ * iStride1 + gx];
 
             int dy_ = dy + repeat;
-            if (IS32MULTIPLE || (dx < iDim0 && dy_< iDim1))
-                shrdMem_d[(ly + repeat) * shrdStride + lx] = iptr[dy_ * iStride1 + dx];
+            if (IS32MULTIPLE || (dx < iDim0 && dy_ < iDim1))
+                shrdMem_d[(ly + repeat) * shrdStride + lx] =
+                    iptr[dy_ * iStride1 + dx];
         }
 
         barrier(CLK_LOCAL_MEM_FENCE);
 
         // Copy from shared memory to global memory
         for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-
             int dy_ = dy + repeat;
-            if (IS32MULTIPLE || (dx < iDim0 && dy_< iDim1))
-                iptr[dy_ * iStride1 + dx] = doOp(shrdMem_s[(ly + repeat) + (shrdStride * lx)]);
+            if (IS32MULTIPLE || (dx < iDim0 && dy_ < iDim1))
+                iptr[dy_ * iStride1 + dx] =
+                    doOp(shrdMem_s[(ly + repeat) + (shrdStride * lx)]);
 
             int gy_ = gy + repeat;
-            if (IS32MULTIPLE || (gx < iDim0 && gy_< iDim1))
-                iptr[gy_ * iStride1 + gx] = doOp(shrdMem_d[(ly + repeat) + (shrdStride * lx)]);
+            if (IS32MULTIPLE || (gx < iDim0 && gy_ < iDim1))
+                iptr[gy_ * iStride1 + gx] =
+                    doOp(shrdMem_d[(ly + repeat) + (shrdStride * lx)]);
         }
 
     } else if (blockIdx_y == blockIdx_x) {
@@ -87,21 +88,20 @@ void transpose_inplace(__global T *iData, const KParam in,
 
         // Copy to shared memory
         for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-
             int gy_ = gy + repeat;
-            if (IS32MULTIPLE || (gx < iDim0 && gy_< iDim1))
-                shrdMem_s[(ly + repeat) * shrdStride + lx] = iptr[gy_ * iStride1 + gx];
-
+            if (IS32MULTIPLE || (gx < iDim0 && gy_ < iDim1))
+                shrdMem_s[(ly + repeat) * shrdStride + lx] =
+                    iptr[gy_ * iStride1 + gx];
         }
 
         barrier(CLK_LOCAL_MEM_FENCE);
 
         // Copy from shared memory to global memory
         for (int repeat = 0; repeat < TILE_DIM; repeat += THREADS_Y) {
-
             int gy_ = gy + repeat;
-            if (IS32MULTIPLE || (gx < iDim0 && gy_< iDim1))
-                iptr[gy_ * iStride1 + gx] = doOp(shrdMem_s[(ly + repeat) + (shrdStride * lx)]);
+            if (IS32MULTIPLE || (gx < iDim0 && gy_ < iDim1))
+                iptr[gy_ * iStride1 + gx] =
+                    doOp(shrdMem_s[(ly + repeat) + (shrdStride * lx)]);
         }
     }
 }
diff --git a/src/backend/opencl/kernel/transpose_inplace.hpp b/src/backend/opencl/kernel/transpose_inplace.hpp
index 4206e2bb34..6ed5c1e5c4 100644
--- a/src/backend/opencl/kernel/transpose_inplace.hpp
+++ b/src/backend/opencl/kernel/transpose_inplace.hpp
@@ -8,88 +8,66 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/transpose_inplace.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-static const int TILE_DIM  = 16;
-static const int THREADS_X = TILE_DIM;
-static const int THREADS_Y = 256 / TILE_DIM;
-
-template<typename T, bool conjugate, bool IS32MULTIPLE>
-void transpose_inplace(Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  transposeProgs;
-        static std::map<int, Kernel*> transposeKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once(compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D TILE_DIM=" << TILE_DIM
-                        << " -D THREADS_Y=" << THREADS_Y
-                        << " -D IS32MULTIPLE=" << IS32MULTIPLE
-                        << " -D DOCONJUGATE=" << (conjugate && af::iscplx<T>())
-                        << " -D T=" << dtype_traits<T>::getName();
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, transpose_inplace_cl, transpose_inplace_cl_len, options.str());
-                transposeProgs[device] = new Program(prog);
-
-                transposeKernels[device] = new Kernel(*transposeProgs[device], "transpose_inplace");
-            });
-
-
-        NDRange local(THREADS_X, THREADS_Y);
-
-        int blk_x = divup(in.info.dims[0], TILE_DIM);
-        int blk_y = divup(in.info.dims[1], TILE_DIM);
-
-        // launch batch * blk_x blocks along x dimension
-        NDRange global(blk_x * local[0] * in.info.dims[2],
-                       blk_y * local[1] * in.info.dims[3]);
-
-        auto transposeOp = make_kernel<Buffer, const KParam,
-                                       const int, const int> (*transposeKernels[device]);
-
-        transposeOp(EnqueueArgs(getQueue(), global, local), *in.data, in.info, blk_x, blk_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
+#include <kernel_headers/transpose_inplace.hpp>
+#include <traits.hpp>
 
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+constexpr int TILE_DIM  = 16;
+constexpr int THREADS_X = TILE_DIM;
+constexpr int THREADS_Y = 256 / TILE_DIM;
+
+template<typename T>
+void transpose_inplace(Param in, cl::CommandQueue& queue, const bool conjugate,
+                       const bool IS32MULTIPLE) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(conjugate),
+        TemplateArg(IS32MULTIPLE),
+    };
+    vector<string> compileOpts = {
+        DefineValue(TILE_DIM),
+        DefineValue(THREADS_Y),
+        DefineValue(IS32MULTIPLE),
+        DefineKeyValue(DOCONJUGATE, (conjugate && iscplx<T>())),
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto transpose =
+        common::getKernel("transpose_inplace", {{transpose_inplace_cl_src}},
+                          tmpltArgs, compileOpts);
+
+    NDRange local(THREADS_X, THREADS_Y);
+
+    int blk_x = divup(in.info.dims[0], TILE_DIM);
+    int blk_y = divup(in.info.dims[1], TILE_DIM);
+
+    // launch batch * blk_x blocks along x dimension
+    NDRange global(blk_x * local[0] * in.info.dims[2],
+                   blk_y * local[1] * in.info.dims[3]);
+
+    transpose(EnqueueArgs(queue, global, local), *in.data, in.info, blk_x,
+              blk_y);
+
+    CL_DEBUG_FINISH(queue);
 }
 
-}
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/triangle.cl b/src/backend/opencl/kernel/triangle.cl
index cb0d2ce84d..536e074f2b 100644
--- a/src/backend/opencl/kernel/triangle.cl
+++ b/src/backend/opencl/kernel/triangle.cl
@@ -7,11 +7,8 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-__kernel
-void triangle_kernel(__global T *rptr, KParam rinfo,
-                     const __global T *iptr, KParam iinfo,
-                     const int groups_x, const int groups_y)
-{
+kernel void triangle(global T *rptr, KParam rinfo, const global T *iptr,
+                     KParam iinfo, const int groups_x, const int groups_y) {
     const int oz = get_group_id(0) / groups_x;
     const int ow = get_group_id(1) / groups_y;
 
@@ -24,25 +21,24 @@ void triangle_kernel(__global T *rptr, KParam rinfo,
     const int incy = groups_y * get_local_size(1);
     const int incx = groups_x * get_local_size(0);
 
-    __global T *d_r = rptr;
-    const __global T *d_i = iptr + iinfo.offset;
+    global T *d_r       = rptr;
+    const global T *d_i = iptr + iinfo.offset;
 
-    if(oz < rinfo.dims[2] && ow < rinfo.dims[3]) {
+    if (oz < rinfo.dims[2] && ow < rinfo.dims[3]) {
         d_i = d_i + oz * iinfo.strides[2] + ow * iinfo.strides[3];
         d_r = d_r + oz * rinfo.strides[2] + ow * rinfo.strides[3];
 
         for (int oy = yy; oy < rinfo.dims[1]; oy += incy) {
-            const __global T *Yd_i = d_i + oy * iinfo.strides[1];
-            __global T *Yd_r = d_r +  oy * rinfo.strides[1];
+            const global T *Yd_i = d_i + oy * iinfo.strides[1];
+            global T *Yd_r       = d_r + oy * rinfo.strides[1];
 
             for (int ox = xx; ox < rinfo.dims[0]; ox += incx) {
-
-                bool cond = is_upper ? (oy >= ox) : (oy <= ox);
+                bool cond         = is_upper ? (oy >= ox) : (oy <= ox);
                 bool do_unit_diag = is_unit_diag && (oy == ox);
-                if(cond) {
-                    Yd_r[ox] = do_unit_diag ? ONE : Yd_i[ox];
+                if (cond) {
+                    Yd_r[ox] = do_unit_diag ? (T)(ONE) : Yd_i[ox];
                 } else {
-                    Yd_r[ox] = ZERO;
+                    Yd_r[ox] = (T)(ZERO);
                 }
             }
         }
diff --git a/src/backend/opencl/kernel/triangle.hpp b/src/backend/opencl/kernel/triangle.hpp
index 2b00117ebf..888ac21909 100644
--- a/src/backend/opencl/kernel/triangle.hpp
+++ b/src/backend/opencl/kernel/triangle.hpp
@@ -8,92 +8,65 @@
  ********************************************************/
 
 #pragma once
-#include <kernel_headers/triangle.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <string>
-#include <mutex>
-#include <map>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <types.hpp>
+#include <kernel_headers/triangle.hpp>
 #include <math.hpp>
+#include <traits.hpp>
 
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-using af::scalar_to_option;
-
-namespace opencl
-{
-
-namespace kernel
-{
-
-// Kernel Launch Config Values
-static const unsigned TX = 32;
-static const unsigned TY = 8;
-static const unsigned TILEX = 128;
-static const unsigned TILEY = 32;
-
-template<typename T, bool is_upper, bool is_unit_diag>
-void triangle(Param out, const Param in)
-{
-    try {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*>  trgProgs;
-        static std::map<int, Kernel*> trgKernels;
-
-        int device = getActiveDeviceId();
-
-        std::call_once(compileFlags[device], [device] () {
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D is_upper=" << is_upper
-                        << " -D is_unit_diag=" << is_unit_diag
-                        << " -D ZERO=(T)(" << scalar_to_option(scalar<T>(0)) << ")"
-                        << " -D ONE=(T)(" << scalar_to_option(scalar<T>(1)) << ")";
-
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-
-                cl::Program prog;
-                buildProgram(prog, triangle_cl, triangle_cl_len, options.str());
-                trgProgs[device] = new Program(prog);
-
-                trgKernels[device] = new Kernel(*trgProgs[device], "triangle_kernel");
-            });
-
-        NDRange local(TX, TY);
-
-        int groups_x = divup(out.info.dims[0], TILEX);
-        int groups_y = divup(out.info.dims[1], TILEY);
-
-        NDRange global(groups_x * out.info.dims[2] * local[0],
-                       groups_y * out.info.dims[3] * local[1]);
-
-        auto triangleOp = make_kernel<Buffer, KParam,
-                                      const Buffer, KParam,
-                                      const int, const int> (*trgKernels[device]);
-
-        triangleOp(EnqueueArgs(getQueue(), global, local),
-                    *out.data, out.info, *in.data, in.info, groups_x, groups_y);
-
-        CL_DEBUG_FINISH(getQueue());
-    } catch (cl::Error err) {
-        CL_TO_AF_ERROR(err);
-        throw;
-    }
-}
-
-}
-
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void triangle(Param out, const Param in, bool is_upper, bool is_unit_diag) {
+    using arrayfire::opencl::scalar_to_option;
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    constexpr unsigned TX    = 32;
+    constexpr unsigned TY    = 8;
+    constexpr unsigned TILEX = 128;
+    constexpr unsigned TILEY = 32;
+
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(is_upper),
+        TemplateArg(is_unit_diag),
+    };
+    vector<string> compileOpts = {
+        DefineValue(is_upper),
+        DefineValue(is_unit_diag),
+        DefineKeyValue(ZERO, scalar_to_option(scalar<T>(0))),
+        DefineKeyValue(ONE, scalar_to_option(scalar<T>(1))),
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto triangle = common::getKernel("triangle", {{triangle_cl_src}},
+                                      tmpltArgs, compileOpts);
+
+    NDRange local(TX, TY);
+
+    int groups_x = divup(out.info.dims[0], TILEX);
+    int groups_y = divup(out.info.dims[1], TILEY);
+
+    NDRange global(groups_x * out.info.dims[2] * local[0],
+                   groups_y * out.info.dims[3] * local[1]);
+
+    triangle(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+             *in.data, in.info, groups_x, groups_y);
+    CL_DEBUG_FINISH(getQueue());
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/unwrap.cl b/src/backend/opencl/kernel/unwrap.cl
new file mode 100644
index 0000000000..2d67fb68ac
--- /dev/null
+++ b/src/backend/opencl/kernel/unwrap.cl
@@ -0,0 +1,76 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void unwrap(global T *d_out, const KParam out, global const T *d_in,
+                   const KParam in, const int wx, const int wy, const int sx,
+                   const int sy, const int px, const int py, const int dx,
+                   const int dy, const int nx, const int reps) {
+    // Compute channel and volume
+    const int w = get_group_id(1) / in.dims[2];
+    const int z =
+        get_group_id(1) - w * in.dims[2];  // get_group_id(1) % in.dims[2];
+
+    if (w >= in.dims[3] || z >= in.dims[2]) return;
+
+    // Compute offset for channel and volume
+    const int cOut = w * out.strides[3] + z * out.strides[2];
+    const int cIn  = w * in.strides[3] + z * in.strides[2];
+
+    // Compute the output column index
+    const int id = IS_COLUMN
+                       ? (get_group_id(0) * get_local_size(1) + get_local_id(1))
+                       : get_global_id(0);
+
+    if (id >= (IS_COLUMN ? out.dims[1] : out.dims[0])) return;
+
+    // Compute the starting index of window in x and y of input
+    const int startx = (id % nx) * sx;
+    const int starty = (id / nx) * sy;
+
+    const int spx = startx - px;
+    const int spy = starty - py;
+
+    // Offset the global pointers to the respective starting indices
+    global T *optr       = d_out + cOut + id * (IS_COLUMN ? out.strides[1] : 1);
+    global const T *iptr = d_in + cIn + in.offset;
+
+    bool cond = (spx >= 0 && spx + (wx * dx) < in.dims[0] && spy >= 0 &&
+                 spy + (wy * dy) < in.dims[1]);
+
+    // Compute output index local to column
+    int outIdx        = IS_COLUMN ? get_local_id(0) : get_local_id(1);
+    const int oStride = IS_COLUMN ? get_local_size(0) : get_local_size(1);
+
+    for (int i = 0; i < reps; i++) {
+        if (outIdx >= (IS_COLUMN ? out.dims[0] : out.dims[1])) return;
+
+        // Compute input index local to window
+        const int y = outIdx / wx;
+        const int x = outIdx % wx;
+
+        const int xpad = spx + x * dx;
+        const int ypad = spy + y * dy;
+
+        // Copy
+        T val = ZERO;
+        if (cond || (xpad >= 0 && xpad < in.dims[0] && ypad >= 0 &&
+                     ypad < in.dims[1])) {
+            const int inIdx = ypad * in.strides[1] + xpad * in.strides[0];
+            val             = iptr[inIdx];
+        }
+
+        if (IS_COLUMN) {
+            optr[outIdx] = val;
+        } else {
+            optr[outIdx * out.strides[1]] = val;
+        }
+
+        outIdx += oStride;
+    }
+}
diff --git a/src/backend/opencl/kernel/unwrap.hpp b/src/backend/opencl/kernel/unwrap.hpp
new file mode 100644
index 0000000000..7c3d71bb37
--- /dev/null
+++ b/src/backend/opencl/kernel/unwrap.hpp
@@ -0,0 +1,83 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel_headers/unwrap.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void unwrap(Param out, const Param in, const dim_t wx, const dim_t wy,
+            const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+            const dim_t dx, const dim_t dy, const dim_t nx,
+            const bool is_column) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    ToNumStr<T> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(is_column),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(IS_COLUMN, is_column),
+        DefineKeyValue(ZERO, toNumStr(scalar<T>(0))),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto unwrap =
+        common::getKernel("unwrap", {{unwrap_cl_src}}, tmpltArgs, compileOpts);
+
+    dim_t TX = 1, TY = 1;
+    dim_t BX       = 1;
+    const dim_t BY = out.info.dims[2] * out.info.dims[3];
+    int reps       = 1;
+
+    if (is_column) {
+        TX   = std::min(THREADS_PER_GROUP, nextpow2(out.info.dims[0]));
+        TY   = THREADS_PER_GROUP / TX;
+        BX   = divup(out.info.dims[1], TY);
+        reps = divup((wx * wy), TX);
+    } else {
+        TX   = THREADS_X;
+        TY   = THREADS_Y;
+        BX   = divup(out.info.dims[0], TX);
+        reps = divup((wx * wy), TY);
+    }
+
+    NDRange local(TX, TY);
+    NDRange global(local[0] * BX, local[1] * BY);
+
+    unwrap(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+           *in.data, in.info, static_cast<int>(wx), static_cast<int>(wy),
+           static_cast<int>(sx), static_cast<int>(sy), static_cast<int>(px),
+           static_cast<int>(py), static_cast<int>(dx), static_cast<int>(dy),
+           static_cast<int>(nx), reps);
+    CL_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/where.cl b/src/backend/opencl/kernel/where.cl
index ab4b31e649..4e5298012e 100644
--- a/src/backend/opencl/kernel/where.cl
+++ b/src/backend/opencl/kernel/where.cl
@@ -8,53 +8,46 @@
  ********************************************************/
 
 #if CPLX
-#define isZero(val) ((val.x ==0) && (val.y == 0))
+#define isZero(val) ((val.x == 0) && (val.y == 0))
 #else
 #define isZero(val) ((val == 0))
 #endif
 
-__kernel
-void get_out_idx_kernel(__global uint *oData,
-                        __global uint *otData,
-                        KParam otInfo,
-                        __global uint *rtData,
-                        KParam rtInfo,
-                        __global T *iData,
-                        KParam iInfo,
-                        uint groups_x,
-                        uint groups_y,
-                        uint lim)
-{
-
-    T Zero = zero;
+kernel void get_out_idx(global uint *oData, global uint *otData, KParam otInfo,
+                        global uint *rtData, KParam rtInfo, global T *iData,
+                        KParam iInfo, uint groups_x, uint groups_y, uint lim) {
+    T Zero = ZERO;
 
     const uint lidx = get_local_id(0);
     const uint lidy = get_local_id(1);
 
-    const uint zid = get_group_id(0) / groups_x;
-    const uint wid = get_group_id(1) / groups_y;
-    const uint groupId_x = get_group_id(0) - (groups_x) * zid;
-    const uint groupId_y = get_group_id(1) - (groups_y) * wid;
-    const uint xid = groupId_x * get_local_size(0) * lim + lidx;
-    const uint yid = groupId_y * get_local_size(1) + lidy;
-
-    const uint off = wid * otInfo.strides[3] + zid * otInfo.strides[2] + yid * otInfo.strides[1];
-    const uint gid = wid * rtInfo.strides[3] + zid * rtInfo.strides[2] + yid * rtInfo.strides[1] + groupId_x;
-
-    otData += wid * otInfo.strides[3] + zid * otInfo.strides[2] + yid * otInfo.strides[1];
-    iData  += wid *  iInfo.strides[3] + zid *  iInfo.strides[2] + yid *  iInfo.strides[1];
-
-    bool cond = (yid < otInfo.dims[1]) && (zid < otInfo.dims[2]) && (wid < otInfo.dims[3]);
+    const uint zid       = get_group_id(0) / groups_x;
+    const uint wid       = get_group_id(1) / groups_y;
+    const uint groupId_x = get_group_id(0) - (groups_x)*zid;
+    const uint groupId_y = get_group_id(1) - (groups_y)*wid;
+    const uint xid       = groupId_x * get_local_size(0) * lim + lidx;
+    const uint yid       = groupId_y * get_local_size(1) + lidy;
+
+    const uint off = wid * otInfo.strides[3] + zid * otInfo.strides[2] +
+                     yid * otInfo.strides[1];
+    const uint gid = wid * rtInfo.strides[3] + zid * rtInfo.strides[2] +
+                     yid * rtInfo.strides[1] + groupId_x;
+
+    otData += wid * otInfo.strides[3] + zid * otInfo.strides[2] +
+              yid * otInfo.strides[1];
+    iData += wid * iInfo.strides[3] + zid * iInfo.strides[2] +
+             yid * iInfo.strides[1] + iInfo.offset;
+
+    bool cond = (yid < otInfo.dims[1]) && (zid < otInfo.dims[2]) &&
+                (wid < otInfo.dims[3]);
     if (!cond) return;
 
     uint accum = (gid == 0) ? 0 : rtData[gid - 1];
 
-    for (uint k = 0, id = xid;
-         k < lim && id < otInfo.dims[0];
+    for (uint k = 0, id = xid; k < lim && id < otInfo.dims[0];
          k++, id += get_local_size(0)) {
-
         uint idx = otData[id] + accum;
-        T ival = iData[id];
+        T ival   = iData[id];
         if (!isZero(ival)) oData[idx - 1] = (off + id);
     }
 }
diff --git a/src/backend/opencl/kernel/where.hpp b/src/backend/opencl/kernel/where.hpp
index 4b6d7e74ee..980cdfe13f 100644
--- a/src/backend/opencl/kernel/where.hpp
+++ b/src/backend/opencl/kernel/where.hpp
@@ -8,163 +8,129 @@
  ********************************************************/
 
 #pragma once
-#include <string>
-#include <mutex>
-#include <map>
-#include <kernel_headers/where.hpp>
-#include <program.hpp>
-#include <traits.hpp>
-#include <dispatch.hpp>
+
 #include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
 #include <debug_opencl.hpp>
-#include <type_util.hpp>
-#include "names.hpp"
-#include "config.hpp"
-#include "scan_first.hpp"
-#include <memory.hpp>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-namespace kernel
-{
-
-    template<typename T>
-    static void get_out_idx(Buffer *out_data,
-                            Param &otmp, Param &rtmp,
-                            Param &in, uint threads_x,
-                            uint groups_x, uint groups_y)
-    {
-        static std::once_flag compileFlags[DeviceManager::MAX_DEVICES];
-        static std::map<int, Program*> whereProgs;
-        static std::map<int, Kernel *> whereKerns;
-
-        int device= getActiveDeviceId();
-
-        std::call_once(compileFlags[device], [device] () {
-
-                ToNum<T> toNum;
-
-                std::ostringstream options;
-                options << " -D T=" << dtype_traits<T>::getName()
-                        << " -D zero=" << toNum(scalar<T>(0))
-                        << " -D CPLX=" << af::iscplx<T>();
-                if (std::is_same<T, double>::value ||
-                    std::is_same<T, cdouble>::value) {
-                    options << " -D USE_DOUBLE";
-                }
-                Program prog;
-                buildProgram(prog, where_cl, where_cl_len, options.str());
-                whereProgs[device] = new Program(prog);
-                whereKerns[device] = new Kernel(*whereProgs[device], "get_out_idx_kernel");
-            });
-
-        NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
-        NDRange global(local[0] * groups_x * in.info.dims[2],
-                       local[1] * groups_y * in.info.dims[3]);
-
-        uint lim = divup(otmp.info.dims[0], (threads_x * groups_x));
-
-        auto whereOp = make_kernel<Buffer,
-                                   Buffer, KParam,
-                                   Buffer, KParam,
-                                   Buffer, KParam,
-                                   uint, uint, uint>(*whereKerns[device]);
-
-        whereOp(EnqueueArgs(getQueue(), global, local),
-                *out_data,
-                *otmp.data, otmp.info,
-                *rtmp.data, rtmp.info,
-                *in.data, in.info,
-                groups_x, groups_y, lim);
-
-        CL_DEBUG_FINISH(getQueue());
-
-    }
+#include <kernel/config.hpp>
+#include <kernel/names.hpp>
+#include <kernel/scan_first.hpp>
+#include <kernel_headers/where.hpp>
+#include <math.hpp>
+#include <traits.hpp>
 
-    template<typename T>
-    static void where(Param &out, Param &in)
-    {
-        try {
-            uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
-            threads_x = std::min(threads_x, THREADS_PER_GROUP);
-            uint threads_y = THREADS_PER_GROUP / threads_x;
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+template<typename T>
+static void get_out_idx(cl::Buffer *out_data, Param &otmp, Param &rtmp,
+                        Param &in, uint threads_x, uint groups_x,
+                        uint groups_y) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    ToNumStr<T> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+    };
+    vector<string> compileOpts = {
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+        DefineKeyValue(ZERO, toNumStr(scalar<T>(0))),
+        DefineKeyValue(CPLX, iscplx<T>()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto getIdx = common::getKernel("get_out_idx", {{where_cl_src}}, tmpltArgs,
+                                    compileOpts);
+
+    NDRange local(threads_x, THREADS_PER_GROUP / threads_x);
+    NDRange global(local[0] * groups_x * in.info.dims[2],
+                   local[1] * groups_y * in.info.dims[3]);
+
+    uint lim = divup(otmp.info.dims[0], (threads_x * groups_x));
+
+    getIdx(EnqueueArgs(getQueue(), global, local), *out_data, *otmp.data,
+           otmp.info, *rtmp.data, rtmp.info, *in.data, in.info, groups_x,
+           groups_y, lim);
+    CL_DEBUG_FINISH(getQueue());
+}
 
-            uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
-            uint groups_y = divup(in.info.dims[1], threads_y);
+template<typename T>
+static void where(Param &out, Param &in) {
+    uint threads_x = nextpow2(std::max(32u, (uint)in.info.dims[0]));
+    threads_x      = std::min(threads_x, THREADS_PER_GROUP);
+    uint threads_y = THREADS_PER_GROUP / threads_x;
 
-            Param rtmp;
-            Param otmp;
+    uint groups_x = divup(in.info.dims[0], threads_x * REPEAT);
+    uint groups_y = divup(in.info.dims[1], threads_y);
 
-            rtmp.info.dims[0] = groups_x;
-            otmp.info.dims[0] = in.info.dims[0];
+    Param rtmp;
+    Param otmp;
 
-            rtmp.info.strides[0] = 1;
-            otmp.info.strides[0] = 1;
+    rtmp.info.dims[0] = groups_x;
+    otmp.info.dims[0] = in.info.dims[0];
 
-            rtmp.info.offset = 0;
-            otmp.info.offset = 0;
+    rtmp.info.strides[0] = 1;
+    otmp.info.strides[0] = 1;
 
-            for (int k = 1; k < 4; k++) {
-                rtmp.info.dims[k] = in.info.dims[k];
-                rtmp.info.strides[k] = rtmp.info.strides[k - 1] * rtmp.info.dims[k - 1];
+    rtmp.info.offset = 0;
+    otmp.info.offset = 0;
 
-                otmp.info.dims[k] = in.info.dims[k];
-                otmp.info.strides[k] = otmp.info.strides[k - 1] * otmp.info.dims[k - 1];
-            }
+    for (int k = 1; k < 4; k++) {
+        rtmp.info.dims[k]    = in.info.dims[k];
+        rtmp.info.strides[k] = rtmp.info.strides[k - 1] * rtmp.info.dims[k - 1];
 
-            int rtmp_elements = rtmp.info.strides[3] * rtmp.info.dims[3];
-            rtmp.data = bufferAlloc(rtmp_elements * sizeof(uint));
+        otmp.info.dims[k]    = in.info.dims[k];
+        otmp.info.strides[k] = otmp.info.strides[k - 1] * otmp.info.dims[k - 1];
+    }
 
-            int otmp_elements = otmp.info.strides[3] * otmp.info.dims[3];
-            otmp.data = bufferAlloc(otmp_elements * sizeof(uint));
+    int rtmp_elements = rtmp.info.strides[3] * rtmp.info.dims[3];
+    rtmp.data         = bufferAlloc(rtmp_elements * sizeof(uint));
 
-            scan_first_fn<T, uint, af_notzero_t, false>(otmp, rtmp, in,
-                                                        groups_x, groups_y,
-                                                        threads_x);
+    int otmp_elements = otmp.info.strides[3] * otmp.info.dims[3];
+    otmp.data         = bufferAlloc(otmp_elements * sizeof(uint));
 
-            // Linearize the dimensions and perform scan
-            Param ltmp = rtmp;
-            ltmp.info.offset = 0;
-            ltmp.info.dims[0] = rtmp_elements;
-            for (int k = 1; k < 4; k++) {
-                ltmp.info.dims[k] = 1;
-                ltmp.info.strides[k] = rtmp_elements;
-            }
+    scanFirstLauncher<T, uint, af_notzero_t>(otmp, rtmp, in, false, groups_x,
+                                             groups_y, threads_x);
 
-            scan_first<uint, uint, af_add_t>(ltmp, ltmp);
+    // Linearize the dimensions and perform scan
+    Param ltmp        = rtmp;
+    ltmp.info.offset  = 0;
+    ltmp.info.dims[0] = rtmp_elements;
+    for (int k = 1; k < 4; k++) {
+        ltmp.info.dims[k]    = 1;
+        ltmp.info.strides[k] = rtmp_elements;
+    }
 
-            // Get output size and allocate output
-            uint total;
-            getQueue().enqueueReadBuffer(*rtmp.data, CL_TRUE,
-                                         sizeof(uint) * (rtmp_elements - 1),
-                                         sizeof(uint),
-                                         &total);
+    scanFirst<uint, uint, af_add_t>(ltmp, ltmp);
 
+    // Get output size and allocate output
+    uint total;
+    getQueue().enqueueReadBuffer(*rtmp.data, CL_TRUE,
+                                 sizeof(uint) * (rtmp_elements - 1),
+                                 sizeof(uint), &total);
 
-            out.data = bufferAlloc(total * sizeof(uint));
+    out.data = bufferAlloc(total * sizeof(uint));
 
-            out.info.dims[0] = total;
-            out.info.strides[0] = 1;
-            for (int k = 1; k < 4; k++) {
-                out.info.dims[k] = 1;
-                out.info.strides[k] = total;
-            }
+    out.info.dims[0]    = total;
+    out.info.strides[0] = 1;
+    for (int k = 1; k < 4; k++) {
+        out.info.dims[k]    = 1;
+        out.info.strides[k] = total;
+    }
 
-            get_out_idx<T>(out.data, otmp, rtmp, in, threads_x, groups_x, groups_y);
+    if (total > 0)
+        get_out_idx<T>(out.data, otmp, rtmp, in, threads_x, groups_x, groups_y);
 
-            bufferFree(rtmp.data);
-            bufferFree(otmp.data);
-        } catch (cl::Error err) {
-            CL_TO_AF_ERROR(err);
-        }
-    }
-}
+    bufferFree(rtmp.data);
+    bufferFree(otmp.data);
 }
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/wrap.cl b/src/backend/opencl/kernel/wrap.cl
new file mode 100644
index 0000000000..3b2b1faf38
--- /dev/null
+++ b/src/backend/opencl/kernel/wrap.cl
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void wrap(global T *optr, KParam out, global T *iptr, KParam in,
+                 const int wx, const int wy, const int sx, const int sy,
+                 const int px, const int py, const int nx, const int ny,
+                 int groups_x, int groups_y) {
+    int idx2 = get_group_id(0) / groups_x;
+    int idx3 = get_group_id(1) / groups_y;
+
+    int groupId_x = get_group_id(0) - idx2 * groups_x;
+    int groupId_y = get_group_id(1) - idx3 * groups_y;
+
+    int oidx0 = get_local_id(0) + get_local_size(0) * groupId_x;
+    int oidx1 = get_local_id(1) + get_local_size(1) * groupId_y;
+
+    optr += idx2 * out.strides[2] + idx3 * out.strides[3] + out.offset;
+    iptr += idx2 * in.strides[2] + idx3 * in.strides[3] + in.offset;
+
+    if (oidx0 >= out.dims[0] || oidx1 >= out.dims[1]) return;
+
+    int pidx0 = oidx0 + px;
+    int pidx1 = oidx1 + py;
+
+    // The last time a value appears in the unwrapped index is padded_index /
+    // stride Each previous index has the value appear "stride" locations
+    // earlier We work our way back from the last index
+
+    const int x_end = min(pidx0 / sx, nx - 1);
+    const int y_end = min(pidx1 / sy, ny - 1);
+
+    const int x_off = pidx0 - sx * x_end;
+    const int y_off = pidx1 - sy * y_end;
+
+    T val   = ZERO;
+    int idx = 1;
+
+    for (int y = y_end, yo = y_off; y >= 0 && yo < wy; yo += sy, y--) {
+        int win_end_y = yo * wx;
+        int dim_end_y = y * nx;
+
+        for (int x = x_end, xo = x_off; x >= 0 && xo < wx; xo += sx, x--) {
+            int win_end = win_end_y + xo;
+            int dim_end = dim_end_y + x;
+
+            if (is_column) {
+                idx = dim_end * in.strides[1] + win_end;
+            } else {
+                idx = dim_end + win_end * in.strides[1];
+            }
+
+            // No need to include anything special for complex
+            // Add for complex numbers is just vector add of reals
+            // Might need to change if we generalize add to more binary ops
+            val = val + iptr[idx];
+        }
+    }
+
+    optr[oidx1 * out.strides[1] + oidx0] = val;
+}
diff --git a/src/backend/opencl/kernel/wrap.hpp b/src/backend/opencl/kernel/wrap.hpp
new file mode 100644
index 0000000000..e664c7b472
--- /dev/null
+++ b/src/backend/opencl/kernel/wrap.hpp
@@ -0,0 +1,121 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Param.hpp>
+#include <common/dispatch.hpp>
+#include <common/kernel_cache.hpp>
+#include <debug_opencl.hpp>
+#include <kernel/config.hpp>
+#include <kernel_headers/wrap.hpp>
+#include <kernel_headers/wrap_dilated.hpp>
+#include <math.hpp>
+#include <traits.hpp>
+
+#include <string>
+#include <vector>
+
+namespace arrayfire {
+namespace opencl {
+namespace kernel {
+
+template<typename T>
+void wrap(Param out, const Param in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    ToNumStr<T> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(is_column),
+    };
+    vector<string> compileOpts = {
+        DefineValue(is_column),
+        DefineKeyValue(ZERO, toNumStr(scalar<T>(0))),
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto wrap =
+        common::getKernel("wrap", {{wrap_cl_src}}, tmpltArgs, compileOpts);
+
+    dim_t nx = (out.info.dims[0] + 2 * px - wx) / sx + 1;
+    dim_t ny = (out.info.dims[1] + 2 * py - wy) / sy + 1;
+
+    NDRange local(THREADS_X, THREADS_Y);
+
+    dim_t groups_x = divup(out.info.dims[0], local[0]);
+    dim_t groups_y = divup(out.info.dims[1], local[1]);
+
+    NDRange global(local[0] * groups_x * out.info.dims[2],
+                   local[1] * groups_y * out.info.dims[3]);
+
+    wrap(EnqueueArgs(getQueue(), global, local), *out.data, out.info, *in.data,
+         in.info, static_cast<int>(wx), static_cast<int>(wy),
+         static_cast<int>(sx), static_cast<int>(sy), static_cast<int>(px),
+         static_cast<int>(py), static_cast<int>(nx), static_cast<int>(ny),
+         static_cast<int>(groups_x), static_cast<int>(groups_y));
+
+    CL_DEBUG_FINISH(getQueue());
+}
+
+template<typename T>
+void wrap_dilated(Param out, const Param in, const dim_t wx, const dim_t wy,
+                  const dim_t sx, const dim_t sy, const dim_t px,
+                  const dim_t py, const dim_t dx, const dim_t dy,
+                  const bool is_column) {
+    using cl::EnqueueArgs;
+    using cl::NDRange;
+    using std::string;
+    using std::vector;
+
+    ToNumStr<T> toNumStr;
+    vector<TemplateArg> tmpltArgs = {
+        TemplateTypename<T>(),
+        TemplateArg(is_column),
+    };
+    vector<string> compileOpts = {
+        DefineValue(is_column),
+        DefineKeyValue(ZERO, toNumStr(scalar<T>(0))),
+        DefineKeyValue(T, dtype_traits<T>::getName()),
+    };
+    compileOpts.emplace_back(getTypeBuildDefinition<T>());
+
+    auto dilatedWrap = common::getKernel(
+        "wrap_dilated", {{wrap_dilated_cl_src}}, tmpltArgs, compileOpts);
+
+    dim_t nx = 1 + (out.info.dims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    dim_t ny = 1 + (out.info.dims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    NDRange local(THREADS_X, THREADS_Y);
+
+    dim_t groups_x = divup(out.info.dims[0], local[0]);
+    dim_t groups_y = divup(out.info.dims[1], local[1]);
+
+    NDRange global(local[0] * groups_x * out.info.dims[2],
+                   local[1] * groups_y * out.info.dims[3]);
+
+    dilatedWrap(EnqueueArgs(getQueue(), global, local), *out.data, out.info,
+                *in.data, in.info, static_cast<int>(wx), static_cast<int>(wy),
+                static_cast<int>(sx), static_cast<int>(sy),
+                static_cast<int>(px), static_cast<int>(py),
+                static_cast<int>(dx), static_cast<int>(dy),
+                static_cast<int>(nx), static_cast<int>(ny),
+                static_cast<int>(groups_x), static_cast<int>(groups_y));
+    CL_DEBUG_FINISH(getQueue());
+}
+
+}  // namespace kernel
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/kernel/wrap_dilated.cl b/src/backend/opencl/kernel/wrap_dilated.cl
new file mode 100644
index 0000000000..fee950eb24
--- /dev/null
+++ b/src/backend/opencl/kernel/wrap_dilated.cl
@@ -0,0 +1,78 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+kernel void wrap_dilated(global T *optr, KParam out, global T *iptr, KParam in,
+                         const int wx, const int wy, const int sx, const int sy,
+                         const int px, const int py, const int dx, const int dy,
+                         const int nx, const int ny, int groups_x,
+                         int groups_y) {
+    int idx2 = get_group_id(0) / groups_x;
+    int idx3 = get_group_id(1) / groups_y;
+
+    int groupId_x = get_group_id(0) - idx2 * groups_x;
+    int groupId_y = get_group_id(1) - idx3 * groups_y;
+
+    int oidx0 = get_local_id(0) + get_local_size(0) * groupId_x;
+    int oidx1 = get_local_id(1) + get_local_size(1) * groupId_y;
+
+    optr += idx2 * out.strides[2] + idx3 * out.strides[3];
+    iptr += idx2 * in.strides[2] + idx3 * in.strides[3] + in.offset;
+
+    if (oidx0 >= out.dims[0] || oidx1 >= out.dims[1]) return;
+
+    int eff_wx = wx + (wx - 1) * (dx - 1);
+    int eff_wy = wy + (wy - 1) * (dy - 1);
+
+    int pidx0 = oidx0 + px;
+    int pidx1 = oidx1 + py;
+
+    // The last time a value appears in the unwrapped index is padded_index /
+    // stride
+    // Each previous index has the value appear "stride" locations earlier
+    // We work our way back from the last index
+
+    const int y_start = (pidx1 < eff_wy) ? 0 : (pidx1 - eff_wy) / sy + 1;
+    const int y_end   = min(pidx1 / sy + 1, ny);
+
+    const int x_start = (pidx0 < eff_wx) ? 0 : (pidx0 - eff_wx) / sx + 1;
+    const int x_end   = min(pidx0 / sx + 1, nx);
+
+    T val   = ZERO;
+    int idx = 1;
+
+    for (int y = y_start; y < y_end; y++) {
+        int fy      = (pidx1 - y * sy);
+        bool yvalid = (fy % dy == 0) && (y < ny);
+        fy /= dy;
+
+        int win_end_y = fy * wx;
+        int dim_end_y = y * nx;
+
+        for (int x = x_start; x < x_end; x++) {
+            int fx      = (pidx0 - x * sx);
+            bool xvalid = (fx % dx == 0) && (x < nx);
+            fx /= dx;
+
+            int win_end = win_end_y + fx;
+            int dim_end = dim_end_y + x;
+
+            if (is_column) {
+                idx = dim_end * in.strides[1] + win_end;
+            } else {
+                idx = dim_end + win_end * in.strides[1];
+            }
+
+            T ival;
+            ival = (yvalid && xvalid) ? iptr[idx] : ZERO;
+            val  = val + ival;
+        }
+    }
+
+    optr[oidx1 * out.strides[1] + oidx0] = val;
+}
diff --git a/src/backend/opencl/logic.hpp b/src/backend/opencl/logic.hpp
index 949fa4d2a6..78efdcadd3 100644
--- a/src/backend/opencl/logic.hpp
+++ b/src/backend/opencl/logic.hpp
@@ -7,25 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/defines.h>
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <optypes.hpp>
 #include <binary.hpp>
+#include <common/jit/BinaryNode.hpp>
 #include <err_opencl.hpp>
+#include <optypes.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
 
-namespace opencl
-{
-    template<typename T, af_op_t op>
-    Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<char, T, op>(lhs, rhs, odims);
-    }
+namespace arrayfire {
+namespace opencl {
+template<typename T, af_op_t op>
+Array<char> logicOp(const Array<T> &lhs, const Array<T> &rhs,
+                    const af::dim4 &odims) {
+    return common::createBinaryNode<char, T, op>(lhs, rhs, odims);
+}
 
-    template<typename T, af_op_t op>
-    Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs, const af::dim4 &odims)
-    {
-        return createBinaryNode<T, T, op>(lhs, rhs, odims);
-    }
+template<typename T, af_op_t op>
+Array<T> bitOp(const Array<T> &lhs, const Array<T> &rhs,
+               const af::dim4 &odims) {
+    return common::createBinaryNode<T, T, op>(lhs, rhs, odims);
 }
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/lookup.cpp b/src/backend/opencl/lookup.cpp
index e9dc4a3f8c..83bca0ac44 100644
--- a/src/backend/opencl/lookup.cpp
+++ b/src/backend/opencl/lookup.cpp
@@ -7,54 +7,72 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <Array.hpp>
 #include <kernel/lookup.hpp>
 #include <lookup.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
 #include <err_opencl.hpp>
+#include <af/dim4.hpp>
 
-namespace opencl
-{
+using arrayfire::common::half;
 
+namespace arrayfire {
+namespace opencl {
 template<typename in_t, typename idx_t>
-Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, const unsigned dim)
-{
-    const dim4 iDims = input.dims();
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim) {
+    const dim4 &iDims = input.dims();
 
     dim4 oDims(1);
-    for (int d=0; d<4; ++d)
-        oDims[d] = (d==int(dim) ? indices.elements() : iDims[d]);
+    for (dim_t d = 0; d < 4; ++d) {
+        oDims[d] = (d == dim ? indices.elements() : iDims[d]);
+    }
 
     Array<in_t> out = createEmptyArray<in_t>(oDims);
 
-    dim_t nDims = iDims.ndims();
-
-    switch(dim) {
-        case 0: kernel::lookup<in_t, idx_t, 0>(out, input, indices, nDims); break;
-        case 1: kernel::lookup<in_t, idx_t, 1>(out, input, indices, nDims); break;
-        case 2: kernel::lookup<in_t, idx_t, 2>(out, input, indices, nDims); break;
-        case 3: kernel::lookup<in_t, idx_t, 3>(out, input, indices, nDims); break;
-    }
+    kernel::lookup<in_t, idx_t>(out, input, indices, dim);
 
     return out;
 }
 
-#define INSTANTIATE(T)  \
-    template Array<T> lookup<T, float   >(const Array<T> &input, const Array<float   > &indices, const unsigned dim); \
-    template Array<T> lookup<T, double  >(const Array<T> &input, const Array<double  > &indices, const unsigned dim); \
-    template Array<T> lookup<T, int     >(const Array<T> &input, const Array<int     > &indices, const unsigned dim); \
-    template Array<T> lookup<T, unsigned>(const Array<T> &input, const Array<unsigned> &indices, const unsigned dim); \
-    template Array<T> lookup<T, uchar   >(const Array<T> &input, const Array<uchar   > &indices, const unsigned dim);
-
-INSTANTIATE(float   );
-INSTANTIATE(cfloat  );
-INSTANTIATE(double  );
-INSTANTIATE(cdouble );
-INSTANTIATE(int     );
-INSTANTIATE(unsigned);
-INSTANTIATE(intl    );
-INSTANTIATE(uintl   );
-INSTANTIATE(uchar   );
-INSTANTIATE(char    );
+#define INSTANTIATE(T)                                                         \
+    template Array<T> lookup<T, float>(const Array<T> &, const Array<float> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, double>(                                       \
+        const Array<T> &, const Array<double> &, const unsigned);              \
+    template Array<T> lookup<T, int>(const Array<T> &, const Array<int> &,     \
+                                     const unsigned);                          \
+    template Array<T> lookup<T, unsigned>(                                     \
+        const Array<T> &, const Array<unsigned> &, const unsigned);            \
+    template Array<T> lookup<T, short>(const Array<T> &, const Array<short> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, ushort>(                                       \
+        const Array<T> &, const Array<ushort> &, const unsigned);              \
+    template Array<T> lookup<T, intl>(const Array<T> &, const Array<intl> &,   \
+                                      const unsigned);                         \
+    template Array<T> lookup<T, uintl>(const Array<T> &, const Array<uintl> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, schar>(const Array<T> &, const Array<schar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, uchar>(const Array<T> &, const Array<uchar> &, \
+                                       const unsigned);                        \
+    template Array<T> lookup<T, half>(const Array<T> &, const Array<half> &,   \
+                                      const unsigned)
 
-}
+INSTANTIATE(float);
+INSTANTIATE(cfloat);
+INSTANTIATE(double);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(unsigned);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(char);
+INSTANTIATE(ushort);
+INSTANTIATE(short);
+INSTANTIATE(half);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/lookup.hpp b/src/backend/opencl/lookup.hpp
index 59c5f21a6d..abf10d5902 100644
--- a/src/backend/opencl/lookup.hpp
+++ b/src/backend/opencl/lookup.hpp
@@ -9,10 +9,10 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
-
+namespace arrayfire {
+namespace opencl {
 template<typename in_t, typename idx_t>
-Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices, const unsigned dim);
-
-}
+Array<in_t> lookup(const Array<in_t> &input, const Array<idx_t> &indices,
+                   const unsigned dim);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/lu.cpp b/src/backend/opencl/lu.cpp
index ee76f47201..ff6f54d0d9 100644
--- a/src/backend/opencl/lu.cpp
+++ b/src/backend/opencl/lu.cpp
@@ -7,28 +7,26 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <err_opencl.hpp>
 #include <lu.hpp>
-#include <err_common.hpp>
 
-#if defined(WITH_OPENCL_LINEAR_ALGEBRA)
-#include <kernel/lu_split.hpp>
-#include <copy.hpp>
+#if defined(WITH_LINEAR_ALGEBRA)
 #include <blas.hpp>
+#include <copy.hpp>
+#include <cpu/cpu_lu.hpp>
+#include <kernel/lu_split.hpp>
 #include <magma/magma.h>
+#include <platform.hpp>
 
-namespace opencl
-{
-
-Array<int> convertPivot(int *ipiv, int in_sz, int out_sz)
-{
+namespace arrayfire {
+namespace opencl {
 
+Array<int> convertPivot(int *ipiv, int in_sz, int out_sz) {
     std::vector<int> out(out_sz);
 
-    for (int i = 0; i < out_sz; i++) {
-        out[i] = i;
-    }
+    for (int i = 0; i < out_sz; i++) { out[i] = i; }
 
-    for(int j = 0; j < in_sz; j++) {
+    for (int j = 0; j < in_sz; j++) {
         // 1 indexed in pivot
         std::swap(out[j], out[ipiv[j] - 1]);
     }
@@ -39,92 +37,93 @@ Array<int> convertPivot(int *ipiv, int in_sz, int out_sz)
 }
 
 template<typename T>
-void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
-{
-
-    try {
-        dim4 iDims = in.dims();
-        int M = iDims[0];
-        int N = iDims[1];
-        int MN = std::min(M, N);
-
-        Array<T> in_copy = copyArray<T>(in);
-        pivot = lu_inplace(in_copy);
-
-        // SPLIT into lower and upper
-        dim4 ldims(M, MN);
-        dim4 udims(MN, N);
-        lower = createEmptyArray<T>(ldims);
-        upper = createEmptyArray<T>(udims);
-        kernel::lu_split<T>(lower, upper, in_copy);
-
-    } catch (cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-    }
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
+    if (OpenCLCPUOffload()) { return cpu::lu(lower, upper, pivot, in); }
+
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+    int MN     = std::min(M, N);
+
+    Array<T> in_copy = copyArray<T>(in);
+    pivot            = lu_inplace(in_copy);
+
+    // SPLIT into lower and upper
+    dim4 ldims(M, MN);
+    dim4 udims(MN, N);
+    lower = createEmptyArray<T>(ldims);
+    upper = createEmptyArray<T>(udims);
+    kernel::luSplit<T>(lower, upper, in_copy);
 }
 
 template<typename T>
-Array<int> lu_inplace(Array<T> &in, const bool convert_pivot)
-{
-    try {
-        initBlas();
-        dim4 iDims = in.dims();
-        int M = iDims[0];
-        int N = iDims[1];
-        int MN = std::min(M, N);
-        std::vector<int> ipiv(MN);
-
-        cl::Buffer *in_buf = in.get();
-        int info = 0;
-        magma_getrf_gpu<T>(M, N, (*in_buf)(), in.getOffset(), in.strides()[1],
-                           &ipiv[0], getQueue()(), &info);
-
-        if (!convert_pivot) return createHostDataArray<int>(dim4(MN), &ipiv[0]);
-
-        Array<int> pivot = convertPivot(&ipiv[0], MN, M);
-        return pivot;
-    } catch(cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-    }
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
+    if (OpenCLCPUOffload()) { return cpu::lu_inplace(in, convert_pivot); }
+
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+    int MN     = std::min(M, N);
+    std::vector<int> ipiv(MN);
+
+    cl::Buffer *in_buf = in.get();
+    int info           = 0;
+    magma_getrf_gpu<T>(M, N, (*in_buf)(), in.getOffset(), in.strides()[1],
+                       &ipiv[0], getQueue()(), &info);
+
+    if (!convert_pivot) { return createHostDataArray<int>(dim4(MN), &ipiv[0]); }
+
+    Array<int> pivot = convertPivot(&ipiv[0], MN, M);
+    return pivot;
 }
 
-#define INSTANTIATE_LU(T)                                                                           \
-    template Array<int> lu_inplace<T>(Array<T> &in, const bool convert_pivot);                      \
-    template void lu<T>(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+bool isLAPACKAvailable() { return true; }
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
 
 INSTANTIATE_LU(float)
 INSTANTIATE_LU(cfloat)
 INSTANTIATE_LU(double)
 INSTANTIATE_LU(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in)
-{
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-Array<int> lu_inplace(Array<T> &in, const bool convert_pivot)
-{
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_LU(T)                                                                           \
-    template Array<int> lu_inplace<T>(Array<T> &in, const bool convert_pivot);                      \
-    template void lu<T>(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+bool isLAPACKAvailable() { return false; }
+
+#define INSTANTIATE_LU(T)                                        \
+    template Array<int> lu_inplace<T>(Array<T> & in,             \
+                                      const bool convert_pivot); \
+    template void lu<T>(Array<T> & lower, Array<T> & upper,      \
+                        Array<int> & pivot, const Array<T> &in);
 
 INSTANTIATE_LU(float)
 INSTANTIATE_LU(cfloat)
 INSTANTIATE_LU(double)
 INSTANTIATE_LU(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/lu.hpp b/src/backend/opencl/lu.hpp
index af43f24614..2186aef62e 100644
--- a/src/backend/opencl/lu.hpp
+++ b/src/backend/opencl/lu.hpp
@@ -7,14 +7,17 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot, const Array<T> &in);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void lu(Array<T> &lower, Array<T> &upper, Array<int> &pivot,
+        const Array<T> &in);
 
-    template<typename T>
-    Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
-}
+template<typename T>
+Array<int> lu_inplace(Array<T> &in, const bool convert_pivot = true);
+
+bool isLAPACKAvailable();
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/magma/gebrd.cpp b/src/backend/opencl/magma/gebrd.cpp
new file mode 100644
index 0000000000..c63be4a5bb
--- /dev/null
+++ b/src/backend/opencl/magma/gebrd.cpp
@@ -0,0 +1,356 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
+#include <traits.hpp>
+#include <types.hpp>
+#include "magma.h"
+#include "magma_blas.h"
+#include "magma_cpu_blas.h"
+#include "magma_cpu_lapack.h"
+#include "magma_data.h"
+#include "magma_helper.h"
+#include "magma_sync.h"
+
+#include <algorithm>
+
+// produces pointer and offset as two args to magmaBLAS routines
+#define dA(i, j) da, ((da_offset) + (i) + (j)*ldda)
+// produces pointer as single arg to BLAS routines
+#define A(i, j) &a[(i) + (j)*lda]
+
+template<typename Ty>
+magma_int_t magma_gebrd_hybrid(magma_int_t m, magma_int_t n, Ty *a,
+                               magma_int_t lda, cl_mem da, size_t da_offset,
+                               magma_int_t ldda, void *_d, void *_e, Ty *tauq,
+                               Ty *taup, Ty *work, magma_int_t lwork,
+                               magma_queue_t queue, magma_int_t *info,
+                               bool copy) {
+    /*  -- MAGMA (version 1.1) --
+        Univ. of Tennessee, Knoxville
+        Univ. of California, Berkeley
+        Univ. of Colorado, Denver
+        @date
+
+        Purpose
+        =======
+        ZGEBRD reduces a general complex M-by-N matrix A to upper or lower
+        bidiagonal form B by an orthogonal transformation: Q**H * A * P = B.
+
+        If m >= n, B is upper bidiagonal; if m < n, B is lower bidiagonal.
+
+        Arguments
+        =========
+        M       (input) INTEGER
+        The number of rows in the matrix A.  M >= 0.
+
+        N       (input) INTEGER
+        The number of columns in the matrix A.  N >= 0.
+
+        A       (input/output) COMPLEX_16 array, dimension (LDA,N)
+        On entry, the M-by-N general matrix to be reduced.
+        On exit,
+        if m >= n, the diagonal and the first superdiagonal are
+        overwritten with the upper bidiagonal matrix B; the
+        elements below the diagonal, with the array TAUQ, represent
+        the orthogonal matrix Q as a product of elementary
+        reflectors, and the elements above the first superdiagonal,
+        with the array TAUP, represent the orthogonal matrix P as
+        a product of elementary reflectors;
+        if m < n, the diagonal and the first subdiagonal are
+        overwritten with the lower bidiagonal matrix B; the
+        elements below the first subdiagonal, with the array TAUQ,
+        represent the orthogonal matrix Q as a product of
+        elementary reflectors, and the elements above the diagonal,
+        with the array TAUP, represent the orthogonal matrix P as
+        a product of elementary reflectors.
+        See Further Details.
+
+        LDA     (input) INTEGER
+        The leading dimension of the array A.  LDA >= max(1,M).
+
+        D       (output) double precision array, dimension (min(M,N))
+        The diagonal elements of the bidiagonal matrix B:
+        D(i) = A(i,i).
+
+        E       (output) double precision array, dimension (min(M,N)-1)
+        The off-diagonal elements of the bidiagonal matrix B:
+        if m >= n, E(i) = A(i,i+1) for i = 1,2,...,n-1;
+        if m < n, E(i) = A(i+1,i) for i = 1,2,...,m-1.
+
+        TAUQ    (output) COMPLEX_16 array dimension (min(M,N))
+        The scalar factors of the elementary reflectors which
+        represent the orthogonal matrix Q. See Further Details.
+
+        TAUP    (output) COMPLEX_16 array, dimension (min(M,N))
+        The scalar factors of the elementary reflectors which
+        represent the orthogonal matrix P. See Further Details.
+
+        WORK    (workspace/output) COMPLEX_16 array, dimension (MAX(1,LWORK))
+        On exit, if INFO = 0, WORK[0] returns the optimal LWORK.
+
+        LWORK   (input) INTEGER
+        The length of the array WORK. LWORK >= (M+N)*NB, where NB
+        is the optimal blocksize.
+
+        If LWORK = -1, then a workspace query is assumed; the routine
+        only calculates the optimal size of the WORK array, returns
+        this value as the first entry of the WORK array, and no error
+        message related to LWORK is issued by XERBLA.
+
+        INFO    (output) INTEGER
+        = 0:  successful exit
+        < 0:  if INFO = -i, the i-th argument had an illegal value.
+
+        Further Details
+        ===============
+        The matrices Q and P are represented as products of elementary
+        reflectors:
+
+        If m >= n,
+        Q = H(1) H(2) . . . H(n)  and  P = G(1) G(2) . . . G(n-1)
+        Each H(i) and G(i) has the form:
+        H(i) = I - tauq * v * v'  and G(i) = I - taup * u * u'
+        where tauq and taup are complex scalars, and v and u are complex
+       vectors; v(1:i-1) = 0, v(i) = 1, and v(i+1:m) is stored on exit in
+       A(i+1:m,i); u(1:i) = 0, u(i+1) = 1, and u(i+2:n) is stored on exit in
+       A(i,i+2:n); tauq is stored in TAUQ(i) and taup in TAUP(i).
+
+        If m < n,
+        Q = H(1) H(2) . . . H(m-1)  and  P = G(1) G(2) . . . G(m)
+        Each H(i) and G(i) has the form:
+        H(i) = I - tauq * v * v'  and G(i) = I - taup * u * u'
+        where tauq and taup are complex scalars, and v and u are complex
+       vectors; v(1:i) = 0, v(i+1) = 1, and v(i+2:m) is stored on exit in
+       A(i+2:m,i); u(1:i-1) = 0, u(i) = 1, and u(i+1:n) is stored on exit in
+       A(i,i+1:n); tauq is stored in TAUQ(i) and taup in TAUP(i).
+
+        The contents of A on exit are illustrated by the following examples:
+
+        m = 6 and n = 5 (m > n):          m = 5 and n = 6 (m < n):
+
+        ( d   e   u1  u1  u1)           ( d   u1  u1  u1  u1  u1)
+        ( v1  d   e   u2  u2)           ( e   d   u2  u2  u2  u2)
+        ( v1  v2  d   e   u3)           ( v1  e   d   u3  u3  u3)
+        ( v1  v2  v3  d   e )           ( v1  v2  e   d   u4  u4)
+        ( v1  v2  v3  v4  d )           ( v1  v2  v3  e   d   u5)
+        ( v1  v2  v3  v4  v5)
+
+        where d and e denote diagonal and off-diagonal elements of B, vi
+        denotes an element of the vector defining H(i), and ui an element of
+        the vector defining G(i).
+        ===================================================================== */
+
+    using Tr = typename af::dtype_traits<Ty>::base_type;
+
+    Tr *d = (Tr *)_d;
+    Tr *e = (Tr *)_e;
+
+    Ty c_neg_one = magma_neg_one<Ty>();
+    Ty c_one     = magma_one<Ty>();
+    cl_mem dwork;
+
+    magma_int_t ncol, nrow, jmax, nb;
+
+    magma_int_t i, j, nx;
+    // magma_int_t iinfo;
+
+    magma_int_t minmn;
+    magma_int_t ldwrkx, ldwrky, lwkopt;
+    magma_int_t lquery;
+
+    nb = magma_get_gebrd_nb<Ty>(n);
+
+    lwkopt  = (m + n) * nb;
+    work[0] = magma_make<Ty>(lwkopt, 0.);
+    lquery  = (lwork == -1);
+
+    /* Check arguments */
+    *info = 0;
+    if (m < 0) {
+        *info = -1;
+    } else if (n < 0) {
+        *info = -2;
+    } else if (lda < std::max(1, m)) {
+        *info = -4;
+    } else if (lwork < lwkopt && (!lquery)) {
+        *info = -10;
+    }
+    if (*info < 0) {
+        // magma_xerbla(__func__, -(*info));
+        return *info;
+    } else if (lquery) {
+        return *info;
+    }
+
+    /* Quick return if possible */
+    minmn = std::min(m, n);
+    if (minmn == 0) {
+        work[0] = c_one;
+        return *info;
+    }
+
+    const size_t size = (m + n) * nb;
+    if (MAGMA_SUCCESS != magma_malloc<Ty>(&dwork, size)) {
+        *info = MAGMA_ERR_DEVICE_ALLOC;
+        return *info;
+    }
+    size_t dwork_offset = 0;
+    // initialize dwork to 0.0
+    const float dfill = 0.0;
+    cl_int err = clEnqueueFillBuffer(queue, dwork, &dfill, sizeof(dfill), 0,
+                                     size * sizeof(Ty), 0, nullptr, nullptr);
+    check_error(err);
+
+    cl_event event = 0;
+
+    ldwrkx = m;
+    ldwrky = n;
+
+    /* Set the block/unblock crossover point NX. */
+    nx = 128;
+
+    /* Copy the matrix to the GPU */
+    if (copy && minmn - nx >= 1) {
+        magma_setmatrix<Ty>(m, n, a, lda, da, da_offset, ldda, queue);
+    }
+
+    gpu_blas_gemm_func<Ty> gpu_blas_gemm;
+    cpu_lapack_gebrd_work_func<Ty> cpu_lapack_gebrd_work;
+
+    for (i = 0; i < (minmn - nx); i += nb) {
+        /*  Reduce rows and columns i:i+nb-1 to bidiagonal form and return
+            the matrices X and Y which are needed to update the unreduced
+            part of the matrix */
+        nrow = m - i;
+        ncol = n - i;
+
+        /*   Get the current panel (no need for the 1st iteration) */
+        if (i > 0) {
+            magma_getmatrix<Ty>(nrow, nb, dA(i, i), ldda, A(i, i), lda, queue);
+            magma_getmatrix<Ty>(nb, ncol - nb, dA(i, i + nb), ldda,
+                                A(i, i + nb), lda, queue);
+        }
+
+        magma_labrd_gpu<Ty>(nrow, ncol, nb, A(i, i), lda, dA(i, i), ldda, d + i,
+                            e + i, tauq + i, taup + i, work, ldwrkx, dwork,
+                            dwork_offset, ldwrkx,  // x, dx
+                            work + (ldwrkx * nb), ldwrky, dwork,
+                            dwork_offset + (ldwrkx * nb), ldwrky,  // y, dy
+                            queue);
+
+        /*  Update the trailing submatrix A(i+nb:m,i+nb:n), using an update
+            of the form  A := A - V*Y' - X*U' */
+        nrow = m - i - nb;
+        ncol = n - i - nb;
+
+        // Send Y back to the GPU
+        magma_setmatrix<Ty>(nrow, nb, work + nb, ldwrkx, dwork,
+                            dwork_offset + nb, ldwrkx, queue);
+        magma_setmatrix<Ty>(ncol, nb, work + (ldwrkx + 1) * nb, ldwrky, dwork,
+                            dwork_offset + (ldwrkx + 1) * nb, ldwrky, queue);
+
+        OPENCL_BLAS_CHECK(gpu_blas_gemm(
+            OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_CONJ_TRANS, nrow, ncol, nb,
+            c_neg_one, dA(i + nb, i), ldda, dwork,
+            dwork_offset + (ldwrkx + 1) * nb, ldwrky, c_one, dA(i + nb, i + nb),
+            ldda, 1, &queue, 0, nullptr, &event));
+
+        OPENCL_BLAS_CHECK(gpu_blas_gemm(
+            OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NO_TRANS, nrow, ncol, nb,
+            c_neg_one, dwork, dwork_offset + nb, ldwrkx, dA(i, i + nb), ldda,
+            c_one, dA(i + nb, i + nb), ldda, 1, &queue, 0, nullptr, &event));
+
+        /* Copy diagonal and off-diagonal elements of B back into A */
+        if (m >= n) {
+            jmax = i + nb;
+            for (j = i; j < jmax; ++j) {
+                *A(j, j)     = magma_make<Ty>(d[j], 0.);
+                *A(j, j + 1) = magma_make<Ty>(e[j], 0.);
+            }
+        } else {
+            jmax = i + nb;
+            for (j = i; j < jmax; ++j) {
+                *A(j, j)     = magma_make<Ty>(d[j], 0.);
+                *A(j + 1, j) = magma_make<Ty>(e[j], 0.);
+            }
+        }
+    }
+
+    /* Use unblocked code to reduce the remainder of the matrix */
+    nrow = m - i;
+    ncol = n - i;
+
+    if (0 < minmn - nx) {
+        magma_getmatrix<Ty>(nrow, ncol, dA(i, i), ldda, A(i, i), lda, queue);
+    }
+
+    LAPACKE_CHECK(cpu_lapack_gebrd_work(nrow, ncol, A(i, i), lda, d + i, e + i,
+                                        tauq + i, taup + i, work, lwork));
+    work[0] = magma_make<Ty>(lwkopt, 0.);
+
+    magma_free(dwork);
+    *info = 0;
+    return 0;
+} /* magma_zgebrd */
+
+#define INSTANTIATE(Ty)                                                   \
+    template magma_int_t magma_gebrd_hybrid<Ty>(                          \
+        magma_int_t m, magma_int_t n, Ty * a, magma_int_t lda, cl_mem da, \
+        size_t da_offset, magma_int_t ldda, void *_d, void *_e, Ty *tauq, \
+        Ty *taup, Ty *work, magma_int_t lwork, magma_queue_t queue,       \
+        magma_int_t *info, bool copy);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(magmaFloatComplex)
+INSTANTIATE(magmaDoubleComplex)
diff --git a/src/backend/opencl/magma/geqrf2.cpp b/src/backend/opencl/magma/geqrf2.cpp
index 1336b5dfa3..daba1f4328 100644
--- a/src/backend/opencl/magma/geqrf2.cpp
+++ b/src/backend/opencl/magma/geqrf2.cpp
@@ -8,7 +8,7 @@
  ********************************************************/
 
 /***********************************************************************
- * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Based on clMAGMA library http://icl.cs.utk.edu/magma/
  * Below is the original copyright.
  *
  *   -- MAGMA (version 0.1) --
@@ -51,75 +51,71 @@
  *
  **********************************************************************/
 
+#include "../platform.hpp"
 #include "magma.h"
-#include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
-#include "../platform.hpp"
 
 #include <algorithm>
 
 template<typename Ty>
-void panel_to_q(magma_uplo_t uplo, magma_int_t ib, Ty *A, magma_int_t lda, Ty *work)
-{
+void panel_to_q(magma_uplo_t uplo, magma_int_t ib, Ty *A, magma_int_t lda,
+                Ty *work) {
     magma_int_t i, j, k = 0;
     Ty *col;
     static const Ty c_zero = magma_zero<Ty>();
     static const Ty c_one  = magma_one<Ty>();
 
     if (uplo == MagmaUpper) {
-        for(i = 0; i < ib; ++i) {
-            col = A + i*lda;
-            for(j = 0; j < i; ++j) {
+        for (i = 0; i < ib; ++i) {
+            col = A + i * lda;
+            for (j = 0; j < i; ++j) {
                 work[k] = col[j];
-                col [j] = c_zero;
+                col[j]  = c_zero;
                 ++k;
             }
 
             work[k] = col[i];
-            col [j] = c_one;
+            col[j]  = c_one;
             ++k;
         }
-    }
-    else {
-        for(i=0; i<ib; ++i) {
-            col = A + i*lda;
+    } else {
+        for (i = 0; i < ib; ++i) {
+            col     = A + i * lda;
             work[k] = col[i];
-            col [i] = c_one;
+            col[i]  = c_one;
             ++k;
-            for(j=i+1; j<ib; ++j) {
+            for (j = i + 1; j < ib; ++j) {
                 work[k] = col[j];
-                col [j] = c_zero;
+                col[j]  = c_zero;
                 ++k;
             }
         }
     }
 }
 
-
 // -------------------------
 // Restores a panel, after call to panel_to_q.
 template<typename Ty>
-void q_to_panel(magma_uplo_t uplo, magma_int_t ib, Ty *A, magma_int_t lda, Ty *work)
-{
+void q_to_panel(magma_uplo_t uplo, magma_int_t ib, Ty *A, magma_int_t lda,
+                Ty *work) {
     magma_int_t i, j, k = 0;
     Ty *col;
 
     if (uplo == MagmaUpper) {
-        for(i = 0; i < ib; ++i) {
-            col = A + i*lda;
-            for(j = 0; j <= i; ++j) {
+        for (i = 0; i < ib; ++i) {
+            col = A + i * lda;
+            for (j = 0; j <= i; ++j) {
                 col[j] = work[k];
                 ++k;
             }
         }
-    }
-    else {
-        for(i = 0; i < ib; ++i) {
-            col = A + i*lda;
-            for(j = i; j < ib; ++j) {
+    } else {
+        for (i = 0; i < ib; ++i) {
+            col = A + i * lda;
+            for (j = i; j < ib; ++j) {
                 col[j] = work[k];
                 ++k;
             }
@@ -127,77 +123,74 @@ void q_to_panel(magma_uplo_t uplo, magma_int_t ib, Ty *A, magma_int_t lda, Ty *w
     }
 }
 
-template<typename Ty> magma_int_t
-magma_geqrf2_gpu(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    Ty *tau,
-    magma_queue_t* queue,
-    magma_int_t *info)
-{
-/*  -- clMAGMA (version 0.1) --
-       Univ. of Tennessee, Knoxville
-       Univ. of California, Berkeley
-       Univ. of Colorado, Denver
-       @date
-
-    Purpose
-    =======
-    ZGEQRF computes a QR factorization of a complex M-by-N matrix A:
-    A = Q * R.
-
-    Arguments
-    =========
-    M       (input) INTEGER
-            The number of rows of the matrix A.  M >= 0.
-
-    N       (input) INTEGER
-            The number of columns of the matrix A.  N >= 0.
-
-    dA      (input/output) COMPLEX_16 array on the GPU, dimension (LDDA,N)
-            On entry, the M-by-N matrix A.
-            On exit, the elements on and above the diagonal of the array
-            contain the min(M,N)-by-N upper trapezoidal matrix R (R is
-            upper triangular if m >= n); the elements below the diagonal,
-            with the array TAU, represent the orthogonal matrix Q as a
-            product of min(m,n) elementary reflectors (see Further
-            Details).
-
-    LDDA    (input) INTEGER
-            The leading dimension of the array dA.  LDDA >= max(1,M).
-            To benefit from coalescent memory accesses LDDA must be
-            divisible by 16.
-
-    TAU     (output) COMPLEX_16 array, dimension (min(M,N))
-            The scalar factors of the elementary reflectors (see Further
-            Details).
-
-    INFO    (output) INTEGER
-            = 0:  successful exit
-            < 0:  if INFO = -i, the i-th argument had an illegal value
-                  or another error occured, such as memory allocation failed.
-
-    Further Details
-    ===============
-    The matrix Q is represented as a product of elementary reflectors
-
-        Q = H(1) H(2) . . . H(k), where k = min(m,n).
-
-    Each H(i) has the form
-
-        H(i) = I - tau * v * v'
-
-    where tau is a complex scalar, and v is a complex vector with
-    v(1:i-1) = 0 and v(i) = 1; v(i+1:m) is stored on exit in A(i+1:m,i),
-    and tau in TAU(i).
-    =====================================================================    */
-
-    #define dA(a_1,a_2)    dA, (dA_offset + (a_1) + (a_2)*(ldda))
-    #define work(a_1)      ( work + (a_1))
-    #define hwork          ( work + (nb)*(m))
+template<typename Ty>
+magma_int_t magma_geqrf2_gpu(magma_int_t m, magma_int_t n, cl_mem dA,
+                             size_t dA_offset, magma_int_t ldda, Ty *tau,
+                             magma_queue_t *queue, magma_int_t *info) {
+    /*  -- clMAGMA (version 0.1) --
+           Univ. of Tennessee, Knoxville
+           Univ. of California, Berkeley
+           Univ. of Colorado, Denver
+           @date
+
+        Purpose
+        =======
+        ZGEQRF computes a QR factorization of a complex M-by-N matrix A:
+        A = Q * R.
+
+        Arguments
+        =========
+        M       (input) INTEGER
+                The number of rows of the matrix A.  M >= 0.
+
+        N       (input) INTEGER
+                The number of columns of the matrix A.  N >= 0.
+
+        dA      (input/output) COMPLEX_16 array on the GPU, dimension (LDDA,N)
+                On entry, the M-by-N matrix A.
+                On exit, the elements on and above the diagonal of the array
+                contain the min(M,N)-by-N upper trapezoidal matrix R (R is
+                upper triangular if m >= n); the elements below the diagonal,
+                with the array TAU, represent the orthogonal matrix Q as a
+                product of min(m,n) elementary reflectors (see Further
+                Details).
+
+        LDDA    (input) INTEGER
+                The leading dimension of the array dA.  LDDA >= max(1,M).
+                To benefit from coalescent memory accesses LDDA must be
+                divisible by 16.
+
+        TAU     (output) COMPLEX_16 array, dimension (min(M,N))
+                The scalar factors of the elementary reflectors (see Further
+                Details).
+
+        INFO    (output) INTEGER
+                = 0:  successful exit
+                < 0:  if INFO = -i, the i-th argument had an illegal value
+                      or another error occured, such as memory allocation
+       failed.
+
+        Further Details
+        ===============
+        The matrix Q is represented as a product of elementary reflectors
+
+            Q = H(1) H(2) . . . H(k), where k = min(m,n).
+
+        Each H(i) has the form
+
+            H(i) = I - tau * v * v'
+
+        where tau is a complex scalar, and v is a complex vector with
+        v(1:i-1) = 0 and v(i) = 1; v(i+1:m) is stored on exit in A(i+1:m,i),
+        and tau in TAU(i).
+        ===================================================================== */
+
+#define dA(a_1, a_2) dA, (dA_offset + (a_1) + (a_2) * (ldda))
+#define work(a_1) (work + (a_1))
+#define hwork (work + (nb) * (m))
 
     cl_mem dwork;
-    Ty  *work;
+    Ty *work;
 
     magma_int_t i, k, ldwork, lddwork, old_i, old_ib, rows;
     magma_int_t nbmin, nx, ib, nb;
@@ -208,24 +201,23 @@ magma_geqrf2_gpu(
         *info = -1;
     } else if (n < 0) {
         *info = -2;
-    } else if (ldda < std::max(1,m)) {
+    } else if (ldda < std::max(1, m)) {
         *info = -4;
     }
     if (*info != 0) {
-        //magma_xerbla( __func__, -(*info) );
+        // magma_xerbla( __func__, -(*info) );
         return *info;
     }
 
-    k = std::min(m,n);
-    if (k == 0)
-        return *info;
+    k = std::min(m, n);
+    if (k == 0) { return *info; }
 
     nb = magma_get_geqrf_nb<Ty>(m);
 
-    lwork  = (m+n) * nb;
+    lwork  = (m + n) * nb;
     lhwork = lwork - (m)*nb;
 
-    if ( MAGMA_SUCCESS != magma_malloc<Ty>( &dwork, n*nb )) {
+    if (MAGMA_SUCCESS != magma_malloc<Ty>(&dwork, n * nb)) {
         *info = MAGMA_ERR_DEVICE_ALLOC;
         return *info;
     }
@@ -238,78 +230,87 @@ magma_geqrf2_gpu(
     }
     */
 
-    cl_mem buffer = clCreateBuffer(opencl::getContext()(), CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
-                                   sizeof(Ty)*lwork, NULL, NULL);
-    work = (Ty*)clEnqueueMapBuffer(queue[0], buffer, CL_TRUE,
-                                   CL_MAP_READ | CL_MAP_WRITE,
-                                   0, lwork*sizeof(Ty),
-                                   0, NULL, NULL, NULL);
+    cl_mem buffer = clCreateBuffer(arrayfire::opencl::getContext()(),
+                                   CL_MEM_READ_WRITE | CL_MEM_ALLOC_HOST_PTR,
+                                   sizeof(Ty) * lwork, NULL, NULL);
+    work          = (Ty *)clEnqueueMapBuffer(queue[0], buffer, CL_TRUE,
+                                             CL_MAP_READ | CL_MAP_WRITE, 0,
+                                             lwork * sizeof(Ty), 0, NULL, NULL, NULL);
 
-    geqrf_work_func<Ty> cpu_geqrf;
-    larft_func<Ty> cpu_larft;
+    cpu_lapack_geqrf_work_func<Ty> cpu_lapack_geqrf;
+    cpu_lapack_larft_func<Ty> cpu_lapack_larft;
 
-    nbmin = 2;
-    nx    = nb;
-    ldwork = m;
-    lddwork= n;
+    nbmin   = 2;
+    nx      = nb;
+    ldwork  = m;
+    lddwork = n;
 
     if (nb >= nbmin && nb < k && nx < k) {
         /* Use blocked code initially */
-        old_i = 0; old_ib = nb;
-        for (i = 0; i < k-nx; i += nb) {
-            ib = std::min(k-i, nb);
-            rows = m -i;
+        old_i  = 0;
+        old_ib = nb;
+        for (i = 0; i < k - nx; i += nb) {
+            ib   = std::min(k - i, nb);
+            rows = m - i;
 
-            magma_queue_sync( queue[1] );
-            magma_getmatrix_async<Ty>(rows, ib, dA(i, i), ldda, work(i), ldwork, queue[0], NULL);
+            magma_queue_sync(queue[1]);
+            magma_getmatrix_async<Ty>(rows, ib, dA(i, i), ldda, work(i), ldwork,
+                                      queue[0], NULL);
 
             if (i > 0) {
                 /* Apply H' to A(i:m,i+2*ib:n) from the left */
-                magma_larfb_gpu<Ty>( MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
-                                     m-old_i, n-old_i-2*old_ib, old_ib,
-                                     dA(old_i, old_i         ), ldda, dwork,0,      lddwork,
-                                     dA(old_i, old_i+2*old_ib), ldda, dwork,old_ib, lddwork, queue[1]);
-
-                magma_setmatrix_async<Ty>( old_ib, old_ib, work(old_i), ldwork,
-                                           dA(old_i, old_i), ldda, queue[1], NULL);
+                magma_larfb_gpu<Ty>(
+                    MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
+                    m - old_i, n - old_i - 2 * old_ib, old_ib, dA(old_i, old_i),
+                    ldda, dwork, 0, lddwork, dA(old_i, old_i + 2 * old_ib),
+                    ldda, dwork, old_ib, lddwork, queue[1]);
+
+                magma_setmatrix_async<Ty>(old_ib, old_ib, work(old_i), ldwork,
+                                          dA(old_i, old_i), ldda, queue[1],
+                                          NULL);
             }
 
             magma_queue_sync(queue[0]);
-            *info = cpu_geqrf(LAPACK_COL_MAJOR, rows, ib, work(i), ldwork, tau+i, hwork, lhwork);
+            LAPACKE_CHECK(cpu_lapack_geqrf(rows, ib, work(i), ldwork, tau + i,
+                                           hwork, lhwork));
 
             /* Form the triangular factor of the block reflector
                H = H(i) H(i+1) . . . H(i+ib-1) */
-            cpu_larft(LAPACK_COL_MAJOR,
-                      *MagmaForwardStr, *MagmaColumnwiseStr,
-                      rows, ib,
-                      work(i), ldwork, tau+i, hwork, ib);
+            LAPACKE_CHECK(
+                cpu_lapack_larft(*MagmaForwardStr, *MagmaColumnwiseStr, rows,
+                                 ib, work(i), ldwork, tau + i, hwork, ib));
 
-            panel_to_q<Ty>( MagmaUpper, ib, work(i), ldwork, hwork+ib*ib );
+            panel_to_q<Ty>(MagmaUpper, ib, work(i), ldwork, hwork + ib * ib);
 
             /* download the i-th V matrix */
-            magma_setmatrix_async<Ty>(rows, ib, work(i), ldwork, dA(i,i), ldda, queue[0], NULL);
+            magma_setmatrix_async<Ty>(rows, ib, work(i), ldwork, dA(i, i), ldda,
+                                      queue[0], NULL);
 
             /* download the T matrix */
-            magma_queue_sync( queue[1] );
-            magma_setmatrix_async<Ty>( ib, ib, hwork, ib, dwork, 0, lddwork, queue[0], NULL);
-            magma_queue_sync( queue[0] );
+            magma_queue_sync(queue[1]);
+            magma_setmatrix_async<Ty>(ib, ib, hwork, ib, dwork, 0, lddwork,
+                                      queue[0], NULL);
+            magma_queue_sync(queue[0]);
 
             if (i + ib < n) {
-                if (i+nb < k-nx) {
+                if (i + nb < k - nx) {
                     /* Apply H' to A(i:m,i+ib:i+2*ib) from the left */
-                    magma_larfb_gpu<Ty>( MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
-                                         rows, ib, ib,
-                                         dA(i, i   ), ldda, dwork,0,  lddwork,
-                                         dA(i, i+ib), ldda, dwork,ib, lddwork, queue[1]);
-                    q_to_panel<Ty>( MagmaUpper, ib, work(i), ldwork, hwork+ib*ib );
-                }
-                else {
-                    magma_larfb_gpu<Ty>( MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
-                                         rows, n-i-ib, ib,
-                                         dA(i, i   ), ldda, dwork,0,  lddwork,
-                                         dA(i, i+ib), ldda, dwork,ib, lddwork, queue[1]);
-                    q_to_panel<Ty>( MagmaUpper, ib, work(i), ldwork, hwork+ib*ib );
-                    magma_setmatrix_async<Ty>(ib, ib, work(i), ldwork, dA(i,i), ldda, queue[1], NULL);
+                    magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward,
+                                        MagmaColumnwise, rows, ib, ib, dA(i, i),
+                                        ldda, dwork, 0, lddwork, dA(i, i + ib),
+                                        ldda, dwork, ib, lddwork, queue[1]);
+                    q_to_panel<Ty>(MagmaUpper, ib, work(i), ldwork,
+                                   hwork + ib * ib);
+                } else {
+                    magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward,
+                                        MagmaColumnwise, rows, n - i - ib, ib,
+                                        dA(i, i), ldda, dwork, 0, lddwork,
+                                        dA(i, i + ib), ldda, dwork, ib, lddwork,
+                                        queue[1]);
+                    q_to_panel<Ty>(MagmaUpper, ib, work(i), ldwork,
+                                   hwork + ib * ib);
+                    magma_setmatrix_async<Ty>(ib, ib, work(i), ldwork, dA(i, i),
+                                              ldda, queue[1], NULL);
                 }
                 old_i  = i;
                 old_ib = ib;
@@ -323,15 +324,18 @@ magma_geqrf2_gpu(
 
     /* Use unblocked code to factor the last or only block. */
     if (i < k) {
-        ib   = n-i;
-        rows = m-i;
-        magma_getmatrix_async<Ty>(rows, ib, dA(i, i), ldda, work, rows, queue[1], NULL);
+        ib   = n - i;
+        rows = m - i;
+        magma_getmatrix_async<Ty>(rows, ib, dA(i, i), ldda, work, rows,
+                                  queue[1], NULL);
         magma_queue_sync(queue[1]);
 
-        lhwork = lwork - rows*ib;
-        *info = cpu_geqrf(LAPACK_COL_MAJOR, rows, ib, work, rows, tau+i, work+ib*rows, lhwork);
+        lhwork = lwork - rows * ib;
+        LAPACKE_CHECK(cpu_lapack_geqrf(rows, ib, work, rows, tau + i,
+                                       work + ib * rows, lhwork));
 
-        magma_setmatrix_async<Ty>(rows, ib, work, rows, dA(i, i), ldda, queue[1], NULL);
+        magma_setmatrix_async<Ty>(rows, ib, work, rows, dA(i, i), ldda,
+                                  queue[1], NULL);
     }
 
     magma_queue_sync(queue[0]);
@@ -344,14 +348,11 @@ magma_geqrf2_gpu(
     return *info;
 } /* magma_zgeqrf2_gpu */
 
-#define INSTANTIATE(Ty)                                 \
-    template magma_int_t                                \
-    magma_geqrf2_gpu<Ty>(                               \
-        magma_int_t m, magma_int_t n,                   \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
-        Ty *tau,                                        \
-        magma_queue_t* queue,                           \
-        magma_int_t *info);                             \
+#define INSTANTIATE(Ty)                                            \
+    template magma_int_t magma_geqrf2_gpu<Ty>(                     \
+        magma_int_t m, magma_int_t n, cl_mem dA, size_t dA_offset, \
+        magma_int_t ldda, Ty * tau, magma_queue_t * queue,         \
+        magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/geqrf3.cpp b/src/backend/opencl/magma/geqrf3.cpp
index ce7a1c9c22..ced1e01f4a 100644
--- a/src/backend/opencl/magma/geqrf3.cpp
+++ b/src/backend/opencl/magma/geqrf3.cpp
@@ -52,9 +52,8 @@
  **********************************************************************/
 
 #include "magma.h"
-#include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
@@ -71,17 +70,16 @@
  */
 
 template<typename Ty>
-void split_diag_block(magma_int_t ib, Ty *a, magma_int_t lda, Ty *work)
-{
+void split_diag_block(magma_int_t ib, Ty *a, magma_int_t lda, Ty *work) {
     magma_int_t i, j;
     Ty *cola, *colw;
     static const Ty c_zero = magma_zero<Ty>();
     static const Ty c_one  = magma_one<Ty>();
 
-    for(i=0; i<ib; i++){
-        cola = a    + i*lda;
-        colw = work + i*ib;
-        for(j=0; j<i; j++){
+    for (i = 0; i < ib; i++) {
+        cola = a + i * lda;
+        colw = work + i * ib;
+        for (j = 0; j < i; j++) {
             colw[j] = cola[j];
             cola[j] = c_zero;
         }
@@ -90,91 +88,90 @@ void split_diag_block(magma_int_t ib, Ty *a, magma_int_t lda, Ty *work)
     }
 }
 
-template<typename Ty> magma_int_t
-magma_geqrf3_gpu(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA, size_t dA_offset,  magma_int_t ldda,
-    Ty *tau, cl_mem dT, size_t dT_offset,
-    magma_queue_t queue,
-    magma_int_t *info)
-{
-/*  -- clMAGMA (version 0.1) --
-       Univ. of Tennessee, Knoxville
-       Univ. of California, Berkeley
-       Univ. of Colorado, Denver
-       @date
-
-    Purpose
-    =======
-    ZGEQRF computes a QR factorization of a complex M-by-N matrix A:
-    A = Q * R.
-
-    This version stores the triangular dT matrices used in
-    the block QR factorization so that they can be applied directly (i.e.,
-    without being recomputed) later. As a result, the application
-    of Q is much faster. Also, the upper triangular matrices for V have 0s
-    in them. The corresponding parts of the upper triangular R are inverted
-    and stored separately in dT.
-
-    Arguments
-    =========
-    M       (input) INTEGER
-            The number of rows of the matrix A.  M >= 0.
-
-    N       (input) INTEGER
-            The number of columns of the matrix A.  N >= 0.
-
-    dA      (input/output) COMPLEX_16 array on the GPU, dimension (LDDA,N)
-            On entry, the M-by-N matrix A.
-            On exit, the elements on and above the diagonal of the array
-            contain the min(M,N)-by-N upper trapezoidal matrix R (R is
-            upper triangular if m >= n); the elements below the diagonal,
-            with the array TAU, represent the orthogonal matrix Q as a
-            product of min(m,n) elementary reflectors (see Further
-            Details).
-
-    LDDA     (input) INTEGER
-            The leading dimension of the array dA.  LDDA >= max(1,M).
-            To benefit from coalescent memory accesses LDDA must be
-            divisible by 16.
-
-    TAU     (output) COMPLEX_16 array, dimension (min(M,N))
-            The scalar factors of the elementary reflectors (see Further
-            Details).
-
-    dT      (workspace/output)  COMPLEX_16 array on the GPU,
-            dimension (2*MIN(M, N) + (N+31)/32*32 )*NB,
-            where NB can be obtained through magma_get_zgeqrf_nb(M).
-            It starts with MIN(M,N)*NB block that store the triangular T
-            matrices, followed by the MIN(M,N)*NB block of the diagonal
-            inverses for the R matrix. The rest of the array is used as workspace.
-
-    INFO    (output) INTEGER
-            = 0:  successful exit
-            < 0:  if INFO = -i, the i-th argument had an illegal value
-                  or another error occured, such as memory allocation failed.
-
-    Further Details
-    ===============
-    The matrix Q is represented as a product of elementary reflectors
-
-       Q = H(1) H(2) . . . H(k), where k = min(m,n).
-
-    Each H(i) has the form
-
-       H(i) = I - tau * v * v'
-
-    where tau is a complex scalar, and v is a complex vector with
-    v(1:i-1) = 0 and v(i) = 1; v(i+1:m) is stored on exit in A(i+1:m,i),
-    and tau in TAU(i).
-    =====================================================================    */
-
-    #define a_ref(a_1,a_2) dA, (dA_offset + (a_1) + (a_2)*(ldda))
-    #define t_ref(a_1)     dT, (dT_offset + (a_1)*nb)
-    #define d_ref(a_1)     dT, (dT_offset + (minmn + (a_1))*nb)
-    #define dd_ref(a_1)    dT, (dT_offset + (2*minmn+(a_1))*nb)
-    #define work_ref(a_1)  ( work + (a_1))
-    #define hwork          ( work + (nb)*(m))
+template<typename Ty>
+magma_int_t magma_geqrf3_gpu(magma_int_t m, magma_int_t n, cl_mem dA,
+                             size_t dA_offset, magma_int_t ldda, Ty *tau,
+                             cl_mem dT, size_t dT_offset, magma_queue_t queue,
+                             magma_int_t *info) {
+    /*  -- clMAGMA (version 0.1) --
+           Univ. of Tennessee, Knoxville
+           Univ. of California, Berkeley
+           Univ. of Colorado, Denver
+           @date
+
+        Purpose
+        =======
+        ZGEQRF computes a QR factorization of a complex M-by-N matrix A:
+        A = Q * R.
+
+        This version stores the triangular dT matrices used in
+        the block QR factorization so that they can be applied directly (i.e.,
+        without being recomputed) later. As a result, the application
+        of Q is much faster. Also, the upper triangular matrices for V have 0s
+        in them. The corresponding parts of the upper triangular R are inverted
+        and stored separately in dT.
+
+        Arguments
+        =========
+        M       (input) INTEGER
+                The number of rows of the matrix A.  M >= 0.
+
+        N       (input) INTEGER
+                The number of columns of the matrix A.  N >= 0.
+
+        dA      (input/output) COMPLEX_16 array on the GPU, dimension (LDDA,N)
+                On entry, the M-by-N matrix A.
+                On exit, the elements on and above the diagonal of the array
+                contain the min(M,N)-by-N upper trapezoidal matrix R (R is
+                upper triangular if m >= n); the elements below the diagonal,
+                with the array TAU, represent the orthogonal matrix Q as a
+                product of min(m,n) elementary reflectors (see Further
+                Details).
+
+        LDDA     (input) INTEGER
+                The leading dimension of the array dA.  LDDA >= max(1,M).
+                To benefit from coalescent memory accesses LDDA must be
+                divisible by 16.
+
+        TAU     (output) COMPLEX_16 array, dimension (min(M,N))
+                The scalar factors of the elementary reflectors (see Further
+                Details).
+
+        dT      (workspace/output)  COMPLEX_16 array on the GPU,
+                dimension (2*MIN(M, N) + (N+31)/32*32 )*NB,
+                where NB can be obtained through magma_get_zgeqrf_nb(M).
+                It starts with MIN(M,N)*NB block that store the triangular T
+                matrices, followed by the MIN(M,N)*NB block of the diagonal
+                inverses for the R matrix. The rest of the array is used as
+       workspace.
+
+        INFO    (output) INTEGER
+                = 0:  successful exit
+                < 0:  if INFO = -i, the i-th argument had an illegal value
+                      or another error occured, such as memory allocation
+       failed.
+
+        Further Details
+        ===============
+        The matrix Q is represented as a product of elementary reflectors
+
+           Q = H(1) H(2) . . . H(k), where k = min(m,n).
+
+        Each H(i) has the form
+
+           H(i) = I - tau * v * v'
+
+        where tau is a complex scalar, and v is a complex vector with
+        v(1:i-1) = 0 and v(i) = 1; v(i+1:m) is stored on exit in A(i+1:m,i),
+        and tau in TAU(i).
+        ===================================================================== */
+
+#define a_ref(a_1, a_2) dA, (dA_offset + (a_1) + (a_2) * (ldda))
+#define t_ref(a_1) dT, (dT_offset + (a_1)*nb)
+#define d_ref(a_1) dT, (dT_offset + (minmn + (a_1)) * nb)
+#define dd_ref(a_1) dT, (dT_offset + (2 * minmn + (a_1)) * nb)
+#define work_ref(a_1) (work + (a_1))
+#define hwork (work + (nb) * (m))
 
     magma_int_t i, k, minmn, old_i, old_ib, rows, cols;
     magma_int_t ib, nb;
@@ -187,99 +184,100 @@ magma_geqrf3_gpu(
         *info = -1;
     } else if (n < 0) {
         *info = -2;
-    } else if (ldda < std::max(1,m)) {
+    } else if (ldda < std::max(1, m)) {
         *info = -4;
     }
     if (*info != 0) {
-        //magma_xerbla( __func__, -(*info) );
+        // magma_xerbla( __func__, -(*info) );
         return *info;
     }
 
-    k = minmn = std::min(m,n);
-    if (k == 0)
-        return *info;
+    k = minmn = std::min(m, n);
+    if (k == 0) { return *info; }
 
     nb = magma_get_geqrf_nb<Ty>(m);
 
-    lwork  = (m + n + nb)*nb;
-    lhwork = lwork - m*nb;
+    lwork  = (m + n + nb) * nb;
+    lhwork = lwork - m * nb;
 
-    if (MAGMA_SUCCESS != magma_malloc_cpu<Ty>( &work, lwork )) {
+    if (MAGMA_SUCCESS != magma_malloc_cpu<Ty>(&work, lwork)) {
         *info = MAGMA_ERR_HOST_ALLOC;
         return *info;
     }
 
-    ut = hwork+nb*(n);
-    memset(ut, 0, nb*nb*sizeof(Ty));
+    ut = hwork + nb * (n);
+    memset(ut, 0, nb * nb * sizeof(Ty));
 
     magma_event_t event[2] = {NULL, NULL};
 
-    ldwork = m;
-    lddwork= n;
+    ldwork  = m;
+    lddwork = n;
 
-    geqrf_work_func<Ty> cpu_geqrf;
-    larft_func<Ty> cpu_larft;
+    cpu_lapack_geqrf_work_func<Ty> cpu_lapack_geqrf;
+    cpu_lapack_larft_func<Ty> cpu_lapack_larft;
 
-    if ( (nb > 1) && (nb < k) ) {
+    if ((nb > 1) && (nb < k)) {
         /* Use blocked code initially */
-        old_i = 0; old_ib = nb;
-        for (i = 0; i < k-nb; i += nb) {
-            ib = std::min(k-i, nb);
-            rows = m -i;
-            magma_getmatrix_async<Ty>(rows, ib,
-                                     a_ref(i,i),  ldda,
-                                     work_ref(i), ldwork, queue, &event[1]);
-            if (i>0){
+        old_i  = 0;
+        old_ib = nb;
+        for (i = 0; i < k - nb; i += nb) {
+            ib   = std::min(k - i, nb);
+            rows = m - i;
+            magma_getmatrix_async<Ty>(rows, ib, a_ref(i, i), ldda, work_ref(i),
+                                      ldwork, queue, &event[1]);
+            if (i > 0) {
                 /* Apply H' to A(i:m,i+2*ib:n) from the left */
-                cols = n-old_i-2*old_ib;
-                magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
-                                   m-old_i, cols, old_ib,
-                                   a_ref(old_i, old_i         ), ldda, t_ref(old_i), nb,
-                                   a_ref(old_i, old_i+2*old_ib), ldda, dd_ref(0),    lddwork, queue);
+                cols = n - old_i - 2 * old_ib;
+                magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward,
+                                    MagmaColumnwise, m - old_i, cols, old_ib,
+                                    a_ref(old_i, old_i), ldda, t_ref(old_i), nb,
+                                    a_ref(old_i, old_i + 2 * old_ib), ldda,
+                                    dd_ref(0), lddwork, queue);
 
                 /* store the diagonal */
-                magma_setmatrix_async<Ty>(old_ib, old_ib,
-                                         ut, old_ib,
-                                         d_ref(old_i), old_ib, queue, &event[0]);
+                magma_setmatrix_async<Ty>(old_ib, old_ib, ut, old_ib,
+                                          d_ref(old_i), old_ib, queue,
+                                          &event[0]);
             }
 
             magma_event_sync(event[1]);
-            *info = cpu_geqrf(LAPACK_COL_MAJOR, rows, ib, work_ref(i), ldwork, tau+i, hwork, lhwork);
+            LAPACKE_CHECK(cpu_lapack_geqrf(rows, ib, work_ref(i), ldwork,
+                                           tau + i, hwork, lhwork));
 
             /* Form the triangular factor of the block reflector
                H = H(i) H(i+1) . . . H(i+ib-1) */
-            cpu_larft(LAPACK_COL_MAJOR,
-                      *MagmaForwardStr, *MagmaColumnwiseStr,
-                      rows, ib,
-                      work_ref(i), ldwork,
-                      tau+i, hwork, ib);
+            LAPACKE_CHECK(
+                cpu_lapack_larft(*MagmaForwardStr, *MagmaColumnwiseStr, rows,
+                                 ib, work_ref(i), ldwork, tau + i, hwork, ib));
 
             /* Put 0s in the upper triangular part of a panel (and 1s on the
                diagonal); copy the upper triangular in ut and invert it. */
-            if (i > 0) magma_event_sync(event[0]);
-            //Change me
+            if (i > 0) { magma_event_sync(event[0]); }
+            // Change me
             split_diag_block<Ty>(ib, work_ref(i), ldwork, ut);
-            magma_setmatrix<Ty>(rows, ib, work_ref(i), ldwork, a_ref(i,i), ldda, queue);
+            magma_setmatrix<Ty>(rows, ib, work_ref(i), ldwork, a_ref(i, i),
+                                ldda, queue);
 
             if (i + ib < n) {
                 /* Send the triangular factor T to the GPU */
                 magma_setmatrix<Ty>(ib, ib, hwork, ib, t_ref(i), nb, queue);
 
-                if (i+nb < k-nb){
+                if (i + nb < k - nb) {
                     /* Apply H' to A(i:m,i+ib:i+2*ib) from the left */
-                    magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
-                                       rows, ib, ib,
-                                       a_ref(i, i   ), ldda, t_ref(i),  nb,
-                                       a_ref(i, i+ib), ldda, dd_ref(0), lddwork, queue);
-                }
-                else {
-                    cols = n-i-ib;
-                    magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward, MagmaColumnwise,
-                                       rows, cols, ib,
-                                       a_ref(i, i   ), ldda, t_ref(i),  nb,
-                                       a_ref(i, i+ib), ldda, dd_ref(0), lddwork, queue);
+                    magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward,
+                                        MagmaColumnwise, rows, ib, ib,
+                                        a_ref(i, i), ldda, t_ref(i), nb,
+                                        a_ref(i, i + ib), ldda, dd_ref(0),
+                                        lddwork, queue);
+                } else {
+                    cols = n - i - ib;
+                    magma_larfb_gpu<Ty>(MagmaLeft, MagmaConjTrans, MagmaForward,
+                                        MagmaColumnwise, rows, cols, ib,
+                                        a_ref(i, i), ldda, t_ref(i), nb,
+                                        a_ref(i, i + ib), ldda, dd_ref(0),
+                                        lddwork, queue);
                     /* Fix the diagonal block */
-                    magma_setmatrix<Ty>( ib, ib, ut, ib, d_ref(i), ib , queue);
+                    magma_setmatrix<Ty>(ib, ib, ut, ib, d_ref(i), ib, queue);
                 }
                 old_i  = i;
                 old_ib = ib;
@@ -291,17 +289,18 @@ magma_geqrf3_gpu(
 
     /* Use unblocked code to factor the last or only block. */
     if (i < k) {
-        ib   = n-i;
-        rows = m-i;
-        magma_getmatrix<Ty>( rows, ib, a_ref(i, i), ldda, work, rows, queue );
+        ib   = n - i;
+        rows = m - i;
+        magma_getmatrix<Ty>(rows, ib, a_ref(i, i), ldda, work, rows, queue);
 
-        lhwork = lwork - rows*ib;
-        *info = cpu_geqrf(LAPACK_COL_MAJOR, rows, ib, work, rows, tau+i, work+ib*rows, lhwork);
+        lhwork = lwork - rows * ib;
+        LAPACKE_CHECK(cpu_lapack_geqrf(rows, ib, work, rows, tau + i,
+                                       work + ib * rows, lhwork));
 
-        magma_setmatrix<Ty>( rows, ib, work, rows, a_ref(i, i), ldda, queue );
+        magma_setmatrix<Ty>(rows, ib, work, rows, a_ref(i, i), ldda, queue);
     }
 
-    magma_free_cpu( work );
+    magma_free_cpu(work);
     return *info;
 } /* magma_zgeqrf_gpu */
 
@@ -310,14 +309,11 @@ magma_geqrf3_gpu(
 #undef d_ref
 #undef work_ref
 
-#define INSTANTIATE(T)                                  \
-    template magma_int_t                                \
-    magma_geqrf3_gpu<T>(                                 \
-        magma_int_t m, magma_int_t n,                   \
-        cl_mem dA, size_t dA_offset,  magma_int_t ldda, \
-        T *tau, cl_mem dT, size_t dT_offset,            \
-        magma_queue_t queue,                            \
-        magma_int_t *info);                             \
+#define INSTANTIATE(T)                                             \
+    template magma_int_t magma_geqrf3_gpu<T>(                      \
+        magma_int_t m, magma_int_t n, cl_mem dA, size_t dA_offset, \
+        magma_int_t ldda, T * tau, cl_mem dT, size_t dT_offset,    \
+        magma_queue_t queue, magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/getrf.cpp b/src/backend/opencl/magma/getrf.cpp
index a79bd7c17e..4fa3960791 100644
--- a/src/backend/opencl/magma/getrf.cpp
+++ b/src/backend/opencl/magma/getrf.cpp
@@ -31,95 +31,92 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
 
 #include "magma.h"
 #include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 
 #include <algorithm>
 
 template<typename Ty>
-magma_int_t magma_getrf_gpu(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    magma_int_t *ipiv,
-    magma_queue_t queue,
-    magma_int_t *info)
-{
-/*  -- clMAGMA (version 0.1) --
-    Univ. of Tennessee, Knoxville
-    Univ. of California, Berkeley
-    Univ. of Colorado, Denver
-    @date
-
-    Purpose
-    =======
-    GETRF computes an LU factorization of a general M-by-N matrix A
-    using partial pivoting with row interchanges.
-
-    The factorization has the form
-    A = P * L * U
-    where P is a permutation matrix, L is lower triangular with unit
-    diagonal elements (lower trapezoidal if m > n), and U is upper
-    triangular (upper trapezoidal if m < n).
-
-    This is the right-looking Level 3 BLAS version of the algorithm.
-
-    Arguments
-    =========
-    M       (input) INTEGER
-    The number of rows of the matrix A.  M >= 0.
-
-    N       (input) INTEGER
-    The number of columns of the matrix A.  N >= 0.
-
-    A       (input/output) an array on the GPU, dimension (LDDA,N).
-    On entry, the M-by-N matrix to be factored.
-    On exit, the factors L and U from the factorization
-    A = P*L*U; the unit diagonal elements of L are not stored.
-
-    LDDA     (input) INTEGER
-    The leading dimension of the array A.  LDDA >= max(1,M).
-
-    IPIV    (output) INTEGER array, dimension (min(M,N))
-    The pivot indices; for 1 <= i <= min(M,N), row i of the
-    matrix was interchanged with row IPIV(i).
-
-    INFO    (output) INTEGER
-    = 0:  successful exit
-    < 0:  if INFO = -i, the i-th argument had an illegal value
-    or another error occured, such as memory allocation failed.
-    > 0:  if INFO = i, U(i,i) is exactly zero. The factorization
-    has been completed, but the factor U is exactly
-    singular, and division by zero will occur if it is used
-    to solve a system of equations.
-    =====================================================================    */
-
-#define  dA(i_, j_) dA,   dA_offset  + (i_)*nb       + (j_)*nb*ldda
-#define dAT(i_, j_) dAT,  dAT_offset + (i_)*nb*lddat + (j_)*nb
-#define dAP(i_, j_) dAP,               (i_)          + (j_)*maxm
-#define work(i_)   (work + (i_))
+magma_int_t magma_getrf_gpu(magma_int_t m, magma_int_t n, cl_mem dA,
+                            size_t dA_offset, magma_int_t ldda,
+                            magma_int_t *ipiv, magma_queue_t queue,
+                            magma_int_t *info) {
+    /*  -- clMAGMA (version 0.1) --
+        Univ. of Tennessee, Knoxville
+        Univ. of California, Berkeley
+        Univ. of Colorado, Denver
+        @date
+
+        Purpose
+        =======
+        GETRF computes an LU factorization of a general M-by-N matrix A
+        using partial pivoting with row interchanges.
+
+        The factorization has the form
+        A = P * L * U
+        where P is a permutation matrix, L is lower triangular with unit
+        diagonal elements (lower trapezoidal if m > n), and U is upper
+        triangular (upper trapezoidal if m < n).
+
+        This is the right-looking Level 3 BLAS version of the algorithm.
+
+        Arguments
+        =========
+        M       (input) INTEGER
+        The number of rows of the matrix A.  M >= 0.
+
+        N       (input) INTEGER
+        The number of columns of the matrix A.  N >= 0.
+
+        A       (input/output) an array on the GPU, dimension (LDDA,N).
+        On entry, the M-by-N matrix to be factored.
+        On exit, the factors L and U from the factorization
+        A = P*L*U; the unit diagonal elements of L are not stored.
+
+        LDDA     (input) INTEGER
+        The leading dimension of the array A.  LDDA >= max(1,M).
+
+        IPIV    (output) INTEGER array, dimension (min(M,N))
+        The pivot indices; for 1 <= i <= min(M,N), row i of the
+        matrix was interchanged with row IPIV(i).
+
+        INFO    (output) INTEGER
+        = 0:  successful exit
+        < 0:  if INFO = -i, the i-th argument had an illegal value
+        or another error occured, such as memory allocation failed.
+        > 0:  if INFO = i, U(i,i) is exactly zero. The factorization
+        has been completed, but the factor U is exactly
+        singular, and division by zero will occur if it is used
+        to solve a system of equations.
+        ===================================================================== */
+
+#define dA(i_, j_) dA, dA_offset + (i_)*nb + (j_)*nb *ldda
+#define dAT(i_, j_) dAT, dAT_offset + (i_)*nb *lddat + (j_)*nb
+#define dAP(i_, j_) dAP, (i_) + (j_)*maxm
+#define work(i_) (work + (i_))
 
     static const Ty c_one     = magma_one<Ty>();
     static const Ty c_neg_one = magma_neg_one<Ty>();
@@ -133,48 +130,47 @@ magma_int_t magma_getrf_gpu(
 
     /* Check arguments */
     *info = 0;
-    if (m < 0)
+    if (m < 0) {
         *info = -1;
-    else if (n < 0)
+    } else if (n < 0) {
         *info = -2;
-    else if (ldda < std::max(1,m))
+    } else if (ldda < std::max(1, m)) {
         *info = -4;
+    }
 
     if (*info != 0) {
-        //magma_xerbla(__func__, -(*info));
+        // magma_xerbla(__func__, -(*info));
         return *info;
     }
 
     /* Quick return if possible */
-    if (m == 0 || n == 0)
-        return *info;
+    if (m == 0 || n == 0) { return *info; }
 
-    gemm_func<Ty> gpu_gemm;
-    trsm_func<Ty> gpu_trsm;
-    getrf_func<Ty> cpu_getrf;
+    gpu_blas_gemm_func<Ty> gpu_blas_gemm;
+    gpu_blas_trsm_func<Ty> gpu_blas_trsm;
+    cpu_lapack_getrf_func<Ty> cpu_lapack_getrf;
 
     /* Function Body */
     mindim = std::min(m, n);
     nb     = magma_get_getrf_nb<Ty>(m);
     s      = mindim / nb;
 
-    if (nb <= 1 || nb >= std::min(m,n)) {
+    if (nb <= 1 || nb >= std::min(m, n)) {
         /* Use CPU code. */
-        if (MAGMA_SUCCESS != magma_malloc_cpu<Ty>(&work, m*n)) {
+        if (MAGMA_SUCCESS != magma_malloc_cpu<Ty>(&work, m * n)) {
             *info = MAGMA_ERR_HOST_ALLOC;
             return *info;
         }
-        magma_getmatrix<Ty>(m, n, dA(0,0), ldda, work(0), m, queue);
-        cpu_getrf(LAPACK_COL_MAJOR, m, n, work, m, ipiv);
-        magma_setmatrix<Ty>(m, n, work(0), m, dA(0,0), ldda, queue);
+        magma_getmatrix<Ty>(m, n, dA(0, 0), ldda, work(0), m, queue);
+        LAPACKE_CHECK(cpu_lapack_getrf(m, n, work, m, ipiv));
+        magma_setmatrix<Ty>(m, n, work(0), m, dA(0, 0), ldda, queue);
         magma_free_cpu(work);
-    }
-    else {
+    } else {
         /* Use hybrid blocked code. */
-        maxm = ((m + 31)/32)*32;
-        maxn = ((n + 31)/32)*32;
+        maxm = ((m + 31) / 32) * 32;
+        maxn = ((n + 31) / 32) * 32;
 
-        if (MAGMA_SUCCESS != magma_malloc<Ty>(&dAP, nb*maxm)) {
+        if (MAGMA_SUCCESS != magma_malloc<Ty>(&dAP, nb * maxm)) {
             *info = MAGMA_ERR_DEVICE_ALLOC;
             return *info;
         }
@@ -182,27 +178,26 @@ magma_int_t magma_getrf_gpu(
         // square matrices can be done in place;
         // rectangular requires copy to transpose
         if (m == n) {
-            dAT = dA;
+            dAT        = dA;
             dAT_offset = dA_offset;
-            lddat = ldda;
-            magmablas_transpose_inplace<Ty>(m, dAT(0,0), lddat, queue);
-        }
-        else {
-            lddat = maxn;  // N-by-M
+            lddat      = ldda;
+            magmablas_transpose_inplace<Ty>(m, dAT(0, 0), lddat, queue);
+        } else {
+            lddat      = maxn;  // N-by-M
             dAT_offset = 0;
-            if (MAGMA_SUCCESS != magma_malloc<Ty>(&dAT, lddat*maxm)) {
+            if (MAGMA_SUCCESS != magma_malloc<Ty>(&dAT, lddat * maxm)) {
                 magma_free(dAP);
                 *info = MAGMA_ERR_DEVICE_ALLOC;
                 return *info;
             }
-            magmablas_transpose<Ty>(m, n, dA(0,0), ldda, dAT(0,0), lddat, queue);
+            magmablas_transpose<Ty>(m, n, dA(0, 0), ldda, dAT(0, 0), lddat,
+                                    queue);
         }
 
         ldwork = maxm;
-        if (MAGMA_SUCCESS != magma_malloc_cpu<Ty>(&work, ldwork*nb)) {
+        if (MAGMA_SUCCESS != magma_malloc_cpu<Ty>(&work, ldwork * nb)) {
             magma_free(dAP);
-            if (dA != dAT)
-                magma_free(dAT);
+            if (dA != dAT) { magma_free(dAT); }
 
             *info = MAGMA_ERR_HOST_ALLOC;
             return *info;
@@ -210,132 +205,120 @@ magma_int_t magma_getrf_gpu(
 
         cl_event event = 0;
 
-
-        for(j=0; j < s; j++) {
-
+        for (j = 0; j < s; j++) {
             // download j-th panel
-            magmablas_transpose<Ty>(nb, m-j*nb, dAT(j,j), lddat, dAP(0,0), maxm, queue);
+            magmablas_transpose<Ty>(nb, m - j * nb, dAT(j, j), lddat, dAP(0, 0),
+                                    maxm, queue);
 
-            magma_getmatrix<Ty>(m-j*nb, nb, dAP(0,0), maxm, work(0), ldwork, queue);
+            magma_getmatrix<Ty>(m - j * nb, nb, dAP(0, 0), maxm, work(0),
+                                ldwork, queue);
 
             if (j > 0 && n > (j + 1) * nb) {
-                gpu_trsm(clblasColumnMajor,
-                         clblasRight, clblasUpper, clblasNoTrans, clblasUnit,
-                         n - (j+1)*nb, nb,
-                         c_one,
-                         dAT(j-1,j-1), lddat,
-                         dAT(j-1,j+1), lddat,
-                         1, &queue, 0, nullptr, &event);
-
-                if (m > j * nb)  {
-                    gpu_gemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans,
-                         n-(j+1)*nb, m-j*nb, nb,
-                         c_neg_one,
-                         dAT(j-1,j+1), lddat,
-                         dAT(j,  j-1), lddat,
-                         c_one,
-                         dAT(j,  j+1), lddat,
-                         1, &queue, 0, nullptr, &event);
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_RIGHT, OPENCL_BLAS_TRIANGLE_UPPER,
+                    OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_UNIT_DIAGONAL,
+                    n - (j + 1) * nb, nb, c_one, dAT(j - 1, j - 1), lddat,
+                    dAT(j - 1, j + 1), lddat, 1, &queue, 0, nullptr, &event));
+
+                if (m > j * nb) {
+                    OPENCL_BLAS_CHECK(gpu_blas_gemm(
+                        OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NO_TRANS,
+                        n - (j + 1) * nb, m - j * nb, nb, c_neg_one,
+                        dAT(j - 1, j + 1), lddat, dAT(j, j - 1), lddat, c_one,
+                        dAT(j, j + 1), lddat, 1, &queue, 0, nullptr, &event));
                 }
             }
 
             // do the cpu part
-            rows = m - j*nb;
-            cpu_getrf(LAPACK_COL_MAJOR, rows, nb, work, ldwork, ipiv+j*nb);
-            if (*info == 0 && iinfo > 0)
-                *info = iinfo + j*nb;
+            rows = m - j * nb;
+            LAPACKE_CHECK(
+                cpu_lapack_getrf(rows, nb, work, ldwork, ipiv + j * nb));
+            if (*info == 0 && iinfo > 0) { *info = iinfo + j * nb; }
 
-            for(i=j*nb; i < j*nb + nb; ++i) {
-                ipiv[i] += j*nb;
-            }
-            magmablas_laswp<Ty>(n, dAT(0,0), lddat, j*nb + 1, j*nb + nb, ipiv, 1, queue);
+            for (i = j * nb; i < j * nb + nb; ++i) { ipiv[i] += j * nb; }
+            magmablas_laswp<Ty>(n, dAT(0, 0), lddat, j * nb + 1, j * nb + nb,
+                                ipiv, 1, queue);
 
             // upload j-th panel
-            magma_setmatrix<Ty>(m-j*nb, nb, work(0), ldwork, dAP(0,0), maxm, queue);
+            magma_setmatrix<Ty>(m - j * nb, nb, work(0), ldwork, dAP(0, 0),
+                                maxm, queue);
 
-            magmablas_transpose<Ty>(m-j*nb, nb, dAP(0,0), maxm, dAT(j,j), lddat, queue);
+            magmablas_transpose<Ty>(m - j * nb, nb, dAP(0, 0), maxm, dAT(j, j),
+                                    lddat, queue);
 
             // do the small non-parallel computations (next panel update)
-            if (s > (j+1)) {
-                gpu_trsm(clblasColumnMajor,
-                         clblasRight, clblasUpper, clblasNoTrans, clblasUnit,
-                         nb, nb,
-                         c_one,
-                         dAT(j, j  ), lddat,
-                         dAT(j, j+1), lddat,
-                         1, &queue, 0, nullptr, &event);
-
-
-                gpu_gemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans,
-                         nb, m-(j+1)*nb, nb,
-                         c_neg_one,
-                         dAT(j,   j+1), lddat,
-                         dAT(j+1, j  ), lddat,
-                         c_one,
-                         dAT(j+1, j+1), lddat,
-                         1, &queue, 0, nullptr, &event);
-            }
-            else {
+            if (s > (j + 1)) {
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_RIGHT, OPENCL_BLAS_TRIANGLE_UPPER,
+                    OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_UNIT_DIAGONAL, nb, nb,
+                    c_one, dAT(j, j), lddat, dAT(j, j + 1), lddat, 1, &queue, 0,
+                    nullptr, &event));
+
+                OPENCL_BLAS_CHECK(gpu_blas_gemm(
+                    OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NO_TRANS, nb,
+                    m - (j + 1) * nb, nb, c_neg_one, dAT(j, j + 1), lddat,
+                    dAT(j + 1, j), lddat, c_one, dAT(j + 1, j + 1), lddat, 1,
+                    &queue, 0, nullptr, &event));
+            } else {
                 if (n > s * nb) {
-                    gpu_trsm(clblasColumnMajor,
-                             clblasRight, clblasUpper, clblasNoTrans, clblasUnit,
-                             n-s*nb, nb,
-                             c_one,
-                             dAT(j, j  ), lddat,
-                             dAT(j, j+1), lddat,
-                             1, &queue, 0, nullptr, &event);
+                    OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                        OPENCL_BLAS_SIDE_RIGHT, OPENCL_BLAS_TRIANGLE_UPPER,
+                        OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_UNIT_DIAGONAL,
+                        n - s * nb, nb, c_one, dAT(j, j), lddat, dAT(j, j + 1),
+                        lddat, 1, &queue, 0, nullptr, &event));
                 }
 
-                if ((n > (j+1) * nb) && (m > (j+1) * nb)) {
-                    gpu_gemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans,
-                             n-(j+1)*nb, m-(j+1)*nb, nb,
-                             c_neg_one,
-                             dAT(j,   j+1), lddat,
-                             dAT(j+1, j  ), lddat,
-                             c_one,
-                             dAT(j+1, j+1), lddat,
-                             1, &queue, 0, nullptr, &event);
+                if ((n > (j + 1) * nb) && (m > (j + 1) * nb)) {
+                    OPENCL_BLAS_CHECK(gpu_blas_gemm(
+                        OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NO_TRANS,
+                        n - (j + 1) * nb, m - (j + 1) * nb, nb, c_neg_one,
+                        dAT(j, j + 1), lddat, dAT(j + 1, j), lddat, c_one,
+                        dAT(j + 1, j + 1), lddat, 1, &queue, 0, nullptr,
+                        &event));
                 }
             }
         }
 
-        magma_int_t nb0 = std::min(m - s*nb, n - s*nb);
+        magma_int_t nb0 = std::min(m - s * nb, n - s * nb);
 
         if (nb0 > 0 && m > s * nb) {
-            rows = m - s*nb;
+            rows = m - s * nb;
 
-            magmablas_transpose<Ty>(nb0, rows, dAT(s,s), lddat, dAP(0,0), maxm, queue);
-            magma_getmatrix<Ty>(rows, nb0, dAP(0,0), maxm, work(0), ldwork, queue);
+            magmablas_transpose<Ty>(nb0, rows, dAT(s, s), lddat, dAP(0, 0),
+                                    maxm, queue);
+            magma_getmatrix<Ty>(rows, nb0, dAP(0, 0), maxm, work(0), ldwork,
+                                queue);
 
             // do the cpu part
-            cpu_getrf(LAPACK_COL_MAJOR, rows, nb0, work, ldwork, ipiv+s*nb);
-            if (*info == 0 && iinfo > 0)
-                *info = iinfo + s*nb;
+            LAPACKE_CHECK(
+                cpu_lapack_getrf(rows, nb0, work, ldwork, ipiv + s * nb));
+            if (*info == 0 && iinfo > 0) { *info = iinfo + s * nb; }
 
-            for(i=s*nb; i < s*nb + nb0; ++i) {
-                ipiv[i] += s*nb;
-            }
-            magmablas_laswp<Ty>(n, dAT(0,0), lddat, s*nb + 1, s*nb + nb0, ipiv, 1, queue);
+            for (i = s * nb; i < s * nb + nb0; ++i) { ipiv[i] += s * nb; }
+            magmablas_laswp<Ty>(n, dAT(0, 0), lddat, s * nb + 1, s * nb + nb0,
+                                ipiv, 1, queue);
 
             // upload j-th panel
-            magma_setmatrix<Ty>(rows, nb0, work(0), ldwork, dAP(0,0), maxm, queue);
-            magmablas_transpose<Ty>(rows, nb0, dAP(0,0), maxm, dAT(s,s), lddat, queue);
+            magma_setmatrix<Ty>(rows, nb0, work(0), ldwork, dAP(0, 0), maxm,
+                                queue);
+            magmablas_transpose<Ty>(rows, nb0, dAP(0, 0), maxm, dAT(s, s),
+                                    lddat, queue);
 
             if (n > s * nb + nb0) {
-                gpu_trsm(clblasColumnMajor,
-                         clblasRight, clblasUpper, clblasNoTrans, clblasUnit,
-                         n-s*nb-nb0, nb0,
-                         c_one, dAT(s,s),     lddat,
-                         dAT(s,s)+nb0, lddat, 1, &queue, 0, nullptr, &event);
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_RIGHT, OPENCL_BLAS_TRIANGLE_UPPER,
+                    OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_UNIT_DIAGONAL,
+                    n - s * nb - nb0, nb0, c_one, dAT(s, s), lddat,
+                    dAT(s, s) + nb0, lddat, 1, &queue, 0, nullptr, &event));
             }
         }
 
         // undo transpose
         if (dA == dAT) {
-            magmablas_transpose_inplace<Ty>(m, dAT(0,0), lddat, queue);
-        }
-        else {
-            magmablas_transpose<Ty>(n, m, dAT(0,0), lddat, dA(0,0), ldda, queue);
+            magmablas_transpose_inplace<Ty>(m, dAT(0, 0), lddat, queue);
+        } else {
+            magmablas_transpose<Ty>(n, m, dAT(0, 0), lddat, dA(0, 0), ldda,
+                                    queue);
             magma_free(dAT);
         }
 
@@ -348,13 +331,11 @@ magma_int_t magma_getrf_gpu(
 
 #undef dAT
 
-#define INSTANTIATE(T)                                  \
-    template magma_int_t magma_getrf_gpu<T>(            \
-        magma_int_t m, magma_int_t n,                   \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
-        magma_int_t *ipiv,                              \
-        magma_queue_t queue,                            \
-        magma_int_t *info);                             \
+#define INSTANTIATE(T)                                             \
+    template magma_int_t magma_getrf_gpu<T>(                       \
+        magma_int_t m, magma_int_t n, cl_mem dA, size_t dA_offset, \
+        magma_int_t ldda, magma_int_t * ipiv, magma_queue_t queue, \
+        magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/getrs.cpp b/src/backend/opencl/magma/getrs.cpp
index 3b83179dcc..d945fa9def 100644
--- a/src/backend/opencl/magma/getrs.cpp
+++ b/src/backend/opencl/magma/getrs.cpp
@@ -53,105 +53,99 @@
 
 #include "magma.h"
 #include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
 #include <platform.hpp>
+#include <af/opencl.h>
 #include <algorithm>
 #include <string>
 
-template<typename Ty>  magma_int_t
-magma_getrs_gpu(magma_trans_t trans, magma_int_t n, magma_int_t nrhs,
-                cl_mem dA, size_t dA_offset, magma_int_t ldda,
-                magma_int_t *ipiv,
-                cl_mem dB, size_t dB_offset, magma_int_t lddb,
-                magma_queue_t queue,
-                magma_int_t *info)
-{
-/*  -- clMagma (version 0.1) --
-       Univ. of Tennessee, Knoxville
-       Univ. of California, Berkeley
-       Univ. of Colorado, Denver
-       @date
-
-    Purpose
-    =======
-    Solves a system of linear equations
-      A * X = B  or  A' * X = B
-    with a general N-by-N matrix A using the LU factorization computed by ZGETRF_GPU.
-
-    Arguments
-    =========
-    TRANS   (input) CHARACTER*1
-            Specifies the form of the system of equations:
-            = 'N':  A * X = B  (No transpose)
-            = 'T':  A'* X = B  (Transpose)
-            = 'C':  A'* X = B  (Conjugate transpose = Transpose)
-
-    N       (input) INTEGER
-            The order of the matrix A.  N >= 0.
-
-    NRHS    (input) INTEGER
-            The number of right hand sides, i.e., the number of columns
-            of the matrix B.  NRHS >= 0.
-
-    A       (input) COMPLEX_16 array on the GPU, dimension (LDA,N)
-            The factors L and U from the factorization A = P*L*U as computed
-            by ZGETRF_GPU.
-
-    LDA     (input) INTEGER
-            The leading dimension of the array A.  LDA >= max(1,N).
-
-    IPIV    (input) INTEGER array, dimension (N)
-            The pivot indices from ZGETRF; for 1<=i<=N, row i of the
-            matrix was interchanged with row IPIV(i).
-
-    B       (input/output) COMPLEX_16 array on the GPU, dimension (LDB,NRHS)
-            On entry, the right hand side matrix B.
-            On exit, the solution matrix X.
-
-    LDB     (input) INTEGER
-            The leading dimension of the array B.  LDB >= max(1,N).
-
-    INFO    (output) INTEGER
-            = 0:  successful exit
-            < 0:  if INFO = -i, the i-th argument had an illegal value
-
-    HWORK   (workspace) COMPLEX_16 array, dimension N*NRHS
-    =====================================================================    */
+template<typename Ty>
+magma_int_t magma_getrs_gpu(magma_trans_t trans, magma_int_t n,
+                            magma_int_t nrhs, cl_mem dA, size_t dA_offset,
+                            magma_int_t ldda, magma_int_t *ipiv, cl_mem dB,
+                            size_t dB_offset, magma_int_t lddb,
+                            magma_queue_t queue, magma_int_t *info) {
+    /*  -- clMagma (version 0.1) --
+           Univ. of Tennessee, Knoxville
+           Univ. of California, Berkeley
+           Univ. of Colorado, Denver
+           @date
+
+        Purpose
+        =======
+        Solves a system of linear equations
+          A * X = B  or  A' * X = B
+        with a general N-by-N matrix A using the LU factorization computed by
+       ZGETRF_GPU.
+
+        Arguments
+        =========
+        TRANS   (input) CHARACTER*1
+                Specifies the form of the system of equations:
+                = 'N':  A * X = B  (No transpose)
+                = 'T':  A'* X = B  (Transpose)
+                = 'C':  A'* X = B  (Conjugate transpose = Transpose)
+
+        N       (input) INTEGER
+                The order of the matrix A.  N >= 0.
+
+        NRHS    (input) INTEGER
+                The number of right hand sides, i.e., the number of columns
+                of the matrix B.  NRHS >= 0.
+
+        A       (input) COMPLEX_16 array on the GPU, dimension (LDA,N)
+                The factors L and U from the factorization A = P*L*U as computed
+                by ZGETRF_GPU.
+
+        LDA     (input) INTEGER
+                The leading dimension of the array A.  LDA >= max(1,N).
+
+        IPIV    (input) INTEGER array, dimension (N)
+                The pivot indices from ZGETRF; for 1<=i<=N, row i of the
+                matrix was interchanged with row IPIV(i).
+
+        B       (input/output) COMPLEX_16 array on the GPU, dimension (LDB,NRHS)
+                On entry, the right hand side matrix B.
+                On exit, the solution matrix X.
+
+        LDB     (input) INTEGER
+                The leading dimension of the array B.  LDB >= max(1,N).
+
+        INFO    (output) INTEGER
+                = 0:  successful exit
+                < 0:  if INFO = -i, the i-th argument had an illegal value
+
+        HWORK   (workspace) COMPLEX_16 array, dimension N*NRHS
+        ===================================================================== */
 
     static const Ty c_one = magma_one<Ty>();
-    Ty *work = NULL;
-    int notran = (trans == MagmaNoTrans);
+    Ty *work              = NULL;
+    int notran            = (trans == MagmaNoTrans);
     magma_int_t i1, i2, inc;
 
     *info = 0;
-    if ( (! notran) &&
-         (trans != MagmaTrans) &&
-         (trans != MagmaConjTrans) ) {
+    if ((!notran) && (trans != MagmaTrans) && (trans != MagmaConjTrans)) {
         *info = -1;
     } else if (n < 0) {
         *info = -2;
     } else if (nrhs < 0) {
         *info = -3;
-    } else if (ldda < std::max(1,n)) {
+    } else if (ldda < std::max(1, n)) {
         *info = -5;
-    } else if (lddb < std::max(1,n)) {
+    } else if (lddb < std::max(1, n)) {
         *info = -8;
     }
-    if (*info != 0) {
-        return *info;
-    }
+    if (*info != 0) { return *info; }
 
     /* Quick return if possible */
-    if (n == 0 || nrhs == 0) {
-        return *info;
-    }
+    if (n == 0 || nrhs == 0) { return *info; }
 
-    magma_malloc_cpu<Ty>( &work, n*nrhs );
-    if ( work == NULL ) {
+    magma_malloc_cpu<Ty>(&work, n * nrhs);
+    if (work == NULL) {
         *info = MAGMA_ERR_HOST_ALLOC;
         return *info;
     }
@@ -159,17 +153,20 @@ magma_getrs_gpu(magma_trans_t trans, magma_int_t n, magma_int_t nrhs,
     i1 = 1;
     i2 = n;
 
-    laswp_func<Ty> cpu_laswp;
-    trsm_func<Ty> gpu_trsm;
-    trsv_func<Ty> gpu_trsv;
+    cpu_lapack_laswp_func<Ty> cpu_lapack_laswp;
+    gpu_blas_trsm_func<Ty> gpu_blas_trsm;
+    gpu_blas_trsv_func<Ty> gpu_blas_trsv;
 
     cl_event event = NULL;
 
-    clblasTranspose cltrans =(trans == MagmaNoTrans) ? clblasNoTrans :
-        (trans == MagmaTrans ? clblasTrans : clblasConjTrans);
+    OPENCL_BLAS_TRANS_T cltrans =
+        (trans == MagmaNoTrans)
+            ? OPENCL_BLAS_NO_TRANS
+            : (trans == MagmaTrans ? OPENCL_BLAS_TRANS
+                                   : OPENCL_BLAS_CONJ_TRANS);
 
-    std::string pName = opencl::getPlatformName(opencl::getDevice());
-    bool cond = pName.find("NVIDIA") != std::string::npos;
+    bool cond =
+        arrayfire::opencl::getActivePlatformVendor() == AFCL_PLATFORM_NVIDIA;
     cl_mem dAT = 0;
     if (nrhs > 1 && cond) {
         magma_malloc<Ty>(&dAT, n * n);
@@ -179,54 +176,87 @@ magma_getrs_gpu(magma_trans_t trans, magma_int_t n, magma_int_t nrhs,
         inc = 1;
 
         /* Solve A * X = B. */
-        magma_getmatrix<Ty>( n, nrhs, dB, dB_offset, lddb, work, n, queue );
-        cpu_laswp(LAPACK_COL_MAJOR, nrhs, work, n, i1, i2, ipiv, inc);
-        magma_setmatrix<Ty>( n, nrhs, work, n, dB, dB_offset, lddb, queue );
-        if ( nrhs == 1) {
-            gpu_trsv(clblasColumnMajor, clblasLower, clblasNoTrans, clblasUnit, n, dA, dA_offset, ldda, dB, dB_offset, 1, 1, &queue, 0, nullptr, &event);
-            gpu_trsv(clblasColumnMajor, clblasUpper, clblasNoTrans, clblasNonUnit, n, dA, dA_offset, ldda, dB, dB_offset, 1, 1, &queue, 0, nullptr, &event);
+        magma_getmatrix<Ty>(n, nrhs, dB, dB_offset, lddb, work, n, queue);
+        LAPACKE_CHECK(cpu_lapack_laswp(nrhs, work, n, i1, i2, ipiv, inc));
+        magma_setmatrix<Ty>(n, nrhs, work, n, dB, dB_offset, lddb, queue);
+        if (nrhs == 1) {
+            OPENCL_BLAS_CHECK(
+                gpu_blas_trsv(OPENCL_BLAS_TRIANGLE_LOWER, OPENCL_BLAS_NO_TRANS,
+                              OPENCL_BLAS_UNIT_DIAGONAL, n, dA, dA_offset, ldda,
+                              dB, dB_offset, 1, 1, &queue, 0, nullptr, &event));
+            OPENCL_BLAS_CHECK(gpu_blas_trsv(
+                OPENCL_BLAS_TRIANGLE_UPPER, OPENCL_BLAS_NO_TRANS,
+                OPENCL_BLAS_NON_UNIT_DIAGONAL, n, dA, dA_offset, ldda, dB,
+                dB_offset, 1, 1, &queue, 0, nullptr, &event));
         } else {
-            gpu_trsm(clblasColumnMajor, clblasLeft, clblasLower, clblasNoTrans, clblasUnit, n, nrhs, c_one, dA, dA_offset, ldda, dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event);
+            OPENCL_BLAS_CHECK(
+                gpu_blas_trsm(OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_LOWER,
+                              OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_UNIT_DIAGONAL,
+                              n, nrhs, c_one, dA, dA_offset, ldda, dB,
+                              dB_offset, lddb, 1, &queue, 0, nullptr, &event));
 
-            if(cond) {
-                gpu_trsm(clblasColumnMajor, clblasLeft, clblasLower, clblasTrans, clblasNonUnit, n, nrhs, c_one, dAT, 0, n, dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event);
+            if (cond) {
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_LOWER,
+                    OPENCL_BLAS_TRANS, OPENCL_BLAS_NON_UNIT_DIAGONAL, n, nrhs,
+                    c_one, dAT, 0, n, dB, dB_offset, lddb, 1, &queue, 0,
+                    nullptr, &event));
             } else {
-                gpu_trsm(clblasColumnMajor, clblasLeft, clblasUpper, clblasNoTrans, clblasNonUnit, n, nrhs, c_one, dA, dA_offset, ldda, dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event);
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_UPPER,
+                    OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NON_UNIT_DIAGONAL, n,
+                    nrhs, c_one, dA, dA_offset, ldda, dB, dB_offset, lddb, 1,
+                    &queue, 0, nullptr, &event));
             }
         }
     } else {
         inc = -1;
 
         /* Solve A' * X = B. */
-        if ( nrhs == 1) {
-            gpu_trsv(clblasColumnMajor, clblasUpper, cltrans, clblasNonUnit, n, dA, dA_offset, ldda, dB, dB_offset, 1, 1, &queue, 0, nullptr, &event);
-            gpu_trsv(clblasColumnMajor, clblasLower, cltrans, clblasUnit, n, dA, dA_offset, ldda, dB, dB_offset, 1, 1, &queue, 0, nullptr, &event);
+        if (nrhs == 1) {
+            OPENCL_BLAS_CHECK(gpu_blas_trsv(OPENCL_BLAS_TRIANGLE_UPPER, cltrans,
+                                            OPENCL_BLAS_NON_UNIT_DIAGONAL, n,
+                                            dA, dA_offset, ldda, dB, dB_offset,
+                                            1, 1, &queue, 0, nullptr, &event));
+            OPENCL_BLAS_CHECK(gpu_blas_trsv(OPENCL_BLAS_TRIANGLE_LOWER, cltrans,
+                                            OPENCL_BLAS_UNIT_DIAGONAL, n, dA,
+                                            dA_offset, ldda, dB, dB_offset, 1,
+                                            1, &queue, 0, nullptr, &event));
         } else {
-            if(cond) {
-                gpu_trsm(clblasColumnMajor, clblasLeft, clblasLower, clblasNoTrans, clblasNonUnit, n, nrhs, c_one, dAT, 0, n, dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event);
+            if (cond) {
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_LOWER,
+                    OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NON_UNIT_DIAGONAL, n,
+                    nrhs, c_one, dAT, 0, n, dB, dB_offset, lddb, 1, &queue, 0,
+                    nullptr, &event));
             } else {
-                gpu_trsm(clblasColumnMajor, clblasLeft, clblasUpper, cltrans, clblasNonUnit, n, nrhs, c_one, dA, dA_offset, ldda, dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event);
+                OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                    OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_UPPER, cltrans,
+                    OPENCL_BLAS_NON_UNIT_DIAGONAL, n, nrhs, c_one, dA,
+                    dA_offset, ldda, dB, dB_offset, lddb, 1, &queue, 0, nullptr,
+                    &event));
             }
-            gpu_trsm(clblasColumnMajor, clblasLeft, clblasLower, cltrans, clblasUnit, n, nrhs, c_one, dA, dA_offset, ldda, dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event);
+            OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_LOWER, cltrans,
+                OPENCL_BLAS_UNIT_DIAGONAL, n, nrhs, c_one, dA, dA_offset, ldda,
+                dB, dB_offset, lddb, 1, &queue, 0, nullptr, &event));
         }
-        magma_getmatrix<Ty>( n, nrhs, dB, dB_offset, lddb, work, n, queue );
-        cpu_laswp(LAPACK_COL_MAJOR, nrhs, work, n, i1, i2, ipiv, inc);
-        magma_setmatrix<Ty>( n, nrhs, work, n, dB, dB_offset, lddb, queue );
+        magma_getmatrix<Ty>(n, nrhs, dB, dB_offset, lddb, work, n, queue);
+        LAPACKE_CHECK(cpu_lapack_laswp(nrhs, work, n, i1, i2, ipiv, inc));
+        magma_setmatrix<Ty>(n, nrhs, work, n, dB, dB_offset, lddb, queue);
     }
 
-    if (nrhs > 1 && dAT != 0) magma_free(dAT);
+    if (nrhs > 1 && dAT != 0) { magma_free(dAT); }
     magma_free_cpu(work);
     return *info;
 }
 
-#define INSTANTIATE(T)                                                  \
-    template  magma_int_t                                               \
-    magma_getrs_gpu<T>(magma_trans_t trans, magma_int_t n, magma_int_t nrhs, \
-                       cl_mem dA, size_t dA_offset, magma_int_t ldda,   \
-                       magma_int_t *ipiv,                               \
-                       cl_mem dB, size_t dB_offset, magma_int_t lddb,   \
-                       magma_queue_t queue,                             \
-                       magma_int_t *info);                              \
+#define INSTANTIATE(T)                                                     \
+    template magma_int_t magma_getrs_gpu<T>(                               \
+        magma_trans_t trans, magma_int_t n, magma_int_t nrhs, cl_mem dA,   \
+        size_t dA_offset, magma_int_t ldda, magma_int_t * ipiv, cl_mem dB, \
+        size_t dB_offset, magma_int_t lddb, magma_queue_t queue,           \
+        magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/labrd.cpp b/src/backend/opencl/magma/labrd.cpp
new file mode 100644
index 0000000000..c2f5fd0698
--- /dev/null
+++ b/src/backend/opencl/magma/labrd.cpp
@@ -0,0 +1,702 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
+#include <common/complex.hpp>
+#include <traits.hpp>
+#include "magma.h"
+#include "magma_blas.h"
+#include "magma_cpu_blas.h"
+#include "magma_cpu_lapack.h"
+#include "magma_data.h"
+#include "magma_helper.h"
+#include "magma_sync.h"
+
+#include <algorithm>
+
+#define cpu_blas_gemv_macro(_trans, _m, _n, _alpha, _aptr, _lda, _xptr, _incx, \
+                            _beta, _yptr, _incy)                               \
+    cpu_blas_gemv(_trans, _m, _n, cblas_scalar(_alpha), cblas_ptr(_aptr),      \
+                  _lda, cblas_ptr(_xptr), _incx, cblas_scalar(_beta),          \
+                  cblas_ptr(_yptr), _incy)
+
+template<typename Ty>
+magma_int_t magma_labrd_gpu(magma_int_t m, magma_int_t n, magma_int_t nb, Ty *a,
+                            magma_int_t lda, cl_mem da, size_t da_offset,
+                            magma_int_t ldda, void *_d, void *_e, Ty *tauq,
+                            Ty *taup, Ty *x, magma_int_t ldx, cl_mem dx,
+                            size_t dx_offset, magma_int_t lddx, Ty *y,
+                            magma_int_t ldy, cl_mem dy, size_t dy_offset,
+                            magma_int_t lddy, magma_queue_t queue) {
+    /*  -- MAGMA (version 1.1) --
+        Univ. of Tennessee, Knoxville
+        Univ. of California, Berkeley
+        Univ. of Colorado, Denver
+        @date
+
+        Purpose
+        =======
+        ZLABRD reduces the first NB rows and columns of a complex general
+        m by n matrix A to upper or lower bidiagonal form by an orthogonal
+        transformation Q' * A * P, and returns the matrices X and Y which
+        are needed to apply the transformation to the unreduced part of A.
+
+        If m >= n, A is reduced to upper bidiagonal form; if m < n, to lower
+        bidiagonal form.
+
+        This is an auxiliary routine called by SGEBRD
+
+        Arguments
+        =========
+        M       (input) INTEGER
+        The number of rows in the matrix A.
+
+        N       (input) INTEGER
+        The number of columns in the matrix A.
+
+        NB      (input) INTEGER
+        The number of leading rows and columns of A to be reduced.
+
+        A       (input/output) COMPLEX_16 array, dimension (LDA,N)
+        On entry, the m by n general matrix to be reduced.
+        On exit, the first NB rows and columns of the matrix are
+        overwritten; the rest of the array is unchanged.
+        If m >= n, elements on and below the diagonal in the first NB
+        columns, with the array TAUQ, represent the orthogonal
+        matrix Q as a product of elementary reflectors; and
+        elements above the diagonal in the first NB rows, with the
+        array TAUP, represent the orthogonal matrix P as a product
+        of elementary reflectors.
+        If m < n, elements below the diagonal in the first NB
+        columns, with the array TAUQ, represent the orthogonal
+        matrix Q as a product of elementary reflectors, and
+        elements on and above the diagonal in the first NB rows,
+        with the array TAUP, represent the orthogonal matrix P as
+        a product of elementary reflectors.
+        See Further Details.
+
+        LDA     (input) INTEGER
+        The leading dimension of the array A.  LDA >= max(1,M).
+
+        D       (output) COMPLEX_16 array, dimension (NB)
+        The diagonal elements of the first NB rows and columns of
+        the reduced matrix.  D(i) = A(i,i).
+
+        E       (output) COMPLEX_16 array, dimension (NB)
+        The off-diagonal elements of the first NB rows and columns of
+        the reduced matrix.
+
+        TAUQ    (output) COMPLEX_16 array dimension (NB)
+        The scalar factors of the elementary reflectors which
+        represent the orthogonal matrix Q. See Further Details.
+
+        TAUP    (output) COMPLEX_16 array, dimension (NB)
+        The scalar factors of the elementary reflectors which
+        represent the orthogonal matrix P. See Further Details.
+
+        X       (output) COMPLEX_16 array, dimension (LDX,NB)
+        The m-by-nb matrix X required to update the unreduced part
+        of A.
+
+        LDX     (input) INTEGER
+        The leading dimension of the array X. LDX >= M.
+
+        Y       (output) COMPLEX_16 array, dimension (LDY,NB)
+        The n-by-nb matrix Y required to update the unreduced part
+        of A.
+
+        LDY     (input) INTEGER
+        The leading dimension of the array Y. LDY >= N.
+
+        Further Details
+        ===============
+        The matrices Q and P are represented as products of elementary
+        reflectors:
+
+        Q = H(1) H(2) . . . H(nb)  and  P = G(1) G(2) . . . G(nb)
+
+        Each H(i) and G(i) has the form:
+
+        H(i) = I - tauq * v * v'  and G(i) = I - taup * u * u'
+
+        where tauq and taup are complex scalars, and v and u are complex
+       vectors.
+
+        If m >= n, v(1:i-1) = 0, v(i) = 1, and v(i:m) is stored on exit in
+        A(i:m,i); u(1:i) = 0, u(i+1) = 1, and u(i+1:n) is stored on exit in
+        A(i,i+1:n); tauq is stored in TAUQ(i) and taup in TAUP(i).
+
+        If m < n, v(1:i) = 0, v(i+1) = 1, and v(i+1:m) is stored on exit in
+        A(i+2:m,i); u(1:i-1) = 0, u(i) = 1, and u(i:n) is stored on exit in
+        A(i,i+1:n); tauq is stored in TAUQ(i) and taup in TAUP(i).
+
+        The elements of the vectors v and u together form the m-by-nb matrix
+        V and the nb-by-n matrix U' which are needed, with X and Y, to apply
+        the transformation to the unreduced part of the matrix, using a block
+        update of the form:  A := A - V*Y' - X*U'.
+
+        The contents of A on exit are illustrated by the following examples
+        with nb = 2:
+
+        m = 6 and n = 5 (m > n):          m = 5 and n = 6 (m < n):
+
+        ( 1   1   u1  u1  u1)           ( 1   u1  u1  u1  u1  u1)
+        ( v1  1   1   u2  u2)           ( 1   1   u2  u2  u2  u2)
+        ( v1  v2  a   a   a )           ( v1  1   a   a   a   a )
+        ( v1  v2  a   a   a )           ( v1  v2  a   a   a   a )
+        ( v1  v2  a   a   a )           ( v1  v2  a   a   a   a )
+        ( v1  v2  a   a   a )
+
+        where a denotes an element of the original matrix which is unchanged,
+        vi denotes an element of the vector defining H(i), and ui an element
+        of the vector defining G(i).
+        ===================================================================== */
+
+    using Tr = typename af::dtype_traits<Ty>::base_type;
+
+    constexpr bool is_cplx = arrayfire::common::is_complex<Ty>::value;
+
+    Tr *d = (Tr *)_d;
+    Tr *e = (Tr *)_e;
+
+    Ty c_neg_one     = magma_neg_one<Ty>();
+    Ty c_one         = magma_one<Ty>();
+    Ty c_zero        = magma_zero<Ty>();
+    magma_int_t c__1 = 1;
+
+    magma_int_t a_dim1, a_offset, x_dim1, x_offset, y_dim1, y_offset, i__2,
+        i__3;
+    magma_int_t i__;
+    Ty alpha{};
+
+    a_dim1   = lda;
+    a_offset = 1 + a_dim1;
+    a -= a_offset;
+    --d;
+    --e;
+    --tauq;
+    --taup;
+
+    x_dim1   = ldx;
+    x_offset = 1 + x_dim1;
+    x -= x_offset;
+    dx_offset -= 1 + lddx;
+
+    y_dim1   = ldy;
+    y_offset = 1 + y_dim1;
+    y -= y_offset;
+    dy_offset -= 1 + lddy;
+
+    /* Quick return if possible */
+    if (m <= 0 || n <= 0) { return 0; }
+
+    Ty *f;
+    magma_malloc_cpu<Ty>(&f, std::max(n, m));
+    assert(f != NULL);  // TODO return error, or allocate outside zlatrd
+
+    magma_event_t event = NULL;
+
+    gpu_blas_gemv_func<Ty> gpu_blas_gemv;
+    cpu_blas_gemv_func<Ty> cpu_blas_gemv;
+    cpu_blas_scal_func<Ty> cpu_blas_scal;
+    cpu_blas_axpy_func<Ty> cpu_blas_axpy;
+    cpu_lapack_larfg_work_func<Ty> cpu_lapack_larfg;
+    cpu_lapack_lacgv_work_func<Ty> cpu_lapack_lacgv;
+
+    CBLAS_TRANSPOSE CblasTransParam = is_cplx ? CblasConjTrans : CblasTrans;
+
+    if (m >= n) {
+        /* Reduce to upper bidiagonal form */
+        for (i__ = 1; i__ <= nb; ++i__) {
+            /*  Update A(i:m,i) */
+            i__2 = m - i__ + 1;
+            i__3 = i__ - 1;
+
+            if (is_cplx) {
+                LAPACKE_CHECK(cpu_lapack_lacgv(i__3, &y[i__ + y_dim1], ldy));
+            }
+
+            cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                &a[i__ + a_dim1], lda, &y[i__ + y_dim1], ldy,
+                                (&c_one), &a[i__ + i__ * a_dim1], c__1);
+
+            if (is_cplx) {
+                LAPACKE_CHECK(cpu_lapack_lacgv(i__3, &y[i__ + y_dim1], ldy));
+            }
+
+            cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                &x[i__ + x_dim1], ldx, &a[i__ * a_dim1 + 1],
+                                c__1, (&c_one), &a[i__ + i__ * a_dim1], c__1);
+
+            /* Generate reflection Q(i) to annihilate A(i+1:m,i) */
+            alpha = a[i__ + i__ * a_dim1];
+            i__2  = m - i__ + 1;
+            i__3  = i__ + 1;
+
+            LAPACKE_CHECK(cpu_lapack_larfg(i__2, &alpha,
+                                           &a[std::min(i__3, m) + i__ * a_dim1],
+                                           c__1, &tauq[i__]));
+
+            d[i__] = magma_real<Ty>(alpha);
+            if (i__ < n) {
+                a[i__ + i__ * a_dim1] = c_one;
+
+                /* Compute Y(i+1:n,i) */
+                i__2 = m - i__ + 1;
+                i__3 = n - i__;
+
+                // 1. Send the block reflector  A(i+1:m,i) to the GPU ------
+                magma_setvector<Ty>(i__2, a + i__ + i__ * a_dim1, 1, da,
+                                    da_offset + (i__ - 1) + (i__ - 1) * (ldda),
+                                    1, queue);
+                // 2. Multiply ---------------------------------------------
+                OPENCL_BLAS_CHECK(gpu_blas_gemv(
+                    OPENCL_BLAS_CONJ_TRANS, i__2, i__3, c_one, da,
+                    da_offset + (i__ - 1) + ((i__ - 1) + 1) * (ldda), ldda, da,
+                    da_offset + (i__ - 1) + (i__ - 1) * (ldda), c__1, c_zero,
+                    dy, dy_offset + i__ + 1 + i__ * y_dim1, c__1, 1, &queue, 0,
+                    nullptr, &event));
+
+                // 3. Put the result back ----------------------------------
+                magma_getmatrix_async<Ty>(
+                    i__3, 1, dy, dy_offset + i__ + 1 + i__ * y_dim1, y_dim1,
+                    y + i__ + 1 + i__ * y_dim1, y_dim1, queue, &event);
+                i__2 = m - i__ + 1;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_one),
+                                    &a[i__ + a_dim1], lda,
+                                    &a[i__ + i__ * a_dim1], c__1, (&c_zero),
+                                    &y[i__ * y_dim1 + 1], c__1);
+
+                i__2 = n - i__;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                    &y[i__ + 1 + y_dim1], ldy,
+                                    &y[i__ * y_dim1 + 1], c__1, (&c_zero), f,
+                                    c__1);
+                i__2 = m - i__ + 1;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_one),
+                                    &x[i__ + x_dim1], ldx,
+                                    &a[i__ + i__ * a_dim1], c__1, (&c_zero),
+                                    &y[i__ * y_dim1 + 1], c__1);
+
+                // 4. Synch to make sure the result is back ----------------
+                magma_event_sync(event);
+
+                if (i__3 != 0) {
+                    i__2 = n - i__;
+                    cpu_blas_axpy(i__2, cblas_scalar(&c_one), cblas_ptr(f),
+                                  c__1, cblas_ptr(&y[i__ + 1 + i__ * y_dim1]),
+                                  c__1);
+                }
+
+                i__2 = i__ - 1;
+                i__3 = n - i__;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_neg_one),
+                                    &a[(i__ + 1) * a_dim1 + 1], lda,
+                                    &y[i__ * y_dim1 + 1], c__1, (&c_one),
+                                    &y[i__ + 1 + i__ * y_dim1], c__1);
+                i__2 = n - i__;
+                cpu_blas_scal(i__2, cblas_scalar(&tauq[i__]),
+                              cblas_ptr(&y[i__ + 1 + i__ * y_dim1]), c__1);
+
+                /* Update A(i,i+1:n) */
+                i__2 = n - i__;
+                if (is_cplx) {
+                    LAPACKE_CHECK(cpu_lapack_lacgv(
+                        i__2, &a[i__ + (i__ + 1) * a_dim1], lda));
+                    LAPACKE_CHECK(cpu_lapack_lacgv(i__, &a[i__ + a_dim1], lda));
+                }
+
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__, (&c_neg_one),
+                                    &y[i__ + 1 + y_dim1], ldy, &a[i__ + a_dim1],
+                                    lda, (&c_one), &a[i__ + (i__ + 1) * a_dim1],
+                                    lda);
+                i__2 = i__ - 1;
+                i__3 = n - i__;
+
+                if (is_cplx) {
+                    LAPACKE_CHECK(cpu_lapack_lacgv(i__, &a[i__ + a_dim1], lda));
+                    LAPACKE_CHECK(
+                        cpu_lapack_lacgv(i__2, &x[i__ + x_dim1], ldx));
+                }
+
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_neg_one),
+                                    &a[(i__ + 1) * a_dim1 + 1], lda,
+                                    &x[i__ + x_dim1], ldx, (&c_one),
+                                    &a[i__ + (i__ + 1) * a_dim1], lda);
+                if (is_cplx) {
+                    LAPACKE_CHECK(
+                        cpu_lapack_lacgv(i__2, &x[i__ + x_dim1], ldx));
+                }
+
+                /* Generate reflection P(i) to annihilate A(i,i+2:n) */
+                i__2 = n - i__;
+                /* Computing MIN */
+                i__3  = i__ + 2;
+                alpha = a[i__ + (i__ + 1) * a_dim1];
+                LAPACKE_CHECK(cpu_lapack_larfg(
+                    i__2, &alpha, &a[i__ + std::min(i__3, n) * a_dim1], lda,
+                    &taup[i__]));
+                e[i__]                      = magma_real<Ty>(alpha);
+                a[i__ + (i__ + 1) * a_dim1] = c_one;
+
+                /* Compute X(i+1:m,i) */
+                i__2 = m - i__;
+                i__3 = n - i__;
+                // 1. Send the block reflector  A(i+1:m,i) to the GPU ------
+                magma_setvector<Ty>(
+                    i__3, a + i__ + (i__ + 1) * a_dim1, lda, da,
+                    da_offset + (i__ - 1) + ((i__ - 1) + 1) * (ldda), ldda,
+                    queue);
+                // 2. Multiply ---------------------------------------------
+                // magma_zcopy(i__3, da+(i__-1)+((i__-1)+1)*(ldda), ldda,
+                //            dy + 1 + lddy, 1);
+                OPENCL_BLAS_CHECK(gpu_blas_gemv(
+                    OPENCL_BLAS_NO_TRANS, i__2, i__3, c_one, da,
+                    da_offset + (i__ - 1) + 1 + ((i__ - 1) + 1) * (ldda), ldda,
+                    da, da_offset + (i__ - 1) + ((i__ - 1) + 1) * (ldda), ldda,
+                    // dy + 1 + lddy, 1,
+                    c_zero, dx, dx_offset + i__ + 1 + i__ * x_dim1, c__1, 1,
+                    &queue, 0, nullptr, &event));
+
+                // 3. Put the result back ----------------------------------
+                magma_getmatrix_async<Ty>(
+                    i__2, 1, dx, dx_offset + i__ + 1 + i__ * x_dim1, x_dim1,
+                    x + i__ + 1 + i__ * x_dim1, x_dim1, queue, &event);
+
+                i__2 = n - i__;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__, (&c_one),
+                                    &y[i__ + 1 + y_dim1], ldy,
+                                    &a[i__ + (i__ + 1) * a_dim1], lda,
+                                    (&c_zero), &x[i__ * x_dim1 + 1], c__1);
+
+                i__2 = m - i__;
+                cpu_blas_gemv_macro(
+                    CblasNoTrans, i__2, i__, (&c_neg_one), &a[i__ + 1 + a_dim1],
+                    lda, &x[i__ * x_dim1 + 1], c__1, (&c_zero), f, c__1);
+                i__2 = i__ - 1;
+                i__3 = n - i__;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_one),
+                                    &a[(i__ + 1) * a_dim1 + 1], lda,
+                                    &a[i__ + (i__ + 1) * a_dim1], lda,
+                                    (&c_zero), &x[i__ * x_dim1 + 1], c__1);
+
+                // 4. Synch to make sure the result is back ----------------
+                magma_event_sync(event);
+
+                if (i__ != 0) {
+                    i__2 = m - i__;
+                    cpu_blas_axpy(i__2, cblas_scalar(&c_one), cblas_ptr(f),
+                                  c__1, cblas_ptr(&x[i__ + 1 + i__ * x_dim1]),
+                                  c__1);
+                }
+
+                i__2 = m - i__;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                    &x[i__ + 1 + x_dim1], ldx,
+                                    &x[i__ * x_dim1 + 1], c__1, (&c_one),
+                                    &x[i__ + 1 + i__ * x_dim1], c__1);
+                i__2 = m - i__;
+                cpu_blas_scal(i__2, cblas_scalar(&taup[i__]),
+                              cblas_ptr(&x[i__ + 1 + i__ * x_dim1]), c__1);
+
+                if (is_cplx) {
+                    i__2 = n - i__;
+                    LAPACKE_CHECK(cpu_lapack_lacgv(
+                        i__2, &a[i__ + (i__ + 1) * a_dim1], lda));
+                    // 4. Send the block reflector  A(i+1:m,i) to the GPU after
+                    // ZLACGV()
+                    magma_setvector<Ty>(
+                        i__2, a + i__ + (i__ + 1) * a_dim1, lda, da,
+                        da_offset + (i__ - 1) + ((i__ - 1) + 1) * (ldda), ldda,
+                        queue);
+                }
+            }
+        }
+    } else {
+        /* Reduce to lower bidiagonal form */
+        for (i__ = 1; i__ <= nb; ++i__) {
+            /* Update A(i,i:n) */
+            i__2 = n - i__ + 1;
+            i__3 = i__ - 1;
+            if (is_cplx) {
+                LAPACKE_CHECK(
+                    cpu_lapack_lacgv(i__2, &a[i__ + i__ * a_dim1], lda));
+                LAPACKE_CHECK(cpu_lapack_lacgv(i__3, &a[i__ + a_dim1], lda));
+            }
+            cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                &y[i__ + y_dim1], ldy, &a[i__ + a_dim1], lda,
+                                (&c_one), &a[i__ + i__ * a_dim1], lda);
+            i__2 = i__ - 1;
+            if (is_cplx) {
+                LAPACKE_CHECK(cpu_lapack_lacgv(i__3, &a[i__ + a_dim1], lda));
+                LAPACKE_CHECK(cpu_lapack_lacgv(i__3, &x[i__ + x_dim1], ldx));
+            }
+            i__3 = n - i__ + 1;
+            cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_neg_one),
+                                &a[i__ * a_dim1 + 1], lda, &x[i__ + x_dim1],
+                                ldx, (&c_one), &a[i__ + i__ * a_dim1], lda);
+            if (is_cplx) {
+                LAPACKE_CHECK(cpu_lapack_lacgv(i__2, &x[i__ + x_dim1], ldx));
+            }
+
+            /* Generate reflection P(i) to annihilate A(i,i+1:n) */
+            i__2 = n - i__ + 1;
+            /* Computing MIN */
+            i__3  = i__ + 1;
+            alpha = a[i__ + i__ * a_dim1];
+            LAPACKE_CHECK(cpu_lapack_larfg(i__2, &alpha,
+                                           &a[i__ + std::min(i__3, n) * a_dim1],
+                                           lda, &taup[i__]));
+            d[i__] = magma_real<Ty>(alpha);
+            if (i__ < m) {
+                a[i__ + i__ * a_dim1] = c_one;
+
+                /* Compute X(i+1:m,i) */
+                i__2 = m - i__;
+                i__3 = n - i__ + 1;
+
+                // 1. Send the block reflector  A(i,i+1:n) to the GPU ------
+                magma_setvector<Ty>(i__3, a + i__ + i__ * a_dim1, lda, da,
+                                    da_offset + (i__ - 1) + (i__ - 1) * (ldda),
+                                    ldda, queue);
+
+                // 2. Multiply ---------------------------------------------
+                // magma_zcopy(i__3, da+(i__-1)+(i__-1)*(ldda), ldda,
+                //            dy + 1 + lddy, 1);
+                OPENCL_BLAS_CHECK(gpu_blas_gemv(
+                    OPENCL_BLAS_NO_TRANS, i__2, i__3, c_one, da,
+                    da_offset + (i__ - 1) + 1 + (i__ - 1) * ldda, ldda, da,
+                    da_offset + (i__ - 1) + (i__ - 1) * ldda, ldda,
+                    // dy + 1 + lddy, 1,
+                    c_zero, dx, dx_offset + i__ + 1 + i__ * x_dim1, c__1, 1,
+                    &queue, 0, nullptr, &event));
+
+                // 3. Put the result back ----------------------------------
+                magma_getmatrix_async<Ty>(
+                    i__2, 1, dx, dx_offset + i__ + 1 + i__ * x_dim1, x_dim1,
+                    x + i__ + 1 + i__ * x_dim1, x_dim1, queue, &event);
+
+                i__2 = n - i__ + 1;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_one),
+                                    &y[i__ + y_dim1], ldy,
+                                    &a[i__ + i__ * a_dim1], lda, (&c_zero),
+                                    &x[i__ * x_dim1 + 1], c__1);
+                i__2 = m - i__;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                    &a[i__ + 1 + a_dim1], lda,
+                                    &x[i__ * x_dim1 + 1], c__1, (&c_zero), f,
+                                    c__1);
+
+                i__2 = i__ - 1;
+                i__3 = n - i__ + 1;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_one),
+                                    &a[i__ * a_dim1 + 1], lda,
+                                    &a[i__ + i__ * a_dim1], lda, (&c_zero),
+                                    &x[i__ * x_dim1 + 1], c__1);
+
+                // 4. Synch to make sure the result is back ----------------
+                magma_event_sync(event);
+                if (i__2 != 0) {
+                    i__3 = m - i__;
+                    cpu_blas_axpy(i__3, cblas_scalar(&c_one), cblas_ptr(f),
+                                  c__1, cblas_ptr(&x[i__ + 1 + i__ * x_dim1]),
+                                  c__1);
+                }
+
+                i__2 = m - i__;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                    &x[i__ + 1 + x_dim1], ldx,
+                                    &x[i__ * x_dim1 + 1], c__1, (&c_one),
+                                    &x[i__ + 1 + i__ * x_dim1], c__1);
+                i__2 = m - i__;
+                cpu_blas_scal(i__2, cblas_scalar(&taup[i__]),
+                              cblas_ptr(&x[i__ + 1 + i__ * x_dim1]), c__1);
+                i__2 = n - i__ + 1;
+
+                if (is_cplx) {
+                    LAPACKE_CHECK(
+                        cpu_lapack_lacgv(i__2, &a[i__ + i__ * a_dim1], lda));
+                    magma_setvector<Ty>(
+                        i__2, a + i__ + (i__)*a_dim1, lda, da,
+                        da_offset + (i__ - 1) + (i__ - 1) * (ldda), ldda,
+                        queue);
+                }
+
+                /* Update A(i+1:m,i) */
+                i__2 = m - i__;
+                i__3 = i__ - 1;
+
+                if (is_cplx) {
+                    LAPACKE_CHECK(
+                        cpu_lapack_lacgv(i__3, &y[i__ + y_dim1], ldy));
+                }
+
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                    &a[i__ + 1 + a_dim1], lda, &y[i__ + y_dim1],
+                                    ldy, (&c_one), &a[i__ + 1 + i__ * a_dim1],
+                                    c__1);
+                i__2 = m - i__;
+                if (is_cplx) {
+                    LAPACKE_CHECK(
+                        cpu_lapack_lacgv(i__3, &y[i__ + y_dim1], ldy));
+                }
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__, (&c_neg_one),
+                                    &x[i__ + 1 + x_dim1], ldx,
+                                    &a[i__ * a_dim1 + 1], c__1, (&c_one),
+                                    &a[i__ + 1 + i__ * a_dim1], c__1);
+
+                /* Generate reflection Q(i) to annihilate A(i+2:m,i) */
+                i__2  = m - i__;
+                i__3  = i__ + 2;
+                alpha = a[i__ + 1 + i__ * a_dim1];
+                LAPACKE_CHECK(cpu_lapack_larfg(
+                    i__2, &alpha, &a[std::min(i__3, m) + i__ * a_dim1], c__1,
+                    &tauq[i__]));
+                e[i__]                    = magma_real<Ty>(alpha);
+                a[i__ + 1 + i__ * a_dim1] = c_one;
+
+                /* Compute Y(i+1:n,i) */
+                i__2 = m - i__;
+                i__3 = n - i__;
+
+                // 1. Send the block reflector  A(i+1:m,i) to the GPU ------
+                magma_setvector<Ty>(
+                    i__2, a + i__ + 1 + i__ * a_dim1, 1, da,
+                    da_offset + (i__ - 1) + 1 + (i__ - 1) * (ldda), 1, queue);
+                // 2. Multiply ---------------------------------------------
+                OPENCL_BLAS_CHECK(gpu_blas_gemv(
+                    OPENCL_BLAS_CONJ_TRANS, i__2, i__3, c_one, da,
+                    da_offset + (i__ - 1) + 1 + ((i__ - 1) + 1) * ldda, ldda,
+                    da, da_offset + (i__ - 1) + 1 + (i__ - 1) * ldda, c__1,
+                    c_zero, dy, dy_offset + i__ + 1 + i__ * y_dim1, c__1, 1,
+                    &queue, 0, nullptr, &event));
+
+                // 3. Put the result back ----------------------------------
+                magma_getmatrix_async<Ty>(
+                    i__3, 1, dy, dy_offset + i__ + 1 + i__ * y_dim1, y_dim1,
+                    y + i__ + 1 + i__ * y_dim1, y_dim1, queue, &event);
+
+                i__2 = m - i__;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__3, (&c_one),
+                                    &a[i__ + 1 + a_dim1], lda,
+                                    &a[i__ + 1 + i__ * a_dim1], c__1, (&c_zero),
+                                    &y[i__ * y_dim1 + 1], c__1);
+                i__2 = n - i__;
+                i__3 = i__ - 1;
+                cpu_blas_gemv_macro(CblasNoTrans, i__2, i__3, (&c_neg_one),
+                                    &y[i__ + 1 + y_dim1], ldy,
+                                    &y[i__ * y_dim1 + 1], c__1, (&c_zero), f,
+                                    c__1);
+
+                i__2 = m - i__;
+                cpu_blas_gemv_macro(CblasTransParam, i__2, i__, (&c_one),
+                                    &x[i__ + 1 + x_dim1], ldx,
+                                    &a[i__ + 1 + i__ * a_dim1], c__1, (&c_zero),
+                                    &y[i__ * y_dim1 + 1], c__1);
+
+                // 4. Synch to make sure the result is back ----------------
+                magma_event_sync(event);
+                if (i__3 != 0) {
+                    i__2 = n - i__;
+                    cpu_blas_axpy(i__2, cblas_scalar(&c_one), cblas_ptr(f),
+                                  c__1, cblas_ptr(&y[i__ + 1 + i__ * y_dim1]),
+                                  c__1);
+                }
+
+                i__2 = n - i__;
+                cpu_blas_gemv_macro(CblasTransParam, i__, i__2, (&c_neg_one),
+                                    &a[(i__ + 1) * a_dim1 + 1], lda,
+                                    &y[i__ * y_dim1 + 1], c__1, (&c_one),
+                                    &y[i__ + 1 + i__ * y_dim1], c__1);
+                i__2 = n - i__;
+                cpu_blas_scal(i__2, cblas_scalar(&tauq[i__]),
+                              cblas_ptr(&y[i__ + 1 + i__ * y_dim1]), c__1);
+            } else {
+                if (is_cplx) {
+                    i__2 = n - i__ + 1;
+                    LAPACKE_CHECK(
+                        cpu_lapack_lacgv(i__2, &a[i__ + i__ * a_dim1], lda));
+                    magma_setvector<Ty>(
+                        i__2, a + i__ + (i__)*a_dim1, lda, da,
+                        da_offset + (i__ - 1) + (i__ - 1) * (ldda), ldda,
+                        queue);
+                }
+            }
+        }
+    }
+
+    magma_queue_sync(queue);
+    magma_free_cpu(f);
+
+    return MAGMA_SUCCESS;
+}
+
+#define INSTANTIATE(Ty)                                                        \
+    template magma_int_t magma_labrd_gpu<Ty>(                                  \
+        magma_int_t m, magma_int_t n, magma_int_t nb, Ty * a, magma_int_t lda, \
+        cl_mem da, size_t da_offset, magma_int_t ldda, void *_d, void *_e,     \
+        Ty *tauq, Ty *taup, Ty *x, magma_int_t ldx, cl_mem dx,                 \
+        size_t dx_offset, magma_int_t lddx, Ty *y, magma_int_t ldy, cl_mem dy, \
+        size_t dy_offset, magma_int_t lddy, magma_queue_t queue);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(magmaFloatComplex)
+INSTANTIATE(magmaDoubleComplex)
diff --git a/src/backend/opencl/magma/larfb.cpp b/src/backend/opencl/magma/larfb.cpp
index 747e16a0ea..b7513bd971 100644
--- a/src/backend/opencl/magma/larfb.cpp
+++ b/src/backend/opencl/magma/larfb.cpp
@@ -33,22 +33,22 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
@@ -56,7 +56,6 @@
 #include "magma.h"
 #include "magma_blas.h"
 #include "magma_data.h"
-#include "magma_cpu_lapack.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
@@ -175,25 +174,25 @@
 
     @ingroup magma_zaux3
     ********************************************************************/
-template<typename Ty> magma_int_t
-magma_larfb_gpu(
-    magma_side_t side, magma_trans_t trans, magma_direct_t direct, magma_storev_t storev,
-    magma_int_t m, magma_int_t n, magma_int_t k,
-    cl_mem dV   , size_t dV_offset,    magma_int_t lddv,
-    cl_mem dT   , size_t dT_offset,    magma_int_t lddt,
-    cl_mem dC   , size_t dC_offset,    magma_int_t lddc,
-    cl_mem dwork, size_t dwork_offset, magma_int_t ldwork,
-    magma_queue_t queue )
-{
-    #define dV(i_,j_)  dV,     (dV_offset    + (i_) + (j_)*lddv)
-    #define dT(i_,j_)  dT,     (dT_offset    + (i_) + (j_)*lddt)
-    #define dC(i_,j_)  dC,     (dC_offset    + (i_) + (j_)*lddc)
-    #define dwork(i_)  dwork,  (dwork_offset + (i_))
+template<typename Ty>
+magma_int_t magma_larfb_gpu(magma_side_t side, magma_trans_t trans,
+                            magma_direct_t direct, magma_storev_t storev,
+                            magma_int_t m, magma_int_t n, magma_int_t k,
+                            cl_mem dV, size_t dV_offset, magma_int_t lddv,
+                            cl_mem dT, size_t dT_offset, magma_int_t lddt,
+                            cl_mem dC, size_t dC_offset, magma_int_t lddc,
+                            cl_mem dwork, size_t dwork_offset,
+                            magma_int_t ldwork, magma_queue_t queue) {
+#define dV(i_, j_) dV, (dV_offset + (i_) + (j_)*lddv)
+#define dT(i_, j_) dT, (dT_offset + (i_) + (j_)*lddt)
+#define dC(i_, j_) dC, (dC_offset + (i_) + (j_)*lddc)
+#define dwork(i_) dwork, (dwork_offset + (i_))
 
     static const Ty c_zero    = magma_zero<Ty>();
     static const Ty c_one     = magma_one<Ty>();
     static const Ty c_neg_one = magma_neg_one<Ty>();
-    static const clblasTranspose transType = magma_is_real<Ty>() ? clblasTrans : clblasConjTrans;
+    static const OPENCL_BLAS_TRANS_T transType =
+        magma_is_real<Ty>() ? OPENCL_BLAS_TRANS : OPENCL_BLAS_CONJ_TRANS;
 
     /* Check input arguments */
     magma_int_t info = 0;
@@ -203,151 +202,115 @@ magma_larfb_gpu(
         info = -6;
     } else if (k < 0) {
         info = -7;
-    } else if ( ((storev == MagmaColumnwise) && (side == MagmaLeft) && lddv < std::max(1,m)) ||
-                ((storev == MagmaColumnwise) && (side == MagmaRight) && lddv < std::max(1,n)) ||
-                ((storev == MagmaRowwise) && lddv < k) ) {
+    } else if (((storev == MagmaColumnwise) && (side == MagmaLeft) &&
+                lddv < std::max(1, m)) ||
+               ((storev == MagmaColumnwise) && (side == MagmaRight) &&
+                lddv < std::max(1, n)) ||
+               ((storev == MagmaRowwise) && lddv < k)) {
         info = -9;
     } else if (lddt < k) {
         info = -11;
-    } else if (lddc < std::max(1,m)) {
+    } else if (lddc < std::max(1, m)) {
         info = -13;
-    } else if ( ((side == MagmaLeft) && ldwork < std::max(1,n)) ||
-                ((side == MagmaRight) && ldwork < std::max(1,m)) ) {
+    } else if (((side == MagmaLeft) && ldwork < std::max(1, n)) ||
+               ((side == MagmaRight) && ldwork < std::max(1, m))) {
         info = -15;
     }
     if (info != 0) {
-        //magma_xerbla( __func__, -(info) );
+        // magma_xerbla( __func__, -(info) );
         return info;
     }
 
     /* Function Body */
-    if (m <= 0 || n <= 0) {
-        return info;
-    }
+    if (m <= 0 || n <= 0) { return info; }
 
     // opposite of trans
-    clblasTranspose transt;
-    clblasTranspose cltrans;
+    OPENCL_BLAS_TRANS_T transt;
+    OPENCL_BLAS_TRANS_T cltrans;
     if (trans == MagmaNoTrans) {
-        transt = transType;
-        cltrans = clblasNoTrans;
-    }
-    else {
-        transt = clblasNoTrans;
+        transt  = transType;
+        cltrans = OPENCL_BLAS_NO_TRANS;
+    } else {
+        transt  = OPENCL_BLAS_NO_TRANS;
         cltrans = transType;
     }
 
     // whether T is upper or lower triangular
-    clblasUplo uplo;
-    if (direct == MagmaForward)
-        uplo = clblasUpper;
-    else
-        uplo = clblasLower;
+    OPENCL_BLAS_TRIANGLE_T uplo;
+    if (direct == MagmaForward) {
+        uplo = OPENCL_BLAS_TRIANGLE_UPPER;
+    } else {
+        uplo = OPENCL_BLAS_TRIANGLE_LOWER;
+    }
 
     // whether V is stored transposed or not
-    clblasTranspose notransV, transV;
+    OPENCL_BLAS_TRANS_T notransV, transV;
     if (storev == MagmaColumnwise) {
-        notransV = clblasNoTrans;
+        notransV = OPENCL_BLAS_NO_TRANS;
         transV   = transType;
-    }
-    else {
+    } else {
         notransV = transType;
-        transV   = clblasNoTrans;
+        transV   = OPENCL_BLAS_NO_TRANS;
     }
 
-    gemm_func<Ty> gpu_gemm;
-    trmm_func<Ty> gpu_trmm;
+    gpu_blas_gemm_func<Ty> gpu_blas_gemm;
+    gpu_blas_trmm_func<Ty> gpu_blas_trmm;
 
     cl_event event = NULL;
 
-    if ( side == MagmaLeft ) {
+    if (side == MagmaLeft) {
         // Form H C or H^H C
-        // Comments assume H C. When forming H^H C, T gets transposed via transt.
+        // Comments assume H C. When forming H^H C, T gets transposed via
+        // transt.
 
         // W = C^H V
-        gpu_gemm(clblasColumnMajor,
-                 transType, notransV,
-                 n, k, m,
-                 c_one,
-                 dC(0,0),  lddc,
-                 dV(0,0),  lddv,
-                 c_zero,
-                 dwork(0), ldwork,
-                 1, &queue, 0, nullptr, &event);
+        OPENCL_BLAS_CHECK(gpu_blas_gemm(
+            transType, notransV, n, k, m, c_one, dC(0, 0), lddc, dV(0, 0), lddv,
+            c_zero, dwork(0), ldwork, 1, &queue, 0, nullptr, &event));
 
         // W = W T^H = C^H V T^H
-        gpu_trmm(clblasColumnMajor,
-                 clblasRight,
-                 uplo, transt, clblasNonUnit,
-                 n, k,
-                 c_one,
-                 dT(0,0) , lddt,
-                 dwork(0), ldwork,
-                 1, &queue, 0, nullptr, &event);
-
+        OPENCL_BLAS_CHECK(gpu_blas_trmm(OPENCL_BLAS_SIDE_RIGHT, uplo, transt,
+                                        OPENCL_BLAS_NON_UNIT_DIAGONAL, n, k,
+                                        c_one, dT(0, 0), lddt, dwork(0), ldwork,
+                                        1, &queue, 0, nullptr, &event));
         // C = C - V W^H = C - V T V^H C = (I - V T V^H) C = H C
-        gpu_gemm(clblasColumnMajor,
-                 notransV, transType,
-                 m, n, k,
-                 c_neg_one,
-                 dV(0,0),  lddv,
-                 dwork(0), ldwork,
-                 c_one,
-                 dC(0,0),  lddc,
-                 1, &queue, 0, nullptr, &event);
-    }
-    else {
+        OPENCL_BLAS_CHECK(gpu_blas_gemm(
+            notransV, transType, m, n, k, c_neg_one, dV(0, 0), lddv, dwork(0),
+            ldwork, c_one, dC(0, 0), lddc, 1, &queue, 0, nullptr, &event));
+    } else {
         // Form C H or C H^H
         // Comments assume C H. When forming C H^H, T gets transposed via trans.
 
         // W = C V
-        gpu_gemm(clblasColumnMajor,
-                 clblasNoTrans, notransV,
-                 m, k, n,
-                 c_one,
-                 dC(0,0),  lddc,
-                 dV(0,0),  lddv,
-                 c_zero,
-                 dwork(0), ldwork,
-                 1, &queue, 0, nullptr, &event);
+        OPENCL_BLAS_CHECK(gpu_blas_gemm(OPENCL_BLAS_NO_TRANS, notransV, m, k, n,
+                                        c_one, dC(0, 0), lddc, dV(0, 0), lddv,
+                                        c_zero, dwork(0), ldwork, 1, &queue, 0,
+                                        nullptr, &event));
 
         // W = W T = C V T
-        gpu_trmm(clblasColumnMajor,
-                 clblasRight, uplo,
-                 cltrans,
-                 clblasNonUnit,
-                 m, k,
-                 c_one,
-                 dT(0,0),  lddt,
-                 dwork(0), ldwork,
-                 1, &queue, 0, nullptr, &event);
+        OPENCL_BLAS_CHECK(gpu_blas_trmm(OPENCL_BLAS_SIDE_RIGHT, uplo, cltrans,
+                                        OPENCL_BLAS_NON_UNIT_DIAGONAL, m, k,
+                                        c_one, dT(0, 0), lddt, dwork(0), ldwork,
+                                        1, &queue, 0, nullptr, &event));
 
         // C = C - W V^H = C - C V T V^H = C (I - V T V^H) = C H
-        gpu_gemm(clblasColumnMajor,
-                 clblasNoTrans, transV,
-                 m, n, k,
-                 c_neg_one,
-                 dwork(0), ldwork,
-                 dV(0,0),  lddv,
-                 c_one,
-                 dC(0,0),  lddc,
-                 1, &queue, 0, nullptr, &event);
+        OPENCL_BLAS_CHECK(gpu_blas_gemm(OPENCL_BLAS_NO_TRANS, transV, m, n, k,
+                                        c_neg_one, dwork(0), ldwork, dV(0, 0),
+                                        lddv, c_one, dC(0, 0), lddc, 1, &queue,
+                                        0, nullptr, &event));
     }
 
     return info;
 } /* magma_zlarfb */
 
-#define INSTANTIATE(T)                                          \
-    template magma_int_t                                        \
-    magma_larfb_gpu<T>(                                         \
-        magma_side_t side, magma_trans_t trans,                 \
-        magma_direct_t direct, magma_storev_t storev,           \
-        magma_int_t m, magma_int_t n, magma_int_t k,            \
-        cl_mem dV   , size_t dV_offset,    magma_int_t lddv,    \
-        cl_mem dT   , size_t dT_offset,    magma_int_t lddt,    \
-        cl_mem dC   , size_t dC_offset,    magma_int_t lddc,    \
-        cl_mem dwork, size_t dwork_offset, magma_int_t ldwork,  \
-        magma_queue_t queue );                                  \
+#define INSTANTIATE(T)                                                      \
+    template magma_int_t magma_larfb_gpu<T>(                                \
+        magma_side_t side, magma_trans_t trans, magma_direct_t direct,      \
+        magma_storev_t storev, magma_int_t m, magma_int_t n, magma_int_t k, \
+        cl_mem dV, size_t dV_offset, magma_int_t lddv, cl_mem dT,           \
+        size_t dT_offset, magma_int_t lddt, cl_mem dC, size_t dC_offset,    \
+        magma_int_t lddc, cl_mem dwork, size_t dwork_offset,                \
+        magma_int_t ldwork, magma_queue_t queue);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/laset.cpp b/src/backend/opencl/magma/laset.cpp
index c35eb3a91c..520bdea59e 100644
--- a/src/backend/opencl/magma/laset.cpp
+++ b/src/backend/opencl/magma/laset.cpp
@@ -7,51 +7,92 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include "magma_data.h"
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
 #include "kernel/laset.hpp"
+#include "magma_data.h"
 
 #include <algorithm>
 
-template<typename T> void
-magmablas_laset(magma_uplo_t uplo, magma_int_t m, magma_int_t n,
-                T offdiag, T diag,
-                cl_mem dA, size_t dA_offset, magma_int_t ldda,
-                magma_queue_t queue)
-{
+template<typename T>
+void magmablas_laset(magma_uplo_t uplo, magma_int_t m, magma_int_t n, T offdiag,
+                     T diag, cl_mem dA, size_t dA_offset, magma_int_t ldda,
+                     magma_queue_t queue) {
+    using arrayfire::opencl::kernel::laset;
     magma_int_t info = 0;
-    if ( uplo != MagmaLower && uplo != MagmaUpper && uplo != MagmaFull )
+    if (uplo != MagmaLower && uplo != MagmaUpper && uplo != MagmaFull) {
         info = -1;
-    else if ( m < 0 )
+    } else if (m < 0) {
         info = -2;
-    else if ( n < 0 )
+    } else if (n < 0) {
         info = -3;
-    else if ( ldda < std::max(1,m) )
+    } else if (ldda < std::max(1, m)) {
         info = -7;
-
-    if (info != 0) {
-        return;  //info;
     }
 
-    if ( m == 0 || n == 0 ) {
-        return;
+    if (info != 0) {
+        return;  // info;
     }
 
+    if (m == 0 || n == 0) { return; }
 
     switch (uplo) {
-    case MagmaFull : return opencl::kernel::laset<T, 0>(m, n, offdiag, diag, dA, dA_offset, ldda);
-    case MagmaLower: return opencl::kernel::laset<T, 1>(m, n, offdiag, diag, dA, dA_offset, ldda);
-    case MagmaUpper: return opencl::kernel::laset<T, 2>(m, n, offdiag, diag, dA, dA_offset, ldda);
-    default: return;
+        case MagmaFull:
+            return laset<T, 0>(m, n, offdiag, diag, dA, dA_offset, ldda, queue);
+        case MagmaLower:
+            return laset<T, 1>(m, n, offdiag, diag, dA, dA_offset, ldda, queue);
+        case MagmaUpper:
+            return laset<T, 2>(m, n, offdiag, diag, dA, dA_offset, ldda, queue);
+        default: return;
     }
-
 }
 
-#define INSTANTIATE(T)                                      \
-    template void magmablas_laset<T>(                       \
-        magma_uplo_t uplo, magma_int_t m, magma_int_t n,    \
-        T offdiag, T diag,                                  \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,      \
-        magma_queue_t queue);                               \
+#define INSTANTIATE(T)                                                      \
+    template void magmablas_laset<T>(                                       \
+        magma_uplo_t uplo, magma_int_t m, magma_int_t n, T offdiag, T diag, \
+        cl_mem dA, size_t dA_offset, magma_int_t ldda, magma_queue_t queue);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/laset_band.cpp b/src/backend/opencl/magma/laset_band.cpp
index f802c4ae6b..ba4e35360e 100644
--- a/src/backend/opencl/magma/laset_band.cpp
+++ b/src/backend/opencl/magma/laset_band.cpp
@@ -7,9 +7,53 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
 #if 0  // Needs to be enabled when unmqr2 is enabled
-#include "magma_data.h"
 #include "kernel/laset_band.hpp"
+#include "magma_data.h"
 
 #include <algorithm>
 
@@ -52,12 +96,10 @@ magmablas_laset_band(magma_uplo_t uplo, magma_int_t m, magma_int_t n, magma_int_
 
 }
 
-#define INSTANTIATE(T)                                  \
-    template void magmablas_laset_band<T>(              \
-        magma_uplo_t uplo,                              \
-        magma_int_t m, magma_int_t n, magma_int_t k,    \
-        T offdiag, T diag,                              \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
+#define INSTANTIATE(T)                                                    \
+    template void magmablas_laset_band<T>(                                \
+        magma_uplo_t uplo, magma_int_t m, magma_int_t n, magma_int_t k,   \
+        T offdiag, T diag, cl_mem dA, size_t dA_offset, magma_int_t ldda, \
         magma_queue_t queue);                           \
 
 INSTANTIATE(float)
diff --git a/src/backend/opencl/magma/laswp.cpp b/src/backend/opencl/magma/laswp.cpp
index 2d4bca5a59..14d24e61c7 100644
--- a/src/backend/opencl/magma/laswp.cpp
+++ b/src/backend/opencl/magma/laswp.cpp
@@ -7,46 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include "magma_data.h"
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
 #include "kernel/laswp.hpp"
+#include "magma_data.h"
 
 #include <algorithm>
 
-template<typename T> void
-magmablas_laswp(
-    magma_int_t n,
-    cl_mem dAT, size_t dAT_offset, magma_int_t ldda,
-    magma_int_t k1, magma_int_t k2,
-    const magma_int_t *ipiv, magma_int_t inci,
-    magma_queue_t queue)
-{
+template<typename T>
+void magmablas_laswp(magma_int_t n, cl_mem dAT, size_t dAT_offset,
+                     magma_int_t ldda, magma_int_t k1, magma_int_t k2,
+                     const magma_int_t *ipiv, magma_int_t inci,
+                     magma_queue_t queue) {
     magma_int_t info = 0;
-    if ( n < 0 )
+    if (n < 0) {
         info = -1;
-    else if ( k1 < 1 )
+    } else if (k1 < 1) {
         info = -4;
-    else if ( k2 < 1 )
+    } else if (k2 < 1) {
         info = -5;
-    else if ( inci <= 0 )
+    } else if (inci <= 0) {
         info = -7;
+    }
 
     if (info != 0) {
-        //magma_xerbla( __func__, -(info) );
-        return;  //info;
+        // magma_xerbla( __func__, -(info) );
+        return;  // info;
     }
 
-    opencl::kernel::laswp<T>(n, dAT, dAT_offset, ldda, k1, k2, ipiv, inci);
+    cl::CommandQueue q(queue, true);
+    arrayfire::opencl::kernel::laswp<T>(n, dAT, dAT_offset, ldda, k1, k2, ipiv,
+                                        inci, q);
 }
 
-
-#define INSTANTIATE(T)                                      \
-    template void magmablas_laswp<T>(                       \
-        magma_int_t n,                                      \
-        cl_mem dAT, size_t dAT_offset, magma_int_t ldda,    \
-        magma_int_t k1, magma_int_t k2,                     \
-        const magma_int_t *ipiv, magma_int_t inci,          \
-        magma_queue_t queue);                               \
-
+#define INSTANTIATE(T)                                                  \
+    template void magmablas_laswp<T>(                                   \
+        magma_int_t n, cl_mem dAT, size_t dAT_offset, magma_int_t ldda, \
+        magma_int_t k1, magma_int_t k2, const magma_int_t *ipiv,        \
+        magma_int_t inci, magma_queue_t queue);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/magma.h b/src/backend/opencl/magma/magma.h
index 5584c20307..df1923b746 100644
--- a/src/backend/opencl/magma/magma.h
+++ b/src/backend/opencl/magma/magma.h
@@ -13,55 +13,45 @@
 #include "magma_common.h"
 
 template<typename Ty>
-magma_int_t magma_getrf_gpu(magma_int_t m, magma_int_t n,
-                            cl_mem dA, size_t dA_offset, magma_int_t ldda,
-                            magma_int_t *ipiv,
-                            magma_queue_t queue,
+magma_int_t magma_getrf_gpu(magma_int_t m, magma_int_t n, cl_mem dA,
+                            size_t dA_offset, magma_int_t ldda,
+                            magma_int_t *ipiv, magma_queue_t queue,
                             magma_int_t *info);
 
 template<typename Ty>
-magma_int_t magma_potrf_gpu(magma_uplo_t   uplo, magma_int_t    n,
-                            cl_mem dA, size_t dA_offset, magma_int_t ldda,
-                            magma_queue_t queue,
-                            magma_int_t*   info);
+magma_int_t magma_potrf_gpu(magma_uplo_t uplo, magma_int_t n, cl_mem dA,
+                            size_t dA_offset, magma_int_t ldda,
+                            magma_queue_t queue, magma_int_t *info);
 
-template<typename Ty> magma_int_t
-magma_larfb_gpu(
-    magma_side_t side, magma_trans_t trans, magma_direct_t direct, magma_storev_t storev,
-    magma_int_t m, magma_int_t n, magma_int_t k,
-    cl_mem dV   , size_t dV_offset,    magma_int_t lddv,
-    cl_mem dT   , size_t dT_offset,    magma_int_t lddt,
-    cl_mem dC   , size_t dC_offset,    magma_int_t lddc,
-    cl_mem dwork, size_t dwork_offset, magma_int_t ldwork,
-    magma_queue_t queue);
+template<typename Ty>
+magma_int_t magma_larfb_gpu(magma_side_t side, magma_trans_t trans,
+                            magma_direct_t direct, magma_storev_t storev,
+                            magma_int_t m, magma_int_t n, magma_int_t k,
+                            cl_mem dV, size_t dV_offset, magma_int_t lddv,
+                            cl_mem dT, size_t dT_offset, magma_int_t lddt,
+                            cl_mem dC, size_t dC_offset, magma_int_t lddc,
+                            cl_mem dwork, size_t dwork_offset,
+                            magma_int_t ldwork, magma_queue_t queue);
 
-template<typename Ty> magma_int_t
-magma_geqrf2_gpu(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    Ty *tau,
-    magma_queue_t* queue,
-    magma_int_t *info);
+template<typename Ty>
+magma_int_t magma_geqrf2_gpu(magma_int_t m, magma_int_t n, cl_mem dA,
+                             size_t dA_offset, magma_int_t ldda, Ty *tau,
+                             magma_queue_t *queue, magma_int_t *info);
 
-template<typename Ty> magma_int_t
-magma_geqrf3_gpu(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA, size_t dA_offset,  magma_int_t ldda,
-    Ty *tau, cl_mem dT, size_t dT_offset,
-    magma_queue_t queue,
-    magma_int_t *info);
+template<typename Ty>
+magma_int_t magma_geqrf3_gpu(magma_int_t m, magma_int_t n, cl_mem dA,
+                             size_t dA_offset, magma_int_t ldda, Ty *tau,
+                             cl_mem dT, size_t dT_offset, magma_queue_t queue,
+                             magma_int_t *info);
 
-template<typename Ty>  magma_int_t
-magma_unmqr_gpu(
-    magma_side_t side, magma_trans_t trans,
-    magma_int_t m, magma_int_t n, magma_int_t k,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    Ty *tau,
-    cl_mem dC, size_t dC_offset, magma_int_t lddc,
-    Ty *hwork, magma_int_t lwork,
-    cl_mem dT, size_t dT_offset, magma_int_t nb,
-    magma_queue_t queue,
-    magma_int_t *info);
+template<typename Ty>
+magma_int_t magma_unmqr_gpu(magma_side_t side, magma_trans_t trans,
+                            magma_int_t m, magma_int_t n, magma_int_t k,
+                            cl_mem dA, size_t dA_offset, magma_int_t ldda,
+                            Ty *tau, cl_mem dC, size_t dC_offset,
+                            magma_int_t lddc, Ty *hwork, magma_int_t lwork,
+                            cl_mem dT, size_t dT_offset, magma_int_t nb,
+                            magma_queue_t queue, magma_int_t *info);
 
 #if 0  // Needs to be enabled when unmqr2 is enabled
 template<typename Ty> magma_int_t
@@ -76,21 +66,35 @@ magma_unmqr2_gpu(
     magma_int_t *info);
 #endif
 
-template<typename Ty>  magma_int_t
-magma_ungqr_gpu(
-    magma_int_t m, magma_int_t n, magma_int_t k,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    Ty *tau,
-    cl_mem dT, size_t dT_offset, magma_int_t nb,
-    magma_queue_t queue,
-    magma_int_t *info);
+template<typename Ty>
+magma_int_t magma_ungqr_gpu(magma_int_t m, magma_int_t n, magma_int_t k,
+                            cl_mem dA, size_t dA_offset, magma_int_t ldda,
+                            Ty *tau, cl_mem dT, size_t dT_offset,
+                            magma_int_t nb, magma_queue_t queue,
+                            magma_int_t *info);
+
+template<typename Ty>
+magma_int_t magma_getrs_gpu(magma_trans_t trans, magma_int_t n,
+                            magma_int_t nrhs, cl_mem dA, size_t dA_offset,
+                            magma_int_t ldda, magma_int_t *ipiv, cl_mem dB,
+                            size_t dB_offset, magma_int_t lddb,
+                            magma_queue_t queue, magma_int_t *info);
+
+template<typename Ty>
+magma_int_t magma_labrd_gpu(magma_int_t m, magma_int_t n, magma_int_t nb, Ty *a,
+                            magma_int_t lda, cl_mem da, size_t da_offset,
+                            magma_int_t ldda, void *_d, void *_e, Ty *tauq,
+                            Ty *taup, Ty *x, magma_int_t ldx, cl_mem dx,
+                            size_t dx_offset, magma_int_t lddx, Ty *y,
+                            magma_int_t ldy, cl_mem dy, size_t dy_offset,
+                            magma_int_t lddy, magma_queue_t queue);
 
-template<typename Ty>  magma_int_t
-magma_getrs_gpu(magma_trans_t trans, magma_int_t n, magma_int_t nrhs,
-                cl_mem dA, size_t dA_offset, magma_int_t ldda,
-                magma_int_t *ipiv,
-                cl_mem dB, size_t dB_offset, magma_int_t lddb,
-                magma_queue_t queue,
-                magma_int_t *info);
+template<typename Ty>
+magma_int_t magma_gebrd_hybrid(magma_int_t m, magma_int_t n, Ty *a,
+                               magma_int_t lda, cl_mem da, size_t da_offset,
+                               magma_int_t ldda, void *_d, void *_e, Ty *tauq,
+                               Ty *taup, Ty *work, magma_int_t lwork,
+                               magma_queue_t queue, magma_int_t *info,
+                               bool copy);
 
 #endif
diff --git a/src/backend/opencl/magma/magma_blas.h b/src/backend/opencl/magma/magma_blas.h
index 4dc349ce27..62f3290121 100644
--- a/src/backend/opencl/magma/magma_blas.h
+++ b/src/backend/opencl/magma/magma_blas.h
@@ -10,61 +10,29 @@
 #ifndef __MAGMA_BLAS_H
 #define __MAGMA_BLAS_H
 
-#include "magma_common.h"
-#include <types.hpp>
-#include <clBLAS.h>
-#include <err_clblas.hpp>
-
-using opencl::cfloat;
-using opencl::cdouble;
-
-#define BLAS_FUNC_DEF(NAME)                     \
-    template<typename T>                        \
-    struct NAME##_func;
-
-#define BLAS_FUNC(NAME, TYPE, PREFIX)                       \
-    template<>                                              \
-    struct NAME##_func<TYPE>                                \
-    {                                                       \
-        template<typename... Args>                          \
-            void                                            \
-            operator() (Args... args)                       \
-        {                                                   \
-            CLBLAS_CHECK(clblas##PREFIX##NAME(args...));    \
-        }                                                   \
-    };
-
-BLAS_FUNC_DEF(gemm)
-BLAS_FUNC(gemm, float,      S)
-BLAS_FUNC(gemm, double,     D)
-BLAS_FUNC(gemm, cfloat,     C)
-BLAS_FUNC(gemm, cdouble,    Z)
+// This file contains the common interface for Magma OpenCL BLAS
+// functions. They can be implemented in different back-ends,
+// such as CLBlast or clBLAS.
 
-BLAS_FUNC_DEF(trmm)
-BLAS_FUNC(trmm, float,      S)
-BLAS_FUNC(trmm, double,     D)
-BLAS_FUNC(trmm, cfloat,     C)
-BLAS_FUNC(trmm, cdouble,    Z)
-
-BLAS_FUNC_DEF(trsm)
-BLAS_FUNC(trsm, float,      S)
-BLAS_FUNC(trsm, double,     D)
-BLAS_FUNC(trsm, cfloat,     C)
-BLAS_FUNC(trsm, cdouble,    Z)
-
-BLAS_FUNC_DEF(trsv)
-BLAS_FUNC(trsv, float,      S)
-BLAS_FUNC(trsv, double,     D)
-BLAS_FUNC(trsv, cfloat,     C)
-BLAS_FUNC(trsv, cdouble,    Z)
-
-#define clblasSherk(...) clblasSsyrk(__VA_ARGS__)
-#define clblasDherk(...) clblasDsyrk(__VA_ARGS__)
-
-BLAS_FUNC_DEF(herk)
-BLAS_FUNC(herk, float,      S)
-BLAS_FUNC(herk, double,     D)
-BLAS_FUNC(herk, cfloat,     C)
-BLAS_FUNC(herk, cdouble,    Z)
+#include <types.hpp>
+#include "magma_common.h"
 
-#endif
+using arrayfire::opencl::cdouble;
+using arrayfire::opencl::cfloat;
+
+template<typename T>
+struct gpu_blas_gemm_func;
+template<typename T>
+struct gpu_blas_gemv_func;
+template<typename T>
+struct gpu_blas_trmm_func;
+template<typename T>
+struct gpu_blas_trsm_func;
+template<typename T>
+struct gpu_blas_trsv_func;
+template<typename T>
+struct gpu_blas_herk_func;
+
+#include "magma_blas_clblast.h"
+
+#endif  // __MAGMA_BLAS_H
diff --git a/src/backend/opencl/magma/magma_blas_clblast.h b/src/backend/opencl/magma/magma_blas_clblast.h
new file mode 100644
index 0000000000..bb2bfbeee5
--- /dev/null
+++ b/src/backend/opencl/magma/magma_blas_clblast.h
@@ -0,0 +1,326 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <complex>
+
+#include <common/defines.hpp>
+#include <common/half.hpp>
+
+#include <clblast.h>
+#include <err_clblast.hpp>
+
+// Convert MAGMA constants to CLBlast constants
+clblast::Layout clblast_order_const(magma_order_t order);
+clblast::Transpose clblast_trans_const(magma_trans_t trans);
+clblast::Triangle clblast_uplo_const(magma_uplo_t uplo);
+clblast::Diagonal clblast_diag_const(magma_diag_t diag);
+clblast::Side clblast_side_const(magma_side_t side);
+
+// Error checking
+#define OPENCL_BLAS_CHECK CLBLAST_CHECK
+
+// Transposing
+#define OPENCL_BLAS_TRANS_T clblast::Transpose  // the type
+#define OPENCL_BLAS_NO_TRANS clblast::Transpose::kNo
+#define OPENCL_BLAS_TRANS clblast::Transpose::kYes
+#define OPENCL_BLAS_CONJ_TRANS clblast::Transpose::kConjugate
+
+// Triangles
+#define OPENCL_BLAS_TRIANGLE_T clblast::Triangle  // the type
+#define OPENCL_BLAS_TRIANGLE_UPPER clblast::Triangle::kUpper
+#define OPENCL_BLAS_TRIANGLE_LOWER clblast::Triangle::kLower
+
+// Sides
+#define OPENCL_BLAS_SIDE_RIGHT clblast::Side::kRight
+#define OPENCL_BLAS_SIDE_LEFT clblast::Side::kLeft
+
+// Unit or non-unit diagonal
+#define OPENCL_BLAS_UNIT_DIAGONAL clblast::Diagonal::kUnit
+#define OPENCL_BLAS_NON_UNIT_DIAGONAL clblast::Diagonal::kNonUnit
+
+// Defines type conversions from ArrayFire (OpenCL) to CLBlast (C++ std)
+template<typename T>
+struct CLBlastType {
+    using Type = T;
+};
+template<>
+struct CLBlastType<cfloat> {
+    using Type = std::complex<float>;
+};
+template<>
+struct CLBlastType<cdouble> {
+    using Type = std::complex<double>;
+};
+template<>
+struct CLBlastType<arrayfire::common::half> {
+    using Type = cl_half;
+};
+
+// Converts a constant from ArrayFire types (OpenCL) to CLBlast types (C++ std)
+template<typename T>
+typename CLBlastType<T>::Type inline toCLBlastConstant(const T val);
+
+// Specializations of the above function
+template<>
+float inline toCLBlastConstant(const float val) {
+    return val;
+}
+template<>
+double inline toCLBlastConstant(const double val) {
+    return val;
+}
+template<>
+cl_half inline toCLBlastConstant(const arrayfire::common::half val) {
+    cl_half out;
+    memcpy(&out, &val, sizeof(cl_half));
+    return out;
+}
+template<>
+std::complex<float> inline toCLBlastConstant(cfloat val) {
+    return {val.s[0], val.s[1]};
+}
+template<>
+std::complex<double> inline toCLBlastConstant(cdouble val) {
+    return {val.s[0], val.s[1]};
+}
+
+// Conversions to CLBlast basic types
+template<typename T>
+struct CLBlastBasicType {
+    using Type = T;
+};
+template<>
+struct CLBlastBasicType<arrayfire::common::half> {
+    using Type = cl_half;
+};
+template<>
+struct CLBlastBasicType<cfloat> {
+    using Type = float;
+};
+template<>
+struct CLBlastBasicType<cdouble> {
+    using Type = double;
+};
+
+// Initialization of the OpenCL BLAS library
+// Only meant to be once and from constructor
+// of DeviceManager singleton
+// DONT'T CALL FROM ANY OTHER LOCATION
+inline void gpu_blas_init() {
+    // Nothing to do here for CLBlast
+}
+
+// tear down of the OpenCL BLAS library
+// Only meant to be called from destructor
+// of DeviceManager singleton
+// DONT'T CALL FROM ANY OTHER LOCATION
+inline void gpu_blas_deinit() {
+    // Nothing to do here for CLBlast
+}
+
+template<typename T>
+struct gpu_blas_gemm_func {
+    clblast::StatusCode operator()(
+        const clblast::Transpose a_transpose,
+        const clblast::Transpose b_transpose, const size_t m, const size_t n,
+        const size_t k, const T alpha, const cl_mem a_buffer,
+        const size_t a_offset, const size_t a_ld, const cl_mem b_buffer,
+        const size_t b_offset, const size_t b_ld, const T beta, cl_mem c_buffer,
+        const size_t c_offset, const size_t c_ld, cl_uint num_queues,
+        cl_command_queue *queues, cl_uint num_wait_events,
+        const cl_event wait_events, cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        const auto beta_clblast  = toCLBlastConstant(beta);
+        return clblast::Gemm(
+            clblast::Layout::kColMajor, a_transpose, b_transpose, m, n, k,
+            alpha_clblast, a_buffer, a_offset, a_ld, b_buffer, b_offset, b_ld,
+            beta_clblast, c_buffer, c_offset, c_ld, queues, events);
+    }
+};
+
+template<typename T>
+struct gpu_blas_gemv_func {
+    clblast::StatusCode operator()(
+        const clblast::Transpose a_transpose, const size_t m, const size_t n,
+        const T alpha, const cl_mem a_buffer, const size_t a_offset,
+        const size_t a_ld, const cl_mem x_buffer, const size_t x_offset,
+        const size_t x_inc, const T beta, cl_mem y_buffer,
+        const size_t y_offset, const size_t y_inc, cl_uint num_queues,
+        cl_command_queue *queues, cl_uint num_wait_events,
+        const cl_event *wait_events, cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        const auto beta_clblast  = toCLBlastConstant(beta);
+        return clblast::Gemv(clblast::Layout::kColMajor, a_transpose, m, n,
+                             alpha_clblast, a_buffer, a_offset, a_ld, x_buffer,
+                             x_offset, x_inc, beta_clblast, y_buffer, y_offset,
+                             y_inc, queues, events);
+    }
+};
+
+template<typename T>
+struct gpu_blas_trmm_func {
+    clblast::StatusCode operator()(
+        const clblast::Side side, const clblast::Triangle triangle,
+        const clblast::Transpose a_transpose, const clblast::Diagonal diagonal,
+        const size_t m, const size_t n, const T alpha, const cl_mem a_buffer,
+        const size_t a_offset, const size_t a_ld, cl_mem b_buffer,
+        const size_t b_offset, const size_t b_ld, cl_uint num_queues,
+        cl_command_queue *queues, cl_uint num_wait_events,
+        const cl_event *wait_events, cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        return clblast::Trmm(clblast::Layout::kColMajor, side, triangle,
+                             a_transpose, diagonal, m, n, alpha_clblast,
+                             a_buffer, a_offset, a_ld, b_buffer, b_offset, b_ld,
+                             queues, events);
+    }
+};
+
+template<typename T>
+struct gpu_blas_trsm_func {
+    clblast::StatusCode operator()(
+        const clblast::Side side, const clblast::Triangle triangle,
+        const clblast::Transpose a_transpose, const clblast::Diagonal diagonal,
+        const size_t m, const size_t n, const T alpha, const cl_mem a_buffer,
+        const size_t a_offset, const size_t a_ld, cl_mem b_buffer,
+        const size_t b_offset, const size_t b_ld, cl_uint num_queues,
+        cl_command_queue *queues, cl_uint num_wait_events,
+        const cl_event *wait_events, cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        return clblast::Trsm(clblast::Layout::kColMajor, side, triangle,
+                             a_transpose, diagonal, m, n, alpha_clblast,
+                             a_buffer, a_offset, a_ld, b_buffer, b_offset, b_ld,
+                             queues, events);
+    }
+};
+
+template<typename T>
+struct gpu_blas_trsv_func {
+    clblast::StatusCode operator()(
+        const clblast::Triangle triangle, const clblast::Transpose a_transpose,
+        const clblast::Diagonal diagonal, const size_t n, const cl_mem a_buffer,
+        const size_t a_offset, const size_t a_ld, cl_mem x_buffer,
+        const size_t x_offset, const size_t x_inc, cl_uint num_queues,
+        cl_command_queue *queues, cl_uint num_wait_events,
+        const cl_event *wait_events, cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        return clblast::Trsv<typename CLBlastType<T>::Type>(
+            clblast::Layout::kColMajor, triangle, a_transpose, diagonal, n,
+            a_buffer, a_offset, a_ld, x_buffer, x_offset, x_inc, queues,
+            events);
+    }
+};
+
+template<typename T>
+struct gpu_blas_herk_func {
+    using BasicType = typename CLBlastBasicType<T>::Type;
+
+    clblast::StatusCode operator()(
+        const clblast::Triangle triangle, const clblast::Transpose a_transpose,
+        const size_t n, const size_t k, const BasicType alpha,
+        const cl_mem a_buffer, const size_t a_offset, const size_t a_ld,
+        const BasicType beta, cl_mem c_buffer, const size_t c_offset,
+        const size_t c_ld, cl_uint num_queues, cl_command_queue *queues,
+        cl_uint num_wait_events, const cl_event *wait_events,
+        cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        const auto beta_clblast  = toCLBlastConstant(beta);
+        return clblast::Herk(clblast::Layout::kColMajor, triangle, a_transpose,
+                             n, k, alpha_clblast, a_buffer, a_offset, a_ld,
+                             beta_clblast, c_buffer, c_offset, c_ld, queues,
+                             events);
+    }
+};
+
+// Run syrk when calling non-complex herk function (specialisation of the above
+// for 'float')
+template<>
+struct gpu_blas_herk_func<float> {
+    clblast::StatusCode operator()(
+        const clblast::Triangle triangle, const clblast::Transpose a_transpose,
+        const size_t n, const size_t k, const float alpha,
+        const cl_mem a_buffer, const size_t a_offset, const size_t a_ld,
+        const float beta, cl_mem c_buffer, const size_t c_offset,
+        const size_t c_ld, cl_uint num_queues, cl_command_queue *queues,
+        cl_uint num_wait_events, const cl_event *wait_events,
+        cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        const auto beta_clblast  = toCLBlastConstant(beta);
+        return clblast::Syrk(clblast::Layout::kColMajor, triangle, a_transpose,
+                             n, k, alpha_clblast, a_buffer, a_offset, a_ld,
+                             beta_clblast, c_buffer, c_offset, c_ld, queues,
+                             events);
+    }
+};
+
+// Run syrk when calling non-complex herk function (specialisation of the above
+// for 'double')
+template<>
+struct gpu_blas_herk_func<double> {
+    clblast::StatusCode operator()(
+        const clblast::Triangle triangle, const clblast::Transpose a_transpose,
+        const size_t n, const size_t k, const double alpha,
+        const cl_mem a_buffer, const size_t a_offset, const size_t a_ld,
+        const double beta, cl_mem c_buffer, const size_t c_offset,
+        const size_t c_ld, cl_uint num_queues, cl_command_queue *queues,
+        cl_uint num_wait_events, const cl_event *wait_events,
+        cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        const auto beta_clblast  = toCLBlastConstant(beta);
+        return clblast::Syrk(clblast::Layout::kColMajor, triangle, a_transpose,
+                             n, k, alpha_clblast, a_buffer, a_offset, a_ld,
+                             beta_clblast, c_buffer, c_offset, c_ld, queues,
+                             events);
+    }
+};
+
+template<typename T>
+struct gpu_blas_syrk_func {
+    clblast::StatusCode operator()(
+        const clblast::Triangle triangle, const clblast::Transpose a_transpose,
+        const size_t n, const size_t k, const T alpha, const cl_mem a_buffer,
+        const size_t a_offset, const size_t a_ld, const T beta, cl_mem c_buffer,
+        const size_t c_offset, const size_t c_ld, cl_uint num_queues,
+        cl_command_queue *queues, cl_uint num_wait_events,
+        const cl_event *wait_events, cl_event *events) {
+        UNUSED(wait_events);
+        assert(num_queues == 1);
+        assert(num_wait_events == 0);
+        const auto alpha_clblast = toCLBlastConstant(alpha);
+        const auto beta_clblast  = toCLBlastConstant(beta);
+        return clblast::Syrk(clblast::Layout::kColMajor, triangle, a_transpose,
+                             n, k, alpha_clblast, a_buffer, a_offset, a_ld,
+                             beta_clblast, c_buffer, c_offset, c_ld, queues,
+                             events);
+    }
+};
diff --git a/src/backend/opencl/magma/magma_common.h b/src/backend/opencl/magma/magma_common.h
index 0a84147db1..82365cadc5 100644
--- a/src/backend/opencl/magma/magma_common.h
+++ b/src/backend/opencl/magma/magma_common.h
@@ -10,14 +10,7 @@
 #ifndef __MAGMA_COMMON_H
 #define __MAGMA_COMMON_H
 
-#ifdef __APPLE__
-#include <OpenCL/cl.h>
-#else
-#include <CL/cl.h>
-#endif
-
-#define HAVE_clBLAS
-#include <clBLAS.h>
+#include <cl2hpp.hpp>
 
 #include "magma_types.h"
 
diff --git a/src/backend/opencl/magma/magma_cpu_blas.h b/src/backend/opencl/magma/magma_cpu_blas.h
new file mode 100644
index 0000000000..608ddc29aa
--- /dev/null
+++ b/src/backend/opencl/magma/magma_cpu_blas.h
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#ifndef MAGMA_CPU_BLAS
+#define MAGMA_CPU_BLAS
+#include <common/blas_headers.hpp>
+#include <common/defines.hpp>
+#include <err_opencl.hpp>
+#include "magma_types.h"
+
+#define CPU_BLAS_FUNC_DEF(NAME) \
+    template<typename T>        \
+    struct cpu_blas_##NAME##_func;
+
+#define CPU_BLAS_FUNC1(NAME, TYPE, X)                \
+    template<>                                       \
+    struct cpu_blas_##NAME##_func<TYPE> {            \
+        template<typename... Args>                   \
+        void operator()(Args... args) {              \
+            cblas_##X##NAME(CblasColMajor, args...); \
+        }                                            \
+    };
+
+#define CPU_BLAS_FUNC2(NAME, TYPE, X)     \
+    template<>                            \
+    struct cpu_blas_##NAME##_func<TYPE> { \
+        template<typename... Args>        \
+        void operator()(Args... args) {   \
+            cblas_##X##NAME(args...);     \
+        }                                 \
+    };
+
+#define CPU_BLAS_DECL1(NAME)                   \
+    CPU_BLAS_FUNC_DEF(NAME)                    \
+    CPU_BLAS_FUNC1(NAME, float, s)             \
+    CPU_BLAS_FUNC1(NAME, double, d)            \
+    CPU_BLAS_FUNC1(NAME, magmaFloatComplex, c) \
+    CPU_BLAS_FUNC1(NAME, magmaDoubleComplex, z)
+
+#define CPU_BLAS_DECL2(NAME)                   \
+    CPU_BLAS_FUNC_DEF(NAME)                    \
+    CPU_BLAS_FUNC2(NAME, float, s)             \
+    CPU_BLAS_FUNC2(NAME, double, d)            \
+    CPU_BLAS_FUNC2(NAME, magmaFloatComplex, c) \
+    CPU_BLAS_FUNC2(NAME, magmaDoubleComplex, z)
+
+CPU_BLAS_DECL1(gemv)
+CPU_BLAS_DECL2(scal)
+CPU_BLAS_DECL2(axpy)
+
+inline float *cblas_ptr(float *in) { return in; }
+inline double *cblas_ptr(double *in) { return in; }
+
+#if defined(IS_OPENBLAS)
+inline float *cblas_ptr(magmaFloatComplex *in) { return (float *)in; }
+inline double *cblas_ptr(magmaDoubleComplex *in) { return (double *)in; }
+#else
+inline void *cblas_ptr(magmaFloatComplex *in) { return (void *)in; }
+inline void *cblas_ptr(magmaDoubleComplex *in) { return (void *)in; }
+#endif
+
+inline float cblas_scalar(float *in) { return *in; }
+inline double cblas_scalar(double *in) { return *in; }
+
+#if defined(IS_OPENBLAS)
+inline float *cblas_scalar(magmaFloatComplex *in) { return (float *)in; }
+inline double *cblas_scalar(magmaDoubleComplex *in) { return (double *)in; }
+#else
+inline void *cblas_scalar(magmaFloatComplex *in) { return (void *)in; }
+inline void *cblas_scalar(magmaDoubleComplex *in) { return (void *)in; }
+#endif
+
+#endif
diff --git a/src/backend/opencl/magma/magma_cpu_lapack.h b/src/backend/opencl/magma/magma_cpu_lapack.h
index 4410e1b4e3..5bba77d0cb 100644
--- a/src/backend/opencl/magma/magma_cpu_lapack.h
+++ b/src/backend/opencl/magma/magma_cpu_lapack.h
@@ -10,12 +10,36 @@
 #ifndef MAGMA_CPU_LAPACK
 #define MAGMA_CPU_LAPACK
 
+#include <common/defines.hpp>
+#include <err_opencl.hpp>
 #include "magma_types.h"
 
 #define LAPACKE_sunmqr_work(...) LAPACKE_sormqr_work(__VA_ARGS__)
 #define LAPACKE_dunmqr_work(...) LAPACKE_dormqr_work(__VA_ARGS__)
 #define LAPACKE_sungqr_work(...) LAPACKE_sorgqr_work(__VA_ARGS__)
 #define LAPACKE_dungqr_work(...) LAPACKE_dorgqr_work(__VA_ARGS__)
+#define LAPACKE_sungbr_work(...) LAPACKE_sorgbr_work(__VA_ARGS__)
+#define LAPACKE_dungbr_work(...) LAPACKE_dorgbr_work(__VA_ARGS__)
+
+template<typename... Args>
+int LAPACKE_slacgv(Args... /*args*/) {
+    return 0;
+}
+
+template<typename... Args>
+int LAPACKE_dlacgv(Args... /*args*/) {
+    return 0;
+}
+
+template<typename... Args>
+int LAPACKE_slacgv_work(Args... /*args*/) {
+    return 0;
+}
+
+template<typename... Args>
+int LAPACKE_dlacgv_work(Args... /*args*/) {
+    return 0;
+}
 
 #define lapack_complex_float magmaFloatComplex
 #define lapack_complex_double magmaDoubleComplex
@@ -23,77 +47,98 @@
 #define ORDER_TYPE int
 #define LAPACK_NAME(fn) LAPACKE_##fn
 
-#if defined(__APPLE__)
-    #define LAPACK_COL_MAJOR 102
-    #include "../../lapacke.hpp"
+#ifdef USE_MKL
+#include <mkl_lapacke.h>
 #else
-    #ifdef USE_MKL
-        #include<mkl_lapacke.h>
-    #else // NETLIB LAPACKE
-        #include<lapacke.h>
-    #endif  // MKL/NETLIB
-#endif  //APPLE
-
-#define CPU_LAPACK_FUNC_DEF(NAME)               \
-    template<typename T>                        \
-    struct NAME##_func;
-
-#define CPU_LAPACK_FUNC(NAME, TYPE, X)              \
-    template<>                                      \
-    struct NAME##_func<TYPE>                        \
-    {                                               \
-        template<typename... Args>                  \
-            int                                     \
-            operator() (Args... args)               \
-        { return LAPACK_NAME(X##NAME)(args...); }   \
+#ifdef __APPLE__
+#include <Accelerate/Accelerate.h>
+#include <common/lapacke.hpp>
+#undef LAPACK_COL_MAJOR
+#define LAPACK_COL_MAJOR 102
+#undef AF_LAPACK_COL_MAJOR
+#define AF_LAPACK_COL_MAJOR 0
+#else  // NETLIB LAPACKE
+#include <lapacke.h>
+#endif
+#endif
+
+#define LAPACKE_CHECK(fn)                                    \
+    do {                                                     \
+        int __info = fn;                                     \
+        if (__info != 0) {                                   \
+            char lapacke_st_msg[32];                         \
+            snprintf(lapacke_st_msg, sizeof(lapacke_st_msg), \
+                     "LAPACKE Error (%d)", (int)(__info));   \
+            AF_ERROR(lapacke_st_msg, AF_ERR_INTERNAL);       \
+        }                                                    \
+    } while (0)
+
+#define CPU_LAPACK_FUNC_DEF(NAME) \
+    template<typename T>          \
+    struct cpu_lapack_##NAME##_func;
+
+#define CPU_LAPACK_FUNC1(NAME, TYPE, X)                             \
+    template<>                                                      \
+    struct cpu_lapack_##NAME##_func<TYPE> {                         \
+        template<typename... Args>                                  \
+        int operator()(Args... args) {                              \
+            return LAPACK_NAME(X##NAME)(LAPACK_COL_MAJOR, args...); \
+        }                                                           \
+    };
+
+#define CPU_LAPACK_FUNC2(NAME, TYPE, X)           \
+    template<>                                    \
+    struct cpu_lapack_##NAME##_func<TYPE> {       \
+        template<typename... Args>                \
+        int operator()(Args... args) {            \
+            return LAPACK_NAME(X##NAME)(args...); \
+        }                                         \
+    };
+
+#define CPU_LAPACK_FUNC3(NAME, TYPE, X)           \
+    template<>                                    \
+    struct cpu_lapack_##NAME##_func<TYPE> {       \
+        template<typename... Args>                \
+        double operator()(Args... args) {         \
+            return LAPACK_NAME(X##NAME)(args...); \
+        }                                         \
     };
 
-CPU_LAPACK_FUNC_DEF(getrf)
-CPU_LAPACK_FUNC(getrf, float,      s)
-CPU_LAPACK_FUNC(getrf, double,     d)
-CPU_LAPACK_FUNC(getrf, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(getrf, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(potrf)
-CPU_LAPACK_FUNC(potrf, float,      s)
-CPU_LAPACK_FUNC(potrf, double,     d)
-CPU_LAPACK_FUNC(potrf, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(potrf, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(trtri)
-CPU_LAPACK_FUNC(trtri, float,      s)
-CPU_LAPACK_FUNC(trtri, double,     d)
-CPU_LAPACK_FUNC(trtri, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(trtri, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(geqrf_work)
-CPU_LAPACK_FUNC(geqrf_work, float,      s)
-CPU_LAPACK_FUNC(geqrf_work, double,     d)
-CPU_LAPACK_FUNC(geqrf_work, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(geqrf_work, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(larft)
-CPU_LAPACK_FUNC(larft, float,      s)
-CPU_LAPACK_FUNC(larft, double,     d)
-CPU_LAPACK_FUNC(larft, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(larft, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(unmqr_work)
-CPU_LAPACK_FUNC(unmqr_work, float,      s)
-CPU_LAPACK_FUNC(unmqr_work, double,     d)
-CPU_LAPACK_FUNC(unmqr_work, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(unmqr_work, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(ungqr_work)
-CPU_LAPACK_FUNC(ungqr_work, float,      s)
-CPU_LAPACK_FUNC(ungqr_work, double,     d)
-CPU_LAPACK_FUNC(ungqr_work, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(ungqr_work, magmaDoubleComplex,    z)
-
-CPU_LAPACK_FUNC_DEF(laswp)
-CPU_LAPACK_FUNC(laswp, float,      s)
-CPU_LAPACK_FUNC(laswp, double,     d)
-CPU_LAPACK_FUNC(laswp, magmaFloatComplex,     c)
-CPU_LAPACK_FUNC(laswp, magmaDoubleComplex,    z)
+#define CPU_LAPACK_DECL1(NAME)                   \
+    CPU_LAPACK_FUNC_DEF(NAME)                    \
+    CPU_LAPACK_FUNC1(NAME, float, s)             \
+    CPU_LAPACK_FUNC1(NAME, double, d)            \
+    CPU_LAPACK_FUNC1(NAME, magmaFloatComplex, c) \
+    CPU_LAPACK_FUNC1(NAME, magmaDoubleComplex, z)
+
+#define CPU_LAPACK_DECL2(NAME)                   \
+    CPU_LAPACK_FUNC_DEF(NAME)                    \
+    CPU_LAPACK_FUNC2(NAME, float, s)             \
+    CPU_LAPACK_FUNC2(NAME, double, d)            \
+    CPU_LAPACK_FUNC2(NAME, magmaFloatComplex, c) \
+    CPU_LAPACK_FUNC2(NAME, magmaDoubleComplex, z)
+
+#define CPU_LAPACK_DECL3(NAME)       \
+    CPU_LAPACK_FUNC_DEF(NAME)        \
+    CPU_LAPACK_FUNC3(NAME, float, s) \
+    CPU_LAPACK_FUNC3(NAME, double, d)
+
+CPU_LAPACK_DECL1(getrf)
+CPU_LAPACK_DECL1(gebrd_work)
+CPU_LAPACK_DECL1(potrf)
+CPU_LAPACK_DECL1(trtri)
+CPU_LAPACK_DECL1(geqrf_work)
+CPU_LAPACK_DECL1(larft)
+CPU_LAPACK_DECL1(unmqr_work)
+CPU_LAPACK_DECL1(ungqr_work)
+CPU_LAPACK_DECL1(ungbr_work)
+CPU_LAPACK_DECL1(bdsqr_work)
+CPU_LAPACK_DECL1(laswp)
+CPU_LAPACK_DECL1(laset)
+
+CPU_LAPACK_DECL2(lacgv_work)
+CPU_LAPACK_DECL2(larfg_work)
+CPU_LAPACK_DECL1(lacpy)
+CPU_LAPACK_DECL3(lamch)
 
 #endif
diff --git a/src/backend/opencl/magma/magma_data.h b/src/backend/opencl/magma/magma_data.h
index 34b0a5397f..04a1e5261c 100644
--- a/src/backend/opencl/magma/magma_data.h
+++ b/src/backend/opencl/magma/magma_data.h
@@ -52,43 +52,37 @@
  *
  **********************************************************************/
 
-
 #ifndef MAGMA_DATA_H
 #define MAGMA_DATA_H
-#include <iostream>
 
+#include <memory.hpp>
 #include <platform.hpp>
 #include "magma_types.h"
 
-#define check_error( err ) if (err != CL_SUCCESS) throw cl::Error(err);
+#define check_error(err) \
+    if (err != CL_SUCCESS) throw cl::Error(err);
 
 // ========================================
 // memory allocation
 // Allocate size bytes on GPU, returning pointer in ptrPtr.
-template<typename T> static magma_int_t
-magma_malloc( magma_ptr* ptrPtr, int num)
-{
+template<typename T>
+static magma_int_t magma_malloc(magma_ptr* ptrPtr, int num) {
     size_t size = num * sizeof(T);
-    // malloc and free sometimes don't work for size=0, so allocate some minimal size
-    if ( size == 0 )
-        size = sizeof(T);
-    cl_int err;
-    *ptrPtr = clCreateBuffer(opencl::getContext()(), CL_MEM_READ_WRITE, size, NULL, &err );
-    if ( err != clblasSuccess ) {
-        return MAGMA_ERR_DEVICE_ALLOC;
-    }
+    // malloc and free sometimes don't work for size=0, so allocate some minimal
+    // size
+    if (size == 0) size = sizeof(T);
+    cl::Buffer* buf = arrayfire::opencl::bufferAlloc(size);
+    *ptrPtr         = static_cast<magma_ptr>(buf->get());
+    delete (buf);
+
+    if (ptrPtr == nullptr) { return MAGMA_ERR_DEVICE_ALLOC; };
     return MAGMA_SUCCESS;
 }
 
 // --------------------
 // Free GPU memory allocated by magma_malloc.
-static inline magma_int_t
-magma_free(cl_mem ptr)
-{
-    cl_int err = clReleaseMemObject( ptr );
-    if ( err != clblasSuccess ) {
-        return MAGMA_ERR_INVALID_PTR;
-    }
+static inline magma_int_t magma_free(magma_ptr ptr) {
+    arrayfire::opencl::memFree(ptr);
     return MAGMA_SUCCESS;
 }
 
@@ -100,395 +94,304 @@ magma_free(cl_mem ptr)
 // to align memory to a 32 byte boundary.
 // Use magma_free_cpu() to free this memory.
 
-template<typename T> static magma_int_t
-magma_malloc_cpu(T** ptrPtr, int num)
-{
+template<typename T>
+static magma_int_t magma_malloc_cpu(T** ptrPtr, int num) {
     size_t size = num * sizeof(T);
-    // malloc and free sometimes don't work for size=0, so allocate some minimal size
-    if ( size == 0 )
-        size = sizeof(T);
+    // malloc and free sometimes don't work for size=0, so allocate some minimal
+    // size
+    if (size == 0) size = sizeof(T);
 #if 1
-    #if defined( _WIN32 ) || defined( _WIN64 )
-    *ptrPtr = (T *)_aligned_malloc( size, 32 );
-    if ( *ptrPtr == NULL ) {
-        return MAGMA_ERR_HOST_ALLOC;
-    }
-    #else
-    int err = posix_memalign((void **)ptrPtr, 32, size );
-    if ( err != 0 ) {
+#if defined(_WIN32) || defined(_WIN64)
+    *ptrPtr = (T*)_aligned_malloc(size, 32);
+    if (*ptrPtr == NULL) { return MAGMA_ERR_HOST_ALLOC; }
+#else
+    int err = posix_memalign((void**)ptrPtr, 32, size);
+    if (err != 0) {
         *ptrPtr = NULL;
         return MAGMA_ERR_HOST_ALLOC;
     }
-    #endif
+#endif
 #else
-    *ptrPtr = malloc( size );
-    if ( *ptrPtr == NULL ) {
-        return MAGMA_ERR_HOST_ALLOC;
-    }
+    *ptrPtr = malloc(size);
+    if (*ptrPtr == NULL) { return MAGMA_ERR_HOST_ALLOC; }
 #endif
     return MAGMA_SUCCESS;
 }
 
 // --------------------
 // Free CPU pinned memory previously allocated by magma_malloc_pinned.
-// The default implementation uses free(), which works for both malloc and posix_memalign.
-// For Windows, _aligned_free() is used.
-template<typename T> static magma_int_t
-magma_free_cpu(T* ptr )
-{
-#if defined( _WIN32 ) || defined( _WIN64 )
-    _aligned_free( ptr );
+// The default implementation uses free(), which works for both malloc and
+// posix_memalign. For Windows, _aligned_free() is used.
+template<typename T>
+static magma_int_t magma_free_cpu(T* ptr) {
+#if defined(_WIN32) || defined(_WIN64)
+    _aligned_free(ptr);
 #else
-    free( ptr );
+    free(ptr);
 #endif
     return MAGMA_SUCCESS;
 }
 
 // ========================================
 // copying vectors
-template<typename T> static void
-magma_setvector(
-    magma_int_t n,
-    T const* hx_src,                   magma_int_t incx,
-    cl_mem   dy_dst, size_t dy_offset, magma_int_t incy,
-    magma_queue_t queue )
-{
-    if (n <= 0)
-        return;
+template<typename T>
+static void magma_setvector(magma_int_t n, T const* hx_src, magma_int_t incx,
+                            cl_mem dy_dst, size_t dy_offset, magma_int_t incy,
+                            magma_queue_t queue) {
+    if (n <= 0) return;
 
     if (incx == 1 && incy == 1) {
-        cl_int err = clEnqueueWriteBuffer(
-            queue, dy_dst, CL_TRUE,
-            dy_offset*sizeof(T), n*sizeof(T),
-            hx_src, 0, NULL, NULL);
-        check_error( err );
-    }
-    else {
+        cl_int err =
+            clEnqueueWriteBuffer(queue, dy_dst, CL_TRUE, dy_offset * sizeof(T),
+                                 n * sizeof(T), hx_src, 0, NULL, NULL);
+        check_error(err);
+    } else {
         magma_int_t ldha = incx;
         magma_int_t lddb = incy;
-        magma_setmatrix( 1, n,
-            hx_src,            ldha,
-            dy_dst, dy_offset, lddb,
-            queue);
+        magma_setmatrix(1, n, hx_src, ldha, dy_dst, dy_offset, lddb, queue);
     }
 }
 
 // --------------------
-template<typename T> static void
-magma_setvector_async(
-    magma_int_t n,
-    T const* hx_src,                   magma_int_t incx,
-    cl_mem   dy_dst, size_t dy_offset, magma_int_t incy,
-    magma_queue_t queue, magma_event_t *event )
-{
-    if (n <= 0)
-        return;
+template<typename T>
+static void magma_setvector_async(magma_int_t n, T const* hx_src,
+                                  magma_int_t incx, cl_mem dy_dst,
+                                  size_t dy_offset, magma_int_t incy,
+                                  magma_queue_t queue, magma_event_t* event) {
+    if (n <= 0) return;
 
     if (incx == 1 && incy == 1) {
-        cl_int err = clEnqueueWriteBuffer(
-            queue, dy_dst, CL_FALSE,
-            dy_offset*sizeof(T), n*sizeof(T),
-            hx_src, 0, NULL, event);
-        check_error( err );
-    }
-    else {
+        cl_int err =
+            clEnqueueWriteBuffer(queue, dy_dst, CL_FALSE, dy_offset * sizeof(T),
+                                 n * sizeof(T), hx_src, 0, NULL, event);
+        check_error(err);
+    } else {
         magma_int_t ldha = incx;
         magma_int_t lddb = incy;
-        magma_setmatrix_async( 1, n,
-            hx_src,            ldha,
-            dy_dst, dy_offset, lddb,
-            queue, event);
+        magma_setmatrix_async(1, n, hx_src, ldha, dy_dst, dy_offset, lddb,
+                              queue, event);
     }
 }
 
 // --------------------
-template<typename T> static void
-magma_getvector(
-    magma_int_t n,
-    cl_mem dx_src, size_t dx_offset, magma_int_t incx,
-    T*     hy_dst,                   magma_int_t incy,
-    magma_queue_t queue )
-{
-    if (n <= 0)
-        return;
+template<typename T>
+static void magma_getvector(magma_int_t n, cl_mem dx_src, size_t dx_offset,
+                            magma_int_t incx, T* hy_dst, magma_int_t incy,
+                            magma_queue_t queue) {
+    if (n <= 0) return;
 
     if (incx == 1 && incy == 1) {
-        cl_int err = clEnqueueReadBuffer(
-            queue, dx_src, CL_TRUE,
-            dx_offset*sizeof(T), n*sizeof(T),
-            hy_dst, 0, NULL, NULL);
-        check_error( err );
-    }
-    else {
+        cl_int err =
+            clEnqueueReadBuffer(queue, dx_src, CL_TRUE, dx_offset * sizeof(T),
+                                n * sizeof(T), hy_dst, 0, NULL, NULL);
+        check_error(err);
+    } else {
         magma_int_t ldda = incx;
         magma_int_t ldhb = incy;
-        magma_getmatrix( 1, n,
-            dx_src, dx_offset, ldda,
-            hy_dst,            ldhb,
-            queue);
+        magma_getmatrix(1, n, dx_src, dx_offset, ldda, hy_dst, ldhb, queue);
     }
 }
 
 // --------------------
-template<typename T> static void
-magma_getvector_async(
-    magma_int_t n,
-    cl_mem dx_src, size_t dx_offset, magma_int_t incx,
-    T*     hy_dst,                   magma_int_t incy,
-    magma_queue_t queue, magma_event_t *event )
-{
-    if (n <= 0)
-        return;
+template<typename T>
+static void magma_getvector_async(magma_int_t n, cl_mem dx_src,
+                                  size_t dx_offset, magma_int_t incx, T* hy_dst,
+                                  magma_int_t incy, magma_queue_t queue,
+                                  magma_event_t* event) {
+    if (n <= 0) return;
 
     if (incx == 1 && incy == 1) {
-        cl_int err = clEnqueueReadBuffer(
-            queue, dx_src, CL_FALSE,
-            dx_offset*sizeof(T), n*sizeof(T),
-            hy_dst, 0, NULL, event);
-        check_error( err );
-    }
-    else {
+        cl_int err =
+            clEnqueueReadBuffer(queue, dx_src, CL_FALSE, dx_offset * sizeof(T),
+                                n * sizeof(T), hy_dst, 0, NULL, event);
+        check_error(err);
+    } else {
         magma_int_t ldda = incx;
         magma_int_t ldhb = incy;
-        magma_getmatrix_async( 1, n,
-            dx_src, dx_offset, ldda,
-            hy_dst,            ldhb,
-            queue, event);
+        magma_getmatrix_async(1, n, dx_src, dx_offset, ldda, hy_dst, ldhb,
+                              queue, event);
     }
 }
 
 // --------------------
-template<typename T> static void
-magma_copymatrix(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA_src, size_t dA_offset, magma_int_t ldda,
-    cl_mem dB_dst, size_t dB_offset, magma_int_t lddb,
-    magma_queue_t queue )
-{
-    if (m <= 0 || n <= 0)
-        return;
-
-    size_t src_origin[3] = { dA_offset*sizeof(T), 0, 0 };
-    size_t dst_orig[3]   = { dB_offset*sizeof(T), 0, 0 };
-    size_t region[3]     = { m*sizeof(T), static_cast<size_t>(n), 1 };
-    cl_int err = clEnqueueCopyBufferRect(
-        queue, dA_src, dB_dst,
-        src_origin, dst_orig, region,
-        ldda*sizeof(T), 0,
-        lddb*sizeof(T), 0,
-        0, NULL, NULL );
-    check_error( err );
+template<typename T>
+static void magma_copymatrix(magma_int_t m, magma_int_t n, cl_mem dA_src,
+                             size_t dA_offset, magma_int_t ldda, cl_mem dB_dst,
+                             size_t dB_offset, magma_int_t lddb,
+                             magma_queue_t queue) {
+    if (m <= 0 || n <= 0) return;
+
+    size_t src_origin[3] = {dA_offset * sizeof(T), 0, 0};
+    size_t dst_orig[3]   = {dB_offset * sizeof(T), 0, 0};
+    size_t region[3]     = {m * sizeof(T), static_cast<size_t>(n), 1};
+    cl_int err = clEnqueueCopyBufferRect(queue, dA_src, dB_dst, src_origin,
+                                         dst_orig, region, ldda * sizeof(T), 0,
+                                         lddb * sizeof(T), 0, 0, NULL, NULL);
+    check_error(err);
 }
 
 // --------------------
-template<typename T> static void
-magma_copymatrix_async(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA_src, size_t dA_offset, magma_int_t ldda,
-    cl_mem dB_dst, size_t dB_offset, magma_int_t lddb,
-    magma_queue_t queue, magma_event_t *event )
-{
-    if (m <= 0 || n <= 0)
-        return;
+template<typename T>
+static void magma_copymatrix_async(magma_int_t m, magma_int_t n, cl_mem dA_src,
+                                   size_t dA_offset, magma_int_t ldda,
+                                   cl_mem dB_dst, size_t dB_offset,
+                                   magma_int_t lddb, magma_queue_t queue,
+                                   magma_event_t* event) {
+    if (m <= 0 || n <= 0) return;
 
     // TODO how to make non-blocking?
-    size_t src_origin[3] = { dA_offset*sizeof(T), 0, 0 };
-    size_t dst_orig[3]   = { dB_offset*sizeof(T), 0, 0 };
-    size_t region[3]     = { m*sizeof(T), static_cast<size_t>(n), 1 };
-    cl_int err = clEnqueueCopyBufferRect(
-        queue, dA_src, dB_dst,
-        src_origin, dst_orig, region,
-        ldda*sizeof(T), 0,
-        lddb*sizeof(T), 0,
-        0, NULL, event );
-    check_error( err );
+    size_t src_origin[3] = {dA_offset * sizeof(T), 0, 0};
+    size_t dst_orig[3]   = {dB_offset * sizeof(T), 0, 0};
+    size_t region[3]     = {m * sizeof(T), static_cast<size_t>(n), 1};
+    cl_int err = clEnqueueCopyBufferRect(queue, dA_src, dB_dst, src_origin,
+                                         dst_orig, region, ldda * sizeof(T), 0,
+                                         lddb * sizeof(T), 0, 0, NULL, event);
+    check_error(err);
 }
 
 // --------------------
-template<typename T> static void
-magma_copyvector(
-    magma_int_t n,
-    cl_mem dx_src, size_t dx_offset, magma_int_t incx,
-    cl_mem dy_dst, size_t dy_offset, magma_int_t incy,
-    magma_queue_t queue )
-{
-    if (n <= 0)
-        return;
+template<typename T>
+static void magma_copyvector(magma_int_t n, cl_mem dx_src, size_t dx_offset,
+                             magma_int_t incx, cl_mem dy_dst, size_t dy_offset,
+                             magma_int_t incy, magma_queue_t queue) {
+    if (n <= 0) return;
 
     if (incx == 1 && incy == 1) {
         cl_int err = clEnqueueReadBuffer(
-            queue, dx_src, CL_TRUE,
-            dx_offset*sizeof(T), n*sizeof(T),
-            dy_dst, dy_offset*sizeof(T), NULL, NULL);
-        check_error( err );
-    }
-    else {
+            queue, dx_src, CL_TRUE, dx_offset * sizeof(T), n * sizeof(T),
+            dy_dst, dy_offset * sizeof(T), NULL, NULL);
+        check_error(err);
+    } else {
         magma_int_t ldda = incx;
         magma_int_t lddb = incy;
-        magma_copymatrix<T>( 1, n,
-            dx_src, dx_offset, ldda,
-            dy_dst, dy_offset, lddb,
-            queue);
+        magma_copymatrix<T>(1, n, dx_src, dx_offset, ldda, dy_dst, dy_offset,
+                            lddb, queue);
     }
 }
 
 // --------------------
-template<typename T> static void
-magma_copyvector_async(
-    magma_int_t n,
-    cl_mem dx_src, size_t dx_offset, magma_int_t incx,
-    cl_mem dy_dst, size_t dy_offset, magma_int_t incy,
-    magma_queue_t queue, magma_event_t *event )
-{
-    if (n <= 0)
-        return;
+template<typename T>
+static void magma_copyvector_async(magma_int_t n, cl_mem dx_src,
+                                   size_t dx_offset, magma_int_t incx,
+                                   cl_mem dy_dst, size_t dy_offset,
+                                   magma_int_t incy, magma_queue_t queue,
+                                   magma_event_t* event) {
+    if (n <= 0) return;
 
     if (incx == 1 && incy == 1) {
         cl_int err = clEnqueueReadBuffer(
-            queue, dx_src, CL_FALSE,
-            dx_offset*sizeof(T), n*sizeof(T),
-            dy_dst, dy_offset*sizeof(T), NULL, event);
-        check_error( err );
-    }
-    else {
+            queue, dx_src, CL_FALSE, dx_offset * sizeof(T), n * sizeof(T),
+            dy_dst, dy_offset * sizeof(T), NULL, event);
+        check_error(err);
+    } else {
         magma_int_t ldda = incx;
         magma_int_t lddb = incy;
-        magma_copymatrix_async<T>( 1, n,
-            dx_src, dx_offset, ldda,
-            dy_dst, dy_offset, lddb,
-            queue, event);
+        magma_copymatrix_async<T>(1, n, dx_src, dx_offset, ldda, dy_dst,
+                                  dy_offset, lddb, queue, event);
     }
 }
 
-
 // ========================================
 // copying sub-matrices (contiguous columns)
 // OpenCL takes queue even for blocking transfers, oddly.
-template<typename T> static void
-magma_setmatrix(
-    magma_int_t m, magma_int_t n,
-    T const* hA_src,                   magma_int_t ldha,
-    cl_mem   dB_dst, size_t dB_offset, magma_int_t lddb,
-    magma_queue_t queue )
-{
-    if (m <= 0 || n <= 0)
-        return;
-
-    size_t buffer_origin[3] = { dB_offset*sizeof(T), 0, 0 };
-    size_t host_orig[3]     = { 0, 0, 0 };
-    size_t region[3]        = { m*sizeof(T), (size_t)n, 1 };
-    cl_int err = clEnqueueWriteBufferRect(
-        queue, dB_dst, CL_TRUE,  // blocking
-        buffer_origin, host_orig, region,
-        lddb*sizeof(T), 0,
-        ldha*sizeof(T), 0,
-        hA_src, 0, NULL, NULL );
-    check_error( err );
+template<typename T>
+static void magma_setmatrix(magma_int_t m, magma_int_t n, T const* hA_src,
+                            magma_int_t ldha, cl_mem dB_dst, size_t dB_offset,
+                            magma_int_t lddb, magma_queue_t queue) {
+    if (m <= 0 || n <= 0) return;
+
+    size_t buffer_origin[3] = {dB_offset * sizeof(T), 0, 0};
+    size_t host_orig[3]     = {0, 0, 0};
+    size_t region[3]        = {m * sizeof(T), (size_t)n, 1};
+    cl_int err = clEnqueueWriteBufferRect(queue, dB_dst, CL_TRUE,  // blocking
+                                          buffer_origin, host_orig, region,
+                                          lddb * sizeof(T), 0, ldha * sizeof(T),
+                                          0, hA_src, 0, NULL, NULL);
+    check_error(err);
 }
 
 // --------------------
-template<typename T> static void
-magma_setmatrix_async(
-    magma_int_t m, magma_int_t n,
-    T const* hA_src,                   magma_int_t ldha,
-    cl_mem   dB_dst, size_t dB_offset, magma_int_t lddb,
-    magma_queue_t queue, magma_event_t *event )
-{
-    if (m <= 0 || n <= 0)
-        return;
-
-    size_t buffer_origin[3] = { dB_offset*sizeof(T), 0, 0 };
-    size_t host_orig[3]     = { 0, 0, 0 };
-    size_t region[3]        = { m*sizeof(T), (size_t)n, 1 };
-    cl_int err = clEnqueueWriteBufferRect(
+template<typename T>
+static void magma_setmatrix_async(magma_int_t m, magma_int_t n, T const* hA_src,
+                                  magma_int_t ldha, cl_mem dB_dst,
+                                  size_t dB_offset, magma_int_t lddb,
+                                  magma_queue_t queue, magma_event_t* event) {
+    if (m <= 0 || n <= 0) return;
+
+    size_t buffer_origin[3] = {dB_offset * sizeof(T), 0, 0};
+    size_t host_orig[3]     = {0, 0, 0};
+    size_t region[3]        = {m * sizeof(T), (size_t)n, 1};
+    cl_int err              = clEnqueueWriteBufferRect(
         queue, dB_dst, CL_FALSE,  // non-blocking
-        buffer_origin, host_orig, region,
-        lddb*sizeof(T), 0,
-        ldha*sizeof(T), 0,
-        hA_src, 0, NULL, event );
+        buffer_origin, host_orig, region, lddb * sizeof(T), 0, ldha * sizeof(T),
+        0, hA_src, 0, NULL, event);
     clFlush(queue);
-    check_error( err );
+    check_error(err);
 }
 
 // --------------------
-template<typename T> static void
-magma_getmatrix(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA_src, size_t dA_offset, magma_int_t ldda,
-    T*     hB_dst,                   magma_int_t ldhb,
-    magma_queue_t queue )
-{
-    if (m <= 0 || n <= 0)
-       return;
-
-    size_t buffer_origin[3] = { dA_offset*sizeof(T), 0, 0 };
-    size_t host_orig[3]     = { 0, 0, 0 };
-    size_t region[3]        = { m*sizeof(T), (size_t)n, 1 };
-    cl_int err = clEnqueueReadBufferRect(
-        queue, dA_src, CL_TRUE,  // blocking
-        buffer_origin, host_orig, region,
-        ldda*sizeof(T), 0,
-        ldhb*sizeof(T), 0,
-        hB_dst, 0, NULL, NULL );
-    check_error( err );
+template<typename T>
+static void magma_getmatrix(magma_int_t m, magma_int_t n, cl_mem dA_src,
+                            size_t dA_offset, magma_int_t ldda, T* hB_dst,
+                            magma_int_t ldhb, magma_queue_t queue) {
+    if (m <= 0 || n <= 0) return;
+
+    size_t buffer_origin[3] = {dA_offset * sizeof(T), 0, 0};
+    size_t host_orig[3]     = {0, 0, 0};
+    size_t region[3]        = {m * sizeof(T), (size_t)n, 1};
+    cl_int err = clEnqueueReadBufferRect(queue, dA_src, CL_TRUE,  // blocking
+                                         buffer_origin, host_orig, region,
+                                         ldda * sizeof(T), 0, ldhb * sizeof(T),
+                                         0, hB_dst, 0, NULL, NULL);
+    check_error(err);
 }
 
 // --------------------
-template<typename T> static void
-magma_getmatrix_async(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA_src, size_t dA_offset, magma_int_t ldda,
-    T*     hB_dst,                   magma_int_t ldhb,
-    magma_queue_t queue, magma_event_t *event )
-{
-    if (m <= 0 || n <= 0)
-        return;
-
-    size_t buffer_origin[3] = { dA_offset*sizeof(T), 0, 0 };
-    size_t host_orig[3]     = { 0, 0, 0 };
-    size_t region[3]        = { m*sizeof(T), (size_t)n, 1 };
-    cl_int err = clEnqueueReadBufferRect(
+template<typename T>
+static void magma_getmatrix_async(magma_int_t m, magma_int_t n, cl_mem dA_src,
+                                  size_t dA_offset, magma_int_t ldda, T* hB_dst,
+                                  magma_int_t ldhb, magma_queue_t queue,
+                                  magma_event_t* event) {
+    if (m <= 0 || n <= 0) return;
+
+    size_t buffer_origin[3] = {dA_offset * sizeof(T), 0, 0};
+    size_t host_orig[3]     = {0, 0, 0};
+    size_t region[3]        = {m * sizeof(T), (size_t)n, 1};
+    cl_int err              = clEnqueueReadBufferRect(
         queue, dA_src, CL_FALSE,  // non-blocking
-        buffer_origin, host_orig, region,
-        ldda*sizeof(T), 0,
-        ldhb*sizeof(T), 0,
-        hB_dst, 0, NULL, event );
+        buffer_origin, host_orig, region, ldda * sizeof(T), 0, ldhb * sizeof(T),
+        0, hB_dst, 0, NULL, event);
     clFlush(queue);
-    check_error( err );
+    check_error(err);
 }
 
-template<typename T> void
-magmablas_transpose_inplace(
-    magma_int_t n,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    magma_queue_t queue);
+template<typename T>
+void magmablas_transpose_inplace(magma_int_t n, cl_mem dA, size_t dA_offset,
+                                 magma_int_t ldda, magma_queue_t queue);
 
-template<typename T> void
-magmablas_transpose(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA,  size_t dA_offset,  magma_int_t ldda,
-    cl_mem dAT, size_t dAT_offset, magma_int_t lddat,
-    magma_queue_t queue);
+template<typename T>
+void magmablas_transpose(magma_int_t m, magma_int_t n, cl_mem dA,
+                         size_t dA_offset, magma_int_t ldda, cl_mem dAT,
+                         size_t dAT_offset, magma_int_t lddat,
+                         magma_queue_t queue);
 
-template<typename T> void
-magmablas_laswp(
-    magma_int_t n,
-    cl_mem dAT, size_t dAT_offset, magma_int_t ldda,
-    magma_int_t k1, magma_int_t k2,
-    const magma_int_t *ipiv, magma_int_t inci,
-    magma_queue_t queue);
+template<typename T>
+void magmablas_laswp(magma_int_t n, cl_mem dAT, size_t dAT_offset,
+                     magma_int_t ldda, magma_int_t k1, magma_int_t k2,
+                     const magma_int_t* ipiv, magma_int_t inci,
+                     magma_queue_t queue);
 
-template<typename T> void
-magmablas_swapdblk(magma_int_t n, magma_int_t nb,
-                   cl_mem dA, magma_int_t dA_offset, magma_int_t ldda, magma_int_t inca,
-                   cl_mem dB, magma_int_t dB_offset, magma_int_t lddb, magma_int_t incb,
-                   magma_queue_t queue);
+template<typename T>
+void magmablas_swapdblk(magma_int_t n, magma_int_t nb, cl_mem dA,
+                        magma_int_t dA_offset, magma_int_t ldda,
+                        magma_int_t inca, cl_mem dB, magma_int_t dB_offset,
+                        magma_int_t lddb, magma_int_t incb,
+                        magma_queue_t queue);
 
-template<typename T> void
-magmablas_laset(magma_uplo_t uplo, magma_int_t m, magma_int_t n,
-                T offdiag, T diag,
-                cl_mem dA, size_t dA_offset, magma_int_t ldda,
-                magma_queue_t queue);
+template<typename T>
+void magmablas_laset(magma_uplo_t uplo, magma_int_t m, magma_int_t n, T offdiag,
+                     T diag, cl_mem dA, size_t dA_offset, magma_int_t ldda,
+                     magma_queue_t queue);
 
 #if 0  // Needs to be enabled when unmqr2 is enabled
 template<typename T> void
diff --git a/src/backend/opencl/magma/magma_helper.cpp b/src/backend/opencl/magma/magma_helper.cpp
index b38045a188..19467d2277 100644
--- a/src/backend/opencl/magma/magma_helper.cpp
+++ b/src/backend/opencl/magma/magma_helper.cpp
@@ -7,148 +7,237 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include "common/defines.hpp"
 #include "magma_common.h"
 
-template<typename T> T magma_one() { return (T)1.0; }
-template<typename T> T magma_neg_one() { return (T)-1.0; }
-template<typename T> T magma_zero() { return (T)0; }
+template<typename T>
+T magma_one() {
+    return (T)1.0;
+}
+template<typename T>
+T magma_neg_one() {
+    return (T)-1.0;
+}
+template<typename T>
+T magma_zero() {
+    return (T)0;
+}
 
-#define INSTANTIATE_REAL(func, T)          \
-    template T func<T>();
+#define INSTANTIATE_REAL(func, T) template T func<T>();
 
-INSTANTIATE_REAL(magma_one    , float )
-INSTANTIATE_REAL(magma_neg_one, float )
-INSTANTIATE_REAL(magma_zero   , float )
-INSTANTIATE_REAL(magma_one    , double)
+INSTANTIATE_REAL(magma_one, float)
+INSTANTIATE_REAL(magma_neg_one, float)
+INSTANTIATE_REAL(magma_zero, float)
+INSTANTIATE_REAL(magma_one, double)
 INSTANTIATE_REAL(magma_neg_one, double)
-INSTANTIATE_REAL(magma_zero   , double)
-
-#define INSTANTIATE_CPLX(func, T, val)          \
-    template<> T func<T>()                      \
-    {                                           \
-        T res;                                  \
-        res.s[0] = val;                         \
-        res.s[1] = 0;                           \
-        return res;                             \
-    }                                           \
-
-INSTANTIATE_CPLX(magma_one    , magmaFloatComplex ,  1.0)
-INSTANTIATE_CPLX(magma_neg_one, magmaFloatComplex , -1.0)
-INSTANTIATE_CPLX(magma_zero   , magmaFloatComplex ,  0.0)
-INSTANTIATE_CPLX(magma_one    , magmaDoubleComplex,  1.0)
+INSTANTIATE_REAL(magma_zero, double)
+
+#define INSTANTIATE_CPLX(func, T, val) \
+    template<>                         \
+    T func<T>() {                      \
+        T res;                         \
+        res.s[0] = val;                \
+        res.s[1] = 0;                  \
+        return res;                    \
+    }
+
+INSTANTIATE_CPLX(magma_one, magmaFloatComplex, 1.0)
+INSTANTIATE_CPLX(magma_neg_one, magmaFloatComplex, -1.0)
+INSTANTIATE_CPLX(magma_zero, magmaFloatComplex, 0.0)
+INSTANTIATE_CPLX(magma_one, magmaDoubleComplex, 1.0)
 INSTANTIATE_CPLX(magma_neg_one, magmaDoubleComplex, -1.0)
-INSTANTIATE_CPLX(magma_zero   , magmaDoubleComplex,  0.0)
+INSTANTIATE_CPLX(magma_zero, magmaDoubleComplex, 0.0)
 
-template<typename T> T magma_scalar(double val) { return (T)val; }
+template<typename T>
+T magma_scalar(double val) {
+    return (T)val;
+}
 template float magma_scalar<float>(double val);
 template double magma_scalar<double>(double val);
 
-#define INSTANTIATE_CPLX_SCALAR(T)              \
-    template<> T magma_scalar<T>(double val)    \
-    {                                           \
-        T res;                                  \
-        res.s[0] = val;                         \
-        res.s[1] = 0;                           \
-        return res;                             \
-    }                                           \
+template<typename T>
+double magma_real(T val) {
+    return (double)val;
+}
+template double magma_real<float>(float val);
+template double magma_real<double>(double val);
+template<>
+double magma_real<magmaFloatComplex>(magmaFloatComplex val) {
+    return static_cast<double>(val.s[0]);
+}
+template<>
+double magma_real<magmaDoubleComplex>(magmaDoubleComplex val) {
+    return static_cast<double>(val.s[0]);
+}
+
+#define INSTANTIATE_CPLX_SCALAR(T)  \
+    template<>                      \
+    T magma_scalar<T>(double val) { \
+        T res;                      \
+        res.s[0] = val;             \
+        res.s[1] = 0;               \
+        return res;                 \
+    }
 
 INSTANTIATE_CPLX_SCALAR(magmaFloatComplex);
 INSTANTIATE_CPLX_SCALAR(magmaDoubleComplex);
 
-template<typename T> bool magma_is_real() { return true; }
+template<typename T>
+bool magma_is_real() {
+    return true;
+}
 template bool magma_is_real<float>();
 template bool magma_is_real<double>();
-template<> bool magma_is_real<magmaFloatComplex>() { return false; }
-template<> bool magma_is_real<magmaDoubleComplex>() { return false; }
+template<>
+bool magma_is_real<magmaFloatComplex>() {
+    return false;
+}
+template<>
+bool magma_is_real<magmaDoubleComplex>() {
+    return false;
+}
 
 template<typename T>
-magma_int_t magma_get_getrf_nb(magma_int_t m )
-{
-    if      (m <= 3200) return 128;
-    else if (m <  9000) return 256;
-    else                return 320;
+magma_int_t magma_get_getrf_nb(magma_int_t m) {
+    if (m <= 3200) {
+        return 128;
+    } else if (m < 9000) {
+        return 256;
+    } else {
+        return 320;
+    }
 }
 
 template magma_int_t magma_get_getrf_nb<float>(magma_int_t m);
 
 template<>
-magma_int_t magma_get_getrf_nb<double>( magma_int_t m )
-{
-    if      (m <= 2048) return 64;
-    else if (m <  7200) return 192;
-    else                return 256;
+magma_int_t magma_get_getrf_nb<double>(magma_int_t m) {
+    if (m <= 2048) {
+        return 64;
+    } else if (m < 7200) {
+        return 192;
+    } else {
+        return 256;
+    }
 }
 
 template<>
-magma_int_t magma_get_getrf_nb<magmaFloatComplex>( magma_int_t m )
-{
-    if      (m <= 2048) return 64;
-    else                return 128;
+magma_int_t magma_get_getrf_nb<magmaFloatComplex>(magma_int_t m) {
+    if (m <= 2048) {
+        return 64;
+    } else {
+        return 128;
+    }
 }
 
 template<>
-magma_int_t magma_get_getrf_nb<magmaDoubleComplex>( magma_int_t m )
-{
-    if      (m <= 3072) return 32;
-    else if (m <= 9024) return 64;
-    else                return 128;
+magma_int_t magma_get_getrf_nb<magmaDoubleComplex>(magma_int_t m) {
+    if (m <= 3072) {
+        return 32;
+    } else if (m <= 9024) {
+        return 64;
+    } else {
+        return 128;
+    }
 }
 
 template<typename T>
-magma_int_t magma_get_potrf_nb(magma_int_t m )
-{
-    if      (m <= 1024) return 128;
-    else                return 320;
+magma_int_t magma_get_potrf_nb(magma_int_t m) {
+    if (m <= 1024) {
+        return 128;
+    } else {
+        return 320;
+    }
 }
 
 template magma_int_t magma_get_potrf_nb<float>(magma_int_t m);
 
 template<>
-magma_int_t magma_get_potrf_nb<double>(magma_int_t m)
-{
-    if      (m <= 4256) return 128;
-    else                return 256;
+magma_int_t magma_get_potrf_nb<double>(magma_int_t m) {
+    if (m <= 4256) {
+        return 128;
+    } else {
+        return 256;
+    }
 }
 
 template<>
-magma_int_t magma_get_potrf_nb<magmaFloatComplex>(magma_int_t m)
-{
+magma_int_t magma_get_potrf_nb<magmaFloatComplex>(magma_int_t m) {
+    UNUSED(m);
     return 128;
 }
 
 template<>
-magma_int_t magma_get_potrf_nb<magmaDoubleComplex>(magma_int_t m)
-{
-    return  64;
+magma_int_t magma_get_potrf_nb<magmaDoubleComplex>(magma_int_t m) {
+    UNUSED(m);
+    return 64;
 }
 
 template<typename T>
-magma_int_t magma_get_geqrf_nb(magma_int_t m )
-{
+magma_int_t magma_get_geqrf_nb(magma_int_t m) {
+    UNUSED(m);
     return 128;
 }
 
 template magma_int_t magma_get_geqrf_nb<float>(magma_int_t m);
 
 template<>
-magma_int_t magma_get_geqrf_nb<double>( magma_int_t m )
-{
-    if      (m <= 2048) return 64;
+magma_int_t magma_get_geqrf_nb<double>(magma_int_t m) {
+    if (m <= 2048) { return 64; }
     return 128;
 }
 
 template<>
-magma_int_t magma_get_geqrf_nb<magmaFloatComplex>( magma_int_t m )
-{
-    if      (m <= 2048) return 32;
-    else if (m <= 4032) return 64;
-    else                return 128;
+magma_int_t magma_get_geqrf_nb<magmaFloatComplex>(magma_int_t m) {
+    if (m <= 2048) {
+        return 32;
+    } else if (m <= 4032) {
+        return 64;
+    } else {
+        return 128;
+    }
 }
 
 template<>
-magma_int_t magma_get_geqrf_nb<magmaDoubleComplex>( magma_int_t m )
-{
-    if      (m <= 2048) return 32;
-    else if (m <= 4032) return 64;
-    else                return 128;
+magma_int_t magma_get_geqrf_nb<magmaDoubleComplex>(magma_int_t m) {
+    if (m <= 2048) {
+        return 32;
+    } else if (m <= 4032) {
+        return 64;
+    } else {
+        return 128;
+    }
 }
+
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wmissing-braces"
+#else
+/* Other */
+#endif
+
+template<typename T>
+T magma_make(double r, double i) {
+    UNUSED(i);
+    return (T)r;
+}
+template float magma_make<float>(double r, double i);
+template double magma_make<double>(double r, double i);
+template<>
+magmaFloatComplex magma_make<magmaFloatComplex>(double r, double i) {
+    magmaFloatComplex tmp = {static_cast<float>(r), static_cast<float>(i)};
+    return tmp;
+}
+template<>
+magmaDoubleComplex magma_make<magmaDoubleComplex>(double r, double i) {
+    magmaDoubleComplex tmp = {r, i};
+    return tmp;
+}
+
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic pop
+#else
+/* Other */
+#endif
diff --git a/src/backend/opencl/magma/magma_helper.h b/src/backend/opencl/magma/magma_helper.h
index 32fd065800..6278761877 100644
--- a/src/backend/opencl/magma/magma_helper.h
+++ b/src/backend/opencl/magma/magma_helper.h
@@ -10,15 +10,31 @@
 #ifndef __MAGMA_HELPER_H
 #define __MAGMA_HELPER_H
 
-template<typename T> T magma_zero();
-template<typename T> T magma_one();
-template<typename T> T magma_neg_one();
-template<typename T> T magma_scalar(double val);
+template<typename T>
+T magma_zero();
+template<typename T>
+T magma_one();
+template<typename T>
+T magma_neg_one();
+template<typename T>
+T magma_scalar(double val);
+template<typename T>
+double magma_real(T val);
+template<typename T>
+T magma_make(double r, double i);
 
-template<typename T> bool magma_is_real();
+template<typename T>
+bool magma_is_real();
 
-template<typename T> magma_int_t magma_get_getrf_nb(int num);
-template<typename T> magma_int_t magma_get_potrf_nb(int num);
-template<typename T> magma_int_t magma_get_geqrf_nb(int num);
+template<typename T>
+magma_int_t magma_get_getrf_nb(int num);
+template<typename T>
+magma_int_t magma_get_potrf_nb(int num);
+template<typename T>
+magma_int_t magma_get_geqrf_nb(int num);
+template<typename T>
+magma_int_t magma_get_gebrd_nb(int /*num*/) {
+    return 32;
+}
 
 #endif
diff --git a/src/backend/opencl/magma/magma_sync.h b/src/backend/opencl/magma/magma_sync.h
index 6221cb0b63..220af8acc4 100644
--- a/src/backend/opencl/magma/magma_sync.h
+++ b/src/backend/opencl/magma/magma_sync.h
@@ -11,22 +11,22 @@
 #define MAGMA_SYNC_H
 
 #ifndef check_error
-#define check_error( err ) if (err != CL_SUCCESS) { printf ("OpenCL err: %d\n", err); throw cl::Error(err); }
+#define check_error(err)                 \
+    if (err != CL_SUCCESS) {             \
+        printf("OpenCL err: %d\n", err); \
+        throw cl::Error(err);            \
+    }
 #endif
 
-static inline void
-magma_event_sync( magma_event_t event )
-{
+static inline void magma_event_sync(magma_event_t event) {
     cl_int err = clWaitForEvents(1, &event);
     check_error(err);
 }
 
-static inline void
-magma_queue_sync( magma_queue_t queue )
-{
-    cl_int err = clFinish( queue );
+static inline void magma_queue_sync(magma_queue_t queue) {
+    cl_int err = clFinish(queue);
     check_error(err);
-    err = clFlush( queue );
+    err = clFlush(queue);
     check_error(err)
 }
 
diff --git a/src/backend/opencl/magma/magma_types.h b/src/backend/opencl/magma/magma_types.h
index 33e6e667af..90dcc6ab8d 100644
--- a/src/backend/opencl/magma/magma_types.h
+++ b/src/backend/opencl/magma/magma_types.h
@@ -29,22 +29,22 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
@@ -52,101 +52,101 @@
 #ifndef MAGMA_TYPES_H
 #define MAGMA_TYPES_H
 
-#include <stdint.h>
 #include <assert.h>
+#include <stdint.h>
 typedef int magma_int_t;
 typedef int magma_index_t;
 
 // Define new type that the precision generator will not change (matches PLASMA)
 typedef double real_Double_t;
 
-#include <clBLAS.h>
-
-typedef cl_command_queue  magma_queue_t;
-typedef cl_event          magma_event_t;
-typedef cl_device_id      magma_device_t;
+typedef cl_command_queue magma_queue_t;
+typedef cl_event magma_event_t;
+typedef cl_device_id magma_device_t;
 
 typedef cl_double2 magmaDoubleComplex;
-typedef cl_float2  magmaFloatComplex;
-
-#define MAGMA_Z_MAKE(r,i)     doubleComplex(r,i)
-#define MAGMA_Z_REAL(a)       (a).s[0]
-#define MAGMA_Z_IMAG(a)       (a).s[1]
-#define MAGMA_Z_ADD(a, b)     MAGMA_Z_MAKE((a).s[0]+(b).s[0], (a).s[1]+(b).s[1])
-#define MAGMA_Z_SUB(a, b)     MAGMA_Z_MAKE((a).s[0]-(b).s[0], (a).s[1]-(b).s[1])
-#define MAGMA_Z_DIV(a, b)     ((a)/(b))
-#define MAGMA_Z_ABS(a)        magma_cabs(a)
-#define MAGMA_Z_ABS1(a)       (fabs((a).s[0]) + fabs((a).s[1]))
-#define MAGMA_Z_CNJG(a)       MAGMA_Z_MAKE((a).s[0], -(a).s[1])
-
-#define MAGMA_C_MAKE(r,i)     floatComplex(r,i)
-#define MAGMA_C_REAL(a)       (a).s[0]
-#define MAGMA_C_IMAG(a)       (a).s[1]
-#define MAGMA_C_ADD(a, b)     MAGMA_C_MAKE((a).s[0]+(b).s[0], (a).s[1]+(b).s[1])
-#define MAGMA_C_SUB(a, b)     MAGMA_C_MAKE((a).s[0]-(b).s[0], (a).s[1]-(b).s[1])
-#define MAGMA_C_DIV(a, b)     ((a)/(b))
-#define MAGMA_C_ABS(a)        magma_cabsf(a)
-#define MAGMA_C_ABS1(a)       (fabsf((a).s[0]) + fabsf((a).s[1]))
-#define MAGMA_C_CNJG(a)       MAGMA_C_MAKE((a).s[0], -(a).s[1])
-
-#define MAGMA_Z_EQUAL(a,b)        (MAGMA_Z_REAL(a)==MAGMA_Z_REAL(b) && MAGMA_Z_IMAG(a)==MAGMA_Z_IMAG(b))
-#define MAGMA_Z_NEGATE(a)         MAGMA_Z_MAKE( -MAGMA_Z_REAL(a), -MAGMA_Z_IMAG(a))
-
-#define MAGMA_C_EQUAL(a,b)        (MAGMA_C_REAL(a)==MAGMA_C_REAL(b) && MAGMA_C_IMAG(a)==MAGMA_C_IMAG(b))
-#define MAGMA_C_NEGATE(a)         MAGMA_C_MAKE( -MAGMA_C_REAL(a), -MAGMA_C_IMAG(a))
-
-#define MAGMA_D_MAKE(r,i)         (r)
-#define MAGMA_D_REAL(x)           (x)
-#define MAGMA_D_IMAG(x)           (0.0)
-#define MAGMA_D_ADD(a, b)         ((a) + (b))
-#define MAGMA_D_SUB(a, b)         ((a) - (b))
-#define MAGMA_D_MUL(a, b)         ((a) * (b))
-#define MAGMA_D_DIV(a, b)         ((a) / (b))
-#define MAGMA_D_ABS(a)            ((a)>0 ? (a) : -(a))
-#define MAGMA_D_ABS1(a)           ((a)>0 ? (a) : -(a))
-#define MAGMA_D_CNJG(a)           (a)
-#define MAGMA_D_EQUAL(a,b)        ((a) == (b))
-#define MAGMA_D_NEGATE(a)         (-a)
-
-#define MAGMA_S_MAKE(r,i)         (r)
-#define MAGMA_S_REAL(x)           (x)
-#define MAGMA_S_IMAG(x)           (0.0)
-#define MAGMA_S_ADD(a, b)         ((a) + (b))
-#define MAGMA_S_SUB(a, b)         ((a) - (b))
-#define MAGMA_S_MUL(a, b)         ((a) * (b))
-#define MAGMA_S_DIV(a, b)         ((a) / (b))
-#define MAGMA_S_ABS(a)            ((a)>0 ? (a) : -(a))
-#define MAGMA_S_ABS1(a)           ((a)>0 ? (a) : -(a))
-#define MAGMA_S_CNJG(a)           (a)
-#define MAGMA_S_EQUAL(a,b)        ((a) == (b))
-#define MAGMA_S_NEGATE(a)         (-a)
-
-#define MAGMA_Z_ZERO              MAGMA_Z_MAKE( 0.0, 0.0)
-#define MAGMA_Z_ONE               MAGMA_Z_MAKE( 1.0, 0.0)
-#define MAGMA_Z_HALF              MAGMA_Z_MAKE( 0.5, 0.0)
-#define MAGMA_Z_NEG_ONE           MAGMA_Z_MAKE(-1.0, 0.0)
-#define MAGMA_Z_NEG_HALF          MAGMA_Z_MAKE(-0.5, 0.0)
-
-#define MAGMA_C_ZERO              MAGMA_C_MAKE( 0.0, 0.0)
-#define MAGMA_C_ONE               MAGMA_C_MAKE( 1.0, 0.0)
-#define MAGMA_C_HALF              MAGMA_C_MAKE( 0.5, 0.0)
-#define MAGMA_C_NEG_ONE           MAGMA_C_MAKE(-1.0, 0.0)
-#define MAGMA_C_NEG_HALF          MAGMA_C_MAKE(-0.5, 0.0)
-
-#define MAGMA_D_ZERO              ( 0.0)
-#define MAGMA_D_ONE               ( 1.0)
-#define MAGMA_D_HALF              ( 0.5)
-#define MAGMA_D_NEG_ONE           (-1.0)
-#define MAGMA_D_NEG_HALF          (-0.5)
-
-#define MAGMA_S_ZERO              ( 0.0)
-#define MAGMA_S_ONE               ( 1.0)
-#define MAGMA_S_HALF              ( 0.5)
-#define MAGMA_S_NEG_ONE           (-1.0)
-#define MAGMA_S_NEG_HALF          (-0.5)
+typedef cl_float2 magmaFloatComplex;
+
+#define MAGMA_Z_MAKE(r, i) doubleComplex(r, i)
+#define MAGMA_Z_REAL(a) (a).s[0]
+#define MAGMA_Z_IMAG(a) (a).s[1]
+#define MAGMA_Z_ADD(a, b) MAGMA_Z_MAKE((a).s[0] + (b).s[0], (a).s[1] + (b).s[1])
+#define MAGMA_Z_SUB(a, b) MAGMA_Z_MAKE((a).s[0] - (b).s[0], (a).s[1] - (b).s[1])
+#define MAGMA_Z_DIV(a, b) ((a) / (b))
+#define MAGMA_Z_ABS(a) magma_cabs(a)
+#define MAGMA_Z_ABS1(a) (fabs((a).s[0]) + fabs((a).s[1]))
+#define MAGMA_Z_CNJG(a) MAGMA_Z_MAKE((a).s[0], -(a).s[1])
+
+#define MAGMA_C_MAKE(r, i) floatComplex(r, i)
+#define MAGMA_C_REAL(a) (a).s[0]
+#define MAGMA_C_IMAG(a) (a).s[1]
+#define MAGMA_C_ADD(a, b) MAGMA_C_MAKE((a).s[0] + (b).s[0], (a).s[1] + (b).s[1])
+#define MAGMA_C_SUB(a, b) MAGMA_C_MAKE((a).s[0] - (b).s[0], (a).s[1] - (b).s[1])
+#define MAGMA_C_DIV(a, b) ((a) / (b))
+#define MAGMA_C_ABS(a) magma_cabsf(a)
+#define MAGMA_C_ABS1(a) (fabsf((a).s[0]) + fabsf((a).s[1]))
+#define MAGMA_C_CNJG(a) MAGMA_C_MAKE((a).s[0], -(a).s[1])
+
+#define MAGMA_Z_EQUAL(a, b) \
+    (MAGMA_Z_REAL(a) == MAGMA_Z_REAL(b) && MAGMA_Z_IMAG(a) == MAGMA_Z_IMAG(b))
+#define MAGMA_Z_NEGATE(a) MAGMA_Z_MAKE(-MAGMA_Z_REAL(a), -MAGMA_Z_IMAG(a))
+
+#define MAGMA_C_EQUAL(a, b) \
+    (MAGMA_C_REAL(a) == MAGMA_C_REAL(b) && MAGMA_C_IMAG(a) == MAGMA_C_IMAG(b))
+#define MAGMA_C_NEGATE(a) MAGMA_C_MAKE(-MAGMA_C_REAL(a), -MAGMA_C_IMAG(a))
+
+#define MAGMA_D_MAKE(r, i) (r)
+#define MAGMA_D_REAL(x) (x)
+#define MAGMA_D_IMAG(x) (0.0)
+#define MAGMA_D_ADD(a, b) ((a) + (b))
+#define MAGMA_D_SUB(a, b) ((a) - (b))
+#define MAGMA_D_MUL(a, b) ((a) * (b))
+#define MAGMA_D_DIV(a, b) ((a) / (b))
+#define MAGMA_D_ABS(a) ((a) > 0 ? (a) : -(a))
+#define MAGMA_D_ABS1(a) ((a) > 0 ? (a) : -(a))
+#define MAGMA_D_CNJG(a) (a)
+#define MAGMA_D_EQUAL(a, b) ((a) == (b))
+#define MAGMA_D_NEGATE(a) (-a)
+
+#define MAGMA_S_MAKE(r, i) (r)
+#define MAGMA_S_REAL(x) (x)
+#define MAGMA_S_IMAG(x) (0.0)
+#define MAGMA_S_ADD(a, b) ((a) + (b))
+#define MAGMA_S_SUB(a, b) ((a) - (b))
+#define MAGMA_S_MUL(a, b) ((a) * (b))
+#define MAGMA_S_DIV(a, b) ((a) / (b))
+#define MAGMA_S_ABS(a) ((a) > 0 ? (a) : -(a))
+#define MAGMA_S_ABS1(a) ((a) > 0 ? (a) : -(a))
+#define MAGMA_S_CNJG(a) (a)
+#define MAGMA_S_EQUAL(a, b) ((a) == (b))
+#define MAGMA_S_NEGATE(a) (-a)
+
+#define MAGMA_Z_ZERO MAGMA_Z_MAKE(0.0, 0.0)
+#define MAGMA_Z_ONE MAGMA_Z_MAKE(1.0, 0.0)
+#define MAGMA_Z_HALF MAGMA_Z_MAKE(0.5, 0.0)
+#define MAGMA_Z_NEG_ONE MAGMA_Z_MAKE(-1.0, 0.0)
+#define MAGMA_Z_NEG_HALF MAGMA_Z_MAKE(-0.5, 0.0)
+
+#define MAGMA_C_ZERO MAGMA_C_MAKE(0.0, 0.0)
+#define MAGMA_C_ONE MAGMA_C_MAKE(1.0, 0.0)
+#define MAGMA_C_HALF MAGMA_C_MAKE(0.5, 0.0)
+#define MAGMA_C_NEG_ONE MAGMA_C_MAKE(-1.0, 0.0)
+#define MAGMA_C_NEG_HALF MAGMA_C_MAKE(-0.5, 0.0)
+
+#define MAGMA_D_ZERO (0.0)
+#define MAGMA_D_ONE (1.0)
+#define MAGMA_D_HALF (0.5)
+#define MAGMA_D_NEG_ONE (-1.0)
+#define MAGMA_D_NEG_HALF (-0.5)
+
+#define MAGMA_S_ZERO (0.0)
+#define MAGMA_S_ONE (1.0)
+#define MAGMA_S_HALF (0.5)
+#define MAGMA_S_NEG_ONE (-1.0)
+#define MAGMA_S_NEG_HALF (-0.5)
 
 #ifndef CBLAS_SADDR
-#define CBLAS_SADDR(a)  &(a)
+#define CBLAS_SADDR(a) &(a)
 #endif
 
 // OpenCL uses opaque memory references on GPU
@@ -166,7 +166,6 @@ typedef cl_mem magmaDouble_const_ptr;
 typedef cl_mem magmaFloatComplex_const_ptr;
 typedef cl_mem magmaDoubleComplex_const_ptr;
 
-
 // ========================================
 // MAGMA constants
 
@@ -175,83 +174,74 @@ typedef cl_mem magmaDoubleComplex_const_ptr;
 #define MAGMA_VERSION_MINOR 0
 #define MAGMA_VERSION_MICRO 0
 
-// stage is "svn", "beta#", "rc#" (release candidate), or blank ("") for final release
+// stage is "svn", "beta#", "rc#" (release candidate), or blank ("") for final
+// release
 #define MAGMA_VERSION_STAGE "svn"
 
 #define MagmaMaxGPUs 8
 #define MagmaMaxSubs 16
 
-
 // ----------------------------------------
 // Return codes
 // LAPACK argument errors are < 0 but > MAGMA_ERR.
 // MAGMA errors are < MAGMA_ERR.
-#define MAGMA_SUCCESS               0
-#define MAGMA_ERR                  -100
-#define MAGMA_ERR_NOT_INITIALIZED  -101
-#define MAGMA_ERR_REINITIALIZED    -102
-#define MAGMA_ERR_NOT_SUPPORTED    -103
-#define MAGMA_ERR_ILLEGAL_VALUE    -104
-#define MAGMA_ERR_NOT_FOUND        -105
-#define MAGMA_ERR_ALLOCATION       -106
-#define MAGMA_ERR_INTERNAL_LIMIT   -107
-#define MAGMA_ERR_UNALLOCATED      -108
-#define MAGMA_ERR_FILESYSTEM       -109
-#define MAGMA_ERR_UNEXPECTED       -110
+#define MAGMA_SUCCESS 0
+#define MAGMA_ERR -100
+#define MAGMA_ERR_NOT_INITIALIZED -101
+#define MAGMA_ERR_REINITIALIZED -102
+#define MAGMA_ERR_NOT_SUPPORTED -103
+#define MAGMA_ERR_ILLEGAL_VALUE -104
+#define MAGMA_ERR_NOT_FOUND -105
+#define MAGMA_ERR_ALLOCATION -106
+#define MAGMA_ERR_INTERNAL_LIMIT -107
+#define MAGMA_ERR_UNALLOCATED -108
+#define MAGMA_ERR_FILESYSTEM -109
+#define MAGMA_ERR_UNEXPECTED -110
 #define MAGMA_ERR_SEQUENCE_FLUSHED -111
-#define MAGMA_ERR_HOST_ALLOC       -112
-#define MAGMA_ERR_DEVICE_ALLOC     -113
-#define MAGMA_ERR_CUDASTREAM       -114
-#define MAGMA_ERR_INVALID_PTR      -115
-#define MAGMA_ERR_UNKNOWN          -116
-#define MAGMA_ERR_NOT_IMPLEMENTED  -117
-
+#define MAGMA_ERR_HOST_ALLOC -112
+#define MAGMA_ERR_DEVICE_ALLOC -113
+#define MAGMA_ERR_CUDASTREAM -114
+#define MAGMA_ERR_INVALID_PTR -115
+#define MAGMA_ERR_UNKNOWN -116
+#define MAGMA_ERR_NOT_IMPLEMENTED -117
 
 // ----------------------------------------
 // parameter constants
 // numbering is consistent with CBLAS and PLASMA; see plasma/include/plasma.h
 // also with lapack_cwrapper/include/lapack_enum.h
-typedef enum {
-    MagmaFalse         = 0,
-    MagmaTrue          = 1
-} magma_bool_t;
+typedef enum { MagmaFalse = 0, MagmaTrue = 1 } magma_bool_t;
 
-typedef enum {
-    MagmaRowMajor      = 101,
-    MagmaColMajor      = 102
-} magma_order_t;
+typedef enum { MagmaRowMajor = 101, MagmaColMajor = 102 } magma_order_t;
 
 // Magma_ConjTrans is an alias for those rare occasions (zlarfb, zun*, zher*k)
-// where we want Magma_ConjTrans to convert to MagmaTrans in precision generation.
+// where we want Magma_ConjTrans to convert to MagmaTrans in precision
+// generation.
 typedef enum {
-    MagmaNoTrans       = 111,
-    MagmaTrans         = 112,
-    MagmaConjTrans     = 113,
-    Magma_ConjTrans    = MagmaConjTrans
+    MagmaNoTrans    = 111,
+    MagmaTrans      = 112,
+    MagmaConjTrans  = 113,
+    Magma_ConjTrans = MagmaConjTrans
 } magma_trans_t;
 
 typedef enum {
-    MagmaUpper         = 121,
-    MagmaLower         = 122,
-    MagmaUpperLower    = 123,
-    MagmaFull          = 123   /* lascl, laset */
+    MagmaUpper      = 121,
+    MagmaLower      = 122,
+    MagmaUpperLower = 123,
+    MagmaFull       = 123 /* lascl, laset */
 } magma_uplo_t;
 
-typedef magma_uplo_t magma_type_t;  /* lascl */
+typedef magma_uplo_t magma_type_t; /* lascl */
 
-typedef enum {
-    MagmaNonUnit       = 131,
-    MagmaUnit          = 132
-} magma_diag_t;
+typedef enum { MagmaNonUnit = 131, MagmaUnit = 132 } magma_diag_t;
 
 typedef enum {
-    MagmaLeft          = 141,
-    MagmaRight         = 142,
-    MagmaBothSides     = 143   /* trevc */
+    MagmaLeft      = 141,
+    MagmaRight     = 142,
+    MagmaBothSides = 143 /* trevc */
 } magma_side_t;
 
 typedef enum {
-    MagmaOneNorm       = 171,  /* lange, lanhe */
+    MagmaOneNorm       = 171, /* lange, lanhe */
     MagmaRealOneNorm   = 172,
     MagmaTwoNorm       = 173,
     MagmaFrobeniusNorm = 174,
@@ -262,20 +252,20 @@ typedef enum {
 } magma_norm_t;
 
 typedef enum {
-    MagmaDistUniform   = 201,  /* latms */
+    MagmaDistUniform   = 201, /* latms */
     MagmaDistSymmetric = 202,
     MagmaDistNormal    = 203
 } magma_dist_t;
 
 typedef enum {
-    MagmaHermGeev      = 241,  /* latms */
-    MagmaHermPoev      = 242,
-    MagmaNonsymPosv    = 243,
-    MagmaSymPosv       = 244
+    MagmaHermGeev   = 241, /* latms */
+    MagmaHermPoev   = 242,
+    MagmaNonsymPosv = 243,
+    MagmaSymPosv    = 244
 } magma_sym_t;
 
 typedef enum {
-    MagmaNoPacking     = 291,  /* latms */
+    MagmaNoPacking     = 291, /* latms */
     MagmaPackSubdiag   = 292,
     MagmaPackSupdiag   = 293,
     MagmaPackColumn    = 294,
@@ -286,170 +276,161 @@ typedef enum {
 } magma_pack_t;
 
 typedef enum {
-    MagmaNoVec         = 301,  /* geev, syev, gesvd */
-    MagmaVec           = 302,  /* geev, syev */
-    MagmaIVec          = 303,  /* stedc */
-    MagmaAllVec        = 304,  /* gesvd, trevc */
-    MagmaSomeVec       = 305,  /* gesvd, trevc */
-    MagmaOverwriteVec  = 306,  /* gesvd */
-    MagmaBacktransVec  = 307   /* trevc */
+    MagmaNoVec        = 301, /* geev, syev, gesvd */
+    MagmaVec          = 302, /* geev, syev */
+    MagmaIVec         = 303, /* stedc */
+    MagmaAllVec       = 304, /* gesvd, trevc */
+    MagmaSomeVec      = 305, /* gesvd, trevc */
+    MagmaOverwriteVec = 306, /* gesvd */
+    MagmaBacktransVec = 307  /* trevc */
 } magma_vec_t;
 
 typedef enum {
-    MagmaRangeAll      = 311,  /* syevx, etc. */
-    MagmaRangeV        = 312,
-    MagmaRangeI        = 313
+    MagmaRangeAll = 311, /* syevx, etc. */
+    MagmaRangeV   = 312,
+    MagmaRangeI   = 313
 } magma_range_t;
 
 typedef enum {
-    MagmaQ             = 322,  /* unmbr, ungbr */
-    MagmaP             = 323
+    MagmaQ = 322, /* unmbr, ungbr */
+    MagmaP = 323
 } magma_vect_t;
 
 typedef enum {
-    MagmaForward       = 391,  /* larfb */
-    MagmaBackward      = 392
+    MagmaForward  = 391, /* larfb */
+    MagmaBackward = 392
 } magma_direct_t;
 
 typedef enum {
-    MagmaColumnwise    = 401,  /* larfb */
-    MagmaRowwise       = 402
+    MagmaColumnwise = 401, /* larfb */
+    MagmaRowwise    = 402
 } magma_storev_t;
 
 // --------------------
 // sparse
 typedef enum {
-    Magma_CSR          = 411,
-    Magma_ELLPACK      = 412,
-    Magma_ELL          = 413,
-    Magma_DENSE        = 414,
-    Magma_BCSR         = 415,
-    Magma_CSC          = 416,
-    Magma_HYB          = 417,
-    Magma_COO          = 418,
-    Magma_ELLRT        = 419,
-    Magma_SELLC        = 420,
-    Magma_SELLP        = 421,
-    Magma_ELLD         = 422,
-    Magma_ELLDD        = 423,
-    Magma_CSRD         = 424,
-    Magma_CSRL         = 427,
-    Magma_CSRU         = 428,
-    Magma_CSRCOO       = 429
+    Magma_CSR     = 411,
+    Magma_ELLPACK = 412,
+    Magma_ELL     = 413,
+    Magma_DENSE   = 414,
+    Magma_BCSR    = 415,
+    Magma_CSC     = 416,
+    Magma_HYB     = 417,
+    Magma_COO     = 418,
+    Magma_ELLRT   = 419,
+    Magma_SELLC   = 420,
+    Magma_SELLP   = 421,
+    Magma_ELLD    = 422,
+    Magma_ELLDD   = 423,
+    Magma_CSRD    = 424,
+    Magma_CSRL    = 427,
+    Magma_CSRU    = 428,
+    Magma_CSRCOO  = 429
 } magma_storage_t;
 
-
 typedef enum {
-    Magma_CG           = 431,
-    Magma_CGMERGE      = 432,
-    Magma_GMRES        = 433,
-    Magma_BICGSTAB     = 434,
-  Magma_BICGSTABMERGE  = 435,
-  Magma_BICGSTABMERGE2 = 436,
-    Magma_JACOBI       = 437,
-    Magma_GS           = 438,
-    Magma_ITERREF      = 439,
-    Magma_BCSRLU       = 440,
-    Magma_PCG          = 441,
-    Magma_PGMRES       = 442,
-    Magma_PBICGSTAB    = 443,
-    Magma_PASTIX       = 444,
-    Magma_ILU          = 445,
-    Magma_ICC          = 446,
-    Magma_AILU         = 447,
-    Magma_AICC         = 448,
-    Magma_BAITER       = 449,
-    Magma_LOBPCG       = 450,
-    Magma_NONE         = 451
+    Magma_CG             = 431,
+    Magma_CGMERGE        = 432,
+    Magma_GMRES          = 433,
+    Magma_BICGSTAB       = 434,
+    Magma_BICGSTABMERGE  = 435,
+    Magma_BICGSTABMERGE2 = 436,
+    Magma_JACOBI         = 437,
+    Magma_GS             = 438,
+    Magma_ITERREF        = 439,
+    Magma_BCSRLU         = 440,
+    Magma_PCG            = 441,
+    Magma_PGMRES         = 442,
+    Magma_PBICGSTAB      = 443,
+    Magma_PASTIX         = 444,
+    Magma_ILU            = 445,
+    Magma_ICC            = 446,
+    Magma_AILU           = 447,
+    Magma_AICC           = 448,
+    Magma_BAITER         = 449,
+    Magma_LOBPCG         = 450,
+    Magma_NONE           = 451
 } magma_solver_type;
 
 typedef enum {
-    Magma_CGS          = 461,
-    Magma_FUSED_CGS    = 462,
-    Magma_MGS          = 463
+    Magma_CGS       = 461,
+    Magma_FUSED_CGS = 462,
+    Magma_MGS       = 463
 } magma_ortho_t;
 
-typedef enum {
-    Magma_CPU          = 471,
-    Magma_DEV          = 472
-} magma_location_t;
+typedef enum { Magma_CPU = 471, Magma_DEV = 472 } magma_location_t;
 
-typedef enum {
-    Magma_GENERAL      = 481,
-    Magma_SYMMETRIC    = 482
-} magma_symmetry_t;
+typedef enum { Magma_GENERAL = 481, Magma_SYMMETRIC = 482 } magma_symmetry_t;
 
 typedef enum {
-    Magma_ORDERED      = 491,
-    Magma_DIAGFIRST    = 492,
-    Magma_UNITY        = 493,
-    Magma_VALUE        = 494
+    Magma_ORDERED   = 491,
+    Magma_DIAGFIRST = 492,
+    Magma_UNITY     = 493,
+    Magma_VALUE     = 494
 } magma_diagorder_t;
 
 typedef enum {
-    Magma_DCOMPLEX     = 501,
-    Magma_FCOMPLEX     = 502,
-    Magma_DOUBLE       = 503,
-    Magma_FLOAT        = 504
+    Magma_DCOMPLEX = 501,
+    Magma_FCOMPLEX = 502,
+    Magma_DOUBLE   = 503,
+    Magma_FLOAT    = 504
 } magma_precision;
 
 typedef enum {
-    Magma_NOSCALE      = 511,
-    Magma_UNITROW      = 512,
-    Magma_UNITDIAG     = 513
+    Magma_NOSCALE  = 511,
+    Magma_UNITROW  = 512,
+    Magma_UNITDIAG = 513
 } magma_scale_t;
 
-
 // When adding constants, remember to do these steps as appropriate:
 // 1)  add magma_xxxx_const()  converter below and in control/constants.cpp
 // 2a) add to magma2lapack_constants[] in control/constants.cpp
-// 2b) update min & max here, which are used to check bounds for magma2lapack_constants[]
-// 2c) add lapack_xxxx_const() converter below and in control/constants.cpp
-#define Magma2lapack_Min  MagmaFalse     // 0
-#define Magma2lapack_Max  MagmaRowwise   // 402
-
+// 2b) update min & max here, which are used to check bounds for
+// magma2lapack_constants[] 2c) add lapack_xxxx_const() converter below and in
+// control/constants.cpp
+#define Magma2lapack_Min MagmaFalse    // 0
+#define Magma2lapack_Max MagmaRowwise  // 402
 
 // ----------------------------------------
 // string constants for calling Fortran BLAS and LAPACK
 // todo: use translators instead? lapack_const( MagmaUpper )
-#define MagmaRowMajorStr      "Row"
-#define MagmaColMajorStr      "Col"
+#define MagmaRowMajorStr "Row"
+#define MagmaColMajorStr "Col"
 
-#define MagmaNoTransStr       "NoTrans"
-#define MagmaTransStr         "Trans"
-#define MagmaConjTransStr     "ConjTrans"
+#define MagmaNoTransStr "NoTrans"
+#define MagmaTransStr "Trans"
+#define MagmaConjTransStr "ConjTrans"
 
-#define MagmaUpperStr         "Upper"
-#define MagmaLowerStr         "Lower"
-#define MagmaUpperLowerStr    "Full"
-#define MagmaFullStr          "Full"
+#define MagmaUpperStr "Upper"
+#define MagmaLowerStr "Lower"
+#define MagmaUpperLowerStr "Full"
+#define MagmaFullStr "Full"
 
-#define MagmaNonUnitStr       "NonUnit"
-#define MagmaUnitStr          "Unit"
+#define MagmaNonUnitStr "NonUnit"
+#define MagmaUnitStr "Unit"
 
-#define MagmaLeftStr          "Left"
-#define MagmaRightStr         "Right"
-#define MagmaBothSidesStr     "Both"
+#define MagmaLeftStr "Left"
+#define MagmaRightStr "Right"
+#define MagmaBothSidesStr "Both"
 
-#define MagmaOneNormStr       "1"
-#define MagmaTwoNormStr       "2"
+#define MagmaOneNormStr "1"
+#define MagmaTwoNormStr "2"
 #define MagmaFrobeniusNormStr "Fro"
-#define MagmaInfNormStr       "Inf"
-#define MagmaMaxNormStr       "Max"
-
-#define MagmaForwardStr       "Forward"
-#define MagmaBackwardStr      "Backward"
+#define MagmaInfNormStr "Inf"
+#define MagmaMaxNormStr "Max"
 
-#define MagmaColumnwiseStr    "Columnwise"
-#define MagmaRowwiseStr       "Rowwise"
+#define MagmaForwardStr "Forward"
+#define MagmaBackwardStr "Backward"
 
-#define MagmaNoVecStr         "NoVec"
-#define MagmaVecStr           "Vec"
-#define MagmaIVecStr          "IVec"
-#define MagmaAllVecStr        "All"
-#define MagmaSomeVecStr       "Some"
-#define MagmaOverwriteVecStr  "Overwrite"
+#define MagmaColumnwiseStr "Columnwise"
+#define MagmaRowwiseStr "Rowwise"
 
+#define MagmaNoVecStr "NoVec"
+#define MagmaVecStr "Vec"
+#define MagmaIVecStr "IVec"
+#define MagmaAllVecStr "All"
+#define MagmaSomeVecStr "Some"
+#define MagmaOverwriteVecStr "Overwrite"
 
 #ifdef __cplusplus
 extern "C" {
@@ -459,97 +440,114 @@ extern "C" {
 // Convert LAPACK character constants to MAGMA constants.
 // This is a one-to-many mapping, requiring multiple translators
 // (e.g., "N" can be NoTrans or NonUnit or NoVec).
-magma_bool_t   magma_bool_const  ( char lapack_char );
-magma_order_t  magma_order_const ( char lapack_char );
-magma_trans_t  magma_trans_const ( char lapack_char );
-magma_uplo_t   magma_uplo_const  ( char lapack_char );
-magma_diag_t   magma_diag_const  ( char lapack_char );
-magma_side_t   magma_side_const  ( char lapack_char );
-magma_norm_t   magma_norm_const  ( char lapack_char );
-magma_dist_t   magma_dist_const  ( char lapack_char );
-magma_sym_t    magma_sym_const   ( char lapack_char );
-magma_pack_t   magma_pack_const  ( char lapack_char );
-magma_vec_t    magma_vec_const   ( char lapack_char );
-magma_range_t  magma_range_const ( char lapack_char );
-magma_vect_t   magma_vect_const  ( char lapack_char );
-magma_direct_t magma_direct_const( char lapack_char );
-magma_storev_t magma_storev_const( char lapack_char );
-
+magma_bool_t magma_bool_const(char lapack_char);
+magma_order_t magma_order_const(char lapack_char);
+magma_trans_t magma_trans_const(char lapack_char);
+magma_uplo_t magma_uplo_const(char lapack_char);
+magma_diag_t magma_diag_const(char lapack_char);
+magma_side_t magma_side_const(char lapack_char);
+magma_norm_t magma_norm_const(char lapack_char);
+magma_dist_t magma_dist_const(char lapack_char);
+magma_sym_t magma_sym_const(char lapack_char);
+magma_pack_t magma_pack_const(char lapack_char);
+magma_vec_t magma_vec_const(char lapack_char);
+magma_range_t magma_range_const(char lapack_char);
+magma_vect_t magma_vect_const(char lapack_char);
+magma_direct_t magma_direct_const(char lapack_char);
+magma_storev_t magma_storev_const(char lapack_char);
 
 // --------------------
 // Convert MAGMA constants to LAPACK(E) constants.
 // The generic lapack_const works for all cases, but the specific routines
 // (e.g., lapack_trans_const) do better error checking.
-const char* lapack_const       ( int            magma_const );
-const char* lapack_bool_const  ( magma_bool_t   magma_const );
-const char* lapack_order_const ( magma_order_t  magma_const );
-const char* lapack_trans_const ( magma_trans_t  magma_const );
-const char* lapack_uplo_const  ( magma_uplo_t   magma_const );
-const char* lapack_diag_const  ( magma_diag_t   magma_const );
-const char* lapack_side_const  ( magma_side_t   magma_const );
-const char* lapack_norm_const  ( magma_norm_t   magma_const );
-const char* lapack_dist_const  ( magma_dist_t   magma_const );
-const char* lapack_sym_const   ( magma_sym_t    magma_const );
-const char* lapack_pack_const  ( magma_pack_t   magma_const );
-const char* lapack_vec_const   ( magma_vec_t    magma_const );
-const char* lapack_range_const ( magma_range_t  magma_const );
-const char* lapack_vect_const  ( magma_vect_t   magma_const );
-const char* lapack_direct_const( magma_direct_t magma_const );
-const char* lapack_storev_const( magma_storev_t magma_const );
-
-static inline char lapacke_const       ( int magma_const            ) { return *lapack_const       ( magma_const ); }
-static inline char lapacke_bool_const  ( magma_bool_t   magma_const ) { return *lapack_bool_const  ( magma_const ); }
-static inline char lapacke_order_const ( magma_order_t  magma_const ) { return *lapack_order_const ( magma_const ); }
-static inline char lapacke_trans_const ( magma_trans_t  magma_const ) { return *lapack_trans_const ( magma_const ); }
-static inline char lapacke_uplo_const  ( magma_uplo_t   magma_const ) { return *lapack_uplo_const  ( magma_const ); }
-static inline char lapacke_diag_const  ( magma_diag_t   magma_const ) { return *lapack_diag_const  ( magma_const ); }
-static inline char lapacke_side_const  ( magma_side_t   magma_const ) { return *lapack_side_const  ( magma_const ); }
-static inline char lapacke_norm_const  ( magma_norm_t   magma_const ) { return *lapack_norm_const  ( magma_const ); }
-static inline char lapacke_dist_const  ( magma_dist_t   magma_const ) { return *lapack_dist_const  ( magma_const ); }
-static inline char lapacke_sym_const   ( magma_sym_t    magma_const ) { return *lapack_sym_const   ( magma_const ); }
-static inline char lapacke_pack_const  ( magma_pack_t   magma_const ) { return *lapack_pack_const  ( magma_const ); }
-static inline char lapacke_vec_const   ( magma_vec_t    magma_const ) { return *lapack_vec_const   ( magma_const ); }
-static inline char lapacke_range_const ( magma_range_t  magma_const ) { return *lapack_range_const ( magma_const ); }
-static inline char lapacke_vect_const  ( magma_vect_t   magma_const ) { return *lapack_vect_const  ( magma_const ); }
-static inline char lapacke_direct_const( magma_direct_t magma_const ) { return *lapack_direct_const( magma_const ); }
-static inline char lapacke_storev_const( magma_storev_t magma_const ) { return *lapack_storev_const( magma_const ); }
-
-
-// --------------------
-// Convert MAGMA constants to clBLAS constants.
-#if defined(HAVE_clBLAS)
-clblasOrder          clblas_order_const( magma_order_t order );
-clblasTranspose      clblas_trans_const( magma_trans_t trans );
-clblasUplo           clblas_uplo_const ( magma_uplo_t  uplo  );
-clblasDiag           clblas_diag_const ( magma_diag_t  diag  );
-clblasSide           clblas_side_const ( magma_side_t  side  );
-#endif
-
+const char* lapack_const(int magma_const);
+const char* lapack_bool_const(magma_bool_t magma_const);
+const char* lapack_order_const(magma_order_t magma_const);
+const char* lapack_trans_const(magma_trans_t magma_const);
+const char* lapack_uplo_const(magma_uplo_t magma_const);
+const char* lapack_diag_const(magma_diag_t magma_const);
+const char* lapack_side_const(magma_side_t magma_const);
+const char* lapack_norm_const(magma_norm_t magma_const);
+const char* lapack_dist_const(magma_dist_t magma_const);
+const char* lapack_sym_const(magma_sym_t magma_const);
+const char* lapack_pack_const(magma_pack_t magma_const);
+const char* lapack_vec_const(magma_vec_t magma_const);
+const char* lapack_range_const(magma_range_t magma_const);
+const char* lapack_vect_const(magma_vect_t magma_const);
+const char* lapack_direct_const(magma_direct_t magma_const);
+const char* lapack_storev_const(magma_storev_t magma_const);
+
+static inline char lapacke_const(int magma_const) {
+    return *lapack_const(magma_const);
+}
+static inline char lapacke_bool_const(magma_bool_t magma_const) {
+    return *lapack_bool_const(magma_const);
+}
+static inline char lapacke_order_const(magma_order_t magma_const) {
+    return *lapack_order_const(magma_const);
+}
+static inline char lapacke_trans_const(magma_trans_t magma_const) {
+    return *lapack_trans_const(magma_const);
+}
+static inline char lapacke_uplo_const(magma_uplo_t magma_const) {
+    return *lapack_uplo_const(magma_const);
+}
+static inline char lapacke_diag_const(magma_diag_t magma_const) {
+    return *lapack_diag_const(magma_const);
+}
+static inline char lapacke_side_const(magma_side_t magma_const) {
+    return *lapack_side_const(magma_const);
+}
+static inline char lapacke_norm_const(magma_norm_t magma_const) {
+    return *lapack_norm_const(magma_const);
+}
+static inline char lapacke_dist_const(magma_dist_t magma_const) {
+    return *lapack_dist_const(magma_const);
+}
+static inline char lapacke_sym_const(magma_sym_t magma_const) {
+    return *lapack_sym_const(magma_const);
+}
+static inline char lapacke_pack_const(magma_pack_t magma_const) {
+    return *lapack_pack_const(magma_const);
+}
+static inline char lapacke_vec_const(magma_vec_t magma_const) {
+    return *lapack_vec_const(magma_const);
+}
+static inline char lapacke_range_const(magma_range_t magma_const) {
+    return *lapack_range_const(magma_const);
+}
+static inline char lapacke_vect_const(magma_vect_t magma_const) {
+    return *lapack_vect_const(magma_const);
+}
+static inline char lapacke_direct_const(magma_direct_t magma_const) {
+    return *lapack_direct_const(magma_const);
+}
+static inline char lapacke_storev_const(magma_storev_t magma_const) {
+    return *lapack_storev_const(magma_const);
+}
 
 // --------------------
 // Convert MAGMA constants to CUBLAS constants.
 #if defined(CUBLAS_V2_H_)
-cublasOperation_t    cublas_trans_const ( magma_trans_t trans );
-cublasFillMode_t     cublas_uplo_const  ( magma_uplo_t  uplo  );
-cublasDiagType_t     cublas_diag_const  ( magma_diag_t  diag  );
-cublasSideMode_t     cublas_side_const  ( magma_side_t  side  );
+cublasOperation_t cublas_trans_const(magma_trans_t trans);
+cublasFillMode_t cublas_uplo_const(magma_uplo_t uplo);
+cublasDiagType_t cublas_diag_const(magma_diag_t diag);
+cublasSideMode_t cublas_side_const(magma_side_t side);
 #endif
 
-
 // --------------------
 // Convert MAGMA constants to CBLAS constants.
 #if defined(HAVE_CBLAS)
 #include <cblas.h>
-enum CBLAS_ORDER     cblas_order_const  ( magma_order_t order );
-enum CBLAS_TRANSPOSE cblas_trans_const  ( magma_trans_t trans );
-enum CBLAS_UPLO      cblas_uplo_const   ( magma_uplo_t  uplo  );
-enum CBLAS_DIAG      cblas_diag_const   ( magma_diag_t  diag  );
-enum CBLAS_SIDE      cblas_side_const   ( magma_side_t  side  );
+enum CBLAS_ORDER cblas_order_const(magma_order_t order);
+enum CBLAS_TRANSPOSE cblas_trans_const(magma_trans_t trans);
+enum CBLAS_UPLO cblas_uplo_const(magma_uplo_t uplo);
+enum CBLAS_DIAG cblas_diag_const(magma_diag_t diag);
+enum CBLAS_SIDE cblas_side_const(magma_side_t side);
 #endif
 
-
 #ifdef __cplusplus
 }
 #endif
 
-#endif        //  #ifndef MAGMA_TYPES_H
+#endif  //  #ifndef MAGMA_TYPES_H
diff --git a/src/backend/opencl/magma/potrf.cpp b/src/backend/opencl/magma/potrf.cpp
index f09cd2299a..68909fe8f9 100644
--- a/src/backend/opencl/magma/potrf.cpp
+++ b/src/backend/opencl/magma/potrf.cpp
@@ -31,42 +31,39 @@
  * * Redistributions  of  source  code  must  retain  the above copyright
  *   notice,  this  list  of  conditions  and  the  following  disclaimer.
  * * Redistributions  in  binary  form must reproduce the above copyright
- *   notice,  this list of conditions and the following disclaimer in the 
+ *   notice,  this list of conditions and the following disclaimer in the
  *   documentation  and/or other materials provided with the distribution.
- * * Neither  the  name of the University of Tennessee, Knoxville nor the 
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
  *   names of its contributors may be used to endorse or promote products
  *   derived from this software without specific prior written permission.
  *
  * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
- * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT 
- * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR 
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
  * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
  * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT 
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
  * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
- * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY 
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
  * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
- * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE 
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
  * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  *
  **********************************************************************/
 
 #include "magma.h"
 #include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
 #include <algorithm>
 
 template<typename Ty>
-magma_int_t magma_potrf_gpu(
-    magma_uplo_t   uplo, magma_int_t    n,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    magma_queue_t queue,
-    magma_int_t*   info)
-{
+magma_int_t magma_potrf_gpu(magma_uplo_t uplo, magma_int_t n, cl_mem dA,
+                            size_t dA_offset, magma_int_t ldda,
+                            magma_queue_t queue, magma_int_t* info) {
 /*  -- clMAGMA (version 0.1) --
     Univ. of Tennessee, Knoxville
     Univ. of California, Berkeley
@@ -120,18 +117,19 @@ magma_int_t magma_potrf_gpu(
     =====================================================================   */
 
 // produces pointer and offset as two args to magmaBLAS routines
-#define dA(i,j)  dA, ((dA_offset) + (i) + (j)*ldda)
+#define dA(i, j) dA, ((dA_offset) + (i) + (j)*ldda)
 
 // produces pointer as single arg to BLAS routines
-#define A(i,j)  &A[ (i) + (j)*lda ]
+#define A(i, j) &A[(i) + (j)*lda]
 
     magma_int_t j, jb, nb;
-    static const Ty  z_one = magma_one<Ty>();
-    static const Ty mz_one = magma_neg_one<Ty>();
-    static const double    one =  1.0;
-    static const double  m_one = -1.0;
+    static const Ty z_one     = magma_one<Ty>();
+    static const Ty mz_one    = magma_neg_one<Ty>();
+    static const double one   = 1.0;
+    static const double m_one = -1.0;
 
-    static const clblasTranspose transType = magma_is_real<Ty>() ? clblasTrans : clblasConjTrans;
+    static const OPENCL_BLAS_TRANS_T transType =
+        magma_is_real<Ty>() ? OPENCL_BLAS_TRANS : OPENCL_BLAS_CONJ_TRANS;
 
     Ty* work;
     magma_int_t err;
@@ -141,23 +139,22 @@ magma_int_t magma_potrf_gpu(
         *info = -1;
     } else if (n < 0) {
         *info = -2;
-    } else if (ldda < std::max(1,n)) {
+    } else if (ldda < std::max(1, n)) {
         *info = -4;
     }
     if (*info != 0) {
-        //magma_xerbla(__func__, -(*info));
+        // magma_xerbla(__func__, -(*info));
         return *info;
     }
 
     nb = magma_get_potrf_nb<Ty>(n);
 
-    gemm_func<Ty> gpu_gemm;
-    trsm_func<Ty> gpu_trsm;
-    herk_func<Ty> gpu_herk;
-    potrf_func<Ty> cpu_potrf;
-
+    gpu_blas_gemm_func<Ty> gpu_blas_gemm;
+    gpu_blas_trsm_func<Ty> gpu_blas_trsm;
+    gpu_blas_herk_func<Ty> gpu_blas_herk;
+    cpu_lapack_potrf_func<Ty> cpu_lapack_potrf;
 
-    err = magma_malloc_cpu<Ty>( &work, nb*nb);
+    err = magma_malloc_cpu<Ty>(&work, nb * nb);
     if (err != MAGMA_SUCCESS) {
         *info = MAGMA_ERR_HOST_ALLOC;
         return *info;
@@ -170,51 +167,43 @@ magma_int_t magma_potrf_gpu(
         // use unblocked code
         magma_getmatrix<Ty>(n, n, dA, dA_offset, ldda, work, n, queue);
 
-        cpu_potrf(LAPACK_COL_MAJOR,
-                  uplo == MagmaUpper ? *MagmaUpperStr : *MagmaLowerStr,
-                  n, work, n);
+        LAPACKE_CHECK(cpu_lapack_potrf(
+            uplo == MagmaUpper ? *MagmaUpperStr : *MagmaLowerStr, n, work, n));
 
         magma_setmatrix<Ty>(n, n, work, n, dA, dA_offset, ldda, queue);
-    }
-    else {
+    } else {
         if (uplo == MagmaUpper) {
             // --------------------
             // compute Cholesky factorization A = U'*U
             // using the left looking algorithm
-            for(j = 0; j < n; j += nb) {
+            for (j = 0; j < n; j += nb) {
                 // apply all previous updates to diagonal block
-                jb = std::min(nb, n-j);
+                jb = std::min(nb, n - j);
                 if (j > 0) {
-                    gpu_herk(clblasColumnMajor,
-                             clblasUpper, transType,
-                             jb, j,
-                             m_one,
-                             dA(0,j), ldda,
-                             one,
-                             dA(j,j), ldda,
-                             1, &queue, 0, nullptr, &blas_event);
+                    OPENCL_BLAS_CHECK(gpu_blas_herk(
+                        OPENCL_BLAS_TRIANGLE_UPPER, transType, jb, j, m_one,
+                        dA(0, j), ldda, one, dA(j, j), ldda, 1, &queue, 0,
+                        nullptr, &blas_event));
                 }
 
                 // start asynchronous data transfer
-                magma_getmatrix_async<Ty>(jb, jb, dA(j,j), ldda, work, jb, queue, &event);
-
-                // apply all previous updates to block row right of diagonal block
-                if (j+jb < n) {
-                    gpu_gemm(clblasColumnMajor,
-                             transType, clblasNoTrans,
-                             jb, n-j-jb, j,
-                             mz_one,
-                             dA(0, j   ), ldda,
-                             dA(0, j+jb), ldda,
-                             z_one,
-                             dA(j, j+jb), ldda,
-                             1, &queue, 0, nullptr, &blas_event);
+                magma_getmatrix_async<Ty>(jb, jb, dA(j, j), ldda, work, jb,
+                                          queue, &event);
+
+                // apply all previous updates to block row right of diagonal
+                // block
+                if (j + jb < n && j > 0) {
+                    OPENCL_BLAS_CHECK(gpu_blas_gemm(
+                        transType, OPENCL_BLAS_NO_TRANS, jb, n - j - jb, j,
+                        mz_one, dA(0, j), ldda, dA(0, j + jb), ldda, z_one,
+                        dA(j, j + jb), ldda, 1, &queue, 0, nullptr,
+                        &blas_event));
                 }
 
                 // simultaneous with above zgemm, transfer data, factor
                 // diagonal block on CPU, and test for positive definiteness
                 magma_event_sync(event);
-                *info =cpu_potrf(LAPACK_COL_MAJOR, *MagmaUpperStr, jb, work, jb);
+                LAPACKE_CHECK(cpu_lapack_potrf(*MagmaUpperStr, jb, work, jb));
 
                 if (*info != 0) {
                     assert(*info > 0);
@@ -222,77 +211,67 @@ magma_int_t magma_potrf_gpu(
                     break;
                 }
 
-                magma_setmatrix_async<Ty>(jb, jb, work, jb, dA(j,j), ldda, queue, &event);
+                magma_setmatrix_async<Ty>(jb, jb, work, jb, dA(j, j), ldda,
+                                          queue, &event);
 
                 // apply diagonal block to block row right of diagonal block
-                if (j+jb < n) {
+                if (j + jb < n) {
                     magma_event_sync(event);
-                    gpu_trsm(clblasColumnMajor,
-                             clblasLeft, clblasUpper,
-                             transType, clblasNonUnit,
-                             jb, n-j-jb,
-                             z_one,
-                             dA(j, j   ), ldda,
-                             dA(j, j+jb), ldda,
-                             1, &queue, 0, nullptr, &blas_event);
+                    OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                        OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_UPPER,
+                        transType, OPENCL_BLAS_NON_UNIT_DIAGONAL, jb,
+                        n - j - jb, z_one, dA(j, j), ldda, dA(j, j + jb), ldda,
+                        1, &queue, 0, nullptr, &blas_event));
                 }
             }
-        }
-        else {
+        } else {
             // --------------------
             // compute Cholesky factorization A = L*L'
             // using the left looking algorithm
-            for(j = 0; j < n; j += nb) {
+            for (j = 0; j < n; j += nb) {
                 // apply all previous updates to diagonal block
-                jb = std::min(nb, n-j);
-                if (j>0) {
-                    gpu_herk(clblasColumnMajor,
-                             clblasLower, clblasNoTrans, jb, j,
-                             m_one,
-                             dA(j, 0), ldda,
-                             one,
-                             dA(j, j), ldda,
-                             1, &queue, 0, nullptr, &blas_event);
+                jb = std::min(nb, n - j);
+                if (j > 0) {
+                    OPENCL_BLAS_CHECK(gpu_blas_herk(
+                        OPENCL_BLAS_TRIANGLE_LOWER, OPENCL_BLAS_NO_TRANS, jb, j,
+                        m_one, dA(j, 0), ldda, one, dA(j, j), ldda, 1, &queue,
+                        0, nullptr, &blas_event));
                 }
 
                 // start asynchronous data transfer
-                magma_getmatrix_async<Ty>(jb, jb, dA(j,j), ldda, work, jb, queue, &event);
-
-                // apply all previous updates to block column below diagonal block
-                if (j+jb < n) {
-                    gpu_gemm(clblasColumnMajor,
-                             clblasNoTrans, transType,
-                             n-j-jb, jb, j,
-                             mz_one,
-                             dA(j+jb, 0), ldda,
-                             dA(j,    0), ldda,
-                             z_one,
-                             dA(j+jb, j), ldda,
-                             1, &queue, 0, nullptr, &blas_event);
+                magma_getmatrix_async<Ty>(jb, jb, dA(j, j), ldda, work, jb,
+                                          queue, &event);
+
+                // apply all previous updates to block column below diagonal
+                // block
+                if (j + jb < n && j > 0) {
+                    OPENCL_BLAS_CHECK(gpu_blas_gemm(
+                        OPENCL_BLAS_NO_TRANS, transType, n - j - jb, jb, j,
+                        mz_one, dA(j + jb, 0), ldda, dA(j, 0), ldda, z_one,
+                        dA(j + jb, j), ldda, 1, &queue, 0, nullptr,
+                        &blas_event));
                 }
 
                 // simultaneous with above zgemm, transfer data, factor
                 // diagonal block on CPU, and test for positive definiteness
                 magma_event_sync(event);
-                *info = cpu_potrf(LAPACK_COL_MAJOR,
-                                  *MagmaLowerStr, jb, work, jb);
+                LAPACKE_CHECK(cpu_lapack_potrf(*MagmaLowerStr, jb, work, jb));
                 if (*info != 0) {
                     assert(*info > 0);
                     *info += j;
                     break;
                 }
-                magma_setmatrix_async<Ty>(jb, jb, work, jb, dA(j,j), ldda, queue, &event);
+                magma_setmatrix_async<Ty>(jb, jb, work, jb, dA(j, j), ldda,
+                                          queue, &event);
 
                 // apply diagonal block to block column below diagonal
-                if (j+jb < n) {
+                if (j + jb < n) {
                     magma_event_sync(event);
-                    gpu_trsm(clblasColumnMajor,
-                             clblasRight, clblasLower, transType, clblasNonUnit,
-                             n-j-jb, jb,
-                             z_one,
-                             dA(j   , j), ldda,
-                             dA(j+jb, j), ldda,
-                             1, &queue, 0, nullptr, &blas_event);
+                    OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                        OPENCL_BLAS_SIDE_RIGHT, OPENCL_BLAS_TRIANGLE_LOWER,
+                        transType, OPENCL_BLAS_NON_UNIT_DIAGONAL, n - j - jb,
+                        jb, z_one, dA(j, j), ldda, dA(j + jb, j), ldda, 1,
+                        &queue, 0, nullptr, &blas_event));
                 }
             }
         }
@@ -304,12 +283,10 @@ magma_int_t magma_potrf_gpu(
     return *info;
 }
 
-#define INSTANTIATE(T)                                  \
-    template magma_int_t magma_potrf_gpu<T>(            \
-        magma_uplo_t   uplo, magma_int_t    n,          \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
-        magma_queue_t queue,                            \
-        magma_int_t*   info);                           \
+#define INSTANTIATE(T)                                                 \
+    template magma_int_t magma_potrf_gpu<T>(                           \
+        magma_uplo_t uplo, magma_int_t n, cl_mem dA, size_t dA_offset, \
+        magma_int_t ldda, magma_queue_t queue, magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/swapdblk.cpp b/src/backend/opencl/magma/swapdblk.cpp
index 412138727e..6a669a54ce 100644
--- a/src/backend/opencl/magma/swapdblk.cpp
+++ b/src/backend/opencl/magma/swapdblk.cpp
@@ -7,28 +7,24 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include "magma_data.h"
 #include "kernel/swapdblk.hpp"
+#include "magma_data.h"
 
-template<typename T> void
-magmablas_swapdblk(magma_int_t n, magma_int_t nb,
-                   cl_mem dA, magma_int_t dA_offset, magma_int_t ldda, magma_int_t inca,
-                   cl_mem dB, magma_int_t dB_offset, magma_int_t lddb, magma_int_t incb,
-                   magma_queue_t queue)
-{
-    opencl::kernel::swapdblk<T>(n, nb,
-                                dA, dA_offset, ldda, inca,
-                                dB, dB_offset, lddb, incb);
+template<typename T>
+void magmablas_swapdblk(magma_int_t n, magma_int_t nb, cl_mem dA,
+                        magma_int_t dA_offset, magma_int_t ldda,
+                        magma_int_t inca, cl_mem dB, magma_int_t dB_offset,
+                        magma_int_t lddb, magma_int_t incb,
+                        magma_queue_t queue) {
+    arrayfire::opencl::kernel::swapdblk<T>(n, nb, dA, dA_offset, ldda, inca, dB,
+                                           dB_offset, lddb, incb, queue);
 }
 
-
-#define INSTANTIATE(T)                                                  \
-    template void magmablas_swapdblk<T>(magma_int_t n, magma_int_t nb,  \
-                                        cl_mem dA, magma_int_t dA_offset, \
-                                        magma_int_t ldda, magma_int_t inca, \
-                                        cl_mem dB, magma_int_t dB_offset, \
-                                        magma_int_t lddb, magma_int_t incb, \
-                                        magma_queue_t queue);           \
+#define INSTANTIATE(T)                                                        \
+    template void magmablas_swapdblk<T>(                                      \
+        magma_int_t n, magma_int_t nb, cl_mem dA, magma_int_t dA_offset,      \
+        magma_int_t ldda, magma_int_t inca, cl_mem dB, magma_int_t dB_offset, \
+        magma_int_t lddb, magma_int_t incb, magma_queue_t queue);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/transpose.cpp b/src/backend/opencl/magma/transpose.cpp
index 763cef37fd..a33d440f95 100644
--- a/src/backend/opencl/magma/transpose.cpp
+++ b/src/backend/opencl/magma/transpose.cpp
@@ -7,57 +7,101 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include "magma_data.h"
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
 #include "kernel/transpose.hpp"
+#include "magma_data.h"
 
-template<typename T> void
-magmablas_transpose(
-    magma_int_t m, magma_int_t n,
-    cl_mem dA,  size_t dA_offset,  magma_int_t ldda,
-    cl_mem dAT, size_t dAT_offset, magma_int_t lddat,
-    magma_queue_t queue)
-{
+using arrayfire::opencl::makeParam;
+using arrayfire::opencl::kernel::transpose;
+using cl::Buffer;
+using cl::CommandQueue;
+
+template<typename T>
+void magmablas_transpose(magma_int_t m, magma_int_t n, cl_mem dA,
+                         size_t dA_offset, magma_int_t ldda, cl_mem dAT,
+                         size_t dAT_offset, magma_int_t lddat,
+                         magma_queue_t queue) {
     magma_int_t info = 0;
-    if ( m < 0 )
+    if (m < 0) {
         info = -1;
-    else if ( n < 0 )
+    } else if (n < 0) {
         info = -2;
-    else if ( ldda < m )
+    } else if (ldda < m) {
         info = -4;
-    else if ( lddat < n )
+    } else if (lddat < n) {
         info = -6;
+    }
 
-    if ( info != 0 ) {
-        //magma_xerbla( __func__, -(info) );
-        return;  //info;
+    if (info != 0) {
+        // magma_xerbla( __func__, -(info) );
+        return;  // info;
     }
 
     /* Quick return */
-    if ( (m == 0) || (n == 0) )
-        return;
+    if ((m == 0) || (n == 0)) { return; }
 
-    int idims[] = {m, n, 1, 1};
-    int odims[] = {n, m, 1, 1};
+    int idims[]    = {m, n, 1, 1};
+    int odims[]    = {n, m, 1, 1};
     int istrides[] = {1, ldda, ldda * n, ldda * n};
     int ostrides[] = {1, lddat, lddat * m, lddat * m};
 
-    using namespace opencl;
+    Buffer dATBuf(dAT, true);
+    Buffer dABuf(dA, true);
 
-    if (m % 32 == 0 && n % 32 == 0) {
-        kernel::transpose<T, false, true >(makeParam(dAT, dAT_offset, odims, ostrides),
-                                           makeParam(dA , dA_offset , idims, istrides));
-    } else {
-        kernel::transpose<T, false, false>(makeParam(dAT, dAT_offset, odims, ostrides),
-                                           makeParam(dA , dA_offset , idims, istrides));
-    }
+    CommandQueue q(queue, true);
+    transpose<T>(makeParam(dATBuf, dAT_offset, odims, ostrides),
+                 makeParam(dABuf, dA_offset, idims, istrides), q, false,
+                 m % 32 == 0 && n % 32 == 0);
 }
 
-#define INSTANTIATE(T)                                      \
-    template void magmablas_transpose<T>(                   \
-        magma_int_t m, magma_int_t n,                       \
-        cl_mem dA,  size_t dA_offset,  magma_int_t ldda,    \
-        cl_mem dAT, size_t dAT_offset, magma_int_t lddat,   \
-        magma_queue_t queue);                               \
+#define INSTANTIATE(T)                                                      \
+    template void magmablas_transpose<T>(                                   \
+        magma_int_t m, magma_int_t n, cl_mem dA, size_t dA_offset,          \
+        magma_int_t ldda, cl_mem dAT, size_t dAT_offset, magma_int_t lddat, \
+        magma_queue_t queue);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/transpose_inplace.cpp b/src/backend/opencl/magma/transpose_inplace.cpp
index 5855a0ddbb..7705edb7b3 100644
--- a/src/backend/opencl/magma/transpose_inplace.cpp
+++ b/src/backend/opencl/magma/transpose_inplace.cpp
@@ -7,46 +7,89 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include "magma_data.h"
+/***********************************************************************
+ * Based on MAGMA library http://icl.cs.utk.edu/magma/
+ * Below is the original copyright.
+ *
+ *   -- MAGMA (version 0.1) --
+ *      Univ. of Tennessee, Knoxville
+ *      Univ. of California, Berkeley
+ *      Univ. of Colorado, Denver
+ *      @date
+ *
+ *      @precisions normal z -> s d c
+ *
+ * -- Innovative Computing Laboratory
+ * -- Electrical Engineering and Computer Science Department
+ * -- University of Tennessee
+ * -- (C) Copyright 2009-2013
+ *
+ * Redistribution  and  use  in  source and binary forms, with or without
+ * modification,  are  permitted  provided  that the following conditions
+ * are met:
+ *
+ * * Redistributions  of  source  code  must  retain  the above copyright
+ *   notice,  this  list  of  conditions  and  the  following  disclaimer.
+ * * Redistributions  in  binary  form must reproduce the above copyright
+ *   notice,  this list of conditions and the following disclaimer in the
+ *   documentation  and/or other materials provided with the distribution.
+ * * Neither  the  name of the University of Tennessee, Knoxville nor the
+ *   names of its contributors may be used to endorse or promote products
+ *   derived from this software without specific prior written permission.
+ *
+ * THIS  SOFTWARE  IS  PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * ``AS IS''  AND  ANY  EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED  TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A  PARTICULAR  PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * HOLDERS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL,  EXEMPLARY,  OR  CONSEQUENTIAL  DAMAGES  (INCLUDING,  BUT NOT
+ * LIMITED  TO,  PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA,  OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY  OF  LIABILITY,  WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING  NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF  THIS  SOFTWARE,  EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ **********************************************************************/
+
 #include "kernel/transpose_inplace.hpp"
+#include "magma_data.h"
+
+using arrayfire::opencl::makeParam;
+using arrayfire::opencl::kernel::transpose_inplace;
+using cl::Buffer;
+using cl::CommandQueue;
 
-template<typename T> void
-magmablas_transpose_inplace(
-    magma_int_t n,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    magma_queue_t queue)
-{
+template<typename T>
+void magmablas_transpose_inplace(magma_int_t n, cl_mem dA, size_t dA_offset,
+                                 magma_int_t ldda, magma_queue_t queue) {
     magma_int_t info = 0;
-    if ( n < 0 )
+    if (n < 0) {
         info = -1;
-    else if ( ldda < n )
+    } else if (ldda < n) {
         info = -3;
+    }
 
-    if ( info != 0 ) {
-        //magma_xerbla( __func__, -(info) );
-        return;  //info;
+    if (info != 0) {
+        // magma_xerbla( __func__, -(info) );
+        return;  // info;
     }
 
-    if (n == 0) return;
+    if (n == 0) { return; }
 
-    int dims[] = {n, n, 1, 1};
+    int dims[]    = {n, n, 1, 1};
     int strides[] = {1, ldda, ldda * n, ldda * n};
 
-    using namespace opencl;
+    Buffer dABuf(dA, true);
 
-    if (n % 32 == 0) {
-        kernel::transpose_inplace<T, false, true >(makeParam(dA , dA_offset , dims, strides));
-    } else {
-        kernel::transpose_inplace<T, false, false>(makeParam(dA , dA_offset , dims, strides));
-    }
+    CommandQueue q(queue, true);
+    transpose_inplace<T>(makeParam(dABuf, dA_offset, dims, strides), q, false,
+                         n % 32 == 0);
 }
 
-#define INSTANTIATE(T)                                  \
-    template void magmablas_transpose_inplace<T>(       \
-        magma_int_t n,                                  \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
-        magma_queue_t queue);                           \
-
+#define INSTANTIATE(T)                                                \
+    template void magmablas_transpose_inplace<T>(                     \
+        magma_int_t n, cl_mem dA, size_t dA_offset, magma_int_t ldda, \
+        magma_queue_t queue);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/ungqr.cpp b/src/backend/opencl/magma/ungqr.cpp
index c0dd47bb4c..3f0ef001d2 100644
--- a/src/backend/opencl/magma/ungqr.cpp
+++ b/src/backend/opencl/magma/ungqr.cpp
@@ -52,25 +52,21 @@
  **********************************************************************/
 
 #include "magma.h"
-#include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
 #include <algorithm>
 
-template<typename Ty>  magma_int_t
-magma_ungqr_gpu(
-    magma_int_t m, magma_int_t n, magma_int_t k,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    Ty *tau,
-    cl_mem dT, size_t dT_offset, magma_int_t nb,
-    magma_queue_t queue,
-    magma_int_t *info)
-{
-#define dA(i,j) (dA),  ((i) + (j)*ldda)
-#define dT(j)   (dT),  ((j)*nb)
+template<typename Ty>
+magma_int_t magma_ungqr_gpu(magma_int_t m, magma_int_t n, magma_int_t k,
+                            cl_mem dA, size_t dA_offset, magma_int_t ldda,
+                            Ty *tau, cl_mem dT, size_t dT_offset,
+                            magma_int_t nb, magma_queue_t queue,
+                            magma_int_t *info) {
+#define dA(i, j) (dA), (dA_offset + ((i) + (j)*ldda))
+#define dT(j) (dT), (dT_offset + ((j)*nb))
 
     static const Ty c_zero = magma_zero<Ty>();
     static const Ty c_one  = magma_one<Ty>();
@@ -89,23 +85,21 @@ magma_ungqr_gpu(
         *info = -2;
     } else if ((k < 0) || (k > n)) {
         *info = -3;
-    } else if (ldda < std::max(1,m)) {
+    } else if (ldda < std::max(1, m)) {
         *info = -5;
     }
     if (*info != 0) {
-        //magma_xerbla( __func__, -(*info));
+        // magma_xerbla( __func__, -(*info));
         return *info;
     }
 
-    if (n <= 0) {
-        return *info;
-    }
+    if (n <= 0) { return *info; }
 
     // first kk columns are handled by blocked method.
     // ki is start of 2nd-to-last block
     if ((nb > 1) && (nb < k)) {
         ki = (k - nb - 1) / nb * nb;
-        kk = std::min(k, ki+nb);
+        kk = std::min(k, ki + nb);
     } else {
         ki = 0;
         kk = 0;
@@ -114,8 +108,8 @@ magma_ungqr_gpu(
     // Allocate CPU work space
     // n*nb for zungqr workspace
     // (m - kk)*(n - kk) for last block's panel
-    lwork = n*nb;
-    lpanel = (m - kk)*(n - kk);
+    lwork  = n * nb;
+    lpanel = (m - kk) * (n - kk);
     magma_malloc_cpu<Ty>(&work, lwork + lpanel);
     if (work == NULL) {
         *info = MAGMA_ERR_HOST_ALLOC;
@@ -124,7 +118,7 @@ magma_ungqr_gpu(
     panel = work + lwork;
 
     // Allocate work space on GPU
-    if (MAGMA_SUCCESS != magma_malloc<Ty>(&dV, ldda*nb)) {
+    if (MAGMA_SUCCESS != magma_malloc<Ty>(&dV, ldda * nb)) {
         magma_free_cpu(work);
         *info = MAGMA_ERR_DEVICE_ALLOC;
         return *info;
@@ -133,30 +127,32 @@ magma_ungqr_gpu(
     // dT workspace has:
     // 2*std::min(m,n)*nb      for T and R^{-1} matrices from geqrf
     // ((n+31)/32*32)*nb for dW larfb workspace.
-    lddwork = std::min(m,n);
+    lddwork = std::min(m, n);
     cl_mem dW;
-    magma_malloc<Ty>(&dW, (((n+31)/32)*32)*nb);
+    if (MAGMA_SUCCESS != magma_malloc<Ty>(&dW, (((n + 31) / 32) * 32) * nb)) {
+        magma_free_cpu(work);
+        magma_free(dV);
+        *info = MAGMA_ERR_DEVICE_ALLOC;
+        return *info;
+    }
 
-    ungqr_work_func<Ty> cpu_ungqr;
+    cpu_lapack_ungqr_work_func<Ty> cpu_lapack_ungqr;
 
     // Use unblocked code for the last or only block.
     if (kk < n) {
         m_kk = m - kk;
         n_kk = n - kk;
         k_kk = k - kk;
-        magma_getmatrix<Ty>(m_kk, k_kk,
-                            dA(kk, kk), ldda, panel, m_kk, queue);
+        magma_getmatrix<Ty>(m_kk, k_kk, dA(kk, kk), ldda, panel, m_kk, queue);
 
-        cpu_ungqr(LAPACK_COL_MAJOR,
-                  m_kk, n_kk, k_kk,
-                  panel, m_kk,
-                  &tau[kk], work, lwork);
+        LAPACKE_CHECK(cpu_lapack_ungqr(m_kk, n_kk, k_kk, panel, m_kk, &tau[kk],
+                                       work, lwork));
 
-        magma_setmatrix<Ty>(m_kk, n_kk,
-                            panel, m_kk, dA(kk, kk), ldda, queue);
+        magma_setmatrix<Ty>(m_kk, n_kk, panel, m_kk, dA(kk, kk), ldda, queue);
 
         // Set A(1:kk,kk+1:n) to zero.
-        magmablas_laset<Ty>(MagmaFull, kk, n - kk, c_zero, c_zero, dA(0, kk), ldda, queue);
+        magmablas_laset<Ty>(MagmaFull, kk, n - kk, c_zero, c_zero, dA(0, kk),
+                            ldda, queue);
     }
 
     if (kk > 0) {
@@ -165,25 +161,24 @@ magma_ungqr_gpu(
         // CPU has no computation
 
         for (i = ki; i >= 0; i -= nb) {
-            ib = std::min(nb, k-i);
+            ib = std::min(nb, k - i);
             mi = m - i;
 
             // Copy current panel on the GPU from dA to dV
-            magma_copymatrix<Ty>(mi, ib,
-                                 dA(i,i), ldda,
-                                 dV, 0,   ldda, queue);
+            magma_copymatrix<Ty>(mi, ib, dA(i, i), ldda, dV, 0, ldda, queue);
 
             // set panel to identity
-            magmablas_laset<Ty>(MagmaFull, i,  ib, c_zero, c_zero, dA(0, i), ldda, queue);
-            magmablas_laset<Ty>(MagmaFull, mi, ib, c_zero, c_one,  dA(i, i), ldda, queue);
+            magmablas_laset<Ty>(MagmaFull, i, ib, c_zero, c_zero, dA(0, i),
+                                ldda, queue);
+            magmablas_laset<Ty>(MagmaFull, mi, ib, c_zero, c_one, dA(i, i),
+                                ldda, queue);
 
             if (i < n) {
-
                 // Apply H to A(i:m,i:n) from the left
-                magma_larfb_gpu<Ty>(MagmaLeft, MagmaNoTrans, MagmaForward, MagmaColumnwise,
-                                    mi, n-i, ib,
-                                    dV, 0,    ldda, dT(i), nb,
-                                    dA(i, i), ldda, dW, 0, lddwork, queue);
+                magma_larfb_gpu<Ty>(MagmaLeft, MagmaNoTrans, MagmaForward,
+                                    MagmaColumnwise, mi, n - i, ib, dV, 0, ldda,
+                                    dT(i), nb, dA(i, i), ldda, dW, 0, lddwork,
+                                    queue);
             }
         }
     }
@@ -192,17 +187,14 @@ magma_ungqr_gpu(
     magma_free(dW);
     magma_free_cpu(work);
     return *info;
-
 }
 
-#define INSTANTIATE(T)                                                  \
-    template  magma_int_t                                               \
-    magma_ungqr_gpu<T>(magma_int_t m, magma_int_t n, magma_int_t k,     \
-                       cl_mem dA, size_t dA_offset, magma_int_t ldda,   \
-                       T *tau,                                          \
-                       cl_mem dT, size_t dT_offset, magma_int_t nb,     \
-                       magma_queue_t queue,                             \
-                       magma_int_t *info);                              \
+#define INSTANTIATE(T)                                          \
+    template magma_int_t magma_ungqr_gpu<T>(                    \
+        magma_int_t m, magma_int_t n, magma_int_t k, cl_mem dA, \
+        size_t dA_offset, magma_int_t ldda, T * tau, cl_mem dT, \
+        size_t dT_offset, magma_int_t nb, magma_queue_t queue,  \
+        magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/unmqr.cpp b/src/backend/opencl/magma/unmqr.cpp
index b740a87257..81dae4a340 100644
--- a/src/backend/opencl/magma/unmqr.cpp
+++ b/src/backend/opencl/magma/unmqr.cpp
@@ -52,121 +52,116 @@
  **********************************************************************/
 
 #include "magma.h"
-#include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
 #include <algorithm>
 
-template<typename Ty>  magma_int_t
-magma_unmqr_gpu(
-    magma_side_t side, magma_trans_t trans,
-    magma_int_t m, magma_int_t n, magma_int_t k,
-    cl_mem dA, size_t dA_offset, magma_int_t ldda,
-    Ty *tau,
-    cl_mem dC, size_t dC_offset, magma_int_t lddc,
-    Ty *hwork, magma_int_t lwork,
-    cl_mem dT, size_t dT_offset, magma_int_t nb,
-    magma_queue_t queue,
-    magma_int_t *info)
-{
-/*  -- clMAGMA (version 0.1) --
-       Univ. of Tennessee, Knoxville
-       Univ. of California, Berkeley
-       Univ. of Colorado, Denver
-       @date
-
-    Purpose
-    =======
-    ZUNMQR_GPU overwrites the general complex M-by-N matrix C with
-
-                    SIDE = 'L'     SIDE = 'R'
-    TRANS = 'N':      Q * C          C * Q
-    TRANS = 'T':      Q**H * C       C * Q**H
-
-    where Q is a complex orthogonal matrix defined as the product of k
-    elementary reflectors
-
-          Q = H(1) H(2) . . . H(k)
-
-    as returned by ZGEQRF. Q is of order M if SIDE = 'L' and of order N
-    if SIDE = 'R'.
-
-    Arguments
-    =========
-    SIDE    (input) CHARACTER*1
-            = 'L': apply Q or Q**H from the Left;
-            = 'R': apply Q or Q**H from the Right.
-
-    TRANS   (input) CHARACTER*1
-            = 'N':  No transpose, apply Q;
-            = 'T':  Transpose, apply Q**H.
-
-    M       (input) INTEGER
-            The number of rows of the matrix C. M >= 0.
-
-    N       (input) INTEGER
-            The number of columns of the matrix C. N >= 0.
-
-    K       (input) INTEGER
-            The number of elementary reflectors whose product defines
-            the matrix Q.
-            If SIDE = 'L', M >= K >= 0;
-            if SIDE = 'R', N >= K >= 0.
-
-    DA      (input) COMPLEX_16 array on the GPU, dimension (LDDA,K)
-            The i-th column must contain the vector which defines the
-            elementary reflector H(i), for i = 1,2,...,k, as returned by
-            ZGEQRF in the first k columns of its array argument DA.
-            DA is modified by the routine but restored on exit.
-
-    LDDA    (input) INTEGER
-            The leading dimension of the array DA.
-            If SIDE = 'L', LDDA >= max(1,M);
-            if SIDE = 'R', LDDA >= max(1,N).
-
-    TAU     (input) COMPLEX_16 array, dimension (K)
-            TAU(i) must contain the scalar factor of the elementary
-            reflector H(i), as returned by ZGEQRF.
-
-    DC      (input/output) COMPLEX_16 array on the GPU, dimension (LDDC,N)
-            On entry, the M-by-N matrix C.
-            On exit, C is overwritten by Q*C or Q**H * C or C * Q**H or C*Q.
-
-    LDDC     (input) INTEGER
-            The leading dimension of the array DC. LDDC >= max(1,M).
-
-    HWORK    (workspace/output) COMPLEX_16 array, dimension (MAX(1,LWORK))
-            On exit, if INFO = 0, HWORK(1) returns the optimal LWORK.
-
-    LWORK   (input) INTEGER
-            The dimension of the array HWORK.
-            LWORK >= (M-K+NB)*(N+2*NB) if SIDE = 'L',
-            and LWORK >= (N-K+NB)*(M+2*NB) if SIDE = 'R', where NB is the
-            optimal blocksize.
-
-            If LWORK = -1, then a workspace query is assumed; the routine
-            only calculates the optimal size of the HWORK array, returns
-            this value as the first entry of the HWORK array, and no error
-            message related to LWORK is issued by XERBLA.
-
-    DT      (input) COMPLEX_16 array on the GPU that is the output
-            (the 9th argument) of magma_zgeqrf_gpu.
-
-    NB      (input) INTEGER
-            This is the blocking size that was used in pre-computing DT, e.g.,
-            the blocking size used in magma_zgeqrf_gpu.
-
-    INFO    (output) INTEGER
-            = 0:  successful exit
-            < 0:  if INFO = -i, the i-th argument had an illegal value
-    =====================================================================   */
-
-    #define a_ref(a_1,a_2) dA, (dA_offset+(a_1)+(a_2)*(ldda))
-    #define c_ref(a_1,a_2) dC, (dC_offset+(a_1)+(a_2)*(lddc))
-    #define t_ref(a_1)     dT, (dT_offset+(a_1)*nb)
+template<typename Ty>
+magma_int_t magma_unmqr_gpu(magma_side_t side, magma_trans_t trans,
+                            magma_int_t m, magma_int_t n, magma_int_t k,
+                            cl_mem dA, size_t dA_offset, magma_int_t ldda,
+                            Ty* tau, cl_mem dC, size_t dC_offset,
+                            magma_int_t lddc, Ty* hwork, magma_int_t lwork,
+                            cl_mem dT, size_t dT_offset, magma_int_t nb,
+                            magma_queue_t queue, magma_int_t* info) {
+    /*  -- clMAGMA (version 0.1) --
+           Univ. of Tennessee, Knoxville
+           Univ. of California, Berkeley
+           Univ. of Colorado, Denver
+           @date
+
+        Purpose
+        =======
+        ZUNMQR_GPU overwrites the general complex M-by-N matrix C with
+
+                        SIDE = 'L'     SIDE = 'R'
+        TRANS = 'N':      Q * C          C * Q
+        TRANS = 'T':      Q**H * C       C * Q**H
+
+        where Q is a complex orthogonal matrix defined as the product of k
+        elementary reflectors
+
+              Q = H(1) H(2) . . . H(k)
+
+        as returned by ZGEQRF. Q is of order M if SIDE = 'L' and of order N
+        if SIDE = 'R'.
+
+        Arguments
+        =========
+        SIDE    (input) CHARACTER*1
+                = 'L': apply Q or Q**H from the Left;
+                = 'R': apply Q or Q**H from the Right.
+
+        TRANS   (input) CHARACTER*1
+                = 'N':  No transpose, apply Q;
+                = 'T':  Transpose, apply Q**H.
+
+        M       (input) INTEGER
+                The number of rows of the matrix C. M >= 0.
+
+        N       (input) INTEGER
+                The number of columns of the matrix C. N >= 0.
+
+        K       (input) INTEGER
+                The number of elementary reflectors whose product defines
+                the matrix Q.
+                If SIDE = 'L', M >= K >= 0;
+                if SIDE = 'R', N >= K >= 0.
+
+        DA      (input) COMPLEX_16 array on the GPU, dimension (LDDA,K)
+                The i-th column must contain the vector which defines the
+                elementary reflector H(i), for i = 1,2,...,k, as returned by
+                ZGEQRF in the first k columns of its array argument DA.
+                DA is modified by the routine but restored on exit.
+
+        LDDA    (input) INTEGER
+                The leading dimension of the array DA.
+                If SIDE = 'L', LDDA >= max(1,M);
+                if SIDE = 'R', LDDA >= max(1,N).
+
+        TAU     (input) COMPLEX_16 array, dimension (K)
+                TAU(i) must contain the scalar factor of the elementary
+                reflector H(i), as returned by ZGEQRF.
+
+        DC      (input/output) COMPLEX_16 array on the GPU, dimension (LDDC,N)
+                On entry, the M-by-N matrix C.
+                On exit, C is overwritten by Q*C or Q**H * C or C * Q**H or C*Q.
+
+        LDDC     (input) INTEGER
+                The leading dimension of the array DC. LDDC >= max(1,M).
+
+        HWORK    (workspace/output) COMPLEX_16 array, dimension (MAX(1,LWORK))
+                On exit, if INFO = 0, HWORK(1) returns the optimal LWORK.
+
+        LWORK   (input) INTEGER
+                The dimension of the array HWORK.
+                LWORK >= (M-K+NB)*(N+2*NB) if SIDE = 'L',
+                and LWORK >= (N-K+NB)*(M+2*NB) if SIDE = 'R', where NB is the
+                optimal blocksize.
+
+                If LWORK = -1, then a workspace query is assumed; the routine
+                only calculates the optimal size of the HWORK array, returns
+                this value as the first entry of the HWORK array, and no error
+                message related to LWORK is issued by XERBLA.
+
+        DT      (input) COMPLEX_16 array on the GPU that is the output
+                (the 9th argument) of magma_zgeqrf_gpu.
+
+        NB      (input) INTEGER
+                This is the blocking size that was used in pre-computing DT,
+       e.g., the blocking size used in magma_zgeqrf_gpu.
+
+        INFO    (output) INTEGER
+                = 0:  successful exit
+                < 0:  if INFO = -i, the i-th argument had an illegal value
+        ===================================================================== */
+
+#define a_ref(a_1, a_2) dA, (dA_offset + (a_1) + (a_2) * (ldda))
+#define c_ref(a_1, a_2) dC, (dC_offset + (a_1) + (a_2) * (lddc))
+#define t_ref(a_1) dT, (dT_offset + (a_1)*nb)
 
     static const Ty c_one = magma_one<Ty>();
 
@@ -177,7 +172,7 @@ magma_unmqr_gpu(
     int left, notran, lquery;
     magma_int_t lwkopt;
 
-    *info = 0;
+    *info  = 0;
     left   = (side == MagmaLeft);
     notran = (trans == MagmaNoTrans);
     lquery = (lwork == -1);
@@ -190,9 +185,9 @@ magma_unmqr_gpu(
         nq = n;
         nw = m;
     }
-    if ( (!left) && (side != MagmaRight) ) {
+    if ((!left) && (side != MagmaRight)) {
         *info = -1;
-    } else if ( (!notran) && (trans != MagmaConjTrans) ) {
+    } else if ((!notran) && (trans != MagmaConjTrans)) {
         *info = -2;
     } else if (m < 0) {
         *info = -3;
@@ -200,22 +195,21 @@ magma_unmqr_gpu(
         *info = -4;
     } else if (k < 0 || k > nq) {
         *info = -5;
-    } else if (ldda < std::max(1,nq)) {
+    } else if (ldda < std::max(1, nq)) {
         *info = -7;
-    } else if (lddc < std::max(1,m)) {
+    } else if (lddc < std::max(1, m)) {
         *info = -10;
-    } else if (lwork < std::max(1,nw) && ! lquery) {
+    } else if (lwork < std::max(1, nw) && !lquery) {
         *info = -12;
     }
 
-    lwkopt = (m-k+nb)*(n+2*nb);
+    lwkopt   = (m - k + nb) * (n + 2 * nb);
     hwork[0] = magma_scalar<Ty>(lwkopt);
 
     if (*info != 0) {
-        //magma_xerbla( __func__, -(*info) );
+        // magma_xerbla( __func__, -(*info) );
         return *info;
-    }
-    else if (lquery) {
+    } else if (lquery) {
         return *info;
     }
 
@@ -225,17 +219,17 @@ magma_unmqr_gpu(
         return *info;
     }
 
-    magma_malloc<Ty>(&dwork, (((n+31)/32)*32)*nb);
+    magma_malloc<Ty>(&dwork, (((n + 31) / 32) * 32) * nb);
 
-    unmqr_work_func<Ty> cpu_unmqr;
+    cpu_lapack_unmqr_work_func<Ty> cpu_lapack_unmqr;
 
-    if ( (left && (! notran)) || ( (!left) && notran ) ) {
-        i1 = 0;
-        i2 = k-nb;
+    if ((left && (!notran)) || ((!left) && notran)) {
+        i1   = 0;
+        i2   = k - nb;
         step = nb;
     } else {
-        i1 = (k - 1 - nb) / nb * nb;
-        i2 = 0;
+        i1   = (k - 1 - nb) / nb * nb;
+        i2   = 0;
         step = -nb;
     }
 
@@ -252,112 +246,96 @@ magma_unmqr_gpu(
 
     static const bool is_real = magma_is_real<Ty>();
 
-    /* Use unblocked code to multiply last or only block (cases Q*C or C*Q^T). */
-    // workspace left:  A(mi*nb) + C(mi*ni) + work(ni*nb_la) = (m-k-nb)*nb + (m-k-nb)*n + n*nb
-    // workspace right: A(ni*nb) + C(mi*ni) + work(mi*nb_la) = (n-k-nb)*nb + m*(n-k-nb) + m*nb
-    if ( step < 0 ) {
+    /* Use unblocked code to multiply last or only block (cases Q*C or C*Q^T).
+     */
+    // workspace left:  A(mi*nb) + C(mi*ni) + work(ni*nb_la) = (m-k-nb)*nb +
+    // (m-k-nb)*n + n*nb workspace right: A(ni*nb) + C(mi*ni) + work(mi*nb_la) =
+    // (n-k-nb)*nb + m*(n-k-nb) + m*nb
+    if (step < 0) {
         // i is beginning of last block
         i = i1 - step;
-        if ( i >= k ) {
-            i = i1;
-        }
+        if (i >= k) { i = i1; }
         ib = k - i;
         if (left) {
             // ni=n, jc=0, H or H^T is applied to C(i:m-1,0:n-1)
             mi = m - i;
             ma = mi;
             ic = i;
-        }
-        else {
+        } else {
             // mi=m, ic=0, H or H^T is applied to C(0:m-1,i:n-1)
             ni = n - i;
             ma = ni;
             jc = i;
         }
 
-        Ty* hA = hwork;
-        Ty* hC = hwork + ma*ib;
-        Ty* hW = hwork + ma*ib + mi*ni;
-        magma_int_t lhwork = lwork - (ma*ib + mi*ni);
+        Ty* hA             = hwork;
+        Ty* hC             = hwork + ma * ib;
+        Ty* hW             = hwork + ma * ib + mi * ni;
+        magma_int_t lhwork = lwork - (ma * ib + mi * ni);
 
-        magma_getmatrix<Ty>(ma, ib, a_ref(i,  i ), ldda, hA, ma, queue);
+        magma_getmatrix<Ty>(ma, ib, a_ref(i, i), ldda, hA, ma, queue);
         magma_getmatrix<Ty>(mi, ni, c_ref(ic, jc), lddc, hC, mi, queue);
 
-        *info = cpu_unmqr(LAPACK_COL_MAJOR,
-                          side == MagmaRight ? 'R' : 'L',
-                          notran ? 'N' : (is_real ? 'T' : 'C'),
-                          mi, ni, ib,
-                          hA, ma, tau+i,
-                          hC, mi,
-                          hW, lhwork);
+        LAPACKE_CHECK(cpu_lapack_unmqr(side == MagmaRight ? 'R' : 'L',
+                                       notran ? 'N' : (is_real ? 'T' : 'C'), mi,
+                                       ni, ib, hA, ma, tau + i, hC, mi, hW,
+                                       lhwork));
 
         // send the updated part of C back to the GPU
-        magma_setmatrix<Ty>( mi, ni, hC, mi, c_ref(ic, jc), lddc, queue);
+        magma_setmatrix<Ty>(mi, ni, hC, mi, c_ref(ic, jc), lddc, queue);
     }
 
-
-    if (nb < k)
-    {
-        for (i=i1; step<0 ? i>i2 : i<i2; i+=step)
-        {
+    if (nb < k) {
+        for (i = i1; step < 0 ? i > i2 : i < i2; i += step) {
             ib = std::min(nb, k - i);
-            if (left){
+            if (left) {
                 mi = m - i;
                 ic = i;
-            }
-            else {
+            } else {
                 ni = n - i;
                 jc = i;
             }
 
-            if (mi == 0 || ni == 0) break;
+            if (mi == 0 || ni == 0) { break; }
 
-            ret = magma_larfb_gpu<Ty>(MagmaLeft,
-                                      is_real ? MagmaTrans : MagmaConjTrans,
-                                      MagmaForward, MagmaColumnwise,
-                                      mi, ni, ib,
-                                      a_ref(i,  i ), ldda, t_ref(i), nb,
-                                      c_ref(ic, jc), lddc, dwork, 0, nw, queue);
-            if ( ret != MAGMA_SUCCESS )
-              return ret;
+            ret = magma_larfb_gpu<Ty>(
+                MagmaLeft, is_real ? MagmaTrans : MagmaConjTrans, MagmaForward,
+                MagmaColumnwise, mi, ni, ib, a_ref(i, i), ldda, t_ref(i), nb,
+                c_ref(ic, jc), lddc, dwork, 0, nw, queue);
+            if (ret != MAGMA_SUCCESS) { return ret; }
         }
-    }
-    else
-    {
+    } else {
         i = i1;
     }
 
-    /* Use unblocked code to multiply the last or only block (cases Q^T*C or C*Q). */
-    if ( step > 0 ) {
-        ib = k-i;
+    /* Use unblocked code to multiply the last or only block (cases Q^T*C or
+     * C*Q). */
+    if (step > 0) {
+        ib = k - i;
         if (left) {
             // ni=n, jc=0, H or H^T is applied to C(i:m-1,0:n-1)
             mi = m - i;
             ma = mi;
             ic = i;
-        }
-        else {
+        } else {
             // mi=m, ic=0, H or H^T is applied to C(0:m-1,i:n-1)
             ni = n - i;
             ma = ni;
             jc = i;
         }
 
-        Ty* hA = hwork;
-        Ty* hC = hwork + ma*ib;
-        Ty* hW = hwork + ma*ib + mi*ni;
-        magma_int_t lhwork = lwork - (ma*ib + mi*ni);
+        Ty* hA             = hwork;
+        Ty* hC             = hwork + ma * ib;
+        Ty* hW             = hwork + ma * ib + mi * ni;
+        magma_int_t lhwork = lwork - (ma * ib + mi * ni);
 
-        magma_getmatrix<Ty>(ma, ib, a_ref(i,  i ), ldda, hA, ma, queue);
+        magma_getmatrix<Ty>(ma, ib, a_ref(i, i), ldda, hA, ma, queue);
         magma_getmatrix<Ty>(mi, ni, c_ref(ic, jc), lddc, hC, mi, queue);
 
-        *info = cpu_unmqr(LAPACK_COL_MAJOR,
-                          side == MagmaRight ? 'R' : 'L',
-                          notran ? 'N' : (is_real ? 'T' : 'C'),
-                          mi, ni, ib,
-                          hA, ma, tau+i,
-                          hC, mi,
-                          hW, lhwork);
+        LAPACKE_CHECK(cpu_lapack_unmqr(side == MagmaRight ? 'R' : 'L',
+                                       notran ? 'N' : (is_real ? 'T' : 'C'), mi,
+                                       ni, ib, hA, ma, tau + i, hC, mi, hW,
+                                       lhwork));
 
         // send the updated part of C back to the GPU
         magma_setmatrix<Ty>(mi, ni, hC, mi, c_ref(ic, jc), lddc, queue);
@@ -369,18 +347,13 @@ magma_unmqr_gpu(
     /* End of MAGMA_ZUNMQR_GPU */
 }
 
-#define INSTANTIATE(T)                                  \
-    template  magma_int_t                               \
-    magma_unmqr_gpu<T>(                                 \
-        magma_side_t side, magma_trans_t trans,         \
-        magma_int_t m, magma_int_t n, magma_int_t k,    \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
-        T *tau,                                         \
-        cl_mem dC, size_t dC_offset, magma_int_t lddc,  \
-        T *hwork, magma_int_t lwork,                    \
-        cl_mem dT, size_t dT_offset, magma_int_t nb,    \
-        magma_queue_t queue,                            \
-        magma_int_t *info);                             \
+#define INSTANTIATE(T)                                                         \
+    template magma_int_t magma_unmqr_gpu<T>(                                   \
+        magma_side_t side, magma_trans_t trans, magma_int_t m, magma_int_t n,  \
+        magma_int_t k, cl_mem dA, size_t dA_offset, magma_int_t ldda, T * tau, \
+        cl_mem dC, size_t dC_offset, magma_int_t lddc, T * hwork,              \
+        magma_int_t lwork, cl_mem dT, size_t dT_offset, magma_int_t nb,        \
+        magma_queue_t queue, magma_int_t * info);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/magma/unmqr2.cpp b/src/backend/opencl/magma/unmqr2.cpp
index 6de4cafc90..11d753fb80 100644
--- a/src/backend/opencl/magma/unmqr2.cpp
+++ b/src/backend/opencl/magma/unmqr2.cpp
@@ -54,8 +54,8 @@
 #if 0  // Needs hetrd to be enabled
 #include "magma.h"
 #include "magma_blas.h"
-#include "magma_data.h"
 #include "magma_cpu_lapack.h"
+#include "magma_data.h"
 #include "magma_helper.h"
 #include "magma_sync.h"
 
@@ -164,9 +164,9 @@ magma_unmqr2_gpu(
     magma_queue_t queue,
     magma_int_t *info)
 {
-    #define dA(i_,j_) (dA) , ((i_) + (j_)*ldda) + dA_offset
-    #define dC(i_,j_) (dC) , ((i_) + (j_)*lddc) + dC_offset
-    #define wA(i_,j_) (wA + (i_) + (j_)*ldwa)
+#define dA(i_, j_) (dA), ((i_) + (j_)*ldda) + dA_offset
+#define dC(i_, j_) (dC), ((i_) + (j_)*lddc) + dC_offset
+#define wA(i_, j_) (wA + (i_) + (j_)*ldwa)
 
     /* Allocate work space on the GPU */
     cl_mem dwork;
@@ -251,7 +251,7 @@ magma_unmqr2_gpu(
         ic = 1;
     }
 
-    larft_func<Ty> cpu_larft;
+    cpu_lapack_larft_func<Ty> cpu_lapack_larft;
 
     // set nb-1 super-diagonals to 0, and diagonal to 1.
     // This way we can copy V directly to the GPU,
@@ -265,10 +265,10 @@ magma_unmqr2_gpu(
         /* Form the triangular factor of the block reflector
            H = H(i) H(i+1) . . . H(i+ib-1) */
         i__4 = nq - i + 1;
-        cpu_larft(LAPACK_COL_MAJOR,
-                  *MagmaForwardStr, *MagmaColumnwiseStr,
-                  i__4, ib,
-                  wA(i,i), ldwa, &tau[i], T, ib);
+        LAPACKE_CHECK(cpu_lapack_larft(
+                          *MagmaForwardStr, *MagmaColumnwiseStr,
+                          i__4, ib,
+                          wA(i,i), ldwa, &tau[i], T, ib));
 
         if (left) {
             /* H or H' is applied to C(i:m,1:n) */
@@ -301,18 +301,12 @@ magma_unmqr2_gpu(
     return *info;
 } /* magma_zunmqr */
 
-
-#define INSTANTIATE(Ty)                                 \
-    template magma_int_t                                \
-    magma_unmqr2_gpu<Ty>(                               \
-        magma_side_t side, magma_trans_t trans,         \
-        magma_int_t m, magma_int_t n, magma_int_t k,    \
-        cl_mem dA, size_t dA_offset, magma_int_t ldda,  \
-        Ty    *tau,                                     \
-        cl_mem dC, size_t dC_offset, magma_int_t lddc,  \
-        Ty    *wA, magma_int_t ldwa,                    \
-        magma_queue_t queue,                            \
-        magma_int_t *info);                             \
+#define INSTANTIATE(Ty)                                                       \
+    template magma_int_t magma_unmqr2_gpu<Ty>(                                \
+        magma_side_t side, magma_trans_t trans, magma_int_t m, magma_int_t n, \
+        magma_int_t k, cl_mem dA, size_t dA_offset, magma_int_t ldda,         \
+        Ty * tau, cl_mem dC, size_t dC_offset, magma_int_t lddc, Ty * wA,     \
+        magma_int_t ldwa, magma_queue_t queue, magma_int_t * info);                             \
 
 INSTANTIATE(float)
 INSTANTIATE(double)
diff --git a/src/backend/opencl/match_template.cpp b/src/backend/opencl/match_template.cpp
index c6e82de681..7f02d886b3 100644
--- a/src/backend/opencl/match_template.cpp
+++ b/src/backend/opencl/match_template.cpp
@@ -7,52 +7,40 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <err_opencl.hpp>
 #include <match_template.hpp>
-#include <kernel/match_template.hpp>
 
-using af::dim4;
+#include <kernel/match_template.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename inType, typename outType, af_match_type mType>
-Array<outType> match_template(const Array<inType> &sImg, const Array<inType> &tImg)
-{
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType) {
     Array<outType> out = createEmptyArray<outType>(sImg.dims());
 
-    bool needMean = mType==AF_ZSAD || mType==AF_LSAD ||
-                    mType==AF_ZSSD || mType==AF_LSSD ||
-                    mType==AF_ZNCC;
+    bool needMean = mType == AF_ZSAD || mType == AF_LSAD || mType == AF_ZSSD ||
+                    mType == AF_LSSD || mType == AF_ZNCC;
 
-    if (needMean)
-        kernel::matchTemplate<inType, outType, mType, true >(out, sImg, tImg);
-    else
-        kernel::matchTemplate<inType, outType, mType, false>(out, sImg, tImg);
+    kernel::matchTemplate<inType, outType>(out, sImg, tImg, mType, needMean);
 
     return out;
 }
 
-#define INSTANTIATE(in_t, out_t)\
-    template Array<out_t> match_template<in_t, out_t, AF_SAD >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_LSAD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZSAD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_SSD >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_LSSD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZSSD>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_NCC >(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_ZNCC>(const Array<in_t> &sImg, const Array<in_t> &tImg); \
-    template Array<out_t> match_template<in_t, out_t, AF_SHD >(const Array<in_t> &sImg, const Array<in_t> &tImg);
+#define INSTANTIATE(in_t, out_t)                       \
+    template Array<out_t> match_template<in_t, out_t>( \
+        const Array<in_t> &, const Array<in_t> &, const af::matchType);
 
 INSTANTIATE(double, double)
-INSTANTIATE(float ,  float)
-INSTANTIATE(char  ,  float)
-INSTANTIATE(int   ,  float)
-INSTANTIATE(uint  ,  float)
-INSTANTIATE(uchar ,  float)
-
-}
+INSTANTIATE(float, float)
+INSTANTIATE(char, float)
+INSTANTIATE(int, float)
+INSTANTIATE(uint, float)
+INSTANTIATE(schar, float)
+INSTANTIATE(uchar, float)
+INSTANTIATE(short, float)
+INSTANTIATE(ushort, float)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/match_template.hpp b/src/backend/opencl/match_template.hpp
index 3d599f2d91..7b493d2ca0 100644
--- a/src/backend/opencl/match_template.hpp
+++ b/src/backend/opencl/match_template.hpp
@@ -8,11 +8,13 @@
  ********************************************************/
 
 #include <Array.hpp>
+#include <af/defines.h>
 
-namespace opencl
-{
-
-template<typename inType, typename outType, af_match_type mType>
-Array<outType> match_template(const Array<inType> &sImg, const Array<inType> &tImg);
-
-}
+namespace arrayfire {
+namespace opencl {
+template<typename inType, typename outType>
+Array<outType> match_template(const Array<inType> &sImg,
+                              const Array<inType> &tImg,
+                              const af::matchType mType);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/math.cpp b/src/backend/opencl/math.cpp
index 8aeb5c49aa..bbe78dfc94 100644
--- a/src/backend/opencl/math.cpp
+++ b/src/backend/opencl/math.cpp
@@ -8,55 +8,50 @@
  ********************************************************/
 
 #include "math.hpp"
+#include <common/half.hpp>
 
-namespace opencl
-{
-    bool operator ==(cfloat a, cfloat b) { return (a.s[0] == b.s[0]) && (a.s[1] == b.s[1]); }
-    bool operator !=(cfloat a, cfloat b) { return !(a == b); }
-    bool operator ==(cdouble a, cdouble b) { return (a.s[0] == b.s[0]) && (a.s[1] == b.s[1]); }
-    bool operator !=(cdouble a, cdouble b) { return !(a == b); }
-
-    cfloat operator +(cfloat a, cfloat b)
-    {
-        cfloat res = {{a.s[0] + b.s[0], a.s[1] + b.s[1]}};
-        return res;
-    }
-
-    cdouble operator +(cdouble a, cdouble b)
-    {
-        cdouble res = {{a.s[0] + b.s[0], a.s[1] + b.s[1]}};
-        return res;
-    }
-
-    cfloat operator *(cfloat lhs, cfloat rhs)
-    {
-        cfloat out;
-        out.s[0] = lhs.s[0] * rhs.s[0] - lhs.s[1] * rhs.s[1];
-        out.s[1] = lhs.s[0] * rhs.s[1] + lhs.s[1] * rhs.s[0];
-        return out;
-    }
-
-    cdouble operator *(cdouble lhs, cdouble rhs)
-    {
-        cdouble out;
-        out.s[0] = lhs.s[0] * rhs.s[0] - lhs.s[1] * rhs.s[1];
-        out.s[1] = lhs.s[0] * rhs.s[1] + lhs.s[1] * rhs.s[0];
-        return out;
-    }
-
-    cfloat division(cfloat lhs, double rhs)
-    {
-        cfloat retVal;
-        retVal.s[0] = real(lhs) / rhs;
-        retVal.s[1] = imag(lhs) / rhs;
-        return retVal;
-    }
-
-    cdouble division(cdouble lhs, double rhs)
-    {
-        cdouble retVal;
-        retVal.s[0] = real(lhs) / rhs;
-        retVal.s[1] = imag(lhs) / rhs;
-        return retVal;
-    }
+namespace arrayfire {
+namespace opencl {
+cfloat operator+(cfloat lhs, cfloat rhs) {
+    cfloat res = {{lhs.s[0] + rhs.s[0], lhs.s[1] + rhs.s[1]}};
+    return res;
 }
+
+common::half operator+(common::half lhs, common::half rhs) noexcept {
+    return common::half(static_cast<float>(lhs) + static_cast<float>(rhs));
+}
+
+cdouble operator+(cdouble lhs, cdouble rhs) {
+    cdouble res = {{lhs.s[0] + rhs.s[0], lhs.s[1] + rhs.s[1]}};
+    return res;
+}
+
+cfloat operator*(cfloat lhs, cfloat rhs) {
+    cfloat out;
+    out.s[0] = lhs.s[0] * rhs.s[0] - lhs.s[1] * rhs.s[1];
+    out.s[1] = lhs.s[0] * rhs.s[1] + lhs.s[1] * rhs.s[0];
+    return out;
+}
+
+cdouble operator*(cdouble lhs, cdouble rhs) {
+    cdouble out;
+    out.s[0] = lhs.s[0] * rhs.s[0] - lhs.s[1] * rhs.s[1];
+    out.s[1] = lhs.s[0] * rhs.s[1] + lhs.s[1] * rhs.s[0];
+    return out;
+}
+
+cfloat division(cfloat lhs, double rhs) {
+    cfloat retVal;
+    retVal.s[0] = real(lhs) / rhs;
+    retVal.s[1] = imag(lhs) / rhs;
+    return retVal;
+}
+
+cdouble division(cdouble lhs, double rhs) {
+    cdouble retVal;
+    retVal.s[0] = real(lhs) / rhs;
+    retVal.s[1] = imag(lhs) / rhs;
+    return retVal;
+}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/math.hpp b/src/backend/opencl/math.hpp
index 9292d398a0..f164c3002c 100644
--- a/src/backend/opencl/math.hpp
+++ b/src/backend/opencl/math.hpp
@@ -9,117 +9,163 @@
 
 #pragma once
 
+#include <common/defines.hpp>
+#include <common/half.hpp>
 #include <af/defines.h>
 
+#include <backend.hpp>
+#include <types.hpp>
+
+#include <algorithm>
 #include <complex>
+#include <climits>
 #include <limits>
-#include <algorithm>
-#include "backend.hpp"
-#include "types.hpp"
 
-namespace opencl
-{
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+#else
+/* Other */
+#endif
 
-    template<typename T> static inline T abs(T val)  { return std::abs(val); }
-    template<typename T> static inline T min(T lhs, T rhs) { return std::min(lhs, rhs); }
-    template<typename T> static inline T max(T lhs, T rhs) { return std::max(lhs, rhs); }
+namespace arrayfire {
+namespace opencl {
 
-    static inline float  abs(cfloat  cval) { return std::sqrt(cval.s[0]*cval.s[0] + cval.s[1]*cval.s[1]); }
-    static inline double abs(cdouble cval) { return std::sqrt(cval.s[0]*cval.s[0] + cval.s[1]*cval.s[1]); }
+template<typename T>
+static inline T abs(T val) {
+    return std::abs(val);
+}
+template<typename T>
+static inline T min(T lhs, T rhs) {
+    return std::min(lhs, rhs);
+}
+template<typename T>
+static inline T max(T lhs, T rhs) {
+    return std::max(lhs, rhs);
+}
 
-    template<typename T> static inline T division(T lhs, double rhs) { return lhs / rhs; }
-    cfloat division(cfloat lhs, double rhs);
-    cdouble division(cdouble lhs, double rhs);
+static inline float abs(cfloat cval) {
+    return std::sqrt(cval.s[0] * cval.s[0] + cval.s[1] * cval.s[1]);
+}
+static inline double abs(cdouble cval) {
+    return std::sqrt(cval.s[0] * cval.s[0] + cval.s[1] * cval.s[1]);
+}
 
-#ifndef STATIC_
-#define STATIC_
-#endif
+template<typename T>
+static inline T division(T lhs, double rhs) {
+    return lhs / rhs;
+}
+cfloat division(cfloat lhs, double rhs);
+cdouble division(cdouble lhs, double rhs);
 
-    template<> STATIC_
-    cfloat max<cfloat>(cfloat lhs, cfloat rhs)
-    {
-        return abs(lhs) > abs(rhs) ? lhs : rhs;
-    }
+template<>
+inline cfloat max<cfloat>(cfloat lhs, cfloat rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
 
-    template<> STATIC_
-    cdouble max<cdouble>(cdouble lhs, cdouble rhs)
-    {
-        return abs(lhs) > abs(rhs) ? lhs : rhs;
-    }
+template<>
+inline cdouble max<cdouble>(cdouble lhs, cdouble rhs) {
+    return abs(lhs) > abs(rhs) ? lhs : rhs;
+}
 
-    template<> STATIC_
-    cfloat min<cfloat>(cfloat lhs, cfloat rhs)
-    {
-        return abs(lhs) < abs(rhs) ? lhs :  rhs;
-    }
+template<>
+inline cfloat min<cfloat>(cfloat lhs, cfloat rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
 
-    template<> STATIC_
-    cdouble min<cdouble>(cdouble lhs, cdouble rhs)
-    {
-        return abs(lhs) < abs(rhs) ? lhs :  rhs;
-    }
+template<>
+inline cdouble min<cdouble>(cdouble lhs, cdouble rhs) {
+    return abs(lhs) < abs(rhs) ? lhs : rhs;
+}
 
-    template<typename T>
-    static T scalar(double val)
-    {
-        return (T)(val);
-    }
+template<typename T>
+static T scalar(double val) {
+    return (T)(val);
+}
 
-    template<> STATIC_
-    cfloat  scalar<cfloat >(double val)
-    {
-        cfloat  cval;
-        cval.s[0]= (float)val;
-        cval.s[1] = 0;
-        return cval;
-    }
+template<>
+inline cfloat scalar<cfloat>(double val) {
+    cfloat cval;
+    cval.s[0] = (float)val;
+    cval.s[1] = 0;
+    return cval;
+}
 
-    template<> STATIC_
-    cdouble scalar<cdouble >(double val)
-    {
-        cdouble cval;
-        cval.s[0]= val;
-        cval.s[1] = 0;
-        return cval;
-    }
+template<>
+inline cdouble scalar<cdouble>(double val) {
+    cdouble cval;
+    cval.s[0] = val;
+    cval.s[1] = 0;
+    return cval;
+}
 
-    template<typename To, typename Ti>
-    static To scalar(Ti real, Ti imag)
-    {
-        To  cval;
-        cval.s[0] = real;
-        cval.s[1] = imag;
-        return cval;
-    }
+template<typename To, typename Ti>
+static To scalar(Ti real, Ti imag) {
+    To cval;
+    cval.s[0] = real;
+    cval.s[1] = imag;
+    return cval;
+}
 
-    template <typename T> T limit_max() { return std::numeric_limits<T>::max(); }
-    template <typename T> T limit_min() { return std::numeric_limits<T>::min(); }
+#ifdef AF_WITH_FAST_MATH
+constexpr bool fast_math = true;
+#else
+constexpr bool fast_math = false;
+#endif
 
-    static inline double real(cdouble in)
-    {
-        return in.s[0];
+template<typename T>
+inline T maxval() {
+    if constexpr (std::is_floating_point_v<T> && !fast_math) {
+        return std::numeric_limits<T>::infinity();
+    } else {
+        return std::numeric_limits<T>::max();
     }
-    static inline float real(cfloat in)
-    {
-        return in.s[0];
-    }
-    static inline double imag(cdouble in)
-    {
-        return in.s[1];
-    }
-    static inline float imag(cfloat in)
-    {
-        return in.s[1];
+}
+template<typename T>
+inline T minval() {
+    if constexpr (std::is_floating_point_v<T> && !fast_math) {
+        return -std::numeric_limits<T>::infinity();
+    } else {
+        return std::numeric_limits<T>::lowest();
     }
+}
 
-    bool operator ==(cfloat a, cfloat b);
-    bool operator !=(cfloat a, cfloat b);
-    bool operator ==(cdouble a, cdouble b);
-    bool operator !=(cdouble a, cdouble b);
-    cfloat operator +(cfloat a, cfloat b);
-    cfloat operator +(cfloat a);
-    cdouble operator +(cdouble a, cdouble b);
-    cdouble operator +(cdouble a);
-    cfloat operator *(cfloat a, cfloat b);
-    cdouble operator *(cdouble a, cdouble b);
+static inline double real(cdouble in) { return in.s[0]; }
+static inline float real(cfloat in) { return in.s[0]; }
+static inline double imag(cdouble in) { return in.s[1]; }
+static inline float imag(cfloat in) { return in.s[1]; }
+
+cfloat operator+(cfloat lhs, cfloat rhs);
+cfloat operator+(cfloat lhs);
+cdouble operator+(cdouble lhs, cdouble rhs);
+cdouble operator+(cdouble lhs);
+cfloat operator*(cfloat lhs, cfloat rhs);
+cdouble operator*(cdouble lhs, cdouble rhs);
+common::half operator+(common::half lhs, common::half rhs) noexcept;
+}  // namespace opencl
+}  // namespace arrayfire
+
+static inline bool operator==(arrayfire::opencl::cfloat lhs,
+                              arrayfire::opencl::cfloat rhs) noexcept {
+    return (lhs.s[0] == rhs.s[0]) && (lhs.s[1] == rhs.s[1]);
+}
+static inline bool operator!=(arrayfire::opencl::cfloat lhs,
+                              arrayfire::opencl::cfloat rhs) noexcept {
+    return !(lhs == rhs);
 }
+static inline bool operator==(arrayfire::opencl::cdouble lhs,
+                              arrayfire::opencl::cdouble rhs) noexcept {
+    return (lhs.s[0] == rhs.s[0]) && (lhs.s[1] == rhs.s[1]);
+}
+static inline bool operator!=(arrayfire::opencl::cdouble lhs,
+                              arrayfire::opencl::cdouble rhs) noexcept {
+    return !(lhs == rhs);
+}
+
+#if defined(__GNUC__) || defined(__GNUG__)
+/* GCC/G++, Clang/LLVM, Intel ICC */
+#pragma GCC diagnostic pop
+#else
+/* Other */
+#endif
diff --git a/src/backend/opencl/max.cpp b/src/backend/opencl/max.cpp
index f4615c72b5..695415517d 100644
--- a/src/backend/opencl/max.cpp
+++ b/src/backend/opencl/max.cpp
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    //max
-    INSTANTIATE(af_max_t, float  , float  )
-    INSTANTIATE(af_max_t, double , double )
-    INSTANTIATE(af_max_t, cfloat , cfloat )
-    INSTANTIATE(af_max_t, cdouble, cdouble)
-    INSTANTIATE(af_max_t, int    , int    )
-    INSTANTIATE(af_max_t, uint   , uint   )
-    INSTANTIATE(af_max_t, char   , char   )
-    INSTANTIATE(af_max_t, uchar  , uchar  )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// max
+INSTANTIATE(af_max_t, float, float)
+INSTANTIATE(af_max_t, double, double)
+INSTANTIATE(af_max_t, cfloat, cfloat)
+INSTANTIATE(af_max_t, cdouble, cdouble)
+INSTANTIATE(af_max_t, int, int)
+INSTANTIATE(af_max_t, uint, uint)
+INSTANTIATE(af_max_t, intl, intl)
+INSTANTIATE(af_max_t, uintl, uintl)
+INSTANTIATE(af_max_t, char, char)
+INSTANTIATE(af_max_t, schar, schar)
+INSTANTIATE(af_max_t, uchar, uchar)
+INSTANTIATE(af_max_t, short, short)
+INSTANTIATE(af_max_t, ushort, ushort)
+INSTANTIATE(af_max_t, half, half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/mean.cpp b/src/backend/opencl/mean.cpp
new file mode 100644
index 0000000000..428c2812c3
--- /dev/null
+++ b/src/backend/opencl/mean.cpp
@@ -0,0 +1,82 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <mean.hpp>
+
+#include <common/half.hpp>
+#include <kernel/mean.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using arrayfire::common::half;
+using std::swap;
+
+namespace arrayfire {
+namespace opencl {
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in) {
+    return kernel::meanAll<Ti, Tw, To>(in);
+}
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts) {
+    return kernel::meanAllWeighted<T, Tw>(in, wts);
+}
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    kernel::mean<Ti, Tw, To>(out, in, dim);
+    return out;
+}
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wts, const int dim) {
+    dim4 odims   = in.dims();
+    odims[dim]   = 1;
+    Array<T> out = createEmptyArray<T>(odims);
+    kernel::meanWeighted<T, Tw, T>(out, in, wts, dim);
+    return out;
+}
+
+#define INSTANTIATE(Ti, Tw, To)                        \
+    template To mean<Ti, Tw, To>(const Array<Ti>& in); \
+    template Array<To> mean<Ti, Tw, To>(const Array<Ti>& in, const int dim);
+
+INSTANTIATE(double, double, double);
+INSTANTIATE(float, float, float);
+INSTANTIATE(int, float, float);
+INSTANTIATE(unsigned, float, float);
+INSTANTIATE(intl, double, double);
+INSTANTIATE(uintl, double, double);
+INSTANTIATE(short, float, float);
+INSTANTIATE(ushort, float, float);
+INSTANTIATE(schar, float, float);
+INSTANTIATE(uchar, float, float);
+INSTANTIATE(char, float, float);
+INSTANTIATE(cfloat, float, cfloat);
+INSTANTIATE(cdouble, double, cdouble);
+INSTANTIATE(half, float, half);
+INSTANTIATE(half, float, float);
+
+#define INSTANTIATE_WGT(T, Tw)                                              \
+    template T mean<T, Tw>(const Array<T>& in, const Array<Tw>& wts);       \
+    template Array<T> mean<T, Tw>(const Array<T>& in, const Array<Tw>& wts, \
+                                  const int dim);
+
+INSTANTIATE_WGT(double, double);
+INSTANTIATE_WGT(float, float);
+INSTANTIATE_WGT(cfloat, float);
+INSTANTIATE_WGT(cdouble, double);
+INSTANTIATE_WGT(half, float);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/mean.hpp b/src/backend/opencl/mean.hpp
new file mode 100644
index 0000000000..61f44aa86a
--- /dev/null
+++ b/src/backend/opencl/mean.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename Ti, typename Tw, typename To>
+To mean(const Array<Ti>& in);
+
+template<typename T, typename Tw>
+T mean(const Array<T>& in, const Array<Tw>& wts);
+
+template<typename Ti, typename Tw, typename To>
+Array<To> mean(const Array<Ti>& in, const int dim);
+
+template<typename T, typename Tw>
+Array<T> mean(const Array<T>& in, const Array<Tw>& wts, const int dim);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/meanshift.cpp b/src/backend/opencl/meanshift.cpp
index ea1b3bea54..9eaec9db9d 100644
--- a/src/backend/opencl/meanshift.cpp
+++ b/src/backend/opencl/meanshift.cpp
@@ -7,37 +7,42 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <meanshift.hpp>
-#include <kernel/meanshift.hpp>
 #include <err_opencl.hpp>
+#include <kernel/meanshift.hpp>
+#include <meanshift.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
-
-template<typename T, bool is_color>
-Array<T> meanshift(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter)
-{
-    const dim4 dims = in.dims();
-    Array<T> out   = createEmptyArray<T>(dims);
-    kernel::meanshift<T, is_color>(out, in, s_sigma, c_sigma, iter);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor) {
+    const dim4 &dims = in.dims();
+    Array<T> out     = createEmptyArray<T>(dims);
+    kernel::meanshift<T>(out, in, spatialSigma, chromaticSigma, numIterations,
+                         isColor);
     return out;
 }
 
-#define INSTANTIATE(T) \
-    template Array<T> meanshift<T, true >(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter); \
-    template Array<T> meanshift<T, false>(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter);
+#define INSTANTIATE(T)                                              \
+    template Array<T> meanshift<T>(const Array<T> &, const float &, \
+                                   const float &, const unsigned &, \
+                                   const bool &);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/meanshift.hpp b/src/backend/opencl/meanshift.hpp
index 3349e37802..54e8dd588f 100644
--- a/src/backend/opencl/meanshift.hpp
+++ b/src/backend/opencl/meanshift.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
-
-template<typename T, bool is_color>
-Array<T> meanshift(const Array<T> &in, const float &s_sigma, const float &c_sigma, const unsigned iter);
-
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> meanshift(const Array<T> &in, const float &spatialSigma,
+                   const float &chromaticSigma, const unsigned &numIterations,
+                   const bool &isColor);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/medfilt.cpp b/src/backend/opencl/medfilt.cpp
index 76fde1a34b..d3025a50b9 100644
--- a/src/backend/opencl/medfilt.cpp
+++ b/src/backend/opencl/medfilt.cpp
@@ -7,49 +7,58 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <medfilt.hpp>
-#include <kernel/medfilt.hpp>
 #include <err_opencl.hpp>
+#include <kernel/medfilt.hpp>
+#include <medfilt.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename T, af_border_type pad>
-Array<T> medfilt(const Array<T> &in, dim_t w_len, dim_t w_wid)
-{
-    ARG_ASSERT(2, (w_len<=kernel::MAX_MEDFILTER_LEN));
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType pad) {
+    ARG_ASSERT(2, (w_wid <= kernel::MAX_MEDFILTER1_LEN));
+    ARG_ASSERT(2, (w_wid % 2 != 0));
 
-    const dim4 dims     = in.dims();
+    const dim4 &dims = in.dims();
 
-    Array<T> out      = createEmptyArray<T>(dims);
+    Array<T> out = createEmptyArray<T>(dims);
 
-    switch(w_len) {
-        case  3: kernel::medfilt<T, pad,  3,  3>(out, in); break;
-        case  5: kernel::medfilt<T, pad,  5,  5>(out, in); break;
-        case  7: kernel::medfilt<T, pad,  7,  7>(out, in); break;
-        case  9: kernel::medfilt<T, pad,  9,  9>(out, in); break;
-        case 11: kernel::medfilt<T, pad, 11, 11>(out, in); break;
-        case 13: kernel::medfilt<T, pad, 13, 13>(out, in); break;
-        case 15: kernel::medfilt<T, pad, 15, 15>(out, in); break;
-    }
+    kernel::medfilt1<T>(out, in, w_wid, pad);
+
+    return out;
+}
+
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType pad) {
+    ARG_ASSERT(2, (w_len % 2 != 0));
+    ARG_ASSERT(2, (w_len <= kernel::MAX_MEDFILTER2_LEN));
+
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::medfilt2<T>(out, in, pad, w_len, w_wid);
     return out;
 }
 
-#define INSTANTIATE(T)\
-    template Array<T> medfilt<T, AF_PAD_ZERO     >(const Array<T> &in, dim_t w_len, dim_t w_wid); \
-    template Array<T> medfilt<T, AF_PAD_SYM>(const Array<T> &in, dim_t w_len, dim_t w_wid);
+#define INSTANTIATE(T)                                                 \
+    template Array<T> medfilt1<T>(const Array<T> &in, const int w_wid, \
+                                  const af::borderType);               \
+    template Array<T> medfilt2<T>(const Array<T> &in, const int w_len, \
+                                  const int w_wid, const af::borderType);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(char  )
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
-INSTANTIATE(uchar )
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/medfilt.hpp b/src/backend/opencl/medfilt.hpp
index d1ba7d388f..439282b1f1 100644
--- a/src/backend/opencl/medfilt.hpp
+++ b/src/backend/opencl/medfilt.hpp
@@ -9,10 +9,16 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename T, af_border_type edge_pad>
-Array<T> medfilt(const Array<T> &in, dim_t w_len, dim_t w_wid);
+template<typename T>
+Array<T> medfilt1(const Array<T> &in, const int w_wid,
+                  const af::borderType edge_pad);
 
-}
+template<typename T>
+Array<T> medfilt2(const Array<T> &in, const int w_len, const int w_wid,
+                  const af::borderType edge_pad);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/memory.cpp b/src/backend/opencl/memory.cpp
index 9d14c916a9..7c69b33e24 100644
--- a/src/backend/opencl/memory.cpp
+++ b/src/backend/opencl/memory.cpp
@@ -7,344 +7,286 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/Logger.hpp>
+#include <common/MemoryManagerBase.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
+#include <errorcodes.hpp>
 #include <memory.hpp>
-#include <dispatch.hpp>
-#include <map>
+#include <platform.hpp>
+#include <spdlog/spdlog.h>
 #include <types.hpp>
+#include <af/dim4.hpp>
 
-namespace opencl
-{
-    static size_t memory_resolution = 1024; //1KB
+#include <utility>
 
-    void setMemStepSize(size_t step_bytes)
-    {
-        memory_resolution = step_bytes;
-    }
+using arrayfire::common::bytesToString;
 
-    size_t getMemStepSize(void)
-    {
-        return memory_resolution;
-    }
+using af::dim4;
+using std::function;
+using std::move;
+using std::unique_ptr;
 
-    // Manager Class
-    // Dummy used to call garbage collection at the end of the program
-    class Manager
-    {
-        public:
-        static bool initialized;
-        Manager()
-        {
-            initialized = true;
-        }
+namespace arrayfire {
+namespace opencl {
+float getMemoryPressure() { return memoryManager().getMemoryPressure(); }
+float getMemoryPressureThreshold() {
+    return memoryManager().getMemoryPressureThreshold();
+}
 
-        ~Manager()
-        {
-            for(int i = 0; i < (int)getDeviceCount(); i++) {
-                setDevice(i);
-                garbageCollect();
-                pinnedGarbageCollect();
-            }
-        }
-    };
+bool jitTreeExceedsMemoryPressure(size_t bytes) {
+    return memoryManager().jitTreeExceedsMemoryPressure(bytes);
+}
 
-    bool Manager::initialized = false;
+void setMemStepSize(size_t step_bytes) {
+    memoryManager().setMemStepSize(step_bytes);
+}
 
-    static void managerInit()
-    {
-        if(Manager::initialized == false)
-            static Manager pm = Manager();
-    }
+size_t getMemStepSize() { return memoryManager().getMemStepSize(); }
+
+void signalMemoryCleanup() { memoryManager().signalMemoryCleanup(); }
 
-    typedef struct
-    {
-        bool is_free;
-        bool is_unlinked;
-        size_t bytes;
-    } mem_info;
+void shutdownMemoryManager() { memoryManager().shutdown(); }
 
-    static size_t used_bytes[DeviceManager::MAX_DEVICES] = {0};
-    static size_t used_buffers[DeviceManager::MAX_DEVICES] = {0};
-    static size_t total_bytes[DeviceManager::MAX_DEVICES] = {0};
+void shutdownPinnedMemoryManager() { pinnedMemoryManager().shutdown(); }
 
-    typedef std::map<cl::Buffer *, mem_info> mem_t;
-    typedef mem_t::iterator mem_iter;
-    mem_t memory_maps[DeviceManager::MAX_DEVICES];
+void printMemInfo(const char *msg, const int device) {
+    memoryManager().printInfo(msg, device);
+}
 
-    static void destroy(cl::Buffer *ptr)
-    {
-        delete ptr;
+template<typename T>
+unique_ptr<cl::Buffer, function<void(cl::Buffer *)>> memAlloc(
+    const size_t &elements) {
+    // TODO: make memAlloc aware of array shapes
+    if (elements) {
+        dim4 dims(elements);
+        void *ptr = memoryManager().alloc(false, 1, dims.get(), sizeof(T));
+        auto buf  = static_cast<cl_mem>(ptr);
+        cl::Buffer *bptr = new cl::Buffer(buf, true);
+        return unique_ptr<cl::Buffer, function<void(cl::Buffer *)>>(bptr,
+                                                                    bufferFree);
+    } else {
+        return unique_ptr<cl::Buffer, function<void(cl::Buffer *)>>(nullptr,
+                                                                    bufferFree);
     }
+}
 
-    void garbageCollect()
-    {
-        int n = getActiveDeviceId();
-        for(mem_iter iter = memory_maps[n].begin();
-            iter != memory_maps[n].end(); ++iter) {
+void *memAllocUser(const size_t &bytes) {
+    dim4 dims(bytes);
+    void *ptr = memoryManager().alloc(true, 1, dims.get(), 1);
+    auto buf  = static_cast<cl_mem>(ptr);
+    return new cl::Buffer(buf, true);
+}
 
-            if ((iter->second).is_free) {
+void memFree(cl::Buffer *ptr) {
+    cl::Buffer *buf = reinterpret_cast<cl::Buffer *>(ptr);
+    cl_mem mem      = static_cast<cl_mem>((*buf)());
+    delete buf;
+    return memoryManager().unlock(static_cast<void *>(mem), false);
+}
 
-                if (!(iter->second).is_unlinked) {
-                    destroy(iter->first);
-                }
-                total_bytes[n] -= iter->second.bytes;
-            }
-        }
+void memFree(cl_mem ptr) {
+    return memoryManager().unlock(static_cast<void *>(ptr), false);
+}
 
-        mem_iter memory_curr = memory_maps[n].begin();
-        mem_iter memory_end  = memory_maps[n].end();
+void memFreeUser(void *ptr) {
+    cl::Buffer *buf = static_cast<cl::Buffer *>(ptr);
+    cl_mem mem      = (*buf)();
+    delete buf;
+    memoryManager().unlock(mem, true);
+}
 
-        while(memory_curr != memory_end) {
-            if (memory_curr->second.is_free) {
-                memory_curr = memory_maps[n].erase(memory_curr);
-            } else {
-                ++memory_curr;
-            }
-        }
+cl::Buffer *bufferAlloc(const size_t &bytes) {
+    dim4 dims(bytes);
+    if (bytes) {
+        void *ptr       = memoryManager().alloc(false, 1, dims.get(), 1);
+        cl_mem mem      = static_cast<cl_mem>(ptr);
+        cl::Buffer *buf = new cl::Buffer(mem, true);
+        return buf;
+    } else {
+        return nullptr;
     }
+}
 
-    cl::Buffer *bufferAlloc(const size_t &bytes)
-    {
-        int n = getActiveDeviceId();
-        cl::Buffer *ptr = NULL;
-        size_t alloc_bytes = divup(bytes, memory_resolution) * memory_resolution;
-
-        if (bytes > 0) {
-
-            // FIXME: Add better checks for garbage collection
-            // Perhaps look at total memory available as a metric
-            if (memory_maps[n].size() >= MAX_BUFFERS || used_bytes[n] >= MAX_BYTES) {
-                garbageCollect();
-            }
-
-            for(mem_iter iter = memory_maps[n].begin();
-                iter != memory_maps[n].end(); ++iter) {
-
-                mem_info info = iter->second;
-
-                if ( info.is_free &&
-                    !info.is_unlinked &&
-                     info.bytes == alloc_bytes) {
-
-                    iter->second.is_free = false;
-                    used_bytes[n] += alloc_bytes;
-                    used_buffers[n]++;
-                    return iter->first;
-                }
-            }
-
-            try {
-                ptr = new cl::Buffer(getContext(), CL_MEM_READ_WRITE, alloc_bytes);
-            } catch(...) {
-                garbageCollect();
-                ptr = new cl::Buffer(getContext(), CL_MEM_READ_WRITE, alloc_bytes);
-            }
-
-            mem_info info = {false, false, alloc_bytes};
-            memory_maps[n][ptr] = info;
-            used_bytes[n] += alloc_bytes;
-            used_buffers[n]++;
-            total_bytes[n] += alloc_bytes;
-        }
-        return ptr;
+void bufferFree(cl::Buffer *buf) {
+    if (buf) {
+        cl_mem mem = (*buf)();
+        delete buf;
+        memoryManager().unlock(static_cast<void *>(mem), false);
     }
+}
 
-    void bufferFree(cl::Buffer *ptr)
-    {
-        int n = getActiveDeviceId();
-        mem_iter iter = memory_maps[n].find(ptr);
+void memLock(const cl::Buffer *ptr) {
+    cl_mem mem = static_cast<cl_mem>((*ptr)());
+    memoryManager().userLock(static_cast<void *>(mem));
+}
 
-        if (iter != memory_maps[n].end()) {
+void memUnlock(const cl::Buffer *ptr) {
+    cl_mem mem = static_cast<cl_mem>((*ptr)());
+    memoryManager().userUnlock(static_cast<void *>(mem));
+}
 
-            if ((iter->second).is_unlinked) return;
+bool isLocked(const void *ptr) {
+    return memoryManager().isUserLocked(const_cast<void *>(ptr));
+}
 
-            iter->second.is_free = true;
-            used_bytes[n] -= iter->second.bytes;
-            used_buffers[n]--;
-        } else {
-            destroy(ptr); // Free it because we are not sure what the size is
-        }
-    }
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers) {
+    memoryManager().usageInfo(alloc_bytes, alloc_buffers, lock_bytes,
+                              lock_buffers);
+}
 
-    void bufferUnlink(cl::Buffer *ptr)
-    {
-        int n = getActiveDeviceId();
-        mem_iter iter = memory_maps[n].find(ptr);
+template<typename T>
+T *pinnedAlloc(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = pinnedMemoryManager().alloc(false, 1, dims.get(), sizeof(T));
+    return static_cast<T *>(ptr);
+}
 
-        if (iter != memory_maps[n].end()) {
+template<typename T>
+void pinnedFree(T *ptr) {
+    pinnedMemoryManager().unlock(static_cast<void *>(ptr), false);
+}
 
-            iter->second.is_unlinked = true;
-            iter->second.is_free = true;
-            used_bytes[n] -= iter->second.bytes;
-            used_buffers[n]--;
+#define INSTANTIATE(T)                                                         \
+    template unique_ptr<cl::Buffer, function<void(cl::Buffer *)>> memAlloc<T>( \
+        const size_t &elements);                                               \
+    template T *pinnedAlloc(const size_t &elements);                           \
+    template void pinnedFree(T *ptr);
+
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(common::half)
+
+template<>
+void *pinnedAlloc<void>(const size_t &elements) {
+    // TODO: make pinnedAlloc aware of array shapes
+    dim4 dims(elements);
+    void *ptr = pinnedMemoryManager().alloc(false, 1, dims.get(), 1);
+    return ptr;
+}
 
-        } else {
+template<>
+void pinnedFree(void *ptr) {
+    pinnedMemoryManager().unlock(ptr, false);
+}
 
-            mem_info info = { false,
-                              false,
-                              100 }; //This number is not relevant
+Allocator::Allocator() { logger = common::loggerFactory("mem"); }
 
-            memory_maps[n][ptr] = info;
+void Allocator::shutdown() {
+    for (int n = 0; n < opencl::getDeviceCount(); n++) {
+        try {
+            opencl::setDevice(n);
+            shutdownMemoryManager();
+        } catch (const AfError &err) {
+            continue;  // Do not throw any errors while shutting down
         }
     }
+}
 
-    void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers)
-    {
-        int n = getActiveDeviceId();
-        if (alloc_bytes   ) *alloc_bytes   = total_bytes[n];
-        if (alloc_buffers ) *alloc_buffers = memory_maps[n].size();
-        if (lock_bytes    ) *lock_bytes    = used_bytes[n];
-        if (lock_buffers  ) *lock_buffers  = used_buffers[n];
-    }
+int Allocator::getActiveDeviceId() { return opencl::getActiveDeviceId(); }
 
-    template<typename T>
-    T *memAlloc(const size_t &elements)
-    {
-        managerInit();
-        return (T *)bufferAlloc(elements * sizeof(T));
-    }
+size_t Allocator::getMaxMemorySize(int id) {
+    return opencl::getDeviceMemorySize(id);
+}
 
-    template<typename T>
-    void memFree(T *ptr)
-    {
-        return bufferFree((cl::Buffer *)ptr);
-    }
+void *Allocator::nativeAlloc(const size_t bytes) {
+    cl_int err = CL_SUCCESS;
+    auto ptr   = static_cast<void *>(clCreateBuffer(
+        getContext()(), CL_MEM_READ_WRITE,  // NOLINT(hicpp-signed-bitwise)
+        bytes, nullptr, &err));
 
-    template<typename T>
-    void memUnlink(T *ptr)
-    {
-        return bufferUnlink((cl::Buffer *)ptr);
+    if (err != CL_SUCCESS) {
+        auto str = fmt::format("Failed to allocate device memory of size {}",
+                               bytesToString(bytes));
+        AF_ERROR(str, AF_ERR_NO_MEM);
     }
 
-    // pinned memory manager
-    typedef struct {
-        cl::Buffer *buf;
-        mem_info info;
-    } pinned_info;
-
-    typedef std::map<void*, pinned_info> pinned_t;
-    typedef pinned_t::iterator pinned_iter;
-    pinned_t pinned_maps[DeviceManager::MAX_DEVICES];
-    static size_t pinned_used_bytes = 0;
-
-    static void pinnedDestroy(cl::Buffer *buf, void *ptr)
-    {
-        getQueue().enqueueUnmapMemObject(*buf, (void *)ptr);
-        destroy(buf);
-    }
+    AF_TRACE("nativeAlloc: {} {}", bytesToString(bytes), ptr);
+    return ptr;
+}
 
-    void pinnedGarbageCollect()
-    {
-        int n = getActiveDeviceId();
-        for(auto &iter : pinned_maps[n]) {
-            if ((iter.second).info.is_free) {
-                pinnedDestroy(iter.second.buf, iter.first);
-            }
-        }
+void Allocator::nativeFree(void *ptr) {
+    cl_mem buffer = static_cast<cl_mem>(ptr);
+    AF_TRACE("nativeFree:          {}", ptr);
+    cl_int err = clReleaseMemObject(buffer);
+    if (err != CL_SUCCESS) {
+        AF_ERROR("Failed to release device memory.", AF_ERR_RUNTIME);
+    }
+}
 
-        pinned_iter memory_curr = pinned_maps[n].begin();
-        pinned_iter memory_end  = pinned_maps[n].end();
+AllocatorPinned::AllocatorPinned() : pinnedMaps(opencl::getDeviceCount()) {
+    logger = common::loggerFactory("mem");
+}
 
-        while(memory_curr != memory_end) {
-            if (memory_curr->second.info.is_free) {
-                memory_curr = pinned_maps[n].erase(memory_curr);
-            } else {
-                ++memory_curr;
-            }
+void AllocatorPinned::shutdown() {
+    for (int n = 0; n < opencl::getDeviceCount(); n++) {
+        opencl::setDevice(n);
+        shutdownPinnedMemoryManager();
+        auto currIterator = pinnedMaps[n].begin();
+        auto endIterator  = pinnedMaps[n].end();
+        while (currIterator != endIterator) {
+            pinnedMaps[n].erase(currIterator++);
         }
-
     }
+}
 
-    void *pinnedBufferAlloc(const size_t &bytes)
-    {
-        void *ptr = NULL;
-        int n = getActiveDeviceId();
-        // Allocate the higher megabyte. Overhead of creating pinned memory is
-        // more so we want more resuable memory.
-        size_t alloc_bytes = divup(bytes, 1048576) * 1048576;
-
-        if (bytes > 0) {
-            cl::Buffer *buf = NULL;
-
-            // FIXME: Add better checks for garbage collection
-            // Perhaps look at total memory available as a metric
-            if (pinned_maps[n].size() >= MAX_BUFFERS || pinned_used_bytes >= MAX_BYTES) {
-                pinnedGarbageCollect();
-            }
-
-            for(pinned_iter iter = pinned_maps[n].begin();
-                iter != pinned_maps[n].end(); ++iter) {
-
-                mem_info info = iter->second.info;
-                if (info.is_free && info.bytes == alloc_bytes) {
-                    iter->second.info.is_free = false;
-                    pinned_used_bytes += alloc_bytes;
-                    return iter->first;
-                }
-            }
-
-            try {
-                buf = new cl::Buffer(getContext(), CL_MEM_ALLOC_HOST_PTR, alloc_bytes);
-
-                ptr = getQueue().enqueueMapBuffer(*buf, true, CL_MAP_READ|CL_MAP_WRITE,
-                                                  0, alloc_bytes);
-            } catch(...) {
-                pinnedGarbageCollect();
-                buf = new cl::Buffer(getContext(), CL_MEM_ALLOC_HOST_PTR, alloc_bytes);
-
-                ptr = getQueue().enqueueMapBuffer(*buf, true, CL_MAP_READ|CL_MAP_WRITE,
-                                                  0, alloc_bytes);
-            }
-            mem_info info = {false, false, alloc_bytes};
-            pinned_info pt = {buf, info};
-            pinned_maps[n][ptr] = pt;
-            pinned_used_bytes += alloc_bytes;
-        }
-        return ptr;
-    }
+int AllocatorPinned::getActiveDeviceId() { return opencl::getActiveDeviceId(); }
 
-    void pinnedBufferFree(void *ptr)
-    {
-        int n = getActiveDeviceId();
-        pinned_iter iter = pinned_maps[n].find(ptr);
-
-        if (iter != pinned_maps[n].end()) {
-            iter->second.info.is_free = true;
-            pinned_used_bytes -= iter->second.info.bytes;
-        } else {
-            pinnedDestroy(iter->second.buf, ptr); // Free it because we are not sure what the size is
-            pinned_maps[n].erase(iter);
-        }
-    }
+size_t AllocatorPinned::getMaxMemorySize(int id) {
+    return opencl::getDeviceMemorySize(id);
+}
+
+void *AllocatorPinned::nativeAlloc(const size_t bytes) {
+    void *ptr = NULL;
 
-    template<typename T>
-    T* pinnedAlloc(const size_t &elements)
-    {
-        managerInit();
-        return (T *)pinnedBufferAlloc(elements * sizeof(T));
+    cl_int err = CL_SUCCESS;
+    auto buf   = clCreateBuffer(getContext()(), CL_MEM_ALLOC_HOST_PTR, bytes,
+                                nullptr, &err);
+    if (err != CL_SUCCESS) {
+        AF_ERROR("Failed to allocate pinned memory.", AF_ERR_NO_MEM);
     }
 
-    template<typename T>
-    void pinnedFree(T* ptr)
-    {
-        return pinnedBufferFree((void *) ptr);
+    ptr = clEnqueueMapBuffer(getQueue()(), buf, CL_TRUE,
+                             CL_MAP_READ | CL_MAP_WRITE, 0, bytes, 0, nullptr,
+                             nullptr, &err);
+    if (err != CL_SUCCESS) {
+        AF_ERROR("Failed to map pinned memory", AF_ERR_RUNTIME);
     }
+    AF_TRACE("Pinned::nativeAlloc: {:>7} {}", bytesToString(bytes), ptr);
+    pinnedMaps[opencl::getActiveDeviceId()].emplace(ptr, new cl::Buffer(buf));
+    return ptr;
+}
 
-#define INSTANTIATE(T)                              \
-    template T* memAlloc(const size_t &elements);   \
-    template void memFree(T* ptr);                  \
-    template void memUnlink(T* ptr);                \
-    template T* pinnedAlloc(const size_t &elements);\
-    template void pinnedFree(T* ptr);               \
-
-    INSTANTIATE(float)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(double)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+void AllocatorPinned::nativeFree(void *ptr) {
+    AF_TRACE("Pinned::nativeFree:          {}", ptr);
+    int n     = opencl::getActiveDeviceId();
+    auto &map = pinnedMaps[n];
+    auto iter = map.find(ptr);
+
+    if (iter != map.end()) {
+        cl::Buffer *buf = map[ptr];
+        if (cl_int err = getQueue().enqueueUnmapMemObject(*buf, ptr)) {
+            getLogger()->warn(
+                "Pinned::nativeFree: Error unmapping pinned memory({}:{}). "
+                "Ignoring",
+                err, getErrorMessage(err));
+        }
+        delete buf;
+        map.erase(iter);
+    }
 }
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/memory.hpp b/src/backend/opencl/memory.hpp
index f9cf1833ce..447f80bb83 100644
--- a/src/backend/opencl/memory.hpp
+++ b/src/backend/opencl/memory.hpp
@@ -8,31 +8,83 @@
  ********************************************************/
 #pragma once
 
-#include <platform.hpp>
-#include <af/defines.h>
+#include <common/AllocatorInterface.hpp>
 
-namespace opencl
-{
+#include <cstdlib>
+#include <functional>
+#include <map>
+#include <memory>
+#include <vector>
 
-    cl::Buffer *bufferAlloc(const size_t &bytes);
-    void bufferFree(cl::Buffer *buf);
-    void bufferUnlink(cl::Buffer *ptr);
+namespace cl {
+class Buffer;  // Forward declaration of cl::Buffer from CL/cl2.hpp
+}
 
-    template<typename T> T *memAlloc(const size_t &elements);
-    template<typename T> void memFree(T *ptr);
-    template<typename T> void memUnlink(T *ptr);
+namespace arrayfire {
+namespace opencl {
+cl::Buffer *bufferAlloc(const size_t &bytes);
+void bufferFree(cl::Buffer *buf);
 
-    template<typename T> T* pinnedAlloc(const size_t &elements);
-    template<typename T> void pinnedFree(T* ptr);
+using bufptr = std::unique_ptr<cl::Buffer, std::function<void(cl::Buffer *)>>;
 
-    static const unsigned MAX_BUFFERS   = 100;
-    static const unsigned MAX_BYTES     = (1 << 30);
+template<typename T>
+bufptr memAlloc(const size_t &elements);
+void *memAllocUser(const size_t &bytes);
 
-    void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
-                          size_t *lock_bytes,  size_t *lock_buffers);
-    void garbageCollect();
-    void pinnedGarbageCollect();
+// Need these as 2 separate function and not a default argument
+// This is because it is used as the deleter in shared pointer
+// which cannot support default arguments
+void memFree(cl::Buffer *ptr);
+void memFree(cl_mem ptr);
+void memFreeUser(void *ptr);
 
-    void setMemStepSize(size_t step_bytes);
-    size_t getMemStepSize(void);
-}
+void memLock(const cl::Buffer *ptr);
+void memUnlock(const cl::Buffer *ptr);
+bool isLocked(const void *ptr);
+
+template<typename T>
+T *pinnedAlloc(const size_t &elements);
+template<typename T>
+void pinnedFree(T *ptr);
+
+void deviceMemoryInfo(size_t *alloc_bytes, size_t *alloc_buffers,
+                      size_t *lock_bytes, size_t *lock_buffers);
+void signalMemoryCleanup();
+void shutdownMemoryManager();
+void pinnedGarbageCollect();
+
+void printMemInfo(const char *msg, const int device);
+
+float getMemoryPressure();
+float getMemoryPressureThreshold();
+bool jitTreeExceedsMemoryPressure(size_t bytes);
+void setMemStepSize(size_t step_bytes);
+size_t getMemStepSize(void);
+
+class Allocator final : public common::AllocatorInterface {
+   public:
+    Allocator();
+    ~Allocator() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+};
+
+class AllocatorPinned final : public common::AllocatorInterface {
+   public:
+    AllocatorPinned();
+    ~AllocatorPinned() = default;
+    void shutdown() override;
+    int getActiveDeviceId() override;
+    size_t getMaxMemorySize(int id) override;
+    void *nativeAlloc(const size_t bytes) override;
+    void nativeFree(void *ptr) override;
+
+   private:
+    std::vector<std::map<void *, cl::Buffer *>> pinnedMaps;
+};
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/min.cpp b/src/backend/opencl/min.cpp
index 442b0d6187..75c117caa8 100644
--- a/src/backend/opencl/min.cpp
+++ b/src/backend/opencl/min.cpp
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    //min
-    INSTANTIATE(af_min_t, float  , float  )
-    INSTANTIATE(af_min_t, double , double )
-    INSTANTIATE(af_min_t, cfloat , cfloat )
-    INSTANTIATE(af_min_t, cdouble, cdouble)
-    INSTANTIATE(af_min_t, int    , int    )
-    INSTANTIATE(af_min_t, uint   , uint   )
-    INSTANTIATE(af_min_t, char   , char   )
-    INSTANTIATE(af_min_t, uchar  , uchar  )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// min
+INSTANTIATE(af_min_t, float, float)
+INSTANTIATE(af_min_t, double, double)
+INSTANTIATE(af_min_t, cfloat, cfloat)
+INSTANTIATE(af_min_t, cdouble, cdouble)
+INSTANTIATE(af_min_t, int, int)
+INSTANTIATE(af_min_t, uint, uint)
+INSTANTIATE(af_min_t, intl, intl)
+INSTANTIATE(af_min_t, uintl, uintl)
+INSTANTIATE(af_min_t, char, char)
+INSTANTIATE(af_min_t, schar, schar)
+INSTANTIATE(af_min_t, uchar, uchar)
+INSTANTIATE(af_min_t, short, short)
+INSTANTIATE(af_min_t, ushort, ushort)
+INSTANTIATE(af_min_t, half, half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/moments.cpp b/src/backend/opencl/moments.cpp
new file mode 100644
index 0000000000..80afc2ece1
--- /dev/null
+++ b/src/backend/opencl/moments.cpp
@@ -0,0 +1,57 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <debug_opencl.hpp>
+#include <err_opencl.hpp>
+#include <kernel/moments.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+static inline unsigned bitCount(unsigned v) {
+    v = v - ((v >> 1U) & 0x55555555U);
+    v = (v & 0x33333333U) + ((v >> 2U) & 0x33333333U);
+    return (((v + (v >> 4U)) & 0xF0F0F0FU) * 0x1010101U) >> 24U;
+}
+
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment) {
+    in.eval();
+    dim4 odims, idims = in.dims();
+    dim_t moments_dim = bitCount(moment);
+
+    odims[0] = moments_dim;
+    odims[1] = 1;
+    odims[2] = idims[2];
+    odims[3] = idims[3];
+
+    Array<float> out = createValueArray<float>(odims, 0.f);
+    out.eval();
+
+    kernel::moments<T>(out, in, moment);
+    return out;
+}
+
+#define INSTANTIATE(T)                                   \
+    template Array<float> moments<T>(const Array<T> &in, \
+                                     const af_moment_type moment);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(ushort)
+INSTANTIATE(short)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/moments.hpp b/src/backend/opencl/moments.hpp
new file mode 100644
index 0000000000..c0e3cb4058
--- /dev/null
+++ b/src/backend/opencl/moments.hpp
@@ -0,0 +1,17 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<float> moments(const Array<T> &in, const af_moment_type moment);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/morph.cpp b/src/backend/opencl/morph.cpp
new file mode 100644
index 0000000000..a1cb86aa03
--- /dev/null
+++ b/src/backend/opencl/morph.cpp
@@ -0,0 +1,66 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_opencl.hpp>
+#include <kernel/morph.hpp>
+#include <math.hpp>
+#include <morph.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    const dim4 mdims = mask.dims();
+    if (mdims[0] != mdims[1]) {
+        OPENCL_NOT_SUPPORTED("Rectangular masks are not suported");
+    }
+    if (mdims[0] > 19) {
+        OPENCL_NOT_SUPPORTED("Kernels > 19x19 are not supported");
+    }
+    const dim4 dims = in.dims();
+    Array<T> out    = createEmptyArray<T>(dims);
+    kernel::morph<T>(out, in, mask, isDilation);
+    return out;
+}
+
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation) {
+    const dim4 mdims = mask.dims();
+    if (mdims[0] != mdims[1] || mdims[0] != mdims[2]) {
+        OPENCL_NOT_SUPPORTED("Only cubic masks are supported");
+    }
+    if (mdims[0] > 7) {
+        OPENCL_NOT_SUPPORTED("Kernels > 7x7x7 masks are not supported");
+    }
+    Array<T> out = createEmptyArray<T>(in.dims());
+    kernel::morph3d<T>(out, in, mask, isDilation);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                    \
+    template Array<T> morph<T>(const Array<T> &, const Array<T> &, bool); \
+    template Array<T> morph3d<T>(const Array<T> &, const Array<T> &, bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/morph.hpp b/src/backend/opencl/morph.hpp
index f16c63c86e..aee753c8d7 100644
--- a/src/backend/opencl/morph.hpp
+++ b/src/backend/opencl/morph.hpp
@@ -9,13 +9,12 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> morph(const Array<T> &in, const Array<T> &mask, bool isDilation);
 
-template<typename T, bool isDilation>
-Array<T> morph(const Array<T> &in, const Array<T> &mask);
-
-template<typename T, bool isDilation>
-Array<T> morph3d(const Array<T> &in, const Array<T> &mask);
-
-}
+template<typename T>
+Array<T> morph3d(const Array<T> &in, const Array<T> &mask, bool isDilation);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/morph3d_impl.hpp b/src/backend/opencl/morph3d_impl.hpp
deleted file mode 100644
index 8cb7e2f5a2..0000000000
--- a/src/backend/opencl/morph3d_impl.hpp
+++ /dev/null
@@ -1,50 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <math.hpp>
-#include <morph.hpp>
-#include <kernel/morph.hpp>
-#include <err_opencl.hpp>
-
-using af::dim4;
-
-namespace opencl
-{
-
-template<typename T, bool isDilation>
-Array<T> morph3d(const Array<T> &in, const Array<T> &mask)
-{
-    const dim4 mdims    = mask.dims();
-
-    if (mdims[0]!=mdims[1] || mdims[0]!=mdims[2])
-        AF_ERROR("Only cube masks are supported in opencl morph currently", AF_ERR_SIZE);
-    if (mdims[0]>7)
-        AF_ERROR("Upto 7x7x7 kernels are only supported in opencl currently", AF_ERR_SIZE);
-
-    const dim4 dims= in.dims();
-    Array<T> out   = createEmptyArray<T>(dims);
-
-    switch(mdims[0]) {
-        case  3: kernel::morph3d<T, isDilation,  3>(out, in, mask); break;
-        case  5: kernel::morph3d<T, isDilation,  5>(out, in, mask); break;
-        case  7: kernel::morph3d<T, isDilation,  7>(out, in, mask); break;
-        default: kernel::morph3d<T, isDilation,  3>(out, in, mask); break;
-    }
-
-    return out;
-}
-
-}
-
-#define INSTANTIATE(T, ISDILATE)                                                 \
-    template Array<T> morph3d<T, ISDILATE>(const Array<T> &in, const Array<T> &mask);
diff --git a/src/backend/opencl/morph_impl.hpp b/src/backend/opencl/morph_impl.hpp
deleted file mode 100644
index b4c57d5d4e..0000000000
--- a/src/backend/opencl/morph_impl.hpp
+++ /dev/null
@@ -1,56 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
-#include <math.hpp>
-#include <morph.hpp>
-#include <kernel/morph.hpp>
-#include <err_opencl.hpp>
-
-using af::dim4;
-
-namespace opencl
-{
-
-template<typename T, bool isDilation>
-Array<T> morph(const Array<T> &in, const Array<T> &mask)
-{
-    const dim4 mdims    = mask.dims();
-
-    if (mdims[0]!=mdims[1])
-        AF_ERROR("Only square masks are supported in opencl morph currently", AF_ERR_SIZE);
-    if (mdims[0]>19)
-        AF_ERROR("Upto 19x19 square kernels are only supported in opencl currently", AF_ERR_SIZE);
-
-    const dim4 dims = in.dims();
-    Array<T> out   = createEmptyArray<T>(dims);
-
-    switch(mdims[0]) {
-        case  3: kernel::morph<T, isDilation,  3>(out, in, mask); break;
-        case  5: kernel::morph<T, isDilation,  5>(out, in, mask); break;
-        case  7: kernel::morph<T, isDilation,  7>(out, in, mask); break;
-        case  9: kernel::morph<T, isDilation,  9>(out, in, mask); break;
-        case 11: kernel::morph<T, isDilation, 11>(out, in, mask); break;
-        case 13: kernel::morph<T, isDilation, 13>(out, in, mask); break;
-        case 15: kernel::morph<T, isDilation, 15>(out, in, mask); break;
-        case 17: kernel::morph<T, isDilation, 17>(out, in, mask); break;
-        case 19: kernel::morph<T, isDilation, 19>(out, in, mask); break;
-        default: kernel::morph<T, isDilation,  3>(out, in, mask); break;
-    }
-
-    return out;
-}
-
-}
-
-#define INSTANTIATE(T, ISDILATE)                                                 \
-    template Array<T> morph  <T, ISDILATE>(const Array<T> &in, const Array<T> &mask);
diff --git a/src/backend/opencl/nearest_neighbour.cpp b/src/backend/opencl/nearest_neighbour.cpp
new file mode 100644
index 0000000000..615165a8e5
--- /dev/null
+++ b/src/backend/opencl/nearest_neighbour.cpp
@@ -0,0 +1,89 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_opencl.hpp>
+#include <kernel/nearest_neighbour.hpp>
+#include <math.hpp>
+#include <topk.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+using af::dim4;
+using cl::Device;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename To, af_match_type dist_type>
+void nearest_neighbour_(Array<uint>& idx, Array<To>& dist,
+                        const Array<T>& query, const Array<T>& train,
+                        const uint dist_dim, const uint n_dist) {
+    uint sample_dim   = (dist_dim == 0) ? 1 : 0;
+    const dim4& qDims = query.dims();
+    const dim4& tDims = train.dims();
+
+    const dim4 outDims(n_dist, qDims[sample_dim]);
+    const dim4 distDims(tDims[sample_dim], qDims[sample_dim]);
+
+    Array<To> tmp_dists = createEmptyArray<To>(distDims);
+
+    idx  = createEmptyArray<uint>(outDims);
+    dist = createEmptyArray<To>(outDims);
+
+    Array<T> queryT = dist_dim == 0 ? transpose(query, false) : query;
+    Array<T> trainT = dist_dim == 0 ? transpose(train, false) : train;
+
+    kernel::allDistances<T, To>(tmp_dists, queryT, trainT, 1, dist_type);
+
+    topk(dist, idx, tmp_dists, n_dist, 0, AF_TOPK_MIN);
+}
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist, const af_match_type dist_type) {
+    switch (dist_type) {
+        case AF_SAD:
+            nearest_neighbour_<T, To, AF_SAD>(idx, dist, query, train, dist_dim,
+                                              n_dist);
+            break;
+        case AF_SSD:
+            nearest_neighbour_<T, To, AF_SSD>(idx, dist, query, train, dist_dim,
+                                              n_dist);
+            break;
+        case AF_SHD:
+            nearest_neighbour_<T, To, AF_SHD>(idx, dist, query, train, dist_dim,
+                                              n_dist);
+            break;
+        default: AF_ERROR("Unsupported dist_type", AF_ERR_NOT_CONFIGURED);
+    }
+}
+
+#define INSTANTIATE(T, To)                                             \
+    template void nearest_neighbour<T, To>(                            \
+        Array<uint> & idx, Array<To> & dist, const Array<T>& query,    \
+        const Array<T>& train, const uint dist_dim, const uint n_dist, \
+        const af_match_type dist_type);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(int, int)
+INSTANTIATE(uint, uint)
+INSTANTIATE(intl, intl)
+INSTANTIATE(uintl, uintl)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, uint)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, uint)
+
+INSTANTIATE(uintl, uint)  // For Hamming
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/nearest_neighbour.hpp b/src/backend/opencl/nearest_neighbour.hpp
new file mode 100644
index 0000000000..65a7a3d1c5
--- /dev/null
+++ b/src/backend/opencl/nearest_neighbour.hpp
@@ -0,0 +1,25 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename To>
+void nearest_neighbour(Array<uint>& idx, Array<To>& dist, const Array<T>& query,
+                       const Array<T>& train, const uint dist_dim,
+                       const uint n_dist,
+                       const af_match_type dist_type = AF_SSD);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/orb.cpp b/src/backend/opencl/orb.cpp
index 67aae9ec83..5e1d2b42d0 100644
--- a/src/backend/opencl/orb.cpp
+++ b/src/backend/opencl/orb.cpp
@@ -7,30 +7,25 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <af/features.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
 #include <err_opencl.hpp>
-#include <handle.hpp>
 #include <kernel/orb.hpp>
+#include <math.hpp>
+#include <af/dim4.hpp>
+#include <af/features.h>
 
 using af::dim4;
 using af::features;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T, typename convAccT>
-unsigned orb(Array<float> &x_out, Array<float> &y_out,
-             Array<float> &score_out, Array<float> &ori_out,
-             Array<float> &size_out, Array<uint> &desc_out,
-             const Array<T>& image,
-             const float fast_thr, const unsigned max_feat,
-             const float scl_fctr, const unsigned levels,
-             const bool blur_img)
-{
+unsigned orb(Array<float> &x_out, Array<float> &y_out, Array<float> &score_out,
+             Array<float> &ori_out, Array<float> &size_out,
+             Array<uint> &desc_out, const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img) {
     unsigned nfeat;
 
     Param x;
@@ -40,36 +35,33 @@ unsigned orb(Array<float> &x_out, Array<float> &y_out,
     Param size;
     Param desc;
 
-    kernel::orb<T,convAccT>(&nfeat, x, y, score, ori, size, desc,
-                            image, fast_thr, max_feat, scl_fctr,
-                            levels, blur_img);
+    kernel::orb<T, convAccT>(&nfeat, x, y, score, ori, size, desc, image,
+                             fast_thr, max_feat, scl_fctr, levels, blur_img);
 
     if (nfeat > 0) {
         const dim4 out_dims(nfeat);
         const dim4 desc_dims(8, nfeat);
 
-        x_out     = createParamArray<float>(x);
-        y_out     = createParamArray<float>(y);
-        score_out = createParamArray<float>(score);
-        ori_out   = createParamArray<float>(ori);
-        size_out  = createParamArray<float>(size);
-        desc_out  = createParamArray<unsigned>(desc);
+        x_out     = createParamArray<float>(x, true);
+        y_out     = createParamArray<float>(y, true);
+        score_out = createParamArray<float>(score, true);
+        ori_out   = createParamArray<float>(ori, true);
+        size_out  = createParamArray<float>(size, true);
+        desc_out  = createParamArray<unsigned>(desc, true);
     }
 
     return nfeat;
 }
 
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned orb<T, convAccT>(                                       \
+        Array<float> & x, Array<float> & y, Array<float> & score,             \
+        Array<float> & ori, Array<float> & size, Array<uint> & desc,          \
+        const Array<T> &image, const float fast_thr, const unsigned max_feat, \
+        const float scl_fctr, const unsigned levels, const bool blur_img);
 
-#define INSTANTIATE(T, convAccT)                                                        \
-    template unsigned orb<T, convAccT>(Array<float> &x, Array<float> &y,                \
-                                       Array<float> &score, Array<float> &ori,          \
-                                       Array<float> &size, Array<uint> &desc,           \
-                                       const Array<T>& image,                           \
-                                       const float fast_thr, const unsigned max_feat,   \
-                                       const float scl_fctr, const unsigned levels,     \
-                                       const bool blur_img);
-
-INSTANTIATE(float , float )
+INSTANTIATE(float, float)
 INSTANTIATE(double, double)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/orb.hpp b/src/backend/opencl/orb.hpp
index 47c477428a..012113886e 100644
--- a/src/backend/opencl/orb.hpp
+++ b/src/backend/opencl/orb.hpp
@@ -7,21 +7,20 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/features.h>
 #include <Array.hpp>
+#include <af/features.h>
 
 using af::features;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T, typename convAccT>
 unsigned orb(Array<float> &x, Array<float> &y, Array<float> &score,
              Array<float> &orientation, Array<float> &size,
-             Array<unsigned> &desc,
-             const Array<T>& image,
-             const float fast_thr, const unsigned max_feat,
-             const float scl_fctr, const unsigned levels,
-             const bool blur_img);
+             Array<unsigned> &desc, const Array<T> &image, const float fast_thr,
+             const unsigned max_feat, const float scl_fctr,
+             const unsigned levels, const bool blur_img);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/platform.cpp b/src/backend/opencl/platform.cpp
index b47db4151f..b6886c97bb 100644
--- a/src/backend/opencl/platform.cpp
+++ b/src/backend/opencl/platform.cpp
@@ -9,463 +9,840 @@
 
 // Include this before af/opencl.h
 // Causes conflict between system cl.hpp and opencl/cl.hpp
-#if defined(WITH_GRAPHICS)
-#include <graphics_common.hpp>
+#include <common/graphics_common.hpp>
+
+#include <GraphicsResourceManager.hpp>
+#include <blas.hpp>
+#include <build_version.hpp>
+#include <clfft.hpp>
+#include <common/ArrayFireTypesIO.hpp>
+#include <common/DefaultMemoryManager.hpp>
+#include <common/Logger.hpp>
+#include <common/Version.hpp>
+#include <common/host_memory.hpp>
+#include <common/util.hpp>
+#include <device_manager.hpp>
+#include <err_opencl.hpp>
+#include <errorcodes.hpp>
+#include <platform.hpp>
+#include <af/version.h>
+
+#ifdef OS_MAC
+#include <OpenGL/CGLCurrent.h>
 #endif
-#include <cl.hpp>
 
-#include <af/version.h>
-#include <af/opencl.h>
-#include <platform.hpp>
-#include <functional>
-#include <algorithm>
+#include <boost/compute/context.hpp>
+#include <boost/compute/utility/program_cache.hpp>
+
 #include <cctype>
-#include <vector>
-#include <string>
-#include <sstream>
-#include <stdexcept>
-#include <cstring>
-#include <algorithm>
+#include <cstdlib>
+#include <functional>
 #include <map>
-#include <errorcodes.hpp>
-#include <err_opencl.hpp>
-
-using std::string;
-using std::vector;
-using std::ostringstream;
-using std::runtime_error;
+#include <memory>
+#include <mutex>
+#include <sstream>
+#include <string>
+#include <thread>
+#include <utility>
+#include <vector>
 
-using cl::Platform;
-using cl::Context;
 using cl::CommandQueue;
+using cl::Context;
 using cl::Device;
+using cl::Platform;
+using std::begin;
+using std::call_once;
+using std::end;
+using std::endl;
+using std::find_if;
+using std::get;
+using std::make_pair;
+using std::make_unique;
+using std::map;
+using std::move;
+using std::once_flag;
+using std::ostringstream;
+using std::pair;
+using std::string;
+using std::to_string;
+using std::unique_ptr;
+using std::vector;
 
-namespace opencl
-{
+using arrayfire::common::getEnvVar;
+using arrayfire::common::ltrim;
+using arrayfire::common::MemoryManagerBase;
+using arrayfire::common::Version;
+using arrayfire::opencl::Allocator;
+using arrayfire::opencl::AllocatorPinned;
 
-#if defined (OS_MAC)
-static const std::string CL_GL_SHARING_EXT = "cl_APPLE_gl_sharing";
-#else
-static const std::string CL_GL_SHARING_EXT = "cl_khr_gl_sharing";
-#endif
+namespace arrayfire {
+namespace opencl {
 
-static const char *get_system(void)
-{
-    return
-#if defined(ARCH_32)
-    "32-bit "
-#elif defined(ARCH_64)
-    "64-bit "
-#endif
+static string get_system() {
+    string arch = (sizeof(void*) == 4) ? "32-bit " : "64-bit ";
 
+    return arch +
 #if defined(OS_LNX)
-    "Linux";
+           "Linux";
 #elif defined(OS_WIN)
-    "Windows";
+           "Windows";
 #elif defined(OS_MAC)
-    "Mac OSX";
+           "Mac OSX";
 #endif
 }
 
-DeviceManager& DeviceManager::getInstance()
-{
-    static DeviceManager my_instance;
-    return my_instance;
-}
-
-DeviceManager::~DeviceManager()
-{
-    //TODO: FIXME:
-    // OpenCL libs on Windows platforms
-    // are crashing the application at program exit
-    // most probably a reference counting issue based
-    // on the investigation done so far. This problem
-    // doesn't seem to happen on Linux or MacOSX.
-    // So, clean up OpenCL resources on non-Windows platforms
-#ifndef OS_WIN
-    for (auto q: mQueues) delete q;
-    for (auto d : mDevices) delete d;
-    for (auto c : mContexts) delete c;
-    for (auto p : mPlatforms) delete p;
-#endif
+int getBackend() { return AF_BACKEND_OPENCL; }
+
+bool verify_present(const string& pname, const string ref) {
+    auto iter =
+        search(begin(pname), end(pname), begin(ref), end(ref),
+               [](const string::value_type& l, const string::value_type& r) {
+                   return tolower(l) == tolower(r);
+               });
+
+    return iter != end(pname);
+}
+
+static string platformMap(string& platStr) {
+    using strmap_t                = map<string, string>;
+    static const strmap_t platMap = {
+        make_pair("NVIDIA CUDA", "NVIDIA"),
+        make_pair("Intel(R) OpenCL", "INTEL"),
+        make_pair("AMD Accelerated Parallel Processing", "AMD"),
+        make_pair("Intel Gen OCL Driver", "BEIGNET"),
+        make_pair("Intel(R) OpenCL HD Graphics", "INTEL"),
+        make_pair("Apple", "APPLE"),
+        make_pair("Portable Computing Language", "POCL"),
+    };
+
+    auto idx = platMap.find(platStr);
+
+    if (idx == platMap.end()) {
+        return platStr;
+    } else {
+        return idx->second;
+    }
 }
 
-void DeviceManager::setContext(int device)
-{
-    mActiveQId = device;
-    mActiveCtxId = device;
+afcl::platform getPlatformEnum(Device dev) {
+    string pname = getPlatformName(dev);
+    if (verify_present(pname, "AMD"))
+        return AFCL_PLATFORM_AMD;
+    else if (verify_present(pname, "NVIDIA"))
+        return AFCL_PLATFORM_NVIDIA;
+    else if (verify_present(pname, "INTEL"))
+        return AFCL_PLATFORM_INTEL;
+    else if (verify_present(pname, "APPLE"))
+        return AFCL_PLATFORM_APPLE;
+    else if (verify_present(pname, "BEIGNET"))
+        return AFCL_PLATFORM_BEIGNET;
+    else if (verify_present(pname, "POCL"))
+        return AFCL_PLATFORM_POCL;
+    return AFCL_PLATFORM_UNKNOWN;
 }
 
-DeviceManager::DeviceManager()
-    : mActiveCtxId(0), mActiveQId(0)
-{
-    try {
-        std::vector<cl::Platform>   platforms;
-        Platform::get(&platforms);
-
-        cl_device_type DEVC_TYPES[] = {
-            CL_DEVICE_TYPE_GPU,
-#ifndef OS_MAC
-            CL_DEVICE_TYPE_ACCELERATOR,
-            CL_DEVICE_TYPE_CPU
-#endif
-        };
+string getDeviceInfo() noexcept {
+    ostringstream info;
+    info << "ArrayFire v" << AF_VERSION << " (OpenCL, " << get_system()
+         << ", build " << AF_REVISION << ")\n";
 
-        for (auto &platform : platforms)
-            mPlatforms.push_back(new Platform(platform));
+    vector<cl::Device*> devices;
+    try {
+        DeviceManager& devMngr = DeviceManager::getInstance();
 
+        common::lock_guard_t lock(devMngr.deviceMutex);
         unsigned nDevices = 0;
-        for (auto devType : DEVC_TYPES) {
-            for (auto &platform : platforms) {
-
-                cl_context_properties cps[3] = {CL_CONTEXT_PLATFORM,
-                    (cl_context_properties)(platform()),
-                    0};
-
-                std::vector<Device> devs;
-                try {
-                    platform.getDevices(devType, &devs);
-                } catch(const cl::Error &err) {
-                    if (err.err() != CL_DEVICE_NOT_FOUND) {
-                        throw;
-                    }
-                }
+        for (auto& device : devMngr.mDevices) {
+            const Platform platform(device->getInfo<CL_DEVICE_PLATFORM>());
 
-                for (auto dev : devs) {
-                    nDevices++;
-                    Context *ctx = new Context(dev, cps);
-                    CommandQueue *cq = new CommandQueue(*ctx, dev);
-                    mDevices.push_back(new Device(dev));
-                    mContexts.push_back(ctx);
-                    mQueues.push_back(cq);
-                    mCtxOffsets.push_back(nDevices);
-                    mIsGLSharingOn.push_back(false);
-                }
-            }
-        }
+            string dstr = device->getInfo<CL_DEVICE_NAME>();
+            bool show_braces =
+                (static_cast<unsigned>(getActiveDeviceId()) == nDevices);
 
-        const char* deviceENV = getenv("AF_OPENCL_DEFAULT_DEVICE");
-        if(deviceENV) {
-            std::stringstream s(deviceENV);
-            int def_device = -1;
-            s >> def_device;
-            if(def_device < 0 || def_device >= (int)nDevices) {
-                printf("WARNING: AF_OPENCL_DEFAULT_DEVICE is out of range\n");
-                printf("Setting default device as 0\n");
-            } else {
-                setContext(def_device);
-            }
-        }
-    } catch (const cl::Error &error) {
-            CL_TO_AF_ERROR(error);
-    }
-    /* loop over devices and replace contexts with
-     * OpenGL shared contexts whereever applicable */
-#if defined(WITH_GRAPHICS)
-    // Define AF_DISABLE_GRAPHICS with any value to disable initialization
-    const char* noGraphicsENV = getenv("AF_DISABLE_GRAPHICS");
-    if(!noGraphicsENV) { // If AF_DISABLE_GRAPHICS is not defined
-        try {
-            int devCount = mDevices.size();
-            fg::Window* wHandle = graphics::ForgeManager::getInstance().getMainWindow();
-            for(int i=0; i<devCount; ++i)
-                markDeviceForInterop(i, wHandle);
-        } catch (...) {
+            string id = (show_braces ? string("[") : "-") +
+                        to_string(nDevices) + (show_braces ? string("]") : "-");
+
+            size_t msize = device->getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>();
+            info << id << " " << getPlatformName(*device) << ": " << ltrim(dstr)
+                 << ", " << msize / 1048576 << " MB";
+#ifndef NDEBUG
+            info << " -- ";
+            string devVersion = device->getInfo<CL_DEVICE_VERSION>();
+            string driVersion = device->getInfo<CL_DRIVER_VERSION>();
+            info << devVersion;
+            info << " -- Device driver " << driVersion;
+            info
+                << " -- FP64 Support: "
+                << (device->getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE>() >
+                            0
+                        ? "True"
+                        : "False");
+#endif
+            info << endl;
+
+            nDevices++;
         }
+    } catch (const AfError& err) {
+        UNUSED(err);
+        info << "No platforms found.\n";
+        // Don't throw an exception here. Info should pass even if the system
+        // doesn't have the correct drivers installed.
     }
-#endif
+    return info.str();
 }
 
+string getPlatformName(const Device& device) {
+    const Platform platform(device.getInfo<CL_DEVICE_PLATFORM>());
+    string platStr = platform.getInfo<CL_PLATFORM_NAME>();
+    return platformMap(platStr);
+}
+
+typedef pair<unsigned, unsigned> device_id_t;
 
-// http://stackoverflow.com/questions/216823/whats-the-best-way-to-trim-stdstring/217605#217605
-// trim from start
-static inline std::string &ltrim(std::string &s)
-{
-    s.erase(s.begin(), std::find_if(s.begin(), s.end(),
-                                    std::not1(std::ptr_fun<int, int>(std::isspace))));
-    return s;
+pair<unsigned, unsigned>& tlocalActiveDeviceId() {
+    // First element is active context id
+    // Second element is active queue id
+    thread_local device_id_t activeDeviceId(0, 0);
+
+    return activeDeviceId;
 }
 
-static std::string platformMap(std::string &platStr)
-{
-    static bool isFirst = true;
+void setActiveContext(int device) {
+    tlocalActiveDeviceId() = make_pair(device, device);
+}
 
-    typedef std::map<std::string, std::string> strmap_t;
-    static strmap_t platMap;
-    if (isFirst) {
-        platMap["NVIDIA CUDA"] = "NVIDIA  ";
-        platMap["Intel(R) OpenCL"] = "INTEL   ";
-        platMap["AMD Accelerated Parallel Processing"] = "AMD     ";
-        platMap["Intel Gen OCL Driver"] = "BEIGNET ";
-        platMap["Apple"] = "APPLE   ";
-        isFirst = false;
-    }
+int getDeviceCount() noexcept try {
+    DeviceManager& devMngr = DeviceManager::getInstance();
 
-    strmap_t::iterator idx = platMap.find(platStr);
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return static_cast<int>(devMngr.mQueues.size());
+} catch (const AfError& err) {
+    UNUSED(err);
+    // If device manager threw an error then return 0 because no platforms
+    // were found
+    return 0;
+}
 
-    if (idx == platMap.end()) {
-        return platStr;
-    } else {
-        return idx->second;
-    }
+void init() {
+    thread_local const DeviceManager& devMngr = DeviceManager::getInstance();
+    UNUSED(devMngr);
 }
 
-std::string getInfo()
-{
-    ostringstream info;
-    info << "ArrayFire v" << AF_VERSION
-         << " (OpenCL, " << get_system() << ", build " << AF_REVISION << ")" << std::endl;
-
-    unsigned nDevices = 0;
-    for (auto context : DeviceManager::getInstance().mContexts) {
-        vector<Device> devices = context->getInfo<CL_CONTEXT_DEVICES>();
-
-        for(auto &device:devices) {
-            const Platform &platform = device.getInfo<CL_DEVICE_PLATFORM>();
-            string platStr = platform.getInfo<CL_PLATFORM_NAME>();
-            bool show_braces = ((unsigned)getActiveDeviceId() == nDevices);
-            string dstr = device.getInfo<CL_DEVICE_NAME>();
-
-            string id = (show_braces ? string("[") : "-") + std::to_string(nDevices) +
-                        (show_braces ? string("]") : "-");
-            info << id << " " << platformMap(platStr) << ": " << ltrim(dstr) << " ";
-#ifndef NDEBUG
-            info << device.getInfo<CL_DEVICE_VERSION>();
-            info << " Device driver " << device.getInfo<CL_DRIVER_VERSION>();
-            info << " FP64 Support("
-                 << (device.getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE>()>0 ? "True" : "False")
-                 << ")";
-#endif
-            info << std::endl;
+int getActiveDeviceId() {
+    // Second element is the queue id, which is
+    // what we mean by active device id in opencl backend
+    return get<1>(tlocalActiveDeviceId());
+}
 
-            nDevices++;
-        }
+int getDeviceIdFromNativeId(cl_device_id id) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    int nDevices = static_cast<int>(devMngr.mDevices.size());
+    int devId    = 0;
+    for (devId = 0; devId < nDevices; ++devId) {
+        if (id == devMngr.mDevices[devId]->operator()()) { break; }
     }
-    return info.str();
+
+    return devId;
 }
 
-std::string getPlatformName(const cl::Device &device)
-{
-    const Platform &platform = device.getInfo<CL_DEVICE_PLATFORM>();
-    std::string platStr = platform.getInfo<CL_PLATFORM_NAME>();
-    return platformMap(platStr);
+int getActiveDeviceType() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mDeviceTypes[get<1>(devId)];
+}
+
+cl::Platform& getActivePlatform() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return *devMngr.mPlatforms[get<1>(devId)].first;
+}
+
+afcl::platform getActivePlatformVendor() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mPlatforms[get<1>(devId)].second;
+}
+
+bool isDeviceBufferAccessible(int buf_device_id, int execution_id) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return buf_device_id == execution_id ||
+           *devMngr.mContexts[buf_device_id] ==
+               *devMngr.mContexts[execution_id];
+}
+
+const Context& getContext() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return *(devMngr.mContexts[get<0>(devId)]);
+}
+
+cl_command_queue getQueueHandle(int device_id) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return (*(devMngr.mQueues[device_id]))();
+}
+
+CommandQueue& getQueue(int device_id) {
+    device_id_t devId =
+        (device_id = -1) ? tlocalActiveDeviceId()
+                         : make_pair<unsigned, unsigned>(device_id, device_id);
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return *(devMngr.mQueues[get<1>(devId)]);
+}
+
+const Device& getDevice(int id) {
+    device_id_t& devId = tlocalActiveDeviceId();
+
+    if (id == -1) { id = get<1>(devId); }
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return *(devMngr.mDevices[id]);
 }
 
-int getDeviceCount()
-{
-    return DeviceManager::getInstance().mQueues.size();
+const std::string& getActiveDeviceBaseBuildFlags() {
+    device_id_t& devId     = tlocalActiveDeviceId();
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+    return devMngr.mBaseBuildFlags[get<1>(devId)];
 }
 
-int getActiveDeviceId()
-{
-    return DeviceManager::getInstance().mActiveQId;
+vector<Version> getOpenCLCDeviceVersion(const Device& device) {
+    // For OpenCL-HPP >= v2023.12.14 type is cl::Platform instead of
+    // cl_platform_id
+    Platform device_platform;
+    device_platform = device.getInfo<CL_DEVICE_PLATFORM>();
+
+    auto platform_version = device_platform.getInfo<CL_PLATFORM_VERSION>();
+    vector<Version> out;
+
+    /// The ifdef allows us to support BUILDING ArrayFire with older
+    /// versions of OpenCL where as the if condition in the ifdef allows us
+    /// to support older versions of OpenCL at runtime
+#ifdef CL_DEVICE_OPENCL_C_ALL_VERSIONS
+    if (platform_version.substr(7).c_str()[0] >= '3') {
+        vector<cl_name_version> device_versions =
+            device.getInfo<CL_DEVICE_OPENCL_C_ALL_VERSIONS>();
+        sort(begin(device_versions), end(device_versions),
+             [](const auto& lhs, const auto& rhs) {
+                 return lhs.version < rhs.version;
+             });
+        transform(begin(device_versions), end(device_versions),
+                  std::back_inserter(out), [](const cl_name_version& version) {
+                      return Version(CL_VERSION_MAJOR(version.version),
+                                     CL_VERSION_MINOR(version.version),
+                                     CL_VERSION_PATCH(version.version));
+                  });
+    } else {
+#endif
+        auto device_version = device.getInfo<CL_DEVICE_OPENCL_C_VERSION>();
+        int major           = atoi(device_version.substr(9, 1).c_str());
+        int minor           = atoi(device_version.substr(11, 1).c_str());
+        out.emplace_back(major, minor);
+#ifdef CL_DEVICE_OPENCL_C_ALL_VERSIONS
+    }
+#endif
+    return out;
 }
 
-const Context& getContext()
-{
+size_t getDeviceMemorySize(int device) {
     DeviceManager& devMngr = DeviceManager::getInstance();
-    return *(devMngr.mContexts[devMngr.mActiveCtxId]);
+
+    cl::Device dev;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        // Assuming devices don't deallocate or are invalidated during execution
+        dev = *devMngr.mDevices[device];
+    }
+    size_t msize = dev.getInfo<CL_DEVICE_GLOBAL_MEM_SIZE>();
+    return msize;
+}
+
+size_t getHostMemorySize() { return common::getHostMemorySize(); }
+
+cl_device_type getDeviceType() {
+    const cl::Device& device = getDevice();
+    cl_device_type type      = device.getInfo<CL_DEVICE_TYPE>();
+    return type;
 }
 
-CommandQueue& getQueue()
-{
+bool OpenCLCPUOffload(bool forceOffloadOSX) {
+    static const bool offloadEnv = getEnvVar("AF_OPENCL_CPU_OFFLOAD") != "0";
+    bool offload                 = false;
+    if (offloadEnv) { offload = getDeviceType() == CL_DEVICE_TYPE_CPU; }
+#if OS_MAC
+    // FORCED OFFLOAD FOR LAPACK FUNCTIONS ON OSX UNIFIED MEMORY DEVICES
+    //
+    // On OSX Unified Memory devices (Intel), always offload LAPACK but not GEMM
+    // irrespective of the AF_OPENCL_CPU_OFFLOAD value
+    // From GEMM, OpenCLCPUOffload(false) is called which will render the
+    // variable inconsequential to the returned result.
+    //
+    // Issue https://github.com/arrayfire/arrayfire/issues/662
+    // Force condition
+    bool osx_offload = getDeviceType() == CL_DEVICE_TYPE_CPU;
+    offload          = osx_offload && (offload || forceOffloadOSX);
+#else
+    UNUSED(forceOffloadOSX);
+#endif
+    return offload;
+}
+
+bool isGLSharingSupported() {
+    device_id_t& devId = tlocalActiveDeviceId();
+
     DeviceManager& devMngr = DeviceManager::getInstance();
-    return *(devMngr.mQueues[devMngr.mActiveQId]);
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    return devMngr.mIsGLSharingOn[get<1>(devId)];
 }
 
-const cl::Device& getDevice()
-{
+bool isDoubleSupported(unsigned device) {
     DeviceManager& devMngr = DeviceManager::getInstance();
-    return *(devMngr.mDevices[devMngr.mActiveQId]);
+
+    cl::Device dev;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        dev = *devMngr.mDevices[device];
+    }
+    return isDoubleSupported(dev);
 }
 
-bool isGLSharingSupported()
-{
+bool isHalfSupported(unsigned device) {
     DeviceManager& devMngr = DeviceManager::getInstance();
-    return devMngr.mIsGLSharingOn[devMngr.mActiveQId];
+
+    cl::Device dev;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        dev = *devMngr.mDevices[device];
+    }
+    return isHalfSupported(dev);
 }
 
-bool isDoubleSupported(int device)
-{
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute) {
+    unsigned nDevices    = 0;
+    auto currActiveDevId = static_cast<unsigned>(getActiveDeviceId());
+    bool devset          = false;
+
     DeviceManager& devMngr = DeviceManager::getInstance();
-    return (devMngr.mDevices[device]->getInfo<CL_DEVICE_PREFERRED_VECTOR_WIDTH_DOUBLE>()>0);
-}
-
-void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute)
-{
-    unsigned nDevices = 0;
-    unsigned currActiveDevId = (unsigned)getActiveDeviceId();
-    bool devset = false;
-
-    for (auto context : DeviceManager::getInstance().mContexts) {
-        vector<Device> devices = context->getInfo<CL_CONTEXT_DEVICES>();
-
-        for (auto &device : devices) {
-            const Platform &platform = device.getInfo<CL_DEVICE_PLATFORM>();
-            string platStr = platform.getInfo<CL_PLATFORM_NAME>();
-
-            if (currActiveDevId == nDevices) {
-                string dev_str;
-                device.getInfo(CL_DEVICE_NAME, &dev_str);
-                string com_str = device.getInfo<CL_DEVICE_VERSION>();
-                com_str = com_str.substr(7, 3);
-
-                // strip out whitespace from the device string:
-                const std::string& whitespace = " \t";
-                const auto strBegin = dev_str.find_first_not_of(whitespace);
-                const auto strEnd = dev_str.find_last_not_of(whitespace);
-                const auto strRange = strEnd - strBegin + 1;
-                dev_str = dev_str.substr(strBegin, strRange);
-
-                // copy to output
-                snprintf(d_name, 64, "%s", dev_str.c_str());
-                snprintf(d_platform, 10, "OpenCL");
-                snprintf(d_toolkit, 64, "%s", platStr.c_str());
-                snprintf(d_compute, 10, "%s", com_str.c_str());
-                devset = true;
+
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+
+        for (auto& context : devMngr.mContexts) {
+            vector<Device> devices = context->getInfo<CL_CONTEXT_DEVICES>();
+
+            for (auto& device : devices) {
+                const Platform platform(device.getInfo<CL_DEVICE_PLATFORM>());
+                string platStr = platform.getInfo<CL_PLATFORM_NAME>();
+
+                if (currActiveDevId == nDevices) {
+                    string dev_str;
+                    device.getInfo(CL_DEVICE_NAME, &dev_str);
+                    string com_str = device.getInfo<CL_DEVICE_VERSION>();
+                    com_str        = com_str.substr(7, 3);
+
+                    // strip out whitespace from the device string:
+                    const string& whitespace = " \t";
+                    const auto strBegin = dev_str.find_first_not_of(whitespace);
+                    const auto strEnd   = dev_str.find_last_not_of(whitespace);
+                    const auto strRange = strEnd - strBegin + 1;
+                    dev_str             = dev_str.substr(strBegin, strRange);
+
+                    // copy to output
+                    snprintf(d_name, 64, "%s", dev_str.c_str());
+                    snprintf(d_platform, 10, "OpenCL");
+                    snprintf(d_toolkit, 64, "%s", platStr.c_str());
+                    snprintf(d_compute, 10, "%s", com_str.c_str());
+                    devset = true;
+                }
+                if (devset) { break; }
+                nDevices++;
             }
-            if(devset) break;
-            nDevices++;
+            if (devset) { break; }
         }
-        if(devset) break;
     }
 
     // Sanitize input
     for (int i = 0; i < 31; i++) {
         if (d_name[i] == ' ') {
-            if (d_name[i + 1] == 0 || d_name[i + 1] == ' ') d_name[i] = 0;
-            else d_name[i] = '_';
+            if (d_name[i + 1] == 0 || d_name[i + 1] == ' ') {
+                d_name[i] = 0;
+            } else {
+                d_name[i] = '_';
+            }
         }
     }
 }
 
-int setDevice(int device)
-{
+int setDevice(int device) {
     DeviceManager& devMngr = DeviceManager::getInstance();
 
-    if (device >= (int)devMngr.mQueues.size() ||
-            device>= (int)DeviceManager::MAX_DEVICES) {
-        //throw runtime_error("@setDevice: invalid device index");
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    if (device >= static_cast<int>(devMngr.mQueues.size()) ||
+        device >= static_cast<int>(DeviceManager::MAX_DEVICES)) {
         return -1;
-    }
-    else {
-        int old = devMngr.mActiveQId;
-        devMngr.setContext(device);
+    } else {
+        int old = getActiveDeviceId();
+        setActiveContext(device);
         return old;
     }
 }
 
-void sync(int device)
-{
-    try {
-        int currDevice = getActiveDeviceId();
-        setDevice(device);
-        getQueue().finish();
-        setDevice(currDevice);
-    } catch (const cl::Error &ex) {
-        CL_TO_AF_ERROR(ex);
+void sync(int device) {
+    int currDevice = getActiveDeviceId();
+    setDevice(device);
+    getQueue().finish();
+    setDevice(currDevice);
+}
+
+void addDeviceContext(cl_device_id dev, cl_context ctx, cl_command_queue que) {
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    int nDevices = 0;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+
+        cl::Device tDevice(dev, true);
+        cl::Context tContext(ctx, true);
+        auto tQueue =
+            (que == NULL ? make_unique<cl::CommandQueue>(tContext, tDevice)
+                         : make_unique<cl::CommandQueue>(que, true));
+        // FIXME: add OpenGL Interop for user provided contexts later
+        devMngr.mIsGLSharingOn.push_back(false);
+        devMngr.mDeviceTypes.push_back(
+            static_cast<int>(tDevice.getInfo<CL_DEVICE_TYPE>()));
+
+        // For OpenCL-HPP >= v2023.12.14 type is cl::Platform instead of
+        // cl_platform_id
+        cl::Platform device_platform;
+        device_platform = tDevice.getInfo<CL_DEVICE_PLATFORM>();
+        devMngr.mPlatforms.push_back(
+            std::make_pair<std::unique_ptr<cl::Platform>, afcl_platform>(
+                make_unique<cl::Platform>(device_platform(), true),
+                getPlatformEnum(tDevice)));
+
+        devMngr.mDevices.emplace_back(make_unique<cl::Device>(move(tDevice)));
+        devMngr.mContexts.emplace_back(
+            make_unique<cl::Context>(move(tContext)));
+        devMngr.mQueues.push_back(move(tQueue));
+        nDevices = static_cast<int>(devMngr.mDevices.size()) - 1;
+
+        auto versions = getOpenCLCDeviceVersion(*(devMngr.mDevices.back()));
+#ifdef AF_WITH_FAST_MATH
+        std::string options =
+            fmt::format(" -cl-std=CL{:Mm} -D dim_t={} -cl-fast-relaxed-math",
+                        versions.back(), dtype_traits<dim_t>::getName());
+#else
+        std::string options =
+            fmt::format(" -cl-std=CL{:Mm} -D dim_t={}", versions.back(),
+                        dtype_traits<dim_t>::getName());
+#endif
+        devMngr.mBaseBuildFlags.push_back(options);
+
+        // cache the boost program_cache object, clean up done on program exit
+        // not during removeDeviceContext
+        namespace compute = boost::compute;
+        using BPCache     = DeviceManager::BoostProgCache;
+        compute::context c(ctx);
+        BPCache currCache = compute::program_cache::get_global_cache(c);
+        devMngr.mBoostProgCacheVector.emplace_back(new BPCache(currCache));
     }
+
+    // Last/newly added device needs memory management
+    memoryManager().addMemoryManagement(nDevices);
 }
 
-bool checkExtnAvailability(const Device &pDevice, std::string pName)
-{
-    bool ret_val = false;
-    // find the extension required
-    std::string exts = pDevice.getInfo<CL_DEVICE_EXTENSIONS>();
-    std::stringstream ss(exts);
-    std::string item;
-    while (std::getline(ss,item,' ')) {
-        if (item==pName) {
-            ret_val = true;
-            break;
+void setDeviceContext(cl_device_id dev, cl_context ctx) {
+    // FIXME: add OpenGL Interop for user provided contexts later
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    common::lock_guard_t lock(devMngr.deviceMutex);
+
+    const int dCount = static_cast<int>(devMngr.mDevices.size());
+    for (int i = 0; i < dCount; ++i) {
+        if (devMngr.mDevices[i]->operator()() == dev &&
+            devMngr.mContexts[i]->operator()() == ctx) {
+            setActiveContext(i);
+            return;
         }
     }
-    return ret_val;
+    AF_ERROR("No matching device found", AF_ERR_ARG);
 }
 
-#if defined(WITH_GRAPHICS)
-void DeviceManager::markDeviceForInterop(const int device, const fg::Window* wHandle)
-{
-    try {
-        if (device >= (int)mQueues.size() ||
-                device>= (int)DeviceManager::MAX_DEVICES) {
-            throw cl::Error(CL_INVALID_DEVICE, "Invalid device passed for CL-GL Interop");
-        }
-        else {
-            mQueues[device]->finish();
-
-            // check if the device has CL_GL sharing extension enabled
-            bool temp = checkExtnAvailability(*mDevices[device], CL_GL_SHARING_EXT);
-            if (!temp) {
-                printf("Device[%d] has no support for OpenGL Interoperation\n",device);
-                /* return silently if given device has not OpenGL sharing extension
-                 * enabled so that regular queue is used for it */
-                return;
+void removeDeviceContext(cl_device_id dev, cl_context ctx) {
+    if (getDevice()() == dev && getContext()() == ctx) {
+        AF_ERROR("Cannot pop the device currently in use", AF_ERR_ARG);
+    }
+
+    DeviceManager& devMngr = DeviceManager::getInstance();
+
+    int deleteIdx = -1;
+    {
+        common::lock_guard_t lock(devMngr.deviceMutex);
+
+        const int dCount = static_cast<int>(devMngr.mDevices.size());
+        for (int i = static_cast<int>(devMngr.mUserDeviceOffset); i < dCount;
+             ++i) {
+            if (devMngr.mDevices[i]->operator()() == dev &&
+                devMngr.mContexts[i]->operator()() == ctx) {
+                deleteIdx = i;
+                break;
             }
+        }
+    }
 
-            // call forge to get OpenGL sharing context and details
-            cl::Platform plat = mDevices[device]->getInfo<CL_DEVICE_PLATFORM>();
-#ifdef OS_MAC
-            CGLContextObj cgl_current_ctx = CGLGetCurrentContext();
-            CGLShareGroupObj cgl_share_group = CGLGetShareGroup(cgl_current_ctx);
-            printf("current opengl context is -------- %p \n", cgl_current_ctx);
-
-            cl_context_properties cps[] = {
-                CL_CONTEXT_PROPERTY_USE_CGL_SHAREGROUP_APPLE, (cl_context_properties)cgl_share_group,
-                0
-            };
-#else
-            cl_context_properties cps[] = {
-                CL_GL_CONTEXT_KHR, (cl_context_properties)wHandle->context(),
-#if defined(_WIN32) || defined(_MSC_VER)
-                CL_WGL_HDC_KHR, (cl_context_properties)wHandle->display(),
+    if (deleteIdx < static_cast<int>(devMngr.mUserDeviceOffset)) {
+        AF_ERROR("Cannot pop ArrayFire internal devices", AF_ERR_ARG);
+    } else if (deleteIdx == -1) {
+        AF_ERROR("No matching device found", AF_ERR_ARG);
+    } else {
+        // remove memory management for device added by user outside of the lock
+        memoryManager().removeMemoryManagement(deleteIdx);
+
+        common::lock_guard_t lock(devMngr.deviceMutex);
+        // FIXME: this case can potentially cause issues due to the
+        // modification of the device pool stl containers.
+
+        // IF the current active device is enumerated at a position
+        // that lies ahead of the device that has been requested
+        // to be removed. We just pop the entries from pool since it
+        // has no side effects.
+        devMngr.mDevices.erase(devMngr.mDevices.begin() + deleteIdx);
+        devMngr.mContexts.erase(devMngr.mContexts.begin() + deleteIdx);
+        devMngr.mQueues.erase(devMngr.mQueues.begin() + deleteIdx);
+        devMngr.mPlatforms.erase(devMngr.mPlatforms.begin() + deleteIdx);
+
+        // FIXME: add OpenGL Interop for user provided contexts later
+        devMngr.mIsGLSharingOn.erase(devMngr.mIsGLSharingOn.begin() +
+                                     deleteIdx);
+
+        // OTHERWISE, update(decrement) the thread local active device ids
+        device_id_t& devId = tlocalActiveDeviceId();
+
+        if (deleteIdx < static_cast<int>(devId.first)) {
+            device_id_t newVals = make_pair(devId.first - 1, devId.second - 1);
+            devId               = newVals;
+        }
+    }
+}
+
+bool synchronize_calls() {
+    static const bool sync = getEnvVar("AF_SYNCHRONOUS_CALLS") == "1";
+    return sync;
+}
+
+int& getMaxJitSize() {
+#if defined(OS_MAC)
+    constexpr int MAX_JIT_LEN = 50;
 #else
-                CL_GLX_DISPLAY_KHR, (cl_context_properties)wHandle->display(),
-#endif
-                CL_CONTEXT_PLATFORM, (cl_context_properties)plat(),
-                0
-            };
+    constexpr int MAX_JIT_LEN = 100;
 #endif
-            Context * ctx = new Context(*mDevices[device], cps);
-            CommandQueue * cq = new CommandQueue(*ctx, *mDevices[device]);
+    thread_local int length = 0;
+    if (length <= 0) {
+        string env_var = getEnvVar("AF_OPENCL_MAX_JIT_LEN");
+        if (!env_var.empty()) {
+            int input_len = stoi(env_var);
+            length        = input_len > 0 ? input_len : MAX_JIT_LEN;
+        } else {
+            length = MAX_JIT_LEN;
+        }
+    }
+    return length;
+}
+
+bool& evalFlag() {
+    thread_local bool flag = true;
+    return flag;
+}
 
-            delete mContexts[device];
-            delete mQueues[device];
+MemoryManagerBase& memoryManager() {
+    static once_flag flag;
 
-            mContexts[device] = ctx;
-            mQueues[device] = cq;
-        }
-        mIsGLSharingOn[device] = true;
-    } catch (const cl::Error &ex) {
-        /* If replacing the original context with GL shared context
-         * failes, don't throw an error and instead fall back to
-         * original context and use copy via host to support graphics
-         * on that particular OpenCL device. So mark it as no GL sharing */
+    DeviceManager& inst = DeviceManager::getInstance();
+
+    call_once(flag, [&]() {
+        // By default, create an instance of the default memory manager
+        inst.memManager = make_unique<common::DefaultMemoryManager>(
+            getDeviceCount(), common::MAX_BUFFERS,
+            AF_MEM_DEBUG || AF_OPENCL_MEM_DEBUG);
+        // Set the memory manager's device memory manager
+        unique_ptr<Allocator> deviceMemoryManager;
+        deviceMemoryManager = make_unique<Allocator>();
+        inst.memManager->setAllocator(move(deviceMemoryManager));
+        inst.memManager->initialize();
+    });
+
+    return *(inst.memManager.get());
+}
+
+MemoryManagerBase& pinnedMemoryManager() {
+    static once_flag flag;
+
+    DeviceManager& inst = DeviceManager::getInstance();
+
+    call_once(flag, [&]() {
+        // By default, create an instance of the default memory manager
+        inst.pinnedMemManager = make_unique<common::DefaultMemoryManager>(
+            getDeviceCount(), common::MAX_BUFFERS,
+            AF_MEM_DEBUG || AF_OPENCL_MEM_DEBUG);
+        // Set the memory manager's device memory manager
+        unique_ptr<AllocatorPinned> deviceMemoryManager;
+        deviceMemoryManager = make_unique<AllocatorPinned>();
+        inst.pinnedMemManager->setAllocator(move(deviceMemoryManager));
+        inst.pinnedMemManager->initialize();
+    });
+
+    return *(inst.pinnedMemManager.get());
+}
+
+void setMemoryManager(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManager(move(mgr));
+}
+
+void resetMemoryManager() {
+    return DeviceManager::getInstance().resetMemoryManager();
+}
+
+void setMemoryManagerPinned(unique_ptr<MemoryManagerBase> mgr) {
+    return DeviceManager::getInstance().setMemoryManagerPinned(move(mgr));
+}
+
+void resetMemoryManagerPinned() {
+    return DeviceManager::getInstance().resetMemoryManagerPinned();
+}
+
+arrayfire::common::ForgeManager& forgeManager() {
+    return *(DeviceManager::getInstance().fgMngr);
+}
+
+GraphicsResourceManager& interopManager() {
+    static once_flag initFlags[DeviceManager::MAX_DEVICES];
+
+    int id = getActiveDeviceId();
+
+    DeviceManager& inst = DeviceManager::getInstance();
+
+    call_once(initFlags[id], [&] {
+        inst.gfxManagers[id] = make_unique<GraphicsResourceManager>();
+    });
+
+    return *(inst.gfxManagers[id].get());
+}
+
+PlanCache& fftManager() {
+    thread_local PlanCache clfftManagers[DeviceManager::MAX_DEVICES];
+
+    return clfftManagers[getActiveDeviceId()];
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
+
+using namespace arrayfire::opencl;
+
+af_err afcl_get_device_type(afcl_device_type* res) {
+    try {
+        *res = static_cast<afcl_device_type>(getActiveDeviceType());
     }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_get_platform(afcl_platform* res) {
+    try {
+        *res = static_cast<afcl_platform>(getActivePlatformVendor());
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
+
+af_err afcl_get_context(cl_context* ctx, const bool retain) {
+    try {
+        *ctx = getContext()();
+        if (retain) { clRetainContext(*ctx); }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
 }
-#endif
 
+af_err afcl_get_queue(cl_command_queue* queue, const bool retain) {
+    try {
+        *queue = getQueue()();
+        if (retain) { clRetainCommandQueue(*queue); }
+    }
+    CATCHALL;
+    return AF_SUCCESS;
 }
 
-using namespace opencl;
+af_err afcl_get_device_id(cl_device_id* id) {
+    try {
+        *id = getDevice()();
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-af_err afcl_get_context(cl_context *ctx, const bool retain)
-{
-    *ctx = getContext()();
-    if (retain) clRetainContext(*ctx);
+af_err afcl_set_device_id(cl_device_id id) {
+    try {
+        setDevice(getDeviceIdFromNativeId(id));
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
+af_err afcl_add_device_context(cl_device_id dev, cl_context ctx,
+                               cl_command_queue que) {
+    try {
+        addDeviceContext(dev, ctx, que);
+    }
+    CATCHALL;
+    return AF_SUCCESS;
+}
 
-af_err afcl_get_queue(cl_command_queue *queue, const bool retain)
-{
-    *queue = getQueue()();
-    if (retain) clRetainCommandQueue(*queue);
+af_err afcl_set_device_context(cl_device_id dev, cl_context ctx) {
+    try {
+        setDeviceContext(dev, ctx);
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
 
-af_err afcl_get_device_id(cl_device_id *id)
-{
-    *id = getDevice()();
+af_err afcl_delete_device_context(cl_device_id dev, cl_context ctx) {
+    try {
+        removeDeviceContext(dev, ctx);
+    }
+    CATCHALL;
     return AF_SUCCESS;
 }
diff --git a/src/backend/opencl/platform.hpp b/src/backend/opencl/platform.hpp
index 76768718ec..30124d9aa2 100644
--- a/src/backend/opencl/platform.hpp
+++ b/src/backend/opencl/platform.hpp
@@ -8,95 +8,191 @@
  ********************************************************/
 
 #pragma once
-#if defined(WITH_GRAPHICS)
-#include <fg/window.h>
-#endif
-#include <cl.hpp>
-#include <vector>
+
+#include <cl2hpp.hpp>
+#include <af/opencl.h>
+
+#include <memory>
 #include <string>
 
-namespace opencl
-{
+// Forward declarations
+namespace boost {
+template<typename T>
+class shared_ptr;
 
-class DeviceManager
-{
-    friend std::string getInfo();
+namespace compute {
+class program_cache;
+}
+}  // namespace boost
 
-    friend int getDeviceCount();
+namespace spdlog {
+class logger;
+}
 
-    friend int getActiveDeviceId();
+namespace arrayfire {
+namespace common {
 
-    friend const cl::Context& getContext();
+class ForgeManager;
 
-    friend cl::CommandQueue& getQueue();
+class MemoryManagerBase;
 
-    friend const cl::Device& getDevice();
+class Version;
+}  // namespace common
+}  // namespace arrayfire
 
-    friend bool isGLSharingSupported();
+using arrayfire::common::MemoryManagerBase;
 
-    friend bool isDoubleSupported(int device);
+namespace arrayfire {
+namespace opencl {
 
-    friend void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
+// Forward declarations
+class GraphicsResourceManager;
+class PlanCache;  // clfft
 
-    friend int setDevice(int device);
+bool verify_present(const std::string& pname, const std::string ref);
 
-    public:
-        static const unsigned MAX_DEVICES = 16;
+int getBackend();
 
-        static DeviceManager& getInstance();
+std::string getDeviceInfo() noexcept;
 
-        ~DeviceManager();
+int getDeviceCount() noexcept;
 
-    protected:
-        void setContext(int device);
+void init();
 
-        DeviceManager();
+int getActiveDeviceId();
 
-        // Following two declarations are required to
-        // avoid copying accidental copy/assignment
-        // of instance returned by getInstance to other
-        // variables
-        DeviceManager(DeviceManager const&);
-        void operator=(DeviceManager const&);
-#if defined(WITH_GRAPHICS)
-        void markDeviceForInterop(const int device, const fg::Window* wHandle);
-#endif
+int& getMaxJitSize();
 
-    private:
-        // Attributes
-        std::vector<cl::CommandQueue*>  mQueues;
-        std::vector<cl::Device*>       mDevices;
-        std::vector<cl::Context*>     mContexts;
-        std::vector<cl::Platform*>   mPlatforms;
-        std::vector<unsigned>       mCtxOffsets;
-        std::vector<bool>        mIsGLSharingOn;
+const cl::Context& getContext();
 
-        unsigned mActiveCtxId;
-        unsigned mActiveQId;
-};
+cl::CommandQueue& getQueue(int device_id = -1);
 
-std::string getInfo();
+/// Return a cl_command_queue handle to the queue for the device.
+///
+/// \param[in] device The device of the returned queue
+/// \returns The cl_command_queue handle to the queue
+cl_command_queue getQueueHandle(int device_id);
 
-int getDeviceCount();
+const cl::Device& getDevice(int id = -1);
 
-int getActiveDeviceId();
+const std::string& getActiveDeviceBaseBuildFlags();
 
-const cl::Context& getContext();
+/// Returns the set of all OpenCL C Versions the device supports. The values
+/// are sorted from oldest to latest.
+std::vector<common::Version> getOpenCLCDeviceVersion(const cl::Device& device);
 
-cl::CommandQueue& getQueue();
+size_t getDeviceMemorySize(int device);
 
-const cl::Device& getDevice();
+size_t getHostMemorySize();
+
+inline unsigned getMemoryBusWidth(const cl::Device& device) {
+    return device.getInfo<CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE>();
+}
+
+// OCL only reports on L1 cache, so we have to estimate the L2 Cache
+// size. From studying many GPU cards, it is noticed that their is a
+// direct correlation between Cache line and L2 Cache size:
+//      - 16KB L2 Cache for each bit in Cache line.
+//        Example: RTX3070 (4096KB of L2 Cache, 256Bit of Cache
+//        line)
+//                   --> 256*16KB = 4096KB
+//      - This is also valid for all AMD GPU's
+//      - Exceptions
+//          * GTX10XX series have 8KB per bit of cache line
+//          * iGPU (64bit cacheline) have 5KB per bit of cache line
+inline size_t getL2CacheSize(const cl::Device& device) {
+    const unsigned cacheLine{getMemoryBusWidth(device)};
+    return cacheLine * 1024ULL *
+           (cacheLine == 64 ? 5
+            : device.getInfo<CL_DEVICE_NAME>().find("GTX 10") ==
+                    std::string::npos
+                ? 16
+                : 8);
+}
+
+inline unsigned getComputeUnits(const cl::Device& device) {
+    return device.getInfo<CL_DEVICE_MAX_COMPUTE_UNITS>();
+}
+
+// maximum nr of threads the device really can run in parallel, without
+// scheduling
+inline unsigned getMaxParallelThreads(const cl::Device& device) {
+    return getComputeUnits(device) * 2048;
+}
+
+cl_device_type getDeviceType();
+
+bool OpenCLCPUOffload(bool forceOffloadOSX = true);
 
 bool isGLSharingSupported();
 
-bool isDoubleSupported(int device);
+bool isDoubleSupported(unsigned device);
+inline bool isDoubleSupported(const cl::Device& device) {
+    // 64bit fp is an optional extension
+    return (device.getInfo<CL_DEVICE_EXTENSIONS>().find("cl_khr_fp64") !=
+            std::string::npos);
+}
+
+// Returns true if 16-bit precision floats are supported by the device
+bool isHalfSupported(unsigned device);
+inline bool isHalfSupported(const cl::Device& device) {
+    // 16bit fp is an option extension
+    return (device.getInfo<CL_DEVICE_EXTENSIONS>().find("cl_khr_fp16") !=
+            std::string::npos);
+}
 
-void devprop(char* d_name, char* d_platform, char *d_toolkit, char* d_compute);
+void devprop(char* d_name, char* d_platform, char* d_toolkit, char* d_compute);
 
-std::string getPlatformName(const cl::Device &device);
+std::string getPlatformName(const cl::Device& device);
 
 int setDevice(int device);
 
+void addDeviceContext(cl_device_id dev, cl_context ctx, cl_command_queue que);
+
+void setDeviceContext(cl_device_id dev, cl_context ctx);
+
+void removeDeviceContext(cl_device_id dev, cl_context ctx);
+
 void sync(int device);
 
-}
+bool synchronize_calls();
+
+int getActiveDeviceType();
+
+cl::Platform& getActivePlatform();
+
+afcl::platform getActivePlatformVendor();
+
+bool& evalFlag();
+
+MemoryManagerBase& memoryManager();
+
+void setMemoryManager(std::unique_ptr<MemoryManagerBase> mgr);
+
+void resetMemoryManager();
+
+MemoryManagerBase& pinnedMemoryManager();
+
+void setMemoryManagerPinned(std::unique_ptr<MemoryManagerBase> mgr);
+
+void resetMemoryManagerPinned();
+
+arrayfire::common::ForgeManager& forgeManager();
+
+GraphicsResourceManager& interopManager();
+
+PlanCache& fftManager();
+
+afcl::platform getPlatformEnum(cl::Device dev);
+
+void setActiveContext(int device);
+
+/// Returns true if the buffer on device buf_device_id can be accessed by
+/// kernels on device execution_id
+///
+/// \param[in] buf_device_id The device id of the buffer
+/// \param[in] execution_id The device where the buffer will be accessed.
+bool isDeviceBufferAccessible(int buf_device_id, int execution_id);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/plot.cpp b/src/backend/opencl/plot.cpp
index 5a5712b86a..5b7dfa69cb 100644
--- a/src/backend/opencl/plot.cpp
+++ b/src/backend/opencl/plot.cpp
@@ -7,51 +7,59 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
-#include <interopManager.hpp>
 #include <Array.hpp>
-#include <plot.hpp>
-#include <err_opencl.hpp>
+#include <GraphicsResourceManager.hpp>
 #include <debug_opencl.hpp>
-#include <join.hpp>
-#include <reduce.hpp>
-#include <reorder.hpp>
+#include <err_opencl.hpp>
+#include <plot.hpp>
 
 using af::dim4;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void copy_plot(const Array<T> &P, fg::Plot* plot)
-{
+void copy_plot(const Array<T> &P, fg_plot plot) {
+    ForgeModule &_ = forgePlugin();
     if (isGLSharingSupported()) {
         CheckGL("Begin OpenCL resource copy");
         const cl::Buffer *d_P = P.get();
-        size_t bytes = plot->size();
-
-        InteropManager& intrpMngr = InteropManager::getInstance();
+        unsigned bytes        = 0;
+        FG_CHECK(_.fg_get_plot_vertex_buffer_size(&bytes, plot));
 
-        cl::Buffer *clPBOResource = intrpMngr.getBufferResource(plot);
+        auto res = interopManager().getPlotResources(plot);
 
         std::vector<cl::Memory> shared_objects;
-        shared_objects.push_back(*clPBOResource);
+        shared_objects.push_back(*(res[0].get()));
 
         glFinish();
-        getQueue().enqueueAcquireGLObjects(&shared_objects);
-        getQueue().enqueueCopyBuffer(*d_P, *clPBOResource, 0, 0, bytes, NULL, NULL);
-        getQueue().finish();
-        getQueue().enqueueReleaseGLObjects(&shared_objects);
+
+        // Use of events:
+        // https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+        cl::Event event;
+
+        getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+        getQueue().enqueueCopyBuffer(*d_P, *(res[0].get()), 0, 0, bytes, NULL,
+                                     &event);
+        getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+        event.wait();
 
         CL_DEBUG_FINISH(getQueue());
         CheckGL("End OpenCL resource copy");
     } else {
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_plot_vertex_buffer(&buffer, plot));
+        FG_CHECK(_.fg_get_plot_vertex_buffer_size(&bytes, plot));
+
         CheckGL("Begin OpenCL fallback-resource copy");
-        glBindBuffer(GL_ARRAY_BUFFER, plot->vbo());
-        GLubyte* ptr = (GLubyte*)glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY);
+        glBindBuffer(GL_ARRAY_BUFFER, buffer);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
         if (ptr) {
-            getQueue().enqueueReadBuffer(*P.get(), CL_TRUE, 0, plot->size(), ptr);
+            getQueue().enqueueReadBuffer(*P.get(), CL_TRUE, 0, bytes, ptr);
             glUnmapBuffer(GL_ARRAY_BUFFER);
         }
         glBindBuffer(GL_ARRAY_BUFFER, 0);
@@ -59,15 +67,16 @@ void copy_plot(const Array<T> &P, fg::Plot* plot)
     }
 }
 
-#define INSTANTIATE(T)  \
-    template void copy_plot<T>(const Array<T> &P, fg::Plot* plot);
+#define INSTANTIATE(T) template void copy_plot<T>(const Array<T> &, fg_plot);
 
 INSTANTIATE(float)
 INSTANTIATE(double)
 INSTANTIATE(int)
 INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
 INSTANTIATE(uchar)
 
-}
-
-#endif  // WITH_GRAPHICS
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/plot.hpp b/src/backend/opencl/plot.hpp
index 582d02e046..4a6849e01a 100644
--- a/src/backend/opencl/plot.hpp
+++ b/src/backend/opencl/plot.hpp
@@ -7,17 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#if defined (WITH_GRAPHICS)
-
 #include <Array.hpp>
-#include <graphics_common.hpp>
-
-namespace opencl
-{
-    template<typename T>
-    void copy_plot(const Array<T> &P, fg::Plot* plot);
-}
+#include <common/graphics_common.hpp>
 
-#endif
+namespace arrayfire {
+namespace opencl {
 
+template<typename T>
+void copy_plot(const Array<T> &P, fg_plot plot);
 
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/print.hpp b/src/backend/opencl/print.hpp
index 8525c5f67f..40919135a7 100644
--- a/src/backend/opencl/print.hpp
+++ b/src/backend/opencl/print.hpp
@@ -11,19 +11,16 @@
 #include <backend.hpp>
 #include <ostream>
 
-namespace opencl
-{
-    static std::ostream&
-    operator<<(std::ostream &out, const cfloat& var)
-    {
-        out << "(" << var.s[0] << "," << var.s[1] << ")";
-        return out;
-    }
+namespace arrayfire {
+namespace opencl {
+static std::ostream& operator<<(std::ostream& out, const cfloat& var) {
+    out << "(" << var.s[0] << "," << var.s[1] << ")";
+    return out;
+}
 
-    static std::ostream&
-    operator<<(std::ostream &out, const cdouble& var)
-    {
-        out << "(" << var.s[0] << "," << var.s[1] << ")";
-        return out;
-    }
+static std::ostream& operator<<(std::ostream& out, const cdouble& var) {
+    out << "(" << var.s[0] << "," << var.s[1] << ")";
+    return out;
 }
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/product.cpp b/src/backend/opencl/product.cpp
index b3a98256e0..a949f87345 100644
--- a/src/backend/opencl/product.cpp
+++ b/src/backend/opencl/product.cpp
@@ -7,17 +7,27 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    //sum
-    INSTANTIATE(af_mul_t, float  , float  )
-    INSTANTIATE(af_mul_t, double , double )
-    INSTANTIATE(af_mul_t, cfloat , cfloat )
-    INSTANTIATE(af_mul_t, cdouble, cdouble)
-    INSTANTIATE(af_mul_t, int    , int    )
-    INSTANTIATE(af_mul_t, uint   , uint   )
-    INSTANTIATE(af_mul_t, char   , int    )
-    INSTANTIATE(af_mul_t, uchar  , uint   )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// sum
+INSTANTIATE(af_mul_t, float, float)
+INSTANTIATE(af_mul_t, double, double)
+INSTANTIATE(af_mul_t, cfloat, cfloat)
+INSTANTIATE(af_mul_t, cdouble, cdouble)
+INSTANTIATE(af_mul_t, int, int)
+INSTANTIATE(af_mul_t, uint, uint)
+INSTANTIATE(af_mul_t, intl, intl)
+INSTANTIATE(af_mul_t, uintl, uintl)
+INSTANTIATE(af_mul_t, char, int)
+INSTANTIATE(af_mul_t, schar, int)
+INSTANTIATE(af_mul_t, uchar, uint)
+INSTANTIATE(af_mul_t, short, int)
+INSTANTIATE(af_mul_t, ushort, uint)
+INSTANTIATE(af_mul_t, half, float)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/program.cpp b/src/backend/opencl/program.cpp
deleted file mode 100644
index 36a8972f80..0000000000
--- a/src/backend/opencl/program.cpp
+++ /dev/null
@@ -1,62 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <program.hpp>
-#include <traits.hpp>
-#include <kernel_headers/KParam.hpp>
-#include <debug_opencl.hpp>
-#include <iostream>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-namespace opencl
-{
-    const static std::string USE_DBL_SRC_STR("\n\
-                                           #ifdef USE_DOUBLE\n\
-                                           #pragma OPENCL EXTENSION cl_khr_fp64 : enable\n\
-                                           #endif\n");
-    void buildProgram(cl::Program &prog,
-                      const char *ker_str, const int ker_len, std::string options)
-    {
-        buildProgram(prog, 1, &ker_str, &ker_len, options);
-    }
-
-    void buildProgram(cl::Program &prog, const int num_files,
-                      const char **ker_strs, const int *ker_lens, std::string options)
-    {
-        try {
-            Program::Sources setSrc;
-            setSrc.emplace_back(USE_DBL_SRC_STR.c_str(), USE_DBL_SRC_STR.length());
-            setSrc.emplace_back(KParam_hpp, KParam_hpp_len);
-
-            for (int i = 0; i < num_files; i++) {
-                setSrc.emplace_back(ker_strs[i], ker_lens[i]);
-            }
-
-            static std::string defaults =
-                std::string(" -D dim_t=") +
-                std::string(dtype_traits<dim_t>::getName());
-
-            prog = cl::Program(getContext(), setSrc);
-            std::vector<cl::Device> targetDevices;
-            targetDevices.push_back(getDevice());
-            prog.build(targetDevices, (defaults + options).c_str());
-
-        } catch (...) {
-            SHOW_BUILD_INFO(prog);
-            throw;
-        }
-    }
-}
diff --git a/src/backend/opencl/program.hpp b/src/backend/opencl/program.hpp
deleted file mode 100644
index 1b76a75ce8..0000000000
--- a/src/backend/opencl/program.hpp
+++ /dev/null
@@ -1,56 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#pragma once
-#include <platform.hpp>
-#include <string>
-#include <mutex>
-
-using cl::Buffer;
-using cl::Program;
-using cl::Kernel;
-using cl::make_kernel;
-using cl::EnqueueArgs;
-using cl::NDRange;
-using std::string;
-
-#define SHOW_DEBUG_BUILD_INFO(PROG) do {                                \
-        cl_uint numDevices = PROG.getInfo<CL_PROGRAM_NUM_DEVICES>();    \
-        for (unsigned int i = 0; i<numDevices; ++i) {                   \
-            std::cout << PROG.getBuildInfo<CL_PROGRAM_BUILD_LOG>(       \
-                PROG.getInfo<CL_PROGRAM_DEVICES>()[i]) << std::endl;    \
-                                                                        \
-            std::cout << PROG.getBuildInfo<CL_PROGRAM_BUILD_OPTIONS>(   \
-                PROG.getInfo<CL_PROGRAM_DEVICES>()[i]) << std::endl;    \
-        }                                                               \
-    } while(0)                                                          \
-
-
-#if defined(NDEBUG)
-
-#define SHOW_BUILD_INFO(PROG) do {                                  \
-        const char *info = getenv("AF_OPENCL_SHOW_BUILD_INFO");     \
-        if (info != nullptr && std::strncmp(info,"0", 1) != 0) {    \
-            SHOW_DEBUG_BUILD_INFO(prog);                            \
-        }                                                           \
-    } while(0)
-
-#else
-#define SHOW_BUILD_INFO(PROG) SHOW_DEBUG_BUILD_INFO(PROG)
-#endif
-
-namespace opencl
-{
-    void buildProgram(cl::Program &prog,
-                      const char *ker_str, const int ker_len, std::string options);
-
-    void buildProgram(cl::Program &prog,
-                      const int num_files,
-                      const char **ker_str, const int *ker_len, std::string options);
-}
diff --git a/src/backend/opencl/qr.cpp b/src/backend/opencl/qr.cpp
index 9e30b43435..bb8d5c1205 100644
--- a/src/backend/opencl/qr.cpp
+++ b/src/backend/opencl/qr.cpp
@@ -8,143 +8,139 @@
  ********************************************************/
 
 #include <qr.hpp>
-#include <err_common.hpp>
+
+#include <err_opencl.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+
 #include <blas.hpp>
 #include <copy.hpp>
+#include <cpu/cpu_qr.hpp>
 #include <identity.hpp>
-#include <err_opencl.hpp>
+#include <kernel/triangle.hpp>
 #include <magma/magma.h>
-#include <magma/magma_helper.h>
 #include <magma/magma_data.h>
-#include <kernel/triangle.hpp>
-
-#if defined(WITH_OPENCL_LINEAR_ALGEBRA)
+#include <magma/magma_helper.h>
+#include <platform.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &orig)
-{
-    try {
-        initBlas();
-        dim4 iDims = orig.dims();
-        int M = iDims[0];
-        int N = iDims[1];
-
-        dim4 pDims(M, std::max(M, N));
-        Array<T> in = padArray<T, T>(orig, pDims, scalar<T>(0));  //copyArray<T>(orig);
-        in.resetDims(iDims);
-
-        int MN = std::min(M, N);
-        int NB = magma_get_geqrf_nb<T>(M);
-
-        int NUM = (2*MN + ((N+31)/32)*32)*NB;
-        Array<T> tmp = createEmptyArray<T>(dim4(NUM));
-
-        std::vector<T> h_tau(MN);
-
-        int info = 0;
-        cl::Buffer *in_buf = in.get();
-        cl::Buffer *dT = tmp.get();
-
-        magma_geqrf3_gpu<T>(M, N,
-                           (*in_buf)(), in.getOffset(), in.strides()[1],
-                           &h_tau[0], (*dT)(), tmp.getOffset(), getQueue()(), &info);
-
-        r = createEmptyArray<T>(in.dims());
-        kernel::triangle<T, true, false>(r, in);
-
-        cl::Buffer *r_buf = r.get();
-        magmablas_swapdblk<T>(MN - 1, NB,
-                              ( *r_buf)(), r.getOffset(),
-                              r.strides()[1], 1,
-                              (*dT)(), tmp.getOffset() + MN * NB,
-                              NB, 0, getQueue()());
-
-        q = in; // No need to copy
-        q.resetDims(dim4(M, M));
-        cl::Buffer *q_buf = q.get();
-
-        magma_ungqr_gpu<T>(q.dims()[0], q.dims()[1], std::min(M, N),
-                           (*q_buf)(), q.getOffset(), q.strides()[1],
-                           &h_tau[0],
-                           (*dT)(), tmp.getOffset(), NB, getQueue()(), &info);
-
-        t = createHostDataArray(dim4(MN), &h_tau[0]);
-    } catch(cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-    }
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &orig) {
+    if (OpenCLCPUOffload()) { return cpu::qr(q, r, t, orig); }
+
+    const dim4 NullShape(0, 0, 0, 0);
+
+    dim4 iDims = orig.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    dim4 endPadding(M - iDims[0], max(M, N) - iDims[1], 0, 0);
+    Array<T> in =
+        (endPadding == NullShape
+             ? copyArray(orig)
+             : padArrayBorders(orig, NullShape, endPadding, AF_PAD_ZERO));
+    in.resetDims(iDims);
+
+    int MN = std::min(M, N);
+    int NB = magma_get_geqrf_nb<T>(M);
+
+    int NUM      = (2 * MN + ((N + 31) / 32) * 32) * NB;
+    Array<T> tmp = createEmptyArray<T>(dim4(NUM));
+
+    std::vector<T> h_tau(MN);
+
+    int info           = 0;
+    cl::Buffer *in_buf = in.get();
+    cl::Buffer *dT     = tmp.get();
+
+    magma_geqrf3_gpu<T>(M, N, (*in_buf)(), in.getOffset(), in.strides()[1],
+                        &h_tau[0], (*dT)(), tmp.getOffset(), getQueue()(),
+                        &info);
+
+    r = createEmptyArray<T>(in.dims());
+    kernel::triangle<T>(r, in, true, false);
+
+    cl::Buffer *r_buf = r.get();
+    magmablas_swapdblk<T>(MN - 1, NB, (*r_buf)(), r.getOffset(), r.strides()[1],
+                          1, (*dT)(), tmp.getOffset() + MN * NB, NB, 0,
+                          getQueue()());
+
+    q = in;  // No need to copy
+    q.resetDims(dim4(M, M));
+    cl::Buffer *q_buf = q.get();
+
+    magma_ungqr_gpu<T>(q.dims()[0], q.dims()[1], std::min(M, N), (*q_buf)(),
+                       q.getOffset(), q.strides()[1], &h_tau[0], (*dT)(),
+                       tmp.getOffset(), NB, getQueue()(), &info);
+
+    t = createHostDataArray(dim4(MN), &h_tau[0]);
 }
 
 template<typename T>
-Array<T> qr_inplace(Array<T> &in)
-{
-    try {
-        initBlas();
-        dim4 iDims = in.dims();
-        int M = iDims[0];
-        int N = iDims[1];
-        int MN = std::min(M, N);
-
-        getQueue().finish(); // FIXME: Does this need to be here?
-        cl::CommandQueue Queue2(getContext(), getDevice());
-        cl_command_queue queues[] = {getQueue()(), Queue2()};
-
-
-        std::vector<T> h_tau(MN);
-        cl::Buffer *in_buf = in.get();
-
-        int info = 0;
-        magma_geqrf2_gpu<T>(M, N, (*in_buf)(),
-                            in.getOffset(), in.strides()[1],
-                            &h_tau[0], queues, &info);
-
-        Array<T> t = createHostDataArray(dim4(MN), &h_tau[0]);
-        return t;
-
-    } catch(cl::Error &err) {
-        CL_TO_AF_ERROR(err);
-    }
+Array<T> qr_inplace(Array<T> &in) {
+    if (OpenCLCPUOffload()) { return cpu::qr_inplace(in); }
+
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+    int MN     = std::min(M, N);
+
+    getQueue().finish();  // FIXME: Does this need to be here?
+    cl::CommandQueue Queue2(getContext(), getDevice());
+    cl_command_queue queues[] = {getQueue()(), Queue2()};
+
+    std::vector<T> h_tau(MN);
+    cl::Buffer *in_buf = in.get();
+
+    int info = 0;
+    magma_geqrf2_gpu<T>(M, N, (*in_buf)(), in.getOffset(), in.strides()[1],
+                        &h_tau[0], queues, &info);
+
+    Array<T> t = createHostDataArray(dim4(MN), &h_tau[0]);
+    return t;
 }
 
-#define INSTANTIATE_QR(T)                                                                           \
-    template Array<T> qr_inplace<T>(Array<T> &in);                                                \
-    template void qr<T>(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
 
 INSTANTIATE_QR(float)
 INSTANTIATE_QR(cfloat)
 INSTANTIATE_QR(double)
 INSTANTIATE_QR(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in)
-{
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-Array<T> qr_inplace(Array<T> &in)
-{
+Array<T> qr_inplace(Array<T> &in) {
     AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_QR(T)                                                                           \
-    template Array<T> qr_inplace<T>(Array<T> &in);                                                \
-    template void qr<T>(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+#define INSTANTIATE_QR(T)                                         \
+    template Array<T> qr_inplace<T>(Array<T> & in);               \
+    template void qr<T>(Array<T> & q, Array<T> & r, Array<T> & t, \
+                        const Array<T> &in);
 
 INSTANTIATE_QR(float)
 INSTANTIATE_QR(cfloat)
 INSTANTIATE_QR(double)
 INSTANTIATE_QR(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/qr.hpp b/src/backend/opencl/qr.hpp
index aa70199f3e..6c7b564ebc 100644
--- a/src/backend/opencl/qr.hpp
+++ b/src/backend/opencl/qr.hpp
@@ -7,14 +7,14 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &in);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void qr(Array<T> &q, Array<T> &r, Array<T> &t, const Array<T> &orig);
 
-    template<typename T>
-    Array<T> qr_inplace(Array<T> &in);
-}
+template<typename T>
+Array<T> qr_inplace(Array<T> &in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/random.cpp b/src/backend/opencl/random.cpp
deleted file mode 100644
index d67f323957..0000000000
--- a/src/backend/opencl/random.cpp
+++ /dev/null
@@ -1,78 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <Array.hpp>
-#include <random.hpp>
-#include <cassert>
-#include <kernel/random.hpp>
-#include <err_opencl.hpp>
-
-namespace opencl
-{
-    template<typename T>
-    Array<T> randu(const af::dim4 &dims)
-    {
-        verifyDoubleSupport<T>();
-        Array<T> out = createEmptyArray<T>(dims);
-        kernel::random<T, true>(*out.get(), out.elements());
-        return out;
-    }
-
-    template<typename T>
-    Array<T> randn(const af::dim4 &dims)
-    {
-        verifyDoubleSupport<T>();
-        Array<T> out = createEmptyArray<T>(dims);
-        kernel::random<T, false>(*out.get(), out.elements());
-        return out;
-    }
-
-    template Array<float>  randu<float>   (const af::dim4 &dims);
-    template Array<double> randu<double>  (const af::dim4 &dims);
-    template Array<int>    randu<int>     (const af::dim4 &dims);
-    template Array<uint>   randu<uint>    (const af::dim4 &dims);
-    template Array<char>   randu<char>    (const af::dim4 &dims);
-    template Array<uchar>  randu<uchar>   (const af::dim4 &dims);
-
-    template Array<float>  randn<float>   (const af::dim4 &dims);
-    template Array<double> randn<double>  (const af::dim4 &dims);
-
-#define COMPLEX_RANDOM(fn, T, TR, is_randu)                 \
-    template<> Array<T> fn<T>(const af::dim4 &dims)         \
-    {                                                       \
-        Array<T> out = createEmptyArray<T>(dims);           \
-        dim_t elements = out.elements() * 2;             \
-        kernel::random<TR, is_randu>(*out.get(), elements); \
-        return out;                                         \
-    }                                                       \
-
-    COMPLEX_RANDOM(randu, cfloat, float, true)
-    COMPLEX_RANDOM(randu, cdouble, double, true)
-    COMPLEX_RANDOM(randn, cfloat, float, false)
-    COMPLEX_RANDOM(randn, cdouble, double, false)
-
-
-    void setSeed(const uintl seed)
-    {
-        uintl hi = (seed & 0xffffffff00000000) >> 32;
-        uintl lo = (seed & 0x00000000ffffffff);
-        kernel::random_seed[0] = (unsigned)hi;
-        kernel::random_seed[1] = (unsigned)lo;
-        kernel::counter = 0;
-    }
-
-    uintl getSeed()
-    {
-        uintl hi = kernel::random_seed[0];
-        uintl lo = kernel::random_seed[1];
-        return hi << 32 | lo;
-    }
-}
diff --git a/src/backend/opencl/random.hpp b/src/backend/opencl/random.hpp
deleted file mode 100644
index c07332eb4b..0000000000
--- a/src/backend/opencl/random.hpp
+++ /dev/null
@@ -1,23 +0,0 @@
-/*******************************************************
- * Copyright (c) 2014, ArrayFire
- * All rights reserved.
- *
- * This file is distributed under 3-clause BSD license.
- * The complete license agreement can be obtained at:
- * http://arrayfire.com/licenses/BSD-3-Clause
- ********************************************************/
-
-#include <af/array.h>
-#include <Array.hpp>
-
-namespace opencl
-{
-    template<typename T>
-    Array<T> randu(const af::dim4 &dims);
-
-    template<typename T>
-    Array<T> randn(const af::dim4 &dims);
-
-    void setSeed(const uintl seed);
-    uintl getSeed();
-}
diff --git a/src/backend/opencl/random_engine.cpp b/src/backend/opencl/random_engine.cpp
new file mode 100644
index 0000000000..d307e54c2b
--- /dev/null
+++ b/src/backend/opencl/random_engine.cpp
@@ -0,0 +1,158 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <kernel/random_engine.hpp>
+#include <af/dim4.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl) {
+    kernel::initMersenneState(*state.get(), *tbl.get(), seed);
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::uniformDistributionCBRNG<T>(*out.get(), out.elements(), type, seed,
+                                        counter);
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl &seed,
+                            uintl &counter) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::normalDistributionCBRNG<T>(*out.get(), out.elements(), type, seed,
+                                       counter);
+    return out;
+}
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::uniformDistributionMT<T>(
+        *out.get(), out.elements(), *state.get(), *pos.get(), *sh1.get(),
+        *sh2.get(), mask, *recursion_table.get(), *temper_table.get());
+    return out;
+}
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state) {
+    Array<T> out = createEmptyArray<T>(dims);
+    kernel::normalDistributionMT<T>(
+        *out.get(), out.elements(), *state.get(), *pos.get(), *sh1.get(),
+        *sh2.get(), mask, *recursion_table.get(), *temper_table.get());
+    return out;
+}
+
+#define INSTANTIATE_UNIFORM(T)                                   \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl &seed, uintl &counter);                      \
+    template Array<T> uniformDistribution<T>(                    \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+#define INSTANTIATE_NORMAL(T)                                    \
+    template Array<T> normalDistribution<T>(                     \
+        const af::dim4 &dims, const af_random_engine_type type,  \
+        const uintl &seed, uintl &counter);                      \
+    template Array<T> normalDistribution<T>(                     \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,  \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table, \
+        Array<uint> temper_table, Array<uint> state);
+
+#define COMPLEX_UNIFORM_DISTRIBUTION(T, TR)                                    \
+    template<>                                                                 \
+    Array<T> uniformDistribution<T>(const af::dim4 &dims,                      \
+                                    const af_random_engine_type type,          \
+                                    const uintl &seed, uintl &counter) {       \
+        Array<T> out    = createEmptyArray<T>(dims);                           \
+        size_t elements = out.elements() * 2;                                  \
+        kernel::uniformDistributionCBRNG<TR>(*out.get(), elements, type, seed, \
+                                             counter);                         \
+        return out;                                                            \
+    }                                                                          \
+    template<>                                                                 \
+    Array<T> uniformDistribution<T>(                                           \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,                \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,               \
+        Array<uint> temper_table, Array<uint> state) {                         \
+        Array<T> out    = createEmptyArray<T>(dims);                           \
+        size_t elements = out.elements() * 2;                                  \
+        kernel::uniformDistributionMT<TR>(                                     \
+            *out.get(), elements, *state.get(), *pos.get(), *sh1.get(),        \
+            *sh2.get(), mask, *recursion_table.get(), *temper_table.get());    \
+        return out;                                                            \
+    }
+
+#define COMPLEX_NORMAL_DISTRIBUTION(T, TR)                                    \
+    template<>                                                                \
+    Array<T> normalDistribution<T>(const af::dim4 &dims,                      \
+                                   const af_random_engine_type type,          \
+                                   const uintl &seed, uintl &counter) {       \
+        Array<T> out    = createEmptyArray<T>(dims);                          \
+        size_t elements = out.elements() * 2;                                 \
+        kernel::normalDistributionCBRNG<TR>(*out.get(), elements, type, seed, \
+                                            counter);                         \
+        return out;                                                           \
+    }                                                                         \
+    template<>                                                                \
+    Array<T> normalDistribution<T>(                                           \
+        const af::dim4 &dims, Array<uint> pos, Array<uint> sh1,               \
+        Array<uint> sh2, uint mask, Array<uint> recursion_table,              \
+        Array<uint> temper_table, Array<uint> state) {                        \
+        Array<T> out    = createEmptyArray<T>(dims);                          \
+        size_t elements = out.elements() * 2;                                 \
+        kernel::normalDistributionMT<TR>(                                     \
+            *out.get(), elements, *state.get(), *pos.get(), *sh1.get(),       \
+            *sh2.get(), mask, *recursion_table.get(), *temper_table.get());   \
+        return out;                                                           \
+    }
+
+INSTANTIATE_UNIFORM(float)
+INSTANTIATE_UNIFORM(double)
+INSTANTIATE_UNIFORM(int)
+INSTANTIATE_UNIFORM(uint)
+INSTANTIATE_UNIFORM(intl)
+INSTANTIATE_UNIFORM(uintl)
+INSTANTIATE_UNIFORM(char)
+INSTANTIATE_UNIFORM(schar)
+INSTANTIATE_UNIFORM(uchar)
+INSTANTIATE_UNIFORM(short)
+INSTANTIATE_UNIFORM(ushort)
+INSTANTIATE_UNIFORM(half)
+
+INSTANTIATE_NORMAL(float)
+INSTANTIATE_NORMAL(double)
+INSTANTIATE_NORMAL(half)
+
+COMPLEX_UNIFORM_DISTRIBUTION(cdouble, double)
+COMPLEX_UNIFORM_DISTRIBUTION(cfloat, float)
+
+COMPLEX_NORMAL_DISTRIBUTION(cdouble, double)
+COMPLEX_NORMAL_DISTRIBUTION(cfloat, float)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/random_engine.hpp b/src/backend/opencl/random_engine.hpp
new file mode 100644
index 0000000000..93c190942e
--- /dev/null
+++ b/src/backend/opencl/random_engine.hpp
@@ -0,0 +1,43 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <backend.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace opencl {
+void initMersenneState(Array<uint> &state, const uintl seed,
+                       const Array<uint> &tbl);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims,
+                             const af_random_engine_type type,
+                             const uintl &seed, uintl &counter);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims,
+                            const af_random_engine_type type, const uintl &seed,
+                            uintl &counter);
+
+template<typename T>
+Array<T> uniformDistribution(const af::dim4 &dims, Array<uint> pos,
+                             Array<uint> sh1, Array<uint> sh2, uint mask,
+                             Array<uint> recursion_table,
+                             Array<uint> temper_table, Array<uint> state);
+
+template<typename T>
+Array<T> normalDistribution(const af::dim4 &dims, Array<uint> pos,
+                            Array<uint> sh1, Array<uint> sh2, uint mask,
+                            Array<uint> recursion_table,
+                            Array<uint> temper_table, Array<uint> state);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/range.cpp b/src/backend/opencl/range.cpp
index 521a4fbeb7..a49ba931c8 100644
--- a/src/backend/opencl/range.cpp
+++ b/src/backend/opencl/range.cpp
@@ -6,41 +6,51 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <kernel/range.hpp>
+#include <range.hpp>
 
 #include <Array.hpp>
-#include <range.hpp>
-#include <kernel/range.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
 #include <math.hpp>
 #include <stdexcept>
-#include <err_opencl.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> range(const dim4& dim, const int seq_dim)
-    {
-        // Set dimension along which the sequence should be
-        // Other dimensions are simply tiled
-        int _seq_dim = seq_dim;
-        if(seq_dim < 0) {
-            _seq_dim = 0;   // column wise sequence
-        }
-
-        if(_seq_dim < 0 || _seq_dim > 3)
-            AF_ERROR("Invalid rep selection", AF_ERR_ARG);
-
-        Array<T> out = createEmptyArray<T>(dim);
-        kernel::range<T>(out, _seq_dim);
-
-        return out;
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim) {
+    // Set dimension along which the sequence should be
+    // Other dimensions are simply tiled
+    int _seq_dim = seq_dim;
+    if (seq_dim < 0) {
+        _seq_dim = 0;  // column wise sequence
+    }
+
+    if (_seq_dim < 0 || _seq_dim > 3) {
+        AF_ERROR("Invalid rep selection", AF_ERR_ARG);
     }
 
-#define INSTANTIATE(T)                                                      \
-    template Array<T> range<T>(const af::dim4 &dims, const int seq_dims);   \
+    Array<T> out = createEmptyArray<T>(dim);
+    kernel::range<T>(out, _seq_dim);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> range<T>(const af::dim4& dims, const int seq_dims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/range.hpp b/src/backend/opencl/range.hpp
index 81e75b44b8..e34f302536 100644
--- a/src/backend/opencl/range.hpp
+++ b/src/backend/opencl/range.hpp
@@ -8,11 +8,11 @@
  ********************************************************/
 #pragma once
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> range(const dim4& dim, const int seq_dim = -1);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> range(const dim4& dim, const int seq_dim = -1);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/reduce.hpp b/src/backend/opencl/reduce.hpp
index 6caa7249da..8660f9f1d8 100644
--- a/src/backend/opencl/reduce.hpp
+++ b/src/backend/opencl/reduce.hpp
@@ -7,15 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
+#pragma once
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace opencl
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> reduce(const Array<Ti> &in, const int dim);
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan = false,
+                 double nanval = 0);
 
-    template<af_op_t op, typename Ti, typename To>
-    To reduce_all(const Array<Ti> &in);
-}
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan = false, double nanval = 0);
+
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan = false,
+                     double nanval = 0);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/reduce_impl.hpp b/src/backend/opencl/reduce_impl.hpp
index a0310398c9..7b68187e4e 100644
--- a/src/backend/opencl/reduce_impl.hpp
+++ b/src/backend/opencl/reduce_impl.hpp
@@ -7,36 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <complex>
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <reduce.hpp>
-#include <kernel/reduce.hpp>
 #include <err_opencl.hpp>
+#include <kernel/reduce.hpp>
+#include <kernel/reduce_by_key.hpp>
+#include <reduce.hpp>
+#include <af/dim4.hpp>
+#include <complex>
 
-using std::swap;
 using af::dim4;
-namespace opencl
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> reduce(const Array<Ti> &in, const int dim)
-    {
-        dim4 odims = in.dims();
-        odims[dim] = 1;
-        Array<To> out = createEmptyArray<To>(odims);
-        kernel::reduce<Ti, To, op>(out, in, dim);
-        return out;
-    }
+using std::swap;
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce(const Array<Ti> &in, const int dim, bool change_nan,
+                 double nanval) {
+    dim4 odims    = in.dims();
+    odims[dim]    = 1;
+    Array<To> out = createEmptyArray<To>(odims);
+    kernel::reduce<Ti, To, op>(out, in, dim, change_nan, nanval);
+    return out;
+}
 
-    template<af_op_t op, typename Ti, typename To>
-    To reduce_all(const Array<Ti> &in)
-    {
-        return kernel::reduce_all<Ti, To, op>(in);
-    }
+template<af_op_t op, typename Ti, typename Tk, typename To>
+void reduce_by_key(Array<Tk> &keys_out, Array<To> &vals_out,
+                   const Array<Tk> &keys, const Array<Ti> &vals, const int dim,
+                   bool change_nan, double nanval) {
+    kernel::reduceByKey<op, Ti, Tk, To>(keys_out, vals_out, keys, vals, dim,
+                                        change_nan, nanval);
 }
 
-#define INSTANTIATE(Op, Ti, To)                                         \
-    template Array<To> reduce<Op, Ti, To>(const Array<Ti> &in, const int dim); \
-    template To reduce_all<Op, Ti, To>(const Array<Ti> &in);
+template<af_op_t op, typename Ti, typename To>
+Array<To> reduce_all(const Array<Ti> &in, bool change_nan, double nanval) {
+    Array<To> out = createEmptyArray<To>(1);
+    kernel::reduceAll<Ti, To, op>(out, in, change_nan, nanval);
+    return out;
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
+
+#define INSTANTIATE(Op, Ti, To)                                                \
+    template Array<To> reduce<Op, Ti, To>(const Array<Ti> &in, const int dim,  \
+                                          bool change_nan, double nanval);     \
+    template void reduce_by_key<Op, Ti, int, To>(                              \
+        Array<int> & keys_out, Array<To> & vals_out, const Array<int> &keys,   \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template void reduce_by_key<Op, Ti, uint, To>(                             \
+        Array<uint> & keys_out, Array<To> & vals_out, const Array<uint> &keys, \
+        const Array<Ti> &vals, const int dim, bool change_nan, double nanval); \
+    template Array<To> reduce_all<Op, Ti, To>(const Array<Ti> &in,             \
+                                              bool change_nan, double nanval);
diff --git a/src/backend/opencl/regions.cpp b/src/backend/opencl/regions.cpp
index 0ca6a083c9..06df18dd4c 100644
--- a/src/backend/opencl/regions.cpp
+++ b/src/backend/opencl/regions.cpp
@@ -7,46 +7,35 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <regions.hpp>
-#include <kernel/regions.hpp>
 #include <err_opencl.hpp>
+#include <kernel/regions.hpp>
+#include <regions.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> regions(const Array<char> &in, af_connectivity connectivity)
-{
-    ARG_ASSERT(2, (connectivity==AF_CONNECTIVITY_4 || connectivity==AF_CONNECTIVITY_8));
-
-    const af::dim4 dims = in.dims();
-
-    Array<T> out  = createEmptyArray<T>(dims);
-
-    switch(connectivity) {
-        case AF_CONNECTIVITY_4:
-            kernel::regions<T, false, 2>(out, in);
-            break;
-        case AF_CONNECTIVITY_8:
-            kernel::regions<T, true,  2>(out, in);
-            break;
-    }
-
+Array<T> regions(const Array<char> &in, af_connectivity connectivity) {
+    const af::dim4 &dims = in.dims();
+    Array<T> out         = createEmptyArray<T>(dims);
+    kernel::regions<T>(out, in, connectivity == AF_CONNECTIVITY_8, 2);
     return out;
 }
 
-#define INSTANTIATE(T)                                                                  \
-    template Array<T> regions<T>(const Array<char> &in, af_connectivity connectivity);
+#define INSTANTIATE(T)                                  \
+    template Array<T> regions<T>(const Array<char> &in, \
+                                 af_connectivity connectivity);
 
-INSTANTIATE(float )
+INSTANTIATE(float)
 INSTANTIATE(double)
-INSTANTIATE(int   )
-INSTANTIATE(uint  )
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/regions.hpp b/src/backend/opencl/regions.hpp
index a645f69f9c..1c4d26f6c0 100644
--- a/src/backend/opencl/regions.hpp
+++ b/src/backend/opencl/regions.hpp
@@ -9,10 +9,11 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
 Array<T> regions(const Array<char> &in, af_connectivity connectivity);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/reorder.cpp b/src/backend/opencl/reorder.cpp
index 403f612910..ecacccd677 100644
--- a/src/backend/opencl/reorder.cpp
+++ b/src/backend/opencl/reorder.cpp
@@ -8,39 +8,45 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <reorder.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
 #include <kernel/reorder.hpp>
+#include <reorder.hpp>
 #include <stdexcept>
-#include <err_opencl.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> reorder(const Array<T> &in, const af::dim4 &rdims)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims(0);
-        for(int i = 0; i < 4; i++)
-            oDims[i] = iDims[rdims[i]];
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        kernel::reorder<T>(out, in, rdims.get());
-
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                         \
-    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);  \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(0);
+    for (int i = 0; i < 4; i++) { oDims[i] = iDims[rdims[i]]; }
+
+    Array<T> out = createEmptyArray<T>(oDims);
+
+    kernel::reorder<T>(out, in, rdims.get());
+
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> reorder<T>(const Array<T> &in, const af::dim4 &rdims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/reorder.hpp b/src/backend/opencl/reorder.hpp
index ad06dafa8e..6aa860c769 100644
--- a/src/backend/opencl/reorder.hpp
+++ b/src/backend/opencl/reorder.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> reorder(const Array<T> &in, const af::dim4 &rdims);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/reshape.cpp b/src/backend/opencl/reshape.cpp
new file mode 100644
index 0000000000..78c83cc086
--- /dev/null
+++ b/src/backend/opencl/reshape.cpp
@@ -0,0 +1,81 @@
+
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <copy.hpp>
+
+#include <common/half.hpp>
+#include <kernel/memcopy.hpp>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename inType, typename outType>
+Array<outType> reshape(const Array<inType> &in, const dim4 &outDims,
+                       outType defaultValue, double scale) {
+    Array<outType> out = createEmptyArray<outType>(outDims);
+    if (out.elements() > 0) {
+        kernel::copy<inType, outType>(out, in, in.ndims(), defaultValue, scale);
+    }
+    return out;
+}
+
+#define INSTANTIATE(SRC_T)                                                    \
+    template Array<float> reshape<SRC_T, float>(Array<SRC_T> const &,         \
+                                                dim4 const &, float, double); \
+    template Array<double> reshape<SRC_T, double>(                            \
+        Array<SRC_T> const &, dim4 const &, double, double);                  \
+    template Array<cfloat> reshape<SRC_T, cfloat>(                            \
+        Array<SRC_T> const &, dim4 const &, cfloat, double);                  \
+    template Array<cdouble> reshape<SRC_T, cdouble>(                          \
+        Array<SRC_T> const &, dim4 const &, cdouble, double);                 \
+    template Array<int> reshape<SRC_T, int>(Array<SRC_T> const &,             \
+                                            dim4 const &, int, double);       \
+    template Array<uint> reshape<SRC_T, uint>(Array<SRC_T> const &,           \
+                                              dim4 const &, uint, double);    \
+    template Array<intl> reshape<SRC_T, intl>(Array<SRC_T> const &,           \
+                                              dim4 const &, intl, double);    \
+    template Array<uintl> reshape<SRC_T, uintl>(Array<SRC_T> const &,         \
+                                                dim4 const &, uintl, double); \
+    template Array<short> reshape<SRC_T, short>(Array<SRC_T> const &,         \
+                                                dim4 const &, short, double); \
+    template Array<ushort> reshape<SRC_T, ushort>(                            \
+        Array<SRC_T> const &, dim4 const &, ushort, double);                  \
+    template Array<uchar> reshape<SRC_T, uchar>(Array<SRC_T> const &,         \
+                                                dim4 const &, uchar, double); \
+    template Array<char> reshape<SRC_T, char>(Array<SRC_T> const &,           \
+                                              dim4 const &, char, double);    \
+    template Array<half> reshape<SRC_T, half>(Array<SRC_T> const &,           \
+                                              dim4 const &, half, double);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(half)
+
+#define INSTANTIATE_COMPLEX(SRC_T)                           \
+    template Array<cfloat> reshape<SRC_T, cfloat>(           \
+        Array<SRC_T> const &, dim4 const &, cfloat, double); \
+    template Array<cdouble> reshape<SRC_T, cdouble>(         \
+        Array<SRC_T> const &, dim4 const &, cdouble, double);
+
+INSTANTIATE_COMPLEX(cfloat)
+INSTANTIATE_COMPLEX(cdouble)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/resize.cpp b/src/backend/opencl/resize.cpp
index d5f358fc69..bf3a8497b2 100644
--- a/src/backend/opencl/resize.cpp
+++ b/src/backend/opencl/resize.cpp
@@ -7,52 +7,41 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <resize.hpp>
 #include <kernel/resize.hpp>
+#include <resize.hpp>
+#include <af/dim4.hpp>
 #include <stdexcept>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
-                    const af_interp_type method)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims(odim0, odim1, iDims[2], iDims[3]);
-
-        Array<T> out = createEmptyArray<T>(oDims);
-
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::resize<T, AF_INTERP_NEAREST> (out, in);
-                break;
-            case AF_INTERP_BILINEAR:
-                kernel::resize<T, AF_INTERP_BILINEAR>(out, in);
-                break;
-            default:
-                break;
-        }
-        return out;
-    }
-
-
-#define INSTANTIATE(T)                                                  \
-    template Array<T> resize<T> (const Array<T> &in,                    \
-                                 const dim_t odim0, const dim_t odim1, \
-                                 const af_interp_type method);
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims(odim0, odim1, iDims[2], iDims[3]);
+    Array<T> out = createEmptyArray<T>(oDims);
+    kernel::resize<T>(out, in, method);
+    return out;
 }
+
+#define INSTANTIATE(T)                                                 \
+    template Array<T> resize<T>(const Array<T> &in, const dim_t odim0, \
+                                const dim_t odim1,                     \
+                                const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/resize.hpp b/src/backend/opencl/resize.hpp
index 04a6b937ee..bec5bc8ce3 100644
--- a/src/backend/opencl/resize.hpp
+++ b/src/backend/opencl/resize.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
-                    const af_interp_type method);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> resize(const Array<T> &in, const dim_t odim0, const dim_t odim1,
+                const af_interp_type method);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/rotate.cpp b/src/backend/opencl/rotate.cpp
index 9fca25a280..eab0c1da26 100644
--- a/src/backend/opencl/rotate.cpp
+++ b/src/backend/opencl/rotate.cpp
@@ -7,49 +7,52 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>
 #include <rotate.hpp>
-#include <kernel/rotate.hpp>
-#include <stdexcept>
-#include <err_opencl.hpp>
-
-namespace opencl
-{
-    template<typename T>
-    Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
-                     const af_interp_type method)
-    {
-        Array<T> out = createEmptyArray<T>(odims);
 
-        switch(method) {
-            case AF_INTERP_NEAREST:
-                kernel::rotate<T, AF_INTERP_NEAREST> (out, in, theta);
-                break;
-            case AF_INTERP_BILINEAR:
-                kernel::rotate<T, AF_INTERP_BILINEAR> (out, in, theta);
-                break;
-            default:
-                AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                break;
-        }
+#include <kernel/rotate.hpp>
 
-        return out;
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method) {
+    Array<T> out = createEmptyArray<T>(odims);
+
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::rotate<T>(out, in, theta, method, 1);
+            break;
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::rotate<T>(out, in, theta, method, 2);
+            break;
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::rotate<T>(out, in, theta, method, 3);
+            break;
+        default: AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
     }
-
-
-#define INSTANTIATE(T)                                                  \
-    template Array<T> rotate(const Array<T> &in, const float theta,     \
-                             const af::dim4 &odims, const af_interp_type method); \
-
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    return out;
 }
+
+#define INSTANTIATE(T)                                              \
+    template Array<T> rotate(const Array<T> &in, const float theta, \
+                             const af::dim4 &odims,                 \
+                             const af_interp_type method);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/rotate.hpp b/src/backend/opencl/rotate.hpp
index ea75f585ae..dddc164718 100644
--- a/src/backend/opencl/rotate.hpp
+++ b/src/backend/opencl/rotate.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
-                    const af_interp_type method);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> rotate(const Array<T> &in, const float theta, const af::dim4 &odims,
+                const af_interp_type method);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/scalar.hpp b/src/backend/opencl/scalar.hpp
index b6abf47ff9..1e497af867 100644
--- a/src/backend/opencl/scalar.hpp
+++ b/src/backend/opencl/scalar.hpp
@@ -8,18 +8,18 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <optypes.hpp>
+#include <common/jit/ScalarNode.hpp>
 #include <math.hpp>
-#include <JIT/ScalarNode.hpp>
+#include <optypes.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> createScalarNode(const dim4 &size, const T val)
-{
-    JIT::ScalarNode<T> *node = new JIT::ScalarNode<T>(val);
-    return createNodeArray<T>(size, JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
+Array<T> createScalarNode(const dim4 &size, const T val) {
+    return createNodeArray<T>(size,
+                              std::make_shared<common::ScalarNode<T>>(val));
 }
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/scan.cpp b/src/backend/opencl/scan.cpp
index 52b8c9a1a7..649789ef91 100644
--- a/src/backend/opencl/scan.cpp
+++ b/src/backend/opencl/scan.cpp
@@ -7,52 +7,51 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
-#include <Array.hpp>
 #include <scan.hpp>
-#include <complex>
-#include <err_opencl.hpp>
 
-#include <kernel/scan_first.hpp>
 #include <kernel/scan_dim.hpp>
+#include <kernel/scan_first.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusiveScan) {
+    Array<To> out = createEmptyArray<To>(in.dims());
 
-namespace opencl
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> scan(const Array<Ti>& in, const int dim)
-    {
-        Array<To> out = createEmptyArray<To>(in.dims());
-
-        try {
-            Param Out = out;
-            Param In  =   in;
-            switch (dim) {
-            case 0: kernel::scan_first<Ti, To, op   >(Out, In); break;
-            case 1: kernel::scan_dim  <Ti, To, op, 1>(Out, In); break;
-            case 2: kernel::scan_dim  <Ti, To, op, 2>(Out, In); break;
-            case 3: kernel::scan_dim  <Ti, To, op, 3>(Out, In); break;
-            }
-        } catch (cl::Error &ex) {
-
-            CL_TO_AF_ERROR(ex);
-        }
-
-        return out;
+    Param Out = out;
+    Param In  = in;
+
+    if (dim == 0) {
+        kernel::scanFirst<Ti, To, op>(Out, In, inclusiveScan);
+    } else {
+        kernel::scanDim<Ti, To, op>(Out, In, dim, inclusiveScan);
     }
 
-#define INSTANTIATE(ROp, Ti, To)                                        \
-    template Array<To> scan<ROp, Ti, To>(const Array<Ti>& in, const int dim); \
-
-    //accum
-    INSTANTIATE(af_add_t, float  , float  )
-    INSTANTIATE(af_add_t, double , double )
-    INSTANTIATE(af_add_t, cfloat , cfloat )
-    INSTANTIATE(af_add_t, cdouble, cdouble)
-    INSTANTIATE(af_add_t, int    , int    )
-    INSTANTIATE(af_add_t, uint   , uint   )
-    INSTANTIATE(af_add_t, char   , int    )
-    INSTANTIATE(af_add_t, uchar  , uint   )
-    INSTANTIATE(af_notzero_t, char  , uint)
+    return out;
 }
+
+#define INSTANTIATE_SCAN(ROp, Ti, To) \
+    template Array<To> scan<ROp, Ti, To>(const Array<Ti>&, const int, bool);
+
+#define INSTANTIATE_SCAN_ALL(ROp)           \
+    INSTANTIATE_SCAN(ROp, float, float)     \
+    INSTANTIATE_SCAN(ROp, double, double)   \
+    INSTANTIATE_SCAN(ROp, cfloat, cfloat)   \
+    INSTANTIATE_SCAN(ROp, cdouble, cdouble) \
+    INSTANTIATE_SCAN(ROp, int, int)         \
+    INSTANTIATE_SCAN(ROp, uint, uint)       \
+    INSTANTIATE_SCAN(ROp, intl, intl)       \
+    INSTANTIATE_SCAN(ROp, uintl, uintl)     \
+    INSTANTIATE_SCAN(ROp, char, uint)       \
+    INSTANTIATE_SCAN(ROp, schar, int)       \
+    INSTANTIATE_SCAN(ROp, uchar, uint)      \
+    INSTANTIATE_SCAN(ROp, short, int)       \
+    INSTANTIATE_SCAN(ROp, ushort, uint)
+
+INSTANTIATE_SCAN(af_notzero_t, char, uint)
+INSTANTIATE_SCAN_ALL(af_add_t)
+INSTANTIATE_SCAN_ALL(af_mul_t)
+INSTANTIATE_SCAN_ALL(af_min_t)
+INSTANTIATE_SCAN_ALL(af_max_t)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/scan.hpp b/src/backend/opencl/scan.hpp
index df03d8282f..77fef74c02 100644
--- a/src/backend/opencl/scan.hpp
+++ b/src/backend/opencl/scan.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
-#include <ops.hpp>
+#include <optypes.hpp>
 
-namespace opencl
-{
-    template<af_op_t op, typename Ti, typename To>
-    Array<To> scan(const Array<Ti>& in, const int dim);
-}
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename Ti, typename To>
+Array<To> scan(const Array<Ti>& in, const int dim, bool inclusive_scan = true);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/scan_by_key.cpp b/src/backend/opencl/scan_by_key.cpp
new file mode 100644
index 0000000000..8af8d2a31b
--- /dev/null
+++ b/src/backend/opencl/scan_by_key.cpp
@@ -0,0 +1,64 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_opencl.hpp>
+#include <scan.hpp>
+#include <af/dim4.hpp>
+#include <complex>
+
+#include <kernel/scan_dim_by_key.hpp>
+#include <kernel/scan_first_by_key.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan) {
+    Array<To> out = createEmptyArray<To>(in.dims());
+
+    Param Out = out;
+    Param Key = key;
+    Param In  = in;
+
+    if (dim == 0) {
+        kernel::scanFirstByKey<Ti, Tk, To, op>(Out, In, Key, inclusive_scan);
+    } else {
+        kernel::scanDimByKey<Ti, Tk, To, op>(Out, In, Key, dim, inclusive_scan);
+    }
+    return out;
+}
+
+#define INSTANTIATE_SCAN_BY_KEY(ROp, Ti, Tk, To)                  \
+    template Array<To> scan<ROp, Ti, Tk, To>(                     \
+        const Array<Tk>& key, const Array<Ti>& in, const int dim, \
+        bool inclusive_scan);
+
+#define INSTANTIATE_SCAN_BY_KEY_ALL(ROp, Tk)           \
+    INSTANTIATE_SCAN_BY_KEY(ROp, float, Tk, float)     \
+    INSTANTIATE_SCAN_BY_KEY(ROp, double, Tk, double)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cfloat, Tk, cfloat)   \
+    INSTANTIATE_SCAN_BY_KEY(ROp, cdouble, Tk, cdouble) \
+    INSTANTIATE_SCAN_BY_KEY(ROp, int, Tk, int)         \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uint, Tk, uint)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, intl, Tk, intl)       \
+    INSTANTIATE_SCAN_BY_KEY(ROp, uintl, Tk, uintl)
+
+#define INSTANTIATE_SCAN_BY_KEY_OP(ROp)    \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, int)  \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uint) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, intl) \
+    INSTANTIATE_SCAN_BY_KEY_ALL(ROp, uintl)
+
+INSTANTIATE_SCAN_BY_KEY_OP(af_add_t)
+INSTANTIATE_SCAN_BY_KEY_OP(af_mul_t)
+INSTANTIATE_SCAN_BY_KEY_OP(af_min_t)
+INSTANTIATE_SCAN_BY_KEY_OP(af_max_t)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/scan_by_key.hpp b/src/backend/opencl/scan_by_key.hpp
new file mode 100644
index 0000000000..f2ad2b2fc7
--- /dev/null
+++ b/src/backend/opencl/scan_by_key.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <optypes.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<af_op_t op, typename Ti, typename Tk, typename To>
+Array<To> scan(const Array<Tk>& key, const Array<Ti>& in, const int dim,
+               bool inclusive_scan = true);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/select.cpp b/src/backend/opencl/select.cpp
new file mode 100644
index 0000000000..20c900007a
--- /dev/null
+++ b/src/backend/opencl/select.cpp
@@ -0,0 +1,138 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#include <kernel/select.hpp>
+#include <select.hpp>
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <common/jit/NaryNode.hpp>
+#include <err_opencl.hpp>
+#include <scalar.hpp>
+
+#include <nonstd/span.hpp>
+#include <memory>
+
+using af::dim4;
+
+using arrayfire::common::half;
+using arrayfire::common::NaryNode;
+
+using std::make_shared;
+using std::max;
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const dim4 &odims) {
+    auto cond_node   = cond.getNode();
+    auto a_node      = a.getNode();
+    auto b_node      = b.getNode();
+    auto a_height    = a_node->getHeight();
+    auto b_height    = b_node->getHeight();
+    auto cond_height = cond_node->getHeight();
+    const int height = max(max(a_height, b_height), cond_height) + 1;
+
+    auto node = make_shared<NaryNode>(
+        NaryNode(static_cast<af::dtype>(dtype_traits<T>::af_type), "__select",
+                 3, {{cond_node, a_node, b_node}}, af_select_t, height));
+    std::array<common::Node *, 1> nodes{node.get()};
+    if (detail::passesJitHeuristics<T>(nodes) != kJITHeuristics::Pass) {
+        if (a_height > max(b_height, cond_height)) {
+            a.eval();
+        } else if (b_height > cond_height) {
+            b.eval();
+        } else {
+            cond.eval();
+        }
+        return createSelectNode<T>(cond, a, b, odims);
+    }
+    return createNodeArray<T>(odims, node);
+}
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b_val, const dim4 &odims) {
+    auto cond_node   = cond.getNode();
+    auto a_node      = a.getNode();
+    Array<T> b       = createScalarNode<T>(odims, b_val);
+    auto b_node      = b.getNode();
+    auto a_height    = a_node->getHeight();
+    auto b_height    = b_node->getHeight();
+    auto cond_height = cond_node->getHeight();
+    const int height = max(max(a_height, b_height), cond_height) + 1;
+
+    auto node = make_shared<NaryNode>(NaryNode(
+        static_cast<af::dtype>(dtype_traits<T>::af_type),
+        (flip ? "__not_select" : "__select"), 3, {{cond_node, a_node, b_node}},
+        (flip ? af_not_select_t : af_select_t), height));
+
+    std::array<common::Node *, 1> nodes{node.get()};
+    if (detail::passesJitHeuristics<T>(nodes) != kJITHeuristics::Pass) {
+        if (a_height > max(b_height, cond_height)) {
+            a.eval();
+        } else if (b_height > cond_height) {
+            b.eval();
+        } else {
+            cond.eval();
+        }
+        return createSelectNode<T, flip>(cond, a, b_val, odims);
+    }
+    return createNodeArray<T>(odims, node);
+}
+
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b) {
+    kernel::select<T>(out, cond, a, b, out.ndims());
+}
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b) {
+    kernel::select_scalar<T>(out, cond, a, b, out.ndims(), flip);
+}
+
+#define INSTANTIATE(T)                                                   \
+    template Array<T> createSelectNode<T>(                               \
+        const Array<char> &cond, const Array<T> &a, const Array<T> &b,   \
+        const af::dim4 &odims);                                          \
+    template Array<T> createSelectNode<T, true>(                         \
+        const Array<char> &cond, const Array<T> &a, const T &b_val,      \
+        const af::dim4 &odims);                                          \
+    template Array<T> createSelectNode<T, false>(                        \
+        const Array<char> &cond, const Array<T> &a, const T &b_val,      \
+        const af::dim4 &odims);                                          \
+    template void select<T>(Array<T> & out, const Array<char> &cond,     \
+                            const Array<T> &a, const Array<T> &b);       \
+    template void select_scalar<T, true>(Array<T> & out,                 \
+                                         const Array<char> &cond,        \
+                                         const Array<T> &a, const T &b); \
+    template void select_scalar<T, false>(Array<T> & out,                \
+                                          const Array<char> &cond,       \
+                                          const Array<T> &a, const T &b)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(cfloat);
+INSTANTIATE(cdouble);
+INSTANTIATE(int);
+INSTANTIATE(uint);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(char);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(short);
+INSTANTIATE(ushort);
+INSTANTIATE(half);
+
+#undef INSTANTIATE
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/select.hpp b/src/backend/opencl/select.hpp
new file mode 100644
index 0000000000..a026f9c04d
--- /dev/null
+++ b/src/backend/opencl/select.hpp
@@ -0,0 +1,31 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+#pragma once
+#include <Array.hpp>
+#include <af/dim4.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void select(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+            const Array<T> &b);
+
+template<typename T, bool flip>
+void select_scalar(Array<T> &out, const Array<char> &cond, const Array<T> &a,
+                   const T &b);
+
+template<typename T>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const Array<T> &b, const af::dim4 &odims);
+
+template<typename T, bool flip>
+Array<T> createSelectNode(const Array<char> &cond, const Array<T> &a,
+                          const T &b_val, const af::dim4 &odims);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/set.cpp b/src/backend/opencl/set.cpp
index f725a23393..1c1b74396c 100644
--- a/src/backend/opencl/set.cpp
+++ b/src/backend/opencl/set.cpp
@@ -7,144 +7,151 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <set.hpp>
 #include <copy.hpp>
-#include <sort.hpp>
 #include <err_opencl.hpp>
+#include <set.hpp>
+#include <sort.hpp>
+#include <af/dim4.hpp>
+
+AF_DEPRECATED_WARNINGS_OFF
 #include <boost/compute/algorithm/set_intersection.hpp>
 #include <boost/compute/algorithm/set_union.hpp>
 #include <boost/compute/algorithm/sort.hpp>
 #include <boost/compute/algorithm/unique.hpp>
 #include <boost/compute/iterator/buffer_iterator.hpp>
+AF_DEPRECATED_WARNINGS_ON
 
 namespace compute = boost::compute;
 
-namespace opencl
-{
-    using af::dim4;
+namespace arrayfire {
+namespace opencl {
+using af::dim4;
 
-    template<typename T>
-    Array<T> setUnique(const Array<T> &in,
-                       const bool is_sorted)
-    {
-        try {
-            Array<T> out = copyArray<T>(in);
+using std::conditional;
+using std::is_same;
+template<typename T>
+using ltype_t = typename conditional<is_same<T, intl>::value, cl_long, T>::type;
 
-            compute::command_queue queue(getQueue()());
+template<typename T>
+using type_t =
+    typename conditional<is_same<T, uintl>::value, cl_ulong, ltype_t<T>>::type;
 
-            compute::buffer out_data((*out.get())());
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted) {
+    try {
+        Array<T> out = copyArray<T>(in);
 
-            compute::buffer_iterator<T> begin(out_data, 0);
-            compute::buffer_iterator<T> end(out_data, out.dims()[0]);
+        compute::command_queue queue(getQueue()());
 
-            if (!is_sorted) {
-                compute::sort(begin, end, queue);
-            }
+        compute::buffer out_data((*out.get())());
 
-            end = compute::unique(begin, end, queue);
+        compute::buffer_iterator<type_t<T>> begin(out_data, 0);
+        compute::buffer_iterator<type_t<T>> end(out_data, out.elements());
 
-            out.resetDims(dim4(std::distance(begin, end), 1, 1, 1));
+        if (!is_sorted) { compute::sort(begin, end, queue); }
 
-            return out;
-        } catch (std::exception &ex) {
-            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
-        }
-    }
-
-    template<typename T>
-    Array<T> setUnion(const Array<T> &first,
-                      const Array<T> &second,
-                      const bool is_unique)
-    {
-        try {
-            Array<T> unique_first = first;
-            Array<T> unique_second = second;
-
-            if (!is_unique) {
-                unique_first  = setUnique(first, false);
-                unique_second = setUnique(second, false);
-            }
-
-            size_t out_size = unique_first.dims()[0] + unique_second.dims()[0];
-            Array<T> out = createEmptyArray<T>(dim4(out_size, 1, 1, 1));
-
-            compute::command_queue queue(getQueue()());
-
-            compute::buffer first_data((*unique_first.get())());
-            compute::buffer second_data((*unique_second.get())());
-            compute::buffer out_data((*out.get())());
-
-            compute::buffer_iterator<T> first_begin(first_data, 0);
-            compute::buffer_iterator<T> first_end(first_data, unique_first.dims()[0]);
-            compute::buffer_iterator<T> second_begin(second_data, 0);
-            compute::buffer_iterator<T> second_end(second_data, unique_second.dims()[0]);
-            compute::buffer_iterator<T> out_begin(out_data, 0);
-
-            compute::buffer_iterator<T> out_end = compute::set_union(
-                first_begin, first_end, second_begin, second_end, out_begin, queue
-                );
-
-            out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
-            return out;
-
-        } catch (std::exception &ex) {
-            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
+        end = compute::unique(begin, end, queue);
+
+        out.resetDims(dim4(std::distance(begin, end), 1, 1, 1));
+
+        return out;
+    } catch (const std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
+}
+
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique) {
+    try {
+        Array<T> unique_first  = first;
+        Array<T> unique_second = second;
+
+        if (!is_unique) {
+            unique_first  = setUnique(first, false);
+            unique_second = setUnique(second, false);
         }
-    }
-
-    template<typename T>
-    Array<T> setIntersect(const Array<T> &first,
-                          const Array<T> &second,
-                          const bool is_unique)
-    {
-        try {
-            Array<T> unique_first = first;
-            Array<T> unique_second = second;
-
-            if (!is_unique) {
-                unique_first  = setUnique(first, false);
-                unique_second = setUnique(second, false);
-            }
-
-            size_t out_size = std::max(unique_first.dims()[0], unique_second.dims()[0]);
-            Array<T> out = createEmptyArray<T>(dim4(out_size, 1, 1, 1));
-
-            compute::command_queue queue(getQueue()());
-
-            compute::buffer first_data((*unique_first.get())());
-            compute::buffer second_data((*unique_second.get())());
-            compute::buffer out_data((*out.get())());
-
-            compute::buffer_iterator<T> first_begin(first_data, 0);
-            compute::buffer_iterator<T> first_end(first_data, unique_first.dims()[0]);
-            compute::buffer_iterator<T> second_begin(second_data, 0);
-            compute::buffer_iterator<T> second_end(second_data, unique_second.dims()[0]);
-            compute::buffer_iterator<T> out_begin(out_data, 0);
-
-            compute::buffer_iterator<T> out_end = compute::set_intersection(
-                first_begin, first_end, second_begin, second_end, out_begin, queue
-                );
-
-            out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
-            return out;
-        } catch (std::exception &ex) {
-            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
+
+        size_t out_size = unique_first.elements() + unique_second.elements();
+        Array<T> out    = createEmptyArray<T>(dim4(out_size, 1, 1, 1));
+
+        compute::command_queue queue(getQueue()());
+
+        compute::buffer first_data((*unique_first.get())());
+        compute::buffer second_data((*unique_second.get())());
+        compute::buffer out_data((*out.get())());
+
+        compute::buffer_iterator<type_t<T>> first_begin(first_data, 0);
+        compute::buffer_iterator<type_t<T>> first_end(first_data,
+                                                      unique_first.elements());
+        compute::buffer_iterator<type_t<T>> second_begin(second_data, 0);
+        compute::buffer_iterator<type_t<T>> second_end(
+            second_data, unique_second.elements());
+        compute::buffer_iterator<type_t<T>> out_begin(out_data, 0);
+
+        compute::buffer_iterator<type_t<T>> out_end = compute::set_union(
+            first_begin, first_end, second_begin, second_end, out_begin, queue);
+
+        out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
+        return out;
+
+    } catch (const std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
+}
+
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique) {
+    try {
+        Array<T> unique_first  = first;
+        Array<T> unique_second = second;
+
+        if (!is_unique) {
+            unique_first  = setUnique(first, false);
+            unique_second = setUnique(second, false);
         }
-    }
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> setUnique<T>(const Array<T> &in, const bool is_sorted); \
-    template Array<T> setUnion<T>(const Array<T> &first, const Array<T> &second, const bool is_unique); \
-    template Array<T> setIntersect<T>(const Array<T> &first, const Array<T> &second, const bool is_unique); \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+        size_t out_size =
+            std::max(unique_first.elements(), unique_second.elements());
+        Array<T> out = createEmptyArray<T>(dim4(out_size, 1, 1, 1));
+
+        compute::command_queue queue(getQueue()());
+
+        compute::buffer first_data((*unique_first.get())());
+        compute::buffer second_data((*unique_second.get())());
+        compute::buffer out_data((*out.get())());
+
+        compute::buffer_iterator<type_t<T>> first_begin(first_data, 0);
+        compute::buffer_iterator<type_t<T>> first_end(first_data,
+                                                      unique_first.elements());
+        compute::buffer_iterator<type_t<T>> second_begin(second_data, 0);
+        compute::buffer_iterator<type_t<T>> second_end(
+            second_data, unique_second.elements());
+        compute::buffer_iterator<type_t<T>> out_begin(out_data, 0);
+
+        compute::buffer_iterator<type_t<T>> out_end = compute::set_intersection(
+            first_begin, first_end, second_begin, second_end, out_begin, queue);
+
+        out.resetDims(dim4(std::distance(out_begin, out_end), 1, 1, 1));
+        return out;
+    } catch (const std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
 }
+
+#define INSTANTIATE(T)                                                        \
+    template Array<T> setUnique<T>(const Array<T> &in, const bool is_sorted); \
+    template Array<T> setUnion<T>(                                            \
+        const Array<T> &first, const Array<T> &second, const bool is_unique); \
+    template Array<T> setIntersect<T>(                                        \
+        const Array<T> &first, const Array<T> &second, const bool is_unique);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/set.hpp b/src/backend/opencl/set.hpp
index d27dd3b86d..2a3ea83594 100644
--- a/src/backend/opencl/set.hpp
+++ b/src/backend/opencl/set.hpp
@@ -7,19 +7,19 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T> Array<T> setUnique(const Array<T> &in,
-                                            const bool is_sorted);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> setUnique(const Array<T> &in, const bool is_sorted);
 
-    template<typename T> Array<T> setUnion(const Array<T> &first,
-                                           const Array<T> &second,
-                                           const bool is_unique);
+template<typename T>
+Array<T> setUnion(const Array<T> &first, const Array<T> &second,
+                  const bool is_unique);
 
-    template<typename T> Array<T> setIntersect(const Array<T> &first,
-                                               const Array<T> &second,
-                                               const bool is_unique);
-}
+template<typename T>
+Array<T> setIntersect(const Array<T> &first, const Array<T> &second,
+                      const bool is_unique);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/shift.cpp b/src/backend/opencl/shift.cpp
index c00033e08f..19e37286d3 100644
--- a/src/backend/opencl/shift.cpp
+++ b/src/backend/opencl/shift.cpp
@@ -7,36 +7,67 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <Array.hpp>
 #include <shift.hpp>
-#include <kernel/shift.hpp>
-#include <stdexcept>
+
 #include <err_opencl.hpp>
+#include <jit/ShiftNode.hpp>
+#include <traits.hpp>
+
+using af::dim4;
+using arrayfire::common::Node_ptr;
+using arrayfire::common::ShiftNodeBase;
+using arrayfire::opencl::jit::BufferNode;
+using arrayfire::opencl::jit::ShiftNode;
+using std::array;
+using std::make_shared;
+using std::static_pointer_cast;
+using std::string;
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> shift(const Array<T> &in, const int sdims[4])
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
+namespace arrayfire {
+namespace opencl {
 
-        Array<T> out = createEmptyArray<T>(oDims);
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]) {
+    // Shift should only be the first node in the JIT tree.
+    // Force input to be evaluated so that in is always a buffer.
+    in.eval();
 
-        kernel::shift<T>(out, in, sdims);
+    string name_str("Sh");
+    name_str += shortname<T>(true);
+    const dim4 &iDims = in.dims();
+    dim4 oDims        = iDims;
 
-        return out;
+    array<int, 4> shifts{};
+    for (int i = 0; i < 4; i++) {
+        // sdims_[i] will always be positive and always [0, oDims[i]].
+        // Negative shifts are converted to position by going the other way
+        // round
+        shifts[i] = -(sdims[i] % static_cast<int>(oDims[i])) +
+                    oDims[i] * (sdims[i] > 0);
+        assert(shifts[i] >= 0 && shifts[i] <= oDims[i]);
     }
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]);     \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    auto node = make_shared<ShiftNode>(
+        static_cast<af::dtype>(dtype_traits<T>::af_type),
+        static_pointer_cast<BufferNode>(in.getNode()), shifts);
+    return createNodeArray<T>(oDims, common::Node_ptr(node));
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> shift<T>(const Array<T> &in, const int sdims[4]);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/shift.hpp b/src/backend/opencl/shift.hpp
index 26603362eb..1797d6d1a7 100644
--- a/src/backend/opencl/shift.hpp
+++ b/src/backend/opencl/shift.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> shift(const Array<T> &in, const int sdims[4]);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> shift(const Array<T> &in, const int sdims[4]);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sift.cpp b/src/backend/opencl/sift.cpp
new file mode 100644
index 0000000000..d4b32c3820
--- /dev/null
+++ b/src/backend/opencl/sift.cpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sift.hpp>
+
+#include <kernel/sift.hpp>
+#include <math.hpp>
+
+using af::dim4;
+using af::features;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x_out, Array<float>& y_out, Array<float>& score_out,
+              Array<float>& ori_out, Array<float>& size_out,
+              Array<float>& desc_out, const Array<T>& in,
+              const unsigned n_layers, const float contrast_thr,
+              const float edge_thr, const float init_sigma,
+              const bool double_input, const float img_scale,
+              const float feature_ratio, const bool compute_GLOH) {
+    unsigned nfeat_out;
+    unsigned desc_len;
+
+    Param x;
+    Param y;
+    Param score;
+    Param ori;
+    Param size;
+    Param desc;
+
+    kernel::sift<T, convAccT>(&nfeat_out, &desc_len, x, y, score, ori, size,
+                              desc, in, n_layers, contrast_thr, edge_thr,
+                              init_sigma, double_input, img_scale,
+                              feature_ratio, compute_GLOH);
+
+    if (nfeat_out > 0) {
+        const dim4 out_dims(nfeat_out);
+        const dim4 desc_dims(desc_len, nfeat_out);
+
+        x_out     = createParamArray<float>(x, true);
+        y_out     = createParamArray<float>(y, true);
+        score_out = createParamArray<float>(score, true);
+        ori_out   = createParamArray<float>(ori, true);
+        size_out  = createParamArray<float>(size, true);
+        desc_out  = createParamArray<float>(desc, true);
+    }
+
+    return nfeat_out;
+}
+
+#define INSTANTIATE(T, convAccT)                                              \
+    template unsigned sift<T, convAccT>(                                      \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        Array<float> & ori_out, Array<float> & size_out,                      \
+        Array<float> & desc_out, const Array<T>& in, const unsigned n_layers, \
+        const float contrast_thr, const float edge_thr,                       \
+        const float init_sigma, const bool double_input,                      \
+        const float img_scale, const float feature_ratio,                     \
+        const bool compute_GLOH);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sift.hpp b/src/backend/opencl/sift.hpp
new file mode 100644
index 0000000000..078841bf69
--- /dev/null
+++ b/src/backend/opencl/sift.hpp
@@ -0,0 +1,28 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename convAccT>
+unsigned sift(Array<float>& x, Array<float>& y, Array<float>& score,
+              Array<float>& ori, Array<float>& size, Array<float>& desc,
+              const Array<T>& in, const unsigned n_layers,
+              const float contrast_thr, const float edge_thr,
+              const float init_sigma, const bool double_input,
+              const float img_scale, const float feature_ratio,
+              const bool compute_GLOH);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sobel.cpp b/src/backend/opencl/sobel.cpp
index a8c76f9636..a7651de07d 100644
--- a/src/backend/opencl/sobel.cpp
+++ b/src/backend/opencl/sobel.cpp
@@ -7,42 +7,43 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
-#include <sobel.hpp>
-#include <kernel/sobel.hpp>
 #include <err_opencl.hpp>
+#include <kernel/sobel.hpp>
+#include <sobel.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename Ti, typename To>
-std::pair< Array<To>, Array<To> >
-sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size)
-{
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size) {
     Array<To> dx = createEmptyArray<To>(img.dims());
     Array<To> dy = createEmptyArray<To>(img.dims());
 
-    switch(ker_size) {
+    switch (ker_size) {
         case 3: kernel::sobel<Ti, To, 3>(dx, dy, img); break;
     }
 
     return std::make_pair(dx, dy);
 }
 
-#define INSTANTIATE(Ti, To)                                             \
-    template std::pair< Array<To>, Array<To> >                            \
-    sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size);
+#define INSTANTIATE(Ti, To)                                    \
+    template std::pair<Array<To>, Array<To>> sobelDerivatives( \
+        const Array<Ti> &img, const unsigned &ker_size);
 
-INSTANTIATE(float , float)
+INSTANTIATE(float, float)
 INSTANTIATE(double, double)
-INSTANTIATE(int   , int)
-INSTANTIATE(uint  , int)
-INSTANTIATE(char  , int)
-INSTANTIATE(uchar , int)
-
-}
+INSTANTIATE(int, int)
+INSTANTIATE(uint, int)
+INSTANTIATE(char, int)
+INSTANTIATE(schar, int)
+INSTANTIATE(uchar, int)
+INSTANTIATE(short, int)
+INSTANTIATE(ushort, int)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sobel.hpp b/src/backend/opencl/sobel.hpp
index 1145d9b9a8..74ccb2ebcf 100644
--- a/src/backend/opencl/sobel.hpp
+++ b/src/backend/opencl/sobel.hpp
@@ -10,11 +10,12 @@
 #include <Array.hpp>
 #include <utility>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename Ti, typename To>
-std::pair< Array<To>, Array<To> >
-sobelDerivatives(const Array<Ti> &img, const unsigned &ker_size);
+std::pair<Array<To>, Array<To>> sobelDerivatives(const Array<Ti> &img,
+                                                 const unsigned &ker_size);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/solve.cpp b/src/backend/opencl/solve.cpp
index 34a357eda9..e6e7aa99ea 100644
--- a/src/backend/opencl/solve.cpp
+++ b/src/backend/opencl/solve.cpp
@@ -7,96 +7,109 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <err_common.hpp>
 #include <solve.hpp>
 
-#if defined(WITH_OPENCL_LINEAR_ALGEBRA)
+#include <err_opencl.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <blas.hpp>
+#include <copy.hpp>
+#include <cpu/cpu_solve.hpp>
+#include <lu.hpp>
 #include <magma/magma.h>
 #include <magma/magma_blas.h>
 #include <magma/magma_data.h>
 #include <magma/magma_helper.h>
-#include <lu.hpp>
-#include <copy.hpp>
-#include <err_opencl.hpp>
-#include <blas.hpp>
-#include <transpose.hpp>
 #include <math.hpp>
+#include <platform.hpp>
+#include <transpose.hpp>
+#include <af/opencl.h>
 
 #include <algorithm>
-#include <string>
+#include <vector>
+
+using cl::Buffer;
+using std::min;
+using std::vector;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> solveLU(const Array<T> &A, const Array<int> &pivot,
-                 const Array<T> &b, const af_mat_prop options)
-{
-    int N = A.dims()[0];
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    if (OpenCLCPUOffload()) { return cpu::solveLU(A, pivot, b, options); }
+
+    int N    = A.dims()[0];
     int NRHS = b.dims()[1];
 
-    std::vector<int> ipiv(N);
+    vector<int> ipiv(N);
     copyData(&ipiv[0], pivot);
 
-    Array< T > B = copyArray<T>(b);
+    Array<T> B = copyArray<T>(b);
 
-    const cl::Buffer *A_buf = A.get();
-    cl::Buffer *B_buf = B.get();
+    const Buffer *A_buf = A.get();
+    Buffer *B_buf       = B.get();
 
     int info = 0;
-    magma_getrs_gpu<T>(MagmaNoTrans, N, NRHS,
-                       (*A_buf)(), A.getOffset(), A.strides()[1],
-                       &ipiv[0],
-                       (*B_buf)(), B.getOffset(), B.strides()[1],
-                       getQueue()(), &info);
+    magma_getrs_gpu<T>(MagmaNoTrans, N, NRHS, (*A_buf)(), A.getOffset(),
+                       A.strides()[1], &ipiv[0], (*B_buf)(), B.getOffset(),
+                       B.strides()[1], getQueue()(), &info);
     return B;
 }
 
 template<typename T>
-Array<T> generalSolve(const Array<T> &a, const Array<T> &b)
-{
-
-    dim4 iDims = a.dims();
-    int M = iDims[0];
-    int N = iDims[1];
-    int MN = std::min(M, N);
-    std::vector<int> ipiv(MN);
+Array<T> generalSolve(const Array<T> &a, const Array<T> &b) {
+    dim4 aDims = a.dims();
+    int batchz = aDims[2];
+    int batchw = aDims[3];
 
     Array<T> A = copyArray<T>(a);
     Array<T> B = copyArray<T>(b);
 
-    cl::Buffer *A_buf = A.get();
-    int info = 0;
-    magma_getrf_gpu<T>(M, N, (*A_buf)(), A.getOffset(), A.strides()[1],
-                       &ipiv[0], getQueue()(), &info);
-
-    cl::Buffer *B_buf = B.get();
-    int K = B.dims()[1];
-    magma_getrs_gpu<T>(MagmaNoTrans, M, K,
-                       (*A_buf)(), A.getOffset(), A.strides()[1],
-                       &ipiv[0],
-                       (*B_buf)(), B.getOffset(), B.strides()[1],
-                       getQueue()(), &info);
+    for (int i = 0; i < batchw; i++) {
+        for (int j = 0; j < batchz; j++) {
+            int M  = aDims[0];
+            int N  = aDims[1];
+            int MN = min(M, N);
+            vector<int> ipiv(MN);
+
+            Buffer *A_buf      = A.get();
+            int info           = 0;
+            cl_command_queue q = getQueue()();
+            auto aoffset =
+                A.getOffset() + j * A.strides()[2] + i * A.strides()[3];
+            magma_getrf_gpu<T>(M, N, (*A_buf)(), aoffset, A.strides()[1],
+                               &ipiv[0], q, &info);
+
+            Buffer *B_buf = B.get();
+            int K         = B.dims()[1];
+
+            auto boffset =
+                B.getOffset() + j * B.strides()[2] + i * B.strides()[3];
+            magma_getrs_gpu<T>(MagmaNoTrans, M, K, (*A_buf)(), aoffset,
+                               A.strides()[1], &ipiv[0], (*B_buf)(), boffset,
+                               B.strides()[1], q, &info);
+        }
+    }
     return B;
 }
 
 template<typename T>
-Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
-{
-    int M = a.dims()[0];
-    int N = a.dims()[1];
-    int K = b.dims()[1];
-    int MN = std::min(M, N);
+Array<T> leastSquares(const Array<T> &a, const Array<T> &b) {
+    int M  = a.dims()[0];
+    int N  = a.dims()[1];
+    int K  = b.dims()[1];
+    int MN = min(M, N);
 
     Array<T> B = createEmptyArray<T>(dim4());
-    trsm_func<T> gpu_trsm;
+    gpu_blas_trsm_func<T> gpu_blas_trsm;
 
     cl_event event;
     cl_command_queue queue = getQueue()();
 
     if (M < N) {
-
-#define UNMQR 0 // FIXME: UNMQR == 1 should be faster but does not work
+#define UNMQR 0  // FIXME: UNMQR == 1 should be faster but does not work
 
         // Least squres for this case is solved using the following
         // solve(A, B) == matmul(Q, Xpad);
@@ -110,65 +123,68 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
         Array<T> A = transpose<T>(a, true);
 
 #if UNMQR
-        B = padArray<T, T>(b, dim4(N, K), scalar<T>(0));
+        const dim4 NullShape(0, 0, 0, 0);
+        dim4 endPadding(N - b.dims()[0], K - b.dims()[1], 0, 0);
+        B = (endPadding == NullShape
+                 ? copyArray(b)
+                 : padArrayBorders(b, NullShape, endPadding, AF_PAD_ZERO));
         B.resetDims(dim4(M, K));
 #else
         B = copyArray<T>(b);
 #endif
 
-        int NB = magma_get_geqrf_nb<T>(A.dims()[1]);
-        int NUM = (2*MN + ((M+31)/32)*32)*NB;
+        int NB       = magma_get_geqrf_nb<T>(A.dims()[1]);
+        int NUM      = (2 * MN + ((M + 31) / 32) * 32) * NB;
         Array<T> tmp = createEmptyArray<T>(dim4(NUM));
 
-        std::vector<T> h_tau(MN);
+        vector<T> h_tau(MN);
 
-        int info = 0;
-        cl::Buffer *dA = A.get();
-        cl::Buffer *dT = tmp.get();
-        cl::Buffer *dB = B.get();
+        int info   = 0;
+        Buffer *dA = A.get();
+        Buffer *dT = tmp.get();
+        Buffer *dB = B.get();
 
-        magma_geqrf3_gpu<T>(A.dims()[0], A.dims()[1],
-                           (*dA)(), A.getOffset(), A.strides()[1],
-                           &h_tau[0], (*dT)(), tmp.getOffset(), getQueue()(), &info);
+        magma_geqrf3_gpu<T>(A.dims()[0], A.dims()[1], (*dA)(), A.getOffset(),
+                            A.strides()[1], &h_tau[0], (*dT)(), tmp.getOffset(),
+                            getQueue()(), &info);
 
         A.resetDims(dim4(M, M));
 
-        magmablas_swapdblk<T>(MN-1, NB,
-                              (*dA)(), A.getOffset(), A.strides()[1], 1,
-                              (*dT)(), tmp.getOffset() + MN * NB, NB, 0, queue);
+        magmablas_swapdblk<T>(MN - 1, NB, (*dA)(), A.getOffset(),
+                              A.strides()[1], 1, (*dT)(),
+                              tmp.getOffset() + MN * NB, NB, 0, queue);
 
-        gpu_trsm(clblasColumnMajor,
-                 clblasLeft, clblasUpper,
-                 clblasConjTrans, clblasNonUnit,
-                 B.dims()[0], B.dims()[1],
-                 scalar<T>(1),
-                 (*dA)(), A.getOffset(), A.strides()[1],
-                 (*dB)(), B.getOffset(), B.strides()[1],
-                 1, &queue, 0, nullptr, &event);
+        OPENCL_BLAS_CHECK(
+            gpu_blas_trsm(OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_UPPER,
+                          OPENCL_BLAS_CONJ_TRANS, OPENCL_BLAS_NON_UNIT_DIAGONAL,
+                          B.dims()[0], B.dims()[1], scalar<T>(1), (*dA)(),
+                          A.getOffset(), A.strides()[1], (*dB)(), B.getOffset(),
+                          B.strides()[1], 1, &queue, 0, nullptr, &event));
 
-        magmablas_swapdblk<T>(MN - 1, NB,
-                              (*dT)(), tmp.getOffset() + MN * NB, NB, 0,
-                              (*dA)(), A.getOffset(), A.strides()[1], 1, queue);
+        magmablas_swapdblk<T>(MN - 1, NB, (*dT)(), tmp.getOffset() + MN * NB,
+                              NB, 0, (*dA)(), A.getOffset(), A.strides()[1], 1,
+                              queue);
 
 #if UNMQR
-        int lwork = (B.dims()[0]-A.dims()[0]+NB)*(B.dims()[1]+2*NB);
-        std::vector<T> h_work(lwork);
+        int lwork = (B.dims()[0] - A.dims()[0] + NB) * (B.dims()[1] + 2 * NB);
+        vector<T> h_work(lwork);
         B.resetDims(dim4(N, K));
-        magma_unmqr_gpu<T>(MagmaLeft, MagmaNoTrans,
-                           B.dims()[0], B.dims()[1], A.dims()[0],
-                           (*dA)(), A.getOffset(), A.strides()[1],
-                           &h_tau[0],
-                           (*dB)(), B.getOffset(), B.strides()[1],
-                           &h_work[0], lwork,
-                           (*dT)(), tmp.getOffset(), NB, queue, &info);
+        magma_unmqr_gpu<T>(MagmaLeft, MagmaNoTrans, B.dims()[0], B.dims()[1],
+                           A.dims()[0], (*dA)(), A.getOffset(), A.strides()[1],
+                           &h_tau[0], (*dB)(), B.getOffset(), B.strides()[1],
+                           &h_work[0], lwork, (*dT)(), tmp.getOffset(), NB,
+                           queue, &info);
 #else
         A.resetDims(dim4(N, M));
-        magma_ungqr_gpu<T>(A.dims()[0], A.dims()[1], std::min(M, N),
-                           (*dA)(), A.getOffset(), A.strides()[1],
-                           &h_tau[0],
-                           (*dT)(), tmp.getOffset(), NB, queue, &info);
-
-        B = matmul(A, B, AF_MAT_NONE, AF_MAT_NONE);
+        magma_ungqr_gpu<T>(A.dims()[0], A.dims()[1], min(M, N), (*dA)(),
+                           A.getOffset(), A.strides()[1], &h_tau[0], (*dT)(),
+                           tmp.getOffset(), NB, queue, &info);
+
+        Array<T> B_new = createEmptyArray<T>(dim4(A.dims()[0], B.dims()[1]));
+        T alpha        = scalar<T>(1.0);
+        T beta         = scalar<T>(0.0);
+        gemm<T>(B_new, AF_MAT_NONE, AF_MAT_NONE, &alpha, A, B, &beta);
+        B = B_new;
 #endif
     } else if (M > N) {
         // Least squres for this case is solved using the following
@@ -180,64 +196,56 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
         // A  == matmul(Q, R);
 
         Array<T> A = copyArray<T>(a);
-        B = copyArray(b);
+        B          = copyArray(b);
 
-        int MN = std::min(M, N);
+        int MN = min(M, N);
         int NB = magma_get_geqrf_nb<T>(M);
 
-        int NUM = (2*MN + ((N+31)/32)*32)*NB;
+        int NUM      = (2 * MN + ((N + 31) / 32) * 32) * NB;
         Array<T> tmp = createEmptyArray<T>(dim4(NUM));
 
-        std::vector<T> h_tau(NUM);
+        vector<T> h_tau(NUM);
 
-        int info = 0;
-        cl::Buffer *A_buf = A.get();
-        cl::Buffer *B_buf = B.get();
-        cl::Buffer *dT = tmp.get();
+        int info      = 0;
+        Buffer *A_buf = A.get();
+        Buffer *B_buf = B.get();
+        Buffer *dT    = tmp.get();
 
-        magma_geqrf3_gpu<T>(M, N,
-                           (*A_buf)(), A.getOffset(), A.strides()[1],
-                           &h_tau[0], (*dT)(), tmp.getOffset(), getQueue()(), &info);
+        magma_geqrf3_gpu<T>(M, N, (*A_buf)(), A.getOffset(), A.strides()[1],
+                            &h_tau[0], (*dT)(), tmp.getOffset(), getQueue()(),
+                            &info);
 
-        int NRHS = B.dims()[1];
+        int NRHS   = B.dims()[1];
         int lhwork = (M - N + NB) * (NRHS + NB) + NRHS * NB;
 
-        std::vector<T> h_work(lhwork);
+        vector<T> h_work(lhwork);
         h_work[0] = scalar<T>(lhwork);
 
-        magma_unmqr_gpu<T>(MagmaLeft, MagmaConjTrans,
-                           M, NRHS, N,
-                           (*A_buf)(), A.getOffset(), A.strides()[1],
-                           &h_tau[0],
-                           (*B_buf)(), B.getOffset(), B.strides()[1],
-                           &h_work[0], lhwork,
-                           (*dT)(), tmp.getOffset(), NB,
-                           queue, &info);
+        magma_unmqr_gpu<T>(MagmaLeft, MagmaConjTrans, M, NRHS, N, (*A_buf)(),
+                           A.getOffset(), A.strides()[1], &h_tau[0], (*B_buf)(),
+                           B.getOffset(), B.strides()[1], &h_work[0], lhwork,
+                           (*dT)(), tmp.getOffset(), NB, queue, &info);
 
-        magmablas_swapdblk<T>(MN - 1, NB,
-                              (*A_buf)(), A.getOffset(), A.strides()[1], 1,
-                              (*dT)(), tmp.getOffset() + NB * MN,
-                              NB, 0, queue);
-
-
-        std::string pName = getPlatformName(getDevice());
-        if(pName.find("NVIDIA") != std::string::npos)
-        {
-            Array<T> AT = transpose<T>(A, true);
-            cl::Buffer* AT_buf = AT.get();
-            gpu_trsm(clblasColumnMajor,
-                     clblasLeft, clblasLower, clblasConjTrans, clblasNonUnit,
-                     N, NRHS, scalar<T>(1),
-                     (*AT_buf)(), AT.getOffset(), AT.strides()[1],
-                     (*B_buf)(), B.getOffset(), B.strides()[1],
-                     1, &queue, 0, nullptr, &event);
+        magmablas_swapdblk<T>(MN - 1, NB, (*A_buf)(), A.getOffset(),
+                              A.strides()[1], 1, (*dT)(),
+                              tmp.getOffset() + NB * MN, NB, 0, queue);
+
+        if (getActivePlatformVendor() == AFCL_PLATFORM_NVIDIA) {
+            Array<T> AT    = transpose<T>(A, true);
+            Buffer *AT_buf = AT.get();
+            OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_LOWER,
+                OPENCL_BLAS_CONJ_TRANS, OPENCL_BLAS_NON_UNIT_DIAGONAL, N, NRHS,
+                scalar<T>(1), (*AT_buf)(), AT.getOffset(), AT.strides()[1],
+                (*B_buf)(), B.getOffset(), B.strides()[1], 1, &queue, 0,
+                nullptr, &event));
         } else {
-            gpu_trsm(clblasColumnMajor,
-                     clblasLeft, clblasUpper, clblasNoTrans, clblasNonUnit,
-                     N, NRHS, scalar<T>(1),
-                     (*A_buf)(), A.getOffset(), A.strides()[1],
-                     (*B_buf)(), B.getOffset(), B.strides()[1],
-                     1, &queue, 0, nullptr, &event);
+            OPENCL_BLAS_CHECK(gpu_blas_trsm(
+                OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_UPPER,
+                OPENCL_BLAS_NO_TRANS, OPENCL_BLAS_NON_UNIT_DIAGONAL, N, NRHS,
+                scalar<T>(1), (*A_buf)(), A.getOffset(), A.strides()[1],
+                (*B_buf)(), B.getOffset(), B.strides()[1], 1, &queue, 0,
+                nullptr, &event));
         }
         B.resetDims(dim4(N, K));
     }
@@ -246,116 +254,110 @@ Array<T> leastSquares(const Array<T> &a, const Array<T> &b)
 }
 
 template<typename T>
-Array<T> triangleSolve(const Array<T> &A, const Array<T> &b, const af_mat_prop options)
-{
-    trsm_func<T> gpu_trsm;
+Array<T> triangleSolve(const Array<T> &A, const Array<T> &b,
+                       const af_mat_prop options) {
+    gpu_blas_trsm_func<T> gpu_blas_trsm;
 
     Array<T> B = copyArray<T>(b);
 
-    int N = B.dims()[0];
+    int N    = B.dims()[0];
     int NRHS = B.dims()[1];
 
-    const cl::Buffer* A_buf = A.get();
-    cl::Buffer* B_buf = B.get();
+    const Buffer *A_buf = A.get();
+    Buffer *B_buf       = B.get();
 
-    cl_event event = 0;
+    cl_event event         = 0;
     cl_command_queue queue = getQueue()();
 
-    std::string pName = getPlatformName(getDevice());
-    if(pName.find("NVIDIA") != std::string::npos && (options & AF_MAT_UPPER))
-    {
+    if (getActivePlatformVendor() == AFCL_PLATFORM_NVIDIA &&
+        (options & AF_MAT_UPPER)) {
         Array<T> AT = transpose<T>(A, true);
 
-        cl::Buffer* AT_buf = AT.get();
-        gpu_trsm(clblasColumnMajor,
-                 clblasLeft,
-                 clblasLower,
-                 clblasConjTrans,
-                 options & AF_MAT_DIAG_UNIT ? clblasUnit : clblasNonUnit,
-                 N, NRHS, scalar<T>(1),
-                 (*AT_buf)(), AT.getOffset(), AT.strides()[1],
-                 (*B_buf)(), B.getOffset(), B.strides()[1],
-                 1, &queue, 0, nullptr, &event);
+        cl::Buffer *AT_buf = AT.get();
+        OPENCL_BLAS_CHECK(gpu_blas_trsm(
+            OPENCL_BLAS_SIDE_LEFT, OPENCL_BLAS_TRIANGLE_LOWER,
+            OPENCL_BLAS_CONJ_TRANS,
+            options & AF_MAT_DIAG_UNIT ? OPENCL_BLAS_UNIT_DIAGONAL
+                                       : OPENCL_BLAS_NON_UNIT_DIAGONAL,
+            N, NRHS, scalar<T>(1), (*AT_buf)(), AT.getOffset(), AT.strides()[1],
+            (*B_buf)(), B.getOffset(), B.strides()[1], 1, &queue, 0, nullptr,
+            &event));
     } else {
-        gpu_trsm(clblasColumnMajor,
-                 clblasLeft,
-                 options & AF_MAT_LOWER ? clblasLower : clblasUpper,
-                 clblasNoTrans,
-                 options & AF_MAT_DIAG_UNIT ? clblasUnit : clblasNonUnit,
-                 N, NRHS, scalar<T>(1),
-                 (*A_buf)(), A.getOffset(), A.strides()[1],
-                 (*B_buf)(), B.getOffset(), B.strides()[1],
-                 1, &queue, 0, nullptr, &event);
+        OPENCL_BLAS_CHECK(gpu_blas_trsm(
+            OPENCL_BLAS_SIDE_LEFT,
+            options & AF_MAT_LOWER ? OPENCL_BLAS_TRIANGLE_LOWER
+                                   : OPENCL_BLAS_TRIANGLE_UPPER,
+            OPENCL_BLAS_NO_TRANS,
+            options & AF_MAT_DIAG_UNIT ? OPENCL_BLAS_UNIT_DIAGONAL
+                                       : OPENCL_BLAS_NON_UNIT_DIAGONAL,
+            N, NRHS, scalar<T>(1), (*A_buf)(), A.getOffset(), A.strides()[1],
+            (*B_buf)(), B.getOffset(), B.strides()[1], 1, &queue, 0, nullptr,
+            &event));
     }
 
     return B;
 }
 
-
 template<typename T>
-Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options)
-{
-    try {
-        initBlas();
-
-        if (options & AF_MAT_UPPER ||
-            options & AF_MAT_LOWER) {
-            return triangleSolve<T>(a, b, options);
-        }
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    if (OpenCLCPUOffload()) { return cpu::solve(a, b, options); }
 
-        if(a.dims()[0] == a.dims()[1]) {
-            return generalSolve<T>(a, b);
-        } else {
-            return leastSquares<T>(a, b);
-        }
-    } catch(cl::Error &err) {
-        CL_TO_AF_ERROR(err);
+    if (options & AF_MAT_UPPER || options & AF_MAT_LOWER) {
+        return triangleSolve<T>(a, b, options);
+    }
+
+    if (a.dims()[0] == a.dims()[1]) {
+        return generalSolve<T>(a, b);
+    } else {
+        return leastSquares<T>(a, b);
     }
 }
 
-#define INSTANTIATE_SOLVE(T)                                            \
-    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,    \
-                               const af_mat_prop options);              \
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
     template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
-                                 const Array<T> &b, const af_mat_prop options); \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
 
 INSTANTIATE_SOLVE(float)
 INSTANTIATE_SOLVE(cfloat)
 INSTANTIATE_SOLVE(double)
 INSTANTIATE_SOLVE(cdouble)
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#else
+#else  // WITH_LINEAR_ALGEBRA
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> solveLU(const Array<T> &A, const Array<int> &pivot,
-                 const Array<T> &b, const af_mat_prop options)
-{
-    AF_ERROR("Linear Algebra is diabled on OpenCL",
-             AF_ERR_NOT_CONFIGURED);
+Array<T> solveLU(const Array<T> &A, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options) {
+    AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
 template<typename T>
-Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options)
-{
-    AF_ERROR("Linear Algebra is diabled on OpenCL",
-              AF_ERR_NOT_CONFIGURED);
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options) {
+    AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
 }
 
-#define INSTANTIATE_SOLVE(T)                                            \
-    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,    \
-                               const af_mat_prop options);              \
+#define INSTANTIATE_SOLVE(T)                                                 \
+    template Array<T> solve<T>(const Array<T> &a, const Array<T> &b,         \
+                               const af_mat_prop options);                   \
     template Array<T> solveLU<T>(const Array<T> &A, const Array<int> &pivot, \
-                                 const Array<T> &b, const af_mat_prop options); \
+                                 const Array<T> &b,                          \
+                                 const af_mat_prop options);
 
 INSTANTIATE_SOLVE(float)
 INSTANTIATE_SOLVE(cfloat)
 INSTANTIATE_SOLVE(double)
 INSTANTIATE_SOLVE(cdouble)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
 
-#endif
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/solve.hpp b/src/backend/opencl/solve.hpp
index f3d234bbf3..390871856c 100644
--- a/src/backend/opencl/solve.hpp
+++ b/src/backend/opencl/solve.hpp
@@ -7,15 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> solve(const Array<T> &a, const Array<T> &b, const af_mat_prop options = AF_MAT_NONE);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> solve(const Array<T> &a, const Array<T> &b,
+               const af_mat_prop options = AF_MAT_NONE);
 
-    template<typename T>
-    Array<T> solveLU(const Array<T> &a, const Array<int> &pivot,
-                     const Array<T> &b, const af_mat_prop options = AF_MAT_NONE);
-}
+template<typename T>
+Array<T> solveLU(const Array<T> &a, const Array<int> &pivot, const Array<T> &b,
+                 const af_mat_prop options = AF_MAT_NONE);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sort.cpp b/src/backend/opencl/sort.cpp
index 33c4f83257..e2bfcaa057 100644
--- a/src/backend/opencl/sort.cpp
+++ b/src/backend/opencl/sort.cpp
@@ -8,40 +8,60 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <sort.hpp>
 #include <copy.hpp>
+#include <err_opencl.hpp>
 #include <kernel/sort.hpp>
 #include <math.hpp>
+#include <reorder.hpp>
+#include <sort.hpp>
 #include <stdexcept>
-#include <err_opencl.hpp>
 
-namespace opencl
-{
-    template<typename T, bool isAscending>
-    Array<T> sort(const Array<T> &in, const unsigned dim)
-    {
-        try {
-            Array<T> out = copyArray<T>(in);
-            switch(dim) {
-            case 0: kernel::sort0<T, isAscending>(out);
-                break;
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending) {
+    try {
+        Array<T> out = copyArray<T>(in);
+        switch (dim) {
+            case 0: kernel::sort0<T>(out, isAscending); break;
+            case 1: kernel::sortBatched<T>(out, 1, isAscending); break;
+            case 2: kernel::sortBatched<T>(out, 2, isAscending); break;
+            case 3: kernel::sortBatched<T>(out, 3, isAscending); break;
             default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+        }
+
+        if (dim != 0) {
+            af::dim4 preorderDims = out.dims();
+            af::dim4 reorderDims(0, 1, 2, 3);
+            reorderDims[dim] = 0;
+            preorderDims[0]  = out.dims()[dim];
+            for (int i = 1; i <= static_cast<int>(dim); i++) {
+                reorderDims[i - 1] = i;
+                preorderDims[i]    = out.dims()[i - 1];
             }
-            return out;
-        } catch (std::exception &ex) {
-            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
+
+            out.setDataDims(preorderDims);
+            out = reorder<T>(out, reorderDims);
         }
-    }
+        return out;
+    } catch (std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
+}
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> sort<T, true>(const Array<T> &in, const unsigned dim); \
-    template Array<T> sort<T,false>(const Array<T> &in, const unsigned dim); \
+#define INSTANTIATE(T)                                                \
+    template Array<T> sort<T>(const Array<T> &in, const unsigned dim, \
+                              bool isAscending);
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sort.hpp b/src/backend/opencl/sort.hpp
index a63dc38495..092995aeec 100644
--- a/src/backend/opencl/sort.hpp
+++ b/src/backend/opencl/sort.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T, bool isAscending>
-    Array<T> sort(const Array<T> &in, const unsigned dim);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> sort(const Array<T> &in, const unsigned dim, bool isAscending);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sort_by_key.cpp b/src/backend/opencl/sort_by_key.cpp
index 90db7f1fd6..f1a89aef4d 100644
--- a/src/backend/opencl/sort_by_key.cpp
+++ b/src/backend/opencl/sort_by_key.cpp
@@ -9,53 +9,81 @@
 
 #include <Array.hpp>
 #include <copy.hpp>
-#include <sort_by_key.hpp>
+#include <err_opencl.hpp>
 #include <kernel/sort_by_key.hpp>
 #include <math.hpp>
+#include <reorder.hpp>
+#include <sort_by_key.hpp>
 #include <stdexcept>
-#include <err_opencl.hpp>
 
-namespace opencl
-{
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort_by_key(Array<Tk> &okey, Array<Tv> &oval,
-               const Array<Tk> &ikey, const Array<Tv> &ival, const unsigned dim)
-    {
-        try {
-            okey = copyArray<Tk>(ikey);
-            oval = copyArray<Tv>(ival);
-            switch(dim) {
-            case 0: kernel::sort0_by_key<Tk, Tv, isAscending>(okey, oval);
+namespace arrayfire {
+namespace opencl {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const unsigned dim, bool isAscending) {
+    try {
+        okey = copyArray<Tk>(ikey);
+        oval = copyArray<Tv>(ival);
+
+        switch (dim) {
+            case 0: kernel::sort0ByKey<Tk, Tv>(okey, oval, isAscending); break;
+            case 1:
+            case 2:
+            case 3:
+                kernel::sortByKeyBatched<Tk, Tv>(okey, oval, dim, isAscending);
                 break;
             default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
+        }
+
+        if (dim != 0) {
+            af::dim4 preorderDims = okey.dims();
+            af::dim4 reorderDims(0, 1, 2, 3);
+            reorderDims[dim] = 0;
+            preorderDims[0]  = okey.dims()[dim];
+            for (unsigned i = 1; i <= dim; i++) {
+                reorderDims[i - 1] = i;
+                preorderDims[i]    = okey.dims()[i - 1];
             }
-        }catch(std::exception &ex) {
-            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
+
+            okey.setDataDims(preorderDims);
+            oval.setDataDims(preorderDims);
+
+            okey = reorder<Tk>(okey, reorderDims);
+            oval = reorder<Tv>(oval, reorderDims);
         }
-    }
-
-#define INSTANTIATE(Tk, Tv)                                             \
-    template void                                                       \
-    sort_by_key<Tk, Tv, true>(Array<Tk> &okey, Array<Tv> &oval,         \
-                              const Array<Tk> &ikey, const Array<Tv> &ival, \
-                              const unsigned dim);                      \
-    template void                                                       \
-    sort_by_key<Tk, Tv,false>(Array<Tk> &okey, Array<Tv> &oval,         \
-                              const Array<Tk> &ikey, const Array<Tv> &ival, \
-                              const unsigned dim);                      \
-
-#define INSTANTIATE1(Tk)       \
-    INSTANTIATE(Tk, float)     \
-    INSTANTIATE(Tk, double)    \
-    INSTANTIATE(Tk, int)       \
-    INSTANTIATE(Tk, uint)      \
-    INSTANTIATE(Tk, char)      \
-    INSTANTIATE(Tk, uchar)     \
-
-    INSTANTIATE1(float)
-    INSTANTIATE1(double)
-    INSTANTIATE1(int)
-    INSTANTIATE1(uint)
-    INSTANTIATE1(char)
-    INSTANTIATE1(uchar)
+    } catch (const std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
 }
+
+#define INSTANTIATE(Tk, Tv)                                        \
+    template void sort_by_key<Tk, Tv>(                             \
+        Array<Tk> & okey, Array<Tv> & oval, const Array<Tk> &ikey, \
+        const Array<Tv> &ival, const uint dim, bool isAscending);
+
+#define INSTANTIATE1(Tk)     \
+    INSTANTIATE(Tk, float)   \
+    INSTANTIATE(Tk, double)  \
+    INSTANTIATE(Tk, cfloat)  \
+    INSTANTIATE(Tk, cdouble) \
+    INSTANTIATE(Tk, int)     \
+    INSTANTIATE(Tk, uint)    \
+    INSTANTIATE(Tk, short)   \
+    INSTANTIATE(Tk, ushort)  \
+    INSTANTIATE(Tk, char)    \
+    INSTANTIATE(Tk, schar)   \
+    INSTANTIATE(Tk, uchar)   \
+    INSTANTIATE(Tk, intl)    \
+    INSTANTIATE(Tk, uintl)
+
+INSTANTIATE1(float)
+INSTANTIATE1(double)
+INSTANTIATE1(int)
+INSTANTIATE1(uint)
+INSTANTIATE1(short)
+INSTANTIATE1(ushort)
+INSTANTIATE1(char)
+INSTANTIATE1(schar)
+INSTANTIATE1(uchar)
+INSTANTIATE1(intl)
+INSTANTIATE1(uintl)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sort_by_key.hpp b/src/backend/opencl/sort_by_key.hpp
index a3380daf55..78223de9be 100644
--- a/src/backend/opencl/sort_by_key.hpp
+++ b/src/backend/opencl/sort_by_key.hpp
@@ -7,12 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename Tk, typename Tv, bool isAscending>
-    void sort_by_key(Array<Tk> &okey, Array<Tv> &oval,
-               const Array<Tk> &ikey, const Array<Tv> &ival, const unsigned dim);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename Tk, typename Tv>
+void sort_by_key(Array<Tk> &okey, Array<Tv> &oval, const Array<Tk> &ikey,
+                 const Array<Tv> &ival, const unsigned dim, bool isAscending);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sort_index.cpp b/src/backend/opencl/sort_index.cpp
index ebbd9f543c..afd8bf8413 100644
--- a/src/backend/opencl/sort_index.cpp
+++ b/src/backend/opencl/sort_index.cpp
@@ -8,42 +8,81 @@
  ********************************************************/
 
 #include <Array.hpp>
-#include <sort_index.hpp>
+#include <common/half.hpp>
 #include <copy.hpp>
-#include <kernel/sort_index.hpp>
+#include <err_opencl.hpp>
+#include <kernel/sort_by_key.hpp>
 #include <math.hpp>
+#include <range.hpp>
+#include <reorder.hpp>
+#include <sort_index.hpp>
 #include <stdexcept>
-#include <err_opencl.hpp>
 
-namespace opencl
-{
-    template<typename T, bool isAscending>
-    void sort_index(Array<T> &val, Array<uint> &idx, const Array<T> &in, const uint dim)
-    {
-        try {
-            val = copyArray<T>(in);
-            idx = createEmptyArray<uint>(in.dims());
-
-            switch(dim) {
-            case 0: kernel::sort0_index<T, isAscending>(val, idx);
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void sort_index(Array<T> &okey, Array<uint> &oval, const Array<T> &in,
+                const uint dim, bool isAscending) {
+    
+    // TODO: fix half implementation of sort0bykey to support this
+    if (std::is_same_v<T, half>) {
+        OPENCL_NOT_SUPPORTED("sort_index with half");
+    }
+
+    try {
+        // okey contains values, oval contains indices
+        okey = copyArray<T>(in);
+        oval = range<uint>(in.dims(), dim);
+        oval.eval();
+
+        switch (dim) {
+            case 0: kernel::sort0ByKey<T, uint>(okey, oval, isAscending); break;
+            case 1:
+            case 2:
+            case 3:
+                kernel::sortByKeyBatched<T, uint>(okey, oval, dim, isAscending);
                 break;
             default: AF_ERROR("Not Supported", AF_ERR_NOT_SUPPORTED);
-            }
-        }         catch (std::exception &ex) {
-            AF_ERROR(ex.what(), AF_ERR_INTERNAL);
         }
-    }
-#define INSTANTIATE(T)                                                  \
-    template void sort_index<T, true>(Array<T> &val, Array<uint> &idx, const Array<T> &in, \
-                                      const uint dim);                  \
-    template void sort_index<T,false>(Array<T> &val, Array<uint> &idx, const Array<T> &in, \
-                                      const uint dim);                  \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
 
+        if (dim != 0) {
+            af::dim4 preorderDims = okey.dims();
+            af::dim4 reorderDims(0, 1, 2, 3);
+            reorderDims[dim] = 0;
+            preorderDims[0]  = okey.dims()[dim];
+            for (uint i = 1; i <= dim; i++) {
+                reorderDims[i - 1] = i;
+                preorderDims[i]    = okey.dims()[i - 1];
+            }
+
+            okey.setDataDims(preorderDims);
+            oval.setDataDims(preorderDims);
+
+            okey = reorder<T>(okey, reorderDims);
+            oval = reorder<uint>(oval, reorderDims);
+        }
+    } catch (const std::exception &ex) { AF_ERROR(ex.what(), AF_ERR_INTERNAL); }
 }
+
+#define INSTANTIATE(T)                                              \
+    template void sort_index<T>(Array<T> & val, Array<uint> & idx,  \
+                                const Array<T> &in, const uint dim, \
+                                bool isAscending);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sort_index.hpp b/src/backend/opencl/sort_index.hpp
index 48a7acb74c..0979a1aa37 100644
--- a/src/backend/opencl/sort_index.hpp
+++ b/src/backend/opencl/sort_index.hpp
@@ -7,11 +7,12 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T, bool isAscending>
-    void sort_index(Array<T> &val, Array<unsigned> &idx, const Array<T> &in, const unsigned dim);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void sort_index(Array<T> &okey, Array<unsigned> &oval, const Array<T> &in,
+                const unsigned dim, bool isAscending);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sparse.cpp b/src/backend/opencl/sparse.cpp
new file mode 100644
index 0000000000..de220563f7
--- /dev/null
+++ b/src/backend/opencl/sparse.cpp
@@ -0,0 +1,221 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sparse.hpp>
+#include <sparse.hpp>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/moddims.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <err_opencl.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <range.hpp>
+#include <reduce.hpp>
+#include <where.hpp>
+
+#include <stdexcept>
+#include <string>
+
+namespace arrayfire {
+namespace opencl {
+
+using namespace common;
+
+// Partial template specialization of sparseConvertDenseToStorage for COO
+// However, template specialization is not allowed
+template<typename T>
+SparseArray<T> sparseConvertDenseToCOO(const Array<T> &in) {
+    in.eval();
+
+    Array<uint> nonZeroIdx_ = where<T>(in);
+    Array<int> nonZeroIdx   = cast<int, uint>(nonZeroIdx_);
+
+    dim_t nNZ = nonZeroIdx.elements();
+
+    Array<int> constDim = createValueArray<int>(dim4(nNZ), in.dims()[0]);
+    constDim.eval();
+
+    Array<int> rowIdx =
+        arithOp<int, af_mod_t>(nonZeroIdx, constDim, nonZeroIdx.dims());
+    Array<int> colIdx =
+        arithOp<int, af_div_t>(nonZeroIdx, constDim, nonZeroIdx.dims());
+
+    Array<T> values = copyArray<T>(in);
+    values          = modDims(values, dim4(values.elements()));
+    values          = lookup<T, int>(values, nonZeroIdx, 0);
+
+    return createArrayDataSparseArray<T>(in.dims(), values, rowIdx, colIdx,
+                                         AF_STORAGE_COO);
+}
+
+template<typename T, af_storage stype>
+SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in_) {
+    in_.eval();
+
+    uint nNZ = getScalar<uint>(reduce_all<af_notzero_t, T, uint>(in_));
+
+    SparseArray<T> sparse_ = createEmptySparseArray<T>(in_.dims(), nNZ, stype);
+    sparse_.eval();
+
+    Array<T> &values   = sparse_.getValues();
+    Array<int> &rowIdx = sparse_.getRowIdx();
+    Array<int> &colIdx = sparse_.getColIdx();
+
+    kernel::dense2csr<T>(values, rowIdx, colIdx, in_);
+
+    return sparse_;
+}
+
+// Partial template specialization of sparseConvertStorageToDense for COO
+// However, template specialization is not allowed
+template<typename T>
+Array<T> sparseConvertCOOToDense(const SparseArray<T> &in) {
+    in.eval();
+
+    Array<T> dense = createValueArray<T>(in.dims(), scalar<T>(0));
+    dense.eval();
+
+    const Array<T> values   = in.getValues();
+    const Array<int> rowIdx = in.getRowIdx();
+    const Array<int> colIdx = in.getColIdx();
+
+    kernel::coo2dense<T>(dense, values, rowIdx, colIdx);
+
+    return dense;
+}
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const SparseArray<T> &in_) {
+    if (stype != AF_STORAGE_CSR) {
+        AF_ERROR("OpenCL Backend only supports CSR or COO to Dense",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    in_.eval();
+
+    Array<T> dense_ = createValueArray<T>(in_.dims(), scalar<T>(0));
+    dense_.eval();
+
+    const Array<T> &values   = in_.getValues();
+    const Array<int> &rowIdx = in_.getRowIdx();
+    const Array<int> &colIdx = in_.getColIdx();
+
+    if (stype == AF_STORAGE_CSR) {
+        kernel::csr2dense<T>(dense_, values, rowIdx, colIdx);
+    } else {
+        AF_ERROR("OpenCL Backend only supports CSR or COO to Dense",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    return dense_;
+}
+
+template<typename T, af_storage dest, af_storage src>
+SparseArray<T> sparseConvertStorageToStorage(const SparseArray<T> &in) {
+    in.eval();
+
+    SparseArray<T> converted = createEmptySparseArray<T>(
+        in.dims(), static_cast<int>(in.getNNZ()), dest);
+    converted.eval();
+
+    if (src == AF_STORAGE_CSR && dest == AF_STORAGE_COO) {
+        Array<int> index = range<int>(in.getNNZ(), 0);
+        index.eval();
+
+        Array<T> &ovalues         = converted.getValues();
+        Array<int> &orowIdx       = converted.getRowIdx();
+        Array<int> &ocolIdx       = converted.getColIdx();
+        const Array<T> &ivalues   = in.getValues();
+        const Array<int> &irowIdx = in.getRowIdx();
+        const Array<int> &icolIdx = in.getColIdx();
+
+        kernel::csr2coo<T>(ovalues, orowIdx, ocolIdx, ivalues, irowIdx, icolIdx,
+                           index);
+
+    } else if (src == AF_STORAGE_COO && dest == AF_STORAGE_CSR) {
+        Array<int> index = range<int>(in.getNNZ(), 0);
+        index.eval();
+
+        Array<T> &ovalues         = converted.getValues();
+        Array<int> &orowIdx       = converted.getRowIdx();
+        Array<int> &ocolIdx       = converted.getColIdx();
+        const Array<T> &ivalues   = in.getValues();
+        const Array<int> &irowIdx = in.getRowIdx();
+        const Array<int> &icolIdx = in.getColIdx();
+
+        Array<int> rowCopy = copyArray<int>(irowIdx);
+        rowCopy.eval();
+
+        kernel::coo2csr<T>(ovalues, orowIdx, ocolIdx, ivalues, irowIdx, icolIdx,
+                           index, rowCopy, in.dims()[0]);
+
+    } else {
+        // Should never come here
+        AF_ERROR("OpenCL Backend invalid conversion combination",
+                 AF_ERR_NOT_SUPPORTED);
+    }
+
+    return converted;
+}
+
+#define INSTANTIATE_TO_STORAGE(T, S)                     \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSR>( \
+        const SparseArray<T> &in);                       \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_CSC>( \
+        const SparseArray<T> &in);                       \
+    template SparseArray<T>                              \
+    sparseConvertStorageToStorage<T, S, AF_STORAGE_COO>( \
+        const SparseArray<T> &in);
+
+#define INSTANTIATE_COO_SPECIAL(T)                                 \
+    template<>                                                     \
+    SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_COO>( \
+        const Array<T> &in) {                                      \
+        return sparseConvertDenseToCOO<T>(in);                     \
+    }                                                              \
+    template<>                                                     \
+    Array<T> sparseConvertStorageToDense<T, AF_STORAGE_COO>(       \
+        const SparseArray<T> &in) {                                \
+        return sparseConvertCOOToDense<T>(in);                     \
+    }
+
+#define INSTANTIATE_SPARSE(T)                                               \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSR>( \
+        const Array<T> &in);                                                \
+    template SparseArray<T> sparseConvertDenseToStorage<T, AF_STORAGE_CSC>( \
+        const Array<T> &in);                                                \
+                                                                            \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSR>(       \
+        const SparseArray<T> &in);                                          \
+    template Array<T> sparseConvertStorageToDense<T, AF_STORAGE_CSC>(       \
+        const SparseArray<T> &in);                                          \
+                                                                            \
+    INSTANTIATE_COO_SPECIAL(T)                                              \
+                                                                            \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSR)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_CSC)                               \
+    INSTANTIATE_TO_STORAGE(T, AF_STORAGE_COO)
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+#undef INSTANTIATE_TO_STORAGE
+#undef INSTANTIATE_COO_SPECIAL
+#undef INSTANTIATE_SPARSE
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sparse.hpp b/src/backend/opencl/sparse.hpp
new file mode 100644
index 0000000000..32a118df0e
--- /dev/null
+++ b/src/backend/opencl/sparse.hpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, af_storage stype>
+common::SparseArray<T> sparseConvertDenseToStorage(const Array<T> &in);
+
+template<typename T, af_storage stype>
+Array<T> sparseConvertStorageToDense(const common::SparseArray<T> &in);
+
+template<typename T, af_storage dest, af_storage src>
+common::SparseArray<T> sparseConvertStorageToStorage(
+    const common::SparseArray<T> &in);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sparse_arith.cpp b/src/backend/opencl/sparse_arith.cpp
new file mode 100644
index 0000000000..cfc868b0a6
--- /dev/null
+++ b/src/backend/opencl/sparse_arith.cpp
@@ -0,0 +1,178 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <kernel/sparse_arith.hpp>
+#include <sparse.hpp>
+
+#include <stdexcept>
+#include <string>
+
+#include <arith.hpp>
+#include <common/cast.hpp>
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <copy.hpp>
+#include <lookup.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <scan.hpp>
+#include <where.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+using namespace common;
+using std::numeric_limits;
+
+template<typename T>
+T getInf() {
+    return scalar<T>(numeric_limits<T>::infinity());
+}
+
+template<>
+cfloat getInf() {
+    return scalar<cfloat, float>(
+        NAN, NAN);  // Matches behavior of complex division by 0 in OpenCL
+}
+
+template<>
+cdouble getInf() {
+    return scalar<cdouble, double>(
+        NAN, NAN);  // Matches behavior of complex division by 0 in OpenCL
+}
+
+template<typename T, af_op_t op>
+Array<T> arithOpD(const SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse) {
+    lhs.eval();
+    rhs.eval();
+
+    Array<T> out  = createEmptyArray<T>(dim4(0));
+    Array<T> zero = createValueArray<T>(rhs.dims(), scalar<T>(0));
+    switch (op) {
+        case af_add_t: out = copyArray<T>(rhs); break;
+        case af_sub_t:
+            out = reverse ? copyArray<T>(rhs)
+                          : arithOp<T, af_sub_t>(zero, rhs, rhs.dims());
+            break;
+        default: out = copyArray<T>(rhs);
+    }
+    out.eval();
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            kernel::sparseArithOpCSR<T, op>(out, lhs.getValues(),
+                                            lhs.getRowIdx(), lhs.getColIdx(),
+                                            rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            kernel::sparseArithOpCOO<T, op>(out, lhs.getValues(),
+                                            lhs.getRowIdx(), lhs.getColIdx(),
+                                            rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const Array<T> &rhs,
+                       const bool reverse) {
+    lhs.eval();
+    rhs.eval();
+
+    SparseArray<T> out = createArrayDataSparseArray<T>(
+        lhs.dims(), lhs.getValues(), lhs.getRowIdx(), lhs.getColIdx(),
+        lhs.getStorage(), true);
+    out.eval();
+    switch (lhs.getStorage()) {
+        case AF_STORAGE_CSR:
+            kernel::sparseArithOpCSR<T, op>(out.getValues(), out.getRowIdx(),
+                                            out.getColIdx(), rhs, reverse);
+            break;
+        case AF_STORAGE_COO:
+            kernel::sparseArithOpCOO<T, op>(out.getValues(), out.getRowIdx(),
+                                            out.getColIdx(), rhs, reverse);
+            break;
+        default:
+            AF_ERROR("Sparse Arithmetic only supported for CSR or COO",
+                     AF_ERR_NOT_SUPPORTED);
+    }
+
+    return out;
+}
+
+template<typename T, af_op_t op>
+SparseArray<T> arithOp(const SparseArray<T> &lhs, const SparseArray<T> &rhs) {
+    lhs.eval();
+    rhs.eval();
+    af::storage sfmt = lhs.getStorage();
+
+    const dim4 &ldims = lhs.dims();
+
+    const uint M = ldims[0];
+    const uint N = ldims[1];
+
+    const dim_t nnzA = lhs.getNNZ();
+    const dim_t nnzB = rhs.getNNZ();
+
+    auto temp = createValueArray<int>(dim4(M + 1), scalar<int>(0));
+    temp.eval();
+
+    unsigned nnzC = 0;
+    kernel::csrCalcOutNNZ(temp, nnzC, M, N, nnzA, lhs.getRowIdx(),
+                          lhs.getColIdx(), nnzB, rhs.getRowIdx(),
+                          rhs.getColIdx());
+
+    auto outRowIdx = scan<af_add_t, int, int>(temp, 0);
+
+    auto outColIdx = createEmptyArray<int>(dim4(nnzC));
+    auto outValues = createEmptyArray<T>(dim4(nnzC));
+
+    kernel::ssArithCSR<T, op>(outValues, outColIdx, outRowIdx, M, N, nnzA,
+                              lhs.getValues(), lhs.getRowIdx(), lhs.getColIdx(),
+                              nnzB, rhs.getValues(), rhs.getRowIdx(),
+                              rhs.getColIdx());
+
+    SparseArray<T> retVal = createArrayDataSparseArray(
+        ldims, outValues, outRowIdx, outColIdx, sfmt);
+    return retVal;
+}
+
+#define INSTANTIATE(T)                                                         \
+    template Array<T> arithOpD<T, af_add_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_sub_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_mul_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template Array<T> arithOpD<T, af_div_t>(                                   \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_mul_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_div_t>(                              \
+        const SparseArray<T> &lhs, const Array<T> &rhs, const bool reverse);   \
+    template SparseArray<T> arithOp<T, af_add_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs); \
+    template SparseArray<T> arithOp<T, af_sub_t>(                              \
+        const common::SparseArray<T> &lhs, const common::SparseArray<T> &rhs);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sparse_arith.hpp b/src/backend/opencl/sparse_arith.hpp
new file mode 100644
index 0000000000..3d45738c76
--- /dev/null
+++ b/src/backend/opencl/sparse_arith.hpp
@@ -0,0 +1,32 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <optypes.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+// These two functions cannot be overloaded by return type.
+// So have to give them separate names.
+template<typename T, af_op_t op>
+Array<T> arithOpD(const common::SparseArray<T> &lhs, const Array<T> &rhs,
+                  const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const Array<T> &rhs, const bool reverse = false);
+
+template<typename T, af_op_t op>
+common::SparseArray<T> arithOp(const common::SparseArray<T> &lhs,
+                               const common::SparseArray<T> &rhs);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sparse_blas.cpp b/src/backend/opencl/sparse_blas.cpp
new file mode 100644
index 0000000000..42b6547127
--- /dev/null
+++ b/src/backend/opencl/sparse_blas.cpp
@@ -0,0 +1,100 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <sparse_blas.hpp>
+
+#include <kernel/cscmm.hpp>
+#include <kernel/cscmv.hpp>
+#include <kernel/csrmm.hpp>
+#include <kernel/csrmv.hpp>
+
+#include <cassert>
+#include <stdexcept>
+#include <string>
+
+#include <common/err_common.hpp>
+#include <complex.hpp>
+#include <err_opencl.hpp>
+#include <math.hpp>
+#include <platform.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+#include <cpu/cpu_sparse_blas.hpp>
+#endif  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace opencl {
+
+using namespace common;
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T>& lhs, const Array<T>& rhsIn,
+                af_mat_prop optLhs, af_mat_prop optRhs) {
+#if defined(WITH_LINEAR_ALGEBRA)
+    if (OpenCLCPUOffload(
+            false)) {  // Do not force offload gemm on OSX Intel devices
+        return cpu::matmul(lhs, rhsIn, optLhs, optRhs);
+    }
+#endif
+
+    int lRowDim = (optLhs == AF_MAT_NONE) ? 0 : 1;
+    // int lColDim = (optLhs == AF_MAT_NONE) ? 1 : 0;
+    static const int rColDim =
+        1;  // Unsupported : (optRhs == AF_MAT_NONE) ? 1 : 0;
+
+    dim4 lDims = lhs.dims();
+    dim4 rDims = rhsIn.dims();
+    int M      = lDims[lRowDim];
+    int N      = rDims[rColDim];
+    // int K = lDims[lColDim];
+
+    const Array<T> rhs =
+        (N != 1 && optLhs == AF_MAT_NONE) ? transpose(rhsIn, false) : rhsIn;
+    Array<T> out = createEmptyArray<T>(af::dim4(M, N, 1, 1));
+
+    static const T alpha = scalar<T>(1.0);
+    static const T beta  = scalar<T>(0.0);
+
+    const Array<T>& values   = lhs.getValues();
+    const Array<int>& rowIdx = lhs.getRowIdx();
+    const Array<int>& colIdx = lhs.getColIdx();
+
+    if (optLhs == AF_MAT_NONE) {
+        if (N == 1) {
+            kernel::csrmv(out, values, rowIdx, colIdx, rhs, alpha, beta);
+        } else {
+            kernel::csrmm_nt(out, values, rowIdx, colIdx, rhs, alpha, beta);
+        }
+    } else {
+        // CSR transpose is a CSC matrix
+        if (N == 1) {
+            kernel::cscmv(out, values, rowIdx, colIdx, rhs, alpha, beta,
+                          optLhs == AF_MAT_CTRANS);
+        } else {
+            kernel::cscmm_nn(out, values, rowIdx, colIdx, rhs, alpha, beta,
+                             optLhs == AF_MAT_CTRANS);
+        }
+    }
+    return out;
+}
+
+#define INSTANTIATE_SPARSE(T)                                            \
+    template Array<T> matmul<T>(const common::SparseArray<T>& lhs,       \
+                                const Array<T>& rhs, af_mat_prop optLhs, \
+                                af_mat_prop optRhs);
+
+INSTANTIATE_SPARSE(float)
+INSTANTIATE_SPARSE(double)
+INSTANTIATE_SPARSE(cfloat)
+INSTANTIATE_SPARSE(cdouble)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sparse_blas.hpp b/src/backend/opencl/sparse_blas.hpp
new file mode 100644
index 0000000000..f51eeac9b4
--- /dev/null
+++ b/src/backend/opencl/sparse_blas.hpp
@@ -0,0 +1,22 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/SparseArray.hpp>
+#include <sparse.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+Array<T> matmul(const common::SparseArray<T>& lhs, const Array<T>& rhs,
+                af_mat_prop optLhs, af_mat_prop optRhs);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/sum.cpp b/src/backend/opencl/sum.cpp
index b2e3d4748a..1ef26bdb89 100644
--- a/src/backend/opencl/sum.cpp
+++ b/src/backend/opencl/sum.cpp
@@ -7,17 +7,37 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#include <common/half.hpp>
 #include "reduce_impl.hpp"
 
-namespace opencl
-{
-    //sum
-    INSTANTIATE(af_add_t, float  , float  )
-    INSTANTIATE(af_add_t, double , double )
-    INSTANTIATE(af_add_t, cfloat , cfloat )
-    INSTANTIATE(af_add_t, cdouble, cdouble)
-    INSTANTIATE(af_add_t, int    , int    )
-    INSTANTIATE(af_add_t, uint   , uint   )
-    INSTANTIATE(af_add_t, char   , int    )
-    INSTANTIATE(af_add_t, uchar  , uint   )
-}
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+// sum
+INSTANTIATE(af_add_t, float, float)
+INSTANTIATE(af_add_t, double, double)
+INSTANTIATE(af_add_t, cfloat, cfloat)
+INSTANTIATE(af_add_t, cdouble, cdouble)
+INSTANTIATE(af_add_t, int, int)
+INSTANTIATE(af_add_t, int, float)
+INSTANTIATE(af_add_t, uint, uint)
+INSTANTIATE(af_add_t, uint, float)
+INSTANTIATE(af_add_t, intl, intl)
+INSTANTIATE(af_add_t, intl, double)
+INSTANTIATE(af_add_t, uintl, uintl)
+INSTANTIATE(af_add_t, uintl, double)
+INSTANTIATE(af_add_t, char, int)
+INSTANTIATE(af_add_t, char, float)
+INSTANTIATE(af_add_t, schar, int)
+INSTANTIATE(af_add_t, schar, float)
+INSTANTIATE(af_add_t, uchar, uint)
+INSTANTIATE(af_add_t, uchar, float)
+INSTANTIATE(af_add_t, short, int)
+INSTANTIATE(af_add_t, short, float)
+INSTANTIATE(af_add_t, ushort, uint)
+INSTANTIATE(af_add_t, ushort, float)
+INSTANTIATE(af_add_t, half, half)
+INSTANTIATE(af_add_t, half, float)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/surface.cpp b/src/backend/opencl/surface.cpp
new file mode 100644
index 0000000000..7a2e15276b
--- /dev/null
+++ b/src/backend/opencl/surface.cpp
@@ -0,0 +1,85 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_opencl.hpp>
+#include <err_opencl.hpp>
+#include <surface.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+using cl::Memory;
+using std::vector;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface) {
+    ForgeModule &_ = forgePlugin();
+    if (isGLSharingSupported()) {
+        CheckGL("Begin OpenCL resource copy");
+        const cl::Buffer *d_P = P.get();
+        unsigned bytes        = 0;
+        FG_CHECK(_.fg_get_surface_vertex_buffer_size(&bytes, surface));
+
+        auto res = interopManager().getSurfaceResources(surface);
+
+        vector<Memory> shared_objects;
+        shared_objects.push_back(*(res[0].get()));
+
+        glFinish();
+
+        // Use of events:
+        // https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+        cl::Event event;
+
+        getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+        getQueue().enqueueCopyBuffer(*d_P, *(res[0].get()), 0, 0, bytes, NULL,
+                                     &event);
+        getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+
+        CL_DEBUG_FINISH(getQueue());
+        CheckGL("End OpenCL resource copy");
+    } else {
+        unsigned bytes = 0, buffer = 0;
+        FG_CHECK(_.fg_get_surface_vertex_buffer(&buffer, surface));
+        FG_CHECK(_.fg_get_surface_vertex_buffer_size(&bytes, surface));
+
+        CheckGL("Begin OpenCL fallback-resource copy");
+        glBindBuffer(GL_ARRAY_BUFFER, buffer);
+        auto *ptr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (ptr) {
+            getQueue().enqueueReadBuffer(*P.get(), CL_TRUE, 0, bytes, ptr);
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+        CheckGL("End OpenCL fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T) \
+    template void copy_surface<T>(const Array<T> &, fg_surface);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/surface.hpp b/src/backend/opencl/surface.hpp
new file mode 100644
index 0000000000..62a1095a84
--- /dev/null
+++ b/src/backend/opencl/surface.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+void copy_surface(const Array<T> &P, fg_surface surface);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/susan.cpp b/src/backend/opencl/susan.cpp
new file mode 100644
index 0000000000..91b011120b
--- /dev/null
+++ b/src/backend/opencl/susan.cpp
@@ -0,0 +1,75 @@
+/*******************************************************
+ * Copyright (c) 2015, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <err_opencl.hpp>
+#include <kernel/susan.hpp>
+#include <af/features.h>
+#include <algorithm>
+#include <cmath>
+
+using af::features;
+using std::vector;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out, Array<float> &resp_out,
+               const Array<T> &in, const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge) {
+    dim4 idims = in.dims();
+
+    const unsigned corner_lim = in.elements() * feature_ratio;
+    Array<float> x_corners    = createEmptyArray<float>({corner_lim});
+    Array<float> y_corners    = createEmptyArray<float>({corner_lim});
+    Array<float> resp_corners = createEmptyArray<float>({corner_lim});
+
+    auto resp = memAlloc<float>(in.elements());
+
+    kernel::susan<T>(resp.get(), in.get(), in.getOffset(), idims[0], idims[1],
+                     diff_thr, geom_thr, edge, radius);
+
+    unsigned corners_found = kernel::nonMaximal<T>(
+        x_corners.get(), y_corners.get(), resp_corners.get(), idims[0],
+        idims[1], resp.get(), edge, corner_lim);
+
+    const unsigned corners_out = std::min(corners_found, corner_lim);
+    if (corners_out == 0) {
+        x_out    = createEmptyArray<float>(dim4());
+        y_out    = createEmptyArray<float>(dim4());
+        resp_out = createEmptyArray<float>(dim4());
+    } else {
+        vector<af_seq> idx{{0., static_cast<double>(corners_out - 1.0), 1.}};
+        x_out    = createSubArray(x_corners, idx);
+        y_out    = createSubArray(y_corners, idx);
+        resp_out = createSubArray(resp_corners, idx);
+    }
+    return corners_out;
+}
+
+#define INSTANTIATE(T)                                                        \
+    template unsigned susan<T>(                                               \
+        Array<float> & x_out, Array<float> & y_out, Array<float> & score_out, \
+        const Array<T> &in, const unsigned radius, const float diff_thr,      \
+        const float geom_thr, const float feature_ratio, const unsigned edge);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/susan.hpp b/src/backend/opencl/susan.hpp
new file mode 100644
index 0000000000..ca6c779c8a
--- /dev/null
+++ b/src/backend/opencl/susan.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, Arrayfire
+ * all rights reserved.
+ *
+ * This file is distributed under 3-clause bsd license.
+ * the complete license agreement can be obtained at:
+ * http://Arrayfire.com/licenses/bsd-3-clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <af/features.h>
+
+using af::features;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+unsigned susan(Array<float> &x_out, Array<float> &y_out,
+               Array<float> &score_out, const Array<T> &in,
+               const unsigned radius, const float diff_thr,
+               const float geom_thr, const float feature_ratio,
+               const unsigned edge);
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/svd.cpp b/src/backend/opencl/svd.cpp
new file mode 100644
index 0000000000..b8bea727d0
--- /dev/null
+++ b/src/backend/opencl/svd.cpp
@@ -0,0 +1,266 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <blas.hpp>
+#include <copy.hpp>
+#include <err_opencl.hpp>  // error check functions and Macros
+#include <math.hpp>
+#include <reduce.hpp>
+#include <svd.hpp>  // opencl backend function header
+#include <transpose.hpp>
+
+#if defined(WITH_LINEAR_ALGEBRA)
+
+#include <cpu/cpu_svd.hpp>
+#include <magma/magma.h>
+#include <magma/magma_cpu_lapack.h>
+#include <magma/magma_helper.h>
+#include <platform.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename Tr>
+Tr calc_scale(Tr From, Tr To) {
+    // FIXME: I am not sure this is correct, removing this for now
+#if 0
+    //http://www.netlib.org/lapack/explore-3.1.1-html/dlascl.f.html
+    cpu_lapack_lamch_func<Tr> cpu_lapack_lamch;
+
+    Tr S = cpu_lapack_lamch('S');
+    Tr B = 1.0 / S;
+
+    Tr FromCopy = From, ToCopy = To;
+
+    Tr Mul = 1;
+
+    while (true) {
+        Tr From1 = FromCopy * S, To1 = ToCopy / B;
+        if (std::abs(From1) > std::abs(ToCopy) && ToCopy != 0) {
+            Mul *= S;
+            FromCopy = From1;
+        } else if (std::abs(To1) > std::abs(FromCopy)) {
+            Mul *= B;
+            ToCopy = To1;
+        } else {
+            Mul *= (ToCopy) / (FromCopy);
+            break;
+        }
+    }
+
+    return Mul;
+#else
+    return To / From;
+#endif
+}
+
+template<typename T, typename Tr>
+void svd(Array<T> &arrU, Array<Tr> &arrS, Array<T> &arrVT, Array<T> &arrA,
+         bool want_vectors = true) {
+    dim4 idims    = arrA.dims();
+    dim4 istrides = arrA.strides();
+
+    const int m      = static_cast<int>(idims[0]);
+    const int n      = static_cast<int>(idims[1]);
+    const int ldda   = static_cast<int>(istrides[1]);
+    const int lda    = m;
+    const int min_mn = std::min(m, n);
+    const int ldu    = m;
+    const int ldvt   = n;
+
+    const int nb    = magma_get_gebrd_nb<T>(n);
+    const int lwork = (m + n) * nb;
+
+    cpu_lapack_lacpy_func<T> cpu_lapack_lacpy;
+    cpu_lapack_bdsqr_work_func<T> cpu_lapack_bdsqr_work;
+    cpu_lapack_ungbr_work_func<T> cpu_lapack_ungbr_work;
+    cpu_lapack_lamch_func<Tr> cpu_lapack_lamch;
+
+    // Get machine constants
+    static const double eps    = cpu_lapack_lamch('P');
+    static const double smlnum = std::sqrt(cpu_lapack_lamch('S')) / eps;
+    static const double bignum = 1. / smlnum;
+
+    Tr anrm = abs(getScalar<T>(reduce_all<af_max_t, T, T>(arrA)));
+
+    T scale                = scalar<T>(1);
+    static const int ione  = 1;
+    static const int izero = 0;
+
+    bool iscl = false;
+    if (anrm > 0. && anrm < smlnum) {
+        iscl  = true;
+        scale = scalar<T>(calc_scale<Tr>(anrm, smlnum));
+    } else if (anrm > bignum) {
+        iscl  = true;
+        scale = scalar<T>(calc_scale<Tr>(anrm, bignum));
+    }
+
+    if (iscl == 1) { multiply_inplace(arrA, abs(scale)); }
+
+    int nru  = 0;
+    int ncvt = 0;
+
+    // Instead of copying U, S, VT, and A to the host and copying the results
+    // back to the device, create a pointer that's mapped to device memory where
+    // the computation can directly happen
+    T *mappedA = static_cast<T *>(getQueue().enqueueMapBuffer(
+        *arrA.get(), CL_FALSE, CL_MAP_READ, sizeof(T) * arrA.getOffset(),
+        sizeof(T) * arrA.elements()));
+    std::vector<T> tauq(min_mn), taup(min_mn);
+    std::vector<T> work(lwork);
+    Tr *mappedS0 = (Tr *)getQueue().enqueueMapBuffer(
+        *arrS.get(), CL_TRUE, CL_MAP_WRITE, sizeof(Tr) * arrS.getOffset(),
+        sizeof(Tr) * arrS.elements());
+    std::vector<Tr> s1(min_mn - 1);
+    std::vector<Tr> rwork(5 * min_mn);
+
+    int info = 0;
+
+    // Bidiagonalize A
+    // (CWorkspace: need 2*N + M, prefer 2*N + (M + N)*NB)
+    // (RWorkspace: need N)
+    magma_gebrd_hybrid<T>(m, n, mappedA, lda, (*arrA.get())(), arrA.getOffset(),
+                          ldda, (void *)mappedS0, static_cast<void *>(&s1[0]),
+                          &tauq[0], &taup[0], &work[0], lwork, getQueue()(),
+                          &info, false);
+
+    T *mappedU = nullptr, *mappedVT = nullptr;
+    std::vector<T> cdummy(1);
+
+    if (want_vectors) {
+        mappedU  = static_cast<T *>(getQueue().enqueueMapBuffer(
+            *arrU.get(), CL_FALSE, CL_MAP_WRITE, sizeof(T) * arrU.getOffset(),
+            sizeof(T) * arrU.elements()));
+        mappedVT = static_cast<T *>(getQueue().enqueueMapBuffer(
+            *arrVT.get(), CL_TRUE, CL_MAP_WRITE, sizeof(T) * arrVT.getOffset(),
+            sizeof(T) * arrVT.elements()));
+
+        // If left singular vectors desired in U, copy result to U
+        // and generate left bidiagonalizing vectors in U
+        // (CWorkspace: need 2*N + NCU, prefer 2*N + NCU*NB)
+        // (RWorkspace: 0)
+        LAPACKE_CHECK(cpu_lapack_lacpy('L', m, n, mappedA, lda, mappedU, ldu));
+
+        int ncu = m;
+        LAPACKE_CHECK(cpu_lapack_ungbr_work('Q', m, ncu, n, mappedU, ldu,
+                                            &tauq[0], &work[0], lwork));
+
+        // If right singular vectors desired in VT, copy result to
+        // VT and generate right bidiagonalizing vectors in VT
+        // (CWorkspace: need 3*N-1, prefer 2*N + (N-1)*NB)
+        // (RWorkspace: 0)
+        LAPACKE_CHECK(
+            cpu_lapack_lacpy('U', n, n, mappedA, lda, mappedVT, ldvt));
+        LAPACKE_CHECK(cpu_lapack_ungbr_work('P', n, n, n, mappedVT, ldvt,
+                                            &taup[0], &work[0], lwork));
+
+        nru  = m;
+        ncvt = n;
+    }
+    getQueue().enqueueUnmapMemObject(*arrA.get(), mappedA);
+
+    // Perform bidiagonal QR iteration, if desired, computing
+    // left singular vectors in U and computing right singular
+    // vectors in VT
+    // (CWorkspace: need 0)
+    // (RWorkspace: need BDSPAC)
+    LAPACKE_CHECK(cpu_lapack_bdsqr_work('U', n, ncvt, nru, izero, mappedS0,
+                                        &s1[0], mappedVT, ldvt, mappedU, ldu,
+                                        &cdummy[0], ione, &rwork[0]));
+
+    if (want_vectors) {
+        getQueue().enqueueUnmapMemObject(*arrU.get(), mappedU);
+        getQueue().enqueueUnmapMemObject(*arrVT.get(), mappedVT);
+    }
+
+    getQueue().enqueueUnmapMemObject(*arrS.get(), mappedS0);
+
+    if (iscl == 1) {
+        Tr rscale = scalar<Tr>(1);
+        if (anrm > bignum) {
+            rscale = calc_scale<Tr>(bignum, anrm);
+        } else if (anrm < smlnum) {
+            rscale = calc_scale<Tr>(smlnum, anrm);
+        }
+        multiply_inplace(arrS, rscale);
+    }
+}
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    if (OpenCLCPUOffload()) { return cpu::svdInPlace(s, u, vt, in); }
+
+    svd<T, Tr>(u, s, vt, in, true);
+}
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    if (OpenCLCPUOffload()) { return cpu::svd(s, u, vt, in); }
+
+    dim4 iDims = in.dims();
+    int M      = iDims[0];
+    int N      = iDims[1];
+
+    if (M >= N) {
+        Array<T> in_copy = copyArray(in);
+        svdInPlace(s, u, vt, in_copy);
+    } else {
+        Array<T> in_trans = transpose(in, true);
+        svdInPlace(s, vt, u, in_trans);
+        transpose_inplace(u, true);
+        transpose_inplace(vt, true);
+    }
+}
+
+#define INSTANTIATE(T, Tr)                                               \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace opencl
+}  // namespace arrayfire
+
+#else  // WITH_LINEAR_ALGEBRA
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
+}
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in) {
+    AF_ERROR("Linear Algebra is disabled on OpenCL", AF_ERR_NOT_CONFIGURED);
+}
+
+#define INSTANTIATE(T, Tr)                                               \
+    template void svd<T, Tr>(Array<Tr> & s, Array<T> & u, Array<T> & vt, \
+                             const Array<T> &in);                        \
+    template void svdInPlace<T, Tr>(Array<Tr> & s, Array<T> & u,         \
+                                    Array<T> & vt, Array<T> & in);
+
+INSTANTIATE(float, float)
+INSTANTIATE(double, double)
+INSTANTIATE(cfloat, float)
+INSTANTIATE(cdouble, double)
+
+}  // namespace opencl
+}  // namespace arrayfire
+
+#endif  // WITH_LINEAR_ALGEBRA
diff --git a/src/backend/opencl/svd.hpp b/src/backend/opencl/svd.hpp
new file mode 100644
index 0000000000..ddf3f4a1bb
--- /dev/null
+++ b/src/backend/opencl/svd.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T, typename Tr>
+void svd(Array<Tr> &s, Array<T> &u, Array<T> &vt, const Array<T> &in);
+
+template<typename T, typename Tr>
+void svdInPlace(Array<Tr> &s, Array<T> &u, Array<T> &vt, Array<T> &in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/threadsMgt.hpp b/src/backend/opencl/threadsMgt.hpp
new file mode 100644
index 0000000000..1fdc136613
--- /dev/null
+++ b/src/backend/opencl/threadsMgt.hpp
@@ -0,0 +1,330 @@
+/*******************************************************
+ * Copyright (c) 2022, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <common/dispatch.hpp>
+#include <platform.hpp>
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace opencl {
+// OVERALL USAGE (With looping):
+// ...                                                      // OWN CODE
+// threadsMgt<T> th(...);                                   // backend.hpp
+// cl::Kernel KER{GETKERNEL(..., th.loop0, th.loop1,
+//                               th.loop3)};                // OWN CODE
+// const cl::NDRange local{th.genLocal(KER)};               // backend.hpp
+// const cl::NDRange global{th.genGlobal(local)};           // backend.hpp
+// KER(local,global,...);                                   // OWN CODE
+// ...                                                      // OWN CODE
+//
+// OVERALL USAGE (without looping):
+// ...                                                      // OWN CODE
+// threadsMgt<T> th(...);                                   // backend.hpp
+// cl::Kernel KER{GETKERNEL(...)};                          // OWN CODE
+// const cl::NDRange local{th.genLocal(KER)};               // backend.hpp
+// const cl::NDRange global{th.genGlobalFull(local)};       // backend.hpp
+// KER(local,global,...);                                   // OWN CODE
+// ...                                                      // OWN CODE
+template<typename T>
+class threadsMgt {
+   public:
+    bool loop0, loop1, loop3;
+
+   private:
+    const unsigned d0, d1, d2, d3;
+    const T ndims;
+    const size_t totalSize;
+    const cl::Device dev;
+    const unsigned maxParallelThreads;
+    const unsigned maxThreads;
+    unsigned largeVolDivider;
+
+   public:
+    // INPUT dims = dims of output array
+    // INPUT ndims = ndims of output array
+    // INPUT nrInputs = number of buffers read by kernel in parallel
+    // INPUT nrOutputs = number of buffer written by kernel in parallel
+    // INPUT totalSize = size of all input & output arrays
+    // INPUT sizeofT = size of 1 element to be written
+    // OUTPUT this.loop0, this.loop1, this.loop3 are ready to create the kernel
+    threadsMgt(const T dims[4], const T ndims, const unsigned nrInputs,
+               const unsigned nrOutputs, const size_t totalSize,
+               const size_t sizeofT);
+
+    // The generated local is only best for independent element operations,
+    //  as are: copying, scaling, math on independent elements,
+    // ... Since vector dimensions can be returned, it is NOT USABLE FOR
+    // BLOCK OPERATIONS, as are: matmul, etc.
+    inline cl::NDRange genLocal(const cl::Kernel& ker) const;
+
+    // INPUT local generated by genLocal()
+    // OUTPUT global, supposing that each element results in 1 thread
+    inline cl::NDRange genGlobalFull(const cl::NDRange& local) const;
+
+    // INPUT local generated by genLocal()
+    // OUTPUT global, assuming the the previous calculated looping will be
+    // executed in the kernel
+    inline cl::NDRange genGlobal(const cl::NDRange& local) const;
+};
+
+// INPUT dims = dims of output array
+// INPUT ndims = ndims of output array
+// INPUT nrInputs = number of buffers read by kernel in parallel
+// INPUT nrOutputs = number of buffer written by kernel in parallel
+// INPUT totalSize = size of all input & output arrays
+// INPUT sizeofT = size of 1 element to be written
+// OUTPUT this.loop0, this.loop1, this.loop3 are ready to create the kernel
+template<typename T>
+threadsMgt<T>::threadsMgt(const T dims[4], const T ndims,
+                          const unsigned nrInputs, const unsigned nrOutputs,
+                          const size_t totalSize, const size_t sizeofT)
+    : loop0(false)
+    , loop1(false)
+    , loop3(false)
+    , d0(static_cast<unsigned>(dims[0]))
+    , d1(static_cast<unsigned>(dims[1]))
+    , d2(static_cast<unsigned>(dims[2]))
+    , d3(static_cast<unsigned>(dims[3]))
+    , ndims(ndims)
+    , totalSize(totalSize)
+    , dev(opencl::getDevice())
+    , maxParallelThreads(getMaxParallelThreads(dev))
+    , maxThreads(maxParallelThreads *
+                 (sizeofT * nrInputs * nrInputs > 8 ? 1 : 2))
+    , largeVolDivider(1) {
+    const unsigned cacheLine{getMemoryBusWidth(dev)};
+    const size_t L2CacheSize{getL2CacheSize(dev)};
+    // The bottleneck of anykernel is dependent on the type of memory
+    // used.
+    // a) For very small arrays (elements < maxParallelThreads), each
+    //  element receives it individual thread
+    // b) For arrays (in+out) smaller
+    //  than 3/2 L2cache, memory access no longer is the bottleneck,
+    //  because enough L2cache is available at any time. Threads are
+    //  limited to reduce scheduling overhead.
+    // c) For very large arrays and type sizes
+    //  (<long double), 1 thread will not generate enough data to keep
+    //  the memory sync mechanism saturated, so we start loooping inside
+    //  each thread.
+    //
+    if (ndims == 1) {
+        if (d0 > maxThreads) {
+            loop0 = true;
+            if (totalSize * 2 > L2CacheSize * 3) {
+                // General formula to calculate best #loops
+                // Dedicated GPUs:
+                //  32/sizeof(T)**2/#outBuffers*(3/4)**(#inBuffers-1)
+                // Integrated GPUs:
+                //  4/sizeof(T)/#outBuffers*(3/4)**(#inBuffers-1)
+                largeVolDivider = cacheLine == 64 ? sizeofT == 1   ? 4
+                                                    : sizeofT == 2 ? 2
+                                                                   : 1
+                                                  : (sizeofT == 1   ? 32
+                                                     : sizeofT == 2 ? 8
+                                                                    : 1) /
+                                                        nrOutputs;
+                for (unsigned i = 1; i < nrInputs; ++i)
+                    largeVolDivider = largeVolDivider * 3 / 4;
+                loop0 = largeVolDivider > 1;
+            }
+        }
+    } else {
+        loop3 = d3 != 1;
+        if ((d1 > 1) & (d0 * d1 * d2 > maxThreads)) {
+            loop1 = true;
+            if ((d0 * sizeofT * 8 > cacheLine * getComputeUnits(dev)) &
+                (totalSize * 2 > L2CacheSize * 3)) {
+                // General formula to calculate best #loops
+                // Dedicated GPUs:
+                //  32/sizeof(T)**2/#outBuffers*(3/4)**(#inBuffers-1)
+                // Integrated GPUs:
+                //  4/sizeof(T)/#outBuffers*(3/4)**(#inBuffers-1)
+                //
+                // dims[3] already loops, so the remaining #loops needs
+                // to be divided
+                largeVolDivider = cacheLine == 64 ? sizeofT == 1   ? 4
+                                                    : sizeofT == 2 ? 2
+                                                                   : 1
+                                                  : (sizeofT == 1   ? 32
+                                                     : sizeofT == 2 ? 8
+                                                     : sizeofT == 4 ? 2
+                                                                    : 1) /
+                                                        (d3 * nrOutputs);
+                for (unsigned i{1}; i < nrInputs; ++i)
+                    largeVolDivider = largeVolDivider * 3 / 4;
+                loop1 = largeVolDivider > 1;
+            }
+        }
+    }
+};
+
+// The generated local is only best for independent element operations,
+//  as are: copying, scaling, math on independent elements,
+// ... Since vector dimensions can be returned, it is NOT USABLE FOR
+// BLOCK OPERATIONS, as are: matmul, etc.
+template<typename T>
+inline cl::NDRange threadsMgt<T>::genLocal(const cl::Kernel& ker) const {
+    // Performance is mainly dependend on:
+    //    - reducing memory latency, by preferring a sequential read of
+    //    cachelines (principally dim0)
+    //    - more parallel threads --> higher occupation of available
+    //    threads
+    //    - more I/O operations per thread --> dims[3] indicates the #
+    //    of I/Os handled by the kernel inside each thread, and outside
+    //    the scope of the block scheduler
+    // High performance is achievable with occupation rates as low as
+    // 30%. Here we aim at 50%, to also cover older hardware with slower
+    // cores.
+    // https://stackoverflow.com/questions/7737772/improving-kernel-performance-by-increasing-occupancy
+    // http://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf
+    // https://www.cvg.ethz.ch/teaching/2011spring/gpgpu/GPU-Optimization.pdf
+    // https://en.wikipedia.org/wiki/Graphics_Core_Next#SIMD_Vector_Unit
+
+    // The performance for vectors is independent from array sizes.
+    if ((d1 == 1) & (d2 == 1)) return cl::NDRange{128ULL};
+
+    // TOTAL OCCUPATION = occup(dim0) * occup(dim1) * occup(dim2).
+    // For linearized arrays, each linear block is allocated to a dim,
+    // resulting in large numbers for dim0 & dim1.
+    // - For dim2, we only return exact dividers of the array dim[3], so
+    // occup(dim2)=100%
+    // - For dim0 & dim1, we aim somewhere between 30% and 50%
+    //      * Having 2 blocks filled + 1 thread in block 3 --> occup >
+    //      2/3=66%
+    //      * Having 3 blocks filled + 1 thread in block 4 --> occup >
+    //      3/4=75%
+    //      * Having 4 blocks filled + 1 thread in block 5 --> occup >
+    //      4/5=80%
+    constexpr unsigned OCCUPANCY_FACTOR{2U};  // at least 2 blocks filled
+
+    // NVIDIA:
+    //  WG multiple      = 32
+    //  possible blocks  = [32, 64, 96, 128, 160, 192, 224, 256, .. 1024]
+    //  best performance = [32, 64, 96, 128]
+    //  optimal perf     = 128; any combination
+    //   NIVIDA always processes full wavefronts.  Allocating partial WG
+    //   (<32) reduces throughput.  Performance reaches a plateau from
+    //   128 with a slightly slowing for very large sizes.
+    // AMD:
+    //  WG multiple      = 64
+    //  possible block   = [16, 32, 48, 64, 128, 192, 256]
+    //  best performance = [(32, low #threads) 64, 128, 256]
+    //  optimal perf     = (128,2,1); max 128 for 1 dimension
+    //   AMD can process partial wavefronts (multiple of 16), although
+    //   all threads of a full WG are allocated, only the active ones
+    //   are executed, so the same number of WGs will fit a CU. When we
+    //   have insufficent threads to occupy all the CU's, partial
+    //   wavefronts (<64) are usefull to distribute all threads over the
+    //   available CU's iso all concentrating on the 1st CU.
+    // For algorithm below:
+    //  parallelThreads  = [32, 64, (96 for NIVIDA), 128, (256 for AMD)]
+    constexpr unsigned minThreads{32};
+    const unsigned relevantElements{d0 * d1 * d2};
+    const unsigned WG{static_cast<unsigned>(
+        ker.getWorkGroupInfo<CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE>(
+            dev))};
+
+    // For small array's, we reduce the maximum threads in 1 block to
+    // improve parallelisme.  In worst case the scheduler can have 1
+    // block per CU, even when only partly loaded. Range for block is:
+    //   [minThreads ... 4 * WG multiple]
+    //   * NVIDIA: [4*32=128 threads]
+    //   * AMD:    [4*64=256 threads]
+    // At 4 * WG multiple, full wavefronts (queue of 4 partial
+    // wavefronts) are all occupied.
+
+    // We need at least maxParallelThreads to occupy all the CU's.
+    const unsigned parallelThreads{
+        relevantElements <= maxParallelThreads
+            ? minThreads
+            : std::min(4U, relevantElements / maxParallelThreads) * WG};
+
+    // Priority 1: keep cachelines filled.  Aparrantly sharing
+    // cachelines between CU's has a cost. Testing confirmed that the
+    // occupation is mostly > 50%
+    const unsigned threads0{d0 == 1 ? 1
+                            : d0 <= minThreads
+                                ? minThreads  // better distribution
+                                : std::min(128U, (divup(d0, WG) * WG))};
+
+    // Priority 2: Fill the block, while respecting the occupation limit
+    // (>66%) (through parallelThreads limit)
+    const unsigned threads1{
+        (threads0 * 64U <= parallelThreads) &&
+                (!(d1 & (64U - 1U)) || (d1 > OCCUPANCY_FACTOR * 64U))
+            ? 64U
+        : (threads0 * 32U <= parallelThreads) &&
+                (!(d1 & (32U - 1U)) || (d1 > OCCUPANCY_FACTOR * 32U))
+            ? 32U
+        : (threads0 * 16U <= parallelThreads) &&
+                (!(d1 & (16U - 1U)) || (d1 > OCCUPANCY_FACTOR * 16U))
+            ? 16U
+        : (threads0 * 8U <= parallelThreads) &&
+                (!(d1 & (8U - 1U)) || (d1 > OCCUPANCY_FACTOR * 8U))
+            ? 8U
+        : (threads0 * 4U <= parallelThreads) &&
+                (!(d1 & (4U - 1U)) || (d1 > OCCUPANCY_FACTOR * 4U))
+            ? 4U
+        : (threads0 * 2U <= parallelThreads) &&
+                (!(d1 & (2U - 1U)) || (d1 > OCCUPANCY_FACTOR * 2U))
+            ? 2U
+            : 1U};
+
+    const unsigned threads01{threads0 * threads1};
+    if ((d2 == 1) | (threads01 * 2 > parallelThreads))
+        return cl::NDRange(threads0, threads1);
+
+    // Priority 3: Only exact dividers are used, so that
+    //  - overflow checking is not needed in the kernel.
+    //  - occupation rate never is reduced
+    // Chances are low that threads2 will be different from 1.
+    const unsigned threads2{
+        (threads01 * 8 <= parallelThreads) && !(d2 & (8U - 1U))   ? 8U
+        : (threads01 * 4 <= parallelThreads) && !(d2 & (4U - 1U)) ? 4U
+        : (threads01 * 2 <= parallelThreads) && !(d2 & (2U - 1U)) ? 2U
+                                                                  : 1U};
+    return cl::NDRange(threads0, threads1, threads2);
+};
+
+// INPUT local generated by genLocal()
+// OUTPUT global, supposing that each element results in 1 thread
+template<typename T>
+inline cl::NDRange threadsMgt<T>::genGlobalFull(
+    const cl::NDRange& local) const {
+    return cl::NDRange(divup(d0, local[0]) * local[0],
+                       divup(d1, local[1]) * local[1],
+                       divup(d2, local[2]) * local[2]);
+};
+
+// INPUT local generated by genLocal()
+// OUTPUT global, assuming the the previous calculated looping will be
+// executed in the kernel
+template<typename T>
+inline cl::NDRange threadsMgt<T>::genGlobal(const cl::NDRange& local) const {
+    if (loop0) {
+        const size_t blocks0{largeVolDivider > 1
+                                 ? d0 / (largeVolDivider * local[0])
+                                 : maxThreads / local[0]};
+        return cl::NDRange(blocks0 == 0 ? local[0] : blocks0 * local[0]);
+    } else if (loop1) {
+        const size_t global0{divup(d0, local[0]) * local[0]};
+        const size_t global2{divup(d2, local[2]) * local[2]};
+        const size_t blocks1{largeVolDivider > 1
+                                 ? d1 / (largeVolDivider * local[1])
+                                 : maxThreads / (global0 * local[1] * global2)};
+        return cl::NDRange(
+            global0, blocks1 == 0 ? local[1] : blocks1 * local[1], global2);
+    } else {
+        return genGlobalFull(local);
+    }
+};
+}  // namespace opencl
+}  // namespace arrayfire
\ No newline at end of file
diff --git a/src/backend/opencl/tile.cpp b/src/backend/opencl/tile.cpp
index 104f4ac53a..98c7eb2bfb 100644
--- a/src/backend/opencl/tile.cpp
+++ b/src/backend/opencl/tile.cpp
@@ -6,38 +6,47 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <kernel/tile.hpp>
+#include <tile.hpp>
 
 #include <Array.hpp>
-#include <tile.hpp>
-#include <kernel/tile.hpp>
+#include <common/half.hpp>
 #include <stdexcept>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> tile(const Array<T> &in, const af::dim4 &tileDims)
-    {
-        const af::dim4 iDims = in.dims();
-        af::dim4 oDims = iDims;
-        oDims *= tileDims;
+using arrayfire::common::half;
 
-        Array<T> out = createEmptyArray<T>(oDims);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims) {
+    const af::dim4 &iDims = in.dims();
+    af::dim4 oDims        = iDims;
+    oDims *= tileDims;
 
-        kernel::tile<T>(out, in);
+    Array<T> out = createEmptyArray<T>(oDims);
 
-        return out;
-    }
-
-#define INSTANTIATE(T)                                                         \
-    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);  \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
+    kernel::tile<T>(out, in);
 
+    return out;
 }
+
+#define INSTANTIATE(T) \
+    template Array<T> tile<T>(const Array<T> &in, const af::dim4 &tileDims);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/tile.hpp b/src/backend/opencl/tile.hpp
index 4547bf1cb0..172cbadbed 100644
--- a/src/backend/opencl/tile.hpp
+++ b/src/backend/opencl/tile.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> tile(const Array<T> &in, const af::dim4 &tileDims);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/topk.cpp b/src/backend/opencl/topk.cpp
new file mode 100644
index 0000000000..201ec06197
--- /dev/null
+++ b/src/backend/opencl/topk.cpp
@@ -0,0 +1,213 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/cast.hpp>
+#include <common/half.hpp>
+#include <common/moddims.hpp>
+#include <err_opencl.hpp>
+#include <index.hpp>
+#include <sort.hpp>
+#include <sort_index.hpp>
+#include <types.hpp>
+#include <handle.hpp>
+#include <arith.hpp>
+#include <range.hpp>
+
+#include <algorithm>
+#include <cmath>
+#include <numeric>
+#include <vector>
+
+using arrayfire::common::half;
+using cl::Buffer;
+using cl::Event;
+
+using std::iota;
+using std::min;
+using std::partial_sort_copy;
+using std::transform;
+using std::vector;
+
+namespace arrayfire {
+namespace opencl {
+vector<af_index_t> indexForTopK(const int k) {
+    af_index_t idx;
+    idx.idx.seq = af_seq{0.0, static_cast<double>(k) - 1.0, 1.0};
+    idx.isSeq   = true;
+    idx.isBatch = false;
+
+    af_index_t sp;
+    sp.idx.seq = af_span;
+    sp.isSeq   = true;
+    sp.isBatch = false;
+
+    return vector<af_index_t>({idx, sp, sp, sp});
+}
+
+template<typename T>
+void topk(Array<T>& vals, Array<unsigned>& idxs, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order) {
+    if (getDeviceType() == CL_DEVICE_TYPE_CPU) {
+        // This branch optimizes for CPU devices by first mapping the buffer
+        // and calling partial sort on the buffer
+
+        // TODO(umar): implement this in the kernel namespace
+
+        // The out_dims is of size k along the dimension of the topk operation
+        // and the same as the input dimension otherwise.
+        dim4 out_dims(1);
+        int ndims = in.dims().ndims();
+        for (int i = 0; i < ndims; i++) {
+            if (i == dim) {
+                out_dims[i] = min(k, (int)in.dims()[i]);
+            } else {
+                out_dims[i] = in.dims()[i];
+            }
+        }
+
+        auto values          = createEmptyArray<T>(out_dims);
+        auto indices         = createEmptyArray<unsigned>(out_dims);
+        const Buffer* in_buf = in.get();
+        Buffer* ibuf         = indices.get();
+        Buffer* vbuf         = values.get();
+
+        cl::Event ev_in, ev_val, ev_ind;
+
+        T* ptr     = static_cast<T*>(getQueue().enqueueMapBuffer(
+            *in_buf, CL_FALSE, CL_MAP_READ, 0, in.elements() * sizeof(T),
+            nullptr, &ev_in));
+        uint* iptr = static_cast<uint*>(getQueue().enqueueMapBuffer(
+            *ibuf, CL_FALSE, CL_MAP_READ | CL_MAP_WRITE, 0, k * sizeof(uint),
+            nullptr, &ev_ind));
+        T* vptr    = static_cast<T*>(getQueue().enqueueMapBuffer(
+            *vbuf, CL_FALSE, CL_MAP_WRITE, 0, k * sizeof(T), nullptr, &ev_val));
+
+        vector<uint> idx(in.elements());
+
+        // Create a linear index
+        iota(begin(idx), end(idx), 0);
+        cl::Event::waitForEvents({ev_in, ev_ind});
+
+        int iter = in.dims()[1] * in.dims()[2] * in.dims()[3];
+        for (int i = 0; i < iter; i++) {
+            auto idx_itr = begin(idx) + i * in.strides()[1];
+            auto kiptr   = iptr + k * i;
+
+            if (order & AF_TOPK_MIN) {
+                if (order & AF_TOPK_STABLE) {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return (compute_t<T>(ptr[lhs]) <
+                                    compute_t<T>(ptr[rhs]))
+                                       ? true
+                                   : compute_t<T>(ptr[lhs]) ==
+                                           compute_t<T>(ptr[rhs])
+                                       ? (lhs < rhs)
+                                       : false;
+                        });
+                } else {
+                    // Sort the top k values in each column
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return compute_t<T>(ptr[lhs]) <
+                                   compute_t<T>(ptr[rhs]);
+                        });
+                }
+            } else {
+                if (order & AF_TOPK_STABLE) {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return (compute_t<T>(ptr[lhs]) >
+                                    compute_t<T>(ptr[rhs]))
+                                       ? true
+                                   : compute_t<T>(ptr[lhs]) ==
+                                           compute_t<T>(ptr[rhs])
+                                       ? (lhs < rhs)
+                                       : false;
+                        });
+                } else {
+                    partial_sort_copy(
+                        idx_itr, idx_itr + in.strides()[1], kiptr, kiptr + k,
+                        [ptr](const uint lhs, const uint rhs) -> bool {
+                            return compute_t<T>(ptr[lhs]) >
+                                   compute_t<T>(ptr[rhs]);
+                        });
+                }
+            }
+            ev_val.wait();
+
+            auto kvptr = vptr + k * i;
+            for (int j = 0; j < k; j++) {
+                // Update the value arrays with the original values
+                kvptr[j] = ptr[kiptr[j]];
+                // Convert linear indices back to column indices
+                kiptr[j] -= i * in.strides()[1];
+            }
+        }
+
+        getQueue().enqueueUnmapMemObject(*ibuf, iptr);
+        getQueue().enqueueUnmapMemObject(*vbuf, vptr);
+        getQueue().enqueueUnmapMemObject(*in_buf, ptr);
+
+        vals = values;
+        idxs = indices;
+    } else {
+        
+        if (!std::is_same_v<T, half>) {
+            auto values  = createEmptyArray<T>(in.dims());
+            auto indices = createEmptyArray<unsigned>(in.dims());
+            sort_index(values, indices, in, dim, order & AF_TOPK_MIN);
+            auto indVec = indexForTopK(k);
+            idxs        = index<unsigned>(indices, indVec.data());
+            vals        = index<T>(values, indVec.data());
+        } else {
+            // Temporary implementation for topk due half not being supported in sort_index
+            // TODO: Fix sort_index and remove this
+
+            auto values  = createEmptyArray<float>(in.dims());
+            auto indices = createEmptyArray<unsigned>(in.dims());
+            sort_index(values, indices, common::cast<float>(in), dim, order & AF_TOPK_MIN);
+
+            auto indVec = indexForTopK(k);
+            idxs        = index<unsigned>(indices, indVec.data());
+
+            // Index values from original array by using the indices from the previous resuult
+            auto len = in.elements() / in.dims()[dim];
+            auto index_dims = dim4(k, len);
+            auto new_indices = common::flat(arithOp<unsigned, af_add_t>(arithOp<unsigned, af_mul_t>(range<unsigned>(index_dims, 1), createValueArray<unsigned>(index_dims, in.dims()[dim]), index_dims), idxs, index_dims));
+            auto indVecVals = indexForTopK(k);
+            indVecVals[0].idx.arr = getHandle(new_indices);
+            indVecVals[0].isSeq = false;
+            indVecVals[0].isBatch = false;
+            
+            vals = common::modDims(index<T>(common::flat(in), indVecVals.data()), idxs.dims());
+            vals.eval();
+
+            releaseHandle<unsigned>(indVecVals[0].idx.arr);
+        }
+    }
+}
+
+#define INSTANTIATE(T)                                                  \
+    template void topk<T>(Array<T>&, Array<unsigned>&, const Array<T>&, \
+                          const int, const int, const af::topkFunction);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(long long)
+INSTANTIATE(unsigned long long)
+INSTANTIATE(half)
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/topk.hpp b/src/backend/opencl/topk.hpp
new file mode 100644
index 0000000000..d4c67878e7
--- /dev/null
+++ b/src/backend/opencl/topk.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+#include <af/defines.h>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void topk(Array<T>& keys, Array<unsigned>& vals, const Array<T>& in,
+          const int k, const int dim, const af::topkFunction order);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/traits.hpp b/src/backend/opencl/traits.hpp
index 71a51c9bc8..2af7257b76 100644
--- a/src/backend/opencl/traits.hpp
+++ b/src/backend/opencl/traits.hpp
@@ -9,63 +9,97 @@
 
 #pragma once
 
-#include <af/traits.hpp>
-#include <string>
+#include <common/defines.hpp>
+#include <common/traits.hpp>
+#include <types.hpp>
+
 #include <sstream>
+#include <string>
 
-namespace af
-{
+namespace af {
 
 template<>
-struct dtype_traits<cl_float2> {
+struct dtype_traits<arrayfire::opencl::cfloat> {
     enum { af_type = c32 };
     typedef float base_type;
-    static const char* getName() { return "float2"; }
+    static const char *getName() { return "float2"; }
 };
 
 template<>
-struct dtype_traits<cl_double2> {
+struct dtype_traits<arrayfire::opencl::cdouble> {
     enum { af_type = c64 };
     typedef double base_type;
-    static const char* getName() { return "double2"; }
+    static const char *getName() { return "double2"; }
 };
+}  // namespace af
+
+namespace arrayfire {
+namespace opencl {
 
-#if !defined(OS_WIN)        // Windows defines size_t as ulong
+template<typename T>
+static bool iscplx() {
+    return false;
+}
 template<>
-struct dtype_traits<size_t> {
-    static const char* getName()
-    {
-        return (sizeof(size_t) == 8)  ? "ulong" : "uint";
-    }
-};
-#endif
+inline bool iscplx<cfloat>() {
+    return true;
+}
+template<>
+inline bool iscplx<cdouble>() {
+    return true;
+}
+
+template<typename T>
+static bool isdbl() {
+    return false;
+}
+
+template<>
+inline bool isdbl<double>() {
+    return true;
+}
+
+template<>
+inline bool isdbl<cdouble>() {
+    return true;
+}
+
+template<typename T>
+static bool islong() {
+    return false;
+}
 
-template<typename T> static bool iscplx() { return false; }
-template<> STATIC_ bool iscplx<cl_float2>() { return true; }
-template<> STATIC_ bool iscplx<cl_double2>() { return true; }
+template<>
+inline bool islong<long>() {
+    return true;
+}
+
+template<>
+inline bool islong<unsigned long>() {
+    return true;
+}
 
 template<typename T>
-STATIC_
-std::string scalar_to_option(const T &val)
-{
-    return std::to_string(+val);
+inline std::string scalar_to_option(const T &val) {
+    using namespace arrayfire::common;
+    using std::to_string;
+    return to_string(+val);
 }
 
 template<>
-STATIC_
-std::string scalar_to_option<cl_float2>(const cl_float2 &val) {
+inline std::string scalar_to_option<cl_float2>(const cl_float2 &val) {
     std::ostringstream ss;
     ss << val.s[0] << "," << val.s[1];
     return ss.str();
 }
 
 template<>
-STATIC_
-std::string scalar_to_option<cl_double2>(const cl_double2 &val) {
+inline std::string scalar_to_option<cl_double2>(const cl_double2 &val) {
     std::ostringstream ss;
     ss << val.s[0] << "," << val.s[1];
     return ss.str();
 }
-}
 
 using af::dtype_traits;
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/transform.cpp b/src/backend/opencl/transform.cpp
index 22cc88ae51..de99f48a60 100644
--- a/src/backend/opencl/transform.cpp
+++ b/src/backend/opencl/transform.cpp
@@ -7,70 +7,61 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
-#include <af/dim4.hpp>
-#include <Array.hpp>
 #include <transform.hpp>
+
+#include <copy.hpp>
 #include <kernel/transform.hpp>
-#include <stdexcept>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> transform(const Array<T> &in, const Array<float> &transform,
-                       const af::dim4 &odims,
-                       const af_interp_type method, const bool inverse)
-    {
-        Array<T> out = createEmptyArray<T>(odims);
+namespace arrayfire {
+namespace opencl {
 
-        if(inverse) {
-            switch(method) {
-                case AF_INTERP_NEAREST:
-                    kernel::transform<T, true, AF_INTERP_NEAREST>
-                                     (out, in, transform);
-                    break;
-                case AF_INTERP_BILINEAR:
-                    kernel::transform<T, true, AF_INTERP_BILINEAR>
-                                     (out, in, transform);
-                    break;
-                default:
-                    AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                    break;
-            }
-        } else {
-            switch(method) {
-                case AF_INTERP_NEAREST:
-                    kernel::transform<T, false, AF_INTERP_NEAREST>
-                                     (out, in, transform);
-                    break;
-                case AF_INTERP_BILINEAR:
-                    kernel::transform<T, false, AF_INTERP_BILINEAR>
-                                     (out, in, transform);
-                    break;
-                default:
-                    AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
-                    break;
-            }
-        }
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective) {
+    // TODO: Temporary Fix, must fix handling subarrays upstream
+    // tf has to be linear, although offset is allowed.
+    const Array<float> tf_Lin = tf.isLinear() ? tf : copyArray(tf);
 
-        return out;
+    switch (method) {
+        case AF_INTERP_NEAREST:
+        case AF_INTERP_LOWER:
+            kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                                 1);
+            break;
+        case AF_INTERP_BILINEAR:
+        case AF_INTERP_BILINEAR_COSINE:
+            kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                                 2);
+            break;
+        case AF_INTERP_BICUBIC:
+        case AF_INTERP_BICUBIC_SPLINE:
+            kernel::transform<T>(out, in, tf_Lin, inverse, perspective, method,
+                                 3);
+            break;
+        default: AF_ERROR("Unsupported interpolation type", AF_ERR_ARG);
     }
+}
 
+#define INSTANTIATE(T)                                                       \
+    template void transform(Array<T> &out, const Array<T> &in,               \
+                            const Array<float> &tf,                          \
+                            const af_interp_type method, const bool inverse, \
+                            const bool perspective);
 
-#define INSTANTIATE(T)                                                  \
-    template Array<T> transform(const Array<T> &in, const Array<float> &transform, \
-                                const af::dim4 &odims, const af_interp_type method, \
-                                const bool inverse);                    \
-
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(uchar)
-    INSTANTIATE(char)
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/transform.hpp b/src/backend/opencl/transform.hpp
index f0b4d4c955..50c1455be0 100644
--- a/src/backend/opencl/transform.hpp
+++ b/src/backend/opencl/transform.hpp
@@ -7,12 +7,13 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/image.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<T> transform(const Array<T> &in, const Array<float> &tf, const af::dim4 &odims,
-                        const af_interp_type method, const bool inverse);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void transform(Array<T> &out, const Array<T> &in, const Array<float> &tf,
+               const af_interp_type method, const bool inverse,
+               const bool perspective);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/transpose.cpp b/src/backend/opencl/transpose.cpp
index 43a1da9df3..248de43017 100644
--- a/src/backend/opencl/transpose.cpp
+++ b/src/backend/opencl/transpose.cpp
@@ -6,51 +6,50 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <kernel/transpose.hpp>
+#include <transpose.hpp>
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <transpose.hpp>
-#include <kernel/transpose.hpp>
+#include <common/half.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
+using arrayfire::common::half;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T> transpose(const Array<T> &in, const bool conjugate)
-{
-    const dim4 inDims   = in.dims();
-    dim4 outDims  = dim4(inDims[1],inDims[0],inDims[2],inDims[3]);
-    Array<T> out  = createEmptyArray<T>(outDims);
-
-    if(conjugate) {
-        if(inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0)
-            kernel::transpose<T, true, true>(out, in);
-        else
-            kernel::transpose<T, true, false>(out, in);
-    } else {
-        if(inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0)
-            kernel::transpose<T, false, true>(out, in);
-        else
-            kernel::transpose<T, false, false>(out, in);
-    }
+Array<T> transpose(const Array<T> &in, const bool conjugate) {
+    const dim4 &inDims = in.dims();
+    dim4 outDims       = dim4(inDims[1], inDims[0], inDims[2], inDims[3]);
+    Array<T> out       = createEmptyArray<T>(outDims);
+
+    const bool is32multiple =
+        inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0;
+
+    kernel::transpose<T>(out, in, getQueue(), conjugate, is32multiple);
 
     return out;
 }
 
-#define INSTANTIATE(T)                                                          \
+#define INSTANTIATE(T) \
     template Array<T> transpose(const Array<T> &in, const bool conjugate);
 
-INSTANTIATE(float  )
-INSTANTIATE(cfloat )
-INSTANTIATE(double )
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
 INSTANTIATE(cdouble)
-INSTANTIATE(char   )
-INSTANTIATE(int    )
-INSTANTIATE(uint   )
-INSTANTIATE(uchar  )
-INSTANTIATE(intl   )
-INSTANTIATE(uintl  )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/transpose.hpp b/src/backend/opencl/transpose.hpp
index 58014bdc20..7bb1f66bbf 100644
--- a/src/backend/opencl/transpose.hpp
+++ b/src/backend/opencl/transpose.hpp
@@ -9,13 +9,14 @@
 
 #include <Array.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-Array<T>  transpose(const Array<T> &in, const bool conjugate);
+Array<T> transpose(const Array<T> &in, const bool conjugate);
 
 template<typename T>
 void transpose_inplace(Array<T> &in, const bool conjugate);
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/transpose_inplace.cpp b/src/backend/opencl/transpose_inplace.cpp
index c30ff2e058..d6b783e5b2 100644
--- a/src/backend/opencl/transpose_inplace.cpp
+++ b/src/backend/opencl/transpose_inplace.cpp
@@ -7,46 +7,45 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <transpose.hpp>
+#include <common/half.hpp>
 #include <kernel/transpose_inplace.hpp>
+#include <transpose.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
+using arrayfire::common::half;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<typename T>
-void transpose_inplace(Array<T> &in, const bool conjugate)
-{
-    dim4 iDims = in.dims();
-
-    if(conjugate) {
-        if(iDims[0] % kernel::TILE_DIM == 0 && iDims[1] % kernel::TILE_DIM == 0)
-            kernel::transpose_inplace<T, true, true>(in);
-        else
-            kernel::transpose_inplace<T, true, false>(in);
-    } else {
-        if(iDims[0] % kernel::TILE_DIM == 0 && iDims[1] % kernel::TILE_DIM == 0)
-            kernel::transpose_inplace<T, false, true>(in);
-        else
-            kernel::transpose_inplace<T, false, false>(in);
-    }
+void transpose_inplace(Array<T> &in, const bool conjugate) {
+    const dim4 &inDims = in.dims();
+
+    const bool is32multiple =
+        inDims[0] % kernel::TILE_DIM == 0 && inDims[1] % kernel::TILE_DIM == 0;
+
+    kernel::transpose_inplace<T>(in, getQueue(), conjugate, is32multiple);
 }
 
-#define INSTANTIATE(T)                                                          \
+#define INSTANTIATE(T) \
     template void transpose_inplace(Array<T> &in, const bool conjugate);
 
-INSTANTIATE(float  )
-INSTANTIATE(cfloat )
-INSTANTIATE(double )
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
 INSTANTIATE(cdouble)
-INSTANTIATE(char   )
-INSTANTIATE(int    )
-INSTANTIATE(uint   )
-INSTANTIATE(uchar  )
-INSTANTIATE(intl   )
-INSTANTIATE(uintl  )
-
-}
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/triangle.cpp b/src/backend/opencl/triangle.cpp
index 371aead83c..346f8d1af7 100644
--- a/src/backend/opencl/triangle.cpp
+++ b/src/backend/opencl/triangle.cpp
@@ -6,52 +6,52 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#include <kernel/triangle.hpp>
+#include <triangle.hpp>
 
-#include <af/dim4.hpp>
 #include <Array.hpp>
-#include <triangle.hpp>
-#include <kernel/triangle.hpp>
+#include <common/half.hpp>
+#include <af/dim4.hpp>
 
 using af::dim4;
+using arrayfire::common::half;
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
-template<typename T, bool is_upper, bool is_unit_diag>
-void triangle(Array<T> &out, const Array<T> &in)
-{
-    kernel::triangle<T, is_upper, is_unit_diag>(out, in);
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag) {
+    kernel::triangle<T>(out, in, is_upper, is_unit_diag);
 }
 
-
-template<typename T, bool is_upper, bool is_unit_diag>
-Array<T> triangle(const Array<T> &in)
-{
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag) {
     Array<T> out = createEmptyArray<T>(in.dims());
-    triangle<T, is_upper, is_unit_diag>(out, in);
+    triangle<T>(out, in, is_upper, is_unit_diag);
     return out;
 }
 
-
 #define INSTANTIATE(T)                                                  \
-    template void triangle<T, true ,  true>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, false,  true>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, true , false>(Array<T> &out, const Array<T> &in); \
-    template void triangle<T, false, false>(Array<T> &out, const Array<T> &in); \
-    template Array<T> triangle<T, true ,  true>(const Array<T> &in);    \
-    template Array<T> triangle<T, false,  true>(const Array<T> &in);    \
-    template Array<T> triangle<T, true , false>(const Array<T> &in);    \
-    template Array<T> triangle<T, false, false>(const Array<T> &in);    \
-
-    INSTANTIATE(float)
-    INSTANTIATE(double)
-    INSTANTIATE(cfloat)
-    INSTANTIATE(cdouble)
-    INSTANTIATE(int)
-    INSTANTIATE(uint)
-    INSTANTIATE(intl)
-    INSTANTIATE(uintl)
-    INSTANTIATE(char)
-    INSTANTIATE(uchar)
-
-}
+    template void triangle<T>(Array<T> &, const Array<T> &, const bool, \
+                              const bool);                              \
+    template Array<T> triangle<T>(const Array<T> &, const bool, const bool);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(char)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/triangle.hpp b/src/backend/opencl/triangle.hpp
index f54acfebd8..51061d51b8 100644
--- a/src/backend/opencl/triangle.hpp
+++ b/src/backend/opencl/triangle.hpp
@@ -7,14 +7,16 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T, bool is_upper, bool is_unit_diag>
-    void triangle(Array<T> &out, const Array<T> &in);
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+void triangle(Array<T> &out, const Array<T> &in, const bool is_upper,
+              const bool is_unit_diag);
 
-    template<typename T, bool is_upper, bool is_unit_diag>
-    Array<T> triangle(const Array<T> &in);
-}
+template<typename T>
+Array<T> triangle(const Array<T> &in, const bool is_upper,
+                  const bool is_unit_diag);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/types.cpp b/src/backend/opencl/types.cpp
index df8e76a78c..90393de3f9 100644
--- a/src/backend/opencl/types.cpp
+++ b/src/backend/opencl/types.cpp
@@ -1,29 +1,106 @@
 /*******************************************************
-* Copyright (c) 2014, ArrayFire
-* All rights reserved.
-*
-* This file is distributed under 3-clause BSD license.
-* The complete license agreement can be obtained at:
-* http://arrayfire.com/licenses/BSD-3-Clause
-********************************************************/
-
-#include <af/defines.h>
-#include "types.hpp"
-
-namespace opencl
-{
-
-    template<typename T > const char *shortname(bool caps) { return caps ? "X" : "x"; }
-
-    template<> const char *shortname<float   >(bool caps) { return caps ? "S" : "s"; }
-    template<> const char *shortname<double  >(bool caps) { return caps ? "D" : "d"; }
-    template<> const char *shortname<cfloat  >(bool caps) { return caps ? "C" : "c"; }
-    template<> const char *shortname<cdouble >(bool caps) { return caps ? "Z" : "z"; }
-    template<> const char *shortname<int     >(bool caps) { return caps ? "I" : "i"; }
-    template<> const char *shortname<uint    >(bool caps) { return caps ? "U" : "u"; }
-    template<> const char *shortname<char    >(bool caps) { return caps ? "J" : "j"; }
-    template<> const char *shortname<uchar   >(bool caps) { return caps ? "V" : "v"; }
-    template<> const char *shortname<intl    >(bool caps) { return caps ? "L" : "l"; }
-    template<> const char *shortname<uintl   >(bool caps) { return caps ? "K" : "k"; }
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
+#include <types.hpp>
+
+#include <common/half.hpp>
+#include <common/util.hpp>
+#include <type_util.hpp>
+
+#include <cmath>
+#include <sstream>
+#include <string>
+
+using arrayfire::common::half;
+using arrayfire::common::toString;
+
+using std::isinf;
+using std::stringstream;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+inline std::string ToNumStr<T>::operator()(T val) {
+    ToNum<T> toNum;
+    return toString(toNum(val));
+}
+
+template<>
+std::string ToNumStr<float>::operator()(float val) {
+    static const char *PINF = "+INFINITY";
+    static const char *NINF = "-INFINITY";
+    if (isinf(val)) { return val < 0.f ? NINF : PINF; }
+    return toString(val);
+}
+
+template<>
+std::string ToNumStr<double>::operator()(double val) {
+    static const char *PINF = "+INFINITY";
+    static const char *NINF = "-INFINITY";
+    if (isinf(val)) { return val < 0. ? NINF : PINF; }
+    return toString(val);
+}
+
+template<>
+std::string ToNumStr<cfloat>::operator()(cfloat val) {
+    ToNumStr<float> realStr;
+    stringstream s;
+    s << "{" << realStr(val.s[0]) << "," << realStr(val.s[1]) << "}";
+    return s.str();
+}
+
+template<>
+std::string ToNumStr<cdouble>::operator()(cdouble val) {
+    ToNumStr<double> realStr;
+    stringstream s;
+    s << "{" << realStr(val.s[0]) << "," << realStr(val.s[1]) << "}";
+    return s.str();
 }
+
+template<>
+std::string ToNumStr<half>::operator()(half val) {
+    using namespace std;
+    using namespace common;
+    static const char *PINF = "+INFINITY";
+    static const char *NINF = "-INFINITY";
+    if (isinf(val)) { return val < 0.f ? NINF : PINF; }
+    return toString(val);
+}
+
+template<>
+template<>
+std::string ToNumStr<half>::operator()<float>(float val) {
+    static const char *PINF = "+INFINITY";
+    static const char *NINF = "-INFINITY";
+    if (isinf(half(val))) { return val < 0.f ? NINF : PINF; }
+    return toString(val);
+}
+
+#define INSTANTIATE(TYPE) template struct ToNumStr<TYPE>
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(cfloat);
+INSTANTIATE(cdouble);
+INSTANTIATE(short);
+INSTANTIATE(ushort);
+INSTANTIATE(int);
+INSTANTIATE(uint);
+INSTANTIATE(intl);
+INSTANTIATE(uintl);
+INSTANTIATE(schar);
+INSTANTIATE(uchar);
+INSTANTIATE(char);
+INSTANTIATE(half);
+
+#undef INSTANTIATE
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/types.hpp b/src/backend/opencl/types.hpp
index 69f5030646..48985ab837 100644
--- a/src/backend/opencl/types.hpp
+++ b/src/backend/opencl/types.hpp
@@ -8,24 +8,167 @@
  ********************************************************/
 
 #pragma once
-#if __APPLE__
-#include <OpenCL/cl.h>
-#else
-#include <CL/cl.h>
-#endif
 
-namespace opencl
-{
+#include <cl2hpp.hpp>
+#include <common/kernel_type.hpp>
+#include <common/traits.hpp>
+#include <af/compilers.h>
+#include <af/traits.hpp>
 
-    typedef cl_float2   cfloat;
-    typedef cl_double2 cdouble;
-    typedef cl_uchar     uchar;
-    typedef cl_uint       uint;
+#include <algorithm>
+#include <array>
+#include <string>
 
-    template<typename T> struct is_complex          { static const bool value = false;  };
-    template<> struct           is_complex<cfloat>  { static const bool value = true;   };
-    template<> struct           is_complex<cdouble> { static const bool value = true;   };
+namespace arrayfire {
+namespace common {
+/// This is a CPU based half which need to be converted into floats before they
+/// are used
+template<>
+struct kernel_type<common::half> {
+    using data = common::half;
 
-    template<typename T > const char *shortname(bool caps=false);
+    // These are the types within a kernel
+    using native = float;
 
+    using compute = float;
+};
+}  // namespace common
+}  // namespace arrayfire
+
+namespace arrayfire {
+namespace opencl {
+using cdouble = cl_double2;
+using cfloat  = cl_float2;
+using intl    = long long;
+using schar   = cl_char;
+using uchar   = cl_uchar;
+using uint    = cl_uint;
+using uintl   = unsigned long long;
+using ushort  = cl_ushort;
+
+template<typename T>
+using compute_t = typename common::kernel_type<T>::compute;
+
+template<typename T>
+using data_t = typename common::kernel_type<T>::data;
+
+template<typename T>
+struct ToNumStr {
+    std::string operator()(T val);
+    template<typename CONVERSION_TYPE>
+    std::string operator()(CONVERSION_TYPE val);
+};
+
+namespace {
+template<typename T>
+inline const char *shortname(bool caps = false) {
+    return caps ? "X" : "x";
+}
+
+template<>
+inline const char *shortname<float>(bool caps) {
+    return caps ? "S" : "s";
+}
+template<>
+inline const char *shortname<double>(bool caps) {
+    return caps ? "D" : "d";
+}
+template<>
+inline const char *shortname<cfloat>(bool caps) {
+    return caps ? "C" : "c";
+}
+template<>
+inline const char *shortname<cdouble>(bool caps) {
+    return caps ? "Z" : "z";
+}
+template<>
+inline const char *shortname<int>(bool caps) {
+    return caps ? "I" : "i";
+}
+template<>
+inline const char *shortname<uint>(bool caps) {
+    return caps ? "U" : "u";
+}
+template<>
+inline const char *shortname<char>(bool caps) {
+    return caps ? "J" : "j";
+}
+template<>
+inline const char *shortname<schar>(bool caps) {
+    return caps ? "A" : "a"; // TODO
+}
+template<>
+inline const char *shortname<uchar>(bool caps) {
+    return caps ? "V" : "v";
+}
+template<>
+inline const char *shortname<intl>(bool caps) {
+    return caps ? "L" : "l";
 }
+template<>
+inline const char *shortname<uintl>(bool caps) {
+    return caps ? "K" : "k";
+}
+template<>
+inline const char *shortname<short>(bool caps) {
+    return caps ? "P" : "p";
+}
+template<>
+inline const char *shortname<ushort>(bool caps) {
+    return caps ? "Q" : "q";
+}
+
+template<typename T>
+inline const char *getFullName() {
+    return af::dtype_traits<T>::getName();
+}
+
+template<>
+inline const char *getFullName<schar>() {
+    return "char";
+}
+
+template<>
+inline const char *getFullName<cfloat>() {
+    return "float2";
+}
+
+template<>
+inline const char *getFullName<cdouble>() {
+    return "double2";
+}
+}  // namespace
+
+template<typename... ARGS>
+AF_CONSTEXPR const char *getTypeBuildDefinition() {
+    using arrayfire::common::half;
+    using std::any_of;
+    using std::array;
+    using std::begin;
+    using std::end;
+    using std::is_same;
+    array<bool, sizeof...(ARGS)> is_half    = {is_same<ARGS, half>::value...};
+    array<bool, sizeof...(ARGS)> is_double  = {is_same<ARGS, double>::value...};
+    array<bool, sizeof...(ARGS)> is_cdouble = {
+        is_same<ARGS, cdouble>::value...};
+
+    bool half_def =
+        any_of(begin(is_half), end(is_half), [](bool val) { return val; });
+    bool double_def =
+        any_of(begin(is_double), end(is_double), [](bool val) { return val; });
+    bool cdouble_def = any_of(begin(is_cdouble), end(is_cdouble),
+                              [](bool val) { return val; });
+
+    if (half_def && (double_def || cdouble_def)) {
+        return " -D USE_HALF -D USE_DOUBLE";
+    } else if (half_def) {
+        return " -D USE_HALF";
+    } else if (double_def || cdouble_def) {
+        return " -D USE_DOUBLE";
+    } else {
+        return "";
+    }
+}
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/unary.hpp b/src/backend/opencl/unary.hpp
index 4d6796e0d7..9ff2fea8c6 100644
--- a/src/backend/opencl/unary.hpp
+++ b/src/backend/opencl/unary.hpp
@@ -7,23 +7,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+#pragma once
 #include <Array.hpp>
-#include <optypes.hpp>
+#include <common/jit/UnaryNode.hpp>
 #include <math.hpp>
-#include <JIT/UnaryNode.hpp>
+#include <optypes.hpp>
 
-namespace opencl
-{
+namespace arrayfire {
+namespace opencl {
 
 template<af_op_t op>
-static const char *unaryName() { return "noop"; }
+static const char *unaryName();
 
-#define UNARY_DECL(OP, FNAME)                   \
-    template<> STATIC_                          \
-    const char *unaryName<af_##OP##_t>()        \
-    {                                           \
-        return FNAME;                           \
-    }                                           \
+#define UNARY_DECL(OP, FNAME)                     \
+    template<>                                    \
+    inline const char *unaryName<af_##OP##_t>() { \
+        return FNAME;                             \
+    }
 
 #define UNARY_FN(OP) UNARY_DECL(OP, #OP)
 
@@ -44,6 +44,7 @@ UNARY_FN(acosh)
 UNARY_FN(atanh)
 
 UNARY_FN(exp)
+UNARY_DECL(sigmoid, "__sigmoid")
 UNARY_FN(expm1)
 UNARY_FN(erf)
 UNARY_FN(erfc)
@@ -57,42 +58,55 @@ UNARY_FN(log10)
 UNARY_FN(log2)
 
 UNARY_FN(sqrt)
+UNARY_FN(rsqrt)
 UNARY_FN(cbrt)
 
 UNARY_FN(trunc)
 UNARY_FN(round)
-UNARY_FN(sign)
+UNARY_FN(signbit)
 UNARY_FN(ceil)
 UNARY_FN(floor)
 
 UNARY_FN(isinf)
 UNARY_FN(isnan)
 UNARY_FN(iszero)
+UNARY_DECL(noop, "__noop")
 
-template<typename T, af_op_t op>
-Array<T> unaryOp(const Array<T> &in)
-{
-    JIT::Node_ptr in_node = in.getNode();
+UNARY_DECL(bitnot, "__bitnot")
 
-    JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<T>::getName(),
-                                              shortname<T>(true),
-                                              unaryName<op>(),
-                                              in_node, op);
+#undef UNARY_FN
 
-    return createNodeArray<T>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
+template<typename T, af_op_t op>
+Array<T> unaryOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using arrayfire::common::Node;
+    using arrayfire::common::Node_ptr;
+    using std::array;
+
+    auto createUnary = [](array<Node_ptr, 1> &operands) {
+        return common::Node_ptr(new common::UnaryNode(
+            static_cast<af::dtype>(dtype_traits<T>::af_type), unaryName<op>(),
+            operands[0], op));
+    };
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    Node_ptr out = common::createNaryNode<T, 1>(outDim, createUnary, {&in});
+    return createNodeArray<T>(outDim, out);
 }
 
 template<typename T, af_op_t op>
-Array<char> checkOp(const Array<T> &in)
-{
-    JIT::Node_ptr in_node = in.getNode();
-
-    JIT::UnaryNode *node = new JIT::UnaryNode(dtype_traits<char>::getName(),
-                                              shortname<char>(true),
-                                              unaryName<op>(),
-                                              in_node, op);
-
-    return createNodeArray<char>(in.dims(), JIT::Node_ptr(reinterpret_cast<JIT::Node *>(node)));
+Array<char> checkOp(const Array<T> &in, dim4 outDim = dim4(-1, -1, -1, -1)) {
+    using arrayfire::common::Node_ptr;
+
+    auto createUnary = [](std::array<Node_ptr, 1> &operands) {
+        return Node_ptr(new common::UnaryNode(
+            static_cast<af::dtype>(dtype_traits<char>::af_type),
+            unaryName<op>(), operands[0], op));
+    };
+
+    if (outDim == dim4(-1, -1, -1, -1)) { outDim = in.dims(); }
+    Node_ptr out = common::createNaryNode<T, 1>(outDim, createUnary, {&in});
+    return createNodeArray<char>(outDim, out);
 }
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/unwrap.cpp b/src/backend/opencl/unwrap.cpp
new file mode 100644
index 0000000000..3fb0d9a14c
--- /dev/null
+++ b/src/backend/opencl/unwrap.cpp
@@ -0,0 +1,65 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
+#include <kernel/unwrap.hpp>
+#include <unwrap.hpp>
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+
+    dim_t nx = 1 + (idims[0] + 2 * px - (((wx - 1) * dx) + 1)) / sx;
+    dim_t ny = 1 + (idims[1] + 2 * py - (((wy - 1) * dy) + 1)) / sy;
+
+    af::dim4 odims(wx * wy, nx * ny, idims[2], idims[3]);
+
+    if (!is_column) { std::swap(odims[0], odims[1]); }
+
+    Array<T> outArray = createEmptyArray<T>(odims);
+    kernel::unwrap<T>(outArray, in, wx, wy, sx, sy, px, py, dx, dy, nx,
+                      is_column);
+
+    return outArray;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> unwrap<T>(                                            \
+        const Array<T> &in, const dim_t wx, const dim_t wy, const dim_t sx, \
+        const dim_t sy, const dim_t px, const dim_t py, const dim_t dx,     \
+        const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/unwrap.hpp b/src/backend/opencl/unwrap.hpp
new file mode 100644
index 0000000000..f65e324c67
--- /dev/null
+++ b/src/backend/opencl/unwrap.hpp
@@ -0,0 +1,19 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<T> unwrap(const Array<T> &in, const dim_t wx, const dim_t wy,
+                const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+                const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/vector_field.cpp b/src/backend/opencl/vector_field.cpp
new file mode 100644
index 0000000000..4d85032602
--- /dev/null
+++ b/src/backend/opencl/vector_field.cpp
@@ -0,0 +1,108 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <GraphicsResourceManager.hpp>
+#include <debug_opencl.hpp>
+#include <err_opencl.hpp>
+#include <vector_field.hpp>
+
+using af::dim4;
+using arrayfire::common::ForgeModule;
+using arrayfire::common::forgePlugin;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield) {
+    ForgeModule &_ = common::forgePlugin();
+    if (isGLSharingSupported()) {
+        CheckGL("Begin OpenCL resource copy");
+        const cl::Buffer *d_points     = points.get();
+        const cl::Buffer *d_directions = directions.get();
+        unsigned pBytes                = 0;
+        unsigned dBytes                = 0;
+        FG_CHECK(_.fg_get_vector_field_vertex_buffer_size(&pBytes, vfield));
+        FG_CHECK(_.fg_get_vector_field_direction_buffer_size(&dBytes, vfield));
+
+        auto res = interopManager().getVectorFieldResources(vfield);
+
+        std::vector<cl::Memory> shared_objects;
+        shared_objects.push_back(*(res[0].get()));
+        shared_objects.push_back(*(res[1].get()));
+
+        glFinish();
+
+        // Use of events:
+        // https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/clEnqueueReleaseGLObjects.html
+        cl::Event event;
+
+        getQueue().enqueueAcquireGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+        getQueue().enqueueCopyBuffer(*d_points, *(res[0].get()), 0, 0, pBytes,
+                                     NULL, &event);
+        getQueue().enqueueCopyBuffer(*d_directions, *(res[1].get()), 0, 0,
+                                     dBytes, NULL, &event);
+        getQueue().enqueueReleaseGLObjects(&shared_objects, NULL, &event);
+        event.wait();
+
+        CL_DEBUG_FINISH(getQueue());
+        CheckGL("End OpenCL resource copy");
+    } else {
+        unsigned size1 = 0, size2 = 0;
+        unsigned buff1 = 0, buff2 = 0;
+        FG_CHECK(_.fg_get_vector_field_vertex_buffer_size(&size1, vfield));
+        FG_CHECK(_.fg_get_vector_field_direction_buffer_size(&size2, vfield));
+        FG_CHECK(_.fg_get_vector_field_vertex_buffer(&buff1, vfield));
+        FG_CHECK(_.fg_get_vector_field_direction_buffer(&buff2, vfield));
+
+        CheckGL("Begin OpenCL fallback-resource copy");
+
+        // Points
+        glBindBuffer(GL_ARRAY_BUFFER, buff1);
+        auto *pPtr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (pPtr) {
+            getQueue().enqueueReadBuffer(*points.get(), CL_TRUE, 0, size1,
+                                         pPtr);
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+
+        // Directions
+        glBindBuffer(GL_ARRAY_BUFFER, buff2);
+        auto *dPtr =
+            static_cast<GLubyte *>(glMapBuffer(GL_ARRAY_BUFFER, GL_WRITE_ONLY));
+        if (dPtr) {
+            getQueue().enqueueReadBuffer(*directions.get(), CL_TRUE, 0, size2,
+                                         dPtr);
+            glUnmapBuffer(GL_ARRAY_BUFFER);
+        }
+        glBindBuffer(GL_ARRAY_BUFFER, 0);
+        CheckGL("End OpenCL fallback-resource copy");
+    }
+}
+
+#define INSTANTIATE(T)                                                     \
+    template void copy_vector_field<T>(const Array<T> &, const Array<T> &, \
+                                       fg_vector_field);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/vector_field.hpp b/src/backend/opencl/vector_field.hpp
new file mode 100644
index 0000000000..33d4d61dff
--- /dev/null
+++ b/src/backend/opencl/vector_field.hpp
@@ -0,0 +1,20 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/graphics_common.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+void copy_vector_field(const Array<T> &points, const Array<T> &directions,
+                       fg_vector_field vfield);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/where.cpp b/src/backend/opencl/where.cpp
index 1ce82bf717..ae86cd8521 100644
--- a/src/backend/opencl/where.cpp
+++ b/src/backend/opencl/where.cpp
@@ -7,39 +7,38 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/dim4.hpp>
-#include <af/defines.h>
-#include <ArrayInfo.hpp>
 #include <Array.hpp>
 #include <err_opencl.hpp>
+#include <kernel/where.hpp>
 #include <where.hpp>
+#include <af/dim4.hpp>
 #include <complex>
-#include <kernel/where.hpp>
-
-namespace opencl
-{
-    template<typename T>
-    Array<uint> where(const Array<T> &in)
-    {
-        Param Out;
-        Param In = in;
-        kernel::where<T>(Out, In);
-        return createParamArray<uint>(Out);
-    }
 
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<uint> where(const Array<T> &in) {
+    Param Out;
+    Param In = in;
+    kernel::where<T>(Out, In);
+    return createParamArray<uint>(Out, true);
+}
 
-#define INSTANTIATE(T)                                  \
-    template Array<uint> where<T>(const Array<T> &in);  \
+#define INSTANTIATE(T) template Array<uint> where<T>(const Array<T> &in);
 
-    INSTANTIATE(float  )
-    INSTANTIATE(cfloat )
-    INSTANTIATE(double )
-    INSTANTIATE(cdouble)
-    INSTANTIATE(char   )
-    INSTANTIATE(int    )
-    INSTANTIATE(uint   )
-    INSTANTIATE(intl   )
-    INSTANTIATE(uintl  )
-    INSTANTIATE(uchar  )
+INSTANTIATE(float)
+INSTANTIATE(cfloat)
+INSTANTIATE(double)
+INSTANTIATE(cdouble)
+INSTANTIATE(char)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
 
-}
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/where.hpp b/src/backend/opencl/where.hpp
index 481d34f6f1..a5ee5feca4 100644
--- a/src/backend/opencl/where.hpp
+++ b/src/backend/opencl/where.hpp
@@ -7,11 +7,11 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <af/array.h>
 #include <Array.hpp>
 
-namespace opencl
-{
-    template<typename T>
-    Array<uint> where(const Array<T>& in);
-}
+namespace arrayfire {
+namespace opencl {
+template<typename T>
+Array<uint> where(const Array<T>& in);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/wrap.cpp b/src/backend/opencl/wrap.cpp
new file mode 100644
index 0000000000..418dc9bc1f
--- /dev/null
+++ b/src/backend/opencl/wrap.cpp
@@ -0,0 +1,77 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+#include <common/dispatch.hpp>
+#include <common/half.hpp>
+#include <err_opencl.hpp>
+#include <kernel/wrap.hpp>
+#include <math.hpp>
+#include <wrap.hpp>
+#include <stdexcept>
+
+using arrayfire::common::half;
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column) {
+    kernel::wrap<T>(out, in, wx, wy, sx, sy, px, py, is_column);
+}
+
+#define INSTANTIATE(T)                                                        \
+    template void wrap<T>(Array<T> & out, const Array<T> &in, const dim_t wx, \
+                          const dim_t wy, const dim_t sx, const dim_t sy,     \
+                          const dim_t px, const dim_t py,                     \
+                          const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(cfloat)
+INSTANTIATE(cdouble)
+INSTANTIATE(int)
+INSTANTIATE(uint)
+INSTANTIATE(intl)
+INSTANTIATE(uintl)
+INSTANTIATE(schar)
+INSTANTIATE(uchar)
+INSTANTIATE(char)
+INSTANTIATE(short)
+INSTANTIATE(ushort)
+#undef INSTANTIATE
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column) {
+    af::dim4 idims = in.dims();
+    af::dim4 odims(ox, oy, idims[2], idims[3]);
+    Array<T> out = createValueArray<T>(odims, scalar<T>(0));
+
+    kernel::wrap_dilated<T>(out, in, wx, wy, sx, sy, px, py, dx, dy, is_column);
+    return out;
+}
+
+#define INSTANTIATE(T)                                                      \
+    template Array<T> wrap_dilated<T>(                                      \
+        const Array<T> &in, const dim_t ox, const dim_t oy, const dim_t wx, \
+        const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,     \
+        const dim_t py, const dim_t dx, const dim_t dy, const bool is_column);
+
+INSTANTIATE(float)
+INSTANTIATE(double)
+INSTANTIATE(half)
+#undef INSTANTIATE
+
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/opencl/wrap.hpp b/src/backend/opencl/wrap.hpp
new file mode 100644
index 0000000000..cceb47ee43
--- /dev/null
+++ b/src/backend/opencl/wrap.hpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <Array.hpp>
+
+namespace arrayfire {
+namespace opencl {
+
+template<typename T>
+void wrap(Array<T> &out, const Array<T> &in, const dim_t wx, const dim_t wy,
+          const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+          const bool is_column);
+
+template<typename T>
+Array<T> wrap_dilated(const Array<T> &in, const dim_t ox, const dim_t oy,
+                      const dim_t wx, const dim_t wy, const dim_t sx,
+                      const dim_t sy, const dim_t px, const dim_t py,
+                      const dim_t dx, const dim_t dy, const bool is_column);
+}  // namespace opencl
+}  // namespace arrayfire
diff --git a/src/backend/template/.gitignore b/src/backend/template/.gitignore
deleted file mode 100644
index e69de29bb2..0000000000
diff --git a/test/.clang-format b/test/.clang-format
new file mode 100644
index 0000000000..47afdf3208
--- /dev/null
+++ b/test/.clang-format
@@ -0,0 +1,144 @@
+---
+Language:        Cpp
+# BasedOnStyle:  Google
+AccessModifierOffset: -1
+AlignAfterOpenBracket: Align
+AlignConsecutiveAssignments: true
+AlignConsecutiveDeclarations: false
+AlignEscapedNewlines: Left
+AlignOperands:   true
+AlignTrailingComments: true
+AllowAllParametersOfDeclarationOnNextLine: true
+AllowShortBlocksOnASingleLine: true
+AllowShortCaseLabelsOnASingleLine: true
+AllowShortFunctionsOnASingleLine: All
+AllowShortIfStatementsOnASingleLine: true
+AllowShortLoopsOnASingleLine: true
+AlwaysBreakAfterReturnType: None
+AlwaysBreakBeforeMultilineStrings: true
+AlwaysBreakTemplateDeclarations: Yes
+BinPackArguments: true
+BinPackParameters: true
+BraceWrapping:   
+  AfterClass:      false
+  AfterControlStatement: false
+  AfterEnum:       false
+  AfterFunction:   false
+  AfterNamespace:  false
+  AfterObjCDeclaration: false
+  AfterStruct:     false
+  AfterUnion:      false
+  AfterExternBlock: false
+  BeforeCatch:     false
+  BeforeElse:      false
+  IndentBraces:    false
+  SplitEmptyFunction: false
+  SplitEmptyRecord: false
+  SplitEmptyNamespace: false
+BreakBeforeBinaryOperators: None
+BreakBeforeBraces: Custom
+BreakInheritanceList: BeforeComma
+BreakBeforeTernaryOperators: true
+BreakConstructorInitializers: BeforeComma
+BreakStringLiterals: true
+ColumnLimit:     80
+CommentPragmas:  '^ IWYU pragma:'
+CompactNamespaces: false
+ConstructorInitializerAllOnOneLineOrOnePerLine: true
+ConstructorInitializerIndentWidth: 4
+ContinuationIndentWidth: 4
+Cpp11BracedListStyle: true
+DerivePointerAlignment: true
+DisableFormat:   false
+ExperimentalAutoDetectBinPacking: false
+FixNamespaceComments: true
+ForEachMacros:
+  - foreach
+  - Q_FOREACH
+  - BOOST_FOREACH
+IncludeBlocks:   Preserve
+IncludeCategories: 
+  - Regex:           '^<af/.*\.h.*>'
+    Priority:        2
+  - Regex:           '^<.*\.h.*>'
+    Priority:        1
+  - Regex:           '^<.*'
+    Priority:        3
+  - Regex:           '.*'
+    Priority:        4
+IncludeIsMainRegex: '([-_](test|unittest))?$'
+IndentCaseLabels: true
+IndentPPDirectives: None
+IndentWidth:     4
+IndentWrappedFunctionNames: false
+JavaScriptQuotes: Leave
+JavaScriptWrapImports: true
+KeepEmptyLinesAtTheStartOfBlocks: false
+MacroBlockBegin: ''
+MacroBlockEnd:   ''
+MaxEmptyLinesToKeep: 1
+NamespaceIndentation: None
+ObjCBinPackProtocolList: Never
+ObjCBlockIndentWidth: 2
+ObjCSpaceAfterProperty: false
+ObjCSpaceBeforeProtocolList: true
+PenaltyBreakAssignment: 2
+PenaltyBreakBeforeFirstCallParameter: 1
+PenaltyBreakComment: 300
+PenaltyBreakFirstLessLess: 120
+PenaltyBreakString: 1000
+PenaltyBreakTemplateDeclaration: 10
+PenaltyExcessCharacter: 1000000
+PenaltyReturnTypeOnItsOwnLine: 200
+PointerAlignment: Right
+RawStringFormats: 
+  - Language:        Cpp
+    Delimiters:      
+      - cc
+      - CC
+      - cpp
+      - Cpp
+      - CPP
+      - 'c++'
+      - 'C++'
+      - R
+    CanonicalDelimiter: ''
+    BasedOnStyle:    google
+  - Language:        TextProto
+    Delimiters:      
+      - pb
+      - PB
+      - proto
+      - PROTO
+    EnclosingFunctions: 
+      - EqualsProto
+      - EquivToProto
+      - PARSE_PARTIAL_TEXT_PROTO
+      - PARSE_TEST_PROTO
+      - PARSE_TEXT_PROTO
+      - ParseTextOrDie
+      - ParseTextProtoOrDie
+    CanonicalDelimiter: ''
+    BasedOnStyle:    google
+ReflowComments:  true
+SortIncludes:    true
+SortUsingDeclarations: true
+SpaceAfterCStyleCast: false
+SpaceAfterTemplateKeyword: false
+SpaceBeforeAssignmentOperators: true
+SpaceBeforeCpp11BracedList: false
+SpaceBeforeCtorInitializerColon: true
+SpaceBeforeInheritanceColon: true
+SpaceBeforeParens: ControlStatements
+SpaceBeforeRangeBasedForLoopColon: true
+SpaceInEmptyParentheses: false
+SpacesBeforeTrailingComments: 2
+SpacesInAngles:  false
+SpacesInContainerLiterals: true
+SpacesInCStyleCastParentheses: false
+SpacesInParentheses: false
+SpacesInSquareBrackets: false
+Standard:        Cpp11
+TabWidth:        4
+UseTab:          Never
+
diff --git a/test/CMakeLists.txt b/test/CMakeLists.txt
index 4f2c07016e..64e1feb777 100644
--- a/test/CMakeLists.txt
+++ b/test/CMakeLists.txt
@@ -1,97 +1,514 @@
-CMAKE_MINIMUM_REQUIRED(VERSION 2.8)
-
-REMOVE_DEFINITIONS(-std=c++11)
-
-MACRO(CREATE_TESTS BACKEND GTEST_LIBS)
-    STRING(TOUPPER ${BACKEND} DEF_NAME)
-
-    FOREACH(FILE ${FILES})
-        GET_FILENAME_COMPONENT(FNAME ${FILE} NAME_WE)
-        SET(TEST_NAME ${FNAME}_${BACKEND})
-
-        ADD_TEST(Test_${TEST_NAME} ${TEST_NAME})
-        ADD_EXECUTABLE(${TEST_NAME} ${FNAME}.cpp)
-        TARGET_LINK_LIBRARIES(${TEST_NAME}  af${BACKEND}
-                      ${THREAD_LIB_FLAG}
-                      ${GTEST_LIBS})
-        SET_TARGET_PROPERTIES(${TEST_NAME}
-                      PROPERTIES
-                      COMPILE_FLAGS -DAF_${DEF_NAME}
-                      FOLDER "Tests/${BACKEND}")
-    ENDFOREACH()
-
-ENDMACRO(CREATE_TESTS)
-
-FIND_PACKAGE(Threads REQUIRED)
-IF(CMAKE_USE_PTHREADS_INIT AND NOT "${APPLE}")
-    SET(THREAD_LIB_FLAG "-pthread")
-ELSE()
-    SET(THREAD_LIB_FLAG ${CMAKE_THREAD_LIBS_INIT})
-ENDIF()
-
-OPTION(USE_SYSTEM_GTEST "Use GTEST from system libraries" OFF)
-IF(USE_SYSTEM_GTEST)
-    FIND_PACKAGE(GTest REQUIRED)
-ELSE(USE_SYSTEM_GTEST)
-    INCLUDE("${CMAKE_MODULE_PATH}/build_gtest.cmake")
-ENDIF(USE_SYSTEM_GTEST)
-
-INCLUDE_DIRECTORIES(${GTEST_INCLUDE_DIRS})
-
-OPTION(USE_RELATIVE_TEST_DIR "Use relative paths for the test data directory(For continious integration(CI) purposes only)" OFF)
-
-IF(${USE_RELATIVE_TEST_DIR})
-    SET(RELATIVE_TEST_DATA_DIR "./data" CACHE STRING "Relative Test Data Directory")
-    SET(TESTDATA_SOURCE_DIR ${RELATIVE_TEST_DATA_DIR})
-ELSE(${USE_RELATIVE_TEST_DIR})
-    SET(TESTDATA_SOURCE_DIR "${CMAKE_SOURCE_DIR}/test/data")
-ENDIF(${USE_RELATIVE_TEST_DIR})
-
-IF (${CMAKE_GENERATOR} STREQUAL "Xcode")
-    ADD_DEFINITIONS("-D TEST_DIR=\"\\\\\"${TESTDATA_SOURCE_DIR}\\\\\"\"")
-ELSE (${CMAKE_GENERATOR} STREQUAL "Xcode")
-    ADD_DEFINITIONS("-D TEST_DIR=\"\\\"${TESTDATA_SOURCE_DIR}\\\"\"")
-ENDIF (${CMAKE_GENERATOR} STREQUAL "Xcode")
-
-IF(NOT ${USE_RELATIVE_TEST_DIR})
-# Check if data exists
-IF (EXISTS "${TESTDATA_SOURCE_DIR}" AND IS_DIRECTORY "${TESTDATA_SOURCE_DIR}"
-    AND EXISTS "${TESTDATA_SOURCE_DIR}/README.md")
-    # Test data is available
-    # Do Nothing
-ELSE (EXISTS "${TESTDATA_SOURCE_DIR}" AND IS_DIRECTORY "${TESTDATA_SOURCE_DIR}"
-    AND EXISTS "${TESTDATA_SOURCE_DIR}/README.md")
-    MESSAGE(WARNING "Test Data is not available. Tests will build but fail when run.")
-    MESSAGE("Did you miss the --recursive option when cloning?")
-    MESSAGE("Run the following commands to correct this:")
-    MESSAGE("git submodule init")
-    MESSAGE("git submodule update")
-    MESSAGE("git submodule foreach git pull origin master")
-ENDIF()
-ENDIF(NOT ${USE_RELATIVE_TEST_DIR})
-
-INCLUDE_DIRECTORIES(${CMAKE_CURRENT_SOURCE_DIR})
-FILE(GLOB FILES "*.cpp")
-
-IF(${BUILD_CPU})
-    CREATE_TESTS(cpu "${GTEST_LIBRARIES}")
-ENDIF()
-
-IF(${BUILD_CUDA})
-    FIND_PACKAGE(CUDA REQUIRED)
-    IF("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang" AND ${CUDA_VERSION_MAJOR} VERSION_LESS 7)
-        CREATE_TESTS(cuda "${GTEST_LIBRARIES_STDLIB}")
-        FOREACH(FILE ${FILES})
-            GET_FILENAME_COMPONENT(FNAME ${FILE} NAME_WE)
-            SET(TEST_NAME ${FNAME}_cuda)
-            SET_TARGET_PROPERTIES(${TEST_NAME} PROPERTIES COMPILE_FLAGS -stdlib=libstdc++)
-            SET_TARGET_PROPERTIES(${TEST_NAME} PROPERTIES LINK_FLAGS -stdlib=libstdc++)
-        ENDFOREACH()
-    ELSE("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang" AND ${CUDA_VERSION_MAJOR} VERSION_LESS 7)
-        CREATE_TESTS(cuda "${GTEST_LIBRARIES}")
-    ENDIF("${APPLE}" AND ${CMAKE_CXX_COMPILER_ID} STREQUAL "Clang" AND ${CUDA_VERSION_MAJOR} VERSION_LESS 7)
-ENDIF()
-
-IF(${BUILD_OPENCL})
-    CREATE_TESTS(opencl "${GTEST_LIBRARIES}")
-ENDIF()
+# Copyright (c) 2025, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+set(AF_TEST_WITH_MTX_FILES
+    ON CACHE BOOL
+    "Download and run tests on large matrices form sparse.tamu.edu")
+
+set(CMAKE_MODULE_PATH ${CMAKE_MODULE_PATH} "${CMAKE_CURRENT_SOURCE_DIR}/CMakeModules")
+
+if(AF_CTEST_SEPARATED)
+  include(GoogleTest)
+endif()
+
+if(AF_TEST_WITH_MTX_FILES)
+  include(download_sparse_datasets)
+endif()
+
+if(AF_WITH_EXTERNAL_PACKAGES_ONLY)
+  dependency_check(GTest_FOUND "Google Tests not found.")
+elseif(NOT TARGET GTest::gtest)
+  af_dep_check_and_populate(${gtest_prefix}
+    URI https://github.com/google/googletest.git
+    REF v1.16.0
+  )
+  if(WIN32)
+    set(gtest_force_shared_crt ON
+        CACHE INTERNAL "Required so that the libs Runtime is not set to MT DLL")
+    set(BUILD_SHARED_LIBS OFF)
+  endif()
+
+  add_subdirectory(${${gtest_prefix}_SOURCE_DIR} ${${gtest_prefix}_BINARY_DIR} EXCLUDE_FROM_ALL)
+  target_compile_definitions(gtest PRIVATE GTEST_HAS_SEH=OFF)
+  set_target_properties(gtest
+    PROPERTIES
+      FOLDER "ExternalProjectTargets/gtest")
+  target_compile_options(gtest
+    PRIVATE
+      $<$<BOOL:${has_cxx_fp_model}>:-fp-model precise>)
+  if(NOT TARGET GTest::gtest)
+    add_library(GTest::gtest ALIAS gtest)
+  endif()
+  # Hide gtest project variables
+  mark_as_advanced(
+    BUILD_SHARED_LIBS
+    BUILD_GMOCK
+    INSTALL_GTEST
+    gmock_build_tests
+    gtest_build_samples
+    gtest_build_tests
+    gtest_disable_pthreads
+    gtest_force_shared_crt
+    gtest_hide_internal_symbols
+  )
+endif()
+
+if(NOT TARGET mmio)
+  add_subdirectory(mmio)
+endif()
+
+
+# Registers test with ctest
+#
+# Parameters
+#  target: The target associated with this test
+#  backend: The backend associated with this test
+#  is_serial: If true the test will be serialized
+function(af_add_test target backend is_serial)
+  if(AF_CTEST_SEPARATED)
+    gtest_discover_tests(${target}
+      TEST_PREFIX $<UPPER_CASE:${backend}>.
+      DISCOVERY_TIMEOUT 40)
+  else()
+    add_test(NAME ${target} COMMAND ${target})
+    if(${is_serial})
+      set_tests_properties(${target}
+        PROPERTIES
+          ENVIRONMENT AF_PRINT_ERRORS=1
+          TIMEOUT 900
+          RUN_SERIAL ON)
+    endif(${is_serial})
+  endif()
+endfunction()
+
+# Reset the CXX flags for tests
+set(CMAKE_CXX_STANDARD 11)
+
+# TODO(pradeep) perhaps rename AF_USE_RELATIVE_TEST_DIR to AF_WITH_TEST_DATA_DIR
+#               with empty default value
+if(${AF_USE_RELATIVE_TEST_DIR})
+  # RELATIVE_TEST_DATA_DIR is a User-visible option with default value of test/data directory
+  # This code arm assumes user is responsible for providing the test data path
+  set(RELATIVE_TEST_DATA_DIR "${CMAKE_CURRENT_SOURCE_DIR}/data" CACHE
+      STRING "Relative Test Data Directory")
+  set(TESTDATA_SOURCE_DIR ${RELATIVE_TEST_DATA_DIR})
+else(${AF_USE_RELATIVE_TEST_DIR})
+  af_dep_check_and_populate(${testdata_prefix}
+    URI https://github.com/arrayfire/arrayfire-data.git
+    #Add test file for SSAS_LinearSteps
+    REF 05703a4897c8b89b7a0ece1dbe21ede33d226f44
+  )
+  set(TESTDATA_SOURCE_DIR "${${testdata_prefix}_SOURCE_DIR}")
+endif(${AF_USE_RELATIVE_TEST_DIR})
+
+if(AF_BUILD_CPU)
+  list(APPEND enabled_backends "cpu")
+endif(AF_BUILD_CPU)
+
+if(AF_BUILD_CUDA)
+  list(APPEND enabled_backends "cuda")
+endif(AF_BUILD_CUDA)
+
+if(AF_BUILD_OPENCL)
+  list(APPEND enabled_backends "opencl")
+endif(AF_BUILD_OPENCL)
+
+if(AF_BUILD_ONEAPI)
+  list(APPEND enabled_backends "oneapi")
+endif(AF_BUILD_ONEAPI)
+
+if(AF_BUILD_UNIFIED)
+  list(APPEND enabled_backends "unified")
+endif(AF_BUILD_UNIFIED)
+
+add_library(arrayfire_test STATIC
+  testHelpers.hpp
+  arrayfire_test.cpp)
+
+target_include_directories(arrayfire_test
+  PRIVATE
+    ${CMAKE_CURRENT_LIST_DIR}
+    ${ArrayFire_SOURCE_DIR}/include
+    ${ArrayFire_BINARY_DIR}/include)
+
+target_include_directories(arrayfire_test
+  SYSTEM PRIVATE
+    ${ArrayFire_SOURCE_DIR}/extern/half/include
+  )
+
+# The tautological-constant-compare warning is always thrown for std::nan
+# and std::info calls. Its unnecessarily verbose.
+target_compile_options(arrayfire_test
+  PUBLIC
+    # Intel compilers use fast math by default and ignore special floating point
+    # values like NaN and Infs.
+    $<$<COMPILE_LANGUAGE:CXX>:
+      $<$<BOOL:${has_cxx_fp_model}>:-fp-model precise>
+      $<$<BOOL:${has_cxx_unqualified_std_cast_call}>:-Wno-unqualified-std-cast-call>>
+  PRIVATE
+    $<$<CXX_COMPILER_ID:MSVC>: /bigobj
+                               /EHsc>
+  )
+if(WIN32)
+  target_compile_definitions(arrayfire_test
+    PRIVATE
+      WIN32_LEAN_AND_MEAN
+      NOMINMAX)
+endif()
+
+target_compile_definitions(arrayfire_test
+  PUBLIC
+    $<$<BOOL:${AF_WITH_FAST_MATH}>:AF_WITH_FAST_MATH>
+  PRIVATE
+    TEST_RESULT_IMAGE_DIR="${CMAKE_BINARY_DIR}/test/"
+    USE_MTX)
+
+target_link_libraries(arrayfire_test
+  PRIVATE
+    mmio
+  PUBLIC
+    GTest::gtest
+    Boost::boost
+  )
+
+# Creates tests for all backends
+#
+# Creates a standard test for all backends. Most of the time you only need to
+# specify the name of the source file to create a test.
+#
+# Parameters
+# ----------
+# 'CXX11'       If set the tests will be compiled using c++11. Tests should strive
+#               to be C++98 compliant
+# 'SRC'         The source files for the test
+# 'LIBRARIES'   Libraries other than ArrayFire that need to be linked
+# 'DEFINITIONS' Definitions that need to be defined
+# 'BACKENDS'    Backends to target for this test. If not set then the test will
+#               compiled againat all backends
+function(make_test)
+  set(options CXX11 SERIAL USE_MMIO NO_ARRAYFIRE_TEST)
+  set(single_args SRC)
+  set(multi_args LIBRARIES DEFINITIONS BACKENDS)
+  cmake_parse_arguments(mt_args "${options}" "${single_args}" "${multi_args}" ${ARGN})
+
+  get_filename_component(src_name ${mt_args_SRC} NAME_WE)
+  foreach(backend ${enabled_backends})
+    if(NOT "${mt_args_BACKENDS}" STREQUAL "" AND
+       NOT ${backend} IN_LIST mt_args_BACKENDS)
+      continue()
+    endif()
+    set(target "test_${src_name}_${backend}")
+
+    add_executable(${target} ${mt_args_SRC})
+    target_include_directories(${target}
+      PRIVATE
+        ${CMAKE_SOURCE_DIR}
+        ${CMAKE_CURRENT_SOURCE_DIR})
+    target_include_directories(${target}
+      SYSTEM PRIVATE
+        ${ArrayFire_SOURCE_DIR}/extern/half/include
+      )
+    target_link_libraries(${target}
+      PRIVATE
+        ${mt_args_LIBRARIES}
+        arrayfire_test
+      )
+
+    target_compile_options(${target}
+      PRIVATE
+        $<$<CXX_COMPILER_ID:MSVC>: /bigobj
+                                   /EHsc>
+      )
+
+    if(${backend} STREQUAL "unified")
+      target_link_libraries(${target}
+        PRIVATE
+          af)
+    else()
+      target_link_libraries(${target}
+        PRIVATE
+          af${backend}
+          )
+    endif()
+
+    if(${mt_args_CXX11})
+      set_target_properties(${target}
+        PROPERTIES
+          CXX_STANDARD 11)
+    endif(${mt_args_CXX11})
+
+    set_target_properties(${target}
+      PROPERTIES
+        FOLDER "Tests"
+        OUTPUT_NAME "${src_name}_${backend}")
+
+    target_compile_definitions(${target}
+      PRIVATE
+        TEST_DIR="${TESTDATA_SOURCE_DIR}"
+        AF_$<UPPER_CASE:${backend}>
+        ${mt_args_DEFINITIONS}
+      )
+    target_link_libraries(${target} PRIVATE mmio)
+    if(AF_TEST_WITH_MTX_FILES AND ${mt_args_USE_MMIO})
+      target_compile_definitions(${target}
+        PRIVATE
+        MTX_TEST_DIR="${ArrayFire_BINARY_DIR}/extern/matrixmarket/"
+        )
+    endif()
+    if(AF_SKIP_UNSUPPORTED_TESTS)
+      target_compile_definitions(${target}
+        PRIVATE
+          SKIP_UNSUPPORTED_TESTS)
+    endif()
+    if(WIN32)
+      target_compile_definitions(${target}
+        PRIVATE
+          WIN32_LEAN_AND_MEAN
+          NOMINMAX)
+    endif()
+
+    # TODO(umar): Create this executable separately
+    if(NOT ${backend} STREQUAL "unified" OR ${target} STREQUAL "backend_unified")
+      af_add_test(${target} ${backend} ${mt_args_SERIAL})
+    endif()
+
+  endforeach()
+endfunction(make_test)
+
+make_test(SRC anisotropic_diffusion.cpp)
+make_test(SRC approx1.cpp)
+make_test(SRC approx2.cpp)
+make_test(SRC array.cpp CXX11)
+make_test(SRC array_death_tests.cpp CXX11 SERIAL)
+make_test(SRC arrayio.cpp)
+make_test(SRC assign.cpp CXX11)
+make_test(SRC backend.cpp CXX11)
+make_test(SRC basic.cpp)
+make_test(SRC bilateral.cpp)
+make_test(SRC binary.cpp CXX11)
+make_test(SRC blas.cpp)
+make_test(SRC canny.cpp)
+make_test(SRC cast.cpp)
+make_test(SRC cholesky_dense.cpp SERIAL)
+make_test(SRC clamp.cpp)
+make_test(SRC compare.cpp)
+make_test(SRC complex.cpp)
+make_test(SRC confidence_connected.cpp CXX11)
+make_test(SRC constant.cpp)
+make_test(SRC convolve.cpp CXX11)
+make_test(SRC corrcoef.cpp)
+make_test(SRC covariance.cpp)
+make_test(SRC diagonal.cpp)
+make_test(SRC diff1.cpp)
+make_test(SRC diff2.cpp)
+make_test(SRC dog.cpp)
+make_test(SRC dot.cpp)
+make_test(SRC empty.cpp)
+make_test(SRC event.cpp CXX11)
+make_test(SRC fast.cpp)
+make_test(SRC fft.cpp)
+make_test(SRC fft_large.cpp)
+make_test(SRC fft_real.cpp)
+make_test(SRC fftconvolve.cpp)
+make_test(SRC flat.cpp)
+make_test(SRC flip.cpp)
+make_test(SRC gaussiankernel.cpp)
+make_test(SRC gen_assign.cpp)
+make_test(SRC gen_index.cpp CXX11)
+make_test(SRC getting_started.cpp)
+make_test(SRC gfor.cpp)
+make_test(SRC gradient.cpp)
+make_test(SRC gray_rgb.cpp)
+make_test(SRC half.cpp)
+make_test(SRC hamming.cpp)
+make_test(SRC harris.cpp)
+make_test(SRC histogram.cpp)
+make_test(SRC homography.cpp)
+make_test(SRC hsv_rgb.cpp)
+make_test(SRC iir.cpp)
+make_test(SRC imageio.cpp)
+make_test(SRC index.cpp CXX11)
+make_test(SRC info.cpp)
+make_test(SRC internal.cpp)
+make_test(SRC inverse_deconv.cpp)
+make_test(SRC inverse_dense.cpp SERIAL)
+make_test(SRC iota.cpp)
+make_test(SRC ireduce.cpp)
+make_test(SRC iterative_deconv.cpp)
+make_test(SRC jit.cpp CXX11)
+make_test(SRC join.cpp)
+make_test(SRC lu_dense.cpp SERIAL)
+#make_test(manual_memory_test.cpp)
+make_test(SRC match_template.cpp)
+make_test(SRC math.cpp CXX11)
+make_test(SRC matrix_manipulation.cpp)
+make_test(SRC mean.cpp)
+make_test(SRC meanshift.cpp)
+make_test(SRC meanvar.cpp CXX11)
+make_test(SRC medfilt.cpp)
+make_test(SRC median.cpp)
+make_test(SRC memory.cpp CXX11)
+make_test(SRC memory_lock.cpp)
+make_test(SRC missing.cpp)
+make_test(SRC moddims.cpp)
+make_test(SRC moments.cpp)
+make_test(SRC morph.cpp)
+make_test(SRC nearest_neighbour.cpp CXX11)
+make_test(SRC nodevice.cpp CXX11)
+make_test(SRC norm.cpp CXX11)
+
+if(OpenCL_FOUND)
+  make_test(SRC ocl_ext_context.cpp
+            LIBRARIES OpenCL::OpenCL OpenCL::cl2hpp
+            BACKENDS "opencl"
+            CXX11)
+  make_test(SRC interop_opencl_custom_kernel_snippet.cpp
+            LIBRARIES OpenCL::OpenCL
+            BACKENDS "opencl"
+            NO_ARRAYFIRE_TEST
+            CXX11)
+  make_test(SRC interop_opencl_external_context_snippet.cpp
+            LIBRARIES OpenCL::OpenCL OpenCL::cl2hpp
+            BACKENDS "opencl"
+            NO_ARRAYFIRE_TEST
+            CXX11)
+endif()
+
+if(AF_BUILD_CUDA)
+  if(CUDA_FOUND)
+    include(AFcuda_helpers)
+    foreach(backend ${enabled_backends})
+      set(cuda_test_backends "cuda" "unified")
+      if(${backend} IN_LIST cuda_test_backends)
+        set(target test_cuda_${backend})
+        add_executable(${target} cuda.cu)
+        target_include_directories(${target}
+          PRIVATE
+          ${CMAKE_SOURCE_DIR}
+          ${CMAKE_CURRENT_SOURCE_DIR})
+        target_include_directories(${target}
+          SYSTEM PRIVATE
+            ${ArrayFire_SOURCE_DIR}/extern/half/include)
+        if(${backend} STREQUAL "unified")
+          target_link_libraries(${target}
+            ArrayFire::af)
+        else()
+          target_link_libraries(${target}
+            ArrayFire::af${backend})
+        endif()
+        target_link_libraries(${target}
+          mmio
+          arrayfire_test)
+  
+        # Couldn't get Threads::Threads to work with this cuda binary. The import
+        # target would not add the -pthread flag which is required for this
+        # executable (on Ubuntu 18.04 anyway)
+        check_cxx_compiler_flag(-pthread pthread_flag)
+        if(pthread_flag)
+          target_link_libraries(${target} -pthread)
+        endif()
+  
+        af_detect_and_set_cuda_architectures(${target})
+  
+        set_target_properties(${target}
+          PROPERTIES
+            FOLDER "Tests"
+            OUTPUT_NAME "cuda_${backend}")
+  
+        if(NOT ${backend} STREQUAL "unified")
+          af_add_test(${target} ${backend} ON)
+        endif()
+      endif()
+    endforeach()
+  endif()
+endif()
+
+
+make_test(SRC orb.cpp)
+make_test(SRC pad_borders.cpp CXX11)
+make_test(SRC pinverse.cpp SERIAL)
+make_test(SRC qr_dense.cpp SERIAL)
+make_test(SRC random.cpp)
+make_test(SRC rng_quality.cpp BACKENDS "cuda;opencl" SERIAL)
+make_test(SRC range.cpp)
+make_test(SRC rank_dense.cpp SERIAL)
+make_test(SRC reduce.cpp CXX11)
+make_test(SRC regions.cpp)
+make_test(SRC reorder.cpp)
+make_test(SRC replace.cpp CXX11)
+make_test(SRC resize.cpp)
+make_test(SRC rng_match.cpp CXX11 BACKENDS "unified")
+make_test(SRC rotate.cpp)
+make_test(SRC rotate_linear.cpp)
+make_test(SRC sat.cpp)
+make_test(SRC scan.cpp)
+make_test(SRC scan_by_key.cpp)
+make_test(SRC select.cpp CXX11)
+make_test(SRC set.cpp CXX11)
+make_test(SRC shift.cpp)
+make_test(SRC gloh.cpp)
+make_test(SRC sift.cpp)
+make_test(SRC sobel.cpp)
+make_test(SRC solve_dense.cpp       CXX11 SERIAL)
+make_test(SRC sort.cpp)
+make_test(SRC sort_by_key.cpp)
+make_test(SRC sort_index.cpp)
+make_test(SRC sparse.cpp SERIAL)
+make_test(SRC sparse_arith.cpp      USE_MMIO)
+make_test(SRC sparse_convert.cpp)
+make_test(SRC stdev.cpp)
+make_test(SRC susan.cpp)
+make_test(SRC svd_dense.cpp         SERIAL)
+make_test(SRC threading.cpp         CXX11 SERIAL)
+make_test(SRC tile.cpp)
+make_test(SRC topk.cpp              CXX11)
+make_test(SRC transform.cpp)
+make_test(SRC transform_coordinates.cpp)
+make_test(SRC translate.cpp)
+make_test(SRC transpose.cpp)
+make_test(SRC transpose_inplace.cpp)
+make_test(SRC triangle.cpp)
+make_test(SRC unwrap.cpp)
+make_test(SRC var.cpp)
+make_test(SRC where.cpp)
+make_test(SRC wrap.cpp)
+make_test(SRC write.cpp)
+make_test(SRC ycbcr_rgb.cpp)
+
+foreach(backend ${enabled_backends})
+  set(target "basic_c_${backend}")
+  add_executable(${target} basic_c.c)
+  if(${backend} STREQUAL "unified")
+    target_link_libraries(${target}
+      PRIVATE
+      ArrayFire::af)
+  else()
+    target_link_libraries(${target}
+      PRIVATE
+      ArrayFire::af${backend})
+  endif()
+  add_test(NAME ${target} COMMAND ${target})
+endforeach()
+
+if(AF_TEST_WITH_MTX_FILES)
+  make_test(SRC matrixmarket.cpp USE_MMIO)
+endif()
+
+add_executable(print_info print_info.cpp)
+if(AF_BUILD_UNIFIED)
+  target_link_libraries(print_info ArrayFire::af)
+elseif(AF_BUILD_OPENCL)
+  target_link_libraries(print_info ArrayFire::afopencl)
+elseif(AF_BUILD_CUDA)
+  target_link_libraries(print_info ArrayFire::afcuda)
+elseif(AF_BUILD_CPU)
+  target_link_libraries(print_info ArrayFire::afcpu)
+elseif(AF_BUILD_ONEAPI)
+  target_link_libraries(print_info ArrayFire::afoneapi)
+endif()
+
+make_test(SRC jit_test_api.cpp)
diff --git a/test/CMakeModules/download_sparse_datasets.cmake b/test/CMakeModules/download_sparse_datasets.cmake
new file mode 100644
index 0000000000..74b2e8a69a
--- /dev/null
+++ b/test/CMakeModules/download_sparse_datasets.cmake
@@ -0,0 +1,47 @@
+# Copyright (c) 2021, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+set(URL "https://sparse.tamu.edu")
+
+function(mtxDownload name group)
+  set(root_dir ${ArrayFire_BINARY_DIR}/extern/matrixmarket)
+  set(target_dir ${root_dir}/${group}/${name})
+  set(mtx_name mtxDownload_${group}_${name})
+  string(TOLOWER ${mtx_name} mtx_name)
+
+  set_and_mark_depnames_advncd(mtx_prefix ${mtx_name})
+  af_dep_check_and_populate(${mtx_name}
+    URI ${URL}/MM/${group}/${name}.tar.gz
+  )
+
+  if(NOT EXISTS "${target_dir}/${name}.mtx")
+    file(MAKE_DIRECTORY ${target_dir})
+    file(COPY ${${mtx_name}_SOURCE_DIR}/${name}.mtx DESTINATION ${target_dir})
+  endif()
+endfunction()
+
+# Following files are used for testing mtx read fn
+# integer data
+mtxDownload("Trec4" "JGD_Kocay")
+# real data
+mtxDownload("bcsstm02" "HB")
+# complex data
+mtxDownload("young4c" "HB")
+
+#Following files are used for sparse-sparse arith
+# real data
+#linear programming problem
+mtxDownload("lpi_vol1" "LPnetlib")
+mtxDownload("lpi_qual" "LPnetlib")
+#Subsequent Circuit Simulation problem
+mtxDownload("oscil_dcop_12" "Sandia")
+mtxDownload("oscil_dcop_42" "Sandia")
+
+# complex data
+#Quantum Chemistry problem
+mtxDownload("conf6_0-4x4-20" "QCD")
+mtxDownload("conf6_0-4x4-30" "QCD")
diff --git a/test/anisotropic_diffusion.cpp b/test/anisotropic_diffusion.cpp
new file mode 100644
index 0000000000..a498d4cdd8
--- /dev/null
+++ b/test/anisotropic_diffusion.cpp
@@ -0,0 +1,193 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/data.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::exception;
+using af::fluxFunction;
+using af::max;
+using af::min;
+using af::randu;
+using std::abs;
+using std::string;
+using std::vector;
+
+template<typename T>
+class AnisotropicDiffusion : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, int, uint, schar, uchar, short, ushort>
+    TestTypes;
+
+TYPED_TEST_SUITE(AnisotropicDiffusion, TestTypes);
+
+template<typename T>
+array normalize(const array &p_in) {
+    T mx = max<T>(p_in);
+    T mn = min<T>(p_in);
+    return (p_in - mn) / (mx - mn);
+}
+
+template<typename T, bool isColor>
+void imageTest(string pTestFile, const float dt, const float K,
+               const uint iters, fluxFunction fluxKind,
+               bool isCurvatureDiffusion = false) {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            OutType;
+
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    using af::dim4;
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> outSizes;
+    vector<string> outFiles;
+
+    readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        if (isCurvatureDiffusion) {
+            inFiles[testId].insert(0, string(TEST_DIR "/curvature_diffusion/"));
+            outFiles[testId].insert(0,
+                                    string(TEST_DIR "/curvature_diffusion/"));
+        } else {
+            inFiles[testId].insert(0, string(TEST_DIR "/gradient_diffusion/"));
+            outFiles[testId].insert(0, string(TEST_DIR "/gradient_diffusion/"));
+        }
+
+        af_array _inArray   = 0;
+        af_array inArray    = 0;
+        af_array _outArray  = 0;
+        af_array cstArray   = 0;
+        af_array minArray   = 0;
+        af_array numArray   = 0;
+        af_array denArray   = 0;
+        af_array divArray   = 0;
+        af_array outArray   = 0;
+        af_array goldArray  = 0;
+        af_array _goldArray = 0;
+        dim_t nElems        = 0;
+
+        ASSERT_SUCCESS(
+            af_load_image(&_inArray, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, _inArray));
+
+        ASSERT_SUCCESS(
+            af_load_image(&_goldArray, outFiles[testId].c_str(), isColor));
+        // af_load_image always returns float array, so convert to output type
+        ASSERT_SUCCESS(conv_image<OutType>(&goldArray, _goldArray));
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
+
+        if (isCurvatureDiffusion) {
+            ASSERT_SUCCESS(af_anisotropic_diffusion(&_outArray, inArray, dt, K,
+                                                                iters, fluxKind,
+                                                                AF_DIFFUSION_MCDE));
+        } else {
+            ASSERT_SUCCESS(af_anisotropic_diffusion(&_outArray, inArray, dt, K,
+                                                                iters, fluxKind,
+                                                                AF_DIFFUSION_GRAD));
+        }
+
+        double maxima, minima, imag;
+        ASSERT_SUCCESS(af_min_all(&minima, &imag, _outArray));
+        ASSERT_SUCCESS(af_max_all(&maxima, &imag, _outArray));
+
+        unsigned ndims;
+        dim_t dims[4];
+        ASSERT_SUCCESS(af_get_numdims(&ndims, _outArray));
+        ASSERT_SUCCESS(
+            af_get_dims(dims, dims + 1, dims + 2, dims + 3, _outArray));
+
+        af_dtype otype = (af_dtype)af::dtype_traits<OutType>::af_type;
+        ASSERT_SUCCESS(af_constant(&cstArray, 255.0, ndims, dims, otype));
+        ASSERT_SUCCESS(
+            af_constant(&denArray, (maxima - minima), ndims, dims, otype));
+        ASSERT_SUCCESS(af_constant(&minArray, minima, ndims, dims, otype));
+        ASSERT_SUCCESS(af_sub(&numArray, _outArray, minArray, false));
+        ASSERT_SUCCESS(af_div(&divArray, numArray, denArray, false));
+        ASSERT_SUCCESS(af_mul(&outArray, divArray, cstArray, false));
+
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.025);
+
+        ASSERT_SUCCESS(af_release_array(_inArray));
+        ASSERT_SUCCESS(af_release_array(_outArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(cstArray));
+        ASSERT_SUCCESS(af_release_array(minArray));
+        ASSERT_SUCCESS(af_release_array(denArray));
+        ASSERT_SUCCESS(af_release_array(numArray));
+        ASSERT_SUCCESS(af_release_array(divArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(_goldArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+    }
+}
+
+TYPED_TEST(AnisotropicDiffusion, GradientGrayscale) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    // Numeric values separated by underscore are arguments to fn being tested.
+    // Divide first value by 1000 to get time step `dt`
+    // Divide second value by 100 to get time step `K`
+    // Divide third value stays as it is since it is iteration count
+    // Fourth value is a 4-character string indicating the flux kind
+    imageTest<TypeParam, false>(
+        string(TEST_DIR "/gradient_diffusion/gray_00125_100_2_exp.test"),
+        0.125f, 1.0, 2, AF_FLUX_EXPONENTIAL);
+}
+
+TYPED_TEST(AnisotropicDiffusion, GradientColorImage) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    imageTest<TypeParam, true>(
+        string(TEST_DIR "/gradient_diffusion/color_00125_100_2_exp.test"),
+        0.125f, 1.0, 2, AF_FLUX_EXPONENTIAL);
+}
+
+TEST(AnisotropicDiffusion, GradientInvalidInputArray) {
+    try {
+        array out = anisotropicDiffusion(randu(100), 0.125f, 0.2f, 10,
+                                         AF_FLUX_QUADRATIC);
+    } catch (exception &exp) { ASSERT_EQ(AF_ERR_SIZE, exp.err()); }
+}
+
+TYPED_TEST(AnisotropicDiffusion, CurvatureGrayscale) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    // Numeric values separated by underscore are arguments to fn being tested.
+    // Divide first value by 1000 to get time step `dt`
+    // Divide second value by 100 to get time step `K`
+    // Divide third value stays as it is since it is iteration count
+    // Fourth value is a 4-character string indicating the flux kind
+    imageTest<TypeParam, false>(
+        string(TEST_DIR "/curvature_diffusion/gray_00125_100_2_mcde.test"),
+        0.125f, 1.0, 2, AF_FLUX_EXPONENTIAL, true);
+}
+
+TYPED_TEST(AnisotropicDiffusion, CurvatureColorImage) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    imageTest<TypeParam, true>(
+        string(TEST_DIR "/curvature_diffusion/color_00125_100_2_mcde.test"),
+        0.125f, 1.0, 2, AF_FLUX_EXPONENTIAL, true);
+}
+
+TEST(AnisotropicDiffusion, CurvatureInvalidInputArray) {
+    try {
+        array out = anisotropicDiffusion(randu(100), 0.125f, 0.2f, 10);
+    } catch (exception &exp) { ASSERT_EQ(AF_ERR_SIZE, exp.err()); }
+}
diff --git a/test/approx1.cpp b/test/approx1.cpp
index ec63557791..af719d8c4d 100644
--- a/test/approx1.cpp
+++ b/test/approx1.cpp
@@ -7,120 +7,238 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
+#include <af/algorithm.h>
+#include <af/arith.h>
 #include <af/array.h>
-#include <af/signal.h>
+#include <af/blas.h>
+#include <af/complex.h>
+#include <af/constants.h>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/exception.h>
+#include <af/gfor.h>
 #include <af/index.h>
+#include <af/random.h>
+#include <af/signal.h>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
 #include <complex>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::abs;
+using af::approx1;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::randu;
+using af::reorder;
+using af::seq;
+using af::span;
+using af::sum;
+
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Approx1 : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Approx1 : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
-// create a list of types to be tested
+// Create a list of types to be tested
 typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
 
-// register the type list
-TYPED_TEST_CASE(Approx1, TestTypes);
+// Register the type list
+TYPED_TEST_SUITE(Approx1, TestTypes);
 
 template<typename T>
-void approx1Test(string pTestFile, const unsigned resultIdx, const af_interp_type method, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void approx1Test(string pTestFile, const unsigned resultIdx,
+                 const af_interp_type method, bool isSubRef = false,
+                 const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    typedef typename af::dtype_traits<T>::base_type BT;
-    vector<af::dim4> numDims;
-    vector<vector<BT> > in;
-    vector<vector<T> > tests;
-    readTests<BT, T, float>(pTestFile,numDims,in,tests);
+    typedef typename dtype_traits<T>::base_type BT;
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<T>> tests;
+    readTests<BT, T, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
 
-    af_array inArray = 0;
-    af_array posArray = 0;
-    af_array outArray = 0;
+    af_array inArray   = 0;
+    af_array posArray  = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     vector<T> input(in[0].begin(), in[0].end());
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(input.front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&posArray, &(in[1].front()), pdims.ndims(), pdims.get(), (af_dtype) af::dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_create_array(&posArray, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_approx1(&outArray, inArray, posArray, method, 0));
+    ASSERT_SUCCESS(af_approx1(&outArray, inArray, posArray, method, 0));
 
     // Get result
     T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
-    bool ret = true;
+    bool ret      = true;
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ret = (std::abs(tests[resultIdx][elIter] - outData[elIter]) < 0.0005);
-        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t" << outData[elIter] << "at: " << elIter << std::endl;
+        ret = (abs(tests[resultIdx][elIter] - outData[elIter]) < 0.0005);
+        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                             << outData[elIter] << "at: " << elIter << endl;
     }
 
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(posArray  != 0) af_release_array(posArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (posArray != 0) af_release_array(posArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
+}
+
+TYPED_TEST(Approx1, Approx1Nearest) {
+    approx1Test<TypeParam>(string(TEST_DIR "/approx/approx1.test"), 0,
+                           AF_INTERP_NEAREST);
 }
 
-#define APPROX1_INIT(desc, file, resultIdx, method)                               \
-    TYPED_TEST(Approx1, desc)                                                                    \
-    {                                                                                           \
-        approx1Test<TypeParam>(string(TEST_DIR"/approx/"#file".test"), resultIdx, method);\
+TYPED_TEST(Approx1, Approx1Linear) {
+    approx1Test<TypeParam>(string(TEST_DIR "/approx/approx1.test"), 1,
+                           AF_INTERP_LINEAR);
+}
+
+template<typename T>
+void approx1CubicTest(string pTestFile, const unsigned resultIdx,
+                      const af_interp_type method, bool isSubRef = false,
+                      const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    typedef typename dtype_traits<T>::base_type BT;
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<T>> tests;
+    readTests<BT, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
+
+    af_array inArray   = 0;
+    af_array posArray  = 0;
+    af_array outArray  = 0;
+    af_array tempArray = 0;
+
+    vector<T> input(in[0].begin(), in[0].end());
+
+    if (isSubRef) {
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(input.front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+    } else {
+        ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+    }
+
+    ASSERT_SUCCESS(af_create_array(&posArray, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_approx1(&outArray, inArray, posArray, method, 0));
+
+    // Get result
+    T* outData = new T[tests[resultIdx].size()];
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
+
+    // Compare result
+    size_t nElems = tests[resultIdx].size();
+    bool ret      = true;
+
+    float max = real(outData[0]), min = real(outData[0]);
+    for (int i = 1; i < (int)nElems; ++i) {
+        min = (real(outData[i]) < min) ? real(outData[i]) : min;
+        max = (real(outData[i]) > max) ? real(outData[i]) : max;
+    }
+    float range = max - min;
+    ASSERT_GT(range, 0.f);
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        double integral;
+        // Test that control points are exact
+        if ((std::modf(in[1][elIter], &integral) < 0.001) ||
+            (std::modf(in[1][elIter], &integral) > 0.999)) {
+            ret = abs(tests[resultIdx][elIter] - outData[elIter]) < 0.001;
+            ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                                 << outData[elIter] << "at: " << elIter << endl;
+        } else {
+            // Match intermediate values within a threshold
+            ret =
+                abs(tests[resultIdx][elIter] - outData[elIter]) < 0.035 * range;
+            ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                                 << outData[elIter] << "at: " << elIter << endl;
+        }
     }
 
-    APPROX1_INIT(Approx1Nearest, approx1, 0, AF_INTERP_NEAREST);
-    APPROX1_INIT(Approx1Linear, approx1, 1, AF_INTERP_LINEAR);
+    // Delete
+    delete[] outData;
+
+    if (inArray != 0) af_release_array(inArray);
+    if (posArray != 0) af_release_array(posArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
+}
+
+TYPED_TEST(Approx1, Approx1Cubic) {
+    approx1CubicTest<TypeParam>(string(TEST_DIR "/approx/approx1_cubic.test"),
+                                0, AF_INTERP_CUBIC_SPLINE);
+}
 
 ///////////////////////////////////////////////////////////////////////////////
 // Test Argument Failure Cases
 ///////////////////////////////////////////////////////////////////////////////
 template<typename T>
-void approx1ArgsTest(string pTestFile, const unsigned resultIdx, const af_interp_type method, const af_err err)
-{
-    if (noDoubleTests<T>()) return;
-    typedef typename af::dtype_traits<T>::base_type BT;
-    vector<af::dim4> numDims;
-    vector<vector<BT> > in;
-    vector<vector<T> > tests;
-    readTests<BT, T, float>(pTestFile,numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
+void approx1ArgsTest(string pTestFile, const af_interp_type method,
+                     const af_err err) {
+    SUPPORTED_TYPE_CHECK(T);
+    typedef typename dtype_traits<T>::base_type BT;
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<T>> tests;
+    readTests<BT, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
 
     af_array inArray  = 0;
     af_array posArray = 0;
@@ -128,39 +246,45 @@ void approx1ArgsTest(string pTestFile, const unsigned resultIdx, const af_interp
 
     vector<T> input(in[0].begin(), in[0].end());
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()), idims.ndims(),
+                                   idims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&posArray, &(in[1].front()), pdims.ndims(), pdims.get(), (af_dtype) af::dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_create_array(&posArray, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
 
     ASSERT_EQ(err, af_approx1(&outArray, inArray, posArray, method, 0));
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(posArray  != 0) af_release_array(posArray);
-    if(outArray  != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (posArray != 0) af_release_array(posArray);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-#define APPROX1_ARGS(desc, file, resultIdx, method, err)                                            \
-    TYPED_TEST(Approx1, desc)                                                                       \
-    {                                                                                               \
-        approx1ArgsTest<TypeParam>(string(TEST_DIR"/approx/"#file".test"), resultIdx, method, err); \
-    }
-
-    APPROX1_ARGS(Approx1NearestArgsPos2D, approx1_pos2d, 0, AF_INTERP_NEAREST, AF_ERR_SIZE);
-    APPROX1_ARGS(Approx1LinearArgsPos2D, approx1_pos2d, 1, AF_INTERP_LINEAR, AF_ERR_SIZE);
-    APPROX1_ARGS(Approx1ArgsInterpBilinear, approx1, 0, AF_INTERP_BILINEAR, AF_ERR_ARG);
-    APPROX1_ARGS(Approx1ArgsInterpCubic, approx1, 0, AF_INTERP_CUBIC, AF_ERR_ARG);
+TYPED_TEST(Approx1, Approx1NearestArgsPos2D) {
+    approx1ArgsTest<TypeParam>(string(TEST_DIR "/approx/approx1_pos2d.test"),
+                               AF_INTERP_NEAREST, AF_ERR_SIZE);
+}
+TYPED_TEST(Approx1, Approx1LinearArgsPos2D) {
+    approx1ArgsTest<TypeParam>(string(TEST_DIR "/approx/approx1_pos2d.test"),
+                               AF_INTERP_LINEAR, AF_ERR_SIZE);
+}
+TYPED_TEST(Approx1, Approx1ArgsInterpBilinear) {
+    approx1ArgsTest<TypeParam>(string(TEST_DIR "/approx/approx1.test"),
+                               AF_INTERP_BILINEAR, AF_ERR_ARG);
+}
 
 template<typename T>
-void approx1ArgsTestPrecision(string pTestFile, const unsigned resultIdx, const af_interp_type method)
-{
-    if (noDoubleTests<T>()) return;
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
+void approx1ArgsTestPrecision(string pTestFile, const unsigned,
+                              const af_interp_type method) {
+    SUPPORTED_TYPE_CHECK(T);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
 
     af_array inArray  = 0;
     af_array posArray = 0;
@@ -168,52 +292,61 @@ void approx1ArgsTestPrecision(string pTestFile, const unsigned resultIdx, const
 
     vector<T> input(in[0].begin(), in[0].end());
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()), idims.ndims(),
+                                   idims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&posArray, &(in[1].front()), pdims.ndims(), pdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&posArray, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    if((af_dtype) af::dtype_traits<T>::af_type == c32 ||
-       (af_dtype) af::dtype_traits<T>::af_type == c64) {
-        ASSERT_EQ(AF_ERR_ARG, af_approx1(&outArray, inArray, posArray, method, 0));
+    if ((af_dtype)dtype_traits<T>::af_type == c32 ||
+        (af_dtype)dtype_traits<T>::af_type == c64) {
+        ASSERT_EQ(AF_ERR_ARG,
+                  af_approx1(&outArray, inArray, posArray, method, 0));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_approx1(&outArray, inArray, posArray, method, 0));
+        ASSERT_SUCCESS(af_approx1(&outArray, inArray, posArray, method, 0));
     }
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(posArray  != 0) af_release_array(posArray);
-    if(outArray  != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (posArray != 0) af_release_array(posArray);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-#define APPROX1_ARGSP(desc, file, resultIdx, method)                                           \
-    TYPED_TEST(Approx1, desc)                                                                    \
-    {                                                                                           \
-        approx1ArgsTestPrecision<TypeParam>(string(TEST_DIR"/approx/"#file".test"), resultIdx, method);\
-    }
+TYPED_TEST(Approx1, Approx1NearestArgsPrecision) {
+    approx1ArgsTestPrecision<TypeParam>(string(TEST_DIR "/approx/approx1.test"),
+                                        0, AF_INTERP_NEAREST);
+}
 
-    APPROX1_ARGSP(Approx1NearestArgsPrecision, approx1, 0, AF_INTERP_NEAREST);
-    APPROX1_ARGSP(Approx1LinearArgsPrecision, approx1, 1, AF_INTERP_LINEAR);
+TYPED_TEST(Approx1, Approx1LinearArgsPrecision) {
+    approx1ArgsTestPrecision<TypeParam>(string(TEST_DIR "/approx/approx1.test"),
+                                        1, AF_INTERP_LINEAR);
+}
 
+TYPED_TEST(Approx1, Approx1CubicArgsPrecision) {
+    approx1ArgsTestPrecision<TypeParam>(
+        string(TEST_DIR "/approx/approx1_cubic.test"), 2,
+        AF_INTERP_CUBIC_SPLINE);
+}
 
 //////////////////////////////////////// CPP //////////////////////////////////
 //
-TEST(Approx1, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(Approx1, CPP) {
     const unsigned resultIdx = 1;
-    const af_interp_type method = AF_INTERP_LINEAR;
-#define BT af::dtype_traits<float>::base_type
-    vector<af::dim4> numDims;
-    vector<vector<BT> > in;
-    vector<vector<float> > tests;
-    readTests<BT, float, float>(string(TEST_DIR"/approx/approx1.test"),numDims,in,tests);
+#define BT dtype_traits<float>::base_type
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<float>> tests;
+    readTests<BT, float, float>(string(TEST_DIR "/approx/approx1.test"),
+                                numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
 
-    af::array input(idims, &(in[0].front()));
-    af::array pos(pdims, &(in[1].front()));
-
-    af::array output = approx1(input, pos, method, 0);
+    array input(idims, &(in[0].front()));
+    array pos(pdims, &(in[1].front()));
+    const af_interp_type method = AF_INTERP_LINEAR;
+    array output                = approx1(input, pos, method, 0);
 
     // Get result
     float* outData = new float[tests[resultIdx].size()];
@@ -221,10 +354,11 @@ TEST(Approx1, CPP)
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
-    bool ret = true;
+    bool ret      = true;
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
         ret = (std::abs(tests[resultIdx][elIter] - outData[elIter]) < 0.0005);
-        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t" << outData[elIter] << "at: " << elIter << std::endl;
+        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                             << outData[elIter] << "at: " << elIter << endl;
     }
 
     // Delete
@@ -232,3 +366,737 @@ TEST(Approx1, CPP)
 
 #undef BT
 }
+
+TEST(Approx1, CPPNearestBatch) {
+    array input = randu(600, 10);
+    array pos   = input.dims(0) * randu(100, 10);
+
+    array outBatch = approx1(input, pos, AF_INTERP_NEAREST);
+
+    array outSerial(pos.dims());
+    for (int i = 0; i < pos.dims(1); i++) {
+        outSerial(span, i) =
+            approx1(input(span, i), pos(span, i), AF_INTERP_NEAREST);
+    }
+
+    array outGFOR(pos.dims());
+    gfor(seq i, pos.dims(1)) {
+        outGFOR(span, i) =
+            approx1(input(span, i), pos(span, i), AF_INTERP_NEAREST);
+    }
+
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outSerial)), 1e-3);
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outGFOR)), 1e-3);
+}
+
+TEST(Approx1, CPPLinearBatch) {
+    array input = iota(dim4(10000, 20), c32);
+    array pos   = input.dims(0) * randu(10000, 20);
+
+    array outBatch = approx1(input, pos, AF_INTERP_LINEAR);
+
+    array outSerial(pos.dims());
+    for (int i = 0; i < pos.dims(1); i++) {
+        outSerial(span, i) =
+            approx1(input(span, i), pos(span, i), AF_INTERP_LINEAR);
+    }
+
+    array outGFOR(pos.dims());
+    gfor(seq i, pos.dims(1)) {
+        outGFOR(span, i) =
+            approx1(input(span, i), pos(span, i), AF_INTERP_LINEAR);
+    }
+
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outSerial)), 1e-3);
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outGFOR)), 1e-3);
+}
+
+TEST(Approx1, CPPCubicBatch) {
+    array input = iota(dim4(10000, 20), c32);
+    array pos   = input.dims(0) * randu(10000, 20);
+
+    array outBatch = approx1(input, pos, AF_INTERP_CUBIC_SPLINE);
+
+    array outSerial(pos.dims());
+    for (int i = 0; i < pos.dims(1); i++) {
+        outSerial(span, i) =
+            approx1(input(span, i), pos(span, i), AF_INTERP_CUBIC_SPLINE);
+    }
+
+    array outGFOR(pos.dims());
+    gfor(seq i, pos.dims(1)) {
+        outGFOR(span, i) =
+            approx1(input(span, i), pos(span, i), AF_INTERP_CUBIC_SPLINE);
+    }
+
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outSerial)), 1e-3);
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outGFOR)), 1e-3);
+}
+
+TEST(Approx1, CPPNearestMaxDims) {
+    const size_t largeDim = 65535 * 32 + 1;
+    array input           = randu(1, largeDim);
+    array pos             = input.dims(0) * randu(1, largeDim);
+    array out             = approx1(input, pos, AF_INTERP_NEAREST);
+
+    input = randu(1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, largeDim);
+    out   = approx1(input, pos, AF_INTERP_NEAREST);
+
+    input = randu(1, 1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, 1, largeDim);
+    out   = approx1(input, pos, AF_INTERP_NEAREST);
+
+    SUCCEED();
+}
+
+TEST(Approx1, CPPLinearMaxDims) {
+    const size_t largeDim = 65535 * 32 + 1;
+    array input           = iota(dim4(1, largeDim), c32);
+    array pos             = input.dims(0) * randu(1, largeDim);
+    array outBatch        = approx1(input, pos, AF_INTERP_LINEAR);
+
+    input    = iota(dim4(1, 1, largeDim), c32);
+    pos      = input.dims(0) * randu(1, 1, largeDim);
+    outBatch = approx1(input, pos, AF_INTERP_LINEAR);
+
+    input    = iota(dim4(1, 1, 1, largeDim), c32);
+    pos      = input.dims(0) * randu(1, 1, 1, largeDim);
+    outBatch = approx1(input, pos, AF_INTERP_LINEAR);
+
+    SUCCEED();
+}
+
+TEST(Approx1, CPPCubicMaxDims) {
+    const size_t largeDim = 65535 * 32 + 1;
+    array input           = iota(dim4(1, largeDim), c32);
+    array pos             = input.dims(0) * randu(1, largeDim);
+    array outBatch        = approx1(input, pos, AF_INTERP_CUBIC);
+
+    input    = iota(dim4(1, 1, largeDim), c32);
+    pos      = input.dims(0) * randu(1, 1, largeDim);
+    outBatch = approx1(input, pos, AF_INTERP_CUBIC);
+
+    input    = iota(dim4(1, 1, 1, largeDim), c32);
+    pos      = input.dims(0) * randu(1, 1, 1, largeDim);
+    outBatch = approx1(input, pos, AF_INTERP_CUBIC);
+
+    SUCCEED();
+}
+
+TEST(Approx1, OtherDimLinear) {
+    int start = 0;
+    int stop  = 10000;
+    int step  = 100;
+    int num   = 1000;
+    array xi  = af::tile(seq(start, stop, step), 1, 2, 2, 2);
+    array yi  = 4 * xi - 3;
+    array xo  = af::round(step * randu(num, 2, 2, 2));
+    array yo  = 4 * xo - 3;
+    for (int d = 1; d < 4; d++) {
+        dim4 rdims(0, 1, 2, 3);
+        rdims[0] = d;
+        rdims[d] = 0;
+
+        array yi_reordered =
+            reorder(yi, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array xo_reordered =
+            reorder(xo, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array yo_reordered = approx1(yi_reordered, xo_reordered, d, start, step,
+                                     AF_INTERP_LINEAR);
+        array res =
+            reorder(yo_reordered, rdims[0], rdims[1], rdims[2], rdims[3]);
+        ASSERT_NEAR(0, af::max<float>(af::abs(res - yo)), 1E-3);
+    }
+}
+
+TEST(Approx1, OtherDimCubic) {
+    float start = 0;
+    float stop  = 100;
+    float step  = 0.01;
+    int num     = 1000;
+    array xi    = af::tile(af::seq(start, stop, step), 1, 2, 2, 2);
+    array yi    = af::sin(xi);
+    array xo    = af::round(step * af::randu(num, 2, 2, 2));
+    array yo    = af::sin(xo);
+    for (int d = 1; d < 4; d++) {
+        dim4 rdims(0, 1, 2, 3);
+        rdims[0] = d;
+        rdims[d] = 0;
+
+        array yi_reordered =
+            reorder(yi, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array xo_reordered =
+            reorder(xo, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array yo_reordered = approx1(yi_reordered, xo_reordered, d, start, step,
+                                     AF_INTERP_CUBIC);
+        array res =
+            reorder(yo_reordered, rdims[0], rdims[1], rdims[2], rdims[3]);
+        ASSERT_NEAR(0, af::max<float>(af::abs(res - yo)), 1E-3);
+    }
+}
+
+TEST(Approx1, CPPUsage) {
+    //! [ex_signal_approx1]
+
+    // Input data array.
+    float input_vals[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), input_vals);
+    // [3 1 1 1]
+    //     10.0000
+    //     20.0000
+    //     30.0000
+
+    // Array of positions to be found along the first dimension.
+    float pv[5] = {0.0f, 0.5, 1.0f, 1.5, 2.0f};
+    array pos(dim4(5, 1), pv);
+    // [5 1 1 1]
+    //     0.0000
+    //     0.5000
+    //     1.0000
+    //     1.5000
+    //     2.0000
+
+    // Perform interpolation across dimension 0.
+    array interp = approx1(in, pos);
+    // [5 1 1 1]
+    //     10.0000
+    //     15.0000
+    //     20.0000
+    //     25.0000
+    //     30.0000
+
+    //! [ex_signal_approx1]
+
+    float civ[5] = {10.0f, 15.0f, 20.0f, 25.0f, 30.0f};
+    array interp_gold(dim4(5, 1), civ);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPUniformUsage) {
+    //! [ex_signal_approx1_uniform]
+
+    float input_vals[9] = {10.0f, 20.0f, 30.0f, 40.0f, 50.0f,
+                           60.0f, 70.0f, 80.0f, 90.0f};
+    array in(dim4(3, 3), input_vals);
+    // [3 3 1 1]
+    //     10.0000    40.0000    70.0000
+    //     20.0000    50.0000    80.0000
+    //     30.0000    60.0000    90.0000
+
+    // Array of positions to be found along the interpolation
+    // dimension, `interp_dim`.
+    float pv[5] = {0.0f, 0.5, 1.0f, 1.5f, 2.0f};
+    array pos(dim4(5, 1), pv);
+    // [5 1 1 1]
+    //     0.0000
+    //     0.5000
+    //     1.0000
+    //     1.5000
+    //     2.0000
+
+    // Define range of indices with which the input values will
+    // correspond along the interpolation dimension.
+    const double idx_start = 0.0;
+    const double idx_step  = 1.0;
+
+    // Perform interpolation across dimension 0.
+    int interp_dim         = 0;
+    array col_major_interp = approx1(in, pos, interp_dim, idx_start, idx_step);
+    // [5 3 1 1]
+    //     10.0000    40.0000    70.0000
+    //     15.0000    45.0000    75.0000
+    //     20.0000    50.0000    80.0000
+    //     25.0000    55.0000    85.0000
+    //     30.0000    60.0000    90.0000
+
+    // Perform interpolation across dimension 1.
+    interp_dim = 1;
+    array row_major_interp =
+        approx1(in, transpose(pos), interp_dim, idx_start, idx_step);
+    // [3 5 1 1]
+    //     10.0000    25.0000    40.0000    55.0000    70.0000
+    //     20.0000    35.0000    50.0000    65.0000    80.0000
+    //     30.0000    45.0000    60.0000    75.0000    90.0000
+
+    //! [ex_signal_approx1_uniform]
+
+    float civ[15] = {10.0f, 15.0f, 20.0f, 25.0f, 30.0f, 40.0f, 45.0f, 50.0f,
+                     55.0f, 60.0f, 70.0f, 75.0f, 80.0f, 85.0f, 90.0f};
+    array interp_gold_col(dim4(5, 3), civ);
+    ASSERT_ARRAYS_EQ(col_major_interp, interp_gold_col);
+
+    float riv[15] = {10.0f, 20.0f, 30.0f, 25.0f, 35.0f, 45.0f, 40.0f, 50.0f,
+                     60.0f, 55.0f, 65.0f, 75.0f, 70.0f, 80.0f, 90.0f};
+    array interp_gold_row(dim4(3, 5), riv);
+    ASSERT_ARRAYS_EQ(row_major_interp, interp_gold_row);
+}
+
+TEST(Approx1, CPPDecimalStepRescaleGrid) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    float pv[5] = {0.f, 0.25f, 0.5f, 0.75f, 1.0f};
+    array pos(dim4(5, 1), pv);
+
+    const int interp_grid_start   = 0;
+    const double interp_grid_step = 0.5;
+    const int interp_dim          = 0;
+    array interp =
+        approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+
+    float iv[5] = {10.0f, 15.0f, 20.0f, 25.0f, 30.0f};
+    array interp_gold(dim4(5, 1), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPRepeatPos) {
+    float inv[9] = {10.0f, 20.0f, 30.0f, 40.0f, 50.0f,
+                    60.0f, 70.0f, 80.0f, 90.0f};
+    array in(dim4(3, 3), inv);
+    float pv[5] = {0.0f, 0.5f, 0.5f, 1.5f, 1.5f};
+    array pos(dim4(5, 1), pv);
+
+    const int interp_grid_start   = 0;
+    const double interp_grid_step = 1.0;
+    const int interp_dim          = 0;
+    array interp =
+        approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+
+    float iv[15] = {10.0f, 15.0f, 15.0f, 25.0f, 25.0f, 40.0f, 45.0f, 45.0f,
+                    55.0f, 55.0f, 70.0f, 75.0f, 75.0f, 85.0f, 85.0f};
+    array interp_gold(dim4(5, 3), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPNonMonotonicPos) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    float pv[5] = {0.5f, 1.0f, 1.5f, 0.0f, 2.0f};
+    array pos(dim4(5, 1), pv);
+
+    const int interp_grid_start   = 0;
+    const double interp_grid_step = 1.0;
+    const int interp_dim          = 0;
+    array interp =
+        approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+
+    float iv[5] = {15.0f, 20.0f, 25.0f, 10.0f, 30.0f};
+    array interp_gold(dim4(5, 1), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPMismatchingIndexingDim) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    float pv[4] = {0.0f, 0.5f, 1.0f, 2.0f};
+    array pos(dim4(1, 4), pv);
+
+    const int interp_grid_start   = 0;
+    const double interp_grid_step = 1.0;
+    const int interp_dim          = 1;
+    const float off_grid          = -1.0;
+    array interp = approx1(in, pos, interp_dim, interp_grid_start,
+                           interp_grid_step, AF_INTERP_LINEAR, off_grid);
+
+    float iv[12] = {10.0f, 20.0f, 30.0f, -1.0f, -1.0f, -1.0f,
+                    -1.0f, -1.0f, -1.0f, -1.0f, -1.0f, -1.0f};
+    array interp_gold(dim4(3, 4), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPNegativeGridStart) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    float pv[5] = {0.0f, 0.5f, 1.0f, 1.5f, 2.0f};
+    array pos(dim4(5, 1), pv);
+
+    const int interp_grid_start   = -1;
+    const double interp_grid_step = 1;
+    const int interp_dim          = 0;
+    array interp =
+        approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+
+    float iv[5] = {20.0f, 25.0f, 30.0f, 0.0f, 0.0f};
+    array interp_gold(dim4(5, 1), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPInterpolateBackwards) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    float pv[5] = {0.0f, 0.5f, 1.0f, 1.5f, 2.0f};
+    array pos(dim4(3, 1), pv);
+
+    const int interp_grid_start   = in.elements() - 1;
+    const double interp_grid_step = -1;
+    const int interp_dim          = 0;
+    array interp =
+        approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+
+    float iv[5] = {30.0f, 25.0f, 20.0f, 15.0f, 10.0f};
+    array interp_gold(dim4(3, 1), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPStartOffGridAndNegativeStep) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    float pv[5] = {0.0f, -0.5f, -1.0f, -1.5f, -2.0f};
+    array pos(dim4(5, 1), pv);
+
+    const int interp_grid_start   = -1;
+    const double interp_grid_step = -1;
+    const int interp_dim          = 0;
+    array interp =
+        approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+
+    float iv[5] = {0.0f, 0.0f, 10.0f, 15.0f, 20.0f};
+    array interp_gold(dim4(5, 1), iv);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx1, CPPUniformInvalidStepSize) {
+    try {
+        float inv[3] = {10.0f, 20.0f, 30.0f};
+        array in(dim4(3, 1), inv);
+        float pv[5] = {0.0f, 0.5f, 1.0f, 1.5f, 2.0f};
+        array pos(dim4(5, 1), pv);
+
+        const int interp_grid_start   = 0;
+        const double interp_grid_step = 0;
+        const int interp_dim          = 0;
+        array interp =
+            approx1(in, pos, interp_dim, interp_grid_start, interp_grid_step);
+        FAIL() << "Expected af::exception\n";
+    } catch (af::exception& ex) { SUCCEED(); } catch (...) {
+        FAIL() << "Expected af::exception\n";
+    }
+}
+
+// Unless the sampling grid specifications - begin, step - are
+// specified by the user, ArrayFire will assume a regular grid with a
+// starting index of 0 and a step value of 1.
+TEST(Approx1, CPPInfCheck) {
+#ifdef __INTEL_LLVM_COMPILER
+    SKIP_IF_FAST_MATH_ENABLED();
+#endif
+    array sampled(seq(0.0, 5.0, 0.5));
+    sampled(0) = af::Inf;
+    seq xo(0.0, 2.0, 0.25);
+    array interp           = approx1(sampled, xo);
+    array interp_augmented = join(1, xo, interp);
+
+    float goldv[9] = {static_cast<float>(af::Inf),
+                      static_cast<float>(af::Inf),
+                      static_cast<float>(af::Inf),
+                      static_cast<float>(af::Inf),
+                      0.5f,
+                      0.625f,
+                      0.75f,
+                      0.875f,
+                      1.0f};
+    array gold(dim4(9, 1), goldv);
+    interp(af::isInf(interp)) = 0;
+    gold(af::isInf(gold))     = 0;
+    ASSERT_ARRAYS_EQ(interp, gold);
+}
+
+TEST(Approx1, CPPUniformInfCheck) {
+#ifdef __INTEL_LLVM_COMPILER
+    SKIP_IF_FAST_MATH_ENABLED();
+#endif
+    array sampled(seq(10.0, 50.0, 10.0));
+    sampled(0) = af::Inf;
+    seq xo(0.0, 8.0, 2.0);
+    array interp   = approx1(sampled, xo, 0, 0, 2);
+    float goldv[5] = {static_cast<float>(af::Inf), 20.0f, 30.0f, 40.0f, 50.0f};
+    array gold(dim4(5, 1), goldv);
+    interp(af::isInf(interp)) = 0;
+    gold(af::isInf(gold))     = 0;
+    ASSERT_ARRAYS_EQ(interp, gold);
+}
+
+TEST(Approx1, CPPEmptyPos) {
+    float inv[3] = {10.0f, 20.0f, 30.0f};
+    array in(dim4(3, 1), inv);
+    array pos;
+    array interp = approx1(in, pos);
+    ASSERT_TRUE(pos.isempty());
+    ASSERT_TRUE(interp.isempty());
+}
+
+TEST(Approx1, CPPEmptyInput) {
+    array in;
+    float pv[3] = {0.0f, 1.0f, 2.0f};
+    array pos(dim4(3, 1), pv);
+
+    array interp = approx1(in, pos);
+    ASSERT_TRUE(in.isempty());
+    ASSERT_TRUE(interp.isempty());
+}
+
+TEST(Approx1, CPPEmptyPosAndInput) {
+    array in;
+    array pos;
+    array interp = approx1(in, pos);
+    ASSERT_TRUE(in.isempty());
+    ASSERT_TRUE(pos.isempty());
+    ASSERT_TRUE(interp.isempty());
+}
+
+template<typename T>
+class Approx1V2 : public ::testing::Test {
+   protected:
+    typedef typename dtype_traits<T>::base_type BT;
+
+    vector<T> h_gold_cast;
+    vector<T> h_in_cast;
+    vector<BT> h_pos_cast;
+
+    dim4 gold_dims;
+    dim4 in_dims;
+    dim4 pos_dims;
+
+    af_array gold;
+    af_array in;
+    af_array pos;
+
+    Approx1V2() : gold(0), in(0), pos(0) {}
+
+    void SetUp() {}
+
+    void releaseArrays() {
+        if (pos != 0) { ASSERT_SUCCESS(af_release_array(pos)); }
+        if (in != 0) { ASSERT_SUCCESS(af_release_array(in)); }
+        if (gold != 0) { ASSERT_SUCCESS(af_release_array(gold)); }
+    }
+
+    void TearDown() { releaseArrays(); }
+
+    void setTestData(float* h_gold, dim4 gold_dims, float* h_in, dim4 in_dims,
+                     float* h_pos, dim4 pos_dims) {
+        releaseArrays();
+
+        gold = 0;
+        in   = 0;
+        pos  = 0;
+
+        this->gold_dims = gold_dims;
+        this->in_dims   = in_dims;
+        this->pos_dims  = pos_dims;
+
+        for (int i = 0; i < gold_dims.elements(); ++i) {
+            h_gold_cast.push_back(static_cast<T>(h_gold[i]));
+        }
+        for (int i = 0; i < in_dims.elements(); ++i) {
+            h_in_cast.push_back(static_cast<T>(h_in[i]));
+        }
+        for (int i = 0; i < pos_dims.elements(); ++i) {
+            h_pos_cast.push_back(static_cast<BT>(h_pos[i]));
+        }
+
+        ASSERT_SUCCESS(af_create_array(&gold, &h_gold_cast.front(),
+                                       gold_dims.ndims(), gold_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&in, &h_in_cast.front(), in_dims.ndims(),
+                                       in_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&pos, &h_pos_cast.front(),
+                                       pos_dims.ndims(), pos_dims.get(),
+                                       (af_dtype)dtype_traits<BT>::af_type));
+    }
+
+    void testSpclOutArray(TestOutputArrayType out_array_type) {
+        SUPPORTED_TYPE_CHECK(T);
+
+        af_array out = 0;
+        TestOutputArrayInfo metadata(out_array_type);
+        genTestOutputArray(&out, gold_dims.ndims(), gold_dims.get(),
+                           (af_dtype)dtype_traits<T>::af_type, &metadata);
+
+        ASSERT_SUCCESS(af_approx1_v2(&out, in, pos, AF_INTERP_LINEAR, 0));
+        ASSERT_SPECIAL_ARRAYS_EQ(gold, out, &metadata);
+    }
+
+    void testSpclOutArrayUniform(TestOutputArrayType out_array_type) {
+        SUPPORTED_TYPE_CHECK(T);
+
+        af_array out = 0;
+        TestOutputArrayInfo metadata(out_array_type);
+        genTestOutputArray(&out, gold_dims.ndims(), gold_dims.get(),
+                           (af_dtype)dtype_traits<T>::af_type, &metadata);
+
+        ASSERT_SUCCESS(af_approx1_uniform_v2(&out, in, pos, 0, 0.0, 1.0,
+                                             AF_INTERP_LINEAR, 0.f));
+        ASSERT_SPECIAL_ARRAYS_EQ(gold, out, &metadata);
+    }
+};
+
+TYPED_TEST_SUITE(Approx1V2, TestTypes);
+
+class SimpleTestData {
+   public:
+    static const int h_gold_size = 15;
+    static const int h_in_size   = 9;
+    static const int h_pos_size  = 5;
+
+    vector<float> h_gold;
+    vector<float> h_in;
+    vector<float> h_pos;
+
+    dim4 gold_dims;
+    dim4 in_dims;
+    dim4 pos_dims;
+
+    SimpleTestData() : gold_dims(5, 3), in_dims(3, 3), pos_dims(5) {
+        float gold_arr[h_gold_size] = {10.0f, 15.0f, 20.0f, 25.0f, 30.0f,
+                                       40.0f, 45.0f, 50.0f, 55.0f, 60.0f,
+                                       70.0f, 75.0f, 80.0f, 85.0f, 90.0f};
+
+        float in_arr[h_in_size] = {10.0f, 20.0f, 30.0f, 40.0f, 50.0f,
+                                   60.0f, 70.0f, 80.0f, 90.0f};
+
+        float pos_arr[h_pos_size] = {0.0f, 0.5f, 1.0f, 1.5f, 2.0f};
+
+        h_gold.assign(gold_arr, gold_arr + h_gold_size);
+        h_in.assign(in_arr, in_arr + h_in_size);
+        h_pos.assign(pos_arr, pos_arr + h_pos_size);
+    }
+};
+
+template<typename T>
+class Approx1V2Simple : public Approx1V2<T> {
+   protected:
+    void SetUp() {
+        SUPPORTED_TYPE_CHECK(T);
+        SimpleTestData data;
+        this->setTestData(&data.h_gold.front(), data.gold_dims,
+                          &data.h_in.front(), data.in_dims, &data.h_pos.front(),
+                          data.pos_dims);
+    }
+};
+
+TYPED_TEST_SUITE(Approx1V2Simple, TestTypes);
+
+TYPED_TEST(Approx1V2Simple, UseNullOutputArray) {
+    this->testSpclOutArray(NULL_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UseFullExistingOutputArray) {
+    this->testSpclOutArray(FULL_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UseExistingOutputSubArray) {
+    this->testSpclOutArray(SUB_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UseReorderedOutputArray) {
+    this->testSpclOutArray(REORDERED_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UniformUseNullOutputArray) {
+    this->testSpclOutArrayUniform(NULL_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UniformUseFullExistingOutputArray) {
+    this->testSpclOutArrayUniform(FULL_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UniformUseExistingOutputSubArray) {
+    this->testSpclOutArrayUniform(SUB_ARRAY);
+}
+
+TYPED_TEST(Approx1V2Simple, UniformUseReorderedOutputArray) {
+    this->testSpclOutArrayUniform(REORDERED_ARRAY);
+}
+
+class Approx1NullArgs : public ::testing::Test {
+   protected:
+    af_array out;
+    af_array in;
+    af_array pos;
+
+    Approx1NullArgs() : out(0), in(0), pos(0) {}
+
+    void SetUp() {
+        SimpleTestData data;
+
+        ASSERT_SUCCESS(af_create_array(&in, &data.h_in.front(),
+                                       data.in_dims.ndims(), data.in_dims.get(),
+                                       f32));
+        ASSERT_SUCCESS(af_create_array(&pos, &data.h_pos.front(),
+                                       data.pos_dims.ndims(),
+                                       data.pos_dims.get(), f32));
+    }
+
+    void TearDown() {
+        if (pos != 0) { ASSERT_SUCCESS(af_release_array(pos)); }
+        if (in != 0) { ASSERT_SUCCESS(af_release_array(in)); }
+    }
+};
+
+TEST_F(Approx1NullArgs, NullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1(out_ptr, this->in, this->pos, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, NullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1(&this->out, 0, this->pos, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, NullPosArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1(&this->out, this->in, 0, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, V2NullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG, af_approx1_v2(out_ptr, this->in, this->pos,
+                                        AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, V2NullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1_v2(&this->out, 0, this->pos, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, V2NullPosArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1_v2(&this->out, this->in, 0, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, UniformNullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG, af_approx1_uniform(out_ptr, this->in, this->pos, 0,
+                                             0.0, 1.0, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, UniformNullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx1_uniform(&this->out, 0, this->pos, 0, 0.0,
+                                             1.0, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, UniformNullPosArray) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx1_uniform(&this->out, this->in, 0, 0, 0.0,
+                                             1.0, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, V2UniformNullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1_uniform_v2(out_ptr, this->in, this->pos, 0, 0.0, 1.0,
+                                    AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, V2UniformNullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx1_uniform_v2(&this->out, 0, this->pos, 0, 0.0, 1.0,
+                                    AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx1NullArgs, V2UniformNullPosArray) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx1_uniform_v2(&this->out, this->in, 0, 0, 0.0,
+                                                1.0, AF_INTERP_LINEAR, 0.f));
+}
diff --git a/test/approx2.cpp b/test/approx2.cpp
index b6563e8351..bec8bd75cf 100644
--- a/test/approx2.cpp
+++ b/test/approx2.cpp
@@ -7,229 +7,277 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
-#include <af/signal.h>
-#include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <af/algorithm.h>
+#include <af/arith.h>
+#include <af/data.h>
 #include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/exception.h>
+#include <af/gfor.h>
+#include <af/random.h>
+#include <af/signal.h>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
-#include <string>
+
+#include <gtest/gtest.h>
 #include <testHelpers.hpp>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+#include <string>
+#include <vector>
+
+using af::abs;
+using af::approx2;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::randu;
+using af::seq;
+using af::span;
+using af::sum;
+
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Approx2 : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Approx2 : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        SUPPORTED_TYPE_CHECK(T);
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Approx2, TestTypes);
+TYPED_TEST_SUITE(Approx2, TestTypes);
 
 template<typename T>
-void approx2Test(string pTestFile, const unsigned resultIdx, const af_interp_type method, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
-    typedef typename af::dtype_traits<T>::base_type BT;
-    vector<af::dim4> numDims;
-    vector<vector<BT> > in;
-    vector<vector<T> > tests;
-    readTests<BT, T, float>(pTestFile,numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
-    af::dim4 qdims = numDims[2];
-
-    af_array inArray = 0;
+void approx2Test(string pTestFile, const unsigned resultIdx,
+                 const af_interp_type method, bool isSubRef = false,
+                 const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
+    typedef typename dtype_traits<T>::base_type BT;
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<T>> tests;
+    readTests<BT, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
+    dim4 qdims = numDims[2];
+
+    af_array inArray   = 0;
     af_array pos0Array = 0;
     af_array pos1Array = 0;
-    af_array outArray = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     vector<T> input(in[0].begin(), in[0].end());
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(input.front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&pos0Array, &(in[1].front()), pdims.ndims(), pdims.get(), (af_dtype) af::dtype_traits<BT>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&pos1Array, &(in[2].front()), qdims.ndims(), qdims.get(), (af_dtype) af::dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_create_array(&pos0Array, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_create_array(&pos1Array, &(in[2].front()), qdims.ndims(),
+                                   qdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
+    ASSERT_SUCCESS(
+        af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
 
     // Get result
     T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
-    bool ret = true;
+    bool ret      = true;
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ret = (std::abs(tests[resultIdx][elIter] - outData[elIter]) < 0.001);
-        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t" << outData[elIter] << "at: " << elIter << std::endl;
+        ret = (abs(tests[resultIdx][elIter] - outData[elIter]) < 0.001);
+        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                             << outData[elIter] << "at: " << elIter << endl;
     }
 
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(pos0Array != 0) af_release_array(pos0Array);
-    if(pos1Array != 0) af_release_array(pos1Array);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (pos0Array != 0) af_release_array(pos0Array);
+    if (pos1Array != 0) af_release_array(pos1Array);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define APPROX2_INIT(desc, file, resultIdx, method)                               \
-    TYPED_TEST(Approx2, desc)                                                                    \
-    {                                                                                           \
-        approx2Test<TypeParam>(string(TEST_DIR"/approx/"#file".test"), resultIdx, method);\
-    }
+TYPED_TEST(Approx2, Approx2Nearest) {
+    approx2Test<TypeParam>(string(TEST_DIR "/approx/approx2.test"), 0,
+                           AF_INTERP_NEAREST);
+}
 
-    APPROX2_INIT(Approx2Nearest, approx2, 0, AF_INTERP_NEAREST);
-    APPROX2_INIT(Approx2Linear, approx2, 1, AF_INTERP_LINEAR);
-    APPROX2_INIT(Approx2NearestBatch, approx2_batch, 0, AF_INTERP_NEAREST);
-    APPROX2_INIT(Approx2LinearBatch, approx2_batch, 1, AF_INTERP_LINEAR);
+TYPED_TEST(Approx2, Approx2Linear) {
+    approx2Test<TypeParam>(string(TEST_DIR "/approx/approx2.test"), 1,
+                           AF_INTERP_LINEAR);
+}
+TYPED_TEST(Approx2, NearestBatch) {
+    approx2Test<TypeParam>(string(TEST_DIR "/approx/approx2_batch.test"), 0,
+                           AF_INTERP_NEAREST);
+}
+TYPED_TEST(Approx2, LinearBatch) {
+    approx2Test<TypeParam>(string(TEST_DIR "/approx/approx2_batch.test"), 1,
+                           AF_INTERP_LINEAR);
+}
 
-///////////////////////////////////////////////////////////////////////////////
 // Test Argument Failure Cases
-///////////////////////////////////////////////////////////////////////////////
 template<typename T>
-void approx2ArgsTest(string pTestFile, const unsigned resultIdx, const af_interp_type method, const af_err err)
-{
-    if (noDoubleTests<T>()) return;
-    typedef typename af::dtype_traits<T>::base_type BT;
-    vector<af::dim4> numDims;
-    vector<vector<BT> > in;
-    vector<vector<T> > tests;
-    readTests<BT, T, float>(pTestFile,numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
-    af::dim4 qdims = numDims[2];
-
-    af_array inArray = 0;
+void approx2ArgsTest(string pTestFile, const af_interp_type method,
+                     const af_err err) {
+    SUPPORTED_TYPE_CHECK(T);
+    typedef typename dtype_traits<T>::base_type BT;
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<T>> tests;
+    readTests<BT, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
+    dim4 qdims = numDims[2];
+
+    af_array inArray   = 0;
     af_array pos0Array = 0;
     af_array pos1Array = 0;
-    af_array outArray = 0;
+    af_array outArray  = 0;
 
     vector<T> input(in[0].begin(), in[0].end());
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()), idims.ndims(),
+                                   idims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&pos0Array, &(in[1].front()), pdims.ndims(), pdims.get(), (af_dtype) af::dtype_traits<BT>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&pos1Array, &(in[2].front()), qdims.ndims(), qdims.get(), (af_dtype) af::dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_create_array(&pos0Array, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
+    ASSERT_SUCCESS(af_create_array(&pos1Array, &(in[2].front()), qdims.ndims(),
+                                   qdims.get(),
+                                   (af_dtype)dtype_traits<BT>::af_type));
 
-    ASSERT_EQ(err, af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
+    ASSERT_EQ(err,
+              af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(pos0Array != 0) af_release_array(pos0Array);
-    if(pos1Array != 0) af_release_array(pos1Array);
-    if(outArray  != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (pos0Array != 0) af_release_array(pos0Array);
+    if (pos1Array != 0) af_release_array(pos1Array);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-#define APPROX2_ARGS(desc, file, resultIdx, method, err)                                            \
-    TYPED_TEST(Approx2, desc)                                                                       \
-    {                                                                                               \
-        approx2ArgsTest<TypeParam>(string(TEST_DIR"/approx/"#file".test"), resultIdx, method, err); \
-    }
+TYPED_TEST(Approx2, Approx2NearestArgsPos3D) {
+    approx2ArgsTest<TypeParam>(string(TEST_DIR "/approx/approx2_pos3d.test"),
+                               AF_INTERP_NEAREST, AF_ERR_SIZE);
+}
+
+TYPED_TEST(Approx2, Approx2LinearArgsPos3D) {
+    approx2ArgsTest<TypeParam>(string(TEST_DIR "/approx/approx2_pos3d.test"),
+                               AF_INTERP_LINEAR, AF_ERR_SIZE);
+}
 
-    APPROX2_ARGS(Approx2NearestArgsPos3D,      approx2_pos3d,   0, AF_INTERP_NEAREST,  AF_ERR_SIZE);
-    APPROX2_ARGS(Approx2LinearArgsPos3D,       approx2_pos3d,   1, AF_INTERP_LINEAR,   AF_ERR_SIZE);
-    APPROX2_ARGS(Approx2NearestArgsPosUnequal, approx2_unequal, 0, AF_INTERP_NEAREST,  AF_ERR_SIZE);
-    APPROX2_ARGS(Approx2ArgsInterpBilinear,    approx2,         0, AF_INTERP_BILINEAR, AF_ERR_ARG);
-    APPROX2_ARGS(Approx2ArgsInterpCubic,       approx2,         0, AF_INTERP_CUBIC,    AF_ERR_ARG);
+TYPED_TEST(Approx2, Approx2NearestArgsPosUnequal) {
+    approx2ArgsTest<TypeParam>(string(TEST_DIR "/approx/approx2_unequal.test"),
+                               AF_INTERP_NEAREST, AF_ERR_SIZE);
+}
 
 template<typename T>
-void approx2ArgsTestPrecision(string pTestFile, const unsigned resultIdx, const af_interp_type method)
-{
-    if (noDoubleTests<T>()) return;
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
-    af::dim4 qdims = numDims[2];
-
-    af_array inArray = 0;
+void approx2ArgsTestPrecision(string pTestFile, const unsigned resultIdx,
+                              const af_interp_type method) {
+    UNUSED(resultIdx);
+    SUPPORTED_TYPE_CHECK(T);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
+    dim4 qdims = numDims[2];
+
+    af_array inArray   = 0;
     af_array pos0Array = 0;
     af_array pos1Array = 0;
-    af_array outArray = 0;
+    af_array outArray  = 0;
 
     vector<T> input(in[0].begin(), in[0].end());
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&pos0Array, &(in[1].front()), pdims.ndims(), pdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&pos1Array, &(in[2].front()), qdims.ndims(), qdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(input.front()), idims.ndims(),
+                                   idims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
+    ASSERT_SUCCESS(af_create_array(&pos0Array, &(in[1].front()), pdims.ndims(),
+                                   pdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&pos1Array, &(in[2].front()), qdims.ndims(),
+                                   qdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    if((af_dtype) af::dtype_traits<T>::af_type == c32 ||
-       (af_dtype) af::dtype_traits<T>::af_type == c64) {
-        ASSERT_EQ(AF_ERR_ARG, af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
+    if ((af_dtype)dtype_traits<T>::af_type == c32 ||
+        (af_dtype)dtype_traits<T>::af_type == c64) {
+        ASSERT_EQ(AF_ERR_ARG, af_approx2(&outArray, inArray, pos0Array,
+                                         pos1Array, method, 0));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
+        ASSERT_SUCCESS(
+            af_approx2(&outArray, inArray, pos0Array, pos1Array, method, 0));
     }
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(pos0Array != 0) af_release_array(pos0Array);
-    if(pos1Array != 0) af_release_array(pos1Array);
-    if(outArray  != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (pos0Array != 0) af_release_array(pos0Array);
+    if (pos1Array != 0) af_release_array(pos1Array);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-#define APPROX2_ARGSP(desc, file, resultIdx, method)                                    \
-    TYPED_TEST(Approx2, desc)                                                           \
-    {                                                                                   \
-        approx2ArgsTestPrecision<TypeParam>(string(TEST_DIR"/approx/"#file".test"),     \
-                                            resultIdx, method);                         \
+#define APPROX2_ARGSP(desc, file, resultIdx, method)                       \
+    TYPED_TEST(Approx2, desc) {                                            \
+        approx2ArgsTestPrecision<TypeParam>(                               \
+            string(TEST_DIR "/approx/" #file ".test"), resultIdx, method); \
     }
 
-    APPROX2_ARGSP(Approx2NearestArgsPrecision, approx2, 0, AF_INTERP_NEAREST);
-    APPROX2_ARGSP(Approx2LinearArgsPrecision, approx2, 1, AF_INTERP_LINEAR);
-
+APPROX2_ARGSP(Approx2NearestArgsPrecision, approx2, 0, AF_INTERP_NEAREST);
+APPROX2_ARGSP(Approx2LinearArgsPrecision, approx2, 1, AF_INTERP_LINEAR);
 
 //////////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(Approx2, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(Approx2, CPP) {
     const unsigned resultIdx = 1;
-#define BT af::dtype_traits<float>::base_type
-    vector<af::dim4> numDims;
-    vector<vector<BT> > in;
-    vector<vector<float> > tests;
-    readTests<BT, float, float>(string(TEST_DIR"/approx/approx2.test"),numDims,in,tests);
+#define BT dtype_traits<float>::base_type
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<float>> tests;
+    readTests<BT, float, float>(string(TEST_DIR "/approx/approx2.test"),
+                                numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::dim4 pdims = numDims[1];
-    af::dim4 qdims = numDims[2];
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
+    dim4 qdims = numDims[2];
 
-    af::array input(idims,&(in[0].front()));
-    af::array pos0(pdims,&(in[1].front()));
-    af::array pos1(qdims,&(in[2].front()));
-    af::array output = af::approx2(input, pos0, pos1, AF_INTERP_LINEAR, 0);
+    array input(idims, &(in[0].front()));
+    array pos0(pdims, &(in[1].front()));
+    array pos1(qdims, &(in[2].front()));
+    array output = approx2(input, pos0, pos1, AF_INTERP_LINEAR, 0);
 
     // Get result
     float* outData = new float[tests[resultIdx].size()];
@@ -237,10 +285,61 @@ TEST(Approx2, CPP)
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
-    bool ret = true;
+    bool ret      = true;
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
         ret = (std::abs(tests[resultIdx][elIter] - outData[elIter]) < 0.001);
-        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t" << outData[elIter] << "at: " << elIter << std::endl;
+        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                             << outData[elIter] << "at: " << elIter << endl;
+    }
+
+    // Delete
+    delete[] outData;
+
+#undef BT
+}
+
+TEST(Approx2Cubic, CPP) {
+    const unsigned resultIdx = 0;
+#define BT dtype_traits<float>::base_type
+    vector<dim4> numDims;
+    vector<vector<BT>> in;
+    vector<vector<float>> tests;
+    readTests<BT, float, float>(string(TEST_DIR "/approx/approx2_cubic.test"),
+                                numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    dim4 pdims = numDims[1];
+    dim4 qdims = numDims[2];
+
+    array input(idims, &(in[0].front()));
+    input = input.T();
+    array pos0(pdims, &(in[1].front()));
+    array pos1(qdims, &(in[2].front()));
+    pos0         = tile(pos0, 1, pos0.dims(0));
+    pos1         = tile(pos1.T(), pos1.dims(0));
+    array output = approx2(input, pos0, pos1, AF_INTERP_BICUBIC_SPLINE, 0).T();
+
+    // Get result
+    float* outData = new float[tests[resultIdx].size()];
+    output.host((void*)outData);
+
+    // Compare result
+    size_t nElems = tests[resultIdx].size();
+    bool ret      = true;
+
+    float max = real(outData[0]), min = real(outData[0]);
+    for (int i = 1; i < (int)nElems; ++i) {
+        min = (real(outData[i]) < min) ? real(outData[i]) : min;
+        max = (real(outData[i]) > max) ? real(outData[i]) : max;
+    }
+    float range = max - min;
+    ASSERT_GT(range, 0.f);
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ret = (std::abs(tests[resultIdx][elIter] - outData[elIter]) <
+               0.01 * range);
+        ASSERT_EQ(true, ret) << tests[resultIdx][elIter] << "\t"
+                             << outData[elIter] << "at: " << elIter << endl;
     }
 
     // Delete
@@ -248,3 +347,723 @@ TEST(Approx2, CPP)
 
 #undef BT
 }
+
+TEST(Approx2, CPPNearestBatch) {
+    array input = randu(200, 100, 10);
+    array pos   = input.dims(0) * randu(100, 100, 10);
+    array qos   = input.dims(1) * randu(100, 100, 10);
+
+    array outBatch = approx2(input, pos, qos, AF_INTERP_NEAREST);
+
+    array outSerial(pos.dims());
+    for (int i = 0; i < pos.dims(2); i++) {
+        outSerial(span, span, i) =
+            approx2(input(span, span, i), pos(span, span, i),
+                    qos(span, span, i), AF_INTERP_NEAREST);
+    }
+
+    array outGFOR(pos.dims());
+    gfor(seq i, pos.dims(2)) {
+        outGFOR(span, span, i) =
+            approx2(input(span, span, i), pos(span, span, i),
+                    qos(span, span, i), AF_INTERP_NEAREST);
+    }
+
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outSerial)), 1e-3);
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outGFOR)), 1e-3);
+}
+
+TEST(Approx2, CPPLinearBatch) {
+    array input = randu(200, 100, 10);
+    array pos   = input.dims(0) * randu(100, 100, 10);
+    array qos   = input.dims(1) * randu(100, 100, 10);
+
+    array outBatch = approx2(input, pos, qos, AF_INTERP_LINEAR);
+
+    array outSerial(pos.dims());
+    for (int i = 0; i < pos.dims(2); i++) {
+        outSerial(span, span, i) =
+            approx2(input(span, span, i), pos(span, span, i),
+                    qos(span, span, i), AF_INTERP_LINEAR);
+    }
+
+    array outGFOR(pos.dims());
+    gfor(seq i, pos.dims(2)) {
+        outGFOR(span, span, i) =
+            approx2(input(span, span, i), pos(span, span, i),
+                    qos(span, span, i), AF_INTERP_LINEAR);
+    }
+
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outSerial)), 1e-3);
+    ASSERT_NEAR(0, sum<float>(abs(outBatch - outGFOR)), 1e-3);
+}
+
+TEST(Approx2, CPPNearestMaxDims) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    array input = randu(1, largeDim);
+    array pos   = input.dims(0) * randu(1, 10);
+    array qos   = input.dims(1) * randu(1, 10);
+    array out   = approx2(input, pos, qos, AF_INTERP_NEAREST);
+
+    input = randu(1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, largeDim);
+    qos   = input.dims(1) * randu(1, 1, largeDim);
+    out   = approx2(input, pos, qos, AF_INTERP_NEAREST);
+
+    input = randu(1, 1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, 1, largeDim);
+    qos   = input.dims(1) * randu(1, 1, 1, largeDim);
+    out   = approx2(input, pos, qos, AF_INTERP_NEAREST);
+
+    SUCCEED();
+}
+
+TEST(Approx2, CPPLinearMaxDims) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    array input = randu(1, largeDim);
+    array pos   = input.dims(0) * randu(1, 10);
+    array qos   = input.dims(1) * randu(1, 10);
+    array out   = approx2(input, pos, qos, AF_INTERP_LINEAR);
+
+    input = randu(1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, largeDim);
+    qos   = input.dims(1) * randu(1, 1, largeDim);
+    out   = approx2(input, pos, qos, AF_INTERP_LINEAR);
+
+    input = randu(1, 1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, 1, largeDim);
+    qos   = input.dims(1) * randu(1, 1, 1, largeDim);
+    out   = approx2(input, pos, qos, AF_INTERP_LINEAR);
+
+    SUCCEED();
+}
+
+TEST(Approx2, CPPCubicMaxDims) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    array input = randu(1, largeDim);
+    array pos   = input.dims(0) * randu(1, 10);
+    array qos   = input.dims(1) * randu(1, 10);
+    array out   = approx2(input, pos, qos, AF_INTERP_BICUBIC);
+
+    input = randu(1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, largeDim);
+    qos   = input.dims(1) * randu(1, 1, largeDim);
+    out   = approx2(input, pos, qos, AF_INTERP_BICUBIC);
+
+    input = randu(1, 1, 1, largeDim);
+    pos   = input.dims(0) * randu(1, 1, 1, largeDim);
+    qos   = input.dims(1) * randu(1, 1, 1, largeDim);
+    out   = approx2(input, pos, qos, AF_INTERP_BICUBIC);
+
+    SUCCEED();
+}
+
+TEST(Approx2, OtherDimLinear) {
+    int start = 0;
+    int stop  = 10000;
+    int step  = 100;
+    int num   = 1000;
+    array xi  = af::tile(seq(start, stop, step), 1, 2, 2, 2);
+    array yi  = af::tile(seq(start, stop, step), 1, 2, 2, 2);
+    array zi  = 4 * xi * yi - 3 * xi;
+    array xo  = af::round(step * randu(num, 2, 2, 2));
+    array yo  = af::round(step * randu(num, 2, 2, 2));
+    array zo  = 4 * xo * yo - 3 * xo;
+    for (int d = 1; d < 3; d++) {
+        dim4 rdims(0, 1, 2, 3);
+        rdims[0] = d;
+        rdims[d] = 0;
+
+        array zi_reordered =
+            reorder(zi, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array xo_reordered =
+            reorder(xo, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array yo_reordered =
+            reorder(yo, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array zo_reordered =
+            approx2(zi_reordered, xo_reordered, d, start, step, yo_reordered,
+                    d + 1, start, step, AF_INTERP_LINEAR);
+        rdims[d] = 0;
+        rdims[0] = d;
+        array res =
+            af::reorder(yo_reordered, rdims[0], rdims[1], rdims[2], rdims[3]);
+        ASSERT_NEAR(0, af::max<float>(af::abs(res - yo)), 1E-3);
+    }
+}
+
+TEST(Approx2, OtherDimCubic) {
+    float start = 0;
+    float stop  = 100;
+    float step  = 0.01;
+    int num     = 1000;
+    array xi    = af::tile(seq(start, stop, step), 1, 2, 2, 2);
+    array yi    = af::tile(seq(start, stop, step), 1, 2, 2, 2);
+    array zi    = 4 * sin(xi) * cos(yi);
+    array xo    = af::round(step * randu(num, 2, 2, 2));
+    array yo    = af::round(step * randu(num, 2, 2, 2));
+    array zo    = 4 * sin(xo) * cos(yo);
+    for (int d = 1; d < 3; d++) {
+        dim4 rdims(0, 1, 2, 3);
+        rdims[0] = d;
+        rdims[d] = 0;
+
+        array zi_reordered =
+            reorder(zi, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array xo_reordered =
+            reorder(xo, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array yo_reordered =
+            reorder(yo, rdims[0], rdims[1], rdims[2], rdims[3]);
+        array zo_reordered =
+            approx2(zi_reordered, xo_reordered, d, start, step, yo_reordered,
+                    d + 1, start, step, AF_INTERP_CUBIC);
+        rdims[d] = 0;
+        rdims[0] = d;
+        array res =
+            reorder(yo_reordered, rdims[0], rdims[1], rdims[2], rdims[3]);
+        ASSERT_NEAR(0, af::max<float>(af::abs(res - yo)), 1E-3);
+    }
+}
+
+TEST(Approx2, CPPUsage) {
+    //! [ex_signal_approx2]
+
+    // Input data array.
+    float input_vals[9] = {1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0};
+    array input(3, 3, input_vals);
+    // [3 3 1 1]
+    //     1.0000     2.0000     3.0000
+    //     1.0000     2.0000     3.0000
+    //     1.0000     2.0000     3.0000
+
+    // First array of positions to be found along the first dimension.
+    float pv0[4] = {0.5, 1.5, 0.5, 1.5};
+    array pos0(2, 2, pv0);
+    // [2 2 1 1]
+    //     0.5000     0.5000
+    //     1.5000     1.5000
+
+    // Second array of positions to be found along the second
+    // dimension.
+    float pv1[4] = {0.5, 0.5, 1.5, 1.5};
+    array pos1(2, 2, pv1);
+    // [2 2 1 1]
+    //     0.5000     1.5000
+    //     0.5000     1.5000
+
+    array interp = approx2(input, pos0, pos1);
+    // [2 2 1 1]
+    //     1.5000     2.5000
+    //     1.5000     2.5000
+
+    //! [ex_signal_approx2]
+
+    float expected_interp[4] = {1.5, 1.5, 2.5, 2.5};
+
+    array interp_gold(2, 2, expected_interp);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx2, CPPUniformUsage) {
+    //! [ex_signal_approx2_uniform]
+
+    // Input data array.
+    float input_vals[9] = {1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0};
+    array input(3, 3, input_vals);
+    // [3 3 1 1]
+    //     1.0000     2.0000     3.0000
+    //     1.0000     2.0000     3.0000
+    //     1.0000     2.0000     3.0000
+
+    // First array of positions to be found along the interpolation
+    // dimension, `interp_dim0`.
+    float pv0[4] = {0.5, 1.5, 0.5, 1.5};
+    array pos0(2, 2, pv0);
+    // [2 2 1 1]
+    //     0.5000     0.5000
+    //     1.5000     1.5000
+
+    // Second array of positions to be found along the interpolation
+    // dimension, `interp_dim1`.
+    float pv1[4] = {0.5, 0.5, 1.5, 1.5};
+    array pos1(2, 2, pv1);
+    // [2 2 1 1]
+    //     0.5000     1.5000
+    //     0.5000     1.5000
+
+    // Define range of indices with which the input values will
+    // correspond along both dimensions to be interpolated.
+    const double idx_start_dim0 = 0.0;
+    const double idx_step_dim0  = 1.0;
+    const int interp_dim0       = 0;
+    const int interp_dim1       = 1;
+    array interp =
+        approx2(input, pos0, interp_dim0, idx_start_dim0, idx_step_dim0, pos1,
+                interp_dim1, idx_start_dim0, idx_step_dim0);
+    // [2 2 1 1]
+    //     1.5000     2.5000
+    //     1.5000     2.5000
+
+    //! [ex_signal_approx2_uniform]
+
+    float expected_interp[4] = {1.5, 1.5, 2.5, 2.5};
+
+    array interp_gold(2, 2, expected_interp);
+    ASSERT_ARRAYS_EQ(interp, interp_gold);
+}
+
+TEST(Approx2, CPPUniformOneDimIndices) {
+    float inv[9] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0};
+    array input(dim4(3, 3), inv);
+
+    float p0[3] = {0.0, 1.0, 2.0};
+    float p1[3] = {0.0, 1.0, 2.0};
+    array pos0(dim4(3, 1), p0);
+    array pos1(dim4(3, 1), p1);
+
+    const int pos0_interp_grid_start   = 0;
+    const double pos0_interp_grid_step = 1;
+    array interpolated =
+        approx2(input, pos0, 0, pos0_interp_grid_start, pos0_interp_grid_step,
+                pos1, 1, pos0_interp_grid_start, pos0_interp_grid_step);
+
+    float expected_interp[3] = {10.0, 50.0, 90.0};
+
+    array interpolated_gold(dim4(3, 1), expected_interp);
+    ASSERT_ARRAYS_EQ(interpolated, interpolated_gold);
+}
+
+TEST(Approx2, CPPUniformTwoDimIndices) {
+    float inv[9] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0};
+    array input(dim4(3, 3), inv);
+
+    float p0[4] = {0, 2, 0, 2};
+    float p1[4] = {0, 0, 2, 2};
+    array pos0(dim4(2, 2), p0);
+    array pos1(dim4(2, 2), p1);
+    const int pos0_interp_grid_start   = 0;
+    const double pos0_interp_grid_step = 1;
+    const int pos0_interp_dim          = 0;
+    const int pos1_interp_dim          = 1;
+
+    array interpolated =
+        approx2(input, pos0, pos0_interp_dim, pos0_interp_grid_start,
+                pos0_interp_grid_step, pos1, pos1_interp_dim,
+                pos0_interp_grid_start, pos0_interp_grid_step);
+
+    float expected_interp[4] = {10.0, 30.0, 70.0, 90.0};
+    array interpolated_gold(dim4(2, 2), expected_interp);
+    ASSERT_ARRAYS_EQ(interpolated, interpolated_gold);
+}
+
+TEST(Approx2, CPPUniformInvalidStepSize) {
+    try {
+        float inv[9] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0};
+        array in(dim4(3, 3), inv);
+        float pv[3] = {0.0, -1.0, -2.0};
+        array pos(dim4(3, 1), pv);
+        const int pos0_interp_grid_start   = -1;
+        const double pos0_interp_grid_step = 0;
+        const int pos0_interp_dim          = 0;
+        const int pos1_interp_dim          = 1;
+
+        array interpolated =
+            approx2(in, pos, pos0_interp_dim, pos0_interp_grid_start,
+                    pos0_interp_grid_step, pos, pos1_interp_dim,
+                    pos0_interp_grid_start, pos0_interp_grid_step);
+        FAIL() << "Expected af::exception\n";
+    } catch (af::exception& ex) { SUCCEED(); } catch (...) {
+        FAIL() << "Expected af::exception\n";
+    }
+}
+
+TEST(Approx2, CPPUniformColumnMajorInterpolation) {
+    float inv[9] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0};
+    array input(dim4(3, 3), inv);
+
+    float p0[4] = {0, 2, 0, 2};
+    float p1[4] = {0, 0, 2, 2};
+    array pos0(dim4(2, 2), p0);
+    array pos1(dim4(2, 2), p1);
+    const int pos0_interp_dim          = 0;
+    const int pos1_interp_dim          = 1;
+    const int pos0_interp_grid_start   = 0;
+    const double pos0_interp_grid_step = 1;
+
+    array first = approx2(input, pos0, pos0_interp_dim, pos0_interp_grid_start,
+                          pos0_interp_grid_step, pos1, pos1_interp_dim,
+                          pos0_interp_grid_start, pos0_interp_grid_step);
+
+    array second = approx2(input, pos1, pos1_interp_dim, pos0_interp_grid_start,
+                           pos0_interp_grid_step, pos0, pos0_interp_dim,
+                           pos0_interp_grid_start, pos0_interp_grid_step);
+
+    // Verify.
+    float expected_interp[4] = {10.0, 30.0, 70.0, 90.0};
+    array interpolated_gold(dim4(2, 2), expected_interp);
+    ASSERT_ARRAYS_EQ(first, interpolated_gold);
+    ASSERT_ARRAYS_EQ(first, second);
+}
+
+TEST(Approx2, CPPUniformRowMajorInterpolation) {
+    float inv[9] = {10.0, 20.0, 30.0, 40.0, 50.0, 60.0, 70.0, 80.0, 90.0};
+    array input(dim4(3, 3), inv);
+
+    float p0[4] = {0, 2, 0, 2};
+    float p1[4] = {0, 0, 2, 2};
+    array pos0(dim4(2, 2), p0);
+    array pos1(dim4(2, 2), p1);
+    const int pos0_interp_grid_start   = 0;
+    const double pos0_interp_grid_step = 1;
+
+    array first =
+        approx2(input, pos0, 1, pos0_interp_grid_start, pos0_interp_grid_step,
+                pos1, 0, pos0_interp_grid_start, pos0_interp_grid_step);
+
+    array second =
+        approx2(input, pos1, 0, pos0_interp_grid_start, pos0_interp_grid_step,
+                pos0, 1, pos0_interp_grid_start, pos0_interp_grid_step);
+
+    // Verify.
+    float expected_interp[4] = {10.0, 70.0, 30.0, 90.0};
+    array interpolated_gold(dim4(2, 2), expected_interp);
+    ASSERT_ARRAYS_EQ(first, interpolated_gold);
+    ASSERT_ARRAYS_EQ(first, second);
+}
+
+TEST(Approx2, CPPEmptyPos) {
+    float inv[3] = {10.0, 20.0, 30.0};
+    array in(dim4(3, 1), inv);
+    array pos;
+    array interpolated = approx2(in, pos, pos);
+    ASSERT_TRUE(pos.isempty());
+    ASSERT_TRUE(interpolated.isempty());
+}
+
+TEST(Approx2, CPPEmptyInput) {
+    array in;
+    float pv[3] = {0.0, 1.0, 2.0};
+    array pos(dim4(3, 1), pv);
+
+    array interpolated = approx2(in, pos, pos);
+    ASSERT_TRUE(in.isempty());
+    ASSERT_TRUE(interpolated.isempty());
+}
+
+TEST(Approx2, CPPEmptyPosAndInput) {
+    array in;
+    array pos;
+    array interpolated = approx2(in, pos, pos);
+    ASSERT_TRUE(in.isempty());
+    ASSERT_TRUE(pos.isempty());
+    ASSERT_TRUE(interpolated.isempty());
+}
+
+template<typename T>
+class Approx2V2 : public ::testing::Test {
+   protected:
+    typedef typename dtype_traits<T>::base_type BT;
+
+    vector<T> h_gold_cast;
+    vector<T> h_in_cast;
+    vector<BT> h_pos1_cast;
+    vector<BT> h_pos2_cast;
+
+    dim4 gold_dims;
+    dim4 in_dims;
+    dim4 pos1_dims;
+    dim4 pos2_dims;
+
+    af_array gold;
+    af_array in;
+    af_array pos1;
+    af_array pos2;
+
+    Approx2V2() : gold(0), in(0), pos1(0), pos2(0) {}
+
+    void SetUp() {}
+
+    void releaseArrays() {
+        if (pos2 != 0) { ASSERT_SUCCESS(af_release_array(pos2)); }
+        if (pos1 != 0) { ASSERT_SUCCESS(af_release_array(pos1)); }
+        if (in != 0) { ASSERT_SUCCESS(af_release_array(in)); }
+        if (gold != 0) { ASSERT_SUCCESS(af_release_array(gold)); }
+    }
+
+    void TearDown() { releaseArrays(); }
+
+    void setTestData(float* h_gold, dim4 gold_dims, float* h_in, dim4 in_dims,
+                     float* h_pos1, dim4 pos1_dims, float* h_pos2,
+                     dim4 pos2_dims) {
+        releaseArrays();
+
+        gold = 0;
+        in   = 0;
+        pos1 = 0;
+        pos2 = 0;
+
+        this->gold_dims = gold_dims;
+        this->in_dims   = in_dims;
+        this->pos1_dims = pos1_dims;
+        this->pos2_dims = pos2_dims;
+
+        for (int i = 0; i < gold_dims.elements(); ++i) {
+            h_gold_cast.push_back(static_cast<T>(h_gold[i]));
+        }
+        for (int i = 0; i < in_dims.elements(); ++i) {
+            h_in_cast.push_back(static_cast<T>(h_in[i]));
+        }
+        for (int i = 0; i < pos1_dims.elements(); ++i) {
+            h_pos1_cast.push_back(static_cast<BT>(h_pos1[i]));
+        }
+        for (int i = 0; i < pos2_dims.elements(); ++i) {
+            h_pos2_cast.push_back(static_cast<BT>(h_pos2[i]));
+        }
+
+        ASSERT_SUCCESS(af_create_array(&gold, &h_gold_cast.front(),
+                                       gold_dims.ndims(), gold_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&in, &h_in_cast.front(), in_dims.ndims(),
+                                       in_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&pos1, &h_pos1_cast.front(),
+                                       pos1_dims.ndims(), pos1_dims.get(),
+                                       (af_dtype)dtype_traits<BT>::af_type));
+        ASSERT_SUCCESS(af_create_array(&pos2, &h_pos2_cast.front(),
+                                       pos2_dims.ndims(), pos2_dims.get(),
+                                       (af_dtype)dtype_traits<BT>::af_type));
+    }
+
+    void testSpclOutArray(TestOutputArrayType out_array_type) {
+        SUPPORTED_TYPE_CHECK(T);
+
+        af_array out = 0;
+        TestOutputArrayInfo metadata(out_array_type);
+        genTestOutputArray(&out, gold_dims.ndims(), gold_dims.get(),
+                           (af_dtype)dtype_traits<T>::af_type, &metadata);
+
+        ASSERT_SUCCESS(
+            af_approx2_v2(&out, in, pos1, pos2, AF_INTERP_LINEAR, 0));
+        ASSERT_SPECIAL_ARRAYS_EQ(gold, out, &metadata);
+    }
+
+    void testSpclOutArrayUniform(TestOutputArrayType out_array_type) {
+        SUPPORTED_TYPE_CHECK(T);
+
+        af_array out = 0;
+        TestOutputArrayInfo metadata(out_array_type);
+        genTestOutputArray(&out, gold_dims.ndims(), gold_dims.get(),
+                           (af_dtype)dtype_traits<T>::af_type, &metadata);
+
+        ASSERT_SUCCESS(af_approx2_uniform_v2(&out, in, pos1, 0, 0.0, 1.0, pos2,
+                                             1, 0.0, 1.0, AF_INTERP_LINEAR, 0));
+        ASSERT_SPECIAL_ARRAYS_EQ(gold, out, &metadata);
+    }
+};
+
+TYPED_TEST_SUITE(Approx2V2, TestTypes);
+
+class SimpleTestData {
+   public:
+    static const int h_gold_size = 4;
+    static const int h_in_size   = 9;
+    static const int h_pos1_size = 4;
+    static const int h_pos2_size = 4;
+
+    vector<float> h_gold;
+    vector<float> h_in;
+    vector<float> h_pos1;
+    vector<float> h_pos2;
+
+    dim4 gold_dims;
+    dim4 in_dims;
+    dim4 pos1_dims;
+    dim4 pos2_dims;
+
+   public:
+    SimpleTestData()
+        : gold_dims(2, 2), in_dims(3, 3), pos1_dims(2, 2), pos2_dims(2, 2) {
+        float gold_arr[h_gold_size] = {1.5, 1.5, 2.5, 2.5};
+
+        float in_arr[h_in_size] = {1.0, 1.0, 1.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0};
+
+        float pos1_arr[h_pos1_size] = {0.5, 1.5, 0.5, 1.5};
+
+        float pos2_arr[h_pos2_size] = {0.5, 0.5, 1.5, 1.5};
+
+        h_gold.assign(gold_arr, gold_arr + h_gold_size);
+        h_in.assign(in_arr, in_arr + h_in_size);
+        h_pos1.assign(pos1_arr, pos1_arr + h_pos1_size);
+        h_pos2.assign(pos2_arr, pos2_arr + h_pos2_size);
+    }
+};
+
+template<typename T>
+class Approx2V2Simple : public Approx2V2<T> {
+   protected:
+    void SetUp() {
+        SUPPORTED_TYPE_CHECK(T);
+        SimpleTestData data;
+        this->setTestData(&data.h_gold.front(), data.gold_dims,
+                          &data.h_in.front(), data.in_dims,
+                          &data.h_pos1.front(), data.pos1_dims,
+                          &data.h_pos2.front(), data.pos2_dims);
+    }
+};
+
+TYPED_TEST_SUITE(Approx2V2Simple, TestTypes);
+
+TYPED_TEST(Approx2V2Simple, UseNullOutputArray) {
+    this->testSpclOutArray(NULL_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UseFullExistingOutputArray) {
+    this->testSpclOutArray(FULL_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UseExistingOutputSubArray) {
+    this->testSpclOutArray(SUB_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UseReorderedOutputArray) {
+    this->testSpclOutArray(REORDERED_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UniformUseNullOutputArray) {
+    this->testSpclOutArrayUniform(NULL_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UniformUseFullExistingOutputArray) {
+    this->testSpclOutArrayUniform(FULL_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UniformUseExistingOutputSubArray) {
+    this->testSpclOutArrayUniform(SUB_ARRAY);
+}
+
+TYPED_TEST(Approx2V2Simple, UniformUseReorderedOutputArray) {
+    this->testSpclOutArrayUniform(REORDERED_ARRAY);
+}
+
+class Approx2NullArgs : public ::testing::Test {
+   protected:
+    af_array out;
+    af_array in;
+    af_array pos1;
+    af_array pos2;
+
+    Approx2NullArgs() : out(0), in(0), pos1(0), pos2(0) {}
+
+    void SetUp() {
+        SimpleTestData data;
+        ASSERT_SUCCESS(af_create_array(&in, &data.h_in.front(),
+                                       data.in_dims.ndims(), data.in_dims.get(),
+                                       f32));
+        ASSERT_SUCCESS(af_create_array(&pos1, &data.h_pos1.front(),
+                                       data.pos1_dims.ndims(),
+                                       data.pos1_dims.get(), f32));
+        ASSERT_SUCCESS(af_create_array(&pos2, &data.h_pos2.front(),
+                                       data.pos2_dims.ndims(),
+                                       data.pos2_dims.get(), f32));
+    }
+
+    void TearDown() {
+        if (pos2 != 0) { ASSERT_SUCCESS(af_release_array(pos2)); }
+        if (pos1 != 0) { ASSERT_SUCCESS(af_release_array(pos1)); }
+        if (in != 0) { ASSERT_SUCCESS(af_release_array(in)); }
+    }
+};
+
+TEST_F(Approx2NullArgs, NullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG, af_approx2(out_ptr, this->in, this->pos1, this->pos2,
+                                     AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, NullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2(&this->out, 0, this->pos1, this->pos2,
+                                     AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, NullPos1Array) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2(&this->out, this->in, 0, this->pos2,
+                                     AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, NullPos2Array) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2(&this->out, this->in, this->pos1, 0,
+                                     AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, V2NullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_v2(out_ptr, this->in, this->pos1,
+                                        this->pos2, AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, V2NullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_v2(&this->out, 0, this->pos1, this->pos2,
+                                        AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, V2NullPos1Array) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_v2(&this->out, this->in, 0, this->pos2,
+                                        AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, V2NullPos2Array) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_v2(&this->out, this->in, this->pos1, 0,
+                                        AF_INTERP_LINEAR, 0.f));
+}
+
+TEST_F(Approx2NullArgs, UniformNullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx2_uniform(out_ptr, this->in, this->pos1, 0, 0.0, 1.0,
+                                 this->pos2, 1, 0.0, 1.0, AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, UniformNullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx2_uniform(&this->out, 0, this->pos1, 0, 0.0, 1.0,
+                                 this->pos2, 1, 0.0, 1.0, AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, UniformNullPos1Array) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx2_uniform(&this->out, this->in, 0, 0, 0.0, 1.0,
+                                 this->pos2, 1, 0.0, 1.0, AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, UniformNullPos2Array) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx2_uniform(&this->out, this->in, this->pos1, 0, 0.0, 1.0,
+                                 0, 1, 0.0, 1.0, AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, V2UniformNullOutputPtr) {
+    af_array* out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_uniform_v2(out_ptr, this->in, this->pos1,
+                                                0, 0.0, 1.0, this->pos2, 1, 0.0,
+                                                1.0, AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, V2UniformNullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_uniform_v2(&this->out, 0, this->pos1, 0,
+                                                0.0, 1.0, this->pos2, 1, 0.0,
+                                                1.0, AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, V2UniformNullPos1Array) {
+    ASSERT_EQ(AF_ERR_ARG, af_approx2_uniform_v2(&this->out, this->in, 0, 0, 0.0,
+                                                1.0, this->pos2, 1, 0.0, 1.0,
+                                                AF_INTERP_LINEAR, 0));
+}
+
+TEST_F(Approx2NullArgs, V2UniformNullPos2Array) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_approx2_uniform_v2(&this->out, this->in, this->pos1, 0, 0.0,
+                                    1.0, 0, 1, 0.0, 1.0, AF_INTERP_LINEAR, 0));
+}
diff --git a/test/array.cpp b/test/array.cpp
index db2cf5a956..c5befe1fdb 100644
--- a/test/array.cpp
+++ b/test/array.cpp
@@ -1,219 +1,218 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
 #include <testHelpers.hpp>
+#include <cstddef>
+#include <cstdlib>
+#include <initializer_list>
+#include <iomanip>
 
 using namespace af;
-using namespace std;
+using std::vector;
 
 template<typename T>
-class Array : public ::testing::Test
-{
+class Array : public ::testing::Test {};
 
-};
+typedef ::testing::Types<float, double, cfloat, cdouble, char, signed char,
+                         unsigned char, int, uint, intl, uintl, short, ushort,
+                         half_float::half>
+    TestTypes;
 
-typedef ::testing::Types<float, double, af::cfloat, af::cdouble, char, unsigned char, int, uint, intl, uintl> TestTypes;
-TYPED_TEST_CASE(Array, TestTypes);
+TYPED_TEST_SUITE(Array, TestTypes);
 
-TEST(Array, ConstructorDefault)
-{
+TEST(Array, ConstructorDefault) {
     array a;
-    EXPECT_EQ(0u,    a.numdims());
-    EXPECT_EQ(dim_t(0),    a.dims(0));
-    EXPECT_EQ(dim_t(0),    a.dims(1));
-    EXPECT_EQ(dim_t(0),    a.dims(2));
-    EXPECT_EQ(dim_t(0),    a.dims(3));
-    EXPECT_EQ(dim_t(0),    a.elements());
-    EXPECT_EQ(f32,  a.type());
-    EXPECT_EQ(0u,    a.bytes());
-    EXPECT_FALSE(   a.isrow());
-    EXPECT_FALSE(   a.iscomplex());
-    EXPECT_FALSE(   a.isdouble());
-    EXPECT_FALSE(   a.isbool());
-
-    EXPECT_FALSE(    a.isvector());
-    EXPECT_FALSE(    a.iscolumn());
-
-    EXPECT_TRUE(    a.isreal());
-    EXPECT_TRUE(    a.isempty());
-    EXPECT_TRUE(    a.issingle());
-    EXPECT_TRUE(    a.isfloating());
-    EXPECT_TRUE(    a.isrealfloating());
-}
-
-TYPED_TEST(Array, ConstructorEmptyDim4)
-{
-    if (noDoubleTests<TypeParam>()) return;
-
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    EXPECT_EQ(0u, a.numdims());
+    EXPECT_EQ(dim_t(0), a.dims(0));
+    EXPECT_EQ(dim_t(0), a.elements());
+    EXPECT_EQ(f32, a.type());
+    EXPECT_EQ(0u, a.bytes());
+    EXPECT_FALSE(a.isrow());
+    EXPECT_FALSE(a.iscomplex());
+    EXPECT_FALSE(a.isdouble());
+    EXPECT_FALSE(a.isbool());
+
+    EXPECT_FALSE(a.isvector());
+    EXPECT_FALSE(a.iscolumn());
+
+    EXPECT_TRUE(a.isreal());
+    EXPECT_TRUE(a.isempty());
+    EXPECT_TRUE(a.issingle());
+    EXPECT_TRUE(a.isfloating());
+    EXPECT_TRUE(a.isrealfloating());
+}
+
+TYPED_TEST(Array, ConstructorEmptyDim4) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
     dim4 dims(3, 3, 3, 3);
     array a(dims, type);
-    EXPECT_EQ(4u,    a.numdims());
-    EXPECT_EQ(dim_t(3),    a.dims(0));
-    EXPECT_EQ(dim_t(3),    a.dims(1));
-    EXPECT_EQ(dim_t(3),    a.dims(2));
-    EXPECT_EQ(dim_t(3),    a.dims(3));
-    EXPECT_EQ(dim_t(81),   a.elements());
-    EXPECT_EQ(type,  a.type());
+    EXPECT_EQ(4u, a.numdims());
+    EXPECT_EQ(dim_t(3), a.dims(0));
+    EXPECT_EQ(dim_t(3), a.dims(1));
+    EXPECT_EQ(dim_t(3), a.dims(2));
+    EXPECT_EQ(dim_t(3), a.dims(3));
+    EXPECT_EQ(dim_t(81), a.elements());
+    EXPECT_EQ(type, a.type());
 }
 
-TYPED_TEST(Array, ConstructorEmpty1D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorEmpty1D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
     array a(2, type);
-    EXPECT_EQ(1u,    a.numdims());
-    EXPECT_EQ(dim_t(2),    a.dims(0));
-    EXPECT_EQ(dim_t(1),    a.dims(1));
-    EXPECT_EQ(dim_t(1),    a.dims(2));
-    EXPECT_EQ(dim_t(1),    a.dims(3));
-    EXPECT_EQ(dim_t(2),    a.elements());
-    EXPECT_EQ(type,  a.type());
+    EXPECT_EQ(1u, a.numdims());
+    EXPECT_EQ(dim_t(2), a.dims(0));
+    EXPECT_EQ(dim_t(1), a.dims(1));
+    EXPECT_EQ(dim_t(1), a.dims(2));
+    EXPECT_EQ(dim_t(1), a.dims(3));
+    EXPECT_EQ(dim_t(2), a.elements());
+    EXPECT_EQ(type, a.type());
 }
 
-TYPED_TEST(Array, ConstructorEmpty2D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorEmpty2D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
     array a(2, 2, type);
-    EXPECT_EQ(2u,    a.numdims());
-    EXPECT_EQ(dim_t(2),    a.dims(0));
-    EXPECT_EQ(dim_t(2),    a.dims(1));
-    EXPECT_EQ(dim_t(1),    a.dims(2));
-    EXPECT_EQ(dim_t(1),    a.dims(3));
-    EXPECT_EQ(dim_t(4),    a.elements());
-    EXPECT_EQ(type,  a.type());
+    EXPECT_EQ(2u, a.numdims());
+    EXPECT_EQ(dim_t(2), a.dims(0));
+    EXPECT_EQ(dim_t(2), a.dims(1));
+    EXPECT_EQ(dim_t(1), a.dims(2));
+    EXPECT_EQ(dim_t(1), a.dims(3));
+    EXPECT_EQ(dim_t(4), a.elements());
+    EXPECT_EQ(type, a.type());
 }
 
-TYPED_TEST(Array, ConstructorEmpty3D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorEmpty3D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
     array a(2, 2, 2, type);
-    EXPECT_EQ(3u,    a.numdims());
-    EXPECT_EQ(dim_t(2),    a.dims(0));
-    EXPECT_EQ(dim_t(2),    a.dims(1));
-    EXPECT_EQ(dim_t(2),    a.dims(2));
-    EXPECT_EQ(dim_t(1),    a.dims(3));
-    EXPECT_EQ(dim_t(8),    a.elements());
-    EXPECT_EQ(type,  a.type());
+    EXPECT_EQ(3u, a.numdims());
+    EXPECT_EQ(dim_t(2), a.dims(0));
+    EXPECT_EQ(dim_t(2), a.dims(1));
+    EXPECT_EQ(dim_t(2), a.dims(2));
+    EXPECT_EQ(dim_t(1), a.dims(3));
+    EXPECT_EQ(dim_t(8), a.elements());
+    EXPECT_EQ(type, a.type());
 }
 
-TYPED_TEST(Array, ConstructorEmpty4D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorEmpty4D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
     array a(2, 2, 2, 2, type);
-    EXPECT_EQ(4u,    a.numdims());
-    EXPECT_EQ(dim_t(2),    a.dims(0));
-    EXPECT_EQ(dim_t(2),    a.dims(1));
-    EXPECT_EQ(dim_t(2),    a.dims(2));
-    EXPECT_EQ(dim_t(2),    a.dims(3));
-    EXPECT_EQ(dim_t(16),   a.elements());
+    EXPECT_EQ(4u, a.numdims());
+    EXPECT_EQ(dim_t(2), a.dims(0));
+    EXPECT_EQ(dim_t(2), a.dims(1));
+    EXPECT_EQ(dim_t(2), a.dims(2));
+    EXPECT_EQ(dim_t(2), a.dims(3));
+    EXPECT_EQ(dim_t(16), a.elements());
     EXPECT_EQ(type, a.type());
 }
 
-TYPED_TEST(Array, ConstructorHostPointer1D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorHostPointer1D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type    = (dtype)dtype_traits<TypeParam>::af_type;
     size_t nelems = 10;
-    vector<TypeParam> data(nelems, 4);
+    vector<TypeParam> data(nelems, TypeParam(4));
     array a(nelems, &data.front(), afHost);
-    EXPECT_EQ(1u,        a.numdims());
-    EXPECT_EQ(dim_t(nelems),   a.dims(0));
-    EXPECT_EQ(dim_t(1),        a.dims(1));
-    EXPECT_EQ(dim_t(1),        a.dims(2));
-    EXPECT_EQ(dim_t(1),        a.dims(3));
-    EXPECT_EQ(dim_t(nelems),   a.elements());
-    EXPECT_EQ(type,     a.type());
+    EXPECT_EQ(1u, a.numdims());
+    EXPECT_EQ(dim_t(nelems), a.dims(0));
+    EXPECT_EQ(dim_t(1), a.dims(1));
+    EXPECT_EQ(dim_t(1), a.dims(2));
+    EXPECT_EQ(dim_t(1), a.dims(3));
+    EXPECT_EQ(dim_t(nelems), a.elements());
+    EXPECT_EQ(type, a.type());
 
     vector<TypeParam> out(nelems);
     a.host(&out.front());
     ASSERT_TRUE(std::equal(data.begin(), data.end(), out.begin()));
 }
 
-TYPED_TEST(Array, ConstructorHostPointer2D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorHostPointer2D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type      = (dtype)dtype_traits<TypeParam>::af_type;
     size_t ndims    = 2;
     size_t dim_size = 10;
     size_t nelems   = dim_size * dim_size;
-    vector<TypeParam> data(nelems, 4);
+    vector<TypeParam> data(nelems, TypeParam(4));
     array a(dim_size, dim_size, &data.front(), afHost);
-    EXPECT_EQ(ndims,    a.numdims());
+    EXPECT_EQ(ndims, a.numdims());
     EXPECT_EQ(dim_t(dim_size), a.dims(0));
     EXPECT_EQ(dim_t(dim_size), a.dims(1));
-    EXPECT_EQ(dim_t(1),        a.dims(2));
-    EXPECT_EQ(dim_t(1),        a.dims(3));
-    EXPECT_EQ(dim_t(nelems),   a.elements());
-    EXPECT_EQ(type,     a.type());
+    EXPECT_EQ(dim_t(1), a.dims(2));
+    EXPECT_EQ(dim_t(1), a.dims(3));
+    EXPECT_EQ(dim_t(nelems), a.elements());
+    EXPECT_EQ(type, a.type());
 
     vector<TypeParam> out(nelems);
     a.host(&out.front());
     ASSERT_TRUE(std::equal(data.begin(), data.end(), out.begin()));
 }
 
-TYPED_TEST(Array, ConstructorHostPointer3D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorHostPointer3D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type      = (dtype)dtype_traits<TypeParam>::af_type;
     size_t ndims    = 3;
     size_t dim_size = 10;
     size_t nelems   = dim_size * dim_size * dim_size;
-    vector<TypeParam> data(nelems, 4);
+    vector<TypeParam> data(nelems, TypeParam(4));
     array a(dim_size, dim_size, dim_size, &data.front(), afHost);
-    EXPECT_EQ(ndims,    a.numdims());
+    EXPECT_EQ(ndims, a.numdims());
     EXPECT_EQ(dim_t(dim_size), a.dims(0));
     EXPECT_EQ(dim_t(dim_size), a.dims(1));
     EXPECT_EQ(dim_t(dim_size), a.dims(2));
-    EXPECT_EQ(dim_t(1),        a.dims(3));
-    EXPECT_EQ(dim_t(nelems),   a.elements());
-    EXPECT_EQ(type,     a.type());
+    EXPECT_EQ(dim_t(1), a.dims(3));
+    EXPECT_EQ(dim_t(nelems), a.elements());
+    EXPECT_EQ(type, a.type());
 
     vector<TypeParam> out(nelems);
     a.host(&out.front());
     ASSERT_TRUE(std::equal(data.begin(), data.end(), out.begin()));
 }
 
-TYPED_TEST(Array, ConstructorHostPointer4D)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, ConstructorHostPointer4D) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type      = (dtype)dtype_traits<TypeParam>::af_type;
     size_t ndims    = 4;
     size_t dim_size = 10;
     size_t nelems   = dim_size * dim_size * dim_size * dim_size;
-    vector<TypeParam> data(nelems, 4);
+    vector<TypeParam> data(nelems, TypeParam(4));
     array a(dim_size, dim_size, dim_size, dim_size, &data.front(), afHost);
-    EXPECT_EQ(ndims,    a.numdims());
+    EXPECT_EQ(ndims, a.numdims());
     EXPECT_EQ(dim_t(dim_size), a.dims(0));
     EXPECT_EQ(dim_t(dim_size), a.dims(1));
     EXPECT_EQ(dim_t(dim_size), a.dims(2));
     EXPECT_EQ(dim_t(dim_size), a.dims(3));
-    EXPECT_EQ(dim_t(nelems),   a.elements());
-    EXPECT_EQ(type,     a.type());
+    EXPECT_EQ(dim_t(nelems), a.elements());
+    EXPECT_EQ(type, a.type());
 
     vector<TypeParam> out(nelems);
     a.host(&out.front());
     ASSERT_TRUE(std::equal(data.begin(), data.end(), out.begin()));
 }
 
-TYPED_TEST(Array, TypeAttributes)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Array, TypeAttributes) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    dtype type = (dtype)af::dtype_traits<TypeParam>::af_type;
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
     array one(10, type);
-    switch(type) {
+    switch (type) {
         case f32:
             EXPECT_TRUE(one.isfloating());
             EXPECT_FALSE(one.isdouble());
@@ -223,6 +222,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
 
         case f64:
@@ -234,6 +234,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case c32:
             EXPECT_TRUE(one.isfloating());
@@ -244,6 +245,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_FALSE(one.isreal());
             EXPECT_TRUE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case c64:
             EXPECT_TRUE(one.isfloating());
@@ -254,6 +256,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_FALSE(one.isreal());
             EXPECT_TRUE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case s32:
             EXPECT_FALSE(one.isfloating());
@@ -264,6 +267,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case u32:
             EXPECT_FALSE(one.isfloating());
@@ -274,6 +278,40 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
+            break;
+        case s16:
+            EXPECT_FALSE(one.isfloating());
+            EXPECT_FALSE(one.isdouble());
+            EXPECT_FALSE(one.issingle());
+            EXPECT_FALSE(one.isrealfloating());
+            EXPECT_TRUE(one.isinteger());
+            EXPECT_TRUE(one.isreal());
+            EXPECT_FALSE(one.iscomplex());
+            EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
+            break;
+        case u16:
+            EXPECT_FALSE(one.isfloating());
+            EXPECT_FALSE(one.isdouble());
+            EXPECT_FALSE(one.issingle());
+            EXPECT_FALSE(one.isrealfloating());
+            EXPECT_TRUE(one.isinteger());
+            EXPECT_TRUE(one.isreal());
+            EXPECT_FALSE(one.iscomplex());
+            EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
+            break;
+        case s8:
+            EXPECT_FALSE(one.isfloating());
+            EXPECT_FALSE(one.isdouble());
+            EXPECT_FALSE(one.issingle());
+            EXPECT_FALSE(one.isrealfloating());
+            EXPECT_TRUE(one.isinteger());
+            EXPECT_TRUE(one.isreal());
+            EXPECT_FALSE(one.iscomplex());
+            EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case u8:
             EXPECT_FALSE(one.isfloating());
@@ -284,6 +322,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case b8:
             EXPECT_FALSE(one.isfloating());
@@ -294,6 +333,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_TRUE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case s64:
             EXPECT_FALSE(one.isfloating());
@@ -304,6 +344,7 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
             break;
         case u64:
             EXPECT_FALSE(one.isfloating());
@@ -314,14 +355,23 @@ TYPED_TEST(Array, TypeAttributes)
             EXPECT_TRUE(one.isreal());
             EXPECT_FALSE(one.iscomplex());
             EXPECT_FALSE(one.isbool());
+            EXPECT_FALSE(one.ishalf());
+            break;
+        case f16:
+            EXPECT_TRUE(one.isfloating());
+            EXPECT_FALSE(one.isdouble());
+            EXPECT_FALSE(one.issingle());
+            EXPECT_TRUE(one.isrealfloating());
+            EXPECT_FALSE(one.isinteger());
+            EXPECT_TRUE(one.isreal());
+            EXPECT_FALSE(one.iscomplex());
+            EXPECT_FALSE(one.isbool());
+            EXPECT_TRUE(one.ishalf());
             break;
-
     }
-
 }
 
-TEST(Array, ShapeAttributes)
-{
+TEST(Array, ShapeAttributes) {
     dim_t dim_size = 10;
     array scalar(1);
     array col(dim_size);
@@ -330,38 +380,333 @@ TEST(Array, ShapeAttributes)
     array volume(dim_size, dim_size, dim_size);
     array hypercube(dim_size, dim_size, dim_size, dim_size);
 
-    EXPECT_FALSE(scalar.    isempty());
-    EXPECT_FALSE(col.       isempty());
-    EXPECT_FALSE(row.       isempty());
-    EXPECT_FALSE(matrix.    isempty());
-    EXPECT_FALSE(volume.    isempty());
-    EXPECT_FALSE(hypercube. isempty());
-
-    EXPECT_TRUE(scalar.     isscalar());
-    EXPECT_FALSE(col.       isscalar());
-    EXPECT_FALSE(row.       isscalar());
-    EXPECT_FALSE(matrix.    isscalar());
-    EXPECT_FALSE(volume.    isscalar());
-    EXPECT_FALSE(hypercube. isscalar());
-
-    EXPECT_FALSE(scalar.    isvector());
-    EXPECT_TRUE(col.        isvector());
-    EXPECT_TRUE(row.        isvector());
-    EXPECT_FALSE(matrix.    isvector());
-    EXPECT_FALSE(volume.    isvector());
-    EXPECT_FALSE(hypercube. isvector());
-
-    EXPECT_FALSE(scalar.    isrow());
-    EXPECT_FALSE(col.       isrow());
-    EXPECT_TRUE(row.        isrow());
-    EXPECT_FALSE(matrix.    isrow());
-    EXPECT_FALSE(volume.    isrow());
-    EXPECT_FALSE(hypercube. isrow());
-
-    EXPECT_FALSE(scalar.    iscolumn());
-    EXPECT_TRUE(col.        iscolumn());
-    EXPECT_FALSE(row.       iscolumn());
-    EXPECT_FALSE(matrix.    iscolumn());
-    EXPECT_FALSE(volume.    iscolumn());
-    EXPECT_FALSE(hypercube. iscolumn());
+    EXPECT_FALSE(scalar.isempty());
+    EXPECT_FALSE(col.isempty());
+    EXPECT_FALSE(row.isempty());
+    EXPECT_FALSE(matrix.isempty());
+    EXPECT_FALSE(volume.isempty());
+    EXPECT_FALSE(hypercube.isempty());
+
+    EXPECT_TRUE(scalar.isscalar());
+    EXPECT_FALSE(col.isscalar());
+    EXPECT_FALSE(row.isscalar());
+    EXPECT_FALSE(matrix.isscalar());
+    EXPECT_FALSE(volume.isscalar());
+    EXPECT_FALSE(hypercube.isscalar());
+
+    EXPECT_FALSE(scalar.isvector());
+    EXPECT_TRUE(col.isvector());
+    EXPECT_TRUE(row.isvector());
+    EXPECT_FALSE(matrix.isvector());
+    EXPECT_FALSE(volume.isvector());
+    EXPECT_FALSE(hypercube.isvector());
+
+    EXPECT_FALSE(scalar.isrow());
+    EXPECT_FALSE(col.isrow());
+    EXPECT_TRUE(row.isrow());
+    EXPECT_FALSE(matrix.isrow());
+    EXPECT_FALSE(volume.isrow());
+    EXPECT_FALSE(hypercube.isrow());
+
+    EXPECT_FALSE(scalar.iscolumn());
+    EXPECT_TRUE(col.iscolumn());
+    EXPECT_FALSE(row.iscolumn());
+    EXPECT_FALSE(matrix.iscolumn());
+    EXPECT_FALSE(volume.iscolumn());
+    EXPECT_FALSE(hypercube.iscolumn());
+}
+
+TEST(Array, ISSUE_951) {
+    // This works
+    // const array a(100, 100);
+    // array b = a.cols(0, 20);
+    // b = b.rows(10, 20);
+
+    // This works
+    // array a(100, 100);
+    // array b = a.cols(0, 20).rows(10, 20);
+
+    // This fails with linking error
+    const array a = randu(100, 100);
+    array b       = a.cols(0, 20).rows(10, 20);
+}
+
+TEST(Array, CreateHandleInvalidNullDimsPointer) {
+    af_array out = 0;
+    EXPECT_EQ(AF_ERR_ARG, af_create_handle(&out, 1, NULL, f32));
+}
+
+TEST(Device, simple) {
+    array a = randu(5, 5);
+    {
+        float *ptr0 = a.device<float>();
+        float *ptr1 = a.device<float>();
+        ASSERT_EQ(ptr0, ptr1);
+    }
+
+    {
+        float *ptr0 = a.device<float>();
+        a.unlock();
+        float *ptr1 = a.device<float>();
+        ASSERT_EQ(ptr0, ptr1);
+    }
+}
+
+TEST(Device, index) {
+    array a = randu(5, 5);
+    array b = a(span, 0);
+
+    ASSERT_NE(a.device<float>(), b.device<float>());
+}
+
+TEST(Device, unequal) {
+    {
+        array a    = randu(5, 5);
+        float *ptr = a.device<float>();
+        array b    = a;
+        ASSERT_NE(ptr, b.device<float>());
+        ASSERT_EQ(ptr, a.device<float>());
+    }
+
+    {
+        array a    = randu(5, 5);
+        float *ptr = a.device<float>();
+        array b    = a;
+        ASSERT_NE(ptr, a.device<float>());
+        ASSERT_EQ(ptr, b.device<float>());
+    }
+}
+
+TEST(DeviceId, Same) {
+    array a = randu(5, 5);
+    ASSERT_EQ(getDevice(), getDeviceId(a));
+}
+
+TEST(DeviceId, Different) {
+    int ndevices = getDeviceCount();
+    if (ndevices < 2) GTEST_SKIP() << "Skipping mult-GPU test";
+    int id0 = getDevice();
+    int id1 = (id0 + 1) % ndevices;
+
+    {
+        array a = randu(5, 5);
+        ASSERT_EQ(getDeviceId(a), id0);
+        setDevice(id1);
+
+        array b = randu(5, 5);
+
+        ASSERT_EQ(getDeviceId(a), id0);
+        ASSERT_EQ(getDeviceId(b), id1);
+        ASSERT_NE(getDevice(), getDeviceId(a));
+        ASSERT_EQ(getDevice(), getDeviceId(b));
+
+        af_array c;
+        af_err err = af_matmul(&c, a.get(), b.get(), AF_MAT_NONE, AF_MAT_NONE);
+        af::sync();
+        ASSERT_EQ(err, AF_SUCCESS);
+    }
+
+    setDevice(id1);
+    deviceGC();
+    setDevice(id0);
+    deviceGC();
+}
+
+TEST(Device, MigrateAllDevicesToAllDevices) {
+    int ndevices = getDeviceCount();
+    if (ndevices < 2) GTEST_SKIP() << "Skipping mult-GPU test";
+
+    for (int i = 0; i < ndevices; i++) {
+        for (int j = 0; j < ndevices; j++) {
+            setDevice(i);
+            array a = constant(i * 255, 10, 10);
+            a.eval();
+
+            setDevice(j);
+            array b = constant(j * 256, 10, 10);
+            b.eval();
+
+            array c = a + b;
+
+            std::vector<float> gold(10 * 10, i * 255 + j * 256);
+
+            ASSERT_VEC_ARRAY_EQ(gold, dim4(10, 10), c);
+        }
+    }
+}
+
+TEST(Device, empty) {
+    array a = array();
+    ASSERT_EQ(a.device<float>(), nullptr);
+}
+
+TEST(Device, JIT) {
+    array a = constant(1, 5, 5);
+    ASSERT_NE(a.device<float>(), nullptr);
+}
+
+TYPED_TEST(Array, Scalar) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    dtype type = (dtype)dtype_traits<TypeParam>::af_type;
+    array a    = randu(dim4(1), type);
+
+    vector<TypeParam> gold(a.elements());
+
+    a.host((void *)gold.data());
+
+    EXPECT_EQ(gold[0], a.scalar<TypeParam>());
+}
+
+TEST(Array, ScalarTypeMismatch) {
+    array a = constant(1.0, dim4(1), f32);
+
+    EXPECT_THROW(a.scalar<int>(), exception);
+}
+
+TEST(Array, CopyListInitializerList) {
+    int h_buffer[] = {23, 34, 18, 99, 34};
+
+    array A(5, h_buffer);
+    array B({23, 34, 18, 99, 34});
+
+    ASSERT_ARRAYS_EQ(A, B);
+}
+
+TEST(Array, DirectListInitializerList2) {
+    int h_buffer[] = {23, 34, 18, 99, 34};
+
+    array A(5, h_buffer);
+    array B{23, 34, 18, 99, 34};
+
+    ASSERT_ARRAYS_EQ(A, B);
+}
+
+TEST(Array, CopyListInitializerListAndDim4) {
+    int h_buffer[] = {23, 34, 18, 99, 34, 44};
+
+    array A(2, 3, h_buffer);
+    array B(dim4(2, 3), {23, 34, 18, 99, 34, 44});
+
+    ASSERT_ARRAYS_EQ(A, B);
+}
+
+TEST(Array, DirectListInitializerListAndDim4) {
+    int h_buffer[] = {23, 34, 18, 99, 34, 44};
+
+    array A(2, 3, h_buffer);
+    array B{dim4(2, 3), {23, 34, 18, 99, 34, 44}};
+
+    ASSERT_ARRAYS_EQ(A, B);
+}
+
+TEST(Array, CopyListInitializerListAssignment) {
+    int h_buffer[] = {23, 34, 18, 99, 34};
+
+    array A(5, h_buffer);
+    array B = {23, 34, 18, 99, 34};
+
+    ASSERT_ARRAYS_EQ(A, B);
+}
+
+TEST(Array, CopyListInitializerListDim4Assignment) {
+    int h_buffer[] = {23, 34, 18, 99, 34, 44};
+
+    array A(2, 3, h_buffer);
+    array B = {dim4(2, 3), {23, 34, 18, 99, 34, 44}};
+
+    ASSERT_ARRAYS_EQ(A, B);
+}
+
+TEST(Array, EmptyArrayHostCopy) {
+    af::array empty;
+    std::vector<float> hdata(100);
+    empty.host(hdata.data());
+    SUCCEED();
+}
+
+TEST(Array, ReferenceCount1) {
+    int counta = 0, countb = 0, countc = 0;
+    array a = af::randu(10, 10);
+    a.eval();
+    af::sync();
+    {
+        ASSERT_REF(a, 1) << "After a = randu(10, 10);";
+
+        array b = af::randu(10, 10);  //(af::seq(100));
+        ASSERT_REF(b, 1) << "After b = randu(10, 10);";
+
+        array c = a + b;
+        ASSERT_REF(a, 2) << "After c = a + b;";
+        ASSERT_REF(b, 2) << "After c = a + b;";
+        ASSERT_REF(c, 0) << "After c = a + b;";
+
+        c.eval();
+        af::sync();
+        ASSERT_REF(a, 1) << "After c.eval();";
+        ASSERT_REF(b, 1) << "After c.eval();";
+        ASSERT_REF(c, 1) << "After c.eval();";
+    }
+}
+
+TEST(Array, ReferenceCount2) {
+    int counta = 0, countb = 0, countc = 0;
+    array a = af::randu(10, 10);
+    array b = af::randu(10, 10);
+    {
+        ASSERT_REF(a, 1) << "After a = randu(10, 10);";
+        ASSERT_REF(b, 1) << "After a = randu(10, 10);";
+
+        array c = a + b;
+
+        ASSERT_REF(a, 2) << "After c = a + b;";
+        ASSERT_REF(b, 2) << "After c = a + b;";
+        ASSERT_REF(c, 0) << "After c = a + b;";
+
+        array d = c;
+
+        ASSERT_REF(a, 2) << "After d = c;";
+        ASSERT_REF(b, 2) << "After d = c;";
+        ASSERT_REF(c, 0) << "After d = c;";
+        ASSERT_REF(d, 0) << "After d = c;";
+    }
+}
+
+// This tests situations where the compiler incorrectly assumes the
+// initializer list constructor instead of the regular constructor when
+// using the uniform initilization syntax
+TEST(Array, InitializerListFixAFArray) {
+    af::array a = randu(1);
+    af::array b{a};
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+// This tests situations where the compiler incorrectly assumes the
+// initializer list constructor instead of the regular constructor when
+// using the uniform initilization syntax
+TEST(Array, InitializerListFixDim4) {
+    af::array a        = randu(1);
+    vector<float> data = {3.14f, 3.14f, 3.14f, 3.14f, 3.14f,
+                          3.14f, 3.14f, 3.14f, 3.14f};
+    af::array b{dim4(3, 3), data.data()};
+    ASSERT_ARRAYS_EQ(constant(3.14, 3, 3), b);
+}
+
+TEST(Array, OtherDevice) {
+    if (af::getDeviceCount() == 1) GTEST_SKIP() << "Single device. Skipping";
+    af::setDevice(0);
+    af::info();
+    af::array a = constant(3, 5, 5);
+    a.eval();
+    af::setDevice(1);
+    af::info();
+    af::array b = constant(2, 5, 5);
+    b.eval();
+
+    af::array c = a + b;
+    af::eval(c);
+    af::sync();
+    af::setDevice(0);
+    ASSERT_ARRAYS_EQ(constant(5, 5, 5), c);
 }
diff --git a/test/array_death_tests.cpp b/test/array_death_tests.cpp
new file mode 100644
index 0000000000..9c2868da4a
--- /dev/null
+++ b/test/array_death_tests.cpp
@@ -0,0 +1,63 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+
+#include <cstdlib>
+
+using af::array;
+using af::constant;
+using af::dim4;
+using af::end;
+using af::fft;
+using af::info;
+using af::randu;
+using af::scan;
+using af::seq;
+using af::setDevice;
+using af::sin;
+using af::sort;
+
+template<typename T>
+class ArrayDeathTest : public ::testing::Test {};
+
+void deathTest() {
+    info();
+    setDevice(0);
+
+    array A = randu(5, 3, f32);
+
+    array B = sin(A) + 1.5;
+
+    B(seq(0, 2), 1) = B(seq(0, 2), 1) * -1;
+
+    array C = fft(B);
+
+    array c = C.row(end);
+
+    dim4 dims(16, 4, 1, 1);
+    array r = constant(2, dims);
+
+    array S = scan(r, 0, AF_BINARY_MUL);
+
+    float d[] = {1, 2, 3, 4, 5, 6};
+    array D(2, 3, d, afHost);
+
+    D.col(0) = D.col(end);
+
+    array vals, inds;
+    sort(vals, inds, A);
+
+    _exit(0);
+}
+
+TEST(ArrayDeathTest, ProxyMoveAssignmentOperator) {
+    EXPECT_EXIT(deathTest(), ::testing::ExitedWithCode(0), "");
+}
diff --git a/test/arrayfire_test.cpp b/test/arrayfire_test.cpp
new file mode 100644
index 0000000000..687de09aab
--- /dev/null
+++ b/test/arrayfire_test.cpp
@@ -0,0 +1,2227 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#define EXTERN_TEMPLATE
+#include <testHelpers.hpp>
+
+#include <arrayfire.h>
+#include <af/algorithm.h>
+#include <af/compatible.h>
+#include <af/internal.h>
+
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <relative_difference.hpp>
+
+#include <algorithm>
+#include <cfloat>
+#include <cmath>
+#include <complex>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <fstream>
+#include <iomanip>
+#include <iterator>
+#include <limits>
+#include <numeric>
+#include <sstream>
+#include <stdexcept>
+#include <string>
+#include <typeinfo>
+#include <utility>
+#include <vector>
+
+using af::af_cdouble;
+using af::af_cfloat;
+using std::vector;
+
+bool operator==(const af_half &lhs, const af_half &rhs) {
+    return lhs.data_ == rhs.data_;
+}
+
+std::ostream &operator<<(std::ostream &os, const af_half &val) {
+    float out = *reinterpret_cast<const half_float::half *>(&val);
+    os << out;
+    return os;
+}
+
+std::ostream &operator<<(std::ostream &os, af::Backend bk) {
+    switch (bk) {
+        case AF_BACKEND_CPU: os << "AF_BACKEND_CPU"; break;
+        case AF_BACKEND_CUDA: os << "AF_BACKEND_CUDA"; break;
+        case AF_BACKEND_OPENCL: os << "AF_BACKEND_OPENCL"; break;
+        case AF_BACKEND_ONEAPI: os << "AF_BACKEND_ONEAPI"; break;
+        case AF_BACKEND_DEFAULT: os << "AF_BACKEND_DEFAULT"; break;
+    }
+    return os;
+}
+
+std::ostream &operator<<(std::ostream &os, af_err e) {
+    return os << af_err_to_string(e);
+}
+
+std::ostream &operator<<(std::ostream &os, af::dtype type) {
+    std::string name;
+    switch (type) {
+        case f32: name = "f32"; break;
+        case c32: name = "c32"; break;
+        case f64: name = "f64"; break;
+        case c64: name = "c64"; break;
+        case b8: name = "b8"; break;
+        case s32: name = "s32"; break;
+        case u32: name = "u32"; break;
+        case s8: name = "s8"; break;
+        case u8: name = "u8"; break;
+        case s64: name = "s64"; break;
+        case u64: name = "u64"; break;
+        case s16: name = "s16"; break;
+        case u16: name = "u16"; break;
+        case f16: name = "f16"; break;
+        default: assert(false && "Invalid type");
+    }
+    return os << name;
+}
+
+std::string readNextNonEmptyLine(std::ifstream &file) {
+    std::string result = "";
+    // Using a for loop to read the next non empty line
+    for (std::string line; std::getline(file, line);) {
+        result += line;
+        if (result != "") break;
+    }
+    // If no file has been found, throw an exception
+    if (result == "") {
+        throw std::runtime_error("Non empty lines not found in the file");
+    }
+    return result;
+}
+
+std::string getBackendName(bool lower) {
+    af::Backend backend = af::getActiveBackend();
+    switch (backend) {
+        case AF_BACKEND_CPU:
+            return lower ? std::string("cpu") : std::string("CPU");
+        case AF_BACKEND_CUDA:
+            return lower ? std::string("cuda") : std::string("CUDA");
+        case AF_BACKEND_OPENCL:
+            return lower ? std::string("opencl") : std::string("OpenCL");
+        case AF_BACKEND_ONEAPI:
+            return lower ? std::string("oneapi") : std::string("oneAPI");
+        default: return lower ? std::string("unknown") : std::string("Unknown");
+    }
+}
+
+std::string getTestName() {
+    std::string testname =
+        ::testing::UnitTest::GetInstance()->current_test_info()->name();
+    return testname;
+}
+
+namespace half_float {
+std::ostream &operator<<(std::ostream &os, half_float::half val) {
+    os << (float)val;
+    return os;
+}
+}  // namespace half_float
+
+// Called by ASSERT_ARRAYS_EQ
+::testing::AssertionResult assertArrayEq(std::string aName, std::string bName,
+                                         const af::array &a, const af::array &b,
+                                         float maxAbsDiff) {
+    af::dtype aType = a.type();
+    af::dtype bType = b.type();
+    if (aType != bType)
+        return ::testing::AssertionFailure()
+               << "TYPE MISMATCH: \n"
+               << "  Actual: " << bName << "(" << b.type() << ")\n"
+               << "Expected: " << aName << "(" << a.type() << ")";
+
+    af::dtype arrDtype = aType;
+    if (a.dims() != b.dims())
+        return ::testing::AssertionFailure()
+               << "SIZE MISMATCH: \n"
+               << "  Actual: " << bName << "([" << b.dims() << "])\n"
+               << "Expected: " << aName << "([" << a.dims() << "])";
+
+    switch (arrDtype) {
+        case f32:
+            return elemWiseEq<float>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case c32:
+            return elemWiseEq<af::cfloat>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case f64:
+            return elemWiseEq<double>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case c64:
+            return elemWiseEq<af::cdouble>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case b8: return elemWiseEq<char>(aName, bName, a, b, maxAbsDiff); break;
+        case s32: return elemWiseEq<int>(aName, bName, a, b, maxAbsDiff); break;
+        case u32:
+            return elemWiseEq<uint>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case s8:
+            return elemWiseEq<schar>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case u8:
+            return elemWiseEq<uchar>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case s64:
+            return elemWiseEq<long long>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case u64:
+            return elemWiseEq<unsigned long long>(aName, bName, a, b,
+                                                  maxAbsDiff);
+            break;
+        case s16:
+            return elemWiseEq<short>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case u16:
+            return elemWiseEq<unsigned short>(aName, bName, a, b, maxAbsDiff);
+            break;
+        case f16:
+            return elemWiseEq<half_float::half>(aName, bName, a, b, maxAbsDiff);
+            break;
+        default:
+            return ::testing::AssertionFailure()
+                   << "INVALID TYPE, see enum numbers: " << bName << "("
+                   << b.type() << ") and " << aName << "(" << a.type() << ")";
+    }
+
+    return ::testing::AssertionSuccess();
+}
+
+template<typename T>
+::testing::AssertionResult imageEq(std::string aName, std::string bName,
+                                   const af::array &a, const af::array &b,
+                                   float maxAbsDiff) {
+    std::vector<T> avec(a.elements());
+    a.host(avec.data());
+    std::vector<T> bvec(b.elements());
+    b.host(bvec.data());
+    double NRMSD = computeArraysRMSD(a.elements(), avec.data(), bvec.data());
+
+    if (NRMSD < maxAbsDiff) {
+        return ::testing::AssertionSuccess();
+    } else {
+        std::string test_name =
+            ::testing::UnitTest::GetInstance()->current_test_info()->name();
+
+        std::string valid_path =
+            std::string(TEST_RESULT_IMAGE_DIR) + test_name + "ValidImage.png";
+        std::string result_path =
+            std::string(TEST_RESULT_IMAGE_DIR) + test_name + "ResultImage.png";
+        std::string diff_path =
+            std::string(TEST_RESULT_IMAGE_DIR) + test_name + "DiffImage.png";
+
+        // af::array img = af::join(1, a, b);
+        // af::Window win;
+        // while (!win.close()) { win.image(img); }
+        af::saveImage(valid_path.c_str(), a.as(f32));
+        af::saveImage(result_path.c_str(), b.as(f32));
+        af::saveImage(diff_path.c_str(), abs(a.as(f32) - b.as(f32)));
+
+        std::cout << "<DartMeasurementFile type=\"image/png\" "
+                     "name=\"ValidImage\">"
+                  << valid_path << "</DartMeasurementFile>\n";
+        std::cout
+            << "<DartMeasurementFile type=\"image/png\" name=\"TestImage\">"
+            << result_path << "</DartMeasurementFile>\n";
+
+        std::cout << "<DartMeasurementFile "
+                  << "type=\"image/png\" name=\"DifferenceImage2\">"
+                  << diff_path << "</DartMeasurementFile>\n";
+
+        return ::testing::AssertionFailure()
+               << "RMSD Error(" << NRMSD << ") exceeds threshold(" << maxAbsDiff
+               << "): " << bName << "(" << b.type() << ") and " << aName << "("
+               << a.type() << ")";
+    }
+}
+
+// Called by ASSERT_ARRAYS_EQ
+::testing::AssertionResult assertImageEq(std::string aName, std::string bName,
+                                         const af::array &a, const af::array &b,
+                                         float maxAbsDiff) {
+    af::dtype aType = a.type();
+    af::dtype bType = b.type();
+    if (aType != bType)
+        return ::testing::AssertionFailure()
+               << "TYPE MISMATCH: \n"
+               << "  Actual: " << bName << "(" << b.type() << ")\n"
+               << "Expected: " << aName << "(" << a.type() << ")";
+
+    af::dtype arrDtype = aType;
+    if (a.dims() != b.dims())
+        return ::testing::AssertionFailure()
+               << "SIZE MISMATCH: \n"
+               << "  Actual: " << bName << "([" << b.dims() << "])\n"
+               << "Expected: " << aName << "([" << a.dims() << "])";
+
+    switch (arrDtype) {
+        case s8: return imageEq<signed char>(aName, bName, a, b, maxAbsDiff);
+        case u8: return imageEq<unsigned char>(aName, bName, a, b, maxAbsDiff);
+        case b8: return imageEq<char>(aName, bName, a, b, maxAbsDiff);
+        case s32: return imageEq<int>(aName, bName, a, b, maxAbsDiff);
+        case u32: return imageEq<unsigned int>(aName, bName, a, b, maxAbsDiff);
+        case f32: return imageEq<float>(aName, bName, a, b, maxAbsDiff);
+        case f64: return imageEq<double>(aName, bName, a, b, maxAbsDiff);
+        case s16: return imageEq<short>(aName, bName, a, b, maxAbsDiff);
+        case u16:
+            return imageEq<unsigned short>(aName, bName, a, b, maxAbsDiff);
+        case u64:
+            return imageEq<unsigned long long>(aName, bName, a, b, maxAbsDiff);
+        case s64: return imageEq<long long>(aName, bName, a, b, maxAbsDiff);
+        default: throw(AF_ERR_NOT_SUPPORTED);
+    }
+    return ::testing::AssertionSuccess();
+}
+
+template<>
+float convert(af::half in) {
+    return static_cast<float>(half_float::half(in.data_));
+}
+
+template<>
+af_half convert(int in) {
+    half_float::half h = half_float::half(in);
+    af_half out;
+    memcpy(&out, &h, sizeof(af_half));
+    return out;
+}
+
+template<typename inType, typename outType, typename FileElementType>
+void readTests(const std::string &FileName, std::vector<af::dim4> &inputDims,
+               std::vector<std::vector<inType>> &testInputs,
+               std::vector<std::vector<outType>> &testOutputs) {
+    using std::vector;
+
+    std::ifstream testFile(FileName.c_str());
+    if (testFile.good()) {
+        unsigned inputCount;
+        testFile >> inputCount;
+        inputDims.resize(inputCount);
+        for (unsigned i = 0; i < inputCount; i++) { testFile >> inputDims[i]; }
+
+        unsigned testCount;
+        testFile >> testCount;
+        testOutputs.resize(testCount);
+
+        vector<unsigned> testSizes(testCount);
+        for (unsigned i = 0; i < testCount; i++) { testFile >> testSizes[i]; }
+
+        testInputs.resize(inputCount, vector<inType>(0));
+        for (unsigned k = 0; k < inputCount; k++) {
+            dim_t nElems = inputDims[k].elements();
+            testInputs[k].resize(nElems);
+            FileElementType tmp;
+            for (unsigned i = 0; i < nElems; i++) {
+                testFile >> tmp;
+                testInputs[k][i] = convert<inType, FileElementType>(tmp);
+            }
+        }
+
+        testOutputs.resize(testCount, vector<outType>(0));
+        for (unsigned i = 0; i < testCount; i++) {
+            testOutputs[i].resize(testSizes[i]);
+            FileElementType tmp;
+            for (unsigned j = 0; j < testSizes[i]; j++) {
+                testFile >> tmp;
+                testOutputs[i][j] = convert<outType, FileElementType>(tmp);
+            }
+        }
+    } else {
+        FAIL() << "TEST FILE NOT FOUND";
+    }
+}
+
+#define INSTANTIATE(Tin, Tout, Tfile)                                  \
+    template void readTests<Tin, Tout, Tfile>(                         \
+        const std::string &FileName, std::vector<af::dim4> &inputDims, \
+        std::vector<std::vector<Tin>> &testInputs,                     \
+        std::vector<std::vector<Tout>> &testOutputs)
+
+INSTANTIATE(float, float, int);
+INSTANTIATE(double, float, int);
+INSTANTIATE(int, float, int);
+INSTANTIATE(unsigned int, float, int);
+INSTANTIATE(char, float, int);
+INSTANTIATE(signed char, float, int);
+INSTANTIATE(unsigned char, float, int);
+INSTANTIATE(short, float, int);
+INSTANTIATE(unsigned short, float, int);
+INSTANTIATE(long long, float, int);
+INSTANTIATE(unsigned long long, float, int);
+INSTANTIATE(af_cfloat, af_cfloat, int);
+INSTANTIATE(double, double, int);
+INSTANTIATE(af_cdouble, af_cdouble, int);
+INSTANTIATE(int, int, int);
+INSTANTIATE(unsigned int, unsigned int, int);
+INSTANTIATE(unsigned int, unsigned int, unsigned int);
+INSTANTIATE(long long, long long, int);
+INSTANTIATE(unsigned long long, unsigned long long, int);
+INSTANTIATE(char, char, int);
+INSTANTIATE(signed char, signed char, int);
+INSTANTIATE(unsigned char, unsigned char, int);
+INSTANTIATE(short, short, int);
+INSTANTIATE(unsigned short, unsigned short, int);
+INSTANTIATE(half_float::half, half_float::half, int);
+INSTANTIATE(af_half, af_half, int);
+INSTANTIATE(float, int, int);
+INSTANTIATE(unsigned int, int, int);
+INSTANTIATE(char, int, int);
+INSTANTIATE(signed char, int, int);
+INSTANTIATE(unsigned char, int, int);
+INSTANTIATE(short, int, int);
+INSTANTIATE(unsigned short, int, int);
+
+INSTANTIATE(signed char, unsigned short, int);
+INSTANTIATE(signed char, short, int);
+INSTANTIATE(signed char, unsigned char, int);
+INSTANTIATE(signed char, double, int);
+
+INSTANTIATE(unsigned char, unsigned short, int);
+INSTANTIATE(unsigned char, short, int);
+INSTANTIATE(unsigned char, signed char, int);
+INSTANTIATE(unsigned char, double, int);
+
+INSTANTIATE(long long, unsigned int, unsigned int);
+INSTANTIATE(unsigned long long, unsigned int, unsigned int);
+INSTANTIATE(int, unsigned int, unsigned int);
+INSTANTIATE(short, unsigned int, unsigned int);
+INSTANTIATE(unsigned short, unsigned int, unsigned int);
+INSTANTIATE(char, unsigned int, unsigned int);
+INSTANTIATE(signed char, unsigned int, unsigned int);
+INSTANTIATE(unsigned char, unsigned int, unsigned int);
+INSTANTIATE(float, unsigned int, unsigned int);
+INSTANTIATE(double, unsigned int, unsigned int);
+
+INSTANTIATE(float, unsigned int, int);
+INSTANTIATE(double, unsigned int, int);
+INSTANTIATE(int, unsigned int, int);
+INSTANTIATE(long long, unsigned int, int);
+INSTANTIATE(unsigned long long, unsigned int, int);
+INSTANTIATE(char, unsigned int, int);
+INSTANTIATE(signed char, unsigned int, int);
+INSTANTIATE(unsigned char, unsigned int, int);
+INSTANTIATE(short, unsigned int, int);
+INSTANTIATE(unsigned short, unsigned int, int);
+
+INSTANTIATE(float, char, int);
+INSTANTIATE(double, char, int);
+INSTANTIATE(signed char, char, int);
+INSTANTIATE(unsigned char, char, int);
+INSTANTIATE(short, char, int);
+INSTANTIATE(unsigned short, char, int);
+INSTANTIATE(int, char, int);
+INSTANTIATE(unsigned int, char, int);
+
+INSTANTIATE(char, float, float);
+INSTANTIATE(int, float, float);
+INSTANTIATE(unsigned int, float, float);
+INSTANTIATE(short, float, float);
+INSTANTIATE(signed char, float, float);
+INSTANTIATE(unsigned char, float, float);
+INSTANTIATE(unsigned short, float, float);
+INSTANTIATE(double, float, float);
+INSTANTIATE(af::af_cfloat, float, float);
+INSTANTIATE(af::af_cdouble, float, float);
+INSTANTIATE(long long, float, float);
+INSTANTIATE(long long, double, float);
+INSTANTIATE(unsigned long long, double, float);
+INSTANTIATE(float, float, float);
+INSTANTIATE(af_cfloat, af_cfloat, float);
+INSTANTIATE(af_cfloat, af_cfloat, af_cfloat);
+INSTANTIATE(af_cdouble, af_cdouble, af_cdouble);
+INSTANTIATE(double, double, float);
+INSTANTIATE(double, double, double);
+INSTANTIATE(af_cdouble, af_cdouble, float);
+INSTANTIATE(int, int, float);
+INSTANTIATE(unsigned int, unsigned int, float);
+INSTANTIATE(long long, long long, float);
+INSTANTIATE(unsigned long long, unsigned long long, float);
+INSTANTIATE(char, char, float);
+INSTANTIATE(signed char, signed char, float);
+INSTANTIATE(unsigned char, unsigned char, float);
+INSTANTIATE(short, short, float);
+INSTANTIATE(unsigned short, unsigned short, float);
+INSTANTIATE(half_float::half, half_float::half, float);
+INSTANTIATE(half_float::half, half_float::half, double);
+
+INSTANTIATE(af_cdouble, af_cdouble, double);
+INSTANTIATE(double, af_cdouble, float);
+INSTANTIATE(float, af_cfloat, float);
+INSTANTIATE(half_float::half, uint, uint);
+INSTANTIATE(float, float, double);
+INSTANTIATE(int, float, double);
+INSTANTIATE(unsigned int, float, double);
+INSTANTIATE(short, float, double);
+INSTANTIATE(unsigned short, float, double);
+INSTANTIATE(char, float, double);
+INSTANTIATE(signed char, float, double);
+INSTANTIATE(unsigned char, float, double);
+INSTANTIATE(long long, double, double);
+INSTANTIATE(unsigned long long, double, double);
+INSTANTIATE(af_cfloat, af_cfloat, double);
+INSTANTIATE(half_float::half, float, double);
+
+#undef INSTANTIATE
+
+bool noDoubleTests(af::dtype ty) {
+    bool isTypeDouble      = (ty == f64) || (ty == c64);
+    int dev                = af::getDevice();
+    bool isDoubleSupported = af::isDoubleAvailable(dev);
+
+    return ((isTypeDouble && !isDoubleSupported) ? true : false);
+}
+
+bool noHalfTests(af::dtype ty) {
+    bool isTypeHalf      = (ty == f16);
+    int dev              = af::getDevice();
+    bool isHalfSupported = af::isHalfAvailable(dev);
+
+    return ((isTypeHalf && !isHalfSupported) ? true : false);
+}
+
+af_half abs(af_half in) {
+    half_float::half in_;
+    // casting to void* to avoid class-memaccess warnings on windows
+    memcpy(static_cast<void *>(&in_), &in, sizeof(af_half));
+    half_float::half out_ = abs(in_);
+    af_half out;
+    memcpy(&out, &out_, sizeof(af_half));
+    return out;
+}
+
+af_half operator-(af_half lhs, af_half rhs) {
+    half_float::half lhs_;
+    half_float::half rhs_;
+
+    // casting to void* to avoid class-memaccess warnings on windows
+    memcpy(static_cast<void *>(&lhs_), &lhs, sizeof(af_half));
+    memcpy(static_cast<void *>(&rhs_), &rhs, sizeof(af_half));
+    half_float::half out = lhs_ - rhs_;
+    af_half o;
+    memcpy(&o, &out, sizeof(af_half));
+    return o;
+}
+
+const af::cfloat &operator+(const af::cfloat &val) { return val; }
+
+const af::cdouble &operator+(const af::cdouble &val) { return val; }
+
+const af_half &operator+(const af_half &val) { return val; }
+
+// Calculate a multi-dimensional coordinates' linearized index
+dim_t ravelIdx(af::dim4 coords, af::dim4 strides) {
+    return std::inner_product(coords.get(), coords.get() + 4, strides.get(),
+                              0LL);
+}
+
+// Calculate a linearized index's multi-dimensonal coordinates in an
+// af::array,
+//  given its dimension sizes and strides
+af::dim4 unravelIdx(dim_t idx, af::dim4 dims, af::dim4 strides) {
+    af::dim4 coords;
+    coords[3] = idx / (strides[3]);
+    coords[2] = idx / (strides[2]) % dims[2];
+    coords[1] = idx / (strides[1]) % dims[1];
+    coords[0] = idx % dims[0];
+
+    return coords;
+}
+
+af::dim4 unravelIdx(dim_t idx, af::array arr) {
+    af::dim4 dims = arr.dims();
+    af::dim4 st   = af::getStrides(arr);
+    return unravelIdx(idx, dims, st);
+}
+
+af::dim4 calcStrides(const af::dim4 &parentDim) {
+    af::dim4 out(1, 1, 1, 1);
+    dim_t *out_dims          = out.get();
+    const dim_t *parent_dims = parentDim.get();
+
+    for (dim_t i = 1; i < 4; i++) {
+        out_dims[i] = out_dims[i - 1] * parent_dims[i - 1];
+    }
+
+    return out;
+}
+
+std::string minimalDim4(af::dim4 coords, af::dim4 dims) {
+    std::ostringstream os;
+    os << "(" << coords[0];
+    if (dims[1] > 1 || dims[2] > 1 || dims[3] > 1) { os << ", " << coords[1]; }
+    if (dims[2] > 1 || dims[3] > 1) { os << ", " << coords[2]; }
+    if (dims[3] > 1) { os << ", " << coords[3]; }
+    os << ")";
+
+    return os.str();
+}
+
+// Generates a random array. testWriteToOutputArray expects that it will
+// receive the same af_array that this generates after the af_* function is
+// called
+void genRegularArray(TestOutputArrayInfo *metadata, const unsigned ndims,
+                     const dim_t *const dims, const af_dtype ty) {
+    metadata->init(ndims, dims, ty);
+}
+
+void genRegularArray(TestOutputArrayInfo *metadata, double val,
+                     const unsigned ndims, const dim_t *const dims,
+                     const af_dtype ty) {
+    metadata->init(val, ndims, dims, ty);
+}
+
+// Generates a large, random array, and extracts a subarray for the af_*
+// function to use. testWriteToOutputArray expects that the large array that
+// it receives is equal to the same large array with the gold array injected
+// on the same subarray location
+void genSubArray(TestOutputArrayInfo *metadata, const unsigned ndims,
+                 const dim_t *const dims, const af_dtype ty) {
+    const dim_t pad_size = 2;
+
+    // The large array is padded on both sides of each dimension
+    // Padding is only applied if the dimension is used, i.e. if dims[i] > 1
+    dim_t full_arr_dims[4] = {dims[0], dims[1], dims[2], dims[3]};
+    for (uint i = 0; i < ndims; ++i) {
+        full_arr_dims[i] = dims[i] + 2 * pad_size;
+    }
+
+    // Calculate index of sub-array. These will be used also by
+    // testWriteToOutputArray so that the gold sub array will be placed in
+    // the same location. Currently, this location is the center of the
+    // large array
+    af_seq subarr_idxs[4] = {af_span, af_span, af_span, af_span};
+    for (uint i = 0; i < ndims; ++i) {
+        af_seq idx     = {pad_size, pad_size + dims[i] - 1.0, 1.0};
+        subarr_idxs[i] = idx;
+    }
+
+    metadata->init(ndims, full_arr_dims, ty, &subarr_idxs[0]);
+}
+
+void genSubArray(TestOutputArrayInfo *metadata, double val,
+                 const unsigned ndims, const dim_t *const dims,
+                 const af_dtype ty) {
+    const dim_t pad_size = 2;
+
+    // The large array is padded on both sides of each dimension
+    // Padding is only applied if the dimension is used, i.e. if dims[i] > 1
+    dim_t full_arr_dims[4] = {dims[0], dims[1], dims[2], dims[3]};
+    for (uint i = 0; i < ndims; ++i) {
+        full_arr_dims[i] = dims[i] + 2 * pad_size;
+    }
+
+    // Calculate index of sub-array. These will be used also by
+    // testWriteToOutputArray so that the gold sub array will be placed in
+    // the same location. Currently, this location is the center of the
+    // large array
+    af_seq subarr_idxs[4] = {af_span, af_span, af_span, af_span};
+    for (uint i = 0; i < ndims; ++i) {
+        af_seq idx     = {pad_size, pad_size + dims[i] - 1.0, 1.0};
+        subarr_idxs[i] = idx;
+    }
+
+    metadata->init(val, ndims, full_arr_dims, ty, &subarr_idxs[0]);
+}
+
+// Generates a reordered array. testWriteToOutputArray expects that this
+// array will still have the correct output values from the af_* function,
+// even though the array was initially reordered.
+void genReorderedArray(TestOutputArrayInfo *metadata, const unsigned ndims,
+                       const dim_t *const dims, const af_dtype ty) {
+    // The rest of this function assumes that dims has 4 elements. Just in
+    // case dims has < 4 elements, use another dims array that is filled
+    // with 1s
+    dim_t all_dims[4] = {1, 1, 1, 1};
+    for (uint i = 0; i < ndims; ++i) { all_dims[i] = dims[i]; }
+
+    // This reorder combination will not move data around, but will simply
+    // call modDims and modStrides (see src/api/c/reorder.cpp).
+    // The output will be checked if it is still correct even with the
+    // modified dims and strides "hack" with no data movement
+    uint reorder_idxs[4] = {0, 2, 1, 3};
+
+    // Shape the output array such that the reordered output array will have
+    // the correct dimensions that the test asks for (i.e. must match dims
+    // arg)
+    dim_t init_dims[4] = {all_dims[0], all_dims[1], all_dims[2], all_dims[3]};
+    for (uint i = 0; i < 4; ++i) { init_dims[i] = all_dims[reorder_idxs[i]]; }
+    metadata->init(4, init_dims, ty);
+
+    af_array reordered = 0;
+    ASSERT_SUCCESS(af_reorder(&reordered, metadata->getOutput(),
+                              reorder_idxs[0], reorder_idxs[1], reorder_idxs[2],
+                              reorder_idxs[3]));
+    metadata->setOutput(reordered);
+}
+
+void genReorderedArray(TestOutputArrayInfo *metadata, double val,
+                       const unsigned ndims, const dim_t *const dims,
+                       const af_dtype ty) {
+    // The rest of this function assumes that dims has 4 elements. Just in
+    // case dims has < 4 elements, use another dims array that is filled
+    // with 1s
+    dim_t all_dims[4] = {1, 1, 1, 1};
+    for (uint i = 0; i < ndims; ++i) { all_dims[i] = dims[i]; }
+
+    // This reorder combination will not move data around, but will simply
+    // call modDims and modStrides (see src/api/c/reorder.cpp).
+    // The output will be checked if it is still correct even with the
+    // modified dims and strides "hack" with no data movement
+    uint reorder_idxs[4] = {0, 2, 1, 3};
+
+    // Shape the output array such that the reordered output array will have
+    // the correct dimensions that the test asks for (i.e. must match dims
+    // arg)
+    dim_t init_dims[4] = {all_dims[0], all_dims[1], all_dims[2], all_dims[3]};
+    for (uint i = 0; i < 4; ++i) { init_dims[i] = all_dims[reorder_idxs[i]]; }
+    metadata->init(val, 4, init_dims, ty);
+
+    af_array reordered = 0;
+    ASSERT_SUCCESS(af_reorder(&reordered, metadata->getOutput(),
+                              reorder_idxs[0], reorder_idxs[1], reorder_idxs[2],
+                              reorder_idxs[3]));
+    metadata->setOutput(reordered);
+}
+// Partner function of testWriteToOutputArray. This generates the "special"
+// array that testWriteToOutputArray will use to check if the af_* function
+// correctly uses an existing array as its output
+void genTestOutputArray(af_array *out_ptr, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype ty,
+                        TestOutputArrayInfo *metadata) {
+    switch (metadata->getOutputArrayType()) {
+        case FULL_ARRAY: genRegularArray(metadata, ndims, dims, ty); break;
+        case SUB_ARRAY: genSubArray(metadata, ndims, dims, ty); break;
+        case REORDERED_ARRAY:
+            genReorderedArray(metadata, ndims, dims, ty);
+            break;
+        default: break;
+    }
+    *out_ptr = metadata->getOutput();
+}
+
+void genTestOutputArray(af_array *out_ptr, double val, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype ty,
+                        TestOutputArrayInfo *metadata) {
+    switch (metadata->getOutputArrayType()) {
+        case FULL_ARRAY: genRegularArray(metadata, val, ndims, dims, ty); break;
+        case SUB_ARRAY: genSubArray(metadata, val, ndims, dims, ty); break;
+        case REORDERED_ARRAY:
+            genReorderedArray(metadata, val, ndims, dims, ty);
+            break;
+        default: break;
+    }
+    *out_ptr = metadata->getOutput();
+}
+
+// Partner function of genTestOutputArray. This uses the same "special"
+// array that genTestOutputArray generates, and checks whether the
+// af_* function wrote to that array correctly
+::testing::AssertionResult testWriteToOutputArray(
+    std::string gold_name, std::string result_name, const af_array gold,
+    const af_array out, TestOutputArrayInfo *metadata) {
+    // In the case of NULL_ARRAY, the output array starts out as null.
+    // After the af_* function is called, it shouldn't be null anymore
+    if (metadata->getOutputArrayType() == NULL_ARRAY) {
+        if (out == 0) {
+            return ::testing::AssertionFailure()
+                   << "Output af_array " << result_name << " is null";
+        }
+        metadata->setOutput(out);
+    }
+    // For every other case, must check if the af_array generated by
+    // genTestOutputArray was used by the af_* function as its output array
+    else {
+        if (metadata->getOutput() != out) {
+            return ::testing::AssertionFailure()
+                   << "af_array POINTER MISMATCH:\n"
+                   << "  Actual: " << out << "\n"
+                   << "Expected: " << metadata->getOutput();
+        }
+    }
+
+    if (metadata->getOutputArrayType() == SUB_ARRAY) {
+        // There are two full arrays. One will be injected with the gold
+        // subarray, the other should have already been injected with the
+        // af_* function's output. Then we compare the two full arrays
+        af_array gold_full_array = metadata->getFullOutputCopy();
+        af_assign_seq(&gold_full_array, gold_full_array,
+                      metadata->getSubArrayNumDims(),
+                      metadata->getSubArrayIdxs(), gold);
+
+        return assertArrayEq(gold_name, result_name,
+                             metadata->getFullOutputCopy(),
+                             metadata->getFullOutput());
+    } else {
+        return assertArrayEq(gold_name, result_name, gold, out);
+    }
+}
+
+// Called by ASSERT_SPECIAL_ARRAYS_EQ
+::testing::AssertionResult assertArrayEq(std::string aName, std::string bName,
+                                         std::string metadataName,
+                                         const af_array a, const af_array b,
+                                         TestOutputArrayInfo *metadata) {
+    UNUSED(metadataName);
+    return testWriteToOutputArray(aName, bName, a, b, metadata);
+}
+
+// To support C API
+::testing::AssertionResult assertArrayEq(std::string aName, std::string bName,
+                                         const af_array a, const af_array b) {
+    af_array aa = 0, bb = 0;
+    af_retain_array(&aa, a);
+    af_retain_array(&bb, b);
+    af::array aaa(aa);
+    af::array bbb(bb);
+    return assertArrayEq(aName, bName, aaa, bbb, 0.0f);
+}
+
+// Called by ASSERT_ARRAYS_NEAR
+::testing::AssertionResult assertArrayNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af::array &a,
+                                           const af::array &b,
+                                           float maxAbsDiff) {
+    UNUSED(maxAbsDiffName);
+    return assertArrayEq(aName, bName, a, b, maxAbsDiff);
+}
+
+// Called by ASSERT_IMAGES_NEAR
+::testing::AssertionResult assertImageNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af_array &a, const af_array &b,
+                                           float maxAbsDiff) {
+    UNUSED(maxAbsDiffName);
+    af_array aa = 0, bb = 0;
+    af_retain_array(&aa, a);
+    af_retain_array(&bb, b);
+    af::array aaa(aa);
+    af::array bbb(bb);
+    return assertImageEq(aName, bName, aaa, bbb, maxAbsDiff);
+}
+
+// Called by ASSERT_IMAGES_NEAR
+::testing::AssertionResult assertImageNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af::array &a,
+                                           const af::array &b,
+                                           float maxAbsDiff) {
+    UNUSED(maxAbsDiffName);
+    return assertImageEq(aName, bName, a, b, maxAbsDiff);
+}
+
+// To support C API
+::testing::AssertionResult assertArrayNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af_array a, const af_array b,
+                                           float maxAbsDiff) {
+    af_array aa = 0, bb = 0;
+    af_retain_array(&aa, a);
+    af_retain_array(&bb, b);
+    af::array aaa(aa);
+    af::array bbb(bb);
+    return assertArrayNear(aName, bName, maxAbsDiffName, aaa, bbb, maxAbsDiff);
+}
+
+void cleanSlate() {
+    const size_t step_bytes = 1024;
+
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    af::deviceGC();
+
+    af::deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(0u, alloc_buffers);
+    ASSERT_EQ(0u, lock_buffers);
+    ASSERT_EQ(0u, alloc_bytes);
+    ASSERT_EQ(0u, lock_bytes);
+
+    af::setMemStepSize(step_bytes);
+
+    ASSERT_EQ(af::getMemStepSize(), step_bytes);
+}
+
+template<typename inType, typename outType>
+void readTestsFromFile(const std::string &FileName,
+                       std::vector<af::dim4> &inputDims,
+                       std::vector<std::vector<inType>> &testInputs,
+                       std::vector<std::vector<outType>> &testOutputs) {
+    using std::vector;
+
+    std::ifstream testFile(FileName.c_str());
+    if (testFile.good()) {
+        unsigned inputCount;
+        testFile >> inputCount;
+        for (unsigned i = 0; i < inputCount; i++) {
+            af::dim4 temp(1);
+            testFile >> temp;
+            inputDims.push_back(temp);
+        }
+
+        unsigned testCount;
+        testFile >> testCount;
+        testOutputs.resize(testCount);
+
+        vector<unsigned> testSizes(testCount);
+        for (unsigned i = 0; i < testCount; i++) { testFile >> testSizes[i]; }
+
+        testInputs.resize(inputCount, vector<inType>(0));
+        for (unsigned k = 0; k < inputCount; k++) {
+            dim_t nElems = inputDims[k].elements();
+            testInputs[k].resize(nElems);
+            inType tmp;
+            for (unsigned i = 0; i < nElems; i++) {
+                testFile >> tmp;
+                testInputs[k][i] = tmp;
+            }
+        }
+
+        testOutputs.resize(testCount, vector<outType>(0));
+        for (unsigned i = 0; i < testCount; i++) {
+            testOutputs[i].resize(testSizes[i]);
+            outType tmp;
+            for (unsigned j = 0; j < testSizes[i]; j++) {
+                testFile >> tmp;
+                testOutputs[i][j] = tmp;
+            }
+        }
+    } else {
+        FAIL() << "TEST FILE NOT FOUND";
+    }
+}
+
+#define INSTANTIATE(Ti, To)                                            \
+    template void readTestsFromFile<Ti, To>(                           \
+        const std::string &FileName, std::vector<af::dim4> &inputDims, \
+        std::vector<std::vector<Ti>> &testInputs,                      \
+        std::vector<std::vector<To>> &testOutputs)
+
+INSTANTIATE(float, float);
+INSTANTIATE(float, af_cfloat);
+INSTANTIATE(af_cfloat, af_cfloat);
+INSTANTIATE(double, double);
+INSTANTIATE(double, af_cdouble);
+INSTANTIATE(af_cdouble, af_cdouble);
+INSTANTIATE(int, float);
+
+#undef INSTANTIATE
+
+template<typename outType>
+void readImageTests(const std::string &pFileName,
+                    std::vector<af::dim4> &pInputDims,
+                    std::vector<std::string> &pTestInputs,
+                    std::vector<std::vector<outType>> &pTestOutputs) {
+    using std::vector;
+
+    std::ifstream testFile(pFileName.c_str());
+    if (testFile.good()) {
+        unsigned inputCount;
+        testFile >> inputCount;
+        for (unsigned i = 0; i < inputCount; i++) {
+            af::dim4 temp(1);
+            testFile >> temp;
+            pInputDims.push_back(temp);
+        }
+
+        unsigned testCount;
+        testFile >> testCount;
+        pTestOutputs.resize(testCount);
+
+        vector<unsigned> testSizes(testCount);
+        for (unsigned i = 0; i < testCount; i++) { testFile >> testSizes[i]; }
+
+        pTestInputs.resize(inputCount, "");
+        for (unsigned k = 0; k < inputCount; k++) {
+            pTestInputs[k] = readNextNonEmptyLine(testFile);
+        }
+
+        pTestOutputs.resize(testCount, vector<outType>(0));
+        for (unsigned i = 0; i < testCount; i++) {
+            pTestOutputs[i].resize(testSizes[i]);
+            outType tmp;
+            for (unsigned j = 0; j < testSizes[i]; j++) {
+                testFile >> tmp;
+                pTestOutputs[i][j] = tmp;
+            }
+        }
+    } else {
+        FAIL() << "TEST FILE NOT FOUND";
+    }
+}
+
+#define INSTANTIATE(To)                                                  \
+    template void readImageTests<To>(                                    \
+        const std::string &pFileName, std::vector<af::dim4> &pInputDims, \
+        std::vector<std::string> &pTestInputs,                           \
+        std::vector<std::vector<To>> &pTestOutputs)
+
+INSTANTIATE(float);
+#undef INSTANTIATE
+
+void readImageTests(const std::string &pFileName,
+                    std::vector<af::dim4> &pInputDims,
+                    std::vector<std::string> &pTestInputs,
+                    std::vector<dim_t> &pTestOutSizes,
+                    std::vector<std::string> &pTestOutputs) {
+    using std::vector;
+
+    std::ifstream testFile(pFileName.c_str());
+    if (testFile.good()) {
+        unsigned inputCount;
+        testFile >> inputCount;
+        for (unsigned i = 0; i < inputCount; i++) {
+            af::dim4 temp(1);
+            testFile >> temp;
+            pInputDims.push_back(temp);
+        }
+
+        unsigned testCount;
+        testFile >> testCount;
+        pTestOutputs.resize(testCount);
+
+        pTestOutSizes.resize(testCount);
+        for (unsigned i = 0; i < testCount; i++) {
+            testFile >> pTestOutSizes[i];
+        }
+
+        pTestInputs.resize(inputCount, "");
+        for (unsigned k = 0; k < inputCount; k++) {
+            pTestInputs[k] = readNextNonEmptyLine(testFile);
+        }
+
+        pTestOutputs.resize(testCount, "");
+        for (unsigned i = 0; i < testCount; i++) {
+            pTestOutputs[i] = readNextNonEmptyLine(testFile);
+        }
+    } else {
+        FAIL() << "TEST FILE NOT FOUND";
+    }
+}
+
+template<typename descType>
+void readImageFeaturesDescriptors(
+    const std::string &pFileName, std::vector<af::dim4> &pInputDims,
+    std::vector<std::string> &pTestInputs,
+    std::vector<std::vector<float>> &pTestFeats,
+    std::vector<std::vector<descType>> &pTestDescs) {
+    using std::vector;
+
+    std::ifstream testFile(pFileName.c_str());
+    if (testFile.good()) {
+        unsigned inputCount;
+        testFile >> inputCount;
+        for (unsigned i = 0; i < inputCount; i++) {
+            af::dim4 temp(1);
+            testFile >> temp;
+            pInputDims.push_back(temp);
+        }
+
+        unsigned attrCount, featCount, descLen;
+        testFile >> featCount;
+        testFile >> attrCount;
+        testFile >> descLen;
+        pTestFeats.resize(attrCount);
+
+        pTestInputs.resize(inputCount, "");
+        for (unsigned k = 0; k < inputCount; k++) {
+            pTestInputs[k] = readNextNonEmptyLine(testFile);
+        }
+
+        pTestFeats.resize(attrCount, vector<float>(0));
+        for (unsigned i = 0; i < attrCount; i++) {
+            pTestFeats[i].resize(featCount);
+            float tmp;
+            for (unsigned j = 0; j < featCount; j++) {
+                testFile >> tmp;
+                pTestFeats[i][j] = tmp;
+            }
+        }
+
+        pTestDescs.resize(featCount, vector<descType>(0));
+        for (unsigned i = 0; i < featCount; i++) {
+            pTestDescs[i].resize(descLen);
+            descType tmp;
+            for (unsigned j = 0; j < descLen; j++) {
+                testFile >> tmp;
+                pTestDescs[i][j] = tmp;
+            }
+        }
+    } else {
+        FAIL() << "TEST FILE NOT FOUND";
+    }
+}
+
+#define INSTANTIATE(TYPE)                                                \
+    template void readImageFeaturesDescriptors<TYPE>(                    \
+        const std::string &pFileName, std::vector<af::dim4> &pInputDims, \
+        std::vector<std::string> &pTestInputs,                           \
+        std::vector<std::vector<float>> &pTestFeats,                     \
+        std::vector<std::vector<TYPE>> &pTestDescs)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(unsigned int);
+#undef INSTANTIATE
+
+template<typename T>
+double computeArraysRMSD(dim_t data_size, T *gold, T *data) {
+    double accum  = 0.0;
+    double maxion = -FLT_MAX;  //(double)std::numeric_limits<T>::lowest();
+    double minion = FLT_MAX;   //(double)std::numeric_limits<T>::max();
+
+    for (dim_t i = 0; i < data_size; i++) {
+        double dTemp = (double)data[i];
+        double gTemp = (double)gold[i];
+        double diff  = gTemp - dTemp;
+        if (diff > 1.e-4) {
+            // printf("%d: diff: %f %f %f\n", i, diff, data[i], gold[i]);
+        }
+        double err =
+            (std::isfinite(diff) && (std::abs(diff) > 1.0e-4)) ? diff : 0.0f;
+        accum += std::pow(err, 2.0);
+        maxion = std::max(maxion, dTemp);
+        minion = std::min(minion, dTemp);
+    }
+    accum /= data_size;
+    double NRMSD = std::sqrt(accum) / (maxion - minion);
+
+    return NRMSD;
+}
+
+template<>
+double computeArraysRMSD<unsigned char>(dim_t data_size, unsigned char *gold,
+                                        unsigned char *data) {
+    double accum = 0.0;
+    int maxion   = 0;    //(double)std::numeric_limits<T>::lowest();
+    int minion   = 255;  //(double)std::numeric_limits<T>::max();
+
+    for (dim_t i = 0; i < data_size; i++) {
+        int dTemp  = data[i];
+        int gTemp  = gold[i];
+        int diff   = abs(gTemp - dTemp);
+        double err = (diff > 1) ? diff : 0.0f;
+        accum += std::pow(err, 2.0);
+        maxion = std::max(maxion, dTemp);
+        minion = std::min(minion, dTemp);
+    }
+    accum /= data_size;
+    double NRMSD = std::sqrt(accum) / (maxion - minion);
+
+    return NRMSD;
+}
+
+template<typename T>
+bool compareArraysRMSD(dim_t data_size, T *gold, T *data, double tolerance) {
+    double accum  = 0.0;
+    double maxion = -FLT_MAX;  //(double)std::numeric_limits<T>::lowest();
+    double minion = FLT_MAX;   //(double)std::numeric_limits<T>::max();
+
+    for (dim_t i = 0; i < data_size; i++) {
+        double dTemp = (double)data[i];
+        double gTemp = (double)gold[i];
+        double diff  = gTemp - dTemp;
+        double err =
+            (std::isfinite(diff) && (std::abs(diff) > 1.0e-4)) ? diff : 0.0f;
+        accum += std::pow(err, 2.0);
+        maxion = std::max(maxion, dTemp);
+        minion = std::min(minion, dTemp);
+    }
+    accum /= data_size;
+    double NRMSD = std::sqrt(accum) / (maxion - minion);
+
+    if (std::isnan(NRMSD) || NRMSD > tolerance) {
+#ifndef NDEBUG
+        printf("Comparison failed, NRMSD value: %lf\n", NRMSD);
+#endif
+        return false;
+    }
+
+    return true;
+}
+
+#define INSTANTIATE(TYPE)                                                 \
+    template double computeArraysRMSD<TYPE>(dim_t data_size, TYPE * gold, \
+                                            TYPE * data);                 \
+    template bool compareArraysRMSD<TYPE>(dim_t data_size, TYPE * gold,   \
+                                          TYPE * data, double tolerance)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(char);
+#undef INSTANTIATE
+
+TestOutputArrayInfo::TestOutputArrayInfo()
+    : out_arr(0)
+    , out_arr_cpy(0)
+    , out_subarr(0)
+    , out_subarr_ndims(0)
+    , out_arr_type(NULL_ARRAY) {
+    for (uint i = 0; i < 4; ++i) { out_subarr_idxs[i] = af_span; }
+}
+
+TestOutputArrayInfo::TestOutputArrayInfo(TestOutputArrayType arr_type)
+    : out_arr(0)
+    , out_arr_cpy(0)
+    , out_subarr(0)
+    , out_subarr_ndims(0)
+    , out_arr_type(arr_type) {
+    for (uint i = 0; i < 4; ++i) { out_subarr_idxs[i] = af_span; }
+}
+
+TestOutputArrayInfo::~TestOutputArrayInfo() {
+    if (out_subarr) af_release_array(out_subarr);
+    if (out_arr_cpy) af_release_array(out_arr_cpy);
+    if (out_arr) af_release_array(out_arr);
+}
+
+void TestOutputArrayInfo::init(const unsigned ndims, const dim_t *const dims,
+                               const af_dtype ty) {
+    ASSERT_SUCCESS(af_randu(&out_arr, ndims, dims, ty));
+}
+
+void TestOutputArrayInfo::init(const unsigned ndims, const dim_t *const dims,
+                               const af_dtype ty,
+                               const af_seq *const subarr_idxs) {
+    init(ndims, dims, ty);
+
+    ASSERT_SUCCESS(af_copy_array(&out_arr_cpy, out_arr));
+    for (uint i = 0; i < ndims; ++i) { out_subarr_idxs[i] = subarr_idxs[i]; }
+    out_subarr_ndims = ndims;
+
+    ASSERT_SUCCESS(af_index(&out_subarr, out_arr, ndims, subarr_idxs));
+}
+
+void TestOutputArrayInfo::init(double val, const unsigned ndims,
+                               const dim_t *const dims, const af_dtype ty) {
+    switch (ty) {
+        case c32:
+        case c64:
+            af_constant_complex(&out_arr, val, 0.0, ndims, dims, ty);
+            break;
+        case s64:
+            af_constant_long(&out_arr, static_cast<intl>(val), ndims, dims);
+            break;
+        case u64:
+            af_constant_ulong(&out_arr, static_cast<uintl>(val), ndims, dims);
+            break;
+        default: af_constant(&out_arr, val, ndims, dims, ty); break;
+    }
+}
+
+void TestOutputArrayInfo::init(double val, const unsigned ndims,
+                               const dim_t *const dims, const af_dtype ty,
+                               const af_seq *const subarr_idxs) {
+    init(val, ndims, dims, ty);
+
+    ASSERT_SUCCESS(af_copy_array(&out_arr_cpy, out_arr));
+    for (uint i = 0; i < ndims; ++i) { out_subarr_idxs[i] = subarr_idxs[i]; }
+    out_subarr_ndims = ndims;
+
+    ASSERT_SUCCESS(af_index(&out_subarr, out_arr, ndims, subarr_idxs));
+}
+
+af_array TestOutputArrayInfo::getOutput() {
+    if (out_arr_type == SUB_ARRAY) {
+        return out_subarr;
+    } else {
+        return out_arr;
+    }
+}
+
+void TestOutputArrayInfo::setOutput(af_array array) {
+    if (out_arr != 0) { ASSERT_SUCCESS(af_release_array(out_arr)); }
+    out_arr = array;
+}
+
+af_array TestOutputArrayInfo::getFullOutput() { return out_arr; }
+af_array TestOutputArrayInfo::getFullOutputCopy() { return out_arr_cpy; }
+af_seq *TestOutputArrayInfo::getSubArrayIdxs() { return &out_subarr_idxs[0]; }
+dim_t TestOutputArrayInfo::getSubArrayNumDims() { return out_subarr_ndims; }
+TestOutputArrayType TestOutputArrayInfo::getOutputArrayType() {
+    return out_arr_type;
+}
+
+#if defined(USE_MTX)
+::testing::AssertionResult mtxReadSparseMatrix(af::array &out,
+                                               const char *fileName) {
+    FILE *fileHandle;
+
+    if ((fileHandle = fopen(fileName, "r")) == NULL) {
+        return ::testing::AssertionFailure()
+               << "Failed to open mtx file: " << fileName << "\n";
+    }
+
+    MM_typecode matcode;
+    if (mm_read_banner(fileHandle, &matcode)) {
+        return ::testing::AssertionFailure()
+               << "Could not process Matrix Market banner.\n";
+    }
+
+    if (!(mm_is_matrix(matcode) && mm_is_sparse(matcode))) {
+        return ::testing::AssertionFailure()
+               << "Input mtx doesn't have a sparse matrix.\n";
+    }
+
+    if (mm_is_integer(matcode)) {
+        return ::testing::AssertionFailure() << "MTX file has integer data. \
+                Integer sparse matrices are not supported in ArrayFire yet.\n";
+    }
+
+    int M = 0, N = 0, nz = 0;
+    if (mm_read_mtx_crd_size(fileHandle, &M, &N, &nz)) {
+        return ::testing::AssertionFailure()
+               << "Failed to read matrix dimensions.\n";
+    }
+
+    if (mm_is_real(matcode)) {
+        std::vector<int> I(nz);
+        std::vector<int> J(nz);
+        std::vector<float> V(nz);
+
+        for (int i = 0; i < nz; ++i) {
+            int c, r;
+            double v;
+            int readCount = fscanf(fileHandle, "%d %d %lg\n", &r, &c, &v);
+            if (readCount != 3) {
+                fclose(fileHandle);
+                return ::testing::AssertionFailure()
+                       << "\nEnd of file reached, expected more data, "
+                       << "following are some reasons this happens.\n"
+                       << "\t - use of template type that doesn't match "
+                          "data "
+                          "type\n"
+                       << "\t - the mtx file itself doesn't have enough "
+                          "data\n";
+            }
+            I[i] = r - 1;
+            J[i] = c - 1;
+            V[i] = (float)v;
+        }
+
+        out = af::sparse(M, N, nz, V.data(), I.data(), J.data(), f32,
+                         AF_STORAGE_COO);
+    } else if (mm_is_complex(matcode)) {
+        std::vector<int> I(nz);
+        std::vector<int> J(nz);
+        std::vector<af::cfloat> V(nz);
+
+        for (int i = 0; i < nz; ++i) {
+            int c, r;
+            double real, imag;
+            int readCount =
+                fscanf(fileHandle, "%d %d %lg %lg\n", &r, &c, &real, &imag);
+            if (readCount != 4) {
+                fclose(fileHandle);
+                return ::testing::AssertionFailure()
+                       << "\nEnd of file reached, expected more data, "
+                       << "following are some reasons this happens.\n"
+                       << "\t - use of template type that doesn't match "
+                          "data "
+                          "type\n"
+                       << "\t - the mtx file itself doesn't have enough "
+                          "data\n";
+            }
+            I[i] = r - 1;
+            J[i] = c - 1;
+            V[i] = af::cfloat(float(real), float(imag));
+        }
+
+        out = af::sparse(M, N, nz, V.data(), I.data(), J.data(), c32,
+                         AF_STORAGE_COO);
+    } else {
+        return ::testing::AssertionFailure()
+               << "Unknown matcode from MTX FILE\n";
+    }
+
+    fclose(fileHandle);
+    return ::testing::AssertionSuccess();
+}
+#endif  // USE_MTX
+
+// TODO: perform conversion on device for CUDA and OpenCL
+template<typename T>
+af_err conv_image(af_array *out, af_array in) {
+    af_array outArray;
+
+    dim_t d0, d1, d2, d3;
+    af_get_dims(&d0, &d1, &d2, &d3, in);
+    af::dim4 idims(d0, d1, d2, d3);
+
+    dim_t nElems = 0;
+    af_get_elements(&nElems, in);
+
+    float *in_data = new float[nElems];
+    af_get_data_ptr(in_data, in);
+
+    T *out_data = new T[nElems];
+
+    af_dtype out_type = (af_dtype)af::dtype_traits<T>::af_type;
+    for (int i = 0; i < (int)nElems; i++) {
+        if (out_type == s8) {
+            // shift to avoid overflow
+            out_data[i] = (T)(std::trunc(in_data[i]) - 128.f);
+        } else {
+            out_data[i] = (T)in_data[i];
+        }
+    }
+
+    af_create_array(&outArray, out_data, idims.ndims(), idims.get(), out_type);
+
+    std::swap(*out, outArray);
+
+    delete[] in_data;
+    delete[] out_data;
+
+    return AF_SUCCESS;
+}
+
+#define INSTANTIATE(To) \
+    template af_err conv_image<To>(af_array * out, af_array in)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(signed char);
+INSTANTIATE(unsigned char);
+INSTANTIATE(half_float::half);
+INSTANTIATE(unsigned int);
+INSTANTIATE(unsigned short);
+INSTANTIATE(int);
+INSTANTIATE(char);
+INSTANTIATE(short);
+INSTANTIATE(af_cdouble);
+INSTANTIATE(af_cfloat);
+INSTANTIATE(long long);
+INSTANTIATE(unsigned long long);
+#undef INSTANTIATE
+
+template<typename T>
+af::array cpu_randu(const af::dim4 dims) {
+    typedef typename af::dtype_traits<T>::base_type BT;
+
+    bool isTypeCplx = is_same_type<T, af::cfloat>::value ||
+                      is_same_type<T, af::cdouble>::value;
+    bool isTypeFloat = is_same_type<BT, float>::value ||
+                       is_same_type<BT, double>::value ||
+                       is_same_type<BT, half_float::half>::value;
+
+    size_t elements = (isTypeCplx ? 2 : 1) * dims.elements();
+
+    std::vector<BT> out(elements);
+    for (size_t i = 0; i < elements; i++) {
+        out[i] = isTypeFloat ? (BT)(rand()) / static_cast<double>(RAND_MAX)
+                             : rand() % 100;
+    }
+
+    return af::array(dims, (T *)&out[0]);
+}
+
+#define INSTANTIATE(To) template af::array cpu_randu<To>(const af::dim4 dims)
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(signed char);
+INSTANTIATE(unsigned char);
+INSTANTIATE(half_float::half);
+INSTANTIATE(unsigned int);
+INSTANTIATE(unsigned short);
+INSTANTIATE(int);
+INSTANTIATE(char);
+INSTANTIATE(short);
+INSTANTIATE(af_cdouble);
+INSTANTIATE(af_cfloat);
+INSTANTIATE(long long);
+INSTANTIATE(unsigned long long);
+#undef INSTANTIATE
+
+template<typename T>
+struct sparseCooValue {
+    int row = 0;
+    int col = 0;
+    T value = 0;
+    sparseCooValue(int r, int c, T v) : row(r), col(c), value(v) {}
+};
+
+template<typename T>
+void swap(sparseCooValue<T> &lhs, sparseCooValue<T> &rhs) {
+    std::swap(lhs.row, rhs.row);
+    std::swap(lhs.col, rhs.col);
+    std::swap(lhs.value, rhs.value);
+}
+
+template<typename T>
+bool operator<(const sparseCooValue<T> &lhs, const sparseCooValue<T> &rhs) {
+    if (lhs.row < rhs.row) {
+        return true;
+    } else if (lhs.row == rhs.row && lhs.col < rhs.col) {
+        return true;
+    } else {
+        return false;
+    }
+}
+
+template<typename T>
+std::ostream &operator<<(std::ostream &os, const sparseCooValue<T> &val) {
+    os << "(" << val.row << ", " << val.col << "): " << val.value;
+    return os;
+}
+
+template<typename T>
+bool isZero(const sparseCooValue<T> &val) {
+    return val.value == 0.;
+}
+
+template<typename T>
+vector<sparseCooValue<T>> toCooVector(const af::array &arr) {
+    vector<sparseCooValue<T>> out;
+    if (arr.issparse()) {
+        switch (sparseGetStorage(arr)) {
+            case AF_STORAGE_COO: {
+                dim_t nnz = sparseGetNNZ(arr);
+                vector<int> row(nnz), col(nnz);
+                vector<T> values(nnz);
+                sparseGetValues(arr).host(values.data());
+                sparseGetRowIdx(arr).host(row.data());
+                sparseGetColIdx(arr).host(col.data());
+                out.reserve(nnz);
+                for (int i = 0; i < nnz; i++) {
+                    out.emplace_back(row[i], col[i], values[i]);
+                }
+            } break;
+            case AF_STORAGE_CSR: {
+                dim_t nnz = sparseGetNNZ(arr);
+                vector<int> row(arr.dims(0) + 1), col(nnz);
+                vector<T> values(nnz);
+                sparseGetValues(arr).host(values.data());
+                sparseGetRowIdx(arr).host(row.data());
+                sparseGetColIdx(arr).host(col.data());
+                out.reserve(nnz);
+                for (int i = 0; i < row.size() - 1; i++) {
+                    for (int r = row[i]; r < row[i + 1]; r++) {
+                        out.emplace_back(i, col[r], values[r]);
+                    }
+                }
+            } break;
+            case AF_STORAGE_CSC: {
+                dim_t nnz = sparseGetNNZ(arr);
+                vector<int> row(nnz), col(arr.dims(1) + 1);
+                vector<T> values(nnz);
+                sparseGetValues(arr).host(values.data());
+                sparseGetRowIdx(arr).host(row.data());
+                sparseGetColIdx(arr).host(col.data());
+                out.reserve(nnz);
+                for (int i = 0; i < col.size() - 1; i++) {
+                    for (int c = col[i]; c < col[i + 1]; c++) {
+                        out.emplace_back(row[c], i, values[c]);
+                    }
+                }
+            } break;
+            default: throw std::logic_error("NOT SUPPORTED");
+        }
+    } else {
+        vector<T> values(arr.elements());
+        arr.host(values.data());
+        int M = arr.dims(0), N = arr.dims(1);
+        for (int j = 0; j < N; j++) {
+            for (int i = 0; i < M; i++) {
+                if (std::fpclassify(real(values[j * M + i])) == FP_ZERO) {
+                    out.emplace_back(i, j, values[j * M + i]);
+                }
+            }
+        }
+    }
+
+    // Remove zero elements from result to ensure that only non-zero
+    // elements are compared
+    out.erase(std::remove_if(out.begin(), out.end(), isZero<T>), out.end());
+    std::sort(begin(out), end(out));
+    return out;
+}
+
+template<typename T>
+bool operator==(const sparseCooValue<T> &lhs, sparseCooValue<T> &rhs) {
+    return lhs.row == rhs.row && lhs.col == rhs.col &&
+           cmp(lhs.value, rhs.value);
+}
+
+template<typename T>
+std::string printContext(const std::vector<T> &hGold, std::string goldName,
+                         const std::vector<T> &hOut, std::string outName,
+                         af::dim4 arrDims, af::dim4 arrStrides, dim_t idx) {
+    std::ostringstream os;
+
+    af::dim4 coords = unravelIdx(idx, arrDims, arrStrides);
+    dim_t ctxWidth  = 5;
+
+    // Coordinates that span dim0
+    af::dim4 coordsMinBound = coords;
+    coordsMinBound[0]       = 0;
+    af::dim4 coordsMaxBound = coords;
+    coordsMaxBound[0]       = arrDims[0] - 1;
+
+    // dim0 positions that can be displayed
+    dim_t dim0Start = std::max<dim_t>(0LL, coords[0] - ctxWidth);
+    dim_t dim0End   = std::min<dim_t>(coords[0] + ctxWidth + 1LL, arrDims[0]);
+
+    // Linearized indices of values in vectors that can be displayed
+    dim_t vecStartIdx =
+        std::max<dim_t>(ravelIdx(coordsMinBound, arrStrides), idx - ctxWidth);
+
+    // Display as minimal coordinates as needed
+    // First value is the range of dim0 positions that will be displayed
+    os << "Viewing slice (" << dim0Start << ":" << dim0End - 1;
+    if (arrDims[1] > 1 || arrDims[2] > 1 || arrDims[3] > 1)
+        os << ", " << coords[1];
+    if (arrDims[2] > 1 || arrDims[3] > 1) os << ", " << coords[2];
+    if (arrDims[3] > 1) os << ", " << coords[3];
+    os << "), dims are (" << arrDims << ") strides: (" << arrStrides << ")\n";
+
+    dim_t ctxElems = dim0End - dim0Start;
+    std::vector<int> valFieldWidths(ctxElems);
+    std::vector<std::string> ctxDim0(ctxElems);
+    std::vector<std::string> ctxOutVals(ctxElems);
+    std::vector<std::string> ctxGoldVals(ctxElems);
+
+    // Get dim0 positions and out/reference values for the context window
+    //
+    // Also get the max string length between the position and out/ref
+    // values per item so that it can be used later as the field width for
+    // displaying each item in the context window
+    for (dim_t i = 0; i < ctxElems; ++i) {
+        std::ostringstream tmpOs;
+
+        dim_t dim0 = dim0Start + i;
+        if (dim0 == coords[0])
+            tmpOs << "[" << dim0 << "]";
+        else
+            tmpOs << dim0;
+        ctxDim0[i]     = tmpOs.str();
+        size_t dim0Len = tmpOs.str().length();
+        tmpOs.str(std::string());
+
+        dim_t valIdx = vecStartIdx + i;
+
+        if (valIdx == idx) {
+            tmpOs << "[" << +hOut[valIdx] << "]";
+        } else {
+            tmpOs << +hOut[valIdx];
+        }
+        ctxOutVals[i] = tmpOs.str();
+        size_t outLen = tmpOs.str().length();
+        tmpOs.str(std::string());
+
+        if (valIdx == idx) {
+            tmpOs << "[" << +hGold[valIdx] << "]";
+        } else {
+            tmpOs << +hGold[valIdx];
+        }
+        ctxGoldVals[i] = tmpOs.str();
+        size_t goldLen = tmpOs.str().length();
+        tmpOs.str(std::string());
+
+        int maxWidth      = std::max<int>(dim0Len, outLen);
+        maxWidth          = std::max<int>(maxWidth, goldLen);
+        valFieldWidths[i] = maxWidth;
+    }
+
+    size_t varNameWidth = std::max<size_t>(goldName.length(), outName.length());
+
+    // Display dim0 positions, output values, and reference values
+    os << std::right << std::setw(varNameWidth) << ""
+       << "   ";
+    for (uint i = 0; i < (dim0End - dim0Start); ++i) {
+        os << std::setw(valFieldWidths[i] + 1) << std::right << ctxDim0[i];
+    }
+    os << "\n";
+
+    os << std::right << std::setw(varNameWidth) << outName << ": {";
+    for (uint i = 0; i < (dim0End - dim0Start); ++i) {
+        os << std::setw(valFieldWidths[i] + 1) << std::right << ctxOutVals[i];
+    }
+    os << " }\n";
+
+    os << std::right << std::setw(varNameWidth) << goldName << ": {";
+    for (uint i = 0; i < (dim0End - dim0Start); ++i) {
+        os << std::setw(valFieldWidths[i] + 1) << std::right << ctxGoldVals[i];
+    }
+    os << " }";
+
+    return os.str();
+}
+
+template<typename T>
+std::string printContext(const std::vector<sparseCooValue<T>> &hGold,
+                         std::string goldName,
+                         const std::vector<sparseCooValue<T>> &hOut,
+                         std::string outName, af::dim4 arrDims,
+                         af::dim4 arrStrides, dim_t idx) {
+    std::ostringstream os;
+
+    af::dim4 coords = unravelIdx(idx, arrDims, arrStrides);
+    dim_t ctxWidth  = 5;
+
+    // Coordinates that span dim0
+    af::dim4 coordsMinBound = coords;
+    coordsMinBound[0]       = 0;
+    af::dim4 coordsMaxBound = coords;
+    coordsMaxBound[0]       = arrDims[0] - 1;
+
+    // dim0 positions that can be displayed
+    dim_t dim0Start = std::max<dim_t>(0LL, idx - ctxWidth);
+    dim_t dim0End   = std::min<dim_t>(idx + ctxWidth + 1LL, hGold.size());
+
+    int setwval = 9;
+    // Linearized indices of values in vectors that can be displayed
+    dim_t vecStartIdx =
+        std::max<dim_t>(ravelIdx(coordsMinBound, arrStrides), idx - ctxWidth);
+    os << "Idx: ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << elem << "]";
+        } else {
+            os << std::setw(setwval) << elem;
+        }
+    }
+    os << "\nRow: ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << hGold[elem].row << "]";
+        } else {
+            os << std::setw(setwval) << hGold[elem].row;
+        }
+    }
+    os << "\n     ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << hOut[elem].row << "]";
+        } else {
+            os << std::setw(setwval) << hOut[elem].row;
+        }
+    }
+    os << "\nCol: ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << hGold[elem].col << "]";
+        } else {
+            os << std::setw(setwval) << hGold[elem].col;
+        }
+    }
+    os << "\n     ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << hOut[elem].col << "]";
+        } else {
+            os << std::setw(setwval) << hOut[elem].col;
+        }
+    }
+
+    os << "\nValue: ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << hGold[elem].value << "]";
+        } else {
+            os << std::setw(setwval) << hGold[elem].value;
+        }
+    }
+    os << "\n       ";
+    for (int elem = dim0Start; elem < dim0End; elem++) {
+        if (elem == idx) {
+            os << std::setw(setwval - 2) << "[" << hOut[elem].value << "]";
+        } else {
+            os << std::setw(setwval) << hOut[elem].value;
+        }
+    }
+
+    return os.str();
+}
+
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const std::vector<T> &a, af::dim4 aDims,
+                                      const std::vector<T> &b, af::dim4 bDims,
+                                      float maxAbsDiff, IntegerTag) {
+    UNUSED(maxAbsDiff);
+    typedef typename std::vector<T>::const_iterator iter;
+
+    std::pair<iter, iter> mismatches =
+        std::mismatch(a.begin(), a.end(), b.begin());
+    iter bItr = mismatches.second;
+
+    if (bItr == b.end()) {
+        return ::testing::AssertionSuccess();
+    } else {
+        dim_t idx         = std::distance(b.begin(), bItr);
+        af::dim4 aStrides = calcStrides(aDims);
+        af::dim4 bStrides = calcStrides(bDims);
+        af::dim4 coords   = unravelIdx(idx, bDims, bStrides);
+
+        return ::testing::AssertionFailure()
+               << "VALUE DIFFERS at " << minimalDim4(coords, aDims) << ":\n"
+               << printContext(a, aName, b, bName, aDims, aStrides, idx);
+    }
+}
+
+struct absMatch {
+    float diff_;
+    absMatch(float diff) : diff_(diff) {}
+
+    template<typename T>
+    bool operator()(const T &lhs, const T &rhs) const {
+        if (diff_ > 0) {
+            using half_float::abs;
+            using std::abs;
+            return abs(rhs - lhs) <= diff_;
+        } else {
+            return boost::math::epsilon_difference(lhs, rhs) < T(1.f);
+        }
+    }
+};
+
+template<>
+bool absMatch::operator()<af::af_cfloat>(const af::af_cfloat &lhs,
+                                         const af::af_cfloat &rhs) const {
+    return af::abs(rhs - lhs) <= diff_;
+}
+
+template<>
+bool absMatch::operator()<af::af_cdouble>(const af::af_cdouble &lhs,
+                                          const af::af_cdouble &rhs) const {
+    return af::abs(rhs - lhs) <= diff_;
+}
+
+template<>
+bool absMatch::operator()<std::complex<float>>(
+    const std::complex<float> &lhs, const std::complex<float> &rhs) const {
+    return std::abs(rhs - lhs) <= diff_;
+}
+
+template<>
+bool absMatch::operator()<std::complex<double>>(
+    const std::complex<double> &lhs, const std::complex<double> &rhs) const {
+    return std::abs(rhs - lhs) <= diff_;
+}
+
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const std::vector<T> &a, af::dim4 aDims,
+                                      const std::vector<T> &b, af::dim4 bDims,
+                                      float maxAbsDiff, FloatTag) {
+    typedef typename std::vector<T>::const_iterator iter;
+    // TODO(mark): Modify equality for float
+    std::pair<iter, iter> mismatches =
+        std::mismatch(a.begin(), a.end(), b.begin(), absMatch(maxAbsDiff));
+
+    iter aItr = mismatches.first;
+    iter bItr = mismatches.second;
+
+    if (aItr == a.end()) {
+        return ::testing::AssertionSuccess();
+    } else {
+        dim_t idx       = std::distance(b.begin(), bItr);
+        af::dim4 coords = unravelIdx(idx, bDims, calcStrides(bDims));
+
+        af::dim4 aStrides = calcStrides(aDims);
+
+        ::testing::AssertionResult result =
+            ::testing::AssertionFailure()
+            << "VALUE DIFFERS at " << minimalDim4(coords, aDims) << ":\n"
+            << printContext(a, aName, b, bName, aDims, aStrides, idx);
+
+        if (maxAbsDiff > 0) {
+            using af::abs;
+            using std::abs;
+            double absdiff = abs(*aItr - *bItr);
+            result << "\n  Actual diff: " << absdiff << "\n"
+                   << "Expected diff: " << maxAbsDiff;
+        }
+
+        return result;
+    }
+}
+
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const std::vector<sparseCooValue<T>> &a,
+                                      af::dim4 aDims,
+                                      const std::vector<sparseCooValue<T>> &b,
+                                      af::dim4 bDims, float maxAbsDiff,
+                                      IntegerTag) {
+    return ::testing::AssertionFailure() << "Unsupported sparse type\n";
+}
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const std::vector<sparseCooValue<T>> &a,
+                                      af::dim4 aDims,
+                                      const std::vector<sparseCooValue<T>> &b,
+                                      af::dim4 bDims, float maxAbsDiff,
+                                      FloatTag) {
+    typedef typename std::vector<sparseCooValue<T>>::const_iterator iter;
+    // TODO(mark): Modify equality for float
+
+    const absMatch diff(maxAbsDiff);
+    std::pair<iter, iter> mismatches = std::mismatch(
+        a.begin(), a.end(), b.begin(),
+        [&diff](const sparseCooValue<T> &lhs, const sparseCooValue<T> &rhs) {
+            return lhs.row == rhs.row && lhs.col == rhs.col &&
+                   diff(lhs.value, rhs.value);
+        });
+
+    iter aItr = mismatches.first;
+    iter bItr = mismatches.second;
+
+    if (aItr == a.end()) {
+        return ::testing::AssertionSuccess();
+    } else {
+        dim_t idx       = std::distance(b.begin(), bItr);
+        af::dim4 coords = unravelIdx(idx, bDims, calcStrides(bDims));
+
+        af::dim4 aStrides = calcStrides(aDims);
+
+        ::testing::AssertionResult result =
+            ::testing::AssertionFailure()
+            << "VALUE DIFFERS at " << idx << ":\n"
+            << printContext(a, aName, b, bName, aDims, aStrides, idx);
+
+        return result;
+    }
+}
+
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const af::array &a, const af::array &b,
+                                      float maxAbsDiff) {
+    typedef typename cond_type<
+        IsFloatingPoint<typename af::dtype_traits<T>::base_type>::value,
+        FloatTag, IntegerTag>::type TagType;
+    TagType tag;
+
+    if (a.issparse() || b.issparse()) {
+        vector<sparseCooValue<T>> hA = toCooVector<T>(a);
+        vector<sparseCooValue<T>> hB = toCooVector<T>(b);
+
+        return elemWiseEq<T>(aName, bName, hA, a.dims(), hB, b.dims(),
+                             maxAbsDiff, tag);
+    } else {
+        std::vector<T> hA(static_cast<size_t>(a.elements()));
+        a.host(hA.data());
+
+        std::vector<T> hB(static_cast<size_t>(b.elements()));
+        b.host(hB.data());
+        return elemWiseEq<T>(aName, bName, hA, a.dims(), hB, b.dims(),
+                             maxAbsDiff, tag);
+    }
+}
+
+template<typename T>
+::testing::AssertionResult assertArrayEq(std::string aName,
+                                         std::string aDimsName,
+                                         std::string bName,
+                                         const std::vector<T> &hA,
+                                         af::dim4 aDims, const af::array &b,
+                                         float maxAbsDiff) {
+    af::dtype aDtype = (af::dtype)af::dtype_traits<T>::af_type;
+    if (aDtype != b.type()) {
+        return ::testing::AssertionFailure()
+               << "TYPE MISMATCH:\n"
+               << "  Actual: " << bName << "(" << b.type() << ")\n"
+               << "Expected: " << aName << "(" << aDtype << ")";
+    }
+
+    if (aDims != b.dims()) {
+        return ::testing::AssertionFailure()
+               << "SIZE MISMATCH:\n"
+               << "  Actual: " << bName << "([" << b.dims() << "])\n"
+               << "Expected: " << aDimsName << "([" << aDims << "])";
+    }
+
+    // In case vector<T> a.size() != aDims.elements()
+    if (hA.size() != static_cast<size_t>(aDims.elements()))
+        return ::testing::AssertionFailure()
+               << "SIZE MISMATCH:\n"
+               << "  Actual: " << aDimsName << "([" << aDims << "] => "
+               << aDims.elements() << ")\n"
+               << "Expected: " << aName << ".size()(" << hA.size() << ")";
+
+    typedef typename cond_type<
+        IsFloatingPoint<typename af::dtype_traits<T>::base_type>::value,
+        FloatTag, IntegerTag>::type TagType;
+    TagType tag;
+
+    std::vector<T> hB(b.elements());
+    b.host(&hB.front());
+    return elemWiseEq<T>(aName, bName, hA, aDims, hB, b.dims(), maxAbsDiff,
+                         tag);
+}
+
+// To support C API
+template<typename T>
+::testing::AssertionResult assertArrayEq(std::string hA_name,
+                                         std::string aDimsName,
+                                         std::string bName,
+                                         const std::vector<T> &hA,
+                                         af::dim4 aDims, const af_array b) {
+    af_array bb = 0;
+    af_retain_array(&bb, b);
+    af::array bbb(bb);
+    return assertArrayEq(hA_name, aDimsName, bName, hA, aDims, bbb);
+}
+
+// Called by ASSERT_VEC_ARRAY_NEAR
+template<typename T>
+::testing::AssertionResult assertArrayNear(
+    std::string hA_name, std::string aDimsName, std::string bName,
+    std::string maxAbsDiffName, const std::vector<T> &hA, af::dim4 aDims,
+    const af::array &b, float maxAbsDiff) {
+    UNUSED(maxAbsDiffName);
+    return assertArrayEq(hA_name, aDimsName, bName, hA, aDims, b, maxAbsDiff);
+}
+
+// To support C API
+template<typename T>
+::testing::AssertionResult assertArrayNear(
+    std::string hA_name, std::string aDimsName, std::string bName,
+    std::string maxAbsDiffName, const std::vector<T> &hA, af::dim4 aDims,
+    const af_array b, float maxAbsDiff) {
+    af_array bb = 0;
+    af_retain_array(&bb, b);
+    af::array bbb(bb);
+    return assertArrayNear(hA_name, aDimsName, bName, maxAbsDiffName, hA, aDims,
+                           bbb, maxAbsDiff);
+}
+
+::testing::AssertionResult assertRefEq(std::string hA_name,
+                                       std::string expected_name,
+                                       const af::array &a, int expected) {
+    int count = 0;
+    af_get_data_ref_count(&count, a.get());
+    if (count != expected) {
+        std::stringstream ss;
+        ss << "Incorrect reference count:\nExpected: " << expected << "\n"
+           << std::setw(8) << hA_name << ": " << count;
+
+        return ::testing::AssertionFailure() << ss.str();
+
+    } else {
+        return ::testing::AssertionSuccess();
+    }
+}
+
+#define INSTANTIATE(To)                                                        \
+    template std::string printContext(                                         \
+        const std::vector<To> &hGold, std::string goldName,                    \
+        const std::vector<To> &hOut, std::string outName, af::dim4 arrDims,    \
+        af::dim4 arrStrides, dim_t idx);                                       \
+    template ::testing::AssertionResult assertArrayEq<To>(                     \
+        std::string aName, std::string aDimsName, std::string bName,           \
+        const std::vector<To> &hA, af::dim4 aDims, const af::array &b,         \
+        float maxAbsDiff);                                                     \
+    template ::testing::AssertionResult assertArrayEq<To>(                     \
+        std::string hA_name, std::string aDimsName, std::string bName,         \
+        const std::vector<To> &hA, af::dim4 aDims, const af_array b);          \
+    template ::testing::AssertionResult assertArrayNear<To>(                   \
+        std::string hA_name, std::string aDimsName, std::string bName,         \
+        std::string maxAbsDiffName, const std::vector<To> &hA, af::dim4 aDims, \
+        const af_array b, float maxAbsDiff);                                   \
+    template ::testing::AssertionResult assertArrayNear<To>(                   \
+        std::string hA_name, std::string aDimsName, std::string bName,         \
+        std::string maxAbsDiffName, const std::vector<To> &hA, af::dim4 aDims, \
+        const af::array &b, float maxAbsDiff)
+
+INSTANTIATE(float);
+INSTANTIATE(double);
+INSTANTIATE(signed char);
+INSTANTIATE(unsigned char);
+INSTANTIATE(half_float::half);
+INSTANTIATE(unsigned int);
+INSTANTIATE(unsigned short);
+INSTANTIATE(int);
+INSTANTIATE(char);
+INSTANTIATE(short);
+INSTANTIATE(af_cdouble);
+INSTANTIATE(af_cfloat);
+INSTANTIATE(long long);
+INSTANTIATE(unsigned long long);
+INSTANTIATE(std::complex<float>);
+INSTANTIATE(std::complex<double>);
+#undef INSTANTIATE
+
+af::array toTempFormat(tempFormat form, const af::array &in) {
+    af::array ret;
+    const af::dim4 &dims = in.dims();
+    switch (form) {
+        case JIT_FORMAT:
+            switch (in.type()) {
+                case b8: ret = !(in); break;
+                default: ret = in * 2;
+            }
+            // Make sure that the base array is <> form original
+            ret.eval();
+            switch (in.type()) {
+                case b8: ret = !(ret); break;
+                default: ret /= 2;
+            }
+            break;
+        case SUB_FORMAT_dim0: {
+            af::dim4 pdims(dims);
+            pdims[0] *= 2;
+            af::array parent  = af::randu(pdims, in.type());
+            const af::seq dim = af::seq(dims[0]) + static_cast<double>(dims[0]);
+            parent(dim, af::span, af::span, af::span) = in;
+            ret = parent(dim, af::span, af::span, af::span);
+        }; break;
+        case SUB_FORMAT_dim1: {
+            af::dim4 pdims(dims);
+            pdims[1] *= 2;
+            const af::seq dim = af::seq(dims[1]) + static_cast<double>(dims[1]);
+            af::array parent  = af::randu(pdims, in.type());
+            parent(af::span, dim, af::span, af::span) = in;
+            ret = parent(af::span, dim, af::span, af::span);
+        }; break;
+        case SUB_FORMAT_dim2: {
+            af::dim4 pdims(dims);
+            pdims[2] *= 2;
+            const af::seq dim = af::seq(dims[2]) + static_cast<double>(dims[2]);
+            af::array parent  = af::randu(pdims, in.type());
+            parent(af::span, af::span, dim, af::span) = in;
+            ret = parent(af::span, af::span, dim, af::span);
+        }; break;
+        case SUB_FORMAT_dim3: {
+            af::dim4 pdims(dims);
+            pdims[3] *= 2;
+            const af::seq dim = af::seq(dims[3]) + static_cast<double>(dims[3]);
+            af::array parent  = af::randu(pdims, in.type());
+            parent(af::span, af::span, af::span, dim) = in;
+            ret = parent(af::span, af::span, af::span, dim);
+        }; break;
+        case REORDERED_FORMAT: {
+            const dim_t idxs[4] = {0, 3, 1, 2};
+            // idxs[0] has to be 0, to keep the same data in mem
+            dim_t rev_idxs[4];
+            for (dim_t i = 0; i < 4; ++i) { rev_idxs[idxs[i]] = i; };
+            ret = af::reorder(in, idxs[0], idxs[1], idxs[2], idxs[3]);
+            ret = ret.copy();  // make data linear
+            ret = af::reorder(ret, rev_idxs[0], rev_idxs[1], rev_idxs[2],
+                              rev_idxs[3]);
+            // ret has same content as in, although data is stored in
+            // different order
+        }; break;
+        case LINEAR_FORMAT:
+        default: ret = in.copy();
+    };
+    return ret;
+}
+
+void toTempFormat(tempFormat form, af_array *out, const af_array &in) {
+    dim_t dims[4];
+    af_get_dims(dims, dims + 1, dims + 2, dims + 3, in);
+    unsigned numdims;
+    af_get_numdims(&numdims, in);
+    af_dtype ty;
+    af_get_type(&ty, in);
+    switch (form) {
+        case JIT_FORMAT: {
+            // af_array one = nullptr, min_one = nullptr, res = nullptr;
+            af_array res = nullptr, two = nullptr;
+            ASSERT_SUCCESS(af_constant(&two, 2, numdims, dims, ty));
+            switch (ty) {
+                case b8: af_not(&res, in); break;
+                default:
+                    // ret = in + af::constant(1, dims, in.type());
+                    ASSERT_SUCCESS(af_mul(&res, in, two, false));
+            }
+            // Make sure that the base array is <> form original
+            ASSERT_SUCCESS(af_eval(res));
+            switch (ty) {
+                case b8: af_not(out, res); break;
+                default:
+                    ASSERT_SUCCESS(af_div(out, res, two, false));  // NO EVAL!!
+            }
+            ASSERT_SUCCESS(af_release_array(two));
+            two = nullptr;
+            ASSERT_SUCCESS(af_release_array(res));
+            res = nullptr;
+        }; break;
+        case SUB_FORMAT_dim0: {
+            const dim_t pdims[4] = {dims[0] * 2, dims[1], dims[2], dims[3]};
+            af_array parent      = nullptr;
+            ASSERT_SUCCESS(af_randu(&parent, 4, pdims, ty));
+            const af_seq idxs[4] = {af_make_seq(dims[0], 2. * dims[0] - 1., 1.),
+                                    af_span, af_span, af_span};
+            ASSERT_SUCCESS(af_assign_seq(out, parent, numdims, idxs, in));
+            ASSERT_SUCCESS(af_index(out, parent, numdims, idxs));
+            ASSERT_SUCCESS(af_release_array(parent));
+            parent = nullptr;
+        }; break;
+        case SUB_FORMAT_dim1: {
+            const dim_t pdims[4] = {dims[0], dims[1] * 2, dims[2], dims[3]};
+            af_array parent      = nullptr;
+            ASSERT_SUCCESS(af_randu(&parent, 4, pdims, ty));
+            const af_seq idxs[4] = {af_span,
+                                    af_make_seq(dims[1], 2. * dims[1] - 1., 1.),
+                                    af_span, af_span};
+            ASSERT_SUCCESS(af_assign_seq(out, parent, numdims, idxs, in));
+            ASSERT_SUCCESS(af_index(out, parent, numdims, idxs));
+            ASSERT_SUCCESS(af_release_array(parent));
+            parent = nullptr;
+        }; break;
+        case SUB_FORMAT_dim2: {
+            const dim_t pdims[4] = {dims[0], dims[1], dims[2] * 2, dims[3]};
+            af_array parent      = nullptr;
+            ASSERT_SUCCESS(af_randu(&parent, 4, pdims, ty));
+            const af_seq idxs[4] = {af_span, af_span,
+                                    af_make_seq(dims[2], 2. * dims[2] - 1., 1.),
+                                    af_span};
+            ASSERT_SUCCESS(af_assign_seq(out, parent, numdims, idxs, in));
+            ASSERT_SUCCESS(af_index(out, parent, numdims, idxs));
+            ASSERT_SUCCESS(af_release_array(parent));
+            parent = nullptr;
+        }; break;
+        case SUB_FORMAT_dim3: {
+            const dim_t pdims[4] = {dims[0], dims[1], dims[2], dims[3] * 2};
+            af_array parent      = nullptr;
+            ASSERT_SUCCESS(af_randu(&parent, 4, pdims, ty));
+            const af_seq idxs[4] = {
+                af_span, af_span, af_span,
+                af_make_seq(dims[3], 2. * dims[3] - 1., 1.)};
+            ASSERT_SUCCESS(af_assign_seq(out, parent, numdims, idxs, in));
+            ASSERT_SUCCESS(af_index(out, parent, numdims, idxs));
+            ASSERT_SUCCESS(af_release_array(parent));
+            parent = nullptr;
+        }; break;
+        case REORDERED_FORMAT: {
+            const unsigned idxs[4] = {0, 3, 1, 2};
+            // idxs[0] has to be 0, to keep the same data in mem
+            dim_t rev_idxs[4];
+            for (dim_t i = 0; i < 4; ++i) { rev_idxs[idxs[i]] = i; };
+            af_array rev = nullptr;
+            ASSERT_SUCCESS(
+                af_reorder(&rev, in, idxs[0], idxs[1], idxs[2], idxs[3]));
+            ASSERT_SUCCESS(af_copy_array(out, rev));
+            ASSERT_SUCCESS(af_reorder(out, rev, rev_idxs[0], rev_idxs[1],
+                                      rev_idxs[2], rev_idxs[3]));
+            // ret has same content as in, although data is stored in
+            // different order
+            ASSERT_SUCCESS(af_release_array(rev));
+            rev = nullptr;
+        }; break;
+        case LINEAR_FORMAT:
+        default: af_copy_array(out, in);
+    };
+}
+
+int main(int argc, char **argv) {
+    ::testing::InitGoogleTest(&argc, argv);
+    return RUN_ALL_TESTS();
+}
diff --git a/test/arrayio.cpp b/test/arrayio.cpp
new file mode 100644
index 0000000000..ea15165ac4
--- /dev/null
+++ b/test/arrayio.cpp
@@ -0,0 +1,132 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+
+#include <testHelpers.hpp>
+
+#include <complex>
+#include <string>
+#include <vector>
+
+using af::allTrue;
+using af::array;
+using af::constant;
+using af::dim4;
+using af::readArray;
+using af::saveArray;
+using std::complex;
+using std::string;
+using std::vector;
+
+struct type_params {
+    string name;
+    af_dtype type;
+    double real;
+    double imag;
+    type_params(string n, af_dtype t, double r, double i = 0.)
+        : name(n), type(t), real(r), imag(i) {}
+};
+
+class ArrayIOType : public ::testing::TestWithParam<type_params> {};
+
+string getTypeName(
+    const ::testing::TestParamInfo<ArrayIOType::ParamType> info) {
+    return info.param.name;
+}
+
+INSTANTIATE_TEST_SUITE_P(
+    Types, ArrayIOType,
+    ::testing::Values(type_params("f32", f32, 3.14f, 0),
+                      type_params("f64", f64, 3.14, 0),
+                      type_params("c32", c32, 3.0f, 4.5f),
+                      type_params("c64", c64, 3.0, 4.5),
+                      type_params("s32", s32, 11), type_params("u32", u32, 12),
+                      type_params("u8", u8, 13), type_params("b8", b8, 1),
+                      type_params("s64", s64, 15), type_params("u64", u64, 16),
+                      type_params("s16", s16, 17), type_params("u16", u16, 18),
+                      type_params("s8", s8, 19)),
+    getTypeName);
+
+TEST_P(ArrayIOType, ReadType) {
+    type_params p = GetParam();
+    if (noDoubleTests(p.type)) GTEST_SKIP() << "No double support.";
+    array arr =
+        readArray((string(TEST_DIR) + "/arrayio/" + p.name + ".arr").c_str(),
+                  p.name.c_str());
+
+    ASSERT_EQ(arr.type(), p.type);
+}
+
+TEST_P(ArrayIOType, ReadSize) {
+    type_params p = GetParam();
+    if (noDoubleTests(p.type)) GTEST_SKIP() << "No double support.";
+    array arr =
+        readArray((string(TEST_DIR) + "/arrayio/" + p.name + ".arr").c_str(),
+                  p.name.c_str());
+
+    ASSERT_EQ(arr.dims(), dim4(10, 10));
+}
+
+template<typename T>
+void checkVals(array arr, double r, double i, af_dtype t) {
+    vector<T> d(arr.elements());
+    arr.host(d.data());
+    int elements = arr.elements();
+    for (int ii = 0; ii < elements; ii++) {
+        if (t == c32 || t == c64) {
+            ASSERT_EQ(r, real<T>(d[ii])) << "at: " << ii;
+            ASSERT_EQ(i, imag<T>(d[ii])) << "at: " << ii;
+        } else {
+            ASSERT_EQ(real(r), real(d[ii])) << "at: " << ii;
+        }
+    }
+}
+
+TEST_P(ArrayIOType, ReadContent) {
+    type_params p = GetParam();
+    if (noDoubleTests(p.type)) GTEST_SKIP() << "No double support.";
+    array arr =
+        readArray((string(TEST_DIR) + "/arrayio/" + p.name + ".arr").c_str(),
+                  p.name.c_str());
+
+    switch (arr.type()) {
+        case f32: checkVals<float>(arr, p.real, p.imag, p.type); break;
+        case f64: checkVals<double>(arr, p.real, p.imag, p.type); break;
+        case c32: checkVals<af::cfloat>(arr, p.real, p.imag, p.type); break;
+        case c64: checkVals<af::cdouble>(arr, p.real, p.imag, p.type); break;
+        case s32: checkVals<int>(arr, p.real, p.imag, p.type); break;
+        case u32: checkVals<unsigned>(arr, p.real, p.imag, p.type); break;
+        case s8: checkVals<signed char>(arr, p.real, p.imag, p.type); break;
+        case u8: checkVals<unsigned char>(arr, p.real, p.imag, p.type); break;
+        case b8: checkVals<char>(arr, p.real, p.imag, p.type); break;
+        case s64: checkVals<long long>(arr, p.real, p.imag, p.type); break;
+        case u64:
+            checkVals<unsigned long long>(arr, p.real, p.imag, p.type);
+            break;
+        case s16: checkVals<short>(arr, p.real, p.imag, p.type); break;
+        case u16: checkVals<unsigned short>(arr, p.real, p.imag, p.type); break;
+        default: FAIL() << "Invalid type";
+    }
+}
+
+TEST(ArrayIO, Save) {
+    array a = constant(1, 10, 10);
+    array b = constant(2, 10, 10);
+
+    saveArray("a", a, "arr.af");
+    saveArray("b", b, "arr.af", true);
+
+    array aread = readArray("arr.af", "a");
+    array bread = readArray("arr.af", "b");
+
+    ASSERT_ARRAYS_EQ(a, aread);
+    ASSERT_ARRAYS_EQ(b, bread);
+}
diff --git a/test/assign.cpp b/test/assign.cpp
index 37bbac536d..7b94bfa608 100644
--- a/test/assign.cpp
+++ b/test/assign.cpp
@@ -7,151 +7,166 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+
+#include <half.hpp>
+
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype_traits;
+using af::end;
+using af::exception;
+using af::randu;
+using af::seq;
+using af::span;
+using std::cout;
+using std::endl;
 using std::string;
 using std::vector;
 
 template<typename T>
-class ArrayAssign : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat1D.push_back(af_make_seq(5,20,1));
-
-            subMat2D.push_back(af_make_seq(1,2,1));
-            subMat2D.push_back(af_make_seq(1,2,1));
-
-            subMat3D.push_back(af_make_seq(3,4,1));
-            subMat3D.push_back(af_make_seq(0,1,1));
-            subMat3D.push_back(af_make_seq(1,2,1));
-
-            subMat4D.push_back(af_make_seq(3,4,1));
-            subMat4D.push_back(af_make_seq(0,1,1));
-            subMat4D.push_back(af_make_seq(0,1,1));
-            subMat4D.push_back(af_make_seq(1,2,1));
-
-            subMat1D_to_2D.push_back(af_make_seq(1,2,1));
-            subMat1D_to_2D.push_back(af_make_seq(1,1,1));
-
-            subMat1D_to_3D.push_back(af_make_seq(5,20,1));
-            subMat1D_to_3D.push_back(af_make_seq(1,1,1));
-            subMat1D_to_3D.push_back(af_make_seq(2,2,1));
-
-            subMat2D_to_3D.push_back(af_make_seq(3,4,1));
-            subMat2D_to_3D.push_back(af_make_seq(0,1,1));
-            subMat2D_to_3D.push_back(af_make_seq(1,1,1));
-
-            subMat1D_to_4D.push_back(af_make_seq(3,4,1));
-            subMat1D_to_4D.push_back(af_make_seq(0,0,1));
-            subMat1D_to_4D.push_back(af_make_seq(0,0,1));
-            subMat1D_to_4D.push_back(af_make_seq(1,1,1));
-
-            subMat2D_to_4D.push_back(af_make_seq(3,4,1));
-            subMat2D_to_4D.push_back(af_make_seq(0,1,1));
-            subMat2D_to_4D.push_back(af_make_seq(0,0,1));
-            subMat2D_to_4D.push_back(af_make_seq(1,1,1));
-
-            subMat3D_to_4D.push_back(af_make_seq(3,4,1));
-            subMat3D_to_4D.push_back(af_make_seq(0,1,1));
-            subMat3D_to_4D.push_back(af_make_seq(0,1,1));
-            subMat3D_to_4D.push_back(af_make_seq(1,1,1));
-        }
-        vector<af_seq> subMat1D;
+class ArrayAssign : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat1D.push_back(af_make_seq(5, 20, 1));
+
+        subMat2D.push_back(af_make_seq(1, 2, 1));
+        subMat2D.push_back(af_make_seq(1, 2, 1));
+
+        subMat3D.push_back(af_make_seq(3, 4, 1));
+        subMat3D.push_back(af_make_seq(0, 1, 1));
+        subMat3D.push_back(af_make_seq(1, 2, 1));
+
+        subMat4D.push_back(af_make_seq(3, 4, 1));
+        subMat4D.push_back(af_make_seq(0, 1, 1));
+        subMat4D.push_back(af_make_seq(0, 1, 1));
+        subMat4D.push_back(af_make_seq(1, 2, 1));
+
+        subMat1D_to_2D.push_back(af_make_seq(1, 2, 1));
+        subMat1D_to_2D.push_back(af_make_seq(1, 1, 1));
+
+        subMat1D_to_3D.push_back(af_make_seq(5, 20, 1));
+        subMat1D_to_3D.push_back(af_make_seq(1, 1, 1));
+        subMat1D_to_3D.push_back(af_make_seq(2, 2, 1));
+
+        subMat2D_to_3D.push_back(af_make_seq(3, 4, 1));
+        subMat2D_to_3D.push_back(af_make_seq(0, 1, 1));
+        subMat2D_to_3D.push_back(af_make_seq(1, 1, 1));
+
+        subMat1D_to_4D.push_back(af_make_seq(3, 4, 1));
+        subMat1D_to_4D.push_back(af_make_seq(0, 0, 1));
+        subMat1D_to_4D.push_back(af_make_seq(0, 0, 1));
+        subMat1D_to_4D.push_back(af_make_seq(1, 1, 1));
+
+        subMat2D_to_4D.push_back(af_make_seq(3, 4, 1));
+        subMat2D_to_4D.push_back(af_make_seq(0, 1, 1));
+        subMat2D_to_4D.push_back(af_make_seq(0, 0, 1));
+        subMat2D_to_4D.push_back(af_make_seq(1, 1, 1));
+
+        subMat3D_to_4D.push_back(af_make_seq(3, 4, 1));
+        subMat3D_to_4D.push_back(af_make_seq(0, 1, 1));
+        subMat3D_to_4D.push_back(af_make_seq(0, 1, 1));
+        subMat3D_to_4D.push_back(af_make_seq(1, 1, 1));
+    }
+    vector<af_seq> subMat1D;
 
-        vector<af_seq> subMat2D;
-        vector<af_seq> subMat1D_to_2D;
+    vector<af_seq> subMat2D;
+    vector<af_seq> subMat1D_to_2D;
 
-        vector<af_seq> subMat3D;
-        vector<af_seq> subMat1D_to_3D;
-        vector<af_seq> subMat2D_to_3D;
+    vector<af_seq> subMat3D;
+    vector<af_seq> subMat1D_to_3D;
+    vector<af_seq> subMat2D_to_3D;
 
-        vector<af_seq> subMat4D;
-        vector<af_seq> subMat1D_to_4D;
-        vector<af_seq> subMat2D_to_4D;
-        vector<af_seq> subMat3D_to_4D;
+    vector<af_seq> subMat4D;
+    vector<af_seq> subMat1D_to_4D;
+    vector<af_seq> subMat2D_to_4D;
+    vector<af_seq> subMat3D_to_4D;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, af::cdouble, af::cfloat, double, int, uint, char, uchar, intl, uintl> TestTypes;
+typedef ::testing::Types<float, cdouble, cfloat, double, int, uint, char, schar,
+                         uchar, intl, uintl, short, ushort, half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(ArrayAssign, TestTypes);
+TYPED_TEST_SUITE(ArrayAssign, TestTypes);
 
 template<typename inType, typename outType>
-void assignTest(string pTestFile, const vector<af_seq> *seqv)
-{
-    if (noDoubleTests<inType>()) return;
-    if (noDoubleTests<outType>()) return;
+void assignTest(string pTestFile, const vector<af_seq> *seqv) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4>  numDims;
-    vector<vector<inType> >      in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
 
     readTests<inType, outType, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af_array lhsArray  = 0;
-    af_array rhsArray  = 0;
-    af_array outArray  = 0;
+    dim4 dims0        = numDims[0];
+    dim4 dims1        = numDims[1];
+    af_array lhsArray = 0;
+    af_array rhsArray = 0;
+    af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&rhsArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<inType>::af_type));
+    ASSERT_SUCCESS(af_create_array(&rhsArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&lhsArray, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<outType>::af_type));
+    ASSERT_SUCCESS(af_create_array(&lhsArray, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<outType>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_assign_seq(&outArray, lhsArray, seqv->size(), &seqv->front(), rhsArray));
+    ASSERT_SUCCESS(af_assign_seq(&outArray, lhsArray, seqv->size(),
+                                 &seqv->front(), rhsArray));
 
     outType *outData = new outType[dims1.elements()];
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
     vector<outType> currGoldBar = tests[0];
-    using namespace std;
-    size_t nElems        = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    size_t nElems               = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
     delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(rhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(lhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(rhsArray));
+    ASSERT_SUCCESS(af_release_array(lhsArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
 template<typename T>
-void assignTestCPP(string pTestFile, const vector<af_seq> &seqv)
-{
-    if (noDoubleTests<T>()) return;
+void assignTestCPP(string pTestFile, const vector<af_seq> &seqv) {
+    SUPPORTED_TYPE_CHECK(T);
     try {
-
-        using af::array;
-
-        vector<af::dim4>  numDims;
-        vector<vector<T> >      in;
-        vector<vector<T> >   tests;
+        vector<dim4> numDims;
+        vector<vector<T>> in;
+        vector<vector<T>> tests;
 
         readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-        af::dim4 dims0     = numDims[0];
-        af::dim4 dims1     = numDims[1];
+        dim4 dims0 = numDims[0];
+        dim4 dims1 = numDims[1];
 
         array a(dims0, &(in[0].front()));
         array b(dims1, &(in[1].front()));
 
-        switch(seqv.size()) {
+        switch (seqv.size()) {
             case 1: b(seqv[0]) = a; break;
-            case 2: b(seqv[0],seqv[1]) = a; break;
-            case 3: b(seqv[0],seqv[1], seqv[2]) = a; break;
-            case 4: b(seqv[0],seqv[1], seqv[2], seqv[3]) = a; break;
+            case 2: b(seqv[0], seqv[1]) = a; break;
+            case 3: b(seqv[0], seqv[1], seqv[2]) = a; break;
+            case 4: b(seqv[0], seqv[1], seqv[2], seqv[3]) = a; break;
             default: assert(1 != 1 && "Does not compute");
         }
 
@@ -159,141 +174,137 @@ void assignTestCPP(string pTestFile, const vector<af_seq> &seqv)
         b.host(outData);
 
         vector<T> currGoldBar = tests[0];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            EXPECT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+        size_t nElems         = currGoldBar.size();
+        for (size_t elIter = 0; elIter < nElems; ++elIter) {
+            EXPECT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
         delete[] outData;
-    } catch(const af::exception &ex) {
+    } catch (const exception &ex) {
         FAIL() << "Exception thrown: " << ex.what();
     }
-
 }
 
-TYPED_TEST(ArrayAssign, Vector)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/1d_to_1d.test"), &(this->subMat1D));
+TYPED_TEST(ArrayAssign, Vector) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/1d_to_1d.test"),
+                                     &(this->subMat1D));
 }
 
-TYPED_TEST(ArrayAssign, VectorCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/1d_to_1d.test"), this->subMat1D);
+TYPED_TEST(ArrayAssign, VectorCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/1d_to_1d.test"),
+                             this->subMat1D);
 }
 
-TYPED_TEST(ArrayAssign, Matrix)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/2d_to_2d.test"), &(this->subMat2D));
+TYPED_TEST(ArrayAssign, Matrix) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/2d_to_2d.test"),
+                                     &(this->subMat2D));
 }
 
-TYPED_TEST(ArrayAssign, MatrixCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/2d_to_2d.test"), this->subMat2D);
+TYPED_TEST(ArrayAssign, MatrixCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/2d_to_2d.test"),
+                             this->subMat2D);
 }
 
-TYPED_TEST(ArrayAssign, Cube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/3d_to_3d.test"), &(this->subMat3D));
+TYPED_TEST(ArrayAssign, Cube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/3d_to_3d.test"),
+                                     &(this->subMat3D));
 }
 
-TYPED_TEST(ArrayAssign, CubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/3d_to_3d.test"), this->subMat3D);
+TYPED_TEST(ArrayAssign, CubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/3d_to_3d.test"),
+                             this->subMat3D);
 }
 
-TYPED_TEST(ArrayAssign, HyperCube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/4d_to_4d.test"), &(this->subMat4D));
+TYPED_TEST(ArrayAssign, HyperCube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/4d_to_4d.test"),
+                                     &(this->subMat4D));
 }
 
-TYPED_TEST(ArrayAssign, HyperCubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/4d_to_4d.test"), this->subMat4D);
+TYPED_TEST(ArrayAssign, HyperCubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/4d_to_4d.test"),
+                             this->subMat4D);
 }
 
-TYPED_TEST(ArrayAssign, Vector2Matrix)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/1d_to_2d.test"), &(this->subMat1D_to_2D));
+TYPED_TEST(ArrayAssign, Vector2Matrix) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/1d_to_2d.test"),
+                                     &(this->subMat1D_to_2D));
 }
 
-TYPED_TEST(ArrayAssign, Vector2MatrixCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/1d_to_2d.test"), this->subMat1D_to_2D);
+TYPED_TEST(ArrayAssign, Vector2MatrixCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/1d_to_2d.test"),
+                             this->subMat1D_to_2D);
 }
 
-TYPED_TEST(ArrayAssign, Vector2Cube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/1d_to_3d.test"), &(this->subMat1D_to_3D));
+TYPED_TEST(ArrayAssign, Vector2Cube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/1d_to_3d.test"),
+                                     &(this->subMat1D_to_3D));
 }
 
-TYPED_TEST(ArrayAssign, Vector2CubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/1d_to_3d.test"), this->subMat1D_to_3D);
+TYPED_TEST(ArrayAssign, Vector2CubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/1d_to_3d.test"),
+                             this->subMat1D_to_3D);
 }
 
-TYPED_TEST(ArrayAssign, Matrix2Cube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/2d_to_3d.test"), &(this->subMat2D_to_3D));
+TYPED_TEST(ArrayAssign, Matrix2Cube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/2d_to_3d.test"),
+                                     &(this->subMat2D_to_3D));
 }
 
-TYPED_TEST(ArrayAssign, Matrix2CubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/2d_to_3d.test"), this->subMat2D_to_3D);
+TYPED_TEST(ArrayAssign, Matrix2CubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/2d_to_3d.test"),
+                             this->subMat2D_to_3D);
 }
 
-TYPED_TEST(ArrayAssign, Vector2HyperCube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/1d_to_4d.test"), &(this->subMat1D_to_4D));
+TYPED_TEST(ArrayAssign, Vector2HyperCube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/1d_to_4d.test"),
+                                     &(this->subMat1D_to_4D));
 }
 
-TYPED_TEST(ArrayAssign, Vector2HyperCubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/1d_to_4d.test"), this->subMat1D_to_4D);
+TYPED_TEST(ArrayAssign, Vector2HyperCubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/1d_to_4d.test"),
+                             this->subMat1D_to_4D);
 }
 
-TYPED_TEST(ArrayAssign, Matrix2HyperCube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/2d_to_4d.test"), &(this->subMat2D_to_4D));
+TYPED_TEST(ArrayAssign, Matrix2HyperCube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/2d_to_4d.test"),
+                                     &(this->subMat2D_to_4D));
 }
 
-TYPED_TEST(ArrayAssign, Matrix2HyperCubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/2d_to_4d.test"), this->subMat2D_to_4D);
+TYPED_TEST(ArrayAssign, Matrix2HyperCubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/2d_to_4d.test"),
+                             this->subMat2D_to_4D);
 }
 
-TYPED_TEST(ArrayAssign, Cube2HyperCube)
-{
-    assignTest<TypeParam, TypeParam>(string(TEST_DIR"/assign/3d_to_4d.test"), &(this->subMat3D_to_4D));
+TYPED_TEST(ArrayAssign, Cube2HyperCube) {
+    assignTest<TypeParam, TypeParam>(string(TEST_DIR "/assign/3d_to_4d.test"),
+                                     &(this->subMat3D_to_4D));
 }
 
-TYPED_TEST(ArrayAssign, Cube2HyperCubeCPP)
-{
-    assignTestCPP<TypeParam>(string(TEST_DIR"/assign/3d_to_4d.test"), this->subMat3D_to_4D);
+TYPED_TEST(ArrayAssign, Cube2HyperCubeCPP) {
+    assignTestCPP<TypeParam>(string(TEST_DIR "/assign/3d_to_4d.test"),
+                             this->subMat3D_to_4D);
 }
 
 template<typename T>
-void assignScalarCPP(string pTestFile, const vector<af_seq> &seqv)
-{
-    if (noDoubleTests<T>()) return;
+void assignScalarCPP(string pTestFile, const vector<af_seq> &seqv) {
+    SUPPORTED_TYPE_CHECK(T);
     try {
-
-        using af::array;
-
-        vector<af::dim4>  numDims;
-        vector<vector<T> >      in;
-        vector<vector<T> >   tests;
+        vector<dim4> numDims;
+        vector<vector<T>> in;
+        vector<vector<T>> tests;
 
         readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-        af::dim4 dims1     = numDims[1];
+        dim4 dims1 = numDims[1];
 
         T a = in[0][0];
         array b(dims1, &(in[1].front()));
 
-        switch(seqv.size()) {
+        switch (seqv.size()) {
             case 1: b(seqv[0]) = a; break;
-            case 2: b(seqv[0],seqv[1]) = a; break;
-            case 3: b(seqv[0],seqv[1], seqv[2]) = a; break;
-            case 4: b(seqv[0],seqv[1], seqv[2], seqv[3]) = a; break;
+            case 2: b(seqv[0], seqv[1]) = a; break;
+            case 3: b(seqv[0], seqv[1], seqv[2]) = a; break;
+            case 4: b(seqv[0], seqv[1], seqv[2], seqv[3]) = a; break;
             default: assert(1 != 1 && "Does not compute");
         }
 
@@ -301,306 +312,314 @@ void assignScalarCPP(string pTestFile, const vector<af_seq> &seqv)
         b.host(outData);
 
         vector<T> currGoldBar = tests[0];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            if(currGoldBar[elIter] != outData[elIter]){
-                switch(seqv.size()) {
+        size_t nElems         = currGoldBar.size();
+        for (size_t elIter = 0; elIter < nElems; ++elIter) {
+            if (currGoldBar[elIter] != outData[elIter]) {
+                switch (seqv.size()) {
                     case 1: printf("b(seqv[0]) = a\n"); break;
                     case 2: printf("b(seqv[0],seqv[1]) = a\n"); break;
                     case 3: printf("b(seqv[0],seqv[1], seqv[2]) = a\n"); break;
-                    case 4: printf("b(seqv[0],seqv[1], seqv[2], seqv[3]) = a\n"); break;
+                    case 4:
+                        printf("b(seqv[0],seqv[1], seqv[2], seqv[3]) = a\n");
+                        break;
                     default: assert(1 != 1 && "Does not compute");
                 }
-                std::cout << "a: " << a << std::endl;
+                cout << "a: " << a << endl;
                 af_print(b);
-                ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+                ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                    << "at: " << elIter << endl;
             }
         }
         delete[] outData;
-    } catch(const af::exception &ex) {
+    } catch (const exception &ex) {
         FAIL() << "Exception thrown: " << ex.what();
     }
 }
 
-TYPED_TEST(ArrayAssign, Scalar1DCPP)
-{
-    assignScalarCPP<TypeParam>(string(TEST_DIR"/assign/scalar_to_1d.test"), this->subMat1D);
+TYPED_TEST(ArrayAssign, Scalar1DCPP) {
+    assignScalarCPP<TypeParam>(string(TEST_DIR "/assign/scalar_to_1d.test"),
+                               this->subMat1D);
 }
 
-TYPED_TEST(ArrayAssign, Scalar2DCPP)
-{
-    assignScalarCPP<TypeParam>(string(TEST_DIR"/assign/scalar_to_2d.test"), this->subMat2D);
+TYPED_TEST(ArrayAssign, Scalar2DCPP) {
+    assignScalarCPP<TypeParam>(string(TEST_DIR "/assign/scalar_to_2d.test"),
+                               this->subMat2D);
 }
 
-TYPED_TEST(ArrayAssign, Scalar3DCPP)
-{
-    assignScalarCPP<TypeParam>(string(TEST_DIR"/assign/scalar_to_3d.test"), this->subMat3D);
+TYPED_TEST(ArrayAssign, Scalar3DCPP) {
+    assignScalarCPP<TypeParam>(string(TEST_DIR "/assign/scalar_to_3d.test"),
+                               this->subMat3D);
 }
 
-TYPED_TEST(ArrayAssign, Scalar4DCPP)
-{
-    assignScalarCPP<TypeParam>(string(TEST_DIR"/assign/scalar_to_4d.test"), this->subMat4D);
+TYPED_TEST(ArrayAssign, Scalar4DCPP) {
+    assignScalarCPP<TypeParam>(string(TEST_DIR "/assign/scalar_to_4d.test"),
+                               this->subMat4D);
 }
 
-TYPED_TEST(ArrayAssign, AssignRowCPP)
-{
-    if (noDoubleTests<TypeParam>()) return;
-    using namespace std;
-    using namespace af;
-    int dimsize=10;
-    vector<TypeParam> input(100, 1);
+TYPED_TEST(ArrayAssign, AssignRowCPP) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    const int dimsize = 10;
+    vector<TypeParam> input(100, TypeParam(1.0));
     vector<TypeParam> sq(dimsize);
     vector<int> arIdx(2);
-    for(int i = 0; i < (int)sq.size(); i++) sq[i] = i;
+    for (int i = 0; i < (int)sq.size(); i++) sq[i] = i;
     arIdx[0] = 5;
     arIdx[1] = 7;
 
-    af::array in(dimsize, dimsize, &input.front(), afHost);
-    af::dim4 size(dimsize, 1, 1, 1);
-    af::array sarr(size, &sq.front(), afHost);
-    af::array arrIdx(2, &arIdx.front(), afHost);
+    array in(dimsize, dimsize, &input.front(), afHost);
+    dim4 size(dimsize, 1, 1, 1);
+    array sarr(size, &sq.front(), afHost);
+    array arrIdx(2, &arIdx.front(), afHost);
 
-    in.row(0)       = sarr;
-    in.row(2)       = 2;
-    in(arrIdx, span)= 8;
-    in.row(af::end) = 3;
-    in.rows(3, 4)   = 7;
+    in.row(0)        = sarr;
+    in.row(2)        = 2;
+    in(arrIdx, span) = 8;
+    in.row(end)      = 3;
+    in.rows(3, 4)    = 7;
 
     vector<TypeParam> out(100);
     in.host(&out.front());
 
-    for(int col = 0; col < dimsize; col++) {
-        for(int row = 0; row < dimsize; row++) {
-            if      (row == 0)              ASSERT_EQ(sq[col], out[col * dimsize + row])
-                << "Assigning array to indexed array using col";
-            else if (row == 2)              ASSERT_EQ(TypeParam(2), out[col * dimsize + row])
-                << "Assigning value to indexed array using col";
-            else if (row == dimsize-1)      ASSERT_EQ(TypeParam(3), out[col * dimsize + row])
-                << "Assigning value to array which is indexed using end.";
-            else if (row == 3 || row == 4)  ASSERT_EQ(TypeParam(7), out[col * dimsize + row])
-                << "Assigning value to an array which is indexed using an rows";
-            else if (row == 5 || row == 7)  ASSERT_EQ(TypeParam(8), out[col * dimsize + row])
-                << "Assigning value to an array which is indexed using an array (i.e. in(arrIdx, span) = 8);) using row";
-            else                            ASSERT_EQ(TypeParam(1),  out[col * dimsize + row])
-                << "Values written to incorrect location";
+    for (int col = 0; col < dimsize; col++) {
+        for (int row = 0; row < dimsize; row++) {
+            if (row == 0)
+                ASSERT_EQ(sq[col], out[col * dimsize + row])
+                    << "Assigning array to indexed array using col";
+            else if (row == 2)
+                ASSERT_EQ(TypeParam(2), out[col * dimsize + row])
+                    << "Assigning value to indexed array using col";
+            else if (row == dimsize - 1)
+                ASSERT_EQ(TypeParam(3), out[col * dimsize + row])
+                    << "Assigning value to array which is indexed using end.";
+            else if (row == 3 || row == 4)
+                ASSERT_EQ(TypeParam(7), out[col * dimsize + row])
+                    << "Assigning value to an array which is indexed using an "
+                       "rows";
+            else if (row == 5 || row == 7)
+                ASSERT_EQ(TypeParam(8), out[col * dimsize + row])
+                    << "Assigning value to an array which is indexed using an "
+                       "array (i.e. in(arrIdx, span) = 8);) using row";
+            else
+                ASSERT_EQ(TypeParam(1), out[col * dimsize + row])
+                    << "Values written to incorrect location";
         }
     }
 }
 
-TYPED_TEST(ArrayAssign, AssignColumnCPP)
-{
-    if (noDoubleTests<TypeParam>()) return;
-    using namespace std;
-    using namespace af;
-    int dimsize=10;
-    vector<TypeParam> input(100, 1);
+TYPED_TEST(ArrayAssign, AssignColumnCPP) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    const int dimsize = 10;
+    vector<TypeParam> input(100, TypeParam(1.0));
     vector<TypeParam> sq(dimsize);
     vector<int> arIdx(2);
-    for(int i = 0; i < (int)sq.size(); i++) sq[i] = i;
+    for (int i = 0; i < (int)sq.size(); i++) sq[i] = i;
     arIdx[0] = 5;
     arIdx[1] = 7;
 
-    af::array in(dimsize, dimsize, &input.front(), afHost);
-    af::dim4 size(dimsize, 1, 1, 1);
-    af::array sarr(size, &sq.front(), afHost);
-    af::array arrIdx(2, &arIdx.front(), afHost);
+    array in(dimsize, dimsize, &input.front(), afHost);
+    dim4 size(dimsize, 1, 1, 1);
+    array sarr(size, &sq.front(), afHost);
+    array arrIdx(2, &arIdx.front(), afHost);
 
-    in.col(0)       = sarr;
-    in.col(2)       = 2;
-    in(span, arrIdx)= 8;
-    in.col(af::end) = 3;
-    in.cols(3, 4)   = 7;
+    in.col(0)        = sarr;
+    in.col(2)        = 2;
+    in(span, arrIdx) = 8;
+    in.col(end)      = 3;
+    in.cols(3, 4)    = 7;
 
     vector<TypeParam> out(100);
     in.host(&out.front());
 
-    for(int col = 0; col < dimsize; col++) {
-        for(int row = 0; row < dimsize; row++) {
-            if      (col == 0)              ASSERT_EQ(sq[row], out[col * dimsize + row])
-                << "Assigning array to indexed array using col";
-            else if (col == 2)              ASSERT_EQ(TypeParam(2), out[col * dimsize + row])
-                << "Assigning value to indexed array using col";
-            else if (col == dimsize-1)      ASSERT_EQ(TypeParam(3), out[col * dimsize + row])
-                << "Assigning value to array which is indexed using end.";
-            else if (col == 3 || col == 4)  ASSERT_EQ(TypeParam(7), out[col * dimsize + row])
-                << "Assigning value to an array which is indexed using an cols";
-            else if (col == 5 || col == 7)  ASSERT_EQ(TypeParam(8), out[col * dimsize + row])
-                << "Assigning value to an array which is indexed using an array (i.e. in(span, arrIdx) = 8);) using col";
-            else                            ASSERT_EQ(TypeParam(1),  out[col * dimsize + row])
-                << "Values written to incorrect location";
+    for (int col = 0; col < dimsize; col++) {
+        for (int row = 0; row < dimsize; row++) {
+            if (col == 0)
+                ASSERT_EQ(sq[row], out[col * dimsize + row])
+                    << "Assigning array to indexed array using col";
+            else if (col == 2)
+                ASSERT_EQ(TypeParam(2), out[col * dimsize + row])
+                    << "Assigning value to indexed array using col";
+            else if (col == dimsize - 1)
+                ASSERT_EQ(TypeParam(3), out[col * dimsize + row])
+                    << "Assigning value to array which is indexed using end.";
+            else if (col == 3 || col == 4)
+                ASSERT_EQ(TypeParam(7), out[col * dimsize + row])
+                    << "Assigning value to an array which is indexed using an "
+                       "cols";
+            else if (col == 5 || col == 7)
+                ASSERT_EQ(TypeParam(8), out[col * dimsize + row])
+                    << "Assigning value to an array which is indexed using an "
+                       "array (i.e. in(span, arrIdx) = 8);) using col";
+            else
+                ASSERT_EQ(TypeParam(1), out[col * dimsize + row])
+                    << "Values written to incorrect location";
         }
     }
 }
 
-TYPED_TEST(ArrayAssign, AssignSliceCPP)
-{
-    if (noDoubleTests<TypeParam>()) return;
-    using namespace std;
-    using namespace af;
-    int dimsize=10;
-    vector<TypeParam> input(1000, 1);
+TYPED_TEST(ArrayAssign, AssignSliceCPP) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const int dimsize = 10;
+    vector<TypeParam> input(1000, TypeParam(1.0));
     vector<TypeParam> sq(dimsize * dimsize);
     vector<int> arIdx(2);
-    for(int i = 0; i < (int)sq.size(); i++) sq[i] = i;
+    for (int i = 0; i < (int)sq.size(); i++) sq[i] = i;
     arIdx[0] = 5;
     arIdx[1] = 7;
 
-    af::array in(dimsize, dimsize, dimsize, &input.front(), afHost);
-    af::dim4 size(dimsize, dimsize, 1, 1);
-    af::array sarr(size, &sq.front(), afHost);
-    af::array arrIdx(2, &arIdx.front(), afHost);
+    array in(dimsize, dimsize, dimsize, &input.front(), afHost);
+    dim4 size(dimsize, dimsize, 1, 1);
+    array sarr(size, &sq.front(), afHost);
+    array arrIdx(2, &arIdx.front(), afHost);
 
-    in.slice(0)             = sarr;
-    in.slice(2)             = 2;
-    in(span, span, arrIdx)  = 8;
-    in.slice(af::end)       = 3;
-    in.slices(3, 4)         = 7;
+    in.slice(0)            = sarr;
+    in.slice(2)            = 2;
+    in(span, span, arrIdx) = 8;
+    in.slice(end)          = 3;
+    in.slices(3, 4)        = 7;
 
     vector<TypeParam> out(1000);
     in.host(&out.front());
 
-    for(int slice = 0; slice < dimsize; slice++) {
-        for(int col = 0; col < dimsize; col++) {
-            for(int row = 0; row < dimsize; row++) {
+    for (int slice = 0; slice < dimsize; slice++) {
+        for (int col = 0; col < dimsize; col++) {
+            for (int row = 0; row < dimsize; row++) {
                 int idx = slice * dimsize * dimsize + col * dimsize + row;
-                if      (slice == 0)              ASSERT_EQ(sq[col * dimsize + row], out[idx])
-                    << "Assigning array to indexed array using col";
-                else if (slice == 2)              ASSERT_EQ(TypeParam(2), out[idx])
-                    << "Assigning value to indexed array using col";
-                else if (slice == dimsize-1)      ASSERT_EQ(TypeParam(3), out[idx])
-                    << "Assigning value to array which is indexed using end.";
-                else if (slice == 3 || slice == 4)  ASSERT_EQ(TypeParam(7), out[idx])
-                    << "Assigning value to an array which is indexed using an slices";
-                else if (slice == 5 || slice == 7)  ASSERT_EQ(TypeParam(8), out[idx])
-                    << "Assigning value to an array which is indexed using an array (i.e. in(span, span, arrIdx) = 8);) using slice";
-                else                            ASSERT_EQ(TypeParam(1),  out[idx])
-                    << "Values written to incorrect location";
+                if (slice == 0)
+                    ASSERT_EQ(sq[col * dimsize + row], out[idx])
+                        << "Assigning array to indexed array using col";
+                else if (slice == 2)
+                    ASSERT_EQ(TypeParam(2), out[idx])
+                        << "Assigning value to indexed array using col";
+                else if (slice == dimsize - 1)
+                    ASSERT_EQ(TypeParam(3), out[idx])
+                        << "Assigning value to array which is indexed using "
+                           "end.";
+                else if (slice == 3 || slice == 4)
+                    ASSERT_EQ(TypeParam(7), out[idx])
+                        << "Assigning value to an array which is indexed using "
+                           "an slices";
+                else if (slice == 5 || slice == 7)
+                    ASSERT_EQ(TypeParam(8), out[idx])
+                        << "Assigning value to an array which is indexed using "
+                           "an array (i.e. in(span, span, arrIdx) = 8);) using "
+                           "slice";
+                else
+                    ASSERT_EQ(TypeParam(1), out[idx])
+                        << "Values written to incorrect location";
             }
         }
     }
 }
 
-TEST(ArrayAssign, InvalidArgs)
-{
-    vector<af::cfloat> in(100, af::cfloat(0,0));
+TEST(ArrayAssign, InvalidArgs) {
+    vector<cfloat> in(100, cfloat(0, 0));
     vector<float> tests(100, float(1));
 
-    af::dim4 dims0(10, 1, 1, 1);
-    af::dim4 dims1(100, 1, 1, 1);
+    dim4 dims0(10, 1, 1, 1);
+    dim4 dims1(100, 1, 1, 1);
     af_array lhsArray = 0;
     af_array rhsArray = 0;
     af_array outArray = 0;
 
     vector<af_seq> seqv;
-    seqv.push_back(af_make_seq(5,14,1));
+    seqv.push_back(af_make_seq(5, 14, 1));
 
-    ASSERT_EQ(AF_ERR_ARG, af_assign_seq(&outArray,
-                                    lhsArray, seqv.size(), &seqv.front(), rhsArray));
+    ASSERT_EQ(AF_ERR_ARG, af_assign_seq(&outArray, lhsArray, seqv.size(),
+                                        &seqv.front(), rhsArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&rhsArray, &(in.front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<af::cfloat>::af_type));
+    ASSERT_SUCCESS(af_create_array(&rhsArray, &(in.front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<cfloat>::af_type));
 
-    ASSERT_EQ(AF_ERR_ARG, af_assign_seq(&outArray,
-                                    lhsArray, seqv.size(), &seqv.front(), rhsArray));
+    ASSERT_EQ(AF_ERR_ARG, af_assign_seq(&outArray, lhsArray, seqv.size(),
+                                        &seqv.front(), rhsArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&lhsArray, &(in.front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&lhsArray, &(in.front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_ERR_ARG, af_assign_seq(&outArray, lhsArray, 0, &seqv.front(), rhsArray));
+    ASSERT_EQ(AF_ERR_ARG,
+              af_assign_seq(&outArray, lhsArray, 0, &seqv.front(), rhsArray));
 
-    ASSERT_EQ(AF_ERR_TYPE, af_assign_seq(&outArray,
-                                         lhsArray, seqv.size(), &seqv.front(), rhsArray));
+    ASSERT_EQ(AF_ERR_TYPE, af_assign_seq(&outArray, lhsArray, seqv.size(),
+                                         &seqv.front(), rhsArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(rhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(lhsArray));
+    ASSERT_SUCCESS(af_release_array(rhsArray));
+    ASSERT_SUCCESS(af_release_array(lhsArray));
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_TO_INDEXED)
-{
+TEST(ArrayAssign, CPP_ASSIGN_TO_INDEXED) {
     vector<int> in(20);
-    for(int i = 0; i < (int)in.size(); i++) in[i] = i;
+    for (int i = 0; i < (int)in.size(); i++) in[i] = i;
 
-    af::array input(10, 2, &in.front(), afHost);
+    array input(10, 2, &in.front(), afHost);
 
-    input(af::span, 0) = input(af::span, 1);// <-- Tests array_proxy to array_proxy assignment
+    // Tests array_proxy to array_proxy assignment
+    input(span, 0) = input(span, 1);
 
     vector<int> out(20);
     input.host(&out.front());
 
-    for(int i = 0; i < 10; i++)                 ASSERT_EQ(i + 10, out[i]);
-    for(int i = 10; i < (int)in.size(); i++)    ASSERT_EQ(i, out[i]);
+    for (int i = 0; i < 10; i++) ASSERT_EQ(i + 10, out[i]);
+    for (int i = 10; i < (int)in.size(); i++) ASSERT_EQ(i, out[i]);
 }
 
-TEST(ArrayAssign, CPP_END)
-{
-    using af::array;
-
-    const int n = 5;
-    const int m = 5;
+TEST(ArrayAssign, CPP_END) {
+    const int n       = 5;
+    const int m       = 5;
     const int end_off = 2;
 
-    array a = af::randu(n, m);
-    array b = af::randu(1, m);
-    a(af::end - end_off, af::span) = b;
+    array a                = randu(n, m);
+    array b                = randu(1, m);
+    a(end - end_off, span) = b;
 
     float *hA = a.host<float>();
     float *hB = b.host<float>();
 
-    for (int i = 0; i < m; i++) {
-        ASSERT_EQ(hA[i * n + end_off], hB[i]);
-    }
+    for (int i = 0; i < m; i++) { ASSERT_EQ(hA[i * n + end_off], hB[i]); }
 
-
-    delete[] hA;
-    delete[] hB;
+    af_free_host(hA);
+    af_free_host(hB);
 }
 
-TEST(ArrayAssign, CPP_END_SEQ)
-{
-    using af::array;
-
-    const int num = 20;
+TEST(ArrayAssign, CPP_END_SEQ) {
+    const int num       = 20;
     const int end_begin = 10;
-    const int end_end = 0;
-    const int len = end_begin - end_end + 1;
+    const int end_end   = 0;
+    const int len       = end_begin - end_end + 1;
 
-    array a = af::randu(num);
-    array b = af::randu(len);
-    a(af::seq(af::end - end_begin, af::end - end_end)) = b;
+    array a                                = randu(num);
+    array b                                = randu(len);
+    a(seq(end - end_begin, end - end_end)) = b;
 
     float *hA = a.host<float>();
     float *hB = b.host<float>();
 
-    for (int i = 0; i < len; i++) {
-        ASSERT_EQ(hA[i + end_begin - 1], hB[i]);
-    }
+    for (int i = 0; i < len; i++) { ASSERT_EQ(hA[i + end_begin - 1], hB[i]); }
 
-    delete[] hA;
-    delete[] hB;
+    af_free_host(hA);
+    af_free_host(hB);
 }
 
-TEST(ArrayAssign, CPP_COPY_ON_WRITE)
-{
-    using af::array;
-
+TEST(ArrayAssign, CPP_COPY_ON_WRITE) {
     const int num = 20;
     const int len = 10;
 
-    array a = af::randu(num);
+    array a    = randu(num);
     float *hAO = a.host<float>();
 
     array a_copy = a;
-    array b = af::randu(len);
-    a(af::seq(len)) = b;
+    array b      = randu(len);
+    a(seq(len))  = b;
 
-    float *hA = a.host<float>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    float *hB  = b.host<float>();
     float *hAC = a_copy.host<float>();
 
     // first half should be from B
-    for (int i = 0; i < len; i++) {
-        ASSERT_EQ(hA[i], hB[i]);
-    }
+    for (int i = 0; i < len; i++) { ASSERT_EQ(hA[i], hB[i]); }
 
     // Second half should be same as original
     for (int i = 0; i < num - len; i++) {
@@ -608,38 +627,31 @@ TEST(ArrayAssign, CPP_COPY_ON_WRITE)
     }
 
     // hAC should not be modified, i.e. same as original
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(hAO[i], hAC[i]);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(hAO[i], hAC[i]); }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hAC;
-    delete[] hAO;
+    af_free_host(hA);
+    af_free_host(hB);
+    af_free_host(hAC);
+    af_free_host(hAO);
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_BINOP)
-{
-    using af::array;
-
+TEST(ArrayAssign, CPP_ASSIGN_BINOP) {
     const int num = 20;
     const int len = 10;
 
-    array a = af::randu(num);
+    array a    = randu(num);
     float *hAO = a.host<float>();
 
     array a_copy = a;
-    array b = af::randu(len);
-    a(af::seq(len)) += b;
+    array b      = randu(len);
+    a(seq(len)) += b;
 
-    float *hA = a.host<float>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    float *hB  = b.host<float>();
     float *hAC = a_copy.host<float>();
 
     // first half should be hAO + hB
-    for (int i = 0; i < len; i++) {
-        ASSERT_EQ(hA[i], hAO[i] + hB[i]);
-    }
+    for (int i = 0; i < len; i++) { ASSERT_EQ(hA[i], hAO[i] + hB[i]); }
 
     // Second half should be same as original
     for (int i = 0; i < num - len; i++) {
@@ -647,70 +659,60 @@ TEST(ArrayAssign, CPP_ASSIGN_BINOP)
     }
 
     // hAC should not be modified, i.e. same as original
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(hAO[i], hAC[i]);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(hAO[i], hAC[i]); }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hAC;
-    delete[] hAO;
+    af_free_host(hA);
+    af_free_host(hB);
+    af_free_host(hAC);
+    af_free_host(hAO);
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_VECTOR)
-{
-    using af::array;
-
+TEST(ArrayAssign, CPP_ASSIGN_VECTOR) {
     const int num = 20;
 
-    array a = af::randu(1, num);
-    array b = af::randu(num);
+    array a = randu(1, num);
+    array b = randu(num);
 
     array c, idx;
     sort(c, idx, b);
 
     a(idx) = c;
 
-    ASSERT_EQ(a.dims(0) , (dim_t)1);
-    ASSERT_EQ(a.dims(1) , (dim_t)num);
-    ASSERT_EQ(c.dims(0) , (dim_t)num);
+    ASSERT_EQ(a.dims(0), (dim_t)1);
+    ASSERT_EQ(a.dims(1), (dim_t)num);
+    ASSERT_EQ(c.dims(0), (dim_t)num);
 
     float *h_a = a.host<float>();
     float *h_b = b.host<float>();
 
-    for (int i =0; i < num; i++) {
-        ASSERT_EQ(h_a[i], h_b[i]) << "at " << i;
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(h_a[i], h_b[i]) << "at " << i; }
 
-    delete[] h_a;
-    delete[] h_b;
+    af_free_host(h_a);
+    af_free_host(h_b);
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_VECTOR_SEQ)
-{
-    using af::array;
-
+TEST(ArrayAssign, CPP_ASSIGN_VECTOR_SEQ) {
     const int num = 20;
     const int len = 10;
-    const int st = 3;
-    const int en = st + len - 1;
+    const int st  = 3;
+    const int en  = st + len - 1;
 
-    array a = af::randu(1, 1, num);
+    array a  = randu(1, 1, num);
     array a0 = a;
-    array b = af::randu(len);
+    array b  = randu(len);
 
-    array idx = af::seq(st, en);
+    array idx = seq(st, en);
 
-    a(af::seq(st, en)) = b;
+    a(seq(st, en)) = b;
 
-    ASSERT_EQ(a.dims(0) , (dim_t)1);
-    ASSERT_EQ(a.dims(1) , (dim_t)1);
-    ASSERT_EQ(a.dims(2) , (dim_t)num);
-    ASSERT_EQ(b.dims(0) , (dim_t)len);
+    ASSERT_EQ(a.dims(0), (dim_t)1);
+    ASSERT_EQ(a.dims(1), (dim_t)1);
+    ASSERT_EQ(a.dims(2), (dim_t)num);
+    ASSERT_EQ(b.dims(0), (dim_t)len);
 
     float *h_a0 = a0.host<float>();
-    float *h_a  =  a.host<float>();
-    float *h_b  =  b.host<float>();
+    float *h_a  = a.host<float>();
+    float *h_b  = b.host<float>();
 
     for (int i = 0; i < num; i++) {
         if (i >= st && i <= en) {
@@ -720,69 +722,59 @@ TEST(ArrayAssign, CPP_ASSIGN_VECTOR_SEQ)
         }
     }
 
-    delete[] h_a0;
-    delete[] h_a;
-    delete[] h_b;
+    af_free_host(h_a0);
+    af_free_host(h_a);
+    af_free_host(h_b);
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_VECTOR_2D)
-{
-    using af::array;
-
-    const int nx = 4;
-    const int ny = 5;
+TEST(ArrayAssign, CPP_ASSIGN_VECTOR_2D) {
+    const int nx  = 4;
+    const int ny  = 5;
     const int num = nx * ny;
 
-    array a = af::randu(nx, ny);
-    array b = af::randu(num);
+    array a = randu(nx, ny);
+    array b = randu(num);
 
     array c, idx;
     sort(c, idx, b);
 
     a(idx) = c;
 
-    ASSERT_EQ(a.dims(0) , (dim_t)nx);
-    ASSERT_EQ(a.dims(1) , (dim_t)ny);
-    ASSERT_EQ(c.dims(0) , (dim_t)num);
+    ASSERT_EQ(a.dims(0), (dim_t)nx);
+    ASSERT_EQ(a.dims(1), (dim_t)ny);
+    ASSERT_EQ(c.dims(0), (dim_t)num);
 
     float *h_a = a.host<float>();
     float *h_b = b.host<float>();
 
-    for (int i =0; i < num; i++) {
-        ASSERT_EQ(h_a[i], h_b[i]) << "at " << i;
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(h_a[i], h_b[i]) << "at " << i; }
 
-    delete[] h_a;
-    delete[] h_b;
+    af_free_host(h_a);
+    af_free_host(h_b);
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_VECTOR_SEQ_2D)
-{
-    using af::array;
-
-    const int nx = 4;
-    const int nz = 5;
+TEST(ArrayAssign, CPP_ASSIGN_VECTOR_SEQ_2D) {
+    const int nx  = 4;
+    const int nz  = 5;
     const int num = nx * nz;
     const int len = 10;
-    const int st = 3;
-    const int en = st + len - 1;
+    const int st  = 3;
+    const int en  = st + len - 1;
 
-    array a = af::randu(nx, 1, nz);
+    array a  = randu(nx, 1, nz);
     array a0 = a;
-    array b = af::randu(len);
+    array b  = randu(len);
 
-    array idx = af::seq(st, en);
+    a(seq(st, en)) = b;
 
-    a(af::seq(st, en)) = b;
-
-    ASSERT_EQ(a.dims(0) , (dim_t)nx);
-    ASSERT_EQ(a.dims(1) , (dim_t)1);
-    ASSERT_EQ(a.dims(2) , (dim_t)nz);
-    ASSERT_EQ(b.dims(0) , (dim_t)len);
+    ASSERT_EQ(a.dims(0), (dim_t)nx);
+    ASSERT_EQ(a.dims(1), (dim_t)1);
+    ASSERT_EQ(a.dims(2), (dim_t)nz);
+    ASSERT_EQ(b.dims(0), (dim_t)len);
 
     float *h_a0 = a0.host<float>();
-    float *h_a  =  a.host<float>();
-    float *h_b  =  b.host<float>();
+    float *h_a  = a.host<float>();
+    float *h_b  = b.host<float>();
 
     for (int i = 0; i < num; i++) {
         if (i >= st && i <= en) {
@@ -792,7 +784,256 @@ TEST(ArrayAssign, CPP_ASSIGN_VECTOR_SEQ_2D)
         }
     }
 
-    delete[] h_a0;
-    delete[] h_a;
-    delete[] h_b;
+    af_free_host(h_a0);
+    af_free_host(h_a);
+    af_free_host(h_b);
+}
+
+TEST(Assign, Copy) {
+    const int num = 20;
+    const int len = 10;
+    const int st  = 3;
+    const int en  = st + len - 1;
+
+    array a     = randu(num, 1);
+    float *h_a0 = a.host<float>();
+
+    array b = randu(len);
+
+    float *d_ptr = a.device<float>();
+    copy(a, b, seq(st, en));
+
+    // Ensure that a still has same device pointer
+    ASSERT_EQ(d_ptr, a.device<float>());
+
+    float *h_a = a.host<float>();
+    float *h_b = b.host<float>();
+
+    for (int i = 0; i < num; i++) {
+        if (i >= st && i <= en) {
+            ASSERT_EQ(h_a[i], h_b[i - st]);
+        } else {
+            ASSERT_EQ(h_a[i], h_a0[i]);
+        }
+    }
+
+    af_free_host(h_a0);
+    af_free_host(h_a);
+    af_free_host(h_b);
+}
+
+TEST(Asssign, LinearCPP) {
+    const int nx    = 5;
+    const int ny    = 4;
+    const float val = 3;
+
+    const int st = nx - 2;
+    const int en = nx * (ny - 1);
+
+    array a       = randu(nx, ny);
+    array a_copy  = a;
+    af::index idx = seq(st, en);
+    a(idx)        = 3;
+
+    ASSERT_EQ(a.dims(0), a_copy.dims(0));
+    ASSERT_EQ(a.dims(1), a_copy.dims(1));
+
+    vector<float> ha(nx * ny);
+    vector<float> ha_copy(nx * ny);
+
+    a.host(&ha[0]);
+    a_copy.host(&ha_copy[0]);
+
+    for (int i = 0; i < nx * ny; i++) {
+        if (i < st || i > en)
+            ASSERT_EQ(ha[i], ha_copy[i]) << "at " << i;
+        else
+            ASSERT_EQ(ha[i], val) << "at " << i;
+    }
+}
+
+TEST(Asssign, LinearCPPMaxDim) {
+    const size_t largeDim = 65535 * 32 + 2;
+    const float val       = 3;
+
+    array a       = randu(1, 2 * largeDim);
+    array a_copy  = a.copy();
+    af::index idx = array(seq(10, largeDim + 10));
+    a(span, idx)  = val;
+
+    ASSERT_EQ(a.dims(0), a_copy.dims(0));
+
+    vector<float> ha(2 * largeDim);
+    vector<float> ha_copy(2 * largeDim);
+
+    a.host(&ha[0]);
+    a_copy.host(&ha_copy[0]);
+
+    for (unsigned int i = 0; i < 2 * largeDim; i++) {
+        if (i >= 10 && i <= largeDim + 10) {
+            ASSERT_EQ(ha[i], val) << "at " << i;
+        } else {
+            ASSERT_EQ(ha[i], ha_copy[i]) << "at " << i;
+        }
+    }
+}
+
+TEST(Asssign, LinearAssignSeq) {
+    const int nx    = 5;
+    const int ny    = 4;
+    const float val = 3;
+    const array rhs = constant(val, 1, 1);
+
+    const int st = nx - 2;
+    const int en = nx * (ny - 1);
+
+    array a       = randu(nx, ny);
+    af::index idx = seq(st, en);
+
+    af_array in_arr  = a.get();
+    af_index_t ii    = idx.get();
+    af_array rhs_arr = rhs.get();
+    af_array out_arr;
+
+    ASSERT_SUCCESS(af_assign_seq(&out_arr, in_arr, 1, &ii.idx.seq, rhs_arr));
+
+    array out(out_arr);
+
+    ASSERT_EQ(a.dims(0), out.dims(0));
+    ASSERT_EQ(a.dims(1), out.dims(1));
+
+    vector<float> hout(nx * ny);
+    vector<float> ha(nx * ny);
+
+    a.host(&ha[0]);
+    out.host(&hout[0]);
+
+    for (int i = 0; i < nx * ny; i++) {
+        if (i < st || i > en)
+            ASSERT_EQ(hout[i], ha[i]) << "at " << i;
+        else
+            ASSERT_EQ(hout[i], val) << "at " << i;
+    }
+}
+
+TEST(Asssign, LinearAssignGenSeq) {
+    const int nx    = 5;
+    const int ny    = 4;
+    const float val = 3;
+    const array rhs = constant(val, 1, 1);
+
+    const int st = nx - 2;
+    const int en = nx * (ny - 1);
+
+    array a       = randu(nx, ny);
+    af::index idx = seq(st, en);
+
+    af_array in_arr  = a.get();
+    af_index_t ii    = idx.get();
+    af_array rhs_arr = rhs.get();
+    af_array out_arr;
+
+    ASSERT_SUCCESS(af_assign_gen(&out_arr, in_arr, 1, &ii, rhs_arr));
+
+    array out(out_arr);
+
+    ASSERT_EQ(a.dims(0), out.dims(0));
+    ASSERT_EQ(a.dims(1), out.dims(1));
+
+    vector<float> hout(nx * ny);
+    vector<float> ha(nx * ny);
+
+    a.host(&ha[0]);
+    out.host(&hout[0]);
+
+    for (int i = 0; i < nx * ny; i++) {
+        if (i < st || i > en)
+            ASSERT_EQ(hout[i], ha[i]) << "at " << i;
+        else
+            ASSERT_EQ(hout[i], val) << "at " << i;
+    }
+}
+
+TEST(Asssign, LinearAssignGenArr) {
+    const int nx    = 5;
+    const int ny    = 4;
+    const float val = 3;
+    const array rhs = constant(val, 1, 1);
+
+    const int st = nx - 2;
+    const int en = nx * (ny - 1);
+
+    array a       = randu(nx, ny);
+    af::index idx = array(seq(st, en));
+
+    af_array in_arr  = a.get();
+    af_index_t ii    = idx.get();
+    af_array rhs_arr = rhs.get();
+    af_array out_arr;
+
+    ASSERT_SUCCESS(af_assign_gen(&out_arr, in_arr, 1, &ii, rhs_arr));
+
+    array out(out_arr);
+
+    ASSERT_EQ(a.dims(0), out.dims(0));
+    ASSERT_EQ(a.dims(1), out.dims(1));
+
+    vector<float> hout(nx * ny);
+    vector<float> ha(nx * ny);
+
+    a.host(&ha[0]);
+    out.host(&hout[0]);
+
+    for (int i = 0; i < nx * ny; i++) {
+        if (i < st || i > en)
+            ASSERT_EQ(hout[i], ha[i]) << "at " << i;
+        else
+            ASSERT_EQ(hout[i], val) << "at " << i;
+    }
+}
+
+TEST(Assign, ISSUE_1764) {
+    int x   = 2;
+    int y   = 2;
+    int z   = 2;
+    array a = randu(x, y, z);
+    vector<float> ha0(a.elements());
+    a.host(&ha0[0]);
+    a(0, span, span) = a(1, span, span);
+    vector<float> ha1(a.elements());
+    a.host(&ha1[0]);
+    for (int k = 0; k < z; k++) {
+        for (int j = 0; j < y; j++) {
+            int offset = (j + k * y) * x;
+            ASSERT_EQ(ha0[offset + 1], ha1[offset + 0]);
+            ASSERT_EQ(ha0[offset + 1], ha1[offset + 1]);
+        }
+    }
+}
+
+TEST(Assign, ISSUE_1677) {
+    try {
+        dim_t sz      = 1;
+        array a       = constant(1.0f, 3, sz, f32);
+        array b       = constant(2.0f, 3, sz, f32);
+        array cond    = constant(0, sz, b8);  // all false
+        a(span, cond) = b(span, cond);
+    } catch (exception &ex) {
+        FAIL() << "ArrayFire exception: " << ex.what();
+    } catch (...) { FAIL() << "Unknown exception thrown"; }
+}
+
+TEST(Index, ISSUE_2533) {
+    int elements = 5 * 10;
+    std::vector<float> gold(elements, 0);
+
+    int assigned_elements = 5 * 6;
+    for (int i = 0; i < assigned_elements; i++) { gold[i] = 1; }
+
+    af::array a = constant(0, 5, 10);
+    af::array b = constant(1, 5, 10);
+
+    a(af::span, af::seq(0, 5)) = b(af::span, af::seq(0, 5));
+
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(5, 10), a);
 }
diff --git a/test/backend.cpp b/test/backend.cpp
new file mode 100644
index 0000000000..d6f9529c11
--- /dev/null
+++ b/test/backend.cpp
@@ -0,0 +1,137 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/data.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <atomic>
+#include <string>
+#include <thread>
+#include <vector>
+
+#include <af/device.h>
+
+using af::dtype_traits;
+using af::getAvailableBackends;
+using af::setBackend;
+using std::string;
+using std::vector;
+
+const char* getActiveBackendString(af_backend active) {
+    switch (active) {
+        case AF_BACKEND_CPU: return "AF_BACKEND_CPU";
+        case AF_BACKEND_CUDA: return "AF_BACKEND_CUDA";
+        case AF_BACKEND_OPENCL: return "AF_BACKEND_OPENCL";
+        default: return "AF_BACKEND_DEFAULT";
+    }
+}
+
+void testFunction(af_backend expected) {
+    af_backend activeBackend = (af_backend)0;
+    af_get_active_backend(&activeBackend);
+
+    ASSERT_EQ(expected, activeBackend);
+
+    af_array outArray = 0;
+    dim_t dims[]      = {32, 32};
+    EXPECT_EQ(AF_SUCCESS, af_randu(&outArray, 2, dims, f32));
+
+    // Verify backends returned by array and by function are the same
+    af_backend arrayBackend = (af_backend)0;
+    af_get_backend_id(&arrayBackend, outArray);
+    EXPECT_EQ(arrayBackend, activeBackend);
+
+    // cleanup
+    if (outArray != 0) { ASSERT_SUCCESS(af_release_array(outArray)); }
+}
+
+void backendTest() {
+    int backends = getAvailableBackends();
+
+    ASSERT_NE(backends, 0);
+
+    bool cpu    = backends & AF_BACKEND_CPU;
+    bool cuda   = backends & AF_BACKEND_CUDA;
+    bool opencl = backends & AF_BACKEND_OPENCL;
+
+    if (cpu) {
+        setBackend(AF_BACKEND_CPU);
+        testFunction(AF_BACKEND_CPU);
+    }
+
+    if (cuda) {
+        setBackend(AF_BACKEND_CUDA);
+        testFunction(AF_BACKEND_CUDA);
+    }
+
+    if (opencl) {
+        setBackend(AF_BACKEND_OPENCL);
+        testFunction(AF_BACKEND_OPENCL);
+    }
+}
+
+TEST(BACKEND_TEST, Basic) { backendTest(); }
+
+using af::getActiveBackend;
+
+void test_backend(std::atomic<int>& counter, int ntests,
+                  af::Backend default_backend, af::Backend test_backend) {
+    auto ta_backend = getActiveBackend();
+    ASSERT_EQ(default_backend, ta_backend);
+
+    // Wait until all threads reach this point
+    counter++;
+    while (counter < ntests) {}
+
+    setBackend(test_backend);
+
+    // Wait until all threads reach this point
+    counter++;
+    while (counter < 2 * ntests) {}
+
+    ta_backend = getActiveBackend();
+    ASSERT_EQ(test_backend, ta_backend);
+}
+
+TEST(Backend, Threads) {
+    using std::thread;
+    std::atomic<int> count(0);
+
+    setBackend(AF_BACKEND_DEFAULT);
+    auto default_backend = getActiveBackend();
+
+    int numbk = af::getBackendCount();
+
+    thread a, b, c;
+    if (af::getAvailableBackends() & AF_BACKEND_CPU) {
+        a = thread([&]() {
+            test_backend(count, numbk, default_backend, AF_BACKEND_CPU);
+        });
+    }
+
+    if (af::getAvailableBackends() & AF_BACKEND_OPENCL) {
+        b = thread([&]() {
+            test_backend(count, numbk, default_backend, AF_BACKEND_OPENCL);
+        });
+    }
+
+    if (af::getAvailableBackends() & AF_BACKEND_CUDA) {
+        c = thread([&]() {
+            test_backend(count, numbk, default_backend, AF_BACKEND_CUDA);
+        });
+    }
+
+    if (a.joinable()) a.join();
+    if (b.joinable()) b.join();
+    if (c.joinable()) c.join();
+}
diff --git a/test/basic.cpp b/test/basic.cpp
index d57c184be3..ebb211c7b7 100644
--- a/test/basic.cpp
+++ b/test/basic.cpp
@@ -7,140 +7,126 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/data.h>
-#include <vector>
+#include <gtest/gtest.h>
 #include <testHelpers.hpp>
+#include <af/data.h>
 
-using namespace std;
+#include <vector>
 
-TEST(BasicTests, constant1000x1000)
-{
-    if (noDoubleTests<float>()) return;
+using af::array;
+using af::constant;
+using af::dim4;
+using std::vector;
 
-    static const int ndims = 2;
+TEST(BasicTests, constant1000x1000) {
+    static const int ndims    = 2;
     static const int dim_size = 1000;
-    dim_t d[ndims] = {dim_size, dim_size};
+    dim_t d[ndims]            = {dim_size, dim_size};
 
     double valA = 3.9;
     af_array a;
-    ASSERT_EQ(AF_SUCCESS, af_constant(&a, valA, ndims, d, f32));
+    ASSERT_SUCCESS(af_constant(&a, valA, ndims, d, f32));
 
     vector<float> h_a(dim_size * dim_size, 100);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_a[0], a));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_a[0], a));
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
-        ASSERT_FLOAT_EQ(valA, h_a[i]);
-    }
+    for (size_t i = 0; i < elements; i++) { ASSERT_FLOAT_EQ(valA, h_a[i]); }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(a));
 }
 
-TEST(BasicTests, constant10x10)
-{
-    if (noDoubleTests<float>()) return;
-
-    static const int ndims = 2;
+TEST(BasicTests, constant10x10) {
+    static const int ndims    = 2;
     static const int dim_size = 10;
-    dim_t d[2] = {dim_size, dim_size};
+    dim_t d[2]                = {dim_size, dim_size};
 
     double valA = 3.9;
     af_array a;
-    ASSERT_EQ(AF_SUCCESS, af_constant(&a, valA, ndims, d, f32));
+    ASSERT_SUCCESS(af_constant(&a, valA, ndims, d, f32));
 
     vector<float> h_a(dim_size * dim_size, 0);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_a[0], a));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_a[0], a));
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
-        ASSERT_FLOAT_EQ(valA, h_a[i]);
-    }
+    for (size_t i = 0; i < elements; i++) { ASSERT_FLOAT_EQ(valA, h_a[i]); }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(a));
 }
 
-TEST(BasicTests, constant100x100)
-{
-    if (noDoubleTests<float>()) return;
-
-    static const int ndims = 2;
+TEST(BasicTests, constant100x100) {
+    static const int ndims    = 2;
     static const int dim_size = 100;
-    dim_t d[2] = {dim_size, dim_size};
+    dim_t d[2]                = {dim_size, dim_size};
 
     double valA = 4.9;
     af_array a;
-    ASSERT_EQ(AF_SUCCESS, af_constant(&a, valA, ndims, d, f32));
+    ASSERT_SUCCESS(af_constant(&a, valA, ndims, d, f32));
 
     vector<float> h_a(dim_size * dim_size, 0);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_a[0], a));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_a[0], a));
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
-        ASSERT_FLOAT_EQ(valA, h_a[i]);
-    }
+    for (size_t i = 0; i < elements; i++) { ASSERT_FLOAT_EQ(valA, h_a[i]); }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(a));
 }
 
-//TODO: Test All The Types \o/
-TEST(BasicTests, AdditionSameType)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<double>()) return;
+// TODO: Test All The Types \o/
+TEST(BasicTests, AdditionSameType) {
+    SUPPORTED_TYPE_CHECK(double);
 
-    static const int ndims = 2;
+    static const int ndims    = 2;
     static const int dim_size = 100;
-    dim_t d[ndims] = {dim_size, dim_size};
+    dim_t d[ndims]            = {dim_size, dim_size};
 
-    double valA = 3.9;
-    double valB = 5.7;
-    double  valCf = valA + valB;
+    double valA  = 3.9;
+    double valB  = 5.7;
+    double valCf = valA + valB;
 
     af_array af32, bf32, cf32;
     af_array af64, bf64, cf64;
 
-    ASSERT_EQ(AF_SUCCESS, af_constant(&af32, valA, ndims, d, f32));
-    ASSERT_EQ(AF_SUCCESS, af_constant(&af64, valA, ndims, d, f64));
+    ASSERT_SUCCESS(af_constant(&af32, valA, ndims, d, f32));
+    ASSERT_SUCCESS(af_constant(&af64, valA, ndims, d, f64));
 
-    ASSERT_EQ(AF_SUCCESS, af_constant(&bf32, valB, ndims, d, f32));
-    ASSERT_EQ(AF_SUCCESS, af_constant(&bf64, valB, ndims, d, f64));
+    ASSERT_SUCCESS(af_constant(&bf32, valB, ndims, d, f32));
+    ASSERT_SUCCESS(af_constant(&bf64, valB, ndims, d, f64));
 
-    ASSERT_EQ(AF_SUCCESS, af_add(&cf32, af32, bf32, false));
-    ASSERT_EQ(AF_SUCCESS, af_add(&cf64, af64, bf64, false));
+    ASSERT_SUCCESS(af_add(&cf32, af32, bf32, false));
+    ASSERT_SUCCESS(af_add(&cf64, af64, bf64, false));
 
-    vector<float>  h_cf32 (dim_size * dim_size);
-    vector<double> h_cf64 (dim_size * dim_size);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_cf32[0], cf32));
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_cf64[0], cf64));
+    vector<float> h_cf32(dim_size * dim_size);
+    vector<double> h_cf64(dim_size * dim_size);
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_cf32[0], cf32));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_cf64[0], cf64));
 
     double err = 0;
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
+    for (size_t i = 0; i < elements; i++) {
         float df = h_cf32[i] - (valCf);
-        ASSERT_FLOAT_EQ(valCf,  h_cf32[i]);
-        ASSERT_FLOAT_EQ(valCf,  h_cf64[i]);
+        ASSERT_FLOAT_EQ(valCf, h_cf32[i]);
+        ASSERT_FLOAT_EQ(valCf, h_cf64[i]);
         err = err + df * df;
     }
     ASSERT_NEAR(0.0f, err, 1e-8);
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(af32));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(af64));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(bf32));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(bf64));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(cf32));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(cf64));
+    ASSERT_SUCCESS(af_release_array(af32));
+    ASSERT_SUCCESS(af_release_array(af64));
+    ASSERT_SUCCESS(af_release_array(bf32));
+    ASSERT_SUCCESS(af_release_array(bf64));
+    ASSERT_SUCCESS(af_release_array(cf32));
+    ASSERT_SUCCESS(af_release_array(cf64));
 }
 
-TEST(BasicTests, Additionf64f64)
-{
-    if (noDoubleTests<double>()) return;
+TEST(BasicTests, Additionf64f64) {
+    SUPPORTED_TYPE_CHECK(double);
 
-    static const int ndims = 2;
+    static const int ndims    = 2;
     static const int dim_size = 100;
-    dim_t d[ndims] = {dim_size, dim_size};
+    dim_t d[ndims]            = {dim_size, dim_size};
 
     double valA = 3.9;
     double valB = 5.7;
@@ -148,37 +134,34 @@ TEST(BasicTests, Additionf64f64)
 
     af_array a, b, c;
 
-    ASSERT_EQ(AF_SUCCESS, af_constant(&a, valA, ndims, d, f64));
-    ASSERT_EQ(AF_SUCCESS, af_constant(&b, valB, ndims, d, f64));
-    ASSERT_EQ(AF_SUCCESS, af_add(&c, a, b, false));
+    ASSERT_SUCCESS(af_constant(&a, valA, ndims, d, f64));
+    ASSERT_SUCCESS(af_constant(&b, valB, ndims, d, f64));
+    ASSERT_SUCCESS(af_add(&c, a, b, false));
 
     vector<double> h_c(dim_size * dim_size, 0);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_c[0], c));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_c[0], c));
 
     double err = 0;
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
+    for (size_t i = 0; i < elements; i++) {
         double df = h_c[i] - (valC);
         ASSERT_FLOAT_EQ(valA + valB, h_c[i]);
         err = err + df * df;
     }
     ASSERT_NEAR(0.0f, err, 1e-8);
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(b));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(c));
-
+    ASSERT_SUCCESS(af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(b));
+    ASSERT_SUCCESS(af_release_array(c));
 }
 
-TEST(BasicTests, Additionf32f64)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<double>()) return;
+TEST(BasicTests, Additionf32f64) {
+    SUPPORTED_TYPE_CHECK(double);
 
-    static const int ndims = 2;
+    static const int ndims    = 2;
     static const int dim_size = 100;
-    dim_t d[ndims] = {dim_size, dim_size};
+    dim_t d[ndims]            = {dim_size, dim_size};
 
     double valA = 3.9;
     double valB = 5.7;
@@ -186,135 +169,343 @@ TEST(BasicTests, Additionf32f64)
 
     af_array a, b, c;
 
-    ASSERT_EQ(AF_SUCCESS, af_constant(&a, valA, ndims, d, f32));
-    ASSERT_EQ(AF_SUCCESS, af_constant(&b, valB, ndims, d, f64));
-    ASSERT_EQ(AF_SUCCESS, af_add(&c, a, b, false));
+    ASSERT_SUCCESS(af_constant(&a, valA, ndims, d, f32));
+    ASSERT_SUCCESS(af_constant(&b, valB, ndims, d, f64));
+    ASSERT_SUCCESS(af_add(&c, a, b, false));
 
     vector<double> h_c(dim_size * dim_size);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void **)&h_c[0], c));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&h_c[0], c));
 
     double err = 0;
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
+    for (size_t i = 0; i < elements; i++) {
         double df = h_c[i] - (valC);
         ASSERT_FLOAT_EQ(valA + valB, h_c[i]);
         err = err + df * df;
     }
     ASSERT_NEAR(0.0f, err, 1e-8);
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(b));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(c));
+    ASSERT_SUCCESS(af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(b));
+    ASSERT_SUCCESS(af_release_array(c));
 }
 
-TEST(BasicArrayTests, constant10x10)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(BasicArrayTests, constant10x10) {
     dim_t dim_size = 10;
-    double valA = 3.14;
-    af::array a = af::constant(valA, dim_size, dim_size, f32);
+    double valA    = 3.14;
+    array a        = constant(valA, dim_size, dim_size, f32);
 
     vector<float> h_a(dim_size * dim_size, 0);
     a.host(&h_a.front());
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
-        ASSERT_FLOAT_EQ(valA, h_a[i]);
-    }
+    for (size_t i = 0; i < elements; i++) { ASSERT_FLOAT_EQ(valA, h_a[i]); }
 }
 
-////////////////////////////////////// CPP Tests //////////////////////////////////
+////////////////////////////////////// CPP Tests
+/////////////////////////////////////
 using af::dim4;
 
-TEST(BasicTests, constant100x100_CPP)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(BasicTests, constant100x100_CPP) {
     static const int dim_size = 100;
-    dim_t d[2] = {dim_size, dim_size};
+    dim_t d[2]                = {dim_size, dim_size};
 
     double valA = 4.9;
     dim4 dims(d[0], d[1]);
-    af::array a = constant(valA, dims);
+    array a = constant(valA, dims);
 
     vector<float> h_a(dim_size * dim_size, 0);
-    a.host((void**)&h_a[0]);
+    a.host((void **)&h_a[0]);
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
-        ASSERT_FLOAT_EQ(valA, h_a[i]);
-    }
+    for (size_t i = 0; i < elements; i++) { ASSERT_FLOAT_EQ(valA, h_a[i]); }
 }
 
-//TODO: Test All The Types \o/
-TEST(BasicTests, AdditionSameType_CPP)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<double>()) return;
+// TODO: Test All The Types \o/
+TEST(BasicTests, AdditionSameType_CPP) {
+    SUPPORTED_TYPE_CHECK(double);
 
     static const int dim_size = 100;
-    dim_t d[2] = {dim_size, dim_size};
+    dim_t d[2]                = {dim_size, dim_size};
     dim4 dims(d[0], d[1]);
 
-    double valA = 3.9;
-    double valB = 5.7;
-    double  valCf = valA + valB;
+    double valA  = 3.9;
+    double valB  = 5.7;
+    double valCf = valA + valB;
 
-    af::array a32 = constant(valA, dims, f32);
-    af::array b32 = constant(valB, dims, f32);
-    af::array c32 = a32 + b32;
+    array a32 = constant(valA, dims, f32);
+    array b32 = constant(valB, dims, f32);
+    array c32 = a32 + b32;
 
-    af::array a64 = constant(valA, dims, f64);
-    af::array b64 = constant(valB, dims, f64);
-    af::array c64 = a64 + b64;
+    array a64 = constant(valA, dims, f64);
+    array b64 = constant(valB, dims, f64);
+    array c64 = a64 + b64;
 
-    vector<float>  h_cf32 (dim_size * dim_size);
-    vector<double> h_cf64 (dim_size * dim_size);
+    vector<float> h_cf32(dim_size * dim_size);
+    vector<double> h_cf64(dim_size * dim_size);
 
-    c32.host((void**)&h_cf32[0]);
-    c64.host((void**)&h_cf64[0]);
+    c32.host((void **)&h_cf32[0]);
+    c64.host((void **)&h_cf64[0]);
 
     double err = 0;
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
+    for (size_t i = 0; i < elements; i++) {
         float df = h_cf32[i] - (valCf);
-        ASSERT_FLOAT_EQ(valCf,  h_cf32[i]);
-        ASSERT_FLOAT_EQ(valCf,  h_cf64[i]);
+        ASSERT_FLOAT_EQ(valCf, h_cf32[i]);
+        ASSERT_FLOAT_EQ(valCf, h_cf64[i]);
         err = err + df * df;
     }
     ASSERT_NEAR(0.0f, err, 1e-8);
 }
 
-TEST(BasicTests, Additionf32f64_CPP)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<double>()) return;
+TEST(BasicTests, Additionf32f64_CPP) {
+    SUPPORTED_TYPE_CHECK(double);
 
     static const int dim_size = 100;
-    dim_t d[2] = {dim_size, dim_size};
+    dim_t d[2]                = {dim_size, dim_size};
     dim4 dims(d[0], d[1]);
 
     double valA = 3.9;
     double valB = 5.7;
     double valC = valA + valB;
 
-    af::array a = constant(valA, dims);
-    af::array b = constant(valB, dims, f64);
-    af::array c = a + b;
+    array a = constant(valA, dims);
+    array b = constant(valB, dims, f64);
+    array c = a + b;
 
     vector<double> h_c(dim_size * dim_size);
-    c.host((void**)&h_c[0]);
+    c.host((void **)&h_c[0]);
 
     double err = 0;
 
     size_t elements = dim_size * dim_size;
-    for(size_t i = 0; i < elements; i++) {
+    for (size_t i = 0; i < elements; i++) {
         double df = h_c[i] - (valC);
         ASSERT_FLOAT_EQ(valA + valB, h_c[i]);
         err = err + df * df;
     }
     ASSERT_NEAR(0.0f, err, 1e-8);
 }
+
+TEST(Assert, TestEqualsCpp) {
+    array gold = constant(1, 10, 10);
+    array out  = constant(1, 10, 10);
+
+    // Testing this macro
+    // ASSERT_ARRAYS_EQ(gold, out);
+    ASSERT_TRUE(assertArrayEq("gold", "out", gold, out));
+}
+
+TEST(Assert, TestEqualsC) {
+    af_array gold = 0;
+    af_array out  = 0;
+    dim_t dims[]  = {10, 10, 1, 1};
+    af_constant(&gold, 1.0, 4, dims, f32);
+    af_constant(&out, 1.0, 4, dims, f32);
+
+    // Testing this macro
+    // ASSERT_ARRAYS_EQ(gold, out);
+    ASSERT_TRUE(assertArrayEq("gold", "out", gold, out));
+
+    ASSERT_SUCCESS(af_release_array(out));
+    ASSERT_SUCCESS(af_release_array(gold));
+}
+
+TEST(Assert, TestEqualsDiffTypes) {
+    SUPPORTED_TYPE_CHECK(double);
+    array gold = constant(1, 10, 10, f64);
+    array out  = constant(1, 10, 10);
+
+    // Testing this macro
+    // ASSERT_ARRAYS_EQ(gold, out);
+    ASSERT_FALSE(assertArrayEq("gold", "out", gold, out));
+}
+
+TEST(Assert, TestEqualsDiffSizes) {
+    array gold = constant(1, 10, 9);
+    array out  = constant(1, 10, 10);
+
+    // Testing this macro
+    // ASSERT_ARRAYS_EQ(gold, out);
+    ASSERT_FALSE(assertArrayEq("gold", "out", gold, out));
+}
+
+TEST(Assert, TestEqualsDiffValue) {
+    array gold = constant(1, 3, 3);
+    array out  = gold;
+    out(2, 2)  = 2;
+
+    // Testing this macro
+    // ASSERT_ARRAYS_EQ(gold, out);
+    ASSERT_FALSE(assertArrayEq("gold", "out", gold, out));
+}
+
+TEST(Assert, TestEqualsDiffComplexValue) {
+    array gold = constant(af::cfloat(3.1f, 3.1f), 3, 3, c32);
+    array out  = gold;
+    out(2, 2)  = 2.2;
+
+    // Testing this macro
+    // ASSERT_ARRAYS_EQ(gold, out);
+    ASSERT_FALSE(assertArrayEq("gold", "out", gold, out));
+}
+
+TEST(Assert, TestVectorEquals) {
+    array out = constant(3.1f, 3, 3);
+
+    vector<float> gold(out.elements());
+    dim4 goldDims(3, 3);
+    fill(gold.begin(), gold.end(), 3.1f);
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_EQ(gold, goldDims, out);
+    ASSERT_TRUE(assertArrayEq("gold", "goldDims", "out", gold, goldDims, out));
+}
+
+TEST(Assert, TestVectorDiffVecType) {
+    array out = constant(3.1f, 3, 3);
+
+    vector<int> gold(out.elements());
+    dim4 goldDims(3, 3);
+    fill(gold.begin(), gold.end(), 3.1f);
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_EQ(gold, goldDims, out);
+    ASSERT_FALSE(assertArrayEq("gold", "goldDims", "out", gold, goldDims, out));
+}
+
+TEST(Assert, TestVectorDiffGoldSizeDims) {
+    array out = constant(3.1f, 3, 3);
+
+    vector<float> gold(3 * 3);
+    dim4 goldDims(3, 2);
+    fill(gold.begin(), gold.end(), 3.1f);
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_EQ(gold, goldDims, out);
+    ASSERT_FALSE(assertArrayEq("gold", "goldDims", "out", gold, goldDims, out));
+}
+
+TEST(Assert, TestVectorDiffOutSizeGoldSize) {
+    array out = constant(3.1f, 3, 3);
+
+    vector<float> gold(3 * 2);
+    dim4 goldDims(3, 2);
+    fill(gold.begin(), gold.end(), 3.1f);
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_EQ(gold, goldDims, out);
+    ASSERT_FALSE(assertArrayEq("gold", "goldDims", "out", gold, goldDims, out));
+}
+
+TEST(Assert, TestVectorDiffDim4) {
+    array out = constant(3.1f, 3, 3);
+    vector<float> gold(out.elements());
+    dim4 goldDims(3, 2);
+    fill(gold.begin(), gold.end(), 3.1f);
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_EQ(gold, goldDims, out);
+    ASSERT_FALSE(assertArrayEq("gold", "goldDims", "out", gold, goldDims, out));
+}
+
+TEST(Assert, TestVectorDiffVecSize) {
+    array out = constant(3.1f, 3, 3);
+    vector<float> gold(out.elements() - 1);
+    dim4 goldDims(3, 3);
+    fill(gold.begin(), gold.end(), 3.1f);
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_EQ(gold, goldDims, out);
+    ASSERT_FALSE(assertArrayEq("gold", "goldDims", "out", gold, goldDims, out));
+}
+
+TEST(Assert, TestArraysNearC) {
+    af_array gold = 0;
+    af_array out  = 0;
+    dim_t dims[]  = {10, 10, 1, 1};
+    af_constant(&gold, 2.2345f, 4, dims, f32);
+    af_constant(&out, 2.2346f, 4, dims, f32);
+
+    float maxDiff = 0.001f;
+
+    // Testing this macro
+    // ASSERT_ARRAYS_NEAR(gold, out, maxDiff);
+    ASSERT_TRUE(assertArrayNear("gold", "out", "maxDiff", gold, out, maxDiff));
+
+    ASSERT_SUCCESS(af_release_array(out));
+    ASSERT_SUCCESS(af_release_array(gold));
+}
+
+TEST(Assert, TestVecArrayNearC) {
+    vector<float> gold(3 * 3);
+    fill(gold.begin(), gold.end(), 2.2345f);
+    dim4 goldDims(3, 3);
+
+    af_array out = 0;
+    dim_t dims[] = {3, 3, 1, 1};
+    af_constant(&out, 2.2346f, 4, dims, f32);
+
+    float maxDiff = 0.001f;
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_NEAR(gold, goldDims, out, maxDiff);
+    ASSERT_TRUE(assertArrayNear("gold", "goldDims", "out", "maxDiff", gold,
+                                goldDims, out, maxDiff));
+
+    ASSERT_SUCCESS(af_release_array(out));
+}
+
+TEST(Assert, TestArraysNearWithinThresh) {
+    array gold = constant(2.2345f, 3, 3);
+    array out  = gold;
+    out(2, 2) += 0.0001f;
+    float maxDiff = 0.001f;
+
+    // Testing this macro
+    // ASSERT_ARRAYS_NEAR(gold, out, maxDiff);
+    ASSERT_TRUE(assertArrayNear("gold", "out", "maxDiff", gold, out, maxDiff));
+}
+
+TEST(Assert, TestArraysNearExceedThresh) {
+    array gold = constant(2.2345f, 3, 3);
+    array out  = gold;
+    out(2, 2) += 0.002f;
+    float maxDiff = 0.001f;
+
+    // Testing this macro
+    // ASSERT_ARRAYS_NEAR(gold, out, maxDiff);
+    ASSERT_FALSE(assertArrayNear("gold", "out", "maxDiff", gold, out, maxDiff));
+}
+
+TEST(Assert, TestVecArrayNearWithinThresh) {
+    vector<float> gold(3 * 3);
+    fill(gold.begin(), gold.end(), 2.2345f);
+    dim4 goldDims(3, 3);
+
+    array out = constant(2.2345f, goldDims);
+    out(2, 2) += 0.0001f;
+    float maxDiff = 0.001f;
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_NEAR(gold, goldDims, out, maxDiff);
+    ASSERT_TRUE(assertArrayNear("gold", "goldDims", "out", "maxAbsDiff", gold,
+                                goldDims, out, maxDiff));
+}
+
+TEST(Assert, TestVecArrayNearExceedThresh) {
+    vector<float> gold(3 * 3);
+    fill(gold.begin(), gold.end(), 2.2345f);
+    dim4 goldDims(3, 3);
+
+    array out = constant(2.2345f, goldDims);
+    out(2, 2) += 0.002f;
+    float maxDiff = 0.001f;
+
+    // Testing this macro
+    // ASSERT_VEC_ARRAY_NEAR(gold, goldDims, out, maxDiff);
+    ASSERT_FALSE(assertArrayNear("gold", "goldDims", "out", "maxAbsDiff", gold,
+                                 goldDims, out, maxDiff));
+}
diff --git a/test/basic_c.c b/test/basic_c.c
new file mode 100644
index 0000000000..b6f3f39f13
--- /dev/null
+++ b/test/basic_c.c
@@ -0,0 +1,18 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+
+int main() {
+    af_array out = 0;
+    dim_t s[]    = {10, 10, 1, 1};
+    af_err e     = af_randu(&out, 4, s, f32);
+    if (out != 0) af_release_array(out);
+    return (AF_SUCCESS != e);
+}
diff --git a/test/bilateral.cpp b/test/bilateral.cpp
index c80d376b52..12b27fc33f 100644
--- a/test/bilateral.cpp
+++ b/test/bilateral.cpp
@@ -7,195 +7,188 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+#include <cmath>
 #include <string>
 #include <vector>
-#include <cmath>
-#include <testHelpers.hpp>
 
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
 using std::string;
 using std::vector;
-using af::dim4;
 
 template<typename T, bool isColor>
-void bilateralTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void bilateralTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<dim4>       inDims;
-    vector<string>    inFiles;
+    vector<dim4> inDims;
+    vector<string> inFiles;
     vector<dim_t> outSizes;
-    vector<string>   outFiles;
+    vector<string> outFiles;
 
     readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array inArray   = 0;
+        af_array outArray  = 0;
+        af_array goldArray = 0;
+        dim_t nElems       = 0;
 
-        af_array inArray  = 0;
-        af_array outArray = 0;
-        af_array goldArray= 0;
-        dim_t nElems   = 0;
+        inFiles[testId].insert(0, string(TEST_DIR "/bilateral/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/bilateral/"));
 
-        inFiles[testId].insert(0,string(TEST_DIR"/bilateral/"));
-        outFiles[testId].insert(0,string(TEST_DIR"/bilateral/"));
+        ASSERT_SUCCESS(
+            af_load_image(&inArray, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(
+            af_load_image(&goldArray, outFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
 
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&inArray, inFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&goldArray, outFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems, goldArray));
+        ASSERT_SUCCESS(
+            af_bilateral(&outArray, inArray, 2.25f, 25.56f, isColor));
 
-        ASSERT_EQ(AF_SUCCESS, af_bilateral(&outArray, inArray, 2.25f, 25.56f, isColor));
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.02f);
 
-        T * outData = new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
-
-        T * goldData= new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)goldData, goldArray));
-
-        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData, outData, 0.02f));
-
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(goldArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
     }
 }
 
-TEST(BilateralOnImage, Grayscale)
-{
-    bilateralTest<float, false>(string(TEST_DIR"/bilateral/gray.test"));
+TEST(BilateralOnImage, Grayscale) {
+    bilateralTest<float, false>(string(TEST_DIR "/bilateral/gray.test"));
 }
 
-TEST(BilateralOnImage, Color)
-{
-    bilateralTest<float, true>(string(TEST_DIR"/bilateral/color.test"));
+TEST(BilateralOnImage, Color) {
+    bilateralTest<float, true>(string(TEST_DIR "/bilateral/color.test"));
 }
 
-
 template<typename T>
-class BilateralOnData : public ::testing::Test
-{
-};
+class BilateralOnData : public ::testing::Test {};
 
-typedef ::testing::Types<float, double, int, uint, char, uchar> DataTestTypes;
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort>
+    DataTestTypes;
 
 // register the type list
-TYPED_TEST_CASE(BilateralOnData, DataTestTypes);
+TYPED_TEST_SUITE(BilateralOnData, DataTestTypes);
 
 template<typename inType>
-void bilateralDataTest(string pTestFile)
-{
-    if (noDoubleTests<inType>()) return;
+void bilateralDataTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(inType);
 
-    typedef typename cond_type<is_same_type<inType, double>::value, double, float>::type outType;
+    typedef typename cond_type<is_same_type<inType, double>::value, double,
+                               float>::type outType;
 
-    vector<af::dim4>        numDims;
-    vector<vector<inType> >       in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
 
     readTests<inType, outType, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims      = numDims[0];
-    af_array outArray  = 0;
-    af_array inArray   = 0;
-    outType *outData;
+    dim4 dims         = numDims[0];
+    af_array outArray = 0;
+    af_array inArray  = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<inType>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_bilateral(&outArray, inArray, 2.25f, 25.56f, false));
+    ASSERT_SUCCESS(af_bilateral(&outArray, inArray, 2.25f, 25.56f, false));
 
-    outData = new outType[dims.elements()];
+    vector<outType> outData(dims.elements());
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData.data(), outArray));
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<outType> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
-        ASSERT_EQ(true, compareArraysRMSD(nElems, &currGoldBar.front(), outData, 0.02f));
+        size_t nElems               = currGoldBar.size();
+        ASSERT_EQ(true, compareArraysRMSD(nElems, &currGoldBar.front(),
+                                          outData.data(), 0.02f));
     }
 
     // cleanup
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(BilateralOnData, Rectangle)
-{
-    bilateralDataTest<TypeParam>(string(TEST_DIR"/bilateral/rectangle.test"));
+TYPED_TEST(BilateralOnData, Rectangle) {
+    bilateralDataTest<TypeParam>(string(TEST_DIR "/bilateral/rectangle.test"));
 }
 
-TYPED_TEST(BilateralOnData, Rectangle_Batch)
-{
-    bilateralDataTest<TypeParam>(string(TEST_DIR"/bilateral/rectangle_batch.test"));
+TYPED_TEST(BilateralOnData, Rectangle_Batch) {
+    bilateralDataTest<TypeParam>(
+        string(TEST_DIR "/bilateral/rectangle_batch.test"));
 }
 
-TYPED_TEST(BilateralOnData, InvalidArgs)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(BilateralOnData, InvalidArgs) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    vector<TypeParam>   in(100,1);
+    vector<TypeParam> in(100, 1);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
     // check for color image bilateral
-    af::dim4 dims = af::dim4(100,1,1,1);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<TypeParam>::af_type));
-    ASSERT_EQ(AF_ERR_SIZE, af_bilateral(&outArray, inArray, 0.12f, 0.34f, true));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    dim4 dims = dim4(100, 1, 1, 1);
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<TypeParam>::af_type));
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_bilateral(&outArray, inArray, 0.12f, 0.34f, true));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
 // C++ unit tests
-TEST(Bilateral, CPP)
-{
-    if (noDoubleTests<float>()) return;
 
-    using af::array;
+using af::array;
+using af::bilateral;
 
-    vector<af::dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
+TEST(Bilateral, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    readTests<float, float, float>(string(TEST_DIR"/bilateral/rectangle.test"), numDims, in, tests);
+    readTests<float, float, float>(string(TEST_DIR "/bilateral/rectangle.test"),
+                                   numDims, in, tests);
 
-    af::dim4 dims      = numDims[0];
+    dim4 dims = numDims[0];
 
     array a(dims, &(in[0].front()));
-    array b = af::bilateral(a, 2.25f, 25.56f, false);
+    array b = bilateral(a, 2.25f, 25.56f, false);
 
-    float *outData = new float[dims.elements()];
-    b.host(outData);
+    vector<float> outData(dims.elements());
+    b.host(outData.data());
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<float> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
-        ASSERT_EQ(true, compareArraysRMSD(nElems, &currGoldBar.front(), outData, 0.02f));
+        size_t nElems             = currGoldBar.size();
+        ASSERT_EQ(true, compareArraysRMSD(nElems, currGoldBar.data(),
+                                          outData.data(), 0.02f));
     }
-
-    // cleanup
-    delete[] outData;
 }
 
+using af::constant;
+using af::iota;
+using af::max;
+using af::seq;
+using af::span;
 
-TEST(bilateral, GFOR)
-{
-    using namespace af;
-
+TEST(bilateral, GFOR) {
     dim4 dims = dim4(10, 10, 3);
-    array A = iota(dims);
-    array B = constant(0, dims);
+    array A   = iota(dims);
+    array B   = constant(0, dims);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = bilateral(A(span, span, ii), 3, 5);
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = bilateral(A(span, span, ii), 3, 5); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = bilateral(A(span, span, ii), 3, 5);
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
diff --git a/test/binary.cpp b/test/binary.cpp
index f4e432f288..7fd47bcfbd 100644
--- a/test/binary.cpp
+++ b/test/binary.cpp
@@ -1,5 +1,5 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2025, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
@@ -8,14 +8,23 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/device.h>
+#include <af/random.h>
+#include <af/half.h>
+#include "half.hpp"  //note: NOT common. From extern/half/include/half.hpp
+
+#include <cfenv>
+#include <cmath>
 
 using namespace std;
 using namespace af;
 
+using half_float_half = half_float::half;
+
 const int num = 10000;
 
 #define add(left, right) (left) + (right)
@@ -23,137 +32,136 @@ const int num = 10000;
 #define mul(left, right) (left) * (right)
 #define div(left, right) (left) / (right)
 
-template<typename T> T mod(T a, T b)
-{
+typedef std::complex<float> complex_float;
+typedef std::complex<double> complex_double;
+
+template<typename T>
+T mod(T a, T b) {
     return std::fmod(a, b);
 }
 
-af::array randgen(const int num, af::dtype ty)
-{
-    af::array tmp = af::round(1 + 2 * af::randu(num, f32)).as(ty);
+template<typename T>
+T rem(T x, T y) {
+    return remainder(x, y);
+}
+
+af::array randgen(const int num, dtype ty) {
+    af::array tmp = round(1 + 2 * af::randu(num, f32)).as(ty);
     tmp.eval();
     return tmp;
 }
 
-#define BINARY_TESTS(Ta, Tb, Tc, func)                                  \
-    TEST(BinaryTests, Test_##func##_##Ta##_##Tb)                        \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        af::array a = randgen(num, ta);                                 \
-        af::array b = randgen(num, tb);                                 \
-        af::array c = func(a, b);                                       \
-        Ta *h_a = a.host<Ta>();                                         \
-        Tb *h_b = b.host<Tb>();                                         \
-        Tc *h_c = c.host<Tc>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_c[i], func(h_a[i], h_b[i])) <<                  \
-                "for values: " << h_a[i]  << "," << h_b[i] << std::endl; \
-        delete[] h_a;                                                   \
-        delete[] h_b;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-                                                                        \
-    TEST(BinaryTests, Test_##func##_##Ta##_##Tb##_left)                 \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af::array a = randgen(num, ta);                                 \
-        Tb h_b = 3.0;                                                   \
-        af::array c = func(a, h_b);                                     \
-        Ta *h_a = a.host<Ta>();                                         \
-        Tc *h_c = c.host<Tc>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_c[i], func(h_a[i], h_b)) <<                     \
-                "for values: " << h_a[i]  << "," << h_b << std::endl;   \
-        delete[] h_a;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-                                                                        \
-    TEST(BinaryTests, Test_##func##_##Ta##_##Tb##_right)                \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-                                                                        \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        Ta h_a = 5.0;                                                   \
-        af::array b = randgen(num, tb);                                 \
-        af::array c = func(h_a, b);                                     \
-        Tb *h_b = b.host<Tb>();                                         \
-        Tb *h_c = c.host<Tb>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_c[i], func(h_a, h_b[i])) <<                     \
-                "for values: " << h_a  << "," << h_b[i] << std::endl;   \
-        delete[] h_b;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-
-
-#define BINARY_TESTS_NEAR(Ta, Tb, Tc, func, err)                        \
-    TEST(BinaryTests, Test_##func##_##Ta##_##Tb)                        \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        af::array a = randgen(num, ta);                                 \
-        af::array b = randgen(num, tb);                                 \
-        af::array c = func(a, b);                                       \
-        Ta *h_a = a.host<Ta>();                                         \
-        Tb *h_b = b.host<Tb>();                                         \
-        Tc *h_c = c.host<Tc>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_NEAR(h_c[i], func(h_a[i], h_b[i]), err) <<           \
-                "for values: " << h_a[i]  << "," << h_b[i] << std::endl; \
-        delete[] h_a;                                                   \
-        delete[] h_b;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-                                                                        \
-    TEST(BinaryTests, Test_##func##_##Ta##_##Tb##_left)                 \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af::array a = randgen(num, ta);                                 \
-        Tb h_b = 0.3;                                                   \
-        af::array c = func(a, h_b);                                     \
-        Ta *h_a = a.host<Ta>();                                         \
-        Ta *h_c = c.host<Ta>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_NEAR(h_c[i], func(h_a[i], h_b), err) <<              \
-                "for values: " << h_a[i]  << "," << h_b << std::endl;   \
-        delete[] h_a;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-                                                                        \
-    TEST(BinaryTests, Test_##func##_##Ta##_##Tb##_right)                \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        Ta h_a = 0.3;                                                   \
-        af::array b = randgen(num, tb);                                 \
-        af::array c = func(h_a, b);                                     \
-        Tb *h_b = b.host<Tb>();                                         \
-        Tb *h_c = c.host<Tb>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_NEAR(h_c[i], func(h_a, h_b[i]), err) <<              \
-                "for values: " << h_a  << "," << h_b[i] << std::endl;   \
-        delete[] h_b;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
+#define MY_ASSERT_NEAR(aa, bb, cc) ASSERT_NEAR(abs(aa), abs(bb), (cc))
+
+#define BINARY_TESTS(Ta, Tb, Tc, func)                                    \
+    TEST(BinaryTests, Test_##func##_##Ta##_##Tb) {                        \
+        SUPPORTED_TYPE_CHECK(Ta);                                         \
+        SUPPORTED_TYPE_CHECK(Tb);                                         \
+        SUPPORTED_TYPE_CHECK(Tc);                                         \
+                                                                          \
+        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;                \
+        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;                \
+        af::array a = randgen(num, ta);                                   \
+        af::array b = randgen(num, tb);                                   \
+        af::array c = func(a, b);                                         \
+        Ta *h_a     = a.host<Ta>();                                       \
+        Tb *h_b     = b.host<Tb>();                                       \
+        vector<Tc> gold(num);                                             \
+        for (int i = 0; i < num; i++) { gold[i] = func(h_a[i], h_b[i]); } \
+        ASSERT_VEC_ARRAY_EQ(gold, dim4(num), c);                          \
+        af_free_host(h_a);                                                \
+        af_free_host(h_b);                                                \
+    }                                                                     \
+                                                                          \
+    TEST(BinaryTests, Test_##func##_##Ta##_##Tb##_left) {                 \
+        SUPPORTED_TYPE_CHECK(Ta);                                         \
+        SUPPORTED_TYPE_CHECK(Tb);                                         \
+                                                                          \
+        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;                \
+        af::array a = randgen(num, ta);                                   \
+        Tb h_b      = 3.0;                                                \
+        af::array c = func(a, h_b);                                       \
+        Ta *h_a     = a.host<Ta>();                                       \
+        vector<Tc> gold(num);                                             \
+        for (int i = 0; i < num; i++) { gold[i] = func(h_a[i], h_b); }    \
+        ASSERT_VEC_ARRAY_EQ(gold, dim4(num), c);                          \
+        af_free_host(h_a);                                                \
+    }                                                                     \
+                                                                          \
+    TEST(BinaryTests, Test_##func##_##Ta##_##Tb##_right) {                \
+        SUPPORTED_TYPE_CHECK(Ta);                                         \
+        SUPPORTED_TYPE_CHECK(Tb);                                         \
+                                                                          \
+        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;                \
+        Ta h_a      = 5.0;                                                \
+        af::array b = randgen(num, tb);                                   \
+        af::array c = func(h_a, b);                                       \
+        Tb *h_b     = b.host<Tb>();                                       \
+        vector<Tc> gold(num);                                             \
+        for (int i = 0; i < num; i++) { gold[i] = func(h_a, h_b[i]); }    \
+        ASSERT_VEC_ARRAY_EQ(gold, dim4(num), c);                          \
+        af_free_host(h_b);                                                \
+    }
+
+#define BINARY_TESTS_NEAR_GENERAL(Ta, Tb, Tc, Td, Te, func, err)      \
+    TEST(BinaryTestsFloating, Test_##func##_##Ta##_##Tb) {            \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+        SUPPORTED_TYPE_CHECK(Tc);                                     \
+                                                                      \
+        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;            \
+        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;            \
+        af::array a = randgen(num, ta);                               \
+        af::array b = randgen(num, tb);                               \
+        af::array c = func(a, b);                                     \
+        Ta *h_a     = a.host<Ta>();                                   \
+        Tb *h_b     = b.host<Tb>();                                   \
+        Tc *h_c     = c.host<Tc>();                                   \
+        for (int i = 0; i < num; i++)                                 \
+            MY_ASSERT_NEAR(h_c[i], func(h_a[i], h_b[i]), (err))       \
+                << "for values: " << h_a[i] << "," << h_b[i] << endl; \
+        af_free_host(h_a);                                            \
+        af_free_host(h_b);                                            \
+        af_free_host(h_c);                                            \
+    }                                                                 \
+                                                                      \
+    TEST(BinaryTestsFloating, Test_##func##_##Ta##_##Tb##_left) {     \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+                                                                      \
+        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;            \
+        af::array a = randgen(num, ta);                               \
+        Tb h_b      = (Tb)0.3;                                            \
+        af::array c = func(a, h_b);                                   \
+        Ta *h_a     = a.host<Ta>();                                   \
+        Td *h_d     = c.host<Td>();                                   \
+        for (int i = 0; i < num; i++)                                 \
+            MY_ASSERT_NEAR(h_d[i], func(h_a[i], h_b), err)            \
+                << "for values: " << h_a[i] << "," << h_b << endl;    \
+        af_free_host(h_a);                                            \
+        af_free_host(h_d);                                            \
+    }                                                                 \
+                                                                      \
+    TEST(BinaryTestsFloating, Test_##func##_##Ta##_##Tb##_right) {    \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+        SUPPORTED_TYPE_CHECK(Tc);                                     \
+                                                                      \
+        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;            \
+        Ta h_a      = (Ta)0.3;                                            \
+        af::array b = randgen(num, tb);                               \
+        af::array c = func(h_a, b);                                   \
+        Tb *h_b     = b.host<Tb>();                                   \
+        Te *h_e     = c.host<Te>();                                   \
+        for (int i = 0; i < num; i++)                                 \
+            MY_ASSERT_NEAR(h_e[i], func(h_a, h_b[i]), err)            \
+                << "for values: " << h_a << "," << h_b[i] << endl;    \
+        af_free_host(h_b);                                            \
+        af_free_host(h_e);                                            \
+    }
+
+#define BINARY_TESTS_NEAR(Ta, Tb, Tc, func, err) \
+    BINARY_TESTS_NEAR_GENERAL(Ta, Tb, Tc, Ta, Tc, func, err)
 
 #define BINARY_TESTS_FLOAT(func) BINARY_TESTS(float, float, float, func)
 #define BINARY_TESTS_DOUBLE(func) BINARY_TESTS(double, double, double, func)
@@ -164,16 +172,21 @@ af::array randgen(const int num, af::dtype ty)
 #define BINARY_TESTS_UINT(func) BINARY_TESTS(uint, uint, uint, func)
 #define BINARY_TESTS_INTL(func) BINARY_TESTS(intl, intl, intl, func)
 #define BINARY_TESTS_UINTL(func) BINARY_TESTS(uintl, uintl, uintl, func)
-#define BINARY_TESTS_NEAR_FLOAT(func) BINARY_TESTS_NEAR(float, float, float, func, 1e-5)
-#define BINARY_TESTS_NEAR_DOUBLE(func) BINARY_TESTS_NEAR(double, double, double, func, 1e-10)
+#define BINARY_TESTS_NEAR_HALF(func) \
+    BINARY_TESTS_NEAR(half_float_half, half_float_half, half_float_half, func, 1e-3)
+#define BINARY_TESTS_NEAR_FLOAT(func) \
+    BINARY_TESTS_NEAR(float, float, float, func, 1e-5)
+#define BINARY_TESTS_NEAR_DOUBLE(func) \
+    BINARY_TESTS_NEAR(double, double, double, func, 1e-10)
 
 BINARY_TESTS_FLOAT(add)
 BINARY_TESTS_FLOAT(sub)
 BINARY_TESTS_FLOAT(mul)
-BINARY_TESTS_NEAR(float, float, float, div, 1e-3) // FIXME
+BINARY_TESTS_NEAR(float, float, float, div, 1e-3)  // FIXME
 BINARY_TESTS_FLOAT(min)
 BINARY_TESTS_FLOAT(max)
-BINARY_TESTS_NEAR(float, float, float, mod, 1e-5) // FIXME
+BINARY_TESTS_NEAR(float, float, float, mod, 1e-5)  // FIXME
+BINARY_TESTS_FLOAT(rem)
 
 BINARY_TESTS_DOUBLE(add)
 BINARY_TESTS_DOUBLE(sub)
@@ -182,11 +195,16 @@ BINARY_TESTS_DOUBLE(div)
 BINARY_TESTS_DOUBLE(min)
 BINARY_TESTS_DOUBLE(max)
 BINARY_TESTS_DOUBLE(mod)
+BINARY_TESTS_DOUBLE(rem)
 
 BINARY_TESTS_NEAR_FLOAT(atan2)
 BINARY_TESTS_NEAR_FLOAT(pow)
 BINARY_TESTS_NEAR_FLOAT(hypot)
 
+BINARY_TESTS_NEAR_HALF(atan2)
+BINARY_TESTS_NEAR_HALF(pow)
+BINARY_TESTS_NEAR_HALF(hypot)
+
 BINARY_TESTS_NEAR_DOUBLE(atan2)
 BINARY_TESTS_NEAR_DOUBLE(pow)
 BINARY_TESTS_NEAR_DOUBLE(hypot)
@@ -194,18 +212,26 @@ BINARY_TESTS_NEAR_DOUBLE(hypot)
 BINARY_TESTS_INT(add)
 BINARY_TESTS_INT(sub)
 BINARY_TESTS_INT(mul)
+BINARY_TESTS_INT(div)
+BINARY_TESTS_INT(pow)
 
 BINARY_TESTS_UINT(add)
 BINARY_TESTS_UINT(sub)
 BINARY_TESTS_UINT(mul)
+BINARY_TESTS_UINT(div)
+BINARY_TESTS_UINT(pow)
 
 BINARY_TESTS_INTL(add)
 BINARY_TESTS_INTL(sub)
 BINARY_TESTS_INTL(mul)
+BINARY_TESTS_INTL(div)
+BINARY_TESTS_INTL(pow)
 
 BINARY_TESTS_UINTL(add)
 BINARY_TESTS_UINTL(sub)
 BINARY_TESTS_UINTL(mul)
+BINARY_TESTS_UINTL(div)
+BINARY_TESTS_UINTL(pow)
 
 BINARY_TESTS_CFLOAT(add)
 BINARY_TESTS_CFLOAT(sub)
@@ -219,28 +245,46 @@ BINARY_TESTS_NEAR(float, double, double, sub, 1e-5)
 BINARY_TESTS_NEAR(float, double, double, mul, 1e-5)
 BINARY_TESTS_NEAR(float, double, double, div, 1e-5)
 
-#define BITOP(func, T, op)                                  \
-    TEST(BinaryTests, Test_##func##_##T)                    \
-    {                                                       \
-        af_dtype ty = (af_dtype)dtype_traits<T>::af_type;   \
-        const T vala = 4095;                                \
-        const T valb = 3;                                   \
-        const T valc = vala op valb;                        \
-        const int num = 10;                                 \
-        af::array a = af::constant(vala, num, ty);          \
-        af::array b = af::constant(valb, num, ty);          \
-        af::array c = a op b;                               \
-        T *h_a = a.host<T>();                               \
-        T *h_b = b.host<T>();                               \
-        T *h_c = c.host<T>();                               \
-        for (int i = 0; i < num; i++)                       \
-            ASSERT_EQ(h_c[i], valc) <<                      \
-                "for values: " << h_a[i]  <<                \
-                "," << h_b[i] << std::endl;                 \
-        delete[] h_a;                                       \
-        delete[] h_b;                                       \
-        delete[] h_c;                                       \
-    }                                                       \
+BINARY_TESTS_NEAR(cfloat, cdouble, cdouble, add, 1e-5)
+BINARY_TESTS_NEAR(cfloat, cdouble, cdouble, sub, 1e-5)
+BINARY_TESTS_NEAR(cfloat, cdouble, cdouble, mul, 1e-5)
+BINARY_TESTS_NEAR(cfloat, cdouble, cdouble, div, 1e-5)
+
+BINARY_TESTS_NEAR_GENERAL(float, cfloat, cfloat, cfloat, cfloat, add, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(float, cfloat, cfloat, cfloat, cfloat, sub, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(float, cfloat, cfloat, cfloat, cfloat, mul, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(float, cfloat, cfloat, cfloat, cfloat, div, 1e-5)
+
+BINARY_TESTS_NEAR_GENERAL(double, cfloat, cdouble, cdouble, cfloat, add, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(double, cfloat, cdouble, cdouble, cfloat, sub, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(double, cfloat, cdouble, cdouble, cfloat, mul, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(double, cfloat, cdouble, cdouble, cfloat, div, 1e-5)
+
+BINARY_TESTS_NEAR_GENERAL(cfloat, double, cdouble, cfloat, cdouble, add, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(cfloat, double, cdouble, cfloat, cdouble, sub, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(cfloat, double, cdouble, cfloat, cdouble, mul, 1e-5)
+BINARY_TESTS_NEAR_GENERAL(cfloat, double, cdouble, cfloat, cdouble, div, 1e-5)
+
+#define BITOP(func, T, op)                                            \
+    TEST(BinaryTests, Test_##func##_##T) {                            \
+        af_dtype ty   = (af_dtype)dtype_traits<T>::af_type;           \
+        const T vala  = 4095;                                         \
+        const T valb  = 3;                                            \
+        const T valc  = vala op valb;                                 \
+        const int num = 10;                                           \
+        af::array a   = af::constant(vala, num, ty);                  \
+        af::array b   = af::constant(valb, num, ty);                  \
+        af::array c   = a op b;                                       \
+        T *h_a        = a.host<T>();                                  \
+        T *h_b        = b.host<T>();                                  \
+        T *h_c        = c.host<T>();                                  \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(h_c[i], valc)                                   \
+                << "for values: " << h_a[i] << "," << h_b[i] << endl; \
+        af_free_host(h_a);                                            \
+        af_free_host(h_b);                                            \
+        af_free_host(h_c);                                            \
+    }
 
 BITOP(bitor, int, |)
 BITOP(bitand, int, &)
@@ -263,3 +307,637 @@ BITOP(bitand, uintl, &)
 BITOP(bitxor, uintl, ^)
 BITOP(bitshiftl, uintl, <<)
 BITOP(bitshiftr, uintl, >>)
+
+#define UBITOP(func, T)                                     \
+    TEST(BinaryTests, Test_##func##_##T) {                  \
+        af_dtype ty   = (af_dtype)dtype_traits<T>::af_type; \
+        const T vala  = 127u;                               \
+        const T valc  = ~vala;                              \
+        const int num = 10;                                 \
+        af::array a   = af::constant(vala, num, ty);        \
+        af::array b   = af::constant(valc, num, ty);        \
+        af::array c   = ~a;                                 \
+        ASSERT_ARRAYS_EQ(c, b);                             \
+    }
+
+UBITOP(bitnot, int)
+UBITOP(bitnot, uint)
+UBITOP(bitnot, intl)
+UBITOP(bitnot, uintl)
+UBITOP(bitnot, schar)
+UBITOP(bitnot, uchar)
+UBITOP(bitnot, short)
+UBITOP(bitnot, ushort)
+
+TEST(BinaryTests, Test_pow_cfloat_float) {
+    af::array a        = randgen(num, c32);
+    af::array b        = randgen(num, f32);
+    af::array c        = af::pow(a, b);
+    complex_float *h_a = (complex_float *)a.host<cfloat>();
+    float *h_b         = b.host<float>();
+    complex_float *h_c = (complex_float *)c.host<cfloat>();
+    for (int i = 0; i < num; i++) {
+        complex_float res = std::pow(h_a[i], h_b[i]);
+        ASSERT_NEAR(real(h_c[i]), real(res), 1E-5)
+            << "for real values of: " << h_a[i] << "," << h_b[i] << endl;
+        ASSERT_NEAR(imag(h_c[i]), imag(res), 1E-5)
+            << "for imag values of: " << h_a[i] << "," << h_b[i] << endl;
+    }
+    af_free_host(h_a);
+    af_free_host(h_b);
+    af_free_host(h_c);
+}
+
+TEST(BinaryTests, Test_pow_cdouble_cdouble) {
+    SUPPORTED_TYPE_CHECK(cdouble);
+    af::array a         = randgen(num, c64);
+    af::array b         = randgen(num, c64);
+    af::array c         = af::pow(a, b);
+    complex_double *h_a = (complex_double *)a.host<cdouble>();
+    complex_double *h_b = (complex_double *)b.host<cdouble>();
+    complex_double *h_c = (complex_double *)c.host<cdouble>();
+    for (int i = 0; i < num; i++) {
+        complex_double res = std::pow(h_a[i], h_b[i]);
+        ASSERT_NEAR(real(h_c[i]), real(res), 1E-10)
+            << "for real values of: " << h_a[i] << "," << h_b[i] << endl;
+        ASSERT_NEAR(imag(h_c[i]), imag(res), 1E-10)
+            << "for imag values of: " << h_a[i] << "," << h_b[i] << endl;
+    }
+    af_free_host(h_a);
+    af_free_host(h_b);
+    af_free_host(h_c);
+}
+
+TEST(BinaryTests, ISSUE_1762) {
+    af::array zero   = af::constant(0, 5, f32);
+    af::array result = af::pow(zero, 2);
+    vector<complex_float> hres(result.elements());
+    result.host(&hres[0]);
+    for (int i = 0; i < 5; i++) {
+        ASSERT_EQ(real(hres[i]), 0);
+        ASSERT_EQ(imag(hres[i]), 0);
+    }
+}
+
+template<typename T>
+class PowPrecisionTest : public ::testing::TestWithParam<T> {
+    void SetUp() { SUPPORTED_TYPE_CHECK(T); }
+};
+
+#define DEF_TEST(Sx, T)                                                    \
+    using PowPrecisionTest##Sx = PowPrecisionTest<T>;                      \
+    TEST_P(PowPrecisionTest##Sx, Issue2304) {                              \
+        T param    = GetParam();                                           \
+        auto dtype = (af_dtype)dtype_traits<T>::af_type;                   \
+        if (noDoubleTests(dtype)) {                                        \
+            if (std::abs((double)param) > 10000)                           \
+                GTEST_SKIP()                                               \
+                    << "Skip larger values because double not supported."; \
+        }                                                                  \
+        af::array A = af::constant(param, 1, dtype);                       \
+        af::array B = af::pow(A, 2);                                       \
+        vector<T> hres(1, 0);                                              \
+        B.host(&hres[0]);                                                  \
+        std::fesetround(FE_TONEAREST);                                     \
+        T gold;                                                            \
+        if (!af::isDoubleAvailable(af::getDevice())) {                     \
+            gold = (T)std::rint(std::pow((float)param, 2.0f));             \
+        } else {                                                           \
+            gold = (T)std::rint(std::pow((double)param, 2.0));             \
+        }                                                                  \
+        ASSERT_EQ(hres[0], gold);                                          \
+    }
+
+DEF_TEST(ULong, unsigned long long)
+DEF_TEST(Long, long long)
+DEF_TEST(UInt, unsigned int)
+DEF_TEST(Int, int)
+DEF_TEST(UShort, unsigned short)
+DEF_TEST(Short, short)
+DEF_TEST(UChar, unsigned char)
+DEF_TEST(SChar, signed char)
+
+#undef DEF_TEST
+
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestULong,
+                         testing::Range<unsigned long long>(1, 1e7, 1e6));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestLong,
+                         testing::Range<long long>(1, 1e7, 1e6));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestUInt,
+                         testing::Range<unsigned int>(1, 65000, 15e3));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestInt,
+                         testing::Range<int>(1, 46340, 10e3));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestUShort,
+                         testing::Range<unsigned short>(1, 255, 100));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestShort,
+                         testing::Range<short>(1, 180, 50));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestUChar,
+                         testing::Range<unsigned char>(1, 12, 5));
+INSTANTIATE_TEST_SUITE_P(PositiveValues, PowPrecisionTestSChar,
+                         testing::Range<signed char>(1, 9, 3));
+
+INSTANTIATE_TEST_SUITE_P(NegativeValues, PowPrecisionTestLong,
+                         testing::Range<long long>(-1e7, 0, 1e6));
+INSTANTIATE_TEST_SUITE_P(NegativeValues, PowPrecisionTestInt,
+                         testing::Range<int>(-46340, 0, 10e3));
+INSTANTIATE_TEST_SUITE_P(NegativeValues, PowPrecisionTestShort,
+                         testing::Range<short>(-180, 0, 50));
+INSTANTIATE_TEST_SUITE_P(NegativeValues, PowPrecisionTestSChar,
+                         testing::Range<signed char>(-9, 0, 3));
+
+struct result_type_param {
+    af_dtype result_;
+    af_dtype lhs_;
+    af_dtype rhs_;
+
+    result_type_param(af_dtype type) : result_(type), lhs_(type), rhs_(type) {}
+    result_type_param(af_dtype result, af_dtype lhs, af_dtype rhs)
+        : result_(result), lhs_(lhs), rhs_(rhs) {}
+};
+
+ostream &operator<<(ostream &os, const result_type_param &p) {
+    os << "{lhs_ = " << p.lhs_ << " rhs_ = " << p.rhs_
+       << " result_ = " << p.result_ << "}";
+    return os;
+}
+
+class ResultType : public testing::TestWithParam<result_type_param> {
+   protected:
+    af::array lhs;
+    af::array rhs;
+    af_dtype gold;
+
+    void SetUp() {
+        result_type_param params = GetParam();
+        gold                     = params.result_;
+        if (noHalfTests(params.result_) || noHalfTests(params.lhs_) ||
+            noHalfTests(params.rhs_)) {
+            GTEST_SKIP() << "Half not supported on this device";
+            return;
+        } else if (noDoubleTests(params.result_) ||
+                   noDoubleTests(params.lhs_) || noDoubleTests(params.rhs_)) {
+            GTEST_SKIP() << "Double not supported on this device";
+            return;
+        }
+        lhs = af::array(10, params.lhs_);
+        rhs = af::array(10, params.rhs_);
+    }
+};
+
+std::string print_types(
+    const ::testing::TestParamInfo<ResultType::ParamType> info) {
+    stringstream ss;
+    ss << "lhs_" << info.param.lhs_ << "_rhs_" << info.param.rhs_ << "_result_"
+       << info.param.result_;
+    return ss.str();
+}
+
+INSTANTIATE_TEST_SUITE_P(
+    SameTypes, ResultType,
+    // clang-format off
+    ::testing::Values(result_type_param(f32),
+                      result_type_param(f64),
+                      result_type_param(c32),
+                      result_type_param(c64),
+                      result_type_param(b8),
+                      result_type_param(s32),
+                      result_type_param(u32),
+                      result_type_param(s8),
+                      result_type_param(u8),
+                      result_type_param(s64),
+                      result_type_param(u64),
+                      result_type_param(s16),
+                      result_type_param(u16),
+                      result_type_param(f16)),
+    // clang-format on
+    print_types);
+
+INSTANTIATE_TEST_SUITE_P(
+    Float, ResultType,
+    // clang-format off
+    ::testing::Values(result_type_param(f32),
+                      result_type_param(f64, f64, f32),
+                      result_type_param(c32, c32, f32),
+                      result_type_param(c64, c64, f32),
+                      result_type_param(f32, b8, f32),
+                      result_type_param(f32, s32, f32),
+                      result_type_param(f32, u32, f32),
+                      result_type_param(f32, s8, f32),
+                      result_type_param(f32, u8, f32),
+                      result_type_param(f32, s64, f32),
+                      result_type_param(f32, u64, f32),
+                      result_type_param(f32, s16, f32),
+                      result_type_param(f32, u16, f32),
+                      result_type_param(f32, f16, f32)),
+    // clang-format on
+    print_types);
+
+INSTANTIATE_TEST_SUITE_P(
+    Double, ResultType,
+    ::testing::Values(
+        // clang-format off
+                      result_type_param(f64, f32, f64),
+                      result_type_param(f64, f64, f64),
+                      result_type_param(c64, c32, f64),
+                      result_type_param(c64, c64, f64),
+                      result_type_param(f64, b8,  f64),
+                      result_type_param(f64, s32, f64),
+                      result_type_param(f64, u32, f64),
+                      result_type_param(f64, s8,  f64),
+                      result_type_param(f64, u8,  f64),
+                      result_type_param(f64, s64, f64),
+                      result_type_param(f64, u64, f64),
+                      result_type_param(f64, s16, f64),
+                      result_type_param(f64, u16, f64),
+                      result_type_param(f64, f16, f64)),
+    // clang-format on
+    print_types);
+
+// clang-format off
+TEST_P(ResultType, Addition)       {
+    ASSERT_EQ(gold, (lhs + rhs).type());
+}
+TEST_P(ResultType, Subtraction)    {
+    ASSERT_EQ(gold, (lhs - rhs).type());
+}
+TEST_P(ResultType, Multiplication) {
+    ASSERT_EQ(gold, (lhs * rhs).type());
+}
+TEST_P(ResultType, Division)       {
+    ASSERT_EQ(gold, (lhs / rhs).type());
+}
+// clang-format on
+
+template<typename T>
+class ResultTypeScalar : public ::testing::Test {
+   protected:
+    T scalar;
+    void SetUp() { scalar = T(1); }
+};
+
+typedef ::testing::Types<float, double, unsigned int, int, short,
+                         unsigned short, char, signed char, unsigned char,
+                         half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(ResultTypeScalar, TestTypes);
+
+TYPED_TEST(ResultTypeScalar, HalfAddition) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    ASSERT_EQ(f16, (af::array(10, f16) + this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, HalfSubtraction) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    ASSERT_EQ(f16, (af::array(10, f16) - this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, HalfMultiplication) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    ASSERT_EQ(f16, (af::array(10, f16) * this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, HalfDivision) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    ASSERT_EQ(f16, (af::array(10, f16) / this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, FloatAddition) {
+    ASSERT_EQ(f32, (af::array(10, f32) + this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, FloatSubtraction) {
+    ASSERT_EQ(f32, (af::array(10, f32) - this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, FloatMultiplication) {
+    ASSERT_EQ(f32, (af::array(10, f32) * this->scalar).type());
+}
+
+TYPED_TEST(ResultTypeScalar, FloatDivision) {
+    ASSERT_EQ(f32, (af::array(10, f32) / this->scalar).type());
+}
+
+class Broadcast : public ::testing::TestWithParam<std::tuple<dim4, dim4>> {
+    void SetUp() override {}
+};
+/// clang-format off
+
+INSTANTIATE_TEST_SUITE_P(
+    CorrectCases, Broadcast,
+    ::testing::Combine(
+        ::testing::Values(dim4(1), dim4(10), dim4(1, 10), dim4(1, 1, 10),
+                          dim4(1, 1, 1, 10), dim4(10, 10), dim4(1, 10, 10),
+                          dim4(1, 1, 10, 10), dim4(10, 1, 10),
+                          dim4(1, 10, 1, 10), dim4(10, 1, 1, 10),
+                          dim4(10, 10, 10), dim4(1, 10, 10, 10),
+                          dim4(10, 1, 10, 10), dim4(10, 10, 1, 10),
+                          dim4(10, 10, 10, 10)),
+        ::testing::Values(dim4(1), dim4(10), dim4(1, 10), dim4(1, 1, 10),
+                          dim4(1, 1, 1, 10), dim4(10, 10), dim4(1, 10, 10),
+                          dim4(1, 1, 10, 10), dim4(10, 1, 10),
+                          dim4(1, 10, 1, 10), dim4(10, 1, 1, 10),
+                          dim4(10, 10, 10), dim4(1, 10, 10, 10),
+                          dim4(10, 1, 10, 10), dim4(10, 10, 1, 10),
+                          dim4(10, 10, 10, 10))),
+    [](const ::testing::TestParamInfo<Broadcast::ParamType> info) {
+        stringstream ss;
+        ss << "lhs_" << get<0>(info.param) << "_rhs_" << get<1>(info.param);
+        string s = ss.str();
+        std::replace(begin(s), std::end(s), ' ', '_');
+        return s;
+    });
+/// clang-format on
+
+af::dim4 broadcastOut(dim4 lhs, dim4 rhs) {
+    dim4 out(1);
+    for (int i = 0; i < AF_MAX_DIMS; i++) {
+        if (lhs[i] == rhs[i])
+            out[i] = lhs[i];
+        else if (lhs[i] == 1 && rhs[i] > 1)
+            out[i] = rhs[i];
+        else if (lhs[i] > 1 && rhs[i] == 1)
+            out[i] = lhs[i];
+        else {
+            std::cout << "incorrect dimension" << lhs << " op " << rhs;
+            return dim4(0);
+        }
+    }
+    return out;
+}
+
+af::dim4 tileRepeations(dim4 in, dim4 other) {
+    af::dim4 out;
+    for (int i = 0; i < AF_MAX_DIMS; i++) {
+        out[i] = std::max(dim_t(1), other[i] / in[i]);
+    }
+    return out;
+}
+
+TEST_P(Broadcast, Addition) {
+    auto params   = GetParam();
+    af::array lhs = iota(get<0>(params));
+    af::array rhs = constant(1, get<1>(params));
+
+    af::array out = lhs + rhs;
+
+    af::dim4 outdims       = broadcastOut(lhs.dims(), rhs.dims());
+    af::dim4 tilerepetions = tileRepeations(lhs.dims(), rhs.dims());
+    af::array tiledlhs     = tile(lhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out += 1; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST_P(Broadcast, Subtraction) {
+    auto params   = GetParam();
+    af::array lhs = range(get<0>(params));
+    af::array rhs = constant(1, get<1>(params));
+
+    af::array out          = lhs - rhs;
+    af::dim4 outdims       = broadcastOut(lhs.dims(), rhs.dims());
+    af::dim4 tilerepetions = tileRepeations(lhs.dims(), rhs.dims());
+    af::array tiledlhs     = tile(lhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out -= 1; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST_P(Broadcast, Multiplication) {
+    auto params   = GetParam();
+    af::array lhs = range(get<0>(params));
+    af::array rhs = constant(2, get<1>(params));
+
+    af::array out          = lhs * rhs;
+    af::dim4 outdims       = broadcastOut(lhs.dims(), rhs.dims());
+    af::dim4 tilerepetions = tileRepeations(lhs.dims(), rhs.dims());
+    af::array tiledlhs     = tile(lhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out *= 2; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST_P(Broadcast, Division) {
+    auto params   = GetParam();
+    af::array lhs = range(get<0>(params));
+    af::array rhs = constant(2, get<1>(params));
+
+    af::array out          = lhs / rhs;
+    af::dim4 outdims       = broadcastOut(lhs.dims(), rhs.dims());
+    af::dim4 tilerepetions = tileRepeations(lhs.dims(), rhs.dims());
+    af::array tiledlhs     = tile(lhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out /= 2; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST_P(Broadcast, AdditionLHSIndexed) {
+    auto params   = GetParam();
+    af::array lhs = iota(get<0>(params) * 2);
+    af::array rhs = constant(1, get<1>(params));
+
+    dim4 lhs_dims = get<0>(params);
+    af::array out = lhs(seq(lhs_dims[0]), seq(lhs_dims[1]), seq(lhs_dims[2]),
+                        seq(lhs_dims[3])) +
+                    rhs;
+
+    af::dim4 outdims       = broadcastOut(get<0>(params), rhs.dims());
+    af::array indexedlhs   = lhs(seq(lhs_dims[0]), seq(lhs_dims[1]),
+                                 seq(lhs_dims[2]), seq(lhs_dims[3]));
+    af::dim4 tilerepetions = tileRepeations(get<0>(params), rhs.dims());
+    af::array tiledlhs     = tile(indexedlhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out += 1; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST_P(Broadcast, AdditionRHSIndexed) {
+    auto params   = GetParam();
+    af::array lhs = iota(get<0>(params));
+    af::array rhs = constant(1, get<1>(params) * 2);
+
+    dim4 rhs_dims = get<1>(params);
+    af::array out = lhs + rhs(seq(rhs_dims[0]), seq(rhs_dims[1]),
+                              seq(rhs_dims[2]), seq(rhs_dims[3]));
+
+    af::dim4 outdims       = broadcastOut(get<0>(params), get<1>(params));
+    af::dim4 tilerepetions = tileRepeations(get<0>(params), get<1>(params));
+    af::array tiledlhs     = tile(lhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out += 1; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST_P(Broadcast, AdditionBothIndexed) {
+    auto params   = GetParam();
+    af::array lhs = iota(get<0>(params) * 2);
+    af::array rhs = constant(1, get<1>(params) * 2);
+
+    dim4 lhs_dims = get<0>(params);
+    dim4 rhs_dims = get<1>(params);
+    af::array out = lhs(seq(lhs_dims[0]), seq(lhs_dims[1]), seq(lhs_dims[2]),
+                        seq(lhs_dims[3])) +
+                    rhs(seq(rhs_dims[0]), seq(rhs_dims[1]), seq(rhs_dims[2]),
+                        seq(rhs_dims[3]));
+
+    af::dim4 outdims       = broadcastOut(lhs_dims, rhs_dims);
+    af::array indexedlhs   = lhs(seq(lhs_dims[0]), seq(lhs_dims[1]),
+                                 seq(lhs_dims[2]), seq(lhs_dims[3]));
+    af::dim4 tilerepetions = tileRepeations(get<0>(params), get<1>(params));
+    af::array tiledlhs     = tile(indexedlhs, tilerepetions);
+
+    vector<float> outvec(outdims.elements());
+    tiledlhs.host(outvec.data());
+    for (auto &out : outvec) { out += 1; }
+
+    ASSERT_VEC_ARRAY_EQ(outvec, outdims, out);
+}
+
+TEST(Broadcast, VectorMatrix2d) {
+    dim_t s     = 10;
+    af::array A = range(dim4(s, 3), 1);
+    af::array B = -range(dim4(3));
+
+    try {
+        A + B;
+        FAIL();
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+    try {
+        B + A;
+        FAIL();
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+}
+
+TEST(Broadcast, VectorMatrix3d) {
+    dim_t s     = 10;
+    af::array A = range(dim4(s, s, 3), 2);
+    af::array B = -range(dim4(3));
+
+    try {
+        A + B;
+        FAIL();
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+    try {
+        B + A;
+        FAIL();
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+}
+
+TEST(Broadcast, VectorMatrix4d) {
+    dim_t s     = 10;
+    af::array A = range(dim4(s, s, s, 3), 3);
+    af::array B = -range(dim4(3));
+
+    try {
+        A + B;
+        FAIL();
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+    try {
+        B + A;
+        FAIL();
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+}
+
+void testAllBroadcast(dim4 dims) {
+    af::array A = constant(1, dims);
+    for (int k = 0; k < dims.ndims(); ++k) {
+        dim4 rdims  = dims;
+        rdims[k]    = 1;
+        af::array B = constant(-1, rdims);
+        af::array C = A + B;
+        ASSERT_ARRAYS_EQ(C, constant(0, dims));
+
+        C = B + A;
+        ASSERT_ARRAYS_EQ(C, constant(0, dims));
+    }
+}
+
+TEST(Broadcast, MatrixMatrix2d) { testAllBroadcast(dim4(10, 15)); }
+
+TEST(Broadcast, MatrixMatrix3d) { testAllBroadcast(dim4(10, 15, 20)); }
+
+TEST(Broadcast, MatrixMatrix4d) { testAllBroadcast(dim4(10, 15, 20, 25)); }
+
+TEST(Broadcast, MismatchingDim0) {
+    af::array A = range(dim4(2, 3, 5), 1);
+    af::array B = -range(dim4(3, 5), 0);
+
+    try {
+        A + B;
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+}
+
+TEST(Broadcast, TestFirstMatchingDim) {
+    af::array A = range(dim4(3, 2, 2, 4), 1);
+    af::array B = -range(dim4(2));
+
+    try {
+        A + B;
+    } catch (af::exception &e) { ASSERT_EQ(e.err(), AF_ERR_SIZE); }
+}
+
+TEST(Broadcast, ManySlicesVsOneSlice) {
+    af::array A = constant(1, dim4(3, 3, 2));
+    af::array B = constant(2, dim4(3, 3));
+    af::array C = A + B;
+
+    ASSERT_ARRAYS_EQ(C, constant(3, dim4(3, 3, 2)));
+
+    C = B + A;
+    ASSERT_ARRAYS_EQ(C, constant(3, dim4(3, 3, 2)));
+}
+
+TEST(Broadcast, SubArray) {
+    dim_t subdim = 5;
+    af::array A  = constant(1, dim4(10, 10, 2));
+    af::array B  = constant(2, dim4(5, 5));
+    af::array C  = A(seq(subdim), seq(subdim), span) + B;
+
+    ASSERT_ARRAYS_EQ(C, constant(3, dim4(subdim, subdim, 2)));
+
+    C = B + A(seq(subdim), seq(subdim), span);
+    ASSERT_ARRAYS_EQ(C, constant(3, dim4(subdim, subdim, 2)));
+}
+
+TEST(Broadcast, SubArrays) {
+    dim_t subdim = 5;
+    af::array A  = constant(1, dim4(10, 10, 2));
+    af::array B  = constant(2, dim4(15, 15));
+
+    af::array C =
+        A(seq(subdim), seq(subdim), span) + B(seq(subdim), seq(subdim));
+    ASSERT_ARRAYS_EQ(C, constant(3, dim4(subdim, subdim, 2)));
+
+    C = B(seq(subdim), seq(subdim)) + A(seq(subdim), seq(subdim), span);
+    ASSERT_ARRAYS_EQ(C, constant(3, dim4(subdim, subdim, 2)));
+}
+
+TEST(Broadcast, IndexedArray) {
+    af::array A = constant(1, dim4(2, 2, 2, 2));
+    af::array B = constant(-1, dim4(1, 5));
+
+    af::array idx = range(dim4(2, 2, 2, 2), 0);
+
+    af::array C = A(idx % 2 == 0) + B;
+    ASSERT_ARRAYS_EQ(C, constant(0, dim4(8, 5)));
+
+    C = B + A(idx % 2 == 0);
+    ASSERT_ARRAYS_EQ(C, constant(0, dim4(8, 5)));
+}
diff --git a/test/binary_ops.hpp b/test/binary_ops.hpp
new file mode 100644
index 0000000000..b498f02094
--- /dev/null
+++ b/test/binary_ops.hpp
@@ -0,0 +1,89 @@
+#include <af/defines.h>
+#include <algorithm>
+#include <complex>
+#include <numeric>
+
+template<typename T>
+static inline T min(T lhs, T rhs) {
+    return std::min(lhs, rhs);
+}
+std::complex<float> min(std::complex<float> lhs, std::complex<float> rhs);
+std::complex<double> min(std::complex<double> lhs, std::complex<double> rhs);
+
+template<typename T>
+static inline T max(T lhs, T rhs) {
+    return std::max(lhs, rhs);
+}
+std::complex<float> max(std::complex<float> lhs, std::complex<float> rhs);
+std::complex<double> max(std::complex<double> lhs, std::complex<double> rhs);
+
+template<typename T, af_binary_op op>
+struct Binary {
+    T init() { return (T)(0); }
+
+    T operator()(T lhs, T rhs) { return lhs + rhs; }
+};
+
+template<typename T>
+struct Binary<T, AF_BINARY_ADD> {
+    T init() { return (T)(0); }
+
+    T operator()(T lhs, T rhs) { return lhs + rhs; }
+};
+
+template<typename T>
+struct Binary<T, AF_BINARY_MUL> {
+    T init() { return (T)(1); }
+
+    T operator()(T lhs, T rhs) { return lhs * rhs; }
+};
+
+template<typename T>
+struct Binary<T, AF_BINARY_MIN> {
+    T init() { return std::numeric_limits<T>::max(); }
+
+    T operator()(T lhs, T rhs) { return min(lhs, rhs); }
+};
+
+template<typename T>
+struct Binary<T, AF_BINARY_MAX> {
+    T init() { return std::numeric_limits<T>::min(); }
+
+    T operator()(T lhs, T rhs) { return max(lhs, rhs); }
+};
+
+#define SPECIALIZE_COMPLEX_MIN(T, Tr)                            \
+    template<>                                                   \
+    struct Binary<T, AF_BINARY_MIN> {                            \
+        T init() { return (T)(std::numeric_limits<Tr>::max()); } \
+                                                                 \
+        T operator()(T lhs, T rhs) { return min(lhs, rhs); }     \
+    };
+
+SPECIALIZE_COMPLEX_MIN(std::complex<float>, float)
+SPECIALIZE_COMPLEX_MIN(std::complex<double>, double)
+#undef SPECIALIZE_COMPLEX_MIN
+
+#define SPECIALIZE_COMPLEX_MAX(T, Tr)                        \
+    template<>                                               \
+    struct Binary<T, AF_BINARY_MAX> {                        \
+        T init() { return (T)((Tr)(0)); }                    \
+                                                             \
+        T operator()(T lhs, T rhs) { return max(lhs, rhs); } \
+    };
+
+SPECIALIZE_COMPLEX_MAX(std::complex<float>, float)
+SPECIALIZE_COMPLEX_MAX(std::complex<double>, double)
+#undef SPECIALIZE_COMPLEX_MAX
+
+#define SPECIALIZE_FLOATING_MAX(T, Tr)                            \
+    template<>                                                    \
+    struct Binary<T, AF_BINARY_MAX> {                             \
+        T init() { return (T)(-std::numeric_limits<Tr>::max()); } \
+                                                                  \
+        T operator()(T lhs, T rhs) { return max(lhs, rhs); }      \
+    };
+
+SPECIALIZE_FLOATING_MAX(float, float)
+SPECIALIZE_FLOATING_MAX(double, double)
+#undef SPECIALIZE_FLOATING_MAX
diff --git a/test/blas.cpp b/test/blas.cpp
index af73359318..6f77c10160 100644
--- a/test/blas.cpp
+++ b/test/blas.cpp
@@ -7,236 +7,825 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/blas.h>
-#include <af/traits.hpp>
 #include <af/defines.h>
-#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/half.h>
+#include <af/traits.hpp>
+#include <algorithm>
 #include <string>
 
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dot;
+using af::dtype_traits;
+using af::getDevice;
+using af::getDeviceCount;
+using af::matmul;
+using af::max;
+using af::randu;
+using af::setDevice;
+using af::span;
+using af::transpose;
+using std::copy;
 using std::cout;
 using std::endl;
 using std::ostream_iterator;
-using std::copy;
+using std::string;
+using std::stringstream;
 using std::vector;
 
 template<typename T>
-class MatrixMultiply : public ::testing::Test
-{
+class MatrixMultiply : public ::testing::Test {};
 
-};
-
-typedef ::testing::Types<float, af::cfloat, double, af::cdouble> TestTypes;
-TYPED_TEST_CASE(MatrixMultiply, TestTypes);
+typedef ::testing::Types<float, double, cdouble, cfloat> TestTypes;
+TYPED_TEST_SUITE(MatrixMultiply, TestTypes);
 
 template<typename T, bool isBVector>
-void MatMulCheck(string TestFile)
-{
-    if (noDoubleTests<T>()) return;
+void MatMulCheck(string TestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    using std::vector;
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> > hData;
-    vector<vector<T> > tests;
-    readTests<T,T,int>(TestFile, numDims, hData, tests);
+    vector<vector<T>> hData;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(TestFile, numDims, hData, tests);
 
     af_array a, aT, b, bT;
-    ASSERT_EQ(AF_SUCCESS,
-            af_create_array(&a, &hData[0].front(), numDims[0].ndims(), numDims[0].get(), (af_dtype) af::dtype_traits<T>::af_type));
-    af::dim4 atdims = numDims[0];
+    ASSERT_SUCCESS(af_create_array(&a, &hData[0].front(), numDims[0].ndims(),
+                                   numDims[0].get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    dim4 atdims = numDims[0];
     {
-        dim_t f  =    atdims[0];
-        atdims[0]   =    atdims[1];
-        atdims[1]   =    f;
-    }
-    ASSERT_EQ(AF_SUCCESS,
-            af_moddims(&aT, a, atdims.ndims(), atdims.get()));
-    ASSERT_EQ(AF_SUCCESS,
-            af_create_array(&b, &hData[1].front(), numDims[1].ndims(), numDims[1].get(), (af_dtype) af::dtype_traits<T>::af_type));
-    af::dim4 btdims = numDims[1];
+        dim_t f   = atdims[0];
+        atdims[0] = atdims[1];
+        atdims[1] = f;
+    }
+    ASSERT_SUCCESS(af_moddims(&aT, a, atdims.ndims(), atdims.get()));
+    ASSERT_SUCCESS(af_create_array(&b, &hData[1].front(), numDims[1].ndims(),
+                                   numDims[1].get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    dim4 btdims = numDims[1];
     {
-        dim_t f = btdims[0];
+        dim_t f   = btdims[0];
         btdims[0] = btdims[1];
         btdims[1] = f;
     }
-    ASSERT_EQ(AF_SUCCESS,
-            af_moddims(&bT, b, btdims.ndims(), btdims.get()));
+    ASSERT_SUCCESS(af_moddims(&bT, b, btdims.ndims(), btdims.get()));
 
     vector<af_array> out(tests.size(), 0);
-    if(isBVector) {
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[0] , aT, b,    AF_MAT_NONE,    AF_MAT_NONE));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[1] , bT, a,   AF_MAT_NONE,    AF_MAT_NONE));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[2] , b, a,    AF_MAT_TRANS,       AF_MAT_NONE));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[3] , bT, aT,   AF_MAT_NONE,    AF_MAT_TRANS));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[4] , b, aT,    AF_MAT_TRANS,       AF_MAT_TRANS));
-    }
-    else {
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[0] , a, b, AF_MAT_NONE,   AF_MAT_NONE));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[1] , a, bT, AF_MAT_NONE,   AF_MAT_TRANS));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[2] , a, bT, AF_MAT_TRANS,      AF_MAT_NONE));
-        ASSERT_EQ(AF_SUCCESS, af_matmul( &out[3] , aT, bT, AF_MAT_TRANS,      AF_MAT_TRANS));
-    }
-
-    for(size_t i = 0; i < tests.size(); i++) {
-        dim_t elems;
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&elems, out[i]));
-        vector<T> h_out(elems);
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void *)&h_out.front(), out[i]));
-
-        if( false == equal(h_out.begin(), h_out.end(), tests[i].begin()) ) {
+    if (isBVector) {
+        ASSERT_SUCCESS(af_matmul(&out[0], aT, b, AF_MAT_NONE, AF_MAT_NONE));
+        ASSERT_SUCCESS(af_matmul(&out[1], bT, a, AF_MAT_NONE, AF_MAT_NONE));
+        ASSERT_SUCCESS(af_matmul(&out[2], b, a, AF_MAT_TRANS, AF_MAT_NONE));
+        ASSERT_SUCCESS(af_matmul(&out[3], bT, aT, AF_MAT_NONE, AF_MAT_TRANS));
+        ASSERT_SUCCESS(af_matmul(&out[4], b, aT, AF_MAT_TRANS, AF_MAT_TRANS));
+    } else {
+        ASSERT_SUCCESS(af_matmul(&out[0], a, b, AF_MAT_NONE, AF_MAT_NONE));
+        ASSERT_SUCCESS(af_matmul(&out[1], a, bT, AF_MAT_NONE, AF_MAT_TRANS));
+        ASSERT_SUCCESS(af_matmul(&out[2], a, bT, AF_MAT_TRANS, AF_MAT_NONE));
+        ASSERT_SUCCESS(af_matmul(&out[3], aT, bT, AF_MAT_TRANS, AF_MAT_TRANS));
+    }
 
-            cout << "Failed test " << i << "\nCalculated: " << endl;
-            copy(h_out.begin(), h_out.end(), ostream_iterator<T>(cout, ", "));
-            cout << "Expected: " << endl;
-            copy(tests[i].begin(), tests[i].end(), ostream_iterator<T>(cout, ", "));
-            FAIL();
-        }
+    for (size_t i = 0; i < tests.size(); i++) {
+        dim4 dd;
+        dim_t *d = dd.get();
+        af_get_dims(&d[0], &d[1], &d[2], &d[3], out[i]);
+        ASSERT_VEC_ARRAY_NEAR(tests[i], dd, out[i], 1e-3);
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(aT));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(b));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(bT));
+    ASSERT_SUCCESS(af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(aT));
+    ASSERT_SUCCESS(af_release_array(b));
+    ASSERT_SUCCESS(af_release_array(bT));
 
-    for (size_t i = 0; i <  out.size(); i++) {
-        ASSERT_EQ(AF_SUCCESS, af_release_array(out[i]));
+    for (size_t i = 0; i < out.size(); i++) {
+        ASSERT_SUCCESS(af_release_array(out[i]));
     }
 }
 
-TYPED_TEST(MatrixMultiply, Square)
-{
-    MatMulCheck<TypeParam, false>(TEST_DIR"/blas/Basic.test");
+TYPED_TEST(MatrixMultiply, Square) {
+    MatMulCheck<TypeParam, false>(TEST_DIR "/blas/Basic.test");
 }
 
-TYPED_TEST(MatrixMultiply, NonSquare)
-{
-    MatMulCheck<TypeParam, false>(TEST_DIR"/blas/NonSquare.test");
+TYPED_TEST(MatrixMultiply, NonSquare) {
+    MatMulCheck<TypeParam, false>(TEST_DIR "/blas/NonSquare.test");
 }
 
-TYPED_TEST(MatrixMultiply, SquareVector)
-{
-    MatMulCheck<TypeParam, true>(TEST_DIR"/blas/SquareVector.test");
+TYPED_TEST(MatrixMultiply, SquareVector) {
+    MatMulCheck<TypeParam, true>(TEST_DIR "/blas/SquareVector.test");
 }
 
-TYPED_TEST(MatrixMultiply, RectangleVector)
-{
-    MatMulCheck<TypeParam, true>(TEST_DIR"/blas/RectangleVector.test");
+TYPED_TEST(MatrixMultiply, RectangleVector) {
+    MatMulCheck<TypeParam, true>(TEST_DIR "/blas/RectangleVector.test");
 }
 
 template<typename T, bool isBVector>
-void cppMatMulCheck(string TestFile)
-{
-    if (noDoubleTests<T>()) return;
+void cppMatMulCheck(string TestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    using std::vector;
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> > hData;
-    vector<vector<T> > tests;
-    readTests<T,T,int>(TestFile, numDims, hData, tests);
+    vector<vector<T>> hData;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(TestFile, numDims, hData, tests);
 
-    af::array a(numDims[0], &hData[0].front());
-    af::array b(numDims[1], &hData[1].front());
+    array a(numDims[0], &hData[0].front());
+    array b(numDims[1], &hData[1].front());
 
-    af::dim4 atdims = numDims[0];
+    dim4 atdims = numDims[0];
     {
-        dim_t f  =    atdims[0];
-        atdims[0]   =    atdims[1];
-        atdims[1]   =    f;
+        dim_t f   = atdims[0];
+        atdims[0] = atdims[1];
+        atdims[1] = f;
     }
-    af::dim4 btdims = numDims[1];
+    dim4 btdims = numDims[1];
     {
-        dim_t f = btdims[0];
+        dim_t f   = btdims[0];
         btdims[0] = btdims[1];
         btdims[1] = f;
     }
 
-    af::array aT = moddims(a, atdims.ndims(), atdims.get());
-    af::array bT = moddims(b, btdims.ndims(), btdims.get());
-
-    vector<af::array> out(tests.size());
-    if(isBVector) {
-        out[0] = af::matmul(aT, b,    AF_MAT_NONE,    AF_MAT_NONE);
-        out[1] = af::matmul(bT, a,   AF_MAT_NONE,    AF_MAT_NONE);
-        out[2] = af::matmul(b, a,    AF_MAT_TRANS,       AF_MAT_NONE);
-        out[3] = af::matmul(bT, aT,   AF_MAT_NONE,    AF_MAT_TRANS);
-        out[4] = af::matmul(b, aT,    AF_MAT_TRANS,       AF_MAT_TRANS);
-    }
-    else {
-        out[0] = af::matmul(a, b, AF_MAT_NONE,   AF_MAT_NONE);
-        out[1] = af::matmul(a, bT, AF_MAT_NONE,   AF_MAT_TRANS);
-        out[2] = af::matmul(a, bT, AF_MAT_TRANS,      AF_MAT_NONE);
-        out[3] = af::matmul(aT, bT, AF_MAT_TRANS,      AF_MAT_TRANS);
+    array aT = moddims(a, atdims.ndims(), atdims.get());
+    array bT = moddims(b, btdims.ndims(), btdims.get());
+
+    vector<array> out(tests.size());
+    if (isBVector) {
+        out[0] = matmul(aT, b, AF_MAT_NONE, AF_MAT_NONE);
+        out[1] = matmul(bT, a, AF_MAT_NONE, AF_MAT_NONE);
+        out[2] = matmul(b, a, AF_MAT_TRANS, AF_MAT_NONE);
+        out[3] = matmul(bT, aT, AF_MAT_NONE, AF_MAT_TRANS);
+        out[4] = matmul(b, aT, AF_MAT_TRANS, AF_MAT_TRANS);
+    } else {
+        out[0] = matmul(a, b, AF_MAT_NONE, AF_MAT_NONE);
+        out[1] = matmul(a, bT, AF_MAT_NONE, AF_MAT_TRANS);
+        out[2] = matmul(a, bT, AF_MAT_TRANS, AF_MAT_NONE);
+        out[3] = matmul(aT, bT, AF_MAT_TRANS, AF_MAT_TRANS);
     }
 
-    for(size_t i = 0; i < tests.size(); i++) {
+    for (size_t i = 0; i < tests.size(); i++) {
         dim_t elems = out[i].elements();
         vector<T> h_out(elems);
-        out[i].host((void*)&h_out.front());
+        out[i].host((void *)&h_out.front());
 
         if (false == equal(h_out.begin(), h_out.end(), tests[i].begin())) {
-
             cout << "Failed test " << i << "\nCalculated: " << endl;
             copy(h_out.begin(), h_out.end(), ostream_iterator<T>(cout, ", "));
             cout << "Expected: " << endl;
-            copy(tests[i].begin(), tests[i].end(), ostream_iterator<T>(cout, ", "));
+            copy(tests[i].begin(), tests[i].end(),
+                 ostream_iterator<T>(cout, ", "));
             FAIL();
         }
     }
 }
 
-TYPED_TEST(MatrixMultiply, Square_CPP)
-{
-    cppMatMulCheck<TypeParam, false>(TEST_DIR"/blas/Basic.test");
+TYPED_TEST(MatrixMultiply, Square_CPP) {
+    cppMatMulCheck<TypeParam, false>(TEST_DIR "/blas/Basic.test");
 }
 
-TYPED_TEST(MatrixMultiply, NonSquare_CPP)
-{
-    cppMatMulCheck<TypeParam, false>(TEST_DIR"/blas/NonSquare.test");
+TYPED_TEST(MatrixMultiply, NonSquare_CPP) {
+    cppMatMulCheck<TypeParam, false>(TEST_DIR "/blas/NonSquare.test");
 }
 
-TYPED_TEST(MatrixMultiply, SquareVector_CPP)
-{
-    cppMatMulCheck<TypeParam, true>(TEST_DIR"/blas/SquareVector.test");
+TYPED_TEST(MatrixMultiply, SquareVector_CPP) {
+    cppMatMulCheck<TypeParam, true>(TEST_DIR "/blas/SquareVector.test");
 }
 
-TYPED_TEST(MatrixMultiply, RectangleVector_CPP)
-{
-    cppMatMulCheck<TypeParam, true>(TEST_DIR"/blas/RectangleVector.test");
+TYPED_TEST(MatrixMultiply, RectangleVector_CPP) {
+    cppMatMulCheck<TypeParam, true>(TEST_DIR "/blas/RectangleVector.test");
 }
 
-TYPED_TEST(MatrixMultiply, MultiGPUSquare_CPP)
-{
-    for(int i = 0; i < af::getDeviceCount(); i++) {
-        af::setDevice(i);
-        cppMatMulCheck<TypeParam, false>(TEST_DIR"/blas/Basic.test");
+#define DEVICE_ITERATE(func)                             \
+    do {                                                 \
+        const char *ENV = getenv("AF_MULTI_GPU_TESTS");  \
+        if (ENV && ENV[0] == '0') {                      \
+            func;                                        \
+        } else {                                         \
+            int oldDevice = getDevice();                 \
+            for (int i = 0; i < getDeviceCount(); i++) { \
+                setDevice(i);                            \
+                func;                                    \
+            }                                            \
+            setDevice(oldDevice);                        \
+        }                                                \
+    } while (0);
+
+TYPED_TEST(MatrixMultiply, MultiGPUSquare_CPP) {
+    DEVICE_ITERATE(
+        (cppMatMulCheck<TypeParam, false>(TEST_DIR "/blas/Basic.test")));
+}
+
+TYPED_TEST(MatrixMultiply, MultiGPUNonSquare_CPP) {
+    DEVICE_ITERATE(
+        (cppMatMulCheck<TypeParam, false>(TEST_DIR "/blas/NonSquare.test")));
+}
+
+TYPED_TEST(MatrixMultiply, MultiGPUSquareVector_CPP) {
+    DEVICE_ITERATE(
+        (cppMatMulCheck<TypeParam, true>(TEST_DIR "/blas/SquareVector.test")));
+}
+
+TYPED_TEST(MatrixMultiply, MultiGPURectangleVector_CPP) {
+    DEVICE_ITERATE((cppMatMulCheck<TypeParam, true>(
+        TEST_DIR "/blas/RectangleVector.test")));
+}
+
+float batch_tol = 1E-2;
+TEST(MatrixMultiply, Batched) {
+    const int M  = 512;
+    const int K  = 512;
+    const int N  = 10;
+    const int D2 = 2;
+    const int D3 = 3;
+    for (int d3 = 1; d3 <= D3; d3 *= D3) {
+        for (int d2 = 1; d2 <= D2; d2 *= D2) {
+            array a = randu(M, K, d2, d3);
+            array b = randu(K, N, d2, d3);
+            array c = matmul(a, b);
+
+            for (int j = 0; j < d3; j++) {
+                for (int i = 0; i < d2; i++) {
+                    array a_ij = a(span, span, i, j);
+                    array b_ij = b(span, span, i, j);
+                    array c_ij = c(span, span, i, j);
+                    array res  = matmul(a_ij, b_ij);
+                    ASSERT_ARRAYS_NEAR(c_ij, res, batch_tol);
+                }
+            }
+        }
     }
 }
 
-TYPED_TEST(MatrixMultiply, MultiGPUNonSquare_CPP)
-{
-    for(int i = 0; i < af::getDeviceCount(); i++) {
-        af::setDevice(i);
-        cppMatMulCheck<TypeParam, false>(TEST_DIR"/blas/NonSquare.test");
+#undef DEVICE_ITERATE
+
+TEST(MatrixMultiply, ISSUE_1882) {
+    const int m = 2;
+    const int n = 3;
+    array A     = randu(m, n);
+    array BB    = randu(n, m);
+    array B     = BB(0, span);
+
+    array res1 = matmul(A.T(), B.T());
+    array res2 = matmulTT(A, B);
+
+    ASSERT_ARRAYS_NEAR(res1, res2, 1E-5);
+}
+
+struct blas_params {
+    int m, n, k, ld2, ld3, rd2, rd3;
+    af_dtype type;
+    blas_params(int m_, int n_, int k_, int ld2_, int ld3_, int rd2_, int rd3_,
+                af_dtype type_)
+        : m(m_)
+        , n(n_)
+        , k(k_)
+        , ld2(ld2_)
+        , ld3(ld3_)
+        , rd2(rd2_)
+        , rd3(rd3_)
+        , type(type_) {}
+};
+
+class MatrixMultiplyBatch : public ::testing::TestWithParam<blas_params> {
+   public:
+    array lhs, rhs, out, gold;
+    void SetUp() {
+        blas_params params = GetParam();
+        lhs = randu(params.m, params.k, params.ld2, params.ld3, params.type);
+        rhs = randu(params.k, params.n, params.rd2, params.rd3, params.type);
+
+        gold = array(params.m, params.n, std::max(params.ld2, params.rd2),
+                     std::max(params.ld3, params.rd3));
+
+        if (params.ld2 == params.rd2 && params.ld3 == params.rd3) {
+            for (int i = 0; i < params.ld2; i++) {
+                for (int j = 0; j < params.ld3; j++) {
+                    array lhs_sub          = lhs(span, span, i, j);
+                    array rhs_sub          = rhs(span, span, i, j);
+                    gold(span, span, i, j) = matmul(lhs_sub, rhs_sub);
+                }
+            }
+        } else {
+            for (int i = 0; i < params.ld2; i++) {
+                for (int j = 0; j < params.ld3; j++) {
+                    for (int k = 0; k < params.rd2; k++) {
+                        for (int l = 0; l < params.rd3; l++) {
+                            array lhs_sub = lhs(span, span, i, j);
+                            array rhs_sub = rhs(span, span, k, l);
+                            gold(span, span, std::max(i, k), std::max(j, l)) =
+                                matmul(lhs_sub, rhs_sub);
+                        }
+                    }
+                }
+            }
+        }
     }
+};
+
+std::string print_blas_params(
+    const ::testing::TestParamInfo<MatrixMultiplyBatch::ParamType> info) {
+    std::stringstream ss;
+
+    ss << "LHS_" << info.param.m << "x" << info.param.k << "x" << info.param.ld2
+       << "x" << info.param.ld3 << "__RHS" << info.param.k << "x"
+       << info.param.n << "x" << info.param.rd2 << "x" << info.param.rd3;
+
+    return ss.str();
 }
 
-TYPED_TEST(MatrixMultiply, MultiGPUSquareVector_CPP)
-{
-    for(int i = 0; i < af::getDeviceCount(); i++) {
-        af::setDevice(i);
-        cppMatMulCheck<TypeParam, true>(TEST_DIR"/blas/SquareVector.test");
+INSTANTIATE_TEST_SUITE_P(
+    LHSBroadcast, MatrixMultiplyBatch,
+    ::testing::Values(
+
+        // clang-format off
+            //             M      N     K   ld2  ld3   rd2   rd3  type
+            blas_params( 32,     32,   10,    2,   1,    1,    1,  f32),
+            blas_params( 32,     32,   10,    1,   2,    1,    1,  f32),
+            blas_params( 32,     32,   10,    2,   2,    1,    1,  f32),
+            blas_params( 32,     32,   10,    3,   2,    1,    1,  f32),
+            blas_params( 32,     32,   10,    3,   3,    1,    1,  f32),
+            blas_params( 32,     32,   10,    4,   4,    1,    1,  f32),
+
+            blas_params(512,     32,  512,    4,   4,    1,    1,  f32),
+            blas_params(512,     32,  513,    4,   4,    1,    1,  f32),
+            blas_params(513,     32,  513,    4,   4,    1,    1,  f32),
+            blas_params(513,     33,  513,    4,   4,    1,    1,  f32),
+            blas_params(513,    511,   32,    4,   4,    1,    1,  f32),
+            blas_params(513,    511,   31,    4,   4,    1,    1,  f32),
+            blas_params(513,    511,   33,    4,   4,    1,    1,  f32),
+            blas_params(511,    511,   33,    4,   4,    1,    1,  f32)
+        // clang-format on
+
+        ),
+    print_blas_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    RHSBroadcast, MatrixMultiplyBatch,
+    ::testing::Values(
+        // clang-format off
+            //            M      N     K   ld2  ld3   rd2  rd3  type
+            blas_params( 32 ,    32,  10,    1,   1,    2,   1,  f32),
+            blas_params( 32 ,    32,  10,    1,   1,    1,   2,  f32),
+            blas_params( 32 ,    32,  10,    1,   1,    2,   2,  f32),
+            blas_params( 32 ,    32,  10,    1,   1,    3,   2,  f32),
+            blas_params( 32 ,    32,  10,    1,   1,    3,   3,  f32),
+            blas_params( 32 ,    32,  10,    1,   1,    4,   4,  f32),
+
+            blas_params(512 ,    32,  512,   1,   1,    4,   4,  f32),
+            blas_params(512 ,    32,  513,   1,   1,    4,   4,  f32),
+            blas_params(513 ,    32,  513,   1,   1,    4,   4,  f32),
+            blas_params(513 ,    33,  513,   1,   1,    4,   4,  f32),
+            blas_params(513 ,   511,   32,   1,   1,    4,   4,  f32),
+            blas_params(513 ,   511,   31,   1,   1,    4,   4,  f32),
+            blas_params(513 ,   511,   33,   1,   1,    4,   4,  f32),
+            blas_params(511 ,   511,   33,   1,   1,    4,   4,  f32)
+        // clang-format on
+        ),
+    print_blas_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    SameBatch, MatrixMultiplyBatch,
+    ::testing::Values(
+        // clang-format off
+            //          M      N     K   ld2  ld3   rd2  rd3  type
+            blas_params(32,   32,  10,     2,   1,    2,   1,  f32),
+            blas_params(32,   32,  10,     1,   2,    1,   2,  f32),
+            blas_params(32,   32,  10,     2,   2,    2,   2,  f32),
+            blas_params(32,   32,  10,     3,   2,    3,   2,  f32),
+            blas_params(32,   32,  10,     3,   3,    3,   3,  f32),
+            blas_params(32,   32,  10,     4,   4,    4,   4,  f32),
+
+            blas_params(512,  32, 512,     4,   4,    4,   4,  f32),
+            blas_params(512,  32, 513,     4,   4,    4,   4,  f32),
+            blas_params(513,  32, 513,     4,   4,    4,   4,  f32),
+            blas_params(513,  33, 513,     4,   4,    4,   4,  f32),
+            blas_params(513, 511,  32,     4,   4,    4,   4,  f32),
+            blas_params(513, 511,  31,     4,   4,    4,   4,  f32),
+            blas_params(513, 511,  33,     4,   4,    4,   4,  f32),
+            blas_params(511, 511,  33,     4,   4,    4,   4,  f32),
+
+            blas_params( 32,  32,  10,     1,   1,    1,   1, f32)
+        // clang-format on
+        ),
+    print_blas_params);
+
+TEST_P(MatrixMultiplyBatch, Batched) {
+    array out = matmul(lhs, rhs);
+    ASSERT_ARRAYS_NEAR(gold, out, 1e-3);
+}
+
+float alpha = 1.f;
+float beta  = 0.f;
+
+float h_gold_gemv[4]  = {5, 5, 5, 5};
+float h_half_ones[20] = {1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f,
+                         1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f, 1.f};
+
+float h_lhs[9] = {1.f, 4.f, 7.f, 2.f, 5.f, 8.f, 3.f, 6.f, 9.f};
+
+float h_lhs_tall[6] = {1.f, 3.f, 5.f, 2.f, 4.f, 6.f};
+
+float h_lhs_wide[6] = {1.f, 4.f, 2.f, 5.f, 3.f, 6.f};
+
+float h_lhs_batch[18] = {1.f, 4.f, 7.f, 2.f, 5.f, 8.f, 3.f, 6.f, 9.f,
+
+                         8.f, 2.f, 5.f, 3.f, 4.f, 7.f, 1.f, 0.f, 6.f};
+
+float h_rhs[9] = {9.f, 6.f, 3.f, 8.f, 5.f, 2.f, 7.f, 4.f, 1.f};
+
+float h_rhs_tall[6] = {9.f, 7.f, 5.f, 8.f, 6.f, 4.f};
+
+float h_rhs_wide[6] = {9.f, 6.f, 8.f, 5.f, 7.f, 4.f};
+
+float h_gold[9] = {30.f, 84.f, 138.f, 24.f, 69.f, 114.f, 18.f, 54.f, 90.f};
+
+float h_gold_NN[9] = {21.f, 51.f, 81.f, 18.f, 44.f, 70.f, 15.f, 37.f, 59.f};
+
+float h_gold_NT[9] = {25.f, 59.f, 93.f, 19.f, 45.f, 71.f, 13.f, 31.f, 49.f};
+
+float h_gold_TN[4] = {55.f, 76.f, 46.f, 64.f};
+
+float h_gold_TT[4] = {68.f, 92.f, 41.f, 56.f};
+
+float h_gold_batch[18] = {
+    30.f, 84.f, 138.f, 24.f, 69.f, 114.f, 18.f, 54.f, 90.f,
+
+    93.f, 42.f, 105.f, 81.f, 36.f, 87.f,  69.f, 30.f, 69.f};
+
+TEST(MatrixMultiply, float) {
+    array A32           = array(3, 3, h_lhs);
+    array B32           = array(3, 3, h_rhs);
+    af_array C32        = 0;
+    const float alpha32 = 1.0f;
+    const float beta32  = 0.0f;
+    af_gemm(&C32, AF_MAT_NONE, AF_MAT_NONE, &alpha32, A32.get(), B32.get(),
+            &beta32);
+    array expected32 = array(3, 3, h_gold);
+    ASSERT_ARRAYS_NEAR(expected32, af::array(C32), 0.0001);
+}
+
+TEST(MatrixMultiply, half) {
+    SUPPORTED_TYPE_CHECK(af_half);
+
+    array A16        = array(3, 3, h_lhs).as(f16);
+    array B16        = array(3, 3, h_rhs).as(f16);
+    array expected16 = array(3, 3, h_gold).as(f16);
+
+    {
+        af_array C16 = 0;
+        const half_float::half alpha16(1.0f);
+        const half_float::half beta16(0.0f);
+        ASSERT_SUCCESS(af_gemm(&C16, AF_MAT_NONE, AF_MAT_NONE, &alpha16,
+                               A16.get(), B16.get(), &beta16));
+        af::array C(C16);
+        ASSERT_ARRAYS_NEAR(expected16, C, 0.00001);
+    }
+    {
+        array C16 = matmul(A16, B16);
+        ASSERT_ARRAYS_NEAR(expected16, C16, 0.000001);
+    }
+}
+
+TEST(MatrixMultiply, schar) {
+    array A8         = array(3, 3, h_lhs).as(s8);
+    array B8         = array(3, 3, h_rhs).as(s8);
+    array expected32 = array(3, 3, h_gold).as(f32);
+
+    {
+        af_array C32 = 0;
+        const float alpha32(1.0f);
+        const float beta32(0.0f);
+        af_backend backend;
+        af_get_active_backend(&backend);
+        if (backend == AF_BACKEND_CUDA) {
+            ASSERT_SUCCESS(af_gemm(&C32, AF_MAT_NONE, AF_MAT_NONE, &alpha32,
+                                   A8.get(), B8.get(), &beta32));
+        } else {
+            ASSERT_EQ(AF_ERR_TYPE,
+                      af_gemm(&C32, AF_MAT_NONE, AF_MAT_NONE, &alpha32,
+                              A8.get(), B8.get(), &beta32));
+            SUCCEED();
+            return;
+        }
+        af::array C(C32);
+        ASSERT_ARRAYS_NEAR(expected32, C, 0.00001);
+    }
+    {
+        array C32 = matmul(A8, B8);
+        ASSERT_ARRAYS_NEAR(expected32, C32, 0.00001);
+    }
+}
+
+struct test_params {
+    af_mat_prop opt_lhs;
+    af_mat_prop opt_rhs;
+    float *alpha;
+    float *h_lhs;
+    float *h_rhs;
+    float *h_gold;
+    dim4 lhs_dims;
+    dim4 rhs_dims;
+    dim4 out_dims;
+    float *beta;
+    TestOutputArrayType out_array_type;
+
+    test_params(af_mat_prop optl, af_mat_prop optr, float *a, float *l,
+                float *r, float *g, dim4 ldims, dim4 rdims, dim4 odims,
+                float *b, TestOutputArrayType t)
+        : opt_lhs(optl)
+        , opt_rhs(optr)
+        , alpha(a)
+        , h_lhs(l)
+        , h_rhs(r)
+        , h_gold(g)
+        , lhs_dims(ldims)
+        , rhs_dims(rdims)
+        , out_dims(odims)
+        , beta(b)
+        , out_array_type(t) {}
+};
+
+class Gemm : public ::testing::TestWithParam<test_params> {
+   protected:
+    af_array lhs;
+    af_array rhs;
+    af_array gold;
+    af_array out;
+    TestOutputArrayInfo metadata;
+
+    void SetUp() {
+        test_params params = GetParam();
+
+        lhs  = 0;
+        rhs  = 0;
+        out  = 0;
+        gold = 0;
+
+        ASSERT_SUCCESS(af_create_array(&lhs, params.h_lhs,
+                                       params.lhs_dims.ndims(),
+                                       params.lhs_dims.get(), f32));
+        ASSERT_SUCCESS(af_create_array(&rhs, params.h_rhs,
+                                       params.rhs_dims.ndims(),
+                                       params.rhs_dims.get(), f32));
+
+        dim_t gold_dim0 = params.opt_lhs == AF_MAT_TRANS ? params.lhs_dims[1]
+                                                         : params.lhs_dims[0];
+        dim_t gold_dim1 = params.opt_rhs == AF_MAT_TRANS ? params.rhs_dims[0]
+                                                         : params.rhs_dims[1];
+        dim_t gold_dim2 = std::max(params.lhs_dims[2], params.rhs_dims[2]);
+        dim_t gold_dim3 = std::max(params.lhs_dims[3], params.rhs_dims[3]);
+        dim4 gold_dims(gold_dim0, gold_dim1, gold_dim2, gold_dim3);
+
+        metadata = TestOutputArrayInfo(params.out_array_type);
+        genTestOutputArray(&out, params.out_dims.ndims(), params.out_dims.get(),
+                           f32, &metadata);
+
+        ASSERT_SUCCESS(af_create_array(&gold, params.h_gold, gold_dims.ndims(),
+                                       gold_dims.get(), f32));
+    }
+
+    void TearDown() {
+        if (gold != 0) { ASSERT_SUCCESS(af_release_array(gold)); }
+        if (rhs != 0) { ASSERT_SUCCESS(af_release_array(rhs)); }
+        if (lhs != 0) { ASSERT_SUCCESS(af_release_array(lhs)); }
+    }
+};
+
+void replace_all(std::string &str, const std::string &oldStr,
+                 const std::string &newStr) {
+    std::string::size_type pos = 0u;
+    while ((pos = str.find(oldStr, pos)) != std::string::npos) {
+        str.replace(pos, oldStr.length(), newStr);
+        pos += newStr.length();
+    }
+}
+
+std::string concat_dim4(dim4 d) {
+    std::stringstream ss;
+    ss << d;
+    std::string s = ss.str();
+    replace_all(s, " ", "x");
+    return s;
+}
+
+string out_info(const ::testing::TestParamInfo<Gemm::ParamType> info) {
+    test_params params = info.param;
+
+    stringstream ss;
+    switch (params.out_array_type) {
+        case NULL_ARRAY: ss << "NullOut"; break;
+        case FULL_ARRAY: ss << "FullOut"; break;
+        case SUB_ARRAY: ss << "SubarrayOut"; break;
+        case REORDERED_ARRAY: ss << "ReorderedOut"; break;
+        default: ss << "UnknownOutArrayType"; break;
+    }
+
+    ss << "_" << concat_dim4(params.lhs_dims) << "_"
+       << concat_dim4(params.rhs_dims);
+
+    ss << "_";
+    ss << (params.opt_lhs == AF_MAT_TRANS ? "T" : "N");
+    ss << (params.opt_rhs == AF_MAT_TRANS ? "T" : "N");
+
+    if (params.lhs_dims[2] > 1 || params.rhs_dims[2] > 1) { ss << "_Batched"; }
+
+    return ss.str();
+}
+
+// clang-format off
+INSTANTIATE_TEST_SUITE_P(
+    Square, Gemm,
+    ::testing::Values(
+        //          lhs_opts     rhs_opts     alpha  lhs    rhs    gold    lhs_dims    rhs_dims    out_dims    beta  out_array_type
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs, h_rhs, h_gold, dim4(3, 3), dim4(3, 3), dim4(3, 3), &beta, NULL_ARRAY     ),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs, h_rhs, h_gold, dim4(3, 3), dim4(3, 3), dim4(3, 3), &beta, FULL_ARRAY     ),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs, h_rhs, h_gold, dim4(3, 3), dim4(3, 3), dim4(3, 3), &beta, SUB_ARRAY      ),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs, h_rhs, h_gold, dim4(3, 3), dim4(3, 3), dim4(3, 3), &beta, REORDERED_ARRAY)
+        ),
+    out_info
+    );
+// clang-format on
+
+// clang-format off
+INSTANTIATE_TEST_SUITE_P(
+    Batched, Gemm,
+    ::testing::Values(
+        //          lhs_opts     rhs_opts     alpha  lhs          rhs    gold          lhs_dims       rhs_dims    out_dims       beta  out_array_type
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs_batch, h_rhs, h_gold_batch, dim4(3, 3, 2), dim4(3, 3), dim4(3, 3, 2), &beta, NULL_ARRAY     ),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs_batch, h_rhs, h_gold_batch, dim4(3, 3, 2), dim4(3, 3), dim4(3, 3, 2), &beta, FULL_ARRAY     ),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs_batch, h_rhs, h_gold_batch, dim4(3, 3, 2), dim4(3, 3), dim4(3, 3, 2), &beta, SUB_ARRAY      ),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_lhs_batch, h_rhs, h_gold_batch, dim4(3, 3, 2), dim4(3, 3), dim4(3, 3, 2), &beta, REORDERED_ARRAY)
+        ),
+    out_info
+    );
+// clang-format on
+
+// clang-format off
+INSTANTIATE_TEST_SUITE_P(
+    NonSquare, Gemm,
+    ::testing::Values(
+        //          lhs_opts      rhs_opts      alpha  lhs         rhs         gold       lhs_dims    rhs_dims    out_dims    beta  out_array_type
+        test_params(AF_MAT_NONE,  AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_wide, h_gold_NN, dim4(3, 2), dim4(2, 3), dim4(3, 3), &beta, NULL_ARRAY),
+        test_params(AF_MAT_NONE,  AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_tall, h_gold_NT, dim4(3, 2), dim4(3, 2), dim4(3, 3), &beta, NULL_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_tall, h_gold_TN, dim4(3, 2), dim4(3, 2), dim4(2, 2), &beta, NULL_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_wide, h_gold_TT, dim4(3, 2), dim4(2, 3), dim4(2, 2), &beta, NULL_ARRAY),
+
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_half_ones, h_half_ones, h_gold_gemv, dim4(4, 5), dim4(5, 1), dim4(4, 1), &beta, NULL_ARRAY),
+        test_params(AF_MAT_NONE, AF_MAT_NONE, &alpha, h_half_ones, h_half_ones, h_gold_gemv, dim4(1, 5), dim4(5, 1), dim4(1, 1), &beta, NULL_ARRAY),
+        test_params(AF_MAT_NONE, AF_MAT_TRANS, &alpha, h_half_ones, h_half_ones, h_gold_gemv, dim4(4, 5), dim4(1, 5), dim4(4, 1), &beta, NULL_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_NONE, &alpha, h_half_ones, h_half_ones, h_gold_gemv, dim4(5, 4), dim4(5, 1), dim4(4, 1), &beta, NULL_ARRAY),
+
+        test_params(AF_MAT_NONE,  AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_wide, h_gold_NN, dim4(3, 2), dim4(2, 3), dim4(3, 3), &beta, FULL_ARRAY),
+        test_params(AF_MAT_NONE,  AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_tall, h_gold_NT, dim4(3, 2), dim4(3, 2), dim4(3, 3), &beta, FULL_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_tall, h_gold_TN, dim4(3, 2), dim4(3, 2), dim4(2, 2), &beta, FULL_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_wide, h_gold_TT, dim4(3, 2), dim4(2, 3), dim4(2, 2), &beta, FULL_ARRAY),
+
+        test_params(AF_MAT_NONE,  AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_wide, h_gold_NN, dim4(3, 2), dim4(2, 3), dim4(3, 3), &beta, SUB_ARRAY),
+        test_params(AF_MAT_NONE,  AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_tall, h_gold_NT, dim4(3, 2), dim4(3, 2), dim4(3, 3), &beta, SUB_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_tall, h_gold_TN, dim4(3, 2), dim4(3, 2), dim4(2, 2), &beta, SUB_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_wide, h_gold_TT, dim4(3, 2), dim4(2, 3), dim4(2, 2), &beta, SUB_ARRAY),
+
+        test_params(AF_MAT_NONE,  AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_wide, h_gold_NN, dim4(3, 2), dim4(2, 3), dim4(3, 3), &beta, REORDERED_ARRAY),
+        test_params(AF_MAT_NONE,  AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_tall, h_gold_NT, dim4(3, 2), dim4(3, 2), dim4(3, 3), &beta, REORDERED_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_NONE,  &alpha, h_lhs_tall, h_rhs_tall, h_gold_TN, dim4(3, 2), dim4(3, 2), dim4(2, 2), &beta, REORDERED_ARRAY),
+        test_params(AF_MAT_TRANS, AF_MAT_TRANS, &alpha, h_lhs_tall, h_rhs_wide, h_gold_TT, dim4(3, 2), dim4(2, 3), dim4(2, 2), &beta, REORDERED_ARRAY)
+        ),
+    out_info
+    );
+// clang-format on
+
+TEST_P(Gemm, UsePreallocatedOutArray) {
+    test_params params = GetParam();
+    ASSERT_SUCCESS(af_gemm(&out, params.opt_lhs, params.opt_rhs, params.alpha,
+                           lhs, rhs, params.beta));
+
+    ASSERT_SPECIAL_ARRAYS_EQ(gold, out, &metadata);
+}
+
+TEST(Gemm, DocSnippet) {
+    //! [ex_af_gemm_alloc]
+    af_array A, B;
+
+    dim_t adims[] = {5, 3, 2};
+    dim_t bdims[] = {3, 5, 2};
+    af_constant(&A, 1, 3, adims, f32);
+    af_constant(&B, 1, 3, bdims, f32);
+
+    float alpha = 1.f;
+    float beta  = 0.f;
+
+    // Undefined behavior!
+    // af_array undef;
+    // af_gemm(&undef, AF_MAT_NONE, AF_MAT_NONE, &alpha, a.get(), b.get(),
+    // &beta);
+
+    af_array C = 0;
+    af_gemm(&C, AF_MAT_NONE, AF_MAT_NONE, &alpha, A, B, &beta);
+    // C =
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+
+    //! [ex_af_gemm_alloc]
+
+    af_array c1_copy = 0;
+    ASSERT_SUCCESS(af_retain_array(&c1_copy, C));
+    af::array c1(c1_copy);
+    af::array gold1 = af::constant(3, 5, 5, 2, f32);
+    ASSERT_ARRAYS_EQ(gold1, c1);
+
+    //! [ex_af_gemm_overwrite]
+    alpha                = 1.f;
+    beta                 = 1.f;
+    af_seq first_slice[] = {af_span, af_span, {0., 0., 1.}};
+    af_array Asub, Bsub, Csub;
+    af_index(&Asub, A, 3, first_slice);
+    af_index(&Bsub, B, 3, first_slice);
+    af_index(&Csub, C, 3, first_slice);
+    af_gemm(&Csub, AF_MAT_NONE, AF_MAT_NONE, &alpha, Asub, Bsub, &beta);
+    // C =
+    //  6.   6.   6.   6.   6.
+    //  6.   6.   6.   6.   6.
+    //  6.   6.   6.   6.   6.
+    //  6.   6.   6.   6.   6.
+    //  6.   6.   6.   6.   6.
+    //
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //  3.   3.   3.   3.   3.
+    //! [ex_af_gemm_overwrite]
+
+    af_array c2_copy = 0;
+    ASSERT_SUCCESS(af_retain_array(&c2_copy, C));
+    af::array c2(c2_copy);
+    vector<float> gold2(5 * 5 * 2, 3);
+    fill(gold2.begin(), gold2.begin() + (5 * 5), 6);
+
+    af_release_array(A);
+    af_release_array(B);
+    af_release_array(C);
+    af_release_array(Asub);
+    af_release_array(Bsub);
+    af_release_array(Csub);
+
+    ASSERT_VEC_ARRAY_EQ(gold2, dim4(5, 5, 2), c2);
+}
+
+TEST(Gemv, HalfScalarProduct) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+
+    const unsigned int sizeValue = 5;
+    array gold                   = constant(sizeValue, 4, 1, f16);
+    {
+        array a     = constant(1, 4, sizeValue, f16);
+        array b     = constant(1, sizeValue, 1, f16);
+        array mmRes = matmul(a, b);
+        ASSERT_ARRAYS_EQ(mmRes, gold);
+    }
+    {
+        array a      = constant(1, 1, sizeValue, f16);
+        array b      = constant(1, sizeValue, 1, f16);
+        array mmRes  = matmul(a, b);
+        array dotRes = dot(transpose(a), b);
+        ASSERT_ARRAYS_EQ(mmRes, dotRes);
     }
 }
 
-TYPED_TEST(MatrixMultiply, MultiGPURectangleVector_CPP)
-{
-    for(int i = 0; i < af::getDeviceCount(); i++) {
-        af::setDevice(i);
-        cppMatMulCheck<TypeParam, true>(TEST_DIR"/blas/RectangleVector.test");
+TEST(MatrixMultiply, SameInput) {
+    // Tests for an error that occured in the Intel OpenCL GPU implementation
+    // that caused an error when you passed the same array as the lhs and the
+    // rhs. see #1711 and PR #2774. Caused by mapping the same buffer with
+    // CL_MEM_WRITE access
+    int dim = 10;
+    array a = randu(dim, dim);
+    vector<float> ha(dim * dim);
+    a.host(&ha.front());
+
+    vector<float> hgold(dim * dim, 0);
+
+    for (int i = 0; i < dim; i++) {
+        for (int j = 0; j < dim; j++) {
+            for (int k = 0; k < dim; k++) {
+                hgold[i * dim + j] += ha[k * dim + j] * ha[i * dim + k];
+            }
+        }
     }
+    array out = matmul(a, a);
+    ASSERT_VEC_ARRAY_NEAR(hgold, dim4(dim, dim), out, 1e-4);
 }
diff --git a/test/canny.cpp b/test/canny.cpp
new file mode 100644
index 0000000000..0a0fdbc08c
--- /dev/null
+++ b/test/canny.cpp
@@ -0,0 +1,269 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class CannyEdgeDetector : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, int, uint, short, ushort, schar, uchar, double>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(CannyEdgeDetector, TestTypes);
+
+template<typename T>
+void cannyTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<char>> tests;
+
+    readTests<T, char, int>(pTestFile, numDims, in, tests);
+
+    dim4 sDims        = numDims[0];
+    af_array outArray = 0;
+    af_array sArray   = 0;
+
+    ASSERT_SUCCESS(af_create_array(&sArray, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(af_canny(&outArray, sArray, AF_CANNY_THRESHOLD_MANUAL,
+                                        0.4147f, 0.8454f, 3, true));
+
+    vector<char> outData(sDims.elements());
+
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData.data(), outArray));
+
+    vector<char> currGoldBar = tests[0];
+    size_t nElems            = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    // cleanup
+    ASSERT_SUCCESS(af_release_array(sArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
+}
+
+TYPED_TEST(CannyEdgeDetector, ArraySizeLessThanBlockSize10x10) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    cannyTest<TypeParam>(string(TEST_DIR "/CannyEdgeDetector/fast10x10.test"));
+}
+
+TYPED_TEST(CannyEdgeDetector, ArraySizeEqualBlockSize16x16) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    cannyTest<TypeParam>(string(TEST_DIR "/CannyEdgeDetector/fast16x16.test"));
+}
+
+TEST(Canny, DISABLED_Exact) {
+    using namespace af;
+    array img = loadImage(TEST_DIR "/CannyEdgeDetector/woman.jpg", false);
+
+    array out = canny(img, AF_CANNY_THRESHOLD_AUTO_OTSU, 0.08, 0.32, 3, false);
+    array gold =
+        loadImage(TEST_DIR "/CannyEdgeDetector/woman_edges.jpg", false) > 3;
+
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+template<typename T>
+void cannyImageOtsuTest(string pTestFile, bool isColor) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    using af::dim4;
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> outSizes;
+    vector<string> outFiles;
+
+    readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array _inArray  = 0;
+        af_array inArray   = 0;
+        af_array _outArray = 0;
+        af_array cstArray  = 0;
+        af_array mulArray  = 0;
+        af_array outArray  = 0;
+        af_array goldArray = 0;
+
+        inFiles[testId].insert(0, string(TEST_DIR "/CannyEdgeDetector/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/CannyEdgeDetector/"));
+
+        af_dtype type = (af_dtype)dtype_traits<T>::af_type;
+
+        ASSERT_SUCCESS(
+            af_load_image(&_inArray, inFiles[testId].c_str(), isColor));
+
+        ASSERT_SUCCESS(af_cast(&inArray, _inArray, type));
+
+        ASSERT_SUCCESS(
+            af_load_image_native(&goldArray, outFiles[testId].c_str()));
+
+        ASSERT_SUCCESS(af_canny(&_outArray, inArray,
+                                            AF_CANNY_THRESHOLD_AUTO_OTSU,
+                                            0.08, 0.32, 3, false));
+
+        unsigned ndims = 0;
+        dim_t dims[4];
+
+        ASSERT_SUCCESS(af_get_numdims(&ndims, _outArray));
+        ASSERT_SUCCESS(
+            af_get_dims(dims, dims + 1, dims + 2, dims + 3, _outArray));
+
+        ASSERT_SUCCESS(af_constant(&cstArray, 255.0, ndims, dims, f32));
+
+        ASSERT_SUCCESS(af_mul(&mulArray, cstArray, _outArray, false));
+        ASSERT_SUCCESS(af_cast(&outArray, mulArray, u8));
+
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 1.0e-3);
+
+        ASSERT_SUCCESS(af_release_array(_inArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(cstArray));
+        ASSERT_SUCCESS(af_release_array(mulArray));
+        ASSERT_SUCCESS(af_release_array(_outArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+    }
+}
+
+TEST(CannyEdgeDetector, OtsuThreshold) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    cannyImageOtsuTest<float>(string(TEST_DIR "/CannyEdgeDetector/gray.test"),
+                              false);
+}
+
+TEST(CannyEdgeDetector, InvalidSizeArray) {
+    af_array inArray  = 0;
+    af_array outArray = 0;
+
+    vector<float> in(100, 1);
+
+    dim4 sDims(100, 1, 1, 1);
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_canny(&outArray, inArray, AF_CANNY_THRESHOLD_MANUAL, 0.24,
+                       0.72, 3, true));
+
+    ASSERT_SUCCESS(af_release_array(inArray));
+}
+
+TEST(CannyEdgeDetector, Array4x4_Invalid) {
+    af_array inArray  = 0;
+    af_array outArray = 0;
+
+    vector<float> in(16, 1);
+
+    dim4 sDims(4, 4, 1, 1);
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_canny(&outArray, inArray, AF_CANNY_THRESHOLD_MANUAL, 0.24,
+                       0.72, 3, true));
+
+    ASSERT_SUCCESS(af_release_array(inArray));
+}
+
+TEST(CannyEdgeDetector, Sobel5x5_Invalid) {
+    af_array inArray  = 0;
+    af_array outArray = 0;
+
+    vector<float> in(25, 1);
+
+    dim4 sDims(5, 5, 1, 1);
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    ASSERT_EQ(AF_ERR_ARG,
+              af_canny(&outArray, inArray, AF_CANNY_THRESHOLD_MANUAL, 0.24,
+                       0.72, 5, true));
+
+    ASSERT_SUCCESS(af_release_array(inArray));
+}
+
+template<typename T>
+void cannyImageOtsuBatchTest(string pTestFile, const dim_t targetBatchCount) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    using af::array;
+    using af::canny;
+    using af::loadImage;
+    using af::loadImageNative;
+    using af::tile;
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> outSizes;
+    vector<string> outFiles;
+
+    readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        inFiles[testId].insert(0, string(TEST_DIR "/CannyEdgeDetector/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/CannyEdgeDetector/"));
+
+        af_dtype type  = (af_dtype)dtype_traits<T>::af_type;
+        array readGold = loadImageNative(outFiles[testId].c_str());
+        array goldIm   = tile(readGold, 1, 1, targetBatchCount);
+        array readImg  = loadImage(inFiles[testId].c_str(), false).as(type);
+        array inputIm  = tile(readImg, 1, 1, targetBatchCount);
+
+        array outIm =
+              canny(inputIm, AF_CANNY_THRESHOLD_AUTO_OTSU, 0.08, 0.32, 3, false);
+        outIm *= 255.0;
+
+        ASSERT_IMAGES_NEAR(goldIm, outIm.as(u8), 1.0e-3);
+    }
+}
+
+TEST(CannyEdgeDetector, BatchofImagesUsingCPPAPI) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    // DO NOT INCREASE BATCH COUNT BEYOND 4
+    // This is a limitation on the test assert macro that is saving
+    // images to disk which can't handle a batch of images.
+    cannyImageOtsuBatchTest<float>(
+        string(TEST_DIR "/CannyEdgeDetector/gray.test"), 3);
+}
diff --git a/test/cast.cpp b/test/cast.cpp
new file mode 100644
index 0000000000..d2b4f95250
--- /dev/null
+++ b/test/cast.cpp
@@ -0,0 +1,204 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/random.h>
+#include <algorithm>
+#include <cstdlib>
+#include <vector>
+
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+
+const int num = 10;
+
+template<typename Ti, typename To>
+void cast_test() {
+    SUPPORTED_TYPE_CHECK(Ti);
+    SUPPORTED_TYPE_CHECK(To);
+
+    af_dtype ta = (af_dtype)dtype_traits<Ti>::af_type;
+    af_dtype tb = (af_dtype)dtype_traits<To>::af_type;
+    dim4 dims(num, 1, 1, 1);
+    af_array a, b;
+    af_randu(&a, dims.ndims(), dims.get(), ta);
+    af_err err = af_cast(&b, a, tb);
+    af_release_array(a);
+    af_release_array(b);
+    ASSERT_SUCCESS(err);
+}
+
+#define REAL_TO_TESTS(Ti, To) \
+    TEST(CAST_TEST, Test_Real_##Ti##_##To) { cast_test<Ti, To>(); }
+
+#define REAL_TEST_INVOKE(Ti)     \
+    REAL_TO_TESTS(Ti, float);    \
+    REAL_TO_TESTS(Ti, cfloat);   \
+    REAL_TO_TESTS(Ti, double);   \
+    REAL_TO_TESTS(Ti, cdouble);  \
+    REAL_TO_TESTS(Ti, char);     \
+    REAL_TO_TESTS(Ti, int);      \
+    REAL_TO_TESTS(Ti, unsigned); \
+    REAL_TO_TESTS(Ti, schar);    \
+    REAL_TO_TESTS(Ti, uchar);    \
+    REAL_TO_TESTS(Ti, intl);     \
+    REAL_TO_TESTS(Ti, uintl);    \
+    REAL_TO_TESTS(Ti, short);    \
+    REAL_TO_TESTS(Ti, ushort);
+
+#define CPLX_TEST_INVOKE(Ti)   \
+    REAL_TO_TESTS(Ti, cfloat); \
+    REAL_TO_TESTS(Ti, cdouble);
+
+REAL_TEST_INVOKE(float)
+REAL_TEST_INVOKE(double)
+REAL_TEST_INVOKE(char)
+REAL_TEST_INVOKE(int)
+REAL_TEST_INVOKE(unsigned)
+REAL_TEST_INVOKE(schar)
+REAL_TEST_INVOKE(uchar)
+REAL_TEST_INVOKE(intl)
+REAL_TEST_INVOKE(uintl)
+REAL_TEST_INVOKE(short)
+REAL_TEST_INVOKE(ushort)
+CPLX_TEST_INVOKE(cfloat)
+CPLX_TEST_INVOKE(cdouble)
+
+// Converting complex to real; expected to fail as this operation is
+// not allowed. Use functions abs, real, image, arg, etc to make the
+// conversion explicit.
+template<typename Ti, typename To>
+void cast_test_complex_real() {
+    SUPPORTED_TYPE_CHECK(Ti);
+    SUPPORTED_TYPE_CHECK(To);
+
+    af_dtype ta = (af_dtype)dtype_traits<Ti>::af_type;
+    af_dtype tb = (af_dtype)dtype_traits<To>::af_type;
+    dim4 dims(num, 1, 1, 1);
+    af_array a, b;
+    af_randu(&a, dims.ndims(), dims.get(), ta);
+    af_err err = af_cast(&b, a, tb);
+    ASSERT_EQ(err, AF_ERR_TYPE);
+    ASSERT_SUCCESS(af_release_array(a));
+}
+
+#define COMPLEX_REAL_TESTS(Ti, To)                      \
+    TEST(CAST_TEST, Test_Complex_To_Real_##Ti##_##To) { \
+        SUPPORTED_TYPE_CHECK(Ti);                       \
+        SUPPORTED_TYPE_CHECK(To);                       \
+        cast_test_complex_real<Ti, To>();               \
+    }
+
+COMPLEX_REAL_TESTS(cfloat, float)
+COMPLEX_REAL_TESTS(cfloat, double)
+COMPLEX_REAL_TESTS(cdouble, float)
+COMPLEX_REAL_TESTS(cdouble, double)
+
+TEST(CAST_TEST, Test_JIT_DuplicateCastNoop) {
+    // Does a trivial cast - check JIT kernel trace to ensure a __noop is
+    // generated since we don't have a way to test it directly
+    SUPPORTED_TYPE_CHECK(double);
+    af_dtype ta = (af_dtype)dtype_traits<float>::af_type;
+    af_dtype tb = (af_dtype)dtype_traits<double>::af_type;
+    dim4 dims(num, 1, 1, 1);
+    af_array a, b, c;
+    af_randu(&a, dims.ndims(), dims.get(), ta);
+
+    af_cast(&b, a, tb);
+    af_cast(&c, b, ta);
+
+    std::vector<float> a_vals(num);
+    std::vector<float> c_vals(num);
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&a_vals[0], a));
+    ASSERT_SUCCESS(af_get_data_ptr((void **)&c_vals[0], c));
+
+    for (size_t i = 0; i < num; ++i) { ASSERT_FLOAT_EQ(a_vals[i], c_vals[i]); }
+
+    af_release_array(a);
+    af_release_array(b);
+    af_release_array(c);
+}
+
+TEST(Cast, ImplicitCast) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array a = randu(100, 100, f64);
+    array b = a.as(f32);
+
+    array c = max(abs(a - b));
+    ASSERT_ARRAYS_NEAR(constant(0, 1, 100, f64), c, 1e-7);
+}
+
+TEST(Cast, ConstantCast) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array a = constant(1, 100, f64);
+    array b = a.as(f32);
+
+    array c = max(abs(a - b));
+    ASSERT_ARRAYS_NEAR(c, constant(0, 1, f64), 1e-7);
+}
+
+TEST(Cast, OpCast) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array a = constant(1, 100, f64);
+    a       = a + a;
+    array b = a.as(f32);
+
+    array c = max(abs(a - b));
+    ASSERT_ARRAYS_NEAR(c, constant(0, 1, f64), 1e-7);
+}
+TEST(Cast, ImplicitCastIndexed) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array a = randu(100, 100, f64);
+    array b = a(span, 1).as(f32);
+    array c = max(abs(a(span, 1) - b));
+    ASSERT_ARRAYS_NEAR(constant(0, 1, 1, f64), c, 1e-7);
+}
+
+TEST(Cast, ImplicitCastIndexedNonLinear) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array a = randu(100, 100, f64);
+    array b = a(seq(10, 20, 2), 1).as(f32);
+    array c = max(abs(a(seq(10, 20, 2), 1) - b));
+    ASSERT_ARRAYS_NEAR(constant(0, 1, 1, f64), c, 1e-7);
+}
+
+TEST(Cast, ImplicitCastIndexedNonLinearArray) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array a   = randu(100, 100, f64);
+    array idx = seq(10, 20, 2);
+    array b   = a(idx, 1).as(f32);
+    array c   = max(abs(a(idx, 1) - b));
+    ASSERT_ARRAYS_NEAR(constant(0, 1, 1, f64), c, 1e-7);
+}
+
+TEST(Cast, ImplicitCastIndexedAndScoped) {
+    using namespace af;
+    SUPPORTED_TYPE_CHECK(double);
+    array c;
+    {
+        array a = randu(100, 100, f64);
+        array b = a(span, 1).as(f32);
+        c       = abs(a(span, 1) - b);
+    }
+    c = max(c);
+    ASSERT_ARRAYS_NEAR(constant(0, 1, 1, f64), c, 1e-7);
+}
diff --git a/test/cholesky_dense.cpp b/test/cholesky_dense.cpp
index 93d13316ef..dea036eca1 100644
--- a/test/cholesky_dense.cpp
+++ b/test/cholesky_dense.cpp
@@ -7,81 +7,130 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::identity;
+using af::matmul;
+using af::max;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-void choleskyTester(const int n, double eps, bool is_upper)
-{
-    if (noDoubleTests<T>()) return;
+void choleskyTester(const int n, double eps, bool is_upper) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
 
-    af::dtype ty = (af::dtype)af::dtype_traits<T>::af_type;
+    dtype ty = (dtype)dtype_traits<T>::af_type;
 
     // Prepare positive definite matrix
 #if 1
-    af::array a = cpu_randu<T>(af::dim4(n, n));
+    array a = cpu_randu<T>(dim4(n, n));
 #else
-    af::array a = af::randu(n, n, ty);
+    array a = randu(n, n, ty);
 #endif
-    af::array b = 10 * n * af::identity(n, n, ty);
-    af::array in = matmul(a.H(), a) + b;
+    array b  = 10 * n * identity(n, n, ty);
+    array in = matmul(a.H(), a) + b;
 
     //! [ex_chol_reg]
-    af::array out;
+    array out;
     cholesky(out, in, is_upper);
     //! [ex_chol_reg]
 
-    af::array re = is_upper ? matmul(out.H(), out) : matmul(out, out.H());
+    array re = is_upper ? matmul(out.H(), out) : matmul(out, out.H());
 
-    ASSERT_NEAR(0, af::max<double>(af::abs(real(in - re))), eps);
-    ASSERT_NEAR(0, af::max<double>(af::abs(imag(in - re))), eps);
+    ASSERT_NEAR(0, max<typename dtype_traits<T>::base_type>(abs(real(in - re))),
+                eps);
+    ASSERT_NEAR(0, max<typename dtype_traits<T>::base_type>(abs(imag(in - re))),
+                eps);
 
     //! [ex_chol_inplace]
-    af::array in2 = in.copy();
+    array in2 = in.copy();
     choleskyInPlace(in2, is_upper);
     //! [ex_chol_inplace]
 
-    af::array out2 = is_upper ? upper(in2) : lower(in2);
+    array out2 = is_upper ? upper(in2) : lower(in2);
+
+    ASSERT_NEAR(0,
+                max<typename dtype_traits<T>::base_type>(abs(real(out2 - out))),
+                eps);
+    ASSERT_NEAR(0,
+                max<typename dtype_traits<T>::base_type>(abs(imag(out2 - out))),
+                eps);
+}
+
+template<typename T>
+class Cholesky : public ::testing::Test {};
+
+typedef ::testing::Types<float, cfloat, double, cdouble> TestTypes;
+TYPED_TEST_SUITE(Cholesky, TestTypes);
+
+template<typename T>
+double eps();
+
+template<>
+double eps<float>() {
+    return 0.05f;
+}
+
+template<>
+double eps<double>() {
+    return 1e-8;
+}
+
+template<>
+double eps<cfloat>() {
+    return 0.05f;
+}
+
+template<>
+double eps<cdouble>() {
+    return 1e-8;
+}
+
+TYPED_TEST(Cholesky, Upper) {
+    choleskyTester<TypeParam>(500, eps<TypeParam>(), true);
+}
 
-    ASSERT_NEAR(0, af::max<double>(af::abs(real(out2 - out))), eps);
-    ASSERT_NEAR(0, af::max<double>(af::abs(imag(out2 - out))), eps);
+TYPED_TEST(Cholesky, UpperLarge) {
+    choleskyTester<TypeParam>(1000, eps<TypeParam>(), true);
 }
 
-#define CHOLESKY_BIG_TESTS(T, eps)              \
-    TEST(Cholesky, T##Upper)                    \
-    {                                           \
-        choleskyTester<T>( 500, eps, true );    \
-    }                                           \
-    TEST(Cholesky, T##Lower)                    \
-    {                                           \
-        choleskyTester<T>(1000, eps, false);    \
-    }                                           \
-    TEST(Cholesky, T##UpperMultiple)            \
-    {                                           \
-        choleskyTester<T>(1024, eps, true );    \
-    }                                           \
-    TEST(Cholesky, T##LowerMultiple)            \
-    {                                           \
-        choleskyTester<T>( 512, eps, false);    \
-    }                                           \
-
-
-CHOLESKY_BIG_TESTS(float, 0.05)
-CHOLESKY_BIG_TESTS(double, 1E-8)
-CHOLESKY_BIG_TESTS(cfloat, 0.05)
-CHOLESKY_BIG_TESTS(cdouble, 1E-8)
+TYPED_TEST(Cholesky, UpperMultipleOfTwo) {
+    choleskyTester<TypeParam>(512, eps<TypeParam>(), true);
+}
+
+TYPED_TEST(Cholesky, UpperMultipleOfTwoLarge) {
+    choleskyTester<TypeParam>(1024, eps<TypeParam>(), true);
+}
+
+TYPED_TEST(Cholesky, Lower) {
+    choleskyTester<TypeParam>(500, eps<TypeParam>(), false);
+}
+
+TYPED_TEST(Cholesky, LowerLarge) {
+    choleskyTester<TypeParam>(1000, eps<TypeParam>(), false);
+}
+
+TYPED_TEST(Cholesky, LowerMultipleOfTwo) {
+    choleskyTester<TypeParam>(512, eps<TypeParam>(), false);
+}
+
+TYPED_TEST(Cholesky, LowerMultipleOfTwoLarge) {
+    choleskyTester<TypeParam>(1024, eps<TypeParam>(), false);
+}
diff --git a/test/clamp.cpp b/test/clamp.cpp
new file mode 100644
index 0000000000..c830b06b2b
--- /dev/null
+++ b/test/clamp.cpp
@@ -0,0 +1,234 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/defines.h>
+#include <af/random.h>
+#include <af/traits.hpp>
+
+#include <sstream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::dtype;
+using af::randu;
+using std::abs;
+using std::string;
+using std::stringstream;
+using std::vector;
+
+const int num = 10000;
+
+struct clamp_params {
+    dim4 size_;
+    dtype in_type_;
+    dtype lo_type_;
+    dtype hi_type_;
+    dtype out_type_;
+
+    clamp_params(dim4 size, dtype itype, dtype ltype, dtype htype, dtype otype)
+        : size_(size)
+        , in_type_(itype)
+        , lo_type_(ltype)
+        , hi_type_(htype)
+        , out_type_(otype) {}
+};
+
+template<typename T>
+class Clamp : public ::testing::TestWithParam<clamp_params> {
+   public:
+    void SetUp() {
+        clamp_params params = GetParam();
+        SUPPORTED_TYPE_CHECK(double);
+        if (noDoubleTests(params.in_type_))
+            GTEST_SKIP() << "Double not supported on this device";
+        if (noHalfTests(params.in_type_))
+            GTEST_SKIP() << "Half not supported on this device";
+        if (noDoubleTests(params.hi_type_))
+            GTEST_SKIP() << "Double not supported on this device";
+        if (noHalfTests(params.hi_type_))
+            GTEST_SKIP() << "Half not supported on this device";
+        if (noDoubleTests(params.lo_type_))
+            GTEST_SKIP() << "Double not supported on this device";
+        if (noHalfTests(params.lo_type_))
+            GTEST_SKIP() << "Half not supported on this device";
+
+        in_ = randu(params.size_, params.in_type_);
+        lo_ = randu(params.size_, params.lo_type_) / T(10);
+        hi_ = T(1) - randu(params.size_, params.hi_type_) / T(10);
+        lo_ = lo_.as(params.lo_type_);
+        hi_ = hi_.as(params.hi_type_);
+
+        size_t num = params.size_.elements();
+        vector<T> hgold(num), hin(num), hlo(num), hhi(num);
+        in_.as((dtype)af::dtype_traits<T>::af_type).host(&hin[0]);
+        lo_.as((dtype)af::dtype_traits<T>::af_type).host(&hlo[0]);
+        hi_.as((dtype)af::dtype_traits<T>::af_type).host(&hhi[0]);
+
+        for (size_t i = 0; i < num; i++) {
+            if (hin[i] < hlo[i])
+                hgold[i] = hlo[i];
+            else if (hin[i] > hhi[i])
+                hgold[i] = hhi[i];
+            else
+                hgold[i] = hin[i];
+        }
+
+        gold_ = array(params.size_, &hgold[0]);
+        gold_ = gold_.as(params.out_type_);
+        gold_.eval();
+    }
+
+    af::array in_;
+    af::array lo_;
+    af::array hi_;
+    af::array gold_;
+};
+
+string pd4(dim4 dims) {
+    string out(32, '\0');
+    int len = snprintf(const_cast<char*>(out.data()), 32, "%lld_%lld_%lld_%lld",
+                       dims[0], dims[1], dims[2], dims[3]);
+    out.resize(len);
+    return out;
+}
+
+string testNameGenerator(const ::testing::TestParamInfo<clamp_params> info) {
+    stringstream ss;
+    ss << "size_" << pd4(info.param.size_) << "_in_" << info.param.in_type_
+       << "_lo_" << info.param.lo_type_ << "_hi_" << info.param.hi_type_;
+    return ss.str();
+}
+
+typedef Clamp<double> ClampFloatingPoint;
+
+// clang-format off
+INSTANTIATE_TEST_SUITE_P(
+    SmallDims, ClampFloatingPoint,
+    ::testing::Values(
+                      clamp_params(dim4(10), f32, f32, f32, f32),
+                      clamp_params(dim4(10), f64, f32, f32, f64),
+                      clamp_params(dim4(10), f16, f32, f32, f32),
+                      clamp_params(dim4(10), f64, f64, f64, f64),
+                      clamp_params(dim4(10), f16, f16, f16, f16),
+                      clamp_params(dim4(10), s32, f32, f32, f32),
+                      clamp_params(dim4(10), u32, f32, f32, f32),
+                      clamp_params(dim4(10), s8,  f32, f32, f32),
+                      clamp_params(dim4(10), u8,  f32, f32, f32),
+                      clamp_params(dim4(10), b8,  f32, f32, f32),
+                      clamp_params(dim4(10), s64, f32, f32, f32),
+                      clamp_params(dim4(10), u64, f32, f32, f32),
+                      clamp_params(dim4(10), s16, f32, f32, f32),
+                      clamp_params(dim4(10), u16, f32, f32, f32),
+
+                      clamp_params(dim4(10, 10), f32, f32, f32, f32),
+                      clamp_params(dim4(10, 10), f64, f32, f32, f64),
+                      clamp_params(dim4(10, 10), f16, f32, f32, f32),
+                      clamp_params(dim4(10, 10), f64, f64, f64, f64),
+                      clamp_params(dim4(10, 10), f16, f16, f16, f16),
+
+                      clamp_params(dim4(10, 10, 10), f32, f32, f32, f32),
+                      clamp_params(dim4(10, 10, 10), f64, f32, f32, f64),
+                      clamp_params(dim4(10, 10, 10), f16, f32, f32, f32),
+                      clamp_params(dim4(10, 10, 10), f64, f64, f64, f64),
+                      clamp_params(dim4(10, 10, 10), f16, f16, f16, f16)
+                      ),
+    testNameGenerator);
+// clang-format on
+
+TEST_P(ClampFloatingPoint, Basic) {
+    clamp_params params = GetParam();
+    array out           = clamp(in_, lo_, hi_);
+    ASSERT_ARRAYS_NEAR(gold_, out, 1e-5);
+}
+
+TEST(Clamp, FloatArrayArray) {
+    array in = randu(num, f32);
+    array lo = randu(num, f32) / 10;        // Ensure lo <= 0.1
+    array hi = 1.0 - randu(num, f32) / 10;  // Ensure hi >= 0.9
+    eval(lo, hi);
+
+    vector<float> hout(num), hin(num), hlo(num), hhi(num);
+    array out = clamp(in, lo, hi);
+    out.host(&hout[0]);
+    in.host(&hin[0]);
+    lo.host(&hlo[0]);
+    hi.host(&hhi[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_LE(hout[i], hhi[i]);
+        ASSERT_GE(hout[i], hlo[i]);
+        ASSERT_EQ(true,
+                  hout[i] == hin[i] || hout[i] == hlo[i] || hout[i] == hhi[i]);
+    }
+}
+
+TEST(Clamp, FloatArrayScalar) {
+    array in = randu(num, f32);
+    array lo = randu(num, f32) / 10;  // Ensure lo <= 0.1
+    float hi = 0.9;
+
+    vector<float> hout(num), hin(num), hlo(num);
+    array out = clamp(in, lo, hi);
+
+    out.host(&hout[0]);
+    in.host(&hin[0]);
+    lo.host(&hlo[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_LE(hout[i], hi);
+        ASSERT_GE(hout[i], hlo[i]);
+        ASSERT_EQ(true,
+                  hout[i] == hin[i] || hout[i] == hlo[i] || hout[i] == hi);
+    }
+}
+
+TEST(Clamp, FloatScalarArray) {
+    array in = randu(num, f32);
+    float lo = 0.1;
+    array hi = 1.0 - randu(num, f32) / 10;  // Ensure hi >= 0.9
+
+    vector<float> hout(num), hin(num), hhi(num);
+    array out = clamp(in, lo, hi);
+
+    out.host(&hout[0]);
+    in.host(&hin[0]);
+    hi.host(&hhi[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_LE(hout[i], hhi[i]);
+        ASSERT_GE(hout[i], lo);
+        ASSERT_EQ(true,
+                  hout[i] == hin[i] || hout[i] == lo || hout[i] == hhi[i]);
+    }
+}
+
+TEST(Clamp, FloatScalarScalar) {
+    array in = randu(num, f32);
+    float lo = 0.1;
+    float hi = 0.9;
+
+    vector<float> hout(num), hin(num);
+    array out = clamp(in, lo, hi);
+
+    out.host(&hout[0]);
+    in.host(&hin[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_LE(hout[i], hi);
+        ASSERT_GE(hout[i], lo);
+        ASSERT_EQ(true, hout[i] == hin[i] || hout[i] == lo || hout[i] == hi);
+    }
+}
diff --git a/test/compare.cpp b/test/compare.cpp
new file mode 100644
index 0000000000..877c08275f
--- /dev/null
+++ b/test/compare.cpp
@@ -0,0 +1,56 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/arith.h>
+#include <af/array.h>
+#include <af/data.h>
+#include <af/random.h>
+
+using af::array;
+using af::dtype_traits;
+using af::randu;
+using std::vector;
+
+template<typename T>
+class Compare : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, uint, int, intl, uintl, schar, uchar,
+                         short, ushort, half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(Compare, TestTypes);
+
+#define COMPARE(OP, Name)                                   \
+    TYPED_TEST(Compare, Test_##Name) {                      \
+        typedef TypeParam T;                                \
+        SUPPORTED_TYPE_CHECK(T);                            \
+        const int num = 1 << 20;                            \
+        af_dtype ty   = (af_dtype)dtype_traits<T>::af_type; \
+        array a       = randu(num, ty);                     \
+        array b       = randu(num, ty);                     \
+        array c       = a OP b;                             \
+        vector<T> ha(num), hb(num);                         \
+        vector<char> hc(num);                               \
+        a.host(&ha[0]);                                     \
+        b.host(&hb[0]);                                     \
+        c.host(&hc[0]);                                     \
+        for (int i = 0; i < num; i++) {                     \
+            char res = ha[i] OP hb[i];                      \
+            ASSERT_EQ((int)res, (int)hc[i]);                \
+        }                                                   \
+    }
+
+COMPARE(==, eq)
+COMPARE(!=, ne)
+COMPARE(<=, le)
+COMPARE(>=, ge)
+COMPARE(<, lt)
+COMPARE(>, gt)
diff --git a/test/complex.cpp b/test/complex.cpp
index f242fb4974..fe8a60c0f9 100644
--- a/test/complex.cpp
+++ b/test/complex.cpp
@@ -8,133 +8,192 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/device.h>
+#include <af/random.h>
 
-using namespace std;
+using std::endl;
 using namespace af;
 
 const int num = 10;
 
-#define COMPLEX_TESTS(Ta, Tb, Tc)                                       \
-    TEST(ComplexTests, Test_##Ta##_##Tb)                                \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        af::array a = randu(num, ta);                                   \
-        af::array b = randu(num, tb);                                   \
-        af::array c = af::complex(a, b);                                \
-        Ta *h_a = a.host<Ta>();                                         \
-        Tb *h_b = b.host<Tb>();                                         \
-        std::complex<Tc> *h_c = c.host< std::complex<Tc> >();           \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_c[i], std::complex<Tc>(h_a[i], h_b[i])) <<      \
-                "for values: " << h_a[i]  << "," << h_b[i] << std::endl; \
-        delete[] h_a;                                                   \
-        delete[] h_b;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-    TEST(ComplexTests, Test_cplx_##Ta##_##Tb##_left)                    \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af::array a = randu(num, ta);                                   \
-        Tb h_b = 0.3;                                                   \
-        af::array c = af::complex(a, h_b);                              \
-        Ta *h_a = a.host<Ta>();                                         \
-        std::complex<Ta> *h_c = c.host<std::complex<Ta> >();            \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_c[i], std::complex<Ta>(h_a[i], h_b)) <<         \
-                "for values: " << h_a[i]  << "," << h_b << std::endl;   \
-        delete[] h_a;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-                                                                        \
-    TEST(ComplexTests, Test_cplx_##Ta##_##Tb##_right)                   \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-                                                                        \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        Ta h_a = 0.3;                                                   \
-        af::array b = randu(num, tb);                                   \
-        af::array c = af::complex(h_a, b);                              \
-        Tb *h_b = b.host<Tb>();                                         \
-        std::complex<Tb> *h_c = c.host<std::complex<Tb> >();            \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_c[i], std::complex<Tb>(h_a, h_b[i])) <<         \
-                "for values: " << h_a  << "," << h_b[i] << std::endl;   \
-        delete[] h_b;                                                   \
-        delete[] h_c;                                                   \
-    }                                                                   \
-    TEST(ComplexTests, Test_##Ta##_##Tb##_Real)                         \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        af::array a = randu(num, ta);                                   \
-        af::array b = randu(num, tb);                                   \
-        af::array c = af::complex(a, b);                                \
-        af::array d = af::real(c);                                      \
-        Ta *h_a = a.host<Ta>();                                         \
-        Tc *h_d = d.host<Tc>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_d[i], h_a[i]) << "at: " << i << std::endl;      \
-        delete[] h_a;                                                   \
-        delete[] h_d;                                                   \
-    }                                                                   \
-    TEST(ComplexTests, Test_##Ta##_##Tb##_Imag)                         \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        af::array a = randu(num, ta);                                   \
-        af::array b = randu(num, tb);                                   \
-        af::array c = af::complex(a, b);                                \
-        af::array d = af::imag(c);                                      \
-        Tb *h_b = b.host<Tb>();                                         \
-        Tc *h_d = d.host<Tc>();                                         \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(h_d[i], h_b[i])  << "at: " << i << std::endl;     \
-        delete[] h_b;                                                   \
-        delete[] h_d;                                                   \
-    }                                                                   \
-    TEST(ComplexTests, Test_##Ta##_##Tb##_Conj)                         \
-    {                                                                   \
-        if (noDoubleTests<Ta>()) return;                                \
-        if (noDoubleTests<Tb>()) return;                                \
-        if (noDoubleTests<Tc>()) return;                                \
-                                                                        \
-        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;              \
-        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;              \
-        af::array a = randu(num, ta);                                   \
-        af::array b = randu(num, tb);                                   \
-        af::array c = af::complex(a, b);                                \
-        af::array d = af::conjg(c);                                     \
-        std::complex<Tc> *h_c = c.host<std::complex<Tc> >();            \
-        std::complex<Tc> *h_d = d.host<std::complex<Tc> >();            \
-        for (int i = 0; i < num; i++)                                   \
-            ASSERT_EQ(std::conj(h_c[i]), h_d[i])                        \
-                << "at: " << i << std::endl;                            \
-        delete[] h_c;                                                   \
-        delete[] h_d;                                                   \
-    }                                                                   \
+#define CPLX(TYPE) af_c##TYPE
 
+#define COMPLEX_TESTS(Ta, Tb, Tc)                                     \
+    TEST(ComplexTests, Test_##Ta##_##Tb) {                            \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+        SUPPORTED_TYPE_CHECK(Tc);                                     \
+                                                                      \
+        af_dtype ta   = (af_dtype)dtype_traits<Ta>::af_type;          \
+        af_dtype tb   = (af_dtype)dtype_traits<Tb>::af_type;          \
+        array a       = randu(num, ta);                               \
+        array b       = randu(num, tb);                               \
+        array c       = complex(a, b);                                \
+        Ta *h_a       = a.host<Ta>();                                 \
+        Tb *h_b       = b.host<Tb>();                                 \
+        CPLX(Tc) *h_c = c.host<CPLX(Tc)>();                           \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(h_c[i], CPLX(Tc)(h_a[i], h_b[i]))               \
+                << "for values: " << h_a[i] << "," << h_b[i] << endl; \
+        freeHost(h_a);                                                \
+        freeHost(h_b);                                                \
+        freeHost(h_c);                                                \
+    }                                                                 \
+    TEST(ComplexTests, Test_cplx_##Ta##_##Tb##_left) {                \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+                                                                      \
+        af_dtype ta   = (af_dtype)dtype_traits<Ta>::af_type;          \
+        array a       = randu(num, ta);                               \
+        Tb h_b        = 0.3;                                          \
+        array c       = complex(a, h_b);                              \
+        Ta *h_a       = a.host<Ta>();                                 \
+        CPLX(Ta) *h_c = c.host<CPLX(Ta)>();                           \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(h_c[i], CPLX(Ta)(h_a[i], h_b))                  \
+                << "for values: " << h_a[i] << "," << h_b << endl;    \
+        freeHost(h_a);                                                \
+        freeHost(h_c);                                                \
+    }                                                                 \
+                                                                      \
+    TEST(ComplexTests, Test_cplx_##Ta##_##Tb##_right) {               \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+                                                                      \
+        af_dtype tb   = (af_dtype)dtype_traits<Tb>::af_type;          \
+        Ta h_a        = 0.3;                                          \
+        array b       = randu(num, tb);                               \
+        array c       = complex(h_a, b);                              \
+        Tb *h_b       = b.host<Tb>();                                 \
+        CPLX(Tb) *h_c = c.host<CPLX(Tb)>();                           \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(h_c[i], CPLX(Tb)(h_a, h_b[i]))                  \
+                << "for values: " << h_a << "," << h_b[i] << endl;    \
+        freeHost(h_b);                                                \
+        freeHost(h_c);                                                \
+    }                                                                 \
+    TEST(ComplexTests, Test_##Ta##_##Tb##_Real) {                     \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+        SUPPORTED_TYPE_CHECK(Tc);                                     \
+                                                                      \
+        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;            \
+        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;            \
+        array a     = randu(num, ta);                                 \
+        array b     = randu(num, tb);                                 \
+        array c     = complex(a, b);                                  \
+        array d     = real(c);                                        \
+        Ta *h_a     = a.host<Ta>();                                   \
+        Tc *h_d     = d.host<Tc>();                                   \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(h_d[i], h_a[i]) << "at: " << i << endl;         \
+        freeHost(h_a);                                                \
+        freeHost(h_d);                                                \
+    }                                                                 \
+    TEST(ComplexTests, Test_##Ta##_##Tb##_Imag) {                     \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+        SUPPORTED_TYPE_CHECK(Tc);                                     \
+                                                                      \
+        af_dtype ta = (af_dtype)dtype_traits<Ta>::af_type;            \
+        af_dtype tb = (af_dtype)dtype_traits<Tb>::af_type;            \
+        array a     = randu(num, ta);                                 \
+        array b     = randu(num, tb);                                 \
+        array c     = complex(a, b);                                  \
+        array d     = imag(c);                                        \
+        Tb *h_b     = b.host<Tb>();                                   \
+        Tc *h_d     = d.host<Tc>();                                   \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(h_d[i], h_b[i]) << "at: " << i << endl;         \
+        freeHost(h_b);                                                \
+        freeHost(h_d);                                                \
+    }                                                                 \
+    TEST(ComplexTests, Test_##Ta##_##Tb##_Conj) {                     \
+        SUPPORTED_TYPE_CHECK(Ta);                                     \
+        SUPPORTED_TYPE_CHECK(Tb);                                     \
+        SUPPORTED_TYPE_CHECK(Tc);                                     \
+                                                                      \
+        af_dtype ta   = (af_dtype)dtype_traits<Ta>::af_type;          \
+        af_dtype tb   = (af_dtype)dtype_traits<Tb>::af_type;          \
+        array a       = randu(num, ta);                               \
+        array b       = randu(num, tb);                               \
+        array c       = complex(a, b);                                \
+        array d       = conjg(c);                                     \
+        CPLX(Tc) *h_c = c.host<CPLX(Tc)>();                           \
+        CPLX(Tc) *h_d = d.host<CPLX(Tc)>();                           \
+        for (int i = 0; i < num; i++)                                 \
+            ASSERT_EQ(conj(h_c[i]), h_d[i]) << "at: " << i << endl;   \
+        freeHost(h_c);                                                \
+        freeHost(h_d);                                                \
+    }
 
 COMPLEX_TESTS(float, float, float)
 COMPLEX_TESTS(double, double, double)
 COMPLEX_TESTS(float, double, double)
+
+TEST(Complex, SNIPPET_arith_func_complex) {
+    //! [ex_arith_func_complex]
+    //!
+    // Create a, a 2x3 array
+    array a = iota(dim4(2, 3));  // a = [0, 2, 4,
+                                 //      1, 3, 5]
+
+    // Create b from a single real array, returning zeros for the imaginary
+    // component
+    array b = complex(a);  // b = [(0, 0), (2, 0), (4, 0),
+                           //      (1, 0), (3, 0), (5, 0)]
+
+    // Create c from two real arrays, one for the real component and one for the
+    // imaginary component
+    array c = complex(a, a);  // c = [(0, 0), (2, 2), (4, 4),
+                              //      (1, 1), (3, 3), (5, 5)]
+
+    // Create d from a single real array for the real component and a single
+    // scalar for each imaginary component
+    array d = complex(a, 2);  // d = [(0, 2), (2, 2), (4, 2),
+                              //      (1, 2), (3, 2), (5, 2)]
+
+    // Create e from a single scalar for each real component and a single real
+    // array for the imaginary component
+    array e = complex(2, a);  // e = [(2, 0), (2, 2), (2, 4),
+                              //      (2, 1), (2, 3), (2, 5)]
+
+    //! [ex_arith_func_complex]
+
+    using std::complex;
+    using std::vector;
+    vector<float> ha(a.elements());
+    a.host(ha.data());
+
+    vector<cfloat> gold_b(a.elements());
+    for (int i = 0; i < a.elements(); i++) {
+        gold_b[i].real = ha[i];
+        gold_b[i].imag = 0;
+    }
+    ASSERT_VEC_ARRAY_EQ(gold_b, a.dims(), b);
+
+    vector<cfloat> gold_c(a.elements());
+    for (int i = 0; i < a.elements(); i++) {
+        gold_c[i].real = ha[i];
+        gold_c[i].imag = ha[i];
+    }
+    ASSERT_VEC_ARRAY_EQ(gold_c, a.dims(), c);
+
+    vector<cfloat> gold_d(a.elements());
+    for (int i = 0; i < a.elements(); i++) {
+        gold_d[i].real = ha[i];
+        gold_d[i].imag = 2;
+    }
+    ASSERT_VEC_ARRAY_EQ(gold_d, a.dims(), d);
+
+    vector<cfloat> gold_e(a.elements());
+    for (int i = 0; i < a.elements(); i++) {
+        gold_e[i].real = 2;
+        gold_e[i].imag = ha[i];
+    }
+    ASSERT_VEC_ARRAY_EQ(gold_e, a.dims(), e);
+}
\ No newline at end of file
diff --git a/test/confidence_connected.cpp b/test/confidence_connected.cpp
new file mode 100644
index 0000000000..39c0f8f0ff
--- /dev/null
+++ b/test/confidence_connected.cpp
@@ -0,0 +1,250 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/traits.hpp>
+
+#include <sstream>
+#include <string>
+#include <vector>
+
+using af::dim4;
+using std::abs;
+using std::string;
+using std::stringstream;
+using std::to_string;
+using std::vector;
+
+template<typename T>
+class ConfidenceConnectedImageTest : public testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, uint, ushort, uchar> TestTypes;
+
+TYPED_TEST_SUITE(ConfidenceConnectedImageTest, TestTypes);
+
+struct CCCTestParams {
+    const char *prefix;
+    unsigned int radius;
+    unsigned int multiplier;
+    unsigned int iterations;
+    double replace;
+};
+
+template<typename T>
+void testImage(const std::string pTestFile, const size_t numSeeds,
+               const unsigned *seedx, const unsigned *seedy,
+               const int multiplier, const unsigned neighborhood_radius,
+               const int iter) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<af::dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> outSizes;
+    vector<string> outFiles;
+
+    readImageTests(std::string(TEST_DIR) + "/confidence_cc/" + pTestFile,
+                   inDims, inFiles, outSizes, outFiles);
+
+    size_t testCount = inDims.size();
+
+    af_array seedxArr = 0, seedyArr = 0;
+    dim4 seedDims(numSeeds);
+    ASSERT_SUCCESS(af_create_array(&seedxArr, seedx, seedDims.ndims(),
+                                   seedDims.get(), u32));
+    ASSERT_SUCCESS(af_create_array(&seedyArr, seedy, seedDims.ndims(),
+                                   seedDims.get(), u32));
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array _inArray   = 0;
+        af_array inArray    = 0;
+        af_array outArray   = 0;
+        af_array _goldArray = 0;
+        af_array goldArray  = 0;
+
+        inFiles[testId].insert(0, string(TEST_DIR "/confidence_cc/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/confidence_cc/"));
+
+        ASSERT_SUCCESS(
+            af_load_image(&_inArray, inFiles[testId].c_str(), false));
+        ASSERT_SUCCESS(
+            af_load_image(&_goldArray, outFiles[testId].c_str(), false));
+
+        // af_load_image always returns float array, so convert to output type
+        ASSERT_SUCCESS(conv_image<T>(&inArray, _inArray));
+        ASSERT_SUCCESS(conv_image<T>(&goldArray, _goldArray));
+
+        CCCTestParams params;
+        params.prefix     = "Image";
+        params.radius     = neighborhood_radius;
+        params.multiplier = multiplier;
+        params.iterations = iter;
+        params.replace    = 255.0;
+
+        ASSERT_SUCCESS(af_confidence_cc(&outArray, inArray, seedxArr, seedyArr,
+                                        params.radius, params.multiplier,
+                                        params.iterations, params.replace));
+        int device = 0;
+        ASSERT_SUCCESS(af_get_device(&device));
+        ASSERT_SUCCESS(af_sync(device));
+
+        ASSERT_ARRAYS_EQ(outArray, goldArray);
+
+        ASSERT_SUCCESS(af_release_array(_inArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(_goldArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+    }
+    ASSERT_SUCCESS(af_release_array(seedxArr));
+    ASSERT_SUCCESS(af_release_array(seedyArr));
+}
+
+template<typename T>
+void testData(CCCTestParams params) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+
+    string file = string(TEST_DIR) + "/confidence_cc/" + string(params.prefix) +
+                  "_" + to_string(params.radius) + "_" +
+                  to_string(params.multiplier) + ".test";
+    readTests<T, T, int>(file, numDims, in, tests);
+
+    dim4 dims         = numDims[0];
+    af_array inArray  = 0;
+    af_array seedxArr = 0, seedyArr = 0;
+
+    vector<uint> seedCoords(in[1].begin(), in[1].end());
+    const unsigned *seedxy = seedCoords.data();
+
+    dim4 seedDims(1);
+    ASSERT_SUCCESS(af_create_array(&seedxArr, seedxy + 0, seedDims.ndims(),
+                                   seedDims.get(), u32));
+    ASSERT_SUCCESS(af_create_array(&seedyArr, seedxy + 1, seedDims.ndims(),
+                                   seedDims.get(), u32));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)af::dtype_traits<T>::af_type));
+
+    af_array outArray = 0;
+    ASSERT_SUCCESS(af_confidence_cc(&outArray, inArray, seedxArr, seedyArr,
+                                    params.radius, params.multiplier,
+                                    params.iterations, params.replace));
+    int device = 0;
+    ASSERT_SUCCESS(af_get_device(&device));
+    ASSERT_SUCCESS(af_sync(device));
+
+    ASSERT_VEC_ARRAY_EQ(tests[0], dims, outArray);
+
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(seedxArr));
+    ASSERT_SUCCESS(af_release_array(seedyArr));
+}
+
+class ConfidenceConnectedDataTest
+    : public testing::TestWithParam<CCCTestParams> {};
+
+TYPED_TEST(ConfidenceConnectedImageTest, DonutBackgroundExtraction) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const unsigned seedx = 10;
+    const unsigned seedy = 10;
+    testImage<TypeParam>(std::string("donut_background.test"), 1, &seedx,
+                         &seedy, 3, 3, 25);
+}
+
+TYPED_TEST(ConfidenceConnectedImageTest, DonutRingExtraction) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const unsigned seedx = 132;
+    const unsigned seedy = 132;
+    testImage<TypeParam>(std::string("donut_ring.test"), 1, &seedx, &seedy, 3,
+                         3, 25);
+}
+
+TYPED_TEST(ConfidenceConnectedImageTest, DonutKernelExtraction) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const unsigned seedx = 150;
+    const unsigned seedy = 150;
+    testImage<TypeParam>(std::string("donut_core.test"), 1, &seedx, &seedy, 3,
+                         3, 25);
+}
+
+TEST_P(ConfidenceConnectedDataTest, SegmentARegion) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    testData<unsigned char>(GetParam());
+}
+
+INSTANTIATE_TEST_SUITE_P(
+    SingleSeed, ConfidenceConnectedDataTest,
+    testing::Values(CCCTestParams{"core", 0u, 1u, 5u, 255.0},
+                    CCCTestParams{"background", 0u, 1u, 5u, 255.0},
+                    CCCTestParams{"ring", 0u, 1u, 5u, 255.0}),
+    [](const ::testing::TestParamInfo<ConfidenceConnectedDataTest::ParamType>
+           info) {
+        stringstream ss;
+        ss << "_prefix_" << info.param.prefix << "_radius_" << info.param.radius
+           << "_multiplier_" << info.param.multiplier << "_iterations_"
+           << info.param.iterations << "_replace_" << info.param.replace;
+        return ss.str();
+    });
+
+#define TEST_FORMATS(form)                                                     \
+    TEST(TEMP_FORMAT, form##_2Dseed) {                                         \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                                \
+        const string filename(string(TEST_DIR) + "/confidence_cc/donut.png");  \
+        const af::array image(af::loadImage(filename.c_str()));                \
+        const af::array seed(dim4(1, 2), {10u, 8u});                           \
+                                                                               \
+        const af::array out =                                                  \
+            af::confidenceCC(toTempFormat(form, image),                        \
+                             toTempFormat(form, seed), 3, 3, 25, 255.0);       \
+        const af::array gold = af::confidenceCC(image, seed, 3, 3, 25, 255.0); \
+                                                                               \
+        EXPECT_ARRAYS_EQ(out, gold);                                           \
+    }                                                                          \
+                                                                               \
+    TEST(TEMP_FORMAT, form##_2xSeed) {                                         \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                                \
+        const string filename(string(TEST_DIR) + "/confidence_cc/donut.png");  \
+        const af::array image(af::loadImage(filename.c_str()));                \
+        const af::array seedx({10u});                                          \
+        const af::array seedy({8u});                                           \
+                                                                               \
+        const af::array out = af::confidenceCC(                                \
+            toTempFormat(form, image), toTempFormat(form, seedx),              \
+            toTempFormat(form, seedy), 3, 3, 25, 255.0);                       \
+        const af::array gold =                                                 \
+            af::confidenceCC(image, seedx, seedy, 3, 3, 25, 255.0);            \
+                                                                               \
+        EXPECT_ARRAYS_EQ(out, gold);                                           \
+    }                                                                          \
+    TEST(TEMP_FORMAT, form##_vectSeed) {                                       \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                                \
+        const string filename(string(TEST_DIR) + "/confidence_cc/donut.png");  \
+        const af::array image(af::loadImage(filename.c_str()));                \
+        const unsigned seedx[1] = {10u};                                       \
+        const unsigned seedy[1] = {8u};                                        \
+                                                                               \
+        const af::array out = af::confidenceCC(toTempFormat(form, image), 1,   \
+                                               seedx, seedy, 3, 3, 25, 255.0); \
+        const af::array gold =                                                 \
+            af::confidenceCC(image, 1, seedx, seedy, 3, 3, 25, 255.0);         \
+                                                                               \
+        EXPECT_ARRAYS_EQ(out, gold);                                           \
+    }
+
+FOREACH_TEMP_FORMAT(TEST_FORMATS)
diff --git a/test/constant.cpp b/test/constant.cpp
index db6e0dbd45..b1d3e0a5af 100644
--- a/test/constant.cpp
+++ b/test/constant.cpp
@@ -8,71 +8,82 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
-
-using namespace std;
-using namespace af;
+#include <af/exception.h>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dtype;
+using af::dtype_traits;
+using af::exception;
+using af::identity;
+using af::sum;
+using std::vector;
 
 template<typename T>
-class Constant : public ::testing::Test { };
+class Constant : public ::testing::Test {};
 
-typedef ::testing::Types<float, af::cfloat, double, af::cdouble, int, unsigned, char, uchar, uintl, intl> TestTypes;
-TYPED_TEST_CASE(Constant, TestTypes);
+typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, char,
+                         schar, uchar, uintl, intl, short, ushort,
+                         half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(Constant, TestTypes);
 
 template<typename T>
 void ConstantCPPCheck(T value) {
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
     const int num = 1000;
-    T val = value;
-    dtype dty = (dtype) dtype_traits<T>::af_type;
-    af::array in = constant(val, num, dty);
+    T val         = value;
+    dtype dty     = (dtype)dtype_traits<T>::af_type;
+    array in      = constant(val, num, dty);
 
     vector<T> h_in(num);
     in.host(&h_in.front());
 
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(h_in[i], val);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(h_in[i], val); }
 }
 
 template<typename T>
 void ConstantCCheck(T value) {
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
     const int num = 1000;
-    typedef typename af::dtype_traits<T>::base_type BT;
-    BT val = ::real(value);
-    dtype dty = (dtype) dtype_traits<T>::af_type;
+    typedef typename dtype_traits<T>::base_type BT;
+    BT val(::real(value));
+    dtype dty = (dtype)dtype_traits<T>::af_type;
     af_array out;
     dim_t dim[] = {(dim_t)num};
-    ASSERT_EQ(AF_SUCCESS, af_constant(&out, val, 1, dim, dty));
+    ASSERT_SUCCESS(af_constant(&out, val, 1, dim, dty));
 
     vector<T> h_in(num);
     af_get_data_ptr(&h_in.front(), out);
 
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(h_in[i], val);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(::real(h_in[i]), val); }
+    ASSERT_SUCCESS(af_release_array(out));
 }
 
 template<typename T>
 void IdentityCPPCheck() {
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
-    int num = 1000;
-    dtype dty = (dtype) dtype_traits<T>::af_type;
-    array out = af::identity(num, num, dty);
+    int num   = 1000;
+    dtype dty = (dtype)dtype_traits<T>::af_type;
+    array out = identity(num, num, dty);
 
-    vector<T> h_in(num*num);
+    vector<T> h_in(num * num);
     out.host(&h_in.front());
 
     for (int i = 0; i < num; i++) {
         for (int j = 0; j < num; j++) {
-            if(j == i)
+            if (j == i)
                 ASSERT_EQ(h_in[i * num + j], T(1));
             else
                 ASSERT_EQ(h_in[i * num + j], T(0));
@@ -80,83 +91,90 @@ void IdentityCPPCheck() {
     }
 
     num = 100;
-    out = af::identity(num, num, num, dty);
+    out = identity(num, num, num, dty);
 
-    h_in.resize(num*num*num);
+    h_in.resize(num * num * num);
     out.host(&h_in.front());
 
     for (int h = 0; h < num; h++) {
-       for (int i = 0; i < num; i++) {
-           for (int j = 0; j < num; j++) {
-               if(j == i)
-                   ASSERT_EQ(h_in[i * num + j], T(1));
-               else
-                   ASSERT_EQ(h_in[i * num + j], T(0));
-           }
-       }
+        for (int i = 0; i < num; i++) {
+            for (int j = 0; j < num; j++) {
+                if (j == i)
+                    ASSERT_EQ(h_in[i * num + j], T(1));
+                else
+                    ASSERT_EQ(h_in[i * num + j], T(0));
+            }
+        }
     }
 }
 
+template<typename T>
+void IdentityLargeDimCheck() {
+    SUPPORTED_TYPE_CHECK(T);
+
+    const size_t largeDim = 65535 * 8 + 1;
+
+    dtype dty = (dtype)dtype_traits<T>::af_type;
+    array out = identity(largeDim, dty);
+    ASSERT_EQ(1.f, sum<float>(out));
+
+    out = identity(1, largeDim, dty);
+    ASSERT_EQ(1.f, sum<float>(out));
+
+    out = identity(1, 1, largeDim, dty);
+    ASSERT_EQ(largeDim, sum<float>(out));
+
+    out = identity(1, 1, 1, largeDim, dty);
+    ASSERT_EQ(largeDim, sum<float>(out));
+}
+
 template<typename T>
 void IdentityCCheck() {
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
     static const int num = 1000;
-    dtype dty = (dtype) dtype_traits<T>::af_type;
+    dtype dty            = (dtype)dtype_traits<T>::af_type;
     af_array out;
     dim_t dim[] = {(dim_t)num, (dim_t)num};
-    ASSERT_EQ(AF_SUCCESS, af_identity(&out, 2, dim, dty));
+    ASSERT_SUCCESS(af_identity(&out, 2, dim, dty));
 
-    vector<T> h_in(num*num);
+    vector<T> h_in(num * num);
     af_get_data_ptr(&h_in.front(), out);
 
     for (int i = 0; i < num; i++) {
         for (int j = 0; j < num; j++) {
-            if(j == i)
+            if (j == i)
                 ASSERT_EQ(h_in[i * num + j], T(1));
             else
                 ASSERT_EQ(h_in[i * num + j], T(0));
         }
     }
+    ASSERT_SUCCESS(af_release_array(out));
 }
 
 template<typename T>
 void IdentityCPPError() {
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
     static const int num = 1000;
-    dtype dty = (dtype) dtype_traits<T>::af_type;
+    dtype dty            = (dtype)dtype_traits<T>::af_type;
     try {
-        array out = af::identity(num, 0, 10, dty);
-    }
-    catch(const af::exception &ex) {
-        SUCCEED();
+        array out = identity(num, 0, 10, dty);
+    } catch (const exception &ex) {
+        FAIL() << "Incorrectly thrown 0-length exception";
         return;
     }
-    FAIL() << "Failed to throw an exception";
+    SUCCEED();
 }
 
-TYPED_TEST(Constant, basicCPP)
-{
-    ConstantCPPCheck<TypeParam>(5);
-}
+TYPED_TEST(Constant, basicCPP) { ConstantCPPCheck<TypeParam>(TypeParam(5)); }
 
-TYPED_TEST(Constant, basicC)
-{
-    ConstantCCheck<TypeParam>(5);
-}
+TYPED_TEST(Constant, basicC) { ConstantCCheck<TypeParam>(TypeParam(5)); }
 
-TYPED_TEST(Constant, IdentityC)
-{
-    IdentityCCheck<TypeParam>();
-}
+TYPED_TEST(Constant, IdentityC) { IdentityCCheck<TypeParam>(); }
 
-TYPED_TEST(Constant, IdentityCPP)
-{
-    IdentityCPPCheck<TypeParam>();
-}
+TYPED_TEST(Constant, IdentityCPP) { IdentityCPPCheck<TypeParam>(); }
 
-TYPED_TEST(Constant, IdentityCPPError)
-{
-    IdentityCPPError<TypeParam>();
-}
+TYPED_TEST(Constant, IdentityLargeDim) { IdentityLargeDimCheck<TypeParam>(); }
+
+TYPED_TEST(Constant, IdentityCPPError) { IdentityCPPError<TypeParam>(); }
diff --git a/test/convolve.cpp b/test/convolve.cpp
index 200c66bfd9..5df8961e1b 100644
--- a/test/convolve.cpp
+++ b/test/convolve.cpp
@@ -7,42 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+#include <cmath>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::vector;
-using std::string;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Convolve : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Convolve : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<cdouble, cfloat, float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<cdouble, cfloat, float, double, int, uint, char, schar,
+                         uchar, short, ushort, intl, uintl>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Convolve, TestTypes);
+TYPED_TEST_SUITE(Convolve, TestTypes);
 
 template<typename T>
-void convolveTest(string pTestFile, int baseDim, bool expand)
-{
-    if (noDoubleTests<T>()) return;
-
-    using af::dim4;
+void convolveTest(string pTestFile, int baseDim, bool expand) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<dim4>      numDims;
-    vector<vector<T> >      in;
-    vector<vector<T> >   tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
 
     readTests<T, T, int>(pTestFile, numDims, in, tests);
 
@@ -52,164 +56,170 @@ void convolveTest(string pTestFile, int baseDim, bool expand)
     af_array filter   = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&signal, &(in[0].front()),
-                sDims.ndims(), sDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&filter, &(in[1].front()),
-                fDims.ndims(), fDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&signal, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&filter, &(in[1].front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     af_conv_mode mode = expand ? AF_CONV_EXPAND : AF_CONV_DEFAULT;
-    switch(baseDim) {
-    case 1: ASSERT_EQ(AF_SUCCESS, af_convolve1(&outArray, signal, filter, mode, AF_CONV_AUTO)); break;
-    case 2: ASSERT_EQ(AF_SUCCESS, af_convolve2(&outArray, signal, filter, mode, AF_CONV_AUTO)); break;
-    case 3: ASSERT_EQ(AF_SUCCESS, af_convolve3(&outArray, signal, filter, mode, AF_CONV_AUTO)); break;
+    switch (baseDim) {
+        case 1:
+            ASSERT_SUCCESS(
+                af_convolve1(&outArray, signal, filter, mode, AF_CONV_AUTO));
+            break;
+        case 2:
+            ASSERT_SUCCESS(
+                af_convolve2(&outArray, signal, filter, mode, AF_CONV_AUTO));
+            break;
+        case 3:
+            ASSERT_SUCCESS(
+                af_convolve3(&outArray, signal, filter, mode, AF_CONV_AUTO));
+            break;
     }
 
     vector<T> currGoldBar = tests[0];
     size_t nElems         = currGoldBar.size();
-    T *outData            = new T[nElems];
+    vector<T> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(signal));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(filter));
+    ASSERT_SUCCESS(af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(signal));
+    ASSERT_SUCCESS(af_release_array(filter));
 }
 
-TYPED_TEST(Convolve, Vector)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector.test"), 1, true);
+TYPED_TEST(Convolve, Vector) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/vector.test"), 1, true);
 }
 
-TYPED_TEST(Convolve, Rectangle)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle.test"), 2, true);
+TYPED_TEST(Convolve, Rectangle) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/rectangle.test"), 2,
+                            true);
 }
 
-TYPED_TEST(Convolve, Cuboid)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid.test"), 3, true);
+TYPED_TEST(Convolve, Cuboid) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/cuboid.test"), 3, true);
 }
 
-TYPED_TEST(Convolve, Vector_Many2One)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_many2one.test"), 1, true);
+TYPED_TEST(Convolve, Vector_Many2One) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/vector_many2one.test"),
+                            1, true);
 }
 
-TYPED_TEST(Convolve, Rectangle_Many2One)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_many2one.test"), 2, true);
+TYPED_TEST(Convolve, Rectangle_Many2One) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/rectangle_many2one.test"), 2, true);
 }
 
-TYPED_TEST(Convolve, Cuboid_Many2One)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_many2one.test"), 3, true);
+TYPED_TEST(Convolve, Cuboid_Many2One) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/cuboid_many2one.test"),
+                            3, true);
 }
 
-TYPED_TEST(Convolve, Vector_Many2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_many2many.test"), 1, true);
+TYPED_TEST(Convolve, Vector_Many2Many) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/vector_many2many.test"),
+                            1, true);
 }
 
-TYPED_TEST(Convolve, Rectangle_Many2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_many2many.test"), 2, true);
+TYPED_TEST(Convolve, Rectangle_Many2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/rectangle_many2many.test"), 2, true);
 }
 
-TYPED_TEST(Convolve, Cuboid_Many2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_many2many.test"), 3, true);
+TYPED_TEST(Convolve, Cuboid_Many2Many) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/cuboid_many2many.test"),
+                            3, true);
 }
 
-TYPED_TEST(Convolve, Vector_One2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_one2many.test"), 1, true);
+TYPED_TEST(Convolve, Vector_One2Many) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/vector_one2many.test"),
+                            1, true);
 }
 
-TYPED_TEST(Convolve, Rectangle_One2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_one2many.test"), 2, true);
+TYPED_TEST(Convolve, Rectangle_One2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/rectangle_one2many.test"), 2, true);
 }
 
-TYPED_TEST(Convolve, Cuboid_One2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_one2many.test"), 3, true);
+TYPED_TEST(Convolve, Cuboid_One2Many) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/cuboid_one2many.test"),
+                            3, true);
 }
 
-TYPED_TEST(Convolve, Same_Vector)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_same.test"), 1, false);
+TYPED_TEST(Convolve, Same_Vector) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/vector_same.test"), 1,
+                            false);
 }
 
-TYPED_TEST(Convolve, Same_Rectangle)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_same.test"), 2, false);
+TYPED_TEST(Convolve, Same_Rectangle) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/rectangle_same.test"), 2,
+                            false);
 }
 
-TYPED_TEST(Convolve, Same_Cuboid)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_same.test"), 3, false);
+TYPED_TEST(Convolve, Same_Cuboid) {
+    convolveTest<TypeParam>(string(TEST_DIR "/convolve/cuboid_same.test"), 3,
+                            false);
 }
 
-TYPED_TEST(Convolve, Same_Vector_Many2One)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_same_many2one.test"), 1, false);
+TYPED_TEST(Convolve, Same_Vector_Many2One) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/vector_same_many2one.test"), 1, false);
 }
 
-TYPED_TEST(Convolve, Same_Rectangle_Many2One)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_same_many2one.test"), 2, false);
+TYPED_TEST(Convolve, Same_Rectangle_Many2One) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/rectangle_same_many2one.test"), 2, false);
 }
 
-TYPED_TEST(Convolve, Same_Cuboid_Many2One)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_same_many2one.test"), 3, false);
+TYPED_TEST(Convolve, Same_Cuboid_Many2One) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/cuboid_same_many2one.test"), 3, false);
 }
 
-TYPED_TEST(Convolve, Same_Vector_Many2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_same_many2many.test"), 1, false);
+TYPED_TEST(Convolve, Same_Vector_Many2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/vector_same_many2many.test"), 1, false);
 }
 
-TYPED_TEST(Convolve, Same_Rectangle_Many2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_same_many2many.test"), 2, false);
+TYPED_TEST(Convolve, Same_Rectangle_Many2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/rectangle_same_many2many.test"), 2, false);
 }
 
-TYPED_TEST(Convolve, Same_Cuboid_Many2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_same_many2many.test"), 3, false);
+TYPED_TEST(Convolve, Same_Cuboid_Many2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/cuboid_same_many2many.test"), 3, false);
 }
 
-TYPED_TEST(Convolve, Same_Vector_One2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/vector_same_one2many.test"), 1, false);
+TYPED_TEST(Convolve, Same_Vector_One2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/vector_same_one2many.test"), 1, false);
 }
 
-TYPED_TEST(Convolve, Same_Rectangle_One2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/rectangle_same_one2many.test"), 2, false);
+TYPED_TEST(Convolve, Same_Rectangle_One2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/rectangle_same_one2many.test"), 2, false);
 }
 
-TYPED_TEST(Convolve, Same_Cuboid_One2Many)
-{
-    convolveTest<TypeParam>(string(TEST_DIR"/convolve/cuboid_same_one2many.test"), 3, false);
+TYPED_TEST(Convolve, Same_Cuboid_One2Many) {
+    convolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/cuboid_same_one2many.test"), 3, false);
 }
 
 template<typename T>
-void sepConvolveTest(string pTestFile, bool expand)
-{
-    if (noDoubleTests<T>()) return;
-
-    using af::dim4;
+void sepConvolveTest(string pTestFile, bool expand) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<dim4>      numDims;
-    vector<vector<T> >      in;
-    vector<vector<T> >   tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
 
     readTests<T, T, int>(pTestFile, numDims, in, tests);
 
@@ -221,336 +231,330 @@ void sepConvolveTest(string pTestFile, bool expand)
     af_array r_filter = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&signal, &(in[0].front()),
-                sDims.ndims(), sDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&c_filter, &(in[1].front()),
-                cfDims.ndims(), cfDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&r_filter, &(in[2].front()),
-                rfDims.ndims(), rfDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&signal, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&c_filter, &(in[1].front()), cfDims.ndims(),
+                                   cfDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&r_filter, &(in[2].front()), rfDims.ndims(),
+                                   rfDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    af_conv_mode  mode = expand ? AF_CONV_EXPAND : AF_CONV_DEFAULT;
-    ASSERT_EQ(AF_SUCCESS, af_convolve2_sep(&outArray, c_filter, r_filter, signal, mode));
+    af_conv_mode mode = expand ? AF_CONV_EXPAND : AF_CONV_DEFAULT;
+    ASSERT_SUCCESS(
+        af_convolve2_sep(&outArray, c_filter, r_filter, signal, mode));
 
     vector<T> currGoldBar = tests[0];
     size_t nElems         = currGoldBar.size();
-    T *outData            = new T[nElems];
+    vector<T> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(signal));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(c_filter));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(r_filter));
+    ASSERT_SUCCESS(af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(signal));
+    ASSERT_SUCCESS(af_release_array(c_filter));
+    ASSERT_SUCCESS(af_release_array(r_filter));
 }
 
-TYPED_TEST(Convolve, Separable2D_Full)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_full.test"), true);
+TYPED_TEST(Convolve, Separable2D_Full) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_full.test"), true);
 }
 
-TYPED_TEST(Convolve, Separable2D_Full_Batch)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_full_batch.test"), true);
+TYPED_TEST(Convolve, Separable2D_Full_Batch) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_full_batch.test"), true);
 }
 
-TYPED_TEST(Convolve, Separable2D_Full_Rectangle)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_full_rectangle.test"), true);
+TYPED_TEST(Convolve, Separable2D_Full_Rectangle) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_full_rectangle.test"),
+        true);
 }
 
-TYPED_TEST(Convolve, Separable2D_Full_Rectangle_Batch)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_full_rectangle_batch.test"), true);
+TYPED_TEST(Convolve, Separable2D_Full_Rectangle_Batch) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_full_rectangle_batch.test"),
+        true);
 }
 
-TYPED_TEST(Convolve, Separable2D_Same)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_same.test"), false);
+TYPED_TEST(Convolve, Separable2D_Same) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_same.test"), false);
 }
 
-TYPED_TEST(Convolve, Separable2D_Same_Batch)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_same_batch.test"), false);
+TYPED_TEST(Convolve, Separable2D_Same_Batch) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_same_batch.test"), false);
 }
 
-TYPED_TEST(Convolve, Separable2D_Same_Rectangle)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_same_rectangle.test"), false);
+TYPED_TEST(Convolve, Separable2D_Same_Rectangle) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_same_rectangle.test"),
+        false);
 }
 
-TYPED_TEST(Convolve, Separable2D_Same_Rectangle_Batch)
-{
-    sepConvolveTest<TypeParam>(string(TEST_DIR"/convolve/separable_conv2d_same_rectangle_batch.test"), false);
+TYPED_TEST(Convolve, Separable2D_Same_Rectangle_Batch) {
+    sepConvolveTest<TypeParam>(
+        string(TEST_DIR "/convolve/separable_conv2d_same_rectangle_batch.test"),
+        false);
 }
 
-TEST(Convolve, Separable_TypeCheck)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<int>()) return;
-    using af::dim4;
-
+TEST(Convolve, Separable_TypeCheck) {
     dim4 sDims(10, 1, 1, 1);
     dim4 fDims(4, 1, 1, 1);
 
-    vector<float> in(10,1);
-    vector<int>   filt(4,1);
+    vector<float> in(10, 1);
+    vector<int> filt(4, 1);
 
     af_array signal   = 0;
     af_array c_filter = 0;
     af_array r_filter = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&signal, &(in.front()),
-                sDims.ndims(), sDims.get(), (af_dtype)af::dtype_traits<float>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&c_filter, &(filt.front()),
-                fDims.ndims(), fDims.get(), (af_dtype)af::dtype_traits<int>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&r_filter, &(filt.front()),
-                fDims.ndims(), fDims.get(), (af_dtype)af::dtype_traits<int>::af_type));
-
-    ASSERT_EQ(AF_ERR_ARG, af_convolve2_sep(&outArray, c_filter, r_filter, signal, AF_CONV_EXPAND));
-
-    ASSERT_EQ(AF_SUCCESS, af_release_array(signal));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(c_filter));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(r_filter));
+    ASSERT_SUCCESS(af_create_array(&signal, &(in.front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&c_filter, &(filt.front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<int>::af_type));
+    ASSERT_SUCCESS(af_create_array(&r_filter, &(filt.front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<int>::af_type));
+
+    ASSERT_EQ(AF_ERR_ARG, af_convolve2_sep(&outArray, c_filter, r_filter,
+                                           signal, AF_CONV_EXPAND));
+
+    ASSERT_SUCCESS(af_release_array(signal));
+    ASSERT_SUCCESS(af_release_array(c_filter));
+    ASSERT_SUCCESS(af_release_array(r_filter));
 }
 
-TEST(Convolve, Separable_DimCheck)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<int>()) return;
-
-    using af::dim4;
-
+TEST(Convolve, Separable_DimCheck) {
     dim4 sDims(10, 1, 1, 1);
     dim4 fDims(4, 1, 1, 1);
 
-    vector<float> in(10,1);
-    vector<int>   filt(4,1);
+    vector<float> in(10, 1);
+    vector<int> filt(4, 1);
 
     af_array signal   = 0;
     af_array c_filter = 0;
     af_array r_filter = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&signal, &(in.front()),
-                sDims.ndims(), sDims.get(), (af_dtype)af::dtype_traits<float>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&c_filter, &(filt.front()),
-                fDims.ndims(), fDims.get(), (af_dtype)af::dtype_traits<int>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&r_filter, &(filt.front()),
-                fDims.ndims(), fDims.get(), (af_dtype)af::dtype_traits<int>::af_type));
-
-    ASSERT_EQ(AF_ERR_ARG, af_convolve2_sep(&outArray, c_filter, r_filter, signal, AF_CONV_EXPAND));
-
-    ASSERT_EQ(AF_SUCCESS, af_release_array(c_filter));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(r_filter));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(signal));
+    ASSERT_SUCCESS(af_create_array(&signal, &(in.front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&c_filter, &(filt.front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<int>::af_type));
+    ASSERT_SUCCESS(af_create_array(&r_filter, &(filt.front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<int>::af_type));
+
+    ASSERT_EQ(AF_ERR_ARG, af_convolve2_sep(&outArray, c_filter, r_filter,
+                                           signal, AF_CONV_EXPAND));
+
+    ASSERT_SUCCESS(af_release_array(c_filter));
+    ASSERT_SUCCESS(af_release_array(r_filter));
+    ASSERT_SUCCESS(af_release_array(signal));
 }
 
-TEST(Convolve1, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::dim4;
-
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/vector_same.test"), numDims, in, tests);
+///////////////////////////////////// CPP ////////////////////////////////
+//
+using af::constant;
+using af::max;
+using af::product;
+using af::randu;
+using af::seq;
+using af::span;
+using af::sum;
+
+TEST(Convolve1, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTests<float, float, int>(string(TEST_DIR "/convolve/vector_same.test"),
+                                 numDims, in, tests);
 
     //![ex_image_convolve1]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [32 1 1 1]
-    af::array filter(numDims[1], &(in[1].front()));
-    //filter dims = [4 1 1 1]
-
-    af::array output = convolve1(signal, filter, AF_CONV_DEFAULT);
-    //output dims = [32 1 1 1] - same as input since expand(3rd argument is false)
-    //None of the dimensions > 1 has lenght > 1, so no batch mode is activated.
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [32 1 1 1]
+    array filter(numDims[1], &(in[1].front()));
+    // filter dims = [4 1 1 1]
+
+    array output = convolve1(signal, filter, AF_CONV_DEFAULT);
+    // output dims = [32 1 1 1] - same as input since expand(3rd argument is
+    // false) None of the dimensions > 1 has lenght > 1, so no batch mode is
+    // activated.
     //![ex_image_convolve1]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
-    output.host(outData);
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
+    output.host(&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(Convolve2, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::dim4;
+TEST(Convolve2, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/rectangle_same_one2many.test"), numDims, in, tests);
+    readTests<float, float, int>(
+        string(TEST_DIR "/convolve/rectangle_same_one2many.test"), numDims, in,
+        tests);
 
     //![ex_image_convolve2]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [15 17 1 1]
-    af::array filter(numDims[1], &(in[1].front()));
-    //filter dims = [5 5 2 1]
-
-    af::array output = convolve2(signal, filter, AF_CONV_DEFAULT);
-    //output dims = [15 17 1 1] - same as input since expand(3rd argument is false)
-    //however, notice that the 3rd dimension of filter is > 1.
-    //So, one to many batch mode will be activated automatically
-    //where the 2d input signal is convolved with each 2d filter
-    //and the result will written corresponding slice in the output 3d array
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [15 17 1 1]
+    array filter(numDims[1], &(in[1].front()));
+    // filter dims = [5 5 2 1]
+
+    array output = convolve2(signal, filter, AF_CONV_DEFAULT);
+    // output dims = [15 17 1 1] - same as input since expand(3rd argument is
+    // false) however, notice that the 3rd dimension of filter is > 1. So, one
+    // to many batch mode will be activated automatically where the 2d input
+    // signal is convolved with each 2d filter and the result will written
+    // corresponding slice in the output 3d array
     //![ex_image_convolve2]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
-    output.host(outData);
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
+    output.host(&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(Convolve3, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::dim4;
+TEST(Convolve3, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/cuboid_same_many2many.test"), numDims, in, tests);
+    readTests<float, float, int>(
+        string(TEST_DIR "/convolve/cuboid_same_many2many.test"), numDims, in,
+        tests);
 
     //![ex_image_convolve3]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [10 11 2 2]
-    af::array filter(numDims[1], &(in[1].front()));
-    //filter dims = [4 2 3 2]
-
-    af::array output = convolve3(signal, filter, AF_CONV_DEFAULT);
-    //output dims = [10 11 2 2] - same as input since expand(3rd argument is false)
-    //however, notice that the 4th dimension is > 1 for both signal
-    //and the filter, therefore many to many batch mode will be
-    //activated where each 3d signal is convolved with the corresponding 3d filter
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [10 11 2 2]
+    array filter(numDims[1], &(in[1].front()));
+    // filter dims = [4 2 3 2]
+
+    array output = convolve3(signal, filter, AF_CONV_DEFAULT);
+    // output dims = [10 11 2 2] - same as input since expand(3rd argument is
+    // false) however, notice that the 4th dimension is > 1 for both signal and
+    // the filter, therefore many to many batch mode will be activated where
+    // each 3d signal is convolved with the corresponding 3d filter
     //![ex_image_convolve3]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
-    output.host(outData);
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
+    output.host(&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(Convolve, separable_CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::dim4;
+TEST(Convolve, separable_CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/separable_conv2d_same_rectangle_batch.test"),
-                                 numDims, in, tests);
+    readTests<float, float, int>(
+        string(TEST_DIR "/convolve/separable_conv2d_same_rectangle_batch.test"),
+        numDims, in, tests);
 
     //![ex_image_conv2_sep]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [3 4 2 1]
-    af::array cFilter(numDims[1], &(in[1].front()));
-    //coloumn filter dims = [2 1 1 1]
-    af::array rFilter(numDims[2], &(in[2].front()));
-    //row filter dims = [3 1 1 1]
-
-    af::array output = convolve(cFilter, rFilter, signal, AF_CONV_DEFAULT);
-    //output signal dims = [3 4 2 1] - same as input since 'expand = false'
-    //notice that the input signal is 3d array, therefore
-    //batch mode will be automatically activated.
-    //output will be 3d array with result of each 2d array convolution(with same filter)
-    //stacked along the 3rd dimension
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [3 4 2 1]
+    array cFilter(numDims[1], &(in[1].front()));
+    // coloumn filter dims = [2 1 1 1]
+    array rFilter(numDims[2], &(in[2].front()));
+    // row filter dims = [3 1 1 1]
+
+    array output = convolve(cFilter, rFilter, signal, AF_CONV_DEFAULT);
+    // output signal dims = [3 4 2 1] - same as input since 'expand = false'
+    // notice that the input signal is 3d array, therefore
+    // batch mode will be automatically activated.
+    // output will be 3d array with result of each 2d array convolution(with
+    // same filter) stacked along the 3rd dimension
     //![ex_image_conv2_sep]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
 
-    output.host((void*)outData);
+    output.host((void *)&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(Convolve, Docs_Unified_Wrapper)
-{
+TEST(Convolve, Docs_Unified_Wrapper) {
     // This unit test doesn't necessarily need to function
-    // accuracy as af::convolve is merely a wrapper to
-    // af::convolve[1|2|3]
-    using af::array;
-    using af::dim4;
-    using af::randu;
-    using af::constant;
-    using af::convolve;
+    // accuracy as convolve is merely a wrapper to
+    // convolve[1|2|3]
 
     //![ex_image_convolve_1d]
     array a = randu(10);
-    //af_print(a);
-    //a [10 1 1 1] = 0.0000 0.1315 0.7556 0.4587 0.5328 0.2190 0.0470 0.6789 0.6793 0.9347
+    // af_print(a);
+    // a [10 1 1 1] = 0.0000 0.1315 0.7556 0.4587 0.5328 0.2190 0.0470 0.6789
+    // 0.6793 0.9347
     array b = randu(4);
-    //af_print(b);
-    //b [4 1 1 1]  = 0.3835 0.5194 0.8310 0.0346
+    // af_print(b);
+    // b [4 1 1 1]  = 0.3835 0.5194 0.8310 0.0346
     array c = convolve(a, b);
-    //af_print(c);
-    //c [10 1 1 1] = 0.3581 0.6777 1.0750 0.7679 0.5903 0.4851 0.6598 1.2770 1.0734 0.8002
+    // af_print(c);
+    // c [10 1 1 1] = 0.3581 0.6777 1.0750 0.7679 0.5903 0.4851 0.6598
+    // 1.2770 1.0734 0.8002
     //![ex_image_convolve_1d]
 
     //![ex_image_convolve_2d]
     array d = constant(0.5, 5, 5);
-    //af_print(d);
-    //d [5 5 1 1]
+    // af_print(d);
+    // d [5 5 1 1]
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     array e = constant(1, 2, 2);
-    //af_print(e);
-    //e [2 2 1 1]
+    // af_print(e);
+    // e [2 2 1 1]
     //     1.0000     1.0000
     //     1.0000     1.0000
     array f = convolve(d, e);
-    //af_print(f);
-    //f [5 5 1 1]
+    // af_print(f);
+    // f [5 5 1 1]
     //     2.0000     2.0000     2.0000     2.0000     1.0000
     //     2.0000     2.0000     2.0000     2.0000     1.0000
     //     2.0000     2.0000     2.0000     2.0000     1.0000
@@ -560,8 +564,8 @@ TEST(Convolve, Docs_Unified_Wrapper)
 
     //![ex_image_convolve_3d]
     array g = constant(1, 4, 4, 4);
-    //af_print(g);
-    //g [4 4 4 1]
+    // af_print(g);
+    // g [4 4 4 1]
     //    1.0000     1.0000     1.0000     1.0000
     //    1.0000     1.0000     1.0000     1.0000
     //    1.0000     1.0000     1.0000     1.0000
@@ -582,8 +586,8 @@ TEST(Convolve, Docs_Unified_Wrapper)
     //    1.0000     1.0000     1.0000     1.0000
     //    1.0000     1.0000     1.0000     1.0000
     array h = constant(0.5, 2, 2, 2);
-    //af_print(h);
-    //h [2 2 2 1]
+    // af_print(h);
+    // h [2 2 2 1]
     //    0.5000     0.5000
     //    0.5000     0.5000
 
@@ -591,8 +595,8 @@ TEST(Convolve, Docs_Unified_Wrapper)
     //    0.5000     0.5000
 
     array i = convolve(g, h);
-    //af_print(i);
-    //i [4 4 4 1]
+    // af_print(i);
+    // i [4 4 4 1]
     //    4.0000     4.0000     4.0000     2.0000
     //    4.0000     4.0000     4.0000     2.0000
     //    4.0000     4.0000     4.0000     2.0000
@@ -615,17 +619,12 @@ TEST(Convolve, Docs_Unified_Wrapper)
     //![ex_image_convolve_3d]
 }
 
-using namespace af;
-
-TEST(GFOR, convolve2_MO)
-{
+TEST(GFOR, convolve2_MO) {
     array A = randu(5, 5, 3);
     array B = randu(5, 5, 3);
     array K = randu(3, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = convolve2(A(span, span, ii), K);
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = convolve2(A(span, span, ii), K); }
 
     for (int ii = 0; ii < 3; ii++) {
         array c_ii = convolve2(A(span, span, ii), K);
@@ -634,15 +633,12 @@ TEST(GFOR, convolve2_MO)
     }
 }
 
-TEST(GFOR, convolve2_1M)
-{
+TEST(GFOR, convolve2_OM) {
     array A = randu(5, 5);
     array B = randu(5, 5, 3);
     array K = randu(3, 3, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = convolve2(A, K(span, span, ii));
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = convolve2(A, K(span, span, ii)); }
 
     for (int ii = 0; ii < 3; ii++) {
         array c_ii = convolve2(A, K(span, span, ii));
@@ -651,8 +647,7 @@ TEST(GFOR, convolve2_1M)
     }
 }
 
-TEST(GFOR, convolve2_MM)
-{
+TEST(GFOR, convolve2_MM) {
     array A = randu(5, 5, 3);
     array B = randu(5, 5, 3);
     array K = randu(3, 3, 3);
@@ -667,3 +662,524 @@ TEST(GFOR, convolve2_MM)
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
     }
 }
+
+TEST(Convolve, 1D_C32) {
+    array A = randu(10, c32);
+    array B = randu(3, c32);
+
+    array out = convolve1(A, B);
+    array gld = fftConvolve1(A, B);
+
+    cfloat acc = sum<cfloat>(out - gld);
+
+    EXPECT_LT(std::abs(real(acc)), 1E-3);
+    EXPECT_LT(std::abs(imag(acc)), 1E-3);
+}
+
+TEST(Convolve, 2D_C32) {
+    array A = randu(10, 10, c32);
+    array B = randu(3, 3, c32);
+
+    array out = convolve2(A, B);
+    array gld = fftConvolve2(A, B);
+
+    cfloat acc = sum<cfloat>(out - gld);
+
+    EXPECT_LT(std::abs(real(acc)), 1E-3);
+    EXPECT_LT(std::abs(imag(acc)), 1E-3);
+}
+
+TEST(Convolve, 3D_C32) {
+    array A = randu(10, 10, 3, c32);
+    array B = randu(3, 3, 3, c32);
+
+    array out = convolve3(A, B);
+    array gld = fftConvolve3(A, B);
+
+    cfloat acc = sum<cfloat>(out - gld);
+
+    EXPECT_EQ(std::abs(real(acc)) < 1E-3, true);
+    EXPECT_EQ(std::abs(imag(acc)) < 1E-3, true);
+}
+
+TEST(Convolve, 1D_C64) {
+    SUPPORTED_TYPE_CHECK(double);
+
+    array A = randu(10, c64);
+    array B = randu(3, c64);
+
+    array out = convolve1(A, B);
+    array gld = fftConvolve1(A, B);
+
+    cdouble acc = sum<cdouble>(out - gld);
+
+    EXPECT_EQ(std::abs(real(acc)) < 1E-3, true);
+    EXPECT_EQ(std::abs(imag(acc)) < 1E-3, true);
+}
+
+TEST(Convolve, 2D_C64) {
+    SUPPORTED_TYPE_CHECK(double);
+
+    array A = randu(10, 10, c64);
+    array B = randu(3, 3, c64);
+
+    array out = convolve2(A, B);
+    array gld = fftConvolve2(A, B);
+
+    cdouble acc = sum<cdouble>(out - gld);
+
+    EXPECT_EQ(std::abs(real(acc)) < 1E-3, true);
+    EXPECT_EQ(std::abs(imag(acc)) < 1E-3, true);
+}
+
+TEST(Convolve, 3D_C64) {
+    SUPPORTED_TYPE_CHECK(double);
+
+    array A = randu(10, 10, 3, c64);
+    array B = randu(3, 3, 3, c64);
+
+    array out = convolve3(A, B);
+    array gld = fftConvolve3(A, B);
+
+    cdouble acc = sum<cdouble>(out - gld);
+
+    EXPECT_EQ(std::abs(real(acc)) < 1E-3, true);
+    EXPECT_EQ(std::abs(imag(acc)) < 1E-3, true);
+}
+
+TEST(ConvolveLargeDim1D, CPP) {
+    const size_t n        = 10;
+    const size_t largeDim = 65535 + 1;
+
+    float h_filter[] = {0.f, 1.f, 0.f};
+    array identity_filter(3, h_filter);
+    array signal = constant(1, n, 1, largeDim);
+
+    array output  = convolve1(signal, identity_filter, AF_CONV_DEFAULT);
+    array output2 = output;
+    ASSERT_EQ(largeDim * n, sum<float>(output2));
+
+    signal = constant(1, n, 1, 1, largeDim);
+
+    output = convolve1(signal, identity_filter, AF_CONV_DEFAULT);
+    ASSERT_EQ(largeDim * n, sum<float>(output));
+}
+
+TEST(ConvolveLargeDim2D, CPP) {
+    const size_t n        = 10;
+    const size_t largeDim = 65535 + 1;
+
+    float h_filter[] = {0.f, 0.f, 0.f, 0.f, 1.f, 0.f, 0.f, 0.f, 0.f};
+    array identity_filter(3, 3, h_filter);
+    array signal = constant(1, n, n, largeDim);
+
+    array output = convolve2(signal, identity_filter, AF_CONV_DEFAULT);
+    ASSERT_EQ(largeDim * n * n, sum<float>(output));
+
+    signal = constant(1, n, n, 1, largeDim);
+
+    output = convolve2(signal, identity_filter, AF_CONV_DEFAULT);
+    ASSERT_EQ(largeDim * n * n, sum<float>(output));
+}
+
+TEST(DISABLED_ConvolveLargeDim3D, CPP) {
+    const size_t n        = 3;
+    const size_t largeDim = 65535 * 16 + 1;
+
+    float h_filter[] = {0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f,
+
+                        0.f, 0.f, 0.f, 0.f, 1.f, 0.f, 0.f, 0.f, 0.f,
+
+                        0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f, 0.f};
+
+    array identity_filter(3, 3, 3, h_filter);
+    array signal = constant(1, n, largeDim, n);
+
+    array output = convolve3(signal, identity_filter, AF_CONV_DEFAULT);
+    ASSERT_EQ(1.f, product<float>(output));
+
+    signal = constant(1, n, n, largeDim);
+
+    output = convolve3(signal, identity_filter, AF_CONV_EXPAND);
+    // TODO: fix product by indexing
+    // ASSERT_EQ(1.f, product<float>(output));
+}
+
+TEST(Convolve, CuboidBatchLaunchBugFix) {
+    std::string testFile(TEST_DIR "/convolve/conv3d_launch_bug.test");
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTests<float, float, float>(testFile, numDims, in, tests);
+
+    dim4 sDims = numDims[0];
+    dim4 fDims = numDims[1];
+
+    af::array signal(sDims, in[0].data());
+    af::array filter(fDims, in[1].data());
+
+    af::array output = convolve3(signal, filter);
+
+    ASSERT_VEC_ARRAY_NEAR(tests[0], sDims, output, 1.0e-3);
+}
+
+struct conv2_strided_params {
+    string testname_;
+    dim4 signal_sz_, filt_sz_, stride_, padding_, dilation_;
+
+    conv2_strided_params(string testname, dim4 signal_sz, dim4 filt_sz,
+                         dim4 stride, dim4 padding, dim4 dilation)
+        : testname_(testname)
+        , signal_sz_(signal_sz)
+        , filt_sz_(filt_sz)
+        , stride_(stride)
+        , padding_(padding)
+        , dilation_(dilation) {}
+};
+
+template<typename TestClass>
+string testNameGenerator(
+    const ::testing::TestParamInfo<typename TestClass::ParamType> info) {
+    return info.param.testname_;
+}
+
+class Conv2ConsistencyTest
+    : public ::testing::TestWithParam<conv2_strided_params> {};
+
+conv2_strided_params conv2_consistency_data(dim4 signal_sz, dim4 filt_sz) {
+    dim4 stride(1, 1);
+    dim4 padding(filt_sz[0] / 2, filt_sz[1] / 2);
+    dim4 dilation(1, 1);
+    std::string testname =
+        "conv2_consistency_" + std::to_string(signal_sz[0]) +
+        std::to_string(signal_sz[1]) + std::to_string(signal_sz[2]) +
+        std::to_string(signal_sz[3]) + "__" + std::to_string(filt_sz[0]) +
+        std::to_string(filt_sz[1]) + std::to_string(filt_sz[2]) +
+        std::to_string(filt_sz[3]) + "__" + "s" + std::to_string(stride[0]) +
+        std::to_string(stride[1]) + "_" + "p" + std::to_string(padding[0]) +
+        std::to_string(padding[1]) + "_" + "d" + std::to_string(dilation[0]) +
+        std::to_string(dilation[1]);
+
+    return conv2_strided_params(testname, signal_sz, filt_sz, stride, padding,
+                                dilation);
+}
+vector<conv2_strided_params> genConsistencyTests() {
+    // TODO: test nfilters and nfeatures
+    return {conv2_consistency_data(dim4(10, 10), dim4(3, 3)),
+            conv2_consistency_data(dim4(11, 11), dim4(5, 5)),
+            conv2_consistency_data(dim4(12, 12), dim4(7, 7)),
+            conv2_consistency_data(dim4(19, 19), dim4(9, 9)),
+            conv2_consistency_data(dim4(33, 33), dim4(3, 3)),
+            conv2_consistency_data(dim4(255, 255), dim4(3, 3)),
+            conv2_consistency_data(dim4(256, 256), dim4(3, 3)),
+            conv2_consistency_data(dim4(257, 257), dim4(3, 3))};
+}
+
+INSTANTIATE_TEST_SUITE_P(Conv2Consistency, Conv2ConsistencyTest,
+                         ::testing::ValuesIn(genConsistencyTests()),
+                         testNameGenerator<Conv2ConsistencyTest>);
+
+TEST_P(Conv2ConsistencyTest, RandomConvolutions) {
+    conv2_strided_params params = GetParam();
+    array signal                = randn(params.signal_sz_);
+    array filter                = randn(params.filt_sz_);
+
+    array out_native = convolve2(signal, filter);
+    array out = convolve2NN(signal, filter, params.stride_, params.padding_,
+                            params.dilation_);
+
+    ASSERT_ARRAYS_NEAR(out_native, out, 2e-5);
+}
+
+template<typename T>
+float tolerance();
+
+template<>
+float tolerance<float>() {
+    return 4e-3;
+}
+
+template<>
+float tolerance<double>() {
+    return 1e-4;
+}
+
+template<>
+float tolerance<half_float::half>() {
+    return 7e-2;
+}
+
+template<typename T>
+void convolve2stridedTest(string pTestFile, dim4 stride, dim4 padding,
+                          dim4 dilation) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 sDims         = numDims[0];
+    dim4 fDims         = numDims[1];
+    af_array signal    = 0;
+    af_array filter    = 0;
+    af_array convolved = 0;
+
+    ASSERT_SUCCESS(af_create_array(&signal, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&filter, &(in[1].front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(af_convolve2_nn(&convolved, signal, filter, stride.ndims(),
+                                   stride.get(), padding.ndims(), padding.get(),
+                                   dilation.ndims(), dilation.get()));
+
+    vector<T> &currGoldBar = tests[0];
+
+    dim_t expectedDim0 =
+        1 + (sDims[0] + 2 * padding[0] - (((fDims[0] - 1) * dilation[0]) + 1)) /
+                stride[0];
+    dim_t expectedDim1 =
+        1 + (sDims[1] + 2 * padding[1] - (((fDims[1] - 1) * dilation[1]) + 1)) /
+                stride[1];
+
+    auto gdim = dim4(expectedDim0, expectedDim1, fDims[3], sDims[3]);
+    ASSERT_VEC_ARRAY_NEAR(currGoldBar, gdim, convolved, tolerance<T>());
+
+    ASSERT_SUCCESS(af_release_array(convolved));
+    ASSERT_SUCCESS(af_release_array(signal));
+    ASSERT_SUCCESS(af_release_array(filter));
+}
+
+template<typename T>
+void convolve2GradientTest(string pTestFile, dim4 stride, dim4 padding,
+                           dim4 dilation) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
+
+    dim4 sDims         = numDims[0];
+    dim4 fDims         = numDims[1];
+    af_array signal    = 0;
+    af_array filter    = 0;
+    af_array convolved = 0;
+
+    ASSERT_SUCCESS(af_create_array(&signal, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&filter, &(in[1].front()), fDims.ndims(),
+                                   fDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    vector<T> &currGoldBar = tests[0];
+    size_t nElems          = currGoldBar.size();
+
+    dim_t expectedDim0 =
+        1 + (sDims[0] + 2 * padding[0] - (((fDims[0] - 1) * dilation[0]) + 1)) /
+                stride[0];
+    dim_t expectedDim1 =
+        1 + (sDims[1] + 2 * padding[1] - (((fDims[1] - 1) * dilation[1]) + 1)) /
+                stride[1];
+    dim4 cDims(expectedDim0, expectedDim1, fDims[3], sDims[3]);
+    ASSERT_EQ(nElems, cDims.elements());
+
+    ASSERT_SUCCESS(af_create_array(&convolved, &(currGoldBar.front()),
+                                   cDims.ndims(), cDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    af_array incoming_gradient = 0;
+    ASSERT_SUCCESS(af_constant(&incoming_gradient, 1, cDims.ndims(),
+                               cDims.get(),
+                               (af_dtype)dtype_traits<T>::af_type));
+
+    af_array filter_gradient = 0;
+    ASSERT_SUCCESS(af_convolve2_gradient_nn(
+        &filter_gradient, incoming_gradient, signal, filter, convolved,
+        stride.ndims(), stride.get(), padding.ndims(), padding.get(),
+        dilation.ndims(), dilation.get(), AF_CONV_GRADIENT_FILTER));
+
+    af_array data_gradient = 0;
+    ASSERT_SUCCESS(af_convolve2_gradient_nn(
+        &data_gradient, incoming_gradient, signal, filter, convolved,
+        stride.ndims(), stride.get(), padding.ndims(), padding.get(),
+        dilation.ndims(), dilation.get(), AF_CONV_GRADIENT_DATA));
+
+    vector<T> &dataGradientGold = tests[1];
+    ASSERT_VEC_ARRAY_NEAR(dataGradientGold, sDims, data_gradient,
+                          tolerance<T>());
+
+    vector<T> &filterGradientGold = tests[2];
+    ASSERT_VEC_ARRAY_NEAR(filterGradientGold, fDims, filter_gradient,
+                          tolerance<T>());
+
+    ASSERT_SUCCESS(af_release_array(incoming_gradient));
+    ASSERT_SUCCESS(af_release_array(convolved));
+    ASSERT_SUCCESS(af_release_array(signal));
+    ASSERT_SUCCESS(af_release_array(filter));
+    ASSERT_SUCCESS(af_release_array(filter_gradient));
+    ASSERT_SUCCESS(af_release_array(data_gradient));
+}
+
+template<typename T>
+class ConvolveStrided : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+// create a list of types to be tested
+typedef ::testing::Types<float, double, half_float::half>
+    TestTypesStrided;  // TODO: integral types??
+
+// register the type list
+TYPED_TEST_SUITE(ConvolveStrided, TestTypesStrided);
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt33_s11_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig810_filt33_s11_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig81011_filt3311_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt33_s11_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt33_s33_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s33_p11_d11.test"),
+        dim4(3, 3), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt33_s33_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s33_p11_d11.test"),
+        dim4(3, 3), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt55_s55_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt5511_s55_p11_d11.test"),
+        dim4(5, 5), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt55_s55_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt5511_s55_p11_d11.test"),
+        dim4(5, 5), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt77_s77_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt7711_s77_p11_d11.test"),
+        dim4(7, 7), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt77_s77_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt7711_s77_p11_d11.test"),
+        dim4(7, 7), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt33_s11_p11_d22) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s11_p11_d22.test"),
+        dim4(1, 1), dim4(1, 1), dim4(2, 2));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt33_s11_p11_d22) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s11_p11_d22.test"),
+        dim4(1, 1), dim4(1, 1), dim4(2, 2));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt33_s11_p11_d33) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s11_p11_d33.test"),
+        dim4(1, 1), dim4(1, 1), dim4(3, 3));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt33_s11_p11_d33) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3311_s11_p11_d33.test"),
+        dim4(1, 1), dim4(1, 1), dim4(3, 3));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt35_s11_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3511_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt35_s11_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3511_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt53_s11_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt5311_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt53_s11_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt5311_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig1010_filt35_s31_p11_d21) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3511_s31_p11_d21.test"),
+        dim4(3, 1), dim4(1, 1), dim4(2, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig1010_filt35_s31_p11_d21) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig101011_filt3511_s31_p11_d21.test"),
+        dim4(3, 1), dim4(1, 1), dim4(2, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Strided_sig81032_filt3334_s11_p11_d11) {
+    convolve2stridedTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig81032_filt3334_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TYPED_TEST(ConvolveStrided, Gradient_sig81032_filt3334_s11_p11_d11) {
+    convolve2GradientTest<TypeParam>(
+        string(TEST_DIR "/convolve/sig81032_filt3334_s11_p11_d11.test"),
+        dim4(1, 1), dim4(1, 1), dim4(1, 1));
+}
+
+TEST(ConvolveNN, ZeroPadding_Issue2817) {
+    array signal = constant(1.f, 5, 5);
+    array filter = constant(1 / 9.f, 3, 3);
+    dim4 strides(1, 1), dilation(1, 1);
+    dim4 padding(0, 0, 1, 1);
+
+    array convolved = convolve2NN(signal, filter, strides, padding, dilation);
+    ASSERT_EQ(sum<float>(abs(signal(seq(1, 3), seq(1, 3)) - convolved)) < 1E-5,
+              true);
+
+    array incoming_gradient = constant(1 / 9.f, 3, 3);
+    array convolved_grad = convolve2GradientNN(incoming_gradient, signal, filter,
+                                               convolved, strides, padding, dilation,
+                                               AF_CONV_GRADIENT_FILTER);
+    ASSERT_EQ(sum<float>(abs(convolved - convolved_grad)) < 1E-5, true);
+}
diff --git a/test/corrcoef.cpp b/test/corrcoef.cpp
new file mode 100644
index 0000000000..e9bc5a5616
--- /dev/null
+++ b/test/corrcoef.cpp
@@ -0,0 +1,93 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <algorithm>
+#include <ctime>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cfloat;
+using af::corrcoef;
+using af::dim4;
+using std::string;
+using std::vector;
+
+template<typename T>
+class CorrelationCoefficient : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int, uint, intl, uintl, char, schar,
+                         uchar>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(CorrelationCoefficient, TestTypes);
+
+template<typename T>
+struct f32HelperType {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            type;
+};
+
+template<typename T>
+struct c32HelperType {
+    typedef typename cond_type<is_same_type<T, cfloat>::value, cfloat,
+                               typename f32HelperType<T>::type>::type type;
+};
+
+template<typename T>
+struct elseType {
+    typedef typename cond_type<is_same_type<T, uintl>::value ||
+                                   is_same_type<T, intl>::value,
+                               double, T>::type type;
+};
+
+template<typename T>
+struct ccOutType {
+    typedef typename cond_type<
+        is_same_type<T, float>::value || is_same_type<T, int>::value ||
+            is_same_type<T, uint>::value || is_same_type<T, schar>::value ||
+            is_same_type<T, uchar>::value || is_same_type<T, short>::value ||
+            is_same_type<T, ushort>::value || is_same_type<T, char>::value,
+        float, typename elseType<T>::type>::type type;
+};
+
+TYPED_TEST(CorrelationCoefficient, All) {
+    typedef typename ccOutType<TypeParam>::type outType;
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    SUPPORTED_TYPE_CHECK(outType);
+
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<int, float>(
+        string(TEST_DIR "/corrcoef/mat_10x10_scalar.test"), numDims, in, tests);
+
+    vector<TypeParam> input1(in[0].begin(), in[0].end());
+    vector<TypeParam> input2(in[1].begin(), in[1].end());
+
+    array a(numDims[0], &(input1.front()));
+    array b(numDims[1], &(input2.front()));
+    outType c = corrcoef<outType>(a, b);
+
+    vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
+    ASSERT_NEAR(::real(currGoldBar[0]), ::real(c), 1.0e-3);
+    ASSERT_NEAR(::imag(currGoldBar[0]), ::imag(c), 1.0e-3);
+}
diff --git a/test/covariance.cpp b/test/covariance.cpp
new file mode 100644
index 0000000000..f149fbd095
--- /dev/null
+++ b/test/covariance.cpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <algorithm>
+#include <ctime>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::exception;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Covariance : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int, uint, intl, uintl, schar, uchar,
+                         short, ushort>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(Covariance, TestTypes);
+
+template<typename T>
+struct f32HelperType {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            type;
+};
+
+template<typename T>
+struct c32HelperType {
+    typedef typename cond_type<is_same_type<T, cfloat>::value, cfloat,
+                               typename f32HelperType<T>::type>::type type;
+};
+
+template<typename T>
+struct elseType {
+    typedef typename cond_type<is_same_type<T, uintl>::value ||
+                                   is_same_type<T, intl>::value,
+                               double, T>::type type;
+};
+
+template<typename T>
+struct covOutType {
+    typedef typename cond_type<
+        is_same_type<T, float>::value || is_same_type<T, int>::value ||
+            is_same_type<T, uint>::value || is_same_type<T, schar>::value ||
+            is_same_type<T, uchar>::value || is_same_type<T, short>::value ||
+            is_same_type<T, ushort>::value || is_same_type<T, char>::value,
+        float, typename elseType<T>::type>::type type;
+};
+
+template<typename T>
+void covTest(string pFileName, bool isbiased = true,
+             const bool useDeprecatedAPI = false) {
+    typedef typename covOutType<T>::type outType;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
+
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<int, float>(pFileName, numDims, in, tests);
+
+    dim4 dims1 = numDims[0];
+    dim4 dims2 = numDims[1];
+    vector<T> input1(in[0].begin(), in[0].end());
+    vector<T> input2(in[1].begin(), in[1].end());
+
+    array a(dims1, &(input1.front()));
+    array b(dims2, &(input2.front()));
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+    array c =
+        (useDeprecatedAPI
+             ? cov(a, b, isbiased)
+             : cov(a, b,
+                   (isbiased ? AF_VARIANCE_SAMPLE : AF_VARIANCE_POPULATION)));
+#pragma GCC diagnostic pop
+
+    vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
+
+    size_t nElems = currGoldBar.size();
+    vector<outType> outData(nElems);
+
+    c.host((void*)outData.data());
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(::real(currGoldBar[elIter]), ::real(outData[elIter]),
+                    1.0e-3)
+            << "at: " << elIter << endl;
+        ASSERT_NEAR(::imag(currGoldBar[elIter]), ::imag(outData[elIter]),
+                    1.0e-3)
+            << "at: " << elIter << endl;
+    }
+}
+
+TYPED_TEST(Covariance, Vector) {
+    covTest<TypeParam>(string(TEST_DIR "/covariance/vec_size60.test"));
+    covTest<TypeParam>(string(TEST_DIR "/covariance/vec_size60.test"), true);
+}
+
+TYPED_TEST(Covariance, Matrix) {
+    covTest<TypeParam>(string(TEST_DIR "/covariance/matrix_65x121.test"));
+    covTest<TypeParam>(string(TEST_DIR "/covariance/matrix_65x121.test"), true);
+}
+
+TEST(Covariance, c32) {
+    array a = constant(cfloat(1.0f, -1.0f), 10, c32);
+    array b = constant(cfloat(2.0f, -1.0f), 10, c32);
+    ASSERT_THROW(cov(a, b, AF_VARIANCE_POPULATION), exception);
+}
+
+TEST(Covariance, c64) {
+    SUPPORTED_TYPE_CHECK(double);
+    array a = constant(cdouble(1.0, -1.0), 10, c64);
+    array b = constant(cdouble(2.0, -1.0), 10, c64);
+    ASSERT_THROW(cov(a, b, AF_VARIANCE_POPULATION), exception);
+}
diff --git a/test/cuda.cu b/test/cuda.cu
new file mode 100644
index 0000000000..d404c514a5
--- /dev/null
+++ b/test/cuda.cu
@@ -0,0 +1,79 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/array.h>
+#include <af/device.h>
+
+using af::allocV2;
+using af::freeV2;
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+TEST(Memory, AfAllocDeviceCUDA) {
+    void *ptr;
+    ASSERT_SUCCESS(af_alloc_device(&ptr, sizeof(float)));
+
+    /// Tests to see if the pointer returned can be used by cuda functions
+    float gold_val = 5;
+    float *gold    = NULL;
+    ASSERT_EQ(cudaSuccess, cudaMalloc(&gold, sizeof(float)));
+    ASSERT_EQ(cudaSuccess, cudaMemcpy(gold, &gold_val, sizeof(float),
+                                      cudaMemcpyHostToDevice));
+
+    ASSERT_EQ(cudaSuccess,
+              cudaMemcpy(ptr, gold, sizeof(float), cudaMemcpyDeviceToDevice));
+
+    float host;
+    ASSERT_EQ(cudaSuccess,
+              cudaMemcpy(&host, ptr, sizeof(float), cudaMemcpyDeviceToHost));
+    ASSERT_SUCCESS(af_free_device(ptr));
+
+    ASSERT_EQ(5, host);
+}
+#pragma GCC diagnostic pop
+
+TEST(Memory, AfAllocDeviceV2CUDA) {
+    void *ptr;
+    ASSERT_SUCCESS(af_alloc_device_v2(&ptr, sizeof(float)));
+
+    /// Tests to see if the pointer returned can be used by cuda functions
+    float gold_val = 5;
+    float *gold    = NULL;
+    ASSERT_EQ(cudaSuccess, cudaMalloc(&gold, sizeof(float)));
+    ASSERT_EQ(cudaSuccess, cudaMemcpy(gold, &gold_val, sizeof(float),
+                                      cudaMemcpyHostToDevice));
+
+    ASSERT_EQ(cudaSuccess,
+              cudaMemcpy(ptr, gold, sizeof(float), cudaMemcpyDeviceToDevice));
+
+    float host;
+    ASSERT_EQ(cudaSuccess,
+              cudaMemcpy(&host, ptr, sizeof(float), cudaMemcpyDeviceToHost));
+    ASSERT_SUCCESS(af_free_device_v2(ptr));
+
+    ASSERT_EQ(5, host);
+}
+
+TEST(Memory, SNIPPET_AllocCUDA) {
+    //! [ex_alloc_v2_cuda]
+
+    void *ptr = allocV2(sizeof(float));
+
+    float *dptr     = static_cast<float *>(ptr);
+    float host_data = 5.0f;
+
+    cudaError_t error = cudaSuccess;
+    error = cudaMemcpy(dptr, &host_data, sizeof(float), cudaMemcpyHostToDevice);
+    freeV2(ptr);
+
+    //! [ex_alloc_v2_cuda]
+    ASSERT_EQ(cudaSuccess, error);
+}
diff --git a/test/data b/test/data
deleted file mode 160000
index 8b4a85eb89..0000000000
--- a/test/data
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 8b4a85eb89118028b4da524f0d734969baa4b0f2
diff --git a/test/diagonal.cpp b/test/diagonal.cpp
index 16d67af30a..e3031f731c 100644
--- a/test/diagonal.cpp
+++ b/test/diagonal.cpp
@@ -1,87 +1,155 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
 
+#include <arrayfire.h>
 #include <gtest/gtest.h>
+#include <half.hpp>
 #include <testHelpers.hpp>
-#include <arrayfire.h>
-#include <iostream>
 
-using namespace std;
-using namespace af;
+#include <cmath>
+#include <vector>
+
+using af::array;
+using af::constant;
+using af::deviceGC;
+using af::diag;
+using af::dim4;
+using af::exception;
+using af::max;
+using af::seq;
+using af::span;
+using af::sum;
+using std::abs;
+using std::vector;
 
 template<typename T>
-class Diagonal : public ::testing::Test
-{
-
-};
+class Diagonal : public ::testing::Test {};
 
-typedef ::testing::Types<float, double, int, uint, char, unsigned char> TestTypes;
-TYPED_TEST_CASE(Diagonal, TestTypes);
+typedef ::testing::Types<float, double, int, uint, char, signed char,
+                         unsigned char, half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(Diagonal, TestTypes);
 
-TYPED_TEST(Diagonal, Create)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Diagonal, Create) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
     try {
-
         static const int size = 1000;
-        vector<TypeParam> input (size * size);
-        for(int i = 0; i < size; i++) {
-            input[i] = i;
-        }
-        for(int jj = 10; jj < size; jj+=100) {
+        vector<TypeParam> input(size * size);
+        for (int i = 0; i < size; i++) { input[i] = i; }
+        for (int jj = 10; jj < size; jj += 100) {
             array data(jj, &input.front(), afHost);
             array out = diag(data, 0, false);
 
             vector<TypeParam> h_out(out.elements());
             out.host(&h_out.front());
 
-            for(int i =0; i < (int)out.dims(0); i++) {
-                for(int j =0; j < (int)out.dims(1); j++) {
-                    if(i == j) ASSERT_EQ(input[i], h_out[i * out.dims(0) + j]);
-                    else       ASSERT_EQ(TypeParam(0), h_out[i * out.dims(0) + j]);
+            for (int i = 0; i < (int)out.dims(0); i++) {
+                for (int j = 0; j < (int)out.dims(1); j++) {
+                    if (i == j)
+                        ASSERT_EQ(input[i], h_out[i * out.dims(0) + j]);
+                    else
+                        ASSERT_EQ(TypeParam(0), h_out[i * out.dims(0) + j]);
                 }
             }
         }
-    } catch (const af::exception& ex) {
-        FAIL() << ex.what() << endl;
-    }
+    } catch (const exception& ex) { FAIL() << ex.what(); }
 }
 
-TYPED_TEST(Diagonal, Extract)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Diagonal, DISABLED_CreateLargeDim) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    try {
+        deviceGC();
+        {
+            static const size_t largeDim = 65535 + 1;
+            array diagvals               = constant(1, largeDim);
+            array out                    = diag(diagvals, 0, false);
+
+            ASSERT_EQ(largeDim, sum<float>(out));
+        }
+    } catch (const exception& ex) { FAIL() << ex.what(); }
+}
+
+TYPED_TEST(Diagonal, Extract) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
     try {
         static const int size = 1000;
-        vector<TypeParam> input (size * size);
-        for(int i = 0; i < size * size; i++) {
-            input[i] = i;
-        }
-        for(int jj = 10; jj < size; jj+=100) {
+        vector<TypeParam> input(size * size);
+        for (int i = 0; i < size * size; i++) { input[i] = i; }
+        for (int jj = 10; jj < size; jj += 100) {
             array data(jj, jj, &input.front(), afHost);
             array out = diag(data, 0);
 
             vector<TypeParam> h_out(out.elements());
             out.host(&h_out.front());
 
-            for(int i =0; i < (int)out.dims(0); i++) {
+            for (int i = 0; i < (int)out.dims(0); i++) {
                 ASSERT_EQ(input[i * data.dims(0) + i], h_out[i]);
             }
         }
-    } catch (const af::exception& ex) {
-        FAIL() << ex.what() << endl;
-    }
+    } catch (const exception& ex) { FAIL() << ex.what(); }
+}
+
+TYPED_TEST(Diagonal, ExtractLargeDim) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    try {
+        static const size_t n        = 10;
+        static const size_t largeDim = 65535 + 1;
+
+        array largedata = constant(1, n, n, largeDim);
+        array out       = diag(largedata, 0);
+
+        ASSERT_EQ(n * largeDim, sum<float>(out));
+
+        largedata  = constant(1, n, n, 1, largeDim);
+        array out1 = diag(largedata, 0);
+
+        ASSERT_EQ(n * largeDim, sum<float>(out1));
+
+    } catch (const exception& ex) { FAIL() << ex.what(); }
 }
 
-TEST(Diagonal, ExtractGFOR)
-{
+TYPED_TEST(Diagonal, ExtractRect) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    try {
+        static const int size0 = 1000, size1 = 900;
+        vector<TypeParam> input(size0 * size1);
+        for (int i = 0; i < size0 * size1; i++) { input[i] = i; }
+
+        for (int jj = 10; jj < size0; jj += 100) {
+            for (int kk = 10; kk < size1; kk += 90) {
+                array data(jj, kk, &input.front(), afHost);
+                array out = diag(data, 0);
+
+                vector<TypeParam> h_out(out.elements());
+                out.host(&h_out.front());
+
+                ASSERT_EQ(out.dims(0), std::min(jj, kk));
+
+                for (int i = 0; i < (int)out.dims(0); i++) {
+                    ASSERT_EQ(input[i * data.dims(0) + i], h_out[i]);
+                }
+            }
+        }
+    } catch (const exception& ex) { FAIL() << ex.what(); }
+}
+
+TEST(Diagonal, ExtractGFOR) {
     dim4 dims = dim4(100, 100, 3);
-    array A = round(100 * randu(dims));
-    array B = constant(0, 100, 1, 3);
+    array A   = round(100 * randu(dims));
+    array B   = constant(0, 100, 1, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = diag(A(span, span, ii));
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = diag(A(span, span, ii)); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = diag(A(span, span, ii));
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
diff --git a/test/diff1.cpp b/test/diff1.cpp
index 13b6054541..9fdf11a91a 100644
--- a/test/diff1.cpp
+++ b/test/diff1.cpp
@@ -7,213 +7,228 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Diff1 : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(1, 4, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-            subMat0.push_back(af_make_seq(0, 1, 1));
-
-            subMat1.push_back(af_make_seq(0, 4, 1));
-            subMat1.push_back(af_make_seq(1, 3, 1));
-            subMat1.push_back(af_make_seq(1, 3, 1));
-
-            subMat2.push_back(af_make_seq(1, 5, 1));
-            subMat2.push_back(af_make_seq(0, 3, 1));
-            subMat2.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
-        vector<af_seq> subMat1;
-        vector<af_seq> subMat2;
+class Diff1 : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(1, 4, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+        subMat0.push_back(af_make_seq(0, 1, 1));
+
+        subMat1.push_back(af_make_seq(0, 4, 1));
+        subMat1.push_back(af_make_seq(1, 3, 1));
+        subMat1.push_back(af_make_seq(1, 3, 1));
+
+        subMat2.push_back(af_make_seq(1, 5, 1));
+        subMat2.push_back(af_make_seq(0, 3, 1));
+        subMat2.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
+    vector<af_seq> subMat1;
+    vector<af_seq> subMat2;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, intl,
+                         uintl, char, signed char, unsigned char, short, ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Diff1, TestTypes);
+TYPED_TEST_SUITE(Diff1, TestTypes);
 
 template<typename T>
-void diff1Test(string pTestFile, unsigned dim, bool isSubRef=false, const vector<af_seq> *seqv=NULL)
-{
-    if (noDoubleTests<T>()) return;
-
-    vector<af::dim4> numDims;
+void diff1Test(string pTestFile, unsigned dim, bool isSubRef = false,
+               const vector<af_seq> *seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<dim4> numDims;
 
-    T *outData;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     af_array inArray   = 0;
     af_array outArray  = 0;
     af_array tempArray = 0;
     // Get input array
     if (isSubRef) {
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       dims.ndims(), dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
     // Run diff1
-    ASSERT_EQ(AF_SUCCESS, af_diff1(&outArray, inArray, dim));
-
-    // Get result
-    outData = new T[dims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_diff1(&outArray, inArray, dim));
 
     // Compare result
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<T> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
-        for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter << std::endl;
-        }
-    }
+        dim4 goldDims;
+        ASSERT_SUCCESS(af_get_dims(&goldDims[0], &goldDims[1], &goldDims[2],
+                                   &goldDims[3], inArray));
+        goldDims[dim]--;
 
-    // Delete
-    delete[] outData;
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, outArray);
+    }
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-TYPED_TEST(Diff1,Vector0)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/vector0.test"), 0);
+TYPED_TEST(Diff1, Vector0) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/vector0.test"), 0);
 }
 
-TYPED_TEST(Diff1,Matrix0)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/matrix0.test"), 0);
+TYPED_TEST(Diff1, Matrix0) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/matrix0.test"), 0);
 }
 
-TYPED_TEST(Diff1,Matrix1)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/matrix1.test"), 1);
+TYPED_TEST(Diff1, Matrix1) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/matrix1.test"), 1);
 }
 
 // Diff on 0 dimension
-TYPED_TEST(Diff1,Basic0)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/basic0.test"), 0);
+TYPED_TEST(Diff1, Basic0) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/basic0.test"), 0);
 }
 
 // Diff on 1 dimension
-TYPED_TEST(Diff1,Basic1)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/basic1.test"), 1);
+TYPED_TEST(Diff1, Basic1) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/basic1.test"), 1);
 }
 
 // Diff on 2 dimension
-TYPED_TEST(Diff1,Basic2)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/basic2.test"), 2);
+TYPED_TEST(Diff1, Basic2) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/basic2.test"), 2);
 }
 
 // Diff on 0 dimension subref
-TYPED_TEST(Diff1,Subref0)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/subref0.test"), 0,true,&(this->subMat0));
+TYPED_TEST(Diff1, Subref0) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/subref0.test"), 0, true,
+                         &(this->subMat0));
 }
 
 // Diff on 1 dimension subref
-TYPED_TEST(Diff1,Subref1)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/subref1.test"), 1,true,&(this->subMat1));
+TYPED_TEST(Diff1, Subref1) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/subref1.test"), 1, true,
+                         &(this->subMat1));
 }
 
 // Diff on 2 dimension subref
-TYPED_TEST(Diff1,Subref2)
-{
-    diff1Test<TypeParam>(string(TEST_DIR"/diff1/subref2.test"), 2,true,&(this->subMat2));
+TYPED_TEST(Diff1, Subref2) {
+    diff1Test<TypeParam>(string(TEST_DIR "/diff1/subref2.test"), 2, true,
+                         &(this->subMat2));
 }
 
 template<typename T>
-void diff1ArgsTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void diff1ArgsTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     af_array inArray  = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     ASSERT_EQ(AF_ERR_ARG, af_diff1(&outArray, inArray, -1));
-    ASSERT_EQ(AF_ERR_ARG, af_diff1(&outArray, inArray,  5));
+    ASSERT_EQ(AF_ERR_ARG, af_diff1(&outArray, inArray, 5));
 
-    if(inArray  != 0) af_release_array(inArray);
-    if(outArray != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-TYPED_TEST(Diff1,InvalidArgs)
-{
-    diff1ArgsTest<TypeParam>(string(TEST_DIR"/diff1/basic0.test"));
+TYPED_TEST(Diff1, InvalidArgs) {
+    diff1ArgsTest<TypeParam>(string(TEST_DIR "/diff1/basic0.test"));
 }
 
 ////////////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(Diff1, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    const unsigned dim = 0;
-    vector<af::dim4> numDims;
 
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float,float,int>(string(TEST_DIR"/diff1/matrix0.test"),numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+using af::array;
+using af::constant;
+using af::deviceGC;
+using af::diff1;
+using af::sum;
+
+TEST(Diff1, DiffLargeDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    deviceGC();
+    {
+        array in   = constant(1, largeDim);
+        array diff = diff1(in, 0);
+        float s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+
+        in   = constant(1, 1, largeDim);
+        diff = diff1(in, 1);
+        s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+
+        in   = constant(1, 1, 1, largeDim);
+        diff = diff1(in, 2);
+        s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+
+        in   = constant(1, 1, 1, 1, largeDim);
+        diff = diff1(in, 3);
+        s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+    }
+}
 
+TEST(Diff1, CPP) {
+    const unsigned dim = 0;
+    vector<dim4> numDims;
 
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::diff1(input, dim);
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/diff1/matrix0.test"),
+                                 numDims, in, tests);
+    dim4 dims = numDims[0];
 
-    // Get result
-    float *outData = new float[dims.elements()];
-    output.host((void*)outData);
+    array input(dims, &(in[0].front()));
+    array output = diff1(input, dim);
 
     // Compare result
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<float> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
-        for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter << std::endl;
-        }
-    }
+        dim4 goldDims             = dims;
+        goldDims[dim]--;
 
-    // Delete
-    delete[] outData;
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, output);
+    }
 }
-
diff --git a/test/diff2.cpp b/test/diff2.cpp
index 16b957f69f..cdc2b9909e 100644
--- a/test/diff2.cpp
+++ b/test/diff2.cpp
@@ -7,207 +7,222 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::deviceGC;
+using af::diff2;
+using af::dim4;
+using af::dtype_traits;
+using af::sum;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Diff2 : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-            subMat0.push_back(af_make_seq(0, 1, 1));
-
-            subMat1.push_back(af_make_seq(1, 4, 1));
-            subMat1.push_back(af_make_seq(0, 2, 1));
-            subMat1.push_back(af_make_seq(0, 1, 1));
-
-            subMat2.push_back(af_make_seq(1, 4, 1));
-            subMat2.push_back(af_make_seq(0, 3, 1));
-            subMat2.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
-        vector<af_seq> subMat1;
-        vector<af_seq> subMat2;
+class Diff2 : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+        subMat0.push_back(af_make_seq(0, 1, 1));
+
+        subMat1.push_back(af_make_seq(1, 4, 1));
+        subMat1.push_back(af_make_seq(0, 2, 1));
+        subMat1.push_back(af_make_seq(0, 1, 1));
+
+        subMat2.push_back(af_make_seq(1, 4, 1));
+        subMat2.push_back(af_make_seq(0, 3, 1));
+        subMat2.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
+    vector<af_seq> subMat1;
+    vector<af_seq> subMat2;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, intl,
+                         uintl, char, signed char, unsigned char, short, ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Diff2, TestTypes);
+TYPED_TEST_SUITE(Diff2, TestTypes);
 
 template<typename T>
-void diff2Test(string pTestFile, unsigned dim, bool isSubRef=false, const vector<af_seq> *seqv=NULL)
-{
-    if (noDoubleTests<T>()) return;
-
-    vector<af::dim4> numDims;
+void diff2Test(string pTestFile, unsigned dim, bool isSubRef = false,
+               const vector<af_seq> *seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<dim4> numDims;
 
-    T *outData;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     af_array inArray   = 0;
     af_array outArray  = 0;
     af_array tempArray = 0;
     // Get input array
     if (isSubRef) {
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       dims.ndims(), dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
     // Run diff2
-    ASSERT_EQ(AF_SUCCESS, af_diff2(&outArray, inArray, dim));
-
-    // Get result
-    outData = new T[dims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_diff2(&outArray, inArray, dim));
 
     // Compare result
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<T> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
-        for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter << std::endl;
-        }
-    }
+        dim4 goldDims;
+        ASSERT_SUCCESS(af_get_dims(&goldDims[0], &goldDims[1], &goldDims[2],
+                                   &goldDims[3], inArray));
+        goldDims[dim] -= 2;
 
-    // Delete
-    delete[] outData;
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, outArray);
+    }
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-TYPED_TEST(Diff2,Vector0)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/vector0.test"), 0);
+TYPED_TEST(Diff2, Vector0) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/vector0.test"), 0);
 }
 
-TYPED_TEST(Diff2,Matrix0)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/matrix0.test"), 0);
+TYPED_TEST(Diff2, Matrix0) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/matrix0.test"), 0);
 }
 
-TYPED_TEST(Diff2,Matrix1)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/matrix1.test"), 1);
+TYPED_TEST(Diff2, Matrix1) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/matrix1.test"), 1);
 }
 
 // Diff on 0 dimension
-TYPED_TEST(Diff2,Basic0)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/basic0.test"), 0);
+TYPED_TEST(Diff2, Basic0) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/basic0.test"), 0);
 }
 
 // Diff on 1 dimension
-TYPED_TEST(Diff2,Basic1)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/basic1.test"), 1);
+TYPED_TEST(Diff2, Basic1) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/basic1.test"), 1);
 }
 
 // Diff on 2 dimension
-TYPED_TEST(Diff2,Basic2)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/basic2.test"), 2);
+TYPED_TEST(Diff2, Basic2) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/basic2.test"), 2);
 }
 
-TYPED_TEST(Diff2,Subref0)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/subref0.test"), 0,true,&(this->subMat0));
+TYPED_TEST(Diff2, Subref0) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/subref0.test"), 0, true,
+                         &(this->subMat0));
 }
 
-TYPED_TEST(Diff2,Subref1)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/subref1.test"), 1,true,&(this->subMat1));
+TYPED_TEST(Diff2, Subref1) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/subref1.test"), 1, true,
+                         &(this->subMat1));
 }
 
-TYPED_TEST(Diff2,Subref2)
-{
-    diff2Test<TypeParam>(string(TEST_DIR"/diff2/subref2.test"), 2,true,&(this->subMat2));
+TYPED_TEST(Diff2, Subref2) {
+    diff2Test<TypeParam>(string(TEST_DIR "/diff2/subref2.test"), 2, true,
+                         &(this->subMat2));
 }
 
 template<typename T>
-void diff2ArgsTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void diff2ArgsTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     af_array inArray  = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     ASSERT_EQ(AF_ERR_ARG, af_diff2(&outArray, inArray, -1));
-    ASSERT_EQ(AF_ERR_ARG, af_diff2(&outArray, inArray,  5));
+    ASSERT_EQ(AF_ERR_ARG, af_diff2(&outArray, inArray, 5));
+
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+}
 
-    if(inArray  != 0) af_release_array(inArray);
-    if(outArray != 0) af_release_array(outArray);
+TYPED_TEST(Diff2, InvalidArgs) {
+    diff2ArgsTest<TypeParam>(string(TEST_DIR "/diff2/basic0.test"));
 }
 
-TYPED_TEST(Diff2,InvalidArgs)
-{
-    diff2ArgsTest<TypeParam>(string(TEST_DIR"/diff2/basic0.test"));
+TEST(Diff2, DiffLargeDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    deviceGC();
+    {
+        array in   = constant(1, largeDim);
+        array diff = diff2(in, 0);
+        float s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+
+        in   = constant(1, 1, largeDim);
+        diff = diff2(in, 1);
+        s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+
+        in   = constant(1, 1, 1, largeDim);
+        diff = diff2(in, 2);
+        s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+
+        in   = constant(1, 1, 1, 1, largeDim);
+        diff = diff2(in, 3);
+        s    = sum<float>(diff, 1);
+        ASSERT_EQ(s, 0.f);
+    }
 }
 
 ////////////////////////////////// CPP ////////////////////////////////////////
 //
-TEST(Diff2, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(Diff2, CPP) {
     const unsigned dim = 1;
-    vector<af::dim4> numDims;
-
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float,float,int>(string(TEST_DIR"/diff2/matrix1.test"),numDims,in,tests);
-    af::dim4 dims       = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::diff2(input, dim);
+    vector<dim4> numDims;
 
-    float *outData = new float[dims.elements()];
-    output.host((void*)outData);
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/diff2/matrix1.test"),
+                                 numDims, in, tests);
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = diff2(input, dim);
 
     // Compare result
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<float> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
-        for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter << std::endl;
-        }
-    }
+        dim4 goldDims             = input.dims();
+        goldDims[dim] -= 2;
 
-    // Delete
-    delete[] outData;
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, output);
+    }
 }
-
diff --git a/test/dog.cpp b/test/dog.cpp
new file mode 100644
index 0000000000..af76c23f59
--- /dev/null
+++ b/test/dog.cpp
@@ -0,0 +1,82 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <af/vision.h>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::convolve2;
+using af::dim4;
+using af::dog;
+using af::dtype_traits;
+using af::exception;
+using af::gaussianKernel;
+using af::randu;
+using af::sum;
+
+template<typename T>
+class DOG : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(DOG, TestTypes);
+
+TYPED_TEST(DOG, Basic) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    dim4 iDims(512, 512, 1, 1);
+    array in = constant(1, iDims, (af_dtype)dtype_traits<float>::af_type);
+    /* calculate DOG using ArrayFire functions */
+    array k1    = gaussianKernel(3, 3);
+    array k2    = gaussianKernel(2, 2);
+    array smth1 = convolve2(in, k1);
+    array smth2 = convolve2(in, k2);
+    array diff  = smth1 - smth2;
+    /* calcuate DOG using new function */
+    array out = dog(in, 3, 2);
+    /* compare both the values */
+    float accumErr = sum<float>(out - diff);
+    EXPECT_EQ(true, accumErr < 1.0e-2);
+}
+
+TYPED_TEST(DOG, Batch) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    dim4 iDims(512, 512, 3, 1);
+    array in = constant(1, iDims, (af_dtype)dtype_traits<float>::af_type);
+    /* calculate DOG using ArrayFire functions */
+    array k1    = gaussianKernel(3, 3);
+    array k2    = gaussianKernel(2, 2);
+    array smth1 = convolve2(in, k1);
+    array smth2 = convolve2(in, k2);
+    array diff  = smth1 - smth2;
+    /* calcuate DOG using new function */
+    array out = dog(in, 3, 2);
+    /* compare both the values */
+    float accumErr = sum<float>(out - diff);
+    EXPECT_EQ(true, accumErr < 1.0e-2);
+}
+
+TYPED_TEST(DOG, InvalidArray) {
+    array in = randu(512);
+    EXPECT_THROW(dog(in, 3, 2), exception);
+}
diff --git a/test/dot.cpp b/test/dot.cpp
new file mode 100644
index 0000000000..834260af44
--- /dev/null
+++ b/test/dot.cpp
@@ -0,0 +1,315 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dot;
+using af::dtype_traits;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class DotF : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+template<typename T>
+class DotC : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypesF;
+typedef ::testing::Types<cfloat, cdouble> TestTypesC;
+
+// register the type list
+TYPED_TEST_SUITE(DotF, TestTypesF);
+TYPED_TEST_SUITE(DotC, TestTypesC);
+
+bool isinf(af::af_cfloat val) {
+    using std::isinf;
+    return isinf(val.real) || isinf(val.imag);
+}
+bool isinf(af::af_cdouble val) {
+    using std::isinf;
+    return isinf(val.real) || isinf(val.imag);
+}
+
+template<typename T>
+void dotTest(string pTestFile, const int resultIdx,
+             const af_mat_prop optLhs = AF_MAT_NONE,
+             const af_mat_prop optRhs = AF_MAT_NONE) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+
+    readTests<T, T, T>(pTestFile, numDims, in, tests);
+
+    dim4 aDims = numDims[0];
+    dim4 bDims = numDims[1];
+
+    af_array a   = 0;
+    af_array b   = 0;
+    af_array out = 0;
+
+    ASSERT_SUCCESS(af_create_array(&a, &(in[0].front()), aDims.ndims(),
+                                   aDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&b, &(in[1].front()), bDims.ndims(),
+                                   bDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(af_dot(&out, a, b, optLhs, optRhs));
+
+    vector<T> goldData = tests[resultIdx];
+    size_t nElems      = goldData.size();
+
+    ASSERT_VEC_ARRAY_NEAR(goldData, dim4(nElems), out, 0.03);
+
+    ASSERT_SUCCESS(af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(b));
+    ASSERT_SUCCESS(af_release_array(out));
+}
+
+template<typename T>
+void compare(double rval, double /*ival*/, T gold) {
+    ASSERT_NEAR(gold, rval, 0.03);
+}
+
+template<>
+void compare<cfloat>(double rval, double ival, cfloat gold) {
+    ASSERT_NEAR(gold.real, rval, 0.03);
+    ASSERT_NEAR(gold.imag, ival, 0.03);
+}
+
+template<>
+void compare<cdouble>(double rval, double ival, cdouble gold) {
+    ASSERT_NEAR(gold.real, rval, 0.03);
+    ASSERT_NEAR(gold.imag, ival, 0.03);
+}
+
+template<typename T>
+void dotAllTest(string pTestFile, const int resultIdx,
+                const af_mat_prop optLhs = AF_MAT_NONE,
+                const af_mat_prop optRhs = AF_MAT_NONE) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+
+    readTests<T, T, T>(pTestFile, numDims, in, tests);
+
+    dim4 aDims = numDims[0];
+    dim4 bDims = numDims[1];
+
+    af_array a = 0;
+    af_array b = 0;
+
+    ASSERT_SUCCESS(af_create_array(&a, &(in[0].front()), aDims.ndims(),
+                                   aDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&b, &(in[1].front()), bDims.ndims(),
+                                   bDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    double rval = 0, ival = 0;
+    ASSERT_SUCCESS(af_dot_all(&rval, &ival, a, b, optLhs, optRhs));
+
+    vector<T> goldData = tests[resultIdx];
+
+    using ::isinf;
+    using std::isinf;
+    if (false == (isinf(rval) && isinf(goldData[0]))) {
+        compare<T>(rval, ival, goldData[0]);
+    }
+
+    ASSERT_SUCCESS(af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(b));
+}
+
+#define INSTANTIATEF(SIZE, FILENAME)                                           \
+    TYPED_TEST(DotF, DotF_##SIZE) {                                            \
+        dotTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 0);    \
+        dotAllTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 0); \
+    }
+
+#define INSTANTIATEC(SIZE, FILENAME)                                          \
+    TYPED_TEST(DotC, DotC_CC_##SIZE) {                                        \
+        dotTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 0,    \
+                           AF_MAT_CONJ, AF_MAT_CONJ);                         \
+        dotAllTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 0, \
+                              AF_MAT_CONJ, AF_MAT_CONJ);                      \
+    }                                                                         \
+    TYPED_TEST(DotC, DotC_UU_##SIZE) {                                        \
+        dotTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 1,    \
+                           AF_MAT_NONE, AF_MAT_NONE);                         \
+        dotAllTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 1, \
+                              AF_MAT_NONE, AF_MAT_NONE);                      \
+    }                                                                         \
+    TYPED_TEST(DotC, DotC_CU_##SIZE) {                                        \
+        dotTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 2,    \
+                           AF_MAT_CONJ, AF_MAT_NONE);                         \
+        dotAllTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 2, \
+                              AF_MAT_CONJ, AF_MAT_NONE);                      \
+    }                                                                         \
+    TYPED_TEST(DotC, DotC_UC_##SIZE) {                                        \
+        dotTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 3,    \
+                           AF_MAT_NONE, AF_MAT_CONJ);                         \
+        dotAllTest<TypeParam>(string(TEST_DIR "/blas/" #FILENAME ".test"), 3, \
+                              AF_MAT_NONE, AF_MAT_CONJ);                      \
+    }
+
+INSTANTIATEF(1000, dot_f_1000);
+INSTANTIATEF(10, dot_f_10);
+INSTANTIATEF(25600, dot_f_25600);
+INSTANTIATEC(1000, dot_c_1000);
+INSTANTIATEC(10, dot_c_10);
+INSTANTIATEC(25600, dot_c_25600);
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+TEST(DotF, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTests<float, float, float>(TEST_DIR "/blas/dot_f_1000.test", numDims,
+                                   in, tests);
+
+    dim4 aDims = numDims[0];
+    dim4 bDims = numDims[1];
+
+    array a(aDims, &(in[0].front()));
+    array b(bDims, &(in[1].front()));
+
+    array out = dot(a, b, AF_MAT_CONJ, AF_MAT_NONE);
+
+    vector<float> goldData = tests[0];
+    dim4 goldDims(1);
+    ASSERT_VEC_ARRAY_EQ(goldData, goldDims, out);
+}
+
+TEST(DotCCU, CPP) {
+    vector<dim4> numDims;
+    vector<vector<cfloat>> in;
+    vector<vector<cfloat>> tests;
+
+    readTests<cfloat, cfloat, cfloat>(TEST_DIR "/blas/dot_c_1000.test", numDims,
+                                      in, tests);
+
+    dim4 aDims = numDims[0];
+    dim4 bDims = numDims[1];
+
+    array a(aDims, &(in[0].front()));
+    array b(bDims, &(in[1].front()));
+
+    array out = dot(a, b, AF_MAT_CONJ, AF_MAT_NONE);
+
+    vector<cfloat> goldData = tests[2];
+    dim4 goldDims(1);
+    ASSERT_VEC_ARRAY_EQ(goldData, goldDims, out);
+}
+
+TEST(DotAllF, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTests<float, float, float>(TEST_DIR "/blas/dot_f_1000.test", numDims,
+                                   in, tests);
+
+    dim4 aDims = numDims[0];
+    dim4 bDims = numDims[1];
+
+    array a(aDims, &(in[0].front()));
+    array b(bDims, &(in[1].front()));
+
+    float out = dot<float>(a, b, AF_MAT_CONJ, AF_MAT_NONE);
+
+    vector<float> goldData = tests[0];
+
+    ASSERT_EQ(goldData[0], out);
+}
+
+TEST(DotAllCCU, CPP) {
+    vector<dim4> numDims;
+    vector<vector<cfloat>> in;
+    vector<vector<cfloat>> tests;
+
+    readTests<cfloat, cfloat, cfloat>(TEST_DIR "/blas/dot_c_1000.test", numDims,
+                                      in, tests);
+
+    dim4 aDims = numDims[0];
+    dim4 bDims = numDims[1];
+
+    array a(aDims, &(in[0].front()));
+    array b(bDims, &(in[1].front()));
+
+    cfloat out = dot<cfloat>(a, b, AF_MAT_CONJ, AF_MAT_NONE);
+
+    vector<cfloat> goldData = tests[2];
+
+    ASSERT_EQ(goldData[0], out);
+}
+
+class Dot : public ::testing::TestWithParam<int> {
+   public:
+    array ha, hb, gold;
+
+    void SetUp() {
+        SUPPORTED_TYPE_CHECK(half_float::half);
+        int elems = GetParam();
+        array fa  = af::randu(elems) - 0.5f;
+        array fb  = af::randu(elems) - 0.5f;
+
+        ha = fa.as(f16);
+        hb = fb.as(f16);
+
+        gold = dot(fa, fb);
+    }
+};
+
+std::string print_dot(const ::testing::TestParamInfo<Dot::ParamType> info) {
+    std::stringstream ss;
+
+    ss << info.param;
+
+    return ss.str();
+}
+
+INSTANTIATE_TEST_SUITE_P(Small, Dot,
+                         ::testing::Values(2, 4, 5, 10, 31, 32, 33, 100, 127,
+                                           128, 129, 200, 500, 511, 512, 513,
+                                           1000),
+                         print_dot);
+
+TEST_P(Dot, Half) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    array hc = dot(ha, hb);
+
+    ASSERT_ARRAYS_NEAR(gold, hc.as(f32), 1e-2);
+}
diff --git a/test/empty.cpp b/test/empty.cpp
new file mode 100644
index 0000000000..f38bb67eaf
--- /dev/null
+++ b/test/empty.cpp
@@ -0,0 +1,291 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
+#include <arrayfire.h>
+#include <cstdio>
+#include <cstdlib>
+
+using namespace af;
+
+template<typename T>
+class Array : public ::testing::Test {};
+
+TEST(Array, TestEmptyAssignment) {
+    array A     = randu(5, f32);
+    array C     = constant(0, 0);
+    array B     = A(isNaN(A));
+    A(isNaN(A)) = C;
+    ASSERT_EQ(B.numdims(), 0u);
+    ASSERT_EQ(A.numdims(), 1u);
+    ASSERT_EQ(lookup(constant(1, 9), constant(0, 0)).numdims(), 0u);
+}
+
+TEST(Array, TestEmptySigProc) {
+    ASSERT_EQ(convolve(constant(1, 1), constant(0, 0)).numdims(), 1u);
+    ASSERT_EQ(convolve(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(convolve2(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(convolve3(constant(0, 0), constant(0, 0)).numdims(), 0u);
+
+    ASSERT_EQ(iir(constant(0, 0), constant(0, 0), constant(0, 0)).numdims(),
+              0u);
+
+    ASSERT_EQ(approx1(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(approx1(constant(0, 0), seq(0, 10)).numdims(), 0u);
+    ASSERT_EQ(approx2(constant(0, 0), constant(0, 0), constant(0, 0)).numdims(),
+              0u);
+    ASSERT_EQ(approx2(constant(0, 0), seq(0, 10), seq(0, 10)).numdims(), 0u);
+}
+
+TEST(Array, TestEmptySet) {
+    ASSERT_EQ(setIntersect(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(setUnique(constant(0, 0)).numdims(), 0u);
+
+    array A = randu(5, f32);
+    array B = constant(0, 0);
+
+    ASSERT_EQ(setUnion(A, B).elements(), 5);
+    ASSERT_EQ(setUnion(B, A).elements(), 5);
+}
+
+TEST(Array, TestEmptyOperators) {
+    ASSERT_EQ((constant(0, 0) + constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) && constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) - constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) & constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) | constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) ^ constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) << constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) >> constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) / constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) == constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) <= constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) >= constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) > constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) < constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(-constant(0, 0).numdims(), 0u);
+    ASSERT_EQ((!constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) != constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) += 1).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) -= 1).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) *= 1).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) /= 1).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) || constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) % constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ((constant(0, 0) * constant(0, 0)).numdims(), 0u);
+}
+
+TEST(Array, TestEmptyFFT) {
+    array arr = constant(0, 0);
+    fftInPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+    fft2InPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+    fft3InPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+    ifftInPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+    ifft2InPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+    ifft3InPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+
+    ASSERT_EQ((fft(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fftNorm(constant(0, 0), 0.5)).numdims(), 0u);
+    ASSERT_EQ((fft2(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fft2Norm(constant(0, 0), 0.5)).numdims(), 0u);
+    ASSERT_EQ((fft3(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fft3Norm(constant(0, 0), 0.5)).numdims(), 0u);
+    ASSERT_EQ((fftC2R<1>(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fftR2C<1>(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fftC2R<2>(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fftR2C<2>(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fftC2R<3>(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((fftR2C<3>(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((ifft(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((ifftNorm(constant(0, 0), 0.5)).numdims(), 0u);
+    ASSERT_EQ((ifft2(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((ifft2Norm(constant(0, 0), 0.5)).numdims(), 0u);
+    ASSERT_EQ((ifft3(constant(0, 0))).numdims(), 0u);
+    ASSERT_EQ((ifft3Norm(constant(0, 0), 0.5)).numdims(), 0u);
+}
+
+TEST(Array, TestEmptyDiff) {
+    ASSERT_EQ(diff1(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(diff1(constant(1, 1)).numdims(), 0u);
+    ASSERT_EQ(diff1(constant(1, 2)).numdims(), 1u);
+    ASSERT_EQ(diff2(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(diff2(constant(1, 1)).numdims(), 0u);
+    ASSERT_EQ(diff2(constant(1, 2)).numdims(), 0u);
+    ASSERT_EQ(diff2(constant(1, 3)).numdims(), 1u);
+}
+
+TEST(Array, TestEmptyLinAlg) {
+    ASSERT_EQ(det<float>(constant(0, 0)), 1);
+    ASSERT_EQ(det<cfloat>(constant(0, 0)).real, 1);
+    ASSERT_EQ(det<cdouble>(constant(0, 0)).real, 1);
+    ASSERT_EQ(norm(constant(0, 0)), 0);
+    ASSERT_EQ(rank(constant(0, 0)), 0u);
+
+    array tau_qr, arr = constant(0, 0);
+    qrInPlace(tau_qr, arr);
+    ASSERT_EQ(tau_qr.numdims(), 0u);
+
+    array out_qr;
+    qr(out_qr, tau_qr, constant(0, 0));
+    ASSERT_EQ(out_qr.numdims(), 0u);
+    ASSERT_EQ(tau_qr.numdims(), 0u);
+    ASSERT_EQ(solve(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(solveLU(constant(0, 0), constant(0, 0), constant(0, 0)).numdims(),
+              0u);
+
+    array out_lu, piv_lu;
+    lu(out_lu, piv_lu, constant(0, 0));
+    ASSERT_EQ(out_lu.numdims(), 0u);
+    ASSERT_EQ(piv_lu.numdims(), 0u);
+
+    array low_lu, up_lu;
+    lu(low_lu, up_lu, piv_lu, constant(0, 0));
+    ASSERT_EQ(low_lu.numdims(), 0u);
+    ASSERT_EQ(up_lu.numdims(), 0u);
+    ASSERT_EQ(piv_lu.numdims(), 0u);
+
+    luInPlace(piv_lu, arr, true);
+    ASSERT_EQ(piv_lu.numdims(), 0u);
+    ASSERT_EQ(arr.numdims(), 0u);
+
+    array u, s, v;
+    svd(u, s, v, constant(0, 0));
+    svdInPlace(u, s, v, arr);
+    ASSERT_EQ(dot(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(transpose(constant(0, 0)).numdims(), 0u);
+    choleskyInPlace(arr);
+    ASSERT_EQ(arr.numdims(), 0u);
+    array out;
+    cholesky(out, constant(0, 0));
+    ASSERT_EQ(out.numdims(), 0u);
+}
+
+TEST(Array, TestEmptyMath) {
+    ASSERT_EQ(acos(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(acosh(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(abs(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(asin(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(asinh(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(atan(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(atan2(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(atanh(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(cos(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(cosh(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(log(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(log10(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(log1p(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(sin(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(sinh(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(tan(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(tanh(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(sqrt(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(real(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(imag(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(conjg(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(erf(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(erfc(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(exp(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(expm1(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(cbrt(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(ceil(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(lgamma(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(pow(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(root(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(tgamma(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(arg(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(floor(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(hypot(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(rem(constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(round(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(sign(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(trunc(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(factorial(constant(0, 0)).numdims(), 0u);
+    // ASSERT_EQ(complex(constant(0,0), constant(0,0)).numdims(), 0u);
+}
+
+TEST(Array, TestEmptyVecOp) {
+    ASSERT_EQ(accum(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(allTrue(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(anyTrue(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(count(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(where(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(max(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(min(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(product(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(sum(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(sort(constant(0, 0)).numdims(), 0u);
+
+    array skeys, svals;
+    sort(skeys, svals, constant(0, 0), constant(0, 0));
+    ASSERT_EQ(skeys.numdims(), 0u);
+    ASSERT_EQ(svals.numdims(), 0u);
+
+    array sout, sind;
+    sort(sout, sind, constant(0, 0));
+    ASSERT_EQ(sout.numdims(), 0u);
+    ASSERT_EQ(sind.numdims(), 0u);
+}
+
+TEST(Array, TestEmptyArrMod) {
+    ASSERT_EQ(diag(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(diag(constant(0, 0), true).numdims(), 0u);
+    ASSERT_EQ(identity(0).numdims(), 0u);
+    ASSERT_EQ(iota(dim4(0)).numdims(), 0u);
+    ASSERT_EQ(lower(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(upper(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(constant(0, 0).as(u8).numdims(), 0u);
+    ASSERT_EQ(isNaN(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(isInf(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(iszero(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(flat(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(flip(constant(0, 0), 0).numdims(), 0u);
+    ASSERT_EQ(moddims(constant(0, 0), dim4(0)).numdims(), 0u);
+    ASSERT_EQ(reorder(constant(0, 0), 0).numdims(), 0u);
+    ASSERT_EQ(shift(constant(0, 0), 1).numdims(), 0u);
+    ASSERT_EQ(tile(constant(0, 0), 1).numdims(), 0u);
+
+    ASSERT_EQ(join(0, constant(0, 0), constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(join(0, randu(3), constant(0, 0)).elements(), 3);
+    ASSERT_EQ(join(0, constant(0, 0), randn(3)).elements(), 3);
+
+    ASSERT_EQ(select(constant(0, 0), constant(0, 0), constant(0, 0)).numdims(),
+              0u);
+
+    array arr = constant(0, 0);
+    replace(arr, constant(0, 0), constant(0, 0));
+    ASSERT_EQ(arr.numdims(), 0u);
+}
+
+TEST(Array, TestEmptyImage) {
+    ASSERT_EQ(histogram(constant(0, 0), 1).numdims(), 0u);
+    ASSERT_EQ(hsv2rgb(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(gray2rgb(constant(0, 0)).numdims(), 0u);
+    ASSERT_EQ(rotate(constant(0, 0), 0).numdims(), 0u);
+
+    af_array h, hout;
+    dim_t ds[1];
+    af_constant(&h, 0, 0, ds, f32);
+    af_histogram(&hout, h, 10, 0.0, 1.0);
+
+    unsigned nd;
+    af_get_numdims(&nd, h);
+    ASSERT_EQ(nd, 0u);
+    af_get_numdims(&nd, hout);
+    ASSERT_EQ(nd, 0u);
+    ASSERT_SUCCESS(af_release_array(h));
+    ASSERT_SUCCESS(af_release_array(hout));
+}
diff --git a/test/event.cpp b/test/event.cpp
new file mode 100644
index 0000000000..e99bbf80c3
--- /dev/null
+++ b/test/event.cpp
@@ -0,0 +1,53 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/event.h>
+
+#include <memory>
+#include <utility>
+
+#include <iostream>
+
+using af::event;
+
+TEST(EventTests, SimpleCreateRelease) {
+    af_event event;
+    ASSERT_SUCCESS(af_create_event(&event));
+    ASSERT_SUCCESS(af_delete_event(event));
+}
+
+TEST(EventTests, MarkEnqueueAndBlock) {
+    af_event event;
+    ASSERT_SUCCESS(af_create_event(&event));
+    ASSERT_SUCCESS(af_mark_event(event));
+    ASSERT_SUCCESS(af_enqueue_wait_event(event));
+    ASSERT_SUCCESS(af_block_event(event));
+    ASSERT_SUCCESS(af_delete_event(event));
+}
+
+TEST(EventTests, EventCreateAndMove) {
+    af_event eventHandle;
+    ASSERT_SUCCESS(af_create_event(&eventHandle));
+
+    event e(eventHandle);
+    e.mark();
+    ASSERT_EQ(eventHandle, e.get());
+
+    auto otherEvent = std::move(e);
+    ASSERT_EQ(otherEvent.get(), eventHandle);
+
+    event f;
+    af_event fE        = f.get();
+    event anotherEvent = std::move(f);
+    ASSERT_EQ(fE, anotherEvent.get());
+    af::sync();
+}
diff --git a/test/fast.cpp b/test/fast.cpp
index 6d1cf286a1..c5e3225d0e 100644
--- a/test/fast.cpp
+++ b/test/fast.cpp
@@ -7,37 +7,37 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/compatible.h>
-#include <string>
-#include <vector>
 #include <cmath>
-#include <testHelpers.hpp>
+#include <string>
 #include <typeinfo>
+#include <vector>
 
+using af::dim4;
+using std::abs;
+using std::endl;
 using std::string;
 using std::vector;
-using af::dim4;
 
-typedef struct
-{
+typedef struct {
     float f[5];
 } feat_t;
 
-bool feat_cmp(feat_t i, feat_t j)
-{
+static bool feat_cmp(feat_t i, feat_t j) {
     for (int k = 0; k < 5; k++)
-        if (i.f[k] != j.f[k])
-            return (i.f[k] < j.f[k]);
+        if (i.f[k] != j.f[k]) return (i.f[k] < j.f[k]);
 
     return false;
 }
 
-void array_to_feat(vector<feat_t>& feat, float *x, float *y, float *score, float *orientation, float *size, unsigned nfeat)
-{
+static void array_to_feat(vector<feat_t> &feat, float *x, float *y,
+                          float *score, float *orientation, float *size,
+                          unsigned nfeat) {
     feat.resize(nfeat);
     for (unsigned i = 0; i < feat.size(); i++) {
         feat[i].f[0] = x[i];
@@ -49,178 +49,193 @@ void array_to_feat(vector<feat_t>& feat, float *x, float *y, float *score, float
 }
 
 template<typename T>
-class FloatFAST : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class FloatFAST : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 template<typename T>
-class FixedFAST : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class FixedFAST : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 typedef ::testing::Types<float, double> FloatTestTypes;
-typedef ::testing::Types<int, unsigned> FixedTestTypes;
+typedef ::testing::Types<int, unsigned, short, ushort, schar> FixedTestTypes;
 
-TYPED_TEST_CASE(FloatFAST, FloatTestTypes);
-TYPED_TEST_CASE(FixedFAST, FixedTestTypes);
+TYPED_TEST_SUITE(FloatFAST, FloatTestTypes);
+TYPED_TEST_SUITE(FixedFAST, FixedTestTypes);
 
 template<typename T>
-void fastTest(string pTestFile, bool nonmax)
-{
-    if (noDoubleTests<T>()) return;
+void fastTest(string pTestFile, bool nonmax) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<dim4>        inDims;
-    vector<string>     inFiles;
-    vector<vector<float> > gold;
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
 
     readImageTests(pTestFile, inDims, inFiles, gold);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
-        dim_t nElems       = 0;
-        af_array inArray_f32  = 0;
-        af_array inArray      = 0;
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        dim_t nElems         = 0;
+        af_array inArray_f32 = 0;
+        af_array inArray     = 0;
         af_features out;
 
-        inFiles[testId].insert(0,string(TEST_DIR"/fast/"));
+        inFiles[testId].insert(0, string(TEST_DIR "/fast/"));
 
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray_f32));
-        continue;
+        ASSERT_SUCCESS(
+            af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
 
-        printf("I should not be here\n");
-        ASSERT_EQ(AF_SUCCESS, conv_image<T>(&inArray, inArray_f32));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, inArray_f32));
 
-        ASSERT_EQ(AF_SUCCESS, af_fast(&out, inArray, 20.0f, 9, nonmax, 0.05f, 3));
+        ASSERT_SUCCESS(af_fast(&out, inArray, 20.0f, 9, nonmax, 0.05f, 3));
 
         dim_t n = 0;
         af_array x, y, score, orientation, size;
 
-        ASSERT_EQ(AF_SUCCESS, af_get_features_num(&n, out));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_xpos(&x, out));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_ypos(&y, out));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_score(&score, out));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_orientation(&orientation, out));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_size(&size, out));
-
-
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems, x));
-
-        float * outX           = new float[gold[0].size()];
-        float * outY           = new float[gold[1].size()];
-        float * outScore       = new float[gold[2].size()];
-        float * outOrientation = new float[gold[3].size()];
-        float * outSize        = new float[gold[4].size()];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outX, x));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outY, y));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outScore, score));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outOrientation, orientation));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outSize, size));
+        ASSERT_SUCCESS(af_get_features_num(&n, out));
+        ASSERT_SUCCESS(af_get_features_xpos(&x, out));
+        ASSERT_SUCCESS(af_get_features_ypos(&y, out));
+        ASSERT_SUCCESS(af_get_features_score(&score, out));
+        ASSERT_SUCCESS(af_get_features_orientation(&orientation, out));
+        ASSERT_SUCCESS(af_get_features_size(&size, out));
+
+        ASSERT_SUCCESS(af_get_elements(&nElems, x));
+
+        float *outX           = new float[gold[0].size()];
+        float *outY           = new float[gold[1].size()];
+        float *outScore       = new float[gold[2].size()];
+        float *outOrientation = new float[gold[3].size()];
+        float *outSize        = new float[gold[4].size()];
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outX, x));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outY, y));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outScore, score));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outOrientation, orientation));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outSize, size));
 
         vector<feat_t> out_feat;
-        array_to_feat(out_feat, outX, outY, outScore, outOrientation, outSize, n);
+        array_to_feat(out_feat, outX, outY, outScore, outOrientation, outSize,
+                      n);
 
         vector<feat_t> gold_feat;
-        array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(), &gold[2].front(), &gold[3].front(), &gold[4].front(), gold[0].size());
+        array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(),
+                      &gold[2].front(), &gold[3].front(), &gold[4].front(),
+                      gold[0].size());
 
         std::sort(out_feat.begin(), out_feat.end(), feat_cmp);
         std::sort(gold_feat.begin(), gold_feat.end(), feat_cmp);
 
         for (int elIter = 0; elIter < (int)nElems; elIter++) {
-            ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0]) << "at: " << elIter << std::endl;
-            ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1]) << "at: " << elIter << std::endl;
-            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3) << "at: " << elIter << std::endl;
-            ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3]) << "at: " << elIter << std::endl;
-            ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4]) << "at: " << elIter << std::endl;
+            ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4])
+                << "at: " << elIter << endl;
         }
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray_f32));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(inArray_f32));
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(x));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(y));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(score));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(orientation));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(size));
+        ASSERT_SUCCESS(af_release_features(out));
 
-        delete [] outX;
-        delete [] outY;
-        delete [] outScore;
-        delete [] outOrientation;
-        delete [] outSize;
+        delete[] outX;
+        delete[] outY;
+        delete[] outScore;
+        delete[] outOrientation;
+        delete[] outSize;
     }
 }
 
-#define FLOAT_FAST_INIT(desc, image, nonmax) \
-    TYPED_TEST(FloatFAST, desc) \
-    {   \
-        fastTest<TypeParam>(string(TEST_DIR"/fast/"#image"_float.test"), nonmax); \
+#define FLOAT_FAST_INIT(desc, image, nonmax)                                \
+    TYPED_TEST(FloatFAST, desc) {                                           \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                             \
+        fastTest<TypeParam>(string(TEST_DIR "/fast/" #image "_float.test"), \
+                            nonmax);                                        \
+    }
+
+#define FIXED_FAST_INIT(desc, image, nonmax)                                \
+    TYPED_TEST(FixedFAST, desc) {                                           \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                             \
+        fastTest<TypeParam>(string(TEST_DIR "/fast/" #image "_fixed.test"), \
+                            nonmax);                                        \
     }
 
-#define FIXED_FAST_INIT(desc, image, nonmax) \
-    TYPED_TEST(FixedFAST, desc) \
-    {   \
-        fastTest<TypeParam>(string(TEST_DIR"/fast/"#image"_fixed.test"), nonmax); \
+FLOAT_FAST_INIT(square, square, false);
+FLOAT_FAST_INIT(square_nonmax, square_nonmax, true);
+FIXED_FAST_INIT(square, square, false);
+FIXED_FAST_INIT(square_nonmax, square_nonmax, true);
+
+/////////////////////////////////// CPP ////////////////////////////////
+
+using af::array;
+using af::features;
+using af::loadImage;
+
+TEST(FloatFAST, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
+
+    readImageTests(string(TEST_DIR "/fast/square_nonmax_float.test"), inDims,
+                   inFiles, gold);
+    inFiles[0].insert(0, string(TEST_DIR "/fast/"));
+
+    array in = loadImage(inFiles[0].c_str(), false);
+
+    features out = fast(in, 20.0f, 9, true, 0.05f, 3);
+
+    float *outX           = new float[gold[0].size()];
+    float *outY           = new float[gold[1].size()];
+    float *outScore       = new float[gold[2].size()];
+    float *outOrientation = new float[gold[3].size()];
+    float *outSize        = new float[gold[4].size()];
+    out.getX().host(outX);
+    out.getY().host(outY);
+    out.getScore().host(outScore);
+    out.getOrientation().host(outOrientation);
+    out.getSize().host(outSize);
+
+    vector<feat_t> out_feat;
+    array_to_feat(out_feat, outX, outY, outScore, outOrientation, outSize,
+                  out.getNumFeatures());
+
+    vector<feat_t> gold_feat;
+    array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(),
+                  &gold[2].front(), &gold[3].front(), &gold[4].front(),
+                  gold[0].size());
+
+    std::sort(out_feat.begin(), out_feat.end(), feat_cmp);
+    std::sort(gold_feat.begin(), gold_feat.end(), feat_cmp);
+
+    for (unsigned elIter = 0; elIter < out.getNumFeatures(); elIter++) {
+        ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3])
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4])
+            << "at: " << elIter << endl;
     }
 
-    FLOAT_FAST_INIT(square, square, false);
-    FLOAT_FAST_INIT(square_nonmax, square_nonmax, true);
-    FIXED_FAST_INIT(square, square, false);
-    FIXED_FAST_INIT(square_nonmax, square_nonmax, true);
-
-///////////////////////////////////// CPP ////////////////////////////////
-//
-// TEST(FloatFAST, CPP)
-// {
-//     if (noDoubleTests<float>()) return;
-
-//     vector<dim4>        inDims;
-//     vector<string>     inFiles;
-//     vector<vector<float> > gold;
-
-//     readImageTests(string(TEST_DIR"/fast/square_nonmax_float.test"), inDims, inFiles, gold);
-//     inFiles[0].insert(0,string(TEST_DIR"/fast/"));
-
-//     af::array in = af::loadimage(inFiles[0].c_str(), false);
-
-//     af::features out = fast(in, 20.0f, 9, true, 0.05f, 3);
-
-//     float * outX           = new float[gold[0].size()];
-//     float * outY           = new float[gold[1].size()];
-//     float * outScore       = new float[gold[2].size()];
-//     float * outOrientation = new float[gold[3].size()];
-//     float * outSize        = new float[gold[4].size()];
-//     out.getX().host(outX);
-//     out.getY().host(outY);
-//     out.getScore().host(outScore);
-//     out.getOrientation().host(outOrientation);
-//     out.getSize().host(outSize);
-
-//     vector<feat_t> out_feat;
-//     array_to_feat(out_feat, outX, outY, outScore, outOrientation, outSize, out.getNumFeatures());
-
-//     vector<feat_t> gold_feat;
-//     array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(), &gold[2].front(), &gold[3].front(), &gold[4].front(), gold[0].size());
-
-//     std::sort(out_feat.begin(), out_feat.end(), feat_cmp);
-//     std::sort(gold_feat.begin(), gold_feat.end(), feat_cmp);
-
-//     for (unsigned elIter = 0; elIter < out.getNumFeatures(); elIter++) {
-//         ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0]) << "at: " << elIter << std::endl;
-//         ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1]) << "at: " << elIter << std::endl;
-//         ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3) << "at: " << elIter << std::endl;
-//         ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3]) << "at: " << elIter << std::endl;
-//         ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4]) << "at: " << elIter << std::endl;
-//     }
-
-//     delete[] outX;
-//     delete[] outY;
-//     delete[] outScore;
-//     delete[] outOrientation;
-//     delete[] outSize;
-// }
+    delete[] outX;
+    delete[] outY;
+    delete[] outScore;
+    delete[] outOrientation;
+    delete[] outSize;
+}
diff --git a/test/fft.cpp b/test/fft.cpp
index aad8c171a0..0af43dca2b 100644
--- a/test/fft.cpp
+++ b/test/fft.cpp
@@ -7,441 +7,560 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+
+#include <sstream>
+#include <stdexcept>
 #include <string>
 #include <vector>
-#include <stdexcept>
-#include <testHelpers.hpp>
 
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype_traits;
+using af::fft;
+using af::fft2;
+using af::fft2InPlace;
+using af::fft3;
+using af::fft3InPlace;
+using af::fftInPlace;
+using af::ifft;
+using af::ifft2;
+using af::ifft2InPlace;
+using af::ifft3;
+using af::ifft3InPlace;
+using af::ifftInPlace;
+using af::moddims;
+using af::randu;
+using af::seq;
+using af::span;
+using std::abs;
+using std::endl;
 using std::string;
+using std::stringstream;
 using std::vector;
-using af::cfloat;
-using af::cdouble;
 
-TEST(fft, Invalid_Type)
-{
-    vector<char>   in(100,1);
+TEST(fft, Invalid_Type) {
+    vector<char> in(100, 1);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    af::dim4 dims(5 * 5 * 2 * 2);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in.front()),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<char>::af_type));
+    dim4 dims(5 * 5 * 2 * 2);
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in.front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<char>::af_type));
 
     ASSERT_EQ(AF_ERR_TYPE, af_fft(&outArray, inArray, 1.0, 0));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TEST(fft2, Invalid_Array)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<float>   in(100,1);
+TEST(fft2, Invalid_Array) {
+    vector<float> in(100, 1);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    af::dim4 dims(5 * 5 * 2 * 2);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in.front()),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<float>::af_type));
+    dim4 dims(5 * 5 * 2 * 2);
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in.front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
     ASSERT_EQ(AF_ERR_SIZE, af_fft2(&outArray, inArray, 1.0, 0, 0));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TEST(fft3, Invalid_Array)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<float>   in(100,1);
+TEST(fft3, Invalid_Array) {
+    vector<float> in(100, 1);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    af::dim4 dims(10,10,1,1);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in.front()),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<float>::af_type));
+    dim4 dims(10, 10, 1, 1);
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in.front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
     ASSERT_EQ(AF_ERR_SIZE, af_fft3(&outArray, inArray, 1.0, 0, 0, 0));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TEST(ifft2, Invalid_Array)
-{
-    if (noDoubleTests<float>()) return;
+TEST(ifft2, Invalid_Array) {
+    vector<float> in(100, 1);
 
-    vector<float>   in(100,1);
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
-
-    af::dim4 dims(100,1,1,1);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in.front()),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<cfloat>::af_type));
+    dim4 dims(100, 1, 1, 1);
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in.front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
     ASSERT_EQ(AF_ERR_SIZE, af_ifft2(&outArray, inArray, 0.01, 0, 0));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TEST(ifft3, Invalid_Array)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<float>   in(100,1);
+TEST(ifft3, Invalid_Array) {
+    vector<float> in(100, 1);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    af::dim4 dims(10,10,1,1);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in.front()),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<cfloat>::af_type));
+    dim4 dims(10, 10, 1, 1);
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in.front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
     ASSERT_EQ(AF_ERR_SIZE, af_ifft3(&outArray, inArray, 0.01, 0, 0, 0));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
 template<typename inType, typename outType, bool isInverse>
-void fftTest(string pTestFile, dim_t pad0=0, dim_t pad1=0, dim_t pad2=0)
-{
-    if (noDoubleTests<inType>()) return;
-    if (noDoubleTests<outType>()) return;
+void fftTest(string pTestFile, dim_t pad0 = 0, dim_t pad1 = 0, dim_t pad2 = 0) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4>        numDims;
-    vector<vector<inType> >       in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
 
     readTestsFromFile<inType, outType>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims       = numDims[0];
-    af_array outArray   = 0;
-    af_array inArray    = 0;
+    dim4 dims         = numDims[0];
+    af_array outArray = 0;
+    af_array inArray  = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<inType>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
 
-    if (isInverse){
+    if (isInverse) {
         switch (dims.ndims()) {
-            case 1 : ASSERT_EQ(AF_SUCCESS, af_ifft (&outArray, inArray, 1.0, pad0));              break;
-            case 2 : ASSERT_EQ(AF_SUCCESS, af_ifft2(&outArray, inArray, 1.0, pad0, pad1));        break;
-            case 3 : ASSERT_EQ(AF_SUCCESS, af_ifft3(&outArray, inArray, 1.0, pad0, pad1, pad2));  break;
-            default: throw std::runtime_error("This error shouldn't happen, pls check");
+            case 1:
+                ASSERT_SUCCESS(af_ifft(&outArray, inArray, 1.0, pad0));
+                break;
+            case 2:
+                ASSERT_SUCCESS(af_ifft2(&outArray, inArray, 1.0, pad0, pad1));
+                break;
+            case 3:
+                ASSERT_SUCCESS(
+                    af_ifft3(&outArray, inArray, 1.0, pad0, pad1, pad2));
+                break;
+            default:
+                throw std::runtime_error(
+                    "This error shouldn't happen, pls check");
         }
     } else {
-        switch(dims.ndims()) {
-            case 1 : ASSERT_EQ(AF_SUCCESS, af_fft (&outArray, inArray, 1.0, pad0));               break;
-            case 2 : ASSERT_EQ(AF_SUCCESS, af_fft2(&outArray, inArray, 1.0, pad0, pad1));         break;
-            case 3 : ASSERT_EQ(AF_SUCCESS, af_fft3(&outArray, inArray, 1.0, pad0, pad1, pad2));   break;
-            default: throw std::runtime_error("This error shouldn't happen, pls check");
+        switch (dims.ndims()) {
+            case 1:
+                ASSERT_SUCCESS(af_fft(&outArray, inArray, 1.0, pad0));
+                break;
+            case 2:
+                ASSERT_SUCCESS(af_fft2(&outArray, inArray, 1.0, pad0, pad1));
+                break;
+            case 3:
+                ASSERT_SUCCESS(
+                    af_fft3(&outArray, inArray, 1.0, pad0, pad1, pad2));
+                break;
+            default:
+                throw std::runtime_error(
+                    "This error shouldn't happen, pls check");
         }
     }
 
-    size_t out_size = tests[0].size();
-    outType *outData= new outType[out_size];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    size_t out_size  = tests[0].size();
+    outType *outData = new outType[out_size];
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
     vector<outType> goldBar(tests[0].begin(), tests[0].end());
 
     size_t test_size = 0;
-    switch(dims.ndims()) {
-        case 1  : test_size = dims[0]/2+1;                       break;
-        case 2  : test_size = dims[1] * (dims[0]/2+1);           break;
-        case 3  : test_size = dims[2] * dims[1] * (dims[0]/2+1); break;
-        default : test_size = dims[0]/2+1;                       break;
+    switch (dims.ndims()) {
+        case 1: test_size = dims[0] / 2 + 1; break;
+        case 2: test_size = dims[1] * (dims[0] / 2 + 1); break;
+        case 3: test_size = dims[2] * dims[1] * (dims[0] / 2 + 1); break;
+        default: test_size = dims[0] / 2 + 1; break;
     }
     outType output_scale = (outType)(isInverse ? test_size : 1);
-    for (size_t elIter=0; elIter<test_size; ++elIter) {
-        bool isUnderTolerance = std::abs(goldBar[elIter]-outData[elIter])<0.001;
-        ASSERT_EQ(true, isUnderTolerance)<<
-            "Expected value="<<goldBar[elIter] <<"\t Actual Value="<<
-            (output_scale*outData[elIter]) << " at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < test_size; ++elIter) {
+        bool isUnderTolerance = abs(goldBar[elIter] - outData[elIter]) < 0.001;
+        ASSERT_EQ(true, isUnderTolerance)
+            << "Expected value=" << goldBar[elIter]
+            << "\t Actual Value=" << (output_scale * outData[elIter])
+            << " at: " << elIter << endl;
     }
 
     // cleanup
     delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-#define INSTANTIATE_TEST(func, name, is_inverse, in_t, out_t, ...)  \
-    TEST(func, name)                                                \
-    {                                                               \
-        fftTest<in_t, out_t, is_inverse>(__VA_ARGS__);              \
-    }
+#define INSTANTIATE_TEST(func, name, is_inverse, in_t, out_t, ...) \
+    TEST(func, name) { fftTest<in_t, out_t, is_inverse>(__VA_ARGS__); }
 
 // Real to complex transforms
-INSTANTIATE_TEST(fft ,  R2C_Float, false,  float,  cfloat, string(TEST_DIR"/signal/fft_r2c.test") );
-INSTANTIATE_TEST(fft , R2C_Double, false, double, cdouble, string(TEST_DIR"/signal/fft_r2c.test") );
-INSTANTIATE_TEST(fft2,  R2C_Float, false,  float,  cfloat, string(TEST_DIR"/signal/fft2_r2c.test"));
-INSTANTIATE_TEST(fft2, R2C_Double, false, double, cdouble, string(TEST_DIR"/signal/fft2_r2c.test"));
-INSTANTIATE_TEST(fft3,  R2C_Float, false,  float,  cfloat, string(TEST_DIR"/signal/fft3_r2c.test"));
-INSTANTIATE_TEST(fft3, R2C_Double, false, double, cdouble, string(TEST_DIR"/signal/fft3_r2c.test"));
+INSTANTIATE_TEST(fft, R2C_Float, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft_r2c.test"));
+INSTANTIATE_TEST(fft, R2C_Double, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft_r2c.test"));
+INSTANTIATE_TEST(fft2, R2C_Float, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft2_r2c.test"));
+INSTANTIATE_TEST(fft2, R2C_Double, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft2_r2c.test"));
+INSTANTIATE_TEST(fft3, R2C_Float, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft3_r2c.test"));
+INSTANTIATE_TEST(fft3, R2C_Double, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft3_r2c.test"));
 
 // complex to complex transforms
-INSTANTIATE_TEST(fft ,  C2C_Float, false,  cfloat,  cfloat, string(TEST_DIR"/signal/fft_c2c.test") );
-INSTANTIATE_TEST(fft , C2C_Double, false, cdouble, cdouble, string(TEST_DIR"/signal/fft_c2c.test") );
-INSTANTIATE_TEST(fft2,  C2C_Float, false,  cfloat,  cfloat, string(TEST_DIR"/signal/fft2_c2c.test"));
-INSTANTIATE_TEST(fft2, C2C_Double, false, cdouble, cdouble, string(TEST_DIR"/signal/fft2_c2c.test"));
-INSTANTIATE_TEST(fft3,  C2C_Float, false,  cfloat,  cfloat, string(TEST_DIR"/signal/fft3_c2c.test"));
-INSTANTIATE_TEST(fft3, C2C_Double, false, cdouble, cdouble, string(TEST_DIR"/signal/fft3_c2c.test"));
+INSTANTIATE_TEST(fft, C2C_Float, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft_c2c.test"));
+INSTANTIATE_TEST(fft, C2C_Double, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft_c2c.test"));
+INSTANTIATE_TEST(fft2, C2C_Float, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft2_c2c.test"));
+INSTANTIATE_TEST(fft2, C2C_Double, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft2_c2c.test"));
+INSTANTIATE_TEST(fft3, C2C_Float, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft3_c2c.test"));
+INSTANTIATE_TEST(fft3, C2C_Double, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft3_c2c.test"));
+
+// Factors 7, 11, 13
+INSTANTIATE_TEST(fft, R2C_Float_7_11_13, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft_r2c_7_11_13.test"));
+INSTANTIATE_TEST(fft, R2C_Double_7_11_13, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft_r2c_7_11_13.test"));
+INSTANTIATE_TEST(fft2, R2C_Float_7_11_13, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft2_r2c_7_11_13.test"));
+INSTANTIATE_TEST(fft2, R2C_Double_7_11_13, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft2_r2c_7_11_13.test"));
+INSTANTIATE_TEST(fft3, R2C_Float_7_11_13, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft3_r2c_7_11_13.test"));
+INSTANTIATE_TEST(fft3, R2C_Double_7_11_13, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft3_r2c_7_11_13.test"));
+
+INSTANTIATE_TEST(fft, C2C_Float_7_11_13, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft_c2c_7_11_13.test"));
+INSTANTIATE_TEST(fft, C2C_Double_7_11_13, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft_c2c_7_11_13.test"));
+INSTANTIATE_TEST(fft2, C2C_Float_7_11_13, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft2_c2c_7_11_13.test"));
+INSTANTIATE_TEST(fft2, C2C_Double_7_11_13, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft2_c2c_7_11_13.test"));
+INSTANTIATE_TEST(fft3, C2C_Float_7_11_13, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft3_c2c_7_11_13.test"));
+INSTANTIATE_TEST(fft3, C2C_Double_7_11_13, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft3_c2c_7_11_13.test"));
 
 // transforms on padded and truncated arrays
-INSTANTIATE_TEST(fft2,  R2C_Float_Trunc, false,  float,  cfloat, string(TEST_DIR"/signal/fft2_r2c_trunc.test"), 16, 16);
-INSTANTIATE_TEST(fft2, R2C_Double_Trunc, false, double, cdouble, string(TEST_DIR"/signal/fft2_r2c_trunc.test"), 16, 16);
+INSTANTIATE_TEST(fft2, R2C_Float_Trunc, false, float, cfloat,
+                 string(TEST_DIR "/signal/fft2_r2c_trunc.test"), 16, 16);
+INSTANTIATE_TEST(fft2, R2C_Double_Trunc, false, double, cdouble,
+                 string(TEST_DIR "/signal/fft2_r2c_trunc.test"), 16, 16);
 
-INSTANTIATE_TEST(fft2,  C2C_Float_Pad, false,  cfloat,  cfloat, string(TEST_DIR"/signal/fft2_c2c_pad.test"), 16, 16);
-INSTANTIATE_TEST(fft2, C2C_Double_Pad, false, cdouble, cdouble, string(TEST_DIR"/signal/fft2_c2c_pad.test"), 16, 16);
+INSTANTIATE_TEST(fft2, C2C_Float_Pad, false, cfloat, cfloat,
+                 string(TEST_DIR "/signal/fft2_c2c_pad.test"), 16, 16);
+INSTANTIATE_TEST(fft2, C2C_Double_Pad, false, cdouble, cdouble,
+                 string(TEST_DIR "/signal/fft2_c2c_pad.test"), 16, 16);
 
 // inverse transforms
 // complex to complex transforms
-INSTANTIATE_TEST(ifft ,  C2C_Float, true,  cfloat,  cfloat, string(TEST_DIR"/signal/ifft_c2c.test") );
-INSTANTIATE_TEST(ifft , C2C_Double, true, cdouble, cdouble, string(TEST_DIR"/signal/ifft_c2c.test") );
-INSTANTIATE_TEST(ifft2,  C2C_Float, true,  cfloat,  cfloat, string(TEST_DIR"/signal/ifft2_c2c.test"));
-INSTANTIATE_TEST(ifft2, C2C_Double, true, cdouble, cdouble, string(TEST_DIR"/signal/ifft2_c2c.test"));
-INSTANTIATE_TEST(ifft3,  C2C_Float, true,  cfloat,  cfloat, string(TEST_DIR"/signal/ifft3_c2c.test"));
-INSTANTIATE_TEST(ifft3, C2C_Double, true, cdouble, cdouble, string(TEST_DIR"/signal/ifft3_c2c.test"));
-
+INSTANTIATE_TEST(ifft, C2C_Float, true, cfloat, cfloat,
+                 string(TEST_DIR "/signal/ifft_c2c.test"));
+INSTANTIATE_TEST(ifft, C2C_Double, true, cdouble, cdouble,
+                 string(TEST_DIR "/signal/ifft_c2c.test"));
+INSTANTIATE_TEST(ifft2, C2C_Float, true, cfloat, cfloat,
+                 string(TEST_DIR "/signal/ifft2_c2c.test"));
+INSTANTIATE_TEST(ifft2, C2C_Double, true, cdouble, cdouble,
+                 string(TEST_DIR "/signal/ifft2_c2c.test"));
+INSTANTIATE_TEST(ifft3, C2C_Float, true, cfloat, cfloat,
+                 string(TEST_DIR "/signal/ifft3_c2c.test"));
+INSTANTIATE_TEST(ifft3, C2C_Double, true, cdouble, cdouble,
+                 string(TEST_DIR "/signal/ifft3_c2c.test"));
 
 template<typename inType, typename outType, int rank, bool isInverse>
-void fftBatchTest(string pTestFile, dim_t pad0=0, dim_t pad1=0, dim_t pad2=0)
-{
-    if (noDoubleTests<inType>()) return;
-    if (noDoubleTests<outType>()) return;
+void fftBatchTest(string pTestFile, dim_t pad0 = 0, dim_t pad1 = 0,
+                  dim_t pad2 = 0) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4>        numDims;
-    vector<vector<inType> >       in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
 
     readTestsFromFile<inType, outType>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims       = numDims[0];
-    af_array outArray   = 0;
-    af_array inArray    = 0;
-
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<inType>::af_type));
-
-    if(isInverse) {
-        switch(rank) {
-            case 1 : ASSERT_EQ(AF_SUCCESS, af_ifft (&outArray, inArray, 1.0, pad0));              break;
-            case 2 : ASSERT_EQ(AF_SUCCESS, af_ifft2(&outArray, inArray, 1.0, pad0, pad1));        break;
-            case 3 : ASSERT_EQ(AF_SUCCESS, af_ifft3(&outArray, inArray, 1.0, pad0, pad1, pad2));  break;
-            default: throw std::runtime_error("This error shouldn't happen, pls check");
+    dim4 dims         = numDims[0];
+    af_array outArray = 0;
+    af_array inArray  = 0;
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
+
+    if (isInverse) {
+        switch (rank) {
+            case 1:
+                ASSERT_SUCCESS(af_ifft(&outArray, inArray, 1.0, pad0));
+                break;
+            case 2:
+                ASSERT_SUCCESS(af_ifft2(&outArray, inArray, 1.0, pad0, pad1));
+                break;
+            case 3:
+                ASSERT_SUCCESS(
+                    af_ifft3(&outArray, inArray, 1.0, pad0, pad1, pad2));
+                break;
+            default:
+                throw std::runtime_error(
+                    "This error shouldn't happen, pls check");
         }
     } else {
-        switch(rank) {
-            case 1 : ASSERT_EQ(AF_SUCCESS, af_fft (&outArray, inArray, 1.0, pad0));               break;
-            case 2 : ASSERT_EQ(AF_SUCCESS, af_fft2(&outArray, inArray, 1.0, pad0, pad1));         break;
-            case 3 : ASSERT_EQ(AF_SUCCESS, af_fft3(&outArray, inArray, 1.0, pad0, pad1, pad2));   break;
-            default: throw std::runtime_error("This error shouldn't happen, pls check");
+        switch (rank) {
+            case 1:
+                ASSERT_SUCCESS(af_fft(&outArray, inArray, 1.0, pad0));
+                break;
+            case 2:
+                ASSERT_SUCCESS(af_fft2(&outArray, inArray, 1.0, pad0, pad1));
+                break;
+            case 3:
+                ASSERT_SUCCESS(
+                    af_fft3(&outArray, inArray, 1.0, pad0, pad1, pad2));
+                break;
+            default:
+                throw std::runtime_error(
+                    "This error shouldn't happen, pls check");
         }
     }
 
-    size_t out_size = tests[0].size();
-    outType *outData= new outType[out_size];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    size_t out_size  = tests[0].size();
+    outType *outData = new outType[out_size];
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
     vector<outType> goldBar(tests[0].begin(), tests[0].end());
 
-    size_t test_size = 0;
+    size_t test_size   = 0;
     size_t batch_count = dims[rank];
-    switch(rank) {
-        case 1  : test_size = dims[0]/2+1;                       break;
-        case 2  : test_size = dims[1] * (dims[0]/2+1);           break;
-        case 3  : test_size = dims[2] * dims[1] * (dims[0]/2+1); break;
-        default : test_size = dims[0]/2+1;                       break;
+    switch (rank) {
+        case 1: test_size = dims[0] / 2 + 1; break;
+        case 2: test_size = dims[1] * (dims[0] / 2 + 1); break;
+        case 3: test_size = dims[2] * dims[1] * (dims[0] / 2 + 1); break;
+        default: test_size = dims[0] / 2 + 1; break;
     }
 
     size_t batch_stride = 1;
-    for(int i=0; i<rank; ++i) batch_stride *= dims[i];
+    for (int i = 0; i < rank; ++i) batch_stride *= dims[i];
 
     outType output_scale = (outType)(isInverse ? test_size : 1);
-    for(size_t batchId=0; batchId<batch_count; ++batchId) {
-        size_t off = batchId*batch_stride;
-        for (size_t elIter=0; elIter<test_size; ++elIter) {
-            bool isUnderTolerance = std::abs(goldBar[elIter+off]-outData[elIter+off])<0.001;
-            ASSERT_EQ(true, isUnderTolerance)<<"Batch id = "<<batchId<<
-                "; Expected value="<<goldBar[elIter+off] <<"\t Actual Value="<<
-                (output_scale*outData[elIter+off]) << " at: " << elIter<< std::endl;
+    for (size_t batchId = 0; batchId < batch_count; ++batchId) {
+        size_t off = batchId * batch_stride;
+        for (size_t elIter = 0; elIter < test_size; ++elIter) {
+            bool isUnderTolerance =
+                abs(goldBar[elIter + off] - outData[elIter + off]) < 0.001;
+            ASSERT_EQ(true, isUnderTolerance)
+                << "Batch id = " << batchId
+                << "; Expected value=" << goldBar[elIter + off]
+                << "\t Actual Value=" << (output_scale * outData[elIter + off])
+                << " at: " << elIter << endl;
         }
     }
 
     // cleanup
     delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
 #define INSTANTIATE_BATCH_TEST(func, name, rank, is_inverse, in_t, out_t, ...) \
-    TEST(func, name##_Batch)                                                   \
-    {                                                                          \
+    TEST(func, name##_Batch) {                                                 \
         fftBatchTest<in_t, out_t, rank, is_inverse>(__VA_ARGS__);              \
     }
 
 // real to complex transforms
-INSTANTIATE_BATCH_TEST(fft , R2C_Float, 1, false, float, cfloat, string(TEST_DIR"/signal/fft_r2c_batch.test") );
-INSTANTIATE_BATCH_TEST(fft2, R2C_Float, 2, false, float, cfloat, string(TEST_DIR"/signal/fft2_r2c_batch.test"));
-INSTANTIATE_BATCH_TEST(fft3, R2C_Float, 3, false, float, cfloat, string(TEST_DIR"/signal/fft3_r2c_batch.test"));
+INSTANTIATE_BATCH_TEST(fft, R2C_Float, 1, false, float, cfloat,
+                       string(TEST_DIR "/signal/fft_r2c_batch.test"));
+INSTANTIATE_BATCH_TEST(fft2, R2C_Float, 2, false, float, cfloat,
+                       string(TEST_DIR "/signal/fft2_r2c_batch.test"));
+INSTANTIATE_BATCH_TEST(fft3, R2C_Float, 3, false, float, cfloat,
+                       string(TEST_DIR "/signal/fft3_r2c_batch.test"));
 
 // complex to complex transforms
-INSTANTIATE_BATCH_TEST(fft , C2C_Float, 1, false, cfloat, cfloat, string(TEST_DIR"/signal/fft_c2c_batch.test") );
-INSTANTIATE_BATCH_TEST(fft2, C2C_Float, 2, false, cfloat, cfloat, string(TEST_DIR"/signal/fft2_c2c_batch.test"));
-INSTANTIATE_BATCH_TEST(fft3, C2C_Float, 3, false, cfloat, cfloat, string(TEST_DIR"/signal/fft3_c2c_batch.test"));
+INSTANTIATE_BATCH_TEST(fft, C2C_Float, 1, false, cfloat, cfloat,
+                       string(TEST_DIR "/signal/fft_c2c_batch.test"));
+INSTANTIATE_BATCH_TEST(fft2, C2C_Float, 2, false, cfloat, cfloat,
+                       string(TEST_DIR "/signal/fft2_c2c_batch.test"));
+INSTANTIATE_BATCH_TEST(fft3, C2C_Float, 3, false, cfloat, cfloat,
+                       string(TEST_DIR "/signal/fft3_c2c_batch.test"));
 
 // inverse transforms
 // complex to complex transforms
-INSTANTIATE_BATCH_TEST(ifft , C2C_Float, 1, true, cfloat, cfloat, string(TEST_DIR"/signal/ifft_c2c_batch.test") );
-INSTANTIATE_BATCH_TEST(ifft2, C2C_Float, 2, true, cfloat, cfloat, string(TEST_DIR"/signal/ifft2_c2c_batch.test"));
-INSTANTIATE_BATCH_TEST(ifft3, C2C_Float, 3, true, cfloat, cfloat, string(TEST_DIR"/signal/ifft3_c2c_batch.test"));
+INSTANTIATE_BATCH_TEST(ifft, C2C_Float, 1, true, cfloat, cfloat,
+                       string(TEST_DIR "/signal/ifft_c2c_batch.test"));
+INSTANTIATE_BATCH_TEST(ifft2, C2C_Float, 2, true, cfloat, cfloat,
+                       string(TEST_DIR "/signal/ifft2_c2c_batch.test"));
+INSTANTIATE_BATCH_TEST(ifft3, C2C_Float, 3, true, cfloat, cfloat,
+                       string(TEST_DIR "/signal/ifft3_c2c_batch.test"));
 
 // transforms on padded and truncated arrays
-INSTANTIATE_BATCH_TEST(fft2,  R2C_Float_Trunc, 2, false,  float,  cfloat, string(TEST_DIR"/signal/fft2_r2c_trunc_batch.test"), 16, 16);
-INSTANTIATE_BATCH_TEST(fft2, R2C_Double_Trunc, 2, false, double, cdouble, string(TEST_DIR"/signal/fft2_r2c_trunc_batch.test"), 16, 16);
-INSTANTIATE_BATCH_TEST(fft2,  C2C_Float_Pad, 2, false,  cfloat,  cfloat, string(TEST_DIR"/signal/fft2_c2c_pad_batch.test"), 16, 16);
-INSTANTIATE_BATCH_TEST(fft2, C2C_Double_Pad, 2, false, cdouble, cdouble, string(TEST_DIR"/signal/fft2_c2c_pad_batch.test"), 16, 16);
-
+INSTANTIATE_BATCH_TEST(fft2, R2C_Float_Trunc, 2, false, float, cfloat,
+                       string(TEST_DIR "/signal/fft2_r2c_trunc_batch.test"), 16,
+                       16);
+INSTANTIATE_BATCH_TEST(fft2, R2C_Double_Trunc, 2, false, double, cdouble,
+                       string(TEST_DIR "/signal/fft2_r2c_trunc_batch.test"), 16,
+                       16);
+INSTANTIATE_BATCH_TEST(fft2, C2C_Float_Pad, 2, false, cfloat, cfloat,
+                       string(TEST_DIR "/signal/fft2_c2c_pad_batch.test"), 16,
+                       16);
+INSTANTIATE_BATCH_TEST(fft2, C2C_Double_Pad, 2, false, cdouble, cdouble,
+                       string(TEST_DIR "/signal/fft2_c2c_pad_batch.test"), 16,
+                       16);
 
 /////////////////////////////////////// CPP ////////////////////////////////////
 //
 template<typename inType, typename outType, bool isInverse>
-void cppFFTTest(string pTestFile, dim_t pad0=0, dim_t pad1=0, dim_t pad2=0)
-{
-    if (noDoubleTests<inType>()) return;
-    if (noDoubleTests<outType>()) return;
+void cppFFTTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4>        numDims;
-    vector<vector<inType> >       in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
 
     readTestsFromFile<inType, outType>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
-    af::array signal(dims, &(in[0].front()));
-    af::array output;
+    dim4 dims = numDims[0];
+    array signal(dims, &(in[0].front()));
+    array output;
 
-    if (isInverse){
+    if (isInverse) {
         output = ifft3Norm(signal, 1.0);
     } else {
         output = fft3Norm(signal, 1.0);
     }
 
     size_t out_size = tests[0].size();
-    cfloat *outData= new cfloat[out_size];
-    output.host((void*)outData);
+    cfloat *outData = new cfloat[out_size];
+    output.host((void *)outData);
 
     vector<cfloat> goldBar(tests[0].begin(), tests[0].end());
 
     size_t test_size = 0;
-    switch(dims.ndims()) {
-        case 1  : test_size = dims[0]/2+1;                       break;
-        case 2  : test_size = dims[1] * (dims[0]/2+1);           break;
-        case 3  : test_size = dims[2] * dims[1] * (dims[0]/2+1); break;
-        default : test_size = dims[0]/2+1;                       break;
+    switch (dims.ndims()) {
+        case 1: test_size = dims[0] / 2 + 1; break;
+        case 2: test_size = dims[1] * (dims[0] / 2 + 1); break;
+        case 3: test_size = dims[2] * dims[1] * (dims[0] / 2 + 1); break;
+        default: test_size = dims[0] / 2 + 1; break;
     }
     outType output_scale = (outType)(isInverse ? test_size : 1);
-    for (size_t elIter=0; elIter<test_size; ++elIter) {
-        bool isUnderTolerance = std::abs(goldBar[elIter]-outData[elIter])<0.001;
-        ASSERT_EQ(true, isUnderTolerance)<<
-            "Expected value="<<goldBar[elIter] <<"\t Actual Value="<<
-            (output_scale*outData[elIter]) << " at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < test_size; ++elIter) {
+        bool isUnderTolerance = abs(goldBar[elIter] - outData[elIter]) < 0.001;
+        ASSERT_EQ(true, isUnderTolerance)
+            << "Expected value=" << goldBar[elIter]
+            << "\t Actual Value=" << (output_scale * outData[elIter])
+            << " at: " << elIter << endl;
     }
     // cleanup
     delete[] outData;
 }
 
 template<typename inType, typename outType, bool isInverse>
-void cppDFTTest(string pTestFile, dim_t pad0=0, dim_t pad1=0, dim_t pad2=0)
-{
-    if (noDoubleTests<inType>()) return;
-    if (noDoubleTests<outType>()) return;
+void cppDFTTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4>        numDims;
-    vector<vector<inType> >       in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
 
     readTestsFromFile<inType, outType>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
-    af::array signal(dims, &(in[0].front()));
-    af::array output;
+    dim4 dims = numDims[0];
+    array signal(dims, &(in[0].front()));
+    array output;
 
-    if (isInverse){
+    if (isInverse) {
         output = idft(signal);
     } else {
         output = dft(signal);
     }
 
     size_t out_size = tests[0].size();
-    cfloat *outData= new cfloat[out_size];
-    output.host((void*)outData);
+    cfloat *outData = new cfloat[out_size];
+    output.host((void *)outData);
 
     vector<cfloat> goldBar(tests[0].begin(), tests[0].end());
 
     size_t test_size = 0;
-    switch(dims.ndims()) {
-        case 1  : test_size = dims[0]/2+1;                       break;
-        case 2  : test_size = dims[1] * (dims[0]/2+1);           break;
-        case 3  : test_size = dims[2] * dims[1] * (dims[0]/2+1); break;
-        default : test_size = dims[0]/2+1;                       break;
+    switch (dims.ndims()) {
+        case 1: test_size = dims[0] / 2 + 1; break;
+        case 2: test_size = dims[1] * (dims[0] / 2 + 1); break;
+        case 3: test_size = dims[2] * dims[1] * (dims[0] / 2 + 1); break;
+        default: test_size = dims[0] / 2 + 1; break;
     }
     outType output_scale = (outType)(isInverse ? test_size : 1);
-    for (size_t elIter=0; elIter<test_size; ++elIter) {
-        bool isUnderTolerance = std::abs(goldBar[elIter]-outData[elIter])<0.001;
-        ASSERT_EQ(true, isUnderTolerance)<<
-            "Expected value="<<goldBar[elIter] <<"\t Actual Value="<<
-            (output_scale*outData[elIter]) << " at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < test_size; ++elIter) {
+        bool isUnderTolerance = abs(goldBar[elIter] - outData[elIter]) < 0.001;
+        ASSERT_EQ(true, isUnderTolerance)
+            << "Expected value=" << goldBar[elIter]
+            << "\t Actual Value=" << (output_scale * outData[elIter])
+            << " at: " << elIter << endl;
     }
     // cleanup
     delete[] outData;
 }
 
-
-TEST(fft3, CPP)
-{
-    cppFFTTest<cfloat, cfloat, false>(string(TEST_DIR"/signal/fft3_c2c.test"));
+TEST(fft3, CPP) {
+    cppFFTTest<cfloat, cfloat, false>(string(TEST_DIR "/signal/fft3_c2c.test"));
 }
 
-TEST(ifft3, CPP)
-{
-    cppFFTTest<cfloat, cfloat, true>(string(TEST_DIR"/signal/ifft3_c2c.test"));
+TEST(ifft3, CPP) {
+    cppFFTTest<cfloat, cfloat, true>(string(TEST_DIR "/signal/ifft3_c2c.test"));
 }
 
-TEST(fft3, RandomData)
-{
-    af::array a = af::randu(31, 31, 31);
-    af::array b = af::fft3(a, 64, 64, 64);
-    af::array c = af::ifft3(b);
-
-    af::dim4 aDims = a.dims();
-    af::dim4 cDims = c.dims();
-    af::dim4 aStrides(1, aDims[0], aDims[0]*aDims[1], aDims[0]*aDims[1]*aDims[2]);
-    af::dim4 cStrides(1, cDims[0], cDims[0]*cDims[1], cDims[0]*cDims[1]*cDims[2]);
-
-    float* gold = new float[a.elements()];
-    float* out  = new float[2*c.elements()];
-
-    a.host((void*)gold);
-    c.host((void*)out);
-
-    for (int k=0; k<(int)aDims[2]; ++k) {
-        int gkOff = k*aStrides[2];
-        int okOff = k*cStrides[2];
-        for (int j=0; j<(int)aDims[1]; ++j) {
-            int gjOff = j*aStrides[1];
-            int ojOff = j*cStrides[1];
-            for (int i=0; i<(int)aDims[0]; ++i) {
-                int giOff = i*aStrides[0];
-                int oiOff = i*cStrides[0];
+TEST(fft3, RandomData) {
+    array a = randu(31, 31, 31);
+    array b = fft3(a, 64, 64, 64);
+    array c = ifft3(b);
+
+    dim4 aDims = a.dims();
+    dim4 cDims = c.dims();
+    dim4 aStrides(1, aDims[0], aDims[0] * aDims[1],
+                  aDims[0] * aDims[1] * aDims[2]);
+    dim4 cStrides(1, cDims[0], cDims[0] * cDims[1],
+                  cDims[0] * cDims[1] * cDims[2]);
+
+    float *gold = new float[a.elements()];
+    float *out  = new float[2 * c.elements()];
+
+    a.host((void *)gold);
+    c.host((void *)out);
+
+    for (int k = 0; k < (int)aDims[2]; ++k) {
+        int gkOff = k * aStrides[2];
+        int okOff = k * cStrides[2];
+        for (int j = 0; j < (int)aDims[1]; ++j) {
+            int gjOff = j * aStrides[1];
+            int ojOff = j * cStrides[1];
+            for (int i = 0; i < (int)aDims[0]; ++i) {
+                int giOff = i * aStrides[0];
+                int oiOff = i * cStrides[0];
 
                 int gi = gkOff + gjOff + giOff;
                 int oi = okOff + ojOff + oiOff;
 
-                bool isUnderTolerance = std::abs(gold[gi]-out[2*oi])<0.001;
-                ASSERT_EQ(true, isUnderTolerance)<< "Expected value="<<
-                    gold[gi] <<"\t Actual Value="<< out[2*oi] << " at: " <<gi<< std::endl;
+                bool isUnderTolerance =
+                    std::abs(gold[gi] - out[2 * oi]) < 0.001;
+                ASSERT_EQ(true, isUnderTolerance)
+                    << "Expected value=" << gold[gi]
+                    << "\t Actual Value=" << out[2 * oi] << " at: " << gi
+                    << endl;
             }
         }
     }
@@ -450,133 +569,350 @@ TEST(fft3, RandomData)
     delete[] out;
 }
 
-TEST(dft, CPP)
-{
-    cppDFTTest<cfloat, cfloat, false>(string(TEST_DIR"/signal/fft_c2c.test"));
+TEST(dft, CPP) {
+    cppDFTTest<cfloat, cfloat, false>(string(TEST_DIR "/signal/fft_c2c.test"));
 }
 
-TEST(idft, CPP)
-{
-    cppDFTTest<cfloat, cfloat, true>(string(TEST_DIR"/signal/ifft_c2c.test"));
+TEST(idft, CPP) {
+    cppDFTTest<cfloat, cfloat, true>(string(TEST_DIR "/signal/ifft_c2c.test"));
 }
 
-TEST(dft2, CPP)
-{
-    cppDFTTest<cfloat, cfloat, false>(string(TEST_DIR"/signal/fft2_c2c.test"));
+TEST(dft2, CPP) {
+    cppDFTTest<cfloat, cfloat, false>(string(TEST_DIR "/signal/fft2_c2c.test"));
 }
 
-TEST(idft2, CPP)
-{
-    cppDFTTest<cfloat, cfloat, true>(string(TEST_DIR"/signal/ifft2_c2c.test"));
+TEST(idft2, CPP) {
+    cppDFTTest<cfloat, cfloat, true>(string(TEST_DIR "/signal/ifft2_c2c.test"));
 }
 
-TEST(dft3, CPP)
-{
-    cppDFTTest<cfloat, cfloat, false>(string(TEST_DIR"/signal/fft3_c2c.test"));
+TEST(dft3, CPP) {
+    cppDFTTest<cfloat, cfloat, false>(string(TEST_DIR "/signal/fft3_c2c.test"));
 }
 
-TEST(idft3, CPP)
-{
-    cppDFTTest<cfloat, cfloat, true>(string(TEST_DIR"/signal/ifft3_c2c.test"));
+TEST(idft3, CPP) {
+    cppDFTTest<cfloat, cfloat, true>(string(TEST_DIR "/signal/ifft3_c2c.test"));
 }
 
-TEST(fft, CPP_4D)
-{
-    af::array a = af::randu(1024, 1024);
-    af::array b = af::fft(a);
+TEST(fft, CPP_4D) {
+    array a = randu(1024, 1024);
+    array b = fft(a);
 
-    af::array A = af::moddims(a, 1024, 32, 16, 2);
-    af::array B = af::fft(A);
+    array A = moddims(a, 1024, 32, 16, 2);
+    array B = fft(A);
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_B = B.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_B = B.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_B;
+    freeHost(h_b);
+    freeHost(h_B);
 }
 
-TEST(ifft, CPP_4D)
-{
-    af::array a = af::randu(1024, 1024, c32);
-    af::array b = af::ifft(a);
+TEST(ifft, CPP_4D) {
+    array a = randu(1024, 1024, c32);
+    array b = ifft(a);
 
-    af::array A = af::moddims(a, 1024, 32, 16, 2);
-    af::array B = af::ifft(A);
+    array A = moddims(a, 1024, 32, 16, 2);
+    array B = ifft(A);
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_B = B.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_B = B.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_B;
+    freeHost(h_b);
+    freeHost(h_B);
 }
 
-TEST(fft, GFOR)
-{
-    af::array a = af::randu(1024, 1024);
-    af::array b = af::constant(0, 1024, 1024, c32);
-    af::array c = af::fft(a);
+TEST(fft, GFOR) {
+    array a = randu(1024, 1024);
+    array b = constant(0, 1024, 1024, c32);
+    array c = fft(a);
 
-    gfor(af::seq ii, a.dims(1)) {
-        b(af::span, ii) = af::fft(a(af::span, ii));
-    }
+    gfor(seq ii, a.dims(1)) { b(span, ii) = fft(a(span, ii)); }
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_c = c.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_c = c.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_c[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_c[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_b);
+    freeHost(h_c);
 }
 
-TEST(fft2, GFOR)
-{
-    af::array a = af::randu(1024, 1024, 4);
-    af::array b = af::constant(0, 1024, 1024, 4, c32);
-    af::array c = af::fft2(a);
+TEST(fft2, GFOR) {
+    array a = randu(1024, 1024, 4);
+    array b = constant(0, 1024, 1024, 4, c32);
+    array c = fft2(a);
 
-    gfor(af::seq ii, a.dims(2)) {
-        b(af::span, af::span, ii) = af::fft2(a(af::span, af::span, ii));
-    }
+    gfor(seq ii, a.dims(2)) { b(span, span, ii) = fft2(a(span, span, ii)); }
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_c = c.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_c = c.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_c[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_c[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_b);
+    freeHost(h_c);
 }
 
-TEST(fft3, GFOR)
-{
-    af::array a = af::randu(32, 32, 32, 4);
-    af::array b = af::constant(0, 32, 32, 32, 4, c32);
-    af::array c = af::fft3(a);
+TEST(fft3, GFOR) {
+    array a = randu(32, 32, 32, 4);
+    array b = constant(0, 32, 32, 32, 4, c32);
+    array c = fft3(a);
 
-    gfor(af::seq ii, a.dims(3)) {
-        b(af::span, af::span, af::span, ii) = af::fft3(a(af::span, af::span, af::span, ii));
+    gfor(seq ii, a.dims(3)) {
+        b(span, span, span, ii) = fft3(a(span, span, span, ii));
     }
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_c = c.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_c = c.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_c[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_c[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_b);
+    freeHost(h_c);
+}
+
+void fft2InPlaceFunc() {
+    array a = randu(1024, 1024, c32);
+    array b = fft2(a);
+    fft2InPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+using af::getDevice;
+using af::getDeviceCount;
+using af::setDevice;
+
+#define DEVICE_ITERATE(func)                             \
+    do {                                                 \
+        const char *ENV = getenv("AF_MULTI_GPU_TESTS");  \
+        if (ENV && ENV[0] == '0') {                      \
+            func;                                        \
+        } else {                                         \
+            int oldDevice = getDevice();                 \
+            for (int i = 0; i < getDeviceCount(); i++) { \
+                setDevice(i);                            \
+                func;                                    \
+            }                                            \
+            setDevice(oldDevice);                        \
+        }                                                \
+    } while (0);
+
+TEST(FFT2, MultiGPUInPlaceSquare_CPP) { DEVICE_ITERATE((fft2InPlaceFunc())); }
+
+struct fft_params {
+    dim4 input_dims_;
+    bool is_odd_;
+    double norm_factor_;
+    fft_params(dim4 dim, bool is_odd, double norm_factor)
+        : input_dims_(dim), is_odd_(is_odd), norm_factor_(norm_factor) {}
+};
+
+class FFTBase : public ::testing::TestWithParam<fft_params> {};
+
+class FFTC2R2D : public FFTBase {};
+class FFT2D : public FFTBase {};
+class FFTC2R3D : public FFTBase {};
+class FFT3D : public FFTBase {};
+class FFTC2R : public FFTBase {};
+class FFTND : public FFTBase {};
+
+string to_test_params(const ::testing::TestParamInfo<FFTBase::ParamType> info) {
+    stringstream ss;
+    ss << "d0_" << info.param.input_dims_[0] << "_d1_"
+       << info.param.input_dims_[1] << "_d2_" << info.param.input_dims_[2]
+       << "_d3_" << info.param.input_dims_[3] << "_"
+       << ((info.param.is_odd_) ? string("odd") : string("even")) << "_norm_"
+       << info.param.norm_factor_;
+    string out = ss.str();
+    return out.replace(out.find("."), 1, "_");
+}
+
+// INSTANTIATE_TEST_SUITE_P(
+//     Inputs2D, FFTC2R2D,
+//     ::testing::Values(fft_params(dim4(513, 512), false, 0.5),
+//                       fft_params(dim4(1025, 1024), false, 0.5),
+//                       fft_params(dim4(2049, 2048), false, 0.5)),
+//     to_test_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    Inputs2D, FFT2D,
+    ::testing::Values(fft_params(dim4(512, 512), false, 0.5),
+                      fft_params(dim4(1024, 1024), false, 0.5),
+                      fft_params(dim4(2048, 2048), false, 0.5)),
+    to_test_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    Inputs3D, FFTC2R3D,
+    ::testing::Values(fft_params(dim4(512, 512, 3), false, 0.5),
+                      fft_params(dim4(1024, 1024, 3), false, 0.5),
+                      fft_params(dim4(2048, 2048, 3), false, 0.5)),
+    to_test_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    Inputs3D, FFT3D,
+    ::testing::Values(fft_params(dim4(1024, 1024, 3), true, 0.5),
+                      fft_params(dim4(1024, 1024, 3), false, 0.5)),
+    to_test_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    InputsND, FFTND,
+    ::testing::Values(fft_params(dim4(512), false, 0.5),
+                      fft_params(dim4(1024), false, 0.5),
+                      fft_params(dim4(1024, 1024), false, 0.5),
+                      fft_params(dim4(1024, 1024, 3), false, 0.5)),
+    to_test_params);
+
+INSTANTIATE_TEST_SUITE_P(
+    InputsND, FFTC2R,
+    ::testing::Values(fft_params(dim4(513), false, 0.5),
+                      fft_params(dim4(1025), false, 0.5),
+                      fft_params(dim4(1025, 1024), false, 0.5),
+                      fft_params(dim4(1025, 1024, 3), false, 0.5)),
+    to_test_params);
+
+// Does not work well with CUDA 10.1
+// TEST_P(FFTC2R2D, Complex32ToRealInputsPreserved) {
+//     fft_params params = GetParam();
+//     af::array a       = af::randu(params.input_dims_, c32);
+//     af::array a_copy  = a.copy();
+//     af::array out     = af::fftC2R<2>(a, params.is_odd_,
+//     params.norm_factor_);
+//
+//     ASSERT_ARRAYS_EQ(a_copy, a);
+// }
+//
+// TEST_P(FFTC2R2D, Complex64ToRealInputsPreserved) {
+//     fft_params params = GetParam();
+//     af::array a       = af::randu(params.input_dims_, c64);
+//     af::array a_copy  = a.copy();
+//     af::array out     = af::fftC2R<2>(a, params.is_odd_,
+//     params.norm_factor_);
+//
+//     ASSERT_ARRAYS_EQ(a_copy, a);
+// }
+
+TEST_P(FFT2D, Real32ToComplexInputsPreserved) {
+    fft_params params = GetParam();
+    af::array a       = af::randu(params.input_dims_, f32);
+    af::array a_copy  = a.copy();
+    af::array out     = af::fftR2C<2>(a, a.dims(), params.norm_factor_);
+
+    ASSERT_ARRAYS_EQ(a_copy, a);
+}
+
+TEST_P(FFT2D, Real64ToComplexInputsPreserved) {
+    SUPPORTED_TYPE_CHECK(double);
+    fft_params params = GetParam();
+    af::array a       = af::randu(params.input_dims_, f64);
+    af::array a_copy  = a.copy();
+    af::array out     = af::fftR2C<2>(a, a.dims(), params.norm_factor_);
+
+    ASSERT_ARRAYS_EQ(a_copy, a);
+}
+
+TEST_P(FFTC2R, Complex32ToRInputsPreserved) {
+    fft_params params = GetParam();
+    af::array a       = af::randu(params.input_dims_, c32);
+    af::array a_copy  = a.copy();
+    af::array out     = af::fftC2R<1>(a, params.is_odd_, params.norm_factor_);
+
+    ASSERT_ARRAYS_EQ(a_copy, a);
+}
+
+TEST_P(FFTC2R, Complex64ToRInputsPreserved) {
+    SUPPORTED_TYPE_CHECK(double);
+    fft_params params = GetParam();
+    af::array a       = af::randu(params.input_dims_, c64);
+    af::array a_copy  = a.copy();
+    af::array out     = af::fftC2R<1>(a, params.is_odd_, params.norm_factor_);
+
+    ASSERT_ARRAYS_EQ(a_copy, a);
+}
+
+TEST_P(FFTND, Real32ToComplexInputsPreserved) {
+    fft_params params = GetParam();
+    af::array a       = af::randu(params.input_dims_, f32);
+    af::array a_copy  = a.copy();
+    af::array out     = af::fftR2C<1>(a, a.dims(), params.norm_factor_);
+
+    ASSERT_ARRAYS_EQ(a_copy, a);
+}
+
+TEST_P(FFTND, Real64ToComplexInputsPreserved) {
+    SUPPORTED_TYPE_CHECK(double);
+    fft_params params = GetParam();
+    af::array a       = af::randu(params.input_dims_, f64);
+    af::array a_copy  = a.copy();
+    af::array out     = af::fftR2C<1>(a, a.dims(), params.norm_factor_);
+
+    ASSERT_ARRAYS_EQ(a_copy, a);
+}
+
+TEST_P(FFTND, InPlaceFFTMatchesOutOfPlace) {
+    fft_params params = GetParam();
+    array a           = randu(params.input_dims_, c32);
+    array b           = fft(a);
+    fftInPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+TEST_P(FFTND, InPlaceIFFTMatchesOutOfPlace) {
+    fft_params params = GetParam();
+    array a           = randu(params.input_dims_, c32);
+    array b           = ifft(a);
+    ifftInPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+TEST_P(FFT2D, InPlaceFFT2MatchesOutOfPlace) {
+    fft_params params = GetParam();
+    array a           = randu(params.input_dims_, c32);
+    array b           = fft2(a);
+    fft2InPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+TEST_P(FFT2D, InPlaceIFFT2MatchesOutOfPlace) {
+    fft_params params = GetParam();
+    array a           = randu(params.input_dims_, c32);
+    array b           = ifft2(a);
+    ifft2InPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+TEST_P(FFT3D, InPlaceFFT3MatchesOutOfPlace) {
+    fft_params params = GetParam();
+    array a           = randu(params.input_dims_, c32);
+    array b           = fft3(a);
+    fft3InPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+TEST_P(FFTC2R3D, InPlaceIFFT3MatchesOutOfPlace) {
+    fft_params params = GetParam();
+    array a           = randu(params.input_dims_, c32);
+    array b           = ifft3(a);
+    ifft3InPlace(a);
+
+    ASSERT_ARRAYS_EQ(a, b);
 }
diff --git a/test/fft_large.cpp b/test/fft_large.cpp
index a3c331ce5a..137d55b32a 100644
--- a/test/fft_large.cpp
+++ b/test/fft_large.cpp
@@ -7,52 +7,57 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+#include <stdexcept>
 #include <string>
 #include <vector>
-#include <stdexcept>
-#include <testHelpers.hpp>
 
+using af::array;
+using af::cfloat;
+using af::fft2;
+using af::ifft2;
+using af::moddims;
+using af::randu;
+using std::endl;
 using std::string;
 using std::vector;
 
-TEST(fft2, CPP_4D)
-{
-    af::array a = af::randu(1024, 1024, 32);
-    af::array b = af::fft2(a);
+TEST(fft2, CPP_4D) {
+    array a = randu(1024, 1024, 32);
+    array b = fft2(a);
 
-    af::array A = af::moddims(a, 1024, 1024, 4, 8);
-    af::array B = af::fft2(A);
+    array A = moddims(a, 1024, 1024, 4, 8);
+    array B = fft2(A);
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_B = B.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_B = B.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_B;
+    af_free_host(h_b);
+    af_free_host(h_B);
 }
 
-TEST(ifft2, CPP_4D)
-{
-    af::array a = af::randu(1024, 1024, 32, c32);
-    af::array b = af::ifft2(a);
+TEST(ifft2, CPP_4D) {
+    array a = randu(1024, 1024, 32, c32);
+    array b = ifft2(a);
 
-    af::array A = af::moddims(a, 1024, 1024, 4, 8);
-    af::array B = af::ifft2(A);
+    array A = moddims(a, 1024, 1024, 4, 8);
+    array B = ifft2(A);
 
-    af::cfloat *h_b = b.host<af::cfloat>();
-    af::cfloat *h_B = B.host<af::cfloat>();
+    cfloat *h_b = b.host<cfloat>();
+    cfloat *h_B = B.host<cfloat>();
 
     for (int i = 0; i < (int)a.elements(); i++) {
-        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << std::endl;
+        ASSERT_EQ(h_b[i], h_B[i]) << "at: " << i << endl;
     }
 
-    delete[] h_b;
-    delete[] h_B;
+    af_free_host(h_b);
+    af_free_host(h_B);
 }
diff --git a/test/fft_real.cpp b/test/fft_real.cpp
new file mode 100644
index 0000000000..863f66d74c
--- /dev/null
+++ b/test/fft_real.cpp
@@ -0,0 +1,107 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <stdexcept>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::fft;
+using af::fft2Norm;
+using af::fft3Norm;
+using af::fftC2R;
+using af::fftNorm;
+using af::fftR2C;
+using af::randu;
+using std::abs;
+using std::string;
+using std::vector;
+
+template<typename T>
+class FFT_REAL : public ::testing::Test {};
+
+typedef ::testing::Types<cfloat, cdouble> TestTypes;
+TYPED_TEST_SUITE(FFT_REAL, TestTypes);
+
+template<int rank>
+array fft(const array &in, double norm) {
+    switch (rank) {
+        case 1: return fftNorm(in, norm);
+        case 2: return fft2Norm(in, norm);
+        case 3: return fft3Norm(in, norm);
+        default: return in;
+    }
+}
+
+#define MY_ASSERT_NEAR(aa, bb, cc) ASSERT_NEAR(abs(aa), abs(bb), (cc))
+
+template<typename Tc, int rank>
+void fft_real(dim4 dims) {
+    typedef typename dtype_traits<Tc>::base_type Tr;
+    SUPPORTED_TYPE_CHECK(Tr);
+
+    dtype ty = (dtype)dtype_traits<Tr>::af_type;
+    array a  = randu(dims, ty);
+
+    bool is_odd = dims[0] & 1;
+
+    int dim0 = dims[0] / 2 + 1;
+
+    double norm = 1;
+    for (int i = 0; i < rank; i++) norm *= dims[i];
+    norm = 1 / norm;
+
+    array as = fftR2C<rank>(a, norm);
+    array af = fft<rank>(a, norm);
+
+    vector<Tc> has(as.elements());
+    vector<Tc> haf(af.elements());
+
+    as.host(&has[0]);
+    af.host(&haf[0]);
+
+    for (int j = 0; j < a.elements() / dims[0]; j++) {
+        for (int i = 0; i < dim0; i++) {
+            MY_ASSERT_NEAR(haf[j * dims[0] + i], has[j * dim0 + i], 1E-2)
+                << "at " << j * dims[0] + i;
+        }
+    }
+
+    array b = fftC2R<rank>(as, is_odd, 1);
+
+    vector<Tr> ha(a.elements());
+    vector<Tr> hb(a.elements());
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+
+    for (int j = 0; j < a.elements(); j++) { ASSERT_NEAR(ha[j], hb[j], 1E-2); }
+}
+
+TYPED_TEST(FFT_REAL, Even1D) { fft_real<TypeParam, 1>(dim4(1024, 256)); }
+
+TYPED_TEST(FFT_REAL, Odd1D) { fft_real<TypeParam, 1>(dim4(625, 256)); }
+
+TYPED_TEST(FFT_REAL, Even2D) { fft_real<TypeParam, 2>(dim4(1024, 256)); }
+
+TYPED_TEST(FFT_REAL, Odd2D) { fft_real<TypeParam, 2>(dim4(625, 256)); }
+
+TYPED_TEST(FFT_REAL, Even3D) { fft_real<TypeParam, 3>(dim4(32, 32, 32)); }
+
+TYPED_TEST(FFT_REAL, Odd3D) { fft_real<TypeParam, 3>(dim4(25, 32, 32)); }
diff --git a/test/fftconvolve.cpp b/test/fftconvolve.cpp
index 4838f1a22d..a8f63e2f45 100644
--- a/test/fftconvolve.cpp
+++ b/test/fftconvolve.cpp
@@ -7,55 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::vector;
-using std::string;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::randu;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class FFTConvolve : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class FFTConvolve : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 template<typename T>
-class FFTConvolveLarge : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class FFTConvolveLarge : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<cfloat, cdouble, float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<cfloat, cdouble, float, double, int, uint, char, schar,
+                         uchar, intl, uintl>
+    TestTypes;
 typedef ::testing::Types<float, double> TestTypesLarge;
 
 // register the type list
-TYPED_TEST_CASE(FFTConvolve, TestTypes);
-TYPED_TEST_CASE(FFTConvolveLarge, TestTypesLarge);
-
-static double get_real(double val) { return val; }
-static double get_real(cfloat val) { return std::real(val); }
-static double get_real(cdouble val) { return std::real(val); }
+TYPED_TEST_SUITE(FFTConvolve, TestTypes);
+TYPED_TEST_SUITE(FFTConvolveLarge, TestTypesLarge);
 
 template<typename T, int baseDim>
-void fftconvolveTest(string pTestFile, bool expand)
-{
-    if (noDoubleTests<T>()) return;
-
-    using af::dim4;
+void fftconvolveTest(string pTestFile, bool expand) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<dim4>      numDims;
-    vector<vector<T> >      in;
-    vector<vector<T> >   tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
 
     readTests<T, T, int>(pTestFile, numDims, in, tests);
 
@@ -64,52 +63,53 @@ void fftconvolveTest(string pTestFile, bool expand)
     af_array signal   = 0;
     af_array filter   = 0;
     af_array outArray = 0;
-    af_dtype in_type =(af_dtype)af::dtype_traits<T>::af_type;
+    af_dtype in_type  = (af_dtype)dtype_traits<T>::af_type;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&signal, &(in[0].front()),
-                                          sDims.ndims(), sDims.get(), in_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&filter, &(in[1].front()),
-                                          fDims.ndims(), fDims.get(), in_type));
+    ASSERT_SUCCESS(af_create_array(&signal, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(), in_type));
+    ASSERT_SUCCESS(af_create_array(&filter, &(in[1].front()), fDims.ndims(),
+                                   fDims.get(), in_type));
 
     af_conv_mode mode = expand ? AF_CONV_EXPAND : AF_CONV_DEFAULT;
-    switch(baseDim) {
-        case 1: ASSERT_EQ(AF_SUCCESS, af_fft_convolve1(&outArray, signal, filter, mode)); break;
-        case 2: ASSERT_EQ(AF_SUCCESS, af_fft_convolve2(&outArray, signal, filter, mode)); break;
-        case 3: ASSERT_EQ(AF_SUCCESS, af_fft_convolve3(&outArray, signal, filter, mode)); break;
+    switch (baseDim) {
+        case 1:
+            ASSERT_SUCCESS(af_fft_convolve1(&outArray, signal, filter, mode));
+            break;
+        case 2:
+            ASSERT_SUCCESS(af_fft_convolve2(&outArray, signal, filter, mode));
+            break;
+        case 3:
+            ASSERT_SUCCESS(af_fft_convolve3(&outArray, signal, filter, mode));
+            break;
     }
 
     vector<T> currGoldBar = tests[0];
     size_t nElems         = currGoldBar.size();
 
     dim_t out_elems = 0;
-    ASSERT_EQ(AF_SUCCESS, af_get_elements(&out_elems, outArray));
+    ASSERT_SUCCESS(af_get_elements(&out_elems, outArray));
     ASSERT_EQ(nElems, (size_t)out_elems);
 
-    T *outData            = new T[nElems];
+    vector<T> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)&outData.front(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(
-            get_real(currGoldBar[elIter]),
-            get_real(outData[elIter])
-            , 1e-2)<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(real(currGoldBar[elIter]), real(outData[elIter]), 1e-2)
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(signal));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(filter));
+    ASSERT_SUCCESS(af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(signal));
+    ASSERT_SUCCESS(af_release_array(filter));
 }
 
 template<typename T, int baseDim>
-void fftconvolveTestLarge(int sDim, int fDim, int sBatch, int fBatch, bool expand)
-{
-    if (noDoubleTests<T>()) return;
+void fftconvolveTestLarge(int sDim, int fDim, int sBatch, int fBatch,
+                          bool expand) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    using af::dim4;
     using af::seq;
-    using af::array;
 
     int outDim = sDim + fDim - 1;
     int fftDim = (int)pow(2, ceil(log2(outDim)));
@@ -119,12 +119,10 @@ void fftconvolveTestLarge(int sDim, int fDim, int sBatch, int fBatch, bool expan
         if (k < baseDim) {
             sd[k] = sDim;
             fd[k] = fDim;
-        }
-        else if (k == baseDim) {
+        } else if (k == baseDim) {
             sd[k] = sBatch;
             fd[k] = fBatch;
-        }
-        else {
+        } else {
             sd[k] = 1;
             fd[k] = 1;
         }
@@ -133,402 +131,359 @@ void fftconvolveTestLarge(int sDim, int fDim, int sBatch, int fBatch, bool expan
     const dim4 signalDims(sd[0], sd[1], sd[2], sd[3]);
     const dim4 filterDims(fd[0], fd[1], fd[2], fd[3]);
 
-    array signal = randu(signalDims, (af_dtype) af::dtype_traits<T>::af_type);
-    array filter = randu(filterDims, (af_dtype) af::dtype_traits<T>::af_type);
+    array signal = randu(signalDims, (af_dtype)dtype_traits<T>::af_type);
+    array filter = randu(filterDims, (af_dtype)dtype_traits<T>::af_type);
 
-    array out = fftConvolve(signal, filter, expand ? AF_CONV_EXPAND : AF_CONV_DEFAULT);
+    array out =
+        fftConvolve(signal, filter, expand ? AF_CONV_EXPAND : AF_CONV_DEFAULT);
 
     array gold;
-    switch(baseDim) {
-    case 1:
-        gold = real(af::ifft(af::fft(signal, fftDim) * af::fft(filter, fftDim)));
-        break;
-    case 2:
-        gold = real(af::ifft2(af::fft2(signal, fftDim, fftDim) * af::fft2(filter, fftDim, fftDim)));
-        break;
-    case 3:
-        gold = real(af::ifft3(af::fft3(signal, fftDim, fftDim, fftDim) * af::fft3(filter, fftDim, fftDim, fftDim)));
-        break;
-    default:
-        ASSERT_LT(baseDim, 4);
+    switch (baseDim) {
+        case 1:
+            gold = real(ifft(fft(signal, fftDim) * fft(filter, fftDim)));
+            break;
+        case 2:
+            gold = real(ifft2(fft2(signal, fftDim, fftDim) *
+                              fft2(filter, fftDim, fftDim)));
+            break;
+        case 3:
+            gold = real(ifft3(fft3(signal, fftDim, fftDim, fftDim) *
+                              fft3(filter, fftDim, fftDim, fftDim)));
+            break;
+        default: ASSERT_LT(baseDim, 4);
     }
 
     int cropMin = 0, cropMax = 0;
     if (expand) {
         cropMin = 0;
         cropMax = outDim - 1;
-    }
-    else {
-        cropMin = fDim/2;
-        cropMax = outDim - fDim/2 - 1;
+    } else {
+        cropMin = fDim / 2;
+        cropMax = outDim - fDim / 2 - 1;
     }
 
-    switch(baseDim) {
-    case 1:
-        gold = gold(seq(cropMin, cropMax));
-        break;
-    case 2:
-        gold = gold(seq(cropMin, cropMax), seq(cropMin, cropMax));
-        break;
-    case 3:
-        gold = gold(seq(cropMin, cropMax), seq(cropMin, cropMax), seq(cropMin, cropMax));
-        break;
+    switch (baseDim) {
+        case 1: gold = gold(seq(cropMin, cropMax)); break;
+        case 2:
+            gold = gold(seq(cropMin, cropMax), seq(cropMin, cropMax));
+            break;
+        case 3:
+            gold = gold(seq(cropMin, cropMax), seq(cropMin, cropMax),
+                        seq(cropMin, cropMax));
+            break;
     }
 
-    size_t outElems  = out.elements();
-    size_t goldElems = gold.elements();
-
-    ASSERT_EQ(goldElems, outElems);
-
-    T *goldData = new T[goldElems];
-    gold.host(goldData);
-
-    T *outData = new T[outElems];
-    out.host(outData);
-
-    for (size_t elIter=0; elIter<outElems; ++elIter) {
-        ASSERT_NEAR(goldData[elIter], outData[elIter], 5e-2) << "at: " << elIter << std::endl;
-    }
-
-    delete[] goldData;
-    delete[] outData;
+    ASSERT_ARRAYS_NEAR(gold, out, 5e-2);
 }
 
-TYPED_TEST(FFTConvolveLarge, VectorLargeSignalSmallFilter)
-{
+TYPED_TEST(FFTConvolveLarge, VectorLargeSignalSmallFilter) {
     fftconvolveTestLarge<TypeParam, 1>(32768, 25, 1, 1, true);
 }
 
-TYPED_TEST(FFTConvolveLarge, VectorLargeSignalLargeFilter)
-{
+TYPED_TEST(FFTConvolveLarge, VectorLargeSignalLargeFilter) {
     fftconvolveTestLarge<TypeParam, 1>(32768, 4095, 1, 1, true);
 }
 
-TYPED_TEST(FFTConvolveLarge, SameVectorLargeSignalSmallFilter)
-{
+TYPED_TEST(FFTConvolveLarge, SameVectorLargeSignalSmallFilter) {
     fftconvolveTestLarge<TypeParam, 1>(32768, 25, 1, 1, false);
 }
 
-TYPED_TEST(FFTConvolveLarge, SameVectorLargeSignalLargeFilter)
-{
+TYPED_TEST(FFTConvolveLarge, SameVectorLargeSignalLargeFilter) {
     fftconvolveTestLarge<TypeParam, 1>(32768, 4095, 1, 1, false);
 }
 
-TYPED_TEST(FFTConvolveLarge, RectangleLargeSignalSmallFilter)
-{
+TYPED_TEST(FFTConvolveLarge, RectangleLargeSignalSmallFilter) {
     fftconvolveTestLarge<TypeParam, 2>(1024, 5, 1, 1, true);
 }
 
-TYPED_TEST(FFTConvolveLarge, RectangleLargeSignalLargeFilter)
-{
+TYPED_TEST(FFTConvolveLarge, RectangleLargeSignalLargeFilter) {
     fftconvolveTestLarge<TypeParam, 2>(1024, 511, 1, 1, true);
 }
 
-TYPED_TEST(FFTConvolveLarge, SameRectangleLargeSignalSmallFilter)
-{
+TYPED_TEST(FFTConvolveLarge, SameRectangleLargeSignalSmallFilter) {
     fftconvolveTestLarge<TypeParam, 2>(1024, 5, 1, 1, false);
 }
 
-TYPED_TEST(FFTConvolveLarge, SameRectangleLargeSignalLargeFilter)
-{
+TYPED_TEST(FFTConvolveLarge, SameRectangleLargeSignalLargeFilter) {
     fftconvolveTestLarge<TypeParam, 2>(1024, 511, 1, 1, false);
 }
 
-TYPED_TEST(FFTConvolveLarge, CuboidLargeSignalSmallFilter)
-{
+TYPED_TEST(FFTConvolveLarge, CuboidLargeSignalSmallFilter) {
     fftconvolveTestLarge<TypeParam, 3>(64, 5, 1, 1, true);
 }
 
-TYPED_TEST(FFTConvolveLarge, CuboidLargeSignalLargeFilter)
-{
+TYPED_TEST(FFTConvolveLarge, CuboidLargeSignalLargeFilter) {
     fftconvolveTestLarge<TypeParam, 3>(64, 31, 1, 1, true);
 }
 
-TYPED_TEST(FFTConvolveLarge, SameCuboidLargeSignalSmallFilter)
-{
+TYPED_TEST(FFTConvolveLarge, SameCuboidLargeSignalSmallFilter) {
     fftconvolveTestLarge<TypeParam, 3>(64, 5, 1, 1, false);
 }
 
-TYPED_TEST(FFTConvolveLarge, SameCuboidLargeSignalLargeFilter)
-{
+TYPED_TEST(FFTConvolveLarge, SameCuboidLargeSignalLargeFilter) {
     fftconvolveTestLarge<TypeParam, 2>(64, 31, 1, 1, false);
 }
 
-TYPED_TEST(FFTConvolve, Vector)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector.test"), true);
+TYPED_TEST(FFTConvolve, Vector) {
+    fftconvolveTest<TypeParam, 1>(string(TEST_DIR "/convolve/vector.test"),
+                                  true);
 }
 
-TYPED_TEST(FFTConvolve, Rectangle)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle.test"), true);
+TYPED_TEST(FFTConvolve, Rectangle) {
+    fftconvolveTest<TypeParam, 2>(string(TEST_DIR "/convolve/rectangle.test"),
+                                  true);
 }
 
-TYPED_TEST(FFTConvolve, Cuboid)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid.test"), true);
+TYPED_TEST(FFTConvolve, Cuboid) {
+    fftconvolveTest<TypeParam, 3>(string(TEST_DIR "/convolve/cuboid.test"),
+                                  true);
 }
 
-TYPED_TEST(FFTConvolve, Vector_Many2One)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_many2one.test"), true);
+TYPED_TEST(FFTConvolve, Vector_Many2One) {
+    fftconvolveTest<TypeParam, 1>(
+        string(TEST_DIR "/convolve/vector_many2one.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Rectangle_Many2One)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_many2one.test"), true);
+TYPED_TEST(FFTConvolve, Rectangle_Many2One) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_many2one.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Cuboid_Many2One)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_many2one.test"), true);
+TYPED_TEST(FFTConvolve, Cuboid_Many2One) {
+    fftconvolveTest<TypeParam, 3>(
+        string(TEST_DIR "/convolve/cuboid_many2one.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Vector_Many2Many)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_many2many.test"), true);
+TYPED_TEST(FFTConvolve, Vector_Many2Many) {
+    fftconvolveTest<TypeParam, 1>(
+        string(TEST_DIR "/convolve/vector_many2many.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Rectangle_Many2Many)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_many2many.test"), true);
+TYPED_TEST(FFTConvolve, Rectangle_Many2Many) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_many2many.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Cuboid_Many2Many)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_many2many.test"), true);
+TYPED_TEST(FFTConvolve, Cuboid_Many2Many) {
+    fftconvolveTest<TypeParam, 3>(
+        string(TEST_DIR "/convolve/cuboid_many2many.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Vector_One2Many)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_one2many.test"), true);
+TYPED_TEST(FFTConvolve, Vector_One2Many) {
+    fftconvolveTest<TypeParam, 1>(
+        string(TEST_DIR "/convolve/vector_one2many.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Rectangle_One2Many)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_one2many.test"), true);
+TYPED_TEST(FFTConvolve, Rectangle_One2Many) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_one2many.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Cuboid_One2Many)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_one2many.test"), true);
+TYPED_TEST(FFTConvolve, Cuboid_One2Many) {
+    fftconvolveTest<TypeParam, 3>(
+        string(TEST_DIR "/convolve/cuboid_one2many.test"), true);
 }
 
-TYPED_TEST(FFTConvolve, Same_Vector)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_same.test"), false);
+TYPED_TEST(FFTConvolve, Same_Vector) {
+    fftconvolveTest<TypeParam, 1>(string(TEST_DIR "/convolve/vector_same.test"),
+                                  false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Rectangle)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_same.test"), false);
+TYPED_TEST(FFTConvolve, Same_Rectangle) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_same.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Cuboid)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_same.test"), false);
+TYPED_TEST(FFTConvolve, Same_Cuboid) {
+    fftconvolveTest<TypeParam, 3>(string(TEST_DIR "/convolve/cuboid_same.test"),
+                                  false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Vector_Many2One)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_same_many2one.test"), false);
+TYPED_TEST(FFTConvolve, Same_Vector_Many2One) {
+    fftconvolveTest<TypeParam, 1>(
+        string(TEST_DIR "/convolve/vector_same_many2one.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Rectangle_Many2One)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_same_many2one.test"), false);
+TYPED_TEST(FFTConvolve, Same_Rectangle_Many2One) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_same_many2one.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Cuboid_Many2One)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_same_many2one.test"), false);
+TYPED_TEST(FFTConvolve, Same_Cuboid_Many2One) {
+    fftconvolveTest<TypeParam, 3>(
+        string(TEST_DIR "/convolve/cuboid_same_many2one.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Vector_Many2Many)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_same_many2many.test"), false);
+TYPED_TEST(FFTConvolve, Same_Vector_Many2Many) {
+    fftconvolveTest<TypeParam, 1>(
+        string(TEST_DIR "/convolve/vector_same_many2many.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Rectangle_Many2Many)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_same_many2many.test"), false);
+TYPED_TEST(FFTConvolve, Same_Rectangle_Many2Many) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_same_many2many.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Cuboid_Many2Many)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_same_many2many.test"), false);
+TYPED_TEST(FFTConvolve, Same_Cuboid_Many2Many) {
+    fftconvolveTest<TypeParam, 3>(
+        string(TEST_DIR "/convolve/cuboid_same_many2many.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Vector_One2Many)
-{
-    fftconvolveTest<TypeParam, 1>(string(TEST_DIR"/convolve/vector_same_one2many.test"), false);
+TYPED_TEST(FFTConvolve, Same_Vector_One2Many) {
+    fftconvolveTest<TypeParam, 1>(
+        string(TEST_DIR "/convolve/vector_same_one2many.test"), false);
 }
 
-TYPED_TEST(FFTConvolve, Same_Rectangle_One2Many)
-{
-    fftconvolveTest<TypeParam, 2>(string(TEST_DIR"/convolve/rectangle_same_one2many.test"), false);
+TYPED_TEST(FFTConvolve, Same_Rectangle_One2Many) {
+    fftconvolveTest<TypeParam, 2>(
+        string(TEST_DIR "/convolve/rectangle_same_one2many.test"), false);
 }
-TYPED_TEST(FFTConvolve, Same_Cuboid_One2Many)
-{
-    fftconvolveTest<TypeParam, 3>(string(TEST_DIR"/convolve/cuboid_same_one2many.test"), false);
+TYPED_TEST(FFTConvolve, Same_Cuboid_One2Many) {
+    fftconvolveTest<TypeParam, 3>(
+        string(TEST_DIR "/convolve/cuboid_same_one2many.test"), false);
 }
 
-TEST(FFTConvolve1, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::dim4;
+TEST(FFTConvolve1, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/vector.test"), numDims, in, tests);
+    readTests<float, float, int>(string(TEST_DIR "/convolve/vector.test"),
+                                 numDims, in, tests);
 
     //![ex_image_convolve1]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [32 1 1 1]
-    af::array filter(numDims[1], &(in[1].front()));
-    //filter dims = [4 1 1 1]
-
-    af::array output = fftConvolve1(signal, filter, AF_CONV_EXPAND);
-    //output dims = [32 1 1 1] - same as input since expand(3rd argument is false)
-    //None of the dimensions > 1 has lenght > 1, so no batch mode is activated.
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [32 1 1 1]
+    array filter(numDims[1], &(in[1].front()));
+    // filter dims = [4 1 1 1]
+
+    array output = fftConvolve1(signal, filter, AF_CONV_EXPAND);
+    // output dims = [32 1 1 1] - same as input since expand(3rd argument is
+    // false) None of the dimensions > 1 has lenght > 1, so no batch mode is
+    // activated.
     //![ex_image_convolve1]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
-    output.host(outData);
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
+    output.host(&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1e-2)<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1e-2)
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(FFTConvolve2, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(FFTConvolve2, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    using af::dim4;
-
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/rectangle_one2many.test"), numDims, in, tests);
+    readTests<float, float, int>(
+        string(TEST_DIR "/convolve/rectangle_one2many.test"), numDims, in,
+        tests);
 
     //![ex_image_convolve2]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [15 17 1 1]
-    af::array filter(numDims[1], &(in[1].front()));
-    //filter dims = [5 5 2 1]
-
-    af::array output = fftConvolve2(signal, filter, AF_CONV_EXPAND);
-    //output dims = [15 17 1 1] - same as input since expand(3rd argument is false)
-    //however, notice that the 3rd dimension of filter is > 1.
-    //So, one to many batch mode will be activated automatically
-    //where the 2d input signal is convolved with each 2d filter
-    //and the result will written corresponding slice in the output 3d array
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [15 17 1 1]
+    array filter(numDims[1], &(in[1].front()));
+    // filter dims = [5 5 2 1]
+
+    array output = fftConvolve2(signal, filter, AF_CONV_EXPAND);
+    // output dims = [15 17 1 1] - same as input since expand(3rd argument is
+    // false) however, notice that the 3rd dimension of filter is > 1. So, one
+    // to many batch mode will be activated automatically where the 2d input
+    // signal is convolved with each 2d filter and the result will written
+    // corresponding slice in the output 3d array
     //![ex_image_convolve2]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
-    output.host(outData);
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
+    output.host(&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1e-2)<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1e-2)
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(FFTConvolve3, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::dim4;
+TEST(FFTConvolve3, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    vector<dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
-
-    readTests<float, float, int>(string(TEST_DIR"/convolve/cuboid_many2many.test"), numDims, in, tests);
+    readTests<float, float, int>(
+        string(TEST_DIR "/convolve/cuboid_many2many.test"), numDims, in, tests);
 
     //![ex_image_convolve3]
-    //vector<dim4> numDims;
-    //vector<vector<float> > in;
-    af::array signal(numDims[0], &(in[0].front()));
-    //signal dims = [10 11 2 2]
-    af::array filter(numDims[1], &(in[1].front()));
-    //filter dims = [4 2 3 2]
-
-    af::array output = fftConvolve3(signal, filter, AF_CONV_EXPAND);
-    //output dims = [10 11 2 2] - same as input since expand(3rd argument is false)
-    //however, notice that the 4th dimension is > 1 for both signal
-    //and the filter, therefore many to many batch mode will be
-    //activated where each 3d signal is convolved with the corresponding 3d filter
+    // vector<dim4> numDims;
+    // vector<vector<float> > in;
+    array signal(numDims[0], &(in[0].front()));
+    // signal dims = [10 11 2 2]
+    array filter(numDims[1], &(in[1].front()));
+    // filter dims = [4 2 3 2]
+
+    array output = fftConvolve3(signal, filter, AF_CONV_EXPAND);
+    // output dims = [10 11 2 2] - same as input since expand(3rd argument is
+    // false) however, notice that the 4th dimension is > 1 for both signal and
+    // the filter, therefore many to many batch mode will be activated where
+    // each 3d signal is convolved with the corresponding 3d filter
     //![ex_image_convolve3]
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems  = output.elements();
-    float *outData = new float[nElems];
-    output.host(outData);
+    size_t nElems             = output.elements();
+    vector<float> outData(nElems);
+    output.host(&outData.front());
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1e-2)<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1e-2)
+            << "at: " << elIter << endl;
     }
-
-    delete[] outData;
 }
 
-TEST(FFTConvolve, Docs_Unified_Wrapper)
-{
+TEST(FFTConvolve, Docs_Unified_Wrapper) {
     // This unit test doesn't necessarily need to function
-    // accuracy as af::convolve is merely a wrapper to
-    // af::convolve[1|2|3]
-    using af::array;
-    using af::dim4;
-    using af::randu;
+    // accuracy as convolve is merely a wrapper to
+    // convolve[1|2|3]
     using af::constant;
     using af::convolve;
 
     //![ex_image_convolve_1d]
     array a = randu(10);
-    //af_print(a);
-    //a [10 1 1 1] = 0.0000 0.1315 0.7556 0.4587 0.5328 0.2190 0.0470 0.6789 0.6793 0.9347
+    // af_print(a);
+    // a [10 1 1 1] = 0.0000 0.1315 0.7556 0.4587 0.5328 0.2190 0.0470 0.6789
+    // 0.6793 0.9347
     array b = randu(4);
-    //af_print(b);
-    //b [4 1 1 1]  = 0.3835 0.5194 0.8310 0.0346
+    // af_print(b);
+    // b [4 1 1 1]  = 0.3835 0.5194 0.8310 0.0346
     array c = convolve(a, b);
-    //af_print(c);
-    //c [10 1 1 1] = 0.3581 0.6777 1.0750 0.7679 0.5903 0.4851 0.6598 1.2770 1.0734 0.8002
+    // af_print(c);
+    // c [10 1 1 1] = 0.3581 0.6777 1.0750 0.7679 0.5903 0.4851 0.6598
+    // 1.2770 1.0734 0.8002
     //![ex_image_convolve_1d]
 
     //![ex_image_convolve_2d]
     array d = constant(0.5, 5, 5);
-    //af_print(d);
-    //d [5 5 1 1]
+    // af_print(d);
+    // d [5 5 1 1]
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     //    0.5000     0.5000     0.5000     0.5000     0.5000
     array e = constant(1, 2, 2);
-    //af_print(e);
-    //e [2 2 1 1]
+    // af_print(e);
+    // e [2 2 1 1]
     //     1.0000     1.0000
     //     1.0000     1.0000
     array f = fftConvolve(d, e);
-    //af_print(f);
-    //f [5 5 1 1]
+    // af_print(f);
+    // f [5 5 1 1]
     //     2.0000     2.0000     2.0000     2.0000     1.0000
     //     2.0000     2.0000     2.0000     2.0000     1.0000
     //     2.0000     2.0000     2.0000     2.0000     1.0000
@@ -538,8 +493,8 @@ TEST(FFTConvolve, Docs_Unified_Wrapper)
 
     //![ex_image_convolve_3d]
     array g = constant(1, 4, 4, 4);
-    //af_print(g);
-    //g [4 4 4 1]
+    // af_print(g);
+    // g [4 4 4 1]
     //    1.0000     1.0000     1.0000     1.0000
     //    1.0000     1.0000     1.0000     1.0000
     //    1.0000     1.0000     1.0000     1.0000
@@ -560,8 +515,8 @@ TEST(FFTConvolve, Docs_Unified_Wrapper)
     //    1.0000     1.0000     1.0000     1.0000
     //    1.0000     1.0000     1.0000     1.0000
     array h = constant(0.5, 2, 2, 2);
-    //af_print(h);
-    //h [2 2 2 1]
+    // af_print(h);
+    // h [2 2 2 1]
     //    0.5000     0.5000
     //    0.5000     0.5000
 
@@ -569,8 +524,8 @@ TEST(FFTConvolve, Docs_Unified_Wrapper)
     //    0.5000     0.5000
 
     array i = fftConvolve(g, h);
-    //af_print(i);
-    //i [4 4 4 1]
+    // af_print(i);
+    // i [4 4 4 1]
     //    4.0000     4.0000     4.0000     2.0000
     //    4.0000     4.0000     4.0000     2.0000
     //    4.0000     4.0000     4.0000     2.0000
@@ -592,3 +547,94 @@ TEST(FFTConvolve, Docs_Unified_Wrapper)
     //    1.0000     1.0000     1.0000     0.5000
     //![ex_image_convolve_3d]
 }
+using namespace af;
+
+TEST(GFOR, fftConvolve2_MO) {
+    array A = randu(5, 5, 3);
+    array B = randu(5, 5, 3);
+    array K = randu(3, 3);
+
+    gfor(seq ii, 3) { B(span, span, ii) = fftConvolve2(A(span, span, ii), K); }
+
+    for (int ii = 0; ii < 3; ii++) {
+        array c_ii = fftConvolve2(A(span, span, ii), K);
+        array b_ii = B(span, span, ii);
+        ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
+    }
+}
+
+TEST(GFOR, fftConvolve2_OM) {
+    array A = randu(5, 5);
+    array B = randu(5, 5, 3);
+    array K = randu(3, 3, 3);
+
+    gfor(seq ii, 3) { B(span, span, ii) = fftConvolve2(A, K(span, span, ii)); }
+
+    for (int ii = 0; ii < 3; ii++) {
+        array c_ii = fftConvolve2(A, K(span, span, ii));
+        array b_ii = B(span, span, ii);
+        ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
+    }
+}
+
+TEST(GFOR, fftConvolve2_MM) {
+    array A = randu(5, 5, 3);
+    array B = randu(5, 5, 3);
+    array K = randu(3, 3, 3);
+
+    gfor(seq ii, 3) {
+        B(span, span, ii) = fftConvolve2(A(span, span, ii), K(span, span, ii));
+    }
+
+    for (int ii = 0; ii < 3; ii++) {
+        array c_ii = fftConvolve2(A(span, span, ii), K(span, span, ii));
+        array b_ii = B(span, span, ii);
+        ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
+    }
+}
+
+TEST(Padding, fftConvolve2) {
+    for (int n = 5; n < 32; n++) {
+        array a = randu(n, n);
+        array b = randu(5, 5);
+        array c = fftConvolve2(a, b);
+        array d = convolve2(a, b, AF_CONV_DEFAULT, AF_CONV_SPATIAL);
+        ASSERT_EQ(max<double>(abs(c - d)) < 1E-5, true);
+    }
+}
+
+TEST(FFTConvolve1, Interleaved) {
+    array a = randu(100, 1, 2);
+    array b = randu(20, 3);
+    array c = fftConvolve1(a, b);
+
+    for (int ii = 0; ii < 2; ii++) {
+        array c_ii = c(span, span, ii);
+        array d    = fftConvolve1(a(span, 0, ii), b);
+        ASSERT_EQ(max<double>(abs(c_ii - d)) < 1E-5, true);
+    }
+}
+
+TEST(FFTConvolve2, Interleaved) {
+    array a = randu(100, 100, 2);
+    array b = randu(5, 5, 1, 3);
+    array c = fftConvolve2(a, b);
+
+    for (int ii = 0; ii < 3; ii++) {
+        array c_ii = c(span, span, span, ii);
+        array d    = fftConvolve2(a, b(span, span, 0, ii));
+        ASSERT_EQ(max<double>(abs(c_ii - d)) < 1E-5, true);
+    }
+}
+
+TEST(FFTConvolve2, Interleaved2) {
+    array a = randu(100, 100, 2);
+    array b = randu(5, 5, 2, 3);
+    array c = fftConvolve2(a, b);
+
+    for (int ii = 0; ii < 3; ii++) {
+        array c_ii = c(span, span, span, ii);
+        array d    = fftConvolve2(a, b(span, span, span, ii));
+        ASSERT_EQ(max<double>(abs(c_ii - d)) < 1E-5, true);
+    }
+}
diff --git a/test/flat.cpp b/test/flat.cpp
index 2448788c93..c9258e865b 100644
--- a/test/flat.cpp
+++ b/test/flat.cpp
@@ -8,117 +8,121 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/device.h>
+#include <af/random.h>
+
+#include <vector>
 
-using namespace std;
-using namespace af;
+using af::array;
+using af::dim4;
+using af::flat;
+using af::freeHost;
+using af::randu;
+using af::seq;
+using af::span;
 
-TEST(FlatTests, Test_flat_1D)
-{
+using std::vector;
+
+TEST(FlatTests, Test_flat_1D) {
     const int num = 10000;
-    af::array in = randu(num);
-    af::array out = flat(in);
+    array in      = randu(num);
+    array out     = flat(in);
 
-    float *h_in = in.host<float>();
-    float *h_out = out.host<float>();
+    ASSERT_ARRAYS_EQ(in, out);
+}
 
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(h_in[i], h_out[i]);
-    }
+TEST(FlatTests, Test_flat_2D_Half) {
+    if (noHalfTests(f16)) return;
+    const int num = 10;
+    array in      = randu(num, num, f16);
+    array out     = flat(in);
+
+    vector<half_float::half> gold(num * num);
+    in.host(&gold[0]);
 
-    delete[] h_in;
-    delete[] h_out;
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(num * num), out);
 }
 
-TEST(FlatTests, Test_flat_2D)
-{
+TEST(FlatTests, Test_flat_2D) {
     const int nx = 200;
     const int ny = 200;
-    const int num =  nx * ny;
 
-    af::array in = randu(nx, ny);
-    af::array out = flat(in);
+    array in  = randu(nx, ny);
+    array out = flat(in);
 
-    float *h_in = in.host<float>();
-    float *h_out = out.host<float>();
-
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(h_in[i], h_out[i]);
-    }
-
-    delete[] h_in;
-    delete[] h_out;
+    vector<float> h_in_flat(in.elements());
+    in.host(h_in_flat.data());
+    dim4 h_in_flat_dims = dim4(nx * ny);
+    ASSERT_VEC_ARRAY_EQ(h_in_flat, h_in_flat_dims, out);
 }
 
-TEST(FlatTests, Test_flat_1D_index)
-{
+TEST(FlatTests, Test_flat_1D_index) {
     const int num = 10000;
-    const int st = 101;
-    const int en = 5000;
+    const int st  = 101;
+    const int en  = 5000;
 
-    af::array in = randu(num);
-    af::array tmp = in(seq(st, en));
-    af::array out = flat(tmp);
+    array in  = randu(num);
+    array tmp = in(seq(st, en));
+    array out = flat(tmp);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
-    for (int i = st; i <= en; i++) {
-        ASSERT_EQ(h_in[i], h_out[i - st]);
-    }
+    // TODO: Use ASSERT_ARRAYS_EQUAL
+    for (int i = st; i <= en; i++) { ASSERT_EQ(h_in[i], h_out[i - st]); }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlatTests, Test_flat_2D_index0)
-{
-    const int nx = 200;
-    const int ny = 200;
-    const int st = 21;
-    const int en = 180;
+TEST(FlatTests, Test_flat_2D_index0) {
+    const int nx  = 200;
+    const int ny  = 200;
+    const int st  = 21;
+    const int en  = 180;
     const int nxo = (en - st + 1);
 
-    af::array in = randu(nx, ny);
-    af::array tmp = in(seq(st, en), span);
-    af::array out = flat(tmp);
+    array in  = randu(nx, ny);
+    array tmp = in(seq(st, en), span);
+    array out = flat(tmp);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
+    // TODO: Use ASSERT_ARRAYS_EQUAL
     for (int j = 0; j < ny; j++) {
-        const int in_off = j * nx;
-        const int out_off =j * nxo;
+        const int in_off  = j * nx;
+        const int out_off = j * nxo;
         for (int i = st; i <= en; i++) {
             ASSERT_EQ(h_in[i + in_off], h_out[i - st + out_off])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlatTests, Test_flat_2D_index1)
-{
+TEST(FlatTests, Test_flat_2D_index1) {
     const int nx = 200;
     const int ny = 200;
     const int st = 21;
     const int en = 180;
 
-    af::array in = randu(nx, ny);
-    af::array tmp = in(span, seq(st, en));
-    af::array out = flat(tmp);
+    array in  = randu(nx, ny);
+    array tmp = in(span, seq(st, en));
+    array out = flat(tmp);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
+    // TODO: Use ASSERT_ARRAYS_EQUAL
     for (int j = st; j <= en; j++) {
-
-        const int in_off = j * nx;
+        const int in_off  = j * nx;
         const int out_off = (j - st) * nx;
 
         for (int i = 0; i < nx; i++) {
@@ -127,6 +131,6 @@ TEST(FlatTests, Test_flat_2D_index1)
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
diff --git a/test/flip.cpp b/test/flip.cpp
index 55d580c4af..852a837f14 100644
--- a/test/flip.cpp
+++ b/test/flip.cpp
@@ -8,42 +8,54 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
+#include <af/device.h>
 #include <af/index.h>
-#include <testHelpers.hpp>
+#include <af/random.h>
 
-using namespace std;
-using namespace af;
+using af::array;
+using af::flip;
+using af::freeHost;
+using af::randu;
+using af::seq;
+using af::span;
 
-TEST(FlipTests, Test_flip_1D)
-{
+void Test_flip_1D(const af::dtype dt) {
     const int num = 10000;
-    af::array in = randu(num);
-    af::array out = flip(in, 0);
+    array in      = randu(num, dt);
+    array out     = flip(in, 0);
 
-    float *h_in = in.host<float>();
-    float *h_out = out.host<float>();
+    float *h_in  = in.as(f32).host<float>();
+    float *h_out = out.as(f32).host<float>();
 
     for (int i = 0; i < num; i++) {
-        ASSERT_EQ(h_in[num - i - 1], h_out[i])
-            << "at (" << i << ")";
+        ASSERT_EQ(h_in[num - i - 1], h_out[i]) << "at (" << i << ")";
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlipTests, Test_flip_2D0)
-{
+TEST(FlipTests, Test_flip_1D_f32) { Test_flip_1D(f32); }
+
+TEST(FlipTests, Test_flip_1D_f16) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+
+    Test_flip_1D(f16);
+}
+
+TEST(FlipTests, Test_flip_2D0) {
     const int nx = 200;
     const int ny = 200;
 
-    af::array in = randu(nx, ny);
-    af::array out = flip(in, 0);
+    array in  = randu(nx, ny);
+    array out = flip(in, 0);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int j = 0; j < ny; j++) {
@@ -51,23 +63,21 @@ TEST(FlipTests, Test_flip_2D0)
         for (int i = 0; i < nx; i++) {
             ASSERT_EQ(h_in[off + nx - 1 - i], h_out[off + i])
                 << "at (" << i << "," << j << ")";
-
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlipTests, Test_flip_2D1)
-{
+TEST(FlipTests, Test_flip_2D1) {
     const int nx = 200;
     const int ny = 200;
 
-    af::array in = randu(nx, ny);
-    af::array out = flip(in, 1);
+    array in  = randu(nx, ny);
+    array out = flip(in, 1);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int j = 0; j < ny; j++) {
@@ -79,106 +89,99 @@ TEST(FlipTests, Test_flip_2D1)
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-
-TEST(FlipTests, Test_flip_1D_index)
-{
+TEST(FlipTests, Test_flip_1D_index) {
     const int num = 10000;
-    const int st = 101;
-    const int en = 5000;
+    const int st  = 101;
+    const int en  = 5000;
 
-    af::array in = randu(num);
-    af::array tmp = in(seq(st, en));
-    af::array out = flip(tmp, 0);
+    array in  = randu(num);
+    array tmp = in(seq(st, en));
+    array out = flip(tmp, 0);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int i = st; i <= en; i++) {
-        ASSERT_EQ(h_in[i], h_out[en - i])
-            << "at (" << i << ")";
+        ASSERT_EQ(h_in[i], h_out[en - i]) << "at (" << i << ")";
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlipTests, Test_flip_2D_index00)
-{
-    const int nx = 200;
-    const int ny = 200;
-    const int st = 21;
-    const int en = 180;
+TEST(FlipTests, Test_flip_2D_index00) {
+    const int nx  = 200;
+    const int ny  = 200;
+    const int st  = 21;
+    const int en  = 180;
     const int nxo = (en - st + 1);
 
-    af::array in = randu(nx, ny);
-    af::array tmp = in(seq(st, en), span);
-    af::array out = flip(tmp, 0);
+    array in  = randu(nx, ny);
+    array tmp = in(seq(st, en), span);
+    array out = flip(tmp, 0);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int j = 0; j < ny; j++) {
-        const int in_off = j * nx;
-        const int out_off =j * nxo;
+        const int in_off  = j * nx;
+        const int out_off = j * nxo;
         for (int i = st; i <= en; i++) {
             ASSERT_EQ(h_in[i + in_off], h_out[en - i + out_off])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlipTests, Test_flip_2D_index01)
-{
-    const int nx = 200;
-    const int ny = 200;
-    const int st = 21;
-    const int en = 180;
+TEST(FlipTests, Test_flip_2D_index01) {
+    const int nx  = 200;
+    const int ny  = 200;
+    const int st  = 21;
+    const int en  = 180;
     const int nxo = (en - st + 1);
 
-    af::array in = randu(nx, ny);
-    af::array tmp = in(seq(st, en), span);
-    af::array out = flip(tmp, 1);
+    array in  = randu(nx, ny);
+    array tmp = in(seq(st, en), span);
+    array out = flip(tmp, 1);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int j = 0; j < ny; j++) {
-        const int in_off = (ny - 1 - j) * nx;
-        const int out_off =j * nxo;
+        const int in_off  = (ny - 1 - j) * nx;
+        const int out_off = j * nxo;
         for (int i = st; i <= en; i++) {
             ASSERT_EQ(h_in[i + in_off], h_out[i - st + out_off])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlipTests, Test_flip_2D_index10)
-{
+TEST(FlipTests, Test_flip_2D_index10) {
     const int nx = 200;
     const int ny = 200;
     const int st = 21;
     const int en = 180;
 
-    af::array in = randu(nx, ny);
-    af::array tmp = in(span, seq(st, en));
-    af::array out = flip(tmp, 0);
+    array in  = randu(nx, ny);
+    array tmp = in(span, seq(st, en));
+    array out = flip(tmp, 0);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int j = st; j <= en; j++) {
-
-        const int in_off = j * nx;
+        const int in_off  = j * nx;
         const int out_off = (j - st) * nx;
 
         for (int i = 0; i < nx; i++) {
@@ -187,27 +190,25 @@ TEST(FlipTests, Test_flip_2D_index10)
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TEST(FlipTests, Test_flip_2D_index11)
-{
+TEST(FlipTests, Test_flip_2D_index11) {
     const int nx = 200;
     const int ny = 200;
     const int st = 21;
     const int en = 180;
 
-    af::array in = randu(nx, ny);
-    af::array tmp = in(span, seq(st, en));
-    af::array out = flip(tmp, 1);
+    array in  = randu(nx, ny);
+    array tmp = in(span, seq(st, en));
+    array out = flip(tmp, 1);
 
-    float *h_in = in.host<float>();
+    float *h_in  = in.host<float>();
     float *h_out = out.host<float>();
 
     for (int j = st; j <= en; j++) {
-
-        const int in_off = j * nx;
+        const int in_off  = j * nx;
         const int out_off = (en - j) * nx;
 
         for (int i = 0; i < nx; i++) {
@@ -216,6 +217,6 @@ TEST(FlipTests, Test_flip_2D_index11)
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
diff --git a/test/gaussiankernel.cpp b/test/gaussiankernel.cpp
index ad70f5037d..3fc8de1c23 100644
--- a/test/gaussiankernel.cpp
+++ b/test/gaussiankernel.cpp
@@ -7,99 +7,101 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::dim4;
+using std::endl;
 using std::string;
 using std::vector;
 
 template<typename T>
-class GaussianKernel : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class GaussianKernel : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float> TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(GaussianKernel, TestTypes);
+TYPED_TEST_SUITE(GaussianKernel, TestTypes);
 
 template<typename T>
-void gaussianKernelTest(string pFileName, double sigma)
-{
-    if (noDoubleTests<T>()) return;
+void gaussianKernelTest(string pFileName, double sigma) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4>     numDims;
-    vector<vector<int> > in;
-    vector<vector<T> >   tests;
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<T>> tests;
 
-    readTestsFromFile<int,T>(pFileName, numDims, in, tests);
+    readTestsFromFile<int, T>(pFileName, numDims, in, tests);
 
-    af_array outArray  = 0;
+    af_array outArray = 0;
 
     vector<int> input(in[0].begin(), in[0].end());
 
-    ASSERT_EQ(AF_SUCCESS, af_gaussian_kernel(&outArray, input[0], input[1], sigma, sigma));
+    ASSERT_SUCCESS(
+        af_gaussian_kernel(&outArray, input[0], input[1], sigma, sigma));
 
     dim_t outElems = 0;
-    ASSERT_EQ(AF_SUCCESS, af_get_elements(&outElems, outArray));
+    ASSERT_SUCCESS(af_get_elements(&outElems, outArray));
     T *outData = new T[outElems];
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
     vector<T> currGoldBar(tests[0].begin(), tests[0].end());
     size_t nElems = currGoldBar.size();
 
     ASSERT_EQ(outElems, (dim_t)nElems);
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)
+            << "at: " << elIter << endl;
     }
 
     delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(GaussianKernel, Small1D)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss1_7.test"), 0.0);
+TYPED_TEST(GaussianKernel, Small1D) {
+    gaussianKernelTest<TypeParam>(string(TEST_DIR "/gaussian/gauss1_7.test"),
+                                  0.0);
 }
 
-TYPED_TEST(GaussianKernel, Large1D)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss1_15.test"), 0.0);
+TYPED_TEST(GaussianKernel, Large1D) {
+    gaussianKernelTest<TypeParam>(string(TEST_DIR "/gaussian/gauss1_15.test"),
+                                  0.0);
 }
 
-TYPED_TEST(GaussianKernel, Small1DWithSigma)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss1_7_sigma1.test"), 1.0);
+TYPED_TEST(GaussianKernel, Small1DWithSigma) {
+    gaussianKernelTest<TypeParam>(
+        string(TEST_DIR "/gaussian/gauss1_7_sigma1.test"), 1.0);
 }
 
-TYPED_TEST(GaussianKernel, SmallSmall2D)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss2_7x7.test"), 0.0);
+TYPED_TEST(GaussianKernel, SmallSmall2D) {
+    gaussianKernelTest<TypeParam>(string(TEST_DIR "/gaussian/gauss2_7x7.test"),
+                                  0.0);
 }
 
-TYPED_TEST(GaussianKernel, LargeSmall2D)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss2_15x7.test"), 0.0);
+TYPED_TEST(GaussianKernel, LargeSmall2D) {
+    gaussianKernelTest<TypeParam>(string(TEST_DIR "/gaussian/gauss2_15x7.test"),
+                                  0.0);
 }
 
-TYPED_TEST(GaussianKernel, LargeLarge2D)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss2_15x15.test"), 0.0);
+TYPED_TEST(GaussianKernel, LargeLarge2D) {
+    gaussianKernelTest<TypeParam>(
+        string(TEST_DIR "/gaussian/gauss2_15x15.test"), 0.0);
 }
 
-TYPED_TEST(GaussianKernel, SmallSmall2DWithSigma)
-{
-    gaussianKernelTest<TypeParam>(string(TEST_DIR"/gaussian/gauss2_7x7_sigma1.test"), 1.0);
+TYPED_TEST(GaussianKernel, SmallSmall2DWithSigma) {
+    gaussianKernelTest<TypeParam>(
+        string(TEST_DIR "/gaussian/gauss2_7x7_sigma1.test"), 1.0);
 }
 
 //////////////////////////////// CPP ////////////////////////////////////
@@ -107,16 +109,15 @@ TYPED_TEST(GaussianKernel, SmallSmall2DWithSigma)
 
 #include <iostream>
 
-void gaussianKernelTestCPP(string pFileName, double sigma)
-{
-    using af::array;
-    using af::gaussianKernel;
+using af::array;
+using af::gaussianKernel;
 
-    vector<af::dim4>       numDims;
-    vector<vector<int> >   in;
-    vector<vector<float> > tests;
+void gaussianKernelTestCPP(string pFileName, double sigma) {
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
 
-    readTestsFromFile<int,float>(pFileName, numDims, in, tests);
+    readTestsFromFile<int, float>(pFileName, numDims, in, tests);
 
     vector<int> input(in[0].begin(), in[0].end());
 
@@ -131,29 +132,28 @@ void gaussianKernelTestCPP(string pFileName, double sigma)
 
     ASSERT_EQ(outElems, (dim_t)nElems);
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)
+            << "at: " << elIter << endl;
     }
 
     delete[] outData;
 }
 
-TEST(GaussianKernel, Small1D_CPP)
-{
-    gaussianKernelTestCPP(string(TEST_DIR"/gaussian/gauss1_7.test"), 0.0);
+TEST(GaussianKernel, Small1D_CPP) {
+    gaussianKernelTestCPP(string(TEST_DIR "/gaussian/gauss1_7.test"), 0.0);
 }
 
-TEST(GaussianKernel, Small1DWithSigma_CPP)
-{
-    gaussianKernelTestCPP(string(TEST_DIR"/gaussian/gauss1_7_sigma1.test"), 1.0);
+TEST(GaussianKernel, Small1DWithSigma_CPP) {
+    gaussianKernelTestCPP(string(TEST_DIR "/gaussian/gauss1_7_sigma1.test"),
+                          1.0);
 }
 
-TEST(GaussianKernel, SmallSmall2D_CPP)
-{
-    gaussianKernelTestCPP(string(TEST_DIR"/gaussian/gauss2_7x7.test"), 0.0);
+TEST(GaussianKernel, SmallSmall2D_CPP) {
+    gaussianKernelTestCPP(string(TEST_DIR "/gaussian/gauss2_7x7.test"), 0.0);
 }
 
-TEST(GaussianKernel, SmallSmall2DWithSigma_CPP)
-{
-    gaussianKernelTestCPP(string(TEST_DIR"/gaussian/gauss2_7x7_sigma1.test"), 1.0);
+TEST(GaussianKernel, SmallSmall2DWithSigma_CPP) {
+    gaussianKernelTestCPP(string(TEST_DIR "/gaussian/gauss2_7x7_sigma1.test"),
+                          1.0);
 }
diff --git a/test/gen_assign.cpp b/test/gen_assign.cpp
index 3e950c3d46..07685108c4 100644
--- a/test/gen_assign.cpp
+++ b/test/gen_assign.cpp
@@ -7,151 +7,163 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <af/data.h>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/data.h>
 
-#include <vector>
+#include <testHelpers.hpp>
 #include <algorithm>
 #include <functional>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::generate;
-using std::cout;
+using af::array;
+using af::dim4;
+using af::dtype_traits;
+using af::exception;
+using af::freeHost;
+using af::randu;
+using af::seq;
+using af::span;
+using af::where;
 using std::endl;
 using std::ostream_iterator;
-using af::dtype_traits;
+using std::string;
+using std::vector;
 
-void testGeneralAssignOneArray(string pTestFile, const dim_t ndims, af_index_t* indexs, int arrayDim)
-{
-    vector<af::dim4>        numDims;
-    vector< vector<float> >      in;
-    vector< vector<float> >   tests;
+void testGeneralAssignOneArray(string pTestFile, const dim_t ndims,
+                               af_index_t *indexs, int arrayDim) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
     readTestsFromFile<float, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af::dim4 dims2     = numDims[2];
-    af_array outArray  = 0;
-    af_array rhsArray  = 0;
-    af_array lhsArray  = 0;
-    af_array idxArray  = 0;
-
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&lhsArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<float>::af_type));
-
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&rhsArray, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<float>::af_type));
-
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray, &(in[2].front()),
-                dims2.ndims(), dims2.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    dim4 dims0        = numDims[0];
+    dim4 dims1        = numDims[1];
+    dim4 dims2        = numDims[2];
+    af_array outArray = 0;
+    af_array rhsArray = 0;
+    af_array lhsArray = 0;
+    af_array idxArray = 0;
+
+    ASSERT_SUCCESS(af_create_array(&lhsArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    ASSERT_SUCCESS(af_create_array(&rhsArray, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    ASSERT_SUCCESS(af_create_array(&idxArray, &(in[2].front()), dims2.ndims(),
+                                   dims2.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
     indexs[arrayDim].idx.arr = idxArray;
 
-    ASSERT_EQ(AF_SUCCESS, af_assign_gen(&outArray, lhsArray, ndims, indexs, rhsArray));
+    ASSERT_SUCCESS(af_assign_gen(&outArray, lhsArray, ndims, indexs, rhsArray));
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
+    size_t nElems             = currGoldBar.size();
+    vector<float> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(rhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(lhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(idxArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(rhsArray));
+    ASSERT_SUCCESS(af_release_array(lhsArray));
+    ASSERT_SUCCESS(af_release_array(idxArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TEST(GeneralAssign, ASSS)
-{
+TEST(GeneralAssign, ASSS) {
     af_index_t indexs[2];
     indexs[1].idx.seq = af_make_seq(0, 9, 1);
-    indexs[0].isSeq = false;
-    indexs[1].isSeq = true;
+    indexs[0].isSeq   = false;
+    indexs[1].isSeq   = true;
 
-    testGeneralAssignOneArray(string(TEST_DIR"/gen_assign/as0_9s0_ns0_n.test"), 2, indexs, 0);
+    testGeneralAssignOneArray(string(TEST_DIR "/gen_assign/as0_9s0_ns0_n.test"),
+                              2, indexs, 0);
 }
 
-TEST(GeneralAssign, SASS)
-{
+TEST(GeneralAssign, SASS) {
     af_index_t indexs[2];
     indexs[0].idx.seq = af_make_seq(10, 14, 1);
-    indexs[0].isSeq = true;
-    indexs[1].isSeq = false;
+    indexs[0].isSeq   = true;
+    indexs[1].isSeq   = false;
 
-    testGeneralAssignOneArray(string(TEST_DIR"/gen_assign/s10_14as0_ns0_n.test"), 2, indexs, 1);
+    testGeneralAssignOneArray(
+        string(TEST_DIR "/gen_assign/s10_14as0_ns0_n.test"), 2, indexs, 1);
 }
 
-TEST(GeneralAssign, SSSS)
-{
-    vector<af::dim4>        numDims;
-    vector< vector<float> >      in;
-    vector< vector<float> >   tests;
+TEST(GeneralAssign, SSSS) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    readTestsFromFile<float, float>(string(TEST_DIR"/gen_assign/s10_14s0_9s0_ns0_n.test"), numDims, in, tests);
+    readTestsFromFile<float, float>(
+        string(TEST_DIR "/gen_assign/s10_14s0_9s0_ns0_n.test"), numDims, in,
+        tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af_array outArray  = 0;
-    af_array rhsArray  = 0;
-    af_array lhsArray  = 0;
+    dim4 dims0        = numDims[0];
+    dim4 dims1        = numDims[1];
+    af_array outArray = 0;
+    af_array rhsArray = 0;
+    af_array lhsArray = 0;
 
     af_index_t indexs[2];
     indexs[0].idx.seq = af_make_seq(10, 14, 1);
     indexs[1].idx.seq = af_make_seq(0, 9, 1);
-    indexs[0].isSeq = true;
-    indexs[1].isSeq = true;
+    indexs[0].isSeq   = true;
+    indexs[1].isSeq   = true;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&lhsArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&lhsArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&rhsArray, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&rhsArray, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_assign_gen(&outArray, lhsArray, 2, indexs, rhsArray));
+    ASSERT_SUCCESS(af_assign_gen(&outArray, lhsArray, 2, indexs, rhsArray));
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
+    size_t nElems             = currGoldBar.size();
+    vector<float> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(rhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(lhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(rhsArray));
+    ASSERT_SUCCESS(af_release_array(lhsArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TEST(GeneralAssign, AAAA)
-{
-    vector<af::dim4>        numDims;
-    vector< vector<float> >      in;
-    vector< vector<float> >   tests;
+TEST(GeneralAssign, AAAA) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    readTestsFromFile<float, float>(string(TEST_DIR"/gen_assign/aaaa.test"), numDims, in, tests);
+    readTestsFromFile<float, float>(string(TEST_DIR "/gen_assign/aaaa.test"),
+                                    numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af::dim4 dims2     = numDims[2];
-    af::dim4 dims3     = numDims[3];
-    af::dim4 dims4     = numDims[4];
-    af::dim4 dims5     = numDims[5];
+    dim4 dims0         = numDims[0];
+    dim4 dims1         = numDims[1];
+    dim4 dims2         = numDims[2];
+    dim4 dims3         = numDims[3];
+    dim4 dims4         = numDims[4];
+    dim4 dims5         = numDims[5];
     af_array outArray  = 0;
     af_array rhsArray  = 0;
     af_array lhsArray  = 0;
@@ -166,72 +178,78 @@ TEST(GeneralAssign, AAAA)
     indexs[2].isSeq = false;
     indexs[3].isSeq = false;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&lhsArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&lhsArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&rhsArray, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&rhsArray, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray0, &(in[2].front()),
-                dims2.ndims(), dims2.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&idxArray0, &(in[2].front()), dims2.ndims(),
+                                   dims2.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
     indexs[0].idx.arr = idxArray0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray1, &(in[3].front()),
-                dims3.ndims(), dims3.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&idxArray1, &(in[3].front()), dims3.ndims(),
+                                   dims3.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
     indexs[1].idx.arr = idxArray1;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray2, &(in[4].front()),
-                dims4.ndims(), dims4.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&idxArray2, &(in[4].front()), dims4.ndims(),
+                                   dims4.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
     indexs[2].idx.arr = idxArray2;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray3, &(in[5].front()),
-                dims5.ndims(), dims5.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&idxArray3, &(in[5].front()), dims5.ndims(),
+                                   dims5.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
     indexs[3].idx.arr = idxArray3;
 
-    ASSERT_EQ(AF_SUCCESS, af_assign_gen(&outArray, lhsArray, 4, indexs, rhsArray));
+    ASSERT_SUCCESS(af_assign_gen(&outArray, lhsArray, 4, indexs, rhsArray));
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
+    size_t nElems             = currGoldBar.size();
+    vector<float> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(rhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(lhsArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(rhsArray));
+    ASSERT_SUCCESS(af_release_array(lhsArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(idxArray0));
+    ASSERT_SUCCESS(af_release_array(idxArray1));
+    ASSERT_SUCCESS(af_release_array(idxArray2));
+    ASSERT_SUCCESS(af_release_array(idxArray3));
 }
 
-
-TEST(ArrayAssign, CPP_ASSIGN_INDEX)
-{
+TEST(ArrayAssign, CPP_ASSIGN_INDEX) {
     using af::array;
 
     const int num = 20000;
 
-    array a = af::randu(num);
+    array a    = randu(num);
     float *hAO = a.host<float>();
 
-    array a_copy = a;
-    array idx = where(a < 0.5);
+    array a_copy  = a;
+    array idx     = where(a < 0.5);
     const int len = idx.elements();
-    array b = af::randu(len);
-    a(idx) = b;
+    array b       = randu(len);
+    a(idx)        = b;
 
-    float *hA = a.host<float>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    float *hB  = b.host<float>();
     float *hAC = a_copy.host<float>();
     uint *hIdx = idx.host<uint>();
 
     for (int i = 0; i < num; i++) {
-
         int j = 0;
-        while(j < len) {
-
+        while (j < len) {
             // If index found, value should match B
             if ((int)hIdx[j] == i) {
                 ASSERT_EQ(hA[i], hB[j]);
@@ -241,49 +259,42 @@ TEST(ArrayAssign, CPP_ASSIGN_INDEX)
         }
 
         // If index not found, value should match original
-        if (j >= len) {
-            ASSERT_EQ(hA[i], hAO[i]);
-        }
+        if (j >= len) { ASSERT_EQ(hA[i], hAO[i]); }
     }
 
     // hAC should not be modified, i.e. same as original
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(hAO[i], hAC[i]);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(hAO[i], hAC[i]); }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hAC;
-    delete[] hAO;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hAC);
+    freeHost(hAO);
+    freeHost(hIdx);
 }
 
-TEST(ArrayAssign, CPP_ASSIGN_INDEX_LOGICAL)
-{
+TEST(ArrayAssign, CPP_ASSIGN_INDEX_LOGICAL) {
     try {
         using af::array;
 
         const int num = 20000;
 
-        array a = af::randu(num);
+        array a    = randu(num);
         float *hAO = a.host<float>();
 
-        array a_copy = a;
-        array idx = where(a < 0.5);
+        array a_copy  = a;
+        array idx     = where(a < 0.5);
         const int len = idx.elements();
-        array b = af::randu(len);
-        a(a < 0.5) = b;
+        array b       = randu(len);
+        a(a < 0.5)    = b;
 
-        float *hA = a.host<float>();
-        float *hB = b.host<float>();
+        float *hA  = a.host<float>();
+        float *hB  = b.host<float>();
         float *hAC = a_copy.host<float>();
         uint *hIdx = idx.host<uint>();
 
         for (int i = 0; i < num; i++) {
-
             int j = 0;
-            while(j < len) {
-
+            while (j < len) {
                 // If index found, value should match B
                 if ((int)hIdx[j] == i) {
                     ASSERT_EQ(hA[i], hB[j]);
@@ -293,36 +304,27 @@ TEST(ArrayAssign, CPP_ASSIGN_INDEX_LOGICAL)
             }
 
             // If index not found, value should match original
-            if (j >= len) {
-                ASSERT_EQ(hA[i], hAO[i]);
-            }
+            if (j >= len) { ASSERT_EQ(hA[i], hAO[i]); }
         }
 
         // hAC should not be modified, i.e. same as original
-        for (int i = 0; i < num; i++) {
-            ASSERT_EQ(hAO[i], hAC[i]);
-        }
-
-        delete[] hA;
-        delete[] hB;
-        delete[] hAC;
-        delete[] hAO;
-        delete[] hIdx;
-    } catch(af::exception &ex) {
-        FAIL() << ex.what() << std::endl;
-    }
+        for (int i = 0; i < num; i++) { ASSERT_EQ(hAO[i], hAC[i]); }
+
+        freeHost(hA);
+        freeHost(hB);
+        freeHost(hAC);
+        freeHost(hAO);
+        freeHost(hIdx);
+    } catch (exception &ex) { FAIL() << ex.what() << endl; }
 }
 
-
-TEST(GeneralAssign, CPP_ASNN)
-{
-    using namespace af;
+TEST(GeneralAssign, CPP_ASNN) {
     const int nx = 1000;
     const int ny = 1000;
     const int st = 200;
     const int en = 805;
 
-    array a = randu(nx, ny);
+    array a   = randu(nx, ny);
     array idx = where(randu(ny) > 0.5);
 
     const int nyb = (en - st) + 1;
@@ -332,34 +334,30 @@ TEST(GeneralAssign, CPP_ASNN)
 
     a(idx, seq(st, en)) = b;
 
-    float *hA = a.host<float>();
-    uint  *hIdx = idx.host<uint>();
-    float *hB = b.host<float>();
-
+    float *hA  = a.host<float>();
+    uint *hIdx = idx.host<uint>();
+    float *hB  = b.host<float>();
 
     for (int j = 0; j < nyb; j++) {
         float *hAt = hA + (st + j) * nx;
         float *hBt = hB + j * nxb;
         for (int i = 0; i < nxb; i++) {
-            ASSERT_EQ(hAt[hIdx[i]], hBt[i])
-                << "at " << i << " " << j << std::endl;
+            ASSERT_EQ(hAt[hIdx[i]], hBt[i]) << "at " << i << " " << j << endl;
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx);
 }
 
-TEST(GeneralAssign, CPP_SANN)
-{
-    using namespace af;
+TEST(GeneralAssign, CPP_SANN) {
     const int nx = 1000;
     const int ny = 1000;
     const int st = 200;
     const int en = 805;
 
-    array a = randu(nx, ny);
+    array a   = randu(nx, ny);
     array idx = where(randu(ny) > 0.5);
 
     const int nxb = (en - st) + 1;
@@ -369,47 +367,44 @@ TEST(GeneralAssign, CPP_SANN)
 
     a(seq(st, en), idx) = b;
 
-    float *hA = a.host<float>();
-    uint  *hIdx = idx.host<uint>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    uint *hIdx = idx.host<uint>();
+    float *hB  = b.host<float>();
 
     for (int j = 0; j < nyb; j++) {
         float *hAt = hA + hIdx[j] * nx;
         float *hBt = hB + j * nxb;
 
         for (int i = 0; i < nxb; i++) {
-            ASSERT_EQ(hAt[i + st], hBt[i])
-            << "at " << i << " " << j << std::endl;
+            ASSERT_EQ(hAt[i + st], hBt[i]) << "at " << i << " " << j << endl;
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx);
 }
 
-TEST(GeneralAssign, CPP_SSAN)
-{
-    using namespace af;
+TEST(GeneralAssign, CPP_SSAN) {
     const int nx = 100;
     const int ny = 100;
     const int nz = 100;
     const int st = 20;
     const int en = 85;
 
-    array a = randu(nx, ny, nz);
+    array a   = randu(nx, ny, nz);
     array idx = where(randu(nz) > 0.5);
 
     const int nxb = (en - st) + 1;
     const int nyb = ny;
     const int nzb = idx.elements();
-    array b = randu(nxb, nyb, nzb);
+    array b       = randu(nxb, nyb, nzb);
 
     a(seq(st, en), span, idx) = b;
 
-    float *hA = a.host<float>();
-    uint  *hIdx = idx.host<uint>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    uint *hIdx = idx.host<uint>();
+    float *hB  = b.host<float>();
 
     for (int k = 0; k < nzb; k++) {
         float *hAt = hA + hIdx[k] * nx * ny;
@@ -417,49 +412,89 @@ TEST(GeneralAssign, CPP_SSAN)
 
         for (int j = 0; j < nyb; j++) {
             for (int i = 0; i < nxb; i++) {
-                ASSERT_EQ(hAt[j * nx  + i + st], hBt[j * nxb + i])
-                    << "at " << i << " " << j << " " << k << std::endl;
+                ASSERT_EQ(hAt[j * nx + i + st], hBt[j * nxb + i])
+                    << "at " << i << " " << j << " " << k << endl;
             }
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx);
 }
 
-TEST(GeneralAssign, CPP_AANN)
-{
-    using namespace af;
+TEST(GeneralAssign, CPP_AANN) {
     const int nx = 1000;
     const int ny = 1000;
 
-    array a = randu(nx, ny);
+    array a    = randu(nx, ny);
     array idx0 = where(randu(nx) > 0.5);
     array idx1 = where(randu(ny) > 0.5);
 
     const int nxb = idx0.elements();
     const int nyb = idx1.elements();
-    array b = randu(nxb, nyb);
+    array b       = randu(nxb, nyb);
 
     a(idx0, idx1) = b;
 
-    float *hA = a.host<float>();
-    uint  *hIdx0 = idx0.host<uint>();
-    uint  *hIdx1 = idx1.host<uint>();
-    float *hB = b.host<float>();
+    float *hA   = a.host<float>();
+    uint *hIdx0 = idx0.host<uint>();
+    uint *hIdx1 = idx1.host<uint>();
+    float *hB   = b.host<float>();
 
     for (int j = 0; j < nyb; j++) {
         float *hAt = hA + hIdx1[j] * nx;
         float *hBt = hB + j * nxb;
         for (int i = 0; i < nxb; i++) {
-            ASSERT_EQ(hAt[hIdx0[i]], hBt[i])
-                << "at " << i << " " << j << std::endl;
+            ASSERT_EQ(hAt[hIdx0[i]], hBt[i]) << "at " << i << " " << j << endl;
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx0;
-    delete[] hIdx1;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx0);
+    freeHost(hIdx1);
+}
+
+TEST(GeneralAssign, NDimsDoesNotMatchLDims) {
+    af_err err;
+    af_array zeros, l1, l2, sevens;
+    dim_t sevens_size[3] = {5, 1, 1};
+    short hsevens[5]     = {7, 7, 7, 7, 7};
+
+    dim_t zeros_size[3] = {5, 6, 1};
+    short hzeros[5 * 6] = {0};
+
+    dim_t hone[1] = {1};
+
+    ASSERT_SUCCESS(af_create_array(&zeros, hzeros, 3, zeros_size, s16));
+    ASSERT_SUCCESS(af_create_array(&sevens, hsevens, 3, sevens_size, s16));
+    ASSERT_SUCCESS(af_create_array(&l2, hone, 1, hone, s64));
+
+    af_index_t *ix;
+    ASSERT_SUCCESS(af_create_indexers(&ix));
+    ASSERT_SUCCESS(af_set_array_indexer(ix, l2, 1));
+
+    // clang-format off
+    vector<short> gold = {
+            0, 0, 0, 0, 0,
+            7, 7, 7, 7, 7,
+            0, 0, 0, 0, 0,
+            0, 0, 0, 0, 0,
+            0, 0, 0, 0, 0,
+            0, 0, 0, 0, 0,
+        };
+    // clang-format on
+    for (int number_of_indices = 2; number_of_indices < 4;
+         number_of_indices++) {
+        af_array result = 0;
+        ASSERT_SUCCESS(
+            af_assign_gen(&result, zeros, number_of_indices, ix, sevens));
+
+        ASSERT_VEC_ARRAY_EQ(gold, dim4(3, zeros_size), af::array(result));
+    }
+    ASSERT_SUCCESS(af_release_array(zeros));
+    ASSERT_SUCCESS(af_release_array(sevens));
+    ASSERT_SUCCESS(af_release_array(l2));
+    ASSERT_SUCCESS(af_release_indexers(ix));
 }
diff --git a/test/gen_index.cpp b/test/gen_index.cpp
index 7fcd3fd46e..fe684ebd27 100644
--- a/test/gen_index.cpp
+++ b/test/gen_index.cpp
@@ -8,115 +8,210 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
+#include <af/arith.h>
+#include <af/data.h>
 #include <af/defines.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/random.h>
 #include <af/traits.hpp>
-#include <af/data.h>
 
-#include <vector>
 #include <algorithm>
 #include <functional>
 #include <iostream>
+#include <sstream>
 #include <string>
-#include <testHelpers.hpp>
+#include <tuple>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::generate;
-using std::cout;
+using af::dim4;
+using af::dtype_traits;
 using std::endl;
+using std::get;
 using std::ostream_iterator;
-using af::dtype_traits;
+using std::string;
+using std::stringstream;
+using std::vector;
 
-void testGeneralIndexOneArray(string pTestFile, const dim_t ndims, af_index_t* indexs, int arrayDim)
-{
-    vector<af::dim4>        numDims;
-    vector< vector<float> >      in;
-    vector< vector<float> >   tests;
+struct index_test {
+    string filename_;
+    dim4 dims_;
+    index_test(string filename, dim4 dims) : filename_(filename), dims_(dims) {}
+};
+
+using index_params = std::tuple<index_test, af_dtype, af_dtype>;
+
+class IndexGeneralizedLegacy : public ::testing::TestWithParam<index_params> {
+    void SetUp() {
+        index_params params = GetParam();
+        vector<dim4> numDims;
+        vector<vector<float>> in;
+        vector<vector<float>> tests;
+
+        if (noDoubleTests(get<1>(params))) return;
+        if (noHalfTests(get<1>(params))) return;
+
+        if (noDoubleTests(get<2>(params))) return;
+        if (noHalfTests(get<2>(params))) return;
+        readTestsFromFile<float, float>(get<0>(params).filename_, numDims, in,
+                                        tests);
+
+        dim4 dims0 = numDims[0];
+        dim4 dims1 = numDims[1];
+
+        af_array inTmp = 0;
+        ASSERT_SUCCESS(af_create_array(&inTmp, &(in[0].front()), dims0.ndims(),
+                                       dims0.get(), f32));
+
+        ASSERT_SUCCESS(af_cast(&inArray_, inTmp, get<1>(params)));
+        af_release_array(inTmp);
+
+        af_array idxTmp = 0;
+        ASSERT_SUCCESS(af_create_array(&idxTmp, &(in[1].front()), dims1.ndims(),
+                                       dims1.get(), f32));
+        ASSERT_SUCCESS(af_cast(&idxArray_, idxTmp, get<2>(params)));
+        af_release_array(idxTmp);
+
+        vector<float> hgold = tests[0];
+        af_array goldTmp;
+        af_create_array(&goldTmp, &hgold.front(), get<0>(params).dims_.ndims(),
+                        get<0>(params).dims_.get(), f32);
+        ASSERT_SUCCESS(af_cast(&gold_, goldTmp, get<1>(params)));
+        af_release_array(goldTmp);
+    }
+
+    void TearDown() {
+        if (inArray_) { ASSERT_SUCCESS(af_release_array(inArray_)); }
+        if (idxArray_) { ASSERT_SUCCESS(af_release_array(idxArray_)); }
+        if (gold_) { ASSERT_SUCCESS(af_release_array(gold_)); }
+    }
+
+   public:
+    IndexGeneralizedLegacy() : gold_(0), inArray_(0), idxArray_(0) {}
+
+    af_array gold_;
+    af_array inArray_;
+    af_array idxArray_;
+};
+
+string testNameGenerator(
+    const ::testing::TestParamInfo<IndexGeneralizedLegacy::ParamType> info) {
+    stringstream ss;
+    ss << "type_" << get<1>(info.param) << "_idx_type_" << get<2>(info.param);
+    return ss.str();
+}
+
+INSTANTIATE_TEST_SUITE_P(
+    Legacy, IndexGeneralizedLegacy,
+    ::testing::Combine(
+        ::testing::Values(index_test(
+            string(TEST_DIR "/gen_index/s0_3s0_1s1_2a.test"), dim4(4, 2, 2))),
+        ::testing::Values(f32, f64, c32, c64, u64, s64, u16, s16, s8, u8, b8,
+                          f16),
+        ::testing::Values(f32, f64, u64, s64, u16, s16, s8, u8, f16)),
+    testNameGenerator);
+
+TEST_P(IndexGeneralizedLegacy, SSSA) {
+    index_params params = GetParam();
+    if (noDoubleTests(get<1>(params))) return;
+    if (noHalfTests(get<1>(params))) return;
+
+    if (noDoubleTests(get<2>(params))) return;
+    if (noHalfTests(get<2>(params))) return;
+
+    af_array outArray = 0;
+    af_index_t indexes[4];
+    indexes[0].idx.seq = af_make_seq(0, 3, 1);
+    indexes[1].idx.seq = af_make_seq(0, 1, 1);
+    indexes[2].idx.seq = af_make_seq(1, 2, 1);
+    indexes[3].idx.arr = idxArray_;
+    indexes[0].isSeq   = true;
+    indexes[1].isSeq   = true;
+    indexes[2].isSeq   = true;
+    indexes[3].isSeq   = false;
+    ASSERT_SUCCESS(af_index_gen(&outArray, inArray_, 4, indexes));
+    ASSERT_ARRAYS_EQ(gold_, outArray);
+    af_release_array(outArray);
+}
+
+void testGeneralIndexOneArray(string pTestFile, const dim_t ndims,
+                              af_index_t *indexs, int arrayDim) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
     readTestsFromFile<float, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af_array outArray  = 0;
-    af_array inArray   = 0;
-    af_array idxArray  = 0;
+    dim4 dims0        = numDims[0];
+    dim4 dims1        = numDims[1];
+    af_array outArray = 0;
+    af_array inArray  = 0;
+    af_array idxArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&idxArray, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
     indexs[arrayDim].idx.arr = idxArray;
 
-    ASSERT_EQ(AF_SUCCESS, af_index_gen(&outArray, inArray, ndims, indexs));
+    ASSERT_SUCCESS(af_index_gen(&outArray, inArray, ndims, indexs));
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
+    size_t nElems             = currGoldBar.size();
+    vector<float> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(idxArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(idxArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TEST(GeneralIndex, SSSA)
-{
-    af_index_t indexs[4];
-    indexs[0].idx.seq = af_make_seq(0, 3, 1);
-    indexs[1].idx.seq = af_make_seq(0, 1, 1);
-    indexs[2].idx.seq = af_make_seq(1, 2, 1);
-    indexs[0].isSeq = true;
-    indexs[1].isSeq = true;
-    indexs[2].isSeq = true;
-    indexs[3].isSeq = false;
-
-    testGeneralIndexOneArray(string(TEST_DIR"/gen_index/s0_3s0_1s1_2a.test"), 4, indexs, 3);
-}
-
-TEST(GeneralIndex, ASSS)
-{
+TEST(GeneralIndex, ASSS) {
     af_index_t indexs[4];
     indexs[1].idx.seq = af_make_seq(0, 9, 1);
     indexs[2].idx.seq = af_span;
     indexs[3].idx.seq = af_span;
-    indexs[0].isSeq = false;
-    indexs[1].isSeq = true;
-    indexs[2].isSeq = true;
-    indexs[3].isSeq = true;
+    indexs[0].isSeq   = false;
+    indexs[1].isSeq   = true;
+    indexs[2].isSeq   = true;
+    indexs[3].isSeq   = true;
 
-    testGeneralIndexOneArray(string(TEST_DIR"/gen_index/as0_9s0_ns0_n.test"), 4, indexs, 0);
+    testGeneralIndexOneArray(string(TEST_DIR "/gen_index/as0_9s0_ns0_n.test"),
+                             4, indexs, 0);
 }
 
-TEST(GeneralIndex, SASS)
-{
+TEST(GeneralIndex, SASS) {
     af_index_t indexs[2];
     indexs[0].idx.seq = af_make_seq(10, 40, 1);
-    indexs[0].isSeq = true;
-    indexs[1].isSeq = false;
+    indexs[0].isSeq   = true;
+    indexs[1].isSeq   = false;
 
-    testGeneralIndexOneArray(string(TEST_DIR"/gen_index/s0_9as0_ns0_n.test"), 2, indexs, 1);
+    testGeneralIndexOneArray(string(TEST_DIR "/gen_index/s0_9as0_ns0_n.test"),
+                             2, indexs, 1);
 }
 
-TEST(GeneralIndex, AASS)
-{
-    vector<af::dim4>        numDims;
-    vector< vector<float> >      in;
-    vector< vector<float> >   tests;
+TEST(GeneralIndex, AASS) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    readTestsFromFile<float, float>(string(TEST_DIR"/gen_index/aas0_ns0_n.test"), numDims, in, tests);
+    readTestsFromFile<float, float>(
+        string(TEST_DIR "/gen_index/aas0_ns0_n.test"), numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af::dim4 dims2     = numDims[2];
+    dim4 dims0         = numDims[0];
+    dim4 dims1         = numDims[1];
+    dim4 dims2         = numDims[2];
     af_array outArray  = 0;
     af_array inArray   = 0;
     af_array idxArray0 = 0;
@@ -124,126 +219,177 @@ TEST(GeneralIndex, AASS)
 
     af_index_t indexs[2];
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray0, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<float>::af_type));
-    indexs[0].isSeq = false;
+    ASSERT_SUCCESS(af_create_array(&idxArray0, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+    indexs[0].isSeq   = false;
     indexs[0].idx.arr = idxArray0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray1, &(in[2].front()),
-                dims2.ndims(), dims2.get(), (af_dtype)af::dtype_traits<float>::af_type));
-    indexs[1].isSeq = false;
+    ASSERT_SUCCESS(af_create_array(&idxArray1, &(in[2].front()), dims2.ndims(),
+                                   dims2.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+    indexs[1].isSeq   = false;
     indexs[1].idx.arr = idxArray1;
 
-    ASSERT_EQ(AF_SUCCESS, af_index_gen(&outArray, inArray, 2, indexs));
+    ASSERT_SUCCESS(af_index_gen(&outArray, inArray, 2, indexs));
+
+    vector<float> currGoldBar = tests[0];
+    size_t nElems             = currGoldBar.size();
+    vector<float> outData(nElems);
+
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(idxArray0));
+    ASSERT_SUCCESS(af_release_array(idxArray1));
+    ASSERT_SUCCESS(af_release_array(outArray));
+}
+
+TEST(GeneralIndex, SSAS_LinearSteps) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;  // Read tests from file
+
+    readTestsFromFile<float, float>(
+        TEST_DIR "/gen_index/s29_9__3s0_9_2as0_n.test", numDims, in, tests);
+
+    af_array outArray  = 0;
+    af_array inArray   = 0;
+    af_array idxArray0 = 0;
+    dim4 dims0         = numDims[0];
+    dim4 dims1         = numDims[1];
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    ASSERT_SUCCESS(af_create_array(&idxArray0, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
+
+    af_index_t indexs[4];
+    indexs[0].idx.seq = af_make_seq(29, 9, -3);
+    indexs[1].idx.seq = af_make_seq(0, 9, 2);
+    indexs[2].idx.arr = idxArray0;
+    indexs[3].idx.seq = af_span;
+
+    indexs[0].isSeq = true;
+    indexs[1].isSeq = true;
+    indexs[2].isSeq = false;
+    indexs[3].isSeq = true;
+
+    ASSERT_SUCCESS(af_index_gen(&outArray, inArray, 4, indexs));
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
+    size_t nElems             = currGoldBar.size();
+    vector<float> outData(nElems);
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(idxArray0));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(idxArray1));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TEST(GeneralIndex, CPP_ASNN)
-{
-    using namespace af;
+using af::array;
+using af::freeHost;
+using af::randu;
+using af::seq;
+using af::span;
+using af::where;
+
+TEST(GeneralIndex, CPP_ASNN) {
     const int nx = 1000;
     const int ny = 1000;
     const int st = 200;
     const int en = 805;
 
-    array a = randu(nx, ny);
+    array a   = randu(nx, ny);
     array idx = where(randu(nx) > 0.5);
-    array b = a(idx, seq(st, en));
+    array b   = a(idx, seq(st, en));
 
     const int nxb = b.dims(0);
     const int nyb = b.dims(1);
 
-    float *hA = a.host<float>();
-    uint  *hIdx = idx.host<uint>();
-    float *hB = b.host<float>();
-
+    float *hA  = a.host<float>();
+    uint *hIdx = idx.host<uint>();
+    float *hB  = b.host<float>();
 
     for (int j = 0; j < nyb; j++) {
         float *hAt = hA + (st + j) * nx;
         float *hBt = hB + j * nxb;
         for (int i = 0; i < nxb; i++) {
-            ASSERT_EQ(hAt[hIdx[i]], hBt[i])
-                << "at " << i << " " << j << std::endl;
+            ASSERT_EQ(hAt[hIdx[i]], hBt[i]) << "at " << i << " " << j << endl;
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx);
 }
 
-TEST(GeneralIndex, CPP_SANN)
-{
-    using namespace af;
+TEST(GeneralIndex, CPP_SANN) {
     const int nx = 1000;
     const int ny = 1000;
     const int st = 200;
     const int en = 805;
 
-    array a = randu(nx, ny);
+    array a   = randu(nx, ny);
     array idx = where(randu(ny) > 0.5);
-    array b = a(seq(st, en), idx);
+    array b   = a(seq(st, en), idx);
 
     const int nxb = b.dims(0);
     const int nyb = b.dims(1);
 
-    float *hA = a.host<float>();
-    uint  *hIdx = idx.host<uint>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    uint *hIdx = idx.host<uint>();
+    float *hB  = b.host<float>();
 
     for (int j = 0; j < nyb; j++) {
         float *hAt = hA + hIdx[j] * nx;
         float *hBt = hB + j * nxb;
 
         for (int i = 0; i < nxb; i++) {
-            ASSERT_EQ(hAt[i + st], hBt[i])
-            << "at " << i << " " << j << std::endl;
+            ASSERT_EQ(hAt[i + st], hBt[i]) << "at " << i << " " << j << endl;
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx);
 }
 
-TEST(GeneralIndex, CPP_SSAN)
-{
-    using namespace af;
+TEST(GeneralIndex, CPP_SSAN) {
     const int nx = 100;
     const int ny = 100;
     const int nz = 100;
     const int st = 20;
     const int en = 85;
 
-    array a = randu(nx, ny, nz);
+    array a   = randu(nx, ny, nz);
     array idx = where(randu(nz) > 0.5);
-    array b = a(seq(st, en), span, idx);
+    array b   = a(seq(st, en), span, idx);
 
     const int nxb = b.dims(0);
     const int nyb = b.dims(1);
     const int nzb = b.dims(2);
 
-    float *hA = a.host<float>();
-    uint  *hIdx = idx.host<uint>();
-    float *hB = b.host<float>();
+    float *hA  = a.host<float>();
+    uint *hIdx = idx.host<uint>();
+    float *hB  = b.host<float>();
 
     for (int k = 0; k < nzb; k++) {
         float *hAt = hA + hIdx[k] * nx * ny;
@@ -251,48 +397,44 @@ TEST(GeneralIndex, CPP_SSAN)
 
         for (int j = 0; j < nyb; j++) {
             for (int i = 0; i < nxb; i++) {
-                ASSERT_EQ(hAt[j * nx  + i + st], hBt[j * nxb + i])
-                    << "at " << i << " " << j << " " << k << std::endl;
+                ASSERT_EQ(hAt[j * nx + i + st], hBt[j * nxb + i])
+                    << "at " << i << " " << j << " " << k << endl;
             }
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx);
 }
 
-TEST(GeneralIndex, CPP_AANN)
-{
-    using namespace af;
+TEST(GeneralIndex, CPP_AANN) {
     const int nx = 1000;
     const int ny = 1000;
 
-    array a = randu(nx, ny);
+    array a    = randu(nx, ny);
     array idx0 = where(randu(nx) > 0.5);
     array idx1 = where(randu(ny) > 0.5);
-    array b = a(idx0, idx1);
+    array b    = a(idx0, idx1);
 
     const int nxb = b.dims(0);
     const int nyb = b.dims(1);
 
-    float *hA = a.host<float>();
-    uint  *hIdx0 = idx0.host<uint>();
-    uint  *hIdx1 = idx1.host<uint>();
-    float *hB = b.host<float>();
-
+    float *hA   = a.host<float>();
+    uint *hIdx0 = idx0.host<uint>();
+    uint *hIdx1 = idx1.host<uint>();
+    float *hB   = b.host<float>();
 
     for (int j = 0; j < nyb; j++) {
         float *hAt = hA + hIdx1[j] * nx;
         float *hBt = hB + j * nxb;
         for (int i = 0; i < nxb; i++) {
-            ASSERT_EQ(hAt[hIdx0[i]], hBt[i])
-                << "at " << i << " " << j << std::endl;
+            ASSERT_EQ(hAt[hIdx0[i]], hBt[i]) << "at " << i << " " << j << endl;
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hIdx0;
-    delete[] hIdx1;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hIdx0);
+    freeHost(hIdx1);
 }
diff --git a/test/getting_started.cpp b/test/getting_started.cpp
index fce648577d..c9e73ef6b5 100644
--- a/test/getting_started.cpp
+++ b/test/getting_started.cpp
@@ -7,39 +7,46 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <vector>
-#include <complex>
+#include <gtest/gtest.h>
 #include <testHelpers.hpp>
+#include <complex>
+#include <vector>
 
 using namespace af;
-using namespace std;
-
-TEST(GettingStarted, SNIPPET_getting_started_gen)
-{
+using std::abs;
+using std::vector;
+
+TEST(GettingStarted, SNIPPET_getting_started_gen) {
+    //! [ex_getting_started_constructors]
+    // Arrays may be created using the array constructor and dimensioned
+    // as 1D, 2D, 3D; however, the values in these arrays will be undefined
+    array undefined_1D(100);         // 1D array with 100 elements
+    array undefined_2D(10, 100);     // 2D array of size 10 x 100
+    array undefined_3D(10, 10, 10);  // 3D array of size 10 x 10 x 10
+    //! [ex_getting_started_constructors]
 
     //! [ex_getting_started_gen]
     // Generate an array of size three filled with zeros.
     // If no data type is specified, ArrayFire defaults to f32.
-    // The af::constant function generates the data on the device.
-    array zeros      = constant(0, 3);
+    // The constant function generates the data on the device.
+    array zeros = constant(0, 3);
 
     // Generate a 1x4 array of uniformly distributed [0,1] random numbers
-    // The af::randu function generates the data on the device.
-    array rand1      = randu(1, 4);
+    // The randu function generates the data on the device.
+    array rand1 = randu(1, 4);
 
     // Generate a 2x2 array (or matrix, if you prefer) of random numbers
     // sampled from a normal distribution.
-    // The af::randn function generates data on the device.
-    array rand2      = randn(2, 2);
+    // The randn function generates data on the device.
+    array rand2 = randn(2, 2);
 
     // Generate a 3x3 identity matrix. The data is generated on the device.
-    array iden       = af::identity(3, 3);
+    array iden = identity(3, 3);
 
     // Lastly, create a 2x1 array (column vector) of uniformly distributed
     // 32-bit complex numbers (c32 data type):
-    array randcplx   = randu(2, 1, c32);
+    array randcplx = randu(2, 1, c32);
     //! [ex_getting_started_gen]
 
     {
@@ -47,30 +54,33 @@ TEST(GettingStarted, SNIPPET_getting_started_gen)
         output.resize(zeros.elements());
         zeros.host(&output.front());
         ASSERT_EQ(f32, zeros.type());
-        for(unsigned i = 0; i < zeros.elements(); i++) ASSERT_FLOAT_EQ(0, output[i]);
+        for (dim_t i = 0; i < zeros.elements(); i++)
+            ASSERT_FLOAT_EQ(0, output[i]);
     }
 
-    if (!noDoubleTests<double>()) {
-        array ones       = constant(1, 3, 2, f64);
+    if (!noDoubleTests(f64)) {
+        array ones = constant(1, 3, 2, f64);
         vector<double> output(ones.elements());
         ones.host(&output.front());
         ASSERT_EQ(f64, ones.type());
-        for(unsigned i = 0; i < ones.elements(); i++) ASSERT_FLOAT_EQ(1, output[i]);
+        for (dim_t i = 0; i < ones.elements(); i++)
+            ASSERT_FLOAT_EQ(1, output[i]);
     }
 
     {
         vector<float> output;
         output.resize(iden.elements());
         iden.host(&output.front());
-        for(unsigned i = 0; i < iden.dims(0); i++)
-            for(unsigned j = 0; j < iden.dims(1); j++)
-                if(i == j)  ASSERT_FLOAT_EQ(1, output[i * iden.dims(0) + j]);
-                else        ASSERT_FLOAT_EQ(0, output[i * iden.dims(0) + j]);
+        for (dim_t i = 0; i < iden.dims(0); i++)
+            for (dim_t j = 0; j < iden.dims(1); j++)
+                if (i == j)
+                    ASSERT_FLOAT_EQ(1, output[i * iden.dims(0) + j]);
+                else
+                    ASSERT_FLOAT_EQ(0, output[i * iden.dims(0) + j]);
     }
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_init)
-{
+TEST(GettingStarted, SNIPPET_getting_started_init) {
     //! [ex_getting_started_init]
     // Create a six-element array on the host
     float hA[] = {0, 1, 2, 3, 4, 5};
@@ -79,7 +89,7 @@ TEST(GettingStarted, SNIPPET_getting_started_init)
     // constructor. Here we copy the data into a 2x3 matrix:
     array A(2, 3, hA);
 
-    // ArrayFire provides a convenince function for printing af::array
+    // ArrayFire provides a convenince function for printing array
     // objects in case you wish to see how the data is stored:
     af_print(A);
 
@@ -87,18 +97,18 @@ TEST(GettingStarted, SNIPPET_getting_started_init)
     // data (stored in {{real, imaginary}, {real, imaginary},  ... } format
     // as found in C's complex.h and C++'s <complex>.
     // Below we create a 3x1 column vector of complex data values:
-    array dB(3, 1, (cfloat*) hA); // 3x1 column vector of complex numbers
+    array dB(3, 1, (cfloat *)hA);  // 3x1 column vector of complex numbers
     af_print(dB);
 
     //! [ex_getting_started_init]
 
     vector<float> out(A.elements());
     A.host(&out.front());
-    for(unsigned int i = 0; i < out.size(); i++) ASSERT_FLOAT_EQ(hA[i], out[i]);
+    for (unsigned int i = 0; i < out.size(); i++)
+        ASSERT_FLOAT_EQ(hA[i], out[i]);
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_print)
-{
+TEST(GettingStarted, SNIPPET_getting_started_print) {
     //! [ex_getting_started_print]
     // Generate two arrays
     array a = randu(2, 2);
@@ -120,24 +130,24 @@ TEST(GettingStarted, SNIPPET_getting_started_print)
     b.host(&outb.front());
     result.host(&out.front());
 
-    for(unsigned i = 0; i < outb.size(); i++) ASSERT_FLOAT_EQ(outa[i] + outb[i] + 0.4, out[i]);
+    for (unsigned i = 0; i < outb.size(); i++)
+        ASSERT_FLOAT_EQ(outa[i] + outb[i] + 0.4, out[i]);
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_dims)
-{
+TEST(GettingStarted, SNIPPET_getting_started_dims) {
     //! [ex_getting_started_dims]
     // Create a 4x5x2 array of uniformly distributed random numbers
-    array a = randu(4,5,2);
+    array a = randu(4, 5, 2);
     // Determine the number of dimensions using the numdims() function:
-    printf("numdims(a)  %d\n",  a.numdims()); // 3
+    printf("numdims(a)  %d\n", a.numdims());  // 3
 
     // We can also find the size of the individual dimentions using either
     // the `dims` function:
-    printf("dims = [%lld %lld]\n", a.dims(0), a.dims(1)); // 4,5
+    printf("dims = [%lld %lld]\n", a.dims(0), a.dims(1));  // 4,5
 
-    // Or the elements of a af::dim4 object:
+    // Or the elements of a dim4 object:
     dim4 dims = a.dims();
-    printf("dims = [%lld %lld]\n", dims[0], dims[1]); // 4,5
+    printf("dims = [%lld %lld]\n", dims[0], dims[1]);  // 4,5
     //! [ex_getting_started_dims]
 
     //! [ex_getting_started_prop]
@@ -150,11 +160,13 @@ TEST(GettingStarted, SNIPPET_getting_started_dims)
     printf("is complex? %d    is real? %d\n", a.iscomplex(), a.isreal());
 
     // if it is a column or row vector
-    printf("is vector? %d  column? %d  row? %d\n", a.isvector(), a.iscolumn(), a.isrow());
+    printf("is vector? %d  column? %d  row? %d\n", a.isvector(), a.iscolumn(),
+           a.isrow());
 
     // and whether or not the array is empty and how much memory it takes on
     // the device:
-    printf("empty? %d  total elements: %lld  bytes: %lu\n", a.isempty(), a.elements(), a.bytes());
+    printf("empty? %d  total elements: %lld  bytes: %zu\n", a.isempty(),
+           a.elements(), a.bytes());
     //! [ex_getting_started_prop]
 
     ASSERT_EQ(f32, a.type());
@@ -175,11 +187,10 @@ TEST(GettingStarted, SNIPPET_getting_started_dims)
     ASSERT_EQ(5, a.dims(1));
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_arith)
-{
+TEST(GettingStarted, SNIPPET_getting_started_arith) {
     //! [ex_getting_started_arith]
     array R = randu(3, 3);
-    af_print(constant(1, 3, 3) + af::complex(sin(R)));  // will be c32
+    af_print(constant(1, 3, 3) + complex(sin(R)));  // will be c32
 
     // rescale complex values to unit circle
     array a = randn(5, c32);
@@ -193,80 +204,80 @@ TEST(GettingStarted, SNIPPET_getting_started_arith)
     //! [ex_getting_started_arith]
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_dev_ptr)
-{
+TEST(GettingStarted, SNIPPET_getting_started_dev_ptr) {
 #ifdef __CUDACC__
     //! [ex_getting_started_dev_ptr]
-    // Create an array on the host, copy it into an ArrayFire 2x3 ArrayFire array
-    float host_ptr[] = {0,1,2,3,4,5};
+    // Create an array on the host, copy it into an ArrayFire 2x3 ArrayFire
+    // array
+    float host_ptr[] = {0, 1, 2, 3, 4, 5};
     array a(2, 3, host_ptr);
 
     // Create a CUDA device pointer, populate it with data from the host
     float *device_ptr;
-    cudaMalloc((void**)&device_ptr, 6*sizeof(float));
-    cudaMemcpy(device_ptr, host_ptr, 6*sizeof(float), cudaMemcpyHostToDevice);
+    cudaMalloc((void **)&device_ptr, 6 * sizeof(float));
+    cudaMemcpy(device_ptr, host_ptr, 6 * sizeof(float), cudaMemcpyHostToDevice);
 
     // Convert the CUDA-allocated device memory into an ArrayFire array:
-    array b(2,3, device_ptr, afDevice); // Note: afDevice (default: afHost)
+    array b(2, 3, device_ptr, afDevice);  // Note: afDevice (default: afHost)
     // Note that ArrayFire takes ownership over `device_ptr`, so memory will
     // be freed when `b` id destructed. Do not call cudaFree(device_ptr)!
 
     //! [ex_getting_started_dev_ptr]
-#endif //__CUDACC__
+#endif  //__CUDACC__
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_ptr)
-{
+TEST(GettingStarted, SNIPPET_getting_started_ptr) {
 #ifdef __CUDACC__
     //! [ex_getting_started_ptr]
     // Create an array consisting of 3 random numbers
     array a = randu(3, f32);
 
     // Copy an array on the device to the host:
-    float * host_a = a.host<float>();
+    float *host_a = a.host<float>();
     // access the host data as a normal array
     printf("host_a[2] = %g\n", host_a[2]);  // last element
-    // and free memory using delete:
-    delete[] host_a;
+    // and free memory using freeHost:
+    freeHost(host_a);
 
     // Get access to the device memory for a CUDA kernel
-    float * d_cuda = a.device<float>();    // no need to free this
+    float *d_cuda = a.device<float>();  // no need to free this
     float value;
     cudaMemcpy(&value, d_cuda + 2, sizeof(float), cudaMemcpyDeviceToHost);
     printf("d_cuda[2] = %g\n", value);
-    a.unlock(); // unlock to allow garbage collection if necessary
+    a.unlock();  // unlock to allow garbage collection if necessary
 
     // Because OpenCL uses references rather than pointers, accessing memory
     // is similar, but has a somewhat clunky syntax. For the C-API
-    cl_mem d_opencl = (cl_mem) a.device<float>();
+    cl_mem d_opencl = (cl_mem)a.device<float>();
     // for the C++ API, you can just wrap this object into a cl::Buffer
     // after calling clRetainMemObject.
 
     //! [ex_getting_started_ptr]
-#endif //__CUDACC__
+#endif  //__CUDACC__
 }
 
-
-TEST(GettingStarted, SNIPPET_getting_started_scalar)
-{
+TEST(GettingStarted, SNIPPET_getting_started_scalar) {
     //! [ex_getting_started_scalar]
-    array a = randu(3);
+    array a   = randu(3);
     float val = a.scalar<float>();
     printf("scalar value: %g\n", val);
     //! [ex_getting_started_scalar]
 }
 
-TEST(GettingStarted, SNIPPET_getting_started_bit)
-{
+TEST(GettingStarted, SNIPPET_getting_started_bit) {
     //! [ex_getting_started_bit]
     int h_A[] = {1, 1, 0, 0, 4, 0, 0, 2, 0};
     int h_B[] = {1, 0, 1, 0, 1, 0, 1, 1, 1};
     array A = array(3, 3, h_A), B = array(3, 3, h_B);
-    af_print(A); af_print(B);
-
-    array A_and_B = A & B; af_print(A_and_B);
-    array  A_or_B = A | B; af_print(A_or_B);
-    array A_xor_B = A ^ B; af_print(A_xor_B);
+    af_print(A);
+    af_print(B);
+
+    array A_and_B = A & B;
+    af_print(A_and_B);
+    array A_or_B = A | B;
+    af_print(A_or_B);
+    array A_xor_B = A ^ B;
+    af_print(A_xor_B);
     //! [ex_getting_started_bit]
 
     vector<int> Andout(A_and_B.elements());
@@ -276,24 +287,37 @@ TEST(GettingStarted, SNIPPET_getting_started_bit)
     A_or_B.host(&Orout.front());
     A_xor_B.host(&Xorout.front());
 
-
-    for(unsigned int i = 0; i < Andout.size(); i++) ASSERT_FLOAT_EQ(h_A[i] & h_B[i], Andout[i]);
-    for(unsigned int i = 0; i < Orout.size(); i++)  ASSERT_FLOAT_EQ(h_A[i] | h_B[i], Orout[i]);
-    for(unsigned int i = 0; i < Xorout.size(); i++) ASSERT_FLOAT_EQ(h_A[i] ^ h_B[i], Xorout[i]);
+    for (unsigned int i = 0; i < Andout.size(); i++)
+        ASSERT_FLOAT_EQ(h_A[i] & h_B[i], Andout[i]);
+    for (unsigned int i = 0; i < Orout.size(); i++)
+        ASSERT_FLOAT_EQ(h_A[i] | h_B[i], Orout[i]);
+    for (unsigned int i = 0; i < Xorout.size(); i++)
+        ASSERT_FLOAT_EQ(h_A[i] ^ h_B[i], Xorout[i]);
 }
 
-
-TEST(GettingStarted, SNIPPET_getting_started_constants)
-{
+TEST(GettingStarted, SNIPPET_getting_started_constants) {
     //! [ex_getting_started_constants]
-    array A = randu(5,5);
-    A(where(A > .5)) = af::NaN;
+    array A          = randu(5, 5);
+    A(where(A > .5)) = NaN;
 
     array x = randu(10e6), y = randu(10e6);
-    double pi_est = 4 * sum<float>(hypot(x,y) < 1) / 10e6;
+    double pi_est = 4 * sum<float>(hypot(x, y) < 1) / 10e6;
     printf("estimation error: %g\n", fabs(Pi - pi_est));
     //! [ex_getting_started_constants]
 
-    ASSERT_LE(fabs(Pi-pi_est), 0.005);
+    ASSERT_LE(fabs(Pi - pi_est), 0.005);
 }
 
+TEST(GettingStarted, SNIPPET_JohnTest) {
+    array a = iota(dim4(2, 3));
+    array b = sum(a);     // sum across the first axis, same as sum(a, 0)
+    array c = sum(a, 1);  // sum across the second axis
+    array d = sum(a, 2);  // sum across the third axis
+    array e = sum(a, 3);  // sum acorss the fourth axis
+    // array f = sum(a, 4); fails due to stepping out of bounds
+    af_print(a);
+    af_print(b);
+    af_print(c);
+    af_print(d);
+    af_print(e);
+}
\ No newline at end of file
diff --git a/test/gfor.cpp b/test/gfor.cpp
index c9e82b538d..3e3d95e51d 100644
--- a/test/gfor.cpp
+++ b/test/gfor.cpp
@@ -7,217 +7,560 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
-using namespace af;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::freeHost;
+using af::gforSet;
+using af::iota;
+using af::randu;
+using af::seq;
+using af::span;
+using std::endl;
+using std::string;
+using std::vector;
 
-TEST(GFOR, Assign_Scalar_Span)
-{
-    const int num = 1000;
+TEST(GFOR, Assign_Scalar_Span) {
+    const int num   = 1000;
     const float val = 3;
-    array A = randu(num);
+    array A         = randu(num);
 
-    gfor(seq ii, num) {
-        A(ii) = val;
-    }
+    gfor(seq ii, num) { A(ii) = val; }
 
     float *hA = A.host<float>();
 
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(hA[i], val);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(hA[i], val); }
 
-    delete[] hA;
+    freeHost(hA);
 }
 
-TEST(GFOR, Assign_Scalar_Seq)
-{
-    const int num = 1000;
-    const int st = 100;
-    const int en = 500;
+TEST(GFOR, Assign_Scalar_Seq) {
+    const int num   = 1000;
+    const int st    = 100;
+    const int en    = 500;
     const float val = 3;
-    array A = randu(num);
-    array B = A.copy();
+    array A         = randu(num);
+    array B         = A.copy();
 
-    gfor(seq ii, st, en) {
-        A(ii) = val;
-    }
+    gfor(seq ii, st, en) { A(ii) = val; }
 
     float *hA = A.host<float>();
     float *hB = B.host<float>();
 
     for (int i = 0; i < num; i++) {
-        if (i >= st && i <= en) ASSERT_EQ(hA[i], val);
-        else ASSERT_EQ(hA[i], hB[i]);
+        if (i >= st && i <= en)
+            ASSERT_EQ(hA[i], val);
+        else
+            ASSERT_EQ(hA[i], hB[i]);
     }
 
-    delete[] hA;
-    delete[] hB;
+    freeHost(hA);
+    freeHost(hB);
 }
 
-TEST(GFOR, Inc_Scalar_Span)
-{
-    const int num = 1000;
+TEST(GFOR, Inc_Scalar_Span) {
+    const int num   = 1000;
     const float val = 3;
-    array A = randu(num);
-    array B = A.copy();
+    array A         = randu(num);
+    array B         = A.copy();
 
-    gfor(seq ii, num) {
-        A(ii) += val;
-    }
+    gfor(seq ii, num) { A(ii) += val; }
 
     float *hA = A.host<float>();
     float *hB = B.host<float>();
 
-    for (int i = 0; i < num; i++) {
-        ASSERT_EQ(hA[i], val + hB[i]);
-    }
+    for (int i = 0; i < num; i++) { ASSERT_EQ(hA[i], val + hB[i]); }
 
-    delete[] hA;
-    delete[] hB;
+    freeHost(hA);
+    freeHost(hB);
 }
 
-TEST(GFOR, Inc_Scalar_Seq)
-{
-    const int num = 1000;
-    const int st = 100;
-    const int en = 500;
+TEST(GFOR, Inc_Scalar_Seq) {
+    const int num   = 1000;
+    const int st    = 100;
+    const int en    = 500;
     const float val = 3;
-    array A = randu(num);
-    array B = A.copy();
+    array A         = randu(num);
+    array B         = A.copy();
 
-    gfor(seq ii, st, en) {
-        A(ii) += val;
-    }
+    gfor(seq ii, st, en) { A(ii) += val; }
 
     float *hA = A.host<float>();
     float *hB = B.host<float>();
 
     for (int i = 0; i < num; i++) {
-        if (i >= st && i <= en) ASSERT_EQ(hA[i], hB[i] + val);
-        else ASSERT_EQ(hA[i], hB[i]);
+        if (i >= st && i <= en)
+            ASSERT_EQ(hA[i], hB[i] + val);
+        else
+            ASSERT_EQ(hA[i], hB[i]);
     }
 
-    delete[] hA;
-    delete[] hB;
+    freeHost(hA);
+    freeHost(hB);
 }
 
-TEST(GFOR, Assign_Array_Span)
-{
+TEST(GFOR, Assign_Array_Span) {
     const int nx = 1000;
-    array A = randu(nx);
-    array B = randu(1, 1);
+    array A      = randu(nx);
+    array B      = randu(1, 1);
 
-    gfor(seq ii, nx) {
-        A(ii) = B;
-    }
+    gfor(seq ii, nx) { A(ii) = B; }
 
     float *hA = A.host<float>();
     float val = B.scalar<float>();
 
-    for (int i = 0; i < nx; i++) {
-        ASSERT_EQ(hA[i], val);
-    }
+    ASSERT_ARRAYS_EQ(A, constant(val, nx));
 
-    delete[] hA;
+    freeHost(hA);
 }
 
-TEST(GFOR, Assign_Array_Seq)
-{
+TEST(GFOR, Assign_Array_Seq) {
     const int nx = 1000;
     const int ny = 25;
     const int st = 100;
     const int en = 500;
-    array A = randu(nx, ny);
-    array B = A.copy();
-    array C = randu(1, ny);
+    array A      = randu(nx, ny);
+    array B      = A.copy();
+    array C      = randu(1, ny);
 
-    gfor(seq ii, st, en) {
-        A(ii, span) = C;
-    }
+    gfor(seq ii, st, en) { A(ii, span) = C; }
 
     float *hA = A.host<float>();
     float *hB = B.host<float>();
     float *hC = C.host<float>();
 
     for (int j = 0; j < ny; j++) {
-        float val = hC[j];
+        float val     = hC[j];
         const int off = j * nx;
         for (int i = 0; i < nx; i++) {
-            if (i >= st && i <= en) ASSERT_EQ(hA[i + off], val);
-            else ASSERT_EQ(hA[i + off], hB[i + off]);
+            if (i >= st && i <= en)
+                ASSERT_EQ(hA[i + off], val);
+            else
+                ASSERT_EQ(hA[i + off], hB[i + off]);
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hC;
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
 }
 
-TEST(GFOR, Inc_Array_Span)
-{
+TEST(GFOR, Inc_Array_Span) {
     const int nx = 1000;
-    array A = randu(nx);
-    array B = A.copy();
-    array C = randu(1, 1);
+    array A      = randu(nx);
+    array B      = A.copy();
+    array C      = randu(1, 1);
 
-    gfor(seq ii, nx) {
-        A(ii) += C;
-    }
+    gfor(seq ii, nx) { A(ii) += C; }
 
     float *hA = A.host<float>();
     float *hB = B.host<float>();
     float val = C.scalar<float>();
 
-    for (int i = 0; i < nx; i++) {
-        ASSERT_EQ(hA[i], val + hB[i]);
-    }
+    for (int i = 0; i < nx; i++) { ASSERT_EQ(hA[i], val + hB[i]); }
 
-    delete[] hA;
-    delete[] hB;
+    freeHost(hA);
+    freeHost(hB);
 }
 
-TEST(GFOR, Inc_Array_Seq)
-{
+TEST(GFOR, Inc_Array_Seq) {
     const int nx = 1000;
     const int ny = 25;
     const int st = 100;
     const int en = 500;
-    array A = randu(nx, ny);
-    array B = A.copy();
-    array C = randu(1, ny);
+    array A      = randu(nx, ny);
+    array B      = A.copy();
+    array C      = randu(1, ny);
 
-    gfor(seq ii, st, en) {
-        A(ii, span) += C;
-    }
+    gfor(seq ii, st, en) { A(ii, span) += C; }
 
     float *hA = A.host<float>();
     float *hB = B.host<float>();
     float *hC = C.host<float>();
 
     for (int j = 0; j < ny; j++) {
-        float val = hC[j];
+        float val     = hC[j];
         const int off = j * nx;
         for (int i = 0; i < nx; i++) {
-            if (i >= st && i <= en) ASSERT_EQ(hA[i + off], val + hB[i + off]);
-            else ASSERT_EQ(hA[i + off], hB[i + off]);
+            if (i >= st && i <= en)
+                ASSERT_EQ(hA[i + off], val + hB[i + off]);
+            else
+                ASSERT_EQ(hA[i + off], hB[i + off]);
+        }
+    }
+
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 2D0) {
+    const int nx = 1000;
+    const int ny = 10;
+    array A      = randu(nx, ny);
+    array B      = randu(1, ny);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int j = 0; j < ny; j++) {
+        for (int i = 0; i < nx; i++) {
+            ASSERT_EQ(hA[j * nx + i] + hB[j], hC[j * nx + i]);
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 2D1) {
+    const int nx = 1000;
+    const int ny = 10;
+    array A      = randu(nx, ny);
+    array B      = randu(nx, 1);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int j = 0; j < ny; j++) {
+        for (int i = 0; i < nx; i++) {
+            ASSERT_EQ(hA[j * nx + i] + hB[i], hC[j * nx + i]);
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 3D0) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    array A      = randu(nx, ny, nz);
+    array B      = randu(1, ny, nz);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int k = 0; k < nz; k++) {
+        for (int j = 0; j < ny; j++) {
+            for (int i = 0; i < nx; i++) {
+                ASSERT_EQ(hA[k * ny * nx + j * nx + i] + hB[k * ny + j],
+                          hC[k * ny * nx + j * nx + i]);
+            }
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 3D1) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    array A      = randu(nx, ny, nz);
+    array B      = randu(nx, 1, nz);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int k = 0; k < nz; k++) {
+        for (int j = 0; j < ny; j++) {
+            for (int i = 0; i < nx; i++) {
+                ASSERT_EQ(hA[k * ny * nx + j * nx + i] + hB[k * nx + i],
+                          hC[k * ny * nx + j * nx + i]);
+            }
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 3D2) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    array A      = randu(nx, ny, nz);
+    array B      = randu(nx, ny, 1);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int k = 0; k < nz; k++) {
+        for (int j = 0; j < ny; j++) {
+            for (int i = 0; i < nx; i++) {
+                ASSERT_EQ(hA[k * ny * nx + j * nx + i] + hB[j * nx + i],
+                          hC[k * ny * nx + j * nx + i]);
+            }
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 3D01) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    array A      = randu(nx, ny, nz);
+    array B      = randu(1, 1, nz);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int k = 0; k < nz; k++) {
+        for (int j = 0; j < ny; j++) {
+            for (int i = 0; i < nx; i++) {
+                ASSERT_EQ(hA[k * ny * nx + j * nx + i] + hB[k],
+                          hC[k * ny * nx + j * nx + i]);
+            }
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 3D_1_2) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    array A      = randu(nx, ny, 1);
+    array B      = randu(nx, 1, nz);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int k = 0; k < nz; k++) {
+        for (int j = 0; j < ny; j++) {
+            for (int i = 0; i < nx; i++) {
+                ASSERT_EQ(hA[j * nx + i] + hB[k * nx + i],
+                          hC[k * ny * nx + j * nx + i]);
+            }
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 4D3) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    const int nw = 2;
+    array A      = randu(nx, ny, nz, nw);
+    array B      = randu(nx, ny, nz, 1);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int l = 0; l < nw; l++) {
+        for (int k = 0; k < nz; k++) {
+            for (int j = 0; j < ny; j++) {
+                for (int i = 0; i < nx; i++) {
+                    ASSERT_EQ(hA[l * nz * ny * nx + k * ny * nx + j * nx + i] +
+                                  hB[k * ny * nx + j * nx + i],
+                              hC[l * nz * ny * nx + k * ny * nx + j * nx + i]);
+                }
+            }
         }
     }
 
-    delete[] hA;
-    delete[] hB;
-    delete[] hC;
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(BatchFunc, 4D_2_3) {
+    const int nx = 1000;
+    const int ny = 10;
+    const int nz = 3;
+    const int nw = 2;
+    array A      = randu(nx, 1, nz, nw);
+    array B      = randu(nx, ny, 1, 1);
+
+    gforSet(true);
+
+    array C = A + B;
+
+    float *hA = A.host<float>();
+    float *hB = B.host<float>();
+    float *hC = C.host<float>();
+
+    for (int l = 0; l < nw; l++) {
+        for (int k = 0; k < nz; k++) {
+            for (int j = 0; j < ny; j++) {
+                for (int i = 0; i < nx; i++) {
+                    ASSERT_EQ(hA[l * nz * nx + k * nx + i] + hB[j * nx + i],
+                              hC[l * nz * ny * nx + k * ny * nx + j * nx + i]);
+                }
+            }
+        }
+    }
+
+    gforSet(false);
+    freeHost(hA);
+    freeHost(hB);
+    freeHost(hC);
+}
+
+TEST(ASSIGN, ISSUE_1127) {
+    array orig  = randu(512, 768, 3);
+    array vert  = randu(512, 768, 3);
+    array horiz = randu(512, 768, 3);
+    array diag  = randu(512, 768, 3);
+
+    array out0 = constant(0, orig.dims(0) * 2, orig.dims(1) * 2, orig.dims(2));
+    array out1 = constant(0, orig.dims(0) * 2, orig.dims(1) * 2, orig.dims(2));
+    int rows = out0.dims(0), cols = out0.dims(1);
+
+    gfor(seq chan, 3) {
+        out0(seq(0, rows - 1, 2), seq(0, cols - 1, 2), chan) =
+            orig(span, span, chan);
+        out0(seq(1, rows - 1, 2), seq(0, cols - 1, 2), chan) =
+            vert(span, span, chan);
+        out0(seq(0, rows - 1, 2), seq(1, cols - 1, 2), chan) =
+            horiz(span, span, chan);
+        out0(seq(1, rows - 1, 2), seq(1, cols - 1, 2), chan) =
+            diag(span, span, chan);
+    }
+    out1(seq(0, rows - 1, 2), seq(0, cols - 1, 2), span) = orig;
+    out1(seq(1, rows - 1, 2), seq(0, cols - 1, 2), span) = vert;
+    out1(seq(0, rows - 1, 2), seq(1, cols - 1, 2), span) = horiz;
+    out1(seq(1, rows - 1, 2), seq(1, cols - 1, 2), span) = diag;
+
+    ASSERT_ARRAYS_EQ(out0, out1);
+}
+
+TEST(GFOR, ArithLoopWithNonUnitIncrSeq) {
+    const int nx    = 10;
+    const int ny    = 10;
+    const int batch = 10;
+    const int start = 0;
+    const int end   = 8;
+    const int incr  = 2;
+
+    array A = randu(nx, ny, batch);
+    array B = randu(nx, ny);
+    array C = constant(0, nx, ny, batch);
+    array G = constant(0, nx, ny, batch);
+
+    for (int i = 0; i < batch; i += incr) {
+        G(span, span, i) = A(span, span, i) * B;
+    }
+    gfor(seq ii, start, end, incr) {
+        C(span, span, ii) = A(span, span, ii) * B;
+    }
+    ASSERT_ARRAYS_EQ(C, G);
+}
+
+TEST(GFOR, MatmulLoopWithNonUnitIncrSeq) {
+    const int nx    = 10;
+    const int ny    = 10;
+    const int batch = 10;
+    const int start = 0;
+    const int end   = 8;
+    const int incr  = 2;
+
+    array A = randu(nx, ny, batch);
+    array B = randu(nx, ny);
+    array C = constant(0, nx, ny, batch);
+    array G = constant(0, nx, ny, batch);
+
+    for (int i = 0; i < batch; i += incr) {
+        G(span, span, i) = matmul(A(span, span, i), B);
+    }
+    gfor(seq ii, start, end, incr) {
+        C(span, span, ii) = matmul(A(span, span, ii), B);
+    }
+    ASSERT_ARRAYS_NEAR(C, G, 1E-03);
+}
+
+TEST(GFOR, ConstArrayIndexing) {
+    const std::size_t dim = 4;
+
+    array m        = iota(dim4(1, dim), dim4(dim));
+    const array cm = iota(dim4(1, dim), dim4(dim));
+
+    array out_cm(dim), out_m(dim);
+
+    EXPECT_NO_THROW({
+        gfor(seq i, static_cast<double>(dim)) {
+            out_cm(i) = af::sum(cm(span,i) * cm(span,i));
+}
+});
+gfor(seq i, static_cast<double>(dim)) {
+    out_m(i) = af::sum(m(span, i) * m(span, i));
+}
+ASSERT_ARRAYS_EQ(out_cm, out_m);
 }
diff --git a/test/gloh.cpp b/test/gloh.cpp
new file mode 100644
index 0000000000..4ce2fa547b
--- /dev/null
+++ b/test/gloh.cpp
@@ -0,0 +1,342 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <cmath>
+#include <string>
+#include <typeinfo>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::features;
+using af::loadImage;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+typedef struct {
+    float f[5];
+    unsigned d[272];
+} feat_desc_t;
+
+typedef struct {
+    float f[5];
+} feat_t;
+
+typedef struct {
+    float d[272];
+} desc_t;
+
+static bool feat_cmp(feat_desc_t i, feat_desc_t j) {
+    for (int k = 0; k < 5; k++)
+        if (round(i.f[k] * 1e1f) != round(j.f[k] * 1e1f))
+            return (round(i.f[k] * 1e1f) < round(j.f[k] * 1e1f));
+
+    return false;
+}
+
+static void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y,
+                               float* score, float* ori, float* size,
+                               float* desc, unsigned nfeat) {
+    feat.resize(nfeat);
+    for (size_t i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = score[i];
+        feat[i].f[3] = ori[i];
+        feat[i].f[4] = size[i];
+        for (unsigned j = 0; j < 272; j++) feat[i].d[j] = desc[i * 272 + j];
+    }
+}
+
+static void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y,
+                               float* score, float* ori, float* size,
+                               vector<vector<float>>& desc, unsigned nfeat) {
+    feat.resize(nfeat);
+    for (size_t i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = score[i];
+        feat[i].f[3] = ori[i];
+        feat[i].f[4] = size[i];
+        for (unsigned j = 0; j < 272; j++) feat[i].d[j] = desc[i][j];
+    }
+}
+
+static void split_feat_desc(vector<feat_desc_t>& fd, vector<feat_t>& f,
+                            vector<desc_t>& d) {
+    f.resize(fd.size());
+    d.resize(fd.size());
+    for (size_t i = 0; i < fd.size(); i++) {
+        f[i].f[0] = fd[i].f[0];
+        f[i].f[1] = fd[i].f[1];
+        f[i].f[2] = fd[i].f[2];
+        f[i].f[3] = fd[i].f[3];
+        f[i].f[4] = fd[i].f[4];
+        for (unsigned j = 0; j < 272; j++) d[i].d[j] = fd[i].d[j];
+    }
+}
+
+static bool compareEuclidean(dim_t desc_len, dim_t ndesc, float* cpu,
+                             float* gpu, float unit_thr = 1.f,
+                             float euc_thr = 1.f) {
+    bool ret  = true;
+    float sum = 0.0f;
+
+    for (dim_t i = 0; i < ndesc; i++) {
+        sum = 0.0f;
+        for (dim_t l = 0; l < desc_len; l++) {
+            dim_t idx = i * desc_len + l;
+            float x   = (cpu[idx] - gpu[idx]);
+            sum += x * x;
+            if (abs(x) > (float)unit_thr) {
+                ret = false;
+                cout << endl << "@compareEuclidean: unit mismatch." << endl;
+                cout << "(cpu,gpu,cpu-gpu)[" << i << "," << l << "] : {"
+                     << cpu[idx] << "," << gpu[idx] << ","
+                     << cpu[idx] - gpu[idx] << "}" << endl;
+                cout << endl;
+                break;
+            }
+        }
+        if (sqrt(sum) > euc_thr) {
+            ret = false;
+            cout << endl << "@compareEuclidean: distance mismatch." << endl;
+            cout << "Euclidean distance: " << sqrt(sum) << endl;
+        }
+        if (ret == false) return ret;
+    }
+
+    return ret;
+}
+
+template<typename T>
+class GLOH : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypes;
+
+TYPED_TEST_SUITE(GLOH, TestTypes);
+
+template<typename T>
+void glohTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> goldFeat;
+    vector<vector<float>> goldDesc;
+
+    readImageFeaturesDescriptors<float>(pTestFile, inDims, inFiles, goldFeat,
+                                        goldDesc);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array inArray_f32 = 0;
+        af_array inArray     = 0;
+        af_array desc        = 0;
+        af_features feat;
+
+        inFiles[testId].insert(0, string(TEST_DIR "/gloh/"));
+
+        ASSERT_SUCCESS(
+            af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, inArray_f32));
+
+        ASSERT_SUCCESS(af_gloh(&feat, &desc, inArray, 3,
+                                           0.04f, 10.0f, 1.6f,
+                                           true, 1.f / 256.f, 0.05f));
+
+        dim_t n = 0;
+        af_array x, y, score, orientation, size;
+
+        ASSERT_SUCCESS(af_get_features_num(&n, feat));
+        ASSERT_SUCCESS(af_get_features_xpos(&x, feat));
+        ASSERT_SUCCESS(af_get_features_ypos(&y, feat));
+        ASSERT_SUCCESS(af_get_features_score(&score, feat));
+        ASSERT_SUCCESS(af_get_features_orientation(&orientation, feat));
+        ASSERT_SUCCESS(af_get_features_size(&size, feat));
+
+        float* outX           = new float[n];
+        float* outY           = new float[n];
+        float* outScore       = new float[n];
+        float* outOrientation = new float[n];
+        float* outSize        = new float[n];
+        dim_t descSize;
+        dim_t descDims[4];
+        ASSERT_SUCCESS(af_get_elements(&descSize, desc));
+        ASSERT_SUCCESS(af_get_dims(&descDims[0], &descDims[1], &descDims[2],
+                                   &descDims[3], desc));
+        float* outDesc = new float[descSize];
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outX, x));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outY, y));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outScore, score));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outOrientation, orientation));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outSize, size));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outDesc, desc));
+
+        vector<feat_desc_t> out_feat_desc;
+        array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation,
+                           outSize, outDesc, n);
+
+        vector<feat_desc_t> gold_feat_desc;
+        array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(),
+                           &goldFeat[1].front(), &goldFeat[2].front(),
+                           &goldFeat[3].front(), &goldFeat[4].front(), goldDesc,
+                           goldFeat[0].size());
+
+        std::stable_sort(out_feat_desc.begin(), out_feat_desc.end(), feat_cmp);
+        std::stable_sort(gold_feat_desc.begin(), gold_feat_desc.end(),
+                         feat_cmp);
+
+        vector<feat_t> out_feat;
+        vector<desc_t> v_out_desc;
+        vector<feat_t> gold_feat;
+        vector<desc_t> v_gold_desc;
+
+        split_feat_desc(out_feat_desc, out_feat, v_out_desc);
+        split_feat_desc(gold_feat_desc, gold_feat, v_gold_desc);
+
+        for (int elIter = 0; elIter < (int)n; elIter++) {
+            ASSERT_LE(fabs(out_feat[elIter].f[0] - gold_feat[elIter].f[0]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[1] - gold_feat[elIter].f[1]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]),
+                      0.5f)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]),
+                      1e-3)
+                << "at: " << elIter << endl;
+        }
+
+        EXPECT_TRUE(compareEuclidean(descDims[0], descDims[1],
+                                     (float*)&v_out_desc[0],
+                                     (float*)&v_gold_desc[0], 2.f, 5.5f));
+
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(inArray_f32));
+
+        ASSERT_SUCCESS(af_release_array(desc));
+        ASSERT_SUCCESS(af_release_features(feat));
+
+        delete[] outX;
+        delete[] outY;
+        delete[] outScore;
+        delete[] outOrientation;
+        delete[] outSize;
+        delete[] outDesc;
+    }
+}
+
+#define GLOH_INIT(desc, image)                                         \
+    TYPED_TEST(GLOH, desc) {                                           \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                        \
+        glohTest<TypeParam>(string(TEST_DIR "/gloh/" #image ".test")); \
+    }
+
+GLOH_INIT(man, man);
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+TEST(GLOH, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> goldFeat;
+    vector<vector<float>> goldDesc;
+
+    readImageFeaturesDescriptors<float>(string(TEST_DIR "/gloh/man.test"),
+                                        inDims, inFiles, goldFeat, goldDesc);
+    inFiles[0].insert(0, string(TEST_DIR "/gloh/"));
+
+    array in = loadImage(inFiles[0].c_str(), false);
+
+    features feat;
+    array desc;
+    gloh(feat, desc, in, 3, 0.04f, 10.0f, 1.6f, true, 1.f / 256.f, 0.05f);
+
+    float* outX           = new float[feat.getNumFeatures()];
+    float* outY           = new float[feat.getNumFeatures()];
+    float* outScore       = new float[feat.getNumFeatures()];
+    float* outOrientation = new float[feat.getNumFeatures()];
+    float* outSize        = new float[feat.getNumFeatures()];
+    float* outDesc        = new float[desc.elements()];
+    dim4 descDims         = desc.dims();
+    feat.getX().host(outX);
+    feat.getY().host(outY);
+    feat.getScore().host(outScore);
+    feat.getOrientation().host(outOrientation);
+    feat.getSize().host(outSize);
+    desc.host(outDesc);
+
+    vector<feat_desc_t> out_feat_desc;
+    array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation,
+                       outSize, outDesc, feat.getNumFeatures());
+
+    vector<feat_desc_t> gold_feat_desc;
+    array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(),
+                       &goldFeat[1].front(), &goldFeat[2].front(),
+                       &goldFeat[3].front(), &goldFeat[4].front(), goldDesc,
+                       goldFeat[0].size());
+
+    std::stable_sort(out_feat_desc.begin(), out_feat_desc.end(), feat_cmp);
+    std::stable_sort(gold_feat_desc.begin(), gold_feat_desc.end(), feat_cmp);
+
+    vector<feat_t> out_feat;
+    vector<desc_t> v_out_desc;
+    vector<feat_t> gold_feat;
+    vector<desc_t> v_gold_desc;
+
+    split_feat_desc(out_feat_desc, out_feat, v_out_desc);
+    split_feat_desc(gold_feat_desc, gold_feat, v_gold_desc);
+
+    for (int elIter = 0; elIter < (int)feat.getNumFeatures(); elIter++) {
+        ASSERT_LE(fabs(out_feat[elIter].f[0] - gold_feat[elIter].f[0]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[1] - gold_feat[elIter].f[1]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]), 0.5f)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]), 1e-3)
+            << "at: " << elIter << endl;
+    }
+
+    EXPECT_TRUE(compareEuclidean(descDims[0], descDims[1],
+                                 (float*)&v_out_desc[0],
+                                 (float*)&v_gold_desc[0], 2.f, 5.5f));
+
+    delete[] outX;
+    delete[] outY;
+    delete[] outScore;
+    delete[] outOrientation;
+    delete[] outSize;
+    delete[] outDesc;
+}
diff --git a/test/gradient.cpp b/test/gradient.cpp
index e283a450f3..5d04d3dd98 100644
--- a/test/gradient.cpp
+++ b/test/gradient.cpp
@@ -7,129 +7,137 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Grad : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Grad : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Grad, TestTypes);
+TYPED_TEST_SUITE(Grad, TestTypes);
 
 template<typename T>
-void gradTest(string pTestFile, const unsigned resultIdx0, const unsigned resultIdx1, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void gradTest(string pTestFile, const unsigned resultIdx0,
+              const unsigned resultIdx1, bool isSubRef = false,
+              const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
+    af_array inArray   = 0;
     af_array tempArray = 0;
-    af_array g0Array = 0;
-    af_array g1Array = 0;
+    af_array g0Array   = 0;
+    af_array g1Array   = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_gradient(&g0Array, &g1Array, inArray));
+    ASSERT_SUCCESS(af_gradient(&g0Array, &g1Array, inArray));
 
     size_t nElems = tests[resultIdx0].size();
     // Get result
     T* grad0Data = new T[tests[resultIdx0].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)grad0Data, g0Array));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)grad0Data, g0Array));
 
     // Compare result
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], grad0Data[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(tests[resultIdx0][elIter], grad0Data[elIter])
+            << "at: " << elIter << endl;
     }
 
     // Get result
     T* grad1Data = new T[tests[resultIdx1].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)grad1Data, g1Array));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)grad1Data, g1Array));
 
     // Compare result
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx1][elIter], grad1Data[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(tests[resultIdx1][elIter], grad1Data[elIter])
+            << "at: " << elIter << endl;
     }
 
-
     // Delete
     delete[] grad0Data;
     delete[] grad1Data;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(g0Array   != 0) af_release_array(g0Array);
-    if(g1Array   != 0) af_release_array(g1Array);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (g0Array != 0) af_release_array(g0Array);
+    if (g1Array != 0) af_release_array(g1Array);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define GRAD_INIT(desc, file, resultIdx0, resultIdx1)                                       \
-    TYPED_TEST(Grad, desc)                                                                  \
-    {                                                                                       \
-        gradTest<TypeParam>(string(TEST_DIR"/grad/"#file".test"), resultIdx0, resultIdx1);  \
+#define GRAD_INIT(desc, file, resultIdx0, resultIdx1)                \
+    TYPED_TEST(Grad, desc) {                                         \
+        gradTest<TypeParam>(string(TEST_DIR "/grad/" #file ".test"), \
+                            resultIdx0, resultIdx1);                 \
     }
 
-    GRAD_INIT(Grad0, grad, 0, 1);
-    GRAD_INIT(Grad1, grad2D, 0, 1);
-    GRAD_INIT(Grad2, grad3D, 0, 1);
-
+GRAD_INIT(Grad0, grad, 0, 1);
+GRAD_INIT(Grad1, grad2D, 0, 1);
+GRAD_INIT(Grad2, grad3D, 0, 1);
 
-/////////////////////////////////////// CPP ///////////////////////////////////////////
+/////////////////////////////////////// CPP
+//////////////////////////////////////////////
 //
-TEST(Grad, CPP)
-{
-    if (noDoubleTests<float>()) return;
 
+using af::array;
+
+TEST(Grad, CPP) {
     const unsigned resultIdx0 = 0;
     const unsigned resultIdx1 = 1;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, float>(string(TEST_DIR"/grad/grad3D.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/grad/grad3D.test"),
+                                   numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af::array input(idims, &(in[0].front()));
-    af::array g0, g1;
-    af::grad(g0, g1, input);
+    array input(idims, &(in[0].front()));
+    array g0, g1;
+    grad(g0, g1, input);
 
     size_t nElems = tests[resultIdx0].size();
     // Get result
@@ -138,7 +146,8 @@ TEST(Grad, CPP)
 
     // Compare result
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], grad0Data[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(tests[resultIdx0][elIter], grad0Data[elIter])
+            << "at: " << elIter << endl;
     }
 
     // Get result
@@ -147,10 +156,25 @@ TEST(Grad, CPP)
 
     // Compare result
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx1][elIter], grad1Data[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(tests[resultIdx1][elIter], grad1Data[elIter])
+            << "at: " << elIter << endl;
     }
 
     // Delete
     delete[] grad0Data;
     delete[] grad1Data;
 }
+
+TEST(Grad, MaxDim) {
+    using af::constant;
+    using af::sum;
+
+    const size_t largeDim = 65535 * 8 + 1;
+
+    array input = constant(1, 2, largeDim);
+    array g0, g1;
+    grad(g0, g1, input);
+
+    ASSERT_EQ(0.f, sum<float>(g0));
+    ASSERT_EQ(0.f, sum<float>(g1));
+}
diff --git a/test/gray_rgb.cpp b/test/gray_rgb.cpp
new file mode 100644
index 0000000000..16a085fb80
--- /dev/null
+++ b/test/gray_rgb.cpp
@@ -0,0 +1,130 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::randu;
+using std::vector;
+
+TEST(rgb_gray, 32bit) {
+    array rgb  = randu(10, 10, 3);
+    array gray = rgb2gray(rgb);
+
+    vector<float> h_rgb(rgb.elements());
+    vector<float> h_gray(gray.elements());
+
+    rgb.host(&h_rgb[0]);
+    gray.host(&h_gray[0]);
+
+    int num  = gray.elements();
+    int roff = 0;
+    int goff = num;
+    int boff = 2 * num;
+
+    const float rPercent = 0.2126f;
+    const float gPercent = 0.7152f;
+    const float bPercent = 0.0722f;
+
+    for (int i = 0; i < num; i++) {
+        float res = rPercent * h_rgb[i + roff] + gPercent * h_rgb[i + goff] +
+                    bPercent * h_rgb[i + boff];
+
+        ASSERT_FLOAT_EQ(res, h_gray[i]);
+    }
+}
+
+TEST(rgb_gray, 8bit) {
+    array rgb  = randu(10, 10, 3, u8);
+    array gray = rgb2gray(rgb);
+
+    vector<uchar> h_rgb(rgb.elements());
+    vector<float> h_gray(gray.elements());
+
+    rgb.host(&h_rgb[0]);
+    gray.host(&h_gray[0]);
+
+    int num  = gray.elements();
+    int roff = 0;
+    int goff = num;
+    int boff = 2 * num;
+
+    const float rPercent = 0.2126f;
+    const float gPercent = 0.7152f;
+    const float bPercent = 0.0722f;
+
+    for (int i = 0; i < num; i++) {
+        float res = rPercent * h_rgb[i + roff] + gPercent * h_rgb[i + goff] +
+                    bPercent * h_rgb[i + boff];
+
+        ASSERT_FLOAT_EQ(res, h_gray[i]);
+    }
+}
+
+TEST(gray_rgb, 32bit) {
+    array gray = randu(10, 10);
+
+    const float rPercent = 0.33f;
+    const float gPercent = 0.34f;
+    const float bPercent = 0.33f;
+
+    array rgb = gray2rgb(gray, rPercent, gPercent, bPercent);
+    vector<float> h_rgb(rgb.elements());
+    vector<float> h_gray(gray.elements());
+
+    int num  = gray.elements();
+    int roff = 0;
+    int goff = num;
+    int boff = 2 * num;
+
+    for (int i = 0; i < num; i++) {
+        float gray = h_gray[i];
+
+        float r = rPercent * gray;
+        float g = gPercent * gray;
+        float b = bPercent * gray;
+
+        ASSERT_FLOAT_EQ(r, h_rgb[i + roff]);
+        ASSERT_FLOAT_EQ(g, h_rgb[i + goff]);
+        ASSERT_FLOAT_EQ(b, h_rgb[i + boff]);
+    }
+}
+
+TEST(rgb_gray, MaxDim) {
+    size_t largeDim = 65535 * 32 + 1;
+    array rgb       = randu(1, largeDim, 3, u8);
+    array gray      = rgb2gray(rgb);
+
+    vector<uchar> h_rgb(rgb.elements());
+    vector<float> h_gray(gray.elements());
+
+    rgb.host(&h_rgb[0]);
+    gray.host(&h_gray[0]);
+
+    int num  = gray.elements();
+    int roff = 0;
+    int goff = num;
+    int boff = 2 * num;
+
+    const float rPercent = 0.2126f;
+    const float gPercent = 0.7152f;
+    const float bPercent = 0.0722f;
+
+    for (int i = 0; i < num; i++) {
+        float res = rPercent * h_rgb[i + roff] + gPercent * h_rgb[i + goff] +
+                    bPercent * h_rgb[i + boff];
+
+        ASSERT_FLOAT_EQ(res, h_gray[i]);
+    }
+}
diff --git a/test/gtest b/test/gtest
deleted file mode 160000
index 23574bf233..0000000000
--- a/test/gtest
+++ /dev/null
@@ -1 +0,0 @@
-Subproject commit 23574bf2333f834ff665f894c97bef8a5b33a0a9
diff --git a/test/half.cpp b/test/half.cpp
new file mode 100644
index 0000000000..8afb6d5f4d
--- /dev/null
+++ b/test/half.cpp
@@ -0,0 +1,148 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <iostream>
+#include <vector>
+
+#include <../extern/half/include/half.hpp>
+#include <testHelpers.hpp>
+
+using af::array;
+using af::constant;
+using af::half;
+using std::vector;
+
+TEST(Half, print) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    array aa = af::constant(3.14, 3, 3, f16);
+    array bb = af::constant(2, 3, 3, f16);
+    af_print(aa);
+}
+
+struct convert_params {
+    af_dtype from, to;
+    double value;
+    convert_params(af_dtype f, af_dtype t, double v)
+        : from(f), to(t), value(v) {}
+};
+
+class HalfConvert : public ::testing::TestWithParam<convert_params> {};
+
+INSTANTIATE_TEST_SUITE_P(ToF16, HalfConvert,
+                         ::testing::Values(convert_params(f32, f16, 10),
+                                           convert_params(f64, f16, 10),
+                                           convert_params(s32, f16, 10),
+                                           convert_params(u32, f16, 10),
+                                           convert_params(s8, f16, 10),
+                                           convert_params(u8, f16, 10),
+                                           convert_params(s64, f16, 10),
+                                           convert_params(u64, f16, 10),
+                                           convert_params(s16, f16, 10),
+                                           convert_params(u16, f16, 10),
+                                           convert_params(f16, f16, 10)));
+
+INSTANTIATE_TEST_SUITE_P(FromF16, HalfConvert,
+                         ::testing::Values(convert_params(f16, f32, 10),
+                                           convert_params(f16, f64, 10),
+                                           convert_params(f16, s32, 10),
+                                           convert_params(f16, u32, 10),
+                                           convert_params(f16, s8, 10),
+                                           convert_params(f16, u8, 10),
+                                           convert_params(f16, s64, 10),
+                                           convert_params(f16, u64, 10),
+                                           convert_params(f16, s16, 10),
+                                           convert_params(f16, u16, 10),
+                                           convert_params(f16, f16, 10)));
+
+TEST_P(HalfConvert, convert) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    convert_params params = GetParam();
+    if (noDoubleTests(params.to))
+        GTEST_SKIP() << "Double not supported on this device";
+    if (noDoubleTests(params.from))
+        GTEST_SKIP() << "Double not supported on this device";
+
+    array from = af::constant(params.value, 3, 3, params.from);
+    array to   = from.as(params.to);
+
+    ASSERT_EQ(from.type(), params.from);
+    ASSERT_EQ(to.type(), params.to);
+
+    array gold = af::constant(params.value, 3, 3, params.to);
+    ASSERT_ARRAYS_EQ(gold, to);
+}
+
+TEST(Half, arith) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    array aa = af::constant(3.14, 3, 3, f16);
+    array bb = af::constant(1, 3, 3, f16);
+
+    array gold   = constant(4.14, 3, 3, f16);
+    array result = bb + aa;
+
+    ASSERT_ARRAYS_EQ(gold, result);
+}
+
+TEST(Half, isInf) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    SKIP_IF_FAST_MATH_ENABLED();
+    half_float::half hinf = std::numeric_limits<half_float::half>::infinity();
+
+    vector<half_float::half> input(2, half_float::half(0));
+    input[0] = hinf;
+
+    array infarr(2, &input.front());
+
+    array res = isInf(infarr);
+
+    vector<char> hgold(2, 0);
+    hgold[0] = 1;
+    array gold(2, &hgold.front());
+
+    ASSERT_ARRAYS_EQ(gold, res);
+}
+
+TEST(Half, isNan) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    SKIP_IF_FAST_MATH_ENABLED();
+    half_float::half hnan = std::numeric_limits<half_float::half>::quiet_NaN();
+
+    vector<half_float::half> input(2, half_float::half(0));
+    input[0] = hnan;
+
+    array nanarr(2, &input.front());
+
+    array res = isNaN(nanarr);
+
+    vector<char> hgold(2, 0);
+    hgold[0] = 1;
+    array gold(2, &hgold.front());
+
+    ASSERT_ARRAYS_EQ(gold, res);
+}
+
+TEST(Half, isZero) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    half_float::half hzero(0.f);
+
+    vector<half_float::half> input(2, half_float::half(1));
+    input[0] = hzero;
+
+    array nanarr(2, &input.front());
+
+    array res = iszero(nanarr);
+
+    vector<char> hgold(2, 0);
+    hgold[0] = 1;
+    array gold(2, &hgold.front());
+
+    ASSERT_ARRAYS_EQ(gold, res);
+}
diff --git a/test/hamming.cpp b/test/hamming.cpp
index 042ff30fd6..b8394e36b5 100644
--- a/test/hamming.cpp
+++ b/test/hamming.cpp
@@ -7,57 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::vector;
-using std::string;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class HammingMatcher8  : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class HammingMatcher8 : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 template<typename T>
-class HammingMatcher32 : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class HammingMatcher32 : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create lists of types to be tested
-typedef ::testing::Types<uchar> TestTypes8;
-typedef ::testing::Types<uint> TestTypes32;
+typedef ::testing::Types<uchar, ushort> TestTypes8;
+typedef ::testing::Types<uint, uintl> TestTypes32;
 
 // register the type list
-TYPED_TEST_CASE(HammingMatcher8,  TestTypes8);
-TYPED_TEST_CASE(HammingMatcher32, TestTypes32);
+TYPED_TEST_SUITE(HammingMatcher8, TestTypes8);
+TYPED_TEST_SUITE(HammingMatcher32, TestTypes32);
 
 template<typename T>
-void hammingMatcherTest(string pTestFile, int feat_dim)
-{
+void hammingMatcherTest(string pTestFile, int feat_dim) {
     using af::dim4;
 
-    vector<dim4>         numDims;
-    vector<vector<uint> >   in32;
-    vector<vector<uint> >  tests;
+    vector<dim4> numDims;
+    vector<vector<uint>> in32;
+    vector<vector<uint>> tests;
 
-    readTests<uint, uint, uint>(pTestFile, numDims, in32, tests);
+    readTests<uint, uint, int>(pTestFile, numDims, in32, tests);
 
-    vector<vector<T> > in(in32.size());
-    for (size_t i = 0; i < in32[0].size(); i++)
-        in[0].push_back((T)in32[0][i]);
-    for (size_t i = 0; i < in32[1].size(); i++)
-        in[1].push_back((T)in32[1][i]);
+    vector<vector<T>> in(in32.size());
+    for (size_t i = 0; i < in32[0].size(); i++) in[0].push_back((T)in32[0][i]);
+    for (size_t i = 0; i < in32[1].size(); i++) in[1].push_back((T)in32[1][i]);
 
     dim4 qDims     = numDims[0];
     dim4 tDims     = numDims[1];
@@ -66,12 +63,14 @@ void hammingMatcherTest(string pTestFile, int feat_dim)
     af_array idx   = 0;
     af_array dist  = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&query, &(in[0].front()),
-                qDims.ndims(), qDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&train, &(in[1].front()),
-                tDims.ndims(), tDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&query, &(in[0].front()), qDims.ndims(),
+                                   qDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&train, &(in[1].front()), tDims.ndims(),
+                                   tDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_hamming_matcher(&idx, &dist, query, train, feat_dim, 1));
+    ASSERT_SUCCESS(af_hamming_matcher(&idx, &dist, query, train, feat_dim, 1));
 
     vector<uint> goldIdx  = tests[0];
     vector<uint> goldDist = tests[1];
@@ -79,56 +78,62 @@ void hammingMatcherTest(string pTestFile, int feat_dim)
     uint *outIdx          = new uint[nElems];
     uint *outDist         = new uint[nElems];
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outIdx,  idx));
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outDist, dist));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outIdx, idx));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outDist, dist));
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(goldDist[elIter], outDist[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(goldDist[elIter], outDist[elIter])
+            << "at: " << elIter << endl;
     }
 
     delete[] outIdx;
     delete[] outDist;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(query));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(train));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(idx));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(dist));
+    ASSERT_SUCCESS(af_release_array(query));
+    ASSERT_SUCCESS(af_release_array(train));
+    ASSERT_SUCCESS(af_release_array(idx));
+    ASSERT_SUCCESS(af_release_array(dist));
 }
 
-TYPED_TEST(HammingMatcher8, Hamming_500_5000_Dim0)
-{
-    hammingMatcherTest<TypeParam>(string(TEST_DIR"/hamming/hamming_500_5000_dim0_u8.test"), 0);
+TYPED_TEST(HammingMatcher8, Hamming_500_5000_Dim0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    hammingMatcherTest<TypeParam>(
+        string(TEST_DIR "/hamming/hamming_500_5000_dim0_u8.test"), 0);
 }
 
-TYPED_TEST(HammingMatcher8, Hamming_500_5000_Dim1)
-{
-    hammingMatcherTest<TypeParam>(string(TEST_DIR"/hamming/hamming_500_5000_dim1_u8.test"), 1);
+TYPED_TEST(HammingMatcher8, Hamming_500_5000_Dim1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    hammingMatcherTest<TypeParam>(
+        string(TEST_DIR "/hamming/hamming_500_5000_dim1_u8.test"), 1);
 }
 
-TYPED_TEST(HammingMatcher32, Hamming_500_5000_Dim0)
-{
-    hammingMatcherTest<TypeParam>(string(TEST_DIR"/hamming/hamming_500_5000_dim0_u32.test"), 0);
+TYPED_TEST(HammingMatcher32, Hamming_500_5000_Dim0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    hammingMatcherTest<TypeParam>(
+        string(TEST_DIR "/hamming/hamming_500_5000_dim0_u32.test"), 0);
 }
 
-TYPED_TEST(HammingMatcher32, Hamming_500_5000_Dim1)
-{
-    hammingMatcherTest<TypeParam>(string(TEST_DIR"/hamming/hamming_500_5000_dim1_u32.test"), 1);
+TYPED_TEST(HammingMatcher32, Hamming_500_5000_Dim1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    hammingMatcherTest<TypeParam>(
+        string(TEST_DIR "/hamming/hamming_500_5000_dim1_u32.test"), 1);
 }
 
 ///////////////////////////////////// CPP ////////////////////////////////
 //
-TEST(HammingMatcher, CPP)
-{
+TEST(HammingMatcher, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
     using af::array;
     using af::dim4;
 
-    vector<dim4>         numDims;
-    vector<vector<uint> >     in;
-    vector<vector<uint> >  tests;
+    vector<dim4> numDims;
+    vector<vector<uint>> in;
+    vector<vector<uint>> tests;
 
-    readTests<uint, uint, uint>(TEST_DIR"/hamming/hamming_500_5000_dim0_u32.test", numDims, in, tests);
+    readTests<uint, uint, int>(
+        TEST_DIR "/hamming/hamming_500_5000_dim0_u32.test", numDims, in, tests);
 
-    dim4 qDims     = numDims[0];
-    dim4 tDims     = numDims[1];
+    dim4 qDims = numDims[0];
+    dim4 tDims = numDims[1];
 
     array query(qDims, &(in[0].front()));
     array train(tDims, &(in[1].front()));
@@ -145,8 +150,48 @@ TEST(HammingMatcher, CPP)
     idx.host(outIdx);
     dist.host(outDist);
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(goldDist[elIter], outDist[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(goldDist[elIter], outDist[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    delete[] outIdx;
+    delete[] outDist;
+}
+
+TEST(HammingMatcher64bit, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    using af::array;
+    using af::dim4;
+
+    vector<dim4> numDims;
+    vector<vector<unsigned long long>> in;
+    vector<vector<unsigned long long>> tests;
+
+    readTests<unsigned long long, unsigned long long, int>(
+        TEST_DIR "/hamming/hamming_500_5000_dim0_u32.test", numDims, in, tests);
+
+    dim4 qDims = numDims[0];
+    dim4 tDims = numDims[1];
+
+    array query(qDims, &(in[0].front()));
+    array train(tDims, &(in[1].front()));
+
+    array idx, dist;
+    hammingMatcher(idx, dist, query, train, 0, 1);
+
+    vector<unsigned long long> goldIdx  = tests[0];
+    vector<unsigned long long> goldDist = tests[1];
+    size_t nElems                       = goldIdx.size();
+    uint *outIdx                        = new uint[nElems];
+    uint *outDist                       = new uint[nElems];
+
+    idx.host(outIdx);
+    dist.host(outDist);
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(goldDist[elIter], outDist[elIter])
+            << "at: " << elIter << endl;
     }
 
     delete[] outIdx;
diff --git a/test/harris.cpp b/test/harris.cpp
new file mode 100644
index 0000000000..f2fd27d47a
--- /dev/null
+++ b/test/harris.cpp
@@ -0,0 +1,222 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <cmath>
+#include <string>
+#include <typeinfo>
+#include <vector>
+
+using af::dim4;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
+
+typedef struct {
+    float f[5];
+} feat_t;
+
+static bool feat_cmp(feat_t i, feat_t j) {
+    for (int k = 0; k < 5; k++)
+        if (i.f[k] != j.f[k]) return (i.f[k] < j.f[k]);
+
+    return false;
+}
+
+static void array_to_feat(vector<feat_t> &feat, float *x, float *y,
+                          float *score, float *orientation, float *size,
+                          unsigned nfeat) {
+    feat.resize(nfeat);
+    for (unsigned i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = score[i];
+        feat[i].f[3] = orientation[i];
+        feat[i].f[4] = size[i];
+    }
+}
+
+template<typename T>
+class Harris : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypes;
+
+TYPED_TEST_SUITE(Harris, TestTypes);
+
+template<typename T>
+void harrisTest(string pTestFile, float sigma, unsigned block_size) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
+
+    readImageTests(pTestFile, inDims, inFiles, gold);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        dim_t nElems         = 0;
+        af_array inArray_f32 = 0;
+        af_array inArray     = 0;
+        af_features out;
+
+        inFiles[testId].insert(0, string(TEST_DIR "/harris/"));
+
+        ASSERT_SUCCESS(
+            af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
+
+        ASSERT_SUCCESS(conv_image<T>(&inArray, inArray_f32));
+
+        ASSERT_SUCCESS(
+            af_harris(&out, inArray, 500, 1e5f, sigma, block_size, 0.04f));
+
+        dim_t n = 0;
+        af_array x, y, score, orientation, size;
+
+        ASSERT_SUCCESS(af_get_features_num(&n, out));
+        ASSERT_SUCCESS(af_get_features_xpos(&x, out));
+        ASSERT_SUCCESS(af_get_features_ypos(&y, out));
+        ASSERT_SUCCESS(af_get_features_score(&score, out));
+        ASSERT_SUCCESS(af_get_features_orientation(&orientation, out));
+        ASSERT_SUCCESS(af_get_features_size(&size, out));
+
+        ASSERT_SUCCESS(af_get_elements(&nElems, x));
+
+        vector<float> outX(gold[0].size());
+        vector<float> outY(gold[1].size());
+        vector<float> outScore(gold[2].size());
+        vector<float> outOrientation(gold[3].size());
+        vector<float> outSize(gold[4].size());
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outX.front(), x));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outY.front(), y));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outScore.front(), score));
+        ASSERT_SUCCESS(
+            af_get_data_ptr((void *)&outOrientation.front(), orientation));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outSize.front(), size));
+
+        vector<feat_t> out_feat;
+        array_to_feat(out_feat, &outX.front(), &outY.front(), &outScore.front(),
+                      &outOrientation.front(), &outSize.front(), n);
+
+        vector<feat_t> gold_feat;
+        array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(),
+                      &gold[2].front(), &gold[3].front(), &gold[4].front(),
+                      gold[0].size());
+
+        std::sort(out_feat.begin(), out_feat.end(), feat_cmp);
+        std::sort(gold_feat.begin(), gold_feat.end(), feat_cmp);
+
+        for (int elIter = 0; elIter < (int)nElems; elIter++) {
+            ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e2)
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4])
+                << "at: " << elIter << endl;
+        }
+
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(inArray_f32));
+
+        ASSERT_SUCCESS(af_release_features(out));
+    }
+}
+
+#define HARRIS_INIT(desc, image, sigma, block_size)                        \
+    TYPED_TEST(Harris, desc) {                                             \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                            \
+        harrisTest<TypeParam>(string(TEST_DIR "/harris/" #image "_" #sigma \
+                                              "_" #block_size ".test"),    \
+                              sigma, block_size);                          \
+    }
+
+HARRIS_INIT(square_0_3, square, 0, 3);
+HARRIS_INIT(square_0_7, square, 0, 7);
+HARRIS_INIT(square_1_0, square, 1, 0);
+HARRIS_INIT(square_5_0, square, 5, 0);
+HARRIS_INIT(lena_0_3, lena, 0, 3);
+HARRIS_INIT(lena_0_7, lena, 0, 7);
+HARRIS_INIT(lena_1_0, lena, 1, 0);
+HARRIS_INIT(lena_5_0, lena, 5, 0);
+
+/////////////////////////////////// CPP ////////////////////////////////
+
+using af::array;
+using af::features;
+using af::harris;
+using af::loadImage;
+
+TEST(FloatHarris, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
+
+    readImageTests(string(TEST_DIR "/harris/square_0_3.test"), inDims, inFiles,
+                   gold);
+    inFiles[0].insert(0, string(TEST_DIR "/harris/"));
+
+    array in = loadImage(inFiles[0].c_str(), false);
+
+    features out = harris(in, 500, 1e5f, 0.0f, 3, 0.04f);
+
+    vector<float> outX(gold[0].size());
+    vector<float> outY(gold[1].size());
+    vector<float> outScore(gold[2].size());
+    vector<float> outOrientation(gold[3].size());
+    vector<float> outSize(gold[4].size());
+    out.getX().host(&outX.front());
+    out.getY().host(&outY.front());
+    out.getScore().host(&outScore.front());
+    out.getOrientation().host(&outOrientation.front());
+    out.getSize().host(&outSize.front());
+
+    vector<feat_t> out_feat;
+    array_to_feat(out_feat, &outX.front(), &outY.front(), &outScore.front(),
+                  &outOrientation.front(), &outSize.front(),
+                  out.getNumFeatures());
+
+    vector<feat_t> gold_feat;
+    array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(),
+                  &gold[2].front(), &gold[3].front(), &gold[4].front(),
+                  gold[0].size());
+
+    std::sort(out_feat.begin(), out_feat.end(), feat_cmp);
+    std::sort(gold_feat.begin(), gold_feat.end(), feat_cmp);
+
+    for (unsigned elIter = 0; elIter < out.getNumFeatures(); elIter++) {
+        ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e2)
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3])
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4])
+            << "at: " << elIter << endl;
+    }
+}
diff --git a/test/histogram.cpp b/test/histogram.cpp
index dfae986c34..ea9431485c 100644
--- a/test/histogram.cpp
+++ b/test/histogram.cpp
@@ -7,148 +7,153 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+#include <iostream>
 #include <string>
 #include <vector>
-#include <iostream>
-#include <testHelpers.hpp>
 
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::ostream_iterator;
 using std::string;
 using std::vector;
 
 template<typename T>
-class Histogram : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Histogram : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<half_float::half, float, double, int, uint, char,
+                         schar, uchar, short, ushort, intl, uintl>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Histogram, TestTypes);
+TYPED_TEST_SUITE(Histogram, TestTypes);
 
 template<typename inType, typename outType>
-void histTest(string pTestFile, unsigned nbins, double minval, double maxval)
-{
-    if (noDoubleTests<inType>()) return;
-    if (noDoubleTests<outType>()) return;
+void histTest(string pTestFile, unsigned nbins, double minval, double maxval) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<inType> >  in;
-    vector<vector<outType> > tests;
-    readTests<inType,uint,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
+    readTests<inType, uint, uint>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
-    af_array outArray   = 0;
-    af_array inArray    = 0;
-    outType *outData;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<inType>::af_type));
+    af_array outArray = 0;
+    af_array inArray  = 0;
 
-    ASSERT_EQ(AF_SUCCESS,af_histogram(&outArray,inArray,nbins,minval,maxval));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
 
-    outData = new outType[dims.elements()];
+    ASSERT_SUCCESS(af_histogram(&outArray, inArray, nbins, minval, maxval));
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    vector<outType> outData(dims.elements());
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData.data(), outArray));
+
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<outType> currGoldBar = tests[testIter];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter],outData[elIter])<< "at: " << elIter<< std::endl;
-        }
+
+        dim4 goldDims(nbins, 1, dims[2], dims[3]);
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, outArray);
     }
 
     // cleanup
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(Histogram,256Bins0min255max_ones)
-{
-    histTest<TypeParam,uint>(string(TEST_DIR"/histogram/256bin1min1max.test"),256,0,255);
+TYPED_TEST(Histogram, 256Bins0min255max_ones) {
+    histTest<TypeParam, uint>(string(TEST_DIR "/histogram/256bin1min1max.test"),
+                              256, 0, 255);
 }
 
-TYPED_TEST(Histogram,100Bins0min99max)
-{
-    histTest<TypeParam,uint>(string(TEST_DIR"/histogram/100bin0min99max.test"),100,0,99);
+TYPED_TEST(Histogram, 100Bins0min99max) {
+    histTest<TypeParam, uint>(
+        string(TEST_DIR "/histogram/100bin0min99max.test"), 100, 0, 99);
 }
 
-TYPED_TEST(Histogram,40Bins0min100max)
-{
-    histTest<TypeParam,uint>(string(TEST_DIR"/histogram/40bin0min100max.test"),40,0,100);
+TYPED_TEST(Histogram, 40Bins0min100max) {
+    histTest<TypeParam, uint>(
+        string(TEST_DIR "/histogram/40bin0min100max.test"), 40, 0, 100);
 }
 
-TYPED_TEST(Histogram,40Bins0min100max_Batch)
-{
-    histTest<TypeParam,uint>(string(TEST_DIR"/histogram/40bin0min100max_batch.test"),40,0,100);
+TYPED_TEST(Histogram, 40Bins0min100max_Batch) {
+    histTest<TypeParam, uint>(
+        string(TEST_DIR "/histogram/40bin0min100max_batch.test"), 40, 0, 100);
 }
 
-TYPED_TEST(Histogram,256Bins0min255max_zeros)
-{
-    histTest<TypeParam,uint>(string(TEST_DIR"/histogram/256bin0min0max.test"),256,0,255);
+TYPED_TEST(Histogram, 256Bins0min255max_zeros) {
+    histTest<TypeParam, uint>(string(TEST_DIR "/histogram/256bin0min0max.test"),
+                              256, 0, 255);
 }
 
 /////////////////////////////////// CPP //////////////////////////////////
 //
-TEST(Histogram, CPP)
-{
-    if (noDoubleTests<float>()) return;
-    if (noDoubleTests<int>()) return;
-
+using af::array;
+using af::constant;
+using af::histogram;
+using af::max;
+using af::randu;
+using af::range;
+using af::round;
+using af::seq;
+using af::span;
+
+TEST(Histogram, CPP) {
     const unsigned nbins = 100;
-    const double minval = 0.0;
-    const double maxval = 99.0;
+    const double minval  = 0.0;
+    const double maxval  = 99.0;
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<float> >  in;
-    vector<vector<uint> > tests;
-    readTests<float,uint,int>(string(TEST_DIR"/histogram/100bin0min99max.test"),numDims,in,tests);
+    vector<vector<float>> in;
+    vector<vector<uint>> tests;
+    readTests<float, uint, int>(
+        string(TEST_DIR "/histogram/100bin0min99max.test"), numDims, in, tests);
 
-//! [hist_nominmax]
-    af::array input(numDims[0], &(in[0].front()));
-    af::array output = histogram(input, nbins, minval, maxval);
-//! [hist_nominmax]
+    //! [hist_nominmax]
+    array input(numDims[0], &(in[0].front()));
+    array output = histogram(input, nbins, minval, maxval);
+    //! [hist_nominmax]
 
-    uint *outData = new uint[output.elements()];
-    output.host((void*)outData);
+    vector<uint> outData(output.elements());
+    output.host((void*)outData.data());
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<uint> currGoldBar = tests[testIter];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter],outData[elIter])<< "at: " << elIter<< std::endl;
-        }
-    }
 
-    // cleanup
-    delete[] outData;
+        dim4 goldDims = numDims[0];
+        goldDims[0]   = nbins;
+        goldDims[1]   = 1;
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, output);
+    }
 }
 
-/////////////////////////////////// Documentation Snippets //////////////////////////////////
+/////////////////////////////////// Documentation Snippets
+/////////////////////////////////////
 //
-TEST(Histogram, SNIPPET_hist_nominmax)
-{
-    using af::array;
-    using af::histogram;
-    using std::ostream_iterator;
-    using std::cout;
-    using std::endl;
-
+TEST(Histogram, SNIPPET_hist_nominmax) {
     unsigned output[] = {3, 1, 2, 0, 0, 0, 0, 1, 1, 1};
 
     //! [ex_image_hist_nominmax]
-    float input[]  = {1, 2, 1, 1, 3, 6, 7, 8, 3};
-    int nbins = 10;
+    float input[] = {1, 2, 1, 1, 3, 6, 7, 8, 3};
+    int nbins     = 10;
 
-    size_t nElems = sizeof(input)/sizeof(float);
+    size_t nElems = sizeof(input) / sizeof(float);
     array hist_in(nElems, input);
 
     array hist_out = histogram(hist_in, nbins);
@@ -158,30 +163,24 @@ TEST(Histogram, SNIPPET_hist_nominmax)
     vector<unsigned> h_out(nbins);
     hist_out.host((void*)h_out.data());
 
-    if( false == equal(h_out.begin(), h_out.end(), output) ) {
+    if (false == equal(h_out.begin(), h_out.end(), output)) {
         cout << "Expected: ";
         copy(output, output + nbins, ostream_iterator<unsigned>(cout, ", "));
         cout << endl << "Actual: ";
-        copy(h_out.begin(), h_out.end(), ostream_iterator<unsigned>(cout, ", "));
+        copy(h_out.begin(), h_out.end(),
+             ostream_iterator<unsigned>(cout, ", "));
         FAIL() << "Output did not match";
     }
 }
 
-TEST(Histogram, SNIPPET_hist_minmax)
-{
-    using af::array;
-    using af::histogram;
-    using std::ostream_iterator;
-    using std::cout;
-    using std::endl;
-
+TEST(Histogram, SNIPPET_hist_minmax) {
     unsigned output[] = {0, 3, 1, 2, 0, 0, 1, 1, 1, 0};
 
     //! [ex_image_hist_minmax]
-    float input[]  = {1, 2, 1, 1, 3, 6, 7, 8, 3};
-    int nbins = 10;
+    float input[] = {1, 2, 1, 1, 3, 6, 7, 8, 3};
+    int nbins     = 10;
 
-    size_t nElems = sizeof(input)/sizeof(float);
+    size_t nElems = sizeof(input) / sizeof(float);
     array hist_in(nElems, input);
 
     array hist_out = histogram(hist_in, nbins, 0, 9);
@@ -191,30 +190,24 @@ TEST(Histogram, SNIPPET_hist_minmax)
     vector<unsigned> h_out(nbins);
     hist_out.host((void*)h_out.data());
 
-    if( false == equal(h_out.begin(), h_out.end(), output) ) {
+    if (false == equal(h_out.begin(), h_out.end(), output)) {
         cout << "Expected: ";
         copy(output, output + nbins, ostream_iterator<unsigned>(cout, ", "));
         cout << endl << "Actual: ";
-        copy(h_out.begin(), h_out.end(), ostream_iterator<unsigned>(cout, ", "));
+        copy(h_out.begin(), h_out.end(),
+             ostream_iterator<unsigned>(cout, ", "));
         FAIL() << "Output did not match";
     }
 }
 
-TEST(Histogram, SNIPPET_histequal)
-{
-    using af::array;
-    using af::histogram;
-    using std::ostream_iterator;
-    using std::cout;
-    using std::endl;
-
-    float output[] = { 1.5, 4.5,  1.5, 1.5, 4.5, 4.5, 6.0, 7.5, 4.5 };
+TEST(Histogram, SNIPPET_histequal) {
+    float output[] = {1.5, 4.5, 1.5, 1.5, 4.5, 4.5, 6.0, 7.5, 4.5};
 
     //! [ex_image_histequal]
-    float input[]  = {1, 2, 1, 1, 3, 6, 7, 8, 3};
-    int nbins = 10;
+    float input[] = {1, 2, 1, 1, 3, 6, 7, 8, 3};
+    int nbins     = 10;
 
-    size_t nElems = sizeof(input)/sizeof(float);
+    size_t nElems = sizeof(input) / sizeof(float);
     array hist_in(nElems, input);
 
     array hist_out = histogram(hist_in, nbins);
@@ -228,30 +221,64 @@ TEST(Histogram, SNIPPET_histequal)
     vector<float> h_out(nElems);
     eq_out.host((void*)h_out.data());
 
-    if( false == equal(h_out.begin(), h_out.end(), output) ) {
+    if (false == equal(h_out.begin(), h_out.end(), output)) {
         cout << "Expected: ";
-        copy(output, output + nbins, ostream_iterator<float>(cout, ", "));
+        copy(output, output + nElems, ostream_iterator<float>(cout, ", "));
         cout << endl << "Actual: ";
         copy(h_out.begin(), h_out.end(), ostream_iterator<float>(cout, ", "));
         FAIL() << "Output did not match";
     }
 }
 
-TEST(histogram, GFOR)
-{
-    using namespace af;
-
+TEST(histogram, GFOR) {
     dim4 dims = dim4(100, 100, 3);
-    array A = round(100 * randu(dims));
-    array B = constant(0, 100, 1, 3);
+    array A   = round(100 * randu(dims));
+    array B   = constant(0, 100, 1, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = histogram(A(span, span, ii), 100);
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = histogram(A(span, span, ii), 100); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = histogram(A(span, span, ii), 100);
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
     }
 }
+
+TEST(histogram, IndexedArray) {
+    const dim_t LEN = 32;
+    array A         = range(LEN, (dim_t)2);
+    for (int i = 16; i < 28; ++i) { A(seq(i, i + 3), span) = i / 4 - 1; }
+    array B = A(seq(20), span);
+    array C = histogram(B, 4);
+    unsigned out[4];
+    C.host((void*)out);
+    ASSERT_EQ(true, out[0] == 16);
+    ASSERT_EQ(true, out[1] == 8);
+    ASSERT_EQ(true, out[2] == 8);
+    ASSERT_EQ(true, out[3] == 8);
+}
+
+TEST(histogram, LargeBins) {
+    const int max_val = 20000;
+    const int min_val = 0;
+    const int nbins   = max_val / 2;
+    const int num     = 1 << 20;
+    array A           = round(max_val * randu(num) + min_val).as(u32);
+    eval(A);
+    array H = histogram(A, nbins, min_val, max_val);
+
+    vector<unsigned> hA(num);
+    A.host(hA.data());
+
+    vector<unsigned> hH(nbins);
+    H.host(hH.data());
+
+    int dx = (max_val - min_val) / nbins;
+    for (int i = 0; i < num; i++) {
+        int bin = (hA[i] - min_val) / dx;
+        bin     = std::min(bin, nbins - 1);
+        hH[bin] -= 1;
+    }
+
+    for (int i = 0; i < nbins; i++) { ASSERT_EQ(hH[i], 0u); }
+}
diff --git a/test/homography.cpp b/test/homography.cpp
new file mode 100644
index 0000000000..bd4809d428
--- /dev/null
+++ b/test/homography.cpp
@@ -0,0 +1,290 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <cmath>
+#include <string>
+#include <typeinfo>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Homography : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypes;
+
+TYPED_TEST_SUITE(Homography, TestTypes);
+
+template<typename T>
+array perspectiveTransform(dim4 inDims, array H) {
+    T d0 = (T)inDims[0];
+    T d1 = (T)inDims[1];
+    return transformCoordinates(H, d0, d1);
+}
+
+template<typename T>
+void homographyTest(string pTestFile, const af_homography_type htype,
+                    const bool rotate, const float size_ratio) {
+    using af::dtype_traits;
+    using af::Pi;
+
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
+
+    readImageTests(pTestFile, inDims, inFiles, gold);
+
+    inFiles[0].insert(0, string(TEST_DIR "/homography/"));
+
+    af_array trainArray_f32 = 0;
+    af_array trainArray     = 0;
+    af_array train_desc     = 0;
+    af_array train_feat_x   = 0;
+    af_array train_feat_y   = 0;
+    af_features train_feat;
+
+    ASSERT_SUCCESS(af_load_image(&trainArray_f32, inFiles[0].c_str(), false));
+    ASSERT_SUCCESS(conv_image<T>(&trainArray, trainArray_f32));
+
+    ASSERT_SUCCESS(af_orb(&train_feat, &train_desc, trainArray,
+                               20.0f, 2000, 1.2f, 8, true));
+
+    ASSERT_SUCCESS(af_get_features_xpos(&train_feat_x, train_feat));
+    ASSERT_SUCCESS(af_get_features_ypos(&train_feat_y, train_feat));
+
+    af_array queryArray       = 0;
+    af_array query_desc       = 0;
+    af_array idx              = 0;
+    af_array dist             = 0;
+    af_array const_50         = 0;
+    af_array dist_thr         = 0;
+    af_array train_idx        = 0;
+    af_array query_idx        = 0;
+    af_array query_feat_x     = 0;
+    af_array query_feat_y     = 0;
+    af_array H                = 0;
+    af_array train_feat_x_idx = 0;
+    af_array train_feat_y_idx = 0;
+    af_array query_feat_x_idx = 0;
+    af_array query_feat_y_idx = 0;
+    af_features query_feat;
+
+    const float theta   = Pi * 0.5f;
+    const dim_t test_d0 = inDims[0][0] * size_ratio;
+    const dim_t test_d1 = inDims[0][1] * size_ratio;
+    const dim_t tDims[] = {test_d0, test_d1};
+    if (rotate) {
+        ASSERT_SUCCESS(af_rotate(&queryArray, trainArray, theta, false,
+                                 AF_INTERP_NEAREST));
+    } else {
+        ASSERT_SUCCESS(af_resize(&queryArray, trainArray, test_d0, test_d1,
+                                 AF_INTERP_BILINEAR));
+    }
+
+    ASSERT_SUCCESS(af_orb(&query_feat, &query_desc, queryArray,
+                                      20.0f, 2000, 1.2f, 8, true));
+
+    ASSERT_SUCCESS(
+        af_hamming_matcher(&idx, &dist, train_desc, query_desc, 0, 1));
+
+    dim_t distDims[4];
+    ASSERT_SUCCESS(af_get_dims(&distDims[0], &distDims[1], &distDims[2],
+                               &distDims[3], dist));
+
+    ASSERT_SUCCESS(af_constant(&const_50, 50, 2, distDims, u32));
+    ASSERT_SUCCESS(af_lt(&dist_thr, dist, const_50, false));
+    ASSERT_SUCCESS(af_where(&train_idx, dist_thr));
+
+    dim_t tidxDims[4];
+    ASSERT_SUCCESS(af_get_dims(&tidxDims[0], &tidxDims[1], &tidxDims[2],
+                               &tidxDims[3], train_idx));
+    af_index_t tindexs;
+    tindexs.isSeq   = false;
+    tindexs.idx.seq = af_make_seq(0, tidxDims[0] - 1, 1);
+    tindexs.idx.arr = train_idx;
+    ASSERT_SUCCESS(af_index_gen(&query_idx, idx, 1, &tindexs));
+
+    ASSERT_SUCCESS(af_get_features_xpos(&query_feat_x, query_feat));
+    ASSERT_SUCCESS(af_get_features_ypos(&query_feat_y, query_feat));
+
+    dim_t qidxDims[4];
+    ASSERT_SUCCESS(af_get_dims(&qidxDims[0], &qidxDims[1], &qidxDims[2],
+                               &qidxDims[3], query_idx));
+    af_index_t qindexs;
+    qindexs.isSeq   = false;
+    qindexs.idx.seq = af_make_seq(0, qidxDims[0] - 1, 1);
+    qindexs.idx.arr = query_idx;
+
+    ASSERT_SUCCESS(af_index_gen(&train_feat_x_idx, train_feat_x, 1, &tindexs));
+    ASSERT_SUCCESS(af_index_gen(&train_feat_y_idx, train_feat_y, 1, &tindexs));
+    ASSERT_SUCCESS(af_index_gen(&query_feat_x_idx, query_feat_x, 1, &qindexs));
+    ASSERT_SUCCESS(af_index_gen(&query_feat_y_idx, query_feat_y, 1, &qindexs));
+
+    int inliers = 0;
+    ASSERT_SUCCESS(af_homography(&H, &inliers, train_feat_x_idx,
+                                             train_feat_y_idx, query_feat_x_idx,
+                                             query_feat_y_idx, htype, 3.0f, 1000,
+                                             (af_dtype)dtype_traits<T>::af_type));
+
+    array HH(H);
+
+    array t = perspectiveTransform<T>(inDims[0], HH);
+
+    T* gold_t = new T[8];
+    for (int i = 0; i < 8; i++) gold_t[i] = (T)0;
+    if (rotate) {
+        gold_t[1] = test_d0;
+        gold_t[2] = test_d0;
+        gold_t[4] = test_d1;
+        gold_t[5] = test_d1;
+    } else {
+        gold_t[2] = test_d1;
+        gold_t[3] = test_d1;
+        gold_t[5] = test_d0;
+        gold_t[6] = test_d0;
+    }
+
+    T* out_t = new T[8];
+    t.host(out_t);
+
+    for (int elIter = 0; elIter < 8; elIter++) {
+        ASSERT_LE(fabs(out_t[elIter] - gold_t[elIter]) / tDims[elIter & 1],
+                  0.25f)
+            << "at: " << elIter << endl;
+    }
+
+    delete[] gold_t;
+    delete[] out_t;
+
+    ASSERT_SUCCESS(af_release_array(queryArray));
+
+    ASSERT_SUCCESS(af_release_array(query_desc));
+    ASSERT_SUCCESS(af_release_array(idx));
+    ASSERT_SUCCESS(af_release_array(dist));
+    ASSERT_SUCCESS(af_release_array(const_50));
+    ASSERT_SUCCESS(af_release_array(dist_thr));
+    ASSERT_SUCCESS(af_release_array(train_idx));
+    ASSERT_SUCCESS(af_release_array(query_idx));
+    ASSERT_SUCCESS(af_release_features(query_feat));
+    ASSERT_SUCCESS(af_release_features(train_feat));
+    ASSERT_SUCCESS(af_release_array(train_feat_x_idx));
+    ASSERT_SUCCESS(af_release_array(train_feat_y_idx));
+    ASSERT_SUCCESS(af_release_array(query_feat_x_idx));
+    ASSERT_SUCCESS(af_release_array(query_feat_y_idx));
+
+    ASSERT_SUCCESS(af_release_array(trainArray));
+    ASSERT_SUCCESS(af_release_array(trainArray_f32));
+    ASSERT_SUCCESS(af_release_array(train_desc));
+}
+
+#define HOMOGRAPHY_INIT(desc, image, htype, rotate, size_ratio)            \
+    TYPED_TEST(Homography, desc) {                                         \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                            \
+        homographyTest<TypeParam>(                                         \
+            string(TEST_DIR "/homography/" #image ".test"), htype, rotate, \
+            size_ratio);                                                   \
+    }
+
+HOMOGRAPHY_INIT(Tux_RANSAC, tux, AF_HOMOGRAPHY_RANSAC, false, 1.0f);
+HOMOGRAPHY_INIT(Tux_RANSAC_90degrees, tux, AF_HOMOGRAPHY_RANSAC, true, 1.0f);
+HOMOGRAPHY_INIT(Tux_RANSAC_resize, tux, AF_HOMOGRAPHY_RANSAC, false, 1.5f);
+// HOMOGRAPHY_INIT(Tux_LMedS, tux, AF_HOMOGRAPHY_LMEDS, false, 1.0f);
+// HOMOGRAPHY_INIT(Tux_LMedS_90degrees, tux, AF_HOMOGRAPHY_LMEDS, true, 1.0f);
+// HOMOGRAPHY_INIT(Tux_LMedS_resize, tux, AF_HOMOGRAPHY_LMEDS, false, 1.5f);
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+
+using af::features;
+using af::loadImage;
+
+TEST(Homography, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
+
+    readImageTests(string(TEST_DIR "/homography/tux.test"), inDims, inFiles,
+                   gold);
+
+    inFiles[0].insert(0, string(TEST_DIR "/homography/"));
+
+    const float size_ratio = 0.5f;
+
+    array train_img = loadImage(inFiles[0].c_str(), false);
+    array query_img = resize(size_ratio, train_img);
+    dim4 tDims      = train_img.dims();
+
+    features feat_train, feat_query;
+    array desc_train, desc_query;
+    orb(feat_train, desc_train, train_img, 20, 2000, 1.2, 8, true);
+    orb(feat_query, desc_query, query_img, 20, 2000, 1.2, 8, true);
+
+    array idx, dist;
+    hammingMatcher(idx, dist, desc_train, desc_query, 0, 1);
+
+    array train_idx = where(dist < 30);
+    array query_idx = idx(train_idx);
+
+    array feat_train_x           = feat_train.getX()(train_idx);
+    array feat_train_y           = feat_train.getY()(train_idx);
+    array feat_train_score       = feat_train.getScore()(train_idx);
+    array feat_train_orientation = feat_train.getOrientation()(train_idx);
+    array feat_train_size        = feat_train.getSize()(train_idx);
+    array feat_query_x           = feat_query.getX()(query_idx);
+    array feat_query_y           = feat_query.getY()(query_idx);
+    array feat_query_score       = feat_query.getScore()(query_idx);
+    array feat_query_orientation = feat_query.getOrientation()(query_idx);
+    array feat_query_size        = feat_query.getSize()(query_idx);
+
+    array H;
+    int inliers = 0;
+    homography(H, inliers, feat_train_x, feat_train_y, feat_query_x,
+                                  feat_query_y, AF_HOMOGRAPHY_RANSAC, 3.0f, 1000, f32);
+
+    float* gold_t = new float[8];
+    for (int i = 0; i < 8; i++) gold_t[i] = 0.f;
+    gold_t[2] = tDims[1] * size_ratio;
+    gold_t[3] = tDims[1] * size_ratio;
+    gold_t[5] = tDims[0] * size_ratio;
+    gold_t[6] = tDims[0] * size_ratio;
+
+    array t = perspectiveTransform<float>(train_img.dims(), H);
+
+    float* out_t = new float[4 * 2];
+    t.host(out_t);
+
+    for (int elIter = 0; elIter < 8; elIter++) {
+        ASSERT_LE(fabs(out_t[elIter] - gold_t[elIter]) / tDims[elIter & 1],
+                  0.1f)
+            << "at: " << elIter << endl;
+    }
+
+    delete[] gold_t;
+    delete[] out_t;
+}
diff --git a/test/hsv_rgb.cpp b/test/hsv_rgb.cpp
index 5e221c2e6d..134e56c6c3 100644
--- a/test/hsv_rgb.cpp
+++ b/test/hsv_rgb.cpp
@@ -7,76 +7,142 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::array;
+using af::dim4;
+using af::exception;
+using af::hsv2rgb;
+using std::endl;
 using std::string;
 using std::vector;
 
-TEST(hsv_rgb, InvalidArray)
-{
+TEST(hsv_rgb, InvalidArray) {
     vector<float> in(100, 1);
 
-    af::dim4 dims(100);
-    af::array input(dims, &(in.front()));
+    dim4 dims(100);
+    array input(dims, &(in.front()));
 
     try {
-        af::array output = af::hsv2rgb(input);
+        array output = hsv2rgb(input);
         ASSERT_EQ(true, false);
-    } catch(af::exception) {
+    } catch (const exception & /* ex */) {
         ASSERT_EQ(true, true);
         return;
     }
 }
 
-TEST(hsv2rgb, CPP)
-{
-    vector<af::dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
+TEST(hsv2rgb, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(string(TEST_DIR "/hsv_rgb/hsv2rgb.test"),
+                                    numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = hsv2rgb(input);
+
+    vector<float> currGoldBar = tests[0];
+    ASSERT_VEC_ARRAY_NEAR(currGoldBar, dims, output, 1.0e-3);
+}
+
+TEST(rgb2hsv, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(string(TEST_DIR "/hsv_rgb/rgb2hsv.test"),
+                                    numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = rgb2hsv(input);
 
-    readTestsFromFile<float,float>(string(TEST_DIR"/hsv_rgb/hsv2rgb.test"), numDims, in, tests);
+    vector<float> currGoldBar = tests[0];
+    ASSERT_VEC_ARRAY_NEAR(currGoldBar, dims, output, 1.0e-3);
+}
 
-    af::dim4 dims    = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::hsv2rgb(input);
+TEST(rgb2hsv, MaxDim) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    float *outData = new float[dims.elements()];
-    output.host((void*)outData);
+    readTestsFromFile<float, float>(string(TEST_DIR "/hsv_rgb/rgb2hsv.test"),
+                                    numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+
+    const size_t largeDim = 65535 * 16 + 1;
+    unsigned int ntile    = (largeDim + dims[1] - 1) / dims[1];
+    input                 = tile(input, 1, ntile);
+    array output          = rgb2hsv(input);
+    dim4 outDims          = output.dims();
+
+    float *outData = new float[outDims.elements()];
+    output.host((void *)outData);
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)<< "at: " << elIter<< std::endl;
+    for (int z = 0; z < outDims[2]; ++z) {
+        for (int y = 0; y < outDims[1]; ++y) {
+            for (int x = 0; x < outDims[0]; ++x) {
+                int outIter =
+                    (z * outDims[1] * outDims[0]) + (y * outDims[0]) + x;
+                int goldIter =
+                    (z * dims[1] * dims[0]) + ((y % dims[1]) * dims[0]) + x;
+                ASSERT_NEAR(currGoldBar[goldIter], outData[outIter], 1.0e-3)
+                    << "at: " << outIter << endl;
+            }
+        }
     }
 
     // cleanup
     delete[] outData;
 }
 
-TEST(rgb2hsv, CPP)
-{
-    vector<af::dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
+TEST(hsv2rgb, MaxDim) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(string(TEST_DIR "/hsv_rgb/hsv2rgb.test"),
+                                    numDims, in, tests);
 
-    readTestsFromFile<float,float>(string(TEST_DIR"/hsv_rgb/rgb2hsv.test"), numDims, in, tests);
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
 
-    af::dim4 dims    = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::rgb2hsv(input);
+    const size_t largeDim = 65535 * 16 + 1;
+    unsigned int ntile    = (largeDim + dims[1] - 1) / dims[1];
+    input                 = tile(input, 1, ntile);
+    array output          = hsv2rgb(input);
+    dim4 outDims          = output.dims();
 
-    float *outData = new float[dims.elements()];
-    output.host((void*)outData);
+    float *outData = new float[outDims.elements()];
+    output.host((void *)outData);
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)<< "at: " << elIter<< std::endl;
+    for (int z = 0; z < outDims[2]; ++z) {
+        for (int y = 0; y < outDims[1]; ++y) {
+            for (int x = 0; x < outDims[0]; ++x) {
+                int outIter =
+                    (z * outDims[1] * outDims[0]) + (y * outDims[0]) + x;
+                int goldIter =
+                    (z * dims[1] * dims[0]) + ((y % dims[1]) * dims[0]) + x;
+                ASSERT_NEAR(currGoldBar[goldIter], outData[outIter], 1.0e-3)
+                    << "at: " << outIter << endl;
+            }
+        }
     }
 
     // cleanup
diff --git a/test/iir.cpp b/test/iir.cpp
index e061e5c48c..85fda2a959 100644
--- a/test/iir.cpp
+++ b/test/iir.cpp
@@ -7,46 +7,49 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::vector;
-using std::string;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::convolve1;
 using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::exception;
+using af::fir;
+using af::iir;
+using af::randu;
+using std::string;
+using std::vector;
 
 template<typename T>
-class filter : public ::testing::Test
-{
-public:
+class filter : public ::testing::Test {
+   public:
     virtual void SetUp() {}
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
-TYPED_TEST_CASE(filter, TestTypes);
-
-static double get_real(double val) { return val; }
-static double get_real(cfloat val) { return std::real(val); }
-static double get_real(cdouble val) { return std::real(val); }
+TYPED_TEST_SUITE(filter, TestTypes);
 
 template<typename T>
-void firTest(const int xrows, const int xcols, const int brows, const int bcols)
-{
-    if (noDoubleTests<T>()) return;
+void firTest(const int xrows, const int xcols, const int brows,
+             const int bcols) {
+    SUPPORTED_TYPE_CHECK(T);
     try {
-        af::dtype ty = (af::dtype)af::dtype_traits<T>::af_type;
-        af::array x = af::randu(xrows, xcols, ty);
-        af::array b = af::randu(brows, bcols, ty);
+        dtype ty = (dtype)dtype_traits<T>::af_type;
+        array x  = randu(xrows, xcols, ty);
+        array b  = randu(brows, bcols, ty);
 
-        af::array y = af::fir(b, x);
-        af::array c = af::convolve1(x, b, AF_CONV_EXPAND);
+        array y = fir(b, x);
+        array c = convolve1(x, b, AF_CONV_EXPAND);
 
         const int ycols = xcols * bcols;
         const int crows = xrows + brows - 1;
@@ -60,48 +63,34 @@ void firTest(const int xrows, const int xcols, const int brows, const int bcols)
 
         for (int j = 0; j < ycols; j++) {
             for (int i = 0; i < yrows; i++) {
-                ASSERT_NEAR(get_real(hy[j * yrows + i]),
-                            get_real(hc[j * crows + i]), 0.01);
+                ASSERT_NEAR(real(hy[j * yrows + i]), real(hc[j * crows + i]),
+                            0.01);
             }
         }
-    } catch (af::exception &ex) {
-        FAIL() << ex.what();
-    }
+    } catch (exception &ex) { FAIL() << ex.what(); }
 }
 
-TYPED_TEST(filter, firVecVec)
-{
-    firTest<TypeParam>(10000, 1, 1000, 1);
-}
+TYPED_TEST(filter, firVecVec) { firTest<TypeParam>(10000, 1, 1000, 1); }
 
-TYPED_TEST(filter, firVecMat)
-{
-    firTest<TypeParam>(10000, 1, 50, 10);
-}
+TYPED_TEST(filter, firVecMat) { firTest<TypeParam>(10000, 1, 50, 10); }
 
-TYPED_TEST(filter, firMatVec)
-{
-    firTest<TypeParam>(5000, 10, 100, 1);
-}
+TYPED_TEST(filter, firMatVec) { firTest<TypeParam>(5000, 10, 100, 1); }
 
-TYPED_TEST(filter, firMatMat)
-{
-    firTest<TypeParam>(5000, 10, 50, 10);
-}
+TYPED_TEST(filter, firMatMat) { firTest<TypeParam>(5000, 10, 50, 10); }
 
 template<typename T>
-void iirA0Test(const int xrows, const int xcols, const int brows, const int bcols)
-{
-    if (noDoubleTests<T>()) return;
+void iirA0Test(const int xrows, const int xcols, const int brows,
+               const int bcols) {
+    SUPPORTED_TYPE_CHECK(T);
     try {
-        af::dtype ty = (af::dtype)af::dtype_traits<T>::af_type;
-        af::array x = af::randu(xrows, xcols, ty);
-        af::array b = af::randu(brows, bcols, ty);
-        af::array a = af::randu(    1, bcols, ty);
-        af::array bNorm = b / tile(a, brows);
+        dtype ty    = (dtype)dtype_traits<T>::af_type;
+        array x     = randu(xrows, xcols, ty);
+        array b     = randu(brows, bcols, ty);
+        array a     = randu(1, bcols, ty);
+        array bNorm = b / tile(a, brows);
 
-        af::array y = af::iir(b, a, x);
-        af::array c = af::convolve1(x, bNorm, AF_CONV_EXPAND);
+        array y = iir(b, a, x);
+        array c = convolve1(x, bNorm, AF_CONV_EXPAND);
 
         const int ycols = xcols * bcols;
         const int crows = xrows + brows - 1;
@@ -115,77 +104,57 @@ void iirA0Test(const int xrows, const int xcols, const int brows, const int bcol
 
         for (int j = 0; j < ycols; j++) {
             for (int i = 0; i < yrows; i++) {
-                ASSERT_NEAR(get_real(hy[j * yrows + i]),
-                            get_real(hc[j * crows + i]), 0.01);
+                ASSERT_NEAR(real(hy[j * yrows + i]), real(hc[j * crows + i]),
+                            0.01);
             }
         }
-    } catch (af::exception &ex) {
-        FAIL() << ex.what();
-    }
+    } catch (exception &ex) { FAIL() << ex.what(); }
 }
 
-TYPED_TEST(filter, iirA0VecVec)
-{
-    iirA0Test<TypeParam>(10000, 1, 1000, 1);
-}
+TYPED_TEST(filter, iirA0VecVec) { iirA0Test<TypeParam>(10000, 1, 1000, 1); }
 
-TYPED_TEST(filter, iirA0VecMat)
-{
-    iirA0Test<TypeParam>(10000, 1, 50, 10);
-}
+TYPED_TEST(filter, iirA0VecMat) { iirA0Test<TypeParam>(10000, 1, 50, 10); }
 
-TYPED_TEST(filter, iirA0MatVec)
-{
-    iirA0Test<TypeParam>(5000, 10, 100, 1);
-}
+TYPED_TEST(filter, iirA0MatVec) { iirA0Test<TypeParam>(5000, 10, 100, 1); }
 
-TYPED_TEST(filter, iirA0MatMat)
-{
-    iirA0Test<TypeParam>(5000, 10, 50, 10);
-}
+TYPED_TEST(filter, iirA0MatMat) { iirA0Test<TypeParam>(5000, 10, 50, 10); }
 
 template<typename T>
-void iirTest(const char *testFile)
-{
-    if (noDoubleTests<T>()) return;
-    vector<af::dim4> inDims;
+void iirTest(const char *testFile) {
+    SUPPORTED_TYPE_CHECK(T);
+    vector<dim4> inDims;
 
-    vector<vector<T> > inputs;
-    vector<vector<T> > outputs;
-    readTests<T, T, float> (testFile, inDims, inputs, outputs);
+    vector<vector<T>> inputs;
+    vector<vector<T>> outputs;
+    readTests<T, T, float>(testFile, inDims, inputs, outputs);
 
     try {
-        af::array a = af::array(inDims[0], &inputs[0][0]);
-        af::array b = af::array(inDims[1], &inputs[1][0]);
-        af::array x = af::array(inDims[2], &inputs[2][0]);
+        array a = array(inDims[0], &inputs[0][0]);
+        array b = array(inDims[1], &inputs[1][0]);
+        array x = array(inDims[2], &inputs[2][0]);
 
-        af::array y = af::iir(b, a, x);
-        std::vector<T> gold = outputs[0];
+        array y        = iir(b, a, x);
+        vector<T> gold = outputs[0];
         ASSERT_EQ(gold.size(), (size_t)y.elements());
 
-        std::vector<T> out(y.elements());
+        vector<T> out(y.elements());
         y.host(&out[0]);
 
-        for(size_t i = 0; i < gold.size(); i++) {
-            ASSERT_NEAR(get_real(out[i]), get_real(gold[i]), 0.01) << "at: " << i;
+        for (size_t i = 0; i < gold.size(); i++) {
+            ASSERT_NEAR(real(out[i]), real(gold[i]), 0.01) << "at: " << i;
         }
 
-    } catch (af::exception &ex) {
-        FAIL() << ex.what();
-    }
+    } catch (exception &ex) { FAIL() << ex.what(); }
 }
 
-TYPED_TEST(filter, iirVecVec)
-{
-    iirTest<TypeParam>(TEST_DIR"/iir/iir_vv.test");
+TYPED_TEST(filter, iirVecVec) {
+    iirTest<TypeParam>(TEST_DIR "/iir/iir_vv.test");
 }
 
-TYPED_TEST(filter, iirVecMat)
-{
-    iirTest<TypeParam>(TEST_DIR"/iir/iir_vm.test");
+TYPED_TEST(filter, iirVecMat) {
+    iirTest<TypeParam>(TEST_DIR "/iir/iir_vm.test");
 }
 
-TYPED_TEST(filter, iirMatMat)
-{
-    iirTest<TypeParam>(TEST_DIR"/iir/iir_mm.test");
+TYPED_TEST(filter, iirMatMat) {
+    iirTest<TypeParam>(TEST_DIR "/iir/iir_mm.test");
 }
diff --git a/test/imageio.cpp b/test/imageio.cpp
index 95f40a9814..16cead852c 100644
--- a/test/imageio.cpp
+++ b/test/imageio.cpp
@@ -7,128 +7,142 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class ImageIO : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class ImageIO : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 typedef ::testing::Types<float> TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(ImageIO, TestTypes);
+TYPED_TEST_SUITE(ImageIO, TestTypes);
 
-// Disable tests if FreeImage is not found
-#if defined(WITH_FREEIMAGE)
-void loadImageTest(string pTestFile, string pImageFile, const bool isColor)
-{
-    if (noDoubleTests<float>()) return;
+void loadImageTest(string pTestFile, string pImageFile, const bool isColor) {
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     af_array imgArray = 0;
-    ASSERT_EQ(AF_SUCCESS, af_load_image(&imgArray, pImageFile.c_str(), isColor));
+    ASSERT_SUCCESS(af_load_image(&imgArray, pImageFile.c_str(), isColor));
 
     // Get result
-    float *imgData = new float[dims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*) imgData, imgArray));
+    float* imgData = new float[dims.elements()];
+    ASSERT_SUCCESS(af_get_data_ptr((void*)imgData, imgArray));
+
+    bool isJPEG = false;
+    if (pImageFile.find(".jpg") != string::npos) { isJPEG = true; }
 
     // Compare result
     size_t nElems = in[0].size();
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(in[0][elIter], imgData[elIter]) << "at: " << elIter << std::endl;
+        if (isJPEG)  // Allow +- 1 because of compression when testing JPG
+            ASSERT_NEAR(in[0][elIter], imgData[elIter], 1)
+                << "at: " << elIter << endl;
+        else
+            ASSERT_EQ(in[0][elIter], imgData[elIter])
+                << "at: " << elIter << endl;
     }
 
     // Delete
     delete[] imgData;
 
-    if(imgArray != 0) af_release_array(imgArray);
+    if (imgArray != 0) af_release_array(imgArray);
 }
 
-TYPED_TEST(ImageIO, ColorSmall)
-{
-    loadImageTest(string(TEST_DIR"/imageio/color_small.test"), string(TEST_DIR"/imageio/color_small.png"), true);
+TYPED_TEST(ImageIO, ColorSmall) {
+    loadImageTest(string(TEST_DIR "/imageio/color_small.test"),
+                  string(TEST_DIR "/imageio/color_small.png"), true);
 }
 
-TYPED_TEST(ImageIO, GraySmall)
-{
-    loadImageTest(string(TEST_DIR"/imageio/gray_small.test"), string(TEST_DIR"/imageio/gray_small.jpg"), false);
+TYPED_TEST(ImageIO, GraySmall) {
+    loadImageTest(string(TEST_DIR "/imageio/gray_small.test"),
+                  string(TEST_DIR "/imageio/gray_small.jpg"), false);
 }
 
-TYPED_TEST(ImageIO, GraySeq)
-{
-    loadImageTest(string(TEST_DIR"/imageio/gray_seq.test"), string(TEST_DIR"/imageio/gray_seq.png"), false);
+TYPED_TEST(ImageIO, GraySeq) {
+    loadImageTest(string(TEST_DIR "/imageio/gray_seq.test"),
+                  string(TEST_DIR "/imageio/gray_seq.png"), false);
 }
 
-TYPED_TEST(ImageIO, ColorSeq)
-{
-    loadImageTest(string(TEST_DIR"/imageio/color_seq.test"), string(TEST_DIR"/imageio/color_seq.png"), true);
+TYPED_TEST(ImageIO, ColorSeq) {
+    loadImageTest(string(TEST_DIR "/imageio/color_seq.test"),
+                  string(TEST_DIR "/imageio/color_seq.png"), true);
 }
 
-void loadimageArgsTest(string pImageFile, const bool isColor, af_err err)
-{
+void loadimageArgsTest(string pImageFile, const bool isColor, af_err err) {
+    IMAGEIO_ENABLED_CHECK();
+
     af_array imgArray = 0;
 
     ASSERT_EQ(err, af_load_image(&imgArray, pImageFile.c_str(), isColor));
 
-    if(imgArray != 0) af_release_array(imgArray);
+    if (imgArray != 0) af_release_array(imgArray);
 }
 
-TYPED_TEST(ImageIO,InvalidArgsMissingFile)
-{
-    loadimageArgsTest(string(TEST_DIR"/imageio/nofile.png"), false, AF_ERR_RUNTIME);
+TYPED_TEST(ImageIO, InvalidArgsMissingFile) {
+    loadimageArgsTest(string(TEST_DIR "/imageio/nofile.png"), false,
+                      AF_ERR_RUNTIME);
 }
 
-TYPED_TEST(ImageIO,InvalidArgsWrongExt)
-{
-    loadimageArgsTest(string(TEST_DIR"/imageio/image.wrongext"), true, AF_ERR_NOT_SUPPORTED);
+TYPED_TEST(ImageIO, InvalidArgsWrongExt) {
+    loadimageArgsTest(string(TEST_DIR "/imageio/image.wrongext"), true,
+                      AF_ERR_NOT_SUPPORTED);
 }
 
 ////////////////////////////////// CPP //////////////////////////////////////
-TEST(ImageIO, CPP)
-{
-    if (noDoubleTests<float>()) return;
 
-    vector<af::dim4> numDims;
+using af::anyTrue;
+using af::deleteImageMem;
+using af::loadImage;
+using af::loadImageMem;
+using af::saveImageMem;
+using af::span;
+
+TEST(ImageIO, CPP) {
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(string(TEST_DIR"/imageio/color_small.test"),numDims,in,tests);
+    vector<dim4> numDims;
 
-    af::dim4 dims = numDims[0];
-    af::array img = af::loadImage(string(TEST_DIR"/imageio/color_small.png").c_str(), true);
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/imageio/color_small.test"),
+                                   numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array img =
+        loadImage(string(TEST_DIR "/imageio/color_small.png").c_str(), true);
 
     // Get result
-    float *imgData = new float[dims.elements()];
+    float* imgData = new float[dims.elements()];
     img.host((void*)imgData);
 
     // Compare result
     size_t nElems = in[0].size();
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(in[0][elIter], imgData[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(in[0][elIter], imgData[elIter]) << "at: " << elIter << endl;
     }
 
     // Delete
@@ -136,35 +150,244 @@ TEST(ImageIO, CPP)
 }
 
 TEST(ImageIO, SavePNGCPP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array input(10, 10, 3, f32);
 
-    af::array input(10, 10, 3, f32);
+    input(span, span, span) = 0;
+    input(0, 0, 0)          = 255;
+    input(0, 9, 1)          = 255;
+    input(9, 0, 2)          = 255;
+    input(9, 9, span)       = 255;
 
-    input(af::span, af::span, af::span) = 0;
-    input(0, 0, 0) = 255;
-    input(0, 9, 1) = 255;
-    input(9, 0, 2) = 255;
-    input(9, 9, af::span) = 255;
+    std::string testname  = getTestName() + "_" + getBackendName(true);
+    std::string imagename = "SaveCPP_" + testname + ".png";
 
-    saveImage("SaveCPP.png", input);
-    af::array out = af::loadImage("SaveCPP.png", true);
+    saveImage(imagename.c_str(), input);
+    array out = loadImage(imagename.c_str(), true);
 
-    ASSERT_FALSE(af::anyTrue<bool>(out - input));
+    ASSERT_FALSE(anyTrue<bool>(out - input));
 }
 
 TEST(ImageIO, SaveBMPCPP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array input(10, 10, 3, f32);
+
+    input(span, span, span) = 0;
+    input(0, 0, 0)          = 255;
+    input(0, 9, 1)          = 255;
+    input(9, 0, 2)          = 255;
+    input(9, 9, span)       = 255;
+
+    std::string testname  = getTestName() + "_" + getBackendName(true);
+    std::string imagename = "SaveCPP_" + testname + ".bmp";
+
+    saveImage(imagename.c_str(), input);
+    array out = loadImage(imagename.c_str(), true);
+
+    ASSERT_FALSE(anyTrue<bool>(out - input));
+}
+
+TEST(ImageMem, SaveMemPNG) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array img =
+        loadImage(string(TEST_DIR "/imageio/color_seq.png").c_str(), true);
+
+    void* savedMem = saveImageMem(img, AF_FIF_PNG);
+
+    array loadMem = loadImageMem(savedMem);
 
-    af::array input(10, 10, 3, f32);
+    ASSERT_FALSE(anyTrue<bool>(img - loadMem));
 
-    input(af::span, af::span, af::span) = 0;
-    input(0, 0, 0) = 255;
-    input(0, 9, 1) = 255;
-    input(9, 0, 2) = 255;
-    input(9, 9, af::span) = 255;
+    deleteImageMem(savedMem);
+}
+
+TEST(ImageMem, SaveMemJPG1) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array img =
+        loadImage(string(TEST_DIR "/imageio/color_seq.png").c_str(), false);
+    saveImage("color_seq1.jpg", img);
+
+    void* savedMem = saveImageMem(img, AF_FIF_JPEG);
 
-    saveImage("SaveCPP.bmp", input);
-    af::array out = af::loadImage("SaveCPP.bmp", true);
+    array loadMem = loadImageMem(savedMem);
+    array imgJPG  = loadImage("color_seq1.jpg", false);
 
-    ASSERT_FALSE(af::anyTrue<bool>(out - input));
+    ASSERT_FALSE(anyTrue<bool>(imgJPG - loadMem));
+
+    deleteImageMem(savedMem);
 }
 
-#endif // WITH_FREEIMAGE
+TEST(ImageMem, SaveMemJPG3) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array img =
+        loadImage(string(TEST_DIR "/imageio/color_seq.png").c_str(), true);
+    saveImage("color_seq3.jpg", img);
+
+    void* savedMem = saveImageMem(img, AF_FIF_JPEG);
+
+    array loadMem = loadImageMem(savedMem);
+    array imgJPG  = loadImage("color_seq3.jpg", true);
+
+    ASSERT_FALSE(anyTrue<bool>(imgJPG - loadMem));
+
+    deleteImageMem(savedMem);
+}
+
+TEST(ImageMem, SaveMemBMP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array img =
+        loadImage(string(TEST_DIR "/imageio/color_rand.png").c_str(), true);
+
+    void* savedMem = saveImageMem(img, AF_FIF_BMP);
+
+    array loadMem = loadImageMem(savedMem);
+
+    ASSERT_FALSE(anyTrue<bool>(img - loadMem));
+
+    deleteImageMem(savedMem);
+}
+
+TEST(ImageIO, LoadImage16CPP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> numDims;
+
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(
+        string(TEST_DIR "/imageio/color_seq_16.test"), numDims, in, tests);
+
+    dim4 dims = numDims[0];
+
+    array img =
+        loadImage(string(TEST_DIR "/imageio/color_seq_16.png").c_str(), true);
+    ASSERT_EQ(img.type(), f32);  // loadImage should always return float
+
+    // Get result
+    float* imgData = new float[dims.elements()];
+    img.host((void*)imgData);
+
+    // Compare result
+    size_t nElems = in[0].size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(in[0][elIter], imgData[elIter]) << "at: " << elIter << endl;
+    }
+
+    // Delete
+    delete[] imgData;
+}
+
+TEST(ImageIO, SaveImage16CPP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    dim4 dims(16, 24, 3);
+
+    array input     = randu(dims, u16);
+    array input_255 = floor(input.as(f32) / 257);
+
+    std::string testname  = getTestName() + "_" + getBackendName(true);
+    std::string imagename = "saveImage16CPP_" + testname + ".png";
+
+    saveImage(imagename.c_str(), input);
+
+    array img = loadImage(imagename.c_str(), true);
+
+    ASSERT_EQ(img.type(), f32);  // loadImage should always return float
+    ASSERT_IMAGES_NEAR(input_255, img, 0.001);
+}
+
+////////////////////////////////////////////////////////////////////////////////
+// Image IO Native Tests
+////////////////////////////////////////////////////////////////////////////////
+
+using af::dtype_traits;
+using af::loadImageNative;
+using af::saveImageNative;
+
+template<typename T>
+void loadImageNativeCPPTest(string pTestFile, string pImageFile) {
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> numDims;
+
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(pTestFile, numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array img = loadImageNative(pImageFile.c_str());
+    ASSERT_EQ(img.type(), (af_dtype)dtype_traits<T>::af_type);
+
+    // Get result
+    T* imgData = new T[dims.elements()];
+    img.host((void*)imgData);
+
+    // Compare result
+    size_t nElems = in[0].size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(in[0][elIter], imgData[elIter]) << "at: " << elIter << endl;
+    }
+
+    // Delete
+    delete[] imgData;
+}
+
+TEST(ImageIONative, LoadImageNative8CPP) {
+    loadImageNativeCPPTest<uchar>(string(TEST_DIR "/imageio/color_small.test"),
+                                  string(TEST_DIR "/imageio/color_small.png"));
+}
+
+TEST(ImageIONative, LoadImageNative16SmallCPP) {
+    loadImageNativeCPPTest<ushort>(
+        string(TEST_DIR "/imageio/color_small_16.test"),
+        string(TEST_DIR "/imageio/color_small_16.png"));
+}
+
+TEST(ImageIONative, LoadImageNative16ColorCPP) {
+    loadImageNativeCPPTest<ushort>(
+        string(TEST_DIR "/imageio/color_seq_16.test"),
+        string(TEST_DIR "/imageio/color_seq_16.png"));
+}
+
+TEST(ImageIONative, LoadImageNative16GrayCPP) {
+    loadImageNativeCPPTest<ushort>(string(TEST_DIR "/imageio/gray_seq_16.test"),
+                                   string(TEST_DIR "/imageio/gray_seq_16.png"));
+}
+
+template<typename T>
+void saveLoadImageNativeCPPTest(dim4 dims) {
+    IMAGEIO_ENABLED_CHECK();
+
+    array input = randu(dims, (af_dtype)dtype_traits<T>::af_type);
+
+    std::string imagename = getTestName() + "_" + getBackendName(true) + ".png";
+
+    saveImageNative(imagename.c_str(), input);
+
+    array loaded = loadImageNative(imagename.c_str());
+    ASSERT_EQ(loaded.type(), input.type());
+
+    ASSERT_FALSE(anyTrue<bool>(input - loaded));
+}
+
+TEST(ImageIONative, SaveLoadImageNative8CPP) {
+    saveLoadImageNativeCPPTest<uchar>(dim4(480, 720, 3, 1));
+}
+
+TEST(ImageIONative, SaveLoadImageNative16SmallCPP) {
+    saveLoadImageNativeCPPTest<ushort>(dim4(8, 12, 3, 1));
+}
+
+TEST(ImageIONative, SaveLoadImageNative16ColorCPP) {
+    saveLoadImageNativeCPPTest<ushort>(dim4(480, 720, 3, 1));
+}
+
+TEST(ImageIONative, SaveLoadImageNative16GrayCPP) {
+    saveLoadImageNativeCPPTest<ushort>(dim4(24, 32, 1, 1));
+}
diff --git a/test/index.cpp b/test/index.cpp
index c43e4db4c5..d5d010ffb1 100644
--- a/test/index.cpp
+++ b/test/index.cpp
@@ -7,77 +7,83 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/data.h>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/data.h>
 
-#include <vector>
 #include <algorithm>
 #include <functional>
 #include <iostream>
+#include <numeric>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::generate;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
 using std::cout;
 using std::endl;
 using std::ostream_iterator;
-using af::dtype_traits;
-
+using std::string;
+using std::vector;
 
 template<typename T, typename OP>
-void
-checkValues(const af_seq &seq, const T* data, const T* indexed_data, OP compair_op) {
-    for(int i = 0, j = seq.begin; compair_op(j,(int)seq.end); j+= seq.step, i++) {
+void checkValues(const af_seq &seq, const T *data, const T *indexed_data,
+                 OP compair_op) {
+    for (int i = 0, j = seq.begin; compair_op(j, (int)seq.end);
+         j += seq.step, i++) {
         ASSERT_DOUBLE_EQ(real(data[j]), real(indexed_data[i]))
-        << "Where i = " << i << " and j = " << j;
+            << "Where i = " << i << " and j = " << j;
     }
 }
 
 template<typename T>
-void
-DimCheck(const vector<af_seq> &seqs) {
-    if (noDoubleTests<T>()) return;
+void DimCheck(const vector<af_seq> &seqs) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    static const int ndims = 1;
+    static const int ndims   = 1;
     static const size_t dims = 100;
 
     dim_t d[1] = {dims};
 
     vector<T> hData(dims);
-    for(int i = 0; i < (int)dims; i++) { hData[i] = i; }
+    for (int i = 0; i < (int)dims; i++) { hData[i] = i; }
 
     af_array a = 0;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&a, &hData.front(), ndims, d, (af_dtype) dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&a, &hData.front(), ndims, d,
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     vector<af_array> indexed_array(seqs.size(), 0);
-    for(size_t i = 0; i < seqs.size(); i++) {
-        ASSERT_EQ(AF_SUCCESS, af_index(&(indexed_array[i]), a, ndims, &seqs[i]))
-            << "where seqs[i].begin == "    << seqs[i].begin
-            << " seqs[i].step == "          << seqs[i].step
-            << " seqs[i].end == "           << seqs[i].end;
+    for (size_t i = 0; i < seqs.size(); i++) {
+        ASSERT_SUCCESS(af_index(&(indexed_array[i]), a, ndims, &seqs[i]))
+            << "where seqs[i].begin == " << seqs[i].begin
+            << " seqs[i].step == " << seqs[i].step
+            << " seqs[i].end == " << seqs[i].end;
     }
 
-    vector<T*> h_indexed(seqs.size());
-    for(size_t i = 0; i < seqs.size(); i++) {
+    vector<T *> h_indexed(seqs.size());
+    for (size_t i = 0; i < seqs.size(); i++) {
         dim_t elems;
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&elems, indexed_array[i]));
+        ASSERT_SUCCESS(af_get_elements(&elems, indexed_array[i]));
         h_indexed[i] = new T[elems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void *)(h_indexed[i]), indexed_array[i]));
+        ASSERT_SUCCESS(
+            af_get_data_ptr((void *)(h_indexed[i]), indexed_array[i]));
     }
 
-    for(size_t k = 0; k < seqs.size(); k++) {
-        if(seqs[k].step > 0)        {
-            checkValues(seqs[k], &hData.front(), h_indexed[k], std::less_equal<int>());
-        } else if (seqs[k].step < 0)  {
-            checkValues(seqs[k], &hData.front(), h_indexed[k], std::greater_equal<int>());
+    for (size_t k = 0; k < seqs.size(); k++) {
+        if (seqs[k].step > 0) {
+            checkValues(seqs[k], &hData.front(), h_indexed[k],
+                        std::less_equal<int>());
+        } else if (seqs[k].step < 0) {
+            checkValues(seqs[k], &hData.front(), h_indexed[k],
+                        std::greater_equal<int>());
         } else {
-            for(size_t i = 0; i <= seqs[k].end; i++) {
+            for (size_t i = 0; i <= seqs[k].end; i++) {
                 ASSERT_DOUBLE_EQ(real(hData[i]), real(h_indexed[k][i]))
                     << "Where i = " << i;
             }
@@ -85,39 +91,44 @@ DimCheck(const vector<af_seq> &seqs) {
         delete[] h_indexed[k];
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(a));
     for (size_t i = 0; i < indexed_array.size(); i++) {
-        ASSERT_EQ(AF_SUCCESS, af_release_array(indexed_array[i]));
+        ASSERT_SUCCESS(af_release_array(indexed_array[i]));
     }
 }
 
 template<typename T>
-class Indexing1D : public ::testing::Test
-{
-public:
+class Indexing1D : public ::testing::Test {
+   public:
     virtual void SetUp() {
-        continuous_seqs.push_back(af_make_seq(  0,    20,   1 )); // Begin Continious
-        continuous_seqs.push_back(af_make_seq(  80,   99,   1 )); // End Continious
-        continuous_seqs.push_back(af_make_seq(  10,   89,   1 )); // Mid Continious
-
-        continuous_reverse_seqs.push_back(af_make_seq(  20,   0,    -1 )); // Begin Reverse Continious
-        continuous_reverse_seqs.push_back(af_make_seq(  99,   80,   -1 )); // End Reverse Continious
-        continuous_reverse_seqs.push_back(af_make_seq(  89,   10,   -1 )); // Mid Reverse Continious
-
-        strided_seqs.push_back(af_make_seq(  5,    40,   2 )); // Two Step
-        strided_seqs.push_back(af_make_seq(  5,    40,   3 )); // Three Step
-        strided_seqs.push_back(af_make_seq(  5,    40,   4 )); // Four Step
-
-        strided_reverse_seqs.push_back(af_make_seq(  40,    5,   -2 )); // Reverse Two Step
-        strided_reverse_seqs.push_back(af_make_seq(  40,    5,   -3 )); // Reverse Three Step
-        strided_reverse_seqs.push_back(af_make_seq(  40,    5,   -4 )); // Reverse Four Step
+        continuous_seqs.push_back(af_make_seq(0, 20, 1));   // Begin Continious
+        continuous_seqs.push_back(af_make_seq(80, 99, 1));  // End Continious
+        continuous_seqs.push_back(af_make_seq(10, 89, 1));  // Mid Continious
+
+        continuous_reverse_seqs.push_back(
+            af_make_seq(20, 0, -1));  // Begin Reverse Continious
+        continuous_reverse_seqs.push_back(
+            af_make_seq(99, 80, -1));  // End Reverse Continious
+        continuous_reverse_seqs.push_back(
+            af_make_seq(89, 10, -1));  // Mid Reverse Continious
+
+        strided_seqs.push_back(af_make_seq(5, 40, 2));  // Two Step
+        strided_seqs.push_back(af_make_seq(5, 40, 3));  // Three Step
+        strided_seqs.push_back(af_make_seq(5, 40, 4));  // Four Step
+
+        strided_reverse_seqs.push_back(
+            af_make_seq(40, 5, -2));  // Reverse Two Step
+        strided_reverse_seqs.push_back(
+            af_make_seq(40, 5, -3));  // Reverse Three Step
+        strided_reverse_seqs.push_back(
+            af_make_seq(40, 5, -4));  // Reverse Four Step
 
         span_seqs.push_back(af_span);
     }
 
     virtual ~Indexing1D() {}
 
-    //virtual void TearDown() {}
+    // virtual void TearDown() {}
 
     vector<af_seq> continuous_seqs;
     vector<af_seq> continuous_reverse_seqs;
@@ -126,20 +137,27 @@ class Indexing1D : public ::testing::Test
     vector<af_seq> span_seqs;
 };
 
-typedef ::testing::Types<float, double, af::cfloat, af::cdouble, int, unsigned, unsigned char, intl, uintl> AllTypes;
-TYPED_TEST_CASE(Indexing1D, AllTypes);
-
-TYPED_TEST(Indexing1D, Continious)          { DimCheck<TypeParam>(this->continuous_seqs);           }
-TYPED_TEST(Indexing1D, ContiniousReverse)   { DimCheck<TypeParam>(this->continuous_reverse_seqs);   }
-TYPED_TEST(Indexing1D, Strided)             { DimCheck<TypeParam>(this->strided_seqs);              }
-TYPED_TEST(Indexing1D, StridedReverse)      { DimCheck<TypeParam>(this->strided_reverse_seqs);      }
-TYPED_TEST(Indexing1D, Span)                { DimCheck<TypeParam>(this->span_seqs);                 }
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned,
+                         signed char, unsigned char, intl, uintl, short, ushort,
+                         half_float::half>
+    AllTypes;
+TYPED_TEST_SUITE(Indexing1D, AllTypes);
 
+TYPED_TEST(Indexing1D, Continious) {
+    DimCheck<TypeParam>(this->continuous_seqs);
+}
+TYPED_TEST(Indexing1D, ContiniousReverse) {
+    DimCheck<TypeParam>(this->continuous_reverse_seqs);
+}
+TYPED_TEST(Indexing1D, Strided) { DimCheck<TypeParam>(this->strided_seqs); }
+TYPED_TEST(Indexing1D, StridedReverse) {
+    DimCheck<TypeParam>(this->strided_reverse_seqs);
+}
+TYPED_TEST(Indexing1D, Span) { DimCheck<TypeParam>(this->span_seqs); }
 
 template<typename T>
-class Indexing2D : public ::testing::Test
-{
-public:
+class Indexing2D : public ::testing::Test {
+   public:
     vector<af_seq> make_vec(af_seq first, af_seq second) {
         vector<af_seq> out;
         out.push_back(first);
@@ -147,246 +165,304 @@ class Indexing2D : public ::testing::Test
         return out;
     }
     virtual void SetUp() {
-
-        column_continuous_seq.push_back(make_vec(af_span, af_make_seq(  0,  6,  1)));
-        column_continuous_seq.push_back(make_vec(af_span, af_make_seq(  4,  9,  1)));
-        column_continuous_seq.push_back(make_vec(af_span, af_make_seq(  3,  8,  1)));
-
-        column_continuous_reverse_seq.push_back(make_vec(af_span, af_make_seq(  6,  0,  -1)));
-        column_continuous_reverse_seq.push_back(make_vec(af_span, af_make_seq(  9,  4,  -1)));
-        column_continuous_reverse_seq.push_back(make_vec(af_span, af_make_seq(  8,  3,  -1)));
-
-        column_strided_seq.push_back(make_vec(af_span, af_make_seq(  0,    8,   2 ))); // Two Step
-        column_strided_seq.push_back(make_vec(af_span, af_make_seq(  2,    9,   3 ))); // Three Step
-        column_strided_seq.push_back(make_vec(af_span, af_make_seq(  0,    9,   4 ))); // Four Step
-
-        column_strided_reverse_seq.push_back(make_vec(af_span, af_make_seq(  8,   0,   -2 ))); // Two Step
-        column_strided_reverse_seq.push_back(make_vec(af_span, af_make_seq(  9,   2,   -3 ))); // Three Step
-        column_strided_reverse_seq.push_back(make_vec(af_span, af_make_seq(  9,   0,   -4 ))); // Four Step
-
-        row_continuous_seq.push_back(make_vec(af_make_seq(  0,  6,  1), af_span));
-        row_continuous_seq.push_back(make_vec(af_make_seq(  4,  9,  1), af_span));
-        row_continuous_seq.push_back(make_vec(af_make_seq(  3,  8,  1), af_span));
-
-        row_continuous_reverse_seq.push_back(make_vec(af_make_seq(  6,  0,  -1), af_span));
-        row_continuous_reverse_seq.push_back(make_vec(af_make_seq(  9,  4,  -1), af_span));
-        row_continuous_reverse_seq.push_back(make_vec(af_make_seq(  8,  3,  -1), af_span));
-
-        row_strided_seq.push_back(make_vec(af_make_seq(  0,    8,   2 ), af_span));
-        row_strided_seq.push_back(make_vec(af_make_seq(  2,    9,   3 ), af_span));
-        row_strided_seq.push_back(make_vec(af_make_seq(  0,    9,   4 ), af_span));
-
-        row_strided_reverse_seq.push_back(make_vec(af_make_seq(  8,   0,   -2 ), af_span));
-        row_strided_reverse_seq.push_back(make_vec(af_make_seq(  9,   2,   -3 ), af_span));
-        row_strided_reverse_seq.push_back(make_vec(af_make_seq(  9,   0,   -4 ), af_span));
-
-        continuous_continuous_seq.push_back(make_vec(af_make_seq(  1,  6,  1), af_make_seq(  0,  6,  1)));
-        continuous_continuous_seq.push_back(make_vec(af_make_seq(  3,  9,  1), af_make_seq(  4,  9,  1)));
-        continuous_continuous_seq.push_back(make_vec(af_make_seq(  5,  8,  1), af_make_seq(  3,  8,  1)));
-
-        continuous_reverse_seq.push_back(make_vec(af_make_seq(  1,  6,  1), af_make_seq(  6,  0,  -1)));
-        continuous_reverse_seq.push_back(make_vec(af_make_seq(  3,  9,  1), af_make_seq(  9,  4,  -1)));
-        continuous_reverse_seq.push_back(make_vec(af_make_seq(  5,  8,  1), af_make_seq(  8,  3,  -1)));
-
-        continuous_strided_seq.push_back(make_vec(af_make_seq(  1,  6,  1), af_make_seq(  0,  8,  2)));
-        continuous_strided_seq.push_back(make_vec(af_make_seq(  3,  9,  1), af_make_seq(  2,  9,  3)));
-        continuous_strided_seq.push_back(make_vec(af_make_seq(  5,  8,  1), af_make_seq(  1,  9,  4)));
-
-        continuous_strided_reverse_seq.push_back(make_vec(af_make_seq(  1,  6,  1), af_make_seq(  8,  0,  -2)));
-        continuous_strided_reverse_seq.push_back(make_vec(af_make_seq(  3,  9,  1), af_make_seq(  9,  2,  -3)));
-        continuous_strided_reverse_seq.push_back(make_vec(af_make_seq(  5,  8,  1), af_make_seq(  9,  1,  -4)));
-
-        reverse_continuous_seq.push_back(make_vec(af_make_seq(  6,  1,  -1), af_make_seq(  0,  6,  1)));
-        reverse_continuous_seq.push_back(make_vec(af_make_seq(  9,  3,  -1), af_make_seq(  4,  9,  1)));
-        reverse_continuous_seq.push_back(make_vec(af_make_seq(  8,  5,  -1), af_make_seq(  3,  8,  1)));
-
-        reverse_reverse_seq.push_back(make_vec(af_make_seq(  6,  1,  -1), af_make_seq(  6,  0,  -1)));
-        reverse_reverse_seq.push_back(make_vec(af_make_seq(  9,  3,  -1), af_make_seq(  9,  4,  -1)));
-        reverse_reverse_seq.push_back(make_vec(af_make_seq(  8,  5,  -1), af_make_seq(  8,  3,  -1)));
-
-        reverse_strided_seq.push_back(make_vec(af_make_seq(  6,  1,  -1), af_make_seq(  0,  8,  2)));
-        reverse_strided_seq.push_back(make_vec(af_make_seq(  9,  3,  -1), af_make_seq(  2,  9,  3)));
-        reverse_strided_seq.push_back(make_vec(af_make_seq(  8,  5,  -1), af_make_seq(  1,  9,  4)));
-
-        reverse_strided_reverse_seq.push_back(make_vec(af_make_seq(  6,  1,  -1), af_make_seq(  8,  0,  -2)));
-        reverse_strided_reverse_seq.push_back(make_vec(af_make_seq(  9,  3,  -1), af_make_seq(  9,  2,  -3)));
-        reverse_strided_reverse_seq.push_back(make_vec(af_make_seq(  8,  5,  -1), af_make_seq(  9,  1,  -4)));
-
-        strided_continuous_seq.push_back(make_vec(af_make_seq(  0,  8,  2), af_make_seq(  0,  6,  1)));
-        strided_continuous_seq.push_back(make_vec(af_make_seq(  2,  9,  3), af_make_seq(  4,  9,  1)));
-        strided_continuous_seq.push_back(make_vec(af_make_seq(  1,  9,  4), af_make_seq(  3,  8,  1)));
-
-        strided_strided_seq.push_back(make_vec(af_make_seq(  1,  6,  2), af_make_seq(  0,  8,  2)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  3,  9,  2), af_make_seq(  2,  9,  3)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  5,  8,  2), af_make_seq(  1,  9,  4)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  1,  6,  3), af_make_seq(  0,  8,  2)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  3,  9,  3), af_make_seq(  2,  9,  3)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  5,  8,  3), af_make_seq(  1,  9,  4)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  1,  6,  4), af_make_seq(  0,  8,  2)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  3,  9,  4), af_make_seq(  2,  9,  3)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  3,  8,  4), af_make_seq(  1,  9,  4)));
-        strided_strided_seq.push_back(make_vec(af_make_seq(  3,  6,  4), af_make_seq(  1,  9,  4)));
+        column_continuous_seq.push_back(
+            make_vec(af_span, af_make_seq(0, 6, 1)));
+        column_continuous_seq.push_back(
+            make_vec(af_span, af_make_seq(4, 9, 1)));
+        column_continuous_seq.push_back(
+            make_vec(af_span, af_make_seq(3, 8, 1)));
+
+        column_continuous_reverse_seq.push_back(
+            make_vec(af_span, af_make_seq(6, 0, -1)));
+        column_continuous_reverse_seq.push_back(
+            make_vec(af_span, af_make_seq(9, 4, -1)));
+        column_continuous_reverse_seq.push_back(
+            make_vec(af_span, af_make_seq(8, 3, -1)));
+
+        column_strided_seq.push_back(
+            make_vec(af_span, af_make_seq(0, 8, 2)));  // Two Step
+        column_strided_seq.push_back(
+            make_vec(af_span, af_make_seq(2, 9, 3)));  // Three Step
+        column_strided_seq.push_back(
+            make_vec(af_span, af_make_seq(0, 9, 4)));  // Four Step
+
+        column_strided_reverse_seq.push_back(
+            make_vec(af_span, af_make_seq(8, 0, -2)));  // Two Step
+        column_strided_reverse_seq.push_back(
+            make_vec(af_span, af_make_seq(9, 2, -3)));  // Three Step
+        column_strided_reverse_seq.push_back(
+            make_vec(af_span, af_make_seq(9, 0, -4)));  // Four Step
+
+        row_continuous_seq.push_back(make_vec(af_make_seq(0, 6, 1), af_span));
+        row_continuous_seq.push_back(make_vec(af_make_seq(4, 9, 1), af_span));
+        row_continuous_seq.push_back(make_vec(af_make_seq(3, 8, 1), af_span));
+
+        row_continuous_reverse_seq.push_back(
+            make_vec(af_make_seq(6, 0, -1), af_span));
+        row_continuous_reverse_seq.push_back(
+            make_vec(af_make_seq(9, 4, -1), af_span));
+        row_continuous_reverse_seq.push_back(
+            make_vec(af_make_seq(8, 3, -1), af_span));
+
+        row_strided_seq.push_back(make_vec(af_make_seq(0, 8, 2), af_span));
+        row_strided_seq.push_back(make_vec(af_make_seq(2, 9, 3), af_span));
+        row_strided_seq.push_back(make_vec(af_make_seq(0, 9, 4), af_span));
+
+        row_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(8, 0, -2), af_span));
+        row_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(9, 2, -3), af_span));
+        row_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(9, 0, -4), af_span));
+
+        continuous_continuous_seq.push_back(
+            make_vec(af_make_seq(1, 6, 1), af_make_seq(0, 6, 1)));
+        continuous_continuous_seq.push_back(
+            make_vec(af_make_seq(3, 9, 1), af_make_seq(4, 9, 1)));
+        continuous_continuous_seq.push_back(
+            make_vec(af_make_seq(5, 8, 1), af_make_seq(3, 8, 1)));
+
+        continuous_reverse_seq.push_back(
+            make_vec(af_make_seq(1, 6, 1), af_make_seq(6, 0, -1)));
+        continuous_reverse_seq.push_back(
+            make_vec(af_make_seq(3, 9, 1), af_make_seq(9, 4, -1)));
+        continuous_reverse_seq.push_back(
+            make_vec(af_make_seq(5, 8, 1), af_make_seq(8, 3, -1)));
+
+        continuous_strided_seq.push_back(
+            make_vec(af_make_seq(1, 6, 1), af_make_seq(0, 8, 2)));
+        continuous_strided_seq.push_back(
+            make_vec(af_make_seq(3, 9, 1), af_make_seq(2, 9, 3)));
+        continuous_strided_seq.push_back(
+            make_vec(af_make_seq(5, 8, 1), af_make_seq(1, 9, 4)));
+
+        continuous_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(1, 6, 1), af_make_seq(8, 0, -2)));
+        continuous_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(3, 9, 1), af_make_seq(9, 2, -3)));
+        continuous_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(5, 8, 1), af_make_seq(9, 1, -4)));
+
+        reverse_continuous_seq.push_back(
+            make_vec(af_make_seq(6, 1, -1), af_make_seq(0, 6, 1)));
+        reverse_continuous_seq.push_back(
+            make_vec(af_make_seq(9, 3, -1), af_make_seq(4, 9, 1)));
+        reverse_continuous_seq.push_back(
+            make_vec(af_make_seq(8, 5, -1), af_make_seq(3, 8, 1)));
+
+        reverse_reverse_seq.push_back(
+            make_vec(af_make_seq(6, 1, -1), af_make_seq(6, 0, -1)));
+        reverse_reverse_seq.push_back(
+            make_vec(af_make_seq(9, 3, -1), af_make_seq(9, 4, -1)));
+        reverse_reverse_seq.push_back(
+            make_vec(af_make_seq(8, 5, -1), af_make_seq(8, 3, -1)));
+
+        reverse_strided_seq.push_back(
+            make_vec(af_make_seq(6, 1, -1), af_make_seq(0, 8, 2)));
+        reverse_strided_seq.push_back(
+            make_vec(af_make_seq(9, 3, -1), af_make_seq(2, 9, 3)));
+        reverse_strided_seq.push_back(
+            make_vec(af_make_seq(8, 5, -1), af_make_seq(1, 9, 4)));
+
+        reverse_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(6, 1, -1), af_make_seq(8, 0, -2)));
+        reverse_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(9, 3, -1), af_make_seq(9, 2, -3)));
+        reverse_strided_reverse_seq.push_back(
+            make_vec(af_make_seq(8, 5, -1), af_make_seq(9, 1, -4)));
+
+        strided_continuous_seq.push_back(
+            make_vec(af_make_seq(0, 8, 2), af_make_seq(0, 6, 1)));
+        strided_continuous_seq.push_back(
+            make_vec(af_make_seq(2, 9, 3), af_make_seq(4, 9, 1)));
+        strided_continuous_seq.push_back(
+            make_vec(af_make_seq(1, 9, 4), af_make_seq(3, 8, 1)));
+
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(1, 6, 2), af_make_seq(0, 8, 2)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(3, 9, 2), af_make_seq(2, 9, 3)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(5, 8, 2), af_make_seq(1, 9, 4)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(1, 6, 3), af_make_seq(0, 8, 2)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(3, 9, 3), af_make_seq(2, 9, 3)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(5, 8, 3), af_make_seq(1, 9, 4)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(1, 6, 4), af_make_seq(0, 8, 2)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(3, 9, 4), af_make_seq(2, 9, 3)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(3, 8, 4), af_make_seq(1, 9, 4)));
+        strided_strided_seq.push_back(
+            make_vec(af_make_seq(3, 6, 4), af_make_seq(1, 9, 4)));
     }
 
-    vector<vector<af_seq> > column_continuous_seq;
-    vector<vector<af_seq> > column_continuous_reverse_seq;
-    vector<vector<af_seq> > column_strided_seq;
-    vector<vector<af_seq> > column_strided_reverse_seq;
+    vector<vector<af_seq>> column_continuous_seq;
+    vector<vector<af_seq>> column_continuous_reverse_seq;
+    vector<vector<af_seq>> column_strided_seq;
+    vector<vector<af_seq>> column_strided_reverse_seq;
 
-    vector<vector<af_seq> > row_continuous_seq;
-    vector<vector<af_seq> > row_continuous_reverse_seq;
-    vector<vector<af_seq> > row_strided_seq;
-    vector<vector<af_seq> > row_strided_reverse_seq;
+    vector<vector<af_seq>> row_continuous_seq;
+    vector<vector<af_seq>> row_continuous_reverse_seq;
+    vector<vector<af_seq>> row_strided_seq;
+    vector<vector<af_seq>> row_strided_reverse_seq;
 
-    vector<vector<af_seq> > continuous_continuous_seq;
-    vector<vector<af_seq> > continuous_strided_seq;
-    vector<vector<af_seq> > continuous_reverse_seq;
-    vector<vector<af_seq> > continuous_strided_reverse_seq;
+    vector<vector<af_seq>> continuous_continuous_seq;
+    vector<vector<af_seq>> continuous_strided_seq;
+    vector<vector<af_seq>> continuous_reverse_seq;
+    vector<vector<af_seq>> continuous_strided_reverse_seq;
 
-    vector<vector<af_seq> > reverse_continuous_seq;
-    vector<vector<af_seq> > reverse_reverse_seq;
-    vector<vector<af_seq> > reverse_strided_seq;
-    vector<vector<af_seq> > reverse_strided_reverse_seq;
+    vector<vector<af_seq>> reverse_continuous_seq;
+    vector<vector<af_seq>> reverse_reverse_seq;
+    vector<vector<af_seq>> reverse_strided_seq;
+    vector<vector<af_seq>> reverse_strided_reverse_seq;
 
-    vector<vector<af_seq> > strided_continuous_seq;
-    vector<vector<af_seq> > strided_strided_seq;
+    vector<vector<af_seq>> strided_continuous_seq;
+    vector<vector<af_seq>> strided_strided_seq;
 };
 
 template<typename T>
-void
-DimCheck2D(const vector<vector<af_seq> > &seqs,string TestFile, size_t NDims)
-{
-    if (noDoubleTests<T>()) return;
+void DimCheck2D(const vector<vector<af_seq>> &seqs, string TestFile,
+                size_t NDims) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> > hData;
-    vector<vector<T> > tests;
-    readTests<T,T,int>(TestFile, numDims, hData, tests);
-    af::dim4 dimensions = numDims[0];
+    vector<vector<T>> hData;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(TestFile, numDims, hData, tests);
+    dim4 dimensions = numDims[0];
 
     af_array a = 0;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&a, &(hData[0].front()), NDims, dimensions.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&a, &(hData[0].front()), NDims,
+                                   dimensions.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     vector<af_array> indexed_arrays(seqs.size(), 0);
-    for(size_t i = 0; i < seqs.size(); i++) {
-        ASSERT_EQ(AF_SUCCESS, af_index(&(indexed_arrays[i]), a, NDims, seqs[i].data()));
+    for (size_t i = 0; i < seqs.size(); i++) {
+        ASSERT_SUCCESS(
+            af_index(&(indexed_arrays[i]), a, NDims, seqs[i].data()));
     }
 
-    vector<T*> h_indexed(seqs.size(), NULL);
-    for(size_t i = 0; i < seqs.size(); i++) {
+    vector<T *> h_indexed(seqs.size(), NULL);
+    for (size_t i = 0; i < seqs.size(); i++) {
         dim_t elems;
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&elems, indexed_arrays[i]));
+        ASSERT_SUCCESS(af_get_elements(&elems, indexed_arrays[i]));
         h_indexed[i] = new T[elems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void *)h_indexed[i], indexed_arrays[i]));
+        ASSERT_SUCCESS(
+            af_get_data_ptr((void *)h_indexed[i], indexed_arrays[i]));
 
-        T* ptr = h_indexed[i];
-        if(false == equal(ptr, ptr + tests[i].size(), tests[i].begin())) {
+        T *ptr = h_indexed[i];
+        if (false == equal(ptr, ptr + tests[i].size(), tests[i].begin())) {
             cout << "index data: ";
             copy(ptr, ptr + tests[i].size(), ostream_iterator<T>(cout, ", "));
             cout << endl << "file data: ";
-            copy(tests[i].begin(), tests[i].end(), ostream_iterator<T>(cout, ", "));
+            copy(tests[i].begin(), tests[i].end(),
+                 ostream_iterator<T>(cout, ", "));
             FAIL() << "indexed_array[" << i << "] FAILED" << endl;
         }
         delete[] h_indexed[i];
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(a));
+    ASSERT_SUCCESS(af_release_array(a));
     for (size_t i = 0; i < indexed_arrays.size(); i++) {
-        ASSERT_EQ(AF_SUCCESS, af_release_array(indexed_arrays[i]));
+        ASSERT_SUCCESS(af_release_array(indexed_arrays[i]));
     }
 }
 
-TYPED_TEST_CASE(Indexing2D, AllTypes);
+TYPED_TEST_SUITE(Indexing2D, AllTypes);
 
-TYPED_TEST(Indexing2D, ColumnContinious)
-{
-    DimCheck2D<TypeParam>(this->column_continuous_seq, TEST_DIR"/index/ColumnContinious.test", 2);
+TYPED_TEST(Indexing2D, ColumnContinious) {
+    DimCheck2D<TypeParam>(this->column_continuous_seq,
+                          TEST_DIR "/index/ColumnContinious.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ColumnContiniousReverse)
-{
-    DimCheck2D<TypeParam>(this->column_continuous_reverse_seq, TEST_DIR"/index/ColumnContiniousReverse.test", 2);
+TYPED_TEST(Indexing2D, ColumnContiniousReverse) {
+    DimCheck2D<TypeParam>(this->column_continuous_reverse_seq,
+                          TEST_DIR "/index/ColumnContiniousReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ColumnStrided)
-{
-    DimCheck2D<TypeParam>(this->column_strided_seq, TEST_DIR"/index/ColumnStrided.test", 2);
+TYPED_TEST(Indexing2D, ColumnStrided) {
+    DimCheck2D<TypeParam>(this->column_strided_seq,
+                          TEST_DIR "/index/ColumnStrided.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ColumnStridedReverse)
-{
-    DimCheck2D<TypeParam>(this->column_strided_reverse_seq, TEST_DIR"/index/ColumnStridedReverse.test", 2);
+TYPED_TEST(Indexing2D, ColumnStridedReverse) {
+    DimCheck2D<TypeParam>(this->column_strided_reverse_seq,
+                          TEST_DIR "/index/ColumnStridedReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, RowContinious)
-{
-    DimCheck2D<TypeParam>(this->row_continuous_seq, TEST_DIR"/index/RowContinious.test", 2);
+TYPED_TEST(Indexing2D, RowContinious) {
+    DimCheck2D<TypeParam>(this->row_continuous_seq,
+                          TEST_DIR "/index/RowContinious.test", 2);
 }
 
-TYPED_TEST(Indexing2D, RowContiniousReverse)
-{
-    DimCheck2D<TypeParam>(this->row_continuous_reverse_seq, TEST_DIR"/index/RowContiniousReverse.test", 2);
+TYPED_TEST(Indexing2D, RowContiniousReverse) {
+    DimCheck2D<TypeParam>(this->row_continuous_reverse_seq,
+                          TEST_DIR "/index/RowContiniousReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, RowStrided)
-{
-    DimCheck2D<TypeParam>(this->row_strided_seq, TEST_DIR"/index/RowStrided.test", 2);
+TYPED_TEST(Indexing2D, RowStrided) {
+    DimCheck2D<TypeParam>(this->row_strided_seq,
+                          TEST_DIR "/index/RowStrided.test", 2);
 }
 
-TYPED_TEST(Indexing2D, RowStridedReverse)
-{
-    DimCheck2D<TypeParam>(this->row_strided_reverse_seq, TEST_DIR"/index/RowStridedReverse.test", 2);
+TYPED_TEST(Indexing2D, RowStridedReverse) {
+    DimCheck2D<TypeParam>(this->row_strided_reverse_seq,
+                          TEST_DIR "/index/RowStridedReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ContiniousContinious)
-{
-    DimCheck2D<TypeParam>(this->continuous_continuous_seq, TEST_DIR"/index/ContiniousContinious.test", 2);
+TYPED_TEST(Indexing2D, ContiniousContinious) {
+    DimCheck2D<TypeParam>(this->continuous_continuous_seq,
+                          TEST_DIR "/index/ContiniousContinious.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ContiniousReverse)
-{
-    DimCheck2D<TypeParam>(this->continuous_reverse_seq, TEST_DIR"/index/ContiniousReverse.test", 2);
+TYPED_TEST(Indexing2D, ContiniousReverse) {
+    DimCheck2D<TypeParam>(this->continuous_reverse_seq,
+                          TEST_DIR "/index/ContiniousReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ContiniousStrided)
-{
-    DimCheck2D<TypeParam>(this->continuous_strided_seq, TEST_DIR"/index/ContiniousStrided.test", 2);
+TYPED_TEST(Indexing2D, ContiniousStrided) {
+    DimCheck2D<TypeParam>(this->continuous_strided_seq,
+                          TEST_DIR "/index/ContiniousStrided.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ContiniousStridedReverse)
-{
-    DimCheck2D<TypeParam>(this->continuous_strided_reverse_seq, TEST_DIR"/index/ContiniousStridedReverse.test", 2);
+TYPED_TEST(Indexing2D, ContiniousStridedReverse) {
+    DimCheck2D<TypeParam>(this->continuous_strided_reverse_seq,
+                          TEST_DIR "/index/ContiniousStridedReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ReverseContinious)
-{
-    DimCheck2D<TypeParam>(this->reverse_continuous_seq, TEST_DIR"/index/ReverseContinious.test", 2);
+TYPED_TEST(Indexing2D, ReverseContinious) {
+    DimCheck2D<TypeParam>(this->reverse_continuous_seq,
+                          TEST_DIR "/index/ReverseContinious.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ReverseReverse)
-{
-    DimCheck2D<TypeParam>(this->reverse_reverse_seq, TEST_DIR"/index/ReverseReverse.test", 2);
+TYPED_TEST(Indexing2D, ReverseReverse) {
+    DimCheck2D<TypeParam>(this->reverse_reverse_seq,
+                          TEST_DIR "/index/ReverseReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ReverseStrided)
-{
-    DimCheck2D<TypeParam>(this->reverse_strided_seq, TEST_DIR"/index/ReverseStrided.test", 2);
+TYPED_TEST(Indexing2D, ReverseStrided) {
+    DimCheck2D<TypeParam>(this->reverse_strided_seq,
+                          TEST_DIR "/index/ReverseStrided.test", 2);
 }
 
-TYPED_TEST(Indexing2D, ReverseStridedReverse)
-{
-    DimCheck2D<TypeParam>(this->reverse_strided_reverse_seq, TEST_DIR"/index/ReverseStridedReverse.test", 2);
+TYPED_TEST(Indexing2D, ReverseStridedReverse) {
+    DimCheck2D<TypeParam>(this->reverse_strided_reverse_seq,
+                          TEST_DIR "/index/ReverseStridedReverse.test", 2);
 }
 
-TYPED_TEST(Indexing2D, StridedContinious)
-{
-    DimCheck2D<TypeParam>(this->strided_continuous_seq, TEST_DIR"/index/StridedContinious.test", 2);
+TYPED_TEST(Indexing2D, StridedContinious) {
+    DimCheck2D<TypeParam>(this->strided_continuous_seq,
+                          TEST_DIR "/index/StridedContinious.test", 2);
 }
 
-TYPED_TEST(Indexing2D, StridedStrided)
-{
-    DimCheck2D<TypeParam>(this->strided_strided_seq, TEST_DIR"/index/StridedStrided.test", 2);
+TYPED_TEST(Indexing2D, StridedStrided) {
+    DimCheck2D<TypeParam>(this->strided_strided_seq,
+                          TEST_DIR "/index/StridedStrided.test", 2);
 }
 
 vector<af_seq> make_vec(af_seq first, af_seq second) {
@@ -396,10 +472,8 @@ vector<af_seq> make_vec(af_seq first, af_seq second) {
     return out;
 }
 
-
 template<typename T>
-class Indexing : public ::testing::Test
-{
+class Indexing : public ::testing::Test {
     vector<af_seq> make_vec3(af_seq first, af_seq second, af_seq third) {
         vector<af_seq> out;
         out.push_back(first);
@@ -408,7 +482,8 @@ class Indexing : public ::testing::Test
         return out;
     }
 
-    vector<af_seq> make_vec4(af_seq first, af_seq second, af_seq third, af_seq fourth) {
+    vector<af_seq> make_vec4(af_seq first, af_seq second, af_seq third,
+                             af_seq fourth) {
         vector<af_seq> out;
         out.push_back(first);
         out.push_back(second);
@@ -417,266 +492,564 @@ class Indexing : public ::testing::Test
         return out;
     }
 
-    public:
-
+   public:
     virtual void SetUp() {
-        continuous3d_to_3d.push_back(make_vec3(af_make_seq( 0, 4, 1), af_make_seq( 0,  6,  1), af_span));
-        continuous3d_to_3d.push_back(make_vec3(af_make_seq( 4, 8, 1), af_make_seq( 4,  9,  1), af_span));
-        continuous3d_to_3d.push_back(make_vec3(af_make_seq( 6, 9, 1), af_make_seq( 3,  8,  1), af_span));
-
-        continuous3d_to_2d.push_back(make_vec3(af_span, af_make_seq( 0,  6,  1), af_make_seq( 0, 0, 1)));
-        continuous3d_to_2d.push_back(make_vec3(af_span, af_make_seq( 4,  9,  1), af_make_seq( 1, 1, 1)));
-        continuous3d_to_2d.push_back(make_vec3(af_span, af_make_seq( 3,  8,  1), af_make_seq( 0, 0, 1)));
-
-        continuous3d_to_1d.push_back(make_vec3(af_span, af_make_seq( 0,  0,  1), af_make_seq( 0, 0, 1)));
-        continuous3d_to_1d.push_back(make_vec3(af_span, af_make_seq( 6,  6,  1), af_make_seq( 1, 1, 1)));
-        continuous3d_to_1d.push_back(make_vec3(af_span, af_make_seq( 9,  9,  1), af_make_seq( 0, 0, 1)));
-
-        continuous4d_to_4d.push_back(make_vec4(af_make_seq( 2, 6, 1), af_make_seq( 2,  6,  1), af_span, af_span));
-        continuous4d_to_3d.push_back(make_vec4(af_make_seq( 2, 6, 1), af_make_seq( 2,  6,  1), af_span, af_make_seq(0, 0, 1)));
-        continuous4d_to_2d.push_back(make_vec4(af_make_seq( 2, 6, 1), af_make_seq( 2,  6,  1), af_make_seq( 0, 0, 1), af_make_seq(0, 0, 1)));
-        continuous4d_to_1d.push_back(make_vec4(af_make_seq( 2, 6, 1), af_make_seq( 2,  2,  1), af_make_seq( 0, 0, 1), af_make_seq(0, 0, 1)));
+        continuous3d_to_3d.push_back(
+            make_vec3(af_make_seq(0, 4, 1), af_make_seq(0, 6, 1), af_span));
+        continuous3d_to_3d.push_back(
+            make_vec3(af_make_seq(4, 8, 1), af_make_seq(4, 9, 1), af_span));
+        continuous3d_to_3d.push_back(
+            make_vec3(af_make_seq(6, 9, 1), af_make_seq(3, 8, 1), af_span));
+
+        continuous3d_to_2d.push_back(
+            make_vec3(af_span, af_make_seq(0, 6, 1), af_make_seq(0, 0, 1)));
+        continuous3d_to_2d.push_back(
+            make_vec3(af_span, af_make_seq(4, 9, 1), af_make_seq(1, 1, 1)));
+        continuous3d_to_2d.push_back(
+            make_vec3(af_span, af_make_seq(3, 8, 1), af_make_seq(0, 0, 1)));
+
+        continuous3d_to_1d.push_back(
+            make_vec3(af_span, af_make_seq(0, 0, 1), af_make_seq(0, 0, 1)));
+        continuous3d_to_1d.push_back(
+            make_vec3(af_span, af_make_seq(6, 6, 1), af_make_seq(1, 1, 1)));
+        continuous3d_to_1d.push_back(
+            make_vec3(af_span, af_make_seq(9, 9, 1), af_make_seq(0, 0, 1)));
+
+        continuous4d_to_4d.push_back(make_vec4(
+            af_make_seq(2, 6, 1), af_make_seq(2, 6, 1), af_span, af_span));
+        continuous4d_to_3d.push_back(make_vec4(af_make_seq(2, 6, 1),
+                                               af_make_seq(2, 6, 1), af_span,
+                                               af_make_seq(0, 0, 1)));
+        continuous4d_to_2d.push_back(
+            make_vec4(af_make_seq(2, 6, 1), af_make_seq(2, 6, 1),
+                      af_make_seq(0, 0, 1), af_make_seq(0, 0, 1)));
+        continuous4d_to_1d.push_back(
+            make_vec4(af_make_seq(2, 6, 1), af_make_seq(2, 2, 1),
+                      af_make_seq(0, 0, 1), af_make_seq(0, 0, 1)));
     }
 
-    vector<vector<af_seq> > continuous3d_to_3d;
-    vector<vector<af_seq> > continuous3d_to_2d;
-    vector<vector<af_seq> > continuous3d_to_1d;
+    vector<vector<af_seq>> continuous3d_to_3d;
+    vector<vector<af_seq>> continuous3d_to_2d;
+    vector<vector<af_seq>> continuous3d_to_1d;
 
-    vector<vector<af_seq> > continuous4d_to_4d;
-    vector<vector<af_seq> > continuous4d_to_3d;
-    vector<vector<af_seq> > continuous4d_to_2d;
-    vector<vector<af_seq> > continuous4d_to_1d;
+    vector<vector<af_seq>> continuous4d_to_4d;
+    vector<vector<af_seq>> continuous4d_to_3d;
+    vector<vector<af_seq>> continuous4d_to_2d;
+    vector<vector<af_seq>> continuous4d_to_1d;
 };
 
 template<typename T>
-void DimCheckND(const vector<vector<af_seq> > &seqs,string TestFile, size_t NDims)
-{
-    if (noDoubleTests<T>()) return;
+void DimCheckND(const vector<vector<af_seq>> &seqs, string TestFile,
+                size_t NDims) {
+    SUPPORTED_TYPE_CHECK(T);
 
     // DimCheck2D function is generalized enough
     // to check 3d and 4d indexing
     DimCheck2D<T>(seqs, TestFile, NDims);
 }
 
-TYPED_TEST_CASE(Indexing, AllTypes);
+TYPED_TEST_SUITE(Indexing, AllTypes);
 
-TYPED_TEST(Indexing, 4D_to_4D)
-{
-    DimCheckND<TypeParam>(this->continuous4d_to_4d, TEST_DIR"/index/Continuous4Dto4D.test", 4);
+TYPED_TEST(Indexing, 4D_to_4D) {
+    DimCheckND<TypeParam>(this->continuous4d_to_4d,
+                          TEST_DIR "/index/Continuous4Dto4D.test", 4);
 }
 
-TYPED_TEST(Indexing, 4D_to_3D)
-{
-    DimCheckND<TypeParam>(this->continuous4d_to_3d, TEST_DIR"/index/Continuous4Dto3D.test", 4);
+TYPED_TEST(Indexing, 4D_to_3D) {
+    DimCheckND<TypeParam>(this->continuous4d_to_3d,
+                          TEST_DIR "/index/Continuous4Dto3D.test", 4);
 }
 
-TYPED_TEST(Indexing, 4D_to_2D)
-{
-    DimCheckND<TypeParam>(this->continuous4d_to_2d, TEST_DIR"/index/Continuous4Dto2D.test", 4);
+TYPED_TEST(Indexing, 4D_to_2D) {
+    DimCheckND<TypeParam>(this->continuous4d_to_2d,
+                          TEST_DIR "/index/Continuous4Dto2D.test", 4);
 }
 
-TYPED_TEST(Indexing, 4D_to_1D)
-{
-    DimCheckND<TypeParam>(this->continuous4d_to_1d, TEST_DIR"/index/Continuous4Dto1D.test", 4);
+TYPED_TEST(Indexing, 4D_to_1D) {
+    DimCheckND<TypeParam>(this->continuous4d_to_1d,
+                          TEST_DIR "/index/Continuous4Dto1D.test", 4);
 }
 
-TYPED_TEST(Indexing, 3D_to_3D)
-{
-    DimCheckND<TypeParam>(this->continuous3d_to_3d, TEST_DIR"/index/Continuous3Dto3D.test", 3);
+TYPED_TEST(Indexing, 3D_to_3D) {
+    DimCheckND<TypeParam>(this->continuous3d_to_3d,
+                          TEST_DIR "/index/Continuous3Dto3D.test", 3);
 }
 
-TYPED_TEST(Indexing, 3D_to_2D)
-{
-    DimCheckND<TypeParam>(this->continuous3d_to_2d, TEST_DIR"/index/Continuous3Dto2D.test", 3);
+TYPED_TEST(Indexing, 3D_to_2D) {
+    DimCheckND<TypeParam>(this->continuous3d_to_2d,
+                          TEST_DIR "/index/Continuous3Dto2D.test", 3);
 }
 
-TYPED_TEST(Indexing, 3D_to_1D)
-{
-    DimCheckND<TypeParam>(this->continuous3d_to_1d, TEST_DIR"/index/Continuous3Dto1D.test", 3);
+TYPED_TEST(Indexing, 3D_to_1D) {
+    DimCheckND<TypeParam>(this->continuous3d_to_1d,
+                          TEST_DIR "/index/Continuous3Dto1D.test", 3);
 }
 
-//////////////////////////////// CPP ////////////////////////////////
-TEST(Indexing2D, ColumnContiniousCPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    using af::array;
+TEST(Index, Docs_Util_C_API) {
+    // clang-format off
+    ASSERT_EQ(0, ([]() -> int {
+    //![ex_index_util_0]
+    af_index_t *indexers = 0;
+    af_err err = af_create_indexers(&indexers); // Memory is allocated on heap by the callee
+                                                // by default all the indexers span all the elements along
+                                                // the given dimension
+
+    // Create array
+    af_array a;
+    unsigned ndims = 2;
+    dim_t dim[]    = {10, 10};
+    af_randu(&a, ndims, dim, f32);
+
+    // Create index array
+    af_array idx;
+    unsigned n = 1;
+    dim_t d[]  = {5};
+    af_range(&idx, n, d, 0, s32);
+
+    af_print_array(a);
+    af_print_array(idx);
+
+    // create array indexer
+    err = af_set_array_indexer(indexers, idx, 1);
+
+    // index with indexers
+    af_array out;
+    err = af_index_gen(&out, a, 2, indexers);  // number of indexers should be two since
+                                               // we have set only second af_index_t
+    if (err != AF_SUCCESS) {
+        printf("Failed in af_index_gen: %d\n", err);
+        return 1;
+    }
+    af_print_array(out);
+    af_release_array(out);
 
-    vector<vector<af_seq> > seqs;
+    af_seq zeroIndices = af_make_seq(0.0, 9.0, 2.0);
 
-    seqs.push_back(make_vec(af_span, af_make_seq(  0,  6,  1)));
-    //seqs.push_back(make_vec(span, af_make_seq(  4,  9,  1)));
-    //seqs.push_back(make_vec(span, af_make_seq(  3,  8,  1)));
+    err = af_set_seq_indexer(indexers, &zeroIndices, 0, false);
 
-    vector<af::dim4> numDims;
+    err = af_index_gen(&out, a, 2, indexers);
+    if (err != AF_SUCCESS) {
+        printf("Failed in af_index_gen: %d\n", err);
+        return 1;
+    }
+    af_print_array(out);
+
+    af_release_indexers(indexers);
+    af_release_array(a);
+    af_release_array(idx);
+    af_release_array(out);
+    return 0;
+    //![ex_index_util_0]
+    }()));
+    // clang-format on
+}
 
-    vector<vector<float> > hData;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(TEST_DIR"/index/ColumnContinious.test", numDims, hData, tests);
-    af::dim4 dimensions = numDims[0];
+//////////////////////////////// CPP ////////////////////////////////
 
-    array a(dimensions,&(hData[0].front()));
+using af::allTrue;
+using af::array;
+using af::constant;
+using af::deviceGC;
+using af::deviceMemInfo;
+using af::end;
+using af::freeHost;
+using af::randu;
+using af::range;
+using af::reorder;
+using af::seq;
+using af::span;
+using af::where;
+
+TEST(Indexing2D, ColumnContiniousCPP) {
+    vector<vector<af_seq>> seqs;
+
+    seqs.push_back(make_vec(af_span, af_make_seq(0, 6, 1)));
+    // seqs.push_back(make_vec(span, af_make_seq(  4,  9,  1)));
+    // seqs.push_back(make_vec(span, af_make_seq(  3,  8,  1)));
+
+    vector<dim4> numDims;
+
+    vector<vector<float>> hData;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(TEST_DIR "/index/ColumnContinious.test",
+                                 numDims, hData, tests);
+    dim4 dimensions = numDims[0];
+
+    array a(dimensions, &(hData[0].front()));
 
     vector<array> sub;
-    for(size_t i = 0; i < seqs.size(); i++) {
+    for (size_t i = 0; i < seqs.size(); i++) {
         vector<af_seq> seq = seqs[i];
         sub.push_back(a(seq[0], seq[1]));
     }
 
-    for(size_t i = 0; i < seqs.size(); i++) {
+    for (size_t i = 0; i < seqs.size(); i++) {
         dim_t elems = sub[i].elements();
-        float *ptr = new float[elems];
+        float *ptr  = new float[elems];
         sub[i].host(ptr);
 
-        if(false == equal(ptr, ptr + tests[i].size(), tests[i].begin())) {
+        if (false == equal(ptr, ptr + tests[i].size(), tests[i].begin())) {
             cout << "index data: ";
-            copy(ptr, ptr + tests[i].size(), ostream_iterator<float>(cout, ", "));
+            copy(ptr, ptr + tests[i].size(),
+                 ostream_iterator<float>(cout, ", "));
             cout << endl << "file data: ";
-            copy(tests[i].begin(), tests[i].end(), ostream_iterator<float>(cout, ", "));
+            copy(tests[i].begin(), tests[i].end(),
+                 ostream_iterator<float>(cout, ", "));
             FAIL() << "indexed_array[" << i << "] FAILED" << endl;
         }
         delete[] ptr;
     }
 }
 
-/************************ Array Based indexing tests from here on ******************/
+/************************ Array Based indexing tests from here on
+ * ******************/
 
 template<typename T>
-class lookup : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class lookup : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
-typedef ::testing::Types<float, double, int, unsigned, unsigned char> ArrIdxTestTypes;
-TYPED_TEST_CASE(lookup, ArrIdxTestTypes);
+typedef ::testing::Types<float, double, int, unsigned, signed char,
+                         unsigned char, short, ushort, intl, uintl,
+                         half_float::half>
+    ArrIdxTestTypes;
+TYPED_TEST_SUITE(lookup, ArrIdxTestTypes);
 
 template<typename T>
-void arrayIndexTest(string pTestFile, int dim)
-{
-    if (noDoubleTests<T>()) return;
+void arrayIndexTest(string pTestFile, int dim) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4>  numDims;
-    vector<vector<T> >      in;
-    vector<vector<T> >   tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
 
     readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
-    af_array outArray  = 0;
-    af_array inArray   = 0;
-    af_array idxArray  = 0;
+    dim4 dims0        = numDims[0];
+    dim4 dims1        = numDims[1];
+    af_array outArray = 0;
+    af_array inArray  = 0;
+    af_array idxArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims0.ndims(), dims0.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims0.ndims(),
+                                   dims0.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&idxArray, &(in[1].front()),
-                dims1.ndims(), dims1.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&idxArray, &(in[1].front()), dims1.ndims(),
+                                   dims1.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_lookup(&outArray, inArray, idxArray, dim));
+    ASSERT_SUCCESS(af_lookup(&outArray, inArray, idxArray, dim));
 
     vector<T> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    T *outData = new T[nElems];
-
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    dim4 goldDims         = dims0;
+    goldDims[dim]         = dims1[0];
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
-    }
+    ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, outArray);
 
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(idxArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(idxArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(lookup, Dim0)
-{
-    arrayIndexTest<TypeParam>(string(TEST_DIR"/arrayindex/dim0.test"), 0);
+TYPED_TEST(lookup, Dim0) {
+    arrayIndexTest<TypeParam>(string(TEST_DIR "/arrayindex/dim0.test"), 0);
 }
 
-TYPED_TEST(lookup, Dim1)
-{
-    arrayIndexTest<TypeParam>(string(TEST_DIR"/arrayindex/dim1.test"), 1);
+TYPED_TEST(lookup, Dim1) {
+    arrayIndexTest<TypeParam>(string(TEST_DIR "/arrayindex/dim1.test"), 1);
 }
 
-TYPED_TEST(lookup, Dim2)
-{
-    arrayIndexTest<TypeParam>(string(TEST_DIR"/arrayindex/dim2.test"), 2);
+TYPED_TEST(lookup, Dim2) {
+    arrayIndexTest<TypeParam>(string(TEST_DIR "/arrayindex/dim2.test"), 2);
 }
 
-TYPED_TEST(lookup, Dim3)
-{
-    arrayIndexTest<TypeParam>(string(TEST_DIR"/arrayindex/dim3.test"), 3);
+TYPED_TEST(lookup, Dim3) {
+    arrayIndexTest<TypeParam>(string(TEST_DIR "/arrayindex/dim3.test"), 3);
 }
 
-TEST(lookup, CPP)
-{
-    using af::array;
-
-    vector<af::dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
+TEST(lookup, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    readTests<float, float, int>(string(TEST_DIR"/arrayindex/dim0.test"), numDims, in, tests);
+    readTests<float, float, int>(string(TEST_DIR "/arrayindex/dim0.test"),
+                                 numDims, in, tests);
 
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
+    dim4 dims0 = numDims[0];
+    dim4 dims1 = numDims[1];
 
     array input(dims0, &(in[0].front()));
     array indices(dims1, &(in[1].front()));
     array output = af::lookup(input, indices, 0);
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
+    dim4 goldDims             = dims0;
+    goldDims[0]               = dims1[0];
 
-    output.host((void*)outData);
+    ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, output);
+}
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
-    }
+TEST(lookup, largeDim) {
+    const size_t largeDim = 65535 * 8 + 1;
+
+    cleanSlate();
+    array input   = range(dim4(2, largeDim));
+    array indices = constant(1, 100);
 
-    delete[] outData;
+    array output = af::lookup(input, indices);
 }
 
-TEST(SeqIndex, CPP_END)
-{
-    using af::array;
+TEST(lookup, Issue2009) {
+    array a   = range(dim4(1000, 1));
+    array idx = constant(0, 1, u32);
+    array b   = af::lookup(a, idx, 1);
 
-    const int n = 5;
-    const int m = 5;
-    const int end_off = 2;
+    ASSERT_ARRAYS_EQ(a, b);
+}
 
-    array a = af::randu(n, m);
-    array b = a(af::end - end_off, af::span);
+TEST(lookup, Issue3613_FirstDimLookupWithOffset) {
+    dim4 dims(1);
+    const int selected_dim = 0; // selected span dimension
+    dims[selected_dim] = 125; // input size
 
-    float *hA = a.host<float>();
-    float *hB = b.host<float>();
+    array a = iota(dims);
+    array idxs = iota(dim4(5, 4, 3, 2));
+    array selected_idx = idxs(af::span, 3, 2, 1); // Offsets in second, third, & fourth dimension
 
-    for (int i = 0; i < m; i++) {
-        ASSERT_EQ(hA[i * n + end_off], hB[i]);
-    }
+    array expected_selected_idx = range(dim4(5)) * 1 + 3 * 5 + 2 * (5 * 4) + 1 * (5 * 4 * 3);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
 
+    array b = af::lookup(a, selected_idx, selected_dim);
+    dim4 output_dims(1);
+    output_dims[selected_dim] = 5; // output size
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, output_dims), b); // lookup output should be the same as looked up indices
+}
+
+TEST(lookup, Issue3613_SecondDimLookupWithOffset) {
+    dim4 dims(1);
+    const int selected_dim = 1; // selected span dimension
+    dims[selected_dim] = 125; // input size
+
+    array a = iota(dims);
+    array idxs = iota(dim4(5, 4, 3, 2));
+    array selected_idx = idxs(af::span, 3, 2, 1); // Offsets in second, third, & fourth dimension
 
-    delete[] hA;
-    delete[] hB;
+    array expected_selected_idx = range(dim4(5)) * 1 + 3 * 5 + 2 * (5 * 4) + 1 * (5 * 4 * 3);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
+
+    array b = af::lookup(a, selected_idx, selected_dim);
+    dim4 output_dims(1);
+    output_dims[selected_dim] = 5; // output size
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, output_dims), b); // lookup output should be the same as looked up indices
 }
 
 
-TEST(SeqIndex, CPP_END_SEQ)
-{
-    using af::array;
+TEST(lookup, Issue3613_ThirdDimLookupWithOffset) {
+    dim4 dims(1);
+    const int selected_dim = 2; // selected span dimension
+    dims[selected_dim] = 125; // input size
+    
+    array a = iota(dims);
+    array idxs = iota(dim4(5, 4, 3, 2));
+    array selected_idx = idxs(af::span, 3, 2, 1); // Offsets in second, third, & fourth dimension
+    
+    array expected_selected_idx = range(dim4(5)) * 1 + 3 * 5 + 2 * (5 * 4) + 1 * (5 * 4 * 3);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
+    
+    array b = af::lookup(a, selected_idx, selected_dim);
+    dim4 output_dims(1);
+    output_dims[selected_dim] = 5; // output size
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, output_dims), b); // lookup output should be the same as looked up indices
+}
 
-    const int num = 20;
+TEST(lookup, Issue3613_FourthDimLookupWithOffset) {
+    dim4 dims(1);
+    const int selected_dim = 3; // selected span dimension
+    dims[selected_dim] = 125; // input size
+    
+    array a = iota(dims);
+    array idxs = iota(dim4(5, 4, 3, 2));
+    array selected_idx = idxs(af::span, 3, 2, 1); // Offsets in second, third, & fourth dimension
+    
+    array expected_selected_idx = range(dim4(5)) * 1 + 3 * 5 + 2 * (5 * 4) + 1 * (5 * 4 * 3);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
+    
+    array b = af::lookup(a, selected_idx, selected_dim);
+    dim4 output_dims(1);
+    output_dims[selected_dim] = 5; // output size
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, output_dims), b); // lookup output should be the same as looked up indices
+}
+
+TEST(lookup, IndicesInSecondDimension) {
+    const int selected_dim = 1; // selected span dimension
+    dim4 dims(1);
+    dims[selected_dim] = 3;
+
+    array a = iota(dim4(100));
+    array idxs = iota(dim4(3, 3, 3, 3));
+    array selected_idx = idxs(0, af::span, 0, 0); // Indices along the second dimension
+
+    array expected_selected_idx = iota(dims) * pow(3, selected_dim);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
+
+    array b = af::lookup(a, selected_idx);
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, dim4(3)), b);
+}
+
+TEST(lookup, IndicesInThirdDimension) {
+    const int selected_dim = 2; // selected span dimension
+    dim4 dims(1);
+    dims[selected_dim] = 3;
+
+    array a = iota(dim4(100));
+    array idxs = iota(dim4(3, 3, 3, 3));
+    array selected_idx = idxs(0, 0, af::span, 0); // Indices along the third dimension
+
+    array expected_selected_idx = iota(dims) * pow(3, selected_dim);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
+
+    array b = af::lookup(a, selected_idx);
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, dim4(3)), b);
+}
+
+TEST(lookup, IndicesInFourthDimension) {
+    const int selected_dim = 3; // selected span dimension
+    dim4 dims(1);
+    dims[selected_dim] = 3;
+
+    array a = iota(dim4(100));
+    array idxs = iota(dim4(3, 3, 3, 3));
+    array selected_idx = idxs(0, 0, 0, af::span); // Indices along the fourth dimension
+
+    array expected_selected_idx = iota(dims) * pow(3, selected_dim);
+    ASSERT_ARRAYS_EQ(expected_selected_idx, selected_idx);
+
+    array b = af::lookup(a, selected_idx);
+    ASSERT_ARRAYS_EQ(af::moddims(expected_selected_idx, dim4(3)), b);
+}
+
+TEST(lookup, SNIPPET_lookup1d) {
+    //! [ex_index_lookup1d]
+
+    // input array
+    float in_[5] = {10, 20, 30, 40, 50};
+    af::array in(5, in_);
+
+    // indices to lookup
+    int idx_[3] = {1, 3, 2};
+    af::array idx(3, idx_);
+
+    af::array indexed = af::lookup(in, idx);
+    // indexed == { 20, 40, 30 };
+
+    //! [ex_index_lookup1d]
+
+    // indexing tests
+    float in_g[3] = {20, 40, 30};
+    af::array indexed_gold(3, in_g);
+    ASSERT_ARRAYS_NEAR(indexed, indexed_gold, 1e-5);
+}
+
+TEST(lookup, SNIPPET_lookup_oob) {
+    //! [ex_index_lookup_oob]
+
+    // input array
+    float in_[5] = {10, 20, 30, 40, 50};
+    af::array in(5, in_);
+
+    // indexing past end of array
+    int idx_outofbounds_p_[8] = {4, 5, 6, 7, 8, 9, 10, 11};
+    af::array idx_outofbounds_p(8, idx_outofbounds_p_);
+
+    // and indexing before beginning of array
+    int idx_outofbounds_n_[8] = {0, -1, -2, -3, -4, -5, -6, -7};
+    af::array idx_outofbounds_n(8, idx_outofbounds_n_);
+
+    af::array indexed_out_of_bounds_pos = af::lookup(in, idx_outofbounds_p);
+    af::array indexed_out_of_bounds_neg = af::lookup(in, idx_outofbounds_n);
+    // indexed_out_of_bounds_pos == { 50, 50, 40, 30, 20, 10, 50, 40 }
+    // indexed_out_of_bounds_neg == { 10, 10, 20, 30, 40, 50, 10, 20 }
+
+    //! [ex_index_lookup_oob]
+
+    // out of bounds tests
+    float oob_p_g_[8] = {50, 50, 40, 30, 20, 10, 50, 40};
+    af::array oob_p_g(8, oob_p_g_);
+    ASSERT_ARRAYS_NEAR(indexed_out_of_bounds_pos, oob_p_g, 1e-5);
+    float oob_n_g_[8] = {10, 10, 20, 30, 40, 50, 10, 20};
+    af::array oob_n_g(8, oob_n_g_);
+    ASSERT_ARRAYS_NEAR(indexed_out_of_bounds_neg, oob_n_g, 1e-5);
+}
+
+TEST(lookup, SNIPPET_lookup2d) {
+    //! [ex_index_lookup2d]
+
+    // constant input data
+    float input_vals[9] = {10, 20, 30, 11, 21, 31, 12, 22, 32};
+    array input(3, 3, input_vals);
+    // {{10 11 12},
+    //  {20 21 22},
+    //  {30 31 32}},
+
+    // indices to lookup
+    int idx_[6] = {0, 0, 1, 1, 2, 2};
+    af::array idx(6, idx_);
+
+    // will look up all indices along specified dimension
+    af::array indexed = af::lookup(input, idx);  //(dim = 0)
+    // indexed == { 10, 11, 12,
+    //              10, 11, 12,
+    //              20, 21, 22,
+    //              20, 21, 22,
+    //              30, 31, 32,
+    //              30, 31, 32 };
+
+    af::array indexed_dim1 = af::lookup(input, idx, 1);
+    // indexed_dim1 == { 10, 10, 11, 11, 12, 12,
+    //                   20, 20, 21, 21, 22, 22,
+    //                   30, 30, 31, 31, 32, 32 };
+
+    //! [ex_index_lookup2d]
+
+    float expected_indexed[18] = {10, 10, 20, 20, 30, 30, 11, 11, 21,
+                                  21, 31, 31, 12, 12, 22, 22, 32, 32};
+
+    array indexed_gold(6, 3, expected_indexed);
+    ASSERT_ARRAYS_NEAR(indexed, indexed_gold, 1e-5);
+
+    float expected_indexed_dim1[18] = {10, 20, 30, 10, 20, 30, 11, 21, 31,
+                                       11, 21, 31, 12, 22, 32, 12, 22, 32};
+
+    array indexed_gold_dim1(3, 6, expected_indexed_dim1);
+    ASSERT_ARRAYS_NEAR(indexed_dim1, indexed_gold_dim1, 1e-5);
+}
+
+TEST(SeqIndex, CPP_END) {
+    const int n       = 5;
+    const int m       = 5;
+    const int end_off = 2;
+
+    array a = randu(n, m);
+    array b = a(end - end_off, span);
+
+    float *hA = a.host<float>();
+    float *hB = b.host<float>();
+
+    for (int i = 0; i < m; i++) { ASSERT_EQ(hA[i * n + end_off], hB[i]); }
+
+    freeHost(hA);
+    freeHost(hB);
+}
+
+TEST(SeqIndex, CPP_END_SEQ) {
+    const int num       = 20;
     const int end_begin = 10;
-    const int end_end = 0;
+    const int end_end   = 0;
 
-    array a = af::randu(num);
-    array b = a(af::seq(af::end - end_begin, af::end - end_end));
+    array a = randu(num);
+    array b = a(seq(end - end_begin, end - end_end));
 
     float *hA = a.host<float>();
     float *hB = b.host<float>();
@@ -685,112 +1058,90 @@ TEST(SeqIndex, CPP_END_SEQ)
         ASSERT_EQ(hA[i + end_begin - 1], hB[i]);
     }
 
-    delete[] hA;
-    delete[] hB;
+    freeHost(hA);
+    freeHost(hB);
 }
 
-af::array cpp_scope_seq_test(const int num, const float val, const af::seq s)
-{
-    af::array a = af::constant(val, num);
+array cpp_scope_seq_test(const int num, const float val, const seq s) {
+    array a = constant(val, num);
     return a(s);
 }
 
-TEST(SeqIndex, CPP_SCOPE_SEQ)
-{
-    using af::array;
-
-    const int num = 20;
+TEST(SeqIndex, CPP_SCOPE_SEQ) {
+    const int num       = 20;
     const int seq_begin = 3;
-    const int seq_end = 10;
-    const float val = 133.33;
+    const int seq_end   = 10;
+    const float val     = 133.33;
 
-    array b = cpp_scope_seq_test(num, val, af::seq(seq_begin, seq_end));
+    array b   = cpp_scope_seq_test(num, val, seq(seq_begin, seq_end));
     float *hB = b.host<float>();
 
-    for (int i = 0; i < seq_end - seq_begin + 1; i++) {
-        ASSERT_EQ(hB[i], val);
-    }
+    for (int i = 0; i < seq_end - seq_begin + 1; i++) { ASSERT_EQ(hB[i], val); }
 
-    delete[] hB;
+    freeHost(hB);
 }
 
-af::array cpp_scope_arr_test(const int num, const float val)
-{
-    af::array a = af::constant(val, num);
-    af::array idx = where(a > val/2);
+array cpp_scope_arr_test(const int num, const float val) {
+    array a   = constant(val, num);
+    array idx = where(a > val / 2);
     return a(idx) * (val - 1);
 }
 
-TEST(SeqIndex, CPP_SCOPE_ARR)
-{
-    using af::array;
-
-    const int num = 20;
+TEST(SeqIndex, CPP_SCOPE_ARR) {
+    const int num   = 20;
     const float val = 133.33;
 
-    array b = cpp_scope_arr_test(num, val);
+    array b   = cpp_scope_arr_test(num, val);
     float *hB = b.host<float>();
 
     for (int i = 0; i < (int)b.elements(); i++) {
         ASSERT_EQ(hB[i], val * (val - 1));
     }
 
-    delete[] hB;
+    freeHost(hB);
 }
 
-TEST(SeqIndex, CPPLarge)
-{
-    using af::array;
+TEST(SeqIndex, CPPLarge) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    vector<af::dim4>      numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
+    readTests<float, float, int>(string(TEST_DIR "/arrayindex/dim0Large.test"),
+                                 numDims, in, tests);
 
-    readTests<float, float, int>(string(TEST_DIR"/arrayindex/dim0Large.test"), numDims, in, tests);
-
-    af::dim4 dims0     = numDims[0];
-    af::dim4 dims1     = numDims[1];
+    dim4 dims0 = numDims[0];
+    dim4 dims1 = numDims[1];
 
     array input(dims0, &(in[0].front()));
     array indices(dims1, &(in[1].front()));
     array output = af::lookup(input, indices, 0);
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    float *outData = new float[nElems];
-
-    output.host((void*)outData);
+    dim4 goldDims             = dims0;
+    goldDims[0]               = dims1[0];
 
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
-    }
-
-    delete[] outData;
+    ASSERT_VEC_ARRAY_EQ(currGoldBar, goldDims, output);
 }
 
-TEST(SeqIndex, Cascade00)
-{
-    using af::seq;
-    using af::span;
-
+TEST(SeqIndex, Cascade00) {
     const int nx = 200;
     const int ny = 200;
 
     const int stb = 21;
     const int enb = 180;
 
-    const int stc = 3;   // Should be less than nx - stb
-    const int enc = 109; // Should be less than ny - enb
+    const int stc = 3;    // Should be less than nx - stb
+    const int enc = 109;  // Should be less than ny - enb
 
-    const int st = stb + stc;
-    const int en = stb + enc;
+    const int st  = stb + stc;
+    const int en  = stb + enc;
     const int nxc = en - st + 1;
 
-    af::array a = af::randu(nx, ny);
-    af::array b = a(seq(stb, enb), span);
-    af::array c = b(seq(stc, enc), span);
+    array a = randu(nx, ny);
+    array b = a(seq(stb, enb), span);
+    array c = b(seq(stc, enc), span);
 
-    ASSERT_EQ(c.dims(1), (dim_t)ny );
+    ASSERT_EQ(c.dims(1), (dim_t)ny);
     ASSERT_EQ(c.dims(0), (dim_t)nxc);
 
     float *h_a = a.host<float>();
@@ -798,27 +1149,21 @@ TEST(SeqIndex, Cascade00)
     float *h_c = c.host<float>();
 
     for (int j = 0; j < ny; j++) {
-
         int a_off = j * nx;
         int c_off = j * nxc;
 
         for (int i = st; i < en; i++) {
-            ASSERT_EQ(h_a[a_off + i],
-                      h_c[c_off + i - st])
+            ASSERT_EQ(h_a[a_off + i], h_c[c_off + i - st])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_a;
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_a);
+    freeHost(h_b);
+    freeHost(h_c);
 }
 
-TEST(SeqIndex, Cascade01)
-{
-    using af::seq;
-    using af::span;
-
+TEST(SeqIndex, Cascade01) {
     const int nx = 200;
     const int ny = 200;
 
@@ -831,9 +1176,9 @@ TEST(SeqIndex, Cascade01)
     const int nxc = enb - stb + 1;
     const int nyc = enc - stc + 1;
 
-    af::array a = af::randu(nx, ny);
-    af::array b = a(seq(stb, enb), span);
-    af::array c = b(span, seq(stc, enc));
+    array a = randu(nx, ny);
+    array b = a(seq(stb, enb), span);
+    array c = b(span, seq(stc, enc));
 
     ASSERT_EQ(c.dims(1), (dim_t)nyc);
     ASSERT_EQ(c.dims(0), (dim_t)nxc);
@@ -843,28 +1188,21 @@ TEST(SeqIndex, Cascade01)
     float *h_c = c.host<float>();
 
     for (int j = stc; j < enc; j++) {
-
         int a_off = j * nx;
         int c_off = (j - stc) * nxc;
 
         for (int i = stb; i < enb; i++) {
-
-            ASSERT_EQ(h_a[a_off + i],
-                      h_c[c_off + i - stb])
+            ASSERT_EQ(h_a[a_off + i], h_c[c_off + i - stb])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_a;
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_a);
+    freeHost(h_b);
+    freeHost(h_c);
 }
 
-TEST(SeqIndex, Cascade10)
-{
-    using af::seq;
-    using af::span;
-
+TEST(SeqIndex, Cascade10) {
     const int nx = 200;
     const int ny = 200;
 
@@ -877,9 +1215,9 @@ TEST(SeqIndex, Cascade10)
     const int nxc = enc - stc + 1;
     const int nyc = enb - stb + 1;
 
-    af::array a = af::randu(nx, ny);
-    af::array b = a(span, seq(stb, enb));
-    af::array c = b(seq(stc, enc), span);
+    array a = randu(nx, ny);
+    array b = a(span, seq(stb, enb));
+    array c = b(seq(stc, enc), span);
 
     ASSERT_EQ(c.dims(1), (dim_t)nyc);
     ASSERT_EQ(c.dims(0), (dim_t)nxc);
@@ -889,76 +1227,64 @@ TEST(SeqIndex, Cascade10)
     float *h_c = c.host<float>();
 
     for (int j = stb; j < enb; j++) {
-
         int a_off = j * nx;
         int c_off = (j - stb) * nxc;
 
         for (int i = stc; i < enc; i++) {
-
-            ASSERT_EQ(h_a[a_off + i],
-                      h_c[c_off + i - stc])
+            ASSERT_EQ(h_a[a_off + i], h_c[c_off + i - stc])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_a;
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_a);
+    freeHost(h_b);
+    freeHost(h_c);
 }
 
-TEST(SeqIndex, Cascade11)
-{
-    using af::seq;
-    using af::span;
-
+TEST(SeqIndex, Cascade11) {
     const int nx = 200;
     const int ny = 200;
 
     const int stb = 50;
     const int enb = 150;
 
-    const int stc = 20; // Should be less than nx - stb
-    const int enc = 80; // Should be less than ny - enb
+    const int stc = 20;  // Should be less than nx - stb
+    const int enc = 80;  // Should be less than ny - enb
 
-    const int st = stb + stc;
-    const int en = stb + enc;
+    const int st  = stb + stc;
+    const int en  = stb + enc;
     const int nyc = en - st + 1;
 
-    af::array a = af::randu(nx, ny);
-    af::array b = a(span, seq(stb, enb));
-    af::array c = b(span, seq(stc, enc));
+    array a = randu(nx, ny);
+    array b = a(span, seq(stb, enb));
+    array c = b(span, seq(stc, enc));
 
     ASSERT_EQ(c.dims(1), nyc);
-    ASSERT_EQ(c.dims(0), nx );
+    ASSERT_EQ(c.dims(0), nx);
 
     float *h_a = a.host<float>();
     float *h_b = b.host<float>();
     float *h_c = c.host<float>();
 
     for (int j = st; j < en; j++) {
-
         int a_off = j * nx;
         int c_off = (j - st) * nx;
 
         for (int i = 0; i < nx; i++) {
-
-            ASSERT_EQ(h_a[a_off + i],
-                      h_c[c_off + i])
+            ASSERT_EQ(h_a[a_off + i], h_c[c_off + i])
                 << "at (" << i << "," << j << ")";
         }
     }
 
-    delete[] h_a;
-    delete[] h_b;
-    delete[] h_c;
+    freeHost(h_a);
+    freeHost(h_b);
+    freeHost(h_c);
 }
 
-TEST(ArrayIndex, CPP_INDEX_VECTOR)
-{
-    using af::array;
-    float h_inds[] = {0, 3, 2, 1}; // zero-based indexing
+TEST(ArrayIndex, CPP_INDEX_VECTOR) {
+    float h_inds[] = {0, 3, 2, 1};  // zero-based indexing
     array inds(1, 4, h_inds);
-    array B = af::randu(1, 4);
+    array B = randu(1, 4);
     array C = B(inds);
 
     ASSERT_EQ(B.dims(0), 1);
@@ -969,49 +1295,39 @@ TEST(ArrayIndex, CPP_INDEX_VECTOR)
     float *h_B = B.host<float>();
     float *h_C = C.host<float>();
 
-    for (int i = 0; i < 4; i++) {
-        ASSERT_EQ(h_C[i], h_B[(int)h_inds[i]]);
-    }
+    for (int i = 0; i < 4; i++) { ASSERT_EQ(h_C[i], h_B[(int)h_inds[i]]); }
 
-    delete[] h_B;
-    delete[] h_C;
+    freeHost(h_B);
+    freeHost(h_C);
 }
 
-TEST(SeqIndex, CPP_INDEX_VECTOR)
-{
-    using af::array;
-
+TEST(SeqIndex, CPP_INDEX_VECTOR) {
     const int num = 20;
     const int len = 10;
-    const int st  =  3;
+    const int st  = 3;
     const int en  = st + len - 1;
 
-    array B = af::randu(1, 20);
-    array C = B(af::seq(st, en));
+    array B = randu(1, 20);
+    array C = B(seq(st, en));
 
-    ASSERT_EQ(1  , B.dims(0));
+    ASSERT_EQ(1, B.dims(0));
     ASSERT_EQ(num, B.dims(1));
-    ASSERT_EQ(1  , C.dims(0));
+    ASSERT_EQ(1, C.dims(0));
     ASSERT_EQ(len, C.dims(1));
 
     float *h_B = B.host<float>();
     float *h_C = C.host<float>();
 
-    for (int i = 0; i < len; i++) {
-        ASSERT_EQ(h_C[i], h_B[i + st]);
-    }
+    for (int i = 0; i < len; i++) { ASSERT_EQ(h_C[i], h_B[i + st]); }
 
-    delete[] h_B;
-    delete[] h_C;
+    freeHost(h_B);
+    freeHost(h_C);
 }
 
-
-TEST(ArrayIndex, CPP_INDEX_VECTOR_2D)
-{
-    using af::array;
-    float h_inds[] = {3, 5, 7, 2}; // zero-based indexing
+TEST(ArrayIndex, CPP_INDEX_VECTOR_2D) {
+    float h_inds[] = {3, 5, 7, 2};  // zero-based indexing
     array inds(1, 4, h_inds);
-    array B = af::randu(4, 4);
+    array B = randu(4, 4);
     array C = B(inds);
 
     ASSERT_EQ(B.dims(0), 4);
@@ -1022,113 +1338,171 @@ TEST(ArrayIndex, CPP_INDEX_VECTOR_2D)
     float *h_B = B.host<float>();
     float *h_C = C.host<float>();
 
-    for (int i = 0; i < 4; i++) {
-        ASSERT_EQ(h_C[i], h_B[(int)h_inds[i]]);
-    }
+    for (int i = 0; i < 4; i++) { ASSERT_EQ(h_C[i], h_B[(int)h_inds[i]]); }
 
-    delete[] h_B;
-    delete[] h_C;
+    freeHost(h_B);
+    freeHost(h_C);
 }
 
-TEST(SeqIndex, CPP_INDEX_VECTOR_2D)
-{
-    using af::array;
-
-    const int nx = 4;
-    const int ny = 3 * nx;
+TEST(SeqIndex, CPP_INDEX_VECTOR_2D) {
+    const int nx  = 4;
+    const int ny  = 3 * nx;
     const int len = 2 * nx;
     const int st  = nx - 1;
     const int en  = st + len - 1;
 
-    array B = af::randu(nx, ny);
-    array C = B(af::seq(st, en));
+    array B = randu(nx, ny);
+    array C = B(seq(st, en));
 
-    ASSERT_EQ(nx , B.dims(0));
-    ASSERT_EQ(ny , B.dims(1));
+    ASSERT_EQ(nx, B.dims(0));
+    ASSERT_EQ(ny, B.dims(1));
     ASSERT_EQ(len, C.dims(0));
-    ASSERT_EQ(1  , C.dims(1));
+    ASSERT_EQ(1, C.dims(1));
 
     float *h_B = B.host<float>();
     float *h_C = C.host<float>();
 
-    for (int i = 0; i < len; i++) {
-        ASSERT_EQ(h_C[i], h_B[i + st]);
-    }
+    for (int i = 0; i < len; i++) { ASSERT_EQ(h_C[i], h_B[i + st]); }
 
-    delete[] h_B;
-    delete[] h_C;
+    freeHost(h_B);
+    freeHost(h_C);
 }
 
 template<typename T>
-class IndexedMembers : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class IndexedMembers : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
-TYPED_TEST_CASE(IndexedMembers, AllTypes);
+TYPED_TEST_SUITE(IndexedMembers, AllTypes);
 
-TYPED_TEST(IndexedMembers, MemFuncs)
-{
-    if (noDoubleTests<TypeParam>()) return;
-    using af::array;
-    dim_t dimsize = 100;
+TYPED_TEST(IndexedMembers, MemFuncs) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    const dim_t dimsize = 100;
     vector<TypeParam> in(dimsize * dimsize);
-    for(int i = 0; i < (int)in.size(); i++) in[i] = i;
+    for (int i = 0; i < (int)in.size(); i++) in[i] = i;
     array input(dimsize, dimsize, &in.front(), afHost);
 
-    ASSERT_EQ(dimsize, input(af::span, 1).elements());
-    ASSERT_EQ(input.type(), input(af::span, 1).type());
-    ASSERT_EQ(af::dim4(dimsize), input(af::span, 1).dims());
-    ASSERT_EQ(1u, input(af::span, 1).numdims());
-    ASSERT_FALSE(input(af::span, 1).isempty());
-    ASSERT_FALSE(input(af::span, 1).isscalar());
+    ASSERT_EQ(dimsize, input(span, 1).elements());
+    ASSERT_EQ(input.type(), input(span, 1).type());
+    ASSERT_EQ(dim4(dimsize), input(span, 1).dims());
+    ASSERT_EQ(1u, input(span, 1).numdims());
+    ASSERT_FALSE(input(span, 1).isempty());
+    ASSERT_FALSE(input(span, 1).isscalar());
     ASSERT_TRUE(input(1, 1).isscalar());
-    ASSERT_TRUE(input(af::span, 1).isvector());
-    ASSERT_FALSE(input(af::span, 1).isrow());
-    ASSERT_EQ(input.iscomplex(), input(af::span, 1).iscomplex());
-    ASSERT_EQ(input.isdouble(), input(af::span, 1).isdouble());
-    ASSERT_EQ(input.issingle(), input(af::span, 1).issingle());
-    ASSERT_EQ(input.isrealfloating(), input(af::span, 1).isrealfloating());
-    ASSERT_EQ(input.isfloating(), input(af::span, 1).isfloating());
-    ASSERT_EQ(input.isinteger(), input(af::span, 1).isinteger());
-    ASSERT_EQ(input.isbool(), input(af::span, 1).isbool());
+    ASSERT_TRUE(input(span, 1).isvector());
+    ASSERT_FALSE(input(span, 1).isrow());
+    ASSERT_EQ(input.iscomplex(), input(span, 1).iscomplex());
+    ASSERT_EQ(input.isdouble(), input(span, 1).isdouble());
+    ASSERT_EQ(input.issingle(), input(span, 1).issingle());
+    ASSERT_EQ(input.isrealfloating(), input(span, 1).isrealfloating());
+    ASSERT_EQ(input.isfloating(), input(span, 1).isfloating());
+    ASSERT_EQ(input.isinteger(), input(span, 1).isinteger());
+    ASSERT_EQ(input.isbool(), input(span, 1).isbool());
     // TODO: Doesn't compile in cuda for cfloat and cdouble
-    //ASSERT_EQ(input.scalar<TypeParam>(), input(af::span, 0).scalar<TypeParam>());
+    // ASSERT_EQ(input.scalar<TypeParam>(), input(span, 0).scalar<TypeParam>());
 }
 
+#if 1
+TYPED_TEST(IndexedMembers, MemIndex) {
+    array a     = range(dim4(10, 10));
+    array b     = a(seq(1, 7), span);
+    array brow  = b.row(5);
+    array brows = b.rows(5, 6);
+    array bcol  = b.col(5);
+    array bcols = b.cols(5, 6);
+
+    array out_row  = a(seq(1, 7), span).row(5);
+    array out_rows = a(seq(1, 7), span).rows(5, 6);
+    array out_col  = a(seq(1, 7), span).col(5);
+    array out_cols = a(seq(1, 7), span).cols(5, 6);
+
+    ASSERT_EQ(0, where(brow != out_row).elements());
+    ASSERT_EQ(0, where(brows != out_rows).elements());
+    ASSERT_EQ(0, where(bcol != out_col).elements());
+    ASSERT_EQ(0, where(bcols != out_cols).elements());
+
+    array avol    = range(dim4(10, 10, 10));
+    array bvol    = avol(seq(1, 7), span, span);
+    array bslice  = bvol.slice(5);
+    array bslices = bvol.slices(5, 6);
+
+    array out_slice  = avol(seq(1, 7), span, span).slice(5);
+    array out_slices = avol(seq(1, 7), span, span).slices(5, 6);
+
+    ASSERT_EQ(0, where(bslice != out_slice).elements());
+    ASSERT_EQ(0, where(bslices != out_slices).elements());
+}
+#endif
 
-TEST(Indexing, SNIPPET_indexing_first)
-{
-    using namespace af;
+TEST(Indexing, SNIPPET_indexing_first) {
     //! [ex_indexing_first]
-    array A = array(seq(1,9), 3, 3);
+    array A = array(seq(1, 9), 3, 3);
     af_print(A);
+    // 1.0000 4.0000 7.0000
+    // 2.0000 5.0000 8.0000
+    // 3.0000 6.0000 9.0000
+
+    af_print(A(0));  // first element
+    // 1.0000
+
+    af_print(A(0, 1));  // first row, second column
+    // 4.0000
 
-    af_print(A(0));    // first element
-    af_print(A(0,1));  // first row, second column
+    af_print(A(end));  // last element
+    // 9.0000
 
-    af_print(A(end));   // last element
-    af_print(A(-1));    // also last element
-    af_print(A(end-1)); // second-to-last element
+    af_print(A(-1));  // also last element
+    // 9.0000
 
-    af_print(A(1,span));       // second row
-    af_print(A.row(end));      // last row
-    af_print(A.cols(1,end));   // all but first column
+    af_print(A(end - 1));  // second-to-last element
+    // 8.0000
 
-    float b_host[] = {0,1,2,3,4,5,6,7,8,9};
+    af_print(A(1, span));  // second row
+    // 2.0000     5.0000     8.0000
+
+    af_print(A.row(end));  // last row
+    // 3.0000     6.0000     9.0000
+
+    af_print(A.cols(1, end));  // all but first column
+    // 4.0000     7.0000
+    // 5.0000     8.0000
+    // 6.0000     9.0000
+
+    float b_host[] = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9};
     array b(10, 1, b_host);
     af_print(b(seq(3)));
-    af_print(b(seq(1,7)));
-    af_print(b(seq(1,7,2)));
-    af_print(b(seq(0,end,2)));
+    // 0.0000
+    // 1.0000
+    // 2.0000
+
+    af_print(b(seq(1, 7)));
+    // 1.0000
+    // 2.0000
+    // 3.0000
+    // 4.0000
+    // 5.0000
+    // 6.0000
+    // 7.0000
+
+    af_print(b(seq(1, 7, 2)));
+    // 1.0000
+    // 3.0000
+    // 5.0000
+    // 7.0000
+
+    af_print(b(seq(0, end, 2)));
+    // 0.0000
+    // 2.0000
+    // 4.0000
+    // 6.0000
+    // 8.0000
     //! [ex_indexing_first]
 
-
-    array lin_first = A(0);
-    array lin_last = A(end);
-    array lin_snd_last = A(end-1);
+    array lin_first    = A(0);
+    array lin_last     = A(end);
+    array lin_snd_last = A(end - 1);
 
     EXPECT_EQ(1, lin_first.dims(0));
     EXPECT_EQ(1, lin_first.elements());
@@ -1141,7 +1515,6 @@ TEST(Indexing, SNIPPET_indexing_first)
     EXPECT_FLOAT_EQ(9.0f, lin_last.scalar<float>());
     EXPECT_FLOAT_EQ(8.0f, lin_snd_last.scalar<float>());
 
-
     lin_last = A(-1);
     EXPECT_EQ(1, lin_last.dims(0));
     EXPECT_EQ(1, lin_last.elements());
@@ -1152,7 +1525,9 @@ TEST(Indexing, SNIPPET_indexing_first)
         ASSERT_EQ(3, out.elements());
         vector<float> hout(out.elements());
         out.host(&hout.front());
-        for(unsigned i = 0; i < hout.size(); i++) { ASSERT_FLOAT_EQ(b_host[i], hout[i]); }
+        for (unsigned i = 0; i < hout.size(); i++) {
+            ASSERT_FLOAT_EQ(b_host[i], hout[i]);
+        }
     }
 
     {
@@ -1160,7 +1535,9 @@ TEST(Indexing, SNIPPET_indexing_first)
         ASSERT_EQ(7, out.elements());
         vector<float> hout(out.elements());
         out.host(&hout.front());
-        for(unsigned i = 1; i < hout.size(); i++) { ASSERT_FLOAT_EQ(b_host[i], hout[i - 1]); }
+        for (unsigned i = 1; i < hout.size(); i++) {
+            ASSERT_FLOAT_EQ(b_host[i], hout[i - 1]);
+        }
     }
 
     {
@@ -1168,52 +1545,616 @@ TEST(Indexing, SNIPPET_indexing_first)
         ASSERT_EQ(4, out.elements());
         vector<float> hout(out.elements());
         out.host(&hout.front());
-        for(unsigned i = 0; i < hout.size(); i++) { ASSERT_FLOAT_EQ(b_host[i * 2 + 1], hout[i]); }
+        for (unsigned i = 0; i < hout.size(); i++) {
+            ASSERT_FLOAT_EQ(b_host[i * 2 + 1], hout[i]);
+        }
     }
 }
 
-TEST(Indexing, SNIPPET_indexing_set)
-{
-    using namespace af;
+TEST(Indexing, SNIPPET_indexing_set) {
     //! [ex_indexing_set]
     array A = constant(0, 3, 3);
     af_print(A);
+    // 0.0000     0.0000     0.0000
+    // 0.0000     0.0000     0.0000
+    // 0.0000     0.0000     0.0000
 
     // setting entries to a constant
-    A(span) = 4;        // fill entire array
+    A(span) = 4;  // fill entire array
     af_print(A);
+    // 4.0000     4.0000     4.0000
+    // 4.0000     4.0000     4.0000
+    // 4.0000     4.0000     4.0000
 
-    A.row(0) = -1;      // first row
+    A.row(0) = -1;  // first row
     af_print(A);
+    // -1.0000    -1.0000    -1.0000
+    //  4.0000     4.0000     4.0000
+    //  4.0000     4.0000     4.0000
 
-    A(seq(3)) = 3.1415; // first three elements
+    A(seq(3)) = 3.1415;  // first three elements
     af_print(A);
+    // 3.1415    -1.0000    -1.0000
+    // 3.1415     4.0000     4.0000
+    // 3.1415     4.0000     4.0000
 
     // copy in another matrix
     array B = constant(1, 4, 4, s32);
-    B.row(0) = randu(1, 4, f32); // set a row to random values (also upcast)
+    af_print(B);
+    //          1          1          1          1
+    //          1          1          1          1
+    //          1          1          1          1
+    //          1          1          1          1
+
+    B.row(0) = randu(1, 4, f32);  // set a row to random values (also upcast)
+
+    // The first rows are zeros because randu returns values from 0.0 - 1.0
+    // and they were converted to the type of B which is s32
+    af_print(B);
+    //          0          0          0          0
+    //          1          1          1          1
+    //          1          1          1          1
+    //          1          1          1          1
     //! [ex_indexing_set]
-    //TODO: Confirm the outputs are correct. see #697
+    // TODO: Confirm the outputs are correct. see #697
 }
 
-
-TEST(Indexing, SNIPPET_indexing_ref)
-{
-    using namespace af;
+TEST(Indexing, SNIPPET_indexing_ref) {
     //! [ex_indexing_ref]
-    float h_inds[] = {0, 4, 2, 1}; // zero-based indexing
+    float h_inds[] = {0, 4, 2, 1};  // zero-based indexing
     array inds(1, 4, h_inds);
     af_print(inds);
+    // 0.0000     4.0000     2.0000     1.0000
 
     array B = randu(1, 4);
     af_print(B);
+    // 0.5471     0.3114     0.5535     0.3800
 
-    array c = B(inds);        // get
+    array c = B(inds);  // get
     af_print(c);
+    // 0.5471     0.3800     0.5535     0.3114
 
-    B(inds) = -1;             // set to scalar
-    B(inds) = constant(0, 4); // zero indices
+    B(inds) = -1;              // set to scalar
+    B(inds) = constant(0, 4);  // zero indices
     af_print(B);
+    // 0.0000     0.0000     0.0000     0.0000
     //! [ex_indexing_ref]
-    //TODO: Confirm the outputs are correct. see #697
+    // TODO: Confirm the outputs are correct. see #697
+}
+
+TEST(Indexing, IndexingCopy) {
+    array A = constant(0, 1, s32);
+    af::index s1;
+    s1 = af::index(A);
+    // At exit both A and s1 will be destroyed
+    // but the underlying array should only be
+    // freed once.
+}
+
+TEST(Assign, LinearIndexSeq) {
+    const int nx = 5;
+    const int ny = 4;
+
+    const int st  = nx - 2;
+    const int en  = nx * (ny - 1);
+    const int num = (en - st + 1);
+
+    array a       = randu(nx, ny);
+    af::index idx = seq(st, en);
+
+    af_array in_arr = a.get();
+    af_index_t ii   = idx.get();
+    af_array out_arr;
+
+    ASSERT_SUCCESS(af_index(&out_arr, in_arr, 1, &ii.idx.seq));
+
+    array out(out_arr);
+
+    ASSERT_EQ(out.dims(0), num);
+    ASSERT_EQ(out.elements(), num);
+
+    vector<float> hout(nx * ny);
+    vector<float> ha(nx * ny);
+
+    a.host(&ha[0]);
+    out.host(&hout[0]);
+
+    for (int i = 0; i < num; i++) { ASSERT_EQ(ha[i + st], hout[i]); }
+}
+
+TEST(Assign, LinearIndexGenSeq) {
+    const int nx = 5;
+    const int ny = 4;
+
+    const int st  = nx - 2;
+    const int en  = nx * (ny - 1);
+    const int num = (en - st + 1);
+
+    array a       = randu(nx, ny);
+    af::index idx = seq(st, en);
+
+    af_array in_arr = a.get();
+    af_index_t ii   = idx.get();
+    af_array out_arr;
+
+    ASSERT_SUCCESS(af_index_gen(&out_arr, in_arr, 1, &ii));
+
+    array out(out_arr);
+
+    ASSERT_EQ(out.dims(0), num);
+    ASSERT_EQ(out.elements(), num);
+
+    vector<float> hout(nx * ny);
+    vector<float> ha(nx * ny);
+
+    a.host(&ha[0]);
+    out.host(&hout[0]);
+
+    for (int i = 0; i < num; i++) { ASSERT_EQ(ha[i + st], hout[i]); }
+}
+
+TEST(Assign, LinearIndexGenArr) {
+    const int nx = 5;
+    const int ny = 4;
+
+    const int st  = nx - 2;
+    const int en  = nx * (ny - 1);
+    const int num = (en - st + 1);
+
+    array a       = randu(nx, ny);
+    af::index idx = array(seq(st, en));
+
+    af_array in_arr = a.get();
+    af_index_t ii   = idx.get();
+    af_array out_arr;
+
+    ASSERT_SUCCESS(af_index_gen(&out_arr, in_arr, 1, &ii));
+
+    array out(out_arr);
+
+    ASSERT_EQ(out.dims(0), num);
+    ASSERT_EQ(out.elements(), num);
+
+    vector<float> hout(nx * ny);
+    vector<float> ha(nx * ny);
+
+    a.host(&ha[0]);
+    out.host(&hout[0]);
+
+    for (int i = 0; i < num; i++) { ASSERT_EQ(ha[i + st], hout[i]); }
+}
+
+TEST(Index, OutOfBounds) {
+    uint gold[7]  = {0, 9, 49, 119, 149, 149, 148};
+    uint h_idx[7] = {0, 9, 49, 119, 149, 150, 151};
+    uint output[7];
+
+    array a = iota(dim4(50, 1, 3)).as(s32);
+    array idx(7, h_idx);
+    array b = a(idx);
+    b.host((void *)output);
+
+    for (int i = 0; i < 7; ++i) ASSERT_EQ(gold[i], output[i]);
 }
+
+TEST(Index, ISSUE_1101_FULL) {
+    deviceGC();
+    array a = randu(5, 5);
+
+    size_t aby, abu, lby, lbu;
+    deviceMemInfo(&aby, &abu, &lby, &lbu);
+
+    array b = a(span, span);
+
+    size_t aby1, abu1, lby1, lbu1;
+    deviceMemInfo(&aby1, &abu1, &lby1, &lbu1);
+
+    ASSERT_EQ(aby, aby1);
+    ASSERT_EQ(abu, abu1);
+    ASSERT_EQ(lby, lby1);
+    ASSERT_EQ(lbu, lbu1);
+
+    ASSERT_ARRAYS_EQ(a, b);
+}
+
+TEST(Index, ISSUE_1101_COL0) {
+    deviceGC();
+    array a = randu(5, 5);
+    vector<float> ha(a.elements());
+    a.host(ha.data());
+    vector<float> gold(ha.begin(), ha.begin() + 5);
+
+    size_t aby, abu, lby, lbu;
+    deviceMemInfo(&aby, &abu, &lby, &lbu);
+
+    array b = a(span, 0);
+
+    size_t aby1, abu1, lby1, lbu1;
+    deviceMemInfo(&aby1, &abu1, &lby1, &lbu1);
+
+    ASSERT_EQ(aby, aby1) << "Number of bytes different";
+    ASSERT_EQ(abu, abu1) << "Number of buffers different";
+    ASSERT_EQ(lby, lby1) << "Number of bytes different";
+    ASSERT_EQ(lbu, lbu1) << "Number of buffers different";
+
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(a.dims()[0]), b);
+}
+
+TEST(Index, ISSUE_1101_MODDIMS) {
+    deviceGC();
+    array a = randu(5, 5);
+    vector<float> ha(a.elements());
+    a.host(&ha[0]);
+
+    size_t aby, abu, lby, lbu;
+    deviceMemInfo(&aby, &abu, &lby, &lbu);
+
+    int st  = 0;
+    int en  = 9;
+    int nx  = 2;
+    int ny  = 5;
+    array b = a(seq(st, en));
+    array c = moddims(b, nx, ny);
+    size_t aby1, abu1, lby1, lbu1;
+    deviceMemInfo(&aby1, &abu1, &lby1, &lbu1);
+
+    EXPECT_EQ(aby, aby1) << "Number of bytes different";
+    EXPECT_EQ(abu, abu1) << "Number of buffers different";
+    EXPECT_EQ(lby, lby1) << "Number of bytes different";
+    EXPECT_EQ(lbu, lbu1) << "Number of buffers different";
+
+    vector<float> hb(b.elements());
+    b.host(&hb[0]);
+    for (int i = 0; i < b.elements(); i++) { ASSERT_EQ(ha[i + st], hb[i]); }
+
+    vector<float> hc(c.elements());
+    c.host(&hc[0]);
+    for (int i = 0; i < c.elements(); i++) { ASSERT_EQ(ha[i + st], hc[i]); }
+}
+
+TEST(Index, Issue1846IndexStepCascade) {
+    array a = randu(3, 12);
+    array b = a(span, seq(0, end, 2));
+    array c = b(span, seq(0, end, 3));
+    array d = a(span, seq(0, end, 6));
+    EXPECT_EQ(allTrue<bool>(c == d), true);
+}
+
+TEST(Index, Issue1845IndexStepReorder) {
+    array a = randu(1, 8, 1);
+    array b = reorder(a, 0, 2, 1);
+    array d = reorder(b(0, 0, span), 2, 1, 0);
+    EXPECT_EQ(allTrue<bool>(a.T() == d), true);
+}
+
+TEST(Index, Issue1867ChainedIndexingLeak) {
+    using af::randn;
+    using af::sync;
+    {
+        array lInput = randn(100, 100, f32);
+        array Q3     = lInput.rows(0, 3).cols(0, 3);
+        Q3.eval();
+        sync();
+    }
+    size_t alloc_bytes, alloc_buffers, lock_bytes, lock_buffers;
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    ASSERT_EQ(0u, lock_buffers);
+}
+
+TEST(Index, InvalidSequence_SingleElementNegativeStep) {
+    EXPECT_THROW(af::seq(1, 1, -1), af::exception);
+}
+TEST(Index, InvalidSequence_PositiveRangeNegativeStep) {
+    EXPECT_THROW(af::seq(1, 5, -1), af::exception);
+}
+
+TEST(Index, InvalidSequence_NegativeRangePositiveStep) {
+    EXPECT_THROW(af::seq(-1, -5, 1), af::exception);
+}
+
+TEST(Index, ISSUE_2273) {
+    int h_idx[2] = {1, 1};
+    array idx(2, h_idx);
+
+    float h_input[12] = {0.f, 1.f, 2.f, 3.f, 4.f,  5.f,
+                         6.f, 7.f, 8.f, 9.f, 10.f, 11.f};
+    array input(2, 3, 2, h_input);
+    array input_reord = reorder(input, 0, 2, 1);
+    array output      = input_reord(span, idx, span);
+
+    float h_gold[12] = {6.f, 7.f, 6.f,  7.f,  8.f,  9.f,
+                        8.f, 9.f, 10.f, 11.f, 10.f, 11.f};
+    array gold(2, 2, 3, h_gold);
+
+    ASSERT_ARRAYS_EQ(gold, output);
+}
+
+TEST(Index, ISSUE_2273_Flipped) {
+    int h_idx[2] = {1, 1};
+    array idx(2, h_idx);
+
+    float h_input[12] = {0.f, 1.f, 6.f, 7.f, 2.f,  3.f,
+                         8.f, 9.f, 4.f, 5.f, 10.f, 11.f};
+    array input(2, 2, 3, h_input);
+    array input_reord = reorder(input, 0, 2, 1);
+    array input_slice = input_reord(span, span, idx);
+
+    array input_ref       = iota(dim4(2, 3, 2));
+    array input_ref_slice = input_ref(span, span, idx);
+
+    float h_gold[12] = {6.f, 7.f, 8.f, 9.f, 10.f, 11.f,
+                        6.f, 7.f, 8.f, 9.f, 10.f, 11.f};
+    array input_slice_gold(2, 3, 2, h_gold);
+
+    ASSERT_ARRAYS_EQ(input_slice_gold, input_slice);
+}
+
+TEST(Index, CopiedIndexDestroyed) {
+    array in = randu(10, 10);
+    array a  = constant(1, 10);
+
+    af::index index1(a);
+    af::index index2(seq(10));
+
+    af::index index3(index1);
+    { af::index index4(index1); }
+
+    af_print(in(index1, index2));
+}
+
+// clang-format off
+class IndexDocs : public ::testing::Test {
+public:
+  array A;
+
+  void SetUp() {
+    //![index_tutorial_1]
+    float data[] = {0,  1,  2,  3,
+                    4,  5,  6,  7,
+                    8,  9, 10, 11,
+                   12, 13, 14, 15};
+    af::array A(4, 4, data);
+    //![index_tutorial_1]
+    this->A = A;
+  }
+};
+
+TEST_F(IndexDocs, Precondition) {
+  vector<float> gold(4*4);
+  std::iota(gold.begin(), gold.end(), 0.f);
+  ASSERT_VEC_ARRAY_EQ(gold, dim4(4, 4), A);
+}
+
+TEST_F(IndexDocs, 2_3Element) {
+    array out =
+    //![index_tutorial_first_element]
+    // Returns an array pointing to the first element
+    A(2, 3); // WARN: avoid doing this. Demo only
+    //![index_tutorial_first_element]
+    vector<float> gold(1, 14.f);
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(1), out);
+}
+
+TEST_F(IndexDocs, FifthElement) {
+    array out =
+    //![index_tutorial_fifth_element]
+    // Returns an array pointing to the fifth element
+    A(5);
+    //![index_tutorial_fifth_element]
+    vector<float> gold(1, 5.f);
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(1), out);
+}
+
+TEST_F(IndexDocs, NegativeIndexing) {
+    //![index_tutorial_negative_indexing]
+    array ref0 = A(2, -1);    // 14 second row last column
+    array ref1 = A(2, end);   // 14 Same as above
+    array ref2 = A(2, -2);    // 10 Second row, second to last(third) column
+    array ref3 = A(2, end-1); // 10 Same as above
+    //![index_tutorial_negative_indexing]
+    vector<float> gold1(1, 14.f);
+    vector<float> gold2(1, 10.f);
+    ASSERT_VEC_ARRAY_EQ(gold1, dim4(1), ref0);
+    ASSERT_VEC_ARRAY_EQ(gold1, dim4(1), ref1);
+    ASSERT_VEC_ARRAY_EQ(gold2, dim4(1), ref2);
+    ASSERT_VEC_ARRAY_EQ(gold2, dim4(1), ref3);
+}
+
+TEST_F(IndexDocs, ThirdColumn) {
+    array out =
+    //![index_tutorial_third_column]
+    // Returns an array pointing to the third column
+    A(span, 2);
+    //![index_tutorial_third_column]
+    vector<float> gold{8, 9, 10, 11};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(4), out);
+}
+
+TEST_F(IndexDocs, SecondRow) {
+    array out =
+    //![index_tutorial_second_row]
+    // Returns an array pointing to the second row
+    A(1, span);
+    //![index_tutorial_second_row]
+    vector<float> gold{1, 5, 9, 13};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(1, 4), out);
+}
+
+TEST_F(IndexDocs, FirstTwoColumns) {
+    array out =
+    //![index_tutorial_first_two_columns]
+    // Returns an array pointing to the first two columns
+    A(span, seq(2));
+    //![index_tutorial_first_two_columns]
+    vector<float> gold{0, 1, 2, 3, 4, 5, 6, 7};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(4, 2), out);
+}
+
+TEST_F(IndexDocs, SecondAndFourthRows) {
+    array out =
+    //![index_tutorial_second_and_fourth_rows]
+    // Returns an array pointing to the second and fourth rows
+    A(seq(1, end, 2), span);
+    //![index_tutorial_second_and_fourth_rows]
+    vector<float> gold{1, 3, 5, 7, 9, 11, 13, 15};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(2, 4), out);
+}
+
+
+TEST_F(IndexDocs, Arrays) {
+    //![index_tutorial_array_indexing]
+    vector<int> hidx = {2, 1, 3};
+    vector<int> hidy = {3, 1, 2};
+    array idx(3, hidx.data());
+    array idy(3, hidy.data());
+
+    array out = A(idx, idy);
+    //![index_tutorial_array_indexing]
+
+    vector<float> gold{
+   14.f,    13.f,    15.f,
+    6.f,     5.f,     7.f,
+   10.f,     9.f,    11.f};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(3, 3), out);
+}
+
+
+TEST_F(IndexDocs, Approx) {
+    //![index_tutorial_approx]
+    vector<float> hidx = {2, 1, 3};
+    vector<float> hidy = {3, 1, 2};
+    array idx(3, hidx.data());
+    array idy(3, hidy.data());
+
+    array out = approx2(A, idx, idy);
+    //![index_tutorial_approx]
+
+    vector<float> gold{14.f, 5.f, 11.f};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(3), out);
+}
+
+TEST_F(IndexDocs, Boolean) {
+    //![index_tutorial_boolean]
+    array out = A(A < 5);
+    //![index_tutorial_boolean]
+    vector<float> gold = {0, 1, 2, 3, 4};
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(5), out);
+}
+
+TEST_F(IndexDocs, References) {
+    deviceGC();
+    size_t alloc_bytes, alloc_buffers, lock_bytes, lock_buffers;
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    //![index_tutorial_references]
+    array reference = A(span, 1);
+    array reference2 = A(seq(3), 1);
+    array reference3 = A(seq(2), span);
+    //![index_tutorial_references]
+
+    size_t alloc_bytes2, alloc_buffers2, lock_bytes2, lock_buffers2;
+    deviceMemInfo(&alloc_bytes2, &alloc_buffers2, &lock_bytes2, &lock_buffers2);
+
+    ASSERT_EQ(0, lock_buffers2 - lock_buffers);
+}
+
+TEST_F(IndexDocs, Copies) {
+    deviceGC();
+    size_t alloc_bytes, alloc_buffers, lock_bytes, lock_buffers;
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    //![index_tutorial_copies]
+    array copy = A(2, span);
+    array copy2 = A(seq(1, 3, 2), span);
+
+
+    int hidx[] = {0, 1, 2};
+    array idx(3, hidx);
+    array copy3 = A(idx, span);
+    //![index_tutorial_copies]
+
+    size_t alloc_bytes2, alloc_buffers2, lock_bytes2, lock_buffers2;
+    deviceMemInfo(&alloc_bytes2, &alloc_buffers2, &lock_bytes2, &lock_buffers2);
+
+    ASSERT_EQ(3, lock_buffers2 - lock_buffers);
+}
+
+TEST_F(IndexDocs, Assignment) {
+    deviceGC();
+    size_t alloc_bytes, alloc_buffers, lock_bytes, lock_buffers;
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    //![index_tutorial_assignment]
+    array inputA = constant(3, 10, 10);
+    array inputB = constant(2, 10, 10);
+    array data   = constant(1, 10, 10);
+
+    // Points to the second column of data. Does not allocate memory
+    array ref = data(span, 1);
+
+    // This call does NOT update data. Memory allocated in matmul
+    ref = matmul(inputA, inputB);
+    // reference does not point to the same memory as the data array
+    //![index_tutorial_assignment]
+
+    size_t alloc_bytes2, alloc_buffers2, lock_bytes2, lock_buffers2;
+    deviceMemInfo(&alloc_bytes2, &alloc_buffers2, &lock_bytes2, &lock_buffers2);
+
+    vector<float> gold_reference(100, 60);
+    vector<float> gold_data(100, 1);
+    ASSERT_VEC_ARRAY_EQ(gold_reference, dim4(10, 10), ref);
+    ASSERT_VEC_ARRAY_EQ(gold_data, dim4(10, 10), data);
+    ASSERT_EQ(4, lock_buffers2 - lock_buffers);
+}
+
+TEST_F(IndexDocs, AssignmentThirdColumn) {
+    vector<float> gold(A.elements());
+    A.host(gold.data());
+
+    deviceGC();
+    size_t alloc_bytes, alloc_buffers, lock_bytes, lock_buffers;
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    //![index_tutorial_assignment_third_column]
+    array reference = A(span, 2);
+    A(span, 2) = 3.14f;
+    assert(allTrue<bool>(reference != A(span, 2)));
+    //![index_tutorial_assignment_third_column]
+    vector<float> gold_reference(begin(gold) + 8, begin(gold)+12);
+    ASSERT_VEC_ARRAY_EQ(gold_reference, dim4(4), reference);
+    gold[8] = gold[9] = gold[10] = gold[11] = 3.14f;
+    ASSERT_VEC_ARRAY_EQ(gold, A.dims(), A);
+
+    size_t alloc_bytes2, alloc_buffers2, lock_bytes2, lock_buffers2;
+    deviceMemInfo(&alloc_bytes2, &alloc_buffers2, &lock_bytes2, &lock_buffers2);
+
+    ASSERT_EQ(1, lock_buffers2 - lock_buffers);
+}
+
+TEST_F(IndexDocs, AssignmentAlloc) {
+    deviceGC();
+    size_t alloc_bytes, alloc_buffers, lock_bytes, lock_buffers;
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    //![index_tutorial_assignment_alloc]
+    {
+        // No allocation performed. ref points to A's memory
+        array ref = A(span, 2);
+    } // ref goes out of scope. No one point's to A's memory
+    A(span, 2) = 3.14f; // No allocation performed.
+    //![index_tutorial_assignment_alloc]
+
+    size_t alloc_bytes2, alloc_buffers2, lock_bytes2, lock_buffers2;
+    deviceMemInfo(&alloc_bytes2, &alloc_buffers2, &lock_bytes2, &lock_buffers2);
+
+    ASSERT_EQ(0, lock_buffers2 - lock_buffers);
+}
+
+TEST_F(IndexDocs, AssignmentRaceCondition) {
+    //![index_tutorial_assignment_race_condition]
+    vector<int> hidx = {4, 3, 4, 0};
+    vector<float> hvals = {9.f, 8.f, 7.f, 6.f};
+    array idx(4, hidx.data());
+    array vals(4, hvals.data());
+
+    A(idx) = vals; // nondeterministic. A(4) can be 9 or 7
+    //![index_tutorial_assignment_race_condition]
+}
+
+// clang-format on
diff --git a/test/info.cpp b/test/info.cpp
index bd70a66b9e..5cd82a6201 100644
--- a/test/info.cpp
+++ b/test/info.cpp
@@ -7,56 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/data.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
 #include <af/device.h>
 
+using af::dim4;
+using af::dtype_traits;
+using af::getDevice;
+using af::info;
+using af::setDevice;
 using std::string;
 using std::vector;
 
 template<typename T>
-class Info : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
-};
-
-// create a list of types to be tested
-typedef ::testing::Types<float> TestTypes;
-
-// register the type list
-TYPED_TEST_CASE(Info, TestTypes);
-
-template<typename T>
-void infoTest()
-{
-    if (noDoubleTests<T>()) return;
+void testFunction() {
+    info();
+
+    af_array outArray = 0;
+    dim4 dims(32, 32, 1, 1);
+    ASSERT_SUCCESS(af_randu(&outArray, dims.ndims(), dims.get(),
+                            (af_dtype)dtype_traits<T>::af_type));
+    // cleanup
+    if (outArray != 0) { ASSERT_SUCCESS(af_release_array(outArray)); }
+}
 
+void infoTest() {
     int nDevices = 0;
-    ASSERT_EQ(AF_SUCCESS, af_get_device_count(&nDevices));
-
-    for(int d = 0; d < nDevices; d++) {
-
-        af::setDevice(d);
-        af::info();
-
-        af_array outArray = 0;
-        af::dim4 dims(32, 32, 1, 1);
-        ASSERT_EQ(AF_SUCCESS, af_randu(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-        // cleanup
-        if(outArray != 0) ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_get_device_count(&nDevices));
+    ASSERT_EQ(true, nDevices > 0);
+
+    const char* ENV = getenv("AF_MULTI_GPU_TESTS");
+    if (ENV && ENV[0] == '0') {
+        testFunction<float>();
+    } else {
+        int oldDevice = getDevice();
+        testFunction<float>();
+        for (int d = 0; d < nDevices; d++) {
+            setDevice(d);
+            testFunction<float>();
+        }
+        setDevice(oldDevice);
     }
 }
 
-TYPED_TEST(Info, All)
-{
-    infoTest<TypeParam>();
-}
+TEST(Info, All) { infoTest(); }
diff --git a/test/internal.cpp b/test/internal.cpp
new file mode 100644
index 0000000000..ede8e697a7
--- /dev/null
+++ b/test/internal.cpp
@@ -0,0 +1,146 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/internal.h>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::randu;
+using af::seq;
+using af::span;
+using std::vector;
+
+TEST(Internal, CreateStrided) {
+    float ha[] = {1,    101,  102,  103,  104,  105,  201,  202,  203,  204,
+                  205,  301,  302,  303,  304,  305,  401,  402,  403,  404,
+                  405,
+
+                  1010, 1020, 1030, 1040, 1050, 2010, 2020, 2030, 2040, 2050,
+                  3010, 3020, 3030, 3040, 3050, 4010, 4020, 4030, 4040, 4050};
+
+    dim_t offset    = 1;
+    unsigned ndims  = 3;
+    dim_t dims[]    = {3, 3, 2};
+    dim_t strides[] = {1, 5, 20};
+    array a         = createStridedArray((void *)ha, offset, dim4(ndims, dims),
+                                         dim4(ndims, strides), f32, afHost);
+
+    dim4 astrides = getStrides(a);
+    dim4 adims    = a.dims();
+
+    ASSERT_EQ(offset, getOffset(a));
+    for (int i = 0; i < (int)ndims; i++) {
+        ASSERT_EQ(strides[i], astrides[i]);
+        ASSERT_EQ(dims[i], adims[i]);
+    }
+
+    vector<float> va(a.elements());
+    a.host(&va[0]);
+
+    int o = offset;
+    for (int k = 0; k < dims[2]; k++) {
+        for (int j = 0; j < dims[1]; j++) {
+            for (int i = 0; i < dims[0]; i++) {
+                ASSERT_EQ(
+                    va[i + j * dims[0] + k * dims[0] * dims[1]],
+                    ha[i * strides[0] + j * strides[1] + k * strides[2] + o])
+                    << "at (" << i << "," << j << "," << k << ")";
+            }
+        }
+    }
+}
+
+TEST(Internal, CheckInfo) {
+    const int xdim = 10;
+    const int ydim = 8;
+
+    const int xoff = 1;
+    const int yoff = 2;
+
+    const int xnum = 5;
+    const int ynum = 3;
+
+    array a = randu(10, 8);
+
+    array b = a(seq(xoff, xoff + xnum - 1), seq(yoff, yoff + ynum - 1));
+
+    dim4 strides = getStrides(b);
+    dim4 dims    = b.dims();
+
+    dim_t offset = xoff + yoff * xdim;
+
+    ASSERT_EQ(dims[0], xnum);
+    ASSERT_EQ(dims[1], ynum);
+    ASSERT_EQ(isOwner(a), true);
+    ASSERT_EQ(isOwner(b), false);
+
+    ASSERT_EQ(getOffset(b), offset);
+    ASSERT_EQ(strides[0], 1);
+    ASSERT_EQ(strides[1], xdim);
+    ASSERT_EQ(strides[2], xdim * ydim);
+    ASSERT_EQ(getRawPtr(a), getRawPtr(b));
+}
+
+TEST(Internal, Linear) {
+    array c;
+    {
+        array a = randu(10, 8);
+
+        // b is just pointing to same underlying data
+        // b is an owner;
+        array b = a;
+        ASSERT_EQ(isOwner(b), true);
+
+        // C is considered sub array
+        // C will not be an owner
+        c = a(span);
+        ASSERT_EQ(isOwner(c), false);
+    }
+
+    // Even though a and b are out of scope, c is still not an owner
+    { ASSERT_EQ(isOwner(c), false); }
+}
+
+TEST(Internal, Allocated) {
+    array a            = randu(10, 8);
+    size_t a_allocated = a.allocated();
+    size_t a_bytes     = a.bytes();
+
+    // b is just pointing to same underlying data
+    // b is an owner;
+    array b = a;
+    ASSERT_EQ(b.allocated(), a_allocated);
+    ASSERT_EQ(b.bytes(), a_bytes);
+
+    // C is considered sub array
+    // C will not be an owner
+    array c = a(span);
+    ASSERT_EQ(c.allocated(), a_allocated);
+    ASSERT_EQ(c.bytes(), a_bytes);
+
+    array d = a.col(1);
+    ASSERT_EQ(d.allocated(), a_allocated);
+
+    a = randu(20);
+    b = randu(20);
+
+    // Even though a, b are reallocated and c, d are not owners
+    // the allocated and bytes should remain the same
+    ASSERT_EQ(c.allocated(), a_allocated);
+    ASSERT_EQ(c.bytes(), a_bytes);
+
+    ASSERT_EQ(d.allocated(), a_allocated);
+}
diff --git a/test/interop_opencl_custom_kernel_snippet.cpp b/test/interop_opencl_custom_kernel_snippet.cpp
new file mode 100644
index 0000000000..c1864d2e79
--- /dev/null
+++ b/test/interop_opencl_custom_kernel_snippet.cpp
@@ -0,0 +1,96 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// clang-format off
+// ![interop_opencl_custom_kernel_snippet]
+#include <arrayfire.h>
+// 1. Add the af/opencl.h include to your project
+#include <af/opencl.h>
+
+#include <cassert>
+
+#define OCL_CHECK(call)                                                     \
+    if (cl_int err = (call) != CL_SUCCESS) {                                \
+        fprintf(stderr, __FILE__ "(%d):Returned error code %d\n", __LINE__, \
+                err);                                                       \
+    }
+
+int main() {
+    size_t length = 10;
+
+    // Create ArrayFire array objects:
+    af::array A = af::randu(length, f32);
+    af::array B = af::constant(0, length, f32);
+
+    // ... additional ArrayFire operations here
+
+    // 2. Obtain the device, context, and queue used by ArrayFire
+    static cl_context af_context     = afcl::getContext();
+    static cl_device_id af_device_id = afcl::getDeviceId();
+    static cl_command_queue af_queue = afcl::getQueue();
+
+    // 3. Obtain cl_mem references to af::array objects
+    cl_mem* d_A = A.device<cl_mem>();
+    cl_mem* d_B = B.device<cl_mem>();
+
+    // 4. Load, build, and use your kernels.
+    //    For the sake of readability, we have omitted error checking.
+    int status = CL_SUCCESS;
+
+    // A simple copy kernel, uses C++11 syntax for multi-line strings.
+    const char* kernel_name = "copy_kernel";
+    const char* source      = R"(
+        void __kernel
+        copy_kernel(__global float* gA, __global float* gB) {
+        int id = get_global_id(0);
+        gB[id] = gA[id];
+    }
+    )";
+
+    // Create the program, build the executable, and extract the entry point
+    // for the kernel.
+    cl_program program = clCreateProgramWithSource(af_context, 1, &source, NULL, &status);
+    OCL_CHECK(status);
+    OCL_CHECK(clBuildProgram(program, 1, &af_device_id, NULL, NULL, NULL));
+    cl_kernel kernel = clCreateKernel(program, kernel_name, &status);
+    OCL_CHECK(status);
+
+    // Set arguments and launch your kernels
+    OCL_CHECK(clSetKernelArg(kernel, 0, sizeof(cl_mem), d_A));
+    OCL_CHECK(clSetKernelArg(kernel, 1, sizeof(cl_mem), d_B));
+    OCL_CHECK(clEnqueueNDRangeKernel(af_queue, kernel, 1, NULL, &length, NULL,
+                                     0, NULL, NULL));
+
+    // 5. Return control of af::array memory to ArrayFire
+    A.unlock();
+    B.unlock();
+
+    /// A and B should not be the same because of the copy_kernel user code
+    assert(af::allTrue<bool>(A == B));
+
+    // Delete the pointers returned by the device function. This does NOT
+    // delete the cl_mem memory and only deletes the pointers
+    delete d_A;
+    delete d_B;
+
+    // ... resume ArrayFire operations
+
+    // Because the device pointers, d_x and d_y, were returned to ArrayFire's
+    // control by the unlock function, there is no need to free them using
+    // clReleaseMemObject()
+
+    // Free the kernel and program objects because they are created in user
+    // code
+    OCL_CHECK(clReleaseKernel(kernel));
+    OCL_CHECK(clReleaseProgram(program));
+
+    return 0;
+}
+// ![interop_opencl_custom_kernel_snippet]
+// clang-format on
diff --git a/test/interop_opencl_external_context_snippet.cpp b/test/interop_opencl_external_context_snippet.cpp
new file mode 100644
index 0000000000..a1259580e6
--- /dev/null
+++ b/test/interop_opencl_external_context_snippet.cpp
@@ -0,0 +1,104 @@
+/*******************************************************
+ * Copyright (c) 2020, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+#pragma GCC diagnostic ignored "-Wunused-parameter"
+#pragma GCC diagnostic ignored "-Wignored-qualifiers"
+#pragma GCC diagnostic ignored "-Wignored-attributes"
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+#if __GNUC__ >= 8
+#pragma GCC diagnostic ignored "-Wcatch-value="
+#endif
+// ![interop_opencl_external_context_snippet]
+#include <arrayfire.h>
+// 1. Add the af/opencl.h include to your project
+#include <af/opencl.h>
+
+#include <cassert>
+
+// definitions required by cl2.hpp
+#define CL_HPP_ENABLE_EXCEPTIONS
+#define CL_HPP_TARGET_OPENCL_VERSION 120
+#define CL_HPP_MINIMUM_OPENCL_VERSION 120
+#include <CL/cl2.hpp>
+
+// 1. Add arrayfire.h and af/opencl.h to your application
+#include "af/opencl.h"
+#include "arrayfire.h"
+
+#include <cstdio>
+#include <vector>
+
+using std::vector;
+
+int main() {
+    // 1. Set up the OpenCL context, device, and queues
+    cl::Context context;
+    try {
+        context = cl::Context(CL_DEVICE_TYPE_ALL);
+    } catch (const cl::Error& err) {
+        fprintf(stderr, "Exiting creating context");
+        return EXIT_FAILURE;
+    }
+    vector<cl::Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
+    if (devices.empty()) {
+        fprintf(stderr, "Exiting. No devices found");
+        return EXIT_SUCCESS;
+    }
+    cl::Device device = devices[0];
+    cl::CommandQueue queue(context, device);
+
+    // Create a buffer of size 10 filled with ones, copy it to the device
+    int length = 10;
+    vector<float> h_A(length, 1);
+    cl::Buffer cl_A(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,
+                    length * sizeof(float), h_A.data());
+
+    // 2. Instruct OpenCL to complete its operations using clFinish (or similar)
+    queue.finish();
+
+    // 3. Instruct ArrayFire to use the user-created context
+    //    First, create a device from the current OpenCL device + context +
+    //    queue
+    afcl::addDevice(device(), context(), queue());
+    //    Next switch ArrayFire to the device using the device and context as
+    //    identifiers:
+    afcl::setDevice(device(), context());
+
+    // 4. Create ArrayFire arrays from OpenCL memory objects
+    af::array af_A = afcl::array(length, cl_A(), f32, true);
+    clRetainMemObject(cl_A());
+
+    // 5. Perform ArrayFire operations on the Arrays
+    af_A = af_A + af::randu(length);
+
+    // NOTE: ArrayFire does not perform the above transaction using in-place
+    // memory, thus the underlying OpenCL buffers containing the memory
+    // containing memory to probably have changed
+
+    // 6. Instruct ArrayFire to finish operations using af::sync
+    af::sync();
+
+    // 7. Obtain cl_mem references for important memory
+    cl_mem* af_mem = af_A.device<cl_mem>();
+    cl_A           = cl::Buffer(*af_mem, /*retain*/ true);
+
+    /// Delete the af_mem pointer. The buffer returned by the device pointer is
+    /// still valid
+    delete af_mem;
+
+    // 8. Continue your OpenCL application
+
+    // ...
+    return EXIT_SUCCESS;
+}
+// ![interop_opencl_external_context_snippet]
+
+#pragma GCC diagnostic pop
diff --git a/test/inverse_deconv.cpp b/test/inverse_deconv.cpp
new file mode 100644
index 0000000000..86ac2869ab
--- /dev/null
+++ b/test/inverse_deconv.cpp
@@ -0,0 +1,137 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/data.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using std::abs;
+using std::string;
+using std::vector;
+using namespace af;
+
+template<typename T>
+class InverseDeconvolution : public ::testing::Test {};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, schar, uchar, short, ushort> TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(InverseDeconvolution, TestTypes);
+
+template<typename T, bool isColor>
+void invDeconvImageTest(string pTestFile, const float gamma,
+                        const af_inverse_deconv_algo algo) {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            OutType;
+
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    using af::dim4;
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> outSizes;
+    vector<string> outFiles;
+
+    readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        inFiles[testId].insert(0, string(TEST_DIR "/inverse_deconv/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/inverse_deconv/"));
+
+        af_array _inArray   = 0;
+        af_array inArray    = 0;
+        af_array kerArray   = 0;
+        af_array _outArray  = 0;
+        af_array cstArray   = 0;
+        af_array minArray   = 0;
+        af_array numArray   = 0;
+        af_array denArray   = 0;
+        af_array divArray   = 0;
+        af_array outArray   = 0;
+        af_array goldArray  = 0;
+        af_array _goldArray = 0;
+        dim_t nElems        = 0;
+
+        ASSERT_SUCCESS(af_gaussian_kernel(&kerArray, 13, 13, 2.25, 2.25));
+
+        af_dtype otype = (af_dtype)af::dtype_traits<OutType>::af_type;
+
+        ASSERT_SUCCESS(
+            af_load_image(&_inArray, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, _inArray));
+
+        ASSERT_SUCCESS(
+            af_load_image(&_goldArray, outFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<OutType>(&goldArray, _goldArray));
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
+
+        unsigned ndims;
+        dim_t dims[4];
+        ASSERT_SUCCESS(af_get_numdims(&ndims, goldArray));
+        ASSERT_SUCCESS(
+            af_get_dims(dims, dims + 1, dims + 2, dims + 3, goldArray));
+
+        ASSERT_SUCCESS(
+            af_inverse_deconv(&_outArray, inArray, kerArray, gamma, algo));
+
+        double maxima, minima, imag;
+        ASSERT_SUCCESS(af_min_all(&minima, &imag, _outArray));
+        ASSERT_SUCCESS(af_max_all(&maxima, &imag, _outArray));
+        ASSERT_SUCCESS(af_constant(&cstArray, 255.0, ndims, dims, otype));
+        ASSERT_SUCCESS(
+            af_constant(&denArray, (maxima - minima), ndims, dims, otype));
+        ASSERT_SUCCESS(af_constant(&minArray, minima, ndims, dims, otype));
+        ASSERT_SUCCESS(af_sub(&numArray, _outArray, minArray, false));
+        ASSERT_SUCCESS(af_div(&divArray, numArray, denArray, false));
+        ASSERT_SUCCESS(af_mul(&outArray, divArray, cstArray, false));
+
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.03);
+
+        ASSERT_SUCCESS(af_release_array(_inArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(kerArray));
+        ASSERT_SUCCESS(af_release_array(cstArray));
+        ASSERT_SUCCESS(af_release_array(minArray));
+        ASSERT_SUCCESS(af_release_array(denArray));
+        ASSERT_SUCCESS(af_release_array(numArray));
+        ASSERT_SUCCESS(af_release_array(divArray));
+        ASSERT_SUCCESS(af_release_array(_outArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(_goldArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+    }
+}
+
+TYPED_TEST(InverseDeconvolution, TikhonovOnGrayscale) {
+    // Test file name format: <colorspace>_<gamma with dots replaced by
+    // "_">_<inverse deconv algo>.test
+    invDeconvImageTest<TypeParam, false>(
+        string(TEST_DIR "/inverse_deconv/gray_00_1_tikhonov.test"), 00.1f,
+        AF_INVERSE_DECONV_TIKHONOV);
+}
+
+TYPED_TEST(InverseDeconvolution, DISABLED_WienerOnGrayscale) {
+    // Test file name format: <colorspace>_<gamma with dots replaced by
+    // "_">_<inverse deconv algo>.test
+    invDeconvImageTest<TypeParam, false>(
+        string(TEST_DIR "/inverse_deconv/gray_1_wiener.test"), 1.0,
+        AF_INVERSE_DECONV_DEFAULT);
+    // TODO(pradeep) change to wiener enum value
+}
diff --git a/test/inverse_dense.cpp b/test/inverse_dense.cpp
index ea6c22256b..0d502389b8 100644
--- a/test/inverse_dense.cpp
+++ b/test/inverse_dense.cpp
@@ -7,59 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
+// NOTE: Tests are known to fail on OSX when utilizing the CPU and OpenCL
+// backends for sizes larger than 128x128 or more. You can read more about it on
+// issue https://github.com/arrayfire/arrayfire/issues/1617
+
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
-#include <string>
-#include <testHelpers.hpp>
+#include <iostream>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
-
-///////////////////////////////// CPP ////////////////////////////////////
-//
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::identity;
+using af::matmul;
+using af::max;
+using std::abs;
 
 template<typename T>
-void inverseTester(const int m, const int n, const int k, double eps)
-{
-    if (noDoubleTests<T>()) return;
+void inverseTester(const int m, const int n, double eps) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
 #if 1
-    af::array A  = cpu_randu<T>(af::dim4(m, n));
+    array A = cpu_randu<T>(dim4(m, n));
 #else
-    af::array A  = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+    array A = randu(m, n, (dtype)dtype_traits<T>::af_type);
 #endif
 
     //! [ex_inverse]
-    af::array IA = inverse(A);
-    af::array I = af::matmul(A, IA);
+    array IA = inverse(A);
+    array I  = matmul(A, IA);
     //! [ex_inverse]
 
-    af::array I2 = af::identity(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+    array I2 = identity(m, n, (dtype)dtype_traits<T>::af_type);
+
+    ASSERT_NEAR(0, max<typename dtype_traits<T>::base_type>(abs(real(I - I2))),
+                eps);
+    ASSERT_NEAR(0, max<typename dtype_traits<T>::base_type>(abs(imag(I - I2))),
+                eps);
+}
+
+template<typename T>
+class Inverse : public ::testing::Test {};
+
+template<typename T>
+double eps();
+
+template<>
+double eps<float>() {
+    return 0.01;
+}
+
+template<>
+double eps<double>() {
+    return 1e-5;
+}
+
+template<>
+double eps<cfloat>() {
+    return 0.015;
+}
 
-    ASSERT_NEAR(0, af::max<double>(af::abs(real(I - I2))), eps);
-    ASSERT_NEAR(0, af::max<double>(af::abs(imag(I - I2))), eps);
+template<>
+double eps<cdouble>() {
+    return 1e-5;
 }
 
-#define INVERSE_TESTS(T, eps)                   \
-    TEST(INVERSE, T##Square)                    \
-    {                                           \
-        inverseTester<T>(1000, 1000, 100, eps); \
-    }                                           \
-    TEST(INVERSE, T##SquareMultiple)            \
-    {                                           \
-        inverseTester<T>(2048, 2048, 512, eps); \
-    }                                           \
-
-INVERSE_TESTS(float, 0.01)
-INVERSE_TESTS(double, 1E-5)
-INVERSE_TESTS(cfloat, 0.01)
-INVERSE_TESTS(cdouble, 1E-5)
+typedef ::testing::Types<float, cfloat, double, cdouble> TestTypes;
+TYPED_TEST_SUITE(Inverse, TestTypes);
+
+TYPED_TEST(Inverse, Square) {
+    inverseTester<TypeParam>(1000, 1000, eps<TypeParam>());
+}
+
+TYPED_TEST(Inverse, SquareMultiplePowerOfTwo) {
+    inverseTester<TypeParam>(2048, 2048, eps<TypeParam>());
+}
diff --git a/test/iota.cpp b/test/iota.cpp
index fae4f72c1d..33ff36e3ba 100644
--- a/test/iota.cpp
+++ b/test/iota.cpp
@@ -7,133 +7,110 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Iota : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Iota : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, unsigned int, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, int, unsigned int, intl, uintl,
+                         signed char, unsigned char, short, ushort,
+                         half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Iota, TestTypes);
+TYPED_TEST_SUITE(Iota, TestTypes);
 
 template<typename T>
-void iotaTest(const af::dim4 idims, const af::dim4 tdims)
-{
-    if (noDoubleTests<T>()) return;
+void iotaTest(const dim4 idims, const dim4 tdims) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_iota(&outArray, idims.ndims(), idims.get(),
-               tdims.ndims(), tdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_iota(&outArray, idims.ndims(), idims.get(), tdims.ndims(),
+                           tdims.get(), (af_dtype)dtype_traits<T>::af_type));
 
     af_array temp0 = 0, temp1 = 0, temp2 = 0;
-    af::dim4 tempdims(idims.elements());
-    af::dim4 fulldims;
-    for(unsigned i = 0; i < 4; i++) {
-        fulldims[i] = idims[i] * tdims[i];
-    }
-    ASSERT_EQ(AF_SUCCESS, af_range(&temp2, tempdims.ndims(), tempdims.get(), 0, (af_dtype) af::dtype_traits<T>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_moddims(&temp1, temp2, idims.ndims(), idims.get()));
-    ASSERT_EQ(AF_SUCCESS, af_tile(&temp0, temp1, tdims[0], tdims[1], tdims[2], tdims[3]));
-
-    // Get result
-    T* outData = new T[fulldims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
-
-    T* tileData = new T[fulldims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)tileData, temp0));
-
-    // Compare result
-    for(int i = 0; i < (int) fulldims.elements(); i++)
-        ASSERT_EQ(tileData[i], outData[i]) << "at: " << i << std::endl;
-
-    // Delete
-    delete[] outData;
-    delete[] tileData;
-
-    if(outArray  != 0) af_release_array(outArray);
-    if(temp0     != 0) af_release_array(temp0);
-    if(temp1     != 0) af_release_array(temp1);
-    if(temp2     != 0) af_release_array(temp2);
+    dim4 tempdims(idims.elements());
+    dim4 fulldims;
+    for (unsigned i = 0; i < 4; i++) { fulldims[i] = idims[i] * tdims[i]; }
+    ASSERT_SUCCESS(af_range(&temp2, tempdims.ndims(), tempdims.get(), 0,
+                            (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_moddims(&temp1, temp2, idims.ndims(), idims.get()));
+    ASSERT_SUCCESS(
+        af_tile(&temp0, temp1, tdims[0], tdims[1], tdims[2], tdims[3]));
+
+    ASSERT_ARRAYS_EQ(temp0, outArray);
+
+    if (outArray != 0) af_release_array(outArray);
+    if (temp0 != 0) af_release_array(temp0);
+    if (temp1 != 0) af_release_array(temp1);
+    if (temp2 != 0) af_release_array(temp2);
 }
 
-#define IOTA_INIT(desc, x, y, z, w, a, b, c, d)                                             \
-    TYPED_TEST(Iota, desc)                                                                  \
-    {                                                                                       \
-        iotaTest<TypeParam>(af::dim4(x, y, z, w), af::dim4(a, b, c, d));                    \
+#define IOTA_INIT(desc, x, y, z, w, a, b, c, d)                  \
+    TYPED_TEST(Iota, desc) {                                     \
+        iotaTest<TypeParam>(dim4(x, y, z, w), dim4(a, b, c, d)); \
     }
 
-    IOTA_INIT(Iota1D0, 100,  1, 1, 1, 2, 3, 1, 1);
+IOTA_INIT(Iota1D0, 100, 1, 1, 1, 2, 3, 1, 1);
+
+IOTA_INIT(Iota2D0, 10, 20, 1, 1, 3, 1, 2, 1);
+IOTA_INIT(Iota2D1, 100, 5, 1, 1, 1, 2, 4, 2);
 
-    IOTA_INIT(Iota2D0,  10, 20, 1, 1, 3, 1, 2, 1);
-    IOTA_INIT(Iota2D1, 100,  5, 1, 1, 1, 2, 4, 2);
+IOTA_INIT(Iota3D0, 20, 6, 3, 1, 1, 1, 1, 1);
+IOTA_INIT(Iota3D1, 10, 12, 5, 1, 2, 3, 4, 5);
+IOTA_INIT(Iota3D2, 25, 30, 2, 1, 1, 2, 2, 1);
 
-    IOTA_INIT(Iota3D0,  20,  6, 3, 1, 1, 1, 1, 1);
-    IOTA_INIT(Iota3D1,  10, 12, 5, 1, 2, 3, 4, 5);
-    IOTA_INIT(Iota3D2,  25, 30, 2, 1, 1, 2, 2, 1);
+IOTA_INIT(Iota4D0, 20, 6, 3, 2, 2, 3, 1, 2);
+IOTA_INIT(Iota4D1, 10, 12, 5, 2, 1, 2, 2, 2);
+IOTA_INIT(Iota4D2, 25, 30, 2, 2, 3, 2, 1, 1);
+IOTA_INIT(Iota4D3, 25, 30, 2, 2, 4, 2, 4, 2);
 
-    IOTA_INIT(Iota4D0,  20,  6, 3, 2, 2, 3, 1, 2);
-    IOTA_INIT(Iota4D1,  10, 12, 5, 2, 1, 2, 2, 2);
-    IOTA_INIT(Iota4D2,  25, 30, 2, 2, 3, 2, 1, 1);
-    IOTA_INIT(Iota4D3,  25, 30, 2, 2, 4, 2, 4, 2);
+IOTA_INIT(IotaMaxDimY, 1, 65535 * 32 + 1, 1, 1, 1, 1, 1, 1);
+IOTA_INIT(IotaMaxDimZ, 1, 1, 65535 * 32 + 1, 1, 1, 1, 1, 1);
+IOTA_INIT(IotaMaxDimW, 1, 1, 1, 65535 * 32 + 1, 1, 1, 1, 1);
 
 ///////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(Iota, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    af::dim4 idims(23, 15, 1, 1);
-    af::dim4 tdims(2, 2, 1, 1);
-    af::dim4 fulldims;
-    for(unsigned i = 0; i < 4; i++) {
-        fulldims[i] = idims[i] * tdims[i];
-    }
-
-    af::array output = af::iota(idims, tdims);
-    af::array tileArray = af::tile(af::moddims(af::range(af::dim4(idims.elements()), 0), idims), tdims);
-
-    // Get result
-    float* outData = new float[fulldims.elements()];
-    output.host((void*)outData);
 
-    float* tileData = new float[fulldims.elements()];
-    tileArray.host((void*)tileData);
+using af::array;
+using af::iota;
 
-    // Compare result
+TEST(Iota, CPP) {
+    dim4 idims(23, 15, 1, 1);
+    dim4 tdims(2, 2, 1, 1);
+    dim4 fulldims;
+    for (unsigned i = 0; i < 4; i++) { fulldims[i] = idims[i] * tdims[i]; }
 
-    // Compare result
-    for(int i = 0; i < (int)fulldims.elements(); i++)
-        ASSERT_EQ(tileData[i], outData[i]) << "at: " << i << std::endl;
+    array output = iota(idims, tdims);
+    array tileArray =
+        tile(moddims(range(dim4(idims.elements()), 0), idims), tdims);
 
-    // Delete
-    delete[] outData;
-    delete[] tileData;
+    ASSERT_ARRAYS_EQ(tileArray, output);
 }
diff --git a/test/ireduce.cpp b/test/ireduce.cpp
index 18461c5ee5..b155512e32 100644
--- a/test/ireduce.cpp
+++ b/test/ireduce.cpp
@@ -8,88 +8,102 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
+
+#include <af/algorithm.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/device.h>
+#include <af/random.h>
+
+#include <algorithm>
 
-using namespace std;
-using namespace af;
-
-
-#define MINMAXOP(fn, ty)                                \
-    TEST(IndexedMinMaxTests, Test_##fn##_##ty##_0)      \
-    {                                                   \
-        if (noDoubleTests<ty>()) return;                \
-        dtype dty = (dtype)dtype_traits<ty>::af_type;   \
-        const int nx = 10000;                           \
-        const int ny = 100;                             \
-        af::array in = randu(nx, ny, dty);              \
-        af::array val, idx;                             \
-        fn(val, idx, in, 0);                            \
-                                                        \
-        ty *h_in = in.host<ty>();                       \
-        ty *h_in_st = h_in;                             \
-        ty *h_val = val.host<ty>();                     \
-        uint *h_idx = idx.host<uint>();                 \
-        for (int i = 0; i < ny; i++) {                  \
-            ty tmp = *fn##_element(h_in, h_in + nx);    \
-            ASSERT_EQ(tmp, h_val[i])                    \
-                << "for index" << i;                    \
-            ASSERT_EQ(h_in[h_idx[i]], tmp)              \
-                << "for index" << i;                    \
-            h_in += nx;                                 \
-        }                                               \
-        delete[] h_in_st;                               \
-        delete[] h_val;                                 \
-        delete[] h_idx;                                 \
-    }                                                   \
-    TEST(IndexedMinMaxTests, Test_##fn##_##ty##_1)      \
-    {                                                   \
-        if (noDoubleTests<ty>()) return;                \
-        dtype dty = (dtype)dtype_traits<ty>::af_type;   \
-        const int nx = 100;                             \
-        const int ny = 100;                             \
-        af::array in = randu(nx, ny, dty);              \
-        af::array val, idx;                             \
-        fn(val, idx, in, 1);                            \
-                                                        \
-        ty *h_in = in.host<ty>();                       \
-        ty *h_val = val.host<ty>();                     \
-        uint *h_idx = idx.host<uint>();                 \
-        for (int i = 0; i < nx; i++) {                  \
-            ty val = h_val[i];                          \
-            for (int j= 0; j < ny; j++) {               \
-                ty tmp = fn(val, h_in[j * nx + i]);     \
-                ASSERT_EQ(tmp, val);                    \
-            }                                           \
-            ASSERT_EQ(val, h_in[h_idx[i] * nx + i]);    \
-        }                                               \
-        delete[] h_in;                                  \
-        delete[] h_val;                                 \
-        delete[] h_idx;                                 \
-    }                                                   \
-    TEST(IndexedMinMaxTests, Test_##fn##_##ty##_all)    \
-    {                                                   \
-        if (noDoubleTests<ty>()) return;                \
-        dtype dty = (dtype)dtype_traits<ty>::af_type;   \
-        const int num = 100000;                         \
-        af::array in = randu(num, dty);                 \
-        ty val;                                         \
-        uint idx;                                       \
-        fn<ty>(&val, &idx, in);                         \
-        ty *h_in = in.host<ty>();                       \
-        ty tmp = *fn##_element(h_in, h_in + num);       \
-        ASSERT_EQ(tmp, val);                            \
-        ASSERT_EQ(tmp, h_in[idx]);                      \
-        delete[] h_in;                                  \
-    }                                                   \
+using af::allTrue;
+using af::array;
+using af::constant;
+using af::dtype;
+using af::dtype_traits;
+using af::max;
+using af::min;
+using af::randu;
+using af::seq;
+using af::span;
+using std::complex;
+using std::vector;
+
+#define MINMAXOP(fn, ty)                                         \
+    TEST(IndexedReduce, fn##_##ty##_0) {                         \
+        SUPPORTED_TYPE_CHECK(ty);                                \
+        dtype dty    = (dtype)dtype_traits<ty>::af_type;         \
+        const int nx = 10;                                       \
+        const int ny = 100;                                      \
+        array in     = randu(nx, ny, dty);                       \
+        array val, idx;                                          \
+        fn(val, idx, in, 0);                                     \
+                                                                 \
+        ty *h_in    = in.host<ty>();                             \
+        ty *h_in_st = h_in;                                      \
+        uint *h_idx = idx.host<uint>();                          \
+        vector<ty> gold;                                         \
+        vector<ty> igold;                                        \
+        gold.reserve(ny);                                        \
+        igold.reserve(ny);                                       \
+        for (int i = 0; i < ny; i++) {                           \
+            gold.push_back(*std::fn##_element(h_in, h_in + nx)); \
+            igold.push_back(h_in[h_idx[i]]);                     \
+            h_in += nx;                                          \
+        }                                                        \
+        ASSERT_VEC_ARRAY_EQ(gold, af::dim4(1, ny), val);         \
+        ASSERT_VEC_ARRAY_EQ(igold, af::dim4(1, ny), val);        \
+        af_free_host(h_in_st);                                   \
+        af_free_host(h_idx);                                     \
+    }                                                            \
+    TEST(IndexedReduce, fn##_##ty##_1) {                         \
+        SUPPORTED_TYPE_CHECK(ty);                                \
+        dtype dty    = (dtype)dtype_traits<ty>::af_type;         \
+        const int nx = 100;                                      \
+        const int ny = 100;                                      \
+        array in     = randu(nx, ny, dty);                       \
+        array val, idx;                                          \
+        fn(val, idx, in, 1);                                     \
+                                                                 \
+        ty *h_in    = in.host<ty>();                             \
+        ty *h_val   = val.host<ty>();                            \
+        uint *h_idx = idx.host<uint>();                          \
+        for (int i = 0; i < nx; i++) {                           \
+            ty val = h_val[i];                                   \
+            for (int j = 0; j < ny; j++) {                       \
+                ty tmp = std::fn(val, h_in[j * nx + i]);         \
+                ASSERT_EQ(tmp, val);                             \
+            }                                                    \
+            ASSERT_EQ(val, h_in[h_idx[i] * nx + i]);             \
+        }                                                        \
+        af_free_host(h_in);                                      \
+        af_free_host(h_val);                                     \
+        af_free_host(h_idx);                                     \
+    }                                                            \
+    TEST(IndexedReduce, fn##_##ty##_all) {                       \
+        SUPPORTED_TYPE_CHECK(ty);                                \
+        dtype dty     = (dtype)dtype_traits<ty>::af_type;        \
+        const int num = 100000;                                  \
+        array in      = randu(num, dty);                         \
+        ty val;                                                  \
+        uint idx;                                                \
+        fn<ty>(&val, &idx, in);                                  \
+        ty *h_in = in.host<ty>();                                \
+        ty tmp   = *std::fn##_element(h_in, h_in + num);         \
+        ASSERT_EQ(tmp, val);                                     \
+        ASSERT_EQ(tmp, h_in[idx]);                               \
+        af_free_host(h_in);                                      \
+    }
 
 MINMAXOP(min, float)
 MINMAXOP(min, double)
 MINMAXOP(min, int)
 MINMAXOP(min, uint)
 MINMAXOP(min, char)
+MINMAXOP(min, schar)
 MINMAXOP(min, uchar)
 
 MINMAXOP(max, float)
@@ -97,4 +111,445 @@ MINMAXOP(max, double)
 MINMAXOP(max, int)
 MINMAXOP(max, uint)
 MINMAXOP(max, char)
+MINMAXOP(max, schar)
 MINMAXOP(max, uchar)
+
+TEST(IndexedReduce, MaxIndexedSmall) {
+    const int num = 1000;
+    const int st  = 10;
+    const int en  = num - 100;
+    array a       = randu(num);
+
+    float b;
+    unsigned idx;
+    max<float>(&b, &idx, a(seq(st, en)));
+
+    vector<float> ha(num);
+    a.host(&ha[0]);
+
+    float res = ha[st];
+    for (int i = st; i <= en; i++) { res = std::max(res, ha[i]); }
+
+    ASSERT_EQ(b, res);
+}
+
+TEST(IndexedReduce, MaxIndexedBig) {
+    const int num = 100000;
+    const int st  = 1000;
+    const int en  = num - 1000;
+    array a       = randu(num);
+
+    float b;
+    unsigned idx;
+    max<float>(&b, &idx, a(seq(st, en)));
+
+    vector<float> ha(num);
+    a.host(&ha[0]);
+
+    float res = ha[st];
+    for (int i = st; i <= en; i++) { res = std::max(res, ha[i]); }
+
+    ASSERT_EQ(b, res);
+}
+
+TEST(IndexedReduce, BUG_FIX_1005) {
+    const int m = 64;
+    const int n = 100;
+    const int b = 5;
+
+    array in = constant(0, m, n, b);
+    for (int i = 0; i < b; i++) {
+        array tmp         = randu(m, n);
+        in(span, span, i) = tmp;
+
+        float val0, val1;
+        unsigned idx0, idx1;
+
+        min<float>(&val0, &idx0, in(span, span, i));
+        min<float>(&val1, &idx1, tmp);
+
+        ASSERT_EQ(val0, val1);
+        ASSERT_EQ(idx0, idx1);
+    }
+}
+
+TEST(IndexedReduce, MinReduceDimensionHasSingleValue) {
+    array data = randu(10, 10, 1);
+
+    array mm, indx;
+    min(mm, indx, data, 2);
+
+    ASSERT_ARRAYS_EQ(data, mm);
+    ASSERT_TRUE(allTrue<bool>(indx == 0));
+}
+
+TEST(IndexedReduce, MaxReduceDimensionHasSingleValue) {
+    array data = randu(10, 10, 1);
+
+    array mm, indx;
+    max(mm, indx, data, 2);
+
+    ASSERT_ARRAYS_EQ(data, mm);
+    ASSERT_TRUE(allTrue<bool>(indx == 0));
+}
+
+TEST(IndexedReduce, MinNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    float test_data[] = {1.f, NAN, 5.f, 0.1f, NAN, -0.5f, NAN, 0.f};
+    int rows          = 4;
+    int cols          = 2;
+    array a(rows, cols, test_data);
+
+    float gold_min_val[] = {0.1f, -0.5f};
+    int gold_min_idx[]   = {3, 1};
+
+    array min_val;
+    array min_idx;
+    min(min_val, min_idx, a);
+
+    vector<float> h_min_val(cols);
+    min_val.host(&h_min_val[0]);
+
+    vector<int> h_min_idx(cols);
+    min_idx.host(&h_min_idx[0]);
+
+    for (int i = 0; i < cols; i++) {
+        ASSERT_FLOAT_EQ(h_min_val[i], gold_min_val[i]);
+    }
+
+    for (int i = 0; i < cols; i++) { ASSERT_EQ(h_min_idx[i], gold_min_idx[i]); }
+}
+
+TEST(IndexedReduce, MaxNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    float test_data[] = {1.f, NAN, 5.f, 0.1f, NAN, -0.5f, NAN, 0.f};
+    int rows          = 4;
+    int cols          = 2;
+    array a(rows, cols, test_data);
+
+    float gold_max_val[] = {5.0f, 0.f};
+    int gold_max_idx[]   = {2, 3};
+
+    array max_val;
+    array max_idx;
+    max(max_val, max_idx, a);
+
+    vector<float> h_max_val(cols);
+    max_val.host(&h_max_val[0]);
+
+    vector<int> h_max_idx(cols);
+    max_idx.host(&h_max_idx[0]);
+
+    for (int i = 0; i < cols; i++) {
+        ASSERT_FLOAT_EQ(h_max_val[i], gold_max_val[i]);
+    }
+
+    for (int i = 0; i < cols; i++) { ASSERT_EQ(h_max_idx[i], gold_max_idx[i]); }
+}
+
+TEST(IndexedReduce, MinCplxNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    float real_wnan_data[] = {0.005f, NAN, -6.3f, NAN,      -0.5f,
+                              NAN,    NAN, 0.2f,  -1205.4f, 8.9f};
+
+    float imag_wnan_data[] = {NAN,    NAN, -9.0f, -0.005f, -0.3f,
+                              0.007f, NAN, 0.1f,  NAN,     4.5f};
+
+    int rows = 5;
+    int cols = 2;
+    array real_wnan(rows, cols, real_wnan_data);
+    array imag_wnan(rows, cols, imag_wnan_data);
+    array a = af::complex(real_wnan, imag_wnan);
+
+    float gold_min_real[] = {-0.5f, 0.2f};
+    float gold_min_imag[] = {-0.3f, 0.1f};
+    int gold_min_idx[]    = {4, 2};
+
+    array min_val;
+    array min_idx;
+    af::min(min_val, min_idx, a);
+
+    vector<complex<float>> h_min_val(cols);
+    min_val.host(&h_min_val[0]);
+
+    vector<int> h_min_idx(cols);
+    min_idx.host(&h_min_idx[0]);
+
+    for (int i = 0; i < cols; i++) {
+        ASSERT_FLOAT_EQ(h_min_val[i].real(), gold_min_real[i]);
+        ASSERT_FLOAT_EQ(h_min_val[i].imag(), gold_min_imag[i]);
+    }
+
+    for (int i = 0; i < cols; i++) { ASSERT_EQ(h_min_idx[i], gold_min_idx[i]); }
+}
+
+TEST(IndexedReduce, MaxCplxNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    float real_wnan_data[] = {0.005f, NAN, -6.3f, NAN,      -0.5f,
+                              NAN,    NAN, 0.2f,  -1205.4f, 8.9f};
+
+    float imag_wnan_data[] = {NAN,    NAN, -9.0f, -0.005f, -0.3f,
+                              0.007f, NAN, 0.1f,  NAN,     4.5f};
+
+    int rows = 5;
+    int cols = 2;
+    array real_wnan(rows, cols, real_wnan_data);
+    array imag_wnan(rows, cols, imag_wnan_data);
+    array a = af::complex(real_wnan, imag_wnan);
+
+    float gold_max_real[] = {-6.3f, 8.9f};
+    float gold_max_imag[] = {-9.0f, 4.5f};
+    int gold_max_idx[]    = {2, 4};
+
+    array max_val;
+    array max_idx;
+    af::max(max_val, max_idx, a);
+
+    vector<complex<float>> h_max_val(cols);
+    max_val.host(&h_max_val[0]);
+
+    vector<int> h_max_idx(cols);
+    max_idx.host(&h_max_idx[0]);
+
+    for (int i = 0; i < cols; i++) {
+        ASSERT_FLOAT_EQ(h_max_val[i].real(), gold_max_real[i]);
+        ASSERT_FLOAT_EQ(h_max_val[i].imag(), gold_max_imag[i]);
+    }
+
+    for (int i = 0; i < cols; i++) { ASSERT_EQ(h_max_idx[i], gold_max_idx[i]); }
+}
+
+TEST(IndexedReduce, MinPreferLargerIdxIfEqual) {
+    float test_data[] = {0.f, 50.f, 50.f, 0.f};
+    int len           = 4;
+    array a(len, test_data);
+
+    float gold_min_val = 0.f;
+    int gold_min_idx   = 3;
+
+    array min_val;
+    array min_idx;
+    min(min_val, min_idx, a);
+
+    vector<float> h_min_val(1);
+    min_val.host(&h_min_val[0]);
+
+    vector<int> h_min_idx(1);
+    min_idx.host(&h_min_idx[0]);
+
+    ASSERT_FLOAT_EQ(h_min_val[0], gold_min_val);
+    ASSERT_EQ(h_min_idx[0], gold_min_idx);
+}
+
+TEST(IndexedReduce, MaxPreferSmallerIdxIfEqual) {
+    float test_data[] = {0.f, 50.f, 50.f, 0.f};
+    int len           = 4;
+    array a(len, test_data);
+
+    float gold_max_val = 50.f;
+    int gold_max_idx   = 1;
+
+    array max_val;
+    array max_idx;
+    max(max_val, max_idx, a);
+
+    vector<float> h_max_val(1);
+    max_val.host(&h_max_val[0]);
+
+    vector<int> h_max_idx(1);
+    max_idx.host(&h_max_idx[0]);
+
+    ASSERT_FLOAT_EQ(h_max_val[0], gold_max_val);
+    ASSERT_EQ(h_max_idx[0], gold_max_idx);
+}
+
+TEST(IndexedReduce, MinCplxPreferLargerIdxIfEqual) {
+    float real_wnan_data[] = {0.f, 50.f, 50.f, 0.f};
+    float imag_wnan_data[] = {0.f, 50.f, 50.f, 0.f};
+
+    int len = 4;
+    array real_wnan(len, real_wnan_data);
+    array imag_wnan(len, imag_wnan_data);
+    array a = af::complex(real_wnan, imag_wnan);
+
+    float gold_min_real = 0.f;
+    float gold_min_imag = 0.f;
+    int gold_min_idx    = 3;
+
+    array min_val;
+    array min_idx;
+    min(min_val, min_idx, a);
+
+    vector<complex<float>> h_min_val(1);
+    min_val.host(&h_min_val[0]);
+
+    vector<int> h_min_idx(1);
+    min_idx.host(&h_min_idx[0]);
+
+    ASSERT_FLOAT_EQ(h_min_val[0].real(), gold_min_real);
+    ASSERT_FLOAT_EQ(h_min_val[0].imag(), gold_min_imag);
+
+    ASSERT_EQ(h_min_idx[0], gold_min_idx);
+}
+
+TEST(IndexedReduce, MaxCplxPreferSmallerIdxIfEqual) {
+    float real_wnan_data[] = {0.f, 50.f, 50.f, 0.f};
+    float imag_wnan_data[] = {0.f, 50.f, 50.f, 0.f};
+
+    int len = 4;
+    array real_wnan(len, real_wnan_data);
+    array imag_wnan(len, imag_wnan_data);
+    array a = af::complex(real_wnan, imag_wnan);
+
+    float gold_max_real = 50.f;
+    float gold_max_imag = 50.f;
+    int gold_max_idx    = 1;
+
+    array max_val;
+    array max_idx;
+    max(max_val, max_idx, a);
+
+    vector<complex<float>> h_max_val(1);
+    max_val.host(&h_max_val[0]);
+
+    vector<int> h_max_idx(1);
+    max_idx.host(&h_max_idx[0]);
+
+    ASSERT_FLOAT_EQ(h_max_val[0].real(), gold_max_real);
+    ASSERT_FLOAT_EQ(h_max_val[0].imag(), gold_max_imag);
+
+    ASSERT_EQ(h_max_idx[0], gold_max_idx);
+}
+
+#define SUBA_TEST_DATA                                              \
+    float test_data[25] = {0.0168, 0.0278, 0.0317, 0.0248, 0.0131,  \
+                           0.0197, 0.0321, 0.0362, 0.0279, 0.0141,  \
+                           0.0218, 0.0353, 0.0394, 0.0297, 0.0143,  \
+                           0.0224, 0.0363, 0.0104, 0.0302, 0.0142,  \
+                           0.0217, 0.0409, 0.0398, 0.0302, 0.0144}; \
+    array a(5, 5, test_data);                                       \
+    array a_sub = a(seq(1, 3), seq(2,4))
+
+TEST(IndexedReduce, max_subarray_all) {
+    SUBA_TEST_DATA;
+
+    float gold_max_val = 0.0409;
+    unsigned gold_max_idx   = 6;
+
+    float max_val;
+    unsigned max_idx;
+    max<float>(&max_val, &max_idx, a_sub);
+
+    ASSERT_FLOAT_EQ(max_val, gold_max_val);
+    ASSERT_EQ(max_idx, gold_max_idx);
+}
+
+TEST(IndexedReduce, min_subarray_all) {
+    SUBA_TEST_DATA;
+
+    float gold_min_val = 0.0104;
+    unsigned gold_min_idx   = 4;
+
+    float min_val;
+    unsigned min_idx;
+    min<float>(&min_val, &min_idx, a_sub);
+
+    ASSERT_FLOAT_EQ(min_val, gold_min_val);
+    ASSERT_EQ(min_idx, gold_min_idx);
+}
+
+TEST(IndexedReduce, max_subarray_0) {
+    SUBA_TEST_DATA;
+
+    float gold_val[3] = {0.0394, 0.0363, 0.0409};
+    unsigned gold_idx[3] = {1, 0, 0};
+
+    array val;
+    array idx;
+    float h_val[3];
+    unsigned h_idx[3];
+
+    max(val, idx, a_sub);
+    val.host(&h_val);
+    idx.host(&h_idx);
+
+    for(int i = 0; i < 3; ++i) {
+        ASSERT_FLOAT_EQ(h_val[i], gold_val[i]);
+        ASSERT_EQ(h_idx[i], gold_idx[i]);
+    }
+}
+
+TEST(IndexedReduce, min_subarray_0) {
+    SUBA_TEST_DATA;
+
+    float gold_val[3] = {0.0297, 0.0104, 0.0302};
+    unsigned gold_idx[3] = {2, 1, 2};
+
+    array val;
+    array idx;
+    float h_val[3];
+    unsigned h_idx[3];
+
+    min(val, idx, a_sub);
+    val.host(&h_val);
+    idx.host(&h_idx);
+
+    for(int i = 0; i < 3; ++i) {
+        ASSERT_FLOAT_EQ(h_val[i], gold_val[i]);
+        ASSERT_EQ(h_idx[i], gold_idx[i]);
+    }
+}
+
+TEST(IndexedReduce, max_subarray_1) {
+    SUBA_TEST_DATA;
+
+    float gold_val[3] = {0.0409, 0.0398, 0.0302};
+    unsigned gold_idx[3] = {2, 2, 1};
+
+    array val;
+    array idx;
+    float h_val[3];
+    unsigned h_idx[3];
+
+    max(val, idx, a_sub, 1);
+    val.host(&h_val);
+    idx.host(&h_idx);
+
+    for(int i = 0; i < 3; ++i) {
+        ASSERT_FLOAT_EQ(h_val[i], gold_val[i]);
+        ASSERT_EQ(h_idx[i], gold_idx[i]);
+    }
+}
+
+TEST(IndexedReduce, min_subarray_1) {
+    SUBA_TEST_DATA;
+
+    float gold_val[3] = {0.0353, 0.0104, 0.0297};
+    unsigned gold_idx[3] = {0, 1, 0};
+
+    array val;
+    array idx;
+    float h_val[3];
+    unsigned h_idx[3];
+
+    min(val, idx, a_sub, 1);
+    val.host(&h_val);
+    idx.host(&h_idx);
+
+    for(int i = 0; i < 3; ++i) {
+        ASSERT_FLOAT_EQ(h_val[i], gold_val[i]);
+        ASSERT_EQ(h_idx[i], gold_idx[i]);
+    }
+}
+
+//Ensure that array is evaluated before reducing
+TEST(IndexedReduce, reduce_jit_array) {
+    af::array jit(af::dim4(2),{1.0f, 2.0f});
+    jit += af::constant(1.0f, af::dim4(2));
+    float val; unsigned idx;
+    float gold_val = 2.0f;
+    unsigned gold_idx = 0;
+    af::min(&val, &idx, jit);
+    ASSERT_EQ(val, gold_val);
+    ASSERT_EQ(idx, gold_idx);
+}
diff --git a/test/iterative_deconv.cpp b/test/iterative_deconv.cpp
new file mode 100644
index 0000000000..290b81f0d6
--- /dev/null
+++ b/test/iterative_deconv.cpp
@@ -0,0 +1,142 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/data.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using std::abs;
+using std::string;
+using std::vector;
+using namespace af;
+
+template<typename T>
+class IterativeDeconvolution : public ::testing::Test {};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, schar, uchar, short, ushort> TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(IterativeDeconvolution, TestTypes);
+
+template<typename T, bool isColor>
+void iterDeconvImageTest(string pTestFile, const unsigned iters, const float rf,
+                         const af::iterativeDeconvAlgo algo) {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            OutType;
+
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    if (is_same_type<T, schar>::value &&
+        algo == AF_ITERATIVE_DECONV_RICHARDSONLUCY) {
+        GTEST_SKIP() << "Incompatible with signed values";
+    }
+
+    using af::dim4;
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> outSizes;
+    vector<string> outFiles;
+
+    readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        inFiles[testId].insert(0, string(TEST_DIR "/iterative_deconv/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/iterative_deconv/"));
+
+        af_array _inArray   = 0;
+        af_array inArray    = 0;
+        af_array kerArray   = 0;
+        af_array _outArray  = 0;
+        af_array cstArray   = 0;
+        af_array minArray   = 0;
+        af_array numArray   = 0;
+        af_array denArray   = 0;
+        af_array divArray   = 0;
+        af_array outArray   = 0;
+        af_array goldArray  = 0;
+        af_array _goldArray = 0;
+        dim_t nElems        = 0;
+
+        ASSERT_SUCCESS(af_gaussian_kernel(&kerArray, 13, 13, 2.25, 2.25));
+
+        af_dtype otype = (af_dtype)af::dtype_traits<OutType>::af_type;
+
+        ASSERT_SUCCESS(
+            af_load_image(&_inArray, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, _inArray));
+
+        ASSERT_SUCCESS(
+            af_load_image(&_goldArray, outFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<OutType>(&goldArray, _goldArray));
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
+
+        unsigned ndims;
+        dim_t dims[4];
+        ASSERT_SUCCESS(af_get_numdims(&ndims, goldArray));
+        ASSERT_SUCCESS(
+            af_get_dims(dims, dims + 1, dims + 2, dims + 3, goldArray));
+
+        ASSERT_SUCCESS(af_iterative_deconv(&_outArray, inArray, kerArray, iters,
+                                           rf, algo));
+
+        double maxima, minima, imag;
+        ASSERT_SUCCESS(af_min_all(&minima, &imag, _outArray));
+        ASSERT_SUCCESS(af_max_all(&maxima, &imag, _outArray));
+        ASSERT_SUCCESS(af_constant(&cstArray, 255.0, ndims, dims, otype));
+        ASSERT_SUCCESS(
+            af_constant(&denArray, (maxima - minima), ndims, dims, otype));
+        ASSERT_SUCCESS(af_constant(&minArray, minima, ndims, dims, otype));
+        ASSERT_SUCCESS(af_sub(&numArray, _outArray, minArray, false));
+        ASSERT_SUCCESS(af_div(&divArray, numArray, denArray, false));
+        ASSERT_SUCCESS(af_mul(&outArray, divArray, cstArray, false));
+
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.03);
+
+        ASSERT_SUCCESS(af_release_array(_inArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(kerArray));
+        ASSERT_SUCCESS(af_release_array(cstArray));
+        ASSERT_SUCCESS(af_release_array(minArray));
+        ASSERT_SUCCESS(af_release_array(denArray));
+        ASSERT_SUCCESS(af_release_array(numArray));
+        ASSERT_SUCCESS(af_release_array(divArray));
+        ASSERT_SUCCESS(af_release_array(_outArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(_goldArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+    }
+}
+
+TYPED_TEST(IterativeDeconvolution, LandweberOnGrayscale) {
+    // Test file name format: <colorspace>_<iterations>_<number/1000:relaxation
+    // factor>_<algo>.test
+    iterDeconvImageTest<TypeParam, false>(
+        string(TEST_DIR "/iterative_deconv/gray_100_50_landweber.test"), 100,
+        0.05, AF_ITERATIVE_DECONV_LANDWEBER);
+}
+
+TYPED_TEST(IterativeDeconvolution, RichardsonLucyOnGrayscale) {
+    // Test file name format: <colorspace>_<iterations>_<number/1000:relaxation
+    // factor>_<algo>.test For RichardsonLucy algorithm, relaxation factor is
+    // not used.
+    iterDeconvImageTest<TypeParam, false>(
+        string(TEST_DIR "/iterative_deconv/gray_100_50_lucy.test"), 100, 0.05,
+        AF_ITERATIVE_DECONV_RICHARDSONLUCY);
+}
diff --git a/test/jit.cpp b/test/jit.cpp
index 3c2308d5eb..487fdcb6e2 100644
--- a/test/jit.cpp
+++ b/test/jit.cpp
@@ -8,60 +8,849 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/device.h>
+#include <af/gfor.h>
+#include <af/random.h>
 
-using namespace std;
-using namespace af;
+#include <numeric>
+#include <tuple>
 
-TEST(JIT, CPP_JIT_HASH)
-{
-    using af::array;
+using af::array;
+using af::constant;
+using af::dim4;
+using af::eval;
+using af::freeHost;
+using af::gforSet;
+using af::randn;
+using af::randu;
+using af::seq;
+using std::get;
+using std::to_string;
+using std::tuple;
+using std::vector;
 
-    const int num = 20;
-    const float valA = 3;
-    const float valB = 5;
-    const float valC = 2;
-    const float valD = valA + valB;
-    const float valE = valA + valC;
+TEST(JIT, CPP_JIT_HASH) {
+    const int num     = 20;
+    const float valA  = 3;
+    const float valB  = 5;
+    const float valC  = 2;
+    const float valD  = valA + valB;
+    const float valE  = valA + valC;
     const float valF1 = valD * valE - valE;
     const float valF2 = valD * valE - valD;
 
-    array a = af::constant(valA, num);
-    array b = af::constant(valB, num);
-    array c = af::constant(valC, num);
-    af::eval(a);
-    af::eval(b);
-    af::eval(c);
-
+    array a = constant(valA, num);
+    array b = constant(valB, num);
+    array c = constant(valC, num);
+    eval(a);
+    eval(b);
+    eval(c);
 
     // Creating a kernel
     {
-        array d = a + b;
-        array e = a + c;
+        array d  = a + b;
+        array e  = a + c;
         array f1 = d * e - e;
-        float *hF1 = f1.host<float>();
 
-        for (int i = 0; i < num; i++) {
-            ASSERT_EQ(hF1[i], valF1);
-        }
+        float* hF1 = f1.host<float>();
 
-        delete[] hF1;
+        for (int i = 0; i < num; i++) { ASSERT_EQ(hF1[i], valF1); }
+
+        freeHost(hF1);
     }
 
     // Making sure a different kernel is generated
     {
-        array d = a + b;
-        array e = a + c;
-        array f2 = d * e - d;
-        float *hF2 = f2.host<float>();
+        array d    = a + b;
+        array e    = a + c;
+        array f2   = d * e - d;
+        float* hF2 = f2.host<float>();
+
+        for (int i = 0; i < num; i++) { ASSERT_EQ(hF2[i], valF2); }
+
+        freeHost(hF2);
+    }
+}
+
+TEST(JIT, CPP_JIT_Reset_Binary) {
+    array a = constant(2, 5, 5);
+    array b = constant(1, 5, 5);
+    array c = a + b;
+    array d = a - b;
+    array e = c * d;
+    e.eval();
+    array f = c - d;
+    f.eval();
+    array g = d - c;
+    g.eval();
+
+    ASSERT_ARRAYS_NEAR(f, -g, 1e-5);
+}
+
+TEST(JIT, CPP_JIT_Reset_Unary) {
+    array a = constant(2, 5, 5);
+    array b = constant(1, 5, 5);
+    array c = sin(a);
+    array d = cos(b);
+    array e = c * d;
+    e.eval();
+    array f = c - d;
+    f.eval();
+    array g = d - c;
+    g.eval();
+
+    ASSERT_ARRAYS_EQ(f, -g);
+}
+
+TEST(JIT, CPP_Multi_linear) {
+    const int num = 1 << 16;
+    array a       = randu(num, s32);
+    array b       = randu(num, s32);
+    array x       = a + b;
+    array y       = a - b;
+    eval(x, y);
+
+    vector<int> ha(num);
+    vector<int> hb(num);
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+
+    vector<int> goldx(num);
+    vector<int> goldy(num);
+    for (int i = 0; i < num; i++) {
+        goldx[i] = ha[i] + hb[i];
+        goldy[i] = ha[i] - hb[i];
+    }
+
+    ASSERT_VEC_ARRAY_EQ(goldx, dim4(num), x);
+    ASSERT_VEC_ARRAY_EQ(goldy, dim4(num), y);
+}
 
+TEST(JIT, CPP_gforSet_strided) {
+    const int num = 1024;
+    gforSet(true);
+    array a = randu(num, 1, s32);
+    array b = randu(1, num, s32);
+    array x = a + b;
+    array y = a - b;
+    eval(x);
+    eval(y);
+    gforSet(false);
+
+    vector<int> ha(num);
+    vector<int> hb(num);
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+
+    vector<int> hapb(num * num);
+    vector<int> hamb(num * num);
+    for (int j = 0; j < num; j++) {
         for (int i = 0; i < num; i++) {
-            ASSERT_EQ(hF2[i], valF2);
+            hapb[j * num + i] = ha[i] + hb[j];
+            hamb[j * num + i] = ha[i] - hb[j];
+        }
+    }
+    ASSERT_VEC_ARRAY_EQ(hapb, dim4(num, num), x);
+    ASSERT_VEC_ARRAY_EQ(hamb, dim4(num, num), y);
+}
+
+TEST(JIT, CPP_gforSet_Multi_strided) {
+    const int num = 1024;
+    gforSet(true);
+    array a = randu(num, 1, s32);
+    array b = randu(1, num, s32);
+    array x = a + b;
+    array y = a - b;
+    eval(x, y);
+    gforSet(false);
+
+    vector<int> ha(num);
+    vector<int> hb(num);
+    vector<int> hx(num * num);
+    vector<int> hy(num * num);
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+    x.host(&hx[0]);
+    y.host(&hy[0]);
+
+    for (int j = 0; j < num; j++) {
+        for (int i = 0; i < num; i++) {
+            ASSERT_EQ((ha[i] + hb[j]), hx[j * num + i]);
+            ASSERT_EQ((ha[i] - hb[j]), hy[j * num + i]);
+        }
+    }
+}
+
+TEST(JIT, CPP_Multi_pre_eval) {
+    const int num = 1 << 16;
+    array a       = randu(num, s32);
+    array b       = randu(num, s32);
+    array x       = a + b;
+    array y       = a - b;
+
+    eval(x);
+
+    // Should evaluate only y
+    eval(x, y);
+
+    // Should not evaluate anything
+    // Should not error out
+    eval(x, y);
+
+    vector<int> ha(num);
+    vector<int> hb(num);
+    vector<int> hx(num);
+    vector<int> hy(num);
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+    x.host(&hx[0]);
+    y.host(&hy[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_EQ((ha[i] + hb[i]), hx[i]);
+        ASSERT_EQ((ha[i] - hb[i]), hy[i]);
+    }
+}
+
+TEST(JIT, CPP_common_node) {
+    array r = seq(-3, 3, 0.5);
+
+    int n = r.dims(0);
+
+    array x = tile(r, 1, r.dims(0));
+    array y = tile(r.T(), r.dims(0), 1);
+
+    vector<float> hx(x.elements());
+    vector<float> hy(y.elements());
+    vector<float> hr(r.elements());
+
+    x.host(&hx[0]);
+    y.host(&hy[0]);
+    r.host(&hr[0]);
+
+    for (int j = 0; j < n; j++) {
+        for (int i = 0; i < n; i++) {
+            ASSERT_EQ(hx[j * n + i], hr[i]);
+            ASSERT_EQ(hy[j * n + i], hr[j]);
+        }
+    }
+}
+
+TEST(JIT, ISSUE_1646) {
+    array test1 = randn(10, 10);
+    array test2 = randn(10);
+    array test3 = randn(10);
+
+    for (int i = 0; i < 1000; i++) {
+        test3 += sum(test1, 1);
+        test2 += test3;
+    }
+    eval(test2);
+    eval(test3);
+}
+
+TEST(JIT, NonLinearLargeY) {
+    const int d0 = 2;
+    // This needs to be > 2 * (1 << 20) to properly check this.
+    const int d1 = 3 * (1 << 20);
+    array a      = randn(d0);
+    array b      = randn(1, d1);
+
+    // tile is jit-ted for both the operations
+    array c = tile(a, 1, d1) + tile(b, d0, 1);
+    eval(c);
+
+    vector<float> ha(d0);
+    vector<float> hb(d1);
+    vector<float> hc(d0 * d1);
+
+    a.host(ha.data());
+    b.host(hb.data());
+
+    for (int j = 0; j < d1; j++) {
+        for (int i = 0; i < d0; i++) { hc[i + j * d0] = ha[i] + hb[j]; }
+    }
+    ASSERT_VEC_ARRAY_EQ(hc, dim4(d0, d1), c);
+}
+
+TEST(JIT, NonLinearLargeX) {
+    af_array r, c, s;
+    dim_t rdims[] = {1024000, 1, 3};
+    dim_t cdims[] = {1, 1, 3};
+    dim_t sdims[] = {1, 1, 1};
+    dim_t ndims   = 3;
+
+    ASSERT_SUCCESS(af_randu(&r, ndims, rdims, f32));
+    ASSERT_SUCCESS(af_constant(&c, 1, ndims, cdims, f32));
+    ASSERT_SUCCESS(af_eval(c));
+    ASSERT_SUCCESS(af_sub(&s, r, c, true));
+    ASSERT_SUCCESS(af_eval(s));
+
+    dim_t relem = 1;
+    dim_t celem = 1;
+    dim_t selem = 1;
+    for (int i = 0; i < ndims; i++) {
+        relem *= rdims[i];
+        celem *= cdims[i];
+        sdims[i] = std::max(rdims[i], cdims[i]);
+        selem *= sdims[i];
+    }
+
+    vector<float> hr(relem);
+    vector<float> hc(celem);
+    vector<float> hs(selem);
+
+    ASSERT_SUCCESS(af_get_data_ptr(hr.data(), r));
+    ASSERT_SUCCESS(af_get_data_ptr(hc.data(), c));
+    ASSERT_SUCCESS(af_get_data_ptr(hs.data(), s));
+
+    for (int k = 0; k < sdims[2]; k++) {
+        for (int j = 0; j < sdims[1]; j++) {
+            for (int i = 0; i < sdims[0]; i++) {
+                int sidx = i + j * sdims[0] + k * (sdims[0] * sdims[1]);
+
+                int ridx = (i % rdims[0]) + (j % rdims[1]) * rdims[0] +
+                           (k % rdims[2]) * rdims[0] * rdims[1];
+
+                int cidx = (i % cdims[0]) + (j % cdims[1]) * cdims[0] +
+                           (k % cdims[2]) * cdims[0] * cdims[1];
+
+                ASSERT_EQ(hs[sidx], hr[ridx] - hc[cidx])
+                    << " at " << i << "," << k;
+            }
         }
+    }
+
+    ASSERT_SUCCESS(af_release_array(r));
+    ASSERT_SUCCESS(af_release_array(c));
+    ASSERT_SUCCESS(af_release_array(s));
+}
+
+TEST(JIT, ISSUE_1894) {
+    array a = randu(1);
+    array b = tile(a, 2 * (1 << 20));
+    eval(b);
+    float ha = -100;
+    vector<float> hb(b.elements(), -200);
+
+    a.host(&ha);
+    b.host(hb.data());
+
+    for (size_t i = 0; i < hb.size(); i++) { ASSERT_EQ(ha, hb[i]); }
+}
+
+TEST(JIT, LinearLarge) {
+    // Needs to be larger than 65535 * 256 (or 1 << 24)
+    float v1 = std::rand() % 100;
+    float v2 = std::rand() % 100;
+
+    array a = constant(v1, 1 << 25);
+    array b = constant(v2, 1 << 25);
+    array c = (a + b) * (a - b);
+    eval(c);
+
+    float v3 = (v1 + v2) * (v1 - v2);
 
-        delete[] hF2;
+    vector<float> hc(c.elements());
+    c.host(hc.data());
+
+    for (size_t i = 0; i < hc.size(); i++) { ASSERT_EQ(hc[i], v3); }
+}
+
+TEST(JIT, NonLinearBuffers1) {
+    array a  = randu(5, 5);
+    array a0 = a;
+    for (int i = 0; i < 1000; i++) {
+        array b = randu(1, 5);
+        a += tile(b, 5);
+    }
+    a.eval();
+}
+
+TEST(JIT, NonLinearBuffers2) {
+    array a = randu(100, 310);
+    array b = randu(10, 10);
+    for (int i = 0; i < 300; i++) {
+        b += a(seq(10), seq(i, i + 9)) * randu(10, 10);
     }
+    b.eval();
+}
+
+TEST(JIT, TransposeBuffers) {
+    const int num = 10;
+    array a       = randu(1, num);
+    array b       = randu(1, num);
+    array c       = a + b;
+    array d       = a.T() + b.T();
+
+    vector<float> ha(a.elements());
+    a.host(ha.data());
+
+    vector<float> hb(b.elements());
+    b.host(hb.data());
+
+    vector<float> hc(c.elements());
+    c.host(hc.data());
+
+    vector<float> hd(d.elements());
+    d.host(hd.data());
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_FLOAT_EQ(ha[i] + hb[i], hc[i]);
+        ASSERT_FLOAT_EQ(hc[i], hd[i]);
+    }
+}
+
+TEST(JIT, ConstEval7) {
+    const array a = constant(1, 1);
+    const array b = constant(1, 1);
+    const array c = constant(1, 1);
+    const array d = constant(1, 1);
+    const array e = constant(1, 1);
+    const array f = constant(1, 1);
+
+#if (__cpp_variadic_templates >= 200704)
+    EXPECT_NO_THROW({
+        const array g = constant(1, 1);
+        eval(a, b, c, d, e, f, g);
+        af::sync();
+    });
+#else
+    EXPECT_NO_THROW({
+        eval(a, b, c, d, e, f);
+        af::sync();
+    });
+#endif
+}
+
+using af::dim4;
+
+struct tile_params {
+    dim4 in_dim;
+    dim4 tile;
+    dim4 out_dim;
+    tile_params(dim4 in, dim4 t, dim4 out)
+        : in_dim(in), tile(t), out_dim(out) {}
+};
+
+std::ostream& operator<<(std::ostream& os, const tile_params& tp) {
+    os << "in_dim: " << tp.in_dim << "; tile parameters: " << tp.tile
+       << "; out_dim " << tp.out_dim << ";";
+    return os;
+}
+
+class JIT : public ::testing::TestWithParam<tile_params> {
+   protected:
+    void SetUp() {
+        tile_params params = GetParam();
+        vector<float> vals(params.in_dim.elements());
+        iota(vals.begin(), vals.end(), 0.f);
+        in = array(params.in_dim, &vals.front());
+
+        // clang-format off
+        gold.resize(params.out_dim.elements());
+        dim_t tile_dim[4] = {params.tile[0], params.tile[1], params.tile[2],
+                             params.tile[3]};
+
+        dim_t istride[4] = {1,
+                            params.in_dim[0],
+                            params.in_dim[0] * params.in_dim[1],
+                            params.in_dim[0] * params.in_dim[1] * params.in_dim[2]};
+        dim_t ostride[4] = {1,
+                            params.out_dim[0],
+                            params.out_dim[0] * params.out_dim[1],
+                            params.out_dim[0] * params.out_dim[1] * params.out_dim[2]};
+
+        for (int i = 0; i < 4; i++) {
+            if (tile_dim[i] != 1) { istride[i] = 0; }
+        }
+
+        for (int l = 0; l < params.out_dim[3]; l++) {
+            for (int k = 0; k < params.out_dim[2]; k++) {
+                for (int j = 0; j < params.out_dim[1]; j++) {
+                    for (int i = 0; i < params.out_dim[0]; i++) {
+                        gold[l * ostride[3] +
+                            k * ostride[2] +
+                            j * ostride[1] +
+                            i * ostride[0]] = vals[l * istride[3] +
+                                                    k * istride[2] +
+                                                    j * istride[1] +
+                                                    i * istride[0]];
+                    }
+                }
+            }
+        }
+        // clang-format on
+    }
+
+   public:
+    array in;
+    vector<float> gold;
+};
+
+void replace_all(std::string& str, const std::string& oldStr,
+                 const std::string& newStr) {
+    std::string::size_type pos = 0u;
+    while ((pos = str.find(oldStr, pos)) != std::string::npos) {
+        str.replace(pos, oldStr.length(), newStr);
+        pos += newStr.length();
+    }
+}
+
+std::string concat_dim4(dim4 d) {
+    std::stringstream ss;
+    ss << d;
+    std::string s = ss.str();
+    replace_all(s, " ", "_");
+    return s;
+}
+std::string tile_info(const ::testing::TestParamInfo<JIT::ParamType> info) {
+    std::stringstream ss;
+    ss << "in_" << concat_dim4(info.param.in_dim) << "_tile_"
+       << concat_dim4(info.param.tile);
+    return ss.str();
+}
+
+// clang-format off
+INSTANTIATE_TEST_SUITE_P(
+                        JitTile, JIT,
+                                                   //  input_dim            tile_dim             output_dim
+                        ::testing::Values(
+                                          tile_params( dim4(10),            dim4(1, 10),         dim4(10, 10)),
+                                          tile_params( dim4(10),            dim4(1, 1, 10),      dim4(10, 1, 10)),
+                                          tile_params( dim4(10),            dim4(1, 1, 1, 10),   dim4(10, 1, 1, 10)),
+                                          tile_params( dim4(1, 10),         dim4(10),            dim4(10, 10)),
+                                          tile_params( dim4(1, 10),         dim4(1, 1, 10),      dim4(1, 10, 10)),
+                                          tile_params( dim4(1, 10),         dim4(1, 1, 1, 10),   dim4(1, 10, 1, 10)),
+
+                                          tile_params( dim4(10, 10),        dim4(1, 1, 10),      dim4(10, 10, 10)),
+                                          tile_params( dim4(10, 10),        dim4(1, 1, 1, 10),   dim4(10, 10, 1, 10)),
+
+                                          tile_params( dim4(1, 1, 10),      dim4(10),            dim4(10, 1, 10)),
+                                          tile_params( dim4(1, 1, 10),      dim4(1, 10),         dim4(1, 10, 10)),
+                                          tile_params( dim4(1, 1, 10),      dim4(1, 1, 1, 10),   dim4(1, 1, 10, 10)),
+
+                                          tile_params( dim4(1, 10, 10),     dim4(10),            dim4(10, 10, 10)),
+                                          tile_params( dim4(10, 1, 10),     dim4(1, 10),         dim4(10, 10, 10)),
+                                          tile_params( dim4(10, 1, 10),     dim4(1, 1, 1, 10),   dim4(10, 1, 10, 10)),
+                                          tile_params( dim4(1, 10, 10),     dim4(1, 1, 1, 10),   dim4(1, 10, 10, 10)),
+                                          tile_params( dim4(10, 10, 10),    dim4(1, 1, 1, 10),   dim4(10, 10, 10, 10)),
+
+                                          tile_params( dim4(1, 1, 1, 10),   dim4(10),            dim4(10, 1, 1, 10)),
+                                          tile_params( dim4(1, 10, 1, 10),  dim4(10),            dim4(10, 10, 1, 10)),
+                                          tile_params( dim4(1, 1, 10, 10),  dim4(10),            dim4(10, 1, 10, 10)),
+                                          tile_params( dim4(1, 10, 10, 10), dim4(10),            dim4(10, 10, 10, 10)),
+
+                                          tile_params( dim4(1, 1, 1, 10),   dim4(1, 10),         dim4(1, 10, 1, 10)),
+                                          tile_params( dim4(10, 1, 1, 10),  dim4(1, 10),         dim4(10, 10, 1, 10)),
+                                          tile_params( dim4(1, 1, 10, 10),  dim4(1, 10),         dim4(1, 10, 10, 10)),
+
+                                          tile_params( dim4(1, 1, 1, 10),   dim4(1, 1, 10),      dim4(1, 1, 10, 10)),
+                                          tile_params( dim4(10, 1, 1, 10),  dim4(1, 1, 10),      dim4(10, 1, 10, 10)),
+                                          tile_params( dim4(1, 10, 1, 10),  dim4(1, 1, 10),      dim4(1, 10, 10, 10)),
+                                          tile_params( dim4(10, 10, 1, 10), dim4(1, 1, 10),      dim4(10, 10, 10, 10))
+                                          ),
+                        tile_info
+                        );
+// clang-format on
+
+TEST_P(JIT, Tile) {
+    tile_params params = GetParam();
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+    size_t alloc_bytes2, alloc_buffers2;
+    size_t lock_bytes2, lock_buffers2;
+    af::deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    array out = tile(in, params.tile);
+    af::deviceMemInfo(&alloc_bytes2, &alloc_buffers2, &lock_bytes2,
+                      &lock_buffers2);
+
+    // Make sure that the dimensions we are testing here are JIT nodes
+    // by checking that no new buffers are created.
+    ASSERT_EQ(alloc_bytes, alloc_bytes2)
+        << "Tile operation created a buffer therefore not a JIT node";
+    ASSERT_EQ(alloc_buffers, alloc_buffers2)
+        << "Tile operation created a buffer therefore not a JIT node";
+    ASSERT_EQ(lock_bytes, lock_bytes2)
+        << "Tile operation created a buffer therefore not a JIT node";
+    ASSERT_EQ(alloc_buffers, alloc_buffers2)
+        << "Tile operation created a buffer therefore not a JIT node";
+
+    ASSERT_VEC_ARRAY_EQ(gold, params.out_dim, out);
+}
+
+/// This test creates a large jit tree with very small buffers. I am
+/// performing random JIT operations on the arrays. In each iteration
+/// I am also creating a new buffer nodes. This test was generated
+/// to address with large parameter sizes in CUDA. See issues #2436
+/// and #2389
+TEST(JIT, LargeJitTree) {
+    dim_t d0 = 30;
+    array a  = randu(d0, 5);
+    array b  = randu(d0, 1);
+    array c  = randu(d0, 1);
+    EXPECT_NO_THROW({
+        for (int i = 0; i < 500; i++) {
+            b += cos(pow(sin(c * 0.3f), 2) + pow(randu(d0, 1) - 3, 2) * 1.1f +
+                     3);
+            a = floor(a + tile(b, 1, 5));
+        }
+        eval(a);
+        af::sync();
+    });
+}
+
+void testTwoLargeNonLinear(const af_dtype dt) {
+    int dimsize = 10;
+    array a     = constant(0, dimsize, dimsize, dt);
+    array aa    = constant(0, dimsize, dimsize, dt);
+    array b     = constant(0, dimsize, dimsize, dt);
+    array bb    = constant(0, dimsize, dimsize, dt);
+
+    int val = 0;
+    for (int i = 0; i < 23; i++) {
+        array ones = constant(1, dimsize, dimsize, dt);
+        ones.eval();
+        array twos = constant(2, dimsize, dt);
+        twos.eval();
+
+        a += tile(twos, 1, dimsize) + ones;
+        aa += tile(twos, 1, dimsize) + ones;
+        val += 3;
+    }
+
+    for (int i = 0; i < 23; i++) {
+        array ones = constant(1, dimsize, dimsize, dt);
+        ones.eval();
+        array twos = constant(2, dimsize, dt);
+        twos.eval();
+        b += tile(twos, 1, dimsize) + ones;
+        bb += tile(twos, 1, dimsize) + ones;
+    }
+    array c  = a + b;
+    array cc = aa + bb;
+    eval(c, cc);
+
+    vector<float> gold(a.elements(), val * 2);
+    ASSERT_VEC_ARRAY_EQ(gold, a.dims(), c.as(f32));
+}
+
+TEST(JIT, TwoLargeNonLinear) { testTwoLargeNonLinear(f32); }
+
+TEST(JIT, TwoLargeNonLinearHalf) {
+    if (noHalfTests(f16)) return;
+    testTwoLargeNonLinear(f16);
+}
+
+std::string select_info(
+    const ::testing::TestParamInfo<std::tuple<int, int, int>> info) {
+    return "a_" + to_string(get<0>(info.param)) + "_b_" +
+           to_string(get<1>(info.param)) + "_cond_" +
+           to_string(get<2>(info.param));
+}
+
+class JITSelect : public ::testing::TestWithParam<std::tuple<int, int, int>> {
+   protected:
+    void SetUp() {}
+};
+
+// clang-format off
+INSTANTIATE_TEST_SUITE_P(
+                        JitSelect, JITSelect,
+                        testing::Combine(
+                                         testing::Range(10, 22),
+                                         testing::Range(10, 22),
+                                         testing::Range(10, 22)),
+                        select_info);
+TEST_P(JITSelect, SelectLargeNonLinear) {
+    int dimsize = 10;
+    array a     = constant(0, dimsize, dimsize);
+    array b     = constant(0, dimsize, dimsize);
+    array cond  = constant(0, dimsize, dimsize);
+
+    int val = 0;
+    for (int i = 0; i < std::get<0>(GetParam()); i++) {
+        array ones = constant(1, dimsize, dimsize);
+        ones.eval();
+        array twos = constant(2, dimsize);
+        twos.eval();
+
+        a += tile(twos, 1, dimsize) + ones;
+        val += 3;
+    }
+
+    for (int i = 0; i < std::get<1>(GetParam()); i++) {
+        array ones = constant(2, dimsize, dimsize);
+        ones.eval();
+        array twos = constant(2, dimsize);
+        twos.eval();
+        b += tile(twos, 1, dimsize) + ones;
+    }
+
+
+    for (int i = 0; i < std::get<2>(GetParam()); i++) {
+        array ones = constant(1, dimsize, dimsize);
+        ones.eval();
+        array twos = constant(2, dimsize);
+        twos.eval();
+        array fives = constant(5, dimsize, dimsize);
+        fives.eval();
+        cond += tile(twos, 1, dimsize) + ones;
+        cond = cond < fives;
+    }
+
+    array c  = select(cond, a, b);
+    c.eval();
+
+    vector<float> gold(a.elements(), val);
+    ASSERT_VEC_ARRAY_EQ(gold, a.dims(), c);
+}
+
+TEST(JIT, AllBuffers) {
+  int buffers = 128;
+  vector<array> arrs(buffers);
+
+  for(int i = 0; i < buffers; i++) {
+    arrs[i] = constant(1, 5);
+    arrs[i].eval();
+  }
+
+  int inc = 2;
+  for(int ii = buffers/2; ii > 2; ii/=2) {
+      for(size_t i = 0; i < arrs.size(); i += inc) {
+          arrs[i] = arrs[i] + arrs[i + inc/2];
+      }
+      inc *= 2;
+  }
+  arrs[0] = tile(arrs[0], 1, 5) + tile(arrs[64],1, 5);
+  arrs[0].eval();
+  af::sync();
+}
+
+TEST(JIT, IndexingColumn) {
+    array a = constant(1, 512, 32);
+    array b = constant(2, 512);
+    a.eval();
+    b.eval();
+
+    array c = a(af::span, 31) + b;
+
+    vector<float> gold(512, 3.0f);
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(512), c);
+}
+
+TEST(JIT, IndexingRow) {
+    array a = constant(1, 32, 512);
+    array b = constant(2, 1, 512);
+    a.eval();
+    b.eval();
+
+    array c = a(31, af::span) + b;
+
+    vector<float> gold(512, 3.0f);
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(1, 512), c);
+}
+
+TEST(JIT, DISABLED_ManyConstants) {
+    array res  = constant(1, 1);
+    array res2 = tile(res, 1, 10);
+    array res3 = randu(1);
+    array res4 = tile(res3, 1, 10);
+    array res5 = randu(1);
+    array res6 = tile(res5, 1, 10);
+    array res7 = randu(1);
+    array res8 = tile(res7, 1, 10);
+
+    for (int i = 0; i < 80; i++) { res2 = res2 + randu(1, 10); }
+    for (int i = 0; i < 80; i++) { res4 = res4 + tile(randu(1), 1, 10); }
+    for (int i = 0; i < 80; i++) { res6 = res6 + tile(randu(1), 1, 10); }
+    for (int i = 0; i < 80; i++) { res8 = res8 + 1.0f; }
+
+    // This still fails in the current implementation
+    eval(res2, res4, res6);//, res8);
+    af::sync();
+}
+
+TEST(JIT, getKernelCacheDirectory) {
+  size_t length = 0;
+  ASSERT_SUCCESS(af_get_kernel_cache_directory(&length, NULL));
+
+  std::string path;
+  path.resize(length);
+  ASSERT_SUCCESS(af_get_kernel_cache_directory(&length, &path.at(0)));
+}
+
+TEST(JIT, setKernelCacheDirectory) {
+  std::string path = ".";
+
+  // Get the old path so we can reset it after the test
+  size_t length = 0;
+  ASSERT_SUCCESS(af_get_kernel_cache_directory(&length, NULL));
+  std::string old_path;
+  old_path.resize(length);
+  ASSERT_SUCCESS(af_get_kernel_cache_directory(&length, &old_path.at(0)));
+
+  // Set cache directory to the new path
+  ASSERT_SUCCESS(af_set_kernel_cache_directory(path.c_str(), false));
+
+  // Get the new path for verification
+  size_t new_length = path.size();
+  std::string new_path;
+  new_path.resize(new_length);
+  ASSERT_SUCCESS(af_get_kernel_cache_directory(&new_length, &new_path.at(0)));
+
+  ASSERT_EQ(path, new_path);
+  ASSERT_EQ(path.size(), new_path.size());
+
+  // Reset to the old path
+  ASSERT_SUCCESS(af_set_kernel_cache_directory(old_path.c_str(), false));
+}
+
+// Ensure that a correct result is obtained when evaluating an expression
+// that contains both an array and its transpose - see ISSUE 3660
+TEST(JIT, evaluateBothArrayAndItsTranspose) {
+  float X2_ptr[25] = { -1.,  -1.,  -1.,  -1.,  -1.,
+                      -0.5, -0.5, -0.5, -0.5, -0.5,
+                        0.,   0.,   0.,   0.,   0.,
+                       0.5,  0.5,  0.5,  0.5,  0.5,
+                        1.,   1.,   1.,   1.,   1. };
+  array X2_gold(5, 5, X2_ptr);
+
+  float Y2_ptr[25] = { -1., -0.5,   0.,  0.5,   1.,
+                       -1., -0.5,   0.,  0.5,   1.,
+                       -1., -0.5,   0.,  0.5,   1.,
+                       -1., -0.5,   0.,  0.5,   1.,
+                       -1., -0.5,   0.,  0.5,   1. };
+  array Y2_gold(5, 5, Y2_ptr);
+
+  float X2Y2_ptr[25] = {  -2., -1.5,  -1., -0.5,  0.,
+                         -1.5,  -1., -0.5,   0., 0.5,
+                          -1., -0.5,   0.,  0.5,  1.,
+                         -0.5,   0.,  0.5,   1., 1.5,
+                           0.,  0.5,   1.,  1.5,  2. };
+  array X2Y2_gold(5, 5, X2Y2_ptr);
+
+  int n = 5;
+  int half = (n - 1) / 2;
+  double delta = 1.0 / half;
+
+  array coord = delta * (af::range(n) - half);
+
+  array X2 = tile(coord.T(), n, 1);
+  array Y2 = tile(coord, 1, n);
+
+  array X2Y2 = X2 + Y2;
+
+  ASSERT_ARRAYS_EQ(X2_gold, X2);
+  ASSERT_ARRAYS_EQ(Y2_gold, Y2);
+  ASSERT_ARRAYS_EQ(X2Y2_gold, X2Y2);
 }
diff --git a/test/jit_test_api.cpp b/test/jit_test_api.cpp
new file mode 100644
index 0000000000..79430ab874
--- /dev/null
+++ b/test/jit_test_api.cpp
@@ -0,0 +1,34 @@
+/*******************************************************
+ * Copyright (c) 2021, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/data.h>
+
+namespace af {
+int getMaxJitLen(void);
+
+void setMaxJitLen(const int jitLen);
+}  // namespace af
+
+TEST(JIT, UnitMaxHeight) {
+    const int oldMaxJitLen = af::getMaxJitLen();
+    af::setMaxJitLen(1);
+    af::array a = af::constant(1, 10);
+    af::array b = af::constant(2, 10);
+    af::array c = a * b;
+    af::array d = b * c;
+    c.eval();
+    d.eval();
+    af::setMaxJitLen(oldMaxJitLen);
+}
+
+TEST(JIT, ZeroMaxHeight) {
+    EXPECT_THROW({ af::setMaxJitLen(0); }, af::exception);
+}
diff --git a/test/join.cpp b/test/join.cpp
index 01014456ab..5cd470780f 100644
--- a/test/join.cpp
+++ b/test/join.cpp
@@ -7,176 +7,327 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/index.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/index.h>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
+
+#include <array>
 #include <complex>
+#include <iostream>
+#include <numeric>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::join;
+using af::randu;
+using af::seq;
+using af::sum;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Join : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Join : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int, intl, uintl, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         intl, uintl, char, signed char, unsigned char, short,
+                         ushort, half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Join, TestTypes);
+TYPED_TEST_SUITE(Join, TestTypes);
 
 template<typename T>
-void joinTest(string pTestFile, const unsigned dim, const unsigned in0, const unsigned in1, const unsigned resultIdx,
-        bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
-
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, int>(pTestFile,numDims,in,tests);
-
-    af::dim4 i0dims = numDims[in0];
-    af::dim4 i1dims = numDims[in1];
-
-    af_array in0Array = 0;
-    af_array in1Array = 0;
-    af_array outArray = 0;
+void joinTest(string pTestFile, const unsigned dim, const unsigned in0,
+              const unsigned in1, const unsigned resultIdx,
+              bool isSubRef = false, const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+
+    dim4 i0dims = numDims[in0];
+    dim4 i1dims = numDims[in1];
+
+    af_array in0Array  = 0;
+    af_array in1Array  = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[in0].front()), i0dims.ndims(), i0dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[in0].front()),
+                                       i0dims.ndims(), i0dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&in0Array, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&in0Array, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&in0Array, &(in[in0].front()), i0dims.ndims(), i0dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&in0Array, &(in[in0].front()),
+                                       i0dims.ndims(), i0dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[in1].front()), i1dims.ndims(), i1dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[in1].front()),
+                                       i1dims.ndims(), i1dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&in1Array, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&in1Array, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&in1Array, &(in[in1].front()), i1dims.ndims(), i1dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&in1Array, &(in[in1].front()),
+                                       i1dims.ndims(), i1dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_join(&outArray, dim, in0Array, in1Array));
-
-    // Get result
-    T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_join(&outArray, dim, in0Array, in1Array));
 
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
-    }
+    dim4 goldDims = i0dims;
+    goldDims[dim] = i0dims[dim] + i1dims[dim];
 
-    // Delete
-    delete[] outData;
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], goldDims, outArray);
 
-    if(in0Array  != 0) af_release_array(in0Array);
-    if(in1Array  != 0) af_release_array(in1Array);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (in0Array != 0) af_release_array(in0Array);
+    if (in1Array != 0) af_release_array(in1Array);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define JOIN_INIT(desc, file, dim, in0, in1, resultIdx)                                     \
-    TYPED_TEST(Join, desc)                                                                  \
-    {                                                                                       \
-        joinTest<TypeParam>(string(TEST_DIR"/join/"#file".test"), dim, in0, in1, resultIdx);\
+#define JOIN_INIT(desc, file, dim, in0, in1, resultIdx)                        \
+    TYPED_TEST(Join, desc) {                                                   \
+        joinTest<TypeParam>(string(TEST_DIR "/join/" #file ".test"), dim, in0, \
+                            in1, resultIdx);                                   \
     }
 
-    JOIN_INIT(JoinBig0, join_big, 0, 0, 1, 0);
-    JOIN_INIT(JoinBig1, join_big, 1, 0, 2, 1);
-    JOIN_INIT(JoinBig2, join_big, 2, 0, 3, 2);
-
-    JOIN_INIT(JoinSmall0, join_small, 0, 0, 1, 0);
-    JOIN_INIT(JoinSmall1, join_small, 1, 0, 2, 1);
-    JOIN_INIT(JoinSmall2, join_small, 2, 0, 3, 2);
+JOIN_INIT(JoinBig0, join_big, 0, 0, 1, 0);
+JOIN_INIT(JoinBig1, join_big, 1, 0, 2, 1);
+JOIN_INIT(JoinBig2, join_big, 2, 0, 3, 2);
+
+JOIN_INIT(JoinSmall0, join_small, 0, 0, 1, 0);
+JOIN_INIT(JoinSmall1, join_small, 1, 0, 2, 1);
+JOIN_INIT(JoinSmall2, join_small, 2, 0, 3, 2);
+
+TEST(Join, JoinLargeDim) {
+    using af::constant;
+    using af::deviceGC;
+    using af::span;
+
+    // const int nx = 32;
+    const int nx = 1;
+    const int ny = 4 * 1024 * 1024;
+    const int nw = 4 * 1024 * 1024;
+
+    deviceGC();
+    {
+        array in         = randu(nx, ny, u8);
+        array joined     = join(0, in, in);
+        dim4 in_dims     = in.dims();
+        dim4 joined_dims = joined.dims();
+
+        ASSERT_EQ(2 * in_dims[0], joined_dims[0]);
+        ASSERT_EQ(0.f, sum<float>((joined(0, span) - joined(1, span)).as(f32)));
+
+        array in2 = constant(1, (dim_t)nx, (dim_t)ny, (dim_t)2, (dim_t)nw, u8);
+        joined    = join(3, in, in);
+        in_dims   = in.dims();
+        joined_dims = joined.dims();
+        ASSERT_EQ(2 * in_dims[3], joined_dims[3]);
+    }
+}
 
 ///////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(Join, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(Join, CPP) {
     const unsigned resultIdx = 2;
-    const unsigned dim = 2;
+    const unsigned dim       = 2;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/join/join_big.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/join/join_big.test"),
+                                 numDims, in, tests);
 
-    af::dim4 i0dims = numDims[0];
-    af::dim4 i1dims = numDims[3];
+    dim4 i0dims = numDims[0];
+    dim4 i1dims = numDims[3];
 
-    af::array input0(i0dims, &(in[0].front()));
-    af::array input1(i1dims, &(in[3].front()));
+    array input0(i0dims, &(in[0].front()));
+    array input1(i1dims, &(in[3].front()));
 
-    af::array output = af::join(dim, input0, input1);
+    array output = join(dim, input0, input1);
 
-    // Get result
-    float* outData = new float[tests[resultIdx].size()];
-    output.host((void*)outData);
+    dim4 goldDims = i0dims;
+    goldDims[dim] = i0dims[dim] + i1dims[dim];
 
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
-    }
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], goldDims, output);
+}
 
-    // Delete
-    delete[] outData;
+TEST(JoinMany0, CPP) {
+    array a0 = randu(10, 5);
+    array a1 = randu(20, 5);
+    array a2 = randu(5, 5);
+
+    array output = join(0, a0, a1, a2);
+    array gold   = join(0, a0, join(0, a1, a2));
+
+    ASSERT_EQ(sum<float>(output - gold), 0);
+}
+
+TEST(JoinMany1, CPP) {
+    array a0 = randu(20, 200);
+    array a1 = randu(20, 400);
+    array a2 = randu(20, 10);
+    array a3 = randu(20, 100);
+
+    int dim      = 1;
+    array output = join(dim, a0, a1, a2, a3);
+    array gold   = join(dim, a0, join(dim, a1, join(dim, a2, a3)));
+    ASSERT_EQ(sum<float>(output - gold), 0);
 }
 
-TEST(JoinMany0, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(Join, DifferentSizes) {
+    array a = seq(10);
+    array b = seq(11);
+    array c = seq(12);
 
-    af::array a0 = af::randu(10, 5);
-    af::array a1 = af::randu(20, 5);
-    af::array a2 = af::randu(5, 5);
+    array d = join(0, a, b, c);
 
-    af::array output = af::join(0, a0, a1, a2);
-    af::array gold = af::join(0, a0, af::join(0, a1, a2));
+    vector<float> ha(10);
+    vector<float> hb(11);
+    vector<float> hc(12);
 
+    for (size_t i = 0; i < ha.size(); i++) { ha[i] = i; }
+    for (size_t i = 0; i < hb.size(); i++) { hb[i] = i; }
+    for (size_t i = 0; i < hc.size(); i++) { hc[i] = i; }
+    vector<float> hgold(10 + 11 + 12);
+    vector<float>::iterator it = copy(ha.begin(), ha.end(), hgold.begin());
+    it                         = copy(hb.begin(), hb.end(), it);
+    it                         = copy(hc.begin(), hc.end(), it);
 
-    ASSERT_EQ(af::sum<float>(output - gold), 0);
+    ASSERT_VEC_ARRAY_EQ(hgold, dim4(10 + 11 + 12), d);
 }
 
-TEST(JoinMany1, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(Join, SameSize) {
+    array a = seq(10);
+    array b = seq(10);
+    array c = seq(10);
+
+    array d = join(0, a, b, c);
 
-    af::array a0 = af::randu(20, 200);
-    af::array a1 = af::randu(20, 400);
-    af::array a2 = af::randu(20, 10);
-    af::array a3 = af::randu(20, 100);
+    vector<float> ha(10);
+    vector<float> hb(10);
+    vector<float> hc(10);
 
-    int dim = 1;
-    af::array output = af::join(dim, a0, a1, a2, a3);
-    af::array gold = af::join(dim, a0, af::join(dim, a1, af::join(dim, a2, a3)));
+    for (size_t i = 0; i < ha.size(); i++) { ha[i] = i; }
+    for (size_t i = 0; i < hb.size(); i++) { hb[i] = i; }
+    for (size_t i = 0; i < hc.size(); i++) { hc[i] = i; }
+    vector<float> hgold(10 + 10 + 10);
+    vector<float>::iterator it = copy(ha.begin(), ha.end(), hgold.begin());
+    it                         = copy(hb.begin(), hb.end(), it);
+    it                         = copy(hc.begin(), hc.end(), it);
 
-    ASSERT_EQ(af::sum<float>(output - gold), 0);
+    ASSERT_VEC_ARRAY_EQ(hgold, dim4(10 + 10 + 10), d);
 }
+
+TEST(Join, ManyEmpty) {
+    array gold = af::constant(0, 15, 5);
+    array a    = af::randn(5, 5);
+    array e;
+    array c  = af::randn(10, 5);
+    array ee = af::join(0, e, e);
+    ASSERT_EQ(ee.elements(), 0);
+    array eee = af::join(0, e, e, e);
+    ASSERT_EQ(eee.elements(), 0);
+
+    array eeac                     = af::join(0, e, e, a, c);
+    array eace                     = af::join(0, e, a, c, e);
+    array acee                     = af::join(0, a, c, e, e);
+    gold(af::seq(0, 4), af::span)  = a;
+    gold(af::seq(5, 14), af::span) = c;
+    ASSERT_ARRAYS_EQ(gold, eeac);
+    ASSERT_ARRAYS_EQ(gold, eace);
+    ASSERT_ARRAYS_EQ(gold, acee);
+}
+
+TEST(Join, respect_parameters_order_ISSUE3511) {
+    const float column_host1[] = {1., 2., 3.};
+    const float column_host2[] = {4., 5., 6.};
+    const af::array buf1(3, 1, column_host1);
+    const af::array buf2(3, 1, column_host2);
+
+    // We need to avoid that JIT arrays are evaluated during whatever call,
+    // so we will have to work with copies for single use
+    const af::array jit1{buf1 + 1.0};
+    const af::array jit2{buf2 + 2.0};
+    const std::array<af::array, 8> cases{jit1,  -jit1,       jit1 + 1.0, jit2,
+                                         -jit2, jit1 + jit2, buf1,       buf2};
+    const std::array<const char*, 8> cases_name{"JIT1", "-JIT1", "JIT1+1.0",
+                                                "JIT2", "-JIT2", "JIT1+JIT2",
+                                                "BUF1", "BUF2"};
+    assert(cases.size() == cases_name.size());
+    for (size_t cl0{0}; cl0 < cases.size(); ++cl0) {
+        for (size_t cl1{0}; cl1 < cases.size(); ++cl1) {
+            printf("Testing: af::join(1,%s,%s)\n", cases_name[cl0],
+                   cases_name[cl1]);
+            const array col0{cases[cl0]};
+            const array col1{cases[cl1]};
+            const array result{af::join(1, col0, col1)};
+            ASSERT_ARRAYS_EQ(result(af::span, 0), col0);
+            ASSERT_ARRAYS_EQ(result(af::span, 1), col1);
+        }
+    }
+    // Join of 3 arrays
+    for (size_t cl0{0}; cl0 < cases.size(); ++cl0) {
+        for (size_t cl1{0}; cl1 < cases.size(); ++cl1) {
+            for (size_t cl2{0}; cl2 < cases.size(); ++cl2) {
+                printf("Testing: af::join(1,%s,%s,%s)\n", cases_name[cl0],
+                       cases_name[cl1], cases_name[cl2]);
+                const array col0{cases[cl0]};
+                const array col1{cases[cl1]};
+                const array col2{cases[cl2]};
+                const array result{af::join(1, col0, col1, col2)};
+                ASSERT_ARRAYS_EQ(result(af::span, 0), col0);
+                ASSERT_ARRAYS_EQ(result(af::span, 1), col1);
+                ASSERT_ARRAYS_EQ(result(af::span, 2), col2);
+            }
+        }
+    }
+}
+
+#define TEST_TEMP_FORMAT(form, d)                                           \
+    TEST(TEMP_FORMAT, form##_dim##d) {                                      \
+        const dim4 dims(2, 2, 2, 2);                                        \
+        const array a(randu(dims));                                         \
+        const array b(randu(dims));                                         \
+                                                                            \
+        array out  = join(d, toTempFormat(form, a), toTempFormat(form, b)); \
+        array gold = join(d, a, b);                                         \
+        EXPECT_ARRAYS_EQ(gold, out);                                        \
+    }
+
+#define TEST_TEMP_FORMATS(form) \
+    TEST_TEMP_FORMAT(form, 0)   \
+    TEST_TEMP_FORMAT(form, 1)   \
+    TEST_TEMP_FORMAT(form, 2)   \
+    TEST_TEMP_FORMAT(form, 3)
+
+FOREACH_TEMP_FORMAT(TEST_TEMP_FORMATS)
diff --git a/test/lu_dense.cpp b/test/lu_dense.cpp
index 7399759931..35c925ab57 100644
--- a/test/lu_dense.cpp
+++ b/test/lu_dense.cpp
@@ -7,43 +7,52 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
+// NOTE: Tests are known to fail on OSX when utilizing the CPU and OpenCL
+// backends for sizes larger than 128x128 or more. You can read more about it on
+// issue https://github.com/arrayfire/arrayfire/issues/1617
+
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::count;
+using af::dim4;
+using af::dtype_traits;
+using af::max;
+using af::seq;
+using af::span;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
-///////////////////////////////// CPP ////////////////////////////////////
-//
-TEST(LU, InPlaceSmall)
-{
-    if (noDoubleTests<float>()) return;
+TEST(LU, InPlaceSmall) {
+    LAPACK_ENABLED_CHECK();
 
     int resultIdx = 0;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, float>(string(TEST_DIR"/lapack/lu.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/lapack/lu.test"), numDims,
+                                   in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
-    af::array output, pivot;
-    af::lu(output, pivot, input);
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+    array output, pivot;
+    lu(output, pivot, input);
 
-    af::dim4 odims = output.dims();
+    dim4 odims = output.dims();
 
     // Get result
     float* outData = new float[tests[resultIdx].size()];
@@ -53,9 +62,10 @@ TEST(LU, InPlaceSmall)
     for (int y = 0; y < (int)odims[1]; ++y) {
         for (int x = 0; x < (int)odims[0]; ++x) {
             // Check only upper triangle
-            if(x <= y) {
-            int elIter = y * odims[0] + x;
-            ASSERT_NEAR(tests[resultIdx][elIter], outData[elIter], 0.001) << "at: " << elIter << std::endl;
+            if (x <= y) {
+                int elIter = y * odims[0] + x;
+                ASSERT_NEAR(tests[resultIdx][elIter], outData[elIter], 0.001)
+                    << "at: " << elIter << endl;
             }
         }
     }
@@ -64,24 +74,24 @@ TEST(LU, InPlaceSmall)
     delete[] outData;
 }
 
-TEST(LU, SplitSmall)
-{
-    if (noDoubleTests<float>()) return;
+TEST(LU, SplitSmall) {
+    LAPACK_ENABLED_CHECK();
 
     int resultIdx = 0;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, float>(string(TEST_DIR"/lapack/lufactorized.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/lapack/lufactorized.test"),
+                                   numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
-    af::array l, u, pivot;
-    af::lu(l, u, pivot, input);
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+    array l, u, pivot;
+    lu(l, u, pivot, input);
 
-    af::dim4 ldims = l.dims();
-    af::dim4 udims = u.dims();
+    dim4 ldims = l.dims();
+    dim4 udims = u.dims();
 
     // Get result
     float* lData = new float[ldims.elements()];
@@ -92,9 +102,10 @@ TEST(LU, SplitSmall)
     // Compare result
     for (int y = 0; y < (int)ldims[1]; ++y) {
         for (int x = 0; x < (int)ldims[0]; ++x) {
-            if(x < y) {
+            if (x < y) {
                 int elIter = y * ldims[0] + x;
-                ASSERT_NEAR(tests[resultIdx][elIter], lData[elIter], 0.001) << "at: " << elIter << std::endl;
+                ASSERT_NEAR(tests[resultIdx][elIter], lData[elIter], 0.001)
+                    << "at: " << elIter << endl;
             }
         }
     }
@@ -104,7 +115,8 @@ TEST(LU, SplitSmall)
     for (int y = 0; y < (int)udims[1]; ++y) {
         for (int x = 0; x < (int)udims[0]; ++x) {
             int elIter = y * (int)udims[0] + x;
-            ASSERT_NEAR(tests[resultIdx][elIter], uData[elIter], 0.001) << "at: " << elIter << std::endl;
+            ASSERT_NEAR(tests[resultIdx][elIter], uData[elIter], 0.001)
+                << "at: " << elIter << endl;
         }
     }
 
@@ -114,81 +126,155 @@ TEST(LU, SplitSmall)
 }
 
 template<typename T>
-void luTester(const int m, const int n, double eps)
-{
-    if (noDoubleTests<T>()) return;
+void luTester(const int m, const int n, double eps) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
 
 #if 1
-    af::array a_orig = cpu_randu<T>(af::dim4(m, n));
+    array a_orig = cpu_randu<T>(dim4(m, n));
 #else
-    af::array a_orig = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+    array a_orig = randu(m, n, (dtype)dtype_traits<T>::af_type);
 #endif
 
-
     //! [ex_lu_unpacked]
-    af::array l, u, pivot;
-    af::lu(l, u, pivot, a_orig);
+    array l, u, pivot;
+    lu(l, u, pivot, a_orig);
     //! [ex_lu_unpacked]
 
     //! [ex_lu_recon]
-    af::array a_recon = af::matmul(l, u);
-    af::array a_perm = a_orig(pivot, af::span);
+    array a_recon = matmul(l, u);
+    array a_perm  = a_orig(pivot, span);
     //! [ex_lu_recon]
 
-    ASSERT_NEAR(0, af::max<double>(af::abs(real(a_recon - a_perm))), eps);
-    ASSERT_NEAR(0, af::max<double>(af::abs(imag(a_recon - a_perm))), eps);
+    ASSERT_NEAR(
+        0,
+        max<typename dtype_traits<T>::base_type>(abs(real(a_recon - a_perm))),
+        eps);
+    ASSERT_NEAR(
+        0,
+        max<typename dtype_traits<T>::base_type>(abs(imag(a_recon - a_perm))),
+        eps);
 
     //! [ex_lu_packed]
-    af::array out = a_orig.copy();
-    af::array pivot2;
-    af::luInPlace(pivot2, out, false);
+    array out = a_orig.copy();
+    array pivot2;
+    luInPlace(pivot2, out, false);
     //! [ex_lu_packed]
 
     //! [ex_lu_extract]
-    af::array l2 = lower(out,  true);
-    af::array u2 = upper(out, false);
+    array l2 = lower(out, true);
+    array u2 = upper(out, false);
     //! [ex_lu_extract]
 
-    ASSERT_EQ(af::count<uint>(pivot == pivot2), pivot.elements());
+    ASSERT_EQ(count<uint>(pivot == pivot2), pivot.elements());
 
     int mn = std::min(m, n);
-    l2 = l2(af::span, af::seq(mn));
-    u2 = u2(af::seq(mn), af::span);
-
-    ASSERT_NEAR(0, af::max<double>(af::abs(real(l2 - l))), eps);
-    ASSERT_NEAR(0, af::max<double>(af::abs(imag(l2 - l))), eps);
-
-    ASSERT_NEAR(0, af::max<double>(af::abs(real(u2 - u))), eps);
-    ASSERT_NEAR(0, af::max<double>(af::abs(imag(u2 - u))), eps);
-}
-
-#define LU_BIG_TESTS(T, eps)                    \
-    TEST(LU, T##BigSquare)                      \
-    {                                           \
-        luTester<T>(500, 500, eps);             \
-    }                                           \
-    TEST(LU, T##BigRect0)                       \
-    {                                           \
-        luTester<T>(500, 1000, eps);            \
-    }                                           \
-    TEST(LU, T##BigRect1)                       \
-    {                                           \
-        luTester<T>(1000, 500, eps);            \
-    }                                           \
-    TEST(LU, T##BigSquareMultiple)              \
-    {                                           \
-        luTester<T>(512, 512, eps);             \
-    }                                           \
-    TEST(LU, T##BigRect0Multiple)               \
-    {                                           \
-        luTester<T>(512, 1024, eps);            \
-    }                                           \
-    TEST(LU, T##BigRect1Multiple)               \
-    {                                           \
-        luTester<T>(1024, 512, eps);            \
-    }                                           \
-
-LU_BIG_TESTS(float, 1E-3)
-LU_BIG_TESTS(double, 1E-8)
-LU_BIG_TESTS(cfloat, 1E-3)
-LU_BIG_TESTS(cdouble, 1E-8)
+    l2     = l2(span, seq(mn));
+    u2     = u2(seq(mn), span);
+
+    array a_recon2 = matmul(l2, u2);
+    array a_perm2  = a_orig(pivot2, span);
+
+    ASSERT_NEAR(
+        0,
+        max<typename dtype_traits<T>::base_type>(abs(real(a_recon2 - a_perm2))),
+        eps);
+    ASSERT_NEAR(
+        0,
+        max<typename dtype_traits<T>::base_type>(abs(imag(a_recon2 - a_perm2))),
+        eps);
+}
+
+template<typename T>
+double eps();
+
+template<>
+double eps<float>() {
+    return 1E-3;
+}
+
+template<>
+double eps<double>() {
+    return 1e-8;
+}
+
+template<>
+double eps<cfloat>() {
+    return 1E-3;
+}
+
+template<>
+double eps<cdouble>() {
+    return 1e-8;
+}
+
+template<typename T>
+class LU : public ::testing::Test {};
+
+typedef ::testing::Types<float, cfloat, double, cdouble> TestTypes;
+TYPED_TEST_SUITE(LU, TestTypes);
+
+TYPED_TEST(LU, SquareLarge) { luTester<TypeParam>(500, 500, eps<TypeParam>()); }
+
+TYPED_TEST(LU, SquareMultipleOfTwoLarge) {
+    luTester<TypeParam>(512, 512, eps<TypeParam>());
+}
+
+TYPED_TEST(LU, RectangularLarge0) {
+    luTester<TypeParam>(1000, 500, eps<TypeParam>());
+}
+
+TYPED_TEST(LU, RectangularMultipleOfTwoLarge0) {
+    luTester<TypeParam>(1024, 512, eps<TypeParam>());
+}
+
+TYPED_TEST(LU, RectangularLarge1) {
+    luTester<TypeParam>(500, 1000, eps<TypeParam>());
+}
+
+TYPED_TEST(LU, RectangularMultipleOfTwoLarge1) {
+    luTester<TypeParam>(512, 1024, eps<TypeParam>());
+}
+
+TEST(LU, NullLowerOutput) {
+    LAPACK_ENABLED_CHECK();
+    dim4 dims(3, 3);
+    af_array in = 0;
+    ASSERT_SUCCESS(af_randu(&in, dims.ndims(), dims.get(), f32));
+
+    af_array upper, pivot;
+    ASSERT_EQ(AF_ERR_ARG, af_lu(NULL, &upper, &pivot, in));
+    ASSERT_SUCCESS(af_release_array(in));
+}
+
+TEST(LU, NullUpperOutput) {
+    LAPACK_ENABLED_CHECK();
+    dim4 dims(3, 3);
+    af_array in = 0;
+    ASSERT_SUCCESS(af_randu(&in, dims.ndims(), dims.get(), f32));
+
+    af_array lower, pivot;
+    ASSERT_EQ(AF_ERR_ARG, af_lu(&lower, NULL, &pivot, in));
+    ASSERT_SUCCESS(af_release_array(in));
+}
+
+TEST(LU, NullPivotOutput) {
+    LAPACK_ENABLED_CHECK();
+    dim4 dims(3, 3);
+    af_array in = 0;
+    ASSERT_SUCCESS(af_randu(&in, dims.ndims(), dims.get(), f32));
+
+    af_array lower, upper;
+    ASSERT_EQ(AF_ERR_ARG, af_lu(&lower, &upper, NULL, in));
+    ASSERT_SUCCESS(af_release_array(in));
+}
+
+TEST(LU, InPlaceNullOutput) {
+    LAPACK_ENABLED_CHECK();
+    dim4 dims(3, 3);
+    af_array in = 0;
+    ASSERT_SUCCESS(af_randu(&in, dims.ndims(), dims.get(), f32));
+
+    ASSERT_EQ(AF_ERR_ARG, af_lu_inplace(NULL, in, true));
+    ASSERT_SUCCESS(af_release_array(in));
+}
diff --git a/test/manual_memory_test.cpp b/test/manual_memory_test.cpp
new file mode 100644
index 0000000000..35e66bcde5
--- /dev/null
+++ b/test/manual_memory_test.cpp
@@ -0,0 +1,41 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <iostream>
+
+TEST(Memory, recover) {
+    cleanSlate();  // Clean up everything done so far
+
+    try {
+        array vec[100];
+
+        // Trying to allocate 1 Terrabyte of memory and trash the memory manager
+        // should crash memory manager
+        for (int i = 0; i < 1000; i++) {
+            vec[i] = randu(1024, 1024, 256);  // Allocating 1GB
+        }
+
+        FAIL();
+    } catch (exception &ae) {
+        ASSERT_EQ(ae.err(), AF_ERR_NO_MEM);
+
+        const int num   = 1000 * 1000;
+        const float val = 1.0;
+
+        array a    = constant(val, num);  // This should work as expected
+        float *h_a = a.host<float>();
+        for (int i = 0; i < 1000 * 1000; i++) { ASSERT_EQ(h_a[i], val); }
+        freeHost(h_a);
+    }
+}
diff --git a/test/match_template.cpp b/test/match_template.cpp
index 083bdca217..f5f6eb4fc7 100644
--- a/test/match_template.cpp
+++ b/test/match_template.cpp
@@ -7,126 +7,140 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::array;
+using af::dim4;
+using af::dtype_traits;
+using af::exception;
+using std::cout;
+using std::endl;
 using std::string;
 using std::vector;
 
 template<typename T>
-class MatchTemplate : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class MatchTemplate : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(MatchTemplate, TestTypes);
+TYPED_TEST_SUITE(MatchTemplate, TestTypes);
 
 template<typename T>
-void matchTemplateTest(string pTestFile, af_match_type pMatchType)
-{
-    typedef typename cond_type<is_same_type<T, double>::value, double, float>::type outType;
-    if (noDoubleTests<T>()) return;
+void matchTemplateTest(string pTestFile, af_match_type pMatchType) {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            outType;
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4>  numDims;
-    vector<vector<T> >      in;
-    vector<vector<outType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<outType>> tests;
 
     readTests<T, outType, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 sDims    = numDims[0];
-    af::dim4 tDims    = numDims[1];
+    dim4 sDims        = numDims[0];
+    dim4 tDims        = numDims[1];
     af_array outArray = 0;
     af_array sArray   = 0;
     af_array tArray   = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&sArray, &(in[0].front()),
-                sDims.ndims(), sDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&sArray, &(in[0].front()), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&tArray, &(in[1].front()),
-                tDims.ndims(), tDims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&tArray, &(in[1].front()), tDims.ndims(),
+                                   tDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_match_template(&outArray, sArray, tArray, pMatchType));
+    ASSERT_SUCCESS(af_match_template(&outArray, sArray, tArray, pMatchType));
 
-    outType *outData = new outType[sDims.elements()];
+    vector<outType> outData(sDims.elements());
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData.data(), outArray));
 
     vector<outType> currGoldBar = tests[0];
-    size_t nElems        = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)<< "at: " << elIter<< std::endl;
+    size_t nElems               = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)
+            << "at: " << elIter << endl;
     }
 
     // cleanup
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(sArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(tArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(sArray));
+    ASSERT_SUCCESS(af_release_array(tArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(MatchTemplate, Matrix_SAD)
-{
-    matchTemplateTest<TypeParam>(string(TEST_DIR"/MatchTemplate/matrix_sad.test"), AF_SAD);
+TYPED_TEST(MatchTemplate, Matrix_SAD) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    matchTemplateTest<TypeParam>(
+        string(TEST_DIR "/MatchTemplate/matrix_sad.test"), AF_SAD);
 }
 
-TYPED_TEST(MatchTemplate, Matrix_SSD)
-{
-    matchTemplateTest<TypeParam>(string(TEST_DIR"/MatchTemplate/matrix_ssd.test"), AF_SSD);
+TYPED_TEST(MatchTemplate, Matrix_SSD) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    matchTemplateTest<TypeParam>(
+        string(TEST_DIR "/MatchTemplate/matrix_ssd.test"), AF_SSD);
 }
 
-TYPED_TEST(MatchTemplate, MatrixBatch_SAD)
-{
-    matchTemplateTest<TypeParam>(string(TEST_DIR"/MatchTemplate/matrix_sad_batch.test"), AF_SAD);
+TYPED_TEST(MatchTemplate, MatrixBatch_SAD) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    matchTemplateTest<TypeParam>(
+        string(TEST_DIR "/MatchTemplate/matrix_sad_batch.test"), AF_SAD);
 }
 
-TEST(MatchTemplate, InvalidMatchType)
-{
-    af_array inArray   = 0;
+TEST(MatchTemplate, InvalidMatchType) {
+    af_array inArray  = 0;
     af_array tArray   = 0;
-    af_array outArray  = 0;
+    af_array outArray = 0;
 
-    vector<float>   in(100, 1);
+    vector<float> in(100, 1);
 
-    af::dim4 sDims(10, 10, 1, 1);
-    af::dim4 tDims(4, 4, 1, 1);
+    dim4 sDims(10, 10, 1, 1);
+    dim4 tDims(4, 4, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                sDims.ndims(), sDims.get(), (af_dtype) af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), sDims.ndims(),
+                                   sDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&tArray, &in.front(),
-                tDims.ndims(), tDims.get(), (af_dtype) af::dtype_traits<float>::af_type));
+    ASSERT_SUCCESS(af_create_array(&tArray, &in.front(), tDims.ndims(),
+                                   tDims.get(),
+                                   (af_dtype)dtype_traits<float>::af_type));
 
-    ASSERT_EQ(AF_ERR_ARG, af_match_template(&outArray, inArray, tArray, (af_match_type)-1));
+    ASSERT_EQ(AF_ERR_ARG,
+              af_match_template(&outArray, inArray, tArray, (af_match_type)-1));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(tArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(tArray));
 }
 
 ///////////////////////////////// CPP TESTS /////////////////////////////
 //
-TEST(MatchTemplate, CPP)
-{
-    vector<float>   in(100, 1);
+TEST(MatchTemplate, CPP) {
+    vector<float> in(100, 1);
 
-    af::dim4 sDims(10, 10, 1, 1);
-    af::dim4 tDims(4, 4, 1, 1);
+    dim4 sDims(10, 10, 1, 1);
+    dim4 tDims(4, 4, 1, 1);
 
     try {
-        af::array input(sDims, &in.front());
-        af::array tmplt(tDims, &in.front());
+        array input(sDims, &in.front());
+        array tmplt(tDims, &in.front());
 
-        af::array out = matchTemplate(input, tmplt, (af_match_type)-1);
-    } catch(af::exception &e) {
-        std::cout<<"Invalid Match test: "<<e.what()<<std::endl;
+        array out = matchTemplate(input, tmplt, (af_match_type)-1);
+    } catch (exception &e) {
+        cout << "Invalid Match test: " << e.what() << endl;
     }
 }
diff --git a/test/math.cpp b/test/math.cpp
index e5dbe76194..ee42a11423 100644
--- a/test/math.cpp
+++ b/test/math.cpp
@@ -1,107 +1,192 @@
 /*******************************************************
- * Copyright (c) 2014, ArrayFire
+ * Copyright (c) 2025, ArrayFire
  * All rights reserved.
  *
  * This file is distributed under 3-clause BSD license.
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
-
 #include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/arith.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
-
-using namespace std;
-using namespace af;
-
-const int num = 10000;
-const float flt_err = 1e-3;
-const double dbl_err = 1e-10;
-
-#define MATH_TESTS_LIMITS(Ti, To, func, err, lo, hi)            \
-    TEST(MathTests, Test_##func##_##Ti)                         \
-    {                                                           \
-        if (noDoubleTests<Ti>()) return;                        \
-        af_dtype ty = (af_dtype)dtype_traits<Ti>::af_type;      \
-        af::array a = (hi - lo) * randu(num, ty) + lo + err;    \
-        af::eval(a);                                            \
-        af::array b = af::func(a);                              \
-        Ti *h_a = a.host<Ti>();                                 \
-        To *h_b = b.host<To>();                                 \
-                                                                \
-        for (int i = 0; i < num; i++)                           \
-            ASSERT_NEAR(h_b[i], func(h_a[i]), err) <<           \
-                "for value: " << h_a[i] << std::endl;           \
-        delete[] h_a;                                           \
-        delete[] h_b;                                           \
-    }                                                           \
-
-#define MATH_TESTS_FLOAT(func) MATH_TESTS_LIMITS(float, float, func, flt_err, 0.05f, 0.95f)
-#define MATH_TESTS_DOUBLE(func) MATH_TESTS_LIMITS(double, double, func, dbl_err, 0.05, 0.95)
-
-MATH_TESTS_FLOAT(sin)
-MATH_TESTS_FLOAT(cos)
-MATH_TESTS_FLOAT(tan)
-MATH_TESTS_FLOAT(asin)
-MATH_TESTS_FLOAT(acos)
-MATH_TESTS_FLOAT(atan)
-
-MATH_TESTS_FLOAT(sinh)
-MATH_TESTS_FLOAT(cosh)
-MATH_TESTS_FLOAT(tanh)
-
-
-MATH_TESTS_FLOAT(sqrt)
-
-MATH_TESTS_FLOAT(exp)
-MATH_TESTS_FLOAT(log)
-MATH_TESTS_FLOAT(log10)
-MATH_TESTS_FLOAT(log2)
-
-MATH_TESTS_LIMITS(float, float, abs, flt_err, -10, 10)
-MATH_TESTS_LIMITS(float, float, ceil, flt_err, -10, 10)
-MATH_TESTS_LIMITS(float, float, floor, flt_err, -10, 10)
-
-MATH_TESTS_DOUBLE(sin)
-MATH_TESTS_DOUBLE(cos)
-MATH_TESTS_DOUBLE(tan)
-MATH_TESTS_DOUBLE(asin)
-MATH_TESTS_DOUBLE(acos)
-MATH_TESTS_DOUBLE(atan)
-
-MATH_TESTS_DOUBLE(sinh)
-MATH_TESTS_DOUBLE(cosh)
-MATH_TESTS_DOUBLE(tanh)
-#if __cplusplus > 199711L
-MATH_TESTS_FLOAT(asinh)
-MATH_TESTS_FLOAT(atanh)
-MATH_TESTS_LIMITS(float, float, acosh, flt_err, 1, 5)
-MATH_TESTS_LIMITS(float, float, round, flt_err, -10, 10)
-MATH_TESTS_FLOAT(cbrt)
-MATH_TESTS_FLOAT(expm1)
-MATH_TESTS_FLOAT(log1p)
-MATH_TESTS_FLOAT(erf)
-MATH_TESTS_FLOAT(erfc)
-
-MATH_TESTS_DOUBLE(asinh)
-MATH_TESTS_DOUBLE(atanh)
-MATH_TESTS_LIMITS(double, double, acosh, dbl_err, 1, 5)
-MATH_TESTS_LIMITS(double, double, round, dbl_err, -10, 10)
-MATH_TESTS_DOUBLE(cbrt)
-MATH_TESTS_DOUBLE(expm1)
-MATH_TESTS_DOUBLE(erf)
-MATH_TESTS_DOUBLE(log1p)
-MATH_TESTS_DOUBLE(erfc)
+#include <af/device.h>
+#include <af/exception.h>
+#include <af/random.h>
+
+#include <complex>
+
+// This makes the macros cleaner
+using af::array;
+using af::dim4;
+using af::dtype_traits;
+using af::exception;
+using af::randu;
+using half_float::half;
+using std::abs;
+using std::endl;
+using std::vector;
+
+const int num        = 10000;
+const float hlf_err  = 1e-2;
+const float flt_err  = 1e-3;
+const double dbl_err = 1e-6;
+
+typedef std::complex<float> complex_float;
+typedef std::complex<double> complex_double;
+
+template<typename T>
+T sigmoid(T in) {
+    return T(1.0 / (1.0 + std::exp(-in)));
+}
+
+template<typename T>
+T rsqrt(T in) {
+    return T(1.0 / sqrt(in));
+}
+
+#define MATH_TEST(T, func, err, lo, hi)                                        \
+    TEST(Math, func##_##T) {                                                   \
+        try {                                                                  \
+            SUPPORTED_TYPE_CHECK(T);                                           \
+            af_dtype ty = (af_dtype)dtype_traits<T>::af_type;                  \
+            array a     = (hi - lo) * randu(num, ty) + lo + err;               \
+            a           = a.as(ty);                                            \
+            eval(a);                                                           \
+            array b = func(a);                                                 \
+            vector<T> h_a(a.elements());                                       \
+            a.host(&h_a[0]);                                                   \
+            for (size_t i = 0; i < h_a.size(); i++) { h_a[i] = func(h_a[i]); } \
+                                                                               \
+            ASSERT_VEC_ARRAY_NEAR(h_a, dim4(h_a.size()), b, err);              \
+        } catch (exception & ex) { FAIL() << ex.what(); }                      \
+    }
+
+#define MATH_TESTS_HALF(func) MATH_TEST(half, func, hlf_err, 0.05f, 0.95f)
+#define MATH_TESTS_FLOAT(func) MATH_TEST(float, func, flt_err, 0.05f, 0.95f)
+#define MATH_TESTS_DOUBLE(func) MATH_TEST(double, func, dbl_err, 0.05, 0.95)
+
+#define MATH_TESTS_CFLOAT(func) \
+    MATH_TEST(complex_float, func, flt_err, 0.05f, 0.95f)
+#define MATH_TESTS_CDOUBLE(func) \
+    MATH_TEST(complex_double, func, dbl_err, 0.05, 0.95)
+
+#define MATH_TESTS_REAL(func) \
+    MATH_TESTS_HALF(func)     \
+    MATH_TESTS_FLOAT(func)    \
+    MATH_TESTS_DOUBLE(func)
+
+#define MATH_TESTS_CPLX(func) \
+    MATH_TESTS_CFLOAT(func)   \
+    MATH_TESTS_CDOUBLE(func)
+
+#define MATH_TESTS_ALL(func) \
+    MATH_TESTS_REAL(func)    \
+    MATH_TESTS_CPLX(func)
+
+#define MATH_TESTS_LIMITS_REAL(func, lo, hi) \
+    MATH_TEST(half, func, hlf_err, lo, hi)   \
+    MATH_TEST(float, func, flt_err, lo, hi)  \
+    MATH_TEST(double, func, dbl_err, lo, hi)
+
+#define MATH_TESTS_LIMITS_CPLX(func, lo, hi)        \
+    MATH_TEST(complex_float, func, flt_err, lo, hi) \
+    MATH_TEST(complex_double, func, dbl_err, lo, hi)
+
+MATH_TESTS_ALL(sin)
+MATH_TESTS_ALL(cos)
+MATH_TESTS_ALL(tan)
+
+MATH_TESTS_REAL(asin)
+MATH_TESTS_REAL(acos)
+MATH_TESTS_REAL(atan)
+
+MATH_TESTS_ALL(sinh)
+MATH_TESTS_ALL(cosh)
+MATH_TESTS_ALL(tanh)
+
+MATH_TESTS_ALL(sqrt)
+MATH_TESTS_ALL(exp)
+MATH_TESTS_ALL(log)
+MATH_TESTS_REAL(log10)
+MATH_TESTS_REAL(log2)
+MATH_TESTS_REAL(rsqrt)
+
+MATH_TESTS_REAL(sigmoid)
+
+MATH_TESTS_LIMITS_REAL(abs, -10, 10)
+MATH_TESTS_LIMITS_REAL(ceil, -10, 10)
+MATH_TESTS_LIMITS_REAL(floor, -10, 10)
+
+#if __cplusplus > 199711L || _MSC_VER >= 1800
+MATH_TESTS_CPLX(asin)
+MATH_TESTS_CPLX(acos)
+MATH_TESTS_CPLX(atan)
+
+MATH_TESTS_ALL(asinh)
+MATH_TESTS_ALL(atanh)
+MATH_TESTS_LIMITS_REAL(acosh, 1, 5)
+MATH_TESTS_LIMITS_CPLX(acosh, 1, 5)
+MATH_TESTS_LIMITS_REAL(round, -10, 10)
+MATH_TESTS_REAL(cbrt)
+MATH_TESTS_REAL(expm1)
+MATH_TESTS_REAL(log1p)
+MATH_TESTS_REAL(erf)
+MATH_TESTS_REAL(erfc)
 #endif
 
-MATH_TESTS_DOUBLE(sqrt)
-
-MATH_TESTS_DOUBLE(exp)
-MATH_TESTS_DOUBLE(log)
-MATH_TESTS_DOUBLE(log10)
-MATH_TESTS_DOUBLE(log2)
-
-MATH_TESTS_LIMITS(double, double, abs, dbl_err, -10, 10)
-MATH_TESTS_LIMITS(double, double, ceil, dbl_err, -10, 10)
-MATH_TESTS_LIMITS(double, double, floor, dbl_err, -10, 10)
+TEST(Math, Not) {
+    array a  = randu(5, 5, b8);
+    array b  = !a;
+    char *ha = a.host<char>();
+    char *hb = b.host<char>();
+
+    for (int i = 0; i < a.elements(); i++) { ASSERT_EQ(ha[i] ^ hb[i], true); }
+
+    af_free_host(ha);
+    af_free_host(hb);
+}
+
+TEST(Math, Modulus) {
+    af::dim4 shape(2, 2);
+    std::vector<long long> aData{3, 3, 3, 3};
+    std::vector<long long> bData{2, 2, 2, 2};
+
+    auto a    = af::array(shape, aData.data(), afHost);
+    auto b    = af::array(shape, bData.data(), afHost);
+    auto rem  = a % b;
+    auto neg_rem = -a % b;
+
+    ASSERT_ARRAYS_EQ(af::constant(1, shape, s64), rem);
+    ASSERT_ARRAYS_EQ(af::constant(-1, shape, s64), neg_rem);
+}
+
+TEST(Math, ModulusFloat) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    af::dim4 shape(2, 2);
+
+    auto a     = af::constant(3, shape, af::dtype::f16);
+    auto b     = af::constant(2, shape, af::dtype::f16);
+    auto a32   = af::constant(3, shape, af::dtype::f32);
+    auto b32   = af::constant(2, shape, af::dtype::f32);
+    auto a64   = af::constant(3, shape, af::dtype::f64);
+    auto b64   = af::constant(2, shape, af::dtype::f64);
+
+    auto rem   = a % b;
+    auto rem32 = a32 % b32;
+    auto rem64 = a64 % b64;
+
+    auto neg_rem = -a % b;
+    auto neg_rem32 = -a32 % b32;
+    auto neg_rem64 = -a64 % b64;
+    
+    ASSERT_ARRAYS_EQ(af::constant(1, shape, af::dtype::f16), rem);
+    ASSERT_ARRAYS_EQ(af::constant(1, shape, af::dtype::f32), rem32);
+    ASSERT_ARRAYS_EQ(af::constant(1, shape, af::dtype::f64), rem64);
+
+    ASSERT_ARRAYS_EQ(af::constant(-1, shape, af::dtype::f16), neg_rem);
+    ASSERT_ARRAYS_EQ(af::constant(-1, shape, af::dtype::f32), neg_rem32);
+    ASSERT_ARRAYS_EQ(af::constant(-1, shape, af::dtype::f64), neg_rem64);
+
+    ASSERT_ARRAYS_EQ(rem32.as(f16), rem);
+}
diff --git a/test/matrix_manipulation.cpp b/test/matrix_manipulation.cpp
index ded90efe3a..d9c6d554bb 100644
--- a/test/matrix_manipulation.cpp
+++ b/test/matrix_manipulation.cpp
@@ -8,20 +8,23 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
 #include <vector>
 
-using namespace af;
-using namespace std;
+using af::array;
+using af::join;
+using af::randu;
+using af::tile;
+using std::vector;
 
-TEST(MatrixManipulation, SNIPPET_matrix_manipulation_tile)
-{
+TEST(MatrixManipulation, SNIPPET_matrix_manipulation_tile) {
     //! [ex_matrix_manipulation_tile]
-    float h[] = {1, 2, 3, 4};
-    array small_arr = array(2, 2, h); // 2x2 matrix
+    float h[]       = {1, 2, 3, 4};
+    array small_arr = array(2, 2, h);  // 2x2 matrix
     af_print(small_arr);
-    array large_arr = tile(small_arr, 2, 3);  // produces 4x6 matrix: (2*2)x(2*3)
+    array large_arr =
+        tile(small_arr, 2, 3);  // produces 4x6 matrix: (2*2)x(2*3)
     af_print(large_arr);
     //! [ex_matrix_manipulation_tile]
 
@@ -33,23 +36,22 @@ TEST(MatrixManipulation, SNIPPET_matrix_manipulation_tile)
 
     unsigned fdim = large_arr.dims(0);
     unsigned sdim = large_arr.dims(1);
-    for(unsigned i = 0; i < sdim; i++) {
-        for(unsigned j = 0; j < fdim; j++) {
-            ASSERT_FLOAT_EQ(h[(i%2) * 2 + (j%2)], h_large_arr[i * fdim + j] );
+    for (unsigned i = 0; i < sdim; i++) {
+        for (unsigned j = 0; j < fdim; j++) {
+            ASSERT_FLOAT_EQ(h[(i % 2) * 2 + (j % 2)],
+                            h_large_arr[i * fdim + j]);
         }
     }
 }
 
-TEST(MatrixManipulation, SNIPPET_matrix_manipulation_join)
-{
-
+TEST(MatrixManipulation, SNIPPET_matrix_manipulation_join) {
     //! [ex_matrix_manipulation_join]
-    float hA[] = { 1, 2, 3, 4, 5, 6 };
-    float hB[] = { 10, 20, 30, 40, 50, 60, 70, 80, 90 };
-    array A = array(3, 2, hA);
-    array B = array(3, 3, hB);
+    float hA[] = {1, 2, 3, 4, 5, 6};
+    float hB[] = {10, 20, 30, 40, 50, 60, 70, 80, 90};
+    array A    = array(3, 2, hA);
+    array B    = array(3, 3, hB);
 
-    af_print(join(1, A, B)); // 3x5 matrix
+    af_print(join(1, A, B));  // 3x5 matrix
     // array result = join(0, A, B); // fail: dimension mismatch
     //! [ex_matrix_manipulation_join]
 
@@ -63,21 +65,20 @@ TEST(MatrixManipulation, SNIPPET_matrix_manipulation_join)
 
     unsigned fdim = out.dims(0);
     unsigned sdim = out.dims(1);
-    for(unsigned i = 0; i < sdim; i++) {
-        for(unsigned j = 0; j < fdim; j++) {
-            if( i < 2 ) {
-                ASSERT_FLOAT_EQ(hA[i * fdim + j], h_out[i * fdim + j]) << "At [" << i << ", " << j << "]";
-            }
-            else {
-                ASSERT_FLOAT_EQ(hB[(i - 2) * fdim + j], h_out[i * fdim + j]) << "At [" << i << ", " << j << "]";
+    for (unsigned i = 0; i < sdim; i++) {
+        for (unsigned j = 0; j < fdim; j++) {
+            if (i < 2) {
+                ASSERT_FLOAT_EQ(hA[i * fdim + j], h_out[i * fdim + j])
+                    << "At [" << i << ", " << j << "]";
+            } else {
+                ASSERT_FLOAT_EQ(hB[(i - 2) * fdim + j], h_out[i * fdim + j])
+                    << "At [" << i << ", " << j << "]";
             }
         }
     }
-
 }
 
-TEST(MatrixManipulation, SNIPPET_matrix_manipulation_mesh)
-{
+TEST(MatrixManipulation, SNIPPET_matrix_manipulation_mesh) {
     //! [ex_matrix_manipulation_mesh]
     float hx[] = {1, 2, 3, 4};
     float hy[] = {5, 6};
@@ -102,27 +103,27 @@ TEST(MatrixManipulation, SNIPPET_matrix_manipulation_mesh)
     vector<float> houty(outy.elements());
     outy.host(&houty.front());
 
-    for(unsigned i = 0; i < houtx.size(); i++) ASSERT_EQ(hx[i%4], houtx[i]) << "At [" << i << "]";
-    for(unsigned i = 0; i < houty.size(); i++) ASSERT_EQ(hy[i>3], houty[i]) << "At [" << i << "]";
+    for (unsigned i = 0; i < houtx.size(); i++)
+        ASSERT_EQ(hx[i % 4], houtx[i]) << "At [" << i << "]";
+    for (unsigned i = 0; i < houty.size(); i++)
+        ASSERT_EQ(hy[i > 3], houty[i]) << "At [" << i << "]";
 }
 
-TEST(MatrixManipulation, SNIPPET_matrix_manipulation_moddims)
-{
+TEST(MatrixManipulation, SNIPPET_matrix_manipulation_moddims) {
     //! [ex_matrix_manipulation_moddims]
     int hA[] = {1, 2, 3, 4, 5, 6};
-    array A = array(3, 2, hA);
+    array A  = array(3, 2, hA);
 
-    af_print(A); // 2x3 matrix
-    af_print(moddims(A, 2, 3)); // 2x3 matrix
-    af_print(moddims(A, 6, 1)); // 6x1 column vector
+    af_print(A);                 // 2x3 matrix
+    af_print(moddims(A, 2, 3));  // 2x3 matrix
+    af_print(moddims(A, 6, 1));  // 6x1 column vector
 
     // moddims(A, 2, 2); // fail: wrong number of elements
     // moddims(A, 8, 8); // fail: wrong number of elements
     //! [ex_matrix_manipulation_moddims]
 }
 
-TEST(MatrixManipulation, SNIPPET_matrix_manipulation_transpose)
-{
+TEST(MatrixManipulation, SNIPPET_matrix_manipulation_transpose) {
     //! [ex_matrix_manipulation_transpose]
     array x = randu(2, 2, f32);
     af_print(x.T());  // transpose (real)
diff --git a/test/matrixmarket.cpp b/test/matrixmarket.cpp
new file mode 100644
index 0000000000..700b604d50
--- /dev/null
+++ b/test/matrixmarket.cpp
@@ -0,0 +1,29 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
+TEST(Sparse, ReadRealMTXFile) {
+    af::array out;
+    std::string file(MTX_TEST_DIR "HB/bcsstm02/bcsstm02.mtx");
+    ASSERT_TRUE(mtxReadSparseMatrix(out, file.c_str()));
+}
+
+TEST(Sparse, ReadComplexMTXFile) {
+    af::array out;
+    std::string file(MTX_TEST_DIR "HB/young4c/young4c.mtx");
+    ASSERT_TRUE(mtxReadSparseMatrix(out, file.c_str()));
+}
+
+TEST(Sparse, FailIntegerMTXRead) {
+    af::array out;
+    std::string file(MTX_TEST_DIR "JGD_Kocay/Trec4/Trec4.mtx");
+    ASSERT_FALSE(mtxReadSparseMatrix(out, file.c_str()));
+}
diff --git a/test/mean.cpp b/test/mean.cpp
index 2e89c75651..79dd76db2d 100644
--- a/test/mean.cpp
+++ b/test/mean.cpp
@@ -7,147 +7,173 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+#include <algorithm>
+#include <ctime>
+#include <iostream>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::string;
-using std::vector;
+using af::array;
 using af::cdouble;
 using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::randu;
+using half_float::half;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Mean : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Mean : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<cdouble, cfloat, float, double, int, uint, intl, uintl, char, uchar> TestTypes;
+// This list does not allow to cleanly add the af_half/half_float type : at the
+// moment half tested in some special unittests
+typedef ::testing::Types<cdouble, cfloat, float, double, int, uint, intl, uintl,
+                         char, schar, uchar, short, ushort, half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Mean, TestTypes);
+TYPED_TEST_SUITE(Mean, TestTypes);
 
 template<typename T>
 struct f32HelperType {
-   typedef typename cond_type<is_same_type<T, double>::value,
-                                             double,
-                                             float>::type type;
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            type;
 };
 
 template<typename T>
 struct c32HelperType {
-   typedef typename cond_type<is_same_type<T, cfloat>::value,
-                                             cfloat,
-                                             typename f32HelperType<T>::type >::type type;
+    typedef typename cond_type<is_same_type<T, cfloat>::value, cfloat,
+                               typename f32HelperType<T>::type>::type type;
 };
 
 template<typename T>
 struct elseType {
-   typedef typename cond_type< is_same_type<T, uintl>::value ||
-                               is_same_type<T, intl>::value,
-                                              double,
-                                              T>::type type;
+    typedef typename cond_type<is_same_type<T, uintl>::value ||
+                                   is_same_type<T, intl>::value,
+                               double, T>::type type;
 };
 
 template<typename T>
 struct meanOutType {
-   typedef typename cond_type< is_same_type<T, float>::value ||
-                               is_same_type<T, int>::value ||
-                               is_same_type<T, uint>::value ||
-                               is_same_type<T, uchar>::value ||
-                               is_same_type<T, char>::value,
-                                              float,
-                              typename elseType<T>::type>::type type;
+    typedef typename cond_type<
+        is_same_type<T, float>::value || is_same_type<T, int>::value ||
+            is_same_type<T, uint>::value || is_same_type<T, schar>::value ||
+            is_same_type<T, uchar>::value || is_same_type<T, short>::value ||
+            is_same_type<T, ushort>::value || is_same_type<T, char>::value,
+        float, typename elseType<T>::type>::type type;
 };
 
 template<typename T>
-void meanDimTest(string pFileName, dim_t dim)
-{
+void meanDimTest(string pFileName, dim_t dim, bool isWeighted = false) {
     typedef typename meanOutType<T>::type outType;
-    if (noDoubleTests<T>()) return;
-    if (noDoubleTests<outType>()) return;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4>      numDims;
-    vector<vector<int> >        in;
-    vector<vector<float> >   tests;
+    double tol = 1.0e-3;
+    if ((af_dtype)af::dtype_traits<T>::af_type == f16) tol = 4.e-3;
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
 
-    readTestsFromFile<int,float>(pFileName, numDims, in, tests);
+    readTestsFromFile<int, float>(pFileName, numDims, in, tests);
 
-    af::dim4 dims      = numDims[0];
-    af_array outArray  = 0;
-    af_array inArray   = 0;
+    dim4 goldDims = numDims[0];
+    goldDims[dim] = 1;
+    if (!isWeighted) {
+        dim4 dims = numDims[0];
+        vector<T> input(in[0].begin(), in[0].end());
 
-    vector<T> input(in[0].begin(), in[0].end());
+        array inArray(dims, &(input.front()));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(input.front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+        array outArray = mean(inArray, dim);
 
-    ASSERT_EQ(AF_SUCCESS, af_mean(&outArray, inArray, dim));
+        vector<outType> outData(dims.elements());
 
-    outType *outData = new outType[dims.elements()];
+        outArray.host((void*)outData.data());
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
 
-    vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
-    size_t nElems = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_NEAR(::real(currGoldBar[elIter]), ::real(outData[elIter]), 1.0e-3)<< "at: " << elIter<< std::endl;
-        ASSERT_NEAR(::imag(currGoldBar[elIter]), ::imag(outData[elIter]), 1.0e-3)<< "at: " << elIter<< std::endl;
-    }
+        dim4 goldDims = dims;
+        goldDims[dim] = 1;
+        ASSERT_VEC_ARRAY_NEAR(currGoldBar, goldDims, outArray, tol);
+    } else {
+        dim4 dims  = numDims[0];
+        dim4 wdims = numDims[1];
+        vector<T> input(in[0].begin(), in[0].end());
+        vector<float> weights(in[1].size());
+        transform(in[1].begin(), in[1].end(), weights.begin(),
+                  convert_to<float, int>);
 
-    // cleanup
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+        array inArray(dims, &(input.front()));
+        array wtsArray(wdims, &(weights.front()));
+
+        array outArray = mean(inArray, wtsArray, dim);
+
+        vector<outType> outData(dims.elements());
+
+        outArray.host((void*)outData.data());
+
+        vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
+
+        ASSERT_VEC_ARRAY_NEAR(currGoldBar, goldDims, outArray, tol);
+    }
 }
 
-TYPED_TEST(Mean, Dim0Matrix)
-{
-    meanDimTest<TypeParam>(string(TEST_DIR"/mean/mean_dim0_matrix.test"), 0);
+TYPED_TEST(Mean, Dim0Matrix) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/mean_dim0_matrix.test"), 0);
 }
 
-TYPED_TEST(Mean, Dim1Cube)
-{
-    meanDimTest<TypeParam>(string(TEST_DIR"/mean/mean_dim1_cube.test"), 1);
+TYPED_TEST(Mean, Dim1Cube) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/mean_dim1_cube.test"), 1);
 }
 
-TYPED_TEST(Mean, Dim0HyperCube)
-{
-    meanDimTest<TypeParam>(string(TEST_DIR"/mean/mean_dim0_hypercube.test"), 0);
+TYPED_TEST(Mean, Dim0HyperCube) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/mean_dim0_hypercube.test"),
+                           0);
 }
 
-TYPED_TEST(Mean, Dim2Matrix)
-{
-    meanDimTest<TypeParam>(string(TEST_DIR"/mean/mean_dim2_matrix.test"), 2);
+TYPED_TEST(Mean, Dim2Matrix) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/mean_dim2_matrix.test"), 2);
 }
 
-TYPED_TEST(Mean, Dim2Cube)
-{
-    meanDimTest<TypeParam>(string(TEST_DIR"/mean/mean_dim2_cube.test"), 2);
+TYPED_TEST(Mean, Dim2Cube) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/mean_dim2_cube.test"), 2);
 }
 
-TYPED_TEST(Mean, Dim2HyperCube)
-{
-    meanDimTest<TypeParam>(string(TEST_DIR"/mean/mean_dim2_hypercube.test"), 2);
+TYPED_TEST(Mean, Dim2HyperCube) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/mean_dim2_hypercube.test"),
+                           2);
 }
 
-//////////////////////////////// CPP ////////////////////////////////////
-// test mean_all interface using cpp api
+TYPED_TEST(Mean, Wtd_Dim0Matrix) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/wtd_mean_dim0_mat.test"), 0,
+                           true);
+}
 
-#include <iostream>
+TYPED_TEST(Mean, Wtd_Dim1Matrix) {
+    meanDimTest<TypeParam>(string(TEST_DIR "/mean/wtd_mean_dim1_mat.test"), 1,
+                           true);
+}
 
 template<typename T>
-void testCPPMean(T const_value, af::dim4 dims)
-{
+void meanAllTest(T const_value, dim4 dims) {
     typedef typename meanOutType<T>::type outType;
-    if (noDoubleTests<T>()) return;
-    if (noDoubleTests<outType>()) return;
+
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
 
     using af::array;
     using af::mean;
@@ -155,11 +181,9 @@ void testCPPMean(T const_value, af::dim4 dims)
     vector<T> hundred(dims.elements(), const_value);
 
     outType gold = outType(0);
-    //for(auto i:hundred) gold += i;
-    for(int i = 0; i < (int)hundred.size(); i++) {
-        gold += hundred[i];
-    }
-    gold /= dims.elements();
+    // for(auto i:hundred) gold += i;
+    for (int i = 0; i < (int)hundred.size(); i++) { gold = gold + hundred[i]; }
+    gold = gold / dims.elements();
 
     array a(dims, &(hundred.front()));
     outType output = mean<outType>(a);
@@ -168,42 +192,190 @@ void testCPPMean(T const_value, af::dim4 dims)
     ASSERT_NEAR(::imag(output), ::imag(gold), 1.0e-3);
 }
 
-TEST(Mean, CPP_f64)
-{
-    testCPPMean<double>(2.1, af::dim4(10, 10, 1, 1));
+template<>
+void meanAllTest(half_float::half const_value, dim4 dims) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+
+    using af::array;
+    using af::mean;
+
+    vector<float> hundred(dims.elements(), const_value);
+
+    float gold = float(0);
+    for (int i = 0; i < (int)hundred.size(); i++) { gold = gold + hundred[i]; }
+    gold = gold / dims.elements();
+
+    array a         = array(dims, &(hundred.front())).as(f16);
+    half output     = mean<half>(a);
+    af_half output2 = mean<af_half>(a);
+
+    // make sure output2 and output are binary equals. This is necessary
+    // because af_half is not a complete type
+    half output2_copy;
+    memcpy(static_cast<void*>(&output2_copy), &output2, sizeof(af_half));
+    ASSERT_EQ(output, output2_copy);
+
+    ASSERT_NEAR(output, gold, 1.0e-3);
 }
 
-TEST(Mean, CPP_f32)
-{
-    testCPPMean<float>(2.1f, af::dim4(10, 5, 2, 1));
+TEST(MeanAll, f64) { meanAllTest<double>(2.1, dim4(10, 10, 1, 1)); }
+
+TEST(MeanAll, f32) { meanAllTest<float>(2.1f, dim4(10, 5, 2, 1)); }
+
+TEST(MeanAll, f16) { meanAllTest<half>((half)0.3f, dim4(10, 5, 2, 1)); }
+
+TEST(MeanAll, s32) { meanAllTest<int>(2, dim4(5, 5, 2, 2)); }
+
+TEST(MeanAll, u32) { meanAllTest<unsigned>(2, dim4(100, 1, 1, 1)); }
+
+TEST(MeanAll, s8) { meanAllTest<schar>(2, dim4(5, 5, 2, 2)); }
+
+TEST(MeanAll, u8) { meanAllTest<uchar>(2, dim4(100, 1, 1, 1)); }
+
+TEST(MeanAll, c32) { meanAllTest<cfloat>(cfloat(2.1f), dim4(10, 5, 2, 1)); }
+
+TEST(MeanAll, s16) { meanAllTest<short>(2, dim4(5, 5, 2, 2)); }
+
+TEST(MeanAll, u16) { meanAllTest<ushort>(2, dim4(100, 1, 1, 1)); }
+
+TEST(MeanAll, c64) { meanAllTest<cdouble>(cdouble(2.1), dim4(10, 10, 1, 1)); }
+
+template<typename T>
+T random() {
+    return T(std::rand() % 10);
+}
+
+template<>
+half random<half>() {
+    // create values from -0.5 to 0.5 to ensure sum does not deviate
+    // too far out of half's useful range
+    float r = static_cast<float>(rand()) / static_cast<float>(RAND_MAX) - 0.5f;
+    return half(r);
+}
+
+template<>
+cfloat random<cfloat>() {
+    return cfloat(float(std::rand() % 10), float(std::rand() % 10));
+}
+
+template<>
+cdouble random<cdouble>() {
+    return cdouble(double(std::rand() % 10), double(std::rand() % 10));
 }
 
-TEST(Mean, CPP_s32)
-{
-    testCPPMean<int>(2, af::dim4(5, 5, 2, 2));
+template<typename T>
+class WeightedMean : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// register the type list
+TYPED_TEST_SUITE(WeightedMean, TestTypes);
+
+template<typename T, typename wtsType>
+void weightedMeanAllTest(dim4 dims) {
+    typedef typename meanOutType<T>::type outType;
+
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
+    SUPPORTED_TYPE_CHECK(wtsType);
+
+    using af::array;
+    using af::mean;
+
+    std::srand(std::time(0));
+
+    vector<T> data(dims.elements());
+    vector<wtsType> wts(dims.elements());
+    std::generate(data.begin(), data.end(), random<T>);
+    std::generate(wts.begin(), wts.end(), random<wtsType>);
+
+    outType wtdSum = outType(0);
+    wtsType wtsSum = wtsType(0);
+
+    for (int i = 0; i < (int)data.size(); i++) {
+        wtdSum = wtdSum + data[i] * wts[i];
+        wtsSum = wtsSum + wts[i];
+    }
+
+    outType gold = wtdSum / outType(wtsSum);
+
+    array a(dims, &(data.front()));
+    array w(dims, &(wts.front()));
+    outType output = mean<outType>(a, w);
+
+    ASSERT_NEAR(::real(output), ::real(gold), 1.0e-2);
+    ASSERT_NEAR(::imag(output), ::imag(gold), 1.0e-2);
 }
 
-TEST(Mean, CPP_u32)
-{
-    testCPPMean<unsigned>(2, af::dim4(100, 1, 1, 1));
+TYPED_TEST(WeightedMean, Basic) {
+    weightedMeanAllTest<TypeParam, float>(dim4(32, 30, 33, 17));
 }
 
-TEST(Mean, CPP_s8)
-{
-    testCPPMean<char>(2, af::dim4(5, 5, 2, 2));
+TEST(WeightedMean, Broadacst) {
+    float val = 0.5f;
+    array a   = randu(4096, 32);
+    array w   = constant(val, a.dims());
+    array c   = mean(a);
+    array d   = mean(a, w);
+
+    vector<float> hc(c.elements());
+    vector<float> hd(d.elements());
+
+    c.host(hc.data());
+    d.host(hd.data());
+
+    for (size_t i = 0; i < hc.size(); i++) {
+        // C and D are the same because they are normalized by the sum of the
+        // weights.
+        ASSERT_NEAR(hc[i], hd[i], 1E-5);
+    }
 }
 
-TEST(Mean, CPP_u8)
-{
-    testCPPMean<uchar>(2, af::dim4(100, 1, 1, 1));
+TEST(Mean, Issue2093) {
+    const int NELEMS = 512;
+
+    array data = randu(1, NELEMS);
+    array wts  = constant(1.0f, 1, NELEMS);
+    vector<float> hdata(NELEMS);
+    data.host(hdata.data());
+
+    array out = mean(data, wts, 1);
+    float outVal;
+    out.host(&outVal);
+
+    float expected = 0.0;
+    for (size_t i = 0; i < NELEMS; ++i) expected += hdata[i];
+    expected /= NELEMS;
+
+    ASSERT_NEAR(outVal, expected, 0.001);
 }
 
-TEST(Mean, CPP_cfloat)
-{
-    testCPPMean<cfloat>(cfloat(2.1f), af::dim4(10, 5, 2, 1));
+TEST(MeanAll, SubArray) {
+    // Fixes Issue 2636
+    using af::mean;
+    using af::span;
+    using af::sum;
+
+    const dim4 inDims(10, 10, 10, 10);
+
+    array in  = randu(inDims);
+    array sub = in(0, span, span, span);
+
+    size_t nElems   = sub.elements();
+    float max_error = std::numeric_limits<float>::epsilon() * nElems;
+    ASSERT_NEAR(mean<float>(sub), sum<float>(sub) / nElems, max_error);
 }
 
-TEST(Mean, CPP_cdouble)
-{
-    testCPPMean<cdouble>(cdouble(2.1), af::dim4(10, 10, 1, 1));
+TEST(MeanHalf, dim0) {
+    SUPPORTED_TYPE_CHECK(half_float::half);
+    // Keeping N low to be able to run on 6GB GPUs
+    int N = 1024;
+    const dim4 inDims(N, N, 1, 1);
+    array in  = randu(inDims, f16);
+    array m16 = af::mean(in, 0);
+    array m32 = af::mean(in.as(f32), 0);
+    // Some diffs appears at 0.0001 max diff : example: float: 0.507014 vs half:
+    // 0.506836
+    ASSERT_ARRAYS_NEAR(m16.as(f32), m32, 0.001f);
 }
diff --git a/test/meanshift.cpp b/test/meanshift.cpp
index 5f1f9a4e3c..d91648ae52 100644
--- a/test/meanshift.cpp
+++ b/test/meanshift.cpp
@@ -7,91 +7,95 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+#include <cmath>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
-#include <cmath>
 
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
 using std::string;
 using std::vector;
-using af::dim4;
 
 template<typename T>
-class Meanshift : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Meanshift : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
-typedef ::testing::Types<float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort, intl, uintl>
+    TestTypes;
 
-TYPED_TEST_CASE(Meanshift, TestTypes);
+TYPED_TEST_SUITE(Meanshift, TestTypes);
 
-TYPED_TEST(Meanshift, InvalidArgs)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Meanshift, InvalidArgs) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    vector<TypeParam>   in(100,1);
+    vector<TypeParam> in(100, 1);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    af::dim4 dims = af::dim4(100,1,1,1);
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<TypeParam>::af_type));
-    ASSERT_EQ(AF_ERR_SIZE, af_mean_shift(&outArray, inArray, 0.12f, 0.34f, 5, true));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    dim4 dims = dim4(100, 1, 1, 1);
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<TypeParam>::af_type));
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_mean_shift(&outArray, inArray, 0.12f, 0.34f, 5, true));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
 template<typename T, bool isColor>
-void meanshiftTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void meanshiftTest(string pTestFile, const float ss) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<dim4>       inDims;
-    vector<string>    inFiles;
+    vector<dim4> inDims;
+    vector<string> inFiles;
     vector<dim_t> outSizes;
-    vector<string>   outFiles;
+    vector<string> outFiles;
 
     readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
-
-        af_array inArray     = 0;
-        af_array inArray_f32 = 0;
-        af_array outArray    = 0;
-        af_array goldArray   = 0;
-        dim_t nElems      = 0;
-
-        inFiles[testId].insert(0,string(TEST_DIR"/meanshift/"));
-        outFiles[testId].insert(0,string(TEST_DIR"/meanshift/"));
-
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&inArray_f32, inFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, conv_image<T>(&inArray, inArray_f32));
-
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&goldArray, outFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems, goldArray));
-
-        ASSERT_EQ(AF_SUCCESS, af_mean_shift(&outArray, inArray, 2.25f, 25.56f, 5, isColor));
-
-        T * outData = new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
-
-        T * goldData= new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)goldData, goldArray));
-
-        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData, outData, 0.07f));
-
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray_f32));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(goldArray));
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array inArray       = 0;
+        af_array inArray_f32   = 0;
+        af_array outArray      = 0;
+        af_array goldArray     = 0;
+        af_array goldArray_f32 = 0;
+        dim_t nElems           = 0;
+
+        inFiles[testId].insert(0, string(TEST_DIR "/meanshift/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/meanshift/"));
+
+        ASSERT_SUCCESS(
+            af_load_image(&inArray_f32, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, inArray_f32));
+
+        ASSERT_SUCCESS(
+            af_load_image(&goldArray_f32, outFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(conv_image<T>(
+            &goldArray,
+            goldArray_f32));  // af_load_image always returns float array
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
+
+        ASSERT_SUCCESS(af_mean_shift(&outArray, inArray, ss, 30.f, 5, isColor));
+
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.02f);
+
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(inArray_f32));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+        ASSERT_SUCCESS(af_release_array(goldArray_f32));
     }
 }
 
@@ -101,72 +105,70 @@ void meanshiftTest(string pTestFile)
 //       Note: compareArraysRMSD is handling upcasting while working
 //       with two different type of types
 //
-#define IMAGE_TESTS(T)                                                      \
-    TEST(Meanshift, Grayscale_##T)                                          \
-    {                                                                       \
-        meanshiftTest<T, false>(string(TEST_DIR"/meanshift/gray.test"));    \
-    }                                                                       \
-    TEST(Meanshift, Color_##T)                                              \
-    {                                                                       \
-        meanshiftTest<T, true>(string(TEST_DIR"/meanshift/color.test"));    \
+#define IMAGE_TESTS(T)                                                   \
+    TEST(Meanshift, Grayscale_##T) {                                     \
+        meanshiftTest<T, false>(string(TEST_DIR "/meanshift/gray.test"), \
+                                6.67f);                                  \
+    }                                                                    \
+    TEST(Meanshift, Color_##T) {                                         \
+        meanshiftTest<T, true>(string(TEST_DIR "/meanshift/color.test"), \
+                               3.5f);                                    \
     }
 
-IMAGE_TESTS(float )
+IMAGE_TESTS(float)
 IMAGE_TESTS(double)
 
-
 //////////////////////////////////////// CPP ///////////////////////////////
 //
-TEST(Meanshift, Color_CPP)
-{
-    if (noDoubleTests<float>()) return;
 
-    vector<dim4>       inDims;
-    vector<string>    inFiles;
+using af::array;
+using af::constant;
+using af::iota;
+using af::loadImage;
+using af::max;
+using af::meanShift;
+using af::seq;
+using af::span;
+
+TEST(Meanshift, Color_CPP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
     vector<dim_t> outSizes;
-    vector<string>   outFiles;
+    vector<string> outFiles;
 
-    readImageTests(string(TEST_DIR"/meanshift/color.test"), inDims, inFiles, outSizes, outFiles);
+    readImageTests(string(TEST_DIR "/meanshift/color.test"), inDims, inFiles,
+                   outSizes, outFiles);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
-        inFiles[testId].insert(0,string(TEST_DIR"/meanshift/"));
-        outFiles[testId].insert(0,string(TEST_DIR"/meanshift/"));
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        inFiles[testId].insert(0, string(TEST_DIR "/meanshift/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/meanshift/"));
 
-        af::array img   = af::loadImage(inFiles[testId].c_str(), true);
-        af::array gold  = af::loadImage(outFiles[testId].c_str(), true);
+        array img    = loadImage(inFiles[testId].c_str(), true);
+        array gold   = loadImage(outFiles[testId].c_str(), true);
         dim_t nElems = gold.elements();
-        af::array output= af::meanShift(img, 2.25f, 25.56f, 5, true);
-
-        float * outData = new float[nElems];
-        output.host((void*)outData);
+        array output = meanShift(img, 3.5f, 30.f, 5, true);
 
-        float * goldData= new float[nElems];
-        gold.host((void*)goldData);
-
-        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData, outData, 0.07f));
-        // cleanup
-        delete[] outData;
-        delete[] goldData;
+        ASSERT_IMAGES_NEAR(gold, output, 0.02f);
     }
 }
 
-TEST(meanshift, GFOR)
-{
-    using namespace af;
-
+TEST(Meanshift, GFOR) {
     dim4 dims = dim4(10, 10, 3);
-    array A = iota(dims);
-    array B = constant(0, dims);
+    array A   = iota(dims);
+    array B   = constant(0, dims);
 
     gfor(seq ii, 3) {
         B(span, span, ii) = meanShift(A(span, span, ii), 3, 5, 3);
     }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = meanShift(A(span, span, ii), 3, 5, 3);
         array b_ii = B(span, span, ii);
-        ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
+
+        ASSERT_LT(max<double>(abs(c_ii - b_ii)), 1E-5);
     }
 }
diff --git a/test/meanvar.cpp b/test/meanvar.cpp
new file mode 100644
index 0000000000..c7eba339a8
--- /dev/null
+++ b/test/meanvar.cpp
@@ -0,0 +1,382 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+
+#include <testHelpers.hpp>
+
+#include <iterator>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::back_inserter;
+using std::move;
+using std::string;
+using std::vector;
+
+af_err init_err = af_init();
+
+template<typename T>
+struct elseType {
+    typedef typename cond_type<is_same_type<T, uintl>::value ||
+                                   is_same_type<T, intl>::value,
+                               double, T>::type type;
+};
+
+template<typename T>
+struct varOutType {
+    typedef typename cond_type<
+        is_same_type<T, float>::value || is_same_type<T, int>::value ||
+            is_same_type<T, uint>::value || is_same_type<T, short>::value ||
+            is_same_type<T, ushort>::value || is_same_type<T, schar>::value ||
+            is_same_type<T, uchar>::value || is_same_type<T, char>::value,
+        float, typename elseType<T>::type>::type type;
+};
+
+template<typename T>
+using outType = typename varOutType<T>::type;
+
+template<typename T>
+struct meanvar_test {
+    static af_dtype af_type;
+    string test_description_;
+    af_array in_;
+    af_array weights_;
+    af_var_bias bias_;
+    int dim_;
+    vector<outType<T>> mean_;
+    vector<outType<T>> variance_;
+
+    meanvar_test(string description, af_array in, af_array weights,
+                 af_var_bias bias, int dim,
+                 vector<typename varOutType<T>::type> &&mean,
+                 vector<typename varOutType<T>::type> &&variance)
+        : test_description_(description)
+        , in_(0)
+        , weights_(0)
+        , bias_(bias)
+        , dim_(dim) {
+        af_retain_array(&in_, in);
+        if (weights) { af_retain_array(&weights_, weights); }
+        mean_.reserve(mean.size());
+        variance_.reserve(variance.size());
+        for (auto &v : mean) mean_.push_back((outType<T>)v);
+        for (auto &v : variance) variance_.push_back((outType<T>)v);
+    }
+
+    meanvar_test(std::string name)
+        : test_description_(name), in_(0), weights_(0) {}
+
+    meanvar_test(meanvar_test<T> &&other)
+        : test_description_(other.test_description_)
+        , in_(other.in_)
+        , weights_(other.weights_)
+        , bias_(other.bias_)
+        , dim_(other.dim_)
+        , mean_(other.mean_)
+        , variance_(other.variance_) {
+        other.in_      = 0;
+        other.weights_ = 0;
+    }
+    meanvar_test &operator=(meanvar_test<T> &&other) = default;
+    meanvar_test &operator=(meanvar_test<T> &other)  = delete;
+
+    meanvar_test(const meanvar_test<T> &other)
+        : test_description_(other.test_description_)
+        , in_(0)
+        , weights_(0)
+        , bias_(other.bias_)
+        , dim_(other.dim_)
+        , mean_(other.mean_)
+        , variance_(other.variance_) {
+        if (other.in_) af_retain_array(&in_, other.in_);
+        if (other.weights_) { af_retain_array(&weights_, other.weights_); }
+    }
+
+    ~meanvar_test() {
+#ifndef _WIN32
+        if (in_) af_release_array(in_);
+        if (weights_) {
+            af_release_array(weights_);
+            weights_ = 0;
+        }
+#endif
+    }
+};
+
+template<typename T>
+af_dtype meanvar_test<T>::af_type = dtype_traits<T>::af_type;
+
+template<typename T>
+class MeanVarTyped : public ::testing::TestWithParam<meanvar_test<T>> {
+   public:
+    void meanvar_test_function(const meanvar_test<T> &test) {
+        SUPPORTED_TYPE_CHECK(T);
+        SUPPORTED_TYPE_CHECK(outType<T>);
+        af_array mean, var;
+
+        // Cast to the expected type
+        af_array in = 0;
+        ASSERT_SUCCESS(
+            af_cast(&in, test.in_, (af_dtype)dtype_traits<T>::af_type));
+
+        EXPECT_EQ(AF_SUCCESS, af_meanvar(&mean, &var, in, test.weights_,
+                                         test.bias_, test.dim_));
+
+        vector<outType<T>> h_mean(test.mean_.size()),
+            h_var(test.variance_.size());
+
+        dim4 outDim(1);
+        af_get_dims(&outDim[0], &outDim[1], &outDim[2], &outDim[3], in);
+        outDim[test.dim_] = 1;
+
+        if (is_same_type<half_float::half, outType<T>>::value) {
+            ASSERT_VEC_ARRAY_NEAR(test.mean_, outDim, mean, 1.f);
+            ASSERT_VEC_ARRAY_NEAR(test.variance_, outDim, var, 0.5f);
+        } else if (is_same_type<float, outType<T>>::value ||
+                   is_same_type<cfloat, outType<T>>::value) {
+            ASSERT_VEC_ARRAY_NEAR(test.mean_, outDim, mean, 0.0016f);
+            ASSERT_VEC_ARRAY_NEAR(test.variance_, outDim, var, 0.2f);
+        } else {
+            ASSERT_VEC_ARRAY_NEAR(test.mean_, outDim, mean, 0.00001f);
+            ASSERT_VEC_ARRAY_NEAR(test.variance_, outDim, var, 0.0001f);
+        }
+
+        ASSERT_SUCCESS(af_release_array(in));
+        ASSERT_SUCCESS(af_release_array(mean));
+        ASSERT_SUCCESS(af_release_array(var));
+    }
+
+    void meanvar_cpp_test_function(const meanvar_test<T> &test) {
+        SUPPORTED_TYPE_CHECK(T);
+        SUPPORTED_TYPE_CHECK(outType<T>);
+        array mean, var;
+
+        // Cast to the expected type
+        af_array in_tmp = 0;
+        ASSERT_SUCCESS(af_retain_array(&in_tmp, test.in_));
+        array in(in_tmp);
+        in = in.as((af_dtype)dtype_traits<T>::af_type);
+
+        af_array weights_tmp = test.weights_;
+        if (weights_tmp) {
+            ASSERT_SUCCESS(af_retain_array(&weights_tmp, weights_tmp));
+        }
+        array weights(weights_tmp);
+        meanvar(mean, var, in, weights, test.bias_, test.dim_);
+
+        vector<outType<T>> h_mean(test.mean_.size()),
+            h_var(test.variance_.size());
+
+        dim4 outDim       = in.dims();
+        outDim[test.dim_] = 1;
+
+        if (is_same_type<half_float::half, outType<T>>::value) {
+            ASSERT_VEC_ARRAY_NEAR(test.mean_, outDim, mean, 1.f);
+            ASSERT_VEC_ARRAY_NEAR(test.variance_, outDim, var, 0.5f);
+        } else if (is_same_type<float, outType<T>>::value ||
+                   is_same_type<cfloat, outType<T>>::value) {
+            ASSERT_VEC_ARRAY_NEAR(test.mean_, outDim, mean, 0.0016f);
+            ASSERT_VEC_ARRAY_NEAR(test.variance_, outDim, var, 0.2f);
+        } else {
+            ASSERT_VEC_ARRAY_NEAR(test.mean_, outDim, mean, 0.00001f);
+            ASSERT_VEC_ARRAY_NEAR(test.variance_, outDim, var, 0.0001f);
+        }
+    }
+};
+
+af_array empty = 0;
+
+enum test_size { MEANVAR_SMALL, MEANVAR_LARGE };
+
+template<typename T>
+meanvar_test<T> meanvar_test_gen(string name, int in_index, int weight_index,
+                                 af_var_bias bias, int dim, int mean_index,
+                                 int var_index, test_size size) {
+    if (noDoubleTests((af_dtype)af::dtype_traits<T>::af_type) ||
+        noDoubleTests((
+            af_dtype)af::dtype_traits<typename varOutType<T>::type>::af_type) ||
+        noHalfTests((af_dtype)af::dtype_traits<T>::af_type)) {
+        meanvar_test<T> out(name);
+        return out;
+    }
+
+    vector<af_array> inputs;
+    vector<vector<typename varOutType<T>::type>> outputs;
+    if (size == MEANVAR_SMALL) {
+        vector<af::dim4> numDims_;
+        vector<vector<T>> in_;
+        vector<vector<typename varOutType<T>::type>> tests_;
+        readTests<T, typename varOutType<T>::type, double>(
+            TEST_DIR "/meanvar/meanvar.data", numDims_, in_, tests_);
+
+        inputs.resize(in_.size());
+        for (size_t i = 0; i < in_.size(); i++) {
+            af_create_array(&inputs[i], &in_[i].front(), numDims_[i].ndims(),
+                            numDims_[i].get(),
+                            (af_dtype)af::dtype_traits<T>::af_type);
+        }
+
+        outputs.resize(tests_.size());
+        for (size_t i = 0; i < tests_.size(); i++) {
+            copy(tests_[i].begin(), tests_[i].end(), back_inserter(outputs[i]));
+        }
+    } else {
+        dim_t full_array_size            = 2000;
+        vector<vector<dim_t>> dimensions = {
+            {2000, 1, 1, 1},  // 0
+            {1, 2000, 1, 1},  // 1
+            {1, 1, 2000, 1},  // 2
+
+            {500, 4, 1, 1},  // 3
+            {4, 500, 1, 1},  // 4
+            {50, 40, 1, 1}   // 5
+        };
+
+        vector<T> large_(full_array_size);
+        for (size_t i = 0; i < large_.size(); i++) {
+            large_[i] = static_cast<T>(i);
+        }
+
+        inputs.resize(dimensions.size());
+        for (size_t i = 0; i < dimensions.size(); i++) {
+            af_create_array(&inputs[i], &large_.front(), 4,
+                            dimensions[i].data(),
+                            (af_dtype)af::dtype_traits<T>::af_type);
+        }
+
+        outputs.push_back(
+            vector<typename varOutType<T>::type>(1, outType<T>(999.5)));
+        outputs.push_back(
+            vector<typename varOutType<T>::type>(1, outType<T>(333500)));
+        outputs.push_back({outType<T>(249.50), outType<T>(749.50),
+                           outType<T>(1249.50), outType<T>(1749.50)});
+        outputs.push_back(
+            vector<typename varOutType<T>::type>(4, outType<T>(20875)));
+    }
+    meanvar_test<T> out(name, inputs[in_index],
+                        (weight_index == -1) ? empty : inputs[weight_index],
+                        bias, dim, move(outputs[mean_index]),
+                        move(outputs[var_index]));
+
+    for (auto input : inputs) { af_release_array(input); }
+    return out;
+}
+
+template<typename T>
+vector<meanvar_test<T>> small_test_values() {
+    // clang-format off
+    return {
+        //                  |           Name |   in_index | weight_index |                  bias |  dim | mean_index | var_index |
+        meanvar_test_gen<T>(   "Sample1Ddim0",           0,            -1,     AF_VARIANCE_SAMPLE,     0,           0,          1, MEANVAR_SMALL),
+        meanvar_test_gen<T>(   "Sample1Ddim1",           1,            -1,     AF_VARIANCE_SAMPLE,     1,           0,          1, MEANVAR_SMALL),
+        meanvar_test_gen<T>(   "Sample2Ddim0",           2,            -1,     AF_VARIANCE_SAMPLE,     0,           3,          4, MEANVAR_SMALL),
+        meanvar_test_gen<T>(   "Sample2Ddim1",           2,            -1,     AF_VARIANCE_SAMPLE,     1,           6,          7, MEANVAR_SMALL),
+
+        meanvar_test_gen<T>("Population1Ddim0",          0,            -1, AF_VARIANCE_POPULATION,     0,           0,          2, MEANVAR_SMALL),
+        meanvar_test_gen<T>("Population1Ddim1",          1,            -1, AF_VARIANCE_POPULATION,     1,           0,          2, MEANVAR_SMALL),
+        meanvar_test_gen<T>("Population2Ddim0",          2,            -1, AF_VARIANCE_POPULATION,     0,           3,          5, MEANVAR_SMALL),
+        meanvar_test_gen<T>("Population2Ddim1",          2,            -1, AF_VARIANCE_POPULATION,     1,           6,          8, MEANVAR_SMALL)};
+    // clang-format on
+}
+
+template<typename T>
+vector<meanvar_test<T>> large_test_values() {
+    return {
+        // clang-format off
+        //                  |       Name |      in_index | weight_index |                  bias |  dim | mean_index | var_index |
+        meanvar_test_gen<T>("Sample1Ddim0",             0,            -1,     AF_VARIANCE_SAMPLE,     0,           0,          1, MEANVAR_LARGE),
+        meanvar_test_gen<T>("Sample1Ddim1",             1,            -1,     AF_VARIANCE_SAMPLE,     1,           0,          1, MEANVAR_LARGE),
+        meanvar_test_gen<T>("Sample1Ddim2",             2,            -1,     AF_VARIANCE_SAMPLE,     2,           0,          1, MEANVAR_LARGE),
+        meanvar_test_gen<T>("Sample2Ddim0",             3,            -1,     AF_VARIANCE_SAMPLE,     0,           2,          3, MEANVAR_LARGE),
+        // TODO(umar) Add additional large tests
+        // meanvar_test_gen<T>(    "Sample2Ddim1",           3,            -1, AF_VARIANCE_SAMPLE,     1,           2,          3, MEANVAR_LARGE),
+        // meanvar_test_gen<T>(    "Sample2Ddim1",           2,            -1, AF_VARIANCE_SAMPLE,     1,           6,          7, MEANVAR_LARGE),
+        // clang-format on
+    };
+}
+
+#define MEANVAR_TEST(NAME, TYPE)                                              \
+    using MeanVar##NAME = MeanVarTyped<TYPE>;                                 \
+    INSTANTIATE_TEST_SUITE_P(                                                 \
+        Small, MeanVar##NAME, ::testing::ValuesIn(small_test_values<TYPE>()), \
+        [](const ::testing::TestParamInfo<MeanVar##NAME::ParamType> info) {   \
+            return info.param.test_description_;                              \
+        });                                                                   \
+    INSTANTIATE_TEST_SUITE_P(                                                 \
+        Large, MeanVar##NAME, ::testing::ValuesIn(large_test_values<TYPE>()), \
+        [](const ::testing::TestParamInfo<MeanVar##NAME::ParamType> info) {   \
+            return info.param.test_description_;                              \
+        });                                                                   \
+                                                                              \
+    TEST_P(MeanVar##NAME, Testing) {                                          \
+        const meanvar_test<TYPE> &test = GetParam();                          \
+        meanvar_test_function(test);                                          \
+    }                                                                         \
+    TEST_P(MeanVar##NAME, TestingCPP) {                                       \
+        const meanvar_test<TYPE> &test = GetParam();                          \
+        meanvar_cpp_test_function(test);                                      \
+    }
+
+MEANVAR_TEST(Float, float)
+MEANVAR_TEST(Double, double)
+MEANVAR_TEST(Int, int)
+MEANVAR_TEST(UnsignedInt, unsigned int)
+MEANVAR_TEST(Short, short)
+MEANVAR_TEST(UnsignedShort, unsigned short)
+MEANVAR_TEST(Long, long long)
+MEANVAR_TEST(UnsignedLong, unsigned long long)
+MEANVAR_TEST(ComplexFloat, af::af_cfloat)
+MEANVAR_TEST(ComplexDouble, af::af_cdouble)
+
+#undef MEANVAR_TEST
+
+using MeanVarHalf = MeanVarTyped<half_float::half>;
+INSTANTIATE_TEST_SUITE_P(
+    Small, MeanVarHalf,
+    ::testing::ValuesIn(small_test_values<half_float::half>()),
+    [](const ::testing::TestParamInfo<MeanVarHalf::ParamType> info) {
+        return info.param.test_description_;
+    });
+TEST_P(MeanVarHalf, Testing) {
+    const meanvar_test<half_float::half> &test = GetParam();
+    meanvar_test_function(test);
+}
+TEST_P(MeanVarHalf, TestingCPP) {
+    const meanvar_test<half_float::half> &test = GetParam();
+    meanvar_cpp_test_function(test);
+}
+
+#define MEANVAR_TEST(NAME, TYPE)                                              \
+    using MeanVar##NAME = MeanVarTyped<TYPE>;                                 \
+    INSTANTIATE_TEST_SUITE_P(                                                 \
+        Small, MeanVar##NAME, ::testing::ValuesIn(small_test_values<TYPE>()), \
+        [](const ::testing::TestParamInfo<MeanVar##NAME::ParamType> &info) {  \
+            return info.param.test_description_;                              \
+        });                                                                   \
+                                                                              \
+    TEST_P(MeanVar##NAME, Testing) {                                          \
+        const meanvar_test<TYPE> &test = GetParam();                          \
+        meanvar_test_function(test);                                          \
+    }                                                                         \
+    TEST_P(MeanVar##NAME, TestingCPP) {                                       \
+        const meanvar_test<TYPE> &test = GetParam();                          \
+        meanvar_cpp_test_function(test);                                      \
+    }
+
+// Only test small sizes because the range of the large arrays go out of bounds
+MEANVAR_TEST(SignedChar, signed char)
+MEANVAR_TEST(UnsignedChar, unsigned char)
+// MEANVAR_TEST(Bool, unsigned char) // TODO(umar): test this type
diff --git a/test/medfilt.cpp b/test/medfilt.cpp
index db00d94e51..5ef951d5b1 100644
--- a/test/medfilt.cpp
+++ b/test/medfilt.cpp
@@ -7,275 +7,420 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
+using std::endl;
 using std::string;
 using std::vector;
 
 template<typename T>
-class MedianFilter : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class MedianFilter : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+template<typename T>
+class MedianFilter1d : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(MedianFilter, TestTypes);
+TYPED_TEST_SUITE(MedianFilter, TestTypes);
+TYPED_TEST_SUITE(MedianFilter1d, TestTypes);
 
 template<typename T>
-void medfiltTest(string pTestFile, dim_t w_len, dim_t w_wid, af_border_type pad)
-{
-    if (noDoubleTests<T>()) return;
+void medfiltTest(string pTestFile, dim_t w_len, dim_t w_wid,
+                 af_border_type pad) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4>  numDims;
-    vector<vector<T> >      in;
-    vector<vector<T> >   tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
 
-    readTests<T,T,int>(pTestFile, numDims, in, tests);
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims      = numDims[0];
-    af_array outArray  = 0;
-    af_array inArray   = 0;
+    dim4 dims         = numDims[0];
+    af_array outArray = 0;
+    af_array inArray  = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_medfilt(&outArray, inArray, w_len, w_wid, pad));
+    ASSERT_SUCCESS(af_medfilt2(&outArray, inArray, w_len, w_wid, pad));
 
-    T *outData = new T[dims.elements()];
+    vector<T> outData(dims.elements());
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData.data(), outArray));
 
     vector<T> currGoldBar = tests[0];
-    size_t nElems        = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    size_t nElems         = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
 
     // cleanup
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
+}
+
+TYPED_TEST(MedianFilter, ZERO_PAD_3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfiltTest<TypeParam>(
+        string(TEST_DIR "/medianfilter/zero_pad_3x3_window.test"), 3, 3,
+        AF_PAD_ZERO);
 }
 
-TYPED_TEST(MedianFilter, ZERO_PAD_3x3)
-{
-    medfiltTest<TypeParam>(string(TEST_DIR"/medianfilter/zero_pad_3x3_window.test"), 3, 3, AF_PAD_ZERO);
+TYPED_TEST(MedianFilter, SYMMETRIC_PAD_3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfiltTest<TypeParam>(
+        string(TEST_DIR "/medianfilter/symmetric_pad_3x3_window.test"), 3, 3,
+        AF_PAD_SYM);
+}
+
+TYPED_TEST(MedianFilter, BATCH_ZERO_PAD_3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfiltTest<TypeParam>(
+        string(TEST_DIR "/medianfilter/batch_zero_pad_3x3_window.test"), 3, 3,
+        AF_PAD_ZERO);
+}
+
+TYPED_TEST(MedianFilter, BATCH_SYMMETRIC_PAD_3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfiltTest<TypeParam>(
+        string(TEST_DIR "/medianfilter/batch_symmetric_pad_3x3_window.test"), 3,
+        3, AF_PAD_SYM);
+}
+
+template<typename T>
+void medfilt1_Test(string pTestFile, dim_t w_wid, af_border_type pad) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+
+    dim4 dims         = numDims[0];
+    af_array outArray = 0;
+    af_array inArray  = 0;
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(af_medfilt1(&outArray, inArray, w_wid, pad));
+
+    vector<T> outData(dims.elements());
+
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData.data(), outArray));
+
+    vector<T> currGoldBar = tests[0];
+    size_t nElems         = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    // cleanup
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(MedianFilter, SYMMETRIC_PAD_3x3)
-{
-    medfiltTest<TypeParam>(string(TEST_DIR"/medianfilter/symmetric_pad_3x3_window.test"), 3, 3, AF_PAD_SYM);
+TYPED_TEST(MedianFilter1d, ZERO_PAD_3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfilt1_Test<TypeParam>(
+        string(TEST_DIR "/medianfilter/zero_pad_3x1_window.test"), 3,
+        AF_PAD_ZERO);
 }
 
-TYPED_TEST(MedianFilter, BATCH_ZERO_PAD_3x3)
-{
-    medfiltTest<TypeParam>(string(TEST_DIR"/medianfilter/batch_zero_pad_3x3_window.test"), 3, 3, AF_PAD_ZERO);
+TYPED_TEST(MedianFilter1d, SYMMETRIC_PAD_3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfilt1_Test<TypeParam>(
+        string(TEST_DIR "/medianfilter/symmetric_pad_3x1_window.test"), 3,
+        AF_PAD_SYM);
 }
 
-TYPED_TEST(MedianFilter, BATCH_SYMMETRIC_PAD_3x3)
-{
-    medfiltTest<TypeParam>(string(TEST_DIR"/medianfilter/batch_symmetric_pad_3x3_window.test"), 3, 3, AF_PAD_SYM);
+TYPED_TEST(MedianFilter1d, BATCH_ZERO_PAD_3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfilt1_Test<TypeParam>(
+        string(TEST_DIR "/medianfilter/batch_zero_pad_3x1_window.test"), 3,
+        AF_PAD_ZERO);
 }
 
-template<typename T,bool isColor>
-void medfiltImageTest(string pTestFile, dim_t w_len, dim_t w_wid)
-{
-    if (noDoubleTests<T>()) return;
+TYPED_TEST(MedianFilter1d, BATCH_SYMMETRIC_PAD_3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    medfilt1_Test<TypeParam>(
+        string(TEST_DIR "/medianfilter/batch_symmetric_pad_3x1_window.test"), 3,
+        AF_PAD_SYM);
+}
 
-    using af::dim4;
+template<typename T, bool isColor>
+void medfiltImageTest(string pTestFile, dim_t w_len, dim_t w_wid) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<dim4>       inDims;
-    vector<string>    inFiles;
+    vector<dim4> inDims;
+    vector<string> inFiles;
     vector<dim_t> outSizes;
-    vector<string>   outFiles;
+    vector<string> outFiles;
 
     readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array inArray   = 0;
+        af_array outArray  = 0;
+        af_array goldArray = 0;
+        dim_t nElems       = 0;
 
-        af_array inArray  = 0;
-        af_array outArray = 0;
-        af_array goldArray= 0;
-        dim_t nElems   = 0;
+        inFiles[testId].insert(0, string(TEST_DIR "/medianfilter/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/medianfilter/"));
 
-        inFiles[testId].insert(0,string(TEST_DIR"/medianfilter/"));
-        outFiles[testId].insert(0,string(TEST_DIR"/medianfilter/"));
+        ASSERT_SUCCESS(
+            af_load_image(&inArray, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(
+            af_load_image(&goldArray, outFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
 
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&inArray, inFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&goldArray, outFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems, goldArray));
+        ASSERT_SUCCESS(
+            af_medfilt2(&outArray, inArray, w_len, w_wid, AF_PAD_ZERO));
 
-        ASSERT_EQ(AF_SUCCESS, af_medfilt(&outArray, inArray, w_len, w_wid, AF_PAD_ZERO));
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.018f);
 
-        T * outData = new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
+    }
+}
 
-        T * goldData= new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)goldData, goldArray));
+template<typename T>
+void medfiltInputTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
-        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData, outData, 0.018f));
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(goldArray));
-    }
+    vector<T> in(100, 1);
+
+    // Check for 1D inputs -> medfilt1
+    dim4 dims = dim4(100, 1, 1, 1);
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(af_medfilt2(&outArray, inArray, 1, 1, AF_PAD_ZERO));
+
+    bool medfilt1;
+    ASSERT_SUCCESS(af_is_vector(&medfilt1, outArray));
+
+    ASSERT_EQ(true, medfilt1);
+
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
+TYPED_TEST(MedianFilter, InvalidArray) { medfiltInputTest<TypeParam>(); }
+
 template<typename T>
-void medfiltInputTest(void)
-{
-    if (noDoubleTests<T>()) return;
+void medfiltWindowTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    vector<T>   in(100, 1);
+    vector<T> in(100, 1);
 
-    // Check for 1D inputs
-    af::dim4 dims = af::dim4(100, 1, 1, 1);
+    // Check for 4D inputs
+    dim4 dims(10, 10, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_ERR_SIZE, af_medfilt(&outArray, inArray, 1, 1, AF_PAD_ZERO));
+    ASSERT_EQ(AF_ERR_ARG, af_medfilt2(&outArray, inArray, 3, 5, AF_PAD_ZERO));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TYPED_TEST(MedianFilter, InvalidArray)
-{
-    medfiltInputTest<TypeParam>();
-}
+TYPED_TEST(MedianFilter, InvalidWindow) { medfiltWindowTest<TypeParam>(); }
 
 template<typename T>
-void medfiltWindowTest(void)
-{
-    if (noDoubleTests<T>()) return;
+void medfilt1d_WindowTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    vector<T>   in(100, 1);
+    vector<T> in(100, 1);
 
     // Check for 4D inputs
-    af::dim4 dims(10, 10, 1, 1);
+    dim4 dims(10, 10, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_ERR_ARG, af_medfilt(&outArray, inArray, 3, 5, AF_PAD_ZERO));
+    ASSERT_EQ(AF_ERR_ARG, af_medfilt1(&outArray, inArray, -1, AF_PAD_ZERO));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TYPED_TEST(MedianFilter, InvalidWindow)
-{
-    medfiltWindowTest<TypeParam>();
-}
+TYPED_TEST(MedianFilter1d, InvalidWindow) { medfilt1d_WindowTest<TypeParam>(); }
 
 template<typename T>
-void medfiltPadTest(void)
-{
-    if (noDoubleTests<T>()) return;
+void medfiltPadTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
+    af_array inArray  = 0;
+    af_array outArray = 0;
 
-    vector<T>   in(100, 1);
+    vector<T> in(100, 1);
 
     // Check for 4D inputs
-    af::dim4 dims(10, 10, 1, 1);
+    dim4 dims(10, 10, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_ERR_ARG, af_medfilt(&outArray, inArray, 3, 3, af_border_type(3)));
+    ASSERT_EQ(AF_ERR_ARG,
+              af_medfilt2(&outArray, inArray, 3, 3, af_border_type(3)));
 
-    ASSERT_EQ(AF_ERR_ARG, af_medfilt(&outArray, inArray, 3, 3, af_border_type(-1)));
+    ASSERT_EQ(AF_ERR_ARG,
+              af_medfilt2(&outArray, inArray, 3, 3, af_border_type(-1)));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TYPED_TEST(MedianFilter, InvalidPadType)
-{
-    medfiltPadTest<TypeParam>();
+TYPED_TEST(MedianFilter, InvalidPadType) { medfiltPadTest<TypeParam>(); }
+
+template<typename T>
+void medfilt1d_PadTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    af_array inArray  = 0;
+    af_array outArray = 0;
+
+    vector<T> in(100, 1);
+
+    // Check for 4D inputs
+    dim4 dims(10, 10, 1, 1);
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_EQ(AF_ERR_ARG,
+              af_medfilt1(&outArray, inArray, 3, af_border_type(3)));
+
+    ASSERT_EQ(AF_ERR_ARG,
+              af_medfilt1(&outArray, inArray, 3, af_border_type(-1)));
+
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
+TYPED_TEST(MedianFilter1d, InvalidPadType) { medfilt1d_PadTest<TypeParam>(); }
 
 //////////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(MedianFilter, CPP)
-{
-    if (noDoubleTests<float>()) return;
 
+using af::array;
+
+TEST(MedianFilter, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
     const dim_t w_len = 3;
     const dim_t w_wid = 3;
 
-    vector<af::dim4>  numDims;
-    vector<vector<float> >      in;
-    vector<vector<float> >   tests;
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
 
-    readTests<float,float,int>(string(TEST_DIR"/medianfilter/batch_symmetric_pad_3x3_window.test"),
-                               numDims, in, tests);
+    readTests<float, float, int>(
+        string(TEST_DIR "/medianfilter/batch_symmetric_pad_3x3_window.test"),
+        numDims, in, tests);
 
-    af::dim4 dims    = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::medfilt(input, w_len, w_wid, AF_PAD_SYM);
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = medfilt(input, w_len, w_wid, AF_PAD_SYM);
 
-    float *outData = new float[dims.elements()];
-    output.host((void*)outData);
+    vector<float> outData(dims.elements());
+    output.host((void*)outData.data());
 
     vector<float> currGoldBar = tests[0];
-    size_t nElems = currGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
+    size_t nElems             = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
     }
-
-    // cleanup
-    delete[] outData;
 }
 
+TEST(MedianFilter1d, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const dim_t w_wid = 3;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTests<float, float, int>(
+        string(TEST_DIR "/medianfilter/batch_symmetric_pad_3x1_window.test"),
+        numDims, in, tests);
 
-TEST(MedianFilter, Docs)
-{
-    using af::array;
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = medfilt1(input, w_wid, AF_PAD_SYM);
+
+    vector<float> outData(dims.elements());
+    output.host((void*)outData.data());
+
+    vector<float> currGoldBar = tests[0];
+    size_t nElems             = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+}
 
-    float input[] = {
-        1.0000,  2.0000,  3.0000,  4.0000,
-        5.0000,  6.0000,  7.0000,  8.0000,
-        9.0000, 10.0000, 11.0000, 12.0000,
-       13.0000, 14.0000, 15.0000, 16.0000
-    };
+TEST(MedianFilter, Docs) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    float input[] = {1.0000,  2.0000,  3.0000,  4.0000,  5.0000,  6.0000,
+                     7.0000,  8.0000,  9.0000,  10.0000, 11.0000, 12.0000,
+                     13.0000, 14.0000, 15.0000, 16.0000};
 
-    float gold[] = {
-        0.0000,  2.0000,  3.0000, 0.0000,
-        2.0000,  6.0000,  7.0000, 4.0000,
-        6.0000, 10.0000, 11.0000, 8.0000,
-        0.0000, 10.0000, 11.0000, 0.0000
-    };
+    float gold[] = {0.0000, 2.0000,  3.0000,  0.0000,  2.0000,  6.0000,
+                    7.0000, 4.0000,  6.0000,  10.0000, 11.0000, 8.0000,
+                    0.0000, 10.0000, 11.0000, 0.0000};
 
     //![ex_image_medfilt]
     array a = array(4, 4, input);
-    //af_print(a);
-    //a = 1.0000        5.0000        9.0000       13.0000
+    // af_print(a);
+    // a = 1.0000        5.0000        9.0000       13.0000
     //    2.0000        6.0000       10.0000       14.0000
     //    3.0000        7.0000       11.0000       15.0000
     //    4.0000        8.0000       12.0000       16.0000
-    array b = af::medfilt(a, 3, 3, AF_PAD_ZERO);
-    //af_print(b);
-    //b=  0.0000        2.0000        6.0000        0.0000
+    array b = medfilt(a, 3, 3, AF_PAD_ZERO);
+    // af_print(b);
+    // b=  0.0000        2.0000        6.0000        0.0000
     //    2.0000        6.0000       10.0000       10.0000
     //    3.0000        7.0000       11.0000       11.0000
     //    0.0000        4.0000        8.0000        0.0000
@@ -284,26 +429,45 @@ TEST(MedianFilter, Docs)
     float output[16];
     b.host((void*)output);
 
-    for (int i=0; i<16; ++i) {
-        ASSERT_EQ(output[i], gold[i]) << "output mismatch at i = " << i << std::endl;
+    for (int i = 0; i < 16; ++i) {
+        ASSERT_EQ(output[i], gold[i]) << "output mismatch at i = " << i << endl;
     }
 }
 
-using namespace af;
+using af::constant;
+using af::iota;
+using af::max;
+using af::medfilt;
+using af::medfilt1;
+using af::seq;
+using af::span;
 
-TEST(MedianFilter, GFOR)
-{
+TEST(MedianFilter, GFOR) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
     dim4 dims = dim4(10, 10, 3);
-    array A = iota(dims);
-    array B = constant(0, dims);
+    array A   = iota(dims);
+    array B   = constant(0, dims);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = medfilt(A(span, span, ii));
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = medfilt(A(span, span, ii)); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = medfilt(A(span, span, ii));
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
     }
 }
+
+TEST(MedianFilter1d, GFOR) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    dim4 dims = dim4(10, 10, 3);
+    array A   = iota(dims);
+    array B   = constant(0, dims);
+
+    gfor(seq ii, 3) { B(span, ii) = medfilt1(A(span, ii)); }
+
+    for (int ii = 0; ii < 3; ii++) {
+        array c_ii = medfilt1(A(span, ii));
+        array b_ii = B(span, ii);
+        ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
+    }
+}
diff --git a/test/median.cpp b/test/median.cpp
index f613ea5b1c..4f64631c6f 100644
--- a/test/median.cpp
+++ b/test/median.cpp
@@ -8,102 +8,167 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/device.h>
+#include <af/random.h>
+#include <af/statistics.h>
+
+using af::array;
+using af::dtype;
+using af::dtype_traits;
+using af::median;
+using af::randu;
+using af::seq;
+using af::span;
+using af::sum;
+using std::vector;
+
+template<typename Ti>
+array generateArray(int nx, int ny, int nz, int nw) {
+    array a = randu(nx, ny, nz, nw, (dtype)dtype_traits<Ti>::af_type);
+    return a;
+}
+
+template<>
+array generateArray<int>(int nx, int ny, int nz, int nw) {
+    array a = (randu(nx, ny, nz, nw, (dtype)dtype_traits<float>::af_type) * 1e6)
+                  .as(s32);
+    return a;
+}
+
+template<>
+array generateArray<unsigned int>(int nx, int ny, int nz, int nw) {
+    array a = (randu(nx, ny, nz, nw, (dtype)dtype_traits<float>::af_type) * 1e6)
+                  .as(u32);
+    return a;
+}
+
+template<typename To, typename Ti>
+void median_flat(int nx, int ny = 1, int nz = 1, int nw = 1) {
+    SUPPORTED_TYPE_CHECK(Ti);
+    array a = generateArray<Ti>(nx, ny, nz, nw);
+
+    // Verification
+    array sa  = sort(flat(a));
+    dim_t mid = (sa.dims(0) + 1) / 2;
+
+    To verify;
+
+    To *h_sa = sa.as((af_dtype)dtype_traits<To>::af_type).host<To>();
+    if (sa.dims(0) % 2 == 1) {
+        verify = h_sa[mid - 1];
+    } else {
+        verify = (h_sa[mid - 1] + h_sa[mid]) / (To)2;
+    }
+
+    // Test Part
+    To val = median<To>(a);
+
+    ASSERT_EQ(verify, val);
+
+    af_free_host(h_sa);
+}
+
+template<typename To, typename Ti, int dim>
+void median_test(int nx, int ny = 1, int nz = 1, int nw = 1) {
+    SUPPORTED_TYPE_CHECK(Ti);
+
+    array a = generateArray<Ti>(nx, ny, nz, nw);
 
-using namespace std;
-using namespace af;
+    // If selected dim is higher than input ndims, then return
+    if (dim >= a.dims().ndims()) return;
 
-template<typename To, typename Ti, bool flat>
-void median0(int nx, int ny=1, int nz=1, int nw=1)
-{
-    if (noDoubleTests<Ti>()) return;
-    array a = randu(nx, ny, nz, nw, (af::dtype)dtype_traits<Ti>::af_type);
-    array sa = sort(a);
+    array verify;
 
-    Ti *h_sa = sa.host<Ti>();
+    // Verification
+    array sa = sort(a, dim);
 
-    To *h_b = NULL;
-    To val = 0;
+    double mid  = (a.dims(dim) + 1) / 2;
+    seq mSeq[4] = {span, span, span, span};
+    mSeq[dim]   = seq(mid, mid, 1.0);
 
-    if (flat) {
-        val = median<To>(a);
-        h_b = &val;
+    if (sa.dims(dim) % 2 == 1) {
+        mSeq[dim] = mSeq[dim] - 1.0;
+        sa        = sa.as((af_dtype)dtype_traits<To>::af_type);
+        verify    = sa(mSeq[0], mSeq[1], mSeq[2], mSeq[3]);
     } else {
-        array b = median(a);
-        h_b = b.host<To>();
+        dim_t sdim[4] = {0};
+        sdim[dim]     = 1;
+        sa            = sa.as((af_dtype)dtype_traits<To>::af_type);
+        array sas     = shift(sa, sdim[0], sdim[1], sdim[2], sdim[3]);
+        verify = ((sa + sas) / To(2))(mSeq[0], mSeq[1], mSeq[2], mSeq[3]);
     }
 
-    for (int w = 0; w < nw; w++) {
-        for (int z = 0; z < nz; z++) {
-            for (int y = 0; y < ny; y++) {
+    // Test Part
+    array out = median(a, dim);
 
-                int off = (y  + ny * (z + nz * w));
-                int id = nx / 2;
+    ASSERT_EQ(out.dims() == verify.dims(), true);
+    ASSERT_ARRAYS_EQ(verify, out);
+}
+
+#define MEDIAN_FLAT(To, Ti)                                                    \
+    TEST(MedianFlat, Ti##_flat_even) { median_flat<To, Ti>(1000); }            \
+    TEST(MedianFlat, Ti##_flat_odd) { median_flat<To, Ti>(783); }              \
+    TEST(MedianFlat, Ti##_flat_multi_even) { median_flat<To, Ti>(24, 11, 3); } \
+    TEST(MedianFlat, Ti##_flat_multi_odd) { median_flat<To, Ti>(15, 21, 7); }
 
-                if (nx & 2) {
-                    ASSERT_EQ(h_sa[id + off * nx], h_b[off]);
-                } else {
-                    To left = h_sa[id + off * nx - 1];
-                    To right = h_sa[id + off * nx];
+MEDIAN_FLAT(float, float)
+MEDIAN_FLAT(float, int)
+MEDIAN_FLAT(float, uint)
+MEDIAN_FLAT(float, schar)
+MEDIAN_FLAT(float, uchar)
+MEDIAN_FLAT(float, short)
+MEDIAN_FLAT(float, ushort)
+MEDIAN_FLAT(double, double)
 
-                    ASSERT_NEAR((left + right) / 2, h_b[off], 1e-8);
-                }
-            }
-        }
+#define MEDIAN_TEST(To, Ti, dim)                                               \
+    TEST(Median, Ti##_1D_##dim##_even) { median_test<To, Ti, dim>(1000); }     \
+    TEST(Median, Ti##_2D_##dim##_even) { median_test<To, Ti, dim>(1000, 25); } \
+    TEST(Median, Ti##_3D_##dim##_even) {                                       \
+        median_test<To, Ti, dim>(100, 25, 4);                                  \
+    }                                                                          \
+    TEST(Median, Ti##_4D_##dim##_even) {                                       \
+        median_test<To, Ti, dim>(100, 25, 2, 2);                               \
+    }                                                                          \
+    TEST(Median, Ti##_1D_##dim##_odd) { median_test<To, Ti, dim>(783); }       \
+    TEST(Median, Ti##_2D_##dim##_odd) { median_test<To, Ti, dim>(783, 25); }   \
+    TEST(Median, Ti##_3D_##dim##_odd) {                                        \
+        median_test<To, Ti, dim>(123, 25, 3);                                  \
+    }                                                                          \
+    TEST(Median, Ti##_4D_##dim##_odd) {                                        \
+        median_test<To, Ti, dim>(123, 25, 3, 3);                               \
     }
 
-    delete[] h_sa;
-    if (!flat) delete[] h_b;
+#define MEDIAN(To, Ti)     \
+    MEDIAN_TEST(To, Ti, 0) \
+    MEDIAN_TEST(To, Ti, 1) \
+    MEDIAN_TEST(To, Ti, 2) \
+    MEDIAN_TEST(To, Ti, 3)
+
+MEDIAN(float, float)
+MEDIAN(float, int)
+MEDIAN(float, uint)
+MEDIAN(float, schar)
+MEDIAN(float, uchar)
+MEDIAN(float, short)
+MEDIAN(float, ushort)
+MEDIAN(double, double)
+
+TEST(Median, OneElement) {
+    af::array in = randu(1, f32);
+
+    af::array out = median(in);
+    ASSERT_ARRAYS_EQ(in, out);
 }
 
-#define MEDIAN0(To, Ti)                         \
-    TEST(median0, Ti##_1D_even)                 \
-    {                                           \
-        median0<To, Ti, false>(1000);           \
-    }                                           \
-    TEST(median0, Ti##_2D_even)                 \
-    {                                           \
-        median0<To, Ti, false>(1000, 100);      \
-    }                                           \
-    TEST(median0, Ti##_3D_even)                 \
-    {                                           \
-        median0<To, Ti, false>(1000, 25, 4);    \
-    }                                           \
-    TEST(median0, Ti##_4D_even)                 \
-    {                                           \
-        median0<To, Ti, false>(1000, 25, 2, 2); \
-    }                                           \
-    TEST(median0, Ti##_flat_even)               \
-    {                                           \
-        median0<To, Ti, true>(1000);            \
-    }                                           \
-    TEST(median0, Ti##_1D_odd)                  \
-    {                                           \
-        median0<To, Ti, false>(783);            \
-    }                                           \
-    TEST(median0, Ti##_2D_odd)                  \
-    {                                           \
-        median0<To, Ti, false>(783, 100);       \
-    }                                           \
-    TEST(median0, Ti##_3D_odd)                  \
-    {                                           \
-        median0<To, Ti, false>(783, 25, 4);     \
-    }                                           \
-    TEST(median0, Ti##_4D_odd)                  \
-    {                                           \
-        median0<To, Ti, false>(783, 25, 2, 2);  \
-    }                                           \
-    TEST(median0, Ti##_flat_odd)                \
-    {                                           \
-        median0<To, Ti, true>(783);             \
-    }                                           \
-
-
-MEDIAN0(float, float)
-MEDIAN0(float, int)
-MEDIAN0(float, uint)
-MEDIAN0(float, uchar)
-MEDIAN0(double, double)
+TEST(Median, TwoElements) {
+    af::array in = randu(2, f32);
+
+    af::array out  = median(in);
+    af::array gold = mean(in);
+    ASSERT_ARRAYS_EQ(gold, out);
+}
diff --git a/test/memory.cpp b/test/memory.cpp
index 144039cc61..9214ab472c 100644
--- a/test/memory.cpp
+++ b/test/memory.cpp
@@ -7,113 +7,204 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
+#include <af/internal.h>
+#include <af/memory.h>
 #include <af/traits.hpp>
+
+#include <memory>
+#include <unordered_map>
+#include <unordered_set>
+#include <utility>
 #include <vector>
-#include <iostream>
-#include <string>
-#include <testHelpers.hpp>
 
+using af::alloc;
+using af::allocV2;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::deviceGC;
+using af::deviceMemInfo;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::freeV2;
+using af::randu;
+using af::seq;
+using af::span;
 using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
 
 const size_t step_bytes = 1024;
 
-void cleanSlate()
-{
+TEST(Memory, Scope) {
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    af::deviceGC();
+    cleanSlate();  // Clean up everything done so far
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    {
+        array a = randu(5, 5);
 
-    ASSERT_EQ(alloc_buffers, 0u);
-    ASSERT_EQ(lock_buffers, 0u);
-    ASSERT_EQ(alloc_bytes, 0u);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
+
+        ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+        ASSERT_EQ(lock_bytes, 1 * step_bytes);
+    }
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 0u);  // 0 because a is out of scope
+
+    ASSERT_EQ(alloc_bytes, 1 * step_bytes);
     ASSERT_EQ(lock_bytes, 0u);
+}
+
+template<typename T>
+class MemAlloc : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         intl, uintl, char, signed char, unsigned char, short,
+                         ushort>
+    TestTypes;
 
-    af::setMemStepSize(step_bytes);
+// register the type list
+TYPED_TEST_SUITE(MemAlloc, TestTypes);
 
-    ASSERT_EQ(af::getMemStepSize(), step_bytes);
+size_t roundUpToStep(size_t bytes) {
+    if (step_bytes == 0) return bytes;
+
+    size_t remainder = bytes % step_bytes;
+    if (remainder == 0) return bytes;
+
+    return bytes + step_bytes - remainder;
 }
 
-TEST(Memory, GetDevicePtr)
-{
+template<typename T>
+void memAllocArrayScopeTest(int elements) {
+    SUPPORTED_TYPE_CHECK(T);
+
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
+    cleanSlate();  // Clean up everything done so far
 
-    af::array a = af::randu(5, 5);
+    {
+        array a = randu(elements, (af_dtype)dtype_traits<T>::af_type);
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
-    ASSERT_EQ(alloc_buffers, 1u);
-    ASSERT_EQ(lock_buffers, 1u);
-    ASSERT_EQ(alloc_bytes, 1 * step_bytes);
-    ASSERT_EQ(lock_bytes, 1 * step_bytes);
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
 
-    a.device<float>();
+        ASSERT_EQ(alloc_bytes, roundUpToStep(elements * sizeof(T)));
+        ASSERT_EQ(lock_bytes, roundUpToStep(elements * sizeof(T)));
+    }
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 1u);
-    ASSERT_EQ(lock_buffers, 0u); // 0 because device should unlock the buffer
-    ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+    ASSERT_EQ(lock_buffers, 0u);  // 0 because a is out of scope
+
+    ASSERT_EQ(alloc_bytes, roundUpToStep(elements * sizeof(T)));
     ASSERT_EQ(lock_bytes, 0u);
 }
 
-TEST(Memory, Scope)
-{
+template<typename T>
+void memAllocPtrScopeTest(int elements) {
+    SUPPORTED_TYPE_CHECK(T);
+
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
-
+    cleanSlate();  // Clean up everything done so far
     {
-        af::array a = af::randu(5, 5);
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+        T *ptr = alloc<T>(elements);
 
-        af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                          &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
         ASSERT_EQ(alloc_buffers, 1u);
         ASSERT_EQ(lock_buffers, 1u);
 
-        ASSERT_EQ(alloc_bytes, 1 * step_bytes);
-        ASSERT_EQ(lock_bytes, 1 * step_bytes);
+        ASSERT_EQ(alloc_bytes, roundUpToStep(elements * sizeof(T)));
+        ASSERT_EQ(lock_bytes, roundUpToStep(elements * sizeof(T)));
+
+        af::free(ptr);
+#pragma GCC diagnostic pop
     }
 
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 0u);  // 0 because a is out of scope
+
+    ASSERT_EQ(alloc_bytes, roundUpToStep(elements * sizeof(T)));
+    ASSERT_EQ(lock_bytes, 0u);
+
+    // Do without using templated alloc
+    cleanSlate();  // Clean up everything done so far
+
+    {
+        void *ptr = allocV2(elements * sizeof(T));
+
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
+
+        ASSERT_EQ(alloc_bytes, roundUpToStep(elements * sizeof(T)));
+        ASSERT_EQ(lock_bytes, roundUpToStep(elements * sizeof(T)));
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+        af::freeV2(ptr);
+    }
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 1u);
-    ASSERT_EQ(lock_buffers, 0u); // 0 because a is out of scope
+    ASSERT_EQ(lock_buffers, 0u);  // 0 because a is out of scope
 
-    ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+    ASSERT_EQ(alloc_bytes, roundUpToStep(elements * sizeof(T)));
     ASSERT_EQ(lock_bytes, 0u);
 }
 
-TEST(Memory, SingleSizeLoop)
-{
+TYPED_TEST(MemAlloc, ArrayScope25) { memAllocArrayScopeTest<TypeParam>(25); }
+
+TYPED_TEST(MemAlloc, ArrayScope2048) {
+    memAllocArrayScopeTest<TypeParam>(2048);
+}
+
+TYPED_TEST(MemAlloc, ArrayScope2293) {
+    memAllocArrayScopeTest<TypeParam>(2293);
+}
+
+TYPED_TEST(MemAlloc, PtrScope25) { memAllocPtrScopeTest<TypeParam>(25); }
+
+TYPED_TEST(MemAlloc, PtrScope2048) { memAllocPtrScopeTest<TypeParam>(2048); }
+
+TYPED_TEST(MemAlloc, PtrScope2293) { memAllocPtrScopeTest<TypeParam>(2293); }
+
+TEST(Memory, SingleSizeLoop) {
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
+    cleanSlate();  // Clean up everything done so far
 
     {
-        af::array a = af::randu(5, 5);
+        array a = randu(5, 5);
 
-        af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
         ASSERT_EQ(alloc_buffers, 1u);
         ASSERT_EQ(lock_buffers, 1u);
@@ -121,13 +212,14 @@ TEST(Memory, SingleSizeLoop)
         ASSERT_EQ(lock_bytes, 1 * step_bytes);
 
         for (int i = 0; i < 100; i++) {
+            a = randu(5, 5);
 
-            a = af::randu(5,5);
-
-            af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                              &lock_bytes, &lock_buffers);
+            deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes,
+                          &lock_buffers);
 
-            ASSERT_EQ(alloc_buffers, 2u); //2 because a new one is created before a is destroyed
+            ASSERT_EQ(
+                alloc_buffers,
+                2u);  // 2 because a new one is created before a is destroyed
             ASSERT_EQ(lock_buffers, 1u);
             ASSERT_EQ(alloc_bytes, 2 * step_bytes);
             ASSERT_EQ(lock_bytes, 1 * step_bytes);
@@ -135,35 +227,33 @@ TEST(Memory, SingleSizeLoop)
     }
 }
 
-TEST(Memory, LargeLoop)
-{
+TEST(Memory, LargeLoop) {
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
+    cleanSlate();  // Clean up everything done so far
 
-    const int num = step_bytes / sizeof(float);
+    const int num    = step_bytes / sizeof(float);
     size_t allocated = step_bytes;
 
-    af::array a = af::randu(num);
+    array a = randu(num);
 
-    std::vector<float> hA(num);
+    vector<float> hA(num);
 
     a.host(&hA[0]);
 
     // Run a large loop that allocates more and more memory at each step
     for (int i = 0; i < 250; i++) {
-        af::array b = af::randu(num * (i + 1));
+        array b        = randu(num * (i + 1));
         size_t current = (i + 1) * step_bytes;
         allocated += current;
 
         // Verify that new buffers are being allocated
-        af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                          &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
-        // Limit to 10 to check before garbage collection
+        // Limit to 10 to check before memory cleanup
         if (i < 10) {
-            ASSERT_EQ(alloc_buffers, (size_t)(i + 2)); //i is zero based
+            ASSERT_EQ(alloc_buffers, (size_t)(i + 2));  // i is zero based
             ASSERT_EQ(lock_buffers, 2u);
 
             ASSERT_EQ(alloc_bytes, allocated);
@@ -171,11 +261,10 @@ TEST(Memory, LargeLoop)
         }
     }
 
-    size_t old_alloc_bytes = alloc_bytes;
+    size_t old_alloc_bytes   = alloc_bytes;
     size_t old_alloc_buffers = alloc_buffers;
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(old_alloc_bytes, alloc_bytes);
     ASSERT_EQ(old_alloc_buffers, alloc_buffers);
@@ -184,19 +273,17 @@ TEST(Memory, LargeLoop)
     ASSERT_EQ(lock_bytes, 1 * step_bytes);
 }
 
-TEST(Memory, IndexingOffset)
-{
+TEST(Memory, IndexingOffset) {
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
+    cleanSlate();  // Clean up everything done so far
 
     const int num = step_bytes / sizeof(float);
 
-    af::array a = af::randu(num);
+    array a = randu(num);
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 1u);
     ASSERT_EQ(lock_buffers, 1u);
@@ -204,10 +291,9 @@ TEST(Memory, IndexingOffset)
     ASSERT_EQ(lock_bytes, 1 * step_bytes);
 
     {
-        af::array b = a(af::seq(1, num/2)); // Should just be an offset
+        array b = a(seq(1, num / 2));  // Should just be an offset
 
-        af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                          &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
         ASSERT_EQ(alloc_buffers, 1u);
         ASSERT_EQ(lock_buffers, 1u);
@@ -215,31 +301,26 @@ TEST(Memory, IndexingOffset)
         ASSERT_EQ(lock_bytes, 1 * step_bytes);
     }
 
-
     // b should not have deleted a
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 1u);
     ASSERT_EQ(lock_buffers, 1u);
     ASSERT_EQ(alloc_bytes, 1 * step_bytes);
     ASSERT_EQ(lock_bytes, 1 * step_bytes);
-
 }
 
-TEST(Memory, IndexingCopy)
-{
+TEST(Memory, IndexingCopy) {
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
+    cleanSlate();  // Clean up everything done so far
 
     const int num = step_bytes / sizeof(float);
 
-    af::array a = af::randu(num);
+    array a = randu(num);
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 1u);
     ASSERT_EQ(lock_buffers, 1u);
@@ -248,10 +329,9 @@ TEST(Memory, IndexingCopy)
 
     {
         // Should just a copy
-        af::array b = a(af::seq(0, num/2-1, 2));
+        array b = a(seq(0, num / 2 - 1, 2));
 
-        af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                          &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
         ASSERT_EQ(alloc_buffers, 2u);
         ASSERT_EQ(lock_buffers, 2u);
@@ -259,31 +339,97 @@ TEST(Memory, IndexingCopy)
         ASSERT_EQ(lock_bytes, 2 * step_bytes);
     }
 
+    // b should not have deleted a
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 2u);
+    ASSERT_EQ(lock_buffers, 1u);
+    ASSERT_EQ(alloc_bytes, 2 * step_bytes);
+    ASSERT_EQ(lock_bytes, 1 * step_bytes);
+}
+
+TEST(Memory, Assign) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    cleanSlate();  // Clean up everything done so far
+
+    const int num = step_bytes / sizeof(float);
+
+    array a = randu(num);
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 1u);
+    ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+    ASSERT_EQ(lock_bytes, 1 * step_bytes);
+
+    {
+        array b         = randu(num / 2);
+        a(seq(num / 2)) = b;
+
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 2u);
+        ASSERT_EQ(lock_buffers, 2u);
+        ASSERT_EQ(alloc_bytes, 2 * step_bytes);
+        ASSERT_EQ(lock_bytes, 2 * step_bytes);
+    }
 
     // b should not have deleted a
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 2u);
     ASSERT_EQ(lock_buffers, 1u);
     ASSERT_EQ(alloc_bytes, 2 * step_bytes);
     ASSERT_EQ(lock_bytes, 1 * step_bytes);
+}
+
+TEST(Memory, AssignLoop) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    cleanSlate();  // Clean up everything done so far
 
+    const int num  = step_bytes / sizeof(float);
+    const int cols = 100;
+
+    array a = randu(num, cols);
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 1u);
+    ASSERT_EQ(alloc_bytes, cols * step_bytes);
+    ASSERT_EQ(lock_bytes, cols * step_bytes);
+
+    for (int i = 0; i < cols; i++) {
+        array b    = randu(num);
+        a(span, i) = b;
+
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers,
+                  2u);  // 3 because you need another scratch space for b
+        ASSERT_EQ(lock_buffers, 2u);
+        ASSERT_EQ(alloc_bytes, (cols + 1) * step_bytes);
+        ASSERT_EQ(lock_bytes, (cols + 1) * step_bytes);
+    }
 }
 
-TEST(Memory, Assign)
-{
+TEST(Memory, AssignRef) {
     size_t alloc_bytes, alloc_buffers;
     size_t lock_bytes, lock_buffers;
 
-    cleanSlate(); // Clean up everything done so far
+    cleanSlate();  // Clean up everything done so far
 
     const int num = step_bytes / sizeof(float);
 
-    af::array a = af::randu(num);
+    array a     = randu(num);
+    array a_ref = a;
 
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 1u);
     ASSERT_EQ(lock_buffers, 1u);
@@ -291,29 +437,679 @@ TEST(Memory, Assign)
     ASSERT_EQ(lock_bytes, 1 * step_bytes);
 
     {
-        // Should just a copy
-        af::array b = af::randu(num / 2);
-        a(af::seq(num / 2)) = b;
+        array b = randu(num / 2);
+        // This should do a full copy of a
+        a(seq(num / 2)) = b;
 
-        af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                          &lock_bytes, &lock_buffers);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
-        // FIXME: An extra buffer is used because of copy on write
-        // Fix to not perform a copy when the buffer does not have children
         ASSERT_EQ(alloc_buffers, 3u);
-        ASSERT_EQ(lock_buffers, 2u);
+        ASSERT_EQ(lock_buffers, 3u);
         ASSERT_EQ(alloc_bytes, 3 * step_bytes);
-        ASSERT_EQ(lock_bytes, 2 * step_bytes);
+        ASSERT_EQ(lock_bytes, 3 * step_bytes);
     }
 
+    // b should not have deleted a
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 3u);
+    ASSERT_EQ(lock_buffers, 2u);  // a_ref
+    ASSERT_EQ(alloc_bytes, 3 * step_bytes);
+    ASSERT_EQ(lock_bytes, 2 * step_bytes);  // a_ref
+}
+
+TEST(Memory, AssignRefLoop) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    cleanSlate();  // Clean up everything done so far
+
+    const int num  = step_bytes / sizeof(float);
+    const int cols = 100;
+
+    array a     = randu(num, cols);
+    array a_ref = a;
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 1u);
+    ASSERT_EQ(alloc_bytes, cols * step_bytes);
+    ASSERT_EQ(lock_bytes, cols * step_bytes);
+
+    for (int i = 0; i < cols; i++) {
+        array b    = randu(num);
+        a(span, i) = b;
+
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 3u);
+        ASSERT_EQ(lock_buffers, 3u);
+        ASSERT_EQ(alloc_bytes, (2 * cols + 1) * step_bytes);
+        ASSERT_EQ(lock_bytes, (2 * cols + 1) * step_bytes);
+    }
 
     // b should not have deleted a
-    af::deviceMemInfo(&alloc_bytes, &alloc_buffers,
-                      &lock_bytes, &lock_buffers);
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
 
     ASSERT_EQ(alloc_buffers, 3u);
+    ASSERT_EQ(lock_buffers, 2u);  // a_ref
+    ASSERT_EQ(alloc_bytes, (2 * cols + 1) * step_bytes);
+    ASSERT_EQ(lock_bytes, 2 * cols * step_bytes);  // a_ref
+}
+
+TEST(Memory, device) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    cleanSlate();  // Clean up everything done so far
+
+    {
+        array a = randu(5, 5);
+
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
+        ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+        ASSERT_EQ(lock_bytes, 1 * step_bytes);
+
+        a.device<float>();
+
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
+        ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+        ASSERT_EQ(lock_bytes, 1 * lock_bytes);
+
+        a.unlock();  // to reset the lock flag
+    }
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 0u);
+    ASSERT_EQ(alloc_bytes, 1 * step_bytes);
+    ASSERT_EQ(lock_bytes, 0u);
+}
+
+TEST(Memory, Assign2D) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t alloc_bytes_after, alloc_buffers_after;
+    size_t lock_bytes, lock_buffers;
+    size_t lock_bytes_after, lock_buffers_after;
+
+    cleanSlate();  // Clean up everything done so far
+    {
+        array a       = af::randu(10, 10, f32);
+        unsigned hb[] = {3, 5, 6, 8, 9};
+        array b(5, hb);
+        array c = af::randu(5, f32);
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+        a(b) = c;
+    }
+
+    deviceMemInfo(&alloc_bytes_after, &alloc_buffers_after, &lock_bytes_after,
+                  &lock_buffers_after);
+
+    // Check if assigned allocated extra buffers
+    ASSERT_EQ(alloc_buffers, alloc_buffers_after);
+    ASSERT_EQ(alloc_bytes, alloc_bytes_after);
+}
+
+TEST(Memory, unlock) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    cleanSlate();  // Clean up everything done so far
+
+    const dim_t num = step_bytes / sizeof(float);
+
+    vector<float> in(num);
+
+    af_array arr = 0;
+    ASSERT_SUCCESS(af_create_array(&arr, &in[0], 1, &num, f32));
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
     ASSERT_EQ(lock_buffers, 1u);
-    ASSERT_EQ(alloc_bytes, 3 * step_bytes);
-    ASSERT_EQ(lock_bytes, 1 * step_bytes);
+    ASSERT_EQ(alloc_bytes, step_bytes);
+    ASSERT_EQ(lock_bytes, step_bytes);
+
+    // arr1 gets released by end of the following code block
+    {
+        array a(arr);
+        a.lock();
+
+        // No new memory should be allocated
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
+        ASSERT_EQ(alloc_bytes, step_bytes);
+        ASSERT_EQ(lock_bytes, step_bytes);
+
+        a.unlock();
+    }
+
+    // Making sure all unlocked buffers are freed
+    deviceGC();
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 0u);
+    ASSERT_EQ(lock_buffers, 0u);
+    ASSERT_EQ(alloc_bytes, 0u);
+    ASSERT_EQ(lock_bytes, 0u);
+}
+
+TEST(Memory, IndexedDevice) {
+    // This test is checking to see if calling .device() will force copy to a
+    // new buffer
+    const int nx = 8;
+    const int ny = 8;
+
+    array in = randu(nx, ny);
+
+    vector<float> in1(in.elements());
+    in.host(&in1[0]);
+
+    int offx = nx / 4;
+    int offy = ny / 4;
+
+    in = in(seq(offx, offx + nx / 2 - 1), seq(offy, offy + ny / 2 - 1));
+
+    int nxo = (int)in.dims(0);
+    int nyo = (int)in.dims(1);
+
+    void *rawPtr = getRawPtr(in);
+    void *devPtr = in.device<float>();
+    ASSERT_NE(devPtr, rawPtr);
+    in.unlock();
+
+    vector<float> in2(in.elements());
+    in.host(&in2[0]);
+
+    for (int y = 0; y < nyo; y++) {
+        for (int x = 0; x < nxo; x++) {
+            ASSERT_EQ(in1[(offy + y) * nx + offx + x], in2[y * nxo + x]);
+        }
+    }
+}
+
+namespace {
+
+template<typename T>
+T *getMemoryManagerPayload(af_memory_manager manager) {
+    void *payloadPtr;
+    af_memory_manager_get_payload(manager, &payloadPtr);
+    return (T *)payloadPtr;
+}
+
+/**
+ * An extremely basic memory manager with a basic caching mechanism for testing
+ * purposes. It is not thread safe or optimized.
+ */
+struct E2ETestPayload {
+    int initializeCalledTimes{0};
+    int shutdownCalledTimes{0};
+    std::unordered_map<void *, size_t> table;
+    std::unordered_set<void *> locked;
+    size_t totalBytes{0};
+    size_t totalBuffers{0};
+    size_t lockedBytes{0};
+    unsigned lastNdims;
+    dim4 lastDims;
+    unsigned lastElementSize;
+
+    size_t maxBuffers{64};
+    size_t maxBytes{1024};
+    // Print info args
+    std::string printInfoStringArg;
+    int printInfoDevice{-1};
+};
+
+af_err allocated_fn(af_memory_manager manager, size_t *out, void *ptr) {
+    auto &table = getMemoryManagerPayload<E2ETestPayload>(manager)->table;
+    if (table.find(ptr) == table.end()) {
+        *out = 0;
+    } else {
+        *out = table[ptr];
+    }
+    return AF_SUCCESS;
+}
+
+af_err user_lock_fn(af_memory_manager manager, void *ptr) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    if (payload->locked.find(ptr) == payload->locked.end()) {
+        payload->locked.insert(ptr);
+        payload->lockedBytes += payload->table[ptr];
+    }
+    return AF_SUCCESS;
+}
+
+af_err is_user_locked_fn(af_memory_manager manager, int *out, void *ptr) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    *out          = payload->locked.find(ptr) != payload->locked.end();
+    return AF_SUCCESS;
+}
+
+af_err unlock_fn(af_memory_manager manager, void *ptr, int userLock) {
+    if (!ptr) { return AF_SUCCESS; }
+
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+
+    if (payload->table.find(ptr) == payload->table.end()) {
+        return AF_SUCCESS;  // fast path
+    }
+
+    // For testing, treat user-allocated and AF-allocated memory identically
+    if (payload->locked.find(ptr) != payload->locked.end()) {
+        payload->locked.erase(ptr);
+        payload->lockedBytes -= payload->table[ptr];
+    }
+    return AF_SUCCESS;
+}
+
+af_err user_unlock_fn(af_memory_manager manager, void *ptr) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    af_err err    = unlock_fn(manager, ptr, /* user */ 1);
+    payload->lockedBytes -= payload->table[ptr];
+    return err;
+}
+
+af_err signal_memory_cleanup_fn(af_memory_manager manager) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    // Free unlocked memory
+    std::vector<void *> freed;
+    for (auto &entry : payload->table) {
+        int isUserLocked;
+        is_user_locked_fn(manager, &isUserLocked, entry.first);
+        if (!isUserLocked) {
+            void *ptr = entry.first;
+            af_memory_manager_native_free(manager, ptr);
+            payload->totalBytes -= payload->table[entry.first];
+            freed.push_back(entry.first);
+        }
+    }
+    for (auto ptr : freed) { payload->table.erase(ptr); }
+    return AF_SUCCESS;
+}
+
+af_err print_info_fn(af_memory_manager manager, char *c, int b) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    payload->printInfoStringArg = std::string(c);
+    payload->printInfoDevice    = b;
+    return AF_SUCCESS;
+}
+
+af_err get_memory_pressure_fn(af_memory_manager manager, float *out) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    if (payload->lockedBytes > payload->maxBytes ||
+        payload->totalBuffers > payload->maxBuffers) {
+        *out = 1.0;
+    } else {
+        *out = 0.0;
+    }
+    return AF_SUCCESS;
+}
+
+af_err jit_tree_exceeds_memory_pressure_fn(af_memory_manager manager, int *out,
+                                           size_t bytes) {
+    auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+    *out          = 2 * bytes > payload->totalBytes;
+    return AF_SUCCESS;
+}
+
+af_err alloc_fn(af_memory_manager manager, void **ptr,
+                /* bool */ int userLock, const unsigned ndims, dim_t *dims,
+                const unsigned element_size) {
+    size_t size = element_size;
+    for (unsigned i = 0; i < ndims; ++i) { size *= dims[i]; }
+
+    if (size > 0) {
+        float pressure;
+        get_memory_pressure_fn(manager, &pressure);
+        float threshold;
+        af_memory_manager_get_memory_pressure_threshold(manager, &threshold);
+        if (pressure >= threshold) { signal_memory_cleanup_fn(manager); }
+
+        if (af_err err = af_memory_manager_native_alloc(manager, ptr, size)) {
+            return err;
+        }
+
+        auto *payload        = getMemoryManagerPayload<E2ETestPayload>(manager);
+        payload->table[*ptr] = size;
+        payload->totalBytes += size;
+        payload->totalBuffers++;
+
+        // Simple implementation: treat user and AF allocations the same
+        payload->locked.insert(*ptr);
+        payload->lockedBytes += size;
+
+        payload->lastNdims       = ndims;
+        payload->lastDims        = dim4(ndims, dims);
+        payload->lastElementSize = element_size;
+    }
+
+    return AF_SUCCESS;
+}
+
+void add_memory_management_fn(af_memory_manager manager, int id) {}
+
+void remove_memory_management_fn(af_memory_manager manager, int id) {}
+
+}  // namespace
+
+class MemoryManagerApi : public ::testing::Test {
+   public:
+    af_memory_manager manager;
+    std::unique_ptr<E2ETestPayload> payload{new E2ETestPayload()};
+    void SetUp() override {
+        af_create_memory_manager(&manager);
+
+        // Set payload_fn
+        af_memory_manager_set_payload(manager, payload.get());
+
+        auto initialize_fn = [](af_memory_manager manager) {
+            auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+            payload->initializeCalledTimes++;
+            return AF_SUCCESS;
+        };
+        af_memory_manager_set_initialize_fn(manager, initialize_fn);
+
+        auto shutdown_fn = [](af_memory_manager manager) {
+            auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+            payload->shutdownCalledTimes++;
+            return AF_SUCCESS;
+        };
+        af_memory_manager_set_shutdown_fn(manager, shutdown_fn);
+
+        // alloc
+        af_memory_manager_set_alloc_fn(manager, alloc_fn);
+        af_memory_manager_set_allocated_fn(manager, allocated_fn);
+        af_memory_manager_set_unlock_fn(manager, unlock_fn);
+        // utils
+        af_memory_manager_set_signal_memory_cleanup_fn(
+            manager, signal_memory_cleanup_fn);
+        af_memory_manager_set_print_info_fn(manager, print_info_fn);
+        // user lock/unlock
+        af_memory_manager_set_user_lock_fn(manager, user_lock_fn);
+        af_memory_manager_set_user_unlock_fn(manager, user_unlock_fn);
+        af_memory_manager_set_is_user_locked_fn(manager, is_user_locked_fn);
+        // memory pressure
+        af_memory_manager_set_get_memory_pressure_fn(manager,
+                                                     get_memory_pressure_fn);
+        af_memory_manager_set_jit_tree_exceeds_memory_pressure_fn(
+            manager, jit_tree_exceeds_memory_pressure_fn);
+        // ocl
+        af_memory_manager_set_add_memory_management_fn(
+            manager, add_memory_management_fn);
+        af_memory_manager_set_remove_memory_management_fn(
+            manager, remove_memory_management_fn);
+
+        af_set_memory_manager(manager);
+    }
+
+    void TearDown() override {
+        af_device_gc();
+        af_unset_memory_manager();
+        af_release_memory_manager(manager);
+    }
+};
+
+TEST_F(MemoryManagerApi, E2ETest1D) {
+    size_t aSize = 8;
+
+    array a = af::array(aSize, af::dtype::f32);
+    ASSERT_EQ(payload->table.size(), 1);
+
+    ASSERT_EQ(payload->table[a.device<float>()], aSize * sizeof(float));
+    ASSERT_EQ(payload->lastNdims, 1);
+    ASSERT_EQ(payload->lastDims, af::dim4(aSize));
+    ASSERT_EQ(payload->lastElementSize, 4);
+}
+
+TEST_F(MemoryManagerApi, E2ETest2D) {
+    size_t aSize = 8;
+
+    af::array a = af::array(aSize, aSize, af::dtype::f32);
+    ASSERT_EQ(payload->table.size(), 1);
+    ASSERT_EQ(payload->table[a.device<float>()], aSize * aSize * sizeof(float));
+    ASSERT_EQ(payload->lastElementSize, 4);
+
+    // Currently this is set to 1 because all allocations request linear memory
+    // This behavior will change in the future
+    ASSERT_EQ(payload->lastNdims, 1);
+    ASSERT_EQ(payload->lastDims, af::dim4(aSize * aSize));
+}
+
+TEST_F(MemoryManagerApi, E2ETest3D) {
+    size_t aSize = 8;
+
+    af::array a = af::array(aSize, aSize, aSize, af::dtype::f32);
+    ASSERT_EQ(payload->table.size(), 1);
+    ASSERT_EQ(payload->table[a.device<float>()],
+              aSize * aSize * aSize * sizeof(float));
+    ASSERT_EQ(payload->lastElementSize, 4);
+
+    // Currently this is set to 1 because all allocations request linear memory
+    // This behavior will change in the future
+    ASSERT_EQ(payload->lastNdims, 1);
+    ASSERT_EQ(payload->lastDims, af::dim4(aSize * aSize * aSize));
+}
+
+TEST_F(MemoryManagerApi, E2ETest4D) {
+    size_t aSize = 8;
+
+    af::array a = af::array(aSize, aSize, aSize, aSize, af::dtype::f32);
+    ASSERT_EQ(payload->table.size(), 1);
+    ASSERT_EQ(payload->table[a.device<float>()],
+              aSize * aSize * aSize * aSize * sizeof(float));
+    ASSERT_EQ(payload->lastElementSize, 4);
+
+    // Currently this is set to 1 because all allocations request linear memory
+    // This behavior will change in the future
+    ASSERT_EQ(payload->lastNdims, 1);
+    ASSERT_EQ(payload->lastDims, af::dim4(aSize * aSize * aSize * aSize));
+    af::sync();
+}
+
+TEST_F(MemoryManagerApi, E2ETest4DComplexDouble) {
+    SUPPORTED_TYPE_CHECK(double);
+    size_t aSize = 8;
 
+    af::array a = af::array(aSize, aSize, aSize, aSize, af::dtype::c64);
+    ASSERT_EQ(payload->table.size(), 1);
+    ASSERT_EQ(payload->table[a.device<float>()],
+              aSize * aSize * aSize * aSize * sizeof(double) * 2);
+    ASSERT_EQ(payload->lastElementSize, 16);
+
+    // Currently this is set to 1 because all allocations request linear memory
+    // This behavior will change in the future
+    ASSERT_EQ(payload->lastNdims, 1);
+    ASSERT_EQ(payload->lastDims, af::dim4(aSize * aSize * aSize * aSize));
+}
+
+TEST_F(MemoryManagerApi, E2ETestMultipleAllocations) {
+    SUPPORTED_TYPE_CHECK(double);
+    size_t aSize = 8;
+
+    af::array a = af::array(aSize, af::dtype::c64);
+    ASSERT_EQ(payload->lastElementSize, 16);
+
+    af::array b = af::array(aSize, af::dtype::f64);
+    ASSERT_EQ(payload->lastElementSize, 8);
+
+    ASSERT_EQ(payload->table.size(), 2);
+    ASSERT_EQ(payload->table[a.device<float>()], aSize * sizeof(double) * 2);
+    ASSERT_EQ(payload->table[b.device<float>()], aSize * sizeof(double));
+
+    // Currently this is set to 1 because all allocations request linear memory
+    // This behavior will change in the future
+    ASSERT_EQ(payload->lastNdims, 1);
+    ASSERT_EQ(payload->lastDims, af::dim4(aSize));
+}
+
+TEST_F(MemoryManagerApi, OutOfMemory) {
+    af::array a;
+    const unsigned N = 99999;
+    try {
+        a = af::randu({N, N, N}, af::dtype::f32);
+        FAIL();
+    } catch (af::exception &ex) {
+        ASSERT_EQ(ex.err(), AF_ERR_NO_MEM);
+    } catch (...) { FAIL(); }
+}
+
+TEST(MemoryManagerE2E, E2ETest) {
+    af_memory_manager manager;
+    af_create_memory_manager(&manager);
+
+    // Set payload_fn
+    std::unique_ptr<E2ETestPayload> payload(new E2ETestPayload());
+    af_memory_manager_set_payload(manager, payload.get());
+
+    auto initialize_fn = [](af_memory_manager manager) {
+        auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+        payload->initializeCalledTimes++;
+        return AF_SUCCESS;
+    };
+    af_memory_manager_set_initialize_fn(manager, initialize_fn);
+
+    auto shutdown_fn = [](af_memory_manager manager) {
+        auto *payload = getMemoryManagerPayload<E2ETestPayload>(manager);
+        payload->shutdownCalledTimes++;
+        return AF_SUCCESS;
+    };
+    af_memory_manager_set_shutdown_fn(manager, shutdown_fn);
+
+    // alloc
+    af_memory_manager_set_alloc_fn(manager, alloc_fn);
+    af_memory_manager_set_allocated_fn(manager, allocated_fn);
+    af_memory_manager_set_unlock_fn(manager, unlock_fn);
+    // utils
+    af_memory_manager_set_signal_memory_cleanup_fn(manager,
+                                                   signal_memory_cleanup_fn);
+    af_memory_manager_set_print_info_fn(manager, print_info_fn);
+    // user lock/unlock
+    af_memory_manager_set_user_lock_fn(manager, user_lock_fn);
+    af_memory_manager_set_user_unlock_fn(manager, user_unlock_fn);
+    af_memory_manager_set_is_user_locked_fn(manager, is_user_locked_fn);
+    // memory pressure
+    af_memory_manager_set_get_memory_pressure_fn(manager,
+                                                 get_memory_pressure_fn);
+    af_memory_manager_set_jit_tree_exceeds_memory_pressure_fn(
+        manager, jit_tree_exceeds_memory_pressure_fn);
+    // ocl
+    af_memory_manager_set_add_memory_management_fn(manager,
+                                                   add_memory_management_fn);
+    af_memory_manager_set_remove_memory_management_fn(
+        manager, remove_memory_management_fn);
+
+    af_set_memory_manager(manager);
+    {
+        size_t aSize = 8;
+
+        void *a = af::allocV2(aSize * sizeof(float));
+        ASSERT_EQ(payload->table.size(), 1);
+
+        ASSERT_EQ(payload->table[a], aSize * sizeof(float));
+        ASSERT_EQ(payload->lastNdims, 1);
+        ASSERT_EQ(payload->lastDims, af::dim4(aSize) * sizeof(float));
+        ASSERT_EQ(payload->lastElementSize, 1);
+
+        dim_t bDim = 2;
+        auto b     = af::randu({bDim, bDim});
+
+        ASSERT_EQ(payload->totalBytes, aSize * sizeof(float) + b.bytes());
+        ASSERT_EQ(payload->totalBuffers, 2);
+        ASSERT_EQ(payload->lockedBytes, aSize * sizeof(float) + b.bytes());
+        ASSERT_EQ(payload->locked.size(), 2);
+        ASSERT_EQ(payload->lastNdims, 1);
+        ASSERT_EQ(payload->lastDims, af::dim4(bDim * b.numdims()));
+        ASSERT_EQ(payload->lastElementSize, sizeof(float));
+
+        af::freeV2(a);
+
+        ASSERT_EQ(payload->totalBytes, aSize * sizeof(float) + b.bytes());
+        ASSERT_EQ(payload->totalBuffers, 2);
+        ASSERT_EQ(payload->lockedBytes, b.bytes());
+        ASSERT_EQ(payload->locked.size(), 1);
+    }
+
+    // gc
+    af::deviceGC();
+    ASSERT_EQ(payload->table.size(), 0);
+
+    // printInfo
+    std::string printInfoMsg = "testPrintInfo";
+    int printInfoDeviceId    = 0;
+    af::printMemInfo(printInfoMsg.c_str(), printInfoDeviceId);
+    ASSERT_EQ(printInfoMsg, payload->printInfoStringArg);
+    ASSERT_EQ(printInfoDeviceId, payload->printInfoDevice);
+
+    // step size (throws with a custom memory manager)
+    ASSERT_THROW(af::setMemStepSize(500), af::exception);
+    ASSERT_THROW(af::getMemStepSize(), af::exception);
+
+    ASSERT_EQ(payload->table.size(), 0);
+    af_unset_memory_manager();
+    af_release_memory_manager(manager);
+    ASSERT_EQ(payload->initializeCalledTimes, 1);
+    ASSERT_EQ(payload->shutdownCalledTimes, af::getDeviceCount());
+}
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+TEST(Memory, AfAllocDeviceCPUC) {
+    af_backend active_backend;
+    ASSERT_SUCCESS(af_get_active_backend(&active_backend));
+
+    if (active_backend == AF_BACKEND_CPU) {
+        void *ptr;
+        ASSERT_SUCCESS(af_alloc_device(&ptr, sizeof(float)));
+
+        // This is the CPU backend so we can assign to the pointer
+        *static_cast<float *>(ptr) = 5;
+        ASSERT_SUCCESS(af_free_device(ptr));
+    }
+}
+#pragma GCC diagnostic pop
+
+TEST(Memory, AfAllocDeviceV2CPUC) {
+    af_backend active_backend;
+    ASSERT_SUCCESS(af_get_active_backend(&active_backend));
+
+    if (active_backend == AF_BACKEND_CPU) {
+        void *ptr;
+        ASSERT_SUCCESS(af_alloc_device_v2(&ptr, sizeof(float)));
+
+        // This is the CPU backend so we can assign to the pointer
+        *static_cast<float *>(ptr) = 5;
+        ASSERT_SUCCESS(af_free_device_v2(ptr));
+    }
+}
+
+TEST(Memory, SNIPPET_AllocCPU) {
+    af_backend active_backend;
+    ASSERT_SUCCESS(af_get_active_backend(&active_backend));
+
+    if (active_backend == AF_BACKEND_CPU) {
+        //! [ex_alloc_v2_cpu]
+
+        // Allocate one float and cast to float*
+        void *ptr   = af::allocV2(sizeof(float));
+        float *dptr = static_cast<float *>(ptr);
+
+        // This is the CPU backend so we can assign to the pointer
+        dptr[0] = 5.0f;
+        freeV2(ptr);
+
+        //! [ex_alloc_v2_cpu]
+
+        ASSERT_EQ(*dptr, 5.0f);
+    }
 }
diff --git a/test/memory_lock.cpp b/test/memory_lock.cpp
new file mode 100644
index 0000000000..5f1ca8034e
--- /dev/null
+++ b/test/memory_lock.cpp
@@ -0,0 +1,73 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::deviceGC;
+using af::deviceMemInfo;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+const size_t step_bytes = 1024;
+
+// This test should be by itself as it leaks memory intentionally
+TEST(Memory, lock) {
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    cleanSlate();  // Clean up everything done so far
+
+    const dim_t num = step_bytes / sizeof(float);
+
+    vector<float> in(num);
+
+    af_array arr = 0;
+    ASSERT_SUCCESS(af_create_array(&arr, &in[0], 1, &num, f32));
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 1u);
+    ASSERT_EQ(alloc_bytes, step_bytes);
+    ASSERT_EQ(lock_bytes, step_bytes);
+
+    // arr1 gets released by end of the following code block
+    {
+        array a(arr);
+        a.lock();
+
+        // No new memory should be allocated
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        ASSERT_EQ(alloc_buffers, 1u);
+        ASSERT_EQ(lock_buffers, 1u);
+        ASSERT_EQ(alloc_bytes, step_bytes);
+        ASSERT_EQ(lock_bytes, step_bytes);
+    }
+
+    // Making sure all unlocked buffers are freed
+    deviceGC();
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 1u);
+    ASSERT_EQ(lock_buffers, 1u);
+    ASSERT_EQ(alloc_bytes, step_bytes);
+    ASSERT_EQ(lock_bytes, step_bytes);
+}
diff --git a/test/missing.cpp b/test/missing.cpp
index ff318ac5bf..d76b035c91 100644
--- a/test/missing.cpp
+++ b/test/missing.cpp
@@ -8,18 +8,19 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <af/array.h>
+#include <testHelpers.hpp>
 #include <af/arith.h>
+#include <af/array.h>
 #include <af/data.h>
-#include <testHelpers.hpp>
+#include <af/image.h>
+#include <af/lapack.h>
+#include <af/random.h>
 
 using namespace af;
 
-TEST(MissingFunctionTests, Dummy)
-{
-    array A = randu(10,10, f32);
+TEST(MissingFunctionTests, Dummy) {
+    array A = randu(10, 10, f32);
     af_print(A);
-    af_print(rank(A));
     af_print(arg(A));
     af_print(arg(complex(A, A)));
     af_print(trunc(3 * A));
@@ -28,8 +29,7 @@ TEST(MissingFunctionTests, Dummy)
     af_print(root(2, A));
     af_print(A - 0.5);
     af_print(sign(A - 0.5));
-    af_print(minfilt(A, 3, 3) - erode(A, constant(1, 3,3)));
-    af_print(maxfilt(A, 3, 3) - dilate(A, constant(1, 3,3)));
+    af_print(minfilt(A, 3, 3) - erode(A, constant(1, 3, 3)));
+    af_print(maxfilt(A, 3, 3) - dilate(A, constant(1, 3, 3)));
     printf("%lf\n", norm(A));
-    printf("%lf\n", det<double>(A));
 }
diff --git a/test/mmio/CMakeLists.txt b/test/mmio/CMakeLists.txt
new file mode 100644
index 0000000000..5f4bd419f0
--- /dev/null
+++ b/test/mmio/CMakeLists.txt
@@ -0,0 +1,23 @@
+# Copyright (c) 2018, ArrayFire
+# All rights reserved.
+#
+# This file is distributed under 3-clause BSD license.
+# The complete license agreement can be obtained at:
+# http://arrayfire.com/licenses/BSD-3-Clause
+
+cmake_minimum_required(VERSION 3.10.2)
+
+project(MatrixMarketIO LANGUAGES C)
+
+add_library(mmio STATIC mmio.c)
+
+target_include_directories(mmio
+    PUBLIC
+      $<BUILD_INTERFACE:${MatrixMarketIO_SOURCE_DIR}>
+    )
+
+target_compile_definitions(mmio PUBLIC USE_MTX)
+
+if(WIN32)
+  target_compile_definitions(mmio PRIVATE _CRT_SECURE_NO_WARNINGS)
+endif()
diff --git a/test/mmio/mmio.c b/test/mmio/mmio.c
new file mode 100644
index 0000000000..5777bf8fdb
--- /dev/null
+++ b/test/mmio/mmio.c
@@ -0,0 +1,426 @@
+/*
+ *   Matrix Market I/O library for ANSI C
+ *
+ *   See http://math.nist.gov/MatrixMarket for details.
+ *
+ *
+ */
+
+#include <ctype.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include "mmio.h"
+
+int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
+                               double **val_, int **I_, int **J_) {
+    FILE *f;
+    MM_typecode matcode;
+    int M, N, nz;
+    int i;
+    double *val;
+    int *I, *J;
+
+    if ((f = fopen(fname, "r")) == NULL) return -1;
+
+    if (mm_read_banner(f, &matcode) != 0) {
+        printf("mm_read_unsymetric: Could not process Matrix Market banner ");
+        printf(" in file [%s]\n", fname);
+        return -1;
+    }
+
+    if (!(mm_is_real(matcode) && mm_is_matrix(matcode) &&
+          mm_is_sparse(matcode))) {
+        fprintf(stderr, "Sorry, this application does not support ");
+        fprintf(stderr, "Market Market type: [%s]\n",
+                mm_typecode_to_str(matcode));
+        return -1;
+    }
+
+    /* find out size of sparse matrix: M, N, nz .... */
+
+    if (mm_read_mtx_crd_size(f, &M, &N, &nz) != 0) {
+        fprintf(stderr,
+                "read_unsymmetric_sparse(): could not parse matrix size.\n");
+        return -1;
+    }
+
+    *M_  = M;
+    *N_  = N;
+    *nz_ = nz;
+
+    /* reseve memory for matrices */
+
+    I   = (int *)malloc(nz * sizeof(int));
+    J   = (int *)malloc(nz * sizeof(int));
+    val = (double *)malloc(nz * sizeof(double));
+
+    *val_ = val;
+    *I_   = I;
+    *J_   = J;
+
+    /* NOTE: when reading in doubles, ANSI C requires the use of the "l"  */
+    /*   specifier as in "%lg", "%lf", "%le", otherwise errors will occur */
+    /*  (ANSI C X3.159-1989, Sec. 4.9.6.2, p. 136 lines 13-15)            */
+
+    for (i = 0; i < nz; i++) {
+        fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i]);
+        I[i]--; /* adjust from 1-based to 0-based */
+        J[i]--;
+    }
+    fclose(f);
+
+    return 0;
+}
+
+int mm_is_valid(MM_typecode matcode) {
+    if (!mm_is_matrix(matcode)) return 0;
+    if (mm_is_dense(matcode) && mm_is_pattern(matcode)) return 0;
+    if (mm_is_real(matcode) && mm_is_hermitian(matcode)) return 0;
+    if (mm_is_pattern(matcode) &&
+        (mm_is_hermitian(matcode) || mm_is_skew(matcode)))
+        return 0;
+    return 1;
+}
+
+int mm_read_banner(FILE *f, MM_typecode *matcode) {
+    char line[MM_MAX_LINE_LENGTH];
+    char banner[MM_MAX_TOKEN_LENGTH];
+    char mtx[MM_MAX_TOKEN_LENGTH];
+    char crd[MM_MAX_TOKEN_LENGTH];
+    char data_type[MM_MAX_TOKEN_LENGTH];
+    char storage_scheme[MM_MAX_TOKEN_LENGTH];
+    char *p;
+
+    mm_clear_typecode(matcode);
+
+    if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL) return MM_PREMATURE_EOF;
+
+    if (sscanf(line, "%s %s %s %s %s", banner, mtx, crd, data_type,
+               storage_scheme) != 5)
+        return MM_PREMATURE_EOF;
+
+    for (p = mtx; *p != '\0'; *p = tolower(*p), p++)
+        ; /* convert to lower case */
+    for (p = crd; *p != '\0'; *p = tolower(*p), p++)
+        ;
+    for (p = data_type; *p != '\0'; *p = tolower(*p), p++)
+        ;
+    for (p = storage_scheme; *p != '\0'; *p = tolower(*p), p++)
+        ;
+
+    /* check for banner */
+    if (strncmp(banner, MatrixMarketBanner, strlen(MatrixMarketBanner)) != 0)
+        return MM_NO_HEADER;
+
+    /* first field should be "mtx" */
+    if (strcmp(mtx, MM_MTX_STR) != 0) return MM_UNSUPPORTED_TYPE;
+    mm_set_matrix(matcode);
+
+    /* second field describes whether this is a sparse matrix (in coordinate
+            storgae) or a dense array */
+
+    if (strcmp(crd, MM_SPARSE_STR) == 0)
+        mm_set_sparse(matcode);
+    else if (strcmp(crd, MM_DENSE_STR) == 0)
+        mm_set_dense(matcode);
+    else
+        return MM_UNSUPPORTED_TYPE;
+
+    /* third field */
+
+    if (strcmp(data_type, MM_REAL_STR) == 0)
+        mm_set_real(matcode);
+    else if (strcmp(data_type, MM_COMPLEX_STR) == 0)
+        mm_set_complex(matcode);
+    else if (strcmp(data_type, MM_PATTERN_STR) == 0)
+        mm_set_pattern(matcode);
+    else if (strcmp(data_type, MM_INT_STR) == 0)
+        mm_set_integer(matcode);
+    else
+        return MM_UNSUPPORTED_TYPE;
+
+    /* fourth field */
+
+    if (strcmp(storage_scheme, MM_GENERAL_STR) == 0)
+        mm_set_general(matcode);
+    else if (strcmp(storage_scheme, MM_SYMM_STR) == 0)
+        mm_set_symmetric(matcode);
+    else if (strcmp(storage_scheme, MM_HERM_STR) == 0)
+        mm_set_hermitian(matcode);
+    else if (strcmp(storage_scheme, MM_SKEW_STR) == 0)
+        mm_set_skew(matcode);
+    else
+        return MM_UNSUPPORTED_TYPE;
+
+    return 0;
+}
+
+int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz) {
+    if (fprintf(f, "%d %d %d\n", M, N, nz) != 3)
+        return MM_COULD_NOT_WRITE_FILE;
+    else
+        return 0;
+}
+
+int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz) {
+    char line[MM_MAX_LINE_LENGTH];
+    int num_items_read;
+
+    /* set return null parameter values, in case we exit with errors */
+    *M = *N = *nz = 0;
+
+    /* now continue scanning until you reach the end-of-comments */
+    do {
+        if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL) return MM_PREMATURE_EOF;
+    } while (line[0] == '%');
+
+    /* line[] is either blank or has M,N, nz */
+    if (sscanf(line, "%d %d %d", M, N, nz) == 3)
+        return 0;
+
+    else
+        do {
+            num_items_read = fscanf(f, "%d %d %d", M, N, nz);
+            if (num_items_read == EOF) return MM_PREMATURE_EOF;
+        } while (num_items_read != 3);
+
+    return 0;
+}
+
+int mm_read_mtx_array_size(FILE *f, int *M, int *N) {
+    char line[MM_MAX_LINE_LENGTH];
+    int num_items_read;
+    /* set return null parameter values, in case we exit with errors */
+    *M = *N = 0;
+
+    /* now continue scanning until you reach the end-of-comments */
+    do {
+        if (fgets(line, MM_MAX_LINE_LENGTH, f) == NULL) return MM_PREMATURE_EOF;
+    } while (line[0] == '%');
+
+    /* line[] is either blank or has M,N, nz */
+    if (sscanf(line, "%d %d", M, N) == 2)
+        return 0;
+
+    else /* we have a blank line */
+        do {
+            num_items_read = fscanf(f, "%d %d", M, N);
+            if (num_items_read == EOF) return MM_PREMATURE_EOF;
+        } while (num_items_read != 2);
+
+    return 0;
+}
+
+int mm_write_mtx_array_size(FILE *f, int M, int N) {
+    if (fprintf(f, "%d %d\n", M, N) != 2)
+        return MM_COULD_NOT_WRITE_FILE;
+    else
+        return 0;
+}
+
+/*-------------------------------------------------------------------------*/
+
+/******************************************************************/
+/* use when I[], J[], and val[]J, and val[] are already allocated */
+/******************************************************************/
+
+int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
+                         double val[], MM_typecode matcode) {
+    int i;
+    if (mm_is_complex(matcode)) {
+        for (i = 0; i < nz; i++)
+            if (fscanf(f, "%d %d %lg %lg", &I[i], &J[i], &val[2 * i],
+                       &val[2 * i + 1]) != 4)
+                return MM_PREMATURE_EOF;
+    } else if (mm_is_real(matcode)) {
+        for (i = 0; i < nz; i++) {
+            if (fscanf(f, "%d %d %lg\n", &I[i], &J[i], &val[i]) != 3)
+                return MM_PREMATURE_EOF;
+        }
+    }
+
+    else if (mm_is_pattern(matcode)) {
+        for (i = 0; i < nz; i++)
+            if (fscanf(f, "%d %d", &I[i], &J[i]) != 2) return MM_PREMATURE_EOF;
+    } else
+        return MM_UNSUPPORTED_TYPE;
+
+    return 0;
+}
+
+int mm_read_mtx_crd_entry(FILE *f, int *I, int *J, double *real, double *imag,
+                          MM_typecode matcode) {
+    if (mm_is_complex(matcode)) {
+        if (fscanf(f, "%d %d %lg %lg", I, J, real, imag) != 4)
+            return MM_PREMATURE_EOF;
+    } else if (mm_is_real(matcode)) {
+        if (fscanf(f, "%d %d %lg\n", I, J, real) != 3) return MM_PREMATURE_EOF;
+
+    }
+
+    else if (mm_is_pattern(matcode)) {
+        if (fscanf(f, "%d %d", I, J) != 2) return MM_PREMATURE_EOF;
+    } else
+        return MM_UNSUPPORTED_TYPE;
+
+    return 0;
+}
+
+/************************************************************************
+    mm_read_mtx_crd()  fills M, N, nz, array of values, and return
+                        type code, e.g. 'MCRS'
+
+                        if matrix is complex, values[] is of size 2*nz,
+                            (nz pairs of real/imaginary values)
+************************************************************************/
+
+int mm_read_mtx_crd(char *fname, int *M, int *N, int *nz, int **I, int **J,
+                    double **val, MM_typecode *matcode) {
+    int ret_code;
+    FILE *f;
+
+    if (strcmp(fname, "stdin") == 0)
+        f = stdin;
+    else if ((f = fopen(fname, "r")) == NULL)
+        return MM_COULD_NOT_READ_FILE;
+
+    if ((ret_code = mm_read_banner(f, matcode)) != 0) return ret_code;
+
+    if (!(mm_is_valid(*matcode) && mm_is_sparse(*matcode) &&
+          mm_is_matrix(*matcode)))
+        return MM_UNSUPPORTED_TYPE;
+
+    if ((ret_code = mm_read_mtx_crd_size(f, M, N, nz)) != 0) return ret_code;
+
+    *I   = (int *)malloc(*nz * sizeof(int));
+    *J   = (int *)malloc(*nz * sizeof(int));
+    *val = NULL;
+
+    if (mm_is_complex(*matcode)) {
+        *val     = (double *)malloc(*nz * 2 * sizeof(double));
+        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val, *matcode);
+        if (ret_code != 0) return ret_code;
+    } else if (mm_is_real(*matcode)) {
+        *val     = (double *)malloc(*nz * sizeof(double));
+        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val, *matcode);
+        if (ret_code != 0) return ret_code;
+    }
+
+    else if (mm_is_pattern(*matcode)) {
+        ret_code = mm_read_mtx_crd_data(f, *M, *N, *nz, *I, *J, *val, *matcode);
+        if (ret_code != 0) return ret_code;
+    }
+
+    if (f != stdin) fclose(f);
+    return 0;
+}
+
+int mm_write_banner(FILE *f, MM_typecode matcode) {
+    char *str = mm_typecode_to_str(matcode);
+    int ret_code;
+
+    ret_code = fprintf(f, "%s %s\n", MatrixMarketBanner, str);
+    free(str);
+    if (ret_code != 2)
+        return MM_COULD_NOT_WRITE_FILE;
+    else
+        return 0;
+}
+
+int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
+                     double val[], MM_typecode matcode) {
+    FILE *f;
+    int i;
+
+    if (strcmp(fname, "stdout") == 0)
+        f = stdout;
+    else if ((f = fopen(fname, "w")) == NULL)
+        return MM_COULD_NOT_WRITE_FILE;
+
+    /* print banner followed by typecode */
+    fprintf(f, "%s ", MatrixMarketBanner);
+    fprintf(f, "%s\n", mm_typecode_to_str(matcode));
+
+    /* print matrix sizes and nonzeros */
+    fprintf(f, "%d %d %d\n", M, N, nz);
+
+    /* print values */
+    if (mm_is_pattern(matcode))
+        for (i = 0; i < nz; i++) fprintf(f, "%d %d\n", I[i], J[i]);
+    else if (mm_is_real(matcode))
+        for (i = 0; i < nz; i++)
+            fprintf(f, "%d %d %20.16g\n", I[i], J[i], val[i]);
+    else if (mm_is_complex(matcode))
+        for (i = 0; i < nz; i++)
+            fprintf(f, "%d %d %20.16g %20.16g\n", I[i], J[i], val[2 * i],
+                    val[2 * i + 1]);
+    else {
+        if (f != stdout) fclose(f);
+        return MM_UNSUPPORTED_TYPE;
+    }
+
+    if (f != stdout) fclose(f);
+
+    return 0;
+}
+
+/**
+ *  Create a new copy of a string s.  mm_strdup() is a common routine, but
+ *  not part of ANSI C, so it is included here.  Used by mm_typecode_to_str().
+ *
+ */
+char *mm_strdup(const char *s) {
+    int len  = strlen(s);
+    char *s2 = (char *)malloc((len + 1) * sizeof(char));
+    return strcpy(s2, s);
+}
+
+char *mm_typecode_to_str(MM_typecode matcode) {
+    char buffer[MM_MAX_LINE_LENGTH];
+    char *types[4];
+    char *mm_strdup(const char *);
+    int error = 0;
+
+    /* check for MTX type */
+    if (mm_is_matrix(matcode))
+        types[0] = MM_MTX_STR;
+    else
+        error = 1;
+
+    /* check for CRD or ARR matrix */
+    if (mm_is_sparse(matcode))
+        types[1] = MM_SPARSE_STR;
+    else if (mm_is_dense(matcode))
+        types[1] = MM_DENSE_STR;
+    else
+        return NULL;
+
+    /* check for element data type */
+    if (mm_is_real(matcode))
+        types[2] = MM_REAL_STR;
+    else if (mm_is_complex(matcode))
+        types[2] = MM_COMPLEX_STR;
+    else if (mm_is_pattern(matcode))
+        types[2] = MM_PATTERN_STR;
+    else if (mm_is_integer(matcode))
+        types[2] = MM_INT_STR;
+    else
+        return NULL;
+
+    /* check for symmetry type */
+    if (mm_is_general(matcode))
+        types[3] = MM_GENERAL_STR;
+    else if (mm_is_symmetric(matcode))
+        types[3] = MM_SYMM_STR;
+    else if (mm_is_hermitian(matcode))
+        types[3] = MM_HERM_STR;
+    else if (mm_is_skew(matcode))
+        types[3] = MM_SKEW_STR;
+    else
+        return NULL;
+
+    sprintf(buffer, "%s %s %s %s", types[0], types[1], types[2], types[3]);
+    return mm_strdup(buffer);
+}
diff --git a/test/mmio/mmio.h b/test/mmio/mmio.h
new file mode 100644
index 0000000000..160b8e6bc6
--- /dev/null
+++ b/test/mmio/mmio.h
@@ -0,0 +1,133 @@
+/*
+ *   Matrix Market I/O library for ANSI C
+ *
+ *   See http://math.nist.gov/MatrixMarket for details.
+ *
+ *
+ */
+
+#ifndef MM_IO_H
+#define MM_IO_H
+
+#define MM_MAX_LINE_LENGTH 1025
+#define MatrixMarketBanner "%%MatrixMarket"
+#define MM_MAX_TOKEN_LENGTH 64
+
+typedef char MM_typecode[4];
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+char *mm_typecode_to_str(MM_typecode matcode);
+
+int mm_read_banner(FILE *f, MM_typecode *matcode);
+int mm_read_mtx_crd_size(FILE *f, int *M, int *N, int *nz);
+int mm_read_mtx_array_size(FILE *f, int *M, int *N);
+
+int mm_write_banner(FILE *f, MM_typecode matcode);
+int mm_write_mtx_crd_size(FILE *f, int M, int N, int nz);
+int mm_write_mtx_array_size(FILE *f, int M, int N);
+
+/********************* MM_typecode query fucntions ***************************/
+
+#define mm_is_matrix(typecode) ((typecode)[0] == 'M')
+
+#define mm_is_sparse(typecode) ((typecode)[1] == 'C')
+#define mm_is_coordinate(typecode) ((typecode)[1] == 'C')
+#define mm_is_dense(typecode) ((typecode)[1] == 'A')
+#define mm_is_array(typecode) ((typecode)[1] == 'A')
+
+#define mm_is_complex(typecode) ((typecode)[2] == 'C')
+#define mm_is_real(typecode) ((typecode)[2] == 'R')
+#define mm_is_pattern(typecode) ((typecode)[2] == 'P')
+#define mm_is_integer(typecode) ((typecode)[2] == 'I')
+
+#define mm_is_symmetric(typecode) ((typecode)[3] == 'S')
+#define mm_is_general(typecode) ((typecode)[3] == 'G')
+#define mm_is_skew(typecode) ((typecode)[3] == 'K')
+#define mm_is_hermitian(typecode) ((typecode)[3] == 'H')
+
+int mm_is_valid(MM_typecode matcode); /* too complex for a macro */
+
+/********************* MM_typecode modify fucntions ***************************/
+
+#define mm_set_matrix(typecode) ((*typecode)[0] = 'M')
+#define mm_set_coordinate(typecode) ((*typecode)[1] = 'C')
+#define mm_set_array(typecode) ((*typecode)[1] = 'A')
+#define mm_set_dense(typecode) mm_set_array(typecode)
+#define mm_set_sparse(typecode) mm_set_coordinate(typecode)
+
+#define mm_set_complex(typecode) ((*typecode)[2] = 'C')
+#define mm_set_real(typecode) ((*typecode)[2] = 'R')
+#define mm_set_pattern(typecode) ((*typecode)[2] = 'P')
+#define mm_set_integer(typecode) ((*typecode)[2] = 'I')
+
+#define mm_set_symmetric(typecode) ((*typecode)[3] = 'S')
+#define mm_set_general(typecode) ((*typecode)[3] = 'G')
+#define mm_set_skew(typecode) ((*typecode)[3] = 'K')
+#define mm_set_hermitian(typecode) ((*typecode)[3] = 'H')
+
+#define mm_clear_typecode(typecode)                          \
+    ((*typecode)[0] = (*typecode)[1] = (*typecode)[2] = ' ', \
+     (*typecode)[3]                                   = 'G')
+
+#define mm_initialize_typecode(typecode) mm_clear_typecode(typecode)
+
+/********************* Matrix Market error codes ***************************/
+
+#define MM_COULD_NOT_READ_FILE 11
+#define MM_PREMATURE_EOF 12
+#define MM_NOT_MTX 13
+#define MM_NO_HEADER 14
+#define MM_UNSUPPORTED_TYPE 15
+#define MM_LINE_TOO_LONG 16
+#define MM_COULD_NOT_WRITE_FILE 17
+
+/******************** Matrix Market internal definitions ********************
+
+   MM_matrix_typecode: 4-character sequence
+
+                    ojbect 		sparse/   	data        storage
+                                dense     	type        scheme
+
+   string position:	 [0]        [1]			[2]         [3]
+
+   Matrix typecode:  M(atrix)  C(oord)		R(eal)   	G(eneral)
+                                A(array)	C(omplex)   H(ermitian)
+                                            P(attern)   S(ymmetric)
+                                            I(nteger)	K(kew)
+
+ ***********************************************************************/
+
+#define MM_MTX_STR "matrix"
+#define MM_ARRAY_STR "array"
+#define MM_DENSE_STR "array"
+#define MM_COORDINATE_STR "coordinate"
+#define MM_SPARSE_STR "coordinate"
+#define MM_COMPLEX_STR "complex"
+#define MM_REAL_STR "real"
+#define MM_INT_STR "integer"
+#define MM_GENERAL_STR "general"
+#define MM_SYMM_STR "symmetric"
+#define MM_HERM_STR "hermitian"
+#define MM_SKEW_STR "skew-symmetric"
+#define MM_PATTERN_STR "pattern"
+
+/*  high level routines */
+
+int mm_write_mtx_crd(char fname[], int M, int N, int nz, int I[], int J[],
+                     double val[], MM_typecode matcode);
+int mm_read_mtx_crd_data(FILE *f, int M, int N, int nz, int I[], int J[],
+                         double val[], MM_typecode matcode);
+int mm_read_mtx_crd_entry(FILE *f, int *I, int *J, double *real, double *img,
+                          MM_typecode matcode);
+
+int mm_read_unsymmetric_sparse(const char *fname, int *M_, int *N_, int *nz_,
+                               double **val_, int **I_, int **J_);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif
diff --git a/test/moddims.cpp b/test/moddims.cpp
index 5fe751bbc0..c8b98f05d1 100644
--- a/test/moddims.cpp
+++ b/test/moddims.cpp
@@ -7,237 +7,376 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
+#include <cstdlib>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Moddims : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat.push_back(af_make_seq(1,2,1));
-            subMat.push_back(af_make_seq(1,3,1));
-        }
-        vector<af_seq> subMat;
+class Moddims : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat.push_back(af_make_seq(1, 2, 1));
+        subMat.push_back(af_make_seq(1, 3, 1));
+    }
+    vector<af_seq> subMat;
 };
 
 // create a list of types to be tested
 // TODO: complex types tests have to be added
-typedef ::testing::Types<float, double, int, unsigned, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, int, unsigned, char, signed char,
+                         unsigned char, short, ushort, half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Moddims, TestTypes);
+TYPED_TEST_SUITE(Moddims, TestTypes);
 
 template<typename T>
-void moddimsTest(string pTestFile, bool isSubRef=false, const vector<af_seq> *seqv=NULL)
-{
-    if (noDoubleTests<T>()) return;
+void moddimsTest(string pTestFile, bool isSubRef = false,
+                 const vector<af_seq> *seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     T *outData;
 
     if (isSubRef) {
-        af_array inArray   = 0;
-        af_array subArray  = 0;
-        af_array outArray  = 0;
+        af_array inArray  = 0;
+        af_array subArray = 0;
+        af_array outArray = 0;
 
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&subArray,inArray,seqv->size(),&seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&subArray, inArray, seqv->size(), &seqv->front()));
 
-        af::dim4 newDims(1);
+        dim4 newDims(1);
         newDims[0] = 2;
         newDims[1] = 3;
-        ASSERT_EQ(AF_SUCCESS, af_moddims(&outArray,subArray,newDims.ndims(),newDims.get()));
+        ASSERT_SUCCESS(
+            af_moddims(&outArray, subArray, newDims.ndims(), newDims.get()));
 
         dim_t nElems;
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems,outArray));
+        ASSERT_SUCCESS(af_get_elements(&nElems, outArray));
 
-        outData          = new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        outData = new T[nElems];
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(subArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(subArray));
     } else {
-        af_array inArray   = 0;
-        af_array outArray  = 0;
+        af_array inArray  = 0;
+        af_array outArray = 0;
 
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        af::dim4 newDims(1);
+        dim4 newDims(1);
         newDims[0] = dims[1];
-        newDims[1] = dims[0]*dims[2];
-        ASSERT_EQ(AF_SUCCESS, af_moddims(&outArray,inArray,newDims.ndims(),newDims.get()));
+        newDims[1] = dims[0] * dims[2];
+        ASSERT_SUCCESS(
+            af_moddims(&outArray, inArray, newDims.ndims(), newDims.get()));
 
-        outData          = new T[dims.elements()];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        outData = new T[dims.elements()];
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
     }
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
-        vector<T> currGoldBar   = tests[testIter];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter],outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
+        vector<T> currGoldBar = tests[testIter];
+        size_t nElems         = currGoldBar.size();
+        for (size_t elIter = 0; elIter < nElems; ++elIter) {
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
     }
     delete[] outData;
 }
 
-TYPED_TEST(Moddims,Basic)
-{
-    moddimsTest<TypeParam>(string(TEST_DIR"/moddims/basic.test"));
+TYPED_TEST(Moddims, Basic) {
+    moddimsTest<TypeParam>(string(TEST_DIR "/moddims/basic.test"));
 }
 
-TYPED_TEST(Moddims,Subref)
-{
-    moddimsTest<TypeParam>(string(TEST_DIR"/moddims/subref.test"),true,&(this->subMat));
+TYPED_TEST(Moddims, Subref) {
+    moddimsTest<TypeParam>(string(TEST_DIR "/moddims/subref.test"), true,
+                           &(this->subMat));
 }
 
-
 template<typename T>
-void moddimsArgsTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void moddimsArgsTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     af_array inArray   = 0;
     af_array outArray  = 0;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    af_array outArray2 = 0;
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    af::dim4 newDims(1);
+    dim4 newDims(1);
     newDims[0] = dims[1];
-    newDims[1] = dims[0]*dims[2];
-    ASSERT_EQ(AF_ERR_ARG, af_moddims(&outArray,inArray,0,newDims.get()));
-    ASSERT_EQ(AF_ERR_ARG, af_moddims(&outArray,inArray,newDims.ndims(),NULL));
+    newDims[1] = dims[0] * dims[2];
+    ASSERT_SUCCESS(af_moddims(&outArray, inArray, 0, newDims.get()));
+    ASSERT_EQ(AF_ERR_ARG,
+              af_moddims(&outArray2, inArray, newDims.ndims(), NULL));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(Moddims,InvalidArgs)
-{
-    moddimsArgsTest<TypeParam>(string(TEST_DIR"/moddims/basic.test"));
+TYPED_TEST(Moddims, InvalidArgs) {
+    moddimsArgsTest<TypeParam>(string(TEST_DIR "/moddims/basic.test"));
 }
 
 template<typename T>
-void moddimsMismatchTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void moddimsMismatchTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
-    af_array inArray   = 0;
-    af_array outArray  = 0;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    af_array inArray  = 0;
+    af_array outArray = 0;
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    af::dim4 newDims(1);
-    newDims[0] = dims[1]-1;
-    newDims[1] = (dims[0]-1)*dims[2];
-    ASSERT_EQ(AF_ERR_SIZE, af_moddims(&outArray,inArray,newDims.ndims(),newDims.get()));
+    dim4 newDims(1);
+    newDims[0] = dims[1] - 1;
+    newDims[1] = (dims[0] - 1) * dims[2];
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_moddims(&outArray, inArray, newDims.ndims(), newDims.get()));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TYPED_TEST(Moddims,Mismatch)
-{
-    moddimsMismatchTest<TypeParam>(string(TEST_DIR"/moddims/basic.test"));
+TYPED_TEST(Moddims, Mismatch) {
+    moddimsMismatchTest<TypeParam>(string(TEST_DIR "/moddims/basic.test"));
 }
 
-
 /////////////////////////////////// CPP ///////////////////////////////////
 //
+
+using af::array;
+
 template<typename T>
-void cppModdimsTest(string pTestFile, bool isSubRef=false, const vector<af_seq> *seqv=NULL)
-{
-    if (noDoubleTests<T>()) return;
+void cppModdimsTest(string pTestFile, bool isSubRef = false,
+                    const vector<af_seq> *seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
     T *outData;
 
     if (isSubRef) {
-        af::array input(dims, &(in[0].front()));
+        array input(dims, &(in[0].front()));
 
-        af::array subArray = input(seqv->at(0), seqv->at(1));
+        array subArray = input(seqv->at(0), seqv->at(1));
 
-        af::dim4 newDims(1);
-        newDims[0] = 2;
-        newDims[1] = 3;
-        af::array output = af::moddims(subArray, newDims.ndims(), newDims.get());
+        dim4 newDims(1);
+        newDims[0]   = 2;
+        newDims[1]   = 3;
+        array output = moddims(subArray, newDims.ndims(), newDims.get());
 
         dim_t nElems = output.elements();
-        outData = new T[nElems];
-        output.host((void*)outData);
+        outData      = new T[nElems];
+        output.host((void *)outData);
     } else {
-        af::array input(dims, &(in[0].front()));
+        array input(dims, &(in[0].front()));
 
-        af::dim4 newDims(1);
+        dim4 newDims(1);
         newDims[0] = dims[1];
-        newDims[1] = dims[0]*dims[2];
+        newDims[1] = dims[0] * dims[2];
 
-        af::array output = af::moddims(input, newDims.ndims(), newDims.get());
+        array output = moddims(input, newDims.ndims(), newDims.get());
 
         outData = new T[dims.elements()];
-        output.host((void*)outData);
+        output.host((void *)outData);
     }
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
-        vector<T> currGoldBar   = tests[testIter];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter],outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
+        vector<T> currGoldBar = tests[testIter];
+        size_t nElems         = currGoldBar.size();
+        for (size_t elIter = 0; elIter < nElems; ++elIter) {
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
     }
     delete[] outData;
 }
 
-TEST(Moddims,Basic_CPP)
-{
-    cppModdimsTest<float>(string(TEST_DIR"/moddims/basic.test"));
+TEST(Moddims, Basic_CPP) {
+    cppModdimsTest<float>(string(TEST_DIR "/moddims/basic.test"));
 }
 
-TEST(Moddims,Subref_CPP)
-{
+TEST(Moddims, Subref_CPP) {
     vector<af_seq> subMat;
-    subMat.push_back(af_make_seq(1,2,1));
-    subMat.push_back(af_make_seq(1,3,1));
-    cppModdimsTest<float>(string(TEST_DIR"/moddims/subref.test"),true,&subMat);
+    subMat.push_back(af_make_seq(1, 2, 1));
+    subMat.push_back(af_make_seq(1, 3, 1));
+    cppModdimsTest<float>(string(TEST_DIR "/moddims/subref.test"), true,
+                          &subMat);
 }
+
+TEST(Moddims, jit) {
+    using namespace af;
+    array c1 = constant(1, 10, 5);
+    c1.eval();
+    array c2 = randu(10, 10);
+
+    vector<float> hc2(100);
+    c2.host(hc2.data());
+
+    array c3 = c2(span, seq(5));
+    c3.eval();
+
+    array a = c1;
+    a       = a + c3;
+    a       = moddims(a, 5, 10);
+    a       = a + constant(2, 5, 10);
+
+    for (int i = 0; i < hc2.size(); i++) { hc2[i] += 3; }
+
+    array gold(10, 5, hc2.data());
+    gold = moddims(gold, 5, 10);
+    ASSERT_ARRAYS_EQ(gold, a);
+}
+
+TEST(Moddims, JitNested) {
+    array a    = af::constant(1, 5, 5);
+    array b    = moddims(moddims(moddims(a, 25), 1, 5, 5), 5, 5);
+    array gold = af::constant(1, 5, 5);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, b);
+}
+
+TEST(Moddims, JitDuplicate) {
+    array a = af::constant(1, 5, 5);
+    array b = af::moddims(a, 25);
+    array c = b + b;
+
+    array gold = af::constant(2, 25);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, c);
+}
+
+TEST(Moddims, JitNestedAndDuplicate) {
+    array a = af::constant(1, 10, 10);
+    array b = af::constant(1, 10, 10);
+    array c = af::constant(2, 100) + moddims(a + b, 100);
+    array d = moddims(
+        moddims(af::constant(2, 1, 10, 10) + moddims(c, 1, 10, 10), 100), 10,
+        10);
+    array e    = d + d;
+    array gold = af::constant(12, 10, 10);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, e);
+}
+
+TEST(Moddims, JitTileThenModdims) {
+    array a    = af::constant(1, 10);
+    array b    = tile(a, 1, 10);
+    array c    = moddims(b, 100);
+    array gold = af::constant(1, 100);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, c);
+}
+
+TEST(Moddims, JitModdimsThenTiled) {
+    array a    = af::constant(1, 10);
+    array b    = moddims(a, 1, 10);
+    array c    = tile(b, 10);
+    array gold = af::constant(1, 10, 10);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, c);
+}
+
+TEST(Moddims, JitTileThenMultipleModdims) {
+    array a    = af::constant(1, 10);
+    array b    = tile(a, 1, 10);
+    array c    = moddims(moddims(b, 100), 10, 10);
+    array gold = af::constant(1, 10, 10);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, c);
+}
+
+TEST(Moddims, JitMultipleModdimsThenTiled) {
+    array a    = af::constant(1, 10);
+    array b    = moddims(moddims(a, 1, 10), 1, 1, 10);
+    array c    = tile(b, 10);
+    array gold = af::constant(1, 10, 1, 10);
+    gold.eval();
+    ASSERT_ARRAYS_EQ(gold, c);
+}
+
+TEST(Moddims, SNIPPET_data_func_moddims) {
+    // clang-format off
+    //! [ex_data_func_moddims]
+    //!
+    // Create a, a 2x3 array
+    array a = iota(dim4(2, 3));           // a = [0, 2, 4,
+                                          //      1, 3, 5]
+
+    // Create b by modifying the dimensions of a to the shape described by a dim4 object
+    array b = moddims(a, dim4(3, 2));     // b = [0, 3,
+                                          //      1, 4,
+                                          //      2, 5]
+
+    // Create c by modifying the dimensions of a to the shape described by dimension length parameters
+    array c = moddims(a, 3, 2);           // c = [0, 3,
+                                          //      1, 4,
+                                          //      2, 5]
+
+    // Create d by modifying the dimensions of a to the shape described by an array of ndims dimensions
+    vector<dim_t> x{3, 2};
+    array d = moddims(a, 2, x.data());    // d = [0, 3,
+                                          //      1, 4,
+                                          //      2, 5]
+
+    //! [ex_data_func_moddims]
+    // clang-format on
+
+    vector<float> gold_a{0, 1, 2, 3, 4, 5};
+
+    ASSERT_VEC_ARRAY_EQ(gold_a, dim4(3, 2), b);
+    ASSERT_VEC_ARRAY_EQ(gold_a, dim4(3, 2), c);
+    ASSERT_VEC_ARRAY_EQ(gold_a, dim4(3, 2), d);
+}
\ No newline at end of file
diff --git a/test/moments.cpp b/test/moments.cpp
new file mode 100644
index 0000000000..bec90e5b5d
--- /dev/null
+++ b/test/moments.cpp
@@ -0,0 +1,190 @@
+/*******************************************************
+ * Copyright (c) 2016, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/array.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::identity;
+using af::loadImage;
+using af::max;
+using af::min;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Image : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int> TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(Image, TestTypes);
+
+template<typename T>
+void momentsTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+
+    vector<vector<T>> in;
+    vector<vector<float>> tests;
+    readTests<T, float, float>(pTestFile, numDims, in, tests);
+
+    array imgArray(numDims.front(), &in.front()[0]);
+
+    array momentsArray = moments(imgArray, AF_MOMENT_M00);
+    vector<float> mData(momentsArray.elements());
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[0][i], mData[i], 4e-3 * tests[0][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_M01);
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[1][i], mData[i], 8e-3 * tests[1][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_M10);
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[2][i], mData[i], 3e-3 * tests[2][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_M11);
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[3][i], mData[i], 7e-3 * tests[3][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_FIRST_ORDER);
+    mData.resize(momentsArray.elements());
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements() / 4; i += 4) {
+        ASSERT_NEAR(tests[0][i], mData[i], 1e-3 * tests[0][i])
+            << "at: " << i << endl;
+        ASSERT_NEAR(tests[1][i], mData[i + 1], 1e-3 * tests[1][i])
+            << "at: " << i + 1 << endl;
+        ASSERT_NEAR(tests[2][i], mData[i + 2], 1e-3 * tests[2][i])
+            << "at: " << i + 2 << endl;
+        ASSERT_NEAR(tests[3][i], mData[i + 3], 1e-3 * tests[3][i])
+            << "at: " << i + 3 << endl;
+    }
+}
+
+void momentsOnImageTest(string pTestFile, string pImageFile, bool isColor) {
+    IMAGEIO_ENABLED_CHECK();
+    vector<dim4> numDims;
+
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(pTestFile, numDims, in, tests);
+
+    array imgArray = loadImage(pImageFile.c_str(), isColor);
+
+    double maxVal = max<double>(imgArray);
+    double minVal = min<double>(imgArray);
+    imgArray -= minVal;
+    imgArray /= maxVal - minVal;
+
+    array momentsArray = moments(imgArray, AF_MOMENT_M00);
+
+    vector<float> mData(momentsArray.elements());
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[0][i], mData[i], 1e-2 * tests[0][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_M01);
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[1][i], mData[i], 1e-2 * tests[1][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_M10);
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[2][i], mData[i], 1e-2 * tests[2][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_M11);
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements(); ++i) {
+        ASSERT_NEAR(tests[3][i], mData[i], 1e-2 * tests[3][i])
+            << "at: " << i << endl;
+    }
+
+    momentsArray = moments(imgArray, AF_MOMENT_FIRST_ORDER);
+    mData.resize(momentsArray.elements());
+    momentsArray.host(&mData[0]);
+    for (int i = 0; i < momentsArray.elements() / 4; i += 4) {
+        ASSERT_NEAR(tests[0][i], mData[i], 1e-2 * tests[0][i])
+            << "at: " << i << endl;
+        ASSERT_NEAR(tests[1][i], mData[i + 1], 1e-2 * tests[1][i])
+            << "at: " << i + 1 << endl;
+        ASSERT_NEAR(tests[2][i], mData[i + 2], 1e-2 * tests[2][i])
+            << "at: " << i + 2 << endl;
+        ASSERT_NEAR(tests[3][i], mData[i + 3], 1e-2 * tests[3][i])
+            << "at: " << i + 3 << endl;
+    }
+}
+
+TEST(IMAGE, MomentsImage) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    momentsOnImageTest(string(TEST_DIR "/moments/gray_seq_16_moments.test"),
+                       string(TEST_DIR "/imageio/gray_seq_16.png"), false);
+}
+
+TEST(Image, MomentsImageBatch) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    momentsTest<float>(
+        string(TEST_DIR "/moments/simple_mat_batch_moments.test"));
+}
+
+TEST(Image, MomentsBatch2D) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    momentsOnImageTest(string(TEST_DIR "/moments/color_seq_16_moments.test"),
+                       string(TEST_DIR "/imageio/color_seq_16.png"), true);
+}
+
+TYPED_TEST(Image, MomentsSynthTypes) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    momentsTest<TypeParam>(string(TEST_DIR "/moments/simple_mat_moments.test"));
+}
+
+TEST(Image, Moment_Issue1957) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    array A = identity(3, 3, b8);
+
+    double m00;
+    moments(&m00, A, AF_MOMENT_M00);
+    ASSERT_EQ(m00, 3);
+}
diff --git a/test/morph.cpp b/test/morph.cpp
index 04de84f8a9..b68d95076f 100644
--- a/test/morph.cpp
+++ b/test/morph.cpp
@@ -7,405 +7,514 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/data.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
+using std::endl;
 using std::string;
 using std::vector;
 
 template<typename T>
-class Morph : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Morph : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Morph, TestTypes);
+TYPED_TEST_SUITE(Morph, TestTypes);
 
 template<typename inType, bool isDilation, bool isVolume>
-void morphTest(string pTestFile)
-{
-    if (noDoubleTests<inType>()) return;
+void morphTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(inType);
 
-    vector<af::dim4>       numDims;
-    vector<vector<inType> >      in;
-    vector<vector<inType> >   tests;
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<inType>> tests;
 
-    readTests<inType,inType,int>(pTestFile, numDims, in, tests);
+    readTests<inType, inType, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims      = numDims[0];
-    af::dim4 maskDims  = numDims[1];
+    dim4 dims          = numDims[0];
+    dim4 maskDims      = numDims[1];
     af_array outArray  = 0;
     af_array inArray   = 0;
     af_array maskArray = 0;
-    inType *outData;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<inType>::af_type));
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&maskArray, &(in[1].front()),
-                maskDims.ndims(), maskDims.get(), (af_dtype)af::dtype_traits<inType>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
+    ASSERT_SUCCESS(af_create_array(&maskArray, &(in[1].front()),
+                                   maskDims.ndims(), maskDims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
 
+    af_err af_stat;
     if (isDilation) {
-        if (isVolume)
-            ASSERT_EQ(AF_SUCCESS, af_dilate3(&outArray, inArray, maskArray));
-        else
-            ASSERT_EQ(AF_SUCCESS, af_dilate(&outArray, inArray, maskArray));
-    }
-    else {
-        if (isVolume)
-            ASSERT_EQ(AF_SUCCESS, af_erode3(&outArray, inArray, maskArray));
-        else
-            ASSERT_EQ(AF_SUCCESS, af_erode(&outArray, inArray, maskArray));
+        if (isVolume) {
+            ASSERT_SUCCESS(af_dilate3(&outArray, inArray, maskArray));
+        } else {
+            ASSERT_SUCCESS(af_dilate(&outArray, inArray, maskArray));
+        }
+    } else {
+        if (isVolume) {
+            ASSERT_SUCCESS(af_erode3(&outArray, inArray, maskArray));
+        } else {
+            ASSERT_SUCCESS(af_erode(&outArray, inArray, maskArray));
+        }
     }
 
-    outData = new inType[dims.elements()];
-
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
-
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<inType> currGoldBar = tests[testIter];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter<< std::endl;
-        }
+        ASSERT_VEC_ARRAY_EQ(currGoldBar, dims, outArray);
     }
 
     // cleanup
-    delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(maskArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(Morph, Dilate3x3)
-{
-    morphTest<TypeParam, true, false>(string(TEST_DIR"/morph/dilate3x3.test"));
+TYPED_TEST(Morph, Dilate3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, true, false>(string(TEST_DIR "/morph/dilate3x3.test"));
 }
 
-TYPED_TEST(Morph, Erode3x3)
-{
-    morphTest<TypeParam, false, false>(string(TEST_DIR"/morph/erode3x3.test"));
+TYPED_TEST(Morph, Erode3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, false, false>(string(TEST_DIR "/morph/erode3x3.test"));
 }
 
-TYPED_TEST(Morph, Dilate3x3_Batch)
-{
-    morphTest<TypeParam, true, false>(string(TEST_DIR"/morph/dilate3x3_batch.test"));
+TYPED_TEST(Morph, Dilate4x4) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, true, false>(string(TEST_DIR "/morph/dilate4x4.test"));
 }
 
-TYPED_TEST(Morph, Erode3x3_Batch)
-{
-    morphTest<TypeParam, false, false>(string(TEST_DIR"/morph/erode3x3_batch.test"));
+TYPED_TEST(Morph, Dilate12x12) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, true, false>(
+        string(TEST_DIR "/morph/dilate12x12.test"));
 }
 
-TYPED_TEST(Morph, Dilate3x3x3)
-{
-    morphTest<TypeParam, true, true>(string(TEST_DIR"/morph/dilate3x3x3.test"));
+TYPED_TEST(Morph, Erode4x4) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, false, false>(string(TEST_DIR "/morph/erode4x4.test"));
 }
 
-TYPED_TEST(Morph, Erode3x3x3)
-{
-    morphTest<TypeParam, false, true>(string(TEST_DIR"/morph/erode3x3x3.test"));
+TYPED_TEST(Morph, Dilate3x3_Batch) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, true, false>(
+        string(TEST_DIR "/morph/dilate3x3_batch.test"));
 }
 
-template<typename T, bool isDilation, bool isColor>
-void morphImageTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+TYPED_TEST(Morph, Erode3x3_Batch) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, false, false>(
+        string(TEST_DIR "/morph/erode3x3_batch.test"));
+}
 
-    using af::dim4;
+TYPED_TEST(Morph, Dilate3x3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, true, true>(
+        string(TEST_DIR "/morph/dilate3x3x3.test"));
+}
+
+TYPED_TEST(Morph, Erode3x3x3) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, false, true>(
+        string(TEST_DIR "/morph/erode3x3x3.test"));
+}
+
+TYPED_TEST(Morph, Dilate4x4x4) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, true, true>(
+        string(TEST_DIR "/morph/dilate4x4x4.test"));
+}
+
+TYPED_TEST(Morph, Erode4x4x4) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphTest<TypeParam, false, true>(
+        string(TEST_DIR "/morph/erode4x4x4.test"));
+}
 
-    vector<dim4>       inDims;
-    vector<string>    inFiles;
+template<typename T, bool isDilation, bool isColor>
+void morphImageTest(string pTestFile, dim_t seLen) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
     vector<dim_t> outSizes;
-    vector<string>   outFiles;
+    vector<string> outFiles;
 
     readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
-
-        af_array inArray  = 0;
-        af_array maskArray= 0;
-        af_array outArray = 0;
-        af_array goldArray= 0;
-        dim_t nElems   = 0;
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array _inArray   = 0;
+        af_array inArray    = 0;
+        af_array maskArray  = 0;
+        af_array outArray   = 0;
+        af_array _goldArray = 0;
+        af_array goldArray  = 0;
+        dim_t nElems        = 0;
 
-        inFiles[testId].insert(0,string(TEST_DIR"/morph/"));
-        outFiles[testId].insert(0,string(TEST_DIR"/morph/"));
+        inFiles[testId].insert(0, string(TEST_DIR "/morph/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/morph/"));
 
-        dim4 mdims(3,3,1,1);
-        ASSERT_EQ(AF_SUCCESS, af_constant(&maskArray, 1.0,
-                    mdims.ndims(), mdims.get(), (af_dtype)af::dtype_traits<T>::af_type));
+        af_dtype targetType = static_cast<af_dtype>(dtype_traits<T>::af_type);
 
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&inArray, inFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&goldArray, outFiles[testId].c_str(), isColor));
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems, goldArray));
+        dim4 mdims(seLen, seLen, 1, 1);
+        ASSERT_SUCCESS(af_constant(&maskArray, 1.0, mdims.ndims(), mdims.get(),
+                                   targetType));
 
-        if (isDilation)
-            ASSERT_EQ(AF_SUCCESS, af_dilate(&outArray, inArray, maskArray));
-        else
-            ASSERT_EQ(AF_SUCCESS, af_erode(&outArray, inArray, maskArray));
+        ASSERT_SUCCESS(
+            af_load_image(&_inArray, inFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(af_cast(&inArray, _inArray, targetType));
 
-        T * outData = new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        ASSERT_SUCCESS(
+            af_load_image(&_goldArray, outFiles[testId].c_str(), isColor));
+        ASSERT_SUCCESS(af_cast(&goldArray, _goldArray, targetType));
 
-        T * goldData= new T[nElems];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)goldData, goldArray));
+        ASSERT_SUCCESS(af_get_elements(&nElems, goldArray));
 
-        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData, outData, 0.018f));
+        af_err error_code = AF_SUCCESS;
+        if (isDilation) {
+            error_code = af_dilate(&outArray, inArray, maskArray);
+        } else {
+            error_code = af_erode(&outArray, inArray, maskArray);
+        }
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(goldArray));
+#if defined(AF_CPU)
+        ASSERT_SUCCESS(error_code);
+        ASSERT_IMAGES_NEAR(goldArray, outArray, 0.018f);
+#else
+        if (targetType != b8 && seLen > 19) {
+            ASSERT_EQ(error_code, AF_ERR_NOT_SUPPORTED);
+        } else {
+            ASSERT_SUCCESS(error_code);
+            ASSERT_IMAGES_NEAR(goldArray, outArray, 0.018f);
+        }
+#endif
+
+        ASSERT_SUCCESS(af_release_array(_inArray));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(maskArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(_goldArray));
+        ASSERT_SUCCESS(af_release_array(goldArray));
     }
 }
 
-TEST(Morph, Grayscale)
-{
-    morphImageTest<float, true, false>(string(TEST_DIR"/morph/gray.test"));
+TEST(Morph, GrayscaleDilation3x3StructuringElement) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphImageTest<float, true, false>(string(TEST_DIR "/morph/gray.test"), 3);
+}
+
+TEST(Morph, ColorImageErosion3x3StructuringElement) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    morphImageTest<float, false, true>(string(TEST_DIR "/morph/color.test"), 3);
+}
+
+TEST(Morph, BinaryImageDilationBy33x33Kernel) {
+    morphImageTest<char, true, false>(
+        string(TEST_DIR "/morph/zag_dilation.test"), 33);
+}
+
+TEST(Morph, BinaryImageErosionBy33x33Kernel) {
+    morphImageTest<char, false, false>(
+        string(TEST_DIR "/morph/zag_erosion.test"), 33);
 }
 
-TEST(Morph, ColorImage)
-{
-    morphImageTest<float, false, true>(string(TEST_DIR"/morph/color.test"));
+TEST(Morph, DilationBy33x33Kernel) {
+    morphImageTest<float, true, true>(
+        string(TEST_DIR "/morph/baboon_dilation.test"), 33);
+}
+
+TEST(Morph, ErosionBy33x33Kernel) {
+    morphImageTest<float, false, true>(
+        string(TEST_DIR "/morph/baboon_erosion.test"), 33);
 }
 
 template<typename T, bool isDilation>
-void morphInputTest(void)
-{
-    if (noDoubleTests<T>()) return;
+void morphInputTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array inArray   = 0;
     af_array maskArray = 0;
     af_array outArray  = 0;
 
-    vector<T>   in(100,1);
-    vector<T>   mask(9,1);
+    vector<T> in(100, 1);
+    vector<T> mask(9, 1);
 
     // Check for 1D inputs
-    af::dim4 dims = af::dim4(100,1,1,1);
-    af::dim4 mdims(3,3,1,1);
+    dim4 dims = dim4(100, 1, 1, 1);
+    dim4 mdims(3, 3, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&maskArray, &mask.front(),
-                mdims.ndims(), mdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&maskArray, &mask.front(), mdims.ndims(),
+                                   mdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     if (isDilation)
         ASSERT_EQ(AF_ERR_SIZE, af_dilate(&outArray, inArray, maskArray));
     else
         ASSERT_EQ(AF_ERR_SIZE, af_erode(&outArray, inArray, maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
+    ASSERT_SUCCESS(af_release_array(maskArray));
 }
 
-TYPED_TEST(Morph, DilateInvalidInput)
-{
-    morphInputTest<TypeParam,true>();
-}
+TYPED_TEST(Morph, DilateInvalidInput) { morphInputTest<TypeParam, true>(); }
 
-TYPED_TEST(Morph, ErodeInvalidInput)
-{
-    morphInputTest<TypeParam,false>();
-}
+TYPED_TEST(Morph, ErodeInvalidInput) { morphInputTest<TypeParam, false>(); }
 
 template<typename T, bool isDilation>
-void morphMaskTest(void)
-{
-    if (noDoubleTests<T>()) return;
+void morphMaskTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array inArray   = 0;
     af_array maskArray = 0;
     af_array outArray  = 0;
 
-    vector<T>   in(100,1);
-    vector<T>   mask(16,1);
+    vector<T> in(100, 1);
+    vector<T> mask(16, 1);
 
     // Check for 4D mask
-    af::dim4 dims(10,10,1,1);
-    af::dim4 mdims(2,2,2,2);
+    dim4 dims(10, 10, 1, 1);
+    dim4 mdims(2, 2, 2, 2);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&maskArray, &mask.front(),
-                mdims.ndims(), mdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&maskArray, &mask.front(), mdims.ndims(),
+                                   mdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     if (isDilation)
         ASSERT_EQ(AF_ERR_SIZE, af_dilate(&outArray, inArray, maskArray));
     else
         ASSERT_EQ(AF_ERR_SIZE, af_erode(&outArray, inArray, maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
+    ASSERT_SUCCESS(af_release_array(maskArray));
 
     // Check for 1D mask
-    mdims = af::dim4(16,1,1,1);
+    mdims = dim4(16, 1, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&maskArray, &mask.front(),
-                mdims.ndims(), mdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&maskArray, &mask.front(), mdims.ndims(),
+                                   mdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     if (isDilation)
         ASSERT_EQ(AF_ERR_SIZE, af_dilate(&outArray, inArray, maskArray));
     else
         ASSERT_EQ(AF_ERR_SIZE, af_erode(&outArray, inArray, maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
+    ASSERT_SUCCESS(af_release_array(maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TYPED_TEST(Morph, DilateInvalidMask)
-{
-    morphMaskTest<TypeParam,true>();
-}
+TYPED_TEST(Morph, DilateInvalidMask) { morphMaskTest<TypeParam, true>(); }
 
-TYPED_TEST(Morph, ErodeInvalidMask)
-{
-    morphMaskTest<TypeParam,false>();
-}
+TYPED_TEST(Morph, ErodeInvalidMask) { morphMaskTest<TypeParam, false>(); }
 
 template<typename T, bool isDilation>
-void morph3DMaskTest(void)
-{
-    if (noDoubleTests<T>()) return;
+void morph3DMaskTest(void) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array inArray   = 0;
     af_array maskArray = 0;
     af_array outArray  = 0;
 
-    vector<T>   in(1000,1);
-    vector<T>   mask(81,1);
+    vector<T> in(1000, 1);
+    vector<T> mask(81, 1);
 
     // Check for 2D mask
-    af::dim4 dims(10,10,10,1);
-    af::dim4 mdims(9,9,1,1);
+    dim4 dims(10, 10, 10, 1);
+    dim4 mdims(9, 9, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(),
-                dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&maskArray, &mask.front(),
-                mdims.ndims(), mdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&maskArray, &mask.front(), mdims.ndims(),
+                                   mdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     if (isDilation)
         ASSERT_EQ(AF_ERR_SIZE, af_dilate3(&outArray, inArray, maskArray));
     else
         ASSERT_EQ(AF_ERR_SIZE, af_erode3(&outArray, inArray, maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
+    ASSERT_SUCCESS(af_release_array(maskArray));
 
     // Check for 4D mask
-    mdims = af::dim4(3,3,3,3);
+    mdims = dim4(3, 3, 3, 3);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&maskArray, &mask.front(),
-                mdims.ndims(), mdims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&maskArray, &mask.front(), mdims.ndims(),
+                                   mdims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     if (isDilation)
         ASSERT_EQ(AF_ERR_SIZE, af_dilate3(&outArray, inArray, maskArray));
     else
         ASSERT_EQ(AF_ERR_SIZE, af_erode3(&outArray, inArray, maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(maskArray));
+    ASSERT_SUCCESS(af_release_array(maskArray));
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-TYPED_TEST(Morph, DilateVolumeInvalidMask)
-{
-    morph3DMaskTest<TypeParam,true>();
+TYPED_TEST(Morph, DilateVolumeInvalidMask) {
+    morph3DMaskTest<TypeParam, true>();
 }
 
-TYPED_TEST(Morph, ErodeVolumeInvalidMask)
-{
-    morph3DMaskTest<TypeParam,false>();
+TYPED_TEST(Morph, ErodeVolumeInvalidMask) {
+    morph3DMaskTest<TypeParam, false>();
 }
 
-
 ////////////////////////////////////// CPP //////////////////////////////////
 //
-template<typename T, bool isDilation, bool isColor>
-void cppMorphImageTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
 
-    using af::dim4;
+using af::array;
+using af::constant;
+using af::erode;
+using af::iota;
+using af::loadImage;
+using af::max;
+using af::randu;
+using af::seq;
+using af::span;
+
+template<typename T, bool isDilation, bool isColor>
+void cppMorphImageTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<dim4>       inDims;
-    vector<string>    inFiles;
+    vector<dim4> inDims;
+    vector<string> inFiles;
     vector<dim_t> outSizes;
-    vector<string>   outFiles;
+    vector<string> outFiles;
 
     readImageTests(pTestFile, inDims, inFiles, outSizes, outFiles);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
-        inFiles[testId].insert(0,string(TEST_DIR"/morph/"));
-        outFiles[testId].insert(0,string(TEST_DIR"/morph/"));
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        inFiles[testId].insert(0, string(TEST_DIR "/morph/"));
+        outFiles[testId].insert(0, string(TEST_DIR "/morph/"));
 
-        af::array mask = af::constant(1.0, 3, 3);
-        af::array img = af::loadImage(inFiles[testId].c_str(), isColor);
-        af::array gold = af::loadImage(outFiles[testId].c_str(), isColor);
-        dim_t nElems   = gold.elements();
-        af::array output;
+        array mask   = constant(1.0, 3, 3);
+        array img    = loadImage(inFiles[testId].c_str(), isColor);
+        array gold   = loadImage(outFiles[testId].c_str(), isColor);
+        dim_t nElems = gold.elements();
+        array output;
 
         if (isDilation)
             output = dilate(img, mask);
         else
             output = erode(img, mask);
 
-        T * outData = new T[nElems];
-        output.host((void*)outData);
+        vector<T> outData(nElems);
+        output.host((void*)outData.data());
 
-        T * goldData= new T[nElems];
-        gold.host((void*)goldData);
+        vector<T> goldData(nElems);
+        gold.host((void*)goldData.data());
 
-        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData, outData, 0.018f));
-        //cleanup
-        delete[] outData;
-        delete[] goldData;
+        ASSERT_EQ(true, compareArraysRMSD(nElems, goldData.data(),
+                                          outData.data(), 0.018f));
     }
 }
 
-TEST(Morph, Grayscale_CPP)
-{
-    cppMorphImageTest<float, true, false>(string(TEST_DIR"/morph/gray.test"));
+TEST(Morph, Grayscale_CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    cppMorphImageTest<float, true, false>(string(TEST_DIR "/morph/gray.test"));
 }
 
-TEST(Morph, ColorImage_CPP)
-{
-    cppMorphImageTest<float, false, true>(string(TEST_DIR"/morph/color.test"));
+TEST(Morph, ColorImage_CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    cppMorphImageTest<float, false, true>(string(TEST_DIR "/morph/color.test"));
 }
 
-using namespace af;
-TEST(Morph, GFOR)
-{
-    dim4 dims = dim4(10, 10, 3);
-    array A = iota(dims);
-    array B = constant(0, dims);
-    array mask = randu(3,3) > 0.3;
+TEST(Morph, GFOR) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    dim4 dims  = dim4(10, 10, 3);
+    array A    = iota(dims);
+    array B    = constant(0, dims);
+    array mask = randu(3, 3) > 0.3;
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = erode(A(span, span, ii), mask);
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = erode(A(span, span, ii), mask); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = erode(A(span, span, ii), mask);
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
     }
 }
+
+TEST(Morph, EdgeIssue1564) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    int inputData[10 * 10] = {0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                              0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
+                              0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1};
+    int goldData[10 * 10]  = {0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,
+                              0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+                              0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0,
+                              0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1,
+                              1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1,
+                              1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1};
+    array input(10, 10, inputData);
+    int maskData[3 * 3] = {1, 1, 1, 1, 0, 1, 1, 1, 1};
+    array mask(3, 3, maskData);
+    
+    array dilated = dilate(input.as(b8), mask.as(b8));
+
+    size_t nElems = dilated.elements();
+    vector<char> outData(nElems);
+    dilated.host((void*)outData.data());
+    
+    for (size_t i = 0; i < nElems; ++i) {
+        ASSERT_EQ((int)outData[i], goldData[i]);
+    }
+}
+
+TEST(Morph, UnsupportedKernel2D) {
+    const unsigned ndims = 2;
+    const dim_t dims[2]  = {10, 10};
+    const dim_t kdims[2] = {32, 32};
+
+    af_array in, mask, out;
+
+    ASSERT_SUCCESS(af_constant(&mask, 1.0, ndims, kdims, f32));
+    ASSERT_SUCCESS(af_randu(&in, ndims, dims, f32));
+
+#if defined(AF_CPU)
+    ASSERT_SUCCESS(af_dilate(&out, in, mask));
+    ASSERT_SUCCESS(af_release_array(out));
+#else
+    ASSERT_EQ(AF_ERR_NOT_SUPPORTED, af_dilate(&out, in, mask));
+#endif
+    ASSERT_SUCCESS(af_release_array(in));
+    ASSERT_SUCCESS(af_release_array(mask));
+}
diff --git a/test/nearest_neighbour.cpp b/test/nearest_neighbour.cpp
new file mode 100644
index 0000000000..82551bc31b
--- /dev/null
+++ b/test/nearest_neighbour.cpp
@@ -0,0 +1,595 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype_traits;
+using af::randu;
+using af::range;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class NearestNeighbour : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create lists of types to be tested
+typedef ::testing::Types<float, double, int, uint, intl, uintl, schar, uchar,
+                         short, ushort>
+    TestTypes;
+
+template<typename T>
+struct otype_t {
+    typedef T otype;
+};
+
+template<>
+struct otype_t<short> {
+    typedef int otype;
+};
+
+template<>
+struct otype_t<ushort> {
+    typedef uint otype;
+};
+
+template<>
+struct otype_t<schar> {
+    typedef int otype;
+};
+
+template<>
+struct otype_t<uchar> {
+    typedef uint otype;
+};
+
+// register the type list
+TYPED_TEST_SUITE(NearestNeighbour, TestTypes);
+
+template<typename T>
+void nearestNeighbourTest(string pTestFile, int feat_dim,
+                          const af_match_type type) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    typedef typename otype_t<T>::otype To;
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<uint>> tests;
+
+    readTests<T, uint, uint>(pTestFile, numDims, in, tests);
+
+    dim4 qDims     = numDims[0];
+    dim4 tDims     = numDims[1];
+    af_array query = 0;
+    af_array train = 0;
+    af_array idx   = 0;
+    af_array dist  = 0;
+
+    ASSERT_SUCCESS(af_create_array(&query, &(in[0].front()), qDims.ndims(),
+                                   qDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&train, &(in[1].front()), tDims.ndims(),
+                                   tDims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(
+        af_nearest_neighbour(&idx, &dist, query, train, feat_dim, 1, type));
+
+    vector<uint> goldIdx  = tests[0];
+    vector<uint> goldDist = tests[1];
+    size_t nElems         = goldIdx.size();
+    uint *outIdx          = new uint[nElems];
+    To *outDist           = new To[nElems];
+
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outIdx, idx));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outDist, dist));
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ((To)goldDist[elIter], outDist[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    delete[] outIdx;
+    delete[] outDist;
+    ASSERT_SUCCESS(af_release_array(query));
+    ASSERT_SUCCESS(af_release_array(train));
+    ASSERT_SUCCESS(af_release_array(idx));
+    ASSERT_SUCCESS(af_release_array(dist));
+}
+
+/////////////////////////////////////////////////
+// SSD
+/////////////////////////////////////////////////
+TYPED_TEST(NearestNeighbour, NN_SSD_100_1000_Dim0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/ssd_100_1000_dim0.test"), 0,
+        AF_SSD);
+}
+
+TYPED_TEST(NearestNeighbour, NN_SSD_100_1000_Dim1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/ssd_100_1000_dim1.test"), 1,
+        AF_SSD);
+}
+
+TYPED_TEST(NearestNeighbour, NN_SSD_500_5000_Dim0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/ssd_500_5000_dim0.test"), 0,
+        AF_SSD);
+}
+
+TYPED_TEST(NearestNeighbour, NN_SSD_500_5000_Dim1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/ssd_500_5000_dim1.test"), 1,
+        AF_SSD);
+}
+
+/////////////////////////////////////////////////
+// SAD
+/////////////////////////////////////////////////
+TYPED_TEST(NearestNeighbour, NN_SAD_100_1000_Dim0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/sad_100_1000_dim0.test"), 0,
+        AF_SAD);
+}
+
+TYPED_TEST(NearestNeighbour, NN_SAD_100_1000_Dim1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/sad_100_1000_dim1.test"), 1,
+        AF_SAD);
+}
+
+TYPED_TEST(NearestNeighbour, NN_SAD_500_5000_Dim0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/sad_500_5000_dim0.test"), 0,
+        AF_SAD);
+}
+
+TYPED_TEST(NearestNeighbour, NN_SAD_500_5000_Dim1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearestNeighbourTest<TypeParam>(
+        string(TEST_DIR "/nearest_neighbour/sad_500_5000_dim1.test"), 1,
+        AF_SAD);
+}
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+TEST(NearestNeighbourSSD, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<uint>> in;
+    vector<vector<uint>> tests;
+
+    readTests<uint, uint, uint>(TEST_DIR
+                                "/nearest_neighbour/ssd_500_5000_dim0.test",
+                                numDims, in, tests);
+
+    dim4 qDims = numDims[0];
+    dim4 tDims = numDims[1];
+
+    array query(qDims, &(in[0].front()));
+    array train(tDims, &(in[1].front()));
+
+    array idx, dist;
+    nearestNeighbour(idx, dist, query, train, 0, 1, AF_SSD);
+
+    vector<uint> goldIdx  = tests[0];
+    vector<uint> goldDist = tests[1];
+    size_t nElems         = goldIdx.size();
+    uint *outIdx          = new uint[nElems];
+    uint *outDist         = new uint[nElems];
+
+    idx.host(outIdx);
+    dist.host(outDist);
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(goldDist[elIter], outDist[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    delete[] outIdx;
+    delete[] outDist;
+}
+
+TEST(NearestNeighbourSAD, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<uint>> in;
+    vector<vector<uint>> tests;
+
+    readTests<uint, uint, uint>(TEST_DIR
+                                "/nearest_neighbour/sad_100_1000_dim1.test",
+                                numDims, in, tests);
+
+    dim4 qDims = numDims[0];
+    dim4 tDims = numDims[1];
+
+    array query(qDims, &(in[0].front()));
+    array train(tDims, &(in[1].front()));
+
+    array idx, dist;
+    nearestNeighbour(idx, dist, query, train, 1, 1, AF_SAD);
+
+    vector<uint> goldIdx  = tests[0];
+    vector<uint> goldDist = tests[1];
+    size_t nElems         = goldIdx.size();
+    uint *outIdx          = new uint[nElems];
+    uint *outDist         = new uint[nElems];
+
+    idx.host(outIdx);
+    dist.host(outDist);
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(goldDist[elIter], outDist[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    delete[] outIdx;
+    delete[] outDist;
+}
+
+TEST(NearestNeighbourSSD, small) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const int ntrain            = 1;
+    const int nquery            = 5;
+    const int nfeat             = 2;
+    float train[ntrain * nfeat] = {
+        5,
+        5,
+    };
+
+    float query[5 * nfeat] = {0, 0, 3.5, 4, 5, 5, 6, 5, 8, 6.5};
+    array t(nfeat, ntrain, train);
+    array q(nfeat, nquery, query);
+    array indices;
+    array distances;
+    nearestNeighbour(indices, distances, q, t, 0, 1, AF_SSD);
+
+    float expectedDistances[nquery] = {
+        (5 - 0) * (5 - 0) + (5 - 0) * (5 - 0),
+        (5 - 3.5) * (5 - 3.5) + (5 - 4) * (5 - 4),
+        (5 - 5) * (5 - 5) + (5 - 5) * (5 - 5),
+        (5 - 6) * (5 - 6) + (5 - 5) * (5 - 5),
+        (5 - 8) * (5 - 8) + (5 - 6.5) * (5 - 6.5)};
+
+    vector<float> actualDistances(nquery);
+    distances.host(&actualDistances[0]);
+    for (int i = 0; i < nquery; i++) {
+        EXPECT_NEAR(expectedDistances[i], actualDistances[i], 1E-8);
+    }
+}
+
+TEST(KNearestNeighbourSSD, small) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const int ntrain = 5;
+    const int nquery = 3;
+    const int nfeat  = 2;
+
+    float query[nquery * nfeat] = {5, 5, 0, 0, 10, 10};
+    float train[ntrain * nfeat] = {0, 0, 3.5, 4, 5, 5, 6, 5, 8, 6.5};
+
+    array t(nfeat, ntrain, train);
+    array q(nfeat, nquery, query);
+    array indices;
+    array actualDistances;
+    const int k = 2;
+    nearestNeighbour(indices, actualDistances, q, t, 0, k, AF_SSD);
+
+    vector<float> expectedDistances{
+        (5.f - 5.f) * (5.f - 5.f) + (5.f - 5.f) * (5.f - 5.f),
+        (5.f - 6.f) * (5.f - 6.f) + (5.f - 5.f) * (5.f - 5.f),
+
+        (0.f - 0.f) * (0.f - 0.f) + (0.f - 0.f) * (0.f - 0.f),
+        (0.f - 3.5f) * (0.f - 3.5f) + (0.f - 4.f) * (0.f - 4.f),
+
+        (10.f - 8.f) * (10.f - 8.f) + (10.f - 6.5f) * (10.f - 6.5f),
+        (10.f - 6.f) * (10.f - 6.f) + (10.f - 5.f) * (10.f - 5.f)};
+
+    ASSERT_VEC_ARRAY_NEAR(expectedDistances, dim4(nfeat, nquery),
+                          actualDistances, 1E-8);
+}
+
+struct nearest_neighbors_params {
+    string testname_;
+    int k_, nfeat_, ntrain_, nquery_;
+    int feat_dim_;
+    dim4 qdims_, tdims_, idims_, ddims_;
+    vector<float> query_;
+    vector<float> train_;
+    vector<unsigned int> indices_;
+    vector<float> dists_;
+
+    nearest_neighbors_params(string testname, int k, int feat_dim, array query,
+                             array train, array indices, array dists)
+        : testname_(testname)
+        , k_(k)
+        , feat_dim_(feat_dim)
+        , query_(query.elements())
+        , train_(train.elements())
+        , indices_(indices.elements())
+        , dists_(dists.elements()) {
+        qdims_ = query.dims();
+        tdims_ = train.dims();
+        idims_ = indices.dims();
+        ddims_ = dists.dims();
+
+        query.host(query_.data());
+        train.host(train_.data());
+        indices.host(indices_.data());
+        dists.host(dists_.data());
+    }
+};
+
+template<typename TestClass>
+string testNameGenerator(
+    const ::testing::TestParamInfo<typename TestClass::ParamType> info) {
+    return info.param.testname_;
+}
+
+class NearestNeighborsTest
+    : public ::testing::TestWithParam<nearest_neighbors_params> {};
+class KNearestNeighborsTest
+    : public ::testing::TestWithParam<nearest_neighbors_params> {};
+
+nearest_neighbors_params single_knn_data(const string testname,
+                                         const int nquery, const int ntrain,
+                                         const int nfeat, const int k,
+                                         const int feat_dim) {
+    array indices, dists;
+    array query, train;
+    if (feat_dim == 0) {
+        query = constant(0, nfeat, nquery);
+        train = constant(1, nfeat, ntrain);
+    } else {
+        query = constant(0, nquery, nfeat);
+        train = constant(1, ntrain, nfeat);
+    }
+
+    indices = constant(0, k, nquery, u32);
+    dists   = constant(nfeat, k, nquery);
+
+    return nearest_neighbors_params(testname, k, feat_dim, query, train,
+                                    indices, dists);
+}
+
+nearest_neighbors_params knn_data(const string testname, const int nquery,
+                                  const int ntrain, const int nfeat,
+                                  const int k, const int feat_dim) {
+    array indices, dists;
+    array query, train;
+    if (feat_dim == 0) {
+        query = constant(0, nfeat, nquery);
+        train = range(dim4(nfeat, ntrain), 1);
+    } else {
+        query = constant(0, nquery, nfeat);
+        train = range(dim4(ntrain, nfeat), 0);
+    }
+
+    indices = range(dim4(k, nquery), 0, u32);
+    dists   = range(dim4(k, nquery));
+    dists *= dists;
+
+    return nearest_neighbors_params(testname, k, feat_dim, query, train,
+                                    indices, dists);
+}
+
+vector<nearest_neighbors_params> genNNTests() {
+    return {
+        single_knn_data("1q1t", 1, 1, 10, 1, 0),
+        single_knn_data("1q10t", 1, 10, 10, 1, 0),
+        single_knn_data("1q100t", 1, 100, 10, 1, 0),
+        single_knn_data("1q1000t", 1, 1000, 10, 1, 0),
+        single_knn_data("1q100000t", 1, 10000, 10, 1, 0),
+        single_knn_data("10q1t", 10, 1, 10, 1, 0),
+        single_knn_data("100q1t", 100, 1, 10, 1, 0),
+        single_knn_data("1000q1t", 1000, 1, 10, 1, 0),
+        single_knn_data("10000q1t", 10000, 1, 10, 1, 0),
+        single_knn_data("100000q1t", 10000, 1, 10, 1, 0),
+        single_knn_data("1q1tfl1", 10, 1, 1, 1, 0),
+        single_knn_data("1q1tfl2", 10, 1, 2, 1, 0),
+        single_knn_data("1q1tfl4", 10, 1, 4, 1, 0),
+        single_knn_data("1q1tfl8", 10, 1, 8, 1, 0),
+        single_knn_data("1q1tfl16", 10, 1, 16, 1, 0),
+        single_knn_data("1q1tfl32", 10, 1, 32, 1, 0),
+        single_knn_data("1q1tfl64", 10, 1, 64, 1, 0),
+        single_knn_data("1q1tfl128", 10, 1, 128, 1, 0),
+        single_knn_data("1q1tfl256", 10, 1, 256, 1, 0),
+        single_knn_data("1q1tfl10000", 10, 1, 10000, 1, 0),
+        single_knn_data("10q1t1d", 10, 1, 10, 1, 1),
+        single_knn_data("100q1t1d", 100, 1, 10, 1, 1),
+        single_knn_data("1000q1t1d", 1000, 1, 10, 1, 1),
+        single_knn_data("10000q1t1d", 10000, 1, 10, 1, 1),
+        single_knn_data("100000q1t1d", 10000, 1, 10, 1, 1),
+    };
+}
+
+vector<nearest_neighbors_params> genKNNTests() {
+    return {knn_data("1q1000t1k", 1, 1000, 1, 1, 0),
+            knn_data("1q1000t2k", 1, 1000, 1, 2, 0),
+            knn_data("1q1000t4k", 1, 1000, 1, 4, 0),
+            knn_data("1q1000t8k", 1, 1000, 1, 8, 0),
+            knn_data("1q1000t16k", 1, 1000, 1, 16, 0),
+            knn_data("1q1000t32k", 1, 1000, 1, 32, 0),
+            knn_data("1q1000t64k", 1, 1000, 1, 64, 0),
+            knn_data("1q1000t128k", 1, 1000, 1, 128, 0),
+            knn_data("1q1000t256k", 1, 1000, 1, 256, 0)};
+}
+
+INSTANTIATE_TEST_SUITE_P(KNearestNeighborsSSD, NearestNeighborsTest,
+                         ::testing::ValuesIn(genNNTests()),
+                         testNameGenerator<NearestNeighborsTest>);
+
+INSTANTIATE_TEST_SUITE_P(KNearestNeighborsSSD, KNearestNeighborsTest,
+                         ::testing::ValuesIn(genKNNTests()),
+                         testNameGenerator<KNearestNeighborsTest>);
+
+TEST_P(NearestNeighborsTest, SingleQTests) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearest_neighbors_params params = GetParam();
+    array query = array(params.qdims_, params.query_.data());
+    array train = array(params.tdims_, params.train_.data());
+
+    const int k        = params.k_;
+    const int feat_dim = params.feat_dim_;
+
+    array indices, distances;
+
+    nearestNeighbour(indices, distances, query, train, feat_dim, k, AF_SSD);
+
+    array indices_gold(params.idims_, params.indices_.data());
+    array distances_gold(params.ddims_, params.dists_.data());
+
+    ASSERT_ARRAYS_EQ(indices_gold, indices);
+    ASSERT_ARRAYS_NEAR(distances_gold, distances, 1e-5);
+}
+
+TEST_P(KNearestNeighborsTest, SingleQTests) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    nearest_neighbors_params params = GetParam();
+
+    array query = array(params.qdims_, params.query_.data());
+    array train = array(params.tdims_, params.train_.data());
+
+    const int k        = params.k_;
+    const int feat_dim = params.feat_dim_;
+
+    array indices, distances;
+
+    nearestNeighbour(indices, distances, query, train, feat_dim, k, AF_SSD);
+
+    array indices_gold(params.idims_, params.indices_.data());
+    array distances_gold(params.ddims_, params.dists_.data());
+
+    ASSERT_ARRAYS_EQ(indices_gold, indices);
+    ASSERT_ARRAYS_NEAR(distances_gold, distances, 1e-5);
+}
+
+TEST(KNearestNeighbours, InvalidNegativeK) {
+    const int ntrain = 500;
+    const int nquery = 1;
+    const int nfeat  = 2;
+
+    array t = randu(nfeat, ntrain);
+    array q = randu(nfeat, nquery);
+
+    array indices;
+    array distances;
+    int k = -1;
+    ASSERT_THROW(nearestNeighbour(indices, distances, q, t, 0, k, AF_SSD),
+                 af::exception);
+}
+
+TEST(KNearestNeighbours, InvalidLargeK) {
+    const int ntrain = 500;
+    const int nquery = 1;
+    const int nfeat  = 2;
+
+    array t = randu(nfeat, ntrain);
+    array q = randu(nfeat, nquery);
+
+    array indices;
+    array distances;
+    int k = 257;
+    ASSERT_THROW(nearestNeighbour(indices, distances, q, t, 0, k, AF_SSD),
+                 af::exception);
+}
+
+TEST(NearestNeighbour, DocSnippet1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    //! [ex_nearest_1]
+    float h_pts[6] = {1.f, 2.f, 3.f, 8.f, 9.f, 10.f};
+    array pts(dim4(1, 6), h_pts);
+    //  1.   2.   3.   8.   9.   10.
+
+    float h_query = 1.25f;
+    array query(dim4(1), &h_query);
+    //  1.25
+
+    array idx;
+    array dist;
+    nearestNeighbour(idx, dist, query, pts, 0, 3);
+    // idx
+    //  0.
+    //  1.
+    //  2.
+    //
+    // dist
+    //  0.0625
+    //  0.5625
+    //  3.0625
+
+    //! [ex_nearest_1]
+
+    unsigned int h_gold_idx[3] = {0, 1, 2};
+    float h_gold_dist[3]       = {0.0625f, 0.5625f, 3.0625f};
+    array gold_idx(dim4(3), h_gold_idx);
+    array gold_dist(dim4(3), h_gold_dist);
+    ASSERT_ARRAYS_EQ(gold_idx, idx);
+    ASSERT_ARRAYS_EQ(gold_dist, dist);
+}
+
+TEST(NearestNeighbour, DocSnippet2) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    //! [ex_nearest_2]
+    float h_pts[18] = {0.f, 0.f, 0.f, 1.f, 0.f, 0.f, 0.f, 1.f, 0.f,
+                       8.f, 9.f, 1.f, 9.f, 8.f, 1.f, 9.f, 9.f, 1.f};
+    array pts(dim4(3, 6), h_pts);
+    //  0.    1.    0.    8.    9.    9.
+    //  0.    0.    1.    9.    8.    9.
+    //  0.    0.    0.    1.    1.    1.
+
+    float h_query[6] = {1.5f, 0.f, 0.f, 7.5f, 9.f, 1.f};
+    array query(dim4(3, 2), h_query);
+    //  1.5   7.5
+    //  0.    9.
+    //  0.    1.
+
+    array idx;
+    array dist;
+    nearestNeighbour(idx, dist, query, pts, 0, 3);
+    // idx
+    //  1     3
+    //  0     5
+    //  2     4
+    //
+    // dist
+    //  0.25  0.25
+    //  2.25  2.25
+    //  3.25  3.25
+    //! [ex_nearest_2]
+
+    unsigned int h_gold_idx[6] = {1, 0, 2, 3, 5, 4};
+    float h_gold_dist[6]       = {0.25f, 2.25f, 3.25f, 0.25f, 2.25f, 3.25f};
+    array gold_idx(dim4(3, 2), h_gold_idx);
+    array gold_dist(dim4(3, 2), h_gold_dist);
+    ASSERT_ARRAYS_EQ(gold_idx, idx);
+    ASSERT_ARRAYS_EQ(gold_dist, dist);
+}
diff --git a/test/nodevice.cpp b/test/nodevice.cpp
new file mode 100644
index 0000000000..5674953c12
--- /dev/null
+++ b/test/nodevice.cpp
@@ -0,0 +1,68 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+// Include functions that provide information about the system and shouldn't
+// throw exceptions during runtime.
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
+TEST(NoDevice, Info) { ASSERT_SUCCESS(af_info()); }
+
+TEST(NoDevice, InfoCxx) { af::info(); }
+
+TEST(NoDevice, InfoString) {
+    char* str;
+    ASSERT_SUCCESS(af_info_string(&str, true));
+    ASSERT_SUCCESS(af_free_host((void*)str));
+}
+
+TEST(NoDevice, GetDeviceCount) {
+    int device = 0;
+    ASSERT_SUCCESS(af_get_device_count(&device));
+}
+
+TEST(NoDevice, GetDeviceCountCxx) { af::getDeviceCount(); }
+
+TEST(NoDevice, GetSizeOf) {
+    size_t size;
+    ASSERT_SUCCESS(af_get_size_of(&size, f32));
+    ASSERT_EQ(4, size);
+}
+
+TEST(NoDevice, GetSizeOfCxx) {
+    size_t size = af::getSizeOf(f32);
+    ASSERT_EQ(4, size);
+}
+
+TEST(NoDevice, GetBackendCount) {
+    unsigned int nbackends;
+    ASSERT_SUCCESS(af_get_backend_count(&nbackends));
+}
+
+TEST(NoDevice, GetBackendCountCxx) {
+    unsigned int nbackends = af::getBackendCount();
+    UNUSED(nbackends);
+}
+
+TEST(NoDevice, GetVersion) {
+    int major = 0, minor = 0, patch = 0;
+
+    ASSERT_SUCCESS(af_get_version(&major, &minor, &patch));
+
+    ASSERT_EQ(AF_VERSION_MAJOR, major);
+    ASSERT_EQ(AF_VERSION_MINOR, minor);
+    ASSERT_EQ(AF_VERSION_PATCH, patch);
+}
+
+TEST(NoDevice, GetRevision) {
+    const char* revision = af_get_revision();
+    UNUSED(revision);
+}
diff --git a/test/norm.cpp b/test/norm.cpp
new file mode 100644
index 0000000000..c795c112c3
--- /dev/null
+++ b/test/norm.cpp
@@ -0,0 +1,285 @@
+/*******************************************************
+ * Copyright (c) 2025, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <sstream>
+
+using af::array;
+using af::constant;
+using af::dim4;
+using std::complex;
+using std::stringstream;
+using std::vector;
+
+std::ostream &operator<<(std::ostream &os, af::normType nt) {
+    switch (nt) {
+        case AF_NORM_VECTOR_1: os << "AF_NORM_VECTOR_1"; break;
+        case AF_NORM_VECTOR_INF: os << "AF_NORM_VECTOR_INF"; break;
+        case AF_NORM_VECTOR_2: os << "AF_NORM_VECTOR_2"; break;
+        case AF_NORM_VECTOR_P: os << "AF_NORM_VECTOR_P"; break;
+        case AF_NORM_MATRIX_1: os << "AF_NORM_MATRIX_1"; break;
+        case AF_NORM_MATRIX_INF: os << "AF_NORM_MATRIX_INF"; break;
+        case AF_NORM_MATRIX_2: os << "AF_NORM_MATRIX_2"; break;
+        case AF_NORM_MATRIX_L_PQ: os << "AF_NORM_MATRIX_L_PQ"; break;
+    }
+    return os;
+}
+
+template<typename T>
+double cpu_norm1_impl(af::dim4 &dims, std::vector<T> &value) {
+    int M = dims[0];
+    int N = dims[1];
+
+    double norm1 = std::numeric_limits<double>::lowest();
+    for (int n = 0; n < N; n++) {
+        T *columnN = value.data() + n * M;
+        double sum = 0;
+        for (int m = 0; m < M; m++) { sum += abs(columnN[m]); }
+        norm1 = std::max(norm1, sum);
+    }
+    return norm1;
+}
+
+template<typename T>
+double cpu_norm_pq_impl(af::dim4 &dims, std::vector<T> &value, double p, double q) {
+    int N = dims[0];
+    int M = dims[1];
+
+    double norm = 0;
+    for (int n = 0; n < N; n++) {
+        T *columnN = value.data() + n * M;
+        double sum = 0;
+        
+        for (int m = 0; m < M; m++) { sum += std::pow(std::abs(columnN[m]), p); }
+
+        norm += std::pow(sum, q / p);
+    }
+    norm = std::pow(norm, 1.0 / q);
+
+    return norm;
+}
+
+double cpu_norm1(af::array &value) {
+    double norm1;
+    af::dim4 dims = value.dims();
+    if (value.type() == f16) {
+        vector<half_float::half> values(value.elements());
+        value.host(values.data());
+        norm1 = cpu_norm1_impl<half_float::half>(dims, values);
+    } else if (value.type() == c32 || value.type() == c64) {
+        vector<complex<double> > values(value.elements());
+        value.as(c64).host(values.data());
+        norm1 = cpu_norm1_impl<complex<double> >(dims, values);
+    } else {
+        vector<double> values(value.elements());
+        value.as(f64).host(values.data());
+        norm1 = cpu_norm1_impl<double>(dims, values);
+    }
+    return norm1;
+}
+
+double cpu_norm_pq(af::array &value, double p, double q) {
+    double norm2;
+    af::dim4 dims = value.dims();
+    if (value.type() == f16) {
+        vector<half_float::half> values(value.elements());
+        value.host(values.data());
+        norm2 = cpu_norm_pq_impl<half_float::half>(dims, values, p, q);
+    } else if (value.type() == c32 || value.type() == c64) {
+        vector<complex<double> > values(value.elements());
+        value.as(c64).host(values.data());
+        norm2 = cpu_norm_pq_impl<complex<double> >(dims, values, p, q);
+    } else {
+        vector<double> values(value.elements());
+        value.as(f64).host(values.data());
+        norm2 = cpu_norm_pq_impl<double>(dims, values, p, q);
+    }
+    return norm2;
+}
+
+template<typename T>
+double cpu_norm_inf_impl(af::dim4 &dims, std::vector<T> &value) {
+    int M = dims[0];
+    int N = dims[1];
+
+    double norm_inf = std::numeric_limits<double>::lowest();
+    for (int m = 0; m < M; m++) {
+        T *rowM    = value.data() + m;
+        double sum = 0;
+        for (int n = 0; n < N; n++) { sum += abs(rowM[n * M]); }
+        norm_inf = std::max(norm_inf, sum);
+    }
+    return norm_inf;
+}
+
+double cpu_norm_inf(af::array &value) {
+    double norm_inf;
+    af::dim4 dims = value.dims();
+    if (value.type() == c32 || value.type() == c64) {
+        vector<complex<double> > values(value.elements());
+        value.as(c64).host(values.data());
+        norm_inf = cpu_norm_inf_impl<complex<double> >(dims, values);
+    } else {
+        vector<double> values(value.elements());
+        value.as(f64).host(values.data());
+        norm_inf = cpu_norm_inf_impl<double>(dims, values);
+    }
+    return norm_inf;
+}
+
+using norm_params = std::tuple<af::dim4, af::dtype>;
+class Norm
+    : public ::testing::TestWithParam<std::tuple<af::dim4, af::dtype> > {};
+
+INSTANTIATE_TEST_CASE_P(
+    Norm, Norm,
+    ::testing::Combine(::testing::Values(dim4(3, 3), dim4(32, 32), dim4(33, 33),
+                                         dim4(64, 64), dim4(128, 128),
+                                         dim4(129, 129), dim4(256, 256),
+                                         dim4(257, 257)),
+                       ::testing::Values(f32, f64, c32, c64, f16)),
+    [](const ::testing::TestParamInfo<Norm::ParamType> info) {
+        stringstream ss;
+        using std::get;
+        ss << "dims_" << get<0>(info.param)[0] << "_" << get<0>(info.param)[1]
+           << "_dtype_" << get<1>(info.param);
+        return ss.str();
+    });
+
+TEST_P(Norm, Identity_AF_NORM_MATRIX_1) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+
+    array identity = af::identity(get<0>(param), get<1>(param));
+    double result  = norm(identity, AF_NORM_MATRIX_1);
+    double norm1   = cpu_norm1(identity);
+
+    ASSERT_DOUBLE_EQ(norm1, result);
+}
+
+TEST_P(Norm, Random_AF_NORM_MATRIX_1) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+
+    array in      = af::randu(get<0>(param), get<1>(param)) - 0.5f;
+    double result = norm(in, AF_NORM_MATRIX_1);
+    double norm1  = cpu_norm1(in);
+
+    ASSERT_NEAR(norm1, result, 2e-4);
+}
+
+TEST_P(Norm, Random_AF_NORM_VECTOR_1) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+
+    af::dim4 dims = get<0>(param);
+    dims[1] = 1; // Test a vector
+
+    array in      = af::randu(dims, get<1>(param)) - 0.5f;
+    double result = norm(in, AF_NORM_VECTOR_1);
+    double norm1  = cpu_norm_pq(in, 1, 1);
+
+    ASSERT_NEAR(norm1, result, 2e-4);
+}
+
+TEST_P(Norm, Random_AF_NORM_VECTOR_INF) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+
+    af::dim4 dims = get<0>(param);
+    dims[1] = 1; // Test a vector
+
+    array in      = af::randu(dims, get<1>(param)) - 0.5f;
+    double result = norm(in, AF_NORM_VECTOR_INF);
+    double norm_inf  = cpu_norm_inf(in);
+
+    ASSERT_NEAR(norm_inf, result, 2e-4);
+}
+
+TEST_P(Norm, Random_AF_NORM_VECTOR_2) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+
+    af::dim4 dims = get<0>(param);
+    dims[1] = 1; // Test a vector
+
+    array in      = af::randu(dims, get<1>(param)) - 0.5f;
+    double result = norm(in, AF_NORM_VECTOR_2);
+    double norm2  = cpu_norm_pq(in, 1, 2); // vectors lie in first dims so swap p and q
+
+    ASSERT_NEAR(norm2, result, 3e-4);
+}
+
+TEST_P(Norm, Random_AF_NORM_VECTOR_P_P_EQUAL_3_POINT_5) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+
+    af::dim4 dims = get<0>(param);
+    dims[1] = 1; // Test a vector
+
+    array in      = af::randu(dims, get<1>(param)) - 0.5f;
+    double result = norm(in, AF_NORM_VECTOR_P, 3.5);
+    double normp  = cpu_norm_pq(in, 1, 3.5); // vectors lie in first dims so swap p and q
+
+    ASSERT_NEAR(normp, result, 3e-4);
+}
+
+TEST_P(Norm, Identity_AF_NORM_MATRIX_2_NOT_SUPPORTED) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+    try {
+        double result =
+            norm(af::identity(get<0>(param), get<1>(param)), AF_NORM_MATRIX_2);
+        FAIL();
+    } catch (af::exception &ex) {
+        ASSERT_EQ(AF_ERR_NOT_SUPPORTED, ex.err());
+        return;
+    }
+    FAIL();
+}
+
+TEST_P(Norm, Identity_AF_NORM_MATRIX_INF) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+    array in        = af::identity(get<0>(param), get<1>(param));
+    double result   = norm(in, AF_NORM_MATRIX_INF);
+    double norm_inf = cpu_norm_inf(in);
+
+    ASSERT_DOUBLE_EQ(norm_inf, result);
+}
+
+TEST_P(Norm, Random_AF_NORM_MATRIX_INF) {
+    using std::get;
+    norm_params param = GetParam();
+    if (get<1>(param) == f16) SUPPORTED_TYPE_CHECK(half_float::half);
+    if (get<1>(param) == f64) SUPPORTED_TYPE_CHECK(double);
+    array in        = af::randu(get<0>(param), get<1>(param));
+    double result   = norm(in, AF_NORM_MATRIX_INF);
+    double norm_inf = cpu_norm_inf(in);
+
+    ASSERT_NEAR(norm_inf, result, 2e-4);
+}
diff --git a/test/ocl_ext_context.cpp b/test/ocl_ext_context.cpp
new file mode 100644
index 0000000000..2f262bcf5d
--- /dev/null
+++ b/test/ocl_ext_context.cpp
@@ -0,0 +1,261 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#if defined(AF_OPENCL)
+#include <af/opencl.h>
+#include <iostream>
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+#pragma GCC diagnostic ignored "-Wunused-parameter"
+#pragma GCC diagnostic ignored "-Wignored-qualifiers"
+#pragma GCC diagnostic ignored "-Wignored-attributes"
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+#if __GNUC__ >= 8
+#pragma GCC diagnostic ignored "-Wcatch-value="
+#endif
+#define CL_HPP_MINIMUM_OPENCL_VERSION 120
+#define CL_HPP_TARGET_OPENCL_VERSION 120
+#define CL_HPP_ENABLE_EXCEPTIONS 1
+#include <CL/cl2.hpp>
+#pragma GCC diagnostic pop
+
+using af::allocV2;
+using af::array;
+using af::constant;
+using af::freeV2;
+using af::getDeviceCount;
+using af::info;
+using af::randu;
+using af::setDevice;
+using std::endl;
+using std::vector;
+
+inline void checkErr(cl_int err, const char *name) {
+    if (err != CL_SUCCESS) {
+        std::cerr << "ERROR: " << name << " (" << err << ")" << endl;
+        exit(EXIT_FAILURE);
+    }
+}
+
+class OCLExtContext : public ::testing::Test {
+   public:
+    cl_device_id deviceId  = NULL;
+    cl_context context     = NULL;
+    cl_command_queue queue = NULL;
+
+    void SetUp() override {
+        cl_platform_id platformId = NULL;
+        cl_uint numPlatforms;
+        cl_uint numDevices;
+        cl_int errorCode = 0;
+
+        checkErr(clGetPlatformIDs(1, &platformId, &numPlatforms),
+                 "Get Platforms failed");
+
+        checkErr(clGetDeviceIDs(platformId, CL_DEVICE_TYPE_DEFAULT, 1,
+                                &deviceId, &numDevices),
+                 "Get cl_device_id failed");
+
+        context = clCreateContext(NULL, 1, &deviceId, NULL, NULL, &errorCode);
+        checkErr(errorCode, "Context creation failed");
+
+#ifdef CL_VERSION_2_0
+        queue = clCreateCommandQueueWithProperties(context, deviceId, 0,
+                                                   &errorCode);
+#else
+        queue = clCreateCommandQueue(context, deviceId, 0, &errorCode);
+#endif
+
+        checkErr(errorCode, "Command queue creation failed");
+    }
+    void TearDown() override {
+        checkErr(clReleaseCommandQueue(queue), "clReleaseCommandQueue");
+        checkErr(clReleaseContext(context), "clReleaseContext");
+        checkErr(clReleaseDevice(deviceId), "clReleaseDevice");
+    }
+};
+
+TEST_F(OCLExtContext, PushAndPop) {
+    int dCount = getDeviceCount();
+    info();
+
+    afcl::addDevice(deviceId, context, queue);
+    ASSERT_EQ(true, dCount + 1 == getDeviceCount());
+
+    afcl::deleteDevice(deviceId, context);
+    ASSERT_EQ(true, dCount == getDeviceCount());
+    info();
+}
+
+TEST_F(OCLExtContext, set) {
+    int dCount = getDeviceCount();  // Before user device addition
+    setDevice(0);
+    info();
+    array t = randu(5, 5);
+    af_print(t);
+
+    afcl::addDevice(deviceId, context, queue);
+    info();
+
+    setDevice(
+        dCount);  // In 0-based index, dCount is index of newly added device
+    info();
+
+    const int x = 5;
+    const int y = 5;
+    const int s = x * y;
+    array a     = constant(1, x, y);
+    vector<float> host(s);
+    a.host((void *)host.data());
+    for (int i = 0; i < s; ++i) ASSERT_EQ(host[i], 1.0f);
+
+    setDevice(0);
+    info();
+    af_print(t);
+}
+
+TEST(OCLCheck, DeviceType) {
+    afcl::deviceType devType = afcl::getDeviceType();
+    cl_device_type type      = -100;
+    clGetDeviceInfo(afcl::getDeviceId(), CL_DEVICE_TYPE, sizeof(cl_device_type),
+                    &type, NULL);
+    ASSERT_EQ(type, (cl_device_type)devType);
+}
+
+TEST(OCLCheck, DevicePlatform) {
+    afcl::platform platform = afcl::getPlatform();
+    ASSERT_NE(platform, AFCL_PLATFORM_UNKNOWN);
+}
+#else
+TEST(OCLExtContext, NoopCPU) {}
+#endif
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+TEST(Memory, AfAllocDeviceOpenCL) {
+    /// Tests to see if the pointer returned can be used by opencl functions
+    float gold_val = 5;
+
+    void *alloc_ptr;
+    ASSERT_SUCCESS(af_alloc_device(&alloc_ptr, sizeof(float)));
+    // af_alloc_device returns a cl::Buffer object from alloc unfortunately
+    cl::Buffer *bptr = static_cast<cl::Buffer *>(alloc_ptr);
+    ASSERT_EQ(2, bptr->getInfo<CL_MEM_REFERENCE_COUNT>());
+
+    cl_command_queue queue;
+    afcl_get_queue(&queue, true);
+    cl::CommandQueue cq(queue);
+
+    cl::Buffer gold(cq, &gold_val, &gold_val + 1, false);
+    cq.enqueueCopyBuffer(gold, *bptr, 0, 0, sizeof(float));
+
+    float host;
+    cq.enqueueReadBuffer(*bptr, CL_TRUE, 0, sizeof(float), &host);
+
+    ASSERT_SUCCESS(af_free_device(alloc_ptr));
+    ASSERT_EQ(gold_val, host);
+}
+#pragma GCC diagnostic pop
+
+TEST(Memory, AfAllocDeviceV2OpenCLC) {
+    /// Tests to see if the pointer returned can be used by opencl functions
+    float gold_val = 5;
+
+    void *alloc_ptr;
+    ASSERT_SUCCESS(af_alloc_device_v2(&alloc_ptr, sizeof(float)));
+    {
+        cl::Buffer bptr(static_cast<cl_mem>(alloc_ptr), true);
+        ASSERT_EQ(3, bptr.getInfo<CL_MEM_REFERENCE_COUNT>());
+
+        cl_command_queue queue;
+        afcl_get_queue(&queue, true);
+        cl::CommandQueue cq(queue);
+
+        cl::Buffer gold(cq, &gold_val, &gold_val + 1, false);
+        cq.enqueueCopyBuffer(gold, bptr, 0, 0, sizeof(float));
+
+        float host;
+        cq.enqueueReadBuffer(bptr, CL_TRUE, 0, sizeof(float), &host);
+        ASSERT_EQ(gold_val, host);
+    }
+
+    ASSERT_SUCCESS(af_free_device_v2(alloc_ptr));
+}
+
+TEST(Memory, AfAllocDeviceV2OpenCLCPP) {
+    /// Tests to see if the pointer returned can be used by opencl functions
+    float gold_val = 5;
+
+    cl_mem alloc_ptr = static_cast<cl_mem>(allocV2(sizeof(float)));
+    {
+        cl::Buffer bptr(alloc_ptr, true);
+        ASSERT_EQ(3, bptr.getInfo<CL_MEM_REFERENCE_COUNT>());
+
+        cl_command_queue queue;
+        afcl_get_queue(&queue, true);
+        cl::CommandQueue cq(queue);
+
+        cl::Buffer gold(cq, &gold_val, &gold_val + 1, false);
+        cq.enqueueCopyBuffer(gold, bptr, 0, 0, sizeof(float));
+
+        float host;
+        cq.enqueueReadBuffer(bptr, CL_TRUE, 0, sizeof(float), &host);
+        ASSERT_EQ(gold_val, host);
+    }
+
+    freeV2(alloc_ptr);
+}
+
+TEST(Memory, SNIPPET_AllocOpenCL) {
+    // clang-format off
+    //! [ex_alloc_v2_opencl]
+    cl_command_queue queue;
+    afcl_get_queue(&queue, true);
+    cl_context context;
+    afcl_get_context(&context, true);
+
+    void *alloc_ptr = allocV2(sizeof(float));
+    cl_mem mem = static_cast<cl_mem>(alloc_ptr);
+
+    // Map memory from the device to the System memory
+    cl_int map_err_code;
+    void *mapped_ptr = clEnqueueMapBuffer(
+        queue, // command queueu
+        mem, // buffer
+        CL_TRUE, // is blocking
+        CL_MAP_READ | CL_MAP_WRITE, // map type
+        0, // offset
+        sizeof(float), // size
+        0, // num_events_in_wait_list
+        nullptr, // event_wait_list
+        nullptr, // event
+        &map_err_code); // error code
+
+    float *float_ptr = static_cast<float *>(mapped_ptr);
+    float_ptr[0]     = 5.0f;
+
+    // Unmap buffer after we are done using it
+    cl_int unmap_err_code =
+        clEnqueueUnmapMemObject(queue,      // command queue
+                                mem,        // buffer
+                                mapped_ptr, // mapped pointer
+                                0,          // num_events_in_wait_list
+                                nullptr,    // event_wait_list
+                                nullptr);   // event
+    freeV2(alloc_ptr);
+    //! [ex_alloc_v2_opencl]
+    // clang-format on
+
+    ASSERT_EQ(CL_SUCCESS, map_err_code);
+    ASSERT_EQ(CL_SUCCESS, unmap_err_code);
+}
diff --git a/test/orb.cpp b/test/orb.cpp
index 4b3c15864d..3ace1f4b05 100644
--- a/test/orb.cpp
+++ b/test/orb.cpp
@@ -7,48 +7,50 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/compatible.h>
-#include <string>
-#include <vector>
 #include <cmath>
-#include <testHelpers.hpp>
+#include <string>
 #include <typeinfo>
+#include <vector>
 
+using af::array;
+using af::dim4;
+using af::features;
+using af::loadImage;
+using std::abs;
+using std::cout;
+using std::endl;
 using std::string;
 using std::vector;
-using af::dim4;
 
-typedef struct
-{
+typedef struct {
     float f[5];
     unsigned d[8];
 } feat_desc_t;
 
-typedef struct
-{
+typedef struct {
     float f[5];
 } feat_t;
 
-typedef struct
-{
+typedef struct {
     unsigned d[8];
 } desc_t;
 
-bool feat_cmp(feat_desc_t i, feat_desc_t j)
-{
+static bool feat_cmp(feat_desc_t i, feat_desc_t j) {
     for (int k = 0; k < 5; k++)
-        if (i.f[k] != j.f[k])
-            return (i.f[k] < j.f[k]);
+        if (i.f[k] != j.f[k]) return (i.f[k] < j.f[k]);
 
-    return true;
+    return false;
 }
 
-void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y, float* score, float* ori, float* size, unsigned* desc, unsigned nfeat)
-{
+static void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y,
+                               float* score, float* ori, float* size,
+                               unsigned* desc, unsigned nfeat) {
     feat.resize(nfeat);
     for (size_t i = 0; i < feat.size(); i++) {
         feat[i].f[0] = x[i];
@@ -56,13 +58,13 @@ void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y, float* sc
         feat[i].f[2] = score[i];
         feat[i].f[3] = ori[i];
         feat[i].f[4] = size[i];
-        for (unsigned j = 0; j < 8; j++)
-            feat[i].d[j] = desc[i * 8 + j];
+        for (unsigned j = 0; j < 8; j++) feat[i].d[j] = desc[i * 8 + j];
     }
 }
 
-void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y, float* score, float* ori, float* size, vector<vector<unsigned> >& desc, unsigned nfeat)
-{
+static void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y,
+                               float* score, float* ori, float* size,
+                               vector<vector<unsigned>>& desc, unsigned nfeat) {
     feat.resize(nfeat);
     for (size_t i = 0; i < feat.size(); i++) {
         feat[i].f[0] = x[i];
@@ -70,25 +72,12 @@ void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y, float* sc
         feat[i].f[2] = score[i];
         feat[i].f[3] = ori[i];
         feat[i].f[4] = size[i];
-        for (unsigned j = 0; j < 8; j++)
-            feat[i].d[j] = desc[i][j];
+        for (unsigned j = 0; j < 8; j++) feat[i].d[j] = desc[i][j];
     }
 }
 
-void array_to_feat(vector<feat_t>& feat, float *x, float *y, float *score, float *ori, float *size, unsigned nfeat)
-{
-    feat.resize(nfeat);
-    for (unsigned i = 0; i < feat.size(); i++) {
-        feat[i].f[0] = x[i];
-        feat[i].f[1] = y[i];
-        feat[i].f[2] = score[i];
-        feat[i].f[3] = ori[i];
-        feat[i].f[4] = size[i];
-    }
-}
-
-void split_feat_desc(vector<feat_desc_t>& fd, vector<feat_t>& f, vector<desc_t>& d)
-{
+static void split_feat_desc(vector<feat_desc_t>& fd, vector<feat_t>& f,
+                            vector<desc_t>& d) {
     f.resize(fd.size());
     d.resize(fd.size());
     for (size_t i = 0; i < fd.size(); i++) {
@@ -97,13 +86,11 @@ void split_feat_desc(vector<feat_desc_t>& fd, vector<feat_t>& f, vector<desc_t>&
         f[i].f[2] = fd[i].f[2];
         f[i].f[3] = fd[i].f[3];
         f[i].f[4] = fd[i].f[4];
-        for (unsigned j = 0; j < 8; j++)
-            d[i].d[j] = fd[i].d[j];
+        for (unsigned j = 0; j < 8; j++) d[i].d[j] = fd[i].d[j];
     }
 }
 
-unsigned popcount(unsigned x)
-{
+static unsigned popcount(unsigned x) {
     x = x - ((x >> 1) & 0x55555555);
     x = (x & 0x33333333) + ((x >> 2) & 0x33333333);
     x = (x + (x >> 4)) & 0x0F0F0F0F;
@@ -112,17 +99,17 @@ unsigned popcount(unsigned x)
     return x & 0x0000003F;
 }
 
-bool compareHamming(int data_size, unsigned *cpu, unsigned *gpu, unsigned thr = 1)
-{
+bool compareHamming(int data_size, unsigned* cpu, unsigned* gpu,
+                    unsigned thr = 1) {
     bool ret = true;
-    for(int i=0;i<data_size;i++)
-    {
+    for (int i = 0; i < data_size; i++) {
         unsigned x = (cpu[i] ^ gpu[i]);
-        if(popcount(x) > thr) {
+        if (popcount(x) > thr) {
             ret = false;
-            std::cout<<std::endl<<"@compareHamming: first mismatch."<<std::endl;
-            std::cout<<"(cpu,gpu,cpu-gpu)["<<i<<"] : {"<<cpu[i]<<","<<gpu[i]<<","<<cpu[i]-gpu[i]<<"}"<<std::endl;
-            std::cout<<std::endl;
+            cout << endl << "@compareHamming: first mismatch." << endl;
+            cout << "(cpu,gpu,cpu-gpu)[" << i << "] : {" << cpu[i] << ","
+                 << gpu[i] << "," << cpu[i] - gpu[i] << "}" << endl;
+            cout << endl;
             break;
         }
     }
@@ -130,73 +117,79 @@ bool compareHamming(int data_size, unsigned *cpu, unsigned *gpu, unsigned thr =
 }
 
 template<typename T>
-class ORB : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class ORB : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 typedef ::testing::Types<float, double> TestTypes;
 
-TYPED_TEST_CASE(ORB, TestTypes);
+TYPED_TEST_SUITE(ORB, TestTypes);
 
 template<typename T>
-void orbTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void orbTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
 
-    vector<dim4>             inDims;
-    vector<string>           inFiles;
-    vector<vector<float> >    goldFeat;
-    vector<vector<unsigned> > goldDesc;
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> goldFeat;
+    vector<vector<unsigned>> goldDesc;
 
-    readImageFeaturesDescriptors<unsigned>(pTestFile, inDims, inFiles, goldFeat, goldDesc);
+    readImageFeaturesDescriptors<unsigned>(pTestFile, inDims, inFiles, goldFeat,
+                                           goldDesc);
 
     size_t testCount = inDims.size();
 
-    for (size_t testId=0; testId<testCount; ++testId) {
-        af_array inArray_f32  = 0;
-        af_array inArray      = 0;
-        af_array desc         = 0;
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array inArray_f32 = 0;
+        af_array inArray     = 0;
+        af_array desc        = 0;
         af_features feat;
 
-        inFiles[testId].insert(0,string(TEST_DIR"/orb/"));
+        inFiles[testId].insert(0, string(TEST_DIR "/orb/"));
 
-        ASSERT_EQ(AF_SUCCESS, af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
-        ASSERT_EQ(AF_SUCCESS, conv_image<T>(&inArray, inArray_f32));
+        ASSERT_SUCCESS(
+            af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, inArray_f32));
 
-        ASSERT_EQ(AF_SUCCESS, af_orb(&feat, &desc, inArray, 20.0f, 400, 1.2f, 8, true));
+        ASSERT_SUCCESS(
+            af_orb(&feat, &desc, inArray, 20.0f, 400, 1.2f, 8, true));
 
         dim_t n = 0;
         af_array x, y, score, orientation, size;
 
-        ASSERT_EQ(AF_SUCCESS, af_get_features_num(&n, feat));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_xpos(&x, feat));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_ypos(&y, feat));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_score(&score, feat));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_orientation(&orientation, feat));
-        ASSERT_EQ(AF_SUCCESS, af_get_features_size(&size, feat));
-
-        float * outX           = new float[n];
-        float * outY           = new float[n];
-        float * outScore       = new float[n];
-        float * outOrientation = new float[n];
-        float * outSize        = new float[n];
+        ASSERT_SUCCESS(af_get_features_num(&n, feat));
+        ASSERT_SUCCESS(af_get_features_xpos(&x, feat));
+        ASSERT_SUCCESS(af_get_features_ypos(&y, feat));
+        ASSERT_SUCCESS(af_get_features_score(&score, feat));
+        ASSERT_SUCCESS(af_get_features_orientation(&orientation, feat));
+        ASSERT_SUCCESS(af_get_features_size(&size, feat));
+
+        float* outX           = new float[n];
+        float* outY           = new float[n];
+        float* outScore       = new float[n];
+        float* outOrientation = new float[n];
+        float* outSize        = new float[n];
         dim_t descSize;
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&descSize, desc));
-        unsigned * outDesc     = new unsigned[descSize];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outX, x));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outY, y));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outScore, score));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outOrientation, orientation));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outSize, size));
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outDesc, desc));
+        ASSERT_SUCCESS(af_get_elements(&descSize, desc));
+        unsigned* outDesc = new unsigned[descSize];
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outX, x));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outY, y));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outScore, score));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outOrientation, orientation));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outSize, size));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outDesc, desc));
 
         vector<feat_desc_t> out_feat_desc;
-        array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation, outSize, outDesc, n);
+        array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation,
+                           outSize, outDesc, n);
 
         vector<feat_desc_t> gold_feat_desc;
-        array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(), &goldFeat[1].front(), &goldFeat[2].front(), &goldFeat[3].front(), &goldFeat[4].front(), goldDesc, goldFeat[0].size());
+        array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(),
+                           &goldFeat[1].front(), &goldFeat[2].front(),
+                           &goldFeat[3].front(), &goldFeat[4].front(), goldDesc,
+                           goldFeat[0].size());
 
         std::sort(out_feat_desc.begin(), out_feat_desc.end(), feat_cmp);
         std::sort(gold_feat_desc.begin(), gold_feat_desc.end(), feat_cmp);
@@ -210,25 +203,30 @@ void orbTest(string pTestFile)
         split_feat_desc(gold_feat_desc, gold_feat, v_gold_desc);
 
         for (int elIter = 0; elIter < (int)n; elIter++) {
-            ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0]) << "at: " << elIter << std::endl;
-            ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1]) << "at: " << elIter << std::endl;
-            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3) << "at: " << elIter << std::endl;
-            ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]), 1e-3) << "at: " << elIter << std::endl;
-            ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]), 1e-3) << "at: " << elIter << std::endl;
+            ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]),
+                      1e-3)
+                << "at: " << elIter << endl;
         }
 
         // TODO: improve distance for single/double-precision interchangeability
-        EXPECT_TRUE(compareHamming(descSize, (unsigned*)&v_out_desc[0], (unsigned*)&v_gold_desc[0], 3));
+        EXPECT_TRUE(compareHamming(descSize, (unsigned*)&v_out_desc[0],
+                                   (unsigned*)&v_gold_desc[0], 3));
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(inArray_f32));
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(inArray_f32));
 
-        ASSERT_EQ(AF_SUCCESS, af_release_array(x));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(y));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(score));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(orientation));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(size));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(desc));
+        ASSERT_SUCCESS(af_release_features(feat));
+        ASSERT_SUCCESS(af_release_array(desc));
 
         delete[] outX;
         delete[] outY;
@@ -239,41 +237,43 @@ void orbTest(string pTestFile)
     }
 }
 
-#define ORB_INIT(desc, image) \
-    TYPED_TEST(ORB, desc) \
-    {   \
-        orbTest<TypeParam>(string(TEST_DIR"/orb/"#image".test"));   \
-    }
+TYPED_TEST(ORB, Square) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    orbTest<TypeParam>(string(TEST_DIR "/orb/square.test"));
+}
 
-    ORB_INIT(square, square);
-    ORB_INIT(lena, lena);
+TYPED_TEST(ORB, Lena) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    orbTest<TypeParam>(string(TEST_DIR "/orb/lena.test"));
+}
 
 ///////////////////////////////////// CPP ////////////////////////////////
 //
-TEST(ORB, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<dim4>             inDims;
-    vector<string>           inFiles;
-    vector<vector<float> >    goldFeat;
-    vector<vector<unsigned> > goldDesc;
-
-    readImageFeaturesDescriptors<unsigned>(string(TEST_DIR"/orb/square.test"), inDims, inFiles, goldFeat, goldDesc);
-    inFiles[0].insert(0,string(TEST_DIR"/orb/"));
-
-    af::array in = af::loadImage(inFiles[0].c_str(), false);
-
-    af::features feat;
-    af::array desc;
-    af::orb(feat, desc, in, 20.0f, 400, 1.2f, 8, true);
-
-    float * outX           = new float[feat.getNumFeatures()];
-    float * outY           = new float[feat.getNumFeatures()];
-    float * outScore       = new float[feat.getNumFeatures()];
-    float * outOrientation = new float[feat.getNumFeatures()];
-    float * outSize        = new float[feat.getNumFeatures()];
-    unsigned * outDesc     = new unsigned[desc.elements()];
+TEST(ORB, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> goldFeat;
+    vector<vector<unsigned>> goldDesc;
+
+    readImageFeaturesDescriptors<unsigned>(string(TEST_DIR "/orb/square.test"),
+                                           inDims, inFiles, goldFeat, goldDesc);
+    inFiles[0].insert(0, string(TEST_DIR "/orb/"));
+
+    array in = loadImage(inFiles[0].c_str(), false);
+
+    features feat;
+    array desc;
+    orb(feat, desc, in, 20.0f, 400, 1.2f, 8, true);
+
+    float* outX           = new float[feat.getNumFeatures()];
+    float* outY           = new float[feat.getNumFeatures()];
+    float* outScore       = new float[feat.getNumFeatures()];
+    float* outOrientation = new float[feat.getNumFeatures()];
+    float* outSize        = new float[feat.getNumFeatures()];
+    unsigned* outDesc     = new unsigned[desc.elements()];
     feat.getX().host(outX);
     feat.getY().host(outY);
     feat.getScore().host(outScore);
@@ -282,10 +282,14 @@ TEST(ORB, CPP)
     desc.host(outDesc);
 
     vector<feat_desc_t> out_feat_desc;
-    array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation, outSize, outDesc, feat.getNumFeatures());
+    array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation,
+                       outSize, outDesc, feat.getNumFeatures());
 
     vector<feat_desc_t> gold_feat_desc;
-    array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(), &goldFeat[1].front(), &goldFeat[2].front(), &goldFeat[3].front(), &goldFeat[4].front(), goldDesc, goldFeat[0].size());
+    array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(),
+                       &goldFeat[1].front(), &goldFeat[2].front(),
+                       &goldFeat[3].front(), &goldFeat[4].front(), goldDesc,
+                       goldFeat[0].size());
 
     std::sort(out_feat_desc.begin(), out_feat_desc.end(), feat_cmp);
     std::sort(gold_feat_desc.begin(), gold_feat_desc.end(), feat_cmp);
@@ -299,15 +303,21 @@ TEST(ORB, CPP)
     split_feat_desc(gold_feat_desc, gold_feat, v_gold_desc);
 
     for (int elIter = 0; elIter < (int)feat.getNumFeatures(); elIter++) {
-        ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0]) << "at: " << elIter << std::endl;
-        ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1]) << "at: " << elIter << std::endl;
-        ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3) << "at: " << elIter << std::endl;
-        ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]), 1e-3) << "at: " << elIter << std::endl;
-        ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]), 1e-3) << "at: " << elIter << std::endl;
+        ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+            << "at: " << elIter << endl;
+        ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]), 1e-3)
+            << "at: " << elIter << endl;
     }
 
     // TODO: improve distance for single/double-precision interchangeability
-    EXPECT_TRUE(compareHamming(desc.elements(), (unsigned*)&v_out_desc[0], (unsigned*)&v_gold_desc[0], 3));
+    EXPECT_TRUE(compareHamming(desc.elements(), (unsigned*)&v_out_desc[0],
+                               (unsigned*)&v_gold_desc[0], 3));
 
     delete[] outX;
     delete[] outY;
diff --git a/test/pad_borders.cpp b/test/pad_borders.cpp
new file mode 100644
index 0000000000..2642ed83ca
--- /dev/null
+++ b/test/pad_borders.cpp
@@ -0,0 +1,186 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using std::vector;
+
+template<typename T>
+class PadBorders : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, cfloat, cdouble, char, signed char,
+                         unsigned char, int, uint, intl, uintl, short,
+                         ushort /*, half_float::half*/>
+    TestTypes;
+
+TYPED_TEST_SUITE(PadBorders, TestTypes);
+
+template<typename T>
+void testPad(const vector<T>& input, const dim4& inDims, const dim4& lbPadding,
+             const dim4& ubPadding, const af::borderType btype,
+             const vector<T>& gold, const dim4& outDims) {
+    SUPPORTED_TYPE_CHECK(T);
+    array in(inDims, input.data());
+    array out = af::pad(in, lbPadding, ubPadding, btype);
+    ASSERT_VEC_ARRAY_EQ(gold, outDims, out);
+}
+
+TYPED_TEST(PadBorders, Zero) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(5, 5), dim4(2, 2, 0, 0), dim4(2, 2, 0, 0), AF_PAD_ZERO,
+            vector<TypeParam>({
+                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
+                1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
+                1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
+                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+            }),
+            dim4(9, 9));
+}
+
+TYPED_TEST(PadBorders, ClampToEdge) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 2, 2,
+                2, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(5, 5), dim4(2, 2, 0, 0), dim4(2, 2, 0, 0),
+            AF_PAD_CLAMP_TO_EDGE,
+            vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2,
+                1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(9, 9));
+}
+
+TYPED_TEST(PadBorders, SymmetricOverEdge) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 0, 2, 3, 2, 2, 0, 3, 5, 2,
+                2, 0, 4, 7, 3, 3, 0, 5, 9, 1, 1, 0,
+            }),
+            dim4(5, 5), dim4(2, 2, 0, 0), dim4(2, 2, 0, 0), AF_PAD_SYM,
+            vector<TypeParam>({
+                3, 2, 2, 3, 2, 2, 0, 0, 2, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
+                1, 1, 1, 0, 0, 1, 3, 2, 2, 3, 2, 2, 0, 0, 2, 5, 3, 3, 5, 2, 2,
+                0, 0, 2, 7, 4, 4, 7, 3, 3, 0, 0, 3, 9, 5, 5, 9, 1, 1, 0, 0, 1,
+                9, 5, 5, 9, 1, 1, 0, 0, 1, 7, 4, 4, 7, 3, 3, 0, 0, 3,
+            }),
+            dim4(9, 9));
+}
+
+TYPED_TEST(PadBorders, Periodic) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 0, 2, 3, 2, 2, 0, 3, 5, 2,
+                2, 0, 4, 7, 3, 3, 0, 5, 9, 1, 1, 0,
+            }),
+            dim4(5, 5), dim4(2, 2, 0, 0), dim4(2, 2, 0, 0), AF_PAD_PERIODIC,
+            vector<TypeParam>({
+                3, 0, 4, 7, 3, 3, 0, 4, 7, 1, 0, 5, 9, 1, 1, 0, 5, 9, 1, 0, 1,
+                1, 1, 1, 0, 1, 1, 2, 0, 2, 3, 2, 2, 0, 2, 3, 2, 0, 3, 5, 2, 2,
+                0, 3, 5, 3, 0, 4, 7, 3, 3, 0, 4, 7, 1, 0, 5, 9, 1, 1, 0, 5, 9,
+                1, 0, 1, 1, 1, 1, 0, 1, 1, 2, 0, 2, 3, 2, 2, 0, 2, 3,
+            }),
+            dim4(9, 9));
+}
+
+TYPED_TEST(PadBorders, BeginOnly) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(5, 5), dim4(2, 2, 0, 0), dim4(0, 2, 0, 0), AF_PAD_ZERO,
+            vector<TypeParam>({
+                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1,
+                0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
+                0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+            }),
+            dim4(7, 9));
+}
+
+TYPED_TEST(PadBorders, EndOnly) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(5, 5), dim4(0, 2, 0, 0), dim4(2, 2, 0, 0), AF_PAD_ZERO,
+            vector<TypeParam>({
+                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,
+                1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,
+                1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+            }),
+            dim4(7, 9));
+}
+
+TYPED_TEST(PadBorders, BeginCorner) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(5, 5), dim4(2, 2, 0, 0), dim4(0, 0, 0, 0), AF_PAD_ZERO,
+            vector<TypeParam>({
+                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,
+                1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1,
+                1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
+            }),
+            dim4(7, 7));
+}
+
+TYPED_TEST(PadBorders, EndCorner) {
+    testPad(vector<TypeParam>({
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+                1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
+            }),
+            dim4(5, 5), dim4(0, 0, 0, 0), dim4(2, 2, 0, 0), AF_PAD_ZERO,
+            vector<TypeParam>({
+                1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1,
+                1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0,
+                0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+            }),
+            dim4(7, 7));
+}
+
+TEST(PadBorders, NegativePadding) {
+    af_array dummyIn  = 0;
+    af_array dummyOut = 0;
+    dim_t ldims[4]    = {-1, 1, 0, 1};
+    dim_t udims[4]    = {-1, 1, 0, 1};
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_pad(&dummyOut, dummyIn, 4, ldims, 4, udims, AF_PAD_ZERO));
+}
+
+TEST(PadBorders, NegativeNDims) {
+    af_array dummyIn  = 0;
+    af_array dummyOut = 0;
+    dim_t ldims[4]    = {1, 1, 0, 1};
+    dim_t udims[4]    = {1, 1, 0, 1};
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_pad(&dummyOut, dummyIn, -1, ldims, 4, udims, AF_PAD_ZERO));
+}
+
+TEST(PadBorders, InvalidPadType) {
+    af_array dummyIn  = 0;
+    af_array dummyOut = 0;
+    dim_t ldims[4]    = {1, 1, 0, 1};
+    dim_t udims[4]    = {1, 1, 0, 1};
+    ASSERT_EQ(AF_ERR_ARG, af_pad(&dummyOut, dummyIn, 4, ldims, 4, udims,
+                                 (af_border_type)4));
+}
diff --git a/test/pinverse.cpp b/test/pinverse.cpp
new file mode 100644
index 0000000000..13b2151836
--- /dev/null
+++ b/test/pinverse.cpp
@@ -0,0 +1,376 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <limits>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::exception;
+using af::identity;
+using af::matmul;
+using af::max;
+using af::pinverse;
+using af::randu;
+using af::span;
+using std::abs;
+using std::string;
+using std::vector;
+
+template<typename T>
+array makeComplex(dim4 dims, const vector<T>& real, const vector<T>& imag) {
+    array realArr(dims, &real.front());
+    array imagArr(dims, &imag.front());
+    return af::complex(realArr, imagArr);
+}
+
+template<typename T>
+array readTestInput(string testFilePath) {
+    typedef typename dtype_traits<T>::base_type InBaseType;
+    dtype outAfType = (dtype)dtype_traits<T>::af_type;
+
+    vector<dim4> dimsVec;
+    vector<vector<InBaseType>> inVec;
+    vector<vector<InBaseType>> goldVec;
+    readTestsFromFile<InBaseType, InBaseType>(testFilePath, dimsVec, inVec,
+                                              goldVec);
+    dim4 inDims = dimsVec[0];
+
+    if (outAfType == c32 || outAfType == c64) {
+        return makeComplex(inDims, inVec[1], inVec[2]);
+    } else {
+        return array(inDims, &inVec[0].front());
+    }
+}
+
+template<typename T>
+array readTestGold(string testFilePath) {
+    typedef typename dtype_traits<T>::base_type InBaseType;
+    dtype outAfType = (dtype)dtype_traits<T>::af_type;
+
+    vector<dim4> dimsVec;
+    vector<vector<InBaseType>> inVec;
+    vector<vector<InBaseType>> goldVec;
+    readTestsFromFile<InBaseType, InBaseType>(testFilePath, dimsVec, inVec,
+                                              goldVec);
+    dim4 goldDims(dimsVec[0][1], dimsVec[0][0]);
+
+    if (outAfType == c32 || outAfType == c64) {
+        return makeComplex(goldDims, goldVec[1], goldVec[2]);
+    } else {
+        return array(goldDims, &goldVec[0].front());
+    }
+}
+
+template<typename T>
+class Pinverse : public ::testing::Test {};
+
+// Epsilons taken from test/inverse.cpp
+template<typename T>
+double eps();
+
+template<>
+double eps<float>() {
+    return 0.01f;
+}
+
+template<>
+double eps<double>() {
+    return 1e-5;
+}
+
+template<>
+double eps<cfloat>() {
+    return 0.01f;
+}
+
+template<>
+double eps<cdouble>() {
+    return 1e-5;
+}
+
+template<typename T>
+double relEps(array in) {
+    typedef typename af::dtype_traits<T>::base_type InBaseType;
+    double fixed_eps = eps<T>();
+    double calc_eps  = std::numeric_limits<InBaseType>::epsilon() *
+                      std::max(in.dims(0), in.dims(1)) * af::max<double>(in);
+    // Use the fixed values above if calculated error tolerance is unnecessarily
+    // too small
+    return std::max(fixed_eps, calc_eps);
+}
+
+typedef ::testing::Types<float, cfloat, double, cdouble> TestTypes;
+TYPED_TEST_SUITE(Pinverse, TestTypes);
+
+// Test Moore-Penrose conditions in the following first 4 tests
+// See https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse#Definition
+TYPED_TEST(Pinverse, AApinvA_A) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array in = readTestInput<TypeParam>(
+        string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array inpinv = pinverse(in);
+    array out    = matmul(in, inpinv, in);
+    ASSERT_ARRAYS_NEAR(in, out, eps<TypeParam>());
+}
+
+TYPED_TEST(Pinverse, ApinvAApinv_Apinv) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array in = readTestInput<TypeParam>(
+        string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array inpinv = pinverse(in);
+    array out    = matmul(inpinv, in, inpinv);
+    ASSERT_ARRAYS_NEAR(inpinv, out, eps<TypeParam>());
+}
+
+TYPED_TEST(Pinverse, AApinv_IsHermitian) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array in = readTestInput<TypeParam>(
+        string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array inpinv = pinverse(in);
+    array aapinv = matmul(in, inpinv);
+    array out    = matmul(in, inpinv).H();
+    ASSERT_ARRAYS_NEAR(aapinv, out, eps<TypeParam>());
+}
+
+TYPED_TEST(Pinverse, ApinvA_IsHermitian) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array in = readTestInput<TypeParam>(
+        string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array inpinv = pinverse(in);
+    array apinva = af::matmul(inpinv, in);
+    array out    = af::matmul(inpinv, in).H();
+    ASSERT_ARRAYS_NEAR(apinva, out, eps<TypeParam>());
+}
+
+TYPED_TEST(Pinverse, Large) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array in = readTestInput<TypeParam>(
+        string(TEST_DIR "/pinverse/pinv_640x480_inputs.test"));
+    array inpinv = pinverse(in);
+    array out    = matmul(in, inpinv, in);
+    ASSERT_ARRAYS_NEAR(in, out, relEps<TypeParam>(in));
+}
+
+TYPED_TEST(Pinverse, LargeTall) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array in = readTestInput<TypeParam>(
+                   string(TEST_DIR "/pinverse/pinv_640x480_inputs.test"))
+                   .T();
+    array inpinv = pinverse(in);
+    array out    = matmul(in, inpinv, in);
+    ASSERT_ARRAYS_NEAR(in, out, relEps<TypeParam>(in));
+}
+
+TEST(Pinverse, Square) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x10.test"));
+    array inpinv = pinverse(in);
+    array out    = matmul(in, inpinv, in);
+    ASSERT_ARRAYS_NEAR(in, out, eps<float>());
+}
+
+TEST(Pinverse, Dim1GtDim0) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse8x10.test"));
+    array inpinv = pinverse(in);
+    array out    = matmul(in, inpinv, in);
+    ASSERT_ARRAYS_NEAR(in, out, eps<float>());
+}
+
+TEST(Pinverse, CompareWithNumpy) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array gold =
+        readTestGold<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array out = pinverse(in);
+    ASSERT_ARRAYS_NEAR(gold, out, relEps<float>(gold));
+}
+
+TEST(Pinverse, SmallSigValExistsFloat) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    const dim_t dim0 = in.dims(0);
+    const dim_t dim1 = in.dims(1);
+
+    // Generate sigma with small non-zero value
+    af::array u;
+    af::array vT;
+    af::array sVec;
+    af::svd(u, sVec, vT, in);
+    dim_t sSize = sVec.elements();
+
+    sVec(2)         = 1e-12;
+    af::array s     = af::diag(sVec, 0, false);
+    af::array zeros = af::constant(0, dim0 > sSize ? dim0 - sSize : sSize,
+                                   dim1 > sSize ? dim1 - sSize : sSize);
+    s               = af::join(dim0 > dim1 ? 0 : 1, s, zeros);
+
+    // Make new input array that has a small non-zero value in its SVD sigma
+    in           = af::matmul(u, s, vT);
+    array inpinv = pinverse(in);
+    array out    = matmul(in, inpinv, in);
+
+    ASSERT_ARRAYS_NEAR(in, out, eps<float>());
+}
+
+TEST(Pinverse, SmallSigValExistsDouble) {
+    SUPPORTED_TYPE_CHECK(double);
+    array in =
+        readTestInput<double>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    const dim_t dim0 = in.dims(0);
+    const dim_t dim1 = in.dims(1);
+
+    // Generate sigma with small non-zero value
+    array u;
+    array vT;
+    array sVec;
+    svd(u, sVec, vT, in);
+    dim_t sSize = sVec.elements();
+
+    sVec(2)     = (double)1e-16;
+    array s     = diag(sVec, 0, false);
+    array zeros = constant(0, dim0 > sSize ? dim0 - sSize : sSize,
+                           dim1 > sSize ? dim1 - sSize : sSize, f64);
+    s           = join(dim0 > dim1 ? 0 : 1, s, zeros);
+
+    // Make new input array that has a small non-zero value in its SVD sigma
+    in           = matmul(u, s, vT);
+    array inpinv = pinverse(in, 1e-15);
+    array out    = matmul(in, inpinv, in);
+
+    ASSERT_ARRAYS_NEAR(in, out, eps<double>());
+}
+
+TEST(Pinverse, Batching3D) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8x2.test"));
+    array inpinv0 = pinverse(in(span, span, 0));
+    array inpinv1 = pinverse(in(span, span, 1));
+
+    array out  = pinverse(in);
+    array out0 = out(span, span, 0);
+    array out1 = out(span, span, 1);
+
+    ASSERT_ARRAYS_NEAR(inpinv0, out0, relEps<float>(inpinv0));
+    ASSERT_ARRAYS_NEAR(inpinv1, out1, relEps<float>(inpinv1));
+}
+
+TEST(Pinverse, Batching4D) {
+    array in = readTestInput<float>(
+        string(TEST_DIR "/pinverse/pinverse10x8x2x2.test"));
+    array inpinv00 = pinverse(in(span, span, 0, 0));
+    array inpinv01 = pinverse(in(span, span, 0, 1));
+    array inpinv10 = pinverse(in(span, span, 1, 0));
+    array inpinv11 = pinverse(in(span, span, 1, 1));
+
+    array out   = pinverse(in);
+    array out00 = out(span, span, 0, 0);
+    array out01 = out(span, span, 0, 1);
+    array out10 = out(span, span, 1, 0);
+    array out11 = out(span, span, 1, 1);
+
+    ASSERT_ARRAYS_NEAR(inpinv00, out00, relEps<float>(inpinv00));
+    ASSERT_ARRAYS_NEAR(inpinv01, out01, relEps<float>(inpinv01));
+    ASSERT_ARRAYS_NEAR(inpinv10, out10, relEps<float>(inpinv10));
+    ASSERT_ARRAYS_NEAR(inpinv11, out11, relEps<float>(inpinv11));
+}
+
+TEST(Pinverse, CustomTol) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array inpinv = pinverse(in, 1e-12);
+    array out    = matmul(in, inpinv, in);
+    ASSERT_ARRAYS_NEAR(in, out, eps<float>());
+}
+
+TEST(Pinverse, C) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    af_array inpinv = 0, identity = 0, out = 0;
+    ASSERT_SUCCESS(af_pinverse(&inpinv, in.get(), 1e-6, AF_MAT_NONE));
+    ASSERT_SUCCESS(
+        af_matmul(&identity, in.get(), inpinv, AF_MAT_NONE, AF_MAT_NONE));
+    ASSERT_SUCCESS(
+        af_matmul(&out, identity, in.get(), AF_MAT_NONE, AF_MAT_NONE));
+
+    ASSERT_ARRAYS_NEAR(in.get(), out, eps<float>());
+
+    ASSERT_SUCCESS(af_release_array(out));
+    ASSERT_SUCCESS(af_release_array(identity));
+    ASSERT_SUCCESS(af_release_array(inpinv));
+}
+
+TEST(Pinverse, C_CustomTol) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    af_array inpinv = 0, identity = 0, out = 0;
+    ASSERT_SUCCESS(af_pinverse(&inpinv, in.get(), 1e-12, AF_MAT_NONE));
+    ASSERT_SUCCESS(
+        af_matmul(&identity, in.get(), inpinv, AF_MAT_NONE, AF_MAT_NONE));
+    ASSERT_SUCCESS(
+        af_matmul(&out, identity, in.get(), AF_MAT_NONE, AF_MAT_NONE));
+
+    ASSERT_ARRAYS_NEAR(in.get(), out, eps<float>());
+
+    ASSERT_SUCCESS(af_release_array(out));
+    ASSERT_SUCCESS(af_release_array(identity));
+    ASSERT_SUCCESS(af_release_array(inpinv));
+}
+
+TEST(Pinverse, NegativeTol) {
+    array in =
+        readTestInput<float>(string(TEST_DIR "/pinverse/pinverse10x8.test"));
+    array out;
+    ASSERT_THROW(out = pinverse(in, -1.f), exception);
+}
+
+TEST(Pinverse, InvalidType) {
+    array in = constant(0, 10, 8, u8);
+    array out;
+    ASSERT_THROW(out = pinverse(in, -1.f), exception);
+}
+
+TEST(Pinverse, InvalidMatProp) {
+    array in = constant(0.f, 10, 8, f32);
+    array out;
+    ASSERT_THROW(out = pinverse(in, -1.f, AF_MAT_SYM), exception);
+}
+
+TEST(Pinverse, DocSnippet) {
+    //! [ex_pinverse]
+    float hA[] = {0, 1, 2, 3, 4, 5};
+    array A(3, 2, hA);
+    //  0.0000     3.0000
+    //  1.0000     4.0000
+    //  2.0000     5.0000
+
+    array Apinv = pinverse(A);
+    // -0.7778    -0.1111     0.5556
+    //  0.2778     0.1111    -0.0556
+
+    array MustBeA = matmul(A, Apinv, A);
+    //  0.0000     3.0000
+    //  1.0000     4.0000
+    //  2.0000     5.0000
+    //! [ex_pinverse]
+    ASSERT_ARRAYS_NEAR(A, MustBeA, eps<float>());
+}
diff --git a/test/print_info.cpp b/test/print_info.cpp
new file mode 100644
index 0000000000..0154ca9d95
--- /dev/null
+++ b/test/print_info.cpp
@@ -0,0 +1,26 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+
+using namespace af;
+
+int main(int, const char**) {
+    int backend = getAvailableBackends();
+    if (backend & AF_BACKEND_OPENCL) {
+        setBackend(AF_BACKEND_OPENCL);
+    } else if (backend & AF_BACKEND_CUDA) {
+        setBackend(AF_BACKEND_CUDA);
+    } else if (backend & AF_BACKEND_CPU) {
+        setBackend(AF_BACKEND_CPU);
+    }
+
+    info();
+    return 0;
+}
diff --git a/test/qr_dense.cpp b/test/qr_dense.cpp
index a0a954cae3..d87cb7b565 100644
--- a/test/qr_dense.cpp
+++ b/test/qr_dense.cpp
@@ -7,44 +7,51 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::exception;
+using af::identity;
+using af::matmul;
+using af::max;
+using std::abs;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 ///////////////////////////////// CPP ////////////////////////////////////
-TEST(QRFactorized, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(QRFactorized, CPP) {
+    LAPACK_ENABLED_CHECK();
 
     int resultIdx = 0;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, float>(string(TEST_DIR"/lapack/qrfactorized.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/lapack/qrfactorized.test"),
+                                   numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
 
-    af::array q, r, tau;
-    af::qr(q, r, tau, input);
+    array q, r, tau;
+    qr(q, r, tau, input);
 
-    af::dim4 qdims = q.dims();
-    af::dim4 rdims = r.dims();
+    dim4 qdims = q.dims();
+    dim4 rdims = r.dims();
 
     // Get result
     float* qData = new float[qdims.elements()];
@@ -56,7 +63,8 @@ TEST(QRFactorized, CPP)
     for (int y = 0; y < (int)qdims[1]; ++y) {
         for (int x = 0; x < (int)qdims[0]; ++x) {
             int elIter = y * qdims[0] + x;
-            ASSERT_NEAR(tests[resultIdx][elIter], qData[elIter], 0.001) << "at: " << elIter << std::endl;
+            ASSERT_NEAR(tests[resultIdx][elIter], qData[elIter], 0.001)
+                << "at: " << elIter << endl;
         }
     }
 
@@ -65,9 +73,10 @@ TEST(QRFactorized, CPP)
     for (int y = 0; y < (int)rdims[1]; ++y) {
         for (int x = 0; x < (int)rdims[0]; ++x) {
             // Test only upper half
-            if(x <= y) {
+            if (x <= y) {
                 int elIter = y * rdims[0] + x;
-                ASSERT_NEAR(tests[resultIdx][elIter], rData[elIter], 0.001) << "at: " << elIter << std::endl;
+                ASSERT_NEAR(tests[resultIdx][elIter], rData[elIter], 0.001)
+                    << "at: " << elIter << endl;
             }
         }
     }
@@ -78,90 +87,105 @@ TEST(QRFactorized, CPP)
 }
 
 template<typename T>
-void qrTester(const int m, const int n, double eps)
-{
+void qrTester(const int m, const int n, double eps) {
     try {
-        if (noDoubleTests<T>()) return;
+        SUPPORTED_TYPE_CHECK(T);
+        LAPACK_ENABLED_CHECK();
 
 #if 1
-        af::array in = cpu_randu<T>(af::dim4(m, n));
+        array in = cpu_randu<T>(dim4(m, n));
 #else
-        af::array in = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+        array in = randu(m, n, (dtype)dtype_traits<T>::af_type);
 #endif
 
         //! [ex_qr_unpacked]
-        af::array q, r, tau;
-        af::qr(q, r, tau, in);
+        array q, r, tau;
+        qr(q, r, tau, in);
         //! [ex_qr_unpacked]
 
-        af::array qq = af::matmul(q, q.H());
-        af::array ii = af::identity(qq.dims(), qq.type());
+        array qq = matmul(q, q.H());
+        array ii = identity(qq.dims(), qq.type());
 
-        ASSERT_NEAR(0, af::max<double>(af::abs(real(qq - ii))), eps);
-        ASSERT_NEAR(0, af::max<double>(af::abs(imag(qq - ii))), eps);
+        ASSERT_NEAR(0, max<double>(abs(real(qq - ii))), eps);
+        ASSERT_NEAR(0, max<double>(abs(imag(qq - ii))), eps);
 
         //! [ex_qr_recon]
-        af::array re = af::matmul(q, r);
+        array re = matmul(q, r);
         //! [ex_qr_recon]
 
-        ASSERT_NEAR(0, af::max<double>(af::abs(real(re - in))), eps);
-        ASSERT_NEAR(0, af::max<double>(af::abs(imag(re - in))), eps);
+        ASSERT_NEAR(0, max<double>(abs(real(re - in))), eps);
+        ASSERT_NEAR(0, max<double>(abs(imag(re - in))), eps);
 
         //! [ex_qr_packed]
-        af::array out = in.copy();
-        af::array tau2;
+        array out = in.copy();
+        array tau2;
         qrInPlace(tau2, out);
         //! [ex_qr_packed]
 
-        af::array r2 = upper(out);
+        array r2 = upper(out);
 
-        ASSERT_NEAR(0, af::max<double>(af::abs(real(tau - tau2))), eps);
-        ASSERT_NEAR(0, af::max<double>(af::abs(imag(tau - tau2))), eps);
+        ASSERT_NEAR(0, max<double>(abs(real(tau - tau2))), eps);
+        ASSERT_NEAR(0, max<double>(abs(imag(tau - tau2))), eps);
 
-        ASSERT_NEAR(0, af::max<double>(af::abs(real(r2 - r))), eps);
-        ASSERT_NEAR(0, af::max<double>(af::abs(imag(r2 - r))), eps);
+        ASSERT_NEAR(0, max<double>(abs(real(r2 - r))), eps);
+        ASSERT_NEAR(0, max<double>(abs(imag(r2 - r))), eps);
 
-
-    } catch(af::exception &ex) {
-        std::cout << ex.what() << std::endl;
+    } catch (exception& ex) {
+        cout << ex.what() << endl;
         throw;
     }
 }
 
+template<typename T>
+double eps();
+
+template<>
+double eps<float>() {
+    return 1e-3;
+}
+
+template<>
+double eps<double>() {
+    return 1e-5;
+}
 
-#define QR_BIG_TESTS(T, eps)                    \
-    TEST(QR, T##BigRect0)                       \
-    {                                           \
-        qrTester<T>(500, 1000, eps);            \
-    }                                           \
-    TEST(QR, T##BigRect0Multiple)               \
-    {                                           \
-        qrTester<T>(512, 1024, eps);            \
-    }                                           \
-    TEST(QR, T##BigRect1Multiple)               \
-    {                                           \
-        qrTester<T>(1024, 512, eps);            \
-    }                                           \
-
-QR_BIG_TESTS(float, 1E-3)
-QR_BIG_TESTS(double, 1E-5)
-QR_BIG_TESTS(cfloat, 1E-3)
-QR_BIG_TESTS(cdouble, 1E-5)
-
-#undef QR_BIG_TESTS
-
-#define QR_BIG_TESTS(T, eps)                    \
-    TEST(QR, T##BigRect1)                       \
-    {                                           \
-        qrTester<T>(1000, 500, eps);            \
-    }                                           \
-
-QR_BIG_TESTS(float, 1E-3)
-QR_BIG_TESTS(double, 1E-5)
-// Fails on Windows on some devices
-#if !(defined(OS_WIN) && defined(AF_OPENCL))
-QR_BIG_TESTS(cfloat, 1E-3)
-QR_BIG_TESTS(cdouble, 1E-5)
-#endif
+template<>
+double eps<cfloat>() {
+    return 1e-3;
+}
+
+template<>
+double eps<cdouble>() {
+    return 1e-5;
+}
+template<typename T>
+class QR : public ::testing::Test {};
+
+typedef ::testing::Types<float, cfloat, double, cdouble> TestTypes;
+TYPED_TEST_SUITE(QR, TestTypes);
 
-#undef QR_BIG_TESTS
+TYPED_TEST(QR, RectangularLarge0) {
+    qrTester<TypeParam>(1000, 500, eps<TypeParam>());
+}
+
+TYPED_TEST(QR, RectangularMultipleOfTwoLarge0) {
+    qrTester<TypeParam>(1024, 512, eps<TypeParam>());
+}
+
+TYPED_TEST(QR, RectangularLarge1) {
+    qrTester<TypeParam>(500, 1000, eps<TypeParam>());
+}
+
+TYPED_TEST(QR, RectangularMultipleOfTwoLarge1) {
+    qrTester<TypeParam>(512, 1024, eps<TypeParam>());
+}
+
+TEST(QR, InPlaceNullOutput) {
+    LAPACK_ENABLED_CHECK();
+    dim4 dims(3, 3);
+    af_array in = 0;
+    ASSERT_SUCCESS(af_randu(&in, dims.ndims(), dims.get(), f32));
+
+    ASSERT_EQ(AF_ERR_ARG, af_qr_inplace(NULL, in));
+    ASSERT_SUCCESS(af_release_array(in));
+}
diff --git a/test/random.cpp b/test/random.cpp
index 608a8f08f3..f6fd0dd45f 100644
--- a/test/random.cpp
+++ b/test/random.cpp
@@ -7,253 +7,408 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/data.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Random : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class Random : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, unsigned char> TestTypes;
+typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, intl,
+                         uintl, signed char, unsigned char, char, af_half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Random, TestTypes);
+TYPED_TEST_SUITE(Random, TestTypes);
 
 template<typename T>
-class Random_norm : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class Random_norm : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+template<typename T>
+class RandomEngine : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+template<typename T>
+class RandomEngineSeed : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+template<typename T>
+class RandomSeed : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble> TestTypesNorm;
+typedef ::testing::Types<float, cfloat, double, cdouble, af_half> TestTypesNorm;
+// register the type list
+TYPED_TEST_SUITE(Random_norm, TestTypesNorm);
 
+// create a list of types to be tested
+typedef ::testing::Types<float, double> TestTypesEngine;
 // register the type list
-TYPED_TEST_CASE(Random_norm, TestTypesNorm);
+TYPED_TEST_SUITE(RandomEngine, TestTypesEngine);
 
-template<typename T>
-void randuTest(af::dim4 & dims)
-{
-    if (noDoubleTests<T>()) return;
+typedef ::testing::Types<unsigned> TestTypesEngineSeed;
+// register the type list
+TYPED_TEST_SUITE(RandomEngineSeed, TestTypesEngineSeed);
 
-    af_array outArray = 0;
-    ASSERT_EQ(AF_SUCCESS, af_randu(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-    if(outArray != 0) af_release_array(outArray);
-}
+// create a list of types to be tested
+typedef ::testing::Types<unsigned> TestTypesSeed;
+// register the type list
+TYPED_TEST_SUITE(RandomSeed, TestTypesSeed);
 
 template<typename T>
-void randnTest(af::dim4 &dims)
-{
-    if (noDoubleTests<T>()) return;
+void randuTest(dim4 &dims) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array outArray = 0;
-    ASSERT_EQ(AF_SUCCESS, af_randn(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-    if(outArray != 0) af_release_array(outArray);
+    ASSERT_SUCCESS(af_randu(&outArray, dims.ndims(), dims.get(),
+                            (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_EQ(af_sync(-1), AF_SUCCESS);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-// INT, UNIT, CHAR, UCHAR Not Supported by RANDN
-template<>
-void randnTest<int>(af::dim4 &dims)
-{
-    if (noDoubleTests<int>()) return;
+template<typename T>
+void randnTest(dim4 &dims) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array outArray = 0;
-    ASSERT_EQ(AF_ERR_TYPE, af_randn(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<int>::af_type));
-    if(outArray != 0) af_release_array(outArray);
+    ASSERT_SUCCESS(af_randn(&outArray, dims.ndims(), dims.get(),
+                            (af_dtype)dtype_traits<T>::af_type));
+    ASSERT_EQ(af_sync(-1), AF_SUCCESS);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-template<>
-void randnTest<unsigned>(af::dim4 &dims)
-{
-    if (noDoubleTests<unsigned>()) return;
+#define RAND(d0, d1, d2, d3)                                   \
+    TYPED_TEST(Random, randu_##d0##_##d1##_##d2##_##d3) {      \
+        dim4 dims(d0, d1, d2, d3);                             \
+        randuTest<TypeParam>(dims);                            \
+    }                                                          \
+    TYPED_TEST(Random_norm, randn_##d0##_##d1##_##d2##_##d3) { \
+        dim4 dims(d0, d1, d2, d3);                             \
+        randnTest<TypeParam>(dims);                            \
+    }
 
-    af_array outArray = 0;
-    ASSERT_EQ(AF_ERR_TYPE, af_randn(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<unsigned>::af_type));
-    if(outArray != 0) af_release_array(outArray);
-}
+RAND(1024, 1024, 1, 1);
+RAND(512, 512, 1, 1);
+RAND(256, 256, 1, 1);
+RAND(128, 128, 1, 1);
+RAND(64, 64, 1, 1);
+RAND(32, 32, 1, 1);
+RAND(16, 16, 1, 1);
+RAND(8, 8, 1, 1);
+RAND(4, 4, 1, 1);
+RAND(2, 2, 2, 2);
+RAND(1, 1, 1, 1);
+RAND(256, 16, 4, 2);
+RAND(32, 16, 8, 4);
+RAND(2, 4, 16, 256);
+RAND(4, 8, 16, 32);
+
+RAND(10, 10, 10, 10);
+
+RAND(1920, 1080, 1, 1);
+RAND(1280, 720, 1, 1);
+RAND(640, 480, 1, 1);
+
+RAND(215, 24, 6, 5);
+RAND(132, 64, 23, 2);
+RAND(15, 35, 50, 3);
+RAND(77, 43, 8, 1);
+RAND(123, 45, 6, 7);
+RAND(345, 28, 9, 1);
+RAND(79, 68, 12, 6);
+RAND(45, 1, 1, 1);
 
-template<>
-void randnTest<char>(af::dim4 &dims)
-{
-    if (noDoubleTests<char>()) return;
+template<typename T>
+void randuArgsTest() {
+    SUPPORTED_TYPE_CHECK(T);
 
+    dim_t ndims       = 4;
+    dim_t dims[]      = {1, 2, 3, 0};
     af_array outArray = 0;
-    ASSERT_EQ(AF_ERR_TYPE, af_randn(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<char>::af_type));
-    if(outArray != 0) af_release_array(outArray);
+    ASSERT_EQ(AF_ERR_SIZE, af_randu(&outArray, ndims, dims,
+                                    (af_dtype)dtype_traits<char>::af_type));
+    ASSERT_EQ(af_sync(-1), AF_SUCCESS);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-template<>
-void randnTest<unsigned char>(af::dim4 &dims)
-{
-    if (noDoubleTests<unsigned char>()) return;
+TYPED_TEST(Random, InvalidArgs) { randuArgsTest<TypeParam>(); }
 
-    af_array outArray = 0;
-    ASSERT_EQ(AF_ERR_TYPE, af_randn(&outArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<unsigned char>::af_type));
-    if(outArray != 0) af_release_array(outArray);
-}
+template<typename T>
+void randuDimsTest() {
+    SUPPORTED_TYPE_CHECK(T);
 
-#define RAND(d0, d1, d2, d3)                            \
-    TYPED_TEST(Random,randu_##d0##_##d1##_##d2##_##d3)  \
-    {                                                   \
-        af::dim4 dims(d0, d1, d2, d3);                  \
-        randuTest<TypeParam>(dims);                     \
-    }                                                   \
-    TYPED_TEST(Random,randn_##d0##_##d1##_##d2##_##d3)  \
-    {                                                   \
-        af::dim4 dims(d0, d1, d2, d3);                  \
-        randnTest<TypeParam>(dims);                     \
-    }                                                   \
-
-RAND(1024, 1024,    1,    1);
-RAND( 512,  512,    1,    1);
-RAND( 256,  256,    1,    1);
-RAND( 128,  128,    1,    1);
-RAND(  64,   64,    1,    1);
-RAND(  32,   32,    1,    1);
-RAND(  16,   16,    1,    1);
-RAND(   8,    8,    1,    1);
-RAND(   4,    4,    1,    1);
-RAND(   2,    2,    2,    2);
-RAND(   1,    1,    1,    1);
-RAND( 256,   16,    4,    2);
-RAND(  32,   16,    8,    4);
-RAND(   2,    4,   16,  256);
-RAND(   4,    8,   16,   32);
-
-RAND(  10,   10,   10,   10);
-
-RAND(1920, 1080,    1,    1);
-RAND(1280,  720,    1,    1);
-RAND( 640,  480,    1,    1);
-
-RAND( 215,   24,    6,    5);
-RAND( 132,   64,   23,    2);
-RAND(  15,   35,   50,    3);
-RAND(  77,   43,    8,    1);
-RAND( 123,   45,    6,    7);
-RAND( 345,   28,    9,    1);
-RAND(  79,   68,   12,    6);
-RAND(  45,    1,    1,    1);
+    dim4 dims(1, 65535 * 32, 1, 1);
+    array large_rand = randu(dims, (af_dtype)dtype_traits<T>::af_type);
+    ASSERT_EQ(large_rand.dims()[1], 65535 * 32);
 
-template<typename T>
-void randuArgsTest()
-{
-    if (noDoubleTests<T>()) return;
+    dims       = dim4(1, 1, 65535 * 32, 1);
+    large_rand = randu(dims, (af_dtype)dtype_traits<T>::af_type);
+    ASSERT_EQ(large_rand.dims()[2], 65535 * 32);
 
-    dim_t ndims = 4;
-    dim_t dims[] = {1, 2, 3, 0};
-    af_array outArray = 0;
-    ASSERT_EQ(AF_ERR_SIZE, af_randu(&outArray, ndims, dims, (af_dtype) af::dtype_traits<char>::af_type));
-    if(outArray != 0) af_release_array(outArray);
+    dims       = dim4(1, 1, 1, 65535 * 32);
+    large_rand = randu(dims, (af_dtype)dtype_traits<T>::af_type);
+    ASSERT_EQ(large_rand.dims()[3], 65535 * 32);
 }
 
-TYPED_TEST(Random,InvalidArgs)
-{
-    randuArgsTest<TypeParam>();
-}
+TYPED_TEST(Random, InvalidDims) { randuDimsTest<TypeParam>(); }
 
 ////////////////////////////////////// CPP /////////////////////////////////////
 //
-TEST(Random, CPP)
-{
-    if (noDoubleTests<float>()) return;
 
+using af::allTrue;
+using af::constant;
+using af::getDefaultRandomEngine;
+using af::getSeed;
+using af::mean;
+using af::randomEngine;
+using af::randomEngineType;
+using af::randu;
+using af::setDefaultRandomEngineType;
+using af::setSeed;
+using af::stdev;
+using af::sum;
+
+TEST(RandomEngine, Default) {
+    // Using default Random engine will cause segfaults
+    // without setting one. This test should be before
+    // setting it to test if default engine setup is working
+    // as expected, otherwise the test will fail.
+    randomEngine engine = getDefaultRandomEngine();
+}
+
+TEST(Random, CPP) {
     // TEST will fail if exception is thrown, which are thrown
     // when only wrong inputs are thrown on bad access happens
-    af::dim4 dims(1, 2, 3, 1);
-    af::array out1 = af::randu(dims);
-    af::array out2 = af::randn(dims);
+    dim4 dims(1, 2, 3, 1);
+    array out1 = randu(dims);
+    array out2 = randn(dims);
+    setDefaultRandomEngineType(AF_RANDOM_ENGINE_PHILOX);
+    array out3 = randu(dims);
+    array out4 = randn(dims);
+    setDefaultRandomEngineType(AF_RANDOM_ENGINE_THREEFRY);
+    array out5 = randu(dims);
+    array out6 = randn(dims);
+    setDefaultRandomEngineType(AF_RANDOM_ENGINE_MERSENNE);
+    array out7 = randu(dims);
+    array out8 = randn(dims);
+    af::sync();
 }
 
 template<typename T>
-void testSetSeed(const uintl seed0, const uintl seed1, bool is_norm = false)
-{
+void testSetSeed(const uintl seed0, const uintl seed1) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    if (noDoubleTests<T>()) return;
+    uintl orig_seed = getSeed();
 
     const int num = 1024 * 1024;
-    af::dtype ty = (af::dtype)af::dtype_traits<T>::af_type;
+    dtype ty      = (dtype)dtype_traits<T>::af_type;
 
-    af::setSeed(seed0);
-    af::array in0 = is_norm ? af::randn(num, ty) : af::randu(num, ty);
+    setSeed(seed0);
+    array in0 = randu(num, ty);
 
-    af::setSeed(seed1);
-    af::array in1 = is_norm ? af::randn(num, ty) : af::randu(num, ty);
+    setSeed(seed1);
+    array in1 = randu(num, ty);
 
-    af::setSeed(seed0);
-    af::array in2 = is_norm ? af::randn(num, ty) : af::randu(num, ty);
+    setSeed(seed0);
+    array in2 = randu(num, ty);
+    array in3 = randu(num, ty);
 
-    std::vector<T> h_in0(num);
-    std::vector<T> h_in1(num);
-    std::vector<T> h_in2(num);
+    vector<T> h_in0(num);
+    vector<T> h_in1(num);
+    vector<T> h_in2(num);
+    vector<T> h_in3(num);
 
     in0.host((void *)&h_in0[0]);
     in1.host((void *)&h_in1[0]);
     in2.host((void *)&h_in2[0]);
+    in3.host((void *)&h_in3[0]);
 
     for (int i = 0; i < num; i++) {
         // Verify if same seed produces same arrays
-        ASSERT_EQ(h_in0[i], h_in2[i]);
+        ASSERT_EQ(h_in0[i], h_in2[i]) << "at : " << i;
+
+        // Verify different arrays created with different seeds differ
+        // b8, s8 and u8 can clash because they generate a small set of values
+        if (ty != b8 && ty != s8 && ty != u8) {
+            ASSERT_NE(h_in0[i], h_in1[i]) << "at : " << i;
+        }
 
-        // Verify different arrays don't clash at same location
-        // b8 and u9 can clash because they generate a small set of values
-        if (ty != b8 && ty != u8) ASSERT_NE(h_in0[i], h_in1[i]);
+        // Verify different arrays created one after the other with same seed
+        // differ b8, s8 and u8 can clash because they generate a small set of
+        // values
+        if (ty != b8 && ty != s8 && ty != u8) {
+            ASSERT_NE(h_in2[i], h_in3[i]) << "at : " << i;
+        }
     }
+
+    setSeed(orig_seed);  // Reset the seed
 }
 
-TYPED_TEST(Random, setSeed)
-{
-    testSetSeed<TypeParam>(10101, 23232, false);
+TYPED_TEST(RandomSeed, setSeed) { testSetSeed<TypeParam>(10101, 23232); }
+
+template<typename T>
+void testGetSeed(const uintl seed0, const uintl seed1) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    uintl orig_seed = getSeed();
+
+    const int num = 1024;
+    dtype ty      = (dtype)dtype_traits<T>::af_type;
+
+    setSeed(seed0);
+    array in0 = randu(num, ty);
+    ASSERT_EQ(getSeed(), seed0);
+
+    setSeed(seed1);
+    array in1 = randu(num, ty);
+    ASSERT_EQ(getSeed(), seed1);
+
+    setSeed(seed0);
+    array in2 = randu(num, ty);
+    ASSERT_EQ(getSeed(), seed0);
+
+    setSeed(orig_seed);  // Reset the seed
 }
 
-TYPED_TEST(Random_norm, setSeed)
-{
-    testSetSeed<TypeParam>(456, 789, false);
+TYPED_TEST(Random, getSeed) { testGetSeed<TypeParam>(1234, 9876); }
+
+template<typename T>
+void testRandomEngineUniform(randomEngineType type) {
+    SUPPORTED_TYPE_CHECK(T);
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    int elem = 16 * 1024 * 1024;
+    randomEngine r(type, 0);
+    array A = randu(elem, ty, r);
+
+    // If double precision is available then perform the mean calculation using
+    // double because the A array is large and causes accuracy issues when using
+    // certain compiler flags (i.e. --march=native)
+    if (af::isDoubleAvailable(af::getDevice())) {
+        array Ad = A.as(f64);
+        double m = mean<double>(Ad);
+        double s = stdev<double>(Ad, AF_VARIANCE_POPULATION);
+        ASSERT_NEAR(m, 0.5, 1e-3);
+        ASSERT_NEAR(s, 0.2887, 1e-2);
+    } else {
+        T m = mean<T>(A);
+        T s = stdev<T>(A, AF_VARIANCE_POPULATION);
+        ASSERT_NEAR(m, 0.5, 1e-3);
+        ASSERT_NEAR(s, 0.2887, 1e-2);
+    }
 }
 
 template<typename T>
-void testGetSeed(const uintl seed0, const uintl seed1)
-{
-    if (noDoubleTests<T>()) return;
+void testRandomEngineNormal(randomEngineType type) {
+    SUPPORTED_TYPE_CHECK(T);
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    int elem = 16 * 1024 * 1024;
+    randomEngine r(type, 0);
+    array A = randn(elem, ty, r);
+    T m     = mean<T>(A);
+    T s     = stdev<T>(A, AF_VARIANCE_POPULATION);
+    ASSERT_NEAR(m, 0, 1e-1);
+    ASSERT_NEAR(s, 1, 1e-1);
+}
 
-    const int num = 1024;
-    af::dtype ty = (af::dtype)af::dtype_traits<T>::af_type;
+TYPED_TEST(RandomEngine, philoxRandomEngineUniform) {
+    testRandomEngineUniform<TypeParam>(AF_RANDOM_ENGINE_PHILOX_4X32_10);
+}
+
+TYPED_TEST(RandomEngine, philoxRandomEngineNormal) {
+    testRandomEngineNormal<TypeParam>(AF_RANDOM_ENGINE_PHILOX_4X32_10);
+}
+
+TYPED_TEST(RandomEngine, threefryRandomEngineUniform) {
+    testRandomEngineUniform<TypeParam>(AF_RANDOM_ENGINE_THREEFRY_2X32_16);
+}
+
+TYPED_TEST(RandomEngine, threefryRandomEngineNormal) {
+    testRandomEngineNormal<TypeParam>(AF_RANDOM_ENGINE_THREEFRY_2X32_16);
+}
+
+TYPED_TEST(RandomEngine, mersenneRandomEngineUniform) {
+    testRandomEngineUniform<TypeParam>(AF_RANDOM_ENGINE_MERSENNE_GP11213);
+}
+
+TYPED_TEST(RandomEngine, mersenneRandomEngineNormal) {
+    testRandomEngineNormal<TypeParam>(AF_RANDOM_ENGINE_MERSENNE_GP11213);
+}
 
-    af::setSeed(seed0);
-    af::array in0 = af::randu(num, ty);
-    ASSERT_EQ(af::getSeed(), seed0);
+template<typename T>
+void testRandomEngineSeed(randomEngineType type) {
+    SUPPORTED_TYPE_CHECK(T);
+    int elem        = 4 * 32 * 1024;
+    uintl orig_seed = 0;
+    uintl new_seed  = 1;
+    randomEngine e(type, orig_seed);
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+    array d1 = randu(elem, ty, e);
+    e.setSeed(new_seed);
+    array d2 = randu(elem, ty, e);
+    e.setSeed(orig_seed);
+    array d3 = randu(elem, ty, e);
+    array d4 = randu(elem, ty, e);
+
+    vector<T> h1(elem);
+    vector<T> h2(elem);
+    vector<T> h3(elem);
+    vector<T> h4(elem);
+
+    d1.host((void *)h1.data());
+    d2.host((void *)h2.data());
+    d3.host((void *)h3.data());
+    d4.host((void *)h4.data());
+
+    for (int i = 0; i < elem; i++) {
+        ASSERT_EQ(h1[i], h3[i]) << "at : " << i;
+        if (ty != b8 && ty != s8 && ty != u8) {
+            ASSERT_NE(h1[i], h2[i]) << "at : " << i;
+            ASSERT_NE(h3[i], h4[i]) << "at : " << i;
+        }
+    }
+}
 
-    af::setSeed(seed1);
-    af::array in1 = af::randu(num, ty);
-    ASSERT_EQ(af::getSeed(), seed1);
+TYPED_TEST(RandomEngineSeed, philoxSeedUniform) {
+    testRandomEngineSeed<TypeParam>(AF_RANDOM_ENGINE_PHILOX_4X32_10);
+}
 
-    af::setSeed(seed0);
-    af::array in2 = af::randu(num, ty);
-    ASSERT_EQ(af::getSeed(), seed0);
+TYPED_TEST(RandomEngineSeed, threefrySeedUniform) {
+    testRandomEngineSeed<TypeParam>(AF_RANDOM_ENGINE_THREEFRY_2X32_16);
 }
 
-TYPED_TEST(Random, getSeed)
-{
-    testGetSeed<TypeParam>(1234, 9876);
+TYPED_TEST(RandomEngineSeed, mersenneSeedUniform) {
+    testRandomEngineSeed<TypeParam>(AF_RANDOM_ENGINE_MERSENNE_GP11213);
 }
diff --git a/test/random_practrand.cpp b/test/random_practrand.cpp
new file mode 100644
index 0000000000..10c9da85e1
--- /dev/null
+++ b/test/random_practrand.cpp
@@ -0,0 +1,27 @@
+// Generate random bits and send them to STDOUT.
+// Suitable for testing with PractRand, c.f. http://pracrand.sourceforge.net/
+// and http://www.pcg-random.org/posts/how-to-test-with-practrand.html
+// Commandline arguments: backend, device, rng_type
+// Example:
+// random_practrand 0 0 200 | RNG_test stdin32
+#include <arrayfire.h>
+#include <cstdint>
+#include <cstdio>
+
+int main(int argc, char **argv) {
+    int backend = argc > 1 ? atoi(argv[1]) : 0;
+    setBackend(static_cast<Backend>(backend));
+    int device = argc > 2 ? atoi(argv[2]) : 0;
+    setDevice(device);
+    int rng = argc > 3 ? atoi(argv[3]) : 100;
+    setDefaultRandomEngineType(static_cast<randomEngineType>(rng));
+
+    setSeed(0xfe47fe0cc078ec30ULL);
+    int samples = 1024 * 1024;
+    while (1) {
+        array values      = randu(samples, u32);
+        uint32_t *pvalues = values.host<uint32_t>();
+        fwrite((void *)pvalues, samples * sizeof(*pvalues), 1, stdout);
+        freeHost(pvalues);
+    }
+}
diff --git a/test/range.cpp b/test/range.cpp
index bd09137a6c..0e708160c2 100644
--- a/test/range.cpp
+++ b/test/range.cpp
@@ -7,77 +7,93 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::range;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Range : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Range : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
+template<typename T>
+class RangeMax : public Range<T> {};
+
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, unsigned int, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, int, unsigned int, intl, uintl,
+                         signed char, unsigned char, short, ushort,
+                         half_float::half>
+    AllTypes;
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int, unsigned int, intl, uintl,
+                         signed char, unsigned char, short, ushort>
+    RegularTypes;
 
 // register the type list
-TYPED_TEST_CASE(Range, TestTypes);
+TYPED_TEST_SUITE(Range, AllTypes);
+TYPED_TEST_SUITE(RangeMax, RegularTypes);
 
 template<typename T>
-void rangeTest(const uint x, const uint y, const uint z, const uint w, const uint dim)
-{
-    if (noDoubleTests<T>()) return;
+void rangeTest(const uint x, const uint y, const uint z, const uint w,
+               const uint dim) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    af::dim4 idims(x, y, z, w);
+    dim4 idims(x, y, z, w);
 
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_range(&outArray, idims.ndims(), idims.get(), dim, (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_range(&outArray, idims.ndims(), idims.get(), dim,
+                            (af_dtype)dtype_traits<T>::af_type));
 
     // Get result
     T* outData = new T[idims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
-    for(int w = 0; w < (int)idims[3]; w++) {
-        for(int z = 0; z < (int)idims[2]; z++) {
-            for(int y = 0; y < (int)idims[1]; y++) {
-                for(int x = 0; x < (int)idims[0]; x++) {
-                    T val = 0;
-                    if(dim == 0) {
+    for (int w = 0; w < (int)idims[3]; w++) {
+        for (int z = 0; z < (int)idims[2]; z++) {
+            for (int y = 0; y < (int)idims[1]; y++) {
+                for (int x = 0; x < (int)idims[0]; x++) {
+                    T val(0);
+                    if (dim == 0) {
                         val = x;
-                    } else if(dim == 1) {
+                    } else if (dim == 1) {
                         val = y;
-                    } else if(dim == 2) {
+                    } else if (dim == 2) {
                         val = z;
-                    } else if(dim == 3) {
+                    } else if (dim == 3) {
                         val = w;
                     }
-                    dim_t idx = w * idims[0] * idims[1] * idims[2]
-                                 + z * idims[0] * idims[1]
-                                 + y * idims[0] + x;
+                    dim_t idx = w * idims[0] * idims[1] * idims[2] +
+                                z * idims[0] * idims[1] + y * idims[0] + x;
 
-                    ASSERT_EQ(val, outData[idx]) << "at: " << idx << std::endl;
+                    ASSERT_EQ(val, outData[idx]) << "at: " << idx;
                 }
             }
         }
@@ -86,67 +102,68 @@ void rangeTest(const uint x, const uint y, const uint z, const uint w, const uin
     // Delete
     delete[] outData;
 
-    if(outArray  != 0) af_release_array(outArray);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-#define RANGE_INIT(desc, x, y, z, w, rep)                                                    \
-    TYPED_TEST(Range, desc)                                                                  \
-    {                                                                                       \
-        rangeTest<TypeParam>(x, y, z, w, rep);                                               \
-    }
+#define RANGE_INIT(desc, x, y, z, w, rep) \
+    TYPED_TEST(Range, desc) { rangeTest<TypeParam>(x, y, z, w, rep); }
+
+RANGE_INIT(Range1D0, 100, 1, 1, 1, 0);
 
-    RANGE_INIT(Range1D0, 100,  1, 1, 1, 0);
+RANGE_INIT(Range2D0, 10, 20, 1, 1, 0);
+RANGE_INIT(Range2D1, 100, 5, 1, 1, 1);
 
-    RANGE_INIT(Range2D0,  10, 20, 1, 1, 0);
-    RANGE_INIT(Range2D1, 100,  5, 1, 1, 1);
+RANGE_INIT(Range3D0, 20, 6, 3, 1, 0);
+RANGE_INIT(Range3D1, 10, 12, 5, 1, 1);
+RANGE_INIT(Range3D2, 25, 30, 2, 1, 2);
 
-    RANGE_INIT(Range3D0,  20,  6, 3, 1, 0);
-    RANGE_INIT(Range3D1,  10, 12, 5, 1, 1);
-    RANGE_INIT(Range3D2,  25, 30, 2, 1, 2);
+RANGE_INIT(Range4D0, 20, 6, 3, 2, 0);
+RANGE_INIT(Range4D1, 10, 12, 5, 2, 1);
+RANGE_INIT(Range4D2, 25, 30, 2, 2, 2);
+RANGE_INIT(Range4D3, 25, 30, 2, 2, 3);
 
-    RANGE_INIT(Range4D0,  20,  6, 3, 2, 0);
-    RANGE_INIT(Range4D1,  10, 12, 5, 2, 1);
-    RANGE_INIT(Range4D2,  25, 30, 2, 2, 2);
-    RANGE_INIT(Range4D3,  25, 30, 2, 2, 3);
+#define RANGE_MAX_INIT(desc, x, y, z, w, rep) \
+    TYPED_TEST(RangeMax, desc) { rangeTest<TypeParam>(x, y, z, w, rep); }
+
+RANGE_MAX_INIT(Range1DMaxDim0, 65535 * 32 + 1, 1, 1, 1, 0);
+RANGE_MAX_INIT(Range1DMaxDim1, 1, 65535 * 32 + 1, 1, 1, 0);
+RANGE_MAX_INIT(Range1DMaxDim2, 1, 1, 65535 * 32 + 1, 1, 0);
+RANGE_MAX_INIT(Range1DMaxDim3, 1, 1, 1, 65535 * 32 + 1, 0);
 
 ///////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(Range, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    const unsigned x = 23;
-    const unsigned y = 15;
-    const unsigned z = 4;
-    const unsigned w = 2;
+TEST(Range, CPP) {
+    const unsigned x   = 23;
+    const unsigned y   = 15;
+    const unsigned z   = 4;
+    const unsigned w   = 2;
     const unsigned dim = 2;
 
-    af::dim4 idims(x, y, z, w);
-    af::array output = af::range(x, y, z, w, dim, f32);
+    dim4 idims(x, y, z, w);
+    array output = range(x, y, z, w, dim, f32);
 
     // Get result
     float* outData = new float[idims.elements()];
     output.host((void*)outData);
 
     // Compare result
-    for(int w = 0; w < (int)idims[3]; w++) {
-        for(int z = 0; z < (int)idims[2]; z++) {
-            for(int y = 0; y < (int)idims[1]; y++) {
-                for(int x = 0; x < (int)idims[0]; x++) {
+    for (int w = 0; w < (int)idims[3]; w++) {
+        for (int z = 0; z < (int)idims[2]; z++) {
+            for (int y = 0; y < (int)idims[1]; y++) {
+                for (int x = 0; x < (int)idims[0]; x++) {
                     float val = 0;
-                    if(dim == 0) {
+                    if (dim == 0) {
                         val = x;
-                    } else if(dim == 1) {
+                    } else if (dim == 1) {
                         val = y;
-                    } else if(dim == 2) {
+                    } else if (dim == 2) {
                         val = z;
-                    } else if(dim == 3) {
+                    } else if (dim == 3) {
                         val = w;
                     }
                     dim_t idx = (w * idims[0] * idims[1] * idims[2]) +
-                                   (z * idims[0] * idims[1]) +
-                                   (y * idims[0]) + x;
-                    ASSERT_EQ(val, outData[idx]) << "at: " << idx << std::endl;
+                                (z * idims[0] * idims[1]) + (y * idims[0]) + x;
+                    ASSERT_EQ(val, outData[idx]) << "at: " << idx << endl;
                 }
             }
         }
@@ -155,3 +172,41 @@ TEST(Range, CPP)
     // Delete
     delete[] outData;
 }
+
+TEST(Range, SNIPPET_data_func_range) {
+    // clang-format off
+    //! [ex_data_func_range]
+    //!
+    // Generates an array of [0, 4] along first dimension
+    array a = range(dim4(5));          // a = [0,
+                                       //      1,
+                                       //      2,
+                                       //      3,
+                                       //      4]
+
+    // Generates an array of [0, 4] along first dimension, tiled along second dimension
+    array b = range(dim4(5, 2));       // b = [0, 0,
+                                       //      1, 1,
+                                       //      2, 2,
+                                       //      3, 3,
+                                       //      4, 4]
+
+    // Generates an array of [0, 2] along second dimension, tiled along first dimension
+    array c = range(dim4(5, 3), 1);    // c = [0, 1, 2,
+                                       //      0, 1, 2,
+                                       //      0, 1, 2,
+                                       //      0, 1, 2,
+                                       //      0, 1, 2]
+
+    //! [ex_data_func_range]
+    // clang-format on
+
+    using std::vector;
+    vector<float> gold_a{0, 1, 2, 3, 4};
+    vector<float> gold_b{0, 1, 2, 3, 4, 0, 1, 2, 3, 4};
+    vector<float> gold_c{0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2};
+
+    ASSERT_VEC_ARRAY_EQ(gold_a, a.dims(), a);
+    ASSERT_VEC_ARRAY_EQ(gold_b, b.dims(), b);
+    ASSERT_VEC_ARRAY_EQ(gold_c, c.dims(), c);
+}
diff --git a/test/rank_dense.cpp b/test/rank_dense.cpp
new file mode 100644
index 0000000000..7625ab82d2
--- /dev/null
+++ b/test/rank_dense.cpp
@@ -0,0 +1,124 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::det;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::join;
+using af::randu;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Rank : public ::testing::Test {};
+
+template<typename T>
+class Det : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
+TYPED_TEST_SUITE(Rank, TestTypes);
+TYPED_TEST_SUITE(Det, TestTypes);
+
+template<typename T>
+void rankSmall() {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    T ha[] = {1, 4, 7, 2, 5, 8, 3, 6, 20};
+    array a(3, 3, ha);
+
+    ASSERT_EQ(3, (int)rank(a));
+}
+
+template<typename T>
+void rankBig(const int num) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    dtype dt = (dtype)dtype_traits<T>::af_type;
+    array a  = randu(num, num, dt);
+    ASSERT_EQ(num, (int)rank(a));
+
+    array b = randu(num, num / 2, dt);
+    ASSERT_EQ(num / 2, (int)rank(b));
+    ASSERT_EQ(num / 2, (int)rank(transpose(b)));
+}
+
+template<typename T>
+void rankLow(const int num) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    dtype dt = (dtype)dtype_traits<T>::af_type;
+
+    array a  = randu(3 * num, num, dt);
+    array b  = randu(3 * num, num, dt);
+    array c  = a + 0.2 * b;
+    array in = join(1, a, b, c);
+
+    // The last third is just a linear combination of first and second thirds
+    ASSERT_EQ(2 * num, (int)rank(in));
+}
+
+TYPED_TEST(Rank, small) { rankSmall<TypeParam>(); }
+
+TYPED_TEST(Rank, big) { rankBig<TypeParam>(1024); }
+
+TYPED_TEST(Rank, low) { rankBig<TypeParam>(512); }
+
+template<typename T>
+void detTest() {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    dtype dt = (dtype)dtype_traits<T>::af_type;
+
+    vector<dim4> numDims;
+
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/lapack/detSmall.test"),
+                                   numDims, in, tests);
+    dim4 dims = numDims[0];
+
+    array input = array(dims, &(in[0].front())).as(dt);
+    T output    = det<T>(input);
+
+    ASSERT_NEAR(abs((T)tests[0][0]), abs(output), 1e-6);
+}
+
+TYPED_TEST(Det, Small) { detTest<TypeParam>(); }
+
+TEST(Rank, NullOutput) {
+    LAPACK_ENABLED_CHECK();
+    dim4 dims(3, 3);
+    af_array in = 0;
+    af_randu(&in, dims.ndims(), dims.get(), f32);
+
+    ASSERT_EQ(AF_ERR_ARG, af_rank(NULL, in, 1e-6));
+    ASSERT_SUCCESS(af_release_array(in));
+}
diff --git a/test/reduce.cpp b/test/reduce.cpp
index 5401cd5a87..c50f95d924 100644
--- a/test/reduce.cpp
+++ b/test/reduce.cpp
@@ -7,48 +7,61 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
+
+#include <math.h>
+#include <algorithm>
+#include <cmath>
+#include <functional>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
 using af::array;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::freeHost;
+using af::tile;
+using std::complex;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
 
+template<typename T>
+class Reduce : public ::testing::Test {};
 
 template<typename T>
-class Reduce : public ::testing::Test
-{
-};
+class ReduceByKey : public ::testing::Test {};
 
-typedef ::testing::Types<float, double, af::cfloat, af::cdouble, uint, int, char, uchar> TestTypes;
-TYPED_TEST_CASE(Reduce, TestTypes);
+typedef ::testing::Types<float, double, cfloat, cdouble, uint, int, intl, uintl,
+                         schar, uchar, short, ushort>
+    TestTypes;
+TYPED_TEST_SUITE(Reduce, TestTypes);
+TYPED_TEST_SUITE(ReduceByKey, TestTypes);
 
 typedef af_err (*reduceFunc)(af_array *, const af_array, const int);
 
 template<typename Ti, typename To, reduceFunc af_reduce>
-void reduceTest(string pTestFile, int off = 0, bool isSubRef=false, const vector<af_seq> seqv=vector<af_seq>())
-{
-    if (noDoubleTests<Ti>()) return;
-    if (noDoubleTests<To>()) return;
+void reduceTest(string pTestFile, int off = 0, bool isSubRef = false,
+                const vector<af_seq> seqv = vector<af_seq>()) {
+    SUPPORTED_TYPE_CHECK(Ti);
+    SUPPORTED_TYPE_CHECK(To);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (pTestFile,numDims,data,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(pTestFile, numDims, data, tests);
+    dim4 dims = numDims[0];
 
-    vector<Ti> in(data[0].begin(), data[0].end());
+    vector<Ti> in(data[0].size());
+    transform(data[0].begin(), data[0].end(), in.begin(), convert_to<Ti, int>);
 
     af_array inArray   = 0;
     af_array outArray  = 0;
@@ -56,184 +69,268 @@ void reduceTest(string pTestFile, int off = 0, bool isSubRef=false, const vector
 
     // Get input array
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &in.front(), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<Ti>::af_type));
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv.size(), &seqv.front()));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(tempArray));
+        ASSERT_SUCCESS(
+            af_create_array(&tempArray, &in.front(), dims.ndims(), dims.get(),
+                            (af_dtype)af::dtype_traits<Ti>::af_type));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv.size(), &seqv.front()));
+        ASSERT_SUCCESS(af_release_array(tempArray));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<Ti>::af_type));
+        ASSERT_SUCCESS(
+            af_create_array(&inArray, &in.front(), dims.ndims(), dims.get(),
+                            (af_dtype)af::dtype_traits<Ti>::af_type));
     }
 
     // Compare result
     for (int d = 0; d < (int)tests.size(); ++d) {
-
         vector<To> currGoldBar(tests[d].begin(), tests[d].end());
 
         // Run sum
-        ASSERT_EQ(AF_SUCCESS, af_reduce(&outArray, inArray, d + off));
+        ASSERT_SUCCESS(af_reduce(&outArray, inArray, d + off));
+
+        af_dtype t;
+        af_get_type(&t, outArray);
 
         // Get result
-        To *outData;
-        outData = new To[dims.elements()];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        vector<To> outData(dims.elements());
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
 
         size_t nElems = currGoldBar.size();
-        for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                                                            << " for dim " << d + off << std::endl;
+        if (std::equal(currGoldBar.begin(), currGoldBar.end(),
+                       outData.begin()) == false) {
+            for (size_t elIter = 0; elIter < nElems; ++elIter) {
+                EXPECT_EQ(currGoldBar[elIter], outData[elIter])
+                    << "at: " << elIter << " for dim " << d + off << endl;
+            }
+            for (int i = 0; i < (int)nElems; i++) {
+                cout << currGoldBar[i] << ", ";
+            }
+
+            cout << endl;
+            for (int i = 0; i < (int)nElems; i++) {
+                cout << outData[i] << ", ";
+            }
+            FAIL();
         }
 
-        // Delete
-        delete[] outData;
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
     }
 
-
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-vector<af_seq> init_subs()
-{
-    vector<af_seq> subs;
-    subs.push_back(af_make_seq(2, 6, 1));
-    subs.push_back(af_make_seq(1, 5, 1));
-    subs.push_back(af_make_seq(1, 3, 1));
-    subs.push_back(af_make_seq(1, 2, 1));
-    return subs;
-}
-
-#define REDUCE_TESTS(FN, TAG, Ti, To)                   \
-    TEST(Reduce,Test_##FN##_##TAG)                      \
-    {                                                   \
-        reduceTest<Ti, To, af_##FN>(                    \
-            string(TEST_DIR"/reduce/"#FN".test")        \
-            );                                          \
-    }                                                   \
-
-REDUCE_TESTS(sum, float   , float     , float     );
-REDUCE_TESTS(sum, double  , double    , double    );
-REDUCE_TESTS(sum, int     , int       , int       );
-REDUCE_TESTS(sum, cfloat  , cfloat , cfloat );
-REDUCE_TESTS(sum, cdouble , cdouble, cdouble);
-REDUCE_TESTS(sum, unsigned, unsigned  , unsigned  );
-REDUCE_TESTS(sum, uchar   , unsigned char, unsigned);
-
-REDUCE_TESTS(min, float   , float     , float     );
-REDUCE_TESTS(min, double  , double    , double    );
-REDUCE_TESTS(min, int     , int       , int       );
-REDUCE_TESTS(min, cfloat  , cfloat , cfloat );
-REDUCE_TESTS(min, cdouble , cdouble, cdouble);
-REDUCE_TESTS(min, unsigned, unsigned  , unsigned  );
-REDUCE_TESTS(min, uchar   , unsigned char, unsigned char);
-
-REDUCE_TESTS(max, float   , float     , float     );
-REDUCE_TESTS(max, double  , double    , double    );
-REDUCE_TESTS(max, int     , int       , int       );
-REDUCE_TESTS(max, cfloat  , cfloat , cfloat );
-REDUCE_TESTS(max, cdouble , cdouble, cdouble);
-REDUCE_TESTS(max, unsigned, unsigned  , unsigned  );
-REDUCE_TESTS(max, uchar   , unsigned char, unsigned char);
-
-REDUCE_TESTS(any_true, float   , float     , unsigned char);
-REDUCE_TESTS(any_true, double  , double    , unsigned char);
-REDUCE_TESTS(any_true, int     , int       , unsigned char);
-REDUCE_TESTS(any_true, cfloat  , cfloat , unsigned char);
-REDUCE_TESTS(any_true, cdouble , cdouble, unsigned char);
-REDUCE_TESTS(any_true, unsigned, unsigned  , unsigned char);
-REDUCE_TESTS(any_true, uchar   , unsigned char, unsigned char);
-
-REDUCE_TESTS(all_true, float   , float     , unsigned char);
-REDUCE_TESTS(all_true, double  , double    , unsigned char);
-REDUCE_TESTS(all_true, int     , int       , unsigned char);
-REDUCE_TESTS(all_true, cfloat  , cfloat , unsigned char);
-REDUCE_TESTS(all_true, cdouble , cdouble, unsigned char);
-REDUCE_TESTS(all_true, unsigned, unsigned  , unsigned char);
-REDUCE_TESTS(all_true, uchar   , unsigned char, unsigned char);
-
-REDUCE_TESTS(count, float   , float     , unsigned);
-REDUCE_TESTS(count, double  , double    , unsigned);
-REDUCE_TESTS(count, int     , int       , unsigned);
-REDUCE_TESTS(count, cfloat  , cfloat , unsigned);
-REDUCE_TESTS(count, cdouble , cdouble, unsigned);
-REDUCE_TESTS(count, unsigned, unsigned  , unsigned);
-REDUCE_TESTS(count, uchar   , unsigned char, unsigned);
-
-TEST(Reduce,Test_Reduce_Big0)
-{
-    if (noDoubleTests<int>()) return;
+template<typename T, reduceFunc OP>
+struct promote_type {
+    typedef T type;
+};
 
-    reduceTest<int, int, af_sum>(
-        string(TEST_DIR"/reduce/big0.test"),
-        0
-        );
+// char and uchar are promoted to int for sum and product
+template<>
+struct promote_type<schar, af_sum> {
+    typedef int type;
+};
+template<>
+struct promote_type<uchar, af_sum> {
+    typedef uint type;
+};
+template<>
+struct promote_type<char, af_sum> {
+    typedef uint type;
+};
+template<>
+struct promote_type<short, af_sum> {
+    typedef int type;
+};
+template<>
+struct promote_type<ushort, af_sum> {
+    typedef uint type;
+};
+template<>
+struct promote_type<schar, af_product> {
+    typedef int type;
+};
+template<>
+struct promote_type<uchar, af_product> {
+    typedef uint type;
+};
+template<>
+struct promote_type<char, af_product> {
+    typedef uint type;
+};
+template<>
+struct promote_type<short, af_product> {
+    typedef int type;
+};
+template<>
+struct promote_type<ushort, af_product> {
+    typedef uint type;
+};
+
+// float16 is promoted to float32 for sum and product
+template<>
+struct promote_type<half_float::half, af_sum> {
+    typedef float type;
+};
+template<>
+struct promote_type<half_float::half, af_product> {
+    typedef float type;
+};
+
+#define REDUCE_TESTS(FN)                                                       \
+    TYPED_TEST(Reduce, Test_##FN) {                                            \
+        reduceTest<TypeParam, typename promote_type<TypeParam, af_##FN>::type, \
+                   af_##FN>(string(TEST_DIR "/reduce/" #FN ".test"));          \
+    }
+
+REDUCE_TESTS(sum);
+REDUCE_TESTS(min);
+REDUCE_TESTS(max);
+
+#undef REDUCE_TESTS
+#define REDUCE_TESTS(FN, OT)                          \
+    TYPED_TEST(Reduce, Test_##FN) {                   \
+        reduceTest<TypeParam, OT, af_##FN>(           \
+            string(TEST_DIR "/reduce/" #FN ".test")); \
+    }
+
+REDUCE_TESTS(any_true, unsigned char);
+REDUCE_TESTS(all_true, unsigned char);
+REDUCE_TESTS(count, unsigned);
+
+#undef REDUCE_TESTS
+
+TEST(Reduce, Test_Reduce_Big0) {
+    reduceTest<int, int, af_sum>(string(TEST_DIR "/reduce/big0.test"), 0);
 }
 
+/*
 TEST(Reduce,Test_Reduce_Big1)
 {
-    if (noDoubleTests<int>()) return;
-
     reduceTest<int, int, af_sum>(
         string(TEST_DIR"/reduce/big1.test"),
         1
         );
 }
+*/
 
 /////////////////////////////////// CPP //////////////////////////////////
 //
-typedef af::array (*ReductionOp)(const af::array&, const int);
+typedef af::array (*ReductionOp)(const af::array &, const int);
 
-using af::sum;
-using af::min;
-using af::max;
 using af::allTrue;
 using af::anyTrue;
+using af::constant;
 using af::count;
+using af::iota;
+using af::max;
+using af::min;
+using af::NaN;
+using af::product;
+using af::randu;
+using af::round;
+using af::seq;
+using af::span;
+using af::sum;
 
 template<typename Ti, typename To, ReductionOp reduce>
-void cppReduceTest(string pTestFile)
-{
-    if (noDoubleTests<Ti>()) return;
-    if (noDoubleTests<To>()) return;
+void cppReduceTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(Ti);
+    SUPPORTED_TYPE_CHECK(To);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (pTestFile,numDims,data,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(pTestFile, numDims, data, tests);
+    dim4 dims = numDims[0];
 
-    vector<Ti> in(data[0].begin(), data[0].end());
+    vector<Ti> in(data[0].size());
+    transform(data[0].begin(), data[0].end(), in.begin(), convert_to<Ti, int>);
 
-    af::array input(dims, &in.front());
+    array input(dims, &in.front());
 
     // Compare result
     for (int d = 0; d < (int)tests.size(); ++d) {
-
         vector<To> currGoldBar(tests[d].begin(), tests[d].end());
 
         // Run sum
-        af::array output = reduce(input, d);
+        array output = reduce(input, d);
 
         // Get result
-        To *outData = new To[dims.elements()];
-        output.host((void*)outData);
+        vector<To> outData(dims.elements());
+        output.host((void *)&outData.front());
 
         size_t nElems = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                                                            << " for dim " << d << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << " for dim " << d << endl;
         }
-
-        // Delete
-        delete[] outData;
     }
 }
 
-#define CPP_REDUCE_TESTS(FN, FNAME, Ti, To)        \
-    TEST(Reduce, Test_##FN##_CPP)                  \
-    {                                              \
-        cppReduceTest<Ti, To, FN>(                 \
-            string(TEST_DIR"/reduce/"#FNAME".test")\
-            );                                     \
+TEST(Reduce, Test_Sum_Scalar_MaxDim) {
+    const size_t largeDim = 65535 * 32 * 8 + 1;
+    array A               = constant(1, dim4(1, largeDim, 1, 1));
+    ASSERT_EQ(sum<float>(A, 1), largeDim);
+    A = constant(1, dim4(1, 1, largeDim, 1));
+    ASSERT_EQ(sum<float>(A, 2), largeDim);
+    A = constant(1, dim4(1, 1, 1, largeDim));
+    ASSERT_EQ(sum<float>(A, 3), largeDim);
+}
+
+TEST(Reduce, Test_Min_Scalar_MaxDim) {
+    const size_t largeDim = 65535 * 32 * 8 + 1;
+    array A               = iota(dim4(1, largeDim, 1, 1));
+    ASSERT_EQ(min(A, 1).scalar<float>(), 0.f);
+    A = iota(dim4(1, 1, largeDim, 1));
+    ASSERT_EQ(min(A, 2).scalar<float>(), 0.f);
+    A = iota(dim4(1, 1, 1, largeDim));
+    ASSERT_EQ(min(A, 3).scalar<float>(), 0.f);
+}
+
+TEST(Reduce, Test_Max_Scalar_MaxDim) {
+    const size_t largeDim = 65535 * 32 * 8 + 1;
+    array A               = iota(dim4(1, largeDim, 1, 1));
+    ASSERT_EQ(max(A, 1).scalar<float>(), largeDim - 1);
+    A = iota(dim4(1, 1, largeDim, 1));
+    ASSERT_EQ(max(A, 2).scalar<float>(), largeDim - 1);
+    A = iota(dim4(1, 1, 1, largeDim));
+    ASSERT_EQ(max(A, 3).scalar<float>(), largeDim - 1);
+}
+
+TEST(Reduce, Test_anyTrue_Scalar_MaxDim) {
+    const size_t largeDim = 65535 * 32 * 8 + 1;
+    array A               = constant(1, dim4(1, largeDim, 1, 1));
+    ASSERT_EQ(anyTrue(A, 1).scalar<char>(), 1);
+    A = constant(1, dim4(1, 1, largeDim, 1));
+    ASSERT_EQ(anyTrue(A, 2).scalar<char>(), 1);
+    A = constant(1, dim4(1, 1, 1, largeDim));
+    ASSERT_EQ(anyTrue(A, 3).scalar<char>(), 1);
+}
+
+TEST(Reduce, Test_allTrue_Scalar_MaxDim) {
+    const size_t largeDim = 65535 * 32 * 8 + 1;
+    array A               = constant(1, dim4(1, largeDim, 1, 1));
+    ASSERT_EQ(allTrue(A, 1).scalar<char>(), 1);
+    A = constant(1, dim4(1, 1, largeDim, 1));
+    ASSERT_EQ(allTrue(A, 2).scalar<char>(), 1);
+    A = constant(1, dim4(1, 1, 1, largeDim));
+    ASSERT_EQ(allTrue(A, 3).scalar<char>(), 1);
+}
+
+TEST(Reduce, Test_count_Scalar_MaxDim) {
+    const size_t largeDim = 65535 * 32 * 8 + 1;
+    array A               = constant(1, dim4(1, largeDim, 1, 1));
+    ASSERT_EQ(count(A, 1).scalar<unsigned int>(), largeDim);
+    A = constant(1, dim4(1, 1, largeDim, 1));
+    ASSERT_EQ(count(A, 2).scalar<unsigned int>(), largeDim);
+    A = constant(1, dim4(1, 1, 1, largeDim));
+    ASSERT_EQ(count(A, 3).scalar<unsigned int>(), largeDim);
+}
+
+#define CPP_REDUCE_TESTS(FN, FNAME, Ti, To)                                    \
+    TEST(Reduce, Test_##FN##_CPP) {                                            \
+        cppReduceTest<Ti, To, FN>(string(TEST_DIR "/reduce/" #FNAME ".test")); \
     }
 
 CPP_REDUCE_TESTS(sum, sum, float, float);
@@ -243,194 +340,2198 @@ CPP_REDUCE_TESTS(anyTrue, any_true, float, unsigned char);
 CPP_REDUCE_TESTS(allTrue, all_true, float, unsigned char);
 CPP_REDUCE_TESTS(count, count, float, unsigned);
 
-TEST(Reduce, Test_Product_Global)
-{
-    int num = 100;
-    af::array a = 1 + af::round(5 * af::randu(num, 1)) / 100;
+struct reduce_by_key_params {
+    size_t iSize, oSize;
+    void *iKeys_;
+    void *iVals_;
+    void *oKeys_;
+    void *oVals_;
+    af_dtype kType_, vType_, oType_;
+    string testname_;
+    virtual ~reduce_by_key_params() {}
+};
 
-    float res = af::product<float>(a);
-    float *h_a = a.host<float>();
-    float gold = 1;
+//
+// Reduce By Key tests
+//
+template<typename Tk, typename Tv, typename To>
+struct reduce_by_key_params_t : public reduce_by_key_params {
+    vector<Tk> iKeys_;
+    vector<Tv> iVals_;
+    vector<Tk> oKeys_;
+    vector<To> oVals_;
+    string testname_;
+
+    reduce_by_key_params_t(vector<Tk> ikeys, vector<Tv> ivals, vector<Tk> okeys,
+                           vector<To> ovals, string testname)
+        : iKeys_(ikeys)
+        , iVals_(ivals)
+        , oKeys_(okeys)
+        , oVals_(ovals)
+        , testname_(testname) {
+        reduce_by_key_params::iSize  = iKeys_.size();
+        reduce_by_key_params::oSize  = oKeys_.size();
+        reduce_by_key_params::iKeys_ = iKeys_.data();
+        reduce_by_key_params::iVals_ = iVals_.data();
+        reduce_by_key_params::oKeys_ = oKeys_.data();
+        reduce_by_key_params::oVals_ = oVals_.data();
+        reduce_by_key_params::vType_ = (af_dtype)af::dtype_traits<Tv>::af_type;
+        reduce_by_key_params::kType_ = (af_dtype)af::dtype_traits<Tk>::af_type;
+        reduce_by_key_params::oType_ = (af_dtype)af::dtype_traits<To>::af_type;
+        reduce_by_key_params::testname_ = testname_;
+    }
+    ~reduce_by_key_params_t() {}
+};
 
-    for (int i = 0; i < num; i++) {
-        gold *= h_a[i];
+array ptrToArray(size_t size, void *ptr, af_dtype type) {
+    array res;
+    switch (type) {
+        case f32: res = array(size, (float *)ptr); break;
+        case f64: res = array(size, (double *)ptr); break;
+        case c32: res = array(size, (cfloat *)ptr); break;
+        case c64: res = array(size, (cdouble *)ptr); break;
+        case u32: res = array(size, (unsigned *)ptr); break;
+        case s32: res = array(size, (int *)ptr); break;
+        case u64: res = array(size, (unsigned long long *)ptr); break;
+        case s64: res = array(size, (long long *)ptr); break;
+        case u16: res = array(size, (unsigned short *)ptr); break;
+        case s16: res = array(size, (short *)ptr); break;
+        case b8: res = array(size, (char *)ptr); break;
+        case s8: res = array(size, (signed char *)ptr); break;
+        case u8: res = array(size, (unsigned char *)ptr); break;
+        case f16: res = array(size, (half_float::half *)ptr); break;
     }
+    return res;
+}
 
-    ASSERT_EQ(gold, res);
-    delete[] h_a;
+array ptrToArray(af::dim4 size, void *ptr, af_dtype type) {
+    array res;
+    switch (type) {
+        case f32: res = array(size, (float *)ptr); break;
+        case f64: res = array(size, (double *)ptr); break;
+        case c32: res = array(size, (cfloat *)ptr); break;
+        case c64: res = array(size, (cdouble *)ptr); break;
+        case u32: res = array(size, (unsigned *)ptr); break;
+        case s32: res = array(size, (int *)ptr); break;
+        case u64: res = array(size, (unsigned long long *)ptr); break;
+        case s64: res = array(size, (long long *)ptr); break;
+        case u16: res = array(size, (unsigned short *)ptr); break;
+        case s16: res = array(size, (short *)ptr); break;
+        case b8: res = array(size, (char *)ptr); break;
+        case s8: res = array(size, (signed char *)ptr); break;
+        case u8: res = array(size, (unsigned char *)ptr); break;
+        case f16: res = array(size, (half_float::half *)ptr); break;
+    }
+    return res;
 }
 
-TEST(Reduce, Test_Sum_Global)
-{
-    int num = 10000;
-    af::array a = af::round(2 * af::randu(num, 1));
+class ReduceByKeyP : public ::testing::TestWithParam<reduce_by_key_params *> {
+   public:
+    array keys, vals;
+    array keyResGold, valsReducedGold;
 
-    float res = af::sum<float>(a);
-    float *h_a = a.host<float>();
-    float gold = 0;
+    void SetUp() {
+        reduce_by_key_params *params = GetParam();
+        if (noHalfTests(params->vType_)) {
+            GTEST_SKIP() << "Half not supported on this device";
+        }
+        if (noDoubleTests(GetParam()->vType_)) {
+            GTEST_SKIP() << "Double not supported on this device";
+        }
 
-    for (int i = 0; i < num; i++) {
-        gold += h_a[i];
+        keys = ptrToArray(params->iSize, params->iKeys_, params->kType_);
+        vals = ptrToArray(params->iSize, params->iVals_, params->vType_);
+
+        keyResGold = ptrToArray(params->oSize, params->oKeys_, params->kType_);
+        valsReducedGold =
+            ptrToArray(params->oSize, params->oVals_, params->oType_);
     }
 
-    ASSERT_EQ(gold, res);
-    delete[] h_a;
+    void TearDown() { delete GetParam(); }
+};
+
+template<typename T>
+struct generateConsq {
+    T vals;
+
+    generateConsq(T v_i = 0) : vals(v_i) {};
+
+    T operator()() { return vals++; }
+};
+
+template<typename T>
+struct generateConst {
+    T vals;
+
+    generateConst(T v_i) : vals(v_i) {};
+
+    T operator()() { return vals; }
+};
+
+template<typename Tk, typename Tv, typename To>
+reduce_by_key_params *rbk_unique_data(const string testname, const int testSz,
+                                      std::function<Tk()> k_gen,
+                                      std::function<Tv()> v_gen) {
+    vector<Tk> keys(testSz);
+    vector<Tv> vals(testSz);
+
+    generate(begin(keys), end(keys), k_gen);
+    generate(begin(vals), end(vals), v_gen);
+
+    vector<Tk> okeys(begin(keys), end(keys));
+    auto last = unique(begin(okeys), end(okeys));
+    okeys.resize(distance(begin(okeys), last));
+    vector<To> ovals(testSz, To(1));
+    return new reduce_by_key_params_t<Tk, Tv, To>(keys, vals, okeys, ovals,
+                                                  testname);
 }
 
-TEST(Reduce, Test_Count_Global)
-{
-    int num = 10000;
-    af::array a = af::round(2 * af::randu(num, 1));
-    af::array b = a.as(b8);
+template<typename Tk, typename Tv, typename To>
+reduce_by_key_params *rbk_single_data(const string testname, const int testSz,
+                                      std::function<Tk()> k_gen,
+                                      std::function<Tv()> v_gen) {
+    vector<Tk> keys(testSz);
+    vector<Tv> vals(testSz);
+
+    generate(begin(keys), end(keys), k_gen);
+    generate(begin(vals), end(vals), v_gen);
+
+    vector<Tk> okeys(begin(keys), end(keys));
+    auto last = unique(begin(okeys), end(okeys));
+    okeys.resize(distance(begin(okeys), last));
+    vector<To> ovals(okeys.size(), To(keys.size()));
+    return new reduce_by_key_params_t<Tk, Tv, To>(keys, vals, okeys, ovals,
+                                                  testname);
+}
 
-    int res = af::count<int>(b);
-    char *h_b = b.host<char>();
-    int gold = 0;
+// clang-format off
+template<typename Tk, typename Tv, typename To>
+vector<reduce_by_key_params*> genUniqueKeyTests() {
+  return {rbk_unique_data<Tk, Tv, To>("unique_key", 31,          generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 32,          generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 33,          generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 127,         generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 128,         generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 129,         generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 1024,        generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 1025,        generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_unique_data<Tk, Tv, To>("unique_key", 1024 * 1025, generateConsq<Tk>(0), generateConst<Tv>(Tv( 1 )))
+    };
+}
 
-    for (int i = 0; i < num; i++) {
-        gold += h_b[i];
-    }
+template<typename Tk, typename Tv, typename To>
+vector<reduce_by_key_params*> genSingleKeyTests() {
+  return {rbk_single_data<Tk, Tv, To>("single_key", 31,         generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 32,         generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 33,         generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 127,        generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 128,        generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 129,        generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 1024,       generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 1025,       generateConst<Tk>(0), generateConst<Tv>(Tv( 1 ))),
+          rbk_single_data<Tk, Tv, To>("single_key", 128 * 1025, generateConst<Tk>(0), generateConst<Tv>(Tv( 1 )))
+    };
+}
+// clang-format on
+
+vector<reduce_by_key_params *> generateAllTypes() {
+    vector<reduce_by_key_params *> out;
+    vector<vector<reduce_by_key_params *>> tmp{
+        genUniqueKeyTests<int, float, float>(),
+        genSingleKeyTests<int, float, float>(),
+        genUniqueKeyTests<unsigned, float, float>(),
+        genSingleKeyTests<unsigned, float, float>(),
+        genUniqueKeyTests<int, double, double>(),
+        genSingleKeyTests<int, double, double>(),
+        genUniqueKeyTests<unsigned, double, double>(),
+        genSingleKeyTests<unsigned, double, double>(),
+        genUniqueKeyTests<int, cfloat, cfloat>(),
+        genSingleKeyTests<int, cfloat, cfloat>(),
+        genUniqueKeyTests<unsigned, cfloat, cfloat>(),
+        genSingleKeyTests<unsigned, cfloat, cfloat>(),
+        genUniqueKeyTests<int, cdouble, cdouble>(),
+        genSingleKeyTests<int, cdouble, cdouble>(),
+        genUniqueKeyTests<unsigned, cdouble, cdouble>(),
+        genSingleKeyTests<unsigned, cdouble, cdouble>(),
+        genUniqueKeyTests<int, half_float::half, float>(),
+        genSingleKeyTests<int, half_float::half, float>(),
+        genUniqueKeyTests<unsigned, half_float::half, float>(),
+        genSingleKeyTests<unsigned, half_float::half, float>(),
+    };
+
+    for (auto &v : tmp) { copy(begin(v), end(v), back_inserter(out)); }
+    return out;
+}
 
-    ASSERT_EQ(gold, res);
-    delete[] h_b;
+template<typename TestClass>
+string testNameGenerator(
+    const ::testing::TestParamInfo<typename TestClass::ParamType> info) {
+    af_dtype kt = info.param->kType_;
+    af_dtype vt = info.param->vType_;
+    size_t size = info.param->iSize;
+    std::stringstream s;
+    s << info.param->testname_ << "_keyType_" << kt << "_valueType_" << vt
+      << "_size_" << size;
+    return s.str();
 }
 
-TEST(Reduce, Test_min_Global)
-{
-    if (noDoubleTests<double>()) return;
+INSTANTIATE_TEST_SUITE_P(UniqueKeyTests, ReduceByKeyP,
+                         ::testing::ValuesIn(generateAllTypes()),
+                         testNameGenerator<ReduceByKeyP>);
 
-    int num = 10000;
-    af::array a = af::randu(num, 1, f64);
-    double res = af::min<double>(a);
-    double *h_a = a.host<double>();
-    double gold = std::numeric_limits<double>::max();
+TEST_P(ReduceByKeyP, SumDim0) {
+    if (noHalfTests(GetParam()->vType_)) {
+        GTEST_SKIP() << "Half not supported on this device";
+    }
+    if (noHalfTests(GetParam()->kType_)) {
+        GTEST_SKIP() << "Half not supported on this device";
+    }
+    if (noDoubleTests(GetParam()->vType_)) {
+        GTEST_SKIP() << "Double not supported on this device";
+    }
+    array keyRes, valsReduced;
+    sumByKey(keyRes, valsReduced, keys, vals, 0, 0);
 
-    if (noDoubleTests<double>()) return;
+    ASSERT_ARRAYS_EQ(keyResGold, keyRes);
+    ASSERT_ARRAYS_NEAR(valsReducedGold, valsReduced, 1e-5);
+}
 
-    for (int i = 0; i < num; i++) {
-        gold = std::min(gold, h_a[i]);
+TEST_P(ReduceByKeyP, SumDim2) {
+    if (noHalfTests(GetParam()->vType_)) {
+        GTEST_SKIP() << "Half not supported on this device";
     }
+    if (noHalfTests(GetParam()->kType_)) {
+        GTEST_SKIP() << "Half not supported on this device";
+    }
+    if (noDoubleTests(GetParam()->vType_)) {
+        GTEST_SKIP() << "Double not supported on this device";
+    }
+    const int ntile = 2;
+    vals            = tile(vals, 1, ntile, 1, 1);
+    vals            = reorder(vals, 1, 2, 0, 3);
 
-    ASSERT_EQ(gold, res);
-    delete[] h_a;
+    valsReducedGold = tile(valsReducedGold, 1, ntile, 1, 1);
+    valsReducedGold = reorder(valsReducedGold, 1, 2, 0, 3);
+
+    array keyRes, valsReduced;
+    const int dim       = 2;
+    const double nanval = 0.0;
+    sumByKey(keyRes, valsReduced, keys, vals, dim, nanval);
+
+    ASSERT_ARRAYS_EQ(keyResGold, keyRes);
+    ASSERT_ARRAYS_NEAR(valsReducedGold, valsReduced, 1e-5);
 }
 
-TEST(Reduce, Test_max_Global)
-{
-    int num = 10000;
-    af::array a = af::randu(num, 1);
-    float res = af::max<float>(a);
-    float *h_a = a.host<float>();
-    float gold = -std::numeric_limits<float>::max();
+TYPED_TEST(ReduceByKey, MultiBlockReduceSingleval) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    array keys = constant(0, 1024 * 1024, s32);
+    array vals = constant(1, 1024 * 1024,
+                          (af_dtype)af::dtype_traits<TypeParam>::af_type);
 
-    for (int i = 0; i < num; i++) {
-        gold = std::max(gold, h_a[i]);
+    array keyResGold      = constant(0, 1);
+    using promoted_t      = typename promote_type<TypeParam, af_sum>::type;
+    array valsReducedGold = constant(
+        1024 * 1024, 1, (af_dtype)af::dtype_traits<promoted_t>::af_type);
+
+    array keyRes, valsReduced;
+    sumByKey(keyRes, valsReduced, keys, vals);
+
+    ASSERT_TRUE(allTrue<bool>(keyResGold == keyRes));
+    ASSERT_ARRAYS_NEAR(valsReducedGold, valsReduced, 1e-5);
+}
+
+void reduce_by_key_test(std::string test_fn) {
+    vector<dim4> numDims;
+    vector<vector<float>> data;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(test_fn, numDims, data, tests);
+
+    for (size_t t = 0; t < numDims.size() / 2; ++t) {
+        dim4 kdim = numDims[t * 2];
+        dim4 vdim = numDims[t * 2 + 1];
+
+        vector<int> in_keys(data[t * 2].begin(), data[t * 2].end());
+        vector<float> in_vals(data[t * 2 + 1].begin(), data[t * 2 + 1].end());
+
+        af_array inKeys  = 0;
+        af_array inVals  = 0;
+        af_array outKeys = 0;
+        af_array outVals = 0;
+        ASSERT_EQ(
+            AF_SUCCESS,
+            af_create_array(&inKeys, &in_keys.front(), kdim.ndims(), kdim.get(),
+                            (af_dtype)af::dtype_traits<int>::af_type));
+        ASSERT_EQ(
+            AF_SUCCESS,
+            af_create_array(&inVals, &in_vals.front(), vdim.ndims(), vdim.get(),
+                            (af_dtype)af::dtype_traits<float>::af_type));
+
+        vector<int> currGoldKeys(tests[t * 2].begin(), tests[t * 2].end());
+        vector<float> currGoldVals(tests[t * 2 + 1].begin(),
+                                   tests[t * 2 + 1].end());
+
+        // Run sum
+        ASSERT_EQ(AF_SUCCESS,
+                  af_sum_by_key(&outKeys, &outVals, inKeys, inVals, 0));
+
+        dim_t ok0, ok1, ok2, ok3;
+        dim_t ov0, ov1, ov2, ov3;
+        af_get_dims(&ok0, &ok1, &ok2, &ok3, outKeys);
+        af_get_dims(&ov0, &ov1, &ov2, &ov3, outVals);
+
+        // Get result
+        vector<int> outKeysVec(ok0 * ok1 * ok2 * ok3);
+        vector<float> outValsVec(ov0 * ov1 * ov2 * ov3);
+
+        ASSERT_EQ(AF_SUCCESS,
+                  af_get_data_ptr((void *)&outKeysVec.front(), outKeys));
+        ASSERT_EQ(AF_SUCCESS,
+                  af_get_data_ptr((void *)&outValsVec.front(), outVals));
+
+        size_t nElems = currGoldKeys.size();
+        if (std::equal(currGoldKeys.begin(), currGoldKeys.end(),
+                       outKeysVec.begin()) == false) {
+            for (size_t elIter = 0; elIter < nElems; ++elIter) {
+                EXPECT_NEAR(currGoldKeys[elIter], outKeysVec[elIter], 1e-4)
+                    << "at: " << elIter << endl;
+                EXPECT_NEAR(currGoldVals[elIter], outValsVec[elIter], 1e-4)
+                    << "at: " << elIter << endl;
+            }
+            for (int i = 0; i < (int)nElems; i++) {
+                cout << currGoldKeys[i] << ":" << currGoldVals[i] << ", ";
+            }
+
+            for (int i = 0; i < (int)nElems; i++) {
+                cout << outKeysVec[i] << ":" << outValsVec[i] << ", ";
+            }
+            FAIL();
+        }
+
+        ASSERT_EQ(AF_SUCCESS, af_release_array(outKeys));
+        ASSERT_EQ(AF_SUCCESS, af_release_array(outVals));
+        ASSERT_EQ(AF_SUCCESS, af_release_array(inKeys));
+        ASSERT_EQ(AF_SUCCESS, af_release_array(inVals));
     }
+}
+TEST(ReduceByKey, MultiBlockReduceContig10) {
+    reduce_by_key_test(string(TEST_DIR "/reduce/test_contig10_by_key.test"));
+}
 
-    ASSERT_EQ(gold, res);
-    delete[] h_a;
+TEST(ReduceByKey, MultiBlockReduceRandom10) {
+    reduce_by_key_test(string(TEST_DIR "/reduce/test_random10_by_key.test"));
 }
 
+TEST(ReduceByKey, MultiBlockReduceContig500) {
+    reduce_by_key_test(string(TEST_DIR "/reduce/test_contig500_by_key.test"));
+}
 
-template<typename T>
-void typed_assert_eq(T lhs, T rhs, bool both = true)
-{
-    ASSERT_EQ(lhs, rhs);
+TEST(ReduceByKey, MultiBlockReduceByKeyRandom500) {
+    reduce_by_key_test(string(TEST_DIR "/reduce/test_random500_by_key.test"));
 }
 
-template<>
-void typed_assert_eq<float>(float lhs, float rhs, bool both)
-{
-    ASSERT_FLOAT_EQ(lhs, rhs);
+TYPED_TEST(ReduceByKey, productReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
+
+    array reduced_keys, reduced_vals;
+    productByKey(reduced_keys, reduced_vals, keys, vals, 0, 1);
+
+    const int goldSz = 5;
+    using promoted_t = typename promote_type<TypeParam, af_product>::type;
+    const vector<promoted_t> gold_reduce{0, 7, 6, 30, 4};
+
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
 }
 
-template<>
-void typed_assert_eq<double>(double lhs, double rhs, bool both)
-{
-    ASSERT_DOUBLE_EQ(lhs, rhs);
+TYPED_TEST(ReduceByKey, minReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
+
+    array reduced_keys, reduced_vals;
+    minByKey(reduced_keys, reduced_vals, keys, vals);
+
+    const int goldSz = 5;
+    const vector<TypeParam> gold_reduce{0, 1, 6, 2, 4};
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
 }
 
-template<>
-void typed_assert_eq<af::cfloat>(af::cfloat lhs, af::cfloat rhs, bool both)
-{
-    ASSERT_FLOAT_EQ(lhs.real(), rhs.real());
-    if(both)
-        ASSERT_FLOAT_EQ(lhs.imag(), rhs.imag());
+TYPED_TEST(ReduceByKey, maxReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
 
+    array reduced_keys, reduced_vals;
+    maxByKey(reduced_keys, reduced_vals, keys, vals);
+
+    const int goldSz = 5;
+    const vector<TypeParam> gold_reduce{0, 7, 6, 5, 4};
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
 }
 
-template<>
-void typed_assert_eq<af::cdouble>(af::cdouble lhs, af::cdouble rhs, bool both)
-{
-    ASSERT_DOUBLE_EQ(lhs.real(), rhs.real());
-    if(both)
-        ASSERT_DOUBLE_EQ(lhs.imag(), rhs.imag());
+TYPED_TEST(ReduceByKey, allTrueReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 1, 1, 1, 0, 1, 1, 1};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
+
+    array reduced_keys, reduced_vals;
+    allTrueByKey(reduced_keys, reduced_vals, keys, vals);
+
+    const int goldSz = 5;
+    const vector<char> gold_reduce{0, 1, 1, 0, 1};
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
 }
 
-TYPED_TEST(Reduce, Test_All_Global)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(ReduceByKey, anyTrueReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 8, 8};
+    const TypeParam testVals[testSz] = {0, 1, 1, 1, 0, 1, 0, 0};
 
-    // Input size test
-    for(int i = 1; i < 1000; i+=100) {
-        int num = 10 * i;
-        vector<TypeParam> h_vals(num, (TypeParam)true);
-        array a(2, num/2, &h_vals.front());
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
 
-        TypeParam res = af::allTrue<TypeParam>(a);
-        typed_assert_eq((TypeParam)true, res, false);
+    array reduced_keys, reduced_vals;
+    anyTrueByKey(reduced_keys, reduced_vals, keys, vals);
 
-        h_vals[3] = false;
-        a = array(2, num/2, &h_vals.front());
+    const int goldSz = 5;
+    const vector<char> gold_reduce{0, 1, 1, 1, 0};
 
-        res = af::allTrue<TypeParam>(a);
-        typed_assert_eq((TypeParam)false, res, false);
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
+}
+
+TYPED_TEST(ReduceByKey, countReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 5};
+    const TypeParam testVals[testSz] = {0, 1, 1, 1, 0, 1, 1, 1};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
+
+    array reduced_keys, reduced_vals;
+    countByKey(reduced_keys, reduced_vals, keys, vals);
+
+    const int goldSz = 4;
+    const vector<unsigned> gold_reduce{0, 2, 1, 3};
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
+}
+
+TYPED_TEST(ReduceByKey, ReduceByKeyNans) {
+    if (!IsFloatingPoint<TypeParam>::value) {
+        SUCCEED() << "Not a floating point type.";
+        return;
     }
 
-    // false value location test
-    int num = 10000;
-    vector<TypeParam> h_vals(num, (TypeParam)true);
-    for(int i = 1; i < 10000; i+=100) {
-        h_vals[i] = false;
-        array a(2, num/2, &h_vals.front());
+    SKIP_IF_FAST_MATH_ENABLED();
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz    = 8;
+    const int testKeys[testSz] = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam nan        = std::numeric_limits<TypeParam>::quiet_NaN();
+    const TypeParam testVals[testSz] = {0, 7, nan, 6, 2, 5, 3, 4};
 
-        TypeParam res = af::allTrue<TypeParam>(a);
-        typed_assert_eq((TypeParam)false, res, false);
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
 
-        h_vals[i] = true;
+    array reduced_keys, reduced_vals;
+    productByKey(reduced_keys, reduced_vals, keys, vals, 0, 1);
+
+    const int goldSz = 5;
+    using promoted_t = typename promote_type<TypeParam, af_product>::type;
+    const vector<promoted_t> gold_reduce{0, 7, 6, 30, 4};
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
+}
+
+TYPED_TEST(ReduceByKey, nDim0ReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
+
+    const int ntile = 2;
+    vals            = tile(vals, af::dim4(1, ntile, ntile, ntile));
+
+    array reduced_keys, reduced_vals;
+    const int dim       = 0;
+    const double nanval = 0.0;
+    sumByKey(reduced_keys, reduced_vals, keys, vals, dim, nanval);
+
+    const dim4 goldSz(5, 2, 2, 2);
+    using promoted_t = typename promote_type<TypeParam, af_sum>::type;
+    const vector<promoted_t> gold_reduce{0, 8, 6, 10, 4, 0, 8, 6, 10, 4,
+
+                                         0, 8, 6, 10, 4, 0, 8, 6, 10, 4,
+
+                                         0, 8, 6, 10, 4, 0, 8, 6, 10, 4,
+
+                                         0, 8, 6, 10, 4, 0, 8, 6, 10, 4};
+    ASSERT_VEC_ARRAY_EQ(gold_reduce, goldSz, reduced_vals);
+}
+
+TYPED_TEST(ReduceByKey, nDim1ReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
+
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
+
+    const int ntile = 2;
+    vals            = tile(vals, af::dim4(1, ntile, 1, 1));
+    vals            = transpose(vals);
+
+    array reduced_keys, reduced_vals;
+    const int dim       = 1;
+    const double nanval = 0.0;
+    sumByKey(reduced_keys, reduced_vals, keys, vals, dim, nanval);
+
+    const int goldSz = 5;
+    using promoted_t = typename promote_type<TypeParam, af_sum>::type;
+    const promoted_t gold_reduce[goldSz] = {0, 8, 6, 10, 4};
+    vector<promoted_t> hreduce(reduced_vals.elements());
+    reduced_vals.host(hreduce.data());
+
+    for (int i = 0; i < goldSz * ntile; i++) {
+        ASSERT_EQ(gold_reduce[i / ntile], hreduce[i]);
     }
 }
 
-TYPED_TEST(Reduce, Test_Any_Global)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(ReduceByKey, nDim2ReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
 
-    // Input size test
-    for(int i = 1; i < 1000; i+=100) {
-        int num = 10 * i;
-        vector<TypeParam> h_vals(num, (TypeParam)false);
-        array a(2, num/2, &h_vals.front());
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
 
-        TypeParam res = af::anyTrue<TypeParam>(a);
-        typed_assert_eq((TypeParam)false, res, false);
+    const int ntile = 2;
+    vals            = tile(vals, af::dim4(1, ntile, 1, 1));
+    vals            = reorder(vals, 1, 2, 0, 3);
 
-        h_vals[3] = true;
-        a = array(2, num/2, &h_vals.front());
+    array reduced_keys, reduced_vals;
+    const int dim       = 2;
+    const double nanval = 0.0;
+    sumByKey(reduced_keys, reduced_vals, keys, vals, dim, nanval);
+
+    const int goldSz = 5;
+    using promoted_t = typename promote_type<TypeParam, af_sum>::type;
+    const promoted_t gold_reduce[goldSz] = {0, 8, 6, 10, 4};
+    vector<promoted_t> h_a(reduced_vals.elements());
+    reduced_vals.host(h_a.data());
 
-        res = af::anyTrue<TypeParam>(a);
-        typed_assert_eq((TypeParam)true, res, false);
+    for (int i = 0; i < goldSz * ntile; i++) {
+        ASSERT_EQ(gold_reduce[i / ntile], h_a[i]);
     }
+}
 
-    // true value location test
-    int num = 10000;
-    vector<TypeParam> h_vals(num, (TypeParam)false);
-    for(int i = 1; i < 10000; i+=100) {
-        h_vals[i] = true;
-        array a(2, num/2, &h_vals.front());
+TYPED_TEST(ReduceByKey, nDim3ReduceByKey) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+    const static int testSz          = 8;
+    const int testKeys[testSz]       = {0, 2, 2, 9, 5, 5, 5, 8};
+    const TypeParam testVals[testSz] = {0, 7, 1, 6, 2, 5, 3, 4};
 
-        TypeParam res = af::anyTrue<TypeParam>(a);
-        typed_assert_eq((TypeParam)true, res, false);
+    array keys(testSz, testKeys);
+    array vals(testSz, testVals);
 
-        h_vals[i] = false;
+    const int ntile = 2;
+    vals            = tile(vals, af::dim4(1, ntile, 1, 1));
+    vals            = reorder(vals, 1, 2, 3, 0);
+
+    array reduced_keys, reduced_vals;
+    const int dim       = 3;
+    const double nanval = 0.0;
+    sumByKey(reduced_keys, reduced_vals, keys, vals, dim, nanval);
+
+    const int goldSz = 5;
+    using promoted_t = typename promote_type<TypeParam, af_sum>::type;
+    const promoted_t gold_reduce[goldSz] = {0, 8, 6, 10, 4};
+    vector<promoted_t> h_a(reduced_vals.elements());
+    reduced_vals.host(h_a.data());
+
+    for (int i = 0; i < goldSz * ntile; i++) {
+        ASSERT_EQ(gold_reduce[i / ntile], h_a[i]);
     }
 }
+
+TEST(Reduce, Test_Product_Global) {
+    const int num = 100;
+    array a       = 1 + round(5 * randu(num, 1)) / 100;
+
+    float res  = product<float>(a);
+    float *h_a = a.host<float>();
+    float gold = 1;
+
+    for (int i = 0; i < num; i++) { gold *= h_a[i]; }
+
+    ASSERT_NEAR(gold, res, 1e-3);
+    freeHost(h_a);
+}
+
+TEST(Reduce, Test_Sum_Global) {
+    const int num = 10000;
+    array a       = round(2 * randu(num, 1));
+
+    float res  = sum<float>(a);
+    float *h_a = a.host<float>();
+    float gold = 0;
+
+    for (int i = 0; i < num; i++) { gold += h_a[i]; }
+
+    ASSERT_EQ(gold, res);
+    freeHost(h_a);
+}
+
+TEST(Reduce, Test_Count_Global) {
+    const int num = 10000;
+    array a       = round(2 * randu(num, 1));
+    array b       = a.as(b8);
+
+    int res   = count<int>(b);
+    char *h_b = b.host<char>();
+    int gold  = 0;
+
+    for (int i = 0; i < num; i++) { gold += h_b[i]; }
+
+    ASSERT_EQ(gold, res);
+    freeHost(h_b);
+}
+
+TEST(Reduce, Test_min_Global) {
+    SUPPORTED_TYPE_CHECK(double);
+
+    const int num = 10000;
+    array a       = randu(num, 1, f64);
+    double res    = min<double>(a);
+    double *h_a   = a.host<double>();
+    double gold   = std::numeric_limits<double>::max();
+
+    SUPPORTED_TYPE_CHECK(double);
+
+    for (int i = 0; i < num; i++) { gold = std::min(gold, h_a[i]); }
+
+    ASSERT_EQ(gold, res);
+    freeHost(h_a);
+}
+
+TEST(Reduce, Test_max_Global) {
+    const int num = 10000;
+    array a       = randu(num, 1);
+    float res     = max<float>(a);
+    float *h_a    = a.host<float>();
+    float gold    = -std::numeric_limits<float>::max();
+
+    for (int i = 0; i < num; i++) { gold = std::max(gold, h_a[i]); }
+
+    ASSERT_EQ(gold, res);
+    freeHost(h_a);
+}
+
+template<typename T>
+void typed_assert_eq(T lhs, T rhs, bool both = true) {
+    UNUSED(both);
+    ASSERT_EQ(lhs, rhs);
+}
+
+template<>
+void typed_assert_eq<float>(float lhs, float rhs, bool both) {
+    UNUSED(both);
+    ASSERT_FLOAT_EQ(lhs, rhs);
+}
+
+template<>
+void typed_assert_eq<double>(double lhs, double rhs, bool both) {
+    UNUSED(both);
+    ASSERT_DOUBLE_EQ(lhs, rhs);
+}
+
+template<>
+void typed_assert_eq<cfloat>(cfloat lhs, cfloat rhs, bool both) {
+    ASSERT_FLOAT_EQ(real(lhs), real(rhs));
+    if (both) { ASSERT_FLOAT_EQ(imag(lhs), imag(rhs)); }
+}
+
+template<>
+void typed_assert_eq<cdouble>(cdouble lhs, cdouble rhs, bool both) {
+    ASSERT_DOUBLE_EQ(real(lhs), real(rhs));
+    if (both) { ASSERT_DOUBLE_EQ(imag(lhs), imag(rhs)); }
+}
+
+TYPED_TEST(Reduce, Test_All_Global) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    // Input size test
+    for (int i = 1; i < 1000; i += 100) {
+        int num = 10 * i;
+        vector<TypeParam> h_vals(num, (TypeParam) true);
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = allTrue<TypeParam>(a);
+        typed_assert_eq((TypeParam) true, res, false);
+
+        h_vals[3] = false;
+        a         = array(2, num / 2, &h_vals.front());
+
+        res = allTrue<TypeParam>(a);
+        typed_assert_eq((TypeParam) false, res, false);
+    }
+
+    // false value location test
+    const int num = 10000;
+    vector<TypeParam> h_vals(num, (TypeParam) true);
+    for (int i = 1; i < 10000; i += 100) {
+        h_vals[i] = false;
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = allTrue<TypeParam>(a);
+        typed_assert_eq((TypeParam) false, res, false);
+
+        h_vals[i] = true;
+    }
+}
+
+TYPED_TEST(Reduce, Test_Any_Global) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    // Input size test
+    for (int i = 1; i < 1000; i += 100) {
+        int num = 10 * i;
+        vector<TypeParam> h_vals(num, (TypeParam) false);
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = anyTrue<TypeParam>(a);
+        typed_assert_eq((TypeParam) false, res, false);
+
+        h_vals[3] = true;
+        a         = array(2, num / 2, &h_vals.front());
+
+        res = anyTrue<TypeParam>(a);
+        typed_assert_eq((TypeParam) true, res, false);
+    }
+
+    // true value location test
+    const int num = 10000;
+    vector<TypeParam> h_vals(num, (TypeParam) false);
+    for (int i = 1; i < 10000; i += 100) {
+        h_vals[i] = true;
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = anyTrue<TypeParam>(a);
+        typed_assert_eq((TypeParam) true, res, false);
+
+        h_vals[i] = false;
+    }
+}
+
+TEST(MinMax, MinMaxNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    const int num      = 10000;
+    array A            = randu(num);
+    A(where(A < 0.25)) = NaN;
+
+    float minval = min<float>(A);
+    float maxval = max<float>(A);
+
+    ASSERT_NE(std::isnan(minval), true);
+    ASSERT_NE(std::isnan(maxval), true);
+
+    float *h_A = A.host<float>();
+
+    for (int i = 0; i < num; i++) {
+        if (!std::isnan(h_A[i])) {
+            ASSERT_LE(minval, h_A[i]);
+            ASSERT_GE(maxval, h_A[i]);
+        }
+    }
+
+    freeHost(h_A);
+}
+
+TEST(MinMax, MinCplxNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    float real_wnan_data[] = {0.005f, NAN, -6.3f, NAN,      -0.5f,
+                              NAN,    NAN, 0.2f,  -1205.4f, 8.9f};
+
+    float imag_wnan_data[] = {NAN,    NAN, -9.0f, -0.005f, -0.3f,
+                              0.007f, NAN, 0.1f,  NAN,     4.5f};
+
+    int rows = 5;
+    int cols = 2;
+    array real_wnan(rows, cols, real_wnan_data);
+    array imag_wnan(rows, cols, imag_wnan_data);
+    array a = af::complex(real_wnan, imag_wnan);
+
+    float gold_min_real[] = {-0.5f, 0.2f};
+    float gold_min_imag[] = {-0.3f, 0.1f};
+
+    array min_val = af::min(a);
+
+    vector<complex<float>> h_min_val(cols);
+    min_val.host(&h_min_val[0]);
+
+    for (int i = 0; i < cols; i++) {
+        ASSERT_FLOAT_EQ(h_min_val[i].real(), gold_min_real[i]);
+        ASSERT_FLOAT_EQ(h_min_val[i].imag(), gold_min_imag[i]);
+    }
+}
+
+TEST(MinMax, MaxCplxNaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    // 4th element is unusually large to cover the case where
+    //  one part holds the largest value among the array,
+    //  and the other part is NaN.
+    // There's a possibility where the NaN is turned into 0
+    //  (since Binary<>::init() will initialize it to 0 in
+    //  for complex max op) during the comparisons, and so its
+    //  magnitude will determine that that element is the max,
+    //  whereas it should have been ignored since its other
+    //  part is NaN
+    float real_wnan_data[] = {0.005f, NAN, -6.3f, NAN,      -0.5f,
+                              NAN,    NAN, 0.2f,  -1205.4f, 8.9f};
+
+    float imag_wnan_data[] = {NAN,    NAN, -9.0f, -0.005f, -0.3f,
+                              0.007f, NAN, 0.1f,  NAN,     4.5f};
+
+    int rows = 5;
+    int cols = 2;
+    array real_wnan(rows, cols, real_wnan_data);
+    array imag_wnan(rows, cols, imag_wnan_data);
+    array a = af::complex(real_wnan, imag_wnan);
+
+    float gold_max_real[] = {-6.3f, 8.9f};
+    float gold_max_imag[] = {-9.0f, 4.5f};
+
+    array max_val = af::max(a);
+
+    vector<complex<float>> h_max_val(cols);
+    max_val.host(&h_max_val[0]);
+
+    for (int i = 0; i < cols; i++) {
+        ASSERT_FLOAT_EQ(h_max_val[i].real(), gold_max_real[i]);
+        ASSERT_FLOAT_EQ(h_max_val[i].imag(), gold_max_imag[i]);
+    }
+}
+
+TEST(Count, NaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    const int num = 10000;
+    array A       = round(5 * randu(num));
+    array B       = A;
+
+    A(where(A == 2)) = NaN;
+
+    ASSERT_EQ(count<uint>(A), count<uint>(B));
+}
+
+TEST(Sum, NaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    const int num      = 10000;
+    array A            = randu(num);
+    A(where(A < 0.25)) = NaN;
+
+    float res = sum<float>(A);
+
+    ASSERT_EQ(std::isnan(res), true);
+
+    res        = sum<float>(A, 0);
+    float *h_A = A.host<float>();
+
+    float tmp = 0;
+    for (int i = 0; i < num; i++) { tmp += std::isnan(h_A[i]) ? 0 : h_A[i]; }
+
+    ASSERT_NEAR(res / num, tmp / num, 1E-5);
+    freeHost(h_A);
+}
+
+TEST(Product, NaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    const int num = 5;
+    array A       = randu(num);
+    A(2)          = NaN;
+
+    float res = product<float>(A);
+
+    ASSERT_EQ(std::isnan(res), true);
+
+    res        = product<float>(A, 1);
+    float *h_A = A.host<float>();
+
+    float tmp = 1;
+    for (int i = 0; i < num; i++) { tmp *= std::isnan(h_A[i]) ? 1 : h_A[i]; }
+
+    ASSERT_NEAR(res / num, tmp / num, 1E-5);
+    freeHost(h_A);
+}
+
+TEST(AnyAll, NaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    const int num = 10000;
+    array A       = (randu(num) > 0.5).as(f32);
+    array B       = A;
+
+    B(where(B == 0)) = NaN;
+
+    ASSERT_EQ(anyTrue<bool>(B), true);
+    ASSERT_EQ(allTrue<bool>(B), true);
+    ASSERT_EQ(anyTrue<bool>(A), true);
+    ASSERT_EQ(allTrue<bool>(A), false);
+}
+
+TEST(MaxAll, IndexedSmall) {
+    const int num = 1000;
+    const int st  = 10;
+    const int en  = num - 100;
+    array a       = randu(num);
+    float b       = max<float>(a(seq(st, en)));
+
+    vector<float> ha(num);
+    a.host(&ha[0]);
+
+    float res = ha[st];
+    for (int i = st; i <= en; i++) { res = std::max(res, ha[i]); }
+
+    ASSERT_EQ(b, res);
+}
+
+TEST(MaxAll, IndexedBig) {
+    const int num = 100000;
+    const int st  = 1000;
+    const int en  = num - 1000;
+    array a       = randu(num);
+    float b       = max<float>(a(seq(st, en)));
+
+    vector<float> ha(num);
+    a.host(&ha[0]);
+
+    float res = ha[st];
+    for (int i = st; i <= en; i++) { res = std::max(res, ha[i]); }
+
+    ASSERT_EQ(b, res);
+}
+
+TEST(Reduce, KernelName) {
+    const int m = 64;
+    const int n = 100;
+    const int b = 5;
+
+    array in = constant(0, m, n, b);
+    for (int i = 0; i < b; i++) {
+        array tmp         = randu(m, n);
+        in(span, span, i) = tmp;
+        ASSERT_EQ(min<float>(in(span, span, i)), min<float>(tmp));
+    }
+}
+
+TEST(Reduce, AllSmallIndexed) {
+    const int len = 512;
+    for (int i = 0; i < 1000; ++i) {
+        // const int len = 10000;
+        array a = af::range(dim4(len, 2));
+        array b = a(seq(len / 2), span);
+        // af::sync();
+        ASSERT_EQ(max<float>(b), len / 2 - 1);
+    }
+}
+
+TEST(ProductAll, BoolIn_ISSUE2543_All_Ones) {
+    ASSERT_EQ(true, product<int>(constant(1, 5, 5, b8)) > 0);
+}
+
+TEST(ProductAll, BoolIn_ISSUE2543_Random_Values) {
+    array in = randu(5, 5, b8);
+    vector<char> hostData(25);
+    in.host(hostData.data());
+    unsigned int gold = 1;
+    for (size_t i = 0; i < hostData.size(); ++i) { gold *= hostData[i]; }
+    const unsigned int out = product<unsigned int>(in);
+    ASSERT_EQ(gold, out);
+}
+
+TEST(Product, BoolIn_ISSUE2543) {
+    array A = randu(5, 5, b8);
+    ASSERT_ARRAYS_EQ(allTrue(A), product(A));
+}
+
+struct reduce_params {
+    double element_value;
+    dim4 arr_dim;
+    dim4 result_dim;
+    int reduce_dim;
+    reduce_params(double ev, dim4 ad, dim4 result_d, int red_dim)
+        : element_value(ev)
+        , arr_dim(ad)
+        , result_dim(result_d)
+        , reduce_dim(red_dim) {}
+};
+
+class ReduceHalf : public ::testing::TestWithParam<reduce_params> {};
+
+INSTANTIATE_TEST_SUITE_P(
+    SumFirstNonZeroDim, ReduceHalf,
+    ::testing::Values(
+        reduce_params(1, dim4(10), dim4(1), -1),
+        reduce_params(1, dim4(10, 10), dim4(1, 10), -1),
+        reduce_params(1, dim4(10, 10, 10), dim4(1, 10, 10), -1),
+        reduce_params(1, dim4(10, 10, 10, 10), dim4(1, 10, 10, 10), -1),
+
+        reduce_params(1, dim4(2048), dim4(1), -1),
+        reduce_params(1, dim4(2048, 10), dim4(1, 10), -1),
+        reduce_params(1, dim4(2048, 10, 10), dim4(1, 10, 10), -1),
+        reduce_params(1, dim4(2048, 10, 10, 10), dim4(1, 10, 10, 10), -1),
+
+        reduce_params(1, dim4(2049), dim4(1), -1),
+        reduce_params(1, dim4(2049, 10), dim4(1, 10), -1),
+        reduce_params(1, dim4(2049, 10, 10), dim4(1, 10, 10), -1),
+        reduce_params(1, dim4(2049, 10, 10, 10), dim4(1, 10, 10, 10), -1),
+
+        reduce_params(1, dim4(8192), dim4(1), -1),
+        reduce_params(1, dim4(8192, 10), dim4(1, 10), -1),
+        reduce_params(1, dim4(8192, 10, 10), dim4(1, 10, 10), -1),
+        reduce_params(1, dim4(8192, 10, 10, 10), dim4(1, 10, 10, 10), -1)));
+
+INSTANTIATE_TEST_SUITE_P(
+    SumNonZeroDim, ReduceHalf,
+    ::testing::Values(
+        reduce_params(1.25, dim4(10, 10), dim4(10), 1),
+        reduce_params(1.25, dim4(10, 10, 10), dim4(10, 1, 10), 1),
+        reduce_params(1.25, dim4(10, 10, 10, 10), dim4(10, 1, 10, 10), 1),
+
+        reduce_params(1.25, dim4(10, 2048), dim4(10), 1),
+        reduce_params(1.25, dim4(10, 2048, 10), dim4(10, 1, 10), 1),
+        reduce_params(1.25, dim4(10, 2048, 10, 10), dim4(10, 1, 10, 10), 1),
+
+        reduce_params(1.25, dim4(10, 2049), dim4(10), 1),
+        reduce_params(1.25, dim4(10, 2049, 10), dim4(10, 1, 10), 1),
+        reduce_params(1.25, dim4(10, 2049, 10, 10), dim4(10, 1, 10, 10), 1),
+
+        reduce_params(1.25, dim4(10, 8192), dim4(10), 1),
+        reduce_params(1.25, dim4(10, 8192, 10), dim4(10, 1, 10), 1),
+        reduce_params(1.25, dim4(10, 8192, 10, 10), dim4(10, 1, 10, 10), 1),
+
+        reduce_params(1.25, dim4(10, 10, 10), dim4(10, 10, 1), 2),
+        reduce_params(1.25, dim4(10, 10, 10, 10), dim4(10, 10, 1, 10), 2),
+
+        reduce_params(1.25, dim4(10, 10, 2048), dim4(10, 10, 1), 2),
+        reduce_params(1.25, dim4(10, 10, 2048, 10), dim4(10, 10, 1, 10), 2),
+
+        reduce_params(1.25, dim4(10, 10, 2049), dim4(10, 10, 1), 2),
+        reduce_params(1.25, dim4(10, 10, 2049, 10), dim4(10, 10, 1, 10), 2),
+
+        reduce_params(1.25, dim4(10, 10, 8192), dim4(10, 10, 1), 2),
+        reduce_params(1.25, dim4(10, 10, 8192, 10), dim4(10, 10, 1, 10), 2)));
+
+TEST_P(ReduceHalf, Sum) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    reduce_params param = GetParam();
+
+    array arr = constant(param.element_value, param.arr_dim, f16);
+
+    size_t elements = 0;
+    if (param.reduce_dim == -1) {
+        elements = param.arr_dim[0];
+    } else {
+        elements = param.arr_dim[param.reduce_dim];
+    }
+
+    double result_value = param.element_value * elements;
+    array gold          = constant(result_value, param.result_dim, f32);
+
+    array result = sum(arr, param.reduce_dim);
+    ASSERT_ARRAYS_EQ(gold, result);
+}
+
+TEST_P(ReduceHalf, Product) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    reduce_params param = GetParam();
+
+    array arr = constant(param.element_value, param.arr_dim, f16);
+
+    size_t elements = 0;
+    if (param.reduce_dim == -1) {
+        elements = param.arr_dim[0];
+    } else {
+        elements = param.arr_dim[param.reduce_dim];
+    }
+
+    float result_value = pow(param.element_value, elements);
+
+    if (std::isinf(result_value)) {
+        SUCCEED();
+        return;
+    }
+    array gold = constant(result_value, param.result_dim, f32);
+
+    array result = product(arr, param.reduce_dim);
+    ASSERT_ARRAYS_EQ(gold, result);
+}
+
+// TODO(umar): HalfMin
+TEST(ReduceHalf, Min) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    float harr[] = {1, 2, 3, 4, 5, 6, 7};
+    array arr(7, harr);
+    arr       = arr.as(f16);
+    array out = min(arr);
+
+    array gold = constant(1, 1, f16);
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+// TODO(umar): HalfMax
+TEST(ReduceHalf, Max) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    float harr[] = {1, 2, 3, 4, 5, 6, 7};
+    array arr(7, harr);
+    arr       = arr.as(f16);
+    array out = max(arr);
+
+    array gold = constant(7, 1, f16);
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+// TODO(umar): HalfCount
+TEST(ReduceHalf, Count) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    float harr[] = {1, 2, 3, 4, 5, 6, 7};
+    array arr(7, harr);
+    arr       = arr.as(f16);
+    array out = count(arr);
+
+    array gold = constant(7, 1, u32);
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+// TODO(umar): HalfAnyTrue
+TEST(ReduceHalf, AnyTrue) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    float harr[] = {1, 2, 3, 4, 5, 6, 7};
+    array arr(7, harr);
+    arr       = arr.as(f16);
+    array out = anyTrue(arr);
+
+    array gold = constant(1, 1, b8);
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+// TODO(umar): HalfAllTrue
+TEST(ReduceHalf, AllTrue) {
+    SUPPORTED_TYPE_CHECK(af_half);
+    float harr[] = {1, 2, 3, 4, 5, 6, 7};
+    array arr(7, harr);
+    arr       = arr.as(f16);
+    array out = allTrue(arr);
+
+    array gold = constant(1, 1, b8);
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+//
+// Documentation Snippets
+
+TEST(Reduce, SNIPPET_sum_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
+
+    //! [ex_reduce_sum_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 2 3 4 5 6 7 8 9 ];
+
+    array okeys, ovals;
+    sumByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0  1  0  2 ]
+    // ovals = [ 3 12 13 17 ]
+
+    //! [ex_reduce_sum_by_key]
+
+    vector<int> gold_keys   = {0, 1, 0, 2};
+    vector<float> gold_vals = {3, 12, 13, 17};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals);
+}
+
+TEST(Reduce, SNIPPET_sum_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 6, 2, 7, 3, 8, 4, 9, 5, 10};
+
+    //! [ex_reduce_sum_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 2 3 4 5  ]
+    //         [ 6 7 8 9 10 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    sumByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1  5  9 ],
+    //          [ 6 15 19 ]]
+
+    //! [ex_reduce_sum_by_key_dim]
+
+    vector<int> gold_keys   = {1, 0, 2};
+    vector<float> gold_vals = {1, 6, 5, 15, 9, 19};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals);
+}
+
+TEST(Reduce, SNIPPET_product_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
+
+    //! [ex_reduce_product_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 2 3 4 5 6 7 8 9 ];
+
+    array okeys, ovals;
+    productByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0  1  0  2 ]
+    // ovals = [ 2 60 42 72 ]
+
+    //! [ex_reduce_product_by_key]
+
+    vector<int> gold_keys   = {0, 1, 0, 2};
+    vector<float> gold_vals = {2, 60, 42, 72};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals);
+}
+
+TEST(Reduce, SNIPPET_product_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 6, 2, 7, 3, 8, 4, 9, 5, 10};
+
+    //! [ex_reduce_product_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 2 3 4 5  ]
+    //         [ 6 7 8 9 10 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    productByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1  6 20 ],
+    //          [ 6 56 90 ]]
+
+    //! [ex_reduce_product_by_key_dim]
+
+    vector<int> gold_keys   = {1, 0, 2};
+    vector<float> gold_vals = {1, 6, 6, 56, 20, 90};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals);
+}
+
+TEST(Reduce, SNIPPET_min_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
+
+    //! [ex_reduce_min_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 2 3 4 5 6 7 8 9 ];
+
+    array okeys, ovals;
+    minByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0 1 0 2 ]
+    // ovals = [ 1 3 6 8 ]
+
+    //! [ex_reduce_min_by_key]
+
+    vector<int> gold_keys   = {0, 1, 0, 2};
+    vector<float> gold_vals = {1, 3, 6, 8};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals);
+}
+
+TEST(Reduce, SNIPPET_min_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 6, 2, 7, 3, 8, 4, 9, 5, 10};
+
+    //! [ex_reduce_min_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 2 3 4 5  ]
+    //         [ 6 7 8 9 10 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    minByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1 2 4 ],
+    //          [ 6 7 9 ]]
+
+    //! [ex_reduce_min_by_key_dim]
+
+    vector<int> gold_keys   = {1, 0, 2};
+    vector<float> gold_vals = {1, 6, 2, 7, 4, 9};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals);
+}
+
+TEST(Reduce, SNIPPET_max_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
+
+    //! [ex_reduce_max_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 2 3 4 5 6 7 8 9 ];
+
+    array okeys, ovals;
+    maxByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0 1 0 2 ]
+    // ovals = [ 2 5 7 9 ]
+
+    //! [ex_reduce_max_by_key]
+
+    vector<int> gold_keys   = {0, 1, 0, 2};
+    vector<float> gold_vals = {2, 5, 7, 9};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals);
+}
+
+TEST(Reduce, SNIPPET_max_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 6, 2, 7, 3, 8, 4, 9, 5, 10};
+
+    //! [ex_reduce_max_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 2 3 4 5  ]
+    //         [ 6 7 8 9 10 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    maxByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1  3  5 ],
+    //          [ 6  8 10 ]]
+
+    //! [ex_reduce_max_by_key_dim]
+
+    vector<int> gold_keys   = {1, 0, 2};
+    vector<float> gold_vals = {1, 6, 3, 8, 5, 10};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals);
+}
+
+TEST(Reduce, SNIPPET_alltrue_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 1, 0, 1, 1, 0, 0, 1, 0};
+
+    //! [ex_reduce_alltrue_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 1 0 1 1 0 0 1 0 ];
+
+    array okeys, ovals;
+    allTrueByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0 1 0 2 ]
+    // ovals = [ 1 0 0 0 ]
+
+    //! [ex_reduce_alltrue_by_key]
+
+    vector<int> gold_keys           = {0, 1, 0, 2};
+    vector<unsigned char> gold_vals = {1, 0, 0, 0};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals.as(u8));
+}
+
+TEST(Reduce, SNIPPET_alltrue_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 0, 1, 1, 1, 0, 0, 1, 1, 1};
+
+    //! [ex_reduce_alltrue_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 1 1 0 1 ]
+    //         [ 0 1 0 1 1 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    allTrueByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1 1 0 ],
+    //          [ 0 0 1 ]]
+
+    //! [ex_reduce_alltrue_by_key_dim]
+
+    vector<int> gold_keys           = {1, 0, 2};
+    vector<unsigned char> gold_vals = {1, 0, 1, 0, 0, 1};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals.as(u8));
+}
+
+TEST(Reduce, SNIPPET_anytrue_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 1, 0, 1, 1, 0, 0, 1, 0};
+
+    //! [ex_reduce_anytrue_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 1 0 1 1 0 0 1 0 ];
+
+    array okeys, ovals;
+    anyTrueByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0 1 0 2 ]
+    // ovals = [ 1 0 0 0 ]
+
+    //! [ex_reduce_anytrue_by_key]
+
+    vector<int> gold_keys           = {0, 1, 0, 2};
+    vector<unsigned char> gold_vals = {1, 1, 0, 1};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals.as(u8));
+}
+
+TEST(Reduce, SNIPPET_anytrue_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 0, 1, 1, 1, 0, 0, 1, 1, 1};
+
+    //! [ex_reduce_anytrue_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 1 1 0 1 ]
+    //         [ 0 1 0 1 1 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    anyTrueByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1 1 1 ],
+    //          [ 0 1 1 ]]
+
+    //! [ex_reduce_anytrue_by_key_dim]
+
+    vector<int> gold_keys           = {1, 0, 2};
+    vector<unsigned char> gold_vals = {1, 0, 1, 1, 1, 1};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals.as(u8));
+}
+
+TEST(Reduce, SNIPPET_count_by_key) {
+    int hkeys[]   = {0, 0, 1, 1, 1, 0, 0, 2, 2};
+    float hvals[] = {1, 1, 0, 1, 1, 0, 0, 1, 0};
+
+    //! [ex_reduce_count_by_key]
+
+    array keys(9, hkeys);  // keys = [ 0 0 1 1 1 0 0 2 2 ]
+    array vals(9, hvals);  // vals = [ 1 1 0 1 1 0 0 1 0 ];
+
+    array okeys, ovals;
+    countByKey(okeys, ovals, keys, vals);
+
+    // okeys = [ 0 1 0 2 ]
+    // ovals = [ 2 2 0 1 ]
+
+    //! [ex_reduce_count_by_key]
+
+    vector<int> gold_keys      = {0, 1, 0, 2};
+    vector<unsigned> gold_vals = {2, 2, 0, 1};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(4), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(4), ovals);
+}
+
+TEST(Reduce, SNIPPET_count_by_key_dim) {
+    int hkeys[] = {1, 0, 0, 2, 2};
+
+    float hvals[] = {1, 0, 1, 1, 1, 0, 0, 1, 1, 1};
+
+    //! [ex_reduce_count_by_key_dim]
+
+    array keys(5, hkeys);
+    array vals(2, 5, hvals);
+
+    // keys = [ 1 0 0 2 2 ]
+
+    // vals = [[ 1 1 1 0 1 ]
+    //         [ 0 1 0 1 1 ]]
+
+    const int reduce_dim = 1;
+    array okeys, ovals;
+    countByKey(okeys, ovals, keys, vals, reduce_dim);
+
+    // okeys = [ 1 0 2 ]
+
+    // ovals = [[ 1 2 1 ],
+    //          [ 0 1 2 ]]
+
+    //! [ex_reduce_count_by_key_dim]
+
+    vector<int> gold_keys      = {1, 0, 2};
+    vector<unsigned> gold_vals = {1, 0, 2, 1, 1, 2};
+
+    ASSERT_VEC_ARRAY_EQ(gold_keys, dim4(3), okeys);
+    ASSERT_VEC_ARRAY_EQ(gold_vals, dim4(2, 3), ovals);
+}
+
+TEST(RaggedMax, simple) {
+    const int testKeys[6]      = {1, 2, 3, 4, 5, 6};
+    const unsigned testVals[2] = {9, 2};
+
+    array arr(3, 2, testKeys);
+    array keys(1, 2, testVals);
+
+    array ragged_max, idx;
+    const int dim = 0;
+    max(ragged_max, idx, arr, keys, dim);
+
+    const dim4 goldSz(1, 2);
+    const vector<int> gold_reduced{3, 5};
+    const vector<unsigned> gold_idx{2, 1};
+
+    ASSERT_VEC_ARRAY_EQ(gold_reduced, goldSz, ragged_max);
+    ASSERT_VEC_ARRAY_EQ(gold_idx, goldSz, idx);
+}
+
+TEST(RaggedMax, simpleDim1) {
+    const int testKeys[8]      = {1, 2, 3, 4, 5, 6, 7, 8};
+    const unsigned testVals[2] = {8, 2};
+
+    array arr(2, 4, testKeys);
+    array keys(2, 1, testVals);
+
+    array ragged_max, idx;
+    const int dim = 1;
+    max(ragged_max, idx, arr, keys, dim);
+
+    const dim4 goldSz(2, 1);
+    const vector<int> gold_reduced{7, 4};
+    const vector<unsigned> gold_idx{3, 1};
+
+    ASSERT_VEC_ARRAY_EQ(gold_reduced, goldSz, ragged_max);
+    ASSERT_VEC_ARRAY_EQ(gold_idx, goldSz, idx);
+}
+
+struct ragged_params {
+    size_t reduceDimLen_;
+    int reduceDim_;
+    af_dtype lType_, vType_, oType_;
+    string testname_;
+
+    virtual ~ragged_params() {}
+};
+
+template<typename Tl, typename Tv, typename To>
+struct ragged_params_t : public ragged_params {
+    string testname_;
+
+    ragged_params_t(size_t reduce_dim_len, int reduce_dim, string testname)
+        : testname_(testname) {
+        ragged_params::reduceDim_    = reduce_dim;
+        ragged_params::reduceDimLen_ = reduce_dim_len;
+        ragged_params::lType_        = (af_dtype)af::dtype_traits<Tl>::af_type;
+        ragged_params::vType_        = (af_dtype)af::dtype_traits<Tv>::af_type;
+        ragged_params::oType_        = (af_dtype)af::dtype_traits<To>::af_type;
+        ragged_params::testname_     = testname_;
+    }
+    ~ragged_params_t() {}
+};
+
+class RaggedReduceMaxRangeP : public ::testing::TestWithParam<ragged_params *> {
+   public:
+    array vals, ragged_lens;
+    array valsReducedGold, idxsReducedGold;
+
+    void SetUp() {
+        ragged_params *params = GetParam();
+        if (noHalfTests(params->vType_)) {
+            GTEST_SKIP() << "Half not supported on this device";
+        }
+        if (noDoubleTests(GetParam()->vType_)) {
+            GTEST_SKIP() << "Double not supported on this device";
+        }
+
+        const size_t rdim_size = params->reduceDimLen_;
+        const int dim          = params->reduceDim_;
+
+        af::dim4 rdim(3, 3, 3, 3);
+        rdim[dim] = rdim_size;
+        vals      = af::range(rdim, dim, params->vType_);
+
+        rdim[dim]   = 1;
+        ragged_lens = af::range(rdim, (dim > 0) ? 0 : 1, params->lType_) + 1;
+
+        valsReducedGold = af::range(rdim, (dim > 0) ? 0 : 1, params->oType_);
+        idxsReducedGold = af::range(rdim, (dim > 0) ? 0 : 1, params->lType_);
+    }
+
+    void TearDown() { delete GetParam(); }
+};
+
+template<typename Tl, typename Tv, typename To>
+ragged_params *ragged_range_data(const string testname, const int testSz,
+                                 const int rdim) {
+    return new ragged_params_t<Tl, Tv, To>(testSz, rdim, testname);
+}
+
+// clang-format off
+template<typename Tv, typename To>
+vector<ragged_params *> genRaggedRangeTests() {
+  return {ragged_range_data<unsigned, Tv, To>("ragged_range", 31,          0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 32,          0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 33,          0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 255,         0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 256,         0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 257,         0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024,        0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1025,        0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024 * 1025, 0),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 31,          1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 32,          1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 33,          1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 255,         1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 256,         1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 257,         1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024,        1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1025,        1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024 * 1025, 1),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 31,          2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 32,          2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 33,          2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 255,         2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 256,         2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 257,         2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024,        2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1025,        2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024 * 1025, 2),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 31,          3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 32,          3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 33,          3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 255,         3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 256,         3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 257,         3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024,        3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1025,        3),
+          ragged_range_data<unsigned, Tv, To>("ragged_range", 1024 * 1025, 3),
+    };
+}
+// clang-format on
+
+vector<ragged_params *> generateAllTypesRagged() {
+    vector<ragged_params *> out;
+    vector<vector<ragged_params *>> tmp{
+        genRaggedRangeTests<int, int>(), genRaggedRangeTests<float, float>(),
+        genRaggedRangeTests<double, double>(),
+        genRaggedRangeTests<half_float::half, half_float::half>()};
+
+    for (auto &v : tmp) { copy(begin(v), end(v), back_inserter(out)); }
+    return out;
+}
+
+template<typename TestClass>
+string testNameGeneratorRagged(
+    const ::testing::TestParamInfo<typename TestClass::ParamType> info) {
+    af_dtype lt = info.param->lType_;
+    af_dtype vt = info.param->vType_;
+    size_t size = info.param->reduceDimLen_;
+    int rdim    = info.param->reduceDim_;
+    std::stringstream s;
+    s << info.param->testname_ << "_lenType_" << lt << "_valueType_" << vt
+      << "_size_" << size << "_reduceDim_" << rdim;
+    return s.str();
+}
+
+INSTANTIATE_TEST_SUITE_P(RaggedReduceTests, RaggedReduceMaxRangeP,
+                         ::testing::ValuesIn(generateAllTypesRagged()),
+                         testNameGeneratorRagged<RaggedReduceMaxRangeP>);
+
+TEST_P(RaggedReduceMaxRangeP, rangeMaxTest) {
+    if (noHalfTests(GetParam()->vType_)) {
+        GTEST_SKIP() << "Half not supported on this device";
+    }
+    array ragged_max, idx;
+    const int dim = GetParam()->reduceDim_;
+    max(ragged_max, idx, vals, ragged_lens, dim);
+
+    ASSERT_ARRAYS_EQ(valsReducedGold, ragged_max);
+    ASSERT_ARRAYS_EQ(idxsReducedGold, idx);
+}
+
+TEST(ReduceByKey, ISSUE_2955) {
+    int N                  = 256;
+    af::array val          = af::randu(N);
+    af::array key          = af::range(af::dim4(N), 0, af::dtype::s32);
+    key(seq(127, af::end)) = 1;
+
+    af::array ok, ov;
+    af::sumByKey(ok, ov, key, val);
+    ASSERT_EQ(ok.dims(0), 128);
+    ASSERT_EQ(ov.dims(0), 128);
+}
+
+TEST(ReduceByKey, ISSUE_2955_dim) {
+    int N                  = 256;
+    af::array val          = af::randu(8, N);
+    af::array key          = af::range(af::dim4(N), 0, af::dtype::s32);
+    key(seq(127, af::end)) = 1;
+
+    af::array ok, ov;
+    af::sumByKey(ok, ov, key, val, 1);
+    ASSERT_EQ(ok.dims(0), 128);
+    ASSERT_EQ(ov.dims(1), 128);
+}
+
+TEST(ReduceByKey, ISSUE_3062) {
+    size_t N = 129;
+
+    af::array ones  = af::constant(1, N, u32);
+    af::array zeros = af::constant(0, N, u32);
+
+    af::array okeys;
+    af::array ovalues;
+
+    af::sumByKey(okeys, ovalues, zeros, ones);
+    ASSERT_EQ(ovalues.scalar<unsigned>(), 129);
+
+    af::countByKey(okeys, ovalues, zeros, ones);
+    ASSERT_EQ(ovalues.scalar<unsigned>(), 129);
+
+    // test reduction on non-zero dimension as well
+    ones  = af::constant(1, 2, N, u32);
+    zeros = af::constant(0, N, u32);
+
+    af::sumByKey(okeys, ovalues, zeros, ones, 1);
+    ASSERT_EQ(ovalues.scalar<unsigned>(), 129);
+
+    af::countByKey(okeys, ovalues, zeros, ones, 1);
+    ASSERT_EQ(ovalues.scalar<unsigned>(), 129);
+}
+
+TEST(Reduce, Test_Sum_Global_Array) {
+    const int num = 513;
+    array a       = af::randn(num, 2, 33, 4);
+
+    float res         = af::sum<float>(a);
+    array full_reduce = af::sum<af::array>(a);
+
+    float *h_a = a.host<float>();
+    float gold = 0.f;
+
+    for (int i = 0; i < a.elements(); i++) { gold += h_a[i]; }
+
+    float max_error =
+        std::numeric_limits<float>::epsilon() * (float)a.elements();
+    ASSERT_NEAR(gold, res, max_error);
+    ASSERT_NEAR(res, full_reduce.scalar<float>(), max_error);
+    freeHost(h_a);
+}
+
+TEST(Reduce, Test_Product_Global_Array) {
+    const int num = 512;
+    array a       = 1 + (0.005 * af::randn(num, 2, 3, 4));
+
+    float res         = af::product<float>(a);
+    array full_reduce = af::product<af::array>(a);
+
+    float *h_a = a.host<float>();
+    float gold = 1.f;
+
+    for (int i = 0; i < a.elements(); i++) { gold *= h_a[i]; }
+
+    float max_error =
+        std::numeric_limits<float>::epsilon() * (float)a.elements();
+    ASSERT_NEAR(gold, res, max_error);
+    ASSERT_NEAR(res, full_reduce.scalar<float>(), max_error);
+    freeHost(h_a);
+}
+
+TEST(Reduce, Test_Count_Global_Array) {
+    const int num = 10000;
+    array a       = round(2 * randu(num, 2, 3, 4));
+    array b       = a.as(b8);
+
+    int res       = count<int>(b);
+    array res_arr = count<af::array>(b);
+    char *h_b     = b.host<char>();
+    unsigned gold = 0;
+
+    for (int i = 0; i < a.elements(); i++) { gold += h_b[i]; }
+
+    ASSERT_EQ(gold, res);
+    ASSERT_EQ(gold, res_arr.scalar<unsigned>());
+    freeHost(h_b);
+}
+
+TEST(Reduce, Test_min_Global_Array) {
+    SUPPORTED_TYPE_CHECK(double);
+
+    const int num = 10000;
+    array a       = af::randn(num, 2, 3, 4, f64);
+    double res    = min<double>(a);
+    array res_arr = min<af::array>(a);
+    double *h_a   = a.host<double>();
+    double gold   = std::numeric_limits<double>::max();
+
+    SUPPORTED_TYPE_CHECK(double);
+
+    for (int i = 0; i < a.elements(); i++) { gold = std::min(gold, h_a[i]); }
+
+    ASSERT_EQ(gold, res);
+    ASSERT_EQ(gold, res_arr.scalar<double>());
+    freeHost(h_a);
+}
+
+TEST(Reduce, Test_max_Global_Array) {
+    const int num = 10000;
+    array a       = af::randn(num, 2, 3, 4);
+    float res     = max<float>(a);
+    array res_arr = max<af::array>(a);
+    float *h_a    = a.host<float>();
+    float gold    = -std::numeric_limits<float>::max();
+
+    for (int i = 0; i < a.elements(); i++) { gold = std::max(gold, h_a[i]); }
+
+    ASSERT_EQ(gold, res);
+    ASSERT_EQ(gold, res_arr.scalar<float>());
+    freeHost(h_a);
+}
+
+TYPED_TEST(Reduce, Test_All_Global_Array) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    // Input size test
+    for (int i = 1; i < 1000; i += 100) {
+        int num = 10 * i;
+        vector<TypeParam> h_vals(num, (TypeParam) true);
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = allTrue<TypeParam>(a);
+        array res_arr = allTrue<array>(a);
+        typed_assert_eq((TypeParam) true, res, false);
+        typed_assert_eq((TypeParam) true, (TypeParam)res_arr.scalar<char>(),
+                        false);
+
+        h_vals[3] = false;
+        a         = array(2, num / 2, &h_vals.front());
+
+        res     = allTrue<TypeParam>(a);
+        res_arr = allTrue<array>(a);
+        typed_assert_eq((TypeParam) false, res, false);
+        typed_assert_eq((TypeParam) false, (TypeParam)res_arr.scalar<char>(),
+                        false);
+    }
+
+    // false value location test
+    const int num = 10000;
+    vector<TypeParam> h_vals(num, (TypeParam) true);
+    for (int i = 1; i < 10000; i += 100) {
+        h_vals[i] = false;
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = allTrue<TypeParam>(a);
+        array res_arr = allTrue<array>(a);
+        typed_assert_eq((TypeParam) false, res, false);
+        typed_assert_eq((TypeParam) false, (TypeParam)res_arr.scalar<char>(),
+                        false);
+
+        h_vals[i] = true;
+    }
+}
+
+TYPED_TEST(Reduce, Test_Any_Global_Array) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    // Input size test
+    for (int i = 1; i < 1000; i += 100) {
+        int num = 10 * i;
+        vector<TypeParam> h_vals(num, (TypeParam) false);
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = anyTrue<TypeParam>(a);
+        array res_arr = anyTrue<array>(a);
+        typed_assert_eq((TypeParam) false, res, false);
+        typed_assert_eq((TypeParam) false, (TypeParam)res_arr.scalar<char>(),
+                        false);
+
+        h_vals[3] = true;
+        a         = array(2, num / 2, &h_vals.front());
+
+        res     = anyTrue<TypeParam>(a);
+        res_arr = anyTrue<array>(a);
+        typed_assert_eq((TypeParam) true, (TypeParam)res_arr.scalar<char>(),
+                        false);
+    }
+
+    // true value location test
+    const int num = 10000;
+    vector<TypeParam> h_vals(num, (TypeParam) false);
+    for (int i = 1; i < 10000; i += 100) {
+        h_vals[i] = true;
+        array a(2, num / 2, &h_vals.front());
+
+        TypeParam res = anyTrue<TypeParam>(a);
+        array res_arr = anyTrue<array>(a);
+        typed_assert_eq((TypeParam) true, res, false);
+        typed_assert_eq((TypeParam) true, (TypeParam)res_arr.scalar<char>(),
+                        false);
+
+        h_vals[i] = false;
+    }
+}
+
+TEST(Reduce, Test_Sum_Global_Array_nanval) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    const int num = 100000;
+    array a       = af::randn(num, 2, 34, 4);
+    a(1, 0, 0, 0) = NAN;
+    a(0, 1, 0, 0) = NAN;
+    a(0, 0, 1, 0) = NAN;
+    a(0, 0, 0, 1) = NAN;
+
+    double nanval     = 0.2;
+    float res         = af::sum<float>(a, nanval);
+    array full_reduce = af::sum<af::array>(a, nanval);
+
+    float *h_a = a.host<float>();
+    float gold = 0.f;
+
+    for (int i = 0; i < a.elements(); i++) {
+        gold += (isnan(h_a[i])) ? nanval : h_a[i];
+    }
+    float max_error =
+        std::numeric_limits<float>::epsilon() * (float)a.elements();
+    ASSERT_NEAR(gold, res, max_error);
+    ASSERT_NEAR(res, full_reduce.scalar<float>(), max_error);
+    freeHost(h_a);
+}
+
+TEST(Reduce, nanval_issue_3255) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    SUPPORTED_TYPE_CHECK(double);
+    char *info_str;
+    af_array ikeys, ivals, okeys, ovals;
+    dim_t dims[1] = {8};
+
+    int ikeys_src[8] = {0, 0, 1, 1, 1, 2, 2, 0};
+    ASSERT_SUCCESS(af_create_array(&ikeys, ikeys_src, 1, dims, u32));
+
+    int i;
+    for (i = 0; i < 8; i++) {
+        double ivals_src[8] = {1, 2, 3, 4, 5, 6, 7, 8};
+        ivals_src[i]        = NAN;
+        ASSERT_SUCCESS(af_create_array(&ivals, ivals_src, 1, dims, f64));
+
+        ASSERT_SUCCESS(
+            af_product_by_key_nan(&okeys, &ovals, ikeys, ivals, 0, 1.0));
+        af::array ovals_cpp(ovals);
+        ASSERT_FALSE(af::anyTrue<bool>(af::isNaN(ovals_cpp)));
+        ASSERT_SUCCESS(af_release_array(okeys));
+
+        ASSERT_SUCCESS(af_sum_by_key_nan(&okeys, &ovals, ikeys, ivals, 0, 1.0));
+        ovals_cpp = af::array(ovals);
+
+        ASSERT_FALSE(af::anyTrue<bool>(af::isNaN(ovals_cpp)));
+        ASSERT_SUCCESS(af_release_array(ivals));
+        ASSERT_SUCCESS(af_release_array(okeys));
+    }
+    ASSERT_SUCCESS(af_release_array(ikeys));
+}
+
+TEST(Reduce, SNIPPET_algorithm_func_sum) {
+    // clang-format off
+    //! [ex_algorithm_func_sum]
+    //
+    // Create a, a 2x3 array
+    array a = iota(dim4(2, 3));           // a = [0, 2, 4,
+                                          //      1, 3, 5]
+
+    // Create b by summing across the first dimension
+    array b = sum(a);        // sum across the first dimension, same as sum(a,0)
+
+    // Create c by summing across the second dimension
+    array c = sum(a, 1);     // sum across the second dimension
+
+    // Create d by summing across the third dimension
+    array d = sum(a, 2);     // sum across the third dimension
+
+    // Create e by summing across the fouth dimension
+    array e = sum(a, 3);     // sum acorss the fourth dimension
+
+    // Summing across higher dimensions fails due to stepping out of bounds. For example,
+    // array f = sum(a0, 4)  // fails due to stepping out of bounds
+
+    //! [ex_algorithm_func_sum]
+    // clang-format on
+
+    using std::vector;
+    vector<float> gold_a{0, 1, 2, 3, 4, 5};
+    vector<float> gold_b{1, 5, 9};
+    vector<float> gold_c{6, 9};
+
+    ASSERT_VEC_ARRAY_EQ(gold_a, a.dims(), a);
+    ASSERT_VEC_ARRAY_EQ(gold_b, b.dims(), b);
+    ASSERT_VEC_ARRAY_EQ(gold_c, c.dims(), c);
+    ASSERT_VEC_ARRAY_EQ(gold_a, d.dims(), d);
+    ASSERT_VEC_ARRAY_EQ(gold_a, e.dims(), e);
+}
+
+#define TEMP_FORMAT_TESTS_reduce(form, op)                    \
+    TEST(TEMP_FORMAT, form##_##op##_array) {                  \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});    \
+        const array gold = op(in, 3);                         \
+        array out        = op(toTempFormat(form, in), 3);     \
+        EXPECT_ARRAYS_EQ(out, gold);                          \
+    }                                                         \
+    TEST(TEMP_FORMAT, form##_##op##_value) {                  \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});    \
+        const float gold = op<float>(in);                     \
+        float out        = op<float>(toTempFormat(form, in)); \
+        EXPECT_EQ(out, gold);                                 \
+    }
+
+#define TEMP_FORMAT_TESTS_ragged(form, op)                                     \
+    TEST(TEMP_FORMAT, form##_##op##_ragged) {                                  \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});                     \
+        const array ragged_len(dim4(1), {(unsigned)in.elements()});            \
+        array gold_vals, gold_idxs;                                            \
+        op(gold_vals, gold_idxs, in, ragged_len, 3);                           \
+        array vals, idxs;                                                      \
+        op(vals, idxs, toTempFormat(form, in), toTempFormat(form, ragged_len), \
+           3);                                                                 \
+        EXPECT_ARRAYS_EQ(vals, gold_vals);                                     \
+        EXPECT_ARRAYS_EQ(idxs, gold_idxs);                                     \
+    }
+
+#define TEMP_FORMAT_TESTS_ByKey(form, op)                      \
+    TEST(TEMP_FORMAT, form##_##op) {                           \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});     \
+        const array keys(constant(0, in.dims().dims[3], u32)); \
+        keys.eval();                                           \
+        array gold_keys, gold_vals;                            \
+        op(gold_keys, gold_vals, keys, in, 3);                 \
+        array out_keys, out_vals;                              \
+        op(out_keys, out_vals, toTempFormat(form, keys),       \
+           toTempFormat(form, in), 3);                         \
+        EXPECT_ARRAYS_EQ(gold_vals, out_vals);                 \
+        EXPECT_ARRAYS_EQ(gold_keys, out_keys);                 \
+    }
+
+#define TEMP_FORMAT_TESTS_allTest(form, op)                         \
+    TEST(TEMP_FORMAT, form##_##op##_array) {                        \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});          \
+        const array gold = op(in > 2.0, 3);                         \
+        array out        = op(toTempFormat(form, in) > 2.0, 3);     \
+        EXPECT_ARRAYS_EQ(gold, out);                                \
+    }                                                               \
+    TEST(TEMP_FORMAT, form##_##op##_value) {                        \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});          \
+        const float gold = op<float>(in > 2.0);                     \
+        float out        = op<float>(toTempFormat(form, in) > 2.0); \
+        EXPECT_EQ(gold, out);                                       \
+    }
+
+#define TEMP_FORMAT_TESTS_allTestByKey(form, op)               \
+    TEST(TEMP_FORMAT, form##_##op) {                           \
+        const array in(dim4(1, 1, 1, 3), {1.f, 2.f, 3.f});     \
+        const array keys(constant(0, in.dims().dims[3], u32)); \
+        array gold_vals, gold_keys;                            \
+        op(gold_keys, gold_vals, keys, in > 2.0, 3);           \
+        array out_vals, out_keys;                              \
+        op(out_keys, out_vals, toTempFormat(form, keys),       \
+           toTempFormat(form, in) > 2.0, 3);                   \
+        EXPECT_ARRAYS_EQ(gold_vals, out_vals);                 \
+        EXPECT_ARRAYS_EQ(gold_keys, out_keys);                 \
+    }
+
+#define TEMP_FORMATS_TESTS(form)                        \
+    TEMP_FORMAT_TESTS_reduce(form, min);                \
+    TEMP_FORMAT_TESTS_reduce(form, max);                \
+    TEMP_FORMAT_TESTS_reduce(form, sum);                \
+    TEMP_FORMAT_TESTS_reduce(form, product);            \
+    TEMP_FORMAT_TESTS_reduce(form, count);              \
+    TEMP_FORMAT_TESTS_ragged(form, max);                \
+    TEMP_FORMAT_TESTS_ByKey(form, minByKey);            \
+    TEMP_FORMAT_TESTS_ByKey(form, maxByKey);            \
+    TEMP_FORMAT_TESTS_ByKey(form, sumByKey);            \
+    TEMP_FORMAT_TESTS_ByKey(form, productByKey);        \
+    TEMP_FORMAT_TESTS_ByKey(form, countByKey);          \
+    TEMP_FORMAT_TESTS_allTest(form, allTrue);           \
+    TEMP_FORMAT_TESTS_allTest(form, anyTrue);           \
+    TEMP_FORMAT_TESTS_allTestByKey(form, allTrueByKey); \
+    TEMP_FORMAT_TESTS_allTestByKey(form, anyTrueByKey);
+
+FOREACH_TEMP_FORMAT(TEMP_FORMATS_TESTS)
diff --git a/test/regions.cpp b/test/regions.cpp
index 273f336463..a6f14ede81 100644
--- a/test/regions.cpp
+++ b/test/regions.cpp
@@ -7,110 +7,119 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
-#include <af/traits.hpp>
+#include <af/dim4.hpp>
 #include <af/image.h>
-#include <vector>
+#include <af/traits.hpp>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::regions;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Regions : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Regions : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, int, unsigned> TestTypes;
+typedef ::testing::Types<float, double, int, unsigned, short, ushort> TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Regions, TestTypes);
+TYPED_TEST_SUITE(Regions, TestTypes);
 
 template<typename T>
-void regionsTest(string pTestFile, af_connectivity connectivity, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void regionsTest(string pTestFile, af_connectivity connectivity,
+                 bool isSubRef = false, const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<uchar> > in;
-    vector<vector<T> > tests;
-    readTests<uchar, T, unsigned>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<uchar>> in;
+    vector<vector<T>> tests;
+    readTests<uchar, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
+    af_array inArray   = 0;
     af_array tempArray = 0;
-    af_array outArray = 0;
+    af_array outArray  = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<char>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<char>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<char>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<char>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_regions(&outArray, inArray, connectivity, (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_regions(&outArray, inArray, connectivity,
+                               (af_dtype)dtype_traits<T>::af_type));
 
     // Get result
     T* outData = new T[idims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<T> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
+        size_t nElems         = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
     }
 
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define REGIONS_INIT(desc, file, conn, conn_type)                                           \
-    TYPED_TEST(Regions, desc)                                                               \
-    {                                                                                       \
-        regionsTest<TypeParam>(string(TEST_DIR"/regions/"#file"_"#conn".test"), conn_type); \
+#define REGIONS_INIT(desc, file, conn, conn_type)                             \
+    TYPED_TEST(Regions, desc) {                                               \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                               \
+        regionsTest<TypeParam>(                                               \
+            string(TEST_DIR "/regions/" #file "_" #conn ".test"), conn_type); \
     }
 
-    REGIONS_INIT(Regions0, regions_8x8, 4, AF_CONNECTIVITY_4);
-    REGIONS_INIT(Regions1, regions_8x8, 8, AF_CONNECTIVITY_8);
-    REGIONS_INIT(Regions2, regions_128x128, 4, AF_CONNECTIVITY_4);
-    REGIONS_INIT(Regions3, regions_128x128, 8, AF_CONNECTIVITY_8);
-
+REGIONS_INIT(Regions0, regions_8x8, 4, AF_CONNECTIVITY_4);
+REGIONS_INIT(Regions1, regions_8x8, 8, AF_CONNECTIVITY_8);
+REGIONS_INIT(Regions2, regions_128x128, 4, AF_CONNECTIVITY_4);
+REGIONS_INIT(Regions3, regions_128x128, 8, AF_CONNECTIVITY_8);
 
 ///////////////////////////////////// CPP ////////////////////////////////
 //
-TEST(Regions, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, unsigned>(string(TEST_DIR"/regions/regions_8x8_4.test"),numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::array input(idims, (float*)&(in[0].front()));
-    af::array output = af::regions(input.as(b8));
+TEST(Regions, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/regions/regions_8x8_4.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, (float*)&(in[0].front()));
+    array output = regions(input.as(b8));
 
     // Get result
     float* outData = new float[idims.elements()];
@@ -119,9 +128,10 @@ TEST(Regions, CPP)
     // Compare result
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<float> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
+        size_t nElems             = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
     }
 
@@ -130,34 +140,22 @@ TEST(Regions, CPP)
 }
 
 ///////////////////////////////// Documentation Examples ///////////////////
-TEST(Regions, Docs_8)
-{
+TEST(Regions, Docs_8) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
     // input data
-    uchar input[64] =  {
-        0, 0, 0, 0, 1, 0, 0, 0,
-        0, 0, 1, 0, 1, 0, 0, 1,
-        0, 0, 0, 1, 0, 0, 0, 0,
-        0, 0, 1, 0, 0, 1, 0, 0,
-        1, 0, 0, 1, 0, 0, 1, 0,
-        0, 0, 0, 1, 1, 0, 0, 1,
-        1, 1, 0, 0, 0, 0, 0, 0,
-        0, 1, 0, 1, 1, 1, 1, 0
-    };
+    uchar input[64] = {0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
+                       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
+                       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
+                       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0};
     // gold output
-    float gold[64] =  {
-        0, 0, 0, 0, 1, 0, 0, 0,
-        0, 0, 1, 0, 1, 0, 0, 2,
-        0, 0, 0, 1, 0, 0, 0, 0,
-        0, 0, 1, 0, 0, 3, 0, 0,
-        4, 0, 0, 1, 0, 0, 3, 0,
-        0, 0, 0, 1, 1, 0, 0, 3,
-        5, 5, 0, 0, 0, 0, 0, 0,
-        0, 5, 0, 6, 6, 6, 6, 0
-    };
+    float gold[64] = {0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 2,
+                      0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0,
+                      4, 0, 0, 1, 0, 0, 3, 0, 0, 0, 0, 1, 1, 0, 0, 3,
+                      5, 5, 0, 0, 0, 0, 0, 0, 0, 5, 0, 6, 6, 6, 6, 0};
 
     //![ex_image_regions]
-    af::array in(8, 8, input);
-    //af_print(in);
+    array in(8, 8, input);
+    // af_print(in);
     // in =
     // 0   0   0   0   1   0   1   0
     // 0   0   0   0   0   0   1   1
@@ -169,8 +167,8 @@ TEST(Regions, Docs_8)
     // 0   1   0   0   0   1   0   0
 
     // Compute the label matrix using 8-way connectivity
-    af::array out = regions(in.as(b8), AF_CONNECTIVITY_8);
-    //af_print(out);
+    array out = regions(in.as(b8), AF_CONNECTIVITY_8);
+    // af_print(out);
     // 0   0   0   0   4   0   5   0
     // 0   0   0   0   0   0   5   5
     // 0   1   0   1   0   0   0   0
@@ -181,72 +179,96 @@ TEST(Regions, Docs_8)
     // 0   2   0   0   0   3   0   0
     //![ex_image_regions]
 
-
     float output[64];
     out.host((void*)output);
 
-    for (int i=0; i<64; ++i) {
-        ASSERT_EQ(gold[i], output[i])<<" mismatch at i="<<i<<std::endl;
+    for (int i = 0; i < 64; ++i) {
+        ASSERT_EQ(gold[i], output[i]) << " mismatch at i=" << i << endl;
     }
 }
 
-TEST(Regions, Docs_4)
-{
+TEST(Regions, Docs_4) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
     // input data
-    uchar input[64] =  {
-        0, 0, 0, 0, 1, 0, 0, 0,
-        0, 0, 1, 0, 1, 0, 0, 1,
-        0, 0, 0, 1, 0, 0, 0, 0,
-        0, 0, 1, 0, 0, 1, 0, 0,
-        1, 0, 0, 1, 0, 0, 1, 0,
-        0, 0, 0, 1, 1, 0, 0, 1,
-        1, 1, 0, 0, 0, 0, 0, 0,
-        0, 1, 0, 1, 1, 1, 1, 0
-    };
+    uchar input[64] = {0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1,
+                       0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,
+                       1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
+                       1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0};
     // gold output
-    float gold[64] =  {
-        0.0000,  0.0000,  0.0000,  0.0000,  1.0000,  0.0000,  0.0000,  0.0000,
-        0.0000,  0.0000,  2.0000,  0.0000,  1.0000,  0.0000,  0.0000,  3.0000,
-        0.0000,  0.0000,  0.0000,  4.0000,  0.0000,  0.0000,  0.0000,  0.0000,
-        0.0000,  0.0000,  5.0000,  0.0000,  0.0000,  6.0000,  0.0000,  0.0000,
-        7.0000,  0.0000,  0.0000,  8.0000,  0.0000,  0.0000,  9.0000,  0.0000,
-        0.0000,  0.0000,  0.0000,  8.0000,  8.0000,  0.0000,  0.0000, 10.0000,
-        11.000, 11.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
-        0.0000, 11.0000,  0.0000, 12.0000, 12.0000, 12.0000, 12.0000,  0.0000
-    };
-
+    float gold[64] = {
+        0.0000, 0.0000,  0.0000, 0.0000,  1.0000,  0.0000,  0.0000,  0.0000,
+        0.0000, 0.0000,  2.0000, 0.0000,  1.0000,  0.0000,  0.0000,  3.0000,
+        0.0000, 0.0000,  0.0000, 4.0000,  0.0000,  0.0000,  0.0000,  0.0000,
+        0.0000, 0.0000,  5.0000, 0.0000,  0.0000,  6.0000,  0.0000,  0.0000,
+        7.0000, 0.0000,  0.0000, 8.0000,  0.0000,  0.0000,  9.0000,  0.0000,
+        0.0000, 0.0000,  0.0000, 8.0000,  8.0000,  0.0000,  0.0000,  10.0000,
+        11.000, 11.0000, 0.0000, 0.0000,  0.0000,  0.0000,  0.0000,  0.0000,
+        0.0000, 11.0000, 0.0000, 12.0000, 12.0000, 12.0000, 12.0000, 0.0000};
 
     //![ex_image_regions_4conn]
-    af::array in(8, 8, input);
-    //af_print(in.T());
-    //in
-    //0  0  0  0  1  0  1  0
-    //0  0  0  0  0  0  1  1
-    //0  1  0  1  0  0  0  0
-    //0  0  1  0  1  1  0  1
-    //1  1  0  0  0  1  0  1
-    //0  0  0  1  0  0  0  1
-    //0  0  0  0  1  0  0  1
-    //0  1  0  0  0  1  0  0
+    array in(8, 8, input);
+    // af_print(in.T());
+    // in
+    // 0  0  0  0  1  0  1  0
+    // 0  0  0  0  0  0  1  1
+    // 0  1  0  1  0  0  0  0
+    // 0  0  1  0  1  1  0  1
+    // 1  1  0  0  0  1  0  1
+    // 0  0  0  1  0  0  0  1
+    // 0  0  0  0  1  0  0  1
+    // 0  1  0  0  0  1  0  0
     // Compute the label matrix using 4-way connectivity
-    af::array out = regions(in.as(b8), AF_CONNECTIVITY_4);
-    //af_print(out.T());
-    //out
-    //0  0  0  0  7  0 11  0
-    //0  0  0  0  0  0 11 11
-    //0  2  0  5  0  0  0  0
-    //0  0  4  0  8  8  0 12
-    //1  1  0  0  0  8  0 12
-    //0  0  0  6  0  0  0 12
-    //0  0  0  0  9  0  0 12
-    //0  3  0  0  0 10  0  0
+    array out = regions(in.as(b8), AF_CONNECTIVITY_4);
+    // af_print(out.T());
+    // out
+    // 0  0  0  0  7  0 11  0
+    // 0  0  0  0  0  0 11 11
+    // 0  2  0  5  0  0  0  0
+    // 0  0  4  0  8  8  0 12
+    // 1  1  0  0  0  8  0 12
+    // 0  0  0  6  0  0  0 12
+    // 0  0  0  0  9  0  0 12
+    // 0  3  0  0  0 10  0  0
     //![ex_image_regions_4conn]
 
-
     float output[64];
     out.host((void*)output);
 
-    for (int i=0; i<64; ++i) {
-        ASSERT_EQ(gold[i], output[i])<<" mismatch at i="<<i<<std::endl;
+    for (int i = 0; i < 64; ++i) {
+        ASSERT_EQ(gold[i], output[i]) << " mismatch at i=" << i << endl;
     }
 }
+
+TEST(Regions, WholeImageComponent) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const int dim = 101;
+    const int sz  = dim * dim;
+    vector<char> input(sz, 1);
+    vector<float> gold(sz, 1.0f);
+
+    array in  = array(dim, dim, input.data());
+    array out = regions(in, AF_CONNECTIVITY_4);
+
+    vector<float> output(sz);
+    out.host((void*)output.data());
+
+    for (int i = 0; i < sz; ++i)
+        ASSERT_FLOAT_EQ(gold[i], output[i]) << " mismatch at i=" << i << endl;
+}
+
+TEST(Regions, NoComponentImage) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const int dim = 101;
+    const int sz  = dim * dim;
+    vector<char> input(sz, 0);
+    vector<float> gold(sz, 0.0f);
+
+    array in  = array(dim, dim, input.data());
+    array out = regions(in, AF_CONNECTIVITY_4);
+
+    vector<float> output(sz);
+    out.host((void*)output.data());
+
+    for (int i = 0; i < sz; ++i)
+        ASSERT_FLOAT_EQ(gold[i], output[i]) << " mismatch at i=" << i << endl;
+}
diff --git a/test/relative_difference.hpp b/test/relative_difference.hpp
new file mode 100644
index 0000000000..3fdfb28dc3
--- /dev/null
+++ b/test/relative_difference.hpp
@@ -0,0 +1,135 @@
+//  (C) Copyright John Maddock 2006, 2015
+//  Use, modification and distribution are subject to the
+//  Boost Software License, Version 1.0. (See accompanying file
+//  LICENSE_1_0.txt or copy at http://www.boost.org/LICENSE_1_0.txt)
+
+#ifndef BOOST_MATH_RELATIVE_ERROR
+#define BOOST_MATH_RELATIVE_ERROR
+
+#include <boost/math/special_functions/fpclassify.hpp>
+#include <boost/math/tools/precision.hpp>
+#include <boost/math/tools/promotion.hpp>
+
+namespace boost {
+namespace math {
+
+template<class T, class U>
+typename boost::math::tools::promote_args<T, U>::type relative_difference(
+    const T& arg_a, const U& arg_b) {
+    typedef typename boost::math::tools::promote_args<T, U>::type result_type;
+    result_type a = arg_a;
+    result_type b = arg_b;
+    BOOST_MATH_STD_USING
+#ifdef BOOST_MATH_NO_LONG_DOUBLE_MATH_FUNCTIONS
+    //
+    // If math.h has no long double support we can't rely
+    // on the math functions generating exponents outside
+    // the range of a double:
+    //
+    result_type min_val = (std::max)(
+        tools::min_value<result_type>(),
+        static_cast<result_type>((std::numeric_limits<double>::min)()));
+    result_type max_val = (std::min)(
+        tools::max_value<result_type>(),
+        static_cast<result_type>((std::numeric_limits<double>::max)()));
+#else
+    result_type min_val = tools::min_value<result_type>();
+    result_type max_val = tools::max_value<result_type>();
+#endif
+    // Screen out NaN's first, if either value is a NaN then the distance is
+    // "infinite":
+    if ((boost::math::isnan)(a) || (boost::math::isnan)(b)) return max_val;
+    // Screen out infinities:
+    if (fabs(b) > max_val) {
+        if (fabs(a) > max_val)
+            return (a < 0) == (b < 0)
+                       ? result_type(0)
+                       : max_val;  // one infinity is as good as another!
+        else
+            return max_val;  // one infinity and one finite value implies
+                             // infinite difference
+    } else if (fabs(a) > max_val)
+        return max_val;  // one infinity and one finite value implies infinite
+                         // difference
+
+    //
+    // If the values have different signs, treat as infinite difference:
+    //
+    if (((a < 0) != (b < 0)) && (a != 0) && (b != 0)) return max_val;
+    a = fabs(a);
+    b = fabs(b);
+    //
+    // Now deal with zero's, if one value is zero (or denorm) then treat it the
+    // same as min_val for the purposes of the calculation that follows:
+    //
+    if (a < min_val) a = min_val;
+    if (b < min_val) b = min_val;
+
+    return (std::max)(fabs((a - b) / a), fabs((a - b) / b));
+}
+
+#if (defined(macintosh) || defined(__APPLE__) || defined(__APPLE_CC__)) && \
+    (LDBL_MAX_EXP <= DBL_MAX_EXP)
+template<>
+inline boost::math::tools::promote_args<double, double>::type
+relative_difference(const double& arg_a, const double& arg_b) {
+    BOOST_MATH_STD_USING
+    double a = arg_a;
+    double b = arg_b;
+    //
+    // On Mac OS X we evaluate "double" functions at "long double" precision,
+    // but "long double" actually has a very slightly narrower range than
+    // "double"! Therefore use the range of "long double" as our limits since
+    // results outside that range may have been truncated to 0 or INF:
+    //
+    double min_val = (std::max)((double)tools::min_value<long double>(),
+                                tools::min_value<double>());
+    double max_val = (std::min)((double)tools::max_value<long double>(),
+                                tools::max_value<double>());
+
+    // Screen out NaN's first, if either value is a NaN then the distance is
+    // "infinite":
+    if ((boost::math::isnan)(a) || (boost::math::isnan)(b)) return max_val;
+    // Screen out infinities:
+    if (fabs(b) > max_val) {
+        if (fabs(a) > max_val)
+            return 0;  // one infinity is as good as another!
+        else
+            return max_val;  // one infinity and one finite value implies
+                             // infinite difference
+    } else if (fabs(a) > max_val)
+        return max_val;  // one infinity and one finite value implies infinite
+                         // difference
+
+    //
+    // If the values have different signs, treat as infinite difference:
+    //
+    if (((a < 0) != (b < 0)) && (a != 0) && (b != 0)) return max_val;
+    a = fabs(a);
+    b = fabs(b);
+    //
+    // Now deal with zero's, if one value is zero (or denorm) then treat it the
+    // same as min_val for the purposes of the calculation that follows:
+    //
+    if (a < min_val) a = min_val;
+    if (b < min_val) b = min_val;
+
+    return (std::max)(fabs((a - b) / a), fabs((a - b) / b));
+}
+#endif
+
+template<class T, class U>
+inline typename boost::math::tools::promote_args<T, U>::type epsilon_difference(
+    const T& arg_a, const U& arg_b) {
+    typedef typename boost::math::tools::promote_args<T, U>::type result_type;
+    result_type r = relative_difference(arg_a, arg_b);
+    if (tools::max_value<result_type>() *
+            boost::math::tools::epsilon<result_type>() <
+        r)
+        return tools::max_value<result_type>();
+    return r / boost::math::tools::epsilon<result_type>();
+}
+}  // namespace math
+}  // namespace boost
+
+#endif
diff --git a/test/reorder.cpp b/test/reorder.cpp
index 789fbfbbc8..3109839786 100644
--- a/test/reorder.cpp
+++ b/test/reorder.cpp
@@ -7,162 +7,194 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::allTrue;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::reorder;
+using af::seq;
+using af::span;
+using af::tile;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Reorder : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Reorder : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         char, signed char, unsigned char, short, ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Reorder, TestTypes);
+TYPED_TEST_SUITE(Reorder, TestTypes);
 
 template<typename T>
-void reorderTest(string pTestFile, const unsigned resultIdx,
-                 const uint x, const uint y, const uint z, const uint w,
-                 bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void reorderTest(string pTestFile, const unsigned resultIdx, const uint x,
+                 const uint y, const uint z, const uint w,
+                 bool isSubRef = false, const vector<af_seq> *seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, int>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
-    af_array outArray = 0;
+    af_array inArray   = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_reorder(&outArray, inArray, x, y, z, w));
-
-    // Get result
-    T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_reorder(&outArray, inArray, x, y, z, w));
 
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
-    }
-
-    // Delete
-    delete[] outData;
+    dim4 goldDims(idims[x], idims[y], idims[z], idims[w]);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], goldDims, outArray);
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define REORDER_INIT(desc, file, resultIdx, x, y, z, w)                                        \
-    TYPED_TEST(Reorder, desc)                                                                  \
-    {                                                                                       \
-        reorderTest<TypeParam>(string(TEST_DIR"/reorder/"#file".test"), resultIdx, x, y, z, w);   \
+#define REORDER_INIT(desc, file, resultIdx, x, y, z, w)                    \
+    TYPED_TEST(Reorder, desc) {                                            \
+        reorderTest<TypeParam>(string(TEST_DIR "/reorder/" #file ".test"), \
+                               resultIdx, x, y, z, w);                     \
     }
 
-    REORDER_INIT(Reorder012, reorder, 0, 0, 1, 2, 3);
-    REORDER_INIT(Reorder021, reorder, 1, 0, 2, 1, 3);
-    REORDER_INIT(Reorder102, reorder, 2, 1, 0, 2, 3);
-    REORDER_INIT(Reorder120, reorder, 3, 1, 2, 0, 3);
-    REORDER_INIT(Reorder201, reorder, 4, 2, 0, 1, 3);
-    REORDER_INIT(Reorder210, reorder, 5, 2, 1, 0, 3);
-
-    REORDER_INIT(Reorder0123, reorder4d, 0, 0, 1, 2, 3);
-    REORDER_INIT(Reorder0132, reorder4d, 1, 0, 1, 3, 2);
-    REORDER_INIT(Reorder0213, reorder4d, 2, 0, 2, 1, 3);
-    REORDER_INIT(Reorder0231, reorder4d, 3, 0, 2, 3, 1);
-    REORDER_INIT(Reorder0312, reorder4d, 4, 0, 3, 1, 2);
-    REORDER_INIT(Reorder0321, reorder4d, 5, 0, 3, 2, 1);
-
-    REORDER_INIT(Reorder1023, reorder4d, 6, 1, 0, 2, 3);
-    REORDER_INIT(Reorder1032, reorder4d, 7, 1, 0, 3, 2);
-    REORDER_INIT(Reorder1203, reorder4d, 8, 1, 2, 0, 3);
-    REORDER_INIT(Reorder1230, reorder4d, 9, 1, 2, 3, 0);
-    REORDER_INIT(Reorder1302, reorder4d,10, 1, 3, 0, 2);
-    REORDER_INIT(Reorder1320, reorder4d,11, 1, 3, 2, 0);
-
-    REORDER_INIT(Reorder2103, reorder4d,12, 2, 1, 0, 3);
-    REORDER_INIT(Reorder2130, reorder4d,13, 2, 1, 3, 0);
-    REORDER_INIT(Reorder2013, reorder4d,14, 2, 0, 1, 3);
-    REORDER_INIT(Reorder2031, reorder4d,15, 2, 0, 3, 1);
-    REORDER_INIT(Reorder2310, reorder4d,16, 2, 3, 1, 0);
-    REORDER_INIT(Reorder2301, reorder4d,17, 2, 3, 0, 1);
-
-    REORDER_INIT(Reorder3120, reorder4d,18, 3, 1, 2, 0);
-    REORDER_INIT(Reorder3102, reorder4d,19, 3, 1, 0, 2);
-    REORDER_INIT(Reorder3210, reorder4d,20, 3, 2, 1, 0);
-    REORDER_INIT(Reorder3201, reorder4d,21, 3, 2, 0, 1);
-    REORDER_INIT(Reorder3012, reorder4d,22, 3, 0, 1, 2);
-    REORDER_INIT(Reorder3021, reorder4d,23, 3, 0, 2, 1);
+REORDER_INIT(Reorder012, reorder, 0, 0, 1, 2, 3);
+REORDER_INIT(Reorder021, reorder, 1, 0, 2, 1, 3);
+REORDER_INIT(Reorder102, reorder, 2, 1, 0, 2, 3);
+REORDER_INIT(Reorder120, reorder, 3, 1, 2, 0, 3);
+REORDER_INIT(Reorder201, reorder, 4, 2, 0, 1, 3);
+REORDER_INIT(Reorder210, reorder, 5, 2, 1, 0, 3);
+
+REORDER_INIT(Reorder0123, reorder4d, 0, 0, 1, 2, 3);
+REORDER_INIT(Reorder0132, reorder4d, 1, 0, 1, 3, 2);
+REORDER_INIT(Reorder0213, reorder4d, 2, 0, 2, 1, 3);
+REORDER_INIT(Reorder0231, reorder4d, 3, 0, 2, 3, 1);
+REORDER_INIT(Reorder0312, reorder4d, 4, 0, 3, 1, 2);
+REORDER_INIT(Reorder0321, reorder4d, 5, 0, 3, 2, 1);
+
+REORDER_INIT(Reorder1023, reorder4d, 6, 1, 0, 2, 3);
+REORDER_INIT(Reorder1032, reorder4d, 7, 1, 0, 3, 2);
+REORDER_INIT(Reorder1203, reorder4d, 8, 1, 2, 0, 3);
+REORDER_INIT(Reorder1230, reorder4d, 9, 1, 2, 3, 0);
+REORDER_INIT(Reorder1302, reorder4d, 10, 1, 3, 0, 2);
+REORDER_INIT(Reorder1320, reorder4d, 11, 1, 3, 2, 0);
+
+REORDER_INIT(Reorder2103, reorder4d, 12, 2, 1, 0, 3);
+REORDER_INIT(Reorder2130, reorder4d, 13, 2, 1, 3, 0);
+REORDER_INIT(Reorder2013, reorder4d, 14, 2, 0, 1, 3);
+REORDER_INIT(Reorder2031, reorder4d, 15, 2, 0, 3, 1);
+REORDER_INIT(Reorder2310, reorder4d, 16, 2, 3, 1, 0);
+REORDER_INIT(Reorder2301, reorder4d, 17, 2, 3, 0, 1);
+
+REORDER_INIT(Reorder3120, reorder4d, 18, 3, 1, 2, 0);
+REORDER_INIT(Reorder3102, reorder4d, 19, 3, 1, 0, 2);
+REORDER_INIT(Reorder3210, reorder4d, 20, 3, 2, 1, 0);
+REORDER_INIT(Reorder3201, reorder4d, 21, 3, 2, 0, 1);
+REORDER_INIT(Reorder3012, reorder4d, 22, 3, 0, 1, 2);
+REORDER_INIT(Reorder3021, reorder4d, 23, 3, 0, 2, 1);
 
 ////////////////////////////////// CPP ///////////////////////////////////
 //
-TEST(Reorder, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(Reorder, CPP) {
     const unsigned resultIdx = 0;
-    const unsigned x = 0;
-    const unsigned y = 1;
-    const unsigned z = 2;
-    const unsigned w = 3;
+    const unsigned x         = 0;
+    const unsigned y         = 1;
+    const unsigned z         = 2;
+    const unsigned w         = 3;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/reorder/reorder4d.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/reorder/reorder4d.test"),
+                                 numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af::array input(idims, &(in[0].front()));
-    af::array output = af::reorder(input, x, y, z, w);
+    array input(idims, &(in[0].front()));
+    array output = reorder(input, x, y, z, w);
 
-    // Get result
-    float* outData = new float[tests[resultIdx].size()];
-    output.host((void*)outData);
+    dim4 goldDims(idims[x], idims[y], idims[z], idims[w]);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], goldDims, output);
+}
 
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
+TEST(Reorder, ISSUE_1777) {
+    const int m = 5;
+    const int n = 4;
+    const int k = 3;
+    vector<float> h_input(m * n);
+
+    for (int i = 0; i < m * n; i++) { h_input[i] = (float)(i); }
+
+    array a(m, n, &h_input[0]);
+    array a_t = tile(a, 1, 1, 3);
+    array a_r = reorder(a_t, 0, 2, 1);
+
+    vector<float> h_output(m * n * k);
+    a_r.host((void *)&h_output[0]);
+    for (int z = 0; z < n; z++) {
+        for (int y = 0; y < k; y++) {
+            for (int x = 0; x < m; x++) {
+                ASSERT_EQ(h_output[z * k * m + y * m + x], h_input[z * m + x]);
+            }
+        }
     }
+}
+
+TEST(Reorder, MaxDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    array input  = range(dim4(2, largeDim, 2), 2);
+    array output = reorder(input, 2, 1, 0);
 
-    // Delete
-    delete[] outData;
+    array gold = range(dim4(2, largeDim, 2));
+
+    ASSERT_ARRAYS_EQ(gold, output);
 }
 
+TEST(Reorder, InputArrayUnchanged) {
+    float h_input[12] = {0.f, 1.f, 2.f, 3.f, 4.f,  5.f,
+                         6.f, 7.f, 8.f, 9.f, 10.f, 11.f};
+    array input(2, 3, 2, h_input);
+    array input_reord = reorder(input, 0, 2, 1);
+
+    array input_gold(2, 3, 2, h_input);
+    ASSERT_ARRAYS_EQ(input_gold, input);
+}
diff --git a/test/replace.cpp b/test/replace.cpp
new file mode 100644
index 0000000000..1156731732
--- /dev/null
+++ b/test/replace.cpp
@@ -0,0 +1,205 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <iostream>
+#include <string>
+#include <type_traits>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::NaN;
+using af::randu;
+using af::seq;
+using af::span;
+using std::vector;
+
+template<typename T>
+class Replace : public ::testing::Test {};
+
+typedef ::testing::Types<half_float::half, float, double, cfloat, cdouble, uint,
+                         int, intl, uintl, schar, uchar, char, short, ushort>
+    TestTypes;
+
+TYPED_TEST_SUITE(Replace, TestTypes);
+
+template<typename T>
+void replaceTest(const dim4 &dims) {
+    SUPPORTED_TYPE_CHECK(T);
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array a = randu(dims, ty);
+    array b = randu(dims, ty);
+
+    if (a.isinteger()) {
+        a = (a % (1 << 30)).as(ty);
+        b = (b % (1 << 30)).as(ty);
+    }
+
+    array c = a.copy();
+
+    array cond = randu(dims, ty) > a;
+
+    replace(c, cond, b);
+
+    int num = (int)a.elements();
+
+    vector<T> ha(num);
+    vector<T> hb(num);
+    vector<T> hc(num);
+    vector<char> hcond(num);
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+    c.host(&hc[0]);
+    cond.host(&hcond[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_EQ(hc[i], hcond[i] ? ha[i] : hb[i]);
+    }
+}
+
+template<typename T>
+void replaceScalarTest(const dim4 &dims) {
+    SUPPORTED_TYPE_CHECK(T);
+    using scalar_t =
+        typename std::conditional<std::is_same<T, intl>::value ||
+                                      std::is_same<T, uintl>::value,
+                                  T, double>::type;
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array a = randu(dims, ty);
+
+    if (a.isinteger()) { a = (a % (1 << 30)).as(ty); }
+
+    array c    = a.copy();
+    array cond = randu(dims, ty) > a;
+    scalar_t b = static_cast<scalar_t>(3);
+
+    replace(c, cond, b);
+    int num = (int)a.elements();
+
+    vector<T> ha(num);
+    vector<T> hc(num);
+    vector<char> hcond(num);
+
+    a.host(&ha[0]);
+    c.host(&hc[0]);
+    cond.host(&hcond[0]);
+
+    for (int i = 0; i < num; i++) { ASSERT_EQ(hc[i], hcond[i] ? ha[i] : T(b)); }
+}
+
+TYPED_TEST(Replace, Simple) { replaceTest<TypeParam>(dim4(1024, 1024)); }
+
+TYPED_TEST(Replace, Scalar) { replaceScalarTest<TypeParam>(dim4(5, 5)); }
+
+TEST(Replace, NaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    dim4 dims(1000, 1250);
+    dtype ty = f32;
+
+    array a                                 = randu(dims, ty);
+    a(seq(a.dims(0) / 2), span, span, span) = NaN;
+    array c                                 = a.copy();
+    float b                                 = 0;
+    replace(c, !isNaN(c), b);
+
+    int num = (int)a.elements();
+
+    vector<float> ha(num);
+    vector<float> hc(num);
+
+    a.host(&ha[0]);
+    c.host(&hc[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_EQ(hc[i], (std::isnan(ha[i]) ? b : ha[i]));
+    }
+}
+
+TEST(Replace, ISSUE_1249) {
+    dim4 dims(2, 3, 4);
+    array cond = randu(dims) > 0.5;
+    array a    = randu(dims);
+    array b    = a.copy();
+    replace(b, !cond, a - a * 0.9);
+    array c  = (a - a * 0.9);
+    c(!cond) = a(!cond);
+
+    int num = (int)dims.elements();
+    vector<float> hb(num);
+    vector<float> hc(num);
+
+    b.host(&hb[0]);
+    c.host(&hc[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_FLOAT_EQ(hc[i], hb[i]) << "at " << i;
+    }
+}
+
+TEST(Replace, 4D) {
+    dim4 dims(2, 3, 4, 2);
+    array cond = randu(dims) > 0.5;
+    array a    = randu(dims);
+    array b    = a.copy();
+    replace(b, !cond, a - a * 0.9);
+    array c = a - a * cond * 0.9;
+
+    int num = (int)dims.elements();
+    vector<float> hb(num);
+    vector<float> hc(num);
+
+    b.host(&hb[0]);
+    c.host(&hc[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_FLOAT_EQ(hc[i], hb[i]) << "at " << i;
+    }
+}
+
+TEST(Replace, ISSUE_1683) {
+    array A = randu(10, 20, f32);
+    vector<float> ha1(A.elements());
+    A.host(ha1.data());
+
+    array B = A(0, span);
+    replace(B, A(0, span) > 0.5, 0.0);
+
+    vector<float> ha2(A.elements());
+    A.host(ha2.data());
+
+    vector<float> hb(B.elements());
+    B.host(hb.data());
+
+    // Ensures A is not modified by replace
+    for (int i = 0; i < (int)A.elements(); i++) {
+        ASSERT_FLOAT_EQ(ha1[i], ha2[i]);
+    }
+
+    // Ensures replace on B works as expected
+    for (int i = 0; i < (int)B.elements(); i++) {
+        float val = ha1[i * A.dims(0)];
+        val       = val < 0.5 ? 0 : val;
+        ASSERT_FLOAT_EQ(val, hb[i]);
+    }
+}
diff --git a/test/resize.cpp b/test/resize.cpp
index dc0b15c5ab..50c46730f9 100644
--- a/test/resize.cpp
+++ b/test/resize.cpp
@@ -7,138 +7,146 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Resize : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Resize : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 template<typename T>
-class ResizeI : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-
-            subMat1.push_back(af_make_seq(0, 5, 1));
-            subMat1.push_back(af_make_seq(0, 5, 1));
-            subMat1.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
-        vector<af_seq> subMat1;
+class ResizeI : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+
+        subMat1.push_back(af_make_seq(0, 5, 1));
+        subMat1.push_back(af_make_seq(0, 5, 1));
+        subMat1.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
+    vector<af_seq> subMat1;
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float, double, cfloat, cdouble> TestTypesF;
-typedef ::testing::Types<int, unsigned, intl, uintl, unsigned char, char> TestTypesI;
+typedef ::testing::Types<int, unsigned, intl, uintl, signed char, unsigned char,
+                         char, short, ushort>
+    TestTypesI;
 
 // register the type list
-TYPED_TEST_CASE(Resize, TestTypesF);
-TYPED_TEST_CASE(ResizeI, TestTypesI);
+TYPED_TEST_SUITE(Resize, TestTypesF);
+TYPED_TEST_SUITE(ResizeI, TestTypesI);
 
-TYPED_TEST(Resize, InvalidDims)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Resize, InvalidDims) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
 
-    vector<TypeParam> in(8,8);
+    vector<TypeParam> in(8 * 8);
 
     af_array inArray  = 0;
     af_array outArray = 0;
 
-    af::dim4 dims = af::dim4(8,8,1,1);
+    dim4 dims = dim4(8, 8, 1, 1);
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(), dims.ndims(), dims.get(),
-                                          (af_dtype) af::dtype_traits<TypeParam>::af_type));
-    ASSERT_EQ(AF_ERR_SIZE, af_resize(&outArray, inArray, 0, 0, AF_INTERP_NEAREST));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<TypeParam>::af_type));
+    ASSERT_EQ(AF_ERR_SIZE,
+              af_resize(&outArray, inArray, 0, 0, AF_INTERP_NEAREST));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
 template<typename T>
-void compare(T test, T out, double err, size_t i)
-{
-    ASSERT_EQ(std::abs(test - out) < 0.0001, true) << "at: " << i << std::endl
-             << "for test = : " << test << std::endl
-             << "out data = : " << out << std::endl;
+void compare(T test, T out, double err, size_t i) {
+    ASSERT_EQ(abs(test - out) < err, true) << "at: " << i << endl
+                                           << "for test = : " << test << endl
+                                           << "out data = : " << out << endl;
 }
 
 template<>
-void compare<uintl>(uintl test, uintl out, double err, size_t i)
-{
-    ASSERT_EQ(((intl)test - (intl)out) < 0.0001, true) << "at: " << i << std::endl
-             << "for test = : " << test << std::endl
-             << "out data = : " << out << std::endl;
+void compare<uintl>(uintl test, uintl out, double err, size_t i) {
+    ASSERT_EQ(((intl)test - (intl)out) < err, true)
+        << "at: " << i << endl
+        << "for test = : " << test << endl
+        << "out data = : " << out << endl;
 }
 
 template<>
-void compare<uint>(uint test, uint out, double err, size_t i)
-{
-    ASSERT_EQ(((int)test - (int)out) < 0.0001, true) << "at: " << i << std::endl
-             << "for test = : " << test << std::endl
-             << "out data = : " << out << std::endl;
+void compare<uint>(uint test, uint out, double err, size_t i) {
+    ASSERT_EQ(((int)test - (int)out) < err, true)
+        << "at: " << i << endl
+        << "for test = : " << test << endl
+        << "out data = : " << out << endl;
 }
 
 template<>
-void compare<uchar>(uchar test, uchar out, double err, size_t i)
-{
-    ASSERT_EQ(((int)test - (int)out) < 0.0001, true) << "at: " << i << std::endl
-             << "for test = : " << test << std::endl
-             << "out data = : " << out << std::endl;
+void compare<uchar>(uchar test, uchar out, double err, size_t i) {
+    ASSERT_EQ(((int)test - (int)out) < err, true)
+        << "at: " << i << endl
+        << "for test = : " << test << endl
+        << "out data = : " << out << endl;
 }
 
 template<typename T>
-void resizeTest(string pTestFile, const unsigned resultIdx, const dim_t odim0, const dim_t odim1, const af_interp_type method, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void resizeTest(string pTestFile, const unsigned resultIdx, const dim_t odim0,
+                const dim_t odim1, const af_interp_type method,
+                bool isSubRef = false, const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
+    dim4 dims = numDims[0];
 
-    af_array inArray = 0;
-    af_array outArray = 0;
+    af_array inArray   = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
     if (isSubRef) {
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       dims.ndims(), dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_resize(&outArray, inArray, odim0, odim1, method));
+    ASSERT_SUCCESS(af_resize(&outArray, inArray, odim0, odim1, method));
 
     // Get result
-    af::dim4 odims(odim0, odim1, dims[2], dims[3]);
+    dim4 odims(odim0, odim1, dims[2], dims[3]);
     T* outData = new T[odims.elements()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
@@ -149,305 +157,263 @@ void resizeTest(string pTestFile, const unsigned resultIdx, const dim_t odim0, c
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
 ///////////////////////////////////////////////////////////////////////////////
 // Float Types
 ///////////////////////////////////////////////////////////////////////////////
-TYPED_TEST(Resize, Resize3CSquareUpNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 0, 16, 16, AF_INTERP_NEAREST);
+TYPED_TEST(Resize, Resize3CSquareUpNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 0, 16, 16,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Resize, Resize3CSquareUpLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 1, 16, 16, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, Resize3CSquareUpLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 1, 16, 16,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Resize, Resize3CSquareDownNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 2, 4, 4, AF_INTERP_NEAREST);
+TYPED_TEST(Resize, Resize3CSquareDownNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 2, 4, 4,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Resize, Resize3CSquareDownLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 3, 4, 4, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, Resize3CSquareDownLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 3, 4, 4,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Resize, Resize3CSquareUpNearestSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 4, 10, 10, AF_INTERP_NEAREST,
-                          true, &(this->subMat0));
+TYPED_TEST(Resize, Resize3CSquareUpNearestSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 4, 10, 10,
+                          AF_INTERP_NEAREST, true, &(this->subMat0));
 }
 
-TYPED_TEST(Resize, Resize3CSquareUpLinearSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 5, 10, 10, AF_INTERP_BILINEAR,
-                          true, &(this->subMat0));
+TYPED_TEST(Resize, Resize3CSquareUpLinearSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 5, 10, 10,
+                          AF_INTERP_BILINEAR, true, &(this->subMat0));
 }
 
-TYPED_TEST(Resize, Resize3CSquareDownNearestSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 6, 3, 3, AF_INTERP_NEAREST,
-                          true, &(this->subMat0));
+TYPED_TEST(Resize, Resize3CSquareDownNearestSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 6, 3, 3,
+                          AF_INTERP_NEAREST, true, &(this->subMat0));
 }
 
-TYPED_TEST(Resize, Resize3CSquareDownLinearSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 7, 3, 3, AF_INTERP_BILINEAR,
-                          true, &(this->subMat0));
+TYPED_TEST(Resize, Resize3CSquareDownLinearSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 7, 3, 3,
+                          AF_INTERP_BILINEAR, true, &(this->subMat0));
 }
 
-TYPED_TEST(Resize, Resize1CRectangleUpNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/rectangle.test"), 0, 12, 16, AF_INTERP_NEAREST);
+TYPED_TEST(Resize, Resize1CRectangleUpNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/rectangle.test"), 0, 12, 16,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Resize, Resize1CRectangleUpLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/rectangle.test"), 1, 12, 16, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, Resize1CRectangleUpLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/rectangle.test"), 1, 12, 16,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Resize, Resize1CRectangleDownNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/rectangle.test"), 2, 6, 2, AF_INTERP_NEAREST);
+TYPED_TEST(Resize, Resize1CRectangleDownNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/rectangle.test"), 2, 6, 2,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Resize, Resize1CRectangleDownLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/rectangle.test"), 3, 6, 2, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, Resize1CRectangleDownLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/rectangle.test"), 3, 6, 2,
+                          AF_INTERP_BILINEAR);
 }
 
 ///////////////////////////////////////////////////////////////////////////////
 // Interger Types
 ///////////////////////////////////////////////////////////////////////////////
-TYPED_TEST(ResizeI, Resize3CSquareUpNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 0, 16, 16, AF_INTERP_NEAREST);
+TYPED_TEST(ResizeI, Resize3CSquareUpNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 0, 16, 16,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareUpLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 1, 16, 16, AF_INTERP_BILINEAR);
+TYPED_TEST(ResizeI, Resize3CSquareUpLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 1, 16, 16,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareDownNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 2, 4, 4, AF_INTERP_NEAREST);
+TYPED_TEST(ResizeI, Resize3CSquareDownNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 2, 4, 4,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareDownLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 3, 4, 4, AF_INTERP_BILINEAR);
+TYPED_TEST(ResizeI, Resize3CSquareDownLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 3, 4, 4,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareUpNearestSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 4, 10, 10, AF_INTERP_NEAREST,
-                          true, &(this->subMat0));
+TYPED_TEST(ResizeI, Resize3CSquareUpNearestSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 4, 10, 10,
+                          AF_INTERP_NEAREST, true, &(this->subMat0));
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareUpLinearSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 5, 10, 10, AF_INTERP_BILINEAR,
-                          true, &(this->subMat0));
+TYPED_TEST(ResizeI, Resize3CSquareUpLinearSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 5, 10, 10,
+                          AF_INTERP_BILINEAR, true, &(this->subMat0));
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareDownNearestSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 6, 3, 3, AF_INTERP_NEAREST,
-                          true, &(this->subMat0));
+TYPED_TEST(ResizeI, Resize3CSquareDownNearestSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 6, 3, 3,
+                          AF_INTERP_NEAREST, true, &(this->subMat0));
 }
 
-TYPED_TEST(ResizeI, Resize3CSquareDownLinearSubref)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/square.test"), 8, 3, 3, AF_INTERP_BILINEAR,
-                          true, &(this->subMat1));
+TYPED_TEST(ResizeI, Resize3CSquareDownLinearSubref) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/square.test"), 8, 3, 3,
+                          AF_INTERP_BILINEAR, true, &(this->subMat1));
 }
 
 ///////////////////////////////////////////////////////////////////////////////
 // Float Types
 ///////////////////////////////////////////////////////////////////////////////
-TYPED_TEST(Resize, Resize1CLargeUpNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 0, 256, 256, AF_INTERP_NEAREST);
+TYPED_TEST(Resize, Resize1CLargeUpNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 0, 256, 256,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Resize, Resize1CLargeUpLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 1, 256, 256, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, Resize1CLargeUpLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 1, 256, 256,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Resize, Resize1CLargeDownNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 2, 32, 32, AF_INTERP_NEAREST);
+TYPED_TEST(Resize, Resize1CLargeDownNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 2, 32, 32,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Resize, Resize1CLargeDownLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 3, 32, 32, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, Resize1CLargeDownLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 3, 32, 32,
+                          AF_INTERP_BILINEAR);
 }
 
 ///////////////////////////////////////////////////////////////////////////////
 // Integer Types
 ///////////////////////////////////////////////////////////////////////////////
-TYPED_TEST(ResizeI, Resize1CLargeUpNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 0, 256, 256, AF_INTERP_NEAREST);
+TYPED_TEST(ResizeI, Resize1CLargeUpNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 0, 256, 256,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(ResizeI, Resize1CLargeUpLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 1, 256, 256, AF_INTERP_BILINEAR);
+TYPED_TEST(ResizeI, Resize1CLargeUpLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 1, 256, 256,
+                          AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(ResizeI, Resize1CLargeDownNearest)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 2, 32, 32, AF_INTERP_NEAREST);
+TYPED_TEST(ResizeI, Resize1CLargeDownNearest) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 2, 32, 32,
+                          AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(ResizeI, Resize1CLargeDownLinear)
-{
-    resizeTest<TypeParam>(string(TEST_DIR"/resize/large.test"), 3, 32, 32, AF_INTERP_BILINEAR);
+TYPED_TEST(ResizeI, Resize1CLargeDownLinear) {
+    resizeTest<TypeParam>(string(TEST_DIR "/resize/large.test"), 3, 32, 32,
+                          AF_INTERP_BILINEAR);
 }
 
 template<typename T>
-void resizeArgsTest(af_err err, string pTestFile, const af::dim4 odims, const af_interp_type method)
-{
-    if (noDoubleTests<T>()) return;
+void resizeArgsTest(af_err err, string pTestFile, const dim4 odims,
+                    const af_interp_type method) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
+    dim4 dims = numDims[0];
 
-    af_array inArray = 0;
+    af_array inArray  = 0;
     af_array outArray = 0;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     ASSERT_EQ(err, af_resize(&outArray, inArray, odims[0], odims[1], method));
 
-    if(inArray != 0) af_release_array(inArray);
-    if(outArray != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-TYPED_TEST(Resize,InvalidArgsDims0)
-{
-    af::dim4 dims(0, 5, 2, 1);
-    resizeArgsTest<TypeParam>(AF_ERR_SIZE, string(TEST_DIR"/resize/square.test"), dims, AF_INTERP_BILINEAR);
+TYPED_TEST(Resize, InvalidArgsDims0) {
+    dim4 dims(0, 5, 2, 1);
+    resizeArgsTest<TypeParam>(AF_ERR_SIZE,
+                              string(TEST_DIR "/resize/square.test"), dims,
+                              AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Resize,InvalidArgsMethod)
-{
-    af::dim4 dims(10, 10, 1, 1);
-    resizeArgsTest<TypeParam>(AF_ERR_ARG, string(TEST_DIR"/resize/square.test"), dims, AF_INTERP_CUBIC);
+TYPED_TEST(Resize, InvalidArgsMethod) {
+    dim4 dims(10, 10, 1, 1);
+    resizeArgsTest<TypeParam>(AF_ERR_ARG,
+                              string(TEST_DIR "/resize/square.test"), dims,
+                              AF_INTERP_CUBIC);
 }
 
 ///////////////////////////////// CPP ////////////////////////////////////
 //
-TEST(Resize, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<af::dim4> numDims;
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(string(TEST_DIR"/resize/square.test"),numDims,in,tests);
 
-    af::dim4 dims = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::resize(input, 16, 16);
+using af::array;
+using af::constant;
+using af::max;
+using af::seq;
+using af::span;
 
-    // Get result
-    af::dim4 odims(16, 16, dims[2], dims[3]);
-    float* outData = new float[odims.elements()];
-    output.host((void*)outData);
+TEST(Resize, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/resize/square.test"),
+                                   numDims, in, tests);
 
-    // Compare result
-    size_t nElems = tests[0].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_NEAR(tests[0][elIter], outData[elIter], 0.0001) << "at: " << elIter << std::endl;
-    }
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = resize(input, 16, 16);
 
-    // Delete
-    delete[] outData;
+    dim4 goldDims(16, 16, dims[2], dims[3]);
+    ASSERT_VEC_ARRAY_NEAR(tests[0], goldDims, output, 0.0001);
 }
 
-TEST(ResizeScale1, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<af::dim4> numDims;
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(string(TEST_DIR"/resize/square.test"),numDims,in,tests);
-
-    af::dim4 dims = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::resize(2.f, input);
-
-    // Get result
-    af::dim4 odims(16, 16, dims[2], dims[3]);
-    float* outData = new float[odims.elements()];
-    output.host((void*)outData);
+TEST(ResizeScale1, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/resize/square.test"),
+                                   numDims, in, tests);
 
-    // Compare result
-    size_t nElems = tests[0].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_NEAR(tests[0][elIter], outData[elIter], 0.0001) << "at: " << elIter << std::endl;
-    }
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = resize(2.f, input);
 
-    // Delete
-    delete[] outData;
+    dim4 goldDims(16, 16, dims[2], dims[3]);
+    ASSERT_VEC_ARRAY_NEAR(tests[0], goldDims, output, 0.0001);
 }
 
-TEST(ResizeScale2, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    vector<af::dim4> numDims;
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(string(TEST_DIR"/resize/square.test"),numDims,in,tests);
+TEST(ResizeScale2, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/resize/square.test"),
+                                   numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::resize(2.f, 2.f, input);
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = resize(2.f, 2.f, input);
 
-    // Get result
-    af::dim4 odims(16, 16, dims[2], dims[3]);
-    float* outData = new float[odims.elements()];
-    output.host((void*)outData);
-
-    // Compare result
-    size_t nElems = tests[0].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_NEAR(tests[0][elIter], outData[elIter], 0.0001) << "at: " << elIter << std::endl;
-    }
-
-    // Delete
-    delete[] outData;
+    dim4 goldDims(16, 16, dims[2], dims[3]);
+    ASSERT_VEC_ARRAY_NEAR(tests[0], goldDims, output, 0.0001);
 }
 
-
-
-TEST(Resize, ExtractGFOR)
-{
-    using namespace af;
+TEST(Resize, ExtractGFOR) {
     dim4 dims = dim4(100, 100, 3);
-    array A = round(100 * randu(dims));
-    array B = constant(0, 200, 200, 3);
+    array A   = round(100 * randu(dims));
+    array B   = constant(0, 200, 200, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = resize(A(span, span, ii), 200, 200);
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = resize(A(span, span, ii), 200, 200); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = resize(A(span, span, ii), 200, 200);
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
diff --git a/test/rng_match.cpp b/test/rng_match.cpp
new file mode 100644
index 0000000000..f13872889e
--- /dev/null
+++ b/test/rng_match.cpp
@@ -0,0 +1,121 @@
+/*******************************************************
+ * Copyright (c) 2019, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
+#include <sstream>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::getAvailableBackends;
+using af::randomEngine;
+using af::randu;
+using af::setBackend;
+using af::setSeed;
+using std::get;
+using std::make_pair;
+using std::stringstream;
+using std::vector;
+
+enum param { engine, backend, size, seed, type };
+
+using rng_params =
+    std::tuple<af::randomEngineType, std::pair<af::Backend, af::Backend>,
+               af::dim4, int, af_dtype>;
+
+class RNGMatch : public ::testing::TestWithParam<rng_params> {
+   protected:
+    void SetUp() {
+        backends_available =
+            getAvailableBackends() & get<backend>(GetParam()).first;
+        backends_available =
+            backends_available &&
+            (getAvailableBackends() & get<backend>(GetParam()).second);
+
+        if (backends_available) {
+            setBackend(get<backend>(GetParam()).first);
+            randomEngine(get<engine>(GetParam()));
+            setSeed(get<seed>(GetParam()));
+            array tmp  = randu(get<size>(GetParam()), get<type>(GetParam()));
+            void* data = malloc(tmp.bytes());
+            tmp.host(data);
+
+            setBackend(get<backend>(GetParam()).second);
+            values[0] = array(get<size>(GetParam()), get<type>(GetParam()));
+            values[0].write(data, values[0].bytes());
+            free(data);
+            randomEngine(get<engine>(GetParam()));
+            setSeed(get<seed>(GetParam()));
+            values[1] = randu(get<size>(GetParam()), get<type>(GetParam()));
+        }
+    }
+
+    array values[2];
+    bool backends_available;
+};
+
+std::string engine_name(af::randomEngineType engine) {
+    switch (engine) {
+        case AF_RANDOM_ENGINE_PHILOX: return "PHILOX";
+        case AF_RANDOM_ENGINE_THREEFRY: return "THREEFRY";
+        case AF_RANDOM_ENGINE_MERSENNE: return "MERSENNE";
+        default: return "UNKNOWN ENGINE";
+    }
+}
+
+std::string backend_name(af::Backend backend) {
+    switch (backend) {
+        case AF_BACKEND_DEFAULT: return "DEFAULT";
+        case AF_BACKEND_CPU: return "CPU";
+        case AF_BACKEND_CUDA: return "CUDA";
+        case AF_BACKEND_OPENCL: return "OPENCL";
+        default: return "UNKNOWN BACKEND";
+    }
+}
+
+std::string rngmatch_info(
+    const ::testing::TestParamInfo<RNGMatch::ParamType> info) {
+    stringstream ss;
+    ss << "size_" << get<size>(info.param)[0] << "_"
+       << backend_name(get<backend>(info.param).first) << "_"
+       << backend_name(get<backend>(info.param).second) << "_"
+       << get<size>(info.param)[1] << "_" << get<size>(info.param)[2] << "_"
+       << get<size>(info.param)[3] << "_seed_" << get<seed>(info.param)
+       << "_type_" << get<type>(info.param);
+    return ss.str();
+}
+
+INSTANTIATE_TEST_SUITE_P(
+    PhiloxCPU_CUDA, RNGMatch,
+    ::testing::Combine(
+        ::testing::Values(AF_RANDOM_ENGINE_PHILOX),
+        ::testing::Values(make_pair(AF_BACKEND_CPU, AF_BACKEND_CUDA),
+                          make_pair(AF_BACKEND_CPU, AF_BACKEND_OPENCL)),
+        ::testing::Values(dim4(10), dim4(100), dim4(1000), dim4(10000),
+                          dim4(1E5), dim4(10, 10), dim4(10, 100),
+                          dim4(100, 100), dim4(1000, 100), dim4(10, 10, 10),
+                          dim4(10, 100, 10), dim4(100, 100, 10),
+                          dim4(1000, 100, 10), dim4(10, 10, 10, 10),
+                          dim4(10, 100, 10, 10), dim4(100, 100, 10, 10),
+                          dim4(1000, 100, 10, 10)),
+        ::testing::Values(12), ::testing::Values(f32, f64, c32, c64, u8)),
+    rngmatch_info);
+
+TEST_P(RNGMatch, BackendEquals) {
+    if (backends_available) {
+        array actual   = values[0];
+        array expected = values[1];
+        ASSERT_ARRAYS_EQ(actual, expected);
+    } else {
+        printf("SKIPPED\n");
+    }
+}
diff --git a/test/rng_quality.cpp b/test/rng_quality.cpp
new file mode 100644
index 0000000000..92c264dfbb
--- /dev/null
+++ b/test/rng_quality.cpp
@@ -0,0 +1,248 @@
+
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
+using af::allTrue;
+using af::array;
+using af::constant;
+using af::deviceGC;
+using af::dtype;
+using af::dtype_traits;
+using af::randomEngine;
+using af::randomEngineType;
+using af::sum;
+
+template<typename T>
+class RandomEngine : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        // Ensure all unlocked buffers are freed
+        deviceGC();
+        SUPPORTED_TYPE_CHECK(T);
+    }
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double> TestTypesEngine;
+// register the type list
+TYPED_TEST_SUITE(RandomEngine, TestTypesEngine);
+
+template<typename T>
+void testRandomEnginePeriod(randomEngineType type) {
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    int elem  = 1024 * 1024;
+    int steps = 4 * 1024;
+    randomEngine r(type, 0);
+
+    array first = randu(elem, ty, r);
+
+    for (int i = 0; i < steps; ++i) {
+        array step     = randu(elem, ty, r);
+        bool different = !allTrue<bool>(first == step);
+        ASSERT_TRUE(different);
+    }
+}
+
+TYPED_TEST(RandomEngine, philoxRandomEnginePeriod) {
+    testRandomEnginePeriod<TypeParam>(AF_RANDOM_ENGINE_PHILOX_4X32_10);
+}
+
+TYPED_TEST(RandomEngine, threefryRandomEnginePeriod) {
+    testRandomEnginePeriod<TypeParam>(AF_RANDOM_ENGINE_THREEFRY_2X32_16);
+}
+
+TYPED_TEST(RandomEngine, mersenneRandomEnginePeriod) {
+    testRandomEnginePeriod<TypeParam>(AF_RANDOM_ENGINE_MERSENNE_GP11213);
+}
+
+template<typename T>
+double chi2_statistic(array input, array expected, bool print = false) {
+    expected *= sum<T>(input) / sum<T>(expected);
+    array diff = input - expected;
+
+    double chi2 = sum<T>((diff * diff) / expected);
+    if (print) {
+        array legend = af::seq(input.elements());
+        legend -= (input.elements() / 2.);
+        legend *= (14. / input.elements());
+
+        af_print(
+            join(1, legend, expected.as(f32), input.as(f32), diff.as(f32)));
+    }
+
+    return chi2;
+}
+
+template<>
+double chi2_statistic<half_float::half>(array input, array expected,
+                                        bool print) {
+    expected *= convert<float>(sum<float>(input)) /
+                convert<float>(sum<float>(expected));
+    array diff  = input - expected;
+    double chi2 = convert<float>(sum<float>((diff * diff) / expected));
+    return chi2;
+}
+
+template<typename T>
+void testRandomEngineUniformChi2(randomEngineType type) {
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    int elem  = 256 * 1024 * 1024;
+    int steps = 256;
+    int bins  = 100;
+
+    array total_hist = constant(0.0, bins, f32);
+    array expected   = constant(1.0 / bins, bins, f32);
+
+    randomEngine r(type, 0);
+
+    // R> qchisq(c(5e-6, 1 - 5e-6), 99)
+    // [1]  48.68125 173.87456
+    float lower(48.68125);
+    float upper(173.87456);
+
+    bool prev_step  = true;
+    bool prev_total = true;
+    for (int i = 0; i < steps; ++i) {
+        array rn_numbers = randu(elem, ty, r);
+        array step_hist  = histogram(rn_numbers, bins, 0.0, 1.0);
+        step_hist        = step_hist.as(f32);
+        float step_chi2  = chi2_statistic<float>(step_hist, expected);
+        if (!prev_step) {
+            EXPECT_GT(step_chi2, lower) << "at step: " << i;
+            EXPECT_LT(step_chi2, upper) << "at step: " << i;
+        }
+        prev_step = step_chi2 > lower && step_chi2 < upper;
+
+        total_hist += step_hist;
+        float total_chi2 = chi2_statistic<float>(total_hist, expected);
+        if (!prev_total) {
+            EXPECT_GT(total_chi2, lower) << "at step: " << i;
+            EXPECT_LT(total_chi2, upper) << "at step: " << i;
+        }
+        prev_total = total_chi2 > lower && total_chi2 < upper;
+    }
+}
+
+TYPED_TEST(RandomEngine, philoxRandomEngineUniformChi2) {
+    testRandomEngineUniformChi2<TypeParam>(AF_RANDOM_ENGINE_PHILOX_4X32_10);
+}
+
+TYPED_TEST(RandomEngine, threefryRandomEngineUniformChi2) {
+    testRandomEngineUniformChi2<TypeParam>(AF_RANDOM_ENGINE_THREEFRY_2X32_16);
+}
+
+TYPED_TEST(RandomEngine, mersenneRandomEngineUniformChi2) {
+    testRandomEngineUniformChi2<TypeParam>(AF_RANDOM_ENGINE_MERSENNE_GP11213);
+}
+
+// should be used only for x <= 5 (roughly)
+array cnd(array x) { return 0.5 * erfc(-x * sqrt(0.5)); }
+
+template<typename T>
+bool testRandomEngineNormalChi2(randomEngineType type)
+
+{
+    af::dtype ty = (af::dtype)af::dtype_traits<T>::af_type;
+
+    int elem  = 256 * 1024 * 1024;
+    int steps = 64;  // 256 * 32;
+    int bins  = 100;
+
+    T lower_edge(-7.0);
+    T upper_edge(7.0);
+
+    array total_hist = af::constant(0.0, 2 * bins, f32);
+    array edges      = af::seq(bins + 1) / bins * lower_edge;
+    array expected   = -af::diff1(cnd(edges));
+
+    expected =
+        af::join(0, expected(af::seq(bins - 1, 0, -1)), expected).as(f32);
+
+    af::randomEngine r(type, 0);
+
+    // NOTE(@rstub): In the chi^2 test one computes the test statistic and
+    // compares the value with the chi^2 distribution with appropriate number of
+    // degrees of freedom. For the uniform distribution one has "number of bins
+    // minus 1" degrees of freedom. For the normal distribution it is "number of
+    // bins minus 3", since there are two parameters mu and sigma. Here I used
+    // the qchisq() function from R to compute "suitable" values from the chi^2
+    // distribution.
+    //
+    // R> qchisq(c(5e-6, 1 - 5e-6), 197)
+    // [1] 121.3197 297.2989
+    float lower(121.3197);
+    float upper(297.2989);
+
+    bool prev_step  = true;
+    bool prev_total = true;
+
+    af::setSeed(0x76fa214467690e3c);
+
+    // std::cout << std::setw(4) << "step" << std::setw(7) << "chi2_i"
+    //           << std::setw(7) << "chi2_t" << std::setprecision(2) <<
+    //           std::fixed
+    //           << std::endl;
+
+    for (int i = 0; i < steps; ++i) {
+        array rn_numbers = randn(elem, ty, r);
+        array step_hist =
+            af::histogram(rn_numbers, 2 * bins, lower_edge, upper_edge);
+        step_hist = step_hist.as(f32);
+
+        float step_chi2 = chi2_statistic<float>(step_hist, expected);
+
+        // if (step_chi2 > 10000) af_print(rn_numbers);
+        // std::cout << std::setprecision(2) << std::fixed << std::setw(4) << i
+        //           << std::setw(9) << step_chi2;
+
+        bool step = step_chi2 > lower && step_chi2 < upper;
+
+        if (!prev_step) {
+            EXPECT_GT(step_chi2, lower) << "at step " << i;
+            EXPECT_LT(step_chi2, upper) << "at step: " << i;
+            if (step_chi2 < lower || step_chi2 > upper) {
+                bool print = true;
+                chi2_statistic<float>(step_hist, expected, print);
+            }
+        }
+
+        // if (!(step || prev_step)) break;
+
+        prev_step = step;
+        total_hist += step_hist;
+
+        float total_chi2 = chi2_statistic<float>(total_hist, expected);
+
+        // std::cout << std::setw(9) << total_chi2 << std::endl;
+
+        bool total = total_chi2 > lower && total_chi2 < upper;
+        if (!prev_total) {
+            EXPECT_GT(total_chi2, lower) << "at step " << i;
+            EXPECT_LT(total_chi2, upper) << "at step " << i;
+            if (total_chi2 < lower || total_chi2 > upper) {
+                bool print = true;
+                chi2_statistic<float>(total_hist, expected, print);
+            }
+        }
+
+        prev_total = total;
+    }
+
+    return true;
+}
+
+TYPED_TEST(RandomEngine, philoxRandomEngineNormalChi2) {
+    testRandomEngineNormalChi2<TypeParam>(AF_RANDOM_ENGINE_PHILOX_4X32_10);
+}
+
+TYPED_TEST(RandomEngine, threefryRandomEngineNormalChi2) {
+    testRandomEngineNormalChi2<TypeParam>(AF_RANDOM_ENGINE_THREEFRY_2X32_16);
+}
+
+TYPED_TEST(RandomEngine, DISABLED_mersenneRandomEngineNormalChi2) {
+    testRandomEngineNormalChi2<TypeParam>(AF_RANDOM_ENGINE_MERSENNE_GP11213);
+}
diff --git a/test/rotate.cpp b/test/rotate.cpp
index 3f626a32cd..986398f88f 100644
--- a/test/rotate.cpp
+++ b/test/rotate.cpp
@@ -7,63 +7,70 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Rotate : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class Rotate : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, cfloat, cdouble, int, intl, char> TestTypes;
+typedef ::testing::Types<float, double, cfloat, cdouble, int, intl, char, schar,
+                         short>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Rotate, TestTypes);
+TYPED_TEST_SUITE(Rotate, TestTypes);
 
 #define PI 3.1415926535897931f
 
 template<typename T>
-void rotateTest(string pTestFile, const unsigned resultIdx, const float angle, const bool crop, const bool recenter, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void rotateTest(string pTestFile, const unsigned resultIdx, const float angle,
+                const bool crop) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
+    dim4 dims = numDims[0];
 
-    af_array inArray = 0;
-    af_array outArray = 0;
+    af_array inArray   = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     float theta = angle * PI / 180.0f;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_rotate(&outArray, inArray, theta, crop, AF_INTERP_NEAREST));
+    ASSERT_SUCCESS(
+        af_rotate(&outArray, inArray, theta, crop, AF_INTERP_NEAREST));
 
     // Get result
     T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
@@ -76,101 +83,98 @@ void rotateTest(string pTestFile, const unsigned resultIdx, const float angle, c
     // We expect 99.99% values to be same between the CPU/GPU versions and
     // ASSERT_EQ (in comments below) to pass for CUDA & OpenCL backends
     size_t fail_count = 0;
-    for(size_t i = 0; i < nElems; i++) {
-        if(std::abs((tests[resultIdx][i] - (T)outData[i])) > 0.001)
-            fail_count++;
+    for (size_t i = 0; i < nElems; i++) {
+        if (abs((tests[resultIdx][i] - (T)outData[i])) > 0.001) fail_count++;
     }
     ASSERT_EQ(true, ((fail_count / (float)nElems) < 0.005));
 
-    //for (size_t elIter = 0; elIter < nElems; ++elIter) {
-    //    ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
+    // for (size_t elIter = 0; elIter < nElems; ++elIter) {
+    //    ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " <<
+    //    elIter << endl;
     //}
 
-
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define ROTATE_INIT(desc, file, resultIdx, angle, crop, recenter)                               \
-    TYPED_TEST(Rotate, desc)                                                                    \
-    {                                                                                           \
-        rotateTest<TypeParam>(string(TEST_DIR"/rotate/"#file".test"), resultIdx, angle, crop, recenter);\
+#define ROTATE_INIT(desc, file, resultIdx, angle, crop)                  \
+    TYPED_TEST(Rotate, desc) {                                           \
+        rotateTest<TypeParam>(string(TEST_DIR "/rotate/" #file ".test"), \
+                              resultIdx, angle, crop);                   \
     }
 
-    ROTATE_INIT(Square180NoCropRecenter     , rotate1,  0, 180, false, true);
-    ROTATE_INIT(Square180CropRecenter       , rotate1,  1, 180, true , true);
-    ROTATE_INIT(Square90NoCropRecenter      , rotate1,  2, 90 , false, true);
-    ROTATE_INIT(Square90CropRecenter        , rotate1,  3, 90 , true , true);
-    ROTATE_INIT(Square45NoCropRecenter      , rotate1,  4, 45 , false, true);
-    ROTATE_INIT(Square45CropRecenter        , rotate1,  5, 45 , true , true);
-    ROTATE_INIT(Squarem45NoCropRecenter     , rotate1,  6,-45 , false, true);
-    ROTATE_INIT(Squarem45CropRecenter       , rotate1,  7,-45 , true , true);
-    ROTATE_INIT(Square60NoCropRecenter      , rotate1,  8, 60 , false, true);
-    ROTATE_INIT(Square60CropRecenter        , rotate1,  9, 60 , true , true);
-    ROTATE_INIT(Square30NoCropRecenter      , rotate1, 10, 30 , false, true);
-    ROTATE_INIT(Square30CropRecenter        , rotate1, 11, 30 , true , true);
-    ROTATE_INIT(Square15NoCropRecenter      , rotate1, 12, 15 , false, true);
-    ROTATE_INIT(Square15CropRecenter        , rotate1, 13, 15 , true , true);
-    ROTATE_INIT(Square10NoCropRecenter      , rotate1, 14, 10 , false, true);
-    ROTATE_INIT(Square10CropRecenter        , rotate1, 15, 10 , true , true);
-    ROTATE_INIT(Square01NoCropRecenter      , rotate1, 16,  1 , false, true);
-    ROTATE_INIT(Square01CropRecenter        , rotate1, 17,  1 , true , true);
-    ROTATE_INIT(Square360NoCropRecenter     , rotate1, 18, 360, false, true);
-    ROTATE_INIT(Square360CropRecenter       , rotate1, 19, 360, true , true);
-    ROTATE_INIT(Squarem180NoCropRecenter    , rotate1, 20,-180, false, true);
-    ROTATE_INIT(Squarem180CropRecenter      , rotate1, 21,-180, false, true);
-    ROTATE_INIT(Square00NoCropRecenter      , rotate1, 22,  0 , false, true);
-    ROTATE_INIT(Square00CropRecenter        , rotate1, 23,  0 , true , true);
-
-    ROTATE_INIT(Rectangle180NoCropRecenter     , rotate2,  0, 180, false, true);
-    ROTATE_INIT(Rectangle180CropRecenter       , rotate2,  1, 180, true , true);
-    ROTATE_INIT(Rectangle90NoCropRecenter      , rotate2,  2, 90 , false, true);
-    ROTATE_INIT(Rectangle90CropRecenter        , rotate2,  3, 90 , true , true);
-    ROTATE_INIT(Rectangle45NoCropRecenter      , rotate2,  4, 45 , false, true);
-    ROTATE_INIT(Rectangle45CropRecenter        , rotate2,  5, 45 , true , true);
-    ROTATE_INIT(Rectanglem45NoCropRecenter     , rotate2,  6,-45 , false, true);
-    ROTATE_INIT(Rectanglem45CropRecenter       , rotate2,  7,-45 , true , true);
-    ROTATE_INIT(Rectangle60NoCropRecenter      , rotate2,  8, 60 , false, true);
-    ROTATE_INIT(Rectangle60CropRecenter        , rotate2,  9, 60 , true , true);
-    ROTATE_INIT(Rectangle30NoCropRecenter      , rotate2, 10, 30 , false, true);
-    ROTATE_INIT(Rectangle30CropRecenter        , rotate2, 11, 30 , true , true);
-    ROTATE_INIT(Rectangle15NoCropRecenter      , rotate2, 12, 15 , false, true);
-    ROTATE_INIT(Rectangle15CropRecenter        , rotate2, 13, 15 , true , true);
-    ROTATE_INIT(Rectangle10NoCropRecenter      , rotate2, 14, 10 , false, true);
-    ROTATE_INIT(Rectangle10CropRecenter        , rotate2, 15, 10 , true , true);
-    ROTATE_INIT(Rectangle01NoCropRecenter      , rotate2, 16,  1 , false, true);
-    ROTATE_INIT(Rectangle01CropRecenter        , rotate2, 17,  1 , true , true);
-    ROTATE_INIT(Rectangle360NoCropRecenter     , rotate2, 18, 360, false, true);
-    ROTATE_INIT(Rectangle360CropRecenter       , rotate2, 19, 360, true , true);
-    ROTATE_INIT(Rectanglem180NoCropRecenter    , rotate2, 20,-180, false, true);
-    ROTATE_INIT(Rectanglem180CropRecenter      , rotate2, 21,-180, false, true);
-    ROTATE_INIT(Rectangle00NoCropRecenter      , rotate2, 22,  0 , false, true);
-    ROTATE_INIT(Rectangle00CropRecenter        , rotate2, 23,  0 , true , true);
+ROTATE_INIT(Square180NoCropRecenter, rotate1, 0, 180, false);
+ROTATE_INIT(Square180CropRecenter, rotate1, 1, 180, true);
+ROTATE_INIT(Square90NoCropRecenter, rotate1, 2, 90, false);
+ROTATE_INIT(Square90CropRecenter, rotate1, 3, 90, true);
+ROTATE_INIT(Square45NoCropRecenter, rotate1, 4, 45, false);
+ROTATE_INIT(Square45CropRecenter, rotate1, 5, 45, true);
+ROTATE_INIT(Squarem45NoCropRecenter, rotate1, 6, -45, false);
+ROTATE_INIT(Squarem45CropRecenter, rotate1, 7, -45, true);
+ROTATE_INIT(Square60NoCropRecenter, rotate1, 8, 60, false);
+ROTATE_INIT(Square60CropRecenter, rotate1, 9, 60, true);
+ROTATE_INIT(Square30NoCropRecenter, rotate1, 10, 30, false);
+ROTATE_INIT(Square30CropRecenter, rotate1, 11, 30, true);
+ROTATE_INIT(Square15NoCropRecenter, rotate1, 12, 15, false);
+ROTATE_INIT(Square15CropRecenter, rotate1, 13, 15, true);
+ROTATE_INIT(Square10NoCropRecenter, rotate1, 14, 10, false);
+ROTATE_INIT(Square10CropRecenter, rotate1, 15, 10, true);
+ROTATE_INIT(Square01NoCropRecenter, rotate1, 16, 1, false);
+ROTATE_INIT(Square01CropRecenter, rotate1, 17, 1, true);
+ROTATE_INIT(Square360NoCropRecenter, rotate1, 18, 360, false);
+ROTATE_INIT(Square360CropRecenter, rotate1, 19, 360, true);
+ROTATE_INIT(Squarem180NoCropRecenter, rotate1, 20, -180, false);
+ROTATE_INIT(Squarem180CropRecenter, rotate1, 21, -180, false);
+ROTATE_INIT(Square00NoCropRecenter, rotate1, 22, 0, false);
+ROTATE_INIT(Square00CropRecenter, rotate1, 23, 0, true);
+
+ROTATE_INIT(Rectangle180NoCropRecenter, rotate2, 0, 180, false);
+ROTATE_INIT(Rectangle180CropRecenter, rotate2, 1, 180, true);
+ROTATE_INIT(Rectangle90NoCropRecenter, rotate2, 2, 90, false);
+ROTATE_INIT(Rectangle90CropRecenter, rotate2, 3, 90, true);
+ROTATE_INIT(Rectangle45NoCropRecenter, rotate2, 4, 45, false);
+ROTATE_INIT(Rectangle45CropRecenter, rotate2, 5, 45, true);
+ROTATE_INIT(Rectanglem45NoCropRecenter, rotate2, 6, -45, false);
+ROTATE_INIT(Rectanglem45CropRecenter, rotate2, 7, -45, true);
+ROTATE_INIT(Rectangle60NoCropRecenter, rotate2, 8, 60, false);
+ROTATE_INIT(Rectangle60CropRecenter, rotate2, 9, 60, true);
+ROTATE_INIT(Rectangle30NoCropRecenter, rotate2, 10, 30, false);
+ROTATE_INIT(Rectangle30CropRecenter, rotate2, 11, 30, true);
+ROTATE_INIT(Rectangle15NoCropRecenter, rotate2, 12, 15, false);
+ROTATE_INIT(Rectangle15CropRecenter, rotate2, 13, 15, true);
+ROTATE_INIT(Rectangle10NoCropRecenter, rotate2, 14, 10, false);
+ROTATE_INIT(Rectangle10CropRecenter, rotate2, 15, 10, true);
+ROTATE_INIT(Rectangle01NoCropRecenter, rotate2, 16, 1, false);
+ROTATE_INIT(Rectangle01CropRecenter, rotate2, 17, 1, true);
+ROTATE_INIT(Rectangle360NoCropRecenter, rotate2, 18, 360, false);
+ROTATE_INIT(Rectangle360CropRecenter, rotate2, 19, 360, true);
+ROTATE_INIT(Rectanglem180NoCropRecenter, rotate2, 20, -180, false);
+ROTATE_INIT(Rectanglem180CropRecenter, rotate2, 21, -180, false);
+ROTATE_INIT(Rectangle00NoCropRecenter, rotate2, 22, 0, false);
+ROTATE_INIT(Rectangle00CropRecenter, rotate2, 23, 0, true);
 
 ////////////////////////////////// CPP //////////////////////////////////////
 //
-TEST(Rotate, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(Rotate, CPP) {
     const unsigned resultIdx = 0;
-    const float angle = 180;
-    const bool crop = false;
+    const float angle        = 180;
+    const bool crop          = false;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(string(TEST_DIR"/rotate/rotate1.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(string(TEST_DIR "/rotate/rotate1.test"),
+                                   numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
+    dim4 dims   = numDims[0];
     float theta = angle * PI / 180.0f;
 
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::rotate(input, theta, crop, AF_INTERP_NEAREST);
+    array input(dims, &(in[0].front()));
+    array output = rotate(input, theta, crop, AF_INTERP_NEAREST);
 
     // Get result
     float* outData = new float[tests[resultIdx].size()];
@@ -187,9 +191,8 @@ TEST(Rotate, CPP)
     // We expect 99.99% values to be same between the CPU/GPU versions and
     // ASSERT_EQ (in comments below) to pass for CUDA & OpenCL backends
     size_t fail_count = 0;
-    for(size_t i = 0; i < nElems; i++) {
-        if(fabs(tests[resultIdx][i] - outData[i]) > 0.0001)
-            fail_count++;
+    for (size_t i = 0; i < nElems; i++) {
+        if (fabs(tests[resultIdx][i] - outData[i]) > 0.0001) fail_count++;
     }
     ASSERT_EQ(true, ((fail_count / (float)nElems) < 0.01));
 
diff --git a/test/rotate_linear.cpp b/test/rotate_linear.cpp
index 6242fb3c09..1324a59a77 100644
--- a/test/rotate_linear.cpp
+++ b/test/rotate_linear.cpp
@@ -7,74 +7,89 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Rotate : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class RotateLinear : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, cfloat, cdouble, int, intl, char> TestTypes;
+typedef ::testing::Types<float, double, cfloat, cdouble, int, intl, schar, char,
+                         short>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Rotate, TestTypes);
+TYPED_TEST_SUITE(RotateLinear, TestTypes);
 
 #define PI 3.1415926535897931f
 
 template<typename T>
-void rotateTest(string pTestFile, const unsigned resultIdx, const float angle, const bool crop, const bool recenter, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void rotateTest(string pTestFile, const unsigned resultIdx, const float angle,
+                const bool crop, bool isSubRef = false,
+                const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T, T, float>(pTestFile,numDims,in,tests);
+    if (is_same_type<T, schar>::value && (int)angle % 90 != 0) {
+        GTEST_SKIP() << "Incompatible test data for s8";
+    }
 
-    af::dim4 dims = numDims[0];
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, float>(pTestFile, numDims, in, tests);
 
-    af_array inArray = 0;
-    af_array outArray = 0;
+    dim4 dims = numDims[0];
+
+    af_array inArray   = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     float theta = angle * PI / 180.0f;
 
     if (isSubRef) {
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       dims.ndims(), dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_rotate(&outArray, inArray, theta, crop, AF_INTERP_BILINEAR));
+    ASSERT_SUCCESS(
+        af_rotate(&outArray, inArray, theta, crop, AF_INTERP_BILINEAR));
 
     // Get result
     T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
@@ -87,102 +102,101 @@ void rotateTest(string pTestFile, const unsigned resultIdx, const float angle, c
     // We expect 99.99% values to be same between the CPU/GPU versions and
     // ASSERT_EQ (in comments below) to pass for CUDA & OpenCL backends
     size_t fail_count = 0;
-    for(size_t i = 0; i < nElems; i++) {
-        if(std::abs((tests[resultIdx][i] - (T)outData[i])) > 0.001) {
+    for (size_t i = 0; i < nElems; i++) {
+        if (abs((tests[resultIdx][i] - (T)outData[i])) > 0.001) {
             fail_count++;
         }
     }
-    ASSERT_EQ(true, ((fail_count / (float)nElems) < 0.02)) << "where count = " << fail_count << std::endl;
+    ASSERT_EQ(true, ((fail_count / (float)nElems) < 0.02))
+        << "where count = " << fail_count << endl;
 
-    //for (size_t elIter = 0; elIter < nElems; ++elIter) {
-    //    ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
+    // for (size_t elIter = 0; elIter < nElems; ++elIter) {
+    //    ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " <<
+    //    elIter << endl;
     //}
 
-
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define ROTATE_INIT(desc, file, resultIdx, angle, crop, recenter)                               \
-    TYPED_TEST(Rotate, desc)                                                                    \
-    {                                                                                           \
-        rotateTest<TypeParam>(string(TEST_DIR"/rotate/"#file".test"), resultIdx, angle, crop, recenter);\
+#define ROTATE_INIT(desc, file, resultIdx, angle, crop)                  \
+    TYPED_TEST(RotateLinear, desc) {                                     \
+        rotateTest<TypeParam>(string(TEST_DIR "/rotate/" #file ".test"), \
+                              resultIdx, angle, crop);                   \
     }
 
-    ROTATE_INIT(Square180NoCropRecenter     , rotatelinear1,  0, 180, false, true);
-    ROTATE_INIT(Square180CropRecenter       , rotatelinear1,  1, 180, true , true);
-    ROTATE_INIT(Square90NoCropRecenter      , rotatelinear1,  2, 90 , false, true);
-    ROTATE_INIT(Square90CropRecenter        , rotatelinear1,  3, 90 , true , true);
-    ROTATE_INIT(Square45NoCropRecenter      , rotatelinear1,  4, 45 , false, true);
-    ROTATE_INIT(Square45CropRecenter        , rotatelinear1,  5, 45 , true , true);
-    ROTATE_INIT(Squarem45NoCropRecenter     , rotatelinear1,  6,-45 , false, true);
-    ROTATE_INIT(Squarem45CropRecenter       , rotatelinear1,  7,-45 , true , true);
-    ROTATE_INIT(Square60NoCropRecenter      , rotatelinear1,  8, 60 , false, true);
-    ROTATE_INIT(Square60CropRecenter        , rotatelinear1,  9, 60 , true , true);
-    ROTATE_INIT(Square30NoCropRecenter      , rotatelinear1, 10, 30 , false, true);
-    ROTATE_INIT(Square30CropRecenter        , rotatelinear1, 11, 30 , true , true);
-    ROTATE_INIT(Square15NoCropRecenter      , rotatelinear1, 12, 15 , false, true);
-    ROTATE_INIT(Square15CropRecenter        , rotatelinear1, 13, 15 , true , true);
-    ROTATE_INIT(Square10NoCropRecenter      , rotatelinear1, 14, 10 , false, true);
-    ROTATE_INIT(Square10CropRecenter        , rotatelinear1, 15, 10 , true , true);
-    ROTATE_INIT(Square01NoCropRecenter      , rotatelinear1, 16,  1 , false, true);
-    ROTATE_INIT(Square01CropRecenter        , rotatelinear1, 17,  1 , true , true);
-    ROTATE_INIT(Square360NoCropRecenter     , rotatelinear1, 18, 360, false, true);
-    ROTATE_INIT(Square360CropRecenter       , rotatelinear1, 19, 360, true , true);
-    ROTATE_INIT(Squarem180NoCropRecenter    , rotatelinear1, 20,-180, false, true);
-    ROTATE_INIT(Squarem180CropRecenter      , rotatelinear1, 21,-180, false, true);
-    ROTATE_INIT(Square00NoCropRecenter      , rotatelinear1, 22,  0 , false, true);
-    ROTATE_INIT(Square00CropRecenter        , rotatelinear1, 23,  0 , true , true);
-
-    ROTATE_INIT(Rectangle180NoCropRecenter     , rotatelinear2,  0, 180, false, true);
-    ROTATE_INIT(Rectangle180CropRecenter       , rotatelinear2,  1, 180, true , true);
-    ROTATE_INIT(Rectangle90NoCropRecenter      , rotatelinear2,  2, 90 , false, true);
-    ROTATE_INIT(Rectangle90CropRecenter        , rotatelinear2,  3, 90 , true , true);
-    ROTATE_INIT(Rectangle45NoCropRecenter      , rotatelinear2,  4, 45 , false, true);
-    ROTATE_INIT(Rectangle45CropRecenter        , rotatelinear2,  5, 45 , true , true);
-    ROTATE_INIT(Rectanglem45NoCropRecenter     , rotatelinear2,  6,-45 , false, true);
-    ROTATE_INIT(Rectanglem45CropRecenter       , rotatelinear2,  7,-45 , true , true);
-    ROTATE_INIT(Rectangle60NoCropRecenter      , rotatelinear2,  8, 60 , false, true);
-    ROTATE_INIT(Rectangle60CropRecenter        , rotatelinear2,  9, 60 , true , true);
-    ROTATE_INIT(Rectangle30NoCropRecenter      , rotatelinear2, 10, 30 , false, true);
-    ROTATE_INIT(Rectangle30CropRecenter        , rotatelinear2, 11, 30 , true , true);
-    ROTATE_INIT(Rectangle15NoCropRecenter      , rotatelinear2, 12, 15 , false, true);
-    ROTATE_INIT(Rectangle15CropRecenter        , rotatelinear2, 13, 15 , true , true);
-    ROTATE_INIT(Rectangle10NoCropRecenter      , rotatelinear2, 14, 10 , false, true);
-    ROTATE_INIT(Rectangle10CropRecenter        , rotatelinear2, 15, 10 , true , true);
-    ROTATE_INIT(Rectangle01NoCropRecenter      , rotatelinear2, 16,  1 , false, true);
-    ROTATE_INIT(Rectangle01CropRecenter        , rotatelinear2, 17,  1 , true , true);
-    ROTATE_INIT(Rectangle360NoCropRecenter     , rotatelinear2, 18, 360, false, true);
-    ROTATE_INIT(Rectangle360CropRecenter       , rotatelinear2, 19, 360, true , true);
-    ROTATE_INIT(Rectanglem180NoCropRecenter    , rotatelinear2, 20,-180, false, true);
-    ROTATE_INIT(Rectanglem180CropRecenter      , rotatelinear2, 21,-180, false, true);
-    ROTATE_INIT(Rectangle00NoCropRecenter      , rotatelinear2, 22,  0 , false, true);
-    ROTATE_INIT(Rectangle00CropRecenter        , rotatelinear2, 23,  0 , true , true);
+ROTATE_INIT(Square180NoCropRecenter, rotatelinear1, 0, 180, false);
+ROTATE_INIT(Square180CropRecenter, rotatelinear1, 1, 180, true);
+ROTATE_INIT(Square90NoCropRecenter, rotatelinear1, 2, 90, false);
+ROTATE_INIT(Square90CropRecenter, rotatelinear1, 3, 90, true);
+ROTATE_INIT(Square45NoCropRecenter, rotatelinear1, 4, 45, false);
+ROTATE_INIT(Square45CropRecenter, rotatelinear1, 5, 45, true);
+ROTATE_INIT(Squarem45NoCropRecenter, rotatelinear1, 6, -45, false);
+ROTATE_INIT(Squarem45CropRecenter, rotatelinear1, 7, -45, true);
+ROTATE_INIT(Square60NoCropRecenter, rotatelinear1, 8, 60, false);
+ROTATE_INIT(Square60CropRecenter, rotatelinear1, 9, 60, true);
+ROTATE_INIT(Square30NoCropRecenter, rotatelinear1, 10, 30, false);
+ROTATE_INIT(Square30CropRecenter, rotatelinear1, 11, 30, true);
+ROTATE_INIT(Square15NoCropRecenter, rotatelinear1, 12, 15, false);
+ROTATE_INIT(Square15CropRecenter, rotatelinear1, 13, 15, true);
+ROTATE_INIT(Square10NoCropRecenter, rotatelinear1, 14, 10, false);
+ROTATE_INIT(Square10CropRecenter, rotatelinear1, 15, 10, true);
+ROTATE_INIT(Square01NoCropRecenter, rotatelinear1, 16, 1, false);
+ROTATE_INIT(Square01CropRecenter, rotatelinear1, 17, 1, true);
+ROTATE_INIT(Square360NoCropRecenter, rotatelinear1, 18, 360, false);
+ROTATE_INIT(Square360CropRecenter, rotatelinear1, 19, 360, true);
+ROTATE_INIT(Squarem180NoCropRecenter, rotatelinear1, 20, -180, false);
+ROTATE_INIT(Squarem180CropRecenter, rotatelinear1, 21, -180, false);
+ROTATE_INIT(Square00NoCropRecenter, rotatelinear1, 22, 0, false);
+ROTATE_INIT(Square00CropRecenter, rotatelinear1, 23, 0, true);
+
+ROTATE_INIT(Rectangle180NoCropRecenter, rotatelinear2, 0, 180, false);
+ROTATE_INIT(Rectangle180CropRecenter, rotatelinear2, 1, 180, true);
+ROTATE_INIT(Rectangle90NoCropRecenter, rotatelinear2, 2, 90, false);
+ROTATE_INIT(Rectangle90CropRecenter, rotatelinear2, 3, 90, true);
+ROTATE_INIT(Rectangle45NoCropRecenter, rotatelinear2, 4, 45, false);
+ROTATE_INIT(Rectangle45CropRecenter, rotatelinear2, 5, 45, true);
+ROTATE_INIT(Rectanglem45NoCropRecenter, rotatelinear2, 6, -45, false);
+ROTATE_INIT(Rectanglem45CropRecenter, rotatelinear2, 7, -45, true);
+ROTATE_INIT(Rectangle60NoCropRecenter, rotatelinear2, 8, 60, false);
+ROTATE_INIT(Rectangle60CropRecenter, rotatelinear2, 9, 60, true);
+ROTATE_INIT(Rectangle30NoCropRecenter, rotatelinear2, 10, 30, false);
+ROTATE_INIT(Rectangle30CropRecenter, rotatelinear2, 11, 30, true);
+ROTATE_INIT(Rectangle15NoCropRecenter, rotatelinear2, 12, 15, false);
+ROTATE_INIT(Rectangle15CropRecenter, rotatelinear2, 13, 15, true);
+ROTATE_INIT(Rectangle10NoCropRecenter, rotatelinear2, 14, 10, false);
+ROTATE_INIT(Rectangle10CropRecenter, rotatelinear2, 15, 10, true);
+ROTATE_INIT(Rectangle01NoCropRecenter, rotatelinear2, 16, 1, false);
+ROTATE_INIT(Rectangle01CropRecenter, rotatelinear2, 17, 1, true);
+ROTATE_INIT(Rectangle360NoCropRecenter, rotatelinear2, 18, 360, false);
+ROTATE_INIT(Rectangle360CropRecenter, rotatelinear2, 19, 360, true);
+ROTATE_INIT(Rectanglem180NoCropRecenter, rotatelinear2, 20, -180, false);
+ROTATE_INIT(Rectanglem180CropRecenter, rotatelinear2, 21, -180, false);
+ROTATE_INIT(Rectangle00NoCropRecenter, rotatelinear2, 22, 0, false);
+ROTATE_INIT(Rectangle00CropRecenter, rotatelinear2, 23, 0, true);
 
 ////////////////////////////////// CPP //////////////////////////////////////
 
-TEST(Rotate, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
+TEST(RotateLinear, CPP) {
     const unsigned resultIdx = 0;
-    const float angle = 180;
-    const bool crop = false;
+    const float angle        = 180;
+    const bool crop          = false;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> >   in;
-    vector<vector<float> >   tests;
-    readTests<float, float, float>(string(TEST_DIR"/rotate/rotatelinear1.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, float>(
+        string(TEST_DIR "/rotate/rotatelinear1.test"), numDims, in, tests);
 
-    af::dim4 dims = numDims[0];
+    dim4 dims   = numDims[0];
     float theta = angle * PI / 180.0f;
 
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::rotate(input, theta, crop, AF_INTERP_BILINEAR);
+    array input(dims, &(in[0].front()));
+    array output = rotate(input, theta, crop, AF_INTERP_BILINEAR);
 
     // Get result
     float* outData = new float[tests[resultIdx].size()];
@@ -199,9 +213,8 @@ TEST(Rotate, CPP)
     // We expect 99.99% values to be same between the CPU/GPU versions and
     // ASSERT_EQ (in comments below) to pass for CUDA & OpenCL backends
     size_t fail_count = 0;
-    for(size_t i = 0; i < nElems; i++) {
-        if(fabs(tests[resultIdx][i] - outData[i]) > 0.0001)
-            fail_count++;
+    for (size_t i = 0; i < nElems; i++) {
+        if (fabs(tests[resultIdx][i] - outData[i]) > 0.0001) fail_count++;
     }
     ASSERT_EQ(true, ((fail_count / (float)nElems) < 0.01));
 
diff --git a/test/sat.cpp b/test/sat.cpp
new file mode 100644
index 0000000000..f87b356b85
--- /dev/null
+++ b/test/sat.cpp
@@ -0,0 +1,51 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <string>
+#include <vector>
+
+using af::accum;
+using af::allTrue;
+using af::array;
+using af::dtype_traits;
+using af::randu;
+using af::sat;
+using std::string;
+using std::vector;
+
+template<typename T>
+class SAT : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, uintl,
+                         intl, short, ushort>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(SAT, TestTypes);
+
+TYPED_TEST(SAT, IntegralImage) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    array a = randu(530, 671, (af_dtype)dtype_traits<TypeParam>::af_type);
+    array b = accum(a, 0);
+    array c = accum(b, 1);
+
+    array s = sat(a);
+
+    EXPECT_EQ(true, allTrue<float>(c == s));
+}
diff --git a/test/scan.cpp b/test/scan.cpp
index 1a855ca66f..afb488278d 100644
--- a/test/scan.cpp
+++ b/test/scan.cpp
@@ -7,39 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/array.h>
+#include <af/device.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/array.h>
-#include <vector>
+#include <algorithm>
 #include <iostream>
+#include <iterator>
 #include <string>
-#include <testHelpers.hpp>
-#include <af/device.h>
+#include <utility>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::allTrue;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype_traits;
+using af::range;
+using af::scan;
+using af::seq;
+using af::span;
+using af::sum;
+using std::copy;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 typedef af_err (*scanFunc)(af_array *, const af_array, const int);
 
 template<typename Ti, typename To, scanFunc af_scan>
-void scanTest(string pTestFile, int off = 0, bool isSubRef=false, const vector<af_seq> seqv=vector<af_seq>())
-{
-    if (noDoubleTests<Ti>()) return;
+void scanTest(string pTestFile, int off = 0, bool isSubRef = false,
+              const vector<af_seq> seqv = vector<af_seq>()) {
+    SUPPORTED_TYPE_CHECK(Ti);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (pTestFile,numDims,data,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(pTestFile, numDims, data, tests);
+    dim4 dims = numDims[0];
 
-    vector<Ti> in(data[0].begin(), data[0].end());
+    vector<Ti> in(data[0].size());
+    transform(data[0].begin(), data[0].end(), in.begin(), convert_to<Ti, int>);
 
     af_array inArray   = 0;
     af_array outArray  = 0;
@@ -47,12 +62,16 @@ void scanTest(string pTestFile, int off = 0, bool isSubRef=false, const vector<a
 
     // Get input array
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &in.front(), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<Ti>::af_type));
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv.size(), &seqv.front()));
-        ASSERT_EQ(AF_SUCCESS, af_release_array(tempArray));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &in.front(), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<Ti>::af_type));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv.size(), &seqv.front()));
+        ASSERT_SUCCESS(af_release_array(tempArray));
     } else {
-
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<Ti>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<Ti>::af_type));
     }
 
     // Compare result
@@ -60,107 +79,289 @@ void scanTest(string pTestFile, int off = 0, bool isSubRef=false, const vector<a
         vector<To> currGoldBar(tests[d].begin(), tests[d].end());
 
         // Run sum
-        ASSERT_EQ(AF_SUCCESS, af_scan(&outArray, inArray, d + off));
+        ASSERT_SUCCESS(af_scan(&outArray, inArray, d + off));
 
         // Get result
         To *outData;
         outData = new To[dims.elements()];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
         size_t nElems = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                << " for dim " << d +off
-                << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << " for dim " << d + off << endl;
         }
 
         // Delete
         delete[] outData;
-        ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+        ASSERT_SUCCESS(af_release_array(outArray));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
 }
 
-vector<af_seq> init_subs()
-{
-    vector<af_seq> subs;
-    subs.push_back(af_make_seq(2, 6, 1));
-    subs.push_back(af_make_seq(1, 5, 1));
-    subs.push_back(af_make_seq(1, 3, 1));
-    subs.push_back(af_make_seq(1, 2, 1));
-    return subs;
-}
+#define SCAN_TESTS(FN, TAG, Ti, To)                                       \
+    TEST(Scan, Test_##FN##_##TAG) {                                       \
+        scanTest<Ti, To, af_##FN>(string(TEST_DIR "/scan/" #FN ".test")); \
+    }
+
+SCAN_TESTS(accum, float, float, float);
+SCAN_TESTS(accum, double, double, double);
+SCAN_TESTS(accum, int, int, int);
+SCAN_TESTS(accum, cfloat, cfloat, cfloat);
+SCAN_TESTS(accum, cdouble, cdouble, cdouble);
+SCAN_TESTS(accum, unsigned, unsigned, unsigned);
+SCAN_TESTS(accum, intl, intl, intl);
+SCAN_TESTS(accum, uintl, uintl, uintl);
+SCAN_TESTS(accum, schar, schar, int);
+SCAN_TESTS(accum, uchar, uchar, unsigned);
+SCAN_TESTS(accum, short, short, int);
+SCAN_TESTS(accum, ushort, ushort, uint);
 
-#define SCAN_TESTS(FN, TAG, Ti, To)             \
-    TEST(Scan,Test_##FN##_##TAG)                \
-    {                                           \
-        scanTest<Ti, To, af_##FN>(              \
-            string(TEST_DIR"/scan/"#FN".test")  \
-            );                                  \
-    }                                           \
-
-SCAN_TESTS(accum, float   , float     , float     );
-SCAN_TESTS(accum, double  , double    , double    );
-SCAN_TESTS(accum, int     , int       , int       );
-SCAN_TESTS(accum, cfloat  , cfloat , cfloat );
-SCAN_TESTS(accum, cdouble , cdouble, cdouble);
-SCAN_TESTS(accum, unsigned, unsigned  , unsigned  );
-SCAN_TESTS(accum, uchar   , unsigned char, unsigned);
-
-TEST(Scan,Test_Scan_Big0)
-{
-    scanTest<int, int, af_accum>(
-        string(TEST_DIR"/scan/big0.test"),
-        0
-        );
+TEST(Scan, Test_Scan_Big0) {
+    scanTest<int, int, af_accum>(string(TEST_DIR "/scan/big0.test"), 0);
 }
 
-TEST(Scan,Test_Scan_Big1)
-{
-    scanTest<int, int, af_accum>(
-        string(TEST_DIR"/scan/big1.test"),
-        1
-        );
+TEST(Scan, Test_Scan_Big1) {
+    scanTest<int, int, af_accum>(string(TEST_DIR "/scan/big1.test"), 1);
 }
 
 ///////////////////////////////// CPP ////////////////////////////////////
-//
-TEST(Scan, CPP)
-{
-    vector<af::dim4> numDims;
-
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (string(TEST_DIR"/scan/accum.test"),numDims,data,tests);
-    af::dim4 dims       = numDims[0];
+TEST(Accum, CPP) {
+    vector<dim4> numDims;
 
-    vector<float> in(data[0].begin(), data[0].end());
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(string(TEST_DIR "/scan/accum.test"), numDims, data,
+                             tests);
+    dim4 dims = numDims[0];
 
-    if (noDoubleTests<float>()) return;
+    vector<float> in(data[0].size());
+    transform(data[0].begin(), data[0].end(), in.begin(),
+              convert_to<float, int>);
 
-    af::array input(dims, &(in.front()));
+    array input(dims, &(in.front()));
 
     // Compare result
     for (int d = 0; d < (int)tests.size(); ++d) {
         vector<float> currGoldBar(tests[d].begin(), tests[d].end());
 
         // Run sum
-        af::array output = af::accum(input, d);
+        array output = accum(input, d);
 
         // Get result
         float *outData;
         outData = new float[dims.elements()];
-        output.host((void*)outData);
+        output.host((void *)outData);
 
         size_t nElems = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                << " for dim " << d
-                << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << " for dim " << d << endl;
         }
 
         // Delete
         delete[] outData;
     }
 }
+
+TEST(Accum, MaxDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    // first dimension kernel tests
+    array input                           = constant(0, 2, largeDim, 2, 2);
+    input(span, seq(0, 9999), span, span) = 1;
+
+    array gold_first                           = constant(0, 2, largeDim, 2, 2);
+    gold_first(span, seq(0, 9999), span, span) = range(2, 10000, 2, 2) + 1;
+
+    array output_first = accum(input, 0);
+    ASSERT_ARRAYS_EQ(gold_first, output_first);
+
+    input                                 = constant(0, 2, 2, 2, largeDim);
+    input(span, span, span, seq(0, 9999)) = 1;
+
+    gold_first                                 = constant(0, 2, 2, 2, largeDim);
+    gold_first(span, span, span, seq(0, 9999)) = range(2, 2, 2, 10000) + 1;
+
+    output_first = accum(input, 0);
+    ASSERT_ARRAYS_EQ(gold_first, output_first);
+
+    // other dimension kernel tests
+    input                                 = constant(0, 2, largeDim, 2, 2);
+    input(span, seq(0, 9999), span, span) = 1;
+
+    array gold_dim = constant(10000, 2, largeDim, 2, 2);
+    gold_dim(span, seq(0, 9999), span, span) =
+        range(dim4(2, 10000, 2, 2), 1) + 1;
+
+    array output_dim = accum(input, 1);
+    ASSERT_ARRAYS_EQ(gold_dim, output_dim);
+
+    input                                 = constant(0, 2, 2, 2, largeDim);
+    input(span, span, span, seq(0, 9999)) = 1;
+
+    gold_dim = constant(0, 2, 2, 2, largeDim);
+    gold_dim(span, span, span, seq(0, 9999)) =
+        range(dim4(2, 2, 2, 10000), 1) + 1;
+
+    output_dim = accum(input, 1);
+    ASSERT_ARRAYS_EQ(gold_dim, output_dim);
+}
+
+TEST(Accum, DocSnippet) {
+    //! [ex_accum_1D]
+    float hA[] = {0, 1, 2, 3, 4};
+    array A(5, hA);
+    //  0.
+    //  1.
+    //  2.
+    //  3.
+    //  4.
+
+    array accumA = accum(A);
+    //  0.
+    //  1.
+    //  3.
+    //  6.
+    //  10.
+    //! [ex_accum_1D]
+
+    float h_gold_accumA[] = {0, 1, 3, 6, 10};
+    array gold_accumA(5, h_gold_accumA);
+    ASSERT_ARRAYS_EQ(gold_accumA, accumA);
+
+    //! [ex_accum_2D]
+    float hB[] = {0, 1, 2, 3, 4, 5, 6, 7, 8};
+    array B(3, 3, hB);
+    //  0.     3.     6.
+    //  1.     4.     7.
+    //  2.     5.     8.
+
+    array accumB_dim0 = accum(B);
+    //  0.     3.     6.
+    //  1.     7.     13.
+    //  3.     12.    21.
+
+    array accumB_dim1 = accum(B, 1);
+    //  0.     3.     9.
+    //  1.     5.     12.
+    //  2.     7.     15.
+    //! [ex_accum_2D]
+
+    float h_gold_accumB_dim0[] = {0, 1, 3, 3, 7, 12, 6, 13, 21};
+    array gold_accumB_dim0(3, 3, h_gold_accumB_dim0);
+    ASSERT_ARRAYS_EQ(gold_accumB_dim0, accumB_dim0);
+
+    float h_gold_accumB_dim1[] = {0, 1, 2, 3, 5, 7, 9, 12, 15};
+    array gold_accumB_dim1(3, 3, h_gold_accumB_dim1);
+    ASSERT_ARRAYS_EQ(gold_accumB_dim1, accumB_dim1);
+}
+
+TEST(Scan, ExclusiveSum1D) {
+    const int in_size = 80000;
+    vector<int> h_in(in_size, 1);
+    vector<int> h_gold(in_size, 0);
+    for (size_t i = 1; i < h_gold.size(); ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+
+    array in(in_size, &h_in.front());
+    array out = scan(in, 0, AF_BINARY_ADD, false);
+
+    ASSERT_VEC_ARRAY_EQ(h_gold, dim4(in_size), out);
+}
+
+TEST(Scan, ExclusiveSum2D_Dim0) {
+    const int in_size = 80000 * 2;
+    vector<int> h_in(in_size, 1);
+    vector<int> h_gold(in_size, 0);
+    for (size_t i = 1; i < h_gold.size() / 2; ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+    for (size_t i = h_gold.size() / 2 + 1; i < h_gold.size(); ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+
+    array in(in_size / 2, 2, &h_in.front());
+    array out = scan(in, 0, AF_BINARY_ADD, false);
+    array gold(in_size / 2, 2, &h_gold.front());
+
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+TEST(Scan, ExclusiveSum2D_Dim1) {
+    const int in_size = 80000 * 2;
+    vector<int> h_in(in_size, 1);
+    vector<int> h_gold(in_size, 0);
+    for (size_t i = 1; i < h_gold.size() / 2; ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+    for (size_t i = h_gold.size() / 2 + 1; i < h_gold.size(); ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+
+    array in(2, in_size / 2, &h_in.front());
+    array out = scan(in, 1, AF_BINARY_ADD, false);
+    array gold(in_size / 2, 2, &h_gold.front());
+    gold = gold.T();
+
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+TEST(Scan, ExclusiveSum2D_Dim2) {
+    const int in_size = 80000 * 2;
+    vector<int> h_in(in_size, 1);
+    vector<int> h_gold(in_size, 0);
+    for (size_t i = 1; i < h_gold.size() / 2; ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+    for (size_t i = h_gold.size() / 2 + 1; i < h_gold.size(); ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+
+    array in(1, 2, in_size / 2, &h_in.front());
+    array out = scan(in, 2, AF_BINARY_ADD, false);
+    array gold(in_size / 2, 2, &h_gold.front());
+    gold = af::reorder(gold, 2, 1, 0);
+
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+TEST(Scan, ExclusiveSum2D_Dim3) {
+    const int in_size = 80000 * 2;
+    vector<int> h_in(in_size, 1);
+    vector<int> h_gold(in_size, 0);
+    for (size_t i = 1; i < h_gold.size() / 2; ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+    for (size_t i = h_gold.size() / 2 + 1; i < h_gold.size(); ++i) {
+        h_gold[i] = h_in[i] + h_gold[i - 1];
+    }
+
+    array in(1, 1, 2, in_size / 2, &h_in.front());
+    array out = scan(in, 3, AF_BINARY_ADD, false);
+    array gold(in_size / 2, 2, &h_gold.front());
+    gold = af::reorder(gold, 2, 3, 1, 0);
+
+    ASSERT_ARRAYS_EQ(gold, out);
+}
+
+#define TEST_TEMP_FORMAT(form, dim)                                      \
+    TEST(TEMP_FORMAT, form##_Dim##dim) {                                 \
+        const dim4 dims(2, 2, 2, 2);                                     \
+        const array in(af::moddims(range(dim4(dims.elements())), dims)); \
+        in.eval();                                                       \
+        const array gold = scan(in, dim);                                \
+                                                                         \
+        array out = scan(toTempFormat(form, in), dim);                   \
+        ASSERT_ARRAYS_EQ(gold, out);                                     \
+    }
+
+#define TEST_TEMP_FORMATS(form) \
+    TEST_TEMP_FORMAT(form, 0)   \
+    TEST_TEMP_FORMAT(form, 1)   \
+    TEST_TEMP_FORMAT(form, 2)   \
+    TEST_TEMP_FORMAT(form, 3)
+
+FOREACH_TEMP_FORMAT(TEST_TEMP_FORMATS)
\ No newline at end of file
diff --git a/test/scan_by_key.cpp b/test/scan_by_key.cpp
new file mode 100644
index 0000000000..08928b5fdc
--- /dev/null
+++ b/test/scan_by_key.cpp
@@ -0,0 +1,265 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/array.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <iostream>
+#include <string>
+#include <utility>
+#include <vector>
+#include "binary_ops.hpp"
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+float randomInterval(float start, float end) {
+    return start + (end - start) * (std::rand() / float(RAND_MAX));
+}
+
+int randomInterval(int start, int end) {
+    return start + std::rand() % (end - start);
+}
+
+template<typename T>
+vector<T> createScanKey(dim4 dims, int scanDim, const vector<int> &nodeLengths,
+                        T keyStart, T keyEnd) {
+    std::srand(0);
+    int elemCount = dims.elements();
+    vector<T> key(elemCount);
+
+    int stride = 1;
+    for (int i = 0; i < scanDim; ++i) { stride *= dims[i]; }
+
+    for (int start = 0; start < stride; ++start) {
+        T keyval = (T)(0);
+        for (int index = start, i = 0; index < elemCount;
+             index += stride, i   = (i + 1) % dims[scanDim]) {
+            bool isNode = false;
+            for (unsigned n = 0; n < nodeLengths.size(); ++n) {
+                if (i % nodeLengths[n] == 0) { isNode = true; }
+            }
+            if (isNode && (std::rand() % 2)) {
+                keyval = randomInterval(keyStart, keyEnd);
+            }
+            key[index] = keyval;
+        }
+    }
+    return key;
+}
+
+template<typename T>
+vector<T> createScanData(dim4 dims, T dataStart, T dataEnd) {
+    int elemCount = dims.elements();
+    vector<T> in(elemCount);
+    for (int i = 0; i < elemCount; ++i) {
+        in[i] = randomInterval(dataStart, dataEnd);
+    }
+    return in;
+}
+
+template<typename Ti, typename Tk, typename To, af_binary_op op,
+         bool inclusive_scan>
+void verify(dim4 dims, const vector<Ti> &in, const vector<Tk> &key,
+            const vector<To> &out, int scanDim, double eps) {
+    std::srand(1);
+    Binary<To, op> binOp;
+    int elemCount = dims.elements();
+
+    int stride = 1;
+    for (int i = 0; i < scanDim; ++i) { stride *= dims[i]; }
+
+    for (int start = 0; start < stride; ++start) {
+        Tk keyval = key[start];
+        To gold   = binOp.init();
+        for (int index = start + (!inclusive_scan) * stride,
+                 i     = (!inclusive_scan);
+             index < elemCount; index += stride, i = (i + 1) % dims[scanDim]) {
+            if ((key[index] != keyval) || (i == 0)) {
+                keyval = key[index];
+                if (inclusive_scan) {
+                    gold = (To)in[index];
+                    ASSERT_NEAR(gold, out[index], eps);
+                } else {
+                    gold = binOp.init();
+                }
+            } else {
+                To dataval = (To)in[index - (!inclusive_scan) * stride];
+                gold       = binOp(gold, dataval);
+                ASSERT_NEAR(gold, out[index], eps);
+            }
+        }
+    }
+}
+
+template<typename Ti, typename To, af_binary_op op, bool inclusive_scan>
+void scanByKeyTest(dim4 dims, int scanDim, vector<int> nodeLengths,
+                   int keyStart, int keyEnd, Ti dataStart, Ti dataEnd,
+                   double eps) {
+    vector<int> key =
+        createScanKey<int>(dims, scanDim, nodeLengths, keyStart, keyEnd);
+    vector<Ti> in = createScanData<Ti>(dims, dataStart, dataEnd);
+
+    array afkey(dims, key.data());
+    array afin(dims, in.data());
+    array afout = scanByKey(afkey, afin, scanDim, op, inclusive_scan);
+    vector<To> out(afout.elements());
+    afout.host(out.data());
+
+    verify<Ti, int, To, op, inclusive_scan>(dims, in, key, out, scanDim, eps);
+}
+
+#define SCAN_BY_KEY_TEST(FN, X, Y, Z, W, Ti, To, INC, DIM, DSTART, DEND, EPS) \
+    TEST(ScanByKey, Test_Scan_By_Key_##FN##_##Ti##_##INC##_##DIM) {           \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                               \
+        dim4 dims(X, Y, Z, W);                                                \
+        int scanDim = DIM;                                                    \
+        int nodel[] = {37, 256};                                              \
+        vector<int> nodeLengths(nodel, nodel + sizeof(nodel) / sizeof(int));  \
+        int keyStart  = 0;                                                    \
+        int keyEnd    = 15;                                                   \
+        int dataStart = DSTART;                                               \
+        int dataEnd   = DEND;                                                 \
+        scanByKeyTest<Ti, To, FN, INC>(dims, scanDim, nodeLengths, keyStart,  \
+                                       keyEnd, dataStart, dataEnd, EPS);      \
+    }
+
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 16 * 1024, 1024, 1, 1, int, int, true, 0, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 16 * 1024, 1024, 1, 1, int, int, false, 0, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 16 * 1024, 1024, 1, 1, float, float, true, 0,
+                 -5.0, 5.0, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 16 * 1024, 1024, 1, 1, float, float, false, 0,
+                 -5.0, 5.0, 1e-3);
+
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 16 * 1024, 1024, 1, 1, int, int, true, 0, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 16 * 1024, 1024, 1, 1, int, int, false, 0, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 16 * 1024, 1024, 1, 1, float, float, true, 0,
+                 -5.0, 5.0, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 16 * 1024, 1024, 1, 1, float, float, false, 0,
+                 -5.0, 5.0, 1e-3);
+
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 16 * 1024, 1024, 1, 1, int, int, true, 0, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 16 * 1024, 1024, 1, 1, int, int, false, 0, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 16 * 1024, 1024, 1, 1, float, float, true, 0,
+                 -5.0, 5.0, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 16 * 1024, 1024, 1, 1, float, float, false, 0,
+                 -5.0, 5.0, 1e-3);
+
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 4 * 1024, 512, 1, 1, int, int, true, 1, -15, 15,
+                 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 4 * 1024, 512, 1, 1, int, int, false, 1, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 4 * 1024, 512, 1, 1, float, float, true, 1, -5,
+                 5, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_ADD, 4 * 1024, 512, 1, 1, float, float, false, 1, -5,
+                 5, 1e-3);
+
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 4 * 1024, 512, 1, 1, int, int, true, 1, -15, 15,
+                 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 4 * 1024, 512, 1, 1, int, int, false, 1, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 4 * 1024, 512, 1, 1, float, float, true, 1, -5,
+                 5, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MIN, 4 * 1024, 512, 1, 1, float, float, false, 1, -5,
+                 5, 1e-3);
+
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 4 * 1024, 512, 1, 1, int, int, true, 1, -15, 15,
+                 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 4 * 1024, 512, 1, 1, int, int, false, 1, -15,
+                 15, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 4 * 1024, 512, 1, 1, float, float, true, 1, -5,
+                 5, 1e-3);
+SCAN_BY_KEY_TEST(AF_BINARY_MAX, 4 * 1024, 512, 1, 1, float, float, false, 1, -5,
+                 5, 1e-3);
+
+TEST(ScanByKey, Test_Scan_By_key_Simple_0) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    dim4 dims(16, 8, 2, 1);
+    int scanDim = 0;
+    int nodel[] = {4, 8};
+    vector<int> nodeLengths(nodel, nodel + sizeof(nodel) / sizeof(int));
+    int keyStart  = 0;
+    int keyEnd    = 15;
+    int dataStart = 2;
+    int dataEnd   = 4;
+    scanByKeyTest<int, int, AF_BINARY_ADD, false>(
+        dims, scanDim, nodeLengths, keyStart, keyEnd, dataStart, dataEnd, 1e-5);
+}
+
+TEST(ScanByKey, Test_Scan_By_key_Simple_1) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    dim4 dims(8, 256 + 128, 1, 1);
+    int scanDim = 1;
+    int nodel[] = {4, 8};
+    vector<int> nodeLengths(nodel, nodel + sizeof(nodel) / sizeof(int));
+    int keyStart  = 0;
+    int keyEnd    = 15;
+    int dataStart = 2;
+    int dataEnd   = 4;
+    scanByKeyTest<int, int, AF_BINARY_ADD, false>(
+        dims, scanDim, nodeLengths, keyStart, keyEnd, dataStart, dataEnd, 1e-5);
+}
+
+TEST(ScanByKey, FixOverflowWrite) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    const int SIZE = 41000;
+    vector<int> keys(SIZE, 0);
+    vector<float> vals(SIZE, 1.0f);
+
+    array someVals = array(SIZE, vals.data());
+    array keysAF   = array(SIZE, s32);
+    array valsAF   = array(SIZE, vals.data());
+
+    keysAF = array(SIZE, keys.data());
+
+    float prior = valsAF(0).scalar<float>();
+
+    array result = af::scanByKey(keysAF, someVals, 0, AF_BINARY_ADD, true);
+
+    ASSERT_EQ(prior, valsAF(0).scalar<float>());
+}
+
+#define TEST_TEMP_FORMAT(form, dim)                                           \
+    TEST(TEMP_FORMAT, form##_Dim##dim) {                                      \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                               \
+        const dim4 dims(2, 2, 2, 2);                                          \
+        const array in(af::moddims(range(dim4(dims.elements())), dims));      \
+        in.eval();                                                            \
+        const array keys(af::constant(0, dims, u32));                         \
+        keys.eval();                                                          \
+        const array gold = scanByKey(keys, in, dim);                          \
+                                                                              \
+        array out =                                                           \
+            scanByKey(toTempFormat(form, keys), toTempFormat(form, in), dim); \
+        ASSERT_ARRAYS_EQ(gold, out);                                          \
+    }
+
+#define TEST_TEMP_FORMATS(form) \
+    TEST_TEMP_FORMAT(form, 0)   \
+    TEST_TEMP_FORMAT(form, 1)   \
+    TEST_TEMP_FORMAT(form, 2)   \
+    TEST_TEMP_FORMAT(form, 3)
+
+FOREACH_TEMP_FORMAT(TEST_TEMP_FORMATS)
\ No newline at end of file
diff --git a/test/select.cpp b/test/select.cpp
new file mode 100644
index 0000000000..4b4c96dd21
--- /dev/null
+++ b/test/select.cpp
@@ -0,0 +1,549 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <cstdio>
+#include <iostream>
+#include <string>
+#include <type_traits>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::eval;
+using af::NaN;
+using af::randu;
+using af::select;
+using af::seq;
+using af::span;
+using af::sum;
+using std::string;
+using std::stringstream;
+using std::vector;
+
+template<typename T>
+class Select : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, cfloat, cdouble, uint, int, intl, uintl,
+                         schar, uchar, char, short, ushort, half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(Select, TestTypes);
+
+template<typename T>
+void selectTest(const dim4& dims) {
+    SUPPORTED_TYPE_CHECK(T);
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array a = randu(dims, ty);
+    array b = randu(dims, ty);
+
+    if (a.isinteger()) {
+        a = (a % (1 << 30)).as(ty);
+        b = (b % (1 << 30)).as(ty);
+    }
+
+    array cond = randu(dims, ty) > a;
+
+    array c = select(cond, a, b);
+
+    int num = (int)a.elements();
+
+    vector<T> ha(num);
+    vector<T> hb(num);
+    vector<T> hc(num);
+    vector<char> hcond(num);
+
+    a.host(&ha[0]);
+    b.host(&hb[0]);
+    c.host(&hc[0]);
+    cond.host(&hcond[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_EQ(hc[i], hcond[i] ? ha[i] : hb[i]);
+    }
+}
+
+template<typename T, bool is_right>
+void selectScalarTest(const dim4& dims) {
+    SUPPORTED_TYPE_CHECK(T);
+    using scalar_t =
+        typename std::conditional<std::is_same<T, intl>::value ||
+                                      std::is_same<T, uintl>::value,
+                                  T, double>::type;
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array a    = randu(dims, ty);
+    array cond = randu(dims, ty) > a;
+    scalar_t b = static_cast<scalar_t>(3);
+
+    if (a.isinteger()) { a = (a % (1 << 30)).as(ty); }
+
+    array c = is_right ? select(cond, a, b) : select(cond, b, a);
+
+    int num = (int)a.elements();
+
+    vector<T> ha(num);
+    vector<T> hc(num);
+    vector<char> hcond(num);
+
+    a.host(&ha[0]);
+    c.host(&hc[0]);
+    cond.host(&hcond[0]);
+
+    if (is_right) {
+        for (int i = 0; i < num; i++) {
+            ASSERT_EQ(hc[i], hcond[i] ? ha[i] : T(b));
+        }
+    } else {
+        for (int i = 0; i < num; i++) {
+            ASSERT_EQ(hc[i], hcond[i] ? T(b) : ha[i]);
+        }
+    }
+}
+
+TYPED_TEST(Select, Simple) { selectTest<TypeParam>(dim4(1024, 1024)); }
+
+TYPED_TEST(Select, RightScalar) {
+    selectScalarTest<TypeParam, true>(dim4(1000, 1000));
+}
+
+TYPED_TEST(Select, LeftScalar) {
+    selectScalarTest<TypeParam, true>(dim4(1000, 1000));
+}
+
+TEST(Select, NaN) {
+    SKIP_IF_FAST_MATH_ENABLED();
+    dim4 dims(1000, 1250);
+    dtype ty = f32;
+
+    array a                                 = randu(dims, ty);
+    a(seq(a.dims(0) / 2), span, span, span) = NaN;
+    float b                                 = 0;
+    array c                                 = select(isNaN(a), b, a);
+
+    int num = (int)a.elements();
+
+    vector<float> ha(num);
+    vector<float> hc(num);
+
+    a.host(&ha[0]);
+    c.host(&hc[0]);
+
+    for (int i = 0; i < num; i++) {
+        ASSERT_FLOAT_EQ(hc[i], std::isnan(ha[i]) ? b : ha[i]);
+    }
+}
+
+TEST(Select, ISSUE_1249) {
+    dim4 dims(2, 3, 4);
+    array cond = randu(dims) > 0.5;
+    array a    = randu(dims);
+    array b    = select(cond, a - a * 0.9, a);
+    array c    = a - a * cond * 0.9;
+
+    int num = (int)dims.elements();
+    vector<float> hb(num);
+    vector<float> hc(num);
+
+    b.host(&hb[0]);
+    c.host(&hc[0]);
+
+    for (int i = 0; i < num; i++) {
+        EXPECT_NEAR(hc[i], hb[i], 1e-7) << "at " << i;
+    }
+}
+
+TEST(Select, 4D) {
+    dim4 dims(2, 3, 4, 2);
+    array cond = randu(dims) > 0.5;
+    array a    = randu(dims);
+    array b    = select(cond, a - a * 0.9, a);
+    array c    = a - a * cond * 0.9;
+
+    int num = (int)dims.elements();
+    vector<float> hb(num);
+    vector<float> hc(num);
+
+    b.host(&hb[0]);
+    c.host(&hc[0]);
+
+    for (int i = 0; i < num; i++) {
+        EXPECT_NEAR(hc[i], hb[i], 1e-7) << "at " << i;
+    }
+}
+
+TEST(Select, Issue_1730) {
+    const int n = 1000;
+    const int m = 200;
+    array a     = randu(n, m) - 0.5;
+    eval(a);
+
+    vector<float> ha1(a.elements());
+    a.host(&ha1[0]);
+
+    const int n1 = n / 2;
+    const int n2 = n1 + n / 4;
+
+    a(seq(n1, n2), span) =
+        select(a(seq(n1, n2), span) >= 0, a(seq(n1, n2), span),
+               a(seq(n1, n2), span) * -1);
+
+    vector<float> ha2(a.elements());
+    a.host(&ha2[0]);
+
+    for (int j = 0; j < m; j++) {
+        for (int i = 0; i < n; i++) {
+            if (i < n1 || i > n2) {
+                ASSERT_FLOAT_EQ(ha1[i], ha2[i])
+                    << "at (" << i << ", " << j << ")";
+            } else {
+                ASSERT_FLOAT_EQ(ha2[i], (ha1[i] >= 0 ? ha1[i] : -ha1[i]))
+                    << "at (" << i << ", " << j << ")";
+            }
+        }
+    }
+}
+
+TEST(Select, Issue_1730_scalar) {
+    const int n = 1000;
+    const int m = 200;
+    array a     = randu(n, m) - 0.5;
+    eval(a);
+
+    vector<float> ha1(a.elements());
+    a.host(&ha1[0]);
+
+    const int n1 = n / 2;
+    const int n2 = n1 + n / 4;
+
+    float val = 0;
+    a(seq(n1, n2), span) =
+        select(a(seq(n1, n2), span) >= 0, a(seq(n1, n2), span), val);
+
+    vector<float> ha2(a.elements());
+    a.host(&ha2[0]);
+
+    for (int j = 0; j < m; j++) {
+        for (int i = 0; i < n; i++) {
+            if (i < n1 || i > n2) {
+                ASSERT_FLOAT_EQ(ha1[i], ha2[i])
+                    << "at (" << i << ", " << j << ")";
+            } else {
+                ASSERT_FLOAT_EQ(ha2[i], (ha1[i] >= 0 ? ha1[i] : val))
+                    << "at (" << i << ", " << j << ")";
+            }
+        }
+    }
+}
+
+TEST(Select, MaxDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+
+    array a    = constant(1, largeDim);
+    array b    = constant(0, largeDim);
+    array cond = constant(0, largeDim, b8);
+
+    array sel = select(cond, a, b);
+    float sum = af::sum<float>(sel);
+
+    ASSERT_FLOAT_EQ(sum, 0.f);
+
+    a    = constant(1, 1, largeDim);
+    b    = constant(0, 1, largeDim);
+    cond = constant(0, 1, largeDim, b8);
+
+    sel = select(cond, a, b);
+    sum = af::sum<float>(sel);
+
+    ASSERT_FLOAT_EQ(sum, 0.f);
+
+    a    = constant(1, 1, 1, largeDim);
+    b    = constant(0, 1, 1, largeDim);
+    cond = constant(0, 1, 1, largeDim, b8);
+
+    sel = select(cond, a, b);
+    sum = af::sum<float>(sel);
+
+    ASSERT_FLOAT_EQ(sum, 0.f);
+
+    a    = constant(1, 1, 1, 1, largeDim);
+    b    = constant(0, 1, 1, 1, largeDim);
+    cond = constant(0, 1, 1, 1, largeDim, b8);
+
+    sel = select(cond, a, b);
+    sum = af::sum<float>(sel);
+
+    ASSERT_FLOAT_EQ(sum, 0.f);
+}
+
+struct select_params {
+    dim4 out;
+    dim4 cond;
+    dim4 a;
+    dim4 b;
+    select_params(dim4 out_, dim4 cond_, dim4 a_, dim4 b_)
+        : out(out_), cond(cond_), a(a_), b(b_) {}
+};
+
+class Select_ : public ::testing::TestWithParam<select_params> {};
+
+string pd4(dim4 dims) {
+    string out(32, '\0');
+    int len = snprintf(const_cast<char*>(out.data()), 32, "%lld_%lld_%lld_%lld",
+                       dims[0], dims[1], dims[2], dims[3]);
+    out.resize(len);
+    return out;
+}
+
+string testNameGenerator(
+    const ::testing::TestParamInfo<Select_::ParamType> info) {
+    stringstream ss;
+    ss << "out_" << pd4(info.param.out) << "_cond_" << pd4(info.param.cond)
+       << "_a_" << pd4(info.param.a) << "_b_" << pd4(info.param.b);
+    return ss.str();
+}
+
+vector<select_params> getSelectTestParams(int M, int N) {
+    const select_params _[] = {
+        select_params(dim4(M), dim4(M), dim4(M), dim4(M)),
+        select_params(dim4(M, N), dim4(M, N), dim4(M, N), dim4(M, N)),
+        select_params(dim4(M, N, N), dim4(M, N, N), dim4(M, N, N),
+                      dim4(M, N, N)),
+        select_params(dim4(M, N, N, N), dim4(M, N, N, N), dim4(M, N, N, N),
+                      dim4(M, N, N, N)),
+        select_params(dim4(M, N), dim4(M, 1), dim4(M, 1), dim4(M, N)),
+        select_params(dim4(M, N), dim4(M, 1), dim4(M, N), dim4(M, 1)),
+        select_params(dim4(M, N), dim4(M, 1), dim4(M, N), dim4(M, N)),
+        select_params(dim4(M, N), dim4(M, N), dim4(M, 1), dim4(M, N)),
+        select_params(dim4(M, N), dim4(M, N), dim4(M, N), dim4(M, 1)),
+        select_params(dim4(M, N), dim4(M, N), dim4(M, 1), dim4(M, 1))};
+    return vector<select_params>(_, _ + sizeof(_) / sizeof(_[0]));
+}
+
+INSTANTIATE_TEST_SUITE_P(SmallDims, Select_,
+                         ::testing::ValuesIn(getSelectTestParams(10, 5)),
+                         testNameGenerator);
+
+INSTANTIATE_TEST_SUITE_P(Dims33_9, Select_,
+                         ::testing::ValuesIn(getSelectTestParams(33, 9)),
+                         testNameGenerator);
+
+INSTANTIATE_TEST_SUITE_P(DimsLg, Select_,
+                         ::testing::ValuesIn(getSelectTestParams(512, 32)),
+                         testNameGenerator);
+
+TEST_P(Select_, Batch) {
+    select_params params = GetParam();
+
+    float aval = 5.0f;
+    float bval = 10.0f;
+    array a    = constant(aval, params.a);
+    array b    = constant(bval, params.b);
+    array cond = (iota(params.cond) % 2).as(b8);
+
+    array out = select(cond, a, b);
+
+    EXPECT_EQ(out.dims(), params.out);
+
+    vector<float> h_out(out.elements());
+    out.host(h_out.data());
+    vector<unsigned char> h_cond(cond.elements());
+    cond.host(h_cond.data());
+
+    vector<float> gold(params.out.elements());
+    for (size_t i = 0; i < gold.size(); i++) {
+        gold[i] = h_cond[i % h_cond.size()] ? aval : bval;
+        ASSERT_FLOAT_EQ(gold[i], h_out[i]) << "at: " << i;
+    }
+}
+
+struct selectlr_params {
+    dim4 out;
+    dim4 cond;
+    dim4 ab;
+    selectlr_params(dim4 out_, dim4 cond_, dim4 ab_)
+        : out(out_), cond(cond_), ab(ab_) {}
+};
+
+class SelectLR_ : public ::testing::TestWithParam<selectlr_params> {};
+
+vector<selectlr_params> getSelectLRTestParams(int M, int N) {
+    const selectlr_params _[] = {
+        selectlr_params(dim4(M), dim4(M), dim4(M)),
+        selectlr_params(dim4(M, N), dim4(M, N), dim4(M, N)),
+        selectlr_params(dim4(M, N, N), dim4(M, N, N), dim4(M, N, N)),
+        selectlr_params(dim4(M, N, N, N), dim4(M, N, N, N), dim4(M, N, N, N)),
+        selectlr_params(dim4(M, N), dim4(M, 1), dim4(M, N)),
+        selectlr_params(dim4(M, N), dim4(M, N), dim4(M, 1))};
+
+    return vector<selectlr_params>(_, _ + sizeof(_) / sizeof(_[0]));
+}
+
+string testNameGeneratorLR(
+    const ::testing::TestParamInfo<SelectLR_::ParamType> info) {
+    stringstream ss;
+    ss << "out_" << pd4(info.param.out) << "_cond_" << pd4(info.param.cond)
+       << "_ab_" << pd4(info.param.ab);
+    return ss.str();
+}
+
+INSTANTIATE_TEST_SUITE_P(SmallDims, SelectLR_,
+                         ::testing::ValuesIn(getSelectLRTestParams(10, 5)),
+                         testNameGeneratorLR);
+
+INSTANTIATE_TEST_SUITE_P(Dims33_9, SelectLR_,
+                         ::testing::ValuesIn(getSelectLRTestParams(33, 9)),
+                         testNameGeneratorLR);
+
+INSTANTIATE_TEST_SUITE_P(DimsLg, SelectLR_,
+                         ::testing::ValuesIn(getSelectLRTestParams(512, 32)),
+                         testNameGeneratorLR);
+
+TEST_P(SelectLR_, BatchL) {
+    selectlr_params params = GetParam();
+
+    float aval = 5.0f;
+    float bval = 10.0f;
+    array b    = constant(bval, params.ab);
+    array cond = (iota(params.cond) % 2).as(b8);
+
+    array out = select(cond, static_cast<double>(aval), b);
+
+    EXPECT_EQ(out.dims(), params.out);
+
+    vector<float> h_out(out.elements());
+    out.host(h_out.data());
+    vector<unsigned char> h_cond(cond.elements());
+    cond.host(h_cond.data());
+
+    vector<float> gold(params.out.elements());
+    for (size_t i = 0; i < gold.size(); i++) {
+        gold[i] = h_cond[i % h_cond.size()] ? aval : bval;
+        ASSERT_FLOAT_EQ(gold[i], h_out[i]) << "at: " << i;
+    }
+}
+
+TEST_P(SelectLR_, BatchR) {
+    selectlr_params params = GetParam();
+
+    float aval = 5.0f;
+    float bval = 10.0f;
+    array a    = constant(aval, params.ab);
+    array cond = (iota(params.cond) % 2).as(b8);
+
+    array out = select(cond, a, static_cast<double>(bval));
+
+    EXPECT_EQ(out.dims(), params.out);
+
+    vector<float> h_out(out.elements());
+    out.host(h_out.data());
+    vector<unsigned char> h_cond(cond.elements());
+    cond.host(h_cond.data());
+
+    vector<float> gold(params.out.elements());
+    for (size_t i = 0; i < gold.size(); i++) {
+        gold[i] = h_cond[i % h_cond.size()] ? aval : bval;
+        ASSERT_FLOAT_EQ(gold[i], h_out[i]) << "at: " << i;
+    }
+}
+
+TEST(Select, InvalidSizeOfAB) {
+    af_array a    = 0;
+    af_array b    = 0;
+    af_array cond = 0;
+    af_array out  = 0;
+
+    double val = 0;
+    dim_t dims = 10;
+    ASSERT_SUCCESS(af_constant(&a, val, 1, &dims, f32));
+
+    dims = 9;
+    ASSERT_SUCCESS(af_constant(&b, val, 1, &dims, f32));
+
+    dims = 10;
+    ASSERT_SUCCESS(af_constant(&cond, val, 1, &dims, b8));
+
+    ASSERT_EQ(AF_ERR_SIZE, af_select(&out, cond, a, b));
+
+    char* msg = NULL;
+    dim_t len = 0;
+    af_get_last_error(&msg, &len);
+    af_free_host(msg);
+    af_release_array(a);
+    af_release_array(b);
+    af_release_array(cond);
+}
+
+TEST(Select, InvalidSizeOfCond) {
+    af_array a    = 0;
+    af_array b    = 0;
+    af_array cond = 0;
+    af_array out  = 0;
+
+    double val = 0;
+    dim_t dims = 10;
+    ASSERT_SUCCESS(af_constant(&a, val, 1, &dims, f32));
+
+    dims = 10;
+    ASSERT_SUCCESS(af_constant(&b, val, 1, &dims, f32));
+
+    dims = 9;
+    ASSERT_SUCCESS(af_constant(&cond, val, 1, &dims, b8));
+
+    ASSERT_EQ(AF_ERR_SIZE, af_select(&out, cond, a, b));
+
+    char* msg = NULL;
+    dim_t len = 0;
+    af_get_last_error(&msg, &len);
+    af_free_host(msg);
+    af_release_array(a);
+    af_release_array(b);
+    af_release_array(cond);
+}
+
+TEST(Select, SNIPPET_select) {
+    //! [ex_data_select]
+    int elements = 9;
+    char hCond[] = {1, 0, 1, 0, 1, 0, 1, 0, 1};
+    float hA[]   = {2, 2, 2, 2, 2, 2, 2, 2, 2};
+    float hB[]   = {3, 3, 3, 3, 3, 3, 3, 3, 3};
+
+    array cond(elements, hCond);
+    array a(elements, hA);
+    array b(elements, hB);
+
+    array out = select(cond, a, b);
+    // out = {2, 3, 2, 3, 2, 3, 2, 3, 2};
+    //! [ex_data_select]
+
+    //! [ex_data_select_c]
+    vector<float> hOut(elements);
+    for (size_t i = 0; i < hOut.size(); i++) {
+        if (hCond[i]) {
+            hOut[i] = hA[i];
+        } else {
+            hOut[i] = hB[i];
+        }
+    }
+    //! [ex_data_select_c]
+
+    ASSERT_VEC_ARRAY_EQ(hOut, dim4(9), out);
+}
diff --git a/test/set.cpp b/test/set.cpp
index e879d2472a..0e1ececadc 100644
--- a/test/set.cpp
+++ b/test/set.cpp
@@ -7,157 +7,297 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/algorithm.h>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-void uniqueTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void uniqueTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (pTestFile,numDims,data,tests);
+    vector<dim4> numDims;
 
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(pTestFile, numDims, data, tests);
 
     // Compare result
     for (int d = 0; d < (int)tests.size(); ++d) {
-
-        af::dim4 dims       = numDims[d];
+        dim4 dims = numDims[d];
         vector<T> in(data[d].begin(), data[d].end());
 
-        af_array inArray   = 0;
-        af_array outArray  = 0;
+        af_array inArray  = 0;
+        af_array outArray = 0;
 
         // Get input array
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(), dims.ndims(),
-                                              dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
+        ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
         vector<T> currGoldBar(tests[d].begin(), tests[d].end());
 
         // Run sum
-        ASSERT_EQ(AF_SUCCESS, af_set_unique(&outArray, inArray, d == 0 ? false : true));
+        ASSERT_SUCCESS(
+            af_set_unique(&outArray, inArray, d == 0 ? false : true));
 
         // Get result
-        T *outData;
-        outData = new T[currGoldBar.size()];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        vector<T> outData(currGoldBar.size());
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
 
         size_t nElems = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                                                            << " for test: " << d << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << " for test: " << d << endl;
         }
 
-        // Delete
-        delete[] outData;
-
-        if(inArray   != 0) af_release_array(inArray);
-        if(outArray  != 0) af_release_array(outArray);
+        if (inArray != 0) af_release_array(inArray);
+        if (outArray != 0) af_release_array(outArray);
     }
 }
 
-#define UNIQUE_TESTS(T)                             \
-    TEST(Set, Test_Unique_##T)                      \
-    {                                               \
-        uniqueTest<T>(TEST_DIR"/set/unique.test");  \
-    }                                               \
+#define UNIQUE_TESTS(T) \
+    TEST(Set, Test_Unique_##T) { uniqueTest<T>(TEST_DIR "/set/unique.test"); }
 
 UNIQUE_TESTS(float)
 UNIQUE_TESTS(double)
 UNIQUE_TESTS(int)
 UNIQUE_TESTS(uint)
+UNIQUE_TESTS(schar)
 UNIQUE_TESTS(uchar)
+UNIQUE_TESTS(short)
+UNIQUE_TESTS(ushort)
+UNIQUE_TESTS(intl)
+UNIQUE_TESTS(uintl)
 
-typedef af_err (*setFunc)(af_array *, const af_array, const af_array, const bool);
+typedef af_err (*setFunc)(af_array *, const af_array, const af_array,
+                          const bool);
 
 template<typename T, setFunc af_set_func>
-void setTest(string pTestFile)
-{
-    if (noDoubleTests<T>()) return;
+void setTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (pTestFile,numDims,data,tests);
+    vector<dim4> numDims;
 
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(pTestFile, numDims, data, tests);
 
     // Compare result
     for (int d = 0; d < (int)tests.size(); d += 2) {
-
-        af::dim4 dims0       = numDims[d + 0];
+        dim4 dims0 = numDims[d + 0];
         vector<T> in0(data[d + 0].begin(), data[d + 0].end());
 
-        af::dim4 dims1       = numDims[d + 1];
+        dim4 dims1 = numDims[d + 1];
         vector<T> in1(data[d + 1].begin(), data[d + 1].end());
 
-        af_array inArray0   = 0;
-        af_array inArray1   = 0;
-        af_array outArray  = 0;
-
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray0, &in0.front(), dims0.ndims(),
-                                              dims0.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray1, &in1.front(), dims1.ndims(),
-                                              dims1.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        af_array inArray0 = 0;
+        af_array inArray1 = 0;
+        af_array outArray = 0;
 
+        ASSERT_SUCCESS(af_create_array(&inArray0, &in0.front(), dims0.ndims(),
+                                       dims0.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
+        ASSERT_SUCCESS(af_create_array(&inArray1, &in1.front(), dims1.ndims(),
+                                       dims1.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
         vector<T> currGoldBar(tests[d].begin(), tests[d].end());
 
         // Run sum
-        ASSERT_EQ(AF_SUCCESS, af_set_func(&outArray, inArray0, inArray1, d == 0 ? false : true));
+        ASSERT_SUCCESS(
+            af_set_func(&outArray, inArray0, inArray1, d == 0 ? false : true));
 
         // Get result
-        T *outData;
-        outData = new T[currGoldBar.size()];
-        ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+        vector<T> outData(currGoldBar.size());
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
 
         size_t nElems = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                                                            << " for test: " << d << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << " for test: " << d << endl;
         }
 
-        // Delete
-        delete[] outData;
-
-        if(inArray0   != 0) af_release_array(inArray0);
-        if(inArray1   != 0) af_release_array(inArray1);
-        if(outArray  != 0) af_release_array(outArray);
+        if (inArray0 != 0) af_release_array(inArray0);
+        if (inArray1 != 0) af_release_array(inArray1);
+        if (outArray != 0) af_release_array(outArray);
     }
 }
 
-#define SET_TESTS(T)                                                    \
-    TEST(Set, Test_Union_##T)                                           \
-    {                                                                   \
-        setTest<T, af_set_union>(TEST_DIR"/set/union.test");            \
-    }                                                                   \
-    TEST(Set, Test_Intersect_##T)                                       \
-    {                                                                   \
-        setTest<T, af_set_intersect>(TEST_DIR"/set/intersect.test");    \
-    }                                                                   \
+#define SET_TESTS(T)                                                  \
+    TEST(Set, Test_Union_##T) {                                       \
+        setTest<T, af_set_union>(TEST_DIR "/set/union.test");         \
+    }                                                                 \
+    TEST(Set, Test_Intersect_##T) {                                   \
+        setTest<T, af_set_intersect>(TEST_DIR "/set/intersect.test"); \
+    }
 
 SET_TESTS(float)
 SET_TESTS(double)
 SET_TESTS(int)
 SET_TESTS(uint)
+SET_TESTS(schar)
 SET_TESTS(uchar)
+SET_TESTS(short)
+SET_TESTS(ushort)
+SET_TESTS(intl)
+SET_TESTS(uintl)
+
+// Documentation examples for setUnique
+TEST(Set, SNIPPET_setUniqueSorted) {
+    //! [ex_set_unique_sorted]
+
+    // input data
+    int h_set[6] = {1, 2, 2, 3, 3, 3};
+    af::array set(6, h_set);
+
+    // is_sorted flag specifies if input is sorted,
+    // allows algorithm to skip internal sorting step
+    const bool is_sorted = true;
+    af::array unique     = setUnique(set, is_sorted);
+    // unique == { 1, 2, 3 };
+
+    //! [ex_set_unique_sorted]
+
+    vector<int> unique_gold = {1, 2, 3};
+    dim4 gold_dim(3, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(unique_gold, gold_dim, unique);
+}
+
+TEST(Set, SNIPPET_setUniqueSortedDesc) {
+    //! [ex_set_unique_desc]
+
+    // input data
+    int h_set[6] = {3, 3, 3, 2, 2, 1};
+    af::array set(6, h_set);
+
+    // is_sorted flag specifies if input is sorted,
+    // allows algorithm to skip internal sorting step
+    // input can be sorted in ascending or descending order
+    const bool is_sorted = true;
+    af::array unique     = setUnique(set, is_sorted);
+    // unique == { 3, 2, 1 };
+
+    //! [ex_set_unique_desc]
+
+    vector<int> unique_gold = {3, 2, 1};
+    dim4 gold_dim(3, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(unique_gold, gold_dim, unique);
+}
+
+TEST(Set, SNIPPET_setUniqueSimple) {
+    //! [ex_set_unique_simple]
+
+    // input data
+    int h_set[6] = {3, 2, 3, 3, 2, 1};
+    af::array set(6, h_set);
+
+    af::array unique = setUnique(set);
+    // unique == { 1, 2, 3 };
+
+    //! [ex_set_unique_simple]
+
+    vector<int> unique_gold = {1, 2, 3};
+    dim4 gold_dim(3, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(unique_gold, gold_dim, unique);
+}
+
+// Documentation examples for setUnion
+TEST(Set, SNIPPET_setUnion) {
+    //! [ex_set_union]
+
+    // input data
+    int h_setA[4] = {1, 2, 3, 4};
+    int h_setB[4] = {2, 3, 4, 5};
+    af::array setA(4, h_setA);
+    af::array setB(4, h_setB);
+
+    const bool is_unique = true;
+    // is_unique flag specifies if inputs are unique,
+    // allows algorithm to skip internal calls to setUnique
+    // inputs must be unique and sorted in increasing order
+    af::array setAB = setUnion(setA, setB, is_unique);
+    // setAB == { 1, 2, 3, 4, 5 };
+
+    //! [ex_set_union]
+
+    vector<int> union_gold = {1, 2, 3, 4, 5};
+    dim4 gold_dim(5, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(union_gold, gold_dim, setAB);
+}
+
+TEST(Set, SNIPPET_setUnionSimple) {
+    //! [ex_set_union_simple]
+
+    // input data
+    int h_setA[4] = {1, 2, 3, 3};
+    int h_setB[4] = {3, 4, 5, 5};
+    af::array setA(4, h_setA);
+    af::array setB(4, h_setB);
+
+    af::array setAB = setUnion(setA, setB);
+    // setAB == { 1, 2, 3, 4, 5 };
+
+    //! [ex_set_union_simple]
+
+    vector<int> union_gold = {1, 2, 3, 4, 5};
+    dim4 gold_dim(5, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(union_gold, gold_dim, setAB);
+}
+
+// Documentation examples for setIntersect()
+TEST(Set, SNIPPET_setIntersect) {
+    //! [ex_set_intersect]
+
+    // input data
+    int h_setA[4] = {1, 2, 3, 4};
+    int h_setB[4] = {2, 3, 4, 5};
+    af::array setA(4, h_setA);
+    af::array setB(4, h_setB);
+
+    const bool is_unique = true;
+    // is_unique flag specifies if inputs are unique,
+    // allows algorithm to skip internal calls to setUnique
+    // inputs must be unique and sorted in increasing order
+    af::array setA_B = setIntersect(setA, setB, is_unique);
+    // setA_B == { 2, 3, 4 };
+
+    //! [ex_set_intersect]
+
+    vector<int> intersect_gold = {2, 3, 4};
+    dim4 gold_dim(3, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(intersect_gold, gold_dim, setA_B);
+}
+
+TEST(Set, SNIPPET_setIntersectSimple) {
+    //! [ex_set_intersect_simple]
+
+    // input data
+    int h_setA[4] = {1, 2, 3, 3};
+    int h_setB[4] = {3, 3, 4, 5};
+    af::array setA(4, h_setA);
+    af::array setB(4, h_setB);
+
+    af::array setA_B = setIntersect(setA, setB);
+    // setA_B == { 3 };
+
+    //! [ex_set_intersect_simple]
+
+    vector<int> intersect_gold = {3};
+    dim4 gold_dim(1, 1, 1, 1);
+    ASSERT_VEC_ARRAY_EQ(intersect_gold, gold_dim, setA_B);
+}
diff --git a/test/shift.cpp b/test/shift.cpp
index 1fe72ec86f..c86c43c8e3 100644
--- a/test/shift.cpp
+++ b/test/shift.cpp
@@ -7,141 +7,152 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::product;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Shift : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Shift : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         intl, uintl, char, signed char, unsigned char, short,
+                         ushort>
+    TestTypes;
 // register the type list
-TYPED_TEST_CASE(Shift, TestTypes);
+TYPED_TEST_SUITE(Shift, TestTypes);
 
 template<typename T>
-void shiftTest(string pTestFile, const unsigned resultIdx,
-                 const int x, const int y, const int z, const int w,
-                 bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void shiftTest(string pTestFile, const unsigned resultIdx, const int x,
+               const int y, const int z, const int w, bool isSubRef = false,
+               const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, int>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
-    af_array outArray = 0;
+    af_array inArray   = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_shift(&outArray, inArray, x, y, z, w));
+    ASSERT_SUCCESS(af_shift(&outArray, inArray, x, y, z, w));
 
-    // Get result
-    T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], idims, outArray);
 
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
+}
+
+#define SHIFT_INIT(desc, file, resultIdx, x, y, z, w)                  \
+    TYPED_TEST(Shift, desc) {                                          \
+        shiftTest<TypeParam>(string(TEST_DIR "/shift/" #file ".test"), \
+                             resultIdx, x, y, z, w);                   \
     }
 
-    // Delete
-    delete[] outData;
+SHIFT_INIT(Shift0, shift4d, 0, 2, 0, 0, 0);
+SHIFT_INIT(Shift1, shift4d, 1, -1, 0, 0, 0);
+SHIFT_INIT(Shift2, shift4d, 2, 3, 2, 0, 0);
+SHIFT_INIT(Shift3, shift4d, 3, 11, 22, 0, 0);
+SHIFT_INIT(Shift4, shift4d, 4, 0, 1, 0, 0);
+SHIFT_INIT(Shift5, shift4d, 5, 0, -6, 0, 0);
+SHIFT_INIT(Shift6, shift4d, 6, 0, 3, 1, 0);
+SHIFT_INIT(Shift7, shift4d, 7, 0, 0, 2, 0);
+SHIFT_INIT(Shift8, shift4d, 8, 0, 0, -2, 0);
+SHIFT_INIT(Shift9, shift4d, 9, 0, 0, 0, 1);
+SHIFT_INIT(Shift10, shift4d, 10, 0, 0, 0, -1);
+SHIFT_INIT(Shift11, shift4d, 11, 1, 1, 1, 1);
+SHIFT_INIT(Shift12, shift4d, 12, -1, -1, -1, -1);
+SHIFT_INIT(Shift13, shift4d, 13, 21, 21, 21, 21);
+SHIFT_INIT(Shift14, shift4d, 14, -21, -21, -21, -21);
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+////////////////////////////////// CPP ///////////////////////////////////
+//
+TEST(Shift, CPP) {
+    const unsigned resultIdx = 0;
+    const unsigned x         = 2;
+    const unsigned y         = 0;
+    const unsigned z         = 0;
+    const unsigned w         = 0;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/shift/shift4d.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+    array output = shift(input, x, y, z, w);
+
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], idims, output);
 }
 
-#define SHIFT_INIT(desc, file, resultIdx, x, y, z, w)                                       \
-    TYPED_TEST(Shift, desc)                                                                 \
-    {                                                                                       \
-        shiftTest<TypeParam>(string(TEST_DIR"/shift/"#file".test"), resultIdx, x, y, z, w); \
-    }
+TEST(Shift, MaxDim) {
+    const size_t largeDim  = 65535 * 32 + 1;
+    const unsigned shift_x = 1;
 
-SHIFT_INIT(Shift0,  shift4d, 0,    2,  0,  0,  0);
-    SHIFT_INIT(Shift1,  shift4d, 1,   -1,  0,  0,  0);
-    SHIFT_INIT(Shift2,  shift4d, 2,    3,  2,  0,  0);
-    SHIFT_INIT(Shift3,  shift4d, 3,   11, 22,  0,  0);
-    SHIFT_INIT(Shift4,  shift4d, 4,    0,  1,  0,  0);
-    SHIFT_INIT(Shift5,  shift4d, 5,    0, -6,  0,  0);
-    SHIFT_INIT(Shift6,  shift4d, 6,    0,  3,  1,  0);
-    SHIFT_INIT(Shift7,  shift4d, 7,    0,  0,  2,  0);
-    SHIFT_INIT(Shift8,  shift4d, 8,    0,  0, -2,  0);
-    SHIFT_INIT(Shift9,  shift4d, 9,    0,  0,  0,  1);
-    SHIFT_INIT(Shift10, shift4d, 10,   0,  0,  0, -1);
-    SHIFT_INIT(Shift11, shift4d, 11,   1,  1,  1,  1);
-    SHIFT_INIT(Shift12, shift4d, 12,  -1, -1, -1, -1);
-    SHIFT_INIT(Shift13, shift4d, 13,  21, 21, 21, 21);
-    SHIFT_INIT(Shift14, shift4d, 14, -21,-21,-21,-21);
+    array input  = range(dim4(2, largeDim));
+    array output = shift(input, shift_x);
 
+    output = abs(input - output);
+    ASSERT_EQ(1.f, product<float>(output));
 
-////////////////////////////////// CPP ///////////////////////////////////
-//
-TEST(Shift, CPP)
-{
-    if (noDoubleTests<float>()) return;
+    input  = range(dim4(2, 1, 1, largeDim));
+    output = shift(input, shift_x);
 
-    const unsigned resultIdx = 0;
-    const unsigned x = 2;
-    const unsigned y = 0;
-    const unsigned z = 0;
-    const unsigned w = 0;
-
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/shift/shift4d.test"),numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
-    af::array output = af::shift(input, x, y, z, w);
-
-    // Get result
-    float* outData = new float[tests[resultIdx].size()];
-    output.host((void*)outData);
-
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
-    }
+    output = abs(input - output);
+    ASSERT_EQ(1.f, product<float>(output));
+}
 
-    // Delete
-    delete[] outData;
+TEST(Shift, RowVector) {
+    const unsigned shift_x = 1;
+    const unsigned shift_y = 1;
+    array input            = iota(dim4(1, 4));
+    array output           = shift(input, shift_x, shift_y);
+    vector<float> gold{3.f, 0.f, 1.f, 2.f};
+    EXPECT_VEC_ARRAY_EQ(gold, dim4(1, 4), output);
 }
diff --git a/test/sift.cpp b/test/sift.cpp
new file mode 100644
index 0000000000..b96325d672
--- /dev/null
+++ b/test/sift.cpp
@@ -0,0 +1,352 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <cmath>
+#include <string>
+#include <typeinfo>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::features;
+using af::loadImage;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+typedef struct {
+    float f[5];
+    unsigned d[128];
+} feat_desc_t;
+
+typedef struct {
+    float f[5];
+} feat_t;
+
+typedef struct {
+    float d[128];
+} desc_t;
+
+static bool feat_cmp(feat_desc_t i, feat_desc_t j) {
+    for (int k = 0; k < 5; k++)
+        if (round(i.f[k] * 1e1f) != round(j.f[k] * 1e1f))
+            return (round(i.f[k] * 1e1f) < round(j.f[k] * 1e1f));
+
+    return false;
+}
+
+static void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y,
+                               float* score, float* ori, float* size,
+                               float* desc, unsigned nfeat) {
+    feat.resize(nfeat);
+    for (size_t i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = score[i];
+        feat[i].f[3] = ori[i];
+        feat[i].f[4] = size[i];
+        for (unsigned j = 0; j < 128; j++) feat[i].d[j] = desc[i * 128 + j];
+    }
+}
+
+static void array_to_feat_desc(vector<feat_desc_t>& feat, float* x, float* y,
+                               float* score, float* ori, float* size,
+                               vector<vector<float>>& desc, unsigned nfeat) {
+    feat.resize(nfeat);
+    for (size_t i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = score[i];
+        feat[i].f[3] = ori[i];
+        feat[i].f[4] = size[i];
+        for (unsigned j = 0; j < 128; j++) feat[i].d[j] = desc[i][j];
+    }
+}
+
+static void split_feat_desc(vector<feat_desc_t>& fd, vector<feat_t>& f,
+                            vector<desc_t>& d) {
+    f.resize(fd.size());
+    d.resize(fd.size());
+    for (size_t i = 0; i < fd.size(); i++) {
+        f[i].f[0] = fd[i].f[0];
+        f[i].f[1] = fd[i].f[1];
+        f[i].f[2] = fd[i].f[2];
+        f[i].f[3] = fd[i].f[3];
+        f[i].f[4] = fd[i].f[4];
+        for (unsigned j = 0; j < 128; j++) d[i].d[j] = fd[i].d[j];
+    }
+}
+
+static bool compareEuclidean(dim_t desc_len, dim_t ndesc, float* cpu,
+                             float* gpu, float unit_thr = 1.f,
+                             float euc_thr = 1.f) {
+    bool ret  = true;
+    float sum = 0.0f;
+
+    for (dim_t i = 0; i < ndesc; i++) {
+        sum = 0.0f;
+        for (dim_t l = 0; l < desc_len; l++) {
+            dim_t idx = i * desc_len + l;
+            float x   = (cpu[idx] - gpu[idx]);
+            sum += x * x;
+            if (abs(x) > (float)unit_thr) {
+                ret = false;
+                cout << endl << "@compareEuclidean: unit mismatch." << endl;
+                cout << "(cpu,gpu,cpu-gpu)[" << i << "," << l << "] : {"
+                     << cpu[idx] << "," << gpu[idx] << ","
+                     << cpu[idx] - gpu[idx] << "}" << endl;
+                cout << endl;
+                break;
+            }
+        }
+        if (sqrt(sum) > euc_thr) {
+            ret = false;
+            cout << endl << "@compareEuclidean: distance mismatch." << endl;
+            cout << "Euclidean distance: " << sqrt(sum) << endl;
+        }
+        if (ret == false) return ret;
+    }
+
+    return ret;
+}
+
+template<typename T>
+class SIFT : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypes;
+
+TYPED_TEST_SUITE(SIFT, TestTypes);
+
+template<typename T>
+void siftTest(string pTestFile, unsigned nLayers, float contrastThr,
+              float edgeThr, float initSigma, bool doubleInput) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> goldFeat;
+    vector<vector<float>> goldDesc;
+
+    readImageFeaturesDescriptors<float>(pTestFile, inDims, inFiles, goldFeat,
+                                        goldDesc);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        af_array inArray_f32 = 0;
+        af_array inArray     = 0;
+        af_array desc        = 0;
+        af_features feat;
+
+        inFiles[testId].insert(0, string(TEST_DIR "/sift/"));
+
+        ASSERT_SUCCESS(
+            af_load_image(&inArray_f32, inFiles[testId].c_str(), false));
+        ASSERT_SUCCESS(conv_image<T>(&inArray, inArray_f32));
+
+        ASSERT_SUCCESS(af_sift(&feat, &desc, inArray, nLayers,
+                                           contrastThr, edgeThr, initSigma,
+                                           doubleInput, 1.f / 256.f, 0.05f));
+
+        dim_t n = 0;
+        af_array x, y, score, orientation, size;
+
+        ASSERT_SUCCESS(af_get_features_num(&n, feat));
+        ASSERT_SUCCESS(af_get_features_xpos(&x, feat));
+        ASSERT_SUCCESS(af_get_features_ypos(&y, feat));
+        ASSERT_SUCCESS(af_get_features_score(&score, feat));
+        ASSERT_SUCCESS(af_get_features_orientation(&orientation, feat));
+        ASSERT_SUCCESS(af_get_features_size(&size, feat));
+
+        float* outX           = new float[n];
+        float* outY           = new float[n];
+        float* outScore       = new float[n];
+        float* outOrientation = new float[n];
+        float* outSize        = new float[n];
+        dim_t descSize;
+        dim_t descDims[4];
+        ASSERT_SUCCESS(af_get_elements(&descSize, desc));
+        ASSERT_SUCCESS(af_get_dims(&descDims[0], &descDims[1], &descDims[2],
+                                   &descDims[3], desc));
+        float* outDesc = new float[descSize];
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outX, x));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outY, y));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outScore, score));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outOrientation, orientation));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outSize, size));
+        ASSERT_SUCCESS(af_get_data_ptr((void*)outDesc, desc));
+
+        vector<feat_desc_t> out_feat_desc;
+        array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation,
+                           outSize, outDesc, n);
+
+        vector<feat_desc_t> gold_feat_desc;
+        array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(),
+                           &goldFeat[1].front(), &goldFeat[2].front(),
+                           &goldFeat[3].front(), &goldFeat[4].front(), goldDesc,
+                           goldFeat[0].size());
+
+        std::stable_sort(out_feat_desc.begin(), out_feat_desc.end(), feat_cmp);
+        std::stable_sort(gold_feat_desc.begin(), gold_feat_desc.end(),
+                         feat_cmp);
+
+        vector<feat_t> out_feat;
+        vector<desc_t> v_out_desc;
+        vector<feat_t> gold_feat;
+        vector<desc_t> v_gold_desc;
+
+        split_feat_desc(out_feat_desc, out_feat, v_out_desc);
+        split_feat_desc(gold_feat_desc, gold_feat, v_gold_desc);
+
+        for (int elIter = 0; elIter < (int)n; elIter++) {
+            ASSERT_LE(fabs(out_feat[elIter].f[0] - gold_feat[elIter].f[0]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[1] - gold_feat[elIter].f[1]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]),
+                      1e-3)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]),
+                      0.5f)
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]),
+                      1e-3)
+                << "at: " << elIter << endl;
+        }
+
+        EXPECT_TRUE(compareEuclidean(descDims[0], descDims[1],
+                                     (float*)&v_out_desc[0],
+                                     (float*)&v_gold_desc[0], 2.f, 4.5f));
+
+        ASSERT_SUCCESS(af_release_array(inArray));
+        ASSERT_SUCCESS(af_release_array(inArray_f32));
+
+        ASSERT_SUCCESS(af_release_array(desc));
+        ASSERT_SUCCESS(af_release_features(feat));
+
+        delete[] outX;
+        delete[] outY;
+        delete[] outScore;
+        delete[] outOrientation;
+        delete[] outSize;
+        delete[] outDesc;
+    }
+}
+
+#define SIFT_INIT(desc, image, nLayers, contrastThr, edgeThr, initSigma,  \
+                  doubleInput)                                            \
+    TYPED_TEST(SIFT, desc) {                                              \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                           \
+        for (int i = 0; i < 1; i++)                                       \
+            siftTest<TypeParam>(string(TEST_DIR "/sift/" #image ".test"), \
+                                nLayers, contrastThr, edgeThr, initSigma, \
+                                doubleInput);                             \
+    }
+
+SIFT_INIT(Man_Default, man, 3, 0.04f, 10.0f, 1.6f, true);
+SIFT_INIT(Man_2Layers, man_2layers, 2, 0.04f, 10.0f, 1.6f, true);
+SIFT_INIT(Man_ContrastThr005, man_contrast005, 3, 0.05f, 10.0f, 1.6f, true);
+SIFT_INIT(Man_EdgeThr5, man_edge5, 3, 0.04f, 5.0f, 1.6f, true);
+SIFT_INIT(Man_InitSigma18, man_initsigma18, 3, 0.04f, 10.0f, 1.8f, true);
+SIFT_INIT(Man_NoDoubleInput, man_nodoubleinput, 3, 0.04f, 10.0f, 1.6f, false);
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+TEST(SIFT, CPP) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> goldFeat;
+    vector<vector<float>> goldDesc;
+
+    readImageFeaturesDescriptors<float>(string(TEST_DIR "/sift/man.test"),
+                                        inDims, inFiles, goldFeat, goldDesc);
+    inFiles[0].insert(0, string(TEST_DIR "/sift/"));
+
+    array in = loadImage(inFiles[0].c_str(), false);
+
+    features feat;
+    array desc;
+    sift(feat, desc, in, 3, 0.04f, 10.0f, 1.6f, true, 1.f / 256.f, 0.05f);
+
+    float* outX           = new float[feat.getNumFeatures()];
+    float* outY           = new float[feat.getNumFeatures()];
+    float* outScore       = new float[feat.getNumFeatures()];
+    float* outOrientation = new float[feat.getNumFeatures()];
+    float* outSize        = new float[feat.getNumFeatures()];
+    float* outDesc        = new float[desc.elements()];
+    dim4 descDims         = desc.dims();
+    feat.getX().host(outX);
+    feat.getY().host(outY);
+    feat.getScore().host(outScore);
+    feat.getOrientation().host(outOrientation);
+    feat.getSize().host(outSize);
+    desc.host(outDesc);
+
+    vector<feat_desc_t> out_feat_desc;
+    array_to_feat_desc(out_feat_desc, outX, outY, outScore, outOrientation,
+                       outSize, outDesc, feat.getNumFeatures());
+
+    vector<feat_desc_t> gold_feat_desc;
+    array_to_feat_desc(gold_feat_desc, &goldFeat[0].front(),
+                       &goldFeat[1].front(), &goldFeat[2].front(),
+                       &goldFeat[3].front(), &goldFeat[4].front(), goldDesc,
+                       goldFeat[0].size());
+
+    std::stable_sort(out_feat_desc.begin(), out_feat_desc.end(), feat_cmp);
+    std::stable_sort(gold_feat_desc.begin(), gold_feat_desc.end(), feat_cmp);
+
+    vector<feat_t> out_feat;
+    vector<desc_t> v_out_desc;
+    vector<feat_t> gold_feat;
+    vector<desc_t> v_gold_desc;
+
+    split_feat_desc(out_feat_desc, out_feat, v_out_desc);
+    split_feat_desc(gold_feat_desc, gold_feat, v_gold_desc);
+
+    for (int elIter = 0; elIter < (int)feat.getNumFeatures(); elIter++) {
+        ASSERT_LE(fabs(out_feat[elIter].f[0] - gold_feat[elIter].f[0]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[1] - gold_feat[elIter].f[1]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e-3)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[3] - gold_feat[elIter].f[3]), 0.5f)
+            << "at: " << elIter << endl;
+        ASSERT_LE(fabs(out_feat[elIter].f[4] - gold_feat[elIter].f[4]), 1e-3)
+            << "at: " << elIter << endl;
+    }
+
+    EXPECT_TRUE(compareEuclidean(descDims[0], descDims[1],
+                                 (float*)&v_out_desc[0],
+                                 (float*)&v_gold_desc[0], 2.f, 4.5f));
+
+    delete[] outX;
+    delete[] outY;
+    delete[] outScore;
+    delete[] outOrientation;
+    delete[] outSize;
+    delete[] outDesc;
+}
diff --git a/test/sobel.cpp b/test/sobel.cpp
index 2ec5ab01c2..72a70ddde3 100644
--- a/test/sobel.cpp
+++ b/test/sobel.cpp
@@ -7,91 +7,86 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
 using std::string;
 using std::vector;
 
 template<typename T>
-class Sobel : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Sobel : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 template<typename T>
-class Sobel_Integer : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {}
+class Sobel_Integer : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float, double> TestTypes;
-typedef ::testing::Types<int, unsigned, char, unsigned char> TestTypesInt;
+typedef ::testing::Types<int, unsigned, char, signed char, unsigned char, short,
+                         ushort>
+    TestTypesInt;
 
 // register the type list
-TYPED_TEST_CASE(Sobel, TestTypes);
-TYPED_TEST_CASE(Sobel_Integer, TestTypesInt);
+TYPED_TEST_SUITE(Sobel, TestTypes);
+TYPED_TEST_SUITE(Sobel_Integer, TestTypesInt);
 
 template<typename Ti, typename To>
-void testSobelDerivatives(string pTestFile)
-{
-    if (noDoubleTests<Ti>()) return;
+void testSobelDerivatives(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(Ti);
 
-    vector<af::dim4>  numDims;
-    vector<vector<Ti> >      in;
-    vector<vector<To> >   tests;
+    vector<dim4> numDims;
+    vector<vector<Ti>> in;
+    vector<vector<To>> tests;
 
-    readTests<Ti,To,int>(pTestFile, numDims, in, tests);
+    readTests<Ti, To, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 dims    = numDims[0];
+    dim4 dims        = numDims[0];
     af_array dxArray = 0;
     af_array dyArray = 0;
     af_array inArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()),
-                dims.ndims(), dims.get(), (af_dtype)af::dtype_traits<Ti>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<Ti>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_sobel_operator(&dxArray, &dyArray, inArray, 3));
-
-    To *dxData = new To[dims.elements()];
-    To *dyData = new To[dims.elements()];
-
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)dxData, dxArray));
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)dyData, dyArray));
+    ASSERT_SUCCESS(af_sobel_operator(&dxArray, &dyArray, inArray, 3));
 
     vector<To> currDXGoldBar = tests[0];
     vector<To> currDYGoldBar = tests[1];
-    size_t nElems = currDXGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currDXGoldBar[elIter], dxData[elIter])<< "at: " << elIter<< std::endl;
-    }
-    nElems = currDYGoldBar.size();
-    for (size_t elIter=0; elIter<nElems; ++elIter) {
-        ASSERT_EQ(currDYGoldBar[elIter], dyData[elIter])<< "at: " << elIter<< std::endl;
-    }
+
+    ASSERT_VEC_ARRAY_EQ(currDXGoldBar, dims, dxArray);
+    ASSERT_VEC_ARRAY_EQ(currDYGoldBar, dims, dyArray);
 
     // cleanup
-    delete[] dxData;
-    delete[] dyData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(dxArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(dyArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(dxArray));
+    ASSERT_SUCCESS(af_release_array(dyArray));
 }
 
-TYPED_TEST(Sobel, Rectangle)
-{
-    testSobelDerivatives<TypeParam, TypeParam>(string(TEST_DIR"/sobel/rectangle.test"));
+// rectangle test data is generated using opencv
+// border type is set to cv.BORDER_REFLECT_101 in opencv
+
+TYPED_TEST(Sobel, Rectangle) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    testSobelDerivatives<TypeParam, TypeParam>(
+        string(TEST_DIR "/sobel/rectangle.test"));
 }
 
-TYPED_TEST(Sobel_Integer, Rectangle)
-{
-    testSobelDerivatives<TypeParam, int>(string(TEST_DIR"/sobel/rectangle.test"));
+TYPED_TEST(Sobel_Integer, Rectangle) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    testSobelDerivatives<TypeParam, int>(
+        string(TEST_DIR "/sobel/rectangle.test"));
 }
diff --git a/test/solve_common.hpp b/test/solve_common.hpp
new file mode 100644
index 0000000000..0eee3d7029
--- /dev/null
+++ b/test/solve_common.hpp
@@ -0,0 +1,132 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+
+#include <arrayfire.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::cdouble;
+using af::cfloat;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+void solveTester(const int m, const int n, const int k, double eps,
+                 int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    af::deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+#if 1
+    af::array A  = cpu_randu<T>(af::dim4(m, n));
+    af::array X0 = cpu_randu<T>(af::dim4(n, k));
+#else
+    af::array A  = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+    af::array X0 = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+#endif
+    af::array B0 = af::matmul(A, X0);
+
+    //! [ex_solve]
+    af::array X1 = af::solve(A, B0);
+    //! [ex_solve]
+
+    //! [ex_solve_recon]
+    af::array B1 = af::matmul(A, X1);
+    //! [ex_solve_recon]
+
+    ASSERT_ARRAYS_NEAR(B0, B1, eps);
+}
+
+template<typename T>
+void solveLUTester(const int n, const int k, double eps,
+                   int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    af::deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+#if 1
+    af::array A  = cpu_randu<T>(af::dim4(n, n));
+    af::array X0 = cpu_randu<T>(af::dim4(n, k));
+#else
+    af::array A  = af::randu(n, n, (af::dtype)af::dtype_traits<T>::af_type);
+    af::array X0 = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+#endif
+    af::array B0 = af::matmul(A, X0);
+
+    //! [ex_solve_lu]
+    af::array A_lu, pivot;
+    af::lu(A_lu, pivot, A);
+    af::array X1 = af::solveLU(A_lu, pivot, B0);
+    //! [ex_solve_lu]
+
+    af::array B1 = af::matmul(A, X1);
+
+    ASSERT_ARRAYS_NEAR(B0, B1, eps);
+}
+
+template<typename T>
+void solveTriangleTester(const int n, const int k, bool is_upper, double eps,
+                         int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    af::deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+#if 1
+    af::array A  = cpu_randu<T>(af::dim4(n, n));
+    af::array X0 = cpu_randu<T>(af::dim4(n, k));
+#else
+    af::array A  = af::randu(n, n, (af::dtype)af::dtype_traits<T>::af_type);
+    af::array X0 = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+#endif
+
+    af::array L, U, pivot;
+    af::lu(L, U, pivot, A);
+
+    af::array AT = is_upper ? U : L;
+    af::array B0 = af::matmul(AT, X0);
+    af::array X1;
+
+    if (is_upper) {
+        //! [ex_solve_upper]
+        af::array X = af::solve(AT, B0, AF_MAT_UPPER);
+        //! [ex_solve_upper]
+
+        X1 = X;
+    } else {
+        //! [ex_solve_lower]
+        af::array X = af::solve(AT, B0, AF_MAT_LOWER);
+        //! [ex_solve_lower]
+
+        X1 = X;
+    }
+
+    af::array B1 = af::matmul(AT, X1);
+
+    ASSERT_ARRAYS_NEAR(B0, B1, eps);
+}
diff --git a/test/solve_dense.cpp b/test/solve_dense.cpp
index 8f2657098c..161aa7a212 100644
--- a/test/solve_dense.cpp
+++ b/test/solve_dense.cpp
@@ -7,181 +7,348 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
+// NOTE: Tests are known to fail on OSX when utilizing the CPU and OpenCL
+// backends for sizes larger than 128x128 or more. You can read more about it on
+// issue https://github.com/arrayfire/arrayfire/issues/1617
+
 #include <gtest/gtest.h>
-#include <arrayfire.h>
-#include <af/dim4.hpp>
+
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
+#include <af/arith.h>
+#include <af/blas.h>
 #include <af/defines.h>
+#include <af/device.h>
+#include <af/dim4.hpp>
+#include <af/lapack.h>
 #include <af/traits.hpp>
-#include <vector>
+
+#include <cstdlib>
 #include <iostream>
-#include <complex>
 #include <string>
-#include <testHelpers.hpp>
+#include <thread>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::deviceGC;
+using af::dim4;
+using af::dtype_traits;
+using af::setDevice;
+using af::sum;
+using std::abs;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
-
-///////////////////////////////// CPP ////////////////////////////////////
-//
+using std::string;
+using std::vector;
 
 template<typename T>
-void solveTester(const int m, const int n, const int k, double eps)
-{
-    if (noDoubleTests<T>()) return;
+void solveTester(const int m, const int n, const int k, const int b, double eps,
+                 int targetDevice = -1) {
+    if (targetDevice >= 0) setDevice(targetDevice);
+
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
 #if 1
-    af::array A  = cpu_randu<T>(af::dim4(m, n));
-    af::array X0 = cpu_randu<T>(af::dim4(n, k));
+    array A  = cpu_randu<T>(dim4(m, n, b));
+    array X0 = cpu_randu<T>(dim4(n, k, b));
 #else
-    af::array A  = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
-    af::array X0 = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+    array A  = randu(m, n, (dtype)dtype_traits<T>::af_type);
+    array X0 = randu(n, k, (dtype)dtype_traits<T>::af_type);
 #endif
-    af::array B0 = af::matmul(A, X0);
+    array B0 = matmul(A, X0);
 
     //! [ex_solve]
-    af::array X1 = af::solve(A, B0);
+    array X1 = solve(A, B0);
     //! [ex_solve]
 
     //! [ex_solve_recon]
-    af::array B1 = af::matmul(A, X1);
+    array B1 = matmul(A, X1);
     //! [ex_solve_recon]
 
-    ASSERT_NEAR(0, af::sum<double>(af::abs(real(B0 - B1))) / (m * k), eps);
-    ASSERT_NEAR(0, af::sum<double>(af::abs(imag(B0 - B1))) / (m * k), eps);
+    ASSERT_NEAR(
+        0,
+        sum<typename dtype_traits<T>::base_type>(abs(real(B0 - B1))) / (m * k),
+        eps);
+    ASSERT_NEAR(
+        0,
+        sum<typename dtype_traits<T>::base_type>(abs(imag(B0 - B1))) / (m * k),
+        eps);
 }
 
 template<typename T>
-void solveLUTester(const int n, const int k, double eps)
-{
-    if (noDoubleTests<T>()) return;
+void solveLUTester(const int n, const int k, double eps,
+                   int targetDevice = -1) {
+    if (targetDevice >= 0) setDevice(targetDevice);
+
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
 #if 1
-    af::array A  = cpu_randu<T>(af::dim4(n, n));
-    af::array X0 = cpu_randu<T>(af::dim4(n, k));
+    array A  = cpu_randu<T>(dim4(n, n));
+    array X0 = cpu_randu<T>(dim4(n, k));
 #else
-    af::array A  = af::randu(n, n, (af::dtype)af::dtype_traits<T>::af_type);
-    af::array X0 = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+    array A  = randu(n, n, (dtype)dtype_traits<T>::af_type);
+    array X0 = randu(n, k, (dtype)dtype_traits<T>::af_type);
 #endif
-    af::array B0 = af::matmul(A, X0);
+    array B0 = matmul(A, X0);
 
     //! [ex_solve_lu]
-    af::array A_lu, pivot;
-    af::lu(A_lu, pivot, A);
-    af::array X1 = af::solveLU(A_lu, pivot, B0);
+    array A_lu, pivot;
+    lu(A_lu, pivot, A);
+    array X1 = solveLU(A_lu, pivot, B0);
     //! [ex_solve_lu]
 
-    af::array B1 = af::matmul(A, X1);
+    array B1 = matmul(A, X1);
 
-    ASSERT_NEAR(0, af::sum<double>(af::abs(real(B0 - B1))) / (n * k), eps);
-    ASSERT_NEAR(0, af::sum<double>(af::abs(imag(B0 - B1))) / (n * k), eps);
+    ASSERT_NEAR(
+        0,
+        sum<typename dtype_traits<T>::base_type>(abs(real(B0 - B1))) / (n * k),
+        eps);
+    ASSERT_NEAR(
+        0,
+        sum<typename dtype_traits<T>::base_type>(abs(imag(B0 - B1))) / (n * k),
+        eps);
 }
 
 template<typename T>
-void solveTriangleTester(const int n, const int k, bool is_upper, double eps)
-{
-    if (noDoubleTests<T>()) return;
+void solveTriangleTester(const int n, const int k, bool is_upper, double eps,
+                         int targetDevice = -1) {
+    if (targetDevice >= 0) setDevice(targetDevice);
+
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
 #if 1
-    af::array A  = cpu_randu<T>(af::dim4(n, n));
-    af::array X0 = cpu_randu<T>(af::dim4(n, k));
+    array A  = cpu_randu<T>(dim4(n, n));
+    array X0 = cpu_randu<T>(dim4(n, k));
 #else
-    af::array A  = af::randu(n, n, (af::dtype)af::dtype_traits<T>::af_type);
-    af::array X0 = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+    array A  = randu(n, n, (dtype)dtype_traits<T>::af_type);
+    array X0 = randu(n, k, (dtype)dtype_traits<T>::af_type);
 #endif
 
-    af::array L, U, pivot;
-    af::lu(L, U, pivot, A);
+    array L, U, pivot;
+    lu(L, U, pivot, A);
 
-    af::array AT = is_upper ? U : L;
-    af::array B0 = af::matmul(AT, X0);
-    af::array X1;
+    array AT = is_upper ? U : L;
+    array B0 = matmul(AT, X0);
+    array X1;
 
     if (is_upper) {
         //! [ex_solve_upper]
-        af::array X = af::solve(AT, B0, AF_MAT_UPPER);
+        array X = solve(AT, B0, AF_MAT_UPPER);
         //! [ex_solve_upper]
 
         X1 = X;
     } else {
         //! [ex_solve_lower]
-        af::array X = af::solve(AT, B0, AF_MAT_LOWER);
+        array X = solve(AT, B0, AF_MAT_LOWER);
         //! [ex_solve_lower]
 
         X1 = X;
     }
 
-    af::array B1 = af::matmul(AT, X1);
-
-    ASSERT_NEAR(0, af::sum<double>(af::abs(real(B0 - B1))) / (n * k), eps);
-    ASSERT_NEAR(0, af::sum<double>(af::abs(imag(B0 - B1))) / (n * k), eps);
-}
-
-#define SOLVE_TESTS(T, eps)                             \
-    TEST(SOLVE_LU, T##Reg)                              \
-    {                                                   \
-        solveLUTester<T>(1000, 100, eps);               \
-    }                                                   \
-    TEST(SOLVE_LU, T##RegMultiple)                      \
-    {                                                   \
-        solveLUTester<T>(2048, 512, eps);               \
-    }                                                   \
-    TEST(SOLVE_Upper, T##Reg)                           \
-    {                                                   \
-        solveTriangleTester<T>(1000, 100, true, eps);   \
-    }                                                   \
-    TEST(SOLVE_Upper, T##RegMultiple)                   \
-    {                                                   \
-        solveTriangleTester<T>(2048, 512, true, eps);   \
-    }                                                   \
-    TEST(SOLVE_Lower, T##Reg)                           \
-    {                                                   \
-        solveTriangleTester<T>(1000, 100, false, eps);  \
-    }                                                   \
-    TEST(SOLVE_Lower, T##RegMultiple)                   \
-    {                                                   \
-        solveTriangleTester<T>(2048, 512, false, eps);  \
-    }                                                   \
-    TEST(SOLVE, T##Square)                              \
-    {                                                   \
-        solveTester<T>(1000, 1000, 100, eps);           \
-    }                                                   \
-    TEST(SOLVE, T##SquareMultiple)                      \
-    {                                                   \
-        solveTester<T>(2048, 2048, 512, eps);           \
-    }                                                   \
-    TEST(SOLVE, T##RectUnder)                           \
-    {                                                   \
-        solveTester<T>(800, 1000, 200, eps);            \
-    }                                                   \
-    TEST(SOLVE, T##RectUnderMultiple)                   \
-    {                                                   \
-        solveTester<T>(1536, 2048, 400, eps);           \
-    }                                                   \
-    TEST(SOLVE, T##RectOverMultiple)                    \
-    {                                                   \
-        solveTester<T>(1536, 1024, 1, eps);             \
-    }
+    array B1 = matmul(AT, X1);
 
-SOLVE_TESTS(float, 0.01)
-SOLVE_TESTS(double, 1E-5)
-SOLVE_TESTS(cfloat, 0.01)
-SOLVE_TESTS(cdouble, 1E-5)
+    ASSERT_NEAR(
+        0,
+        sum<typename dtype_traits<T>::base_type>(af::abs(real(B0 - B1))) /
+            (n * k),
+        eps);
+    ASSERT_NEAR(
+        0,
+        sum<typename dtype_traits<T>::base_type>(af::abs(imag(B0 - B1))) /
+            (n * k),
+        eps);
+}
 
-#undef SOLVE_TESTS
+template<typename T>
+class Solve : public ::testing::Test {};
+
+typedef ::testing::Types<float, cfloat, double, cdouble> TestTypes;
+TYPED_TEST_SUITE(Solve, TestTypes);
+
+template<typename T>
+double eps();
+
+template<>
+double eps<float>() {
+    return 0.01f;
+}
+
+template<>
+double eps<double>() {
+    return 1e-5;
+}
+
+template<>
+double eps<cfloat>() {
+    return 0.015f;
+}
+
+template<>
+double eps<cdouble>() {
+    return 1e-5;
+}
+
+TYPED_TEST(Solve, Square) {
+    solveTester<TypeParam>(100, 100, 10, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareMultipleOfTwo) {
+    solveTester<TypeParam>(96, 96, 16, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareLarge) {
+    solveTester<TypeParam>(1000, 1000, 10, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareMultipleOfTwoLarge) {
+    solveTester<TypeParam>(2048, 2048, 32, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareBatch) {
+    solveTester<TypeParam>(100, 100, 10, 10, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareMultipleOfTwoBatch) {
+    solveTester<TypeParam>(96, 96, 16, 10, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareLargeBatch) {
+    solveTester<TypeParam>(1000, 1000, 10, 10, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, SquareMultipleOfTwoLargeBatch) {
+    solveTester<TypeParam>(2048, 2048, 32, 10, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresUnderDetermined) {
+    solveTester<TypeParam>(80, 100, 20, 1, eps<TypeParam>());
+}
 
-#define SOLVE_TESTS(T, eps)                     \
-    TEST(SOLVE, T##RectOver)                    \
-    {                                           \
-        solveTester<T>(800, 600, 50, eps);      \
+TYPED_TEST(Solve, LeastSquaresUnderDeterminedMultipleOfTwo) {
+    solveTester<TypeParam>(96, 128, 40, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresUnderDeterminedLarge) {
+    solveTester<TypeParam>(800, 1000, 200, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresUnderDeterminedMultipleOfTwoLarge) {
+    solveTester<TypeParam>(1536, 2048, 400, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresOverDetermined) {
+    solveTester<TypeParam>(80, 60, 20, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresOverDeterminedMultipleOfTwo) {
+    solveTester<TypeParam>(96, 64, 1, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresOverDeterminedLarge) {
+    solveTester<TypeParam>(800, 600, 64, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LeastSquaresOverDeterminedMultipleOfTwoLarge) {
+    solveTester<TypeParam>(1536, 1024, 1, 1, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LU) { solveLUTester<TypeParam>(100, 10, eps<TypeParam>()); }
+
+TYPED_TEST(Solve, LUMultipleOfTwo) {
+    solveLUTester<TypeParam>(96, 64, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LULarge) {
+    solveLUTester<TypeParam>(1000, 100, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, LUMultipleOfTwoLarge) {
+    solveLUTester<TypeParam>(2048, 512, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleUpper) {
+    solveTriangleTester<TypeParam>(100, 10, true, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleUpperMultipleOfTwo) {
+    solveTriangleTester<TypeParam>(96, 64, true, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleUpperLarge) {
+    solveTriangleTester<TypeParam>(1000, 100, true, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleUpperMultipleOfTwoLarge) {
+    solveTriangleTester<TypeParam>(2048, 512, true, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleLower) {
+    solveTriangleTester<TypeParam>(100, 10, false, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleLowerMultipleOfTwo) {
+    solveTriangleTester<TypeParam>(96, 64, false, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleLowerLarge) {
+    solveTriangleTester<TypeParam>(1000, 100, false, eps<TypeParam>());
+}
+
+TYPED_TEST(Solve, TriangleLowerMultipleOfTwoLarge) {
+    solveTriangleTester<TypeParam>(2048, 512, false, eps<TypeParam>());
+}
+
+#if !defined(AF_OPENCL)
+int nextTargetDeviceId() {
+    static int nextId = 0;
+    return nextId++;
+}
+
+#define SOLVE_LU_TESTS_THREADING(T, eps)                              \
+    tests.emplace_back(solveLUTester<T>, 1000, 100, eps,              \
+                       nextTargetDeviceId() % numDevices);            \
+    tests.emplace_back(solveTriangleTester<T>, 1000, 100, true, eps,  \
+                       nextTargetDeviceId() % numDevices);            \
+    tests.emplace_back(solveTriangleTester<T>, 1000, 100, false, eps, \
+                       nextTargetDeviceId() % numDevices);            \
+    tests.emplace_back(solveTester<T>, 1000, 1000, 100, 1, eps,       \
+                       nextTargetDeviceId() % numDevices);            \
+    tests.emplace_back(solveTester<T>, 800, 1000, 200, 1, eps,        \
+                       nextTargetDeviceId() % numDevices);            \
+    tests.emplace_back(solveTester<T>, 800, 600, 64, 1, eps,          \
+                       nextTargetDeviceId() % numDevices);
+
+TEST(Solve, Threading) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    int numDevices = 0;
+    ASSERT_SUCCESS(af_get_device_count(&numDevices));
+    ASSERT_EQ(true, numDevices > 0);
+
+    SOLVE_LU_TESTS_THREADING(float, 0.01);
+    SOLVE_LU_TESTS_THREADING(cfloat, 0.01);
+    if (noDoubleTests(f64)) {
+        SOLVE_LU_TESTS_THREADING(double, 1E-5);
+        SOLVE_LU_TESTS_THREADING(cdouble, 1E-5);
     }
 
-SOLVE_TESTS(float, 0.01)
-SOLVE_TESTS(double, 1E-5)
-// Fails on Windows on some devices
-#if !(defined(OS_WIN) && defined(AF_OPENCL))
-SOLVE_TESTS(cfloat, 0.01)
-SOLVE_TESTS(cdouble, 1E-5)
-#endif
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
 
+#undef SOLVE_LU_TESTS_THREADING
+#endif
 #undef SOLVE_TESTS
diff --git a/test/sort.cpp b/test/sort.cpp
index 7377d2a9f8..bd60edb5b5 100644
--- a/test/sort.cpp
+++ b/test/sort.cpp
@@ -7,130 +7,173 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Sort : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Sort : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, uint, int, uchar> TestTypes;
+typedef ::testing::Types<float, double, uint, int, schar, uchar, short, ushort,
+                         intl, uintl>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Sort, TestTypes);
+TYPED_TEST_SUITE(Sort, TestTypes);
 
 template<typename T>
-void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0,
+              bool isSubRef = false, const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<float> > tests;
-    readTests<T, float, int>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<float>> tests;
+    readTests<T, float, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
+    af_array inArray   = 0;
     af_array tempArray = 0;
-    af_array sxArray = 0;
+    af_array sxArray   = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_sort(&sxArray, inArray, 0, dir));
+    ASSERT_SUCCESS(af_sort(&sxArray, inArray, 0, dir));
 
     size_t nElems = tests[resultIdx0].size();
 
     // Get result
     T* sxData = new T[tests[resultIdx0].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)sxData, sxArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)sxData, sxArray));
 
     // Compare result
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter])
+            << "at: " << elIter << endl;
     }
 
     // Delete
     delete[] sxData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(sxArray   != 0) af_release_array(sxArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (sxArray != 0) af_release_array(sxArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define SORT_INIT(desc, file, dir, resultIdx0)                                      \
-    TYPED_TEST(Sort, desc)                                                          \
-    {                                                                               \
-        sortTest<TypeParam>(string(TEST_DIR"/sort/"#file".test"), dir, resultIdx0); \
+#define SORT_INIT(desc, file, dir, resultIdx0)                            \
+    TYPED_TEST(Sort, desc) {                                              \
+        sortTest<TypeParam>(string(TEST_DIR "/sort/" #file ".test"), dir, \
+                            resultIdx0);                                  \
     }
 
-    // Using same inputs as sort_index. So just skipping the index results
-    SORT_INIT(Sort0True,  sort, true, 0);
-    SORT_INIT(Sort0False, sort,false, 2);
+// Using same inputs as sort_index. So just skipping the index results
+SORT_INIT(Sort0True, sort, true, 0);
+SORT_INIT(Sort0False, sort, false, 2);
 
-    SORT_INIT(Sort2d0False, basic_2d, true, 0);
+SORT_INIT(Sort2d0False, basic_2d, true, 0);
 
-    SORT_INIT(Sort10x10True,  sort_10x10, true,  0);
-    SORT_INIT(Sort10x10False, sort_10x10, false, 2);
-    SORT_INIT(Sort1000True,   sort_1000,  true,  0);
-    SORT_INIT(Sort1000False,  sort_1000,  false, 2);
-    SORT_INIT(SortMedTrue,    sort_med1,  true,  0);
-    SORT_INIT(SortMedFalse,   sort_med1,  false, 2);
-    // Takes too much time in current implementation. Enable when everything is parallel
-    //SORT_INIT(SortMed5True,   sort_med,   true,  0);
-    //SORT_INIT(SortMed5False,  sort_med,   false, 2);
-    //SORT_INIT(SortLargeTrue,  sort_large, true,  0);
-    //SORT_INIT(SortLargeFalse, sort_large, false, 2);
+SORT_INIT(Sort10x10True, sort_10x10, true, 0);
+SORT_INIT(Sort10x10False, sort_10x10, false, 2);
+SORT_INIT(Sort1000True, sort_1000, true, 0);
+SORT_INIT(Sort1000False, sort_1000, false, 2);
+SORT_INIT(SortMedTrue, sort_med1, true, 0);
+SORT_INIT(SortMedFalse, sort_med1, false, 2);
 
+SORT_INIT(SortMed5True, sort_med, true, 0);
+SORT_INIT(SortMed5False, sort_med, false, 2);
+SORT_INIT(SortLargeTrue, sort_large, true, 0);
+SORT_INIT(SortLargeFalse, sort_large, false, 2);
 
 ////////////////////////////////////// CPP ////////////////////////////////
 //
-TEST(Sort, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(Sort, CPPDim0) {
+    const bool dir            = true;
+    const unsigned resultIdx0 = 0;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_10x10.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+
+    array output = sort(input, 0, dir);
+
+    size_t nElems = tests[resultIdx0].size();
+
+    // Get result
+    float* sxData = new float[tests[resultIdx0].size()];
+    output.host((void*)sxData);
 
-    const bool dir = true;
+    // Compare result
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    // Delete
+    delete[] sxData;
+}
+
+TEST(Sort, CPPDim1) {
+    const bool dir            = true;
     const unsigned resultIdx0 = 0;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/sort/sort_10x10.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_10x10.test"),
+                                 numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
 
-    af::array output = af::sort(input, 0, dir);
+    array input_ = reorder(input, 1, 0, 2, 3);
+
+    array output = sort(input_, 1, dir);
+
+    output =
+        reorder(output, 1, 0, 2, 3);  // Required for checking with test data
 
     size_t nElems = tests[resultIdx0].size();
 
@@ -140,10 +183,46 @@ TEST(Sort, CPP)
 
     // Compare result
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter]) << "at: " << elIter << std::endl;
+        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter])
+            << "at: " << elIter << endl;
     }
 
     // Delete
     delete[] sxData;
 }
 
+TEST(Sort, CPPDim2) {
+    const bool dir            = false;
+    const unsigned resultIdx0 = 2;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_med.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+
+    array input_ = reorder(input, 1, 2, 0, 3);
+
+    array output = sort(input_, 2, dir);
+
+    output =
+        reorder(output, 2, 0, 1, 3);  // Required for checking with test data
+
+    size_t nElems = tests[resultIdx0].size();
+
+    // Get result
+    float* sxData = new float[tests[resultIdx0].size()];
+    output.host((void*)sxData);
+
+    // Compare result
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    // Delete
+    delete[] sxData;
+}
diff --git a/test/sort_by_key.cpp b/test/sort_by_key.cpp
index 4f817aad9d..265ee570b7 100644
--- a/test/sort_by_key.cpp
+++ b/test/sort_by_key.cpp
@@ -7,53 +7,57 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Sort : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class SortByKey : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, uint, int, uchar> TestTypes;
+typedef ::testing::Types<float, double, uint, int, schar, uchar, short, ushort,
+                         intl, uintl>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Sort, TestTypes);
+TYPED_TEST_SUITE(SortByKey, TestTypes);
 
 template<typename T>
-void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0, const unsigned resultIdx1, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0,
+              const unsigned resultIdx1, bool isSubRef = false) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<float> > tests;
-    readTests<T, float, int>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
     af_array ikeyArray = 0;
     af_array ivalArray = 0;
@@ -62,110 +66,133 @@ void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0, const
     af_array ovalArray = 0;
 
     if (isSubRef) {
-        //ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        // ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+        // idims.ndims(), idims.get(), (af_dtype) dtype_traits<T>::af_type));
 
-        //ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        // ASSERT_SUCCESS(af_index(&inArray, tempArray, seqv->size(),
+        // &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&ikeyArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&ivalArray, &(in[1].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&ikeyArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&ivalArray, &(in[1].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_sort_by_key(&okeyArray, &ovalArray, ikeyArray, ivalArray, 0, dir));
-
-    size_t nElems = tests[resultIdx0].size();
-
-    // Get result
-    T* keyData = new T[tests[resultIdx0].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)keyData, okeyArray));
+    ASSERT_SUCCESS(
+        af_sort_by_key(&okeyArray, &ovalArray, ikeyArray, ivalArray, 0, dir));
 
     // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], keyData[elIter]) << "at: " << elIter << std::endl;
-    }
-
-    T* valData = new T[tests[resultIdx1].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)valData, ovalArray));
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, okeyArray);
 
-#ifndef AF_OPENCL
+#ifdef AF_OPENCL
+    UNUSED(resultIdx1);
+#else
     // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx1][elIter], valData[elIter]) << "at: " << elIter << std::endl;
-    }
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx1], idims, ovalArray);
 #endif
 
-    // Delete
-    delete[] keyData;
-    delete[] valData;
-
-    if(ikeyArray != 0) af_release_array(ikeyArray);
-    if(ivalArray != 0) af_release_array(ivalArray);
-    if(okeyArray != 0) af_release_array(okeyArray);
-    if(ovalArray != 0) af_release_array(ovalArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (ikeyArray != 0) af_release_array(ikeyArray);
+    if (ivalArray != 0) af_release_array(ivalArray);
+    if (okeyArray != 0) af_release_array(okeyArray);
+    if (ovalArray != 0) af_release_array(ovalArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define SORT_INIT(desc, file, dir, resultIdx0, resultIdx1)                                       \
-    TYPED_TEST(Sort, desc)                                                                       \
-    {                                                                                            \
-        sortTest<TypeParam>(string(TEST_DIR"/sort/"#file".test"), dir, resultIdx0, resultIdx1);  \
+#define SORT_INIT(desc, file, dir, resultIdx0, resultIdx1)                \
+    TYPED_TEST(SortByKey, desc) {                                         \
+        sortTest<TypeParam>(string(TEST_DIR "/sort/" #file ".test"), dir, \
+                            resultIdx0, resultIdx1);                      \
     }
 
-    SORT_INIT(Sort0True,      sort_by_key_tiny,  true,  0, 1);
-    SORT_INIT(Sort0False,     sort_by_key_tiny,  false, 2, 3);
-    SORT_INIT(Sort10x10True,  sort_by_key_2D,    true,  0, 1);
-    SORT_INIT(Sort10x10False, sort_by_key_2D,    false, 2, 3);
-    SORT_INIT(Sort1000True,   sort_by_key_1000,  true,  0, 1);
-    SORT_INIT(Sort1000False,  sort_by_key_1000,  false, 2, 3);
-    SORT_INIT(SortMedTrue,    sort_by_key_med,   true,  0, 1);
-    SORT_INIT(SortMedFalse,   sort_by_key_med,   false, 2, 3);
-    // Takes too much time in current implementation. Enable when everything is parallel
-    //SORT_INIT(SortLargeTrue,  sort_by_key_large, true,  0, 1);
-    //SORT_INIT(SortLargeFalse, sort_by_key_large, false, 2, 3);
-
-
+SORT_INIT(Sort0True, sort_by_key_tiny, true, 0, 1);
+SORT_INIT(Sort0False, sort_by_key_tiny, false, 2, 3);
+SORT_INIT(Sort10x10True, sort_by_key_2D, true, 0, 1);
+SORT_INIT(Sort10x10False, sort_by_key_2D, false, 2, 3);
+SORT_INIT(Sort1000True, sort_by_key_1000, true, 0, 1);
+SORT_INIT(SortMedTrue, sort_by_key_med, true, 0, 1);
+SORT_INIT(Sort1000False, sort_by_key_1000, false, 2, 3);
+SORT_INIT(SortMedFalse, sort_by_key_med, false, 2, 3);
 
+SORT_INIT(SortLargeTrue, sort_by_key_large, true, 0, 1);
+SORT_INIT(SortLargeFalse, sort_by_key_large, false, 2, 3);
 
 ////////////////////////////////////// CPP ///////////////////////////////
 //
-TEST(SortByKey, CPP)
-{
-    if (noDoubleTests<float>()) return;
-
-    const bool dir = true;
+TEST(SortByKey, CPPDim0) {
+    const bool dir            = true;
     const unsigned resultIdx0 = 0;
     const unsigned resultIdx1 = 1;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/sort/sort_by_key_tiny.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_by_key_tiny.test"),
+                                 numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::array keys(idims, &(in[0].front()));
-    af::array vals(idims, &(in[1].front()));
-    af::array out_keys, out_vals;
-    af::sort(out_keys, out_vals, keys, vals, 0, dir);
+    dim4 idims = numDims[0];
+    array keys(idims, &(in[0].front()));
+    array vals(idims, &(in[1].front()));
+    array out_keys, out_vals;
+    sort(out_keys, out_vals, keys, vals, 0, dir);
 
-    size_t nElems = tests[resultIdx0].size();
-    // Get result
-    float* keyData = new float[tests[resultIdx0].size()];
-    out_keys.host((void*)keyData);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, out_keys);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx1], idims, out_vals);
+}
 
-    // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], keyData[elIter]) << "at: " << elIter << std::endl;
-    }
+TEST(SortByKey, CPPDim1) {
+    const bool dir            = true;
+    const unsigned resultIdx0 = 0;
+    const unsigned resultIdx1 = 1;
 
-    float* valData = new float[tests[resultIdx1].size()];
-    out_vals.host((void*)valData);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(
+        string(TEST_DIR "/sort/sort_by_key_large.test"), numDims, in, tests);
 
-    // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx1][elIter], valData[elIter]) << "at: " << elIter << std::endl;
-    }
+    dim4 idims = numDims[0];
+    array keys(idims, &(in[0].front()));
+    array vals(idims, &(in[1].front()));
+
+    array keys_ = reorder(keys, 1, 0, 2, 3);
+    array vals_ = reorder(vals, 1, 0, 2, 3);
+
+    array out_keys, out_vals;
+    sort(out_keys, out_vals, keys_, vals_, 1, dir);
 
-    // Delete
-    delete[] keyData;
-    delete[] valData;
+    out_keys = reorder(out_keys, 1, 0, 2, 3);
+    out_vals = reorder(out_vals, 1, 0, 2, 3);
+
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, out_keys);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx1], idims, out_vals);
 }
 
+TEST(SortByKey, CPPDim2) {
+    const bool dir            = false;
+    const unsigned resultIdx0 = 2;
+    const unsigned resultIdx1 = 3;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(
+        string(TEST_DIR "/sort/sort_by_key_large.test"), numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array keys(idims, &(in[0].front()));
+    array vals(idims, &(in[1].front()));
+
+    array keys_ = reorder(keys, 1, 2, 0, 3);
+    array vals_ = reorder(vals, 1, 2, 0, 3);
+
+    array out_keys, out_vals;
+    sort(out_keys, out_vals, keys_, vals_, 2, dir);
+
+    out_keys = reorder(out_keys, 2, 0, 1, 3);
+    out_vals = reorder(out_vals, 2, 0, 1, 3);
+
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, out_keys);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx1], idims, out_vals);
+}
diff --git a/test/sort_index.cpp b/test/sort_index.cpp
index f4296266fd..5e1b88a97d 100644
--- a/test/sort_index.cpp
+++ b/test/sort_index.cpp
@@ -7,167 +7,198 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
 using std::cout;
 using std::endl;
-using af::cfloat;
-using af::cdouble;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Sort : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class SortIndex : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, uint, int, uchar> TestTypes;
+typedef ::testing::Types<float, double, uint, int, schar, uchar, short, ushort,
+                         intl, uintl>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Sort, TestTypes);
+TYPED_TEST_SUITE(SortIndex, TestTypes);
 
 template<typename T>
-void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0, const unsigned resultIdx1, bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void sortTest(string pTestFile, const bool dir, const unsigned resultIdx0,
+              const unsigned resultIdx1, bool isSubRef = false,
+              const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<float> > tests;
-    readTests<T, float, int>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<float>> tests;
+    readTests<T, float, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
+    af_array inArray   = 0;
     af_array tempArray = 0;
-    af_array sxArray = 0;
-    af_array ixArray = 0;
+    af_array sxArray   = 0;
+    af_array ixArray   = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_sort_index(&sxArray, &ixArray, inArray, 0, dir));
-
-    size_t nElems = tests[resultIdx0].size();
+    ASSERT_SUCCESS(af_sort_index(&sxArray, &ixArray, inArray, 0, dir));
 
-    // Get result
-    T* sxData = new T[tests[resultIdx0].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)sxData, sxArray));
+    vector<T> sxTest(tests[resultIdx0].size());
+    transform(tests[resultIdx0].begin(), tests[resultIdx0].end(),
+              sxTest.begin(), convert_to<T, float>);
 
-    // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter]) << "at: " << elIter << std::endl;
-    }
-
-    // Get result
-    unsigned* ixData = new unsigned[tests[resultIdx1].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)ixData, ixArray));
+    ASSERT_VEC_ARRAY_EQ(sxTest, idims, sxArray);
 
-#ifndef AF_OPENCL
-    // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx1][elIter], ixData[elIter]) << "at: " << elIter << std::endl;
-    }
+#ifdef AF_OPENCL
+    UNUSED(resultIdx1);
+#else
+    vector<unsigned> ixTest(tests[resultIdx1].begin(), tests[resultIdx1].end());
+    ASSERT_VEC_ARRAY_EQ(ixTest, idims, ixArray);
 #endif
 
-    // Delete
-    delete[] sxData;
-    delete[] ixData;
-
-    if(inArray   != 0) af_release_array(inArray);
-    if(sxArray   != 0) af_release_array(sxArray);
-    if(ixArray   != 0) af_release_array(ixArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (sxArray != 0) af_release_array(sxArray);
+    if (ixArray != 0) af_release_array(ixArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-#define SORT_INIT(desc, file, dir, resultIdx0, resultIdx1)                                       \
-    TYPED_TEST(Sort, desc)                                                                       \
-    {                                                                                            \
-        sortTest<TypeParam>(string(TEST_DIR"/sort/"#file".test"), dir, resultIdx0, resultIdx1);  \
+#define SORT_INIT(desc, file, dir, resultIdx0, resultIdx1)                \
+    TYPED_TEST(SortIndex, desc) {                                         \
+        sortTest<TypeParam>(string(TEST_DIR "/sort/" #file ".test"), dir, \
+                            resultIdx0, resultIdx1);                      \
     }
 
-    SORT_INIT(Sort0True,  sort, true, 0, 1);
-    SORT_INIT(Sort0False, sort,false, 2, 3);
+SORT_INIT(Sort0True, sort, true, 0, 1);
+SORT_INIT(Sort0False, sort, false, 2, 3);
 
-    SORT_INIT(Sort2d0False, basic_2d, true, 0, 1);
+SORT_INIT(Sort2d0False, basic_2d, true, 0, 1);
 
-    SORT_INIT(Sort10x10True,  sort_10x10, true,  0, 1);
-    SORT_INIT(Sort10x10False, sort_10x10, false, 2, 3);
-    SORT_INIT(Sort1000True,   sort_1000,  true,  0, 1);
-    SORT_INIT(Sort1000False,  sort_1000,  false, 2, 3);
-    SORT_INIT(SortMedTrue,    sort_med1,  true,  0, 1);
-    SORT_INIT(SortMedFalse,   sort_med1,  false, 2, 3);
-    // Takes too much time in current implementation. Enable when everything is parallel
-    //SORT_INIT(SortMed5True,   sort_med,   true,  0, 1);
-    //SORT_INIT(SortMed5False,  sort_med,   false, 2, 3);
-    //SORT_INIT(SortLargeTrue,  sort_large, true,  0, 1);
-    //SORT_INIT(SortLargeFalse, sort_large, false, 2, 3);
-;
+SORT_INIT(Sort10x10True, sort_10x10, true, 0, 1);
+SORT_INIT(Sort10x10False, sort_10x10, false, 2, 3);
+SORT_INIT(Sort1000True, sort_1000, true, 0, 1);
+SORT_INIT(SortMedTrue, sort_med1, true, 0, 1);
+SORT_INIT(Sort1000False, sort_1000, false, 2, 3);
+SORT_INIT(SortMedFalse, sort_med1, false, 2, 3);
 
+SORT_INIT(SortMed5True, sort_med, true, 0, 1);
+SORT_INIT(SortMed5False, sort_med, false, 2, 3);
+SORT_INIT(SortLargeTrue, sort_large, true, 0, 1);
+SORT_INIT(SortLargeFalse, sort_large, false, 2, 3);
 
 //////////////////////////////////// CPP /////////////////////////////////
 //
-TEST(SortIndex, CPP)
-{
-    if (noDoubleTests<float>()) return;
+TEST(SortIndex, CPPDim0) {
+    const bool dir            = true;
+    const unsigned resultIdx0 = 0;
+    const unsigned resultIdx1 = 1;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_10x10.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+    array outValues, outIndices;
+    sort(outValues, outIndices, input, 0, dir);
+
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, outValues);
+
+    vector<unsigned> ixTest(tests[resultIdx1].size());
+    transform(tests[resultIdx1].begin(), tests[resultIdx1].end(),
+              ixTest.begin(), convert_to<unsigned, float>);
 
-    const bool dir = true;
+    ASSERT_VEC_ARRAY_EQ(ixTest, idims, outIndices);
+}
+
+TEST(SortIndex, CPPDim1) {
+    const bool dir            = true;
     const unsigned resultIdx0 = 0;
     const unsigned resultIdx1 = 1;
 
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/sort/sort_10x10.test"),numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_10x10.test"),
+                                 numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
-    af::array outValues, outIndices;
-    af::sort(outValues, outIndices, input, 0, dir);
+    dim4 idims = numDims[0];
+    array input_(idims, &(in[0].front()));
+    array input = reorder(input_, 1, 0, 2, 3);
 
-    size_t nElems = tests[resultIdx0].size();
+    array outValues, outIndices;
+    sort(outValues, outIndices, input, 1, dir);
 
-    // Get result
-    float* sxData = new float[tests[resultIdx0].size()];
-    outValues.host((void*)sxData);
+    outValues  = reorder(outValues, 1, 0, 2, 3);
+    outIndices = reorder(outIndices, 1, 0, 2, 3);
 
-    // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx0][elIter], sxData[elIter]) << "at: " << elIter << std::endl;
-    }
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, outValues);
 
-    // Get result
-    unsigned* ixData = new unsigned[tests[resultIdx1].size()];
-    outIndices.host((void*)ixData);
+    vector<unsigned> ixTest(tests[resultIdx1].begin(), tests[resultIdx1].end());
+    ASSERT_VEC_ARRAY_EQ(ixTest, idims, outIndices);
+}
 
-    // Compare result
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx1][elIter], ixData[elIter]) << "at: " << elIter << std::endl;
-    }
+TEST(SortIndex, CPPDim2) {
+    const bool dir            = false;
+    const unsigned resultIdx0 = 2;
+    const unsigned resultIdx1 = 3;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/sort/sort_med.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input_(idims, &(in[0].front()));
+    array input = reorder(input_, 1, 2, 0, 3);
+
+    array outValues, outIndices;
+    sort(outValues, outIndices, input, 2, dir);
+
+    outValues  = reorder(outValues, 2, 0, 1, 3);
+    outIndices = reorder(outIndices, 2, 0, 1, 3);
+
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx0], idims, outValues);
 
-    // Delete
-    delete[] sxData;
-    delete[] ixData;
+    vector<unsigned> ixTest(tests[resultIdx1].begin(), tests[resultIdx1].end());
+    ASSERT_VEC_ARRAY_EQ(ixTest, idims, outIndices);
 }
diff --git a/test/sparse.cpp b/test/sparse.cpp
new file mode 100644
index 0000000000..f1e1b67d72
--- /dev/null
+++ b/test/sparse.cpp
@@ -0,0 +1,476 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <gtest/gtest.h>
+#include <sparse_common.hpp>
+#include <testHelpers.hpp>
+
+using af::allTrue;
+using af::array;
+using af::deviceMemInfo;
+using af::dim4;
+using af::dtype_traits;
+using af::identity;
+using af::randu;
+using af::span;
+using af::seq;
+
+#define SPARSE_TESTS(T, eps)                                                \
+    TEST(Sparse, T##Square) { sparseTester<T>(1000, 1000, 100, 5, eps); }   \
+    TEST(Sparse, T##RectMultiple) {                                         \
+        sparseTester<T>(2048, 1024, 512, 3, eps);                           \
+    }                                                                       \
+    TEST(Sparse, T##RectDense) { sparseTester<T>(500, 1000, 250, 1, eps); } \
+    TEST(Sparse, T##MatVec) { sparseTester<T>(625, 1331, 1, 2, eps); }      \
+    TEST(Sparse, Transpose_##T##MatVec) {                                   \
+        sparseTransposeTester<T>(625, 1331, 1, 2, eps);                     \
+    }                                                                       \
+    TEST(Sparse, Transpose_##T##Square) {                                   \
+        sparseTransposeTester<T>(1000, 1000, 100, 5, eps);                  \
+    }                                                                       \
+    TEST(Sparse, Transpose_##T##RectMultiple) {                             \
+        sparseTransposeTester<T>(2048, 1024, 512, 3, eps);                  \
+    }                                                                       \
+    TEST(Sparse, Transpose_##T##RectDense) {                                \
+        sparseTransposeTester<T>(453, 751, 397, 1, eps);                    \
+    }                                                                       \
+    TEST(Sparse, T##ConvertCSR) { convertCSR<T>(2345, 5678, 0.5); }
+
+SPARSE_TESTS(float, 1E-3)
+SPARSE_TESTS(double, 1E-5)
+SPARSE_TESTS(cfloat, 1E-3)
+SPARSE_TESTS(cdouble, 1E-5)
+
+#undef SPARSE_TESTS
+
+#define CREATE_TESTS(STYPE) \
+    TEST(Sparse, Create_##STYPE) { createFunction<STYPE>(); }
+
+CREATE_TESTS(AF_STORAGE_CSR)
+CREATE_TESTS(AF_STORAGE_COO)
+
+#undef CREATE_TESTS
+
+TEST(Sparse, Create_AF_STORAGE_CSC) {
+    array d = identity(3, 3);
+
+    af_array out = 0;
+    ASSERT_EQ(AF_ERR_ARG,
+              af_create_sparse_array_from_dense(&out, d.get(), AF_STORAGE_CSC));
+
+    if (out != 0) af_release_array(out);
+}
+
+#define CAST_TESTS_TYPES(Ti, To, SUFFIX, M, N, F) \
+    TEST(Sparse, Cast_##Ti##_##To##_##SUFFIX) {   \
+        sparseCastTester<Ti, To>(M, N, F);        \
+    }
+
+#define CAST_TESTS(Ti, To)                     \
+    CAST_TESTS_TYPES(Ti, To, 1, 1000, 1000, 5) \
+    CAST_TESTS_TYPES(Ti, To, 2, 512, 1024, 2)
+
+CAST_TESTS(float, float)
+CAST_TESTS(float, double)
+CAST_TESTS(float, cfloat)
+CAST_TESTS(float, cdouble)
+
+CAST_TESTS(double, float)
+CAST_TESTS(double, double)
+CAST_TESTS(double, cfloat)
+CAST_TESTS(double, cdouble)
+
+CAST_TESTS(cfloat, cfloat)
+CAST_TESTS(cfloat, cdouble)
+
+CAST_TESTS(cdouble, cfloat)
+CAST_TESTS(cdouble, cdouble)
+
+TEST(Sparse, ISSUE_1745) {
+    using af::where;
+
+    array A    = randu(4, 4);
+    A(1, span) = 0;
+    A(2, span) = 0;
+
+    array idx     = where(A);
+    array data    = A(idx);
+    array row_idx = (idx / A.dims()[0]).as(s64);
+    array col_idx = (idx % A.dims()[0]).as(s64);
+
+    af_array A_sparse;
+    ASSERT_EQ(AF_ERR_ARG, af_create_sparse_array(
+                              &A_sparse, A.dims(0), A.dims(1), data.get(),
+                              row_idx.get(), col_idx.get(), AF_STORAGE_CSR));
+}
+
+TEST(Sparse, offsets_work_csr_to_dense_ISSUE_1918) {
+    array reference(2,2);
+    reference(0, span) = 0;
+    reference(1, span) = 2;
+    float value[] = { 1, 1, 2, 2 };
+    int row_csr[] = { 0, 2, 2, 0, 0, 2 };
+    int col[] = { 0, 1, 0, 1 };
+    array values(4, 1, value, afHost);
+    array rows_csr(6, 1, row_csr, afHost);
+    array cols(4, 1, col, afHost);
+    array S_csr;
+  
+    S_csr = sparse(2, 2, values(seq(2, 3)), rows_csr(seq(3, 5)), cols(seq(2, 3)));
+    array output_csr = dense(S_csr);
+
+    EXPECT_ARRAYS_EQ(reference, output_csr);
+}
+
+TEST(Sparse, offsets_work_coo_to_dense_ISSUE_1918) {
+    array reference(2,2);
+    reference(0, span) = 0;
+    reference(1, span) = 2;
+    float value[] = { 1, 1, 2, 2 };
+    int row_coo[] = { 0, 0, 1, 1 };
+    int col[] = { 0, 1, 0, 1 };
+    array values(4, 1, value, afHost);
+    array rows_coo(4, 1, row_coo, afHost);
+    array cols(4, 1, col, afHost);
+    array S_coo;
+  
+    S_coo = sparse(2, 2, values(seq(2, 3)), rows_coo(seq(2, 3)), cols(seq(2, 3)), AF_STORAGE_COO);
+    array output_coo = dense(S_coo);
+
+    EXPECT_ARRAYS_EQ(reference, output_coo);
+}
+
+TEST(Sparse, ISSUE_2134_COO) {
+    int rows[]     = {0, 0, 0, 1, 1, 2, 2};
+    int cols[]     = {0, 1, 2, 0, 1, 0, 2};
+    float values[] = {3, 3, 4, 3, 10, 4, 3};
+    array row(7, rows);
+    array col(7, cols);
+    array value(7, values);
+    af_array A = 0;
+    EXPECT_EQ(AF_ERR_SIZE,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_CSR));
+    if (A != 0) af_release_array(A);
+    A = 0;
+    EXPECT_EQ(AF_ERR_SIZE,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_CSC));
+    if (A != 0) af_release_array(A);
+    A = 0;
+    EXPECT_EQ(AF_SUCCESS,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_COO));
+    if (A != 0) af_release_array(A);
+}
+
+TEST(Sparse, ISSUE_2134_CSR) {
+    int rows[]     = {0, 3, 5, 7};
+    int cols[]     = {0, 1, 2, 0, 1, 0, 2};
+    float values[] = {3, 3, 4, 3, 10, 4, 3};
+    array row(4, rows);
+    array col(7, cols);
+    array value(7, values);
+    af_array A = 0;
+    EXPECT_EQ(AF_SUCCESS,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_CSR));
+    if (A != 0) af_release_array(A);
+    A = 0;
+    EXPECT_EQ(AF_ERR_SIZE,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_CSC));
+    if (A != 0) af_release_array(A);
+    A = 0;
+    EXPECT_EQ(AF_ERR_SIZE,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_COO));
+    if (A != 0) af_release_array(A);
+}
+
+TEST(Sparse, ISSUE_2134_CSC) {
+    int rows[]     = {0, 0, 0, 1, 1, 2, 2};
+    int cols[]     = {0, 3, 5, 7};
+    float values[] = {3, 3, 4, 3, 10, 4, 3};
+    array row(7, rows);
+    array col(4, cols);
+    array value(7, values);
+    af_array A = 0;
+    EXPECT_EQ(AF_ERR_SIZE,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_CSR));
+    if (A != 0) af_release_array(A);
+    A = 0;
+    EXPECT_EQ(AF_SUCCESS,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_CSC));
+    if (A != 0) af_release_array(A);
+    A = 0;
+    EXPECT_EQ(AF_ERR_SIZE,
+              af_create_sparse_array(&A, 3, 3, value.get(), row.get(),
+                                     col.get(), AF_STORAGE_COO));
+    if (A != 0) af_release_array(A);
+}
+
+template<typename T>
+class Sparse : public ::testing::Test {};
+
+typedef ::testing::Types<float, cfloat, double, cdouble> SparseTypes;
+TYPED_TEST_SUITE(Sparse, SparseTypes);
+
+TYPED_TEST(Sparse, DeepCopy) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    cleanSlate();
+
+    array s;
+    {
+        // Create a sparse array from a dense array. Make sure that the dense
+        // arrays are removed
+        array dense = randu(10, 10);
+        array d     = makeSparse<TypeParam>(dense, 5);
+        s           = sparse(d);
+    }
+
+    // At this point only the sparse array will be allocated in memory.
+    // Determine how much memory is allocated by one sparse array
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+    size_t size_of_alloc      = lock_bytes;
+    size_t buffers_per_sparse = lock_buffers;
+
+    {
+        array s2 = s.copy();
+        s2.eval();
+
+        // Make sure that the deep copy allocated additional memory
+        deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+        EXPECT_NE(s.get(), s2.get()) << "The sparse arrays point to the same "
+                                        "af_array object.";
+        EXPECT_EQ(size_of_alloc * 2, lock_bytes)
+            << "The number of bytes allocated by the deep copy do "
+               "not match the original array";
+
+        EXPECT_EQ(buffers_per_sparse * 2, lock_buffers)
+            << "The number of buffers allocated by the deep "
+               "copy do not match the original array";
+        array d  = dense(s);
+        array d2 = dense(s2);
+        ASSERT_ARRAYS_EQ(d, d2);
+    }
+}
+
+TYPED_TEST(Sparse, Empty) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    af_array ret = 0;
+    dim_t rows = 0, cols = 0, nnz = 0;
+    EXPECT_EQ(AF_SUCCESS, af_create_sparse_array_from_ptr(
+                              &ret, rows, cols, nnz, NULL, NULL, NULL,
+                              (af_dtype)dtype_traits<TypeParam>::af_type,
+                              AF_STORAGE_CSR, afHost));
+    bool sparse = false;
+    EXPECT_EQ(AF_SUCCESS, af_is_sparse(&sparse, ret));
+    EXPECT_EQ(true, sparse);
+    EXPECT_EQ(AF_SUCCESS, af_release_array(ret));
+}
+
+TYPED_TEST(Sparse, EmptyDeepCopy) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    array a = sparse(0, 0, array(0, (af_dtype)dtype_traits<TypeParam>::af_type),
+                     array(1, s32), array(0, s32));
+    EXPECT_TRUE(a.issparse());
+    EXPECT_EQ(0, sparseGetNNZ(a));
+
+    array b = a.copy();
+    EXPECT_TRUE(b.issparse());
+    EXPECT_EQ(0, sparseGetNNZ(b));
+}
+
+TEST(Sparse, CPPSparseFromHostArrays) {
+    //! [ex_sparse_host_arrays]
+
+    float vals[]  = {5, 8, 3, 6};
+    int row_ptr[] = {0, 0, 2, 3, 4};
+    int col_idx[] = {0, 1, 2, 1};
+    const int M = 4, N = 4, nnz = 4;
+
+    // Create sparse array (CSR) from host pointers to values, row
+    // pointers, and column indices.
+    array sparse = af::sparse(M, N, nnz, vals, row_ptr, col_idx, f32,
+                              AF_STORAGE_CSR, afHost);
+
+    // sparse
+    //     values:  [ 5.0, 8.0, 3.0, 6.0 ]
+    //     row_ptr: [ 0, 0, 2, 3, 4 ]
+    //     col_idx: [ 0, 1, 2, 1 ]
+
+    //! [ex_sparse_host_arrays]
+
+    array sparse_vals, sparse_row_ptr, sparse_col_idx;
+    af::storage sparse_storage;
+    sparseGetInfo(sparse_vals, sparse_row_ptr, sparse_col_idx, sparse_storage,
+                  sparse);
+
+    ASSERT_ARRAYS_EQ(sparse_vals, array(dim4(nnz, 1), vals));
+    ASSERT_ARRAYS_EQ(sparse_row_ptr, array(dim4(M + 1, 1), row_ptr));
+    ASSERT_ARRAYS_EQ(sparse_col_idx, array(dim4(nnz, 1), col_idx));
+    ASSERT_EQ(sparse_storage, AF_STORAGE_CSR);
+    ASSERT_EQ(sparseGetNNZ(sparse), nnz);
+}
+
+TEST(Sparse, CPPSparseFromAFArrays) {
+    //! [ex_sparse_af_arrays]
+
+    float v[]   = {5, 8, 3, 6};
+    int r[]     = {0, 0, 2, 3, 4};
+    int c[]     = {0, 1, 2, 1};
+    const int M = 4, N = 4, nnz = 4;
+    array vals    = array(dim4(nnz), v);
+    array row_ptr = array(dim4(M + 1), r);
+    array col_idx = array(dim4(nnz), c);
+
+    // Create sparse array (CSR) from af::arrays containing values,
+    // row pointers, and column indices.
+    array sparse = af::sparse(M, N, vals, row_ptr, col_idx, AF_STORAGE_CSR);
+
+    // sparse
+    //     values:  [ 5.0, 8.0, 3.0, 6.0 ]
+    //     row_ptr: [ 0, 0, 2, 3, 4 ]
+    //     col_idx: [ 0, 1, 2, 1 ]
+
+    //! [ex_sparse_af_arrays]
+
+    array sparse_vals, sparse_row_ptr, sparse_col_idx;
+    af::storage sparse_storage;
+    sparseGetInfo(sparse_vals, sparse_row_ptr, sparse_col_idx, sparse_storage,
+                  sparse);
+
+    ASSERT_ARRAYS_EQ(sparse_vals, vals);
+    ASSERT_ARRAYS_EQ(sparse_row_ptr, row_ptr);
+    ASSERT_ARRAYS_EQ(sparse_col_idx, col_idx);
+    ASSERT_EQ(sparse_storage, AF_STORAGE_CSR);
+    ASSERT_EQ(sparseGetNNZ(sparse), nnz);
+}
+
+TEST(Sparse, CPPSparseFromDenseUsage) {
+    float dns[] = {0, 5, 0, 0, 0, 8, 0, 6, 0, 0, 3, 0, 0, 0, 0, 0};
+    const int M = 4, N = 4, nnz = 4;
+    array dense(dim4(M, N), dns);
+
+    //! [ex_sparse_from_dense]
+
+    // dense
+    //     0     0     0     0
+    //     5     8     0     0
+    //     0     0     3     0
+    //     0     6     0     0
+
+    // Convert dense af::array to its sparse (CSR) representation.
+    array sparse = af::sparse(dense, AF_STORAGE_CSR);
+
+    // sparse
+    //     values:  [ 5.0, 8.0, 3.0, 6.0 ]
+    //     row_ptr: [ 0, 0, 2, 3, 4 ]
+    //     col_idx: [ 0, 1, 2, 1 ]
+
+    //! [ex_sparse_from_dense]
+
+    float v[] = {5, 8, 3, 6};
+    int r[]   = {0, 0, 2, 3, 4};
+    int c[]   = {0, 1, 2, 1};
+    array gold_vals(dim4(nnz), v);
+    array gold_row_ptr(dim4(M + 1), r);
+    array gold_col_idx(dim4(nnz), c);
+
+    array sparse_vals, sparse_row_ptr, sparse_col_idx;
+    af::storage sparse_storage;
+    sparseGetInfo(sparse_vals, sparse_row_ptr, sparse_col_idx, sparse_storage,
+                  sparse);
+
+    ASSERT_ARRAYS_EQ(sparse_vals, gold_vals);
+    ASSERT_ARRAYS_EQ(sparse_row_ptr, gold_row_ptr);
+    ASSERT_ARRAYS_EQ(sparse_col_idx, gold_col_idx);
+    ASSERT_EQ(sparse_storage, AF_STORAGE_CSR);
+    ASSERT_EQ(sparseGetNNZ(sparse), nnz);
+}
+
+TEST(Sparse, CPPDenseToSparseToDenseUsage) {
+    float g[]   = {0, 5, 0, 0, 0, 8, 0, 6, 0, 0, 3, 0, 0, 0, 0, 0};
+    const int M = 4, N = 4;
+    array in(dim4(M, N), g);
+    array sparse = af::sparse(in, AF_STORAGE_CSR);
+
+    //! [ex_dense_from_sparse]
+
+    // sparse
+    //     values:  [ 5.0, 8.0, 3.0, 6.0 ]
+    //     row_ptr: [ 0, 0, 2, 3, 4 ]
+    //     col_idx: [ 0, 1, 2, 1 ]
+
+    // Get dense representation of given sparse af::array.
+    array dense = af::dense(sparse);
+
+    // dense
+    //     0     0     0     0
+    //     5     8     0     0
+    //     0     0     3     0
+    //     0     6     0     0
+
+    //! [ex_dense_from_sparse]
+
+    float v[]     = {5, 8, 3, 6};
+    int r[]       = {0, 0, 2, 3, 4};
+    int c[]       = {0, 1, 2, 1};
+    const int nnz = 4;
+    array gold_vals(dim4(nnz), v);
+    array gold_row_ptr(dim4(M + 1), r);
+    array gold_col_idx(dim4(nnz), c);
+
+    array sparse_vals, sparse_row_ptr, sparse_col_idx;
+    af::storage sparse_storage;
+    sparseGetInfo(sparse_vals, sparse_row_ptr, sparse_col_idx, sparse_storage,
+                  sparse);
+
+    ASSERT_ARRAYS_EQ(sparse_vals, gold_vals);
+    ASSERT_ARRAYS_EQ(sparse_row_ptr, gold_row_ptr);
+    ASSERT_ARRAYS_EQ(sparse_col_idx, gold_col_idx);
+    ASSERT_EQ(sparse_storage, AF_STORAGE_CSR);
+    ASSERT_EQ(sparseGetNNZ(sparse), nnz);
+
+    // Check dense array
+    array gold(dim4(M, N), g);
+    ASSERT_ARRAYS_EQ(in, gold);
+    ASSERT_ARRAYS_EQ(dense, gold);
+}
+
+TEST(Sparse, CPPDenseToSparseConversions) {
+    array in      = af::randu(200, 200);
+    in(in < 0.75) = 0;
+
+    array coo_sparse_arr = af::sparse(in, AF_STORAGE_COO);
+    array csr_sparse_arr = af::sparse(in, AF_STORAGE_CSR);
+
+    array coo_dense_arr = af::dense(coo_sparse_arr);
+    array csr_dense_arr = af::dense(csr_sparse_arr);
+
+    ASSERT_ARRAYS_EQ(in, coo_dense_arr);
+    ASSERT_ARRAYS_EQ(in, csr_dense_arr);
+
+    array non_zero   = af::flat(in)(af::where(in));
+    array non_zero_T = af::flat(in.T())(af::where(in.T()));
+    ASSERT_ARRAYS_EQ(non_zero, af::sparseGetValues(coo_sparse_arr));
+    ASSERT_ARRAYS_EQ(
+        non_zero_T,
+        af::sparseGetValues(csr_sparse_arr));  // csr values are transposed
+}
diff --git a/test/sparse_arith.cpp b/test/sparse_arith.cpp
new file mode 100644
index 0000000000..8415effed5
--- /dev/null
+++ b/test/sparse_arith.cpp
@@ -0,0 +1,362 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::deviceGC;
+using af::dim4;
+using af::freeHost;
+using af::max;
+using af::sum;
+using std::abs;
+using std::string;
+using std::vector;
+
+template<typename T>
+array makeSparse(array A, int factor) {
+    A = floor(A * 1000);
+    A = A * ((A % factor) == 0) / 1000;
+    return A;
+}
+
+template<>
+array makeSparse<cfloat>(array A, int factor) {
+    array r = real(A);
+    r       = floor(r * 1000);
+    r       = r * ((r % factor) == 0) / 1000;
+
+    array i = r / 2;
+
+    A = complex(r, i);
+    return A;
+}
+
+template<>
+array makeSparse<cdouble>(array A, int factor) {
+    array r = real(A);
+    r       = floor(r * 1000);
+    r       = r * ((r % factor) == 0) / 1000;
+
+    array i = r / 2;
+
+    A = complex(r, i);
+    return A;
+}
+
+typedef enum {
+    af_add_t,
+    af_sub_t,
+    af_mul_t,
+    af_div_t,
+} af_op_t;
+
+template<af_op_t op>
+struct arith_op;
+
+template<>
+struct arith_op<af_add_t> {
+    array operator()(array v1, array v2) { return v1 + v2; }
+};
+
+template<>
+struct arith_op<af_sub_t> {
+    array operator()(array v1, array v2) { return v1 - v2; }
+};
+
+template<>
+struct arith_op<af_mul_t> {
+    array operator()(array v1, array v2) { return v1 * v2; }
+};
+
+template<>
+struct arith_op<af_div_t> {
+    array operator()(array v1, array v2) { return v1 / v2; }
+};
+
+template<typename T, af_op_t op>
+void sparseArithTester(const int m, const int n, int factor, const double eps) {
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+
+#if 1
+    array A = cpu_randu<T>(dim4(m, n));
+    array B = cpu_randu<T>(dim4(m, n));
+#else
+    array A = randu(m, n, (dtype)dtype_traits<T>::af_type);
+    array B = randu(m, n, (dtype)dtype_traits<T>::af_type);
+#endif
+
+    A = makeSparse<T>(A, factor);
+
+    array RA = sparse(A, AF_STORAGE_CSR);
+    array OA = sparse(A, AF_STORAGE_COO);
+
+    // Arith Op
+    array resR = arith_op<op>()(RA, B);
+    array resO = arith_op<op>()(OA, B);
+    array resD = arith_op<op>()(A, B);
+
+    array revR = arith_op<op>()(B, RA);
+    array revO = arith_op<op>()(B, OA);
+    array revD = arith_op<op>()(B, A);
+
+    ASSERT_ARRAYS_NEAR(resD, resR, eps);
+    ASSERT_ARRAYS_NEAR(resD, resO, eps);
+    ASSERT_ARRAYS_NEAR(revD, revR, eps);
+    ASSERT_ARRAYS_NEAR(revD, revO, eps);
+}
+
+// Mul
+template<typename T>
+void sparseArithTesterMul(const int m, const int n, int factor,
+                          const double eps) {
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+
+#if 1
+    array A = cpu_randu<T>(dim4(m, n));
+    array B = cpu_randu<T>(dim4(m, n));
+#else
+    array A = randu(m, n, (dtype)dtype_traits<T>::af_type);
+    array B = randu(m, n, (dtype)dtype_traits<T>::af_type);
+#endif
+
+    A = makeSparse<T>(A, factor);
+
+    array RA = sparse(A, AF_STORAGE_CSR);
+    array OA = sparse(A, AF_STORAGE_COO);
+
+    // Forward
+    {
+        // Arith Op
+        array resR = arith_op<af_mul_t>()(RA, B);
+        array resO = arith_op<af_mul_t>()(OA, B);
+
+        // We will test this by converting the COO to CSR and CSR to COO and
+        // comparing them. In essense, we are comparing the resR and resO
+        // TODO: Make a better comparison using dense
+
+        // Check resR against conR
+        array conR = sparseConvertTo(resR, AF_STORAGE_CSR);
+        ASSERT_ARRAYS_NEAR(resR, conR, eps);
+
+        // Check resO against conO
+        array conO = sparseConvertTo(resR, AF_STORAGE_COO);
+        ASSERT_ARRAYS_NEAR(resO, conO, eps);
+    }
+
+    // Reverse
+    {
+        // Arith Op
+        array resR = arith_op<af_mul_t>()(B, RA);
+        array resO = arith_op<af_mul_t>()(B, OA);
+
+        // We will test this by converting the COO to CSR and CSR to COO and
+        // comparing them. In essense, we are comparing the resR and resO
+        // TODO: Make a better comparison using dense
+
+        // Check resR against conR
+        array conR = sparseConvertTo(resR, AF_STORAGE_CSR);
+        ASSERT_ARRAYS_NEAR(resR, conR, eps);
+
+        // Check resO against conO
+        array conO = sparseConvertTo(resR, AF_STORAGE_COO);
+        ASSERT_ARRAYS_NEAR(resO, conO, eps);
+    }
+}
+
+// Div
+template<typename T>
+void sparseArithTesterDiv(const int m, const int n, int factor,
+                          const double eps) {
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+
+#if 1
+    array A = cpu_randu<T>(dim4(m, n));
+    array B = cpu_randu<T>(dim4(m, n));
+#else
+    array A = randu(m, n, (dtype)dtype_traits<T>::af_type);
+    array B = randu(m, n, (dtype)dtype_traits<T>::af_type);
+#endif
+
+    A = makeSparse<T>(A, factor);
+
+    array RA = sparse(A, AF_STORAGE_CSR);
+    array OA = sparse(A, AF_STORAGE_COO);
+
+    // Arith Op
+    array resR = arith_op<af_div_t>()(RA, B);
+    array resO = arith_op<af_div_t>()(OA, B);
+
+    // Assert division by sparse is not allowed
+    af_array out_temp = 0;
+    ASSERT_EQ(AF_ERR_NOT_SUPPORTED,
+              af_div(&out_temp, B.get(), RA.get(), false));
+    ASSERT_EQ(AF_ERR_NOT_SUPPORTED,
+              af_div(&out_temp, B.get(), OA.get(), false));
+    if (out_temp != 0) af_release_array(out_temp);
+
+    // We will test this by converting the COO to CSR and CSR to COO and
+    // comparing them. In essense, we are comparing the resR and resO
+    // TODO: Make a better comparison using dense
+
+    // Check resR against conR
+    array conR = sparseConvertTo(resR, AF_STORAGE_CSR);
+    ASSERT_ARRAYS_EQ(resR, conR);
+
+    // Check resO against conO
+    array conO = sparseConvertTo(resR, AF_STORAGE_COO);
+    ASSERT_ARRAYS_EQ(resO, conO);
+}
+
+#define ARITH_TESTS_OPS(T, M, N, F, EPS)              \
+    TEST(SPARSE_ARITH, T##_ADD_##M##_##N) {           \
+        sparseArithTester<T, af_add_t>(M, N, F, EPS); \
+    }                                                 \
+    TEST(SPARSE_ARITH, T##_SUB_##M##_##N) {           \
+        sparseArithTester<T, af_sub_t>(M, N, F, EPS); \
+    }                                                 \
+    TEST(SPARSE_ARITH, T##_MUL_##M##_##N) {           \
+        sparseArithTesterMul<T>(M, N, F, EPS);        \
+    }                                                 \
+    TEST(SPARSE_ARITH, T##_DIV_##M##_##N) {           \
+        sparseArithTesterDiv<T>(M, N, F, EPS);        \
+    }
+
+#define ARITH_TESTS(T, eps)                \
+    ARITH_TESTS_OPS(T, 10, 10, 5, eps)     \
+    ARITH_TESTS_OPS(T, 1024, 1024, 5, eps) \
+    ARITH_TESTS_OPS(T, 100, 100, 1, eps)   \
+    ARITH_TESTS_OPS(T, 2048, 1000, 6, eps) \
+    ARITH_TESTS_OPS(T, 123, 278, 5, eps)
+
+ARITH_TESTS(float, 1e-6)
+ARITH_TESTS(double, 1e-6)
+ARITH_TESTS(cfloat, 1e-4)  // This is mostly for complex division in OpenCL
+ARITH_TESTS(cdouble, 1e-6)
+
+// Sparse-Sparse Arithmetic testing function
+template<typename T, af_op_t op>
+void ssArithmetic(const int m, const int n, int factor, const double eps) {
+    deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+
+#if 1
+    array A = cpu_randu<T>(dim4(m, n));
+    array B = cpu_randu<T>(dim4(m, n));
+#else
+    array A = randu(m, n, (dtype)dtype_traits<T>::af_type);
+    array B = randu(m, n, (dtype)dtype_traits<T>::af_type);
+#endif
+
+    A = makeSparse<T>(A, factor);
+    B = makeSparse<T>(B, factor);
+
+    array spA = sparse(A, AF_STORAGE_CSR);
+    array spB = sparse(B, AF_STORAGE_CSR);
+
+    arith_op<op> binOp;
+
+    // Arith Op
+    array resS = binOp(spA, spB);
+    array resD = binOp(A, B);
+    ASSERT_ARRAYS_NEAR(resD, resS, eps);
+
+    array revS = binOp(spB, spA);
+    array revD = binOp(B, A);
+    ASSERT_ARRAYS_NEAR(revD, revS, eps);
+}
+
+#define SP_SP_ARITH_TEST(type, m, n, factor, eps)           \
+    TEST(SparseSparseArith, type##_Addition_##m##_##n) {    \
+        ssArithmetic<type, af_add_t>(m, n, factor, eps);    \
+    }                                                       \
+    TEST(SparseSparseArith, type##_Subtraction_##m##_##n) { \
+        ssArithmetic<type, af_sub_t>(m, n, factor, eps);    \
+    }
+
+#define SP_SP_ARITH_TESTS(T, eps)           \
+    SP_SP_ARITH_TEST(T, 10, 10, 5, eps)     \
+    SP_SP_ARITH_TEST(T, 1024, 1024, 5, eps) \
+    SP_SP_ARITH_TEST(T, 100, 100, 1, eps)   \
+    SP_SP_ARITH_TEST(T, 2048, 1000, 6, eps) \
+    SP_SP_ARITH_TEST(T, 123, 278, 5, eps)
+
+SP_SP_ARITH_TESTS(float, 1e-6)
+SP_SP_ARITH_TESTS(double, 1e-6)
+SP_SP_ARITH_TESTS(cfloat,
+                  1e-4)  // This is mostly for complex division in OpenCL
+SP_SP_ARITH_TESTS(cdouble, 1e-6)
+
+#if defined(USE_MTX) && defined(MTX_TEST_DIR)
+
+// Sparse-Sparse Arithmetic testing function using mtx files
+template<af_op_t op>
+void ssArithmeticMTX(const char* op1, const char* op2) {
+    deviceGC();
+
+    // Re-enable when double is enabled SUPPORTED_TYPE_CHECK(T);
+
+    array cooA, cooB;
+    ASSERT_TRUE(mtxReadSparseMatrix(cooA, op1));
+    ASSERT_TRUE(mtxReadSparseMatrix(cooB, op2));
+
+    array spA = sparseConvertTo(cooA, AF_STORAGE_CSR);
+    array spB = sparseConvertTo(cooB, AF_STORAGE_CSR);
+
+    array A = dense(spA);
+    array B = dense(spB);
+
+    arith_op<op> binOp;
+
+    // Arith Op
+    array resS = binOp(spA, spB);
+    array resD = binOp(A, B);
+    array revS = binOp(spB, spA);
+    array revD = binOp(B, A);
+
+    ASSERT_ARRAYS_NEAR(resD, dense(resS), 1e-4);
+    ASSERT_ARRAYS_NEAR(revD, dense(revS), 1e-4);
+}
+
+TEST(SparseSparseArith, LinearProgrammingData) {
+    std::string file1(MTX_TEST_DIR "LPnetlib/lpi_vol1/lpi_vol1.mtx");
+    std::string file2(MTX_TEST_DIR "LPnetlib/lpi_qual/lpi_qual.mtx");
+    ssArithmeticMTX<af_add_t>(file1.c_str(), file2.c_str());
+}
+
+TEST(SparseSparseArith, SubsequentCircuitSimData) {
+    std::string file1(MTX_TEST_DIR "Sandia/oscil_dcop_12/oscil_dcop_12.mtx");
+    std::string file2(MTX_TEST_DIR "Sandia/oscil_dcop_42/oscil_dcop_42.mtx");
+    ssArithmeticMTX<af_sub_t>(file1.c_str(), file2.c_str());
+}
+
+TEST(SparseSparseArith, QuantumChemistryData) {
+    std::string file1(MTX_TEST_DIR "QCD/conf6_0-4x4-20/conf6_0-4x4-20.mtx");
+    std::string file2(MTX_TEST_DIR "QCD/conf6_0-4x4-30/conf6_0-4x4-30.mtx");
+    ssArithmeticMTX<af_add_t>(file1.c_str(), file2.c_str());
+}
+#endif
diff --git a/test/sparse_common.hpp b/test/sparse_common.hpp
new file mode 100644
index 0000000000..5884871388
--- /dev/null
+++ b/test/sparse_common.hpp
@@ -0,0 +1,255 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#pragma once
+#include <arrayfire.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::cdouble;
+using af::cfloat;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+///////////////////////////////// CPP ////////////////////////////////////
+//
+
+template<typename T>
+static af::array makeSparse(af::array A, int factor) {
+    A = floor(A * 1000);
+    A = A * ((A % factor) == 0) / 1000;
+    return A;
+}
+
+template<>
+af::array makeSparse<cfloat>(af::array A, int factor) {
+    af::array r = real(A);
+    r           = floor(r * 1000);
+    r           = r * ((r % factor) == 0) / 1000;
+
+    af::array i = r / 2;
+
+    A = af::complex(r, i);
+    return A;
+}
+
+template<>
+af::array makeSparse<cdouble>(af::array A, int factor) {
+    af::array r = real(A);
+    r           = floor(r * 1000);
+    r           = r * ((r % factor) == 0) / 1000;
+
+    af::array i = r / 2;
+
+    A = af::complex(r, i);
+    return A;
+}
+
+static double calc_norm(af::array lhs, af::array rhs) {
+    return af::max<double>(af::abs(lhs - rhs) /
+                           (af::abs(lhs) + af::abs(rhs) + 1E-5));
+}
+
+template<typename T>
+static void sparseTester(const int m, const int n, const int k, int factor,
+                         double eps, int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    af::deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+
+#if 1
+    af::array A = cpu_randu<T>(af::dim4(m, n));
+    af::array B = cpu_randu<T>(af::dim4(n, k));
+#else
+    af::array A = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+    af::array B = af::randu(n, k, (af::dtype)af::dtype_traits<T>::af_type);
+#endif
+
+    A = makeSparse<T>(A, factor);
+
+    // Result of GEMM
+    af::array dRes1 = matmul(A, B);
+
+    // Create Sparse Array From Dense
+    af::array sA = af::sparse(A, AF_STORAGE_CSR);
+
+    // Sparse Matmul
+    af::array sRes1 = matmul(sA, B);
+
+    // Verify Results
+    ASSERT_NEAR(0, calc_norm(real(dRes1), real(sRes1)), eps);
+    ASSERT_NEAR(0, calc_norm(imag(dRes1), imag(sRes1)), eps);
+}
+
+template<typename T>
+static void sparseTransposeTester(const int m, const int n, const int k,
+                                  int factor, double eps,
+                                  int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    af::deviceGC();
+
+    SUPPORTED_TYPE_CHECK(T);
+
+#if 1
+    af::array A = cpu_randu<T>(af::dim4(m, n));
+    af::array B = cpu_randu<T>(af::dim4(m, k));
+#else
+    af::array A = af::randu(m, n, (af::dtype)af::dtype_traits<T>::af_type);
+    af::array B = af::randu(m, k, (af::dtype)af::dtype_traits<T>::af_type);
+#endif
+
+    A = makeSparse<T>(A, factor);
+
+    // Result of GEMM
+    af::array dRes2 = matmul(A, B, AF_MAT_TRANS, AF_MAT_NONE);
+    af::array dRes3;
+    if (IsComplex<T>::value) {
+        dRes3 = matmul(A, B, AF_MAT_CTRANS, AF_MAT_NONE);
+    }
+
+    // Create Sparse Array From Dense
+    af::array sA = af::sparse(A, AF_STORAGE_CSR);
+
+    // Sparse Matmul
+    af::array sRes2 = matmul(sA, B, AF_MAT_TRANS, AF_MAT_NONE);
+    af::array sRes3;
+    if (IsComplex<T>::value) {
+        sRes3 = matmul(sA, B, AF_MAT_CTRANS, AF_MAT_NONE);
+    }
+
+    // Verify Results
+    ASSERT_NEAR(0, calc_norm(real(dRes2), real(sRes2)), eps);
+    ASSERT_NEAR(0, calc_norm(imag(dRes2), imag(sRes2)), eps);
+
+    if (IsComplex<T>::value) {
+        ASSERT_NEAR(0, calc_norm(real(dRes3), real(sRes3)), eps);
+        ASSERT_NEAR(0, calc_norm(imag(dRes3), imag(sRes3)), eps);
+    }
+}
+
+template<typename T>
+static void convertCSR(const int M, const int N, const double ratio,
+                       int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    SUPPORTED_TYPE_CHECK(T);
+#if 1
+    af::array a = cpu_randu<T>(af::dim4(M, N));
+#else
+    af::array a = af::randu(M, N);
+#endif
+    a = a * (a > ratio);
+
+    af::array s  = af::sparse(a, AF_STORAGE_CSR);
+    af::array aa = af::dense(s);
+
+    ASSERT_ARRAYS_EQ(a, aa);
+}
+
+template<typename T>
+static void convertCSC(const int M, const int N, const double ratio,
+                       int targetDevice = -1) {
+    if (targetDevice >= 0) af::setDevice(targetDevice);
+
+    SUPPORTED_TYPE_CHECK(T);
+#if 1
+    af::array a = cpu_randu<T>(af::dim4(M, N));
+#else
+    af::array a = af::randu(M, N);
+#endif
+    a = a * (a > ratio);
+
+    af::array s  = af::sparse(a, AF_STORAGE_CSC);
+    af::array aa = af::dense(s);
+
+    ASSERT_ARRAYS_EQ(a, aa);
+}
+
+// This test essentially verifies that the sparse structures have the correct
+// dimensions and indices using a very basic test
+template<af_storage stype>
+static void createFunction() {
+    af::array in = af::sparse(af::identity(3, 3), stype);
+
+    af::array values = sparseGetValues(in);
+    af::array rowIdx = sparseGetRowIdx(in);
+    af::array colIdx = sparseGetColIdx(in);
+    dim_t nNZ        = sparseGetNNZ(in);
+
+    ASSERT_EQ(nNZ, values.elements());
+
+    ASSERT_EQ(0, af::max<double>(values - af::constant(1, nNZ)));
+    ASSERT_EQ(0, af::max<int>(rowIdx -
+                              af::range(af::dim4(rowIdx.elements()), 0, s32)));
+    ASSERT_EQ(0, af::max<int>(colIdx -
+                              af::range(af::dim4(colIdx.elements()), 0, s32)));
+}
+
+template<typename Ti, typename To>
+static void sparseCastTester(const int m, const int n, int factor) {
+    SUPPORTED_TYPE_CHECK(Ti);
+    SUPPORTED_TYPE_CHECK(To);
+
+    af::array A = cpu_randu<Ti>(af::dim4(m, n));
+
+    A = makeSparse<Ti>(A, factor);
+
+    af::array sTi = af::sparse(A, AF_STORAGE_CSR);
+
+    // Cast
+    af::array sTo = sTi.as((af::dtype)af::dtype_traits<To>::af_type);
+
+    // Verify nnZ
+    dim_t iNNZ = sparseGetNNZ(sTi);
+    dim_t oNNZ = sparseGetNNZ(sTo);
+
+    ASSERT_EQ(iNNZ, oNNZ);
+
+    // Verify Types
+    dim_t iSType = sparseGetStorage(sTi);
+    dim_t oSType = sparseGetStorage(sTo);
+
+    ASSERT_EQ(iSType, oSType);
+
+    // Get the individual arrays and verify equality
+    af::array iValues = sparseGetValues(sTi);
+    af::array iRowIdx = sparseGetRowIdx(sTi);
+    af::array iColIdx = sparseGetColIdx(sTi);
+
+    af::array oValues = sparseGetValues(sTo);
+    af::array oRowIdx = sparseGetRowIdx(sTo);
+    af::array oColIdx = sparseGetColIdx(sTo);
+
+    // Verify values
+    ASSERT_EQ(0, af::max<int>(af::abs(iRowIdx - oRowIdx)));
+    ASSERT_EQ(0, af::max<int>(af::abs(iColIdx - oColIdx)));
+
+    static const double eps = 1e-6;
+    if (iValues.iscomplex() && !oValues.iscomplex()) {
+        ASSERT_NEAR(0, af::max<double>(af::abs(af::abs(iValues) - oValues)),
+                    eps);
+    } else if (!iValues.iscomplex() && oValues.iscomplex()) {
+        ASSERT_NEAR(0, af::max<double>(af::abs(iValues - af::abs(oValues))),
+                    eps);
+    } else {
+        ASSERT_NEAR(0, af::max<double>(af::abs(iValues - oValues)), eps);
+    }
+}
diff --git a/test/sparse_convert.cpp b/test/sparse_convert.cpp
new file mode 100644
index 0000000000..7e8b927542
--- /dev/null
+++ b/test/sparse_convert.cpp
@@ -0,0 +1,125 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::max;
+using std::abs;
+using std::string;
+using std::vector;
+
+///////////////////////////////// CPP ////////////////////////////////////
+//
+
+template<typename T>
+array makeSparse(array A, int factor) {
+    A = floor(A * 1000);
+    A = A * ((A % factor) == 0) / 1000;
+    return A;
+}
+
+template<>
+array makeSparse<cfloat>(array A, int factor) {
+    array r = real(A);
+    r       = floor(r * 1000);
+    r       = r * ((r % factor) == 0) / 1000;
+
+    array i = r / 2;
+
+    A = complex(r, i);
+    return A;
+}
+
+template<>
+array makeSparse<cdouble>(array A, int factor) {
+    array r = real(A);
+    r       = floor(r * 1000);
+    r       = r * ((r % factor) == 0) / 1000;
+
+    array i = r / 2;
+
+    A = complex(r, i);
+    return A;
+}
+
+template<typename T, af_storage src, af_storage dest>
+void sparseConvertTester(const int m, const int n, int factor) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    array A = cpu_randu<T>(dim4(m, n));
+
+    A = makeSparse<T>(A, factor);
+
+    // Create Sparse Array of type src and dest From Dense
+    array sA = sparse(A, src);
+
+    // Convert src to dest format and dest to src
+    array s2d = sparseConvertTo(sA, dest);
+
+    // Create the dest type from dense - gold
+    array dA = sparse(A, dest);
+
+    ASSERT_ARRAYS_EQ(dA, s2d);
+    ASSERT_ARRAYS_EQ(A, s2d);
+}
+
+#define CONVERT_TESTS_TYPES(T, STYPE, DTYPE, SUFFIX, M, N, F) \
+    TEST(SPARSE_CONVERT, T##_##STYPE##_##DTYPE##_##SUFFIX) {  \
+        sparseConvertTester<T, STYPE, DTYPE>(M, N, F);        \
+    }                                                         \
+    TEST(SPARSE_CONVERT, T##_##DTYPE##_##STYPE##_##SUFFIX) {  \
+        sparseConvertTester<T, DTYPE, STYPE>(M, N, F);        \
+    }
+
+#define CONVERT_TESTS(T, STYPE, DTYPE)                      \
+    CONVERT_TESTS_TYPES(T, STYPE, DTYPE, 1, 1000, 1000, 5)  \
+    CONVERT_TESTS_TYPES(T, STYPE, DTYPE, 2, 512, 512, 1)    \
+    CONVERT_TESTS_TYPES(T, STYPE, DTYPE, 3, 512, 1024, 2)   \
+    CONVERT_TESTS_TYPES(T, STYPE, DTYPE, 4, 2048, 1024, 10) \
+    CONVERT_TESTS_TYPES(T, STYPE, DTYPE, 5, 237, 411, 5)
+
+CONVERT_TESTS(float, AF_STORAGE_CSR, AF_STORAGE_COO)
+CONVERT_TESTS(double, AF_STORAGE_CSR, AF_STORAGE_COO)
+CONVERT_TESTS(cfloat, AF_STORAGE_CSR, AF_STORAGE_COO)
+CONVERT_TESTS(cdouble, AF_STORAGE_CSR, AF_STORAGE_COO)
+
+#undef CONVERT_TESTS
+#undef CONVERT_TESTS_TYPES
+
+// Test to check failure with CSC
+TEST(SPARSE_CONVERT, CSC_ARG_ERROR) {
+    const int m = 100, n = 28, factor = 5;
+
+    array A = cpu_randu<float>(dim4(m, n));
+
+    A = makeSparse<float>(A, factor);
+
+    // Create Sparse Array of type src and dest From Dense
+    array sA = sparse(A, AF_STORAGE_CSR);
+
+    // Convert src to dest format and dest to src
+    // Use C-API to catch error
+    af_array out = 0;
+    ASSERT_EQ(AF_ERR_ARG, af_sparse_convert_to(&out, sA.get(), AF_STORAGE_CSC));
+
+    if (out != 0) af_release_array(out);
+}
diff --git a/test/stdev.cpp b/test/stdev.cpp
new file mode 100644
index 0000000000..bf95801fed
--- /dev/null
+++ b/test/stdev.cpp
@@ -0,0 +1,244 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <algorithm>
+#include <ctime>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::exception;
+using af::seq;
+using af::stdev;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class StandardDev : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, int, uint, intl, uintl, char, schar,
+                         uchar>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(StandardDev, TestTypes);
+
+template<typename T>
+struct f32HelperType {
+    typedef
+        typename cond_type<is_same_type<T, double>::value, double, float>::type
+            type;
+};
+
+template<typename T>
+struct c32HelperType {
+    typedef typename cond_type<is_same_type<T, cfloat>::value, cfloat,
+                               typename f32HelperType<T>::type>::type type;
+};
+
+template<typename T>
+struct elseType {
+    typedef typename cond_type<is_same_type<T, uintl>::value ||
+                                   is_same_type<T, intl>::value,
+                               double, T>::type type;
+};
+
+template<typename T>
+struct sdOutType {
+    typedef typename cond_type<
+        is_same_type<T, float>::value || is_same_type<T, int>::value ||
+            is_same_type<T, uint>::value || is_same_type<T, schar>::value ||
+            is_same_type<T, uchar>::value || is_same_type<T, short>::value ||
+            is_same_type<T, ushort>::value || is_same_type<T, char>::value,
+        float, typename elseType<T>::type>::type type;
+};
+
+template<typename T>
+void stdevDimTest(string pFileName, dim_t dim,
+                  const bool useDeprecatedAPI = false) {
+    typedef typename sdOutType<T>::type outType;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
+
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<int, float>(pFileName, numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    vector<T> input(in[0].begin(), in[0].end());
+
+    array a(dims, &(input.front()));
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+    array b = (useDeprecatedAPI ? stdev(a, dim)
+                                : stdev(a, AF_VARIANCE_POPULATION, dim));
+#pragma GCC diagnostic pop
+
+    vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
+
+    size_t nElems = currGoldBar.size();
+    vector<outType> outData(nElems);
+
+    b.host((void*)outData.data());
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(::real(currGoldBar[elIter]), ::real(outData[elIter]),
+                    1.0e-3)
+            << "at: " << elIter << endl;
+        ASSERT_NEAR(::imag(currGoldBar[elIter]), ::imag(outData[elIter]),
+                    1.0e-3)
+            << "at: " << elIter << endl;
+    }
+}
+
+TYPED_TEST(StandardDev, Dim0) {
+    stdevDimTest<TypeParam>(string(TEST_DIR "/stdev/mat_10x10_dim0.test"), 0);
+    stdevDimTest<TypeParam>(string(TEST_DIR "/stdev/mat_10x10_dim0.test"), 0,
+                            true);
+}
+
+TYPED_TEST(StandardDev, Dim1) {
+    stdevDimTest<TypeParam>(string(TEST_DIR "/stdev/mat_10x10_dim1.test"), 1);
+    stdevDimTest<TypeParam>(string(TEST_DIR "/stdev/mat_10x10_dim1.test"), 1,
+                            true);
+}
+
+TYPED_TEST(StandardDev, Dim2) {
+    stdevDimTest<TypeParam>(
+        string(TEST_DIR "/stdev/hypercube_10x10x5x5_dim2.test"), 2);
+    stdevDimTest<TypeParam>(
+        string(TEST_DIR "/stdev/hypercube_10x10x5x5_dim2.test"), 2, true);
+}
+
+TYPED_TEST(StandardDev, Dim3) {
+    stdevDimTest<TypeParam>(
+        string(TEST_DIR "/stdev/hypercube_10x10x5x5_dim3.test"), 3);
+    stdevDimTest<TypeParam>(
+        string(TEST_DIR "/stdev/hypercube_10x10x5x5_dim3.test"), 3, true);
+}
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+TEST(StandardDev, InvalidDim) { ASSERT_THROW(stdev(array(), 5), exception); }
+
+TEST(StandardDev, InvalidType) {
+    ASSERT_THROW(stdev(constant(cdouble(1.0, -1.0), 10)), exception);
+}
+#pragma GCC diagnostic pop
+
+template<typename T>
+void stdevDimIndexTest(string pFileName, dim_t dim,
+                       const bool useDeprecatedAPI = false) {
+    typedef typename sdOutType<T>::type outType;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
+
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<int, float>(pFileName, numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    vector<T> input(in[0].begin(), in[0].end());
+
+    array a(dims, &(input.front()));
+    array b = a(seq(2, 6), seq(1, 7));
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+    array c = (useDeprecatedAPI ? stdev(b, dim)
+                                : stdev(b, AF_VARIANCE_POPULATION, dim));
+#pragma GCC diagnostic pop
+
+    vector<outType> currGoldBar(tests[0].begin(), tests[0].end());
+
+    size_t nElems = currGoldBar.size();
+    vector<outType> outData(nElems);
+
+    c.host((void*)outData.data());
+
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(::real(currGoldBar[elIter]), ::real(outData[elIter]),
+                    1.0e-3)
+            << "at: " << elIter << endl;
+        ASSERT_NEAR(::imag(currGoldBar[elIter]), ::imag(outData[elIter]),
+                    1.0e-3)
+            << "at: " << elIter << endl;
+    }
+}
+
+TYPED_TEST(StandardDev, IndexedArrayDim0) {
+    stdevDimIndexTest<TypeParam>(
+        string(TEST_DIR "/stdev/mat_10x10_seq2_6x1_7_dim0.test"), 0);
+    stdevDimIndexTest<TypeParam>(
+        string(TEST_DIR "/stdev/mat_10x10_seq2_6x1_7_dim0.test"), 0);
+}
+
+TYPED_TEST(StandardDev, IndexedArrayDim1) {
+    stdevDimIndexTest<TypeParam>(
+        string(TEST_DIR "/stdev/mat_10x10_seq2_6x1_7_dim1.test"), 1, true);
+    stdevDimIndexTest<TypeParam>(
+        string(TEST_DIR "/stdev/mat_10x10_seq2_6x1_7_dim1.test"), 1, true);
+}
+
+template<typename T>
+void stdevAllTest(string pFileName, const bool useDeprecatedAPI = false) {
+    typedef typename sdOutType<T>::type outType;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
+
+    vector<dim4> numDims;
+    vector<vector<int>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<int, float>(pFileName, numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    vector<T> input(in[0].size());
+    transform(in[0].begin(), in[0].end(), input.begin(), convert_to<T, int>);
+
+    array a(dims, &(input.front()));
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+    outType b = (useDeprecatedAPI ? stdev<outType>(a)
+                                  : stdev<outType>(a, AF_VARIANCE_POPULATION));
+#pragma GCC diagnostic pop
+
+    vector<outType> currGoldBar(tests[0].size());
+    transform(tests[0].begin(), tests[0].end(), currGoldBar.begin(),
+              convert_to<outType, float>);
+
+    ASSERT_NEAR(::real(currGoldBar[0]), ::real(b), 1.0e-3);
+    ASSERT_NEAR(::imag(currGoldBar[0]), ::imag(b), 1.0e-3);
+}
+
+TYPED_TEST(StandardDev, All) {
+    stdevAllTest<TypeParam>(string(TEST_DIR "/stdev/mat_10x10_scalar.test"));
+    stdevAllTest<TypeParam>(string(TEST_DIR "/stdev/mat_10x10_scalar.test"),
+                            true);
+}
diff --git a/test/susan.cpp b/test/susan.cpp
new file mode 100644
index 0000000000..c488bda775
--- /dev/null
+++ b/test/susan.cpp
@@ -0,0 +1,176 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/compatible.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <cmath>
+#include <string>
+#include <typeinfo>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::exception;
+using af::features;
+using af::loadImage;
+using af::randu;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
+
+typedef struct {
+    float f[5];
+} feat_t;
+
+static bool feat_cmp(feat_t i, feat_t j) {
+    for (int k = 0; k < 5; k++)
+        if (i.f[k] != j.f[k]) return (i.f[k] < j.f[k]);
+
+    return false;
+}
+
+static void array_to_feat(vector<feat_t> &feat, float *x, float *y,
+                          float *score, float *orientation, float *size,
+                          unsigned nfeat) {
+    feat.resize(nfeat);
+    for (unsigned i = 0; i < feat.size(); i++) {
+        feat[i].f[0] = x[i];
+        feat[i].f[1] = y[i];
+        feat[i].f[2] = score[i];
+        feat[i].f[3] = orientation[i];
+        feat[i].f[4] = size[i];
+    }
+}
+
+template<typename T>
+class Susan : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double, int, uint, char, schar, uchar, short,
+                         ushort>
+    TestTypes;
+
+TYPED_TEST_SUITE(Susan, TestTypes);
+
+template<typename T>
+void susanTest(string pTestFile, float t, float g) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<vector<float>> gold;
+
+    readImageTests(pTestFile, inDims, inFiles, gold);
+
+    size_t testCount = inDims.size();
+
+    for (size_t testId = 0; testId < testCount; ++testId) {
+        inFiles[testId].insert(0, string(TEST_DIR "/susan/"));
+
+        array in = loadImage(inFiles[testId].c_str(), false);
+
+        features out = susan(in, 3, t, g, 0.05f, 3);
+
+        vector<float> outX(gold[0].size());
+        vector<float> outY(gold[1].size());
+        vector<float> outScore(gold[2].size());
+        vector<float> outOrientation(gold[3].size());
+        vector<float> outSize(gold[4].size());
+        out.getX().host(outX.data());
+        out.getY().host(outY.data());
+        out.getScore().host(outScore.data());
+        out.getOrientation().host(outOrientation.data());
+        out.getSize().host(outSize.data());
+
+        vector<feat_t> out_feat;
+        array_to_feat(out_feat, outX.data(), outY.data(), outScore.data(),
+                      outOrientation.data(), outSize.data(),
+                      out.getNumFeatures());
+
+        vector<feat_t> gold_feat;
+        array_to_feat(gold_feat, &gold[0].front(), &gold[1].front(),
+                      &gold[2].front(), &gold[3].front(), &gold[4].front(),
+                      gold[0].size());
+
+        std::sort(out_feat.begin(), out_feat.end(), feat_cmp);
+        std::sort(gold_feat.begin(), gold_feat.end(), feat_cmp);
+
+        for (int elIter = 0; elIter < (int)out.getNumFeatures(); elIter++) {
+            ASSERT_EQ(out_feat[elIter].f[0], gold_feat[elIter].f[0])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[1], gold_feat[elIter].f[1])
+                << "at: " << elIter << endl;
+            ASSERT_LE(fabs(out_feat[elIter].f[2] - gold_feat[elIter].f[2]), 1e2)
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[3], gold_feat[elIter].f[3])
+                << "at: " << elIter << endl;
+            ASSERT_EQ(out_feat[elIter].f[4], gold_feat[elIter].f[4])
+                << "at: " << elIter << endl;
+        }
+    }
+}
+
+#define SUSAN_TEST(image, tval, gval)                                         \
+    TYPED_TEST(Susan, image) {                                                \
+        UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);                               \
+        susanTest<TypeParam>(string(TEST_DIR "/susan/" #image ".test"), tval, \
+                             gval);                                           \
+    }
+
+SUSAN_TEST(man_t32_g10, 32, 10);
+SUSAN_TEST(square_t32_g10, 32, 10);
+SUSAN_TEST(square_t32_g20, 32, 20);
+
+TEST(Susan, InvalidDims) {
+    try {
+        array a      = randu(256);
+        features out = susan(a);
+        EXPECT_TRUE(false);
+    } catch (exception &e) { EXPECT_TRUE(true); }
+}
+
+TEST(Susan, InvalidRadius) {
+    try {
+        array a      = randu(256);
+        features out = susan(a, 10);
+        EXPECT_TRUE(false);
+    } catch (exception &e) { EXPECT_TRUE(true); }
+}
+
+TEST(Susan, InvalidThreshold) {
+    try {
+        array a      = randu(256);
+        features out = susan(a, 3, -32, 10, 0.05f, 3);
+        EXPECT_TRUE(false);
+    } catch (exception &e) { EXPECT_TRUE(true); }
+}
+
+TEST(Susan, InvalidFeatureRatio) {
+    try {
+        array a      = randu(256);
+        features out = susan(a, 3, 32, 10, 1.3f, 3);
+        EXPECT_TRUE(false);
+    } catch (exception &e) { EXPECT_TRUE(true); }
+}
+
+TEST(Susan, InvalidEdge) {
+    try {
+        array a      = randu(128, 128);
+        features out = susan(a, 3, 32, 10, 1.3f, 129);
+        EXPECT_TRUE(false);
+    } catch (exception &e) { EXPECT_TRUE(true); }
+}
diff --git a/test/svd_dense.cpp b/test/svd_dense.cpp
new file mode 100644
index 0000000000..f0da346ce4
--- /dev/null
+++ b/test/svd_dense.cpp
@@ -0,0 +1,168 @@
+/*******************************************************
+ * Copyright (c) 2015, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::iota;
+using af::randu;
+using af::seq;
+using af::span;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class svd : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
+TYPED_TEST_SUITE(svd, TestTypes);
+
+template<typename T>
+inline double get_val(T val) {
+    return val;
+}
+
+template<>
+inline double get_val<cfloat>(cfloat val) {
+    return abs(val);
+}
+
+template<>
+double get_val<cdouble>(cdouble val) {
+    return abs(val);
+}
+
+template<typename T>
+void svdTest(const int M, const int N) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array A = randu(M, N, ty);
+
+    //! [ex_svd_reg]
+    array U, S, Vt;
+    af::svd(U, S, Vt, A);
+
+    const int MN = std::min(M, N);
+
+    array UU = U(span, seq(MN));
+    array SS = diag(S, 0, false).as(ty);
+    array VV = Vt(seq(MN), span);
+
+    array AA = matmul(UU, SS, VV);
+    //! [ex_svd_reg]
+
+#if defined(OS_MAC)
+    ASSERT_ARRAYS_NEAR(A, AA, 3E-3);
+#else
+    ASSERT_ARRAYS_NEAR(A, AA, 1E-3);
+#endif
+}
+
+template<typename T>
+void svdInPlaceTest(const int M, const int N) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array A      = randu(M, N, ty);
+    array A_copy = A.copy();
+
+    array U, S, Vt;
+    af::svdInPlace(U, S, Vt, A);
+
+    const int MN = std::min(M, N);
+
+    array UU = U(span, seq(MN));
+    array SS = diag(S, 0, false).as(ty);
+    array VV = Vt(seq(MN), span);
+
+    array AA = matmul(UU, SS, VV);
+
+#if defined(OS_MAC)
+    ASSERT_ARRAYS_NEAR(A_copy, AA, 3E-3);
+#else
+    ASSERT_ARRAYS_NEAR(A_copy, AA, 1E-3);
+#endif
+}
+
+template<typename T>
+void checkInPlaceSameResults(const int M, const int N) {
+    SUPPORTED_TYPE_CHECK(T);
+    LAPACK_ENABLED_CHECK();
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+
+    array in = randu(dim4(M, N), ty);
+    array u, s, v;
+    af::svd(u, s, v, in);
+
+    array uu, ss, vv;
+    af::svdInPlace(uu, ss, vv, in);
+
+    ASSERT_ARRAYS_EQ(u, uu);
+    ASSERT_ARRAYS_EQ(s, ss);
+    ASSERT_ARRAYS_EQ(v, vv);
+}
+
+TYPED_TEST(svd, Square) { svdTest<TypeParam>(500, 500); }
+
+TYPED_TEST(svd, Rect0) { svdTest<TypeParam>(500, 300); }
+
+TYPED_TEST(svd, Rect1) { svdTest<TypeParam>(300, 500); }
+
+TYPED_TEST(svd, InPlaceSquare) { svdInPlaceTest<TypeParam>(500, 500); }
+
+TYPED_TEST(svd, InPlaceRect0) { svdInPlaceTest<TypeParam>(500, 300); }
+
+// dim0 < dim1 case not supported for now
+// TYPED_TEST(svd, InPlaceRect1)
+// {
+//     svdInPlaceTest<TypeParam>(300, 500);
+// }
+
+TYPED_TEST(svd, InPlaceSameResultsSquare) {
+    checkInPlaceSameResults<TypeParam>(10, 10);
+}
+
+TYPED_TEST(svd, InPlaceSameResultsRect0) {
+    checkInPlaceSameResults<TypeParam>(10, 8);
+}
+
+// dim0 < dim1 case not supported for now
+// TYPED_TEST(svd, InPlaceSameResultsRect1)
+// {
+//     checkInPlaceSameResults<TypeParam>(8, 10);
+// }
+
+TEST(svd, InPlaceRect0_Exception) {
+    array in = randu(3, 5);
+    array u, s, v;
+    EXPECT_THROW(svdInPlace(u, s, v, in), af::exception);
+}
diff --git a/test/testHelpers.hpp b/test/testHelpers.hpp
index b31fc40458..405f23309d 100644
--- a/test/testHelpers.hpp
+++ b/test/testHelpers.hpp
@@ -6,295 +6,141 @@
  * The complete license agreement can be obtained at:
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
+#pragma once
+#ifdef __GNUC__
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wunused-function"
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wparentheses"
+#endif
+#include <half.hpp>
+#ifdef __GNUC__
+#pragma GCC diagnostic pop
+#endif
+#include <af/array.h>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+
+#include <gtest/gtest.h>
 
+#include <cfloat>
 #include <string>
-#include <fstream>
-#include <iterator>
 #include <vector>
-#include <algorithm>
-#include <limits>
-#include <arrayfire.h>
-#include <af/dim4.hpp>
-#include <af/array.h>
 
+#if defined(USE_MTX)
+#include <mmio.h>
+#include <cstdlib>
+#endif
+
+/// GTest deprecated the INSTANTIATED_TEST_CASE_P macro in favor of the
+/// INSTANTIATE_TEST_SUITE_P macro which has the same syntax but the older
+/// versions of gtest do not support this new macro adds the
+/// INSTANTIATE_TEST_SUITE_P macro and maps it to the old macro
+#ifndef INSTANTIATE_TEST_SUITE_P
+#define INSTANTIATE_TEST_SUITE_P INSTANTIATE_TEST_CASE_P
+#endif
+#ifndef TYPED_TEST_SUITE
+#define TYPED_TEST_SUITE TYPED_TEST_CASE
+#endif
+
+bool operator==(const af_half &lhs, const af_half &rhs);
+
+std::ostream &operator<<(std::ostream &os, const af_half &val);
+
+#define UNUSED(expr) \
+    do { (void)(expr); } while (0)
+
+namespace aft {
+#ifdef __GNUC__
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+#elif defined(_MSC_VER)
+#pragma warning(push)
+#pragma warning(disable : 4996)
+#endif
+typedef intl intl;
+typedef uintl uintl;
+#ifdef __GNUC__
+#pragma GCC diagnostic pop
+#elif defined(_MSC_VER)
+#pragma warning(pop)
+#endif
+}  // namespace aft
+
+using aft::intl;
+using aft::uintl;
+
+std::ostream &operator<<(std::ostream &os, af::Backend bk);
+
+std::ostream &operator<<(std::ostream &os, af_err e);
+
+std::ostream &operator<<(std::ostream &os, af::dtype type);
+
+namespace af {
+template<>
+struct dtype_traits<half_float::half> {
+    enum { af_type = f16, ctype = f16 };
+    typedef half_float::half base_type;
+    static const char *getName() { return "half"; }
+};
+
+}  // namespace af
+
+typedef signed char schar;
 typedef unsigned char uchar;
 typedef unsigned int uint;
+typedef unsigned short ushort;
+
+std::string getBackendName(bool lower = false);
+std::string getTestName();
+
+std::string readNextNonEmptyLine(std::ifstream &file);
+
+namespace half_float {
+std::ostream &operator<<(std::ostream &os, half_float::half val);
+}  // namespace half_float
+
+template<typename To, typename Ti>
+To convert(Ti in) {
+    return static_cast<To>(in);
+}
+
+#ifndef EXTERN_TEMPLATE
+extern template float convert(af::half in);
+extern template af_half convert(int in);
+#endif
 
 template<typename inType, typename outType, typename FileElementType>
 void readTests(const std::string &FileName, std::vector<af::dim4> &inputDims,
-                std::vector<std::vector<inType> >   &testInputs,
-                std::vector<std::vector<outType> > &testOutputs)
-{
-    using std::vector;
-
-    std::ifstream testFile(FileName.c_str());
-    if(testFile.good()) {
-        unsigned inputCount;
-        testFile >> inputCount;
-        for(unsigned i=0; i<inputCount; i++) {
-            af::dim4 temp(1);
-            testFile >> temp;
-            inputDims.push_back(temp);
-        }
-
-        unsigned testCount;
-        testFile >> testCount;
-        testOutputs.resize(testCount);
-
-        vector<unsigned> testSizes(testCount);
-        for(unsigned i = 0; i < testCount; i++) {
-            testFile >> testSizes[i];
-        }
-
-        testInputs.resize(inputCount,vector<inType>(0));
-        for(unsigned k=0; k<inputCount; k++) {
-            unsigned nElems = inputDims[k].elements();
-            testInputs[k].resize(nElems);
-            FileElementType tmp;
-            for(unsigned i = 0; i < nElems; i++) {
-                testFile >> tmp;
-                testInputs[k][i] = tmp;
-            }
-        }
-
-        testOutputs.resize(testCount, vector<outType>(0));
-        for(unsigned i = 0; i < testCount; i++) {
-            testOutputs[i].resize(testSizes[i]);
-            FileElementType tmp;
-            for(unsigned j = 0; j < testSizes[i]; j++) {
-                testFile >> tmp;
-                testOutputs[i][j] = tmp;
-            }
-        }
-    }
-    else {
-        FAIL() << "TEST FILE NOT FOUND";
-    }
-}
+               std::vector<std::vector<inType>> &testInputs,
+               std::vector<std::vector<outType>> &testOutputs);
 
 template<typename inType, typename outType>
-void readTestsFromFile(const std::string &FileName, std::vector<af::dim4> &inputDims,
-                std::vector<std::vector<inType> >  &testInputs,
-                std::vector<std::vector<outType> > &testOutputs)
-{
-    using std::vector;
-
-    std::ifstream testFile(FileName.c_str());
-    if(testFile.good()) {
-        unsigned inputCount;
-        testFile >> inputCount;
-        for(unsigned i=0; i<inputCount; i++) {
-            af::dim4 temp(1);
-            testFile >> temp;
-            inputDims.push_back(temp);
-        }
-
-        unsigned testCount;
-        testFile >> testCount;
-        testOutputs.resize(testCount);
-
-        vector<unsigned> testSizes(testCount);
-        for(unsigned i = 0; i < testCount; i++) {
-            testFile >> testSizes[i];
-        }
-
-        testInputs.resize(inputCount,vector<inType>(0));
-        for(unsigned k=0; k<inputCount; k++) {
-            unsigned nElems = inputDims[k].elements();
-            testInputs[k].resize(nElems);
-            inType tmp;
-            for(unsigned i = 0; i < nElems; i++) {
-                testFile >> tmp;
-                testInputs[k][i] = tmp;
-            }
-        }
-
-        testOutputs.resize(testCount, vector<outType>(0));
-        for(unsigned i = 0; i < testCount; i++) {
-            testOutputs[i].resize(testSizes[i]);
-            outType tmp;
-            for(unsigned j = 0; j < testSizes[i]; j++) {
-                testFile >> tmp;
-                testOutputs[i][j] = tmp;
-            }
-        }
-    }
-    else {
-        FAIL() << "TEST FILE NOT FOUND";
-    }
-}
+void readTestsFromFile(const std::string &FileName,
+                       std::vector<af::dim4> &inputDims,
+                       std::vector<std::vector<inType>> &testInputs,
+                       std::vector<std::vector<outType>> &testOutputs);
 
-void readImageTests(const std::string        &pFileName,
-                    std::vector<af::dim4>    &pInputDims,
+void readImageTests(const std::string &pFileName,
+                    std::vector<af::dim4> &pInputDims,
                     std::vector<std::string> &pTestInputs,
-                    std::vector<dim_t>    &pTestOutSizes,
-                    std::vector<std::string> &pTestOutputs)
-{
-    using std::vector;
-
-    std::ifstream testFile(pFileName.c_str());
-    if(testFile.good()) {
-        unsigned inputCount;
-        testFile >> inputCount;
-        for(unsigned i=0; i<inputCount; i++) {
-            af::dim4 temp(1);
-            testFile >> temp;
-            pInputDims.push_back(temp);
-        }
-
-        unsigned testCount;
-        testFile >> testCount;
-        pTestOutputs.resize(testCount);
-
-        pTestOutSizes.resize(testCount);
-        for(unsigned i = 0; i < testCount; i++) {
-            testFile >> pTestOutSizes[i];
-        }
-
-        pTestInputs.resize(inputCount, "");
-        for(unsigned k=0; k<inputCount; k++) {
-            std::string temp = "";
-            while(std::getline(testFile, temp)) {
-                if (temp!="")
-                    break;
-            }
-            if (temp=="")
-                throw std::runtime_error("Test file might not be per format, please check.");
-            pTestInputs[k] = temp;
-        }
-
-        pTestOutputs.resize(testCount, "");
-        for(unsigned i = 0; i < testCount; i++) {
-            std::string temp = "";
-            while(std::getline(testFile, temp)) {
-                if (temp!="")
-                    break;
-            }
-            if (temp=="")
-                throw std::runtime_error("Test file might not be per format, please check.");
-            pTestOutputs[i] = temp;
-        }
-    }
-    else {
-        FAIL() << "TEST FILE NOT FOUND";
-    }
-}
+                    std::vector<dim_t> &pTestOutSizes,
+                    std::vector<std::string> &pTestOutputs);
 
 template<typename outType>
-void readImageTests(const std::string                 &pFileName,
-                    std::vector<af::dim4>             &pInputDims,
-                    std::vector<std::string>          &pTestInputs,
-                    std::vector<std::vector<outType> > &pTestOutputs)
-{
-    using std::vector;
-
-    std::ifstream testFile(pFileName.c_str());
-    if(testFile.good()) {
-        unsigned inputCount;
-        testFile >> inputCount;
-        for(unsigned i=0; i<inputCount; i++) {
-            af::dim4 temp(1);
-            testFile >> temp;
-            pInputDims.push_back(temp);
-        }
-
-        unsigned testCount;
-        testFile >> testCount;
-        pTestOutputs.resize(testCount);
-
-        vector<unsigned> testSizes(testCount);
-        for(unsigned i = 0; i < testCount; i++) {
-            testFile >> testSizes[i];
-        }
-
-        pTestInputs.resize(inputCount, "");
-        for(unsigned k=0; k<inputCount; k++) {
-            std::string temp = "";
-            while(std::getline(testFile, temp)) {
-                if (temp!="")
-                    break;
-            }
-            if (temp=="")
-                throw std::runtime_error("Test file might not be per format, please check.");
-            pTestInputs[k] = temp;
-        }
-
-        pTestOutputs.resize(testCount, vector<outType>(0));
-        for(unsigned i = 0; i < testCount; i++) {
-            pTestOutputs[i].resize(testSizes[i]);
-            outType tmp;
-            for(unsigned j = 0; j < testSizes[i]; j++) {
-                testFile >> tmp;
-                pTestOutputs[i][j] = tmp;
-            }
-        }
-    }
-    else {
-        FAIL() << "TEST FILE NOT FOUND";
-    }
-}
+void readImageTests(const std::string &pFileName,
+                    std::vector<af::dim4> &pInputDims,
+                    std::vector<std::string> &pTestInputs,
+                    std::vector<std::vector<outType>> &pTestOutputs);
 
 template<typename descType>
-void readImageFeaturesDescriptors(const std::string                  &pFileName,
-                                  std::vector<af::dim4>              &pInputDims,
-                                  std::vector<std::string>           &pTestInputs,
-                                  std::vector<std::vector<float> >    &pTestFeats,
-                                  std::vector<std::vector<descType> > &pTestDescs)
-{
-    using std::vector;
-
-    std::ifstream testFile(pFileName.c_str());
-    if(testFile.good()) {
-        unsigned inputCount;
-        testFile >> inputCount;
-        for(unsigned i=0; i<inputCount; i++) {
-            af::dim4 temp(1);
-            testFile >> temp;
-            pInputDims.push_back(temp);
-        }
-
-        unsigned attrCount, featCount, descLen;
-        testFile >> featCount;
-        testFile >> attrCount;
-        testFile >> descLen;
-        pTestFeats.resize(attrCount);
-
-        pTestInputs.resize(inputCount, "");
-        for(unsigned k=0; k<inputCount; k++) {
-            std::string temp = "";
-            while(std::getline(testFile, temp)) {
-                if (temp!="")
-                    break;
-            }
-            if (temp=="")
-                throw std::runtime_error("Test file might not be per format, please check.");
-            pTestInputs[k] = temp;
-        }
-
-        pTestFeats.resize(attrCount, vector<float>(0));
-        for(unsigned i = 0; i < attrCount; i++) {
-            pTestFeats[i].resize(featCount);
-            float tmp;
-            for(unsigned j = 0; j < featCount; j++) {
-                testFile >> tmp;
-                pTestFeats[i][j] = tmp;
-            }
-        }
-
-        pTestDescs.resize(featCount, vector<descType>(0));
-        for(unsigned i = 0; i < featCount; i++) {
-            pTestDescs[i].resize(descLen);
-            descType tmp;
-            for(unsigned j = 0; j < descLen; j++) {
-                testFile >> tmp;
-                pTestDescs[i][j] = tmp;
-            }
-        }
-    }
-    else {
-        FAIL() << "TEST FILE NOT FOUND";
-    }
-}
+void readImageFeaturesDescriptors(
+    const std::string &pFileName, std::vector<af::dim4> &pInputDims,
+    std::vector<std::string> &pTestInputs,
+    std::vector<std::vector<float>> &pTestFeats,
+    std::vector<std::vector<descType>> &pTestDescs);
 
 /**
  * Below is not a pair wise comparition method, rather
@@ -311,34 +157,13 @@ void readImageFeaturesDescriptors(const std::string                  &pFileName,
  * value of NRMSD. Hence, the range of RMSD is [0,255] for image inputs.
  */
 template<typename T>
-bool compareArraysRMSD(dim_t data_size, T *gold, T *data, double tolerance)
-{
-    double accum  = 0.0;
-    double maxion = FLT_MAX;//(double)std::numeric_limits<T>::lowest();
-    double minion = FLT_MAX;//(double)std::numeric_limits<T>::max();
-
-    for(dim_t i=0;i<data_size;i++)
-    {
-        double dTemp = (double)data[i];
-        double gTemp = (double)gold[i];
-        double diff  = gTemp-dTemp;
-        double err   = std::abs(diff) > 1.0e-4 ? diff : 0.0f;
-        accum  += std::pow(err,2.0);
-        maxion  = std::max(maxion, dTemp);
-        minion  = std::min(minion, dTemp);
-    }
-    accum      /= data_size;
-    double NRMSD = std::sqrt(accum)/(maxion-minion);
-
-    std::cout<<"NRMSD = "<<NRMSD<<std::endl;
-    if (NRMSD > tolerance)
-        return false;
-
-    return true;
-}
+bool compareArraysRMSD(dim_t data_size, T *gold, T *data, double tolerance);
+
+template<typename T>
+double computeArraysRMSD(dim_t data_size, T *gold, T *data);
 
 template<typename T, typename Other>
-struct is_same_type{
+struct is_same_type {
     static const bool value = false;
 };
 
@@ -360,100 +185,498 @@ struct cond_type<false, T, Other> {
     typedef Other type;
 };
 
+template<bool B, class T = void>
+struct enable_if {};
+
+template<class T>
+struct enable_if<true, T> {
+    typedef T type;
+};
+
 template<typename T>
-double real(T val) { return val.real(); }
-template<>
-double real<double>(double val) { return val; }
-template<>
-double real<float>(float val) { return val; }
-template<>
-double real<int>(int val) { return val; }
-template<>
-double real<char>(char val) { return val; }
-template<>
-double real<uchar>(uchar val) { return val; }
-template<>
-double real<uint>(uint val) { return val; }
+inline double real(T val) {
+    return (double)val;
+}
 template<>
-double real<intl>(intl val) { return val; }
+inline double real<af::cdouble>(af::cdouble val) {
+    return real(val);
+}
 template<>
-double real<uintl>(uintl val) { return val; }
+inline double real<af::cfloat>(af::cfloat val) {
+    return real(val);
+}
 
 template<typename T>
-double imag(T val) { return val.imag(); }
-template<>
-double imag<double>(double val) { return 0; }
-template<>
-double imag<float>(float val) { return 0; }
-template<>
-double imag<int>(int val) { return 0; }
-template<>
-double imag<uint>(uint val) { return 0; }
-template<>
-double imag<intl>(intl val) { return 0; }
-template<>
-double imag<uintl>(uintl val) { return 0; }
+inline double imag(T val) {
+    return (double)val;
+}
 template<>
-double imag<char>(char val) { return 0; }
+inline double imag<af::cdouble>(af::cdouble val) {
+    return imag(val);
+}
 template<>
-double imag<uchar>(uchar val) { return 0; }
+inline double imag<af::cfloat>(af::cfloat val) {
+    return imag(val);
+}
 
-template<typename T>
-bool noDoubleTests()
-{
-    bool isTypeDouble = is_same_type<T, double>::value || is_same_type<T, af::cdouble>::value;
+template<class T>
+struct IsComplex {
+    static const bool value = is_same_type<af::cfloat, T>::value ||
+                              is_same_type<af::cdouble, T>::value;
+};
 
-    int dev = af::getDevice();
-    bool isDoubleSupported = af::isDoubleAvailable(dev);
+template<class T>
+struct IsFloatingPoint {
+    static const bool value = is_same_type<half_float::half, T>::value ||
+                              is_same_type<float, T>::value ||
+                              is_same_type<double, T>::value ||
+                              is_same_type<long double, T>::value;
+};
 
-    return ((isTypeDouble && !isDoubleSupported) ? true : false);
+bool noDoubleTests(af::dtype ty);
+
+bool noHalfTests(af::dtype ty);
+
+#define SUPPORTED_TYPE_CHECK(type)                                \
+    if (noDoubleTests((af_dtype)af::dtype_traits<type>::af_type)) \
+        GTEST_SKIP() << "Device doesn't support Doubles";         \
+    if (noHalfTests((af_dtype)af::dtype_traits<type>::af_type))   \
+    GTEST_SKIP() << "Device doesn't support Half"
+
+#ifdef SKIP_UNSUPPORTED_TESTS
+#define UNSUPPORTED_BACKEND(backend)                                         \
+    if (backend == af::getActiveBackend())                                   \
+    GTEST_SKIP() << "Skipping unsupported function on " + getBackendName() + \
+                        " backend"
+#else
+#define UNSUPPORTED_BACKEND(backend)
+#endif
+
+#define LAPACK_ENABLED_CHECK() \
+    if (!af::isLAPACKAvailable()) GTEST_SKIP() << "LAPACK Not Configured."
+
+#define IMAGEIO_ENABLED_CHECK() \
+    if (!af::isImageIOAvailable()) GTEST_SKIP() << "Image IO Not Configured"
+
+#ifdef AF_WITH_FAST_MATH
+#define SKIP_IF_FAST_MATH_ENABLED() \
+    GTEST_SKIP() << "ArrayFire compiled with AF_WITH_FAST_MATH"
+#else
+#define SKIP_IF_FAST_MATH_ENABLED()
+#endif
+
+template<typename TO, typename FROM>
+TO convert_to(FROM in) {
+    return TO(in);
 }
 
 // TODO: perform conversion on device for CUDA and OpenCL
 template<typename T>
-af_err conv_image(af_array *out, af_array in)
-{
-    af_array outArray;
+af_err conv_image(af_array *out, af_array in);
+
+template<typename T>
+af::array cpu_randu(const af::dim4 dims);
 
-    dim_t d0, d1, d2, d3;
-    af_get_dims(&d0, &d1, &d2, &d3, in);
-    af::dim4 idims(d0, d1, d2, d3);
+void cleanSlate();
 
-    dim_t nElems = 0;
-    af_get_elements(&nElems, in);
+//********** arrayfire custom test asserts ***********
 
-    float *in_data = new float[nElems];
-    af_get_data_ptr(in_data, in);
+// Overloading unary + op is needed to make unsigned char values printable
+//  as numbers
+af_half abs(af_half in);
 
-    T *out_data = new T[nElems];
+af_half operator-(af_half lhs, af_half rhs);
 
-    for (int i = 0; i < (int)nElems; i++)
-        out_data[i] = (T)in_data[i];
+const af::cfloat &operator+(const af::cfloat &val);
 
-    af_create_array(&outArray, out_data, idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type);
+const af::cdouble &operator+(const af::cdouble &val);
 
-    std::swap(*out, outArray);
+const af_half &operator+(const af_half &val);
 
-    delete [] in_data;
-    delete [] out_data;
+// Calculate a multi-dimensional coordinates' linearized index
+dim_t ravelIdx(af::dim4 coords, af::dim4 strides);
 
-    return AF_SUCCESS;
-}
+// Calculate a linearized index's multi-dimensonal coordinates in an af::array,
+//  given its dimension sizes and strides
+af::dim4 unravelIdx(dim_t idx, af::dim4 dims, af::dim4 strides);
+
+af::dim4 unravelIdx(dim_t idx, af::array arr);
+
+af::dim4 calcStrides(const af::dim4 &parentDim);
+
+std::string minimalDim4(af::dim4 coords, af::dim4 dims);
 
 template<typename T>
-af::array cpu_randu(const af::dim4 dims)
-{
-    typedef typename af::dtype_traits<T>::base_type BT;
+std::string printContext(const std::vector<T> &hGold, std::string goldName,
+                         const std::vector<T> &hOut, std::string outName,
+                         af::dim4 arrDims, af::dim4 arrStrides, dim_t idx);
 
-    bool isTypeCplx = is_same_type<T, af::cfloat>::value || is_same_type<T, af::cdouble>::value;
-    bool isTypeFloat = is_same_type<BT, float>::value || is_same_type<BT, double>::value;
+struct FloatTag {};
+struct IntegerTag {};
 
-    dim_t elements = (isTypeCplx ? 2 : 1) * dims.elements();
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const std::vector<T> &a, af::dim4 aDims,
+                                      const std::vector<T> &b, af::dim4 bDims,
+                                      float maxAbsDiff, IntegerTag);
 
-    std::vector<BT> out(elements);
-    for(int i = 0; i < (int)elements; i++) {
-        out[i] = isTypeFloat ? (BT)(rand())/RAND_MAX : rand() % 100;
-    }
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const std::vector<T> &a, af::dim4 aDims,
+                                      const std::vector<T> &b, af::dim4 bDims,
+                                      float maxAbsDiff, FloatTag);
 
-    return af::array(dims, (T *)&out[0]);
-}
+template<typename T>
+::testing::AssertionResult elemWiseEq(std::string aName, std::string bName,
+                                      const af::array &a, const af::array &b,
+                                      float maxAbsDiff);
+
+::testing::AssertionResult assertArrayEq(std::string aName, std::string bName,
+                                         const af::array &a, const af::array &b,
+                                         float maxAbsDiff = 0.f);
+
+// Called by ASSERT_VEC_ARRAY_EQ
+template<typename T>
+::testing::AssertionResult assertArrayEq(std::string aName,
+                                         std::string aDimsName,
+                                         std::string bName,
+                                         const std::vector<T> &hA,
+                                         af::dim4 aDims, const af::array &b,
+                                         float maxAbsDiff = 0.0f);
+
+// To support C API
+::testing::AssertionResult assertArrayEq(std::string aName, std::string bName,
+                                         const af_array a, const af_array b);
+
+// To support C API
+template<typename T>
+::testing::AssertionResult assertArrayEq(std::string hA_name,
+                                         std::string aDimsName,
+                                         std::string bName,
+                                         const std::vector<T> &hA,
+                                         af::dim4 aDims, const af_array b);
+
+// Called by ASSERT_ARRAYS_NEAR
+::testing::AssertionResult assertArrayNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af::array &a,
+                                           const af::array &b,
+                                           float maxAbsDiff);
+
+::testing::AssertionResult assertImageNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af_array &a, const af_array &b,
+                                           float maxAbsDiff);
+
+::testing::AssertionResult assertImageNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af::array &a,
+                                           const af::array &b,
+                                           float maxAbsDiff);
+
+// Called by ASSERT_VEC_ARRAY_NEAR
+template<typename T>
+::testing::AssertionResult assertArrayNear(
+    std::string hA_name, std::string aDimsName, std::string bName,
+    std::string maxAbsDiffName, const std::vector<T> &hA, af::dim4 aDims,
+    const af::array &b, float maxAbsDiff);
+
+// To support C API
+::testing::AssertionResult assertArrayNear(std::string aName, std::string bName,
+                                           std::string maxAbsDiffName,
+                                           const af_array a, const af_array b,
+                                           float maxAbsDiff);
+
+// To support C API
+template<typename T>
+::testing::AssertionResult assertArrayNear(
+    std::string hA_name, std::string aDimsName, std::string bName,
+    std::string maxAbsDiffName, const std::vector<T> &hA, af::dim4 aDims,
+    const af_array b, float maxAbsDiff);
+
+::testing::AssertionResult assertRefEq(std::string hA_name,
+                                       std::string expected_name,
+                                       const af::array &a, int expected);
+
+/// Checks if the C-API arrayfire function returns successfully
+///
+/// \param[in] CALL This is the arrayfire C function
+#define ASSERT_SUCCESS(CALL) ASSERT_EQ(AF_SUCCESS, CALL)
+
+/// Compares two af::array or af_arrays for their types, dims, and values
+/// (strict equality).
+///
+/// \param[in] EXPECTED The expected array of the assertion
+/// \param[in] ACTUAL The actual resulting array from the calculation
+#define ASSERT_ARRAYS_EQ(EXPECTED, ACTUAL) \
+    ASSERT_PRED_FORMAT2(assertArrayEq, EXPECTED, ACTUAL)
+
+/// Same as ASSERT_ARRAYS_EQ, but for cases when a "special" output array is
+/// given to the function.
+/// The special array can be null, a full-sized array, a subarray, or reordered
+/// Can only be used for testing C-API functions currently
+///
+/// \param[in] EXPECTED The expected array of the assertion
+/// \param[in] ACTUAL The actual resulting array from the calculation
+#define ASSERT_SPECIAL_ARRAYS_EQ(EXPECTED, ACTUAL, META) \
+    ASSERT_PRED_FORMAT3(assertArrayEq, EXPECTED, ACTUAL, META)
+
+/// Compares a std::vector with an af::/af_array for their types, dims, and
+/// values (strict equality).
+///
+/// \param[in] EXPECTED_VEC The vector that represents the expected array
+/// \param[in] EXPECTED_ARR_DIMS The dimensions of the expected array
+/// \param[in] ACTUAL_ARR The actual resulting array from the calculation
+#define ASSERT_VEC_ARRAY_EQ(EXPECTED_VEC, EXPECTED_ARR_DIMS, ACTUAL_ARR) \
+    ASSERT_PRED_FORMAT3(assertArrayEq, EXPECTED_VEC, EXPECTED_ARR_DIMS,  \
+                        ACTUAL_ARR)
+
+/// Compares two af::array or af_arrays for their types, dims, and values
+/// (strict equality).
+///
+/// \param[in] EXPECTED The expected array of the assertion
+/// \param[in] ACTUAL The actual resulting array from the calculation
+#define EXPECT_ARRAYS_EQ(EXPECTED, ACTUAL) \
+    EXPECT_PRED_FORMAT2(assertArrayEq, EXPECTED, ACTUAL)
+
+/// Same as EXPECT_ARRAYS_EQ, but for cases when a "special" output array is
+/// given to the function.
+/// The special array can be null, a full-sized array, a subarray, or reordered
+/// Can only be used for testing C-API functions currently
+///
+/// \param[in] EXPECTED The expected array of the assertion
+/// \param[in] ACTUAL The actual resulting array from the calculation
+#define EXPECT_SPECIAL_ARRAYS_EQ(EXPECTED, ACTUAL, META) \
+    EXPECT_PRED_FORMAT3(assertArrayEq, EXPECTED, ACTUAL, META)
+
+/// Compares a std::vector with an af::/af_array for their types, dims, and
+/// values (strict equality).
+///
+/// \param[in] EXPECTED_VEC The vector that represents the expected array
+/// \param[in] EXPECTED_ARR_DIMS The dimensions of the expected array
+/// \param[in] ACTUAL_ARR The actual resulting array from the calculation
+#define EXPECT_VEC_ARRAY_EQ(EXPECTED_VEC, EXPECTED_ARR_DIMS, ACTUAL_ARR) \
+    EXPECT_PRED_FORMAT3(assertArrayEq, EXPECTED_VEC, EXPECTED_ARR_DIMS,  \
+                        ACTUAL_ARR)
+
+/// Compares two af::array or af_arrays for their type, dims, and values (with a
+/// given tolerance).
+///
+/// \param[in] EXPECTED Expected value of the assertion
+/// \param[in] ACTUAL Actual value of the calculation
+/// \param[in] MAX_ABSDIFF Expected maximum absolute difference between
+///            elements of EXPECTED and ACTUAL
+///
+/// \NOTE: This macro will deallocate the af_arrays after the call
+#define ASSERT_ARRAYS_NEAR(EXPECTED, ACTUAL, MAX_ABSDIFF) \
+    ASSERT_PRED_FORMAT3(assertArrayNear, EXPECTED, ACTUAL, MAX_ABSDIFF)
+
+/// Compares two af::array or af_arrays for their type, dims, and values (with a
+/// given tolerance).
+///
+/// \param[in] EXPECTED Expected value of the assertion
+/// \param[in] ACTUAL Actual value of the calculation
+/// \param[in] MAX_ABSDIFF Expected maximum absolute difference between
+///            elements of EXPECTED and ACTUAL
+///
+/// \NOTE: This macro will deallocate the af_arrays after the call
+#define ASSERT_IMAGES_NEAR(EXPECTED, ACTUAL, MAX_ABSDIFF) \
+    ASSERT_PRED_FORMAT3(assertImageNear, EXPECTED, ACTUAL, MAX_ABSDIFF)
+
+/// Compares a std::vector with an af::array for their dims and values (with a
+/// given tolerance).
+///
+/// \param[in] EXPECTED_VEC The vector that represents the expected array
+/// \param[in] EXPECTED_ARR_DIMS The dimensions of the expected array
+/// \param[in] ACTUAL_ARR The actual array from the calculation
+/// \param[in] MAX_ABSDIFF Expected maximum absolute difference between
+///            elements of EXPECTED and ACTUAL
+#define ASSERT_VEC_ARRAY_NEAR(EXPECTED_VEC, EXPECTED_ARR_DIMS, ACTUAL_ARR, \
+                              MAX_ABSDIFF)                                 \
+    ASSERT_PRED_FORMAT4(assertArrayNear, EXPECTED_VEC, EXPECTED_ARR_DIMS,  \
+                        ACTUAL_ARR, MAX_ABSDIFF)
+
+/// Compares two af::array or af_arrays for their type, dims, and values (with a
+/// given tolerance).
+///
+/// \param[in] EXPECTED Expected value of the assertion
+/// \param[in] ACTUAL Actual value of the calculation
+/// \param[in] MAX_ABSDIFF Expected maximum absolute difference between
+///            elements of EXPECTED and ACTUAL
+///
+/// \NOTE: This macro will deallocate the af_arrays after the call
+#define EXPECT_ARRAYS_NEAR(EXPECTED, ACTUAL, MAX_ABSDIFF) \
+    EXPECT_PRED_FORMAT3(assertArrayNear, EXPECTED, ACTUAL, MAX_ABSDIFF)
+
+/// Compares two af::array or af_arrays for their type, dims, and values (with a
+/// given tolerance).
+///
+/// \param[in] EXPECTED Expected value of the assertion
+/// \param[in] ACTUAL Actual value of the calculation
+/// \param[in] MAX_ABSDIFF Expected maximum absolute difference between
+///            elements of EXPECTED and ACTUAL
+///
+/// \NOTE: This macro will deallocate the af_arrays after the call
+#define EXPECT_IMAGES_NEAR(EXPECTED, ACTUAL, MAX_ABSDIFF) \
+    EXPECT_PRED_FORMAT3(assertImageNear, EXPECTED, ACTUAL, MAX_ABSDIFF)
+
+/// Compares a std::vector with an af::array for their dims and values (with a
+/// given tolerance).
+///
+/// \param[in] EXPECTED_VEC The vector that represents the expected array
+/// \param[in] EXPECTED_ARR_DIMS The dimensions of the expected array
+/// \param[in] ACTUAL_ARR The actual array from the calculation
+/// \param[in] MAX_ABSDIFF Expected maximum absolute difference between
+///            elements of EXPECTED and ACTUAL
+#define EXPECT_VEC_ARRAY_NEAR(EXPECTED_VEC, EXPECTED_ARR_DIMS, ACTUAL_ARR, \
+                              MAX_ABSDIFF)                                 \
+    EXPECT_PRED_FORMAT4(assertArrayNear, EXPECTED_VEC, EXPECTED_ARR_DIMS,  \
+                        ACTUAL_ARR, MAX_ABSDIFF)
+
+#define ASSERT_REF(arr, expected) \
+    ASSERT_PRED_FORMAT2(assertRefEq, arr, expected)
+
+#if defined(USE_MTX)
+::testing::AssertionResult mtxReadSparseMatrix(af::array &out,
+                                               const char *fileName);
+#endif  // USE_MTX
+
+enum TestOutputArrayType {
+    // Test af_* function when given a null array as its output
+    NULL_ARRAY,
+
+    // Test af_* function when given an output array that is the same size as
+    // the expected output
+    FULL_ARRAY,
+
+    // Test af_* function when given an output array that is a sub-array of a
+    // larger array (the sub-array size is still the same size as the expected
+    // output). Only the sub-array must be modified by the af_* function
+    SUB_ARRAY,
+
+    // Test af_* function when given an output array that was previously
+    // reordered (but after the reorder, has still the same shape as the
+    // expected
+    // output). This specifically uses the reorder behavior when dim0 is kept,
+    // and thus no data movement is done - only the dims and strides are
+    // modified
+    REORDERED_ARRAY
+};
+
+class TestOutputArrayInfo {
+    af_array out_arr;
+    af_array out_arr_cpy;
+    af_array out_subarr;
+    dim_t out_subarr_ndims;
+    af_seq out_subarr_idxs[4];
+    TestOutputArrayType out_arr_type;
+
+   public:
+    TestOutputArrayInfo();
+
+    TestOutputArrayInfo(TestOutputArrayType arr_type);
+
+    ~TestOutputArrayInfo();
+
+    void init(const unsigned ndims, const dim_t *const dims, const af_dtype ty);
+
+    void init(const unsigned ndims, const dim_t *const dims, const af_dtype ty,
+              const af_seq *const subarr_idxs);
+
+    void init(double val, const unsigned ndims, const dim_t *const dims,
+              const af_dtype ty);
+
+    void init(double val, const unsigned ndims, const dim_t *const dims,
+              const af_dtype ty, const af_seq *const subarr_idxs);
+
+    af_array getOutput();
+
+    void setOutput(af_array array);
+
+    af_array getFullOutput();
+    af_array getFullOutputCopy();
+    af_seq *getSubArrayIdxs();
+    dim_t getSubArrayNumDims();
+    TestOutputArrayType getOutputArrayType();
+};
+
+// Generates a random array. testWriteToOutputArray expects that it will receive
+// the same af_array that this generates after the af_* function is called
+void genRegularArray(TestOutputArrayInfo *metadata, const unsigned ndims,
+                     const dim_t *const dims, const af_dtype ty);
+
+void genRegularArray(TestOutputArrayInfo *metadata, double val,
+                     const unsigned ndims, const dim_t *const dims,
+                     const af_dtype ty);
+
+// Generates a large, random array, and extracts a subarray for the af_*
+// function to use. testWriteToOutputArray expects that the large array that it
+// receives is equal to the same large array with the gold array injected on the
+// same subarray location
+void genSubArray(TestOutputArrayInfo *metadata, const unsigned ndims,
+                 const dim_t *const dims, const af_dtype ty);
+
+void genSubArray(TestOutputArrayInfo *metadata, double val,
+                 const unsigned ndims, const dim_t *const dims,
+                 const af_dtype ty);
+
+// Generates a reordered array. testWriteToOutputArray expects that this array
+// will still have the correct output values from the af_* function, even though
+// the array was initially reordered.
+void genReorderedArray(TestOutputArrayInfo *metadata, const unsigned ndims,
+                       const dim_t *const dims, const af_dtype ty);
+
+void genReorderedArray(TestOutputArrayInfo *metadata, double val,
+                       const unsigned ndims, const dim_t *const dims,
+                       const af_dtype ty);
+// Partner function of testWriteToOutputArray. This generates the "special"
+// array that testWriteToOutputArray will use to check if the af_* function
+// correctly uses an existing array as its output
+void genTestOutputArray(af_array *out_ptr, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype ty,
+                        TestOutputArrayInfo *metadata);
+
+void genTestOutputArray(af_array *out_ptr, double val, const unsigned ndims,
+                        const dim_t *const dims, const af_dtype ty,
+                        TestOutputArrayInfo *metadata);
+
+// Partner function of genTestOutputArray. This uses the same "special"
+// array that genTestOutputArray generates, and checks whether the
+// af_* function wrote to that array correctly
+::testing::AssertionResult testWriteToOutputArray(
+    std::string gold_name, std::string result_name, const af_array gold,
+    const af_array out, TestOutputArrayInfo *metadata);
+
+// Called by ASSERT_SPECIAL_ARRAYS_EQ
+::testing::AssertionResult assertArrayEq(std::string aName, std::string bName,
+                                         std::string metadataName,
+                                         const af_array a, const af_array b,
+                                         TestOutputArrayInfo *metadata);
+
+enum tempFormat {
+    LINEAR_FORMAT,    // Linear array (= default)
+    JIT_FORMAT,       // Array which has JIT operations outstanding
+    SUB_FORMAT_dim0,  // Array where only a subset is allocated for dim0
+    SUB_FORMAT_dim1,  // Array where only a subset is allocated for dim1
+    SUB_FORMAT_dim2,  // Array where only a subset is allocated for dim2
+    SUB_FORMAT_dim3,  // Array where only a subset is allocated for dim3
+    REORDERED_FORMAT  // Array where the dimensions are reordered
+};
+// Calls the function fn for all available formats
+#define FOREACH_TEMP_FORMAT(TESTS) \
+    TESTS(LINEAR_FORMAT)           \
+    TESTS(JIT_FORMAT)              \
+    TESTS(SUB_FORMAT_dim0)         \
+    TESTS(SUB_FORMAT_dim1)         \
+    TESTS(SUB_FORMAT_dim2)         \
+    TESTS(SUB_FORMAT_dim3)         \
+    TESTS(REORDERED_FORMAT)
+
+// formats the "in" array according to provided format.  The content remains
+// unchanged.
+af::array toTempFormat(tempFormat form, const af::array &in);
+void toTempFormat(tempFormat form, af_array *out, const af_array &in);
+
+#ifdef __GNUC__
+#pragma GCC diagnostic pop
+#endif
diff --git a/test/threading.cpp b/test/threading.cpp
new file mode 100644
index 0000000000..1b71411f0e
--- /dev/null
+++ b/test/threading.cpp
@@ -0,0 +1,753 @@
+/*******************************************************
+ * Copyright (c) 2017, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <sparse_common.hpp>
+#include <testHelpers.hpp>
+#include <af/traits.hpp>
+#include <chrono>
+#include <complex>
+#include <condition_variable>
+#include <cstddef>
+#include <iterator>
+#include <thread>
+#include <vector>
+
+using namespace af;
+
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+static const int THREAD_COUNT = 32;
+
+#if defined(AF_CPU)
+static const unsigned ITERATION_COUNT = 10;
+#else
+static const unsigned ITERATION_COUNT = 1000;
+#endif
+
+enum ArithOp { ADD, SUB, DIV, MUL };
+
+void calc(ArithOp opcode, array op1, array op2, float outValue,
+          int iteration_count) {
+    setDevice(0);
+    array res;
+    for (int i = 0; i < iteration_count; ++i) {
+        switch (opcode) {
+            case ADD: res = op1 + op2; break;
+            case SUB: res = op1 - op2; break;
+            case DIV: res = op1 / op2; break;
+            case MUL: res = op1 * op2; break;
+        }
+    }
+
+    vector<float> out(res.elements());
+    res.host((void*)out.data());
+
+    for (unsigned i = 0; i < out.size(); ++i) ASSERT_FLOAT_EQ(out[i], outValue);
+    af::sync();
+}
+
+TEST(Threading, SimultaneousRead) {
+    setDevice(0);
+
+    array A = constant(1.0, 100, 100);
+    array B = constant(1.0, 100, 100);
+
+    vector<std::thread> tests;
+
+    int thread_count    = 8;
+    int iteration_count = 30;
+    for (int t = 0; t < thread_count; ++t) {
+        ArithOp op;
+        float outValue;
+
+        switch (t % 4) {
+            case 0:
+                op       = ADD;
+                outValue = 2.0f;
+                break;
+            case 1:
+                op       = SUB;
+                outValue = 0.0f;
+                break;
+            case 2:
+                op       = DIV;
+                outValue = 1.0f;
+                break;
+            case 3:
+                op       = MUL;
+                outValue = 1.0f;
+                break;
+        }
+
+        tests.emplace_back(calc, op, A, B, outValue, iteration_count);
+    }
+
+    for (int t = 0; t < thread_count; ++t)
+        if (tests[t].joinable()) tests[t].join();
+}
+
+std::condition_variable cv;
+std::mutex cvMutex;
+size_t counter = THREAD_COUNT;
+
+void doubleAllocationTest() {
+    setDevice(0);
+
+    // Block until all threads are launched and the
+    // counter variable hits zero
+    std::unique_lock<std::mutex> lock(cvMutex);
+    // Check for current thread launch counter value
+    // if reached zero, notify others to continue
+    // otherwise block current thread
+    if (--counter == 0)
+        cv.notify_all();
+    else
+        cv.wait(lock, [] { return counter == 0; });
+    lock.unlock();
+
+    array a = randu(5, 5);
+
+    // Wait for for other threads to hit randu call
+    // while this thread's variable a is still in scope.
+    std::this_thread::sleep_for(std::chrono::seconds(2));
+}
+
+int nextTargetDeviceId() {
+    static int nextId = 0;
+    return nextId++;
+}
+
+void morphTest(const array input, const array mask, const bool isDilation,
+               const array gold, int targetDevice) {
+    UNSUPPORTED_BACKEND(AF_BACKEND_ONEAPI);
+    setDevice(targetDevice);
+
+    array out;
+
+    for (unsigned i = 0; i < ITERATION_COUNT; ++i)
+        out = isDilation ? dilate(input, mask) : erode(input, mask);
+
+    ASSERT_IMAGES_NEAR(gold, out, 0.018f);
+}
+
+TEST(Threading, SetPerThreadActiveDevice) {
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<bool> isDilationFlags;
+    vector<bool> isColorFlags;
+    vector<string> files;
+
+    files.push_back(string(TEST_DIR "/morph/gray.test"));
+    isDilationFlags.push_back(true);
+    isColorFlags.push_back(false);
+
+    files.push_back(string(TEST_DIR "/morph/color.test"));
+    isDilationFlags.push_back(false);
+    isColorFlags.push_back(true);
+
+    vector<std::thread> tests;
+    unsigned totalTestCount = 0;
+
+    for (size_t pos = 0; pos < files.size(); ++pos) {
+        const bool isDilation = isDilationFlags[pos];
+        const bool isColor    = isColorFlags[pos];
+
+        vector<dim4> inDims;
+        vector<string> inFiles;
+        vector<dim_t> outSizes;
+        vector<string> outFiles;
+
+        readImageTests(files[pos], inDims, inFiles, outSizes, outFiles);
+
+        const unsigned testCount = inDims.size();
+
+        const dim4 maskdims(3, 3, 1, 1);
+
+        for (size_t testId = 0; testId < testCount; ++testId) {
+            int trgtDeviceId = totalTestCount % getDeviceCount();
+
+            // prefix full path to image file names
+            inFiles[testId].insert(0, string(TEST_DIR "/morph/"));
+            outFiles[testId].insert(0, string(TEST_DIR "/morph/"));
+
+            setDevice(trgtDeviceId);
+
+            const array mask = constant(1.0, maskdims);
+
+            array input = loadImage(inFiles[testId].c_str(), isColor);
+            array gold  = loadImage(outFiles[testId].c_str(), isColor);
+
+            // Push the new test as a new thread of execution
+            tests.emplace_back(morphTest, input, mask, isDilation, gold,
+                               trgtDeviceId);
+
+            totalTestCount++;
+        }
+    }
+
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
+
+TEST(Threading, MemoryManagementScope) {
+    setDevice(0);
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    for (int t = 0; t < THREAD_COUNT; ++t)
+        tests.emplace_back(doubleAllocationTest);
+
+    for (int t = 0; t < THREAD_COUNT; ++t)
+        if (tests[t].joinable()) tests[t].join();
+
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(lock_buffers, 0u);
+    ASSERT_EQ(lock_bytes, 0u);
+    ASSERT_EQ(alloc_buffers, 32u);
+    ASSERT_EQ(alloc_bytes, 32768u);
+}
+
+void jitAllocationTest() {
+    setDevice(0);
+
+    for (int i = 0; i < 100; ++i) array a = constant(1, 5, 5);
+}
+
+TEST(Threading, MemoryManagement_JIT_Node) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    for (int t = 0; t < THREAD_COUNT; ++t)
+        tests.emplace_back(jitAllocationTest);
+
+    for (int t = 0; t < THREAD_COUNT; ++t)
+        if (tests[t].joinable()) tests[t].join();
+
+    size_t alloc_bytes, alloc_buffers;
+    size_t lock_bytes, lock_buffers;
+
+    deviceMemInfo(&alloc_bytes, &alloc_buffers, &lock_bytes, &lock_buffers);
+
+    ASSERT_EQ(alloc_buffers, 0u);
+    ASSERT_EQ(lock_buffers, 0u);
+    ASSERT_EQ(alloc_bytes, 0u);
+    ASSERT_EQ(lock_bytes, 0u);
+}
+
+template<typename inType, typename outType, bool isInverse>
+void fftTest(int targetDevice, string pTestFile, dim_t pad0 = 0, dim_t pad1 = 0,
+             dim_t pad2 = 0) {
+    SUPPORTED_TYPE_CHECK(inType);
+    SUPPORTED_TYPE_CHECK(outType);
+
+    vector<dim4> numDims;
+    vector<vector<inType>> in;
+    vector<vector<outType>> tests;
+
+    readTestsFromFile<inType, outType>(pTestFile, numDims, in, tests);
+
+    dim4 dims         = numDims[0];
+    af_array outArray = 0;
+    af_array inArray  = 0;
+
+    ASSERT_SUCCESS(af_set_device(targetDevice));
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<inType>::af_type));
+
+    if (isInverse) {
+        switch (dims.ndims()) {
+            case 1:
+                ASSERT_SUCCESS(af_ifft(&outArray, inArray, 1.0, pad0));
+                break;
+            case 2:
+                ASSERT_SUCCESS(af_ifft2(&outArray, inArray, 1.0, pad0, pad1));
+                break;
+            case 3:
+                ASSERT_SUCCESS(
+                    af_ifft3(&outArray, inArray, 1.0, pad0, pad1, pad2));
+                break;
+            default:
+                throw std::runtime_error(
+                    "This error shouldn't happen, pls check");
+        }
+    } else {
+        switch (dims.ndims()) {
+            case 1:
+                ASSERT_SUCCESS(af_fft(&outArray, inArray, 1.0, pad0));
+                break;
+            case 2:
+                ASSERT_SUCCESS(af_fft2(&outArray, inArray, 1.0, pad0, pad1));
+                break;
+            case 3:
+                ASSERT_SUCCESS(
+                    af_fft3(&outArray, inArray, 1.0, pad0, pad1, pad2));
+                break;
+            default:
+                throw std::runtime_error(
+                    "This error shouldn't happen, pls check");
+        }
+    }
+
+    size_t out_size  = tests[0].size();
+    outType* outData = new outType[out_size];
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
+
+    vector<outType> goldBar(tests[0].begin(), tests[0].end());
+
+    size_t test_size = 0;
+    switch (dims.ndims()) {
+        case 1: test_size = dims[0] / 2 + 1; break;
+        case 2: test_size = dims[1] * (dims[0] / 2 + 1); break;
+        case 3: test_size = dims[2] * dims[1] * (dims[0] / 2 + 1); break;
+        default: test_size = dims[0] / 2 + 1; break;
+    }
+    outType output_scale = (outType)(isInverse ? test_size : 1);
+    for (size_t elIter = 0; elIter < test_size; ++elIter) {
+        bool isUnderTolerance = abs(goldBar[elIter] - outData[elIter]) < 0.001;
+        ASSERT_EQ(true, isUnderTolerance)
+            << "Expected value=" << goldBar[elIter]
+            << "\t Actual Value=" << (output_scale * outData[elIter])
+            << " at: " << elIter
+            << " from thread: " << std::this_thread::get_id() << endl;
+    }
+
+    // cleanup
+    delete[] outData;
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
+}
+
+#define INSTANTIATE_TEST(func, name, is_inverse, in_t, out_t, file)        \
+    {                                                                      \
+        int targetDevice = nextTargetDeviceId() % numDevices;              \
+        tests.emplace_back(fftTest<in_t, out_t, is_inverse>, targetDevice, \
+                           file, 0, 0, 0);                                 \
+    }
+
+#define INSTANTIATE_TEST_TP(func, name, is_inverse, in_t, out_t, file, p0, p1) \
+    {                                                                          \
+        int targetDevice = nextTargetDeviceId() % numDevices;                  \
+        tests.emplace_back(fftTest<in_t, out_t, is_inverse>, targetDevice,     \
+                           file, p0, p1, 0);                                   \
+    }
+
+TEST(Threading, FFT_R2C) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    int numDevices = 0;
+    ASSERT_SUCCESS(af_get_device_count(&numDevices));
+    ASSERT_EQ(true, numDevices > 0);
+
+    // Real to complex transforms
+    INSTANTIATE_TEST(fft, R2C_Float, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft_r2c.test"));
+    INSTANTIATE_TEST(fft2, R2C_Float, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft2_r2c.test"));
+    INSTANTIATE_TEST(fft3, R2C_Float, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft3_r2c.test"));
+
+    // Factors 7, 11, 13
+    INSTANTIATE_TEST(fft, R2C_Float_7_11_13, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft_r2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft2, R2C_Float_7_11_13, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft2_r2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft3, R2C_Float_7_11_13, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft3_r2c_7_11_13.test"));
+
+    // transforms on padded and truncated arrays
+    INSTANTIATE_TEST_TP(fft2, R2C_Float_Trunc, false, float, cfloat,
+                        string(TEST_DIR "/signal/fft2_r2c_trunc.test"), 16, 16);
+
+    if (noDoubleTests(f64)) {
+        // Real to complex transforms
+        INSTANTIATE_TEST(fft, R2C_Double, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft_r2c.test"));
+        INSTANTIATE_TEST(fft2, R2C_Double, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft2_r2c.test"));
+        INSTANTIATE_TEST(fft3, R2C_Double, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft3_r2c.test"));
+
+        // Factors 7, 11, 13
+        INSTANTIATE_TEST(fft, R2C_Double_7_11_13, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft_r2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft2, R2C_Double_7_11_13, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft2_r2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft3, R2C_Double_7_11_13, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft3_r2c_7_11_13.test"));
+
+        // transforms on padded and truncated arrays
+        INSTANTIATE_TEST_TP(fft2, R2C_Double_Trunc, false, double, cdouble,
+                            string(TEST_DIR "/signal/fft2_r2c_trunc.test"), 16,
+                            16);
+    }
+
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
+
+TEST(Threading, FFT_C2C) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    int numDevices = 0;
+    ASSERT_SUCCESS(af_get_device_count(&numDevices));
+    ASSERT_EQ(true, numDevices > 0);
+
+    // complex to complex transforms
+    INSTANTIATE_TEST(fft, C2C_Float, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft_c2c.test"));
+    INSTANTIATE_TEST(fft2, C2C_Float, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft2_c2c.test"));
+    INSTANTIATE_TEST(fft3, C2C_Float, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft3_c2c.test"));
+
+    INSTANTIATE_TEST(fft, C2C_Float_7_11_13, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft_c2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft2, C2C_Float_7_11_13, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft2_c2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft3, C2C_Float_7_11_13, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft3_c2c_7_11_13.test"));
+
+    // transforms on padded and truncated arrays
+    INSTANTIATE_TEST_TP(fft2, C2C_Float_Pad, false, cfloat, cfloat,
+                        string(TEST_DIR "/signal/fft2_c2c_pad.test"), 16, 16);
+
+    // inverse transforms
+    // complex to complex transforms
+    INSTANTIATE_TEST(ifft, C2C_Float, true, cfloat, cfloat,
+                     string(TEST_DIR "/signal/ifft_c2c.test"));
+    INSTANTIATE_TEST(ifft2, C2C_Float, true, cfloat, cfloat,
+                     string(TEST_DIR "/signal/ifft2_c2c.test"));
+    INSTANTIATE_TEST(ifft3, C2C_Float, true, cfloat, cfloat,
+                     string(TEST_DIR "/signal/ifft3_c2c.test"));
+
+    if (noDoubleTests(f64)) {
+        INSTANTIATE_TEST(fft, C2C_Double, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft_c2c.test"));
+        INSTANTIATE_TEST(fft2, C2C_Double, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft2_c2c.test"));
+        INSTANTIATE_TEST(fft3, C2C_Double, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft3_c2c.test"));
+
+        INSTANTIATE_TEST(fft, C2C_Double_7_11_13, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft_c2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft2, C2C_Double_7_11_13, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft2_c2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft3, C2C_Double_7_11_13, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft3_c2c_7_11_13.test"));
+
+        INSTANTIATE_TEST_TP(fft2, C2C_Double_Pad, false, cdouble, cdouble,
+                            string(TEST_DIR "/signal/fft2_c2c_pad.test"), 16,
+                            16);
+
+        INSTANTIATE_TEST(ifft, C2C_Double, true, cdouble, cdouble,
+                         string(TEST_DIR "/signal/ifft_c2c.test"));
+        INSTANTIATE_TEST(ifft2, C2C_Double, true, cdouble, cdouble,
+                         string(TEST_DIR "/signal/ifft2_c2c.test"));
+        INSTANTIATE_TEST(ifft3, C2C_Double, true, cdouble, cdouble,
+                         string(TEST_DIR "/signal/ifft3_c2c.test"));
+    }
+
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
+
+TEST(Threading, FFT_ALL) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    int numDevices = 0;
+    ASSERT_SUCCESS(af_get_device_count(&numDevices));
+    ASSERT_EQ(true, numDevices > 0);
+
+    // Real to complex transforms
+    INSTANTIATE_TEST(fft, R2C_Float, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft_r2c.test"));
+    INSTANTIATE_TEST(fft2, R2C_Float, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft2_r2c.test"));
+    INSTANTIATE_TEST(fft3, R2C_Float, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft3_r2c.test"));
+
+    // Factors 7, 11, 13
+    INSTANTIATE_TEST(fft, R2C_Float_7_11_13, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft_r2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft2, R2C_Float_7_11_13, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft2_r2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft3, R2C_Float_7_11_13, false, float, cfloat,
+                     string(TEST_DIR "/signal/fft3_r2c_7_11_13.test"));
+
+    // transforms on padded and truncated arrays
+    INSTANTIATE_TEST_TP(fft2, R2C_Float_Trunc, false, float, cfloat,
+                        string(TEST_DIR "/signal/fft2_r2c_trunc.test"), 16, 16);
+
+    // complex to complex transforms
+    INSTANTIATE_TEST(fft, C2C_Float, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft_c2c.test"));
+    INSTANTIATE_TEST(fft2, C2C_Float, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft2_c2c.test"));
+    INSTANTIATE_TEST(fft3, C2C_Float, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft3_c2c.test"));
+
+    INSTANTIATE_TEST(fft, C2C_Float_7_11_13, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft_c2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft2, C2C_Float_7_11_13, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft2_c2c_7_11_13.test"));
+    INSTANTIATE_TEST(fft3, C2C_Float_7_11_13, false, cfloat, cfloat,
+                     string(TEST_DIR "/signal/fft3_c2c_7_11_13.test"));
+
+    // transforms on padded and truncated arrays
+    INSTANTIATE_TEST_TP(fft2, C2C_Float_Pad, false, cfloat, cfloat,
+                        string(TEST_DIR "/signal/fft2_c2c_pad.test"), 16, 16);
+
+    // inverse transforms
+    // complex to complex transforms
+    INSTANTIATE_TEST(ifft, C2C_Float, true, cfloat, cfloat,
+                     string(TEST_DIR "/signal/ifft_c2c.test"));
+    INSTANTIATE_TEST(ifft2, C2C_Float, true, cfloat, cfloat,
+                     string(TEST_DIR "/signal/ifft2_c2c.test"));
+    INSTANTIATE_TEST(ifft3, C2C_Float, true, cfloat, cfloat,
+                     string(TEST_DIR "/signal/ifft3_c2c.test"));
+
+    if (noDoubleTests(f64)) {
+        INSTANTIATE_TEST(fft, R2C_Double, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft_r2c.test"));
+        INSTANTIATE_TEST(fft2, R2C_Double, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft2_r2c.test"));
+        INSTANTIATE_TEST(fft3, R2C_Double, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft3_r2c.test"));
+        INSTANTIATE_TEST(fft, R2C_Double_7_11_13, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft_r2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft2, R2C_Double_7_11_13, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft2_r2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft3, R2C_Double_7_11_13, false, double, cdouble,
+                         string(TEST_DIR "/signal/fft3_r2c_7_11_13.test"));
+        INSTANTIATE_TEST_TP(fft2, R2C_Double_Trunc, false, double, cdouble,
+                            string(TEST_DIR "/signal/fft2_r2c_trunc.test"), 16,
+                            16);
+        INSTANTIATE_TEST(fft, C2C_Double, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft_c2c.test"));
+        INSTANTIATE_TEST(fft2, C2C_Double, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft2_c2c.test"));
+        INSTANTIATE_TEST(fft3, C2C_Double, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft3_c2c.test"));
+        INSTANTIATE_TEST(fft, C2C_Double_7_11_13, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft_c2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft2, C2C_Double_7_11_13, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft2_c2c_7_11_13.test"));
+        INSTANTIATE_TEST(fft3, C2C_Double_7_11_13, false, cdouble, cdouble,
+                         string(TEST_DIR "/signal/fft3_c2c_7_11_13.test"));
+        INSTANTIATE_TEST_TP(fft2, C2C_Double_Pad, false, cdouble, cdouble,
+                            string(TEST_DIR "/signal/fft2_c2c_pad.test"), 16,
+                            16);
+        INSTANTIATE_TEST(ifft, C2C_Double, true, cdouble, cdouble,
+                         string(TEST_DIR "/signal/ifft_c2c.test"));
+        INSTANTIATE_TEST(ifft2, C2C_Double, true, cdouble, cdouble,
+                         string(TEST_DIR "/signal/ifft2_c2c.test"));
+        INSTANTIATE_TEST(ifft3, C2C_Double, true, cdouble, cdouble,
+                         string(TEST_DIR "/signal/ifft3_c2c.test"));
+    }
+
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
+
+template<typename T, bool isBVector>
+void cppMatMulCheck(int targetDevice, string TestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    using std::vector;
+    vector<dim4> numDims;
+
+    vector<vector<T>> hData;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(TestFile, numDims, hData, tests);
+
+    setDevice(targetDevice);
+
+    array a(numDims[0], &hData[0].front());
+    array b(numDims[1], &hData[1].front());
+
+    dim4 atdims = numDims[0];
+    {
+        dim_t f   = atdims[0];
+        atdims[0] = atdims[1];
+        atdims[1] = f;
+    }
+    dim4 btdims = numDims[1];
+    {
+        dim_t f   = btdims[0];
+        btdims[0] = btdims[1];
+        btdims[1] = f;
+    }
+
+    array aT = moddims(a, atdims.ndims(), atdims.get());
+    array bT = moddims(b, btdims.ndims(), btdims.get());
+
+    vector<array> out(tests.size());
+    if (isBVector) {
+        out[0] = matmul(aT, b, AF_MAT_NONE, AF_MAT_NONE);
+        out[1] = matmul(bT, a, AF_MAT_NONE, AF_MAT_NONE);
+        out[2] = matmul(b, a, AF_MAT_TRANS, AF_MAT_NONE);
+        out[3] = matmul(bT, aT, AF_MAT_NONE, AF_MAT_TRANS);
+        out[4] = matmul(b, aT, AF_MAT_TRANS, AF_MAT_TRANS);
+    } else {
+        out[0] = matmul(a, b, AF_MAT_NONE, AF_MAT_NONE);
+        out[1] = matmul(a, bT, AF_MAT_NONE, AF_MAT_TRANS);
+        out[2] = matmul(a, bT, AF_MAT_TRANS, AF_MAT_NONE);
+        out[3] = matmul(aT, bT, AF_MAT_TRANS, AF_MAT_TRANS);
+    }
+
+    for (size_t i = 0; i < tests.size(); i++) {
+        dim_t elems = out[i].elements();
+        vector<T> h_out(elems);
+        out[i].host((void*)&h_out.front());
+
+        if (false == equal(h_out.begin(), h_out.end(), tests[i].begin())) {
+            cout << "Failed test " << i << "\nCalculated: " << endl;
+            std::copy(h_out.begin(), h_out.end(),
+                      std::ostream_iterator<T>(cout, ", "));
+            cout << "Expected: " << endl;
+            std::copy(tests[i].begin(), tests[i].end(),
+                      std::ostream_iterator<T>(cout, ", "));
+            FAIL();
+        }
+    }
+}
+
+#define TEST_BLAS_FOR_TYPE(TypeName)                        \
+    tests.emplace_back(cppMatMulCheck<TypeName, false>,     \
+                       nextTargetDeviceId() % numDevices,   \
+                       TEST_DIR "/blas/Basic.test");        \
+    tests.emplace_back(cppMatMulCheck<TypeName, false>,     \
+                       nextTargetDeviceId() % numDevices,   \
+                       TEST_DIR "/blas/NonSquare.test");    \
+    tests.emplace_back(cppMatMulCheck<TypeName, true>,      \
+                       nextTargetDeviceId() % numDevices,   \
+                       TEST_DIR "/blas/SquareVector.test"); \
+    tests.emplace_back(cppMatMulCheck<TypeName, true>,      \
+                       nextTargetDeviceId() % numDevices,   \
+                       TEST_DIR "/blas/RectangleVector.test");
+
+TEST(Threading, BLAS) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    int numDevices = 0;
+    ASSERT_SUCCESS(af_get_device_count(&numDevices));
+    ASSERT_EQ(true, numDevices > 0);
+
+    TEST_BLAS_FOR_TYPE(float);
+    TEST_BLAS_FOR_TYPE(cfloat);
+
+    if (noDoubleTests(f64)) {
+        TEST_BLAS_FOR_TYPE(double);
+        TEST_BLAS_FOR_TYPE(cdouble);
+    }
+
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
+
+#define SPARSE_TESTS(T, eps)                                            \
+    tests.emplace_back(sparseTester<T>, 1000, 1000, 100, 5, eps,        \
+                       nextTargetDeviceId() % numDevices);              \
+    tests.emplace_back(sparseTester<T>, 500, 1000, 250, 1, eps,         \
+                       nextTargetDeviceId() % numDevices);              \
+    tests.emplace_back(sparseTester<T>, 625, 1331, 1, 2, eps,           \
+                       nextTargetDeviceId() % numDevices);              \
+    tests.emplace_back(sparseTransposeTester<T>, 625, 1331, 1, 2, eps,  \
+                       nextTargetDeviceId() % numDevices);              \
+    tests.emplace_back(sparseTransposeTester<T>, 453, 751, 397, 1, eps, \
+                       nextTargetDeviceId() % numDevices);              \
+    tests.emplace_back(convertCSR<T>, 2345, 5678, 0.5,                  \
+                       nextTargetDeviceId() % numDevices);
+
+TEST(Threading, Sparse) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    int numDevices = 0;
+    ASSERT_SUCCESS(af_get_device_count(&numDevices));
+    ASSERT_EQ(true, numDevices > 0);
+
+    SPARSE_TESTS(float, 1E-3);
+    SPARSE_TESTS(cfloat, 1E-3);
+    if (noDoubleTests(f64)) {
+        SPARSE_TESTS(double, 1E-5);
+        SPARSE_TESTS(cdouble, 1E-5);
+    }
+
+    for (size_t testId = 0; testId < tests.size(); ++testId)
+        if (tests[testId].joinable()) tests[testId].join();
+}
+
+TEST(Threading, DISABLED_MemoryManagerStressTest) {
+    vector<std::thread> threads;
+    for (int i = 0; i < THREAD_COUNT; i++) {
+        threads.emplace_back([] {
+            vector<array> arrg;
+            int size     = 100;
+            int ex_count = 0;
+
+            // Continue until the memory runs out multiple times
+            while (true) {
+                try {
+                    // constantly change size of the array allocated
+                    size += 10;
+                    arrg.push_back(randu(size));
+
+                    // delete some values intermittently
+                    if (!(size % 200)) {
+                        arrg.erase(std::begin(arrg), std::begin(arrg) + 5);
+                    }
+                } catch (const exception& ex) {
+                    if (ex_count++ > 3) { break; }
+                }
+            }
+        });
+    }
+    for (auto& t : threads) { t.join(); }
+}
+
+TEST(Threading, DISABLED_Sort) {
+    cleanSlate();  // Clean up everything done so far
+
+    vector<std::thread> tests;
+
+    ASSERT_SUCCESS(af_set_device(0));
+
+    for (int i = 0; i < THREAD_COUNT; ++i) {
+        tests.emplace_back([] {
+            array a = randu(100, 100);
+            for (int k = 0; k < 100; ++k) array b = sort(a);
+        });
+    }
+
+    for (auto& t : tests)
+        if (t.joinable()) t.join();
+}
diff --git a/test/tile.cpp b/test/tile.cpp
index fc38827fe2..3a608fa987 100644
--- a/test/tile.cpp
+++ b/test/tile.cpp
@@ -8,143 +8,234 @@
  ********************************************************/
 
 #include <gtest/gtest.h>
-#include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/algorithm.h>
+#include <af/data.h>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
-#include <iostream>
+
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::constant;
+using af::dim4;
+using af::dtype_traits;
+using af::product;
+using af::seq;
+using af::span;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Tile : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat0.push_back(af_make_seq(0, 4, 1));
-            subMat0.push_back(af_make_seq(2, 6, 1));
-            subMat0.push_back(af_make_seq(0, 2, 1));
-        }
-        vector<af_seq> subMat0;
+class Tile : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat0.push_back(af_make_seq(0, 4, 1));
+        subMat0.push_back(af_make_seq(2, 6, 1));
+        subMat0.push_back(af_make_seq(0, 2, 1));
+    }
+    vector<af_seq> subMat0;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         intl, uintl, char, signed char, unsigned char, short,
+                         ushort, half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Tile, TestTypes);
+TYPED_TEST_SUITE(Tile, TestTypes);
 
 template<typename T>
-void tileTest(string pTestFile, const unsigned resultIdx, const uint x, const uint y, const uint z, const uint w,
-              bool isSubRef = false, const vector<af_seq> * seqv = NULL)
-{
-    if (noDoubleTests<T>()) return;
+void tileTest(string pTestFile, const unsigned resultIdx, const uint x,
+              const uint y, const uint z, const uint w, bool isSubRef = false,
+              const vector<af_seq>* seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> > in;
-    vector<vector<T> > tests;
-    readTests<T, T, int>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
 
-    af::dim4 idims = numDims[0];
+    dim4 idims = numDims[0];
 
-    af_array inArray = 0;
-    af_array outArray = 0;
+    af_array inArray   = 0;
+    af_array outArray  = 0;
     af_array tempArray = 0;
 
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
 
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv->size(), &seqv->front()));
     } else {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), idims.ndims(), idims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()),
+                                       idims.ndims(), idims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_tile(&outArray, inArray, x, y, z, w));
+    ASSERT_SUCCESS(af_tile(&outArray, inArray, x, y, z, w));
 
-    // Get result
-    T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    dim4 goldDims(idims[0] * x, idims[1] * y, idims[2] * z, idims[3] * w);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], goldDims, outArray);
 
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
+}
+
+#define TILE_INIT(desc, file, resultIdx, x, y, z, w)                 \
+    TYPED_TEST(Tile, desc) {                                         \
+        tileTest<TypeParam>(string(TEST_DIR "/tile/" #file ".test"), \
+                            resultIdx, x, y, z, w);                  \
     }
 
-    // Delete
-    delete[] outData;
+TILE_INIT(Tile432, tile, 0, 4, 3, 2, 1);
+TILE_INIT(Tile111, tile, 1, 1, 1, 1, 1);
+TILE_INIT(Tile123, tile, 2, 1, 2, 3, 1);
+TILE_INIT(Tile312, tile, 3, 3, 1, 2, 1);
+TILE_INIT(Tile231, tile, 4, 2, 3, 1, 1);
+
+TILE_INIT(Tile3D432, tile_large3D, 0, 2, 2, 2, 1);
+TILE_INIT(Tile3D111, tile_large3D, 1, 1, 1, 1, 1);
+TILE_INIT(Tile3D123, tile_large3D, 2, 1, 2, 3, 1);
+TILE_INIT(Tile3D312, tile_large3D, 3, 3, 1, 2, 1);
+TILE_INIT(Tile3D231, tile_large3D, 4, 2, 3, 1, 1);
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+TILE_INIT(Tile2D432, tile_large2D, 0, 2, 2, 2, 1);
+TILE_INIT(Tile2D111, tile_large2D, 1, 1, 1, 1, 1);
+TILE_INIT(Tile2D123, tile_large2D, 2, 1, 2, 3, 1);
+TILE_INIT(Tile2D312, tile_large2D, 3, 3, 1, 2, 1);
+TILE_INIT(Tile2D231, tile_large2D, 4, 2, 3, 1, 1);
+
+///////////////////////////////// CPP ////////////////////////////////////
+//
+TEST(Tile, CPP) {
+    const unsigned resultIdx = 0;
+    const unsigned x         = 2;
+    const unsigned y         = 2;
+    const unsigned z         = 2;
+    const unsigned w         = 1;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/tile/tile_large3D.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+    array output = tile(input, x, y, z, w);
+
+    dim4 goldDims(idims[0] * x, idims[1] * y, idims[2] * z, idims[3] * w);
+    ASSERT_VEC_ARRAY_EQ(tests[resultIdx], goldDims, output);
 }
 
-#define TILE_INIT(desc, file, resultIdx, x, y, z, w)                                        \
-    TYPED_TEST(Tile, desc)                                                                  \
-    {                                                                                       \
-        tileTest<TypeParam>(string(TEST_DIR"/tile/"#file".test"), resultIdx, x, y, z, w);   \
-    }
+TEST(Tile, MaxDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+    const unsigned x      = 1;
+    const unsigned z      = 1;
+    unsigned y            = 2;
+    unsigned w            = 1;
 
-    TILE_INIT(Tile432, tile, 0, 4, 3, 2, 1);
-    TILE_INIT(Tile111, tile, 1, 1, 1, 1, 1);
-    TILE_INIT(Tile123, tile, 2, 1, 2, 3, 1);
-    TILE_INIT(Tile312, tile, 3, 3, 1, 2, 1);
-    TILE_INIT(Tile231, tile, 4, 2, 3, 1, 1);
+    array input  = constant(1, 1, largeDim);
+    array output = tile(input, x, y, z, w);
 
-    TILE_INIT(Tile3D432, tile_large3D, 0, 2, 2, 2, 1);
-    TILE_INIT(Tile3D111, tile_large3D, 1, 1, 1, 1, 1);
-    TILE_INIT(Tile3D123, tile_large3D, 2, 1, 2, 3, 1);
-    TILE_INIT(Tile3D312, tile_large3D, 3, 3, 1, 2, 1);
-    TILE_INIT(Tile3D231, tile_large3D, 4, 2, 3, 1, 1);
+    ASSERT_EQ(1, output.dims(0));
+    ASSERT_EQ(2 * largeDim, output.dims(1));
+    ASSERT_EQ(1, output.dims(2));
+    ASSERT_EQ(1, output.dims(3));
 
-    TILE_INIT(Tile2D432, tile_large2D, 0, 2, 2, 2, 1);
-    TILE_INIT(Tile2D111, tile_large2D, 1, 1, 1, 1, 1);
-    TILE_INIT(Tile2D123, tile_large2D, 2, 1, 2, 3, 1);
-    TILE_INIT(Tile2D312, tile_large2D, 3, 3, 1, 2, 1);
-    TILE_INIT(Tile2D231, tile_large2D, 4, 2, 3, 1, 1);
+    ASSERT_EQ(1.f, product<float>(output));
 
+    y = 1;
+    w = 2;
 
-///////////////////////////////// CPP ////////////////////////////////////
-//
-TEST(Tile, CPP)
-{
-    if (noDoubleTests<float>()) return;
+    input  = constant(1, 1, 1, 1, largeDim);
+    output = tile(input, x, y, z, w);
 
-    const unsigned resultIdx = 0;
-    const unsigned x = 2;
-    const unsigned y = 2;
-    const unsigned z = 2;
-    const unsigned w = 1;
-
-    vector<af::dim4> numDims;
-    vector<vector<float> > in;
-    vector<vector<float> > tests;
-    readTests<float, float, int>(string(TEST_DIR"/tile/tile_large3D.test"),numDims,in,tests);
-
-    af::dim4 idims = numDims[0];
-    af::array input(idims, &(in[0].front()));
-    af::array output = af::tile(input, x, y, z, w);
-
-    // Get result
-    float* outData = new float[tests[resultIdx].size()];
-    output.host((void*)outData);
-
-    // Compare result
-    size_t nElems = tests[resultIdx].size();
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter]) << "at: " << elIter << std::endl;
-    }
+    ASSERT_EQ(1, output.dims(0));
+    ASSERT_EQ(1, output.dims(1));
+    ASSERT_EQ(1, output.dims(2));
+    ASSERT_EQ(2 * largeDim, output.dims(3));
+
+    ASSERT_EQ(1.f, product<float>(output));
+}
 
-    // Delete
-    delete[] outData;
+TEST(Tile, DocSnippet) {
+    //! [ex_tile_input]
+    float hA[] = {0, 1, 2, 3, 4, 5};
+    array A(3, 2, hA);
+    //  0.  3.
+    //  1.  4.
+    //  2.  5.
+    //! [ex_tile_input]
+
+    //! [ex_tile_0_2]
+    array B = tile(A, 2, 1);
+    //  0.  3.
+    //  1.  4.
+    //  2.  5.
+    //  0.  3.
+    //  1.  4.
+    //  2.  5.
+    //! [ex_tile_0_2]
+
+    ASSERT_ARRAYS_EQ(A, B(seq(0, 2), span));
+    ASSERT_ARRAYS_EQ(A, B(seq(3, 5), span));
+
+    //! [ex_tile_1_3]
+    array C = tile(A, 1, 3);
+    //  0.  3.  0.  3.  0.  3.
+    //  1.  4.  1.  4.  1.  4.
+    //  2.  5.  2.  5.  2.  5.
+    //! [ex_tile_1_3]
+
+    ASSERT_ARRAYS_EQ(A, C(span, seq(0, 1)));
+    ASSERT_ARRAYS_EQ(A, C(span, seq(2, 3)));
+    ASSERT_ARRAYS_EQ(A, C(span, seq(4, 5)));
+
+    //! [ex_tile_0_2_and_1_3]
+    array D = tile(A, 2, 3);
+    //  0.  3.  0.  3.  0.  3.
+    //  1.  4.  1.  4.  1.  4.
+    //  2.  5.  2.  5.  2.  5.
+    //  0.  3.  0.  3.  0.  3.
+    //  1.  4.  1.  4.  1.  4.
+    //  2.  5.  2.  5.  2.  5.
+    //! [ex_tile_0_2_and_1_3]
+
+    ASSERT_ARRAYS_EQ(C, D(seq(0, 2), span));
+    ASSERT_ARRAYS_EQ(C, D(seq(3, 5), span));
 }
 
+// The tile was failing for larger sizes because of JIT kernels were not able to
+// handle repeated x blocks. The kernels were exiting early which caused the
+// next iteration to fail
+TEST(Tile, LargeRepeatDim) {
+    long long dim0     = 33;
+    long long largeDim = 40001;
+    array temp_ones    = af::iota(largeDim, dim4(1), u8);
+    temp_ones          = af::moddims(temp_ones, 1, 1, largeDim);
+    temp_ones.eval();
+
+    array temp = tile(temp_ones, dim0, 1, 1);
+    temp.eval();
+    vector<unsigned char> empty(dim0 * largeDim);
+    for (long long ii = 0; ii < largeDim; ii++) {
+        int offset = ii * dim0;
+        for (int i = 0; i < dim0; i++) { empty[offset + i] = ii; }
+    }
+
+    ASSERT_VEC_ARRAY_EQ(empty, dim4(dim0, 1, largeDim), temp);
+}
diff --git a/test/topk.cpp b/test/topk.cpp
new file mode 100644
index 0000000000..58319f25e8
--- /dev/null
+++ b/test/topk.cpp
@@ -0,0 +1,495 @@
+/*******************************************************
+ * Copyright (c) 2018, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+
+#include <algorithm>
+#include <cmath>
+#include <functional>
+#include <iostream>
+#include <iterator>
+#include <map>
+#include <numeric>
+#include <random>
+#include <sstream>
+#include <string>
+#include <type_traits>
+#include <utility>
+
+using af::array;
+using af::dim4;
+using af::dtype_traits;
+using af::iota;
+using af::topk;
+using af::topkFunction;
+using half_float::half;
+
+using std::iota;
+using std::make_pair;
+using std::min;
+using std::mt19937;
+using std::ostream;
+using std::pair;
+using std::random_device;
+using std::shuffle;
+using std::string;
+using std::stringstream;
+using std::vector;
+
+template<typename T>
+class TopK : public ::testing::Test {};
+
+typedef ::testing::Types<float, double, int, uint, half_float::half> TestTypes;
+
+TYPED_TEST_SUITE(TopK, TestTypes);
+
+template<typename T>
+void increment_next(T& val,
+                    typename std::enable_if<std::is_floating_point<T>::value,
+                                            int>::type t = 0) {
+    val = std::nextafterf(val, std::numeric_limits<T>::max());
+}
+
+template<typename T>
+void increment_next(
+    T& val,
+    typename std::enable_if<std::is_integral<T>::value, int>::type t = 0) {
+    ++val;
+}
+
+void increment_next(half_float::half& val) {
+    half_float::half tmp = (half_float::half)half_float::nextafter(
+        val, std::numeric_limits<half_float::half>::max());
+    val = tmp;
+}
+
+template<typename T>
+void topkTest(const int ndims, const dim_t* dims, const unsigned k,
+              const int dim, const af_topk_function order) {
+    SUPPORTED_TYPE_CHECK(T);
+    af_dtype dtype = (af_dtype)dtype_traits<T>::af_type;
+
+    af_array input, output, outindex;
+
+    size_t ielems = 1;
+    size_t oelems = 1;
+
+    for (int i = 0; i < ndims; i++) {
+        ielems *= dims[i];
+        oelems *= (i == dim ? k : dims[i]);
+    }
+
+    size_t bCount = ielems / dims[dim];
+    size_t bSize  = dims[dim];
+
+    vector<T> inData(ielems);
+    T val{std::numeric_limits<T>::lowest()};
+    generate(begin(inData), end(inData), [&]() {
+        increment_next(val);
+        return val;
+    });
+
+    random_device rnd_device;
+    mt19937 g(rnd_device());
+    shuffle(begin(inData), end(inData), g);
+
+    vector<T> outData(oelems);
+    vector<uint> outIdxs(oelems);
+
+    for (size_t b = 0; b < bCount; b++) {
+        using KeyValuePair = pair<T, uint>;
+
+        vector<KeyValuePair> kvPairs;
+        kvPairs.reserve(((b + 1) * bSize));
+
+        for (size_t i = b * bSize; i < ((b + 1) * bSize); ++i)
+            kvPairs.push_back(make_pair(inData[i], (i - b * bSize)));
+
+        if (order & AF_TOPK_MIN) {
+            stable_sort(kvPairs.begin(), kvPairs.end(),
+                        [](const KeyValuePair& lhs, const KeyValuePair& rhs) {
+                            return lhs.first < rhs.first;
+                        });
+        } else {
+            stable_sort(kvPairs.begin(), kvPairs.end(),
+                        [](const KeyValuePair& lhs, const KeyValuePair& rhs) {
+                            return lhs.first > rhs.first;
+                        });
+        }
+
+        auto it = kvPairs.begin();
+        for (size_t i = 0; i < k; ++it, ++i) {
+            outData[i + b * k] = it->first;
+            outIdxs[i + b * k] = it->second;
+        }
+    }
+
+    ASSERT_SUCCESS(af_create_array(&input, inData.data(), ndims, dims, dtype));
+    ASSERT_SUCCESS(af_topk(&output, &outindex, input, k, dim, order));
+
+    vector<T> hovals(oelems);
+    vector<uint> hoidxs(oelems);
+
+    ASSERT_SUCCESS(af_get_data_ptr((void*)hovals.data(), output));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)hoidxs.data(), outindex));
+
+    for (size_t i = 0; i < oelems; ++i) {
+        switch (dtype) {
+            case f64:
+                EXPECT_DOUBLE_EQ(outData[i], hovals[i]) << "at: " << i;
+                break;
+            case f32:
+                EXPECT_FLOAT_EQ(outData[i], hovals[i]) << "at: " << i;
+                break;
+            default: EXPECT_EQ(outData[i], hovals[i]) << "at: " << i; break;
+        }
+        ASSERT_EQ(outIdxs[i], hoidxs[i]) << "at: " << i;
+    }
+
+    ASSERT_SUCCESS(af_release_array(input));
+    ASSERT_SUCCESS(af_release_array(output));
+    ASSERT_SUCCESS(af_release_array(outindex));
+}
+
+int type_max(af_dtype type) {
+    switch (type) {
+        case f16: return 63000;
+        default: return 100000;
+    }
+}
+
+TYPED_TEST(TopK, Max1D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t), 1, 1, 1};
+    topkTest<TypeParam>(1, dims, 5, 0, AF_TOPK_MAX);
+}
+
+TYPED_TEST(TopK, Max2D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 10, 10, 1, 1};
+    topkTest<TypeParam>(2, dims, 3, 0, AF_TOPK_MAX);
+}
+
+TYPED_TEST(TopK, Max3D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 100, 10, 10, 1};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_MAX);
+}
+
+TYPED_TEST(TopK, Max4D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 1000, 10, 10, 10};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_MAX);
+}
+
+TYPED_TEST(TopK, Min1D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t), 1, 1, 1};
+    topkTest<TypeParam>(1, dims, 5, 0, AF_TOPK_MIN);
+}
+
+TYPED_TEST(TopK, Min2D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 10, 10, 1, 1};
+    topkTest<TypeParam>(2, dims, 3, 0, AF_TOPK_MIN);
+}
+
+TYPED_TEST(TopK, Min3D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 100, 10, 10, 1};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_MIN);
+}
+
+TYPED_TEST(TopK, Min4D0) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 1000, 10, 10, 10};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_MIN);
+}
+
+TEST(TopK, ValidationCheck_DimN) {
+    dim_t dims[4] = {10, 10, 1, 1};
+    af_array out, idx, in;
+    ASSERT_SUCCESS(af_randu(&in, 2, dims, f32));
+    ASSERT_EQ(AF_ERR_NOT_SUPPORTED,
+              af_topk(&out, &idx, in, 10, 1, AF_TOPK_MAX));
+    ASSERT_SUCCESS(af_release_array(in));
+}
+
+TEST(TopK, ValidationCheck_DefaultDim) {
+    dim_t dims[4] = {10, 10, 1, 1};
+    af_array out, idx, in;
+    ASSERT_SUCCESS(af_randu(&in, 4, dims, f32));
+    ASSERT_SUCCESS(af_topk(&out, &idx, in, 10, -1, AF_TOPK_MAX));
+    ASSERT_SUCCESS(af_release_array(in));
+    ASSERT_SUCCESS(af_release_array(out));
+    ASSERT_SUCCESS(af_release_array(idx));
+}
+
+// stable variants
+TYPED_TEST(TopK, Max1D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t), 1, 1, 1};
+    topkTest<TypeParam>(1, dims, 5, 0, AF_TOPK_STABLE_MAX);
+}
+
+TYPED_TEST(TopK, Max2D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 10, 10, 1, 1};
+    topkTest<TypeParam>(2, dims, 3, 0, AF_TOPK_STABLE_MAX);
+}
+
+TYPED_TEST(TopK, Max3D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 100, 10, 10, 1};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_STABLE_MAX);
+}
+
+TYPED_TEST(TopK, Max4D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 1000, 10, 10, 10};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_STABLE_MAX);
+}
+
+TYPED_TEST(TopK, Min1D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t), 1, 1, 1};
+    topkTest<TypeParam>(1, dims, 5, 0, AF_TOPK_STABLE_MIN);
+}
+
+TYPED_TEST(TopK, Min2D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 10, 10, 1, 1};
+    topkTest<TypeParam>(2, dims, 3, 0, AF_TOPK_STABLE_MIN);
+}
+
+TYPED_TEST(TopK, Min3D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 100, 10, 10, 1};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_STABLE_MIN);
+}
+
+TYPED_TEST(TopK, Min4D0_Stable) {
+    af_dtype t    = (af_dtype)dtype_traits<TypeParam>::af_type;
+    dim_t dims[4] = {type_max(t) / 1000, 10, 10, 10};
+    topkTest<TypeParam>(2, dims, 5, 0, AF_TOPK_STABLE_MIN);
+}
+
+TEST(TopK, ValidationCheck_DimN_Stable) {
+    dim_t dims[4] = {10, 10, 1, 1};
+    af_array out, idx, in;
+    ASSERT_SUCCESS(af_randu(&in, 2, dims, f32));
+    ASSERT_EQ(AF_ERR_NOT_SUPPORTED,
+              af_topk(&out, &idx, in, 10, 1, AF_TOPK_STABLE_MAX));
+    ASSERT_SUCCESS(af_release_array(in));
+}
+
+TEST(TopK, ValidationCheck_DefaultDim_Stable) {
+    dim_t dims[4] = {10, 10, 1, 1};
+    af_array out, idx, in;
+    ASSERT_SUCCESS(af_randu(&in, 4, dims, f32));
+    ASSERT_SUCCESS(af_topk(&out, &idx, in, 10, -1, AF_TOPK_STABLE_MAX));
+    ASSERT_SUCCESS(af_release_array(in));
+    ASSERT_SUCCESS(af_release_array(out));
+    ASSERT_SUCCESS(af_release_array(idx));
+}
+
+struct topk_params {
+    int d0;
+    int d1;
+    int k;
+    int dim;
+    topkFunction order;
+};
+
+ostream& operator<<(ostream& os, const topk_params& param) {
+    os << "d0: " << param.d0 << " d1: " << param.d1 << " k:  " << param.k
+       << " dim: " << param.dim
+       << " order: " << ((param.order == AF_TOPK_MAX) ? "MAX" : "MIN");
+    return os;
+}
+
+class TopKParams : public ::testing::TestWithParam<topk_params> {};
+
+INSTANTIATE_TEST_SUITE_P(
+    InstantiationName, TopKParams,
+    ::testing::Values(topk_params{100, 10, 32, 0, AF_TOPK_MIN},
+                      topk_params{100, 10, 64, 0, AF_TOPK_MIN},
+                      topk_params{100, 10, 32, 0, AF_TOPK_MAX},
+                      topk_params{100, 10, 64, 0, AF_TOPK_MAX},
+                      topk_params{100, 10, 5, 0, AF_TOPK_MIN},
+                      topk_params{1000, 10, 5, 0, AF_TOPK_MIN},
+                      topk_params{10000, 10, 5, 0, AF_TOPK_MIN},
+                      topk_params{100, 10, 5, 0, AF_TOPK_MAX},
+                      topk_params{1000, 10, 5, 0, AF_TOPK_MAX},
+                      topk_params{10000, 10, 5, 0, AF_TOPK_MAX},
+                      topk_params{10, 10, 5, 0, AF_TOPK_MIN},
+                      topk_params{10, 100, 5, 0, AF_TOPK_MIN},
+                      topk_params{10, 1000, 5, 0, AF_TOPK_MIN},
+                      topk_params{10, 10000, 5, 0, AF_TOPK_MIN},
+                      topk_params{10, 10, 5, 0, AF_TOPK_MAX},
+                      topk_params{10, 100, 5, 0, AF_TOPK_MAX},
+                      topk_params{10, 1000, 5, 0, AF_TOPK_MAX},
+                      topk_params{10, 10000, 5, 0, AF_TOPK_MAX},
+                      topk_params{1000, 10, 256, 0, AF_TOPK_MAX}),
+    [](const ::testing::TestParamInfo<TopKParams::ParamType> info) {
+        stringstream ss;
+        ss << "d0_" << info.param.d0 << "_d1_" << info.param.d1 << "_k_"
+           << info.param.k << "_dim_" << info.param.dim << "_order_"
+           << ((info.param.order == AF_TOPK_MAX) ? string("MAX")
+                                                 : string("MIN"));
+        return ss.str();
+    });
+
+string print_context(int idx0, int idx1, const vector<float>& val,
+                     const vector<unsigned>& idx) {
+    stringstream ss;
+    if (idx0 > 3 && idx1 > 3) {
+        for (int i = idx0 - 3; i < idx0 + 3; i++) {
+            ss << i << ": " << val[i] << " " << idx[i] << "\n";
+        }
+    } else {
+        int end = min(6, idx0 + 3);
+        for (int i = 0; i < end; i++) {
+            ss << i << ": " << val[i] << " " << idx[i] << "\n";
+        }
+    }
+    return ss.str();
+}
+
+TEST_P(TopKParams, CPP) {
+    topk_params params = GetParam();
+    int d0             = params.d0;
+    int d1             = params.d1;
+    int k              = params.k;
+    int dim            = params.dim;
+    topkFunction order = params.order;
+
+    array in = iota(dim4(d0, d1));
+
+    // reverse the array if the order is ascending
+    if (order == AF_TOPK_MIN) { in = -in + (d0 * d1 - 1); }
+    array val, idx;
+    topk(val, idx, in, k, dim, order);
+
+    vector<float> hval(k * d1);
+    vector<unsigned> hidx(k * d1);
+    val.host(&hval[0]);
+    idx.host(&hidx[0]);
+
+    if (order == AF_TOPK_MIN) {
+        for (int j = d1 - 1, i = 0; j > 0; j--) {
+            for (int kidx = 0, goldidx = d0 - 1; kidx < k;
+                 i++, kidx++, goldidx--) {
+                float gold = static_cast<float>(j * d0 + kidx);
+                ASSERT_FLOAT_EQ(gold, hval[i])
+                    << print_context(i, kidx, hval, hidx);
+                ASSERT_EQ(goldidx, hidx[i])
+                    << print_context(i, kidx, hval, hidx);
+            }
+        }
+    } else {
+        for (int ii = 0, i = 0; ii < d1; ii++) {
+            for (int j = d0 - 1; j >= d0 - k; --j, i++) {
+                float gold  = static_cast<float>(ii * d0 + j);
+                int goldidx = j;
+                ASSERT_FLOAT_EQ(gold, hval[i])
+                    << print_context(i, j, hval, hidx);
+                ASSERT_EQ(goldidx, hidx[i]) << print_context(i, j, hval, hidx);
+            }
+        }
+    }
+}
+
+TEST(TopK, KGreaterThan256) {
+    af::array a = af::randu(500);
+    af::array vals, idx;
+
+    int k = 257;
+    EXPECT_THROW(topk(vals, idx, a, k), af::exception)
+        << "The current limitation of the K value as increased. Please check "
+           "or remove this test";
+}
+
+TEST(TopK, KEquals0) {
+    af::array a = af::randu(500);
+    af::array vals, idx;
+
+    int k = 0;
+    EXPECT_THROW(topk(vals, idx, a, k), af::exception)
+        << "K cannot be less than 1";
+}
+
+TEST(TopK, KLessThan0) {
+    af::array a = af::randu(500);
+    af::array vals, idx;
+
+    int k = -1;
+    EXPECT_THROW(topk(vals, idx, a, k), af::exception)
+        << "K cannot be less than 0";
+}
+
+TEST(TopK, DeterministicTiesMin) {
+    af::array a           = af::constant(1, 500);
+    a(af::seq(0, 499, 2)) = 7;
+    af::array vals_min, idx_min;
+
+    int k = 6;
+    topk(vals_min, idx_min, a, k, 0, AF_TOPK_STABLE_MIN);
+
+    af::array expected_idx_min   = af::seq(1, 499, 2);
+    af::array k_expected_idx_min = expected_idx_min(af::seq(0, k - 1));
+    ASSERT_ARRAYS_EQ(idx_min, k_expected_idx_min.as(u32));
+}
+
+TEST(TopK, DeterministicTiesMax) {
+    af::array a           = af::constant(1, 500);
+    a(af::seq(0, 499, 2)) = 7;
+    af::array vals_max, idx_max;
+
+    int k = 6;
+    topk(vals_max, idx_max, a, k, 0, AF_TOPK_STABLE_MAX);
+
+    af::array expected_idx_max   = af::seq(0, 499, 2);
+    af::array k_expected_idx_max = expected_idx_max(af::seq(0, k - 1));
+    ASSERT_ARRAYS_EQ(idx_max, k_expected_idx_max.as(u32));
+}
+
+TEST(TopK, DeterministicTiesBatchedMin) {
+    const int nbatch = 10;
+    af::array a      = af::constant(1, 500, nbatch, nbatch, nbatch);
+    a(af::seq(0, 499, 2), af::span, af::span, af::span) = 7;
+    af::array vals_min, idx_min;
+
+    int k = 6;
+    topk(vals_min, idx_min, a, k, 0, AF_TOPK_STABLE_MIN);
+
+    af::array expected_idx_min = af::seq(1, 499, 2);
+    af::array k_expected_idx_min =
+        af::tile(expected_idx_min(af::seq(0, k - 1)),
+                 af::dim4(1, nbatch, nbatch, nbatch));
+    ASSERT_ARRAYS_EQ(idx_min, k_expected_idx_min.as(u32));
+}
+
+TEST(TopK, DeterministicTiesBatchedMax) {
+    const int nbatch = 10;
+    af::array a      = af::constant(1, 500, nbatch, nbatch, nbatch);
+    a(af::seq(0, 499, 2), af::span, af::span, af::span) = 7;
+    af::array vals_max, idx_max;
+
+    int k = 6;
+    topk(vals_max, idx_max, a, k, 0, AF_TOPK_STABLE_MAX);
+
+    af::array expected_idx_max = af::seq(0, 499, 2);
+    af::array k_expected_idx_max =
+        af::tile(expected_idx_max(af::seq(0, k - 1)),
+                 af::dim4(1, nbatch, nbatch, nbatch));
+    ASSERT_ARRAYS_EQ(idx_max, k_expected_idx_max.as(u32));
+}
diff --git a/test/transform.cpp b/test/transform.cpp
new file mode 100644
index 0000000000..e6026576ba
--- /dev/null
+++ b/test/transform.cpp
@@ -0,0 +1,662 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::dtype_traits;
+using af::loadImage;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Transform : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+template<typename T>
+class TransformInt : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypes;
+typedef ::testing::Types<int, intl, uint, uintl, short, ushort, schar, uchar>
+    TestTypesInt;
+
+TYPED_TEST_SUITE(Transform, TestTypes);
+TYPED_TEST_SUITE(TransformInt, TestTypesInt);
+
+template<typename T>
+void genTestData(af_array *gold, af_array *in, af_array *transform,
+                 dim_t *odim0, dim_t *odim1, string pTestFile,
+                 string pHomographyFile) {
+    vector<dim4> inNumDims;
+    vector<string> inFiles;
+    vector<dim_t> goldNumDims;
+    vector<string> goldFiles;
+
+    readImageTests(pTestFile, inNumDims, inFiles, goldNumDims, goldFiles);
+
+    inFiles[0].insert(0, string(TEST_DIR "/transform/"));
+    inFiles[1].insert(0, string(TEST_DIR "/transform/"));
+    goldFiles[0].insert(0, string(TEST_DIR "/transform/"));
+
+    dim4 objDims = inNumDims[0];
+
+    vector<dim4> HNumDims;
+    vector<vector<float>> HIn;
+    vector<vector<float>> HTests;
+    readTests<float, float, float>(pHomographyFile, HNumDims, HIn, HTests);
+
+    dim4 HDims = HNumDims[0];
+
+    af_array sceneArray_f32 = 0;
+    af_array goldArray_f32  = 0;
+    af_array sceneArray     = 0;
+    af_array goldArray      = 0;
+    af_array HArray         = 0;
+
+    ASSERT_SUCCESS(af_load_image(&sceneArray_f32, inFiles[1].c_str(), false));
+    ASSERT_SUCCESS(af_load_image(&goldArray_f32, goldFiles[0].c_str(), false));
+
+    ASSERT_SUCCESS(conv_image<T>(&sceneArray, sceneArray_f32));
+    ASSERT_SUCCESS(conv_image<T>(&goldArray, goldArray_f32));
+
+    ASSERT_SUCCESS(af_create_array(&HArray, &(HIn[0].front()), HDims.ndims(),
+                                   HDims.get(), f32));
+
+    *gold      = goldArray;
+    *in        = sceneArray;
+    *transform = HArray;
+    *odim0     = objDims[0];
+    *odim1     = objDims[1];
+
+    if (goldArray_f32 != 0) af_release_array(goldArray_f32);
+    if (sceneArray_f32 != 0) af_release_array(sceneArray_f32);
+}
+
+template<typename T>
+void transformTest(string pTestFile, string pHomographyFile,
+                   const af_interp_type method, const bool invert) {
+    SUPPORTED_TYPE_CHECK(T);
+    IMAGEIO_ENABLED_CHECK();
+
+    af_array sceneArray = 0;
+    af_array goldArray  = 0;
+    af_array outArray   = 0;
+    af_array HArray     = 0;
+
+    dim_t odim0 = 0;
+    dim_t odim1 = 0;
+
+    genTestData<T>(&goldArray, &sceneArray, &HArray, &odim0, &odim1, pTestFile,
+                   pHomographyFile);
+
+    ASSERT_SUCCESS(af_transform(&outArray, sceneArray, HArray, odim0, odim1,
+                                method, invert));
+
+    // Get gold data
+    dim_t goldEl = 0;
+    ASSERT_SUCCESS(af_get_elements(&goldEl, goldArray));
+    vector<T> goldData(goldEl);
+    ASSERT_SUCCESS(af_get_data_ptr((void *)&goldData.front(), goldArray));
+
+    // Get result
+    dim_t outEl = 0;
+    ASSERT_SUCCESS(af_get_elements(&outEl, outArray));
+    vector<T> outData(outEl);
+    ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
+
+    const float thr = 1.1f;
+
+    // Maximum number of wrong pixels must be <= 0.01% of number of elements,
+    // this metric is necessary due to rounding errors between different
+    // backends for AF_INTERP_NEAREST and AF_INTERP_LOWER
+    const size_t maxErr = goldEl * 0.0001f;
+    size_t err          = 0;
+
+    for (dim_t elIter = 0; elIter < goldEl; elIter++) {
+        err += fabs((float)floor(outData[elIter]) -
+                    (float)floor(goldData[elIter])) > thr;
+        if (err > maxErr) {
+            ASSERT_LE(err, maxErr) << "at: " << elIter << endl;
+        }
+    }
+
+    if (HArray != 0) { af_release_array(HArray); }
+    if (outArray != 0) { af_release_array(outArray); }
+    if (goldArray != 0) { af_release_array(goldArray); }
+    if (sceneArray != 0) { af_release_array(sceneArray); }
+}
+
+TYPED_TEST(Transform, PerspectiveNearest) {
+    transformTest<TypeParam>(string(TEST_DIR "/transform/tux_nearest.test"),
+                             string(TEST_DIR "/transform/tux_tmat.test"),
+                             AF_INTERP_NEAREST, false);
+}
+
+TYPED_TEST(Transform, PerspectiveBilinear) {
+    transformTest<TypeParam>(string(TEST_DIR "/transform/tux_bilinear.test"),
+                             string(TEST_DIR "/transform/tux_tmat.test"),
+                             AF_INTERP_BILINEAR, false);
+}
+
+TYPED_TEST(Transform, PerspectiveLower) {
+    transformTest<TypeParam>(string(TEST_DIR "/transform/tux_lower.test"),
+                             string(TEST_DIR "/transform/tux_tmat.test"),
+                             AF_INTERP_LOWER, false);
+}
+
+TYPED_TEST(Transform, PerspectiveNearestInvert) {
+    transformTest<TypeParam>(
+        string(TEST_DIR "/transform/tux_nearest.test"),
+        string(TEST_DIR "/transform/tux_tmat_inverse.test"), AF_INTERP_NEAREST,
+        true);
+}
+
+TYPED_TEST(Transform, PerspectiveBilinearInvert) {
+    transformTest<TypeParam>(
+        string(TEST_DIR "/transform/tux_bilinear.test"),
+        string(TEST_DIR "/transform/tux_tmat_inverse.test"), AF_INTERP_BILINEAR,
+        true);
+}
+
+TYPED_TEST(Transform, PerspectiveLowerInvert) {
+    transformTest<TypeParam>(
+        string(TEST_DIR "/transform/tux_lower.test"),
+        string(TEST_DIR "/transform/tux_tmat_inverse.test"), AF_INTERP_LOWER,
+        true);
+}
+
+TYPED_TEST(TransformInt, PerspectiveNearest) {
+    transformTest<TypeParam>(string(TEST_DIR "/transform/tux_nearest.test"),
+                             string(TEST_DIR "/transform/tux_tmat.test"),
+                             AF_INTERP_NEAREST, false);
+}
+
+TYPED_TEST(TransformInt, PerspectiveBilinear) {
+    transformTest<TypeParam>(string(TEST_DIR "/transform/tux_bilinear.test"),
+                             string(TEST_DIR "/transform/tux_tmat.test"),
+                             AF_INTERP_BILINEAR, false);
+}
+
+TYPED_TEST(TransformInt, PerspectiveLower) {
+    transformTest<TypeParam>(string(TEST_DIR "/transform/tux_lower.test"),
+                             string(TEST_DIR "/transform/tux_tmat.test"),
+                             AF_INTERP_LOWER, false);
+}
+
+TYPED_TEST(TransformInt, PerspectiveNearestInvert) {
+    transformTest<TypeParam>(
+        string(TEST_DIR "/transform/tux_nearest.test"),
+        string(TEST_DIR "/transform/tux_tmat_inverse.test"), AF_INTERP_NEAREST,
+        true);
+}
+
+TYPED_TEST(TransformInt, PerspectiveBilinearInvert) {
+    transformTest<TypeParam>(
+        string(TEST_DIR "/transform/tux_bilinear.test"),
+        string(TEST_DIR "/transform/tux_tmat_inverse.test"), AF_INTERP_BILINEAR,
+        true);
+}
+
+TYPED_TEST(TransformInt, PerspectiveLowerInvert) {
+    transformTest<TypeParam>(
+        string(TEST_DIR "/transform/tux_lower.test"),
+        string(TEST_DIR "/transform/tux_tmat_inverse.test"), AF_INTERP_LOWER,
+        true);
+}
+
+template<typename T>
+class TransformV2 : public Transform<T> {
+   protected:
+    typedef typename dtype_traits<T>::base_type BT;
+
+    af_array gold;
+    af_array in;
+    af_array transform;
+
+    dim4 gold_dims;
+    dim4 in_dims;
+    dim4 transform_dims;
+
+    dim_t odim0;
+    dim_t odim1;
+
+    af_interp_type method;
+    bool invert;
+
+    TransformV2()
+        : gold(0)
+        , in(0)
+        , transform(0)
+        , odim0(0)
+        , odim1(0)
+        , method(AF_INTERP_NEAREST)
+        , invert(false) {}
+
+    void setInterpType(af_interp_type m) { method = m; }
+    void setInvertFlag(bool i) { invert = i; }
+
+    void SetUp() {}
+
+    void releaseArrays() {
+        if (transform != 0) { ASSERT_SUCCESS(af_release_array(transform)); }
+        if (in != 0) { ASSERT_SUCCESS(af_release_array(in)); }
+        if (gold != 0) { ASSERT_SUCCESS(af_release_array(gold)); }
+
+        gold      = 0;
+        in        = 0;
+        transform = 0;
+    }
+
+    void TearDown() { releaseArrays(); }
+
+    void setTestData(float *h_gold, dim4 gold_dims, float *h_in, dim4 in_dims,
+                     float *h_transform, dim4 transform_dims) {
+        releaseArrays();
+
+        this->gold_dims      = gold_dims;
+        this->in_dims        = in_dims;
+        this->transform_dims = transform_dims;
+
+        vector<T> h_gold_cast;
+        vector<T> h_in_cast;
+        vector<BT> h_transform_cast;
+
+        for (int i = 0; i < gold_dims.elements(); ++i) {
+            h_gold_cast.push_back(static_cast<T>(h_gold[i]));
+        }
+        for (int i = 0; i < in_dims.elements(); ++i) {
+            h_in_cast.push_back(static_cast<T>(h_in[i]));
+        }
+        for (int i = 0; i < transform_dims.elements(); ++i) {
+            h_transform_cast.push_back(static_cast<BT>(h_transform[i]));
+        }
+
+        ASSERT_SUCCESS(af_create_array(&gold, &h_gold_cast.front(),
+                                       gold_dims.ndims(), gold_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&in, &h_in_cast.front(), in_dims.ndims(),
+                                       in_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(
+            &transform, &h_transform_cast.front(), transform_dims.ndims(),
+            transform_dims.get(), (af_dtype)dtype_traits<BT>::af_type));
+    }
+
+    void setTestData(string pTestFile, string pHomographyFile) {
+        IMAGEIO_ENABLED_CHECK();
+        releaseArrays();
+
+        genTestData<T>(&gold, &in, &transform, &odim0, &odim1, pTestFile,
+                       pHomographyFile);
+
+        ASSERT_SUCCESS(af_get_dims(&gold_dims[0], &gold_dims[1], &gold_dims[2],
+                                   &gold_dims[3], gold));
+        ASSERT_SUCCESS(af_get_dims(&in_dims[0], &in_dims[1], &in_dims[2],
+                                   &in_dims[3], in));
+        ASSERT_SUCCESS(af_get_dims(&transform_dims[0], &transform_dims[1],
+                                   &transform_dims[2], &transform_dims[3],
+                                   transform));
+    }
+
+    void assertSpclArraysTransform(const af_array gold, const af_array out,
+                                   TestOutputArrayInfo *metadata) {
+        // In the case of NULL_ARRAY, the output array starts out as null.
+        // After the af_* function is called, it shouldn't be null anymore
+        if (metadata->getOutputArrayType() == NULL_ARRAY) {
+            if (out == 0) {
+                ASSERT_TRUE(out != 0) << "Output af_array is null";
+            }
+            metadata->setOutput(out);
+        }
+        // For every other case, must check if the af_array generated by
+        // genTestOutputArray was used by the af_* function as its output array
+        else {
+            if (metadata->getOutput() != out) {
+                ASSERT_TRUE(metadata->getOutput() != out)
+                    << "af_array POINTER MISMATCH:\n"
+                    << "  Actual: " << out << "\n"
+                    << "Expected: " << metadata->getOutput();
+            }
+        }
+
+        af_array out_  = 0;
+        af_array gold_ = 0;
+
+        if (metadata->getOutputArrayType() == SUB_ARRAY) {
+            // There are two full arrays. One will be injected with the gold
+            // subarray, the other should have already been injected with the
+            // af_* function's output. Then we compare the two full arrays
+            af_array gold_full_array = metadata->getFullOutputCopy();
+            af_assign_seq(&gold_full_array, gold_full_array,
+                          metadata->getSubArrayNumDims(),
+                          metadata->getSubArrayIdxs(), gold);
+
+            gold_ = metadata->getFullOutputCopy();
+            out_  = metadata->getFullOutput();
+        } else {
+            gold_ = gold;
+            out_  = out;
+        }
+
+        // Get gold data
+        dim_t goldEl = 0;
+        af_get_elements(&goldEl, gold_);
+        vector<T> goldData(goldEl);
+        af_get_data_ptr((void *)&goldData.front(), gold_);
+
+        // Get result
+        dim_t outEl = 0;
+        af_get_elements(&outEl, out_);
+        vector<T> outData(outEl);
+        af_get_data_ptr((void *)&outData.front(), out_);
+
+        const float thr = 1.1f;
+
+        // Maximum number of wrong pixels must be <= 0.01% of number of
+        // elements, this metric is necessary due to rounding errors between
+        // different backends for AF_INTERP_NEAREST and AF_INTERP_LOWER
+        const size_t maxErr = goldEl * 0.0001f;
+        size_t err          = 0;
+
+        for (dim_t elIter = 0; elIter < goldEl; elIter++) {
+            err += fabs((float)floor(outData[elIter]) -
+                        (float)floor(goldData[elIter])) > thr;
+            if (err > maxErr) {
+                ASSERT_LE(err, maxErr) << "at: " << elIter << endl;
+            }
+        }
+    }
+
+    void testSpclOutArray(TestOutputArrayType out_array_type) {
+        SUPPORTED_TYPE_CHECK(T);
+        IMAGEIO_ENABLED_CHECK();
+
+        af_array out = 0;
+        TestOutputArrayInfo metadata(out_array_type);
+        genTestOutputArray(&out, gold_dims.ndims(), gold_dims.get(),
+                           (af_dtype)dtype_traits<T>::af_type, &metadata);
+        ASSERT_SUCCESS(
+            af_transform_v2(&out, in, transform, odim0, odim1, method, invert));
+
+        assertSpclArraysTransform(gold, out, &metadata);
+    }
+};
+
+TYPED_TEST_SUITE(TransformV2, TestTypes);
+
+template<typename T>
+class TransformV2TuxNearest : public TransformV2<T> {
+   protected:
+    void SetUp() {
+        this->setTestData(string(TEST_DIR "/transform/tux_nearest.test"),
+                          string(TEST_DIR "/transform/tux_tmat.test"));
+        this->setInterpType(AF_INTERP_NEAREST);
+        this->setInvertFlag(false);
+    }
+};
+
+TYPED_TEST_SUITE(TransformV2TuxNearest, TestTypes);
+
+TYPED_TEST(TransformV2TuxNearest, UseNullOutputArray) {
+    this->testSpclOutArray(NULL_ARRAY);
+}
+
+TYPED_TEST(TransformV2TuxNearest, UseFullExistingOutputArray) {
+    this->testSpclOutArray(FULL_ARRAY);
+}
+
+TYPED_TEST(TransformV2TuxNearest, UseExistingOutputSubArray) {
+    this->testSpclOutArray(SUB_ARRAY);
+}
+
+TYPED_TEST(TransformV2TuxNearest, UseReorderedOutputArray) {
+    this->testSpclOutArray(REORDERED_ARRAY);
+}
+
+class TransformNullArgs : public TransformV2TuxNearest<float> {
+   protected:
+    af_array out;
+    TransformNullArgs() : out(0) {}
+};
+
+TEST_F(TransformNullArgs, NullOutputPtr) {
+    af_array *out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG,
+              af_transform(out_ptr, this->in, this->transform, this->odim0,
+                           this->odim1, this->method, this->invert));
+}
+
+TEST_F(TransformNullArgs, NullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_transform(&this->out, 0, this->transform, this->odim0,
+                           this->odim1, this->method, this->invert));
+}
+
+TEST_F(TransformNullArgs, NullTransformArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_transform(&this->out, this->in, 0, this->odim0, this->odim1,
+                           this->method, this->invert));
+}
+
+TEST_F(TransformNullArgs, V2NullOutputPtr) {
+    af_array *out_ptr = 0;
+    ASSERT_EQ(AF_ERR_ARG,
+              af_transform_v2(out_ptr, this->in, this->transform, this->odim0,
+                              this->odim1, this->method, this->invert));
+}
+
+TEST_F(TransformNullArgs, V2NullInputArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_transform_v2(&this->out, 0, this->transform, this->odim0,
+                              this->odim1, this->method, this->invert));
+}
+
+TEST_F(TransformNullArgs, V2NullTransformArray) {
+    ASSERT_EQ(AF_ERR_ARG,
+              af_transform_v2(&this->out, this->in, 0, this->odim0, this->odim1,
+                              this->method, this->invert));
+}
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+TEST(Transform, CPP) {
+    IMAGEIO_ENABLED_CHECK();
+
+    vector<dim4> inDims;
+    vector<string> inFiles;
+    vector<dim_t> goldDim;
+    vector<string> goldFiles;
+
+    vector<dim4> HDims;
+    vector<vector<float>> HIn;
+    vector<vector<float>> HTests;
+    readTests<float, float, float>(TEST_DIR "/transform/tux_tmat.test", HDims,
+                                   HIn, HTests);
+
+    readImageTests(string(TEST_DIR "/transform/tux_nearest.test"), inDims,
+                   inFiles, goldDim, goldFiles);
+
+    inFiles[0].insert(0, string(TEST_DIR "/transform/"));
+    inFiles[1].insert(0, string(TEST_DIR "/transform/"));
+
+    goldFiles[0].insert(0, string(TEST_DIR "/transform/"));
+
+    array H  = array(HDims[0][0], HDims[0][1], &(HIn[0].front()));
+    array IH = array(HDims[0][0], HDims[0][1], &(HIn[0].front()));
+
+    array scene_img = loadImage(inFiles[1].c_str(), false);
+
+    array gold_img = loadImage(goldFiles[0].c_str(), false);
+
+    array out_img = transform(scene_img, IH, inDims[0][0], inDims[0][1],
+                              AF_INTERP_NEAREST, false);
+
+    dim4 outDims  = out_img.dims();
+    dim4 goldDims = gold_img.dims();
+
+    vector<float> h_out_img(outDims[0] * outDims[1]);
+    out_img.host(&h_out_img.front());
+    vector<float> h_gold_img(goldDims[0] * goldDims[1]);
+    gold_img.host(&h_gold_img.front());
+
+    const dim_t n   = gold_img.elements();
+    const float thr = 1.0f;
+
+    // Maximum number of wrong pixels must be <= 0.01% of number of elements,
+    // this metric is necessary due to rounding errors between different
+    // backends for AF_INTERP_NEAREST and AF_INTERP_LOWER
+    const size_t maxErr = n * 0.0001f;
+    size_t err          = 0;
+
+    for (dim_t elIter = 0; elIter < n; elIter++) {
+        err += fabs((int)h_out_img[elIter] - h_gold_img[elIter]) > thr;
+        if (err > maxErr) {
+            ASSERT_LE(err, maxErr) << "at: " << elIter << endl;
+        }
+    }
+}
+
+// This tests batching of different forms
+// tf0 rotates by 90 clockwise
+// tf1 rotates by 90 counter clockwise
+// This test simply makes sure the batching is working correctly
+TEST(TransformBatching, CPP) {
+    vector<dim4> vDims;
+    vector<vector<float>> in;
+    vector<vector<float>> gold;
+
+    readTests<float, float, int>(
+        string(TEST_DIR "/transform/transform_batching.test"), vDims, in, gold);
+
+    array img0(vDims[0], &(in[0].front()));
+    array img1(vDims[1], &(in[1].front()));
+    array ip_tile(vDims[2], &(in[2].front()));
+    array ip_quad(vDims[3], &(in[3].front()));
+    array ip_mult(vDims[4], &(in[4].front()));
+    array ip_tile3(vDims[5], &(in[5].front()));
+    array ip_quad3(vDims[6], &(in[6].front()));
+
+    array tf0(vDims[7 + 0], &(in[7 + 0].front()));
+    array tf1(vDims[7 + 1], &(in[7 + 1].front()));
+    array tf_tile(vDims[7 + 2], &(in[7 + 2].front()));
+    array tf_quad(vDims[7 + 3], &(in[7 + 3].front()));
+    array tf_mult(vDims[7 + 4], &(in[7 + 4].front()));
+    array tf_mult3(vDims[7 + 5], &(in[7 + 5].front()));
+    array tf_mult3x(vDims[7 + 6], &(in[7 + 6].front()));
+
+    const int X = img0.dims(0);
+    const int Y = img0.dims(1);
+
+    ASSERT_EQ(gold.size(), 21u);
+    vector<array> out(gold.size());
+    out[0] = transform(img0, tf0, Y, X, AF_INTERP_NEAREST);  // 1,1 x 1,1
+    out[1] = transform(img0, tf1, Y, X, AF_INTERP_NEAREST);  // 1,1 x 1,1
+    out[2] = transform(img1, tf0, Y, X, AF_INTERP_NEAREST);  // 1,1 x 1,1
+    out[3] = transform(img1, tf1, Y, X, AF_INTERP_NEAREST);  // 1,1 x 1,1
+
+    out[4] = transform(img0, tf_tile, Y, X, AF_INTERP_NEAREST);  // 1,1 x N,1
+    out[5] = transform(img0, tf_mult, Y, X, AF_INTERP_NEAREST);  // 1,1 x N,N
+    out[6] = transform(img0, tf_quad, Y, X, AF_INTERP_NEAREST);  // 1,1 x 1,N
+
+    out[7] = transform(ip_tile, tf0, Y, X, AF_INTERP_NEAREST);      // N,1 x 1,1
+    out[8] = transform(ip_tile, tf_tile, Y, X, AF_INTERP_NEAREST);  // N,1 x N,1
+    out[9] = transform(ip_tile, tf_mult, Y, X, AF_INTERP_NEAREST);  // N,N x N,N
+    out[10] =
+        transform(ip_tile, tf_quad, Y, X, AF_INTERP_NEAREST);  // N,1 x 1,N
+
+    out[11] = transform(ip_quad, tf0, Y, X, AF_INTERP_NEAREST);  // 1,N x 1,1
+    out[12] =
+        transform(ip_quad, tf_quad, Y, X, AF_INTERP_NEAREST);  // 1,N x 1,N
+    out[13] =
+        transform(ip_quad, tf_mult, Y, X, AF_INTERP_NEAREST);  // 1,N x N,N
+    out[14] =
+        transform(ip_quad, tf_tile, Y, X, AF_INTERP_NEAREST);  // 1,N x N,1
+
+    out[15] = transform(ip_mult, tf0, Y, X, AF_INTERP_NEAREST);  // N,N x 1,1
+    out[16] =
+        transform(ip_mult, tf_tile, Y, X, AF_INTERP_NEAREST);  // N,N x N,1
+    out[17] =
+        transform(ip_mult, tf_mult, Y, X, AF_INTERP_NEAREST);  // N,N x N,N
+    out[18] =
+        transform(ip_mult, tf_quad, Y, X, AF_INTERP_NEAREST);  // N,N x 1,N
+
+    out[19] =
+        transform(ip_tile3, tf_mult3, Y, X, AF_INTERP_NEAREST);  // N,1 x N,N
+    out[20] =
+        transform(ip_quad3, tf_mult3x, Y, X, AF_INTERP_NEAREST);  // 1,N x N,N
+
+    array x_(dim4(35, 40, 1, 1), &(gold[1].front()));
+
+    for (int i = 0; i < (int)gold.size(); i++) {
+        // Get result
+        vector<float> outData(out[i].elements());
+        out[i].host((void *)&outData.front());
+
+        for (int iter = 0; iter < (int)gold[i].size(); iter++) {
+            ASSERT_EQ(gold[i][iter], outData[iter])
+                << "at: " << iter << endl
+                << "for " << i << "-th operation" << endl;
+        }
+    }
+}
+
+#define TEST_TEMP_FORMAT(form, interp)                                         \
+    TEST(TEMP_FORMAT, form##_##interp) {                                       \
+        IMAGEIO_ENABLED_CHECK();                                               \
+                                                                               \
+        vector<dim4> inDims;                                                   \
+        vector<string> inFiles;                                                \
+        vector<dim_t> goldDim;                                                 \
+        vector<string> goldFiles;                                              \
+                                                                               \
+        vector<dim4> HDims;                                                    \
+        vector<vector<float>> HIn;                                             \
+        vector<vector<float>> HTests;                                          \
+        readTests<float, float, float>(TEST_DIR "/transform/tux_tmat.test",    \
+                                       HDims, HIn, HTests);                    \
+                                                                               \
+        readImageTests(string(TEST_DIR "/transform/tux_nearest.test"), inDims, \
+                       inFiles, goldDim, goldFiles);                           \
+        inFiles[1].insert(0, string(TEST_DIR "/transform/"));                  \
+        const array IH = array(HDims[0][0], HDims[0][1], &(HIn[0].front()));   \
+        const array scene_img = loadImage(inFiles[1].c_str(), false);          \
+                                                                               \
+        const array out =                                                      \
+            transform(toTempFormat(form, scene_img), toTempFormat(form, IH),   \
+                      inDims[0][0], inDims[0][1], interp, false);              \
+        const array gold = transform(scene_img, IH, inDims[0][0],              \
+                                     inDims[0][1], interp, false);             \
+                                                                               \
+        EXPECT_ARRAYS_EQ(out, gold);                                           \
+    }
+
+#define TESTS_TEMP_FORMAT(form)                       \
+    TEST_TEMP_FORMAT(form, AF_INTERP_NEAREST)         \
+    TEST_TEMP_FORMAT(form, AF_INTERP_BILINEAR)        \
+    TEST_TEMP_FORMAT(form, AF_INTERP_BILINEAR_COSINE) \
+    TEST_TEMP_FORMAT(form, AF_INTERP_BICUBIC)         \
+    TEST_TEMP_FORMAT(form, AF_INTERP_BICUBIC_SPLINE)  \
+    TEST_TEMP_FORMAT(form, AF_INTERP_LOWER)
+
+FOREACH_TEMP_FORMAT(TESTS_TEMP_FORMAT)
\ No newline at end of file
diff --git a/test/transform_coordinates.cpp b/test/transform_coordinates.cpp
new file mode 100644
index 0000000000..bc5dbed4e9
--- /dev/null
+++ b/test/transform_coordinates.cpp
@@ -0,0 +1,139 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class TransformCoordinates : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+typedef ::testing::Types<float, double> TestTypes;
+
+TYPED_TEST_SUITE(TransformCoordinates, TestTypes);
+
+template<typename T>
+void transformCoordinatesTest(string pTestFile) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> inDims;
+    vector<vector<T>> in;
+    vector<vector<float>> gold;
+
+    readTests<T, float, float>(pTestFile, inDims, in, gold);
+
+    af_array tfArray  = 0;
+    af_array outArray = 0;
+    ASSERT_SUCCESS(af_create_array(&tfArray, &(in[0].front()),
+                                   inDims[0].ndims(), inDims[0].get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    int nTests = in.size();
+
+    for (int test = 1; test < nTests; test++) {
+        dim_t d0 = (dim_t)in[test][0];
+        dim_t d1 = (dim_t)in[test][1];
+
+        ASSERT_SUCCESS(af_transform_coordinates(&outArray, tfArray, d0, d1));
+
+        // Get result
+        dim_t outEl = 0;
+        ASSERT_SUCCESS(af_get_elements(&outEl, outArray));
+        vector<T> outData(outEl);
+        ASSERT_SUCCESS(af_get_data_ptr((void *)&outData.front(), outArray));
+
+        ASSERT_SUCCESS(af_release_array(outArray));
+        const float thr = 1.f;
+
+        for (dim_t elIter = 0; elIter < outEl; elIter++) {
+            ASSERT_LE(fabs(outData[elIter] - gold[test - 1][elIter]), thr)
+                << "at: " << elIter << endl;
+        }
+    }
+
+    if (tfArray != 0) af_release_array(tfArray);
+}
+
+TYPED_TEST(TransformCoordinates, RotateMatrix) {
+    transformCoordinatesTest<TypeParam>(
+        string(TEST_DIR "/transformCoordinates/rotate_matrix.test"));
+}
+
+TYPED_TEST(TransformCoordinates, 3DMatrix) {
+    transformCoordinatesTest<TypeParam>(
+        string(TEST_DIR "/transformCoordinates/3d_matrix.test"));
+}
+
+///////////////////////////////////// CPP ////////////////////////////////
+//
+TEST(TransformCoordinates, CPP) {
+    vector<dim4> inDims;
+    vector<vector<float>> in;
+    vector<vector<float>> gold;
+
+    readTests<float, float, float>(
+        TEST_DIR "/transformCoordinates/3d_matrix.test", inDims, in, gold);
+
+    array tf = array(inDims[0][0], inDims[0][1], &(in[0].front()));
+
+    float d0 = in[1][0];
+    float d1 = in[1][1];
+
+    array out    = transformCoordinates(tf, d0, d1);
+    dim4 outDims = out.dims();
+
+    vector<float> h_out(outDims[0] * outDims[1]);
+    out.host(&h_out.front());
+
+    const size_t n  = gold[0].size();
+    const float thr = 1.f;
+
+    for (size_t elIter = 0; elIter < n; elIter++) {
+        ASSERT_LE(fabs(h_out[elIter] - gold[0][elIter]), thr)
+            << "at: " << elIter << endl;
+    }
+}
+
+#define TESTS_TEMP_FORMAT(form)                                                \
+    TEST(TEMP_FORMAT, form) {                                                  \
+        vector<dim4> inDims;                                                   \
+        vector<vector<float>> in;                                              \
+        vector<vector<float>> gold;                                            \
+                                                                               \
+        readTests<float, float, float>(TEST_DIR                                \
+                                       "/transformCoordinates/3d_matrix.test", \
+                                       inDims, in, gold);                      \
+                                                                               \
+        const array tf(inDims[0][0], inDims[0][1], &(in[0].front()));          \
+        const float d0 = in[1][0];                                             \
+        const float d1 = in[1][1];                                             \
+                                                                               \
+        const array out =                                                      \
+            transformCoordinates(toTempFormat(form, tf), d0, d1);              \
+        const array gout = transformCoordinates(tf, d0, d1);                   \
+                                                                               \
+        EXPECT_ARRAYS_EQ(out, gout);                                           \
+    }
+
+FOREACH_TEMP_FORMAT(TESTS_TEMP_FORMAT)
\ No newline at end of file
diff --git a/test/translate.cpp b/test/translate.cpp
index 6f4d691612..edbab15a2c 100644
--- a/test/translate.cpp
+++ b/test/translate.cpp
@@ -7,180 +7,182 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Translate : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class Translate : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 template<typename T>
-class TranslateInt : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class TranslateInt : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
 typedef ::testing::Types<float, double, cfloat, cdouble> TestTypes;
-typedef ::testing::Types<int, intl, char> TestTypesInt;
+typedef ::testing::Types<int, intl, char, schar, short> TestTypesInt;
 
 // register the type list
-TYPED_TEST_CASE(Translate, TestTypes);
-TYPED_TEST_CASE(TranslateInt, TestTypesInt);
+TYPED_TEST_SUITE(Translate, TestTypes);
+TYPED_TEST_SUITE(TranslateInt, TestTypesInt);
 
 template<typename T>
-void translateTest(string pTestFile, const unsigned resultIdx, af::dim4 odims, const float tx, const float ty, const af_interp_type method, const float max_fail_count = 0.0001)
-{
-    if (noDoubleTests<T>()) return;
+void translateTest(string pTestFile, const unsigned resultIdx, dim4 odims,
+                   const float tx, const float ty, const af_interp_type method,
+                   const float max_fail_count = 0.0001) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
-    vector<vector<T> >   in;
-    vector<vector<float> >   tests;
-    readTests<T, float, float>(pTestFile,numDims,in,tests);
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<float>> tests;
+    readTests<T, float, float>(pTestFile, numDims, in, tests);
 
-    af_array inArray = 0;
+    af_array inArray  = 0;
     af_array outArray = 0;
 
-    af::dim4 dims = numDims[0];
+    dim4 dims = numDims[0];
 
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
-    ASSERT_EQ(AF_SUCCESS, af_translate(&outArray, inArray, tx, ty, odims[0], odims[1], method));
+    ASSERT_SUCCESS(
+        af_translate(&outArray, inArray, tx, ty, odims[0], odims[1], method));
 
     // Get result
     T* outData = new T[tests[resultIdx].size()];
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void*)outData, outArray));
 
     // Compare result
     size_t nElems = tests[resultIdx].size();
 
     size_t fail_count = 0;
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        if(std::abs((T)tests[resultIdx][elIter] - outData[elIter]) > 0.0001) {
+        if (abs((T)tests[resultIdx][elIter] - outData[elIter]) > 0.0001) {
             fail_count++;
         }
     }
     ASSERT_EQ(true, (((float)fail_count / (float)(nElems)) <= max_fail_count))
-             << "Fail Count  = " << fail_count << std::endl;
+        << "Fail Count  = " << fail_count << endl;
 
     // Delete
     delete[] outData;
 
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
 }
 
-TYPED_TEST(Translate, Small1)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 0,
-                             af::dim4(10, 10, 1, 1), 3, 2, AF_INTERP_NEAREST);
+TYPED_TEST(Translate, Small1) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 0,
+        dim4(10, 10, 1, 1), 3, 2, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Translate, Small2)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 1,
-                             af::dim4(10, 10, 1, 1), -3, -2, AF_INTERP_NEAREST);
+TYPED_TEST(Translate, Small2) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 1,
+        dim4(10, 10, 1, 1), -3, -2, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Translate, Small3)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 2,
-                             af::dim4(15, 15, 1, 1), 1.5, 2.5, AF_INTERP_BILINEAR);
+TYPED_TEST(Translate, Small3) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 2,
+        dim4(15, 15, 1, 1), 1.5, 2.5, AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Translate, Small4)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 3,
-                             af::dim4(15, 15, 1, 1), -1.5, -2.5, AF_INTERP_BILINEAR);
+TYPED_TEST(Translate, Small4) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 3,
+        dim4(15, 15, 1, 1), -1.5, -2.5, AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Translate, Large1)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 0,
-                             af::dim4(250, 320, 1, 1), 10, 18, AF_INTERP_NEAREST);
+TYPED_TEST(Translate, Large1) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 0,
+        dim4(250, 320, 1, 1), 10, 18, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Translate, Large2)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 1,
-                             af::dim4(250, 320, 1, 1), -20, 24, AF_INTERP_NEAREST);
+TYPED_TEST(Translate, Large2) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 1,
+        dim4(250, 320, 1, 1), -20, 24, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(Translate, Large3)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 2,
-                             af::dim4(300, 400, 1, 1), 10.23, 12.72, AF_INTERP_BILINEAR);
+TYPED_TEST(Translate, Large3) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 2,
+        dim4(300, 400, 1, 1), 10.23, 12.72, AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(Translate, Large4)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 3,
-                             af::dim4(300, 400, 1, 1), -15.69, -10.13, AF_INTERP_BILINEAR);
+TYPED_TEST(Translate, Large4) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 3,
+        dim4(300, 400, 1, 1), -15.69, -10.13, AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(TranslateInt, Small1)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 0,
-                             af::dim4(10, 10, 1, 1), 3, 2, AF_INTERP_NEAREST);
+TYPED_TEST(TranslateInt, Small1) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 0,
+        dim4(10, 10, 1, 1), 3, 2, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(TranslateInt, Small2)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 1,
-                             af::dim4(10, 10, 1, 1), -3, -2, AF_INTERP_NEAREST);
+TYPED_TEST(TranslateInt, Small2) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 1,
+        dim4(10, 10, 1, 1), -3, -2, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(TranslateInt, Small3)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 2,
-                             af::dim4(15, 15, 1, 1), 1.5, 2.5, AF_INTERP_BILINEAR);
+TYPED_TEST(TranslateInt, Small3) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 2,
+        dim4(15, 15, 1, 1), 1.5, 2.5, AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(TranslateInt, Small4)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_small_1.test"), 3,
-                             af::dim4(15, 15, 1, 1), -1.5, -2.5, AF_INTERP_BILINEAR);
+TYPED_TEST(TranslateInt, Small4) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_small_1.test"), 3,
+        dim4(15, 15, 1, 1), -1.5, -2.5, AF_INTERP_BILINEAR);
 }
 
-TYPED_TEST(TranslateInt, Large1)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 0,
-                             af::dim4(250, 320, 1, 1), 10, 18, AF_INTERP_NEAREST);
+TYPED_TEST(TranslateInt, Large1) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 0,
+        dim4(250, 320, 1, 1), 10, 18, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(TranslateInt, Large2)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 1,
-                             af::dim4(250, 320, 1, 1), -20, 24, AF_INTERP_NEAREST);
+TYPED_TEST(TranslateInt, Large2) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 1,
+        dim4(250, 320, 1, 1), -20, 24, AF_INTERP_NEAREST);
 }
 
-TYPED_TEST(TranslateInt, Large3)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 2,
-                             af::dim4(300, 400, 1, 1), 10.23, 12.72, AF_INTERP_BILINEAR, 0.001);
+TYPED_TEST(TranslateInt, Large3) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 2,
+        dim4(300, 400, 1, 1), 10.23, 12.72, AF_INTERP_BILINEAR, 0.001);
 }
 
-TYPED_TEST(TranslateInt, Large4)
-{
-    translateTest<TypeParam>(string(TEST_DIR"/translate/translate_large_1.test"), 3,
-                             af::dim4(300, 400, 1, 1), -15.69, -10.13, AF_INTERP_BILINEAR, 0.001);
+TYPED_TEST(TranslateInt, Large4) {
+    translateTest<TypeParam>(
+        string(TEST_DIR "/translate/translate_large_1.test"), 3,
+        dim4(300, 400, 1, 1), -15.69, -10.13, AF_INTERP_BILINEAR, 0.001);
 }
diff --git a/test/transpose.cpp b/test/transpose.cpp
index 346dcc4daa..420f6d88e3 100644
--- a/test/transpose.cpp
+++ b/test/transpose.cpp
@@ -7,173 +7,175 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
+
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
+using af::allTrue;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::abs;
+using std::endl;
 using std::string;
 using std::vector;
-using af::cfloat;
-using af::cdouble;
 
 template<typename T>
-class Transpose : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-            subMat2D.push_back(af_make_seq(2,7,1));
-            subMat2D.push_back(af_make_seq(2,7,1));
-
-            subMat3D.push_back(af_make_seq(2,7,1));
-            subMat3D.push_back(af_make_seq(2,7,1));
-            subMat3D.push_back(af_span);
-        }
-        vector<af_seq> subMat2D;
-        vector<af_seq> subMat3D;
+class Transpose : public ::testing::Test {
+   public:
+    virtual void SetUp() {
+        subMat2D.push_back(af_make_seq(2, 7, 1));
+        subMat2D.push_back(af_make_seq(2, 7, 1));
+
+        subMat3D.push_back(af_make_seq(2, 7, 1));
+        subMat3D.push_back(af_make_seq(2, 7, 1));
+        subMat3D.push_back(af_span);
+    }
+    vector<af_seq> subMat2D;
+    vector<af_seq> subMat3D;
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<float, cfloat, double, cdouble, int, uint, char, schar,
+                         uchar, short, ushort, half_float::half>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Transpose, TestTypes);
+TYPED_TEST_SUITE(Transpose, TestTypes);
 
 template<typename T>
-void trsTest(string pTestFile, bool isSubRef=false, const vector<af_seq> *seqv=NULL)
-{
-    if (noDoubleTests<T>())
-        return;
+void trsTest(string pTestFile, bool isSubRef = false,
+             const vector<af_seq> *seqv = NULL) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
-    readTests<T,T,int>(pTestFile,numDims,in,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+    dim4 dims = numDims[0];
 
-    af_array outArray   = 0;
-    af_array inArray    = 0;
+    af_array outArray = 0;
+    af_array inArray  = 0;
     T *outData;
-    ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &(in[0].front()), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), dims.ndims(),
+                                   dims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
 
     // check if the test is for indexed Array
     if (isSubRef) {
-        af::dim4 newDims(dims[1]-4,dims[0]-4,dims[2],dims[3]);
+        dim4 newDims(dims[1] - 4, dims[0] - 4, dims[2], dims[3]);
         af_array subArray = 0;
-        ASSERT_EQ(AF_SUCCESS, af_index(&subArray,inArray,seqv->size(),&seqv->front()));
-        ASSERT_EQ(AF_SUCCESS, af_transpose(&outArray,subArray, false));
+        ASSERT_SUCCESS(
+            af_index(&subArray, inArray, seqv->size(), &seqv->front()));
+        ASSERT_SUCCESS(af_transpose(&outArray, subArray, false));
         // destroy the temporary indexed Array
-        ASSERT_EQ(AF_SUCCESS, af_release_array(subArray));
+        ASSERT_SUCCESS(af_release_array(subArray));
 
         dim_t nElems;
-        ASSERT_EQ(AF_SUCCESS, af_get_elements(&nElems,outArray));
+        ASSERT_SUCCESS(af_get_elements(&nElems, outArray));
         outData = new T[nElems];
     } else {
-        ASSERT_EQ(AF_SUCCESS,af_transpose(&outArray,inArray, false));
+        ASSERT_SUCCESS(af_transpose(&outArray, inArray, false));
         outData = new T[dims.elements()];
     }
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
+    ASSERT_SUCCESS(af_get_data_ptr((void *)outData, outArray));
 
-    for (size_t testIter=0; testIter<tests.size(); ++testIter) {
-        vector<T> currGoldBar   = tests[testIter];
-        size_t nElems        = currGoldBar.size();
-        for (size_t elIter=0; elIter<nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter],outData[elIter])<< "at: " << elIter<< std::endl;
+    for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
+        vector<T> currGoldBar = tests[testIter];
+        size_t nElems         = currGoldBar.size();
+        for (size_t elIter = 0; elIter < nElems; ++elIter) {
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
     }
 
     // cleanup
     delete[] outData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-TYPED_TEST(Transpose,Vector)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/vector.test"));
+TYPED_TEST(Transpose, Vector) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/vector.test"));
 }
 
-TYPED_TEST(Transpose,VectorBatch)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/vector_batch.test"));
+TYPED_TEST(Transpose, VectorBatch) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/vector_batch.test"));
 }
 
- TYPED_TEST(Transpose,Square)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/square.test"));
+TYPED_TEST(Transpose, Square) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/square.test"));
 }
 
-TYPED_TEST(Transpose,Rectangle)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/rectangle.test"));
+TYPED_TEST(Transpose, Rectangle) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/rectangle.test"));
 }
 
-TYPED_TEST(Transpose,Rectangle2)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/rectangle2.test"));
+TYPED_TEST(Transpose, Rectangle2) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/rectangle2.test"));
 }
 
-TYPED_TEST(Transpose,SquareBatch)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/square_batch.test"));
+TYPED_TEST(Transpose, SquareBatch) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/square_batch.test"));
 }
 
-TYPED_TEST(Transpose,RectangleBatch)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/rectangle_batch.test"));
+TYPED_TEST(Transpose, RectangleBatch) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/rectangle_batch.test"));
 }
 
-TYPED_TEST(Transpose,RectangleBatch2)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/rectangle_batch2.test"));
+TYPED_TEST(Transpose, RectangleBatch2) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/rectangle_batch2.test"));
 }
 
-TYPED_TEST(Transpose,Square512x512)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/square2.test"));
+TYPED_TEST(Transpose, Square512x512) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/square2.test"));
 }
 
-TYPED_TEST(Transpose,SubRef)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/offset.test"),true,&(this->subMat2D));
+TYPED_TEST(Transpose, SubRef) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/offset.test"), true,
+                       &(this->subMat2D));
 }
 
-TYPED_TEST(Transpose,SubRefBatch)
-{
-    trsTest<TypeParam>(string(TEST_DIR"/transpose/offset_batch.test"),true,&(this->subMat3D));
+TYPED_TEST(Transpose, SubRefBatch) {
+    trsTest<TypeParam>(string(TEST_DIR "/transpose/offset_batch.test"), true,
+                       &(this->subMat3D));
 }
 
-
 ////////////////////////////////////// CPP //////////////////////////////////
 //
 template<typename T>
-void trsCPPTest(string pFileName)
-{
-    vector<af::dim4> numDims;
+void trsCPPTest(string pFileName) {
+    vector<dim4> numDims;
 
-    vector<vector<T> >   in;
-    vector<vector<T> >   tests;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
     readTests<T, T, int>(pFileName, numDims, in, tests);
-    af::dim4 dims = numDims[0];
+    dim4 dims = numDims[0];
 
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
-    af::array input(dims, &(in[0].front()));
-    af::array output = af::transpose(input);
+    array input(dims, &(in[0].front()));
+    array output = transpose(input);
 
     T *outData = new T[dims.elements()];
-    output.host((void*)outData);
+    output.host((void *)outData);
 
     for (size_t testIter = 0; testIter < tests.size(); ++testIter) {
         vector<T> currGoldBar = tests[testIter];
-        size_t nElems = currGoldBar.size();
+        size_t nElems         = currGoldBar.size();
         for (size_t elIter = 0; elIter < nElems; ++elIter) {
-            ASSERT_EQ(currGoldBar[elIter], outData[elIter])<< "at: " << elIter << std::endl;
+            ASSERT_EQ(currGoldBar[elIter], outData[elIter])
+                << "at: " << elIter << endl;
         }
     }
 
@@ -181,38 +183,37 @@ void trsCPPTest(string pFileName)
     delete[] outData;
 }
 
-TEST(Transpose, CPP_f64)
-{
-    trsCPPTest<double>(string(TEST_DIR"/transpose/rectangle_batch2.test"));
+TEST(Transpose, CPP_f64) {
+    trsCPPTest<double>(string(TEST_DIR "/transpose/rectangle_batch2.test"));
 }
 
-TEST(Transpose, CPP_f32)
-{
-    trsCPPTest<float>(string(TEST_DIR"/transpose/rectangle_batch2.test"));
+TEST(Transpose, CPP_f32) {
+    trsCPPTest<float>(string(TEST_DIR "/transpose/rectangle_batch2.test"));
 }
 
 template<typename T>
-void trsCPPConjTest()
-{
-    vector<af::dim4> numDims;
+void trsCPPConjTest(dim_t d0, dim_t d1 = 1, dim_t d2 = 1, dim_t d3 = 1) {
+    vector<dim4> numDims;
 
-    af::dim4 dims(40, 40);
+    dim4 dims(d0, d1, d2, d3);
 
-    if (noDoubleTests<T>()) return;
+    SUPPORTED_TYPE_CHECK(T);
 
-    af::array input = randu(dims, (af_dtype) af::dtype_traits<T>::af_type);
-    af::array output_t = af::transpose(input, false);
-    af::array output_c = af::transpose(input, true);
+    array input    = randu(dims, (af_dtype)dtype_traits<T>::af_type);
+    array output_t = transpose(input, false);
+    array output_c = transpose(input, true);
 
-    T *tData  = new T[dims.elements()];
+    T *tData = new T[dims.elements()];
     T *cData = new T[dims.elements()];
-    output_t.host((void*)tData);
-    output_c.host((void*)cData);
+    output_t.host((void *)tData);
+    output_c.host((void *)cData);
 
     size_t nElems = dims.elements();
     for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_NEAR(tData[elIter].real(), cData[elIter].real(), 1e-6)<< "at: " << elIter << std::endl;
-        ASSERT_NEAR(-tData[elIter].imag(), cData[elIter].imag(), 1e-6)<< "at: " << elIter << std::endl;
+        ASSERT_NEAR(real(tData[elIter]), real(cData[elIter]), 1e-6)
+            << "at: " << elIter << endl;
+        ASSERT_NEAR(-imag(tData[elIter]), imag(cData[elIter]), 1e-6)
+            << "at: " << elIter << endl;
     }
 
     // cleanup
@@ -220,25 +221,67 @@ void trsCPPConjTest()
     delete[] cData;
 }
 
-TEST(Transpose, CPP_c32_CONJ)
-{
-    trsCPPConjTest<cfloat>();
+TEST(Transpose, CPP_c32_CONJ40x40) { trsCPPConjTest<cfloat>(40, 40); }
+
+TEST(Transpose, CPP_c32_CONJ2000x1) { trsCPPConjTest<cfloat>(2000); }
+
+TEST(Transpose, CPP_c32_CONJ20x20x5) { trsCPPConjTest<cfloat>(20, 20, 5); }
+
+TEST(Transpose, MaxDim) {
+    const size_t largeDim = 65535 * 33 + 1;
+
+    array input  = range(dim4(2, largeDim, 1, 1));
+    array gold   = range(dim4(largeDim, 2, 1, 1), 1);
+    array output = transpose(input);
+
+    ASSERT_EQ(output.dims(0), (int)largeDim);
+    ASSERT_EQ(output.dims(1), 2);
+    ASSERT_ARRAYS_EQ(gold, output);
+
+    input  = range(dim4(2, 5, 1, largeDim));
+    gold   = range(dim4(5, 2, 1, largeDim), 1);
+    output = transpose(input);
+
+    ASSERT_ARRAYS_EQ(gold, output);
 }
 
-TEST(Transpose, GFOR)
-{
-    using namespace af;
+TEST(Transpose, GFOR) {
+    using af::constant;
+    using af::max;
+    using af::seq;
+    using af::span;
+
     dim4 dims = dim4(100, 100, 3);
-    array A = round(100 * randu(dims));
-    array B = constant(0, 100, 100, 3);
+    array A   = round(100 * randu(dims));
+    array B   = constant(0, 100, 100, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = A(span, span, ii).T();
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = A(span, span, ii).T(); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = A(span, span, ii).T();
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
     }
 }
+
+TEST(Transpose, SNIPPET_blas_func_transpose) {
+    // clang-format off
+    //! [ex_blas_func_transpose]
+    //!
+    // Create a, a 2x3 array
+    array a = iota(dim4(2, 3));    // a = [0, 2, 4
+                                   //      1, 3, 5]
+
+    // Create b, the transpose of a
+    array b = transpose(a);        // b = [0, 1,
+                                   //      2, 3,
+                                   //      4, 5]
+
+    //! [ex_blas_func_transpose]
+    // clang-format on
+
+    using std::vector;
+    vector<float> gold_b{0, 2, 4, 1, 3, 5};
+
+    ASSERT_VEC_ARRAY_EQ(gold_b, b.dims(), b);
+}
diff --git a/test/transpose_inplace.cpp b/test/transpose_inplace.cpp
index 34e17647c8..7e542fd34f 100644
--- a/test/transpose_inplace.cpp
+++ b/test/transpose_inplace.cpp
@@ -7,69 +7,59 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::string;
-using std::vector;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using std::endl;
+using std::vector;
 
 template<typename T>
-class Transpose : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class Transpose : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble, int, uint, char, uchar> TestTypes;
+typedef ::testing::Types<float, cfloat, double, cdouble, int, uint, char, schar,
+                         uchar, short, ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Transpose, TestTypes);
+TYPED_TEST_SUITE(Transpose, TestTypes);
 
 template<typename T>
-void transposeip_test(af::dim4 dims)
-{
-    if (noDoubleTests<T>())
-        return;
+void transposeip_test(dim4 dims) {
+    SUPPORTED_TYPE_CHECK(T);
 
     af_array inArray  = 0;
     af_array outArray = 0;
 
-    ASSERT_EQ(AF_SUCCESS, af_randu(&inArray, dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-
-    ASSERT_EQ(AF_SUCCESS, af_transpose(&outArray, inArray, false));
-    ASSERT_EQ(AF_SUCCESS, af_transpose_inplace(inArray, false));
+    ASSERT_SUCCESS(af_randu(&inArray, dims.ndims(), dims.get(),
+                            (af_dtype)dtype_traits<T>::af_type));
 
-    T *outData = new T[dims.elements()];
-    T *trsData = new T[dims.elements()];
+    ASSERT_SUCCESS(af_transpose(&outArray, inArray, false));
+    ASSERT_SUCCESS(af_transpose_inplace(inArray, false));
 
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)outData, outArray));
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr((void*)trsData, inArray));
-
-    dim_t nElems = dims.elements();
-    for (int elIter = 0; elIter < (int)nElems; ++elIter) {
-        ASSERT_EQ(trsData[elIter] , outData[elIter])<< "at: " << elIter << std::endl;
-    }
+    ASSERT_ARRAYS_EQ(inArray, outArray);
 
     // cleanup
-    delete[] outData;
-    delete[] trsData;
-    ASSERT_EQ(AF_SUCCESS, af_release_array(inArray));
-    ASSERT_EQ(AF_SUCCESS, af_release_array(outArray));
+    ASSERT_SUCCESS(af_release_array(inArray));
+    ASSERT_SUCCESS(af_release_array(outArray));
 }
 
-#define INIT_TEST(Side, D3, D4)                                                     \
-    TYPED_TEST(Transpose, TranposeIP_##Side)                                        \
-    {                                                                               \
-        transposeip_test<TypeParam>(af::dim4(Side, Side, D3, D4));                  \
+#define INIT_TEST(Side, D3, D4)                                \
+    TYPED_TEST(Transpose, TranposeIP_##Side) {                 \
+        transposeip_test<TypeParam>(dim4(Side, Side, D3, D4)); \
     }
 
 INIT_TEST(10, 1, 1);
@@ -81,27 +71,12 @@ INIT_TEST(25, 2, 2);
 
 ////////////////////////////////////// CPP //////////////////////////////////
 //
-void transposeInPlaceCPPTest()
-{
-    if (noDoubleTests<float>()) return;
+void transposeInPlaceCPPTest() {
+    dim4 dims(64, 64, 1, 1);
 
-    af::dim4 dims(64, 64, 1,1);
-
-    af::array input = randu(dims);
-    af::array output = af::transpose(input);
+    array input  = randu(dims);
+    array output = transpose(input);
     transposeInPlace(input);
 
-    float *outData = new float[dims.elements()];
-    float *trsData = new float[dims.elements()];
-
-    output.host((void*)outData);
-    input.host((void*)trsData);
-
-    dim_t nElems = dims.elements();
-    for (int elIter = 0; elIter < (int)nElems; ++elIter) {
-        ASSERT_EQ(trsData[elIter], outData[elIter])<< "at: " << elIter << std::endl;
-    }
-
-    // cleanup
-    delete[] outData;
+    ASSERT_ARRAYS_EQ(input, output);
 }
diff --git a/test/triangle.cpp b/test/triangle.cpp
index d3bed920cc..a7d47832e5 100644
--- a/test/triangle.cpp
+++ b/test/triangle.cpp
@@ -7,45 +7,51 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
-#include <af/dim4.hpp>
+#include <gtest/gtest.h>
+#include <half.hpp>
+#include <testHelpers.hpp>
+#include <af/data.h>
 #include <af/defines.h>
+#include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/data.h>
-#include <vector>
-#include <iostream>
+
 #include <complex>
+#include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
 using af::dim4;
+using af::freeHost;
+using std::abs;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Triangle : public ::testing::Test { };
+class Triangle : public ::testing::Test {};
 
-typedef ::testing::Types<float, af::cfloat, double, af::cdouble, int, unsigned, char, uchar, uintl, intl> TestTypes;
-TYPED_TEST_CASE(Triangle, TestTypes);
+typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, char,
+                         schar, uchar, uintl, intl, short, ushort,
+                         half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(Triangle, TestTypes);
 
 template<typename T>
-void triangleTester(const dim4 dims, bool is_upper, bool is_unit_diag=false)
-{
-    if (noDoubleTests<T>()) return;
+void triangleTester(const dim4 dims, bool is_upper, bool is_unit_diag = false) {
+    SUPPORTED_TYPE_CHECK(T);
 #if 1
-    af::array in = cpu_randu<T>(dims);
+    array in = cpu_randu<T>(dims);
 #else
-    af::array in = af::randu(dims, (af::dtype)af::dtype_traits<T>::af_type);
+    array in = randu(dims, (dtype)dtype_traits<T>::af_type);
 #endif
 
-    T *h_in = in.host<T>();
-    af::array out = is_upper ?  upper(in, is_unit_diag) : lower(in, is_unit_diag);
-    T *h_out = out.host<T>();
+    T *h_in   = in.host<T>();
+    array out = is_upper ? upper(in, is_unit_diag) : lower(in, is_unit_diag);
+    T *h_out  = out.host<T>();
 
     int m = dims[0];
     int n = dims[1];
@@ -58,112 +64,104 @@ void triangleTester(const dim4 dims, bool is_upper, bool is_unit_diag=false)
 
             for (int x = 0; x < m; x++) {
                 T val = T(0);
-                if (((y <= x) && !is_upper) ||
-                    ((y >= x) &&  is_upper)) {
+                if (((y <= x) && !is_upper) || ((y >= x) && is_upper)) {
                     val = (is_unit_diag && y == x) ? (T)(1) : h_in[y_off + x];
                 }
 
-                ASSERT_EQ(h_out[y_off + x], val) << "at (" << x << ", " << y << ")";
+                ASSERT_EQ(h_out[y_off + x], val)
+                    << "at (" << x << ", " << y << ")";
             }
         }
     }
 
-    delete[] h_in;
-    delete[] h_out;
+    freeHost(h_in);
+    freeHost(h_out);
 }
 
-TYPED_TEST(Triangle, Lower2DRect0)
-{
+TYPED_TEST(Triangle, Lower2DRect0) {
     triangleTester<TypeParam>(dim4(500, 600), false);
 }
 
-TYPED_TEST(Triangle, Lower2DRect1)
-{
+TYPED_TEST(Triangle, Lower2DRect1) {
     triangleTester<TypeParam>(dim4(2003, 1775), false);
 }
 
-TYPED_TEST(Triangle, Lower2DSquare)
-{
+TYPED_TEST(Triangle, Lower2DSquare) {
     triangleTester<TypeParam>(dim4(2048, 2048), false);
 }
 
-TYPED_TEST(Triangle, Lower3D)
-{
+TYPED_TEST(Triangle, Lower3D) {
     triangleTester<TypeParam>(dim4(1000, 1000, 5), false);
 }
 
-TYPED_TEST(Triangle, Lower4D)
-{
+TYPED_TEST(Triangle, Lower4D) {
     triangleTester<TypeParam>(dim4(600, 900, 3, 2), false);
 }
 
-TYPED_TEST(Triangle, Upper2DRect0)
-{
+TYPED_TEST(Triangle, Upper2DRect0) {
     triangleTester<TypeParam>(dim4(500, 600), true);
 }
 
-TYPED_TEST(Triangle, Upper2DRect1)
-{
+TYPED_TEST(Triangle, Upper2DRect1) {
     triangleTester<TypeParam>(dim4(2003, 1775), true);
 }
 
-TYPED_TEST(Triangle, Upper2DSquare)
-{
+TYPED_TEST(Triangle, Upper2DSquare) {
     triangleTester<TypeParam>(dim4(2048, 2048), true);
 }
 
-TYPED_TEST(Triangle, Upper3D)
-{
+TYPED_TEST(Triangle, Upper3D) {
     triangleTester<TypeParam>(dim4(1000, 1000, 5), true);
 }
 
-TYPED_TEST(Triangle, Upper4D)
-{
+TYPED_TEST(Triangle, Upper4D) {
     triangleTester<TypeParam>(dim4(600, 900, 3, 2), true);
 }
 
-TYPED_TEST(Triangle, Lower2DRect0Unit)
-{
+TYPED_TEST(Triangle, Lower2DRect0Unit) {
     triangleTester<TypeParam>(dim4(500, 600), false, true);
 }
 
-TYPED_TEST(Triangle, Lower2DRect1Unit)
-{
+TYPED_TEST(Triangle, Lower2DRect1Unit) {
     triangleTester<TypeParam>(dim4(2003, 1775), false, true);
 }
 
-TYPED_TEST(Triangle, Lower2DSquareUnit)
-{
+TYPED_TEST(Triangle, Lower2DSquareUnit) {
     triangleTester<TypeParam>(dim4(2048, 2048), false, true);
 }
 
-TYPED_TEST(Triangle, Upper2DRect0Unit)
-{
+TYPED_TEST(Triangle, Upper2DRect0Unit) {
     triangleTester<TypeParam>(dim4(500, 600), true, true);
 }
 
-TYPED_TEST(Triangle, Upper2DRect1Unit)
-{
+TYPED_TEST(Triangle, Upper2DRect1Unit) {
     triangleTester<TypeParam>(dim4(2003, 1775), true, true);
 }
 
-TYPED_TEST(Triangle, Upper2DSquareUnit)
-{
+TYPED_TEST(Triangle, Upper2DSquareUnit) {
     triangleTester<TypeParam>(dim4(2048, 2048), true, true);
 }
 
-TEST(Lower, ExtractGFOR)
-{
-    using namespace af;
+TYPED_TEST(Triangle, MaxDim) {
+    const size_t largeDim = 65535 * 32 + 1;
+    triangleTester<TypeParam>(dim4(2, largeDim), true, true);
+}
+
+TEST(Lower, ExtractGFOR) {
+    using af::constant;
+    using af::lower;
+    using af::max;
+    using af::round;
+    using af::seq;
+    using af::span;
+
     dim4 dims = dim4(100, 100, 3);
-    array A = round(100 * randu(dims));
-    array B = constant(0, 100, 100, 3);
+    array A   = round(100 * randu(dims));
+    array B   = constant(0, 100, 100, 3);
 
-    gfor(seq ii, 3) {
-        B(span, span, ii) = lower(A(span, span, ii));
-    }
+    gfor(seq ii, 3) { B(span, span, ii) = lower(A(span, span, ii)); }
 
-    for(int ii = 0; ii < 3; ii++) {
+    for (int ii = 0; ii < 3; ii++) {
         array c_ii = lower(A(span, span, ii));
         array b_ii = B(span, span, ii);
         ASSERT_EQ(max<double>(abs(c_ii - b_ii)) < 1E-5, true);
diff --git a/test/unwrap.cpp b/test/unwrap.cpp
new file mode 100644
index 0000000000..9b97059dac
--- /dev/null
+++ b/test/unwrap.cpp
@@ -0,0 +1,239 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::allTrue;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::range;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Unwrap : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         intl, uintl, char, signed char, unsigned char, short,
+                         ushort>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(Unwrap, TestTypes);
+
+template<typename T>
+void unwrapTest(string pTestFile, const unsigned resultIdx, const dim_t wx,
+                const dim_t wy, const dim_t sx, const dim_t sy, const dim_t px,
+                const dim_t py) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<T>> tests;
+    readTests<T, T, int>(pTestFile, numDims, in, tests);
+
+    dim4 idims = numDims[0];
+
+    af_array inArray   = 0;
+    af_array outArray  = 0;
+    af_array outArrayT = 0;
+    af_array outArray2 = 0;
+
+    ASSERT_SUCCESS(af_create_array(&inArray, &(in[0].front()), idims.ndims(),
+                                   idims.get(),
+                                   (af_dtype)dtype_traits<T>::af_type));
+
+    ASSERT_SUCCESS(af_unwrap(&outArray, inArray, wx, wy, sx, sy, px, py, true));
+    ASSERT_SUCCESS(
+        af_unwrap(&outArrayT, inArray, wx, wy, sx, sy, px, py, false));
+    ASSERT_SUCCESS(af_transpose(&outArray2, outArrayT, false));
+
+    size_t nElems = tests[resultIdx].size();
+    vector<T> outData(nElems);
+
+    // TODO: Change to ASSERT_VEC_ARRAY_EQ
+    // Compare is_column == true results
+    ASSERT_SUCCESS(af_get_data_ptr((void*)&outData[0], outArray));
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    // Compare is_column == false results
+    ASSERT_SUCCESS(af_get_data_ptr((void*)&outData[0], outArray2));
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (outArrayT != 0) af_release_array(outArrayT);
+    if (outArray2 != 0) af_release_array(outArray2);
+}
+
+#define UNWRAP_INIT(desc, file, resultIdx, wx, wy, sx, sy, px, py)       \
+    TYPED_TEST(Unwrap, desc) {                                           \
+        unwrapTest<TypeParam>(string(TEST_DIR "/unwrap/" #file ".test"), \
+                              resultIdx, wx, wy, sx, sy, px, py);        \
+    }
+
+UNWRAP_INIT(UnwrapSmall00, unwrap_small, 0, 3, 3, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall01, unwrap_small, 1, 3, 3, 1, 1, 1, 1);
+UNWRAP_INIT(UnwrapSmall02, unwrap_small, 2, 3, 3, 1, 1, 2, 2);
+UNWRAP_INIT(UnwrapSmall03, unwrap_small, 3, 3, 3, 2, 2, 0, 0);
+UNWRAP_INIT(UnwrapSmall04, unwrap_small, 4, 3, 3, 2, 2, 1, 1);
+UNWRAP_INIT(UnwrapSmall05, unwrap_small, 5, 3, 3, 2, 2, 2, 2);
+UNWRAP_INIT(UnwrapSmall06, unwrap_small, 6, 3, 3, 3, 3, 0, 0);
+UNWRAP_INIT(UnwrapSmall07, unwrap_small, 7, 3, 3, 3, 3, 1, 1);
+UNWRAP_INIT(UnwrapSmall08, unwrap_small, 8, 3, 3, 3, 3, 2, 2);
+UNWRAP_INIT(UnwrapSmall09, unwrap_small, 9, 4, 4, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall10, unwrap_small, 10, 4, 4, 1, 1, 1, 1);
+UNWRAP_INIT(UnwrapSmall11, unwrap_small, 11, 4, 4, 1, 1, 2, 2);
+UNWRAP_INIT(UnwrapSmall12, unwrap_small, 12, 4, 4, 1, 1, 3, 3);
+UNWRAP_INIT(UnwrapSmall13, unwrap_small, 13, 4, 4, 2, 2, 0, 0);
+UNWRAP_INIT(UnwrapSmall14, unwrap_small, 14, 4, 4, 2, 2, 1, 1);
+UNWRAP_INIT(UnwrapSmall15, unwrap_small, 15, 4, 4, 2, 2, 2, 2);
+UNWRAP_INIT(UnwrapSmall16, unwrap_small, 16, 4, 4, 2, 2, 3, 3);
+UNWRAP_INIT(UnwrapSmall17, unwrap_small, 17, 4, 4, 4, 4, 0, 0);
+UNWRAP_INIT(UnwrapSmall18, unwrap_small, 18, 4, 4, 4, 4, 1, 1);
+UNWRAP_INIT(UnwrapSmall19, unwrap_small, 19, 4, 4, 4, 4, 2, 2);
+UNWRAP_INIT(UnwrapSmall20, unwrap_small, 20, 4, 4, 4, 4, 3, 3);
+UNWRAP_INIT(UnwrapSmall21, unwrap_small, 21, 5, 5, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall22, unwrap_small, 22, 5, 5, 1, 1, 1, 1);
+UNWRAP_INIT(UnwrapSmall23, unwrap_small, 23, 5, 5, 5, 5, 0, 0);
+UNWRAP_INIT(UnwrapSmall24, unwrap_small, 24, 5, 5, 5, 5, 1, 1);
+UNWRAP_INIT(UnwrapSmall25, unwrap_small, 25, 8, 8, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall26, unwrap_small, 26, 8, 8, 1, 1, 7, 7);
+UNWRAP_INIT(UnwrapSmall27, unwrap_small, 27, 8, 8, 8, 8, 0, 0);
+UNWRAP_INIT(UnwrapSmall28, unwrap_small, 28, 8, 8, 8, 8, 7, 7);
+UNWRAP_INIT(UnwrapSmall29, unwrap_small, 29, 12, 12, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall30, unwrap_small, 30, 12, 12, 1, 1, 2, 2);
+UNWRAP_INIT(UnwrapSmall31, unwrap_small, 31, 12, 12, 12, 12, 0, 0);
+UNWRAP_INIT(UnwrapSmall32, unwrap_small, 32, 12, 12, 12, 12, 2, 2);
+UNWRAP_INIT(UnwrapSmall33, unwrap_small, 33, 16, 16, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall34, unwrap_small, 34, 16, 16, 16, 16, 0, 0);
+UNWRAP_INIT(UnwrapSmall35, unwrap_small, 35, 16, 16, 16, 16, 15, 15);
+UNWRAP_INIT(UnwrapSmall36, unwrap_small, 36, 31, 31, 8, 8, 15, 15);
+UNWRAP_INIT(UnwrapSmall37, unwrap_small, 37, 8, 12, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall38, unwrap_small, 38, 8, 12, 1, 1, 7, 11);
+UNWRAP_INIT(UnwrapSmall39, unwrap_small, 39, 8, 12, 8, 12, 0, 0);
+UNWRAP_INIT(UnwrapSmall40, unwrap_small, 40, 8, 12, 8, 12, 7, 11);
+UNWRAP_INIT(UnwrapSmall41, unwrap_small, 41, 15, 10, 1, 1, 0, 0);
+UNWRAP_INIT(UnwrapSmall42, unwrap_small, 42, 15, 10, 1, 1, 14, 9);
+UNWRAP_INIT(UnwrapSmall43, unwrap_small, 43, 15, 10, 15, 10, 0, 0);
+
+// FIXME: This test is faulty after fixing the copy paste errors in unwrap
+// UNWRAP_INIT(UnwrapSmall44, unwrap_small, 44, 15, 10, 15, 10, 14,  9);
+UNWRAP_INIT(UnwrapSmall45, unwrap_small, 45, 18, 16, 18, 16, 1, 0);
+UNWRAP_INIT(UnwrapSmall46, unwrap_small, 46, 16, 18, 16, 18, 0, 1);
+///////////////////////////////// CPP ////////////////////////////////////
+//
+TEST(Unwrap, CPP) {
+    const unsigned resultIdx = 20;
+    const unsigned wx        = 4;
+    const unsigned wy        = 4;
+    const unsigned sx        = 4;
+    const unsigned sy        = 4;
+    const unsigned px        = 3;
+    const unsigned py        = 3;
+
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+    readTests<float, float, int>(string(TEST_DIR "/unwrap/unwrap_small.test"),
+                                 numDims, in, tests);
+
+    dim4 idims = numDims[0];
+    array input(idims, &(in[0].front()));
+    array output = unwrap(input, wx, wy, sx, sy, px, py);
+
+    // Get result
+    float* outData = new float[tests[resultIdx].size()];
+    output.host((void*)outData);
+
+    // Compare result
+    size_t nElems = tests[resultIdx].size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_EQ(tests[resultIdx][elIter], outData[elIter])
+            << "at: " << elIter << endl;
+    }
+
+    // Delete
+    delete[] outData;
+}
+
+TEST(Unwrap, MaxDim) {
+    const size_t largeDim = 65535 + 1;
+    array input           = range(5, 5, largeDim);
+
+    const unsigned wx = 5;
+    const unsigned wy = 5;
+    const unsigned sx = 5;
+    const unsigned sy = 5;
+    const unsigned px = 0;
+    const unsigned py = 0;
+
+    array output = unwrap(input, wx, wy, sx, sy, px, py);
+
+    array gold = range(dim4(5, 5, 1, largeDim));
+    gold       = moddims(gold, dim4(25, 1, largeDim));
+
+    ASSERT_ARRAYS_EQ(gold, output);
+}
+
+TEST(Unwrap, DocSnippet) {
+    //! [ex_unwrap]
+    float hA[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
+    array A(dim4(3, 3), hA);
+    //  1.     4.     7.
+    //  2.     5.     8.
+    //  3.     6.     9.
+
+    array A_simple = unwrap(A, 2, 2,  // window size
+                            1, 1);    // stride (sliding window)
+    //  1.     2.     4.     5.
+    //  2.     3.     5.     6.
+    //  4.     5.     7.     8.
+    //  5.     6.     8.     9.
+
+    array A_padded = unwrap(A, 2, 2,  // window size
+                            2, 2,     // stride (distinct)
+                            1, 1);    // padding
+    //  0.     0.     0.     5.
+    //  0.     0.     4.     6.
+    //  0.     2.     0.     8.
+    //  1.     3.     7.     9.
+    //! [ex_unwrap]
+
+    float gold_hA_simple[] = {1, 2, 4, 5, 2, 3, 5, 6, 4, 5, 7, 8, 5, 6, 8, 9};
+    array gold_A_simple(dim4(4, 4), gold_hA_simple);
+    ASSERT_ARRAYS_EQ(gold_A_simple, A_simple);
+
+    float gold_hA_padded[] = {0, 0, 0, 1, 0, 0, 2, 3, 0, 4, 0, 7, 5, 6, 8, 9};
+    array gold_A_padded(dim4(4, 4), gold_hA_padded);
+    ASSERT_ARRAYS_EQ(gold_A_padded, A_padded);
+}
diff --git a/test/var.cpp b/test/var.cpp
index fcea0ab02f..b889413646 100644
--- a/test/var.cpp
+++ b/test/var.cpp
@@ -7,59 +7,54 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
 #include <string>
 #include <vector>
-#include <testHelpers.hpp>
 
-using std::string;
-using std::vector;
+using af::array;
 using af::cdouble;
 using af::cfloat;
-using af::array;
+using af::dim4;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Var : public ::testing::Test
-{
-
-};
+class Var : public ::testing::Test {};
 
-typedef ::testing::Types< float, double, cfloat, cdouble, uint, int, uintl, intl, char, uchar> TestTypes;
-TYPED_TEST_CASE(Var, TestTypes);
+typedef ::testing::Types<float, double, cfloat, cdouble, uint, int, uintl, intl,
+                         char, schar, uchar, short, ushort, half_float::half>
+    TestTypes;
+TYPED_TEST_SUITE(Var, TestTypes);
 
 template<typename T>
 struct elseType {
-   typedef typename cond_type< is_same_type<T, uintl>::value ||
-                               is_same_type<T, intl>::value,
-                                              double,
-                                              T>::type type;
+    typedef typename cond_type<is_same_type<T, uintl>::value ||
+                                   is_same_type<T, intl>::value,
+                               double, T>::type type;
 };
 
 template<typename T>
 struct varOutType {
-   typedef typename cond_type< is_same_type<T, float>::value ||
-                               is_same_type<T, int>::value ||
-                               is_same_type<T, uint>::value ||
-                               is_same_type<T, uchar>::value ||
-                               is_same_type<T, char>::value,
-                                              float,
-                              typename elseType<T>::type>::type type;
+    typedef typename cond_type<
+        is_same_type<T, float>::value || is_same_type<T, int>::value ||
+            is_same_type<T, uint>::value || is_same_type<T, short>::value ||
+            is_same_type<T, ushort>::value || is_same_type<T, schar>::value ||
+            is_same_type<T, uchar>::value || is_same_type<T, char>::value,
+        float, typename elseType<T>::type>::type type;
 };
 
-
-
 //////////////////////////////// CPP ////////////////////////////////////
 // test var_all interface using cpp api
 
 template<typename T>
-void testCPPVar(T const_value, af::dim4 dims)
-{
+void testCPPVar(T const_value, dim4 dims, const bool useDeprecatedAPI = false) {
     typedef typename varOutType<T>::type outType;
-    if (noDoubleTests<T>()) return;
-    if (noDoubleTests<outType>()) return;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
 
     using af::array;
     using af::var;
@@ -69,86 +64,127 @@ void testCPPVar(T const_value, af::dim4 dims)
     outType gold = outType(0);
 
     array a(dims, &(hundred.front()));
-    outType output = var<outType>(a, false);
+
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+    outType output =
+        (useDeprecatedAPI ? var<outType>(a, false)
+                          : var<outType>(a, AF_VARIANCE_POPULATION));
 
     ASSERT_NEAR(::real(output), ::real(gold), 1.0e-3);
     ASSERT_NEAR(::imag(output), ::imag(gold), 1.0e-3);
 
-    output = var<outType>(a, true);
+    output = (useDeprecatedAPI ? var<outType>(a, true)
+                               : var<outType>(a, AF_VARIANCE_SAMPLE));
 
     ASSERT_NEAR(::real(output), ::real(gold), 1.0e-3);
     ASSERT_NEAR(::imag(output), ::imag(gold), 1.0e-3);
 
-    gold = outType(2.5);
-    outType tmp[] = { outType(0), outType(1), outType(2), outType(3),
-        outType(4) };
+    gold          = outType(2);
+    outType tmp[] = {outType(0), outType(1), outType(2), outType(3),
+                     outType(4)};
     array b(5, tmp);
-    output = var<outType>(b, false);
+    af_print(b);
+    output = (useDeprecatedAPI ? var<outType>(b, false)
+                               : var<outType>(b, AF_VARIANCE_POPULATION));
 
     ASSERT_NEAR(::real(output), ::real(gold), 1.0e-3);
     ASSERT_NEAR(::imag(output), ::imag(gold), 1.0e-3);
 
-    gold = outType(2);
-    output = var<outType>(b, true);
+    gold   = outType(2.5);
+    output = (useDeprecatedAPI ? var<outType>(b, true)
+                               : var<outType>(b, AF_VARIANCE_SAMPLE));
+#pragma GCC diagnostic pop
 
     ASSERT_NEAR(::real(output), ::real(gold), 1.0e-3);
     ASSERT_NEAR(::imag(output), ::imag(gold), 1.0e-3);
 }
 
-TYPED_TEST(Var, AllCPPSmall)
-{
-    testCPPVar<TypeParam>(2, af::dim4(10, 10, 1, 1));
+TYPED_TEST(Var, AllCPPSmall) {
+    testCPPVar<TypeParam>(TypeParam(2), dim4(10, 10, 1, 1));
+    testCPPVar<TypeParam>(TypeParam(2), dim4(10, 10, 1, 1), true);
 }
 
-TYPED_TEST(Var, AllCPPMedium)
-{
-    testCPPVar<TypeParam>(2, af::dim4(100, 100, 1, 1));
+TYPED_TEST(Var, AllCPPMedium) {
+    testCPPVar<TypeParam>(TypeParam(2), dim4(100, 100, 1, 1));
+    testCPPVar<TypeParam>(TypeParam(2), dim4(100, 100, 1, 1), true);
 }
 
-TYPED_TEST(Var, AllCPPLarge)
-{
-    testCPPVar<TypeParam>(2, af::dim4(1000, 1000, 1, 1));
+TYPED_TEST(Var, AllCPPLarge) {
+    testCPPVar<TypeParam>(TypeParam(2), dim4(1000, 1000, 1, 1));
+    testCPPVar<TypeParam>(TypeParam(2), dim4(1000, 1000, 1, 1), true);
 }
 
-TYPED_TEST(Var, DimCPPSmall)
-{
-    typedef typename varOutType<TypeParam>::type outType;
+template<typename T>
+void dimCppSmallTest(const string pFileName,
+                     const bool useDeprecatedAPI = false) {
+    typedef typename varOutType<T>::type outType;
+    float tol = 0.001f;
+    if ((af_dtype)af::dtype_traits<T>::af_type == f16) { tol = 0.6f; }
 
-    if (noDoubleTests<TypeParam>()) return;
-    if (noDoubleTests<outType>()) return;
+    SUPPORTED_TYPE_CHECK(T);
+    SUPPORTED_TYPE_CHECK(outType);
 
-    vector<af::dim4> numDims;
-    vector<vector<TypeParam> > in;
-    vector<vector<outType> > tests;
+    vector<dim4> numDims;
+    vector<vector<T>> in;
+    vector<vector<outType>> tests;
 
-    readTests<TypeParam, outType, double> (TEST_DIR"/var/var.data",numDims,in,tests);
+    readTests<T, outType, float>(pFileName, numDims, in, tests);
 
-    for(size_t i = 0; i < in.size(); i++)
-    {
+    for (size_t i = 0; i < in.size(); i++) {
         array input(numDims[i], &in[i].front(), afHost);
 
-        array bout  = var(input, false);
-        array nbout = var(input, true);
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+        array bout  = (useDeprecatedAPI ? var(input, true)
+                                        : var(input, AF_VARIANCE_SAMPLE));
+        array nbout = (useDeprecatedAPI ? var(input, false)
+                                        : var(input, AF_VARIANCE_POPULATION));
 
-        array bout1  = var(input, false, 1);
-        array nbout1 = var(input, true,  1);
+        array bout1 = (useDeprecatedAPI ? var(input, true, 1)
+                                        : var(input, AF_VARIANCE_SAMPLE, 1));
+        array nbout1 =
+            (useDeprecatedAPI ? var(input, false, 1)
+                              : var(input, AF_VARIANCE_POPULATION, 1));
+#pragma GCC diagnostic pop
 
-        vector<vector<outType> > h_out(4);
+        vector<vector<outType>> h_out(4);
 
         h_out[0].resize(bout.elements());
         h_out[1].resize(nbout.elements());
         h_out[2].resize(bout1.elements());
         h_out[3].resize(nbout1.elements());
 
-        bout.host(  &h_out[0].front());
-        nbout.host( &h_out[1].front());
-        bout1.host( &h_out[2].front());
+        bout.host(&h_out[0].front());
+        nbout.host(&h_out[1].front());
+        bout1.host(&h_out[2].front());
         nbout1.host(&h_out[3].front());
 
-        for(size_t j = 0; j < tests.size(); j++) {
-            for(size_t jj = 0; jj < tests[j].size(); jj++) {
-                ASSERT_EQ(h_out[j][jj], tests[j][jj]);
-            }
-        }
+        ASSERT_VEC_ARRAY_NEAR(tests[0], bout.dims(), bout, tol);
+        ASSERT_VEC_ARRAY_NEAR(tests[1], nbout.dims(), nbout, tol);
+        ASSERT_VEC_ARRAY_NEAR(tests[2], bout1.dims(), bout1, tol);
+        ASSERT_VEC_ARRAY_NEAR(tests[3], nbout1.dims(), nbout1, tol);
     }
 }
+
+TYPED_TEST(Var, DimCPPSmall) {
+    dimCppSmallTest<TypeParam>(string(TEST_DIR "/var/var.data"));
+    dimCppSmallTest<TypeParam>(string(TEST_DIR "/var/var.data"), true);
+}
+
+TEST(Var, ISSUE2117) {
+    using af::constant;
+    using af::sum;
+    using af::var;
+
+    array myArray = constant(1, 1000, 3000);
+    myArray       = var(myArray, AF_VARIANCE_SAMPLE, 1);
+    ASSERT_NEAR(0.0f, sum<float>(myArray), 0.000001);
+
+    myArray = constant(1, 1000, 3000);
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
+    myArray = var(myArray, true, 1);
+#pragma GCC diagnostic pop
+    ASSERT_NEAR(0.0f, sum<float>(myArray), 0.000001);
+}
diff --git a/test/where.cpp b/test/where.cpp
index 96bc8d50de..a6c8dcde46 100644
--- a/test/where.cpp
+++ b/test/where.cpp
@@ -7,42 +7,51 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/array.h>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <af/array.h>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::allTrue;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::randu;
+using af::range;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Where : public ::testing::Test { };
+class Where : public ::testing::Test {};
 
-typedef ::testing::Types< float, double, cfloat, cdouble, int, uint, intl, uintl, char, uchar > TestTypes;
-TYPED_TEST_CASE(Where, TestTypes);
+typedef ::testing::Types<float, double, cfloat, cdouble, int, uint, intl, uintl,
+                         char, schar, uchar, short, ushort>
+    TestTypes;
+TYPED_TEST_SUITE(Where, TestTypes);
 
 template<typename T>
-void whereTest(string pTestFile, bool isSubRef=false, const vector<af_seq> seqv=vector<af_seq>())
-{
-    if (noDoubleTests<T>()) return;
+void whereTest(string pTestFile, bool isSubRef = false,
+               const vector<af_seq> seqv = vector<af_seq>()) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    vector<af::dim4> numDims;
+    vector<dim4> numDims;
 
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (pTestFile,numDims,data,tests);
-    af::dim4 dims       = numDims[0];
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(pTestFile, numDims, data, tests);
+    dim4 dims = numDims[0];
 
-    vector<T> in(data[0].begin(), data[0].end());
+    vector<T> in(data[0].size());
+    transform(data[0].begin(), data[0].end(), in.begin(), convert_to<T, int>);
 
     af_array inArray   = 0;
     af_array outArray  = 0;
@@ -50,85 +59,99 @@ void whereTest(string pTestFile, bool isSubRef=false, const vector<af_seq> seqv=
 
     // Get input array
     if (isSubRef) {
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&tempArray, &in.front(), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
-        ASSERT_EQ(AF_SUCCESS, af_index(&inArray, tempArray, seqv.size(), &seqv.front()));
+        ASSERT_SUCCESS(af_create_array(&tempArray, &in.front(), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(
+            af_index(&inArray, tempArray, seqv.size(), &seqv.front()));
     } else {
-
-        ASSERT_EQ(AF_SUCCESS, af_create_array(&inArray, &in.front(), dims.ndims(), dims.get(), (af_dtype) af::dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&inArray, &in.front(), dims.ndims(),
+                                       dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
     }
 
     // Compare result
     vector<uint> currGoldBar(tests[0].begin(), tests[0].end());
 
     // Run sum
-    ASSERT_EQ(AF_SUCCESS, af_where(&outArray, inArray));
+    ASSERT_SUCCESS(af_where(&outArray, inArray));
 
-    // Get result
-    size_t nElems = currGoldBar.size();
-    vector<uint> outData(nElems);
-    ASSERT_EQ(AF_SUCCESS, af_get_data_ptr(&outData.front(), outArray));
+    ASSERT_VEC_ARRAY_EQ(currGoldBar, dim4(tests[0].size()), outArray);
 
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                                                        << std::endl;
-    }
-
-    if(inArray   != 0) af_release_array(inArray);
-    if(outArray  != 0) af_release_array(outArray);
-    if(tempArray != 0) af_release_array(tempArray);
+    if (inArray != 0) af_release_array(inArray);
+    if (outArray != 0) af_release_array(outArray);
+    if (tempArray != 0) af_release_array(tempArray);
 }
 
-vector<af_seq> init_subs()
-{
-    vector<af_seq> subs;
-    subs.push_back(af_make_seq(2, 6, 1));
-    subs.push_back(af_make_seq(1, 5, 1));
-    subs.push_back(af_make_seq(1, 3, 1));
-    subs.push_back(af_make_seq(1, 2, 1));
-    return subs;
-}
-
-
-#define WHERE_TESTS(T)                          \
-    TEST(Where,Test_##T)                        \
-    {                                           \
-        whereTest<T>(                           \
-            string(TEST_DIR"/where/where.test") \
-            );                                  \
-    }                                           \
+#define WHERE_TESTS(T)                                      \
+    TEST(Where, Test_##T) {                                 \
+        whereTest<T>(string(TEST_DIR "/where/where.test")); \
+    }
 
-TYPED_TEST(Where, BasicC)
-{
-    whereTest<TypeParam>(string(TEST_DIR"/where/where.test") );
+TYPED_TEST(Where, BasicC) {
+    whereTest<TypeParam>(string(TEST_DIR "/where/where.test"));
 }
 
 //////////////////////////////////// CPP /////////////////////////////////
 //
-TYPED_TEST(Where, CPP)
-{
-    if (noDoubleTests<TypeParam>()) return;
+TYPED_TEST(Where, CPP) {
+    SUPPORTED_TYPE_CHECK(TypeParam);
+
+    vector<dim4> numDims;
 
-    vector<af::dim4> numDims;
+    vector<vector<int>> data;
+    vector<vector<int>> tests;
+    readTests<int, int, int>(string(TEST_DIR "/where/where.test"), numDims,
+                             data, tests);
+    dim4 dims = numDims[0];
 
-    vector<vector<int> > data;
-    vector<vector<int> > tests;
-    readTests<int,int,int> (string(TEST_DIR"/where/where.test"),numDims,data,tests);
-    af::dim4 dims       = numDims[0];
+    vector<float> in(data[0].size());
+    transform(data[0].begin(), data[0].end(), in.begin(),
+              convert_to<float, int>);
 
-    vector<float> in(data[0].begin(), data[0].end());
-    af::array input(dims, &in.front(), afHost);
-    af::array output = where(input);
+    array input(dims, &in.front(), afHost);
+    array output = where(input);
 
     // Compare result
     vector<uint> currGoldBar(tests[0].begin(), tests[0].end());
 
-    // Get result
-    size_t nElems = currGoldBar.size();
-    vector<uint> outData(nElems);
-    output.host((void*)&(outData.front()));
+    ASSERT_VEC_ARRAY_EQ(currGoldBar, dim4(tests[0].size()), output);
+}
 
-    for (size_t elIter = 0; elIter < nElems; ++elIter) {
-        ASSERT_EQ(currGoldBar[elIter], outData[elIter]) << "at: " << elIter
-                                                        << std::endl;
-    }
+TEST(Where, MaxDim) {
+    const size_t largeDim = 65535 * 32 + 2;
+
+    array input  = range(dim4(1, largeDim), 1);
+    array output = where(input % 2 == 0);
+    array gold   = 2 * range(largeDim / 2);
+    ASSERT_ARRAYS_EQ(gold.as(u32), output);
+
+    input  = range(dim4(1, 1, 1, largeDim), 3);
+    output = where(input % 2 == 0);
+    ASSERT_ARRAYS_EQ(gold.as(u32), output);
+}
+
+TEST(Where, ISSUE_1259) {
+    array a       = randu(10, 10, 10);
+    array indices = where(a > 2);
+    ASSERT_EQ(indices.elements(), 0);
 }
+
+#define TEST_TEMP_FORMAT(form, dim)                                      \
+    TEST(TEMP_FORMAT, form##_Dim##dim) {                                 \
+        const dim4 dims(2, 3, 4, 5);                                     \
+        const array in(af::moddims(range(dim4(dims.elements())), dims)); \
+        in.eval();                                                       \
+        const array gold = where(in > 3.0);                              \
+                                                                         \
+        array out = where(toTempFormat(form, in) > 3.0);                 \
+        ASSERT_ARRAYS_EQ(gold, out);                                     \
+    }
+
+#define TEST_TEMP_FORMATS(form) \
+    TEST_TEMP_FORMAT(form, 0)   \
+    TEST_TEMP_FORMAT(form, 1)   \
+    TEST_TEMP_FORMAT(form, 2)   \
+    TEST_TEMP_FORMAT(form, 3)
+
+FOREACH_TEMP_FORMAT(TEST_TEMP_FORMATS)
\ No newline at end of file
diff --git a/test/wrap.cpp b/test/wrap.cpp
new file mode 100644
index 0000000000..4f53d9fd34
--- /dev/null
+++ b/test/wrap.cpp
@@ -0,0 +1,542 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/defines.h>
+#include <af/dim4.hpp>
+#include <af/traits.hpp>
+#include <algorithm>
+#include <complex>
+#include <iostream>
+#include <string>
+#include <vector>
+
+using af::allTrue;
+using af::array;
+using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype;
+using af::dtype_traits;
+using af::randu;
+using af::range;
+using std::abs;
+using std::cout;
+using std::endl;
+using std::string;
+using std::vector;
+
+template<typename T>
+class Wrap : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
+};
+
+// create a list of types to be tested
+typedef ::testing::Types<float, double, cfloat, cdouble, int, unsigned int,
+                         intl, uintl, char, signed char, unsigned char, short,
+                         ushort>
+    TestTypes;
+
+// register the type list
+TYPED_TEST_SUITE(Wrap, TestTypes);
+
+template<typename T>
+inline double get_val(T val) {
+    return val;
+}
+
+template<>
+inline double get_val<cfloat>(cfloat val) {
+    return abs(val);
+}
+
+template<>
+inline double get_val<cdouble>(cdouble val) {
+    return abs(val);
+}
+
+template<>
+inline double get_val<unsigned char>(unsigned char val) {
+    return ((int)(val)) % 256;
+}
+
+template<>
+inline double get_val<char>(char val) {
+    return (val != 0);
+}
+
+template<typename T>
+void wrapTest(const dim_t ix, const dim_t iy, const dim_t wx, const dim_t wy,
+              const dim_t sx, const dim_t sy, const dim_t px, const dim_t py,
+              bool cond) {
+    SUPPORTED_TYPE_CHECK(T);
+
+    const int nc = 1;
+
+    int lim = std::max((dim_t)2, (dim_t)(250) / (wx * wy));
+
+    dtype ty = (dtype)dtype_traits<T>::af_type;
+    array in = round(lim * randu(ix, iy, nc, f32)).as(ty);
+
+    vector<T> h_in(in.elements());
+    in.host(&h_in[0]);
+
+    vector<int> h_factor(ix * iy);
+
+    dim_t ny = (iy + 2 * py - wy) / sy + 1;
+    dim_t nx = (ix + 2 * px - wx) / sx + 1;
+
+    for (int idy = 0; idy < ny; idy++) {
+        int fy = idy * sy - py;
+        if (fy + wy < 0 || fy >= iy) continue;
+
+        for (int idx = 0; idx < nx; idx++) {
+            int fx = idx * sx - px;
+            if (fx + wx < 0 || fx >= ix) continue;
+
+            for (int ly = 0; ly < wy; ly++) {
+                if (fy + ly < 0 || fy + ly >= iy) continue;
+
+                for (int lx = 0; lx < wx; lx++) {
+                    if (fx + lx < 0 || fx + lx >= ix) continue;
+                    h_factor[(fy + ly) * ix + (fx + lx)]++;
+                }
+            }
+        }
+    }
+
+    array factor(ix, iy, &h_factor[0]);
+
+    array in_dim  = unwrap(in, wx, wy, sx, sy, px, py, cond);
+    array res_dim = wrap(in_dim, ix, iy, wx, wy, sx, sy, px, py, cond);
+
+    ASSERT_EQ(in.elements(), res_dim.elements());
+
+    vector<T> h_res(ix * iy);
+    res_dim.host(&h_res[0]);
+
+    for (int n = 0; n < nc; n++) {
+        T *iptr = &h_in[n * ix * iy];
+        T *rptr = &h_res[n * ix * iy];
+
+        for (int y = 0; y < iy; y++) {
+            for (int x = 0; x < ix; x++) {
+                // FIXME: Use a better test
+                T ival     = iptr[y * ix + x];
+                T rval     = rptr[y * ix + x];
+                int factor = h_factor[y * ix + x];
+
+                if (get_val(ival) == 0) continue;
+
+                ASSERT_NEAR(get_val<T>(ival * factor), get_val<T>(rval), 1E-5)
+                    << "at " << x << "," << y << " for cond  == " << cond
+                    << endl;
+            }
+        }
+    }
+}
+
+#define WRAP_INIT(desc, ix, iy, wx, wy, sx, sy, px, py)             \
+    TYPED_TEST(Wrap, Col##desc) {                                   \
+        wrapTest<TypeParam>(ix, iy, wx, wy, sx, sy, px, py, true);  \
+    }                                                               \
+    TYPED_TEST(Wrap, Row##desc) {                                   \
+        wrapTest<TypeParam>(ix, iy, wx, wy, sx, sy, px, py, false); \
+    }
+
+WRAP_INIT(00, 300, 100, 3, 3, 1, 1, 0, 0);
+WRAP_INIT(01, 300, 100, 3, 3, 1, 1, 1, 1);
+WRAP_INIT(03, 300, 100, 3, 3, 2, 2, 0, 0);
+WRAP_INIT(04, 300, 100, 3, 3, 2, 2, 1, 1);
+WRAP_INIT(05, 300, 100, 3, 3, 2, 2, 2, 2);
+WRAP_INIT(06, 300, 100, 3, 3, 3, 3, 0, 0);
+WRAP_INIT(07, 300, 100, 3, 3, 3, 3, 1, 1);
+WRAP_INIT(08, 300, 100, 3, 3, 3, 3, 2, 2);
+WRAP_INIT(09, 300, 100, 4, 4, 1, 1, 0, 0);
+WRAP_INIT(13, 300, 100, 4, 4, 2, 2, 0, 0);
+WRAP_INIT(14, 300, 100, 4, 4, 2, 2, 1, 1);
+WRAP_INIT(15, 300, 100, 4, 4, 2, 2, 2, 2);
+WRAP_INIT(16, 300, 100, 4, 4, 2, 2, 3, 3);
+WRAP_INIT(17, 300, 100, 4, 4, 4, 4, 0, 0);
+WRAP_INIT(18, 300, 100, 4, 4, 4, 4, 1, 1);
+WRAP_INIT(19, 300, 100, 4, 4, 4, 4, 2, 2);
+WRAP_INIT(27, 300, 100, 8, 8, 8, 8, 0, 0);
+WRAP_INIT(28, 300, 100, 8, 8, 8, 8, 7, 7);
+WRAP_INIT(31, 300, 100, 12, 12, 12, 12, 0, 0);
+WRAP_INIT(32, 300, 100, 12, 12, 12, 12, 2, 2);
+WRAP_INIT(35, 300, 100, 16, 16, 16, 16, 15, 15);
+WRAP_INIT(36, 300, 100, 31, 31, 8, 8, 15, 15);
+WRAP_INIT(39, 300, 100, 8, 12, 8, 12, 0, 0);
+WRAP_INIT(40, 300, 100, 8, 12, 8, 12, 7, 11);
+WRAP_INIT(43, 300, 100, 15, 10, 15, 10, 0, 0);
+WRAP_INIT(44, 300, 100, 15, 10, 15, 10, 14, 9);
+
+TEST(Wrap, MaxDim) {
+    const size_t largeDim = 65535 + 1;
+    array input           = range(5, 5, 1, largeDim);
+
+    const unsigned wx = 5;
+    const unsigned wy = 5;
+    const unsigned sx = 5;
+    const unsigned sy = 5;
+    const unsigned px = 0;
+    const unsigned py = 0;
+
+    array unwrapped = unwrap(input, wx, wy, sx, sy, px, py);
+    array output    = wrap(unwrapped, 5, 5, wx, wy, sx, sy, px, py);
+
+    ASSERT_ARRAYS_EQ(output, input);
+}
+
+TEST(Wrap, DocSnippet) {
+    //! [ex_wrap_1]
+    float hA[] = {1, 2, 3, 4, 5, 6, 7, 8, 9};
+    array A(dim4(3, 3), hA);
+    //  1.     4.     7.
+    //  2.     5.     8.
+    //  3.     6.     9.
+
+    array A_unwrapped = unwrap(A, 2, 2,  // window size
+                               2, 2,     // stride (distinct)
+                               1, 1);    // padding
+    //  0.     0.     0.     5.
+    //  0.     0.     4.     6.
+    //  0.     2.     0.     8.
+    //  1.     3.     7.     9.
+
+    array A_wrapped = wrap(A_unwrapped, 3, 3,  // A's size
+                           2, 2,               // window size
+                           2, 2,               // stride (distinct)
+                           1, 1);              // padding
+    //  1.     4.     7.
+    //  2.     5.     8.
+    //  3.     6.     9.
+    //! [ex_wrap_1]
+
+    ASSERT_ARRAYS_EQ(A, A_wrapped);
+
+    //! [ex_wrap_2]
+    float hB[] = {1, 1, 1, 1, 1, 1, 1, 1, 1};
+    array B(dim4(3, 3), hB);
+    //  1.     1.     1.
+    //  1.     1.     1.
+    //  1.     1.     1.
+    array B_unwrapped = unwrap(B, 2, 2,  // window size
+                               1, 1);    // stride (sliding)
+    //  1.     1.     1.     1.
+    //  1.     1.     1.     1.
+    //  1.     1.     1.     1.
+    //  1.     1.     1.     1.
+    array B_wrapped = wrap(B_unwrapped, 3, 3,  // B's size
+                           2, 2,               // window size
+                           1, 1);              // stride (sliding)
+    //  1.     2.     1.
+    //  2.     4.     2.
+    //  1.     2.     1.
+    //! [ex_wrap_2]
+
+    float gold_hB_wrapped[] = {1, 2, 1, 2, 4, 2, 1, 2, 1};
+    array gold_B_wrapped(dim4(3, 3), gold_hB_wrapped);
+    ASSERT_ARRAYS_EQ(gold_B_wrapped, B_wrapped);
+}
+
+static void getInput(af_array *data, const dim_t *dims) {
+    float h_data[16] = {10, 20, 20, 30, 30, 40, 40, 50,
+                        30, 40, 40, 50, 50, 60, 60, 70};
+    ASSERT_SUCCESS(af_create_array(data, &h_data[0], 2, dims, f32));
+}
+static void getGold(af_array *gold, const dim_t *dims) {
+    float h_gold[16] = {10, 20, 30, 40, 20, 30, 40, 50,
+                        30, 40, 50, 60, 40, 50, 60, 70};
+    ASSERT_SUCCESS(af_create_array(gold, &h_gold[0], 2, dims, f32));
+}
+
+class WrapCommon : virtual public ::testing::Test {
+   protected:
+    WrapCommon()
+        : in_(0)
+        , gold_(0)
+        , in_dims(4, 4)
+        , gold_dims(4, 4)
+        , win_len(2)
+        , strd_len(2)
+        , pad_len(0)
+        , is_column(true) {}
+
+    virtual void SetUp() {
+        ::getInput(&in_, &in_dims[0]);
+        ::getGold(&gold_, &in_dims[0]);
+    }
+
+    virtual void TearDown() {
+        if (in_ != 0) af_release_array(in_);
+        if (gold_ != 0) af_release_array(gold_);
+    }
+
+    af_array in_;
+    af_array gold_;
+    dim4 in_dims;
+    dim4 gold_dims;
+    dim_t win_len;
+    dim_t strd_len;
+    dim_t pad_len;
+    bool is_column;
+};
+
+template<typename T>
+class WrapV2 : public WrapCommon {
+   protected:
+    vector<T> h_gold_cast;
+    vector<T> h_in_cast;
+
+    WrapV2() {}
+
+    void setTestData(float *h_gold, dim4 gold_dims, float *h_in, dim4 in_dims) {
+        releaseArrays();
+
+        this->gold_ = 0;
+        this->in_   = 0;
+
+        this->gold_dims = gold_dims;
+        this->in_dims   = in_dims;
+
+        for (int i = 0; i < gold_dims.elements(); ++i) {
+            h_gold_cast.push_back(static_cast<T>(h_gold[i]));
+        }
+        for (int i = 0; i < in_dims.elements(); ++i) {
+            h_in_cast.push_back(static_cast<T>(h_in[i]));
+        }
+
+        ASSERT_SUCCESS(af_create_array(&this->gold_, &h_gold_cast.front(),
+                                       gold_dims.ndims(), gold_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+        ASSERT_SUCCESS(af_create_array(&this->in_, &h_in_cast.front(),
+                                       in_dims.ndims(), in_dims.get(),
+                                       (af_dtype)dtype_traits<T>::af_type));
+    }
+
+    void testSpclOutArray(TestOutputArrayType out_array_type) {
+        SUPPORTED_TYPE_CHECK(T);
+
+        af_array out = 0;
+        TestOutputArrayInfo metadata(out_array_type);
+        if (out_array_type == NULL_ARRAY) {
+            genTestOutputArray(&out, this->gold_dims.ndims(),
+                               this->gold_dims.get(),
+                               (af_dtype)dtype_traits<T>::af_type, &metadata);
+        } else {
+            genTestOutputArray(&out, 0.0, this->gold_dims.ndims(),
+                               this->gold_dims.get(),
+                               (af_dtype)dtype_traits<T>::af_type, &metadata);
+        }
+
+        // Taken from the Wrap.DocSnippet test
+        ASSERT_SUCCESS(af_wrap_v2(&out, this->in_, 4, 4,  // output dims
+                                  2, 2,                   // window size
+                                  2, 2,                   // stride
+                                  0, 0,                   // padding
+                                  true));                 // is_column
+
+        ASSERT_SPECIAL_ARRAYS_EQ(this->gold_, out, &metadata);
+    }
+
+    void releaseArrays() {
+        if (this->in_ != 0) { ASSERT_SUCCESS(af_release_array(this->in_)); }
+        if (this->gold_ != 0) { ASSERT_SUCCESS(af_release_array(this->gold_)); }
+    }
+};
+
+TYPED_TEST_SUITE(WrapV2, TestTypes);
+
+template<typename T>
+class WrapV2Simple : public WrapV2<T> {
+   protected:
+    void SetUp() {
+        SUPPORTED_TYPE_CHECK(T);
+        this->releaseArrays();
+        this->in_   = 0;
+        this->gold_ = 0;
+
+        af_array tmp_in   = 0;
+        af_array tmp_gold = 0;
+
+        ::getInput(&tmp_in, this->in_dims.get());
+        ::getGold(&tmp_gold, this->gold_dims.get());
+
+        af_dtype dtype = (af_dtype)dtype_traits<T>::af_type;
+        ASSERT_SUCCESS(af_cast(&this->in_, tmp_in, dtype));
+        ASSERT_SUCCESS(af_cast(&this->gold_, tmp_gold, dtype));
+
+        ASSERT_SUCCESS(af_release_array(tmp_in));
+        ASSERT_SUCCESS(af_release_array(tmp_gold));
+    }
+};
+
+TYPED_TEST_SUITE(WrapV2Simple, TestTypes);
+
+TYPED_TEST(WrapV2Simple, UseNullOutputArray) {
+    this->testSpclOutArray(NULL_ARRAY);
+}
+
+TYPED_TEST(WrapV2Simple, UseFullExistingOutputArray) {
+    this->testSpclOutArray(FULL_ARRAY);
+}
+
+TYPED_TEST(WrapV2Simple, UseExistingOutputSubArray) {
+    this->testSpclOutArray(SUB_ARRAY);
+}
+
+TYPED_TEST(WrapV2Simple, UseReorderedOutputArray) {
+    this->testSpclOutArray(REORDERED_ARRAY);
+}
+
+class WrapNullArgs : public WrapCommon {};
+
+TEST_F(WrapNullArgs, NullOutputPtr) {
+    af_array *out_ptr = 0;
+    ASSERT_EQ(af_wrap(out_ptr, this->in_, 4, 4,  // output dims
+                      2, 2,                      // window size
+                      2, 2,                      // stride
+                      0, 0,                      // padding
+                      true),                     // is_column
+              AF_ERR_ARG);
+}
+
+TEST_F(WrapNullArgs, NullInputArray) {
+    af_array out = 0;
+    ASSERT_EQ(af_wrap(&out, 0, 4, 4,  // output dims
+                      2, 2,           // window size
+                      2, 2,           // stride
+                      0, 0,           // padding
+                      true),          // is_column
+              AF_ERR_ARG);
+}
+
+TEST_F(WrapNullArgs, V2NullOutputPtr) {
+    af_array *out_ptr = 0;
+    ASSERT_EQ(af_wrap_v2(out_ptr, this->in_, 4, 4,  // output dims
+                         2, 2,                      // window size
+                         2, 2,                      // stride
+                         0, 0,                      // padding
+                         true),                     // is_column
+              AF_ERR_ARG);
+}
+
+TEST_F(WrapNullArgs, V2NullInputArray) {
+    af_array out = 0;
+    ASSERT_EQ(af_wrap_v2(&out, 0, 4, 4,  // output dims
+                         2, 2,           // window size
+                         2, 2,           // stride
+                         0, 0,           // padding
+                         true),          // is_column
+              AF_ERR_ARG);
+}
+
+struct ArgDim {
+    ArgDim(dim_t d0, dim_t d1) : dim0(d0), dim1(d1) {}
+    void get(dim_t *d0, dim_t *d1);
+
+    dim_t dim0;
+    dim_t dim1;
+};
+
+struct WindowDims : public ArgDim {
+    WindowDims() : ArgDim(1, 1) {}
+    WindowDims(dim_t d0, dim_t d1) : ArgDim(d0, d1) {}
+};
+
+struct StrideDims : public ArgDim {
+    StrideDims() : ArgDim(1, 1) {}
+    StrideDims(dim_t d0, dim_t d1) : ArgDim(d0, d1) {}
+};
+
+struct PadDims : public ArgDim {
+    PadDims() : ArgDim(0, 0) {}
+    PadDims(dim_t d0, dim_t d1) : ArgDim(d0, d1) {}
+};
+
+class WrapArgs {
+   public:
+    WindowDims wc_;
+    StrideDims sc_;
+    PadDims pc_;
+    bool is_column;
+    af_err err;
+
+    WrapArgs() : wc_(), sc_(), pc_(), is_column(true), err(af_err(999)) {}
+
+    WrapArgs(dim_t win_d0, dim_t win_d1, dim_t str_d0, dim_t str_d1,
+             dim_t pad_d0, dim_t pad_d1, bool is_col, af_err err)
+        : wc_(win_d0, win_d1)
+        , sc_(str_d0, str_d1)
+        , pc_(pad_d0, pad_d1)
+        , is_column(is_col)
+        , err(err) {}
+};
+
+class WrapAPITest
+    : public WrapCommon
+    , public ::testing::WithParamInterface<WrapArgs> {
+   public:
+    WrapAPITest() : input(), in_(0), in_dims(4, 4, 1, 1) {}
+
+    virtual void SetUp() {
+        input = GetParam();
+        ::getInput(&in_, in_dims.get());
+    }
+    virtual void TearDown() {
+        if (in_ != 0) af_release_array(in_);
+    }
+
+    WrapArgs input;
+    af_array in_;
+    dim4 in_dims;
+};
+
+TEST_P(WrapAPITest, CheckDifferentWrapArgs) {
+    dim_t win_d0 = input.wc_.dim0;
+    dim_t win_d1 = input.wc_.dim1;
+    dim_t str_d0 = input.sc_.dim0;
+    dim_t str_d1 = input.sc_.dim1;
+    dim_t pad_d0 = input.pc_.dim0;
+    dim_t pad_d1 = input.pc_.dim1;
+
+    af_array out_ = 0;
+    af_err err    = af_wrap(&out_, in_, in_dims[0], in_dims[1], win_d0, win_d1,
+                            str_d0, str_d1, pad_d0, pad_d1, input.is_column);
+
+    ASSERT_EQ(err, input.err);
+    if (out_ != 0) af_release_array(out_);
+}
+
+WrapArgs args[] = {
+    // clang-format off
+    //      | win_dim0 | win_dim1 | str_dim0 | str_dim1 | pad_dim0 | pad_dim1 | is_col |    err    |
+    WrapArgs(        2,         2,         2,         2,         0,         0,    true,  AF_SUCCESS),
+    WrapArgs(        2,         2,         2,         2,         0,         0,   false,  AF_SUCCESS),
+
+    WrapArgs(       -1,         2,         2,         2,         0,         0,    true,  AF_ERR_ARG),
+    WrapArgs(        2,        -1,         2,         2,         0,         0,    true,  AF_ERR_ARG),
+    WrapArgs(       -1,        -1,         2,         2,         0,         0,    true,  AF_ERR_ARG),
+
+    WrapArgs(        2,         2,        -1,         2,         0,         0,    true,  AF_ERR_ARG),
+    WrapArgs(        2,         2,         2,        -1,         0,         0,    true,  AF_ERR_ARG),
+    WrapArgs(        2,         2,        -1,        -1,         0,         0,    true,  AF_ERR_ARG),
+
+    WrapArgs(        2,         2,         2,         2,         1,         1,    true,  AF_ERR_SIZE),
+    WrapArgs(        2,         2,         2,         2,        -1,         1,    true,  AF_ERR_SIZE),
+    WrapArgs(        2,         2,         2,         2,         1,        -1,    true,  AF_ERR_SIZE),
+    WrapArgs(        2,         2,         2,         2,        -1,        -1,    true,  AF_ERR_SIZE),
+    // clang-format on
+};
+
+INSTANTIATE_TEST_SUITE_P(BulkTest, WrapAPITest, ::testing::ValuesIn(args));
diff --git a/test/write.cpp b/test/write.cpp
index afe5f386f6..db751939ab 100644
--- a/test/write.cpp
+++ b/test/write.cpp
@@ -7,46 +7,48 @@
  * http://arrayfire.com/licenses/BSD-3-Clause
  ********************************************************/
 
-#include <gtest/gtest.h>
 #include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
 #include <af/dim4.hpp>
 #include <af/traits.hpp>
-#include <vector>
 #include <iostream>
 #include <string>
-#include <testHelpers.hpp>
+#include <vector>
 
-using std::vector;
-using std::string;
-using std::cout;
-using std::endl;
-using af::cfloat;
+using af::array;
 using af::cdouble;
+using af::cfloat;
+using af::dim4;
+using af::dtype_traits;
+using af::freeHost;
+using std::endl;
+using std::string;
+using std::vector;
 
 template<typename T>
-class Write : public ::testing::Test
-{
-    public:
-        virtual void SetUp() {
-        }
+class Write : public ::testing::Test {
+   public:
+    virtual void SetUp() {}
 };
 
 // create a list of types to be tested
-typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, char, unsigned char> TestTypes;
+typedef ::testing::Types<float, cfloat, double, cdouble, int, unsigned, char,
+                         signed char, unsigned char, short, ushort>
+    TestTypes;
 
 // register the type list
-TYPED_TEST_CASE(Write, TestTypes);
+TYPED_TEST_SUITE(Write, TestTypes);
 
 template<typename T>
-void writeTest(af::dim4 dims)
-{
-    if (noDoubleTests<T>()) return;
+void writeTest(dim4 dims) {
+    SUPPORTED_TYPE_CHECK(T);
 
-    af::array A = af::randu(dims, (af_dtype) af::dtype_traits<T>::af_type);
-    af::array B = af::randu(dims, (af_dtype) af::dtype_traits<T>::af_type);
+    array A = randu(dims, (af_dtype)dtype_traits<T>::af_type);
+    array B = randu(dims, (af_dtype)dtype_traits<T>::af_type);
 
-    af::array A_copy = A.copy();
-    af::array B_copy = B.copy();
+    array A_copy = A.copy();
+    array B_copy = B.copy();
 
     T *a_host = A.host<T>();
     T *b_dev  = B.device<T>();
@@ -54,48 +56,31 @@ void writeTest(af::dim4 dims)
     A.write(b_dev, dims.elements() * sizeof(T), afDevice);
     B.write(a_host, dims.elements() * sizeof(T), afHost);
 
-    af::array check1 = A != B_copy;     // False so check1 is all 0s
-    af::array check2 = B != A_copy;     // False so check2 is all 0s
+    ASSERT_ARRAYS_EQ(B_copy, A);
+    ASSERT_ARRAYS_EQ(A_copy, B);
 
-    char *h_check1 = check1.host<char>();
-    char *h_check2 = check2.host<char>();
+    freeHost(a_host);
+}
 
-    for(int i = 0; i < (int)dims.elements(); i++) {
-        ASSERT_EQ(h_check1[i], 0) << "at: " << i << std::endl;
-        ASSERT_EQ(h_check2[i], 0) << "at: " << i << std::endl;
-    }
+TYPED_TEST(Write, Vector0) { writeTest<TypeParam>(dim4(10)); }
 
-    delete [] a_host;
-    delete [] h_check1;
-    delete [] h_check2;
-}
+TYPED_TEST(Write, Vector1) { writeTest<TypeParam>(dim4(1000)); }
 
-TYPED_TEST(Write, Vector0)
-{
-    writeTest<TypeParam>(af::dim4(10));
-}
+TYPED_TEST(Write, Matrix0) { writeTest<TypeParam>(dim4(64, 8)); }
 
-TYPED_TEST(Write, Vector1)
-{
-    writeTest<TypeParam>(af::dim4(1000));
-}
+TYPED_TEST(Write, Matrix1) { writeTest<TypeParam>(dim4(256, 256)); }
 
-TYPED_TEST(Write, Matrix0)
-{
-    writeTest<TypeParam>(af::dim4(64, 8));
-}
+TYPED_TEST(Write, Volume0) { writeTest<TypeParam>(dim4(10, 10, 10)); }
 
-TYPED_TEST(Write, Matrix1)
-{
-    writeTest<TypeParam>(af::dim4(256, 256));
-}
+TYPED_TEST(Write, Volume1) { writeTest<TypeParam>(dim4(32, 64, 16)); }
 
-TYPED_TEST(Write, Volume0)
-{
-    writeTest<TypeParam>(af::dim4(10, 10, 10));
-}
+TEST(Write, VoidPointer) {
+    vector<float> gold(100, 5);
+
+    array a(100);
+
+    void *h_gold = (void *)&gold.front();
+    a.write(h_gold, 100 * sizeof(float), afHost);
 
-TYPED_TEST(Write, Volume1)
-{
-    writeTest<TypeParam>(af::dim4(32, 64, 16));
+    ASSERT_VEC_ARRAY_EQ(gold, dim4(100), a);
 }
diff --git a/test/ycbcr_rgb.cpp b/test/ycbcr_rgb.cpp
new file mode 100644
index 0000000000..ec365db9a4
--- /dev/null
+++ b/test/ycbcr_rgb.cpp
@@ -0,0 +1,157 @@
+/*******************************************************
+ * Copyright (c) 2014, ArrayFire
+ * All rights reserved.
+ *
+ * This file is distributed under 3-clause BSD license.
+ * The complete license agreement can be obtained at:
+ * http://arrayfire.com/licenses/BSD-3-Clause
+ ********************************************************/
+
+#include <arrayfire.h>
+#include <gtest/gtest.h>
+#include <testHelpers.hpp>
+#include <af/dim4.hpp>
+#include <string>
+#include <vector>
+
+using af::array;
+using af::dim4;
+using std::endl;
+using std::string;
+using std::vector;
+
+TEST(ycbcr_rgb, InvalidArray) {
+    vector<float> in(100, 1);
+
+    dim4 dims(100);
+    array input(dims, &(in.front()));
+
+    try {
+        array output = hsv2rgb(input);
+        ASSERT_EQ(true, false);
+    } catch (const af::exception &ex) {
+        ASSERT_EQ(true, true);
+        return;
+    }
+}
+
+TEST(ycbcr2rgb, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(
+        string(TEST_DIR "/ycbcr_rgb/ycbcr2rgb.test"), numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = ycbcr2rgb(input);
+
+    vector<float> outData(dims.elements());
+    output.host(outData.data());
+
+    vector<float> currGoldBar = tests[0];
+    size_t nElems             = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)
+            << "at: " << elIter << endl;
+    }
+}
+
+TEST(ycbcr2rgb, MaxDim) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(
+        string(TEST_DIR "/ycbcr_rgb/ycbcr2rgb.test"), numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+
+    const size_t largeDim = 65535 * 16 + 1;
+    unsigned int ntile    = (largeDim + dims[1] - 1) / dims[1];
+    input                 = tile(input, 1, ntile);
+    array output          = ycbcr2rgb(input);
+    dim4 outDims          = output.dims();
+
+    float *outData = new float[outDims.elements()];
+    output.host((void *)outData);
+
+    vector<float> currGoldBar = tests[0];
+    for (int z = 0; z < outDims[2]; ++z) {
+        for (int y = 0; y < outDims[1]; ++y) {
+            for (int x = 0; x < outDims[0]; ++x) {
+                int outIter =
+                    (z * outDims[1] * outDims[0]) + (y * outDims[0]) + x;
+                int goldIter =
+                    (z * dims[1] * dims[0]) + ((y % dims[1]) * dims[0]) + x;
+                ASSERT_NEAR(currGoldBar[goldIter], outData[outIter], 1.0e-3)
+                    << "at: " << outIter << endl;
+            }
+        }
+    }
+
+    // cleanup
+    delete[] outData;
+}
+
+TEST(rgb2ycbcr, CPP) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(
+        string(TEST_DIR "/ycbcr_rgb/rgb2ycbcr.test"), numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+    array output = rgb2ycbcr(input);
+
+    vector<float> outData(dims.elements());
+    output.host(outData.data());
+
+    vector<float> currGoldBar = tests[0];
+    size_t nElems             = currGoldBar.size();
+    for (size_t elIter = 0; elIter < nElems; ++elIter) {
+        ASSERT_NEAR(currGoldBar[elIter], outData[elIter], 1.0e-3)
+            << "at: " << elIter << endl;
+    }
+}
+
+TEST(rgb2ycbcr, MaxDim) {
+    vector<dim4> numDims;
+    vector<vector<float>> in;
+    vector<vector<float>> tests;
+
+    readTestsFromFile<float, float>(
+        string(TEST_DIR "/ycbcr_rgb/rgb2ycbcr.test"), numDims, in, tests);
+
+    dim4 dims = numDims[0];
+    array input(dims, &(in[0].front()));
+
+    const size_t largeDim = 65535 * 16 + 1;
+    unsigned int ntile    = (largeDim + dims[1] - 1) / dims[1];
+    input                 = tile(input, 1, ntile);
+    array output          = rgb2ycbcr(input);
+    dim4 outDims          = output.dims();
+
+    float *outData = new float[outDims.elements()];
+    output.host((void *)outData);
+
+    vector<float> currGoldBar = tests[0];
+    for (int z = 0; z < outDims[2]; ++z) {
+        for (int y = 0; y < outDims[1]; ++y) {
+            for (int x = 0; x < outDims[0]; ++x) {
+                int outIter =
+                    (z * outDims[1] * outDims[0]) + (y * outDims[0]) + x;
+                int goldIter =
+                    (z * dims[1] * dims[0]) + ((y % dims[1]) * dims[0]) + x;
+                ASSERT_NEAR(currGoldBar[goldIter], outData[outIter], 1.0e-3)
+                    << "at: " << outIter << endl;
+            }
+        }
+    }
+
+    delete[] outData;
+}
diff --git a/vcpkg.json b/vcpkg.json
new file mode 100644
index 0000000000..7b8d9bca2f
--- /dev/null
+++ b/vcpkg.json
@@ -0,0 +1,86 @@
+{
+    "name": "arrayfire",
+    "version": "3.10.0",
+    "homepage": "https://github.com/arrayfire/arrayfire",
+    "description": "ArrayFire is a HPC general-purpose library targeting parallel and massively-parallel architectures such as CPUs, GPUs, etc.",
+    "supports": "x64",
+    "dependencies": [
+        "boost-math",
+        "boost-stacktrace",
+        "spdlog",
+        "freeimage",
+        "span-lite"
+    ],
+    "overrides": [
+        {
+            "name": "fmt",
+            "version": "8.1.1"
+        },
+        {
+            "name": "spdlog",
+            "version": "1.9.2"
+        },
+        {
+            "name": "jasper",
+            "version": "4.2.0"
+        }
+    ],
+    "features": {
+        "tests": {
+            "description": "Build with tests",
+            "dependencies": [
+                "gtest"
+            ]
+        },
+        "forge": {
+            "description": "Build Forge",
+            "dependencies": [
+                {
+                    "name": "freetype",
+                    "default-features": false
+                },
+                {
+                    "name": "fontconfig",
+                    "platform": "!windows"
+                },
+                "glfw3",
+                "glad"
+            ]
+        },
+        "openblasfftw": {
+            "description": "Build with OpenBLAS/FFTW",
+            "dependencies": [
+                {
+                    "name": "fftw3",
+                    "features": [ "threads" ]
+                },
+                {
+                    "name": "openblas",
+                    "features": [ "threads" ]
+                },
+                "lapack"
+            ]
+        },
+        "cuda": {
+            "description": "Build CUDA backend",
+            "dependencies": [
+                "cuda"
+            ]
+        },
+        "opencl": {
+            "description": "Build OpenCL backend",
+            "dependencies": [
+                "boost-compute",
+                "boost-program-options",
+                "opencl"
+            ]
+        },
+        "cudnn": {
+            "description": "Build CUDA with support for cuDNN",
+            "dependencies": [
+                "cudnn"
+            ]
+        }
+    },
+    "builtin-baseline": "b02e341c927f16d991edbd915d8ea43eac52096c"
+}